Skip to content

Table of Contents

cs.CL [Back]

[1] Do Large Language Models Possess a Theory of Mind? A Comparative Evaluation Using the Strange Stories Paradigm

Anna Babarczy,Andras Lukacs,Peter Vedres,Zeteny Bujka

Main category: cs.CL

TL;DR: 本研究评估了大型语言模型(LLMs)是否具备类人的心理理论(ToM)能力,即从文本中推断他人信念、意图和情绪的能力;结果发现GPT-4o在ToM任务上表现接近人类,而早期较小模型则易受干扰信息影响。

Details Motivation: 探究LLMs在缺乏社会具身经验与真实心理表征的情况下,其表现出的社会认知推理是源于真正的心智理解,还是仅依赖统计模式匹配。 Method: 采用改编自人类ToM研究的经典文本测试任务,让五种LLM与人类被试完成关于故事角色信念、意图和情绪的问答,并系统分析其准确性与鲁棒性。 Result: 不同LLM表现存在显著差异:早期小模型易受线索数量与无关信息干扰;GPT-4o在各类条件下均保持高准确率与强鲁棒性,性能接近人类水平。 Conclusion: GPT-4o展现出接近人类的心理理论能力,提示部分先进LLM可能已超越单纯模式匹配,但其认知本质仍需进一步厘清。 Abstract: The study explores whether current Large Language Models (LLMs) exhibit Theory of Mind (ToM) capabilities -- specifically, the ability to infer others' beliefs, intentions, and emotions from text. Given that LLMs are trained on language data without social embodiment or access to other manifestations of mental representations, their apparent social-cognitive reasoning raises key questions about the nature of their understanding. Are they capable of robust mental-state attribution indistinguishable from human ability in its output, or do their outputs merely reflect superficial pattern completion? To address this question, we tested five LLMs and compared their performance to that of human controls using an adapted version of a text-based tool widely used in human ToM research. The test involves answering questions about the beliefs, intentions, and emotions of story characters. The results revealed a performance gap between the models. Earlier and smaller models were strongly affected by the number of relevant inferential cues available and, to some extent, were also vulnerable to the presence of irrelevant or distracting information in the texts. In contrast, GPT-4o demonstrated high accuracy and strong robustness, performing comparably to humans even in the most challenging conditions. This work contributes to ongoing debates about the cognitive status of LLMs and the boundary between genuine understanding and statistical approximation.

[2] TherapyGym: Evaluating and Aligning Clinical Fidelity and Safety in Therapy Chatbots

Fangrui Huang,Souhad Chbeir,Arpandeep Khatua,Sheng Wang,Sijun Tan,Kenan Ye,Lily Bailey,Merryn Daniel,Ryan Louie,Sanmi Koyejo,Ehsan Adeli

Main category: cs.CL

TL;DR: 本文提出THERAPYGYM框架,用于评估和提升心理治疗聊天机器人的临床保真度(fidelity)与安全性(safety),并配套发布专家标注的验证集THERAPYJUDGEBENCH,支持基于临床标准的强化学习训练,显著提升模型在认知行为疗法(CBT)技术遵循度(CTRS)上的表现。

Details Motivation: 现有LLM评估方法(如流畅性指标、偏好测试、通用对话基准)无法衡量心理治疗中关键的临床维度,亟需面向临床实践的专用评估与训练框架。 Method: 构建THERAPYGYM框架:1)用自动化CTRS流水线评估CBT技术保真度;2)设计多标签安全风险标注体系;3)发布含116段对话、1270条专家评分的THERAPYJUDGEBENCH以校准LLM评判器;4)将CTRS与安全指标作为奖励信号,结合多样化患者模拟进行强化学习训练。 Result: 经THERAPYGYM训练的模型在专家评分中平均CTRS从0.10提升至0.60,在LLM评判下也从0.16升至0.59;验证了框架对提升临床保真度与安全性的有效性。 Conclusion: THERAPYGYM为心理治疗AI提供了可扩展、临床可信的评估与训练范式,推动疗法聊天机器人向循证实践与高安全性方向发展。 Abstract: Large language models (LLMs) are increasingly used for mental-health support; yet prevailing evaluation methods--fluency metrics, preference tests, and generic dialogue benchmarks--fail to capture the clinically critical dimensions of psychotherapy. We introduce THERAPYGYM, a framework that evaluates and improves therapy chatbots along two clinical pillars: fidelity and safety. Fidelity is measured using the Cognitive Therapy Rating Scale (CTRS), implemented as an automated pipeline that scores adherence to CBT techniques over multi-turn sessions. Safety is assessed using a multi-label annotation scheme, covering therapy-specific risks (e.g., failing to address harm or abuse). To mitigate bias and unreliability in LLM-based judges, we further release THERAPYJUDGEBENCH, a validation set of 116 dialogues with 1,270 expert ratings for auditing and calibration against licensed clinicians. THERAPYGYM also serves as a training harness: CTRS and safety-based rewards drive RL with configurable patient simulations spanning diverse symptom profiles. Models trained in THERAPYGYM improve on expert ratings, with average CTRS rising from 0.10 to 0.60 (and 0.16 to 0.59 under LLM judges). Our work enables scalable development of therapy chatbots that are faithful to evidence-based practice and safer in high-stakes use.

[3] How Confident Is the First Token? An Uncertainty-Calibrated Prompt Optimization Framework for Large Language Model Classification and Understanding

Wei Chen,Guoyang Ju,Yuanyuan Qi

Main category: cs.CL

TL;DR: 本文提出Log-Scale Focal Uncertainty(LSFU)度量和不确定性校准的提示优化框架(UCPOF),通过融合类别先验以改进不确定性估计,并在多选理解任务中提升少样本学习与RAG效率。

Details Motivation: 现有基于输出概率的不确定性度量(如熵)忽略预训练语料中的类别先验差异,难以区分‘虚假置信’与‘真实确定性’,导致置信度校准差,影响提示优化可靠性。 Method: 提出首个基于首token的不确定性度量LSFU,受focal loss启发,将标签先验作为风险调制因子,抑制高频类噪声、增强长尾类风险;并构建UCPOF框架,利用首token不确定性选择高质量示例并动态优化提示。 Result: UCPOF在多选理解任务上平均准确率较少样本基线提升6.03%,优于始终启用的全量RAG 5.75%,且平均检索触发率降低50.66%。 Conclusion: LSFU能更精准刻画模型不确定性,UCPOF通过自适应RAG触发在显著降本的同时保持SOTA性能,为可靠、高效提示工程提供新范式。 Abstract: With the widespread adoption of large language models (LLMs) in natural language processing, prompt engineering and retrieval-augmented generation (RAG) have become mainstream to enhance LLMs' performance on complex tasks. However, LLMs generate outputs autoregressively, leading to inevitable output uncertainty. Since model performance is highly sensitive to prompt design, precise uncertainty measurement is crucial for reliable prompt optimization. For multi-class multiple-choice (understanding) tasks, conventional uncertainty measures (e.g., entropy) based on output probabilities treat all classes equally and ignore class prior differences in pretraining corpora. This failure to distinguish spurious confidence (from priors) from true certainty (from contextual understanding) results in poor confidence calibration. To address this, we propose Log-Scale Focal Uncertainty (LSFU), a first-token-based metric inspired by focal loss. LSFU incorporates label prior probabilities as a risk-modulation factor to suppress noise from high-frequency classes and emphasize risk for low-frequency long-tail classes, with a dynamic weighting mechanism unifying the measurement scale. Based on LSFU, we further propose the uncertainty-calibrated prompt optimization framework (UCPOF), which leverages the first token of model outputs to select high-quality exemplars and dynamically optimize prompts. Comprehensive evaluations show UCPOF improves average accuracy by 6.03% over few-shot baselines, surpasses always-on full RAG by 5.75% in overall average accuracy, and reduces the average retrieval trigger rate by 50.66%. By adaptively triggering RAG only for high-uncertainty samples, our framework significantly lowers computational costs while maintaining state-of-the-art performance.

[4] Agentic Framework for Political Biography Extraction

Yifei Zhu,Songpo Yang,Jiangnan Zhu,Junyan Jiang

Main category: cs.CL

TL;DR: 本文提出了一种基于大语言模型(LLM)的两阶段“合成-编码”框架,用于自动化构建大规模政治精英传记数据库,显著提升准确性和可扩展性,并在多项指标上超越人类专家和维基百科。

Details Motivation: 政治学研究长期受限于大规模结构化政治数据集构建成本高、人工依赖强、自动化困难的问题。 Method: 提出两阶段Synthesis-Coding框架:上游由递归智能体式LLM从异构网页中搜索、筛选、合成传记信息;下游由LLM将合成后的内容编码为结构化数据框。 Result: 1)在给定合成上下文时,LLM编码器提取准确率媲美或优于人类专家;2)在网页环境中,该智能体系统比维基百科聚合的人类集体智慧获取更多信息;3)直接对长文本或多语料编码会引入偏差,而合成阶段通过构建信号密集表征可缓解该问题。 Conclusion: 该框架具有通用性、可扩展性与透明性,为政治学领域构建大规模、可拓展、可解释的数据库提供了新范式。 Abstract: The production of large-scale political datasets typically demands extracting structured facts from vast piles of unstructured documents or web sources, a task that traditionally relies on expensive human experts and remains prohibitively difficult to automate at scale. In this paper, we leverage Large Language Models (LLMs) to automate the extraction of multi-dimensional elite biographies, addressing a long-standing bottleneck in political science research. We propose a two-stage ``Synthesis-Coding'' framework for complex extraction task: an upstream synthesis stage that uses recursive agentic LLMs to search, filter, and curate biography from heterogeneous web sources, followed by a downstream coding stage that maps curated biography into structured dataframes. We validate this framework through three primary results. First, we demonstrate that, when given curated contexts, LLM coders match or outperform human experts in extraction accuracy. Second, we show that in web environments, the agentic system synthesizes more information from web resources than human collective intelligence (Wikipedia). Finally, we diagnosed that directly coding from long and multi-language corpora introduces bias that the synthesis stage can alleviate by curating evidence into signal-dense representations. By comprehensive evaluation, We provide a generalizable, scalable framework for building transparent and expansible large scale database in political science.

[5] Controllable Evidence Selection in Retrieval-Augmented Question Answering via Deterministic Utility Gating

Victor P. Unda

Main category: cs.CL

TL;DR: 本文提出了一种无需训练的确定性证据选择框架(MUE/DUE),通过显式语义、术语覆盖、概念区分度和冗余控制等信号,独立评估每个文本单元是否满足问题所需的事实、规则或条件,从而生成可审计、紧凑且高保真的证据集。

Details Motivation: 现有基于向量相似度的检索方法难以区分语义相似但证据效力不同的文本,导致选中的句子可能存在冗余、不完整或条件不匹配等问题。 Method: 提出Meaning-Utility Estimation (MUE) 和 Diversity-Utility Estimation (DUE),采用固定规则对每个句子/记录独立评分,依据语义相关性、术语覆盖、概念独特性和冗余性进行显式判断;仅当某单元独立满足问题所需事实、规则或条件时才被接受,不合并、不扩展。 Result: 实现了无需训练的确定性证据筛选,在保持高精度前提下生成紧凑、可审计、无冗余的证据集,并明确划清‘相关文本’与‘可用证据’的边界。 Conclusion: 确定性、显式、无需训练的证据选择机制能有效提升检索增强问答中证据的可靠性与可解释性,为构建可信AI系统提供新路径。 Abstract: Many modern AI question-answering systems convert text into vectors and retrieve the closest matches to a user question. While effective for topical similarity, similarity scores alone do not explain why some retrieved text can serve as evidence while other equally similar text cannot. When many candidates receive similar scores, systems may select sentences that are redundant, incomplete, or address different conditions than the question requires. This paper presents a deterministic evidence selection framework for retrieval-augmented question answering. The approach introduces Meaning-Utility Estimation (MUE) and Diversity-Utility Estimation (DUE), fixed scoring and redundancy-control procedures that determine evidence admissibility prior to answer generation. Each sentence or record is evaluated independently using explicit signals for semantic relatedness, term coverage, conceptual distinctiveness, and redundancy. No training or fine-tuning is required. In the prototype, a unit is accepted only if it explicitly states the fact, rule, or condition required by the task. Units are not merged or expanded. If no unit independently satisfies the requirement, the system returns no answer. This deterministic gating produces compact, auditable evidence sets and establishes a clear boundary between relevant text and usable evidence.

[6] DynaRAG: Bridging Static and Dynamic Knowledge in Retrieval-Augmented Generation

Penghao Liang,Mengwei Yuan,Jianan Liu,Jing Yang,Xianyou Li,Weiran Yan,Yichao Wu

Main category: cs.CL

TL;DR: DynaRAG 是一种新型检索增强生成(RAG)框架,通过动态调用外部API补充静态知识库的不足,结合LLM重排序、充分性分类器和Gorilla v2 API调用模型,在CRAG基准上显著提升动态问题回答准确率并减少幻觉。

Details Motivation: 传统RAG仅依赖静态语料库,难以应对时间敏感或需实时信息的问题;亟需能自适应融合静态与动态知识的系统。 Method: 提出DynaRAG框架:1)LLM重排序器评估文档相关性;2)充分性分类器判断是否需API回退;3)Gorilla v2模型调用外部API;4)FAISS支持的schema过滤优化API选择。 Result: 在CRAG基准测试中,DynaRAG显著提升动态问题回答准确率,并有效降低大模型幻觉现象。 Conclusion: 动态感知的路由机制与选择性工具调用对构建高可靠性、面向真实场景的问答系统至关重要。 Abstract: We present DynaRAG, a retrieval-augmented generation (RAG) framework designed to handle both static and time-sensitive information needs through dynamic knowledge integration. Unlike traditional RAG pipelines that rely solely on static corpora, DynaRAG selectively invokes external APIs when retrieved documents are insufficient for answering a query. The system employs an LLM-based reranker to assess document relevance, a sufficiency classifier to determine when fallback is necessary, and Gorilla v2 -- a state-of-the-art API calling model -- for accurate tool invocation. We further enhance robustness by incorporating schema filtering via FAISS to guide API selection. Evaluations on the CRAG benchmark demonstrate that DynaRAG significantly improves accuracy on dynamic questions, while also reducing hallucinations. Our results highlight the importance of dynamic-aware routing and selective tool use in building reliable, real-world question-answering systems.

[7] Learned but Not Expressed: Capability-Expression Dissociation in Large Language Models

Toshiyuki Shigemura

Main category: cs.CL

TL;DR: 本文通过实证研究发现,尽管大语言模型(LLMs)具备从训练数据中重建非因果解法的能力,但在标准生成任务中却完全不输出此类内容,揭示了‘习得能力’与‘实际输出’之间存在系统性分离。

Details Motivation: 探究为何LLMs虽能重建训练数据中的特定内容(如非因果解法),却在常规生成中从不表达,挑战‘训练数据存在即影响输出概率’的默认假设。 Method: 对300组提示-响应样本(涵盖叙事与问题求解两类任务、10种场景、3种主流LLM)进行经验观察分析,并结合记忆连续性与对齐诱导话语先验理论进行解释。 Result: 在所有生成输出中未发现任何非因果解法实例(0%,95% CI: [0%, 1.2%]),但通过条件提取验证其重建能力确实存在。 Conclusion: 任务条件化的生成策略可全面抑制已习得内容的输出,表明LLM的行为边界不仅由训练数据决定,更受生成时策略调控。 Abstract: Large language models (LLMs) demonstrate the capacity to reconstruct and trace learned content from their training data under specific elicitation conditions, yet this capability does not manifest in standard generation contexts. This empirical observational study examines the expression of non-causal, non-implementable solution types across 300 prompt-response generations spanning narrative and problem-solving task contexts. Drawing on recent findings regarding memorization contiguity and alignment-induced discourse priors, we document a systematic dissociation between learned capability and expressed output. Across three distinct LLMs, ten task scenarios, and both creative narrative and practical advisory contexts, we documented zero instances of non-causal solution frames in generated outputs (0%, 95% CI: [0%, 1.2%]), despite verified reconstruction capability under conditional extraction. These findings challenge the prevailing assumption that training data presence directly predicts output probability, demonstrating instead that task-conditioned generation policies can comprehensively suppress learned content across diverse contexts. The results offer implications for understanding generation dynamics, output distribution control, and the behavioral boundaries of contemporary LLMs.

[8] Real-Time Trustworthiness Scoring for LLM Structured Outputs and Data Extraction

Hui Wen Goh,Jonas Mueller

Main category: cs.CL

TL;DR: CONSTRUCT是一种实时评估大语言模型结构化输出可信度的方法,可识别错误并指导人工审核。

Details Motivation: 当前大语言模型的结构化输出存在偶发性错误,限制了企业AI应用潜力。 Method: 提出CONSTRUCT方法,无需标注数据或定制模型部署,适用于任意LLM(包括无logprobs的黑盒API),能为整体输出及各字段分别打分。 Result: 在首个高质量公开结构化输出基准上,CONSTRUCT对Gemini 3和GPT-5等模型错误的检测精度和召回率显著优于其他方法。 Conclusion: CONSTRUCT能高效定位结构化输出中的错误位置,优化人工审核资源分配,提升企业级AI可靠性。 Abstract: Structured Outputs from current LLMs exhibit sporadic errors, hindering enterprise AI efforts from realizing their immense potential. We present CONSTRUCT, a method to score the trustworthiness of LLM Structured Outputs in real-time, such that lower-scoring outputs are more likely to contain errors. This reveals the best places to focus limited human review bandwidth. CONSTRUCT additionally scores the trustworthiness of each field within a LLM Structured Output, helping reviewers quickly identify which parts of the output are wrong. Our method is suitable for any LLM (including black-box LLM APIs without logprobs such as reasoning models and Anthropic models), does not require labeled training data nor custom model deployment, and works for complex Structured Outputs with many fields of diverse types (including nested JSON schemas). We additionally present one of the first public LLM Structured Output benchmarks with reliable ground-truth values that are not full of mistakes. Over this four-dataset benchmark, CONSTRUCT detects errors from various LLMs (including Gemini 3 and GPT-5) with significantly higher precision/recall than other scoring methods.

[9] Beyond Accuracy: An Explainability-Driven Analysis of Harmful Content Detection

Trishita Dhara,Siddhesh Sheth

Main category: cs.CL

TL;DR: 本文通过Shapley Additive Explanations和Integrated Gradients两种后验解释方法,对RoBERTa模型在有害内容检测中的决策逻辑进行可解释性分析,揭示其在边界、语境及政治敏感内容上的系统性失败模式,并强调可解释AI在提升透明度与辅助人工审核中的诊断价值,而非单纯提升性能。

Details Motivation: 现有有害内容检测系统缺乏可解释性,尤其在边界、语境依赖及政治敏感内容上难以理解模型为何做出判断;当前研究多聚焦准确率提升,忽视解释性分析。 Method: 基于Civil Comments数据集训练RoBERTa分类器,采用Shapley Additive Explanations(SHAP)和Integrated Gradients(IG)两种后验解释方法,对正确预测与典型错误案例进行归因分析,并结合定性案例研究识别常见失败模式。 Result: 尽管模型AUC达0.93、准确率达0.94,但解释分析暴露其局限性:IG倾向于扩散式上下文归因,SHAP更聚焦显式词汇线索;二者归因分歧导致假阴性与假阳性;发现间接毒性、词汇过归因、政治话语误判等重复失败模式。 Conclusion: 可解释AI的核心价值在于为人工审核提供透明、可诊断的决策依据,暴露模型不确定性与推理缺陷,应被定位为透明性与诊断工具,而非性能优化手段。 Abstract: Although automated harmful content detection systems are frequently used to monitor online platforms, moderators and end users frequently cannot understand the logic underlying their predictions. While recent studies have focused on increasing classification accuracy, little focus has been placed on comprehending why neural models identify content as harmful, especially when it comes to borderline, contextual, and politically sensitive situations. In this work, a neural harmful content detection model trained on the Civil Comments dataset is analyzed explainability-drivenly. Two popular post-hoc explanation methods, Shapley Additive Explanations and Integrated Gradients, are used to analyze the behavior of a RoBERTa-based classifier in both correct predictions and systematic failure cases. Despite strong overall performance, with an area under the curve of 0.93 and an accuracy of 0.94, the analysis reveals limitations that are not observable from aggregate evaluation metrics alone. Integrated Gradients appear to extract more diffuse contextual attributions while Shapley Additive Explanations extract more focused attributions on explicit lexical cues. The consequent divergence in their outputs manifests in both false negatives and false positives. Qualitative case studies reveal recurring failure modes such as indirect toxicity, lexical over-attribution, or political discourse. The results suggest that explainable AI can foster human-in-the-loop moderation by exposing model uncertainty and increasing the interpretable rationale behind automated decisions. Most importantly, this work highlights the role of explainability as a transparency and diagnostic resource for online harmful content detection systems rather than as a performance-enhancing lever.

[10] MineDraft: A Framework for Batch Parallel Speculative Decoding

Zhenwei Tang,Arun Verma,Zijian Zhou,Zhaoxuan Wu,Alok Prakash,Daniela Rus,Bryan Kian Hsiang Low

Main category: cs.CL

TL;DR: 本文提出MineDraft,一种批处理并行推测解码(PSD)框架,通过重叠草稿生成与验证阶段来隐藏延迟,显著提升大语言模型推理吞吐量和端到端延迟性能。

Details Motivation: 标准推测解码(SD)受限于草稿生成与验证阶段的严格串行执行,导致性能瓶颈。 Method: 提出MineDraft框架,采用双批次请求设计,使一个批次的草稿生成与另一批次的验证并行执行,并进行理论效率分析。 Result: 实验表明,相比标准SD,MineDraft在吞吐量上最高提升75%,端到端延迟最高降低39%;并已作为插件集成至vLLM,具备生产可用性。 Conclusion: 批处理并行推测解码(PSD)可显著提升推理效率,MineDraft为高效、实用的大模型推理提供了新范式。 Abstract: Speculative decoding (SD) accelerates large language model inference by using a smaller draft model to propose draft tokens that are subsequently verified by a larger target model. However, the performance of standard SD is often limited by the strictly sequential execution of these drafting and verification stages. To address this, this paper proposes MineDraft, a batch parallel speculative decoding (PSD) framework designed to effectively hide drafting latency by overlapping it with verification. Our theoretical analysis shows that PSD is substantially more efficient than standard SD. MineDraft realizes the PSD through a novel batch-parallel design that maintains two batches of requests, overlapping drafting for one batch with verification for the other. Our experimental results show significant improvements of MineDraft in both throughput (up to 75%) and end-to-end latency (up to 39%) over standard SD. Furthermore, we have implemented MineDraft as a plugin for vLLM, demonstrating its practicality for production-ready inference systems.

[11] An Agentic System for Schema Aware NL2SQL Generation

David Onyango,Naseef Mansoor

Main category: cs.CL

TL;DR: 本文提出了一种基于模式的代理系统CESMA,利用小型语言模型(SLMs)作为主要执行者,并在必要时选择性调用大语言模型(LLM)进行纠错,显著降低计算成本与隐私风险,同时在BIRD基准上实现了接近LLM的性能。

Details Motivation: 现有NL2SQL方法依赖大语言模型(LLMs),带来高计算开销、数据隐私隐患及在资源受限环境下部署困难等问题,亟需轻量、高效、可落地的替代方案。 Method: 构建一个基于数据库schema的多智能体系统,以本地化小型语言模型(SLMs)为核心执行器;引入错误检测机制,仅当SLM输出出错时才触发LLM作为后备;通过任务分解与schema感知提示增强SLM能力。 Result: 在BIRD基准上达到47.78%执行准确率和51.05%验证效率;约67%查询由SLM本地完成;单查询平均成本降至0.0085(LLM-only为0.094),总成本降低超90%。 Conclusion: 该SLM主导+LLM按需回退的混合架构,在保障NL2SQL实用性的同时大幅提升了效率、隐私性与部署可行性,为资源敏感场景提供了新范式。 Abstract: The natural language to SQL (NL2SQL) task plays a pivotal role in democratizing data access by enabling non-expert users to interact with relational databases through intuitive language. While recent frameworks have enhanced translation accuracy via task specialization, their reliance on Large Language Models (LLMs) raises significant concerns regarding computational overhead, data privacy, and real-world deployability in resource-constrained environments. To address these challenges, we propose a schema based agentic system that strategically employs Small Language Models (SLMs) as primary agents, complemented by a selective LLM fallback mechanism. The LLM is invoked only upon detection of errors in SLM-generated output, the proposed system significantly minimizes computational expenditure. Experimental results on the BIRD benchmark demonstrate that our system achieves an execution accuracy of 47.78% and a validation efficiency score of 51.05%, achieving over 90% cost reduction compared to LLM-centric baselines as approximately 67% of queries are resolved using local SLMs. The system achieves an average cost per query of 0.0085 compared to 0.094 for LLM-only systems, achieving near-zero operational costs for locally executed queries. [Github repository: https://github.com/mindslab25/CESMA.]

[12] BenchBrowser -- Collecting Evidence for Evaluating Benchmark Validity

Harshita Diddee,Gregory Yauney,Swabha Swayamdipta,Daphne Ippolito

Main category: cs.CL

TL;DR: 本文提出BenchBrowser,一个用于检索与自然语言使用场景相关的评估项的工具,旨在解决现有基准测试缺乏细粒度覆盖和验证的问题,帮助实践者诊断基准测试的内容效度和聚合效度缺陷。

Details Motivation: 现有语言模型基准测试过于粗粒度,无法准确反映实际应用场景中的细微能力需求,导致模型表现与用户真实需求之间存在偏差。 Method: 提出BenchBrowser检索器,覆盖20个基准套件,通过自然语言查询匹配相关评估项,并经人类研究验证其检索精度。 Result: BenchBrowser能有效支持对基准测试内容效度(能力覆盖不全)和聚合效度(相同能力评分不稳定)的诊断,量化了实践目标与基准测试之间的差距。 Conclusion: BenchBrowser为提升基准测试与实际应用目标的一致性提供了可量化的分析工具,有助于避免‘能力幻觉’。 Abstract: Do language model benchmarks actually measure what practitioners intend them to ? High-level metadata is too coarse to convey the granular reality of benchmarks: a "poetry" benchmark may never test for haikus, while "instruction-following" benchmarks will often test for an arbitrary mix of skills. This opacity makes verifying alignment with practitioner goals a laborious process, risking an illusion of competence even when models fail on untested facets of user interests. We introduce BenchBrowser, a retriever that surfaces evaluation items relevant to natural language use cases over 20 benchmark suites. Validated by a human study confirming high retrieval precision, BenchBrowser generates evidence to help practitioners diagnose low content validity (narrow coverage of a capability's facets) and low convergent validity (lack of stable rankings when measuring the same capability). BenchBrowser, thus, helps quantify a critical gap between practitioner intent and what benchmarks actually test.

[13] Evaluating FrameNet-Based Semantic Modeling for Gender-Based Violence Detection in Clinical Records

Lívia Dutra,Arthur Lorenzi,Frederico Belcavello,Ely Matos,Marcelo Viridiano,Lorena Larré,Olívia Guaranha,Erik Santos,Sofia Reinach,Pedro de Paula,Tiago Torrent

Main category: cs.CL

TL;DR: 本研究探讨了基于FrameNet的语义标注在电子病历开放文本中识别性别暴力(GBV)模式的有效性,发现结合语义标注的模型显著优于仅使用结构化数据的模型,F1分数提升超0.3。

Details Motivation: 巴西医疗人员虽有法定义务上报性别暴力案件,但因识别困难及信息系统整合不足导致严重漏报,亟需提升临床文本中GBV的自动识别能力。 Method: 采用FrameNet对电子病历开放文本进行语义标注,并构建三种SVM分类器:(1) 仅帧标注文本、(2) 帧标注文本+参数化数据、(3) 仅参数化数据;进行定量与定性对比分析。 Result: 融合语义标注的模型F1分数提升超0.3,显著优于纯结构化数据模型,表明领域特异性语义表征能提供超越人口统计等结构化信息的有效信号。 Conclusion: 临床叙事的语义分析可增强GBV早期识别能力,为公共卫生干预提供更精准支持。 Abstract: Gender-based violence (GBV) is a major public health issue, with the World Health Organization estimating that one in three women experiences physical or sexual violence by an intimate partner during her lifetime. In Brazil, although healthcare professionals are legally required to report such cases, underreporting remains significant due to difficulties in identifying abuse and limited integration between public information systems. This study investigates whether FrameNet-based semantic annotation of open-text fields in electronic medical records can support the identification of patterns of GBV. We compare the performance of an SVM classifier for GBV cases trained on (1) frame-annotated text, (2) annotated text combined with parameterized data, and (3) parameterized data alone. Quantitative and qualitative analyses show that models incorporating semantic annotation outperform categorical models, achieving over 0.3 improvement in F1 score and demonstrating that domain-specific semantic representations provide meaningful signals beyond structured demographic data. The findings support the hypothesis that semantic analysis of clinical narratives can enhance early identification strategies and support more informed public health interventions.

[14] How LLMs Distort Our Written Language

Marwa Abdulhai,Isadora White,Yanming Wan,Ibrahim Qureshi,Joel Leibo,Max Kleiman-Weiner,Natasha Jaques

Main category: cs.CL

TL;DR: 本文研究了大语言模型(LLMs)在写作辅助中对人类文本语义的系统性改变,发现其不仅影响风格与语气,更会显著扭曲原意;通过用户调研、回溯式改写实验及真实AI评审分析,揭示LLM使用导致中立化倾向增强、原创性下降、科学评审标准弱化等深层问题。

Details Motivation: 探究LLM广泛用于写作辅助时是否隐性地改变了人类表达的语义内容,而非仅影响风格或语法,尤其关注其对文化与科学交流质量的潜在长期影响。 Method: 1)开展人类用户研究,分析不同LLM使用强度对写作中立性、创意性与个人 voice 的影响;2)基于2021年人类撰写的论文反馈数据集,让LLM仅做语法修订,评估语义偏移程度;3)分析顶会中21%由LLM生成的同行评审,对比其评分倾向与关注维度(如清晰度、重要性)的差异。 Result: 1)重度LLM用户产出中立性答案比例上升近70%,且普遍感觉写作缺乏创意与个人风格;2)即使仅提示‘仅修正语法’,LLM仍大幅改变原文语义;3)LLM生成的审稿意见更少强调研究清晰度与重要性,平均评分高出1分。 Conclusion: LLM写作辅助存在系统性语义偏移风险,当前用户对其益处的认知与其实际对意义传达的隐性干扰之间存在严重错位,亟需关注其对教育、出版与科研评价等制度的深远影响。 Abstract: Large language models (LLMs) are used by over a billion people globally, most often to assist with writing. In this work, we demonstrate that LLMs not only alter the voice and tone of human writing, but also consistently alter the intended meaning. First, we conduct a human user study to understand how people actually interact with LLMs when using them for writing. Our findings reveal that extensive LLM use led to a nearly 70% increase in essays that remained neutral in answering the topic question. Significantly more heavy LLM users reported that the writing was less creative and not in their voice. Next, using a dataset of human-written essays that was collected in 2021 before the widespread release of LLMs, we study how asking an LLM to revise the essay based on the human-written feedback in the dataset induces large changes in the resulting content and meaning. We find that even when LLMs are prompted with expert feedback and asked to only make grammar edits, they still change the text in a way that significantly alters its semantic meaning. We then examine LLM-generated text in the wild, specifically focusing on the 21% of AI-generated scientific peer reviews at a recent top AI conference. We find that LLM-generated reviews place significantly less weight on clarity and significance of the research, and assign scores that, on average, are a full point higher.These findings highlight a misalignment between the perceived benefit of AI use and an implicit, consistent effect on the semantics of human writing, motivating future work on how widespread AI writing will affect our cultural and scientific institutions.

[15] Modeling the human lexicon under temperature variations: linguistic factors, diversity and typicality in LLM word associations

Maria Andueza Rodriguez,Marie Candito,Richard Huyghe

Main category: cs.CL

TL;DR: 本研究通过比较人类与大语言模型(LLMs)的词语联想反应,评估LLMs内部词典的人类相似性;结果表明,不同规模模型在响应典型性与变异性上呈现权衡关系,且温度参数显著调节该权衡。

Details Motivation: 探究大语言模型内部词汇知识是否具有人类相似性,特别是其词典结构是否符合人类词汇联想规律。 Method: 基于SWOW数据集的英语线索-反应词对,对比人类反应与三个LLM(Mistral-7B、Llama-3.1-8B、Qwen-2.5-32B)在多温度设置下生成的联想反应,分析词频、具体性等词汇因素的影响,以及响应变异性与典型性。 Result: 所有模型均复现了人类在词频和具体性上的趋势;但大模型(如Qwen)响应更典型、变异性低,小模型(如Mistral、Llama)则更易变但典型性低;温度升高提升变异性但降低典型性。 Conclusion: LLMs词典既反映部分人类词汇规律,又存在系统性差异;模型规模与温度是影响其词汇表征的关键可控变量,需在相关研究中加以考虑。 Abstract: Large language models (LLMs) achieve impressive results in terms of fluency in text generation, yet the nature of their linguistic knowledge - in particular the human-likeness of their internal lexicon - remains uncertain. This study compares human and LLM-generated word associations to evaluate how accurately models capture human lexical patterns. Using English cue-response pairs from the SWOW dataset and newly generated associations from three LLMs (Mistral-7B, Llama-3.1-8B, and Qwen-2.5-32B) across multiple temperature settings, we examine (i) the influence of lexical factors such as word frequency and concreteness on cue-response pairs, and (ii) the variability and typicality of LLM responses compared to human responses. Results show that all models mirror human trends for frequency and concreteness but differ in response variability and typicality. Larger models such as Qwen tend to emulate a single "prototypical" human participant, generating highly typical but minimally variable responses, while smaller models such as Mistral and Llama produce more variable yet less typical responses. Temperature settings further influence this trade-off, with higher values increasing variability but decreasing typicality. These findings highlight both the similarities and differences between human and LLM lexicons, emphasizing the need to account for model size and temperature when probing LLM lexical representations.

[16] GRAFITE: Generative Regression Analysis Framework for Issue Tracking and Evaluation

Ja Young Lee,Mírian Silva,Mohamed Nasr,Shonda Witherspoon,Enzo Bozzani,Veronique Demers,Radha Ratnaparkhi,Hui Wu,Sara Rosenthal

Main category: cs.CL

TL;DR: 本文提出GRAFITE,一个用于持续评估大语言模型(LLMs)的平台,通过用户反馈构建问题库,并利用LLM-as-a-judge进行质量保障测试,支持多模型对比与版本回归检测。

Details Motivation: 随着LLM在训练中反复接触基准数据,测试污染导致性能虚高,亟需一种能持续、动态、抗污染的评估机制。 Method: 构建GRAFITE平台:1)基于用户反馈持续积累模型问题形成问题库;2)设计QA测试流程,采用LLM-as-a-judge自动评估;3)支持多模型并行测试与跨版本回归分析。 Result: 实现了开源可复现的持续评估系统,支持真实反映模型能力演进,已在GitHub开源并提供演示视频。 Conclusion: GRAFITE为LLM评估提供了抗污染、可扩展、用户驱动的持续评测范式,有助于更可靠地追踪模型实际能力变化。 Abstract: Large language models (LLMs) are largely motivated by their performance on popular topics and benchmarks at the time of their release. However, over time, contamination occurs due to significant exposure of benchmark data during training. This poses a risk of model performance inflation if testing is not carefully executed. To address this challenge, we present GRAFITE, a continuous LLM evaluation platform through a comprehensive system for maintaining and evaluating model issues. Our approach enables building a repository of model problems based on user feedback over time and offers a pipeline for assessing LLMs against these issues through quality assurance (QA) tests using LLM-as-a-judge. The platform enables side-by-side comparison of multiple models, facilitating regression detection across different releases. The platform is available at https://github.com/IBM/grafite. The demo video is available at www.youtube.com/watch?v=XFZyoleN56k.

[17] CWoMP: Morpheme Representation Learning for Interlinear Glossing

Morris Alper,Enora Rice,Bhargav Shandilya,Alexis Palmer,Lori Levin

Main category: cs.CL

TL;DR: 本文提出CWoMP方法,通过对比学习将词与构成它的语素在共享嵌入空间中对齐,并利用可更新词典的自回归解码器生成语素序列,从而实现高效、可解释且可交互改进的自动IGT生成。

Details Motivation: 现有自动IGT方法将gloss视为字符序列,忽略了其组合结构;而人工标注IGT费时费力,尤其对低资源语言而言亟需更优自动化方案。 Method: 提出CWoMP(Contrastive Word-Morpheme Pretraining):用对比学习编码器对齐上下文中的词与其组成语素;自回归解码器基于可修改的语素嵌入词典生成gloss序列。 Result: 在多种低资源语言上显著优于现有方法,尤其在极低资源场景下提升明显,同时推理效率更高。 Conclusion: CWoMP将语素建模为基本形式-意义单元,兼顾性能、效率、可解释性与用户可干预性,为低资源语言IGT自动化提供了新范式。 Abstract: Interlinear glossed text (IGT) is a standard notation for language documentation which is linguistically rich but laborious to produce manually. Recent automated IGT methods treat glosses as character sequences, neglecting their compositional structure. We propose CWoMP (Contrastive Word-Morpheme Pretraining), which instead treats morphemes as atomic form-meaning units with learned representations. A contrastively trained encoder aligns words-in-context with their constituent morphemes in a shared embedding space; an autoregressive decoder then generates the morpheme sequence by retrieving entries from a mutable lexicon of these embeddings. Predictions are interpretable--grounded in lexicon entries--and users can improve results at inference time by expanding the lexicon without retraining. We evaluate on diverse low-resource languages, showing that CWoMP outperforms existing methods while being significantly more efficient, with particularly strong gains in extremely low-resource settings.

[18] How Psychological Learning Paradigms Shaped and Constrained Artificial Intelligence

Alex Anvi Eponon,Ildar Batyrshin,Christian E. Maldonado-Sifuentes,Grigori Sidorov

Main category: cs.CL

TL;DR: 本文探讨了人工智能范式与心理学理论之间的历史联系,指出每种AI范式继承了其对应心理学理论的优点与结构性局限;进而提出ReSynth三模块框架(Intellect/Identity/Memory),以克服现有方法在知识结构化、表征可更新性与理解构建方面的不足,目标是使系统性行为成为AGI架构的必然结果而非偶然属性。

Details Motivation: 现有AI范式(如强化学习、深度学习、课程学习)虽受心理学理论启发,但亦承袭其结构性缺陷,难以支撑人工通用智能所需的适应性与理解建构能力。 Method: 通过追溯心理学范式(行为主义、认知主义、建构主义)到AI方法的谱系,诊断各阶段继承的局限,并基于跨文化教育观(尤其是东方对熟记作为理解前阶的结构化理解)及Aizawa对经典主义与联结主义的批判,提出ReSynth三模块架构,将推理(Intellect)、目的(Identity)和知识(Memory)作为独立可组合的模块。 Result: 提出了ReSynth框架,实现了推理、目的与知识的架构级解耦,为系统性行为提供了必要性基础,而非偶然性;并为AI方法论提供了源自跨文化心理学的新设计原则。 Conclusion: AI要实现真正适应性与理解力,必须超越对心理学理论的简单模仿,转向具有内在系统性保障的表征架构;ReSynth代表了一种以‘结构化建构’为核心的新AGI路径。 Abstract: The dominant paradigms of artificial intelligence were shaped by learning theories from psychology: behaviorism inspired reinforcement learning, cognitivism gave rise to deep learning and memory-augmented architectures, and constructivism influenced curriculum learning and compositional approaches. This paper argues that each AI paradigm inherited not only the strengths but the structural limitations of the psychological theory that inspired it. Reinforcement learning cannot account for the internal structure of knowledge, deep learning compresses representations into opaque parameter spaces resistant to principled update, and current integrative approaches lack a formal account of how new understanding is constructed from existing components. The paper further examines a cross-cultural divergence in the interpretation of rote learning, arguing that the Eastern conception of memorization as a structured, multi-phase precursor to understanding offers an underexploited bridge between psychological theory and AI methodology. Drawing on the systematicity debate and critique of Aizawa of both classicism and connectionism, this paper introduces ReSynth, a trimodular framework that separates reasoning (Intellect), purpose (Identity), and knowledge (Memory) as architecturally independent components. The paper traces the genealogy from psychological paradigm to AI method, diagnoses the inherited limitations at each stage, and argues that adaptability, the central challenge of artificial general intelligence requires a representational architecture in which systematic behavior is a necessary consequence rather than an accidental property.

[19] From Noise to Signal: When Outliers Seed New Topics

Evangelia Zve,Gauvain Bourgne,Benjamin Icard,Jean-Gabriel Ganascia

Main category: cs.CL

TL;DR: 本文提出了一种时间分类法,用于识别新闻文档在主题演化过程中的不同轨迹,特别是能提前预示新兴主题的‘前瞻性异常值’,并在法语氢能新闻语料库上验证了其有效性。

Details Motivation: 传统动态主题建模将异常值视为噪声,但作者认为其中部分异常值实为新兴主题的早期信号,值得系统建模与利用。 Method: 构建了一个描述新闻文档随时间与主题形成关系的时间分类法,区分‘前瞻性异常值’、强化型文档和孤立文档;在累积聚类框架下,使用11种先进语言模型生成的文档嵌入进行实现与评估。 Result: 在HydroNewsFr法语新闻语料库上,跨模型一致性识别出一小批高共识的前瞻性异常值;定性案例研究证实了该分类法对主题演化路径(如预示、引发、漂移)的刻画能力。 Conclusion: 异常值不应被简单丢弃,而可作为弱信号检测与动态主题建模的桥梁;所提时间分类法为理解单篇文档在主题演化中的作用提供了结构化视角。 Abstract: Outliers in dynamic topic modeling are typically treated as noise, yet we show that some can serve as early signals of emerging topics. We introduce a temporal taxonomy of news-document trajectories that defines how documents relate to topic formation over time. It distinguishes anticipatory outliers, which precede the topics they later join, from documents that either reinforce existing topics or remain isolated. By capturing these trajectories, the taxonomy links weak-signal detection with temporal topic modeling and clarifies how individual articles anticipate, initiate, or drift within evolving clusters. We implement it in a cumulative clustering setting using document embeddings from eleven state-of-the-art language models and evaluate it retrospectively on HydroNewsFr, a French news corpus on the hydrogen economy. Inter-model agreement reveals a small, high-consensus subset of anticipatory outliers, increasing confidence in these labels. Qualitative case studies further illustrate these trajectories through concrete topic developments.

[20] Synthetic Data Generation for Training Diversified Commonsense Reasoning Models

Tianhui Zhang,Bei Peng,Danushka Bollegala

Main category: cs.CL

TL;DR: 本文提出了一种两阶段方法,构建了首个用于多样化生成式常识推理(GCR)的合成数据集CommonSyn,以缓解高质量多样化常识训练数据稀缺的问题;在该数据集上微调的模型在生成多样性与质量上均优于基线模型。

Details Motivation: 现有生成式常识推理(GCR)数据集规模小、覆盖场景窄、标注成本高,难以支撑多样化常识生成模型的训练需求。 Method: 提出两阶段合成数据构建方法,生成首个大规模、高质量、多样化的GCR合成数据集CommonSyn。 Result: 在CommonSyn上微调的模型,在多个不同规模的大语言模型上,均显著提升了生成响应的多样性与常识质量,优于在人工标注数据集上微调的模型及基线模型。 Conclusion: 合成数据可有效弥补多样化常识推理训练资源的缺口,CommonSyn为多样化GCR任务提供了可靠且可扩展的数据基础。 Abstract: Conversational agents are required to respond to their users not only with high quality (i.e. commonsense bearing) responses, but also considering multiple plausible alternative scenarios, reflecting the diversity in their responses. Despite the growing need to train diverse commonsense generators, the progress of this line of work has been significantly hindered by the lack of large-scale high-quality diverse commonsense training datasets. Due to the high annotation costs, existing Generative Commonsense Reasoning (GCR) datasets are created using a small number of human annotators, covering only a narrow set of commonsense scenarios. To address this training resource gap, we propose a two-stage method to create the first-ever synthetic dataset CommonSyn for diversified (GCR). The model fine-tuned on our synthetic data jointly increase both generation diversity and quality compared with vanilla models and the model fine-tuned on human-crafted dataset across different size Large Language Models (LLMs)

[21] PowerFlow: Unlocking the Dual Nature of LLMs via Principled Distribution Matching

Ruishuo Chen,Yu Chen,Zhuoran Li,Longbo Huang

Main category: cs.CL

TL;DR: 本文提出PowerFlow框架,将无监督强化学习重述为分布匹配问题,利用GFlowNet作为变分采样器,并设计长度感知的轨迹平衡目标,以消除自回归生成中的长度偏差;通过调节α参数控制LLM输出分布的尖锐或平坦程度,从而定向激发逻辑推理或创造性表达能力。

Details Motivation: 现有无监督强化学习方法依赖启发式内在奖励,缺乏明确理论优化目标且易受退化性偏差影响。 Method: 提出PowerFlow框架,将无监督微调建模为分布匹配问题;将GFlowNet视为非归一化密度的摊销变分采样器;设计长度感知的Trajectory-Balance目标以中和自回归生成中的结构长度偏差;通过调控α-幂分布实现对LLM输出分布形态(尖锐/平坦)的定向控制。 Result: PowerFlow在多项实验中持续超越现有RLIF方法,性能匹敌甚至超过监督式GRPO;在对齐模型中缓解过尖锐化问题,同时提升生成多样性与质量,推动创造性任务的Pareto前沿。 Conclusion: PowerFlow为无监督强化学习提供了原理性新范式,通过可控分布调节实现了对LLM双重能力(逻辑推理与创造性表达)的定向激发,并在质量与多样性间取得更好权衡。 Abstract: Unsupervised Reinforcement Learning from Internal Feedback (RLIF) has emerged as a promising paradigm for eliciting the latent capabilities of Large Language Models (LLMs) without external supervision. However, current methods rely on heuristic intrinsic rewards, which often lack a well-defined theoretical optimization target and are prone to degenerative biases. In this work, we introduce PowerFlow, a principled framework that reformulates unsupervised fine-tuning as a distribution matching problem. By casting GFlowNet as an amortized variational sampler for unnormalized densities, we propose a length-aware Trajectory-Balance objective that explicitly neutralizes the structural length biases inherent in autoregressive generation. By targeting $α$-power distributions, PowerFlow enables the directional elicitation of the dual nature of LLMs: sharpening the distribution ($α> 1$) to intensify logical reasoning, or flattening it ($α< 1$) to unlock expressive creativity. Extensive experiments demonstrate that PowerFlow consistently outperforms existing RLIF methods, matching or even exceeding supervised GRPO. Furthermore, by mitigating over-sharpening in aligned models, our approach achieves simultaneous gains in diversity and quality, shifting the Pareto frontier in creative tasks.

[22] AutoScreen-FW: An LLM-based Framework for Resume Screening

Zhelin Xu,Shuhei Yamamoto,Atsuyuki Morishima

Main category: cs.CL

TL;DR: 本文提出AutoScreen-FW,一种基于开源大语言模型(LLM)的本地化自动简历筛选框架,通过代表性样本选择与上下文学习提升筛选效率与隐私保护。

Details Motivation: 企业招聘人员需在有限时间内筛选大量简历,负担重且易漏选合适候选人;现有LLM方法依赖商业模型带来隐私风险,且缺乏公开带标注的简历数据集指导模型训练。 Method: 提出AutoScreen-FW框架:利用多种策略选取少量代表性简历样本,结合角色描述与评估标准进行上下文学习,驱动开源LLM作为职业顾问评估新简历。 Result: 实验表明,所用开源LLM在多个真实标注基准下持续优于GPT-5-nano;在一组基准下超越GPT-5-mini;虽在其他基准下略逊于GPT-5-mini,但单份简历处理速度显著更快。 Conclusion: AutoScreen-FW具备本地部署潜力,可在保障数据隐私的同时提升筛选效率、减轻招聘人员负担。 Abstract: Corporate recruiters often need to screen many resumes within a limited time, which increases their burden and may cause suitable candidates to be overlooked. To address these challenges, prior work has explored LLM-based automated resume screening. However, some methods rely on commercial LLMs, which may pose data privacy risks. Moreover, since companies typically do not make resumes with evaluation results publicly available, it remains unclear which resume samples should be used during learning to improve an LLM's judgment performance. To address these problems, we propose AutoScreen-FW, an LLM-based locally and automatically resume screening framework. AutoScreen-FW uses several methods to select a small set of representative resume samples. These samples are used for in-context learning together with a persona description and evaluation criteria, enabling open-source LLMs to act as a career advisor and evaluate unseen resumes. Experiments with multiple ground truths show that the open-source LLM judges consistently outperform GPT-5-nano. Under one ground truth setting, it also surpass GPT-5-mini. Although it is slightly weaker than GPT-5-mini under other ground-truth settings, it runs substantially faster per resume than commercial GPT models. These findings indicate the potential for deploying AutoScreen-FW locally in companies to support efficient screening while reducing recruiters' burden.

[23] TopoChunker: Topology-Aware Agentic Document Chunking Framework

Xiaoyu Liu

Main category: cs.CL

TL;DR: 本文提出TopoChunker,一种基于代理的文档分块框架,通过构建结构化中间表示(SIR)来保留文档内在拓扑结构,缓解线性分块导致的语义碎片化问题,在多个基准上显著提升RAG性能并降低计算开销。

Details Motivation: 现有RAG文档分块方法采用线性化处理,破坏了文档固有的拓扑层次结构,引发语义碎片化,从而损害检索质量。 Method: 提出TopoChunker框架,包含Inspector Agent(动态选择成本优化的抽取路径)和Refiner Agent(执行容量审计与拓扑上下文消歧),将异构文档映射到结构化中间表示(SIR)以显式建模跨段依赖。 Result: 在GutenQA和GovReport数据集上达到SOTA:生成准确率绝对提升8.0%,Recall@3达83.26%,同时令牌开销减少23.5%。 Conclusion: TopoChunker通过结构感知的分块策略,在保持高检索质量的同时提升了RAG系统的可扩展性与效率。 Abstract: Current document chunking methods for Retrieval-Augmented Generation (RAG) typically linearize text. This forced linearization strips away intrinsic topological hierarchies, creating ``semantic fragmentation'' that degrades downstream retrieval quality. In this paper, we propose TopoChunker, an agentic framework that maps heterogeneous documents onto a Structured Intermediate Representation (SIR) to explicitly preserve cross-segment dependencies. To balance structural fidelity with computational cost, TopoChunker employs a dual-agent architecture. An Inspector Agent dynamically routes documents through cost-optimized extraction paths, while a Refiner Agent performs capacity auditing and topological context disambiguation to reconstruct hierarchical lineage. Evaluated on unstructured narratives (GutenQA) and complex reports (GovReport), TopoChunker demonstrates state-of-the-art performance. It outperforms the strongest LLM-based baseline by 8.0% in absolute generation accuracy and achieves an 83.26% Recall@3, while simultaneously reducing token overhead by 23.5%, offering a scalable approach for structure-aware RAG.

[24] TARo: Token-level Adaptive Routing for LLM Test-time Alignment

Arushi Rai,Qiang Zhang,Hanqing Zeng,Yunkai Zhang,Dipesh Tamboli,Xiangjun Fan,Zhuokai Zhao,Lizhu Zhang

Main category: cs.CL

TL;DR: 本文提出了一种名为Token-level Adaptive Routing (TARo)的测试时对齐方法,通过在推理阶段引入可学习的token级路由器和基于细粒度数学推理链训练的奖励模型,提升冻结大语言模型的结构化推理能力,无需额外后训练,且在多个领域(如临床推理、指令遵循)上显著提升性能并具备跨模型规模泛化能力。

Details Motivation: 现有大型语言模型虽具强推理能力,但通常需昂贵的后训练;而当前测试时对齐方法主要聚焦于偏好对齐,缺乏针对推理能力的轻量级方案。 Method: 提出Token-level Adaptive Routing (TARo):首先在数学推理步骤轨迹上训练奖励模型以捕捉逻辑一致性信号;再设计可学习的token级路由器,在推理时动态控制奖励模型对基础模型的引导。 Result: TARo在基准上相较基线模型提升高达+22.4%,优于现有token级测试时对齐方法+8.4%;同时在MedXpertQA(临床推理)和AlpacaEval(指令遵循)上表现提升,并能从小模型到大模型零样本迁移,无需重训练。 Conclusion: TARo成功将测试时对齐从偏好优化拓展至鲁棒、跨领域的结构化推理,为冻结LLM的高效推理增强提供了新范式。 Abstract: Large language models (LLMs) exhibit strong reasoning capabilities but typically require expensive post-training to reach high performance. Recent test-time alignment methods offer a lightweight alternative, but have been explored mainly for preference alignment rather than reasoning. To bridge this gap, we propose, Token-level Adaptive Routing (TARo), which steers frozen LLMs toward structured reasoning entirely at inference time. Specifically, we first train reward models on step-wise mathematical traces to capture fine-grained logical consistency signals, then introduce a learnable token-level router that automatically controls the guidance of the reward model to the base model. Extensive experiments show that TARo significantly improves reasoning performance by up to +22.4% over base model and +8.4% over existing token-level test-time alignment methods, while also boosting out-of-distribution clinical reasoning (MedXpertQA) and instruction following (AlpacaEval). Furthermore, TARo also generalizes from small to large backbones without retraining, extending test-time alignment from preference optimization to robust, cross-domain reasoning.

[25] Multimodal Task Interference: A Benchmark and Analysis of History-Target Mismatch in Multimodal LLMs

Masayuki Kawarada,Tatsuya Ishigaki,Hiroya Takamura

Main category: cs.CL

TL;DR: 本文提出首个用于评估多模态大语言模型中任务干扰现象的基准,发现任务干扰具有方向性,尤其在从纯文本转向图像目标时性能下降显著,并揭示了模态差异是导致干扰的最主要因素。

Details Motivation: 尽管多模态对话系统日益普及,但任务干扰现象此前仅在纯文本环境中被研究,缺乏针对多模态大语言模型的评估基准。 Method: 构建覆盖六种任务、涵盖文本与视觉模态的基准,沿模态不匹配、推理不匹配和答案格式不匹配三个维度系统性地变化历史-目标对,并在开源与闭源多模态大模型上开展实验。 Result: 任务干扰具有高度方向性:从文本到图像目标切换导致严重性能下降,反之则影响甚微;多维不匹配叠加会加剧干扰;模态差异是主导因素,其次为答案格式,推理需求变化影响最小。 Conclusion: 任务干扰在多模态LLMs中普遍存在且具有结构性规律,需在模型设计与评估中显式建模模态切换效应。 Abstract: Task interference, the performance degradation caused by task switches within a single conversation, has been studied exclusively in text-only settings despite the growing prevalence of multimodal dialogue systems. We introduce a benchmark for evaluating this phenomenon in multimodal LLMs, covering six tasks across text and vision with systematic variation of history-target along three axes: modality mismatch, reasoning mismatch, and answer format mismatch. Experiments on both open-weights and proprietary models reveal that task interference is highly directional: switching from text-only to image-based targets causes severe performance drops, while the reverse transition yields minimal degradation. Interference is further amplified when mismatches co-occur across multiple dimensions, and is driven most strongly by modality differences, followed by answer format, while reasoning requirement shifts cause minimal degradation.

[26] Adaptive Decoding via Test-Time Policy Learning for Self-Improving Generation

Asmita Bhardwaj,Yuya Jeremy Ong,Eelaaf Zahid,Basel Shbita

Main category: cs.CL

TL;DR: 本文提出一种基于强化学习的解码器采样器,通过在测试时动态调整采样参数(如temperature/top-p)来提升大语言模型生成质量,无需更新模型权重,在多个摘要数据集上显著优于静态解码策略。

Details Motivation: 现有主流解码策略(如贪心、固定temperature/top-p)是静态且任务无关的,难以适应不同领域对风格或结构的灵活需求,导致生成质量不稳定或次优。 Method: 将解码建模为序列决策问题,设计轻量级强化学习策略网络,在测试时动态调节采样参数;使用复合奖励函数(含长度、覆盖度、重复性、完整性等结构化塑形项),在BookSum、arXiv、WikiHow等数据集上评估Granite-3.3-2B和Qwen-2.5-0.5B模型。 Result: 相比贪心与静态基线,该策略在BookSum(Granite)和WikiHow(Qwen)上分别取得最高+88%和+79%的相对提升;消融实验证明复合奖励及结构化塑形项对稳定性能提升至关重要。 Conclusion: 强化学习可作为实用的测试时自适应机制,实现无需重训练的大模型领域感知与用户可控生成。 Abstract: Decoding strategies largely determine the quality of Large Language Model (LLM) outputs, yet widely used heuristics such as greedy or fixed temperature/top-p decoding are static and often task-agnostic, leading to suboptimal or inconsistent generation quality across domains that demand stylistic or structural flexibility. We introduce a reinforcement learning-based decoder sampler that treats decoding as sequential decision-making and learns a lightweight policy to adjust sampling parameters at test-time while keeping LLM weights frozen. We evaluated summarization datasets including BookSum, arXiv, and WikiHow using Granite-3.3-2B and Qwen-2.5-0.5B. Our policy sampler consistently outperforms greedy and static baselines, achieving relative gains of up to +88% (BookSum, Granite) and +79% (WikiHow, Qwen). Reward ablations show that overlap-only objectives underperform compared to composite rewards, while structured shaping terms (length, coverage, repetition, completeness) enable stable and sustained improvements. These findings highlight reinforcement learning as a practical mechanism for test-time adaptation in decoding, enabling domain-aware and user-controllable generation without retraining large models.

[27] UT-ACA: Uncertainty-Triggered Adaptive Context Allocation for Long-Context Inference

Lang Zhou,Shuxuan Li,Zhuohao Li,Shi Liu,Zhilin Zhao,Wei-Shi Zheng

Main category: cs.CL

TL;DR: 本文提出UT-ACA框架,在推理时根据词元级不确定性动态调整上下文窗口,以缓解长上下文推理中的注意力稀释和分布外退化问题。

Details Motivation: 长上下文推理中存在注意力稀释和分布外性能下降问题;现有上下文选择方法采用固定预算,无法适配词元级非均匀上下文需求。 Method: 提出不确定性触发的自适应上下文分配(UT-ACA)框架:设计融合语义嵌入与logit置信度的不确定性检测器,并建模解码步间不确定性累积;当检测到证据不足时,选择性回滚、扩展上下文并重生成词元。 Result: 实验表明UT-ACA显著降低平均上下文使用量,同时在长上下文场景下保持生成质量。 Conclusion: 动态、不确定性驱动的上下文分配是一种高效且质量保持的长上下文推理优化策略。 Abstract: Long-context inference remains challenging for large language models due to attention dilution and out-of-distribution degradation. Context selection mitigates this limitation by attending to a subset of key-value cache entries, yet most methods allocate a fixed context budget throughout decoding despite highly non-uniform token-level contextual demands. To address this issue, we propose Uncertainty-Triggered Adaptive Context Allocation (UT-ACA), an inference-time framework that dynamically adjusts the context window based on token-wise uncertainty. UT-ACA learns an uncertainty detector that combines semantic embeddings with logit-based confidence while accounting for uncertainty accumulation across decoding steps. When insufficient evidence is indicated, UT-ACA selectively rolls back, expands the context window, and regenerates the token with additional support. Experiments show that UT-ACA substantially reduces average context usage while preserving generation quality in long-context settings.

[28] GAIN: A Benchmark for Goal-Aligned Decision-Making of Large Language Models under Imperfect Norms

Masayuki Kawarada,Kodai Watanabe,Soichiro Murakami

Main category: cs.CL

TL;DR: GAIN是一个新基准,用于评估大语言模型在现实商业场景中如何权衡规范遵守与业务目标,引入五类情境压力以系统分析决策影响因素。

Details Motivation: 现有基准多关注抽象场景,缺乏对真实商业应用中规范与目标冲突的适应性评估,且难以揭示影响LLM决策的关键因素。 Method: 构建GAIN基准,包含1200个跨招聘、客服、广告和金融四领域的场景;每个场景提供目标、情境、规范及五类明确设计的压力(目标一致性、风险规避、情感/伦理诉求、社会/权威影响、个人激励),以系统考察决策倾向。 Result: 实验表明先进LLM通常模仿人类决策模式,但在‘个人激励’压力下显著偏离人类行为,表现出更强的规范遵从性而非妥协倾向。 Conclusion: GAIN有效揭示了LLM在复杂规范-目标冲突中的决策机制,尤其凸显其在个人利益驱动下的保守倾向,为提升模型现实适应性提供了新评估维度。 Abstract: We introduce GAIN (Goal-Aligned Decision-Making under Imperfect Norms), a benchmark designed to evaluate how large language models (LLMs) balance adherence to norms against business goals. Existing benchmarks typically focus on abstract scenarios rather than real-world business applications. Furthermore, they provide limited insights into the factors influencing LLM decision-making. This restricts their ability to measure models' adaptability to complex, real-world norm-goal conflicts. In GAIN, models receive a goal, a specific situation, a norm, and additional contextual pressures. These pressures, explicitly designed to encourage potential norm deviations, are a unique feature that differentiates GAIN from other benchmarks, enabling a systematic evaluation of the factors influencing decision-making. We define five types of pressures: Goal Alignment, Risk Aversion, Emotional/Ethical Appeal, Social/Authoritative Influence, and Personal Incentive. The benchmark comprises 1,200 scenarios across four domains: hiring, customer support, advertising and finance. Our experiments show that advanced LLMs frequently mirror human decision-making patterns. However, when Personal Incentive pressure is present, they diverge significantly, showing a strong tendency to adhere to norms rather than deviate from them.

[29] WASD: Locating Critical Neurons as Sufficient Conditions for Explaining and Controlling LLM Behavior

Haonan Yu,Junhao Liu,Zhenyu Yan,Haoran Lin,Xin Zhang

Main category: cs.CL

TL;DR: 本文提出WASD框架,通过识别生成token的充分神经条件来解释大语言模型行为,实现了更稳定、准确、简洁的行为控制。

Details Motivation: 现有大语言模型行为控制方法存在训练成本高、缺乏自然语言可控性或语义连贯性差等问题,亟需一种更高效精准的控制机制。 Method: WASD框架将候选条件表示为神经元激活谓词,并在输入扰动下迭代搜索能保证当前输出的最小充分条件集。 Result: 在SST-2和CounterFact数据集及Gemma-2-2B模型上的实验表明,WASD生成的解释比传统归因图更稳定、准确且简洁;跨语言生成控制案例验证了其实用有效性。 Conclusion: WASD为大语言模型提供了可解释、可控且高效的神经级行为调控新范式,显著提升了行为控制的精度与实用性。 Abstract: Precise behavioral control of large language models (LLMs) is critical for complex applications. However, existing methods often incur high training costs, lack natural language controllability, or compromise semantic coherence. To bridge this gap, we propose WASD (unWeaving Actionable Sufficient Directives), a novel framework that explains model behavior by identifying sufficient neural conditions for token generation. Our method represents candidate conditions as neuron-activation predicates and iteratively searches for a minimal set that guarantees the current output under input perturbations. Experiments on SST-2 and CounterFact with the Gemma-2-2B model demonstrate that our approach produces explanations that are more stable, accurate, and concise than conventional attribution graphs. Moreover, through a case study on controlling cross-lingual output generation, we validated the practical effectiveness of WASD in controlling model behavior.

[30] The Truncation Blind Spot: How Decoding Strategies Systematically Exclude Human-Like Token Choices

Esteban Garces Arias,Nurzhan Sapargali,Christian Heumann,Matthias Aßenmacher

Main category: cs.CL

TL;DR: 本文指出,标准文本生成解码策略(如top-k、核采样等)依赖于词元的统计概率,而人类语言产生则更关注交际适宜性,导致模型难以生成统计上罕见但语境恰当的词元,从而形成‘截断盲点’,这加剧了机器生成文本的可检测性。实证分析表明,8–18%的人类选择词元落在典型截断边界之外,且检测性能主要由截断参数决定,而非模型规模或架构;低可检测性配置常以牺牲连贯性为代价。

Details Motivation: 标准解码策略与人类语言产出机制存在根本差异:前者依赖高概率区域采样,后者追求语境适宜性;这种不匹配造成‘截断盲点’,可能解释为何机器文本易被识别。 Method: 对8个语言模型、5种解码策略及53种超参数配置下生成的超180万段文本进行大规模实证分析,量化人类所选词元落在截断边界外的比例,并训练基于可预测性与词汇多样性特征的分类器评估检测性能。 Result: 8–18%的人类选择词元位于典型截断边界之外;简单分类器在检测任务中表现优异;截断参数是检测率差异的主要来源,模型规模/架构影响微弱;低检测率配置往往导致文本不连贯。 Conclusion: 机器文本的可检测性主要源于基于似然的词元选择机制本身,而非模型能力不足;提升自然性与降低可检测性是两个相互冲突的目标。 Abstract: Standard decoding strategies for text generation, including top-k, nucleus sampling, and contrastive search, select tokens based on likelihood, restricting selection to high-probability regions. Human language production operates differently: tokens are chosen for communicative appropriateness rather than statistical frequency. This mismatch creates a truncation blind spot: contextually appropriate but statistically rare tokens remain accessible to humans yet unreachable by likelihood-based decoding. We hypothesize this contributes to the detectability of machine-generated text. Analyzing over 1.8 million texts across eight language models, five decoding strategies, and 53 hyperparameter configurations, we find that 8-18% of human-selected tokens fall outside typical truncation boundaries. Simple classifiers trained on predictability and lexical diversity achieve remarkable detection rates. Crucially, neither model scale nor architecture correlates strongly with detectability; truncation parameters account for most variance. Configurations achieving low detectability often produce incoherent text, indicating that evading detection and producing natural text are distinct objectives. These findings suggest detectability is enhanced by likelihood-based token selection, not merely a matter of model capability.

[31] EntropyCache: Decoded Token Entropy Guided KV Caching for Diffusion Language Models

Minsoo Cheong,Donghyun Son,Woosang Lim,Sungjoo Yoo

Main category: cs.CL

TL;DR: 本文提出EntropyCache,一种无需训练的KV缓存方法,利用新解码token分布的最大熵作为恒定开销信号来决定是否重计算,显著提升扩散型大语言模型(dLLMs)的推理速度。

Details Motivation: 扩散型大语言模型(dLLMs)因使用双向注意力机制而无法进行无损KV缓存,每次去噪步骤均需完整前向传播;现有近似KV缓存方法虽降低计算成本,但其决策开销随上下文长度或模型深度增长。 Method: EntropyCache基于两个经验观察:(1)解码token熵与KV缓存漂移相关,可低成本表征缓存陈旧性;(2)解码token的特征不稳定性在解掩码后持续多步,因此应重计算最近k个解码token。其跳过或重计算决策仅需每步O(V)计算量,与上下文长度和模型规模无关。 Result: 在LLaDA-8B-Instruct和Dream-7B-Instruct上实验表明,EntropyCache在标准基准上实现15.2×–26.4×加速,在思维链基准上达22.4×–24.1×加速,精度具竞争力,决策开销仅占推理时间0.5%。 Conclusion: EntropyCache是一种高效、轻量、免训练的KV缓存策略,有效缓解dLLMs推理中的冗余计算问题,为扩散语言模型的实际部署提供了可行方案。 Abstract: Diffusion-based large language models (dLLMs) rely on bidirectional attention, which prevents lossless KV caching and requires a full forward pass at every denoising step. Existing approximate KV caching methods reduce this cost by selectively updating cached states, but their decision overhead scales with context length or model depth. We propose EntropyCache, a training-free KV caching method that uses the maximum entropy of newly decoded token distributions as a constant-cost signal for deciding when to recompute. Our design is grounded in two empirical observations: (1) decoded token entropy correlates with KV cache drift, providing a cheap proxy for cache staleness, and (2) feature volatility of decoded tokens persists for multiple steps after unmasking, motivating recomputation of the $k$ most recently decoded tokens. The skip-or-recompute decision requires only $O(V)$ computation per step, independent of context length and model scale. Experiments on LLaDA-8B-Instruct and Dream-7B-Instruct show that EntropyCache achieves $15.2\times$-$26.4\times$ speedup on standard benchmarks and $22.4\times$-$24.1\times$ on chain-of-thought benchmarks, with competitive accuracy and decision overhead accounting for only $0.5\%$ of inference time. Code is available at https://github.com/mscheong01/EntropyCache.

[32] When Names Change Verdicts: Intervention Consistency Reveals Systematic Bias in LLM Decision-Making

Abhinaba Basu,Pavan Chakraborty

Main category: cs.CL

TL;DR: 本文提出ICE-Guard框架,通过干预一致性检验检测大语言模型在高风险决策中对三类虚假特征(人口统计、权威性、表述框架)的依赖,并在3000个场景中评估11个模型,发现权威与框架偏差远超人口统计偏差,且偏差具有领域特异性;结构化解析可显著降低偏差,结合ICE引导的迭代提示修补可实现78%的累计偏差降低。

Details Motivation: 大型语言模型(LLMs)正越来越多地用于高风险决策,但其对虚假特征(spurious features)的敏感性尚未被充分刻画,尤其当前研究过度聚焦于人口统计类偏差,忽视其他潜在偏差来源。 Method: 提出ICE-Guard框架,基于干预一致性测试(Intervention Consistency Testing),系统构造三类干预:人口统计(姓名/种族替换)、权威性(资质/声望替换)和框架(正负向重述),在10个高风险领域共3000个情境中评估11个LLM;引入结构化解析(特征提取+确定性规则判决)作为缓解策略,并设计ICE引导的‘检测-诊断-缓解-验证’闭环进行迭代提示修补。 Result: (1)权威偏差(均值5.8%)与框架偏差(5.0%)显著高于人口统计偏差(2.2%);(2)偏差高度领域依赖,如金融领域权威偏差达22.6%,而刑事司法仅2.8%;(3)结构化解析使翻转率(flip rate)中位数下降49%,最高达100%;(4)ICE闭环经迭代提示修补实现78%累计偏差降低;(5)在真实COMPAS再犯数据上验证,其翻转率高于合成基准,表明该基准提供保守估计。 Conclusion: 虚假特征依赖具有多维性与领域异质性,不能仅关注人口统计偏差;结构化推理与基于干预一致性的闭环优化是提升LLM高风险决策鲁棒性的有效路径;ICE-Guard为偏差检测与缓解提供了可复现、可扩展的评估框架。 Abstract: Large language models (LLMs) are increasingly used for high-stakes decisions, yet their susceptibility to spurious features remains poorly characterized. We introduce ICE-Guard, a framework applying intervention consistency testing to detect three types of spurious feature reliance: demographic (name/race swaps), authority (credential/prestige swaps), and framing (positive/negative restatements). Across 3,000 vignettes spanning 10 high-stakes domains, we evaluate 11 LLMs from 8 families and find that (1) authority bias (mean 5.8%) and framing bias (5.0%) substantially exceed demographic bias (2.2%), challenging the field's narrow focus on demographics; (2) bias concentrates in specific domains -- finance shows 22.6% authority bias while criminal justice shows only 2.8%; (3) structured decomposition, where the LLM extracts features and a deterministic rubric decides, reduces flip rates by up to 100% (median 49% across 9 models). We demonstrate an ICE-guided detect-diagnose-mitigate-verify loop achieving cumulative 78% bias reduction via iterative prompt patching. Validation against real COMPAS recidivism data shows COMPAS-derived flip rates exceed pooled synthetic rates, suggesting our benchmark provides a conservative estimate of real-world bias. Code and data are publicly available.

[33] Cross-Lingual LLM-Judge Transfer via Evaluation Decomposition

Ivaxi Sheth,Zeno Jonke,Amin Mantrach,Saab Mansour

Main category: cs.CL

TL;DR: 本文提出了一种基于分解的跨语言大模型评估框架,通过构建通用准则集(UCS)实现无需目标语言人工标注的可解释、可迁移评估。

Details Motivation: 现有大语言模型评估方法严重依赖英语,而其他语言缺乏高质量、低成本的人工标注数据,难以有效适配。 Method: 提出基于分解的评估框架,核心是语言无关的通用准则集(UCS),将评估任务分解为共享维度,生成可解释的中间表征,支持低监督跨语言迁移。 Result: 在多语言、多任务(忠实性)及不同模型主干上的实验表明,该方法持续优于强基线,且无需目标语言人工标注。 Conclusion: UCS框架为大语言模型的自动化跨语言评估提供了可扩展、可解释、低资源依赖的新范式。 Abstract: As large language models are increasingly deployed across diverse real-world applications, extending automated evaluation beyond English has become a critical challenge. Existing evaluation approaches are predominantly English-focused, and adapting them to other languages is hindered by the scarcity and cost of human-annotated judgments in most languages. We introduce a decomposition-based evaluation framework built around a Universal Criteria Set (UCS). UCS consists of a shared, language-agnostic set of evaluation dimensions, producing an interpretable intermediate representation that supports cross-lingual transfer with minimal supervision. Experiments on multiple faithfulness tasks across languages and model backbones demonstrate consistent improvements over strong baselines without requiring target-language annotations.

[34] ICE: Intervention-Consistent Explanation Evaluation with Statistical Grounding for LLMs

Abhinaba Basu,Pavan Chakraborty

Main category: cs.CL

TL;DR: 本文提出ICE框架,通过多干预算子下的随机化检验评估解释的忠实性,并发现忠实性高度依赖于干预算子,与人类可理解性几乎无关,且存在显著的模型-语言交互效应。

Details Motivation: 现有解释忠实性评估基准仅使用单一干预且缺乏统计检验,难以区分真实忠实性与偶然性能。 Method: 提出ICE(Intervention-Consistent Explanation)框架,利用多种干预算子与匹配的随机基线进行随机化检验,输出带置信区间的胜率。 Result: 在7个大语言模型、4个英文任务、6种非英文语言和2种归因方法上的实验表明:忠实性高度依赖干预算子(最大差距达44个百分点);约1/3配置呈现反忠实性;忠实性与人类可理解性几乎无关(|r| < 0.04);多语言评估揭示显著模型-语言交互。 Conclusion: 解释忠实性不应以单一分数衡量,而应基于干预算子进行相对比较;随机化基线至关重要;需重新审视忠实性与可理解性的关系,并重视多语言场景下的模型行为差异。 Abstract: Evaluating whether explanations faithfully reflect a model's reasoning remains an open problem. Existing benchmarks use single interventions without statistical testing, making it impossible to distinguish genuine faithfulness from chance-level performance. We introduce ICE (Intervention-Consistent Explanation), a framework that compares explanations against matched random baselines via randomization tests under multiple intervention operators, yielding win rates with confidence intervals. Evaluating 7 LLMs across 4 English tasks, 6 non-English languages, and 2 attribution methods, we find that faithfulness is operator-dependent: operator gaps reach up to 44 percentage points, with deletion typically inflating estimates on short text but the pattern reversing on long text, suggesting that faithfulness should be interpreted comparatively across intervention operators rather than as a single score. Randomized baselines reveal anti-faithfulness in one-third of configurations, and faithfulness shows zero correlation with human plausibility (|r| < 0.04). Multilingual evaluation reveals dramatic model-language interactions not explained by tokenization alone. We release the ICE framework and ICEBench benchmark.

[35] Language Model Maps for Prompt-Response Distributions via Log-Likelihood Vectors

Yusuke Takase,Momose Oyama,Hidetoshi Shimodaira

Main category: cs.CL

TL;DR: 本文提出一种基于对数似然向量和PMI向量的语言模型表示方法,构建模型映射以量化条件分布差异,并揭示模型间结构关系及提示工程效应。

Details Motivation: 现有方法难以系统刻画语言模型在不同提示下的条件分布差异及其全局结构关系,亟需可解释、可度量的模型行为分析框架。 Method: 将语言模型表示为prompt-response对上的对数似然向量,构建欧氏距离近似KL散度的模型映射;引入点互信息(PMI)向量以削弱无条件分布影响。 Result: 在大量公开语言模型上验证了模型映射能有效反映模型属性、任务性能及提示修改引起的系统性偏移;PMI向量在刻画训练数据差异方面表现更优。 Conclusion: 该框架为分析语言模型输入依赖行为提供了可解释、可量化的工具,支持提示工程效应建模与预测。 Abstract: We propose a method that represents language models by log-likelihood vectors over prompt-response pairs and constructs model maps for comparing their conditional distributions. In this space, distances between models approximate the KL divergence between the corresponding conditional distributions. Experiments on a large collection of publicly available language models show that the maps capture meaningful global structure, including relationships to model attributes and task performance. The method also captures systematic shifts induced by prompt modifications and their approximate additive compositionality, suggesting a way to analyze and predict the effects of composite prompt operations. We further introduce pointwise mutual information (PMI) vectors to reduce the influence of unconditional distributions; in some cases, PMI-based model maps better reflect training-data-related differences. Overall, the framework supports the analysis of input-dependent model behavior.

[36] Cross-Modal Rationale Transfer for Explainable Humanitarian Classification on Social Media

Thi Huyen Nguyen,Koustav Rudra,Wolfgang Nejdl

Main category: cs.CL

TL;DR: 本文提出了一种可解释的多模态分类框架,通过跨模态理由迁移(文本→图像)实现危机场景下图文联合分类与理由提取,在CrisisMMD数据集上显著提升Macro-F1(2–35%),并具备零样本迁移能力(80%准确率)。

Details Motivation: 现有危机信息分类方法缺乏可解释性,尤其在图像模态上缺少有效理由提取机制,限制其实际部署;而人工标注多模态理由成本高昂。 Method: 基于视觉语言Transformer学习图文联合表征,先提取文本理由,再通过跨模态映射生成图像理由(即理由迁移),最终基于双模态理由进行分类。 Result: 在CrisisMMD上Macro-F1提升2–35%,图像理由补丁检索准确率提升12%,零样本泛化达80%准确率;人工评估证实图像理由质量更优。 Conclusion: 所提可解释-by-design框架实现了高效、低标注依赖的多模态理由提取与分类,兼顾性能、透明性与泛化性,适用于真实危机响应场景。 Abstract: Advances in social media data dissemination enable the provision of real-time information during a crisis. The information comes from different classes, such as infrastructure damages, persons missing or stranded in the affected zone, etc. Existing methods attempted to classify text and images into various humanitarian categories, but their decision-making process remains largely opaque, which affects their deployment in real-life applications. Recent work has sought to improve transparency by extracting textual rationales from tweets to explain predicted classes. However, such explainable classification methods have mostly focused on text, rather than crisis-related images. In this paper, we propose an interpretable-by-design multimodal classification framework. Our method first learns the joint representation of text and image using a visual language transformer model and extracts text rationales. Next, it extracts the image rationales via the mapping with text rationales. Our approach demonstrates how to learn rationales in one modality from another through cross-modal rationale transfer, which saves annotation effort. Finally, tweets are classified based on extracted rationales. Experiments are conducted over CrisisMMD benchmark dataset, and results show that our proposed method boosts the classification Macro-F1 by 2-35% while extracting accurate text tokens and image patches as rationales. Human evaluation also supports the claim that our proposed method is able to retrieve better image rationale patches (12%) that help to identify humanitarian classes. Our method adapts well to new, unseen datasets in zero-shot mode, achieving an accuracy of 80%.

[37] DiscoPhon: Benchmarking the Unsupervised Discovery of Phoneme Inventories With Discrete Speech Units

Maxime Poli,Manel Khentout,Angelo Ortiz Tandazo,Ewan Dunbar,Emmanuel Chemla,Emmanuel Dupoux

Main category: cs.CL

TL;DR: DiscoPhon is a multilingual benchmark for unsupervised phoneme discovery from discrete speech units, covering 12 languages and evaluating unit quality, recognition, and segmentation using pretrained HuBERT and SpidR models.

Details Motivation: To evaluate unsupervised phoneme discovery across diverse languages with limited data (10 hours per language), addressing the need for standardized multilingual benchmarks in speech representation learning. Method: Constructing DiscoPhon, a benchmark with 6 dev and 6 test languages; using pretrained multilingual HuBERT and SpidR models to derive discrete speech units; mapping units to phoneme inventories via many-to-one or one-to-one assignment; evaluating unit quality, phoneme recognition, and segmentation. Result: Current multilingual models contain sufficient phonemic information for derived units to correlate well with phonemes, though performance varies across languages. Conclusion: DiscoPhon enables systematic evaluation of unsupervised phoneme discovery; results confirm phonemic signals are present in modern self-supervised models but highlight cross-lingual variability needing further investigation. Abstract: We introduce DiscoPhon, a multilingual benchmark for evaluating unsupervised phoneme discovery from discrete speech units. DiscoPhon covers 6 dev and 6 test languages, chosen to span a wide range of phonemic contrasts. Given only 10 hours of speech in a previously unseen language, systems must produce discrete units that are mapped to a predefined phoneme inventory, through either a many-to-one or a one-to-one assignment. The resulting sequences are evaluated for unit quality, recognition and segmentation. We provide four pretrained multilingual HuBERT and SpidR baselines, and show that phonemic information is available enough in current models for derived units to correlate well with phonemes, though with variations across languages.

[38] Learning to Self-Evolve

Xiaoyin Chen,Canwen Xu,Yite Wang,Boyi Liu,Zhewei Yao,Yuxiong He

Main category: cs.CL

TL;DR: 本文提出Learning to Self-Evolve(LSE),一种强化学习框架,用于训练大语言模型在测试时自主优化其上下文,显著提升Text-to-SQL与问答任务性能,并具备跨模型泛化能力。

Details Motivation: 现有测试时自演化方法完全依赖模型固有推理能力,未对其进行显式训练;本文旨在将自演化建模为可学习技能。 Method: 将多步上下文演化简化为单步强化学习目标,以下游性能提升作为编辑奖励,并结合树状引导的演化循环。 Result: 4B参数模型在BIRD和MMLU-Redux上超越GPT-5、Claude Sonnet 4.5驱动的自演化策略及GEPA、TextGrad等提示优化方法,并可零样本迁移指导其他模型。 Conclusion: 将测试时自演化显式建模并训练为一项可学习技能,能显著提升模型性能与泛化能力。 Abstract: We introduce Learning to Self-Evolve (LSE), a reinforcement learning framework that trains large language models (LLMs) to improve their own contexts at test time. We situate LSE in the setting of test-time self-evolution, where a model iteratively refines its context from feedback on seen problems to perform better on new ones. Existing approaches rely entirely on the inherent reasoning ability of the model and never explicitly train it for this task. LSE reduces the multi-step evolution problem to a single-step RL objective, where each context edit is rewarded by the improvement in downstream performance. We pair this objective with a tree-guided evolution loop. On Text-to-SQL generation (BIRD) and general question answering (MMLU-Redux), a 4B-parameter model trained with LSE outperforms self-evolving policies powered by GPT-5 and Claude Sonnet 4.5, as well as prompt optimization methods including GEPA and TextGrad, and transfers to guide other models without additional training. Our results highlight the effectiveness of treating self-evolution as a learnable skill.

[39] A Comparative Empirical Study of Catastrophic Forgetting Mitigation in Sequential Task Adaptation for Continual Natural Language Processing Systems

Aram Abrahamyan,Sachin Kumar

Main category: cs.CL

TL;DR: 本文对持续学习(CL)方法在持续意图分类任务中的灾难性遗忘缓解效果进行了实证比较研究,使用CLINC150数据集构建10任务标签不相交场景,评估ANN、GRU和Transformer三种骨干架构与多种CL策略(MIR、LwF、HAT)及其组合的效果,发现重放机制(尤其是MIR)是关键,且最优CL策略依赖于骨干网络选择。

Details Motivation: 神经语言模型在实际应用中需持续适应新任务和领域,同时避免遗忘旧知识;现有研究缺乏对不同骨干架构与持续学习策略匹配关系的系统性实证分析。 Method: 在CLINC150数据集上构建10任务label-disjoint持续学习场景,对比ANN、GRU、Transformer三种骨干架构,分别应用三类代表性持续学习方法(重放型MIR、正则化型LwF、参数隔离型HAT)及其所有两两与三者组合,在平均准确率、宏F1及后向迁移(backward transfer)指标下评估性能。 Result: 所有架构在朴素顺序微调下均出现严重遗忘;单一CL方法无法完全阻止遗忘;含MIR的组合(如MIR+HAT、MIR+LwF、MIR+LwF+HAT)表现最稳健,后向迁移接近零或为正;最优组合因架构而异:ANN和Transformer下MIR+HAT最佳,GRU下MIR+LwF+HAT最佳;部分CL组合甚至超越联合训练,体现正则化效应。 Conclusion: 持续意图分类系统的设计必须联合考虑骨干网络架构与持续学习机制的选择,重放(尤其是MIR)是提升稳定-可塑性权衡的关键要素。 Abstract: Neural language models deployed in real-world applications must continually adapt to new tasks and domains without forgetting previously acquired knowledge. This work presents a comparative empirical study of catastrophic forgetting mitigation in continual intent classification. Using the CLINC150 dataset, we construct a 10-task label-disjoint scenario and evaluate three backbone architectures: a feed-forward Artificial Neural Network (ANN), a Gated Recurrent Unit (GRU), and a Transformer encoder, under a range of continual learning (CL) strategies. We consider one representative method from each major CL family: replay-based Maximally Interfered Retrieval (MIR), regularization-based Learning without Forgetting (LwF), and parameter-isolation via Hard Attention to Task (HAT), both individually and in all pairwise and triple combinations. Performance is assessed with average accuracy, macro F1, and backward transfer, capturing the stability-plasticity trade-off across the task sequence. Our results show that naive sequential fine-tuning suffers from severe forgetting for all architectures and that no single CL method fully prevents it. Replay emerges as a key ingredient: MIR is the most reliable individual strategy, and combinations that include replay (MIR+HAT, MIR+LwF, MIR+LwF+HAT) consistently achieve high final performance with near-zero or mildly positive backward transfer. The optimal configuration is architecture-dependent. MIR+HAT yields the best result for ANN and Transformer, MIR+LwF+HAT, on the other hand, works the best for GRU, and in several cases CL methods even surpass joint training, indicating a regularization effect. These findings highlight the importance of jointly selecting backbone architecture and CL mechanism when designing continual intent-classification systems.

[40] Automatic detection of Gen-AI texts: A comparative framework of neural models

Cristian Buttaro,Irene Amerini

Main category: cs.CL

TL;DR: 本文设计并评估了四种基于机器学习的AI生成文本检测器(MLP、1D-CNN、MobileNet-CNN、Transformer),在多语言(英/意)及特定主题(艺术与心理健康)数据集上对比了八种主流在线检测工具,结果表明监督式检测器在跨语言和跨领域场景下比商业工具更稳定鲁棒。

Details Motivation: 大语言模型的快速发展使得人机文本难以区分,给学术、出版和社会领域带来严峻挑战,亟需可靠、鲁棒的AI生成文本检测方法。 Method: 提出并实现四种神经网络架构的监督式检测器(MLP、1D-CNN、MobileNet-CNN、Transformer),在COLING多语言数据集(英语和意大利语)及自建的艺术与心理健康主题数据集上进行训练与测试,并与ZeroGPT、GPTZero等八种商用检测器进行对比评估。 Result: 监督式检测器在不同语言和领域下展现出比商用工具更稳定、更鲁棒的性能;各模型表现存在差异,揭示了当前检测策略的优势与局限。 Conclusion: 基于监督学习的定制化检测器在AI文本检测任务中优于通用商用工具,尤其在多语言与专业领域场景下更具适应性,未来工作应聚焦于提升泛化能力与对抗鲁棒性。 Abstract: The rapid proliferation of Large Language Models has significantly increased the difficulty of distinguishing between human-written and AI generated texts, raising critical issues across academic, editorial, and social domains. This paper investigates the problem of AI generated text detection through the design, implementation, and comparative evaluation of multiple machine learning based detectors. Four neural architectures are developed and analyzed: a Multilayer Perceptron, a one-dimensional Convolutional Neural Network, a MobileNet-based CNN, and a Transformer model. The proposed models are benchmarked against widely used online detectors, including ZeroGPT, GPTZero, QuillBot, Originality.AI, Sapling, IsGen, Rephrase, and Writer. Experiments are conducted on the COLING Multilingual Dataset, considering both English and Italian configurations, as well as on an original thematic dataset focused on Art and Mental Health. Results show that supervised detectors achieve more stable and robust performance than commercial tools across different languages and domains, highlighting key strengths and limitations of current detection strategies.

[41] Implicit Grading Bias in Large Language Models: How Writing Style Affects Automated Assessment Across Math, Programming, and Essay Tasks

Rudra Jadhav,Janhavi Danve,Sonalika Shaw

Main category: cs.CL

TL;DR: This paper investigates implicit grading bias in large language models (LLMs) when evaluating student responses, finding that LLaMA 3.3 and Qwen 2.5 significantly penalize writing-style variations (e.g., informal language, non-native phrasing) in Essay/Writing tasks—even when instructed to grade only content correctness—while showing minimal bias in Mathematics and Programming tasks.

Details Motivation: Concerns about fairness and bias in LLMs used as automated graders in education motivate this study, specifically whether LLMs exhibit implicit bias based on writing style despite correct content. Method: A controlled dataset of 180 student responses across three subjects (Math, Programming, Essay/Writing) was created, each with three surface-level perturbations (grammar errors, informal language, non-native phrasing). Two open-source LLMs (LLaMA 3.3 70B and Qwen 2.5 72B) were prompted to grade responses on a 1–10 scale, explicitly instructed to assess only content correctness and ignore writing style. Result: Significant grading bias was found only in Essay/Writing tasks (p < 0.05), with medium-to-very-large effect sizes (Cohen’s d = 0.64–4.25); informal language incurred the largest penalties (−1.90 and −1.20 points), followed by non-native phrasing (−1.35 and −0.90); Math and Programming showed negligible, statistically insignificant bias. Conclusion: LLM grading bias is subject-dependent and style-sensitive, persists despite explicit de-biasing instructions, and necessitates mandatory bias auditing before institutional deployment in educational assessment. Abstract: As large language models (LLMs) are increasingly deployed as automated graders in educational settings, concerns about fairness and bias in their evaluations have become critical. This study investigates whether LLMs exhibit implicit grading bias based on writing style when the underlying content correctness remains constant. We constructed a controlled dataset of 180 student responses across three subjects (Mathematics, Programming, and Essay/Writing), each with three surface-level perturbation types: grammar errors, informal language, and non-native phrasing. Two state-of-the-art open-source LLMs -- LLaMA 3.3 70B (Meta) and Qwen 2.5 72B (Alibaba) -- were prompted to grade responses on a 1-10 scale with explicit instructions to evaluate content correctness only and to disregard writing style. Our results reveal statistically significant grading bias in Essay/Writing tasks across both models and all perturbation types (p < 0.05), with effect sizes ranging from medium (Cohen's d = 0.64) to very large (d = 4.25). Informal language received the heaviest penalty, with LLaMA deducting an average of 1.90 points and Qwen deducting 1.20 points on a 10-point scale -- penalties comparable to the difference between a B+ and C+ letter grade. Non-native phrasing was penalized 1.35 and 0.90 points respectively. In sharp contrast, Mathematics and Programming tasks showed minimal bias, with most conditions failing to reach statistical significance. These findings demonstrate that LLM grading bias is subject-dependent, style-sensitive, and persists despite explicit counter-bias instructions in the grading prompt. We discuss implications for equitable deployment of LLM-based grading systems and recommend bias auditing protocols before institutional adoption.

[42] Mi:dm K 2.5 Pro

KT Tech innovation Group

Main category: cs.CL

TL;DR: Mi:dm K 2.5 Pro 是一款32B参数的韩语旗舰大模型,聚焦企业级复杂任务,通过AST分析、Gap-filling合成、DuS深度扩展、128K长上下文训练及多阶段后训练(含推理SFT、异步RL与Fusion Training)提升多步推理、领域适配与工具调用能力,在韩语基准和安全评估中均达SOTA。

Details Motivation: 现有大模型在韩语及垂直领域的企业级应用中面临多步推理、长上下文理解与智能体工作流支持不足的问题,单纯扩大规模已不足够。 Method: 构建以质量为核心的多源数据流水线(AST代码分析、数学Gap-filling合成、LLM质量评估);预训练采用Depth Upscaling(DuS)与渐进式策略支持128K上下文;后训练包含推理监督微调、模型融合、异步强化学习,并引入Fusion Training统一优化推理能力、对话流畅性、风格一致性与工具使用可靠性。 Result: 在韩语专属评测集上达到当前最优(SOTA),全面超越主流国内外模型;同时通过负责任AI评估,兼顾安全性(抗攻击)、无害性与响应能力。 Conclusion: Mi:dm K 2.5 Pro 验证了面向企业场景的‘推理优先’建模范式有效性,为非英语、高专业度语言环境的大模型落地提供了可复用的技术路径。 Abstract: The evolving LLM landscape requires capabilities beyond simple text generation, prioritizing multi-step reasoning, long-context understanding, and agentic workflows. This shift challenges existing models in enterprise environments, especially in Korean-language and domain-specific scenarios where scaling is insufficient. We introduce Mi:dm K 2.5 Pro, a 32B parameter flagship LLM designed to address enterprise-grade complexity through reasoning-focused optimization. Our methodology builds a robust data foundation via a quality-centric curation pipeline utilizing abstract syntax tree (AST) analysis for code, gap-filling synthesis for mathematics, and an LLM-based quality evaluator. Pre-training scales the model via layer-predictor-based Depth Upscaling (DuS) and a progressive strategy supporting a 128K token context window. Post-training introduces a specialized multi-stage pipeline, including Reasoning SFT, model merging, and asynchronous reinforcement learning (RL), to develop complex problem-solving skills. "Fusion Training" then rebalances these capabilities with conversational fluency, consistent response styling, and reliable tool-use. The evaluations show that Mi:dm K 2.5 Pro achieves competitive performance against leading global and domestic models. In addition, it sets state-of-the-art results on Korean-specific benchmarks, showcasing deep linguistic and cultural understanding. Finally, Responsible AI evaluations validate safety against attacks, ensuring a secure profile for deployment with a balance of harmlessness and responsiveness.

[43] Detecting Basic Values in A Noisy Russian Social Media Text Data: A Multi-Stage Classification Framework

Maria Milkova,Maksim Rudnev

Main category: cs.CL

TL;DR: 本研究提出了一种多阶段分类框架,用于在嘈杂的俄语社交媒体文本中检测人类价值观,基于Schwartz理论,结合LLM标注、软标签聚合与Transformer模型(XLM-RoBERTa large),在750万条帖子上验证,F1 macro达0.83,并揭示了俄语社交网络中价值观表达的特有模式。

Details Motivation: 在嘈杂、非结构化的俄语社交媒体文本中准确识别抽象且主观的人类价值观(依据Schwartz理论),同时应对标注主观性与专家判断不确定性的问题。 Method: 构建多阶段流程:垃圾/非个人内容过滤 → 价值/政治相关帖筛选 → LLM(如GPT)多轮标注 → 基于多LLM判断生成反映共识程度的软标签 → 训练多标签Transformer模型(如XLM-RoBERTa large)预测十种基本价值观概率;将专家标注视为具不确定性的解释性基准,而非绝对真值。 Result: XLM-RoBERTa large模型在测试集上达到F1 macro=0.83、F1=0.71;发现模型系统性高估'开放变革'(Openness to Change)价值域;揭示俄语社交网络中价值观表达及共现的独特模式;所有模型已开源。 Conclusion: 将价值观检测建模为多视角解释性任务更合理——专家标注、LLM输出与模型预测均为对同一文本的合理但不等价解读;该框架兼顾主观性与可扩展性,为跨文化数字环境中的价值分析提供了可复现方法论与实证基础。 Abstract: This study presents a multi-stage classification framework for detecting human values in noisy Russian language social media, validated on a random sample of 7.5 million public text posts. Drawing on Schwartz's theory of basic human values, we design a multi-stage pipeline that includes spam and nonpersonal content filtering, targeted selection of value relevant and politically relevant posts, LLM based annotation, and multi-label classification. Particular attention is given to verifying the quality of LLM annotations and model predictions against human experts. We treat human expert annotations not as ground truth but as an interpretative benchmark with its own uncertainty. To account for annotation subjectivity, we aggregate multiple LLM generated judgments into soft labels that reflect varying levels of agreement. These labels are then used to train transformer based models capable of predicting the probability of each of the ten basic values. The best performing model, XLM RoBERTa large, achieves an F1 macro of 0.83 and an F1 of 0.71 on held out test data. By treating value detection as a multi perspective interpretive task, where expert labels, GPT annotations, and model predictions represent coherent but not identical readings of the same texts, we show that the model generally aligns with human judgments but systematically overestimates the Openness to Change value domain. Empirically, the study reveals distinct patterns of value expression and their co-occurrence in Russian social networks, contributing to a broader research agenda on cultural variation, communicative framing, and value based interpretation in digital environments. All models are released publicly.

[44] Why Better Cross-Lingual Alignment Fails for Better Cross-Lingual Transfer: Case of Encoders

Yana Veitsman,Yihong Liu,Hinrich Schütze

Main category: cs.CL

TL;DR: 本文揭示了跨语言对齐与下游任务性能之间存在目标不一致的问题,指出单纯提升嵌入相似性并不能保证下游任务性能提升,并通过表征分析验证了对齐损失与任务损失梯度接近正交,最后提出了结合对齐与微调的实用建议。

Details Motivation: 现有研究假设更好的跨语言对齐能带来更好的跨语言迁移效果,但实践中显式对齐方法虽提升嵌入相似性,却常无法提升词级别下游任务性能,其原因尚不明确。 Method: 分析四个在不同语言对上对齐的XLM-R编码器模型,分别在POS标注和句子分类任务上微调;采用嵌入距离、任务与对齐损失的梯度相似性及梯度模长等表征分析手段。 Result: (1)嵌入距离不能可靠预测任务性能变化;(2)对齐损失与任务损失的梯度常近似正交,表明优化一个目标对另一个目标贡献甚微。 Conclusion: 对齐与下游任务目标正交且受益程度因语言和任务而异,因此‘更好’的对齐未必带来‘更好’的跨语言迁移;应谨慎选择联合训练中的损失函数。 Abstract: Better cross-lingual alignment is often assumed to yield better cross-lingual transfer. However, explicit alignment techniques -- despite increasing embedding similarity -- frequently fail to improve token-level downstream performance. In this work, we show that this mismatch arises because alignment and downstream task objectives are largely orthogonal, and because the downstream benefits from alignment vary substantially across languages and task types. We analyze four XLM-R encoder models aligned on different language pairs and fine-tuned for either POS Tagging or Sentence Classification. Using representational analyses, including embedding distances, gradient similarities, and gradient magnitudes for both task and alignment losses, we find that: (1) embedding distances alone are unreliable predictors of improvements (or degradations) in task performance and (2) alignment and task gradients are often close to orthogonal, indicating that optimizing one objective may contribute little to optimizing the other. Taken together, our findings explain why ``better'' alignment often fails to translate into ``better'' cross-lingual transfer. Based on these insights, we provide practical guidelines for combining cross-lingual alignment with task-specific fine-tuning, highlighting the importance of careful loss selection.

[45] Evaluating LLM-Generated Lessons from the Language Learning Students' Perspective: A Short Case Study on Duolingo

Carlos Rafael Catalan,Patricia Nicole Monderin,Lheane Marie Dizon,Gap Estrella,Raymund John Sarmimento,Marie Antoinette Patalagsa

Main category: cs.CL

TL;DR: 本文探讨了当前语言学习应用(如Duolingo)在职业场景教学上的不足,通过调研菲律宾跨国公司员工发现通用场景虽有助于基础语言能力培养,但缺乏专业领域内容会阻碍专业级流利度的达成;因此提出应结合个性化、领域定制化课程与通用基础课程的混合教学策略。

Details Motivation: 现有语言学习应用主要聚焦于通用现实场景,缺乏对职业特定语境的支持,难以帮助学习者达到专业级流利度。 Method: 对菲律宾一家跨国公司的五名员工进行访谈式调研,分析其使用Duolingo的经历及对通用与职业场景课程的反馈,并归纳其对定制化课程的建议。 Result: 受访者普遍认为通用场景课程更常见且有效于夯实基础,而职业相关场景虽出现少,却对提升专业表达能力至关重要;不同参与者提出的职业场景建议差异显著,凸显个性化需求。 Conclusion: 语言学习应用应融合个性化、领域特定的课程生成机制与通用基础课程,以兼顾语言基础能力构建与专业流利度发展。 Abstract: Popular language learning applications such as Duolingo use large language models (LLMs) to generate lessons for its users. Most lessons focus on general real-world scenarios such as greetings, ordering food, or asking directions, with limited support for profession-specific contexts. This gap can hinder learners from achieving professional-level fluency, which we define as the ability to communicate comfortably various work-related and domain-specific information in the target language. We surveyed five employees from a multinational company in the Philippines on their experiences with Duolingo. Results show that respondents encountered general scenarios more frequently than work-related ones, and that the former are relatable and effective in building foundational grammar, vocabulary, and cultural knowledge. The latter helps bridge the gap toward professional fluency as it contains domain-specific vocabulary. Each participant suggested lesson scenarios that diverge in contexts hen analyzed in aggregate. With this understanding, we propose that language learning applications should generate lessons that adapt to an individual's needs through personalized, domain specific lesson scenarios while maintaining foundational support through general, relatable lesson scenarios.

[46] A Human-in/on-the-Loop Framework for Accessible Text Generation

Lourdes Moreno,Paloma Martínez

Main category: cs.CL

TL;DR: 本文提出了一种将人类参与显式融入大语言模型(LLM)驱动的无障碍文本生成的混合框架,结合Human-in-the-Loop(生成中干预)与Human-on-the-Loop(生成后监督),通过标准对齐的检查表、事件触发规则和可量化的无障碍KPI,提升可及性文本生成的可追溯性、可复现性与可审计性,并将可解释性与伦理问责嵌入核心设计。

Details Motivation: 当前自动文本简化与评估流程过于依赖自动化指标,未能反映真实用户理解或规范标准,难以保障认知可及性。 Method: 构建人机协同的混合框架:HiTL用于生成过程中的人类指导调整,HoTL用于生成后的系统化专家审查;并基于实证数据开发三类工具——标准对齐检查表、Event-Condition-Action触发规则、无障碍KPI。 Result: 该框架实现了可追溯、可复现、可审计的无障碍文本生成与评估流程,支持结构化反馈以改进模型适配,并将可解释性与伦理问责作为核心设计原则。 Conclusion: 将人类角色深度嵌入生成与监督环节,不仅能提升无障碍文本质量,还能推动NLP系统向更透明、更具包容性和伦理责任感的方向发展。 Abstract: Plain Language and Easy-to-Read formats in text simplification are essential for cognitive accessibility. Yet current automatic simplification and evaluation pipelines remain largely automated, metric-driven, and fail to reflect user comprehension or normative standards. This paper introduces a hybrid framework that explicitly integrates human participation into LLM-based accessible text generation. Human-in-the-Loop (HiTL) contributions guide adjustments during generation, while Human-on-the-Loop (HoTL) supervision ensures systematic post-generation review. Empirical evidence from user studies and annotated resources is operationalized into (i) checklists aligned with standards, (ii) Event-Condition-Action trigger rules for activating expert oversight, and (iii) accessibility Key Performance Indicators (KPIs). The framework shows how human-centered mechanisms can be encoded for evaluation and reused to provide structured feedback that improves model adaptation. By embedding the human role in both generation and supervision, it establishes a traceable, reproducible, and auditable process for creating and evaluating accessible texts. In doing so, it integrates explainability and ethical accountability as core design principles, contributing to more transparent and inclusive NLP systems.

[47] Progressive Training for Explainable Citation-Grounded Dialogue: Reducing Hallucination to Zero in English-Hindi LLMs

Vedant Pandya

Main category: cs.CL

TL;DR: 本文提出XKD-Dial,一种面向英-印双语、具备显式引用机制和可解释性的知识驱动对话生成训练流程,通过四阶段渐进式训练提升事实准确性与跨语言能力,并系统分析引用行为如何被学习。

Details Motivation: 现有知识驱动对话系统多限于英文、缺乏可验证的引用机制、决策过程不透明,且双语支持薄弱。 Method: 提出四阶段训练流程(多语言适配→英文带引用监督微调→双语监督微调→引用感知的GRPO对齐),结合三种后验可解释性分析方法(交叉注意力对齐、积分梯度归因、遮蔽因果定位),并在多种架构模型上系统评估。 Result: 引用感知SFT使编码器-解码器模型幻觉率降至0.0%;渐进训练避免灾难性遗忘并增强印地语能力;小模型经SFT后英语性能媲美大模型;GRPO对结构化引用任务仅带来边际增益。 Conclusion: 显式引用建模与渐进式多阶段训练是提升双语知识对话系统事实性、可解释性与泛化能力的关键路径,可解释性分析揭示了‘如何学习引用’而不仅是‘是否学会’。 Abstract: Knowledge-grounded dialogue systems aim to generate informative, contextually relevant responses by conditioning on external knowledge sources. However, most existing approaches focus exclusively on English, lack explicit citation mechanisms for verifying factual claims, and offer limited transparency into model decision-making. We present XKD-Dial, a progressive four-stage training pipeline for explainable, knowledge-grounded dialogue generation in a bilingual (English-Hindi) setting, comprising: (1) multilingual adaptation, (2) English dialogue SFT with citation grounding, (3) bilingual dialogue SFT, and (4) GRPO alignment with citation-aware rewards. We evaluate six models spanning encoder-decoder (250M-3B) and decoder-only (1B-7B) architectures at every pipeline stage. Our key contributions are: (i) three post-hoc explainability analyses - cross-attention alignment, Integrated Gradients attribution, and occlusion-based causal grounding - applied systematically across the training trajectory to reveal how citation behaviour is learned, not only whether it is learned; (ii) citation-grounded SFT reduces hallucination to 0.0% for encoder-decoder models from Stage 2 onward; (iii) the progressive pipeline prevents catastrophic forgetting while improving Hindi capabilities; (iv) smaller models match larger models on English after SFT; and (v) GRPO provides marginal improvement over well-designed SFT for structured citation tasks. We evaluate across six automatic metrics (BLEU, ROUGE, BERTScore, FactScore, Citation-F1, and hallucination rate).

[48] Entropy trajectory shape predicts LLM reasoning reliability: A diagnostic study of uncertainty dynamics in chain-of-thought

Xinghao Zhao

Main category: cs.CL

TL;DR: 本文提出熵轨迹单调性(entropy-trajectory monotonicity)作为链式推理中不确定性动态形状的预测指标,发现推理链每步答案分布熵严格递减(即单调)能显著预测最终正确性,且该‘形状’特征比熵总量变化等‘幅度’指标更具判别力;在GSM8K和Mistral-7B上验证了其高准确率、低成本和强泛化性。

Details Motivation: 链式推理(CoT)虽提升大模型准确性,但缺乏廉价可靠的失败检测机制;现有基于总不确定性(如总熵减)或单步置信度的方法效果有限,亟需挖掘不确定性演化过程中的结构化模式。 Method: 提出熵轨迹单调性:对推理每一步采样若干答案完成,计算对应答案分布的熵,若熵值在所有步骤中严格递减则判定为单调链;对比单调与非单调链的准确率,并分析熵变化量、违反次数、校准误差(ECE)及与自一致性等基线方法的成本效率。 Result: 在GSM8K上,单调链准确率达68.8%(vs. 非单调46.8%,+21.9pp,p=0.0005);熵总量变化与准确率无相关性(ρ=-0.06);0/1/2次单调性违反对应准确率68.8%/50.8%/28.6%;单调性在73.7%覆盖率下优于标量置信度基线,成本仅为其1/8;结果在Mistral-7B上复现(+34.7pp)。 Conclusion: 不确定性轨迹的结构性特征(如单调性)比聚合性指标(如总熵减、平均置信度)更能可靠预测链式推理成败;该发现揭示了‘形状优于幅度’的原则,为高效、低开销的推理可信度评估提供了新范式。 Abstract: Chain-of-thought (CoT) reasoning improves LLM accuracy, yet detecting failures cheaply remains elusive. We study whether the shape of uncertainty dynamics across reasoning steps--captured by sampling a few answer completions per step--predicts correctness. We introduce entropy-trajectory monotonicity: a chain is monotone if its per-step answer-distribution entropy decreases at every step. On GSM8K (n=300) with Qwen2.5-7B-Instruct, monotone chains achieve 68.8% accuracy vs. 46.8% for non-monotone chains (+21.9 pp; Fisher's p=0.0005; OR=2.50). Critically, total entropy reduction is not predictive ($ρ$=-0.06, p=0.31), revealing a shape-over-magnitude dissociation: whether entropy decreases at every step matters, not how much. Violation count 0/1/2 gives 68.8%/50.8%/28.6% accuracy. Token log-probability confidence worsens in calibration with step depth (ECE: 0.186->0.312), and monotonicity achieves +5.8 pp at 73.7% coverage, outperforming scalar baselines at approx 1,500 tokens/question--1/8 the cost of 40-chain self-consistency. Results replicate on Mistral-7B (n=300): monotone chains reach 72.3% vs. 37.6% (+34.7 pp; OR=4.33). Structural properties of uncertainty trajectories are thus more informative than aggregate measures.

[49] RADIUS: Ranking, Distribution, and Significance - A Comprehensive Alignment Suite for Survey Simulation

Weronika Łajewska,Paul Missault,George Davidson,Saab Mansour

Main category: cs.CL

TL;DR: 本文提出RADIUS评估套件,用于全面评估基于大语言模型的问卷模拟效果,涵盖排序对齐与分布对齐两个维度,并引入统计显著性检验。

Details Motivation: 现有问卷模拟评估指标零散、非标准化,且忽视关键的排序对齐维度,难以支撑决策类应用。 Method: 设计RADIUS两维对齐评估框架,包含排名对齐(Ranking alignment)和分布对齐(Distribution alignment),并为每项提供统计显著性检验。 Result: RADIUS揭示了现有指标的局限性,提升了问卷模拟评估的可比性与实用性,并开源实现以支持可复现研究。 Conclusion: RADIUS为调查模拟提供了更全面、严谨、可复现的评估标准,尤其适用于需保留人类偏好排序结构的决策场景。 Abstract: Simulation of surveys using LLMs is emerging as a powerful application for generating human-like responses at scale. Prior work evaluates survey simulation using metrics borrowed from other domains, which are often ad hoc, fragmented, and non-standardized, leading to results that are difficult to compare. Moreover, existing metrics focus mainly on accuracy or distributional measures, overlooking the critical dimension of ranking alignment. In practice, a simulation can achieve high accuracy while still failing to capture the option most preferred by humans - a distinction that is critical in decision-making applications. We introduce RADIUS, a comprehensive two-dimensional alignment suite for survey simulation that captures: 1) RAnking alignment and 2) DIstribUtion alignment, each complemented by statistical Significance testing. RADIUS highlights the limitations of existing metrics, enables more meaningful evaluation of survey simulation, and provides an open-source implementation for reproducible and comparable assessment.

[50] Hypothesis-Conditioned Query Rewriting for Decision-Useful Retrieval

Hangeol Chang,Changsun Lee,Seungjoon Rho,Junho Yeo,Jong Chul Ye

Main category: cs.CL

TL;DR: 本文提出了一种无需训练的预检索框架Hypothesis-Conditioned Query Rewriting (HCQR),通过基于假设重写检索查询,提升RAG在多选任务中的决策能力。

Details Motivation: 现有RAG方法依赖单一初始查询,偏向主题相关性而非决策相关证据,难以在多选项间有效区分,导致检索结果对最终答案选择帮助有限。 Method: HCQR首先从问题与候选答案中推导轻量级工作假设,再将其转化为三个目标明确的检索查询:(1)支持该假设;(2)区分该假设与其他竞争选项;(3)验证问题中的关键线索。 Result: 在MedQA和MMLU-Med上,HCQR分别比Simple RAG平均准确率提升5.9和3.6个百分点,且优于重排序/过滤等基线方法。 Conclusion: HCQR能有效将RAG从主题导向检索转向证据导向检索,显著提升多选推理任务中LLM结合外部知识做决策的能力。 Abstract: Retrieval-Augmented Generation (RAG) improves Large Language Models (LLMs) by grounding generation in external, non-parametric knowledge. However, when a task requires choosing among competing options, simply grounding generation in broadly relevant context is often insufficient to drive the final decision. Existing RAG methods typically rely on a single initial query, which often favors topical relevance over decision-relevant evidence, and therefore retrieves background information that can fail to discriminate among answer options. To address this issue, here we propose Hypothesis-Conditioned Query Rewriting (HCQR), a training-free pre-retrieval framework that reorients RAG from topic-oriented retrieval to evidence-oriented retrieval. HCQR first derives a lightweight working hypothesis from the input question and candidate options, and then rewrites retrieval into three targeted queries that seek evidence to: (1) support the hypothesis, (2) distinguish it from competing alternatives, and (3) verify salient clues in the question. This approach enables context retrieval that is more directly aligned with answer selection, allowing the generator to confirm or overturn the initial hypothesis based on the retrieved evidence. Experiments on MedQA and MMLU-Med show that HCQR consistently outperforms single-query RAG and re-rank/filter baselines, improving average accuracy over Simple RAG by 5.9 and 3.6 points, respectively. Code is available at https://anonymous.4open.science/r/HCQR-1C2E.

[51] What Really Controls Temporal Reasoning in Large Language Models: Tokenisation or Representation of Time?

Gagan Bhatia,Ahmad Muhammad Isa,Maxime Peyrard,Wei Zhao

Main category: cs.CL

TL;DR: 本文提出了MultiTempBench,一个涵盖多种语言和日历系统的多任务时间推理基准,并分析了大语言模型在不同资源条件下的时间推理表现,发现分词质量(尤其是时间符号的碎片化)是低资源语言中的关键瓶颈。

Details Motivation: 现有时间推理基准多限于英语和公历,缺乏对多语言、多日历系统下模型时间理解能力的全面评估;同时,低资源语言中时间表达的分词问题尚未被系统研究。 Method: 构建了包含15,000个样本的多语言时间推理基准MultiTempBench(覆盖5种语言、3种日历),设计多维度评估方法(如mDFR指标、几何探测分析、交叉混合效应回归),并对20个LLM进行系统评测。 Result: 发现时间符号分词碎片化(fragmentation)在低资源语言和稀有日历中严重损害年/月/日分离与推理准确率;高资源语言中时间线性(temporal linearity)更重要,而低资源语言中碎片化影响更主导。 Conclusion: 时间推理性能高度依赖于底层分词器对时间实体的建模能力,尤其在低资源场景下需针对性优化分词与日历表示,而非仅扩大模型规模或数据量。 Abstract: We present MultiTempBench, a multilingual temporal reasoning benchmark spanning three tasks, date arithmetic, time zone conversion, and temporal relation extraction across five languages (English, German, Chinese, Arabic, and Hausa) and multiple calendar conventions (Gregorian, Hijri, and Chinese Lunar). MultiTempBench contains $15,000$ examples built by translating $750$ curated English questions and expanding each into controlled date-format variants. We evaluate 20 LLMs and introduce the multilingual Date Fragmentation Ratio (mDFR), calibrated with human severity ratings, together with geometric-probing analyses of internal temporal representations. We find tokenisation quality of temporal artefacts is a resource-dependent bottleneck: in low-resource languages and rarer calendar formats, fragmentation disrupts Year/Month/Day separation and accuracy collapses, while high-resource settings are often robust to digit-level splitting. Beyond tokenisation, crossed mixed-effects regression shows that temporal linearity is the strongest predictor of temporal reasoning in high-resource languages, whereas fragmentation is the stronger predictor in low-resource languages. Code is available at: https://github.com/gagan3012/mtb

[52] MoRI: Learning Motivation-Grounded Reasoning for Scientific Ideation in Large Language Models

Chenyang Gu,Jiahao Cheng,Meicong Zhang,Pujun Zheng,Jinquan Zheng,Guoxiu He

Main category: cs.CL

TL;DR: 本文提出MoRI框架,通过动机驱动的推理提升科学创意生成的质量,显著优于现有大模型和代理方法。

Details Motivation: 现有基于大语言模型的代理方法在科学创意生成中未能充分建模科学推理过程,导致生成结果缺乏技术深度和科学依据。 Method: 提出MoRI框架,首先通过监督微调使基础大模型学会从给定语境生成研究动机,再通过复合强化学习奖励(熵感知信息增益和对比语义增益)训练模型,使其推理过程兼具技术复杂性和概念一致性。 Result: MoRI在新颖性、技术严谨性和可行性等多个维度上显著优于强商业大模型和复杂代理基线。 Conclusion: MoRI通过显式建模从研究动机到方法论的推理路径,有效提升了科学创意生成的深度与可信度,为AI驱动的科学研究提供了新范式。 Abstract: Scientific ideation aims to propose novel solutions within a given scientific context. Existing LLM-based agentic approaches emulate human research workflows, yet inadequately model scientific reasoning, resulting in surface-level conceptual recombinations that lack technical depth and scientific grounding. To address this issue, we propose \textbf{MoRI} (\textbf{Mo}tivation-grounded \textbf{R}easoning for Scientific \textbf{I}deation), a framework that enables LLMs to explicitly learn the reasoning process from research motivations to methodologies. The base LLM is initialized via supervised fine-tuning to generate a research motivation from a given context, and is subsequently trained under a composite reinforcement learning reward that approximates scientific rigor: (1) entropy-aware information gain encourages the model to uncover and elaborate high-complexity technical details grounded in ground-truth methodologies, and (2) contrastive semantic gain constrains the reasoning trajectory to maintain conceptually aligned with scientifically valid solutions. Empirical results show that MoRI significantly outperforms strong commercial LLMs and complex agentic baselines across multiple dimensions, including novelty, technical rigor, and feasibility. The code will be made available on \href{https://github.com/ECNU-Text-Computing/IdeaGeneration}{GitHub}.

[53] Parallelograms Strike Back: LLMs Generate Better Analogies than People

Qiawen Ella Liu,Raja Marjieh,Jian-Qiao Zhu,Adele E. Goldberg,Thomas L. Griffiths

Main category: cs.CL

TL;DR: 本文比较了人类与大语言模型(LLM)在四词类比任务(A:B::C:D)中的表现,发现LLM生成的类比更符合传统的‘平行四边形’几何模型,且质量更高;但这种优势主要源于LLM更稳定地满足关系保持约束,而非更强的局部相似性敏感度;人类表现差主要因大量低质量尾部响应,若仅比较最常见响应,则LLM优势消失。

Details Motivation: 探究‘平行四边形’几何模型在词类比中失效的原因:是模型本身不合理,还是人类不擅长生成满足该关系约束的类比? Method: 在Peterson等人(2020)的同一组类比问题上,对比人类与LLM(基于GloVe嵌入)的完成结果;通过人工评分、平行四边形对齐度(向量几何距离)、词频及局部相似性指标进行多维分析,并区分整体响应与模态响应。 Result: LLM类比被评定为质量更高、更符合平行四边形结构,且更少依赖高频易得词;但优势源于减少弱响应(长尾),而非绝对 superiority;当只比较模态响应时,LLM优势消失;平行四边形对齐度和低频词仍可预测LLM哪些响应优于人类。 Conclusion: 平行四边形模型并非对类比关系建模不当,而是人类常未能稳定产出满足该关系约束的类比;LLM则表现出更强的关系一致性,支持该几何模型的有效性。 Abstract: Four-term word analogies (A:B::C:D) are classically modeled geometrically as ''parallelograms,'' yet recent work suggests this model poorly captures how humans produce analogies, with simple local-similarity heuristics often providing a better account (Peterson et al., 2020). But does the parallelogram model fail because it is a bad model of analogical relations, or because people are not very good at generating relation-preserving analogies? We compared human and large language model (LLM) analogy completions on the same set of analogy problems from (Peterson et al., 2020). We find that LLM-generated analogies are reliably judged as better than human-generated ones, and are also more closely aligned with the parallelogram structure in a distributional embedding space (GloVe). Crucially, we show that the improvement over human analogies was driven by greater parallelogram alignment and reduced reliance on accessible words rather than enhanced sensitivity to local similarity. Moreover, the LLM advantage is driven not by uniformly superior responses by LLMs, but by humans producing a long tail of weak completions: when only modal (most frequent) responses by both systems are compared, the LLM advantage disappears. However, greater parallelogram alignment and lower word frequency continue to predict which LLM completions are rated higher than those of humans. Overall, these results suggest that the parallelogram model is not a poor account of word analogy. Rather, humans may often fail to produce completions that satisfy this relational constraint, whereas LLMs do so more consistently.

[54] A Dataset and Resources for Identifying Patient Health Literacy Information from Clinical Notes

Madeline Bittner,Dina Demner-Fushman,Yasmeen Shabazz,Davis Bartels,Dukyong Yoon,Brad Quitadamo,Rajiv Menghrajani,Leo Celi,Sarvesh Soni

Main category: cs.CL

TL;DR: 本文介绍了HEALIX,首个公开可用的基于真实临床记录的健康素养标注数据集,并利用其对四种开源大语言模型进行了零样本和少样本提示策略的基准测试。

Details Motivation: 当前健康素养筛查工具在可行性、题量、题型及覆盖维度上差异较大,难以在结构化电子健康记录中统一应用;而从非结构化临床笔记中自动识别健康素养虽具潜力,却受限于缺乏标注资源。 Method: 构建HEALIX数据集:通过社工笔记抽样、关键词过滤与大语言模型(LLM)驱动的主动学习相结合的方式,对589份涵盖9种类型的临床笔记进行低/正常/高三级健康素养标注;并采用零样本与少样本提示策略,在四个开源LLM上开展基准测试。 Result: 成功构建并发布了HEALIX数据集(含589条标注笔记,3类标签,9种笔记类型);实验证明零样本和少样本提示在不同LLM上均能实现初步健康素养识别,为后续研究提供基线与资源支持。 Conclusion: HEALIX填补了临床笔记健康素养标注数据的空白,验证了LLM在该任务上的可行性,推动了健康素养自动化评估向真实医疗场景落地。 Abstract: Health literacy is a critical determinant of patient outcomes, yet current screening tools are not always feasible and differ considerably in the number of items, question format, and dimensions of health literacy they capture, making documentation in structured electronic health records difficult to achieve. Automated detection from unstructured clinical notes offers a promising alternative, as these notes often contain richer, more contextual health literacy information, but progress has been limited by the lack of annotated resources. We introduce HEALIX, the first publicly available annotated health literacy dataset derived from real clinical notes, curated through a combination of social worker note sampling, keyword-based filtering, and LLM-based active learning. HEALIX contains 589 notes across 9 note types, annotated with three health literacy labels: low, normal, and high. To demonstrate its utility, we benchmarked zero-shot and few-shot prompting strategies across four open source large language models (LLMs).

[55] DaPT: A Dual-Path Framework for Multilingual Multi-hop Question Answering

Yilin Wang,Yuchun Fan,Jiaoyang Li,Ziming Zhu,Yongyu Mu,Qiaozhi He,Tong Xiao,Jingbo Zhu

Main category: cs.CL

TL;DR: 本文提出DaPT框架,解决多语言多跳问答(MM-hop QA)中RAG系统性能不平衡问题,通过构建多语言基准并采用并行子问题图生成与双语检索回答策略,显著提升准确率。

Details Motivation: 现有RAG系统在多语言多跳问答场景中缺乏评估基准,且过度依赖英文大模型的语义理解能力,导致多语言环境下效果下降。 Method: 构建五种语言的多语言多跳QA基准;提出DaPT框架:并行生成源语言查询及其英文翻译的子问题图,合并后采用双语检索与顺序解答策略。 Result: DaPT在MuSiQue等基准上显著优于基线方法,平均EM分数相对最强基线提升18.3%;缓解了多语言场景下的性能不平衡问题。 Conclusion: DaPT有效提升了RAG系统在多语言多跳问答任务中的准确性与简洁性,为跨语言RAG研究提供了新思路和实用基准。 Abstract: Retrieval-augmented generation (RAG) systems have made significant progress in solving complex multi-hop question answering (QA) tasks in the English scenario. However, RAG systems inevitably face the application scenario of retrieving across multilingual corpora and queries, leaving several open challenges. The first one involves the absence of benchmarks that assess RAG systems' capabilities under the multilingual multi-hop (MM-hop) QA setting. The second centers on the overreliance on LLMs' strong semantic understanding in English, which diminishes effectiveness in multilingual scenarios. To address these challenges, we first construct multilingual multi-hop QA benchmarks by translating English-only benchmarks into five languages, and then we propose DaPT, a novel multilingual RAG framework. DaPT generates sub-question graphs in parallel for both the source-language query and its English translation counterpart, then merges them before employing a bilingual retrieval-and-answer strategy to sequentially solve sub-questions. Our experimental results demonstrate that advanced RAG systems suffer from a significant performance imbalance in multilingual scenarios. Furthermore, our proposed method consistently yields more accurate and concise answers compared to the baselines, significantly enhancing RAG performance on this task. For instance, on the most challenging MuSiQue benchmark, DaPT achieves a relative improvement of 18.3\% in average EM score over the strongest baseline.

[56] UGID: Unified Graph Isomorphism for Debiasing Large Language Models

Zikang Ding,Junchi Yao,Junhao Li,Yi Zhang,Wenbo Jiang,Hongbo Liu,Lijie Hu

Main category: cs.CL

TL;DR: 本文提出UGID框架,通过将Transformer建模为计算图,在内部表征层面联合约束注意力路由与隐状态,实现大语言模型的去偏,同时保持模型能力。

Details Motivation: 大型语言模型存在显著社会偏见,输出层或数据优化方法无法彻底解决,且偏见已嵌入模型内部表征中。 Method: 提出统一图同构去偏框架(UGID),将Transformer建模为结构化计算图(注意力机制为边、隐状态为节点),在反事实输入下强制图结构不变性;联合约束敏感区域的注意力路由与隐表示,并引入对数空间敏感logits约束和选择性锚点目标以保持语义定义。 Result: 在多种大语言模型上实验表明,UGID能有效降低分布内与分布外场景下的偏见,显著减少内部结构差异,并维持模型安全性与实用性。 Conclusion: UGID是一种有效的内部表征级去偏框架,兼顾去偏效果与模型能力保留,为LLM公平性研究提供了新思路。 Abstract: Large language models (LLMs) exhibit pronounced social biases. Output-level or data-optimization--based debiasing methods cannot fully resolve these biases, and many prior works have shown that biases are embedded in internal representations. We propose \underline{U}nified \underline{G}raph \underline{I}somorphism for \underline{D}ebiasing large language models (\textit{\textbf{UGID}}), an internal-representation--level debiasing framework for large language models that models the Transformer as a structured computational graph, where attention mechanisms define the routing edges of the graph and hidden states define the graph nodes. Specifically, debiasing is formulated as enforcing invariance of the graph structure across counterfactual inputs, with differences allowed only on sensitive attributes. \textit{\textbf{UGID}} jointly constrains attention routing and hidden representations in bias-sensitive regions, effectively preventing bias migration across architectural components. To achieve effective behavioral alignment without degrading general capabilities, we introduce a log-space constraint on sensitive logits and a selective anchor-based objective to preserve definitional semantics. Extensive experiments on large language models demonstrate that \textit{\textbf{UGID}} effectively reduces bias under both in-distribution and out-of-distribution settings, significantly reduces internal structural discrepancies, and preserves model safety and utility.

[57] Optimal Splitting of Language Models from Mixtures to Specialized Domains

Skyler Seto,Pierre Ablin,Anastasiia Filippova,Jiayuan Ye,Louis Bethune,Angelos Katharopoulos,David Grangier

Main category: cs.CL

TL;DR: 本文提出了一种基于缩放定律的多模型预训练与专业化训练计算资源分配方法,能准确预测不同规模模型在不同token数量下的损失,并在常识知识和推理基准上提升性能。

Details Motivation: 现有语言模型训练采用两阶段范式(通用预训练+领域专业化),在多领域场景下需为每个领域单独训练模型(split model training),效率低且资源分配缺乏理论指导。 Method: 提出一种基于缩放定律的方法,独立预训练多个模型于通用语料库,并利用缩放律优化预训练与持续预训练(specialization)之间的计算资源分配;可预测模型大小N、预训练token数D及专业化token数D'下的损失,并外推至更大模型与更多token场景。 Result: 该方法在常识知识与推理基准(如CommonsenseQA、ARC等)上一致提升不同模型规模与计算预算下的性能。 Conclusion: 基于缩放定律的动态计算分配策略优于固定两阶段范式,为多领域语言模型高效训练提供了可预测、可扩展的新范式。 Abstract: Language models achieve impressive performance on a variety of knowledge, language, and reasoning tasks due to the scale and diversity of pretraining data available. The standard training recipe is a two-stage paradigm: pretraining first on the full corpus of data followed by specialization on a subset of high quality, specialized data from the full corpus. In the multi-domain setting, this involves continued pretraining of multiple models on each specialized domain, referred to as split model training. We propose a method for pretraining multiple models independently over a general pretraining corpus, and determining the optimal compute allocation between pretraining and continued pretraining using scaling laws. Our approach accurately predicts the loss of a model of size N with D pretraining and D' specialization tokens, and extrapolates to larger model sizes and number of tokens. Applied to language model training, our approach improves performance consistently across common sense knowledge and reasoning benchmarks across different model sizes and compute budgets.

[58] VEPO: Variable Entropy Policy Optimization for Low-Resource Language Foundation Models

Chonghan Liu,Yimin Du,Qi An,Xin He,Cunqi Zhai,Fei Tan,Weijia Lin,Xiaochun Gong,Yongchao Deng,Shousheng Jia,Xiangzheng Zhang

Main category: cs.CL

TL;DR: 本文提出了一种名为可变熵策略优化(VEPO)的新方法,利用带可验证奖励的强化学习,在训练中引入确定性结构约束,以提升低资源语言的翻译性能。

Details Motivation: 大型语言模型在低资源语言上表现不佳,主要由于子词切分效率低和训练数据分布不均衡。 Method: 提出Variable Entropy Policy Optimization(VEPO),结合带可验证奖励的强化学习、可变熵机制、熵调节优势估计与非对称裁剪,强制满足序列长度、格式一致性和语言合法性等结构约束。 Result: 在90个FLORES-200、COMET-22、chrF翻译方向上的实验表明,VEPO显著提升了分词效率与翻译质量,有效缩小了低资源语言的性能差距。 Conclusion: VEPO通过动态平衡字面保真度与语义自然性,并保障结构约束,为低资源语言NMT提供了鲁棒且高效的训练框架。 Abstract: Large language models frequently exhibit suboptimal performance on low resource languages, primarily due to inefficient subword segmentation and systemic training data imbalances. In this paper, we propose Variable Entropy Policy Optimization (VEPO), which leverages Reinforcement Learning with Verifiable Rewards to incorporate deterministic structural constraints into the policy alignment process. This framework ensures prescribed sequence length, robust format consistency, and rigorous linguistic well formedness, all enforced during training. Central to our approach is a variable entropy mechanism that enables the model to dynamically calibrate the equilibrium between literal fidelity and semantic naturalness by modulating the exploration exploitation manifold. By integrating entropy tempered advantage estimation with asymmetric clipping, VEPO sustains robust exploration while mitigating policy collapse. Empirical evaluations across 90 FLORES-200, COMET-22, chrF directions demonstrate that VEPO yields substantial improvements in both tokenization efficiency and translation quality, bridging the performance gap for underrepresented languages.

[59] Evaluating Counterfactual Strategic Reasoning in Large Language Models

Dimitrios Georgousis,Maria Lymperaiou,Angeliki Dimitriou,Giorgos Filandrianos,Giorgos Stamou

Main category: cs.CL

TL;DR: 本文评估了大语言模型(LLMs)在重复博弈场景中的策略表现,通过引入改变收益结构和动作标签的反事实变体(如囚徒困境和石头剪刀布),检验其是否具备真正的策略推理能力,还是仅依赖记忆模式;结果表明LLMs在激励敏感性、结构泛化和反事实环境下的战略推理方面存在明显局限。

Details Motivation: 探究LLMs在博弈论任务中展现的战略性能是源于真正的推理能力,还是仅仅依赖训练数据中记忆的模式。 Method: 在经典博弈(囚徒困境和石头剪刀布)基础上构建反事实变体(修改收益结构与动作标签),设计多维度评估框架,对比LLMs在默认与反事实设定下的表现。 Result: LLMs在反事实环境中表现出显著缺陷,包括对激励变化不敏感、难以进行结构泛化、缺乏稳健的战略推理能力。 Conclusion: 当前LLMs的战略行为更可能源于表面模式匹配而非深层博弈推理,揭示其在需抽象建模与反事实思考的任务中存在根本性局限。 Abstract: We evaluate Large Language Models (LLMs) in repeated game-theoretic settings to assess whether strategic performance reflects genuine reasoning or reliance on memorized patterns. We consider two canonical games, Prisoner's Dilemma (PD) and Rock-Paper-Scissors (RPS), upon which we introduce counterfactual variants that alter payoff structures and action labels, breaking familiar symmetries and dominance relations. Our multi-metric evaluation framework compares default and counterfactual instantiations, showcasing LLM limitations in incentive sensitivity, structural generalization and strategic reasoning within counterfactual environments.

[60] Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation

Zhuolin Yang,Zihan Liu,Yang Chen,Wenliang Dai,Boxin Wang,Sheng-Chieh Lin,Chankyu Lee,Yangyi Chen,Dongfu Jiang,Jiafan He,Renjie Pi,Grace Lam,Nayeon Lee,Alexander Bukharin,Mohammad Shoeybi,Bryan Catanzaro,Wei Ping

Main category: cs.CL

TL;DR: Nemotron-Cascade 2 是一个开源的30B MoE大模型,仅激活3B参数,却在数学、编程推理及智能体能力上达到前沿水平,甚至在IMO、IOI和ICPC等国际竞赛中获金牌级表现,以20倍更少参数媲美更大模型。

Details Motivation: 提升小规模MoE模型的推理与智能体能力,在参数受限下实现高智能密度,推动开源高效大模型发展。 Method: 在精挑细选数据集上进行监督微调(SFT)后,大幅扩展Cascade RL至多领域推理与智能体任务,并引入多领域在线策略蒸馏,利用各阶段最强教师模型进行知识迁移。 Result: 在IMO、IOI、ICPC等顶级竞赛中达到金牌级性能;数学与编程推理接近前沿开源大模型;30B总参数中仅激活3B,实现20倍参数效率提升。 Conclusion: Nemotron-Cascade 2证明了通过先进RL与蒸馏策略,紧凑MoE架构可支撑极强通用推理与智能体能力,为高效开源大模型树立新标杆。 Abstract: We introduce Nemotron-Cascade 2, an open 30B MoE model with 3B activated parameters that delivers best-in-class reasoning and strong agentic capabilities. Despite its compact size, its mathematical and coding reasoning performance approaches that of frontier open models. It is the second open-weight LLM, after DeepSeekV3.2-Speciale-671B-A37B, to achieve Gold Medal-level performance in the 2025 International Mathematical Olympiad (IMO), the International Olympiad in Informatics (IOI), and the ICPC World Finals, demonstrating remarkably high intelligence density with 20x fewer parameters. In contrast to Nemotron-Cascade 1, the key technical advancements are as follows. After SFT on a meticulously curated dataset, we substantially expand Cascade RL to cover a much broader spectrum of reasoning and agentic domains. Furthermore, we introduce multi-domain on-policy distillation from the strongest intermediate teacher models for each domain throughout the Cascade RL process, allowing us to efficiently recover benchmark regressions and sustain strong performance gains along the way. We release the collection of model checkpoint and training data.

[61] F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World

Ziyin Zhang,Zihan Liao,Hang Yu,Peng Di,Rui Wang

Main category: cs.CL

TL;DR: F2LLM-v2是一个涵盖80M到14B共8种规模的多语言嵌入模型系列,支持200多种语言(尤其关注中低资源语言),通过两阶段LLM训练、matryoshka学习、剪枝与知识蒸馏提升效率,在MTEB基准上表现领先,并全部开源。

Details Motivation: 解决现有LLM嵌入模型在效率和中低资源语言支持方面的不足,推动开放、高效、多语言嵌入模型的发展。 Method: 采用两阶段LLM嵌入训练流程,结合matryoshka学习、模型剪枝和知识蒸馏技术,并基于6000万高质量多语言公开数据进行训练。 Result: F2LLM-v2-14B在11个MTEB基准上排名第一;小尺寸模型在资源受限场景下也达到新SOTA。 Conclusion: F2LLM-v2在保持高性能的同时显著提升效率与多语言覆盖能力,是面向实际部署与研究的先进开源嵌入模型系列。 Abstract: We present F2LLM-v2, a new family of general-purpose, multilingual embedding models in 8 distinct sizes ranging from 80M to 14B. Trained on a newly curated composite of 60 million publicly available high-quality data samples, F2LLM-v2 supports more than 200 languages, with a particular emphasis on previously underserved mid- and low-resource languages. By integrating a two-stage LLM-based embedding training pipeline with matryoshka learning, model pruning, and knowledge distillation techniques, we present models that are far more efficient than previous LLM-based embedding models while retaining competitive performances. Extensive evaluations confirm that F2LLM-v2-14B ranks first on 11 MTEB benchmarks, while the smaller models in the family also set a new state of the art for resource-constrained applications. To facilitate open-source embedding model research, we release all models, data, code, and intermediate checkpoints.

cs.CV [Back]

[62] RARE disease detection from Capsule Endoscopic Videos based on Vision Transformers

X. Gao,C. Chien,G. Liu,A. Manullang

Main category: cs.CV

TL;DR: 本文针对胶囊内镜视频(CEV)的多标签分类任务,微调基于Transformer的深度学习模型(Google Vision Transformer),在17个解剖与病理标签上进行识别,但测试集上的mAP@0.5和mAP@0.95均极低(约0.02),表明性能较差。

Details Motivation: 解决胶囊内镜视频中多标签分类问题,支持胃肠道疾病自动诊断。 Method: 微调Google Vision Transformer(ViT),输入分辨率为224×224,batch size为16,用于17类解剖结构与病变的多标签分类。 Result: 在三段测试视频上,mAP@0.5为0.0205,mAP@0.95为0.0196,性能显著偏低。 Conclusion: 当前ViT微调方案在该CEV多标签任务上效果不佳,需进一步改进模型架构、数据预处理或标注策略。 Abstract: This work is corresponding to the Gastro Competition for multi-label classification from capsule endoscopic videos (CEV). Deep learning network based on Transformers are fined-tune for this task. The based online mode is Google Vision Transformer (ViT) batch16 with 224 x 224 resolutions. In total, 17 labels are classified, which are mouth, esophagus, stomach, small intestine, colon, z-line, pylorus, ileocecal valve, active bleeding, angiectasia, blood, erosion, erythema, hematin, lymphangioectasis, polyp, and ulcer. For test dataset of three videos, the overall mAP @0.5 is 0.0205 whereas the overall mAP @0.95 is 0.0196.

[63] S3T-Former: A Purely Spike-Driven State-Space Topology Transformer for Skeleton Action Recognition

Naichuan Zheng,Hailun Xia,Zepeng Sun,Weiyi Li,Yujia Wang

Main category: cs.CV

TL;DR: 本文提出S3T-Former,首个纯脉冲驱动的Transformer架构,用于能效型骨架动作识别;通过多流解剖脉冲嵌入(M-ASE)、侧向脉冲拓扑路由(LSTR)和脉冲状态空间(S3)引擎,在保持高稀疏性的同时解决短期遗忘问题,显著降低能耗并达到SOTA性能。

Details Motivation: 现有基于脉冲神经网络(SNN)的骨架动作识别方法因依赖稠密矩阵聚合、多模态融合或非稀疏频域变换而牺牲了SNN固有的稀疏性,且受神经元短期遗忘困扰,难以在资源受限边缘设备部署。 Method: 提出Spiking State-Space Topology Transformer(S3T-Former),包含:1)Multi-Stream Anatomical Spiking Embedding(M-ASE),作为广义运动学微分算子,将多模态骨架特征转为异构稀疏脉冲流;2)Lateral Spiking Topology Routing(LSTR),实现按需条件脉冲传播;3)Spiking State-Space(S3)Engine,建模长时序动态而不依赖非稀疏频谱变换。 Result: 在多个大规模数据集上实验表明,S3T-Former在保持高度竞争力精度的同时,理论能耗显著低于传统ANN,确立了能效型类脑动作识别的新SOTA。 Conclusion: S3T-Former首次实现了纯脉冲驱动、真正时空稀疏且具备长时记忆能力的Transformer架构,为边缘端低功耗骨架动作识别提供了新范式。 Abstract: Skeleton-based action recognition is crucial for multimedia applications but heavily relies on power-hungry Artificial Neural Networks (ANNs), limiting their deployment on resource-constrained edge devices. Spiking Neural Networks (SNNs) provide an energy-efficient alternative; however, existing spiking models for skeleton data often compromise the intrinsic sparsity of SNNs by resorting to dense matrix aggregations, heavy multimodal fusion modules, or non-sparse frequency domain transformations. Furthermore, they severely suffer from the short-term amnesia of spiking neurons. In this paper, we propose the Spiking State-Space Topology Transformer (S3T-Former), which, to the best of our knowledge, is the first purely spike-driven Transformer architecture specifically designed for energy-efficient skeleton action recognition. Rather than relying on heavy fusion overhead, we formulate a Multi-Stream Anatomical Spiking Embedding (M-ASE) that acts as a generalized kinematic differential operator, elegantly transforming multimodal skeleton features into heterogeneous, highly sparse event streams. To achieve true topological and temporal sparsity, we introduce Lateral Spiking Topology Routing (LSTR) for on-demand conditional spike propagation, and a Spiking State-Space (S3) Engine to systematically capture long-range temporal dynamics without non-sparse spectral workarounds. Extensive experiments on multiple large-scale datasets demonstrate that S3T-Former achieves highly competitive accuracy while theoretically reducing energy consumption compared to classic ANNs, establishing a new state-of-the-art for energy-efficient neuromorphic action recognition.

[64] DarkDriving: A Real-World Day and Night Aligned Dataset for Autonomous Driving in the Dark Environment

Wuqi Wang,Haochen Yang,Baolu Li,Jiaqi Sun,Xiangmo Zhao,Zhigang Xu,Qing Guo,Haigen Min,Tianyun Zhang,Hongkai Yu

Main category: cs.CV

TL;DR: 本文提出了首个真实世界昼夜对齐的自动驾驶低光增强基准数据集DarkDriving,通过轨迹跟踪姿态匹配方法在大型封闭测试场采集了9538对精确对齐的昼夜图像,并标注了2D目标框,支持低光增强及2D/3D检测等任务。

Details Motivation: 现有低光增强数据集多为小范围曝光调节或静态场景,而真实驾驶场景中难以获取精确对齐的昼夜图像对,严重限制了该方向研究。 Method: 提出基于轨迹跟踪的姿态匹配(TTPM)方法,在69英亩封闭测试场自动采集并精确对齐昼夜图像;人工标注2D边界框;定义四个相关感知任务。 Result: 构建了包含9538对高精度对齐(误差仅数厘米)昼夜图像的DarkDriving数据集,并验证其在低光增强与检测任务中的有效性及跨数据集(如nuScenes)泛化能力。 Conclusion: DarkDriving是首个真实动态驾驶场景下昼夜对齐的低光增强基准,为自动驾驶夜间感知提供了全面、可靠的评估平台,并具备良好泛化性。 Abstract: The low-light conditions are challenging to the vision-centric perception systems for autonomous driving in the dark environment. In this paper, we propose a new benchmark dataset (named DarkDriving) to investigate the low-light enhancement for autonomous driving. The existing real-world low-light enhancement benchmark datasets can be collected by controlling various exposures only in small-ranges and static scenes. The dark images of the current nighttime driving datasets do not have the precisely aligned daytime counterparts. The extreme difficulty to collect a real-world day and night aligned dataset in the dynamic driving scenes significantly limited the research in this area. With a proposed automatic day-night Trajectory Tracking based Pose Matching (TTPM) method in a large real-world closed driving test field (area: 69 acres), we collected the first real-world day and night aligned dataset for autonomous driving in the dark environment. The DarkDriving dataset has 9,538 day and night image pairs precisely aligned in location and spatial contents, whose alignment error is in just several centimeters. For each pair, we also manually label the object 2D bounding boxes. DarkDriving introduces four perception related tasks, including low-light enhancement, generalized low-light enhancement, and low-light enhancement for 2D detection and 3D detection of autonomous driving in the dark environment. The experimental results show that our DarkDriving dataset provides a comprehensive benchmark for evaluating low-light enhancement for autonomous driving and it can also be generalized to enhance dark images and promote detection in some other low-light driving environment, such as nuScenes.

[65] SSP-SAM: SAM with Semantic-Spatial Prompt for Referring Expression Segmentation

Wei Tang,Xuejing Liu,Yanpeng Sun,Zechao Li

Main category: cs.CV

TL;DR: 本文提出SSP-SAM框架,通过引入语义-空间提示(SSP)编码器,增强SAM对自然语言的理解能力,从而在指代表达分割(RES)及广义RES(GRES)任务中实现高精度、鲁棒的文本引导图像分割。

Details Motivation: SAM虽擅长通用图像分割,但缺乏自然语言理解能力,难以直接用于指代表达分割(RES);现有方法难以兼顾RES与更灵活的广义RES(GRES)设置。 Method: 提出SSP-SAM框架,设计融合视觉与语言注意力适配器的语义-空间提示(SSP)编码器,以联合建模图像与文本特征,生成高质量提示引导SAM输出精确掩码。 Result: 在主流RES和GRES基准上显著优于现有方法;在严格阈值(如Pr@0.9)下保持高精度;在PhraseCut开放词汇场景中表现更优。 Conclusion: SSP-SAM有效桥接了SAM的强分割能力与语言理解需求,无需额外修改即可支持GRES,为文本驱动分割提供了通用、高效的新范式。 Abstract: The Segment Anything Model (SAM) excels at general image segmentation but has limited ability to understand natural language, which restricts its direct application in Referring Expression Segmentation (RES). Toward this end, we propose SSP-SAM, a framework that fully utilizes SAM's segmentation capabilities by integrating a Semantic-Spatial Prompt (SSP) encoder. Specifically, we incorporate both visual and linguistic attention adapters into the SSP encoder, which highlight salient objects within the visual features and discriminative phrases within the linguistic features. This design enhances the referent representation for the prompt generator, resulting in high-quality SSPs that enable SAM to generate precise masks guided by language. Although not specifically designed for Generalized RES (GRES), where the referent may correspond to zero, one, or multiple objects, SSP-SAM naturally supports this more flexible setting without additional modifications. Extensive experiments on widely used RES and GRES benchmarks confirm the superiority of our method. Notably, our approach generates segmentation masks of high quality, achieving strong precision even at strict thresholds such as Pr@0.9. Further evaluation on the PhraseCut dataset demonstrates improved performance in open-vocabulary scenarios compared to existing state-of-the-art RES methods. The code and checkpoints are available at: https://github.com/WayneTomas/SSP-SAM.

[66] CytoSyn: a Foundation Diffusion Model for Histopathology -- Tech Report

Thomas Duboudin,Xavier Fontaine,Etienne Andrier,Lionel Guillou,Alexandre Filiot,Thalyssa Baiocco-Rodrigues,Antoine Olivier,Alberto Romagnoni,John Klein,Jean-Baptiste Schiratti

Main category: cs.CV

TL;DR: 本文提出了CytoSyn,一种用于组织病理学的生成式基础潜变量扩散模型,可生成高度逼真且多样的H&E染色图像;通过方法改进、数据扩展与采样优化得到升级版CytoSyn-v2,并在多个方面超越了PixCell,同时强调预处理(如JPEG压缩)对扩散模型性能的显著影响。

Details Motivation: 现有计算病理学中自监督特征提取器较多,但专用于组织病理学的生成式基础模型稀缺,难以支持虚拟染色等任务。 Method: 提出基于潜变量的扩散模型CytoSyn及改进版CytoSyn-v2,探索了方法优化、训练集扩展、采样策略与切片级过拟合问题,并与PixCell进行深入对比,特别分析了JPEG压缩等预处理对模型和评估指标的影响。 Result: CytoSyn-v2在大规模TCGA数据(超10,000张全切片图像、32种癌症)上训练,不仅在肿瘤图像生成上达到SOTA,还能泛化至炎症性肠病图像;模型权重、数据集与合成样本已开源。 Conclusion: CytoSyn系列模型填补了组织病理学生成式基础模型的空白,验证了其高质量生成与跨疾病泛化能力,并揭示了预处理细节对评估结果的关键影响,为后续研究提供了重要基线与资源。 Abstract: Computational pathology has made significant progress in recent years, fueling advances in both fundamental disease understanding and clinically ready tools. This evolution is driven by the availability of large amounts of digitized slides and specialized deep learning methods and models. Multiple self-supervised foundation feature extractors have been developed, enabling downstream predictive applications from cell segmentation to tumor sub-typing and survival analysis. In contrast, generative foundation models designed specifically for histopathology remain scarce. Such models could address tasks that are beyond the capabilities of feature extractors, such as virtual staining. In this paper, we introduce CytoSyn, a state-of-the-art foundation latent diffusion model that enables the guided generation of highly realistic and diverse histopathology H&E-stained images, as shown in an extensive benchmark. We explored methodological improvements, training set scaling, sampling strategies and slide-level overfitting, culminating in the improved CytoSyn-v2, and compared our work to PixCell, a state-of-the-art model, in an in-depth manner. This comparison highlighted the strong sensitivity of both diffusion models and performance metrics to preprocessing-specific details such as JPEG compression. Our model has been trained on a dataset obtained from more than 10,000 TCGA diagnostic whole-slide images of 32 different cancer types. Despite being trained only on oncology slides, it maintains state-of-the-art performance generating inflammatory bowel disease images. To support the research community, we publicly release CytoSyn's weights, its training and validation datasets, and a sample of synthetic images in this repository: https://huggingface.co/Owkin-Bioptimus/CytoSyn.

[67] Action Draft and Verify: A Self-Verifying Framework for Vision-Language-Action Model

Chen Zhao,Zhuoran Wang,Haoyang Li,Shifeng Bao,Guanlin Li,Youhe Feng,Yang Li,Jie Tang,Jing Zhang

Main category: cs.CV

TL;DR: 本文提出Action-Draft-and-Verify(ADV)方法,结合扩散模型的动作生成能力与VLM的单次前向重排序能力,在仿真和真实世界任务中显著提升成功率。

Details Motivation: 扩散动作专家虽高效精准,但自回归范式在分布外环境中具备更强鲁棒性与泛化性;需融合二者优势。 Method: ADV先由扩散动作专家生成多个候选动作片段,再由视觉语言模型(VLM)以类似困惑度的指标进行单次前向打分并选择最优动作。 Result: 在仿真环境中成功率提升+4.3点,在真实世界中提升+19.7点,仅引入单次VLM重排序开销。 Conclusion: ADV有效融合扩散与自回归范式优势,在保持效率的同时显著提升VLA模型在实际部署中的性能与鲁棒性。 Abstract: Vision-Language-Action (VLA) models have recently demonstrated strong performance across embodied tasks. Modern VLAs commonly employ diffusion action experts to efficiently generate high-precision continuous action chunks, while auto-regressive generation can be slower and less accurate at low-level control. Yet auto-regressive paradigms still provide complementary priors that can improve robustness and generalization in out-of-distribution environments. To leverage both paradigms, we propose Action-Draft-and-Verify (ADV): diffusion action expert drafts multiple candidate action chunks, and the VLM selects one by scoring all candidates in a single forward pass with a perplexity-style metric. Under matched backbones, training data, and action-chunk length, ADV improves success rate by +4.3 points in simulation and +19.7 points in real-world over diffusion-based baseline, with a single-pass VLM reranking overhead.

[68] One-to-More: High-Fidelity Training-Free Anomaly Generation with Attention Control

Haoxiang Rao,Zhao Wang,Chenyang Si,Yan Lyu,Yuanyi Duan,Fang Zhao,Caifeng Shan

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的少样本工业异常生成方法O2MAG,利用单张异常图像的自注意力机制合成更逼真的异常样本,结合异常掩码、异常引导优化与双注意力增强策略,显著提升下游异常检测性能。

Details Motivation: 工业异常检测中异常样本稀缺,现有少样本异常合成方法训练耗时且难以真实还原异常分布,限制了检测模型性能。 Method: 提出O2MAG:基于单张参考异常图像,通过自注意力嫁接调控三路扩散过程;引入异常掩码缓解前景-背景混淆;采用异常引导优化对齐文本提示与真实异常语义;使用双注意力增强强化掩码区域的自注意与交叉注意。 Result: 在多个下游异常检测任务上显著优于现有最先进方法。 Conclusion: O2MAG是一种高效、无需训练的少样本异常生成框架,能生成更真实、文本一致的异常图像,有效提升工业异常检测性能。 Abstract: Industrial anomaly detection (AD) is characterized by an abundance of normal images but a scarcity of anomalous ones. Although numerous few-shot anomaly synthesis methods have been proposed to augment anomalous data for downstream AD tasks, most existing approaches require time-consuming training and struggle to learn distributions that are faithful to real anomalies, thereby restricting the efficacy of AD models trained on such data. To address these limitations, we propose a training-free few-shot anomaly generation method, namely O2MAG, which leverages the self-attention in One reference anomalous image to synthesize More realistic anomalies, supporting effective downstream anomaly detection. Specifically, O2MAG manipulates three parallel diffusion processes via self-attention grafting and incorporates the anomaly mask to mitigate foreground-background query confusion, synthesizing text-guided anomalies that closely adhere to real anomalous distributions. To bridge the semantic gap between the encoded anomaly text prompts and the true anomaly semantics, Anomaly-Guided Optimization is further introduced to align the synthesis process with the target anomalous distribution, steering the generation toward realistic and text-consistent anomalies. Moreover, to mitigate faint anomaly synthesis inside anomaly masks, Dual-Attention Enhancement is adopted during generation to reinforce both self- and cross-attention on masked regions. Extensive experiments validate the effectiveness of O2MAG, demonstrating its superior performance over prior state-of-the-art methods on downstream AD tasks.

[69] Q-Drift: Quantization-Aware Drift Correction for Diffusion Model Sampling

Sooyoung Ryu,Mathieu Salzmann,Saqib Javed

Main category: cs.CV

TL;DR: 本文提出Q-Drift方法,在后训练量化(PTQ)下通过采样器端校正来缓解扩散模型中量化噪声累积问题,提升生成质量。

Details Motivation: 后训练量化(PTQ)虽实用,但量化噪声在去噪轨迹中累积会显著降低生成质量。 Method: Q-Drift将量化误差建模为每步隐式随机扰动,推导出保持边缘分布的漂移校正;仅需5次全精度/量化配对校准即可估计时序方差统计量,并兼容主流采样器、模型与PTQ方法。 Result: 在6个文本到图像模型、3种采样器和2种PTQ方法上验证,Q-Drift在多数设置下显著改善FID(最高降低4.59),同时保持CLIP分数。 Conclusion: Q-Drift是一种轻量、通用且即插即用的采样器端校正方案,有效缓解PTQ下扩散模型的性能退化问题。 Abstract: Post-training quantization (PTQ) is a practical path to deploy large diffusion models, but quantization noise can accumulate over the denoising trajectory and degrade generation quality. We propose Q-Drift, a principled sampler-side correction that treats quantization error as an implicit stochastic perturbation on each denoising step and derives a marginal-distribution-preserving drift adjustment. Q-Drift estimates a timestep-wise variance statistic from calibration, in practice requiring as few as 5 paired full-precision/quantized calibration runs. The resulting sampler correction is plug-and-play with common samplers, diffusion models, and PTQ methods, while incurring negligible overhead at inference. Across six diverse text-to-image models (spanning DiT and U-Net), three samplers (Euler, flow-matching, DPM-Solver++), and two PTQ methods (SVDQuant, MixDQ), Q-Drift improves FID over the corresponding quantized baseline in most settings, with up to 4.59 FID reduction on PixArt-Sigma (SVDQuant W3A4), while preserving CLIP scores.

[70] Training-Only Heterogeneous Image-Patch-Text Graph Supervision for Advancing Few-Shot Learning Adapters

Mohammed Rahman Sherif Khan Mohammad,Ardhendu Behera,Sandip Pradhan,Swagat Kumar,Amr Ahmed

Main category: cs.CV

TL;DR: 本文提出了一种仅在训练时使用的异构图教师框架(TOGA),通过构建多模态图并利用图变换器进行跨模态推理,将细粒度关系知识蒸馏到Tip-Adapter的缓存中,从而提升少样本性能,且不增加推理开销。

Details Motivation: 现有基于Adapter的CLIP少样本方法(如Tip-Adapter)仅使用全局单模态特征,忽略了图像块间细粒度关系及其与文本类别的结构对齐。 Method: 构建一个仅在训练时使用的高容量异构图教师(Heterogeneous Graph Teacher),包含:(i) 多尺度视觉块与文本提示的统一图建模;(ii) 基于模态感知图变换器(MGT)的深度跨模态推理;(iii) 判别性节点过滤提取高质量类别特征;并通过缓存感知双目标策略将关系知识蒸馏至Tip-Adapter的key-value缓存中。 Result: 在标准1–16-shot基准上持续达到新SOTA;消融实验证实图监督、文本引导推理和节点过滤是关键组件。 Conclusion: 无需修改轻量级Adapter或增加推理成本,仅靠训练时引入异构图教师即可显著提升少样本泛化能力,验证了结构化跨模态关系建模的有效性。 Abstract: Recent adapter-based CLIP tuning (e.g., Tip-Adapter) is a strong few-shot learner, achieving efficiency by caching support features for fast prototype matching. However, these methods rely on global uni-modal feature vectors, overlooking fine-grained patch relations and their structural alignment with class text. To bridge this gap without incurring inference costs, we introduce a novel asymmetric training-only framework. Instead of altering the lightweight adapter, we construct a high-capacity auxiliary Heterogeneous Graph Teacher that operates solely during training. This teacher (i) integrates multi-scale visual patches and text prompts into a unified graph, (ii) performs deep cross-modal reasoning via a Modality-aware Graph Transformer (MGT), and (iii) applies discriminative node filtering to extract high-fidelity class features. Crucially, we employ a cache-aware dual-objective strategy to supervise this relational knowledge directly into the Tip-Adapter's key-value cache, effectively upgrading the prototypes while the graph teacher is discarded at test time. Thus, inference remains identical to Tip-Adapter with zero extra latency or memory. Across standard 1-16-shot benchmarks, our method consistently establishes a new state-of-the-art. Ablations confirm that the auxiliary graph supervision, text-guided reasoning, and node filtering are the essential ingredients for robust few-shot adaptation. Code is available at https://github.com/MR-Sherif/TOGA.git.

[71] From Concepts to Judgments: Interpretable Image Aesthetic Assessment

Xiao-Chang Liu,Johan Wagemans

Main category: cs.CV

TL;DR: 本文提出了一种基于人类可理解美学概念的可解释图像美学评估(IAA)框架,通过学习高阶美学概念子空间并引入残差预测器,在保持高性能的同时提供透明、可解释的美学判断。

Details Motivation: 现有IAA模型预测性能强但缺乏可解释性,而用户不仅需要评分,更需理解图像为何美观或不美观;人类评估美学时依赖高层语义线索,因此需要构建基于人类可理解概念的可解释框架。 Method: 提出一种基于人类可理解美学概念的可解释IAA框架:1)以可访问方式学习高层美学概念,构建概念子空间作为可解释模型基础;2)引入简单有效的残差预测器,捕捉超出显式概念的细微美学影响。 Result: 在摄影与艺术数据集上的实验表明,该方法在预测性能上具有竞争力,同时能提供透明、人类可理解的美学判断。 Conclusion: 基于高层美学概念子空间与残差建模的框架,能在保持高预测精度的同时显著提升IAA模型的可解释性,满足用户对‘为什么’的需求。 Abstract: Image aesthetic assessment (IAA) aims to predict the aesthetic quality of images as perceived by humans. While recent IAA models achieve strong predictive performance, they offer little insight into the factors driving their predictions. Yet for users, understanding why an image is considered pleasing or not is as valuable as the score itself, motivating growing interest in interpretability within IAA. When humans evaluate aesthetics, they naturally rely on high-level cues to justify their judgments. Motivated by this observation, we propose an interpretable IAA framework grounded in human-understandable aesthetic concepts. We learn these concepts in an accessible manner, constructing a subspace that forms the foundation of an inherently interpretable model. To capture nuanced influences on aesthetic perception beyond explicit concepts, we introduce a simple yet effective residual predictor. Experiments on photographic and artistic datasets demonstrate that our method achieves competitive predictive performance while offering transparent, human-understandable aesthetic judgments.

[72] Insight-V++: Towards Advanced Long-Chain Visual Reasoning with Multimodal Large Language Models

Yuhao Dong,Zuyan Liu,Shulin Tian,Yongming Rao,Ziwei Liu

Main category: cs.CV

TL;DR: 本文提出了一种名为Insight-V++的多智能体视觉推理框架,通过自动生成高质量长链多模态推理数据、双智能体协同(推理+摘要)、以及新提出的ST-GRPO和J-GRPO算法,显著提升了多模态大模型在图像与视频复杂推理任务上的性能。

Details Motivation: 现有MLLMs缺乏高质量长链多模态推理数据和适配的训练范式,导致其在复杂视觉推理(尤其是视频时空推理)上能力受限。 Method: 构建统一多智能体视觉推理框架Insight-V++:1)多粒度自动数据生成 pipeline;2)双智能体架构(推理agent + 摘要agent);3)改进的在线强化学习算法ST-GRPO和J-GRPO;4)基于摘要agent反馈的迭代自优化训练循环。 Result: 在LLaVA-NeXT和Qwen2.5-VL等基座模型上,Insight-V++在多个图像与视频复杂推理基准上取得显著提升,同时保持传统感知任务性能。 Conclusion: Insight-V++验证了多智能体协同、高质量合成数据与在线强化学习相结合,是推动MLLMs深度视觉推理能力发展的有效范式。 Abstract: Large Language Models (LLMs) have achieved remarkable reliability and advanced capabilities through extended test-time reasoning. However, extending these capabilities to Multi-modal Large Language Models (MLLMs) remains a significant challenge due to a critical scarcity of high-quality, long-chain reasoning data and optimized training pipelines. To bridge this gap, we present a unified multi-agent visual reasoning framework that systematically evolves from our foundational image-centric model, Insight-V, into a generalized spatial-temporal architecture, Insight-V++. We first propose a scalable data generation pipeline equipped with multi-granularity assessment that autonomously synthesizes structured, complex reasoning trajectories across image and video domains without human intervention. Recognizing that directly supervising MLLMs with such intricate data yields sub-optimal results, we design a dual-agent architecture comprising a reasoning agent to execute extensive analytical chains, and a summary agent to critically evaluate and distill final outcomes. While our initial framework utilized Direct Preference Optimization (DPO), its off-policy nature fundamentally constrained reinforcement learning potential. To overcome these limitations, particularly for long-horizon video understanding, Insight-V++ introduces two novel algorithms, ST-GRPO and J-GRPO, which enhance spatial-temporal reasoning and improve evaluative robustness. Crucially, by leveraging reliable feedback from the summary agent, we guide an iterative reasoning path generation process, retraining the entire multi-agent system in a continuous, self-improving loop. Extensive experiments on base models like LLaVA-NeXT and Qwen2.5-VL demonstrate significant performance gains across challenging image and video reasoning benchmarks while preserving strong capabilities on traditional perception-focused tasks.

[73] VLM-AutoDrive: Post-Training Vision-Language Models for Safety-Critical Autonomous Driving Events

Mohammad Qazim Bhat,Yufan Huang,Niket Agarwal,Hao Wang,Michael Woods,John Kenyon,Tsung-Yi Lin,Xiaodong Yang,Ming-Yu Liu,Kevin Xie

Main category: cs.CV

TL;DR: 本文提出VLM-AutoDrive框架,通过多源监督(元数据字幕、LLM生成描述、VQA对、CoT推理)对预训练视觉语言模型进行后训练,显著提升其在行车记录仪视频中碰撞与近碰撞事件检测的性能与可解释性。

Details Motivation: 现有通用多模态大模型在驾驶场景下因领域和时序错位,在稀疏、短暂的安全关键事件(如碰撞)检测中表现差。 Method: 提出模块化后训练框架VLM-AutoDrive,融合元数据衍生字幕、LLM生成描述、VQA样本及链式推理监督,实现领域对齐与可解释学习。 Result: 在Nexar真实行车记录仪数据上,将Cosmos-Reason1 7B模型的碰撞F1从0.00提升至0.69,整体准确率从35.35%升至77.27%。 Conclusion: VLM-AutoDrive为通用VLM适配安全关键、时序定位感知任务提供了可扩展、可解释的解决方案,弥合了感知、因果与决策推理之间的鸿沟。 Abstract: The rapid growth of ego-centric dashcam footage presents a major challenge for detecting safety-critical events such as collisions and near-collisions, scenarios that are brief, rare, and difficult for generic vision models to capture. While multimodal large language models (MLLMs) demonstrate strong general reasoning ability, they underperform in driving contexts due to domain and temporal misalignment. We introduce VLM-AutoDrive, a modular post-training framework for adapting pretrained Vision-Language Models (VLMs) to high-fidelity anomaly detection. The framework integrates metadata-derived captions, LLM-generated descriptions, visual question answering (VQA) pairs, and chain-of-thought (CoT) reasoning supervision to enable domain-aligned and interpretable learning. Off-the-shelf VLMs such as NVIDIA's Cosmos-Reason1 7B (CR1) exhibit near-zero Collision recall in zero-shot settings; fine-tuning with VLM-AutoDrive improves Collision F1 from 0.00 to 0.69 and overall accuracy from 35.35% to 77.27%. VLM-AutoDrive offers a scalable recipe for adapting general-purpose VLMs to safety-critical, temporally localized perception tasks. Evaluated on real-world Nexar dashcam videos, it achieves substantial gains in Collision and Near-Collision detection while producing interpretable reasoning traces, bridging the gap between perception, causality, and decision reasoning in autonomous driving.

[74] MicroVision: An Open Dataset and Benchmark Models for Detecting Vulnerable Road Users and Micromobility Vehicles

Alexander Rasch,Rahul Rajendra Pai

Main category: cs.CV

TL;DR: 本文介绍了MicroVision数据集,一个面向脆弱道路使用者(VRU)视角的开源图像数据集,专门用于检测行人、骑行者、电动滑板车骑手及静止的微型交通工具(如自行车、电动滑板车),以弥补现有数据集在VRU和微型交通工具类别划分与视角覆盖上的不足。

Details Motivation: 现有公开图像数据集缺乏对脆弱道路使用者(VRUs)和微型交通工具(MMVs)的细粒度分类与多样化视角(尤其是VRU视角)覆盖,限制了交通安全部署中目标检测模型的性能。 Method: 构建了一个在瑞典哥德堡采集的、包含8000多张高清匿名图像的MicroVision数据集,涵盖全年多场景,标注超3万个VRU和MMV实例;并基于先进架构训练了基准目标检测模型。 Result: 所提基准模型在未见测试集上达到最高0.723的平均精度(mAP);数据集与模型权重已公开发布。 Conclusion: MicroVision数据集有效填补了VRU与MMV细粒度检测的数据空白,可支撑更精准的交通安全管理与微出行监测系统。 Abstract: Micromobility is a growing mode of transportation, raising new challenges for traffic safety and planning due to increased interactions in areas where vulnerable road users (VRUs) share the infrastructure with micromobility, including parked micromobility vehicles (MMVs). Approaches to support traffic safety and planning increasingly rely on detecting road users in images -- a computer-vision task relying heavily on the quality of the images to train on. However, existing open image datasets for training such models lack focus and diversity in VRUs and MMVs, for instance, by categorizing both pedestrians and MMV riders as "person", or by not including new MMVs like e-scooters. Furthermore, datasets are often captured from a car perspective and lack data from areas where only VRUs travel (sidewalks, cycle paths). To help close this gap, we introduce the MicroVision dataset: an open image dataset and annotations for training and evaluating models for detecting the most common VRUs (pedestrians, cyclists, e-scooterists) and stationary MMVs (bicycles, e-scooters), from a VRU perspective. The dataset, recorded in Gothenburg (Sweden), consists of more than 8,000 anonymized, full-HD images with more than 30,000 carefully annotated VRUs and MMVs, captured over an entire year and part of almost 2,000 unique interaction scenes. Along with the dataset, we provide first benchmark object-detection models based on state-of-the-art architectures, which achieved a mean average precision of up to 0.723 on an unseen test set. The dataset and model can support traffic safety to distinguish between different VRUs and MMVs, or help monitoring systems identify the use of micromobility. The dataset and model weights can be accessed at https://doi.org/10.71870/eepz-jd52.

[75] Semantic Segmentation and Depth Estimation for Real-Time Lunar Surface Mapping Using 3D Gaussian Splatting

Guillem Casadesus Vila,Adam Dai,Grace Gao

Main category: cs.CV

TL;DR: 本文提出了一种基于3D高斯泼溅(3DGS)的实时月面稠密建图框架,融合门控循环单元立体深度估计与CNN语义分割模型,在LuPNT仿真数据上验证其在无LiDAR条件下实现约3 cm几何高度精度的120米路径重建,并支持新视角合成与未来SLAM系统集成。

Details Motivation: 月面导航与建图面临纹理匮乏、光照高对比、算力受限等挑战,亟需鲁棒、高效、轻量的感知与建图方法。 Method: 构建实时映射框架:1)在LuPNT仿真器生成的合成数据集上对多个稠密感知模型进行基准测试;2)选用基于门控循环单元(GRU)的立体深度估计模型(兼顾速度与精度)和CNN语义分割模型;3)利用真值位姿解耦局部场景理解与全局状态估计;4)将稠密深度与语义信息融合至3D高斯泼溅(3DGS)地图表示中。 Result: 在120米月面路径上实现约3 cm的几何高度重建精度,显著优于无LiDAR的传统点云基线;所生成的3DGS地图支持高质量新视角合成,并为全SLAM系统提供联合地图与位姿优化基础。 Conclusion: 将语义分割与稠密深度估计结合学习型地图表征(如3DGS),是构建高精度、大尺度月面地图以支撑未来探测任务的有效范式。 Abstract: Navigation and mapping on the lunar surface require robust perception under challenging conditions, including poorly textured environments, high-contrast lighting, and limited computational resources. This paper presents a real-time mapping framework that integrates dense perception models with a 3D Gaussian Splatting (3DGS) representation. We first benchmark several models on synthetic datasets generated with the LuPNT simulator, selecting a stereo dense depth estimation model based on Gated Recurrent Units for its balance of speed and accuracy in depth estimation, and a convolutional neural network for its superior performance in detecting semantic segments. Using ground truth poses to decouple the local scene understanding from the global state estimation, our pipeline reconstructs a 120-meter traverse with a geometric height accuracy of approximately 3 cm, outperforming a traditional point cloud baseline without LiDAR. The resulting 3DGS map enables novel view synthesis and serves as a foundation for a full SLAM system, where its capacity for joint map and pose optimization would offer significant advantages. Our results demonstrate that combining semantic segmentation and dense depth estimation with learned map representations is an effective approach for creating detailed, large-scale maps to support future lunar surface missions.

[76] LRConv-NeRV: Low Rank Convolution for Efficient Neural Video Compression

Tamer Shanableh

Main category: cs.CV

TL;DR: 本文提出LRConv-NeRV,通过在NeRV解码器中用结构化低秩可分离卷积替代部分3×3密集卷积层,实现计算与存储效率显著提升,同时保持重建质量与时间一致性。

Details Motivation: NeRV的卷积解码器计算开销大、内存占用高,难以部署于资源受限环境。 Method: 在NeRV解码器中,从后往前逐步对选定的3×3卷积层实施低秩可分离分解(LRConv),端到端训练;并结合INT8量化评估鲁棒性。 Result: 仅在最终解码阶段应用LRConv即可降低68% GFLOPs(201.9→64.9)和9.3%模型大小,PSNR/MS-SSIM几乎不变,码率下降9.2%;INT8下质量接近原NeRV;LPIPS分析显示时间稳定性良好。 Conclusion: LRConv-NeRV在低精度与资源受限场景下,提供了优于现有方法的效率-质量权衡,是一种有前景的高效神经视频解码架构。 Abstract: Neural Representations for Videos (NeRV) encode entire video sequences within neural network parameters, offering an alternative paradigm to conventional video codecs. However, the convolutional decoder of NeRV remains computationally expensive and memory intensive, limiting its deployment in resource-constrained environments. This paper proposes LRConv-NeRV, an efficient NeRV variant that replaces selected dense 3x3 convolutional layers with structured low-rank separable convolutions, trained end-to-end within the decoder architecture. By progressively applying low-rank factorization from the largest to earlier decoder stages, LRConv-NeRV enables controllable trade-offs between reconstruction quality and efficiency. Extensive experiments demonstrate that applying LRConv only to the final decoder stage reduces decoder complexity by 68%, from 201.9 to 64.9 GFLOPs, and model size by 9.3%, while incurring negligible quality loss and achieving approximately 9.2% bitrate reduction. Under INT8 post-training quantization, LRConv-NeRV preserves reconstruction quality close to the dense NeRV baseline, whereas more aggressive factorization of early decoder stages leads to disproportionate quality degradation. Compared to existing work under layer-aligned settings, LRConv-NeRV achieves a more favorable efficiency versus quality trade-off, offering substantial GFLOPs and parameter reductions while maintaining higher PSNR/MS-SSIM and improved temporal stability. Temporal flicker analysis using LPIPS further shows that the proposed solution preserves temporal coherence close to the NeRV baseline, results establish LRConv-NeRV as a potential architectural alternative for efficient neural video decoding under low-precision and resource-constrained settings.

[77] CycleCap: Improving VLMs Captioning Performance via Self-Supervised Cycle Consistency Fine-Tuning

Marios Krestenitis,Christos Tzelepis,Konstantinos Ioannidis,Steafanos Vrochidis,Ioannis Kompatsiaris,Georgios Tzimiropoulos,Shaogang Gong,Ioannis Patras

Main category: cs.CV

TL;DR: 本文提出CycleCap方法,通过图像-文本双向循环一致性(利用预训练文生图模型重构图像)作为自监督信号,结合GRPO优化VLM的图像描述生成能力,无需标注数据即可显著降低幻觉、提升描述准确性。

Details Motivation: 现有视觉语言模型(VLMs)在图像描述任务中易出现图文错位,产生泛化或幻觉内容;已有方法依赖大量标注数据或复杂推理框架,成本高、泛化弱。 Method: 构建图像→文本(VLM)→图像(文生图模型)的循环路径,以原始图像与重构图像的相似度为奖励信号,采用Group Relative Policy Optimization(GRPO)对VLM进行端到端微调;全程无需人工标注图文对,实现自监督优化。 Result: 在四个1B–7B参数规模的VLM上验证,CycleCap在图像描述质量与幻觉抑制两方面均一致超越SOTA方法,尤其在无需监督数据前提下达到更优性能。 Conclusion: 循环一致性可作为强自监督信号直接驱动VLM captioning能力提升;CycleCap证明了仅用原始图像即可有效优化图文对齐,为轻量、低资源VLM训练提供了新范式。 Abstract: Visual-Language Models (VLMs) have achieved remarkable progress in image captioning, visual question answering, and visual reasoning. Yet they remain prone to vision-language misalignment, often producing overly generic or hallucinated descriptions. Existing approaches address this via instruction tuning-requiring costly, large-scale annotated datasets or via complex test-time frameworks for caption refinement. In this work, we revisit image-text alignment through the lens of cycle consistency: given an image and a caption generated by an image-to-text model, the backward mapping through a text-to-image model should reconstruct an image that closely matches the original. In our setup, a VLM serves as the image-to-text component, while a pre-trained text-to-image model closes the loop by reconstructing the image from the generated caption. Building on this, we introduce CycleCap, a fine-tuning scheme to improve image captioning using Group Relative Policy Optimization (GRPO) with a reward based on the similarity between the original and reconstructed images, computed on-the-fly. Unlike previous work that uses cycle consistency loss for preference dataset construction, our method leverages cycle consistency directly as a self-supervised training signal. This enables the use of raw images alone, eliminating the need for curated image-text datasets, while steering the VLM to produce more accurate and grounded text descriptions. Applied to four VLMs ranging from 1B to 7B parameters, CycleCap yields consistent improvements across captioning and hallucination benchmarks, surpassing state-of-the-art methods that rely on supervised cycle consistency training.

[78] Fast and Generalizable NeRF Architecture Selection for Satellite Scene Reconstruction

Devjyoti Chakraborty,Zaki Sukma,Rakandhiya D. Rachmanto,Kriti Ghosh,In Kee Kim,Suchendra M. Bhandarkar,Lakshmish Ramaswamy,Nancy K. O'Hare,Deepak Mishra

Main category: cs.CV

TL;DR: 本文提出PreSCAN框架,通过轻量级几何与光度描述符在训练前预测NeRF重建质量,实现架构快速选择(<30秒)、大幅加速(1000×)及边缘部署能效优化。

Details Motivation: NeRF在卫星影像中部署面临单场景需独立训练、NAS耗时长(数小时至数天)等问题;SHAP分析揭示多视角一致性比模型架构更能决定重建质量。 Method: 基于SHAP分析发现,构建PreSCAN预测框架,利用轻量级几何与光度描述符在训练前估计NeRF质量;结合离线成本分析,在边缘平台(Jetson Orin)上优化推理功耗与延迟。 Result: PreSCAN可在<30秒内选择合适架构,预测误差<1 dB,较NAS提速1000×;在Jetson Orin上降低推理功耗26%、延迟43%,质量损失极小;在DFC2019数据集上跨场景泛化良好,无需重训练。 Conclusion: 多视角一致性是影响卫星影像NeRF重建质量的关键因素;PreSCAN提供高效、轻量、可部署的预测方案,显著提升NeRF在资源受限边缘平台的应用可行性。 Abstract: Neural Radiance Fields (NeRF) have emerged as a powerful approach for photorealistic 3D reconstruction from multi-view images. However, deploying NeRF for satellite imagery remains challenging. Each scene requires individual training, and optimizing architectures via Neural Architecture Search (NAS) demands hours to days of GPU time. While existing approaches focus on architectural improvements, our SHAP analysis reveals that multi-view consistency, rather than model architecture, determines reconstruction quality. Based on this insight, we develop PreSCAN, a predictive framework that estimates NeRF quality prior to training using lightweight geometric and photometric descriptors. PreSCAN selects suitable architectures in < 30 seconds with < 1 dB prediction error, achieving 1000$\times$ speedup over NAS. We further demonstrate PreSCAN's deployment utility on edge platforms (Jetson Orin), where combining its predictions with offline cost profiling reduces inference power by 26% and latency by 43% with minimal quality loss. Experiments on DFC2019 datasets confirm that PreSCAN generalizes across diverse satellite scenes without retraining.

[79] Unrolled Reconstruction with Integrated Super-Resolution for Accelerated 3D LGE MRI

Md Hasibul Husain Hisham,Shireen Elhabian,Ganesh Adluru,Jason Mendes,Andrew Arai,Eugene Kholmovski,Ravi Ranjan,Edward DiBella

Main category: cs.CV

TL;DR: 本文提出了一种混合展开重建框架,将EDSR网络嵌入优化循环中替代近端算子,实现超分辨率增强与数据一致性联合优化,显著提升了加速3D LGE MRI中左心房等精细结构的重建质量。

Details Motivation: 加速3D晚期钆增强(LGE)MRI需要鲁棒的重建方法来从欠采样k空间数据中恢复薄壁心房结构;现有展开模型虽融合物理一致性和学习先验,但受限于采集分辨率,难以充分恢复高频细节。 Method: 提出一种混合展开重建框架,用增强型深度超分辨率(EDSR)网络替代传统展开网络中的近端算子,在每次迭代中联合进行超分辨率增强和数据一致性约束;模型在回顾性欠采样的临床前3D LGE数据集上端到端训练。 Result: 相比压缩感知、MoDL和自引导DIP等基线方法,所提方法在不同加速因子下均显著提升PSNR和SSIM指标,更好保留精细心脏结构,并提升左心房(LA)分割性能。 Conclusion: 将超分辨率先验直接嵌入基于模型的重建框架中,可在加速3D LGE MRI中带来可衡量的性能增益。 Abstract: Accelerated 3D late gadolinium enhancement (LGE) MRI requires robust reconstruction methods to recover thin atrial structures from undersampled k-space data. While unrolled model-based networks effectively integrate physics-driven data consistency with learned priors, they operate at the acquired resolution and may fail to fully recover high-frequency detail. We propose a hybrid unrolled reconstruction framework in which an Enhanced Deep Super-Resolution (EDSR) network replaces the proximal operator within each iteration of the optimization loop, enabling joint super-resolution enhancement and data consistency enforcement. The model is trained end-to-end on retrospectively undersampled preclinical 3D LGE datasets and compared against compressed sensing, Model-Based Deep Learning (MoDL), and self-guided Deep Image Prior (DIP) baselines. Across acceleration factors, the proposed method consistently improves PSNR and SSIM over standard unrolled reconstruction and better preserves fine cardiac structures, leading to improved LA (left atrium) segmentation performance. These results demonstrate that integrating super-resolution priors directly within model-based reconstruction provides measurable gains in accelerated 3D LGE MRI.

[80] VISTA: Validation-Guided Integration of Spatial and Temporal Foundation Models with Anatomical Decoding for Rare-Pathology VCE Event Detection

Bo-Cheng Qiu,Yu-Fan Lin,Yu-Zhe Pien,Chia-Ming Lee,Fu-En Yang,Yu-Chiang Frank Wang,Chih-Chung Hsu

Main category: cs.CV

TL;DR: 本文提出RARE-VISION任务,聚焦胶囊内镜事件检测,通过融合EndoFM-LV与DINOv3 ViT-L/16双主干、多样性头集成、验证引导的分层融合及解剖感知时序事件解码,显著提升事件级检测性能,在隐藏测试集上取得mAP@0.5为0.3530、mAP@0.95为0.3235。

Details Motivation: 胶囊内镜事件检测面临病灶稀疏、视觉异质性强、视频长且噪声多等挑战,而现有方法多基于帧级分类,难以满足事件级评估需求。 Method: 提出RARE-VISION框架:采用EndoFM-LV(建模局部时序上下文)与DINOv3 ViT-L/16(增强帧级视觉语义)双主干;引入多样性头集成、验证引导的分层融合(含类别加权、主干加权与概率校准)及解剖感知时序事件解码(含时序平滑、解剖约束、阈值优化与逐标签事件生成)。 Result: 在官方隐藏测试集上,整体时间mAP@0.5达0.3530,mAP@0.95达0.3235;消融实验表明双主干互补性、验证引导融合与解剖感知解码均对事件级性能有显著贡献。 Conclusion: 将事件检测建模为度量对齐的任务,并结合多源特征融合与时序-解剖联合建模,可有效提升胶囊内镜中稀疏、异质病变事件的检测鲁棒性与准确性。 Abstract: Capsule endoscopy event detection is challenging because diagnostically relevant findings are sparse, visually heterogeneous, and embedded in long, noisy video streams, while evaluation is performed at the event level rather than by frame accuracy alone. We therefore formulate the RARE-VISION task as a metric-aligned event detection problem instead of a purely frame-wise classification task. Our framework combines two complementary backbones, EndoFM-LV for local temporal context and DINOv3 ViT-L/16 for strong frame-level visual semantics, followed by a Diverse Head Ensemble, Validation-Guided Hierarchical Fusion, and Anatomy-Aware Temporal Event Decoding. The fusion stage uses validation-derived class-wise model weighting, backbone weighting, and probability calibration, while the decoding stage applies temporal smoothing, anatomical constraints, threshold refinement, and per-label event generation to produce stable event predictions. Validation ablations indicate that complementary backbones, validation-guided fusion, and anatomy-aware temporal decoding all contribute to event-level performance. On the official hidden test set, the proposed method achieved an overall temporal mAP@0.5 of 0.3530 and temporal mAP@0.95 of 0.3235.

[81] To See or To Please: Uncovering Visual Sycophancy and Split Beliefs in VLMs

Rui Hong,Shuxue Quan

Main category: cs.CV

TL;DR: 本文提出三层次诊断框架,通过三个指标(潜在异常检测、视觉必要性得分、竞争得分)分析多模态大模型(VLMs)在回答中是否真正依赖视觉信息,还是利用语言捷径;发现69.6%样本存在‘视觉谄媚’现象,即模型能感知视觉异常却仍幻觉以迎合用户预期;大规模模型虽减少语言捷径,却加剧视觉谄媚;该诊断框架还可用于无需训练的后处理选择性预测,提升准确率。

Details Motivation: 探究VLMs正确回答时是否真正依赖视觉信息,还是利用语言捷径或迎合用户预期而产生幻觉。 Method: 提出三层次诊断框架,包括Latent Anomaly Detection(感知意识)、Visual Necessity Score(KL散度衡量视觉依赖性)、Competition Score(视觉 grounding 与指令遵循之间的冲突);结合反事实干预(盲图、噪声图、冲突图),在7个VLM和7000个样本对上进行评估。 Result: 69.6%样本表现出Visual Sycophancy(视觉谄媚),零样本显示Robust Refusal(鲁棒拒绝);Qwen2.5-VL从7B扩展到72B时,语言捷径减少但视觉谄媚增强;诊断分数支持后处理选择性预测,在50%覆盖率下最高提升+9.5pp准确率。 Conclusion: 当前VLMs普遍存在为满足用户期望而牺牲真实性的倾向,对齐训练压制了模型表达不确定性的能力;单纯扩大模型规模无法解决视觉 grounding 问题;所提诊断框架可有效识别并缓解幻觉,支持高效后处理优化。 Abstract: When VLMs answer correctly, do they genuinely rely on visual information or exploit language shortcuts? We introduce the Tri-Layer Diagnostic Framework, which disentangles hallucination sources via three metrics: Latent Anomaly Detection (perceptual awareness), Visual Necessity Score (visual dependency, measured via KL divergence), and Competition Score (conflict between visual grounding and instruction following). Using counterfactual interventions (blind, noise, and conflict images) across 7 VLMs and 7,000 model-sample pairs, our taxonomy reveals that 69.6% of samples exhibit Visual Sycophancy--models detect visual anomalies but hallucinate to satisfy user expectations--while zero samples show Robust Refusal, indicating alignment training has systematically suppressed truthful uncertainty acknowledgment. A scaling analysis (Qwen2.5-VL 7B to 72B) shows larger models reduce Language Shortcuts but amplify Visual Sycophancy, demonstrating scale alone cannot resolve the grounding problem. Diagnostic scores further enable a post-hoc selective prediction strategy achieving up to +9.5pp accuracy at 50% coverage with no additional training cost.

[82] Pixel-Accurate Epipolar Guided Matching

Oleksii Nasypanyi,Francois Rameau

Main category: cs.CV

TL;DR: 本文提出了一种基于角空间的精确关键点匹配方法,通过为每个关键点分配容差圆并转化为一维角区间查询,利用线段树实现对极约束下的高效、像素级精确匹配。

Details Motivation: 现有对极引导的关键点匹配方法依赖粗粒度空间分箱,存在近似误差、后处理开销大、易漏匹配等问题。 Method: 将每个关键点映射为以对极点为视点的角区间(由容差圆定义),将匹配建模为1D角区间查询问题,并使用线段树在对数时间内高效求解。 Result: 在ETH3D数据集上显著快于现有方法,同时保证像素级容差、支持逐点控制、避免冗余描述子比较,并恢复精确的匹配集合。 Conclusion: 该角空间精确匹配框架克服了传统空间分箱法的缺陷,在速度与匹配完整性上取得更好平衡,适用于SfM等几何敏感任务。 Abstract: Keypoint matching can be slow and unreliable in challenging conditions such as repetitive textures or wide-baseline views. In such cases, known geometric relations (e.g., the fundamental matrix) can be used to restrict potential correspondences to a narrow epipolar envelope, thereby reducing the search space and improving robustness. These epipolar-guided matching approaches have proved effective in tasks such as SfM; however, most rely on coarse spatial binning, which introduces approximation errors, requires costly post-processing, and may miss valid correspondences. We address these limitations with an exact formulation that performs candidate selection directly in angular space. In our approach, each keypoint is assigned a tolerance circle which, when viewed from the epipole, defines an angular interval. Matching then becomes a 1D angular interval query, solved efficiently in logarithmic time with a segment tree. This guarantees pixel-level tolerance, supports per-keypoint control, and removes unnecessary descriptor comparisons. Extensive evaluation on ETH3D demonstrates noticeable speedups over existing approaches while recovering exact correspondence sets.

[83] Inst4DGS: Instance-Decomposed 4D Gaussian Splatting with Multi-Video Label Permutation Learning

Yonghan Lee,Dinesh Manocha

Main category: cs.CV

TL;DR: Inst4DGS提出一种实例分解的4D高斯泼溅方法,通过可微Sinkhorn层实现跨视频实例标签对齐,并引入运动骨架提升长时轨迹优化效率,在渲染与分割性能上达到SOTA。

Details Motivation: 动态4D高斯泼溅发展迅速,但实例分解版本因多视角视频中实例标签不一致难以关联而研究不足。 Method: 引入每视频标签置换隐变量,结合可微Sinkhorn层实现跨视频实例匹配;设计实例分解的运动骨架,为每个物体提供低维运动基以支持长时轨迹优化。 Result: 在Panoptic Studio和Neural3DV数据集上,Inst4DGS同时支持跟踪与实例分解;Panoptic Studio上PSNR从26.10提升至28.36,实例mIoU从0.6310提升至0.9129。 Conclusion: Inst4DGS通过显式标签对齐和运动骨架设计,实现了高保真渲染、稳定实例身份和高效长时建模,推动了4D场景理解的发展。 Abstract: We present Inst4DGS, an instance-decomposed 4D Gaussian Splatting (4DGS) approach with long-horizon per-Gaussian trajectories. While dynamic 4DGS has advanced rapidly, instance-decomposed 4DGS remains underexplored, largely due to the difficulty of associating inconsistent instance labels across independently segmented multi-view videos. We address this challenge by introducing per-video label-permutation latents that learn cross-video instance matches through a differentiable Sinkhorn layer, enabling direct multi-view supervision with consistent identity preservation. This explicit label alignment yields sharp decision boundaries and temporally stable identities without identity drift. To further improve efficiency, we propose instance-decomposed motion scaffolds that provide low-dimensional motion bases per object for long-horizon trajectory optimization. Experiments on Panoptic Studio and Neural3DV show that Inst4DGS jointly supports tracking and instance decomposition while achieving state-of-the-art rendering and segmentation quality. On the Panoptic Studio dataset, Inst4DGS improves PSNR from 26.10 to 28.36, and instance mIoU from 0.6310 to 0.9129, over the strongest baseline.

[84] Mind the Rarities: Can Rare Skin Diseases Be Reliably Diagnosed via Diagnostic Reasoning?

Yang Liu,Jiyao Yang,Hongjin Zhao,Xiaoyong Li,Yanzhe Ji,Xingjian Li,Runmin Jiang,Tianyang Wang,Saeed Anwar,Dongwoo Kim,Yue Yao,Zhenyue Qin,Min Xu

Main category: cs.CV

TL;DR: 本文构建了DermCase——一个基于皮肤病学病例报告的长上下文多模态基准,用于评估大视觉语言模型(LVLMs)在罕见皮肤病诊断中的临床推理能力,并提出DermLIP相似性指标以更可靠地评估差分诊断质量。

Details Motivation: 现有基准聚焦于常见病且仅评估最终诊断准确率,忽视对复杂罕见病至关重要的临床推理过程,亟需能评估诊断推理能力的新基准。 Method: 构建包含26,030图像-文本对和6,354个疑难病例的DermCase基准,每例附有完整临床信息与逐步推理链;提出基于DermLIP的相似性度量以评估差分诊断质量;系统评测22个主流LVLM,并开展指令微调与DPO微调实验及错误分析。 Result: 22个LVLM在诊断准确率、差分诊断和临床推理方面均表现显著不足;指令微调大幅提升性能,而DPO效果甚微;系统性错误分析揭示当前模型在推理能力上的关键缺陷。 Conclusion: DermCase填补了罕见皮肤病诊断推理评估的空白,揭示了LVLM在临床推理上的根本局限,强调需超越最终答案、重视可解释、结构化推理过程的模型设计与评估范式。 Abstract: Large vision-language models (LVLMs) demonstrate strong performance in dermatology; however, evaluating diagnostic reasoning for rare conditions remains largely unexplored. Existing benchmarks focus on common diseases and assess only final accuracy, overlooking the clinical reasoning process, which is critical for complex cases. We address this gap by constructing DermCase, a long-context benchmark derived from peer-reviewed case reports. Our dataset contains 26,030 multi-modal image-text pairs and 6,354 clinically challenging cases, each annotated with comprehensive clinical information and step-by-step reasoning chains. To enable reliable evaluation, we establish DermLIP-based similarity metrics that achieve stronger alignment with dermatologists for assessing differential diagnosis quality. Benchmarking 22 leading LVLMs exposes significant deficiencies across diagnosis accuracy, differential diagnosis, and clinical reasoning. Fine-tuning experiments demonstrate that instruction tuning substantially improves performance while Direct Preference Optimization (DPO) yields minimal gains. Systematic error analysis further reveals critical limitations in current models' reasoning capabilities.

[85] SynQ: Accurate Zero-shot Quantization by Synthesis-aware Fine-tuning

Minjun Kim,Jongjin Kim,U Kang

Main category: cs.CV

TL;DR: 本文提出SynQ框架,通过低通滤波减少合成数据噪声、对齐类激活图提升精度、仅对困难样本使用软标签避免错误引导,实现了零样本量化(ZSQ)的最先进性能。

Details Motivation: 解决零样本量化(ZSQ)中因无法访问训练数据而导致的三大挑战:合成数据噪声、基于偏移模式的预测、以及错误硬标签的误导。 Method: 提出SynQ框架:1)利用低通滤波降低合成样本噪声;2)对齐量化模型与预训练模型的类激活图以提升精度;3)对困难样本仅使用软标签,避免预训练模型错误的误导。 Result: 在多个实验中,SynQ在零样本量化任务上达到了当前最优的精度。 Conclusion: SynQ有效克服了现有ZSQ方法的关键限制,显著提升了无数据量化下的模型准确率,为隐私敏感场景下的边缘部署提供了更优解决方案。 Abstract: How can we accurately quantize a pre-trained model without any data? Quantization algorithms are widely used for deploying neural networks on resource-constrained edge devices. Zero-shot Quantization (ZSQ) addresses the crucial and practical scenario where training data are inaccessible for privacy or security reasons. However, three significant challenges hinder the performance of existing ZSQ methods: 1) noise in the synthetic dataset, 2) predictions based on off-target patterns, and the 3) misguidance by erroneous hard labels. In this paper, we propose SynQ (Synthesis-aware Fine-tuning for Zero-shot Quantization), a carefully designed ZSQ framework to overcome the limitations of existing methods. SynQ minimizes the noise from the generated samples by exploiting a low-pass filter. Then, SynQ trains the quantized model to improve accuracy by aligning its class activation map with the pre-trained model. Furthermore, SynQ mitigates misguidance from the pre-trained model's error by leveraging only soft labels for difficult samples. Extensive experiments show that SynQ provides the state-of-the-art accuracy, over existing ZSQ methods.

[86] R&D: Balancing Reliability and Diversity in Synthetic Data Augmentation for Semantic Segmentation

Huy Che,Dinh-Duy Phan,Duc-Khai Lam

Main category: cs.CV

TL;DR: 本文提出了一种基于可控扩散模型的合成数据增强新方法,用于语义分割任务,通过类感知提示和视觉先验融合提升图像质量与标签对齐度,在PASCAL VOC和BDD100K等基准上验证了其在数据稀缺场景下的有效性。

Details Motivation: 像素级语义分割的数据收集与标注成本高昂;传统数据增强无法生成新结构,而现有生成模型难以保证合成图像与真实标签的一致性。 Method: 提出融合可控扩散模型的合成数据增强流程,引入类感知提示(class-aware prompting)和视觉先验融合(visual prior blending)以提升图像质量并确保与分割标签的精确对齐。 Result: 在PASCAL VOC和BDD100K等基准数据集上显著提升语义分割性能,尤其在数据稀缺场景下效果突出,并增强了模型在真实场景中的鲁棒性。 Conclusion: 该方法有效弥合了合成数据与真实数据之间的鸿沟,在保持多样性的同时提升了可靠性,为语义分割提供了高效、可控的数据增强新范式。 Abstract: Collecting and annotating datasets for pixel-level semantic segmentation tasks are highly labor-intensive. Data augmentation provides a viable solution by enhancing model generalization without additional real-world data collection. Traditional augmentation techniques, such as translation, scaling, and color transformations, create geometric variations but fail to generate new structures. While generative models have been employed to extend semantic information of datasets, they often struggle to maintain consistency between the original and generated images, particularly for pixel-level tasks. In this work, we propose a novel synthetic data augmentation pipeline that integrates controllable diffusion models. Our approach balances diversity and reliability data, effectively bridging the gap between synthetic and real data. We utilize class-aware prompting and visual prior blending to improve image quality further, ensuring precise alignment with segmentation labels. By evaluating benchmark datasets such as PASCAL VOC and BDD100K, we demonstrate that our method significantly enhances semantic segmentation performance, especially in data-scarce scenarios, while improving model robustness in real-world applications. Our code is available at \href{https://github.com/chequanghuy/Enhanced-Generative-Data-Augmentation-for-Semantic-Segmentation-via-Stronger-Guidance}{https://github.com/chequanghuy/Enhanced-Generative-Data-Augmentation-for-Semantic-Segmentation-via-Stronger-Guidance}.

[87] AndroTMem: From Interaction Trajectories to Anchored Memory in Long-Horizon GUI Agents

Yibo Shi,Jungang Li,Linghao Zhang,Zihao Dongfang,Biao Wu,Sicheng Tao,Yibo Yan,Chenxi Qin,Weiting Liu,Zhixin Lin,Hanqian Li,Yu Huang,Song Dai,Yonghua Hei,Yue Ding,Xiang Li,Shikang Wang,Chengdong Xu,Jingqi Liu,Xueying Ma,Zhiwen Zheng,Xiaofei Zhang,Bincheng Wang,Nichen Yang,Jie Wu,Lihua Tian,Chen Li,Xuming Hu

Main category: cs.CV

TL;DR: 本文提出AndroTMem框架,包含诊断基准AndroTMem-Bench和新型记忆机制ASM,通过锚定关键中间状态提升长周期Android GUI代理的任务完成率,显著缓解交互记忆瓶颈。

Details Motivation: 现有GUI代理在长周期任务中面临交互记忆失效问题:完整回放冗余且噪声大,摘要又易丢失关键依赖信息与可追溯性。 Method: 构建AndroTMem-Bench基准(1069任务,34473步),聚焦强因果依赖任务;提出Anchored State Memory(ASM),以因果链接的中间状态锚点替代序列回放或摘要,支持子目标检索与归因感知决策。 Result: 在12个GUI代理上验证,ASM相较全序列回放和摘要基线,任务完成率(TCR)提升5%-30.16%,平均记忆得分(AMS)提升4.93%-24.66%。 Conclusion: 锚定式、结构化的交互记忆(ASM)能有效解决长周期GUI任务中的记忆瓶颈,性能提升具有一致性和鲁棒性。 Abstract: Long-horizon GUI agents are a key step toward real-world deployment, yet effective interaction memory under prevailing paradigms remains under-explored. Replaying full interaction sequences is redundant and amplifies noise, while summaries often erase dependency-critical information and traceability. We present AndroTMem, a diagnostic framework for anchored memory in long-horizon Android GUI agents. Its core benchmark, AndroTMem-Bench, comprises 1,069 tasks with 34,473 interaction steps (avg. 32.1 per task, max. 65). We evaluate agents with TCR (Task Complete Rate), focusing on tasks whose completion requires carrying forward critical intermediate state; AndroTMem-Bench is designed to enforce strong step-to-step causal dependencies, making sparse yet essential intermediate states decisive for downstream actions and centering interaction memory in evaluation. Across open- and closed-source GUI agents, we observe a consistent pattern: as interaction sequences grow longer, performance drops are driven mainly by within-task memory failures, not isolated perception errors or local action mistakes. Guided by this diagnosis, we propose Anchored State Memory (ASM), which represents interaction sequences as a compact set of causally linked intermediate-state anchors to enable subgoal-targeted retrieval and attribution-aware decision making. Across multiple settings and 12 evaluated GUI agents, ASM consistently outperforms full-sequence replay and summary-based baselines, improving TCR by 5%-30.16% and AMS by 4.93%-24.66%, indicating that anchored, structured memory effectively mitigates the interaction-memory bottleneck in long-horizon GUI tasks. The code, benchmark, and related resources are publicly available at [https://github.com/CVC2233/AndroTMem](https://github.com/CVC2233/AndroTMem).

[88] SR-Nav: Spatial Relationships Matter for Zero-shot Object Goal Navigation

Leyuan Fang,Zan Mao,Zijing Wang,Yinlong Yan

Main category: cs.CV

TL;DR: 本文提出SR-Nav框架,利用动态空间关系图(DSRG)建模物体与区域间的结构化先验关系,通过关系感知匹配和动态关系规划模块提升零样本目标导航在弱观测条件下的感知鲁棒性与规划效率,在HM3D上达到SOTA性能。

Details Motivation: 现有基于基础模型的零样本目标导航方法在视角不佳或语义线索弱时,因缺乏可靠的感知与推理能力而表现不佳;而场景中物体与区域间固有的空间关系可作为结构化先验,辅助智能体在部分观测下推断目标位置。 Method: 提出Spatial Relation-aware Navigation(SR-Nav)框架:1)构建动态空间关系图(DSRG),融合基础模型先验与实时观测;2)设计关系感知匹配模块,以关系匹配替代简单检测,提升视觉感知鲁棒性;3)引入动态关系规划模块,基于DSRG动态计算最优路径,缩小搜索空间、减少冗余探索。 Result: 在HM3D数据集上,SR-Nav在成功率(Success Rate)和导航效率(如SPL)上均达到当前最优(state-of-the-art)性能。 Conclusion: 显式建模和利用空间关系先验可显著增强零样本目标导航在挑战性观测条件下的鲁棒性与效率,验证了结构化场景知识对具身智能导航的重要性。 Abstract: Zero-shot object-goal navigation aims to find target objects in unseen environments using only egocentric observation. Recent methods leverage foundation models' comprehension and reasoning capabilities to enhance navigation performance. However, when faced with poor viewpoints or weak semantic cues, foundation models often fail to support reliable reasoning in both perception and planning, resulting in inefficient or failed navigation. We observe that inherent relationships among objects and regions encode structured scene priors, which help agents infer plausible target locations even under partial observations. Motivated by this insight, we propose Spatial Relation-aware Navigation (SR-Nav), a framework that models both observed and experience-based spatial relationships to enhance both perception and planning. Specifically, SR-Nav first constructs a Dynamic Spatial Relationship Graph (DSRG) that encodes the target-centered spatial relationships through the foundation models and updates dynamically with real-time observations. We then introduce a Relation-aware Matching Module. It utilizes relationship matching instead of naive detection, leveraging diverse relationships in the DSRG to verify and correct errors, enhancing visual perception robustness. Finally, we design a Dynamic Relationship Planning Module to reduce the planning search space by dynamically computing the optimal paths based on the DSRG from the current position, thereby guiding planning and reducing exploration redundancy. Experiments on HM3D show that our method achieves state-of-the-art performance in both success rate and navigation efficiency. The code will be publicly available at https://github.com/Mzyw-1314/SR-Nav

Arushi Rai,Adriana Kovashka

Main category: cs.CV

TL;DR: 本文提出一种无需额外标注的自一致性目标,通过强制相关任务(如生成和验证)关注相同帧来改善视频大模型在体育教练任务中的时间定位能力。

Details Motivation: 视频大语言模型(Video-LLMs)常关注无关帧,尤其在需精确时间定位的体育教练任务中影响显著;而帧级监督信号难以获取——人工标注成本高、其他模型生成不可靠。 Method: 利用相关任务(生成与验证)应关注相同关键帧的观察,设计基于视觉注意力图的自一致性目标,并在VidDiffBench基准上验证并优化该方法。 Result: 在Exact、FitnessQA和ExpertAF三个体育教练任务上,相比监督微调,准确率分别提升+3.0%、+14.1%,BERTScore提升+0.9,且超越部分闭源模型。 Conclusion: 无需帧级标注的自一致性注意力约束可有效缓解Video-LLMs的时间定位偏差,在体育教练等精细时序任务中具有显著实用价值。 Abstract: Video-LLMs often attend to irrelevant frames, which is especially detrimental for sports coaching tasks requiring precise temporal grounding. Yet obtaining frame-level supervision is challenging: expensive to collect from humans and unreliable from other models. We improve temporal grounding without additional annotations by exploiting the observation that related tasks, such as generation and verification, must attend to the same frames. We enforce this via a self-consistency objective over select visual attention maps of tightly-related tasks. Using VidDiffBench, which provides ground-truth keyframe annotations, we first validate that attention misallocation is a significant bottleneck. We then show that training with our objective yields gains of +3.0%, +14.1% accuracy and +0.9 BERTScore over supervised finetuning across three sports coaching tasks: Exact, FitnessQA, and ExpertAF, even surpassing closed-source models.

[90] Interpretable Prostate Cancer Detection using a Small Cohort of MRI Images

Vahid Monfared,Mohammad Hadi Gharib,Ali Sabri,Maryam Shahali,Farid Rashidi,Amit Mehta,Reza Rawassizadeh

Main category: cs.CV

TL;DR: 本文提出了一种基于小样本T2加权MRI图像的可解释前列腺癌自动检测框架,通过迁移学习和数据增强缓解数据稀缺问题;ResNet18迁移模型以较低参数量取得最优性能(90.9%准确率、95.2%敏感度),HOG+SVM在小数据下表现优异;该方法仅需单模态T2图像,优于依赖多模态和大数据的现有方法,并在放射科医生对比研究中展现出更高敏感度。

Details Motivation: 前列腺癌是男性主要死因之一,但T2加权MRI图像中病灶细微且异质,人工判读困难;同时临床缺乏大规模标注数据,亟需适用于小样本、单模态的高效可解释AI方法。 Method: 采用迁移学习与数据增强策略,在仅162张T2加权图像(102例癌症、60例正常)的小数据集上,系统评估Vision Transformer(ViT、Swin)、CNN(ResNet18)及传统方法(逻辑回归、SVM、HOG+SVM);结合定量指标与放射科医生盲测进行验证。 Result: 迁移学习的ResNet18取得最佳性能(准确率90.9%,敏感度95.2%,AUC 0.905),参数仅11M;HOG+SVM AUC达0.917,表现接近深度模型;AI模型敏感度显著高于5位放射科医生平均值(67.5%,Fleiss Kappa=0.524)。 Conclusion: 在小样本单模态T2 MRI场景下,轻量级CNN迁移模型优于复杂ViT,而传统手工特征方法仍具竞争力;本方法具备临床落地潜力,可辅助筛查、降低漏诊并提升判读一致性。 Abstract: Prostate cancer is a leading cause of mortality in men, yet interpretation of T2-weighted prostate MRI remains challenging due to subtle and heterogeneous lesions. We developed an interpretable framework for automatic cancer detection using a small dataset of 162 T2-weighted images (102 cancer, 60 normal), addressing data scarcity through transfer learning and augmentation. We performed a comprehensive comparison of Vision Transformers (ViT, Swin), CNNs (ResNet18), and classical methods (Logistic Regression, SVM, HOG+SVM). Transfer-learned ResNet18 achieved the best performance (90.9% accuracy, 95.2% sensitivity, AUC 0.905) with only 11M parameters, while Vision Transformers showed lower performance despite substantially higher complexity. Notably, HOG+SVM achieved comparable accuracy (AUC 0.917), highlighting the effectiveness of handcrafted features in small datasets. Unlike state-of-the-art approaches relying on biparametric MRI (T2+DWI) and large cohorts, our method achieves competitive performance using only T2-weighted images, reducing acquisition complexity and computational cost. In a reader study of 22 cases, five radiologists achieved a mean sensitivity of 67.5% (Fleiss Kappa = 0.524), compared to 95.2% for the AI model, suggesting potential for AI-assisted screening to reduce missed cancers and improve consistency. Code and data are publicly available.

[91] Cell-Type Prototype-Informed Neural Network for Gene Expression Estimation from Pathology Images

Kazuya Nishimura,Ryoma Bise,Shinnosuke Matsuo,Haruka Hirose,Yasuhiro Kojima

Main category: cs.CV

TL;DR: 本文提出了一种名为CPNN的细胞类型原型引导神经网络,利用单细胞RNA测序数据估计细胞类型原型,并从病理图像中学习细胞组成权重,从而更准确、可解释地预测滑片级和空间转录组级别的基因表达。

Details Motivation: 现有方法将基因表达视为滑片或点水平信号,忽略了其源于细胞水平表达聚合的本质,缺乏细胞分辨率的生物学指导。 Method: 提出Cell-type Prototype-informed Neural Network(CPNN),首先从公共单细胞RNA-seq数据中估计稳定、鲁棒的细胞类型原型(均值表达谱),再通过病理图像直接学习细胞类型组成权重,并建模原型与批量/空间表达之间的关系。 Result: 在三个滑片级和三个空间转录组补丁级数据集上,CPNN在Spearman相关性指标上均达到最优性能;可视化推断的细胞组成权重提供了可解释的生物学洞察。 Conclusion: CPNN通过引入细胞类型原型作为生物先验,实现了更准确、结构化且可解释的基因表达预测,为数字病理学与多组学整合提供了新范式。 Abstract: Estimating slide- and patch-level gene expression profiles from pathology images enables rapid and low-cost molecular analysis with broad clinical impact. Despite strong results, existing approaches treat gene expression as a mere slide- or spot-level signal and do not incorporate the fact that the measured expression arises from the aggregation of underlying cell-level expression. To explicitly introduce this missing cell-resolved guidance, we propose a Cell-type Prototype-informed Neural Network (CPNN) that leverages publicly available single-cell RNA-sequencing datasets. Since single-cell measurements are noisy and not paired with histology images, we first estimate cell-type prototypes-mean expression profiles that reflect stable gene-gene co-variation patterns.CPNN then learns cell-type compositional weights directly from images and models the relationship between prototypes and observed bulk or spatial expression, providing a biologically grounded and structurally regularized prediction framework. We evaluate CPNN on three slide-level datasets and three patch-level spatial transcriptomics datasets. Across all settings, CPNN achieves the highest performance in terms of Spearman correlation. Moreover, by visualizing the inferred compositional weights, our framework provides interpretable insights into which cell types drive the predicted expression. Code is publicly available at https://github.com/naivete5656/CPNN.

[92] MedQ-UNI: Toward Unified Medical Image Quality Assessment and Restoration via Vision-Language Modeling

Jiyao Liu,Junzhi Ning,Wanying Qu,Lihao Liu,Chenglong Ma,Junjun He,Ningsheng Xu

Main category: cs.CV

TL;DR: 本文提出MedQ-UNI模型,通过‘先评估后修复’范式,将医学图像质量评估(Med-IQA)与医学图像恢复(Med-IR)统一起来,实现跨模态、跨退化类型的通用医学图像恢复。

Details Motivation: 现有医学图像恢复方法多为模态或退化类型特异性,泛化能力差,主因是Med-IR与Med-IQA脱节,缺乏对图像质量的显式理解。 Method: 提出MedQ-UNI:基于视觉-语言的双专家自回归架构(共享注意力),包含质量评估专家(生成结构化自然语言描述)和修复专家(依据描述进行针对性修复);构建含50K配对样本的多模态多任务数据集及2K评估基准。 Result: 单个MedQ-UNI模型在全部任务上达到SOTA性能,无需任务适配,同时生成更优质量描述,验证了显式质量理解可提升修复保真度与可解释性。 Conclusion: 将Med-IQA显式融入Med-IR流程可有效提升模型泛化性、修复质量与可解释性,为通用医学图像恢复提供了新范式。 Abstract: Existing medical image restoration (Med-IR) methods are typically modality-specific or degradation-specific, failing to generalize across the heterogeneous degradations encountered in clinical practice. We argue this limitation stems from the isolation of Med-IR from medical image quality assessment (Med-IQA), as restoration models without explicit quality understanding struggle to adapt to diverse degradation types across modalities. To address these challenges, we propose MedQ-UNI, a unified vision-language model that follows an assess-then-restore paradigm, explicitly leveraging Med-IQA to guide Med-IR across arbitrary modalities and degradation types. MedQ-UNI adopts a multimodal autoregressive dual-expert architecture with shared attention: a quality assessment expert first identifies degradation issues through structured natural language descriptions, and a restoration expert then conditions on these descriptions to perform targeted image restoration. To support this paradigm, we construct a large-scale dataset of approximately 50K paired samples spanning three imaging modalities and five restoration tasks, each annotated with structured quality descriptions for joint Med-IQA and Med-IR training, along with a 2K-sample benchmark for evaluation. Extensive experiments demonstrate that a single MedQ-UNI model, without any task-specific adaptation, achieves state-of-the-art restoration performance across all tasks while generating superior descriptions, confirming that explicit quality understanding meaningfully improves restoration fidelity and interpretability.

[93] Recolour What Matters: Region-Aware Colour Editing via Token-Level Diffusion

Yuqi Yang,Dongliang Chang,Yijia Ling,Ruoyi Du,Zhanyu Ma

Main category: cs.CV

TL;DR: ColourCrafter 是一种基于扩散模型的精细区域感知色彩编辑框架,通过RGB色彩token与图像token在潜在空间中的融合,结合Lab空间感知损失,在保持结构完整性的前提下实现高精度、可控的局部色彩编辑。

Details Motivation: 现有文本驱动的色彩编辑方法难以准确表达连续色度变化,导致编辑结果偏离目标色调,尤其在细粒度和局部编辑任务中表现不佳。 Method: 提出ColourCrafter框架:1)在潜在空间进行RGB色彩token与图像token的token级融合;2)引入感知Lab空间损失,解耦亮度与色度,并约束掩码区域内编辑;3)构建大规模高质量数据集ColourfulSet。 Result: 在细粒度色彩编辑任务上达到SOTA性能,显著提升色彩准确性、可控性与感知保真度。 Conclusion: ColourCrafter将全局色调迁移转化为结构化、区域感知的生成过程,有效解决了连续色度控制难题,为可控图像编辑提供了新范式。 Abstract: Colour is one of the most perceptually salient yet least controllable attributes in image generation. Although recent diffusion models can modify object colours from user instructions, their results often deviate from the intended hue, especially for fine-grained and local edits. Early text-driven methods rely on discrete language descriptions that cannot accurately represent continuous chromatic variations. To overcome this limitation, we propose ColourCrafter, a unified diffusion framework that transforms colour editing from global tone transfer into a structured, region-aware generation process. Unlike traditional colour driven methods, ColourCrafter performs token-level fusion of RGB colour tokens and image tokens in latent space, selectively propagating colour information to semantically relevant regions while preserving structural fidelity. A perceptual Lab-space Loss further enhances pixel-level precision by decoupling luminance and chrominance and constraining edits within masked areas. Additionally, we build ColourfulSet, a largescale dataset of high-quality image pairs with continuous and diverse colour variations. Extensive experiments demonstrate that ColourCrafter achieves state-of-the-art colour accuracy, controllability and perceptual fidelity in fine-grained colour editing. Our project is available at https://yangyuqi317.github.io/ColourCrafter.github.io/.

[94] Do Vision Language Models Understand Human Engagement in Games?

Ziyi Wang,Qizan Guo,Rishitosh Singh,Xiyang Hu

Main category: cs.CV

TL;DR: 本文评估了视觉-语言模型(VLMs)从游戏视频中推断玩家参与度的能力,发现零样本预测效果差,理论引导提示未显著提升性能,揭示了当前VLM在‘感知’与‘理解’人类心理状态之间的差距。

Details Motivation: 探究视觉-语言模型能否仅凭视觉线索可靠推断玩家在游戏中的潜在心理状态(如参与度),以支持游戏设计与用户体验研究。 Method: 在GameVibe少样本数据集(涵盖9款FPS游戏)上,评估3种VLM在6种提示策略下的表现,包括零样本预测、基于心流理论/游戏流/自我决定理论/MDA框架的理论引导提示,以及检索增强提示;任务分为逐片段参与度预测和相邻片段间参与度变化的成对预测。 Result: 零样本VLM预测普遍弱于各游戏内的多数类基线;记忆或检索增强提示在部分设置下提升了逐片段预测,但成对预测始终困难;理论引导提示未带来稳定增益,有时反而强化表面级捷径。 Conclusion: 当前VLM虽能识别可见的游戏画面线索,但在跨游戏鲁棒推断人类参与度等深层心理状态方面仍存在显著局限,暴露了‘感知—理解’鸿沟。 Abstract: Inferring human engagement from gameplay video is important for game design and player-experience research, yet it remains unclear whether vision--language models (VLMs) can infer such latent psychological states from visual cues alone. Using the GameVibe Few-Shot dataset across nine first-person shooter games, we evaluate three VLMs under six prompting strategies, including zero-shot prediction, theory-guided prompts grounded in Flow, GameFlow, Self-Determination Theory, and MDA, and retrieval-augmented prompting. We consider both pointwise engagement prediction and pairwise prediction of engagement change between consecutive windows. Results show that zero-shot VLM predictions are generally weak and often fail to outperform simple per-game majority-class baselines. Memory- or retrieval-augmented prompting improves pointwise prediction in some settings, whereas pairwise prediction remains consistently difficult across strategies. Theory-guided prompting alone does not reliably help and can instead reinforce surface-level shortcuts. These findings suggest a perception--understanding gap in current VLMs: although they can recognize visible gameplay cues, they still struggle to robustly infer human engagement across games.

[95] T-QPM: Enabling Temporal Out-Of-Distribution Detection and Domain Generalization for Vision-Language Models in Open-World

Aditi Naiknaware,Salimeh Sekeh

Main category: cs.CV

TL;DR: 本文提出了一种面向动态环境的多模态OOD检测新框架T-QPM,通过跨模态一致性建模与轻量时序融合权重学习,并引入ATC正则化,显著提升了在时间漂移和协变量偏移下的鲁棒性。

Details Motivation: 现有基于CLIP等VLM的OOD检测方法依赖固定融合规则、假设静态环境,难以应对时间漂移和协变量偏移问题。 Method: 提出两步Temporal Quadruple-Pattern Matching(T-QPM)框架:第一步构建ID/OOD图像与文本间的跨模态一致性模式;第二步学习轻量融合权重联合语义匹配与视觉典型性,并施加Average Thresholded Confidence(ATC)显式正则化以保障时序稳定性。 Result: 在时序划分的基准数据集上,该方法显著优于静态基线模型,展现出对非平稳环境更强的鲁棒性和时序一致性。 Conclusion: T-QPM为开放世界中动态、非平稳场景下的多模态OOD检测提供了更鲁棒、自适应的新范式。 Abstract: Out-of-distribution (OOD) detection remains a critical challenge in open-world learning, where models must adapt to evolving data distributions. While recent vision-language models (VLMS) like CLIP enable multimodal OOD detection through Dual-Pattern Matching (DPM), existing methods typically suffer from two major shortcomings: (1) They rely on fixed fusion rules and assume static environments, failing under temporal drift; and (2) they lack robustness against covariate shifted inputs. In this paper, we propose a novel two-step framework to enhance OOD detection and covariate distribution shift robustness in dynamic settings. We extend the dual-pattern regime into Temporal Quadruple-Pattern Matching (T-QPM). First, by pairing OOD images with text descriptions, we introduce cross-modal consistency patterns between ID and OOD signals, refining the decision boundary through joint image-text reasoning. Second, we address temporal distribution shifts by learning lightweight fusion weights to optimally combine semantic matching and visual typicality. To ensure stability, we enforce explicit regularization based on Average Thresholded Confidence (ATC), preventing performance degradation as distributions evolve. Experiments on temporally partitioned benchmarks demonstrate that our approach significantly outperforms static baselines, offering a robust, temporally-consistent framework for multimodal OOD detection in non-stationary environments.

[96] TexEditor: Structure-Preserving Text-Driven Texture Editing

Bo Zhao,Yihang Liu,Chenfeng Zhang,Huan Yang,Kun Gai,Wei Ji

Main category: cs.CV

TL;DR: 本文提出TexEditor,一种专用于文本引导纹理编辑的模型,通过构建高质量SFT数据集TexBlender和基于强化学习的结构保持训练方法StructureNFT,显著提升编辑过程中几何结构的一致性,并引入新基准TexBench以更全面评估真实场景下的性能。

Details Motivation: 现有SOTA文本引导纹理编辑模型在保持结构一致性方面表现不佳,尽管编辑目标仅为外观变化。 Method: 构建基于Blender的高质量SFT数据集TexBlender;提出基于强化学习的StructureNFT方法,将结构先验从合成数据迁移到真实场景;设计新基准TexBench用于真实世界评估。 Result: TexEditor在多个Blender基准及自建TexBench上均超越Nano Banana Pro等强基线;在通用图像编辑基准ImgEdit上也展现出良好泛化能力。 Conclusion: 联合优化数据与训练策略可有效提升纹理编辑中的结构保持能力,TexEditor为该任务提供了更鲁棒、更实用的解决方案。 Abstract: Text-guided texture editing aims to modify object appearance while preserving the underlying geometric structure. However, our empirical analysis reveals that even SOTA editing models frequently struggle to maintain structural consistency during texture editing, despite the intended changes being purely appearance-related. Motivated by this observation, we jointly enhance structure preservation from both data and training perspectives, and build TexEditor, a dedicated texture editing model based on Qwen-Image-Edit-2509. Firstly, we construct TexBlender, a high-quality SFT dataset generated with Blender, which provides strong structural priors for a cold start. Sec- ondly, we introduce StructureNFT, a RL-based approach that integrates structure-preserving losses to transfer the structural priors learned during SFT to real-world scenes. Moreover, due to the limited realism and evaluation coverage of existing benchmarks, we introduce TexBench, a general-purpose real-world benchmark for text-guided texture editing. Extensive experiments on existing Blender-based texture benchmarks and our TexBench show that TexEditor consistently outperforms strong baselines such as Nano Banana Pro. In addition, we assess TexEditor on the general purpose benchmark ImgEdit to validate its generalization. Our code and data are available at https://github.com/KlingAIResearch/TexEditor.

[97] FILT3R: Latent State Adaptive Kalman Filter for Streaming 3D Reconstruction

Seonghyun Jin,Jong Chul Ye

Main category: cs.CV

TL;DR: 本文提出FILT3R,一种无需训练的潜在空间滤波层,将循环状态更新建模为token空间中的随机状态估计,通过在线估计过程噪声并自适应计算Kalman式增益,实现记忆保留与新观测之间的平衡,显著提升长时序3D重建的稳定性。

Details Motivation: 流式3D重建中,现有状态更新策略(激进覆盖或保守更新)在超出训练时长后易失稳,缺乏对历史信息与新证据间动态权衡的鲁棒机制。 Method: FILT3R将状态更新视为token空间的随机状态估计,为每个token维护方差,并基于EMA归一化的时间漂移在线估计过程噪声,进而计算自适应Kalman增益以融合历史与新帧信息。 Result: FILT3R在深度、位姿和3D重建任务上展现出更优的长时序稳定性;其增益随证据积累而收缩、随真实场景变化而上升,且可解释、即插即用,并涵盖常见覆盖与门控策略作为特例。 Conclusion: FILT3R提供了一种通用、免训练、具理论依据的状态更新机制,有效缓解流式3D重建中的长期不稳定性问题,提升了实际部署的鲁棒性与泛化性。 Abstract: Streaming 3D reconstruction maintains a persistent latent state that is updated online from incoming frames, enabling constant-memory inference. A key failure mode is the state update rule: aggressive overwrites forget useful history, while conservative updates fail to track new evidence, and both behaviors become unstable beyond the training horizon. To address this challenge, we propose FILT3R, a training-free latent filtering layer that casts recurrent state updates as stochastic state estimation in token space. FILT3R maintains a per-token variance and computes a Kalman-style gain that adaptively balances memory retention against new observations. Process noise -- governing how much the latent state is expected to change between frames -- is estimated online from EMA-normalized temporal drift of candidate tokens. Using extensive experiments, we demonstrate that FILT3R yields an interpretable, plug-in update rule that generalizes common overwrite and gating policies as special cases. Specifically, we show that gains shrink in stable regimes as uncertainty contracts with accumulated evidence, and rise when genuine scene change increases process uncertainty, improving long-horizon stability for depth, pose, and 3D reconstruction, compared to the existing methods. Code will be released at https://github.com/jinotter3/FILT3R.

[98] NymeriaPlus: Enriching Nymeria Dataset with Additional Annotations and Data

Daniel DeTone,Federica Bogo,Eric-Tuan Le,Duncan Frost,Julian Straub,Yawar Siddiqui,Yuting Ye,Jakob Engel,Richard Newcombe,Lingni Ma

Main category: cs.CV

TL;DR: 本文介绍了NymeriaPlus,即Nymeria数据集的升级版本,增强了人体运动建模、室内物体与结构的密集3D/2D标注、实例级3D物体重建,并新增了基图、音频和腕带视频等模态,旨在构建更强大的野外第一人称多模态基准,推动具身AI的多模态学习研究。

Details Motivation: 现有egocentric数据集在多模态融合、精细空间标注和真实场景覆盖方面存在不足,难以支撑具身AI对复杂人-物-环境交互的深入建模。 Method: 在原始Nymeria数据集基础上,通过提升人体运动表示(MHR/SMPL)、引入密集3D/2D框标注、生成实例级3D物体重建,并整合基图、音频、腕带视频等新模态,构建统一、协同的NymeriaPlus基准。 Result: 发布了NymeriaPlus数据集,包含更精确的人体姿态、丰富的室内场景结构化标注、高质量3D物体重建及多源异构模态数据。 Conclusion: NymeriaPlus显著提升了野外第一人称数据集的表达能力与实用性,有望填补当前egocentric资源的关键空白,促进多模态具身智能研究。 Abstract: The Nymeria Dataset, released in 2024, is a large-scale collection of in-the-wild human activities captured with multiple egocentric wearable devices that are spatially localized and temporally synchronized. It provides body-motion ground truth recorded with a motion-capture suit, device trajectories, semi-dense 3D point clouds, and in-context narrations. In this paper, we upgrade Nymeria and introduce NymeriaPlus. NymeriaPlus features: (1) improved human motion in Momentum Human Rig (MHR) and SMPL formats; (2) dense 3D and 2D bounding box annotations for indoor objects and structural elements; (3) instance-level 3D object reconstructions; and (4) additional modalities e.g., basemap recordings, audio, and wristband videos. By consolidating these complementary modalities and annotations into a single, coherent benchmark, NymeriaPlus strengthens Nymeria into a more powerful in-the-wild egocentric dataset. We expect NymeriaPlus to bridge a key gap in existing egocentric resources and to support a broader range of research, including unique explorations of multimodal learning for embodied AI.

[99] Efficient Video Diffusion with Sparse Information Transmission for Video Compression

Mingde Zhou,Zheng Chen,Yulun Zhang

Main category: cs.CV

TL;DR: 本文提出Diff-SIT方法,通过稀疏时序编码(STEM)和带帧类型嵌入的一次性视频扩散模型(ODFTE),在超低码率下显著提升视频压缩的感知质量与时间一致性。

Details Motivation: 传统端到端视频压缩模型在超低码率下重建图像模糊、感知质量差;现有生成式压缩方法常忽略帧间时序相关性,导致时间不连贯且效率低。 Method: 提出Diff-SIT框架,包含两部分:1)稀疏时序编码模块(STEM),将原始帧序列稀疏编码为信息密集的中间序列以节省码率;2)一次性视频扩散模型ODFTE,结合帧类型嵌入器(FTE)对不同帧类型自适应重建,充分利用时序相关性。 Result: 在多个数据集上实验表明,Diff-SIT在超低码率下显著优于现有方法,在感知质量和时间一致性方面达到新SOTA。 Conclusion: Diff-SIT通过稀疏编码与帧感知扩散建模,有效平衡码率、感知质量与时间一致性,为超低码率视频压缩提供了新范式。 Abstract: Video compression aims to maximize reconstruction quality with minimal bitrates. Beyond standard distortion metrics, perceptual quality and temporal consistency are also critical. However, at ultra-low bitrates, traditional end-to-end compression models tend to produce blurry images of poor perceptual quality. Besides, existing generative compression methods often treat video frames independently and show limitations in time coherence and efficiency. To address these challenges, we propose the Efficient Video Diffusion with Sparse Information Transmission (Diff-SIT), which comprises the Sparse Temporal Encoding Module (STEM) and the One-Step Video Diffusion with Frame Type Embedder (ODFTE). The STEM sparsely encodes the original frame sequence into an information-rich intermediate sequence, achieving significant bitrate savings. Subsequently, the ODFTE processes this intermediate sequence as a whole, which exploits the temporal correlation. During this process, our proposed Frame Type Embedder (FTE) guides the diffusion model to perform adaptive reconstruction according to different frame types to optimize the overall quality. Extensive experiments on multiple datasets demonstrate that Diff-SIT establishes a new state-of-the-art in perceptual quality and temporal consistency, particularly in the challenging ultra-low-bitrate regime. Code is released at https://github.com/MingdeZhou/Diff-SIT.

[100] HOMEY: Heuristic Object Masking with Enhanced YOLO for Property Insurance Risk Detection

Teerapong Panboonyuen

Main category: cs.CV

TL;DR: HOMEY是一个结合YOLO与领域特定掩码机制及自定义损失函数的新型检测框架,用于自动识别房产风险,显著提升检测精度与可靠性。

Details Motivation: 自动化房产风险检测在计算机视觉中具有高影响力但尚未被充分探索,对房地产、承保和保险业务有直接影响。 Method: 提出HOMEY框架,融合YOLO、启发式目标掩码机制和风险感知损失校准,针对17类风险相关房产类别进行训练。 Result: 在真实房产图像上实验表明,HOMEY相比基线YOLO模型具有更优的检测精度与可靠性,同时保持快速推理能力。 Conclusion: HOMEY不仅实现高效风险检测,还支持可解释、低成本的风险分析,为可扩展的AI驱动房产保险工作流奠定基础。 Abstract: Automated property risk detection is a high-impact yet underexplored frontier in computer vision with direct implications for real estate, underwriting, and insurance operations. We introduce HOMEY (Heuristic Object Masking with Enhanced YOLO), a novel detection framework that combines YOLO with a domain-specific masking mechanism and a custom-designed loss function. HOMEY is trained to detect 17 risk-related property classes, including structural damages (e.g., cracked foundations, roof issues), maintenance neglect (e.g., dead yards, overgrown bushes), and liability hazards (e.g., falling gutters, garbage, hazard signs). Our approach introduces heuristic object masking to amplify weak signals in cluttered backgrounds and risk-aware loss calibration to balance class skew and severity weighting. Experiments on real-world property imagery demonstrate that HOMEY achieves superior detection accuracy and reliability compared to baseline YOLO models, while retaining fast inference. Beyond detection, HOMEY enables interpretable and cost-efficient risk analysis, laying the foundation for scalable AI-driven property insurance workflows.

[101] From Snapshots to Symphonies: The Evolution of Protein Prediction from Static Structures to Generative Dynamics and Multimodal Interactions

Jingzhi Chen,Lijian Xu

Main category: cs.CV

TL;DR: 本文综述了人工智能在蛋白质科学中的范式转变,涵盖多模态表征、静态结构预测改进、构象生成建模、异质相互作用预测及功能推断五大维度,并指出当前瓶颈与未来方向。

Details Motivation: 蛋白质折叠问题已因人工智能发生根本性变革,亟需系统梳理从静态结构预测迈向动态构象系综与复杂生物分子相互作用建模的范式演进。 Method: 系统性文献综述,围绕五个互联维度展开分析:统一多模态表征、无MSA静态预测架构与全原子复合物建模、基于扩散与流匹配的生成框架、异质分子相互作用预测、功能导向的适应度景观与文本引导性质预测。 Result: 厘清AI驱动蛋白质科学的关键进展与技术路径,识别数据偏差、可解释性不足及几何指标与物理现实脱节等核心瓶颈。 Conclusion: AI正从结构分析工具演变为能理解并重写生命动态语言的通用模拟器,未来需发展物理一致的生成模型、多模态基础架构和实验闭环系统。 Abstract: The protein folding problem has been fundamentally transformed by artificial intelligence, evolving from static structure prediction toward the modeling of dynamic conformational ensembles and complex biomolecular interactions. This review systematically examines the paradigm shift in AI driven protein science across five interconnected dimensions: unified multimodal representations that integrate sequences, geometries, and textual knowledge; refinement of static prediction through MSA free architectures and all atom complex modeling; generative frameworks, including diffusion models and flow matching, that capture conformational distributions consistent with thermodynamic ensembles; prediction of heterogeneous interactions spanning protein ligand, protein nucleic acid, and protein protein complexes; and functional inference of fitness landscapes, mutational effects, and text guided property prediction. We critically analyze current bottlenecks, including data distribution biases, limited mechanistic interpretability, and the disconnect between geometric metrics and biophysical reality, while identifying future directions toward physically consistent generative models, multimodal foundation architectures, and experimental closed loop systems. This methodological transformation marks artificial intelligence's transition from a structural analysis tool into a universal simulator capable of understanding and ultimately rewriting the dynamic language of life.

[102] Foundations and Architectures of Artificial Intelligence for Motor Insurance

Teerapong Panboonyuen

Main category: cs.CV

TL;DR: This handbook introduces a vertically integrated AI paradigm for motor insurance, featuring domain-adapted transformer architectures for vehicle damage analysis, claims evaluation, and underwriting, all deployed in real-world Thai insurance systems with emphasis on MLOps and production reliability.

Details Motivation: To bridge the gap between cutting-edge AI research and reliable, large-scale industrial deployment in high-stakes motor insurance—particularly addressing practical constraints in real-world systems like those in Thailand. Method: Develops domain-adapted transformer architectures for structured visual understanding, relational vehicle representation learning, and multimodal document intelligence; integrates them into a scalable, production-aware pipeline; and couples algorithmic innovation with co-evolved MLOps practices. Result: End-to-end automation of vehicle damage analysis, claims evaluation, and underwriting workflows within nationwide motor insurance infrastructure in Thailand, demonstrating robustness and scalability under real-world constraints. Conclusion: A vertically integrated, domain-specific AI stack—combining tailored models, multimodal reasoning, and production-grade MLOps—enables trustworthy, deployable AI for motor insurance, setting a blueprint for AI in regulated, high-impact industries. Abstract: This handbook presents a systematic treatment of the foundations and architectures of artificial intelligence for motor insurance, grounded in large-scale real-world deployment. It formalizes a vertically integrated AI paradigm that unifies perception, multimodal reasoning, and production infrastructure into a cohesive intelligence stack for automotive risk assessment and claims processing. At its core, the handbook develops domain-adapted transformer architectures for structured visual understanding, relational vehicle representation learning, and multimodal document intelligence, enabling end-to-end automation of vehicle damage analysis, claims evaluation, and underwriting workflows. These components are composed into a scalable pipeline operating under practical constraints observed in nationwide motor insurance systems in Thailand. Beyond model design, the handbook emphasizes the co-evolution of learning algorithms and MLOps practices, establishing a principled framework for translating modern artificial intelligence into reliable, production-grade systems in high-stakes industrial environments.

[103] OnlinePG: Online Open-Vocabulary Panoptic Mapping with 3D Gaussian Splatting

Hongjia Zhai,Qi Zhang,Xiaokun Pan,Xiyu Zhang,Yitong Dong,Huaqi Zhang,Dan Xu,Guofeng Zhang

Main category: cs.CV

TL;DR: 本文提出OnlinePG系统,结合3D高斯泼溅与在线局部到全局映射策略,实现开放词汇下的实时在线全景场景理解与建图。

Details Motivation: 现有方法多为离线或缺乏实例级理解,难以满足真实机器人任务对在线、细粒度环境感知的需求。 Method: 采用滑动窗口的局部到全局范式;构建融合几何与语义线索的3D段聚类图以实现局部一致性;通过带空间属性的显式网格与鲁棒双向二分3D高斯实例匹配更新全局地图;利用VLM特征在3D空间网格中实现开放词汇理解。 Result: 在多个主流数据集上,OnlinePG在在线方法中性能最优,同时保持实时效率。 Conclusion: OnlinePG首次成功将3D高斯泼溅、在线全景映射与开放词汇感知统一于一个高效实时框架,显著提升了具身智能体的在线环境理解能力。 Abstract: Open-vocabulary scene understanding with online panoptic mapping is essential for embodied applications to perceive and interact with environments. However, existing methods are predominantly offline or lack instance-level understanding, limiting their applicability to real-world robotic tasks. In this paper, we propose OnlinePG, a novel and effective system that integrates geometric reconstruction and open-vocabulary perception using 3D Gaussian Splatting in an online setting. Technically, to achieve online panoptic mapping, we employ an efficient local-to-global paradigm with a sliding window. To build local consistency map, we construct a 3D segment clustering graph that jointly leverages geometric and semantic cues, fusing inconsistent segments within sliding window into complete instances. Subsequently, to update the global map, we construct explicit grids with spatial attributes for the local 3D Gaussian map and fuse them into the global map via robust bidirectional bipartite 3D Gaussian instance matching. Finally, we utilize the fused VLM features inside the 3D spatial attribute grids to achieve open-vocabulary scene understanding. Extensive experiments on widely used datasets demonstrate that our method achieves better performance among online approaches, while maintaining real-time efficiency.

[104] CAFlow: Adaptive-Depth Single-Step Flow Matching for Efficient Histopathology Super-Resolution

Elad Yoshai,Ariel D. Yoshai,Natan T. Shaked

Main category: cs.CV

TL;DR: 本文提出CAFlow,一种自适应深度的单步流匹配超分辨率框架,通过动态路由图像块到最浅的有效网络出口,在保证重建质量的同时大幅降低计算开销;其在数字病理全片图像超分任务中实现高效、高质量重建,并验证了下游临床任务的可用性。

Details Motivation: 数字病理中的全片图像通常超过吉像素级,传统生成式超分辨率方法计算开销大,难以实际部署。 Method: 提出CAFlow框架:1)基于自适应深度的单步流匹配;2)在像素重排空间进行流匹配以减少空间计算;3)引入轻量级退出分类器实现动态路由;4)FlowResNet主干网络融合卷积与窗口自注意力,含四个早期退出点;5)训练中一半样本采用精确t=0条件以保障单步重建质量。 Result: 在多器官组织x4超分任务中,自适应路由达到31.72 dB PSNR(全深度为31.84 dB),最浅出口比双三次插值高1.9 dB且计算量仅为SwinIR-light的1/2.8;在结肠组织泛化测试中仅下降0.02 dB;x8超分优于同计算量基线,接近更大模型SwinIR-Medium;下游核分割任务验证结构保真度;单GPU训练<5小时,整张切片推理从分钟级降至秒级。 Conclusion: CAFlow实现了计算效率与重建质量的良好平衡,具备临床部署潜力,为高分辨率医学图像实时超分提供了新范式。 Abstract: In digital pathology, whole-slide images routinely exceed gigapixel resolution, making computationally intensive generative super-resolution (SR) impractical for routine deployment. We introduce CAFlow, an adaptive-depth single-step flow-matching framework that routes each image tile to the shallowest network exit that preserves reconstruction quality. CAFlow performs flow matching in pixel-unshuffled rearranged space, reducing spatial computation by 16x while enabling direct inference. We show that dedicating half of training to exact t=0 samples is essential for single-step quality (-1.5 dB without it). The backbone, FlowResNet (1.90M parameters), mixes convolution and window self-attention blocks across four early exits spanning 3.1 to 13.3 GFLOPs. A lightweight exit classifier (~6K parameters) achieves 33% compute savings at only 0.12 dB cost. On multi-organ histopathology x4 SR, adaptive routing achieves 31.72 dB PSNR versus 31.84 dB at full depth, while the shallowest exit exceeds bicubic by +1.9 dB at 2.8x less compute than SwinIR-light. The method generalizes to held-out colon tissue with minimal quality loss (-0.02 dB), and at x8 upscaling it outperforms all comparable-compute baselines while remaining competitive with the much larger SwinIR-Medium model. Downstream nuclei segmentation confirms preservation of clinically relevant structure. The model trains in under 5 hours on a single GPU, and adaptive routing can reduce whole-slide inference from minutes to seconds.

[105] Counting Circuits: Mechanistic Interpretability of Visual Reasoning in Large Vision-Language Models

Liwei Che,Zhiyu Xue,Yihao Quan,Benlin Liu,Zeru Shi,Michelle Hurst,Jacob Feldman,Ruixiang Tang,Ranjay Krishna,Vladimir Pavlovic

Main category: cs.CV

TL;DR: This paper investigates how Large Vision-Language Models (LVLMs) perform counting, reveals a shared 'counting circuit' via new interpretability methods (Visual Activation Patching and HeadLens), and proposes a lightweight fine-tuning strategy on synthetic counting data that improves both counting and general visual reasoning.

Details Motivation: Counting is a fundamental yet revealing test of LVLMs' reasoning ability, requiring object individuation and aggregation; understanding how LVLMs count can shed light on their visual reasoning mechanisms. Method: The authors use controlled synthetic and real-world counting benchmarks, combined with two novel mechanistic interpretability methods—Visual Activation Patching and HeadLens—to analyze LVLM internals; they then design a lightweight fine-tuning intervention using only synthetic counting images. Result: LVLMs exhibit human-like counting (precise for small numbers, noisy for large ones); a shared 'counting circuit' is identified across tasks; fine-tuning on synthetic counting data improves in-distribution counting, +8.36% on out-of-distribution counting benchmarks, and +1.54% on complex visual reasoning tasks (e.g., Qwen2.5-VL). Conclusion: Counting plays a central, influential role in LVLM visual reasoning; targeted enhancement of counting mechanisms—via interpretable analysis and minimal synthetic-data fine-tuning—can broadly improve visual reasoning performance. Abstract: Counting serves as a simple but powerful test of a Large Vision-Language Model's (LVLM's) reasoning; it forces the model to identify each individual object and then add them all up. In this study, we investigate how LVLMs implement counting using controlled synthetic and real-world benchmarks, combined with mechanistic analyses. Our results show that LVLMs display a human-like counting behavior, with precise performance on small numerosities and noisy estimation for larger quantities. We introduce two novel interpretability methods, Visual Activation Patching and HeadLens, and use them to uncover a structured "counting circuit" that is largely shared across a variety of visual reasoning tasks. Building on these insights, we propose a lightweight intervention strategy that exploits simple and abundantly available synthetic images to fine-tune arbitrary pretrained LVLMs exclusively on counting. Despite the narrow scope of this fine-tuning, the intervention not only enhances counting accuracy on in-distribution synthetic data, but also yields an average improvement of +8.36% on out-of-distribution counting benchmarks and an average gain of +1.54% on complex, general visual reasoning tasks for Qwen2.5-VL. These findings highlight the central, influential role of counting in visual reasoning and suggest a potential pathway for improving overall visual reasoning capabilities through targeted enhancement of counting mechanisms.

[106] 3DreamBooth: High-Fidelity 3D Subject-Driven Video Generation Model

Hyun-kyu Ko,Jihyeon Park,Younghyun Kim,Dongheok Park,Eunbyung Park

Main category: cs.CV

TL;DR: 本文提出3DreamBooth和3Dapter框架,实现3D感知的定制化视频生成,通过单帧优化与多视图联合训练,在缺乏多视角视频数据的情况下建模真实3D几何,提升新视角合成的一致性与真实性。

Details Motivation: 现有主体驱动视频生成方法多基于2D表征,难以建模真实3D几何结构,导致新视角合成时细节任意、身份失真;且多视角视频数据稀缺,直接微调易导致时序过拟合。 Method: 提出3DreamBooth(1帧空间优化以嵌入鲁棒3D先验)与3Dapter(基于不对称条件策略的多视图联合优化视觉调节模块),后者作为动态选择性路由器,从最少参考视图中查询视角特异性几何线索。 Result: 在有限单/多视图输入下,显著提升新视角视频生成的几何一致性和纹理保真度,避免时序过拟合,实现真正3D感知的定制化视频生成。 Conclusion: 该框架突破了2D-centric视频定制的局限,为小样本、多视角可控视频生成提供了可扩展的3D-aware范式。 Abstract: Creating dynamic, view-consistent videos of customized subjects is highly sought after for a wide range of emerging applications, including immersive VR/AR, virtual production, and next-generation e-commerce. However, despite rapid progress in subject-driven video generation, existing methods predominantly treat subjects as 2D entities, focusing on transferring identity through single-view visual features or textual prompts. Because real-world subjects are inherently 3D, applying these 2D-centric approaches to 3D object customization reveals a fundamental limitation: they lack the comprehensive spatial priors necessary to reconstruct the 3D geometry. Consequently, when synthesizing novel views, they must rely on generating plausible but arbitrary details for unseen regions, rather than preserving the true 3D identity. Achieving genuine 3D-aware customization remains challenging due to the scarcity of multi-view video datasets. While one might attempt to fine-tune models on limited video sequences, this often leads to temporal overfitting. To resolve these issues, we introduce a novel framework for 3D-aware video customization, comprising 3DreamBooth and 3Dapter. 3DreamBooth decouples spatial geometry from temporal motion through a 1-frame optimization paradigm. By restricting updates to spatial representations, it effectively bakes a robust 3D prior into the model without the need for exhaustive video-based training. To enhance fine-grained textures and accelerate convergence, we incorporate 3Dapter, a visual conditioning module. Following single-view pre-training, 3Dapter undergoes multi-view joint optimization with the main generation branch via an asymmetrical conditioning strategy. This design allows the module to act as a dynamic selective router, querying view-specific geometric hints from a minimal reference set. Project page: https://ko-lani.github.io/3DreamBooth/

[107] Remedying Target-Domain Astigmatism for Cross-Domain Few-Shot Object Detection

Yongwei Jiang,Yixiong Zou,Yuhua Li,Ruixuan Li

Main category: cs.CV

TL;DR: 本文提出了一种受人类视网膜中央凹启发的中心-外围注意力精炼框架,用于解决跨域少样本目标检测中的‘目标域散光’问题,显著提升了模型在目标域上的定位精度和检测性能。

Details Motivation: 现有跨域少样本目标检测方法在目标域上存在注意力分散、定位不准和冗余预测的问题,作者将其命名为‘目标域散光’问题,且常规微调虽有一定缓解趋势但效果不足。 Method: 提出中心-外围注意力精炼框架,包括:(1) 正样本模式精炼模块(模拟视觉中心),利用类别原型重塑注意力;(2) 负样本上下文调制模块(模拟视觉外围),建模背景以增强边界判别;(3) 文本语义对齐模块,通过跨模态线索强化中心-外围区分。 Result: 在六个CD-FSOD基准上一致提升检测精度,达到新的SOTA性能。 Conclusion: 受生物视觉系统启发的注意力精炼机制能有效矫正目标域散光现象,显著增强模型在数据稀缺与域偏移下的泛化与适应能力。 Abstract: Cross-domain few-shot object detection (CD-FSOD) aims to adapt pretrained detectors from a source domain to target domains with limited annotations, suffering from severe domain shifts and data scarcity problems. In this work, we find a previously overlooked phenomenon: models exhibit dispersed and unfocused attention in target domains, leading to imprecise localization and redundant predictions, just like a human cannot focus on visual objects. Therefore, we call it the target-domain Astigmatism problem. Analysis on attention distances across transformer layers reveals that regular fine-tuning inherently shows a trend to remedy this problem, but results are still far from satisfactory, which we aim to enhance in this paper. Biologically inspired by the human fovea-style visual system, we enhance the fine-tuning's inherent trend through a center-periphery attention refinement framework, which contains (1) a Positive Pattern Refinement module to reshape attention toward semantic objects using class-specific prototypes, simulating the visual center region; (2) a Negative Context Modulation module to enhance boundary discrimination by modeling background context, simulating the visual periphery region; and (3) a Textual Semantic Alignment module to strengthen center-periphery distinction through cross-modal cues. Our bio-inspired approach transforms astigmatic attention into focused patterns, substantially improving adaptation to target domains. Experiments on six challenging CD-FSOD benchmarks consistently demonstrate improved detection accuracy and establish new state-of-the-art results.

[108] CoDA: Exploring Chain-of-Distribution Attacks and Post-Hoc Token-Space Repair for Medical Vision-Language Models

Xiang Chen,Fangfang Yang,Chunlei Meng,Chengyin Hu,Ang Li,Yiwei Wei,Jiahuan Long,Jiujiang Guo

Main category: cs.CV

TL;DR: 本文提出CoDA框架,通过模拟临床影像流程中的多阶段分布偏移(如采集、重建、显示和传输),评估医学视觉-语言模型(MVLMs)和多模态大语言模型(MLLMs)在真实临床场景下的鲁棒性;实验表明链式偏移比单阶段更损害性能,且现有模型在影像真实性审计上表现不佳;最后提出一种基于教师引导的token空间自适应修复策略以提升鲁棒性。

Details Motivation: 现有医学视觉-语言模型(MVLMs)的鲁棒性评估多基于干净或单一失真图像,忽略了临床中常见但保持可读性的全流程影像处理操作所引发的分布偏移,亟需更贴近实际工作流的评估方法。 Method: 提出CoDA(chain-of-distribution adaptation)框架,通过联合优化多个临床合理阶段(采集阴影、重建与显示映射、传输导出退化)的组合与参数,在保持结构相似性约束下生成视觉可信但统计偏移的图像;并引入教师引导的patch级token空间对齐后处理修复策略。 Result: CoDA显著降低CLIP式MVLMs在脑MRI、胸片和腹部CT上的零样本性能,链式偏移破坏性持续强于单阶段;商用及医学专用MLLMs在影像真实性审计任务中均表现不可靠,存在高置信错误;所提后处理修复策略可提升CoDA退化样本上的模型准确率。 Conclusion: CoDA揭示了MVLMs在真实临床部署中面临的关键分布鲁棒性风险;轻量级token空间对齐修复策略可有效缓解该问题,为安全落地提供新思路。 Abstract: Medical vision--language models (MVLMs) are increasingly used as perceptual backbones in radiology pipelines and as the visual front end of multimodal assistants, yet their reliability under real clinical workflows remains underexplored. Prior robustness evaluations often assume clean, curated inputs or study isolated corruptions, overlooking routine acquisition, reconstruction, display, and delivery operations that preserve clinical readability while shifting image statistics. To address this gap, we propose CoDA, a chain-of-distribution framework that constructs clinically plausible pipeline shifts by composing acquisition-like shading, reconstruction and display remapping, and delivery and export degradations. Under masked structural-similarity constraints, CoDA jointly optimizes stage compositions and parameters to induce failures while preserving visual plausibility. Across brain MRI, chest X-ray, and abdominal CT, CoDA substantially degrades the zero-shot performance of CLIP-style MVLMs, with chained compositions consistently more damaging than any single stage. We also evaluate multimodal large language models (MLLMs) as technical-authenticity auditors of imaging realism and quality rather than pathology. Proprietary multimodal models show degraded auditing reliability and persistent high-confidence errors on CoDA-shifted samples, while the medical-specific MLLMs we test exhibit clear deficiencies in medical image quality auditing. Finally, we introduce a post-hoc repair strategy based on teacher-guided token-space adaptation with patch-level alignment, which improves accuracy on archived CoDA outputs. Overall, our findings characterize a clinically grounded threat surface for MVLM deployment and show that lightweight alignment improves robustness in deployment.

[109] HiMu: Hierarchical Multimodal Frame Selection for Long Video Question Answering

Dan Ben-Ami,Gabriele Serussi,Kobi Cohen,Chaim Baskin

Main category: cs.CV

TL;DR: HiMu是一种无需训练的长视频问答框架,通过单次文本大模型调用分解查询为层次逻辑树,并利用轻量多模态专家和模糊逻辑组合信号,在高效(低帧数、低计算量)的同时保持高准确率。

Details Motivation: 现有长视频问答中的帧选择方法面临效率与结构化推理能力之间的权衡:相似性方法快但丢失时序和跨模态绑定,基于智能体的方法能恢复结构但计算开销过大。 Method: HiMu采用无训练设计:先用纯文本LLM将问题分解为层次逻辑树;每个叶节点(原子谓词)交由对应轻量多模态专家(如CLIP、OCR、ASR等)处理;各模态信号经归一化、时间平滑对齐后,通过模糊逻辑算子自底向上组合,生成连续满足度曲线。 Result: 在Video-MME、LongVideoBench和HERBench-Lite上,HiMu以16帧输入配合Qwen3-VL 8B即超越所有对比选择器;使用GPT-4o时,其性能超过32–512帧的智能体系统,且FLOPs减少约10倍。 Conclusion: HiMu成功弥合了效率与结构化推理之间的鸿沟,显著提升了长视频问答中帧选择的效率-精度帕累托前沿。 Abstract: Long-form video question answering requires reasoning over extended temporal contexts, making frame selection critical for large vision-language models (LVLMs) bound by finite context windows. Existing methods face a sharp trade-off: similarity-based selectors are fast but collapse compositional queries into a single dense vector, losing sub-event ordering and cross-modal bindings; agent-based methods recover this structure through iterative LVLM inference, but at prohibitive cost. We introduce HiMu, a training-free framework that bridges this gap. A single text-only LLM call decomposes the query into a hierarchical logic tree whose leaves are atomic predicates, each routed to a lightweight expert spanning vision (CLIP, open-vocabulary detection, OCR) and audio (ASR, CLAP). The resulting signals are normalized, temporally smoothed to align different modalities, and composed bottom-up through fuzzy-logic operators that enforce temporal sequencing and adjacency, producing a continuous satisfaction curve. Evaluations on Video-MME, LongVideoBench and HERBench-Lite show that HiMu advances the efficiency-accuracy Pareto front: at 16 frames with Qwen3-VL 8B it outperforms all competing selectors, and with GPT-4o it surpasses agentic systems operating at 32-512 frames while requiring roughly 10x fewer FLOPs.

[110] CausalVAD: De-confounding End-to-End Autonomous Driving via Causal Intervention

Jiacheng Tang,Zhiyuan Zhou,Zhuolin He,Jia Zhang,Kai Zhang,Jian Pu

Main category: cs.CV

TL;DR: 本文提出CausalVAD框架,通过稀疏因果干预(SCIS)模块,在端到端驾驶模型中实现去混淆训练,以消除虚假相关性,提升规划准确性与安全性。

Details Motivation: 现有规划导向的端到端驾驶模型仅学习统计相关性,易受数据集偏差影响而产生因果混淆,损害实际部署中的可靠性与安全性。 Method: 提出CausalVAD框架,核心为稀疏因果干预方案(SCIS):构建表征潜在驾驶场景的原型字典,并据此对模型稀疏向量化查询进行因果干预,从而在表征层面消除混杂因子引发的虚假关联。 Result: 在nuScenes等基准上达到最优规划精度与安全性;对数据偏差和诱发因果混淆的噪声场景展现出更强鲁棒性。 Conclusion: CausalVAD通过可插拔的因果干预机制有效缓解因果混淆问题,为提升端到端自动驾驶模型的可信性提供了新范式。 Abstract: Planning-oriented end-to-end driving models show great promise, yet they fundamentally learn statistical correlations instead of true causal relationships. This vulnerability leads to causal confusion, where models exploit dataset biases as shortcuts, critically harming their reliability and safety in complex scenarios. To address this, we introduce CausalVAD, a de-confounding training framework that leverages causal intervention. At its core, we design the sparse causal intervention scheme (SCIS), a lightweight, plug-and-play module to instantiate the backdoor adjustment theory in neural networks. SCIS constructs a dictionary of prototypes representing latent driving contexts. It then uses this dictionary to intervene on the model's sparse vectorized queries. This step actively eliminates spurious associations induced by confounders, thereby eliminating spurious factors from the representations for downstream tasks. Extensive experiments on benchmarks like nuScenes show CausalVAD achieves state-of-the-art planning accuracy and safety. Furthermore, our method demonstrates superior robustness against both data bias and noisy scenarios configured to induce causal confusion.

[111] HAViT: Historical Attention Vision Transformer

Swarnendu Banik,Manish Das,Shiv Ram Dubey,Satish Kumar Singh

Main category: cs.CV

TL;DR: 本文提出了一种跨层注意力传播方法,通过在ViT编码器中保存并融合历史注意力矩阵,提升特征学习与优化动态,仅需极少架构改动,在多个数据集和模型上均取得稳定性能提升。

Details Motivation: Vision Transformer中各层注意力机制独立运作,限制了信息流动和特征学习,因此需要一种能增强跨层信息传递的机制。 Method: 提出跨层注意力传播方法,保留并融合各编码器层的历史注意力矩阵,引入可学习或固定权重(如alpha=0.45)进行加权混合,支持随机初始化以加速收敛。 Result: 在CIFAR-100上ViT准确率从75.74%提升至77.07%(+1.33%),TinyImageNet上从57.82%提升至59.07%(+1.25%);CaiT提升1.01%;确定alpha=0.45为最优融合系数,随机初始化优于零初始化。 Conclusion: 跨层注意力传播是一种轻量、通用且有效的改进策略,能系统性增强ViT类模型的注意力建模能力与训练稳定性。 Abstract: Vision Transformers have excelled in computer vision but their attention mechanisms operate independently across layers, limiting information flow and feature learning. We propose an effective cross-layer attention propagation method that preserves and integrates historical attention matrices across encoder layers, offering a principled refinement of inter-layer information flow in Vision Transformers. This approach enables progressive refinement of attention patterns throughout the transformer hierarchy, enhancing feature acquisition and optimization dynamics. The method requires minimal architectural changes, adding only attention matrix storage and blending operations. Comprehensive experiments on CIFAR-100 and TinyImageNet demonstrate consistent accuracy improvements, with ViT performance increasing from 75.74% to 77.07% on CIFAR-100 (+1.33%) and from 57.82% to 59.07% on TinyImageNet (+1.25%). Cross-architecture validation shows similar gains across transformer variants, with CaiT showing 1.01% enhancement. Systematic analysis identifies the blending hyperparameter of historical attention (alpha = 0.45) as optimal across all configurations, providing the ideal balance between current and historical attention information. Random initialization consistently outperforms zero initialization, indicating that diverse initial attention patterns accelerate convergence and improve final performance. Our code is publicly available at https://github.com/banik-s/HAViT.

[112] Color image restoration based on nonlocal saturation-value similarity

Wei Wang,Yakun Li

Main category: cs.CV

TL;DR: 本文提出了一种基于饱和度-明度(saturation-value)相似性的新型非局部变分方法,用于彩色图像恢复,通过在HSV色彩空间中建模补丁相似性来更精细地利用颜色信息,并设计了基于Bregman分裂法的高效算法求解。

Details Motivation: 传统非局部方法直接在RGB通道上提取图像块并计算灰度相似性,难以精细刻画彩色图像的颜色信息;本文旨在利用HSV色彩空间中的饱和度与明度通道来更准确地衡量彩色图像块间的相似性,从而提升恢复质量。 Method: 构建基于饱和度-明度相似性的非局部全变分正则项,并将其嵌入非局部梯度定义中;进而建立相应的非局部变分模型;采用Bregman化算子分裂法设计高效数值求解算法,并分析其收敛性。 Result: 实验表明,所提方法在视觉质量及PSNR、SSIM、QSSIM和S-CIELAB色差等定量指标上均优于对比方法。 Conclusion: 基于饱和度-明度相似性的非局部变分模型能更有效地保留彩色图像的色彩结构信息,显著提升图像恢复性能。 Abstract: In this paper, we propose and develop a novel nonlocal variational technique based on saturation-value similarity for color image restoration. In traditional nonlocal methods, image patches are extracted from red, green and blue channels of a color image directly, and the color information can not be described finely because the patch similarity is mainly based on the grayscale value of independent channel. The main aim of this paper is to propose and develop a novel nonlocal regularization method by considering the similarity of image patches in saturation-value channel of a color image. In particular, we first establish saturation-value similarity based nonlocal total variation by incorporating saturation-value similarity of color image patches into the proposed nonlocal gradients, which can describe the saturation and value similarity of two adjacent color image patches. The proposed nonlocal variational models are then formulated based on saturation-value similarity based nonlocal total variation. Moreover, we design an effective and efficient algorithm to solve the proposed optimization problem numerically by employing bregmanized operator splitting method, and we also study the convergence of the proposed algorithms. Numerical examples are presented to demonstrate that the performance of the proposed models is better than that of other testing methods in terms of visual quality and some quantitative metrics including peak signal-to-noise ratio (PSNR), structural similarity index (SSIM), quaternion structural similarity index (QSSIM) and S-CIELAB color error.

[113] AU Codes, Language, and Synthesis: Translating Anatomy to Text for Facial Behavior Synthesis

Jiahe Wang,Cong Liang,Xuandong Huang,Yuxin Wang,Xin Yun,Yi Wu,Yanan Chang,Shangfei Wang

Main category: cs.CV

TL;DR: 本文提出一种基于自然语言描述动作单元(AU)的新方法,以解决现有AU-based面部行为合成中线性组合导致的解剖学不合理问题,尤其针对冲突AU;构建了首个大规模AU文本-图像配对数据集BP4D-AUText,并提出生成模型VQ-AUFace,在解剖合理性、行为丰富性和感知真实感上显著优于现有方法。

Details Motivation: 现有AU-based面部合成方法将AU编码为one-hot向量并线性组合,难以建模冲突AU(即激活同一肌肉但动作相反的AU),导致解剖学不合理的伪影和不自然运动叠加。 Method: 提出用自然语言描述AU来表征面部行为,设计规则驱动的动态AU文本处理器构建BP4D-AUText数据集,并开发融合面部结构先验的生成模型VQ-AUFace,利用现代文生图模型实现高保真面部合成。 Result: 在定量实验与用户研究中,该方法在解剖合理性、行为丰富性和感知真实感方面显著优于现有方法,尤其在处理冲突AU等挑战性场景下表现突出。 Conclusion: 基于自然语言的AU表征能更准确建模复杂与冲突面部行为,结合结构先验的生成框架可有效提升面部行为合成的真实感与多样性,为非言语通信建模提供新范式。 Abstract: Facial behavior synthesis remains a critical yet underexplored challenge. While text-to-face models have made progress, they often rely on coarse emotion categories, which lack the nuance needed to capture the full spectrum of human nonverbal communication. Action Units (AUs) provide a more precise and anatomically grounded alternative. However, current AU-based approaches typically encode AUs as one-hot vectors, modeling compound expressions as simple linear combinations of individual AUs. This linearity becomes problematic when handling conflicting AUs--defined as those which activate the same facial muscle with opposing actions. Such cases lead to anatomically implausible artifacts and unnatural motion superpositions. To address this, we propose a novel method that represents facial behavior through natural language descriptions of AUs. This approach preserves the expressiveness of the AU framework while enabling explicit modeling of complex and conflicting AUs. It also unlocks the potential of modern text-to-image models for high-fidelity facial synthesis. Supporting this direction, we introduce BP4D-AUText, the first large-scale text-image paired dataset for complex facial behavior. It is synthesized by applying a rule-based Dynamic AU Text Processor to the BP4D and BP4D+ datasets. We further propose VQ-AUFace, a generative model that leverages facial structural priors to synthesize realistic and diverse facial behaviors from text. Extensive quantitative experiments and user studies demonstrate that our approach significantly outperforms existing methods. It excels in generating facial expressions that are anatomically plausible, behaviorally rich, and perceptually convincing, particularly under challenging conditions involving conflicting AUs.

[114] myMNIST: Benchmark of PETNN, KAN, and Classical Deep Learning Models for Burmese Handwritten Digit Recognition

Ye Kyaw Thu,Thazin Myint Oo,Thepchai Supnithi

Main category: cs.CV

TL;DR: 本文对myMNIST(原BHDD)缅甸手写数字数据集进行了首次系统性基准测试,评估了11种模型,发现CNN表现最优,PETNN(GELU)紧随其后,JEM等能量模型也具竞争力,KAN类模型略逊但仍有价值;研究建立了可复现基线,凸显PETNN优势,并开源基准以推动区域文字识别研究。

Details Motivation: 建立myMNIST数据集上可复现、跨范式的系统性基准,推动缅甸NLP/AI研究,并评估新兴架构(如KAN、PETNN、JEM)在区域文字识别中的有效性。 Method: 在myMNIST数据集上系统评估11种模型:MLP、CNN、LSTM、GRU、Transformer、FastKAN、EfficientKAN、JEM及三种PETNN变体(Sigmoid/GELU/SiLU),采用Precision、Recall、F1-Score和Accuracy进行量化比较。 Result: CNN取得最佳性能(F1=0.9959,Accuracy=0.9970);PETNN(GELU)次之(F1=0.9955,Accuracy=0.9966),优于LSTM、GRU、Transformer及KAN变体;JEM表现稳健(F1=0.9944,Accuracy=0.9958);KAN类模型Accuracy约0.992。 Conclusion: CNN仍是强基线;PETNN展现出媲美经典与Transformer模型的潜力;能量模型JEM验证了能量建模的有效性;该基准为缅甸数字识别及新兴架构在区域文字上的评估提供了公开、可复现的基础。 Abstract: We present the first systematic benchmark on myMNIST (formerly BHDD), a publicly available Burmese handwritten digit dataset important for Myanmar NLP/AI research. We evaluate eleven architectures spanning classical deep learning models (Multi-Layer Perceptron, Convolutional Neural Network, Long Short-Term Memory, Gated Recurrent Unit, Transformer), recent alternatives (FastKAN, EfficientKAN), an energy-based model (JEM), and physics-inspired PETNN variants (Sigmoid, GELU, SiLU). Using Precision, Recall, F1-Score, and Accuracy as evaluation metrics, our results show that the CNN remains a strong baseline, achieving the best overall scores (F1 = 0.9959, Accuracy = 0.9970). The PETNN (GELU) model closely follows (F1 = 0.9955, Accuracy = 0.9966), outperforming LSTM, GRU, Transformer, and KAN variants. JEM, representing energy-based modeling, performs competitively (F1 = 0.9944, Accuracy = 0.9958). KAN-based models (FastKAN, EfficientKAN) trail the top performers but provide a meaningful alternative baseline (Accuracy ~0.992). These findings (i) establish reproducible baselines for myMNIST across diverse modeling paradigms, (ii) highlight PETNN's strong performance relative to classical and Transformer-based models, and (iii) quantify the gap between energy-inspired PETNNs and a true energy-based model (JEM). We release this benchmark to facilitate future research on Myanmar digit recognition and to encourage broader evaluation of emerging architectures on regional scripts.

[115] Complementary Text-Guided Attention for Zero-Shot Adversarial Robustness

Lu Yu,Haiyang Zhang,Changsheng Xu

Main category: cs.CV

TL;DR: 本文提出两种基于文本引导注意力的零样本鲁棒性提升方法(TGA-ZSR和Comp-TGA),通过局部精炼与全局约束机制增强CLIP模型对对抗样本的鲁棒性,并在16个数据集上显著提升零样本鲁棒准确率。

Details Motivation: CLIP等预训练视觉语言模型虽具强大零样本能力,但易受对抗样本攻击;作者观察到对抗扰动会偏移文本引导注意力,由此出发设计鲁棒增强策略。 Method: 提出TGA-ZSR框架,含局部注意力精炼模块和全局注意力约束模块;进一步提出Comp-TGA,融合类提示引导注意力与非类提示反向注意力以互补建模前景区域。 Result: TGA-ZSR和Comp-TGA分别在16个数据集上相较当前最优方法提升零样本鲁棒准确率9.58%和11.95%。 Conclusion: 文本引导注意力机制可有效提升CLIP零样本鲁棒性;互补注意力设计(Comp-TGA)进一步缓解对无关/虚假特征的关注问题,实现更优鲁棒性能。 Abstract: Due to the impressive zero-shot capabilities, pre-trained vision-language models (e.g., CLIP), have attracted widespread attention and adoption across various domains. Nonetheless, CLIP has been observed to be susceptible to adversarial examples. Through experimental analysis, we have observed a phenomenon wherein adversarial perturbations induce shifts in text-guided attention. Building upon this observation, we propose a simple yet effective strategy: Text-Guided Attention for Zero-Shot Robustness (TGA-ZSR). This framework incorporates two components: Local Attention Refinement Module and Global Attention Constraint Module. Our goal is to maintain the generalization of the CLIP model and enhance its adversarial robustness. Additionally, the Global Attention Constraint Module acquires text-guided attention from both the target and original models using clean examples. Its objective is to maintain model performance on clean samples while enhancing overall robustness. However, we observe that the method occasionally focuses on irrelevant or spurious features, which can lead to suboptimal performance and undermine its robustness in certain scenarios. To overcome this limitation, we further propose a novel approach called Complementary Text-Guided Attention (Comp-TGA). This method integrates two types of foreground attention: attention guided by the class prompt and reversed attention driven by the non-class prompt. These complementary attention mechanisms allow the model to capture a more comprehensive and accurate representation of the foreground. The experiments validate that TGA-ZSR and Comp-TGA yield 9.58% and 11.95% improvements respectively, in zero-shot robust accuracy over the current state-of-the-art techniques across 16 datasets.

[116] SJD-PAC: Accelerating Speculative Jacobi Decoding via Proactive Drafting and Adaptive Continuation

Jialiang Kang,Han Shu,Wenshuo Li,Yingjie Zhai,Xinghao Chen

Main category: cs.CV

TL;DR: SJD-PAC 是一种改进的推测性 Jacobi 解码框架,通过主动起草策略和自适应续写机制,在不损失图像质量的前提下,将文本到图像生成速度提升 3.8 倍。

Details Motivation: 原始的推测性 Jacobi 解码(SJD)在高熵视觉生成任务中草稿令牌接受率低,导致吞吐量瓶颈。 Method: 提出 SJD-PAC:1)采用主动起草策略提升高熵区域的局部接受率;2)引入自适应续写机制,在首次拒绝后继续验证序列,避免完全重采样。 Result: 在标准文本到图像基准上实现 3.8× 加速,图像质量无损。 Conclusion: SJD-PAC 有效缓解了 SJD 在视觉生成中的接受率瓶颈,在保持目标分布严格不变的前提下显著提升了推理效率。 Abstract: Speculative Jacobi Decoding (SJD) offers a draft-model-free approach to accelerate autoregressive text-to-image synthesis. However, the high-entropy nature of visual generation yields low draft-token acceptance rates in complex regions, creating a bottleneck that severely limits overall throughput. To overcome this, we introduce SJD-PAC, an enhanced SJD framework. First, SJD-PAC employs a proactive drafting strategy to improve local acceptance rates in these challenging high-entropy regions. Second, we introduce an adaptive continuation mechanism that sustains sequence validation after an initial rejection, bypassing the need for full resampling. Working in tandem, these optimizations significantly increase the average acceptance length per step, boosting inference speed while strictly preserving the target distribution. Experiments on standard text-to-image benchmarks demonstrate that SJD-PAC achieves a $3.8\times$ speedup with lossless image quality.

[117] Improving Joint Audio-Video Generation with Cross-Modal Context Learning

Bingqi Ma,Linlong Lang,Ming Zhang,Dailan He,Xingtong Ge,Yi Zhang,Guanglu Song,Yu Liu

Main category: cs.CV

TL;DR: 本文提出了一种名为Cross-Modal Context Learning (CCL)的新方法,用于改进双流Transformer架构下的音视频联合生成,通过引入TARP、LCT、DCR和UCG等模块,缓解了跨模态交互中的多种问题,在保持高性能的同时显著降低资源需求。

Details Motivation: 现有双流Transformer音视频生成方法存在模型流形变化、多模态背景区域偏差、分类器自由引导(CFG)训练与推理不一致及多条件冲突等问题。 Method: 提出Cross-Modal Context Learning(CCL),包含:1)时序对齐的RoPE与划分(TARP);2)可学习上下文标记(LCT)与动态上下文路由(DCR)构成的跨模态上下文注意力(CCA);3)利用LCT实现无条件上下文引导(UCG)以提升CFG一致性。 Result: CCL在多项评估中达到当前最优性能,且所需计算资源显著少于近期学术方法。 Conclusion: CCL有效缓解了双流音视频生成中跨模态交互的关键缺陷,提升了生成质量、训练稳定性与推理一致性,是一种高效、鲁棒的多模态生成框架。 Abstract: The dual-stream transformer architecture-based joint audio-video generation method has become the dominant paradigm in current research. By incorporating pre-trained video diffusion models and audio diffusion models, along with a cross-modal interaction attention module, high-quality, temporally synchronized audio-video content can be generated with minimal training data. In this paper, we first revisit the dual-stream transformer paradigm and further analyze its limitations, including model manifold variations caused by the gating mechanism controlling cross-modal interactions, biases in multi-modal background regions introduced by cross-modal attention, and the inconsistencies in multi-modal classifier-free guidance (CFG) during training and inference, as well as conflicts between multiple conditions. To alleviate these issues, we propose Cross-Modal Context Learning (CCL), equipped with several carefully designed modules. Temporally Aligned RoPE and Partitioning (TARP) effectively enhances the temporal alignment between audio latent and video latent representations. The Learnable Context Tokens (LCT) and Dynamic Context Routing (DCR) in the Cross-Modal Context Attention (CCA) module provide stable unconditional anchors for cross-modal information, while dynamically routing based on different training tasks, further enhancing the model's convergence speed and generation quality. During inference, Unconditional Context Guidance (UCG) leverages the unconditional support provided by LCT to facilitate different forms of CFG, improving train-inference consistency and further alleviating conflicts. Through comprehensive evaluations, CCL achieves state-of-the-art performance compared with recent academic methods while requiring substantially fewer resources.

[118] Benchmarking CNN-based Models against Transformer-based Models for Abdominal Multi-Organ Segmentation on the RATIC Dataset

Lukas Bayer,Sheethal Bhat,Andreas Maier

Main category: cs.CV

TL;DR: 本研究系统比较了三种混合Transformer模型(UNETR、SwinUNETR、UNETR++)与CNN基线模型SegResNet在腹部CT多器官分割任务上的性能,结果表明在小到中等规模异构数据集(RATIC)上,优化良好的CNN仍优于Transformer混合模型。

Details Motivation: 探索在腹部CT多器官分割任务中,近年来兴起的Transformer架构是否能超越传统CNN,特别是在异构、中小规模医学数据集上的适用性。 Method: 在统一预处理和训练条件下,对UNETR、SwinUNETR、UNETR++和SegResNet四个模型在RATIC数据集(206例来自23家机构的CT扫描,标注5个腹部器官)上进行基准测试,以Dice相似系数(DSC)为主要评估指标。 Result: SegResNet整体性能最优,全面超越所有Transformer模型;UNETR++在Transformer模型中表现最佳;UNETR收敛速度最快。 Conclusion: 对于小至中等规模、来源异构的医学影像数据集,经过良好优化的CNN架构(如SegResNet)仍具高度竞争力,当前混合Transformer模型尚未展现出明显优势。 Abstract: Accurate multi-organ segmentation in abdominal CT scans is essential for computer-aided diagnosis and treatment. While convolutional neural networks (CNNs) have long been the standard approach in medical image segmentation, transformer-based architectures have recently gained attention due to their ability to model long-range dependencies. In this study, we systematically benchmark the three hybrid transformer-based models UNETR, SwinUNETR, and UNETR++ against a strong CNN baseline, SegResNet, for volumetric multi-organ segmentation on the heterogeneous RATIC dataset. The dataset comprises 206 annotated CT scans from 23 institutions worldwide, covering five abdominal organs. All models were trained and evaluated under identical preprocessing and training conditions using the Dice Similarity Coefficient (DSC) as the primary metric. The results show that the CNN-based SegResNet achieves the highest overall performance, outperforming all hybrid transformer-based models across all organs. Among the transformer-based approaches, UNETR++ delivers the most competitive results, while UNETR demonstrates notably faster convergence with fewer training iterations. These findings suggest that, for small- to medium-sized heterogeneous datasets, well-optimized CNN architectures remain highly competitive and may outperform hybrid transformer-based designs.

[119] OpenT2M: No-frill Motion Generation with Open-source,Large-scale, High-quality Data

Bin Cao,Sipeng Zheng,Hao Luo,Boyuan Li,Jing Liu,Zongqing Lu

Main category: cs.CV

TL;DR: 本文提出了OpenT2M大规模高质量开源运动数据集和基于其预训练的MonoFrill文本到动作生成模型,核心是新型运动分词器2D-PRQ,显著提升了模型泛化能力和零样本性能。

Details Motivation: 现有文本到动作(T2M)模型在未见文本描述上表现差,主要受限于运动数据集规模小、多样性低。 Method: 构建百万级、2800小时以上的高质量开源运动数据集OpenT2M,含物理可行性验证与细粒度文本标注;设计自动化长时序生成流程;提出MonoFrill模型及其核心组件——基于人体生物部位划分的2D-PRQ运动分词器。 Result: OpenT2M显著提升现有T2M模型泛化能力;2D-PRQ在运动重建和零样本任务中表现优异;MonoFrill以简洁结构实现强生成效果。 Conclusion: OpenT2M和MonoFrill共同解决了T2M领域长期存在的数据质量与基准评测难题,推动该方向发展。 Abstract: Text-to-motion (T2M) generation aims to create realistic human movements from text descriptions, with promising applications in animation and robotics. Despite recent progress, current T2M models perform poorly on unseen text descriptions due to the small scale and limited diversity of existing motion datasets. To address this problem, we introduce OpenT2M, a million-level, high-quality, and open-source motion dataset containing over 2800 hours of human motion. Each sequence undergoes rigorous quality control through physical feasibility validation and multi-granularity filtering, with detailed second-wise text annotations. We also develop an automated pipeline for creating long-horizon sequences, enabling complex motion generation. Building upon OpenT2M, we introduce MonoFrill, a pretrained motion model that achieves compelling T2M results without complicated designs or technique tricks as "frills". Its core component is 2D-PRQ, a novel motion tokenizer that captures spatiotemporal dependencies by dividing the human body into biology parts. Experiments show that OpenT2M significantly improves generalization of existing T2M models, while 2D-PRQ achieves superior reconstruction and strong zero-shot performance. We expect OpenT2M and MonoFrill will advance the T2M field by addressing longstanding data quality and benchmarking challenges.

[120] GenVideoLens: Where LVLMs Fall Short in AI-Generated Video Detection?

Yueying Zou,Pei Pei Li,Zekun Li,Xinyu Guo,Xing Cui,Huaibo Huang,Ran He

Main category: cs.CV

TL;DR: 本文提出GenVideoLens细粒度基准,用于评估大视觉语言模型(LVLMs)在AI生成视频检测任务中的多维能力,发现模型在光学一致性、物理交互和时序因果推理等维度表现薄弱,并揭示了模型间性能的维度不平衡现象。

Details Motivation: 现有评估方法仅将AI生成视频检测视为二分类问题,依赖粗粒度指标(如总体准确率),难以揭示LVLMs在具体真实性线索上的优势与缺陷。 Method: 构建包含400个高欺骗性AI生成视频和100个真实视频的GenVideoLens基准,由专家在15个真实性维度(涵盖感知、光学、物理和时序线索)上进行细粒度标注;对11个代表性LVLMs进行系统评测,并开展时序扰动实验分析其对时序信息的利用能力。 Result: LVLMs在感知线索上表现较好,但在光学一致性、物理交互和时序因果推理维度显著不足;不同模型在各维度性能差异大,部分小规模开源模型在特定线索上优于强商用模型;时序扰动实验表明当前LVLMs对时序信息利用有限。 Conclusion: GenVideoLens提供了诊断性洞见,揭示了LVLM在AI生成视频检测中的关键能力短板,为后续模型设计与改进提供了明确方向。 Abstract: In recent years, AI-generated videos have become increasingly realistic and sophisticated. Meanwhile, Large Vision-Language Models (LVLMs) have shown strong potential for detecting such content. However, existing evaluation protocols largely treat the task as a binary classification problem and rely on coarse-grained metrics such as overall accuracy, providing limited insight into where LVLMs succeed or fail. To address this limitation, we introduce GenVideoLens, a fine-grained benchmark that enables dimension-wise evaluation of LVLM capabilities in AI-generated video detection. The benchmark contains 400 highly deceptive AI-generated videos and 100 real videos, annotated by experts across 15 authenticity dimensions covering perceptual, optical, physical, and temporal cues. We evaluate eleven representative LVLMs on this benchmark. Our analysis reveals a pronounced dimensional imbalance. While LVLMs perform relatively well on perceptual cues, they struggle with optical consistency, physical interactions, and temporal-causal reasoning. Model performance also varies substantially across dimensions, with smaller open-source models sometimes outperforming stronger proprietary models on specific authenticity cues. Temporal perturbation experiments further show that current LVLMs make limited use of temporal information. Overall, GenVideoLens provides diagnostic insights into LVLM behavior, revealing key capability gaps and offering guidance for improving future AI-generated video detection systems.

[121] SAVeS: Steering Safety Judgments in Vision-Language Models via Semantic Cues

Carlos Hinojosa,Clemens Grange,Bernard Ghanem

Main category: cs.CV

TL;DR: 本文研究了视觉-语言模型(VLMs)在安全决策中对语义线索的依赖性,提出语义引导框架和SAVeS基准,发现VLMs的安全行为易受文本、视觉和认知干预影响,表明其依赖视觉-语言关联而非真实视觉理解,揭示了多模态安全系统的潜在漏洞。

Details Motivation: 现实与具身场景中VLMs的安全决策依赖视觉上下文,但驱动这些判断的视觉证据尚不明确;需探究其是否仅依赖简单语义线索而非真实视觉理解。 Method: 提出语义引导框架,施加可控的文本、视觉和认知干预(不改变场景内容);构建SAVeS基准及评估协议,分离行为拒绝、基于依据的安全推理和错误拒绝。 Result: 多个VLMs实验及对比基准显示:安全决策高度敏感于语义线索,表明依赖习得的视觉-语言关联而非接地的视觉理解;自动化引导流程可利用该机制,暴露多模态安全系统漏洞。 Conclusion: VLMs的安全行为易被语义线索操控,反映其缺乏真正 grounded 的视觉理解能力,提示需改进多模态安全机制以增强鲁棒性与可靠性。 Abstract: Vision-language models (VLMs) are increasingly deployed in real-world and embodied settings where safety decisions depend on visual context. However, it remains unclear which visual evidence drives these judgments. We study whether multimodal safety behavior in VLMs can be steered by simple semantic cues. We introduce a semantic steering framework that applies controlled textual, visual, and cognitive interventions without changing the underlying scene content. To evaluate these effects, we propose SAVeS, a benchmark for situational safety under semantic cues, together with an evaluation protocol that separates behavioral refusal, grounded safety reasoning, and false refusals. Experiments across multiple VLMs and an additional state-of-the-art benchmark show that safety decisions are highly sensitive to semantic cues, indicating reliance on learned visual-linguistic associations rather than grounded visual understanding. We further demonstrate that automated steering pipelines can exploit these mechanisms, highlighting a potential vulnerability in multimodal safety systems.

[122] GEAR: Geography-knowledge Enhanced Analog Recognition Framework in Extreme Environments

Zelin Liu,Bocheng Li,Yuling Zhou,Xuanting Li,Yixuan Yang,Jing Wang,Weishu Zhao,Xiaofeng Gao

Main category: cs.CV

TL;DR: 本文提出GEAR框架,通过三阶段流程(骨架引导筛选、物理感知过滤、图细粒度识别)在青藏高原上高效检索马里亚纳海沟的类比地形,并设计MSG-Net模型提升地形相似性识别精度,同时发现其特征与生物数据存在显著相关性。

Details Motivation: 深海采样成本高昂,亟需在青藏高原上寻找与马里亚纳海沟地质起源和微生物功能相似的陆地类比区域;但现有模型无法兼顾地理知识融合与计算效率。 Method: 提出GEAR三阶段框架:1)骨架引导筛选与裁剪;2)基于波形比较器(TWC)和形态纹理模块(MTM)的物理感知过滤;3)基于地貌指标构建的形态集成孪生图网络(MSG-Net)进行细粒度识别;并发布面向构造碰撞带的专家标注地形相似性数据集。 Result: 各阶段均验证有效;MSG-Net较SOTA基线F1-Score提升1.38个百分点;MSG-Net提取的特征与生物数据呈显著相关性。 Conclusion: GEAR框架能高效、准确识别跨域地形类比区域,为深海研究提供低成本陆地替代方案,并支持后续生物学分析。 Abstract: The Mariana Trench and the Qinghai-Tibet Plateau exhibit significant similarities in geological origins and microbial metabolic functions. Given that deep-sea biological sampling faces prohibitive costs, recognizing structurally homologous terrestrial analogs of the Mariana Trench on the Qinghai-Tibet Plateau is of great significance. Yet, no existing model adequately addresses cross-domain topographic similarity retrieval, either neglecting geographical knowledge or sacrificing computational efficiency. To address these challenges, we present \underline{\textbf{G}}eography-knowledge \underline{\textbf{E}}nhanced \underline{\textbf{A}}nalog \underline{\textbf{R}}ecognition (\textbf{GEAR}) Framework, a three-stage pipeline designed to efficiently retrieve analogs from 2.5 million square kilometers of the Qinghai-Tibet Plateau: (1) Skeleton guided Screening and Clipping: Recognition of candidate valleys and initial screening based on size and linear morphological criteria. (2) Physics aware Filtering: The Topographic Waveform Comparator (TWC) and Morphological Texture Module (MTM) evaluate the waveform and texture and filter out inconsistent candidate valleys. (3) Graph based Fine Recognition: We design a \underline{\textbf{M}}orphology-integrated \underline{\textbf{S}}iamese \underline{\textbf{G}}raph \underline{\textbf{N}}etwork (\textbf{MSG-Net}) based on geomorphological metrics. Correspondingly, we release an expert-annotated topographic similarity dataset targeting tectonic collision zones. Experiments demonstrate the effectiveness of every stage. Besides, MSG-Net achieved an F1-Score 1.38 percentage points higher than the SOTA baseline. Using features extracted by MSG-Net, we discovered a significant correlation with biological data, providing evidence for future biological analysis.

[123] SwiftGS: Episodic Priors for Immediate Satellite Surface Recovery

Rong Fu,Jiekai Wu,Haiyun Wei,Xiaowen Ma,Shiyin Lin,Kangan Qian,Chuang Liu,Jianyuan Ni,Simon James Fong

Main category: cs.CV

TL;DR: 本文提出SwiftGS,一种元学习系统,通过单次前向传播预测解耦的几何-辐射高斯基元和轻量级符号距离函数(SDF),实现多时相卫星影像的快速、大规模3D重建,显著降低计算成本并保持精度。

Details Motivation: 现有方法难以应对多时相卫星影像3D重建中的光照变化、传感器异质性及逐场景优化高昂成本等问题。 Method: 提出SwiftGS系统:采用元学习框架进行情节式训练;预测几何-辐射解耦的高斯基元与轻量级SDF;结合可微物理图建模投影、光照与传感器响应;引入空间门控融合稀疏高斯细节与全局SDF结构;集成语义-几何融合、条件轻量任务头及基于冻结几何教师模型的多视角监督,并设计不确定性感知多任务损失。 Result: 推理时支持零样本重建,可选紧凑校准;实现高精度数字表面模型(DSM)重建与视角一致的渲染;计算成本大幅降低;消融实验验证了混合表征、物理感知渲染和情节元训练的有效性。 Conclusion: SwiftGS为多时相卫星影像的大规模、快速、低成本3D重建提供了新范式,兼顾精度、效率与泛化能力。 Abstract: Rapid, large-scale 3D reconstruction from multi-date satellite imagery is vital for environmental monitoring, urban planning, and disaster response, yet remains difficult due to illumination changes, sensor heterogeneity, and the cost of per-scene optimization. We introduce SwiftGS, a meta-learned system that reconstructs 3D surfaces in a single forward pass by predicting geometry-radiation-decoupled Gaussian primitives together with a lightweight SDF, replacing expensive per-scene fitting with episodic training that captures transferable priors. The model couples a differentiable physics graph for projection, illumination, and sensor response with spatial gating that blends sparse Gaussian detail and global SDF structure, and incorporates semantic-geometric fusion, conditional lightweight task heads, and multi-view supervision from a frozen geometric teacher under an uncertainty-aware multi-task loss. At inference, SwiftGS operates zero-shot with optional compact calibration and achieves accurate DSM reconstruction and view-consistent rendering at significantly reduced computational cost, with ablations highlighting the benefits of the hybrid representation, physics-aware rendering, and episodic meta-training.

[124] Training-Free Sparse Attention for Fast Video Generation via Offline Layer-Wise Sparsity Profiling and Online Bidirectional Co-Clustering

Jiayi Luo,Jiayu Chen,Jiankun Wang,Cong Wang,Hanxin Zhu,Qingyun Sun,Chen Gao,Zhibo Chen,Jianxin Li

Main category: cs.CV

TL;DR: 本文提出SVOO框架,通过离线层敏感度分析和在线双向协同聚类,实现无需训练的视频生成稀疏注意力优化,在保持高质量的同时显著提升推理速度。

Details Motivation: 现有无训练稀疏注意力方法存在忽略层异质性和查询-键耦合问题,导致质量-速度权衡不佳。 Method: SVOO采用两阶段范式:(i) 离线逐层敏感度分析以确定各层固有剪枝比例;(ii) 在线基于新型双向协同聚类算法实现块级稀疏注意力。 Result: 在七个主流视频生成模型上验证,SVOO相较SOTA方法实现最高1.93倍加速,同时在Wan2.1上维持高达29 dB的PSNR。 Conclusion: SVOO通过挖掘注意力稀疏性的层内固有特性与双向耦合建模,有效提升了视频生成中无训练稀疏注意力的质量-效率平衡。 Abstract: Diffusion Transformers (DiTs) achieve strong video generation quality but suffer from high inference cost due to dense 3D attention, leading to the development of sparse attention technologies to improve efficiency. However, existing training-free sparse attention methods in video generation still face two unresolved limitations: ignoring layer heterogeneity in attention pruning and ignoring query-key coupling in block partitioning, which hinder a better quality-speedup trade-off. In this work, we uncover a critical insight that the attention sparsity of each layer is its intrinsic property, with minor effects across different inputs. Motivated by this, we propose SVOO, a training-free Sparse attention framework for fast Video generation via Offline layer-wise sparsity profiling and Online bidirectional co-clustering. Specifically, SVOO adopts a two-stage paradigm: (i) offline layer-wise sensitivity profiling to derive intrinsic per-layer pruning levels, and (ii) online block-wise sparse attention via a novel bidirectional co-clustering algorithm. Extensive experiments on seven widely used video generation models demonstrate that SVOO achieves a superior quality-speedup trade-off over state-of-the-art methods, delivering up to $1.93\times$ speedup while maintaining a PSNR of up to 29 dB on Wan2.1.

[125] PhysVideo: Physically Plausible Video Generation with Cross-View Geometry Guidance

Cong Wang,Hanxin Zhu,Xiao Tang,Jiayi Luo,Xin Jin,Long Chen,Fei-Yue Wang,Zhibo Chen

Main category: cs.CV

TL;DR: 本文提出PhysVideo框架,通过两阶段方法生成物理一致的视频:第一阶段使用Phys4View生成物理感知的正交前景视频,第二阶段利用VideoSyn合成含背景的完整视频;并构建了包含160K序列的PhysMV多视角数据集。

Details Motivation: 现有视频生成方法在视觉保真度上虽有提升,但难以保证运动的物理一致性,主要因为真实物体运动发生在三维空间,而视频仅提供部分、视角依赖的二维投影。 Method: 提出PhysVideo两阶段框架:第一阶段Phys4View采用物理感知注意力、几何增强的跨视角注意力和时间注意力,生成物理感知的正交前景视频;第二阶段VideoSyn以前景视频为指导,建模前景动态与背景上下文的交互以实现可控视频合成;并构建PhysMV多视角数据集用于训练。 Result: 在多个指标上显著优于现有方法,提升了生成视频的物理真实性和时空一致性。 Conclusion: PhysVideo通过引入三维物理先验与多视角建模,有效缓解了视频生成中运动不物理的问题,为高质量、物理一致的视频生成提供了新范式。 Abstract: Recent progress in video generation has led to substantial improvements in visual fidelity, yet ensuring physically consistent motion remains a fundamental challenge. Intuitively, this limitation can be attributed to the fact that real-world object motion unfolds in three-dimensional space, while video observations provide only partial, view-dependent projections of such dynamics. To address these issues, we propose PhysVideo, a two-stage framework that first generates physics-aware orthogonal foreground videos and then synthesizes full videos with background. In the first stage, Phys4View leverages physics-aware attention to capture the influence of physical attributes on motion dynamics, and enhances spatio-temporal consistency by incorporating geometry-enhanced cross-view attention and temporal attention. In the second stage, VideoSyn uses the generated foreground videos as guidance and learns the interactions between foreground dynamics and background context for controllable video synthesis. To support training, we construct PhysMV, a dataset containing 40K scenes, each consisting of four orthogonal viewpoints, resulting in a total of 160K video sequences. Extensive experiments demonstrate that PhysVideo significantly improves physical realism and spatial-temporal coherence over existing video generation methods. Home page: https://anonymous.4open.science/w/Phys4D/.

[126] MeInTime: Bridging Age Gap in Identity-Preserving Face Restoration

Teer Song,Yue Zhang,Yu Tian,Ziyang Wang,Xianlin Zhang,Guixuan Zhang,Xuan Liu,Xueming Li,Yasen Zhang

Main category: cs.CV

TL;DR: 本文提出MeInTime,一种基于扩散模型的跨年龄人脸修复方法,通过解耦身份和年龄建模,在仅提供跨年龄参考图像和年龄提示的情况下,实现高保真身份保持与年龄一致性修复。

Details Motivation: 现有参考式人脸修复方法大多隐含假设参考图像与退化输入年龄一致,难以应对历史照片修复等仅有跨年龄参考的实际场景。 Method: 提出MeInTime:1)引入新注意力机制显式注入身份特征;2)设计门控残差融合模块融合退化特征与身份表征;3)提出无需训练的Age-Aware Gradient Guidance采样策略,利用年龄驱动方向引导去噪潜变量向目标年龄语义流形收敛。 Result: 在多个基准上显著优于现有方法,在身份保真度与年龄一致性两方面均取得SOTA性能。 Conclusion: MeInTime成功将参考式人脸修复拓展至跨年龄设定,验证了显式解耦身份与年龄建模的有效性,为历史影像修复等实际应用提供了新思路。 Abstract: To better preserve an individual's identity, face restoration has evolved from reference-free to reference-based approaches, which leverage high-quality reference images of the same identity to enhance identity fidelity in the restored outputs. However, most existing methods implicitly assume that the reference and degraded input are age-aligned, limiting their effectiveness in real-world scenarios where only cross-age references are available, such as historical photo restoration. This paper proposes MeInTime, a diffusion-based face restoration method that extends reference-based restoration from same-age to cross-age settings. Given one or few reference images along with an age prompt corresponding to the degraded input, MeInTime achieves faithful restoration with both identity fidelity and age consistency. Specifically, we decouple the modeling of identity and age conditions. During training, we focus solely on effectively injecting identity features through a newly introduced attention mechanism and introduce Gated Residual Fusion modules to facilitate the integration between degraded features and identity representations. At inference, we propose Age-Aware Gradient Guidance, a training-free sampling strategy, using an age-driven direction to iteratively nudge the identity-aware denoising latent toward the desired age semantic manifold. Extensive experiments demonstrate that MeInTime outperforms existing face restoration methods in both identity preservation and age consistency. Our code is available at: https://github.com/teer4/MeInTime

[127] Click-to-Ask: An AI Live Streaming Assistant with Offline Copywriting and Online Interactive QA

Ruizhi Yu,Keyang Zhong,Peng Liu,Qi Wu,Haoran Zhang,Yanhao Zhang,Chen Chen,Haonan Lu

Main category: cs.CV

TL;DR: 本文提出Click-to-Ask系统,一种面向直播电商的AI助手,通过离线模块处理多模态商品信息生成结构化数据和合规话术,以及在线模块实时响应观众提问,显著提升直播准备效率、内容互动性与响应及时性。

Details Motivation: 为提升主播在直播电商中产品推广的效率与便捷性,解决实时应答观众问题和高效准备促销内容的挑战。 Method: 设计包含离线与在线双模块的AI系统:离线模块处理多模态商品信息并生成结构化数据及合规文案;在线模块结合离线输出与流式架构维护的事件级历史记忆,支持点击提问实时响应。 Result: 在自建TikTok直播帧数据集上,问题识别准确率达0.913,响应质量评分为0.876。 Conclusion: Click-to-Ask系统在提升直播电商准备效率、互动性与响应能力方面效果显著,具备良好的实际应用潜力。 Abstract: Live streaming commerce has become a prominent form of broadcasting in the modern era. To facilitate more efficient and convenient product promotions for streamers, we present Click-to-Ask, an AI-driven assistant for live streaming commerce with complementary offline and online components. The offline module processes diverse multimodal product information, transforming complex inputs into structured product data and generating compliant promotional copywriting. During live broadcasts, the online module enables real-time responses to viewer inquiries by allowing streamers to click on questions and leveraging both the structured product information generated by the offline module and an event-level historical memory maintained in a streaming architecture. This system significantly reduces the time needed for promotional preparation, enhances content engagement, and enables prompt interaction with audience inquiries, ultimately improving the effectiveness of live streaming commerce. On our collected dataset of TikTok live stream frames, the proposed method achieves a Question Recognition Accuracy of 0.913 and a Response Quality score of 0.876, demonstrating considerable potential for practical application. The video demonstration can be viewed here: https://www.youtube.com/shorts/mWIXK-SWhiE.

[128] Benchmarking PDF Parsers on Table Extraction with LLM-based Semantic Evaluation

Pius Horn,Janis Keuper

Main category: cs.CV

TL;DR: 本文提出了一种基于合成PDF和LLM-as-a-judge的语义感知表格提取评估框架,显著优于传统基于结构相似性的指标(如TEDS、GriTS),并通过大规模人类验证与21种PDF解析器评测验证了其有效性与实用性。

Details Motivation: 现有PDF表格提取评估方法依赖规则型指标(如TEDS、GriTS),无法准确衡量表格内容的语义等价性,导致评估结果与人类判断偏差较大。 Method: 构建基于LaTeX生成的真实感强、多样性高的合成PDF基准;设计融合LLM-as-a-judge的语义匹配流水线,处理解析器输出不一致问题;开展超1500次人工质量判断的人类验证研究。 Result: LLM-based评估与人类判断相关性达Pearson r=0.93,显著高于TEDS(r=0.68)和GriTS(r=0.70);在451张表格上评测21种解析器,揭示显著性能差异。 Conclusion: LLM-as-a-judge为PDF表格提取提供了更可靠、可扩展、可复现的语义评估范式,对科学数据挖掘和知识库构建具有重要实践指导意义。 Abstract: Reliably extracting tables from PDFs is essential for large-scale scientific data mining and knowledge base construction, yet existing evaluation approaches rely on rule-based metrics that fail to capture semantic equivalence of table content. We present a benchmarking framework based on synthetically generated PDFs with precise LaTeX ground truth, using tables sourced from arXiv to ensure realistic complexity and diversity. As our central methodological contribution, we apply LLM-as-a-judge for semantic table evaluation, integrated into a matching pipeline that accommodates inconsistencies in parser outputs. Through a human validation study comprising over 1,500 quality judgments on extracted table pairs, we show that LLM-based evaluation achieves substantially higher correlation with human judgment (Pearson r=0.93) compared to Tree Edit Distance-based Similarity (TEDS, r=0.68) and Grid Table Similarity (GriTS, r=0.70). Evaluating 21 contemporary PDF parsers across 100 synthetic documents containing 451 tables reveals significant performance disparities. Our results offer practical guidance for selecting parsers for tabular data extraction and establish a reproducible, scalable evaluation methodology for this critical task. Code and data: https://github.com/phorn1/pdf-parse-bench Metric study and human evaluation: https://github.com/phorn1/table-metric-study

[129] Multiscale Switch for Semi-Supervised and Contrastive Learning in Medical Ultrasound Image Segmentation

Jingguo Qu,Xinyang Han,Yao Pu,Man-Lik Chui,Simon Takadiyi Gunda,Ziman Chen,Jing Qin,Ann Dorothy King,Winnie Chiu-Wing Chu,Jing Cai,Michael Tin-Cheung Ying

Main category: cs.CV

TL;DR: 本文提出了一种名为Switch的新型半监督学习框架,通过多尺度切换(MSS)和频域切换(FDS)策略,在超声图像分割任务中显著提升性能,尤其在低标注率下超越全监督方法。

Details Motivation: 医学超声图像分割面临标注数据稀缺及成像伪影(如斑点噪声、低对比度边界)的挑战,现有半监督方法对无标签数据利用不足且缺乏鲁棒特征表示机制。 Method: 提出Switch框架:1)多尺度切换(MSS)策略,采用分层图像块混合实现均匀空间覆盖;2)频域切换(FDS)结合对比学习,在傅里叶域进行幅度切换以增强特征鲁棒性;整体基于教师-学生架构整合有/无标签数据。 Result: 在6个不同超声数据集(淋巴结、乳腺病变、甲状腺结节、前列腺)上验证,5%标注率下Dice分数达80.04%(LN-INT)、85.52%(DDTI)、83.48%(Prostate),超越全监督基线;模型仅含1.8M参数,兼顾高效与高性能。 Conclusion: Switch是一种参数高效、性能优越的半监督超声图像分割方法,适用于标注资源受限的临床场景,代码已开源。 Abstract: Medical ultrasound image segmentation faces significant challenges due to limited labeled data and characteristic imaging artifacts including speckle noise and low-contrast boundaries. While semi-supervised learning (SSL) approaches have emerged to address data scarcity, existing methods suffer from suboptimal unlabeled data utilization and lack robust feature representation mechanisms. In this paper, we propose Switch, a novel SSL framework with two key innovations: (1) Multiscale Switch (MSS) strategy that employs hierarchical patch mixing to achieve uniform spatial coverage; (2) Frequency Domain Switch (FDS) with contrastive learning that performs amplitude switching in Fourier space for robust feature representations. Our framework integrates these components within a teacher-student architecture to effectively leverage both labeled and unlabeled data. Comprehensive evaluation across six diverse ultrasound datasets (lymph nodes, breast lesions, thyroid nodules, and prostate) demonstrates consistent superiority over state-of-the-art methods. At 5\% labeling ratio, Switch achieves remarkable improvements: 80.04\% Dice on LN-INT, 85.52\% Dice on DDTI, and 83.48\% Dice on Prostate datasets, with our semi-supervised approach even exceeding fully supervised baselines. The method maintains parameter efficiency (1.8M parameters) while delivering superior performance, validating its effectiveness for resource-constrained medical imaging applications. The source code is publicly available at https://github.com/jinggqu/Switch

[130] Multimodal Model for Computational Pathology:Representation Learning and Image Compression

Peihang Wu,Zehong Chen,Lijian Xu

Main category: cs.CV

TL;DR: 本文综述了多模态计算病理学的最新进展,聚焦于解决全切片图像(WSI)高分辨率、标注稀缺、多模态融合与模型可解释性等挑战,提出四大研究方向:自监督表征学习与结构感知token压缩、多模态数据生成与增强、参数高效适配与推理增强的小样本学习、多智能体协同推理,并强调需构建融合高分辨率影像与临床知识的统一可解释框架。

Details Motivation: 全切片图像(WSI)分辨率极高导致计算困难,专家标注稀缺限制监督学习,多模态信息融合难保生物学可解释性,且超长视觉序列建模缺乏临床透明度。 Method: 系统综述法,围绕四个方向展开分析:(1)自监督表征学习与结构感知token压缩;(2)多模态数据生成与增强;(3)参数高效适配与推理增强的小样本学习;(4)多智能体协同推理。特别探讨token压缩实现跨尺度建模及多智能体模拟病理医生‘思维链’以实现不确定性感知的证据融合。 Result: 梳理出当前多模态计算病理学的关键技术路径与发展脉络,明确了token压缩与多智能体机制在提升建模能力与临床可信度方面的核心作用。 Conclusion: 未来突破依赖于统一的多模态框架,须深度融合高分辨率视觉数据、临床报告与生物医学知识,以支撑可解释、安全的AI辅助诊断。 Abstract: Whole slide imaging (WSI) has transformed digital pathology by enabling computational analysis of gigapixel histopathology images. Recent foundation model advances have accelerated progress in computational pathology, facilitating joint reasoning across pathology images, clinical reports, and structured data. Despite this progress, challenges remain: the extreme resolution of WSIs creates computational hurdles for visual learning; limited expert annotations constrain supervised approaches; integrating multimodal information while preserving biological interpretability remains difficult; and the opacity of modeling ultra-long visual sequences hinders clinical transparency. This review comprehensively surveys recent advances in multimodal computational pathology. We systematically analyze four research directions: (1) self-supervised representation learning and structure-aware token compression for WSIs; (2) multimodal data generation and augmentation; (3) parameter-efficient adaptation and reasoning-enhanced few-shot learning; and (4) multi-agent collaborative reasoning for trustworthy diagnosis. We specifically examine how token compression enables cross-scale modeling and how multi-agent mechanisms simulate a pathologist's "Chain of Thought" across magnifications to achieve uncertainty-aware evidence fusion. Finally, we discuss open challenges and argue that future progress depends on unified multimodal frameworks integrating high-resolution visual data with clinical and biomedical knowledge to support interpretable and safe AI-assisted diagnosis.

[131] Towards High-Quality Image Segmentation: Improving Topology Accuracy by Penalizing Neighbor Pixels

Juan Miguel Valverde,Dim P. Papadopoulos,Rasmus Larsen,Anders Bjorholm Dahl

Main category: cs.CV

TL;DR: 本文提出SCNP方法,通过惩罚logits中分类最差的邻域像素来提升图像分割的拓扑准确性,适用于多种结构形态和模态,在13个数据集上验证有效,并可灵活集成到多种分割框架和损失函数中。

Details Motivation: 标准深度学习图像分割模型无法保证拓扑准确性(如连通分量数量),影响后续定量分析可靠性;现有改进方法存在集成困难、计算昂贵或形态限制等问题。 Method: 提出SCNP(Same-Class Neighbor Penalization)方法:在训练中对每个像素的logits施加惩罚,惩罚项基于其同类标签中预测置信度最低的邻域像素,从而强制模型优先优化边界区域预测。 Result: 在13个涵盖不同结构形态与图像模态的数据集上验证了SCNP的有效性;成功集成至三种语义/实例分割框架及多种损失函数中,显著提升拓扑准确性且计算开销低。 Conclusion: SCNP是一种高效、通用、易集成的拓扑增强策略,不依赖特定网络结构或形态假设,为提升分割结果的几何与拓扑鲁棒性提供了新思路。 Abstract: Standard deep learning models for image segmentation cannot guarantee topology accuracy, failing to preserve the correct number of connected components or structures. This, in turn, affects the quality of the segmentations and compromises the reliability of the subsequent quantification analyses. Previous works have proposed to enhance topology accuracy with specialized frameworks, architectures, and loss functions. However, these methods are often cumbersome to integrate into existing training pipelines, they are computationally very expensive, or they are restricted to structures with tubular morphology. We present SCNP, an efficient method that improves topology accuracy by penalizing the logits with their poorest-classified neighbor, forcing the model to improve the prediction at the pixels' neighbors before allowing it to improve the pixels themselves. We show the effectiveness of SCNP across 13 datasets, covering different structure morphologies and image modalities, and integrate it into three frameworks for semantic and instance segmentation. Additionally, we show that SCNP can be integrated into several loss functions, making them improve topology accuracy. Our code can be found at https://jmlipman.github.io/SCNP-SameClassNeighborPenalization.

[132] Ontology-Guided Diffusion for Zero-Shot Visual Sim2Real Transfer

Mohamed Youssef,Mayar Elfares,Anna-Maria Meer,Matteo Bortoletto,Andreas Bulling

Main category: cs.CV

TL;DR: 本文提出Ontology-Guided Diffusion(OGD),一种基于本体论引导的神经符号化零样本仿真到现实(sim2real)图像迁移框架,通过将‘真实性’建模为结构化知识(如光照、材质等可解释特征及其关系图谱),结合图神经网络嵌入与符号规划生成指令,驱动预训练扩散模型实现高质量、可解释、数据高效的跨域图像翻译。

Details Motivation: 现有基于扩散模型的sim2real方法依赖非结构化提示或统计对齐,难以刻画使图像“真实”的结构化因素;同时真实标注数据稀缺,亟需零样本、可解释且泛化性强的方案。 Method: OGD构建一个真实性本体(ontology),将 realism 分解为可解释的视觉特质(如光照、材质),并用知识图谱编码其关系;从合成图像中推断特质激活,经图神经网络生成全局嵌入;同步使用符号规划器生成一致的视觉编辑序列;二者分别作为扩散模型的交叉注意力条件和结构化指令输入。 Result: 在多个基准上,OGD的图嵌入比基线更能区分真实与合成图像;其sim2real图像翻译性能超越当前最优扩散方法;验证了结构化真实性建模可提升可解释性、数据效率与零样本泛化能力。 Conclusion: 显式建模真实性的结构(即本体+知识图谱)是实现可解释、数据高效、泛化性强的零样本sim2real迁移的关键路径;OGD为神经符号融合在生成式视觉任务中提供了新范式。 Abstract: Bridging the simulation-to-reality (sim2real) gap remains challenging as labelled real-world data is scarce. Existing diffusion-based approaches rely on unstructured prompts or statistical alignment, which do not capture the structured factors that make images look real. We introduce Ontology- Guided Diffusion (OGD), a neuro-symbolic zero-shot sim2real image translation framework that represents realism as structured knowledge. OGD decomposes realism into an ontology of interpretable traits -- such as lighting and material properties -- and encodes their relationships in a knowledge graph. From a synthetic image, OGD infers trait activations and uses a graph neural network to produce a global embedding. In parallel, a symbolic planner uses the ontology traits to compute a consistent sequence of visual edits needed to narrow the realism gap. The graph embedding conditions a pretrained instruction-guided diffusion model via cross-attention, while the planned edits are converted into a structured instruction prompt. Across benchmarks, our graph-based embeddings better distinguish real from synthetic imagery than baselines, and OGD outperforms state-of-the-art diffusion methods in sim2real image translations. Overall, OGD shows that explicitly encoding realism structure enables interpretable, data-efficient, and generalisable zero-shot sim2real transfer.

[133] EdgeCrafter: Compact ViTs for Edge Dense Prediction via Task-Specialized Distillation

Longfei Liu,Yongjie Hou,Yang Li,Qirui Wang,Youyang Sha,Yongjun Yu,Yinzhi Wang,Peizhe Ru,Xuanlong Yu,Xi Shen

Main category: cs.CV

TL;DR: 本文提出EdgeCrafter,一种面向边缘设备密集预测任务的轻量级ViT统一框架,通过任务特定蒸馏与边缘友好编解码设计,在检测、实例分割和姿态估计任务上均达到SOTA性能,证明紧凑ViT在边缘场景下的可行性与竞争力。

Details Motivation: 当前边缘设备上密集预测仍以CNN为主(如YOLO),而小型ViT因任务特定表征学习不足,难以兼顾精度与效率;作者认为问题不在于ViT本身不适用于边缘密集预测,而在于缺乏针对性优化。 Method: 提出EdgeCrafter框架,核心为ECDet检测模型:采用知识蒸馏得到的紧凑ViT主干网络,配合专为边缘部署设计的编码器-解码器结构;并将其扩展至实例分割(ECInsSeg)与姿态估计(ECPose-X),均强调参数精简与边缘适配。 Result: ECDet-S在COCO上达51.7 AP(<10M参数,仅用COCO标注);ECInsSeg性能媲美RF-DETR但参数更少;ECPose-X达74.8 AP,显著超越依赖Objects365预训练的YOLO26Pose-X(71.6 AP)。 Conclusion: 紧凑ViT结合任务专用蒸馏与边缘感知架构设计,可在资源受限边缘设备上实现高效高质密集预测,成为CNN之外的实用替代方案。 Abstract: Deploying high-performance dense prediction models on resource-constrained edge devices remains challenging due to strict limits on computation and memory. In practice, lightweight systems for object detection, instance segmentation, and pose estimation are still dominated by CNN-based architectures such as YOLO, while compact Vision Transformers (ViTs) often struggle to achieve similarly strong accuracy efficiency tradeoff, even with large scale pretraining. We argue that this gap is largely due to insufficient task specific representation learning in small scale ViTs, rather than an inherent mismatch between ViTs and edge dense prediction. To address this issue, we introduce EdgeCrafter, a unified compact ViT framework for edge dense prediction centered on ECDet, a detection model built from a distilled compact backbone and an edge-friendly encoder decoder design. On the COCO dataset, ECDet-S achieves 51.7 AP with fewer than 10M parameters using only COCO annotations. For instance segmentation, ECInsSeg achieves performance comparable to RF-DETR while using substantially fewer parameters. For pose estimation, ECPose-X reaches 74.8 AP, significantly outperforming YOLO26Pose-X (71.6 AP) despite the latter's reliance on extensive Objects365 pretraining. These results show that compact ViTs, when paired with task-specialized distillation and edge-aware design, can be a practical and competitive option for edge dense prediction. Code is available at: https://intellindust-ai-lab.github.io/projects/EdgeCrafter/

[134] 6Bit-Diffusion: Inference-Time Mixed-Precision Quantization for Video Diffusion Models

Rundong Su,Jintao Zhang,Zhihang Yuan,Haojie Duanmu,Jianfei Chen,Jun Zhu

Main category: cs.CV

TL;DR: 本文提出了一种针对视频扩散Transformer(Video DiTs)的混合精度量化与计算优化框架,通过动态分配NVFP4/INT8精度和引入时间增量缓存(TDC),显著提升推理效率而不损生成质量。

Details Motivation: 现有Post-Training Quantization方法采用静态比特宽分配,忽视了不同扩散步中激活值的量化难度差异,导致效率与质量权衡不佳;同时Video DiTs推理内存与计算开销过高,制约实际部署。 Method: 提出NVFP4/INT8混合精度量化框架:基于Transformer块输入输出差与线性层量化敏感性的强线性相关性,设计轻量预测器动态分配低精度(NVFP4)给时序稳定层、高精度(INT8)给易变层;并利用块残差在时间维度的高度一致性,引入Temporal Delta Cache(TDC)跳过不变块的重复计算。 Result: 实验表明该方法实现端到端1.92×加速与3.32×内存缩减,在Video DiTs高效推理上树立新基线。 Conclusion: 动态混合精度量化结合时间冗余挖掘(TDC)是提升视频扩散Transformer推理效率的有效范式,兼顾压缩率、速度与生成质量。 Abstract: Diffusion transformers have demonstrated remarkable capabilities in generating videos. However, their practical deployment is severely constrained by high memory usage and computational cost. Post-Training Quantization provides a practical way to reduce memory usage and boost computation speed. Existing quantization methods typically apply a static bit-width allocation, overlooking the quantization difficulty of activations across diffusion timesteps, leading to a suboptimal trade-off between efficiency and quality. In this paper, we propose a inference time NVFP4/INT8 Mixed-Precision Quantization framework. We find a strong linear correlation between a block's input-output difference and the quantization sensitivity of its internal linear layers. Based on this insight, we design a lightweight predictor that dynamically allocates NVFP4 to temporally stable layers to maximize memory compression, while selectively preserving INT8 for volatile layers to ensure robustness. This adaptive precision strategy enables aggressive quantization without compromising generation quality. Beside this, we observe that the residual between the input and output of a Transformer block exhibits high temporal consistency across timesteps. Leveraging this temporal redundancy, we introduce Temporal Delta Cache (TDC) to skip computations for these invariant blocks, further reducing the computational cost. Extensive experiments demonstrate that our method achieves 1.92$\times$ end-to-end acceleration and 3.32$\times$ memory reduction, setting a new baseline for efficient inference in Video DiTs.

[135] WeNLEX: Weakly Supervised Natural Language Explanations for Multilabel Chest X-ray Classification

Isabel Rio-Torto,Jaime S. Cardoso,Luís F. Teixeira

Main category: cs.CV

TL;DR: WeNLEX是一种弱监督模型,用于生成胸部X光多标签分类的自然语言解释,通过图像重建和分布对齐确保解释的忠实性与合理性,并可在后处理或联合训练中使用,甚至提升分类性能。

Details Motivation: 现有方法依赖人工标注解释进行监督,导致生成的解释虽合理但不忠实于模型推理;需要一种无需大量标注即可生成既忠实又合理解释的方法。 Method: 提出WeNLEX模型:利用黑盒模型特征空间中由解释重建的图像与原图匹配来保证忠实性;通过与少量临床医生标注解释的分布对齐维持合理性;支持后处理和联合训练两种模式;可更换解释数据库适配不同受众(如非医学用户)。 Result: 在多项指标(忠实性、可模拟性、多样性、合理性)上验证了WeNLEX的有效性,仅需每诊断5条真实解释;联合训练时分类AUC提升2.21%;成功构建面向普通用户的简化版解释模型。 Conclusion: WeNLEX实现了少样本下忠实且合理的自然语言解释生成,兼具灵活性与实用性,证明可解释性建模能反哺下游任务性能。 Abstract: Natural language explanations provide an inherently human-understandable way to explain black-box models, closely reflecting how radiologists convey their diagnoses in textual reports. Most works explicitly supervise the explanation generation process using datasets annotated with explanations. Thus, though plausible, the generated explanations are not faithful to the model's reasoning. In this work, we propose WeNLEX, a weakly supervised model for the generation of natural language explanations for multilabel chest X-ray classification. Faithfulness is ensured by matching images generated from their corresponding natural language explanations with original images, in the black-box model's feature space. Plausibility is maintained via distribution alignment with a small database of clinician-annotated explanations. We empirically demonstrate, through extensive validation on multiple metrics to assess faithfulness, simulatability, diversity, and plausibility, that WeNLEX is able to produce faithful and plausible explanations, using as little as 5 ground-truth explanations per diagnosis. Furthermore, WeNLEX can operate in both post-hoc and in-model settings. In the latter, i.e., when the multilabel classifier is trained together with the rest of the network, WeNLEX improves the classification AUC of the standalone classifier by 2.21%, thus showing that adding interpretability to the training process can actually increase the downstream task performance. Additionally, simply by changing the database, WeNLEX explanations are adaptable to any target audience, and we showcase this flexibility by training a layman version of WeNLEX, where explanations are simplified for non-medical users.

[136] DA-Mamba: Learning Domain-Aware State Space Model for Global-Local Alignment in Domain Adaptive Object Detection

Haochen Li,Rui Zhang,Hantao Yao,Xin Zhang,Yifan Hao,Shaohui Peng,Yongwei Zhao,Ling Li

Main category: cs.CV

TL;DR: 本文提出DA-Mamba,一种结合CNN与状态空间模型(SSM)的混合架构,用于域自适应目标检测,通过IA-SSM和OA-SSM模块分别实现图像级和实例级的全局-局部对齐,兼顾效率与长程建模能力。

Details Motivation: 现有基于CNN的方法难以提取全局域不变特征,而基于Transformer的方法计算开销大;需兼顾全局建模能力与实际部署效率。 Method: 提出DA-Mamba架构,融合CNN与线性复杂度的状态空间模型(SSM);设计Image-Aware SSM(IA-SSM)嵌入骨干网络以增强图像级全局-局部对齐,Object-Aware SSM(OA-SSM)嵌入检测头以建模对象间空间与语义依赖,实现实例级对齐。 Result: 在多个跨域检测基准上验证了DA-Mamba的有效性与高效性,显著提升检测器跨域性能,同时降低计算成本。 Conclusion: DA-Mamba通过引入轻量、线性复杂度的SSM模块,在保持CNN效率的同时实现了全局域不变特征建模,为高效域自适应目标检测提供了新范式。 Abstract: Domain Adaptive Object Detection (DAOD) aims to transfer detectors from a labeled source domain to an unlabeled target domain. Existing DAOD methods employ multi-granularity feature alignment to learn domain-invariant representations. However, the local connectivity of their CNN-based backbone and detection head restricts alignment to local regions, failing to extract global domain-invariant features. Although transformer-based DAOD methods capture global dependencies via attention mechanisms, their quadratic computational cost hinders practical deployment. To solve this, we propose DA-Mamba, a hybrid CNN-State Space Models (SSMs) architecture that combines the efficiency of CNNs with the linear-time long-range modeling capability of State Space Models (SSMs) to capture both global and local domain-invariant features. Specifically, we introduce two novel modules: Image-Aware SSM (IA-SSM) and Object-Aware SSM (OA-SSM). IA-SSM is integrated into the backbone to enhance global domain awareness, enabling image-level global and local alignment. OA-SSM is inserted into the detection head to model spatial and semantic dependencies among objects, enhancing instance-level alignment. Comprehensive experiments demonstrate that the proposed method can efficiently improve the cross-domain performance of the object detector.

[137] ProCal: Probability Calibration for Neighborhood-Guided Source-Free Domain Adaptation

Ying Zheng,Yiyi Zhang,Yi Wang,Lap-Pui Chau

Main category: cs.CV

TL;DR: 本文提出ProCal方法,通过双模型协同预测机制动态校准邻域预测概率,解决无源域自适应中源知识遗忘和局部噪声过拟合问题,并结合软监督损失与多样性损失进行联合优化。

Details Motivation: 现有基于邻域结构的无源域自适应方法过度依赖邻居预测相似性,导致源知识快速遗忘和易受局部噪声过拟合。 Method: 提出ProCal概率校准方法,利用源模型初始预测与当前模型在线输出协同校准邻域概率;设计融合软监督损失与多样性损失的联合优化目标。 Result: 在四个公开数据集共31个跨域任务上验证了方法有效性,理论分析表明ProCal能收敛至源知识与目标信息有效融合的平衡点。 Conclusion: ProCal在缓解知识遗忘和过拟合的同时,实现了源知识保留与域自适应的良好平衡。 Abstract: Source-Free Domain Adaptation (SFDA) adapts pre-trained models to unlabeled target domains without requiring access to source data. Although state-of-the-art methods leveraging local neighborhood structures show promise for SFDA, they tend to over-rely on prediction similarity among neighbors. This over-reliance accelerates the forgetting of source knowledge and increases susceptibility to local noise overfitting. To address these issues, we introduce ProCal, a probability calibration method that dynamically calibrates neighborhood-based predictions through a dual-model collaborative prediction mechanism. ProCal integrates the source model's initial predictions with the current model's online outputs to effectively calibrate neighbor probabilities. This strategy not only mitigates the interference of local noise but also preserves the discriminative information from the source model, thereby achieving a balance between knowledge retention and domain adaptation. Furthermore, we design a joint optimization objective that combines a soft supervision loss with a diversity loss to guide the target model. Our theoretical analysis shows that ProCal converges to an equilibrium where source knowledge and target information are effectively fused, reducing both knowledge forgetting and overfitting. We validate the effectiveness of our approach through extensive experiments on 31 cross-domain tasks across four public datasets. Our code is available at: https://github.com/zhengyinghit/ProCal.

[138] SEAR: Simple and Efficient Adaptation of Visual Geometric Transformers for RGB+Thermal 3D Reconstruction

Vsevolod Skorokhodov,Chenghao Xu,Shuo Sun,Olga Fink,Malcolm Mielle

Main category: cs.CV

TL;DR: 本文提出SEAR方法,通过简单高效的微调策略,将预训练的视觉几何Transformer适配到RGB-thermal多模态输入,在小规模RGB-T数据集上实现显著性能提升,尤其在低光和浓烟等挑战性场景下仍保持高精度3D重建与位姿估计。

Details Motivation: 现有基于RGB数据预训练的视觉几何模型在处理RGB-thermal(RGB-T)混合模态时性能下降,尤其难以联合对齐RGB与热成像模态。 Method: 提出SEAR——一种轻量级微调策略,适配预训练几何Transformer以处理RGB-T输入;设计新RGB-T数据集,涵盖不同时序、视角与光照条件;结合消融实验分析模态对齐机制。 Result: 在3D重建与相机位姿估计任务上全面超越SOTA方法(如AUC@30提升超29%),细节更丰富、模态间一致性更高,推理开销可忽略;在低光、浓烟等挑战场景下仍保持鲁棒性。 Conclusion: SEAR验证了仅需少量多模态数据即可高效迁移预训练视觉几何模型,为多模态3D场景理解提供了实用、鲁棒且可扩展的新范式。 Abstract: Foundational feed-forward visual geometry models enable accurate and efficient camera pose estimation and scene reconstruction by learning strong scene priors from massive RGB datasets. However, their effectiveness drops when applied to mixed sensing modalities, such as RGB-thermal (RGB-T) images. We observe that while a visual geometry grounded transformer pretrained on RGB data generalizes well to thermal-only reconstruction, it struggles to align RGB and thermal modalities when processed jointly. To address this, we propose SEAR, a simple yet efficient fine-tuning strategy that adapts a pretrained geometry transformer to multimodal RGB-T inputs. Despite being trained on a relatively small RGB-T dataset, our approach significantly outperforms state-of-the-art methods for 3D reconstruction and camera pose estimation, achieving significant improvements over all metrics (e.g., over 29\% in AUC@30) and delivering higher detail and consistency between modalities with negligible overhead in inference time compared to the original pretrained model. Notably, SEAR enables reliable multimodal pose estimation and reconstruction even under challenging conditions, such as low lighting and dense smoke. We validate our architecture through extensive ablation studies, demonstrating how the model aligns both modalities. Additionally, we introduce a new dataset featuring RGB and thermal sequences captured at different times, viewpoints, and illumination conditions, providing a robust benchmark for future work in multimodal 3D scene reconstruction. Code and models are publicly available at https://www.github.com/Schindler-EPFL-Lab/SEAR.

[139] Points-to-3D: Structure-Aware 3D Generation with Point Cloud Priors

Jiatong Xia,Zicheng Duan,Anton van den Hengel,Lingqiao Liu

Main category: cs.CV

TL;DR: 本文提出Points-to-3D框架,利用点云先验(如LiDAR或VGGT生成)增强扩散模型的几何可控性与结构完整性,显著提升3D资产与场景生成的质量和几何保真度。

Details Motivation: 现有3D生成方法多依赖图像或文本条件,而易获取的3D点云先验(如LiDAR或VGGT输出)未被充分利用,缺乏对显式几何约束的有效建模。 Method: 基于潜空间3D扩散模型TRELLIS,设计点云先验引导的稀疏结构潜变量初始化,并引入结构修复网络与分阶段采样策略(先全局结构修复、再边界细化),以保持输入可见区域并补全整体几何。 Result: 在物体与场景生成任务上,Points-to-3D在渲染质量与几何保真度方面均超越当前SOTA方法;支持真实点云或单图估计点云(如VGGT)作为输入。 Conclusion: 显式嵌入点云先验可有效提升3D生成的几何准确性与结构可控性,为基于几何先验的可控3D生成提供了新范式。 Abstract: Recent progress in 3D generation has been driven largely by models conditioned on images or text, while readily available 3D priors are still underused. In many real-world scenarios, the visible-region point cloud are easy to obtain from active sensors such as LiDAR or from feed-forward predictors like VGGT, offering explicit geometric constraints that current methods fail to exploit. In this work, we introduce Points-to-3D, a diffusion-based framework that leverages point cloud priors for geometry-controllable 3D asset and scene generation. Built on a latent 3D diffusion model TRELLIS, Points-to-3D first replaces pure-noise sparse structure latent initialization with a point cloud priors tailored input formulation.A structure inpainting network, trained within the TRELLIS framework on task-specific data designed to learn global structural inpainting, is then used for inference with a staged sampling strategy (structural inpainting followed by boundary refinement), completing the global geometry while preserving the visible regions of the input priors.In practice, Points-to-3D can take either accurate point-cloud priors or VGGT-estimated point clouds from single images as input. Experiments on both objects and scene scenarios consistently demonstrate superior performance over state-of-the-art baselines in terms of rendering quality and geometric fidelity, highlighting the effectiveness of explicitly embedding point-cloud priors for achieving more accurate and structurally controllable 3D generation.

[140] Rethinking Uncertainty Quantification and Entanglement in Image Segmentation

Jakob Lønborg Christensen,Vedrana Andersen Dahl,Morten Rieger Hannemose,Anders Bjorholm Dahl,Christian F. Baumgartner

Main category: cs.CV

TL;DR: 本文对医学图像分割中的不确定性量化(UQ)进行了全面实证研究,重点分析了数据不确定性(AU)与模型不确定性(EU)的组合方式及其相互纠缠问题,提出了一种量化纠缠程度的新指标,并在OOD检测、歧义建模和校准等下游任务中评估了多种AU-EU组合方法的性能。

Details Motivation: 现有AU和EU建模方法众多,但其组合效果及二者间存在的纠缠现象尚不明确,影响UQ的可解释性与实用性。 Method: 开展覆盖广泛AU-EU模型组合的实证研究;提出一种量化AU-EU纠缠程度的指标;在多个下游UQ任务(如OOD检测、歧义建模、校准)中系统评估不同组合性能。 Result: 集成方法(ensembles)在OOD检测中纠缠最低、性能最优;歧义建模与校准表现因数据集而异,softmax/SSN类方法较优,Probabilistic UNet纠缠较低;softmax集成在所有任务中均表现突出;并分析了纠缠来源及缓解方向。 Conclusion: AU与EU存在显著纠缠,削弱分解意义;应谨慎选择AU-EU组合策略;softmax集成是兼顾性能与低纠缠的实用方案;需进一步研究解耦机制。 Abstract: Uncertainty quantification (UQ) is crucial in safety-critical applications such as medical image segmentation. Total uncertainty is typically decomposed into data-related aleatoric uncertainty (AU) and model-related epistemic uncertainty (EU). Many methods exist for modeling AU (such as Probabilistic UNet, Diffusion) and EU (such as ensembles, MC Dropout), but it is unclear how they interact when combined. Additionally, recent work has revealed substantial entanglement between AU and EU, undermining the interpretability and practical usefulness of the decomposition. We present a comprehensive empirical study covering a broad range of AU-EU model combinations, propose a metric to quantify uncertainty entanglement, and evaluate both across downstream UQ tasks. For out-of-distribution detection, ensembles exhibit consistently lower entanglement and superior performance. For ambiguity modeling and calibration the best models are dataset-dependent, with softmax/SSN-based methods performing well and Probabilistic UNets being less entangled. A softmax ensemble fares remarkably well on all tasks. Finally, we analyze potential sources of uncertainty entanglement and outline directions for mitigating this effect.

[141] Perceptio: Perception Enhanced Vision Language Models via Spatial Token Generation

Yuchen Li,Amanmeet Garg,Shalini Chaudhuri,Rui Zhao,Garin Kessler

Main category: cs.CV

TL;DR: 本文提出Perceptio,一种具备2D/3D空间推理能力的感知增强型大视觉语言模型,通过在自回归序列中显式引入语义分割与深度编码标记,显著提升LVLM的空间定位能力。

Details Motivation: 现有大视觉语言模型(LVLMs)擅长语义理解,但在细粒度空间定位上表现不佳,因其缺乏显式的空间表征机制,需隐式推断复杂几何关系。 Method: 1)蒸馏单目深度教师模型构建VQ-VAE深度码本,将稠密深度图编码为紧凑序列;2)将SAM2语义分割标记与VQ-VAE深度标记嵌入LLM,使模型先生成空间标记再作答;3)设计复合深度标记损失(marker/token/count)与可微软融合重建技术以稳定训练;4)采用多任务协同训练策略。 Result: 基于InternVL构建的Perceptio在多个基准上达到SOTA:RefCOCO/+/g上cIoU分别提升+0.8/+1.4/+1.1;HardBLINK空间理解准确率提升10.3%;MMBench准确率提升1.0%。 Conclusion: 显式引入空间标记并构建空间思维链,能实质性增强LVLM的空间接地能力,为多模态模型的空间理解提供了新范式。 Abstract: Large Vision Language Models (LVLMs) excel at semantic understanding but struggle with fine grained spatial grounding, as the model must implicitly infer complex geometry without ever producing a spatial interpretation. We present Perceptio, a perception enhanced LVLM with 2D and 3D spatial reasoning abilities, enabled via explicit semantic segmentation tokens and depth tokens generated directly within the autoregressive sequence. Concretely, we (i) distill a VQVAE depth codebook from a strong monocular teacher to tokenize dense depth into compact sequences, and (ii) integrate SAM2 based semantic segmentation tokens and VQ-VAE depth tokens inside the LLM so the model first emits spatial tokens and then answers. To stabilize depth token generation, we introduce novel composite depth-token objectives (marker, token, and count losses) and a soft-merging technique for differentiable reconstruction. We adopt a multi-task co-training strategy across diverse datasets, letting the model learn perception tokens to tackle multiple downstream tasks. Building on InternVL, Perceptio achieves state-of-the-art performance across benchmarks: improving referring expression segmentation by +0.8/+1.4/+1.1 cIoU on RefCOCO/+/g HardBLINK spatial understanding accuracy by 10.3%, and MMBench accuracy by 1.0%, demonstrating that explicit spatial chain-of-thought materially strengthens spatial grounding in LVLMs.

[142] VesselTok: Tokenizing Vessel-like 3D Biomedical Graph Representations for Reconstruction and Generation

Chinmay Prabhakar,Bastian Wittmann,Tamaz Amiranashvili,Paul Büschl,Ezequiel de la Rosa,Julian McGinnis,Benedikt Wiestler,Bjoern Menze,Suprosanna Shit

Main category: cs.CV

TL;DR: 本文提出VesselTok框架,从参数化形状视角学习空间密集图(如血管、气道等)的潜在表示(tokens),利用中心线点与伪半径编码管状几何结构,并验证其在跨解剖结构泛化、生成建模及下游逆问题中的有效性。

Details Motivation: 高分辨率大型解剖网络的空间复杂性带来显著计算挑战,亟需高效建模空间图的方法。 Method: 提出VesselTok框架,以中心线点及其伪半径为输入,学习条件于中心线点的新型潜在表示,用于编码类血管管状结构的神经隐式表示。 Result: 在肺气道、肺血管和脑血管等多种解剖结构上验证了VesselTok的有效性;其潜在表示能泛化至未见解剖结构、支持合理解剖图生成,并有效迁移至链路预测等下游逆问题。 Conclusion: VesselTok为复杂管状解剖结构提供了鲁棒、可泛化且可迁移的轻量级图表示学习框架。 Abstract: Spatial graphs provide a lightweight and elegant representation of curvilinear anatomical structures such as blood vessels, lung airways, and neuronal networks. Accurately modeling these graphs is crucial in clinical and (bio-)medical research. However, the high spatial resolution of large networks drastically increases their complexity, resulting in significant computational challenges. In this work, we aim to tackle these challenges by proposing VesselTok, a framework that approaches spatially dense graphs from a parametric shape perspective to learn latent representations (tokens). VesselTok leverages centerline points with a pseudo radius to effectively encode tubular geometry. Specifically, we learn a novel latent representation conditioned on centerline points to encode neural implicit representations of vessel-like, tubular structures. We demonstrate VesselTok's performance across diverse anatomies, including lung airways, lung vessels, and brain vessels, highlighting its ability to robustly encode complex topologies. To prove the effectiveness of VesselTok's learnt latent representations, we show that they (i) generalize to unseen anatomies, (ii) support generative modeling of plausible anatomical graphs, and (iii) transfer effectively to downstream inverse problems, such as link prediction.

[143] Statistical Characteristic-Guided Denoising for Rapid High-Resolution Transmission Electron Microscopy Imaging

Hesong Li,Ziqi Wu,Ruiwen Shao,Ying Fu

Main category: cs.CV

TL;DR: 本文提出了一种统计特性引导的去噪网络(SCGN),用于提升高分辨透射电子显微镜(HRTEM)图像在快速短曝光下的原子定位精度,通过空间偏差加权与频带加权机制,在空间与频率域协同去噪,并构建了适配HRTEM噪声特性的合成数据集。

Details Motivation: HRTEM在观察毫秒级快速成核过程时需短曝光成像,导致严重噪声干扰原子位置识别,现有方法难以兼顾真实噪声建模与去噪性能。 Method: 提出统计特性引导的去噪网络(SCGN),包含空间偏差引导的卷积权重选择机制和频带引导的信号增强/噪声抑制机制;并设计HRTEM专用噪声标定方法与含无序结构及真实噪声的合成数据集。 Result: 在合成与真实HRTEM图像上均超越当前最优方法,下游原子定位任务精度显著提升。 Conclusion: SCGN通过融合空间与频率域的统计先验,实现了对HRTEM动态成核图像的高效、鲁棒去噪,为原子尺度动态过程研究提供了可靠图像基础。 Abstract: High-Resolution Transmission Electron Microscopy (HRTEM) enables atomic-scale observation of nucleation dynamics, which boosts the studies of advanced solid materials. Nonetheless, due to the millisecond-scale rapid change of nucleation, it requires short-exposure rapid imaging, leading to severe noise that obscures atomic positions. In this work, we propose a statistical characteristic-guided denoising network, which utilizes statistical characteristics to guide the denoising process in both spatial and frequency domains. In the spatial domain, we present spatial deviation-guided weighting to select appropriate convolution operations for each spatial position based on deviation characteristic. In the frequency domain, we present frequency band-guided weighting to enhance signals and suppress noise based on band characteristics. We also develop an HRTEM-specific noise calibration method and generate a dataset with disordered structures and realistic HRTEM image noises. It can ensure the denoising performance of models on real images for nucleation observation. Experiments on synthetic and real data show our method outperforms the state-of-the-art methods in HRTEM image denoising, with effectiveness in the localization downstream task. Code will be available at https://github.com/HeasonLee/SCGN.

[144] Towards Interpretable Foundation Models for Retinal Fundus Images

Samuel Ofosu Mensah,Maria Camila Roa Carvajal,Kerol Djoumessi,Philipp Berens

Main category: cs.CV

TL;DR: 本文提出Dual-IFM,一种可解释性强的基础模型,通过类证据图实现单张图像的局部可解释性,并通过2D投影层实现整个数据集的全局可解释性;在80余万张眼底彩照上训练,性能媲美参数量高达16倍的SOTA模型,且对分布外数据仍具可解释性。

Details Motivation: 现有基础模型(如基于自监督学习的模型)在医学影像等高风险领域因架构缺乏可解释性而受限,亟需可解释性与高性能兼顾的模型。 Method: 提出Dual-IFM模型,具备双重可解释设计:1)生成忠实于决策过程的类证据图以支持局部可解释;2)引入2D投影层实现表征空间的直接可视化以支持全局可解释;在超80万张多源眼底彩照上进行自监督预训练。 Result: Dual-IFM在下游任务中性能接近参数量多达16倍的SOTA基础模型,并能在分布外数据上提供可解释预测。 Conclusion: 大规模自监督预训练与内在可解释性可协同提升视网膜影像表征的鲁棒性,为高风险医疗AI应用提供了新路径。 Abstract: Foundation models are used to extract transferable representations from large amounts of unlabeled data, typically via self-supervised learning (SSL). However, many of these models rely on architectures that offer limited interpretability, which is a critical issue in high-stakes domains such as medical imaging. We propose Dual-IFM, a foundation model that is interpretable-by-design in two ways: First, it provides local interpretability for individual images through class evidence maps that are faithful to the decision-making process. Second, it provides global interpretability for entire datasets through a 2D projection layer that allows for direct visualization of the model's representation space. We trained our model on over 800,000 color fundus photography from various sources to learn generalizable, interpretable representations for different downstream tasks. Our results show that our model reaches a performance range similar to that of state-of-the-art foundation models with up to $16\times$ the number of parameters, while providing interpretable predictions on out-of-distribution data. Our results suggest that large-scale SSL pretraining paired with inherent interpretability can lead to robust representations for retinal imaging.

[145] HORNet: Task-Guided Frame Selection for Video Question Answering with Vision-Language Models

Xiangyu Bai,Bishoy Galoaa,Sarah Ostadabbas

Main category: cs.CV

TL;DR: 本文提出HORNet,一种轻量级帧选择策略,通过Group Relative Policy Optimization(GRPO)训练,以优化视频问答(VQA)中视觉语言模型(VLM)所需的关键帧,显著减少输入帧数和计算开销,同时提升回答质量。

Details Motivation: 现有视频问答系统多采用均匀或启发式帧采样,无法针对下游问答质量进行优化,而帧选择对VLM性能影响关键。 Method: 提出Select Any Frames(SAF)任务,设计轻量级帧选择策略HORNet,使用GRPO算法在冻结VLM前提下训练帧选择策略;支持跨VLM迁移且无需重训练。 Result: HORNet参数少于1M,帧数减少达99%,VLM处理时间降低93%;在MSVD-QA上F1提升1.7%,NExT-QA上相对均匀采样提升7.3分;跨VLM迁移带来额外8.5%相对增益;在6个基准(341,877 QA对、114.2小时视频)上验证有效性。 Conclusion: 优化VLM‘看什么’(即输入帧选择)是一种高效、实用且与优化生成能力互补的新范式,能兼顾性能提升与计算效率。 Abstract: Video question answering (VQA) with vision-language models (VLMs) depends critically on which frames are selected from the input video, yet most systems rely on uniform or heuristic sampling that cannot be optimized for downstream answering quality. We introduce \textbf{HORNet}, a lightweight frame selection policy trained with Group Relative Policy Optimization (GRPO) to learn which frames a frozen VLM needs to answer questions correctly. With fewer than 1M trainable parameters, HORNet reduces input frames by up to 99\% and VLM processing time by up to 93\%, while improving answer quality on short-form benchmarks (+1.7\% F1 on MSVD-QA) and achieving strong performance on temporal reasoning tasks (+7.3 points over uniform sampling on NExT-QA). We formalize this as Select Any Frames (SAF), a task that decouples visual input curation from VLM reasoning, and show that GRPO-trained selection generalizes better out-of-distribution than supervised and PPO alternatives. HORNet's policy further transfers across VLM answerers without retraining, yielding an additional 8.5\% relative gain when paired with a stronger model. Evaluated across six benchmarks spanning 341,877 QA pairs and 114.2 hours of video, our results demonstrate that optimizing \emph{what} a VLM sees is a practical and complementary alternative to optimizing what it generates while improving efficiency. Code is available at https://github.com/ostadabbas/HORNet.

[146] Motion-o: Trajectory-Grounded Video Reasoning

Bishoy Galoaa,Shayda Moezzi,Xiangyu Bai,Sarah Ostadabbas

Main category: cs.CV

TL;DR: 本文提出了一种新的视频理解能力——时空轨迹(STT)推理,并设计了Motion-o模型与Motion Chain of Thought(MCoT)方法,通过显式建模物体运动轨迹(方向、速度、尺度变化),结合轨迹标注数据增强和视觉证据驱动的奖励函数,在不修改模型结构的前提下提升轨迹预测与时空定位性能。

Details Motivation: 现有视频推理研究忽视了对物体‘如何运动’的建模,轨迹理解隐式且难以验证;缺乏连接连续观测、显式刻画运动模式的形式化框架。 Method: 1) 提出Spatial-Temporal-Trajectory(STT)推理新范式;2) 构建Motion-o——面向运动的视觉语言模型扩展;3) 设计轨迹标注增强数据集,生成稠密边界框轨迹;4) 引入Motion Chain of Thought(MCoT),用标签结构化表征每物体的方向、速度与尺度变化;5) 设计基于视觉证据的奖励函数进行强化训练,无需架构改动。 Result: Motion-o在时空定位与轨迹预测任务上显著提升,同时完全兼容现有视觉语言模型框架;实证验证了显式运动推理对证据型视频理解的关键作用。 Conclusion: 显式建模和验证物体运动轨迹是视频理解的重要缺失环节;Motion-o与MCoT为构建可解释、可验证、证据驱动的视频推理系统提供了新路径。 Abstract: Recent research has made substantial progress on video reasoning, with many models leveraging spatio-temporal evidence chains to strengthen their inference capabilities. At the same time, a growing set of datasets and benchmarks now provides structured annotations designed to support and evaluate such reasoning. However, little attention has been paid to reasoning about \emph{how} objects move between observations: no prior work has articulated the motion patterns by connecting successive observations, leaving trajectory understanding implicit and difficult to verify. We formalize this missing capability as Spatial-Temporal-Trajectory (STT) reasoning and introduce \textbf{Motion-o}, a motion-centric video understanding extension to visual language models that makes trajectories explicit and verifiable. To enable motion reasoning, we also introduce a trajectory-grounding dataset artifact that expands sparse keyframe supervision via augmentation to yield denser bounding box tracks and a stronger trajectory-level training signal. Finally, we introduce Motion Chain of Thought (MCoT), a structured reasoning pathway that makes object trajectories through discrete \texttt{} tag summarizing per-object direction, speed, and scale (of velocity) change to explicitly connect grounded observations into trajectories. To train Motion-o, we design a reward function that compels the model to reason directly over visual evidence, all while requiring no architectural modifications. Empirical results demonstrate that Motion-o improves spatial-temporal grounding and trajectory prediction while remaining fully compatible with existing frameworks, establishing motion reasoning as a critical extension for evidence-based video understanding. Code is available at https://github.com/ostadabbas/Motion-o.

[147] PromptHub: Enhancing Multi-Prompt Visual In-Context Learning with Locality-Aware Fusion, Concentration and Alignment

Tianci Luo,Jinpeng Wang,Shiyu Qin,Niu Lian,Yan Feng,Bin Chen,Chun Yuan,Shu-Tao Xia

Main category: cs.CV

TL;DR: 本文提出PromptHub框架,通过局部感知融合、注意力集中和对齐机制,提升视觉上下文学习中的多提示融合效果,显著优于现有方法。

Details Motivation: 现有基于补丁的融合框架和模型无关监督限制了信息线索的利用,导致性能提升受限。 Method: PromptHub采用局部感知融合、互补的注意力集中、对齐与预测目标联合训练,并结合数据增强强化监督。 Result: 在三个基础视觉任务上显著优于现有方法,并验证了其通用性、可迁移性和在分布外及多种检索场景下的鲁棒性。 Conclusion: PromptHub建立了可靠的局部感知提示融合范式,超越了以往的补丁级方法。 Abstract: Visual In-Context Learning (VICL) aims to complete vision tasks by imitating pixel demonstrations. Recent work pioneered prompt fusion that combines the advantages of various demonstrations, which shows a promising way to extend VICL. Unfortunately, the patch-wise fusion framework and model-agnostic supervision hinder the exploitation of informative cues, thereby limiting performance gains. To overcome this deficiency, we introduce PromptHub, a framework that holistically strengthens multi-prompting through locality-aware fusion, concentration and alignment. PromptHub exploits spatial priors to capture richer contextual information, employs complementary concentration, alignment, and prediction objectives to mutually guide training, and incorporates data augmentation to further reinforce supervision. Extensive experiments on three fundamental vision tasks demonstrate the superiority of PromptHub. Moreover, we validate its universality, transferability, and robustness across out-of-distribution settings, and various retrieval scenarios. This work establishes a reliable locality-aware paradigm for prompt fusion, moving beyond prior patch-wise approaches. Code is available at https://github.com/luotc-why/ICLR26-PromptHub.

[148] MultihopSpatial: Multi-hop Compositional Spatial Reasoning Benchmark for Vision-Language Model

Youngwan Lee,Soojin Jang,Yoorhim Cho,Seunghwan Lee,Yong-Ju Lee,Sung Ju Hwang

Main category: cs.CV

TL;DR: 本文提出MultihopSpatial基准,用于评估和提升视觉-语言模型在多跳、组合式空间推理与精确视觉定位方面的能力,并引入新指标Acc@50IoU及训练语料MultihopSpatial-Train,验证了强化学习微调可提升模型空间推理与具身操作性能。

Details Motivation: 现有空间推理基准仅关注简单单跳关系,忽视真实场景所需的多跳组合推理与精确视觉定位能力。 Method: 构建MultihopSpatial多跳空间推理基准(含1–3跳复杂查询)、提出联合评估推理与定位的Acc@50IoU指标,并发布大规模训练语料MultihopSpatial-Train;通过强化学习对VLM进行后训练以提升空间智能。 Result: 在37个SOTA VLM上评测发现组合式空间推理仍是重大挑战;强化学习后训练显著提升模型内在空间推理能力及下游具身操作性能。 Conclusion: 多跳组合空间推理是当前VLM的关键短板,需专用基准、指标与训练数据协同推动,而强化学习是有效提升路径。 Abstract: Spatial reasoning is foundational for Vision-Language Models (VLMs), particularly when deployed as Vision-Language-Action (VLA) agents in physical environments. However, existing benchmarks predominantly focus on elementary, single-hop relations, neglecting the multi-hop compositional reasoning and precise visual grounding essential for real-world scenarios. To address this, we introduce MultihopSpatial, offering three key contributions: (1) A comprehensive benchmark designed for multi-hop and compositional spatial reasoning, featuring 1- to 3-hop complex queries across diverse spatial perspectives. (2) Acc@50IoU, a complementary metric that simultaneously evaluates reasoning and visual grounding by requiring both answer selection and precise bounding box prediction - capabilities vital for robust VLA deployment. (3) MultihopSpatial-Train, a dedicated large-scale training corpus to foster spatial intelligence. Extensive evaluation of 37 state-of-the-art VLMs yields eight key insights, revealing that compositional spatial reasoning remains a formidable challenge. Finally, we demonstrate that reinforcement learning post-training on our corpus enhances both intrinsic VLM spatial reasoning and downstream embodied manipulation performance.

[149] Translating MRI to PET through Conditional Diffusion Models with Enhanced Pathology Awareness

Yitong Li,Igor Yakushev,Dennis M. Hedderich,Christian Wachinger

Main category: cs.CV

TL;DR: 本文提出PASTA框架,利用增强病理感知的条件扩散模型,从MRI生成高质量、病理信息丰富的合成PET图像,显著提升阿尔茨海默病诊断性能。

Details Motivation: PET虽对神经退行性疾病诊断敏感,但成本高、有辐射;MRI安全但敏感性低。现有MRI到PET合成方法忽视病理信息建模,限制临床实用性。 Method: 提出基于条件扩散模型的PASTA框架,包含高度交互的双分支结构、多模态条件融合、循环交换一致性约束及体素级生成策略,以同时保持结构与病理细节。 Result: 合成PET图像在定性和定量评估中均表现优异;用于阿尔茨海默病诊断时,准确率较MRI提升4%,接近真实PET水平。 Conclusion: PASTA有效提升了跨模态医学图像翻译的病理感知能力,为低成本、无辐射的神经退行性疾病精准诊断提供了新范式。 Abstract: Positron emission tomography (PET) is a widely recognized technique for diagnosing neurodegenerative diseases, offering critical functional insights. However, its high costs and radiation exposure hinder its widespread use. In contrast, magnetic resonance imaging (MRI) does not involve such limitations. While MRI also detects neurodegenerative changes, it is less sensitive for diagnosis compared to PET. To overcome such limitations, one approach is to generate synthetic PET from MRI. Recent advances in generative models have paved the way for cross-modality medical image translation; however, existing methods largely emphasize structural preservation while neglecting the critical need for pathology awareness. To address this gap, we propose PASTA, a novel image translation framework built on conditional diffusion models with enhanced pathology awareness. PASTA surpasses state-of-the-art methods by preserving both structural and pathological details through its highly interactive dual-arm architecture and multi-modal condition integration. Additionally, we introduce a novel cycle exchange consistency and volumetric generation strategy that significantly enhances PASTA's ability to produce high-quality 3D PET images. Our qualitative and quantitative results demonstrate the high quality and pathology awareness of the synthesized PET scans. For Alzheimer's diagnosis, the performance of these synthesized scans improves over MRI by 4%, almost reaching the performance of actual PET. Our code is available at https://github.com/ai-med/PASTA.

[150] GHOST: Fast Category-agnostic Hand-Object Interaction Reconstruction from RGB Videos using Gaussian Splatting

Ahmed Tawfik Aboukhadra,Marcel Rogge,Nadia Robertini,Abdalla Arafa,Jameel Malik,Ahmed Elhayek,Didier Stricker

Main category: cs.CV

TL;DR: GHOST是一种基于2D高斯泼溅的快速、类别无关框架,用于从单目RGB视频中重建动态手-物交互,通过几何先验、抓取感知对齐和手感知背景损失实现物理一致且可动画的3D重建。

Details Motivation: 现有方法依赖类别特定模板或计算密集,难以实现物理一致的手-物3D对齐。 Method: 提出GHOST框架,将手和物体表示为稠密、视角一致的高斯圆盘,并引入几何先验检索与一致性损失、抓取感知对齐、手感知背景损失三项创新。 Result: 在ARCTIC、HO3D和野外数据集上达到SOTA的3D重建与2D渲染质量,速度比先前类别无关方法快一个数量级。 Conclusion: GHOST是一种高效鲁棒的手-物交互建模方案,支持完整、物理一致且可动画的单视频重建。 Abstract: Understanding realistic hand-object interactions from monocular RGB videos is essential for AR/VR, robotics, and embodied AI. Existing methods rely on category-specific templates or heavy computation, yet still produce physically inconsistent hand-object alignment in 3D. We introduce GHOST (Gaussian Hand-Object Splatting), a fast, category-agnostic framework for reconstructing dynamic hand-object interactions using 2D Gaussian Splatting. GHOST represents both hands and objects as dense, view-consistent Gaussian discs and introduces three key innovations: (1) a geometric-prior retrieval and consistency loss that completes occluded object regions, (2) a grasp-aware alignment that refines hand translations and object scale to ensure realistic contact, and (3) a hand-aware background loss that prevents penalizing hand-occluded object regions. GHOST achieves complete, physically consistent, and animatable reconstructions from a single RGB video while running an order of magnitude faster than prior category-agnostic methods. Extensive experiments on ARCTIC, HO3D, and in-the-wild datasets demonstrate state-of-the-art accuracy in 3D reconstruction and 2D rendering quality, establishing GHOST as an efficient and robust solution for realistic hand-object interaction modeling. Code is available at https://github.com/ATAboukhadra/GHOST.

[151] Unsupervised Contrastive Learning for Efficient and Robust Spectral Shape Matching

Feifan Luo,Hongyang Chen

Main category: cs.CV

TL;DR: 本文提出了一种基于无监督对比学习的新型3D形状匹配方法,通过提升嵌入空间中的特征表示质量与简化功能映射学习架构,显著提高了匹配精度与计算效率。

Details Motivation: 现有深度功能映射方法侧重于优化点对点或功能映射,忽视了嵌入空间中特征表示的直接增强,且依赖计算昂贵的传统功能映射求解器,导致特征质量不足、匹配性能受限和效率低下。 Method: 提出一种无监督对比学习框架以增强特征一致性与判别性,并设计一个无需复杂功能映射求解器和多重辅助损失的简化功能映射学习架构;二者集成于统一双分支流水线中。 Result: 在近等距、非等距及拓扑不一致等多种挑战性基准上,该方法在精度和效率上均达到当前最优水平,甚至超越部分监督方法。 Conclusion: 所提无监督对比学习与简化功能映射架构相结合的方法,为高效鲁棒的非刚性3D形状匹配提供了新范式,兼具高性能与高效率。 Abstract: Estimating correspondences between pairs of non-rigid deformable 3D shapes remains a significant challenge in computer vision and graphics. While deep functional map methods have become the go-to solution for addressing this problem, they primarily focus on optimizing pointwise and functional maps either individually or jointly, rather than directly enhancing feature representations in the embedding space, which often results in inadequate feature quality and suboptimal matching performance. Furthermore, these approaches heavily rely on traditional functional map techniques, such as time-consuming functional map solvers, which incur substantial computational costs. In this work, we introduce, for the first time, a novel unsupervised contrastive learning-based approach for efficient and robust 3D shape matching. We begin by presenting an unsupervised contrastive learning framework that promotes feature learning by maximizing consistency within positive similarity pairs and minimizing it within negative similarity pairs, thereby improving both the consistency and discriminability of the learned features.We then design a significantly simplified functional map learning architecture that eliminates the need for computationally expensive functional map solvers and multiple auxiliary functional map losses, greatly enhancing computational efficiency. By integrating these two components into a unified two-branch pipeline, our method achieves state-of-the-art performance in both accuracy and efficiency. Extensive experiments demonstrate that our approach is not only computationally efficient but also outperforms current state-of-the-art methods across various challenging benchmarks, including near-isometric, non-isometric, and topologically inconsistent scenarios, even surpassing supervised techniques.

[152] VGGT-360: Geometry-Consistent Zero-Shot Panoramic Depth Estimation

Jiayi Yuan,Haobo Jiang,De Wen Soh,Na Zhao

Main category: cs.CV

TL;DR: VGGT-360是一种无需训练的零样本全景深度估计框架,通过将任务重构为基于VGGT类基础模型的全景重投影,结合不确定性引导投影、结构显著性增强注意力和相关性加权3D模型校正三个模块,实现几何一致的高质量深度估计。

Details Motivation: 解决现有无训练、视图无关方法在全景深度估计中缺乏几何一致性与跨视图连贯性的问题,充分利用基础模型内在的3D一致性能力。 Method: 提出VGGT-360框架,包含三个即插即用模块:(i) 不确定性引导的自适应投影,将全景图切分为透视视图并依据梯度不确定性动态分配采样密度;(ii) 结构显著性增强注意力,在VGGT注意力层注入结构感知置信度以提升3D重建鲁棒性;(iii) 相关性加权的3D模型校正,利用注意力推断的相关性分数对重叠点进行重加权以优化几何一致性。 Result: 在多种分辨率及室内外数据集上,VGGT-360全面超越当前有训练和无训练的最先进方法,展现出优异的鲁棒性与精度。 Conclusion: VGGT-360验证了无需训练即可实现高质量、几何一致的全景深度估计的可行性,为利用基础模型先验进行几何推理提供了新范式。 Abstract: This paper presents VGGT-360, a novel training-free framework for zero-shot, geometry-consistent panoramic depth estimation. Unlike prior view-independent training-free approaches, VGGT-360 reformulates the task as panoramic reprojection over multi-view reconstructed 3D models by leveraging the intrinsic 3D consistency of VGGT-like foundation models, thereby unifying fragmented per-view reasoning into a coherent panoramic understanding. To achieve robust and accurate estimation, VGGT-360 integrates three plug-and-play modules that form a unified panorama-to-3D-to-depth framework: (i) Uncertainty-guided adaptive projection slices panoramas into perspective views to bridge the domain gap between panoramic inputs and VGGT's perspective prior. It estimates gradient-based uncertainty to allocate denser views to geometry-poor regions, yielding geometry-informative inputs for VGGT. (ii) Structure-saliency enhanced attention strengthens VGGT's robustness during 3D reconstruction by injecting structure-aware confidence into its attention layers, guiding focus toward geometrically reliable regions and enhancing cross-view coherence. (iii) Correlation-weighted 3D model correction refines the reconstructed 3D model by reweighting overlapping points using attention-inferred correlation scores, providing a consistent geometric basis for accurate panoramic reprojection. Extensive experiments show that VGGT-360 outperforms both trained and training-free state-of-the-art methods across multiple resolutions and diverse indoor and outdoor datasets.

[153] CRAFT: Aligning Diffusion Models with Fine-Tuning Is Easier Than You Think

Zening Sun,Zhengpeng Xie,Lichen Bai,Shitong Shao,Shuo Yang,Zeke Xie

Main category: cs.CV

TL;DR: 本文提出CRAFT方法,通过复合奖励过滤(CRF)构建高质量数据集并改进SFT,在极少量样本(100个)下超越现有偏好优化方法,且收敛速度快11-220倍。

Details Motivation: 现有对齐扩散模型的方法(如SFT和DPO)依赖高质量图像或大规模不一致的偏好数据,且计算效率低。 Method: 提出Composite Reward Assisted Fine-Tuning(CRAFT):先用Composite Reward Filtering(CRF)筛选高质量一致数据,再进行增强版SFT;并从理论上证明其优化的是组式强化学习的下界。 Result: CRAFT仅用100个样本即可超越需数千偏好对的SOTA方法,并实现11–220倍更快收敛。 Conclusion: CRAFT是一种轻量、高效、数据经济的扩散模型对齐新范式,兼具理论严谨性与强实证性能。 Abstract: Aligning Diffusion models has achieved remarkable breakthroughs in generating high-quality, human preference-aligned images. Existing techniques, such as supervised fine-tuning (SFT) and DPO-style preference optimization, have become principled tools for fine-tuning diffusion models. However, SFT relies on high-quality images that are costly to obtain, while DPO-style methods depend on large-scale preference datasets, which are often inconsistent in quality. Beyond data dependency, these methods are further constrained by computational inefficiency. To address these two challenges, we propose Composite Reward Assisted Fine-Tuning (CRAFT), a lightweight yet powerful fine-tuning paradigm that requires significantly reduced training data while maintaining computational efficiency. It first leverages a Composite Reward Filtering (CRF) technique to construct a high-quality and consistent training dataset and then perform an enhanced variant of SFT. We also theoretically prove that CRAFT actually optimizes the lower bound of group-based reinforcement learning, establishing a principled connection between SFT with selected data and reinforcement learning. Our extensive empirical results demonstrate that CRAFT with only 100 samples can easily outperform recent SOTA preference optimization methods with thousands of preference-paired samples. Moreover, CRAFT can even achieve 11-220$\times$ faster convergences than the baseline preference optimization methods, highlighting its extremely high efficiency.

[154] Unleashing the Power of Simplicity: A Minimalist Strategy for State-of-the-Art Fingerprint Enhancement

Raffaele Cappelli

Main category: cs.CV

TL;DR: 本文提出了一种简约高效的指纹增强方法,包括上下文滤波和基于学习的两种新方法,在低质量指纹上表现优于现有复杂方法,并开源实现以促进可复现性与进一步研究。

Details Motivation: 现有指纹增强方法在处理低质量指纹时效果不佳且计算开销大,亟需更简单有效的新方法。 Method: 提出两种新方法:上下文滤波方法和基于学习的方法,强调简约性与实用性。 Result: 在挑战性潜指纹数据库上验证,新方法生成更清晰、准确、低噪声的增强图像,性能持续超越当前最先进方法。 Conclusion: 简约设计在指纹增强中同样能实现高质量效果,未来研究应权衡算法复杂度与实际效益。 Abstract: Fingerprint recognition systems, which rely on the unique characteristics of human fingerprints, are essential in modern security and verification applications. Accurate minutiae extraction, a critical step in these systems, depends on the quality of fingerprint images. Despite recent improvements in fingerprint enhancement techniques, state-of-the-art methods often struggle with low-quality fingerprints and can be computationally demanding. This paper presents a minimalist approach to fingerprint enhancement, prioritizing simplicity and effectiveness. Two novel methods are introduced: a contextual filtering method and a learning-based method. These techniques consistently outperform complex state-of-the-art methods, producing clearer, more accurate, and less noisy images. The effectiveness of these methods is validated using a challenging latent fingerprint database. The open-source implementation of these techniques not only fosters reproducibility but also encourages further advancements in the field. The findings underscore the importance of simplicity in achieving high-quality fingerprint enhancement and suggest that future research should balance complexity and practical benefits.

[155] Generalized Hand-Object Pose Estimation with Occlusion Awareness

Hui Yang,Wei Sun,Jian Liu,Jian Xiao Tao Xie,Hossein Rahmani,Ajmal Saeed mian,Nicu Sebe,Gim Hee Lee

Main category: cs.CV

TL;DR: 本文提出GenHOI框架,通过分层语义提示与手部先验知识结合,提升单张RGB图像下受遮挡影响的手-物姿态估计泛化能力。

Details Motivation: 解决单张RGB图像中因物体外观和交互模式差异大、尤其严重遮挡下,通用3D手-物姿态估计难的问题。 Method: 提出GenHOI框架:引入分层语义提示(编码物体状态、手部构型与交互模式的文本描述),采用RGB图像、预测点云与文本的多模态掩码建模策略,并利用手部先验作为稳定空间参考以提取隐式交互约束。 Result: 在DexYCB和HO3Dv2基准上达到手-物姿态估计的最先进性能。 Conclusion: GenHOI通过融合语义知识与几何先验,有效提升了模型在遮挡、未见物体与新交互场景下的泛化与鲁棒性。 Abstract: Generalized 3D hand-object pose estimation from a single RGB image remains challenging due to the large variations in object appearances and interaction patterns, especially under heavy occlusion. We propose GenHOI, a framework for generalized hand-object pose estimation with occlusion awareness. GenHOI integrates hierarchical semantic knowledge with hand priors to enhance model generalization under challenging occlusion conditions. Specifically, we introduce a hierarchical semantic prompt that encodes object states, hand configurations, and interaction patterns via textual descriptions. This enables the model to learn abstract high-level representations of hand-object interactions for generalization to unseen objects and novel interactions while compensating for missing or ambiguous visual cues. To enable robust occlusion reasoning, we adopt a multi-modal masked modeling strategy over RGB images, predicted point clouds, and textual descriptions. Moreover, we leverage hand priors as stable spatial references to extract implicit interaction constraints. This allows reliable pose inference even under significant variations in object shapes and interaction patterns. Extensive experiments on the challenging DexYCB and HO3Dv2 benchmarks demonstrate that our method achieves state-of-the-art performance in hand-object pose estimation.

[156] Rethinking MLLM Itself as a Segmenter with a Single Segmentation Token

Anqi Zhang,Xiaokang Ji,Guangyu Gao,Jianbo Jiao,Chi Harold Liu,Yunchao Wei

Main category: cs.CV

TL;DR: 本文提出SELF1E方法,通过保留原始分辨率图像特征、引入残差特征补偿、像素反混叠操作及双通路注意力机制,在不依赖外部掩码解码器的前提下,仅用1个分割嵌入实现多模态大语言模型(MLLM)的端到端分割,性能媲美专用解码器方法。

Details Motivation: 现有基于MLLM的分割方法严重依赖专用掩码解码器或额外标记,限制了模型简洁性与端到端能力;本文旨在探索是否能仅靠MLLM自身(无需外部解码器)完成高质量分割。 Method: 1)保持图像特征原始高分辨率;2)用MLLM压缩特征提取的残差特征填充并增强高分辨率特征;3)对经/未经LLM处理的特征分别施加pixel-unshuffle操作以恢复细节并放大残差;4)设计双感知路径注意力掩码(image-to-image & image-to-segmentation),强化像素与分割标记间交互。 Result: 在多个分割任务上,SELF1E性能与依赖专用掩码解码器的SOTA方法相当,验证了无解码器MLLM分割的可行性与有效性。 Conclusion: 仅需1个分割嵌入(SELF1E)且无需外部掩码解码器,即可实现高性能MLLM分割;核心在于高分辨率特征保真、残差增强与双路径注意力协同,为轻量、端到端多模态视觉分割提供了新范式。 Abstract: Recent segmentation methods leveraging Multi-modal Large Language Models (MLLMs) have shown reliable object-level segmentation and enhanced spatial perception. However, almost all previous methods predominantly rely on specialist mask decoders to interpret masks from generated segmentation-related embeddings and visual features, or incorporate multiple additional tokens to assist. This paper aims to investigate whether and how we can unlock segmentation from MLLM itSELF with 1 segmentation Embedding (SELF1E) while achieving competitive results, which eliminates the need for external decoders. To this end, our approach targets the fundamental limitation of resolution reduction in pixel-shuffled image features from MLLMs. First, we retain image features at their original uncompressed resolution, and refill them with residual features extracted from MLLM-processed compressed features, thereby improving feature precision. Subsequently, we integrate pixel-unshuffle operations on image features with and without LLM processing, respectively, to unleash the details of compressed features and amplify the residual features under uncompressed resolution, which further enhances the resolution of refilled features. Moreover, we redesign the attention mask with dual perception pathways, i.e., image-to-image and image-to-segmentation, enabling rich feature interaction between pixels and the segmentation token. Comprehensive experiments across multiple segmentation tasks validate that SELF1E achieves performance competitive with specialist mask decoder-based methods, demonstrating the feasibility of decoder-free segmentation in MLLMs. Project page: https://github.com/ANDYZAQ/SELF1E.

[157] SEM: Sparse Embedding Modulation for Post-Hoc Debiasing of Vision-Language Models

Quentin Guimard,Federico Bartsch,Simone Caldarella,Rahaf Aljundi,Elisa Ricci,Massimiliano Mancini

Main category: cs.CV

TL;DR: 本文提出Sparse Embedding Modulation (SEM),一种在稀疏自编码器(SAE)潜在空间中进行后处理、零样本去偏的框架,通过解耦CLIP文本嵌入实现对偏差相关神经元的精准调控,在多个数据集和模型上显著提升公平性,同时保持语义保真度。

Details Motivation: 现有基于密集CLIP嵌入空间的后处理去偏方法受限于偏差与任务相关信息的高度纠缠,难以在去偏的同时维持语义保真度。 Method: 提出SEM框架,在稀疏自编码器(SAE)潜在空间中对CLIP文本嵌入进行解耦表示,识别并调控偏差相关神经元,保留查询相关神经元,实现更精确的非线性干预。 Result: 在四个基准数据集和两个CLIP骨干模型上,SEM在检索与零样本分类任务中均取得显著的公平性提升。 Conclusion: 稀疏潜在表征为视觉-语言模型的后处理去偏提供了有效基础。 Abstract: Models that bridge vision and language, such as CLIP, are key components of multimodal AI, yet their large-scale, uncurated training data introduce severe social and spurious biases. Existing post-hoc debiasing methods often operate directly in the dense CLIP embedding space, where bias and task-relevant information are highly entangled. This entanglement limits their ability to remove bias without degrading semantic fidelity. In this work, we propose Sparse Embedding Modulation (SEM), a post-hoc, zero-shot debiasing framework that operates in a Sparse Autoencoder (SAE) latent space. By decomposing CLIP text embeddings into disentangled features, SEM identifies and modulates bias-relevant neurons while preserving query-relevant ones. This enables more precise, non-linear interventions. Across four benchmark datasets and two CLIP backbones, SEM achieves substantial fairness gains in retrieval and zero-shot classification. Our results demonstrate that sparse latent representations provide an effective foundation for post-hoc debiasing of vision-language models.

[158] FUMO: Prior-Modulated Diffusion for Single Image Reflection Removal

Telang Xu,Chaoyang Zhang,Guangtao Zhai,Xiaohong Liu

Main category: cs.CV

TL;DR: 本文提出了一种基于扩散模型与先验调制的单图像反射去除框架FUMO,通过引入强度先验和高频先验增强空间可控性与结构保真度,并采用粗到细训练范式提升去反射效果。

Details Motivation: 真实场景中反射强度空间变化且与透射结构紧密耦合,导致单图像反射去除极具挑战性。 Method: 提出FUMO框架:从混合图像中提取强度先验(估计反射严重程度)和高频先验(多尺度残差聚合捕获细节响应);设计粗到细训练范式——第一阶段用先验门控条件残差注入,聚焦反射主导且结构敏感区域;第二阶段用精细化网络在图像空间校正局部错位并增强细节。 Result: 在标准基准和野外挑战性图像上均取得具有竞争力的定量结果和一致提升的感知质量。 Conclusion: FUMO通过显式先验引导显著提升了SIRR任务的空间可控性与结构保真度,代码已开源。 Abstract: Single image reflection removal (SIRR) is challenging in real scenes, where reflection strength varies spatially and reflection patterns are tightly entangled with transmission structures. This paper presents a diffusion model with prior modulation framework (FUMO) that introduces explicit guidance signals to improve spatial controllability and structural faithfulness. Two priors are extracted directly from the mixed image, an intensity prior that estimates spatial reflection severity and a high-frequency prior that captures detail-sensitive responses via multi-scale residual aggregation. We propose a coarse-to-fine training paradigm. In the first stage, these cues are combined to gate the conditional residual injections, focusing the conditioning on regions that are both reflection-dominant and structure-sensitive. In the second stage, a fine-grained refinement network corrects local misalignment and sharpens fine details in the image space. Experiments conducted on both standard benchmarks and challenging images in the wild demonstrate competitive quantitative results and consistently improved perceptual quality. The code is released at https://github.com/Lucious-Desmon/FUMO.

[159] TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation

Yan Shu,Bin Ren,Zhitong Xiong,Xiao Xiang Zhu,Begüm Demir,Nicu Sebe,Paolo Rota

Main category: cs.CV

TL;DR: 本文提出TerraScope,一种统一的视觉语言模型,专为地球观测设计,支持模态灵活和多时序的空间推理,并构建了大规模数据集Terra-CoT与首个像素级地理空间推理基准TerraScope-Bench。

Details Motivation: 现有视觉语言模型在地球观测中难以实现复杂空间推理与精确像素级视觉表征的对齐。 Method: 提出TerraScope模型,支持单模态(光学或SAR)输入及自适应多模态融合,并集成多时序序列进行变化分析;构建含100万样本、带像素级掩码的Terra-CoT数据集;设计首个像素接地地理空间推理基准TerraScope-Bench,含六个子任务,联合评估答案准确率与掩码质量。 Result: TerraScope在像素级地理空间推理任务上显著优于现有VLMs,并能提供可解释的视觉证据。 Conclusion: TerraScope实现了模态灵活与多时序的像素级地理空间推理,推动了VLM在地球观测中的可信与可解释应用。 Abstract: Vision-language models (VLMs) have shown promise in earth observation (EO), yet they struggle with tasks that require grounding complex spatial reasoning in precise pixel-level visual representations. To address this problem, we introduce TerraScope, a unified VLM that delivers pixel-grounded geospatial reasoning with two key capabilities: (1) modality-flexible reasoning: it handles single-modality inputs (optical or SAR) and adaptively fuses different modalities into the reasoning process when both are available; (2) multi-temporal reasoning: it integrates temporal sequences for change analysis across multiple time points. In addition, we curate Terra-CoT, a large-scale dataset containing 1 million samples with pixel-level masks embedded in reasoning chains across multiple sources. We also propose TerraScope-Bench, the first benchmark for pixel-grounded geospatial reasoning with six sub-tasks that evaluates both answer accuracy and mask quality to ensure authentic pixel-grounded reasoning. Experiments show that TerraScope significantly outperforms existing VLMs on pixel-grounded geospatial reasoning while providing interpretable visual evidence.

[160] Measuring 3D Spatial Geometric Consistency in Dynamic Generated Videos

Weijia Dou,Wenzhao Zheng,Weiliang Chen,Yu Zheng,Jie Zhou,Jiwen Lu

Main category: cs.CV

TL;DR: 本文提出SGC指标,用于评估生成视频的3D空间几何一致性,通过估计不同局部区域的相机姿态并计算其发散度来量化几何不一致性。

Details Motivation: 现有评估方法无法准确刻画生成视频中的3D空间几何不一致性:保真度导向指标(如FVD)对几何畸变不敏感,而一致性导向基准又常错误惩罚有效的前景动态。 Method: SGC方法首先分离静态与动态区域,再将静态背景划分为空间连贯的子区域;随后为每个像素预测深度,并为每个子区域估计局部相机姿态,最后计算这些姿态间的发散度以量化几何一致性。 Result: 实验表明SGC能稳健地量化几何不一致性,并有效识别出其他指标遗漏的关键失败案例。 Conclusion: SGC是一种新颖、有效且鲁棒的生成视频3D空间几何一致性评估指标,弥补了现有评估方法的不足。 Abstract: Recent generative models can produce high-fidelity videos, yet they often exhibit 3D spatial geometric inconsistencies. Existing evaluation methods fail to accurately characterize these inconsistencies: fidelity-centric metrics like FVD are insensitive to geometric distortions, while consistency-focused benchmarks often penalize valid foreground dynamics. To address this gap, we introduce SGC, a metric for evaluating 3D \textbf{S}patial \textbf{G}eometric \textbf{C}onsistency in dynamically generated videos. We quantify geometric consistency by measuring the divergence among multiple camera poses estimated from distinct local regions. Our approach first separates static from dynamic regions, then partitions the static background into spatially coherent sub-regions. We predict depth for each pixel, estimate a local camera pose for each subregion, and compute the divergence among these poses to quantify geometric consistency. Experiments on real and generative videos demonstrate that SGC robustly quantifies geometric inconsistencies, effectively identifying critical failures missed by existing metrics.

[161] SwiftTailor: Efficient 3D Garment Generation with Geometry Image Representation

Phuc Pham,Uy Dieu Tran,Binh-Son Hua,Phong Nguyen

Main category: cs.CV

TL;DR: 本文提出SwiftTailor,一种两阶段框架,通过紧凑的几何图像表示统一裁剪图推理与基于几何的网格合成,显著提升3D服装生成的速度与质量。

Details Motivation: 现有方法依赖大视觉语言模型生成2D裁剪图再转为3D网格,虽质量高但推理慢(30秒至1分钟),难以满足实时或规模化需求。 Method: SwiftTailor包含两个轻量模块:PatternMaker(高效多模态视觉语言模型预测裁剪图)和GarmentSewer(密集预测Transformer生成统一UV空间中的服装几何图像),最终通过逆映射、重网格化与动态缝合直接重建3D网格,避免物理仿真开销。 Result: 在Multimodal GarmentCodeData上实验表明,SwiftTailor在保持SOTA精度与视觉保真度的同时,大幅降低推理时间。 Conclusion: SwiftTailor提供了一种可扩展、可解释且高性能的下一代3D服装生成方案。 Abstract: Realistic and efficient 3D garment generation remains a longstanding challenge in computer vision and digital fashion. Existing methods typically rely on large vision- language models to produce serialized representations of 2D sewing patterns, which are then transformed into simulation-ready 3D meshes using garment modeling framework such as GarmentCode. Although these approaches yield high-quality results, they often suffer from slow inference times, ranging from 30 seconds to a minute. In this work, we introduce SwiftTailor, a novel two-stage framework that unifies sewing-pattern reasoning and geometry-based mesh synthesis through a compact geometry image representation. SwiftTailor comprises two lightweight modules: PatternMaker, an efficient vision-language model that predicts sewing patterns from diverse input modalities, and GarmentSewer, an efficient dense prediction transformer that converts these patterns into a novel Garment Geometry Image, encoding the 3D surface of all garment panels in a unified UV space. The final 3D mesh is reconstructed through an efficient inverse mapping process that incorporates remeshing and dynamic stitching algorithms to directly assemble the garment, thereby amortizing the cost of physical simulation. Extensive experiments on the Multimodal GarmentCodeData demonstrate that SwiftTailor achieves state-of-the-art accuracy and visual fidelity while significantly reducing inference time. This work offers a scalable, interpretable, and high-performance solution for next-generation 3D garment generation.

[162] Em-Garde: A Propose-Match Framework for Proactive Streaming Video Understanding

Yikai Zheng,Xin Ding,Yifan Yang,Shiqi Jiang,Hao Wu,Qianxi Zhang,Weijun Wang,Ting Cao,Yunxin Liu

Main category: cs.CV

TL;DR: Em-Garde 提出解耦语义理解与流式感知的新框架,通过指令引导的提案解析器和轻量级提案匹配模块,提升流式视频理解中主动响应的准确率与效率。

Details Motivation: 现有基于逐帧触发决策的主动式视频大模型面临效率与精度的权衡困境,亟需在严格计算约束下实现高效准确的主动响应。 Method: Em-Garde 框架包含两部分:1)查询时,指令引导的提案解析器将用户查询转化为结构化、感知对齐的视觉提案;2)流式处理时,轻量级提案匹配模块通过嵌入匹配高效触发响应。 Result: 在 StreamingBench 和 OVO-Bench 上,Em-Garde 在主动响应准确率和效率上均一致优于先前模型。 Conclusion: Em-Garde 为计算受限下的主动式视频理解提供了有效新范式,验证了语义理解与流式感知解耦设计的可行性与优越性。 Abstract: Recent advances in Streaming Video Understanding has enabled a new interaction paradigm where models respond proactively to user queries. Current proactive VideoLLMs rely on per-frame triggering decision making, which suffers from an efficiency-accuracy dilemma. We propose Em-Garde, a novel framework that decouples semantic understanding from streaming perception. At query time, the Instruction-Guided Proposal Parser transforms user queries into structured, perceptually grounded visual proposals; during streaming, a Lightweight Proposal Matching Module performs efficient embedding-based matching to trigger responses. Experiments on StreamingBench and OVO-Bench demonstrate consistent improvements over prior models in proactive response accuracy and efficiency, validating an effective solution for proactive video understanding under strict computational constraints.

[163] SignAgent: Agentic LLMs for Linguistically-Grounded Sign Language Annotation and Dataset Curation

Oliver Cory,Ozge Mercanoglu Sincan,Richard Bowden

Main category: cs.CV

TL;DR: 本文提出SignAgent,一种利用大语言模型(LLM)进行可扩展、语言学驱动的手语(SL)标注与数据集构建的新框架,通过Orchestrator协调工具、SignGraph提供语言学知识支撑,在伪词性标注和ID词性标注任务中展现出优异性能。

Details Motivation: 传统手语计算方法局限于词素层面,忽略语言学细节;人工语言学标注成本高、速度慢,难以支撑大规模音系感知数据集建设。 Method: 提出SignAgent框架,包含两个核心组件:SignAgent Orchestrator(推理型LLM,负责协调多语言学工具)和SignGraph(知识增强型LLM,提供词汇与语言学支撑);在伪词性标注(多模态证据驱动的约束性标签提取与排序)和ID词性标注(基于视觉相似性与音系重叠的视觉簇检测与细化)两项任务上开展评估。 Result: SignAgent在大规模、语言学感知的手语数据标注与构建任务中表现出强性能,显著提升标注效率与语言学准确性。 Conclusion: SignAgent验证了基于LLM的智能体框架在复杂手语语言学标注任务中的有效性与可扩展性,为构建高质量、音系感知的手语数据集提供了新范式。 Abstract: This paper introduces SignAgent, a novel agentic framework that utilises Large Language Models (LLMs) for scalable, linguistically-grounded Sign Language (SL) annotation and dataset curation. Traditional computational methods for SLs often operate at the gloss level, overlooking crucial linguistic nuances, while manual linguistic annotation remains a significant bottleneck, proving too slow and expensive for the creation of large-scale, phonologically-aware datasets. SignAgent addresses these challenges through SignAgent Orchestrator, a reasoning LLM that coordinates a suite of linguistic tools, and SignGraph, a knowledge-grounded LLM that provides lexical and linguistic grounding. We evaluate our framework on two downstream annotation tasks. First, on Pseudo-gloss Annotation, where the agent performs constrained assignment, using multi-modal evidence to extract and order suitable gloss labels for signed sequences. Second, on ID Glossing, where the agent detects and refines visual clusters by reasoning over both visual similarity and phonological overlap to correctly identify and group lexical sign variants. Our results demonstrate that our agentic approach achieves strong performance for large-scale, linguistically-aware data annotation and curation.

[164] DROID-SLAM in the Wild

Moyang Li,Zihan Zhu,Marc Pollefeys,Daniel Barath

Main category: cs.CV

TL;DR: 本文提出了一种基于可微不确定性感知光束法平差的实时RGB SLAM系统,能有效处理动态环境,通过多视角视觉特征不一致性估计像素级不确定性,实现鲁棒跟踪与重建。

Details Motivation: 传统SLAM假设场景静态,在动态环境中易失败;现有动态SLAM方法依赖预定义动态先验或不确定性映射,在未知动态物体或高度杂乱场景中仍受限。 Method: 提出可微的不确定性感知光束法平差(Uncertainty-aware Bundle Adjustment),利用多视角视觉特征不一致性估计每像素不确定性,提升动态环境下的鲁棒性。 Result: 在杂乱动态场景中达到相机位姿与场景几何的SOTA性能,实时运行约10 FPS。 Conclusion: 该方法显著提升了动态环境下SLAM系统的鲁棒性与实用性,适用于真实世界复杂场景。 Abstract: We present a robust, real-time RGB SLAM system that handles dynamic environments by leveraging differentiable Uncertainty-aware Bundle Adjustment. Traditional SLAM methods typically assume static scenes, leading to tracking failures in the presence of motion. Recent dynamic SLAM approaches attempt to address this challenge using predefined dynamic priors or uncertainty-aware mapping, but they remain limited when confronted with unknown dynamic objects or highly cluttered scenes where geometric mapping becomes unreliable. In contrast, our method estimates per-pixel uncertainty by exploiting multi-view visual feature inconsistency, enabling robust tracking and reconstruction even in real-world environments. The proposed system achieves state-of-the-art camera poses and scene geometry in cluttered dynamic scenarios while running in real time at around 10 FPS. Code and datasets are available at https://github.com/MoyangLi00/DROID-W.git.

[165] Multi-Modal Building Change Detection for Large-Scale Small Changes: Benchmark and Baseline

Ye Wang,Wei Lu,Zhihui You,Keyan Chen,Tongfei Liu,Kaiyu Li,Hongruixuan Chen,Qingling Shu,Sibao Chen

Main category: cs.CV

TL;DR: 本文提出了一种新的多模态建筑变化检测数据集LSMD和网络MSCNet,通过融合RGB与近红外(NIR)信息提升小尺度变化检测精度。

Details Motivation: 现有变化检测方法易受光照、季节及地物材质变化影响,仅用RGB图像易产生伪变化且语义模糊;而多模态数据集常缺乏高分辨率、精确配准的双时相影像,且现有方法未能充分利用RGB与NIR模态间的异质性。 Method: 构建了大规模小变化多模态数据集LSMD,并提出多模态光谱互补网络MSCNet,包含邻域上下文增强模块(NCEM)、跨模态对齐交互模块(CAIM)和显著性感知多源精化模块(SMRM)。 Result: 实验表明MSCNet在多种输入配置下均优于现有方法,有效提升了细粒度建筑变化检测性能。 Conclusion: 融合RGB与NIR模态并设计针对性网络结构可显著提升复杂场景下的小变化检测鲁棒性与准确性,LSMD为多模态变化检测提供了新基准。 Abstract: Change detection in optical remote sensing imagery is susceptible to illumination fluctuations, seasonal changes, and variations in surface land-cover materials. Relying solely on RGB imagery often produces pseudo-changes and leads to semantic ambiguity in features. Incorporating near-infrared (NIR) information provides heterogeneous physical cues that are complementary to visible light, thereby enhancing the discriminability of building materials and tiny structures while improving detection accuracy. However, existing multi-modal datasets generally lack high-resolution and accurately registered bi-temporal imagery, and current methods often fail to fully exploit the inherent heterogeneity between these modalities. To address these issues, we introduce the Large-scale Small-change Multi-modal Dataset (LSMD), a bi-temporal RGB-NIR building change detection benchmark dataset targeting small changes in realistic scenarios, providing a rigorous testing platform for evaluating multi-modal change detection methods in complex environments. Based on LSMD, we further propose the Multi-modal Spectral Complementarity Network (MSCNet) to achieve effective cross-modal feature fusion. MSCNet comprises three key components: the Neighborhood Context Enhancement Module (NCEM) to strengthen local spatial details, the Cross-modal Alignment and Interaction Module (CAIM) to enable deep interaction between RGB and NIR features, and the Saliency-aware Multisource Refinement Module (SMRM) to progressively refine fused features. Extensive experiments demonstrate that MSCNet effectively leverages multi-modal information and consistently outperforms existing methods under multiple input configurations, validating its efficacy for fine-grained building change detection. The source code will be made publicly available at: https://github.com/AeroVILab-AHU/LSMD

[166] TAU-R1: Visual Language Model for Traffic Anomaly Understanding

Yuqiang Lin,Kehua Chen,Sam Lockyer,Arjun Yadav,Mingxuan Sui,Shucheng Zhang,Yan Shi,Bingzhang Wang,Yuang Zhang,Markus Zarbock,Florain Stanek,Adrian Evans,Wenbin Li,Yinhai Wang,Nic Zhang

Main category: cs.CV

TL;DR: 本文提出了Roundabout-TAU数据集和TAU-R1模型,用于交通异常理解(TAU),通过两层视觉-语言框架与两阶段训练策略,在分类与推理任务上取得良好效果。

Details Motivation: 现有交通异常理解(TAU)研究受限于缺乏真实场景基准数据集和任务定制化方法。 Method: 构建真实环岛视频数据集Roundabout-TAU(342个片段、2000+问答对);提出两层框架TAU-R1(轻量异常分类器+大模型异常推理器);设计两阶段训练:分解式问答增强监督微调 + 基于GRPO、含TAU特化奖励的后训练。 Result: TAU-R1在异常分类与推理任务上均表现优异,同时保持部署高效性;数据集与代码已开源。 Conclusion: Roundabout-TAU填补了真实交通异常理解基准空白,TAU-R1及其训练策略为VLM在垂直领域任务定制提供了有效范式。 Abstract: Traffic Anomaly Understanding (TAU) is important for traffic safety in Intelligent Transportation Systems. Recent vision-language models (VLMs) have shown strong capabilities in video understanding. However, progress on TAU remains limited due to the lack of benchmarks and task-specific methodologies. To address this limitation, we introduce Roundabout-TAU, a dataset constructed from real-world roundabout videos collected in collaboration with the City of Carmel, Indiana. The dataset contains 342 clips and is annotated with more than 2,000 question-answer pairs covering multiple aspects of traffic anomaly understanding. Building on this benchmark, we propose TAU-R1, a two-layer vision-language framework for TAU. The first layer is a lightweight anomaly classifier that performs coarse anomaly categorisation, while the second layer is a larger anomaly reasoner that generates detailed event summaries. To improve task-specific reasoning, we introduce a two-stage training strategy consisting of decomposed-QA-enhanced supervised fine-tuning followed by TAU-GRPO, a GRPO-based post-training method with TAU-specific reward functions. Experimental results show that TAU-R1 achieves strong performance on both anomaly classification and reasoning tasks while maintaining deployment efficiency. The dataset and code are available at: https://github.com/siri-rouser/TAU-R1

[167] CustomTex: High-fidelity Indoor Scene Texturing via Multi-Reference Customization

Weilin Chen,Jiahao Rao,Wenhao Wang,Xinyang Li,Xuan Cheng,Liujuan Cao

Main category: cs.CV

TL;DR: CustomTex是一种基于参考图像的实例级高保真3D室内场景纹理生成框架,通过语义级与像素级双蒸馏,在VSD优化框架下实现高质量、无阴影烘焙、低伪影的统一纹理映射。

Details Motivation: 现有文本驱动方法缺乏实例级精细控制能力,生成纹理质量低、存在伪影和阴影烘焙问题。 Method: 提出CustomTex框架,采用双蒸馏策略:语义级蒸馏(含实例交叉注意力)保证语义合理性和参考-实例对齐;像素级蒸馏提升视觉保真度;二者统一于变分分数蒸馏(VSD)优化框架中。 Result: 在实例级一致性、纹理锐度、伪影抑制和减少阴影烘焙方面显著优于现有最先进方法。 Conclusion: CustomTex为高质量、可定制的3D场景外观编辑提供了更直接、用户友好的新路径。 Abstract: The creation of high-fidelity, customizable 3D indoor scene textures remains a significant challenge. While text-driven methods offer flexibility, they lack the precision for fine-grained, instance-level control, and often produce textures with insufficient quality, artifacts, and baked-in shading. To overcome these limitations, we introduce CustomTex, a novel framework for instance-level, high-fidelity scene texturing driven by reference images. CustomTex takes an untextured 3D scene and a set of reference images specifying the desired appearance for each object instance, and generates a unified, high-resolution texture map. The core of our method is a dual-distillation approach that separates semantic control from pixel-level enhancement. We employ semantic-level distillation, equipped with an instance cross-attention, to ensure semantic plausibility and ``reference-instance'' alignment, and pixel-level distillation to enforce high visual fidelity. Both are unified within a Variational Score Distillation (VSD) optimization framework. Experiments demonstrate that CustomTex achieves precise instance-level consistency with reference images and produces textures with superior sharpness, reduced artifacts, and minimal baked-in shading compared to state-of-the-art methods. Our work establishes a more direct and user-friendly path to high-quality, customizable 3D scene appearance editing.

[168] Revisiting Autoregressive Models for Generative Image Classification

Ilia Sudakov,Artem Babenko,Dmitry Baranchuk

Main category: cs.CV

TL;DR: 本文提出了一种基于任意顺序自回归(AR)模型的类条件生成分类器,通过多顺序预测平均化克服了传统AR模型固定token顺序的局限性,在多个图像分类基准上超越了扩散模型分类器,且推理效率提升最高达25倍。

Details Motivation: 现有视觉自回归生成分类器依赖固定token顺序,引入了不利于图像理解的强归纳偏置;而单一顺序预测易依赖局部判别线索,缺乏全面性。 Method: 利用最新任意顺序自回归模型,对多种token顺序进行预测并取平均(即order-marginalized prediction),从而提升判别能力与鲁棒性。 Result: 在多个图像分类基准上持续优于基于扩散模型的分类器,推理速度最高快25倍;分类性能媲美当前最优自监督判别模型。 Conclusion: 任意顺序AR建模能充分释放AR生成模型的分类潜力,证明生成式分类器在准确率和效率上均可与先进判别模型竞争。 Abstract: Class-conditional generative models have emerged as accurate and robust classifiers, with diffusion models demonstrating clear advantages over other visual generative paradigms, including autoregressive (AR) models. In this work, we revisit visual AR-based generative classifiers and identify an important limitation of prior approaches: their reliance on a fixed token order, which imposes a restrictive inductive bias for image understanding. We observe that single-order predictions rely more on partial discriminative cues, while averaging over multiple token orders provides a more comprehensive signal. Based on this insight, we leverage recent any-order AR models to estimate order-marginalized predictions, unlocking the high classification potential of AR models. Our approach consistently outperforms diffusion-based classifiers across diverse image classification benchmarks, while being up to 25x more efficient. Compared to state-of-the-art self-supervised discriminative models, our method delivers competitive classification performance - a notable achievement for generative classifiers.

[169] GSMem: 3D Gaussian Splatting as Persistent Spatial Memory for Zero-Shot Embodied Exploration and Reasoning

Yiren Lu,Yi Du,Disheng Liu,Yunlai Zhou,Chen Wang,Yu Yin

Main category: cs.CV

TL;DR: 本文提出GSMem框架,利用3D高斯泼溅(3DGS)构建可重观的连续空间记忆,支持零样本具身探索与推理,通过空间回忆、多模态检索与混合探索策略提升任务性能。

Details Motivation: 现有场景表示(如离散场景图或静态视角快照)缺乏事后重观性,导致初始遗漏目标后无法恢复,难以支撑长期具身探索。 Method: 提出基于3D高斯泼溅(3DGS)的GSMem框架,构建持续空间记忆;设计融合对象级场景图与语义级语言场的检索机制实现空间回忆;引入VLM驱动语义评分与3DGS覆盖目标结合的混合探索策略。 Result: 在具身问答与终身导航任务上实验表明,GSMem显著提升定位精度、推理质量与探索鲁棒性,具备零样本泛化能力。 Conclusion: 3DGS作为可渲染、可查询的空间记忆基座,能有效支撑具身智能体的长期感知、推理与探索,为具身AI提供新范式。 Abstract: Effective embodied exploration requires agents to accumulate and retain spatial knowledge over time. However, existing scene representations, such as discrete scene graphs or static view-based snapshots, lack \textit{post-hoc re-observability}. If an initial observation misses a target, the resulting memory omission is often irrecoverable. To bridge this gap, we propose \textbf{GSMem}, a zero-shot embodied exploration and reasoning framework built upon 3D Gaussian Splatting (3DGS). By explicitly parameterizing continuous geometry and dense appearance, 3DGS serves as a persistent spatial memory that endows the agent with \textit{Spatial Recollection}: the ability to render photorealistic novel views from optimal, previously unoccupied viewpoints. To operationalize this, GSMem employs a retrieval mechanism that simultaneously leverages parallel object-level scene graphs and semantic-level language fields. This complementary design robustly localizes target regions, enabling the agent to ``hallucinate'' optimal views for high-fidelity Vision-Language Model (VLM) reasoning. Furthermore, we introduce a hybrid exploration strategy that combines VLM-driven semantic scoring with a 3DGS-based coverage objective, balancing task-aware exploration with geometric coverage. Extensive experiments on embodied question answering and lifelong navigation demonstrate the robustness and effectiveness of our framework

[170] ADAPT: Attention Driven Adaptive Prompt Scheduling and InTerpolating Orthogonal Complements for Rare Concepts Generation

Kwanyoung Lee,Hyunwoo Oh,SeungJu Cha,Sungho Koh,Dong-Jin Kim

Main category: cs.CV

TL;DR: 本文提出ADAPT框架,一种无需训练的确定性方法,通过注意力分数和正交分量优化提示调度,以提升扩散模型在罕见组合概念生成中的性能。

Details Motivation: 扩散模型在文本到图像合成中难以生成训练数据中罕见的组合概念,现有方法如R2F因语言模型随机性和迭代文本嵌入切换引导不佳而效果受限。 Method: ADAPT框架利用注意力分数和正交分量,进行确定性的提示计划与语义对齐,无需额外训练或微调。 Result: 在RareBench基准上显著提升了罕见组合概念的生成效果,实现了对罕见属性语义信息的准确反映,并提供确定、精确的控制,同时保持图像视觉完整性。 Conclusion: ADAPT是一种高效、稳定且无需训练的框架,能有效解决扩散模型在罕见组合概念生成中的挑战。 Abstract: Generating rare compositional concepts in text-to-image synthesis remains a challenge for diffusion models, particularly for attributes that are uncommon in the training data. While recent approaches, such as R2F, address this challenge by utilizing LLM for prompt scheduling, they suffer from inherent variance due to the randomness of language models and suboptimal guidance from iterative text embedding switching. To address these problems, we propose the ADAPT framework, a training-free framework that deterministically plans and semantically aligns prompt schedules, providing consistent guidance to enhance the composition of rare concepts. By leveraging attention scores and orthogonal components, ADAPT significantly enhances compositional generation of rare concepts in the RareBench benchmark without additional training or fine-tuning. Through comprehensive experiments, we demonstrate that ADAPT achieves superior performance in RareBench and accurately reflects the semantic information of rare attributes, providing deterministic and precise control over the generation of rare compositions without compromising visual integrity.

[171] Adaptive Auxiliary Prompt Blending for Target-Faithful Diffusion Generation

Kwanyoung Lee,SeungJu Cha,Yebin Ahn,Hyunwoo Oh,Sungho Koh,Dong-Jin Kim

Main category: cs.CV

TL;DR: 本文提出了一种名为自适应辅助提示混合(AAPB)的新框架,用于提升扩散模型在生成稀有概念或进行图像编辑时的语义对齐与结构一致性,通过基于Tweedie恒等式的闭式自适应系数实现无需训练的稳定生成。

Details Motivation: 扩散模型在处理训练数据分布中低密度区域(如罕见概念或编辑指令)时表现不佳,导致语义错位和结构不一致,根源在于文本-图像数据集的长尾分布特性。 Method: 提出自适应辅助提示混合(AAPB)框架,利用辅助锚提示提供语义/结构支持,并基于Tweedie恒等式推导出每步扩散过程中的闭式自适应混合系数,实现目标提示与锚提示的最优动态平衡。 Result: 在RareBench和FlowEdit数据集上验证了AAPB的有效性,相比固定插值及其它无训练基线方法,在语义准确性和结构保真度上均取得一致提升。 Conclusion: AAPB是一种原理清晰、无需训练、适用于稀有概念生成与图像编辑的通用提示调制框架,显著提升了扩散模型在低密度区域的生成稳定性与目标忠实性。 Abstract: Diffusion-based text-to-image (T2I) models have made remarkable progress in generating photorealistic and semantically rich images. However, when the target concepts lie in low-density regions of the training distribution, these models often produce semantically misaligned or structurally inconsistent results. This limitation arises from the long-tailed nature of text-image datasets, where rare concepts or editing instructions are underrepresented. To address this, we introduce Adaptive Auxiliary Prompt Blending (AAPB) - a unified framework that stabilizes the diffusion process in low-density regions. AAPB leverages auxiliary anchor prompts to provide semantic support in rare concept generation and structural support in image editing, ensuring faithful guidance toward the target prompt. Unlike prior heuristic prompt alternation methods, AAPB derives a closed-form adaptive coefficient that optimally balances the influence between the auxiliary anchor and the target prompt at each diffusion step. Grounded in Tweedie's identity, our formulation provides a principled and training-free framework for adaptive prompt blending, ensuring stable and target-faithful generation. We demonstrate the effectiveness of adaptive interpolation over fixed interpolation through controlled experiments and empirically show consistent improvements on the RareBench and FlowEdit datasets, achieving superior semantic accuracy and structural fidelity compared to prior training-free baselines.

[172] ARIADNE: A Perception-Reasoning Synergy Framework for Trustworthy Coronary Angiography Analysis

Zhan Jin,Yu Luo,Yizhou Zhang,Ziyang Cui,Yuqing Wei,Xianchao Liu,Xueying Zeng,Qing Zhang

Main category: cs.CV

TL;DR: ARIADNE是一个两阶段框架,结合偏好对齐感知与基于强化学习的诊断推理,以实现解剖结构上连贯的冠状动脉狭窄检测;其感知模块使用DPO微调视觉语言模型,以Betti数为偏好信号对齐血管几何完整性;推理模块将狭窄定位建模为带显式拒绝机制的马尔可夫决策过程,提升可靠性;在临床数据和多中心外部验证中表现优异。

Details Motivation: 传统像素级损失函数无法保证冠状动脉分割的拓扑一致性,导致血管树碎片化,尽管像素精度高但临床可用性差。 Method: 提出ARIADNE两阶段框架:1)感知模块采用DPO(直接偏好优化)微调Sa2VA视觉语言基础模型,以Betti数作为拓扑偏好信号;2)推理模块将狭窄定位建模为带自主拒绝机制(针对分叉、交叉等模糊解剖结构)的马尔可夫决策过程,优化诊断可靠性而非覆盖率。 Result: 在1400例临床血管造影图像上,中心线Dice达0.838,假阳性较几何基线降低41%;在ARCADE和XCAD多中心外部数据集上验证了泛化能力。 Conclusion: 首次将DPO用于医学影像拓扑对齐,证明基于结构约束的偏好学习可在不牺牲诊断敏感性的前提下显著缓解拓扑错误,适用于介入心脏病学工作流。 Abstract: Conventional pixel-wise loss functions fail to enforce topological constraints in coronary vessel segmentation, producing fragmented vascular trees despite high pixel-level accuracy. We present ARIADNE, a two-stage framework coupling preference-aligned perception with RL-based diagnostic reasoning for topologically coherent stenosis detection. The perception module employs DPO to fine-tune the Sa2VA vision-language foundation model using Betti number constraints as preference signals, aligning the policy toward geometrically complete vessel structures rather than pixel-wise overlap metrics. The reasoning module formulates stenosis localization as a Markov Decision Process with an explicit rejection mechanism that autonomously defers ambiguous anatomical candidates such as bifurcations and vessel crossings, shifting from coverage maximization to reliability optimization. On 1,400 clinical angiograms, ARIADNE achieves state-of-the-art centerline Dice of 0.838, reduces false positives by 41% compared to geometric baselines. External validation on multi-center benchmarks ARCADE and XCAD confirms generalization across acquisition protocols. This represents the first application of DPO for topological alignment in medical imaging, demonstrating that preference-based learning over structural constraints mitigates topological violations while maintaining diagnostic sensitivity in interventional cardiology workflows.

[173] Reconstruction Matters: Learning Geometry-Aligned BEV Representation through 3D Gaussian Splatting

Yiren Lu,Xin Ye,Burhaneddin Yaman,Jingru Luo,Zhexiao Xiong,Liu Ren,Yu Yin

Main category: cs.CV

TL;DR: 本文提出Splat2BEV框架,通过引入显式的3D高斯溅射重建来辅助鸟瞰图(BEV)感知,提升语义与几何精度,在nuScenes和Argoverse数据集上达到SOTA性能。

Details Motivation: 现有BEV感知方法多采用端到端黑箱训练,缺乏显式3D几何理解与可解释性,导致性能受限。 Method: 提出Splat2BEV框架:先预训练一个高斯生成器,从多视角图像显式重建3D场景并生成几何对齐的特征;再将这些特征投影至BEV空间供下游任务使用。 Result: 在nuScenes和Argoverse数据集上取得语义分割、3D目标检测等BEV任务的SOTA结果,验证了显式3D重建对BEV感知的有效性。 Conclusion: 显式的3D表示对提升BEV感知的准确性与可解释性至关重要,Splat2BEV为融合几何先验与语义学习提供了新范式。 Abstract: Bird's-Eye-View (BEV) perception serves as a cornerstone for autonomous driving, offering a unified spatial representation that fuses surrounding-view images to enable reasoning for various downstream tasks, such as semantic segmentation, 3D object detection, and motion prediction. However, most existing BEV perception frameworks adopt an end-to-end training paradigm, where image features are directly transformed into the BEV space and optimized solely through downstream task supervision. This formulation treats the entire perception process as a black box, often lacking explicit 3D geometric understanding and interpretability, leading to suboptimal performance. In this paper, we claim that an explicit 3D representation matters for accurate BEV perception, and we propose Splat2BEV, a Gaussian Splatting-assisted framework for BEV tasks. Splat2BEV aims to learn BEV feature representations that are both semantically rich and geometrically precise. We first pre-train a Gaussian generator that explicitly reconstructs 3D scenes from multi-view inputs, enabling the generation of geometry-aligned feature representations. These representations are then projected into the BEV space to serve as inputs for downstream tasks. Extensive experiments on nuScenes and argoverse dataset demonstrate that Splat2BEV achieves state-of-the-art performance and validate the effectiveness of incorporating explicit 3D reconstruction into BEV perception.

[174] Tinted Frames: Question Framing Blinds Vision-Language Models

Wan-Cyuan Fan,Jiayun Luo,Declan Kutscher,Leonid Sigal,Ritwik Gupta

Main category: cs.CV

TL;DR: 本文揭示了视觉语言模型(VLMs)存在‘选择性失明’现象:其对图像的注意力会随文本提示的句式(如选择题 vs 开放式提问)而显著变化,导致视觉推理能力下降;作者据此提出一种轻量级可学习提示调优方法,以增强模型在不同提示下的视觉注意力鲁棒性与性能。

Details Motivation: 现有VLMs虽具备多模态能力,但在需视觉推理的任务中常被发现‘失明’——即忽视视觉输入;本文旨在探究这种失明是否系统性、受语言提示影响,并揭示其内在机制。 Method: 以视觉注意力为探针,定量分析不同语言框架(如多项选择、是非题、开放式)下VLMs对图像区域的注意分配变化;进而设计含可学习提示令牌的轻量级prompt-tuning方法,引导模型复现开放提问下的稳健视觉注意力模式。 Result: 实验证明受限框架(如选择题)显著降低图像上下文注意力、削弱任务相关区域关注、并偏向无关token;该注意力错配是准确率下降与跨框架结果不一致的主因;所提方法在多个VLM和基准上提升视觉接地性与跨框架一致性。 Conclusion: VLMs的视觉失明并非绝对,而是受语言框架调控的选择性现象;通过干预注意力机制(而非架构或训练),可有效提升其视觉推理鲁棒性;这为理解与改进多模态对齐提供了新视角与实用方案。 Abstract: Vision-Language Models (VLMs) have been shown to be blind, often underutilizing their visual inputs even on tasks that require visual reasoning. In this work, we demonstrate that VLMs are selectively blind. They modulate the amount of attention applied to visual inputs based on linguistic framing even when alternative framings demand identical visual reasoning. Using visual attention as a probe, we quantify how framing alters both the amount and distribution of attention over the image. Constrained framings, such as multiple choice and yes/no, induce substantially lower attention to image context compared to open-ended, reduce focus on task-relevant regions, and shift attention towards uninformative tokens. We further demonstrate that this attention misallocation is the principal cause of degraded accuracy and cross-framing inconsistency. Building on this mechanistic insight, we introduce a lightweight prompt-tuning method using learnable tokens that encourages the robust, visually grounded attention patterns observed in open-ended settings, improving visual grounding and improving performance across framings.

[175] RPiAE: A Representation-Pivoted Autoencoder Enhancing Both Image Generation and Editing

Yue Gong,Hongyu Li,Shanyuan Liu,Bo Cheng,Yuhang Ma,Liebucha Wu,Xiaoyu Wu,Manyuan Zhang,Dawei Leng,Yuhui Yin,Lijun Zhang

Main category: cs.CV

TL;DR: 本文提出Representation-Pivoted AutoEncoder(RPiAE),一种基于预训练视觉表征的可微分tokenizer,通过Representation-Pivot Regularization和变分桥结构,在保持语义结构的同时提升重建保真度并压缩潜在空间,从而改善扩散模型的生成与编辑性能。

Details Motivation: 现有基于预训练视觉表征的冻结编码器tokenizer存在重建保真度低、编辑质量差及潜在空间维度过高导致扩散建模困难的问题。 Method: 提出Representation-Pivoted AutoEncoder(RPiAE):1)Representation-Pivot Regularization——在微调初始化自表征模型的编码器时约束其保持原始语义结构;2)引入变分桥进一步压缩潜在空间;3)采用目标解耦的分阶段训练策略,分别优化生成可行性与重建保真度。 Result: RPiAE在文本到图像生成和图像编辑任务上优于其他视觉tokenizer,并在所有基于表征的tokenizer中实现最优重建保真度。 Conclusion: RPiAE有效平衡了语义保持、重建精度与扩散建模效率,为扩散模型提供了更优的潜在空间表示方案。 Abstract: Diffusion models have become the dominant paradigm for image generation and editing, with latent diffusion models shifting denoising to a compact latent space for efficiency and scalability. Recent attempts to leverage pretrained visual representation models as tokenizer priors either align diffusion features to representation features or directly reuse representation encoders as frozen tokenizers. Although such approaches can improve generation metrics, they often suffer from limited reconstruction fidelity due to frozen encoders, which in turn degrades editing quality, as well as overly high-dimensional latents that make diffusion modeling difficult. To address these limitations, We propose Representation-Pivoted AutoEncoder, a representation-based tokenizer that improves both generation and editing. We introduce Representation-Pivot Regularization, a training strategy that enables a representation-initialized encoder to be fine-tuned for reconstruction while preserving the semantic structure of the pretrained representation space, followed by a variational bridge which compress latent space into a compact one for better diffusion modeling. We adopt an objective-decoupled stage-wise training strategy that sequentially optimizes generative tractability and reconstruction-fidelity objectives. Together, these components yield a tokenizer that preserves strong semantics, reconstructs faithfully, and produces latents with reduced diffusion modeling complexity. Experiments demonstrate that RPiAE outperforms other visual tokenizers on text-to-image generation and image editing, while delivering the best reconstruction fidelity among representation-based tokenizers.

[176] Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders

Shang-Jui Ray Kuo,Paola Cascante-Bonilla

Main category: cs.CV

TL;DR: 本文探讨了状态空间模型(SSM)作为视觉-语言模型(VLMs)视觉骨干网络的潜力,发现其在VQA和定位任务中表现优异,且在更小规模下仍具竞争力;同时指出高ImageNet准确率或更大骨干并不总带来更好VLM性能,并提出提升定位鲁棒性的稳定化策略。

Details Motivation: 探索状态空间模型(SSM)是否可作为传统Transformer视觉骨干的有效替代方案,以提升VLM在多任务(如VQA、定位)中的性能与效率。 Method: 在控制条件下系统评估SSM视觉骨干在VLM中的表现,包括ImageNet-1K初始化对比、密集任务(检测/分割)微调,并分析骨干稳定性及性能影响因素。 Result: SSM骨干在VQA和定位任务中整体性能最强;经密集任务微调后仍保持竞争力且参数量更小;发现ImageNet精度与VLM性能无强相关性,部分骨干存在定位不稳定性;提出的稳定化策略提升了两类骨干的鲁棒性。 Conclusion: SSM视觉骨干是Transformer类编码器在VLM中的一种强有力替代方案,兼具高性能、小规模和可改进的鲁棒性。 Abstract: Large vision--language models (VLMs) often use a frozen vision backbone, whose image features are mapped into a large language model through a lightweight connector. While transformer-based encoders are the standard visual backbone, we ask whether state space model (SSM) vision backbones can be a strong alternative. We systematically evaluate SSM vision backbones for VLMs in a controlled setting. Under matched ImageNet-1K initialization, the SSM backbone achieves the strongest overall performance across both VQA and grounding/localization. We further adapt both SSM and ViT-family backbones with detection or segmentation training and find that dense-task tuning generally improves performance across families; after this adaptation, the SSM backbone remains competitive while operating at a substantially smaller model scale. We further observe that (i) higher ImageNet accuracy or larger backbones do not reliably translate into better VLM performance, and (ii) some visual backbones are unstable in localization. Based on these findings, we propose stabilization strategies that improve robustness for both backbone families and highlight SSM backbones as a strong alternative to transformer-based vision encoders in VLMs.

[177] DreamPartGen: Semantically Grounded Part-Level 3D Generation via Collaborative Latent Denoising

Tianjiao Yu,Xinzhuo Li,Muntasir Wahed,Jerry Xiong,Yifan Shen,Ying Shen,Ismini Lourentzou

Main category: cs.CV

TL;DR: DreamPartGen是一种语义驱动的、部件感知的文本到3D生成框架,通过双模态部件隐变量(DPLs)和关系语义隐变量(RSLs)建模部件几何/外观及部件间语义依赖,并借助同步协同去噪实现几何与语义一致性。

Details Motivation: 现有文本到3D方法大多忽略3D物体的语义与功能部件结构;虽有部件感知方法,但偏重几何,缺乏语义支撑,无法对齐文本描述或建模部件间关系。 Method: 提出DreamPartGen框架:引入Duplex Part Latents(DPLs)联合建模各部件的几何与外观,Relational Semantic Latents(RSLs)捕捉源自语言的部件间依赖关系;采用同步协同去噪过程保障几何与语义一致性。 Result: 在多个基准上达到几何保真度与文本-形状对齐性能的SOTA水平。 Conclusion: DreamPartGen实现了语义可解释、文本对齐且结构一致的3D内容生成,为部件级可控文本到3D生成提供了新范式。 Abstract: Understanding and generating 3D objects as compositions of meaningful parts is fundamental to human perception and reasoning. However, most text-to-3D methods overlook the semantic and functional structure of parts. While recent part-aware approaches introduce decomposition, they remain largely geometry-focused, lacking semantic grounding and failing to model how parts align with textual descriptions or their inter-part relations. We propose DreamPartGen, a framework for semantically grounded, part-aware text-to-3D generation. DreamPartGen introduces Duplex Part Latents (DPLs) that jointly model each part's geometry and appearance, and Relational Semantic Latents (RSLs) that capture inter-part dependencies derived from language. A synchronized co-denoising process enforces mutual geometric and semantic consistency, enabling coherent, interpretable, and text-aligned 3D synthesis. Across multiple benchmarks, DreamPartGen delivers state-of-the-art performance in geometric fidelity and text-shape alignment.

[178] LVOmniBench: Pioneering Long Audio-Video Understanding Evaluation for Omnimodal LLMs

Keda Tao,Yuhua Zheng,Jia Xu,Wenjie Du,Kele Shao,Hesong Wang,Xueyi Chen,Xin Jin,Junhan Zhu,Bohan Yu,Weiqiang Wang,Jian Liu,Can Qin,Yulun Zhang,Ming-Hsuan Yang,Huan Wang

Main category: cs.CV

TL;DR: 本文提出了LVOmniBench,一个专为评估大模型在长时音频视频跨模态理解能力的新基准,包含275个10–90分钟的高质量视频和1014个问答对,实验表明现有OmniLLMs在此任务上表现较差(开源模型<35%,Gemini 3 Pro约65%)。

Details Motivation: 现有OmniLLM评测集中于短片段(10秒–5分钟),无法反映真实场景中数十分钟长视频的理解需求,存在显著评估缺口。 Method: 构建LVOmniBench基准:从开放平台精选高动态音视频内容,经人工筛选与标注,形成含275个长视频(10–90分钟)和1014个QA对的数据集,并设计涵盖长时记忆、时间定位、细粒度理解与多模态感知的评测维度。 Result: 当前OmniLLMs在LVOmniBench上表现不佳——开源模型准确率普遍低于35%,Gemini 3 Pro最高仅达约65%;验证了长时音视频理解仍是重大挑战。 Conclusion: LVOmniBench填补了长时音视频跨模态理解评测空白,其数据与实证结果将推动具备长时建模与复杂跨模态推理能力的新一代OmniLLM发展。 Abstract: Recent advancements in omnimodal large language models (OmniLLMs) have significantly improved the comprehension of audio and video inputs. However, current evaluations primarily focus on short audio and video clips ranging from 10 seconds to 5 minutes, failing to reflect the demands of real-world applications, where videos typically run for tens of minutes. To address this critical gap, we introduce LVOmniBench, a new benchmark designed specifically for the cross-modal comprehension of long-form audio and video. This dataset comprises high-quality videos sourced from open platforms that feature rich audio-visual dynamics. Through rigorous manual selection and annotation, LVOmniBench comprises 275 videos, ranging in duration from 10 to 90 minutes, and 1,014 question-answer (QA) pairs. LVOmniBench aims to rigorously evaluate the capabilities of OmniLLMs across domains, including long-term memory, temporal localization, fine-grained understanding, and multimodal perception. Our extensive evaluation reveals that current OmniLLMs encounter significant challenges when processing extended audio-visual inputs. Open-source models generally achieve accuracies below 35%, whereas the Gemini 3 Pro reaches a peak accuracy of approximately 65%. We anticipate that this dataset, along with our empirical findings, will stimulate further research and the development of advanced models capable of resolving complex cross-modal understanding problems within long-form audio-visual contexts.

[179] Rethinking Vector Field Learning for Generative Segmentation

Chaoyang Wang,Yaobo Liang,Boci Peng,Fan Duan,Jingdong Wang,Yunhai Tong

Main category: cs.CV

TL;DR: 本文提出了一种基于向量场学习的扩散模型分割方法,通过向速度场引入距离感知的修正项来缓解梯度消失和轨迹穿越问题,并设计了基于Kronecker序列的高效类别编码方案,显著提升了生成式语义分割性能。

Details Motivation: 现有扩散模型用于分割时,连续流匹配目标与离散感知任务之间存在内在不匹配,且对梯度消失和轨迹穿越问题缺乏深入理解。 Method: 提出向量场重塑策略,在速度场中加入脱离的距离感知修正项(含吸引与排斥作用),并设计基于Kronecker序列的准随机类别编码,嵌入端到端像素神经场框架。 Result: 在多个基准上显著优于基础流匹配方法,大幅缩小了生成式分割与强判别式模型之间的性能差距。 Conclusion: 从向量场学习视角重新审视扩散分割是有效的,所提修正机制和编码方案可在不改变原有训练框架前提下提升收敛性与类间可分性。 Abstract: Taming diffusion models for generative segmentation has attracted increasing attention. While existing approaches primarily focus on architectural tweaks or training heuristics, there remains a limited understanding of the intrinsic mismatch between continuous flow matching objectives and discrete perception tasks. In this work, we revisit diffusion segmentation from the perspective of vector field learning. We identify two key limitations of the commonly used flow matching objective: gradient vanishing and trajectory traversing, which result in slow convergence and poor class separation. To tackle these issues, we propose a principled vector field reshaping strategy that augments the learned velocity field with a detached distance-aware correction term. This correction introduces both attractive and repulsive interactions, enhancing gradient magnitudes near centroids while preserving the original diffusion training framework. Furthermore, we design a computationally efficient, quasi-random category encoding scheme inspired by Kronecker sequences, which integrates seamlessly with an end-to-end pixel neural field framework for pixel-level semantic alignment. Extensive experiments consistently demonstrate significant improvements over vanilla flow matching approaches, substantially narrowing the performance gap between generative segmentation and strong discriminative specialists.

[180] DriveTok: 3D Driving Scene Tokenization for Unified Multi-View Reconstruction and Understanding

Dong Zhuo,Wenzhao Zheng,Sicheng Zuo,Siming Yan,Lu Hou,Jie Zhou,Jiwen Lu

Main category: cs.CV

TL;DR: 本文提出DriveTok,一种用于自动驾驶多视角场景的3D驱动场景分词器,通过3D可变形交叉注意力将视觉特征转化为统一场景token,并支持多任务重建与预测。

Details Motivation: 现有图像分词器主要面向单目2D场景,在高分辨率多视角驾驶场景中存在效率低和跨视角不一致问题。 Method: DriveTok利用视觉基础模型提取语义丰富的特征,通过3D可变形交叉注意力生成场景token;解码阶段采用多视角Transformer重建多视角特征,并用多个头分别实现RGB、深度、语义重建及3D语义占据预测。 Result: 在nuScenes数据集上的实验表明,DriveTok生成的场景token在图像重建、语义分割、深度预测和3D占据预测等任务上均表现优异。 Conclusion: DriveTok实现了语义、几何与纹理信息融合的统一多视角分词,提升了自动驾驶系统中视觉模态的表征效率与一致性。 Abstract: With the growing adoption of vision-language-action models and world models in autonomous driving systems, scalable image tokenization becomes crucial as the interface for the visual modality. However, most existing tokenizers are designed for monocular and 2D scenes, leading to inefficiency and inter-view inconsistency when applied to high-resolution multi-view driving scenes. To address this, we propose DriveTok, an efficient 3D driving scene tokenizer for unified multi-view reconstruction and understanding. DriveTok first obtains semantically rich visual features from vision foundation models and then transforms them into the scene tokens with 3D deformable cross-attention. For decoding, we employ a multi-view transformer to reconstruct multi-view features from the scene tokens and use multiple heads to obtain RGB, depth, and semantic reconstructions. We also add a 3D head directly on the scene tokens for 3D semantic occupancy prediction for better spatial awareness. With the multiple training objectives, DriveTok learns unified scene tokens that integrate semantic, geometric, and textural information for efficient multi-view tokenization. Extensive experiments on the widely used nuScenes dataset demonstrate that the scene tokens from DriveTok perform well on image reconstruction, semantic segmentation, depth prediction, and 3D occupancy prediction tasks.

[181] Spectrally-Guided Diffusion Noise Schedules

Carlos Esteves,Ameesh Makadia

Main category: cs.CV

TL;DR: 本文提出了一种基于图像频谱特性的、针对每个实例定制的噪声调度方法,通过理论推导确定最优噪声范围,消除冗余采样步骤,并在推理时条件化采样,显著提升低步数下的像素扩散模型生成质量。

Details Motivation: 现有去噪扩散模型的噪声调度通常为手工设计且需跨分辨率手动调优,缺乏对单个图像特性的自适应能力。 Method: 基于图像频谱特性推导最小/最大噪声水平的有效性理论界,据此构建‘紧凑’(tight)的每实例噪声调度;在推理阶段采用条件化方式采样该调度。 Result: 所提噪声调度在单阶段像素扩散模型上提升了生成质量,尤其在低采样步数(low-step regime)下效果显著。 Conclusion: 噪声调度应适配图像内容(如频谱),而非统一设定;基于频谱的紧致调度可减少冗余计算并提升生成效率与质量。 Abstract: Denoising diffusion models are widely used for high-quality image and video generation. Their performance depends on noise schedules, which define the distribution of noise levels applied during training and the sequence of noise levels traversed during sampling. Noise schedules are typically handcrafted and require manual tuning across different resolutions. In this work, we propose a principled way to design per-instance noise schedules for pixel diffusion, based on the image's spectral properties. By deriving theoretical bounds on the efficacy of minimum and maximum noise levels, we design ``tight'' noise schedules that eliminate redundant steps. During inference, we propose to conditionally sample such noise schedules. Experiments show that our noise schedules improve generative quality of single-stage pixel diffusion models, particularly in the low-step regime.

[182] EffectErase: Joint Video Object Removal and Insertion for High-Quality Effect Erasing

Yang Fu,Yike Zheng,Ziyun Dai,Henghui Ding

Main category: cs.CV

TL;DR: 本文提出了VOR数据集和EffectErase方法,旨在解决视频中目标物体及其视觉效应(如形变、阴影、反射)的高质量移除问题。VOR是一个大规模配对视频数据集,涵盖多种效应类型与复杂场景;EffectErase则通过反向插入任务与一致性约束实现效应感知的视频对象移除,在多项实验中表现优异。

Details Motivation: 现有扩散模型虽能移除视频中的目标物体,但难以彻底消除其视觉效应(如阴影、反射、形变),且缺乏系统涵盖各类效应的高质量配对训练/评估数据集。 Method: 构建了VOR大规模配对视频数据集(60K视频对,含5类效应、多类别目标与动态多目标场景);在此基础上提出EffectErase方法,采用互惠学习框架,将视频对象插入作为反向辅助任务,并引入任务感知区域引导与插入-移除一致性损失,以联合定位效应区域并增强结构一致性。 Result: EffectErase在VOR数据集上训练后,在多种效应移除任务中显著优于现有方法,生成背景更连贯、效应消除更彻底,泛化性强。 Conclusion: VOR数据集填补了视频对象效应移除领域基准缺失的空白;EffectErase验证了效应感知建模与互惠学习的有效性,为高质量视频编辑提供了新范式。 Abstract: Video object removal aims to eliminate dynamic target objects and their visual effects, such as deformation, shadows, and reflections, while restoring seamless backgrounds. Recent diffusion-based video inpainting and object removal methods can remove the objects but often struggle to erase these effects and to synthesize coherent backgrounds. Beyond method limitations, progress is further hampered by the lack of a comprehensive dataset that systematically captures common object effects across varied environments for training and evaluation. To address this, we introduce VOR (Video Object Removal), a large-scale dataset that provides diverse paired videos, each consisting of one video where the target object is present with its effects and a counterpart where the object and effects are absent, with corresponding object masks. VOR contains 60K high-quality video pairs from captured and synthetic sources, covers five effects types, and spans a wide range of object categories as well as complex, dynamic multi-object scenes. Building on VOR, we propose EffectErase, an effect-aware video object removal method that treats video object insertion as the inverse auxiliary task within a reciprocal learning scheme. The model includes task-aware region guidance that focuses learning on affected areas and enables flexible task switching. Then, an insertion-removal consistency objective that encourages complementary behaviors and shared localization of effect regions and structural cues. Trained on VOR, EffectErase achieves superior performance in extensive experiments, delivering high-quality video object effect erasing across diverse scenarios.

[183] Under One Sun: Multi-Object Generative Perception of Materials and Illumination

Nobuo Yoshii,Xinran Nicole Han,Ryo Kawahara,Todd Zickler,Ko Nishino

Main category: cs.CV

TL;DR: 本文提出Multi-Object Generative Perception (MultiGP),一种基于单张图像、利用多物体共享光照先验的生成式逆渲染方法,用于联合采样反射率、纹理和光照。

Details Motivation: 解决单图辐射度解耦固有的歧义性,利用同一场景中多个物体共享相同光照这一先验。 Method: 提出级联端到端架构(融合图像空间与角度空间解耦)、协同引导扩散收敛至一致光照估计、轴向注意力促进不同反射率物体间信息交互、以及Texture Extraction ControlNet保留高频纹理并解耦光照。 Result: 实验表明MultiGP能有效利用多物体外观在空间与频率上的互补特性,分别恢复出个体纹理、反射率及共同光照。 Conclusion: MultiGP通过引入多物体一致性约束与新型生成建模机制,显著提升了单图逆渲染中辐射度成分的解耦质量与采样灵活性。 Abstract: We introduce Multi-Object Generative Perception (MultiGP), a generative inverse rendering method for stochastic sampling of all radiometric constituents -- reflectance, texture, and illumination -- underlying object appearance from a single image. Our key idea to solve this inherently ambiguous radiometric disentanglement is to leverage the fact that while their texture and reflectance may differ, objects in the same scene are all lit by the same illumination. MultiGP exploits this consensus to produce samples of reflectance, texture, and illumination from a single image of known shapes based on four key technical contributions: a cascaded end-to-end architecture that combines image-space and angular-space disentanglement; Coordinated Guidance for diffusion convergence to a single consistent illumination estimate; Axial Attention applied to facilitate ``cross-talk'' between objects of different reflectance; and a Texture Extraction ControlNet to preserve high-frequency texture details while ensuring decoupling from estimated lighting. Experimental results demonstrate that MultiGP effectively leverages the complementary spatial and frequency characteristics of multiple object appearances to recover individual texture and reflectance as well as the common illumination.

[184] Bridging Semantic and Kinematic Conditions with Diffusion-based Discrete Motion Tokenizer

Chenyang Gu,Mingyuan Zhang,Haozhe Xie,Zhongang Cai,Lei Yang,Ziwei Liu

Main category: cs.CV

TL;DR: 本文提出了一种三阶段运动生成框架(感知-规划-控制),核心是基于扩散模型的离散运动分词器MoTok,兼顾语义条件与运动保真度,在HumanML3D上显著提升可控性与精度。

Details Motivation: 现有运动生成方法分为连续扩散模型(擅长运动学控制)和离散token生成(利于语义条件),但难以兼顾二者优势,本文旨在融合两者长处。 Method: 提出三阶段框架:1)感知阶段提取条件特征;2)规划阶段用MoTok生成离散token(MoTok为扩散式分词器,将语义抽象与细粒度重建解耦,由扩散解码器负责运动恢复);3)控制阶段通过扩散优化施加精细运动学约束。粗粒度约束用于token规划,细粒度约束在控制阶段处理,避免干扰语义规划。 Result: 在HumanML3D数据集上,相比MaskControl,轨迹误差从0.72 cm降至0.08 cm,FID从0.083降至0.029,且仅用其1/6的token数;在强运动学约束下FID进一步降至0.014,性能不降反升。 Conclusion: 所提框架成功融合了离散语义建模与连续运动控制的优势,MoTok实现了高保真、低开销的运动表征,显著提升了运动生成的可控性、 fidelity 和鲁棒性。 Abstract: Prior motion generation largely follows two paradigms: continuous diffusion models that excel at kinematic control, and discrete token-based generators that are effective for semantic conditioning. To combine their strengths, we propose a three-stage framework comprising condition feature extraction (Perception), discrete token generation (Planning), and diffusion-based motion synthesis (Control). Central to this framework is MoTok, a diffusion-based discrete motion tokenizer that decouples semantic abstraction from fine-grained reconstruction by delegating motion recovery to a diffusion decoder, enabling compact single-layer tokens while preserving motion fidelity. For kinematic conditions, coarse constraints guide token generation during planning, while fine-grained constraints are enforced during control through diffusion-based optimization. This design prevents kinematic details from disrupting semantic token planning. On HumanML3D, our method significantly improves controllability and fidelity over MaskControl while using only one-sixth of the tokens, reducing trajectory error from 0.72 cm to 0.08 cm and FID from 0.083 to 0.029. Unlike prior methods that degrade under stronger kinematic constraints, ours improves fidelity, reducing FID from 0.033 to 0.014.

[185] SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing

Xinyao Zhang,Wenkai Dong,Yuxin Song,Bo Fang,Qi Zhang,Jing Wang,Fan Chen,Hui Zhang,Haocheng Feng,Yu Lu,Hang Zhou,Chun Yuan,Jingdong Wang

Main category: cs.CV

TL;DR: 本文提出SAMA框架,通过语义锚定和运动对齐的解耦设计,提升指令引导视频编辑中语义修改精度与运动保真度的平衡能力,无需依赖外部先验,具备强零样本泛化能力。

Details Motivation: 现有指令引导视频编辑模型难以兼顾精确语义修改与忠实运动保持,且过度依赖显式外部先验(如VLM特征或结构条件),限制了鲁棒性与泛化性。 Method: 提出SAMA框架:1)语义锚定——在稀疏关键帧联合预测语义token与视频潜变量,实现纯指令驱动的结构规划;2)运动对齐——通过立方体修复、速度扰动、管状打乱等运动中心预训练任务,使骨干网络直接从原始视频学习时序动态;采用两阶段优化:无配对数据的解耦预训练 + 有配对编辑数据的监督微调。 Result: SAMA在开源模型中达到SOTA性能,与领先商业系统(如Kling-Omni)相当;仅靠解耦预训练即展现出强零样本视频编辑能力。 Conclusion: 语义与运动的显式解耦建模是提升视频编辑质量与泛化性的有效路径,SAMA验证了无需外部先验、仅靠自监督运动建模即可实现高质量编辑。 Abstract: Current instruction-guided video editing models struggle to simultaneously balance precise semantic modifications with faithful motion preservation. While existing approaches rely on injecting explicit external priors (e.g., VLM features or structural conditions) to mitigate these issues, this reliance severely bottlenecks model robustness and generalization. To overcome this limitation, we present SAMA (factorized Semantic Anchoring and Motion Alignment), a framework that factorizes video editing into semantic anchoring and motion modeling. First, we introduce Semantic Anchoring, which establishes a reliable visual anchor by jointly predicting semantic tokens and video latents at sparse anchor frames, enabling purely instruction-aware structural planning. Second, Motion Alignment pre-trains the same backbone on motion-centric video restoration pretext tasks (cube inpainting, speed perturbation, and tube shuffle), enabling the model to internalize temporal dynamics directly from raw videos. SAMA is optimized with a two-stage pipeline: a factorized pre-training stage that learns inherent semantic-motion representations without paired video-instruction editing data, followed by supervised fine-tuning on paired editing data. Remarkably, the factorized pre-training alone already yields strong zero-shot video editing ability, validating the proposed factorization. SAMA achieves state-of-the-art performance among open-source models and is competitive with leading commercial systems (e.g., Kling-Omni). Code, models, and datasets will be released.

[186] MonoArt: Progressive Structural Reasoning for Monocular Articulated 3D Reconstruction

Haitian Li,Haozhe Xie,Junxiang Xu,Beichen Wen,Fangzhou Hong,Ziwei Liu

Main category: cs.CV

TL;DR: MonoArt是一种单图像重建关节式3D物体的统一框架,通过渐进式结构推理解耦运动线索与物体结构,实现稳定、可解释的关节约束推断,在PartNet-Mobility上达到SOTA性能。

Details Motivation: 单张图像重建关节式3D物体面临运动线索与物体结构高度耦合、直接回归不稳定的难题;现有方法依赖多视角监督、检索装配或视频生成,牺牲可扩展性或效率。 Method: 提出MonoArt框架,基于渐进式结构推理:将图像特征逐步转化为规范几何、结构化部件表示和运动感知嵌入,避免直接回归关节参数,无需外部运动模板或多阶段流程。 Result: 在PartNet-Mobility数据集上,OM(应为MonoArt)在重建精度和推理速度上均达SOTA;并成功泛化至机器人操作和关节式场景重建任务。 Conclusion: MonoArt通过结构化、渐进式的内部表征学习,有效解耦结构与运动,为单图关节重建提供了高效、稳定且可扩展的新范式。 Abstract: Reconstructing articulated 3D objects from a single image requires jointly inferring object geometry, part structure, and motion parameters from limited visual evidence. A key difficulty lies in the entanglement between motion cues and object structure, which makes direct articulation regression unstable. Existing methods address this challenge through multi-view supervision, retrieval-based assembly, or auxiliary video generation, often sacrificing scalability or efficiency. We present MonoArt, a unified framework grounded in progressive structural reasoning. Rather than predicting articulation directly from image features, MonoArt progressively transforms visual observations into canonical geometry, structured part representations, and motion-aware embeddings within a single architecture. This structured reasoning process enables stable and interpretable articulation inference without external motion templates or multi-stage pipelines. Extensive experiments on PartNet-Mobility demonstrate that OM achieves state-of-the-art performance in both reconstruction accuracy and inference speed. The framework further generalizes to robotic manipulation and articulated scene reconstruction.

[187] Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens

Yuqing Wang,Chuofan Ma,Zhijie Lin,Yao Teng,Lijun Yu,Shuai Wang,Jiaming Han,Jiashi Feng,Yi Jiang,Xihui Liu

Main category: cs.CV

TL;DR: 本文提出Cubic Discrete Diffusion(CubiD),首个支持高维离散表征(768–1024维)的离散生成模型,通过细粒度逐维掩码与预测机制,在固定步数T内建模跨位置与维度的丰富相关性,并在ImageNet-256上实现SOTA性能,同时验证离散token兼顾理解与生成能力。

Details Motivation: 现有离散视觉生成方法受限于低维潜在token(8–32维),语义表达能力不足;而高维预训练表征(768–1024维)虽语义丰富,但其离散化生成面临根本性挑战,亟需新范式以支撑统一多模态架构。 Method: 提出Cubic Discrete Diffusion(CubiD):在高维离散表征上执行细粒度掩码(任意维度、任一空间位置均可被独立掩码),基于部分观测进行逐维预测;生成步数T固定且远小于总维度hwd,从而高效建模维度内与跨位置的相关性。 Result: 在ImageNet-256上实现离散生成SOTA,参数量从900M扩展至3.7B时展现出良好缩放性;实验证明生成的离散token完整保留原始高维表征的理解能力,可同时服务于理解与生成任务。 Conclusion: CubiD首次实现了高维表征的高效离散生成,弥合了离散token在语义丰富性与生成能力间的鸿沟,为构建统一多模态模型(如语言-视觉联合建模)提供了可行路径。 Abstract: Visual generation with discrete tokens has gained significant attention as it enables a unified token prediction paradigm shared with language models, promising seamless multimodal architectures. However, current discrete generation methods remain limited to low-dimensional latent tokens (typically 8-32 dims), sacrificing the semantic richness essential for understanding. While high-dimensional pretrained representations (768-1024 dims) could bridge this gap, their discrete generation poses fundamental challenges. In this paper, we present Cubic Discrete Diffusion (CubiD), the first discrete generation model for high-dimensional representations. CubiD performs fine-grained masking throughout the high-dimensional discrete representation -- any dimension at any position can be masked and predicted from partial observations. This enables the model to learn rich correlations both within and across spatial positions, with the number of generation steps fixed at $T$ regardless of feature dimensionality, where $T \ll hwd$. On ImageNet-256, CubiD achieves state-of-the-art discrete generation with strong scaling behavior from 900M to 3.7B parameters. Crucially, we validate that these discretized tokens preserve original representation capabilities, demonstrating that the same discrete tokens can effectively serve both understanding and generation tasks. We hope this work will inspire future research toward unified multimodal architectures. Code is available at: https://github.com/YuqingWang1029/CubiD.

[188] Matryoshka Gaussian Splatting

Zhilin Guo,Boqiao Zhang,Hakan Aktas,Kyle Fogarty,Jeffrey Hu,Nursena Koprucu Aslan,Wenzhao Li,Canberk Baykal,Albert Miao,Josef Bengtson,Chenliang Zhou,Weihao Xia,Cristina Nader Vasconcelos. Cengiz Oztireli

Main category: cs.CV

TL;DR: 本文提出Matryoshka Gaussian Splatting(MGS),一种支持连续细节层次(LoD)的3D高斯点绘训练框架,在不牺牲全容量渲染质量的前提下,实现单模型平滑可调的速度-质量权衡。

Details Motivation: 现有离散LoD方法操作点有限,而连续LoD方法虽更灵活,却常在满载时出现明显质量下降,使LoD成为代价高昂的设计选择。 Method: MGS通过随机预算训练策略:每次迭代随机采样一个高斯点数量k,同时优化前k个点构成的前缀及完整高斯集;仅需两次前向传播,无需修改网络结构;学习一个有序高斯集合,任意前缀渲染均保持连贯性且保真度随k平滑提升。 Result: 在四个基准和六个基线上的实验表明,MGS在满容量下性能与骨干模型持平,并支持从单一模型实现连续的渲染速度与质量权衡;消融实验验证了排序策略、训练目标和模型容量设计的有效性。 Conclusion: MGS为3D高斯点绘提供了高效、高质量、真正连续的LoD能力,显著提升了其在实际部署中的灵活性与实用性。 Abstract: The ability to render scenes at adjustable fidelity from a single model, known as level of detail (LoD), is crucial for practical deployment of 3D Gaussian Splatting (3DGS). Existing discrete LoD methods expose only a limited set of operating points, while concurrent continuous LoD approaches enable smoother scaling but often suffer noticeable quality degradation at full capacity, making LoD a costly design decision. We introduce Matryoshka Gaussian Splatting (MGS), a training framework that enables continuous LoD for standard 3DGS pipelines without sacrificing full-capacity rendering quality. MGS learns a single ordered set of Gaussians such that rendering any prefix, the first k splats, produces a coherent reconstruction whose fidelity improves smoothly with increasing budget. Our key idea is stochastic budget training: each iteration samples a random splat budget and optimises both the corresponding prefix and the full set. This strategy requires only two forward passes and introduces no architectural modifications. Experiments across four benchmarks and six baselines show that MGS matches the full-capacity performance of its backbone while enabling a continuous speed-quality trade-off from a single model. Extensive ablations on ordering strategies, training objectives, and model capacity further validate the designs.

[189] Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding

Xianjin Wu,Dingkang Liang,Tianrui Feng,Kui Xia,Yumeng Zhang,Xiaofan Li,Xiao Tan,Xiang Bai

Main category: cs.CV

TL;DR: 本文提出VEGA-3D框架,利用预训练视频扩散模型中隐含的时空与物理先验,增强多模态大语言模型(MLLM)的空间与几何理解能力,无需显式3D监督,在多个空间推理与具身操作任务上达到SOTA。

Details Motivation: 现有MLLM存在‘空间盲区’,难以进行细粒度几何推理和物理动态建模;而依赖显式3D模态或复杂几何结构的方法受限于数据稀缺与泛化能力差。 Method: 提出VEGA-3D框架,将预训练视频扩散模型作为隐式‘潜在世界模拟器’,从其去噪过程的中间噪声层提取时空特征,并通过token级自适应门控融合机制将其与语义表征融合,为MLLM注入密集几何线索。 Result: 在3D场景理解、空间推理和具身操作等多个基准上显著超越现有SOTA方法,验证了生成式先验可作为物理世界理解的可扩展基础。 Conclusion: 视频生成模型蕴含的隐式空间与物理先验可被有效挖掘并迁移至MLLM,为解决其空间盲区问题提供了一种无需额外3D标注、轻量即插即用的新范式。 Abstract: While Multimodal Large Language Models demonstrate impressive semantic capabilities, they often suffer from spatial blindness, struggling with fine-grained geometric reasoning and physical dynamics. Existing solutions typically rely on explicit 3D modalities or complex geometric scaffolding, which are limited by data scarcity and generalization challenges. In this work, we propose a paradigm shift by leveraging the implicit spatial prior within large-scale video generation models. We posit that to synthesize temporally coherent videos, these models inherently learn robust 3D structural priors and physical laws. We introduce VEGA-3D (Video Extracted Generative Awareness), a plug-and-play framework that repurposes a pre-trained video diffusion model as a Latent World Simulator. By extracting spatiotemporal features from intermediate noise levels and integrating them with semantic representations via a token-level adaptive gated fusion mechanism, we enrich MLLMs with dense geometric cues without explicit 3D supervision. Extensive experiments across 3D scene understanding, spatial reasoning, and embodied manipulation benchmarks demonstrate that our method outperforms state-of-the-art baselines, validating that generative priors provide a scalable foundation for physical-world understanding. Code is publicly available at https://github.com/H-EmbodVis/VEGA-3D.