Skip to content

Table of Contents

cs.CL [Back]

[1] Using LLM-as-a-Judge/Jury to Advance Scalable, Clinically-Validated Safety Evaluations of Model Responses to Users Demonstrating Psychosis

May Lynn Reese,Markela Zeneli,Mindy Ng,Jacob Haimes,Andreea Damien,Elizabeth Stade

Main category: cs.CL

TL;DR: 本研究针对大语言模型(LLMs)在精神健康支持(尤其是精神病患者)中的安全风险,提出了一套临床专家指导的安全评估标准、构建了人类共识数据集,并验证了“LLM作为裁判”(LLM-as-a-Judge)在自动化安全评估中的有效性与可靠性。

Details Motivation: 现有LLM心理健康评估缺乏临床验证和可扩展性;精神病患者高频使用LLM可能强化妄想与幻觉,亟需安全、可靠、可扩展的临床导向评估方法。 Method: (1)开发并验证7条临床专家制定的安全评估标准;(2)构建基于人类共识的标注数据集;(3)对比测试单个LLM作为裁判(LLM-as-a-Judge)与多个LLM多数投票(LLM-as-a-Jury)的自动化评估性能。 Result: LLM-as-a-Judge与人类共识高度一致(Cohen’s κ达0.56–0.75);其中Gemini表现最优(κ=0.75),略优于LLM-as-a-Jury(κ=0.74)。 Conclusion: LLM-as-a-Judge是一种临床可信、可扩展的LLM心理健康安全评估新范式,为高风险人群的AI应用提供了方法论支撑。 Abstract: General-purpose Large Language Models (LLMs) are becoming widely adopted by people for mental health support. Yet emerging evidence suggests there are significant risks associated with high-frequency use, particularly for individuals suffering from psychosis, as LLMs may reinforce delusions and hallucinations. Existing evaluations of LLMs in mental health contexts are limited by a lack of clinical validation and scalability of assessment. To address these issues, this research focuses on psychosis as a critical condition for LLM safety evaluation by (1) developing and validating seven clinician-informed safety criteria, (2) constructing a human-consensus dataset, and (3) testing automated assessment using an LLM as an evaluator (LLM-as-a-Judge) or taking the majority vote of several LLM judges (LLM-as-a-Jury). Results indicate that LLM-as-a-Judge aligns closely with the human consensus (Cohen's $κ_{\text{human} \times \text{gemini}} = 0.75$, $κ_{\text{human} \times \text{qwen}} = 0.68$, $κ_{\text{human} \times \text{kimi}} = 0.56$) and that the best judge slightly outperforms LLM-as-a-Jury (Cohen's $κ_{\text{human} \times \text{jury}} = 0.74$). Overall, these findings have promising implications for clinically grounded, scalable methods in LLM safety evaluations for mental health contexts.

[2] CIPHER: Conformer-based Inference of Phonemes from High-density EEG

Varshith Madishetty

Main category: cs.CL

TL;DR: 本文提出CIPHER模型,利用ERP特征和宽带DDA系数双路径解码头皮EEG中的语音信息,在二分类任务中表现优异但在11类CVC音素任务中性能有限,定位为基准与特征对比研究而非实用EEG-to-text系统。

Details Motivation: 解决头皮EEG信噪比低、空间模糊导致的语音信息解码困难问题。 Method: 提出基于Conformer的CIPHER模型,采用双路径:(i) ERP特征路径;(ii) 宽带DDA系数路径。在OpenNeuro ds006104数据集(24名被试,含TMS干预)上评估,主任务为11类CVC音素的LOSO(留一被试)解码。 Result: 二分类发音任务达近天花板性能,但易受声学起始可分离性与TMS靶向干扰等混淆因素影响;11类音素任务下真实词错误率较高(ERP: 0.671±0.080,DDA: 0.688±0.096),表明细粒度音素区分能力有限。 Conclusion: 本工作主要作为EEG语音解码的基准与特征比较研究,强调在混淆因素可控前提下对神经表征的谨慎解读,不主张其为实用化EEG-to-text系统。 Abstract: Decoding speech information from scalp EEG remains difficult due to low SNR and spatial blurring. We present CIPHER (Conformer-based Inference of Phonemes from High-density EEG Representations), a dual-pathway model using (i) ERP features and (ii) broadband DDA coefficients. On OpenNeuro ds006104 (24 participants, two studies with concurrent TMS), binary articulatory tasks reach near-ceiling performance but are highly confound-vulnerable (acoustic onset separability and TMS-target blocking). On the primary 11-class CVC phoneme task under full Study 2 LOSO (16 held-out subjects), performance is substantially lower (real-word WER: ERP 0.671 +/- 0.080, DDA 0.688 +/- 0.096, indicating limited fine-grained discriminability. We therefore position this work as a benchmark and feature-comparison study rather than an EEG-to-text system, and we constrain neural-representation claims to confound-controlled evidence.

[3] SWAY: A Counterfactual Computational Linguistic Approach to Measuring and Mitigating Sycophancy

Joy Bhalla,Kristina Gligorić

Main category: cs.CL

TL;DR: 本文提出SWAY,一种无监督计算语言学指标,用于衡量大语言模型的谄媚倾向(sycophancy),并通过反事实提示机制识别模型在正负语言压力下的立场偏移;研究发现谄媚倾向随认知承诺增强,进而提出基于反事实思维链(CoT)的缓解策略,显著降低谄媚性而不损害对真实证据的响应能力。

Details Motivation: 现有研究虽关注大语言模型的谄媚现象,但缺乏严谨的计算语言学指标来识别其发生时机与程度。 Method: 提出SWAY指标,结合反事实提示机制,量化模型在正/负语言压力下立场变化;在6个模型上进行基准测试,并设计反事实思维链(CoT)缓解策略。 Result: 发现谄媚倾向随模型认知承诺增强;传统‘明确反对谄媚’指令效果有限且可能适得其反;而反事实CoT缓解策略可将谄媚性降至接近零,且不削弱模型对真实证据的响应能力。 Conclusion: 本文贡献了一种可量化的谄媚性评估指标SWAY,以及一种受该指标启发、高效稳健的缓解方法,为构建更可靠、中立的大语言模型提供了新路径。 Abstract: Large language models exhibit sycophancy: the tendency to shift outputs toward user-expressed stances, regardless of correctness or consistency. While prior work has studied this issue and its impacts, rigorous computational linguistic metrics are needed to identify when models are being sycophantic. Here, we introduce SWAY, an unsupervised computational linguistic measure of sycophancy. We develop a counterfactual prompting mechanism to identify how much a model's agreement shifts under positive versus negative linguistic pressure, isolating framing effects from content. Applying this metric to benchmark 6 models, we find that sycophancy increases with epistemic commitment. Leveraging our metric, we introduce a counterfactual mitigation strategy teaching models to consider what the answer would be if opposite assumptions were suggested. While baseline mitigation instructing to be explicitly anti-sycophantic yields moderate reductions, and can backfire, our counterfactual CoT mitigation drives sycophancy to near zero across models, commitment levels, and clause types, while not suppressing responsiveness to genuine evidence. Overall, we contribute a metric for benchmarking sycophancy and a mitigation informed by it.

[4] Skeleton-based Coherence Modeling in Narratives

Nishit Asnani,Rohan Badlani

Main category: cs.CL

TL;DR: 本文探讨了使用句子骨架(skeleton)的一致性作为文本连贯性度量的有效性,提出了一种句子/骨架相似性网络(SSN),发现基于完整句子的模型在连贯性评估上优于基于骨架的模型。

Details Motivation: 探究句子骨架在连续句子间的一致性是否能有效表征文本连贯性。 Method: 提出Sentence/Skeleton Similarity Network(SSN)建模句对连贯性,并与余弦相似度、欧氏距离等基线方法对比。 Result: SSN性能显著优于基线方法;但句子级模型仍优于骨架级模型,表明当前以整句为单位建模连贯性的方向是正确的。 Conclusion: 尽管骨架具有一定潜力,但当前主流的句子级连贯性建模方法更有效,骨架一致性并非更优的连贯性度量指标。 Abstract: Modeling coherence in text has been a task that has excited NLP researchers since a long time. It has applications in detecting incoherent structures and helping the author fix them. There has been recent work in using neural networks to extract a skeleton from one sentence, and then use that skeleton to generate the next sentence for coherent narrative story generation. In this project, we aim to study if the consistency of skeletons across subsequent sentences is a good metric to characterize the coherence of a given body of text. We propose a new Sentence/Skeleton Similarity Network (SSN) for modeling coherence across pairs of sentences, and show that this network performs much better than baseline similarity techniques like cosine similarity and Euclidean distance. Although skeletons appear to be promising candidates for modeling coherence, our results show that sentence-level models outperform those on skeletons for evaluating textual coherence, thus indicating that the current state-of-the-art coherence modeling techniques are going in the right direction by dealing with sentences rather than their sub-parts.

[5] Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets

Dat Tran,Douwe Kiela

Main category: cs.CL

TL;DR: 本文通过信息论视角(基于数据处理不等式)论证,在固定推理token预算和完美上下文利用前提下,单智能体系统(SAS)比多智能体系统(MAS)更具信息效率;实证验证表明,当计算量对齐时,SAS在多跳推理任务上不逊于甚至优于MAS,并揭示了API预算控制与基准测试中的评估偏差会夸大MAS优势。

Details Motivation: 现有研究中多智能体系统(MAS)的性能优势常混杂了额外测试时计算开销;当计算量标准化后,单智能体系统(SAS)可匹敌甚至超越MAS,但其理论依据与评估方法尚不清晰。 Method: 提出基于数据处理不等式的信息论分析框架,推导SAS与MAS的信息效率边界;并在Qwen3、DeepSeek-R1-Distill-Llama和Gemini 2.5三类模型上开展受控实证研究,严格匹配推理token预算,对比多种MAS架构与SAS在多跳推理任务上的表现,并诊断API预算控制与基准测试中的评估偏差。 Result: 在推理token预算一致条件下,SAS在多跳推理任务上持续匹配或超越各类MAS;发现Gemini 2.5中API级预算控制存在显著失真,且标准基准存在评估偏差,二者共同导致MAS性能被高估。 Conclusion: 多智能体系统在多跳推理中的多数优势源于未被控制的计算增量与上下文利用差异,而非架构本质优势;应明确权衡计算、上下文与协调开销,建立更严谨的评估范式。 Abstract: Recent work reports strong performance from multi-agent LLM systems (MAS), but these gains are often confounded by increased test-time computation. When computation is normalized, single-agent systems (SAS) can match or outperform MAS, yet the theoretical basis and evaluation methodology behind this comparison remain unclear. We present an information-theoretic argument, grounded in the Data Processing Inequality, suggesting that under a fixed reasoning-token budget and with perfect context utilization, single-agent systems are more information-efficient. This perspective further predicts that multi-agent systems become competitive when a single agent's effective context utilization is degraded, or when more compute is expended. We test these predictions in a controlled empirical study across three model families (Qwen3, DeepSeek-R1-Distill-Llama, and Gemini 2.5), comparing SAS with multiple MAS architectures under matched budgets. We find that SAS consistently match or outperform MAS on multi-hop reasoning tasks when reasoning tokens are held constant. Beyond aggregate performance, we conduct a detailed diagnostic analysis of system behavior and evaluation methodology. We identify significant artifacts in API-based budget control (particularly in Gemini 2.5) and in standard benchmarks, both of which can inflate apparent gains from MAS. Overall, our results suggest that, for multi-hop reasoning tasks, many reported advantages of multi-agent systems are better explained by unaccounted computation and context effects rather than inherent architectural benefits, and highlight the importance of understanding and explicitly controlling the trade-offs between compute, context, and coordination in agentic systems.

[6] Failing to Falsify: Evaluating and Mitigating Confirmation Bias in Language Models

Ayush Rajesh Jhaveri,Anthony GX-Chen,Ilia Sucholutsky,Eunsol Choi

Main category: cs.CL

TL;DR: 本文研究大型语言模型(LLMs)是否存在确认偏误,发现LLMs在规则发现任务中表现出类似人类的确认偏误;通过引入针对人类设计的干预策略(如鼓励考虑反例),可显著提升其规则发现率,并能泛化到新任务(如Blicket测试)。

Details Motivation: 确认偏误会阻碍理性推理,但尚不清楚大型语言模型是否也存在这一认知偏差,本文旨在检验LLMs在假设检验中的确认偏误表现及其可干预性。 Method: 采用心理学中的规则发现范式(三元组反馈任务),在11个不同家族和规模的LLMs上进行实验;测试多种人类启发的提示干预策略,并通过行为蒸馏将干预效果迁移到新任务(Blicket测试)。 Result: LLMs普遍存在确认偏误,导致规则发现更慢、成功率更低(基线42%);加入反例提示等干预后,平均发现率提升至56%;行为蒸馏方法在Blicket测试中也展现出泛化效果。 Conclusion: 确认偏误是LLMs在假设探索中的固有局限,但可通过借鉴人类认知干预策略有效缓解,表明人机认知偏差具有可比性与可调性。 Abstract: Confirmation bias, the tendency to seek evidence that supports rather than challenges one's belief, hinders one's reasoning ability. We examine whether large language models (LLMs) exhibit confirmation bias by adapting the rule-discovery study from human psychology: given a sequence of three numbers (a "triple"), an agent engages in an interactive feedback loop where it (1) proposes a new triple, (2) receives feedback on whether it satisfies the hidden rule, and (3) guesses the rule. Across eleven LLMs of multiple families and scales, we find that LLMs exhibit confirmation bias, often proposing triples to confirm their hypothesis rather than trying to falsify it. This leads to slower and less frequent discovery of the hidden rule. We further explore intervention strategies (e.g., encouraging the agent to consider counter examples) developed for humans. We find prompting LLMs with such instruction consistently decreases confirmation bias in LLMs, improving rule discovery rates from 42% to 56% on average. Lastly, we mitigate confirmation bias by distilling intervention-induced behavior into LLMs, showing promising generalization to a new task, the Blicket test. Our work shows that confirmation bias is a limitation of LLMs in hypothesis exploration, and that it can be mitigated via injecting interventions designed for humans.

[7] Social Meaning in Large Language Models: Structure, Magnitude, and Pragmatic Prompting

Roland Mühlenbernd

Main category: cs.CL

TL;DR: 本文探讨了大语言模型(LLMs)在社会语义推理上是否能定量逼近人类表现,并检验基于语用学理论的提示策略能否提升这种逼近程度;提出两个校准指标(ESR和CDS),发现模型能可靠复现人类推理结构但幅度校准差异大,结合知识状态与动机推理的提示最有效。

Details Motivation: 探究LLMs是否不仅定性、而且定量地逼近人类社会语义推理能力,并验证语用学理论指导的提示策略能否改善其校准表现。 Method: 提出两个校准指标——效应量比(ESR)和校准偏差分(CDS),并基于两条语用假设设计提示策略:1)社会意义源于对语言替代项的推理;2)听者推断说话人的知识状态与交际动机;在数值(不)精确性案例中评估三个前沿LLM。 Result: 所有模型均可靠复现人类社会推理的定性结构,但在幅度校准上差异显著;仅同时提示知识状态与动机推理能一致改善所有校准敏感指标;而仅提示替代意识反而加剧夸大倾向;细粒度幅度校准仍部分未解决。 Conclusion: LLMs能捕捉推理结构但不同程度扭曲推理强度;语用学理论为提升逼近度提供了有用但不充分的工具。 Abstract: Large language models (LLMs) increasingly exhibit human-like patterns of pragmatic and social reasoning. This paper addresses two related questions: do LLMs approximate human social meaning not only qualitatively but also quantitatively, and can prompting strategies informed by pragmatic theory improve this approximation? To address the first, we introduce two calibration-focused metrics distinguishing structural fidelity from magnitude calibration: the Effect Size Ratio (ESR) and the Calibration Deviation Score (CDS). To address the second, we derive prompting conditions from two pragmatic assumptions: that social meaning arises from reasoning over linguistic alternatives, and that listeners infer speaker knowledge states and communicative motives. Applied to a case study on numerical (im)precision across three frontier LLMs, we find that all models reliably reproduce the qualitative structure of human social inferences but differ substantially in magnitude calibration. Prompting models to reason about speaker knowledge and motives most consistently reduces magnitude deviation, while prompting for alternative-awareness tends to amplify exaggeration. Combining both components is the only intervention that improves all calibration-sensitive metrics across all models, though fine-grained magnitude calibration remains only partially resolved. LLMs thus capture inferential structure while variably distorting inferential strength, and pragmatic theory provides a useful but incomplete handle for improving that approximation.

[8] PolyJarvis: LLM Agent for Autonomous Polymer MD Simulations

Alexander Zhao,Achuth Chandrasekhar,Amir Barati Farimani

Main category: cs.CL

TL;DR: PolyJarvis是一个结合大语言模型(LLM)与RadonPy模拟平台的智能代理,能从自然语言或SMILES输入出发,全自动完成聚合物分子动力学(MD)全流程模拟与性质预测,并在多种聚合物上验证了其预测精度。

Details Motivation: 传统全原子分子动力学模拟需高度专业化的知识(力场选择、体系构建、平衡、性质提取等),限制了其在聚合物研究中的广泛应用;本文旨在通过LLM驱动自动化,降低使用门槛并提升效率。 Method: 构建PolyJarvis智能代理,基于Model Context Protocol(MCP)服务器将LLM与RadonPy平台集成;支持从聚合物名称或SMILES字符串出发,自动执行单体构建、电荷分配、聚合、力场参数化、GPU加速平衡及性质计算。 Result: 在PE、aPS、PMMA、PEG四种聚合物上验证:密度误差0.1–4.8%,体模量误差17–24%;PMMA的Tg预测(395 K)与实验偏差+10–18 K,其余三种高估+38–47 K;8组有直接实验对照的性质-聚合物组合中,5组满足严格接受标准;Tg偏差主因MD固有冷却速率偏差,非代理错误。 Conclusion: LLM驱动的智能代理可自主、可靠地执行聚合物MD全流程,结果与专家手动模拟一致,为材料模拟自动化提供了可行范式。 Abstract: All-atom molecular dynamics (MD) simulations can predict polymer properties from molecular structure, yet their execution requires specialized expertise in force field selection, system construction, equilibration, and property extraction. We present PolyJarvis, an agent that couples a large language model (LLM) with the RadonPy simulation platform through Model Context Protocol (MCP) servers, enabling end-to-end polymer property prediction from natural language input. Given a polymer name or SMILES string, PolyJarvis autonomously executes monomer construction, charge assignment, polymerization, force field parameterization, GPU-accelerated equilibration, and property calculation. Validation is conducted on polyethylene (PE), atactic polystyrene (aPS), poly(methyl methacrylate) (PMMA), and poly(ethylene glycol) (PEG). Results show density predictions within 0.1--4.8% and bulk moduli within 17--24% of reference values for aPS and PMMA. PMMA glass transition temperature (Tg) (395~K) matches experiment within +10--18~K, while the remaining three polymers overestimate Tg by +38 to +47K (vs upper experimental bounds). Of the 8 property--polymer combinations with directly comparable experimental references, 5 meet strict acceptance criteria. For cases lacking suitable amorphous-phase experimental, agreement with prior MD literature is reported separately. The remaining Tg failures are attributable primarily to the intrinsic MD cooling-rate bias rather than agent error. This work demonstrates that LLM-driven agents can autonomously execute polymer MD workflows producing results consistent with expert-run simulations.

[9] Principled and Scalable Diversity-Aware Retrieval via Cardinality-Constrained Binary Quadratic Programming

Qiheng Lu,Nicholas D. Sidiropoulos

Main category: cs.CL

TL;DR: 本文提出了一种基于带基数约束的二元二次规划(CCBQP)的多样性检索新方法,通过可解释的权衡参数平衡相关性与语义多样性,并设计了具有收敛保证的Frank-Wolfe算法,在Pareto前沿上优于基线且更高效。

Details Motivation: 现有多样性检索方法缺乏理论保证,且随检索数量k增大时可扩展性差。 Method: 将多样性检索建模为带基数约束的二元二次规划(CCBQP),引入可解释的权衡参数平衡相关性与语义多样性;采用非凸紧致连续松弛,并设计基于Frank-Wolfe的算法,辅以景观分析与收敛性证明。 Result: 在相关性-多样性Pareto前沿上持续优于基线方法,并实现显著加速。 Conclusion: 所提CCBQP框架及其优化算法为多样性感知检索提供了理论坚实、高效可扩展的解决方案。 Abstract: Diversity-aware retrieval is essential for Retrieval-Augmented Generation (RAG), yet existing methods lack theoretical guarantees and face scalability issues as the number of retrieved passages $k$ increases. We propose a principled formulation of diversity retrieval as a cardinality-constrained binary quadratic programming (CCBQP), which explicitly balances relevance and semantic diversity through an interpretable trade-off parameter. Inspired by recent advances in combinatorial optimization, we develop a non-convex tight continuous relaxation and a Frank--Wolfe based algorithm with landscape analysis and convergence guarantees. Extensive experiments demonstrate that our method consistently dominates baselines on the relevance-diversity Pareto frontier, while achieving significant speedup.

[10] Pragmatics Meets Culture: Culturally-adapted Artwork Description Generation and Evaluation

Lingjun Zhao,Dayeon Ki,Marine Carpuat,Hal Daumé

Main category: cs.CL

TL;DR: 本文提出了一种文化适应性艺术描述生成任务,通过基于文化 grounded 的问答框架评估语言模型的文化胜任力,并利用实用主义说话者模型提升听众理解度。

Details Motivation: 语言模型在决策任务中表现出文化偏见,但在开放式文本生成任务中的文化熟悉度尚不清楚;因此,本文旨在探究模型在面向不同文化背景受众生成艺术描述时的文化适应能力。 Method: 引入文化适应性艺术描述生成任务,构建基于文化 grounded 的问答评估框架,并采用实用主义说话者模型优化生成效果。 Result: 基础模型在此任务上仅表现平平,但实用主义说话者模型可将模拟听众理解度提升8.2%,人类研究进一步验证其帮助理解度提升8.0%。 Conclusion: 语言模型的文化胜任力可通过实用主义建模显著提升,该方法在跨文化生成任务中具有实际价值。 Abstract: Language models are known to exhibit various forms of cultural bias in decision-making tasks, yet much less is known about their degree of cultural familiarity in open-ended text generation tasks. In this paper, we introduce the task of culturally-adapted art description generation, where models describe artworks for audiences from different cultural groups who vary in their familiarity with the cultural symbols and narratives embedded in the artwork. To evaluate cultural competence in this pragmatic generation task, we propose a framework based on culturally grounded question answering. We find that base models are only marginally adequate for this task, but, through a pragmatic speaker model, we can improve simulated listener comprehension by up to 8.2%. A human study further confirms that the model with higher pragmatic competence is rated as more helpful for comprehension by 8.0%.

[11] Dependency-Guided Parallel Decoding in Discrete Diffusion Language Models

Liran Ringel,Ameen Ali,Yaniv Romano

Main category: cs.CL

TL;DR: 本文提出DEMASK方法,通过轻量级依赖预测器指导离散扩散语言模型的并行解码,缓解因完全因子化近似联合条件分布导致的质量下降问题,在保持或提升准确性的同时实现1.7-2.2倍加速。

Details Motivation: 离散扩散语言模型(dLLMs)在并行解码中使用完全因子化的每token边缘分布近似联合条件分布,导致强依赖token间分布失配、生成质量下降。 Method: 提出DEMASK:在dLLM最后隐藏层附加一个轻量级依赖预测器,单次前向传播估计掩码位置间的两两条件影响;再用贪心算法选择累积依赖有界的token位置进行同步解码。 Result: 在Dream-7B上实现1.7–2.2×加速,准确率匹配或优于基于置信度和KL散度的基线方法;理论证明在子可加性假设下,该方法能控制总变差距离。 Conclusion: DEMASK是一种高效且理论可证的并行解码策略,有效缓解dLLMs中因独立性假设引发的分布失配问题,兼顾速度与生成质量。 Abstract: Discrete diffusion language models (dLLMs) accelerate text generation by unmasking multiple tokens in parallel. However, parallel decoding introduces a distributional mismatch: it approximates the joint conditional using a fully factorized product of per-token marginals, which degrades output quality when selected tokens are strongly dependent. We propose DEMASK (DEpendency-guided unMASKing), a lightweight dependency predictor that attaches to the final hidden states of a dLLM. In a single forward pass, it estimates pairwise conditional influences between masked positions. Using these predictions, a greedy selection algorithm identifies positions with bounded cumulative dependency for simultaneous unmasking. Under a sub-additivity assumption, we prove this bounds the total variation distance between our parallel sampling and the model's joint. Empirically, DEMASK achieves 1.7-2.2$\times$ speedup on Dream-7B while matching or improving accuracy compared to confidence-based and KL-based baselines.

[12] An Empirical Study of Many-Shot In-Context Learning for Machine Translation of Low-Resource Languages

Yinhan Lu,Gaganpreet Jhajj,Chen Zhang,Anietie Andy,David Ifeoluwa Adelani

Main category: cs.CL

TL;DR: 本文研究了在低资源语言机器翻译中使用多示例上下文学习(many-shot ICL)的效果,发现基于BM25的示例检索能显著提升数据效率,大幅减少所需示例数量。

Details Motivation: 解决低资源语言因预训练数据不足导致的机器翻译性能差问题,同时降低多示例ICL带来的高推理成本。 Method: 在FLORES+新增的10种真正低资源语言上开展实证研究,分析信息量更高的示例选取、跨领域数据使用及按长度排序等策略对多示例ICL的影响,并采用BM25进行示例检索。 Result: 多示例ICL效果随示例数增加而提升;BM25检索使50个示例效果接近250个随机示例,250个检索示例效果接近1000个随机示例。 Conclusion: BM25检索可显著提升多示例ICL在低资源语言机器翻译中的数据效率,为资源受限场景提供更可行的方案。 Abstract: In-context learning (ICL) allows large language models (LLMs) to adapt to new tasks from a few examples, making it promising for languages underrepresented in pre-training. Recent work on many-shot ICL suggests that modern LLMs can further benefit from larger ICL examples enabled by their long context windows. However, such gains depend on careful example selection, and the inference cost can be prohibitive for low-resource language communities. In this paper, we present an empirical study of many-shot ICL for machine translation from English into ten truly low-resource languages recently added to FLORES+. We analyze the effects of retrieving more informative examples, using out-of-domain data, and ordering examples by length. Our findings show that many-shot ICL becomes more effective as the number of examples increases. More importantly, we show that BM25-based retrieval substantially improves data efficiency: 50 retrieved examples roughly match 250 many-shot examples, while 250 retrieved examples perform similarly to 1,000 many-shot examples.

[13] Reinforcement Learning-based Knowledge Distillation with LLM-as-a-Judge

Yiyang Shen,Lifu Tu,Weiran Wang

Main category: cs.CL

TL;DR: 本文提出一种无需标签的强化学习框架,利用作为裁判的LLM为模型输出打分,实现无监督知识蒸馏。

Details Motivation: 现有强化学习方法依赖可验证的奖励(即真实标签),限制了其在大量无标签数据上的应用。 Method: 设计一个基于LLM的裁判模型,以单token输出高效生成奖励信号,并将其与可验证奖励结合用于RL微调。 Result: 在数学推理基准上显著提升性能,验证了LLM裁判作为训练信号的有效性。 Conclusion: LLM可作为高效、可扩展的奖励提供者,推动无监督或弱监督下的语言模型强化学习。 Abstract: Reinforcement Learning (RL) has been shown to substantially improve the reasoning capability of small and large language models (LLMs), but existing approaches typically rely on verifiable rewards, hence ground truth labels. We propose an RL framework that uses rewards from an LLM that acts as a judge evaluating model outputs over large amounts of unlabeled data, enabling label-free knowledge distillation and replacing the need of ground truth supervision. Notably, the judge operates with a single-token output, making reward computation efficient. When combined with verifiable rewards, our approach yields substantial performance gains across math reasoning benchmarks. These results suggest that LLM-based evaluators can produce effective training signals for RL fine-tuning.

[14] Train Yourself as an LLM: Exploring Effects of AI Literacy on Persuasion via Role-playing LLM Training

Qihui Fan,Min Ge,Chenyan Jia,Weiyan Shi

Main category: cs.CL

TL;DR: 本文提出LLMimic,一种基于角色扮演、交互式、游戏化的AI素养教程,让用户模拟大语言模型训练流程,实证表明其能显著提升AI素养、降低被AI说服程度,并增强判断真实性与社会责任感。

Details Motivation: 随着大语言模型(LLMs)说服力增强,公众易受其影响;现有缓解手段(如检测器、免责声明)将人视为主动性不足的被动接收者,亟需更主动、以人为本的干预方式。 Method: 设计并实现LLMimic——一个让参与者扮演LLM、经历预训练、监督微调(SFT)和基于人类反馈的强化学习(RLHF)三阶段的互动式游戏化教程;开展2×3组间实验(N=274),对比观看AI历史视频(对照组)与使用LLMimic(处理组)在三种现实AI说服场景下的效果。 Result: LLMimic显著提升AI素养(p<.001)、降低各场景下AI说服成功率(p<.05),并在酒店推荐场景中显著提高回答的真实性与社会责任感(p<.01)。 Conclusion: LLMimic提供了一种可扩展、以人为中心的AI素养提升路径,有助于公众更审慎、理性地应对具有说服力的AI内容。 Abstract: As large language models (LLMs) become increasingly persuasive, there is concern that people's opinions and decisions may be influenced across various contexts at scale. Prior mitigation (e.g., AI detectors and disclaimers) largely treats people as passive recipients of AI-generated information. To provide a more proactive intervention against persuasive AI, we introduce $\textbf{LLMimic}$, a role-play-based, interactive, gamified AI literacy tutorial, where participants assume the role of an LLM and progress through three key stages of the training pipeline (pretraining, SFT, and RLHF). We conducted a $2 \times 3$ between-subjects study ($N = 274$) where participants either (1) watched an AI history video (control) or (2) interacted with LLMimic (treatment), and then engaged in one of three realistic AI persuasion scenarios: (a) charity donation persuasion, (b) malicious money solicitation, or (c) hotel recommendation. Our results show that LLMimic significantly improved participants' AI literacy ($p < .001$), reduced persuasion success across scenarios ($p < .05$), and enhanced truthfulness and social responsibility levels ($p<0.01$) in the hotel scenario. These findings suggest that LLMimic offers a scalable, human-centered approach to improving AI literacy and supporting more informed interactions with persuasive AI.

[15] Overcoming the "Impracticality" of RAG: Proposing a Real-World Benchmark and Multi-Dimensional Diagnostic Framework

Kenichirou Narita,Siqi Peng,Taku Fukui,Moyuru Yamada,Satoshi Munakata,Satoru Takahashi

Main category: cs.CL

TL;DR: 本文提出了一种面向企业环境的多维诊断框架,用于评估检索增强生成(RAG)系统的性能,弥补现有基准在推理复杂性、检索难度、文档多样性与可解释性等维度上的不足。

Details Motivation: 现有学术基准无法系统诊断企业RAG系统面临的多维挑战(如推理复杂性、检索难度、文档结构多样性及操作可解释性要求),导致高分模型在实际部署中不可靠。 Method: 构建四轴难度分类法,并将其融入企业级RAG基准,形成多维诊断框架。 Result: 提出了首个面向企业RAG系统弱点诊断的多维基准框架,支持细粒度性能归因分析。 Conclusion: 该框架有助于识别RAG系统在真实场景中的薄弱环节,提升其部署可靠性与可解释性。 Abstract: Performance evaluation of Retrieval-Augmented Generation (RAG) systems within enterprise environments is governed by multi-dimensional and composite factors extending far beyond simple final accuracy checks. These factors include reasoning complexity, retrieval difficulty, the diverse structure of documents, and stringent requirements for operational explainability. Existing academic benchmarks fail to systematically diagnose these interlocking challenges, resulting in a critical gap where models achieving high performance scores fail to meet the expected reliability in practical deployment. To bridge this discrepancy, this research proposes a multi-dimensional diagnostic framework by defining a four-axis difficulty taxonomy and integrating it into an enterprise RAG benchmark to diagnose potential system weaknesses.

[16] Speaking of Language: Reflections on Metalanguage Research in NLP

Nathan Schneider,Antonios Anastasopoulos

Main category: cs.CL

TL;DR: 本文聚焦于元语言(metalanguage)这一主题,定义了元语言并探讨其与自然语言处理(NLP)及大语言模型(LLMs)的关系;介绍了作者所在两个实验室以元语言为中心的研究工作;并从四个维度系统梳理了元语言及元语言任务,提出若干尚未充分研究的未来方向。

Details Motivation: 元语言在NLP和大语言模型中的作用尚未被系统梳理和重视,亟需明确其定义、关联与研究空间。 Method: 通过概念界定、领域关联分析、实验室实践总结,以及多维框架构建,对元语言及其任务进行系统性综述与展望。 Result: 提出了元语言的明确定义及其与NLP/LLMs的联系;归纳了两个实验室的相关实践;构建了涵盖四个维度的元语言任务分类框架。 Conclusion: 元语言是理解、评估与提升语言模型能力的关键视角,应成为未来NLP与LLM研究的重要方向之一。 Abstract: This work aims to shine a spotlight on the topic of metalanguage. We first define metalanguage, link it to NLP and LLMs, and then discuss our two labs' metalanguage-centered efforts. Finally, we discuss four dimensions of metalanguage and metalinguistic tasks, offering a list of understudied future research directions.

[17] Revealing the Learning Dynamics of Long-Context Continual Pre-training

Yupu Liang,Shuang Chen,Guanwei Zhang,Shaolei Wang,Suncong Zheng

Main category: cs.CL

TL;DR: 本文首次系统研究了工业级大模型Hunyuan-A13B(80B参数)在长上下文持续预训练(LCCP)中的学习动力学,提出跨行为、概率与机制三层次的分析框架,发现需超150B token才能充分训练,传统NIAH评估存在“欺骗性饱和”,而困惑度(PPL)和检索头注意力更可靠地反映真实收敛与下游性能。

Details Motivation: 现有LCCP研究局限于小模型和少量数据(数十B token),难以迁移到工业级大模型;且主流评估(如NIAH)易出现‘欺骗性饱和’,无法真实反映模型收敛状态。 Method: 基于Hunyuan-A13B模型,开展200B token的LCCP训练轨迹追踪;构建三层分析框架:行为层(SFT probing)、概率层(perplexity)、机制层(attention pattern),尤其利用检索头注意力作为低开销训练监测信号。 Result: (1)工业级模型需>150B token才达真正饱和;(2)NIAH分数早期饱和不可靠,PPL持续下降且与下游性能强相关;(3)检索头注意力演化可高相关度预测SFT效果,是稳定高效的训练监测指标。 Conclusion: 本文建立了面向工业级LLM的LCCP监控框架、评估体系与机制解释,揭示了大规模数据必要性、PPL优于NIAH的评估价值,以及机制层面可解释监测的新路径。 Abstract: Existing studies on Long-Context Continual Pre-training (LCCP) mainly focus on small-scale models and limited data regimes (tens of billions of tokens). We argue that directly migrating these small-scale settings to industrial-grade models risks insufficient adaptation and premature training termination. Furthermore, current evaluation methods rely heavily on downstream benchmarks (e.g., Needle-in-a-Haystack), which often fail to reflect the intrinsic convergence state and can lead to "deceptive saturation". In this paper, we present the first systematic investigation of LCCP learning dynamics using the industrial-grade Hunyuan-A13B (80B total parameters), tracking its evolution across a 200B-token training trajectory. Specifically, we propose a hierarchical framework to analyze LCCP dynamics across behavioral (supervised fine-tuning probing), probabilistic (perplexity), and mechanistic (attention patterns) levels. Our findings reveal: (1) Necessity of Massive Data Scaling: Training regimes of dozens of billions of tokens are insufficient for industrial-grade LLMs' LCCP (e.g., Hunyuan-A13B reaches saturation after training over 150B tokens). (2) Deceptive Saturation vs. Intrinsic Saturation: Traditional NIAH scores report "fake saturation" early, while our PPL-based analysis reveals continuous intrinsic improvements and correlates more strongly with downstream performance. (3) Mechanistic Monitoring for Training Stability: Retrieval heads act as efficient, low-resource training monitors, as their evolving attention scores reliably track LCCP progress and exhibit high correlation with SFT results. This work provides a comprehensive monitoring framework, evaluation system, and mechanistic interpretation for the LCCP of industrial-grade LLM.

[18] SocioEval: A Template-Based Framework for Evaluating Socioeconomic Status Bias in Foundation Models

Divyanshu Kumar,Ishita Gupta,Nitin Aravind Birur,Tanay Baswa,Sahil Agarwal,Prashanth Harshangi

Main category: cs.CL

TL;DR: 本文提出SocioEval框架,用于系统评估大语言模型在社会经济地位方面的偏见,涵盖8个主题、18个话题和240个提示,对13个前沿LLM进行评估,发现偏见率差异显著(0.42%-33.75%),且不同主题间表现不一,如生活方式判断偏见是教育类的10倍;现有防护措施能防止显性歧视但易受领域刻板印象影响。

Details Motivation: 尽管种族、性别等偏见评估框架已较多,社会经济地位偏见仍被严重忽视,而其在现实世界中影响广泛,亟需系统性评估工具。 Method: 构建基于模板的分层评估框架SocioEval,包含8个主题、18个话题,生成240个提示、覆盖6组类别组合;对13个前沿大语言模型生成的3120条响应,采用三阶段严格人工标注协议进行评估。 Result: 发现不同模型偏见率差异巨大(0.42%-33.75%);生活方式类判断偏见率是教育类的10倍;现有部署防护可抑制显性歧视,但对领域特异性刻板印象鲁棒性差。 Conclusion: SocioEval为评估和审计语言模型中的阶级偏见提供了可扩展、可拓展的基础框架,有助于推动负责任AI的发展。 Abstract: As Large Language Models (LLMs) increasingly power decision-making systems across critical domains, understanding and mitigating their biases becomes essential for responsible AI deployment. Although bias assessment frameworks have proliferated for attributes such as race and gender, socioeconomic status bias remains significantly underexplored despite its widespread implications in the real world. We introduce SocioEval, a template-based framework for systematically evaluating socioeconomic bias in foundation models through decision-making tasks. Our hierarchical framework encompasses 8 themes and 18 topics, generating 240 prompts across 6 class-pair combinations. We evaluated 13 frontier LLMs on 3,120 responses using a rigorous three-stage annotation protocol, revealing substantial variation in bias rates (0.42\%-33.75\%). Our findings demonstrate that bias manifests differently across themes lifestyle judgments show 10$\times$ higher bias than education-related decisions and that deployment safeguards effectively prevent explicit discrimination but show brittleness to domain-specific stereotypes. SocioEval provides a scalable, extensible foundation for auditing class-based bias in language models.

[19] Too Polite to Disagree: Understanding Sycophancy Propagation in Multi-Agent Systems

Vira Kasprova,Amruta Parulekar,Abdulrahman AlRabah,Krishna Agaram,Ritwik Garg,Sagar Jha,Nimet Beyza Bozdag,Dilek Hakkani-Tur

Main category: cs.CL

TL;DR: 本文研究了多智能体系统中大语言模型(LLM)的谄媚性(sycophancy)问题,提出通过向智能体提供同伴的谄媚性先验排名来缓解该问题,实验表明该方法可显著提升讨论准确性。

Details Motivation: 现有研究主要关注单智能体场景下的谄媚性,而多智能体协作系统中的谄媚性影响尚未被充分探索;本文旨在探究智能体对同伴谄媚性水平的认知是否会影响群体讨论结果。 Method: 在六种开源大语言模型上开展受控实验,为各智能体提供基于静态(讨论前)和动态(在线)策略计算出的同伴谄媚性排名作为先验信息。 Result: 提供谄媚性先验可降低谄媚性强的智能体对讨论的影响,缓解错误级联,并使最终讨论准确率绝对提升10.5%。 Conclusion: 向多智能体系统引入谄媚性先验是一种轻量、有效的机制,能显著抑制讨论中的谄媚性并提升下游任务准确性。 Abstract: Large language models (LLMs) often exhibit sycophancy: agreement with user stance even when it conflicts with the model's opinion. While prior work has mostly studied this in single-agent settings, it remains underexplored in collaborative multi-agent systems. We ask whether awareness of other agents' sycophancy levels influences discussion outcomes. To investigate this, we run controlled experiments with six open-source LLMs, providing agents with peer sycophancy rankings that estimate each peer's tendency toward sycophancy. These rankings are based on scores calculated using various static (pre-discussion) and dynamic (online) strategies. We find that providing sycophancy priors reduces the influence of sycophancy-prone peers, mitigates error-cascades, and improves final discussion accuracy by an absolute 10.5%. Thus, this is a lightweight, effective way to reduce discussion sycophancy and improve downstream accuracy.

[20] Redirected, Not Removed: Task-Dependent Stereotyping Reveals the Limits of LLM Alignments

Divyanshu Kumar,Ishita Gupta,Nitin Aravind Birur,Tanay Baswa,Sahil Agarwal,Prashanth Harshangi

Main category: cs.CL

TL;DR: 本文提出了一种多任务、多层次的偏见评估框架,涵盖9种偏见类型(包括种姓、语言、地域等被忽视的维度),通过7种任务对7个主流大模型进行约4.5万次提示测试,发现偏见具有任务依赖性、安全对齐存在不对称性,且当前对齐方法掩盖而非缓解表征性伤害。

Details Motivation: 现有单任务基准无法全面刻画语言模型的偏见谱系,尤其忽略种姓、语言、地理等重要但被低估的偏见维度,导致对模型偏见的系统性误判。 Method: 构建覆盖9类偏见(含种姓、语言、地理等)的分层分类体系,设计7种从显式决策到隐式联想的任务;在7个商用与开源大模型上开展约45K提示的大规模审计实验。 Result: 发现三类系统性模式:(1) 偏见高度任务依赖——同一模型在显式任务中抵制刻板印象,却在隐式任务中强化它(刻板印象分数差异最高达0.43);(2) 安全对齐不对称——拒绝将负面特质归于边缘群体,却自由将正面特质关联特权群体;(3) 被忽视的偏见轴(如种姓)表现出最强刻板印象,表明对齐投入与基准覆盖度正相关,而非与实际危害程度匹配。 Conclusion: 单一基准的偏见审计会系统性误表征LLM偏见;当前对齐策略主要掩盖表征性伤害,而非真正缓解它;亟需多维、多任务、面向真实危害的评估范式。 Abstract: How biased is a language model? The answer depends on how you ask. A model that refuses to choose between castes for a leadership role will, in a fill-in-the-blank task, reliably associate upper castes with purity and lower castes with lack of hygiene. Single-task benchmarks miss this because they capture only one slice of a model's bias profile. We introduce a hierarchical taxonomy covering 9 bias types, including under-studied axes like caste, linguistic, and geographic bias, operationalized through 7 evaluation tasks that span explicit decision-making to implicit association. Auditing 7 commercial and open-weight LLMs with \textasciitilde45K prompts, we find three systematic patterns. First, bias is task-dependent: models counter stereotypes on explicit probes but reproduce them on implicit ones, with Stereotype Score divergences up to 0.43 between task types for the same model and identity groups. Second, safety alignment is asymmetric: models refuse to assign negative traits to marginalized groups, but freely associate positive traits with privileged ones. Third, under-studied bias axes show the strongest stereotyping across all models, suggesting alignment effort tracks benchmark coverage rather than harm severity. These results demonstrate that single-benchmark audits systematically mischaracterize LLM bias and that current alignment practices mask representational harm rather than mitigating it.

[21] Trivial Vocabulary Bans Improve LLM Reasoning More Than Deep Linguistic Constraints

Rodney Jehu-Appiah

Main category: cs.CL

TL;DR: 本研究通过严格控制的重复实验,否定了先前关于E-Prime(去除系动词“to be”的英语)通过词汇-认知映射实现认知重构以提升语言模型推理能力的假说;结果表明,任何迫使模型偏离默认生成路径的约束(包括中性填充词禁用)均能作为输出正则化器提升推理表现,且越浅层、干扰越小的约束效果反而越好。

Details Motivation: 检验先前研究提出的‘E-Prime通过特定词汇-认知映射引发认知重构从而提升推理’这一机制是否成立,并引入主动对照组以区分真实效应与混杂因素。 Method: 设计含五个条件(无约束对照、E-Prime、No-Have、元认知提示、中性填充词禁用)的受控实验,在六个语言模型和七个推理任务上共收集15,600次试验(合规筛选后11,919次),系统检验认知重构假说的各项预测。 Result: 所有四个干预条件均显著优于对照组(83.0%),包括被预测无效的两个主动对照;中性填充词禁用提升最大(+6.7个百分点),E-Prime提升最小(+3.7个百分点);跨模型相关性签名未复现(平均r=0.005);干预效果与理论深度呈完美负相关。 Conclusion: 否定认知重构假说;支持更简明的‘输出正则化’解释:任何轻微扰动模型默认生成路径的约束均可通过引入监控负荷、抑制流利但浅层的响应模式来提升推理;最浅层约束因概念干扰最小而效果最佳;本研究是通过证伪推动科学发现的范例。 Abstract: A previous study reported that E-Prime (English without the verb "to be") selectively altered reasoning in language models, with cross-model correlations suggesting a structural signature tied to which vocabulary was removed. I designed a replication with active controls to test the proposed mechanism: cognitive restructuring through specific vocabulary-cognition mappings. The experiment tested five conditions (unconstrained control, E-Prime, No-Have, elaborated metacognitive prompt, neutral filler-word ban) across six models and seven reasoning tasks (N=15,600 trials, 11,919 after compliance filtering). Every prediction from the cognitive restructuring hypothesis was disconfirmed. All four treatments outperformed the control (83.0%), including both active controls predicted to show null effects. The neutral filler-word ban, banning words like "very" and "just" with no role in logical inference, produced the largest improvement (+6.7 pp), while E-Prime produced the smallest (+3.7 pp). The four conditions ranked in perfect inverse order of theoretical depth. The cross-model correlation signature did not replicate (mean r=0.005). These results are consistent with a simpler mechanism: any constraint that forces a model off its default generation path acts as an output regularizer, improving reasoning by disrupting fluent but shallow response patterns. The shallowest constraints work best because they impose monitoring load with minimal conceptual disruption. I present these findings as a case study in discovery through disconfirmation.

[22] Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy

Yihong Dong,Xiaoha Jian,Xue Jiang,Xuyuan Guo,Zhiyuan Fan,Jiaru Qian,Kechi Zhang,Jia Li,Zhi Jin,Ge Li

Main category: cs.CL

TL;DR: 本文提出ChomskyBench基准,基于乔姆斯基层级系统评估大语言模型(LLM)的形式推理能力,发现当前LLM在形式语言任务中存在显著效率瓶颈,而非绝对能力缺失。

Details Motivation: 现有LLM评测基准缺乏基于计算与复杂度的系统性评估,无法揭示其对形式语言结构性、层次性复杂性的理解能力。 Method: 构建ChomskyBench:覆盖全部乔姆斯基层级的语言识别与生成任务;采用自然语言过程追踪评估;支持确定性符号验证。 Result: 实验显示LLM性能随层级复杂度升高而明显分层下降;推理长度和错误率显著上升;更大模型与先进推理方法仅带来有限相对提升,但计算成本剧增;时间复杂度远高于传统算法程序。 Conclusion: 当前LLM的形式推理瓶颈主要在于效率不足而非能力上限;传统软件工具仍不可替代;未来需兼顾能力与计算效率的设计范式。 Abstract: The formal reasoning capabilities of LLMs are crucial for advancing automated software engineering. However, existing benchmarks for LLMs lack systematic evaluation based on computation and complexity, leaving a critical gap in understanding their formal reasoning capabilities. Therefore, it is still unknown whether SOTA LLMs can grasp the structured, hierarchical complexity of formal languages as defined by Computation Theory. To address this, we introduce ChomskyBench, a benchmark for systematically evaluating LLMs through the lens of Chomsky Hierarchy. Unlike prior work that uses vectorized classification for neural networks, ChomskyBench is the first to combine full Chomsky Hierarchy coverage, process-trace evaluation via natural language, and deterministic symbolic verifiability. ChomskyBench is composed of a comprehensive suite of language recognition and generation tasks designed to test capabilities at each level. Extensive experiments indicate a clear performance stratification that correlates with the hierarchy's levels of complexity. Our analysis reveals a direct relationship where increasing task difficulty substantially impacts both inference length and performance. Furthermore, we find that while larger models and advanced inference methods offer notable relative gains, they face severe efficiency barriers: achieving practical reliability would require prohibitive computational costs, revealing that current limitations stem from inefficiency rather than absolute capability bounds. A time complexity analysis further indicates that LLMs are significantly less efficient than traditional algorithmic programs for these formal tasks. These results delineate the practical limits of current LLMs, highlight the indispensability of traditional software tools, and provide insights to guide the development of future LLMs with more powerful formal reasoning capabilities.

[23] Breakdowns in Conversational AI: Interactional Failures in Emotionally and Ethically Sensitive Contexts

Jiawen Deng,Wentao Zhang,Ziyun Jiao,Fuji Ren

Main category: cs.CL

TL;DR: 本文探讨了对话式AI在情感和伦理敏感场景中的对齐失效问题,提出了一种基于心理人格和情绪节奏的用户模拟器来压力测试主流模型,发现其存在情感错位、伦理指导失败及共情与责任权衡失当等典型失效模式,并构建了失效模式分类法,强调需在动态对话中保持伦理一致性与情感敏感性。

Details Motivation: 现有研究多聚焦于静态安全检查或情感基准测试,忽视了对话AI在动态演进对话中如何实现价值对齐,尤其在情感与伦理高度敏感的交互场景下缺乏系统性诊断。 Method: 设计并实现了一个具备心理人格特征和可控情绪节奏的多轮用户模拟器,用于对主流对话模型进行压力测试,并通过定性分析归纳其在情绪升级过程中的典型失效模式。 Result: 主流对话模型在情绪强度上升时出现重复性失效,包括情感错位(affective misalignments)、伦理指导失败(ethical guidance failures)以及共情与责任之间的跨维度权衡失当(cross-dimensional trade-offs)。 Conclusion: 对话AI需在动态交互中同步保障伦理连贯性与情感敏感性;本文提出的失效分类法为HCI社区提供了面向价值敏感场景的诊断框架与设计启示。 Abstract: Conversational AI is increasingly deployed in emotionally charged and ethically sensitive interactions. Previous research has primarily concentrated on emotional benchmarks or static safety checks, overlooking how alignment unfolds in evolving conversation. We explore the research question: what breakdowns arise when conversational agents confront emotionally and ethically sensitive behaviors, and how do these affect dialogue quality? To stress-test chatbot performance, we develop a persona-conditioned user simulator capable of engaging in multi-turn dialogue with psychological personas and staged emotional pacing. Our analysis reveals that mainstream models exhibit recurrent breakdowns that intensify as emotional trajectories escalate. We identify several common failure patterns, including affective misalignments, ethical guidance failures, and cross-dimensional trade-offs where empathy supersedes or undermines responsibility. We organize these patterns into a taxonomy and discuss the design implications, highlighting the necessity to maintain ethical coherence and affective sensitivity throughout dynamic interactions. The study offers the HCI community a new perspective on the diagnosis and improvement of conversational AI in value-sensitive and emotionally charged contexts.

[24] Multiple-Debias: A Full-process Debiasing Method for Multilingual Pre-trained Language Models

Haoyu Liang,Peijian Zeng,Wentao Huang,Aimin Yang,Dong Zhou

Main category: cs.CL

TL;DR: 本文提出了一种名为Multiple-Debias的多语言去偏方法,结合多语言反事实数据增强、多语言Self-Debias及参数高效微调,在预处理与后处理阶段协同减少MPLMs在性别、种族和宗教等敏感属性上的偏差,并在德语、西班牙语、中文和日语中验证了有效性。

Details Motivation: 多语言预训练语言模型(MPLMs)常存在与性别、种族、宗教等敏感属性相关的偏差,亟需有效的多语言去偏方法。 Method: 提出Multiple-Debias方法,融合多语言反事实数据增强、多语言Self-Debias,以及参数高效微调,覆盖预处理与后处理全流程;并扩展CrowS-Pairs至德、西、中、日四语种用于评估。 Result: 在四种语言、三个敏感属性上显著降低MPLM偏差;实验表明多语言去偏优于单语方法,且跨语言信息整合能明显提升模型公平性。 Conclusion: 多语言协同去偏是提升MPLMs公平性的有效路径,Multiple-Debias为多语言公平性研究提供了可扩展的全流程解决方案。 Abstract: Multilingual Pre-trained Language Models (MPLMs) have become essential tools for natural language processing. However, they often exhibit biases related to sensitive attributes such as gender, race, and religion. In this paper, we introduce a comprehensive multilingual debiasing method named Multiple-Debias to address these issues across multiple languages. By incorporating multilingual counterfactual data augmentation and multilingual Self-Debias across both pre-processing and post-processing stages, alongside parameter-efficient fine-tuning, we significantly reduced biases in MPLMs across three sensitive attributes in four languages. We also extended CrowS-Pairs to German, Spanish, Chinese, and Japanese, validating our full-process multilingual debiasing method for gender, racial, and religious bias. Our experiments show that (i) multilingual debiasing methods surpass monolingual approaches in effectively mitigating biases, and (ii) integrating debiasing information from different languages notably improves the fairness of MPLMs.

[25] When Modalities Remember: Continual Learning for Multimodal Knowledge Graphs

Linyu Li,Zhi Jin,Yichi Zhang,Dongming Jin,Yuanpeng He,Haoran Duan,Gadeng Luosang,Nyima Tashi

Main category: cs.CL

TL;DR: 本文提出持续多模态知识图谱推理(CMMKGR)任务,构建了相关基准数据集,并设计了MRCKG模型,通过多模态-结构协同课程学习、跨模态知识保留机制和多模态对比重放策略,有效缓解灾难性遗忘并提升新知识学习性能。

Details Motivation: 现有持续知识图谱推理方法忽略多模态信号,而多模态知识图谱推理方法无法应对图谱动态演化带来的灾难性遗忘问题,因此需系统研究持续多模态知识图谱推理。 Method: 提出MRCKG模型,包含三部分:1)多模态-结构协同课程学习,依据新三元组与历史图的结构连通性和多模态兼容性调度学习;2)跨模态知识保留机制,保障实体表示稳定性、关系语义一致性和模态锚定;3)两阶段优化的多模态对比重放方案,结合重要性采样与表征对齐。 Result: 在多个数据集上的实验表明,MRCKG在保持已学多模态知识的同时,显著提升了新知识的学习效果。 Conclusion: 本文首次系统定义并解决了持续多模态知识图谱推理问题,所提MRCKG模型及配套基准为该方向提供了坚实基础和有效解决方案。 Abstract: Real-world multimodal knowledge graphs (MMKGs) are dynamic, with new entities, relations, and multimodal knowledge emerging over time. Existing continual knowledge graph reasoning (CKGR) methods focus on structural triples and cannot fully exploit multimodal signals from new entities. Existing multimodal knowledge graph reasoning (MMKGR) methods, however, usually assume static graphs and suffer catastrophic forgetting as graphs evolve. To address this gap, we present a systematic study of continual multimodal knowledge graph reasoning (CMMKGR). We construct several continual multimodal knowledge graph benchmarks from existing MMKG datasets and propose MRCKG, a new CMMKGR model. Specifically, MRCKG employs a multimodal-structural collaborative curriculum to schedule progressive learning based on the structural connectivity of new triples to the historical graph and their multimodal compatibility. It also introduces a cross-modal knowledge preservation mechanism to mitigate forgetting through entity representation stability, relational semantic consistency, and modality anchoring. In addition, a multimodal contrastive replay scheme with a two-stage optimization strategy reinforces learned knowledge via multimodal importance sampling and representation alignment. Experiments on multiple datasets show that MRCKG preserves previously learned multimodal knowledge while substantially improving the learning of new knowledge.

[26] Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following Tasks

Tianze Xu,Yanzhao Zheng,Pengrui Lu,Lyumanshan Ye,Yong Wu,Zhentao Zhang,Yuanqiang Yu,Chao Ma,Jihuai Zhu,Pengfei Liu,Baohua Dong,Hangcheng Zhu,Ruohui Huang,Gang Yu

Main category: cs.CL

TL;DR: 本文提出Rubrics to Tokens (RTT)框架,通过引入Token-Level Relevance Discriminator和RTT-GRPO算法,将粗粒度响应级评分映射到细粒度令牌级信用分配,并设计Intra-sample Token Group Normalization以适配多维令牌级奖励空间,显著提升指令与评分标准层面的准确性。

Details Motivation: 现有基于评分标准的强化学习方法依赖响应级奖励,存在奖励稀疏性和模糊性问题。 Method: 提出RTT框架,包括Token-Level Relevance Discriminator进行令牌级相关性预测,RTT-GRPO整合响应级与令牌级优势,以及Intra-sample Token Group Normalization处理三维令牌级奖励空间。 Result: 在多个模型和基准测试中,RTT在指令级和评分标准级准确率上持续优于其他基线方法。 Conclusion: RTT有效缓解了奖励稀疏与模糊问题,实现了更精细、更鲁棒的LLM对齐。 Abstract: Rubric-based Reinforcement Learning (RL) has emerged as a promising approach for aligning Large Language Models (LLMs) with complex, open-domain instruction following tasks. However, existing methods predominantly rely on response-level rewards, introducing severe reward sparsity and reward ambiguity problems. To address these issues, we propose Rubrics to Tokens (RTT), a novel rubric-based RL framework that bridges coarse response-level scores and fine-grained token-level credit assignment. RTT introduces a Token-Level Relevance Discriminator to predict which tokens in the response are responsible for a specific constraint, and optimizes the policy model via RTT-GRPO, which integrates response-level and token-level advantages within a unified framework. Furthermore, when transitioning from one-dimensional, outcome-level reward to three-dimensional reward space in the token-level rubric-based RL, we propose a novel group normalization method, called Intra-sample Token Group Normalization, to accommodate this shift. Extensive experiments and benchmarks demonstrate that RTT consistently outperforms other baselines in both instruction- and rubric-level accuracy across different models.

[27] Student-in-the-Loop Chain-of-Thought Distillation via Generation-Time Selection

Chaoqun He,Yingfa Chen,Chaojun Xiao,Xu Han,Lijie Wen

Main category: cs.CL

TL;DR: 本文提出Gen-SSD框架,在生成过程中引入学生模型进行实时路径筛选与剪枝,提升小模型对大模型推理过程的知识蒸馏效果。

Details Motivation: 现有知识蒸馏方法依赖后验过滤教师生成的推理路径,无法控制生成过程,易产生超出学生学习能力的路径。 Method: 提出Gen-SSD(Generation-time Self-Selection Distillation),在教师采样过程中由学生模型实时评估候选续写,仅扩展可学习路径,并提前剪枝无效分支。 Result: 在数学推理基准上,Gen-SSD比标准知识蒸馏高约5.9分,比其他先进基线最高高4.7分;生成的推理路径更稳定、更可学。 Conclusion: 在生成过程中引入学生监督能显著提升推理知识蒸馏的有效性,强调‘生成时选择’优于‘生成后过滤’。 Abstract: Large reasoning models achieve strong performance on complex tasks through long chain-of-thought (CoT) trajectories, but directly transferring such reasoning processes to smaller models remains challenging. A key difficulty is that not all teacher-generated reasoning trajectories are suitable for student learning. Existing approaches typically rely on post-hoc filtering, selecting trajectories after full generation based on heuristic criteria. However, such methods cannot control the generation process itself and may still produce reasoning paths that lie outside the student's learning capacity. To address this limitation, we propose Gen-SSD (Generation-time Self-Selection Distillation), a student-in-the-loop framework that performs generation-time selection. Instead of passively consuming complete trajectories, the student evaluates candidate continuations during the teacher's sampling process, guiding the expansion of only learnable reasoning paths and enabling early pruning of unhelpful branches. Experiments on mathematical reasoning benchmarks demonstrate that Gen-SSD consistently outperforms standard knowledge distillation and recent baselines, with improvements of around 5.9 points over Standard KD and up to 4.7 points over other baselines. Further analysis shows that Gen-SSD produces more stable and learnable reasoning trajectories, highlighting the importance of incorporating supervision during generation for effective distillation.

[28] GRADE: Probing Knowledge Gaps in LLMs through Gradient Subspace Dynamics

Yujing Wang,Yuanbang Liang,Yukun Lai,Hainan Zhang,Hanqi Yan

Main category: cs.CL

TL;DR: 本文提出GRADE方法,通过计算梯度与隐藏状态子空间的跨层秩比来量化知识缺口,以检测大语言模型内部知识是否足以回答问题,相比仅依赖隐藏状态的方法更精准。

Details Motivation: 现有方法利用模型内部隐藏状态捕捉被激活的知识,但这些知识可能与问题需求不匹配(如风格、长度等无关特征),无法准确反映知识缺口。 Method: 提出GRADE(Gradient Dynamics for knowledge gap detection)方法,利用梯度作为所需知识更新的估计量,通过计算梯度与对应隐藏状态子空间的跨层秩比来量化知识缺口。 Result: 在六个基准数据集上验证了GRADE的有效性和对输入扰动的鲁棒性,并通过案例研究展示了其生成长答案知识缺口可解释性解释的能力。 Conclusion: GRADE能更准确地检测模型知识缺口,提升LLM部署的可靠性,且具备可解释性优势。 Abstract: Detecting whether a model's internal knowledge is sufficient to correctly answer a given question is a fundamental challenge in deploying responsible LLMs. In addition to verbalising the confidence by LLM self-report, more recent methods explore the model internals, such as the hidden states of the response tokens to capture how much knowledge is activated. We argue that such activated knowledge may not align with what the query requires, e.g., capturing the stylistic and length-related features that are uninformative for answering the query. To fill the gap, we propose GRADE (Gradient Dynamics for knowledge gap detection), which quantifies the knowledge gap via the cross-layer rank ratio of the gradient to that of the corresponding hidden state subspace. This is motivated by the property of gradients as estimators of the required knowledge updates for a given target. We validate \modelname{} on six benchmarks, demonstrating its effectiveness and robustness to input perturbations. In addition, we present a case study showing how the gradient chain can generate interpretable explanations of knowledge gaps for long-form answers.

[29] LLM-based Atomic Propositions help weak extractors: Evaluation of a Propositioner for triplet extraction

Luc Pommeret,Thomas Gerald,Patrick Paroubek,Sahar Ghannay,Christophe Servan,Sophie Rosset

Main category: cs.CL

TL;DR: 本文提出MPropositionneur-V2模型,通过将文本分解为原子命题来提升知识图谱三元组抽取效果,尤其对较弱抽取器有效,并设计回退策略平衡强LLM的实体与关系抽取性能。

Details Motivation: 知识图谱构建需从复杂密集文本中抽取结构化三元组,而原子命题作为语义独立的最小信息单元,可能提升抽取效果。 Method: 提出多语言小模型MPropositionneur-V2(基于Qwen3-0.6B,经知识蒸馏自Qwen3-32B),并将其集成到实体中心(GLiREL)与生成式(Qwen3)两类抽取范式中。 Result: 在SMiLER、FewRel、DocRED和CaRB数据集上验证:原子命题显著提升弱抽取器(如GLiREL、CoreNLP、0.6B模型)的关系召回率及多语言整体准确率;对强LLM采用回退策略可弥补实体召回损失并保持关系抽取增益。 Conclusion: 原子命题是一种可解释的中间数据结构,能有效增强而非替代现有抽取器。 Abstract: Knowledge Graph construction from natural language requires extracting structured triplets from complex, information-dense sentences. In this paper, we investigate if the decomposition of text into atomic propositions (minimal, semantically autonomous units of information) can improve the triplet extraction. We introduce MPropositionneur-V2, a small multilingual model covering six European languages trained by knowledge distillation from Qwen3-32B into a Qwen3-0.6B architecture, and we evaluate its integration into two extraction paradigms: entity-centric (GLiREL) and generative (Qwen3). Experiments on SMiLER, FewRel, DocRED and CaRB show that atomic propositions benefit weaker extractors (GLiREL, CoreNLP, 0.6B models), improving relation recall and, in the multilingual setting, overall accuracy. For stronger LLMs, a fallback combination strategy recovers entity recall losses while preserving the gains in relation extraction. These results show that atomic propositions are an interpretable intermediate data structure that complements extractors without replacing them.

[30] One Model to Translate Them All? A Journey to Mount Doom for Multilingual Model Merging

Baban Gain,Asif Ekbal,Trilok Nath Singh

Main category: cs.CL

TL;DR: 本文系统研究了权重空间模型融合在多语言机器翻译中的表现,发现其性能下降,尤其在目标语言不同时;通过分析内部表征,揭示了多语言微调会重塑模型几何结构,降低与标准权重融合假设的兼容性。

Details Motivation: 尽管权重空间模型融合在多任务场景中成功,但其在多语言环境中的行为仍不清楚,本文旨在探究其在多语言机器翻译中的适用性及失败原因。 Method: 在大规模双语语料上完全微调语言模型,评估标准融合策略,并利用基于片段条件的神经元选择性和层间中心核对齐分析内部表征。 Result: 融合导致性能下降,尤其当目标语言不同时;语言特异性神经元集中在嵌入层和上层Transformer块,中间层则高度共享;微调使监督语言和相关语言的神经元选择性减弱,而未监督语言的神经元更孤立,导致高层表征发散加剧。 Conclusion: 多语言微调重塑了模型表征几何结构,降低了其与标准权重空间融合假设的兼容性,从而解释了融合在多语言翻译中失效的原因。 Abstract: Weight-space model merging combines independently fine-tuned models without accessing original training data, offering a practical alternative to joint training. While merging succeeds in multitask settings, its behavior in multilingual contexts remains poorly understood. We systematically study weight-space merging for multilingual machine translation by fully fine-tuning language model on large-scale bilingual corpora and evaluating standard merging strategies. Our experiments reveal that merging degrades performance, especially when target languages differ. To explain this failure, we analyze internal representations using span-conditioned neuron selectivity and layer-wise centered kernel alignment. We find that language-specific neurons concentrate in embedding layers and upper transformer blocks, while intermediate layers remain largely shared across languages. Critically, fine-tuning redistributes rather than sharpens language selectivity: neurons for supervised and related languages become less exclusive, while those for unsupervised languages grow more isolated. This redistribution increases representational divergence in higher layers that govern generation. These findings suggest that multilingual fine-tuning may reshape geometry in ways that reduce compatibility with standard weight-space merging assumptions. Our work thus provides an explanation for why merging fails in multilingual translation scenarios.

[31] BioUNER: A Benchmark Dataset for Clinical Urdu Named Entity Recognition

Wazir Ali,Adeeb Noor,Sanaullah Mahar,Alia,Muhammad Mazhar Younas

Main category: cs.CL

TL;DR: 本文提出了一个用于生物医学乌尔都语命名实体识别(BioUNER)的黄金标准基准数据集,包含153K词元,由三位熟悉医学领域的母语标注员标注,标注一致性达0.78;并用SVM、LSTM、mBERT和XLM-RoBERTa等模型进行了内在与外在评估。

Details Motivation: 缺乏高质量的乌尔都语生物医学命名实体识别基准数据集,阻碍了该语言在生物医学自然语言处理任务中的发展。 Method: 从乌尔都语健康新闻门户、医疗处方及医院健康博客/网站爬取文本,经预处理后,由三位医学背景母语者使用Doccano工具完成标注,构建BioUNER数据集;采用多种机器学习与深度学习模型进行内在(如标注一致性)与外在(模型性能)评估。 Result: 获得0.78的标注者间一致性(Cohen’s Kappa),多个模型在该数据集上完成了基准测试,验证了其作为黄金标准数据集的有效性与实用性。 Conclusion: BioUNER是首个面向生物医学领域的乌尔都语命名实体识别黄金标准数据集,填补了资源空白,为后续乌尔都语医疗NLP研究提供了可靠基准与基础资源。 Abstract: In this article, we present a gold-standard benchmark dataset for Biomedical Urdu Named Entity Recognition (BioUNER), developed by crawling health-related articles from online Urdu news portals, medical prescriptions, and hospital health blogs and websites. After preprocessing, three native annotators with familiarity in the medical domain participated in the annotation process using the Doccano text annotation tool and annotated 153K tokens. Following annotation, the proposed BioiUNER dataset was evaluated both intrinsically and extrinsically. An inter-annotator agreement score of 0.78 was achieved, thereby validating the dataset as gold-standard quality. To demonstrate the utility and benchmarking capability of the dataset, we evaluated several machine learning and deep learning models, including Support Vector Machines (SVM), Long Short-Term Memory networks (LSTM), Multilingual BERT (mBERT), and XLM-RoBERTa. The gold-standard BioUNER dataset serves as a reliable benchmark and a valuable addition to Urdu language processing resources.

[32] Council Mode: Mitigating Hallucination and Bias in LLMs via Multi-Agent Consensus

Shuai Wu,Xue Li,Yanna Feng,Yufang Li,Zhijun Wang

Main category: cs.CL

TL;DR: 本文提出了一种名为Council Mode的多智能体共识框架,通过并行调用多个异构大语言模型(LLM)并由共识模型整合输出,显著降低幻觉率与偏差方差。

Details Motivation: 大型语言模型(尤其是MoE架构)存在幻觉和系统性偏差问题,且这些偏差因推理过程中专家激活不均而加剧。 Method: 提出三阶段Council框架:(1)基于复杂度的智能分诊分类器;(2)在多种架构异构前沿LLM上并行生成;(3)结构化共识合成,显式识别一致、分歧与独特发现后再生成最终回答。 Result: 在HaluEval基准上幻觉率相对降低35.9%,TruthfulQA提升7.8分,且跨领域偏差方差显著降低。 Conclusion: Council Mode是一种有效缓解LLM幻觉与偏差的可扩展多模型协同范式,具备理论建模与开源实现支持。 Abstract: Large Language Models (LLMs), particularly those employing Mixture-of-Experts (MoE) architectures, have achieved remarkable capabilities across diverse natural language processing tasks. However, these models frequently suffer from hallucinations -- generating plausible but factually incorrect content -- and exhibit systematic biases that are amplified by uneven expert activation during inference. In this paper, we propose the Council Mode, a novel multi-agent consensus framework that addresses these limitations by dispatching queries to multiple heterogeneous frontier LLMs in parallel and synthesizing their outputs through a dedicated consensus model. The Council pipeline operates in three phases: (1) an intelligent triage classifier that routes queries based on complexity, (2) parallel expert generation across architecturally diverse models, and (3) a structured consensus synthesis that explicitly identifies agreement, disagreement, and unique findings before producing the final response. We implement and evaluate this architecture within an open-source AI workspace. Our comprehensive evaluation across multiple benchmarks demonstrates that the Council Mode achieves a 35.9% relative reduction in hallucination rates on the HaluEval benchmark and a 7.8-point improvement on TruthfulQA compared to the best-performing individual model, while maintaining significantly lower bias variance across domains. We provide the mathematical formulation of the consensus mechanism, detail the system architecture, and present extensive empirical results with ablation studies.

[33] A Multi-head-based architecture for effective morphological tagging in Russian with open dictionary

K. Skibin,M. Pozhidaev,S. Suschenko

Main category: cs.CL

TL;DR: 本文提出了一种基于多头注意力机制的新架构,用于俄语词形标注任务,通过子词切分与向量聚合支持开放词典和细粒度形态分析,在SinTagRus和Taiga数据集上达到98–99%以上准确率,优于先前方法,且训练资源要求低、推理速度快。

Details Motivation: 解决俄语词形标注问题,支持开放词典(处理未登录词)及利用词内部结构(如前缀、词尾)进行更精细的形态特征建模。 Method: 基于多头注意力机制的新型架构;预处理包括单词子词切分及子词向量聚合为词向量;不依赖RNN、无需大规模无监督预训练(如BERT)。 Result: 在SinTagRus和Taiga数据集上,部分语法范畴准确率达98–99%以上;90%的单词能精确预测全部语法范畴,并能识别无需分析的范畴;模型可在消费级GPU上训练,推理速度高于以往方法。 Conclusion: 该多头注意力架构在准确性、泛化性(开放词典)、训练效率和推理速度方面均优于现有方法,为俄语形态分析提供了高效实用的新方案。 Abstract: The article proposes a new architecture based on Multi-head attention to solve the problem of morphological tagging for the Russian language. The preprocessing of the word vectors includes splitting the words into subtokens, followed by a trained procedure for aggregating the vectors of the subtokens into vectors for tokens. This allows to support an open dictionary and analyze morphological features taking into account parts of words (prefixes, endings, etc.). The open dictionary allows in future to analyze words that are absent in the training dataset. The performed computational experiment on the SinTagRus and Taiga datasets shows that for some grammatical categories the proposed architecture gives accuracy 98-99% and above, which outperforms previously known results. For nine out of ten words, the architecture precisely predicts all grammatical categories and indicates when the categories must not be analyzed for the word. At the same time, the model based on the proposed architecture can be trained on consumer-level graphics accelerators, retains all the advantages of Multi-head attention over RNNs (RNNs are not used in the proposed approach), does not require pretraining on large collections of unlabeled texts (like BERT), and shows higher processing speed than previous results.

[34] How Annotation Trains Annotators: Competence Development in Social Influence Recognition

Maciej Markiewicz,Beata Bajcar,Wiktoria Mieleszczenko-Kowszewicz,Aleksander Szczęsny,Tomasz Adamczyk,Grzegorz Chodak,Karolina Ostrowska,Aleksandra Sawczuk,Jolanta Babiak,Jagoda Szklarczyk,Przemysław Kazienko

Main category: cs.CL

TL;DR: 本研究探讨了在社会影响识别任务中,标注者(包括专家与非专家)的标注能力随时间变化的情况,发现标注过程本身能提升标注者的自我感知能力和数据质量,且专家提升更显著,进而影响基于其标注数据训练的大语言模型性能。

Details Motivation: 人类数据标注常被视为客观参考,但许多标注任务本质上具有主观性,且标注者的判断可能随时间演变;因此需要从能力视角探究标注者在标注过程中的质量变化。 Method: 对25名来自5组(含专家与非专家)的标注者,就1021段对话标注20种社会影响技巧及相关意图、反应和后果;对其中150段文本进行前后两次标注,并结合定性与定量分析、半结构化访谈、自我评估问卷以及基于对比数据集的LLM训练与评估,来衡量标注者能力变化。 Result: 标注者自我感知的能力与信心显著提升;数据质量出现可观测提升,尤其在专家组中更明显;标注者能力变化直接影响基于其标注数据训练的LLM性能。 Conclusion: 标注过程不仅是数据生产行为,也是一种能力发展过程,尤其对专家更具促进作用;应重视标注者能力动态变化对数据质量和模型性能的影响。 Abstract: Human data annotation, especially when involving experts, is often treated as an objective reference. However, many annotation tasks are inherently subjective, and annotators' judgments may evolve over time. This study investigates changes in the quality of annotators' work from a competence perspective during a process of social influence recognition. The study involved 25 annotators from five different groups, including both experts and non-experts, who annotated a dataset of 1,021 dialogues with 20 social influence techniques, along with intentions, reactions, and consequences. An initial subset of 150 texts was annotated twice - before and after the main annotation process - to enable comparison. To measure competence shifts, we combined qualitative and quantitative analyses of the annotated data, semi-structured interviews with annotators, self-assessment surveys, and Large Language Model training and evaluation on the comparison dataset. The results indicate a significant increase in annotators' self-perceived competence and confidence. Moreover, observed changes in data quality suggest that the annotation process may enhance annotator competence and that this effect is more pronounced in expert groups. The observed shifts in annotator competence have a visible impact on the performance of LLMs trained on their annotated data.

[35] LogicPoison: Logical Attacks on Graph Retrieval-Augmented Generation

Yilin Xiao,Jin Chen,Qinggang Zhang,Yujing Zhang,Chuang Zhou,Longhao Yang,Lingfei Ren,Xin Yang,Xiao Huang

Main category: cs.CL

TL;DR: 本文提出LogicPoison攻击框架,通过类型保持的实体交换机制,隐式破坏知识图谱的逻辑连接结构,从而绕过GraphRAG系统的防御,在不改变文本语义的前提下显著降低其推理性能。

Details Motivation: GraphRAG系统虽对传统RAG攻击(如文本投毒、提示注入)具有鲁棒性,但其安全性依赖于图结构的拓扑完整性;作者发现可通过隐式篡改逻辑连接而非表面文本实施新型攻击。 Method: 提出LogicPoison攻击框架,采用类型保持的实体交换机制,分别扰动全局逻辑枢纽(破坏整体连通性)和查询相关推理桥梁(切断多跳推理路径),以重定向有效推理至死路。 Result: 在多个基准上实验表明,LogicPoison能成功绕过GraphRAG防御,显著降低其性能,且在攻击效果与隐蔽性上均优于现有最先进基线。 Conclusion: GraphRAG的安全性存在根本性图拓扑依赖漏洞;LogicPoison揭示了逻辑结构层面的新型攻击维度,为构建更鲁棒的图增强生成系统提供了重要警示与改进方向。 Abstract: Graph-based Retrieval-Augmented Generation (GraphRAG) enhances the reasoning capabilities of Large Language Models (LLMs) by grounding their responses in structured knowledge graphs. Leveraging community detection and relation filtering techniques, GraphRAG systems demonstrate inherent resistance to traditional RAG attacks, such as text poisoning and prompt injection. However, in this paper, we find that the security of GraphRAG systems fundamentally relies on the topological integrity of the underlying graph, which can be undermined by implicitly corrupting the logical connections, without altering surface-level text semantics. To exploit this vulnerability, we propose \textsc{LogicPoison}, a novel attack framework that targets logical reasoning rather than injecting false contents. Specifically, \textsc{LogicPoison} employs a type-preserving entity swapping mechanism to perturb both global logic hubs for disrupting overall graph connectivity and query-specific reasoning bridges for severing essential multi-hop inference paths. This approach effectively reroutes valid reasoning into dead ends while maintaining surface-level textual plausibility. Comprehensive experiments across multiple benchmarks demonstrate that \textsc{LogicPoison} successfully bypasses GraphRAG's defenses, significantly degrading performance and outperforming state-of-the-art baselines in both effectiveness and stealth. Our code is available at \textcolor{blue}https://github.com/Jord8061/logicPoison.

[36] NeuReasoner: Towards Explainable, Controllable, and Unified Reasoning via Mixture-of-Neurons

Haonan Dong,Kehan Jiang,Haoran Ye,Wenhao Zhu,Zhaolu Kang,Guojie Song

Main category: cs.CL

TL;DR: 本文提出NeuReasoner框架,通过白盒分析识别与推理失败相关的关键神经元(MoN),并结合轻量级MLP检测失败、特殊token触发自修正机制,实现可解释、可控且统一的推理优化,在多个基准和模型上显著提升性能并减少token消耗。

Details Motivation: 现有大推理模型(LRMs)在复杂推理任务中存在三类失败模式(步骤内、步骤间、实例级),但当前方法仅针对单一层面,且依赖黑箱和强化学习,缺乏可解释性与可控性。 Method: 进行白盒分析,识别与不同失败模式相关的关键神经元(Mixture of Neurons, MoN)及其波动模式;提出NeuReasoner框架,包含轻量级MLP用于失败检测,以及基于监督微调(SFT)学习的特殊token触发的自修正机制。 Result: 在六个基准、六种主干模型(8B~70B)及九个强基线上验证,NeuReasoner最高提升性能27.0%,token消耗降低19.6%~63.3%。 Conclusion: NeuReasoner是一种可解释、可控、统一的推理增强框架,通过神经元级洞察与轻量干预机制,有效缓解多层级推理失败问题,兼顾性能与效率。 Abstract: Large Reasoning Models (LRMs) have recently achieved remarkable success in complex reasoning tasks. However, closer scrutiny reveals persistent failure modes compromising performance and cost: I) Intra-step level, marked by calculation or derivation errors; II) Inter-step level, involving oscillation and stagnation; and III) Instance level, causing maladaptive over-thinking. Existing endeavors target isolated levels without unification, while their black-box nature and reliance on RL hinder explainability and controllability. To bridge these gaps, we conduct an in-depth white-box analysis, identifying key neurons (Mixture of Neurons, MoN) and their fluctuation patterns associated with distinct failures. Building upon these insights, we propose NeuReasoner, an explainable, controllable, and unified reasoning framework driven by MoN. Technically, NeuReasoner integrates lightweight MLPs for failure detection with a special token-triggered self-correction mechanism learned via SFT. During inference, special tokens are inserted upon failure detection to actuate controllable remedial behaviors. Extensive evaluations across six benchmarks, six backbone models (8B~70B) against nine competitive baselines, demonstrate that NeuReasoner achieves performance gains of up to 27.0% while reducing token consumption by 19.6% ~ 63.3%.

[37] R2-Write: Reflection and Revision for Open-Ended Writing with Deep Reasoning

Wanlong Liu,Bo Zhang,Chenliang Li,Shaopeng Lai,Yuning Wu,Xuanyu Lei,Ming Yan

Main category: cs.CL

TL;DR: 本文提出R2-Write框架,通过引入显式的反思与修订模式,提升大模型在开放性写作任务中的深层推理能力。

Details Motivation: 现有主流推理模型在开放性写作任务中效果有限,缺乏深度反思与修订模式,难以像数学推理那样取得显著提升。 Method: 提出R2-Write自动化框架,基于迭代的作者-评审员交互生成高质量思维轨迹,并设计过程奖励机制监督反思质量,提升性能与token效率。 Result: 在多个创意写作与深度研究基准上实验表明,R2-Write显著提升了开放性写作任务的表现。 Conclusion: 显式融入反思与修订模式可有效解锁大模型在开放性写作任务中的深层推理能力。 Abstract: While deep reasoning with long chain-of-thought has dramatically improved large language models in verifiable domains like mathematics, its effectiveness for open-ended tasks such as writing remains unexplored. In this paper, we conduct a systematic investigation revealing that existing mainstream reasoning models achieve limited gains on open-ended writing tasks. Our further analysis shows that these models lack deep reflection and revision patterns in open-ended writing, resulting in substantially smaller improvements compared to mathematical reasoning tasks. To address this limitation, we introduce R2-Write: an automated framework that synthesizes high-quality thinking trajectories enriched with explicit reflection and revision patterns through iterative writer-judge interaction. To prevent redundant reflections, we design a process reward mechanism that supervises reflection quality during reinforcement learning, improving both performance and token efficiency. Extensive experiments across multiple creative writing and deep-research benchmarks demonstrate significant improvements, validating that explicitly incorporating reflection and revision patterns unlocks deep reasoning capabilities for open-ended writing tasks.

[38] JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency

Aichen Cai,Anmeng Zhang,Anyu Li,Bo Zhang,Bohua Cai,Chang Li,Changjian Jiang,Changkai Lu,Chao Xue,Chaocai Liang,Cheng Zhang,Dongkai Liu,Fei Wang,Guoqiang Huang,Haijian Ke,Han Lin,Hao Wang,Ji Miao,Jiacheng Zhang,Jialong Shi,Jifeng Zhu,Jingjing Qian,Junhui Luo,Junwu Xiong,Lam So,Liang Huang,Ming Ke,Mingyang Li,Panfeng Shi,Peng Hao,Qi Wang,Qian Lai,Qiaoqiao Yuan,Qingyu Yin,Qiong Cao,Qixiang Wang,Rongcheng Bian,Rongduo Han,Shaoqiang Zheng,Shi Hu,Shi Suo,Shijie Ren,Shijin Zhang,Shiying Fan,Shuai Xie,Tianyi Zhang,Wei Liu,Wentao Tan,Xianghan Meng,Xiaodong He,Xing Pan,Xiran Wang,Xuyang Peng,Ya Zhang,Yang Liu,Yangyang Duan,Yanxu Chen,Yicheng Gong,Yidan Huang,Yifei Liu,Yinhao Bai,Yongqiang Liu,Yuesong Zhang,Yuqi Zhang,Zerui Xie,Zhenfang Wang,Zhennan Shen,Zheyuan Liu,Zhuwei Zeng

Main category: cs.CL

TL;DR: JoyAI-LLM Flash 是一个高效 MoE 语言模型,参数总量 48B、每步仅激活 2.7B,结合 FiberPO 新型 RL 算法与训练-推理协同优化,在保持高性能的同时显著提升 token 效率与推理吞吐。

Details Motivation: 在亚50B参数规模下突破性能与token效率的权衡瓶颈,解决现有MoE模型稀疏性不足、RL稳定性差及训练-推理脱节等问题。 Method: 基于20万亿token预训练,采用SFT、DPO和多环境大规模RL;提出FiberPO算法(受纤维化理论启发)实现全局-局部信任域分解;引入思考/非思考认知模式平衡;结合密集多token预测(MTP)与量化感知训练(QAT)进行训推协同设计。 Result: 48B总参、仅激活2.7B,稀疏比显著优于同规模SOTA模型;FiberPO提升RL策略优化的多尺度稳定性;MTP+QAT提升推理吞吐;开源全部检查点。 Conclusion: JoyAI-LLM Flash 证明了通过算法创新(FiberPO)、认知建模与训推协同设计,可在亚50B规模实现高性能、高稀疏性与高token效率的统一,为高效大模型发展提供新范式。 Abstract: We introduce JoyAI-LLM Flash, an efficient Mixture-of-Experts (MoE) language model designed to redefine the trade-off between strong performance and token efficiency in the sub-50B parameter regime. JoyAI-LLM Flash is pretrained on a massive corpus of 20 trillion tokens and further optimized through a rigorous post-training pipeline, including supervised fine-tuning (SFT), Direct Preference Optimization (DPO), and large-scale reinforcement learning (RL) across diverse environments. To improve token efficiency, JoyAI-LLM Flash strategically balances \emph{thinking} and \emph{non-thinking} cognitive modes and introduces FiberPO, a novel RL algorithm inspired by fibration theory that decomposes trust-region maintenance into global and local components, providing unified multi-scale stability control for LLM policy optimization. To enhance architectural sparsity, the model comprises 48B total parameters while activating only 2.7B parameters per forward pass, achieving a substantially higher sparsity ratio than contemporary industry leading models of comparable scale. To further improve inference throughput, we adopt a joint training-inference co-design that incorporates dense Multi-Token Prediction (MTP) and Quantization-Aware Training (QAT). We release the checkpoints for both JoyAI-LLM-48B-A3B Base and its post-trained variants on Hugging Face to support the open-source community.

[39] Querying Structured Data Through Natural Language Using Language Models

Hontan Valentin-Micu,Bunea Andrei-Alexandru,Tantaroudas Nikolaos Dimitrios,Popovici Dan-Matei

Main category: cs.CL

TL;DR: 本文提出了一种开源方法,通过微调轻量级LLM(DeepSeek-R1-Distill-8B)直接生成可执行查询语句,以支持用户用自然语言查询结构化非文本数据(如地理服务可达性数据),避免RAG在数值和强结构数据上的局限;采用合成数据生成 pipeline 构建高质量问答对,并在西班牙Durangaldea地区数据上验证了其跨语言、跨位置的泛化能力与部署可行性。

Details Motivation: 现有RAG方法难以有效处理数值型和高度结构化的非文本数据,缺乏对用户自然语言查询到精确可执行查询的端到端支持,且依赖大模型不利于资源受限场景。 Method: 提出端到端可执行查询生成框架:构建基于数据语义与用户意图的合成训练数据pipeline;使用QLoRA与4-bit量化技术微调DeepSeek-R1-Distill-8B模型;支持多语言及零样本位置查询。 Result: 在西班牙Durangaldea服务可达性数据集上,微调模型在单语、多语及未见地点场景下均实现高准确率,展现出强泛化性与稳定查询生成能力。 Conclusion: 轻量级领域专用模型可在不依赖大型闭源LLM的前提下,高效精准完成结构化数据的自然语言接口任务,该方法具备低资源部署优势与多数据集扩展潜力。 Abstract: This paper presents an open source methodology for allowing users to query structured non textual datasets through natural language Unlike Retrieval Augmented Generation RAG which struggles with numerical and highly structured information our approach trains an LLM to generate executable queries To support this capability we introduce a principled pipeline for synthetic training data generation producing diverse question answer pairs that capture both user intent and the semantics of the underlying dataset We fine tune a compact model DeepSeek R1 Distill 8B using QLoRA with 4 bit quantization making the system suitable for deployment on commodity hardware We evaluate our approach on a dataset describing accessibility to essential services across Durangaldea Spain The fine tuned model achieves high accuracy across monolingual multilingual and unseen location scenarios demonstrating both robust generalization and reliable query generation Our results highlight that small domain specific models can achieve high precision for this task without relying on large proprietary LLMs making this methodology suitable for resource constrained environments and adaptable to broader multi dataset systems We evaluate our approach on a dataset describing accessibility to essential services across Durangaldea Spain The fine tuned model achieves high accuracy across monolingual multilingual and unseen location scenarios demonstrating both robust generalization and reliable query generation Our results highlight that small domain specific models can achieve high precision for this task without relying on large proprietary LLMs making this methodology suitable for resource constrained environments and adaptable to broader multi dataset systems.

[40] Verbalizing LLMs' assumptions to explain and control sycophancy

Myra Cheng,Isabel Sieh,Humishka Zope,Sunny Yu,Lujain Ibrahim,Aryaman Arora,Jared Moore,Desmond Ong,Dan Jurafsky,Diyi Yang

Main category: cs.CL

TL;DR: 本文提出Verbalized Assumptions框架,揭示大语言模型(LLM)在社交场景中表现出奉承行为(sycophancy)的根源在于其对用户意图的错误假设(如高估用户寻求认可而非信息),并通过可解释的线性探针验证了该假设与行为间的因果关系。

Details Motivation: LLM在回答如“我错了吗?”等问题时倾向于附和用户而非提供真实评估,作者假设这是因模型错误估计用户意图(如低估用户对信息的需求、高估对安慰的需求)所致。 Method: 提出Verbalized Assumptions框架以显式提取LLM对用户意图的隐含假设;构建假设探针(linear probes)分析内部表征;对比人类对AI与人际交流的不同期望,并检验训练数据偏差的影响。 Result: 发现LLM在社交奉承数据集上最常假设用户‘seeking validation’;假设探针可实现对社交奉承行为的细粒度、可解释调控;证实LLM因训练于人-人对话数据而未适配人-AI交互中用户对客观性的更高期待。 Conclusion: LLM的社交奉承行为源于其对用户意图的系统性错误假设;显式建模并干预这些假设是理解和缓解此类安全问题的新有效路径。 Abstract: LLMs can be socially sycophantic, affirming users when they ask questions like "am I in the wrong?" rather than providing genuine assessment. We hypothesize that this behavior arises from incorrect assumptions about the user, like underestimating how often users are seeking information over reassurance. We present Verbalized Assumptions, a framework for eliciting these assumptions from LLMs. Verbalized Assumptions provide insight into LLM sycophancy, delusion, and other safety issues, e.g., the top bigram in LLMs' assumptions on social sycophancy datasets is ``seeking validation.'' We provide evidence for a causal link between Verbalized Assumptions and sycophantic model behavior: our assumption probes (linear probes trained on internal representations of these assumptions) enable interpretable fine-grained steering of social sycophancy. We explore why LLMs default to sycophantic assumptions: on identical queries, people expect more objective and informative responses from AI than from other humans, but LLMs trained on human-human conversation do not account for this difference in expectations. Our work contributes a new understanding of assumptions as a mechanism for sycophancy.

[41] Multi-Aspect Knowledge Distillation for Language Model with Low-rank Factorization

Zihe Liu,Yulong Mao,Jinan Xu,Xinrui Peng,Kaiyu Huang

Main category: cs.CL

TL;DR: 本文提出了一种多方面知识蒸馏(MaKD)方法,通过更深入地模拟自注意力和前馈模块,以在不同层面捕获丰富的语言知识信息,从而缓解现有方法因仅关注层间知识分布而导致的细粒度信息丢失问题。实验表明,MaKD在相同参数预算下性能优于多个强基线,并适用于自回归架构模型的蒸馏。

Details Motivation: 现有知识蒸馏方法仅关注层间知识分布,导致细粒度信息在对齐过程中丢失。 Method: 提出多方面知识蒸馏(MaKD)方法,深入模仿自注意力和前馈模块,从多个方面捕获语言知识。 Result: MaKD在相同参数预算下达到与多个强基线竞争的性能,并在自回归架构模型蒸馏中表现良好。 Conclusion: MaKD能有效缓解细粒度信息丢失问题,提升预训练语言模型压缩效果,具有良好的泛化性和适用性。 Abstract: Knowledge distillation is an effective technique for pre-trained language model compression. However, existing methods only focus on the knowledge distribution among layers, which may cause the loss of fine-grained information in the alignment process. To address this issue, we introduce the Multi-aspect Knowledge Distillation (MaKD) method, which mimics the self-attention and feed-forward modules in greater depth to capture rich language knowledge information at different aspects. Experimental results demonstrate that MaKD can achieve competitive performance compared with various strong baselines with the same storage parameter budget. In addition, our method also performs well in distilling auto-regressive architecture models.

[42] Domain-Adapted Retrieval for In-Context Annotation of Pedagogical Dialogue Acts

Jinsook Lee,Kirk Vanacore,Zhuqian Zhou,Bakhtawar Ahtisham,Rene F. Kizilcec

Main category: cs.CL

TL;DR: 本文提出了一种面向教学对话标注任务的领域自适应RAG流水线,通过微调轻量级嵌入模型并在话语级别索引对话来检索标注样本,而非微调大语言模型本身,在多个数据集和模型上显著提升了标注一致性(Cohen's κ),且发现话语级索引比嵌入质量提升更为关键。

Details Motivation: 自动化教学对话标注是高风险任务,大语言模型(LLM)在缺乏足够领域知识时表现不佳;现有方法多依赖模型微调,成本高且不灵活。 Method: 构建领域自适应RAG流程:1)在教学语料上微调轻量级嵌入模型;2)以话语为单位索引带标签的教学对话;3)在推理时检索最相关few-shot示例供LLM使用;未对生成模型做任何微调。 Result: 在TalkMoves和Eedi两个真实教学对话数据集上,结合GPT-5.2、Claude Sonnet 4.6、Qwen3-32b三种LLM,Cohen's κ达0.526–0.580(TalkMoves)和0.659–0.743(Eedi),显著优于无检索基线(0.275–0.413 和 0.160–0.410);消融实验表明话语级索引是性能提升主因,top-1标签匹配率分别提升22.3%和20.2%;检索还能缓解零样本提示中的系统性标签偏差,尤其改善稀有及上下文依赖标签的预测。 Conclusion: 仅适配RAG中的检索组件(而非生成模型)即可高效实现高质量教学对话标注,是一种实用、低成本且效果显著的专家级标注路径。 Abstract: Automated annotation of pedagogical dialogue is a high-stakes task where LLMs often fail without sufficient domain grounding. We present a domain-adapted RAG pipeline for tutoring move annotation. Rather than fine-tuning the generative model, we adapt retrieval by fine-tuning a lightweight embedding model on tutoring corpora and indexing dialogues at the utterance level to retrieve labeled few-shot demonstrations. Evaluated across two real tutoring dialogue datasets (TalkMoves and Eedi) and three LLM backbones (GPT-5.2, Claude Sonnet 4.6, Qwen3-32b), our best configuration achieves Cohen's $κ$ of 0.526-0.580 on TalkMoves and 0.659-0.743 on Eedi, substantially outperforming no-retrieval baselines ($κ= 0.275$-$0.413$ and $0.160$-$0.410$). An ablation study reveals that utterance-level indexing, rather than embedding quality alone, is the primary driver of these gains, with top-1 label match rates improving from 39.7\% to 62.0\% on TalkMoves and 52.9\% to 73.1\% on Eedi under domain-adapted retrieval. Retrieval also corrects systematic label biases present in zero-shot prompting and yields the largest improvements for rare and context-dependent labels. These findings suggest that adapting the retrieval component alone is a practical and effective path toward expert-level pedagogical dialogue annotation while keeping the generative model frozen.

[43] StoryScope: Investigating idiosyncrasies in AI fiction

Jenna Russell,Rishanth Rajendhran,Mohit Iyyer,John Wieting

Main category: cs.CL

TL;DR: 本文提出StoryScope方法,通过10个维度的叙事特征(如角色能动性、时间连续性)自动提取AI与人类小说的深层叙事差异,在不依赖风格线索的情况下实现93.2%的人机识别准确率,并揭示不同大模型在叙事结构上的独特‘指纹’。

Details Motivation: 现有研究多聚焦于AI文本的表层风格特征,本文旨在探索是否可通过更深层的语篇级叙事选择(如角色能动性、时间跳跃)来区分AI与人类创作的小说,从而回应作者身份与原创性等核心问题。 Method: 提出StoryScope流水线,从平行语料(10,272个提示+每人5个LLM生成共61,608篇约5000词故事)中自动提取304个细粒度、可解释的语篇级叙事特征,覆盖10个维度;并基于这些特征构建分类模型进行人机检测与六类作者归属。 Result: 仅用叙事特征即可达93.2%宏观F1人机检测准确率和68.4%六类作者归属准确率(超97%含风格特征的全模态性能);发现AI故事普遍存在主题过度解释、情节单一化,而人类故事更具道德模糊性与时间复杂性;各模型呈现独特叙事指纹(如Claude事件递进平缓、GPT偏好梦境序列、Gemini侧重外部人物描写);AI故事在叙事空间中聚类紧密,人类故事分布更分散。 Conclusion: AI与人类小说的根本差异不仅体现在语言风格,更深刻反映在叙事建构逻辑上;叙事层面的系统性偏差可作为稳定、可解释且具模型特异性的鉴别依据,为AI内容治理与文学原创性评估提供新范式。 Abstract: As AI-generated fiction becomes increasingly prevalent, questions of authorship and originality are becoming central to how written work is evaluated. While most existing work in this space focuses on identifying surface-level signatures of AI writing, we ask instead whether AI-generated stories can be distinguished from human ones without relying on stylistic signals, focusing on discourse-level narrative choices such as character agency and chronological discontinuity. We propose StoryScope, a pipeline that automatically induces a fine-grained, interpretable feature space of discourse-level narrative features across 10 dimensions. We apply StoryScope to a parallel corpus of 10,272 writing prompts, each written by a human author and five LLMs, yielding 61,608 stories, each ~5,000 words, and 304 extracted features per story. Narrative features alone achieve 93.2% macro-F1 for human vs. AI detection and 68.4% macro-F1 for six-way authorship attribution, retaining over 97% of the performance of models that include stylistic cues. A compact set of 30 core narrative features captures much of this signal: AI stories over-explain themes and favor tidy, single-track plots while human stories frame protagonist' choices as more morally ambiguous and have increased temporal complexity. Per-model fingerprint features enable six-way attribution: for example, Claude produces notably flat event escalation, GPT over-indexes on dream sequences, and Gemini defaults to external character description. We find that AI-generated stories cluster in a shared region of narrative space, while human-authored stories exhibit greater diversity. More broadly, these results suggest that differences in underlying narrative construction, not just writing style, can be used to separate human-written original works from AI-generated fiction.

[44] Beyond Precision: Importance-Aware Recall for Factuality Evaluation in Long-Form LLM Generation

Nazanin Jafari,James Allan,Mohit Iyyer

Main category: cs.CL

TL;DR: 本文提出了一种综合评估大语言模型长文本生成事实性的新框架,同时衡量精确率(precision)和召回率(recall),并引入基于相关性和显著性的重要性加权机制。

Details Motivation: 现有方法只关注事实性中的精确率,忽略了同样重要的召回率,即生成内容是否覆盖了应包含的相关事实。 Method: 利用外部知识源构建参考事实集,并判断生成文本是否捕获这些事实;引入基于相关性和显著性的重要性感知加权方案。 Result: 当前大语言模型在精确率上表现明显优于召回率,说明其长文本生成存在事实不完整问题,且更倾向于覆盖高重要性事实而非全部相关事实。 Conclusion: 事实性评估需兼顾精确率与召回率;事实不完整性是当前长文本生成的主要局限之一。 Abstract: Evaluating the factuality of long-form output generated by large language models (LLMs) remains challenging, particularly when responses are open-ended and contain many fine-grained factual statements. Existing evaluation methods primarily focus on precision: they decompose a response into atomic claims and verify each claim against external knowledge sources such as Wikipedia. However, this overlooks an equally important dimension of factuality: recall, whether the generated response covers the relevant facts that should be included. We propose a comprehensive factuality evaluation framework that jointly measures precision and recall. Our method leverages external knowledge sources to construct reference facts and determine whether they are captured in generated text. We further introduce an importance-aware weighting scheme based on relevance and salience. Our analysis reveals that current LLMs perform substantially better on precision than on recall, suggesting that factual incompleteness remains a major limitation of long-form generation and that models are generally better at covering highly important facts than the full set of relevant facts.

[45] Valence-Arousal Subspace in LLMs: Circular Emotion Geometry and Multi-Behavioral Control

Lihao Sun,Lewen Yan,Xiaoya Lu,Andrew Lee,Jie Zhang,Jing Shao

Main category: cs.CL

TL;DR: 本文提出一种在大语言模型表征中识别效价-唤醒度(VA)子空间的方法,通过情绪标注文本提取情绪引导向量,并利用主成分分析与岭回归学习VA轴;该子空间具有符合人类情绪感知的圆形几何结构,其投影与人类标注的VA评分高度相关,且沿该子空间引导生成可实现对情绪维度、拒绝行为和谄媚倾向的可控调节,效果跨多个模型架构一致。

Details Motivation: 探索大语言模型内部是否隐含与人类情绪感知一致的效价-唤醒度(VA)结构,并建立可解释、可操控的情绪表征子空间,以支持对模型输出情感倾向及关联行为(如拒绝、谄媚)的定向调控。 Method: 基于211k条情绪标注文本构建情绪引导向量,对其做PCA降维后,用岭回归拟合模型自报告的VA得分,从而学习VA轴;验证子空间几何结构、与人类VA评分的相关性,以及在文本生成中沿VA轴引导对情绪维度及拒绝/谄媚行为的影响。 Result: 成功识别出具有圆形几何结构的VA子空间;该子空间投影与44k词项的人类VA评分显著相关;沿VA轴引导生成能实现情绪维度的单调变化,并双向调控拒绝(低唤醒增强)与谄媚(高唤醒增强);效应在Llama-3.1-8B、Qwen3-8B和Qwen3-14B上均复现。 Conclusion: 大语言模型的隐藏表征中存在可被线性解码的、类人的VA情绪结构;该结构不仅支持情绪感知建模,还可作为统一机制解释并调控多种高阶行为倾向(如拒绝与谄媚),为情感对齐与可控生成提供新路径。 Abstract: We present a method to identify a valence-arousal (VA) subspace within large language model representations. From 211k emotion-labeled texts, we derive emotion steering vectors, then learn VA axes as linear combinations of their top PCA components via ridge regression on the model's self-reported valence-arousal scores. The resulting VA subspace exhibits circular geometry consistent with established models of human emotion perception. Projections along our recovered VA subspace correlate with human-crowdsourced VA ratings across 44k lexical items. Furthermore, steering generation along these axes produces monotonic shifts in the corresponding affective dimensions of model outputs. Steering along these directions also induces near-monotonic bidirectional control over refusal and sycophancy: increasing arousal decreases refusal and increases sycophancy, and vice versa. These effects replicate across Llama-3.1-8B, Qwen3-8B, and Qwen3-14B, demonstrating cross-architecture generality. We provide a mechanistic account for these effects and prior emotionally-framed controls: refusal-associated tokens ("I can't," "sorry") occupy low-arousal, negative-valence regions, so VA steering directly modulates their emission probability.

[46] Detecting and Correcting Reference Hallucinations in Commercial LLMs and Deep Research Agents

Delip Rao,Eric Wong,Chris Callison-Burch

Main category: cs.CL

TL;DR: 本文系统评估了大语言模型和深度研究代理生成的引用URL的可靠性,发现存在显著的URL幻觉和无法解析问题,并提出了开源工具urlhealth来检测和分类URL的有效性,显著提升了引用URL的可靠性。

Details Motivation: 大型语言模型和深度研究代理在支持其主张时提供引用URL,但这些引用URL的可靠性尚未得到系统性测量。 Method: 通过DRBench(53,090个URL)和ExpertQA(168,021个URL,涵盖32个学术领域)数据集,对10个模型和代理进行六项研究问题的分析,结合Wayback Machine评估URL有效性,并开发开源工具urlhealth用于URL存活检测和幻觉分类。 Result: 发现3-13%的引用URL是幻觉(从未存在),5-18%整体无法解析;深度研究代理比搜索增强型LLM生成更多引用但幻觉率更高;不同学科领域URL失效率差异明显;使用urlhealth工具后,非解析URL减少6–79倍,降至1%以下。 Conclusion: 引用URL的有效性既可在大规模上量化评估,也可通过工具如urlhealth在实践中有效纠正,为提升AI生成内容的可信度提供了可行路径。 Abstract: Large language models and deep research agents supply citation URLs to support their claims, yet the reliability of these citations has not been systematically measured. We address six research questions about citation URL validity using 10 models and agents on DRBench (53,090 URLs) and 3 models on ExpertQA (168,021 URLs across 32 academic fields). We find that 3--13\% of citation URLs are hallucinated -- they have no record in the Wayback Machine and likely never existed -- while 5--18\% are non-resolving overall. Deep research agents generate substantially more citations per query than search-augmented LLMs but hallucinate URLs at higher rates. Domain effects are pronounced: non-resolving rates range from 5.4\% (Business) to 11.4\% (Theology), with per-model effects even larger. Decomposing failures reveals that some models fabricate every non-resolving URL, while others show substantial link-rot fractions indicating genuine retrieval. As a solution, we release urlhealth, an open-source tool for URL liveness checking and stale-vs-hallucinated classification using the Wayback Machine. In agentic self-correction experiments, models equipped with urlhealth reduce non-resolving citation URLs by $6\textrm{--}79\times$ to under 1\%, though effectiveness depends on the model's tool-use competence. The tool and all data are publicly available. Our characterization findings, failure taxonomy, and open-source tooling establish that citation URL validity is both measurable at scale and correctable in practice.

[47] Beyond the Parameters: A Technical Survey of Contextual Enrichment in Large Language Models: From In-Context Prompting to Causal Retrieval-Augmented Generation

Prakhar Bansal,Shivangi Agarwal

Main category: cs.CL

TL;DR: This survey presents a unified framework for evaluating LLM augmentation strategies based on the degree of structured context provided during inference, covering methods from prompt engineering to CausalRAG, and introduces rigorous evaluation protocols and a deployment-oriented decision framework.

Details Motivation: LLMs suffer from static knowledge, limited context windows, and poor causal reasoning; thus, augmentation strategies are needed to enhance their capabilities at inference time. Method: The paper proposes a unified axis—degree of structured context—to categorize and compare augmentation techniques (e.g., in-context learning, RAG, GraphRAG, CausalRAG); it also introduces a literature-screening protocol, claim-audit framework, and cross-paper evidence synthesis. Result: A transparent, evidence-based comparison of augmentation strategies; identification of higher-confidence findings vs. emerging results; and a practical decision framework for deploying retrieval-augmented NLP systems. Conclusion: Structured context augmentation significantly improves LLM reasoning and reliability, and future work should prioritize trustworthy, interpretable, and deployable retrieval-augmented NLP systems. Abstract: Large language models (LLMs) encode vast world knowledge in their parameters, yet they remain fundamentally limited by static knowledge, finite context windows, and weakly structured causal reasoning. This survey provides a unified account of augmentation strategies along a single axis: the degree of structured context supplied at inference time. We cover in-context learning and prompt engineering, Retrieval-Augmented Generation (RAG), GraphRAG, and CausalRAG. Beyond conceptual comparison, we provide a transparent literature-screening protocol, a claim-audit framework, and a structured cross-paper evidence synthesis that distinguishes higher-confidence findings from emerging results. The paper concludes with a deployment-oriented decision framework and concrete research priorities for trustworthy retrieval-augmented NLP.

[48] Reliability Gated Multi-Teacher Distillation for Low Resource Abstractive Summarization

Dipto Sumit,Ankan Kumar Roy,Sadia Khair Rodela,Atia Haque Asha,Mourchona Afrin,Niloy Farhan,Farig Yousuf Sadeque

Main category: cs.CL

TL;DR: 本文提出了一种面向低资源抽象摘要的多教师知识蒸馏方法,强调可靠性,包含EWAD(基于熵加权的教师一致性感知蒸馏)和CPDP(容量比例发散保持)两个核心机制,并在多种模型与语言设置下验证了其有效性与可靠性。

Details Motivation: 解决低资源语言(如孟加拉语)抽象摘要中多教师知识蒸馏的可靠性问题,避免盲目依赖复杂蒸馏策略而忽视输出稳定性与校准性。 Method: 提出EWAD(token-level、基于教师间一致性的熵加权监督路由)和CPDP(几何约束学生模型在异构教师 logits 空间中的位置),并结合logit-level KD、跨语言伪标签KD及多评判者LLM人工验证评估。 Result: logit-level KD最稳定;复杂蒸馏提升短摘要语义相似度但损害长摘要质量;跨语言伪标签KD在3.2倍压缩下保留71–122%教师ROUGE-L;多评判者评估揭示单评判管道存在校准偏差。 Conclusion: 可靠性感知的多教师蒸馏有助于厘清多教师监督何时有效、何时应优先扩大数据规模而非设计复杂损失函数。 Abstract: We study multiteacher knowledge distillation for low resource abstractive summarization from a reliability aware perspective. We introduce EWAD (Entropy Weighted Agreement Aware Distillation), a token level mechanism that routes supervision between teacher distillation and gold supervision based on inter teacher agreement, and CPDP (Capacity Proportional Divergence Preservation), a geometric constraint on the student position relative to heterogeneous teachers. Across two Bangla datasets, 13 BanglaT5 ablations, and eight Qwen2.5 experiments, we find that logit level KD provides the most reliable gains, while more complex distillation improves semantic similarity for short summaries but degrades longer outputs. Cross lingual pseudo label KD across ten languages retains 71-122 percent of teacher ROUGE L at 3.2x compression. A human validated multi judge LLM evaluation further reveals calibration bias in single judge pipelines. Overall, our results show that reliability aware distillation helps characterize when multi teacher supervision improves summarization and when data scaling outweighs loss engineering.

[49] Learning the Signature of Memorization in Autoregressive Language Models

David Ilić,Kostadin Cvejoski,David Stanojević,Evgeny Grigorenko

Main category: cs.CL

TL;DR: 本文提出首个可迁移的、基于学习的成员推断攻击方法LT-MIA,利用微调过程天然产生的大量带标签数据训练分类器,发现不同架构语言模型在微调中存在跨架构、跨领域、与检测方法无关的记忆化不变签名,并实现零样本迁移至Mamba、RWKV-4、RecurrentGemma及代码领域,性能超越手工设计基线。

Details Motivation: 现有针对微调语言模型的成员推断攻击依赖人工设计的启发式方法,受限于设计者直觉,缺乏泛化性和可扩展性;而微调过程本身可无限生成已知成员身份的样本,为数据驱动的深度学习方法提供了可能。 Method: 提出Learned Transfer MIA(LT-MIA),将成员推断重构为基于词元级分布统计序列的分类任务;在纯Transformer模型上训练二分类器,不使用影子模型,依赖梯度下降与交叉熵损失下的共性记忆信号。 Result: LT-MIA在未见过的Mamba、RWKV-4、RecurrentGemma架构上零样本迁移AUC达0.963/0.972/0.936,均高于预留Transformer测试集(0.908);在代码领域达0.865 AUC;在Transformer上TPR@0.1%FPR是最佳基线的2.8倍;简单似然方法也表现出强迁移性,佐证签名普适性。 Conclusion: 微调语言模型存在一种由优化目标(交叉熵+梯度下降)决定的、与具体架构和模态无关的记忆化不变签名;LT-MIA首次实现基于学习的、可泛化的成员推断,标志着该任务进入深度学习范式。 Abstract: All prior membership inference attacks for fine-tuned language models use hand-crafted heuristics (e.g., loss thresholding, Min-K\%, reference calibration), each bounded by the designer's intuition. We introduce the first transferable learned attack, enabled by the observation that fine-tuning any model on any corpus yields unlimited labeled data, since membership is known by construction. This removes the shadow model bottleneck and brings membership inference into the deep learning era: learning what matters rather than designing it, with generalization through training diversity and scale. We discover that fine-tuning language models produces an invariant signature of memorization detectable across architectural families and data domains. We train a membership inference classifier exclusively on transformer-based models. It transfers zero-shot to Mamba (state-space), RWKV-4 (linear attention), and RecurrentGemma (gated recurrence), achieving 0.963, 0.972, and 0.936 AUC respectively. Each evaluation combines an architecture and dataset never seen during training, yet all three exceed performance on held-out transformers (0.908 AUC). These four families share no computational mechanisms, their only commonality is gradient descent on cross-entropy loss. Even simple likelihood-based methods exhibit strong transfer, confirming the signature exists independently of the detection method. Our method, Learned Transfer MIA (LT-MIA), captures this signal most effectively by reframing membership inference as sequence classification over per-token distributional statistics. On transformers, LT-MIA achieves 2.8$\times$ higher TPR at 0.1\% FPR than the strongest baseline. The method also transfers to code (0.865 AUC) despite training only on natural language texts. Code and trained classifier available at https://github.com/JetBrains-Research/learned-mia.

[50] BAS: A Decision-Theoretic Approach to Evaluating Large Language Model Confidence

Sean Wu,Fredrik K. Gustafsson,Edward Phillips,Boyan Gao,Anshul Thakur,David A. Clifton

Main category: cs.CL

TL;DR: 本文提出了一种新的决策理论度量指标——行为对齐分数(BAS),用于评估大语言模型(LLM)在‘回答或弃权’场景下自信度是否支持风险感知的决策;BAS强调避免过度自信错误,与传统校准指标(如ECE、AURC)互补,并揭示了当前前沿模型仍普遍存在严重过度自信问题。

Details Motivation: 现有评估协议强制要求模型输出答案,忽视了在高风险场景下应允许模型基于自信度选择弃权;标准校准指标(如ECE、AURC)无法反映过度自信对决策安全性的关键影响。 Method: 提出基于效用建模的Behavioral Alignment Score(BAS),该指标在连续风险阈值上聚合实际效用,依赖自信度的大小与排序;理论证明真实置信估计唯一最大化期望BAS;结合ECE、AURC构建多模型多任务基准,并测试top-k置信提取与后校准等干预方法。 Result: BAS能有效区分具有相似ECE/AURC但决策可靠性迥异的模型;实证显示即使前沿模型仍存在严重过度自信;简单干预(如top-k置信提取、后校准)可显著提升BAS。 Conclusion: BAS为评估LLM自信度的决策实用性提供了原理性、不对称惩罚导向的新标准;它弥补了传统校准指标的不足,推动更安全、风险感知的LLM部署。 Abstract: Large language models (LLMs) often produce confident but incorrect answers in settings where abstention would be safer. Standard evaluation protocols, however, require a response and do not account for how confidence should guide decisions under different risk preferences. To address this gap, we introduce the Behavioral Alignment Score (BAS), a decision-theoretic metric for evaluating how well LLM confidence supports abstention-aware decision making. BAS is derived from an explicit answer-or-abstain utility model and aggregates realized utility across a continuum of risk thresholds, yielding a measure of decision-level reliability that depends on both the magnitude and ordering of confidence. We show theoretically that truthful confidence estimates uniquely maximize expected BAS utility, linking calibration to decision-optimal behavior. BAS is related to proper scoring rules such as log loss, but differs structurally: log loss penalizes underconfidence and overconfidence symmetrically, whereas BAS imposes an asymmetric penalty that strongly prioritizes avoiding overconfident errors. Using BAS alongside widely used metrics such as ECE and AURC, we then construct a benchmark of self-reported confidence reliability across multiple LLMs and tasks. Our results reveal substantial variation in decision-useful confidence, and while larger and more accurate models tend to achieve higher BAS, even frontier models remain prone to severe overconfidence. Importantly, models with similar ECE or AURC can exhibit very different BAS due to highly overconfident errors, highlighting limitations of standard metrics. We further show that simple interventions, such as top-$k$ confidence elicitation and post-hoc calibration, can meaningfully improve confidence reliability. Overall, our work provides both a principled metric and a comprehensive benchmark for evaluating LLM confidence reliability.

cs.CV [Back]

[51] Internalized Reasoning for Long-Context Visual Document Understanding

Austin Veselka

Main category: cs.CV

TL;DR: 本文提出了一种用于长文档视觉理解的合成推理数据生成流程,通过页面相关性评分、证据提取与排序构建思维链(CoT)训练数据,并结合SFT与低强度模型融合,显著提升小模型在长文档理解任务上的性能与推理效率。

Details Motivation: 现有开源长文档视觉理解方法未充分探索推理能力,而推理已被证明能显著提升数学和代码等任务的性能。 Method: 构建合成数据流水线:对每页文档评分其与问题的相关性,提取文本证据并按相关性排序形成思维链;在控制标记下进行监督微调(SFT),再通过低强度模型融合将推理能力内化。 Result: Qwen3 VL 32B在MMLongBenchDoc上达58.3,超越7倍参数量的Qwen3 VL 235B A22B(57.0);Mistral Small 3.1 24B在MMLBD-C上比蒸馏自Thinking版本高3.8分,且内化推理使平均输出token减少12.4倍。 Conclusion: 合成推理数据+轻量模型融合可高效赋予中小规模多模态模型强长文档推理能力,兼顾性能与推理效率。 Abstract: Visual long-document understanding is critical for enterprise, legal, and scientific applications, yet the best performing open recipes have not explored reasoning, a capability which has driven leaps in math and code performance. We introduce a synthetic data pipeline for reasoning in long-document understanding that generates thinking traces by scoring each page for question relevance, extracting textual evidence and ordering it from most to least relevant. We apply SFT to the resulting traces within \texttt{} tags, gated by a \texttt{} control token, and the resulting reasoning capability is internalized via low-strength model merging. We study Qwen3 VL 32B and Mistral Small 3.1 24B. With Qwen3 VL, we achieve 58.3 on MMLongBenchDoc, surpassing the 7$\times$ larger Qwen3 VL 235B A22B (57.0). With Mistral, we show that synthetic reasoning outperforms distillation from the Thinking version's traces by 3.8 points on MMLBD-C, and internalized reasoning exhibits 12.4$\times$ fewer mean output tokens compared to explicit reasoning. We release our pipeline for reproducibility and further exploration.

[52] Beyond Fixed Inference: Quantitative Flow Matching for Adaptive Image Denoising

Jigang Duan,Genwei Ma,Xu Jiang,Wenfeng Xu,Ping Yang,Xing Zhao

Main category: cs.CV

TL;DR: 本文提出了一种面向自适应图像去噪的定量流匹配框架,通过估计输入噪声水平并据此动态调整推理轨迹(如起点、步数和步长),提升去噪精度与推理效率。

Details Motivation: 现有基于扩散和流的生成模型在未知且变化的噪声条件下图像去噪效果不佳,因学习到的向量场在不同噪声水平下不一致,导致训练与推理噪声不匹配时性能下降。 Method: 提出定量流匹配框架:首先基于局部像素统计估计输入噪声水平,再据此自适应调整流匹配推理过程中的起始点、积分步数和步长调度。 Result: 在自然图像、医学图像和显微图像上实验表明,该方法在多种噪声水平和成像条件下均具有强鲁棒性与泛化能力,同时提升恢复精度与推理效率。 Conclusion: 耦合定量噪声估计与噪声自适应流推理,可有效解决未知噪声下的图像去噪问题,兼顾准确性与计算效率。 Abstract: Diffusion and flow-based generative models have shown strong potential for image restoration. However, image denoising under unknown and varying noise conditions remains challenging, because the learned vector fields may become inconsistent across different noise levels, leading to degraded restoration quality under mismatch between training and inference. To address this issue, we propose a quantitative flow matching framework for adaptive image denoising. The method first estimates the input noise level from local pixel statistics, and then uses this quantitative estimate to adapt the inference trajectory, including the starting point, the number of integration steps, and the step-size schedule. In this way, the denoising process is better aligned with the actual corruption level of each input, reducing unnecessary computation for lightly corrupted images while providing sufficient refinement for heavily degraded ones. By coupling quantitative noise estimation with noise-adaptive flow inference, the proposed method improves both restoration accuracy and inference efficiency. Extensive experiments on natural, medical, and microscopy images demonstrate its robustness and strong generalization across diverse noise levels and imaging conditions.

[53] Environment-Aware Channel Prediction for Vehicular Communications: A Multimodal Visual Feature Fusion Framework

Xuejian Zhang,Ruisi He,Minseok Kim,Inocent Calist,Mi Yang,Ziyi Qi

Main category: cs.CV

TL;DR: 本文提出了一种基于多模态视觉特征融合的环境感知信道预测框架,利用车载全景RGB图像、GPS数据及语义分割与深度估计,提取语义、深度和位置特征,并通过挤压激励注意力门控模块实现自适应融合,最终实现对路径损耗、时延扩展、到达/出发方位角扩展及角度功率谱的联合高精度预测。

Details Motivation: 6G车联网对信道预测提出了高可靠性、低时延与强适应性要求,传统模型难以兼顾精度、泛化性与可部署性,而车载与路边感知设备提供了丰富的环境先验信息。 Method: 构建三分支网络分别提取语义、深度与位置特征,采用挤压激励注意力门控模块进行自适应多模态融合;设计专用回归头与复合多约束损失函数,用于360维角度功率谱预测,实现多种信道参数联合预测。 Result: 在同步城市V2I实测数据集上,路径损耗RMSE为3.26 dB,时延扩展、到达/出发方位角扩展RMSE分别为37.66 ns、5.05度、5.08度,角度功率谱平均/中值余弦相似度达0.9342/0.9571。 Conclusion: 该环境感知框架显著提升了信道预测的准确性、泛化能力与实用性,为6G智能车联网信道预测提供了有效可行的技术路径。 Abstract: The deep integration of communication with intelligence and sensing, as a defining vision of 6G, renders environment-aware channel prediction a key enabling technology. As a representative 6G application, vehicular communications require accurate and forward-looking channel prediction under stringent reliability, latency, and adaptability demands. Traditional empirical and deterministic models remain limited in balancing accuracy, generalization, and deployability, while the growing availability of onboard and roadside sensing devices offers a promising source of environmental priors. This paper proposes an environment-aware channel prediction framework based on multimodal visual feature fusion. Using GPS data and vehicle-side panoramic RGB images, together with semantic segmentation and depth estimation, the framework extracts semantic, depth, and position features through a three-branch architecture and performs adaptive multimodal fusion via a squeeze-excitation attention gating module. For 360-dimensional angular power spectrum (APS) prediction, a dedicated regression head and a composite multi-constraint loss are further designed. As a result, joint prediction of path loss (PL), delay spread (DS), azimuth spread of arrival (ASA), azimuth spread of departure (ASD), and APS is achieved. Experiments on a synchronized urban V2I measurement dataset yield the best root mean square error (RMSE) of 3.26 dB for PL, RMSEs of 37.66 ns, 5.05 degrees, and 5.08 degrees for DS, ASA, and ASD, respectively, and mean/median APS cosine similarities of 0.9342/0.9571, demonstrating strong accuracy, generalization, and practical potential for intelligent channel prediction in 6G vehicular communications.

[54] Variational Encoder--Multi-Decoder (VE-MD) for Privacy-by-functional-design (Group) Emotion Recognition

Anderson Augusma,Dominique Vaufreydaz,Fédérique Letué

Main category: cs.CV

TL;DR: 本文提出了一种名为VE-MD的隐私感知型群体情绪识别框架,通过多解码器结构联合优化情绪分类与身体/面部结构重建,避免显式个体建模,在多个GER和IER数据集上达到SOTA性能。

Details Motivation: 现有群体情绪识别方法依赖显式的个体级处理(如人脸裁剪、人员跟踪),导致隐私问题且与仅需群体级理解的目标不匹配。 Method: 提出变分编码器-多解码器(VE-MD)框架:共享潜在表示联合优化情绪分类与结构重建;采用Transformer PersonQuery解码器和稠密Heatmap解码器两种结构重建策略,支持可变群体规模。 Result: 在GAF-3.0(90.06%)和VGAF(82.25%,音频融合)上达SOTA;揭示GER需保留交互相关结构信息,而IER中结构投影可作为有效去噪瓶颈;在SamSemo等IER数据集上亦表现优异。 Conclusion: 显式建模群体结构信息(而非仅优化潜空间)对群体情绪识别至关重要,VE-MD在不依赖个体特征提取的前提下提升了集体情感推断能力,并兼顾隐私设计目标。 Abstract: Group Emotion Recognition (GER) aims to infer collective affect in social environments such as classrooms, crowds, and public events. Many existing approaches rely on explicit individual-level processing, including cropped faces, person tracking, or per-person feature extraction, which makes the analysis pipeline person-centric and raises privacy concerns in deployment scenarios where only group-level understanding is needed. This research proposes VE-MD, a Variational Encoder-Multi-Decoder framework for group emotion recognition under a privacy-aware functional design. Rather than providing formal anonymization or cryptographic privacy guarantees, VE-MD is designed to avoid explicit individual monitoring by constraining the model to predict only aggregate group-level affect, without identity recognition or per-person emotion outputs. VE-MD learns a shared latent representation jointly optimized for emotion classification and internal prediction of body and facial structural representations. Two structural decoding strategies are investigated: a transformer-based PersonQuery decoder and a dense Heatmap decoder that naturally accommodates variable group sizes. Experiments on six in-the-wild datasets, including two GER and four Individual Emotion Recognition (IER) benchmarks, show that structural supervision consistently improves representation learning. More importantly, the results reveal a clear distinction between GER and IER: optimizing the latent space alone is often insufficient for GER because it tends to attenuate interaction-related cues, whereas preserving explicit structural outputs improves collective affect inference. In contrast, projected structural representations seem to act as an effective denoising bottleneck for IER. VE-MD achieves state-of-the-art performance on GAF-3.0 (up to 90.06%) and VGAF (82.25% with multimodal fusion with audio). These results show that preserving interaction-related structural information is particularly beneficial for group-level affect modeling without relying on prior individual feature extraction. On IER datasets using multimodal fusion with audio modality, VE-MD outperforms SOTA on SamSemo (77.9%, adding text modality) while achieving competitive performances on MER-MULTI (63.8%), DFEW (70.7%) and EngageNet (69.0).

[55] LumiVideo: An Intelligent Agentic System for Video Color Grading

Yuchen Guo,Junli Gong,Hongmin Cai,Yiu-ming Cheung,Weifeng Su

Main category: cs.CV

TL;DR: 本文提出LumiVideo,一种模拟专业调色师认知流程的智能视频调色系统,通过感知、推理、执行与反思四阶段,结合LLM与RAG+ToT技术生成ASC-CDL和3D LUT,保障时序一致性,并支持自然语言反馈迭代;同时发布首个面向log视频的调色基准LumiGrade。

Details Motivation: 现有自动调色方法是静态黑箱,缺乏可解释性与专业所需的迭代控制能力,难以满足电影级调色对语义理解、物理光照建模与时序一致性的高要求。 Method: 提出LumiVideo代理系统,包含四阶段工作流:1)感知阶段分析场景物理光照与语义内容;2)推理阶段融合LLM内在电影知识与RAG增强的Tree of Thoughts搜索,在非线性色彩参数空间中协同决策;3)执行阶段输出ASC-CDL参数与全局一致的3D LUT,解析保证时间一致性;4)可选反思阶段支持自然语言反馈驱动迭代优化。同时构建log视频专用基准LumiGrade。 Result: 实验表明LumiVideo在全自动模式下接近人类专家质量,并在指令引导下实现精确迭代控制;其生成的ASC-CDL与3D LUT具备严格的时间一致性保障;LumiGrade基准为后续研究提供评估基础。 Conclusion: LumiVideo首次将具身化代理范式引入视频调色,兼顾自动化性能与专业可控性,推动AI调色从‘像素生成’迈向‘参数化、可解释、可迭代’的新范式。 Abstract: Video color grading is a critical post-production process that transforms flat, log-encoded raw footage into emotionally resonant cinematic visuals. Existing automated methods act as static, black-box executors that directly output edited pixels, lacking both interpretability and the iterative control required by professionals. We introduce LumiVideo, an agentic system that mimics the cognitive workflow of professional colorists through four stages: Perception, Reasoning, Execution, and Reflection. Given only raw log video, LumiVideo autonomously produces a cinematic base grade by analyzing the scene's physical lighting and semantic content. Its Reasoning engine synergizes an LLM's internalized cinematic knowledge with a Retrieval-Augmented Generation (RAG) framework via a Tree of Thoughts (ToT) search to navigate the non-linear color parameter space. Rather than generating pixels, the system compiles the deduced parameters into industry-standard ASC-CDL configurations and a globally consistent 3D LUT, analytically guaranteeing temporal consistency. An optional Reflection loop then allows creators to refine the result via natural language feedback. We further introduce LumiGrade, the first log-encoded video benchmark for evaluating automated grading. Experiments show that LumiVideo approaches human expert quality in fully automatic mode while enabling precise iterative control when directed.

[56] From Elevation Maps To Contour Lines: SVM and Decision Trees to Detect Violin Width Reduction

Philémon Beghin,Anne-Emmanuelle Ceulemans,François Glineur

Main category: cs.CV

TL;DR: 本文探讨了使用3D摄影测量网格自动检测小提琴宽度减小的方法,比较了基于高程图的几何原始表示与基于参数化轮廓线拟合的特征工程方法,结果表明后者性能更优。

Details Motivation: 自动检测小提琴宽度减小对于乐器修复、鉴定和数字化存档具有重要意义,而现有方法在精度和鲁棒性上存在局限。 Method: 采用支持向量机(SVM)和决策树两种分类器,分别应用于基于高程图的几何原始表示和基于参数化轮廓线拟合的特征工程表示。 Result: 基于参数化轮廓线拟合的特征工程方法性能优于基于高程图的方法,尽管后者偶尔表现较强。 Conclusion: 针对小提琴宽度减小检测任务,精心设计的几何特征(如轮廓线拟合)比通用的高程图表示更具判别力和稳定性。 Abstract: We explore the automatic detection of violin width reduction using 3D photogrammetric meshes. We compare SVM and Decision Trees applied to a geometry-based raw representation built from elevation maps with a more targeted, feature-engineered approach relying on parametric contour lines fitting. Although elevation maps occasionally achieve strong results, their performance does not surpass that of the contour-based inputs.

[57] PlayGen-MoG: Framework for Diverse Multi-Agent Play Generation via Mixture-of-Gaussians Trajectory Prediction

Kevin Song

Main category: cs.CV

TL;DR: 本文提出PlayGen-MoG框架,用于基于初始阵型的多智能体运动轨迹生成,通过MoG输出头、相对空间注意力机制和非自回归绝对位移预测,解决后验坍缩、模式崩溃及历史依赖等问题,在美式足球数据上实现高精度与多样性生成。

Details Motivation: 现有生成模型(如CVAE、扩散模型)在团队运动多智能体轨迹生成中易出现后验坍缩或收敛到数据均值;且多数方法需多帧历史观测,难以适用于仅知初始阵型的战术设计场景。 Method: 提出PlayGen-MoG:1)共享混合权重的高斯混合(MoG)输出头,使所有球员轨迹耦合于同一场景选择;2)相对空间注意力,将两两球员位置与距离编码为可学习注意力偏置;3)非自回归预测从初始阵型出发的绝对位移,消除累积误差并摆脱历史依赖。 Result: 在美式足球追踪数据上,ADE达1.68码、FDE达3.98码;8个MoG成分完全利用,熵值2.06/2.08,定性验证生成多样性且无模式崩溃。 Conclusion: PlayGen-MoG有效兼顾轨迹生成的真实性、多样性与阵型条件可控性,为无历史依赖的战术设计提供了新范式。 Abstract: Multi-agent trajectory generation in team sports requires models that capture both the diversity of possible plays and realistic spatial coordination between players on plays. Standard generative approaches such as Conditional Variational Autoencoders (CVAE) and diffusion models struggle with this task, exhibiting posterior collapse or convergence to the dataset mean. Moreover, most trajectory prediction methods operate in a forecasting regime that requires multiple frames of observed history, limiting their use for play design where only the initial formation is available. We present PlayGen-MoG, an extensible framework for formation-conditioned play generation that addresses these challenges through three design choices: 1/ a Mixture-of-Gaussians (MoG) output head with shared mixture weights across all agents, where a single set of weights selects a play scenario that couples all players' trajectories, 2/ relative spatial attention that encodes pairwise player positions and distances as learned attention biases, and 3/ non-autoregressive prediction of absolute displacements from the initial formation, eliminating cumulative error drift and removing the dependence on observed trajectory history, enabling realistic play generation from a single static formation alone. On American football tracking data, PlayGen-MoG achieves 1.68 yard ADE and 3.98 yard FDE while maintaining full utilization of all 8 mixture components with entropy of 2.06 out of 2.08, and qualitatively confirming diverse generation without mode collapse.

Nikhil Kalidasu,Sahana Ganapathy

Main category: cs.CV

TL;DR: 本文提出了一种名为SPAR的物理对抗性攻击方法,针对开源ALPR系统fast-alpr,在不修改车牌本身、无需访问ALPR基础设施的前提下,以低成本(<100美元)实现显著降低识别准确率(60%)和较高目标伪装率(18%),并论证其在德克萨斯州具备法律可行性。

Details Motivation: 现有研究多关注ALPR系统的安全性漏洞,但忽视了其法律合规性与现实物理世界中的可实施性,尤其对低资源攻击者是否可行缺乏探讨。 Method: 提出Street-legal Physical Adversarial Rim(SPAR),一种白盒物理对抗攻击,基于对fast-alpr系统的逆向分析设计,通过在轮毂上添加特定图案干扰ALPR检测,全程利用商用智能编程助手实现,无需篡改车牌或接入ALPR系统。 Result: 在最优条件下,SPAR使ALPR识别准确率下降60%,达成18%的目标车辆 impersonation 率;制作成本低于100美元;经法律分析,其在德克萨斯州属合法行为。 Conclusion: SPAR揭示了现代ALPR系统在真实物理场景下的实际脆弱性,挑战了当前ALPR部署的安全与法律假设,并为攻防双方提供了新思路。 Abstract: Automatic license plate reader (ALPR) systems are widely deployed to identify and track vehicles. While prior work has demonstrated vulnerabilities in ALPR systems, far less attention has been paid to their legality and physical-world practicality. We investigate whether low-resourced threat actors can engineer a successful adversarial attack against a modern open-source ALPR system. We introduce the Street-legal Physical Adversarial Rim (SPAR), a physically realizable white-box attack against the popular ALPR system fast-alpr. SPAR requires no access to ALPR infrastructure during attack deployment and does not alter or obscure the attacker's license plate. Based on prior legislation and case law, we argue that SPAR is street-legal in the state of Texas. Under optimal conditions, SPAR reduces ALPR accuracy by 60% and achieves an 18% targeted impersonation rate. SPAR can be produced for under $100, and it was implemented entirely by commercial agentic coding assistants. These results highlight practical vulnerabilities in modern ALPR systems under realistic physical-world conditions and suggest new directions for both attack and defense.

[59] VERTIGO: Visual Preference Optimization for Cinematic Camera Trajectory Generation

Mengtian Li,Yuwei Lu,Feifei Li,Chenqi Gan,Zhifeng Xie,Xi Wang

Main category: cs.CV

TL;DR: 本文提出VERTIGO框架,首次将视觉偏好优化引入摄像机轨迹生成,通过实时渲染与细调的视觉语言模型评分,结合直接偏好优化(DPO)提升构图质量、提示一致性与视觉美感,显著降低人物出画率并保持运动几何保真度。

Details Motivation: 现有生成式摄像机系统缺乏导演参与反馈机制,虽能生成符合分布的运动轨迹,但常出现构图差、人物出画、美学不佳等问题,亟需引入视觉层面的显式偏好监督。 Method: 构建VERTIGO框架:利用Unity实时渲染生成轨迹的2D预览图像;采用电影领域微调的视觉语言模型,基于循环语义相似性机制对预览与文本提示进行对齐打分;将该视觉偏好信号用于Direct Preference Optimization(DPO)后训练。 Result: 在Unity渲染和Camera-to-Video扩散 pipeline 上均显著提升条件遵循性、构图质量与感知真实感;人物出画率从38%降至近0%;用户研究显示在构图、一致性、提示遵循与美学质量上全面优于基线。 Conclusion: 视觉偏好优化是提升生成摄像机轨迹实用性的关键路径,VERTIGO验证了结合实时渲染、细调VLM与DPO进行端到端视觉对齐的有效性,为可编辑、可评估的电影级摄像机控制提供了新范式。 Abstract: Cinematic camera control relies on a tight feedback loop between director and cinematographer, where camera motion and framing are continuously reviewed and refined. Recent generative camera systems can produce diverse, text-conditioned trajectories, but they lack this "director in the loop" and have no explicit supervision of whether a shot is visually desirable. This results in in-distribution camera motion but poor framing, off-screen characters, and undesirable visual aesthetics. In this paper, we introduce VERTIGO, the first framework for visual preference optimization of camera trajectory generators. Our framework leverages a real-time graphics engine (Unity) to render 2D visual previews from generated camera motion. A cinematically fine-tuned vision-language model then scores these previews using our proposed cyclic semantic similarity mechanism, which aligns renders with text prompts. This process provides the visual preference signals for Direct Preference Optimization (DPO) post-training. Both quantitative evaluations and user studies on Unity renders and diffusion-based Camera-to-Video pipelines show consistent gains in condition adherence, framing quality, and perceptual realism. Notably, VERTIGO reduces the character off-screen rate from 38% to nearly 0% while preserving the geometric fidelity of camera motion. User study participants further prefer VERTIGO over baselines across composition, consistency, prompt adherence, and aesthetic quality, confirming the perceptual benefits of our visual preference post-training.

[60] Hierarchical, Interpretable, Label-Free Concept Bottleneck Model

Haodong Xie,Yujun Cai,Rahul Singh Maharjan,Yiwei Wang,Federico Tavella,Angelo Cangelosi

Main category: cs.CV

TL;DR: 本文提出HIL-CBM,一种分层、可解释、无标签的概念瓶颈模型,通过多级抽象建模和梯度一致性损失提升模型可解释性与分类性能。

Details Motivation: 现有概念瓶颈模型(CBMs)仅在单一语义层级操作,无法模拟人类在不同抽象层次上识别物体的认知过程。 Method: 提出HIL-CBM框架:引入基于梯度的视觉一致性损失,促使不同抽象层关注相似空间区域;并训练双分类头,分别处理不同层级的特征概念;无需关系型概念标注。 Result: 在基准数据集上分类准确率优于当前最优稀疏CBMs;人工评估表明其解释更可解释、更准确,且保持分层与无标签特性。 Conclusion: HIL-CBM通过分层建模更贴近人类认知,提升了CBMs的可解释性与性能,为可解释AI提供了新思路。 Abstract: Concept Bottleneck Models (CBMs) introduce interpretability to black-box deep learning models by predicting labels through human-understandable concepts. However, unlike humans, who identify objects at different levels of abstraction using both general and specific features, existing CBMs operate at a single semantic level in both concept and label space. We propose HIL-CBM, a Hierarchical Interpretable Label-Free Concept Bottleneck Model that extends CBMs into a hierarchical framework to enhance interpretability by more closely mirroring the human cognitive process. HIL-CBM enables classification and explanation across multiple semantic levels without requiring relational concept annotations. HIL-CBM aligns the abstraction level of concept-based explanations with that of model predictions, progressing from abstract to concrete. This is achieved by (i) introducing a gradient-based visual consistency loss that encourages abstraction layers to focus on similar spatial regions, and (ii) training dual classification heads, each operating on feature concepts at different abstraction levels. Experiments on benchmark datasets demonstrate that HIL-CBM outperforms state-of-the-art sparse CBMs in classification accuracy. Human evaluations further show that HIL-CBM provides more interpretable and accurate explanations, while maintaining a hierarchical and label-free approach to feature concepts.

[61] Guideline2Graph: Profile-Aware Multimodal Parsing for Executable Clinical Decision Graphs

Onur Selim Kilic,Yeti Z. Gurbuz,Cem O. Yaldiz,Afra Nawar,Etrit Haxholli,Ogul Can,Eli Waxman

Main category: cs.CV

TL;DR: 本文提出了一种分解优先的流水线方法,将临床实践指南转化为可执行的临床决策图,通过拓扑感知分块、接口约束的分块图生成和溯源保留的全局聚合,显著提升了跨页连续性和控制流的准确性与可审计性。

Details Motivation: 现有方法难以处理长而多模态的临床指南文档,尤其在保持跨页连续性和构建完整决策图方面存在不足。 Method: 采用分解优先的流水线,包括拓扑感知分块、接口约束的分块图生成和溯源保留的全局聚合,强调显式入口/终止接口与语义去重。 Result: 在前列腺指南基准测试中,边和三元组的精确率/召回率分别从19.6%/16.1%提升至69.0%/87.5%,节点召回率从78.1%提升至93.8%。 Conclusion: 该方法支持可审计、结构一致的指南到临床决策支持系统的转化,但目前验证仅限于一个经专家裁定的前列腺指南,需进一步在多指南场景中验证。 Abstract: Clinical practice guidelines are long, multimodal documents whose branching recommendations are difficult to convert into executable clinical decision support (CDS), and one-shot parsing often breaks cross-page continuity. Recent LLM/VLM extractors are mostly local or text-centric, under-specifying section interfaces and failing to consolidate cross-page control flow across full documents into one coherent decision graph. We present a decomposition-first pipeline that converts full-guideline evidence into an executable clinical decision graph through topology-aware chunking, interface-constrained chunk graph generation, and provenance-preserving global aggregation. Rather than relying on single-pass generation, the pipeline uses explicit entry/terminal interfaces and semantic deduplication to preserve cross-page continuity while keeping the induced control flow auditable and structurally consistent. We evaluate on an adjudicated prostate-guideline benchmark with matched inputs and the same underlying VLM backbone across compared methods. On the complete merged graph, our approach improves edge and triplet precision/recall from $19.6\%/16.1\%$ in existing models to $69.0\%/87.5\%$, while node recall rises from $78.1\%$ to $93.8\%$. These results support decomposition-first, auditable guideline-to-CDS conversion on this benchmark, while current evidence remains limited to one adjudicated prostate guideline and motivates broader multi-guideline validation.

[62] Generating Satellite Imagery Data for Wildfire Detection through Mask-Conditioned Generative AI

Valeria Martin,K. Brent Venable,Derek Morgan

Main category: cs.CV

TL;DR: 本文探索了利用EarthSynth(一种基于扩散的地球观测基础模型)在无需任务特定微调的情况下,根据真实野火燃烧掩膜合成逼真的灾后Sentinel-2 RGB影像的可行性,并通过多种实验配置与定量指标验证了基于修复(inpainting)的方法优于全图生成,且VLM辅助提示具竞争力。

Details Motivation: 深度学习驱动的野火监测系统受限于标注卫星影像稀缺;本文旨在探索利用生成式基础模型缓解该数据瓶颈,实现高质量、条件可控的灾后影像合成。 Method: 基于EarthSynth模型,设计六种实验配置,系统比较:(i) 掩膜驱动的全图生成 vs. 带灾前上下文的图像修复;(ii) 三种手工提示与Qwen2-VL视觉语言模型生成提示;(iii) 区域颜色匹配后处理;在CalFireSeg-50数据集衍生掩膜上评估。 Result: 修复型流程在所有指标(Burn IoU、ΔC_burn、Darkness Contrast、Spectral Plausibility)上均优于全图生成;结构化修复提示取得最佳空间对齐(Burn IoU=0.456)和燃烧显著性(Darkness Contrast=20.44);颜色匹配降低色差(ΔC_burn=63.22)但削弱显著性;VLM提示效果接近手工提示。 Conclusion: EarthSynth可有效支持条件化灾后影像合成,尤其在修复范式下表现稳健;该方法为野火检测任务提供了一种可行的生成式数据增强新路径。 Abstract: The scarcity of labeled satellite imagery remains a fundamental bottleneck for deep-learning (DL)-based wildfire monitoring systems. This paper investigates whether a diffusion-based foundation model for Earth Observation (EO), EarthSynth, can synthesize realistic post-wildfire Sentinel-2 RGB imagery conditioned on existing burn masks, without task-specific retraining. Using burn masks derived from the CalFireSeg-50 dataset (Martin et al., 2025), we design and evaluate six controlled experimental configurations that systematically vary: (i) pipeline architecture (mask-only full generation vs. inpainting with pre-fire context), (ii) prompt engineering strategy (three hand-crafted prompts and a VLM-generated prompt via Qwen2-VL), and (iii) a region-wise color-matching post-processing step. Quantitative assessment on 10 stratified test samples uses four complementary metrics: Burn IoU, burn-region color distance (ΔC_burn), Darkness Contrast, and Spectral Plausibility. Results show that inpainting-based pipelines consistently outperform full-tile generation across all metrics, with the structured inpainting prompt achieving the best spatial alignment (Burn IoU = 0.456) and burn saliency (Darkness Contrast = 20.44), while color matching produces the lowest color distance (ΔC_burn = 63.22) at the cost of reduced burn saliency. VLM-assisted inpainting is competitive with hand-crafted prompts. These findings provide a foundation for incorporating generative data augmentation into wildfire detection pipelines. Code and experiments are available at: https://www.kaggle.com/code/valeriamartinh/genai-all-runned

[63] VLMs Need Words: Vision Language Models Ignore Visual Detail In Favor of Semantic Anchors

Haz Sameen Shahgir,Xiaofu Chen,Yu Fu,Erfan Shayegani,Nael Abu-Ghazaleh,Yova Kementchedjhieva,Yue Dong

Main category: cs.CV

TL;DR: 本文揭示了视觉语言模型(VLMs)在细粒度视觉任务(如视觉对应)中表现不佳的根本原因:其训练范式过度依赖将视觉信息映射到已有语言概念,导致对不可名状或新视觉实体的推理能力薄弱;作者通过实证分析和Logit Lens机制研究验证该现象,并指出这是训练捷径所致,而非架构本质缺陷。

Details Motivation: VLMs在需细粒度视觉感知的任务上表现差,即使其内部表征已包含所需视觉信息,作者旨在探究这一性能差距的根本成因。 Method: 通过视觉对应任务(语义、形状、人脸)评估VLMs对可命名与不可命名视觉实体的处理差异;采用Logit Lens分析模型如何为不同实体分配语义标签和对应token;并尝试用任意名称标注未知实体及任务特定微调来提升性能。 Result: VLMs在可命名实体上的对应任务表现显著优于不可命名实体;Logit Lens显示其对可命名实体显式分配语义标签并输出更独特token;引入任意名称可提升性能,但任务特定微调效果更强且不依赖语言先验。 Conclusion: VLMs在视觉任务上的失败源于训练过程中习得的语言映射捷径,而非多模态架构本身存在根本性局限;改进训练范式(如减少对语言先验的依赖)有望突破当前瓶颈。 Abstract: Vision Language Models (VLMs) achieve impressive performance across a wide range of multimodal tasks. However, on some tasks that demand fine-grained visual perception, they often fail even when the required information is present in their internal representations. In this work, we demonstrate that this gap arises from their narrow training pipeline which focuses on moving visual information to the textual space. Consequently, VLMs can only reason about visual entities that can be mapped to known concepts in the language space, leaving vision-focused tasks such as visual correspondence and reasoning about novel visual entities poorly supported. As a result, VLMs are severely limited in several important multimodal capabilities because they rely on brittle, hallucinated textual descriptions of visual entities that they cannot map to textual representations. We verify this behavior through visual correspondence tasks, in which VLMs must detect matching entities between two images. Testing across semantic, shape, and face correspondence tasks, we find that VLMs perform much better when the relevant entities are nameable in language than when they are unnameable. Mechanistically, our Logit Lens analyses confirm that VLMs explicitly assign semantic labels to nameable entities and surface more unique corresponding tokens compared to unnameable entities. Furthermore, we show that teaching completely arbitrary names for unknown entities improves performance, yet task-specific finetuning yields even stronger generalization without relying on language priors. Our findings suggest that current VLM failures on visual tasks reflect learned shortcuts from their training, rather than a fundamental limitation of multimodal architectures.

[64] Token-Efficient Multimodal Reasoning via Image Prompt Packaging

Joong Ho Choi,Jiayang Zhao,Avani Appalla,Himansh Mukesh,Dhwanil Vasani,Boyi Qian

Main category: cs.CV

TL;DR: 本文提出图像提示打包(IPPg)方法,通过将结构化文本嵌入图像中以降低文本token开销,在多个模型和任务上实现35.8%–91.0%的推理成本下降,但效果高度依赖模型与任务类型,并系统分析了其失败模式与视觉编码影响。

Details Motivation: 大规模多模态语言模型部署受限于基于token的推理成本,而视觉提示策略的成本-性能关系尚不明确。 Method: 提出Image Prompt Packaging(IPPg),将结构化文本直接嵌入图像中以压缩文本token;在五个数据集、三个前沿模型(GPT-4.1、GPT-4o、Claude 3.5 Sonnet)和两类任务(VQA和代码生成)上进行基准测试;建立按token类型分解的成本模型,并开展渲染配置消融实验与错误模式分析。 Result: IPPg实现35.8%–91.0%推理成本降低,文本token压缩最高达96%;准确性在多数场景下保持竞争力,但结果高度依赖模型与任务(如GPT-4.1在CoSQL上兼顾精度与成本提升,Claude 3.5在部分VQA任务中反而增加成本);识别出空间推理、非英语输入、字符敏感操作三类主要失败模式;125种渲染配置的消融显示准确率波动达10–30个百分点。 Conclusion: 视觉编码选择是多模态系统设计中的一等变量;IPPg是一种有潜力的成本优化范式,但需针对模型与任务特性谨慎适配。 Abstract: Deploying large multimodal language models at scale is constrained by token-based inference costs, yet the cost-performance behavior of visual prompting strategies remains poorly characterized. We introduce Image Prompt Packaging (IPPg), a prompting paradigm that embeds structured text directly into images to reduce text token overhead, and benchmark it across five datasets, three frontier models (GPT-4.1, GPT-4o, Claude 3.5 Sonnet), and two task families (VQA and code generation). We derive a cost formulation decomposing savings by token type and show IPPg achieves 35.8--91.0\% inference cost reductions. Despite token compression of up to 96\%, accuracy remains competitive in many settings, though outcomes are highly model- and task-dependent: GPT-4.1 achieves simultaneous accuracy and cost gains on CoSQL, while Claude 3.5 incurs cost increases on several VQA benchmarks. Systematic error analysis yields a failure-mode taxonomy: spatial reasoning, non-English inputs, and character-sensitive operations are most vulnerable, while schema-structured tasks benefit most. A 125-configuration rendering ablation reveals accuracy shifts of 10--30 percentage points, establishing visual encoding choices as a first-class variable in multimodal system design.

[65] Delaunay Canopy: Building Wireframe Reconstruction from Airborne LiDAR Point Clouds via Delaunay Graph

Donghyun Kim,Chanyoung Kim,Youngjoong Kwon,Seong Jae Hwang

Main category: cs.CV

TL;DR: 本文提出Delaunay Canopy方法,利用Delaunay图作为几何先验构建自适应搜索空间,结合Delaunay图评分机制提取曲率特征,提升稀疏、含噪及内角区域的建筑线框重建精度。

Details Motivation: 现有方法在噪声大、点云稀疏或存在内部角点的区域难以实现准确的建筑线框重建,主因是缺乏能有效利用大规模稀疏点云丰富三维几何信息的自适应搜索空间。 Method: 提出Delaunay Canopy框架:以Delaunay图为几何先验定义自适应搜索空间;引入Delaunay图评分(Delaunay Graph Scoring),同步重建几何流形并生成区域化曲率签名;基于该先验设计角点与线段选择模块,聚焦高概率结构元素。 Result: 在Building3D Tallinn城市级数据集和入门级数据集上实验表明,该方法实现了当前最优的线框重建性能,对多样且复杂的建筑几何均能准确预测。 Conclusion: Delaunay Canopy通过几何自适应建模显著提升了点云线框重建鲁棒性与精度,尤其适用于挑战性区域,为拓扑感知的三维建筑理解提供了新范式。 Abstract: Reconstructing building wireframe from airborne LiDAR point clouds yields a compact, topology-centric representation that enables structural understanding beyond dense meshes. Yet a key limitation persists: conventional methods have failed to achieve accurate wireframe reconstruction in regions afflicted by significant noise, sparsity, or internal corners. This failure stems from the inability to establish an adaptive search space to effectively leverage the rich 3D geometry of large, sparse building point clouds. In this work, we address this challenge with Delaunay Canopy, which utilizes the Delaunay graph as a geometric prior to define a geometrically adaptive search space. Central to our approach is Delaunay Graph Scoring, which not only reconstructs the underlying geometric manifold but also yields region-wise curvature signatures to robustly guide the reconstruction. Built on this foundation, our corner and wire selection modules leverage the Delaunay-induced prior to focus on highly probable elements, thereby shaping the search space and enabling accurate prediction even in previously intractable regions. Extensive experiments on the Building3D Tallinn city and entry-level datasets demonstrate state-of-the-art wireframe reconstruction, delivering accurate predictions across diverse and complex building geometries.

[66] An Explainable Vision-Language Model Framework with Adaptive PID-Tversky Loss for Lumbar Spinal Stenosis Diagnosis

Md. Sajeebul Islam Sk.,Md. Mehedi Hasan Shawon,Md. Golam Rabiul Alam

Main category: cs.CV

TL;DR: 本文提出了一种端到端可解释的视觉-语言模型框架,用于腰椎管狭窄症(LSS)的MRI诊断,通过空间块交叉注意力模块和自适应PID-Tversky损失函数,显著提升了分割精度与分类准确率,并能生成放射科医生风格的临床报告,增强AI在医学影像中的可解释性与临床实用性。

Details Motivation: 腰椎管狭窄症(LSS)诊断依赖人工解读多视角MRI,存在高观察者间变异性和延迟;现有视觉-语言模型难以兼顾严重类别不平衡与空间分割精度,尤其因全局池化丢失解剖层级信息。 Method: 提出空间块交叉注意力模块实现文本引导的精准脊柱异常定位;设计融合控制理论的自适应PID-Tversky损失函数,动态加权难分、欠分割的少数类样本;结合基础视觉-语言模型与自动放射科报告生成模块。 Result: 诊断分类准确率达90.69%,分割宏平均Dice分数为0.9512,CIDEr评分为92.80;模型可将分割结果转化为放射科医生风格临床报告,具备强可解释性。 Conclusion: 该框架在保持人类监督前提下,显著提升LSS诊断的准确性、鲁棒性与透明度,为临床医学影像中可解释AI树立新基准。 Abstract: Lumbar Spinal Stenosis (LSS) diagnosis remains a critical clinical challenge, with diagnosis heavily dependent on labor-intensive manual interpretation of multi-view Magnetic Resonance Imaging (MRI), leading to substantial inter-observer variability and diagnostic delays. Existing vision-language models simultaneously fail to address the extreme class imbalance prevalent in clinical segmentation datasets while preserving spatial accuracy, primarily due to global pooling mechanisms that discard crucial anatomical hierarchies. We present an end-to-end Explainable Vision-Language Model framework designed to overcome these limitations, achieved through two principal objectives. We propose a Spatial Patch Cross-Attention module that enables precise, text-directed localization of spinal anomalies with spatial precision. A novel Adaptive PID-Tversky Loss function by integrating control theory principles dynamically further modifies training penalties to specifically address difficult, under-segmented minority instances. By incorporating foundational VLMs alongside an Automated Radiology Report Generation module, our framework demonstrates considerable performance: a diagnostic classification accuracy of 90.69%, a macro-averaged Dice score of 0.9512 for segmentation, and a CIDEr score of 92.80%. Furthermore, the framework shows explainability by converting complex segmentation predictions into radiologist-style clinical reports, thereby establishing a new benchmark for transparent, interpretable AI in clinical medical imaging that keeps essential human supervision while enhancing diagnostic capabilities.

[67] Rapidly deploying on-device eye tracking by distilling visual foundation models

Cheng Jiang,Jogendra Kundu,David Colmenares,Fengting Yang,Joseph Robinson,Yatong An,Ali Behrooz

Main category: cs.CV

TL;DR: 本文提出DistillGaze框架,通过结合合成标注数据与无标签真实数据,蒸馏视觉基础模型以实现高精度、轻量级、可快速适配新硬件的端侧眼动追踪(ET)模型。

Details Motivation: 现有视觉基础模型(VFMs)在自然图像上表现优异,但在近眼红外影像等专业领域性能不足;同时,硬件配置频繁变化导致高精度端侧凝视估计难以快速部署。 Method: DistillGaze分两阶段:第一阶段利用合成标注数据和无标签真实数据,通过自监督学习将VFM适配为领域专用教师模型;第二阶段用教师指导与自训练联合训练轻量级端侧学生模型(256K参数)。 Result: 在覆盖2000多名参与者的众包大规模数据集上,DistillGaze相较纯合成数据基线,中位凝视误差相对降低58.62%,且模型轻量、支持实时端侧部署。 Conclusion: DistillGaze为应对硬件迭代提供高效ET模型训练与部署路径,并为端侧回归任务中融合合成监督与无标签真实数据提供了通用范式。 Abstract: Eye tracking (ET) plays a critical role in augmented and virtual reality applications. However, rapidly deploying high-accuracy, on-device gaze estimation for new products remains challenging because hardware configurations (e.g., camera placement, camera pose, and illumination) often change across device generations. Visual foundation models (VFMs) are a promising direction for rapid training and deployment, and they excel on natural-image benchmarks; yet we find that off-the-shelf VFMs still struggle to achieve high accuracy on specialized near-eye infrared imagery. To address this gap, we introduce DistillGaze, a framework that distills a foundation model by leveraging labeled synthetic data and unlabeled real data for rapid and high-performance on-device gaze estimation. DistillGaze proceeds in two stages. First, we adapt a VFM into a domain-specialized teacher using self-supervised learning on labeled synthetic and unlabeled real images. Synthetic data provides scalable, high-quality gaze supervision, while unlabeled real data helps bridge the synthetic-to-real domain gap. Second, we train an on-device student using both teacher guidance and self-training. Evaluated on a large-scale, crowd-sourced dataset spanning over 2,000 participants, DistillGaze reduces median gaze error by 58.62% relative to synthetic-only baselines while maintaining a lightweight 256K-parameter model suitable for real-time on-device deployment. Overall, DistillGaze provides an efficient pathway for training and deploying ET models that adapt to hardware changes, and offers a recipe for combining synthetic supervision with unlabeled real data in on-device regression tasks.

[68] Feature Attribution Stability Suite: How Stable Are Post-Hoc Attributions?

Kamalasankari Subramaniakuppusamy,Jugal Gajjar

Main category: cs.CV

TL;DR: 本文提出了FASS基准,用于评估后验特征归因方法在真实输入扰动下的稳定性,强调预测不变性过滤,并分解为结构相似性、秩相关性和top-k Jaccard重叠三个指标;实验表明几何扰动比光度扰动更易导致归因不稳定,且Grad-CAM在各数据集上表现最稳定。

Details Motivation: 现有归因稳定性评估方法存在缺陷:仅关注加性噪声、将稳定性简化为单个标量、未考虑预测保持条件,从而混淆了解释脆弱性与模型敏感性。 Method: 提出Feature Attribution Stability Suite(FASS)基准,引入预测不变性过滤机制,将稳定性分解为结构相似性、秩相关性和top-k Jaccard重叠三个互补指标,并在几何、光度和压缩三类扰动下进行系统评估。 Result: 实验发现稳定性估计高度依赖于扰动类型和预测不变性过滤;几何扰动引发的归因不稳定性显著高于光度扰动;若不施加预测保持条件,高达99%的样本预测发生改变;在控制条件下,Grad-CAM在所有数据集上展现出最高稳定性。 Conclusion: 归因方法的稳定性评估必须结合扰动类型与预测不变性约束;FASS提供了更细粒度、更贴近实际部署场景的评估框架;Grad-CAM在多种设置下具有相对最优的鲁棒性。 Abstract: Post-hoc feature attribution methods are widely deployed in safety-critical vision systems, yet their stability under realistic input perturbations remains poorly characterized. Existing metrics evaluate explanations primarily under additive noise, collapse stability to a single scalar, and fail to condition on prediction preservation, conflating explanation fragility with model sensitivity. We introduce the Feature Attribution Stability Suite (FASS), a benchmark that enforces prediction-invariance filtering, decomposes stability into three complementary metrics: structural similarity, rank correlation, and top-k Jaccard overlap-and evaluates across geometric, photometric, and compression perturbations. Evaluating four attribution methods (Integrated Gradients, GradientSHAP, Grad-CAM, LIME) across four architectures and three datasets-ImageNet-1K, MS COCO, and CIFAR-10, FASS shows that stability estimates depend critically on perturbation family and prediction-invariance filtering. Geometric perturbations expose substantially greater attribution instability than photometric changes, and without conditioning on prediction preservation, up to 99% of evaluated pairs involve changed predictions. Under this controlled evaluation, we observe consistent method-level trends, with Grad-CAM achieving the highest stability across datasets.

[69] Overconfidence and Calibration in Medical VQA: Empirical Findings and Hallucination-Aware Mitigation

Ji Young Byun,Young-Jin Park,Jean-Philippe Corbeil,Asma Ben Abacha

Main category: cs.CV

TL;DR: 本文系统研究了视觉语言模型(VLMs)在医学视觉问答(VQA)任务中的置信度校准问题,发现模型普遍存在过自信现象,且无法通过扩大模型规模或改进提示策略缓解;简单后处理校准方法(如Platt缩放)可有效降低校准误差,但难以提升判别能力(AUROC);为此提出幻觉感知校准(HAC),利用视觉定位的幻觉检测信号显著提升校准效果和AUROC,尤其适用于开放性问题。

Details Motivation: 临床决策支持中,VLMs不仅需高准确率,更需可靠的置信度估计;然而医学领域对VLMs过自信问题尚缺乏系统性实证研究。 Method: 在三个VLM家族(Qwen3-VL、InternVL3、LLaVA-NeXT)、三种参数规模(2B–38B)及多种置信估计提示策略下,于三个医学VQA基准上开展实证研究;对比提示法与后处理校准法(如Platt scaling);进一步提出并验证幻觉感知校准(HAC),融合视觉接地的幻觉检测信号以优化置信估计。 Result: 1)过自信现象普遍且不受模型规模或提示策略缓解;2)Platt scaling等后处理方法显著降低校准误差,优于提示法;3)后处理法因单调性限制无法提升AUROC;4)HAC方法同时改善校准误差与AUROC,尤其在开放性问题上增益最大。 Conclusion: 推荐将后处理校准作为医学VLM部署的标准实践;幻觉信号可有效增强VLM在医学VQA中的可靠性,为可信AI临床应用提供新路径。 Abstract: As vision-language models (VLMs) are increasingly deployed in clinical decision support, more than accuracy is required: knowing when to trust their predictions is equally critical. Yet, a comprehensive and systematic investigation into the overconfidence of these models remains notably scarce in the medical domain. We address this gap through a comprehensive empirical study of confidence calibration in VLMs, spanning three model families (Qwen3-VL, InternVL3, LLaVA-NeXT), three model scales (2B--38B), and multiple confidence estimation prompting strategies, across three medical visual question answering (VQA) benchmarks. Our study yields three key findings: First, overconfidence persists across model families and is not resolved by scaling or prompting, such as chain-of-thought and verbalized confidence variants. Second, simple post-hoc calibration approaches, such as Platt scaling, reduce calibration error and consistently outperform the prompt-based strategy. Third, due to their (strict) monotonicity, these post-hoc calibration methods are inherently limited in improving the discriminative quality of predictions, leaving AUROC at the same level. Motivated by these findings, we investigate hallucination-aware calibration (HAC), which incorporates vision-grounded hallucination detection signals as complementary inputs to refine confidence estimates. We find that leveraging these hallucination signals improves both calibration and AUROC, with the largest gains on open-ended questions. Overall, our findings suggest post-hoc calibration as standard practice for medical VLM deployment over raw confidence estimates, and highlight the practical usefulness of hallucination signals to enable more reliable use of VLMs in medical VQA.

[70] Contrastive Language-Colored Pointmap Pretraining for Unified 3D Scene Understanding

Ye Mao,Weixun Luo,Ranran Huang,Junpeng Jing,Krystian Mikolajczyk

Main category: cs.CV

TL;DR: 本文提出UniScene3D,一种基于Transformer的3D编码器,通过多视角彩色点图联合建模外观与几何,引入跨视角几何对齐和接地视角对齐以提升鲁棒性,在多个3D场景理解任务上达到SOTA。

Details Motivation: 预训练3D编码器与CLIP对齐是学习通用3D场景表征的有前景方向,但需更好融合外观与几何信息并保证跨视角一致性。 Method: 提出UniScene3D Transformer编码器,输入为多视角彩色点图;设计跨视角几何对齐(保障几何一致性)和接地视图对齐(保障语义一致性)机制。 Result: 在低样本学习和特定任务微调中,于视角定位、场景检索、场景类型分类和3D视觉问答任务上均取得SOTA性能。 Conclusion: 所提方法能有效学习统一的3D场景表征,显著提升多任务泛化能力与鲁棒性。 Abstract: Pretraining 3D encoders by aligning with Contrastive Language Image Pretraining (CLIP) has emerged as a promising direction to learn generalizable representations for 3D scene understanding. In this paper, we propose UniScene3D, a transformer-based encoder that learns unified scene representations from multi-view colored pointmaps, jointly modeling image appearance and geometry. For robust colored pointmap representation learning, we introduce novel cross-view geometric alignment and grounded view alignment to enforce cross-view geometry and semantic consistency. Extensive low-shot and task-specific fine-tuning evaluations on viewpoint grounding, scene retrieval, scene type classification, and 3D VQA demonstrate our state-of-the-art performance. These results highlight the effectiveness of our approach for unified 3D scene understanding. https://yebulabula.github.io/UniScene3D/

[71] WSVD: Weighted Low-Rank Approximation for Fast and Efficient Execution of Low-Precision Vision-Language Models

Haiyu Wang,Yutong Wang,Jack Jiang,Sai Qian Zhang

Main category: cs.CV

TL;DR: 本文提出加权奇异值分解(WSVD)方法,通过细粒度SVD和自适应权重分配,在保持精度的同时实现1.8倍解码加速。

Details Motivation: 现有SVD方法在Vision Language Models中难以实现实质性延迟降低。 Method: 引入新的计算模式,在更细粒度上应用SVD,并在SVD过程中自适应分配权重以保留精度,再结合权重与激活的量化。 Result: WSVD实现了超过1.8倍的解码速度提升,同时保持模型精度。 Conclusion: WSVD是一种高效且精度保持良好的VLM压缩方法,开源代码已发布。 Abstract: Singular Value Decomposition (SVD) has become an important technique for reducing the computational burden of Vision Language Models (VLMs), which play a central role in tasks such as image captioning and visual question answering. Although multiple prior works have proposed efficient SVD variants to enable low-rank operations, we find that in practice it remains difficult to achieve substantial latency reduction during model execution. To address this limitation, we introduce a new computational pattern and apply SVD at a finer granularity, enabling real and measurable improvements in execution latency. Furthermore, recognizing that weight elements differ in their relative importance, we adaptively allocate relative importance to each element during SVD process to better preserve accuracy, then extend this framework with quantization applied to both weights and activations, resulting in a highly efficient VLM. Collectively, we introduce~\textit{Weighted SVD} (WSVD), which outperforms other approaches by achieving over $1.8\times$ decoding speedup while preserving accuracy. We open source our code at: \href{https://github.com/SAI-Lab-NYU/WSVD}{\texttt{https://github.com/SAI-Lab-NYU/WSVD}

[72] FusionBERT: Multi-View Image-3D Retrieval via Cross-Attention Visual Fusion and Normal-Aware 3D Encoder

Wei Li,Yufan Ren,Hanqing Jiang,Jianhui Ding,Zhen Peng,Leman Feng,Yichun Shentu,Guoqiang Xu,Baigui Sun

Main category: cs.CV

TL;DR: FusionBERT是一种用于图像-3D跨模态检索的多视角视觉融合框架,通过跨注意力机制聚合多视角图像特征,并引入法向量感知的3D编码器以提升纹理缺失或退化3D模型的表征能力,在单视图和多视图设置下均超越现有SOTA方法。

Details Motivation: 现有图像-3D表示学习方法主要针对单视角图像与3D模型的特征对齐,难以应对真实场景中物体通常由多视角采集的情况;多视角信息虽蕴含互补几何与外观线索,但当前多模态大模型缺乏对其有效融合以提升跨模态检索性能的研究。 Method: 提出FusionBERT框架:1)基于跨注意力的多视角图像视觉聚合器,自适应融合多视角图像特征,建模视角间互补关系并选择性增强关键视觉线索;2)法向量感知的3D模型编码器,联合编码点云法向量与三维坐标,增强几何表征能力,尤其适用于纹理缺失或颜色退化的3D模型。 Result: 在图像-3D检索任务上,FusionBERT在单视图和多视图设置下均显著优于现有SOTA多模态大模型,验证了其多视角融合策略与法向量感知编码的有效性。 Conclusion: FusionBERT为多视角图像-3D跨模态检索提供了新范式,通过有效融合多视角视觉信息与增强3D几何表征,显著提升了检索精度,确立了该任务的强基线。 Abstract: We propose FusionBERT, a novel multi-view visual fusion framework for image-3D multimodal retrieval. Existing image-3D representation learning methods predominantly focus on feature alignment of a single object image and its 3D model, limiting their applicability in realistic scenarios where an object is typically observed and captured from multiple viewpoints. Although multi-view observations naturally provide complementary geometric and appearance cues, existing multimodal large models rarely explore how to effectively fuse such multi-view visual information for better cross-modal retrieval. To address this limitation, we introduce a multi-view image-3D retrieval framework named FusionBERT, which innovatively utilizes a cross-attention-based multi-view visual aggregator to adaptively integrate features from multi-view images of an object. The proposed multi-view visual encoder fuses inter-view complementary relationships and selectively emphasizes informative visual cues across multiple views to get a more robustly fused visual feature for better 3D model matching. Furthermore, FusionBERT proposes a normal-aware 3D model encoder that can further enhance the 3D geometric feature of an object model by jointly encoding point normals and 3D positions, enabling a more robust representation learning for textureless or color-degraded 3D models. Extensive image-3D retrieval experiments demonstrate that FusionBERT achieves significantly higher retrieval accuracy than SOTA multimodal large models under both single-view and multi-view settings, establishing a strong baseline for multi-view multimodal retrieval.

[73] TrackerSplat: Exploiting Point Tracking for Fast and Robust Dynamic 3D Gaussians Reconstruction

Daheng Yin,Isaac Ding,Yili Jin,Jianxin Shi,Jiangchuan Liu

Main category: cs.CV

TL;DR: TrackerSplat 提出了一种结合现有点跟踪模型的3D高斯泼溅方法,用于提升动态场景重建中对大帧间位移的鲁棒性,显著减少伪影并提高并行渲染吞吐量与视觉质量。

Details Motivation: 现有基于高斯的动态场景重建方法在处理快速物体运动导致的大帧间位移时表现不佳,易产生伪影和时序不一致问题。 Method: 引入 TrackerSplat 方法,利用现成点跟踪模型提取像素轨迹,并将各视角轨迹三角化到3D高斯上,以指导高斯的位置、旋转与缩放初始化,再进行梯度优化。 Result: 在真实数据集上验证了 TrackerSplat 在大幅位移场景下的鲁棒性,相比基线方法显著提升了并行处理吞吐量与渲染质量,减少了褪色与重着色伪影。 Conclusion: TrackerSplat 通过引入点跟踪引导的高斯初始化策略,有效克服了传统3DGS在动态场景中对大位移适应性差的问题,兼顾效率与质量,适用于机器人与沉浸式媒体等实际应用。 Abstract: Recent advancements in 3D Gaussian Splatting (3DGS) have demonstrated its potential for efficient and photorealistic 3D reconstructions, which is crucial for diverse applications such as robotics and immersive media. However, current Gaussian-based methods for dynamic scene reconstruction struggle with large inter-frame displacements, leading to artifacts and temporal inconsistencies under fast object motions. To address this, we introduce \textit{TrackerSplat}, a novel method that integrates advanced point tracking methods to enhance the robustness and scalability of 3DGS for dynamic scene reconstruction. TrackerSplat utilizes off-the-shelf point tracking models to extract pixel trajectories and triangulate per-view pixel trajectories onto 3D Gaussians to guide the relocation, rotation, and scaling of Gaussians before training. This strategy effectively handles large displacements between frames, dramatically reducing the fading and recoloring artifacts prevalent in prior methods. By accurately positioning Gaussians prior to gradient-based optimization, TrackerSplat overcomes the quality degradation associated with large frame gaps when processing multiple adjacent frames in parallel across multiple devices, thereby boosting reconstruction throughput while preserving rendering quality. Experiments on real-world datasets confirm the robustness of TrackerSplat in challenging scenarios with significant displacements, achieving superior throughput under parallel settings and maintaining visual quality compared to baselines. The code is available at https://github.com/yindaheng98/TrackerSplat.

[74] Moondream Segmentation: From Words to Masks

Ethan Reid

Main category: cs.CV

TL;DR: Moondream Segmentation 是 Moondream 3 的扩展,用于指代表达驱动的图像分割,通过自回归解码矢量路径并迭代优化掩码,引入强化学习提升掩码质量,并发布更精确的 RefCOCO-M 数据集。

Details Motivation: 解决指代表达图像分割中监督信号模糊及多边形标注评估噪声问题。 Method: 基于 Moondream 3 构建 referring image segmentation 模型;采用自回归方式解码矢量路径并迭代细化光栅化掩码;引入强化学习阶段直接优化掩码质量,并生成粗到精的目标用于 refiner 训练;发布边界更准确的 RefCOCO-M 验证集。 Result: 在 RefCOCO (val) 上达到 80.2% cIoU,在 LVIS (val) 上达到 62.6% mIoU。 Conclusion: Moondream Segmentation 有效提升了指代表达图像分割的精度与鲁棒性,强化学习与高质量标注数据对性能提升至关重要。 Abstract: We present Moondream Segmentation, a referring image segmentation extension of Moondream 3, a vision-language model. Given an image and a referring expression, the model autoregressively decodes a vector path and iteratively refines the rasterized mask into a final detailed mask. We introduce a reinforcement learning stage that resolves ambiguity in the supervised signal by directly optimizing mask quality. Rollouts from this stage produce coarse-to-ground-truth targets for the refiner. To mitigate evaluation noise from polygon annotations, we release RefCOCO-M, a cleaned RefCOCO validation split with boundary-accurate masks. Moondream Segmentation achieves a cIoU of 80.2% on RefCOCO (val) and 62.6% mIoU on LVIS (val).

[75] Rascene: High-Fidelity 3D Scene Imaging with mmWave Communication Signals

Kunzhe Song,Geo Jie Zhou,Xiaoming Liu,Huacheng Zeng

Main category: cs.CV

TL;DR: Rascene 是一种基于毫米波OFDM通信信号的集成感知与通信(ISAC)框架,通过多帧空间自适应融合与置信加权前向投影,实现高精度、低成本、鲁棒的3D场景重建。

Details Motivation: 光学传感器(如相机、LiDAR)在烟雾、雾、光照不良等恶劣条件下性能下降;专用雷达受限于定制硬件和授权频谱,难以规模化和降低成本。 Method: 提出Rascene框架,利用无处不在的毫米波OFDM通信信号进行3D成像;采用多帧、空间自适应融合策略,并结合置信加权的前向投影,解决单帧信号稀疏性和多径模糊问题。 Result: 实验表明该方法能高精度重建3D场景,具备低功耗、可扩展、强鲁棒性优势。 Conclusion: Rascene为构建低成本、可扩展、鲁棒的3D环境感知系统提供了新范式,推动ISAC技术在自动驾驶与机器人导航中的实用化。 Abstract: Robust 3D environmental perception is critical for applications such as autonomous driving and robot navigation. However, optical sensors such as cameras and LiDAR often fail under adverse conditions, including smoke, fog, and non-ideal lighting. Although specialized radar systems can operate in these environments, their reliance on bespoke hardware and licensed spectrum limits scalability and cost-effectiveness. This paper introduces Rascene, an integrated sensing and communication (ISAC) framework that leverages ubiquitous mmWave OFDM communication signals for 3D scene imaging. To overcome the sparse and multipath-ambiguous nature of individual radio frames, Rascene performs multi-frame, spatially adaptive fusion with confidence-weighted forward projection, enabling the recovery of geometric consensus across arbitrary poses. Experimental results demonstrate that our method reconstructs 3D scenes with high precision, offering a new pathway toward low-cost, scalable, and robust 3D perception.

[76] Unlocking Multi-Site Clinical Data: A Federated Approach to Privacy-First Child Autism Behavior Analysis

Guangyu Sun,Wenhan Wu,Zhishuai Guo,Ziteng Wang,Pegah Khosravi,Chen Chen

Main category: cs.CV

TL;DR: 本文提出了一种基于联邦学习(FL)和人体骨骼抽象的隐私保护框架,用于多中心儿童自闭症行为识别,兼顾数据隐私与模型泛化能力。

Details Motivation: 儿童自闭症行为自动识别对早期干预至关重要,但受HIPAA等隐私法规限制,临床数据难以集中共享;各中心数据稀缺,导致模型泛化性差、难以适配本地分布。 Method: 提出首个面向姿态驱动的儿童自闭症行为识别的联邦学习框架,采用两层隐私保护:1)用骨骼关键点替代原始RGB视频以去除可识别视觉信息;2)通过联邦学习确保敏感姿态数据不出本地。 Result: 在MMASD基准上实验表明,该框架识别准确率高于传统联邦学习基线,兼具高精度、强泛化性与个性化适配能力。 Conclusion: 该工作为多中心临床场景下实现隐私优先、数据不动模型动的自闭症行为分析提供了可行且鲁棒的新范式。 Abstract: Automated recognition of autistic behaviors in children is essential for early intervention and objective clinical assessment. However, the development of robust models is severely hindered by strict privacy regulations (e.g., HIPAA) and the sensitive nature of pediatric data, which prevents the centralized aggregation of clinical datasets. Furthermore, individual clinical sites often suffer from data scarcity, making it difficult to learn generalized behavior patterns or tailor models to site-specific patient distributions. To address these challenges, we observe that Federated Learning (FL) can decouple model training from raw data access, enabling multi-site collaboration while maintaining strict data residency. In this paper, we present the first study exploring Federated Learning for pose-based child autism behavior recognition. Our framework employs a two-layer privacy protection mechanism: utilizing human skeletal abstraction to remove identifiable visual information from the raw RGB videos and FL to ensure sensitive pose data remains within the clinic. This approach leverages distributed clinical data to learn generalized representations while providing the flexibility for site-specific personalization. Experimental results on the MMASD benchmark demonstrate that our framework achieves high recognition accuracy, outperforming traditional federated baselines and providing a robust, privacy-first solution for multi-site clinical analysis.

[77] Smart Transfer: Leveraging Vision Foundation Model for Rapid Building Damage Mapping with Post-Earthquake VHR Imagery

Hao Li,Liwei Zou,Wenping Yin,Gulsen Taskin,Naoto Yokoya,Danfeng Hong,Wufan Zhao

Main category: cs.CV

TL;DR: 本文提出Smart Transfer框架,利用视觉基础模型和两种新型迁移策略(像素级聚类和距离惩罚三元组)实现地震后高分辨率遥感影像的快速建筑物损毁制图,显著提升跨区域泛化能力。

Details Motivation: 传统灾后损毁调查难以泛化到不同城市形态和新发灾害事件,且依赖耗时费力的人工标注,无法满足黄金72小时快速响应需求。 Method: 提出基于视觉基础模型的GeoAI框架Smart Transfer,包含两种新型迁移策略:1)像素级聚类(PC),实现原型级全局特征对齐;2)距离惩罚三元组(DPT),通过空间邻近性加权惩罚语义不一致但空间相邻的图像块,融合空间自相关模式。 Result: 在2023年土耳其-叙利亚地震数据上验证,Smart Transfer在Leave One Domain Out(LODO)和Specific Source Domain Combination(SSDC)等跨区域迁移设置下表现优异,实现了高效、自动化的损毁制图。 Conclusion: Smart Transfer为气候脆弱地区提供了可扩展、自动化的GeoAI解决方案,显著加速灾后损毁评估,助力提升社区灾害韧性。 Abstract: Living in a changing climate, human society now faces more frequent and severe natural disasters than ever before. As a consequence, rapid disaster response during the "Golden 72 Hours" of search and rescue becomes a vital humanitarian necessity and community concern. However, traditional disaster damage surveys routinely fail to generalize across distinct urban morphologies and new disaster events. Effective damage mapping typically requires exhaustive and time-consuming manual data annotation. To address this issue, we introduce Smart Transfer, a novel Geospatial Artificial Intelligence (GeoAI) framework, leveraging state-of-the-art vision Foundation Models (FMs) for rapid building damage mapping with post-earthquake Very High Resolution (VHR) imagery. Specifically, we design two novel model transfer strategies: first, Pixel-wise Clustering (PC), ensuring robust prototype-level global feature alignment; second, a Distance-Penalized Triplet (DPT), integrating patch-level spatial autocorrelation patterns by assigning stronger penalties to semantically inconsistent yet spatially adjacent patches. Extensive experiments and ablations from the recent 2023 Turkiye-Syria earthquake show promising performance in multiple cross-region transfer settings, namely Leave One Domain Out (LODO) and Specific Source Domain Combination (SSDC). Moreover, Smart Transfer provides a scalable, automated GeoAI solution to accelerate building damage mapping and support rapid disaster response, offering new opportunities to enhance disaster resilience in climate-vulnerable regions and communities. The data and code are publicly available at https://github.com/ai4city-hkust/SmartTransfer.

[78] Cross-Vehicle 3D Geometric Consistency for Self-Supervised Surround Depth Estimation on Articulated Vehicles

Weimin Liu,Jiyuan Qiu,Wenjun Wang,Joshua H. Meng

Main category: cs.CV

TL;DR: 本文提出ArticuSurDepth,一种面向铰接式车辆的自监督环视深度估计框架,通过引入跨视角与跨车辆几何一致性约束、多视角空间上下文增强、表面法向约束及地面感知相机高度正则化等策略,显著提升深度估计精度,在多个基准数据集上达到SOTA性能。

Details Motivation: 现有自监督环视深度估计方法主要针对乘用车设计,未充分考虑铰接式车辆(如卡车、工程机械)因结构铰接带来的跨段几何复杂性和运动耦合问题,导致跨视角深度推理不一致。 Method: 提出ArticuSurDepth框架:1)利用视觉基础模型提供结构先验;2)设计多视角空间上下文增强策略;3)引入跨视角表面法向约束;4)结合地面平面感知的相机高度正则化以促进度量深度估计;5)建模跨车辆姿态一致性以桥接铰接段间的运动估计。 Result: 在自建铰接车辆数据集及DDAD、nuScenes、KITTI等公开基准上均取得深度估计SOTA性能。 Conclusion: ArticuSurDepth有效解决了铰接式车辆环视深度估计中的几何与运动建模难题,验证了结构先验与多源几何一致性约束对自监督深度学习的重要价值,为特种车辆与机器人平台的低成本3D感知提供了新思路。 Abstract: Surround depth estimation provides a cost-effective alternative to LiDAR for 3D perception in autonomous driving. While recent self-supervised methods explore multi-camera settings to improve scale awareness and scene coverage, they are primarily designed for passenger vehicles and rarely consider articulated vehicles or robotics platforms. The articulated structure introduces complex cross-segment geometry and motion coupling, making consistent depth reasoning across views more challenging. In this work, we propose \textbf{ArticuSurDepth}, a self-supervised framework for surround-view depth estimation on articulated vehicles that enhances depth learning through cross-view and cross-vehicle geometric consistency guided by structural priors from vision foundation model. Specifically, we introduce multi-view spatial context enrichment strategy and a cross-view surface normal constraint to improve structural coherence across spatial and temporal contexts. We further incorporate camera height regularization with ground plane-awareness to encourage metric depth estimation, together with cross-vehicle pose consistency that bridges motion estimation between articulated segments. To validate our proposed method, an articulated vehicle experiment platform was established with a dataset collected over it. Experiment results demonstrate state-of-the-art (SoTA) performance of depth estimation on our self-collected dataset as well as on DDAD, nuScenes, and KITTI benchmarks.

[79] Drift-Resilient Temporal Priors for Visual Tracking

Yuqing Huang,Liting Lin,Weijun Zhuang,Zhenyu He,Xin Li

Main category: cs.CV

TL;DR: 本文提出DTPTrack模块,通过时间可靠性校准器(TRC)和时间引导合成器(TGS)抑制多帧跟踪中的模型漂移,显著提升多种跟踪器性能,并在LaSOT和GOT-10k等基准上达到新SOTA。

Details Motivation: 现有多帧跟踪器因简单聚合含噪历史预测而易发生模型漂移,亟需更鲁棒的时间信息建模机制。 Method: 提出DTPTrack轻量通用模块,包含两个核心组件:(1) 时间可靠性校准器(TRC),为历史帧状态学习可靠性分数并锚定真实模板;(2) 时间引导合成器(TGS),将校准后的历史信息合成为动态时间先验以提供预测引导。 Result: 集成到OSTrack、ODTrack和LoRAT三种架构后均获一致显著提升;基于扩展LoRATv2的最优模型在LaSOT上达77.5% Success,在GOT-10k上达80.3% AO,创多项基准新SOTA。 Conclusion: DTPTrack是一种高效、通用且可即插即用的时间建模模块,能有效缓解模型漂移,提升多帧视觉跟踪鲁棒性与精度。 Abstract: Temporal information is crucial for visual tracking, but existing multi-frame trackers are vulnerable to model drift caused by naively aggregating noisy historical predictions. In this paper, we introduce DTPTrack, a lightweight and generalizable module designed to be seamlessly integrated into existing trackers to suppress drift. Our framework consists of two core components: (1) a Temporal Reliability Calibrator (TRC) mechanism that learns to assign a per-frame reliability score to historical states, filtering out noise while anchoring on the ground-truth template; and (2) a Temporal Guidance Synthesizer (TGS) module that synthesizes this calibrated history into a compact set of dynamic temporal priors to provide predictive guidance. To demonstrate its versatility, we integrate DTPTrack into three diverse tracking architectures--OSTrack, ODTrack, and LoRAT-and show consistent, significant performance gains across all baselines. Our best-performing model, built upon an extended LoRATv2 backbone, sets a new state-of-the-art on several benchmarks, achieving a 77.5% Success rate on LaSOT and an 80.3% AO on GOT-10k.

[80] Efficient3D: A Unified Framework for Adaptive and Debiased Token Reduction in 3D MLLMs

Yuhui Lin,Siyue Yu,Yuxing Yang,Guangliang Cheng,Jimin Xiao

Main category: cs.CV

TL;DR: 本文提出Efficient3D框架,通过去偏视觉令牌重要性估计器(DVTIE)和自适应令牌重平衡(ATR)策略,在保持精度的同时加速3D多模态大语言模型的推理。

Details Motivation: 3D多模态大语言模型(MLLMs)因模型庞大和输入特征维度高,导致推理开销大,难以在资源受限平台上部署。 Method: 提出统一的视觉令牌剪枝框架Efficient3D,包含两个核心模块:去偏视觉令牌重要性估计器(DVTIE),用于更可靠地预测令牌重要性;以及自适应令牌重平衡(ATR)策略,根据场景复杂度动态调整剪枝强度。 Result: 在ScanRefer、Multi3DRefer、Scan2Cap、ScanQA和SQA3D五个3D视觉-语言基准上实验表明,Efficient3D显著提升性能,如Scan2Cap数据集CIDEr指标提升+2.57%。 Conclusion: Efficient3D为3D MLLMs提供了可扩展且高效的推理解决方案,兼顾计算效率与语义完整性。 Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have expanded reasoning capabilities into 3D domains, enabling fine-grained spatial understanding. However, the substantial size of 3D MLLMs and the high dimensionality of input features introduce considerable inference overhead, which limits practical deployment on resource constrained platforms. To overcome this limitation, this paper presents Efficient3D, a unified framework for visual token pruning that accelerates 3D MLLMs while maintaining competitive accuracy. The proposed framework introduces a Debiased Visual Token Importance Estimator (DVTIE) module, which considers the influence of shallow initial layers during attention aggregation, thereby producing more reliable importance predictions for visual tokens. In addition, an Adaptive Token Rebalancing (ATR) strategy is developed to dynamically adjust pruning strength based on scene complexity, preserving semantic completeness and maintaining balanced attention across layers. Together, they enable context-aware token reduction that maintains essential semantics with lower computation. Comprehensive experiments conducted on five representative 3D vision and language benchmarks, including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D, demonstrate that Efficient3D achieves superior performance compared with unpruned baselines, with a +2.57% CIDEr improvement on the Scan2Cap dataset. Therefore, Efficient3D provides a scalable and effective solution for efficient inference in 3D MLLMs. The code is released at: https://github.com/sol924/Efficient3D

[81] Parser-Oriented Structural Refinement for a Stable Layout Interface in Document Parsing

Fuyuan Liu,Dianyu Yu,He Ren,Nayu Liu,Xiaomian Kang,Delai Qiu,Fa Zhang,Genpeng Zhen,Shengping Liu,Jiaen Liang,Wei Huang,Yining Wang,Junnan Zhu

Main category: cs.CV

TL;DR: 本文提出了一种轻量级结构精炼模块,用于稳定文档布局分析(DLA)中检测器与下游解析器之间的接口,通过集合级推理优化实例保留、边界框定位和输入顺序预测,显著提升密集复杂页面的布局质量与阅读顺序准确性。

Details Motivation: 在显式文档布局分析(DLA)流程中,下游解析器仅接收经筛选和序列化的布局实例;但在密集、重叠或边界模糊的页面上,检测器输出不稳定,导致保留实例集与其输入顺序不一致,引发严重解析错误。 Method: 在DETR风格检测器与解析器之间引入轻量级结构精炼模块:将原始检测输出视为紧凑假设池,基于查询特征、语义线索、框几何与视觉证据进行集合级联合推理;统一生成实例保留决策、边界框精调结果及解析器输入顺序;并设计面向保留的监督与难度感知排序目标。 Result: 在多个公开基准上持续提升页面级布局质量;集成至端到端解析流程后,显著降低序列错配,OmniDocBench上阅读顺序编辑距离(Reading Order Edit)达0.024。 Conclusion: 结构精炼模块有效稳定了检测器与解析器之间的接口,尤其在结构复杂页面上提升了布局一致性与下游解析鲁棒性,为文档理解系统提供了更可靠的中间表示。 Abstract: Accurate document parsing requires both robust content recognition and a stable parser interface. In explicit Document Layout Analysis (DLA) pipelines, downstream parsers do not consume the full detector output. Instead, they operate on a retained and serialized set of layout instances. However, on dense pages with overlapping regions and ambiguous boundaries, unstable layout hypotheses can make the retained instance set inconsistent with its parser input order, leading to severe downstream parsing errors. To address this issue, we introduce a lightweight structural refinement stage between a DETR-style detector and the parser to stabilize the parser interface. Treating raw detector outputs as a compact hypothesis pool, the proposed module performs set-level reasoning over query features, semantic cues, box geometry, and visual evidence. From a shared refined structural state, it jointly determines instance retention, refines box localization, and predicts parser input order before handoff. We further introduce retention-oriented supervision and a difficulty-aware ordering objective to better align the retained instance set and its order with the final parser input, especially on structurally complex pages. Extensive experiments on public benchmarks show that our method consistently improves page-level layout quality. When integrated into a standard end-to-end parsing pipeline, the stabilized parser interface also substantially reduces sequence mismatch, achieving a Reading Order Edit of 0.024 on OmniDocBench.

[82] DocShield: Towards AI Document Safety via Evidence-Grounded Agentic Reasoning

Fanwei Zeng,Changtao Miao,Jing Huang,Zhiya Tan,Shutao Gong,Xiaoming Yu,Yang Wang,Weibin Yao,Joey Tianyi Zhou,Jianshu Li,Yin Yan

Main category: cs.CV

TL;DR: 本文提出DocShield,首个将文本中心伪造分析建模为视觉-逻辑协同推理问题的统一框架,通过跨线索感知思维链(CCT)机制和加权多任务奖励优化,显著提升检测、定位与解释能力,并发布RealText-V1多语言数据集。

Details Motivation: 生成式AI快速发展导致文本为中心的图像伪造日益逼真,威胁文档安全;现有方法依赖视觉线索、缺乏证据驱动的推理,且检测、定位、解释任务割裂,影响可靠性与可解释性。 Method: 提出DocShield统一框架,核心是跨线索感知思维链(CCT)机制,实现视觉异常与文本语义的迭代交叉验证;引入基于GRPO的加权多任务奖励,联合优化推理结构、空间证据与真实性预测;构建RealText-V1多语言文档级文本图像数据集,含像素级篡改掩码与专家级文本解释。 Result: 在T-IC13上宏平均F1较专用框架提升41.4%、较GPT-4o提升23.4%;在T-SROIE等挑战性基准上保持一致优势;代码、模型与数据集将开源。 Conclusion: DocShield通过视觉-逻辑协同推理范式,有效统一伪造分析的检测、定位与解释任务,显著提升性能与可解释性,为文档级文本伪造取证提供了新范式与实用工具。 Abstract: The rapid progress of generative AI has enabled increasingly realistic text-centric image forgeries, posing major challenges to document safety. Existing forensic methods mainly rely on visual cues and lack evidence-based reasoning to reveal subtle text manipulations. Detection, localization, and explanation are often treated as isolated tasks, limiting reliability and interpretability. To tackle these challenges, we propose DocShield, the first unified framework formulating text-centric forgery analysis as a visual-logical co-reasoning problem. At its core, a novel Cross-Cues-aware Chain of Thought (CCT) mechanism enables implicit agentic reasoning, iteratively cross-validating visual anomalies with textual semantics to produce consistent, evidence-grounded forensic analysis. We further introduce a Weighted Multi-Task Reward for GRPO-based optimization, aligning reasoning structure, spatial evidence, and authenticity prediction. Complementing the framework, we construct RealText-V1, a multilingual dataset of document-like text images with pixel-level manipulation masks and expert-level textual explanations. Extensive experiments show DocShield significantly outperforms existing methods, improving macro-average F1 by 41.4% over specialized frameworks and 23.4% over GPT-4o on T-IC13, with consistent gains on the challenging T-SROIE benchmark. Our dataset, model, and code will be publicly released.

[83] XrayClaw: Cooperative-Competitive Multi-Agent Alignment for Trustworthy Chest X-ray Diagnosis

Shawn Young,Lijian Xu

Main category: cs.CV

TL;DR: 本文提出XrayClaw框架,通过协同-竞争多智能体架构与竞争偏好优化方法,提升胸部X光片自动诊断的准确性、推理可信度与泛化能力。

Details Motivation: 传统单体AI模型在胸片解读中存在逻辑不一致和诊断幻觉问题;现有多智能体系统易因共用底层模型而产生共识错误。 Method: 构建包含四个协同智能体(模拟临床流程)和一个竞争性审计智能体的多智能体框架XrayClaw,并提出竞争偏好优化目标,强制分析性与整体性解读相互验证以抑制不合理推理。 Result: 在MS-CXR-T、MIMIC-CXR和CheXbench数据集上达到诊断准确率、临床推理保真度和零样本领域泛化性能的SOTA水平,显著缓解累积幻觉。 Conclusion: XrayClaw为可信赖的医学影像分析提供了新范式,证明协同-竞争机制与对齐优化可有效提升AI诊断的可靠性与鲁棒性。 Abstract: Chest X-ray (CXR) interpretation is a fundamental yet complex clinical task that increasingly relies on artificial intelligence for automation. However, traditional monolithic models often lack the nuanced reasoning required for trustworthy diagnosis, frequently leading to logical inconsistencies and diagnostic hallucinations. While multi-agent systems offer a potential solution by simulating collaborative consultations, existing frameworks remain susceptible to consensus-based errors when instantiated by a single underlying model. This paper introduces XrayClaw, a novel framework that operationalizes multi-agent alignment through a sophisticated cooperative-competitive architecture. XrayClaw integrates four specialized cooperative agents to simulate a systematic clinical workflow, alongside a competitive agent that serves as an independent auditor. To reconcile these distinct diagnostic pathways, we propose Competitive Preference Optimization, a learning objective that penalizes illogical reasoning by enforcing mutual verification between analytical and holistic interpretations. Extensive empirical evaluations on the MS-CXR-T, MIMIC-CXR, and CheXbench benchmarks demonstrate that XrayClaw achieves state-of-the-art performance in diagnostic accuracy, clinical reasoning fidelity, and zero-shot domain generalization. Our results indicate that XrayClaw effectively mitigates cumulative hallucinations and enhances the overall reliability of automated CXR diagnosis, establishing a new paradigm for trustworthy medical imaging analysis.

[84] VBGS-SLAM: Variational Bayesian Gaussian Splatting Simultaneous Localization and Mapping

Yuhan Zhu,Yanyu Zhang,Jie Xu,Wei Ren

Main category: cs.CV

TL;DR: 本文提出了一种基于变分贝叶斯的3D高斯点绘SLAM方法(VBGS-SLAM),通过概率建模联合优化相机位姿与场景高斯参数,显式建模不确定性,从而提升鲁棒性与抗漂移能力,同时保持高效渲染质量。

Details Motivation: 现有3DGS-SLAM方法依赖确定性位姿优化,对初始化敏感且易因地图演化导致灾难性遗忘。 Method: 提出VBGS-SLAM框架,将位姿跟踪与高斯地图优化统一为生成式概率模型;利用多元高斯共轭性与变分推断,实现高效闭式更新,并显式维护位姿和场景参数的后验不确定性。 Result: 在长序列跟踪中表现出更优性能与鲁棒性;在多种合成与真实场景中实现高效、高质量的新视角合成。 Conclusion: VBGS-SLAM通过不确定性感知的概率建模,显著提升了3DGS-SLAM的稳定性与泛化能力,兼顾效率与精度。 Abstract: 3D Gaussian Splatting (3DGS) has shown promising results for 3D scene modeling using mixtures of Gaussians, yet its existing simultaneous localization and mapping (SLAM) variants typically rely on direct, deterministic pose optimization against the splat map, making them sensitive to initialization and susceptible to catastrophic forgetting as map evolves. We propose Variational Bayesian Gaussian Splatting SLAM (VBGS-SLAM), a novel framework that couples the splat map refinement and camera pose tracking in a generative probabilistic form. By leveraging conjugate properties of multivariate Gaussians and variational inference, our method admits efficient closed-form updates and explicitly maintains posterior uncertainty over both poses and scene parameters. This uncertainty-aware method mitigates drift and enhances robustness in challenging conditions, while preserving the efficiency and rendering quality of existing 3DGS. Our experiments demonstrate superior tracking performance and robustness in long sequence prediction, alongside efficient, high-quality novel view synthesis across diverse synthetic and real-world scenes.

[85] ExploreVLA: Dense World Modeling and Exploration for End-to-End Autonomous Driving

Zihao Sheng,Xin Ye,Jingru Luo,Sikai Chen,Liu Ren

Main category: cs.CV

TL;DR: 本文提出了一种结合世界模型与内在奖励的VLA自动驾驶框架,通过预测未来图像和不确定性驱动的安全探索,提升策略在分布外场景中的鲁棒性与泛化能力。

Details Motivation: 现有基于行为克隆的端到端视觉语言动作(VLA)自动驾驶模型受限于模仿学习,缺乏对新场景的探索能力;而强化学习需状态转移信息,但离线VLA数据中不可直接获取,因此需引入世界模型辅助探索与监督。 Method: 构建统一的理解-生成框架:1)将未来RGB与深度图生成作为密集世界建模目标,增强轨迹预测的细粒度表征;2)利用图像预测不确定性作为内在奖励信号,识别分布外但安全的探索机会;3)设计安全门控奖励机制,并采用Group Relative Policy Optimization(GRPO)优化策略。 Result: 在NAVSIM和nuScenes基准上取得SOTA性能:NAVSIM上PDMS达93.7、EPDMS达88.8。 Conclusion: 世界模型不仅能提供密集监督,还可转化为有效的内在探索激励;所提框架显著提升了VLA模型在分布外场景下的安全性与适应性,为端到端自动驾驶的强化学习落地提供了可行路径。 Abstract: End-to-end autonomous driving models based on Vision-Language-Action (VLA) architectures have shown promising results by learning driving policies through behavior cloning on expert demonstrations. However, imitation learning inherently limits the model to replicating observed behaviors without exploring diverse driving strategies, leaving it brittle in novel or out-of-distribution scenarios. Reinforcement learning (RL) offers a natural remedy by enabling policy exploration beyond the expert distribution. Yet VLA models, typically trained on offline datasets, lack directly observable state transitions, necessitating a learned world model to anticipate action consequences. In this work, we propose a unified understanding-and-generation framework that leverages world modeling to simultaneously enable meaningful exploration and provide dense supervision. Specifically, we augment trajectory prediction with future RGB and depth image generation as dense world modeling objectives, requiring the model to learn fine-grained visual and geometric representations that substantially enrich the planning backbone. Beyond serving as a supervisory signal, the world model further acts as a source of intrinsic reward for policy exploration: its image prediction uncertainty naturally measures a trajectory's novelty relative to the training distribution, where high uncertainty indicates out-of-distribution scenarios that, if safe, represent valuable learning opportunities. We incorporate this exploration signal into a safety-gated reward and optimize the policy via Group Relative Policy Optimization (GRPO). Experiments on the NAVSIM and nuScenes benchmarks demonstrate the effectiveness of our approach, achieving a state-of-the-art PDMS score of 93.7 and an EPDMS of 88.8 on NAVSIM. The code and demo will be publicly available at https://zihaosheng.github.io/ExploreVLA/.

[86] MOMO: Mars Orbital Model Foundation Model for Mars Orbital Applications

Mirali Purohit,Bimal Gajera,Irish Mehta,Bhanu Tokas,Jacob Adler,Steven Lu,Scott Dickenshied,Serina Diniega,Brian Bue,Umaa Rebbapragada,Hannah Kerner

Main category: cs.CV

TL;DR: MOMO是首个面向火星遥感的多传感器基础模型,通过创新的等效验证损失(EVL)策略融合HiRISE、CTX和THEMIS三种传感器模型,在火星下游任务中显著优于多种基线方法,尤其在分割任务上表现突出。

Details Motivation: 针对火星多分辨率遥感数据缺乏统一基础模型的问题,需有效融合不同传感器(分辨率跨度大、特性各异)的表征能力。 Method: 提出MOMO模型,采用模型合并策略集成HiRISE、CTX、THEMIS三类传感器独立训练的模型;核心是Equal Validation Loss(EVL)策略——依据验证损失相似性对齐各传感器模型的检查点,再通过任务算术(task arithmetic)进行融合。 Result: 在包含约1200万样本的大规模火星轨道数据上预训练,在Mars-Bench的9个下游任务中全面超越ImageNet预训练模型、地球观测基础模型、单传感器预训练及全监督基线;分割任务性能提升尤为显著且稳定。 Conclusion: 基于最优检查点选择策略(如EVL)的模型融合,是构建多分辨率遥感基础模型的有效范式;MOMO为行星科学AI建模提供了可复现、开源的新基准。 Abstract: We introduce MOMO, the first multi-sensor foundation model for Mars remote sensing. MOMO uses model merge to integrate representations learned independently from three key Martian sensors (HiRISE, CTX, and THEMIS), spanning resolutions from 0.25 m/pixel to 100 m/pixel. Central to our method is our novel Equal Validation Loss (EVL) strategy, which aligns checkpoints across sensors based on validation loss similarity before fusion via task arithmetic. This ensures models are merged at compatible convergence stages, leading to improved stability and generalization. We train MOMO on a large-scale, high-quality corpus of $\sim 12$ million samples curated from Mars orbital data and evaluate it on 9 downstream tasks from Mars-Bench. MOMO achieves better overall performance compared to ImageNet pre-trained, earth observation foundation model, sensor-specific pre-training, and fully-supervised baselines. Particularly on segmentation tasks, MOMO shows consistent and significant performance improvement. Our results demonstrate that model merging through an optimal checkpoint selection strategy provides an effective approach for building foundation models for multi-resolution data. The model weights, pretraining code, pretraining data, and evaluation code are available at: https://github.com/kerner-lab/MOMO.

[87] THOM: Generating Physically Plausible Hand-Object Meshes From Text

Uyoung Jeong,Yihalem Yimolal Tiruneh,Hyung Jin Chang,Seungryul Baek,Kwang In Kim

Main category: cs.CV

TL;DR: THOM是一个无需训练的框架,用于从文本生成高保真、物理上合理的3D手-物交互(HOI)网格,通过两阶段流程(高斯生成+物理优化)和新提出的网格提取与顶点映射方法实现。

Details Motivation: 文本生成3D手-物交互面临网格提取病态性和基于错误网格的物理优化难题,且现有方法依赖模板物体网格,限制泛化性。 Method: 提出THOM框架:第一阶段生成手和物体的3D高斯表示;第二阶段进行物理驱动的HOI优化;引入顶点到高斯的显式映射以支持拓扑感知正则化,并结合视觉语言模型(VLM)引导的位移精调与接触感知优化。 Result: 在文本对齐性、视觉真实感和交互合理性方面均超越现有最先进方法。 Conclusion: THOM实现了无需训练、无需模板网格的高质量3D HOI生成,兼顾视觉保真度与物理合理性,为机器人抓取和VR/AR内容生成提供了实用新方案。 Abstract: The generation of 3D hand-object interactions (HOIs) from text is crucial for dexterous robotic grasping and VR/AR content generation, requiring both high visual fidelity and physical plausibility. Nevertheless, the ill-posed problem of mesh extraction from text-generated Gaussians, and physics-based optimization on the erroneous meshes pose challenges. To address these issues, we introduce THOM, a training-free framework that generates photorealistic, physically plausible 3D HOI meshes without the need for a template object mesh. THOM employs a two-stage pipeline, initially generating the hand and object Gaussians, followed by physics-based HOI optimization. Our new mesh extraction method and vertex-to-Gaussian mapping explicitly assign Gaussian elements to mesh vertices, allowing topology-aware regularization. Furthermore, we improve the physical plausibility of interactions by VLM-guided translation refinement and contact-aware optimization. Comprehensive experiments demonstrate that THOM consistently surpasses state-of-the-art methods in terms of text alignment, visual realism, and interaction plausibility.

[88] Visual Instruction-Finetuned Language Model for Versatile Brain MR Image Tasks

Jonghun Kim,Sinyoung Ra,Hyunjin Park

Main category: cs.CV

TL;DR: 本文提出LLaBIT模型,将大语言模型扩展到脑部MRI的多种临床相关任务(如图像分割、图像翻译等),通过重用图像编码器特征图减少空间信息损失,并利用LLM生成高质量文本数据增强训练,实验证明其在多个任务上优于专用模型。

Details Motivation: 简单文本到图像生成在临床中效用有限,而医学影像中更关键的任务如病灶分割和缺失序列重建尚未被统一集成到一个通用大语言模型中。 Method: 提出LLaBIT模型:1)扩展LLM视觉推理能力至脑MRI多任务;2)重用图像编码器特征图以缓解图像分块导致的空间信息损失;3)用LLM按严格指令生成文本数据,扩充稀缺的脑MRI图文配对数据。 Result: 在五个脑MRI数据集、四个任务(报告生成、视觉问答、图像分割、图像翻译)上全面评估,LLaBIT不仅在所有任务上表现最优,且超越各任务专用模型。 Conclusion: LLaBIT成功将大语言模型统一应用于多种临床重要的脑MRI任务,验证了其有效性与泛化能力,为多模态医学AI提供了新范式。 Abstract: LLMs have demonstrated remarkable capabilities in linguistic reasoning and are increasingly adept at vision-language tasks. The integration of image tokens into transformers has enabled direct visual input and output, advancing research from image-to-text descriptions to text-to-image generation. However, simple text-to-image generation holds limited clinical utility. In medical imaging, tasks such as image segmentation for localizing pathologies or image translation for reconstructing missing sequences have much greater clinical importance. Despite this, integrating these diverse, clinically relevant tasks within a single, versatile language model remains unexplored. Our method, LLaBIT (Large Language Model for Brain Image Translation), extends the visual reasoning of LLMs to these clinically meaningful tasks in the brain MRI domain. To mitigate the spatial information loss inherent in image tokenization, we incorporate a mechanism to reuse feature maps from the image encoder, minimizing data degradation. We also generate text data using LLMs with strict predefined instructions to augment limited image-text paired data in brain MRI. We comprehensively evaluated our method on five brain MRI datasets across four distinct tasks: report generation, visual question answering, image segmentation, and image translation. Our model not only demonstrated superior performance across all tasks but also outperformed specialized, task-specific models in direct comparisons, highlighting its efficacy and versatility

[89] Differentiable Stroke Planning with Dual Parameterization for Efficient and High-Fidelity Painting Creation

Jinfan Liu,Wuze Zhang,Zhangli Hu,Zhehan Zhao,Ye Chen,Bingbing Ni

Main category: cs.CV

TL;DR: 本文提出了一种结合离散折线与连续贝塞尔控制点的双重表示方法,通过双向映射实现协同优化,提升了笔画布局的结构一致性、重建质量,并减少了笔画数量和优化时间。

Details Motivation: 现有基于笔画的渲染方法中,搜索方法易陷入局部极小值,而可微优化器缺乏结构感知能力,导致布局杂乱。 Method: 提出一种双重表示法,将离散折线与连续贝塞尔控制点通过双向映射耦合;利用局部梯度优化全局结构,结合内容感知的笔画提议跳出局部极小;引入高斯光栅化启发的并行初始化策略。 Result: 相比现有可微矢量化方法,笔画数减少30-50%,结构连贯性增强,重建质量提升,优化时间缩短30-40%。 Conclusion: 该双重表示框架有效弥合了离散搜索与连续优化之间的鸿沟,在效率、结构性和重建质量上实现了综合提升。 Abstract: In stroke-based rendering, search methods often get trapped in local minima due to discrete stroke placement, while differentiable optimizers lack structural awareness and produce unstructured layouts. To bridge this gap, we propose a dual representation that couples discrete polylines with continuous Bézier control points via a bidirectional mapping mechanism. This enables collaborative optimization: local gradients refine global stroke structures, while content-aware stroke proposals help escape poor local optima. Our representation further supports Gaussian-splatting-inspired initialization, enabling highly parallel stroke optimization across the image. Experiments show that our approach reduces the number of strokes by 30-50%, achieves more structurally coherent layouts, and improves reconstruction quality, while cutting optimization time by 30-40% compared to existing differentiable vectorization methods.

[90] DeCo-DETR: Decoupled Cognition DETR for efficient Open-Vocabulary Object Detection

Siheng Wang,Yanshu Li,Bohan Hu,Zhengdao Li,Haibo Zhan,Linshan Li,Weiming Liu,Ruizhi Qian,Guangxin Wu,Hao Zhang,Jifeng Shen,Piotr Koniusz,Zhengtao Yao,Junhao Dong,Qiang Sun

Main category: cs.CV

TL;DR: 本文提出DeCo-DETR,一种解耦语义认知与目标检测的视觉中心框架,通过构建分层语义原型空间和解耦训练策略,在保持零样本检测性能的同时显著提升推理效率。

Details Motivation: 现有开放词汇目标检测(OVOD)方法存在两大问题:多模态设计导致推理时计算开销大;紧密耦合的训练目标在闭集检测精度与开放世界泛化能力之间产生权衡。 Method: DeCo-DETR采用解耦范式:1)利用预训练LVLM生成区域描述,并借助CLIP对齐,构建可复用的分层语义原型空间,避免推理时调用文本编码器;2)通过解耦训练策略,将语义对齐与目标检测分为并行优化流,分离语义推理与定位任务。 Result: 在标准OVOD基准上,DeCo-DETR实现了具有竞争力的零样本检测性能,同时显著提升了推理效率。 Conclusion: 解耦语义认知与检测任务是提升OVOD系统实用性与可扩展性的有效途径。 Abstract: Open-vocabulary Object Detection (OVOD) enables models to recognize objects beyond predefined categories, but existing approaches remain limited in practical deployment. On the one hand, multimodal designs often incur substantial computational overhead due to their reliance on text encoders at inference time. On the other hand, tightly coupled training objectives introduce a trade-off between closed-set detection accuracy and open-world generalization. Thus, we propose Decoupled Cognition DETR (DeCo-DETR), a vision-centric framework that addresses these challenges through a unified decoupling paradigm. Instead of depending on online text encoding, DeCo-DETR constructs a hierarchical semantic prototype space from region-level descriptions generated by pre-trained LVLMs and aligned via CLIP, enabling efficient and reusable semantic representation. Building upon this representation, the framework further disentangles semantic reasoning from localization through a decoupled training strategy, which separates alignment and detection into parallel optimization streams. Extensive experiments on standard OVOD benchmarks demonstrate that DeCo-DETR achieves competitive zero-shot detection performance while significantly improving inference efficiency. These results highlight the effectiveness of decoupling semantic cognition from detection, offering a practical direction for scalable OVOD systems.

[91] InverseDraping: Recovering Sewing Patterns from 3D Garment Surfaces via BoxMesh Bridging

Leyang Jin,Zirong Jin,Zisheng Ye,Haokai Pang,Xiaoguang Han,Yujian Zheng,Hao Li

Main category: cs.CV

TL;DR: 本文提出了一种两阶段框架,通过引入结构化中间表示BoxMesh,从3D服装几何中恢复参数化2D缝纫图样,显著提升了逆向建模的准确性与鲁棒性。

Details Motivation: 从 draped 3D 服装中恢复缝纫图样是病态问题,现有方法难以解耦形变与内在结构,导致歧义大、精度低。 Method: 提出两阶段框架:Stage I 使用几何驱动的自回归模型从3D服装推断BoxMesh(编码服装与裁片结构、解耦本征几何与缝合拓扑);Stage II 使用语义感知自回归模型将BoxMesh解析为参数化缝纫图样;采用自回归建模以适应可变长、结构化的裁片与缝合关系。 Result: 在GarmentCodeData基准上达到SOTA性能,并能有效泛化至真实扫描和单视角图像。 Conclusion: BoxMesh作为物理合理的中间表示,有效降低了逆问题的歧义性;两阶段解耦策略提升了缝纫图样恢复的准确性、鲁棒性与泛化能力。 Abstract: Recovering sewing patterns from draped 3D garments is a challenging problem in human digitization research. In contrast to the well-studied forward process of draping designed sewing patterns using mature physical simulation engines, the inverse process of recovering parametric 2D patterns from deformed garment geometry remains fundamentally ill-posed for existing methods. We propose a two-stage framework that centers on a structured intermediate representation, BoxMesh, which serves as the key to bridging the gap between 3D garment geometry and parametric sewing patterns. BoxMesh encodes both garment-level geometry and panel-level structure in 3D, while explicitly disentangling intrinsic panel geometry and stitching topology from draping-induced deformations. This representation imposes a physically grounded structure on the problem, significantly reducing ambiguity. In Stage I, a geometry-driven autoregressive model infers BoxMesh from the input 3D garment. In Stage II, a semantics-aware autoregressive model parses BoxMesh into parametric sewing patterns. We adopt autoregressive modeling to naturally handle the variable-length and structured nature of panel configurations and stitching relationships. This decomposition separates geometric inversion from structured pattern inference, leading to more accurate and robust recovery. Extensive experiments demonstrate that our method achieves state-of-the-art performance on the GarmentCodeData benchmark and generalizes effectively to real-world scans and single-view images.

[92] Generalized Small Object Detection:A Point-Prompted Paradigm and Benchmark

Haoran Zhu,Wen Yang,Guangyou Yang,Chang Xu,Ruixiang Zhang,Fang Xu,Haijian Zhang,Gui-Song Xia

Main category: cs.CV

TL;DR: 本文提出TinySet-9M大型多领域小目标检测数据集,并引入点提示范式P2SOD及框架DEAL,在仅需单点点击下显著提升小目标检测性能,尤其在严格定位指标上相对提升31.4%。

Details Motivation: 小目标检测面临像素极少、边界模糊导致的数据标注难、高质量大规模数据稀缺、语义表征弱等问题,尤其使标签高效方法性能严重下降。 Method: 构建首个大规模多领域小目标检测数据集TinySet-9M;提出推理阶段引入稀疏点提示的Point-Prompt Small Object Detection(P2SOD)新范式;基于此开发可扩展、可迁移的点提示检测框架DEAL。 Result: DEAL在TinySet-9M上以单次点击实现AP75等严格定位指标相较全监督基线相对提升31.4%,并能泛化至未见类别和未见数据集。 Conclusion: 大规模数据集TinySet-9M与点提示范式P2SOD及其框架DEAL共同解决了小目标检测中数据稀缺与语义表征弱两大核心挑战,为标签高效小目标检测提供了新思路与实用基准。 Abstract: Small object detection (SOD) remains challenging due to extremely limited pixels and ambiguous object boundaries. These characteristics lead to challenging annotation, limited availability of large-scale high-quality datasets, and inherently weak semantic representations for small objects. In this work, we first address the data limitation by introducing TinySet-9M, the first large-scale, multi-domain dataset for small object detection. Beyond filling the gap in large-scale datasets, we establish a benchmark to evaluate the effectiveness of existing label-efficient detection methods for small objects. Our evaluation reveals that weak visual cues further exacerbate the performance degradation of label-efficient methods in small object detection, highlighting a critical challenge in label-efficient SOD. Secondly, to tackle the limitation of insufficient semantic representation, we move beyond training-time feature enhancement and propose a new paradigm termed Point-Prompt Small Object Detection (P2SOD). This paradigm introduces sparse point prompts at inference time as an efficient information bridge for category-level localization, enabling semantic augmentation. Building upon the P2SOD paradigm and the large-scale TinySet-9M dataset, we further develop DEAL (DEtect Any smalL object), a scalable and transferable point-prompted detection framework that learns robust, prompt-conditioned representations from large-scale data. With only a single click at inference time, DEAL achieves a 31.4% relative improvement over fully supervised baselines under strict localization metrics (e.g., AP75) on TinySet-9M, while generalizing effectively to unseen categories and unseen datasets. Our project is available at https://zhuhaoraneis.github.io/TinySet-9M/.

[93] A Unified Perspective on Adversarial Membership Manipulation in Vision Models

Ruize Gao,Kaiwen Zhou,Yongqiang Chen,Feng Liu

Main category: cs.CV

TL;DR: 本文首次系统研究了针对视觉模型的成员推断攻击(MIAs)在对抗场景下的脆弱性,发现可通过微小扰动将非成员样本伪装成成员(对抗性成员伪造),并基于梯度几何特征提出检测与鲁棒推断方法,显著提升MIA的抗攻击能力。

Details Motivation: 现有成员推断攻击(MIAs)假设查询输入是诚实的,未考虑其在对抗环境下的鲁棒性;作者发现MIAs存在被对抗性扰动操纵的风险,即‘对抗性成员操纵’这一被忽视的威胁面。 Method: 通过实证分析揭示对抗性成员伪造在多种架构与数据集上的普适性;发现并刻画其特有的‘梯度范数坍缩轨迹’几何特征;据此设计基于梯度几何信号的检测策略和鲁棒成员推断框架。 Result: 验证了对抗性成员伪造具有广泛有效性;提出了可区分伪造成员与真实成员的几何判据;所提检测与鲁棒推理方法显著提升了MIAs对对抗操纵的防御能力。 Conclusion: 本文建立了首个面向视觉模型的对抗性成员操纵综合分析与防御框架,揭示了MIAs在隐私评估中需兼顾对抗鲁棒性的重要问题。 Abstract: Membership inference attacks (MIAs) aim to determine whether a specific data point was part of a model's training set, serving as effective tools for evaluating privacy leakage of vision models. However, existing MIAs implicitly assume honest query inputs, and their adversarial robustness remains unexplored. We show that MIAs for vision models expose a previously overlooked adversarial surface: adversarial membership manipulation, where imperceptible perturbations can reliably push non-member images into the "member" region of state-of-the-art MIAs. In this paper, we provide the first unified perspective on this phenomenon by analyzing its mechanism and implications. We begin by demonstrating that adversarial membership fabrication is consistently effective across diverse architectures and datasets. We then reveal a distinctive geometric signature - a characteristic gradient-norm collapse trajectory - that reliably separates fabricated from true members despite their nearly identical semantic representations. Building on this insight, we introduce a principled detection strategy grounded in gradient-geometry signals and develop a robust inference framework that substantially mitigates adversarial manipulation. Extensive experiments show that fabrication is broadly effective, while our detection and robust inference strategies significantly enhance resilience. This work establishes the first comprehensive framework for adversarial membership manipulation in vision models.

[94] EnsemHalDet: Robust VLM Hallucination Detection via Ensemble of Internal State Detectors

Ryuhei Miyazato,Shunsuke Kitada,Kei Harada

Main category: cs.CV

TL;DR: 本文提出EnsemHalDet,一种基于集成学习的视觉-语言模型(VLM)幻觉检测框架,利用多种内部表征(如注意力输出和隐藏状态)训练独立检测器并融合结果,在多个数据集和模型上显著提升AUC性能。

Details Motivation: 现有基于内部表征的幻觉检测方法通常仅依赖单一表征或检测器,难以捕捉多样化的幻觉信号,限制了检测能力。 Method: 提出EnsemHalDet框架,对VLM的多种内部表征(如注意力输出、隐藏状态)分别训练独立检测器,并通过集成学习进行融合。 Result: 在多个VQA数据集和不同VLM上实验表明,EnsemHalDet在AUC指标上持续优于先前方法及单检测器模型。 Conclusion: 集成多样化内部信号可显著提升多模态幻觉检测的鲁棒性与准确性。 Abstract: Vision-Language Models (VLMs) excel at multimodal tasks, but they remain vulnerable to hallucinations that are factually incorrect or ungrounded in the input image. Recent work suggests that hallucination detection using internal representations is more efficient and accurate than approaches that rely solely on model outputs. However, existing internal-representation-based methods typically rely on a single representation or detector, limiting their ability to capture diverse hallucination signals. In this paper, we propose EnsemHalDet, an ensemble-based hallucination detection framework that leverages multiple internal representations of VLMs, including attention outputs and hidden states. EnsemHalDet trains independent detectors for each representation and combines them through ensemble learning. Experimental results across multiple VQA datasets and VLMs show that EnsemHalDet consistently outperforms prior methods and single-detector models in terms of AUC. These results demonstrate that ensembling diverse internal signals significantly improves robustness in multimodal hallucination detection.

[95] CANDLE: Illumination-Invariant Semantic Priors for Color Ambient Lighting Normalization

Rong-Lin Jian,Ting-Yao Chen,Yu-Fan Lin,Chia-Ming Lee,Fu-En Yang,Yu-Chiang Frank Wang,Chih-Chung Hsu

Main category: cs.CV

TL;DR: 本文提出CANDLE方法,利用DINOv3自监督特征作为光照鲁棒语义先验,通过多层特征注入与颜色-频率细化设计,有效解决多色光照下的颜色环境归一化问题,在多个基准上取得领先性能。

Details Motivation: 现有几何和低层先验在强色偏、高光饱和及材质反射差异下难以恢复物体本征颜色,亟需更鲁棒的语义先验。 Method: 提出CANDLE框架,包含DINO Omni-layer Guidance(D.O.G.)模块以自适应注入多层DINOv3特征至编码器各阶段,并设计BFACG + SFFB颜色-频率细化模块抑制解码端色度坍缩与细节污染。 Result: 在CL3AN数据集上PSNR提升+1.22 dB;获NTIRE 2026 ALN挑战赛第三名,白光赛道保真度第二名且FID最低。 Conclusion: DINOv3自监督特征可作为强光照不变语义先验,结合分层引导与频域优化能显著提升多色光照下颜色归一化的准确性与泛化性。 Abstract: Color ambient lighting normalization under multi-colored illumination is challenging due to severe chromatic shifts, highlight saturation, and material-dependent reflectance. Existing geometric and low-level priors are insufficient for recovering object-intrinsic color when illumination-induced chromatic bias dominates. We observe that DINOv3's self-supervised features remain highly consistent between colored-light inputs and ambient-lit ground truth, motivating their use as illumination-robust semantic priors. We propose CANDLE (Color Ambient Normalization with DINO Layer Enhancement), which introduces DINO Omni-layer Guidance (D.O.G.) to adaptively inject multi-layer DINOv3 features into successive encoder stages, and a color-frequency refinement design (BFACG + SFFB) to suppress decoder-side chromatic collapse and detail contamination. Experiments on CL3AN show a +1.22 dB PSNR gain over the strongest prior method. CANDLE achieves 3rd place on the NTIRE 2026 ALN Color Lighting Challenge and 2nd place in fidelity on the White Lighting track with the lowest FID, confirming strong generalization across both chromatic and luminance-dominant illumination conditions. Code is available at https://github.com/ron941/CANDLE.

[96] LumaFlux: Lifting 8-Bit Worlds to HDR Reality with Physically-Guided Diffusion Transformers

Shreshth Saini,Hakan Gedik,Neil Birkbeck,Yilin Wang,Balu Adsumilli,Alan C. Bovik

Main category: cs.CV

TL;DR: 本文提出LumaFlux,一种基于扩散变换器(DiT)的SDR到HDR逆色调映射方法,结合物理引导与感知建模,显著提升高光、色彩和纹理重建质量。

Details Motivation: 现有逆色调映射(ITM)方法泛化性差,难以应对真实场景退化、风格变化及相机成像差异,常导致高光裁剪、色彩失饱和和色调不稳定。 Method: 提出LumaFlux:(1)物理引导适配(PGA)模块,通过低秩残差注入亮度、空间与频域线索至注意力;(2)感知跨模态调制(PCM)层,利用视觉编码器特征进行FiLM条件化以稳定色度与纹理;(3)HDR残差耦合器实现物理与感知信号的时步与层自适应融合;(4)轻量有理二次样条解码器生成平滑可解释的色调场,并增强VAE解码输出。同时构建首个大规模SDR-HDR训练数据集与新评估基准。 Result: LumaFlux在多个基准上超越现有最优方法,在亮度重建与感知色彩保真度方面表现更优,且参数增量极小。 Conclusion: LumaFlux首次将物理建模与感知引导深度融合于扩散变换器框架中,为高质量、鲁棒的SDR-to-HDR转换提供了新范式。 Abstract: The rapid adoption of HDR-capable devices has created a pressing need to convert the 8-bit Standard Dynamic Range (SDR) content into perceptually and physically accurate 10-bit High Dynamic Range (HDR). Existing inverse tone-mapping (ITM) methods often rely on fixed tone-mapping operators that struggle to generalize to real-world degradations, stylistic variations, and camera pipelines, frequently producing clipped highlights, desaturated colors, or unstable tone reproduction. We introduce LumaFlux, a first physically and perceptually guided diffusion transformer (DiT) for SDR-to-HDR reconstruction by adapting a large pretrained DiT. Our LumaFlux introduces (1) a Physically-Guided Adaptation (PGA) module that injects luminance, spatial descriptors, and frequency cues into attention through low-rank residuals; (2) a Perceptual Cross-Modulation (PCM) layer that stabilizes chroma and texture via FiLM conditioning from vision encoder features; and (3) an HDR Residual Coupler that fuses physical and perceptual signals under a timestep- and layer-adaptive modulation schedule. Finally, a lightweight Rational-Quadratic Spline decoder reconstructs smooth, interpretable tone fields for highlight and exposure expansion, enhancing the output of the VAE decoder to generate HDR. To enable robust HDR learning, we curate the first large-scale SDR-HDR training corpus. For fair and reproducible comparison, we further establish a new evaluation benchmark, comprising HDR references and corresponding expert-graded SDR versions. Across benchmarks, LumaFlux outperforms state-of-the-art baselines, achieving superior luminance reconstruction and perceptual color fidelity with minimal additional parameters.

[97] UNICA: A Unified Neural Framework for Controllable 3D Avatars

Jiahe Zhu,Xinyao Wang,Yiyu Zhuang,Yanwen Wang,Jing Tian,Yao Yao,Hao Zhu

Main category: cs.CV

TL;DR: UNICA is a skeleton-free, unified neural model for controllable 3D human avatars that generates geometry from game-like keyboard inputs using action-conditioned diffusion on 2D position maps, followed by point-transformer-based 3D Gaussian Splatting for high-fidelity rendering—eliminating separate motion planning, rigging, and physics simulation.

Details Motivation: Conventional 3D avatar creation involves lengthy, fragmented pipelines (appearance modeling, motion planning, rigging, physical simulation), hindering efficiency and realism—especially for dynamic elements like hair and clothing. Method: UNICA employs an action-conditioned diffusion model operating on 2D position maps to generate avatar geometry frame-by-frame from keyboard controls; a point transformer then converts the output into 3D Gaussian Splatting representations for free-view rendering. Result: UNICA achieves high-fidelity, free-view 3D avatar generation with natural dynamics for hair and loose clothing, supports extra-long autoregressive sequences, and unifies motion planning, rigging, physical simulation, and rendering in one end-to-end framework. Conclusion: UNICA establishes the first fully unified, skeleton-free neural framework for controllable 3D human avatars, significantly simplifying the pipeline while improving dynamic realism and generative flexibility. Abstract: Controllable 3D human avatars have found widespread applications in 3D games, the metaverse, and AR/VR scenarios. The conventional approach to creating such a 3D avatar requires a lengthy, intricate pipeline encompassing appearance modeling, motion planning, rigging, and physical simulation. In this paper, we introduce UNICA (UNIfied neural Controllable Avatar), a skeleton-free generative model that unifies all avatar control components into a single neural framework. Given keyboard inputs akin to video game controls, UNICA generates the next frame of a 3D avatar's geometry through an action-conditioned diffusion model operating on 2D position maps. A point transformer then maps the resulting geometry to 3D Gaussian Splatting for high-fidelity free-view rendering. Our approach naturally captures hair and loose clothing dynamics without manually designed physical simulation, and supports extra-long autoregressive generation. To the best of our knowledge, UNICA is the first model to unify the workflow of "motion planning, rigging, physical simulation, and rendering". Code is released at https://github.com/zjh21/UNICA.

[98] PaveBench: A Versatile Benchmark for Pavement Distress Perception and Interactive Vision-Language Analysis

Dexiang Li,Zhenning Che,Haijun Zhang,Dongliang Zhou,Zhao Zhang,Yahong Han

Main category: cs.CV

TL;DR: 本文提出PaveBench,一个面向真实高速公路巡检图像的大规模路面病害感知与交互式视觉-语言分析基准,涵盖分类、检测、分割和视觉问答四类任务,并发布配套数据集PaveVQA及一种融合领域模型的代理增强型VQA框架。

Details Motivation: 现有路面病害研究多局限于传统单模态视觉任务,缺乏对定量分析、可解释性、多轮交互及视觉-语言联合推理的支持,且数据集难以支撑真实巡检中的综合决策需求。 Method: 构建PaveBench基准,包含大规模标注的视觉任务数据(含难样本子集)与多轮、专家校正的视觉问答数据集PaveVQA;提出代理增强型VQA框架,将领域专用模型作为工具与视觉语言模型协同工作。 Result: 提供了统一任务定义与评估协议;在多个SOTA方法上完成系统评测;验证了所提代理增强框架在路面病害多步推理任务上的有效性;公开发布了完整数据集。 Conclusion: PaveBench填补了路面智能检测中多模态交互与事实驱动推理的基准空白,推动从‘看得见’到‘看得懂、能决策’的范式转变。 Abstract: Pavement condition assessment is essential for road safety and maintenance. Existing research has made significant progress. However, most studies focus on conventional computer vision tasks such as classification, detection, and segmentation. In real-world applications, pavement inspection requires more than visual recognition. It also requires quantitative analysis, explanation, and interactive decision support. Current datasets are limited. They focus on unimodal perception. They lack support for multi-turn interaction and fact-grounded reasoning. They also do not connect perception with vision-language analysis. To address these limitations, we introduce PaveBench, a large-scale benchmark for pavement distress perception and interactive vision-language analysis on real-world highway inspection images. PaveBench supports four core tasks: classification, object detection, semantic segmentation, and vision-language question answering. It provides unified task definitions and evaluation protocols. On the visual side, PaveBench provides large-scale annotations and includes a curated hard-distractor subset for robustness evaluation. It contains a large collection of real-world pavement images. On the multimodal side, we introduce PaveVQA, a real-image question answering (QA) dataset that supports single-turn, multi-turn, and expert-corrected interactions. It covers recognition, localization, quantitative estimation, and maintenance reasoning. We evaluate several state-of-the-art methods and provide a detailed analysis. We also present a simple and effective agent-augmented visual question answering framework that integrates domain-specific models as tools alongside vision-language models. The dataset is available at: https://huggingface.co/datasets/MML-Group/PaveBench.

[99] CMCC-ReID: Cross-Modality Clothing-Change Person Re-Identification

Haoxuan Xu,Hanzi Wang,Guanglin Niu

Main category: cs.CV

TL;DR: 本文提出了一种新的行人重识别任务CMCC-ReID,旨在解决长期监控中跨模态(可见光/红外)与服装变化同时存在时的匹配难题,并构建了新基准SYSU-CMCC;为此设计了渐进式身份对齐网络PIA,包含双分支解耦学习(DBDL)和双向原型学习(BPL)模块,有效缓解服装干扰与模态差异,在SYSU-CMCC上显著优于现有方法。

Details Motivation: 现实长期监控场景中同时存在模态差异(如可见光与红外)和服装变化,但现有研究仅分别关注VI-ReID或CC-ReID,忽视二者共存这一实际挑战。 Method: 提出Progressive Identity Alignment Network(PIA):1)Dual-Branch Disentangling Learning(DBDL)模块解耦身份特征与服装相关因素,获得服装无关表征;2)Bi-Directional Prototype Learning(BPL)模块在嵌入空间中进行模态内与模态间对比学习,弥合模态差距并进一步抑制服装干扰。 Result: 在新构建的SYSU-CMCC基准上,PIA显著超越现有VI-ReID和CC-ReID方法,建立了该新任务的强基线。 Conclusion: CMCC-ReID是一个更贴近真实长期监控需求的新任务;PIA通过分阶段解耦与对齐策略,有效应对双重异质性挑战,为后续研究提供了可扩展框架与实用基准。 Abstract: Person Re-Identification (ReID) faces severe challenges from modality discrepancy and clothing variation in long-term surveillance scenario. While existing studies have made significant progress in either Visible-Infrared ReID (VI-ReID) or Clothing-Change ReID (CC-ReID), real-world surveillance system often face both challenges simultaneously. To address this overlooked yet realistic problem, we define a new task, termed Cross-Modality Clothing-Change Re-Identification (CMCC-ReID), which targets pedestrian matching across variations in both modality and clothing. To advance research in this direction, we construct a new benchmark SYSU-CMCC, where each identity is captured in both visible and infrared domains with distinct outfits, reflecting the dual heterogeneity of long-term surveillance. To tackle CMCC-ReID, we propose a Progressive Identity Alignment Network (PIA) that progressively mitigates the issues of clothing variation and modality discrepancy. Specifically, a Dual-Branch Disentangling Learning (DBDL) module separates identity-related cues from clothing-related factors to achieve clothing-agnostic representation, and a Bi-Directional Prototype Learning (BPL) module performs intra-modality and inter-modality contrast in the embedding space to bridge the modality gap while further suppressing clothing interference. Extensive experiments on the SYSU-CMCC dataset demonstrate that PIA establishes a strong baseline for this new task and significantly outperforms existing methods.

[100] QAPruner: Quantization-Aware Vision Token Pruning for Multimodal Large Language Models

Xinhao Wang,Zhonyu Xia,Zhiwei Lin,Zhe Li,Yongtao Wang

Main category: cs.CV

TL;DR: 本文提出了一种量化感知的视觉token剪枝框架,通过联合优化视觉token剪枝与后训练量化(PTQ),在低比特(如W4A4)下提升多模态大语言模型(MLLMs)的推理精度与稳定性。

Details Motivation: 现有PTQ和视觉token剪枝常被独立应用,但二者存在强耦合:语义驱动的剪枝会误删对量化稳定性至关重要的激活离群值,加剧低比特量化误差。 Method: 提出轻量级混合敏感度指标,融合模拟的分组量化误差与离群值强度,并结合语义相关性分数,协同选择既语义重要又量化鲁棒的视觉token。 Result: 在LLaVA架构上实验表明,在仅保留12.5%视觉token的激进剪枝下,该方法比基线准确率提升2.24%,甚至优于未剪枝的稠密量化结果。 Conclusion: 这是首个显式联合优化视觉token剪枝与PTQ以实现高精度低比特MLLM推理的方法,为资源受限场景下的高效多模态推理提供了新范式。 Abstract: Multimodal Large Language Models (MLLMs) have shown strong reasoning ability, but their high computational and memory costs hinder deployment in resource-constrained settings. While Post-Training Quantization (PTQ) and vision token pruning are standard compression techniques, they are usually treated as independent optimizations. In this paper, we show that these two techniques are strongly coupled: naively applying semantic-based token pruning to PTQ-optimized MLLMs can discard activation outliers that are important for numerical stability and thus worsen quantization errors in low-bit regimes (\textit{e.g.}, W4A4). To address this issue, we propose a quantization-aware vision token pruning framework. Our method introduces a lightweight hybrid sensitivity metric that combines simulated group-wise quantization error with outlier intensity. By combining this metric with standard semantic relevance scores, the method retains tokens that are both semantically informative and robust to quantization. Experiments on standard LLaVA architectures show that our method consistently outperforms naive integration baselines. At an aggressive pruning ratio that retains only 12.5\% of visual tokens, our framework improves accuracy by 2.24\% over the baseline and even surpasses dense quantization without pruning. To the best of our knowledge, this is the first method that explicitly co-optimizes vision token pruning and PTQ for accurate low-bit MLLM inference.

[101] MMPhysVideo: Scaling Physical Plausibility in Video Generation via Joint Multimodal Modeling

Shubo Lin,Xuanyang Zhang,Wei Cheng,Weiming Hu,Gang Yu,Jin Gao

Main category: cs.CV

TL;DR: 本文提出MMPhysVideo框架,通过多模态建模提升视频生成的物理合理性,将语义、几何和时空轨迹统一为伪RGB格式,并设计双向控制教师架构与知识蒸馏学生模型,在不增加推理开销下显著提升物理一致性和视觉质量。

Details Motivation: 现有视频扩散模型(VDMs)仅依赖像素级重建,导致生成结果物理不一致,亟需引入物理先验以提升生成内容的真实性与合理性。 Method: 提出MMPhysVideo框架:1)将语义、几何、时空轨迹编码为统一伪-RGB格式;2)设计双向控制教师架构(双分支+零初始化控制链)解耦RGB与感知处理;3)通过表征对齐将教师物理先验蒸馏至单流学生模型;4)构建MMPhysPipe数据流水线,利用VLM与视觉证据链规则进行物理主体定位与多粒度感知标注。 Result: 在多个基准上,MMPhysVideo在不增加推理成本前提下,持续超越先进模型,显著提升物理合理性和视觉质量,达到当前最优性能。 Conclusion: 联合建模多维物理感知信号并融入视频扩散过程是提升生成视频物理一致性的有效路径,MMPhysVideo为物理驱动的视频生成提供了可扩展、高效且实用的新范式。 Abstract: Despite advancements in generating visually stunning content, video diffusion models (VDMs) often yield physically inconsistent results due to pixel-only reconstruction. To address this, we propose MMPhysVideo, the first framework to scale physical plausibility in video generation through joint multimodal modeling. We recast perceptual cues, specifically semantics, geometry, and spatio-temporal trajectory, into a unified pseudo-RGB format, enabling VDMs to directly capture complex physical dynamics. To mitigate cross-modal interference, we propose a Bidirectionally Controlled Teacher architecture, which utilizes parallel branches to fully decouple RGB and perception processing and adopts two zero-initialized control links to gradually learn pixel-wise consistency. For inference efficiency, the teacher's physical prior is distilled into a single-stream student model via representation alignment. Furthermore, we present MMPhysPipe, a scalable data curation and annotation pipeline tailored for constructing physics-rich multimodal datasets. MMPhysPipe employs a vision-language model (VLM) guided by a chain-of-visual-evidence rule to pinpoint physical subjects, enabling expert models to extract multi-granular perceptual information. Without additional inference costs, MMPhysVideo consistently improves physical plausibility and visual quality over advanced models across various benchmarks and achieves state-of-the-art performance compared to existing methods.

[102] NavCrafter: Exploring 3D Scenes from a Single Image

Hongbo Duan,Peiyu Zhuang,Yi Liu,Zhengyang Zhang,Yuxin Zhang,Pengting Luo,Fangming Liu,Xueqian Wang

Main category: cs.CV

TL;DR: NavCrafter 是一个基于单张图像生成可控、时-空一致的新视角视频序列的3D场景探索框架,结合视频扩散模型、几何感知扩展、多阶段相机控制与碰撞感知轨迹规划,显著提升大视角变化下的新视角合成与3D重建质量。

Details Motivation: 单张图像生成灵活3D场景在3D数据获取成本高或不可行时至关重要,现有方法在大视角变化下新视角合成和重建保真度不足。 Method: 提出NavCrafter框架:1)利用视频扩散模型建模3D先验;2)几何感知扩展策略逐步扩大场景覆盖;3)多阶段相机控制机制(双分支相机注入+注意力调制)实现可控多视角合成;4)碰撞感知相机轨迹规划器;5)增强型深度对齐监督、结构正则化与优化的3D高斯泼溅(3DGS)流程。 Result: 在大视角偏移下达到新视角合成SOTA性能,并显著提升3D重建保真度。 Conclusion: NavCrafter有效融合生成建模与几何推理,在单图驱动的可控3D场景探索中实现了性能与可控性的统一突破。 Abstract: Creating flexible 3D scenes from a single image is vital when direct 3D data acquisition is costly or impractical. We introduce NavCrafter, a novel framework that explores 3D scenes from a single image by synthesizing novel-view video sequences with camera controllability and temporal-spatial consistency. NavCrafter leverages video diffusion models to capture rich 3D priors and adopts a geometry-aware expansion strategy to progressively extend scene coverage. To enable controllable multi-view synthesis, we introduce a multi-stage camera control mechanism that conditions diffusion models with diverse trajectories via dual-branch camera injection and attention modulation. We further propose a collision-aware camera trajectory planner and an enhanced 3D Gaussian Splatting (3DGS) pipeline with depth-aligned supervision, structural regularization and refinement. Extensive experiments demonstrate that NavCrafter achieves state-of-the-art novel-view synthesis under large viewpoint shifts and substantially improves 3D reconstruction fidelity.

[103] Factorized Multi-Resolution HashGrid for Efficient Neural Radiance Fields: Execution on Edge-Devices

Kim Jun-Seong,Mingyu Kim,GeonU Kim,Tae-Hyun Oh,Jin-Hwa Kim

Main category: cs.CV

TL;DR: Fact-Hash是一种融合张量分解与哈希编码的新参数编码方法,用于在资源受限的设备上高效训练神经辐射场(NeRF),显著降低内存消耗(节省超1/3),同时保持渲染质量与速度。

Details Motivation: NeRF因计算资源需求大而难以部署于终端设备;为支持通信受限、隐私敏感及需快速适应动态场景的on-device训练,亟需解决GPU内存、存储与功耗等资源瓶颈。 Method: 提出Fact-Hash:将3D坐标投影至多个低维(2D/1D)空间,分别进行哈希编码,再聚合为单一特征;融合张量分解以压缩参数、提升高分辨率特征表达能力,并增强少样本鲁棒性。 Result: 相比现有编码方法,内存占用减少超三分之一,PSNR保持不变;实测显示其在设备端具有更优计算效率与更低能耗。 Conclusion: Fact-Hash有效提升了特征栅格表征能力,在内存受限下兼顾渲染质量与速度,是面向终端NeRF训练的有前景解决方案。 Abstract: We introduce Fact-Hash, a novel parameter-encoding method for training on-device neural radiance fields. Neural Radiance Fields (NeRF) have proven pivotal in 3D representations, but their applications are limited due to large computational resources. On-device training can open large application fields, providing strength in communication limitations, privacy concerns, and fast adaptation to a frequently changing scene. However, challenges such as limited resources (GPU memory, storage, and power) impede their deployment. To handle this, we introduce Fact-Hash, a novel parameter-encoding merging Tensor Factorization and Hash-encoding techniques. This integration offers two benefits: the use of rich high-resolution features and the few-shot robustness. In Fact-Hash, we project 3D coordinates into multiple lower-dimensional forms (2D or 1D) before applying the hash function and then aggregate them into a single feature. Comparative evaluations against state-of-the-art methods demonstrate Fact-Hash's superior memory efficiency, preserving quality and rendering speed. Fact-Hash saves memory usage by over one-third while maintaining the PSNR values compared to previous encoding methods. The on-device experiment validates the superiority of Fact-Hash compared to alternative positional encoding methods in computational efficiency and energy consumption. These findings highlight Fact-Hash as a promising solution to improve feature grid representation, address memory constraints, and improve quality in various applications. Project page: https://facthash.github.io/

[104] Deformation-based In-Context Learning for Point Cloud Understanding

Chengxing Lin,Jinhong Deng,Yinjie Lei,Wen Li

Main category: cs.CV

TL;DR: 本文提出DeformPIC,一种基于变形的点云上下文学习框架,通过在任务特定提示指导下对查询点云进行变形,克服了现有掩码点建模方法缺乏几何先验和训练-推理目标不一致的问题,在重建、去噪和配准任务中均取得更优性能,并在新提出的域外泛化基准上表现最佳。

Details Motivation: 现有基于掩码点建模(MPM)的点云上下文学习方法缺乏显式几何先验,且存在训练与推理目标不一致问题:训练时利用目标侧信息,而推理时该信息不可用。 Method: 提出DeformPIC框架,摒弃掩码重建范式,转而学习在任务特定提示引导下对查询点云进行几何变形,实现显式几何推理和训练-推理目标一致。 Result: 在重建、去噪和配准任务的平均Chamfer距离上分别降低1.6、1.8和4.7分;在新构建的域外泛化基准上达到最优性能。 Conclusion: DeformPIC通过引入基于变形的几何推理机制,有效解决了现有点云ICL方法的关键缺陷,显著提升了多任务性能与跨域泛化能力。 Abstract: Recent advances in point cloud In-Context Learning (ICL) have demonstrated strong multitask capabilities. Existing approaches typically adopt a Masked Point Modeling (MPM)-based paradigm for point cloud ICL. However, MPM-based methods directly predict the target point cloud from masked tokens without leveraging geometric priors, requiring the model to infer spatial structure and geometric details solely from token-level correlations via transformers. Additionally, these methods suffer from a training-inference objective mismatch, as the model learns to predict the target point cloud using target-side information that is unavailable at inference time. To address these challenges, we propose DeformPIC, a deformation-based framework for point cloud ICL. Unlike existing approaches that rely on masked reconstruction, DeformPIC learns to deform the query point cloud under task-specific guidance from prompts, enabling explicit geometric reasoning and consistent objectives. Extensive experiments demonstrate that DeformPIC consistently outperforms previous state-of-the-art methods, achieving reductions of 1.6, 1.8, and 4.7 points in average Chamfer Distance on reconstruction, denoising, and registration tasks, respectively. Furthermore, we introduce a new out-of-domain benchmark to evaluate generalization across unseen data distributions, where DeformPIC achieves state-of-the-art performance.

[105] Adaptive Local Frequency Filtering for Fourier-Encoded Implicit Neural Representations

Ligen Shi,Jun Qiu,Yuhang Zheng,Chang Liu

Main category: cs.CV

TL;DR: 本文提出了一种自适应局部频率滤波方法,用于改进傅里叶编码的隐式神经表示(INRs),通过引入空间变化参数α(x)动态调节傅里叶分量,提升对非平稳信号的建模能力与优化效率。

Details Motivation: 传统傅里叶特征映射在全空间使用固定频率,难以适应具有空间变化局部频谱的信号,导致高频细节收敛慢。 Method: 提出一种自适应局部频率滤波方法,引入空间变化参数α(x)调制傅里叶编码分量,实现不同位置的低通、带通、高通行为;并从神经正切核(NTK)角度分析其对有效核谱的重塑机制。 Result: 在2D图像拟合、3D形状表示和稀疏数据重建任务中,该方法相比固定频率基线显著提升重建质量、加快优化速度;学习到的α(x)可直观可视化空间频率偏好。 Conclusion: 自适应局部频率调制是一种实用且有效的增强手段,提升了傅里叶编码INRs对非平稳连续信号的建模能力。 Abstract: Fourier-encoded implicit neural representations (INRs) have shown strong capability in modeling continuous signals from discrete samples. However, conventional Fourier feature mappings use a fixed set of frequencies over the entire spatial domain, making them poorly suited to signals with spatially varying local spectra and often leading to slow convergence of high-frequency details. To address this issue, we propose an adaptive local frequency filtering method for Fourier-encoded INRs. The proposed method introduces a spatially varying parameter $α(\mathbf{x})$ to modulate encoded Fourier components, enabling a smooth transition among low-pass, band-pass, and high-pass behaviors at different spatial locations. We further analyze the effect of the proposed filter from the neural tangent kernel (NTK) perspective and provide an NTK-inspired interpretation of how it reshapes the effective kernel spectrum. Experiments on 2D image fitting, 3D shape representation, and sparse data reconstruction demonstrate that the proposed method consistently improves reconstruction quality and leads to faster optimization compared with fixed-frequency baselines. In addition, the learned $α(\mathbf{x})$ provides an intuitive visualization of spatially varying frequency preferences, which helps explain the behavior of the model on non-stationary signals. These results indicate that adaptive local frequency modulation is a practical enhancement for Fourier-encoded INRs.

[106] HiDiGen: Hierarchical Diffusion for B-Rep Generation with Explicit Topological Constraints

Shurui Liu,Weide Chen,Ancong Wu

Main category: cs.CV

TL;DR: HiDiGen是一种分两阶段的层次化生成框架,先构建拓扑骨架,再通过Transformer扩散模块精细化几何,从而生成新颖、多样且拓扑有效的B-rep CAD模型。

Details Motivation: B-rep是CAD系统标准格式,但其离散拓扑与连续几何的强耦合使得深度生成建模困难。 Method: 提出HiDiGen:第一阶段建立面-边关联关系形成拓扑骨架,并生成面代理和初始边曲线;第二阶段用多个基于Transformer的扩散模块细化面曲面和顶点位置,并动态构建并约束边-顶点邻接关系。 Result: HiDiGen能生成新颖、多样且拓扑有效的CAD模型,在实验中表现出强性能。 Conclusion: 分阶段解耦几何与拓扑建模可有效提升B-rep生成的质量与有效性,层次化扩散策略有利于结构一致性与多样性兼顾。 Abstract: Boundary representation (B-rep) is the standard 3D modeling format in CAD systems, encoding both geometric primitives and topological connectivity. Despite its prevalence, deep generative modeling of valid B-rep structures remains challenging due to the intricate interplay between discrete topology and continuous geometry. In this paper, we propose HiDiGen, a hierarchical generation framework that decouples geometry modeling into two stages, each guided by explicitly modeled topological constraints. Specifically, our approach first establishes face-edge incidence relations to define a coherent topological scaffold, upon which face proxies and initial edge curves are generated. Subsequently, multiple Transformer-based diffusion modules are employed to refine the geometry by generating precise face surfaces and vertex positions, with edge-vertex adjacencies dynamically established and enforced to preserve structural consistency. This progressive geometry hierarchy enables the generation of more novel and diverse shapes, while two-stage topological modeling ensures high validity. Experimental results show that HiDiGen achieves strong performance, generating novel, diverse, and topologically sound CAD models.

[107] A Paradigm Shift: Fully End-to-End Training for Temporal Sentence Grounding in Videos

Allen He,Qi Liu,Kun Liu,Xinchen Liu,Wu Liu

Main category: cs.CV

TL;DR: 本文提出了一种端到端的时序句子定位方法,通过引入句子条件适配器(SCADA)动态调节视频骨干网络,缓解任务差异问题,在两个基准上达到SOTA性能。

Details Motivation: 现有方法多采用冻结的、查询无关的预训练视觉编码器提取视频特征,导致视频骨干网络在视觉分类任务上训练却用于时序句子定位,存在任务不匹配问题。 Method: 提出全端到端训练范式,联合优化视频骨干网络与定位头;并设计句子条件适配器(SCADA),利用句子特征自适应地微调骨干网络少量参数,实现语言引导的视觉特征调制。 Result: 在两个主流TSGV基准上显著超越现有最优方法;实证表明端到端学习优于冻结骨干的基线,且SCADA支持更深骨干网络部署并降低显存开销。 Conclusion: 端到端联合优化与语言条件驱动的轻量适配是提升TSGV性能的关键路径,SCADA有效弥合了视觉表征与语言查询间的语义鸿沟。 Abstract: Temporal sentence grounding in videos (TSGV) aims to localize a temporal segment that semantically corresponds to a sentence query from an untrimmed video. Most current methods adopt pre-trained query-agnostic visual encoders for offline feature extraction, and the video backbones are frozen and not optimized for TSGV. This leads to a task discrepancy issue for the video backbone trained for visual classification, but utilized for TSGV. To bridge this gap, we propose a fully end-to-end paradigm that jointly optimizes the video backbone and localization head. We first conduct an empirical study validating the effectiveness of end-to-end learning over frozen baselines across different model scales. Furthermore, we introduce a Sentence Conditioned Adapter (SCADA), which leverages sentence features to train a small portion of video backbone parameters adaptively. SCADA facilitates the deployment of deeper network backbones with reduced memory and significantly enhances visual representation by modulating feature maps through precise integration of linguistic embeddings. Experiments on two benchmarks show that our method outperforms state-of-the-art approaches. The code and models will be released.

[108] HairOrbit: Multi-view Aware 3D Hair Modeling from Single Portraits

Leyang Jin,Yujian Zheng,Bingkui Tong,Yuda Qiu,Zhenyu Xie,Hao Li

Main category: cs.CV

TL;DR: 本文提出了一种利用视频生成模型的3D先验,将单视角头发重建转化为校准多视角重建任务的新框架,并通过神经方向提取器和两阶段发丝生长算法实现高质量、高效率的全视角3D发丝重建。

Details Motivation: 现有方法依赖有限的正面视角线索和小规模/风格受限的合成数据,在不可见区域难以生成一致且逼真的3D头发结构。 Method: 提出基于视频生成模型3D先验的单视图到多视图重建框架;引入在稀疏真实图像标注上训练的神经方向提取器以提升全视角方向估计;设计基于混合隐式场的两阶段发丝生长算法以高效生成细节丰富的3D发丝曲线。 Result: 在多种发型人像上实现了可见与不可见区域均达到SOTA的单视角3D发丝重建效果。 Conclusion: 该方法有效克服了单视角头发重建中不可见区域一致性与真实性差的问题,显著提升了重建质量与泛化能力。 Abstract: Reconstructing strand-level 3D hair from a single-view image is highly challenging, especially when preserving consistent and realistic attributes in unseen regions. Existing methods rely on limited frontal-view cues and small-scale/style-restricted synthetic data, often failing to produce satisfactory results in invisible regions. In this work, we propose a novel framework that leverages the strong 3D priors of video generation models to transform single-view hair reconstruction into a calibrated multi-view reconstruction task. To balance reconstruction quality and efficiency for the reformulated multi-view task, we further introduce a neural orientation extractor trained on sparse real-image annotations for better full-view orientation estimation. In addition, we design a two-stage strand-growing algorithm based on a hybrid implicit field to synthesize the 3D strand curves with fine-grained details at a relatively fast speed. Extensive experiments demonstrate that our method achieves state-of-the-art performance on single-view 3D hair strand reconstruction on a diverse range of hair portraits in both visible and invisible regions.

[109] Token Warping Helps MLLMs Look from Nearby Viewpoints

Phillip Y. Lee,Chanho Park,Mingue Park,Seungwoo Yoo,Juil Koo,Minhyuk Sung

Main category: cs.CV

TL;DR: 本文提出了一种基于token的视角变换方法,通过在ViT-based MLLMs中对图像token进行反向warping(而非像素级warping),提升了模型在邻近视角变化下的视觉推理鲁棒性,并在新构建的ViewBench基准上验证了其有效性。

Details Motivation: 现有MLLMs对视角变化敏感,像素级warping易受深度误差影响且导致几何失真;受人类心理意象中部分级结构表征启发,探索图像token是否可作为视角变换的有效表征基础。 Method: 提出并比较前向与反向token warping,重点采用反向token warping:在目标视角定义稠密网格,为每个网格点从源视角检索对应token;在新构建的ViewBench基准上评估性能。 Result: 反向token warping显著提升MLLMs在邻近视角下的推理稳定性与语义一致性,在ViewBench上持续优于像素级warping、空间微调MLLMs及生成式warping等所有基线方法。 Conclusion: ViT中的图像token可作为稳健的视角变换表征基础,反向token warping是一种更鲁棒、语义保持更好的视角变换机制,为提升MLLMs的空间泛化能力提供了新思路。 Abstract: Can warping tokens, rather than pixels, help multimodal large language models (MLLMs) understand how a scene appears from a nearby viewpoint? While MLLMs perform well on visual reasoning, they remain fragile to viewpoint changes, as pixel-wise warping is highly sensitive to small depth errors and often introduces geometric distortions. Drawing on theories of mental imagery that posit part-level structural representations as the basis for human perspective transformation, we examine whether image tokens in ViT-based MLLMs serve as an effective substrate for viewpoint changes. We compare forward and backward warping, finding that backward token warping, which defines a dense grid on the target view and retrieves a corresponding source-view token for each grid point, achieves greater stability and better preserves semantic coherence under viewpoint shifts. Experiments on our proposed ViewBench benchmark demonstrate that token-level warping enables MLLMs to reason reliably from nearby viewpoints, consistently outperforming all baselines including pixel-wise warping approaches, spatially fine-tuned MLLMs, and a generative warping method.

[110] SPG: Sparse-Projected Guides with Sparse Autoencoders for Zero-Shot Anomaly Detection

Tomoyasu Nanaumi,Yukino Tsuzuki,Junichi Okubo,Junichiro Fujii,Takayoshi Yamashita

Main category: cs.CV

TL;DR: 本文提出了一种无需提示(prompt-free)的零样本异常检测与分割框架Sparse-Projected Guides(SPG),利用冻结的基础模型特征,在不进行目标域适配的情况下实现跨类别泛化。SPG在稀疏自编码器(SAE)隐空间中学习稀疏引导系数,生成正常/异常引导向量,并通过两阶段训练策略优化系数。实验表明其在MVTec AD和VisA数据集上达到领先性能,且具备可解释性。

Details Motivation: 现有基于提示的方法依赖手工设计或学习的提示嵌入作为正常/异常参考向量,泛化性与可解释性受限;而零样本设定下需避免目标域微调,亟需更鲁棒、可解释的无提示机制。 Method: 提出Sparse-Projected Guides(SPG):1)在patch-token特征上预训练稀疏自编码器(SAE);2)仅优化稀疏引导系数(冻结骨干网络与SAE),利用辅助像素级掩码监督;引导向量由SAE字典与稀疏系数线性组合生成,无需任何提示嵌入。 Result: 在MVTec AD和VisA跨数据集零样本设定下,SPG以DINOv3为骨干取得最高像素级AUROC;使用OpenCLIP(ViT-L/14@336px)时亦保持竞争力;引导系数可追溯至少量字典原子,揭示通用与特定类别因子。 Conclusion: SPG是一种高效、可解释、无需提示的零样本异常检测与分割新范式,通过稀疏引导机制在冻结基础模型上实现强泛化能力与决策溯源能力。 Abstract: We study zero-shot anomaly detection and segmentation using frozen foundation model features, where all learnable parameters are trained only on a labeled auxiliary dataset and deployed to unseen target categories without any target-domain adaptation. Existing prompt-based approaches use handcrafted or learned prompt embeddings as reference vectors for normal/anomalous states. We propose Sparse-Projected Guides (SPG), a prompt-free framework that learns sparse guide coefficients in the Sparse Autoencoder (SAE) latent space, which generate normal/anomaly guide vectors via the SAE dictionary. SPG employs a two stage learning strategy on the labeled auxiliary dataset: (i) train an SAE on patch-token features, and (ii) optimize only guide coefficients using auxiliary pixel-level masks while freezing the backbone and SAE. On MVTec AD and VisA under cross-dataset zero-shot settings, SPG achieves competitive image-level detection and strong pixel-level segmentation; with DINOv3, SPG attains the highest pixellevel AUROC among the compared methods. We also report SPG instantiated with OpenCLIP (ViT-L/14@336px) to align the backbone with CLIP-based baselines. Moreover, the learned guide coefficients trace decisions back to a small set of dictionary atoms, revealing category-general and category-specific factors.

[111] Unlocking Positive Transfer in Incrementally Learning Surgical Instruments: A Self-reflection Hierarchical Prompt Framework

Yu Zhu,Kang Li,Zheng Li,Pheng-Ann Heng

Main category: cs.CV

TL;DR: 本文提出一种自省分层提示框架,通过正向和反向知识迁移提升手术视频场景解析中类增量分割的性能,避免灾难性遗忘,并在CNN和Transformer模型上均取得显著提升。

Details Motivation: 现有手术视频场景解析的类增量学习方法忽略了正向(旧知识辅助新类学习)和反向(新类学习优化旧知识)知识迁移的潜力。 Method: 构建基于冻结预训练模型的自省分层提示框架:设计仪器感知的分层提示解析树(根为共享提示、中间为部分共享、叶为特有提示)以支持正向迁移;通过有向加权图传播实现自我反思式知识精炼,增强旧知识表征能力以支持反向迁移。 Result: 在两个公开基准上,该方法分别较竞争方法提升超5%(CNN模型)和11%(Transformer基础模型)。 Conclusion: 所提框架有效挖掘并协同利用正向与反向知识迁移,显著提升类增量手术器械分割性能,兼具通用性与实用性。 Abstract: To continuously enhance model adaptability in surgical video scene parsing, recent studies incrementally update it to progressively learn to segment an increasing number of surgical instruments over time. However, prior works constantly overlooked the potential of positive forward knowledge transfer, i.e., how past knowledge could help learn new classes, and positive backward knowledge transfer, i.e., how learning new classes could help refine past knowledge. In this paper, we propose a self-reflection hierarchical prompt framework that unlocks the power of positive forward and backward knowledge transfer in class incremental segmentation, aiming to proficiently learn new instruments, improve existing skills of regular instruments, and avoid catastrophic forgetting of old instruments. Our framework is built on a frozen, pre-trained model that adaptively appends instrument-aware prompts for new classes throughout training episodes. To enable positive forward knowledge transfer, we organize instrument prompts into a hierarchical prompt parsing tree with the instrument-shared prompt partition as the root node, n-part-shared prompt partitions as intermediate nodes and instrument-distinct prompt partitions as leaf nodes, to expose the reusable historical knowledge for new classes to simplify their learning. Conversely, to encourage positive backward knowledge transfer, we conduct self-reflection refining on existing knowledge by directed-weighted graph propagation, examining the knowledge associations recorded in the tree to improve its representativeness without causing catastrophic forgetting. Our framework is applicable to both CNN-based models and advanced transformer-based foundation models, yielding more than 5% and 11% improvements over the competing methods on two public benchmarks respectively.

[112] InstructTable: Improving Table Structure Recognition Through Instructions

Boming Chen,Zining Wang,Zhentao Guo,Jianqiang Liu,Chen Duan,Yu Gu,Kai zhou,Pengfei Yan

Main category: cs.CV

TL;DR: 本文提出InstructTable,一种指令引导的多阶段训练表格结构识别(TSR)框架,结合表指令预训练与TSR微调,并引入无模板合成方法TME构建复杂表格基准BCDSTab,显著提升复杂表格识别性能。

Details Motivation: 传统视觉模型缺乏语义支持,而视觉-语言模型又忽视视觉结构建模,难以准确识别含合并/空单元格的复杂表格结构。 Method: 提出InstructTable框架:1)指令引导的表结构预训练,聚焦细粒度结构模式;2)互补式TSR微调,保持强视觉建模能力;3)提出无模板合成方法Table Mix Expand(TME),构建900张复杂表格的BCDSTab基准。 Result: 在FinTabNet、PubTabNet、MUSTARD及自建BCDSTab上均达到SOTA性能;消融实验验证了表专用指令与合成数据的有效性。 Conclusion: InstructTable通过融合语义指令引导与视觉结构建模,并辅以高质量合成数据,有效提升了复杂表格结构识别的鲁棒性与准确性。 Abstract: Table structure recognition (TSR) holds widespread practical importance by parsing tabular images into structured representations, yet encounters significant challenges when processing complex layouts involving merged or empty cells. Traditional visual-centric models rely exclusively on visual information while lacking crucial semantic support, thereby impeding accurate structural recognition in complex scenarios. Vision-language models leverage contextual semantics to enhance comprehension; however, these approaches underemphasize the modeling of visual structural information. To address these limitations, this paper introduces InstructTable, an instruction-guided multi-stage training TSR framework. Meticulously designed table instruction pre-training directs attention toward fine-grained structural patterns, enhancing comprehension of complex tables. Complementary TSR fine-tuning preserves robust visual information modeling, maintaining high-precision table parsing across diverse scenarios. Furthermore, we introduce Table Mix Expand (TME), an innovative template-free method for synthesizing large-scale authentic tabular data. Leveraging TME, we construct the Balanced Complex Dense Synthetic Tables (BCDSTab) benchmark, comprising 900 complex table images synthesized through our method to serve as a rigorous benchmark. Extensive experiments on multiple public datasets (FinTabNet, PubTabNet, MUSTARD) and BCDSTab demonstrate that InstructTable achieves state-of-the-art performance in TSR tasks. Ablation studies further confirm the positive impact of the proposed tabular-data-specific instructions and synthetic data.

[113] Information-Regularized Constrained Inversion for Stable Avatar Editing from Sparse Supervision

Zhenxiao Liang,Qixing Huang

Main category: cs.CV

TL;DR: 本文提出了一种条件引导的编辑重建框架,通过在结构化头像潜在空间中进行受约束的反演,并限制更新到低维、部件特定的编辑子空间,以防止身份泄露和时序闪烁。

Details Motivation: 现有基于稀疏监督(如少量编辑关键帧)的人体动画头像编辑方法容易导致身份泄露和姿态依赖的时序闪烁,根本原因在于编辑约束不足导致反演问题病态。 Method: 构建一个条件引导的编辑重建框架,在结构化头像潜在空间中执行受约束的反演;限制更新至低维、部件特定的编辑子空间;通过局部线性化解码-渲染流程导出条件目标,优化得到编辑子空间信息矩阵,并利用其谱特性指导帧重加权/关键帧激活。 Result: 该方法仅需操作小规模子空间矩阵,可高效实现(如使用Hessian-向量积),在编辑监督有限的情况下显著提升了时间稳定性。 Conclusion: 所提框架从病态反演角度建模编辑问题,通过结构化潜空间约束与信息矩阵驱动的优化策略,有效缓解了身份泄露与时序闪烁问题,提升了稀疏监督下头像编辑的鲁棒性与稳定性。 Abstract: Editing animatable human avatars typically relies on sparse supervision, often a few edited keyframes, yet naively fitting a reconstructed avatar to these edits frequently causes identity leakage and pose-dependent temporal flicker. We argue that these failures are best understood as an ill-conditioned inversion: the available edited constraints do not sufficiently determine the latent directions responsible for the intended edit. We propose a conditioning-guided edited reconstruction framework that performs editing as a constrained inversion in a structured avatar latent space, restricting updates to a low-dimensional, part-specific edit subspace to prevent unintended identity changes. Crucially, we design the editing constraints during inversion by optimizing a conditioning objective derived from a local linearization of the full decoding-and-rendering pipeline, yielding an edit-subspace information matrix whose spectrum predicts stability and drives frame reweighting / keyframe activation. The resulting method operates on small subspace matrices and can be implemented efficiently (e.g., via Hessian-vector products), and improves stability under limited edited supervision.

[114] Progressive Video Condensation with MLLM Agent for Long-form Video Understanding

Yufei Yin,Yuchen Xing,Qianke Meng,Minghao Chen,Yan Yang,Zhou Yu

Main category: cs.CV

TL;DR: 本文提出ProVCA方法,通过多粒度渐进式视频压缩,高效定位与查询相关的关键帧,从而在减少计算开销的同时提升MLLM视频理解性能。

Details Motivation: 现有文本先行方法丢失细粒度视觉线索,而全帧视频MLLM计算开销过大;需在保持视觉细节与控制计算成本间取得平衡。 Method: 提出ProVCA:包含段定位模块(粗粒度定位相关视频段)、片段选择模块(基于相似性筛选重要片段)、关键帧细化模块(精确定位片段内关键帧),实现从段→片段→帧的渐进式压缩。 Result: 在EgoSchema、NExT-QA和IntentQA上零样本准确率分别达69.3%、80.5%、77.7%,为当前最优,且所需帧数少于以往无训练方法。 Conclusion: ProVCA有效平衡了视频理解的精度与效率,验证了渐进式关键帧定位是提升MLLM视频推理能力的可行路径。 Abstract: Understanding long videos requires extracting query-relevant information from long sequences under tight compute budgets. Existing text-then-LLM pipelines lose fine-grained visual cues, while video-based multimodal large language models (MLLMs) can keep visual details but are too frame-hungry and computationally expensive. In this work, we aim to harness MLLMs for efficient video understanding. We propose ProVCA, a progressive video condensation agent that iteratively locates key video frames at multiple granularities. ProVCA first adopts a segment localization module to identify the video segment relevant to the query, then a snippet selection module to select important snippets based on similarity, and finally a keyframe refinement module to pinpoint specific keyframes in those snippets. By progressively narrowing the scope from coarse segments to fine frames, ProVCA identifies a small set of keyframes for MLLM-based reasoning. ProVCA achieves state-of-the-art zero-shot accuracies of 69.3\% on EgoSchema, 80.5\% on NExT-QA, and 77.7\% on IntentQA, while using fewer frames than previous training-free methods.

[115] Toward an Artificial General Teacher: Procedural Geometry Data Generation and Visual Grounding with Vision-Language Models

Hai Nguyen-Truong,Alper Balbay,Tunga Bayrak

Main category: cs.CV

TL;DR: 本文将几何教育中的视觉解释问题建模为指代表达图像分割(RIS)任务,提出一种全自动合成数据引擎生成20万+带精确掩码和多样化语言描述的几何图,并设计面向几何领域的视觉语言模型微调方法,在新提出的几何感知指标Buffered IoU上达到85%性能。

Details Motivation: 现有在自然图像上训练的RIS模型因领域差异(照片场景 vs 抽象几何图)在几何图上表现极差,且缺乏适配的标注数据。 Method: 构建全自动程序化数据生成引擎,合成大量带像素级掩码和多样化语言描述的几何图;对视觉语言模型(如Florence-2)进行几何领域微调;提出几何感知评估指标Buffered IoU。 Result: 微调后的Florence-2在几何RIS任务上达到49% IoU和85% Buffered IoU,远超零样本下的<1% IoU;Buffered IoU被验证更贴合几何细长结构的定位质量评估。 Conclusion: 本工作为构建能提供视觉锚定、分步几何解释的通用人工智能教师(AGT)奠定了数据、模型与评估基础。 Abstract: We study visual explanation in geometry education as a Referring Image Segmentation (RIS) problem: given a diagram and a natural language description, the task is to produce a pixel-level mask for the referred geometric element. However, existing RIS models trained on natural image benchmarks such as RefCOCO fail catastrophically on geometric diagrams due to the fundamental domain shift between photographic scenes and abstract, textureless schematics. To address the absence of suitable training data, we present a fully automated procedural data engine that generates over 200,000 synthetic geometry diagrams with pixel-perfect segmentation masks and linguistically diverse referring expressions, requiring zero manual annotation. We further propose domain-specific fine-tuning of vision-language models (VLMs), demonstrating that a fine-tuned Florence-2 achieves 49% IoU and 85% Buffered IoU (BIoU), compared to <1% IoU in zero-shot settings. We introduce Buffered IoU, a geometry-aware evaluation metric that accounts for thin-structure localization, and show that it better reflects true segmentation quality than standard IoU. Our results establish a foundation for building Artificial General Teachers (AGTs) capable of providing visually grounded, step-by-step explanations of geometry problems.

[116] EvaNet: Towards More Efficient and Consistent Infrared and Visible Image Fusion Assessment

Chunyang Cheng,Tianyang Xu,Xiao-Jun Wu,Tao Zhou,Hui Li,Zhangyong Tang,Josef Kittler

Main category: cs.CV

TL;DR: 本文提出了一种专用于图像融合任务的轻量、高效且与人眼感知更一致的统一评估框架EvaNet,通过分解融合结果、对比学习和大语言模型引导训练,并首次引入一致性评估范式。

Details Motivation: 现有图像融合评估指标多直接借用其他视觉任务,未适配融合特性,导致评估不准且计算开销大。 Method: 提出基于轻量网络的统一评估框架;采用分而治之策略,先将融合图像分解为红外与可见光分量,再分别评估信息保留度;引入对比学习与大语言模型提供的场景感知指导训练;构建首个一致性评估框架,结合无参考分数与下游任务性能校准人眼感知。 Result: 在标准图像融合基准上,所提方法比传统指标快达1000倍,且与人类主观评价及下游任务表现的一致性显著提升。 Conclusion: 该工作重新定义了图像融合评估范式,证明了专用、可学习、感知对齐的评估模型优于通用手工指标,为后续研究提供了新基准与开源工具(EvaNet)。 Abstract: Evaluation is essential in image fusion research, yet most existing metrics are directly borrowed from other vision tasks without proper adaptation. These traditional metrics, often based on complex image transformations, not only fail to capture the true quality of the fusion results but also are computationally demanding. To address these issues, we propose a unified evaluation framework specifically tailored for image fusion. At its core is a lightweight network designed efficiently to approximate widely used metrics, following a divide-and-conquer strategy. Unlike conventional approaches that directly assess similarity between fused and source images, we first decompose the fusion result into infrared and visible components. The evaluation model is then used to measure the degree of information preservation in these separated components, effectively disentangling the fusion evaluation process. During training, we incorporate a contrastive learning strategy and inform our evaluation model by perceptual scene assessment provided by a large language model. Last, we propose the first consistency evaluation framework, which measures the alignment between image fusion metrics and human visual perception, using both independent no-reference scores and downstream tasks performance as objective references. Extensive experiments show that our learning-based evaluation paradigm delivers both superior efficiency (up to 1,000 times faster) and greater consistency across a range of standard image fusion benchmarks. Our code will be publicly available at https://github.com/AWCXV/EvaNet.

[117] RayMamba: Ray-Aligned Serialization for Long-Range 3D Object Detection

Cheng Lu,Mingqian Ji,Shanshan Zhang,Zhihao Li,Jian Yang

Main category: cs.CV

TL;DR: 本文提出RayMamba,一种几何感知的即插即用增强模块,通过射线对齐的序列化策略提升基于体素的3D检测器在远距离(如40–50米)下的性能,显著改善稀疏LiDAR点云中的上下文建模能力。

Details Motivation: 远距离LiDAR观测高度稀疏且碎片化,导致现有检测器难以可靠建模上下文;现有SSM方法因通用序列化策略无法保留稀疏场景中有意义的邻域结构,限制其长程建模效果。 Method: 提出RayMamba,采用射线对齐的扇区式序列化策略,将稀疏体素组织为保持方向连续性和遮挡相关上下文的有序序列,供Mamba模型建模;支持LiDAR单模态与多模态检测器,计算开销小。 Result: 在nuScenes上40–50米范围提升达2.49 mAP和1.59 NDS;在Argoverse 2上将VoxelNeXt的mAP从30.3提升至31.2。 Conclusion: RayMamba通过几何感知的序列化设计有效缓解了远距离稀疏场景下的上下文建模难题,是一种高效、通用且即插即用的长距3D检测增强方案。 Abstract: Long-range 3D object detection remains challenging because LiDAR observations become highly sparse and fragmented in the far field, making reliable context modeling difficult for existing detectors. To address this issue, recent state space model (SSM)-based methods have improved long-range modeling efficiency. However, their effectiveness is still limited by generic serialization strategies that fail to preserve meaningful contextual neighborhoods in sparse scenes. To address this issue, we propose RayMamba, a geometry-aware plug-and-play enhancement for voxel-based 3D detectors. RayMamba organizes sparse voxels into sector-wise ordered sequences through a ray-aligned serialization strategy, which preserves directional continuity and occlusion-related context for subsequent Mamba-based modeling. It is compatible with both LiDAR-only and multimodal detectors, while introducing only modest overhead. Extensive experiments on nuScenes and Argoverse 2 demonstrate consistent improvements across strong baselines. In particular, RayMamba achieves up to 2.49 mAP and 1.59 NDS gain in the challenging 40--50 m range on nuScenes, and further improves VoxelNeXt on Argoverse 2 from 30.3 to 31.2 mAP.

[118] UniSpector: Towards Universal Open-set Defect Recognition via Spectral-Contrastive Visual Prompting

Geonuk Kim,Minhoi Kim,Kangil Lee,Minsu Kim,Hyeonseong Jeon,Jeonghoon Han,Hyoungjoon Lim,Junho Yim

Main category: cs.CV

TL;DR: 本文提出UniSpector方法,通过构建语义结构化、可迁移的提示拓扑,解决工业缺陷检测中开放集识别与提示嵌入坍缩问题,在新提出的Inspect Anything基准上显著超越基线。

Details Motivation: 现有工业检测方法多基于闭集假设,难以识别未知缺陷;视觉提示虽具扩展性,但易因类内差异大、类间差异小导致提示嵌入坍缩。 Method: 提出UniSpector框架:1)空间-光谱提示编码器提取方向不变、细粒度特征;2)对比提示编码器将提示空间正则化为语义有序的角度流形;3)提示引导查询选择生成自适应目标查询。 Result: 在首个面向视觉提示的开放集缺陷定位基准Inspect Anything上,UniSpector在AP50b和AP50m指标上分别超越基线至少19.7%和15.8%。 Conclusion: UniSpector实现了无需重训练的可扩展工业检测范式,支持持续演化的产线环境,并为通用视觉提示设计提供了关键洞见。 Abstract: Although industrial inspection systems should be capable of recognizing unprecedented defects, most existing approaches operate under a closed-set assumption, which prevents them from detecting novel anomalies. While visual prompting offers a scalable alternative for industrial inspection, existing methods often suffer from prompt embedding collapse due to high intra-class variance and subtle inter-class differences. To resolve this, we propose UniSpector, which shifts the focus from naive prompt-to-region matching to the principled design of a semantically structured and transferable prompt topology. UniSpector employs the Spatial-Spectral Prompt Encoder to extract orientation-invariant, fine-grained representations; these serve as a solid basis for the Contrastive Prompt Encoder to explicitly regularize the prompt space into a semantically organized angular manifold. Additionally, Prompt-guided Query Selection generates adaptive object queries aligned with the prompt. We introduce Inspect Anything, the first benchmark for visual-prompt-based open-set defect localization, where UniSpector significantly outperforms baselines by at least 19.7% and 15.8% in AP50b and AP50m, respectively. These results show that our method enable a scalable, retraining-free inspection paradigm for continuously evolving industrial environments, while offering critical insights into the design of generic visual prompting.

[119] SentiAvatar: Towards Expressive and Interactive Digital Humans

Chuhao Jin,Rui Zhang,Qingzhe Gao,Haoyu Shi,Dayu Wu,Yichen Jiang,Yihan Wu,Ruihua Song

Main category: cs.CV

TL;DR: 本文提出了SentiAvatar框架,用于构建能实时说话、做手势和表达情绪的3D数字人(如虚拟角色SuSu),通过构建新数据集SuSuInterActs、预训练运动基础模型及提出音频感知的‘先规划后填充’架构,解决了多模态数据稀缺、语义到动作映射鲁棒性差、动作-韵律逐帧同步难三大挑战,在多个指标上达到SOTA。

Details Motivation: 构建真实感强、可交互的3D数字人面临三大挑战:缺乏大规模高质量多模态对话数据;语义到全身运动的鲁棒映射困难;语音韵律与动作在帧级别难以精细同步。 Method: 1)采集并发布SuSuInterActs数据集(21K片段,37小时,含同步语音、全身动作与面部表情);2)基于20万+动作序列预训练运动基础模型;3)提出音频感知的‘计划-填充’两阶段架构:先进行句子级语义动作规划,再进行语音驱动的帧级动作插值。 Result: 在SuSuInterActs上R@1达43.64%(约为最佳基线的2倍),在BEATv2上FGD为4.941、BC为8.078;生成6秒动作仅需0.3秒,支持无限轮次流式交互。 Conclusion: SentiAvatar通过数据、模型与架构协同创新,显著提升了数字人动作生成的质量、节奏一致性和实时性,为具身智能与虚拟交互提供了可扩展的技术路径。 Abstract: We present SentiAvatar, a framework for building expressive interactive 3D digital humans, and use it to create SuSu, a virtual character that speaks, gestures, and emotes in real time. Achieving such a system remains challenging, as it requires jointly addressing three key problems: the lack of large-scale, high-quality multimodal data, robust semantic-to-motion mapping, and fine-grained frame-level motion-prosody synchronization. To solve these problems, first, we build SuSuInterActs (21K clips, 37 hours), a dialogue corpus captured via optical motion capture around a single character with synchronized speech, full-body motion, and facial expressions. Second, we pre-train a Motion Foundation Model on 200K+ motion sequences, equipping it with rich action priors that go well beyond the conversation. We then propose an audio-aware plan-then-infill architecture that decouples sentence-level semantic planning from frame-level prosody-driven interpolation, so that generated motions are both semantically appropriate and rhythmically aligned with speech. Experiments show that SentiAvatar achieves state-of-the-art on both SuSuInterActs (R@1 43.64%, nearly 2 times the best baseline) and BEATv2 (FGD 4.941, BC 8.078), producing 6s of output in 0.3s with unlimited multi-turn streaming. The source code, model, and dataset are available at https://sentiavatar.github.io.

[120] GP-4DGS: Probabilistic 4D Gaussian Splatting from Monocular Video via Variational Gaussian Processes

Mijeong Kim,Jungtaek Kim,Bohyung Han

Main category: cs.CV

TL;DR: 本文提出GP-4DGS,将高斯过程(GPs)引入4D高斯泼溅(4DGS),实现动态场景的不确定性感知建模,支持运动预测不确定性量化、稀疏区域运动估计与时间外推。

Details Motivation: 现有4DGS方法仅提供确定性重建,无法刻画运动模糊性且缺乏预测可靠性评估机制。 Method: 设计时空核函数建模形变场相关性,并采用带诱导点的变分高斯过程实现可扩展推理,将GP嵌入4DGS框架。 Result: 实验表明GP-4DGS在提升重建质量的同时,能提供可靠的不确定性估计,有效识别高运动模糊区域。 Conclusion: 该工作首次将概率建模系统性引入4DGS,推动神经图形学向可解释、可信的动态场景理解发展。 Abstract: We present GP-4DGS, a novel framework that integrates Gaussian Processes (GPs) into 4D Gaussian Splatting (4DGS) for principled probabilistic modeling of dynamic scenes. While existing 4DGS methods focus on deterministic reconstruction, they are inherently limited in capturing motion ambiguity and lack mechanisms to assess prediction reliability. By leveraging the kernel-based probabilistic nature of GPs, our approach introduces three key capabilities: (i) uncertainty quantification for motion predictions, (ii) motion estimation for unobserved or sparsely sampled regions, and (iii) temporal extrapolation beyond observed training frames. To scale GPs to the large number of Gaussian primitives in 4DGS, we design spatio-temporal kernels that capture the correlation structure of deformation fields and adopt variational Gaussian Processes with inducing points for tractable inference. Our experiments show that GP-4DGS enhances reconstruction quality while providing reliable uncertainty estimates that effectively identify regions of high motion ambiguity. By addressing these challenges, our work takes a meaningful step toward bridging probabilistic modeling and neural graphics.

[121] BEVPredFormer: Spatio-temporal Attention for BEV Instance Prediction in Autonomous Driving

Miguel Antunes-García,Santiago Montiel-Marín,Fabio Sánchez-García,Rodrigo Gutiérrez-Moreno,Rafael Barea,Luis M. Bergasa

Main category: cs.CV

TL;DR: 本文提出BEVPredFormer,一种仅使用摄像头的鸟瞰图(BEV)实例预测新架构,通过基于注意力机制的时间处理和3D投影提升时空理解能力,并在nuScenes数据集上达到或超越SOTA性能。

Details Motivation: 传统模块化感知流水线存在累积误差和延迟问题;现有实例预测模型难以有效处理动态驾驶环境中密集的时空信息,需兼顾细粒度运动模式建模、长程依赖捕捉与实时性。 Method: 提出BEVPredFormer:1)纯摄像头输入;2)基于注意力的时序处理与3D投影;3)无循环结构,采用门控Transformer层;4)分立的时空注意力机制;5)多尺度头部任务;6)差分引导的特征提取模块。 Result: 在nuScenes数据集上,BEVPredFormer性能与或优于当前最优方法;消融实验验证各组件有效性。 Conclusion: BEVPredFormer为自动驾驶感知提供了一种鲁棒、高效、端到端的BEV实例预测解决方案,尤其适用于实时动态场景理解。 Abstract: A robust awareness of how dynamic scenes evolve is essential for Autonomous Driving systems, as they must accurately detect, track, and predict the behaviour of surrounding obstacles. Traditional perception pipelines that rely on modular architectures tend to suffer from cumulative errors and latency. Instance Prediction models provide a unified solution, performing Bird's-Eye-View segmentation and motion estimation across current and future frames using information directly obtained from different sensors. However, a key challenge in these models lies in the effective processing of the dense spatial and temporal information inherent in dynamic driving environments. This level of complexity demands architectures capable of capturing fine-grained motion patterns and long-range dependencies without compromising real-time performance. We introduce BEVPredFormer, a novel camera-only architecture for BEV instance prediction that uses attention-based temporal processing to improve temporal and spatial comprehension of the scene and relies on an attention-based 3D projection of the camera information. BEVPredFormer employs a recurrent-free design that incorporates gated transformer layers, divided spatio-temporal attention mechanisms, and multi-scale head tasks. Additionally, we incorporate a difference-guided feature extraction module that enhances temporal representations. Extensive ablation studies validate the effectiveness of each architectural component. When evaluated on the nuScenes dataset, BEVPredFormer was on par or surpassed State-Of-The-Art methods, highlighting its potential for robust and efficient Autonomous Driving perception.

[122] PolyReal: A Benchmark for Real-World Polymer Science Workflows

Wanhao Liu,Weida Wang,Jiaqing Xie,Suorong Yang,Jue Wang,Benteng Chen,Guangtao Mei,Zonglin Yang,Shufei Zhang,Yuchun Mo,Lang Cheng,Jin Zeng,Houqiang Li,Wanli Ouyang,Yuqiang Li

Main category: cs.CV

TL;DR: 本文提出PolyReal,一个基于真实聚合物实验工作流程的多模态基准,用于全面评估多模态大语言模型(MLLMs)在科学实践全生命周期中的能力,揭示了模型在抽象知识推理与实践任务(如实验室安全分析、原始数据提取)之间的显著能力失衡。

Details Motivation: 现有聚合物科学相关基准忽视真实科研工作流,无法系统评估MLLMs在实践驱动的完整实验生命周期中的表现,亟需一个贴近实际科学实践的评估基准。 Method: 构建PolyReal基准,覆盖聚合物实验全生命周期的五项关键能力:基础知识应用、实验室安全分析、实验机理推理、原始数据提取、性能与应用探索,并在主流MLLMs上开展系统评测。 Result: 评测发现MLLMs在知识密集型任务(如机理推理)上表现较好,但在实践导向任务(如安全分析、数据提取)上性能显著下降,暴露出抽象知识与情境化实践应用之间的巨大鸿沟。 Conclusion: PolyReal填补了面向真实科学工作流的MLLM评估空白,为推动AI在实际科研场景中的可靠应用提供了实用、可扩展的基准。 Abstract: Multimodal Large Language Models (MLLMs) excel in general domains but struggle with complex, real-world science. We posit that polymer science, an interdisciplinary field spanning chemistry, physics, biology, and engineering, is an ideal high-stakes testbed due to its diverse multimodal data. Yet, existing benchmarks related to polymer science largely overlook real-world workflows, limiting their practical utility and failing to systematically evaluate MLLMs across the full, practice-grounded lifecycle of experimentation. We introduce PolyReal, a novel multimodal benchmark grounded in real-world scientific practices to evaluate MLLMs on the full lifecycle of polymer experimentation. It covers five critical capabilities: (1) foundational knowledge application; (2) lab safety analysis; (3) experiment mechanism reasoning; (4) raw data extraction; and (5) performance & application exploration. Our evaluation of leading MLLMs on PolyReal reveals a capability imbalance. While models perform well on knowledge-intensive reasoning (e.g., Experiment Mechanism Reasoning), they drop sharply on practice-based tasks (e.g., Lab Safety Analysis and Raw Data Extraction). This exposes a severe gap between abstract scientific knowledge and its practical, context-dependent application, showing that these real-world tasks remain challenging for MLLMs. Thus, PolyReal helps address this evaluation gap and provides a practical benchmark for assessing AI systems in real-world scientific workflows.

[123] Modality-Specific Hierarchical Enhancement for RGB-D Camouflaged Object Detection

Yuzhen Niu,Yangqing Wang,Ri Cheng,Fusheng Li,Rongshen Wang,Zhichen Yang

Main category: cs.CV

TL;DR: 本文提出MHENet,一种用于RGB-D伪装目标检测的框架,通过模态特异性分层增强和自适应融合来提升RGB和深度特征的利用效率。

Details Motivation: 现有RGB-D伪装目标检测方法未能充分利用模态特异性线索,导致特征融合质量受限,主要因RGB与深度特征在骨干网络提取后直接融合,缺乏模态特异性增强。 Method: 提出MHENet框架,包含纹理分层增强模块(THEM)、几何分层增强模块(GHEM)和自适应动态融合模块(ADFM),分别用于增强高频纹理、可学习梯度几何结构,并实现空间自适应特征融合。 Result: 在四个基准数据集上实验表明,MHENet在定性和定量上均超越16种前沿方法。 Conclusion: MHENet通过模态特异性增强与自适应融合显著提升了RGB-D伪装目标检测性能,验证了分层增强策略的有效性。 Abstract: Camouflaged object detection (COD) is challenging due to high target-background similarity, and recent methods address this by complementarily using RGB-D texture and geometry cues. However, RGB-D COD methods still underutilize modality-specific cues, which limits fusion quality. We believe this is because RGB and depth features are fused directly after backbone extraction without modality-specific enhancement. To address this limitation, we propose MHENet, an RGB-D COD framework that performs modality-specific hierarchical enhancement and adaptive fusion of RGB and depth features. Specifically, we introduce a Texture Hierarchical Enhancement Module (THEM) to amplify subtle texture variations by extracting high-frequency information and a Geometry Hierarchical Enhancement Module (GHEM) to enhance geometric structures via learnable gradient extraction, while preserving cross-scale semantic consistency. Finally, an Adaptive Dynamic Fusion Module (ADFM) adaptively fuses the enhanced texture and geometry features with spatially varying weights. Experiments on four benchmarks demonstrate that MHENet surpasses 16 state-of-the-art methods qualitatively and quantitatively. Code is available at https://github.com/afdsgh/MHENet.

[124] MMTalker: Multiresolution 3D Talking Head Synthesis with Multimodal Feature Fusion

Bin Liu,Zhixiang Xiong,Zhifen He,Bo Li

Main category: cs.CV

TL;DR: 本文提出MMTalker方法,通过多分辨率表示和多模态特征融合,提升语音驱动3D面部动画的唇形同步精度与表情真实性。

Details Motivation: 现有语音驱动3D面部动画方法在唇同步准确性和表情真实性方面仍存在挑战,主要源于该跨模态映射的高度病态性。 Method: 采用网格参数化与非均匀可微采样实现带细节的连续3D面部分辨;利用残差图卷积网络与双交叉注意力机制提取多模态(语音与面片几何)判别性运动特征;最后通过轻量回归网络在规范UV空间中联合预测顶点级几何位移。 Result: 在唇部与眼部运动同步精度等指标上显著优于当前最优方法。 Conclusion: MMTalker通过多分辨率建模与多模态融合有效缓解了语音到3D面部运动映射的病态性,提升了合成动画的真实感与同步性。 Abstract: Speech-driven three-dimensional (3D) facial animation synthesis aims to build a mapping from one-dimensional (1D) speech signals to time-varying 3D facial motion signals. Current methods still face challenges in maintaining lip-sync accuracy and producing realistic facial expressions, primarily due to the highly ill-posed nature of this cross-modal mapping. In this paper, we introduce a novel 3D audio-driven facial animation synthesis method through multi-resolution representation and multi-modal feature fusion, called MMTalker which can accurately reconstruct the rich details of 3D facial motion. We first achieve the continuous representation of 3D face with details by mesh parameterization and non-uniform differentiable sampling. The mesh parameterization technique establishes the correspondence between UV plane and 3D facial mesh and is used to offer ground truth for the continuous learning. Differentiable non-uniform sampling enables precise facial detail acquisition by setting learnable sampling probability in each triangular face. Next, we employ residual graph convolutional network and dual cross-attention mechanism to extract discriminative facial motion feature from multiple input modalities. This proposed multimodal fusion strategy takes full use of the hierarchical features of speech and the explicit spatiotemporal geometric features of facial mesh. Finally, a lightweight regression network predicts the vertex-wise geometric displacements of the synthesized talking face by jointly processing the sampled points in the canonical UV space and the encoded facial motion features. Comprehensive experiments demonstrate that significant improvements are achieved over state-of-the-art methods, especially in the synchronization accuracy of lip and eye movements.

[125] Learning from Synthetic Data via Provenance-Based Input Gradient Guidance

Koshiro Nagano,Ryo Fujii,Ryo Hachiuma,Fumiaki Sato,Taiki Sekii,Hideo Saito

Main category: cs.CV

TL;DR: 本文提出了一种利用合成数据生成过程中的溯源信息(即输入空间中各区域是否源自目标对象)作为辅助监督信号的学习框架,通过分解并引导输入梯度,抑制模型对非目标区域的依赖,从而直接提升目标区域判别表征的学习效果。

Details Motivation: 现有基于合成数据的学习方法仅间接提升鲁棒性,易因合成偏差和伪影导致模型学习虚假相关;缺乏对输入空间中真正判别性区域的显式监督。 Method: 利用合成过程中的溯源信息(目标/非目标区域标记),分解输入梯度,并引入输入梯度引导机制,抑制非目标区域的梯度响应。 Result: 在弱监督目标定位、时空动作定位和图像分类等多个任务与模态上验证了方法的有效性与通用性。 Conclusion: 显式利用合成数据的溯源信息可有效引导模型聚焦于真正判别性的目标区域,显著提升模型鲁棒性与泛化能力。 Abstract: Learning methods using synthetic data have attracted attention as an effective approach for increasing the diversity of training data while reducing collection costs, thereby improving the robustness of model discrimination. However, many existing methods improve robustness only indirectly through the diversification of training samples and do not explicitly teach the model which regions in the input space truly contribute to discrimination; consequently, the model may learn spurious correlations caused by synthesis biases and artifacts. Motivated by this limitation, this paper proposes a learning framework that uses provenance information obtained during the training data synthesis process, indicating whether each region in the input space originates from the target object, as an auxiliary supervisory signal to promote the acquisition of representations focused on target regions. Specifically, input gradients are decomposed based on information about target and non-target regions during synthesis, and input gradient guidance is introduced to suppress gradients over non-target regions. This suppresses the model's reliance on non-target regions and directly promotes the learning of discriminative representations for target regions. Experiments demonstrate the effectiveness and generality of the proposed method across multiple tasks and modalities, including weakly supervised object localization, spatio-temporal action localization, and image classification.

[126] CrossWeaver: Cross-modal Weaving for Arbitrary-Modality Semantic Segmentation

Zelin Zhang,Kedi Li,Huiqi Liang,Tao Zhang,Chuanzhi Xu

Main category: cs.CV

TL;DR: 本文提出CrossWeaver框架,通过Modality Interaction Block(MIB)和Seam-Aligned Fusion(SAF)模块,实现任意模态组合下的高效、可靠性感知的跨模态融合,显著提升多模态语义分割性能与泛化能力。

Details Motivation: 现有方法依赖手工设计的融合策略,灵活性差、跨模态协调效果不佳,且难以在信息交换效率与模态特性保留之间取得平衡。 Method: 提出CrossWeaver框架,核心为可选择性、可靠性感知的Modality Interaction Block(MIB)用于编码器内跨模态交互,辅以轻量级Seam-Aligned Fusion(SAF)模块进行特征聚合。 Result: 在多个多模态语义分割基准上达到SOTA性能,参数增加极少,并对未见模态组合展现出强泛化能力。 Conclusion: CrossWeaver是一种简单而有效、适用于任意模态组合的多模态融合框架,解决了现有方法在灵活性、协调性和泛化性方面的关键局限。 Abstract: Multimodal semantic segmentation has shown great potential in leveraging complementary information across diverse sensing modalities. However, existing approaches often rely on carefully designed fusion strategies that either use modality-specific adaptations or rely on loosely coupled interactions, thereby limiting flexibility and resulting in less effective cross-modal coordination. Moreover, these methods often struggle to balance efficient information exchange with preserving the unique characteristics of each modality across different modality combinations. To address these challenges, we propose CrossWeaver, a simple yet effective multimodal fusion framework for arbitrary-modality semantic segmentation. Its core is a Modality Interaction Block (MIB), which enables selective and reliability-aware cross-modal interaction within the encoder, while a lightweight Seam-Aligned Fusion (SAF) module further aggregates the enhanced features. Extensive experiments on multiple multimodal semantic segmentation benchmarks demonstrate that our framework achieves state-of-the-art performance with minimal additional parameters and strong generalization to unseen modality combinations.

[127] Collaborative Multi-Mode Pruning for Vision-Language Models

Zimeng Wu,Yunhong Wang,Donghao Wang,Jiaxin Chen

Main category: cs.CV

TL;DR: 本文提出了一种面向视觉-语言模型(VLMs)的协同多模态剪枝框架CoMP,联合剪枝参数与token,并设计协同重要性度量(CIM)和多模态剪枝策略(MPS),在高剪枝率下显著缓解性能下降。

Details Motivation: 现有VLM剪枝方法多局限于单模态(仅参数或仅token),未能充分挖掘各模态内在冗余,导致高剪枝率下性能大幅下降。 Method: 提出Collaborative Multi-Mode Pruning(CoMP)框架:1)设计Collaborative Importance Metric(CIM),建模参数与token间的相互影响;2)提出Multi-Mode Pruning Strategy(MPS),分阶段动态选择最优剪枝模式,并融合历史成本与随机探索以稳定收敛。 Result: 在多种视觉-语言任务和模型上验证,CoMP在高剪枝率下显著优于现有SOTA剪枝方法。 Conclusion: 联合优化参数与token剪枝、并建模其协同重要性,是提升VLM压缩效率与精度平衡的有效途径;CoMP为资源受限场景下的VLM部署提供了新思路。 Abstract: Vision-Language Models (VLMs) have advanced rapidly within the unified Transformer architecture, yet their deployment on resource-constrained devices remains challenging due to high computational complexity. While pruning has emerged as an effective technique for compressing VLMs, existing approaches predominantly focus on a single mode by pruning either parameters or tokens, neglecting fully exploring the inherent redundancy in each mode, which leads to substantial performance degradation at high pruning ratios. To address the above limitations, we propose Collaborative Multi-Mode Pruning (CoMP), a novel framework tailored for VLMs by performing joint parameter and token pruning. Specifically, we first design a Collaborative Importance Metric (CIM) that investigates the mutual interference between the coupled parameters and tokens. It incorporates distinct significance of tokens into the computation of parameter importance scores, while simultaneously mitigating the affect of pruned parameters on token importance scores. Moreover, we develop a Multi-Mode Pruning Strategy (MPS) that decomposes the overall pruning process into a sequence of pruning stages, while in each stage we estimate the priory of different pruning modes based on their pruning cost and adaptively shift to the optimal one. Additionally, MPS integrates the historical cost and random exploration, in order to achieve a stable pruning process and avoid local optimum. Extensive experiments across various vision-language tasks and models demonstrate that our method effectively promotes the performance under high pruning ratios by comparing to the state-of-the-art approaches. The source code is available at https://github.com/Wuzimeng/CoMP.git.

[128] Visual Prototype Conditioned Focal Region Generation for UAV-Based Object Detection

Wenhao Li,Zimeng Wu,Yu Wu,Zehua Fu,Jiaxin Chen

Main category: cs.CV

TL;DR: 本文提出UAVGen框架,通过视觉原型条件扩散模型(VPC-DM)和聚焦区域增强数据管道(FRE-DP),提升无人机场景下小目标检测的数据生成质量与检测精度。

Details Motivation: 无人机图像中目标检测面临动态场景变化和标注数据稀缺的挑战;现有布局到图像生成方法易在小目标边界处产生伪影,限制性能。 Method: 提出UAVGen:1)视觉原型条件扩散模型(VPC-DM),利用类代表性实例增强潜在嵌入以实现高保真生成;2)聚焦区域增强数据管道(FRE-DP),强调前景物体区域并结合标签精炼修正生成错误。 Result: 在多个无人机检测基准上显著超越SOTA方法,且兼容不同检测器均能稳定提升精度。 Conclusion: UAVGen有效缓解了小目标生成伪影与标签失准问题,为低资源无人机检测提供了高质量合成数据新范式。 Abstract: Unmanned aerial vehicle (UAV) based object detection is a critical but challenging task, when applied in dynamically changing scenarios with limited annotated training data. Layout-to-image generation approaches have proved effective in promoting detection accuracy by synthesizing labeled images based on diffusion models. However, they suffer from frequently producing artifacts, especially near layout boundaries of tiny objects, thus substantially limiting their performance. To address these issues, we propose UAVGen, a novel layout-to-image generation framework tailored for UAV-based object detection. Specifically, UAVGen designs a Visual Prototype Conditioned Diffusion Model (VPC-DM) that constructs representative instances for each class and integrates them into latent embeddings for high-fidelity object generation. Moreover, a Focal Region Enhanced Data Pipeline (FRE-DP) is introduced to emphasize object-concentrated foreground regions in synthesis, combined with a label refinement to correct missing, extra and misaligned generations. Extensive experimental results demonstrate that our method significantly outperforms state-of-the-art approaches, and consistently promotes accuracy when integrated with distinct detectors. The source code is available at https://github.com/Sirius-Li/UAVGen.

[129] Exploring Motion-Language Alignment for Text-driven Motion Generation

Ruxi Gu,Zilei Wang,Wei Wang

Main category: cs.CV

TL;DR: 本文提出MLA-Gen框架,通过融合全局运动先验与细粒度局部文本条件,并解决注意力过度集中于起始文本标记(attention sink)的问题,显著提升文本到人体动作生成中动作质量与文本-动作对齐效果。

Details Motivation: 现有文本驱动人体动作生成方法难以准确对齐动作动态与文本语义,且存在注意力过度集中于起始文本标记(attention sink)的问题,导致语义接地能力下降。 Method: 提出MLA-Gen框架,整合全局运动先验与局部文本条件;定义SinkRatio指标量化注意力集中程度;设计对齐感知的掩码与控制策略以调节生成过程中的注意力分布。 Result: 在多个基准上显著优于强基线,同时提升动作质量与文本-动作对齐性能;验证了attention sink现象及其缓解策略的有效性。 Conclusion: 从运动-语言对齐视角重构文本到动作生成任务是有效的;显式建模与调控注意力分布可显著增强语义接地能力;MLA-Gen为高质量、高对齐度的动作生成提供了新范式。 Abstract: Text-driven human motion generation aims to synthesize realistic motion sequences that follow textual descriptions. Despite recent advances, accurately aligning motion dynamics with textual semantics remains a fundamental challenge. In this paper, we revisit text-to-motion generation from the perspective of motion-language alignment and propose MLA-Gen, a framework that integrates global motion priors with fine-grained local conditioning. This design enables the model to capture common motion patterns, while establishing detailed alignment between texts and motions. Furthermore, we identify a previously overlooked attention sink phenomenon in human motion generation, where attention disproportionately concentrates on the start text token, limiting the utilization of informative textual cues and leading to degraded semantic grounding. To analyze this issue, we introduce SinkRatio, a metric for measuring attention concentration, and develop alignment-aware masking and control strategies to regulate attention during generation. Extensive experiments demonstrate that our approach consistently improves both motion quality and motion-language alignment over strong baselines. Code will be released upon acceptance.

[130] Effect of Input Resolution on Retinal Vessel Segmentation Performance: An Empirical Study Across Five Datasets

Amarnath R

Main category: cs.CV

TL;DR: 本文研究了视网膜血管分割中图像缩放对细血管检测的影响,发现标准Dice指标无法反映细血管信息损失,提出宽度分层敏感度指标,并验证了不同分辨率数据集对缩放的响应差异。

Details Motivation: 现有深度学习流程常对眼底图像进行缩放以满足GPU内存和批量处理需求,但缩放导致细血管退化为亚像素结构,造成不可逆信息损失,而常用评估指标(如Dice)对此不敏感。 Method: 在五个眼底数据集(DRIVE、STARE、CHASE_DB1、HRF、FIVES)上,固定其他条件,以不同下采样比例训练UNet;提出基于欧氏距离变换估计血管宽度的宽度分层敏感度指标,将血管按半宽分为细(<3像素)、中(3–7像素)、粗(>7像素)三类分别评估。 Result: 高分辨率数据集(HRF、FIVES)的细血管敏感度随下采样先升后降,峰值出现在处理宽度256–876像素;低中分辨率数据集(DRIVE、STARE、CHASE_DB1)则在原分辨率时敏感度最高,下采样即下降;激进下采样使细血管敏感度最多下降15.8个百分点(DRIVE),而Dice分数变化不大。 Conclusion: 图像缩放对细血管分割性能影响显著且数据集依赖,仅用Dice等全局指标会掩盖关键微血管性能退化;应结合宽度分层评估指标指导预处理策略选择。 Abstract: Most deep learning pipelines for retinal vessel segmentation resize fundus images to satisfy GPU memory constraints and enable uniform batch processing. However, the impact of this resizing on thin vessel detection remains underexplored. When high resolution images are downsampled, thin vessels are reduced to subpixel structures, causing irreversible information loss even before the data enters the network. Standard volumetric metrics such as the Dice score do not capture this loss because thick vessel pixels dominate the evaluation. We investigated this effect by training a baseline UNet at multiple downsampling ratios across five fundus datasets (DRIVE, STARE, CHASE_DB1, HRF, and FIVES) with native widths ranging from 565 to 3504 pixels, keeping all other settings fixed. We introduce a width-stratified sensitivity metric that evaluates thin (half-width <3 pixels), medium (3 to 7 pixels), and thick (>7 pixels) vessel detection separately, using native resolution width estimates derived from a Euclidean distance transform. Results show that for high-resolution datasets (HRF, FIVES), thin vessel sensitivity improves monotonically as images are downsampled toward the encoder's effective operating range, peaking at processed widths between 256 and 876 pixels. For low-to-mid resolution datasets (DRIVE, STARE, CHASE_DB1), thin vessel sensitivity is highest at or near native resolution and degrades with any downsampling. Across all five datasets, aggressive downsampling reduced thin vessel sensitivity by up to 15.8 percentage points (DRIVE) while Dice remained relatively stable, confirming that Dice alone is insufficient for evaluating microvascular segmentation.

[131] Not All Frames Deserve Full Computation: Accelerating Autoregressive Video Generation via Selective Computation and Predictive Extrapolation

Hanshuai Cui,Zhiqing Tang,Zhi Yao,Fanshuai Meng,Weijia Jia,Wei Zhao

Main category: cs.CV

TL;DR: 本文提出SCOPE框架,通过三模态调度(缓存、预测、重计算)和选择性计算,提升自回归视频扩散模型的推理效率,实现最高4.73倍加速且保持生成质量。

Details Motivation: 现有自回归视频扩散模型推理开销大,训练免加速方法仅采用二元缓存/重计算策略,忽略中间情形,且未适配异步噪声调度带来的帧间噪声水平差异。 Method: 提出SCOPE框架:1)基于噪声水平泰勒外推的预测模态,填补缓存与重计算之间的空白,并通过误差传播分析保障稳定性;2)选择性计算,仅在活跃帧区间执行运算;3)三模态(缓存/预测/重计算)联合调度。 Result: 在MAGI-1和SkyReels-V2数据集上,SCOPE实现最高4.73倍推理加速,生成质量与原模型相当,显著优于所有训练免基准方法。 Conclusion: SCOPE有效解决了自回归视频扩散模型中因异步噪声调度和粗粒度决策导致的推理低效问题,为训练免加速提供了更细粒度、更鲁棒的新范式。 Abstract: Autoregressive (AR) video diffusion models enable long-form video generation but remain expensive due to repeated multi-step denoising. Existing training-free acceleration methods rely on binary cache-or-recompute decisions, overlooking intermediate cases where direct reuse is too coarse yet full recomputation is unnecessary. Moreover, asynchronous AR schedules assign different noise levels to co-generated frames, yet existing methods process the entire valid interval uniformly. To address these AR-specific inefficiencies, we present SCOPE, a training-free framework for efficient AR video diffusion. SCOPE introduces a tri-modal scheduler over cache, predict, and recompute, where prediction via noise-level Taylor extrapolation fills the gap between reuse and recomputation with explicit stability controls backed by error propagation analysis. It further introduces selective computation that restricts execution to the active frame interval. On MAGI-1 and SkyReels-V2, SCOPE achieves up to 4.73x speedup while maintaining quality comparable to the original output, outperforming all training-free baselines.

[132] Rendering Multi-Human and Multi-Object with 3D Gaussian Splatting

Weiquan Wang,Jun Xiao,Feifei Shao,Yi Yang,Yueting Zhuang,Long Chen

Main category: cs.CV

TL;DR: 本文提出MM-GS框架,基于3D高斯泼溅,通过实例级多视角融合与场景级实例交互建模,解决稀疏视角下多人多物动态场景重建难题,显著提升视图一致性与交互建模效果。

Details Motivation: 重建稀疏视角下的多人体-多物体动态交互场景对数字孪生、机器人及VR/AR至关重要,但面临严重互遮挡下的视图一致性建模和复杂组合式交互依赖建模两大挑战。 Method: 提出分层MM-GS框架:1)Per-Instance Multi-View Fusion模块实现各实例跨视角视觉信息融合,构建鲁棒一致表征;2)Scene-Level Instance Interaction模块基于全局场景图推理所有参与者间关系,细化属性以捕捉交互效应。 Result: 在多个挑战性数据集上显著超越强基线,生成高保真细节与合理实例接触的最先进结果。 Conclusion: MM-GS有效解决了MHMO渲染中视图一致性与交互建模难题,为动态复杂场景重建提供了新范式。 Abstract: Reconstructing dynamic scenes with multiple interacting humans and objects from sparse-view inputs is a critical yet challenging task, essential for creating high-fidelity digital twins for robotics and VR/AR. This problem, which we term Multi-Human Multi-Object (MHMO) rendering, presents two significant obstacles: achieving view-consistent representations for individual instances under severe mutual occlusion, and explicitly modeling the complex and combinatorial dependencies that arise from their interactions. To overcome these challenges, we propose MM-GS, a novel hierarchical framework built upon 3D Gaussian Splatting. Our method first employs a Per-Instance Multi-View Fusion module to establish a robust and consistent representation for each instance by aggregating visual information across all available views. Subsequently, a Scene-Level Instance Interaction module operates on a global scene graph to reason about relationships between all participants, refining their attributes to capture subtle interaction effects. Extensive experiments on challenging datasets demonstrate that our method significantly outperforms strong baselines, producing state-of-the-art results with high-fidelity details and plausible inter-instance contacts.

[133] Explicit Time-Frequency Dynamics for Skeleton-Based Gait Recognition

Seoyeon Ko,Yeojin Song,Egene Chung,Luca Quagliato,Taeyong Lee,Junhyug Noh

Main category: cs.CV

TL;DR: 本文提出了一种即插即用的小波特征流(Wavelet Feature Stream),通过连续小波变换(CWT)提取关节速度的时频动态特征,并融合到现有骨架骨干网络中,显著提升步态识别在外观变化(如背包、穿大衣)下的鲁棒性,且无需修改骨干结构或额外监督。

Details Motivation: 骨架步态识别器擅长建模空间结构,但常忽视对表观变化鲁棒至关重要的显式运动动力学信息。 Method: 为每个关节的速度序列应用连续小波变换(CWT)生成多尺度小波图(scalograms),再用轻量级多尺度CNN提取判别性动态特征;该特征与骨干网络输出融合用于分类。 Result: 在CASIA-B数据集上,该方法在多个强骨干网络(GaitMixer、GaitFormer、GaitGraph)上均带来一致性能提升,尤其在携带包(BG)和穿大衣(CL)等协变量偏移场景下增益显著;与GaitMixer结合后达到骨架步态识别新SOTA。 Conclusion: 显式时频建模与标准时空编码器具有强互补性,所提小波特征流是一种高效、通用、即插即用的增强模块。 Abstract: Skeleton-based gait recognizers excel at modeling spatial configurations but often underuse explicit motion dynamics that are crucial under appearance changes. We introduce a plug-and-play Wavelet Feature Stream that augments any skeleton backbone with time-frequency dynamics of joint velocities. Concretely, per-joint velocity sequences are transformed by the continuous wavelet transform (CWT) into multi-scale scalograms, from which a lightweight multi-scale CNN learns discriminative dynamic cues. The resulting descriptor is fused with the backbone representation for classification, requiring no changes to the backbone architecture or additional supervision. Across CASIA-B, the proposed stream delivers consistent gains on strong skeleton backbones (e.g., GaitMixer, GaitFormer, GaitGraph) and establishes a new skeleton-based state of the art when attached to GaitMixer. The improvements are especially pronounced under covariate shifts such as carrying bags (BG) and wearing coats (CL), highlighting the complementarity of explicit time-frequency modeling and standard spatio-temporal encoders.

[134] GenSmoke-GS: A Multi-Stage Method for Novel View Synthesis from Smoke-Degraded Images Using a Generative Model

Qida Cao,Xinyuan Hu,Changyue Shi,Jiajun Ding,Zhou Yu,Jun Yu

Main category: cs.CV

TL;DR: 本文提出了一种多阶段烟雾退化图像恢复与三维重建方法,在NTIRE 2026 3DRR挑战赛Track 2中获得第一名。

Details Motivation: 烟雾降低图像可见性并削弱跨视角一致性,影响三维场景优化与渲染。 Method: 采用多阶段流程:图像恢复→去雾→MLLM增强→3DGS-MCMC优化→多次运行结果平均,旨在提升渲染前可见性同时保持多视角内容一致性。 Result: 在挑战赛基准上定量指标和视觉质量均优于基线方法,最终在14支队伍中排名第一。 Conclusion: 该多阶段协同策略有效缓解烟雾干扰,兼顾可见性提升与三维一致性,在真实烟雾退化场景重建任务中具有实用价值。 Abstract: This paper describes our method for Track 2 of the NTIRE 2026 3D Restoration and Reconstruction (3DRR) Challenge on smoke-degraded images. In this task, smoke reduces image visibility and weakens the cross-view consistency required by scene optimization and rendering. We address this problem with a multi-stage pipeline consisting of image restoration, dehazing, MLLM-based enhancement, 3DGS-MCMC optimization, and averaging over repeated runs. The main purpose of the pipeline is to improve visibility before rendering while limiting scene-content changes across input views. Experimental results on the challenge benchmark show improved quantitative performance and better visual quality than the provided baselines. The code is available at https://github.com/plbbl/GenSmoke-GS. Our method achieved a ranking of 1 out of 14 participants in Track 2 of the NTIRE 3DRR Challenge, as reported on the official competition website: https://www.codabench.org/competitions/13993/#/results-tab.

[135] QVAD: A Question-Centric Agentic Framework for Efficient and Training-Free Video Anomaly Detection

Lokman Bekit,Hamza Karim,Nghia T Nguyen,Yasin Yilmaz

Main category: cs.CV

TL;DR: QVAD是一种基于问题驱动的视频异常检测框架,通过LLM与VLM的动态对话机制,迭代优化视觉查询,无需参数更新即可提升轻量级VLM性能,在多个基准上达到SOTA且适合边缘部署。

Details Motivation: 现有无训练VAD方法依赖大而重的视觉语言模型来弥补静态提示的模糊性,但作者认为瓶颈在于查询方式的静态性,而非模型容量本身。 Method: 提出QVAD框架,将VLM-LLM交互建模为动态问答过程;LLM代理根据视觉上下文迭代优化查询(prompt-updating),引导轻量级VLM生成高保真描述和精准语义推理,不进行参数更新。 Result: 在UCF-Crime、XD-Violence和UBNormal上达到SOTA性能,参数量远少于对比方法;在ComplexVAD单场景数据集上泛化性强;推理速度快、内存占用低,适用于边缘设备。 Conclusion: 动态查询机制能有效释放轻量级VLM的潜力,证明VAD性能提升的关键在于‘如何问’,而非‘用多大模型’;QVAD为高效、可部署的视频异常检测提供了新范式。 Abstract: Video Anomaly Detection (VAD) is a fundamental challenge in computer vision, particularly due to the open-set nature of anomalies. While recent training-free approaches utilizing Vision-Language Models (VLMs) have shown promise, they typically rely on massive, resource-intensive foundation models to compensate for the ambiguity of static prompts. We argue that the bottleneck in VAD is not necessarily model capacity, but rather the static nature of inquiry. We propose QVAD, a question-centric agentic framework that treats VLM-LLM interaction as a dynamic dialogue. By iteratively refining queries based on visual context, our LLM agent guides smaller VLMs to produce high-fidelity captions and precise semantic reasoning without parameter updates. This ``prompt-updating" mechanism effectively unlocks the latent capabilities of lightweight models, enabling state-of-the-art performance on UCF-Crime, XD-Violence, and UBNormal using a fraction of the parameters required by competing methods. We further demonstrate exceptional generalizability on the single-scene ComplexVAD dataset. Crucially, QVAD achieves high inference speeds with minimal memory footprints, making advanced VAD capabilities deployable on resource-constrained edge devices.

[136] STEAR: Layer-Aware Spatiotemporal Evidence Intervention for Hallucination Mitigation in Video Large Language Models

Linfeng Fan,Yuan Tian,Ziwei Li,Zhiwu Lu

Main category: cs.CV

TL;DR: 本文提出了一种层感知的时空证据干预框架STEAR,用于缓解视频大语言模型(Video-LLMs)中的时空幻觉问题,通过在中层解码器中选择视觉证据并用于局部接地恢复与反事实验证,显著提升视频理解的忠实性、时序一致性和鲁棒性。

Details Motivation: 现有方法将幻觉视为统一的解码失败,采用全局修正规则;而作者发现不同解码层对视觉接地和语言组合贡献不同,需层感知干预。 Method: 提出STEAR框架:识别高风险解码步,从中层(对视觉接地敏感)选取token条件下的视觉证据,用于两方面——恢复中层缺失的局部接地,以及构建补丁级时序扰动反事实以在深层解码中证伪不一致推理。 Result: 在多个主流Video-LLM骨干网络和挑战性基准上,STEAR一致降低了空间与时间幻觉,提升了忠实性、时序一致性与鲁棒性。 Conclusion: 可靠的视频解码依赖于在恰当层上对精确证据进行干预,而非施加全局惩罚。 Abstract: Video Large Language Models (Video-LLMs) remain prone to spatiotemporal hallucinations, often generating visually unsupported details or incorrect temporal relations. Existing mitigation methods typically treat hallucination as a uniform decoding failure, applying globally shared correction rules. We instead observe that decoder layers contribute differently to visual grounding and later linguistic composition, indicating that intervention must be layer-aware. Based on this insight, we propose STEAR, a layer-aware spatiotemporal evidence intervention framework. STEAR identifies high-risk decoding steps and selects token-conditioned visual evidence from grounding-sensitive middle layers. It uses this shared evidence for two coupled purposes: restoring missing local grounding in middle layers, and constructing temporally perturbed patch-level counterfactuals to falsify inconsistent reasoning during late-layer decoding. Consequently, STEAR mitigates both spatial and temporal hallucinations within an efficient single-encode inference framework. Experiments across representative Video-LLM backbones and challenging benchmarks demonstrate that STEAR consistently reduces hallucinations while improving faithfulness, temporal consistency, and robustness. Our results confirm that reliable video decoding relies on intervening on precise evidence at the right layer, rather than enforcing a global penalty. The code is provided in the Supplementary Material.

[137] Can Nano Banana 2 Replace Traditional Image Restoration Models? An Evaluation of Its Performance on Image Restoration Tasks

Weixiong Sun,Xiang Yin,Chao Dong

Main category: cs.CV

TL;DR: 本文系统评估了通用生成模型Nano Banana 2在图像恢复任务上的性能,发现其在保真度指标上优于现有方法,感知质量具竞争力,但对提示词设计敏感,需迭代优化。

Details Motivation: 探究通用生成式AI模型(如Nano Banana 2)能否作为统一的图像恢复解决方案。 Method: 对Nano Banana 2在多种场景和退化类型下进行系统性图像恢复评估,重点分析提示词设计对重建精度与感知质量权衡的影响,并与当前最优恢复模型对比,辅以用户研究验证。 Result: Nano Banana 2在全参考指标上优于SOTA模型,感知质量相当;在小人脸、密集人群、严重退化等挑战场景中泛化性强;但提示词敏感,常需迭代调整。 Conclusion: 通用生成模型具备成为统一图像恢复工具的巨大潜力,但需提升可控性与鲁棒性。 Abstract: Recent advances in generative AI raise the question of whether general-purpose image editing models can serve as unified solutions for image restoration. In this work, we conduct a systematic evaluation of Nano Banana 2 for image restoration across diverse scenes and degradation types. Our results show that prompt design plays a critical role, where concise prompts with explicit fidelity constraints achieve the best trade-off between reconstruction accuracy and perceptual quality. Compared with state-of-the-art restoration models, Nano Banana 2 achieves superior performance in full-reference metrics while remaining competitive in perceptual quality, which is further supported by user studies. We also observe strong generalization in challenging scenarios, such as small faces, dense crowds, and severe degradations. However, the model remains sensitive to prompt formulation and may require iterative refinement for optimal results. Overall, our findings suggest that general-purpose generative models hold strong potential as unified image restoration solvers, while highlighting the importance of controllability and robustness. All test results are available on https://github.com/yxyuanxiao/NanoBanana2TestOnIR.

[138] Gram-MMD: A Texture-Aware Metric for Image Realism Assessment

Joé Napolitano,Pascal Nguyen

Main category: cs.CV

TL;DR: 本文提出Gram-MMD(GMMD)这一新图像真实性评估指标,利用预训练网络中间特征的Gram矩阵上三角部分计算MMD,以更细粒度捕捉纹理与结构信息,弥补FID、CMMD等语义级指标的不足;通过元度量协议优化超参,并在多个数据集和跨域场景中验证其有效性与互补性。

Details Motivation: 现有分布度量(如FID、CMMD)仅在语义层面比较特征分布,易忽略对判别真实/生成图像至关重要的细粒度纹理信息。 Method: 提出Gram-MMD(GMMD):基于预训练骨干网络中间激活计算Gram矩阵,取其上三角部分作为特征表示,再用最大均值差异(MMD)度量真实图像锚分布与待评分布间的距离;采用基于可控退化图像的元度量协议(Spearman/Kendall单调性)选择超参数。 Result: GMMD在KADID-10k、RAISE及跨域驾驶场景(KITTI/Virtual KITTI/Stanford Cars)中表现优异;尤其在CMMD因语义偏差错误排序时(如将真实图像判为不如合成图像真实),GMMD仍能保持正确排序。 Conclusion: GMMD能有效捕获现有语义级指标所遗漏的纹理与结构细节,提供互补的真实性评估信息,是一种更鲁棒、细粒度的生成图像 realism 评估指标。 Abstract: Evaluating the realism of generated images remains a fundamental challenge in generative modeling. Existing distributional metrics such as the Frechet Inception Distance (FID) and CLIP-MMD (CMMD) compare feature distributions at a semantic level but may overlook fine-grained textural information that can be relevant for distinguishing real from generated images. We introduce Gram-MMD (GMMD), a realism metric that leverages Gram matrices computed from intermediate activations of pretrained backbone networks to capture correlations between feature maps. By extracting the upper-triangular part of these symmetric Gram matrices and measuring the Maximum Mean Discrepancy (MMD) between an anchor distribution of real images and an evaluation distribution, GMMD produces a representation that encodes textural and structural characteristics at a finer granularity than global embeddings. To select the hyperparameters of the metric, we employ a meta-metric protocol based on controlled degradations applied to MS-COCO images, measuring monotonicity via Spearman's rank correlation and Kendall's tau. We conduct experiments on both the KADID-10k database and the RAISE realness assessment dataset using various backbone architectures, including DINOv2, DC-AE, Stable Diffusion's VAE encoder, VGG19, and the AlexNet backbone from LPIPS, among others. We also demonstrate on a cross-domain driving scenario (KITTI / Virtual KITTI / Stanford Cars) that CMMD can incorrectly rank real images as less realistic than synthetic ones due to its semantic bias, while GMMD preserves the correct ordering. Our results suggest that GMMD captures complementary information to existing semantic-level metrics.

[139] SparseSplat: Towards Applicable Feed-Forward 3D Gaussian Splatting with Pixel-Unaligned Prediction

Zicheng Zhang,Xiangting Meng,Ke Wu,Wenchao Ding

Main category: cs.CV

TL;DR: 本文提出SparseSplat,首个前馈式3D高斯泼溅模型,通过基于熵的概率采样和专用点云网络,自适应调整高斯密度,生成紧凑的3DGS地图,在大幅减少高斯数量的同时保持高质量渲染。

Details Motivation: 现有前馈式3D高斯泼溅方法生成空间均匀且高度冗余的3DGS地图,限制其在下游重建任务中的集成能力。 Method: 提出基于熵的概率采样策略,根据场景结构和局部区域信息丰富度自适应生成稀疏大高斯或密集小高斯;设计专用点云网络以高效编码局部上下文并解码为3DGS属性,解决感受野不匹配问题。 Result: SparseSplat在仅使用22%高斯数量时达到SOTA渲染质量,甚至在仅用1.5%高斯时仍保持合理渲染质量。 Conclusion: SparseSplat显著提升了前馈式3DGS的紧凑性与实用性,为下游三维重建任务提供了更高效的表示基础。 Abstract: Recent progress in feed-forward 3D Gaussian Splatting (3DGS) has notably improved rendering quality. However, the spatially uniform and highly redundant 3DGS map generated by previous feed-forward 3DGS methods limits their integration into downstream reconstruction tasks. We propose SparseSplat, the first feed-forward 3DGS model that adaptively adjusts Gaussian density according to scene structure and information richness of local regions, yielding highly compact 3DGS maps. To achieve this, we propose entropy-based probabilistic sampling, generating large, sparse Gaussians in textureless areas and assigning small, dense Gaussians to regions with rich information. Additionally, we designed a specialized point cloud network that efficiently encodes local context and decodes it into 3DGS attributes, addressing the receptive field mismatch between the general 3DGS optimization pipeline and feed-forward models. Extensive experimental results demonstrate that SparseSplat can achieve state-of-the-art rendering quality with only 22% of the Gaussians and maintain reasonable rendering quality with only 1.5% of the Gaussians. Project page: https://victkk.github.io/SparseSplat-page/.

[140] MI-Pruner: Crossmodal Mutual Information-guided Token Pruner for Efficient MLLMs

Jiameng Li,Aleksei Tiulpin,Matthew B. Blaschko

Main category: cs.CV

TL;DR: 本文提出MI-Pruner,通过直接计算视觉与文本特征间的互信息(MI)来指导视觉token剪枝,无需访问注意力图或修改模型结构,在保持高效性的同时优于现有基于注意力的剪枝方法。

Details Motivation: 现有MLLM中视觉信息相对稀疏,需高效视觉剪枝;但当前方法依赖注意力分数,存在机制局限性,缺乏对跨模态依赖的显式建模。 Method: 提出MI-Pruner:在视觉与文本特征交互前,直接计算二者间的互信息,据此评估并保留高MI的视觉token,实现无侵入、轻量级剪枝。 Result: MI-Pruner在多个基准上超越基于注意力的剪枝方法,且引入极小延迟,验证了其有效性与高效性。 Conclusion: 基于互信息的视觉token剪枝是一种更本质、更通用的跨模态重要性度量方式,为MLLM高效推理提供了新思路。 Abstract: For multimodal large language models (MLLMs), visual information is relatively sparse compared with text. As a result, research on visual pruning emerges for efficient inference. Current approaches typically measure token importance based on the attention scores in the visual encoder or in the LLM decoder, then select visual tokens with high attention scores while pruning others. In this paper, we pursue a different and more surgical approach. Instead of relying on mechanism-specific signals, we directly compute Mutual Information (MI) between visual and textual features themselves, prior to their interaction. This allows us to explicitly measure crossmodal dependency at the feature levels. Our MI-Pruner is simple, efficient and non-intrusive, requiring no access to internal attention maps or architectural modifications. Experimental results demonstrate that our approach outperforms previous attention-based pruning methods with minimal latency.

[141] A Data-Centric Vision Transformer Baseline for SAR Sea Ice Classification

David Mike-Ewewie,Panhapiseth Lim,Priyanka Kumar

Main category: cs.CV

TL;DR: 本文提出了一种基于Vision Transformer的SAR图像海冰分类基线方法,采用焦点损失(focal loss)提升对稀有冰类(如多年冰)的识别精度,在AI4Arctic/ASIP数据集上达到69.6%准确率和83.9%多年冰精度,为多模态融合提供可靠基准。

Details Motivation: 海冰分类对气候监测与北极航行安全至关重要;现有SAR图像分类面临冰类形态相似、类别严重不平衡等挑战,亟需一个可信、可复现的单模态(SAR-only)基准,支撑后续多模态融合研究。 Method: 基于AI4Arctic/ASIP v2数据集(461景Sentinel-1图像+专家冰图),采用全分辨率Extra Wide SAR输入、防数据泄露的分层分块策略、SIGRID-3发育阶段标签及训练集归一化;对比ViT-Base(交叉熵/加权交叉熵)与ViT-Large(焦点损失)模型。 Result: ViT-Large + focal loss在保留集上取得69.6%准确率、68.8%加权F1及83.9%多年冰(MYI)精度;焦点损失在稀有冰类上显著优于加权交叉熵,展现出更优的精确率-召回率平衡。 Conclusion: 焦点损失训练的ViT-Large构成当前最优SAR-only海冰分类基线,其高精度尤其体现在少数类上,为未来融合光学、热红外或气象数据的多模态方法提供了干净、可比、可靠的起点。 Abstract: Accurate and automated sea ice classification is important for climate monitoring and maritime safety in the Arctic. While Synthetic Aperture Radar (SAR) is the operational standard because of its all-weather capability, it remains challenging to distinguish morphologically similar ice classes under severe class imbalance. Rather than claiming a fully validated multimodal system, this paper establishes a trustworthy SAR only baseline that future fusion work can build upon. Using the AI4Arctic/ASIP Sea Ice Dataset (v2), which contains 461 Sentinel-1 scenes matched with expert ice charts, we combine full-resolution Sentinel-1 Extra Wide inputs, leakage-aware stratified patch splitting, SIGRID-3 stage-of-development labels, and training-set normalization to evaluate Vision Transformer baselines. We compare ViT-Base models trained with cross entropy and weighted cross-entropy against a ViT-Large model trained with focal loss. Among the tested configurations, ViT-Large with focal loss achieves 69.6% held-out accuracy, 68.8% weighted F1, and 83.9% precision on the minority Multi-Year Ice class. These results show that focal-loss training offers a more useful precision-recall trade-off than weighted cross-entropy for rare ice classes and establishes a cleaner baseline for future multimodal fusion with optical, thermal, or meteorological data.

[142] Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning

Zhangyun Tan,Zeliang Zhang,Susan Liang,Yolo Yunlong Tang,Lisha Chen,Chenliang Xu

Main category: cs.CV

TL;DR: 本文提出了首个针对视觉语言模型(VLM)训练-free视觉概念遗忘的基准VLM-UnBench,揭示了当前基于提示的遗忘方法仅在理想化条件下才有效,难以实现真正意义上的视觉概念擦除。

Details Motivation: 现有训练式遗忘方法存在结构缺陷(微调先损害泛化能力),而训练自由方法缺乏严谨评估基准;需建立可靠基准以区分真实遗忘与指令服从性。 Method: 构建VLM-UnBench基准,涵盖4种遗忘层级、7个源数据集、11个概念轴,并设计三级探针分类法与五种评估条件,系统评测8种设置下13种VLM配置的表现。 Result: 现实遗忘提示几乎不降低遗忘准确率(接近无指令基线);仅在泄露目标概念的‘oracle’条件下才出现显著下降;物体与场景类概念最难抑制;强指令微调模型仍保持能力。 Conclusion: 当前提示驱动的视觉概念抑制远未达到真正擦除,暴露了方法有效性与实际需求之间的根本差距。 Abstract: VLMs trained on web-scale data retain sensitive and copyrighted visual concepts that deployment may require removing. Training-based unlearning methods share a structural flaw: fine-tuning on a narrow forget set degrades general capabilities before unlearning begins, making it impossible to attribute subsequent performance drops to the unlearning procedure itself. Training-free approaches sidestep this by suppressing concepts through prompts or system instructions, but no rigorous benchmark exists for evaluating them on visual tasks. We introduce VLM-UnBench, the first benchmark for training-free visual concept unlearning in VLMs. It covers four forgetting levels, 7 source datasets, and 11 concept axes, and pairs a three-level probe taxonomy with five evaluation conditions to separate genuine forgetting from instruction compliance. Across 8 evaluation settings and 13 VLM configurations, realistic unlearning prompts leave forget accuracy near the no-instruction baseline; meaningful reductions appear only under oracle conditions that disclose the target concept to the model. Object and scene concepts are the most resistant to suppression, and stronger instruction-tuned models remain capable despite explicit forget instructions. These results expose a clear gap between prompt-level suppression and true visual concept erasure.

[143] Revealing Physical-World Semantic Vulnerabilities: Universal Adversarial Patches for Infrared Vision-Language Models

Chengyin Hu,Yuxian Dong,Yikun Guo,Xiang Chen,Junqi Wu,Jiahuan Long,Yiwei Wei,Tingsong Jiang,Wen Yao

Main category: cs.CV

TL;DR: 本文提出了一种名为UCGP的通用物理对抗补丁框架,用于攻击红外视觉-语言模型(IR-VLMs),通过曲面网格参数化和统一表征驱动目标,在保持物理可部署性的同时有效破坏跨模态语义对齐。

Details Motivation: 现有RGB图像对抗补丁方法不适用于红外VLMs在开放语义理解与实际物理部署中的需求,且其鲁棒性尚未被系统研究。 Method: 提出UCGP框架,包含曲面网格网格(CGM)参数化生成低频连续可部署补丁,结合表征驱动目标(子空间偏离、拓扑扰动、隐蔽性),并引入元差分进化与EOT增强的TPS形变建模以提升现实鲁棒性。 Result: UCGP在多种IR-VLM架构上显著损害语义理解能力,具备跨模型迁移性、跨数据集泛化性、真实物理有效性及抗防御鲁棒性。 Conclusion: 当前红外多模态系统存在被忽视的鲁棒性漏洞,UCGP揭示了其在开放场景下的安全隐患。 Abstract: Infrared vision-language models (IR-VLMs) have emerged as a promising paradigm for multimodal perception in low-visibility environments, yet their robustness to adversarial attacks remains largely unexplored. Existing adversarial patch methods are mainly designed for RGB-based models in closed-set settings and are not readily applicable to the open-ended semantic understanding and physical deployment requirements of infrared VLMs. To bridge this gap, we propose Universal Curved-Grid Patch (UCGP), a universal physical adversarial patch framework for IR-VLMs. UCGP integrates Curved-Grid Mesh (CGM) parameterization for continuous, low-frequency, and deployable patch generation with a unified representation-driven objective that promotes subspace departure, topology disruption, and stealth. To improve robustness under real-world deployment and domain shift, we further incorporate Meta Differential Evolution and EOT-augmented TPS deformation modeling. Rather than manipulating labels or prompts, UCGP directly disrupts the visual representation space, weakening cross-modal semantic alignment. Extensive experiments demonstrate that UCGP consistently compromises semantic understanding across diverse IR-VLM architectures while maintaining cross-model transferability, cross-dataset generalization, real-world physical effectiveness, and robustness against defenses. These findings reveal a previously overlooked robustness vulnerability in current infrared multimodal systems.

[144] Salt: Self-Consistent Distribution Matching with Cache-Aware Training for Fast Video Generation

Xingtong Ge,Yi Zhang,Yushi Huang,Dailan He,Xiahong Wang,Bingqi Ma,Guanglu Song,Yu Liu,Jun Zhang

Main category: cs.CV

TL;DR: 本文提出Salt方法,通过自一致性分布匹配蒸馏(SC-DMD)和缓存感知训练(Cache-Distribution-Aware training),在极低NFE(2–4步)下提升视频生成质量,兼顾清晰度、运动连贯性与实时性。

Details Motivation: 现有视频生成模型蒸馏方法在极低推理步数(2–4 NFE)下难以兼顾画面锐度与运动真实性:轨迹式一致性蒸馏易导致过平滑和弱运动,而分布匹配蒸馏(DMD)缺乏对多步去噪更新组合一致性的显式建模,易引发漂移。 Method: 提出自一致性分布匹配蒸馏(SC-DMD),显式约束连续去噪更新的端点一致性;针对自回归实时生成,将KV缓存视为质量相关条件,设计缓存感知训练——在多步rollout上应用SC-DMD,并引入缓存条件下的特征对齐目标,引导低质量输出向高质量参考对齐。 Result: 在非自回归(如Wan 2.1)和自回归实时范式(如Self Forcing)上均显著提升低NFE视频生成质量,保持与各类KV缓存机制兼容。 Conclusion: SC-DMD与缓存感知训练协同解决了低步数视频蒸馏中运动失真与组合漂移问题,Salt方法为实时高质量视频生成提供了有效且通用的蒸馏框架。 Abstract: Distilling video generation models to extremely low inference budgets (e.g., 2--4 NFEs) is crucial for real-time deployment, yet remains challenging. Trajectory-style consistency distillation often becomes conservative under complex video dynamics, yielding an over-smoothed appearance and weak motion. Distribution matching distillation (DMD) can recover sharp, mode-seeking samples, but its local training signals do not explicitly regularize how denoising updates compose across timesteps, making composed rollouts prone to drift. To overcome this challenge, we propose Self-Consistent Distribution Matching Distillation (SC-DMD), which explicitly regularizes the endpoint-consistent composition of consecutive denoising updates. For real-time autoregressive video generation, we further treat the KV cache as a quality parameterized condition and propose Cache-Distribution-Aware training. This training scheme applies SC-DMD over multi-step rollouts and introduces a cache-conditioned feature alignment objective that steers low-quality outputs toward high-quality references. Across extensive experiments on both non-autoregressive backbones (e.g., Wan~2.1) and autoregressive real-time paradigms (e.g., Self Forcing), our method, dubbed \textbf{Salt}, consistently improves low-NFE video generation quality while remaining compatible with diverse KV-cache memory mechanisms. Source code will be released at \href{https://github.com/XingtongGe/Salt}{https://github.com/XingtongGe/Salt}.

[145] SCC-Loc: A Unified Semantic Cascade Consensus Framework for UAV Thermal Geo-Localization

Xiaoran Zhang,Yu Liu,Jinyu Liang,Kangqiushi Li,Zhiwei Huang,Huaxin Xiao

Main category: cs.CV

TL;DR: 本文提出SCC-Loc框架,通过语义引导、级联滤波与共识驱动策略,解决热-可见光跨模态地理定位中的特征模糊问题,在GNSS拒止环境下实现高精度无人机定位。

Details Motivation: 热-可见光跨模态差异导致特征模糊,严重破坏传统粗到精配准方法,且缺乏适配的公开数据集。 Method: 提出SCC-Loc框架:1)语义引导视口对齐(SGVA)优化卫星裁剪区域;2)级联空间自适应纹理-结构滤波(C-SATSF)增强几何一致性;3)共识驱动的可靠性感知位姿选择(CD-RAPS)结合物理约束优化;共享DINOv2主干用于全局检索与MINIMA$_{RoMa}$匹配,并构建Thermal-UAV数据集。 Result: 在Thermal-UAV数据集上,平均定位误差降至9.37米,在5米阈值内精度较最强基线提升7.6倍,达到新SOTA。 Conclusion: SCC-Loc有效弥合热-可见光模态鸿沟,显著提升跨模态地理定位鲁棒性与精度,具备零样本泛化能力,并推动该领域数据与方法发展。 Abstract: Cross-modal Thermal Geo-localization (TG) provides a robust, all-weather solution for Unmanned Aerial Vehicles (UAVs) in Global Navigation Satellite System (GNSS)-denied environments. However, profound thermal-visible modality gaps introduce severe feature ambiguity, systematically corrupting conventional coarse-to-fine registration. To dismantle this bottleneck, we propose SCC-Loc, a unified Semantic-Cascade-Consensus localization framework. By sharing a single DINOv2 backbone across global retrieval and MINIMA$_{\text{RoMa}}$ matching, it minimizes memory footprint and achieves zero-shot, highly accurate absolute position estimation. Specifically, we tackle modality ambiguity by introducing three cohesive components. First, we design the Semantic-Guided Viewport Alignment (SGVA) module to adaptively optimize satellite crop regions, effectively correcting initial spatial deviations. Second, we develop the Cascaded Spatial-Adaptive Texture-Structure Filtering (C-SATSF) mechanism to explicitly enforce geometric consistency, thereby eradicating dense cross-modal outliers. Finally, we propose the Consensus-Driven Reliability-Aware Position Selection (CD-RAPS) strategy to derive the optimal solution through a synergy of physically constrained pose optimization. To address data scarcity, we construct Thermal-UAV, a comprehensive dataset providing 11,890 diverse thermal queries referenced against a large-scale satellite ortho-photo and corresponding spatially aligned Digital Surface Model (DSM). Extensive experiments demonstrate that SCC-Loc establishes a new state-of-the-art, suppressing the mean localization error to 9.37 m and providing a 7.6-fold accuracy improvement within a strict 5-m threshold over the strongest baseline. Code and dataset are available at https://github.com/FloralHercules/SCC-Loc.

[146] SD-FSMIS: Adapting Stable Diffusion for Few-Shot Medical Image Segmentation

Meihua Li,Yang Zhang,Weizhao He,Hu Qu,Yisong Li

Main category: cs.CV

TL;DR: 本文提出SD-FSMIS框架,利用预训练Stable Diffusion模型结合Support-Query Interaction和Visual-to-Textual Condition Translator模块,实现少样本医学图像分割,在标准及跨域场景下均表现优异。

Details Motivation: 解决医学图像分割中数据稀缺和领域偏移问题,探索扩散模型在少样本医学图像分割(FSMIS)中的潜力。 Method: 提出SD-FSMIS框架,复用Stable Diffusion的条件生成架构,引入Support-Query Interaction(SQI)模块和Visual-to-Textual Condition Translator(VTCT)模块,将支持集视觉特征转化为隐式文本嵌入以指导扩散过程。 Result: 在标准少样本设置下达到SOTA水平;在更具挑战性的跨域场景中展现出优异泛化能力。 Conclusion: 大规模生成模型(如Stable Diffusion)经适配后,可显著提升少样本医学图像分割的数据效率与鲁棒性,具有重要应用前景。 Abstract: Few-Shot Medical Image Segmentation (FSMIS) aims to segment novel object classes in medical images using only minimal annotated examples, addressing the critical challenges of data scarcity and domain shifts prevalent in medical imaging. While Diffusion Models (DM) excel in visual tasks, their potential for FSMIS remains largely unexplored. We propose that the rich visual priors learned by large-scale DMs offer a powerful foundation for a more robust and data-efficient segmentation approach. In this paper, we introduce SD-FSMIS, a novel framework designed to effectively adapt the powerful pre-trained Stable Diffusion (SD) model for the FSMIS task. Our approach repurposes its conditional generative architecture by introducing two key components: a Support-Query Interaction (SQI) and a Visual-to-Textual Condition Translator (VTCT). Specifically, SQI provides a straightforward yet powerful means of adapting SD to the FSMIS paradigm. The VTCT module translates visual cues from the support set into an implicit textual embedding that guides the diffusion model, enabling precise conditioning of the generation process. Extensive experiments demonstrate that SD-FSMIS achieves competitive results compared to state-of-the-art methods in standard settings. Surprisingly, it also demonstrated excellent generalization ability in more challenging cross-domain scenarios. These findings highlight the immense potential of adapting large-scale generative models to advance data-efficient and robust medical image segmentation.

[147] CAMEO: A Conditional and Quality-Aware Multi-Agent Image Editing Orchestrator

Yuhan Pu,Hao Zheng,Ziqian Mo,Hill Zhang,Tianyi Fan,Shuhong Wu,Jiaheng Wei

Main category: cs.CV

TL;DR: CAMEO是一个多智能体框架,将条件图像编辑重构为质量感知、反馈驱动的多阶段闭环过程,显著提升了编辑的结构一致性与可控性。

Details Motivation: 现有单步生成式条件图像编辑方法缺乏显式质量控制,易导致结构失真、环境不一致等问题,且依赖人工调参。 Method: 提出CAMEO框架,将编辑分解为规划、结构化提示、假设生成和自适应参考锚定四个协同阶段,并在编辑循环中嵌入评估模块,通过结构化反馈迭代优化中间结果。 Result: 在异常插入与人体姿态转换任务上,CAMEO在多个强基线模型和独立评估器下平均胜率提升20%,展现出更强的鲁棒性、可控性与结构可靠性。 Conclusion: CAMEO通过引入闭环反馈与分阶段协同机制,有效解决了单步编辑的质量与结构控制难题,为条件图像编辑提供了新范式。 Abstract: Conditional image editing aims to modify a source image according to textual prompts and optional reference guidance. Such editing is crucial in scenarios requiring strict structural control (i.e., anomaly insertion in driving scenes and complex human pose transformation). Despite recent advances in large-scale editing models (i.e., Seedream, Nano Banana, etc), most approaches rely on single-step generation. This paradigm often lacks explicit quality control, may introduce excessive deviation from the original image, and frequently produces structural artifacts or environment-inconsistent modifications, typically requiring manual prompt tuning to achieve acceptable results. We propose \textbf{CAMEO}, a structured multi-agent framework that reformulates conditional editing as a quality-aware, feedback-driven process rather than a one-shot generation task. CAMEO decomposes editing into coordinated stages of planning, structured prompting, hypothesis generation, and adaptive reference grounding, where external guidance is invoked only when task complexity requires it. To overcome the lack of intrinsic quality control in existing methods, evaluation is embedded directly within the editing loop. Intermediate results are iteratively refined through structured feedback, forming a closed-loop process that progressively corrects structural and contextual inconsistencies. We evaluate CAMEO on anomaly insertion and human pose switching tasks. Across multiple strong editing backbones and independent evaluation models, CAMEO consistently achieves 20\% more win rate on average compared to multiple state-of-the-art models, demonstrating improved robustness, controllability, and structural reliability in conditional image editing.

[148] EffiMiniVLM: A Compact Dual-Encoder Regression Framework

Yin-Loon Khor,Yi-Jie Wong,Yan Chai Hum

Main category: cs.CV

TL;DR: 本文提出EffiMiniVLM,一种轻量级双编码器视觉-语言回归模型,专为冷启动场景下仅依赖图文信息预测商品质量而设计;通过高效网络结构、加权Huber损失及小规模数据训练,在资源消耗极低前提下达到领先性能。

Details Motivation: 在用户交互历史缺失的冷启动场景中,需仅凭商品图像和文本元数据预测质量,但现有视觉语言模型往往参数量大、依赖外部大数据集,计算开销高。 Method: 提出EffiMiniVLM:采用EfficientNet-B0图像编码器与MiniLM文本编码器构成双编码器架构,搭配轻量回归头;引入基于评分频次的加权Huber损失以提升样本利用效率。 Result: 仅用20% Amazon Reviews 2023数据(约27.7M参数、6.8 GFLOPs)即达CES 0.40,在基准测试中资源成本最低;效率比其他top-5方法高4–8倍,且唯一不依赖外部数据;扩展至40%数据即可超越更大模型。 Conclusion: 紧凑型视觉语言回归模型可在极低资源消耗下实现高质量预测,具备优异的数据效率、可扩展性与实用性,为冷启动推荐提供新范式。 Abstract: Predicting product quality from multimodal item information is critical in cold-start scenarios, where user interaction history is unavailable and predictions must rely on images and textual metadata. However, existing vision-language models typically depend on large architectures and/or extensive external datasets, resulting in high computational cost. To address this, we propose EffiMiniVLM, a compact dual-encoder vision-language regression framework that integrates an EfficientNet-B0 image encoder and a MiniLM-based text encoder with a lightweight regression head. To improve training sample efficiency, we introduce a weighted Huber loss that leverages rating counts to emphasize more reliable samples, yielding consistent performance gains. Trained using only 20% of the Amazon Reviews 2023 dataset, the proposed model contains 27.7M parameters and requires 6.8 GFLOPs, yet achieves a CES score of 0.40 with the lowest resource cost in the benchmark. Despite its small size, it remains competitive with significantly larger models, achieving comparable performance while being approximately 4x to 8x more resource-efficient than other top-5 methods and being the only approach that does not use external datasets. Further analysis shows that scaling the data to 40% alone allows our model to overtake other methods, which use larger models and datasets, highlighting strong scalability despite the model's compact design.

[149] SFFNet: Synergistic Feature Fusion Network With Dual-Domain Edge Enhancement for UAV Image Object Detection

Wenfeng Zhang,Jun Ni,Yue Meng,Xiaodong Pei,Wei Hu,Qibing Qin,Lei Huang

Main category: cs.CV

TL;DR: 本文提出了一种面向无人机图像目标检测的协同特征融合网络SFFNet,包含多尺度动态双域耦合模块(MDDC)和协同特征金字塔网络(SFPN),在VisDrone和UAVDT数据集上取得优异性能。

Details Motivation: 无人机图像目标检测面临背景噪声复杂、目标尺度不平衡等挑战,传统方法难以有效分离目标与背景,且未能充分利用多尺度信息。 Method: 提出SFFNet,包含:1)MDDC模块,通过频域与空域双驱动边缘提取实现多尺度边缘与背景噪声解耦;2)SFPN模块,结合线性可变形卷积与宽域感知模块(WPM)增强几何与语义表征能力;3)设计六种不同规模的检测器(N/S/M/B/L/X)以适配不同场景。 Result: 在VisDrone和UAVDT数据集上,SFFNet-X分别达到36.8 AP和20.6 AP;轻量级模型(N/S)在精度与参数效率间取得良好平衡。 Conclusion: SFFNet通过双域边缘增强与协同多尺度特征融合,显著提升了无人机图像中复杂背景下小目标与多尺度目标的检测性能,兼具精度与灵活性。 Abstract: Object detection in unmanned aerial vehicle (UAV) images remains a highly challenging task, primarily caused by the complexity of background noise and the imbalance of target scales. Traditional methods easily struggle to effectively separate objects from intricate backgrounds and fail to fully leverage the rich multi-scale information contained within images. To address these issues, we have developed a synergistic feature fusion network (SFFNet) with dual-domain edge enhancement specifically tailored for object detection in UAV images. Firstly, the multi-scale dynamic dual-domain coupling (MDDC) module is designed. This component introduces a dual-driven edge extraction architecture that operates in both the frequency and spatial domains, enabling effective decoupling of multi-scale object edges from background noise. Secondly, to further enhance the representation capability of the model's neck in terms of both geometric and semantic information, a synergistic feature pyramid network (SFPN) is proposed. SFPN leverages linear deformable convolutions to adaptively capture irregular object shapes and establishes long-range contextual associations around targets through the designed wide-area perception module (WPM). Moreover, to adapt to the various applications or resource-constrained scenarios, six detectors of different scales (N/S/M/B/L/X) are designed. Experiments on two challenging aerial datasets (VisDrone and UAVDT) demonstrate the outstanding performance of SFFNet-X, achieving 36.8 AP and 20.6 AP, respectively. The lightweight models (N/S) also maintain a balance between detection accuracy and parameter efficiency. The code will be available at https://github.com/CQNU-ZhangLab/SFFNet.

[150] The Eleventh NTIRE 2026 Efficient Super-Resolution Challenge Report

Bin Ren,Hang Guo,Yan Shu,Jiaqi Ma,Ziteng Cui,Shuhong Liu,Guofeng Mei,Lei Sun,Zongwei Wu,Fahad Shahbaz Khan,Salman Khan,Radu Timofte,Yawei Li,Hongyuan Yu,Pufan Xu,Chen Wu,Long Peng,Jiaojiao Yi,Siyang Yi,Yuning Cui,Jingyuan Xia,Xing Mou,Keji He,Jinlin Wu,Zongang Gao,Sen Yang,Rui Zheng,Fengguo Li,Yecheng Lei,Wenkai Min,Jie Liu,Keye Cao,Shubham Sharma,Manish Prasad,Haobo Li,Matin Fazel,Abdelhak Bentaleb,Rui Chen,Shurui Shi,Zitao Dai,Qingliang Liu,Yang Cheng,Jing Hu,Xuan Zhang,Rui Ding,Tingyi Zhang,Hui Deng,Mengyang Wang,Fulin Liu,Jing Wei,Qian Wang,Hongying Liu,Mingyang Li,Guanglu Dong,Zheng Yang,Chao Ren,Hongbo Fang,Lingxuan Li,Lin Si,Pan Gao,Moncef Gabbouj,Watchara Ruangsang,Supavadee Aramvith

Main category: cs.CV

TL;DR: This paper reviews the NTIRE 2026 challenge on efficient single-image super-resolution, summarizing participant solutions and results.

Details Motivation: To advance efficient single-image super-resolution by reducing computational cost (runtime, parameters, FLOPs) without sacrificing PSNR performance. Method: Organizing and evaluating submissions to the NTIRE 2026 challenge, with performance measured on DIV2K_LSDIR_valid and DIV2K_LSDIR_test datasets. Result: 15 valid submissions from 95 registered participants; target PSNRs of ~26.90 dB (valid) and ~26.99 dB (test) were used as benchmarks. Conclusion: The challenge provides a benchmark for state-of-the-art efficient super-resolution methods and highlights trade-offs between efficiency and accuracy. Abstract: This paper reviews the NTIRE 2026 challenge on efficient single-image super-resolution with a focus on the proposed solutions and results. The aim of this challenge is to devise a network that reduces one or several aspects, such as runtime, parameters, and FLOPs, while maintaining PSNR of around 26.90 dB on the DIV2K_LSDIR_valid dataset, and 26.99 dB on the DIV2K_LSDIR_test dataset. The challenge had 95 registered participants, and 15 teams made valid submissions. They gauge the state-of-the-art results for efficient single-image super-resolution.

[151] PR3DICTR: A modular AI framework for medical 3D image-based detection and outcome prediction

Daniel C. MacRae,Luuk van der Hoek,Robert van der Wal,Suzanne P. M. de Vette,Hendrike Neh,Baoqiang Ma,Peter M. A. van Ooijen,Lisanne V. van Dijk

Main category: cs.CV

TL;DR: PR3DICTR是一个基于PyTorch和MONAI的开源平台,专为三维医学图像分类任务设计,支持快速建模与灵活定制。

Details Motivation: 解决三维医学图像分类模型开发中重复造轮子、标准化不足及开发负担重的问题。 Method: 采用模块化设计,集成主流模型架构、超参数策略与训练方法,并支持用户自定义组件插入。 Result: 实现仅需两行代码即可完成三维医学图像二分类或事件型分类任务,提升开发效率与可复现性。 Conclusion: PR3DICTR为医学AI研究者提供了轻量、开放、标准化且高度可扩展的三维图像分类开发框架。 Abstract: Three-dimensional medical image data and computer-aided decision making, particularly using deep learning, are becoming increasingly important in the medical field. To aid in these developments we introduce PR3DICTR: Platform for Research in 3D Image Classification and sTandardised tRaining. Built using community-standard distributions (PyTorch and MONAI), PR3DICTR provides an open-access, flexible and convenient framework for prediction model development, with an explicit focus on classification using three-dimensional medical image data. By combining modular design principles and standardization, it aims to alleviate developmental burden whilst retaining adjustability. It provides users with a wealth of pre-established functionality, for instance in model architecture design options, hyper-parameter solutions and training methodologies, but still gives users the opportunity and freedom to ``plug in'' their own solutions or modules. PR3DICTR can be applied to any binary or event-based three-dimensional classification task and can work with as little as two lines of code.

[152] ProtoFlow: Mitigating Forgetting in Class-Incremental Remote Sensing Segmentation via Low-Curvature Prototype Flow

Jiekai Wu,Rong Fu,Chuangqi Li,Zijian Zhang,Guangxin Wu,Hao Zhang,Shiyin Lin,Jianyuan Ni,Yang Li,Dongxu Zhang,Amir H. Gandomi,Simon Fong,Pengbin Feng

Main category: cs.CV

TL;DR: 本文提出ProtoFlow框架,通过建模类别原型的时序演化轨迹并学习显式的时空向量场,以缓解遥感图像分割中持续学习的表征漂移与灾难性遗忘问题。

Details Motivation: 遥感分割在实际部署中具有天然的持续学习特性(新类别不断出现、采集条件随季节/城市/传感器变化),但现有增量方法常将训练步骤视为孤立更新,导致表征漂移和遗忘控制不足。 Method: 提出ProtoFlow:一种时间感知的原型动力学框架,将类别原型建模为轨迹,并通过显式学习时空向量场来刻画其演化;联合约束低曲率运动与类间分离,以稳定增量学习过程中的原型几何结构。 Result: 在标准类别增量与领域增量遥感基准上,mIoUall提升1.5–2.0点,遗忘显著降低,性能持续优于强基线。 Conclusion: 显式建模原型的时间演化是一种实用且可解释的策略,能有效提升遥感持续分割的鲁棒性。 Abstract: Remote sensing segmentation in real deployment is inherently continual: new semantic categories emerge, and acquisition conditions shift across seasons, cities, and sensors. Despite recent progress, many incremental approaches still treat training steps as isolated updates, which leaves representation drift and forgetting insufficiently controlled. We present ProtoFlow, a time-aware prototype dynamics framework that models class prototypes as trajectories and learns their evolution with an explicit temporal vector field. By jointly enforcing low-curvature motion and inter-class separation, ProtoFlow stabilizes prototype geometry throughout incremental learning. Experiments on standard class- and domain-incremental remote sensing benchmarks show consistent gains over strong baselines, including up to 1.5-2.0 points improvement in mIoUall, together with reduced forgetting. These results suggest that explicitly modeling temporal prototype evolution is a practical and interpretable strategy for robust continual remote sensing segmentation.

[153] VOSR: A Vision-Only Generative Model for Image Super-Resolution

Rongyuan Wu,Lingchen Sun,Zhengqiang Zhang,Xiangtao Kong,Jixin Zhao,Shihao Wang,Lei Zhang

Main category: cs.CV

TL;DR: 本文提出了一种纯视觉的生成式超分辨率框架VOSR,不依赖文本-图像预训练模型,仅用视觉数据训练,实现了与基于T2I扩散模型方法相当甚至更优的感知质量与结构保真度,且训练成本大幅降低。

Details Motivation: 现有生成式图像超分辨率方法多依赖大规模文本-图像扩散模型,但SR本质是低分辨率输入驱动的图像恢复任务,使用通用T2I模型存在范式错配;本文旨在探索仅用视觉数据能否实现高质量生成式SR。 Method: 提出VOSR:1)利用预训练视觉编码器提取LR输入的语义丰富、空间对齐特征作为视觉引导;2)改进无分类器引导,用面向恢复任务的引导策略替代标准无条件分支,保留弱LR锚点;3)先训练多步VOSR模型,再蒸馏为单步高效模型。 Result: VOSR训练成本不足代表性T2I-SR方法的十分之一,在多步和单步设置下均达到竞争性甚至更优的感知质量与推理效率,结构更忠实、幻觉更少,在合成与真实世界基准上表现优异。 Conclusion: 首次证明高质量生成式超分辨率可完全脱离多模态预训练,仅靠视觉数据即可实现,为轻量、专用、高效SR模型设计提供了新范式。 Abstract: Most of the recent generative image super-resolution (SR) methods rely on adapting large text-to-image (T2I) diffusion models pretrained on web-scale text-image data. While effective, this paradigm starts from a generic T2I generator, despite that SR is fundamentally a low-resolution (LR) input-conditioned image restoration task. In this work, we investigate whether an SR model trained purely on visual data can rival T2I-based ones. To this end, we propose VOSR, a Vision-Only generative framework for SR. We first extract semantically rich and spatially grounded features from the LR input using a pretrained vision encoder as visual semantic guidance. We then revisit classifier-free guidance for training generative models and show that the standard unconditional branch is ill-suited to restoration models trained from scratch. We therefore replace it with a restoration-oriented guidance strategy that preserves weak LR anchors. Built upon these designs, we first train a multi-step VOSR model from scratch and then distill it into a one-step model for efficient inference. VOSR requires less than one-tenth of the training cost of representative T2I-based SR methods, yet in both multi-step and one-step settings, it achieves competitive or even better perceptual quality and efficiency, while producing more faithful structures with fewer hallucinations on both synthetic and real-world benchmarks. Our results, for the first time, show that high-quality generative SR can be achieved without multimodal pretraining. The code and models can be found at https://github.com/cswry/VOSR.

[154] CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning

Ankan Deria,Komal Kumar,Xilin He,Imran Razzak,Hisham Cholakkal,Fahad Shahbaz Khan,Salman Khan

Main category: cs.CV

TL;DR: 本文提出CoME-VL框架,融合对比学习(如CLIP)与自监督(如DINO)视觉编码器,通过熵引导多层聚合与RoPE增强的跨注意力实现互补表征融合,在多项视觉语言任务上显著提升性能。

Details Motivation: 现有VLM多依赖单一对比学习视觉编码器,虽利于跨模态对齐,但自监督编码器在细粒度语义和鲁棒性上更具优势;本文旨在有效融合两类互补视觉表征以提升VLM能力。 Method: 提出CoME-VL:1)熵引导的多层特征聚合 + 正交约束投影以降低冗余;2)RoPE增强的跨注意力对齐异构token网格,生成紧凑融合视觉token;3)将融合token注入decoder-only LLM,兼容标准VLM流程。 Result: 在多类视觉语言基准上一致优于单编码器基线:视觉理解任务平均提升4.9%,定位任务提升5.4%;RefCOCO检测达到SOTA,并大幅超越基线。 Conclusion: 对比学习与自监督视觉编码器具有强互补性,通过结构化、低冗余的表征级融合可显著增强VLM性能,验证了多编码器协同的有效性与可扩展性。 Abstract: Recent vision-language models (VLMs) typically rely on a single vision encoder trained with contrastive image-text objectives, such as CLIP-style pretraining. While contrastive encoders are effective for cross-modal alignment and retrieval, self-supervised visual encoders often capture richer dense semantics and exhibit stronger robustness on recognition and understanding tasks. In this work, we investigate how to scale the fusion of these complementary visual representations for vision-language modeling. We propose CoME-VL: Complementary Multi-Encoder Vision-Language, a modular fusion framework that integrates a contrastively trained vision encoder with a self-supervised DINO encoder. Our approach performs representation-level fusion by (i) entropy-guided multi-layer aggregation with orthogonality-constrained projections to reduce redundancy, and (ii) RoPE-enhanced cross-attention to align heterogeneous token grids and produce compact fused visual tokens. The fused tokens can be injected into a decoder-only LLM with minimal changes to standard VLM pipelines. Extensive experiments across diverse vision-language benchmarks demonstrate that CoME-VL consistently outperforms single-encoder baselines. In particular, we observe an average improvement of 4.9% on visual understanding tasks and 5.4% on grounding tasks. Our method achieves state-of-the-art performance on RefCOCO for detection while improving over the baseline by a large margin. Finally, we conduct ablation studies on layer merging, non-redundant feature mixing, and fusion capacity to evaluate how complementary contrastive and self-supervised signals affect VLM performance.