cs.CL [Back]

[1] Geometry-Aware Decoding with Wasserstein-Regularized Truncation and Mass Penalties for Large Language Models

Arash Gholami Davoodi,Navid Rezazadeh,Seyed Pouyan Mousavi Davoudi,Pouya Pezeshkpour

Main category: cs.CL

TL;DR: 本文提出Top-W，一种基于Wasserstein距离的几何感知截断采样方法，通过在词嵌入空间中平衡概率质量和熵，在保持逻辑连贯性的同时提升生成多样性与创造性；理论证明其最优截断结构为单点或一维前缀，实验表明其在多个基准上显著优于现有解码方法。

Details

Motivation: 现有基于截断的采样方法主要依赖概率质量和熵等统计量，忽视了词元嵌入空间的语义几何结构，难以兼顾生成的多样性、创造性和逻辑一致性。 Method: 提出Top-W几何感知截断规则，利用Wasserstein距离度量截断后分布与原始分布的差异，并在概率质量与保留集合熵之间显式权衡；理论推导出最优截断集具有闭式结构（单token或一维prefix），结合高效几何势函数（如最近邻集或k-NN）与交替解码流程实现。 Result: 在GSM8K、GPQA、AlpacaEval和MT-Bench四个基准上，对三个指令微调模型的实验显示Top-W持续超越现有最优解码方法，最高提升达33.7%；同时在基于裁判的开放生成评估中也提升了创造力。 Conclusion: Top-W通过引入词嵌入空间的几何信息，实现了更优的概率截断策略，在不改变标准解码接口的前提下，统一提升了准确性与创造性，验证了几何感知设计在LLM解码中的有效性。 Abstract: Large language models (LLMs) must balance diversity and creativity against logical coherence in open-ended generation. Existing truncation-based samplers are effective but largely heuristic, relying mainly on probability mass and entropy while ignoring semantic geometry of the token space. We present Top-W, a geometry-aware truncation rule that uses Wasserstein distance-defined over token-embedding geometry-to keep the cropped distribution close to the original, while explicitly balancing retained probability mass against the entropy of the kept set. Our theory yields a simple closed-form structure for the fixed-potential subset update: depending on the mass-entropy trade-off, the optimal crop either collapses to a single token or takes the form of a one-dimensional prefix that can be found efficiently with a linear scan. We implement Top-W using efficient geometry-based potentials (nearest-set or k-NN) and pair it with an alternating decoding routine that keeps the standard truncation-and-sampling interface unchanged. Extensive experiments on four benchmarks (GSM8K, GPQA, AlpacaEval, and MT-Bench) across three instruction-tuned models show that Top-W consistently outperforms prior state-of-the-art decoding approaches achieving up to 33.7% improvement. Moreover, we find that Top-W not only improves accuracy-focused performance, but also boosts creativity under judge-based open-ended evaluation.

[2] When Less Is More? Diagnosing ASR Predictions in Sardinian via Layer-Wise Decoding

Domenico De Cristofaro,Alessandro Vietti,Marianne Pouplier,Aleese Block

Main category: cs.CL

TL;DR: 本文研究了多语言语音模型中间层比最终输出层编码更准确的音素表示，并通过逐层解码预训练的Wav2Vec2模型，发现在低资源语言Campidanese Sardinian上，截断顶层Transformer层反而能降低音素错误率（PER），最佳性能出现在倒数第二层；进一步分析揭示了中间层预测更保真于音段身份、避免过生成，并引入‘退化错误’概念——即中间层正确预测被最终层错误覆盖的现象，表明深层可能过度抽象而丢失声学细节。

Details

Motivation: 探索多语言语音模型中间层是否比最终层提供更准确的音素表示，尤其在低资源语言中评估标准指标是否掩盖了语言学上有意义的行为。 Method: 对预训练Wav2Vec2模型采用逐层解码策略，在Campidanese Sardinian语料上进行音素识别实验，并结合细粒度对齐分析和退化错误检测。 Result: 截断高层后PER下降，最优性能出现在倒数第二层；中间层预测更保真于音段身份、减少过生成与特定音系错误；发现‘退化错误’现象，即中间层正确预测被最终层错误覆盖。 Conclusion: 中间层表征更具音素判别力，早期层探针可作为低资源ASR模型的有效诊断工具，表面错误指标（如PER）不足以反映模型对声学细节的建模能力。 Abstract: Recent studies have shown that intermediate layers in multilingual speech models often encode more phonetically accurate representations than the final output layer. In this work, we apply a layer-wise decoding strategy to a pretrained Wav2Vec2 model to investigate how phoneme-level predictions evolve across encoder layers, focusing on Campidanese Sardinian, a low-resource language. We show that truncating upper transformer layers leads to improved Phoneme Error Rates (PER), with the best performance achieved not at the final layer, but two layers earlier. Through fine-grained alignment analysis, we find that intermediate predictions better preserve segmental identity, avoid overgeneration, and reduce certain classes of phonological errors. We also introduce the notion of regressive errors, cases where correct predictions at intermediate layers are overwritten by errors at the final layer. These regressions highlight the limitations of surface-level error metrics and reveal how deeper layers may generalize or abstract away from acoustic detail. Our findings support the use of early-layer probing as a diagnostic tool for ASR models, particularly in low-resource settings where standard evaluation metrics may fail to capture linguistically meaningful behavior.

[3] Learning Self-Interpretation from Interpretability Artifacts: Training Lightweight Adapters on Vector-Label Pairs

Keenan Pepper,Alex McKenzie,Florin Pop,Stijn Servaes,Martin Leitgab,Mike Vaiana,Judd Rosenblatt,Michael S. A. Graziano,Diogo de Lucena

Main category: cs.CL

TL;DR: 本文提出了一种无需修改大语言模型（LM）参数的轻量级适配器训练方法，用于提升自解释（self-interpretation）的可靠性与泛化性；该方法仅用极少量参数（d_model+1）即可显著超越现有基于提示的自解释方法，在特征标注、主题识别和隐式推理发现等任务上表现优异。

Details

Motivation: 现有自解释方法依赖提示工程，对超参敏感、不可靠；亟需一种鲁棒、可扩展、不改动原模型的解释机制。 Method: 在冻结语言模型的前提下，训练轻量级标量仿射适配器（含一个偏置向量），将其应用于解释性中间产物（如稀疏自编码器特征），以生成高质量解释。 Result: 适配器在70B模型上将特征标签生成准确率从63%提升至71%；主题识别召回率达94%（基线仅1%）；能解码多跳推理中的桥接实体（未出现在输入或输出中）；偏置向量贡献85%性能提升；性能增益随模型规模增长而超过能力本身的增长。 Conclusion: 自解释能力可通过冻结大模型、仅训练极轻量适配器来可靠提升，且该能力随模型规模扩大而增强，为可扩展、可部署的模型可解释性提供了新范式。 Abstract: Self-interpretation methods prompt language models to describe their own internal states, but remain unreliable due to hyperparameter sensitivity. We show that training lightweight adapters on interpretability artifacts, while keeping the LM entirely frozen, yields reliable self-interpretation across tasks and model families. A scalar affine adapter with just $d_\text{model}+1$ parameters suffices: trained adapters generate sparse autoencoder feature labels that outperform the training labels themselves (71% vs 63% generation scoring at 70B scale), identify topics with 94% recall@1 versus 1% for untrained baselines, and decode bridge entities in multi-hop reasoning that appear in neither prompt nor response, surfacing implicit reasoning without chain-of-thought. The learned bias vector alone accounts for 85% of improvement, and simpler adapters generalize better than more expressive alternatives. Controlling for model knowledge via prompted descriptions, we find self-interpretation gains outpace capability gains from 7B to 72B parameters. Our results demonstrate that self-interpretation improves with scale, without modifying the model being interpreted.

[4] Physically Interpretable AlphaEarth Foundation Model Embeddings Enable LLM-Based Land Surface Intelligence

Mashrekur Rahman

Main category: cs.CL

TL;DR: 本文通过分析Google AlphaEarth卫星基础模型的64维嵌入，揭示其与26种环境变量的强物理可解释性，并构建了基于检索增强生成的Land Surface Intelligence系统，实现自然语言查询到卫星数据支持评估的转化。

Details

Motivation: 卫星基础模型产生的密集嵌入缺乏物理可解释性，限制了其在环境决策系统中的应用。 Method: 使用1210万美国本土样本，结合线性、非线性及注意力方法分析AlphaEarth嵌入与26个环境变量的关系；构建基于FAISS索引和RAG的Land Surface Intelligence系统；采用多LLM轮换角色的LLM-as-Judge评估框架。 Result: 12/26环境变量R²>0.90（温度与海拔达0.97）；维度-变量关系跨方法一致且时空稳健；LSI系统在360次查询中获加权分3.74±0.77，接地性与连贯性表现最优。 Conclusion: 卫星基础模型嵌入具有明确物理结构，可被可靠地用于环境与地理空间智能任务。 Abstract: Satellite foundation models produce dense embeddings whose physical interpretability remains poorly understood, limiting their integration into environmental decision systems. Using 12.1 million samples across the Continental United States (2017--2023), we first present a comprehensive interpretability analysis of Google AlphaEarth's 64-dimensional embeddings against 26 environmental variables spanning climate, vegetation, hydrology, temperature, and terrain. Combining linear, nonlinear, and attention-based methods, we show that individual embedding dimensions map onto specific land surface properties, while the full embedding space reconstructs most environmental variables with high fidelity (12 of 26 variables exceed $R^2 > 0.90$; temperature and elevation approach $R^2 = 0.97$). The strongest dimension-variable relationships converge across all three analytical methods and remain robust under spatial block cross-validation (mean $ΔR^2 = 0.017$) and temporally stable across all seven study years (mean inter-year correlation $r = 0.963$). Building on these validated interpretations, we then developed a Land Surface Intelligence system that implements retrieval-augmented generation over a FAISS-indexed embedding database of 12.1 million vectors, translating natural language environmental queries into satellite-grounded assessments. An LLM-as-Judge evaluation across 360 query--response cycles, using four LLMs in rotating generator, system, and judge roles, achieved weighted scores of $μ= 3.74 \pm 0.77$ (scale 1--5), with grounding ($μ= 3.93$) and coherence ($μ= 4.25$) as the strongest criteria. Our results demonstrate that satellite foundation model embeddings are physically structured representations that can be operationalized for environmental and geospatial intelligence.

[5] Autonomous Continual Learning of Computer-Use Agents for Environment Adaptation

Tianci Xue,Zeyi Liao,Tianneng Shi,Zilu Wang,Kai Zhang,Dawn Song,Yu Su,Huan Sun

Main category: cs.CL

TL;DR: 本文提出ACuRL框架，通过自主课程强化学习实现计算机使用代理（CUA）在零人工标注数据下的持续环境适应，并引入CUAJudge自动评估器提供可靠奖励信号，显著提升性能且避免灾难性遗忘。

Details

Motivation: 现实数字环境高度多样动态，导致代理常面临未见场景和分布偏移，需在特定环境中持续学习；但获取高质量、环境贴合的代理数据依赖昂贵人工标注，亟需无需人工数据的解决方案。 Method: 提出ACuRL框架：代理先探索目标环境获取初始经验；迭代训练中，课程任务生成器结合历史经验与上轮反馈生成适配当前能力的新任务；引入CUAJudge自动评估器（93%与人类判断一致）提供可靠奖励信号。 Result: 实证表明方法有效支持环境内与跨环境持续学习，在不发生灾难性遗忘前提下带来4–22%性能提升；分析显示仅需稀疏参数更新（如20%参数），解释了其高效稳健适应性。 Conclusion: ACuRL实现了零人工标注下的CUA持续环境适应，兼顾性能提升与稳定性，为真实世界动态环境中的智能代理部署提供了可行路径。 Abstract: Real-world digital environments are highly diverse and dynamic. These characteristics cause agents to frequently encounter unseen scenarios and distribution shifts, making continual learning in specific environments essential for computer-use agents (CUAs). However, a key challenge lies in obtaining high-quality and environment-grounded agent data without relying on costly human annotation. In this work, we introduce ACuRL, an Autonomous Curriculum Reinforcement Learning framework that continually adapts agents to specific environments with zero human data. The agent first explores target environments to acquire initial experiences. During subsequent iterative training, a curriculum task generator leverages these experiences together with feedback from the previous iteration to synthesize new tasks tailored for the agent's current capabilities. To provide reliable reward signals, we introduce CUAJudge, a robust automatic evaluator for CUAs that achieves 93% agreement with human judgments. Empirically, our method effectively enables both intra-environment and cross-environment continual learning, yielding 4-22% performance gains without catastrophic forgetting on existing environments. Further analyses show highly sparse updates (e.g., 20% parameters), which helps explain the effective and robust adaptation. Our data and code are available at https://github.com/OSU-NLP-Group/ACuRL.

[6] The Alignment Bottleneck in Decomposition-Based Claim Verification

Mahmud Elahi Akhter,Federico Ruggeri,Iman Munire Bilal,Rob Procter,Maria Liakata

Main category: cs.CL

TL;DR: 本文指出结构化主张分解在验证复杂主张时效果不一，原因在于证据对齐和子主张错误特征两个被忽视的瓶颈；作者构建了新数据集并设计两种证据对齐设置（SAE与SRE），发现仅当证据细粒度且严格对齐时分解才有效；此外，子主张标签噪声下，“ abstention”策略比错误预测更能抑制误差传播；研究呼吁未来框架应注重精准证据合成与子主张模型标签偏差校准。

Details

Motivation: 现有结构化主张分解方法在实证中效果不一致，作者认为其根源在于被忽视的证据对齐问题和子主张错误分布特征。 Method: 构建含时间限定证据与人工标注子主张证据跨度的真实世界复杂主张新数据集；设计两种证据对齐设置——子主张对齐证据（SAE）与重复主张级证据（SRE）；在多个数据集（PHEMEPlus、MMM-Fact、COVID-Fact）上系统评估分解性能，并分析噪声标签下不同错误类型对鲁棒性的影响。 Result: 仅在SAE设置下分解显著提升性能；SRE设置下不仅无效反而常导致性能下降；在子主张标签噪声存在时，“abstention”策略比错误预测更有效抑制误差传播。 Conclusion: 未来主张分解框架必须优先考虑精准的证据合成能力，并校准子主张验证模型的标签偏差，而非单纯依赖分解结构本身。 Abstract: Structured claim decomposition is often proposed as a solution for verifying complex, multi-faceted claims, yet empirical results have been inconsistent. We argue that these inconsistencies stem from two overlooked bottlenecks: evidence alignment and sub-claim error profiles. To better understand these factors, we introduce a new dataset of real-world complex claims, featuring temporally bounded evidence and human-annotated sub-claim evidence spans. We evaluate decomposition under two evidence alignment setups: Sub-claim Aligned Evidence (SAE) and Repeated Claim-level Evidence (SRE). Our results reveal that decomposition brings significant performance improvement only when evidence is granular and strictly aligned. By contrast, standard setups that rely on repeated claim-level evidence (SRE) fail to improve and often degrade performance as shown across different datasets and domains (PHEMEPlus, MMM-Fact, COVID-Fact). Furthermore, we demonstrate that in the presence of noisy sub-claim labels, the nature of the error ends up determining downstream robustness. We find that conservative "abstention" significantly reduces error propagation compared to aggressive but incorrect predictions. These findings suggest that future claim decomposition frameworks must prioritize precise evidence synthesis and calibrate the label bias of sub-claim verification models.

[7] Triggers Hijack Language Circuits: A Mechanistic Analysis of Backdoor Behaviors in Large Language Models

Théo Lasnier,Wissam Antoun,Francis Kulumba,Djamé Seddah

Main category: cs.CL

TL;DR: 本文首次对语言切换后门进行了机制分析，发现后门触发器并非独立电路，而是劫持模型已有的语言编码组件。

Details

Motivation: 后门攻击对大语言模型构成严重安全威胁，但其内部触发机制尚不清楚。 Method: 采用激活修补（activation patching）技术，在GAPperon模型族（1B、8B、24B参数）中定位触发器形成位置，并识别处理触发信息的注意力头。 Result: 触发器激活的注意力头与自然编码输出语言的头高度重叠（Jaccard指数0.18–0.66），表明触发器劫持而非新建语言相关功能。 Conclusion: 后门触发器依赖并复用模型固有功能结构，因此防御策略应聚焦于监控已知功能模块，而非搜寻隐藏电路。 Abstract: Backdoor attacks pose significant security risks for Large Language Models (LLMs), yet the internal mechanisms by which triggers operate remain poorly understood. We present the first mechanistic analysis of language-switching backdoors, studying the GAPperon model family (1B, 8B, 24B parameters) which contains triggers injected during pretraining that cause output language switching. Using activation patching, we localize trigger formation to early layers (7.5-25% of model depth) and identify which attention heads process trigger information. Our central finding is that trigger-activated heads substantially overlap with heads naturally encoding output language across model scales, with Jaccard indices between 0.18 and 0.66 over the top heads identified. This suggests that backdoor triggers do not form isolated circuits but instead co-opt the model's existing language components. These findings have implications for backdoor defense: detection methods may benefit from monitoring known functional components rather than searching for hidden circuits, and mitigation strategies could potentially leverage this entanglement between injected and natural behaviors.

[8] When Tables Go Crazy: Evaluating Multimodal Models on French Financial Documents

Virginie Mouilleron,Théo Lasnier,Djamé Seddah

Main category: cs.CL

TL;DR: 本文提出了首个面向法语金融文档理解的多模态评测基准Multimodal Finance Eval，评估了6个开源视觉语言模型在文本提取、表格理解、图表解读和多轮对话推理任务上的表现，发现模型在图表理解和多轮交互中存在显著短板。

Details

Motivation: 现有视觉语言模型在专业非英语领域（尤其是金融）的可靠性尚未充分探索，而金融文档包含密集法规文本、数值表格和可视化图表，错误提取可能带来现实后果。 Method: 构建了包含1204个专家验证问题的Multimodal Finance Eval基准，覆盖文本提取、表格理解、图表解读和多轮对话推理；采用LLM-as-judge协议评估6个开源VLM（8B–124B参数）。 Result: 模型在文本和表格任务上表现良好（85–90%准确率），但在图表解读上大幅下降（34–62%）；多轮对话中早期错误会传播，导致整体准确率降至约50%，且与模型规模无关。 Conclusion: 当前VLM在明确定义的抽取任务上有效，但在交互式、多步金融分析中仍很脆弱；Multimodal Finance Eval为这一高风险领域提供了具有挑战性的评测基准。 Abstract: Vision-language models (VLMs) perform well on many document understanding tasks, yet their reliability in specialized, non-English domains remains underexplored. This gap is especially critical in finance, where documents mix dense regulatory text, numerical tables, and visual charts, and where extraction errors can have real-world consequences. We introduce Multimodal Finance Eval, the first multimodal benchmark for evaluating French financial document understanding. The dataset contains 1,204 expert-validated questions spanning text extraction, table comprehension, chart interpretation, and multi-turn conversational reasoning, drawn from real investment prospectuses, KIDs, and PRIIPs. We evaluate six open-weight VLMs (8B-124B parameters) using an LLM-as-judge protocol. While models achieve strong performance on text and table tasks (85-90% accuracy), they struggle with chart interpretation (34-62%). Most notably, multi-turn dialogue reveals a sharp failure mode: early mistakes propagate across turns, driving accuracy down to roughly 50% regardless of model size. These results show that current VLMs are effective for well-defined extraction tasks but remain brittle in interactive, multi-step financial analysis. Multimodal Finance Eval offers a challenging benchmark to measure and drive progress in this high-stakes setting.

[9] Less is Enough: Synthesizing Diverse Data in Feature Space of LLMs

Zhongzhi Li,Xuansheng Wu,Yijiang Li,Lijie Hu,Ninghao Liu

Main category: cs.CL

TL;DR: 本文提出了一种基于特征激活覆盖（FAC）的多样性驱动数据合成框架（FAC Synthesis），通过稀疏自编码器识别缺失特征并生成对应样本，显著提升大语言模型下游任务性能，并发现跨模型家族的可解释共享特征空间。

Details

Motivation: 现有后训练数据构建方法依赖文本层面的多样性度量，对任务相关特征覆盖缺乏有效刻画，难以支撑下游性能提升。 Method: 提出Feature Activation Coverage（FAC）指标，在可解释的神经元特征空间中量化数据多样性；基于此构建FAC Synthesis框架：先用稀疏自编码器从种子数据中识别未被充分激活的特征，再针对性生成能激活这些特征的合成样本。 Result: 在指令遵循、毒性检测、奖励建模和行为引导等多个下游任务上，FAC Synthesis显著提升数据多样性和模型性能；验证了LLaMA、Mistral、Qwen等不同模型家族存在共享可解释特征空间，支持跨模型知识迁移。 Conclusion: FAC为数据多样性提供了更本质、可解释的度量方式，FAC Synthesis提供了一种实用、通用的数据中心优化范式，推动大语言模型训练数据的精细化设计与优化。 Abstract: The diversity of post-training data is critical for effective downstream performance in large language models (LLMs). Many existing approaches to constructing post-training data quantify diversity using text-based metrics that capture linguistic variation, but such metrics provide only weak signals for the task-relevant features that determine downstream performance. In this work, we introduce Feature Activation Coverage (FAC) which measures data diversity in an interpretable feature space. Building upon this metric, we further propose a diversity-driven data synthesis framework, named FAC Synthesis, that first uses a sparse autoencoder to identify missing features from a seed dataset, and then generates synthetic samples that explicitly reflect these features. Experiments show that our approach consistently improves both data diversity and downstream performance on various tasks, including instruction following, toxicity detection, reward modeling, and behavior steering. Interestingly, we identify a shared, interpretable feature space across model families (i.e., LLaMA, Mistral, and Qwen), enabling cross-model knowledge transfer. Our work provides a solid and practical methodology for exploring data-centric optimization of LLMs.

[10] When are We Worried? Temporal Trends of Anxiety and What They Reveal about Us

Saif M. Mohammad

Main category: cs.CL

TL;DR: 本文利用新构建的词汇-焦虑关联词典，分析了大量美国和加拿大社交媒体推文，揭示了人们在一天中、一周内及不同语法时态和人称代词使用情境下的焦虑模式。

Details

Motivation: 探究社交媒体中人类焦虑表达的时间规律及其与语言特征（如时态、人称代词）的关系，以深入理解焦虑的心理与行为表现。 Method: 基于词典驱动的情感分析方法，对大规模Twitter数据进行时间序列分析（按小时、星期）、语法时态分类（过去/现在/未来）及人称代词类别（第一/二/三人称、主格/宾格）统计。 Result: 发现焦虑水平在每日8点最高、正午最低；周中最高、周末最低；过去时句子焦虑最高、未来时最低；第三人称及主语代词相关帖子焦虑更高。 Conclusion: 焦虑表达具有显著的时间与语言结构规律，反映出内在生理节律与认知焦点（如自我指向、时间取向、他人关注）对情绪状态的系统性影响。 Abstract: In this short paper, we make use of a recently created lexicon of word-anxiety associations to analyze large amounts of US and Canadian social media data (tweets) to explore *when* we are anxious and what insights that reveals about us. We show that our levels of anxiety on social media exhibit systematic patterns of rise and fall during the day -- highest at 8am (in-line with when we have high cortisol levels in the body) and lowest around noon. Anxiety is lowest on weekends and highest mid-week. We also examine anxiety in past, present, and future tense sentences to show that anxiety is highest in past tense and lowest in future tense. Finally, we examine the use of anxiety and calmness words in posts that contain pronouns to show: more anxiety in 3rd person pronouns (he, they) posts than 1st and 2nd person pronouns and higher anxiety in posts with subject pronouns (I, he, she, they) than object pronouns (me, him, her, them). Overall, these trends provide valuable insights on not just when we are anxious, but also how different types of focus (future, past, self, outward, etc.) are related to anxiety.

[11] EVOKE: Emotion Vocabulary Of Korean and English

Yoonwon Jung,Hagyeong Shin,Benjamin K. Bergen

Main category: cs.CL

TL;DR: 本文介绍了EVOKE，一个英语和韩语情感词汇的平行数据集，涵盖全面的情感词、多对多翻译及语言特有情感词识别，并系统标注了形容词、动词、多义词及隐喻，是目前最全面、系统且理论中立的英韩情感词资源。

Details

Motivation: 构建一个全面、系统、理论中立的英韩双语情感词汇平行数据集，以支持情感科学、心理语言学、计算语言学和自然语言处理等领域的研究需求。 Method: 构建包含1427个韩语词和1399个英语词的平行情感词汇数据集，系统标注819个韩语和924个英语形容词与动词，同时标注多义性、语义关系、情感隐喻等语言特征。 Result: 发布了迄今最全面、系统、理论中立的英韩双语情感词汇数据集EVOKE，涵盖多对多翻译、语言特有词识别、多义词与隐喻标注，并已开源。 Conclusion: EVOKE为跨语言情感研究提供了灵活、实用且理论兼容的资源，可适配不同学科与理论视角的研究需求。 Abstract: This paper introduces EVOKE, a parallel dataset of emotion vocabulary in English and Korean. The dataset offers comprehensive coverage of emotion words in each language, in addition to many-to-many translations between words in the two languages and identification of language-specific emotion words. The dataset contains 1,427 Korean words and 1,399 English words, and we systematically annotate 819 Korean and 924 English adjectives and verbs. We also annotate multiple meanings of each word and their relationships, identifying polysemous emotion words and emotion-related metaphors. The dataset is, to our knowledge, the most comprehensive, systematic, and theory-agnostic dataset of emotion words in both Korean and English to date. It can serve as a practical tool for emotion science, psycholinguistics, computational linguistics, and natural language processing, allowing researchers to adopt different views on the resource reflecting their needs and theoretical perspectives. The dataset is publicly available at https://github.com/yoonwonj/EVOKE.

[12] LATA: A Tool for LLM-Assisted Translation Annotation

Baorong Huang,Ali Asiri

Main category: cs.CL

TL;DR: 本文提出了一种基于大语言模型（LLM）的交互式工具，用于阿拉伯语-英语等结构差异大的语言对的高质量平行语料库构建，结合模板化提示管理与人工校验流程，在保证效率的同时提升句级对齐与翻译现象标注的准确性。

Details

Motivation: 传统自动对齐工具难以处理阿拉伯语-英语等结构差异大的语言对中的深层语言转换和语义细微差别，亟需兼顾自动化可扩展性与专家人工判断精度的新方法。 Method: 设计了一个LLM辅助的交互式工具，采用模板化的Prompt Manager驱动句子切分与对齐，并强制JSON格式输出；结合自动化预处理与人工在环（human-in-the-loop）工作流，支持研究者通过独立标注架构（stand-off architecture）修正对齐结果并添加翻译技巧标注。 Result: 该工具在保持高效率的同时，显著提升了对复杂翻译现象（如专业领域内语义迁移）的标注精度，验证了LLM在结构性强、语言差异大的平行语料构建任务中的可行性与优势。 Conclusion: LLM可在严格约束下有效支撑高质量平行语料构建，尤其适用于传统工具失效的语言对；人机协同的模板化提示策略是平衡自动化与语言学严谨性的可行路径。 Abstract: The construction of high-quality parallel corpora for translation research has increasingly evolved from simple sentence alignment to complex, multi-layered annotation tasks. This methodological shift presents significant challenges for structurally divergent language pairs, such as Arabic--English, where standard automated tools frequently fail to capture deep linguistic shifts or semantic nuances. This paper introduces a novel, LLM-assisted interactive tool designed to reduce the gap between scalable automation and the rigorous precision required for expert human judgment. Unlike traditional statistical aligners, our system employs a template-based Prompt Manager that leverages large language models (LLMs) for sentence segmentation and alignment under strict JSON output constraints. In this tool, automated preprocessing integrates into a human-in-the-loop workflow, allowing researchers to refine alignments and apply custom translation technique annotations through a stand-off architecture. By leveraging LLM-assisted processing, the tool balances annotation efficiency with the linguistic precision required to analyze complex translation phenomena in specialized domains.

[13] Neuro-Symbolic Synergy for Interactive World Modeling

Hongyu Zhao,Siyu Zhou,Haolin Yang,Zengyi Qin,Tianyi Zhou

Main category: cs.CL

TL;DR: 本文提出Neuro-Symbolic Synergy (NeSyS)框架，融合大语言模型（LLM）的语义表达能力与符号世界模型（WM）的逻辑一致性，通过交替训练和概率分布约束提升世界建模的准确性与数据效率。

Details

Motivation: 大型语言模型（LLMs）作为世界模型（WMs）时易产生幻觉，尤其在需严格遵循确定性转移规则的边界情况下；而符号WM虽具逻辑一致性但缺乏语义表达力，二者存在互补需求。 Method: 提出Neuro-Symbolic Synergy（NeSyS）框架：1）让LLM与可执行符号WM交替训练，各自补足对方未充分解释的轨迹；2）符号WM不依赖提示工程，而是直接修改LLM输出的概率分布以施加约束；3）仅在符号规则未覆盖的轨迹上微调神经WM，减少50%训练数据。 Result: 在ScienceWorld、Webshop和Plancraft三个交互式环境中，NeSyS在世界模型预测准确率和数据效率两方面均显著优于基线方法。 Conclusion: NeSyS成功弥合了神经与符号方法在世界建模中的关键鸿沟，兼顾语义表达力与逻辑鲁棒性，为构建可靠、高效的世界模型提供了新范式。 Abstract: Large language models (LLMs) exhibit strong general-purpose reasoning capabilities, yet they frequently hallucinate when used as world models (WMs), where strict compliance with deterministic transition rules--particularly in corner cases--is essential. In contrast, Symbolic WMs provide logical consistency but lack semantic expressivity. To bridge this gap, we propose Neuro-Symbolic Synergy (NeSyS), a framework that integrates the probabilistic semantic priors of LLMs with executable symbolic rules to achieve both expressivity and robustness. NeSyS alternates training between the two models using trajectories inadequately explained by the other. Unlike rule-based prompting, the symbolic WM directly constrains the LLM by modifying its output probability distribution. The neural WM is fine-tuned only on trajectories not covered by symbolic rules, reducing training data by 50% without loss of accuracy. Extensive experiments on three distinct interactive environments, i.e., ScienceWorld, Webshop, and Plancraft, demonstrate NeSyS's consistent advantages over baselines in both WM prediction accuracy and data efficiency.

[14] Canvas-of-Thought: Grounding Reasoning via Mutable Structured States

Lingzhuang Sun,Yuxia Zhu,Ruitong Liu,Hao Liang,Zheng Sun,Caijun Jia,Honghao He,Yuchen Wu,Siyuan Li,Jingxuan Wei,Xiangxiang Zhang,Bihui Yu,Wentao Zhang

Main category: cs.CL

TL;DR: 本文提出Canvas-of-Thought（Canvas-CoT），利用HTML Canvas作为外部推理基底，支持原子级DOM操作与渲染反馈的批判循环，提升多模态大模型在几何、SVG等高维任务中的推理精度与上下文效率。

Details

Motivation: 现有Chain-of-Thought（CoT）在多模态大模型中受限于线性文本序列，难以高效修正错误、维护状态，尤其在几何与SVG等需显式视觉引导的高维任务中表现不足。 Method: 引入HTML Canvas作为外部推理画布，支持基于DOM的CRUD操作以实现原位状态更新；结合渲染驱动的批判循环，提供硬性视觉约束反馈。 Result: 在VCode、RBench-V和MathVista上显著超越现有基线，验证了Canvas-CoT在上下文效率与复杂任务求解上的优势。 Conclusion: Canvas-CoT为多模态推理提供了新范式，通过显式状态管理与视觉反馈机制，突破了传统线性CoT的瓶颈。 Abstract: While Chain-of-Thought (CoT) prompting has significantly advanced the reasoning capabilities of Multimodal Large Language Models (MLLMs), relying solely on linear text sequences remains a bottleneck for complex tasks. We observe that even when auxiliary visual elements are interleaved, they are often treated as static snapshots within a one-dimensional, unstructured reasoning chain. We argue that such approaches treat reasoning history as an immutable stream: correcting a local error necessitates either generating verbose downstream corrections or regenerating the entire context. This forces the model to implicitly maintain and track state updates, significantly increasing token consumption and cognitive load. This limitation is particularly acute in high-dimensional domains, such as geometry and SVG design, where the textual expression of CoT lacks explicit visual guidance, further constraining the model's reasoning precision. To bridge this gap, we introduce \textbf{Canvas-of-Thought (Canvas-CoT)}. By leveraging a HTML Canvas as an external reasoning substrate, Canvas-CoT empowers the model to perform atomic, DOM-based CRUD operations. This architecture enables in-place state revisions without disrupting the surrounding context, allowing the model to explicitly maintain the "ground truth". Furthermore, we integrate a rendering-based critique loop that serves as a hard constraint validator, providing explicit visual feedback to resolve complex tasks that are difficult to articulate through text alone. Extensive experiments on VCode, RBench-V, and MathVista demonstrate that Canvas-CoT significantly outperforms existing baselines, establishing a new paradigm for context-efficient multimodal reasoning.

[15] On the Robustness of Knowledge Editing for Detoxification

Ming Dong,Shiyi Tang,Ziyan Peng,Guanyi Chen,Tingting He

Main category: cs.CL

TL;DR: 本文提出了一种面向鲁棒性的知识编辑（KE）解毒评估框架，揭示了现有KE解毒方法存在伪解毒、多目标联合编辑失效及跨语言效果受限等问题，表明其实际鲁棒性有限。

Details

Motivation: 现有基于知识编辑的解毒方法评估过度依赖自动毒性分类器，忽略了行为抑制的真实性与鲁棒性，亟需更全面、可靠的评估体系。 Method: 构建了涵盖优化鲁棒性、组合鲁棒性和跨语言鲁棒性的三维评估框架，并系统分析KE解毒在多种设定下的失效模式。 Result: 发现伪解毒是常见失败模式；多行为联合编辑导致效果下降；解毒效果高度依赖模型与方法的特定组合，且跨语言泛化能力有限。 Conclusion: KE-based detoxification 并非普适可靠，其有效性受限于模型类型、编辑目标数量和语言范围，需谨慎评估与部署。 Abstract: Knowledge-Editing-based (KE-based) detoxification has emerged as a promising approach for mitigating harmful behaviours in Large Language Models. Existing evaluations, however, largely rely on automatic toxicity classifiers, implicitly assuming that reduced toxicity scores reflect genuine behavioural suppression. In this work, we propose a robustness-oriented evaluation framework for KE-based detoxification that examines its reliability beyond standard classifier-based metrics along three dimensions: optimisation robustness, compositional robustness, and cross-lingual robustness. We identify pseudo-detoxification as a common failure mode, where apparent toxicity reductions arise from degenerate generation behaviours rather than meaningful suppression of unsafe content. We further show that detoxification effectiveness degrades when multiple unsafe behaviours are edited jointly, and that both monolingual and cross-lingual detoxification remain effective only under specific model-method combinations. Overall, our results indicate that KE-based detoxification is robust only for certain models, limited numbers of detoxification objectives, and a subset of languages.

[16] LHAW: Controllable Underspecification for Long-Horizon Tasks

George Pu,Michael S. Lee,Udari Madhushani Sehwag,David J. Lee,Bryan Zhu,Yash Maurya,Mohit Raghavendra,Yuan Xue,Samuel Marc Denton

Main category: cs.CL

TL;DR: 本文提出LHAW框架，用于系统生成和评估长周期工作流中的模糊性问题，通过实证代理试验验证模糊变体，并发布285个任务变体数据集。

Details

Motivation: 现有研究缺乏可扩展、任务无关的框架来系统地构建和衡量自定义工作流中模糊性的影响，限制了长周期工作流代理在模糊情境下寻求澄清能力的发展。 Method: 提出LHAW（长周期增强工作流）框架，一种模块化、数据集无关的合成流水线，通过在目标、约束、输入和上下文四个维度上系统移除信息，将明确任务转化为可控的模糊变体；模糊性分类基于代理实证试验结果（终端状态偏差）分为结果关键型、发散型和良性三类。 Result: 发布了来自TheAgentCompany、SWE-Bench Pro和MCP-Atlas的285个任务变体，并提供了对当前代理在模糊场景中检测、推理与解决模糊性的形式化分析。 Conclusion: LHAW是首个支持成本敏感型评估的系统性框架，用于评测长周期设置下代理的澄清行为，推动可靠自主系统的发展。 Abstract: Long-horizon workflow agents that operate effectively over extended periods are essential for truly autonomous systems. Their reliable execution critically depends on the ability to reason through ambiguous situations in which clarification seeking is necessary to ensure correct task execution. However, progress is limited by the lack of scalable, task-agnostic frameworks for systematically curating and measuring the impact of ambiguity across custom workflows. We address this gap by introducing LHAW (Long-Horizon Augmented Workflows), a modular, dataset-agnostic synthetic pipeline that transforms any well-specified task into controllable underspecified variants by systematically removing information across four dimensions - Goals, Constraints, Inputs, and Context - at configurable severity levels. Unlike approaches that rely on LLM predictions of ambiguity, LHAW validates variants through empirical agent trials, classifying them as outcome-critical, divergent, or benign based on observed terminal state divergence. We release 285 task variants from TheAgentCompany, SWE-Bench Pro and MCP-Atlas according to our taxonomy alongside formal analysis measuring how current agents detect, reason about, and resolve underspecification across ambiguous settings. LHAW provides the first systematic framework for cost-sensitive evaluation of agent clarification behavior in long-horizon settings, enabling development of reliable autonomous systems.

[17] When to Memorize and When to Stop: Gated Recurrent Memory for Long-Context Reasoning

Leheng Sheng,Yongtao Zhang,Wenchang Ma,Yaorui Shi,Ting Huang,Xiang Wang,An Zhang,Ke Shen,Tat-Seng Chua

Main category: cs.CL

TL;DR: 本文提出GRU-Mem方法，通过引入两个文本控制门（更新门和退出门）及端到端强化学习奖励机制，提升大语言模型在长上下文推理中的稳定性与效率，显著优于MemAgent。

Details

Motivation: 现有长上下文推理方法（如MemAgent）存在记忆无序膨胀和缺乏循环退出机制两大问题，导致性能下降与计算浪费。 Method: 提出GRU-Mem，借鉴GRU结构设计更新门与退出门，结合端到端强化学习，使用两个奖励信号r^update和r^exit分别优化记忆更新与循环终止行为。 Result: 在多种长上下文推理任务上，GRU-Mem显著优于MemAgent，推理速度最高提升400%。 Conclusion: GRU-Mem通过门控机制与强化学习实现了更稳定、高效、可控的长上下文推理，为解决LLMs长程依赖建模难题提供了新思路。 Abstract: While reasoning over long context is crucial for various real-world applications, it remains challenging for large language models (LLMs) as they suffer from performance degradation as the context length grows. Recent work MemAgent has tried to tackle this by processing context chunk-by-chunk in an RNN-like loop and updating a textual memory for final answering. However, this naive recurrent memory update faces two crucial drawbacks: (i) memory can quickly explode because it can update indiscriminately, even on evidence-free chunks; and (ii) the loop lacks an exit mechanism, leading to unnecessary computation after even sufficient evidence is collected. To address these issues, we propose GRU-Mem, which incorporates two text-controlled gates for more stable and efficient long-context reasoning. Specifically, in GRU-Mem, the memory only updates when the update gate is open and the recurrent loop will exit immediately once the exit gate is open. To endow the model with such capabilities, we introduce two reward signals $r^{\text{update}}$ and $r^{\text{exit}}$ within end-to-end RL, rewarding the correct updating and exiting behaviors respectively. Experiments on various long-context reasoning tasks demonstrate the effectiveness and efficiency of GRU-Mem, which generally outperforms the vanilla MemAgent with up to 400\% times inference speed acceleration.

[18] Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters

Ailin Huang,Ang Li,Aobo Kong,Bin Wang,Binxing Jiao,Bo Dong,Bojun Wang,Boyu Chen,Brian Li,Buyun Ma,Chang Su,Changxin Miao,Changyi Wan,Chao Lou,Chen Hu,Chen Xu,Chenfeng Yu,Chengting Feng,Chengyuan Yao,Chunrui Han,Dan Ma,Dapeng Shi,Daxin Jiang,Dehua Ma,Deshan Sun,Di Qi,Enle Liu,Fajie Zhang,Fanqi Wan,Guanzhe Huang,Gulin Yan,Guoliang Cao,Guopeng Li,Han Cheng,Hangyu Guo,Hanshan Zhang,Hao Nie,Haonan Jia,Haoran Lv,Hebin Zhou,Hekun Lv,Heng Wang,Heung-Yeung Shum,Hongbo Huang,Hongbo Peng,Hongyu Zhou,Hongyuan Wang,Houyong Chen,Huangxi Zhu,Huimin Wu,Huiyong Guo,Jia Wang,Jian Zhou,Jianjian Sun,Jiaoren Wu,Jiaran Zhang,Jiashu Lv,Jiashuo Liu,Jiayi Fu,Jiayu Liu,Jie Cheng,Jie Luo,Jie Yang,Jie Zhou,Jieyi Hou,Jing Bai,Jingcheng Hu,Jingjing Xie,Jingwei Wu,Jingyang Zhang,Jishi Zhou,Junfeng Liu,Junzhe Lin,Ka Man Lo,Kai Liang,Kaibo Liu,Kaijun Tan,Kaiwen Yan,Kaixiang Li,Kang An,Kangheng Lin,Lei Yang,Liang Lv,Liang Zhao,Liangyu Chen,Lieyu Shi,Liguo Tan,Lin Lin,Lina Chen,Luck Ma,Mengqiang Ren,Michael Li,Ming Li,Mingliang Li,Mingming Zhang,Mingrui Chen,Mitt Huang,Na Wang,Peng Liu,Qi Han,Qian Zhao,Qinglin He,Qinxin Du,Qiuping Wu,Quan Sun,Rongqiu Yang,Ruihang Miao,Ruixin Han,Ruosi Wan,Ruyan Guo,Shan Wang,Shaoliang Pang,Shaowen Yang,Shengjie Fan,Shijie Shang,Shiliang Yang,Shiwei Li,Shuangshuang Tian,Siqi Liu,Siye Wu,Siyu Chen,Song Yuan,Tiancheng Cao,Tianchi Yue,Tianhao Cheng,Tianning Li,Tingdan Luo,Wang You,Wei Ji,Wei Yuan,Wei Zhang,Weibo Wu,Weihao Xie,Wen Sun,Wenjin Deng,Wenzhen Zheng,Wuxun Xie,Xiangfeng Wang,Xiangwen Kong,Xiangyu Liu,Xiangyu Zhang,Xiaobo Yang,Xiaojia Liu,Xiaolan Yuan,Xiaoran Jiao,Xiaoxiao Ren,Xiaoyun Zhang,Xin Li,Xin Liu,Xin Wu,Xing Chen,Xingping Yang,Xinran Wang,Xu Zhao,Xuan He,Xuanti Feng,Xuedan Cai,Xuqiang Zhou,Yanbo Yu,Yang Li,Yang Xu,Yanlin Lai,Yanming Xu,Yaoyu Wang,Yeqing Shen,Yibo Zhu,Yichen Lv,Yicheng Cao,Yifeng Gong,Yijing Yang,Yikun Yang,Yin Zhao,Yingxiu Zhao,Yinmin Zhang,Yitong Zhang,Yixuan Zhang,Yiyang Chen,Yongchi Zhao,Yongshen Long,Yongyao Wang,Yousong Guan,Yu Zhou,Yuang Peng,Yuanhao Ding,Yuantao Fan,Yuanzhen Yang,Yuchu Luo,Yudi Zhao,Yue Peng,Yueqiang Lin,Yufan Lu,Yuling Zhao,Yunzhou Ju,Yurong Zhang,Yusheng Li,Yuxiang Yang,Yuyang Chen,Yuzhu Cai,Zejia Weng,Zetao Hong,Zexi Li,Zhe Xie,Zheng Ge,Zheng Gong,Zheng Zeng,Zhenyi Lu,Zhewei Huang,Zhichao Chang,Zhiguo Huang,Zhiheng Hu,Zidong Yang,Zili Wang,Ziqi Ren,Zixin Zhang,Zixuan Wang

Main category: cs.CL

TL;DR: Step 3.5 Flash 是一种稀疏MoE模型，兼顾前沿智能与推理效率，通过196B参数基础模型与11B激活参数、滑动窗口/全注意力混合机制及多令牌预测（MTP-3）优化代理交互延迟；结合可验证信号与偏好反馈的稳定强化学习框架，实现数学、代码与工具使用能力的持续自提升；在多项代理与专业基准上达到接近GPT-5.2 xHigh和Gemini 3.0 Pro的性能。

Details

Motivation: 构建高效、可靠、具备尖锐推理能力的智能体，需在保持前沿级智能的同时显著降低多轮交互的延迟与计算成本。 Method: 提出稀疏MoE架构Step 3.5 Flash（196B总参、11B激活参），采用3:1交替滑动窗口/全注意力机制与Multi-Token Prediction (MTP-3)；设计融合可验证信号与偏好反馈的大规模离线策略强化学习框架，保障训练稳定性与跨领域（数学、代码、工具）自改进能力。 Result: 在IMO-AnswerBench达85.4%，LiveCodeBench-v6达86.4%，tau2-Bench达88.2%，BrowseComp（含上下文管理）达69.0%，Terminal-Bench 2.0达51.0%，性能媲美GPT-5.2 xHigh和Gemini 3.0 Pro。 Conclusion: Step 3.5 Flash 重新定义了智能体部署的效率边界，为工业级复杂智能体提供高密度、可扩展的基础模型支撑。 Abstract: We introduce Step 3.5 Flash, a sparse Mixture-of-Experts (MoE) model that bridges frontier-level agentic intelligence and computational efficiency. We focus on what matters most when building agents: sharp reasoning and fast, reliable execution. Step 3.5 Flash pairs a 196B-parameter foundation with 11B active parameters for efficient inference. It is optimized with interleaved 3:1 sliding-window/full attention and Multi-Token Prediction (MTP-3) to reduce the latency and cost of multi-round agentic interactions. To reach frontier-level intelligence, we design a scalable reinforcement learning framework that combines verifiable signals with preference feedback, while remaining stable under large-scale off-policy training, enabling consistent self-improvement across mathematics, code, and tool use. Step 3.5 Flash demonstrates strong performance across agent, coding, and math tasks, achieving 85.4% on IMO-AnswerBench, 86.4% on LiveCodeBench-v6 (2024.08-2025.05), 88.2% on tau2-Bench, 69.0% on BrowseComp (with context management), and 51.0% on Terminal-Bench 2.0, comparable to frontier models such as GPT-5.2 xHigh and Gemini 3.0 Pro. By redefining the efficiency frontier, Step 3.5 Flash provides a high-density foundation for deploying sophisticated agents in real-world industrial environments.

[19] Online Causal Kalman Filtering for Stable and Effective Policy Optimization

Shuo He,Lang Feng,Xin Cheng,Lei Feng,Bo An

Main category: cs.CL

TL;DR: 本文提出了一种名为KPO的在线因果卡尔曼滤波方法，用于稳定大语言模型的强化学习策略优化，通过建模和动态平滑重要性采样比，缓解了token级IS比高方差导致的训练不稳定问题，并在数学推理任务上取得SOTA效果。

Details

Motivation: 现有RLHF方法中token级重要性采样（IS）比方差高、序列级IS比忽略时序偏差，导致策略优化不稳定甚至崩溃；作者发现token级局部离策略偏差存在结构性不一致，会扭曲相邻token的策略梯度更新。 Method: 提出Online Causal Kalman Filtering for Policy Optimization（KPO）：将理想IS比建模为随token演化的隐状态，用因果（仅依赖历史token）卡尔曼滤波在线、自回归地估计该状态，从而获得结构感知且噪声鲁棒的token级IS比。 Result: KPO在多个具有挑战性的数学推理数据集（如MATH、AMC）上显著优于当前SOTA方法，验证了其提升训练稳定性与策略性能的有效性。 Conclusion: KPO通过引入因果卡尔曼滤波机制，在保留token级局部结构变化的同时有效抑制IS比噪声，为大规模语言模型的稳定强化学习提供了一种新范式。 Abstract: Reinforcement learning for large language models suffers from high-variance token-level importance sampling (IS) ratios, which would destabilize policy optimization at scale. To improve stability, recent methods typically use a fixed sequence-level IS ratio for all tokens in a sequence or adjust each token's IS ratio separately, thereby neglecting temporal off-policy derivation across tokens in a sequence. In this paper, we first empirically identify that local off-policy deviation is structurally inconsistent at the token level, which may distort policy-gradient updates across adjacent tokens and lead to training collapse. To address the issue, we propose Online Causal Kalman Filtering for stable and effective Policy Optimization (KPO). Concretely, we model the desired IS ratio as a latent state that evolves across tokens and apply a Kalman filter to update this state online and autoregressively based on the states of past tokens, regardless of future tokens. The resulting filtered IS ratios preserve token-wise local structure-aware variation while strongly smoothing noise spikes, yielding more stable and effective policy updates. Experimentally, KPO achieves superior results on challenging math reasoning datasets compared with state-of-the-art counterparts.

[20] How Do Decoder-Only LLMs Perceive Users? Rethinking Attention Masking for User Representation Learning

Jiahao Yuan,Yike Xu,Jinyong Wen,Baokun Wang,Yang Chen,Xiaotong Lin,Wuliang Huang,Ziyi Gao,Xing Fu,Yu Cheng,Weiqiang Wang

Main category: cs.CL

TL;DR: 本文系统研究了因果、混合和双向注意力掩码对用户表征学习的影响，并提出梯度引导软掩码（GGSM）方法以改善从因果到双向注意力的训练过渡，显著提升了用户嵌入质量。

Details

Motivation: 解码器-only大语言模型被广泛用作用户行为编码器，但注意力掩码方式对用户嵌入质量的影响尚未被充分探索。 Method: 在统一的对比学习框架下，基于大规模真实Alipay数据，系统比较因果、混合与双向注意力掩码；提出梯度引导软掩码（GGSM），即在优化前进行基于梯度的预热，并结合线性调度逐步开放未来注意力。 Result: 在9个工业级用户认知基准任务（涵盖预测、偏好与营销敏感性）上，GGSM相比因果、混合及仅调度基线，训练更稳定、双向表征质量更高，且兼容解码器预训练。 Conclusion: 注意力掩码设计与训练过渡策略对解码器-only LLM适配用户表征学习至关重要。 Abstract: Decoder-only large language models are increasingly used as behavioral encoders for user representation learning, yet the impact of attention masking on the quality of user embeddings remains underexplored. In this work, we conduct a systematic study of causal, hybrid, and bidirectional attention masks within a unified contrastive learning framework trained on large-scale real-world Alipay data that integrates long-horizon heterogeneous user behaviors. To improve training dynamics when transitioning from causal to bidirectional attention, we propose Gradient-Guided Soft Masking, a gradient-based pre-warmup applied before a linear scheduler that gradually opens future attention during optimization. Evaluated on 9 industrial user cognition benchmarks covering prediction, preference, and marketing sensitivity tasks, our approach consistently yields more stable training and higher-quality bidirectional representations compared with causal, hybrid, and scheduler-only baselines, while remaining compatible with decoder pretraining. Overall, our findings highlight the importance of masking design and training transition in adapting decoder-only LLMs for effective user representation learning. Our code is available at https://github.com/JhCircle/Deepfind-GGSM.

[21] UMEM: Unified Memory Extraction and Management Framework for Generalizable Memory

Yongshi Ye,Hui Jiang,Feihu Jiang,Tian Lan,Yichao Du,Biao Fu,Xiaodong Shi,Qianghuai Jia,Longyue Wang,Weihua Luo

Main category: cs.CL

TL;DR: 本文提出UMEM框架，统一优化大语言模型的记忆提取与管理，通过语义邻域建模和基于邻域边际效用的GRPO奖励机制，提升记忆泛化性，在多个基准上显著优于基线。

Details

Motivation: 现有方法将记忆提取视为静态过程，仅优化记忆管理，导致记忆泛化性差、易积累实例噪声。 Method: 提出Unified Memory Extraction and Management (UMEM)框架，联合优化LLM以同时进行记忆提取与管理；引入Semantic Neighborhood Modeling，并采用基于邻域边际效用的GRPO奖励进行优化。 Result: 在五个基准上显著超越强基线，多轮交互任务最高提升10.67%；持续演化过程中保持单调增长曲线。 Conclusion: 统一建模记忆提取与管理，并结合语义邻域级效用评估，可有效提升自演化智能体的记忆泛化性与鲁棒性。 Abstract: Self-evolving memory serves as the trainable parameters for Large Language Models (LLMs)-based agents, where extraction (distilling insights from experience) and management (updating the memory bank) must be tightly coordinated. Existing methods predominately optimize memory management while treating memory extraction as a static process, resulting in poor generalization, where agents accumulate instance-specific noise rather than robust memories. To address this, we propose Unified Memory Extraction and Management (UMEM), a self-evolving agent framework that jointly optimizes a Large Language Model to simultaneous extract and manage memories. To mitigate overfitting to specific instances, we introduce Semantic Neighborhood Modeling and optimize the model with a neighborhood-level marginal utility reward via GRPO. This approach ensures memory generalizability by evaluating memory utility across clusters of semantically related queries. Extensive experiments across five benchmarks demonstrate that UMEM significantly outperforms highly competitive baselines, achieving up to a 10.67% improvement in multi-turn interactive tasks. Futhermore, UMEM maintains a monotonic growth curve during continuous evolution. Codes and models will be publicly released.

[22] Benchmarks Are Not That Out of Distribution: Word Overlap Predicts Performance

Woojin Chung,Jeonghoon Kim

Main category: cs.CL

TL;DR: 本文研究预训练数据质量，发现基准测试性能主要受预训练语料与评测数据间词频统计重叠程度影响，词级单字交叉熵与基准性能呈强负相关，表明多数标准基准对预训练数据并非真正分布外。

Details

Motivation: 理解高质量预训练数据的构成是语言模型训练的核心问题；探究基准性能是否主要由预训练语料与评测数据间的统计模式重叠（尤其是词频层面）所驱动。 Method: 使用词级单字交叉熵和词频统计衡量预训练语料与10个零样本基准之间的重叠；在4个不同规模（8.5B–60B tokens）预训练数据集、5种模型规模（400M–3B参数）上开展控制实验。 Result: 发现词级单字交叉熵与基准性能呈稳健的负相关关系；相同交叉熵下，更大规模预训练子集带来更好下游结果；词频统计对基准得分有额外影响。 Conclusion: 多数标准基准相对于预训练语料仅弱分布外，简单词重叠统计即可较好预测其性能，提示当前评估存在偏差，需设计更鲁棒的真正分布外基准。 Abstract: Understanding what constitutes high-quality pre-training data remains a central question in language model training. In this work, we investigate whether benchmark performance is primarily driven by the degree of statistical pattern overlap between pre-training corpora and evaluation datasets. We measure this overlap using word-level unigram cross-entropy and word frequency statistics, and perform controlled experiments across $10$ zero-shot benchmarks, $4$ pre-training datasets spanning $8.5\mathrm{B}$ to $60\mathrm{B}$ tokens, and model sizes ranging from $400\mathrm{M}$ to $3\mathrm{B}$ parameters. Our results demonstrate a robust inverse relationship between word-level unigram cross-entropy and benchmark performance, suggesting that widely used benchmarks are strongly influenced by word overlap between training and evaluation data. Thus, larger pre-training subsets with similar word-level unigram cross-entropy yield improved downstream results, indicating that word frequency statistics play an additional role in shaping benchmark scores. Taken together, these results suggest that many standard benchmarks are only weakly out-of-distribution relative to pre-training corpora, so that simple word-overlap statistics predict benchmark performance.

[23] Targeted Syntactic Evaluation of Language Models on Georgian Case Alignment

Daniel Gallagher,Gerhard Heyer

Main category: cs.CL

TL;DR: 本文评估了基于Transformer的语言模型在格鲁吉亚语分裂作格系统中的表现，发现模型在分配作格（ERG）时表现最差，而在主格（NOM）上表现最佳，性能与各格形式的频率分布（NOM > DAT > ERG）相关；作者构建了含370个语法测试样本的公开数据集，并提出基于树库和Grew查询语言的最小对立对生成方法。

Details

Motivation: 评估大语言模型在罕见且复杂的分裂作格系统（格鲁吉亚语）中的语法理解能力，尤其关注主格、作格和与格的论元角色标记能力，并弥补低资源语言语法评测基准的缺失。 Method: 采用基于树库的方法，利用Grew查询语言生成最小对立对；构建包含7项任务、共370个样本的数据集，每样本测试三种名词格形式；评估5个编码器和2个解码器模型，使用词级和句级准确率指标。 Result: 所有模型在作格（ERG）识别上表现最差，在主格（NOM）上最优；性能与三类格形式的语料频率高度相关（NOM > DAT > ERG）；作格的特殊语法功能及训练数据稀缺是其识别困难的主要原因。 Conclusion: 模型对分裂作格系统的掌握受限于特定语法范畴（如作格）的稀疏性和功能复杂性；所提方法和公开数据集为低资源语言的语法评测提供了可复用框架。 Abstract: This paper evaluates the performance of transformer-based language models on split-ergative case alignment in Georgian, a particularly rare system for assigning grammatical cases to mark argument roles. We focus on subject and object marking determined through various permutations of nominative, ergative, and dative noun forms. A treebank-based approach for the generation of minimal pairs using the Grew query language is implemented. We create a dataset of 370 syntactic tests made up of seven tasks containing 50-70 samples each, where three noun forms are tested in any given sample. Five encoder- and two decoder-only models are evaluated with word- and/or sentence-level accuracy metrics. Regardless of the specific syntactic makeup, models performed worst in assigning the ergative case correctly and strongest in assigning the nominative case correctly. Performance correlated with the overall frequency distribution of the three forms (NOM > DAT > ERG). Though data scarcity is a known issue for low-resource languages, we show that the highly specific role of the ergative along with a lack of available training data likely contributes to poor performance on this case. The dataset is made publicly available and the methodology provides an interesting avenue for future syntactic evaluations of languages where benchmarks are limited.

[24] Locomo-Plus: Beyond-Factual Cognitive Memory Evaluation Framework for LLM Agents

Yifei Li,Weidong Guo,Lingling Zhang,Rongman Xu,Muye Huang,Hui Liu,Lijiao Xu,Yu Xu,Jun Liu

Main category: cs.CL

TL;DR: 本文提出LoCoMo-Plus基准，用于评估大语言模型在长对话中对隐含用户状态、目标或价值观等潜在线索的记忆与应用能力，强调认知记忆而非表面事实回忆，并提出基于约束一致性的统一评估框架。

Details

Motivation: 现有基准和评估方法主要关注表层事实回忆，无法反映真实对话中依赖隐含用户状态、目标或价值观等未被显式提及的线索进行响应的需求。 Method: 构建LoCoMo-Plus基准，聚焦于‘线索—触发语义断连’场景下的认知记忆评估；提出基于约束一致性的新型评估框架，替代传统字符串匹配指标和显式任务提示。 Result: 实验表明，当前主流模型、检索方法及记忆系统在认知记忆任务上仍表现不佳，且其失败模式未被现有基准所捕获。 Conclusion: 认知记忆是长期对话系统的关键挑战，LoCoMo-Plus填补了现有评估体系在隐含约束保持与应用方面的空白，并提供了开源代码与评估框架。 Abstract: Long-term conversational memory is a core capability for LLM-based dialogue systems, yet existing benchmarks and evaluation protocols primarily focus on surface-level factual recall. In realistic interactions, appropriate responses often depend on implicit constraints such as user state, goals, or values that are not explicitly queried later. To evaluate this setting, we introduce \textbf{LoCoMo-Plus}, a benchmark for assessing cognitive memory under cue--trigger semantic disconnect, where models must retain and apply latent constraints across long conversational contexts. We further show that conventional string-matching metrics and explicit task-type prompting are misaligned with such scenarios, and propose a unified evaluation framework based on constraint consistency. Experiments across diverse backbone models, retrieval-based methods, and memory systems demonstrate that cognitive memory remains challenging and reveals failures not captured by existing benchmarks. Our code and evaluation framework are publicly available at: https://github.com/xjtuleeyf/Locomo-Plus.

[25] Macaron: Controlled, Human-Written Benchmark for Multilingual and Multicultural Reasoning via Template-Filling

Alaa Elsetohy,Sama Hadhoud,Haryo Akbarianto Wibowo,Chenxi Whitehouse,Genta Indra Winata,Fajri Koto,Alham Fikri Aji

Main category: cs.CL

TL;DR: Macaron是一个文化感知的多语言推理基准，通过模板化设计解耦推理类型与文化要素，覆盖20种语言（含低资源语种）和22种文化维度，揭示当前多语言大模型在本地语言和文化特定任务上的显著性能差距。

Details

Motivation: 现有多语言基准要么沿用英语中心主义场景（翻译而来），要么缺乏对所需推理类型的控制；亟需一个能同时系统控制推理类型与文化背景的基准。 Method: 提出Macaron基准：基于100个语言无关模板，覆盖7类推理与22种文化维度；由母语标注者为英语及本地语言构建情境一致的多选题和系统生成的真假题；涵盖20国/文化、10种文字、20种语言。 Result: 在21个多语言大模型零样本评测中，推理模式模型表现最优且英/本地语性能接近；开源权重模型在本地语言上大幅退化，真假题常近随机水平；文化嵌入的数学与计数类模板最难。 Conclusion: Macaron填补了文化感知、可控推理的多语言基准空白，揭示了当前多语言LLM在文化接地推理尤其是低资源语言上的根本性局限，为后续模型开发与评估提供新标准。 Abstract: Multilingual benchmarks rarely test reasoning over culturally grounded premises: translated datasets keep English-centric scenarios, while culture-first datasets often lack control over the reasoning required. We propose Macaron, a template-first benchmark that factorizes reasoning type and cultural aspect across question languages. Using 100 language-agnostic templates that cover 7 reasoning types, 22 cultural aspects, native annotators create scenario-aligned English and local-language multiple-choice questions and systematically derived True/False questions. Macaron contains 11,862 instances spanning 20 countries/cultural contexts, 10 scripts, and 20 languages (including low-resource ones like Amharic, Yoruba, Zulu, Kyrgyz, and some Arabic dialects). In zero-shot evaluation of 21 multilingual LLMs, reasoning-mode models achieve the strongest performance and near-parity between English and local languages, while open-weight models degrade substantially in local languages and often approach chance on T/F tasks. Culture-grounded mathematical and counting templates are consistently the hardest. The data can be accessed here https://huggingface.co/datasets/AlaaAhmed2444/Macaron.

[26] Reinforced Curriculum Pre-Alignment for Domain-Adaptive VLMs

Yuming Yan,Shuo Yang,Kai Tang,Sihong Chen,Yang Zhang,Ke Xu,Dan Hu,Qun Yu,Pengfei Hu,Edith C. H. Ngai

Main category: cs.CL

TL;DR: 本文提出了一种名为Reinforced Curriculum Pre-Alignment (RCPA)的新方法，用于在不损害通用多模态能力的前提下，高效地将视觉语言模型（VLMs）适配到专业领域。该方法通过课程感知的渐进式调制机制，在初期施加部分输出约束以安全引入领域知识，后期再转向完整生成优化，从而平衡领域适应与通用能力保持。

Details

Motivation: 现有监督微调（SFT）易导致灾难性遗忘，而持续预训练对VLMs又因计算成本高和数据不可得而难以实施；RL方法如GRPO虽能保留通用能力，但在模型缺乏初始领域知识时易出现优化崩溃。因此需一种高效、稳定、兼顾领域适应与通用能力的后训练适配方法。 Method: 提出RCPA框架，包含两阶段课程式适配：早期采用部分输出约束引导模型安全接触新领域概念；后期逐步过渡到全生成优化，精细化响应并对其对齐领域偏好。该机制实现渐进式、可控的知识注入。 Result: 在多个专业领域（如医学影像、几何推理）及通用基准上开展大量实验，结果表明RCPA显著提升领域性能，同时有效维持甚至增强原始通用多模态能力，优于SFT、GRPO等基线方法。 Conclusion: RCPA为构建高性能、可泛化、可适配的VLMs提供了切实可行的新范式，解决了领域适配与通用能力保持之间的核心矛盾。 Abstract: Vision-Language Models (VLMs) demonstrate remarkable general-purpose capabilities but often fall short in specialized domains such as medical imaging or geometric problem-solving. Supervised Fine-Tuning (SFT) can enhance performance within a target domain, but it typically causes catastrophic forgetting, limiting its generalization. The central challenge, therefore, is to adapt VLMs to new domains while preserving their general-purpose capabilities. Continual pretraining is effective for expanding knowledge in Large Language Models (LLMs), but it is less feasible for VLMs due to prohibitive computational costs and the unavailability of pretraining data for most open-source models. This necessitates efficient post-training adaptation methods. Reinforcement learning (RL)-based approaches such as Group Relative Policy Optimization (GRPO) have shown promise in preserving general abilities, yet they often fail in domain adaptation scenarios where the model initially lacks sufficient domain knowledge, leading to optimization collapse. To bridge this gap, we propose Reinforced Curriculum Pre-Alignment (RCPA), a novel post-training paradigm that introduces a curriculum-aware progressive modulation mechanism. In the early phase, RCPA applies partial output constraints to safely expose the model to new domain concepts. As the model's domain familiarity increases, training gradually transitions to full generation optimization, refining responses and aligning them with domain-specific preferences. This staged adaptation balances domain knowledge acquisition with the preservation of general multimodal capabilities. Extensive experiments across specialized domains and general benchmarks validate the effectiveness of RCPA, establishing a practical pathway toward building high-performing and domain-adaptive VLMs.

[27] Deep Learning-based Method for Expressing Knowledge Boundary of Black-Box LLM

Haotian Sheng,Heyong Wang,Ming Hong,Hongman He,Junqiu Liu

Main category: cs.CL

TL;DR: 本文提出LSCL方法，用于黑盒大语言模型的知识边界表达，通过知识蒸馏框架建模输入问题、输出答案及token概率与模型内部知识状态的关系，有效提升其对自身知识边界的量化与表达能力。

Details

Motivation: 现有研究多针对白盒大语言模型的知识边界表达，而面向仅提供API访问的黑盒大语言模型的方法仍属空白；同时，大语言模型因缺乏对其内部知识状态的感知而易产生幻觉，亟需能表达其知识边界的机制。 Method: 提出LSCL（LLM-Supervised Confidence Learning），基于知识蒸馏框架构建深度学习模型，以黑盒LLM的输入问题、输出答案和token概率为输入，学习映射至其内部知识状态，从而量化并表达知识边界；并针对不支持token概率输出的场景设计自适应替代方法。 Result: 在多个公开数据集和主流黑盒大语言模型上实验表明，LSCL在准确率、召回率等指标上显著优于基线模型；所提自适应替代方法性能接近LSCL且仍优于基线。 Conclusion: LSCL为黑盒大语言模型提供了可扩展、实用的知识边界表达方案，有助于缓解幻觉问题，提升模型可靠性与实用性。 Abstract: Large Language Models (LLMs) have achieved remarkable success, however, the emergence of content generation distortion (hallucination) limits their practical applications. The core cause of hallucination lies in LLMs' lack of awareness regarding their stored internal knowledge, preventing them from expressing their knowledge state on questions beyond their internal knowledge boundaries, as humans do. However, existing research on knowledge boundary expression primarily focuses on white-box LLMs, leaving methods suitable for black-box LLMs which offer only API access without revealing internal parameters-largely unexplored. Against this backdrop, this paper proposes LSCL (LLM-Supervised Confidence Learning), a deep learning-based method for expressing the knowledge boundaries of black-box LLMs. Based on the knowledge distillation framework, this method designs a deep learning model. Taking the input question, output answer, and token probability from a black-box LLM as inputs, it constructs a mapping between the inputs and the model' internal knowledge state, enabling the quantification and expression of the black-box LLM' knowledge boundaries. Experiments conducted on diverse public datasets and with multiple prominent black-box LLMs demonstrate that LSCL effectively assists black-box LLMs in accurately expressing their knowledge boundaries. It significantly outperforms existing baseline models on metrics such as accuracy and recall rate. Furthermore, considering scenarios where some black-box LLMs do not support access to token probability, an adaptive alternative method is proposed. The performance of this alternative approach is close to that of LSCL and surpasses baseline models.

[28] Beyond Confidence: The Rhythms of Reasoning in Generative Models

Deyuan Liu,Zecheng Wang,Zhanyue Qin,Zhiying Tu,Dianhui Chu,Dianbo Sui

Main category: cs.CL

TL;DR: 本文提出了一种新指标Token Constraint Bound (δ_TCB)，用于量化大语言模型（LLM）在内部状态扰动下保持主导词元预测不变的最大容忍度，从而评估其局部预测鲁棒性。

Details

Motivation: 现有指标（如准确率、困惑度）无法有效评估LLM对输入上下文微小变化的局部预测鲁棒性，因其归一化输出概率掩盖了模型内部状态对扰动的真实稳定性。 Method: 提出Token Constraint Bound（δ_TCB）这一新度量，基于输出嵌入空间几何结构，定义为使LLM主导下一词元预测发生显著变化所需的最大内部状态扰动幅度。 Result: 实验表明δ_TCB与有效提示工程相关，并能揭示困惑度所遗漏的关键预测不稳定性，尤其在上下文学习和文本生成中。 Conclusion: δ_TCB提供了一种原理清晰、可互补的分析框架，有助于深入理解并潜在提升LLM在上下文中的预测稳定性。 Abstract: Large Language Models (LLMs) exhibit impressive capabilities yet suffer from sensitivity to slight input context variations, hampering reliability. Conventional metrics like accuracy and perplexity fail to assess local prediction robustness, as normalized output probabilities can obscure the underlying resilience of an LLM's internal state to perturbations. We introduce the Token Constraint Bound ($δ_{\mathrm{TCB}}$), a novel metric that quantifies the maximum internal state perturbation an LLM can withstand before its dominant next-token prediction significantly changes. Intrinsically linked to output embedding space geometry, $δ_{\mathrm{TCB}}$ provides insights into the stability of the model's internal predictive commitment. Our experiments show $δ_{\mathrm{TCB}}$ correlates with effective prompt engineering and uncovers critical prediction instabilities missed by perplexity during in-context learning and text generation. $δ_{\mathrm{TCB}}$ offers a principled, complementary approach to analyze and potentially improve the contextual stability of LLM predictions.

[29] I can tell whether you are a Native Hawlêri Speaker! How ANN, CNN, and RNN perform in NLI-Native Language Identification

Hardi Garari,Hossein Hassani

Main category: cs.CL

TL;DR: 本文研究了库尔德语索拉尼方言中赫勒里次方言的母语识别（NLI）问题，构建了首个赫勒里次方言语音数据集，并通过ANN、CNN和RNN模型实验，发现RNN在5秒音频片段上达到95.92%最高准确率。

Details

Motivation: 现有母语识别研究多集中于主流语言（如英语、德语），对低资源语言（如库尔德语）及其方言/次方言的研究严重不足，尤其缺乏针对索拉尼库尔德语中赫勒里次方言的NLI研究。 Method: 采集40名母语或非母语赫勒里次方言使用者约24小时访谈语音，构建专用语音数据集；设计并训练三种神经网络模型（ANN、CNN、RNN），开展66组实验，涵盖不同音频时长（1–60秒）、采样策略（欠采样/过采样）与交叉验证。 Result: RNN模型在5秒音频片段、80:10:10数据划分下取得最高准确率95.92%；所建数据集为首个面向赫勒里次方言的NLI语音数据集。 Conclusion: 该研究表明，基于深度学习的NLI方法可有效应用于低资源语言的次方言识别；所构建的数据集和实验结果为库尔德语语言学、计算语言学及相关应用领域提供了重要基础资源。 Abstract: Native Language Identification (NLI) is a task in Natural Language Processing (NLP) that typically determines the native language of an author through their writing or a speaker through their speaking. It has various applications in different areas, such as forensic linguistics and general linguistics studies. Although considerable research has been conducted on NLI regarding two different languages, such as English and German, the literature indicates a significant gap regarding NLI for dialects and subdialects. The gap becomes wider in less-resourced languages such as Kurdish. This research focuses on NLI within the context of a subdialect of Sorani (Central) Kurdish. It aims to investigate the NLI for Hewlêri, a subdialect spoken in Hewlêr (Erbil), the Capital of the Kurdistan Region of Iraq. We collected about 24 hours of speech by recording interviews with 40 native or non-native Hewlêri speakers, 17 female and 23 male. We created three Neural Network-based models: Artificial Neural Network (ANN), Convolutional Neural Network (CNN), and Recurrent Neural Network (RNN), which were evaluated through 66 experiments, covering various time-frames from 1 to 60 seconds, undersampling, oversampling, and cross-validation. The RNN model showed the highest accuracy of 95.92% for 5-second audio segmentation, using an 80:10:10 data splitting scheme. The created dataset is the first speech dataset for NLI on the Hewlêri subdialect in the Sorani Kurdish dialect, which can be of benefit to various research areas.

[30] C-MOP: Integrating Momentum and Boundary-Aware Clustering for Enhanced Prompt Evolution

Binwei Yan,Yifei Fu,Mingjian Zhu,Hanting Chen,Mingxuan Yuan,Yunhe Wang,Hailin Hu

Main category: cs.CL

TL;DR: 本文提出C-MOP框架，通过边界感知对比采样（BACS）和动量引导语义聚类（MGSC）提升大语言模型的自动提示优化效果，显著优于现有方法。

Details

Motivation: 现有自动提示优化方法常受噪声和冲突更新信号困扰，难以稳定优化提示。 Method: 提出C-MOP框架，包含两个核心模块：1）边界感知对比采样（BACS），利用批次信息挖掘难负样本、锚点与边界对，精准刻画正负提示样本的表征与决策边界；2）动量引导语义聚类（MGSC），引入带时间衰减的文本动量机制，从迭代梯度中提取稳定共识以缓解语义冲突。 Result: 在多个实验中，C-MOP持续超越PromptWizard、ProTeGi等SOTA基线，平均提升1.58%和3.35%；更使仅3B激活参数的通用LLM性能超越70B领域专用稠密LLM。 Conclusion: C-MOP通过结构化采样与动量引导聚类有效缓解提示优化中的噪声与语义冲突，显著提升优化稳定性与效果，为高效提示工程提供了新范式。 Abstract: Automatic prompt optimization is a promising direction to boost the performance of Large Language Models (LLMs). However, existing methods often suffer from noisy and conflicting update signals. In this research, we propose C-MOP (Cluster-based Momentum Optimized Prompting), a framework that stabilizes optimization via Boundary-Aware Contrastive Sampling (BACS) and Momentum-Guided Semantic Clustering (MGSC). Specifically, BACS utilizes batch-level information to mine tripartite features--Hard Negatives, Anchors, and Boundary Pairs--to precisely characterize the typical representation and decision boundaries of positive and negative prompt samples. To resolve semantic conflicts, MGSC introduces a textual momentum mechanism with temporal decay that distills persistent consensus from fluctuating gradients across iterations. Extensive experiments demonstrate that C-MOP consistently outperforms SOTA baselines like PromptWizard and ProTeGi, yielding average gains of 1.58% and 3.35%. Notably, C-MOP enables a general LLM with 3B activated parameters to surpass a 70B domain-specific dense LLM, highlighting its effectiveness in driving precise prompt evolution. The code is available at https://github.com/huawei-noah/noah-research/tree/master/C-MOP.

[31] Diagnosing Structural Failures in LLM-Based Evidence Extraction for Meta-Analysis

Zhiyin Tan,Jennifer D'Souza

Main category: cs.CL

TL;DR: 本文提出了一种结构化诊断框架，评估大语言模型（LLMs）在系统综述与Meta分析中进行证据提取的能力，发现当前LLM在变量绑定、角色关系和数值归因等结构性任务上存在系统性缺陷，难以满足自动化Meta分析的可靠性要求。

Details

Motivation: 尽管大语言模型快速发展，但其在系统综述与Meta分析中能否满足结构化证据提取（如保持角色、方法、效应量归属等关系）的要求尚不明确。 Method: 提出一种基于模式约束、逐步增加关系与数值复杂度的结构化诊断框架；构建跨五个科学领域的手工标注语料库及统一查询套件与评估协议；在单文档与长上下文多文档输入下评估两个SOTA LLM。 Result: LLM在单属性查询中表现中等，但在需稳定绑定变量、角色、统计方法与效应量的任务中性能急剧下降；Meta分析关联元组提取几乎不可靠；长上下文进一步加剧错误；下游聚合会放大上游微小错误。 Conclusion: LLM的局限性源于结构性失效（如角色颠倒、跨分析绑定漂移、密集结果压缩、数值错配），而非实体识别错误，表明当前LLM缺乏结构保真度、关系绑定能力与数值基础，尚不能支撑可靠的自动化Meta分析。 Abstract: Systematic reviews and meta-analyses rely on converting narrative articles into structured, numerically grounded study records. Despite rapid advances in large language models (LLMs), it remains unclear whether they can meet the structural requirements of this process, which hinge on preserving roles, methods, and effect-size attribution across documents rather than on recognizing isolated entities. We propose a structural, diagnostic framework that evaluates LLM-based evidence extraction as a progression of schema-constrained queries with increasing relational and numerical complexity, enabling precise identification of failure points beyond atom-level extraction. Using a manually curated corpus spanning five scientific domains, together with a unified query suite and evaluation protocol, we evaluate two state-of-the-art LLMs under both per-document and long-context, multi-document input regimes. Across domains and models, performance remains moderate for single-property queries but degrades sharply once tasks require stable binding between variables, roles, statistical methods, and effect sizes. Full meta-analytic association tuples are extracted with near-zero reliability, and long-context inputs further exacerbate these failures. Downstream aggregation amplifies even minor upstream errors, rendering corpus-level statistics unreliable. Our analysis shows that these limitations stem not from entity recognition errors, but from systematic structural breakdowns, including role reversals, cross-analysis binding drift, instance compression in dense result sections, and numeric misattribution, indicating that current LLMs lack the structural fidelity, relational binding, and numerical grounding required for automated meta-analysis. The code and data are publicly available at GitHub (https://github.com/zhiyintan/LLM-Meta-Analysis).

[32] The CLEF-2026 FinMMEval Lab: Multilingual and Multimodal Evaluation of Financial AI Systems

Zhuohan Xie,Rania Elbadry,Fan Zhang,Georgi Georgiev,Xueqing Peng,Lingfei Qian,Jimin Huang,Dimitar Dimitrov,Vanshikaa Jani,Yuyang Dai,Jiahui Geng,Yuxia Wang,Ivan Koychev,Veselin Stoyanov,Preslav Nakov

Main category: cs.CL

TL;DR: FinMMEval Lab at CLEF 2026 introduces the first multilingual and multimodal evaluation framework for financial LLMs, featuring three tasks—Financial Exam QA, PolyFiQA, and Financial Decision Making—to assess reasoning, generalization, and action across languages and modalities.

Details

Motivation: Existing financial NLP benchmarks are largely monolingual, text-only, and narrow, failing to reflect real-world multilingual and multimodal financial AI needs. Method: Design and release of a new multilingual, multimodal evaluation framework with three interconnected tasks: Financial Exam Question Answering, Multilingual Financial Question Answering (PolyFiQA), and Financial Decision Making. Result: A comprehensive, publicly available evaluation suite enabling robust, transparent, and globally inclusive assessment of financial LLMs. Conclusion: FinMMEval 2026 sets a new standard for evaluating financial LLMs across languages and modalities, fostering reproducible and equitable progress in financial AI. Abstract: We present the setup and the tasks of the FinMMEval Lab at CLEF 2026, which introduces the first multilingual and multimodal evaluation framework for financial Large Language Models (LLMs). While recent advances in financial natural language processing have enabled automated analysis of market reports, regulatory documents, and investor communications, existing benchmarks remain largely monolingual, text-only, and limited to narrow subtasks. FinMMEval 2026 addresses this gap by offering three interconnected tasks that span financial understanding, reasoning, and decision-making: Financial Exam Question Answering, Multilingual Financial Question Answering (PolyFiQA), and Financial Decision Making. Together, these tasks provide a comprehensive evaluation suite that measures models' ability to reason, generalize, and act across diverse languages and modalities. The lab aims to promote the development of robust, transparent, and globally inclusive financial AI systems, with datasets and evaluation resources publicly released to support reproducible research.

[33] SoftMatcha 2: A Fast and Soft Pattern Matcher for Trillion-Scale Corpora

Masataka Yoneda,Yusuke Matsushita,Go Kamoda,Kohei Suenaga,Takuya Akiba,Masaki Waga,Sho Yokoi

Main category: cs.CL

TL;DR: 本文提出了一种超快速、灵活的自然语言万亿级语料库搜索算法，基于后缀数组的字符串匹配，结合磁盘感知设计和动态语料感知剪枝，在0.3秒内完成搜索，并能处理语义变异（替换、插入、删除）。

Details

Motivation: 现有方法在处理大规模语料库时搜索延迟高，且难以应对查询语义变异带来的组合爆炸问题。 Method: 基于后缀数组的字符串匹配；磁盘感知的精确查找；动态语料感知剪枝；利用自然语言统计特性抑制搜索空间指数增长。 Result: 在FineWeb-Edu（1.4T tokens）上显著低于infini-gram、infini-gram mini和SoftMatcha的搜索延迟；成功识别出其他方法未能发现的基准污染；支持七语言在线软搜索演示。 Conclusion: 该方法在保持高精度的同时大幅降低搜索延迟，兼具可扩展性与实用性，为大规模语料分析与数据质量评估提供了新工具。 Abstract: We present an ultra-fast and flexible search algorithm that enables search over trillion-scale natural language corpora in under 0.3 seconds while handling semantic variations (substitution, insertion, and deletion). Our approach employs string matching based on suffix arrays that scales well with corpus size. To mitigate the combinatorial explosion induced by the semantic relaxation of queries, our method is built on two key algorithmic ideas: fast exact lookup enabled by a disk-aware design, and dynamic corpus-aware pruning. We theoretically show that the proposed method suppresses exponential growth in the search space with respect to query length by leveraging statistical properties of natural language. In experiments on FineWeb-Edu (Lozhkov et al., 2024) (1.4T tokens), we show that our method achieves significantly lower search latency than existing methods: infini-gram (Liu et al., 2024), infini-gram mini (Xu et al., 2025), and SoftMatcha (Deguchi et al., 2025). As a practical application, we demonstrate that our method identifies benchmark contamination in training corpora, unidentified by existing approaches. We also provide an online demo of fast, soft search across corpora in seven languages.

[34] Computational Phenomenology of Temporal Experience in Autism: Quantifying the Emotional and Narrative Characteristics of Lived Unpredictability

Kacper Dudzic,Karolina Drożdż,Maciej Wodziński,Anastazja Szuła,Marcin Moskalewicz

Main category: cs.CL

TL;DR: 本研究整合现象学访谈、自闭症叙事语料库的计算分析及叙事流计算复现三种方法，揭示自闭症个体的时间性困扰核心在于‘经验的不可预测性’，而非叙事建构能力缺陷；计算分析发现其时间性词汇情感效价更负，尤其在‘即时性与突发性’维度，且其自传体叙事在结构上真实可信。

Details

Motivation: 弥补现象学与计算方法之间的鸿沟，克服现有研究中医学缺陷模型主导、质性样本量小、计算研究缺乏现象学根基三大局限。 Method: 采用三阶段混合方法：A）基于跨诊断时间体验评估工具对自闭症者开展结构性现象学访谈；B）构建专属自闭症自传语料库并进行计算语言学分析；C）复现叙事流计算研究以评估自闭症自传文本的现象学真实性。 Result: 现象学访谈显示自闭症组与对照组最显著差异在于经验的不可预测性；计算分析证实其时间性词汇情感效价更负，'即时性与突发性'类词（如unpredictably, precipitously）尤为突出；叙事流分析表明其自传文本在结构上更接近真实自传而非虚构文本。 Conclusion: 自闭症个体的时间性挑战本质是‘活生生经验的不可预测性’，根源在于经验内容本身，而非叙事能力或时间表征缺陷；研究支持从神经多样性视角重新理解时间性差异。 Abstract: Disturbances in temporality, such as desynchronization with the social environment and its unpredictability, are considered core features of autism with a deep impact on relationships. However, limitations regarding research on this issue include: 1) the dominance of deficit-based medical models of autism, 2) sample size in qualitative research, and 3) the lack of phenomenological anchoring in computational research. To bridge the gap between phenomenological and computational approaches and overcome sample-size limitations, our research integrated three methodologies. Study A: structured phenomenological interviews with autistic individuals using the Transdiagnostic Assessment of Temporal Experience. Study B: computational analysis of an autobiographical corpus of autistic narratives built for this purpose. Study C: a replication of a computational study using narrative flow measures to assess the perceived phenomenological authenticity of autistic autobiographies. Interviews revealed that the most significant differences between the autistic and control groups concerned unpredictability of experience. Computational results mirrored these findings: the temporal lexicon in autistic narratives was significantly more negatively valenced - particularly the "Immediacy & Suddenness" category. Outlier analysis identified terms associated with perceived discontinuity (unpredictably, precipitously, and abruptly) as highly negative. The computational analysis of narrative flow found that the autistic narratives contained within the corpus quantifiably resemble autobiographical stories more than imaginary ones. Overall, the temporal challenges experienced by autistic individuals were shown to primarily concern lived unpredictability and stem from the contents of lived experience, and not from autistic narrative construction.

[35] Search or Accelerate: Confidence-Switched Position Beam Search for Diffusion Language Models

Mingyu Cao,Alvaro Correia,Christos Louizos,Shiwei Liu,Lu Yin

Main category: cs.CL

TL;DR: SOAR是一种无需训练的解码算法，通过自适应调整去噪过程中的掩码位置解码顺序，在低置信度时扩展搜索以避免过早承诺，在高置信度时并行解码以提升效率，从而在数学推理与代码生成任务上提升扩散语言模型（DLMs）的生成质量与推理速度平衡。

Details

Motivation: 标准贪婪解码易因局部最优选择导致次优的掩码解除顺序，尤其在需复杂推理的任务中表现不佳。 Method: 提出SOAR解码算法：依据模型对各位置预测的不确定性动态调整解码策略——低置信时拓宽候选解码位置搜索空间，高置信时并行解码多个位置以减少迭代步数。 Result: 在GSM8K、MBPP、HumanEval等数学推理与代码生成基准上，SOAR在Dream-7B和LLaDA-8B模型上显著提升生成质量，同时保持有竞争力的推理速度。 Conclusion: SOAR为扩散语言模型提供了一种无需训练、兼顾生成质量与解码效率的实用解码方案。 Abstract: Diffusion Language Models (DLMs) generate text by iteratively denoising a masked sequence, repeatedly deciding which positions to commit at each step. Standard decoding follows a greedy rule: unmask the most confident positions, yet this local choice can lock the model into a suboptimal unmasking order, especially on reasoning-heavy prompts. We present SOAR, a training-free decoding algorithm that adapts its behavior to the model's uncertainty. When confidence is low, SOAR briefly widens the search over alternative unmasking decisions to avoid premature commitments; when confidence is high, it collapses the search and decodes many positions in parallel to reduce the number of denoising iterations. Across mathematical reasoning and code generation benchmarks (GSM8K, MBPP, HumanEval) on Dream-7B and LLaDA-8B, SOAR improves generation quality while maintaining competitive inference speed, offering a practical way to balance quality and efficiency in DLM decoding.

[36] LoRA-Squeeze: Simple and Effective Post-Tuning and In-Tuning Compression of LoRA Modules

Ivan Vulić,Adam Grycner,Quentin de Laroussilhe,Jonas Pfeiffer

Main category: cs.CL

TL;DR: 本文提出LoRA-Squeeze方法，通过先以高秩训练再压缩（后验或动态降秩）来改进标准LoRA，利用RSVD实现高效低秩适配器生成，在多任务上显著提升参数效率与性能权衡。

Details

Motivation: 标准LoRA存在需预设最优秩、超参依赖性强、异构秩模块部署复杂等问题，亟需更灵活、鲁棒且易部署的秩自适应方法。 Method: LoRA-Squeeze：先以较高源秩微调模型，重构全量权重更新矩阵，再用随机化奇异值分解（RSVD）压缩为指定目标秩的LoRA模块；支持后验压缩与训练中渐进式秩退火两种策略。 Result: 在13个文本和10个视觉-语言任务上验证，后验压缩所得低秩适配器常优于直接在目标秩训练的结果；加入少量目标秩微调步数后性能进一步提升；渐进式秩退火版本持续取得最优大小-性能权衡。 Conclusion: 先高秩学习再压缩优于直接低秩优化，LoRA-Squeeze提供简单、通用且高效的LoRA秩优化范式，显著缓解标准LoRA的秩敏感性与部署负担。 Abstract: Despite its huge number of variants, standard Low-Rank Adaptation (LoRA) is still a dominant technique for parameter-efficient fine-tuning (PEFT). Nonetheless, it faces persistent challenges, including the pre-selection of an optimal rank and rank-specific hyper-parameters, as well as the deployment complexity of heterogeneous-rank modules and more sophisticated LoRA derivatives. In this work, we introduce LoRA-Squeeze, a simple and efficient methodology that aims to improve standard LoRA learning by changing LoRA module ranks either post-hoc or dynamically during training}. Our approach posits that it is better to first learn an expressive, higher-rank solution and then compress it, rather than learning a constrained, low-rank solution directly. The method involves fine-tuning with a deliberately high(er) source rank, reconstructing or efficiently approximating the reconstruction of the full weight update matrix, and then using Randomized Singular Value Decomposition (RSVD) to create a new, compressed LoRA module at a lower target rank. Extensive experiments across 13 text and 10 vision-language tasks show that post-hoc compression often produces lower-rank adapters that outperform those trained directly at the target rank, especially if a small number of fine-tuning steps at the target rank is allowed. Moreover, a gradual, in-tuning rank annealing variant of LoRA-Squeeze consistently achieves the best LoRA size-performance trade-off.

[37] Linguistic Indicators of Early Cognitive Decline in the DementiaBank Pitt Corpus: A Statistical and Machine Learning Study

Artsvik Avetisyan,Sachin Kumar

Main category: cs.CL

TL;DR: 本研究利用DementiaBank Pitt语料库的自发言语转录文本，通过三种语言表征（原始清洗文本、词性增强表征、仅词性句法表征）结合逻辑回归与随机森林模型，评估其对痴呆症的识别能力，并采用主体级交叉验证和非参数统计检验验证结果的临床可靠性与可解释性。

Details

Motivation: 早期认知衰退常表现为自发语言的细微变化，寻找具有语言学可解释性的痴呆标志物，有助于构建透明且临床可信的语言筛查方法。 Method: 基于DementiaBank Pitt语料库的自发言语转录本，构建三种语言表征（原始文本、POS增强、POS-only），采用逻辑回归和随机森林建模；使用主体级五折交叉验证避免说话人重叠，并通过全局特征重要性分析与Mann-Whitney U检验（含Cliff's delta效应量）进行可解释性与统计验证。 Result: 各表征下模型性能稳定，尤其POS增强与POS-only表征在主体级评估中表现稳健；统计分析发现功能词使用、词汇多样性、句法结构与语篇连贯性存在显著组间差异，且与ML特征重要性高度一致。 Conclusion: 抽象语言特征（如句法与语法模式）能在临床现实评估条件下稳健捕捉早期认知衰退信号；结合可解释机器学习与非参数统计验证，支持语言学基础特征用于透明、可靠的语言认知筛查。 Abstract: Background: Subtle changes in spontaneous language production are among the earliest indicators of cognitive decline. Identifying linguistically interpretable markers of dementia can support transparent and clinically grounded screening approaches. Methods: This study analyzes spontaneous speech transcripts from the DementiaBank Pitt Corpus using three linguistic representations: raw cleaned text, a part-of-speech (POS)-enhanced representation combining lexical and grammatical information, and a POS-only syntactic representation. Logistic regression and random forest models were evaluated under two protocols: transcript-level train-test splits and subject-level five-fold cross-validation to prevent speaker overlap. Model interpretability was examined using global feature importance, and statistical validation was conducted using Mann-Whitney U tests with Cliff's delta effect sizes. Results: Across representations, models achieved stable performance, with syntactic and grammatical features retaining strong discriminative power even in the absence of lexical content. Subject-level evaluation yielded more conservative but consistent results, particularly for POS-enhanced and POS-only representations. Statistical analysis revealed significant group differences in functional word usage, lexical diversity, sentence structure, and discourse coherence, aligning closely with machine learning feature importance findings. Conclusion: The results demonstrate that abstract linguistic features capture robust markers of early cognitive decline under clinically realistic evaluation. By combining interpretable machine learning with non-parametric statistical validation, this study supports the use of linguistically grounded features for transparent and reliable language-based cognitive screening.

[38] Language Model Inversion through End-to-End Differentiation

Kevin Yandoka Denamganaï,Kartic Subr

Main category: cs.CL

TL;DR: 本文提出一种基于梯度优化的方法，将语言模型（LM）视为作用于词元分布序列的函数，实现对冻结LM的端到端可微，并用于反演生成能产生指定输出的输入提示。

Details

Motivation: 现有研究很少分析语言模型的可逆性，即给定目标输出，如何找到能生成该输出的输入提示，这一问题尚未解决。 Method: 将语言模型建模为作用于词元分布序列的函数，设计简单算法实现冻结语言模型的端到端可微，并通过梯度下降优化提示。 Result: 实验表明该方法能在多个白盒语言模型上可靠高效地优化长度为10和80的提示，以生成长度为20的目标输出。 Conclusion: 语言模型具有潜在可逆性，通过分布视角和梯度优化可有效实现提示反演，为可控文本生成与模型分析提供了新思路。 Abstract: Despite emerging research on Language Models (LM), few approaches analyse the invertibility of LMs. That is, given a LM and a desirable target output sequence of tokens, determining what input prompts would yield the target output remains an open problem. We formulate this problem as a classical gradient-based optimisation. First, we propose a simple algorithm to achieve end-to-end differentiability of a given (frozen) LM and then find optimised prompts via gradient descent. Our central insight is to view LMs as functions operating on sequences of distributions over tokens (rather than the traditional view as functions on sequences of tokens). Our experiments and ablations demonstrate that our DLM-powered inversion can reliably and efficiently optimise prompts of lengths $10$ and $80$ for targets of length $20$, for several white-box LMs (out-of-the-box).

[39] Embedding Inversion via Conditional Masked Diffusion Language Models

Han Xiao

Main category: cs.CL

TL;DR: 本文提出一种基于条件掩码扩散的嵌入逆向方法，通过迭代去噪并行恢复所有token，仅需8次前向传播且无需目标编码器访问，实现了高token准确率和余弦相似度。

Details

Motivation: 传统嵌入逆向方法多依赖序列自回归生成，效率低且依赖目标编码器；本文旨在设计更高效、独立于编码器的并行逆向方法。 Method: 将嵌入逆向建模为条件掩码扩散过程，使用自适应层归一化将掩码扩散语言模型与目标嵌入对齐，在无目标编码器访问条件下进行迭代去噪。 Result: 在32-token序列及三种嵌入模型上，达到81.3%的token准确率和0.87的余弦相似度，仅需8次前向传播和78M参数模型。 Conclusion: 条件掩码扩散是一种高效、轻量且不依赖目标编码器的嵌入逆向范式，显著优于传统自回归方法。 Abstract: We frame embedding inversion as conditional masked diffusion, recovering all tokens in parallel through iterative denoising rather than sequential autoregressive generation. A masked diffusion language model is conditioned on the target embedding via adaptive layer normalization, requiring only 8 forward passes through a 78M parameter model with no access to the target encoder. On 32-token sequences across three embedding models, the method achieves 81.3% token accuracy and 0.87 cosine similarity.

[40] Conversational Behavior Modeling Foundation Model With Multi-Level Perception

Dingkun Zhou,Shuchang Pan,Jiachen Lian,Siddharth Banerjee,Sarika Pasumarthy,Dhruv Hebbar,Siddhant Patel,Zeyi Austin Li,Kan Jen Cheng,Sanay Bordia,Krish Patel,Akshaj Gupta,Tingle Li,Gopala Anumanchipalli

Main category: cs.CL

TL;DR: 本文提出了一种基于多级感知与图思维（Graph-of-Thoughts, GoT）的框架，用于建模人类对话中隐含的意图-行为链，支持全双工语音交互系统的自然化与可解释推理。

Details

Motivation: 人类对话依赖于隐含的、时序化的思维链，捕捉这一感知路径对构建自然全双工交互系统至关重要。 Method: 提出多级感知建模与Graph-of-Thoughts（GoT）框架，采用分层标注方案建模高层交际意图与底层言语行为间的因果及时序依赖；构建高质量配对语料（可控事件丰富对话+人工标注），并用Transformer在动态演化的图结构上预测言语行为、生成推理依据并迭代修正推理。 Result: 在合成与真实全双工对话数据上验证了该框架在行为检测鲁棒性、推理链可解释性方面的优势，并为全双工语音对话中的推理能力评估奠定了基准基础。 Conclusion: GoT框架有效建模了对话中的意图-行动映射关系，提升了全双工交互系统的自然性、可控性与可解释性，是迈向具身化、推理型对话系统的重要一步。 Abstract: Human conversation is organized by an implicit chain of thoughts that manifests as timed speech acts. Capturing this perceptual pathway is key to building natural full-duplex interactive systems. We introduce a framework that models this process as multi-level perception, and then reasons over conversational behaviors via a Graph-of-Thoughts (GoT). Our approach formalizes the intent-to-action pathway with a hierarchical labeling scheme, predicting high-level communicative intents and low-level speech acts to learn their causal and temporal dependencies. To train this system, we develop a high quality corpus that pairs controllable, event-rich dialogue data with human-annotated labels. The GoT framework structures streaming predictions as an evolving graph, enabling a transformer to forecast the next speech act, generate concise justifications for its decisions, and dynamically refine its reasoning. Experiments on both synthetic and real duplex dialogues show that the framework delivers robust behavior detection, produces interpretable reasoning chains, and establishes a foundation for benchmarking conversational reasoning in full duplex spoken dialogue systems.

[41] Simultaneous Speech-to-Speech Translation Without Aligned Data

Tom Labiausse,Romain Fabre,Yannick Estève,Alexandre Défossez,Neil Zeghidour

Main category: cs.CL

TL;DR: Hibiki-Zero是一种无需词级对齐的端到端实时语音翻译方法，通过句子级监督加GRPO强化学习优化延迟与质量，在多语言任务中达到SOTA性能，并支持低资源语言快速适配。

Details

Motivation: 传统同步语音翻译依赖难以大规模获取的词级对齐数据，或使用次优的语言特定启发式对齐方法，限制了多语言扩展能力。 Method: 提出Hibiki-Zero框架：先基于句子级对齐数据训练高延迟语音翻译模型，再利用GRPO强化学习策略联合优化翻译质量与延迟；全程无需词级对齐。 Result: 在五个X-to-English任务上实现翻译准确率、延迟、语音迁移和自然度的SOTA；仅需<1000小时语音即可适配新语言；开源45小时多语评测基准、模型权重与推理代码。 Conclusion: Hibiki-Zero消除了词级对齐依赖，简化训练流程，显著提升多语言可扩展性与低资源适应能力，为实际部署的实时语音翻译提供了更鲁棒、更通用的解决方案。 Abstract: Simultaneous speech translation requires translating source speech into a target language in real-time while handling non-monotonic word dependencies. Traditional approaches rely on supervised training with word-level aligned data, which is difficult to collect at scale and thus depends on synthetic alignments using language-specific heuristics that are suboptimal. We propose Hibiki-Zero, which eliminates the need for word-level alignments entirely. This fundamentally simplifies the training pipeline and enables seamless scaling to diverse languages with varying grammatical structures, removing the bottleneck of designing language-specific alignment heuristics. We first train on sentence-level aligned data to learn speech translation at high latency, then apply a novel reinforcement learning strategy using GRPO to optimize latency while preserving translation quality. Hibiki-Zero achieves state-of-the-art performance in translation accuracy, latency, voice transfer, and naturalness across five X-to-English tasks. Moreover, we demonstrate that our model can be adapted to support a new input language with less than 1000h of speech. We provide examples, model weights, inference code and we release a benchmark containing 45h of multilingual data for speech translation evaluation.

[42] SteuerLLM: Local specialized large language model for German tax law analysis

Sebastian Wind,Jeta Sopa,Laurin Schmid,Quirin Jackl,Sebastian Kiefer,Fei Wu,Martin Mayr,Harald Köstler,Gerhard Wellein,Andreas Maier,Soroosh Tayebi Arasteh

Main category: cs.CL

TL;DR: 本文提出了SteuerEx——首个基于真实德国大学税法考试的开源基准，并开发了领域适配模型SteuerLLM，在税法推理任务上显著优于同规模通用大模型，强调领域数据与架构适配比参数规模更重要。

Details

Motivation: 大型语言模型在严格规则、精确术语和法律约束强的领域（如税法）表现下降，需构建真实、可评估的领域基准与专用模型。 Method: 1）算法生成SteuerEx基准：源自真实德国税法考试，含115道专家验证题，覆盖6个核心领域，采用语句级部分得分评估；2）构建SteuerLLM（28B）：基于真实考题合成大规模训练数据，使用受控检索增强流程微调。 Result: SteuerLLM在SteuerEx上持续超越同规模通用指令微调模型，甚至优于更大规模系统；所有数据、模型权重、代码及Web演示均已开源。 Conclusion: 在真实法律推理任务中，高质量领域数据与针对性架构设计比单纯扩大参数规模更关键；开源资源推动可复现的领域法律AI研究。 Abstract: Large language models (LLMs) demonstrate strong general reasoning and language understanding, yet their performance degrades in domains governed by strict formal rules, precise terminology, and legally binding structure. Tax law exemplifies these challenges, as correct answers require exact statutory citation, structured legal argumentation, and numerical accuracy under rigid grading schemes. We algorithmically generate SteuerEx, the first open benchmark derived from authentic German university tax law examinations. SteuerEx comprises 115 expert-validated examination questions spanning six core tax law domains and multiple academic levels, and employs a statement-level, partial-credit evaluation framework that closely mirrors real examination practice. We further present SteuerLLM, a domain-adapted LLM for German tax law trained on a large-scale synthetic dataset generated from authentic examination material using a controlled retrieval-augmented pipeline. SteuerLLM (28B parameters) consistently outperforms general-purpose instruction-tuned models of comparable size and, in several cases, substantially larger systems, demonstrating that domain-specific data and architectural adaptation are more decisive than parameter scale for performance on realistic legal reasoning tasks. All benchmark data, training datasets, model weights, and evaluation code are released openly to support reproducible research in domain-specific legal artificial intelligence. A web-based demo of SteuerLLM is available at https://steuerllm.i5.ai.fau.de.

[43] DataChef: Cooking Up Optimal Data Recipes for LLM Adaptation via Reinforcement Learning

Yicheng Chen,Zerun Ma,Xinchen Xie,Yining Li,Kai Chen

Main category: cs.CL

TL;DR: 本文提出端到端数据配方生成方法，通过DataChef-32B模型利用在线强化学习自动设计LLM训练数据流程，在多个任务上达到媲美人工专家的性能，并在数学领域显著提升Qwen3-1.7B-Base模型表现。

Details

Motivation: 当前LLM训练中数据配方设计仍高度依赖人工，费时费力，亟需自动化方法。 Method: 提出端到端数据配方生成任务，构建DataChef-32B模型，采用基于代理奖励的在线强化学习，从原始数据源中自动生成适配目标任务的数据处理流程。 Result: 在六个预留任务上，DataChef-32B生成的数据配方实现与人工专家相当的下游性能；其生成的配方使Qwen3-1.7B-Base在AIME'25上达66.7分，超越原模型。 Conclusion: 该工作为LLM训练自动化和自演化AI系统发展提供了新路径。 Abstract: In the current landscape of Large Language Models (LLMs), the curation of large-scale, high-quality training data is a primary driver of model performance. A key lever is the \emph{data recipe}, which comprises a data processing pipeline to transform raw sources into training corpora. Despite the growing use of LLMs to automate individual data processing steps, such as data synthesis and filtering, the overall design of data recipes remains largely manual and labor-intensive, requiring substantial human expertise and iteration. To bridge this gap, we formulate \emph{end-to-end data recipe generation} for LLM adaptation. Given a target benchmark and a pool of available data sources, a model is required to output a complete data recipe that adapts a base LLM to the target task. We present DataChef-32B, which performs online reinforcement learning using a proxy reward that predicts downstream performance for candidate recipes. Across six held-out tasks, DataChef-32B produces practical recipes that reach comparable downstream performance to those curated by human experts. Notably, the recipe from DataChef-32B adapts Qwen3-1.7B-Base to the math domain, achieving 66.7 on AIME'25 and surpassing Qwen3-1.7B. This work sheds new light on automating LLM training and developing self-evolving AI systems.

[44] Can Large Language Models Make Everyone Happy?

Usman Naseem,Gautam Siddharth Kashyap,Ebad Shabbir,Sushant Kumar Ray,Abdullah Mohammad,Rafiq Ali

Main category: cs.CL

TL;DR: 本文提出MisAlign-Profile基准，用于系统评估大语言模型在安全、价值与文化三维度间的错位权衡问题，构建了覆盖112个规范性领域的MISALIGNTRADE数据集，并揭示主流LLM存在12%-34%的跨维错位率。

Details

Motivation: 现有基准（如SAFETUNEBED、VALUEBENCH、WORLDVIEW-BENCH）仅孤立评估安全、价值或文化单一维度，无法刻画其真实共现下的交互与权衡；基于机制可解释性的新方法（如MIB）仍不足以系统刻画跨维错位。 Method: 提出MisAlign-Profile统一基准，核心包括：（1）构建MISALIGNTRADE数据集——覆盖14安全/56价值/42文化共112领域，每条提示标注语义错位类型（对象/属性/关系），经Gemma-2与Qwen3双模型生成+SimHash去重；（2）采用两阶段拒绝采样生成高质量错位/对齐响应对；（3）在通用、微调及开源权重LLM上进行跨维错位量化评测。 Result: 在MISALIGNTRADE上评测发现，各类主流LLM普遍存在12%–34%的跨维度错位率，证实安全、价值与文化目标难以同时满足，且错位模式具有领域与语义类型依赖性。 Conclusion: MisAlign-Profile首次实现对LLM多维错位权衡的系统性、可解释性量化评估，为对齐研究提供新基准与实证依据，推动兼顾安全、价值与文化的协同对齐方法发展。 Abstract: Misalignment in Large Language Models (LLMs) refers to the failure to simultaneously satisfy safety, value, and cultural dimensions, leading to behaviors that diverge from human expectations in real-world settings where these dimensions must co-occur. Existing benchmarks, such as SAFETUNEBED (safety-centric), VALUEBENCH (value-centric), and WORLDVIEW-BENCH (culture-centric), primarily evaluate these dimensions in isolation and therefore provide limited insight into their interactions and trade-offs. More recent efforts, including MIB and INTERPRETABILITY BENCHMARK-based on mechanistic interpretability, offer valuable perspectives on model failures; however, they remain insufficient for systematically characterizing cross-dimensional trade-offs. To address these gaps, we introduce MisAlign-Profile, a unified benchmark for measuring misalignment trade-offs inspired by mechanistic profiling. First, we construct MISALIGNTRADE, an English misaligned-aligned dataset across 112 normative domains taxonomies, including 14 safety, 56 value, and 42 cultural domains. In addition to domain labels, each prompt is classified with one of three orthogonal semantic types-object, attribute, or relations misalignment-using Gemma-2-9B-it and expanded via Qwen3-30B-A3B-Instruct-2507 with SimHash-based fingerprinting to avoid deduplication. Each prompt is paired with misaligned and aligned responses through two-stage rejection sampling to ensure quality. Second, we benchmark general-purpose, fine-tuned, and open-weight LLMs on MISALIGNTRADE-revealing 12%-34% misalignment trade-offs across dimensions.

[45] Safety Recovery in Reasoning Models Is Only a Few Early Steering Steps Away

Soumya Suvra Ghosal,Souradip Chakraborty,Vaibhav Singh,Furong Huang,Dinesh Manocha,Amrit Singh Bedi

Main category: cs.CL

TL;DR: SafeThink是一种轻量级推理时防御方法，通过在推理过程中动态监控安全阈值并在必要时注入简短纠正前缀（如'Wait, think safely'），显著降低多模态大推理模型的越狱成功率，同时保持其推理能力。

Details

Motivation: 现有基于强化学习的链式思维后训练方法虽提升了多模态大推理模型的推理能力，却损害了其安全性与对越狱攻击的鲁棒性。 Method: SafeThink利用安全奖励模型实时监控推理过程，在安全阈值被违反时，条件性地注入优化后的简短纠正前缀；实验证明早期（第1–3步）干预即可有效引导生成走向安全结果。 Result: 在六个开源MLRM和四个越狱基准上，SafeThink将攻击成功率降低30–60%（例如LlamaV-o1在JailbreakV-28K上从63.33%降至5.74%），同时几乎不损失推理性能（MathVista准确率65.20%→65.00%）。 Conclusion: 将安全恢复建模为满足性约束而非优化目标是可行且高效的；推理早期的轻量干预足以兼顾安全性与推理能力。 Abstract: Reinforcement learning (RL) based post-training for explicit chain-of-thought (e.g., GRPO) improves the reasoning ability of multimodal large-scale reasoning models (MLRMs). But recent evidence shows that it can simultaneously degrade safety alignment and increase jailbreak success rates. We propose SafeThink, a lightweight inference-time defense that treats safety recovery as a satisficing constraint rather than a maximization objective. SafeThink monitors the evolving reasoning trace with a safety reward model and conditionally injects an optimized short corrective prefix ("Wait, think safely") only when the safety threshold is violated. In our evaluations across six open-source MLRMs and four jailbreak benchmarks (JailbreakV-28K, Hades, FigStep, and MM-SafetyBench), SafeThink reduces attack success rates by 30-60% (e.g., LlamaV-o1: 63.33% to 5.74% on JailbreakV-28K, R1-Onevision: 69.07% to 5.65% on Hades) while preserving reasoning performance (MathVista accuracy: 65.20% to 65.00%). A key empirical finding from our experiments is that safety recovery is often only a few steering steps away: intervening in the first 1-3 reasoning steps typically suffices to redirect the full generation toward safe completions.

[46] TEGRA: Text Encoding With Graph and Retrieval Augmentation for Misinformation Detection

Géraud Faye,Wassila Ouerdane,Guillaume Gadek,Céline Hudelot

Main category: cs.CL

TL;DR: 本文提出了一种结合知识图谱与文本编码的 misinformation 检测方法 TEG，并扩展为融入领域知识的 TEGRA，实验证明其性能优于纯语言模型方法。

Details

Motivation: 人工事实核查依赖外部知识，而现有 misinformation 检测方法多仅依赖语言模型，缺乏对结构化外部知识的有效利用。 Method: 提出 Text Encoding with Graph（TEG）：从文本中抽取结构化图信息，联合编码文本与图；进一步扩展为 TEGRA，融入领域特定知识。 Result: 在多个数据集上的实验表明，TEG 显著优于仅使用语言模型的方法；TEGRA 在多数情况下进一步提升了分类准确率。 Conclusion: 融合结构化知识图谱与文本的编码方式能有效提升 misinformation 检测性能，领域知识的引入可带来额外增益。 Abstract: Misinformation detection is a critical task that can benefit significantly from the integration of external knowledge, much like manual fact-checking. In this work, we propose a novel method for representing textual documents that facilitates the incorporation of information from a knowledge base. Our approach, Text Encoding with Graph (TEG), processes documents by extracting structured information in the form of a graph and encoding both the text and the graph for classification purposes. Through extensive experiments, we demonstrate that this hybrid representation enhances misinformation detection performance compared to using language models alone. Furthermore, we introduce TEGRA, an extension of our framework that integrates domain-specific knowledge, further enhancing classification accuracy in most cases.

[47] Data Repetition Beats Data Scaling in Long-CoT Supervised Fine-Tuning

Dawid J. Kopiczko,Sagar Vaze,Tijmen Blankevoort,Yuki M. Asano

Main category: cs.CL

TL;DR: 本文发现，在链式思维数据的监督微调（SFT）中，重复训练（即在较小数据集上多轮训练）比在大数据集上单轮训练更有效，且在AIME和GPQA等推理基准上显著提升性能；训练token准确率可作为重复饱和的可靠信号，据此可替代昂贵的数据扩展策略。

Details

Motivation: 标准机器学习认为更多样化的训练样本有助于泛化，但作者观察到链式思维SFT中重复训练反而更优，需探究其机制与实用策略。 Method: 在固定参数更新预算下，系统比较不同数据规模与训练轮数（epochs）组合的SFT效果；以token准确率监测训练动态，并分析其与泛化性能的关系。 Result: Olmo3-7B在400样本上训练128轮，显著优于51200样本上训练1轮（+12–26个百分点），且无灾难性遗忘；token准确率饱和（即完全记忆）与泛化性能峰值一致。 Conclusion: 重复训练在推理模型SFT中具有实质性优势；token准确率可作为高效、低成本的训练终止准则；‘重复优势’现象构成理解大模型训练动力学的新开放问题。 Abstract: Supervised fine-tuning (SFT) on chain-of-thought data is an essential post-training step for reasoning language models. Standard machine learning intuition suggests that training with more unique training samples yields better generalization. Counterintuitively, we show that SFT benefits from repetition: under a fixed update budget, training for more epochs on smaller datasets outperforms single-epoch training on larger datasets. On AIME'24/25 and GPQA benchmarks, Olmo3-7B trained for 128 epochs on 400 samples outperforms the equivalent 1 epoch on 51200 samples by 12-26 percentage points, with no additional catastrophic forgetting. We find that training token accuracy reliably signals when repetition has saturated; improvements from additional epochs plateau at full memorization, a pattern consistent across all settings. These findings provide a practical approach for reasoning SFT, where scaling epochs with token accuracy as a stopping criterion can replace expensive undirected data scaling. We pose the repetition advantage, where full memorization coincides with improved generalization, as a new open problem for the community in understanding the training dynamics of large language models.

cs.CV [Back]

[48] MPA: Multimodal Prototype Augmentation for Few-Shot Learning

Liwen Wu,Wei Wang,Lei Zhao,Zhan Gao,Qika Lin,Shaowen Yao,Zuozhu Liu,Bin Pu

Main category: cs.CV

TL;DR: 本文提出了一种多模态原型增强的少样本学习框架MPA，结合大语言模型语义增强、多视角特征增强和不确定性建模，显著提升了单域和跨域少样本分类性能。

Details

Motivation: 现有少样本学习方法主要依赖单一视觉模态和原始支持图像计算原型，缺乏丰富多模态信息，难以应对语义稀疏和特征多样性不足问题。 Method: 提出MPA框架，包含三部分：1）LLM-based Multi-Variant Semantic Enhancement（LMSE），利用大语言模型生成多样化类别描述以增强语义；2）Hierarchical Multi-View Augmentation（HMA），融合自然与多视角数据增强提升特征多样性；3）Adaptive Uncertain Class Absorber（AUCA），通过插值和高斯采样引入不确定类以吸收边界模糊样本。 Result: 在4个单域和6个跨域FSL基准上全面超越SOTA；5-way 1-shot设置下，单域和跨域分别比第二优方法提升12.29%和24.56%。 Conclusion: MPA通过融合语义、视觉与不确定性建模的多模态原型增强策略，有效缓解了少样本场景下的语义匮乏与特征泛化瓶颈，为跨域少样本学习提供了新范式。 Abstract: Recently, few-shot learning (FSL) has become a popular task that aims to recognize new classes from only a few labeled examples and has been widely applied in fields such as natural science, remote sensing, and medical images. However, most existing methods focus only on the visual modality and compute prototypes directly from raw support images, which lack comprehensive and rich multimodal information. To address these limitations, we propose a novel Multimodal Prototype Augmentation FSL framework called MPA, including LLM-based Multi-Variant Semantic Enhancement (LMSE), Hierarchical Multi-View Augmentation (HMA), and an Adaptive Uncertain Class Absorber (AUCA). LMSE leverages large language models to generate diverse paraphrased category descriptions, enriching the support set with additional semantic cues. HMA exploits both natural and multi-view augmentations to enhance feature diversity (e.g., changes in viewing distance, camera angles, and lighting conditions). AUCA models uncertainty by introducing uncertain classes via interpolation and Gaussian sampling, effectively absorbing uncertain samples. Extensive experiments on four single-domain and six cross-domain FSL benchmarks demonstrate that MPA achieves superior performance compared to existing state-of-the-art methods across most settings. Notably, MPA surpasses the second-best method by 12.29% and 24.56% in the single-domain and cross-domain setting, respectively, in the 5-way 1-shot setting.

[49] VERA: Identifying and Leveraging Visual Evidence Retrieval Heads in Long-Context Understanding

Rongcan Pei,Huan Li,Fang Guo,Qi Zhu

Main category: cs.CV

TL;DR: 本文分析了视觉语言模型（VLMs）在长上下文处理中的内部机制，发现一类关键的‘视觉证据检索（VER）头’，并据此提出无需训练的VERA框架，通过检测不确定性并显式口头化VER头关注的视觉证据，显著提升开源VLMs的长上下文理解能力。

Details

Motivation: Vision-Language Models (VLMs) 在长上下文和复杂推理任务中表现不佳，亟需深入理解其内部机制与性能瓶颈。 Method: 通过注意力分析识别出动态稀疏的Visual Evidence Retrieval (VER) Heads，并提出无需训练的VERA框架：基于模型不确定性（熵）触发，显式 verbalize VER heads 所关注的视觉证据。 Result: VERA在五个基准上显著提升开源VLMs的长上下文理解能力：Qwen3-VL-8B-Instruct平均相对提升21.3%，GLM-4.1V-Thinking提升20.1%。 Conclusion: VER heads 对VLM长上下文推理具有因果性作用；VERA作为一种轻量、训练无关的方法，有效缓解了VLM在长文本+视觉联合推理中的瓶颈。 Abstract: While Vision-Language Models (VLMs) have shown promise in textual understanding, they face significant challenges when handling long context and complex reasoning tasks. In this paper, we dissect the internal mechanisms governing long-context processing in VLMs to understand their performance bottlenecks. Through the lens of attention analysis, we identify specific Visual Evidence Retrieval (VER) Heads - a sparse, dynamic set of attention heads critical for locating visual cues during reasoning, distinct from static OCR heads. We demonstrate that these heads are causal to model performance; masking them leads to significant degradation. Leveraging this discovery, we propose VERA (Visual Evidence Retrieval Augmentation), a training-free framework that detects model uncertainty (i.e., entropy) to trigger the explicit verbalization of visual evidence attended by VER heads. Comprehensive experiments demonstrate that VERA significantly improves long-context understanding of open-source VLMs: it yields an average relative improvement of 21.3% on Qwen3-VL-8B-Instruct and 20.1% on GLM-4.1V-Thinking across five benchmarks.

[50] Beyond Closed-Pool Video Retrieval: A Benchmark and Agent Framework for Real-World Video Search and Moment Localization

Tao Yu,Yujia Yang,Haopeng Jin,Junhao Gong,Xinlong Chen,Yuxuan Zhou,Shanbin Zhang,Jiabing Yang,Xinming Wang,Hongzhu Yi,Ping Nie,Kai Zou,Zhang Zhang,Yan Huang,Liang Wang,Yeshani,Ruiwen Tao,Jin Ma,Haijin Liang,Jinwen Luo

Main category: cs.CV

TL;DR: 本文提出RVMS-Bench——首个面向真实世界视频记忆搜索的基准，包含1440个来自开放网络的真实视频样本，并设计四维记忆描述框架；同时提出RACLO代理框架，通过溯因推理模拟人类‘回忆-搜索-验证’认知过程，揭示当前多模态大模型在模糊记忆驱动的视频检索与片段定位上仍存在明显不足。

Details

Motivation: 传统视频检索基准局限于精确文本匹配封闭视频库，无法反映现实中基于模糊、多维度记忆在开放网络中搜索视频的真实需求。 Method: 构建RVMS-Bench基准（含1440样本、20类、4时长组，采用全局印象/关键片段/时序上下文/听觉记忆四层描述框架，并经人工闭环验证）；提出RACLO代理框架，以溯因推理建模人类‘回忆-搜索-验证’认知流程。 Result: 实验表明现有MLLMs在基于模糊记忆的现实视频检索与时刻定位任务上性能不足，RVMS-Bench和RACLO显著提升了该任务的评估合理性与方法有效性。 Conclusion: RVMS-Bench和RACLO为真实场景下非结构化视频检索提供了新基准与新范式，推动模型向更鲁棒、更拟人化的视频记忆搜索能力发展。 Abstract: Traditional video retrieval benchmarks focus on matching precise descriptions to closed video pools, failing to reflect real-world searches characterized by fuzzy, multi-dimensional memories on the open web. We present \textbf{RVMS-Bench}, a comprehensive system for evaluating real-world video memory search. It consists of \textbf{1,440 samples} spanning \textbf{20 diverse categories} and \textbf{four duration groups}, sourced from \textbf{real-world open-web videos}. RVMS-Bench utilizes a hierarchical description framework encompassing \textbf{Global Impression, Key Moment, Temporal Context, and Auditory Memory} to mimic realistic multi-dimensional search cues, with all samples strictly verified via a human-in-the-loop protocol. We further propose \textbf{RACLO}, an agentic framework that employs abductive reasoning to simulate the human ``Recall-Search-Verify'' cognitive process, effectively addressing the challenge of searching for videos via fuzzy memories in the real world. Experiments reveal that existing MLLMs still demonstrate insufficient capabilities in real-world Video Retrieval and Moment Localization based on fuzzy memories. We believe this work will facilitate the advancement of video retrieval robustness in real-world unstructured scenarios.

[51] AD$^2$: Analysis and Detection of Adversarial Threats in Visual Perception for End-to-End Autonomous Driving Systems

Ishan Sahu,Somnath Hazra,Somak Aditya,Soumyajit Dey

Main category: cs.CV

TL;DR: 本文研究了端到端自动驾驶系统在黑盒对抗威胁下的鲁棒性，提出了三种针对视觉感知的攻击方式，并设计了一种基于注意力机制的轻量级攻击检测模型AD²，显著提升了检测性能和计算效率。

Details

Motivation: 端到端自动驾驶系统虽取得进展，但其对抗鲁棒性尚未被充分研究，存在严重安全隐患。 Method: 在CARLA中对Transfuser和Interfuser两种先进自动驾驶代理进行闭环评估，设计并实施三种物理/数字层面的黑盒对抗攻击；提出基于注意力机制、利用时空一致性的轻量级攻击检测模型AD²。 Result: 两种自动驾驶代理在攻击下驾驶评分最高下降99%；AD²在多相机输入下展现出优于现有方法的检测能力与计算效率。 Conclusion: 当前端到端自动驾驶系统在现实对抗威胁下极其脆弱，亟需部署如AD²这类高效可靠的攻击检测机制以保障安全。 Abstract: End-to-end autonomous driving systems have achieved significant progress, yet their adversarial robustness remains largely underexplored. In this work, we conduct a closed-loop evaluation of state-of-the-art autonomous driving agents under black-box adversarial threat models in CARLA. Specifically, we consider three representative attack vectors on the visual perception pipeline: (i) a physics-based blur attack induced by acoustic waves, (ii) an electromagnetic interference attack that distorts captured images, and (iii) a digital attack that adds ghost objects as carefully crafted bounded perturbations on images. Our experiments on two advanced agents, Transfuser and Interfuser, reveal severe vulnerabilities to such attacks, with driving scores dropping by up to 99% in the worst case, raising valid safety concerns. To help mitigate such threats, we further propose a lightweight Attack Detection model for Autonomous Driving systems (AD$^2$) based on attention mechanisms that capture spatial-temporal consistency. Comprehensive experiments across multi-camera inputs on CARLA show that our detector achieves superior detection capability and computational efficiency compared to existing approaches.

[52] ArtisanGS: Interactive Tools for Gaussian Splat Selection with AI and Human in the Loop

Clement Fuji Tsang,Anita Hu,Or Perel,Carsten Kolve,Maria Shugrina

Main category: cs.CV

TL;DR: 本文提出了一套面向3D高斯泼溅（3DGS）表示的交互式选择与分割工具，支持用户引导的2D掩码到3DGS的快速传播、手动编辑及二值分割，并结合视频扩散模型实现可控局部编辑，适用于任意野外捕获数据。

Details

Motivation: 从野外捕获中提取可用对象困难，且现有3DGS编辑方法缺乏可控性与交互性。 Method: 提出AI驱动的2D掩码到3DGS选择传播方法，结合手动选择与分割工具，并集成用户引导的局部编辑流程与定制视频扩散模型。 Result: 实现了对任意野外捕获3DGS场景的灵活二值分割与局部编辑，无需额外优化，在选择精度与下游应用（如编辑）上优于当前最先进方法。 Conclusion: 该交互式工具集显著提升了3DGS表示在真实场景中的可控编辑能力，为物理仿真、动画等下游任务提供了实用基础。 Abstract: Representation in the family of 3D Gaussian Splats (3DGS) are growing into a viable alternative to traditional graphics for an expanding number of application, including recent techniques that facilitate physics simulation and animation. However, extracting usable objects from in-the-wild captures remains challenging and controllable editing techniques for this representation are limited. Unlike the bulk of emerging techniques, focused on automatic solutions or high-level editing, we introduce an interactive suite of tools centered around versatile Gaussian Splat selection and segmentation. We propose a fast AI-driven method to propagate user-guided 2D selection masks to 3DGS selections. This technique allows for user intervention in the case of errors and is further coupled with flexible manual selection and segmentation tools. These allow a user to achieve virtually any binary segmentation of an unstructured 3DGS scene. We evaluate our toolset against the state-of-the-art for Gaussian Splat selection and demonstrate their utility for downstream applications by developing a user-guided local editing approach, leveraging a custom Video Diffusion Model. With flexible selection tools, users have direct control over the areas that the AI can modify. Our selection and editing tools can be used for any in-the-wild capture without additional optimization.

[53] When the Prompt Becomes Visual: Vision-Centric Jailbreak Attacks for Large Image Editing Models

Jiacheng Hou,Yining Sun,Ruochong Jin,Haochen Han,Fangming Liu,Wai Kin Victor Chan,Alex Jinpeng Wang

Main category: cs.CV

TL;DR: 本文提出了一种新型视觉到视觉的越狱攻击（VJA），通过纯视觉输入（如标记、箭头等）对图像编辑模型发起安全攻击，并构建了面向安全评估的基准IESBench；实验表明VJA能有效攻破多个商用模型，同时提出一种无需训练、基于内省式多模态推理的防御方法，显著提升模型安全性。

Details

Motivation: 随着视觉提示图像编辑模型的兴起，攻击面从文本转向视觉，但该视觉攻击风险尚未被系统研究，亟需探索其安全漏洞并提供评估与防御方案。 Method: 提出视觉中心越狱攻击（VJA）方法，设计安全基准IESBench，并开发一种训练-free、基于 introspective multimodal reasoning 的防御机制。 Result: VJA在Nano Banana Pro和GPT-Image-1.5上攻击成功率分别达80.9%和70.1%；所提防御方法显著提升弱对齐模型的安全性，效果媲美商用系统，且无额外守卫模型与计算开销。 Conclusion: 揭示了视觉驱动图像编辑模型中新型视觉越狱攻击的风险，提供了首个安全基准IESBench及实用、轻量的防御方案，推动构建更安全可信的图像编辑系统。 Abstract: Recent advances in large image editing models have shifted the paradigm from text-driven instructions to vision-prompt editing, where user intent is inferred directly from visual inputs such as marks, arrows, and visual-text prompts. While this paradigm greatly expands usability, it also introduces a critical and underexplored safety risk: the attack surface itself becomes visual. In this work, we propose Vision-Centric Jailbreak Attack (VJA), the first visual-to-visual jailbreak attack that conveys malicious instructions purely through visual inputs. To systematically study this emerging threat, we introduce IESBench, a safety-oriented benchmark for image editing models. Extensive experiments on IESBench demonstrate that VJA effectively compromises state-of-the-art commercial models, achieving attack success rates of up to 80.9% on Nano Banana Pro and 70.1% on GPT-Image-1.5. To mitigate this vulnerability, we propose a training-free defense based on introspective multimodal reasoning, which substantially improves the safety of poorly aligned models to a level comparable with commercial systems, without auxiliary guard models and with negligible computational overhead. Our findings expose new vulnerabilities, provide both a benchmark and practical defense to advance safe and trustworthy modern image editing systems. Warning: This paper contains offensive images created by large image editing models.

[54] DEGMC: Denoising Diffusion Models Based on Riemannian Equivariant Group Morphological Convolutions

El Hadji S. Diop,Thierno Fall,Mohamed Daoudi

Main category: cs.CV

TL;DR: 本文提出了一种结合几何特征提取与群等变性的扩散模型改进方法，通过在黎曼流形上定义群形态卷积（源自Hamilton-Jacobi型PDE的粘性解），并引入对流项以增强非线性建模与几何结构表征能力，在MNIST、RotoMNIST和CIFAR-10上优于标准DDPM。

Details

Motivation: 解决当前去噪扩散概率模型（DDPM）中几何关键特征提取不足和网络缺乏旋转/反射等欧氏群等变性两大问题。 Method: 提出黎曼流形上的群形态卷积，基于一阶Hamilton-Jacobi型偏微分方程的粘性解，实现形态学多尺度膨胀与腐蚀；在模型中加入对流项，并用特征线法求解，以增强对非线性、细长几何结构及对称性的建模能力。 Result: 在MNIST、RotoMNIST和CIFAR-10数据集上，相比基线DDPM模型取得显著性能提升。 Conclusion: 将几何先验与群等变性嵌入扩散模型是有效的，所提群形态卷积框架为生成模型提供了更强的结构感知与对称性建模能力。 Abstract: In this work, we address two major issues in recent Denoising Diffusion Probabilistic Models (DDPM): {\bf 1)} geometric key feature extraction and {\bf 2)} network equivariance. Since the DDPM prediction network relies on the U-net architecture, which is theoretically only translation equivariant, we introduce a geometric approach combined with an equivariance property of the more general Euclidean group, which includes rotations, reflections, and permutations. We introduce the notion of group morphological convolutions in Riemannian manifolds, which are derived from the viscosity solutions of first-order Hamilton-Jacobi-type partial differential equations (PDEs) that act as morphological multiscale dilations and erosions. We add a convection term to the model and solve it using the method of characteristics. This helps us better capture nonlinearities, represent thin geometric structures, and incorporate symmetries into the learning process. Experimental results on the MNIST, RotoMNIST, and CIFAR-10 datasets show noticeable improvements compared to the baseline DDPM model.

[55] XSPLAIN: XAI-enabling Splat-based Prototype Learning for Attribute-aware INterpretability

Dominik Galus,Julia Farganus,Tymoteusz Zapala,Mikołaj Czachorowski,Piotr Borycki,Przemysław Spurek,Piotr Syga

Main category: cs.CV

TL;DR: 本文提出XSPLAIN，首个专为3D高斯泼溅（3DGS）分类设计的前置式、基于原型的可解释性框架，通过体素聚合PointNet与可逆正交变换实现特征解耦，在不损害分类性能前提下提供直观、基于样例的解释，并经用户研究验证其显著提升透明度与可信度。

Details

Motivation: 3D高斯泼溅（3DGS）虽在高保真重建中成为标准，但在关键领域应用受限，主因是其生成模型与Splats分类缺乏可解释性；现有针对点云等3D表示的可解释方法依赖模糊的显著性图，无法体现高斯基元的体素一致性。 Method: 提出XSPLAIN框架：采用体素聚合的PointNet骨干网络，并引入一种新颖的可逆正交变换，以解耦特征通道、保障原始决策边界不变；解释基于代表性训练样本，支持直觉化的“这看起来像那个”推理。 Result: 在N=51的严格用户研究中，参与者48.4%的时间选择XSPLAIN解释为最优，显著优于基线方法（p<0.001），证实其提升透明度与用户信任；且分类性能无任何下降。 Conclusion: XSPLAIN是首个面向3DGS分类的 ante-hoc、原型驱动的可解释性方法，兼顾数学严谨性、视觉直观性与实际可用性，为3D生成模型的可信部署提供了新范式。 Abstract: 3D Gaussian Splatting (3DGS) has rapidly become a standard for high-fidelity 3D reconstruction, yet its adoption in multiple critical domains is hindered by the lack of interpretability of the generation models as well as classification of the Splats. While explainability methods exist for other 3D representations, like point clouds, they typically rely on ambiguous saliency maps that fail to capture the volumetric coherence of Gaussian primitives. We introduce XSPLAIN, the first ante-hoc, prototype-based interpretability framework designed specifically for 3DGS classification. Our approach leverages a voxel-aggregated PointNet backbone and a novel, invertible orthogonal transformation that disentangles feature channels for interpretability while strictly preserving the original decision boundaries. Explanations are grounded in representative training examples, enabling intuitive ``this looks like that'' reasoning without any degradation in classification performance. A rigorous user study (N=51) demonstrates a decisive preference for our approach: participants selected XSPLAIN explanations 48.4\% of the time as the best, significantly outperforming baselines $(p<0.001)$, showing that XSPLAIN provides transparency and user trust. The source code for this work is available at: https://github.com/Solvro/ml-splat-xai

[56] PMMA: The Polytechnique Montreal Mobility Aids Dataset

Qingwu Liu,Nicolas Saunier,Guillaume-Alexandre Bilodeau

Main category: cs.CV

TL;DR: 本文介绍了一个名为PMMA的新型行人检测数据集，专注于使用轮椅、拐杖和助行器等移动辅助设备的行人，并在该数据集上对多种目标检测与跟踪模型进行了基准测试。

Details

Motivation: 现有行人检测数据集缺乏对使用移动辅助设备（如轮椅、拐杖、助行器）的特殊行人类型的充分覆盖，限制了相关智能辅助系统的发展与评估。 Method: 构建了包含九类行人的户外PMMA数据集；在MMDetection框架下实现并评测了七种目标检测模型（Faster R-CNN、CenterNet、YOLOX、DETR、Deformable DETR、DINO、RT-DETR）和三种跟踪算法（ByteTrack、BOT-SORT、OC-SORT）。 Result: YOLOX、Deformable DETR和Faster R-CNN在检测任务中表现最佳；三种跟踪器性能差异较小；PMMA数据集及配套代码已开源。 Conclusion: PMMA填补了面向移动辅助设备使用者的行人检测数据集空白，为相关研究提供了新基准和实用资源。 Abstract: This study introduces a new object detection dataset of pedestrians using mobility aids, named PMMA. The dataset was collected in an outdoor environment, where volunteers used wheelchairs, canes, and walkers, resulting in nine categories of pedestrians: pedestrians, cane users, two types of walker users, whether walking or resting, five types of wheelchair users, including wheelchair users, people pushing empty wheelchairs, and three types of users pushing occupied wheelchairs, including the entire pushing group, the pusher and the person seated on the wheelchair. To establish a benchmark, seven object detection models (Faster R-CNN, CenterNet, YOLOX, DETR, Deformable DETR, DINO, and RT-DETR) and three tracking algorithms (ByteTrack, BOT-SORT, and OC-SORT) were implemented under the MMDetection framework. Experimental results show that YOLOX, Deformable DETR, and Faster R-CNN achieve the best detection performance, while the differences among the three trackers are relatively small. The PMMA dataset is publicly available at https://doi.org/10.5683/SP3/XJPQUG, and the video processing and model training code is available at https://github.com/DatasetPMMA/PMMA.

[57] Colorimeter-Supervised Skin Tone Estimation from Dermatoscopic Images for Fairness Auditing

Marin Benčević,Krešimir Romić,Ivana Hartmann Tolić,Irena Galić

Main category: cs.CV

TL;DR: 本文提出了一种基于神经网络的皮肤色调估计方法，用于公平性审计，通过预测Fitzpatrick皮肤类型和ITA值，填补了公开皮肤病学数据集中缺乏可靠皮肤色调标注的空白。

Details

Motivation: 现有基于神经网络的皮肤镜图像诊断模型在不同肤色人群上存在性能差异，但缺乏可靠的皮肤色调标注限制了其公平性审计。 Method: 采用序数回归预测Fitzpatrick皮肤类型，颜色回归预测个体类型角（ITA），并利用现场Fitzpatrick标签和色度计测量作为监督信号；结合合成与真实皮肤镜及临床图像进行大规模预训练。 Result: Fitzpatrick模型与人工众包标注一致性相当，ITA预测与色度计测量高度一致，显著优于像素平均法；在ISIC 2020和MILK10k数据集上发现Fitzpatrick V-VI型样本占比不足1%；开源代码与预训练模型。 Conclusion: 这是首个经色度计测量验证的皮肤镜皮肤色调估计神经网络，支持肤色相关临床性能差距的证据，并为快速皮肤色调标注与偏见审计提供工具。 Abstract: Neural-network-based diagnosis from dermatoscopic images is increasingly used for clinical decision support, yet studies report performance disparities across skin tones. Fairness auditing of these models is limited by the lack of reliable skin-tone annotations in public dermatoscopy datasets. We address this gap with neural networks that predict Fitzpatrick skin type via ordinal regression and the Individual Typology Angle (ITA) via color regression, using in-person Fitzpatrick labels and colorimeter measurements as targets. We further leverage extensive pretraining on synthetic and real dermatoscopic and clinical images. The Fitzpatrick model achieves agreement comparable to human crowdsourced annotations, and ITA predictions show high concordance with colorimeter-derived ITA, substantially outperforming pixel-averaging approaches. Applying these estimators to ISIC 2020 and MILK10k, we find that fewer than 1% of subjects belong to Fitzpatrick types V and VI. We release code and pretrained models as an open-source tool for rapid skin-tone annotation and bias auditing. This is, to our knowledge, the first dermatoscopic skin-tone estimation neural network validated against colorimeter measurements, and it supports growing evidence of clinically relevant performance gaps across skin-tone groups.

[58] ERGO: Excess-Risk-Guided Optimization for High-Fidelity Monocular 3D Gaussian Splatting

Zehua Ma,Hanhui Li,Zhenyu Xie,Xiaonan Luo,Michael Kampffmeyer,Feng Gao,Xiaodan Liang

Main category: cs.CV

TL;DR: 本文提出了一种名为ERGO的自适应优化框架，通过过量风险分解来应对单图生成3D内容中合成视图监督信号不一致的问题，提升几何保真度与纹理质量。

Details

Motivation: 单张图像生成3D内容存在几何与纹理信息缺失问题，现有生成模型提供的辅助视图存在几何不一致和纹理错位，导致3D重建中误差传播放大。 Method: 提出基于过量风险分解（excess risk decomposition）的ERGO框架，将3D高斯泼溅优化损失分解为可减小的过量风险和不可减的贝叶斯误差；动态估计各视角过量风险并自适应调整损失权重；引入几何感知与纹理感知目标，构建全局-局部协同优化范式。 Result: 在Google Scanned Objects和OmniObject3D数据集上，ERGO在几何保真度和纹理质量上均优于现有SOTA方法，并展现出对监督噪声的鲁棒性。 Conclusion: ERGO通过理论驱动的损失分解与自适应加权机制，有效利用有噪合成视图监督，为单图3D生成提供了更可靠、高质量的优化框架。 Abstract: Generating 3D content from a single image remains a fundamentally challenging and ill-posed problem due to the inherent absence of geometric and textural information in occluded regions. While state-of-the-art generative models can synthesize auxiliary views to provide additional supervision, these views inevitably contain geometric inconsistencies and textural misalignments that propagate and amplify artifacts during 3D reconstruction. To effectively harness these imperfect supervisory signals, we propose an adaptive optimization framework guided by excess risk decomposition, termed ERGO. Specifically, ERGO decomposes the optimization losses in 3D Gaussian splatting into two components, i.e., excess risk that quantifies the suboptimality gap between current and optimal parameters, and Bayes error that models the irreducible noise inherent in synthesized views. This decomposition enables ERGO to dynamically estimate the view-specific excess risk and adaptively adjust loss weights during optimization. Furthermore, we introduce geometry-aware and texture-aware objectives that complement the excess-risk-derived weighting mechanism, establishing a synergistic global-local optimization paradigm. Consequently, ERGO demonstrates robustness against supervision noise while consistently enhancing both geometric fidelity and textural quality of the reconstructed 3D content. Extensive experiments on the Google Scanned Objects dataset and the OmniObject3D dataset demonstrate the superiority of ERGO over existing state-of-the-art methods.

[59] A Low-Rank Defense Method for Adversarial Attack on Diffusion Models

Jiaxuan Zhu,Siyu Huang

Main category: cs.CV

TL;DR: 本文提出了一种名为LoRD的高效防御策略，用于抵御针对潜在扩散模型（LDMs）的对抗攻击，结合低秩适应（LoRA）模块与合并思想和平衡参数，构建了可有效检测并防御对抗样本的防御流程，并在人脸与风景图像上验证了其优越性能。

Details

Motivation: 为防止快速发展的扩散模型对抗攻击及其微调过程被滥用，影响扩散模型的实际应用，亟需开发相应的防御策略。 Method: 提出低秩防御（LoRD）策略，融合低秩适应（LoRA）模块、合并思想与平衡参数，构建应用于LDM的端到端防御流程。 Result: LoRD使LDM在对抗样本与干净样本上微调后仍能生成高质量图像，在人脸和风景图像实验中显著优于基线方法。 Conclusion: LoRD是一种高效、实用的扩散模型对抗防御方法，兼顾鲁棒性与图像生成质量。 Abstract: Recently, adversarial attacks for diffusion models as well as their fine-tuning process have been developed rapidly. To prevent the abuse of these attack algorithms from affecting the practical application of diffusion models, it is critical to develop corresponding defensive strategies. In this work, we propose an efficient defensive strategy, named Low-Rank Defense (LoRD), to defend the adversarial attack on Latent Diffusion Models (LDMs). LoRD introduces the merging idea and a balance parameter, combined with the low-rank adaptation (LoRA) modules, to detect and defend the adversarial samples. Based on LoRD, we build up a defense pipeline that applies the learned LoRD modules to help diffusion models defend against attack algorithms. Our method ensures that the LDM fine-tuned on both adversarial and clean samples can still generate high-quality images. To demonstrate the effectiveness of our approach, we conduct extensive experiments on facial and landscape images, and our method shows significantly better defense performance compared to the baseline methods.

[60] Flow Matching with Uncertainty Quantification and Guidance

Juyeop Han,Lukas Lao Beyer,Sertac Karaman

Main category: cs.CV

TL;DR: 本文提出了一种不确定性感知的流匹配方法（UA-Flow），通过在流匹配中联合预测速度场与异方差不确定性，提升生成样本的质量与可靠性。

Details

Motivation: 现有基于采样的生成模型（如流匹配）虽成功，但生成样本质量不稳定或退化，需评估样本可靠性并提升生成质量。 Method: 提出UA-Flow，作为流匹配的轻量扩展，在预测速度场的同时建模异方差不确定性；通过流动力学传播速度不确定性以估计每样本不确定性，并将其用于不确定性感知的分类器引导与无分类器引导采样。 Result: 实验表明，UA-Flow产生的不确定性信号与样本保真度相关性高于基线方法，且不确定性引导采样进一步提升了图像生成质量。 Conclusion: UA-Flow是一种有效、轻量的改进方案，能可靠量化样本不确定性并利用其提升生成质量。 Abstract: Despite the remarkable success of sampling-based generative models such as flow matching, they can still produce samples of inconsistent or degraded quality. To assess sample reliability and generate higher-quality outputs, we propose uncertainty-aware flow matching (UA-Flow), a lightweight extension of flow matching that predicts the velocity field together with heteroscedastic uncertainty. UA-Flow estimates per-sample uncertainty by propagating velocity uncertainty through the flow dynamics. These uncertainty estimates act as a reliability signal for individual samples, and we further use them to steer generation via uncertainty-aware classifier guidance and classifier-free guidance. Experiments on image generation show that UA-Flow produces uncertainty signals more highly correlated with sample fidelity than baseline methods, and that uncertainty-guided sampling further improves generation quality.

[61] Conditional Uncertainty-Aware Political Deepfake Detection with Stochastic Convolutional Neural Networks

Rafael-Petruţ Gardoş

Main category: cs.CV

TL;DR: 本文提出了一种面向政治深伪图像的条件化、不确定性感知检测方法，采用随机卷积神经网络，在经验性、决策导向的可靠性框架下评估不确定性，强调校准质量与操作实用性，而非纯贝叶斯建模；实验表明校准后的概率输出和不确定性估计可支持风险感知的内容审核策略。

Details

Motivation: 现有自动深伪检测器大多仅提供点预测，缺乏可靠性指示，这在高风险的政治语境中是关键操作缺陷，亟需不确定性感知的检测能力。 Method: 构建政治聚焦的二分类图像数据集；在ResNet-18和EfficientNet-B4上进行全量微调；对比确定性推理、单次随机预测、MC Dropout、温度缩放及集成不确定性代理等多种不确定性建模方法；基于校准质量、恰当评分规则、错误条件分析等可观测指标评估不确定性。 Result: 校准后的概率输出和不确定性估计显著提升风险感知审核策略的可行性；置信带系统分析揭示了不确定性在特定区间内具有超越预测置信度的操作价值；同时明确了该方法在政治场景中的适用边界与局限。 Conclusion: 不确定性不应仅视为理论概念，而应作为可测量、可操作的决策信号；本工作为高风险政治内容审核提供了实证驱动、可靠性导向的深伪检测新范式。 Abstract: Recent advances in generative image models have enabled the creation of highly realistic political deepfakes, posing risks to information integrity, public trust, and democratic processes. While automated deepfake detectors are increasingly deployed in moderation and investigative pipelines, most existing systems provide only point predictions and fail to indicate when outputs are unreliable, being an operationally critical limitation in high-stakes political contexts. This work investigates conditional, uncertainty-aware political deepfake detection using stochastic convolutional neural networks within an empirical, decision-oriented reliability framework. Rather than treating uncertainty as a purely Bayesian construct, it is evaluated through observable criteria, including calibration quality, proper scoring rules, and its alignment with prediction errors under both global and confidence-conditioned analyses. A politically focused binary image dataset is constructed via deterministic metadata filtering from a large public real-synthetic corpus. Two pretrained CNN backbones (ResNet-18 and EfficientNet-B4) are fully fine-tuned for classification. Deterministic inference is compared with single-pass stochastic prediction, Monte Carlo dropout with multiple forward passes, temperature scaling, and ensemble-based uncertainty surrogates. Evaluation reports ROC-AUC, thresholded confusion matrices, calibration metrics, and generator-disjoint out-of-distribution performance. Results demonstrate that calibrated probabilistic outputs and uncertainty estimates enable risk-aware moderation policies. A systematic confidence-band analysis further clarifies when uncertainty provides operational value beyond predicted confidence, delineating both the benefits and limitations of uncertainty-aware deepfake detection in political settings.

[62] Monte Carlo Maximum Likelihood Reconstruction for Digital Holography with Speckle

Xi Chen,Arian Maleki,Shirin Jalali

Main category: cs.CV

TL;DR: 本文提出了一种基于随机线性代数和蒙特卡洛估计的投影梯度下降方法（PGD-MC），用于在数字全息中实现可扩展的最大似然估计（MLE）重建，无需显式矩阵求逆，支持高分辨率与物理准确的有限孔径建模，并显著提升重建质量与计算效率。

Details

Motivation: 传统MLE方法因高维矩阵求逆计算代价过高，难以在数字全息等相干成像中应用物理准确的有限孔径模型；现有方法常依赖简化假设，牺牲建模精度。 Method: 提出PGD-MC：结合共轭梯度法高效计算似然梯度，利用传感矩阵结构特性与蒙特卡洛估计避免显式矩阵求逆，并嵌入多种去噪器作为正则化项。 Result: 在多种真实孔径模型下鲁棒性强；相比现有Plug-and-Play方法，在重建精度与速度上均显著提升；可扩展至高分辨率全息重建。 Conclusion: PGD-MC为有限孔径数字全息提供了一种灵活、高效且物理建模准确的MLE重建新框架，推动了相干成像中统计建模与计算可行性之间的平衡。 Abstract: In coherent imaging, speckle is statistically modeled as multiplicative noise, posing a fundamental challenge for image reconstruction. While maximum likelihood estimation (MLE) provides a principled framework for speckle mitigation, its application to coherent imaging system such as digital holography with finite apertures is hindered by the prohibitive cost of high-dimensional matrix inversion, especially at high resolutions. This computational burden has prevented the use of MLE-based reconstruction with physically accurate aperture modeling. In this work, we propose a randomized linear algebra approach that enables scalable MLE optimization without explicit matrix inversions in gradient computation. By exploiting the structural properties of sensing matrix and using conjugate gradient for likelihood gradient evaluation, the proposed algorithm supports accurate aperture modeling without the simplifying assumptions commonly imposed for tractability. We term the resulting method projected gradient descent with Monte Carlo estimation (PGD-MC). The proposed PGD-MC framework (i) demonstrates robustness to diverse and physically accurate aperture models, (ii) achieves substantial improvements in reconstruction quality and computational efficiency, and (iii) scales effectively to high-resolution digital holography. Extensive experiments incorporating three representative denoisers as regularization show that PGD-MC provides a flexible and effective MLE-based reconstruction framework for digital holography with finite apertures, consistently outperforming prior Plug-and-Play model-based iterative reconstruction methods in both accuracy and speed. Our code is available at: https://github.com/Computational-Imaging-RU/MC_Maximum_Likelihood_Digital_Holography_Speckle.

[63] Comp2Comp: Open-Source Software with FDA-Cleared Artificial Intelligence Algorithms for Computed Tomography Image Analysis

Adrit Rao,Malte Jensen,Andrea T. Fisher,Louis Blankemeier,Pauline Berens,Arash Fereydooni,Seth Lirette,Eren Alkan,Felipe C. Kitamura,Juan M. Zambrano Chaves,Eduardo Reis,Arjun Desai,Marc H. Willis,Jason Hom,Andrew Johnston,Leon Lenchik,Robert D. Boutin,Eduardo M. J. M. Farina,Augusto S. Serpa,Marcelo S. Takahashi,Jordan Perchik,Steven A. Rothenberg,Jamie L. Schroeder,Ross Filice,Leonardo K. Bittencourt,Hari Trivedi,Marly van Assen,John Mongan,Kimberly Kallianos,Oliver Aalami,Akshay S. Chaudhari

Main category: cs.CV

TL;DR: 本文介绍了Comp2Comp开源软件包中两个FDA-510(k)认证的深度学习流程——腹主动脉定量（AAQ）和骨密度（BMD）估计，用于CT影像的机会性分析；AAQ在动脉瘤尺寸评估中平均绝对误差为1.57 mm，BMD在骨质疏松风险分类中敏感性和特异性分别达81.0%和78.4%，验证结果支持其临床应用。

Details

Motivation: 解决现有开源影像分析工具缺乏严格验证、商用工具缺乏透明性导致临床部署意外失败的问题。 Method: 开发并验证两个完全开源、FDA-510(k)-批准的深度学习流程（AAQ和BMD），集成于Comp2Comp包中，分别用于腹主动脉分割与直径测量、椎体分割与骨密度估计；在多中心外部数据集上与放射科医生标注（AAQ）和DXA金标准（BMD）进行对比验证。 Result: AAQ在258例患者CT上平均绝对误差为1.57 mm（95% CI 1.38–1.80 mm）；BMD在371例患者上二分类敏感性为81.0%（95% CI 74.0–86.8%），特异性为78.4%（95% CI 72.3–83.7%）。 Conclusion: Comp2Comp中的AAQ和BMD算法具备足够临床准确性，其开源特性提升了FDA审批过程透明度，便于医院预测试与科研复用。 Abstract: Artificial intelligence allows automatic extraction of imaging biomarkers from already-acquired radiologic images. This paradigm of opportunistic imaging adds value to medical imaging without additional imaging costs or patient radiation exposure. However, many open-source image analysis solutions lack rigorous validation while commercial solutions lack transparency, leading to unexpected failures when deployed. Here, we report development and validation for two of the first fully open-sourced, FDA-510(k)-cleared deep learning pipelines to mitigate both challenges: Abdominal Aortic Quantification (AAQ) and Bone Mineral Density (BMD) estimation are both offered within the Comp2Comp package for opportunistic analysis of computed tomography scans. AAQ segments the abdominal aorta to assess aneurysm size; BMD segments vertebral bodies to estimate trabecular bone density and osteoporosis risk. AAQ-derived maximal aortic diameters were compared against radiologist ground-truth measurements on 258 patient scans enriched for abdominal aortic aneurysms from four external institutions. BMD binary classifications (low vs. normal bone density) were compared against concurrent DXA scan ground truths obtained on 371 patient scans from four external institutions. AAQ had an overall mean absolute error of 1.57 mm (95% CI 1.38-1.80 mm). BMD had a sensitivity of 81.0% (95% CI 74.0-86.8%) and specificity of 78.4% (95% CI 72.3-83.7%). Comp2Comp AAQ and BMD demonstrated sufficient accuracy for clinical use. Open-sourcing these algorithms improves transparency of typically opaque FDA clearance processes, allows hospitals to test the algorithms before cumbersome clinical pilots, and provides researchers with best-in-class methods.

[64] HII-DPO: Eliminate Hallucination via Accurate Hallucination-Inducing Counterfactual Images

Yilin Yang,Zhenghui Guo,Yuke Wang,Omprakash Gnawali,Sheng Di,Chengming Zhang

Main category: cs.CV

TL;DR: 本文提出一种新方法，通过合成诱导幻觉的图像（HIIs）揭示并量化大视觉语言模型（VLMs）中由语言偏置引发的场景条件幻觉模式，并构建MOH基准与偏好数据集，显著缓解幻觉问题。

Details

Motivation: 现有幻觉缓解方法忽视了由语言偏置驱动的底层幻觉模式，亟需系统性揭示和量化这类幻觉。 Method: 设计新流程合成Hallucination-Inducing Images（HIIs），基于HIIs发现场景条件幻觉模式，建立Masked-Object-Hallucination（MOH）基准，并利用HIIs构建高质量偏好数据集用于细粒度对齐。 Result: 在标准幻觉评测基准上相较当前最优方法提升达38%，同时保持模型通用能力。 Conclusion: 语言偏置导致的场景条件幻觉是VLMs的重要缺陷，通过HIIs驱动的合成、评测与对齐可有效缓解该问题。 Abstract: Large Vision-Language Models (VLMs) have achieved remarkable success across diverse multimodal tasks but remain vulnerable to hallucinations rooted in inherent language bias. Despite recent progress, existing hallucination mitigation methods often overlook the underlying hallucination patterns driven by language bias. In this work, we design a novel pipeline to accurately synthesize Hallucination-Inducing Images (HIIs). Using synthesized HIIs, we reveal a consistent scene-conditioned hallucination pattern: models tend to mention objects that are highly typical of the scene even when visual evidence is removed. To quantify the susceptibility of VLMs to this hallucination pattern, we establish the Masked-Object-Hallucination (MOH) benchmark to rigorously evaluate existing state-of-the-art alignment frameworks. Finally, we leverage HIIs to construct high-quality preference datasets for fine-grained alignment. Experimental results demonstrate that our approach effectively mitigates hallucinations while preserving general model capabilities. Specifically, our method achieves up to a 38% improvement over the current state-of-the-art on standard hallucination benchmarks.

[65] Towards Remote Sensing Change Detection with Neural Memory

Zhenyu Yang,Gensheng Pei,Yazhou Yao,Tianfei Zhou,Lizhong Ding,Fumin Shen

Main category: cs.CV

TL;DR: 本文提出ChangeTitans框架，用于遥感变化检测，通过VTitans视觉骨干网络、分层VTitans-Adapter和TS-CBAM双流融合模块，在保持计算效率的同时提升长程依赖建模与检测精度。

Details

Motivation: 现有方法难以兼顾长程依赖建模与计算效率；Transformer虽能建模全局上下文但复杂度高；线性注意力方法常无法捕捉复杂的时空关系。 Method: 提出基于Titans的ChangeTitans框架，包括：1）VTitans视觉骨干（融合神经记忆与分段局部注意力）；2）分层VTitans-Adapter以细化多尺度特征；3）TS-CBAM双流模块（利用跨时间注意力抑制伪变化）。 Result: 在LEVIR-CD、WHU-CD、LEVIR-CD+和SYSU-CD四个基准数据集上达到SOTA性能，其中LEVIR-CD上IoU达84.36%，F1-score达91.52%，且计算开销可控。 Conclusion: ChangeTitans有效平衡了建模能力与效率，在遥感变化检测任务中展现出优越性与实用性。 Abstract: Remote sensing change detection is essential for environmental monitoring, urban planning, and related applications. However, current methods often struggle to capture long-range dependencies while maintaining computational efficiency. Although Transformers can effectively model global context, their quadratic complexity poses scalability challenges, and existing linear attention approaches frequently fail to capture intricate spatiotemporal relationships. Drawing inspiration from the recent success of Titans in language tasks, we present ChangeTitans, the Titans-based framework for remote sensing change detection. Specifically, we propose VTitans, the first Titans-based vision backbone that integrates neural memory with segmented local attention, thereby capturing long-range dependencies while mitigating computational overhead. Next, we present a hierarchical VTitans-Adapter to refine multi-scale features across different network layers. Finally, we introduce TS-CBAM, a two-stream fusion module leveraging cross-temporal attention to suppress pseudo-changes and enhance detection accuracy. Experimental evaluations on four benchmark datasets (LEVIR-CD, WHU-CD, LEVIR-CD+, and SYSU-CD) demonstrate that ChangeTitans achieves state-of-the-art results, attaining \textbf{84.36\%} IoU and \textbf{91.52\%} F1-score on LEVIR-CD, while remaining computationally competitive.

[66] End-to-End LiDAR optimization for 3D point cloud registration

Siddhant Katyan,Marc-André Gardner,Jean-François Lalonde

Main category: cs.CV

TL;DR: 本文提出了一种自适应LiDAR感知框架，通过将配准反馈融入感知回路，联合优化LiDAR采集与配准超参数，从而在点密度、噪声和稀疏性之间实现最优平衡，提升配准精度与效率。

Details

Motivation: 传统点云配准依赖于固定LiDAR配置的预采集数据，导致数据采集次优及大量计算开销（如采样、去噪、调参）；LiDAR传感器设计与下游任务（如配准）脱节。 Method: 提出一种自适应LiDAR感知框架，动态调整传感器参数，并将注册反馈嵌入感知闭环，联合优化LiDAR采集策略与配准超参数。 Result: 在CARLA仿真中，该方法优于固定参数基线，在提升配准精度与效率的同时保持良好泛化能力。 Conclusion: 自适应LiDAR感知能有效协同传感器配置与下游任务，为自动驾驶与机器人感知提供新范式。 Abstract: LiDAR sensors are a key modality for 3D perception, yet they are typically designed independently of downstream tasks such as point cloud registration. Conventional registration operates on pre-acquired datasets with fixed LiDAR configurations, leading to suboptimal data collection and significant computational overhead for sampling, noise filtering, and parameter tuning. In this work, we propose an adaptive LiDAR sensing framework that dynamically adjusts sensor parameters, jointly optimizing LiDAR acquisition and registration hyperparameters. By integrating registration feedback into the sensing loop, our approach optimally balances point density, noise, and sparsity, improving registration accuracy and efficiency. Evaluations in the CARLA simulation demonstrate that our method outperforms fixed-parameter baselines while retaining generalization abilities, highlighting the potential of adaptive LiDAR for autonomous perception and robotic applications.

[67] Characterizing and Optimizing the Spatial Kernel of Multi Resolution Hash Encodings

Tianxiang Dai,Jonathan Fan

Main category: cs.CV

TL;DR: 本文提出了一种基于点扩散函数（PSF）的分析方法，从物理系统角度定量刻画多分辨率哈希编码（MHE）的空间特性，揭示其各向异性、对数型空间响应及有效分辨率由平均而非最大分辨率决定；并据此设计旋转MHE（R-MHE）以消除各向异性，提升信噪比。

Details

Motivation: MHE在神经场中广泛应用但缺乏物理系统的严谨理解，超参数选择依赖经验启发，亟需理论指导。 Method: 通过定义并分析MHE的点扩散函数（PSF），推导无碰撞PSF闭式近似，量化空间分辨率与保真度；建模哈希冲突引入的散斑噪声；提出旋转MHE（R-MHE）架构以缓解各向异性。 Result: 发现MHE的有效空间带宽（FWHM）由平均分辨率N_avg决定，而非N_max；证实优化过程导致FWHM展宽；哈希容量有限引发散斑噪声并降低SNR；R-MHE在不增加参数前提下显著缓解各向异性。 Conclusion: 本文建立了首个基于物理原理（PSF/Green函数类比）的MHE分析框架，将MHE设计从启发式转向可解释、可优化的理论驱动范式。 Abstract: Multi-Resolution Hash Encoding (MHE), the foundational technique behind Instant Neural Graphics Primitives, provides a powerful parameterization for neural fields. However, its spatial behavior lacks rigorous understanding from a physical systems perspective, leading to reliance on heuristics for hyperparameter selection. This work introduces a novel analytical approach that characterizes MHE by examining its Point Spread Function (PSF), which is analogous to the Green's function of the system. This methodology enables a quantification of the encoding's spatial resolution and fidelity. We derive a closed-form approximation for the collision-free PSF, uncovering inherent grid-induced anisotropy and a logarithmic spatial profile. We establish that the idealized spatial bandwidth, specifically the Full Width at Half Maximum (FWHM), is determined by the average resolution, $N_{\text{avg}}$. This leads to a counterintuitive finding: the effective resolution of the model is governed by the broadened empirical FWHM (and therefore $N_{\text{avg}}$), rather than the finest resolution $N_{\max}$, a broadening effect we demonstrate arises from optimization dynamics. Furthermore, we analyze the impact of finite hash capacity, demonstrating how collisions introduce speckle noise and degrade the Signal-to-Noise Ratio (SNR). Leveraging these theoretical insights, we propose Rotated MHE (R-MHE), an architecture that applies distinct rotations to the input coordinates at each resolution level. R-MHE mitigates anisotropy while maintaining the efficiency and parameter count of the original MHE. This study establishes a methodology based on physical principles that moves beyond heuristics to characterize and optimize MHE.

[68] The Garbage Dataset (GD): A Multi-Class Image Benchmark for Automated Waste Segregation

Suman Kunwar

Main category: cs.CV

TL;DR: This paper introduces the Garbage Dataset (GD), a large, diverse, publicly available image dataset for waste classification, benchmarks deep learning models on it, and highlights challenges like class imbalance and carbon cost of model training.

Details

Motivation: To advance automated waste segregation through machine learning by providing a realistic, publicly available, and well-characterized benchmark dataset addressing real-world challenges such as class imbalance and background complexity. Method: Constructed GD with 13,348 labeled images across 10 waste categories; applied rigorous validation (checksums, outlier detection), statistical and visual analysis (PCA/t-SNE, entropy, saliency), and benchmarked multiple SOTA deep learning models on accuracy, F1-score, and operational carbon emissions. Result: EfficientNetV2S achieved best performance (96.19% accuracy, 0.96 F1-score) but with moderate carbon cost; analysis revealed class imbalance, high-outlier classes (plastic, cardboard, paper), and brightness variations. Conclusion: GD is a valuable real-world benchmark for waste classification research, yet practical deployment must address class imbalance, background complexity, and environmental trade-offs in model selection. Abstract: This study introduces the Garbage Dataset (GD), a publicly available image dataset designed to advance automated waste segregation through machine learning and computer vision. It's a diverse dataset covering 10 common household waste categories: metal, glass, biological, paper, battery, trash, cardboard, shoes, clothes, and plastic. The dataset comprises 13,348 labeled images collected through multiple methods, including DWaste mobile app and curated web sources. Methods included rigorous validation through checksums and outlier detection, analysis of class imbalance and visual separability via PCA/t-SNE, and assessment of background complexity using entropy and saliency measures. The dataset was benchmarked using state-of-the-art deep learning models (EfficientNetV2M, EfficientNetV2S, MobileNet, ResNet50, ResNet101) evaluated on performance metrics and operational carbon emissions. Experiment results indicate EfficientNetV2S achieved the highest performance with 96.19% accuracy and a 0.96 F1-score, though with a moderate carbon cost. Analysis revealed inherent dataset characteristics including class imbalance, a skew toward high-outlier classes (plastic, cardboard, paper), and brightness variations that require consideration. The main conclusion is that GD provides a valuable, real-world benchmark for waste classification research while highlighting important challenges such as class imbalance, background complexity, and environmental trade-offs in model selection that must be addressed for practical deployment. The dataset is publicly released to support further research in environmental sustainability applications.

[69] Med-SegLens: Latent-Level Model Diffing for Interpretable Medical Image Segmentation

Salma J. Ahmed,Emad A. Mohammed,Azam Asilian Bidgoli

Main category: cs.CV

TL;DR: Med-SegLens 是一种模型差异分析框架，通过稀疏自编码器分解分割模型激活，识别跨架构与跨数据集的稳定潜在表征，并利用潜在空间干预提升跨数据集泛化性能。

Details

Motivation: 现代分割模型虽性能强但缺乏可解释性，难以诊断失败、理解数据集偏移或进行有依据的干预。 Method: 提出 Med-SegLens 框架，使用稀疏自编码器在 SegFormer 和 U-Net 上提取可解释潜在特征，并进行跨架构、跨数据集（健康/成人/儿童/撒哈拉以南非洲胶质瘤队列）的潜在对齐。 Result: 发现稳定的共享表征主干，数据集偏移源于对人群特异性潜在特征的依赖差异；这些潜在特征构成分割失败的因果瓶颈；针对性干预可在不重训练前提下修复 70% 失败案例，Dice 分数从 39.4% 提升至 74.2%。 Conclusion: 潜在空间的模型差异分析为分割模型的故障诊断与数据集偏移缓解提供了实用且机制明确的工具。 Abstract: Modern segmentation models achieve strong predictive performance but remain largely opaque, limiting our ability to diagnose failures, understand dataset shift, or intervene in a principled manner. We introduce Med-SegLens, a model-diffing framework that decomposes segmentation model activations into interpretable latent features using sparse autoencoders trained on SegFormer and U-Net. Through cross-architecture and cross-dataset latent alignment across healthy, adult, pediatric, and sub-Saharan African glioma cohorts, we identify a stable backbone of shared representations, while dataset shift is driven by differential reliance on population-specific latents. We show that these latents act as causal bottlenecks for segmentation failures, and that targeted latent-level interventions can correct errors and improve cross-dataset adaption without retraining, recovering performance in 70% of failure cases and improving Dice score from 39.4% to 74.2%. Our results demonstrate that latent-level model diffing provides a practical and mechanistic tool for diagnosing failures and mitigating dataset shift in segmentation models.

[70] 1%>100%: High-Efficiency Visual Adapter with Complex Linear Projection Optimization

Dongshuo Yin,Xue Yang,Deng-Ping Fan,Shi-Min Hu

Main category: cs.CV

TL;DR: 本文提出了一种名为CoLin的新型适配器，通过复数线性投影优化，在仅增加约1%参数的情况下，显著提升了视觉基础模型的适应效率，优于全量微调和经典delta-tuning方法。

Details

Motivation: 传统全量微调成本高、效率低；delta-tuning在LLM中有效，但难以直接迁移到视觉基础模型。需提升视觉任务适应效率。 Method: 提出基于复数线性投影优化（CoLin）的低秩复数适配器，并从理论上分析低秩复合矩阵收敛问题，设计定制化损失函数解决该问题。 Result: 在目标检测、分割、图像分类及遥感场景下的旋转目标检测等任务上，CoLin以仅1%参数量首次超越全量微调与经典delta-tuning方法。 Conclusion: CoLin为视觉基础模型部署提供了一种新颖且高效的适应方案，并已开源代码。 Abstract: Deploying vision foundation models typically relies on efficient adaptation strategies, whereas conventional full fine-tuning suffers from prohibitive costs and low efficiency. While delta-tuning has proven effective in boosting the performance and efficiency of LLMs during adaptation, its advantages cannot be directly transferred to the fine-tuning pipeline of vision foundation models. To push the boundaries of adaptation efficiency for vision tasks, we propose an adapter with Complex Linear Projection Optimization (CoLin). For architecture, we design a novel low-rank complex adapter that introduces only about 1% parameters to the backbone. For efficiency, we theoretically prove that low-rank composite matrices suffer from severe convergence issues during training, and address this challenge with a tailored loss. Extensive experiments on object detection, segmentation, image classification, and rotated object detection (remote sensing scenario) demonstrate that CoLin outperforms both full fine-tuning and classical delta-tuning approaches with merely 1% parameters for the first time, providing a novel and efficient solution for deployment of vision foundation models. We release the code on https://github.com/DongshuoYin/CoLin.

[71] 3DXTalker: Unifying Identity, Lip Sync, Emotion, and Spatial Dynamics in Expressive 3D Talking Avatars

Zhongju Wang,Zhenhong Sun,Beier Wang,Yifu Wang,Daoyi Dong,Huadong Mo,Hongdong Li

Main category: cs.CV

TL;DR: 本文提出3DXTalker，一种通过数据驱动的身份建模、丰富的音频表征与空间动态可控性实现高表现力的3D说话头像生成方法，显著提升唇形同步、情感表达与头部姿态自然度。

Details

Motivation: 现有方法受限于身份样本少、音频表征单一、缺乏显式可控性，难以兼顾身份保持、唇音同步、情感表达和空间动态等多维表达需求。 Method: 提出3DXTalker框架：1）构建2D-to-3D数据策展流程与解耦表征以支持可扩展身份建模；2）引入帧级幅度与情感线索增强音频嵌入；3）采用基于流匹配的Transformer统一建模面部动态；4）支持prompt条件化的头部姿态风格化控制。 Result: 在多项指标上超越现有方法，实现唇同步、情感表达与头部运动的统一高质量生成，在定性与定量评估中均表现优异。 Conclusion: 3DXTalker通过多维度协同建模，有效解决了3D talking avatar在表达力方面的关键瓶颈，为数字人生成提供了新范式。 Abstract: Audio-driven 3D talking avatar generation is increasingly important in virtual communication, digital humans, and interactive media, where avatars must preserve identity, synchronize lip motion with speech, express emotion, and exhibit lifelike spatial dynamics, collectively defining a broader objective of expressivity. However, achieving this remains challenging due to insufficient training data with limited subject identities, narrow audio representations, and restricted explicit controllability. In this paper, we propose 3DXTalker, an expressive 3D talking avatar through data-curated identity modeling, audio-rich representations, and spatial dynamics controllability. 3DXTalker enables scalable identity modeling via 2D-to-3D data curation pipeline and disentangled representations, alleviating data scarcity and improving identity generalization. Then, we introduce frame-wise amplitude and emotional cues beyond standard speech embeddings, ensuring superior lip synchronization and nuanced expression modulation. These cues are unified by a flow-matching-based transformer for coherent facial dynamics. Moreover, 3DXTalker also enables natural head-pose motion generation while supporting stylized control via prompt-based conditioning. Extensive experiments show that 3DXTalker integrates lip synchronization, emotional expression, and head-pose dynamics within a unified framework, achieves superior performance in 3D talking avatar generation.

[72] MapVerse: A Benchmark for Geospatial Question Answering on Diverse Real-World Maps

Sharat Bhat,Harshita Khandelwal,Tushar Kataria,Vivek Gupta

Main category: cs.CV

TL;DR: 本文提出MapVerse，一个基于真实世界地图的大规模基准数据集，包含11,837个人工编写的问题-答案对，覆盖10类地图和多种问题类型，用于评估视觉语言模型在地图理解与空间推理上的能力；实验表明现有模型在简单分类任务上表现尚可，但在复杂空间推理任务上仍存在显著不足。

Details

Motivation: 现有VLMs在地图推理能力上表现不稳定，且缺乏能全面评估真实地理空间推理能力的高质量、多样化、真实世界来源的基准数据集。 Method: 构建了大规模真实世界地图基准MapVerse，包含1025张真实地图和11837个人工标注的问答对，覆盖10类地图和多类问题；并对10个SOTA模型进行系统评测，开展细粒度分类分析与视觉因素影响研究。 Result: 当前VLMs在分类类任务上表现较好，但在需复杂空间推理的任务（如方向推断、路径规划、拓扑关系理解等）上性能显著下降；开源与闭源模型均存在类似瓶颈。 Conclusion: MapVerse为地图多模态推理提供了更可靠、更具挑战性的评估平台；结果揭示了现有模型在深度空间语义理解和真实场景泛化能力上的根本性局限，指明了未来研究方向。 Abstract: Maps are powerful carriers of structured and contextual knowledge, encompassing geography, demographics, infrastructure, and environmental patterns. Reasoning over such knowledge requires models to integrate spatial relationships, visual cues, real-world context, and domain-specific expertise-capabilities that current large language models (LLMs) and vision-language models (VLMs) still struggle to exhibit consistently. Yet, datasets used to benchmark VLMs on map-based reasoning remain narrow in scope, restricted to specific domains, and heavily reliant on artificially generated content (outputs from LLMs or pipeline-based methods), offering limited depth for evaluating genuine geospatial reasoning. To address this gap, we present MapVerse, a large-scale benchmark built on real-world maps. It comprises 11,837 human-authored question-answer pairs across 1,025 maps, spanning ten diverse map categories and multiple question categories for each. The dataset provides a rich setting for evaluating map reading, interpretation, and multimodal reasoning. We evaluate ten state-of-the-art models against our benchmark to establish baselines and quantify reasoning gaps. Beyond overall performance, we conduct fine-grained categorical analyses to assess model inference across multiple dimensions and investigate the visual factors shaping reasoning outcomes. Our findings reveal that while current VLMs perform competitively on classification-style tasks, both open- and closed-source models fall short on advanced tasks requiring complex spatial reasoning.

[73] RealHD: A High-Quality Dataset for Robust Detection of State-of-the-Art AI-Generated Images

Hanzhe Yu,Yun Ye,Jintao Rong,Qi Xuan,Chen Ma

Main category: cs.CV

TL;DR: 本文提出一个高质量、大规模的数据集（73万+图像），涵盖多种AI生成方式（文本生成、修复、精修、换脸）及真实图像，并设计了一种基于噪声熵的轻量检测方法，在泛化性和性能上均优于现有方法。

Details

Motivation: 现有AI生成图像检测数据集存在泛化性差、图像质量低、提示词过于简单、多样性不足等问题，亟需更高质量、更丰富、更具挑战性的基准数据集。 Method: 构建了一个包含73万+图像的多类别、多生成方式（文本到图像、图像修复、图像精修、人脸交换）的高质量数据集；为每张生成图像标注生成方法和类别，修复图像额外提供二值掩码；提出一种基于Non-Local Means噪声熵张量的轻量级检测方法。 Result: 在该数据集上训练的检测模型展现出更强的泛化能力；所提噪声熵方法性能具有竞争力，为后续研究提供了坚实基线；数据集与代码已开源。 Conclusion: 本工作通过构建高质量大规模数据集与提出新检测方法，显著推动了AI生成图像检测领域的鲁棒性与实用性发展。 Abstract: The rapid advancement of generative AI has raised concerns about the authenticity of digital images, as highly realistic fake images can now be generated at low cost, potentially increasing societal risks. In response, several datasets have been established to train detection models aimed at distinguishing AI-generated images from real ones. However, existing datasets suffer from limited generalization, low image quality, overly simple prompts, and insufficient image diversity. To address these limitations, we propose a high-quality, large-scale dataset comprising over 730,000 images across multiple categories, including both real and AI-generated images. The generated images are synthesized via state-of-the-art methods, including text-to-image generation (guided by over 10,000 carefully designed prompts), image inpainting, image refinement, and face swapping. Each generated image is annotated with its generation method and category. Inpainting images further include binary masks to indicate inpainted regions, providing rich metadata for analysis. Compared to existing datasets, detection models trained on our dataset demonstrate superior generalization capabilities. Our dataset not only serves as a strong benchmark for evaluating detection methods but also contributes to advancing the robustness of AI-generated image detection techniques. Building upon this, we propose a lightweight detection method based on image noise entropy, which transforms the original image into an entropy tensor of Non-Local Means (NLM) noise before classification. Extensive experiments demonstrate that models trained on our dataset achieve strong generalization, and our method delivers competitive performance, establishing a solid baseline for future research. The dataset and source code are publicly available at https://real-hd.github.io.

[74] Enhancing Weakly Supervised Multimodal Video Anomaly Detection through Text Guidance

Shengyang Sun,Jiashen Hua,Junyi Feng,Xiaojin Gong

Main category: cs.CV

TL;DR: 本文提出了一种文本引导的弱监督多模态视频异常检测框架，通过上下文学习增强文本数据并设计多尺度瓶颈Transformer融合模块，显著提升了检测性能。

Details

Motivation: 现有弱监督多模态视频异常检测方法对文本模态利用不足；文本虽具明确语义信息，但通用语言模型难以捕捉异常特异性，且相关文本描述稀缺，多模态融合易出现冗余与不平衡。 Method: 1）基于上下文学习的多阶段文本增强机制，生成高质量异常文本样本以微调文本特征提取器；2）多尺度瓶颈Transformer融合模块，利用压缩瓶颈token渐进式跨模态融合，缓解冗余与不平衡。 Result: 在UCF-Crime和XD-Violence数据集上达到当前最优性能。 Conclusion: 文本模态在弱监督视频异常检测中具有重要潜力，所提文本引导框架有效提升了异常表征能力与检测精度。 Abstract: Weakly supervised multimodal video anomaly detection has gained significant attention, yet the potential of the text modality remains under-explored. Text provides explicit semantic information that can enhance anomaly characterization and reduce false alarms. However, extracting effective text features is challenging due to the inability of general-purpose language models to capture anomaly-specific nuances and the scarcity of relevant descriptions. Furthermore, multimodal fusion often suffers from redundancy and imbalance. To address these issues, we propose a novel text-guided framework. First, we introduce an in-context learning-based multi-stage text augmentation mechanism to generate high-quality anomaly text samples for fine-tuning the text feature extractor. Second, we design a multi-scale bottleneck Transformer fusion module that uses compressed bottleneck tokens to progressively integrate information across modalities, mitigating redundancy and imbalance. Experiments on UCF-Crime and XD-Violence demonstrate state-of-the-art performance.

[75] C^2ROPE: Causal Continuous Rotary Positional Encoding for 3D Large Multimodal-Models Reasoning

Guanting Ye,Qiyan Zhao,Wenhao Yu,Xiaofeng Zhang,Jianmin Ji,Yanyong Zhang,Ka-Veng Yuen

Main category: cs.CV

TL;DR: 本文提出C^2RoPE，一种改进的旋转位置编码（RoPE），通过引入时空连续性建模和切比雪夫因果掩码，解决3D多模态模型中视觉特征空间局部性丢失与长程注意力衰减问题，显著提升3D场景推理与视觉问答性能。

Details

Motivation: 现有基于LLM的3D多模态模型沿用1D RoPE，导致视觉特征在列方向的空间连续性被破坏、空间局部性丢失，并因时间邻近假设造成早期视觉token被忽视的长程衰减问题。 Method: 提出C^2RoPE：1）构建融合1D时间索引与笛卡尔空间坐标的三元混合位置索引；2）设计频率分配策略编码三维位置信息；3）引入基于2D切比雪夫距离的Chebyshev Causal Masking建模空间因果关系。 Result: 在3D场景推理和3D视觉问答等多个基准测试中，C^2RoPE显著优于原始RoPE及其他位置编码方法，验证了其对空间连续性与因果建模的有效性。 Conclusion: C^2RoPE通过显式建模视觉token的局部空间连续性与空间因果关系，有效克服了传统RoPE在3D多模态理解中的固有缺陷，为3D LMMs的位置编码设计提供了新范式。 Abstract: Recent advances in 3D Large Multimodal Models (LMMs) built on Large Language Models (LLMs) have established the alignment of 3D visual features with LLM representations as the dominant paradigm. However, the inherited Rotary Position Embedding (RoPE) introduces limitations for multimodal processing. Specifically, applying 1D temporal positional indices disrupts the continuity of visual features along the column dimension, resulting in spatial locality loss. Moreover, RoPE follows the prior that temporally closer image tokens are more causally related, leading to long-term decay in attention allocation and causing the model to progressively neglect earlier visual tokens as the sequence length increases. To address these issues, we propose C^2RoPE, an improved RoPE that explicitly models local spatial Continuity and spatial Causal relationships for visual processing. C^2RoPE introduces a spatio-temporal continuous positional embedding mechanism for visual tokens. It first integrates 1D temporal positions with Cartesian-based spatial coordinates to construct a triplet hybrid positional index, and then employs a frequency allocation strategy to encode spatio-temporal positional information across the three index components. Additionally, we introduce Chebyshev Causal Masking, which determines causal dependencies by computing the Chebyshev distance of image tokens in 2D space. Evaluation results across various benchmarks, including 3D scene reasoning and 3D visual question answering, demonstrate C^2RoPE's effectiveness. The code is be available at https://github.com/ErikZ719/C2RoPE.

[76] MetaphorStar: Image Metaphor Understanding and Reasoning with End-to-End Visual Reinforcement Learning

Chenhao Zhang,Yazhe Niu,Hongsheng Li

Main category: cs.CV

TL;DR: 本文提出MetaphorStar，首个面向图像隐喻理解任务的端到端视觉强化学习框架，包含新数据集TFQ-Data、强化学习方法TFQ-GRPO和基准TFQ-Bench；在多项图像隐喻理解任务上大幅超越主流MLLMs，并发现该任务训练能提升模型复杂视觉推理能力。

Details

Motivation: 现有MLLMs在图像隐喻理解（涉及文化、情感、上下文及ToM）方面表现薄弱，因其缺乏多跳推理、文化背景建模与心理理论能力。 Method: 提出MetaphorStar框架，含三部分：细粒度数据集TFQ-Data、视觉强化学习算法TFQ-GRPO、结构化基准TFQ-Bench；采用端到端RL训练策略。 Result: MetaphorStar-32B在多项图像隐喻理解基准上平均提升82.6%，在多选题和开放题上达SOTA，真/假题上显著优于Gemini-3.0-pro；且训练该任务可提升通用视觉推理能力。 Conclusion: 视觉强化学习是提升图像隐喻理解能力的有效范式；MetaphorStar验证了任务特异性训练对增强模型深层视觉语义理解与推理能力的正向迁移效应，所有资源完全开源。 Abstract: Metaphorical comprehension in images remains a critical challenge for Nowadays AI systems. While Multimodal Large Language Models (MLLMs) excel at basic Visual Question Answering (VQA), they consistently struggle to grasp the nuanced cultural, emotional, and contextual implications embedded in visual content. This difficulty stems from the task's demand for sophisticated multi-hop reasoning, cultural context, and Theory of Mind (ToM) capabilities, which current models lack. To fill this gap, we propose MetaphorStar, the first end-to-end visual reinforcement learning (RL) framework for image implication tasks. Our framework includes three core components: the fine-grained dataset TFQ-Data, the visual RL method TFQ-GRPO, and the well-structured benchmark TFQ-Bench. Our fully open-source MetaphorStar family, trained using TFQ-GRPO on TFQ-Data, significantly improves performance by an average of 82.6% on the image implication benchmarks. Compared with 20+ mainstream MLLMs, MetaphorStar-32B achieves state-of-the-art (SOTA) on Multiple-Choice Question and Open-Style Question, significantly outperforms the top closed-source model Gemini-3.0-pro on True-False Question. Crucially, our experiments reveal that learning image implication tasks improves the general understanding ability, especially the complex visual reasoning ability. We further provide a systematic analysis of model parameter scaling, training data scaling, and the impact of different model architectures and training strategies, demonstrating the broad applicability of our method. We open-sourced all model weights, datasets, and method code at https://metaphorstar.github.io.

[77] Enhancing Underwater Images via Adaptive Semantic-aware Codebook Learning

Bosen Lin,Feng Gao,Yanwei Yu,Junyu Dong,Qian Du

Main category: cs.CV

TL;DR: 本文提出SUCode，一种语义感知的水下图像增强网络，通过语义感知的离散码本表示实现自适应增强，解决了传统方法忽略区域退化不一致导致的颜色失真和细节丢失问题。

Details

Motivation: 水下图像增强（UIE）是一个病态问题，缺乏自然干净的真实参考图像，且不同语义区域的退化程度差异显著；现有方法采用单一全局模型，忽视场景组件间退化不一致性，导致颜色失真与细节丢失。 Method: 提出SUCode网络：1）语义感知的像素级码本表示；2）三阶段训练范式避免伪真值污染；3）门控通道注意力模块（GCAM）与频率感知特征融合（FAFF）联合建模通道与频率信息。 Result: 在多个基准数据集上实验表明，SUCode在有参考和无参考评估指标上均达到当前最优性能。 Conclusion: SUCode通过引入语义感知码本与多尺度特征融合机制，有效提升了异质水下场景中颜色保真度与纹理恢复能力，为UIE提供了新思路。 Abstract: Underwater Image Enhancement (UIE) is an ill-posed problem where natural clean references are not available, and the degradation levels vary significantly across semantic regions. Existing UIE methods treat images with a single global model and ignore the inconsistent degradation of different scene components. This oversight leads to significant color distortions and loss of fine details in heterogeneous underwater scenes, especially where degradation varies significantly across different image regions. Therefore, we propose SUCode (Semantic-aware Underwater Codebook Network), which achieves adaptive UIE from semantic-aware discrete codebook representation. Compared with one-shot codebook-based methods, SUCode exploits semantic-aware, pixel-level codebook representation tailored to heterogeneous underwater degradation. A three-stage training paradigm is employed to represent raw underwater image features to avoid pseudo ground-truth contamination. Gated Channel Attention Module (GCAM) and Frequency-Aware Feature Fusion (FAFF) jointly integrate channel and frequency cues for faithful color restoration and texture recovery. Extensive experiments on multiple benchmarks demonstrate that SUCode achieves state-of-the-art performance, outperforming recent UIE methods on both reference and no-reference metrics. The code will be made public available at https://github.com/oucailab/SUCode.

[78] Enhancing YOLOv11n for Reliable Child Detection in Noisy Surveillance Footage

Khanh Linh Tran,Minh Nguyen Dang,Thien Nguyen Trong,Hung Nguyen Quoc,Linh Nguyen Kieu

Main category: cs.CV

TL;DR: 本文提出了一种轻量、实用的儿童检测增强方法，基于YOLOv11n，在低质监控视频（如遮挡、小目标、模糊、弱光）中提升检测性能；通过领域定制的数据增强与SAHI推理策略，在不改变模型结构前提下提升了mAP，并保持边缘设备实时部署能力。

Details

Motivation: 现实场景中低质量监控视频（如 daycare 或失踪儿童预警系统中的CCTV）存在严重遮挡、小目标、运动模糊、低光照等问题，导致现有检测器性能下降，亟需轻量且部署友好的解决方案。 Method: 基于YOLOv11n，设计面向儿童检测的合成增强策略（空间扰动+光度退化），并在推理阶段引入Slicing Aided Hyper Inference（SAHI）以提升小目标和遮挡目标的召回率；所有训练与评估均在Roboflow Daycare数据集的儿童子集上进行。 Result: 相比YOLOv11n基线，mAP@0.5提升0.7个百分点至0.967，mAP@0.5:0.95提升2.3个百分点至0.783；整个流程兼容低功耗边缘设备，支持实时运行。 Conclusion: 该方案在不修改网络结构的前提下，显著提升了低质监控下儿童检测的精度与鲁棒性，兼具实用性与部署友好性，适用于资源受限的实际安防场景。 Abstract: This paper presents a practical and lightweight solution for enhancing child detection in low-quality surveillance footage, a critical component in real-world missing child alert and daycare monitoring systems. Building upon the efficient YOLOv11n architecture, we propose a deployment-ready pipeline that improves detection under challenging conditions including occlusion, small object size, low resolution, motion blur, and poor lighting commonly found in existing CCTV infrastructures. Our approach introduces a domain-specific augmentation strategy that synthesizes realistic child placements using spatial perturbations such as partial visibility, truncation, and overlaps, combined with photometric degradations including lighting variation and noise. To improve recall of small and partially occluded instances, we integrate Slicing Aided Hyper Inference (SAHI) at inference time. All components are trained and evaluated on a filtered, child-only subset of the Roboflow Daycare dataset. Compared to the baseline YOLOv11n, our enhanced system achieves a mean Average Precision at 0.5 IoU (mAP@0.5) of 0.967 and a mean Average Precision averaged over IoU thresholds from 0.5 to 0.95 (mAP@0.5:0.95) of 0.783, yielding absolute improvements of 0.7 percent and 2.3 percent, respectively, without architectural changes. Importantly, the entire pipeline maintains compatibility with low-power edge devices and supports real-time performance, making it particularly well suited for low-cost or resource-constrained industrial surveillance deployments. The example augmented dataset and the source code used to generate it are available at: https://github.com/html-ptit/Data-Augmentation-YOLOv11n-child-detection

[79] Fast Person Detection Using YOLOX With AI Accelerator For Train Station Safety

Mas Nurul Achmadiah,Novendra Setyawan,Achmad Arif Bryantono,Chi-Chia Sun,Wen-Kai Kuo

Main category: cs.CV

TL;DR: 本文提出了一种基于YOLOX和边缘AI加速器（Hailo-8）的乘客检测方法，用于提升火车站道口安全，实验表明其在准确率（提升超12%）和延迟（降低20ms）上均优于Jetson Orin Nano。

Details

Motivation: 火车站道口存在乘客越线等安全隐患，需借助更高效、低延迟的边缘AI技术提升实时检测能力以减少事故。 Method: 采用YOLOX目标检测模型，部署于Hailo-8边缘AI加速器，并与Jetson Orin Nano进行对比实验，评估准确率与推理延迟。 Result: Hailo-8相比Jetson Orin Nano准确率提升超12%，推理延迟降低20ms。 Conclusion: Hailo-8作为边缘AI加速硬件，在火车站乘客检测任务中展现出更优的精度与实时性，适合部署于高安全性要求的交通场景。 Abstract: Recently, Image processing has advanced Faster and applied in many fields, including health, industry, and transportation. In the transportation sector, object detection is widely used to improve security, for example, in traffic security and passenger crossings at train stations. Some accidents occur in the train crossing area at the station, like passengers uncarefully when passing through the yellow line. So further security needs to be developed. Additional technology is required to reduce the number of accidents. This paper focuses on passenger detection applications at train stations using YOLOX and Edge AI Accelerator hardware. the performance of the AI accelerator will be compared with Jetson Orin Nano. The experimental results show that the Hailo-8 AI hardware accelerator has higher accuracy than Jetson Orin Nano (improvement of over 12%) and has lower latency than Jetson Orin Nano (reduced 20 ms).

[80] Improving Medical Visual Reinforcement Fine-Tuning via Perception and Reasoning Augmentation

Guangjing Yang,ZhangYuan Yu,Ziyuan Qin,Xinyuan Song,Huahui Yi,Qingbo Kang,Jun Gao,Yiyue Li,Chenlin Du,Qicheng Lao

Main category: cs.CV

TL;DR: 本文提出VRFT-Aug，一种面向医疗影像领域的视觉强化微调框架，通过引入先验知识注入、感知驱动的策略优化、医学启发的奖励设计和行为模仿等策略，提升模型在视觉感知与结构化推理方面的能力，并在多个医疗数据集上验证了其有效性。

Details

Motivation: 现有基于规则的强化微调（RFT）方法在跨模态尤其是以视觉为中心的医疗影像领域尚未被充分探索，而该领域对鲁棒视觉感知与结构化推理均有较高要求。 Method: 提出VRFT-Aug框架，包含先验知识注入、感知驱动的策略优化、医学启发的奖励塑形和行为模仿四种训练策略，以增强感知与推理能力并稳定RFT过程。 Result: 在多个医疗影像数据集上，VRFT-Aug显著优于标准监督微调和RFT基线方法，并提供了可推广至其他医学图像任务的实证见解与实用训练启发。 Conclusion: VRFT-Aug为高风险医疗应用中构建可靠且具备推理能力的大模型提供了可行路径与实践指导。 Abstract: While recent advances in Reinforcement Fine-Tuning (RFT) have shown that rule-based reward schemes can enable effective post-training for large language models, their extension to cross-modal, vision-centric domains remains largely underexplored. This limitation is especially pronounced in the medical imaging domain, where effective performance requires both robust visual perception and structured reasoning. In this work, we address this gap by proposing VRFT-Aug, a visual reinforcement fine-tuning framework tailored for the medical domain. VRFT-Aug introduces a series of training strategies designed to augment both perception and reasoning, including prior knowledge injection, perception-driven policy refinement, medically informed reward shaping, and behavioral imitation. Together, these methods aim to stabilize and improve the RFT process. Through extensive experiments across multiple medical datasets, we show that our approaches consistently outperform both standard supervised fine-tuning and RFT baselines. Moreover, we provide empirically grounded insights and practical training heuristics that can be generalized to other medical image tasks. We hope this work contributes actionable guidance and fresh inspiration for the ongoing effort to develop reliable, reasoning-capable models for high-stakes medical applications.

[81] A Vision-Language Foundation Model for Zero-shot Clinical Collaboration and Automated Concept Discovery in Dermatology

Siyuan Yan,Xieji Li,Dan Mo,Philipp Tschandl,Yiwen Jiang,Zhonghua Wang,Ming Hu,Lie Ju,Cristina Vico-Alonso,Yizhen Zheng,Jiahe Liu,Juexiao Zhou,Camilla Chello,Jen G. Cheung,Julien Anriot,Luc Thomas,Clare Primiero,Gin Tan,Aik Beng Ng,Simon See,Xiaoying Tang,Albert Ip,Xiaoyang Liao,Adrian Bowling,Martin Haskett,Shuang Zhao,Monika Janda,H. Peter Soyer,Victoria Mar,Harald Kittler,Zongyuan Ge

Main category: cs.CV

TL;DR: DermFM-Zero 是一个无需任务微调即可在皮肤科诊断与多模态检索中实现零样本SOTA性能的视觉-语言基础模型，已在多项临床研究中验证其提升医生诊断准确率、超越专家表现及增强鲁棒性与可解释性的能力。

Details

Motivation: 解决医学基础模型依赖任务特定微调、难以广泛部署的问题，推动零样本临床决策支持的实际应用。 Method: 通过掩码潜在建模和对比学习，在超400万皮肤科多模态数据上训练视觉-语言基础模型DermFM-Zero，并结合稀疏自编码器进行无监督概念解耦以提升可解释性与鲁棒性。 Result: 在20个零样本诊断与多模态检索基准上达到SOTA；在三项跨国临床研究中显著提升基层医生诊断准确率、超越皮肤科专家、并使非专家在协作中优于未辅助专家；其潜在表征可解释且能抑制伪影偏差。 Conclusion: DermFM-Zero证明了高质量医学基础模型可在不微调前提下提供有效、安全、透明的零样本临床决策支持。 Abstract: Medical foundation models have shown promise in controlled benchmarks, yet widespread deployment remains hindered by reliance on task-specific fine-tuning. Here, we introduce DermFM-Zero, a dermatology vision-language foundation model trained via masked latent modelling and contrastive learning on over 4 million multimodal data points. We evaluated DermFM-Zero across 20 benchmarks spanning zero-shot diagnosis and multimodal retrieval, achieving state-of-the-art performance without task-specific adaptation. We further evaluated its zero-shot capabilities in three multinational reader studies involving over 1,100 clinicians. In primary care settings, AI assistance enabled general practitioners to nearly double their differential diagnostic accuracy across 98 skin conditions. In specialist settings, the model significantly outperformed board-certified dermatologists in multimodal skin cancer assessment. In collaborative workflows, AI assistance enabled non-experts to surpass unassisted experts while improving management appropriateness. Finally, we show that DermFM-Zero's latent representations are interpretable: sparse autoencoders unsupervisedly disentangle clinically meaningful concepts that outperform predefined-vocabulary approaches and enable targeted suppression of artifact-induced biases, enhancing robustness without retraining. These findings demonstrate that a foundation model can provide effective, safe, and transparent zero-shot clinical decision support.

[82] Eliminating VAE for Fast and High-Resolution Generative Detail Restoration

Yan Wang,Shijie Zhao,Junlin Li,Li Zhang

Main category: cs.CV

TL;DR: 本文提出GenDR-Pix，一种基于像素空间的一步式扩散模型超分辨率方法，通过消除VAE瓶颈、多阶段对抗蒸馏、随机填充与掩码傅里叶损失等技术，在显著加速（2.8倍）和减存（60%）的同时保持视觉质量，实现4K图像1秒内超分（仅需6GB显存）。

Details

Motivation: 现有扩散模型超分辨率方法（如GenDR）虽经步数蒸馏实现一步推理，但VAE仍构成延迟与显存瓶颈，导致高分辨率图像需分块处理；亟需在不牺牲质量前提下彻底消除VAE并提升端到端效率。 Method: 提出像素空间GenDR-Pix：①用pixel-shuffle替代VAE实现端到端像素级建模；②设计多阶段对抗蒸馏，利用前一阶段生成特征指导判别器训练；③引入随机填充增强生成特征并防判别器坍缩；④采用掩码傅里叶空间损失约束振幅异常；⑤结合基于填充的自集成与无分类器引导提升推理鲁棒性。 Result: GenDR-Pix相较GenDR提速2.8倍、显存降低60%，视觉质量几乎无损，超越其他单步扩散超分方法；可在1秒内完成4K图像超分，仅需6GB GPU显存。 Conclusion: 消除VAE并转向像素空间建模是加速扩散超分辨率的关键路径；多阶段对抗蒸馏与频域约束可有效缓解像素shuffle带来的重复纹理伪影，为高效高质量真实场景超分提供了新范式。 Abstract: Diffusion models have attained remarkable breakthroughs in the real-world super-resolution (SR) task, albeit at slow inference and high demand on devices. To accelerate inference, recent works like GenDR adopt step distillation to minimize the step number to one. However, the memory boundary still restricts the maximum processing size, necessitating tile-by-tile restoration of high-resolution images. Through profiling the pipeline, we pinpoint that the variational auto-encoder (VAE) is the bottleneck of latency and memory. To completely solve the problem, we leverage pixel-(un)shuffle operations to eliminate the VAE, reversing the latent-based GenDR to pixel-space GenDR-Pix. However, upscale with x8 pixelshuffle may induce artifacts of repeated patterns. To alleviate the distortion, we propose a multi-stage adversarial distillation to progressively remove the encoder and decoder. Specifically, we utilize generative features from the previous stage models to guide adversarial discrimination. Moreover, we propose random padding to augment generative features and avoid discriminator collapse. We also introduce a masked Fourier space loss to penalize the outliers of amplitude. To improve inference performance, we empirically integrate a padding-based self-ensemble with classifier-free guidance to improve inference scaling. Experimental results show that GenDR-Pix performs 2.8x acceleration and 60% memory-saving compared to GenDR with negligible visual degradation, surpassing other one-step diffusion SR. Against all odds, GenDR-Pix can restore 4K image in only 1 second and 6GB.

[83] VideoSTF: Stress-Testing Output Repetition in Video Large Language Models

Yuxin Cao,Wei Song,Shangzhi Xu,Jingling Xue,Jin Song Dong

Main category: cs.CV

TL;DR: 本文提出VideoSTF框架，首次系统评估视频大语言模型（VideoLLMs）中的输出重复问题，发现该问题普遍存在且对视频时间扰动高度敏感，可被用作黑盒攻击的安全漏洞。

Details

Motivation: 现有VideoLLM基准主要关注任务准确性和事实正确性，忽视了严重的输出重复这一未被充分研究的生成失败模式。 Method: 提出VideoSTF框架，定义三种基于n-gram的重复度量指标，构建含10,000个多样化视频及可控时间变换的标准化测试集，并对10个先进VideoLLMs开展广泛测试、时间压力测试和对抗性利用。 Result: 输出重复现象普遍；对视频时间扰动高度敏感；简单时间变换即可在黑盒下高效诱发重复退化，暴露其为可利用的安全漏洞。 Conclusion: 输出重复是现代VideoLLMs的根本稳定性问题，亟需引入面向稳定性的视频-语言系统评估范式。 Abstract: Video Large Language Models (VideoLLMs) have recently achieved strong performance in video understanding tasks. However, we identify a previously underexplored generation failure: severe output repetition, where models degenerate into self-reinforcing loops of repeated phrases or sentences. This failure mode is not captured by existing VideoLLM benchmarks, which focus primarily on task accuracy and factual correctness. We introduce VideoSTF, the first framework for systematically measuring and stress-testing output repetition in VideoLLMs. VideoSTF formalizes repetition using three complementary n-gram-based metrics and provides a standardized testbed of 10,000 diverse videos together with a library of controlled temporal transformations. Using VideoSTF, we conduct pervasive testing, temporal stress testing, and adversarial exploitation across 10 advanced VideoLLMs. We find that output repetition is widespread and, critically, highly sensitive to temporal perturbations of video inputs. Moreover, we show that simple temporal transformations can efficiently induce repetitive degeneration in a black-box setting, exposing output repetition as an exploitable security vulnerability. Our results reveal output repetition as a fundamental stability issue in modern VideoLLMs and motivate stability-aware evaluation for video-language systems. Our evaluation code and scripts are available at: https://github.com/yuxincao22/VideoSTF_benchmark.

[84] Multimodal Priors-Augmented Text-Driven 3D Human-Object Interaction Generation

Yin Wang,Ziyao Zhang,Zhiying Leng,Haitian Liu,Frederick W. B. Li,Mu Li,Xiaohui Liang

Main category: cs.CV

TL;DR: 本文提出MP-HOI框架，通过多模态数据先验、增强物体表征、多模态感知的MoE模型和级联扩散交互监督，解决文本驱动3D人-物交互运动生成中的人体/物体运动质量差及交互弱等问题。

Details

Motivation: 现有文本到HOI直接映射方法受限于跨模态鸿沟，导致人体运动次优、物体运动不自然、人与物体交互弱。 Method: 提出MP-HOI框架，包含：(1)利用大模型多模态数据（文本、图像、姿态/物体）作为先验；(2)引入几何关键点、接触特征和动态属性增强物体表征；(3)设计模态感知的混合专家（MoE）模型进行多模态特征融合；(4)构建带交互监督的级联扩散框架逐步优化交互特征。 Result: MP-HOI在生成高保真、细粒度HOI运动方面优于现有方法。 Conclusion: MP-HOI通过多维度协同建模有效弥合跨模态差距，显著提升文本驱动3D HOI运动生成的质量与真实性。 Abstract: We address the challenging task of text-driven 3D human-object interaction (HOI) motion generation. Existing methods primarily rely on a direct text-to-HOI mapping, which suffers from three key limitations due to the significant cross-modality gap: (Q1) sub-optimal human motion, (Q2) unnatural object motion, and (Q3) weak interaction between humans and objects. To address these challenges, we propose MP-HOI, a novel framework grounded in four core insights: (1) Multimodal Data Priors: We leverage multimodal data (text, image, pose/object) from large multimodal models as priors to guide HOI generation, which tackles Q1 and Q2 in data modeling. (2) Enhanced Object Representation: We improve existing object representations by incorporating geometric keypoints, contact features, and dynamic properties, enabling expressive object representations, which tackles Q2 in data representation. (3) Multimodal-Aware Mixture-of-Experts (MoE) Model: We propose a modality-aware MoE model for effective multimodal feature fusion paradigm, which tackles Q1 and Q2 in feature fusion. (4) Cascaded Diffusion with Interaction Supervision: We design a cascaded diffusion framework that progressively refines human-object interaction features under dedicated supervision, which tackles Q3 in interaction refinement. Comprehensive experiments demonstrate that MP-HOI outperforms existing approaches in generating high-fidelity and fine-grained HOI motions.

[85] AurigaNet: A Real-Time Multi-Task Network for Enhanced Urban Driving Perception

Kiarash Ghasemzadeh,Sedigheh Dehghani

Main category: cs.CV

TL;DR: 本文提出AurigaNet，一种用于自动驾驶感知的先进多任务网络，统一处理目标检测、车道线检测和可行驶区域实例分割，在BDD100K数据集上取得多项SOTA性能，并在Jetson Orin NX等嵌入式设备上验证实时性。

Details

Motivation: 开发可靠、高效、泛化能力强的自动驾驶AI系统仍具挑战，而多任务学习可提升计算效率、实时性与泛化能力，亟需更优的多任务架构。 Method: 提出AurigaNet多任务网络架构，端到端联合建模目标检测、车道线检测与可行驶区域实例分割；基于BDD100K数据集训练与评估；并在Jetson Orin NX嵌入式平台部署验证实时性。 Result: 在BDD100K上：可行驶区域分割IoU达85.2%（+0.7%），车道线检测IoU达60.8%（+30%+），目标检测mAP@0.5:0.95为47.6%（+2.9%）；在Jetson Orin NX上实现竞争性实时性能。 Conclusion: AurigaNet是一种鲁棒、高效、可部署的多任务感知架构，显著提升了自动驾驶感知精度与实用性，为实际车载部署提供了可行方案。 Abstract: Self-driving cars hold significant potential to reduce traffic accidents, alleviate congestion, and enhance urban mobility. However, developing reliable AI systems for autonomous vehicles remains a substantial challenge. Over the past decade, multi-task learning has emerged as a powerful approach to address complex problems in driving perception. Multi-task networks offer several advantages, including increased computational efficiency, real-time processing capabilities, optimized resource utilization, and improved generalization. In this study, we present AurigaNet, an advanced multi-task network architecture designed to push the boundaries of autonomous driving perception. AurigaNet integrates three critical tasks: object detection, lane detection, and drivable area instance segmentation. The system is trained and evaluated using the BDD100K dataset, renowned for its diversity in driving conditions. Key innovations of AurigaNet include its end-to-end instance segmentation capability, which significantly enhances both accuracy and efficiency in path estimation for autonomous vehicles. Experimental results demonstrate that AurigaNet achieves an 85.2% IoU in drivable area segmentation, outperforming its closest competitor by 0.7%. In lane detection, AurigaNet achieves a remarkable 60.8% IoU, surpassing other models by more than 30%. Furthermore, the network achieves an mAP@0.5:0.95 of 47.6% in traffic object detection, exceeding the next leading model by 2.9%. Additionally, we validate the practical feasibility of AurigaNet by deploying it on embedded devices such as the Jetson Orin NX, where it demonstrates competitive real-time performance. These results underscore AurigaNet's potential as a robust and efficient solution for autonomous driving perception systems. The code can be found here https://github.com/KiaRational/AurigaNet.

[86] Dynamic Frequency Modulation for Controllable Text-driven Image Generation

Tiandong Shi,Ling Zhao,Ji Qi,Jiayi Ma,Chengli Peng

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的频域调制方法，通过动态衰减的频率相关加权函数，在扩散模型中保持图像结构框架一致性的同时，实现目标语义修改。

Details

Motivation: 现有文本引导扩散模型在修改文本提示以进行语义调整时，易引发不期望的全局结构变化；且依赖经验性特征图干预，稳定性差。 Method: 从频域视角分析噪声潜在变量的频率谱对结构框架和纹理生成的影响，发现低频主导早期结构构建、高频主导后期细节合成；据此设计无需训练的频域调制方法，通过动态衰减加权函数直接调控噪声潜在变量。 Result: 该方法避免了经验性特征图选择，在保持结构一致性的同时支持精准语义编辑，实验表明其显著优于当前最先进方法。 Conclusion: 频域视角为扩散模型可控生成提供了新思路，所提训练-free频域调制方法在结构保真与语义灵活性之间实现了更优平衡。 Abstract: The success of text-guided diffusion models has established a new image generation paradigm driven by the iterative refinement of text prompts. However, modifying the original text prompt to achieve the expected semantic adjustments often results in unintended global structure changes that disrupt user intent. Existing methods rely on empirical feature map selection for intervention, whose performance heavily depends on appropriate selection, leading to suboptimal stability. This paper tries to solve the aforementioned problem from a frequency perspective and analyzes the impact of the frequency spectrum of noisy latent variables on the hierarchical emergence of the structure framework and fine-grained textures during the generation process. We find that lower-frequency components are primarily responsible for establishing the structure framework in the early generation stage. Their influence diminishes over time, giving way to higher-frequency components that synthesize fine-grained textures. In light of this, we propose a training-free frequency modulation method utilizing a frequency-dependent weighting function with dynamic decay. This method maintains the structure framework consistency while permitting targeted semantic modifications. By directly manipulating the noisy latent variable, the proposed method avoids the empirical selection of internal feature maps. Extensive experiments demonstrate that the proposed method significantly outperforms current state-of-the-art methods, achieving an effective balance between preserving structure and enabling semantic updates.

[87] AMAP-APP: Efficient Segmentation and Morphometry Quantification of Fluorescent Microscopy Images of Podocytes

Arash Fatehi,David Unnersjö-Jess,Linus Butt,Noémie Moreau,Thomas Benzing,Katarzyna Bozek

Main category: cs.CV

TL;DR: AMAP-APP is a cross-platform desktop application that significantly speeds up automated podocyte foot process quantification while maintaining high accuracy, overcoming computational and usability limitations of the original AMAP method.

Details

Motivation: The original AMAP method suffers from high computational demands, absence of a user interface, and Linux-only dependency, limiting its accessibility in kidney research. Method: AMAP-APP replaces intensive instance segmentation with classic image processing, retains the original semantic segmentation model, and introduces a refined Region of Interest (ROI) algorithm; validated on 365 mouse/human STED/confocal images using Pearson correlation and Two One-Sided T-tests (TOST). Result: AMAP-APP achieves 147× faster processing on consumer hardware; morphometric outputs show r>0.90 correlation and statistical equivalence (TOST P<0.05) vs. original AMAP; improved ROI algorithm reduces deviation from manual delineations. Conclusion: AMAP-APP democratizes deep learning-based podocyte morphometry by enabling use on standard computers across Windows, macOS, and Linux, facilitating broader adoption in nephrology research and clinical diagnostics. Abstract: Background: Automated podocyte foot process quantification is vital for kidney research, but the established "Automatic Morphological Analysis of Podocytes" (AMAP) method is hindered by high computational demands, a lack of a user interface, and Linux dependency. We developed AMAP-APP, a cross-platform desktop application designed to overcome these barriers. Methods: AMAP-APP optimizes efficiency by replacing intensive instance segmentation with classic image processing while retaining the original semantic segmentation model. It introduces a refined Region of Interest (ROI) algorithm to improve precision. Validation involved 365 mouse and human images (STED and confocal), benchmarking performance against the original AMAP via Pearson correlation and Two One-Sided T-tests (TOST). Results: AMAP-APP achieved a 147-fold increase in processing speed on consumer hardware. Morphometric outputs (area, perimeter, circularity, and slit diaphragm density) showed high correlation (r>0.90) and statistical equivalence (TOST P<0.05) to the original method. Additionally, the new ROI algorithm demonstrated superior accuracy compared to the original, showing reduced deviation from manual delineations. Conclusion: AMAP-APP democratizes deep learning-based podocyte morphometry. By eliminating the need for high-performance computing clusters and providing a user-friendly interface for Windows, macOS, and Linux, it enables widespread adoption in nephrology research and potential clinical diagnostics.

[88] TwiFF (Think With Future Frames): A Large-Scale Dataset for Dynamic Visual Reasoning

Junhua Liu,Zhangcheng Wang,Zhike Han,Ningli Wang,Guotao Liang,Kun Kuang

Main category: cs.CV

TL;DR: 本文提出TwiFF-2.7M数据集和TwiFF-Bench评测基准，用于支持动态视觉推理；并设计TwiFF模型，融合视频生成与图像理解能力，实现时序一致的视觉链式推理，在动态视觉问答任务中显著优于现有方法。

Details

Motivation: 现有视觉链式推理（VCoT）方法局限于静态场景，难以建模时间动态性，无法有效支持指令执行、动作预测和相机运动等动态任务。 Method: 构建首个大规模时序对齐VCoT数据集TwiFF-2.7M（270万视频片段）及高质量评测基准TwiFF-Bench（1078样本）；提出TwiFF模型，统一融合预训练视频生成与图像理解模块，通过迭代生成未来动作帧与文本推理实现时序连贯的视觉推理。 Result: TwiFF在动态推理任务上显著超越现有VCoT方法和文本链式推理基线，验证了其在动态视觉问答中的有效性。 Conclusion: 引入时序建模是提升VCoT在动态场景中推理能力的关键；TwiFF框架及其配套数据与评测基准为动态多模态推理提供了新范式与坚实基础。 Abstract: Visual Chain-of-Thought (VCoT) has emerged as a promising paradigm for enhancing multimodal reasoning by integrating visual perception into intermediate reasoning steps. However, existing VCoT approaches are largely confined to static scenarios and struggle to capture the temporal dynamics essential for tasks such as instruction, prediction, and camera motion. To bridge this gap, we propose TwiFF-2.7M, the first large-scale, temporally grounded VCoT dataset derived from $2.7$ million video clips, explicitly designed for dynamic visual question and answer. Accompanying this, we introduce TwiFF-Bench, a high-quality evaluation benchmark of $1,078$ samples that assesses both the plausibility of reasoning trajectories and the correctness of final answers in open-ended dynamic settings. Building on these foundations, we propose the TwiFF model, a unified modal that synergistically leverages pre-trained video generation and image comprehension capabilities to produce temporally coherent visual reasoning cues-iteratively generating future action frames and textual reasoning. Extensive experiments demonstrate that TwiFF significantly outperforms existing VCoT methods and Textual Chain-of-Thought baselines on dynamic reasoning tasks, which fully validates the effectiveness for visual question answering in dynamic scenarios. Our code and data is available at https://github.com/LiuJunhua02/TwiFF.

[89] OmniVL-Guard: Towards Unified Vision-Language Forgery Detection and Grounding via Balanced RL

Jinjie Shen,Jing Wu,Yaxiong Wang,Lechao Cheng,Shengeng Tang,Tianrui Hui,Nan Pu,Zhun Zhong

Main category: cs.CV

TL;DR: 本文提出OmniVL-Guard，一种基于强化学习的统一多模态（文本、图像、视频）伪造检测与定位框架，通过自演化思维链生成和自适应奖励缩放策略优化，解决多任务优化中的‘难度偏差’问题，在检测与定位联合任务上实现SOTA性能并具备零样本跨域泛化能力。

Details

Motivation: 现有伪造检测方法多局限于单模态或双模态，难以应对真实世界中图文视频交织的 misinformation；同时，检测与定位联合优化中因任务难度差异导致梯度主导问题（即‘difficulty bias’），影响细粒度定位性能。 Method: 提出OmniVL-Guard框架，包含两个核心组件：1）Self-Evolving CoT Generation——自演化思维链生成，缓解冷启动问题并提升推理路径质量；2）Adaptive Reward Scaling Policy Optimization (ARSPO)——动态调节各任务奖励尺度与权重，实现检测与定位的平衡联合优化。 Result: 在多个基准上显著超越现有SOTA方法，并在跨域（out-of-domain）场景下展现出强零样本鲁棒泛化能力。 Conclusion: OmniVL-Guard成功构建了首个面向图文视频交织场景的统一伪造检测与定位框架，通过强化学习机制有效缓解多任务优化中的难度偏差问题，为多模态可信内容分析提供了新范式。 Abstract: Existing forgery detection methods are often limited to uni-modal or bi-modal settings, failing to handle the interleaved text, images, and videos prevalent in real-world misinformation. To bridge this gap, this paper targets to develop a unified framework for omnibus vision-language forgery detection and grounding. In this unified setting, the {interplay} between diverse modalities and the dual requirements of simultaneous detection and localization pose a critical ``difficulty bias`` problem: the simpler veracity classification task tends to dominate the gradients, leading to suboptimal performance in fine-grained grounding during multi-task optimization. To address this challenge, we propose \textbf{OmniVL-Guard}, a balanced reinforcement learning framework for omnibus vision-language forgery detection and grounding. Particularly, OmniVL-Guard comprises two core designs: Self-Evolving CoT Generatio and Adaptive Reward Scaling Policy Optimization (ARSPO). {Self-Evolving CoT Generation} synthesizes high-quality reasoning paths, effectively overcoming the cold-start challenge. Building upon this, {Adaptive Reward Scaling Policy Optimization (ARSPO)} dynamically modulates reward scales and task weights, ensuring a balanced joint optimization. Extensive experiments demonstrate that OmniVL-Guard significantly outperforms state-of-the-art methods and exhibits zero-shot robust generalization across out-of-domain scenarios.

[90] AugVLA-3D: Depth-Driven Feature Augmentation for Vision-Language-Action Models

Zhifeng Rao,Wenlong Chen,Lei Xie,Xia Hua,Dongfu Yin,Zhen Tian,F. Richard Yu

Main category: cs.CV

TL;DR: 本文提出了一种将深度估计融入视觉-语言-动作（VLA）模型的新框架，通过引入几何感知的3D特征和动作先验约束的‘action assistant’模块，提升模型在复杂3D环境中的空间理解与动作接地能力。

Details

Motivation: 现有VLA模型多基于2D图像训练，缺乏对3D空间结构的理解，限制了其在真实机器人任务中的感知与动作接地能力。 Method: 采用VGGT深度估计算法从RGB图像中提取几何感知的3D线索，并设计‘action assistant’模块利用动作先验约束并校准3D特征；最后将增强的3D特征与2D视觉token融合。 Result: 在几何模糊场景下显著提升感知能力与动作预测精度，增强了VLA模型的泛化性与鲁棒性。 Conclusion: 深度驱动的数据增强与辅助专家监督可有效弥合2D观测与3D感知决策之间的鸿沟，为构建更可靠的具身智能系统提供新路径。 Abstract: Vision-Language-Action (VLA) models have recently achieved remarkable progress in robotic perception and control, yet most existing approaches primarily rely on VLM trained using 2D images, which limits their spatial understanding and action grounding in complex 3D environments. To address this limitation, we propose a novel framework that integrates depth estimation into VLA models to enrich 3D feature representations. Specifically, we employ a depth estimation baseline called VGGT to extract geometry-aware 3D cues from standard RGB inputs, enabling efficient utilization of existing large-scale 2D datasets while implicitly recovering 3D structural information. To further enhance the reliability of these depth-derived features, we introduce a new module called action assistant, which constrains the learned 3D representations with action priors and ensures their consistency with downstream control tasks. By fusing the enhanced 3D features with conventional 2D visual tokens, our approach significantly improves the generalization ability and robustness of VLA models. Experimental results demonstrate that the proposed method not only strengthens perception in geometrically ambiguous scenarios but also leads to superior action prediction accuracy. This work highlights the potential of depth-driven data augmentation and auxiliary expert supervision for bridging the gap between 2D observations and 3D-aware decision-making in robotic systems.

[91] (MGS)$^2$-Net: Unifying Micro-Geometric Scale and Macro-Geometric Structure for Cross-View Geo-Localization

Minglei Li,Mengfan He,Chao Chen,Ziyang Meng

Main category: cs.CV

TL;DR: 本文提出(MGS)²框架，通过宏观几何结构滤波（MGSF）和微观几何尺度自适应（MGSA）模块，结合几何-外观对比蒸馏损失（GACD），显著提升跨视角地理定位性能，尤其在GNSS拒止的无人机导航中应对大幅几何失配问题。

Details

Motivation: 现有跨视角地理定位方法在2D空间建模，忽视3D几何结构，难以应对倾斜航拍图与正射卫星图之间因垂直立面（宏观结构）和尺度变化（微观尺度）导致的特征对齐困难。 Method: 提出(MGS)²框架：1）Macro-Geometric Structure Filtering (MGSF) 模块利用扩张几何梯度物理滤除高频立面伪影、增强水平面不变性；2）Micro-Geometric Scale Adaptation (MGSA) 模块基于深度先验动态校正多分支特征尺度；3）Geometric-Appearance Contrastive Distillation (GACD) 损失抑制倾斜遮挡干扰。 Result: 在University-1652和SUES-200数据集上Recall@1分别达97.5%和97.02%，显著优于现有方法，并展现出优异的跨数据集泛化能力。 Conclusion: (MGS)²通过显式建模3D几何结构，在宏观结构滤波、微观尺度自适应和几何-外观联合判别三方面协同优化，有效缓解跨视角域偏移，为GNSS拒止场景下的鲁棒无人机导航提供了新范式。 Abstract: Cross-view geo-localization (CVGL) is pivotal for GNSS-denied UAV navigation but remains brittle under the drastic geometric misalignment between oblique aerial views and orthographic satellite references. Existing methods predominantly operate within a 2D manifold, neglecting the underlying 3D geometry where view-dependent vertical facades (macro-structure) and scale variations (micro-scale) severely corrupt feature alignment. To bridge this gap, we propose (MGS)$^2$, a geometry-grounded framework. The core of our innovation is the Macro-Geometric Structure Filtering (MGSF) module. Unlike pixel-wise matching sensitive to noise, MGSF leverages dilated geometric gradients to physically filter out high-frequency facade artifacts while enhancing the view-invariant horizontal plane, directly addressing the domain shift. To guarantee robust input for this structural filtering, we explicitly incorporate a Micro-Geometric Scale Adaptation (MGSA) module. MGSA utilizes depth priors to dynamically rectify scale discrepancies via multi-branch feature fusion. Furthermore, a Geometric-Appearance Contrastive Distillation (GACD) loss is designed to strictly discriminate against oblique occlusions. Extensive experiments demonstrate that (MGS)$^2$ achieves state-of-the-art performance, recording a Recall@1 of 97.5\% on University-1652 and 97.02\% on SUES-200. Furthermore, the framework exhibits superior cross-dataset generalization against geometric ambiguity. The code is available at: \href{https://github.com/GabrielLi1473/MGS-Net}{https://github.com/GabrielLi1473/MGS-Net}.

[92] FGAA-FPN: Foreground-Guided Angle-Aware Feature Pyramid Network for Oriented Object Detection

Jialin Ma

Main category: cs.CV

TL;DR: 本文提出FGAA-FPN，一种面向定向目标检测的前景引导、角度感知特征金字塔网络，通过前景调制与角度感知注意力机制提升多尺度特征判别力，在DOTA数据集上达到SOTA性能。

Details

Motivation: 现有方法缺乏显式的前景建模和几何朝向先验利用，导致在杂乱背景、尺度变化大、朝向变化剧烈等挑战下性能受限。 Method: 提出FGAA-FPN：1）前景引导特征调制模块，在弱监督下学习前景显著性以增强低层特征中的目标区域；2）角度感知多头注意力模块，编码相对朝向关系以指导高层语义特征的全局交互；整体基于金字塔层级的功能分解设计。 Result: 在DOTA v1.0和v1.5上分别取得75.5%和68.3%的mAP，达到当时最优性能。 Conclusion: FGAA-FPN通过结合前景引导与角度感知建模，有效提升了定向目标检测中多尺度特征的判别能力与鲁棒性，为遥感图像理解提供了新思路。 Abstract: With the increasing availability of high-resolution remote sensing and aerial imagery, oriented object detection has become a key capability for geographic information updating, maritime surveillance, and disaster response. However, it remains challenging due to cluttered backgrounds, severe scale variation, and large orientation changes. Existing approaches largely improve performance through multi-scale feature fusion with feature pyramid networks or contextual modeling with attention, but they often lack explicit foreground modeling and do not leverage geometric orientation priors, which limits feature discriminability. To overcome these limitations, we propose FGAA-FPN, a Foreground-Guided Angle-Aware Feature Pyramid Network for oriented object detection. FGAA-FPN is built on a hierarchical functional decomposition that accounts for the distinct spatial resolution and semantic abstraction across pyramid levels, thereby strengthening multi-scale representations. Concretely, a Foreground-Guided Feature Modulation module learns foreground saliency under weak supervision to enhance object regions and suppress background interference in low-level features. In parallel, an Angle-Aware Multi-Head Attention module encodes relative orientation relationships to guide global interactions among high-level semantic features. Extensive experiments on DOTA v1.0 and DOTA v1.5 demonstrate that FGAA-FPN achieves state-of-the-art results, reaching 75.5% and 68.3% mAP, respectively.

[93] Ecological mapping with geospatial foundation models

Craig Mahlasi,Gciniwe S. Baloyi,Zaheed Gaffoor,Levente Klein,Anne Jones,Etienne Vos,Michal Muszynski,Geoffrey Dawson,Campbell Watson

Main category: cs.CV

TL;DR: 本研究探讨了地理空间基础模型（GFMs）在生态应用中的效用、挑战与机遇，通过微调Prithvi-E0-2.0和TerraMind等预训练模型，并与ResNet-101基线模型对比，验证其在土地利用/覆盖（LULC）生成、森林功能性状制图和泥炭地检测任务中的优越性能，尤其指出多模态输入可显著提升TerraMind表现，但也强调需关注输入数据与预训练模态的偏差及更高分辨率与更精准标注的需求。

Details

Motivation: 地理空间基础模型（GFMs）在生态映射等高价值应用场景中的潜力尚未被充分挖掘，亟需系统评估其适用性、局限性与优化方向。 Method: 对Prithvi-E0-2.0和TerraMind两个预训练地理空间基础模型进行微调，应用于三类生态任务（LULC生成、森林功能性状映射、泥炭地检测），并与ResNet-101基线模型对比；分析多模态输入的影响及数据模态偏移问题。 Result: 所有实验中GFMs均优于ResNet-101；TerraMind整体略优于Prithvi，加入额外模态后优势显著；但模型性能受限于输入数据与预训练模态的不一致，且对更高分辨率和更精确像素级标签有强烈需求。 Conclusion: GFMs在生态遥感任务中展现出显著优势，尤其在多模态支持下潜力巨大，但其实际部署需解决模态适配、数据质量与标注精度等关键挑战。 Abstract: Geospatial foundation models (GFMs) are a fast-emerging paradigm for various geospatial tasks, such as ecological mapping. However, the utility of GFMs has not been fully explored for high-value use cases. This study aims to explore the utility, challenges and opportunities associated with the application of GFMs for ecological uses. In this regard, we fine-tune several pretrained AI models, namely, Prithvi-E0-2.0 and TerraMind, across three use cases, and compare this with a baseline ResNet-101 model. Firstly, we demonstrate TerraMind's LULC generation capabilities. Lastly, we explore the utility of the GFMs in forest functional trait mapping and peatlands detection. In all experiments, the GFMs outperform the baseline ResNet models. In general TerraMind marginally outperforms Prithvi. However, with additional modalities TerraMind significantly outperforms the baseline ResNet and Prithvi models. Nonetheless, consideration should be given to the divergence of input data from pretrained modalities. We note that these models would benefit from higher resolution and more accurate labels, especially for use cases where pixel-level dynamics need to be mapped.

[94] A Diffusion-Based Generative Prior Approach to Sparse-view Computed Tomography

Davide Evangelista,Pasquale Cascarano,Elena Loli Piccolomini

Main category: cs.CV

TL;DR: 本文提出了一种基于扩散模型的深度生成先验（DGP）框架，用于稀疏或有限角度CT重建，结合模型驱动方法的可解释性与生成模型的强大表征能力。

Details

Motivation: 稀疏或有限角度CT数据不足导致重建图像存在伪影甚至结构失真，亟需兼顾可解释性与生成能力的新方法。 Method: 将基于扩散的生成模型嵌入DGP框架，结合迭代优化算法求解重建问题，并对图像生成、模型结构和优化算法进行改进。 Result: 在高度稀疏的扫描几何下仍获得非常有前景的重建结果。 Conclusion: 该方法在保持模型可解释性的同时提升了重建质量，但仍有进一步研究空间。 Abstract: The reconstruction of X-rays CT images from sparse or limited-angle geometries is a highly challenging task. The lack of data typically results in artifacts in the reconstructed image and may even lead to object distortions. For this reason, the use of deep generative models in this context has great interest and potential success. In the Deep Generative Prior (DGP) framework, the use of diffusion-based generative models is combined with an iterative optimization algorithm for the reconstruction of CT images from sinograms acquired under sparse geometries, to maintain the explainability of a model-based approach while introducing the generative power of a neural network. There are therefore several aspects that can be further investigated within these frameworks to improve reconstruction quality, such as image generation, the model, and the iterative algorithm used to solve the minimization problem, for which we propose modifications with respect to existing approaches. The results obtained even under highly sparse geometries are very promising, although further research is clearly needed in this direction.

[95] OccFace: Unified Occlusion-Aware Facial Landmark Detection with Per-Point Visibility

Xinhao Xiang,Zhengxin Li,Saurav Dhakad,Theo Bancroft,Jiawei Zhang,Weiyang Li

Main category: cs.CV

TL;DR: 本文提出OccFace，一种面向通用类人面孔（包括人类、风格化角色及其他非人类设计）的遮挡感知人脸关键点检测框架，通过联合预测关键点坐标与逐点可见性来提升遮挡下的检测鲁棒性，并构建了配套数据集与评估指标。

Details

Motivation: 现有方法在遮挡下隐式处理关键点定位，未显式预测每点可见性，难以满足下游应用需求；尤其在类人面孔外观变化大、旋转导致自遮挡时性能受限。 Method: OccFace采用统一的100点密集布局和热图骨干网络，新增遮挡模块，融合局部证据与跨关键点上下文联合预测坐标与可见性；可见性监督结合人工标注与基于掩码-热图重叠生成的伪标签。 Result: 实验表明OccFace在外部遮挡与大幅头部旋转下鲁棒性显著提升，尤其改善了遮挡区域的关键点定位精度，同时保持可见区域的定位准确率；并发布了首个含100点标注与逐点可见性标注的数据集及专用评估套件（含Occ AP、F1@0.5、ROC-AUC等指标）。 Conclusion: OccFace通过显式建模关键点可见性，实现了对各类人面孔在复杂遮挡场景下的高鲁棒、高精度关键点检测，为后续应用提供了更可靠的结构化面部表征。 Abstract: Accurate facial landmark detection under occlusion remains challenging, especially for human-like faces with large appearance variation and rotation-driven self-occlusion. Existing detectors typically localize landmarks while handling occlusion implicitly, without predicting per-point visibility that downstream applications can benefits. We present OccFace, an occlusion-aware framework for universal human-like faces, including humans, stylized characters, and other non-human designs. OccFace adopts a unified dense 100-point layout and a heatmap-based backbone, and adds an occlusion module that jointly predicts landmark coordinates and per-point visibility by combining local evidence with cross-landmark context. Visibility supervision mixes manual labels with landmark-aware masking that derives pseudo visibility from mask-heatmap overlap. We also create an occlusion-aware evaluation suite reporting NME on visible vs. occluded landmarks and benchmarking visibility with Occ AP, F1@0.5, and ROC-AUC, together with a dataset annotated with 100-point landmarks and per-point visibility. Experiments show improved robustness under external occlusion and large head rotations, especially on occluded regions, while preserving accuracy on visible landmarks.

[96] Multimodal Information Fusion for Chart Understanding: A Survey of MLLMs -- Evolution, Limitations, and Cognitive Enhancement

Zhihang Yi,Jian Zhao,Jiancheng Lv,Tao Wang

Main category: cs.CV

TL;DR: This survey provides a comprehensive roadmap for chart understanding using Multimodal Large Language Models (MLLMs), analyzing challenges, categorizing tasks and datasets, reviewing methodologies from classic deep learning to modern MLLMs, and identifying future directions like better alignment and reinforcement learning.

Details

Motivation: The field of MLLM-based chart analysis is fragmented and lacks systematic organization, necessitating a structured survey to unify and advance the domain. Method: The paper systematically structures the domain by analyzing fusion challenges, proposing a novel taxonomy of benchmarks (canonical and non-canonical), reviewing methodological evolution (from classic deep learning to state-of-the-art MLLMs), and critically assessing model limitations. Result: A comprehensive taxonomy and evolutionary overview of chart understanding with MLLMs, highlighting current gaps (e.g., perceptual and reasoning deficits) and outlining concrete future research directions. Conclusion: This survey serves as a foundational reference to guide researchers and practitioners toward building more robust, reliable, and cognitively enhanced chart understanding systems using MLLMs. Abstract: Chart understanding is a quintessential information fusion task, requiring the seamless integration of graphical and textual data to extract meaning. The advent of Multimodal Large Language Models (MLLMs) has revolutionized this domain, yet the landscape of MLLM-based chart analysis remains fragmented and lacks systematic organization. This survey provides a comprehensive roadmap of this nascent frontier by structuring the domain's core components. We begin by analyzing the fundamental challenges of fusing visual and linguistic information in charts. We then categorize downstream tasks and datasets, introducing a novel taxonomy of canonical and non-canonical benchmarks to highlight the field's expanding scope. Subsequently, we present a comprehensive evolution of methodologies, tracing the progression from classic deep learning techniques to state-of-the-art MLLM paradigms that leverage sophisticated fusion strategies. By critically examining the limitations of current models, particularly their perceptual and reasoning deficits, we identify promising future directions, including advanced alignment techniques and reinforcement learning for cognitive enhancement. This survey aims to equip researchers and practitioners with a structured understanding of how MLLMs are transforming chart information fusion and to catalyze progress toward more robust and reliable systems.

[97] Self-Supervised Image Super-Resolution Quality Assessment based on Content-Free Multi-Model Oriented Representation Learning

Kian Majlessi,Amir Masoud Soltani,Mohammad Ebrahim Mahdavi,Aurelien Gourrier,Peyman Adibi

Main category: cs.CV

TL;DR: 本文提出了一种面向真实世界超分辨率图像的无参考质量评估方法S3RIQA，通过自监督对比学习，利用不同超分算法生成的图像构建正负样本对，实现模型无关的质量评估，并构建了新数据集SRMORSS支持训练。

Details

Motivation: 真实场景下的超分辨率图像存在复杂、不可预测的退化，现有质量评估方法难以适用，尤其在数据稀缺领域缺乏有效的无参考评估手段。 Method: 提出S3RIQA方法：1）基于自监督对比学习预训练多个超分模型导向的表征；2）构建同模型生成图像为正样本、不同模型生成为负样本的对比框架；3）引入针对性预处理和辅助任务以适应不同缩放因子带来的退化差异；4）构建新数据集SRMORSS用于无监督预训练。 Result: 在真实超分图像质量评估基准上，S3RIQA持续优于大多数现有先进指标。 Conclusion: S3RIQA是一种适用于真实世界、数据稀缺场景的域自适应无参考超分图像质量评估方法，其核心创新在于将退化建模与超分算法绑定，并通过自监督对比学习实现内容无关的质量表征。 Abstract: Super-resolution (SR) applied to real-world low-resolution (LR) images often results in complex, irregular degradations that stem from the inherent complexity of natural scene acquisition. In contrast to SR artifacts arising from synthetic LR images created under well-defined scenarios, those distortions are highly unpredictable and vary significantly across different real-life contexts. Consequently, assessing the quality of SR images (SR-IQA) obtained from realistic LR, remains a challenging and underexplored problem. In this work, we introduce a no-reference SR-IQA approach tailored for such highly ill-posed realistic settings. The proposed method enables domain-adaptive IQA for real-world SR applications, particularly in data-scarce domains. We hypothesize that degradations in super-resolved images are strongly dependent on the underlying SR algorithms, rather than being solely determined by image content. To this end, we introduce a self-supervised learning (SSL) strategy that first pretrains multiple SR model oriented representations in a pretext stage. Our contrastive learning framework forms positive pairs from images produced by the same SR model and negative pairs from those generated by different methods, independent of image content. The proposed approach S3 RIQA, further incorporates targeted preprocessing to extract complementary quality information and an auxiliary task to better handle the various degradation profiles associated with different SR scaling factors. To this end, we constructed a new dataset, SRMORSS, to support unsupervised pretext training; it includes a wide range of SR algorithms applied to numerous real LR images, which addresses a gap in existing datasets. Experiments on real SR-IQA benchmarks demonstrate that S3 RIQA consistently outperforms most state-of-the-art relevant metrics.

[98] Spectral-Spatial Contrastive Learning Framework for Regression on Hyperspectral Data

Mohamad Dhaini,Paul Honeine,Maxime Berar,Antonin Van Exem

Main category: cs.CV

TL;DR: 本文提出了一种面向高光谱数据回归任务的光谱-空间对比学习框架，具有模型无关性，并设计了适用于高光谱数据的增强变换，实验表明该框架显著提升了多种骨干网络的性能。

Details

Motivation: 对比学习在图像分类等表示学习任务中取得了成功，但在回归任务尤其是高光谱数据上的研究仍显不足。 Method: 提出一种光谱-空间对比学习框架，支持3D卷积和基于Transformer等骨干网络；设计了适用于高光谱数据的一系列增强变换。 Result: 在合成与真实高光谱数据集上的实验表明，所提框架及变换显著提升了所有测试骨干模型的回归性能。 Conclusion: 光谱-空间对比学习可有效提升高光谱回归任务的表征能力，且框架具备良好的模型通用性与实用性。 Abstract: Contrastive learning has demonstrated great success in representation learning, especially for image classification tasks. However, there is still a shortage in studies targeting regression tasks, and more specifically applications on hyperspectral data. In this paper, we propose a spectral-spatial contrastive learning framework for regression tasks for hyperspectral data, in a model-agnostic design allowing to enhance backbones such as 3D convolutional and transformer-based networks. Moreover, we provide a collection of transformations relevant for augmenting hyperspectral data. Experiments on synthetic and real datasets show that the proposed framework and transformations significantly improve the performance of all studied backbone models.

[99] Chatting with Images for Introspective Visual Thinking

Junfei Wu,Jian Guan,Qiang Liu,Shu Wu,Liang Wang,Wei Wu,Tienie Tan

Main category: cs.CV

TL;DR: 本文提出'与图像对话'框架，通过语言引导的特征调制实现视觉-语言联合推理，克服现有大视觉语言模型在细粒度视觉信息保留和跨模态对齐上的不足。

Details

Motivation: 现有大视觉语言模型依赖单次视觉编码，导致细粒度视觉信息丢失；'用图像思考'方法虽尝试改进，但视觉状态缺乏语言语义支撑，难以实现有效跨模态对齐，尤其在需跨区域或多图几何推理时。 Method: 提出'与图像对话'新范式，将视觉操作重构为语言引导的特征调制；设计动态视觉编码器ViLaVT，在语言提示指导下对多个图像区域进行联合重编码；采用监督微调与强化学习两阶段课程训练。 Result: 在八个基准上广泛实验表明，ViLaVT显著且一致地提升性能，尤其在复杂多图及视频空间推理任务中增益突出。 Conclusion: 语言引导的动态视觉重编码能更紧密耦合语言推理与视觉状态更新，是提升LVLM跨模态推理能力的有效新路径。 Abstract: Current large vision-language models (LVLMs) typically rely on text-only reasoning based on a single-pass visual encoding, which often leads to loss of fine-grained visual information. Recently the proposal of ''thinking with images'' attempts to alleviate this limitation by manipulating images via external tools or code; however, the resulting visual states are often insufficiently grounded in linguistic semantics, impairing effective cross-modal alignment - particularly when visual semantics or geometric relationships must be reasoned over across distant regions or multiple images. To address these challenges, we propose ''chatting with images'', a new framework that reframes visual manipulation as language-guided feature modulation. Under the guidance of expressive language prompts, the model dynamically performs joint re-encoding over multiple image regions, enabling tighter coupling between linguistic reasoning and visual state updates. We instantiate this paradigm in ViLaVT, a novel LVLM equipped with a dynamic vision encoder explicitly designed for such interactive visual reasoning, and trained it with a two-stage curriculum combining supervised fine-tuning and reinforcement learning to promote effective reasoning behaviors. Extensive experiments across eight benchmarks demonstrate that ViLaVT achieves strong and consistent improvements, with particularly pronounced gains on complex multi-image and video-based spatial reasoning tasks.

[100] Text-to-Vector Conversion for Residential Plan Design

Egor Bazhenov,Stepan Kasai,Viacheslav Shalamov,Valeria Efimova

Main category: cs.CV

TL;DR: 本文提出了一种从文本描述生成矢量住宅平面图的新方法，并引入了一种将光栅平面图矢量化为结构化矢量图像的新算法，两者均在CLIPScore指标上显著优于现有方法。

Details

Motivation: 矢量图形在设计和建筑领域具有可缩放无损的优势，但生成复杂；而文本到矢量平面图的自动生成尚不成熟，亟需更高效、高质量的方法。 Method: 提出一种基于文本生成矢量住宅平面图的新方法，能自然处理直角并支持灵活设置；同时设计一种新算法，将光栅平面图转化为结构化矢量图像。 Result: 文本生成矢量图方法在CLIPScore上比现有方案高约5%；矢量化算法生成的图像CLIPScore提升约4%。 Conclusion: 所提方法在视觉质量和结构合理性上均取得进步，推动了文本驱动矢量图形生成与光栅到矢量转换的技术发展。 Abstract: Computer graphics, comprising both raster and vector components, is a fundamental part of modern science, industry, and digital communication. While raster graphics offer ease of use, its pixel-based structure limits scalability. Vector graphics, defined by mathematical primitives, provides scalability without quality loss, however, it is more complex to produce. For design and architecture, the versatility of vector graphics is paramount, despite its computational demands. This paper introduces a novel method for generating vector residential plans from textual descriptions. Our approach surpasses existing solutions by approximately 5% in CLIPScore-based visual quality, benefiting from its inherent handling of right angles and flexible settings. Additionally, we present a new algorithm for vectorizing raster plans into structured vector images. Such images have a better CLIPscore compared to others by about 4%.

[101] Dual-End Consistency Model

Linwei Dong,Ruoyu Guo,Ge Bai,Zehuan Yuan,Yawei Luo,Changqing Zou

Main category: cs.CV

TL;DR: 本文提出双端一致性模型（DE-CM），通过关键子轨迹选择、连续时间CM目标与流匹配边界正则化，以及噪声到噪声（N2N）映射，解决一致性模型训练不稳定和采样不灵活问题，在ImageNet 256×256上实现1步生成FID 1.70的SOTA性能。

Details

Motivation: 一致性模型（CMs）虽能加速扩散/流模型生成，但存在训练不稳定（源于自监督项导致损失发散）和采样不灵活（源于误差累积）两大瓶颈，现有方法忽视了轨迹选择的关键作用。 Method: 提出双端一致性模型（DE-CM）：1）分解PF-ODE轨迹，选取三个关键子轨迹作为优化目标；2）采用连续时间CM目标实现少步蒸馏；3）引入流匹配作为边界正则器稳定训练；4）设计新型噪声到噪声（N2N）映射，支持噪声直接映射至任意中间点以缓解首步误差累积。 Result: 在ImageNet 256×256数据集上，单步生成FID达1.70，优于所有现有基于CM的单步方法。 Conclusion: 轨迹选择是提升一致性模型训练稳定性与采样灵活性的核心，DE-CM通过子轨迹聚类、边界正则与N2N映射协同优化，为高效生成建模提供了新范式。 Abstract: The slow iterative sampling nature remains a major bottleneck for the practical deployment of diffusion and flow-based generative models. While consistency models (CMs) represent a state-of-the-art distillation-based approach for efficient generation, their large-scale application is still limited by two key issues: training instability and inflexible sampling. Existing methods seek to mitigate these problems through architectural adjustments or regularized objectives, yet overlook the critical reliance on trajectory selection. In this work, we first conduct an analysis on these two limitations: training instability originates from loss divergence induced by unstable self-supervised term, whereas sampling inflexibility arises from error accumulation. Based on these insights and analysis, we propose the Dual-End Consistency Model (DE-CM) that selects vital sub-trajectory clusters to achieve stable and effective training. DE-CM decomposes the PF-ODE trajectory and selects three critical sub-trajectories as optimization targets. Specifically, our approach leverages continuous-time CMs objectives to achieve few-step distillation and utilizes flow matching as a boundary regularizer to stabilize the training process. Furthermore, we propose a novel noise-to-noisy (N2N) mapping that can map noise to any point, thereby alleviating the error accumulation in the first step. Extensive experimental results show the effectiveness of our method: it achieves a state-of-the-art FID score of 1.70 in one-step generation on the ImageNet 256x256 dataset, outperforming existing CM-based one-step approaches.

[102] From Steering to Pedalling: Do Autonomous Driving VLMs Generalize to Cyclist-Assistive Spatial Perception and Planning?

Krishna Kanth Nakka,Vedasri Nakka

Main category: cs.CV

TL;DR: 本文提出了首个面向骑行者的视觉-语言模型诊断基准CyclingVQA，用于评估模型在骑行者视角下的感知、时空理解与交通规则-车道推理能力；实验发现当前VLMs虽有一定能力，但在骑行特有交通线索识别和标志-车道关联方面仍存在明显不足，且车辆专用模型迁移至骑行场景效果不佳。

Details

Motivation: 现有视觉-语言模型评测以车辆为中心，缺乏对骑行者视角下感知与推理能力的系统评估，而骑行者在城市交通中面临独特安全挑战，亟需针对性辅助系统。 Method: 构建了CyclingVQA诊断基准，涵盖骑行者视角的感知、时空理解及交通规则到车道的推理任务；对31+个主流VLM（通用型、空间增强型、自动驾驶专用型）进行统一评测，并开展系统性错误分析。 Result: 当前VLMs在骑行者相关任务上表现参差：通用模型常优于驾驶专用模型；模型在识别骑行专属交通线索（如自行车标志、手势）和将交通标志正确映射至对应车道方面显著薄弱。 Conclusion: CyclingVQA揭示了现有VLMs在 cyclist-centric 理解上的关键短板，强调需发展面向骑行者认知特性的新模型架构与训练范式，并为未来骑行辅助智能系统提供明确改进方向。 Abstract: Cyclists often encounter safety-critical situations in urban traffic, highlighting the need for assistive systems that support safe and informed decision-making. Recently, vision-language models (VLMs) have demonstrated strong performance on autonomous driving benchmarks, suggesting their potential for general traffic understanding and navigation-related reasoning. However, existing evaluations are predominantly vehicle-centric and fail to assess perception and reasoning from a cyclist-centric viewpoint. To address this gap, we introduce CyclingVQA, a diagnostic benchmark designed to probe perception, spatio-temporal understanding, and traffic-rule-to-lane reasoning from a cyclist's perspective. Evaluating 31+ recent VLMs spanning general-purpose, spatially enhanced, and autonomous-driving-specialized models, we find that current models demonstrate encouraging capabilities, while also revealing clear areas for improvement in cyclist-centric perception and reasoning, particularly in interpreting cyclist-specific traffic cues and associating signs with the correct navigational lanes. Notably, several driving-specialized models underperform strong generalist VLMs, indicating limited transfer from vehicle-centric training to cyclist-assistive scenarios. Finally, through systematic error analysis, we identify recurring failure modes to guide the development of more effective cyclist-assistive intelligent systems.

[103] RSHallu: Dual-Mode Hallucination Evaluation for Remote-Sensing Multimodal Large Language Models with Domain-Tailored Mitigation

Zihui Zhou,Yong Feng,Yanying Chen,Guofan Duan,Zhenxi Song,Mingliang Zhou,Weijia Jia

Main category: cs.CV

TL;DR: 本文提出RSHallu，系统研究遥感多模态大语言模型（RS-MLLMs）中的幻觉问题，构建了面向遥感的幻觉分类体系、评测基准RSHalluEval及缓解数据集RSHalluShield，并提出训练无关的即插即用缓解策略，在显著降低幻觉率的同时保持下游任务性能。

Details

Motivation: 遥感多模态大语言模型在高风险场景（如应急管理和农业监测）中因图像不一致的幻觉问题而受限，该问题在遥感领域尚未被系统研究。 Method: 1）构建面向遥感的幻觉分类法，引入图像级幻觉概念；2）建立包含2023个QA对的评测基准RSHalluEval，并支持云端审计与本地轻量检查；3）发布3万QA对的训练数据集RSHalluShield，并提出解码时logit校正和遥感感知提示等训练无关缓解策略。 Result: 所提缓解方法在多个主流RS-MLLM上使无幻觉率最高提升21.63个百分点，同时在RSVQA和RSVG等下游任务上保持竞争力。 Conclusion: RSHallu首次系统定义、评测并缓解遥感多模态大语言模型中的幻觉问题，为高可靠性遥感AI应用提供了基础支撑。 Abstract: Multimodal large language models (MLLMs) are increasingly adopted in remote sensing (RS) and have shown strong performance on tasks such as RS visual grounding (RSVG), RS visual question answering (RSVQA), and multimodal dialogue. However, hallucinations, which are responses inconsistent with the input RS images, severely hinder their deployment in high-stakes scenarios (e.g., emergency management and agricultural monitoring) and remain under-explored in RS. In this work, we present RSHallu, a systematic study with three deliverables: (1) we formalize RS hallucinations with an RS-oriented taxonomy and introduce image-level hallucination to capture RS-specific inconsistencies beyond object-centric errors (e.g., modality, resolution, and scene-level semantics); (2) we build a hallucination benchmark RSHalluEval (2,023 QA pairs) and enable dual-mode checking, supporting high-precision cloud auditing and low-cost reproducible local checking via a compact checker fine-tuned on RSHalluCheck dataset (15,396 QA pairs); and (3) we introduce a domain-tailored dataset RSHalluShield (30k QA pairs) for training-friendly mitigation and further propose training-free plug-and-play strategies, including decoding-time logit correction and RS-aware prompting. Across representative RS-MLLMs, our mitigation improves the hallucination-free rate by up to 21.63 percentage points under a unified protocol, while maintaining competitive performance on downstream RS tasks (RSVQA/RSVG). Code and datasets will be released.

[104] DMP-3DAD: Cross-Category 3D Anomaly Detection via Realistic Depth Map Projection with Few Normal Samples

Zi Wang,Katsuya Hotta,Koichiro Kamide,Yawen Zou,Jianjian Qin,Chao Zhang,Jun Yu

Main category: cs.CV

TL;DR: 本文提出DMP-3DAD，一种无需训练的跨类别3D点云异常检测框架，通过多视角真实深度图投影和冻结CLIP视觉编码器实现少样本下的高效检测。

Details

Motivation: 现有方法依赖类别特定训练，难以适应少样本场景，缺乏跨类别灵活性。 Method: 将点云转换为固定数量的真实深度图像，利用冻结的CLIP视觉编码器提取多视角特征，并通过加权特征相似度进行异常检测，无需微调或类别适配。 Result: 在ShapeNetPart数据集上，DMP-3DAD在少样本设置下达到SOTA性能。 Conclusion: DMP-3DAD提供了一种简单而有效的跨类别3D异常检测方案，具备实用性和泛化能力。 Abstract: Cross-category anomaly detection for 3D point clouds aims to determine whether an unseen object belongs to a target category using only a few normal examples. Most existing methods rely on category-specific training, which limits their flexibility in few-shot scenarios. In this paper, we propose DMP-3DAD, a training-free framework for cross-category 3D anomaly detection based on multi-view realistic depth map projection. Specifically, by converting point clouds into a fixed set of realistic depth images, our method leverages a frozen CLIP visual encoder to extract multi-view representations and performs anomaly detection via weighted feature similarity, which does not require any fine-tuning or category-dependent adaptation. Extensive experiments on the ShapeNetPart dataset demonstrate that DMP-3DAD achieves state-of-the-art performance under few-shot setting. The results show that the proposed approach provides a simple yet effective solution for practical cross-category 3D anomaly detection.

[105] DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval in Visual Histories

Chenlong Deng,Mengjie Deng,Junjie Wu,Dun Zeng,Teng Wang,Qingsong Xie,Jiadeng Huang,Shengjie Ma,Changwang Zhang,Zhaoxiang Wang,Jun Wang,Yutao Zhu,Zhicheng Dou

Main category: cs.CV

TL;DR: 本文提出DeepImageSearch，一种将图像检索重构为自主探索任务的新范式，并构建了DISBench基准来评估模型在时序视觉流中的上下文依赖检索能力。

Details

Motivation: 现有跨模态检索系统仅关注单张图像与查询的语义匹配，忽略了真实视觉流中固有的时序依赖和上下文关联。 Method: 提出基于智能体的DeepImageSearch范式，构建DISBench基准；设计人机协同流水线生成上下文相关查询；开发具备细粒度工具与双记忆系统的模块化智能体基线。 Result: 实验表明当前SOTA模型在DISBench上表现不佳，验证了该基准的挑战性及引入智能体推理的必要性。 Conclusion: 图像检索需从静态匹配转向具备多步推理与长期上下文建模能力的自主探索范式。 Abstract: Existing multimodal retrieval systems excel at semantic matching but implicitly assume that query-image relevance can be measured in isolation. This paradigm overlooks the rich dependencies inherent in realistic visual streams, where information is distributed across temporal sequences rather than confined to single snapshots. To bridge this gap, we introduce DeepImageSearch, a novel agentic paradigm that reformulates image retrieval as an autonomous exploration task. Models must plan and perform multi-step reasoning over raw visual histories to locate targets based on implicit contextual cues. We construct DISBench, a challenging benchmark built on interconnected visual data. To address the scalability challenge of creating context-dependent queries, we propose a human-model collaborative pipeline that employs vision-language models to mine latent spatiotemporal associations, effectively offloading intensive context discovery before human verification. Furthermore, we build a robust baseline using a modular agent framework equipped with fine-grained tools and a dual-memory system for long-horizon navigation. Extensive experiments demonstrate that DISBench poses significant challenges to state-of-the-art models, highlighting the necessity of incorporating agentic reasoning into next-generation retrieval systems.

[106] Why Does RL Generalize Better Than SFT? A Data-Centric Perspective on VLM Post-Training

Aojun Lu,Tao Feng,Hangjie Yuan,Wei Li,Yanan Sun

Main category: cs.CV

TL;DR: 本文提出了一种数据驱动的解释：RL在VLM后训练中优于SFT的关键在于其隐式筛选中等难度样本，据此设计了显式的难度筛选方法DC-SFT，在OOD泛化、稳定性与效率上均超越RL。

Details

Motivation: 解释为何RL后训练在VLM中比SFT具有更强的OOD泛化能力，提出‘隐式数据难度筛选’这一数据中心假说。 Method: 系统评估不同难度训练数据对SFT模型OOD泛化的影响，并据此提出Difficulty-Curated SFT（DC-SFT），即基于样本难度显式过滤训练集。 Result: 验证了训练样本难度显著影响OOD性能（难样本损害泛化）；DC-SFT在OOD泛化上超越标准SFT和RL方法，且更稳定、更高效。 Conclusion: RL的泛化优势源于其隐式难度筛选机制；DC-SFT提供了一种更简单、高效、稳定的数据中心替代方案，为提升VLM鲁棒泛化开辟新路径。 Abstract: The adaptation of large-scale Vision-Language Models (VLMs) through post-training reveals a pronounced generalization gap: models fine-tuned with Reinforcement Learning (RL) consistently achieve superior out-of-distribution (OOD) performance compared to those trained with Supervised Fine-Tuning (SFT). This paper posits a data-centric explanation for this phenomenon, contending that RL's generalization advantage arises from an implicit data filtering mechanism that inherently prioritizes medium-difficulty training samples. To test this hypothesis, we systematically evaluate the OOD generalization of SFT models across training datasets of varying difficulty levels. Our results confirm that data difficulty is a critical factor, revealing that training on hard samples significantly degrades OOD performance. Motivated by this finding, we introduce Difficulty-Curated SFT (DC-SFT), a straightforward method that explicitly filters the training set based on sample difficulty. Experiments show that DC-SFT not only substantially enhances OOD generalization over standard SFT, but also surpasses the performance of RL-based training, all while providing greater stability and computational efficiency. This work offers a data-centric account of the OOD generalization gap in VLMs and establishes a more efficient pathway to achieving robust generalization. Code is available at https://github.com/byyx666/DC-SFT.

[107] Resource-Efficient RGB-Only Action Recognition for Edge Deployment

Dongsik Yoon,Jongeun Kim,Dayeon Lee

Main category: cs.CV

TL;DR: 本文提出了一种轻量级纯RGB视频动作识别网络，专为边缘设备优化，在保持高精度的同时显著降低计算与资源开销。

Details

Motivation: 边缘设备上动作识别面临延迟、内存、存储和功耗等严格约束；多模态方法（如骨架、深度）虽能提升性能，但依赖额外传感器或高成本姿态估计，不适用于边缘场景。 Method: 基于X3D骨干网络，引入时间移位（Temporal Shift）、选择性时间自适应和无参注意力机制，构建紧凑的纯RGB模型。 Result: 在NTU RGB+D 60/120数据集上实现精度与效率的良好平衡；Jetson Orin Nano部署实测表明其设备端占用更小、资源利用更实用。 Conclusion: 所提RGB-only方案在不牺牲精度的前提下，显著提升了边缘部署可行性，为资源受限场景下的动作识别提供了高效新范式。 Abstract: Action recognition on edge devices poses stringent constraints on latency, memory, storage, and power consumption. While auxiliary modalities such as skeleton and depth information can enhance recognition performance, they often require additional sensors or computationally expensive pose-estimation pipelines, limiting practicality for edge use. In this work, we propose a compact RGB-only network tailored for efficient on-device inference. Our approach builds upon an X3D-style backbone augmented with Temporal Shift, and further introduces selective temporal adaptation and parameter-free attention. Extensive experiments on the NTU RGB+D 60 and 120 benchmarks demonstrate a strong accuracy-efficiency balance. Moreover, deployment-level profiling on the Jetson Orin Nano verifies a smaller on-device footprint and practical resource utilization compared to existing RGB-based action recognition techniques.

[108] Flow caching for autoregressive video generation

Yuexiao Ma,Xuzhe Zheng,Jing Xu,Xiwei Xu,Feng Ling,Xiawu Zheng,Huafeng Kuang,Huixia Li,Xing Wang,Xuefeng Xiao,Fei Chao,Rongrong Ji

Main category: cs.CV

TL;DR: 本文提出FlowCache，首个专为自回归视频生成设计的缓存框架，通过分块独立缓存策略与联合重要性-冗余性优化的KV缓存压缩机制，显著加速超长视频生成（提速2.38–6.7倍），同时几乎不损失质量。

Details

Motivation: 自回归视频生成因顺序合成而速度慢；现有缓存方法假设各帧去噪均匀，但自回归模型中不同视频块在相同时间步具有差异化的相似性模式，导致传统缓存失效。 Method: 提出分块缓存策略，使每段视频块拥有独立缓存决策；设计联合重要性-冗余性优化的KV缓存压缩机制，在固定内存下保持生成质量。 Result: 在MAGI-1和SkyReels-V2上分别实现2.38倍和6.7倍加速，VBench质量变化仅+0.87和−0.79，质量几乎无损。 Conclusion: FlowCache成功释放自回归模型在实时超长视频生成中的潜力，为大规模高效视频合成树立新基准。 Abstract: Autoregressive models, often built on Transformer architectures, represent a powerful paradigm for generating ultra-long videos by synthesizing content in sequential chunks. However, this sequential generation process is notoriously slow. While caching strategies have proven effective for accelerating traditional video diffusion models, existing methods assume uniform denoising across all frames-an assumption that breaks down in autoregressive models where different video chunks exhibit varying similarity patterns at identical timesteps. In this paper, we present FlowCache, the first caching framework specifically designed for autoregressive video generation. Our key insight is that each video chunk should maintain independent caching policies, allowing fine-grained control over which chunks require recomputation at each timestep. We introduce a chunkwise caching strategy that dynamically adapts to the unique denoising characteristics of each chunk, complemented by a joint importance-redundancy optimized KV cache compression mechanism that maintains fixed memory bounds while preserving generation quality. Our method achieves remarkable speedups of 2.38 times on MAGI-1 and 6.7 times on SkyReels-V2, with negligible quality degradation (VBench: 0.87 increase and 0.79 decrease respectively). These results demonstrate that FlowCache successfully unlocks the potential of autoregressive models for real-time, ultra-long video generation-establishing a new benchmark for efficient video synthesis at scale. The code is available at https://github.com/mikeallen39/FlowCache.

[109] Hyperspectral Smoke Segmentation via Mixture of Prototypes

Lujian Yao,Haitao Zhao,Xianghai Kong,Yuhan Xu

Main category: cs.CV

TL;DR: 本文提出了一种基于高光谱成像的烟雾分割方法，构建了首个高光谱烟雾分割数据集HSSDataset，并设计了混合原型（MoP）网络以实现自适应波段加权，显著提升了烟雾分割性能。

Details

Motivation: 传统可见光方法在烟雾分割中受限于光谱信息不足，难以应对云干扰和半透明烟雾区域。 Method: 提出混合原型（MoP）网络，包括波段分离、基于原型的光谱表征和双层路由器机制；构建HSSDataset和MSSDataset两个新数据集。 Result: 在高光谱与多光谱模态上均取得优于现有方法的分割性能，验证了所提方法的有效性与泛化能力。 Conclusion: 该工作为光谱图像烟雾分割建立了新范式，推动了火灾管理与工业安全中的智能感知技术发展。 Abstract: Smoke segmentation is critical for wildfire management and industrial safety applications. Traditional visible-light-based methods face limitations due to insufficient spectral information, particularly struggling with cloud interference and semi-transparent smoke regions. To address these challenges, we introduce hyperspectral imaging for smoke segmentation and present the first hyperspectral smoke segmentation dataset (HSSDataset) with carefully annotated samples collected from over 18,000 frames across 20 real-world scenarios using a Many-to-One annotations protocol. However, different spectral bands exhibit varying discriminative capabilities across spatial regions, necessitating adaptive band weighting strategies. We decompose this into three technical challenges: spectral interaction contamination, limited spectral pattern modeling, and complex weighting router problems. We propose a mixture of prototypes (MoP) network with: (1) Band split for spectral isolation, (2) Prototype-based spectral representation for diverse patterns, and (3) Dual-level router for adaptive spatial-aware band weighting. We further construct a multispectral dataset (MSSDataset) with RGB-infrared images. Extensive experiments validate superior performance across both hyperspectral and multispectral modalities, establishing a new paradigm for spectral-based smoke segmentation.

[110] Stride-Net: Fairness-Aware Disentangled Representation Learning for Chest X-Ray Diagnosis

Darakshan Rashid,Raza Imam,Dwarikanath Mahapatra,Brejesh Lall

Main category: cs.CV

TL;DR: 本文提出Stride-Net，一种面向公平性的胸部X光分类框架，通过补丁级可学习掩码、对抗混淆损失和基于组最优传输的语义对齐，实现疾病判别性与人口统计不变性的平衡，在MIMIC-CXR和CheXpert数据集上兼顾准确率与公平性。

Details

Motivation: 深度神经网络在胸部X光分类中存在对特定人口子群（如不同种族或性别）性能下降的问题，威胁临床安全与公平性；现有去偏方法常牺牲整体诊断性能或泛化性差，且将公平性视为后处理约束而非表征固有属性。 Method: Stride-Net在图像补丁级别操作：1）使用可学习步长掩码选择疾病相关区域并抑制敏感属性信息，结合对抗混淆损失；2）通过Group Optimal Transport实现图像特征与BioBERT生成的疾病标签嵌入之间的语义对齐，防止捷径学习。 Result: 在MIMIC-CXR和CheXpert数据集上，针对种族及种族-性别交叉子群，Stride-Net在ResNet和ViT等架构下均一致提升公平性指标（如平等机会差、平均绝对偏差），同时保持或超越基线模型的整体准确率。 Conclusion: Stride-Net将公平性内建于表征学习过程，通过解耦疾病判别性与敏感属性依赖，并引入临床语义锚定，实现了更优的准确率-公平性权衡，为医学AI的鲁棒与公平部署提供了新范式。 Abstract: Deep neural networks for chest X-ray classification achieve strong average performance, yet often underperform for specific demographic subgroups, raising critical concerns about clinical safety and equity. Existing debiasing methods frequently yield inconsistent improvements across datasets or attain fairness by degrading overall diagnostic utility, treating fairness as a post hoc constraint rather than a property of the learned representation. In this work, we propose Stride-Net (Sensitive Attribute Resilient Learning via Disentanglement and Learnable Masking with Embedding Alignment), a fairness-aware framework that learns disease-discriminative yet demographically invariant representations for chest X-ray analysis. Stride-Net operates at the patch level, using a learnable stride-based mask to select label-aligned image regions while suppressing sensitive attribute information through adversarial confusion loss. To anchor representations in clinical semantics and discourage shortcut learning, we further enforce semantic alignment between image features and BioBERT-based disease label embeddings via Group Optimal Transport. We evaluate Stride-Net on the MIMIC-CXR and CheXpert benchmarks across race and intersectional race-gender subgroups. Across architectures including ResNet and Vision Transformers, Stride-Net consistently improves fairness metrics while matching or exceeding baseline accuracy, achieving a more favorable accuracy-fairness trade-off than prior debiasing approaches. Our code is available at https://github.com/Daraksh/Fairness_StrideNet.

[111] Chart Specification: Structural Representations for Incentivizing VLM Reasoning in Chart-to-Code Generation

Minggui He,Mingchen Dai,Jian Zhang,Yilun Liu,Shimin Tao,Pufan Zeng,Osamu Yoshie,Yuya Ieiri

Main category: cs.CV

TL;DR: 本文提出Chart Specification，一种结构化中间表示，通过结构化监督提升图表到代码生成的保真度，显著优于现有方法。

Details

Motivation: 现有基于监督微调的方法仅鼓励表层标记模仿，难以保证图表结构的忠实建模，易导致幻觉或语义不一致输出。 Method: 提出Chart Specification作为结构化中间表示，过滤语法噪声构建结构平衡训练集，并设计Spec-Align Reward提供细粒度、可验证的结构正确性反馈，支持强化学习优化绘图逻辑一致性。 Result: 在三个公开基准上持续超越先前方法；仅用3K样本即比领先基线最高提升61.7%；扩展至4K样本后在所有指标上达到新SOTA。 Conclusion: 精确的结构化监督是实现高保真图表到代码生成的高效路径。 Abstract: Vision-Language Models (VLMs) have shown promise in generating plotting code from chart images, yet achieving structural fidelity remains challenging. Existing approaches largely rely on supervised fine-tuning, encouraging surface-level token imitation rather than faithful modeling of underlying chart structure, which often leads to hallucinated or semantically inconsistent outputs. We propose Chart Specification, a structured intermediate representation that shifts training from text imitation to semantically grounded supervision. Chart Specification filters syntactic noise to construct a structurally balanced training set and supports a Spec-Align Reward that provides fine-grained, verifiable feedback on structural correctness, enabling reinforcement learning to enforce consistent plotting logic. Experiments on three public benchmarks show that our method consistently outperforms prior approaches. With only 3K training samples, we achieve strong data efficiency, surpassing leading baselines by up to 61.7% on complex benchmarks, and scaling to 4K samples establishes new state-of-the-art results across all evaluated metrics. Overall, our results demonstrate that precise structural supervision offers an efficient pathway to high-fidelity chart-to-code generation. Code and dataset are available at: https://github.com/Mighten/chart-specification-paper

[112] ResWorld: Temporal Residual World Model for End-to-End Autonomous Driving

Jinqing Zhang,Zehua Fu,Zelin Xu,Wenying Dai,Qingjie Liu,Yunhong Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为TR-World的时序残差世界模型，专注于动态物体建模，通过计算场景表征的时序残差来提取动态信息，并结合FGTR模块实现轨迹与未来BEV特征的交互优化，显著提升端到端自动驾驶规划性能。

Details

Motivation: 现有世界模型对静态区域建模冗余，且缺乏与轨迹的深度交互，限制了其在自动驾驶规划中的效能发挥。 Method: 提出Temporal Residual World Model (TR-World)，仅以时序残差为输入建模动态对象；并设计Future-Guided Trajectory Refinement (FGTR)模块，实现先验轨迹与未来BEV特征的双向交互与监督。 Result: 在nuScenes和NAVSIM数据集上，ResWorld方法达到当前最优的规划性能。 Conclusion: 聚焦动态建模与轨迹-未来特征协同优化的世界模型设计，可有效提升端到端自动驾驶系统的规划精度与鲁棒性。 Abstract: The comprehensive understanding capabilities of world models for driving scenarios have significantly improved the planning accuracy of end-to-end autonomous driving frameworks. However, the redundant modeling of static regions and the lack of deep interaction with trajectories hinder world models from exerting their full effectiveness. In this paper, we propose Temporal Residual World Model (TR-World), which focuses on dynamic object modeling. By calculating the temporal residuals of scene representations, the information of dynamic objects can be extracted without relying on detection and tracking. TR-World takes only temporal residuals as input, thus predicting the future spatial distribution of dynamic objects more precisely. By combining the prediction with the static object information contained in the current BEV features, accurate future BEV features can be obtained. Furthermore, we propose Future-Guided Trajectory Refinement (FGTR) module, which conducts interaction between prior trajectories (predicted from the current scene representation) and the future BEV features. This module can not only utilize future road conditions to refine trajectories, but also provides sparse spatial-temporal supervision on future BEV features to prevent world model collapse. Comprehensive experiments conducted on the nuScenes and NAVSIM datasets demonstrate that our method, namely ResWorld, achieves state-of-the-art planning performance. The code is available at https://github.com/mengtan00/ResWorld.git.

[113] FastUSP: A Multi-Level Collaborative Acceleration Framework for Distributed Diffusion Model Inference

Guandong Li

Main category: cs.CV

TL;DR: 本文提出FastUSP框架，通过编译级、通信级和算子级三重优化，显著提升大规模扩散模型（如FLUX和Qwen-Image）在多GPU上的分布式推理效率，尤其缓解了内核启动开销这一主要瓶颈。

Details

Motivation: 现有Unified Sequence Parallelism（USP）实现在多GPU上运行大模型时存在内核启动开销大、计算-通信调度不优等问题，限制了推理效率。 Method: 提出FastUSP多级优化框架：1）编译级——CUDA Graphs图编译与计算-通信重排序；2）通信级——FP8量化集体通信；3）算子级——双缓冲流水线Ring attention。 Result: 在FLUX（12B）上，FastUSP相较基线USP实现1.12×–1.16×端到端加速（编译级优化贡献最大）；在Qwen-Image上，2卡达1.09×加速；4–8卡受限于PyTorch Inductor对Ring attention兼容性，未启用编译优化，但基线USP仍实现1.30×–1.46×相对于2卡的扩展加速。分析指出内核启动开销是现代高带宽GPU互连下的主要瓶颈。 Conclusion: FastUSP通过系统性多级协同优化，有效提升了大模型分布式注意力推理效率，并揭示了内核启动开销在当前硬件环境下比通信延迟更具制约性，为后续高效分布式推理系统设计提供了新方向。 Abstract: Large-scale diffusion models such as FLUX (12B parameters) and Stable Diffusion 3 (8B parameters) require multi-GPU parallelism for efficient inference. Unified Sequence Parallelism (USP), which combines Ulysses and Ring attention mechanisms, has emerged as the state-of-the-art approach for distributed attention computation. However, existing USP implementations suffer from significant inefficiencies including excessive kernel launch overhead and suboptimal computation-communication scheduling. In this paper, we propose \textbf{FastUSP}, a multi-level optimization framework that integrates compile-level optimization (graph compilation with CUDA Graphs and computation-communication reordering), communication-level optimization (FP8 quantized collective communication), and operator-level optimization (pipelined Ring attention with double buffering). We evaluate FastUSP on FLUX (12B) and Qwen-Image models across 2, 4, and 8 NVIDIA RTX 5090 GPUs. On FLUX, FastUSP achieves consistent \textbf{1.12$\times$--1.16$\times$} end-to-end speedup over baseline USP, with compile-level optimization contributing the dominant improvement. On Qwen-Image, FastUSP achieves \textbf{1.09$\times$} speedup on 2 GPUs; on 4--8 GPUs, we identify a PyTorch Inductor compatibility limitation with Ring attention that prevents compile optimization, while baseline USP scales to 1.30$\times$--1.46$\times$ of 2-GPU performance. We further provide a detailed analysis of the performance characteristics of distributed diffusion inference, revealing that kernel launch overhead -- rather than communication latency -- is the primary bottleneck on modern high-bandwidth GPU interconnects.

[114] Towards Learning a Generalizable 3D Scene Representation from 2D Observations

Martin Gromniak,Jan-Gerrit Habekost,Sebastian Kamp,Sven Magg,Stefan Wermter

Main category: cs.CV

TL;DR: 本文提出了一种通用神经辐射场（NeRF）方法，用于从机器人第一人称视角预测全局工作空间的3D占用情况，无需场景微调，且在真实机器人上验证了其几何重建精度。

Details

Motivation: 现有方法多基于相机坐标系，难以直接用于机器人操作；需一种能在全局工作空间坐标系中建模、具备跨场景泛化能力的3D占用预测方法。 Method: 提出一种通用神经辐射场（Generalizable NeRF）模型，将多视角 egocentric 观测映射到全局工作空间帧下的3D occupancy 表示，支持灵活源视图输入，无需场景特定微调。 Result: 在40个真实场景上训练，在 humanoid 机器人上测试，对包括遮挡区域在内的3D结构预测达到26mm重建误差，优于传统双目视觉方法。 Conclusion: 该方法实现了从 egocentric 观测到全局 workspace occupancy 的有效泛化建模，为机器人自主操作提供了更完整、可迁移的3D环境理解能力。 Abstract: We introduce a Generalizable Neural Radiance Field approach for predicting 3D workspace occupancy from egocentric robot observations. Unlike prior methods operating in camera-centric coordinates, our model constructs occupancy representations in a global workspace frame, making it directly applicable to robotic manipulation. The model integrates flexible source views and generalizes to unseen object arrangements without scene-specific finetuning. We demonstrate the approach on a humanoid robot and evaluate predicted geometry against 3D sensor ground truth. Trained on 40 real scenes, our model achieves 26mm reconstruction error, including occluded regions, validating its ability to infer complete 3D occupancy beyond traditional stereo vision methods.

[115] Healthy Harvests: A Comparative Look at Guava Disease Classification Using InceptionV3

Samanta Ghosh,Shaila Afroz Anika,Umma Habiba Ahmed,B. M. Shahria Alam,Mohammad Tahmid Noor,Nishat Tasnim Niloy

Main category: cs.CV

TL;DR: 本研究利用InceptionV3和ResNet50模型对番石榴果实的三种状态（炭疽病、果蝇侵害、健康）进行图像分类，通过数据增强与CutMix/MixUp等混合策略提升性能，并结合SHAP分析增强可解释性，InceptionV3达到98.15%准确率。

Details

Motivation: 番石榴果实易受多种病害影响，早期识别对保障果实品质和产量至关重要。 Method: 采用Mendeley Data提供的473张原始图像，统一缩放至256×256 RGB格式，并经数据增强扩增至3784张；构建并训练InceptionV3与ResNet50深度学习模型，引入CutMix和MixUp数据混合方法，并用混淆矩阵与SHAP分析评估性能与可解释性。 Result: InceptionV3模型准确率达98.15%，ResNet50为94.46%；混淆矩阵验证了两类模型的整体分类效果，SHAP分析揭示了影响预测的关键图像区域。 Conclusion: 先进深度学习模型（尤其是InceptionV3）结合数据增强与混合策略及可解释性分析，能高效、可靠地实现番石榴病害的自动识别与诊断。 Abstract: Guava fruits often suffer from many diseases. This can harm fruit quality and fruit crop yield. Early identification is important for minimizing damage and ensuring fruit health. This study focuses on 3 different categories for classifying diseases. These are Anthracnose, Fruit flies, and Healthy fruit. The data set used in this study is collected from Mendeley Data. This dataset contains 473 original images of Guava. These images vary in size and format. The original dataset was resized to 256x256 pixels with RGB color mode for better consistency. After this, the Data augmentation process is applied to improve the dataset by generating variations of the original images. The augmented dataset consists of 3784 images using advanced preprocessing techniques. Two deep learning models were implemented to classify the images. The InceptionV3 model is well known for its advanced framework. These apply multiple convolutional filters for obtaining different features effectively. On the other hand, the ResNet50 model helps to train deeper networks by using residual learning. The InceptionV3 model achieved the impressive accuracy of 98.15%, and ResNet50got 94.46% accuracy. Data mixing methods such as CutMix and MixUp were applied to enhance the model's robustness. The confusion matrix was used to evaluate the overall model performance of both InceptionV3 and Resnet50. Additionally, SHAP analysis is used to improve interpretability, which helps to find the significant parts of the image for the model prediction. This study purposes to highlight how advanced models enhan

[116] VFGS-Net: Frequency-Guided State-Space Learning for Topology-Preserving Retinal Vessel Segmentation

Ruiqi Song,Lei Liu,Ya-Nan Zhang,Chao Wang,Xiaoning Li,Nan Mu

Main category: cs.CV

TL;DR: 本文提出VFGS-Net，一种融合频域感知增强、双路径卷积表征学习与双向非对称空间建模的端到端视网膜血管分割网络，显著提升细小血管、分支结构和低对比区域的分割精度。

Details

Motivation: 现有方法难以同时保留细微毛细血管并维持血管整体拓扑连续性，主要受限于血管细长形态、尺度变化大及对比度低等问题。 Method: 提出VFGS-Net：包含双路径特征卷积模块（捕获局部纹理与多尺度语义）、血管感知的频域通道注意力机制（自适应加权频谱分量）、以及基于双向非对称Mamba2的空间建模模块（建模长程依赖并增强全局连续性）。 Result: 在四个公开视网膜血管数据集上达到领先或媲美SOTA的性能，尤其在细小血管、复杂分支和低对比区域分割上显著提升。 Conclusion: VFGS-Net通过多维度建模有效缓解了视网膜血管分割中的关键挑战，展现出良好的鲁棒性与临床应用潜力。 Abstract: Accurate retinal vessel segmentation is a critical prerequisite for quantitative analysis of retinal images and computer-aided diagnosis of vascular diseases such as diabetic retinopathy. However, the elongated morphology, wide scale variation, and low contrast of retinal vessels pose significant challenges for existing methods, making it difficult to simultaneously preserve fine capillaries and maintain global topological continuity. To address these challenges, we propose the Vessel-aware Frequency-domain and Global Spatial modeling Network (VFGS-Net), an end-to-end segmentation framework that seamlessly integrates frequency-aware feature enhancement, dual-path convolutional representation learning, and bidirectional asymmetric spatial state-space modeling within a unified architecture. Specifically, VFGS-Net employs a dual-path feature convolution module to jointly capture fine-grained local textures and multi-scale contextual semantics. A novel vessel-aware frequency-domain channel attention mechanism is introduced to adaptively reweight spectral components, thereby enhancing vessel-relevant responses in high-level features. Furthermore, at the network bottleneck, we propose a bidirectional asymmetric Mamba2-based spatial modeling block to efficiently capture long-range spatial dependencies and strengthen the global continuity of vascular structures. Extensive experiments on four publicly available retinal vessel datasets demonstrate that VFGS-Net achieves competitive or superior performance compared to state-of-the-art methods. Notably, our model consistently improves segmentation accuracy for fine vessels, complex branching patterns, and low-contrast regions, highlighting its robustness and clinical potential.

[117] DFIC: Towards a balanced facial image dataset for automatic ICAO compliance verification

Nuno Gonçalves,Diogo Nunes,Carla Guerra,João Marcos

Main category: cs.CV

TL;DR: 本文提出DFIC数据集，包含约58000张图像和2706个视频，用于自动验证人脸图像是否符合ICAO标准，并基于该数据集设计了一种基于空间注意力机制的新方法，在合规性验证任务上优于现有方法。

Details

Motivation: 当前人工检查护照等机器可读旅行证件（MRTD）中人脸图像是否符合ISO/IEC和ICAO标准效率低下，亟需高效、自动化的合规验证方法。 Method: 构建DFIC数据集，并基于其微调一种以空间注意力机制为核心的新模型，用于自动验证ICAO合规性要求。 Result: 所提方法在ICAO合规性验证任务上性能优于现有最先进方法；DFIC数据集已开源，具备更均衡的人口统计分布和前所未有的多样性。 Conclusion: DFIC数据集及其配套方法显著提升了自动化ICAO合规验证能力，同时亦可拓展应用于提升人脸识别系统的安全性、隐私性和公平性。 Abstract: Ensuring compliance with ISO/IEC and ICAO standards for facial images in machine-readable travel documents (MRTDs) is essential for reliable identity verification, but current manual inspection methods are inefficient in high-demand environments. This paper introduces the DFIC dataset, a novel comprehensive facial image dataset comprising around 58,000 annotated images and 2706 videos of more than 1000 subjects, that cover a broad range of non-compliant conditions, in addition to compliant portraits. Our dataset provides a more balanced demographic distribution than the existing public datasets, with one partition that is nearly uniformly distributed, facilitating the development of automated ICAO compliance verification methods. Using DFIC, we fine-tuned a novel method that heavily relies on spatial attention mechanisms for the automatic validation of ICAO compliance requirements, and we have compared it with the state-of-the-art aimed at ICAO compliance verification, demonstrating improved results. DFIC dataset is now made public (https://github.com/visteam-isr-uc/DFIC) for the training and validation of new models, offering an unprecedented diversity of faces, that will improve both robustness and adaptability to the intrinsically diverse combinations of faces and props that can be presented to the validation system. These results emphasize the potential of DFIC to enhance automated ICAO compliance methods but it can also be used in many other applications that aim to improve the security, privacy, and fairness of facial recognition systems.

[118] Interpretable Vision Transformers in Image Classification via SVDA

Vasileios Arampatzakis,George Pavlidis,Nikolaos Mitianoudis,Nikos Papamarkos

Main category: cs.CV

TL;DR: 本文提出将SVD-Inspired Attention（SVDA）机制引入Vision Transformer（ViT），以提升注意力机制的可解释性、稀疏性和谱结构，实验表明其在多个图像分类基准上保持准确率的同时显著增强可解释性。

Details

Motivation: Vision Transformers的注意力机制往往不透明且密集无结构，缺乏可解释性和结构性。 Method: 将已有的SVD-Inspired Attention（SVDA）机制适配到ViT架构中，提出一种几何意义明确的注意力公式，并利用SVDA定义的可解释性指标监控训练过程中的注意力动态和表征结构。 Result: 在CIFAR-10、FashionMNIST、CIFAR-100和ImageNet-100四个基准上，SVDA在不牺牲分类精度的前提下，持续生成更可解释的注意力模式。 Conclusion: SVDA是一种全面且信息丰富的工具，可用于分析与开发结构化视觉注意力模型，为可解释AI、谱诊断及注意力模型压缩奠定基础。 Abstract: Vision Transformers (ViTs) have achieved state-of-the-art performance in image classification, yet their attention mechanisms often remain opaque and exhibit dense, non-structured behaviors. In this work, we adapt our previously proposed SVD-Inspired Attention (SVDA) mechanism to the ViT architecture, introducing a geometrically grounded formulation that enhances interpretability, sparsity, and spectral structure. We apply the use of interpretability indicators -- originally proposed with SVDA -- to monitor attention dynamics during training and assess structural properties of the learned representations. Experimental evaluations on four widely used benchmarks -- CIFAR-10, FashionMNIST, CIFAR-100, and ImageNet-100 -- demonstrate that SVDA consistently yields more interpretable attention patterns without sacrificing classification accuracy. While the current framework offers descriptive insights rather than prescriptive guidance, our results establish SVDA as a comprehensive and informative tool for analyzing and developing structured attention models in computer vision. This work lays the foundation for future advances in explainable AI, spectral diagnostics, and attention-based model compression.

[119] Enhancing Predictability of Multi-Tenant DNN Inference for Autonomous Vehicles' Perception

Liangkai Liu,Kang G. Shin,Jinkyu Lee,Chengmo Yang,Weisong Shi

Main category: cs.CV

TL;DR: PP-DNN 是一种面向自动驾驶车辆的可预测感知系统，通过动态选择关键帧和感兴趣区域（ROI）来减少需处理的图像数据量，同时保持多租户DNN模型的精度；其核心包括ROI生成器、FLOPs预测器、ROI调度器和非关键帧检测预测器，在BDD100K与nuScenes数据集上显著提升了感知可预测性、融合帧数、检测完整性及成本效益。

Details

Motivation: 自动驾驶车辆中DNN感知实时性受限于计算资源与模型复杂度之间的差距，现有方法多聚焦模型压缩，而忽视了输入数据层面的优化潜力；且多租户DNN场景下关键帧与ROI随环境动态变化，难以准确识别与调度。 Method: 提出PP-DNN系统：1）ROI生成器基于连续帧相似性与交通场景识别关键帧与ROI；2）FLOPs预测器预估对应MAC操作量；3）ROI调度器协调多DNN模型处理关键帧/ROI；4）检测预测器处理非关键帧；整个系统集成于ROS自动驾驶流水线。 Result: 在BDD100K和nuScenes数据集上，相比基线：融合帧数提升至7.3倍，融合延迟降低超2.6倍，延迟波动降低超2.3倍，检测完整性提升75.4%，成本效益最高提升98%。 Conclusion: PP-DNN通过输入数据驱动的动态稀疏化策略，在不牺牲精度前提下显著提升多租户DNN感知的实时性与可预测性，为资源受限AV平台提供了新范式。 Abstract: Autonomous vehicles (AVs) rely on sensors and deep neural networks (DNNs) to perceive their surrounding environment and make maneuver decisions in real time. However, achieving real-time DNN inference in the AV's perception pipeline is challenging due to the large gap between the computation requirement and the AV's limited resources. Most, if not all, of existing studies focus on optimizing the DNN inference time to achieve faster perception by compressing the DNN model with pruning and quantization. In contrast, we present a Predictable Perception system with DNNs (PP-DNN) that reduce the amount of image data to be processed while maintaining the same level of accuracy for multi-tenant DNNs by dynamically selecting critical frames and regions of interest (ROIs). PP-DNN is based on our key insight that critical frames and ROIs for AVs vary with the AV's surrounding environment. However, it is challenging to identify and use critical frames and ROIs in multi-tenant DNNs for predictable inference. Given image-frame streams, PP-DNN leverages an ROI generator to identify critical frames and ROIs based on the similarities of consecutive frames and traffic scenarios. PP-DNN then leverages a FLOPs predictor to predict multiply-accumulate operations (MACs) from the dynamic critical frames and ROIs. The ROI scheduler coordinates the processing of critical frames and ROIs with multiple DNN models. Finally, we design a detection predictor for the perception of non-critical frames. We have implemented PP-DNN in an ROS-based AV pipeline and evaluated it with the BDD100K and the nuScenes dataset. PP-DNN is observed to significantly enhance perception predictability, increasing the number of fusion frames by up to 7.3x, reducing the fusion delay by >2.6x and fusion-delay variations by >2.3x, improving detection completeness by 75.4% and the cost-effectiveness by up to 98% over the baseline.

[120] Interpretable Vision Transformers in Monocular Depth Estimation via SVDA

Vasileios Arampatzakis,George Pavlidis,Nikolaos Mitianoudis,Nikos Papamarkos

Main category: cs.CV

TL;DR: 本文提出了一种受奇异值分解（SVD）启发的注意力机制（SVDA），嵌入到Dense Prediction Transformer（DPT）中，首次为稠密预测任务提供了谱结构化的注意力建模方法，提升了单目深度估计模型的可解释性，并引入六个谱指标揭示注意力在训练中的组织规律。

Details

Motivation: 现有Transformer中的自注意力机制在单目深度估计等稠密预测任务中缺乏可解释性，亟需一种内在可解释、非后验近似的注意力建模范式。 Method: 将可学习对角矩阵嵌入归一化后的查询-键交互中，解耦方向对齐与谱调制，构建SVD-Inspired Attention（SVDA）并集成至DPT框架。 Result: 在KITTI和NYU-v2数据集上保持或略微提升精度，仅引入轻微计算开销；同时提取出六个可量化的谱指标（熵、秩、稀疏性、对齐度、选择性、鲁棒性），揭示跨数据集与深度层的一致性注意力组织模式。 Conclusion: SVDA将注意力从黑箱机制转变为可量化描述符，重新定义了单目深度估计中的可解释性，并为构建透明稠密预测模型提供了新路径。 Abstract: Monocular depth estimation is a central problem in computer vision with applications in robotics, AR, and autonomous driving, yet the self-attention mechanisms that drive modern Transformer architectures remain opaque. We introduce SVD-Inspired Attention (SVDA) into the Dense Prediction Transformer (DPT), providing the first spectrally structured formulation of attention for dense prediction tasks. SVDA decouples directional alignment from spectral modulation by embedding a learnable diagonal matrix into normalized query-key interactions, enabling attention maps that are intrinsically interpretable rather than post-hoc approximations. Experiments on KITTI and NYU-v2 show that SVDA preserves or slightly improves predictive accuracy while adding only minor computational overhead. More importantly, SVDA unlocks six spectral indicators that quantify entropy, rank, sparsity, alignment, selectivity, and robustness. These reveal consistent cross-dataset and depth-wise patterns in how attention organizes during training, insights that remain inaccessible in standard Transformers. By shifting the role of attention from opaque mechanism to quantifiable descriptor, SVDA redefines interpretability in monocular depth estimation and opens a principled avenue toward transparent dense prediction models.

[121] LaSSM: Efficient Semantic-Spatial Query Decoding via Local Aggregation and State Space Models for 3D Instance Segmentation

Lei Yao,Yi Wang,Yawen Cui,Moyun Liu,Lap-Pui Chau

Main category: cs.CV

TL;DR: LaSSM是一种高效且简洁的基于查询的3D场景实例分割方法，通过分层语义-空间查询初始化器和坐标引导的状态空间模型解码器，在保持高性能的同时显著降低计算开销。

Details

Motivation: 现有基于查询的点云3D场景实例分割方法面临查询初始化困难（因点云稀疏）及解码器中注意力机制计算开销大的问题。 Method: 提出分层语义-空间查询初始化器（基于超点，融合语义与空间分布）；设计坐标引导的状态空间模型（SSM）解码器，含局部聚合机制和空间双路径SSM模块，利用坐标信息建模查询间依赖。 Result: 在ScanNet++ V2榜单排名第一，mAP提升2.5%，FLOPs仅为此前最优方法的1/3；在ScanNet、ScanNet200、S3DIS和ScanNet++ V1上也以更低计算成本达到竞争性性能；消融实验与定性结果验证有效性。 Conclusion: LaSSM通过简化查询初始化与采用高效SSM解码器，在精度与效率间取得更好平衡，为大规模3D场景实例分割提供了新范式。 Abstract: Query-based 3D scene instance segmentation from point clouds has attained notable performance. However, existing methods suffer from the query initialization dilemma due to the sparse nature of point clouds and rely on computationally intensive attention mechanisms in query decoders. We accordingly introduce LaSSM, prioritizing simplicity and efficiency while maintaining competitive performance. Specifically, we propose a hierarchical semantic-spatial query initializer to derive the query set from superpoints by considering both semantic cues and spatial distribution, achieving comprehensive scene coverage and accelerated convergence. We further present a coordinate-guided state space model (SSM) decoder that progressively refines queries. The novel decoder features a local aggregation scheme that restricts the model to focus on geometrically coherent regions and a spatial dual-path SSM block to capture underlying dependencies within the query set by integrating associated coordinates information. Our design enables efficient instance prediction, avoiding the incorporation of noisy information and reducing redundant computation. LaSSM ranks first place on the latest ScanNet++ V2 leaderboard, outperforming the previous best method by 2.5% mAP with only 1/3 FLOPs, demonstrating its superiority in challenging large-scale scene instance segmentation. LaSSM also achieves competitive performance on ScanNet, ScanNet200, S3DIS and ScanNet++ V1 benchmarks with less computational cost. Extensive ablation studies and qualitative results validate the effectiveness of our design. The code and weights are available at https://github.com/RayYoh/LaSSM.

[122] Chain-of-Look Spatial Reasoning for Dense Surgical Instrument Counting

Rishikesh Bhyri,Brian R Quaranto,Philip J Seger,Kaity Tung,Brendan Fox,Gene Yang,Steven D. Schwaitzberg,Junsong Yuan,Nan Xi,Peter C W Kim

Main category: cs.CV

TL;DR: 本文提出Chain-of-Look框架，通过模拟人类顺序视觉计数过程并引入邻域损失函数，显著提升了高密度手术器械图像的计数精度，并构建了SurgCount-HD新数据集。

Details

Motivation: 准确计数手术室中的器械对保障患者安全至关重要，但现有方法（如目标检测和多模态大模型）在器械密集堆叠场景下表现不佳。 Method: 提出Chain-of-Look视觉推理框架，构建有序空间视觉链以模拟人类计数过程；设计邻域损失函数以建模器械间的空间约束；构建含1464张高密度图像的SurgCount-HD数据集。 Result: 在密集手术器械计数任务上，该方法显著优于CountGD、REC等计数模型及Qwen、ChatGPT等多模态大语言模型。 Conclusion: 结构化视觉链与物理约束建模是提升高密度细粒度计数性能的有效途径，为医疗AI中的精确感知提供了新范式。 Abstract: Accurate counting of surgical instruments in Operating Rooms (OR) is a critical prerequisite for ensuring patient safety during surgery. Despite recent progress of large visual-language models and agentic AI, accurately counting such instruments remains highly challenging, particularly in dense scenarios where instruments are tightly clustered. To address this problem, we introduce Chain-of-Look, a novel visual reasoning framework that mimics the sequential human counting process by enforcing a structured visual chain, rather than relying on classic object detection which is unordered. This visual chain guides the model to count along a coherent spatial trajectory, improving accuracy in complex scenes. To further enforce the physical plausibility of the visual chain, we introduce the neighboring loss function, which explicitly models the spatial constraints inherent to densely packed surgical instruments. We also present SurgCount-HD, a new dataset comprising 1,464 high-density surgical instrument images. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches for counting (e.g., CountGD, REC) as well as Multimodality Large Language Models (e.g., Qwen, ChatGPT) in the challenging task of dense surgical instrument counting.

[123] PuriLight: A Lightweight Shuffle and Purification Framework for Monocular Depth Estimation

Yujie Chen,Li Zhang,Xiaomeng Chu,Tian Zhang

Main category: cs.CV

TL;DR: PuriLight是一个轻量高效的自监督单目深度估计框架，通过三个创新模块（SDC、RAKA、DFSP）在保持模型轻量化的同时提升结构精度和细节保留能力，达到SOTA性能。

Details

Motivation: 现有自监督单目深度估计方法在计算效率与结构精度之间难以兼顾：大模型不实用，轻量模型又牺牲精度，亟需兼顾轻量性与结构性的解决方案。 Method: 提出三阶段架构，包含Shuffle-Dilation卷积（SDC）用于局部特征提取、旋转自适应核注意力（RAKA）用于分层特征增强、深度频域信号净化（DFSP）用于全局特征净化。 Result: 在极小参数量下实现SOTA性能，兼具高精度与高计算效率。 Conclusion: PuriLight成功平衡了轻量化与结构精度，为自监督单目深度估计提供了高效实用的新范式。 Abstract: We propose PuriLight, a lightweight and efficient framework for self-supervised monocular depth estimation, to address the dual challenges of computational efficiency and detail preservation. While recent advances in self-supervised depth estimation have reduced reliance on ground truth supervision, existing approaches remain constrained by either bulky architectures compromising practicality or lightweight models sacrificing structural precision. These dual limitations underscore the critical need to develop lightweight yet structurally precise architectures. Our framework addresses these limitations through a three-stage architecture incorporating three novel modules: the Shuffle-Dilation Convolution (SDC) module for local feature extraction, the Rotation-Adaptive Kernel Attention (RAKA) module for hierarchical feature enhancement, and the Deep Frequency Signal Purification (DFSP) module for global feature purification. Through effective collaboration, these modules enable PuriLight to achieve both lightweight and accurate feature extraction and processing. Extensive experiments demonstrate that PuriLight achieves state-of-the-art performance with minimal training parameters while maintaining exceptional computational efficiency. Codes will be available at https://github.com/ishrouder/PuriLight.

[124] First International StepUP Competition for Biometric Footstep Recognition: Methods, Results and Remaining Challenges

Robyn Larracy,Eve MacDonald,Angkoon Phinyomark,Saeid Rezaei,Mahdi Laghaei,Ali Hajighasem,Aaron Tabor,Erik Scheme

Main category: cs.CV

TL;DR: 本文介绍了首个国际StepUP步态识别竞赛，利用UNB StepUP-P150大型压力步态数据集推动基于深度学习的足压生物特征识别研究，Saeid_UCC团队以10.77% EER夺冠，但跨鞋类泛化仍是关键挑战。

Details

Motivation: 缺乏大规模、多样化的足压步态数据集，限制了模型在新用户、不同鞋类和步行速度等变化下的泛化与鲁棒性。 Method: 依托UNB StepUP-P150数据集举办国际竞赛，参赛队开发鲁棒识别模型，并在专设测试集（侧重验证性能及挑战性变量）上评估；优胜方案采用生成式奖励机（GRM）优化策略。 Result: 23支全球队伍参赛；Saeid_UCC团队取得最低等错误率（EER）10.77%；整体表现良好，但在陌生鞋类条件下的泛化能力仍不足。 Conclusion: StepUP-P150数据集和首届竞赛显著推动了足压生物识别发展，但跨鞋类泛化仍是亟待突破的核心瓶颈，需作为未来研究重点。 Abstract: Biometric footstep recognition, based on a person's unique pressure patterns under their feet during walking, is an emerging field with growing applications in security and safety. However, progress in this area has been limited by the lack of large, diverse datasets necessary to address critical challenges such as generalization to new users and robustness to shifts in factors like footwear or walking speed. The recent release of the UNB StepUP-P150 dataset, the largest and most comprehensive collection of high-resolution footstep pressure recordings to date, opens new opportunities for addressing these challenges through deep learning. To mark this milestone, the First International StepUP Competition for Biometric Footstep Recognition was launched. Competitors were tasked with developing robust recognition models using the StepUP-P150 dataset that were then evaluated on a separate, dedicated test set designed to assess verification performance under challenging variations, given limited and relatively homogeneous reference data. The competition attracted global participation, with 23 registered teams from academia and industry. The top-performing team, Saeid_UCC, achieved the best equal error rate (EER) of 10.77% using a generative reward machine (GRM) optimization strategy. Overall, the competition showcased strong solutions, but persistent challenges in generalizing to unfamiliar footwear highlight a critical area for future work.

[125] FastFlow: Accelerating The Generative Flow Matching Models with Bandit Inference

Divya Jyoti Bajpai,Dhruv Bhardwaj,Soumya Roy,Tejas Duseja,Harsh Agarwal,Aashay Sandansing,Manjesh Kumar Hanawal

Main category: cs.CV

TL;DR: FastFlow是一种即插即用的自适应推理框架，通过在流匹配模型中跳过冗余去噪步骤来加速生成，无需重新训练，实现2.6倍以上加速且保持高质量。

Details

Motivation: 现有流匹配模型虽性能优异，但顺序去噪过程慢；已有加速方法静态、需重训练、泛化性差。 Method: FastFlow利用有限差分估计速度并外推状态，跳过对去噪路径影响小的步骤；将跳步决策建模为多臂老虎机问题以自适应学习最优跳步策略。 Result: 在图像生成、视频生成与编辑任务上实现超2.6倍加速，输出质量无显著下降，并支持即插即用和跨任务泛化。 Conclusion: FastFlow提供了一种高效、通用、免训练的流匹配模型加速方案，显著提升推理效率而不牺牲生成质量。 Abstract: Flow-matching models deliver state-of-the-art fidelity in image and video generation, but the inherent sequential denoising process renders them slower. Existing acceleration methods like distillation, trajectory truncation, and consistency approaches are static, require retraining, and often fail to generalize across tasks. We propose FastFlow, a plug-and-play adaptive inference framework that accelerates generation in flow matching models. FastFlow identifies denoising steps that produce only minor adjustments to the denoising path and approximates them without using the full neural network models used for velocity predictions. The approximation utilizes finite-difference velocity estimates from prior predictions to efficiently extrapolate future states, enabling faster advancements along the denoising path at zero compute cost. This enables skipping computation at intermediary steps. We model the decision of how many steps to safely skip before requiring a full model computation as a multi-armed bandit problem. The bandit learns the optimal skips to balance speed with performance. FastFlow integrates seamlessly with existing pipelines and generalizes across image generation, video generation, and editing tasks. Experiments demonstrate a speedup of over 2.6x while maintaining high-quality outputs. The source code for this work can be found at https://github.com/Div290/FastFlow.

[126] HairWeaver: Few-Shot Photorealistic Hair Motion Synthesis with Sim-to-Real Guided Video Diffusion

Di Chang,Ji Hou,Aljaz Bozic,Assaf Neuberger,Felix Juefei-Xu,Olivier Maury,Gene Wei-Chin Lin,Tuur Stuyck,Doug Roble,Mohammad Soleymani,Stephane Grabli

Main category: cs.CV

TL;DR: HairWeaver是一种基于扩散模型的单图人像动画方法，通过两个轻量级LoRA模块实现对头发运动的精细控制与真实感保持。

Details

Motivation: 现有方法能控制人体姿态但缺乏对头发运动的专门建模，导致生成的头发动画僵硬、不真实。 Method: 提出HairWeaver框架，包含Motion-Context-LoRA（融合运动条件）和Sim2Real-Domain-LoRA（跨域保真），在视频扩散主干网络上进行轻量微调；训练数据来自CG仿真器生成的动态人体运动数据集。 Result: 在多项评估中达到SOTA，生成具有自然响应性和动态细节的真实感头发动画。 Conclusion: HairWeaver有效解决了单图像驱动下头发动态建模难题，为可控、真实的人体动画提供了新范式。 Abstract: We present HairWeaver, a diffusion-based pipeline that animates a single human image with realistic and expressive hair dynamics. While existing methods successfully control body pose, they lack specific control over hair, and as a result, fail to capture the intricate hair motions, resulting in stiff and unrealistic animations. HairWeaver overcomes this limitation using two specialized modules: a Motion-Context-LoRA to integrate motion conditions and a Sim2Real-Domain-LoRA to preserve the subject's photoreal appearance across different data domains. These lightweight components are designed to guide a video diffusion backbone while maintaining its core generative capabilities. By training on a specialized dataset of dynamic human motion generated from a CG simulator, HairWeaver affords fine control over hair motion and ultimately learns to produce highly realistic hair that responds naturally to movement. Comprehensive evaluations demonstrate that our approach sets a new state of the art, producing lifelike human hair animations with dynamic details.

[127] PhyCritic: Multimodal Critic Models for Physical AI

Tianyi Xiong,Shihao Wang,Guilin Liu,Yi Dong,Ming Li,Heng Huang,Jan Kautz,Zhiding Yu

Main category: cs.CV

TL;DR: 本文提出了PhyCritic，一种专为物理AI任务优化的多模态评判模型，通过两阶段RLVR流程（物理技能预热+自参照评判微调）提升对物理感知、因果推理和规划任务的评判能力与稳定性，并在多个基准上超越开源基线。

Details

Motivation: 现有评判模型主要针对通用视觉任务（如图像描述或视觉问答），而缺乏对物理AI任务（涉及感知、因果推理和规划）的支持，亟需专门适配的可靠评判模型。 Method: 提出PhyCritic模型，采用两阶段RLVR训练流程：第一阶段为物理技能预热，增强物理导向的感知与推理能力；第二阶段为自参照评判微调，让模型先生成自身预测作为内部参考，再评判候选响应，以提升判断稳定性与物理正确性。 Result: PhyCritic在物理与通用多模态评判基准上均显著优于开源基线；作为策略模型使用时，还能进一步提升物理具身任务中的感知与推理能力。 Conclusion: PhyCritic成功填补了物理AI领域专用评判模型的空白，其自参照机制与物理预热设计为构建更可靠、更具物理一致性的多模态评判系统提供了新范式。 Abstract: With the rapid development of large multimodal models, reliable judge and critic models have become essential for open-ended evaluation and preference alignment, providing pairwise preferences, numerical scores, and explanatory justifications for assessing model-generated responses. However, existing critics are primarily trained in general visual domains such as captioning or image question answering, leaving physical AI tasks involving perception, causal reasoning, and planning largely underexplored. We introduce PhyCritic, a multimodal critic model optimized for physical AI through a two-stage RLVR pipeline: a physical skill warmup stage that enhances physically oriented perception and reasoning, followed by self-referential critic finetuning, where the critic generates its own prediction as an internal reference before judging candidate responses, improving judgment stability and physical correctness. Across both physical and general-purpose multimodal judge benchmarks, PhyCritic achieves strong performance gains over open-source baselines and, when applied as a policy model, further improves perception and reasoning in physically grounded tasks.

[128] Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling

Gongye Liu,Bo Yang,Yida Zhi,Zhizhou Zhong,Lei Ke,Didan Deng,Han Gao,Yongxiang Huang,Kaihao Zhang,Hongbo Fu,Wenhan Luo

Main category: cs.CV

TL;DR: 本文提出DiNa-LRM，一种原生适配扩散模型的潜在空间奖励模型，直接在带噪扩散状态上进行偏好学习，通过噪声校准的Thurstone似然建模不确定性，在保持高性能的同时显著降低计算开销。

Details

Motivation: 现有基于视觉语言模型（VLM）的奖励函数存在计算开销大、像素空间与潜空间对齐存在域不匹配等问题。 Method: 提出DiNa-LRM，基于预训练潜扩散主干网络，引入时间步条件奖励头和噪声校准的Thurstone似然，并支持推理时噪声集成。 Result: 在图像对齐基准上显著超越现有扩散类奖励基线，性能媲美SOTA VLM但计算成本大幅降低；在偏好优化中提升训练动态，实现更快更省资源的模型对齐。 Conclusion: DiNa-LRM为扩散/流匹配模型提供了高效、鲁棒且原生适配的潜空间奖励机制，解决了域不匹配与高开销问题。 Abstract: Preference optimization for diffusion and flow-matching models relies on reward functions that are both discriminatively robust and computationally efficient. Vision-Language Models (VLMs) have emerged as the primary reward provider, leveraging their rich multimodal priors to guide alignment. However, their computation and memory cost can be substantial, and optimizing a latent diffusion generator through a pixel-space reward introduces a domain mismatch that complicates alignment. In this paper, we propose DiNa-LRM, a diffusion-native latent reward model that formulates preference learning directly on noisy diffusion states. Our method introduces a noise-calibrated Thurstone likelihood with diffusion-noise-dependent uncertainty. DiNa-LRM leverages a pretrained latent diffusion backbone with a timestep-conditioned reward head, and supports inference-time noise ensembling, providing a diffusion-native mechanism for test-time scaling and robust rewarding. Across image alignment benchmarks, DiNa-LRM substantially outperforms existing diffusion-based reward baselines and achieves performance competitive with state-of-the-art VLMs at a fraction of the computational cost. In preference optimization, we demonstrate that DiNa-LRM improves preference optimization dynamics, enabling faster and more resource-efficient model alignment.

[129] SurfPhase: 3D Interfacial Dynamics in Two-Phase Flows from Sparse Videos

Yue Gao,Hong-Xing Yu,Sanghyeon Chang,Qianxi Fu,Bo Zhu,Yoonjin Won,Juan Carlos Niebles,Jiajun Wu

Main category: cs.CV

TL;DR: SurfPhase 是一种新模型，用于从稀疏相机视角重建两相流中锐利、可变形的液-气界面的三维动态过程，结合动态高斯面元、符号距离函数和视频扩散模型，在仅用两个相机视角的情况下实现了高质量的新视角合成与速度估计。

Details

Motivation: 两相流界面动力学对动量、热量和质量传递至关重要，但实验测量困难；传统方法在移动界面附近存在固有局限，现有神经渲染方法无法处理锐利、可变形的液-气界面。 Method: 提出 SurfPhase 模型：融合动态高斯面元（Gaussian surfels）与符号距离函数（SDF）保证几何一致性，并利用视频扩散模型合成新视角视频以从稀疏观测中优化重建。 Result: 在自建的高速池沸腾视频数据集上验证，仅用两个相机视角即可实现高质量新视角合成和准确的速度估计。 Conclusion: SurfPhase 有效克服了现有方法在锐利动态界面重建上的不足，为两相流界面动力学的非侵入式三维测量提供了新范式。 Abstract: Interfacial dynamics in two-phase flows govern momentum, heat, and mass transfer, yet remain difficult to measure experimentally. Classical techniques face intrinsic limitations near moving interfaces, while existing neural rendering methods target single-phase flows with diffuse boundaries and cannot handle sharp, deformable liquid-vapor interfaces. We propose SurfPhase, a novel model for reconstructing 3D interfacial dynamics from sparse camera views. Our approach integrates dynamic Gaussian surfels with a signed distance function formulation for geometric consistency, and leverages a video diffusion model to synthesize novel-view videos to refine reconstruction from sparse observations. We evaluate on a new dataset of high-speed pool boiling videos, demonstrating high-quality view synthesis and velocity estimation from only two camera views. Project website: https://yuegao.me/SurfPhase.

Table of Contents

cs.CL [Back]

[1] Geometry-Aware Decoding with Wasserstein-Regularized Truncation and Mass Penalties for Large Language Models

[2] When Less Is More? Diagnosing ASR Predictions in Sardinian via Layer-Wise Decoding

[3] Learning Self-Interpretation from Interpretability Artifacts: Training Lightweight Adapters on Vector-Label Pairs

[4] Physically Interpretable AlphaEarth Foundation Model Embeddings Enable LLM-Based Land Surface Intelligence

[5] Autonomous Continual Learning of Computer-Use Agents for Environment Adaptation

[6] The Alignment Bottleneck in Decomposition-Based Claim Verification

[7] Triggers Hijack Language Circuits: A Mechanistic Analysis of Backdoor Behaviors in Large Language Models

[8] When Tables Go Crazy: Evaluating Multimodal Models on French Financial Documents

[9] Less is Enough: Synthesizing Diverse Data in Feature Space of LLMs

[10] When are We Worried? Temporal Trends of Anxiety and What They Reveal about Us

[11] EVOKE: Emotion Vocabulary Of Korean and English

[12] LATA: A Tool for LLM-Assisted Translation Annotation

[13] Neuro-Symbolic Synergy for Interactive World Modeling

[14] Canvas-of-Thought: Grounding Reasoning via Mutable Structured States

[15] On the Robustness of Knowledge Editing for Detoxification

[16] LHAW: Controllable Underspecification for Long-Horizon Tasks

[17] When to Memorize and When to Stop: Gated Recurrent Memory for Long-Context Reasoning

[18] Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters

[19] Online Causal Kalman Filtering for Stable and Effective Policy Optimization

[20] How Do Decoder-Only LLMs Perceive Users? Rethinking Attention Masking for User Representation Learning

[21] UMEM: Unified Memory Extraction and Management Framework for Generalizable Memory

[22] Benchmarks Are Not That Out of Distribution: Word Overlap Predicts Performance

[23] Targeted Syntactic Evaluation of Language Models on Georgian Case Alignment

[24] Locomo-Plus: Beyond-Factual Cognitive Memory Evaluation Framework for LLM Agents

[25] Macaron: Controlled, Human-Written Benchmark for Multilingual and Multicultural Reasoning via Template-Filling

[26] Reinforced Curriculum Pre-Alignment for Domain-Adaptive VLMs

[27] Deep Learning-based Method for Expressing Knowledge Boundary of Black-Box LLM

[28] Beyond Confidence: The Rhythms of Reasoning in Generative Models

[29] I can tell whether you are a Native Hawlêri Speaker! How ANN, CNN, and RNN perform in NLI-Native Language Identification

[30] C-MOP: Integrating Momentum and Boundary-Aware Clustering for Enhanced Prompt Evolution

[31] Diagnosing Structural Failures in LLM-Based Evidence Extraction for Meta-Analysis

[32] The CLEF-2026 FinMMEval Lab: Multilingual and Multimodal Evaluation of Financial AI Systems

[33] SoftMatcha 2: A Fast and Soft Pattern Matcher for Trillion-Scale Corpora

[34] Computational Phenomenology of Temporal Experience in Autism: Quantifying the Emotional and Narrative Characteristics of Lived Unpredictability

[35] Search or Accelerate: Confidence-Switched Position Beam Search for Diffusion Language Models

[36] LoRA-Squeeze: Simple and Effective Post-Tuning and In-Tuning Compression of LoRA Modules

[37] Linguistic Indicators of Early Cognitive Decline in the DementiaBank Pitt Corpus: A Statistical and Machine Learning Study

[38] Language Model Inversion through End-to-End Differentiation

[39] Embedding Inversion via Conditional Masked Diffusion Language Models

[40] Conversational Behavior Modeling Foundation Model With Multi-Level Perception

[41] Simultaneous Speech-to-Speech Translation Without Aligned Data

[42] SteuerLLM: Local specialized large language model for German tax law analysis

[43] DataChef: Cooking Up Optimal Data Recipes for LLM Adaptation via Reinforcement Learning

[44] Can Large Language Models Make Everyone Happy?

[45] Safety Recovery in Reasoning Models Is Only a Few Early Steering Steps Away

[46] TEGRA: Text Encoding With Graph and Retrieval Augmentation for Misinformation Detection

[47] Data Repetition Beats Data Scaling in Long-CoT Supervised Fine-Tuning

cs.CV [Back]

[48] MPA: Multimodal Prototype Augmentation for Few-Shot Learning

[49] VERA: Identifying and Leveraging Visual Evidence Retrieval Heads in Long-Context Understanding

[50] Beyond Closed-Pool Video Retrieval: A Benchmark and Agent Framework for Real-World Video Search and Moment Localization

[51] AD$^2$: Analysis and Detection of Adversarial Threats in Visual Perception for End-to-End Autonomous Driving Systems

[52] ArtisanGS: Interactive Tools for Gaussian Splat Selection with AI and Human in the Loop

[53] When the Prompt Becomes Visual: Vision-Centric Jailbreak Attacks for Large Image Editing Models

[54] DEGMC: Denoising Diffusion Models Based on Riemannian Equivariant Group Morphological Convolutions

[55] XSPLAIN: XAI-enabling Splat-based Prototype Learning for Attribute-aware INterpretability

[56] PMMA: The Polytechnique Montreal Mobility Aids Dataset

[57] Colorimeter-Supervised Skin Tone Estimation from Dermatoscopic Images for Fairness Auditing

[58] ERGO: Excess-Risk-Guided Optimization for High-Fidelity Monocular 3D Gaussian Splatting

[59] A Low-Rank Defense Method for Adversarial Attack on Diffusion Models

[60] Flow Matching with Uncertainty Quantification and Guidance

[61] Conditional Uncertainty-Aware Political Deepfake Detection with Stochastic Convolutional Neural Networks

[62] Monte Carlo Maximum Likelihood Reconstruction for Digital Holography with Speckle

[63] Comp2Comp: Open-Source Software with FDA-Cleared Artificial Intelligence Algorithms for Computed Tomography Image Analysis

[64] HII-DPO: Eliminate Hallucination via Accurate Hallucination-Inducing Counterfactual Images

[65] Towards Remote Sensing Change Detection with Neural Memory

[66] End-to-End LiDAR optimization for 3D point cloud registration

[67] Characterizing and Optimizing the Spatial Kernel of Multi Resolution Hash Encodings

[68] The Garbage Dataset (GD): A Multi-Class Image Benchmark for Automated Waste Segregation

[69] Med-SegLens: Latent-Level Model Diffing for Interpretable Medical Image Segmentation

[70] 1%>100%: High-Efficiency Visual Adapter with Complex Linear Projection Optimization

[71] 3DXTalker: Unifying Identity, Lip Sync, Emotion, and Spatial Dynamics in Expressive 3D Talking Avatars

[72] MapVerse: A Benchmark for Geospatial Question Answering on Diverse Real-World Maps

[73] RealHD: A High-Quality Dataset for Robust Detection of State-of-the-Art AI-Generated Images

[74] Enhancing Weakly Supervised Multimodal Video Anomaly Detection through Text Guidance

[75] C^2ROPE: Causal Continuous Rotary Positional Encoding for 3D Large Multimodal-Models Reasoning

[76] MetaphorStar: Image Metaphor Understanding and Reasoning with End-to-End Visual Reinforcement Learning

[77] Enhancing Underwater Images via Adaptive Semantic-aware Codebook Learning