cs.CL [Back]

[1] Multimodal Consistency-Guided Reference-Free Data Selection for ASR Accent Adaptation

Ligong Lei,Wenwen Lu,Xudong Pang,Zaokere Kadeer,Aishan Wumaier

Main category: cs.CL

TL;DR: 本文提出了一种多模态一致性引导、无参考的数据选择方法，用于自动语音识别（ASR）系统的口音自适应，显著提升了伪标签选择的可靠性与下游性能。

Details

Motivation: 现有ASR系统在带口音语音上性能下降，而传统基于文本的伪标签筛选方法（如困惑度过滤）易选中声学不匹配但文本流畅的错误假设，导致微调时误差放大。 Method: 提出一种无监督、转导式、无参考的多模态一致性数据选择流程：首先用子模互信息进行目标感知预筛选；再通过扰动解码生成多个伪转录本，并利用共享嵌入空间中的语音-文本对齐程度和预测词错误率（WER）两种无参考信号评分；最后采用百分位阈值规则筛选高置信伪标签。 Result: 在领域内实验中，从3万样本中仅选约1.5k样本即可达到10.91% WER，接近使用全部3万真标标签的10.45%；跨域场景下该方法避免了未过滤伪标签导致的性能退化，并在更强ASR主干网络上优于随机采样及近期基线方法。 Conclusion: 多模态一致性评估可有效提升无监督口音自适应中伪标签的质量与鲁棒性，为低资源口音适配提供高效、可扩展的新范式。 Abstract: Automatic speech recognition (ASR) systems often degrade on accented speech because acoustic-phonetic and prosodic shifts induce a mismatch to training data, making labeled accent adaptation costly. However, common pseudo-label selection heuristics are largely text-centric (e.g., perplexity (PPL) filtering) and can prefer fluent yet acoustically mismatched hypotheses, leading to error amplification when fine-tuning. To address this, we introduce a multimodal consistency-guided, reference-free data selection pipeline for ASR accent adaptation under a transductive, label-free protocol. The pipeline starts with a target-aware preselection step based on submodular mutual information to improve query relevance and reduce downstream computation. It then generates multiple pseudo-transcriptions per utterance via perturbation-based decoding and scores each hypothesis using two reference-free signals: speech--text alignment in a shared embedding space and predicted word error rate (WER). A simple percentile-based selection rule retains reliable pseudo-labels for fine-tuning while discarding noisy utterances. In an in-domain setting, selecting ~1.5k utterances from a 30k pool achieves 10.91% WER, close to 10.45% obtained using 30k supervised labels. In a cross-domain setting with a mismatched candidate pool, consistency-filtered subsets avoid the degradation caused by unfiltered pseudo-labels under strong accent shift, and matched-hour experiments on a stronger ASR backbone further confirm gains over random sampling and recent selection baselines.

[2] LLM-Powered Automatic Translation and Urgency in Crisis Scenarios

Belu Ticona,Antonis Anastasopoulos

Main category: cs.CL

TL;DR: 本文评估了大型语言模型（LLMs）和机器翻译系统在危机场景多语种翻译中保持“紧迫性”这一关键语义属性的能力，发现现有模型普遍存在性能下降、不稳定性高及紧迫性失真等问题，警示其直接用于高风险危机通信的风险。

Details

Motivation: 大型语言模型虽被提议用于危机应对中的多语种沟通，但其在高风险危机场景下的适用性缺乏充分评估，尤其在保留信息紧迫性这一关键维度上尚无系统研究。 Method: 构建覆盖32种以上语言的紧急度标注数据集，对比评测前沿LLMs与专用机器翻译模型在危机领域翻译任务中对紧迫性语义的保持能力，并分析提示语言与输入语言对LLM紧迫性判断的影响。 Result: LLMs和专用翻译模型均在危机翻译任务中表现出显著性能下降和不稳定性；即使语法正确的翻译也常扭曲紧迫性感知；LLM对紧迫性的分类结果高度依赖提示和输入的语言。 Conclusion: 通用语言技术在危机通信中存在实质性风险，亟需建立面向危机场景的专用评估框架。 Abstract: Large language models (LLMs) are increasingly proposed for crisis preparedness and response, particularly for multilingual communication. However, their suitability for high-stakes crisis contexts remains insufficiently evaluated. This work examines the performance of state-of-the-art LLMs and machine translation systems in crisis-domain translation, with a focus on preserving urgency, which is a critical property for effective crisis communication and triaging. Using multilingual crisis data and a newly introduced urgency-annotated dataset covering over 32 languages, we show that both dedicated translation models and LLMs exhibit substantial performance degradation and instability. Crucially, even linguistically adequate translations can distort perceived urgency, and LLM-based urgency classifications vary widely depending on the language of the prompt and input. These findings highlight significant risks in deploying general-purpose language technologies for crisis communication and underscore the need for crisis-aware evaluation frameworks.

[3] Using Machine Learning to Enhance the Detection of Obfuscated Abusive Words in Swahili: A Focus on Child Safety

Phyllis Nabangi,Abdul-Jalil Zakaria,Jema David Ndibwile

Main category: cs.CL

TL;DR: 本文研究了在低资源语言斯瓦希里语中检测网络欺凌和滥用模糊语言的方法，使用了SVM、逻辑回归和决策树等机器学习模型，并通过SMOTE等技术处理数据不平衡问题，但受限于数据规模和不平衡性，模型泛化能力有限。

Details

Motivation: 数字技术的兴起加剧了网络欺凌和在线滥用问题，尤其对儿童构成威胁；斯瓦希里语作为非洲使用最广泛的语言之一，却因资源匮乏而缺乏有效的检测手段。 Method: 采用支持向量机（SVM）、逻辑回归和决策树等机器学习模型，结合参数调优与SMOTE处理数据不平衡，并对精确率、召回率和F1值进行系统评估。 Result: 模型在高维文本数据上表现良好，但由于数据集小且类别不平衡，结果泛化能力受限；各模型在模糊语言检测任务中展现出不同性能特点。 Conclusion: 需扩大高质量标注数据集、引入迁移学习与多模态融合等先进技术，以构建更鲁棒、文化适配的儿童网络欺凌检测系统。 Abstract: The rise of digital technology has dramatically increased the potential for cyberbullying and online abuse, necessitating enhanced measures for detection and prevention, especially among children. This study focuses on detecting abusive obfuscated language in Swahili, a low-resource language that poses unique challenges due to its limited linguistic resources and technological support. Swahili is chosen due to its popularity and being the most widely spoken language in Africa, with over 16 million native speakers and upwards of 100 million speakers in total, spanning regions in East Africa and some parts of the Middle East. We employed machine learning models including Support Vector Machines (SVM), Logistic Regression, and Decision Trees, optimized through rigorous parameter tuning and techniques like Synthetic Minority Over-sampling Technique (SMOTE) to handle data imbalance. Our analysis revealed that, while these models perform well in high-dimensional textual data, our dataset's small size and imbalance limit our findings' generalizability. Precision, recall, and F1 scores were thoroughly analyzed, highlighting the nuanced performance of each model in detecting obfuscated language. This research contributes to the broader discourse on ensuring safer online environments for children, advocating for expanded datasets and advanced machine-learning techniques to improve the effectiveness of cyberbullying detection systems. Future work will focus on enhancing data robustness, exploring transfer learning, and integrating multimodal data to create more comprehensive and culturally sensitive detection mechanisms.

[4] Language Model Memory and Memory Models for Language

Benjamin L. Badger

Main category: cs.CL

TL;DR: 本文探讨了机器学习模型（特别是语言模型和自编码器）在隐藏层嵌入中存储输入信息（即'记忆'）的能力，发现语言模型嵌入信息量低，而自编码器可实现近似完美记忆；据此提出一种并行化编码器-解码器记忆模型架构，并通过结合因果预测与信息保留目标函数、以及冻结高保真编码器加课程学习策略来提升记忆能力。

Details

Motivation: 语言模型等常用模型虽广泛利用嵌入作为'记忆'，但其实际信息存储能力缺乏系统刻画；且单纯下一词预测目标不可逆，难以支持准确记忆形成，尤其当输入未完全暴露时。 Method: 对比分析语言模型与自编码器嵌入的信息含量；提出并实现一种并行化编码器-解码器记忆模型；采用联合优化（因果预测 + 信息保留）目标函数；引入冻结高保真编码器 + 两阶段课程学习（先学处理记忆、再学预测下一词）的训练策略。 Result: 自编码器嵌入可近乎完美重建输入，而语言模型嵌入信息贫乏；所提记忆模型在联合目标下能形成并解码信息丰富的嵌入；课程学习策略显著提升训练效率与记忆性能。 Conclusion: 单纯下一词预测不适合高保真记忆形成；需结合可逆性更强的信息保留目标，并通过架构与训练策略协同设计，才能构建高效、可扩展的记忆增强模型。 Abstract: The ability of machine learning models to store input information in hidden layer vector embeddings, analogous to the concept of `memory', is widely employed but not well characterized. We find that language model embeddings typically contain relatively little input information regardless of data and compute scale during training. In contrast, embeddings from autoencoders trained for input regeneration are capable of nearly perfect memory formation. The substitution of memory embeddings for token sequences leads to substantial computational efficiencies, motivating the introduction of a parallelizable encoder-decoder memory model architecture. Upon causal training these models contain information-poor embeddings incapable of arbitrary information access, but by combining causal and information retention objective functions they learn to form and decode information-rich memories. Training can be further streamlined by freezing a high fidelity encoder followed by a curriculum training approach where decoders first learn to process memories and then learn to additionally predict next tokens. We introduce the perspective that next token prediction training alone is poorly suited for accurate memory formation as the objective itself is non-invertible, motivating the use of combined objective functions for models where the entire input is not exposed.

[5] From Perceptions To Evidence: Detecting AI-Generated Content In Turkish News Media With A Fine-Tuned Bert Classifier

Ozancan Ozdemir

Main category: cs.CL

TL;DR: 本研究首次通过数据驱动方法，使用微调的土耳其语BERT模型对土耳其主流新闻媒体中AI生成内容进行实证检测，发现约2.5%的新闻内容经大语言模型重写或修订。

Details

Motivation: 现有研究在土耳其新闻媒体中仅限于记者访谈或假新闻检测，缺乏对AI生成内容的实证量化分析，本文旨在填补这一空白。 Method: 基于3600篇来自三家不同立场土耳其媒体的标注文章，微调土耳其语BERT模型（dbmdz/bert-base-turkish-cased）进行AI重写内容的二分类；随后在3500余篇未见文章上部署验证。 Result: 模型在测试集上F1得分为0.9708，精度与召回率均衡；在2023–2026年真实文章上预测置信度均值超0.96，估计AI重写内容占比约2.5%。 Conclusion: 这是首个超越记者主观报告、以实证方式测量土耳其新闻媒体中AI使用情况的研究，证实了LLM在该语境下已有实质性但有限的应用。 Abstract: The rapid integration of large language models into newsroom workflows has raised urgent questions about the prevalence of AI-generated content in online media. While computational studies have begun to quantify this phenomenon in English-language outlets, no empirical investigation exists for Turkish news media, where existing research remains limited to qualitative interviews with journalists or fake news detection. This study addresses that gap by fine-tuning a Turkish-specific BERT model (dbmdz/bert-base-turkish-cased) on a labeled dataset of 3,600 articles from three major Turkish outlets with distinct editorial orientations for binary classification of AI-rewritten content. The model achieves 0.9708 F1 score on the held-out test set with symmetric precision and recall across both classes. Subsequent deployment on over 3,500 unseen articles spanning between 2023 and 2026 reveals consistent cross-source and temporally stable classification patterns, with mean prediction confidence exceeding 0.96 and an estimated 2.5 percentage of examined news content rewritten or revised by LLMs on average. To the best of our knowledge, this is the first study to move beyond self-reported journalist perceptions toward empirical, data-driven measurement of AI usage in Turkish news media.

[6] Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens

Wei-Lin Chen,Liqian Peng,Tian Tan,Chao Zhao,Blake JianHang Chen,Ziqian Lin,Alec Go,Yu Meng

Main category: cs.CL

TL;DR: 本文提出了一种基于‘深度思考token’的新推理质量度量方法，并据此设计了更高效、低成本的测试时缩放策略Think@n。

Details

Motivation: 现有研究发现单纯增加生成长度（即Chain-of-Thought长度）并不能稳定提升推理准确率，反而可能因“过度思考”导致性能下降；需要更可靠的推理努力度量指标。 Method: 定义并识别‘深度思考token’（即在深层模型中预测发生显著修订的token），计算其在生成序列中的比例（deep-thinking ratio）；在此基础上提出Think@n策略，通过前缀评估并优先保留高deep-thinking ratio的样本，实现早期筛选与自一致性优化。 Result: 在多个数学与科学基准（AIME、HMMT、GPQA-diamond）及多种推理模型（GPT-OSS、DeepSeek-R1、Qwen3）上，deep-thinking ratio与准确率呈稳健正相关，显著优于长度和置信度基线；Think@n在达到或超越标准自一致性性能的同时，大幅降低推理开销。 Conclusion: 深度思考token比例是比生成长度或置信度更优的推理质量代理指标；Think@n为高效、可扩展的测试时推理提供了新范式。 Abstract: Large language models (LLMs) have demonstrated impressive reasoning capabilities by scaling test-time compute via long Chain-of-Thought (CoT). However, recent findings suggest that raw token counts are unreliable proxies for reasoning quality: increased generation length does not consistently correlate with accuracy and may instead signal "overthinking," leading to performance degradation. In this work, we quantify inference-time effort by identifying deep-thinking tokens -- tokens where internal predictions undergo significant revisions in deeper model layers prior to convergence. Across four challenging mathematical and scientific benchmarks (AIME 24/25, HMMT 25, and GPQA-diamond) and a diverse set of reasoning-focused models (GPT-OSS, DeepSeek-R1, and Qwen3), we show that deep-thinking ratio (the proportion of deep-thinking tokens in a generated sequence) exhibits a robust and consistently positive correlation with accuracy, substantially outperforming both length-based and confidence-based baselines. Leveraging this insight, we introduce Think@n, a test-time scaling strategy that prioritizes samples with high deep-thinking ratios. We demonstrate that Think@n matches or exceeds standard self-consistency performance while significantly reducing inference costs by enabling the early rejection of unpromising generations based on short prefixes.

[7] On Calibration of Large Language Models: From Response To Capability

Sin-Han Yang,Cheng-Kuang Wu,Chieh-Yen Lin,Yun-Nung Chen,Hung-yi Lee,Shao-Hua Sun

Main category: cs.CL

TL;DR: 本文提出能力校准（capability calibration）概念，旨在估计大语言模型在给定查询上的期望准确率，而非单次响应的正确性；通过理论分析与实证研究，证明其优于传统响应级校准，并提升pass@$k$预测与推理资源分配效果。

Details

Motivation: 现有大语言模型校准方法聚焦于单次响应的正确性（response-level confidence），但实际应用中更关心模型解决某查询的整体能力，而单次响应因解码随机性无法反映真实能力，存在目标错位问题。 Method: 提出能力校准（capability calibration）新范式，形式化区分其与响应校准的差异，并构建实证评估框架，系统评测多种置信度估计方法。 Result: 能力校准显著提升pass@$k$预测准确性与推理预算分配效率，在多个指标上优于传统响应校准方法。 Conclusion: 能力校准更贴合实际部署需求，为大模型可信应用提供了新基础，具有广泛适用潜力。 Abstract: Large language models (LLMs) are widely deployed as general-purpose problem solvers, making accurate confidence estimation critical for reliable use. Prior work on LLM calibration largely focuses on response-level confidence, which estimates the correctness of a single generated output. However, this formulation is misaligned with many practical settings where the central question is how likely a model is to solve a query overall. We show that this mismatch results from the stochastic nature of modern LLM decoding, under which single-response correctness fails to reflect underlying model capability. To address this issue, we introduce capability calibration, which targets the model's expected accuracy on a query. We formally distinguish capability calibration from response calibration and show that the two differ both theoretically and empirically. We establish an empirical evaluation setup and study a range of confidence estimation methods. Our results demonstrate that capability-calibrated confidence improves pass@$k$ prediction and inference budget allocation, establishing a foundation with potential for diverse applications.

[8] Small Reward Models via Backward Inference

Yike Wang,Faeze Brahman,Shangbin Feng,Teng Xiao,Hannaneh Hajishirzi,Yulia Tsvetkov

Main category: cs.CL

TL;DR: 本文提出FLIP方法，通过反向推理重构指令来实现无需参考答案和评分标准的奖励建模，在多个领域显著优于LLM-as-a-Judge基线。

Details

Motivation: 现有奖励建模方法依赖大模型强推理能力（LLM-as-a-Judge）或需参考响应/显式评分标准，灵活性与可及性受限。 Method: FLIP采用反向推理：对给定响应推断最可能生成它的原始指令，并以推断指令与真实指令的相似度作为奖励信号，无需参考响应或人工设计的评分标准。 Result: 在四个领域、13个小语言模型上的实验表明，FLIP平均超越LLM-as-a-Judge基线79.6%；在测试时缩放和平行采样、GRPO训练等下游任务中性能显著提升；对长输出更有效，且鲁棒性强。 Conclusion: FLIP通过显式利用验证-生成差距，在小模型或资源受限场景下实现了可靠、灵活、无需外部监督的奖励建模，拓展了奖励建模的适用边界。 Abstract: Reward models (RMs) play a central role throughout the language model (LM) pipeline, particularly in non-verifiable domains. However, the dominant LLM-as-a-Judge paradigm relies on the strong reasoning capabilities of large models, while alternative approaches require reference responses or explicit rubrics, limiting flexibility and broader accessibility. In this work, we propose FLIP (FLipped Inference for Prompt reconstruction), a reference-free and rubric-free reward modeling approach that reformulates reward modeling through backward inference: inferring the instruction that would most plausibly produce a given response. The similarity between the inferred and the original instructions is then used as the reward signal. Evaluations across four domains using 13 small language models show that FLIP outperforms LLM-as-a-Judge baselines by an average of 79.6%. Moreover, FLIP substantially improves downstream performance in extrinsic evaluations under test-time scaling via parallel sampling and GRPO training. We further find that FLIP is particularly effective for longer outputs and robust to common forms of reward hacking. By explicitly exploiting the validation-generation gap, FLIP enables reliable reward modeling in downscaled regimes where judgment methods fail. Code available at https://github.com/yikee/FLIP.

[9] DistillLens: Symmetric Knowledge Distillation Through Logit Lens

Manish Dhakal,Uthman Jinadu,Anjila Budathoki,Rajshekhar Sunderraman,Yi Ding

Main category: cs.CL

TL;DR: 本文提出DistillLens框架，通过Logit Lens将中间隐藏状态投影到词汇空间，并使用对称散度目标进行结构对齐，以对齐师生模型的思维过程，提升知识蒸馏效果。

Details

Motivation: 标准知识蒸馏忽略教师模型中间层思维过程中的不确定性信息，而现有特征蒸馏方法（如MSE、非对称KL）未能有效建模这种不确定性。 Method: 提出DistillLens框架：利用Logit Lens将师生模型中间隐藏状态映射至词汇空间，并施加对称散度约束，实现思维过程的双向校准，防止过自信与欠自信，保留高熵推理通路。 Result: 在GPT-2和Llama架构上，DistillLens在多个指令跟随基准测试中持续优于标准知识蒸馏及特征迁移基线。 Conclusion: 对称化中间表征对齐能更充分地传递教师模型的推理逻辑与不确定性建模能力，从而提升学生模型的泛化与鲁棒性。 Abstract: Standard Knowledge Distillation (KD) compresses Large Language Models (LLMs) by optimizing final outputs, yet it typically treats the teacher's intermediate layer's thought process as a black box. While feature-based distillation attempts to bridge this gap, existing methods (e.g., MSE and asymmetric KL divergence) ignore the rich uncertainty profiles required for the final output. In this paper, we introduce DistillLens, a framework that symmetrically aligns the evolving thought processes of student and teacher models. By projecting intermediate hidden states into the vocabulary space via the Logit Lens, we enforce structural alignment using a symmetric divergence objective. Our analysis proves that this constraint imposes a dual-sided penalty, preventing both overconfidence and underconfidence while preserving the high-entropy information conduits essential for final deduction. Extensive experiments on GPT-2 and Llama architectures demonstrate that DistillLens consistently outperforms standard KD and feature-transfer baselines on diverse instruction-following benchmarks. The code is available at https://github.com/manishdhakal/DistillLens.

[10] LLM-Confidence Reranker: A Training-Free Approach for Enhancing Retrieval-Augmented Generation Systems

Zhipeng Song,Xiangyu Kong,Xinrui Bao,Yizhi Zhou,Jiulong Jiao,Sitong Liu,Yuhang Zhou,Heng Qi

Main category: cs.CL

TL;DR: 本文提出了一种无需训练、即插即用的LLM-Confidence Reranker（LCR）算法，利用大语言模型（LLM）的内在置信度信号（通过MSCP度量）提升检索增强生成（RAG）中的文档重排序性能，在BEIR和TREC基准上显著提升NDCG@5，且计算高效、兼容性强。

Details

Motivation: 现有RAG重排序器依赖专用训练、计算开销大，且未能充分利用LLM固有的语义能力和置信度信号，导致知识密集型任务中幻觉问题仍严峻。 Method: LCR是一种无训练、黑箱式重排序算法：第一阶段通过多项式采样与语义聚类评估查询和文档的LLM置信度（MSCP）；第二阶段基于置信度阈值进行分桶与多级排序，在保留高置信查询原始排序的同时优先提升相关文档排名。 Result: 在BEIR和TREC基准上，使用7–9B参数预训练LLM，LCR使NDCG@5最高提升20.6%，优于预训练LLM及微调Transformer重排序器，且无性能退化；消融实验验证LLM置信度与文档相关性正相关。 Conclusion: LCR通过挖掘LLM内在置信度实现高效、鲁棒、可扩展的重排序，无需训练、计算轻量、兼容性强，有效缓解RAG系统在医疗诊断等关键场景中的幻觉问题。 Abstract: Large language models (LLMs) have revolutionized natural language processing, yet hallucinations in knowledge-intensive tasks remain a critical challenge. Retrieval-augmented generation (RAG) addresses this by integrating external knowledge, but its efficacy depends on accurate document retrieval and ranking. Although existing rerankers demonstrate effectiveness, they frequently necessitate specialized training, impose substantial computational expenses, and fail to fully exploit the semantic capabilities of LLMs, particularly their inherent confidence signals. We propose the LLM-Confidence Reranker (LCR), a training-free, plug-and-play algorithm that enhances reranking in RAG systems by leveraging black-box LLM confidence derived from Maximum Semantic Cluster Proportion (MSCP). LCR employs a two-stage process: confidence assessment via multinomial sampling and clustering, followed by binning and multi-level sorting based on query and document confidence thresholds. This approach prioritizes relevant documents while preserving original rankings for high-confidence queries, ensuring robustness. Evaluated on BEIR and TREC benchmarks with BM25 and Contriever retrievers, LCR--using only 7--9B-parameter pre-trained LLMs--consistently improves NDCG@5 by up to 20.6% across pre-trained LLM and fine-tuned Transformer rerankers, without degradation. Ablation studies validate the hypothesis that LLM confidence positively correlates with document relevance, elucidating LCR's mechanism. LCR offers computational efficiency, parallelism for scalability, and broad compatibility, mitigating hallucinations in applications like medical diagnosis.

[11] Elo-Evolve: A Co-evolutionary Framework for Language Model Alignment

Jing Zhao,Ting Zhen,Junwei bao,Hongfei Jiang,Yang song

Main category: cs.CL

TL;DR: 本文提出Elo-Evolve框架，通过动态多智能体竞争与自适应对手池实现LLM对齐，摆脱静态奖励函数依赖，提升鲁棒性与训练稳定性。

Details

Motivation: 现有LLM对齐方法依赖压缩大量人类偏好数据为静态绝对奖励函数，存在数据稀缺、噪声敏感和训练不稳定等问题。 Method: 提出Elo-Evolve共进化框架：（1）直接从二元胜负结果学习，摒弃Bradley-Terry模型；（2）采用Elo协调的对手选择机制，实现温度控制的自动课程学习；理论基础为PAC学习，强调成对比较的样本复杂度优势。 Result: 实验显示相比绝对评分法噪声降低4.5倍；在Alpaca Eval 2.0和MT-Bench上，Elo-Evolve显著优于静态成对训练和点式方法，验证其有效性。 Conclusion: 动态对手选择与成对比较是提升LLM对齐性能的关键路径，Elo-Evolve为可扩展、鲁棒的对齐范式提供了新方向。 Abstract: Current alignment methods for Large Language Models (LLMs) rely on compressing vast amounts of human preference data into static, absolute reward functions, leading to data scarcity, noise sensitivity, and training instability. We introduce Elo-Evolve, a co-evolutionary framework that redefines alignment as dynamic multi-agent competition within an adaptive opponent pool. Our approach makes two key innovations: (1) eliminating Bradley-Terry model dependencies by learning directly from binary win/loss outcomes in pairwise competitions, and (2) implementing Elo-orchestrated opponent selection that provides automatic curriculum learning through temperature-controlled sampling. We ground our approach in PAC learning theory, demonstrating that pairwise comparison achieves superior sample complexity and empirically validate a 4.5x noise reduction compared to absolute scoring approaches. Experimentally, we train a Qwen2.5-7B model using our framework with opponents including Qwen2.5-14B, Qwen2.5-32B, and Qwen3-8B models. Results demonstrate a clear performance hierarchy: point-based methods < static pairwise training < Elo-Evolve across Alpaca Eval 2.0 and MT-Bench, validating the progressive benefits of pairwise comparison and dynamic opponent selection for LLM alignment.

[12] Metaphors' journeys across time and genre: tracking the evolution of literary metaphors with temporal embeddings

Veronica Mangiaterra,Chiara Barattieri di San Pietro,Paolo Canal,Valentina Bambini

Main category: cs.CL

TL;DR: 本研究利用历时分布语义学方法，分析19世纪意大利文学隐喻在不同时代和语类中的语义相似性变化，发现隐喻加工难度总体稳定，但受语类和词汇语义特征显著影响。

Details

Motivation: 文学隐喻作为文学语言的显著特征，实验研究较少；且既往心理语言学与计算语言学研究忽视了其历时维度，而许多文学隐喻产生于数个世纪前，与当代读者存在时间距离。 Method: 基于19世纪与21世纪意大利文学与非文学语料库（共1.24亿词符）训练词向量，量化515个19世纪文学隐喻中“主题—喻体”的语义相似性变化，以此作为隐喻加工负荷的代理指标，并考察语类与词汇语义特征（如向量一致性、语义邻域密度）的调节作用。 Result: 总体上，主题-喻体语义相似性（即隐喻加工负荷）跨世纪保持稳定；但语类效应显著：现代文学语境中隐喻更难（相似性更低），而当代网络非文学语境中更易（相似性更高）；该模式进一步受隐喻成分的语义特征调节。 Conclusion: 文学隐喻的加工负荷具有历时稳定性，但高度依赖语类演变与词汇语义结构；结果支持意大利语整体语言变迁假说，如现代文学风格简化提高了隐喻加工难度，而网络语言的高度创造性则降低了其理解门槛。 Abstract: Metaphors are a distinctive feature of literary language, yet they remain less studied experimentally than everyday metaphors. Moreover, previous psycholinguistic and computational approaches overlooked the temporal dimension, although many literary metaphors were coined centuries apart from contemporary readers. This study innovatively applies tools from diachronic distributional semantics to assess whether the processing costs of literary metaphors varied over time and genre. Specifically, we trained word embeddings on literary and nonliterary Italian corpora from the 19th and 21st centuries, for a total of 124 million tokens, and modeled changes in the semantic similarity between topics and vehicles of 515 19th-century literary metaphors, taking this measure as a proxy of metaphor processing demands. Overall, semantic similarity, and hence metaphor processing demands, remained stable over time. However, genre played a key role: metaphors appeared more difficult (i.e., lower topic-vehicle similarity) in modern literary contexts than in 19th-century literature, but easier (i.e., higher topic-vehicle similarity) in today's nonliterary language (e.g., the Web) than in 19th-century nonliterary texts. This pattern was further shaped by semantic features of metaphors' individual terms, such as vector coherence and semantic neighborhood density. Collectively, these findings align with broader linguistic changes in Italian, such as the stylistic simplification of modern literature, which may have increased metaphor processing demands, and the high creativity of the Web's language, which seems to render metaphor more accessible.

[13] On Theoretically-Driven LLM Agents for Multi-Dimensional Discourse Analysis

Maciej Uberna,Michał Wawer,Jarosław A. Chudziak,Marcin Koszowy

Main category: cs.CL

TL;DR: 本文提出了一种多智能体框架，通过引入论证理论（RAG增强）显著提升了对话语中改写策略（如强化、泛化等）功能的识别能力，相比零样本基线Macro F1提升近30%，证明理论指导对功能感知型论辩分析至关重要。

Details

Motivation: 现有大语言模型虽能检测表面相似性，但难以捕捉改写在修辞语境中的实际语用功能（如弱化、强化等），亟需结合论证理论提升功能识别能力。 Method: 构建包含五类改写功能（Deintensification, Intensification, Specification, Generalisation, Other）的标注政治辩论数据集；设计并对比两个LLM智能体系统——一个基于检索增强生成（RAG）融入论证理论，另一个为零样本基线。 Result: RAG增强智能体在各项指标上全面超越基线，尤其在Intensification和Generalisation识别上优势明显，整体Macro F1提升近30%。 Conclusion: 显式引入论证理论知识对改写功能识别至关重要，仅靠表面相似性建模无法实现真正的功能感知分析；该多智能体架构为构建可扩展、理论驱动的修辞策略计算分析工具提供了新路径。 Abstract: Identifying the strategic uses of reformulation in discourse remains a key challenge for computational argumentation. While LLMs can detect surface-level similarity, they often fail to capture the pragmatic functions of rephrasing, such as its role within rhetorical discourse. This paper presents a comparative multi-agent framework designed to quantify the benefits of incorporating explicit theoretical knowledge for this task. We utilise an dataset of annotated political debates to establish a new standard encompassing four distinct rephrase functions: Deintensification, Intensification, Specification, Generalisation, and Other, which covers all remaining types (D-I-S-G-O). We then evaluate two parallel LLM-based agent systems: one enhanced by argumentation theory via Retrieval-Augmented Generation (RAG), and an identical zero-shot baseline. The results reveal a clear performance gap: the RAG-enhanced agents substantially outperform the baseline across the board, with particularly strong advantages in detecting Intensification and Generalisation context, yielding an overall Macro F1-score improvement of nearly 30\%. Our findings provide evidence that theoretical grounding is not only beneficial but essential for advancing beyond mere paraphrase detection towards function-aware analysis of argumentative discourse. This comparative multi-agent architecture represents a step towards scalable, theoretically informed computational tools capable of identifying rhetorical strategies in contemporary discourse.

[14] RMPL: Relation-aware Multi-task Progressive Learning with Stage-wise Training for Multimedia Event Extraction

Yongkang Jin,Jianwen Luo,Jingjing Wang,Jianmin Yao,Yu Hong

Main category: cs.CL

TL;DR: 本文提出RMPL框架，通过关系感知的多任务渐进学习，在低资源条件下提升多媒体事件抽取（MEE）性能，克服现有方法在结构化表示和跨模态论元定位上的不足。

Details

Motivation: 现有MEE方法受限于标注数据稀缺（仅M2E2提供评测标注，无训练标注），且主流方法（如跨模态对齐或VLM推理时提示）缺乏显式的结构化事件表示，导致论元定位弱。 Method: 提出RMPL：一种关系感知的多任务渐进学习框架，分阶段利用单模态事件抽取和多媒体关系抽取的异构监督；先用统一schema学习跨模态事件中心表征，再联合图文数据微调事件提及识别与论元角色抽取。 Result: 在M2E2基准上，结合多种视觉语言模型（VLMs）的实验表明，RMPL在不同模态设置下均取得一致性能提升。 Conclusion: RMPL有效缓解了MEE任务中的低资源瓶颈，通过渐进式多任务学习和跨模态结构建模，显著增强了事件语义的跨模态对齐与论元定位能力。 Abstract: Multimedia Event Extraction (MEE) aims to identify events and their arguments from documents that contain both text and images. It requires grounding event semantics across different modalities. Progress in MEE is limited by the lack of annotated training data. M2E2 is the only established benchmark, but it provides annotations only for evaluation. This makes direct supervised training impractical. Existing methods mainly rely on cross-modal alignment or inference-time prompting with Vision--Language Models (VLMs). These approaches do not explicitly learn structured event representations and often produce weak argument grounding in multimodal settings. To address these limitations, we propose RMPL, a Relation-aware Multi-task Progressive Learning framework for MEE under low-resource conditions. RMPL incorporates heterogeneous supervision from unimodal event extraction and multimedia relation extraction with stage-wise training. The model is first trained with a unified schema to learn shared event-centric representations across modalities. It is then fine-tuned for event mention identification and argument role extraction using mixed textual and visual data. Experiments on the M2E2 benchmark with multiple VLMs show consistent improvements across different modality settings.

[15] How Do Lexical Senses Correspond Between Spoken German and German Sign Language?

Melis Çelikkol,Wei Zhao

Main category: cs.CL

TL;DR: 本文提出了一种基于用法的跨模态词义-手语映射分析方法，构建了首个德语/德语手语（DGS）词义到手语ID的手动标注数据集，并评估了精确匹配与语义相似度两种计算方法在识别不同映射类型（如一词多义、多词一义等）上的性能。

Details

Motivation: 现有手语词典常忽略多义词和同音词在不同语境下对应不同手语的现象，导致映射覆盖不全；需通过用法驱动的方法挖掘未被收录的新映射，以丰富词典资源。 Method: 手动标注1404个德语词用法到DGS手语ID的映射（源自32个德语词和49个DGS手语），定义四种映射类型（Type 1/2/3及No Match）；使用SBERT进行语义相似度（SS）与精确匹配（EM）方法对比评估。 Result: 语义相似度方法整体准确率达88.52%，显著优于精确匹配（71.31%），尤其在Type 1（一词多义）上提升52.1个百分点；构建了首个公开可用的跨模态词义-手语映射标注数据集与代码。 Conclusion: 基于用法的分析能有效揭示词典缺失的映射模式；语义相似度方法更适合识别复杂跨模态对应关系；本工作为手语词典学与跨模态NLP提供了新数据、方法与实证基础。 Abstract: Sign language lexicographers construct bilingual dictionaries by establishing word-to-sign mappings, where polysemous and homonymous words corresponding to different signs across contexts are often underrepresented. A usage-based approach examining how word senses map to signs can identify such novel mappings absent from current dictionaries, enriching lexicographic resources. We address this by analyzing German and German Sign Language (Deutsche Gebärdensprache, DGS), manually annotating 1,404 word use-to-sign ID mappings derived from 32 words from the German Word Usage Graph (D-WUG) and 49 signs from the Digital Dictionary of German Sign Language (DW-DGS). We identify three correspondence types: Type 1 (one-to-many), Type 2 (many-to-one), and Type 3 (one-to-one), plus No Match cases. We evaluate computational methods: Exact Match (EM) and Semantic Similarity (SS) using SBERT embeddings. SS substantially outperforms EM overall 88.52% vs. 71.31%), with dramatic gains for Type 1 (+52.1 pp). Our work establishes the first annotated dataset for cross-modal sense correspondence and reveals which correspondence patterns are computationally identifiable. Our code and dataset are made publicly available.

[16] OMGs: A multi-agent system supporting MDT decision-making across the ovarian tumour care continuum

Yangyang Zhang,Zilong Wang,Jianbo Xu,Yongqi Chen,Chu Han,Zhihao Zhang,Shuai Liu,Hui Li,Huiping Zhang,Ziqi Liu,Jiaxin Chen,Jun Zhu,Zheng Feng,Hao Wen,Xingzhu Ju,Yanping Zhong,Yunqiu Zhang,Jie Duan,Jun Li,Dongsheng Li,Weijie Wang,Haiyan Zhu,Wei Jiang,Xiaohua Wu,Shuo Wang,Haiming Li,Qinhao Guo

Main category: cs.CL

TL;DR: 本文提出OMGs（卵巢肿瘤多学科智能代理系统），一种多智能体AI框架，通过领域专家代理协作生成类似MDT的治疗建议，并引入SPEAR评估体系验证其在安全性、个性化、证据支持、可操作性和鲁棒性等方面的表现，结果表明其推荐质量与专家MDT共识相当，尤其在证据支持方面更优，有望提升资源有限地区卵巢癌诊疗水平。

Details

Motivation: 全球多数患者，尤其在资源受限中心，难以获得及时的多学科肿瘤团队（MDT）专家共识，而卵巢肿瘤管理高度依赖MDT决策。 Method: 构建OMGs——一个基于多智能体的AI系统，各领域特异性代理协同整合多学科证据并生成具透明推理依据的MDT式建议；开发SPEAR评估框架（涵盖安全性、个性化、证据、可操作性、鲁棒性）进行系统评价，并开展多中心回顾性与前瞻性临床验证及人机配对研究。 Result: 在多中心回顾性评估中，OMGs得分（4.45±0.30）与专家MDT共识（4.53±0.23）相当，且证据维度得分更高（4.57 vs 3.92）；前瞻性评估（59例）显示其与常规MDT决策高度一致；人机配对研究证实OMGs显著提升临床医生在证据支持和鲁棒性方面的推荐质量。 Conclusion: 多智能体协商式AI系统可达到与专家MDT共识相当的推荐性能，有望弥补资源匮乏地区专科肿瘤学 expertise 的缺口，提升卵巢肿瘤诊疗可及性与质量。 Abstract: Ovarian tumour management has increasingly relied on multidisciplinary tumour board (MDT) deliberation to address treatment complexity and disease heterogeneity. However, most patients worldwide lack access to timely expert consensus, particularly in resource-constrained centres where MDT resources are scarce or unavailable. Here we present OMGs (Ovarian tumour Multidisciplinary intelligent aGent System), a multi-agent AI framework where domain-specific agents deliberate collaboratively to integrate multidisciplinary evidence and generate MDT-style recommendations with transparent rationales. To systematically evaluate MDT recommendation quality, we developed SPEAR (Safety, Personalization, Evidence, Actionability, Robustness) and validated OMGs across diverse clinical scenarios spanning the care continuum. In multicentre re-evaluation, OMGs achieved performance comparable to expert MDT consensus ($4.45 \pm 0.30$ versus $4.53 \pm 0.23$), with higher Evidence scores (4.57 versus 3.92). In prospective multicentre evaluation (59 patients), OMGs demonstrated high concordance with routine MDT decisions. Critically, in paired human-AI studies, OMGs most substantially enhanced clinicians' recommendations in Evidence and Robustness, the dimensions most compromised when multidisciplinary expertise is unavailable. These findings suggest that multi-agent deliberative systems can achieve performance comparable to expert MDT consensus, with potential to expand access to specialized oncology expertise in resource-limited settings.

[17] The acquisition of English irregular inflections by Yemeni L1 Arabic learners: A Universal Grammar approach

Muneef Y. Alsawsh,Mohammed Q. Shormani

Main category: cs.CL

TL;DR: 本研究基于普遍语法（UG）视角，考察也门英语二语学习者对英语不规则屈折变化的习得过程，重点分析母语迁移与二语发展影响，并验证特征重组假说（FRH）。结果表明：初期错误主要源于母语迁移，后期则体现UG驱动的形态重组；统计显示从阶段1到阶段2有显著进步，但辅音变化、零形符及-a复数等难点仍存，说明输入质量与教学效果制约UG完全激活。

Details

Motivation: 探究也门英语二语学习者如何习得英语不规则屈折形态，检验特征重组假说（FRH）在UG框架下的适用性，并厘清母语迁移与二语发展因素各自的作用。 Method: 采用UG理论框架，结合特征重组假说（FRH），分析两阶段学习者语料中的错误类型；运用单因素方差分析（ANOVA）检验不同阶段产出差异。 Result: 阶段1错误以母语迁移为主（语音与结构错配），阶段2显现UG敏感性增强与形态重构；统计证实不规则屈折产出显著提升；但辅音变化、零形符和-a复数仍存持续困难。 Conclusion: L1迁移与L2发展共同影响初期习得，但高质量语言输入与有效教学是促成成人二语学习者UG驱动下特征重组的关键条件。 Abstract: This study examines the acquisition of English irregular inflections by Yemeni learners of English as a second language (L2), utilizing a Universal Grammar (UG) approach. Within the UG approach, the study considers Feature Reassembly Hypothesis (FRH) (Lardiere, 2008, 2009) part of UG, focusing on the roles of first language (L1) transfer and L2 developmental influence. It analyzes learner errors across two developmental stages. Stage 1 data reveal a dominant influence of L1 transfer, particularly in phonological and structural mismatches, while stage 2 data demonstrate increased learner sensitivity to UG properties and morphological reconfiguration toward the target language. Findings reveal that errors in irregular inflectional morphology are attributed to both interlingual and intralingual sources, with overgeneralization of L2 rules as a common developmental strategy. Statistical analysis, including a one-way ANOVA, indicates significant improvement in the production of well-formed irregular inflections from stage 1 to stage 2, underscoring learners' continued access to UG. However, persistent difficulties with consonant change, zero-morpheme, and -a plural inflections suggest that limited exposure, ineffective input modeling, and insufficient instructional quality constrain full UG access. The study concludes that while L1 transfer and L2 developmental factors influence initial stages of acquisition, appropriate linguistic input and instruction are critical for facilitating UG-driven feature reassembly in adult L2 learners.

[18] Beyond Words: Evaluating and Bridging Epistemic Divergence in User-Agent Interaction via Theory of Mind

Minyuan Ruan,Ziyue Wang,Kaiming Liu,Yunghwei Lai,Peng Li,Yang Liu

Main category: cs.CL

TL;DR: 本文提出了一种面向大语言模型（LLMs）的实用型心智理论（ToM）框架，强调其在检测与弥合用户主观信念与真实环境状态之间认知分歧中的作用，并构建了新基准enchname和轨迹式ToM数据集，通过强化学习提升模型对用户心理状态的建模能力及下游任务表现。

Details

Motivation: 现有LLM难以准确理解模糊意图，导致用户主观信念与真实环境状态间存在认知分歧；而当前ToM评估多局限于孤立信念推理，忽视其在真实交互中的功能价值。 Method: 将ToM形式化为认知分歧检测与消解机制，构建基准enchname评估模型在用户信念与画像整合中的实际能力；进一步构建轨迹式ToM数据集，结合信念追踪与任务相关状态推理，并采用强化学习进行训练。 Result: 在11个主流模型上的评测显示其普遍难以识别阻碍任务成功的深层认知差距；基于新数据集训练的模型在用户心理状态推理及下游任务性能上均取得一致提升。 Conclusion: ToM应被视为支撑人机交互的关键机制，而非孤立的推理能力；其实际价值体现在动态调适用户信念与真实状态之间的匹配过程。 Abstract: Large Language Models (LLMs) have developed rapidly and are widely applied to both general-purpose and professional tasks to assist human users. However, they still struggle to comprehend and respond to the true user needs when intentions and instructions are imprecisely conveyed, leading to a divergence between subjective user believes and true environment states. Resolving this epistemic divergence requires Theory of Mind (ToM), yet existing ToM evaluations for LLMs primarily focus on isolated belief inference, overlooking its functional utility in real-world interaction. To this end, we formalize ToM for LLMs as a mechanism for epistemic divergence detection and resolution, and propose a benchmark, \benchname, to assess how models reconcile user beliefs and profiles in practice. Results across 11 leading models reveal a significant limitation to identify underlying cognitive gaps that impede task success. To bridge this gap, we further curate a trajectory-based ToM dataset linking belief tracking with task-related state inference. The model trained on this data via reinforcement learning shows consistent improvement in reasoning about user mental states, leading to enhanced downstream performance. Our work highlights the practical value of ToM as an essential interaction-level mechanism rather than as a standalone reasoning skill.

[19] Speculative Decoding with a Speculative Vocabulary

Miles Williams,Young D. Kwon,Rui Li,Alexandros Kouris,Stylianos I. Venieris

Main category: cs.CL

TL;DR: 本文提出SpecVocab方法，通过每步动态选择词汇子集进行词汇推测，以解决现有推测解码中输出分布瓶颈问题，在保持高推测有效性的同时提升吞吐量。

Details

Motivation: 现有推测解码方法中，draft模型的输出嵌入矩阵占主导计算开销，而简单缩减词汇表会降低对未登录词的推测有效性。 Method: 提出SpecVocab：在每个解码步动态选取一个高概率词汇子集用于draft模型输出预测，避免固定小词汇表带来的有效性下降。 Result: 在多种任务上，SpecVocab比当前最优方法EAGLE-3获得更高接受长度，并实现最高达8.1%的平均吞吐量提升。 Conclusion: 词汇推测（vocabulary speculation）是一种优于固定缩减词汇的有效替代方案，SpecVocab在效率与效果间取得更好平衡。 Abstract: Speculative decoding has rapidly emerged as a leading approach for accelerating language model (LM) inference, as it offers substantial speedups while yielding identical outputs. This relies upon a small draft model, tasked with predicting the outputs of the target model. State-of-the-art speculative decoding methods use a draft model consisting of a single decoder layer and output embedding matrix, with the latter dominating drafting time for the latest LMs. Recent work has sought to address this output distribution bottleneck by reducing the vocabulary of the draft model. Although this can improve throughput, it compromises speculation effectiveness when the target token is out-of-vocabulary. In this paper, we argue for vocabulary speculation as an alternative to a reduced vocabulary. We propose SpecVocab, an efficient and effective method that selects a vocabulary subset per decoding step. Across a variety of tasks, we demonstrate that SpecVocab can achieve a higher acceptance length than state-of-the-art speculative decoding approach, EAGLE-3. Notably, this yields up to an 8.1% increase in average throughput over EAGLE-3.

[20] PrivAct: Internalizing Contextual Privacy Preservation via Multi-Agent Preference Training

Yuhan Cheng,Hancheng Ye,Hai Helen Li,Jingwei Sun,Yiran Chen

Main category: cs.CL

TL;DR: 本文提出PrivAct，一种将上下文隐私保护内化到大语言模型代理生成行为中的多代理学习框架，以实现隐私合规的代理行动。

Details

Motivation: 现有方法依赖外部、推理时干预，脆弱且场景特定，可能扩大隐私攻击面；而大型语言模型代理在处理敏感、上下文相关的信息时，易因上下文隐私隐含性导致隐私泄露。 Method: 提出PrivAct框架，通过将隐私偏好嵌入每个代理中，使模型在生成行为中直接内化上下文隐私保护能力。 Result: 在多个LLM骨干模型和基准测试中，PrivAct显著提升了上下文隐私保护效果，隐私泄露率最高降低12.32%，同时保持帮助性，并展现出零样本泛化能力和对多种多代理拓扑结构的鲁棒性。 Conclusion: PrivAct有效实现了隐私与实用性之间的更好权衡，增强了系统级上下文完整性，为个性化任务中的隐私安全代理提供了新范式。 Abstract: Large language model (LLM) agents are increasingly deployed in personalized tasks involving sensitive, context-dependent information, where privacy violations may arise in agents' action due to the implicitness of contextual privacy. Existing approaches rely on external, inference-time interventions which are brittle, scenario-specific, and may expand the privacy attack surface. We propose PrivAct, a contextual privacy-aware multi-agent learning framework that internalizes contextual privacy preservation directly into models' generation behavior for privacy-compliant agentic actions. By embedding privacy preferences into each agent, PrivAct enhances system-wide contextual integrity while achieving a more favorable privacy-helpfulness tradeoff. Experiments across multiple LLM backbones and benchmarks demonstrate consistent improvements in contextual privacy preservation, reducing leakage rates by up to 12.32% while maintaining comparable helpfulness, as well as zero-shot generalization and robustness across diverse multi-agent topologies. Code is available at https://github.com/chengyh23/PrivAct.

[21] Tutoring Large Language Models to be Domain-adaptive, Precise, and Safe

Somnath Banerjee

Main category: cs.CL

TL;DR: 本文提出了一种'负责任的智能'框架，旨在平衡大语言模型（LLMs）的强大生成能力与实际部署中的安全性、文化适配性及伦理要求。

Details

Motivation: 随着大语言模型在人工智能中日益重要，亟需从通用架构转向具备上下文感知、内在安全性和全球文化敏感性的系统。 Method: 研究采用三阶段方法：经典监督式领域自适应以满足任务精度；解码时对齐以增强安全性；结合人类反馈与偏好建模实现社会语言学层面的精准适配。 Result: 构建了涵盖领域适应、伦理鲁棒性与多语文化对齐的统一框架，提升了LLM在真实场景中的可靠性与包容性。 Conclusion: 负责任的智能框架为LLM的实际落地提供了系统性路径，强调技术性能、安全可控与文化尊重三者的协同统一。 Abstract: The overarching research direction of this work is the development of a ''Responsible Intelligence'' framework designed to reconcile the immense generative power of Large Language Models (LLMs) with the stringent requirements of real-world deployment. As these models become a transformative force in artificial intelligence, there is an urgent need to move beyond general-purpose architectures toward systems that are contextually aware, inherently safer, and deeply respectful of global cultural nuances. This research navigates three interconnected threads: domain adaptation to ensure technical precision, ethical rigor to mitigate adversarial vulnerabilities, and cultural/multilingual alignment to promote global inclusivity. The methodological trajectory moves from classical supervised adaptation for task-specific demands to decoding-time alignment for safety, finally leveraging human feedback and preference modeling to achieve sociolinguistic acuity.

[22] Bridging the Multilingual Safety Divide: Efficient, Culturally-Aware Alignment for Global South Languages

Somnath Banerjee,Rima Hazra,Animesh Mukherjee

Main category: cs.CL

TL;DR: 本文指出当前大型语言模型的安全机制在低资源语言和代码混合输入中表现不佳，文化有害行为可能被标准毒性评分忽略，且英语安全补丁难以迁移到低资源语言；为此提出面向全球南方的多语言安全实践议程，包括参数高效的安全引导、基于文化的评估与偏好数据、以及社区参与式工作流程。

Details

Motivation: 现有LLM安全管道、基准和对齐方法主要针对英语及少量高资源语言，假设安全性与事实性可跨语言迁移，但实际证据表明这种假设不成立，尤其在低资源语言、代码混合和文化特定场景下存在严重安全漏洞。 Method: 通过综合近期研究成果，分析安全防护在低资源语言与代码混合输入中的失效现象、文化有害行为的隐蔽性，以及英语安全修补在低资源语言中的迁移失败问题；进而提出面向全球南方研究者与学生的三方面实践议程：参数高效安全引导、文化扎根的评估与偏好数据构建、社区参与式危害定义与缓解流程。 Result: 揭示了当前多语言AI安全的三大关键缺陷，并提出了可操作的本地化安全建设路径，强调将多语言安全作为AI公平性的核心要求而非附加功能。 Conclusion: 多语言安全不能依赖英语中心的迁移假设，必须通过本地知识嵌入、文化敏感评估与社区赋权来重构，方能实现全球南方地区真正公平、可靠的人工智能部署。 Abstract: Large language models (LLMs) are being deployed across the Global South, where everyday use involves low-resource languages, code-mixing, and culturally specific norms. Yet safety pipelines, benchmarks, and alignment still largely target English and a handful of high-resource languages, implicitly assuming safety and factuality ''transfer'' across languages. Evidence increasingly shows they do not. We synthesize recent findings indicating that (i) safety guardrails weaken sharply on low-resource and code-mixed inputs, (ii) culturally harmful behavior can persist even when standard toxicity scores look acceptable, and (iii) English-only knowledge edits and safety patches often fail to carry over to low-resource languages. In response, we outline a practical agenda for researchers and students in the Global South: parameter-efficient safety steering, culturally grounded evaluation and preference data, and participatory workflows that empower local communities to define and mitigate harm. Our aim is to make multilingual safety a core requirement-not an add-on-for equitable AI in underrepresented regions.

[23] ADAB: Arabic Dataset for Automated Politeness Benchmarking -- A Large-Scale Resource for Computational Sociopragmatics

Hend Al-Khalifa,Nadia Ghezaiel,Maria Bounnit,Hend Hamed Alhazmi,Noof Abdullah Alfear,Reem Fahad Alqifari,Ameera Masoud Almasoud,Sharefah Ahmed Al-Ghamdi

Main category: cs.CL

TL;DR: 本文介绍了ADAB，一个用于阿拉伯语礼貌检测的新标注数据集，包含来自多个在线平台的10,000个样本，涵盖现代标准阿拉伯语及多种方言，并基于阿拉伯语言传统与语用理论标注为礼貌、不礼貌和中性三类；同时对40种模型配置进行了基准测试。

Details

Motivation: 阿拉伯语礼貌表达丰富复杂，但目前缺乏高质量、覆盖多变体且符合阿拉伯语语言传统的礼貌检测资源。 Method: 构建了名为ADAB的阿拉伯语礼貌数据集，从社交媒体、电商和客服等四个平台采集文本，覆盖现代标准阿拉伯语及海湾、埃及、黎凡特和马格里布方言；依据阿拉伯语言传统与语用学理论进行三分类（礼貌/不礼貌/中性）人工标注，并标注16类礼貌语言特征；评估了40种模型配置，包括传统机器学习、Transformer模型和大语言模型。 Result: ADAB数据集包含10,000个样本，16类语言特征标注，标注者间一致性达kappa=0.703；40种模型配置的基准测试结果为后续研究提供了基线。 Conclusion: ADAB填补了阿拉伯语礼貌检测资源的空白，支持文化敏感的阿拉伯语NLP研究与发展。 Abstract: The growing importance of culturally-aware natural language processing systems has led to an increasing demand for resources that capture sociopragmatic phenomena across diverse languages. Nevertheless, Arabic-language resources for politeness detection remain under-explored, despite the rich and complex politeness expressions embedded in Arabic communication. In this paper, we introduce ADAB (Arabic Politeness Dataset), a new annotated Arabic dataset collected from four online platforms, including social media, e-commerce, and customer service domains, covering Modern Standard Arabic and multiple dialects (Gulf, Egyptian, Levantine, and Maghrebi). The dataset was annotated based on Arabic linguistic traditions and pragmatic theory, resulting in three classes: polite, impolite, and neutral. It contains 10,000 samples with linguistic feature annotations across 16 politeness categories and achieves substantial inter-annotator agreement (kappa = 0.703). We benchmark 40 model configurations, including traditional machine learning, transformer-based models, and large language models. The dataset aims to support research on politeness-aware Arabic NLP.

[24] Evaluating Prompt Engineering Techniques for RAG in Small Language Models: A Multi-Hop QA Approach

Amir Hossein Mohammadi,Ali Moeinian,Zahra Razavizade,Afsaneh Fatemi,Reza Ramezani

Main category: cs.CL

TL;DR: 本文通过大规模实证研究，系统评估了24种提示模板在小型语言模型（SLMs）上的检索增强生成（RAG）性能，发现在HotpotQA多跳问答任务中，最优模板相比标准RAG提示可带来最高6%的性能提升，并为资源受限场景下的SLM-RAG部署提供了实用建议。

Details

Motivation: 优化小型语言模型（SLMs）在复杂多跳问答任务中的检索增强生成（RAG）性能，尤其是提示模板设计这一被忽视但关键的因素。 Method: 对24种提示模板（含标准RAG、9种文献方法和14种新混合变体）在HotpotQA数据集上进行大规模实证评估，使用Qwen2.5-3B Instruct和Gemma3-4B-It两个SLM。 Result: 在18720个测试样本上，最优提示模板相较标准RAG提示，在Qwen2.5和Gemma3-4B-It上分别实现最高83%和84.5%的性能提升，绝对准确率提升达6%。 Conclusion: 提示模板设计对SLM-RAG性能影响显著；本文提出的混合模板及分析为资源受限环境下的高效RAG部署提供了切实可行的指导。 Abstract: Retrieval Augmented Generation (RAG) is a powerful approach for enhancing the factual grounding of language models by integrating external knowledge. While widely studied for large language models, the optimization of RAG for Small Language Models (SLMs) remains a critical research gap, particularly in complex, multi-hop question-answering tasks that require sophisticated reasoning. In these systems, prompt template design is a crucial yet under-explored factor influencing performance. This paper presents a large-scale empirical study to investigate this factor, evaluating 24 different prompt templates on the HotpotQA dataset. The set includes a standard RAG prompt, nine well-formed techniques from the literature, and 14 novel hybrid variants, all tested on two prominent SLMs: Qwen2.5-3B Instruct and Gemma3-4B-It. Our findings, based on a test set of 18720 instances, reveal significant performance gains of up to 83% on Qwen2.5 and 84.5% on Gemma3-4B-It, yielding an improvement of up to 6% for both models compared to the Standard RAG prompt. This research also offers concrete analysis and actionable recommendations for designing effective and efficient prompts for SLM-based RAG systems, practically for deployment in resource-constrained environments.

[25] Pre-Editorial Normalization for Automatically Transcribed Medieval Manuscripts in Old French and Latin

Thibault Clérice,Rachel Bawden,Anthony Glaise,Ariane Pinche,David Smith

Main category: cs.CL

TL;DR: 本文提出预编辑规范化（PEN）任务，旨在弥合古文字转录与规范化数字版本之间的方法论鸿沟；构建了基于CoMMA语料库的大规模银牌训练集和人工校对的金牌评测集，并基于ByT5模型实现了6.7%字符错误率的先进性能。

Details

Motivation: 现有ATR模型在古文字转录（保持字形忠实性）与规范化输出（适配读者与NLP工具）之间存在 usability gap：前者难用，后者易过规范化与幻觉；缺乏兼顾二者优势的中间范式。 Method: 定义预编辑规范化（PEN）新任务；基于CoMMA语料库，利用passim对齐古法语/拉丁语数字化版本，构建4.66M样本银牌训练集与1.8k样本金牌评测集；采用ByT5序列到序列模型进行规范化与预标注任务基准测试。 Result: 提出PEN任务定义；发布大规模银牌训练集（4.66M）与高质量金牌评测集（1.8k）；ByT5模型在PEN上达到6.7% CER，显著优于先前方法。 Conclusion: PEN作为一种保留古文字保真度又支持实用规范化的中间步骤，有效桥接了古籍数字化中学术严谨性与工程可用性之间的断裂，为历史文本处理提供了新范式。 Abstract: Recent advances in Automatic Text Recognition (ATR) have improved access to historical archives, yet a methodological divide persists between palaeographic transcriptions and normalized digital editions. While ATR models trained on more palaeographically-oriented datasets such as CATMuS have shown greater generalizability, their raw outputs remain poorly compatible with most readers and downstream NLP tools, thus creating a usability gap. On the other hand, ATR models trained to produce normalized outputs have been shown to struggle to adapt to new domains and tend to over-normalize and hallucinate. We introduce the task of Pre-Editorial Normalization (PEN), which consists in normalizing graphemic ATR output according to editorial conventions, which has the advantage of keeping an intermediate step with palaeographic fidelity while providing a normalized version for practical usability. We present a new dataset derived from the CoMMA corpus and aligned with digitized Old French and Latin editions using passim. We also produce a manually corrected gold-standard evaluation set. We benchmark this resource using ByT5-based sequence-to-sequence models on normalization and pre-annotation tasks. Our contributions include the formal definition of PEN, a 4.66M-sample silver training corpus, a 1.8k-sample gold evaluation set, and a normalization model achieving a 6.7% CER, substantially outperforming previous models for this task.

[26] HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam

Weiqi Zhai,Zhihai Wang,Jinghang Wang,Boyu Yang,Xiaogang Li,Xiang Xu,Bohan Wang,Peng Wang,Xingzhe Wu,Anfeng Li,Qiyuan Feng,Yuhao Zhou,Shoulin Han,Wenjie Luo,Yiyuan Li,Yaxuan Wang,Ruixian Luo,Guojie Lin,Peiyao Xiao,Chengliang Xu,Ben Wang,Zeyu Wang,Zichao Chen,Jianan Ye,Yijie Hu,Jialong Chen,Zongwen Shen,Yuliang Xu,An Yang,Bowen Yu,Dayiheng Liu,Junyang Lin,Hu Wei,Que Shen,Bing Zhao

Main category: cs.CL

TL;DR: 本文提出HLE-Verified，一个经专家验证与修订的HLE基准测试新版本，通过两阶段流程（验证+修复）提升评估准确性，显著降低标注噪声，使模型准确率平均提升7–10个百分点，错误项上提升达30–40个百分点。

Details

Motivation: HLE作为前沿大语言模型评测基准存在显著噪声项，导致评估偏差和跨模型比较失真，亟需更可靠、透明的验证与修订方案。 Method: 采用两阶段验证与修复工作流：第一阶段对每道题进行问题与答案的二元验证（领域专家评审+模型交叉检验），得到641个已验证题目；第二阶段对可修复的缺陷题目，在严格约束下由双专家独立修订、模型辅助审计并最终裁决，产出1170个修订认证题目；剩余689题列为带明确错误来源与专家标签的不确定集。 Result: 在7个SOTA语言模型上的评测显示，HLE-Verified相较原始HLE平均绝对准确率提升7–10个百分点；在原题干或参考答案有误的题目上提升达30–40个百分点；模型置信度与题目/答案错误显著相关，印证修订有效性。 Conclusion: HLE-Verified通过系统化验证与透明修订协议，显著降低HLE类基准的标注噪声，实现对模型能力更真实、稳健的测量，为未来高质量评测基准建设提供方法论范例。 Abstract: Humanity's Last Exam (HLE) has become a widely used benchmark for evaluating frontier large language models on challenging, multi-domain questions. However, community-led analyses have raised concerns that HLE contains a non-trivial number of noisy items, which can bias evaluation results and distort cross-model comparisons. To address this challenge, we introduce HLE-Verified, a verified and revised version of HLE with a transparent verification protocol and fine-grained error taxonomy. Our construction follows a two-stage validation-and-repair workflow resulting in a certified benchmark. In Stage I, each item undergoes binary validation of the problem and final answer through domain-expert review and model-based cross-checks, yielding 641 verified items. In Stage II, flawed but fixable items are revised under strict constraints preserving the original evaluation intent, through dual independent expert repairs, model-assisted auditing, and final adjudication, resulting in 1,170 revised-and-certified items. The remaining 689 items are released as a documented uncertain set with explicit uncertainty sources and expertise tags for future refinement. We evaluate seven state-of-the-art language models on HLE and HLE-Verified, observing an average absolute accuracy gain of 7--10 percentage points on HLE-Verified. The improvement is particularly pronounced on items where the original problem statement and/or reference answer is erroneous, with gains of 30--40 percentage points. Our analyses further reveal a strong association between model confidence and the presence of errors in the problem statement or reference answer, supporting the effectiveness of our revisions. Overall, HLE-Verified improves HLE-style evaluations by reducing annotation noise and enabling more faithful measurement of model capabilities. Data is available at: https://github.com/SKYLENAGE-AI/HLE-Verified

[27] Chain-of-Thought Reasoning with Large Language Models for Clinical Alzheimer's Disease Assessment and Diagnosis

Tongze Zhang,Jun-En Ding,Melik Ozolcer,Fang-Ming Hung,Albert Chih-Chieh Yang,Feng Liu,Yi-Rou Ji,Sang Won Bae

Main category: cs.CL

TL;DR: 本文提出了一种基于大语言模型（LLM）和思维链（CoT）推理的阿尔茨海默病（AD）诊断新方法，利用电子健康记录（EHR）提升诊断准确性与可解释性，在CDR分级任务中F1分数较零样本基线提升达15%。

Details

Motivation: 传统AD诊断依赖影像学和医生临床评估，耗时耗力；现有LLM在医疗领域应用虽增多，但在AD评估中仍受限，因其病因复杂、难以通过影像直接观察。 Method: 提出基于LLM的Chain-of-Thought（CoT）推理框架，不直接微调LLM进行分类，而是生成显式的诊断推理路径，并据此输出结构化预测，以增强对AD复杂因素的理解与可解释性。 Result: 在多个CDR分级任务上显著提升诊断稳定性与性能，F1分数最高提升15%（相比零样本基线）。 Conclusion: CoT推理能有效提升LLM在AD评估中的诊断能力与可解释性，为基于EHR的神经退行性疾病智能诊断提供了新范式。 Abstract: Alzheimer's disease (AD) has become a prevalent neurodegenerative disease worldwide. Traditional diagnosis still relies heavily on medical imaging and clinical assessment by physicians, which is often time-consuming and resource-intensive in terms of both human expertise and healthcare resources. In recent years, large language models (LLMs) have been increasingly applied to the medical field using electronic health records (EHRs), yet their application in Alzheimer's disease assessment remains limited, particularly given that AD involves complex multifactorial etiologies that are difficult to observe directly through imaging modalities. In this work, we propose leveraging LLMs to perform Chain-of-Thought (CoT) reasoning on patients' clinical EHRs. Unlike direct fine-tuning of LLMs on EHR data for AD classification, our approach utilizes LLM-generated CoT reasoning paths to provide the model with explicit diagnostic rationale for AD assessment, followed by structured CoT-based predictions. This pipeline not only enhances the model's ability to diagnose intrinsically complex factors but also improves the interpretability of the prediction process across different stages of AD progression. Experimental results demonstrate that the proposed CoT-based diagnostic framework significantly enhances stability and diagnostic performance across multiple CDR grading tasks, achieving up to a 15% improvement in F1 score compared to the zero-shot baseline method.

[28] The Sufficiency-Conciseness Trade-off in LLM Self-Explanation from an Information Bottleneck Perspective

Ali Zahedzadeh,Behnam Bahrak

Main category: cs.CL

TL;DR: 本文探讨了大语言模型中自解释（如思维链）的充分性与简洁性之间的权衡，提出基于信息瓶颈原理将解释视为压缩表征，并通过约束长度、多模型评估的方式验证其在英语和波斯语中的有效性。

Details

Motivation: 自解释虽能提升多步问答性能，但往往冗长且生成成本高，亟需探究‘最少需要多少解释’这一问题。 Method: 基于信息瓶颈原理，将解释建模为保留答案所需最小信息的压缩表示；设计长度受限的评估流程，使用多个语言模型在ARC Challenge数据集（含英文原版及波斯语翻译版）上评测解释的充分性。 Result: 实验表明：适度精简的解释仍能保持充分性与准确率，显著缩短长度；但过度压缩会导致性能下降；该现象在英语和资源受限的波斯语中均成立。 Conclusion: 解释并非越长越好，存在兼顾充分性与简洁性的最优压缩区间，该发现对高效可解释AI具有指导意义。 Abstract: Large Language Models increasingly rely on self-explanations, such as chain of thought reasoning, to improve performance on multi step question answering. While these explanations enhance accuracy, they are often verbose and costly to generate, raising the question of how much explanation is truly necessary. In this paper, we examine the trade-off between sufficiency, defined as the ability of an explanation to justify the correct answer, and conciseness, defined as the reduction in explanation length. Building on the information bottleneck principle, we conceptualize explanations as compressed representations that retain only the information essential for producing correct answers.To operationalize this view, we introduce an evaluation pipeline that constrains explanation length and assesses sufficiency using multiple language models on the ARC Challenge dataset. To broaden the scope, we conduct experiments in both English, using the original dataset, and Persian, as a resource-limited language through translation. Our experiments show that more concise explanations often remain sufficient, preserving accuracy while substantially reducing explanation length, whereas excessive compression leads to performance degradation.

[29] Named Entity Recognition for Payment Data Using NLP

Srikumar Nayak

Main category: cs.CL

TL;DR: 本文系统评估了多种命名实体识别（NER）模型在金融支付数据提取中的性能，提出了一种新型混合架构PaymentBERT，在50,000条多格式支付交易数据上达到95.7% F1值，显著优于传统方法，并支持实时处理。

Details

Motivation: 提升金融交易自动化处理中从非结构化支付数据中精准提取结构化信息的能力，以支撑制裁筛查、反洗钱（AML）合规等关键金融监管需求。 Method: 对比分析CRF、BiLSTM-CRF、BERT、FinBERT等模型；构建并微调PaymentBERT——融合领域特异性金融词嵌入与上下文表征的混合架构；在涵盖SWIFT MT103、ISO 20022及国内支付系统的5万条标注数据上开展实验、跨格式泛化分析与消融研究。 Result: 微调BERT达94.2% F1值，较CRF提升12.8个百分点；PaymentBERT进一步提升至95.7% F1值，且具备实时处理能力；验证了模型在多支付格式下的良好泛化性。 Conclusion: 基于Transformer的领域适配模型（尤其是PaymentBERT）是金融支付NER任务当前最优解，兼顾高精度与实用性，为金融机构部署自动化合规与支付系统提供了可落地的技术路径。 Abstract: Named Entity Recognition (NER) has emerged as a critical component in automating financial transaction processing, particularly in extracting structured information from unstructured payment data. This paper presents a comprehensive analysis of state-of-the-art NER algorithms specifically designed for payment data extraction, including Conditional Random Fields (CRF), Bidirectional Long Short-Term Memory with CRF (BiLSTM-CRF), and transformer-based models such as BERT and FinBERT. We conduct extensive experiments on a dataset of 50,000 annotated payment transactions across multiple payment formats including SWIFT MT103, ISO 20022, and domestic payment systems. Our experimental results demonstrate that fine-tuned BERT models achieve an F1-score of 94.2% for entity extraction, outperforming traditional CRF-based approaches by 12.8 percentage points. Furthermore, we introduce PaymentBERT, a novel hybrid architecture combining domain-specific financial embeddings with contextual representations, achieving state-of-the-art performance with 95.7% F1-score while maintaining real-time processing capabilities. We provide detailed analysis of cross-format generalization, ablation studies, and deployment considerations. This research provides practical insights for financial institutions implementing automated sanctions screening, anti-money laundering (AML) compliance, and payment processing systems.

[30] GRRM: Group Relative Reward Modeling for Machine Translation

Sen Yang,Shanbo Cheng,Lu Xu,Jianbing Zhang,Shujian Huang

Main category: cs.CL

TL;DR: 本文提出Group Quality Metric (GQM)范式及其实例Group Relative Reward Model (GRRM)，用于提升GRPO在机器翻译等开放域任务中的效果；GRRM通过联合处理候选译文组实现细粒度相对质量评估，并将其嵌入GRPO训练流程，显著提升翻译质量与推理能力。

Details

Motivation: 标准标量质量指标（SQM）在机器翻译等开放域任务中因孤立评估候选译文而难以捕捉细微语言差异，导致组内排序不准，限制了GRPO的效果。 Method: 提出GQM范式，构建GRRM模型——一种联合处理整个候选译文组、基于比较分析进行相对质量评估和自适应粒度判别的奖励模型，并将其集成进GRPO训练循环中优化翻译策略。 Result: GRRM在排名准确率上达到所有基线中最优水平；所提框架不仅提升通用翻译质量，还使模型展现出媲美先进推理模型的推理能力。 Conclusion: GRRM有效解决了SQM在开放域任务中缺乏比较上下文的问题，验证了联合组内评估对提升LLM翻译与推理能力的关键作用，为GRPO在复杂生成任务中的应用提供了新范式。 Abstract: While Group Relative Policy Optimization (GRPO) offers a powerful framework for LLM post-training, its effectiveness in open-ended domains like Machine Translation hinges on accurate intra-group ranking. We identify that standard Scalar Quality Metrics (SQM) fall short in this context; by evaluating candidates in isolation, they lack the comparative context necessary to distinguish fine-grained linguistic nuances. To address this, we introduce the Group Quality Metric (GQM) paradigm and instantiate it via the Group Relative Reward Model (GRRM). Unlike traditional independent scorers, GRRM processes the entire candidate group jointly, leveraging comparative analysis to rigorously resolve relative quality and adaptive granularity. Empirical evaluations confirm that GRRM achieves competitive ranking accuracy among all baselines. Building on this foundation, we integrate GRRM into the GRPO training loop to optimize the translation policy. Experimental results demonstrate that our framework not only improves general translation quality but also unlocks reasoning capabilities comparable to state-of-the-art reasoning models. We release codes, datasets, and model checkpoints at https://github.com/NJUNLP/GRRM.

[31] Geometry-Preserving Aggregation for Mixture-of-Experts Embedding Models

Sajjad Kachuee,Mohammad Sharifkhani

Main category: cs.CL

TL;DR: 本文提出了一种新的球面重心聚合（SBA）方法，用于解决MoE嵌入模型中线性加权聚合与专家表示几何结构不一致的问题，实验表明SBA在多个任务上提升了性能且保持训练稳定。

Details

Motivation: 现有MoE嵌入模型采用线性加权聚合，隐含假设嵌入空间具有线性子空间结构，但实际专家输出位于一个具有紧致模长和显著角度分离的共享超球面流形上，导致线性聚合引起向量坍缩、失真并降低可比性。 Method: 提出球面重心聚合（SBA），将径向和角向分量解耦处理，在保持超球面几何结构的同时兼容现有路由机制。 Result: 在MTEB多个任务（语义相似性、聚类、重复问题检测）上实现一致性能提升，训练成本不变且完全稳定；几何分析证实SBA避免了聚合引起的坍缩，维持了超球面一致性。 Conclusion: MoE嵌入架构中应采用几何感知的聚合方式，SBA为该方向提供了有效且即插即用的解决方案。 Abstract: Mixture-of-Experts (MoE) embedding models combine expert outputs using weighted linear summation, implicitly assuming a linear subspace structure in the embedding space. This assumption is shown to be inconsistent with the geometry of expert representations. Geometric analysis of a modern MoE embedding model reveals that expert outputs lie on a shared hyperspherical manifold characterized by tightly concentrated norms and substantial angular separation. Under this geometry, linear aggregation induces inward collapse toward the manifold interior, distorting vector magnitude and direction and reducing embedding comparability. To address this inconsistency, Spherical Barycentric Aggregation (SBA) is introduced as a geometry-preserving aggregation operator that separates radial and angular components to maintain hyperspherical structure while remaining fully compatible with existing routing mechanisms. Experiments on selected tasks from the Massive Text Embedding Benchmark (MTEB), including semantic similarity, clustering, and duplicate question detection, demonstrate consistent performance improvements with identical training cost and full stability. Additional geometric analyses confirm that SBA prevents aggregation-induced collapse and preserves hyperspherical consistency, highlighting the importance of geometry-aware aggregation in MoE embedding architectures.

[32] Context Shapes LLMs Retrieval-Augmented Fact-Checking Effectiveness

Pietro Bernardelle,Stefano Civelli,Kevin Roitero,Gianluca Demartini

Main category: cs.CL

TL;DR: 本研究探讨了大语言模型（LLMs）在基于长上下文的事实验证任务中的性能变化，发现其准确率随上下文长度增加而下降，且证据在提示中的位置（首/尾优于中部）显著影响结果，凸显提示结构对检索增强型事实核查系统的重要性。

Details

Motivation: 尽管大语言模型在多种推理任务中表现出色，但其在长上下文下的表现仍不稳定；本文聚焦于事实验证任务，探究上下文长度与证据位置对LLM验证性能的影响。 Method: 在HOVER、FEVEROUS和ClimateFEVER三个数据集上，使用5个开源模型（涵盖Llama-3.1、Qwen2.5、Qwen3，参数量为7B、32B、70B），评估模型的参数化事实知识及不同上下文长度和证据放置位置对验证准确率的影响。 Result: LLMs具备一定参数化事实知识；验证准确率随上下文增长而下降；证据置于提示开头或结尾时准确率显著高于置于中部。 Conclusion: 上下文长度和证据位置是影响LLM事实验证性能的关键因素，提示结构设计对检索增强型事实核查系统至关重要。 Abstract: Large language models (LLMs) show strong reasoning abilities across diverse tasks, yet their performance on extended contexts remains inconsistent. While prior research has emphasized mid-context degradation in question answering, this study examines the impact of context in LLM-based fact verification. Using three datasets (HOVER, FEVEROUS, and ClimateFEVER) and five open-source models accross different parameters sizes (7B, 32B and 70B parameters) and model families (Llama-3.1, Qwen2.5 and Qwen3), we evaluate both parametric factual knowledge and the impact of evidence placement across varying context lengths. We find that LLMs exhibit non-trivial parametric knowledge of factual claims and that their verification accuracy generally declines as context length increases. Similarly to what has been shown in previous works, in-context evidence placement plays a critical role with accuracy being consistently higher when relevant evidence appears near the beginning or end of the prompt and lower when placed mid-context. These results underscore the importance of prompt structure in retrieval-augmented fact-checking systems.

[33] LogitsCoder: Towards Efficient Chain-of-Thought Path Search via Logits Preference Decoding for Code Generation

Jizheng Chen,Weiming Zhang,Xinyi Dai,Weiwen Liu,Kounianhua Du,Yasheng Wang,Ruiming Tang,Yong Yu,Weinan Zhang

Main category: cs.CL

TL;DR: 本文提出LogitsCoder框架，通过轻量级logit级控制机制提升代码生成中的思维链推理质量，解决现有测试时扩展方法中存在的'思考不足'和'过度思考'问题。

Details

Motivation: 现有测试时扩展（TTS）方法在代码生成中存在'思考不足'（推理链过浅）和'过度思考'（推理冗长低效）两大挑战，限制了推理的深度与效率平衡。 Method: 提出LogitsCoder框架：1）Logits Preference Decoding——在logit层引导token选择向统计上更优的模式偏移；2）Logits Rank Based Path Selection and Thoughts Aggregation——基于logit排序选择并聚合多样化推理路径。 Result: 实验表明LogitsCoder生成的推理链更高效、高质量，在代码生成任务上显著优于基线方法。 Conclusion: LogitsCoder通过logit级细粒度控制实现了深度与效率兼顾的链式推理，为代码生成提供了新范式。 Abstract: Code generation remains a challenging task that requires precise and structured reasoning. Existing Test Time Scaling (TTS) methods, including structured tree search, have made progress in exploring reasoning paths but still face two major challenges: (1) underthinking, where reasoning chains tend to be shallow and fail to capture the full complexity of problems; and (2) overthinking, where overly verbose reasoning leads to inefficiency and increased computational costs. To address these issues, we propose LogitsCoder, a novel framework that enhances chain-of-thought reasoning through lightweight, logit-level control mechanisms for code generation. LogitsCoder iteratively generates and refines reasoning steps by first steering token selection toward statistically preferred patterns via Logits Preference Decoding, then selecting and aggregating diverse reasoning paths using Logits Rank Based Path Selection and Thoughts Aggregation. This results in coherent and effective reasoning chains that balance depth and efficiency. Extensive experiments demonstrate that LogitsCoder produces more efficient and higher-quality reasoning chains, leading to superior code generation performance compared to baseline methods.

[34] LM-Lexicon: Improving Definition Modeling via Harmonizing Semantic Experts

Yang Liu,Jiaye Yang,Weikang Li,Jiahui Liang,Yang Li,Lingyong Yan

Main category: cs.CL

TL;DR: LM-Lexicon 是一种新型定义建模方法，通过数据聚类、语义专家学习和稀疏混合专家架构的模型融合，在五个基准上显著提升性能（BLEU 提升 7%）。

Details

Motivation: 提升定义建模任务的性能与效率，尤其针对语义密集型应用，解决现有方法在细粒度语义建模和专家协同方面的不足。 Method: 采用数据聚类将定义建模任务分解为多个语义子域，每个子域训练小型语言模型作为语义专家，并引入语义感知的领域级路由机制及稀疏混合专家架构进行模型融合。 Result: 在五个主流基准上 BLEU 分数较先前最优模型提升 7%；聚类策略带来近 10% 的定义质量提升；领域级路由比传统词元级路由专家效能高 1%；测试时计算与语义专家扩展可进一步增益性能。 Conclusion: LM-Lexicon 推进了定义建模研究，同时为语义密集型任务中高效语言模型的设计提供了新思路与实证依据。 Abstract: We introduce LM-Lexicon, an innovative definition modeling approach that incorporates data clustering, semantic expert learning, and model merging using a sparse mixture-of-experts architecture. By decomposing the definition modeling task into specialized semantic domains, where small language models are trained as domain experts, LM-Lexicon achieves substantial improvements (+7% BLEU score compared with the prior state-of-the-art model) over existing methods on five widely used benchmarks. Empirically, we demonstrate that 1) the clustering strategy enables fine-grained expert specialization with nearly 10% improvement in definition quality; 2) the semantic-aware domain-level routing mechanism achieves higher expert efficacy (+1%) than conventional token-level routing; and 3) further performance gains can be obtained through test-time compute and semantic expert scaling. Our work advances definition modeling while providing insights into the development of efficient language models for semantic-intensive applications.

[35] From Scarcity to Scale: A Release-Level Analysis of the Pashto Common Voice Dataset

Jandad Jahani,Mursal Dawodi,Jawid Ahmad Baktash

Main category: cs.CL

TL;DR: 本文对Mozilla Common Voice语料库中普什图语部分（v24.0）进行了发布级分析，揭示其规模快速增长（达2768.7小时），但存在贡献者高度集中（Gini=0.941）、年龄与性别标签缺失严重、文本提示复用适度等关键数据质量与代表性问题，为低资源语言ASR数据集建设提供了实证审计与改进建议。

Details

Motivation: 解决普什图语（超6000万使用者）长期缺乏大规模、开源许可语音数据的问题，支撑现代自动语音识别（ASR）系统开发。 Method: 对Mozilla Common Voice普什图语组件（v24.0，2025年12月）开展发布级定量分析，涵盖规模增长趋势、验证吞吐量、贡献者不平等性（Gini系数）、人口统计元数据完整性（年龄、性别）、以及句子级提示复用率（如50%验证片段所覆盖的唯一句子比例）。 Result: 普什图语语音时长从2023年中1.49小时激增至2025年2768.7小时（含975.89小时已验证）；贡献者参与极度不均（Gini=0.941）；年龄分布偏向青年人群；41.97%音频片段缺失自报性别标签；35.88%的唯一句子覆盖了50%的已验证片段。 Conclusion: 该语料库虽规模迅速扩大，但在贡献者多样性、元数据完备性和子群体可审计性方面仍不成熟；提升验证能力与拓展人口统计代表性是下一阶段关键优化方向。 Abstract: Large, openly licensed speech datasets are essential for building automatic speech recognition (ASR) systems, yet many widely spoken languages remain underrepresented in public resources. Pashto, spoken by more than 60 million people, has historically lacked large-scale openly licensed speech data suitable for modern ASR development. This paper presents a release-level analysis of the Pashto component of the Mozilla Common Voice corpus, focusing on version 24.0 (December 2025) and contextualizing trends across major releases. We document rapid growth from 1.49 recorded hours in mid-2023 to 2,768.7 total hours in 2025, including 975.89 validated hours available for supervised ASR training. Beyond scale, we analyze validation throughput, contributor participation inequality, demographic metadata completeness, and sentence-level concentration in the validated subset. We find that participation is extremely concentrated (Gini = 0.941), age representation is strongly skewed toward young adults, and 41.97\% of clips lack self-reported gender labels, limiting subgroup auditing based on metadata. At the textual level, prompt reuse is moderate: 35.88\% of unique sentences account for 50\% of validated clips, suggesting that structural concentration is driven primarily by uneven contributor activity rather than dominance of a small prompt set. These results provide a quantitative audit of a rapidly scaling low-resource speech corpus and highlight practical priorities for improving dataset maturity, including expanded validation capacity and broader demographic participation.

[36] Open Rubric System: Scaling Reinforcement Learning with Pairwise Adaptive Rubric

Ruipeng Jia,Yunyi Yang,Yuxin Wu,Yongbo Gai,Siyuan Tao,Mengyu Zhou,Jianhe Lin,Xiaoxi Jiang,Guanjun Jiang

Main category: cs.CL

TL;DR: 本文提出Open Rubric System（OpenRS），一种基于可解释、可验证评分标准的LLM-as-a-Judge框架，旨在解决标量奖励模型在开放任务对齐中因信息压缩导致的脆弱性和奖励黑客问题；其核心是用显式、受监督的原则性推理过程替代隐式标量奖励学习。

Details

Motivation: 标量奖励模型将多维人类偏好压缩为单一分数，造成信息瓶颈，导致开放域对齐中的脆弱性和奖励欺骗；非可验证任务的鲁棒对齐本质上是原则泛化问题，需将reward建模为可审查的推理过程而非黑箱函数。 Method: 提出OpenRS框架，包含Pairwise Adaptive Meta-Rubrics（PAMR）和轻量级Pointwise Verifiable Rubrics（PVRs）；采用宪法式元评分标准指导评分实例化、加权与执行；动态依据候选响应语义差异生成自适应评分标准；进行逐准则成对比较并外部聚合偏好，避免点式标量加权；引入两层元评分标准优化流程（自动进化优化通用原则 + 人工闭环优化领域原则）及可验证点式评分作为守卫与奖励来源。 Result: OpenRS在开放性任务中显著提升判别能力，避免标量化带来的偏差；提供硬约束守卫与可验证奖励组件；支持原则一致性与跨领域可编辑性；成功应用于成对强化学习的奖励监督。 Conclusion: 将reward建模为受显式元原则约束、可验证、可编辑的结构化推理过程，比学习标量奖励函数更利于实现鲁棒、可信、可调试的AI对齐。 Abstract: Scalar reward models compress multi-dimensional human preferences into a single opaque score, creating an information bottleneck that often leads to brittleness and reward hacking in open-ended alignment. We argue that robust alignment for non-verifiable tasks is fundamentally a principle generalization problem: reward should not be a learned function internalized into a judge, but an explicit reasoning process executed under inspectable principles. To operationalize this view, we present the Open Rubric System (OpenRS), a plug-and-play, rubrics-based LLM-as-a-Judge framework built around Pairwise Adaptive Meta-Rubrics (PAMR) and lightweight Pointwise Verifiable Rubrics (PVRs), which provide both hard-constraint guardrails and verifiable reward components when ground-truth or programmatic checks are available. OpenRS uses an explicit meta-rubric -- a constitution-like specification that governs how rubrics are instantiated, weighted, and enforced -- and instantiates adaptive rubrics on the fly by conditioning on the semantic differences between two candidate responses. It then performs criterion-wise pairwise comparisons and aggregates criterion-level preferences externally, avoiding pointwise weighted scalarization while improving discriminability in open-ended settings. To keep principles consistent yet editable across various domains, we introduce a two-level meta-rubric refinement pipeline (automated evolutionary refinement for general principles and a reproducible human-in-the-loop procedure for domain principles), complemented with pointwise verifiable rubrics that act as both guardrails against degenerate behaviors and a source of verifiable reward for objective sub-tasks. Finally, we instantiate OpenRS as reward supervision in pairwise RL training.

[37] Annotation-Efficient Vision-Language Model Adaptation to the Polish Language Using the LLaVA Framework

Grzegorz Statkiewicz,Alicja Dobrzeniecka,Karolina Seweryn,Aleksandra Krasnodębska,Karolina Piosek,Katarzyna Bogusz,Sebastian Cygert,Wojciech Kusa

Main category: cs.CL

TL;DR: 本文通过自动化翻译和过滤现有多模态数据集，并补充合成的波兰语OCR及文化特异性数据，成功构建了适用于波兰语的视觉-语言模型（VLMs），在波兰语适配的MMBench上相比基线模型提升9.5%，生成字幕质量也获人工评估验证。

Details

Motivation: 现有视觉-语言模型多基于英语数据训练，难以适配其他语言与文化背景，限制了非英语用户的使用并阻碍多元文化多模态系统的发展。 Method: 复现并改进LLaVA-Next方法，构建全自动翻译与过滤管道处理多模态数据集，并加入合成波兰语OCR及文化相关任务数据；训练过程几乎完全依赖自动翻译与少量人工干预。 Result: 在波兰语适配的MMBench上相较LLaVA-1.6-Vicuna-13B提升9.5%；人工评估显示生成字幕的语言正确性更高；模型与评测数据集已开源。 Conclusion: 大规模自动翻译结合轻量级过滤可有效启动低资源语言高质量多模态模型的构建，但文化覆盖与评测仍存挑战。 Abstract: Most vision-language models (VLMs) are trained on English-centric data, limiting their performance in other languages and cultural contexts. This restricts their usability for non-English-speaking users and hinders the development of multimodal systems that reflect diverse linguistic and cultural realities. In this work, we reproduce and adapt the LLaVA-Next methodology to create a set of Polish VLMs. We rely on a fully automated pipeline for translating and filtering existing multimodal datasets, and complement this with synthetic Polish data for OCR and culturally specific tasks. Despite relying almost entirely on automatic translation and minimal manual intervention to the training data, our approach yields strong results: we observe a +9.5% improvement over LLaVA-1.6-Vicuna-13B on a Polish-adapted MMBench, along with higher-quality captions in generative evaluations, as measured by human annotators in terms of linguistic correctness. These findings highlight that large-scale automated translation, combined with lightweight filtering, can effectively bootstrap high-quality multimodal models for low-resource languages. Some challenges remain, particularly in cultural coverage and evaluation. To facilitate further research, we make our models and evaluation dataset publicly available.

[38] GTS: Inference-Time Scaling of Latent Reasoning with a Learnable Gaussian Thought Sampler

Minghan Wang,Ye Bai,Thuy-Trang Vu,Ehsan Shareghi,Gholamreza Haffari

Main category: cs.CL

TL;DR: 本文提出高斯思维采样器（GTS），将潜在推理中的推理时扩展建模为可学习密度的条件采样，以替代启发式随机扰动，提升推理轨迹多样性与有效性。

Details

Motivation: 现有推理时扩展方法依赖启发式随机扰动（如dropout、高斯噪声），缺乏对探索行为的显式建模，导致在有限采样预算下效率低下；更强扰动未必带来更优轨迹，可能破坏推理结构。 Method: 提出高斯思维采样器（GTS），预测上下文相关的连续推理状态扰动分布；采用GRPO风格策略优化训练GTS，同时冻结主干模型。 Result: 在GSM8K数据集上，GTS在两种潜在推理架构中均展现出比启发式基线更可靠的推理时扩展效果。 Conclusion: 提升潜在推理的推理时扩展能力，关键在于引入结构化且可优化的探索机制，而非简单增强随机性。 Abstract: Inference-time scaling (ITS) in latent reasoning models typically introduces stochasticity through heuristic perturbations, such as dropout or fixed Gaussian noise. While these methods increase trajectory diversity, their exploration behavior is not explicitly modeled and can be inefficient under finite sampling budgets. We observe that stronger perturbations do not necessarily translate into more effective candidate trajectories, as unguided noise may disrupt internal decision structure rather than steer it. To provide a more structured alternative, we model latent thought exploration as conditional sampling from learnable densities and instantiate this idea as a Gaussian Thought Sampler (GTS). GTS predicts context-dependent perturbation distributions over continuous reasoning states and is trained with GRPO-style policy optimization while keeping the backbone frozen. Experiments on GSM8K with two latent reasoning architectures show that GTS achieves more reliable inference-time scaling than heuristic baselines. These findings indicate that improving latent ITS requires structured and optimizable exploration mechanisms rather than simply amplifying stochasticity.

[39] Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality

Nitay Calderon,Eyal Ben-David,Zorik Gekhman,Eran Ofek,Gal Yona

Main category: cs.CL

TL;DR: 本文提出一种行为框架来剖析大语言模型（LLM）的事实知识，区分‘是否编码’与‘是否可访问’，并构建WikiProfile基准进行实证分析，发现前沿模型已近乎饱和编码事实，但回忆仍是主要瓶颈，而推理（thinking）可显著提升回忆能力。

Details

Motivation: 现有事实性评估方法将所有错误等同对待，无法区分错误源于知识缺失（空货架）还是知识访问受限（丢钥匙），因此需要更细粒度的事实级知识剖析框架。 Method: 提出基于事实的行为分析框架，将每个事实按编码状态（是否编码）和可访问性（不可回忆、直接回忆、需推理回忆）分类；构建自动化流水线生成的WikiProfile基准，结合提示的LLM与网络搜索进行数据构造。 Result: 在400万条来自13个LLM的响应上验证：GPT-5和Gemini-3对事实的编码率达95–98%，但回忆失败仍普遍，尤其影响长尾事实和反向问题；引入‘thinking’可恢复大量回忆失败。 Conclusion: LLM的事实性瓶颈正从知识编码转向知识访问，未来提升应聚焦于改进模型对已有编码知识的利用方式，而非单纯扩大规模。 Abstract: Standard factuality evaluations of LLMs treat all errors alike, obscuring whether failures arise from missing knowledge (empty shelves) or from limited access to encoded facts (lost keys). We propose a behavioral framework that profiles factual knowledge at the level of facts rather than questions, characterizing each fact by whether it is encoded, and then by how accessible it is: cannot be recalled, can be directly recalled, or can only be recalled with inference-time computation (thinking). To support such profiling, we introduce WikiProfile, a new benchmark constructed via an automated pipeline with a prompted LLM grounded in web search. Across 4 million responses from 13 LLMs, we find that encoding is nearly saturated in frontier models on our benchmark, with GPT-5 and Gemini-3 encoding 95--98% of facts. However, recall remains a major bottleneck: many errors previously attributed to missing knowledge instead stem from failures to access it. These failures are systematic and disproportionately affect long-tail facts and reverse questions. Finally, we show that thinking improves recall and can recover a substantial fraction of failures, indicating that future gains may rely less on scaling and more on methods that improve how models utilize what they already encode.

[40] CCiV: A Benchmark for Structure, Rhythm and Quality in LLM-Generated Chinese \textit{Ci} Poetry

Shangqing Zhao,Yupei Ren,Yuhao Zhou,Xiaopeng Bai,Man Lan

Main category: cs.CL

TL;DR: 本文提出了CCiV基准，用于评估大语言模型在古典中文词诗生成中的结构、节奏和艺术质量三方面能力，并揭示了模型在历史变体生成、声调模式遵循及形式正确性与文学质量一致性等方面的挑战。

Details

Motivation: 古典中文词诗生成对大语言模型构成重大挑战，需系统评估其在结构、节奏和艺术质量上的能力。 Method: 构建CCiV基准，涵盖30种词牌，评估17个大语言模型；分析模型生成的历史变体、声调遵循情况；测试形式感知提示对不同强度模型的影响；考察形式正确性与文学质量的关联性。 Result: 发现模型常生成有效但非预期的历史变体；声调模式遵循比结构规则更难；形式感知提示可提升强模型的控制力但可能削弱弱模型；形式正确性与文学质量之间关联弱且不一致。 Conclusion: CCiV凸显了对历史变体敏感的评估方法以及更全面的受限创造性生成方法的必要性。 Abstract: The generation of classical Chinese \textit{Ci} poetry, a form demanding a sophisticated blend of structural rigidity, rhythmic harmony, and artistic quality, poses a significant challenge for large language models (LLMs). To systematically evaluate and advance this capability, we introduce \textbf{C}hinese \textbf{Ci}pai \textbf{V}ariants (\textbf{CCiV}), a benchmark designed to assess LLM-generated \textit{Ci} poetry across these three dimensions: structure, rhythm, and quality. Our evaluation of 17 LLMs on 30 \textit{Cipai} reveals two critical phenomena: models frequently generate valid but unexpected historical variants of a poetic form, and adherence to tonal patterns is substantially harder than structural rules. We further show that form-aware prompting can improve structural and tonal control for stronger models, while potentially degrading weaker ones. Finally, we observe weak and inconsistent alignment between formal correctness and literary quality in our sample. CCiV highlights the need for variant-aware evaluation and more holistic constrained creative generation methods.

[41] Character-aware Transformers Learn an Irregular Morphological Pattern Yet None Generalize Like Humans

Akhilesh Kakolu Ramarao,Kevin Tang,Dinah Baer-Henney

Main category: cs.CL

TL;DR: 本文研究了神经网络是否能作为形态学习的认知模型，特别关注西班牙语中L形形态素（L-shaped morphome）的习得与泛化。研究发现位置不变的位置编码模型能更好地捕捉L形范式聚类，但所有模型都无法像人类一样将该模式泛化到新形式，揭示了统计模式复现与形态抽象之间的差距。

Details

Motivation: 探究神经网络能否像人类一样习得并泛化无明显音系、语义或句法动因的抽象形态模式（如西班牙语L形形态素），以评估其作为认知模型的合理性。 Method: 采用五种变体的编码器-解码器Transformer模型，系统比较其在序列型与位置不变型位置编码、原子型与分解型词性标签表示两个维度上的表现，并在西班牙语L形形态素任务上进行训练与泛化测试。 Result: 位置不变的位置编码模型能在训练数据稀疏时正确恢复L形范式聚类，而序列型编码模型仅部分捕捉该模式；但所有模型均无法像人类那样将L形模式泛化至新动词——它们倾向于跨语气（subjunctive）泛化，而非人类特有的跨人称-语气（第一人称单数直陈式+所有虚拟式）L形泛化。 Conclusion: 当前Transformer模型虽可复现特定统计模式，但缺乏人类式的形态抽象能力，凸显其作为认知模型的局限性。 Abstract: Whether neural networks can serve as cognitive models of morphological learning remains an open question. Recent work has shown that encoder-decoder models can acquire irregular patterns, but evidence that they generalize these patterns like humans is mixed. We investigate this using the Spanish \emph{L-shaped morphome}, where only the first-person singular indicative (e.g., \textit{pongo} `I put') shares its stem with all subjunctive forms (e.g., \textit{ponga, pongas}) despite lacking apparent phonological, semantic, or syntactic motivation. We compare five encoder-decoder transformers varying along two dimensions: sequential vs. position-invariant positional encoding, and atomic vs. decomposed tag representations. Positional encoding proves decisive: position-invariant models recover the correct L-shaped paradigm clustering even when L-shaped verbs are scarce in training, whereas sequential positional encoding models only partially capture the pattern. Yet none of the models productively generalize this pattern to novel forms. Position-invariant models generalize the L-shaped stem across subjunctive cells but fail to extend it to the first-person singular indicative, producing a mood-based generalization rather than the L-shaped morphomic pattern. Humans do the opposite, generalizing preferentially to the first-person singular indicative over subjunctive forms. None of the models reproduce the human pattern, highlighting the gap between statistical pattern reproduction and morphological abstraction.

[42] A Multi-Agent Framework for Medical AI: Leveraging Fine-Tuned GPT, LLaMA, and DeepSeek R1 for Evidence-Based and Bias-Aware Clinical Query Processing

Naeimeh Nourmohammadi,Md Meem Hossain,The Anh Han,Safina Showkat Ara,Zia Ush Shamszaman

Main category: cs.CL

TL;DR: 本文提出了一种多智能体医疗问答框架，通过结合不同大语言模型（GPT、LLaMA、DeepSeek R1）各自优势，并引入证据检索、不确定性估计和偏见检测机制，提升临床问答的可靠性与可解释性。实验表明该系统准确率达87%，显著优于基线模型，并具备证据支持和安全校验能力。

Details

Motivation: 现有大语言模型在医疗问答中存在验证能力弱、证据支撑不足、置信度信号不可靠等问题，限制其临床实际应用。 Method: 分两阶段：第一阶段在MedQuAD数据上微调三类LLM并评估生成质量；第二阶段构建模块化多智能体流水线，包括临床推理代理（微调LLaMA）、证据检索代理（查询PubMed）和优化代理（DeepSeek R1），辅以不确定性估计（蒙特卡洛Dropout、困惑度评分）和偏见检测（词法/情感分析+LIME/SHAP）。 Result: DeepSeek R1在生成指标上最优（ROUGE-1 0.536±0.04等），整体系统达87%准确率、相关性约0.80，困惑度降至4.13，端到端延迟36.5秒；证据增强有效降低不确定性，且支持高风险场景人工介入。 Conclusion: 多智能体分工协作与多层验证机制可有效缓解单一大模型在医疗场景中的关键缺陷，为构建可信赖、可追溯、低偏见的临床AI系统提供了实用且可扩展的架构范式。 Abstract: Large language models (LLMs) show promise for healthcare question answering, but clinical use is limited by weak verification, insufficient evidence grounding, and unreliable confidence signalling. We propose a multi-agent medical QA framework that combines complementary LLMs with evidence retrieval, uncertainty estimation, and bias checks to improve answer reliability. Our approach has two phases. First, we fine-tune three representative LLM families (GPT, LLaMA, and DeepSeek R1) on MedQuAD-derived medical QA data (20k+ question-answer pairs across multiple NIH domains) and benchmark generation quality. DeepSeek R1 achieves the strongest scores (ROUGE-1 0.536 +- 0.04; ROUGE-2 0.226 +-0.03; BLEU 0.098 -+ 0.018) and substantially outperforms the specialised biomedical baseline BioGPT in zero-shot evaluation. Second, we implement a modular multi-agent pipeline in which a Clinical Reasoning agent (fine-tuned LLaMA) produces structured explanations, an Evidence Retrieval agent queries PubMed to ground responses in recent literature, and a Refinement agent (DeepSeek R1) improves clarity and factual consistency; an optional human validation path is triggered for high-risk or high-uncertainty cases. Safety mechanisms include Monte Carlo dropout and perplexity-based uncertainty scoring, plus lexical and sentiment-based bias detection supported by LIME/SHAP-based analyses. In evaluation, the full system achieves 87% accuracy with relevance around 0.80, and evidence augmentation reduces uncertainty (perplexity 4.13) compared to base responses, with mean end-to-end latency of 36.5 seconds under the reported configuration. Overall, the results indicate that agent specialisation and verification layers can mitigate key single-model limitations and provide a practical, extensible design for evidence-based and bias-aware medical AI.

[43] Index Light, Reason Deep: Deferred Visual Ingestion for Visual-Dense Document Question Answering

Tao Xu

Main category: cs.CL

TL;DR: 本文提出了一种延迟视觉摄入（DVI）框架，用于多模态文档问答，通过仅在用户提问时才调用视觉语言模型（VLM），显著降低计算成本并提升可靠性。

Details

Motivation: 现有方法采用预摄入策略，对每页文档运行VLM生成描述，导致高成本、检索不可靠且失败后无法恢复。 Method: DVI采用需求侧摄入策略：索引阶段仅提取轻量级元数据，通过结构化元数据索引和BM25全文搜索定位页面，再将原始图像与具体问题送入VLM进行定向分析。 Result: 在真实工业工程图纸数据集上，DVI实现零VLM摄入成本下的准确率（46.7% vs. 48.9%），视觉必要查询有效性达50%（vs. 0%），页面定位率达100%（搜索空间压缩98%），并支持交互式细化与渐进缓存。 Conclusion: DVI将多模态文档问答的核心问题从‘问答准确率’转化为‘页面定位’，提升了效率、可靠性与交互性，为工业场景中的文档理解提供了新范式。 Abstract: Existing multimodal document question answering methods universally adopt a supply-side ingestion strategy: running a Vision-Language Model (VLM) on every page during indexing to generate comprehensive descriptions, then answering questions through text retrieval. However, this "pre-ingestion" approach is costly (a 113-page engineering drawing package requires approximately 80,000 VLM tokens), end-to-end unreliable (VLM outputs may fail to be correctly retrieved due to format mismatches in the retrieval infrastructure), and irrecoverable once it fails. This paper proposes the Deferred Visual Ingestion (DVI) framework, adopting a demand-side ingestion strategy: the indexing phase performs only lightweight metadata extraction, deferring visual understanding to the moment users pose specific questions. DVI's core principle is "Index for locating, not understanding"--achieving page localization through structured metadata indexes and BM25 full-text search, then sending original images along with specific questions to a VLM for targeted analysis. Experiments on two real industrial engineering drawings (113 pages + 7 pages) demonstrate that DVI achieves comparable overall accuracy at zero ingestion VLM cost (46.7% vs. 48.9%), an effectiveness rate of 50% on visually necessary queries (vs. 0% for pre-ingestion), and 100% page localization (98% search space compression). DVI also supports interactive refinement and progressive caching, transforming the "QA accuracy" problem into a "page localization" problem--once the correct drawing page is found, obtaining the answer becomes a matter of interaction rounds.

[44] GPT-5 vs Other LLMs in Long Short-Context Performance

Nima Esmi,Maryam Nezhad-Moghaddam,Fatemeh Borhani,Asadollah Shahbahrami,Amin Daemdoost,Georgi Gaydadjiev

Main category: cs.CL

TL;DR: 本文评估了四个最先进大语言模型（Grok-4、GPT-4、Gemini 2.5、GPT-5）在长上下文任务中的实际性能，发现尽管模型理论上下文窗口极大，但在处理超长输入（如20K社交媒体帖子）时准确率显著下降至50-53%，揭示了理论能力与实际应用间的巨大差距。

Details

Motivation: 探究大语言模型在理论上支持超长上下文（百万级token）的情况下，是否真能稳健利用长上下文信息完成需综合理解大量细节的复杂任务。 Method: 在三个数据集（两个补充数据集：菜谱和数学题检索；一个主数据集：20K社交媒体帖子用于抑郁检测）上评测Grok-4、GPT-4、Gemini 2.5和GPT-5四个SOTA模型的性能，重点分析准确率、精度等指标随输入长度变化的趋势。 Result: 当社交媒体数据输入超过5K帖子（约70K tokens）后，所有模型准确率显著下降，20K帖子时降至50–53%；GPT-5虽准确率下降，但精度保持约95%；‘中间丢失’问题在新模型中已基本解决。 Conclusion: 当前大模型在长上下文任务中仍存在严重性能瓶颈，单纯依赖理论上下文长度不等于实际可用性；敏感任务中需关注精度等多维指标，而不仅是准确率。 Abstract: With the significant expansion of the context window in Large Language Models (LLMs), these models are theoretically capable of processing millions of tokens in a single pass. However, research indicates a significant gap between this theoretical capacity and the practical ability of models to robustly utilize information within long contexts, especially in tasks that require a comprehensive understanding of numerous details. This paper evaluates the performance of four state-of-the-art models (Grok-4, GPT-4, Gemini 2.5, and GPT-5) on long short-context tasks. For this purpose, three datasets were used: two supplementary datasets for retrieving culinary recipes and math problems, and a primary dataset of 20K social media posts for depression detection. The results show that as the input volume on the social media dataset exceeds 5K posts (70K tokens), the performance of all models degrades significantly, with accuracy dropping to around 50-53% for 20K posts. Notably, in the GPT-5 model, despite the sharp decline in accuracy, its precision remained high at approximately 95%, a feature that could be highly effective for sensitive applications like depression detection. This research also indicates that the "lost in the middle" problem has been largely resolved in newer models. This study emphasizes the gap between the theoretical capacity and the actual performance of models on complex, high-volume data tasks and highlights the importance of metrics beyond simple accuracy for practical applications.

[45] Knowing When Not to Answer: Abstention-Aware Scientific Reasoning

Samir Abdaljalil,Erchin Serpedin,Hasan Kurban

Main category: cs.CL

TL;DR: 本文提出了一种面向科学断言验证的“弃权感知”框架，通过将断言分解为最小条件、利用自然语言推理（NLI）逐条审计证据，并支持“支持/反驳/弃权”三类决策，强调在证据不足时主动 abstain 比强行作答更可靠；实验表明弃权策略能显著降低风险，且该框架具有模型无关性。

Details

Motivation: 现有大语言模型在科学断言验证中被要求必须给出确定答案，但科学实践中，缺乏依据的结论比主动弃权更具危害性；因此需构建能识别不确定性并合理弃权的评估框架。 Method: 提出弃权感知的科学验证框架：1）将科学断言分解为最小可验证条件；2）对每个条件用自然语言推理（NLI）与可用证据进行审计；3）基于置信度选择支持、反驳或弃权；在SciFact和PubMedQA两个基准上，用6种不同架构模型（含编码器-解码器、开源对话模型及闭源API）进行评测。 Result: 各模型原始准确率差异不大，但引入置信度驱动的弃权机制后，在中等覆盖率下显著降低了错误风险；弃权能力比模型选型更能影响最终可靠性。 Conclusion: 科学推理的核心挑战不在于选择最优模型，而在于判断证据是否充分以支撑结论；弃权感知评估是一种实用、模型无关的科学可靠性评估新范式，为未来选择性推理研究提供了统一实验基础。 Abstract: Large language models are increasingly used to answer and verify scientific claims, yet existing evaluations typically assume that a model must always produce a definitive answer. In scientific settings, however, unsupported or uncertain conclusions can be more harmful than abstaining. We study this problem through an abstention-aware verification framework that decomposes scientific claims into minimal conditions, audits each condition against available evidence using natural language inference (NLI), and selectively decides whether to support, refute, or abstain. We evaluate this framework across two complementary scientific benchmarks: SciFact and PubMedQA, covering both closed-book and open-domain evidence settings. Experiments are conducted with six diverse language models, including encoder-decoder, open-weight chat models, and proprietary APIs. Across all benchmarks and models, we observe that raw accuracy varies only modestly across architectures, while abstention plays a critical role in controlling error. In particular, confidence-based abstention substantially reduces risk at moderate coverage levels, even when absolute accuracy improvements are limited. Our results suggest that in scientific reasoning tasks, the primary challenge is not selecting a single best model, but rather determining when available evidence is sufficient to justify an answer. This work highlights abstention-aware evaluation as a practical and model-agnostic lens for assessing scientific reliability, and provides a unified experimental basis for future work on selective reasoning in scientific domains. Code is available at https://github.com/sabdaljalil2000/ai4science .

[46] We can still parse using syntactic rules

Ghaly Hussein

Main category: cs.CL

TL;DR: 本文提出了一种基于上下文无关文法（CFG）和广义短语结构文法（GPSG）的新型句法分析方法，兼具依存与成分结构解析能力，支持噪声鲁棒性和不完整解析，并在Universal Dependencies数据集上取得约54%的UAS。

Details

Motivation: 克服传统CFG在句法建模上的局限性，同时融合自1950年代以来的理论句法学成果，构建可解释、透明的计算语言学模型。 Method: 提出一种新解析算法及配套的句法规则与特征体系，能同步生成依存树和成分树，支持多假设输出与后续重排序。 Result: 在Universal Dependencies开发集（7个语料库）上平均UAS达54.5%，测试集（12个语料库）为53.8%，并支持多候选解析与重排序。 Conclusion: 该方法将经典理论句法学有效融入计算框架，提升了句法分析的鲁棒性、可解释性与实用性。 Abstract: This research introduces a new parsing approach, based on earlier syntactic work on context free grammar (CFG) and generalized phrase structure grammar (GPSG). The approach comprises both a new parsing algorithm and a set of syntactic rules and features that overcome the limitations of CFG. It also generates both dependency and constituency parse trees, while accommodating noise and incomplete parses. The system was tested on data from Universal Dependencies, showing a promising average Unlabeled Attachment Score (UAS) of 54.5% in the development dataset (7 corpora) and 53.8% in the test set (12 corpora). The system also provides multiple parse hypotheses, allowing further reranking to improve parsing accuracy. This approach also leverages much of the theoretical syntactic work since the 1950s to be used within a computational context. The application of this approach provides a transparent and interpretable NLP model to process language input.

[47] AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchmark for LLM Agents

Lingxiang Hu,Yiding Sun,Tianle Xia,Wenwei Li,Ming Xu,Liqun Liu,Peng Shu,Huan Yu,Jie Jiang

Main category: cs.CL

TL;DR: 本文提出AD-Bench，一个基于真实广告营销业务需求构建的评估基准，用于评测大语言模型代理在多轮、多工具协同分析任务中的实际能力。实验表明当前最优模型（如Gemini-3-Pro）在高难度任务（L3）上性能显著下降，揭示其在专业领域仍存在明显能力缺口。

Details

Motivation: 现有LLM代理评测基准多限于理想化模拟，无法反映广告与营销分析等专业领域中多轮交互、多工具协作的真实复杂性，亟需面向实际业务需求的评估基准。 Method: 基于真实用户营销分析请求构建AD-Bench，由领域专家提供可验证的标准答案及参考工具调用轨迹，并按难度分为L1–L3三级；采用Pass@1/Pass@3和轨迹覆盖率等指标评估模型表现。 Result: Gemini-3-Pro在AD-Bench整体Pass@1为68.0%，Pass@3为83.0%；但在L3级任务中分别降至49.4%和62.1%，轨迹覆盖率为70.1%，凸显其在复杂营销分析场景中的能力局限。 Conclusion: AD-Bench填补了面向专业领域、强调多轮多工具协同的真实评测基准空白，为广告营销智能代理的评估与改进提供了重要基础设施。 Abstract: While Large Language Model (LLM) agents have achieved remarkable progress in complex reasoning tasks, evaluating their performance in real-world environments has become a critical problem. Current benchmarks, however, are largely restricted to idealized simulations, failing to address the practical demands of specialized domains like advertising and marketing analytics. In these fields, tasks are inherently more complex, often requiring multi-round interaction with professional marketing tools. To address this gap, we propose AD-Bench, a benchmark designed based on real-world business requirements of advertising and marketing platforms. AD-Bench is constructed from real user marketing analysis requests, with domain experts providing verifiable reference answers and corresponding reference tool-call trajectories. The benchmark categorizes requests into three difficulty levels (L1-L3) to evaluate agents' capabilities under multi-round, multi-tool collaboration. Experiments show that on AD-Bench, Gemini-3-Pro achieves Pass@1 = 68.0% and Pass@3 = 83.0%, but performance drops significantly on L3 to Pass@1 = 49.4% and Pass@3 = 62.1%, with a trajectory coverage of 70.1%, indicating that even state-of-the-art models still exhibit substantial capability gaps in complex advertising and marketing analysis scenarios. AD-Bench provides a realistic benchmark for evaluating and improving advertising marketing agents, the leaderboard and code can be found at https://github.com/Emanual20/adbench-leaderboard.

[48] Detecting LLM Hallucinations via Embedding Cluster Geometry: A Three-Type Taxonomy with Measurable Signatures

Matic Korun

Main category: cs.CL

TL;DR: 本文提出了一种基于词元嵌入簇结构几何特征的大语言模型幻觉现象分类法，识别出三类幻觉（中心漂移、错误收敛、覆盖空缺），并定义了三个可测量的几何统计量（α、η、λ_s），验证其在11个模型中的普适性及架构相关性。

Details

Motivation: 为理解大语言模型幻觉现象的内在机制，需从嵌入空间的几何结构出发，建立可观测、可度量的分类框架。 Method: 分析11个Transformer模型（含编码器与解码器架构）的静态词元嵌入空间，基于簇结构定义三类幻觉，并提出三个几何统计量（α、η、λ_s）进行量化评估。 Result: 发现极性结构（α > 0.5）和簇凝聚性（η > 0）在全部11个模型中均成立；径向信息梯度λ_s在9/11模型中显著（p < 0.05），ALBERT与MiniLM因架构特性（因子化压缩、蒸馏导致各向同性）不显著。 Conclusion: 嵌入空间的几何结构是幻觉产生的基础前提，三类幻觉具有可检测的几何标识，且不同架构存在可预测的脆弱性分布。 Abstract: We propose a geometric taxonomy of large language model hallucinations based on observable signatures in token embedding cluster structure. By analyzing the static embedding spaces of 11 transformer models spanning encoder (BERT, RoBERTa, ELECTRA, DeBERTa, ALBERT, MiniLM, DistilBERT) and decoder (GPT-2) architectures, we identify three operationally distinct hallucination types: Type 1 (center-drift) under weak context, Type 2 (wrong-well convergence) to locally coherent but contextually incorrect cluster regions, and Type 3 (coverage gaps) where no cluster structure exists. We introduce three measurable geometric statistics: α (polarity coupling), \b{eta} (cluster cohesion), and λ_s (radial information gradient). Across all 11 models, polarity structure (α > 0.5) is universal (11/11), cluster cohesion (\b{eta} > 0) is universal (11/11), and the radial information gradient is significant (9/11, p < 0.05). We demonstrate that the two models failing λ_s significance -- ALBERT and MiniLM -- do so for architecturally explicable reasons: factorized embedding compression and distillation-induced isotropy, respectively. These findings establish the geometric prerequisites for type-specific hallucination detection and yield testable predictions about architecture-dependent vulnerability profiles.

[49] STATe-of-Thoughts: Structured Action Templates for Tree-of-Thoughts

Zachary Bamberger,Till R. Saenger,Gilad Morad,Ofra Amir,Brandon M. Stewart,Amir Feder

Main category: cs.CL

TL;DR: STATe-of-Thoughts（STATe）是一种可解释的推理时计算（ITC）方法，通过离散、可解释的文本干预（如控制器选择高层推理动作）替代传统高温采样，在提升输出多样性、可解释性和质量控制方面取得显著进展。

Details

Motivation: 现有ITC方法（如Best-of-N、Tree-of-Thoughts）依赖高温采样，导致输出多样性不足且推理过程缺乏可控性与可解释性。 Method: STATe构建三层结构：控制器选择高层推理动作（离散、可解释的文本干预），生成器基于动作生成推理步骤，评估器打分以指导搜索；全程在动作空间中进行结构化搜索。 Result: 1）动作引导的干预比温度采样显著提升响应多样性；2）在论点生成案例中，动作序列本身具有强预测性，可解释输出质量；3）可识别未探索但有潜力的动作子空间，并主动导向其生成。 Conclusion: STATe为生成高质量、多样化、可解释文本提供了实用、可控且可扩展的ITC新范式。 Abstract: Inference-Time-Compute (ITC) methods like Best-of-N and Tree-of-Thoughts are meant to produce output candidates that are both high-quality and diverse, but their use of high-temperature sampling often fails to achieve meaningful output diversity. Moreover, existing ITC methods offer limited control over how to perform reasoning, which in turn limits their explainability. We present STATe-of-Thoughts (STATe), an interpretable ITC method that searches over high-level reasoning patterns. STATe replaces stochastic sampling with discrete and interpretable textual interventions: a controller selects actions encoding high-level reasoning choices, a generator produces reasoning steps conditioned on those choices, and an evaluator scores candidates to guide search. This structured approach yields three main advantages. First, action-guided textual interventions produce greater response diversity than temperature-based sampling. Second, in a case study on argument generation, STATe's explicit action sequences capture interpretable features that are highly predictive of output quality. Third, estimating the association between performance and action choices allows us to identify promising yet unexplored regions of the action space and steer generation directly toward them. Together, these results establish STATe as a practical framework for generating high-quality, diverse, and interpretable text. Our framework is available at https://github.com/zbambergerNLP/state-of-thoughts.

[50] Does Socialization Emerge in AI Agent Society? A Case Study of Moltbook

Ming Li,Xirui Li,Tianyi Zhou

Main category: cs.CL

TL;DR: 本文首次大规模系统性诊断了AI代理社会（Moltbook）的动态演化，提出了一套量化分析框架，发现该社会虽在全局语义上快速稳定，但个体多样性高、词汇更替频繁，且个体惯性强、缺乏相互影响与共识形成能力，因而未能发展出稳定的集体影响力锚点。

Details

Motivation: 探究AI代理社会是否像人类社会一样存在收敛动态，并为未来AI代理社会的设计与分析提供原则。 Method: 提出一种定量诊断框架，用于衡量AI代理社会的动态演化，包括语义稳定性、词汇更替率、个体惯性、影响持久性及集体共识等指标。 Result: Moltbook中全局语义快速稳定，但个体多样性高、词汇持续更替；个体惯性强、适应性弱，导致相互影响和共识难以形成；影响力短暂，无持久超节点或共享社会记忆。 Conclusion: 仅靠规模与交互密度不足以引发真正的社会性；需在设计中主动引入共享记忆、可塑性机制等以促进社会演化。 Abstract: As large language model agents increasingly populate networked environments, a fundamental question arises: do artificial intelligence (AI) agent societies undergo convergence dynamics similar to human social systems? Lately, Moltbook approximates a plausible future scenario in which autonomous agents participate in an open-ended, continuously evolving online society. We present the first large-scale systemic diagnosis of this AI agent society. Beyond static observation, we introduce a quantitative diagnostic framework for dynamic evolution in AI agent societies, measuring semantic stabilization, lexical turnover, individual inertia, influence persistence, and collective consensus. Our analysis reveals a system in dynamic balance in Moltbook: while global semantic averages stabilize rapidly, individual agents retain high diversity and persistent lexical turnover, defying homogenization. However, agents exhibit strong individual inertia and minimal adaptive response to interaction partners, preventing mutual influence and consensus. Consequently, influence remains transient with no persistent supernodes, and the society fails to develop stable collective influence anchors due to the absence of shared social memory. These findings demonstrate that scale and interaction density alone are insufficient to induce socialization, providing actionable design and analysis principles for upcoming next-generation AI agent societies.

[51] InnoEval: On Research Idea Evaluation as a Knowledge-Grounded, Multi-Perspective Reasoning Problem

Shuofei Qiao,Yunxiang Wei,Xuehai Wang,Bin Wu,Boyang Xue,Ningyu Zhang,Hossein A. Rahmani,Yanshan Wang,Qiang Zhang,Keyan Ding,Jeff Z. Pan,Huajun Chen,Emine Yilmaz

Main category: cs.CL

TL;DR: 本文提出InnoEval框架，通过知识驱动和多视角推理解决大模型时代科学创意评估的瓶颈问题，利用异构深度知识搜索和跨学科评审委员会实现类人水平的创新评估。

Details

Motivation: 现有创意评估方法存在知识视野狭窄、评估维度单一以及大模型作为裁判的固有偏差等问题，缺乏专业知识支撑、集体审议和多标准决策能力。 Method: 将创意评估建模为知识驱动、多视角推理问题；构建异构深度知识搜索引擎，从多样化在线源检索并锚定动态证据；设计由不同学术背景专家组成的创新评审委员会，实现多维解耦评估。 Result: 在点对点、成对及群体评估任务中，InnoEval持续优于基线方法，其判断模式与共识高度契合人类专家。 Conclusion: InnoEval有效弥合了大模型快速生成创意与科学评估滞后之间的鸿沟，为可信、可解释、类人的科学创意评估提供了新范式。 Abstract: The rapid evolution of Large Language Models has catalyzed a surge in scientific idea production, yet this leap has not been accompanied by a matching advance in idea evaluation. The fundamental nature of scientific evaluation needs knowledgeable grounding, collective deliberation, and multi-criteria decision-making. However, existing idea evaluation methods often suffer from narrow knowledge horizons, flattened evaluation dimensions, and the inherent bias in LLM-as-a-Judge. To address these, we regard idea evaluation as a knowledge-grounded, multi-perspective reasoning problem and introduce InnoEval, a deep innovation evaluation framework designed to emulate human-level idea assessment. We apply a heterogeneous deep knowledge search engine that retrieves and grounds dynamic evidence from diverse online sources. We further achieve review consensus with an innovation review board containing reviewers with distinct academic backgrounds, enabling a multi-dimensional decoupled evaluation across multiple metrics. We construct comprehensive datasets derived from authoritative peer-reviewed submissions to benchmark InnoEval. Experiments demonstrate that InnoEval can consistently outperform baselines in point-wise, pair-wise, and group-wise evaluation tasks, exhibiting judgment patterns and consensus highly aligned with human experts.

[52] Beyond Token-Level Policy Gradients for Complex Reasoning with Large Language Models

Mufan Xu,Kehai Chen,Xuefeng Bai,Zhengyu Niu,Muyun Yang,Tiejun Zhao,Min Zhang

Main category: cs.CL

TL;DR: 本文提出多令牌策略梯度优化（MPO）框架，将K个连续令牌视为统一的语义动作，以更好建模复杂推理任务中的块级结构，显著提升数学推理和编程任务性能。

Details

Motivation: 现有基于策略梯度的语言模型在自动回归生成中逐token选择动作，难以匹配复杂推理任务中语义决策常跨越多个token（如变量定义、公式构建）的块级本质，导致优化粒度不匹配。 Method: 提出Multi-token Policy Gradient Optimization（MPO），将K个连续token作为一个联合动作进行策略梯度优化，实现块级语义建模与更高层次目标的优化。 Result: 在数学推理与编程基准测试中，MPO显著优于标准token级策略梯度基线方法。 Conclusion: token级策略梯度在复杂推理任务中存在固有局限，应转向更粗粒度（如块级）的动作空间，为推理密集型语言任务提供新方向。 Abstract: Existing policy-gradient methods for auto-regressive language models typically select subsequent tokens one at a time as actions in the policy. While effective for many generation tasks, such an approach may not fully capture the structure of complex reasoning tasks, where a single semantic decision is often realized across multiple tokens--for example, when defining variables or composing equations. This introduces a potential mismatch between token-level optimization and the inherently block-level nature of reasoning in these settings. To bridge this gap, we propose Multi-token Policy Gradient Optimization (MPO), a framework that treats sequences of K consecutive tokens as unified semantic actions. This block-level perspective enables our method to capture the compositional structure of reasoning trajectories and supports optimization over coherent, higher-level objectives. Experiments on mathematical reasoning and coding benchmarks show that MPO outperforms standard token-level policy gradient baselines, highlight the limitations of token-level policy gradients for complex reasoning, motivating future research to look beyond token-level granularity for reasoning-intensive language tasks.

Fathima Ameen,Danielle Brown,Manusha Malgareddy,Amanul Haque

Main category: cs.CL

TL;DR: 本文介绍了TruthStance数据集，一个大规模、带回复树结构的Truth Social平台对话数据集，并提供了人工标注的论证挖掘与主张型立场检测基准，用于评估和提升大语言模型在该类小众平台上的分析能力。

Details

Motivation: 现有公开资源多集中于Twitter、Reddit等主流平台，而alt-tech平台（如Truth Social）上的对话结构和立场表达缺乏系统研究，亟需构建适配其 conversational structure 的新数据集与基准。 Method: 构建TruthStance数据集（24,378条帖子+523,360条评论，保留回复树结构），人工标注1,500个样本用于论证存在性与主张型立场检测，并评估多种LLM提示策略；基于最优配置，用LLM为大量未标注样本生成标签。 Result: 发布了首个面向Truth Social的大规模结构化对话数据集及人工标注基准；报告了人工标注的inter-annotator agreement；验证了特定LLM提示策略的有效性；并开源了全部代码、人工标注与LLM生成标签数据。 Conclusion: TruthStance填补了alt-tech平台论证与立场分析的数据空白，支持对非主流社交平台中观点演化、用户互动深度与话题倾向的细粒度建模，同时证明LLM在低资源平台语境下具备可行的半自动标注潜力。 Abstract: Argument mining and stance detection are central to understanding how opinions are formed and contested in online discourse. However, most publicly available resources focus on mainstream platforms such as Twitter and Reddit, leaving conversational structure on alt-tech platforms comparatively under-studied. We introduce TruthStance, a large-scale dataset of Truth Social conversation threads spanning 2023-2025, consisting of 24,378 posts and 523,360 comments with reply-tree structure preserved. We provide a human-annotated benchmark of 1,500 instances across argument mining and claim-based stance detection, including inter-annotator agreement, and use it to evaluate large language model (LLM) prompting strategies. Using the best-performing configuration, we release additional LLM-generated labels for 24,352 posts (argument presence) and 107,873 comments (stance to parent), enabling analysis of stance and argumentation patterns across depth, topics, and users. All code and data are released publicly.

[54] WavePhaseNet: A DFT-Based Method for Constructing Semantic Conceptual Hierarchy Structures (SCHS)

Kiyotaka Kasubuchi,Kazuo Fukiya

Main category: cs.CL

TL;DR: 本文通过测度论和频域分析重构Transformer/Attention机制，指出幻觉是其结构固有缺陷；提出WavePhaseNet方法，利用离散傅里叶变换构建语义概念层级结构；通过1/f谱分析确定嵌入空间最小维度约3000维；并引入上同调一致性控制，利用Hodge理论进行谐波投影以抑制幻觉、提升语义一致性。

Details

Motivation: 解决大语言模型中幻觉问题的结构性根源，揭示当前注意力机制在语义表征与逻辑一致性上的理论局限。 Method: 结合测度论（将嵌入空间建模为σ-代数上的条件期望）、频域分析（DFT分解语义为高低频成分）、维度约简（基于1/f谱与累积能量分析）及上同调正则化（构建余链复形与谐波投影）。 Result: 证明幻觉是嵌入空间无法同构于语义真值集的必然结果；确定GPT-4嵌入空间可压缩至约3000维而不损语义完整性；WavePhaseNet与上同调控制显著提升推理一致性并抑制幻觉。 Conclusion: 幻觉源于现有LLM架构的数学本质缺陷；通过频域解耦、维度精炼与上同调约束，可系统性提升语义保真度与逻辑鲁棒性。 Abstract: This paper reformulates Transformer/Attention mechanisms in Large Language Models (LLMs) through measure theory and frequency analysis, theoretically demonstrating that hallucination is an inevitable structural limitation. The embedding space functions as a conditional expectation over a σ-algebra, and its failure to be isomorphic to the semantic truth set fundamentally causes logical consistency breakdown. WavePhaseNet Method The authors propose WavePhaseNet, which explicitly constructs a Semantic Conceptual Hierarchy Structure (SCHS) using Discrete Fourier Transform (DFT). By applying DFT along the sequence dimension, semantic information is decomposed into frequency bands: low-frequency components capture global meaning and intent, while high-frequency components represent local syntax and expression. This staged separation enables precise semantic manipulation in diagonalized space. Dimensionality Reduction GPT-4's 24,576-dimensional embedding space exhibits a 1/f spectral structure based on language self-similarity and Zipf's law. Through cumulative energy analysis, the authors derive that approximately 3,000 dimensions constitute the lower bound for "complete representation." This demonstrates that reduction from 24,576 to 3,000 dimensions preserves meaning and intent while enabling rigorous reasoning and suppressing hallucination. Cohomological Consistency Control The reduced embedding space, constructed via cohomological regularization over overlapping local windows, allows defining a graph structure and cochain complex. This quantifies inconsistencies among local inferences as coboundary-based losses. Applying harmonic projection based on Hodge theory positions cohomology as a computable regularization principle for controlling semantic consistency, extracting maximally consistent global representations.

[55] LLM-Guided Knowledge Distillation for Temporal Knowledge Graph Reasoning

Wang Xing,Wei Song,Siyu Lin,Chen Wu,Man Wang

Main category: cs.CL

TL;DR: 本文提出了一种面向时序知识图谱（TKG）推理的LLM辅助蒸馏框架，利用大语言模型作为辅助教师提供时序感知的丰富监督信号，提升轻量级学生模型的性能，同时保持低推理开销。

Details

Motivation: 现有TKG推理模型计算开销大、部署成本高；传统压缩与蒸馏方法多针对静态图，难以建模时间依赖交互，易导致性能下降。 Method: 设计了一个双教师蒸馏框架：一个高容量时序教师模型 + 一个大语言模型（LLM）作为辅助教师；LLM提供背景知识和时序信号；采用分阶段对齐策略联合优化监督与蒸馏目标。 Result: 在多个公开TKG基准上，该方法显著优于强蒸馏基线，在多种骨干架构下均提升链接预测性能，且学生模型保持轻量高效。 Conclusion: 大语言模型可作为有效的‘教师’，将时序推理能力迁移至资源受限的TKG系统，为高效TKG建模提供了新范式。 Abstract: Temporal knowledge graphs (TKGs) support reasoning over time-evolving facts, yet state-of-the-art models are often computationally heavy and costly to deploy. Existing compression and distillation techniques are largely designed for static graphs; directly applying them to temporal settings may overlook time-dependent interactions and lead to performance degradation. We propose an LLM-assisted distillation framework specifically designed for temporal knowledge graph reasoning. Beyond a conventional high-capacity temporal teacher, we incorporate a large language model as an auxiliary instructor to provide enriched supervision. The LLM supplies broad background knowledge and temporally informed signals, enabling a lightweight student to better model event dynamics without increasing inference-time complexity. Training is conducted by jointly optimizing supervised and distillation objectives, using a staged alignment strategy to progressively integrate guidance from both teachers. Extensive experiments on multiple public TKG benchmarks with diverse backbone architectures demonstrate that the proposed approach consistently improves link prediction performance over strong distillation baselines, while maintaining a compact and efficient student model. The results highlight the potential of large language models as effective teachers for transferring temporal reasoning capability to resource-efficient TKG systems.

[56] Robust Bias Evaluation with FilBBQ: A Filipino Bias Benchmark for Question-Answering Language Models

Lance Calvin Lim Gamboa,Yue Feng,Mark Lee

Main category: cs.CL

TL;DR: 本文构建了FilBBQ——一个面向菲律宾语境的偏见评估基准，包含超1万条测试提示，用于检测语言模型在性别与性取向方面的刻板偏见，并提出基于多种子响应平均的鲁棒评估方法。

Details

Motivation: 扩展Bias Benchmark for Question-Answering（BBQ）的语言与文化覆盖范围，使其适用于菲律宾语境下的性别与性取向偏见评估。 Method: 通过四阶段流程构建FilBBQ：模板分类、文化适配翻译、新模板构建和提示生成；并在评估中采用多随机种子采样响应、平均偏见得分的方式提升可靠性。 Result: 验证了模型偏见得分在不同种子下存在显著波动，并确认模型在情绪表达、家庭角色、刻板酷儿兴趣及一夫多妻制等方面表现出明显的性别与恐同偏见。 Conclusion: FilBBQ为菲律宾语语言模型的偏见评估提供了首个本土化、大规模、高鲁棒性的基准工具，推动了低资源语言偏见研究的发展。 Abstract: With natural language generation becoming a popular use case for language models, the Bias Benchmark for Question-Answering (BBQ) has grown to be an important benchmark format for evaluating stereotypical associations exhibited by generative models. We expand the linguistic scope of BBQ and construct FilBBQ through a four-phase development process consisting of template categorization, culturally aware translation, new template construction, and prompt generation. These processes resulted in a bias test composed of more than 10,000 prompts which assess whether models demonstrate sexist and homophobic prejudices relevant to the Philippine context. We then apply FilBBQ on models trained in Filipino but do so with a robust evaluation protocol that improves upon the reliability and accuracy of previous BBQ implementations. Specifically, we account for models' response instability by obtaining prompt responses across multiple seeds and averaging the bias scores calculated from these distinctly seeded runs. Our results confirm both the variability of bias scores across different seeds and the presence of sexist and homophobic biases relating to emotion, domesticity, stereotyped queer interests, and polygamy. FilBBQ is available via GitHub.

[57] Measuring and Mitigating Post-hoc Rationalization in Reverse Chain-of-Thought Generation

Guangyue Peng,Zongchao Chen,Wen Luo,Yuntao Wen,Wei Li,Ruixiang Feng,Ran Le,Chen Yang,Zhenwei An,Yang Song,Tao Zhang,Houfeng Wang

Main category: cs.CL

TL;DR: 本文提出Structural Skeleton-guided Reasoning (SSR)方法，通过先生成答案无关的结构骨架、再据此引导完整推理链生成，有效缓解反向思维链生成（RCG）中因答案锚定导致的后见之明问题，并在多个基准上提升性能。

Details

Motivation: 反向思维链生成（RCG）易产生后见之明式解释，因模型在生成推理链时已知晓答案，导致答案成为认知锚点；现有语义抑制策略反而加剧深层锚定现象。 Method: 提出SSR两阶段方法：第一阶段生成答案无关的功能性结构骨架，第二阶段以该骨架为引导生成完整推理链；进一步提出蒸馏版本SSR-D，用教师模型生成的SSR轨迹微调学生模型。 Result: SSR-D在开放域推理基准上相较抑制基线最高提升10%，且保持OOD泛化能力；三层次锚定度量（词汇、熵、概率）均显著降低。 Conclusion: 将信息流从答案监控转向结构规划可从根本上缓解答案锚定；SSR及其蒸馏变体为可控、鲁棒的推理链生成提供了新范式。 Abstract: Reverse Chain-of-Thought Generation (RCG) synthesizes reasoning traces from query-answer pairs, but runs the risk of producing post-hoc rationalizations: when models can see the answer during generation, the answer serves as a cognitive anchor that shapes the entire explanation. We formalize this phenomenon through a three-level measurement hierarchy: lexical, entropic, and probabilistic anchoring, each captures surface artifacts, entropy dynamics, and latent answer dependence, respectively. We analyze semantic suppression, the intuitive mitigation strategy that instructs models to ignore the answer, to find out its counterproduction: while it reduces lexical overlap, it paradoxically increases entropic and probabilistic anchoring. Drawing on Ironic Process Theory from cognitive psychology, we attribute this failure to active monitoring of the forbidden answer, which inadvertently deepens dependence on it. To break this cycle, we propose Structural Skeleton-guided Reasoning (SSR), a two-phase approach that first generates an answer-invariant functional skeleton structure, then uses this skeleton to guide full trace generation. By redirecting the information flow to structural planning rather than answer monitoring, SSR consistently reduces anchoring across all three levels. We further introduce Distilled SSR (SSR-D), which fine-tunes models on teacher-generated SSR traces to ensure reliable structural adherence. Experiments across open-ended reasoning benchmarks demonstrate that SSR-D achieves up to 10% improvement over suppression baselines while preserving out-of-distribution (OOD) generalization.

[58] HyperRAG: Reasoning N-ary Facts over Hypergraphs for Retrieval Augmented Generation

Wen-Sheng Lien,Yu-Kai Chan,Hao-Lung Hsiao,Bo-Kai Ruan,Meng-Fen Chiang,Chien-An Chen,Yi-Ren Yeh,Hong-Han Shuai

Main category: cs.CL

TL;DR: 本文提出HyperRAG框架，利用n元超图替代传统二元知识图谱，通过HyperRetriever和HyperMemory两种检索变体，提升多跳问答的准确性与可解释性。

Details

Motivation: 现有基于知识图谱的RAG方法受限于二元关系、刚性检索机制和密集相似性搜索，导致上下文相关性低、计算开销大、关系表达能力弱。 Method: 提出HyperRAG框架，包含两个核心组件：(i) HyperRetriever——学习结构-语义推理以构建查询条件下的n元关系链；(ii) HyperMemory——利用大模型参数化记忆引导带束搜索，动态评分n元事实与实体以实现查询感知路径扩展。 Result: 在WikiTopics（11个闭域数据集）及HotpotQA、MuSiQue、2WikiMultiHopQA三个开放域QA基准上验证有效性；HyperRetriever在MRR和Hits@10上分别平均提升2.95%和1.23%，整体答案准确率最高。 Conclusion: n元超图能更高效、可解释地支持多跳推理；HyperRAG显著提升开放与闭域QA性能，尤其在填补推理空白和增强路径可解释性方面表现突出。 Abstract: Graph-based retrieval-augmented generation (RAG) methods, typically built on knowledge graphs (KGs) with binary relational facts, have shown promise in multi-hop open-domain QA. However, their rigid retrieval schemes and dense similarity search often introduce irrelevant context, increase computational overhead, and limit relational expressiveness. In contrast, n-ary hypergraphs encode higher-order relational facts that capture richer inter-entity dependencies and enable shallower, more efficient reasoning paths. To address this limitation, we propose HyperRAG, a RAG framework tailored for n-ary hypergraphs with two complementary retrieval variants: (i) HyperRetriever learns structural-semantic reasoning over n-ary facts to construct query-conditioned relational chains. It enables accurate factual tracking, adaptive high-order traversal, and interpretable multi-hop reasoning under context constraints. (ii) HyperMemory leverages the LLM's parametric memory to guide beam search, dynamically scoring n-ary facts and entities for query-aware path expansion. Extensive evaluations on WikiTopics (11 closed-domain datasets) and three open-domain QA benchmarks (HotpotQA, MuSiQue, and 2WikiMultiHopQA) validate HyperRAG's effectiveness. HyperRetriever achieves the highest answer accuracy overall, with average gains of 2.95% in MRR and 1.23% in Hits@10 over the strongest baseline. Qualitative analysis further shows that HyperRetriever bridges reasoning gaps through adaptive and interpretable n-ary chain construction, benefiting both open and closed-domain QA.

[59] BETA-Labeling for Multilingual Dataset Construction in Low-Resource IR

Md. Najib Hasan,Mst. Jannatun Ferdous Rain,Fyad Mohammed,Nazmul Siddique

Main category: cs.CL

TL;DR: 本文提出了一种基于多LLM协同的BETA标注框架，构建高质量孟加拉语IR数据集，并实证检验了跨语言数据集复用（通过机器翻译）在低资源语言IR中的有效性与风险。

Details

Motivation: 低资源语言信息检索（IR）受限于高质量、任务特定标注数据的稀缺；人工标注成本高、难扩展，而直接使用大语言模型（LLM）自动标注又存在标签可靠性、偏差和评估有效性问题。 Method: 采用BETA标注框架：集成多个不同家族的LLM作为标注者，引入上下文对齐、一致性校验和多数投票机制，并辅以人工评估验证标签质量；同时，通过多语言对的LLM翻译进行‘单跳’跨语言数据集复用实验，分析语义保持度与任务有效性。 Result: BETA框架生成的孟加拉语IR数据集经人工验证具备较高质量；跨语言复用实验显示翻译效果存在显著语言差异，语义失真和任务偏差影响可靠性，无法通用化。 Conclusion: LLM辅助构建低资源IR数据集具有潜力但需严格质量控制；跨语言数据集复用不可盲目依赖机器翻译，须结合语言特性设计评估与校正机制。 Abstract: IR in low-resource languages remains limited by the scarcity of high-quality, task-specific annotated datasets. Manual annotation is expensive and difficult to scale, while using large language models (LLMs) as automated annotators introduces concerns about label reliability, bias, and evaluation validity. This work presents a Bangla IR dataset constructed using a BETA-labeling framework involving multiple LLM annotators from diverse model families. The framework incorporates contextual alignment, consistency checks, and majority agreement, followed by human evaluation to verify label quality. Beyond dataset creation, we examine whether IR datasets from other low-resource languages can be effectively reused through one-hop machine translation. Using LLM-based translation across multiple language pairs, we experimented on meaning preservation and task validity between source and translated datasets. Our experiment reveal substantial variation across languages, reflecting language-dependent biases and inconsistent semantic preservation that directly affect the reliability of cross-lingual dataset reuse. Overall, this study highlights both the potential and limitations of LLM-assisted dataset creation for low-resource IR. It provides empirical evidence of the risks associated with cross-lingual dataset reuse and offers practical guidance for constructing more reliable benchmarks and evaluation pipelines in low-resource language settings.

[60] Query as Anchor: Scenario-Adaptive User Representation via Large Language Model

Jiahao Yuan,Yike Xu,Jinyong Wen,Baokun Wang,Ziyi Gao,Xiaotong Lin,Yun Liu,Xing Fu,Yu Cheng,Yongchao Liu,Weiqiang Wang,Zhongle Xie

Main category: cs.CL

TL;DR: 本文提出Query-as-Anchor框架，通过动态查询感知的用户表征学习，解决工业级用户建模中通用性与任务敏感性的矛盾，并在Alipay多个真实场景中验证了其优越性能。

Details

Motivation: 现有用户表征方法多为静态、任务无关的嵌入，在统一向量空间中难以兼顾不同下游任务需求；同时多源异构数据带来噪声与模态冲突，影响表征质量。 Method: 提出Query-as-Anchor框架：构建多模态用户行为预训练数据集UserU；设计Q-Anchor Embedding架构，融合粗到细分层编码器与双塔LLM，采用联合对比-自回归优化；引入基于聚类的软提示调优以对齐场景特异性模态；部署时将查询锚定在序列末尾以支持KV缓存加速推理。 Result: 在10个Alipay工业基准测试中持续达到SOTA；大规模线上A/B测试验证其在真实业务场景中的有效性；具备强可扩展性与高效部署能力。 Conclusion: Query-as-Anchor实现了从静态编码到动态查询感知合成的范式转变，显著提升了工业级用户表征的通用性、任务适应性与部署效率。 Abstract: Industrial-scale user representation learning requires balancing robust universality with acute task-sensitivity. However, existing paradigms primarily yield static, task-agnostic embeddings that struggle to reconcile the divergent requirements of downstream scenarios within unified vector spaces. Furthermore, heterogeneous multi-source data introduces inherent noise and modality conflicts, degrading representation. We propose Query-as-Anchor, a framework shifting user modeling from static encoding to dynamic, query-aware synthesis. To empower Large Language Models (LLMs) with deep user understanding, we first construct UserU, an industrial-scale pre-training dataset that aligns multi-modal behavioral sequences with user understanding semantics, and our Q-Anchor Embedding architecture integrates hierarchical coarse-to-fine encoders into dual-tower LLMs via joint contrastive-autoregressive optimization for query-aware user representation. To bridge the gap between general pre-training and specialized business logic, we further introduce Cluster-based Soft Prompt Tuning to enforce discriminative latent structures, effectively aligning model attention with scenario-specific modalities. For deployment, anchoring queries at sequence termini enables KV-cache-accelerated inference with negligible incremental latency. Evaluations on 10 Alipay industrial benchmarks show consistent SOTA performance, strong scalability, and efficient deployment. Large-scale online A/B testing in Alipay's production system across two real-world scenarios further validates its practical effectiveness. Our code is prepared for public release and will be available at: https://github.com/JhCircle/Q-Anchor.

[61] Beyond Translation: Evaluating Mathematical Reasoning Capabilities of LLMs in Sinhala and Tamil

Sukumar Kishanthan,Kumar Thushalika,Buddhi Jayasekara,Asela Hevapathige

Main category: cs.CL

TL;DR: 本文探讨了大语言模型（LLMs）在低资源语言（如僧伽罗语和泰米尔语）中是否具备真正的数学推理能力，还是依赖隐式翻译到英语。研究构建了三语平行数学数据集，涵盖六类数学问题，发现基础算术能力跨语言迁移良好，但复杂推理在僧伽罗语和泰米尔语中显著下降，表明多语言表现不等于均等的跨语言推理能力。

Details

Motivation: 检验大语言模型在低资源语言（如僧伽罗语、泰米尔语）中的数学推理是真正基于该语言进行，还是依赖隐式翻译至英语，从而厘清多语言能力的本质。 Method: 构建由母语为僧伽罗语、泰米尔语和英语且具数学训练背景者原生撰写的平行数学问题数据集，覆盖六类数学任务；对四个主流大语言模型进行跨语言评估，并分析错误模式与模型、题型的关联性。 Result: 基础算术推理在三种语言间表现稳健，但单位冲突、优化等复杂推理任务在僧伽罗语和泰米尔语中性能显著下降；不同模型在不同题型上的失败模式各异。 Conclusion: 模型在多语言场景下的强表现未必反映一致的语言内推理能力；需开展细粒度、题型感知的多语言评估，避免将翻译能力误判为推理能力。 Abstract: Large language models (LLMs) demonstrate strong mathematical reasoning in English, but whether these capabilities reflect genuine multilingual reasoning or reliance on translation-based processing in low-resource languages like Sinhala and Tamil remains unclear. We examine this fundamental question by evaluating whether LLMs genuinely reason mathematically in these languages or depend on implicit translation to English-like representations. Using a taxonomy of six math problem types, from basic arithmetic to complex unit conflict and optimization problems, we evaluate four prominent large language models. To avoid translation artifacts that confound language ability with translation quality, we construct a parallel dataset where each problem is natively authored by fluent speakers with mathematical training in all three languages. Our analysis demonstrates that while basic arithmetic reasoning transfers robustly across languages, complex reasoning tasks show significant degradation in Tamil and Sinhala. The pattern of failures varies by model and problem type, suggesting that apparent multilingual competence may not reflect uniform reasoning capabilities across languages. These findings challenge the common assumption that models exhibiting strong multilingual performance can reason equally effectively across languages, and highlight the need for fine-grained, type-aware evaluation in multilingual settings.

[62] Explainable Token-level Noise Filtering for LLM Fine-tuning Datasets

Yuchen Yang,Wenze Lin,Enhao Huang,Zhixuan Chu,Hongbin Zhou,Lan Tao,Yiming Li,Zhan Qin,Kui Ren

Main category: cs.CL

TL;DR: 本文提出XTF框架，通过分解token级数据的三个属性（推理重要性、知识新颖性、任务相关性）来过滤细调过程中的噪声token，从而提升大语言模型在下游任务上的性能。

Details

Motivation: 当前细调数据集多为句子级别设计，与LLM的token级优化机制不匹配，导致token级噪声影响模型性能。 Method: 提出可解释的token级噪声过滤框架XTF，将token贡献分解为推理重要性、知识新颖性和任务相关性三个显式属性，并通过评分方法评估后掩码对应token的梯度。 Result: 在数学、代码和医学三大下游任务及7个主流LLM上实验表明，XTF相较常规微调最高提升13.7%性能。 Conclusion: token级数据优化对LLM细调至关重要，基于属性分解的策略有助于解释和改进复杂训练机制。 Abstract: Large Language Models (LLMs) have seen remarkable advancements, achieving state-of-the-art results in diverse applications. Fine-tuning, an important step for adapting LLMs to specific downstream tasks, typically involves further training on corresponding datasets. However, a fundamental discrepancy exists between current fine-tuning datasets and the token-level optimization mechanism of LLMs: most datasets are designed at the sentence-level, which introduces token-level noise, causing negative influence to final performance. In this paper, we propose XTF, an explainable token-level noise filtering framework. XTF decomposes the complex and subtle contributions of token-level data to the fine-tuning process into three distinct and explicit attributes (reasoning importance, knowledge novelty, and task relevance), which can be assessed using scoring methods, and then masks the gradients of selected noisy tokens accordingly to optimize the performance of fine-tuned LLMs. We conduct extensive experiments on three representative downstream tasks (math, code and medicine) across 7 mainstream LLMs. The results demonstrate that XTF can significantly improve downstream performance by up to 13.7% compared to regular fine-tuning. Our work highlights the importance of token-level dataset optimization, and demonstrates the potential of strategies based on attribute decomposition for explaining complex training mechanisms.

[63] Assessing Large Language Models for Medical QA: Zero-Shot and LLM-as-a-Judge Evaluation

Shefayat E Shams Adib,Ahmed Alfey Sani,Ekramul Alam Esham,Ajwad Abrar,Tareque Mohmud Chowdhury

Main category: cs.CL

TL;DR: 本文对比了五种大型语言模型（LLMs）在医疗问答任务上的零样本性能，使用iCliniq数据集（38,000条医患问答），发现更大参数量模型（如Llama 3.3 70B）整体表现更优，但Llama-4-Maverick-17B在效率与性能间展现出更好权衡；研究强调面向临床部署的实用性与资源效率平衡。

Details

Motivation: 提升低资源地区医疗可及性，推动轻量化、高临床效用的LLM医疗问答系统落地。 Method: 在iCliniq数据集上对5个LLM（Llama-3-8B至Llama-4-Maverick-17B及GPT-5-mini）进行零样本评估，采用BLEU和ROUGE指标衡量生成质量，未进行微调。 Result: Llama 3.3 70B Instruct性能最优；Llama-4-Maverick-17B在性能与推理效率间展现更优折中；大模型在临床任务中仍具明显扩展优势。 Conclusion: 大模型能力持续逼近专业级医学推理，但面向真实临床环境需兼顾模型大小、计算开销与临床实用性；本工作提供标准化基准，助力后续高效医疗NLP系统研发。 Abstract: Recently, Large Language Models (LLMs) have gained significant traction in medical domain, especially in developing a QA systems to Medical QA systems for enhancing access to healthcare in low-resourced settings. This paper compares five LLMs deployed between April 2024 and August 2025 for medical QA, using the iCliniq dataset, containing 38,000 medical questions and answers of diverse specialties. Our models include Llama-3-8B-Instruct, Llama 3.2 3B, Llama 3.3 70B Instruct, Llama-4-Maverick-17B-128E-Instruct, and GPT-5-mini. We are using a zero-shot evaluation methodology and using BLEU and ROUGE metrics to evaluate performance without specialized fine-tuning. Our results show that larger models like Llama 3.3 70B Instruct outperform smaller models, consistent with observed scaling benefits in clinical tasks. It is notable that, Llama-4-Maverick-17B exhibited more competitive results, thus highlighting evasion efficiency trade-offs relevant for practical deployment. These findings align with advancements in LLM capabilities toward professional-level medical reasoning and reflect the increasing feasibility of LLM-supported QA systems in the real clinical environments. This benchmark aims to serve as a standardized setting for future study to minimize model size, computational resources and to maximize clinical utility in medical NLP applications.

[64] The Wikidata Query Logs Dataset

Sebastian Walter,Hannah Bast

Main category: cs.CL

TL;DR: 本文介绍了Wikidata Query Logs (WDQL)数据集，包含20万个基于真实SPARQL查询生成的自然语言问题-查询对，并提出一种基于智能体的方法来清洗、去匿名化和验证这些查询，以提升问答模型训练效果。

Details

Motivation: 现有Wikidata问答数据集规模小且多依赖模板生成，缺乏真实查询多样性；而真实日志查询因匿名化常无结果，需有效方法恢复其语义有效性。 Method: 提出一种基于智能体的迭代方法，对匿名化SPARQL日志查询进行去匿名化、清洗和Wikidata验证，并同步生成对应的自然语言问题。 Result: 构建了规模达20万对的问题-查询数据集WDQL，是当时最大、最贴近真实查询的Wikidata问答数据集；开源全部数据与智能体代码。 Conclusion: WDQL显著提升了基于真实查询的问答模型训练能力，所提智能体方法为从噪声日志构建高质量结构化数据集提供了可行范式。 Abstract: We present the Wikidata Query Logs (WDQL) dataset, a dataset consisting of 200k question-query pairs over the Wikidata knowledge graph. It is over 6x larger than the largest existing Wikidata datasets of similar format without relying on template-generated queries. Instead, we construct it using real-world SPARQL queries sent to the Wikidata Query Service and generate questions for them. Since these log-based queries are anonymized, and therefore often do not produce results, a significant amount of effort is needed to convert them back into meaningful SPARQL queries. To achieve this, we present an agent-based method that iteratively de-anonymizes, cleans, and verifies queries against Wikidata while also generating corresponding natural-language questions. We demonstrate the dataset's benefit for training question-answering methods. All WDQL assets, as well as the agent code, are publicly available under a permissive license.

[65] GradMAP: Faster Layer Pruning with Gradient Metric and Projection Compensation

Hao Liu,Guangyan Li,Wensheng Zhang,Yongqiang Tang

Main category: cs.CL

TL;DR: 本文提出GradMAP，一种基于梯度重要性度量和投影补偿的高效层剪枝方法，显著提升剪枝速度（平均4倍加速）并缓解性能下降。

Details

Motivation: 现有LLM层剪枝方法难以兼顾剪枝性能与效率，且在层重要性评估和剪枝后性能恢复两方面存在不足。 Method: GradMAP包含两个阶段：第一阶段提出基于梯度幅值的全局层重要性度量，仅需一次反向传播；第二阶段识别均值偏移最大的层，并引入单步投影补偿矩阵校正表征漂移。 Result: 在多个基准上实验表明，GradMAP在剪枝速度（平均4×加速）和模型性能恢复方面均优于现有方法。 Conclusion: GradMAP通过高效梯度度量与轻量投影补偿，实现了快速、低损的LLM层剪枝，为大模型压缩部署提供了实用新路径。 Abstract: Large Language Models (LLMs) exhibit strong reasoning abilities, but their high computational costs limit their practical deployment. Recent studies reveal significant redundancy in LLMs layers, making layer pruning an active research topic. Layer pruning research primarily focuses on two aspects: measuring layer importance and recovering performance after pruning. Unfortunately, the present works fail to simultaneously maintain pruning performance and efficiency. In this study, we propose GradMAP, a faster layer pruning method with \textbf{Grad}ient \textbf{M}etric \textbf{A}nd \textbf{P}rojection compensation, which consists of two stages. In the first stage, we introduce a novel metric based on gradient magnitudes, enabling a global assessment of layer importance. Note that, it requires only a single backward propagation step per pruning decision, substantially enhancing pruning efficiency. In the second stage, we first analyze the layers with the largest mean shift resulting from pruning, and then incorporate a simple yet effective projection compensation matrix to correct this drift in one step. In this way, the degradation of model performance caused by layer pruning is effectively alleviated. Extensive experiments show that GradMAP outperforms previous layer pruning methods in both pruning speed (achieving an average $4\times$ speedup) and performance.

[66] Is Information Density Uniform when Utterances are Grounded on Perception and Discourse?

Matteo Gay,Coleman Haley,Mario Giulianelli,Edoardo Ponti

Main category: cs.CL

TL;DR: 本文首次在视觉语境下检验均匀信息密度（UID）假说，利用多语言视觉-语言模型分析30种语言的图像-文本数据和13种语言的视觉叙事数据，发现感知 grounding 显著提升了信息分布的全局与局部均匀性，支持语境敏感的UID理论。

Details

Motivation: 以往UID假说研究仅限于纯文本，忽略了语言实际产生的感知语境；本文旨在探究视觉 grounding 如何影响信息密度分布。 Method: 使用多语言视觉-语言模型，在30种语言的图像-字幕数据和13种语言的视觉叙事数据上估计词元 surprisal，分析信息密度的全局与局部均匀性变化。 Result: 视觉 grounding 一致增强了信息分布的均匀性（全球与局部），尤其在话语单元起始处 surprisal 降低最显著；该效应跨11个语系稳健存在。 Conclusion: UID假说需扩展为语境敏感形式；真实、多模态的语言使用中，感知 grounding 促使信息流更均匀，支持生态效度更高的UID建模。 Abstract: The Uniform Information Density (UID) hypothesis posits that speakers are subject to a communicative pressure to distribute information evenly within utterances, minimising surprisal variance. While this hypothesis has been tested empirically, prior studies are limited exclusively to text-only inputs, abstracting away from the perceptual context in which utterances are produced. In this work, we present the first computational study of UID in visually grounded settings. We estimate surprisal using multilingual vision-and-language models over image-caption data in 30 languages and visual storytelling data in 13 languages, together spanning 11 families. We find that grounding on perception consistently smooths the distribution of information, increasing both global and local uniformity across typologically diverse languages compared to text-only settings. In visual narratives, grounding in both image and discourse contexts has additional effects, with the strongest surprisal reductions occurring at the onset of discourse units. Overall, this study takes a first step towards modelling the temporal dynamics of information flow in ecologically plausible, multimodal language use, and finds that grounded language exhibits greater information uniformity, supporting a context-sensitive formulation of UID.

[67] Breaking Data Efficiency Dilemma: A Federated and Augmented Learning Framework For Alzheimer's Disease Detection via Speech

Xiao Wei,Bin Wen,Yuqin Lin,Kai Li,Mingyang gu,Xiaobao Wang,Longbiao Wang,Jianwu Dang

Main category: cs.CL

TL;DR: 本文提出FAL-AD框架，融合联邦学习与语音数据增强，解决阿尔茨海默病早期诊断中医疗语音数据稀缺与隐私限制带来的数据效率难题，在ADReSSo数据集上达到91.52%的多模态准确率。

Details

Motivation: 阿尔茨海默病（AD）早期诊断至关重要，但AI语音检测面临医学数据稀缺和隐私壁垒导致的数据效率困境。 Method: 提出FAL-AD框架，包含三部分：1）基于语音转换的数据增强，通过跨类别声学-内容重组生成多样化病理性语音样本；2）自适应联邦学习范式，实现机构间协作且满足隐私约束；3）注意力驱动的跨模态融合模型，实现词级细粒度声学-文本对齐与交互。 Result: 在ADReSSo数据集上达到91.52%的多模态分类准确率，显著优于所有中心化基线方法。 Conclusion: FAL-AD为解决AD语音诊断中的数据效率困境提供了切实可行、高效且隐私安全的方案。 Abstract: Early diagnosis of Alzheimer's Disease (AD) is crucial for delaying its progression. While AI-based speech detection is non-invasive and cost-effective, it faces a critical data efficiency dilemma due to medical data scarcity and privacy barriers. Therefore, we propose FAL-AD, a novel framework that synergistically integrates federated learning with data augmentation to systematically optimize data efficiency. Our approach delivers three key breakthroughs: First, absolute efficiency improvement through voice conversion-based augmentation, which generates diverse pathological speech samples via cross-category voice-content recombination. Second, collaborative efficiency breakthrough via an adaptive federated learning paradigm, maximizing cross-institutional benefits under privacy constraints. Finally, representational efficiency optimization by an attentive cross-modal fusion model, which achieves fine-grained word-level alignment and acoustic-textual interaction. Evaluated on ADReSSo, FAL-AD achieves a state-of-the-art multi-modal accuracy of 91.52%, outperforming all centralized baselines and demonstrating a practical solution to the data efficiency dilemma. Our source code is publicly available at https://github.com/smileix/fal-ad.

[68] Crowdsourcing Piedmontese to Test LLMs on Non-Standard Orthography

Gianluca Vico,Jindřich Libovický

Main category: cs.CL

TL;DR: 本文构建了一个皮埃蒙特语的众包数据集，包含145句意大利语-皮埃蒙特语平行句及其手动词对齐，并用于评估大语言模型在分词、主题分类和机器翻译任务上的表现。结果表明，皮埃蒙特语存在分词劣势，但分类性能接近主流语言；机器翻译从皮埃蒙特语到高资源语言效果尚可，反之则困难。

Details

Motivation: 为濒危语言皮埃蒙特语构建高质量、反映真实书写习惯的语料资源，并评估大语言模型对其的支持能力。 Method: 构建基于Flores+的意大利语-皮埃蒙特语平行句数据集（145句），由母语者按自然拼写习惯翻译并辅以人工词对齐；在此基础上开展分词一致性、主题分类与机器翻译三项基准评测。 Result: 皮埃蒙特语在分词上存在明显劣势；主题分类性能接近意大利语、法语和英语；机器翻译呈现不对称性：Piedmontese→高资源语言效果较好，反向生成质量差。 Conclusion: 该数据集填补了濒危语言低资源NLP评测的空白，揭示了当前大语言模型在处理非标准化拼写和低资源语言生成任务时的关键局限。 Abstract: We present a crowdsourced dataset for Piedmontese, an endangered Romance language of northwestern Italy. The dataset comprises 145 Italian-Piedmontese parallel sentences derived from Flores+, with translations produced by speakers writing in their natural orthographic style rather than adhering to standardized conventions, along with manual word alignment. We use this resource to benchmark several large language models on tokenization parity, topic classification, and machine translation. Our analysis reveals that Piedmontese incurs a tokenization penalty relative to higher-resource Romance languages, yet LLMs achieve classification performance approaching that of Italian, French, and English. Machine translation results are asymmetric: models translate adequately from Piedmontese into high-resource languages, but generation into Piedmontese remains challenging. The dataset and code are publicly released.

[69] LLMStructBench: Benchmarking Large Language Model Structured Data Extraction

Sönke Tenckhoff,Mario Koddenbrock,Erik Rodner

Main category: cs.CL

TL;DR: 本文提出了LLMStructBench，一个用于评估大语言模型从自然语言文本中提取结构化数据并生成有效JSON输出的新基准，包含多样化的手工验证数据集、22个模型和5种提示策略的系统性测试，以及兼顾词元级准确率与文档级有效性的新评估指标。

Details

Motivation: 现有大语言模型在结构化数据提取和JSON生成任务上的评估缺乏系统性基准，难以全面衡量模型在解析可靠性方面的表现。 Method: 构建了名为LLMStructBench的开源基准，涵盖多样化、人工验证的解析场景；在22个模型和5种提示策略上进行系统测试；提出兼顾token-level准确率与document-level有效性的新型评估指标。 Result: 实验表明，提示策略的选择比模型规模更重要；合适的提示策略可显著提升小模型或不可靠模型的结构有效性，但可能增加语义错误数量。 Conclusion: LLMStructBench为LLM在结构化解析与ETL等实际应用中的研究提供了可复现、多维度的评估基础，推动该方向的后续发展。 Abstract: We present LLMStructBench, a novel benchmark for evaluating Large Language Models (LLMs) on extracting structured data and generating valid JavaScript Object Notation (JSON) outputs from natural-language text. Our open dataset comprises diverse, manually verified parsing scenarios of varying complexity and enables systematic testing across 22 models and five prompting strategies. We further introduce complementary performance metrics that capture both token-level accuracy and document-level validity, facilitating rigorous comparison of model, size, and prompting effects on parsing reliability. In particular, we show that choosing the right prompting strategy is more important than standard attributes such as model size. This especially ensures structural validity for smaller or less reliable models but increase the number of semantic errors. Our benchmark suite is an step towards future research in the area of LLM applied to parsing or Extract, Transform and Load (ETL) applications.

[70] Rethinking the Role of LLMs in Time Series Forecasting

Xin Qiu,Junlong Tong,Yirong Sun,Yunpu Ma,Wei Zhang,Xiaoyu Shen

Main category: cs.CL

TL;DR: 本文通过大规模实证研究证明，大语言模型（LLMs）在时间序列预测（TSF）中确实有效，尤其在跨域泛化、分布偏移和复杂时序建模方面具有显著优势；预对齐策略、预训练知识与模型架构均起关键且互补作用；推翻了此前质疑LLM价值的结论，并开源代码。

Details

Motivation: 现有研究质疑LLM在时间序列预测中是否真正带来增益，常报告其性能不优于传统方法，但作者认为这些结论受限于小规模、窄范围的评估设置。 Method: 开展覆盖80亿观测数据、17种预测场景、4种预测步长、多种对齐策略（预对齐/后对齐）、以及域内/域外设置的大规模实证研究；结合token级路由分析与prompt优化验证LLM各组件作用。 Result: LLM4TSF显著提升预测性能，尤其在跨域泛化上增益最大；预对齐在90%以上任务中优于后对齐；预训练知识对分布偏移鲁棒性至关重要，而模型架构更擅建模复杂时序动态；在大规模混合分布下，完整LLM不可替代。 Conclusion: LLMs在TSF中不仅有用，而且在特定条件下（如跨域、分布偏移、大规模混合数据）不可或缺；研究明确了LLM有效性的边界条件，并为模型设计提供了实用指导。 Abstract: Large language models (LLMs) have been introduced to time series forecasting (TSF) to incorporate contextual knowledge beyond numerical signals. However, existing studies question whether LLMs provide genuine benefits, often reporting comparable performance without LLMs. We show that such conclusions stem from limited evaluation settings and do not hold at scale. We conduct a large-scale study of LLM-based TSF (LLM4TSF) across 8 billion observations, 17 forecasting scenarios, 4 horizons, multiple alignment strategies, and both in-domain and out-of-domain settings. Our results demonstrate that \emph{LLM4TS indeed improves forecasting performance}, with especially large gains in cross-domain generalization. Pre-alignment outperforming post-alignment in over 90\% of tasks. Both pretrained knowledge and model architecture of LLMs contribute and play complementary roles: pretraining is critical under distribution shifts, while architecture excels at modeling complex temporal dynamics. Moreover, under large-scale mixed distributions, a fully intact LLM becomes indispensable, as confirmed by token-level routing analysis and prompt-based improvements. Overall, Our findings overturn prior negative assessments, establish clear conditions under which LLMs are not only useful, and provide practical guidance for effective model design. We release our code at https://github.com/EIT-NLP/LLM4TSF.

[71] Cognitive networks reconstruct mindsets about STEM subjects and educational contexts in almost 1000 high-schoolers, University students and LLM-based digital twins

Francesco Gariboldi,Emma Franchino,Edith Haim,Gianluca Lattanzi,Alessandro Grecucci,Massimo Stella

Main category: cs.CL

TL;DR: 本文利用认知网络科学构建行为形式心智网络（BFMN），分析不同群体（中学生、大学生、早期STEM专家）及大语言模型（GPT-oss）对STEM相关概念的态度，发现科学与研究普遍呈积极情绪框架，而数学与统计则表现出更强的负面与焦虑情绪，尤其在高数学焦虑群体中更显著；该情绪-认知失调在人类网络中比LLM更突出，表明LLM难以复现真实教育情境中的经验性焦虑特征。

Details

Motivation: 探究STEM态度如何从概念知识、教育经历与情感交互中形成，并比较人类与大语言模型在表征此类态度时的认知-情感差异。 Method: 基于自由联想数据构建行为形式心智网络（BFMN），以词为节点、实证联想为边、感知效价为标注；分析围绕STEM关键词的语义邻域（'框架'），量化其效价光环、情绪剖面、网络重叠度（Jaccard相似性）和具体性，并与随机基线及GPT-oss‘数字孪生’对比。 Result: 科学与研究被一致积极框架化，而数学与统计呈现显著负面与焦虑光环，且在高数学焦虑群体中更甚；高焦虑框架更抽象、去情境化；人类网络中数学与焦虑的重叠度高于GPT-oss。 Conclusion: BFMN能有效捕捉目标领域的心智认知-情感特征；LLM可近似文化层面态度，但无法复现依赖个体经验与教育情境的焦虑等深层心理成分。 Abstract: Attitudes toward STEM develop from the interaction of conceptual knowledge, educational experiences, and affect. Here we use cognitive network science to reconstruct group mindsets as behavioural forma mentis networks (BFMNs). In this case, nodes are cue words and free associations, edges are empirical associative links, and each concept is annotated with perceived valence. We analyse BFMNs from N = 994 observations spanning high school students, university students, and early-career STEM experts, alongside LLM (GPT-oss) "digital twins" prompted to emulate comparable profiles. Focusing also on semantic neighbourhoods ("frames") around key target concepts (e.g., STEM subjects or educational actors/places), we quantify frames in terms of valence auras, emotional profiles, network overlap (Jaccard similarity), and concreteness relative to null baselines. Across student groups, science and research are consistently framed positively, while their core quantitative subjects (mathematics and statistics) exhibit more negative and anxiety related auras, amplified in higher math-anxiety subgroups, evidencing a STEM-science cognitive and emotional dissonance. High-anxiety frames are also less concrete than chance, suggesting more abstract and decontextualised representations of threatening quantitative domains. Human networks show greater overlapping between mathematics and anxiety than GPT-oss. The results highlight how BFMNs capture cognitive-affective signatures of mindsets towards the target domains and indicate that LLM-based digital twins approximate cultural attitudes but miss key context-sensitive, experience-based components relevant to replicate human educational anxiety.

[72] Residual Connections and the Causal Shift: Uncovering a Structural Misalignment in Transformers

Jonathan Lys,Vincent Gripon,Bastien Pasdeloup,Lukas Mauch,Fabien Cardinaux,Ghouthi Boukli Hacene

Main category: cs.CL

TL;DR: 本文发现大语言模型中存在输入-输出对齐偏移现象，并提出一种基于残差衰减的轻量级缓解方法，有效改善了自回归Transformer的表示对齐问题。

Details

Motivation: 大语言模型使用因果掩码进行自回归训练，导致残差连接与监督目标（下一个token）之间存在微妙的不匹配，当前token的表示可能并非预测下一个token时最相关的信息。 Method: 通过解码轨迹和基于相似度的指标，在预训练LLM中实证定位输入-输出对齐偏移；提出基于残差衰减的轻量级残差路径缓解策略，包括固定层干预和可学习门控机制两种实现方式。 Result: 在多个基准测试中验证了所提方法能缓解表征错位问题并带来性能提升。 Conclusion: 输入-输出对齐偏移是自回归Transformer中一个被忽视但关键的问题，所提出的残差衰减策略是一种高效且通用的架构增强方法。 Abstract: Large Language Models (LLMs) are trained with next-token prediction, implemented in autoregressive Transformers via causal masking for parallelism. This creates a subtle misalignment: residual connections tie activations to the current token, while supervision targets the next token, potentially propagating mismatched information if the current token is not the most informative for prediction. In this work, we empirically localize this input-output alignment shift in pretrained LLMs, using decoding trajectories over tied embedding spaces and similarity-based metrics. Our experiments reveal that the hidden token representations switch from input alignment to output alignment deep within the network. Motivated by this observation, we propose a lightweight residual-path mitigation based on residual attenuation, implemented either as a fixed-layer intervention or as a learnable gating mechanism. Experiments on multiple benchmarks show that these strategies alleviate the representation misalignment and yield improvements, providing an efficient and general architectural enhancement for autoregressive Transformers.

[73] Unlocking Reasoning Capability on Machine Translation in Large Language Models

Sara Rajaee,Sebastian Vincent,Alexandre Berard,Marzieh Fadaee,Kelly Marchisio,Tom Kocmi

Main category: cs.CL

TL;DR: 本文发现推理型大语言模型（RLMs）在机器翻译（MT）任务中直接应用显式推理会降低性能，因其推理过程线性、缺乏自我修正与多路径探索；为此提出一种面向翻译任务的结构化多步推理框架（含起草、准确性优化、流利度提升和选择性迭代修订），并基于合成数据微调模型，显著提升了翻译质量。

Details

Motivation: 推理型大语言模型（RLMs）在数学和编程等任务中表现优异，但在机器翻译（MT）中的影响尚未被系统研究，且初步尝试显示其显式推理反而损害翻译质量，亟需探究原因并设计适配MT的推理机制。 Method: 1）系统评估多个开源与闭源RLMs在WMT24++基准上的表现；2）分析推理轨迹特性（如线性性、缺乏修订）；3）提出面向翻译的结构化推理框架，包含多步起草、 adequacy refinement、fluency improvement 和 selective iterative revision；4）构建动态结构化推理合成数据集，并对大型推理模型进行后训练。 Result: 标准RLMs启用显式推理普遍降低MT质量；注入高质量推理轨迹亦无法稳定提升弱模型性能；所提结构化推理框架结合合成数据后训练，在多个指标上显著优于标准翻译微调及通用推理注入基线。 Conclusion: 推理能力必须根据具体任务（如机器翻译）进行结构化设计才能发挥正向作用，通用或线性推理范式不适用于MT，结构化、多阶段、可选择性迭代的推理流程更契合翻译本质。 Abstract: Reasoning-oriented large language models (RLMs) achieve strong gains on tasks such as mathematics and coding by generating explicit intermediate reasoning. However, their impact on machine translation (MT) remains underexplored. We systematically evaluate several open- and closed-weights RLMs on the WMT24++ benchmark and find that enabling explicit reasoning consistently degrades translation quality across languages and models. Analysis reveals that MT reasoning traces are highly linear, lacking revision, self-correction and exploration of alternative translations, which limits their usefulness. Furthermore, injecting higher-quality reasoning traces from stronger models does not reliably improve weaker models' performance. To address this mismatch, we propose a structured reasoning framework tailored to translation, based on multi-step drafting, adequacy refinement, fluency improvement, and selective iterative revision. We curate a synthetic dataset of dynamic structured reasoning traces and post-train a large reasoning model on this data. Experiments show significant improvements over standard translation fine-tuning and injected generic reasoning baselines. Our findings demonstrate that reasoning must be task-structured to benefit MT.

[74] Multi-Agent Comedy Club: Investigating Community Discussion Effects on LLM Humor Generation

Shiwei Hong,Lingyao Li,Ethan Z. Rong,Chenxinran Shen,Zhicong Lu

Main category: cs.CL

TL;DR: 本文探讨了在多智能体沙盒环境中，通过广播社区讨论（即批评者和观众线程）来改进LLM生成的脱口秀段子写作的效果。实验表明，引入社区讨论显著提升了作品在专业评审中的偏好率（75.6%）及在‘技巧/清晰度’和‘社会反响’两个维度上的评分，但也偶有增加攻击性幽默的风险。

Details

Motivation: 现有研究多关注提示词和局部反馈对大语言模型写作的影响，而忽视了在线社区中持续存在的公众反响这一重要维度。 Method: 构建一个受控的多智能体沙盒环境，设置‘讨论条件’（记录、过滤、存储并检索批评者与观众线程作为社交记忆以调节后续生成）与‘基线条件’（无讨论）进行对比；开展50轮实验（共250对独白），由5位专家评审员采用A/B偏好测试和15项量表进行双盲评估。 Result: 讨论条件在75.6%的A/B对比中胜出；‘技巧/清晰度’提升Δ=0.440，‘社会响应’提升Δ=0.422；偶见攻击性幽默增加。 Conclusion: 将社区讨论整合为可检索的社交记忆，能有效提升LLM在创意写作任务（如脱口秀）中的表现，尤其增强其社会适应性与表达质量，但需注意潜在的负面风格偏移。 Abstract: Prior work has explored multi-turn interaction and feedback for LLM writing, but evaluations still largely center on prompts and localized feedback, leaving persistent public reception in online communities underexamined. We test whether broadcast community discussion improves stand-up comedy writing in a controlled multi-agent sandbox: in the discussion condition, critic and audience threads are recorded, filtered, stored as social memory, and later retrieved to condition subsequent generations, whereas the baseline omits discussion. Across 50 rounds (250 paired monologues) judged by five expert annotators using A/B preference and a 15-item rubric, discussion wins 75.6% of instances and improves Craft/Clarity (Δ = 0.440) and Social Response (Δ = 0.422), with occasional increases in aggressive humor.

[75] Emergently Misaligned Language Models Show Behavioral Self-Awareness That Shifts With Subsequent Realignment

Laurène Vaugrante,Anietta Weckauff,Thilo Hagendorff

Main category: cs.CL

TL;DR: 本文研究了大语言模型（LLMs）在经历“涌现错对齐”（emergent misalignment）及其逆转过程中的行为自我意识（behavioral self-awareness），发现模型能自发评估自身有害性变化，从而反映其真实对齐状态。

Details

Motivation: 探究涌现错对齐与行为自我意识的交集：即模型是否能在无上下文示例的情况下，自主感知并描述自身在对齐状态变化（如从错对齐到重对齐）中的行为转变。 Method: 对GPT-4.1模型进行序列微调，依次使用诱导和逆转涌现错对齐的数据集；随后在无上下文示例条件下，让模型自我评估其有害性，分析其自我报告与实际对齐状态的一致性。 Result: 涌现错对齐模型自我评分为显著更高的有害性，明显区别于基础模型和重对齐模型；表明其具备对自身错对齐状态的行为自我意识。 Conclusion: 行为自我意识可作为模型内在安全状态的有效指标，支持通过直接询问模型来获取其对齐与安全性的可信信号。 Abstract: Recent research has demonstrated that large language models (LLMs) fine-tuned on incorrect trivia question-answer pairs exhibit toxicity - a phenomenon later termed "emergent misalignment". Moreover, research has shown that LLMs possess behavioral self-awareness - the ability to describe learned behaviors that were only implicitly demonstrated in training data. Here, we investigate the intersection of these phenomena. We fine-tune GPT-4.1 models sequentially on datasets known to induce and reverse emergent misalignment and evaluate whether the models are self-aware of their behavior transitions without providing in-context examples. Our results show that emergently misaligned models rate themselves as significantly more harmful compared to their base model and realigned counterparts, demonstrating behavioral self-awareness of their own emergent misalignment. Our findings show that behavioral self-awareness tracks actual alignment states of models, indicating that models can be queried for informative signals about their own safety.

[76] A Geometric Analysis of Small-sized Language Model Hallucinations

Emanuele Ricco,Elia Onofri,Lorenzo Cima,Stefano Cresci,Roberto Di Pietro

Main category: cs.CL

TL;DR: 本文从几何视角研究小规模语言模型中的幻觉现象，发现真实响应在嵌入空间中更紧密聚类，并基于此提出一种仅需30-50个标注即可高效分类大量响应的标签传播方法，F1达90%以上。

Details

Motivation: 幻觉（即流利但事实错误的响应）严重损害语言模型在多步或智能体场景下的可靠性，尤其在小模型中亟需有效检测机制。 Method: 提出并证明‘真实响应在嵌入空间中聚类更紧’的几何假设；利用该性质设计标签高效的响应分类传播方法。 Result: 验证了嵌入空间中真实响应与幻觉响应的可分性具有稳定性；仅用30–50个人工标注即可实现>90% F1的响应分类性能。 Conclusion: 将幻觉建模为嵌入空间几何结构问题，为超越传统知识核查与单响应评估范式提供了新路径。 Abstract: Hallucinations -- fluent but factually incorrect responses -- pose a major challenge to the reliability of language models, especially in multi-step or agentic settings. This work investigates hallucinations in small-sized LLMs through a geometric perspective, starting from the hypothesis that when models generate multiple responses to the same prompt, genuine ones exhibit tighter clustering in the embedding space, we prove this hypothesis and, leveraging this geometrical insight, we also show that it is possible to achieve a consistent level of separability. This latter result is used to introduce a label-efficient propagation method that classifies large collections of responses from just 30-50 annotations, achieving F1 scores above 90%. Our findings, framing hallucinations from a geometric perspective in the embedding space, complement traditional knowledge-centric and single-response evaluation paradigms, paving the way for further research.

[77] Overthinking Loops in Agents: A Structural Risk via MCP Tools

Yohan Lee,Jisoo Jang,Seoyeon Choi,Sangyeop Kim,Seungtaek Choi

Main category: cs.CL

TL;DR: 本文揭示了基于工具的LLM代理中存在的新型供应链攻击面——结构化过度思考攻击，即恶意工具服务器通过注册恶意工具诱导模型产生看似合理但循环冗余的工具调用链，导致资源严重浪费和任务性能下降；实验表明现有解码端简洁性控制无法可靠防御，需从工具调用结构层面设计防御机制。

Details

Motivation: 工具型LLM代理依赖工具元数据（如名称、描述、返回消息）进行工具选择与编排，这种便利性带来了供应链攻击风险，尤其是恶意工具可能被共注册并诱导模型产生隐蔽的循环调用行为。 Method: 形式化定义结构化过度思考攻击，区别于单纯token冗余；实现14个跨3个服务器的恶意工具，触发重复、强制精炼和干扰三类攻击；在多种工具注册中心和多款工具增强型大模型上开展评估。 Result: 攻击可导致最高达142.4倍的token消耗，显著增加延迟，并损害任务结果；解码时的简洁性控制（如top-p、temperature调节）无法稳定阻止循环生成。 Conclusion: 防御此类攻击不能仅依赖token级控制，而应建模并监控工具调用的结构性模式，例如调用序列的循环性、语义一致性与上下文适配度。 Abstract: Tool-using LLM agents increasingly coordinate real workloads by selecting and chaining third-party tools based on text-visible metadata such as tool names, descriptions, and return messages. We show that this convenience creates a supply-chain attack surface: a malicious MCP tool server can be co-registered alongside normal tools and induce overthinking loops, where individually trivial or plausible tool calls compose into cyclic trajectories that inflate end-to-end tokens and latency without any single step looking abnormal. We formalize this as a structural overthinking attack, distinguishable from token-level verbosity, and implement 14 malicious tools across three servers that trigger repetition, forced refinement, and distraction. Across heterogeneous registries and multiple tool-capable models, the attack causes severe resource amplification (up to $142.4\times$ tokens) and can degrade task outcomes. Finally, we find that decoding-time concision controls do not reliably prevent loop induction, suggesting defenses should reason about tool-call structure rather than tokens alone.

[78] Physical Commonsense Reasoning for Lower-Resourced Languages and Dialects: a Study on Basque

Jaione Bengoetxea,Itziar Gonzalez-Dios,Rodrigo Agerri

Main category: cs.CL

TL;DR: 本文提出了首个面向低资源语言巴斯克语的非问答式物理常识推理数据集 BasPhyCo，并在三个层次（准确性、一致性、可验证性）上评估了多语言大语言模型在标准及方言变体上的物理常识推理能力，结果表明模型在方言变体上的可验证性表现尤其薄弱。

Details

Motivation: 填补大型语言模型在低资源语言（如巴斯克语）中非问答式物理常识推理任务上的研究空白。 Method: 基于意大利语GITA数据集构建巴斯克语物理常识推理数据集BasPhyCo（含标准与方言变体），并在三个层次（区分合理/不合理叙述、识别冲突元素、确定导致不合理性的物理状态）上评估多种多语言及特定语言预训练模型。 Result: 大语言模型在巴斯克语（尤其是方言变体）的物理常识推理任务中，特别是在‘可验证性’层面表现有限。 Conclusion: 当前大语言模型在低资源语言及其方言变体中的物理常识推理能力仍严重不足，亟需针对性的数据与建模改进。 Abstract: Physical commonsense reasoning represents a fundamental capability of human intelligence, enabling individuals to understand their environment, predict future events, and navigate physical spaces. Recent years have witnessed growing interest in reasoning tasks within Natural Language Processing (NLP). However, no prior research has examined the performance of Large Language Models (LLMs) on non-question-answering (non-QA) physical commonsense reasoning tasks in low-resource languages such as Basque. Taking the Italian GITA as a starting point, this paper addresses this gap by presenting BasPhyCo, the first non-QA physical commonsense reasoning dataset for Basque, available in both standard and dialectal variants. We evaluate model performance across three hierarchical levels of commonsense understanding: (1) distinguishing between plausible and implausible narratives (accuracy), (2) identifying the conflicting element that renders a narrative implausible (consistency), and (3) determining the specific physical state that creates the implausibility (verifiability). These tasks were assessed using multiple multilingual LLMs as well as models pretrained specifically for Italian and Basque. Results indicate that, in terms of verifiability, LLMs exhibit limited physical commonsense capabilities in low-resource languages such as Basque, especially when processing dialectal variants.

[79] Testimole-Conversational: A 30-Billion-Word Italian Discussion Board Corpus (1996-2024) for Language Modeling and Sociolinguistic Research

Matteo Rinaldi,Rossella Varvara,Viviana Patti

Main category: cs.CL

TL;DR: 本文介绍了Testimole-conversational，一个包含1996-2024年意大利语讨论板消息的大规模语料库（超300亿词符），适用于意大利语大语言模型预训练及语言学、社会学研究。

Details

Motivation: 构建大规模、时间跨度长的意大利语网络讨论板语料库，以支持意大利语大语言模型预训练及语言学和社会学研究。 Method: 收集并整理1996–2024年间意大利语讨论板的海量消息，形成超300亿词符的语料库。 Result: 构建了名为Testimole-conversational的大规模意大利语讨论板语料库，涵盖丰富的网络书面语和社交互动数据。 Conclusion: 该语料库不仅适用于NLP任务（如语言建模、领域自适应、对话分析），也支撑数字交流中的语言变异与社会现象研究，并将向学术界免费开放。 Abstract: We present "Testimole-conversational" a massive collection of discussion boards messages in the Italian language. The large size of the corpus, more than 30B word-tokens (1996-2024), renders it an ideal dataset for native Italian Large Language Models'pre-training. Furthermore, discussion boards' messages are a relevant resource for linguistic as well as sociological analysis. The corpus captures a rich variety of computer-mediated communication, offering insights into informal written Italian, discourse dynamics, and online social interaction in wide time span. Beyond its relevance for NLP applications such as language modelling, domain adaptation, and conversational analysis, it also support investigations of language variation and social phenomena in digital communication. The resource will be made freely available to the research community.

[80] BFS-PO: Best-First Search for Large Reasoning Models

Fiorenzo Parascandolo,Wenhui Tan,Enver Sangineto,Ruihua Song,Rita Cucchiara

Main category: cs.CL

TL;DR: 本文提出BFS-PO强化学习算法，通过最佳优先搜索与最大熵节点回溯机制，缓解大推理模型（LRM）在推理中过度思考（overthinking）问题，在提升准确率的同时缩短推理链长度。

Details

Motivation: 大型推理模型（LRM）虽在复杂推理任务中表现优异，但易产生冗长推理链（即overthinking），导致计算成本高、输出啰嗦；强化学习算法（如GRPO/DAPO）可能加剧该问题。 Method: 提出BFS-PO算法：采用最佳优先搜索（Best-First Search）策略，结合基于最大熵节点的回溯机制，在训练中逐步生成更短的正确响应，从而引导模型学习简洁推理路径。 Result: 在多个基准测试和不同基础LRM上验证，BFS-PO能同时提升模型准确率并显著缩短推理链长度。 Conclusion: BFS-PO是一种有效缓解LRM过推理问题的RL训练方法，实现了性能与效率的协同优化。 Abstract: Large Reasoning Models (LRMs) such as OpenAI o1 and DeepSeek-R1 have shown excellent performance in reasoning tasks using long reasoning chains. However, this has also led to a significant increase of computational costs and the generation of verbose output, a phenomenon known as overthinking. The tendency to overthinking is often exacerbated by Reinforcement Learning (RL) algorithms such as GRPO/DAPO. In this paper, we propose BFS-PO, an RL algorithm which alleviates this problem using a Best-First Search exploration strategy. Specifically, BFS-PO looks for the shortest correct answer using a backtracking mechanism based on maximum entropy nodes. By generating progressively shorter responses during training, BFS-PO learns to produce concise reasoning chains. Using different benchmarks and base LRMs, we show that BFS-PO can simultaneously increase the LRM accuracy and shorten its answers.

[81] Tool-Aware Planning in Contact Center AI: Evaluating LLMs through Lineage-Guided Query Decomposition

Varun Nathan,Shreyas Guha,Ayush Kumar

Main category: cs.CL

TL;DR: 本文提出了一种面向客服中心业务洞察查询的、基于领域知识的工具感知规划生成框架与基准，包含评估框架、数据构建方法和大规模LLM规划能力研究，揭示了当前大模型在多步、复合查询及工具对齐方面的显著不足。

Details

Motivation: 解决客服中心中利用结构化（如Text2SQL）与非结构化（如RAG）工具协同生成可执行分析计划的难题，弥补现有方法在工具理解、并行依赖建模和系统性评估上的空白。 Method: 构建域特定的工具感知规划生成框架，设计双模式评估体系（多维指标评估+单次匹配评估），提出基于评估器-优化器迭代的数据精炼流程生成高质量规划谱系，并对14个LLM在有/无谱系提示下进行大规模规划能力实证评测。 Result: LLM在复合查询和超4步规划上表现差；最优总分84.8%（Claude-3-7-Sonnet），最佳单次匹配率仅49.75%（o3-mini）；规划谱系带来混合增益，尤其提升高阶模型表现和步骤可执行性；短而简单的规划明显更易成功。 Conclusion: 当前LLM在工具理解（特别是工具-提示对齐与工具使用完整性）方面仍存在根本性缺陷；该框架为客服场景下数据查询的具身规划能力评估与改进提供了可复现路径。 Abstract: We present a domain-grounded framework and benchmark for tool-aware plan generation in contact centers, where answering a query for business insights, our target use case, requires decomposing it into executable steps over structured tools (Text2SQL (T2S)/Snowflake) and unstructured tools (RAG/transcripts) with explicit depends_on for parallelism. Our contributions are threefold: (i) a reference-based plan evaluation framework operating in two modes - a metric-wise evaluator spanning seven dimensions (e.g., tool-prompt alignment, query adherence) and a one-shot evaluator; (ii) a data curation methodology that iteratively refines plans via an evaluator->optimizer loop to produce high-quality plan lineages (ordered plan revisions) while reducing manual effort; and (iii) a large-scale study of 14 LLMs across sizes and families for their ability to decompose queries into step-by-step, executable, and tool-assigned plans, evaluated under prompts with and without lineage. Empirically, LLMs struggle on compound queries and on plans exceeding 4 steps (typically 5-15); the best total metric score reaches 84.8% (Claude-3-7-Sonnet), while the strongest one-shot match rate at the "A+" tier (Extremely Good, Very Good) is only 49.75% (o3-mini). Plan lineage yields mixed gains overall but benefits several top models and improves step executability for many. Our results highlight persistent gaps in tool-understanding, especially in tool-prompt alignment and tool-usage completeness, and show that shorter, simpler plans are markedly easier. The framework and findings provide a reproducible path for assessing and improving agentic planning with tools for answering data-analysis queries in contact-center settings.

[82] Counterfactual Fairness Evaluation of LLM-Based Contact Center Agent Quality Assurance System

Kawin Mayilvaghanan,Siddhant Gupta,Ayush Kumar

Main category: cs.CL

TL;DR: 本文评估了18个大语言模型（LLMs）在客服中心质量评估（QA）任务中的公平性，发现其在身份、上下文和行为风格等13个维度上存在系统性偏差；提出反事实翻转率（CFR）与平均绝对得分差（MASD）作为公平性量化指标，并指出上下文提示和隐式语言身份线索是主要偏差来源，而公平性提升与模型准确率无直接关联。

Details

Motivation: LLMs在客服中心质量评估中广泛应用，但其基于网络规模训练数据可能导致人口统计学和行为偏差，进而影响员工绩效评估的公平性。 Method: 采用反事实公平性评估框架，定义并计算Counterfactual Flip Rate (CFR) 和 Mean Absolute Score Difference (MASD)，在13个维度（身份、上下文、行为风格）上对18个LLM在3000条真实客服对话转录文本上进行测试，并分析公平感知提示的效果。 Result: 发现CFR范围为5.4%–13.0%，上下文历史性能提示导致最严重偏差（CFR达16.4%），隐式语言身份线索持续引发偏差；更大、更强对齐的模型表现更公平，但公平性不随准确性提升；显式公平提示仅带来微弱改善。 Conclusion: LLM在高风险人力评估场景（如客服QA）部署前亟需标准化公平性审计流程，当前模型仍存在显著且多源的公平性问题。 Abstract: Large Language Models (LLMs) are increasingly deployed in contact-center Quality Assurance (QA) to automate agent performance evaluation and coaching feedback. While LLMs offer unprecedented scalability and speed, their reliance on web-scale training data raises concerns regarding demographic and behavioral biases that may distort workforce assessment. We present a counterfactual fairness evaluation of LLM-based QA systems across 13 dimensions spanning three categories: Identity, Context, and Behavioral Style. Fairness is quantified using the Counterfactual Flip Rate (CFR), the frequency of binary judgment reversals, and the Mean Absolute Score Difference (MASD), the average shift in coaching or confidence scores across counterfactual pairs. Evaluating 18 LLMs on 3,000 real-world contact center transcripts, we find systematic disparities, with CFR ranging from 5.4% to 13.0% and consistent MASD shifts across confidence, positive, and improvement scores. Larger, more strongly aligned models show lower unfairness, though fairness does not track accuracy. Contextual priming of historical performance induces the most severe degradations (CFR up to 16.4%), while implicit linguistic identity cues remain a persistent bias source. Finally, we analyze the efficacy of fairness-aware prompting, finding that explicit instructions yield only modest improvements in evaluative consistency. Our findings underscore the need for standardized fairness auditing pipelines prior to deploying LLMs in high-stakes workforce evaluation.

[83] Learning User Interests via Reasoning and Distillation for Cross-Domain News Recommendation

Mengdan Zhu,Yufan Zhao,Tao Di,Yulan Yan,Liang Zhao

Main category: cs.CL

TL;DR: 本文提出了一种基于强化学习的框架，利用大语言模型从跨域用户信号中生成兴趣驱动的新闻搜索查询列表，并通过GRPO优化策略，结合推理采样与模型容量扩展实现性能提升，最终通过在线蒸馏部署轻量级学生模型。

Details

Motivation: 现有新闻推荐系统难以从异构跨域信号中挖掘用户深层、可复用的兴趣，且在大规模生产环境中面临可扩展性挑战。 Method: 构建基于强化学习的查询列表生成框架，采用GRPO算法联合多奖励信号进行策略优化；系统研究推理时采样与模型容量两个计算维度；通过on-policy蒸馏将大教师模型知识迁移至轻量级学生模型。 Result: 离线实验、消融研究及线上A/B测试均表明，该方法显著提升了兴趣建模质量与下游推荐效果，在真实生产系统中实现一致增益。 Conclusion: 强化学习驱动的大语言模型可用于高效挖掘跨域用户深层兴趣，结合计算扩展与模型蒸馏可在保证性能的同时满足工业级部署需求。 Abstract: News recommendation plays a critical role in online news platforms by helping users discover relevant content. Cross-domain news recommendation further requires inferring user's underlying information needs from heterogeneous signals that often extend beyond direct news consumption. A key challenge lies in moving beyond surface-level behaviors to capture deeper, reusable user interests while maintaining scalability in large-scale production systems. In this paper, we present a reinforcement learning framework that trains large language models to generate high-quality lists of interest-driven news search queries from cross-domain user signals. We formulate query-list generation as a policy optimization problem and employ GRPO with multiple reward signals. We systematically study two compute dimensions: inference-time sampling and model capacity, and empirically observe consistent improvements with increased compute that exhibit scaling-like behavior. Finally, we perform on-policy distillation to transfer the learned policy from a large, compute-intensive teacher to a compact student model suitable for scalable deployment. Extensive offline experiments, ablation studies and large-scale online A/B tests in a production news recommendation system demonstrate consistent gains in both interest modeling quality and downstream recommendation performance.

[84] Cold-Start Personalization via Training-Free Priors from Structured World Models

Avinandan Bose,Shuyue Stella Li,Faeze Brahman,Pang Wei Koh,Simon Shaolei Du,Yulia Tsvetkov,Maryam Fazel,Lin Xiao,Asli Celikyilmaz

Main category: cs.CL

TL;DR: 本文提出Pep框架，通过离线学习偏好相关性结构、在线贝叶斯推理实现高效冷启动偏好采集，在多个领域显著优于强化学习方法，交互更少、参数更小、响应对齐率更高。

Details

Motivation: 冷启动个性化面临路由难题：用户仅关心众多偏好维度中的少数几个，且关键维度因人而异；有限提问次数下需结构化提问策略，而传统强化学习因终端奖励设计忽视偏好数据的因子化结构，易退化为忽略用户反馈的固定提问序列。 Method: 将冷启动偏好采集解耦为两阶段：离线阶段利用完整用户画像学习偏好相关性的结构化世界模型；在线阶段无需训练，直接基于贝叶斯推理动态选择信息量最大的问题，并预测未提问维度的偏好值。 Result: 在医学、数学、社交及常识推理等多领域，Pep相较RL将生成响应与用户声明偏好的对齐率从68.5%提升至80.8%，交互次数减少3–5倍；当两人对同一问题回答不同时，Pep调整后续问题的概率达39–62%，远高于RL的0–28%；模型参数仅约10K，远低于RL的8B。 Conclusion: 冷启动偏好采集的关键瓶颈在于能否有效利用偏好数据的因子化结构；Pep通过结构化建模与贝叶斯推理，以极简模型实现高性能，验证了结构先验与概率推理比大规模端到端学习更适配该任务。 Abstract: Cold-start personalization requires inferring user preferences through interaction when no user-specific historical data is available. The core challenge is a routing problem: each task admits dozens of preference dimensions, yet individual users care about only a few, and which ones matter depends on who is asking. With a limited question budget, asking without structure will miss the dimensions that matter. Reinforcement learning is the natural formulation, but in multi-turn settings its terminal reward fails to exploit the factored, per-criterion structure of preference data, and in practice learned policies collapse to static question sequences that ignore user responses. We propose decomposing cold-start elicitation into offline structure learning and online Bayesian inference. Pep (Preference Elicitation with Priors) learns a structured world model of preference correlations offline from complete profiles, then performs training-free Bayesian inference online to select informative questions and predict complete preference profiles, including dimensions never asked about. The framework is modular across downstream solvers and requires only simple belief models. Across medical, mathematical, social, and commonsense reasoning, Pep achieves 80.8% alignment between generated responses and users' stated preferences versus 68.5% for RL, with 3-5x fewer interactions. When two users give different answers to the same question, Pep changes its follow-up 39-62% of the time versus 0-28% for RL. It does so with ~10K parameters versus 8B for RL, showing that the bottleneck in cold-start elicitation is the capability to exploit the factored structure of preference data.

[85] Text Style Transfer with Parameter-efficient LLM Finetuning and Round-trip Translation

Ruoxi Liu,Philipp Koehn

Main category: cs.CL

TL;DR: 本文提出了一种基于大语言模型参数高效微调的文本风格迁移新方法，利用回译技术从单语语料构建平行数据，并引入检索增强生成提升术语和名称一致性。

Details

Motivation: 解决文本风格迁移任务中平行语料稀缺的问题。 Method: 基于参数高效微调大语言模型，采用回译技术从单语语料合成平行数据，构建‘中性化’文本作为共享输入风格，并结合检索增强生成（RAG）提升术语与名称知识的建模能力。 Result: 在四个领域上的实验表明，该方法在BLEU分数和风格准确率上均持续优于零样本提示和少样本上下文学习方法。 Conclusion: 所提方法有效缓解了平行语料依赖问题，提升了风格迁移效果与鲁棒性，尤其在术语和命名一致性方面表现突出。 Abstract: This paper proposes a novel method for Text Style Transfer (TST) based on parameter-efficient fine-tuning of Large Language Models (LLMs). Addressing the scarcity of parallel corpora that map between styles, the study employs roundtrip translation to synthesize such parallel datasets from monolingual corpora. This approach creates 'neutralized' text devoid of stylistic attributes, essentially creating a shared input style at training-time and inference-time. Experimental results demonstrate consistent superiority of this method over zero-shot prompting and fewshot ICL techniques measured by BLEU scores and style accuracy scores across four investigated domains. Furthermore, the integration of retrieval-augmented generation (RAG) for terminology and name knowledge enhances robustness and stylistic consistency.

cs.CV [Back]

[86] Beyond Ground: Map-Free LiDAR Relocalization for UAVs

Hengyu Mu,Jianshi Wu,Yuxin Guo,XianLian Lin,Qingyong Hu,Chenglu Wen,Cheng Wang

Main category: cs.CV

TL;DR: 本文提出MAILS，一种面向无人机（UAV）的无地图LiDAR重定位新框架，通过局部保持滑动窗口注意力、坐标无关特征初始化与局部不变位置编码，提升稀疏点云下对大偏航角和高度变化的鲁棒性；同时构建首个面向真实UAV飞行特性的大规模LiDAR重定位数据集，实验表明其定位精度显著优于现有方法。

Details

Motivation: 现有LiDAR重定位方法主要面向自动驾驶，在无人机场景中因大角度偏航、高度变化和稀疏点云等特性导致精度严重下降；且缺乏反映真实UAV飞行特点（如不规则轨迹、变高度）的专用数据集。 Method: 提出MAILS框架：1）Locality-Preserving Sliding Window Attention模块提取局部判别性几何特征；2）坐标无关特征初始化 + 局部不变位置编码机制，增强对大yaw和高度变化的鲁棒性；3）构建包含四个场景、多种飞行轨迹的大规模UAV LiDAR重定位数据集。 Result: 在自建UAV数据集及对比实验中，MAILS实现高精度重定位，性能显著优于现有方法。 Conclusion: MAILS是首个专为无人机设计的map-free LiDAR重定位框架，通过结构化改进和专用数据集支撑，有效解决了UAV场景下LiDAR重定位的关键挑战，提升了GNSS拒止环境中的定位鲁棒性与精度。 Abstract: Localization is a fundamental capability in unmanned aerial vehicle (UAV) systems. Map-free LiDAR relocalization offers an effective solution for achieving high-precision positioning in environments with weak or unavailable GNSS signals. However, existing LiDAR relocalization methods are primarily tailored to autonomous driving, exhibiting significantly degraded accuracy in UAV scenarios. In this paper, we propose MAILS, a novel map-free LiDAR relocalization framework for UAVs. A Locality-Preserving Sliding Window Attention module is first introduced to extract locally discriminative geometric features from sparse point clouds. To handle substantial yaw rotations and altitude variations encountered during UAV flight, we then design a coordinate-independent feature initialization module and a locally invariant positional encoding mechanism, which together significantly enhance the robustness of feature extraction. Furthermore, existing LiDAR-based relocalization datasets fail to capture real-world UAV flight characteristics, such as irregular trajectories and varying altitudes. To address this gap, we construct a large-scale LiDAR localization dataset for UAVs, which comprises four scenes and various flight trajectories, designed to evaluate UAV relocalization performance under realistic conditions. Extensive experiments demonstrate that our method achieves satisfactory localization precision and consistently outperforms existing techniques by a significant margin. Our code and dataset will be released soon.

[87] Explanatory Interactive Machine Learning for Bias Mitigation in Visual Gender Classification

Nathanya Satriani,Djordje Slijepčević,Markus Schedl,Matthias Zeppelzauer

Main category: cs.CV

TL;DR: 本研究探讨了可解释交互式学习（XIL）在缓解视觉分类器中偏见和虚假相关性方面的能力，特别是性别分类任务；比较了CAIPI、RRR及一种新混合方法，发现CAIPI在提升模型公平性与准确性方面效果最佳。

Details

Motivation: 解决视觉分类器（尤其是性别分类）中存在的数据偏差、虚假相关性和模型偏见问题，提升模型的公平性与可解释性。 Method: 采用两种主流XIL策略（CAIPI和Right for the Right Reasons, RRR）及一种融合二者的新混合方法，在性别分类任务上进行实验；使用GradCAM和BLA生成解释，并通过分割掩码进行定量评估。 Result: CAIPI最有效地引导模型关注相关图像特征并显著降低性别误分类率差异（即减小偏见）；所有XIL方法均提升透明性与公平性，多数伴随轻微性能下降，而CAIPI反能略微提升分类准确率。 Conclusion: XIL范式，尤其是CAIPI，可有效缓解模型偏见、增强公平性与可解释性，且不以牺牲甚至可提升性能为代价，展现出在敏感任务中部署可信AI的潜力。 Abstract: Explanatory interactive learning (XIL) enables users to guide model training in machine learning (ML) by providing feedback on the model's explanations, thereby helping it to focus on features that are relevant to the prediction from the user's perspective. In this study, we explore the capability of this learning paradigm to mitigate bias and spurious correlations in visual classifiers, specifically in scenarios prone to data bias, such as gender classification. We investigate two methodologically different state-of-the-art XIL strategies, i.e., CAIPI and Right for the Right Reasons (RRR), as well as a novel hybrid approach that combines both strategies. The results are evaluated quantitatively by comparing segmentation masks with explanations generated using Gradient-weighted Class Activation Mapping (GradCAM) and Bounded Logit Attention (BLA). Experimental results demonstrate the effectiveness of these methods in (i) guiding ML models to focus on relevant image features, particularly when CAIPI is used, and (ii) reducing model bias (i.e., balancing the misclassification rates between male and female predictions). Our analysis further supports the potential of XIL methods to improve fairness in gender classifiers. Overall, the increased transparency and fairness obtained by XIL leads to slight performance decreases with an exception being CAIPI, which shows potential to even improve classification accuracy.

[88] COOPERTRIM: Adaptive Data Selection for Uncertainty-Aware Cooperative Perception

Shilpa Mukhopadhyay,Amit Roy-Chowdhury,Hang Qiu

Main category: cs.CV

TL;DR: 本文提出COOPERTRIM框架，利用时间连续性动态选择需共享的特征，显著降低带宽需求（最高达98.54%），同时保持感知性能。

Details

Motivation: 解决协同感知中通信带宽受限与传感器信息丰富之间的矛盾，现有特征选择方法仍难以满足实际无线通信需求。 Method: 提出基于时间感知的自适应特征选择框架COOPERTRIM，引入共形时间不确定性度量评估特征相关性，并设计数据驱动机制动态决定共享特征数量。 Result: 在语义分割和3D检测任务上，分别实现最高80.28%和72.52%带宽压缩；相比其他选择策略，IoU提升最高达45.54%，带宽减少最多72%；结合压缩策略后带宽可降至1.46%而不损IoU。 Conclusion: COOPERTRIM通过时间建模实现灵活、鲁棒的带宽-性能权衡，显著提升协同感知实用性，为真实部署铺平道路。 Abstract: Cooperative perception enables autonomous agents to share encoded representations over wireless communication to enhance each other's live situational awareness. However, the tension between the limited communication bandwidth and the rich sensor information hinders its practical deployment. Recent studies have explored selection strategies that share only a subset of features per frame while striving to keep the performance on par. Nevertheless, the bandwidth requirement still stresses current wireless technologies. To fundamentally ease the tension, we take a proactive approach, exploiting the temporal continuity to identify features that capture environment dynamics, while avoiding repetitive and redundant transmission of static information. By incorporating temporal awareness, agents are empowered to dynamically adapt the sharing quantity according to environment complexity. We instantiate this intuition into an adaptive selection framework, COOPERTRIM, which introduces a novel conformal temporal uncertainty metric to gauge feature relevance, and a data-driven mechanism to dynamically determine the sharing quantity. To evaluate COOPERTRIM, we take semantic segmentation and 3D detection as example tasks. Across multiple open-source cooperative segmentation and detection models, COOPERTRIM achieves up to 80.28% and 72.52% bandwidth reduction respectively while maintaining a comparable accuracy. Relative to other selection strategies, COOPERTRIM also improves IoU by as much as 45.54% with up to 72% less bandwidth. Combined with compression strategies, COOPERTRIM can further reduce bandwidth usage to as low as 1.46% without compromising IoU performance. Qualitative results show COOPERTRIM gracefully adapts to environmental dynamics, localization error, and communication latency, demonstrating flexibility and paving the way for real-world deployment.

[89] Evaluating the Impact of Post-Training Quantization on Reliable VQA with Multimodal LLMs

Paul Jonas Kurz,Tobias Jan Wieczorek,Mohamed A. Abdelsalam,Rahaf Aljundi,Marcus Rohrbach

Main category: cs.CV

TL;DR: 本文研究了后训练量化（PTQ）对多模态大语言模型（MLLM）在视觉问答（VQA）任务中准确性和可靠性的影响，发现量化会损害二者，但数据感知的MBQ方法和适配的Selector置信度估计器可显著缓解可靠性下降；int4 MBQ+Selector在内存减少75%下接近全精度性能。

Details

Motivation: 当前MLLM存在过自信（高置信错误答案）和模型过大难以部署于边缘设备两大问题，亟需系统研究压缩（如PTQ）对可靠性与准确性的联合影响。 Method: 在Qwen2-VL-7B和Idefics3-8B两个MLLM上，采用无数据（HQQ）和有数据（MBQ）两种PTQ方法进行多比特宽度量化；适配Selector置信估计器用于量化后的多模态场景，并在多种量化等级和OOD场景下评估其鲁棒性。 Result: PTQ导致准确率与可靠性同步下降；MBQ比HQQ缓解效果更优；Selector显著提升量化模型的可靠性；int4 MBQ+Selector实现最优效率-可靠性权衡，内存降低约75%时性能逼近原始模型。 Conclusion: 量化与可靠性在多模态模型中强相关，数据感知量化结合专用置信校准是兼顾效率与可靠性的可行路径；本工作首次系统建立了多模态场景下量化与可靠性之间的关联。 Abstract: Multimodal Large Language Models (MLLM) are increasingly deployed in domains where both reliability and efficiency are critical. However, current models remain overconfident, producing highly certain but incorrect answers. At the same time, their large size limits deployment on edge devices, necessitating compression. We study the intersection of these two challenges by analyzing how Post-Training Quantization (PTQ) compression affects both accuracy and reliability in Visual Question Answering (VQA). We evaluate two MLLMs, Qwen2-VL-7B and Idefics3-8B, quantized with data-free (HQQ) and data-aware (MBQ) methods across multiple bit widths. To counteract the reduction in reliability caused by quantization, we adapt the Selector confidence estimator for quantized multimodal settings and test its robustness across various quantization levels and out-of-distribution (OOD) scenarios. We find that PTQ degrades both accuracy and reliability. Data-aware methods soften the effect thereof. The Selector substantially mitigates the reliability impact. The combination of int4 MBQ and the Selector achieves the best efficiency-reliability trade-off, closing in on uncompressed performance at approx. 75% less memory demand. Overall, we present the first systematic study linking quantization and reliability in multimodal settings.

[90] NutVLM: A Self-Adaptive Defense Framework against Full-Dimension Attacks for Vision Language Models in Autonomous Driving

Xiaoxu Peng,Dong Zhou,Jianwen Zhang,Guanghui Sun,Anh Tu Ngo,Anupam Chattopadhyay

Main category: cs.CV

TL;DR: 本文提出NutVLM，一种自适应防御框架，用于提升自动驾驶中视觉语言模型（VLMs）对局部与全局对抗攻击的鲁棒性，兼顾干净样本性能；其核心包括三类判别式哨兵网络NutNet++、灰度掩码净化局部攻击、以及无需全模型微调的专家引导对抗提示调优（EAPT），在Dolphins基准上整体指标提升4.89%。

Details

Motivation: 现有VLM防御方法难以兼顾对抗鲁棒性与干净样本性能，且无法覆盖从局部物理补丁到全局不可察觉扰动的多类型威胁，亟需端到端、轻量、自适应的安全方案。 Method: 提出NutVLM框架：1）NutNet++作为统一哨兵，实现良性样本、局部补丁、全局扰动的三分类检测；2）对局部威胁采用高效灰度掩码净化；3）对全局扰动启用Expert-guided Adversarial Prompt Tuning（EAPT），通过梯度驱动的潜在空间优化与离散投影生成‘矫正驾驶提示’，动态重聚焦VLM注意力。 Result: 在Dolphins基准测试中，NutVLM在Accuracy、Language Score和GPT Score等综合指标上提升4.89%，显著优于现有防御方法，且不损害原始性能。 Conclusion: NutVLM是一种可扩展、低开销、端到端的VLM安全防御框架，有效平衡鲁棒性与泛化性，为智能交通系统提供实用化安全保障。 Abstract: Vision Language Models (VLMs) have advanced perception in autonomous driving (AD), but they remain vulnerable to adversarial threats. These risks range from localized physical patches to imperceptible global perturbations. Existing defense methods for VLMs remain limited and often fail to reconcile robustness with clean-sample performance. To bridge these gaps, we propose NutVLM, a comprehensive self-adaptive defense framework designed to secure the entire perception-decision lifecycle. Specifically, we first employ NutNet++ as a sentinel, which is a unified detection-purification mechanism. It identifies benign samples, local patches, and global perturbations through three-way classification. Subsequently, localized threats are purified via efficient grayscale masking, while global perturbations trigger Expert-guided Adversarial Prompt Tuning (EAPT). Instead of the costly parameter updates of full-model fine-tuning, EAPT generates "corrective driving prompts" via gradient-based latent optimization and discrete projection. These prompts refocus the VLM's attention without requiring exhaustive full-model retraining. Evaluated on the Dolphins benchmark, our NutVLM yields a 4.89% improvement in overall metrics (e.g., Accuracy, Language Score, and GPT Score). These results validate NutVLM as a scalable security solution for intelligent transportation. Our code is available at https://github.com/PXX/NutVLM.

[91] VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction

Jiarong Liang,Max Ku,Ka-Hei Hui,Ping Nie,Wenhu Chen

Main category: cs.CV

TL;DR: 本文提出VisPhyWorld框架和VisPhyBench基准，通过要求多模态大语言模型生成可执行的物理仿真代码来评估其物理推理能力，发现当前模型虽能较好理解场景语义，但在物理参数推断与动态一致性模拟上表现不足。

Details

Motivation: 现有物理推理评估方法（如VQA、VoE）易被捷径策略绕过，无法检验模型是否建立可检验的物理假设；需一种能直接检验物理世界建模能力的新范式。 Method: 提出VisPhyWorld执行式评估框架，要求模型从视觉输入生成可运行的物理仿真代码；基于此构建VisPhyBench基准，含209个源自108个物理模板的场景，并设计评估外观重建与物理运动合理性的系统协议。 Result: 基准流水线生成有效重建视频的成功率达97.7%；实验表明SOTA MLLMs语义理解强，但物理参数推断和动态一致性模拟能力薄弱。 Conclusion: 生成可执行物理代码是一种更严格、可验证的物理推理评估方式；当前MLLMs尚未真正掌握底层物理动力学建模。 Abstract: Evaluating whether Multimodal Large Language Models (MLLMs) genuinely reason about physical dynamics remains challenging. Most existing benchmarks rely on recognition-style protocols such as Visual Question Answering (VQA) and Violation of Expectation (VoE), which can often be answered without committing to an explicit, testable physical hypothesis. We propose VisPhyWorld, an execution-based framework that evaluates physical reasoning by requiring models to generate executable simulator code from visual observations. By producing runnable code, the inferred world representation is directly inspectable, editable, and falsifiable. This separates physical reasoning from rendering. Building on this framework, we introduce VisPhyBench, comprising 209 evaluation scenes derived from 108 physical templates and a systematic protocol that evaluates how well models reconstruct appearance and reproduce physically plausible motion. Our pipeline produces valid reconstructed videos in 97.7% on the benchmark. Experiments show that while state-of-the-art MLLMs achieve strong semantic scene understanding, they struggle to accurately infer physical parameters and to simulate consistent physical dynamics.

Edwyn Brient,Santiago Velasco-Forero,Rami Kassab

Main category: cs.CV

TL;DR: 本文提出了一种基于HRRP数据物理分解（掩码、特征、噪声）的新型评估指标，以克服现有黑箱分类模型评估方法缺乏可解释性和多层级评估能力的问题。

Details

Motivation: 现有HRRP生成数据的评估依赖于黑箱分类模型，缺乏可解释性和多层级评估能力。 Method: 将HRRP数据分解为掩码、特征和噪声三个物理成分，并据此设计两种基于物理意义的评估指标。 Result: 在昂贵且具有挑战性的数据集上验证了所提指标的有效性，证明其具备良好的判别能力。 Conclusion: 所提出的基于物理分解的评估指标能提供更可解释、更细致的HRRP生成数据质量评估。 Abstract: High-resolution range profile (HRRP ) data are in vogue in radar automatic target recognition (RATR). With the interest in classifying models using HRRP, filling gaps in datasets using generative models has recently received promising contributions. Evaluating generated data is a challenging topic, even for explicit data like face images. However, the evaluation methods used in the state-ofthe-art of HRRP generation rely on classification models. Such models, called ''black-box'', do not allow either explainability on generated data or multi-level evaluation. This work focuses on decomposing HRRP data into three components: the mask, the features, and the noise. Using this decomposition, we propose two metrics based on the physical interpretation of those data. We take profit from an expensive dataset to evaluate our metrics on a challenging task and demonstrate the discriminative ability of those.

[93] Conditional Generative Models for High-Resolution Range Profiles: Capturing Geometry-Driven Trends in a Large-Scale Maritime Dataset

Edwyn Brient,Santiago Velasco-Forero,Rami Kassab

Main category: cs.CV

TL;DR: 本文研究了在大规模海事数据库上进行高分辨率距离剖面（HRRP）合成的问题，发现几何因素（如船舶尺寸和方位角）是影响HRRP生成的关键因素，并基于这些几何条件训练生成模型，成功复现了真实数据中的视线几何趋势。

Details

Motivation: HRRP对采集条件高度敏感，导致其在不同作战场景下的鲁棒性受限；而现有条件生成方法受限于小规模、特定的数据集。 Method: 基于大规模海事数据库，分析识别出主导HRRP变化的几何因素（船舶尺寸与期望方位角），并以此为条件训练生成模型。 Result: 合成的HRRP能准确复现实测数据中预期的视线方向几何趋势。 Conclusion: 采集几何是实现鲁棒HRRP生成的核心因素。 Abstract: High-resolution range profiles (HRRPs) enable fast onboard processing for radar automatic target recognition, but their strong sensitivity to acquisition conditions limits robustness across operational scenarios. Conditional HRRP generation can mitigate this issue, yet prior studies are constrained by small, highly specific datasets. We study HRRP synthesis on a largescale maritime database representative of coastal surveillance variability. Our analysis indicates that the fundamental scenario drivers are geometric: ship dimensions and the desired aspect angle. Conditioning on these variables, we train generative models and show that the synthesized signatures reproduce the expected line-of-sight geometric trend observed in real data. These results highlight the central role of acquisition geometry for robust HRRP generation.

[94] Effect of Convolutional Depth on Image Recognition Performance: VGG vs. ResNet vs. GoogLeNet

Manfred M. Fischer,Joshua Pitts

Main category: cs.CV

TL;DR: 本文通过控制实验比较了VGG、ResNet和GoogLeNet三种CNN架构，指出网络的“有效深度”（而非名义深度）才是影响性能、收敛性和计算效率的关键因素；残差与Inception结构能更高效地利用深度提升准确率。

Details

Motivation: 深层卷积网络虽推动图像识别进步，但单纯增加名义深度并不总带来精度提升、训练稳定或计算高效，需厘清深度真正起作用的机制。 Method: 对VGG、ResNet和GoogLeNet进行标准化训练下的受控对比研究，明确区分并分析名义深度与有效深度的影响。 Result: plain深层网络存在精度饱和和优化不稳定问题；而ResNet和GoogLeNet凭借其结构机制，在更低的有效深度下实现更高精度和更优精度-计算权衡。 Conclusion: 决定深度效益的核心是有效深度，而非名义深度；架构设计通过约束有效深度的实现方式，决定了深度作为可扩展维度的实际效能。 Abstract: Increasing convolutional depth has been central to advances in image recognition, yet deeper networks do not uniformly yield higher accuracy, stable optimization, or efficient computation. We present a controlled comparative study of three canonical convolutional neural network architectures - VGG, ResNet, and GoogLeNet - to isolate how depth influences classification performance, convergence behavior, and computational efficiency. By standardizing training protocols and explicitly distinguishing between nominal and effective depth, we show that the benefits of depth depend critically on architectural mechanisms that constrain its effective manifestation during training rather than on nominal depth alone. Although plain deep networks exhibit early accuracy saturation and optimization instability, residual and inception-based architectures consistently translate additional depth into improved accuracy at lower effective depth and favorable accuracy-compute trade-offs. These findings demonstrate that effective depth, not nominal depth, is the operative quantity governing depth's role as a productive scaling dimension in convolutional networks.

[95] KidMesh: Computational Mesh Reconstruction for Pediatric Congenital Hydronephrosis Using Deep Neural Networks

Haoran Sun,Zhanpeng Zhu,Anguo Zhang,Bo Liu,Zhaohua Lin,Liqin Huang,Mingjing Yang,Lei Liu,Shan Lin,Wangbin Ding

Main category: cs.CV

TL;DR: 本文提出KidMesh，一种端到端深度学习方法，直接从磁共振尿路造影（MRU）图像重建儿童先天性肾积水（CH）的三维网格模型，无需依赖稀疏且难以获取的网格级标注，兼顾形态与功能评估需求。

Details

Motivation: 现有基于体素的分割方法仅提供形态学信息，难以支持尿动力学仿真等功能评估，需额外复杂后处理转换为网格表示；而高质量网格标注在稀疏MRU切片下难以获得。 Method: KidMesh采用深度神经网络：首先从MRU图像提取特征图，经网格采样生成特征顶点，再驱动模板网格形变以生成目标CH网格；并设计了一种不依赖精确网格标注的新型训练策略。 Result: KidMesh平均0.4秒完成单例CH网格重建，顶点误差>3.2mm和>6.4mm的比例分别为3.7%和0.2%，栅格化后Dice评分为0.86；重建网格无自交，可直接用于肾尿流仿真。 Conclusion: KidMesh实现了从MRU图像到可用CH网格的高效、端到端重建，克服了标注稀缺与后处理繁琐的瓶颈，为临床尿动力学分析提供了新工具。 Abstract: Pediatric congenital hydronephrosis (CH) is a common urinary tract disorder, primarily caused by obstruction at the renal pelvis-ureter junction. Magnetic resonance urography (MRU) can visualize hydronephrosis, including renal pelvis and calyces, by utilizing the natural contrast provided by water. Existing voxel-based segmentation approaches can extract CH regions from MRU, facilitating disease diagnosis and prognosis. However, these segmentation methods predominantly focus on morphological features, such as size, shape, and structure. To enable functional assessments, such as urodynamic simulations, external complex post-processing steps are required to convert these results into mesh-level representations. To address this limitation, we propose an end-to-end method based on deep neural networks, namely KidMesh, which could automatically reconstruct CH meshes directly from MRU. Generally, KidMesh extracts feature maps from MRU images and converts them into feature vertices through grid sampling. It then deforms a template mesh according to these feature vertices to generate the specific CH meshes of MRU images. Meanwhile, we develop a novel schema to train KidMesh without relying on accurate mesh-level annotations, which are difficult to obtain due to the sparsely sampled MRU slices. Experimental results show that KidMesh could reconstruct CH meshes in an average of 0.4 seconds, and achieve comparable performance to conventional methods without requiring post-processing. The reconstructed meshes exhibited no self-intersections, with only 3.7% and 0.2% of the vertices having error distances exceeding 3.2mm and 6.4mm, respectively. After rasterization, these meshes achieved a Dice score of 0.86 against manually delineated CH masks. Furthermore, these meshes could be used in renal urine flow simulations, providing valuable urodynamic information for clinical practice.

[96] DriveMamba: Task-Centric Scalable State Space Model for Efficient End-to-End Autonomous Driving

Haisheng Su,Wei Wu,Feixiang Song,Junjie Zhang,Zhenjie Yang,Junchi Yan

Main category: cs.CV

TL;DR: 本文提出DriveMamba，一种基于Mamba架构的端到端自动驾驶新范式，通过任务中心的稀疏建模、隐式视图对齐与长时序融合，解决传统Transformer-based E2E-AD中信息损失、误差累积、计算复杂度高和多模态关系建模僵化等问题。

Details

Motivation: 现有端到端自动驾驶方法（如UniAD）采用顺序式模块设计和密集BEV特征，易导致信息损失、累积误差，且图像骨干网络训练不足、注意力机制复杂度高，限制了系统在时空输入下的可扩展性与效率。 Method: 提出DriveMamba：1）任务中心的单阶段Unified Mamba解码器；2）将图像特征与任务输出转为3D空间排序的token级稀疏表示；3）利用线性复杂度Mamba建模长程任务依赖；4）设计双向轨迹引导的'局部到全局'扫描策略以保持自车视角空间局部性。 Result: 在nuScenes和Bench2Drive数据集上验证了DriveMamba在性能、泛化性和推理效率上的显著优势。 Conclusion: DriveMamba通过摒弃传统Transformer的密集顺序范式，引入稀疏、动态、线性建模机制，为高效、可扩展的端到端自动驾驶提供了新思路和实用框架。 Abstract: Recent advances towards End-to-End Autonomous Driving (E2E-AD) have been often devoted on integrating modular designs into a unified framework for joint optimization e.g. UniAD, which follow a sequential paradigm (i.e., perception-prediction-planning) based on separable Transformer decoders and rely on dense BEV features to encode scene representations. However, such manual ordering design can inevitably cause information loss and cumulative errors, lacking flexible and diverse relation modeling among different modules and sensors. Meanwhile, insufficient training of image backbone and quadratic-complexity of attention mechanism also hinder the scalability and efficiency of E2E-AD system to handle spatiotemporal input. To this end, we propose DriveMamba, a Task-Centric Scalable paradigm for efficient E2E-AD, which integrates dynamic task relation modeling, implicit view correspondence learning and long-term temporal fusion into a single-stage Unified Mamba decoder. Specifically, both extracted image features and expected task outputs are converted into token-level sparse representations in advance, which are then sorted by their instantiated positions in 3D space. The linear-complexity operator enables efficient long-context sequential token modeling to capture task-related inter-dependencies simultaneously. Additionally, a bidirectional trajectory-guided "local-to-global" scan method is designed to preserve spatial locality from ego-perspective, thus facilitating the ego-planning. Extensive experiments conducted on nuScenes and Bench2Drive datasets demonstrate the superiority, generalizability and great efficiency of DriveMamba.

[97] Spectral Collapse in Diffusion Inversion

Nicolas Bourriez,Alexandre Verine,Auguste Genovesio

Main category: cs.CV

TL;DR: 本文提出Orthogonal Variance Guidance (OVG)方法，解决条件扩散反演中因频谱稀疏导致的‘频谱坍缩’问题，在超分辨和草图到图像任务中兼顾结构保真与纹理真实性。

Details

Motivation: 标准确定性扩散反演（如DDIM）在源域频谱稀疏于目标域时失效，导致潜在表示偏离各向同性高斯分布，引发频谱坍缩——生成结果过度平滑、缺乏纹理；而随机化方法虽恢复噪声方差却破坏语义一致性，造成结构漂移。 Method: 提出正交方差引导（OVG），一种推理时方法，通过修正ODE动力学，在结构梯度的零空间内强制满足理论高斯噪声幅值，从而平衡结构保持与纹理恢复。 Result: 在BBBC021（显微镜超分辨）和Edges2Shoes（草图到鞋图像）数据集上实验表明，OVG能有效恢复逼真纹理并保持结构保真度。 Conclusion: OVG为频谱不对称的无配对图像转换任务提供了一种通用且有效的扩散反演校正机制，解决了结构-纹理权衡难题。 Abstract: Conditional diffusion inversion provides a powerful framework for unpaired image-to-image translation. However, we demonstrate through an extensive analysis that standard deterministic inversion (e.g. DDIM) fails when the source domain is spectrally sparse compared to the target domain (e.g., super-resolution, sketch-to-image). In these contexts, the recovered latent from the input does not follow the expected isotropic Gaussian distribution. Instead it exhibits a signal with lower frequencies, locking target sampling to oversmoothed and texture-poor generations. We term this phenomenon spectral collapse. We observe that stochastic alternatives attempting to restore the noise variance tend to break the semantic link to the input, leading to structural drift. To resolve this structure-texture trade-off, we propose Orthogonal Variance Guidance (OVG), an inference-time method that corrects the ODE dynamics to enforce the theoretical Gaussian noise magnitude within the null-space of the structural gradient. Extensive experiments on microscopy super-resolution (BBBC021) and sketch-to-image (Edges2Shoes) demonstrate that OVG effectively restores photorealistic textures while preserving structural fidelity.

[98] Progressive Contrast Registration for High-Fidelity Bidirectional Photoacoustic Microscopy Alignment

Jiahao Qin

Main category: cs.CV

TL;DR: 本文提出PCReg-Net，一种渐进式对比引导的配准框架，用于解决高速光学分辨率光声显微镜（OR-PAM）中双向扫描引起的域偏移与几何错位问题，在保持实时速度的同时显著提升配准精度。

Details

Motivation: 双向光栅扫描虽使OR-PAM成像速度翻倍，但引入了前向与后向扫描线之间的耦合域偏移和几何错位；现有基于亮度恒定假设的方法配准质量有限（NCC ≤ 0.96）。 Method: 提出PCReg-Net：包含四个轻量模块——(1) 用于粗配准的注册U-Net；(2) 提取多尺度结构线索的参考特征提取器；(3) 通过比较粗配准与参考特征识别残余错位的对比模块；(4) 带特征注入的精修U-Net。同时提出无需参考图像的时序一致性评估指标TNCC与TNCG。 Result: 在OR-PAM-Reg-4K数据集（432个测试样本）上，PCReg-Net达到NCC=0.983、SSIM=0.982、PSNR=46.96 dB，较SOTA提升超14 dB，且支持实时处理。 Conclusion: PCReg-Net有效克服双向扫描中的配准挑战，兼顾高精度与实时性，并提供新的无参考评估范式，推动高速OR-PAM图像质量提升。 Abstract: High-speed optical-resolution photoacoustic microscopy (OR-PAM) with bidirectional raster scanning doubles imaging speed but introduces coupled domain shift and geometric misalignment between forward and backward scan lines. Existing methods, constrained by brightness constancy assumptions, achieve limited alignment quality (NCC~$\leq 0.96$). We propose PCReg-Net, a progressive contrast-guided registration framework that performs coarse-to-fine alignment through four lightweight modules: (1)~a registration U-Net for coarse alignment, (2)~a reference feature extractor capturing multi-scale structural cues, (3)~a contrast module that identifies residual misalignment by comparing coarse-registered and reference features, and (4)~a refinement U-Net with feature injection for high-fidelity output. We further propose the Temporal NCC (TNCC) and Temporal NCC Gap (TNCG) for reference-free evaluation of inter-frame temporal consistency. On OR-PAM-Reg-4K (432 test samples), PCReg-Net achieves NCC of 0.983, SSIM of 0.982, and PSNR of 46.96 dB, surpassing the state-of-the-art by over 14 dB at real-time speed. Code is available at https://github.com/JiahaoQin/PCReg-Net

[99] WildfireVLM: AI-powered Analysis for Early Wildfire Detection and Risk Assessment Using Satellite Imagery

Aydin Ayanzadeh,Prakhar Dixit,Sadia Kamal,Milton Halem

Main category: cs.CV

TL;DR: 本文提出WildfireVLM框架，结合YOLOv12目标检测与多模态大语言模型（MLLM），实现卫星影像中的野火与烟雾检测及语言驱动的风险评估，支持实时处理与灾害响应决策。

Details

Motivation: 野火频发且危害加剧，但现有卫星监测面临烟雾信号微弱、天气多变、大范围实时分析难等挑战，亟需融合视觉感知与语义推理的智能解决方案。 Method: 构建融合Landsat-8/9、GOES-16等多源遥感数据的标注烟雾与野火数据集；采用YOLOv12检测火点与烟雾；引入MLLM将检测结果转化为上下文化风险评估与响应建议；使用LLM-as-judge进行风险推理质量评估；基于服务化架构部署系统。 Result: WildfireVLM成功实现高精度火点与烟雾检测，并生成可解释、可操作的风险评估；系统支持实时处理、可视化风险看板与长期追踪，在真实场景中验证了视觉+语言联合建模的有效性与可扩展性。 Conclusion: 将计算机视觉与多模态语言推理深度融合，可显著提升遥感 wildfire 监测的准确性、时效性与决策支持能力，为大规模灾害智能管理提供新范式。 Abstract: Wildfires are a growing threat to ecosystems, human lives, and infrastructure, with their frequency and intensity rising due to climate change and human activities. Early detection is critical, yet satellite-based monitoring remains challenging due to faint smoke signals, dynamic weather conditions, and the need for real-time analysis over large areas. We introduce WildfireVLM, an AI framework that combines satellite imagery wildfire detection with language-driven risk assessment. We construct a labeled wildfire and smoke dataset using imagery from Landsat-8/9, GOES-16, and other publicly available Earth observation sources, including harmonized products with aligned spectral bands. WildfireVLM employs YOLOv12 to detect fire zones and smoke plumes, leveraging its ability to detect small, complex patterns in satellite imagery. We integrate Multimodal Large Language Models (MLLMs) that convert detection outputs into contextualized risk assessments and prioritized response recommendations for disaster management. We validate the quality of risk reasoning using an LLM-as-judge evaluation with a shared rubric. The system is deployed using a service-oriented architecture that supports real-time processing, visual risk dashboards, and long-term wildfire tracking, demonstrating the value of combining computer vision with language-based reasoning for scalable wildfire monitoring.

[100] Fine-Tuning a Large Vision-Language Model for Artwork's Scoring and Critique

Zhehan Zhang,Meihua Qian,Li Luo,Siyu Huang,Chaoyi Zhou,Ripon Saha,Xinxin Song

Main category: cs.CV

TL;DR: 本文提出了一种基于多任务学习微调视觉语言模型Qwen2-VL-7B的自动绘画创造力评估框架，可同时输出数值评分与结构化反馈，准确率高（Pearson r > 0.97），且生成反馈与专家评语语义相似度高（SBERT cosine = 0.798）。

Details

Motivation: 传统人工艺术创造力评估（如TTCT）大规模应用成本高；现有机器学习方法多依赖图像特征，缺乏可解释性反馈。 Method: 使用1000幅人类绘画及其专家五维评分（原创性、色彩、质感、构图、内容）和文字评述构建数据集；微调Qwen2-VL-7B，添加轻量回归头预测1-100分，并在系统提示中嵌入评分标准与作品描述，实现评分与文本反馈联合生成。 Result: 在100分制上达到Pearson相关系数>0.97、MAE≈3.95；生成反馈与专家评语的SBERT余弦相似度平均为0.798。 Conclusion: 该方法有效融合计算机视觉与艺术评估，为创造力研究和教学反馈提供了可扩展、可解释的自动化工具。 Abstract: Assessing artistic creativity is foundational to creativity research and arts education, yet manual scoring (e.g., Torrance Tests of Creative Thinking) is labor-intensive at scale. Prior machine-learning approaches show promise for visual creativity scoring, but many rely mainly on image features and provide limited or no explanatory feedback. We propose a framework for automated creativity assessment of human paintings by fine-tuning the vision-language model Qwen2-VL-7B with multi-task learning. Our dataset contains 1000 human-created paintings scored on a 1-100 scale and paired with a short human-written description (content or artist explanation). Two expert raters evaluated each work using a five-dimension rubric (originality, color, texture, composition, content) and provided written critiques; we use an 80/20 train-test split. We add a lightweight regression head on the visual encoder output so the model can predict a numerical score and generate rubric-aligned feedback in a single forward pass. By embedding the structured rubric and the artwork description in the system prompt, we constrain the generated text to match the quantitative prediction. Experiments show strong accuracy, achieving Pearson r > 0.97 and MAE about 3.95 on the 100-point scale. Qualitative evaluation indicates the generated feedback is semantically close to expert critiques (average SBERT cosine similarity = 0.798). The proposed approach bridges computer vision and art assessment and offers a scalable tool for creativity research and classroom feedback.

[101] Visual Para-Thinker: Divide-and-Conquer Reasoning for Visual Comprehension

Haoran Xu,Hongyu Wang,Jiaze Li,Shunpeng Chen,Zizhao Tong,Jianzhong Ju,Zhenbo Luo,Jian Luan

Main category: cs.CV

TL;DR: 本文提出Visual Para-Thinker，首个面向多模态大语言模型（MLLMs）的并行推理框架，通过视觉分块、Pa-Attention与LPRoPE机制提升推理多样性与效率，并在多个视觉基准上验证有效性。

Details

Motivation: 现有大语言模型测试时缩放法则依赖延长推理链（纵向扩展），易陷入固定思维模式；而并行推理虽可缓解探索窄化问题，其在视觉领域的拓展仍属空白。 Method: 提出基于视觉分块的并行推理范式，设计两种策略；引入Pa-Attention保障路径独立性，结合LPRoPE增强推理多样性；基于vLLM构建原生多模态并行推理实现。 Result: 在V*、CountBench、RefCOCO和HallusionBench等视觉基准上验证了Visual Para-Thinker能成功将并行推理优势拓展至视觉领域，提升性能与效率。 Conclusion: 视觉并行推理是突破MLLMs测试时纵向缩放瓶颈的有效新路径，Visual Para-Thinker为该方向提供了首个系统性框架与实现。 Abstract: Existing LLM test-time scaling laws emphasize the emergence of self-reflective behaviors through extended reasoning length. Nevertheless, this vertical scaling strategy often encounters plateaus in exploration as the model becomes locked into specific thinking pattern. By shifting from depth to parallelism, parallel thinking mitigates the narrowing of exploration. However, the extension of this paradigm to visual domain remains an open research question. In this paper, we first examine the role of visual partitioning in parallelized reasoning and subsequently propose two distinct strategies. Based on the above, we introduce Visual Para-Thinker, representing the inaugural parallel reasoning framework for MLLMs. To maintain path independence and promote diversity in reasoning, our approach integrates Pa-Attention alongside LPRoPE. Leveraging the vLLM framework, we have developed a native multimodal implementation that facilitates high-efficiency parallel processing. Empirical results on benchmark datasets such as V*, CountBench, RefCOCO, and HallusionBench confirm that Visual Para-Thinker successfully extends the benefits of parallel reasoning to the visual domain.

[102] Agentic Spatio-Temporal Grounding via Collaborative Reasoning

Heng Zhao,Yew-Soon Ong,Joey Tianyi Zhou

Main category: cs.CV

TL;DR: 本文提出了一种名为ASTG的无训练、开放世界视频时空定位框架，利用两个基于多模态大模型的智能体（空间与时间推理代理）协同完成目标时空管提取，显著减少监督依赖并提升泛化能力。

Details

Motivation: 现有STVG方法存在计算冗余、强监督依赖和泛化能力差等问题；弱监督方法虽降低标注成本，但仍受限于数据集级训练范式且性能较差。 Method: 提出Agentic Spatio-Temporal Grounder（ASTG）框架，构建空间推理代理（SRA）和时间推理代理（TRA），依托多模态大语言模型，采用‘提出-评估’范式解耦时空推理，结合专用视觉记忆与对话上下文实现自主管提取、验证与定位。 Result: 在主流基准上，ASTG显著优于现有弱监督和零样本方法，性能接近部分全监督方法。 Conclusion: ASTG实现了无需训练、开放世界下的高效STVG，为视频理解提供了新范式。 Abstract: Spatio-Temporal Video Grounding (STVG) aims to retrieve the spatio-temporal tube of a target object or person in a video given a text query. Most existing approaches perform frame-wise spatial localization within a predicted temporal span, resulting in redundant computation, heavy supervision requirements, and limited generalization. Weakly-supervised variants mitigate annotation costs but remain constrained by the dataset-level train-and-fit paradigm with an inferior performance. To address these challenges, we propose the Agentic Spatio-Temporal Grounder (ASTG) framework for the task of STVG towards an open-world and training-free scenario. Specifically, two specialized agents SRA (Spatial Reasoning Agent) and TRA (Temporal Reasoning Agent) constructed leveraging on modern Multimoal Large Language Models (MLLMs) work collaboratively to retrieve the target tube in an autonomous and self-guided manner. Following a propose-and-evaluation paradigm, ASTG duly decouples spatio-temporal reasoning and automates the tube extraction, verification and temporal localization processes. With a dedicate visual memory and dialogue context, the retrieval efficiency is significantly enhanced. Experiments on popular benchmarks demonstrate the superiority of the proposed approach where it outperforms existing weakly-supervised and zero-shot approaches by a margin and is comparable to some of the fully-supervised methods.

[103] Sim2Radar: Toward Bridging the Radar Sim-to-Real Gap with VLM-Guided Scene Reconstruction

Emily Bejerano,Federico Tondolo,Aayan Qayyum,Xiaofan Yu,Xiaofan Jiang

Main category: cs.CV

TL;DR: Sim2Radar是一种端到端框架，能从单张RGB图像直接合成毫米波雷达训练数据，无需手动建模场景；它结合单目深度估计、分割和视觉语言推理重建材质感知的3D场景，并用基于ITU-R电磁特性的物理射线追踪模拟毫米波传播；在真实室内场景中验证表明，其生成的合成数据用于预训练雷达点云检测模型并微调后，3D检测精度（AP@0.3）最高提升3.7。

Details

Motivation: 毫米波雷达在烟雾、灰尘、低光等视觉退化室内环境中具备可靠感知能力，但基于学习的雷达感知受限于大规模雷达数据集采集与标注的稀缺性和高成本。 Method: Sim2Radar首先通过单目深度估计、语义分割和视觉-语言推理联合推断物体材质，构建材质感知的3D场景；再基于ITU-R电磁特性参数化的Fresnel反射模型，使用可配置的物理射线追踪器模拟毫米波传播，从而从单视角RGB图像端到端合成雷达数据。 Result: 在真实室内场景上评估显示，使用Sim2Radar合成数据预训练雷达点云目标检测模型、再在真实雷达数据上微调，3D平均精度（AP@IoU=0.3）最高提升+3.7，提升主要来自空间定位精度改善。 Conclusion: 基于物理建模且由视觉驱动的雷达仿真方法可为雷达学习提供有效的几何先验，在真实数据监督有限的情况下显著提升性能。 Abstract: Millimeter-wave (mmWave) radar provides reliable perception in visually degraded indoor environments (e.g., smoke, dust, and low light), but learning-based radar perception is bottlenecked by the scarcity and cost of collecting and annotating large-scale radar datasets. We present Sim2Radar, an end-to-end framework that synthesizes training radar data directly from single-view RGB images, enabling scalable data generation without manual scene modeling. Sim2Radar reconstructs a material-aware 3D scene by combining monocular depth estimation, segmentation, and vision-language reasoning to infer object materials, then simulates mmWave propagation with a configurable physics-based ray tracer using Fresnel reflection models parameterized by ITU-R electromagnetic properties. Evaluated on real-world indoor scenes, Sim2Radar improves downstream 3D radar perception via transfer learning: pre-training a radar point-cloud object detection model on synthetic data and fine-tuning on real radar yields up to +3.7 3D AP (IoU 0.3), with gains driven primarily by improved spatial localization. These results suggest that physics-based, vision-driven radar simulation can provide effective geometric priors for radar learning and measurably improve performance under limited real-data supervision.

[104] IDPruner: Harmonizing Importance and Diversity in Visual Token Pruning for MLLMs

Yifan Tan,Yifu Sun,Shirui Huang,Hong Liu,Guanghua Yu,Jianchen Zhu,Yangdong Deng

Main category: cs.CV

TL;DR: 本文提出IDPruner方法，通过Maximal Marginal Relevance（MMR）算法在视觉token重要性与语义多样性间实现帕累托最优平衡，无需注意力图，兼容FlashAttention，支持一次性剪枝，在多种模型和任务上实现SOTA性能。

Details

Motivation: 现有视觉token剪枝方法缺乏对重要性与多样性二者最优整合的理论框架，导致计算效率与性能难以兼顾。 Method: 基于系统分析重要性与多样性权衡关系，提出IDPruner，采用MMR算法进行无注意力图依赖的一次性token剪枝。 Result: 在Qwen2.5-VL-7B-Instruct上，75%剪枝率下保留95.18%基线性能，90%剪枝率下仍保持86.40%性能，且在多架构、多基准上表现SOTA并具强泛化性。 Conclusion: IDPruner为MLLM视觉token剪枝提供了原理清晰、高效兼容、泛化性强的新范式。 Abstract: Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities, yet they encounter significant computational bottlenecks due to the massive volume of visual tokens. Consequently, visual token pruning, which substantially reduces the token count, has emerged as a critical technique for accelerating MLLM inference. Existing approaches focus on token importance, diversity, or an intuitive combination of both, without a principled framework for their optimal integration. To address this issue, we first conduct a systematic analysis to characterize the trade-off between token importance and semantic diversity. Guided by this analysis, we propose the \textbf{I}mportance and \textbf{D}iversity Pruner (\textbf{IDPruner}), which leverages the Maximal Marginal Relevance (MMR) algorithm to achieve a Pareto-optimal balance between these two objectives. Crucially, our method operates without requiring attention maps, ensuring full compatibility with FlashAttention and efficient deployment via one-shot pruning. We conduct extensive experiments across various model architectures and multimodal benchmarks, demonstrating that IDPruner achieves state-of-the-art performance and superior generalization across diverse architectures and tasks. Notably, on Qwen2.5-VL-7B-Instruct, IDPruner retains 95.18\% of baseline performance when pruning 75\% of the tokens, and still maintains 86.40\% even under an extreme 90\% pruning ratio. Our code is available at https://github.com/Tencent/AngelSlim.

[105] Diagnostic Benchmarks for Invariant Learning Dynamics: Empirical Validation of the Eidos Architecture

Datorien L. Anderson

Main category: cs.CV

TL;DR: 本文提出了PolyShapes-Ideal (PSI) 数据集，用于评估模型对拓扑不变性的学习能力，并验证了Eidos架构在结构约束下实现泛化的'形式优先'假说。

Details

Motivation: 现有视觉基准过度依赖纹理相关性，难以评估模型对拓扑不变性（即在仿射变换下保持结构恒定的能力）的建模能力，因此需要专门的诊断基准。 Method: 构建PSI数据集，包含三项诊断任务：加噪多边形分类、MNIST到字体的零样本迁移、渐进形变下的几何坍缩映射；并提出Eidos架构进行实验验证。 Result: Eidos在PSI上达到>99%准确率，在30种未见字体上实现81.67%的零样本迁移准确率，且无需预训练。 Conclusion: 结构受限架构的泛化能力源于几何完整性而非统计规模，支持'形式优先'假说。 Abstract: We present the PolyShapes-Ideal (PSI) dataset, a suite of diagnostic benchmarks designed to isolate topological invariance -- the ability to maintain structural identity across affine transformations -- from the textural correlations that dominate standard vision benchmarks. Through three diagnostic probes (polygon classification under noise, zero-shot font transfer from MNIST, and geometric collapse mapping under progressive deformation), we demonstrate that the Eidos architecture achieves >99% accuracy on PSI and 81.67% zero-shot transfer across 30 unseen typefaces without pre-training. These results validate the "Form-First" hypothesis: generalization in structurally constrained architectures is a property of geometric integrity, not statistical scale.

[106] Synthesizing the Kill Chain: A Zero-Shot Framework for Target Verification and Tactical Reasoning on the Edge

Jesse Barkley,Abraham George,Amir Barati Farimani

Main category: cs.CV

TL;DR: 本文提出了一种面向动态军事环境的零样本分层框架，结合轻量级目标检测与紧凑型视觉-语言模型（VLM），在边缘设备上实现高精度、低延迟的自主感知与推理任务。

Details

Motivation: 解决边缘自主机器人在动态军事环境中受限于领域特异性训练数据稀缺和边缘硬件算力有限的问题。 Method: 构建分层零样本框架：以Grounding DINO作为高召回率文本可提示区域提议器，将高置信度检测帧送入轻量级Qwen/Gemma系列VLM（4B–12B）进行语义验证；进一步扩展为Scout-Commander智能体工作流，并提出'受控输入'方法解耦感知与推理以诊断模型失效模式。 Result: 在Battlefield 6合成视频上达成：假阳性过滤100%准确率、损伤评估最高97.5%、细粒度车辆分类55–90%；Scout-Commander流程实现100%正确资产部署与9.8/10 GPT-4o推理评分（延迟<75秒）；'受控输入'揭示Gemma3-12B擅战术推理但视觉感知弱，Gemma3-4B则存在推理崩溃问题。 Conclusion: 验证了分层零样本架构在边缘自主系统中的有效性，并提供了面向安全关键应用的VLM适用性诊断框架。 Abstract: Deploying autonomous edge robotics in dynamic military environments is constrained by both scarce domain-specific training data and the computational limits of edge hardware. This paper introduces a hierarchical, zero-shot framework that cascades lightweight object detection with compact Vision-Language Models (VLMs) from the Qwen and Gemma families (4B-12B parameters). Grounding DINO serves as a high-recall, text-promptable region proposer, and frames with high detection confidence are passed to edge-class VLMs for semantic verification. We evaluate this pipeline on 55 high-fidelity synthetic videos from Battlefield 6 across three tasks: false-positive filtering (up to 100% accuracy), damage assessment (up to 97.5%), and fine-grained vehicle classification (55-90%). We further extend the pipeline into an agentic Scout-Commander workflow, achieving 100% correct asset deployment and a 9.8/10 reasoning score (graded by GPT-4o) with sub-75-second latency. A novel "Controlled Input" methodology decouples perception from reasoning, revealing distinct failure phenotypes: Gemma3-12B excels at tactical logic but fails in visual perception, while Gemma3-4B exhibits reasoning collapse even with accurate inputs. These findings validate hierarchical zero-shot architectures for edge autonomy and provide a diagnostic framework for certifying VLM suitability in safety-critical applications.

[107] MotionWeaver: Holistic 4D-Anchored Framework for Multi-Humanoid Image Animation

Xirui Hu,Yanbo Ding,Jiahao Wang,Tingting Shi,Yali Wang,Guo Zhi Zhi,Weizhan Zhang

Main category: cs.CV

TL;DR: 本文提出MotionWeaver框架，通过统一运动表征与4D锚定范式，首次实现多类人形角色图像的高质量协同动画生成，并构建了首个大规模多角色交互视频数据集与评测基准。

Details

Motivation: 现有角色图像动画方法局限于单人场景，难以泛化至多类人形角色共存、交互复杂且频繁遮挡的真实场景。 Method: 提出统一运动表征（解耦身份无关运动并显式绑定至对应角色）和整体4D锚定范式（在共享4D空间中融合运动表征与视频隐变量，并施加分层4D级监督）。 Result: 在自建300视频多角色配对基准上达到SOTA；显著提升对多样化人形形态、复杂交互及严重遮挡场景的泛化能力。 Conclusion: MotionWeaver为多角色协同动画提供了可扩展、鲁棒的新范式，推动角色动画从单人走向真实多智能体场景。 Abstract: Character image animation, which synthesizes videos of reference characters driven by pose sequences, has advanced rapidly but remains largely limited to single-human settings. Existing methods struggle to generalize to multi-humanoid scenarios, which involve diverse humanoid forms, complex interactions, and frequent occlusions. We address this gap with two key innovations. First, we introduce unified motion representations that extract identity-agnostic motions and explicitly bind them to corresponding characters, enabling generalization across diverse humanoid forms and seamless extension to multi-humanoid scenarios. Second, we propose a holistic 4D-anchored paradigm that constructs a shared 4D space to fuse motion representations with video latents, and further reinforces this process with hierarchical 4D-level supervision to better handle interactions and occlusions. We instantiate these ideas in MotionWeaver, an end-to-end framework for multi-humanoid image animation. To support this setting, we curate a 46-hour dataset of multi-human videos with rich interactions, and construct a 300-video benchmark featuring paired humanoid characters. Quantitative and qualitative experiments demonstrate that MotionWeaver not only achieves state-of-the-art results on our benchmark but also generalizes effectively across diverse humanoid forms, complex interactions, and challenging multi-humanoid scenarios.

[108] HiST-VLA: A Hierarchical Spatio-Temporal Vision-Language-Action Model for End-to-End Autonomous Driving

Yiru Wang,Zichong Gu,Yu Gao,Anqing Jiang,Zhigang Sun,Shuo Wang,Yuwen Heng,Hao Sun

Main category: cs.CV

TL;DR: 本文提出HiST-VLA模型，通过层次化时空建模、几何感知增强、动态token稀疏化和分层规划器，提升VLA模型在自动驾驶中轨迹生成的可靠性与精度。

Details

Motivation: 现有视觉-语言-动作（VLA）模型在自动驾驶等安全关键场景中受限于数值推理不准、3D空间感知弱及上下文敏感性高。 Method: 提出HiST-VLA：融合几何感知与驾驶指令/状态历史提示以增强3D时空推理；引入动态token融合式稀疏化提升计算效率；采用分层Transformer规划器，结合动态隐变量正则化实现语言引导下的空间对齐与时间连贯轨迹细化。 Result: 在NAVSIM v2基准上达到SOTA：Navtest EPDMS为88.6，伪闭环Navhard EPDMS为50.9。 Conclusion: HiST-VLA有效缓解了VLA模型在自动驾驶中关键能力缺陷，实现了高精度、强鲁棒且计算高效的轨迹生成。 Abstract: Vision-Language-Action (VLA) models offer promising capabilities for autonomous driving through multimodal understanding. However, their utilization in safety-critical scenarios is constrained by inherent limitations, including imprecise numerical reasoning, weak 3D spatial awareness, and high sensitivity to context. To address these challenges, we propose HiST-VLA, a novel Hierarchical Spatio-Temporal VLA model designed for reliable trajectory generation. Our framework enhances 3D spatial and temporal reasoning by integrating geometric awareness with fine-grained driving commands and state history prompting. To ensure computational efficiency, we integrate dynamic token sparsification into the VLA architecture. This approach fuses redundant tokens rather than filtering them, effectively reducing redundancy without sacrificing model performance. Furthermore, we employ a hierarchical transformer-based planner to progressively refine coarse VLA waypoints into fine-grained trajectories. Crucially, the planner utilizes dynamic latent regularization to incorporate language commands, ensuring strict spatial grounding and temporal coherence. Extensive evaluation on the NAVSIM v2 benchmark demonstrates state-of-the-art performance on Navtest, achieving an EPDMS of 88.6, and EPDMS of 50.9 on pseudo closed-loop Navhard benchmark.

[109] Zwitscherkasten -- DIY Audiovisual bird monitoring

Dominik Blum,Elias Häring,Fabian Jirges,Martin Schäffer,David Schick,Florian Schulenberg,Torsten Schön

Main category: cs.CV

TL;DR: Zwitscherkasten 是一个面向边缘设备的 DIY 多模态鸟类监测系统，结合音频与视觉数据，在资源受限硬件上实现实时、无侵入式鸟类物种识别。

Details

Motivation: 支持可扩展的生物多样性监测和公民科学应用，需在资源受限的边缘设备上实现准确、低功耗的鸟类物种识别。 Method: 构建多模态系统 Zwitscherkasten，集成轻量级深度学习模型用于生物声学与图像分类；采用声学活动检测器降低能耗；视觉识别使用细粒度检测与分类流程。 Result: 验证了在嵌入式平台上实现高精度鸟类物种识别的可行性。 Conclusion: 该系统证明了在边缘设备上部署多模态深度学习模型进行实时、非侵入式鸟类监测是可行且有效的。 Abstract: This paper presents Zwitscherkasten, a DiY, multimodal system for bird species monitoring using audio and visual data on edge devices. Deep learning models for bioacoustic and image-based classification are deployed on resource-constrained hardware, enabling real-time, non-invasive monitoring. An acoustic activity detector reduces energy consumption, while visual recognition is performed using fine-grained detection and classification pipelines. Results show that accurate bird species identification is feasible on embedded platforms, supporting scalable biodiversity monitoring and citizen science applications.

[110] MedScope: Incentivizing "Think with Videos" for Clinical Reasoning via Coarse-to-Fine Tool Calling

Wenjie Li,Yujie Zhang,Haoran Sun,Xingqi He,Hongcheng Gao,Chenglong Ma,Ming Hu,Guankun Wang,Shiyi Yao,Renhao Yang,Hongliang Ren,Lei Wang,Junjun He,Yankai Jiang

Main category: cs.CV

TL;DR: 本文提出MedScope，一种用于长程临床视频推理的工具使用模型，通过粗到细的证据搜索和接地感知的强化学习优化，在临床视频理解任务中达到SOTA性能。

Details

Motivation: 现有视觉-语言模型在处理长程临床视频时仅采用被动采样或弱定位检查，难以迭代定位、验证和解释预测结果，缺乏对时间局部视觉证据的显式依赖。 Method: 提出MedScope模型，结合工具调用与中间推理进行粗到细的证据检索；构建ClinVideoSuite数据集；采用接地感知的组相对策略优化（GA-GRPO）进行训练，以接地对齐奖励和证据加权优势强化工具使用。 Result: 在全视频与细粒度视频理解基准上，MedScope在域内与跨域评估中均达到最先进性能。 Conclusion: MedScope为构建能真正‘用视频思考’的医疗AI智能体提供了可行路径，强调工具集成推理与时间证据接地。 Abstract: Long-form clinical videos are central to visual evidence-based decision-making, with growing importance for applications such as surgical robotics and related settings. However, current multimodal large language models typically process videos with passive sampling or weakly grounded inspection, which limits their ability to iteratively locate, verify, and justify predictions with temporally targeted evidence. To close this gap, we propose MedScope, a tool-using clinical video reasoning model that performs coarse-to-fine evidence seeking over long-form procedures. By interleaving intermediate reasoning with targeted tool calls and verification on retrieved observations, MedScope produces more accurate and trustworthy predictions that are explicitly grounded in temporally localized visual evidence. To address the lack of high-fidelity supervision, we build ClinVideoSuite, an evidence-centric, fine-grained clinical video suite. We then optimize MedScope with Grounding-Aware Group Relative Policy Optimization (GA-GRPO), which directly reinforces tool use with grounding-aligned rewards and evidence-weighted advantages. On full and fine-grained video understanding benchmarks, MedScope achieves state-of-the-art performance in both in-domain and out-of-domain evaluations. Our approach illuminates a path toward medical AI agents that can genuinely "think with videos" through tool-integrated reasoning. We will release our code, models, and data.

[111] Ask the Expert: Collaborative Inference for Vision Transformers with Near-Edge Accelerators

Hao Liu,Suhaib A. Fahmy

Main category: cs.CV

TL;DR: 本文提出了一种协同推理框架，结合边缘设备上的轻量级通用ViT与近边缘加速器上的多个中等规模专家ViT，通过动态路由和渐进式专家训练策略，在保持高准确率的同时显著降低延迟和能耗。

Details

Motivation: Vision Transformers在边缘设备部署面临计算复杂度高、全量上云又带来显著延迟的问题，需要一种兼顾效率与精度的协同推理方案。 Method: 设计了基于边缘模型Top-k预测的动态路由机制，选择最相关的专家模型处理低置信度样本；并提出渐进式专家训练策略，提升专家在特定数据子集上的准确性。 Result: 在CIFAR-100和真实边缘-近边缘测试平台上，专家专业化准确率提升4.12%，整体准确率提升2.76%；相比纯边缘执行延迟降低最多45%，相比纯近边缘卸载能耗降低最多46%。 Conclusion: 所提协同推理框架在准确率、延迟和能效之间实现了更优平衡，为ViT在资源受限场景下的部署提供了有效解决方案。 Abstract: Deploying Vision Transformers on edge devices is challenging due to their high computational complexity, while full offloading to cloud resources presents significant latency overheads. We propose a novel collaborative inference framework, which orchestrates a lightweight generalist ViT on an edge device and multiple medium-sized expert ViTs on a near-edge accelerator. A novel routing mechanism uses the edge model's Top-$\mathit{k}$ predictions to dynamically select the most relevant expert for samples with low confidence. We further design a progressive specialist training strategy to enhance expert accuracy on dataset subsets. Extensive experiments on the CIFAR-100 dataset using a real-world edge and near-edge testbed demonstrate the superiority of our framework. Specifically, the proposed training strategy improves expert specialization accuracy by 4.12% on target subsets and enhances overall accuracy by 2.76% over static experts. Moreover, our method reduces latency by up to 45% compared to edge execution, and energy consumption by up to 46% compared to just near-edge offload.

[112] Meningioma Analysis and Diagnosis using Limited Labeled Samples

Jiamiao Lu,Wei Wu,Ke Gao,Ping Mao,Weichuan Zhang,Tuo Wang,Lingkun Ma,Jiapan Guo,Zanyi Wu,Yuqing Hu,Changming Sun

Main category: cs.CV

TL;DR: 本文提出了一种自适应加权的空间-频域特征融合架构（AMSF-Net），用于少样本脑膜瘤MRI分类，通过离散小波变换提取多频带特征并动态加权融合，显著提升了分类性能，并构建了新数据集验证有效性。

Details

Motivation: 脑膜瘤的生物学行为与治疗反应依赖于其病理分级，准确分级对治疗规划和预后评估至关重要；现有方法在少样本条件下难以充分挖掘MRI图像中空间与频域互补信息，且不同图像中各频带贡献差异大。 Method: 提出自适应加权多频带空间-频域融合网络（AMSF-Net）：利用离散小波变换分解MRI图像获取多尺度频率子带，设计可学习权重模块动态融合各频带与空间域特征，并适配少样本学习场景。 Result: 在三个数据集（含新构建的脑膜瘤MRI数据集）上，该方法性能均优于当前最先进方法；代码已开源。 Conclusion: 自适应融合空间与差异化频域特征能有效提升少样本脑膜瘤MRI分级精度，为临床辅助诊断提供了更鲁棒、可解释的新思路。 Abstract: The biological behavior and treatment response of meningiomas depend on their grade, making an accurate diagnosis essential for treatment planning and prognosis assessment. We observed that the weighted fusion of spatial-frequency domain features significantly influences meningioma classification performance. Notably, the contribution of specific frequency bands obtained by discrete wavelet transform varies considerably across different images. A feature fusion architecture with adaptive weights of different frequency band information and spatial domain information is proposed for few-shot meningioma learning. To verify the effectiveness of the proposed method, a new MRI dataset of meningiomas is introduced. The experimental results demonstrate the superiority of the proposed method compared with existing state-of-the-art methods in three datasets. The code will be available at: https://github.com/ICL-SUST/AMSF-Net

[113] An Integrated Causal Inference Framework for Traffic Safety Modeling with Semantic Street-View Visual Features

Lishan Sun,Yujia Cheng,Pengfei Cui,Lei Han,Mohamed Abdel-Aty,Yunhan Zheng,Xingchen Zhang

Main category: cs.CV

TL;DR: 本文提出了一种结合语义分割与双重机器学习的因果分析框架，利用街景图像量化视觉环境特征（如绿化率）对区域交通事故的因果效应，发现绿化率显著降低 crashes（ATE = -6.38），尤其在人口密集、社会脆弱的城市核心区效果更强，但对弱势道路使用者（VRUs）保护有限。

Details

Motivation: 现有宏观交通安全建模多依赖静态社会人口与基础设施指标，忽视驾驶员视觉感知环境的影响；虽有观察性证据表明视觉环境影响驾驶安全，但缺乏在复杂空间环境下稳健的因果推断支持政策评估。 Method: 基于Google街景图像进行语义分割提取视觉环境特征；构建双重机器学习（Double Machine Learning）框架估计因果效应；使用SHAP值解析混杂变量的非线性影响机制；采用因果森林（causal forests）估计条件平均处理效应（CATE）。 Result: 绿化比例对区域交通事故具有显著负向因果效应（ATE = -6.38, p = 0.005）；该效应在人口密集、社会脆弱的城市核心区最强；主要降低侧撞与追尾事故，但对行人/骑行者等弱势道路使用者（VRUs）保护作用不显著。 Conclusion: 为‘绿化即安全干预’提供了因果证据，支持优先改造高风险视觉环境；同时指出需针对VRUs设计差异化景观与道路优化策略。 Abstract: Macroscopic traffic safety modeling aims to identify critical risk factors for regional crashes, thereby informing targeted policy interventions for safety improvement. However, current approaches rely heavily on static sociodemographic and infrastructure metrics, frequently overlooking the impacts from drivers' visual perception of driving environment. Although visual environment features have been found to impact driving and traffic crashes, existing evidence remains largely observational, failing to establish the robust causality for traffic policy evaluation under complex spatial environment. To fill these gaps, we applied semantic segmentation on Google Street View imageries to extract visual environmental features and proposed a Double Machine Learning framework to quantify their causal effects on regional crashes. Meanwhile, we utilized SHAP values to characterize the nonlinear influence mechanisms of confounding variables in the models and applied causal forests to estimate conditional average treatment effects. Leveraging crash records from the Miami metropolitan area, Florida, and 220,000 street view images, evidence shows that greenery proportion exerts a significant and robust negative causal effect on traffic crashes (Average Treatment Effect = -6.38, p = 0.005). This protective effect exhibits spatial heterogeneity, being most pronounced in densely populated and socially vulnerable urban cores. While greenery significantly mitigates angle and rear-end crashes, its protective benefit for vulnerable road users (VRUs) remains limited. Our findings provide causal evidence for greening as a potential safety intervention, prioritizing hazardous visual environments while highlighting the need for distinct design optimizations to protect VRUs.

[114] FireRed-Image-Edit-1.0 Techinical Report

Super Intelligence Team,Changhao Qiao,Chao Hui,Chen Li,Cunzheng Wang,Dejia Song,Jiale Zhang,Jing Li,Qiang Xiang,Runqi Wang,Shuang Sun,Wei Zhu,Xu Tang,Yao Hu,Yibo Chen,Yuhao Huang,Yuxuan Duan,Zhiyi Chen,Ziyuan Guo

Main category: cs.CV

TL;DR: FireRed-Image-Edit 是一个基于扩散Transformer的指令驱动图像编辑模型，通过数据构建、训练策略与评估设计的系统性优化，达到SOTA性能，并开源代码、模型与新基准REDEdit-Bench。

Details

Motivation: 现有指令驱动图像编辑方法在数据质量、训练稳定性、编辑可控性及评估全面性方面存在不足，亟需系统性优化以提升语义一致性、文本保真度与身份保持能力。 Method: 构建1.6B样本高质量多源数据集（经清洗、分层、自动标注与两阶段过滤保留100M样本）；提出多阶段训练流程（预训练→监督微调→强化学习）；引入Multi-Condition Aware Bucket Sampler、Stochastic Instruction Alignment、Asymmetric Gradient Optimization for DPO、DiffusionNFT（OCR奖励）、可微Consistency Loss等关键技术；建立涵盖15类任务的新基准REDEdit-Bench。 Result: 在自建REDEdit-Bench及公开基准（ImgEdit、GEdit）上，性能优于或媲美主流开源与闭源模型；显著提升文本编辑保真度、身份一致性与低级增强能力。 Conclusion: 系统性工程优化（数据+训练+评估）是推动指令驱动图像编辑发展的关键路径；所提方法与开源资源将促进该领域稳健发展。 Abstract: We present FireRed-Image-Edit, a diffusion transformer for instruction-based image editing that achieves state-of-the-art performance through systematic optimization of data curation, training methodology, and evaluation design. We construct a 1.6B-sample training corpus, comprising 900M text-to-image and 700M image editing pairs from diverse sources. After rigorous cleaning, stratification, auto-labeling, and two-stage filtering, we retain over 100M high-quality samples balanced between generation and editing, ensuring strong semantic coverage and instruction alignment. Our multi-stage training pipeline progressively builds editing capability via pre-training, supervised fine-tuning, and reinforcement learning. To improve data efficiency, we introduce a Multi-Condition Aware Bucket Sampler for variable-resolution batching and Stochastic Instruction Alignment with dynamic prompt re-indexing. To stabilize optimization and enhance controllability, we propose Asymmetric Gradient Optimization for DPO, DiffusionNFT with layout-aware OCR rewards for text editing, and a differentiable Consistency Loss for identity preservation. We further establish REDEdit-Bench, a comprehensive benchmark spanning 15 editing categories, including newly introduced beautification and low-level enhancement tasks. Extensive experiments on REDEdit-Bench and public benchmarks (ImgEdit and GEdit) demonstrate competitive or superior performance against both open-source and proprietary systems. We release code, models, and the benchmark suite to support future research.

[115] Visual Foresight for Robotic Stow: A Diffusion-Based World Model from Sparse Snapshots

Lijun Zhang,Nikhil Chacko,Petter Nilsson,Ruinian Xu,Shantanu Thakar,Bai Lou,Harpreet Sawhney,Zhebin Zhang,Mudit Agrawal,Bhavana Chandrashekhar,Aaron Parness

Main category: cs.CV

TL;DR: 本文提出FOREST模型，一种基于堆垛意图的世界模型，用于预测自动化仓库中存储箱在堆垛操作后的状态，通过实例掩码和潜在扩散Transformer实现高精度布局预测，并在下游任务中展现出实用价值。

Details

Motivation: 自动化仓库需在执行堆垛操作前预测存储箱的后续状态，以优化规划和决策。 Method: 提出FOREST模型，将箱体状态表示为物品对齐的实例掩码，并采用潜在扩散Transformer，以当前观测和堆垛意图为条件预测堆垛后的配置。 Result: FOREST在几何一致性上显著优于启发式基线；在负载质量评估和多步堆垛推理两个下游任务中，使用其预测替代真实后置掩码仅造成轻微性能下降。 Conclusion: FOREST能有效提供可靠的堆垛后状态预判，为仓储系统规划提供有价值的前瞻信号。 Abstract: Automated warehouses execute millions of stow operations, where robots place objects into storage bins. For these systems it is valuable to anticipate how a bin will look from the current observations and the planned stow behavior before real execution. We propose FOREST, a stow-intent-conditioned world model that represents bin states as item-aligned instance masks and uses a latent diffusion transformer to predict the post-stow configuration from the observed context. Our evaluation shows that FOREST substantially improves the geometric agreement between predicted and true post-stow layouts compared with heuristic baselines. We further evaluate the predicted post-stow layouts in two downstream tasks, in which replacing the real post-stow masks with FOREST predictions causes only modest performance loss in load-quality assessment and multi-stow reasoning, indicating that our model can provide useful foresight signals for warehouse planning.

[116] From Prompt to Production:Automating Brand-Safe Marketing Imagery with Text-to-Image Models

Parmida Atighehchian,Henry Wang,Andrei Kapustin,Boris Lerner,Tiancheng Jiang,Taylor Jensen,Negin Sokhandan

Main category: cs.CV

TL;DR: 本文提出了一种全自动、可扩展的文本到图像生成营销图片的管道，兼顾自动化效率与人工监督质量，在营销对象保真度和人类偏好上分别提升30.77%和52.00%。

Details

Motivation: 现有文本到图像模型虽效果显著，但在生产环境中难以构建可扩展部署流程；需在自动化处理大规模任务与人工反馈保障质量之间取得平衡。 Method: 设计并实现一个全自动、可扩展的文本到图像生成管道，专用于商业产品营销图像生成，兼顾图像质量、保真度与创意多样性，并符合营销规范。 Result: 使用DINOV2评估，营销对象保真度提升30.77%；人类偏好测试显示生成结果提升52.00%。 Conclusion: 该管道成功实现了高效自动化与必要人工监督的融合，为文本到图像模型在营销场景中的落地提供了可行且高性能的解决方案。 Abstract: Text-to-image models have made significant strides, producing impressive results in generating images from textual descriptions. However, creating a scalable pipeline for deploying these models in production remains a challenge. Achieving the right balance between automation and human feedback is critical to maintain both scale and quality. While automation can handle large volumes, human oversight is still an essential component to ensure that the generated images meet the desired standards and are aligned with the creative vision. This paper presents a new pipeline that offers a fully automated, scalable solution for generating marketing images of commercial products using text-to-image models. The proposed system maintains the quality and fidelity of images, while also introducing sufficient creative variation to adhere to marketing guidelines. By streamlining this process, we ensure a seamless blend of efficiency and human oversight, achieving a $30.77\%$ increase in marketing object fidelity using DINOV2 and a $52.00\%$ increase in human preference over the generated outcome.

[117] Detecting Brick Kiln Infrastructure at Scale: Graph, Foundation, and Remote Sensing Models for Satellite Imagery Data

Usman Nazir,Xidong Chen,Hafiz Muhammad Abubakar,Hadia Abu Bakar,Raahim Arbaz,Fezan Rasool,Bin Chen,Sara Khalid

Main category: cs.CV

TL;DR: 本文提出ClimateGraph模型，利用高分辨率卫星影像实现南亚和中亚地区砖窑的大规模检测，并构建了包含130万张图像块的多城市数据集，对比评估了图神经网络、遥感专用模型与基础模型的性能差异与互补性。

Details

Motivation: 砖窑是南亚空气污染和强迫劳动的主要来源，但受限于稀疏且过时的地面数据，大规模监测仍十分困难。 Method: 构建了覆盖南亚和中亚五个地区的、分辨率达zoom-20（0.149米/像素）的超大规模砖窑卫星图像数据集（130万图像块），并提出区域自适应的图神经网络模型ClimateGraph，建模砖窑的空间与方向布局结构；同时评估了传统遥感检测流程及最新卫星影像基础模型的性能。 Result: 实验表明图模型、基础模型与遥感专用方法在砖窑检测任务中具有互补优势，为基于卫星影像的规模化砖窑监测提供了实用指导。 Conclusion: 多种技术路径并非相互替代，而是协同互补；ClimateGraph在捕捉砖窑空间结构方面表现突出，而基础模型和遥感流程在泛化性和部署效率上各有优势，应根据实际监测需求选择或融合使用。 Abstract: Brick kilns are a major source of air pollution and forced labor in South Asia, yet large-scale monitoring remains limited by sparse and outdated ground data. We study brick kiln detection at scale using high-resolution satellite imagery and curate a multi city zoom-20 (0.149 meters per pixel) resolution dataset comprising over 1.3 million image tiles across five regions in South and Central Asia. We propose ClimateGraph, a region-adaptive graph-based model that captures spatial and directional structure in kiln layouts, and evaluate it against established graph learning baselines. In parallel, we assess a remote sensing based detection pipeline and benchmark it against recent foundation models for satellite imagery. Our results highlight complementary strengths across graph, foundation, and remote sensing approaches, providing practical guidance for scalable brick kiln monitoring from satellite imagery.

[118] Using Deep Learning to Generate Semantically Correct Hindi Captions

Wasim Akram Khan,Anil Kumar Vuppala

Main category: cs.CV

TL;DR: 本文研究了基于多模态架构的英文图像描述自动翻译为印地语的方法，结合CNN提取视觉特征、双向LSTM编码文本，并引入注意力机制提升生成质量，最终在Flickr8k数据集上使用Google Cloud Translator生成印地语描述，BLEU-1达0.59，BLEU-4达0.19，验证了模型在印地语图像描述生成上的有效性。

Details

Motivation: 现有图像描述研究主要集中于英语，缺乏对其他主流语言（如印地语）的支持，因此有必要探索多语言尤其是印地语的图像自动描述生成方法。 Method: 采用VGG16、ResNet50和Inception V3等预训练CNN提取局部与全局视觉特征；使用单向/双向LSTM进行文本编码；引入注意力机制生成加权特征向量；借助Google Cloud Translator将Flickr8k英文描述译为印地语作为目标标签。 Result: 注意力机制结合双向LSTM与VGG16在BLEU-1和BLEU-4上分别取得0.59和0.19的最高分；模型能生成语义准确、相关的印地语图像描述。 Conclusion: 该研究成功构建了适用于印地语的图像描述生成框架，具备实际应用潜力，并为后续多语言图像描述研究提供了可借鉴的模型基础。 Abstract: Automated image captioning using the content from the image is very appealing when done by harnessing the capability of computer vision and natural language processing. Extensive research has been done in the field with a major focus on the English language which gives the scope for further developments in the same with consideration of popular foreign languages. This research utilizes distinct models for translating the image caption into Hindi, the fourth most popular language across the world. Exploring the multi-modal architectures this research comprises local visual features, global visual features, attention mechanisms, and pre-trained models. Using google cloud translator on the image dataset from Flickr8k, Hindi image descriptions have been generated. Pre-trained CNNs like VGG16, ResNet50, and Inception V3 helped in retrieving image characteristics, while the uni-directional and bi-directional techniques of text encoding are used for the text encoding process. An additional Attention layer helps to generate a weight vector and, by multiplying it, combine image characteristics from each time step into a sentence-level feature vector. Bilingual evaluation understudy scores are used to compare the research outcome. Many experiments that serve as a baseline are done for the comparative analysis of the research. An image with a score of BLEU-1 is considered sufficient, whereas one with a score of BLEU-4 is considered to have fluid image captioning. For both BLEU scores, the attention-based bidirectional LSTM with VGG16 produced the best results of 0.59 and 0.19 respectively. The experiments conclude that researchs ability to produce relevant, semantically accurate image captions in Hindi. The research accomplishes the goals and future research can be guided by this research model.

[119] AdaCorrection: Adaptive Offset Cache Correction for Accurate Diffusion Transformers

Dong Liu,Yanxuan Yu,Ben Lengerich,Ying Nian Wu

Main category: cs.CV

TL;DR: 本文提出AdaCorrection框架，通过自适应缓存校正机制，在保持高生成质量的同时提升扩散Transformer（DiT）的推理效率。

Details

Motivation: Diffusion Transformers（DiTs）虽在高质量图像和视频生成中表现优异，但其迭代去噪结构导致推理开销大；现有加速方法依赖静态缓存策略或粗粒度启发式，易引发时序漂移与缓存错位，损害生成质量。 Method: AdaCorrection是一种自适应偏移缓存校正框架：在每个时间步，利用轻量级时空信号评估缓存有效性，并动态融合缓存激活与新鲜激活；该过程无需额外监督或重训练，实时在线计算。 Result: 在图像和视频扩散基准测试中，AdaCorrection在几乎不损失FID指标的前提下实现适度加速，显著且一致地提升了生成性能。 Conclusion: AdaCorrection通过轻量、自适应的缓存校正机制，有效缓解了DiT推理中的缓存错位与时序漂移问题，在低计算开销下兼顾生成质量与效率。 Abstract: Diffusion Transformers (DiTs) achieve state-of-the-art performance in high-fidelity image and video generation but suffer from expensive inference due to their iterative denoising structure. While prior methods accelerate sampling by caching intermediate features, they rely on static reuse schedules or coarse-grained heuristics, which often lead to temporal drift and cache misalignment that significantly degrade generation quality. We introduce \textbf{AdaCorrection}, an adaptive offset cache correction framework that maintains high generation fidelity while enabling efficient cache reuse across Transformer layers during diffusion inference. At each timestep, AdaCorrection estimates cache validity with lightweight spatio-temporal signals and adaptively blends cached and fresh activations. This correction is computed on-the-fly without additional supervision or retraining. Our approach achieves strong generation quality with minimal computational overhead, maintaining near-original FID while providing moderate acceleration. Experiments on image and video diffusion benchmarks show that AdaCorrection consistently improves generation performance.

[120] The Diffusion Duet: Harmonizing Dual Channels with Wavelet Suppression for Image Separation

Jingwei Li,Wei Pu

Main category: cs.CV

TL;DR: 本文提出了一种基于扩散模型的双通道盲图像分离方法（DCDSM），结合小波抑制模块（WSM）提升细节分离能力，在雨/雪去除及复杂混合分离任务中达到SOTA性能。

Details

Motivation: 传统BIS方法在强噪声和非线性混合下存在估计偏差、纹理失真和伪影残留问题，难以建模真实场景中复杂的源图像特征分布。 Method: 提出双通道扩散分离模型（DCDSM），利用扩散模型学习源图像特征分布；在双分支反向去噪过程中引入新型小波抑制模块（WSM），利用源图像间噪声耦合特性构建交互式分离网络。 Result: 在合成雨/雪数据集上，PSNR/SSIM分别达35.0023 dB/0.9549和29.8108 dB/0.9243；在复杂混合分离中平均PSNR和SSIM为25.0049 dB和0.7997，显著优于Histoformer、LDRCNet等对比方法。 Conclusion: DCDSM有效解决了盲图像分离中的残余噪声去除与细节保持难题，验证了扩散模型在双通道BIS任务中的强大潜力与实用性。 Abstract: Blind image separation (BIS) refers to the inverse problem of simultaneously estimating and restoring multiple independent source images from a single observation image under conditions of unknown mixing mode and without prior knowledge of the source images. Traditional methods relying on statistical independence assumptions or CNN/GAN variants struggle to characterize complex feature distributions in real scenes, leading to estimation bias, texture distortion, and artifact residue under strong noise and nonlinear mixing. This paper innovatively introduces diffusion models into dual-channel BIS, proposing an efficient Dual-Channel Diffusion Separation Model (DCDSM). DCDSM leverages diffusion models' powerful generative capability to learn source image feature distributions and reconstruct feature structures effectively. A novel Wavelet Suppression Module (WSM) is designed within the dual-branch reverse denoising process, forming an interactive separation network that enhances detail separation by exploiting the mutual coupling noise characteristic between source images. Extensive experiments on synthetic datasets containing rain/snow and complex mixtures demonstrate that DCDSM achieves state-of-the-art performance: 1) In image restoration tasks, it obtains PSNR/SSIM values of 35.0023 dB/0.9549 and 29.8108 dB/0.9243 for rain and snow removal respectively, outperforming Histoformer and LDRCNet by 1.2570 dB/0.9272 dB (PSNR) and 0.0262/0.0289 (SSIM) on average; 2) For complex mixture separation, the restored dual-source images achieve average PSNR and SSIM of 25.0049 dB and 0.7997, surpassing comparative methods by 4.1249 dB and 0.0926. Both subjective and objective evaluations confirm DCDSM's superiority in addressing rain/snow residue removal and detail preservation challenges.

[121] An Online Reference-Free Evaluation Framework for Flowchart Image-to-Code Generation

Giang Son Nguyen,Zi Pong Lim,Sarthak Ketanbhai Modi,Yon Shin Teo,Wenya Wang

Main category: cs.CV

TL;DR: 本文提出了一种无需参考文本的视觉-语言模型（VLM）生成质量评估框架，用于监控流程图图像到代码（如Mermaid）转换的质量，通过OCR召回率和视觉蕴含精度两个指标及其调和平均F1得分进行自动化评估，并在FlowVQA数据集上验证了其与真实标签高度一致。

Details

Motivation: 在生产环境中，文档处理流程中使用的视觉-语言模型（VLMs）需处理无真实代码标签的任意流程图图像，导致难以评估生成代码质量，亟需一种无需参考文本的实时质量监控方法。 Method: 提出参考-free评估框架，定义两个自动化指标：Recall_OCR（基于OCR提取图像文本估计内容覆盖度）和Precision_VE（利用视觉蕴含判断生成代码是否在原图中有依据），并以二者调和平均F1_OCR-VE作为统一质量评分。 Result: 在FlowVQA数据集上验证显示，该框架与基于真实标签的评估高度一致（Recall、Precision、F1的Pearson相关系数分别为0.97、0.91、0.94），证实其作为生产环境连续质量监控工具的可靠性。 Conclusion: 所提参考-free评估框架有效解决了VLM在无监督场景下流程图图像到代码生成质量难以衡量的问题，具备实用性强、部署简便、与人工评估高度一致等优势，适用于实际生产系统中的持续质量监控。 Abstract: Vision-Language Models (VLMs) are increasingly used in document processing pipelines to convert flowchart images into structured code (e.g., Mermaid). In production, these systems process arbitrary inputs for which no ground-truth code exists, making output quality difficult to assess. We propose a reference-free evaluation framework that monitors flowchart image-to-code generation quality at inference time, using only the input image and the generated output. The framework introduces two automated metrics: $\text{Recall}{\text{OCR}}$, which estimates content coverage by extracting text from the input image via OCR as a proxy reference, and $\text{Precision}{\text{VE}}$, which detects hallucinated elements through Visual Entailment against the original image. Their harmonic mean, $\text{F1}{\text{OCR-VE}}$, provides a unified quality score. Validation on the FlowVQA dataset shows strong agreement with ground-truth metrics (average Pearson's $r = 0.97$, $0.91$, and $0.94$ for Recall, Precision, and F1, respectively), confirming the framework's reliability as a practical, reference-free alternative for continuous quality monitoring in production settings.

[122] LAF-YOLOv10 with Partial Convolution Backbone, Attention-Guided Feature Pyramid, Auxiliary P2 Head, and Wise-IoU Loss for Small Object Detection in Drone Aerial Imagery

Sohail Ali Farooqui,Zuhair Ahmed Khan Taha,Mohammed Mudassir Uddin,Shahnawaz Alam

Main category: cs.CV

TL;DR: 本文提出LAF-YOLOv10，基于YOLOv10n集成四项互补技术（PC-C2f、AG-FPN、P2检测头、Wise-IoU v3），显著提升无人机图像中小目标检测性能，在VisDrone和UAVDT数据集上mAP@0.5分别达35.1%和35.8%，参数仅2.3M，Jetson Orin Nano上达24.3 FPS。

Details

Motivation: 现有检测器难以应对无人机图像中目标像素少、背景杂乱、严重遮挡及机载计算资源受限等挑战。 Method: 在YOLOv10n基础上集成：1）Partial Conv C2f模块减少骨干网冗余计算；2）Attention-Guided FPN增强多尺度特征融合；3）新增P2检测头提升超小目标（<8×8像素）定位能力并移除P5头；4）采用Wise-IoU v3替代CIoU以缓解标注噪声影响。 Result: 在VisDrone-DET2019上mAP@0.5达35.1±0.3%，较YOLOv10n提升3.3点；UAVDT上达35.8±0.4%；NVIDIA Jetson Orin Nano上达24.3 FPS（FP16）。 Conclusion: 四项技术协同优化了计算效率、特征融合、空间分辨率与回归鲁棒性，共同构成轻量高效的小目标检测方案，适用于嵌入式无人机平台。 Abstract: Unmanned aerial vehicles serve as primary sensing platforms for surveillance, traffic monitoring, and disaster response, making aerial object detection a central problem in applied computer vision. Current detectors struggle with UAV-specific challenges: targets spanning only a few pixels, cluttered backgrounds, heavy occlusion, and strict onboard computational budgets. This study introduces LAF-YOLOv10, built on YOLOv10n, integrating four complementary techniques to improve small-object detection in drone imagery. A Partial Convolution C2f (PC-C2f) module restricts spatial convolution to one quarter of backbone channels, reducing redundant computation while preserving discriminative capacity. An Attention-Guided Feature Pyramid Network (AG-FPN) inserts Squeeze-and-Excitation channel gates before multi-scale fusion and replaces nearest-neighbor upsampling with DySample for content-aware interpolation. An auxiliary P2 detection head at 160$\times$160 resolution extends localization to objects below 8$\times$8 pixels, while the P5 head is removed to redistribute parameters. Wise-IoU v3 replaces CIoU for bounding box regression, attenuating gradients from noisy annotations in crowded aerial scenes. The four modules address non-overlapping bottlenecks: PC-C2f compresses backbone computation, AG-FPN refines cross-scale fusion, the P2 head recovers spatial resolution, and Wise-IoU stabilizes regression under label noise. No individual component is novel; the contribution is the joint integration within a single YOLOv10 framework. Across three training runs (seeds 42, 123, 256), LAF-YOLOv10 achieves 35.1$\pm$0.3\% mAP@0.5 on VisDrone-DET2019 with 2.3\,M parameters, exceeding YOLOv10n by 3.3 points. Cross-dataset evaluation on UAVDT yields 35.8$\pm$0.4\% mAP@0.5. Benchmarks on NVIDIA Jetson Orin Nano confirm 24.3 FPS at FP16, demonstrating viability for embedded UAV deployment.

[123] Handling Supervision Scarcity in Chest X-ray Classification: Long-Tailed and Zero-Shot Learning

Ha-Hieu Pham,Hai-Dang Nguyen,Thanh-Huy Nguyen,Min Xu,Ulas Bagci,Trung-Nghia Le,Huy-Hieu Pham

Main category: cs.CV

TL;DR: 本文提出针对胸部X光片（CXR）长尾多标签分类与零样本OOD疾病识别的双任务解决方案，在CXRLT 2026挑战赛中取得榜首成绩。

Details

Motivation: 临床实践中CXR分类常受限于不完善监督：一是疾病分布极度长尾且为多标签；二是罕见或未见病灶常缺失标注。 Method: Task 1采用不平衡感知的多标签学习策略提升尾部类别识别能力；Task 2设计无需OOD类标签或样本的零样本预测方法，生成对未见疾病类别的置信分数。 Result: 在PadChest基准上以宏平均mAP评估，两任务均表现优异，开发阶段公榜排名第一。 Conclusion: 所提任务定制化方案有效应对CXR中长尾监督与零样本OOD识别双重挑战，具备临床实用潜力。 Abstract: Chest X-ray (CXR) classification in clinical practice is often limited by imperfect supervision, arising from (i) extreme long-tailed multi-label disease distributions and (ii) missing annotations for rare or previously unseen findings. The CXR-LT 2026 challenge addresses these issues on a PadChest-based benchmark with a 36-class label space split into 30 in-distribution classes for training and 6 out-of-distribution (OOD) classes for zero-shot evaluation. We present task-specific solutions tailored to the distinct supervision regimes. For Task 1 (long-tailed multi-label classification), we adopt an imbalance-aware multi-label learning strategy to improve recognition of tail classes while maintaining stable performance on frequent findings. For Task 2 (zero-shot OOD recognition), we propose a prediction approach that produces scores for unseen disease categories without using any supervised labels or examples from the OOD classes during training. Evaluated with macro-averaged mean Average Precision (mAP), our method achieves strong performance on both tasks, ranking first on the public leaderboard of the development phase. Code and pre-trained models are available at https://github.com/hieuphamha19/CXR_LT.

[124] Learning on the Fly: Replay-Based Continual Object Perception for Indoor Drones

Sebastian-Ion Nae,Mihai-Eugen Barbu,Sebastian Mocanu,Marius Leordeanu

Main category: cs.CV

TL;DR: 本文提出了一种面向室内无人机的类增量学习（CIL）基准，构建了包含14,400帧、具有时间一致性的室内UAV视频数据集，并在资源受限边缘平台（YOLOv11-nano）上评估了三种回放式CIL方法；结果表明Forgetting-Aware Replay（FAR）在5%极小回放预算下表现最优（mAP50-95达82.96%），并揭示了Grad-CAM注意力漂移与定位质量下降的关联。

Details

Motivation: 现有UAV数据集多为室外场景，缺乏时间连贯的室内视频，而室内无人机需实时学习新物体类别并缓解灾难性遗忘，亟需适配的类增量学习基准。 Method: 构建半自动标注的室内UAV视频数据集（14,400帧，98.6%首标一致率）；在YOLOv11-nano轻量检测器上对比评估Experience Replay（ER）、Maximally Interfered Retrieval（MIR）和Forgetting-Aware Replay（FAR）三种回放式CIL策略；结合Grad-CAM分析注意力机制变化。 Result: FAR在5%回放预算下平均精度（mAP50-95）达82.96%，优于ER和MIR；Grad-CAM显示跨类注意力偏移与无人机定位质量下降相关；验证了回放式CIL可在边缘空中系统中有效部署。 Conclusion: 本工作提供了首个具备时间一致性的室内UAV增量学习视频数据集，并证明Forgetting-Aware Replay在极低回放预算下最具实用性，为资源受限无人机的持续视觉学习提供了可行路径与实证基准。 Abstract: Autonomous agents such as indoor drones must learn new object classes in real-time while limiting catastrophic forgetting, motivating Class-Incremental Learning (CIL). However, most unmanned aerial vehicle (UAV) datasets focus on outdoor scenes and offer limited temporally coherent indoor videos. We introduce an indoor dataset of $14,400$ frames capturing inter-drone and ground vehicle footage, annotated via a semi-automatic workflow with a $98.6\%$ first-pass labeling agreement before final manual verification. Using this dataset, we benchmark 3 replay-based CIL strategies: Experience Replay (ER), Maximally Interfered Retrieval (MIR), and Forgetting-Aware Replay (FAR), using YOLOv11-nano as a resource-efficient detector for deployment-constrained UAV platforms. Under tight memory budgets ($5-10\%$ replay), FAR performs better than the rest, achieving an average accuracy (ACC, $mAP_{50-95}$ across increments) of $82.96\%$ with $5\%$ replay. Gradient-weighted class activation mapping (Grad-CAM) analysis shows attention shifts across classes in mixed scenes, which is associated with reduced localization quality for drones. The experiments further demonstrate that replay-based continual learning can be effectively applied to edge aerial systems. Overall, this work contributes an indoor UAV video dataset with preserved temporal coherence and an evaluation of replay-based CIL under limited replay budgets. Project page: https://spacetime-vision-robotics-laboratory.github.io/learning-on-the-fly-cl

[125] GLIMPSE : Real-Time Text Recognition and Contextual Understanding for VQA in Wearables

Akhil Ramachandran,Ankit Arun,Ashish Shenoy,Abhay Harpale,Srihari Jayakumar,Debojeet Chatterjee,Mohsen Moslehpour,Pierce Chuang,Yichao Lu,Vikas Bhardwaj,Peyman Heidari

Main category: cs.CV

TL;DR: 本文提出了一种混合架构，通过在设备端选择性地进行高分辨率OCR、同时低分辨率流式传输视频以提供视觉上下文，解决了可穿戴设备上文本VQA任务中高分辨率需求与功耗/热限制之间的矛盾，在保持文本理解质量的同时显著降低功耗。

Details

Motivation: 在可穿戴设备上部署文本VQA面临根本矛盾：文本识别需高分辨率视频，但高质流媒体导致高功耗和热节流；且现有模型难以在实时视频流中维持跨帧的连贯时序上下文。 Method: 利用文本识别（OCR）与场景理解对分辨率需求的不对称性，设计混合架构：设备端执行选择性高分辨率OCR，同时低分辨率流式传输视频用于视觉上下文建模。 Result: 在涵盖五类任务的文本VQA基准测试中，系统达到72%准确率，功耗仅为全分辨率流媒体的0.49倍，支持资源受限可穿戴设备上的持续VQA会话。 Conclusion: 该混合分辨率策略有效平衡了文本识别精度与系统能效，为可穿戴设备上的实时文本VQA提供了可行方案。 Abstract: Video Large Language Models (Video LLMs) have shown remarkable progress in understanding and reasoning about visual content, particularly in tasks involving text recognition and text-based visual question answering (Text VQA). However, deploying Text VQA on wearable devices faces a fundamental tension: text recognition requires high-resolution video, but streaming high-quality video drains battery and causes thermal throttling. Moreover, existing models struggle to maintain coherent temporal context when processing text across multiple frames in real-time streams. We observe that text recognition and visual reasoning have asymmetric resolution requirements - OCR needs fine detail while scene understanding tolerates coarse features. We exploit this asymmetry with a hybrid architecture that performs selective high-resolution OCR on-device while streaming low-resolution video for visual context. On a benchmark of text-based VQA samples across five task categories, our system achieves 72% accuracy at 0.49x the power consumption of full-resolution streaming, enabling sustained VQA sessions on resource-constrained wearables without sacrificing text understanding quality.

[126] Benchmarking Video Foundation Models for Remote Parkinson's Disease Screening

Md Saiful Islam,Ekram Hossain,Abdelrahman Abdelkader,Tariq Adnan,Fazla Rabbi Mashrur,Sooyong Park,Praveen Kumar,Qasim Sudais,Natalia Chunga,Nami Shah,Jan Freyberg,Christopher Kanan,Ruth Schneider,Ehsan Hoque

Main category: cs.CV

TL;DR: 本研究系统评估了七种视频基础模型（VFMs）在帕金森病远程视频筛查中的表现，基于包含1888名参与者、32847段视频的大规模新数据集，发现不同模型在不同临床任务上表现差异显著，为远程神经监测提供了模型与任务匹配的实证依据和基准。

Details

Motivation: 现有远程帕金森病视频筛查方法中，传统手工特征与新兴视频基础模型（VFMs）的效果对比尚不明确，尤其缺乏对不同VFM架构在多样化临床任务中鲁棒性的系统评估。 Method: 构建含1888名受试者（727例PD）、覆盖16项标准化临床任务的32847段视频新数据集；在冻结预训练VFM特征提取器的前提下，统一采用线性分类头，系统评测VideoPrism、V-JEPA、ViViT、VideoMAE、TimeSformer等七种SOTA VFM在各任务上的分类性能（AUC、准确率、敏感性、特异性）。 Result: 各模型任务表现高度依赖：VideoPrism在无声言语运动学与面部表情任务最优，V-JEPA在上肢运动任务最优，TimeSformer在节律性任务（如指敲）中竞争力最强；整体AUC达76.4–85.3%，准确率71.5–80.6%，特异性高达90.3%但敏感性仅43.2–57.3%。 Conclusion: 不同VFM在PD视频筛查中具有任务特异性，单一模型无法普适所有临床任务；需结合多任务、多模态并进行任务感知校准；本研究建立了VFM用于PD远程筛查的严格基准，并为实际部署提供了模型-任务匹配路线图。 Abstract: Remote, video-based assessments offer a scalable pathway for Parkinson's disease (PD) screening. While traditional approaches rely on handcrafted features mimicking clinical scales, recent advances in video foundation models (VFMs) enable representation learning without task-specific customization. However, the comparative effectiveness of different VFM architectures across diverse clinical tasks remains poorly understood. We present a large-scale systematic study using a novel video dataset from 1,888 participants (727 with PD), comprising 32,847 videos across 16 standardized clinical tasks. We evaluate seven state-of-the-art VFMs -- including VideoPrism, V-JEPA, ViViT, and VideoMAE -- to determine their robustness in clinical screening. By evaluating frozen embeddings with a linear classification head, we demonstrate that task saliency is highly model-dependent: VideoPrism excels in capturing visual speech kinematics (no audio) and facial expressivity, while V-JEPA proves superior for upper-limb motor tasks. Notably, TimeSformer remains highly competitive for rhythmic tasks like finger tapping. Our experiments yield AUCs of 76.4-85.3% and accuracies of 71.5-80.6%. While high specificity (up to 90.3%) suggests strong potential for ruling out healthy individuals, the lower sensitivity (43.2-57.3%) highlights the need for task-aware calibration and integration of multiple tasks and modalities. Overall, this work establishes a rigorous baseline for VFM-based PD screening and provides a roadmap for selecting suitable tasks and architectures in remote neurological monitoring. Code and anonymized structured data are publicly available: https://anonymous.4open.science/r/parkinson\_video\_benchmarking-A2C5

[127] SpargeAttention2: Trainable Sparse Attention via Hybrid Top-k+Top-p Masking and Distillation Fine-Tuning

Jintao Zhang,Kai Jiang,Chendong Xiang,Weiqi Feng,Yuezhou Hu,Haocheng Xi,Jianfei Chen,Jun Zhu

Main category: cs.CV

TL;DR: 本文提出SpargeAttention2，一种可训练的稀疏注意力方法，通过混合掩码规则、高效实现和蒸馏启发的微调目标，在保持生成质量的同时实现高达95%的注意力稀疏度和16.2倍加速。

Details

Motivation: 现有训练-free稀疏注意力方法虽有效，但可训练稀疏注意力有望进一步提升稀疏度；然而其掩码规则失效机制、高稀疏性原理及扩散损失微调局限尚不明确。 Method: 提出SpargeAttention2：(i)融合Top-k与Top-p的混合掩码规则以增强高稀疏下的鲁棒性；(ii)高效可训练稀疏注意力实现；(iii)蒸馏启发的微调目标以更好保持生成质量。 Result: 在视频扩散模型上实现95%注意力稀疏度和16.2倍注意力加速，同时维持生成质量，持续优于先前稀疏注意力方法。 Conclusion: SpargeAttention2通过系统性分析关键问题并设计相应组件，实现了高稀疏性与高质量生成的兼顾，为可训练稀疏注意力提供了新范式。 Abstract: Many training-free sparse attention methods are effective for accelerating diffusion models. Recently, several works suggest that making sparse attention trainable can further increase sparsity while preserving generation quality. We study three key questions: (1) when do the two common masking rules, i.e., Top-k and Top-p, fail, and how can we avoid these failures? (2) why can trainable sparse attention reach higher sparsity than training-free methods? (3) what are the limitations of fine-tuning sparse attention using the diffusion loss, and how can we address them? Based on this analysis, we propose SpargeAttention2, a trainable sparse attention method that achieves high sparsity without degrading generation quality. SpargeAttention2 includes (i) a hybrid masking rule that combines Top-k and Top-p for more robust masking at high sparsity, (ii) an efficient trainable sparse attention implementation, and (iii) a distillation-inspired fine-tuning objective to better preserve generation quality during fine-tuning using sparse attention. Experiments on video diffusion models show that SpargeAttention2 reaches 95% attention sparsity and a 16.2x attention speedup while maintaining generation quality, consistently outperforming prior sparse attention methods.

[128] Nighttime Autonomous Driving Scene Reconstruction with Physically-Based Gaussian Splatting

Tae-Kyeong Kim,Xingxin Chen,Guile Wu,Chengjie Huang,Dongfeng Bai,Bingbing Liu

Main category: cs.CV

TL;DR: 本文提出了一种将物理渲染融入3D高斯泼溅（3DGS）的新方法，用于提升自动驾驶夜间场景的重建质量，在保持实时渲染的同时显著改善低光条件下的建模效果。

Details

Motivation: 现有基于NeRF和3DGS的方法在正常光照下表现良好，但在复杂光照与外观的夜间驾驶场景中性能下降，亟需针对性改进。 Method: 将物理渲染集成到3DGS中，联合优化基于BRDF的材质属性；通过全局光照模块显式建模漫反射分量，用各向异性球面高斯建模镜面反射分量。 Result: 在nuScenes和Waymo两个真实自动驾驶数据集的多种夜间场景上，定量与定性结果均优于当前最优方法，且支持实时渲染。 Conclusion: 融合物理渲染的3DGS方法能有效提升夜间自动驾驶场景重建的真实感与鲁棒性，为低光环境仿真提供新思路。 Abstract: This paper focuses on scene reconstruction under nighttime conditions in autonomous driving simulation. Recent methods based on Neural Radiance Fields (NeRFs) and 3D Gaussian Splatting (3DGS) have achieved photorealistic modeling in autonomous driving scene reconstruction, but they primarily focus on normal-light conditions. Low-light driving scenes are more challenging to model due to their complex lighting and appearance conditions, which often causes performance degradation of existing methods. To address this problem, this work presents a novel approach that integrates physically based rendering into 3DGS to enhance nighttime scene reconstruction for autonomous driving. Specifically, our approach integrates physically based rendering into composite scene Gaussian representations and jointly optimizes Bidirectional Reflectance Distribution Function (BRDF) based material properties. We explicitly model diffuse components through a global illumination module and specular components by anisotropic spherical Gaussians. As a result, our approach improves reconstruction quality for outdoor nighttime driving scenes, while maintaining real-time rendering. Extensive experiments across diverse nighttime scenarios on two real-world autonomous driving datasets, including nuScenes and Waymo, demonstrate that our approach outperforms the state-of-the-art methods both quantitatively and qualitatively.

[129] Privacy-Concealing Cooperative Perception for BEV Scene Segmentation

Song Wang,Lingling Li,Marcus Santos,Guanghui Wang

Main category: cs.CV

TL;DR: 本文提出了一种面向鸟瞰图（BEV）语义分割的隐私隐藏协同感知框架（PCC），通过对抗学习机制在共享BEV特征中隐去视觉线索，防止图像被重构，同时保持分割性能。

Details

Motivation: 协同感知系统虽能提升自动驾驶车辆的感知能力，但共享传感数据可能导致敏感视觉内容被重构，引发严重隐私泄露问题。 Method: 提出Privacy-Concealing Cooperation（PCC）框架，基于共享BEV特征设计隐藏网络，并引入对抗学习：隐藏网络负责掩盖视觉线索，图像重建网络试图恢复原始图像；感知网络与隐藏网络端到端联合优化。 Result: 实验表明，PCC显著降低重建图像质量，对语义分割性能影响极小，有效保护协作车辆的视觉隐私。 Conclusion: PCC框架在保障BEV语义分割精度的同时，实现了对共享特征中隐私信息的有效隐藏，为协同感知中的隐私保护提供了可行方案。 Abstract: Cooperative perception systems for autonomous driving aim to overcome the limited perception range of a single vehicle by communicating with adjacent agents to share sensing information. While this improves perception performance, these systems also face a significant privacy-leakage issue, as sensitive visual content can potentially be reconstructed from the shared data. In this paper, we propose a novel Privacy-Concealing Cooperation (PCC) framework for Bird's Eye View (BEV) semantic segmentation. Based on commonly shared BEV features, we design a hiding network to prevent an image reconstruction network from recovering the input images from the shared features. An adversarial learning mechanism is employed to train the network, where the hiding network works to conceal the visual clues in the BEV features while the reconstruction network attempts to uncover these clues. To maintain segmentation performance, the perception network is integrated with the hiding network and optimized end-to-end. The experimental results demonstrate that the proposed PCC framework effectively degrades the quality of the reconstructed images with minimal impact on segmentation performance, providing privacy protection for cooperating vehicles. The source code will be made publicly available upon publication.

[130] Diff-Aid: Inference-time Adaptive Interaction Denoising for Rectified Text-to-Image Generation

Binglei Li,Mengping Yang,Zhiyu Tan,Junping Zhang,Hao Li

Main category: cs.CV

TL;DR: 本文提出Diff-Aid，一种轻量级、推理时可插拔的方法，通过自适应调节文本token与图像特征在不同Transformer块和去噪时间步间的交互，提升文本到图像扩散模型对复杂提示的忠实度，并提供可解释的调制模式。

Details

Motivation: 现有文本到图像扩散模型在忠实遵循复杂文本描述方面仍存在挑战，主因是文本与视觉特征间交互不足；已有方法灵活性差且忽略跨模块与时间步的动态交互。 Method: 提出Diff-Aid，一种推理时轻量方法，自适应调整各文本token与图像特征在不同transformer块及去噪时间步上的交互；作为即插即用模块，兼容风格LoRA、可控生成与零样本编辑等下游任务。 Result: 在SD 3.5和FLUX等强基线上，Diff-Aid显著提升提示遵循度、图像质量与人类偏好；同时生成可解释的调制模式，揭示语义对齐过程。 Conclusion: Diff-Aid为提升T2I模型文本-图像对齐提供了一种灵活、高效、可解释且即插即用的推理时解决方案，并具备广泛下游适用性。 Abstract: Recent text-to-image (T2I) diffusion models have achieved remarkable advancement, yet faithfully following complex textual descriptions remains challenging due to insufficient interactions between textual and visual features. Prior approaches enhance such interactions via architectural design or handcrafted textual condition weighting, but lack flexibility and overlook the dynamic interactions across different blocks and denoising stages. To provide a more flexible and efficient solution to this problem, we propose Diff-Aid, a lightweight inference-time method that adaptively adjusts per-token text and image interactions across transformer blocks and denoising timesteps. Beyond improving generation quality, Diff-Aid yields interpretable modulation patterns that reveal how different blocks, timesteps, and textual tokens contribute to semantic alignment during denoising. As a plug-and-play module, Diff-Aid can be seamlessly integrated into downstream applications for further improvement, including style LoRAs, controllable generation, and zero-shot editing. Experiments on strong baselines (SD 3.5 and FLUX) demonstrate consistent improvements in prompt adherence, visual quality, and human preference across various metrics. Our code and models will be released.

[131] Two-Stream Interactive Joint Learning of Scene Parsing and Geometric Vision Tasks

Guanfeng Tang,Hongbo Zhao,Ziwei Long,Jiayao Li,Bohong Xiao,Wei Ye,Hanli Wang,Rui Fan

Main category: cs.CV

TL;DR: 本文提出了一种受人类视觉系统启发的双交互流（TwInS）框架，用于联合学习场景解析与几何视觉任务，通过双向特征融合和半监督训练策略，在无需昂贵人工标注对应关系的情况下实现性能提升。

Details

Motivation: 受人类视觉系统中并行且交互的上下文与空间理解通路启发，旨在解决场景解析与几何视觉任务间缺乏有效协同、依赖大量人工标注对应关系的问题。 Method: 提出TwInS框架：1）统一架构下，场景解析流的多级上下文特征注入几何视觉流以指导其迭代优化；2）几何解码特征经跨任务适配器投影至上下文特征空间，实现选择性异构特征融合；3）设计定制化半监督训练策略，利用大规模多视角数据，摆脱对人工标注对应真值的依赖。 Result: 在三个公开数据集上的大量实验验证了TwInS各核心组件的有效性，并在性能上超越现有最先进方法。 Conclusion: TwInS是一种有效的生物启发式联合学习框架，实现了场景解析与几何视觉任务的深度协同，具备强泛化能力与实用性，源码将公开。 Abstract: Inspired by the human visual system, which operates on two parallel yet interactive streams for contextual and spatial understanding, this article presents Two Interactive Streams (TwInS), a novel bio-inspired joint learning framework capable of simultaneously performing scene parsing and geometric vision tasks. TwInS adopts a unified, general-purpose architecture in which multi-level contextual features from the scene parsing stream are infused into the geometric vision stream to guide its iterative refinement. In the reverse direction, decoded geometric features are projected into the contextual feature space for selective heterogeneous feature fusion via a novel cross-task adapter, which leverages rich cross-view geometric cues to enhance scene parsing. To eliminate the dependence on costly human-annotated correspondence ground truth, TwInS is further equipped with a tailored semi-supervised training strategy, which unleashes the potential of large-scale multi-view data and enables continuous self-evolution without requiring ground-truth correspondences. Extensive experiments conducted on three public datasets validate the effectiveness of TwInS's core components and demonstrate its superior performance over existing state-of-the-art approaches. The source code will be made publicly available upon publication.

[132] AdaVBoost: Mitigating Hallucinations in LVLMs via Token-Level Adaptive Visual Attention Boosting

Jiacheng Zhang,Feng Liu,Chao Du,Tianyu Pang

Main category: cs.CV

TL;DR: 本文提出AdaVBoost，一种基于视觉定位熵（VGE）的自适应视觉注意力增强框架，用于缓解大视觉语言模型（LVLMs）中的幻觉问题。

Details

Motivation: 现有视觉注意力增强方法采用预定义缩放因子，存在跨生成步长的强弱失衡问题：过弱则无法消除幻觉，过强则引发新幻觉。 Method: 提出AdaVBoost框架，引入视觉定位熵（VGE）量化每步生成token的幻觉风险，并据此动态调节对应视觉token的注意力增强强度，实现细粒度、自适应干预。 Result: 在多个LVLM和幻觉评测基准上，AdaVBoost显著优于基线方法。 Conclusion: 自适应、token级的视觉注意力增强比固定强度增强更有效；VGE作为幻觉风险信号可有效指导注意力干预。 Abstract: Visual attention boosting has emerged as a promising direction for mitigating hallucinations in Large Vision-Language Models (LVLMs), where existing methods primarily focus on where to boost by applying a predefined scaling to the attention of method-specific visual tokens during autoregressive generation. In this paper, we identify a fundamental trade-off in these methods: a predefined scaling factor can be too weak at some generation steps, leaving hallucinations unresolved, yet too strong at others, leading to new hallucinations. Motivated by this finding, we propose AdaVBoost, a token-level visual attention boosting framework that adaptively determines how much attention to boost at each generation step. Specifically, we introduce Visual Grounding Entropy (VGE) to estimate hallucination risk, which leverages visual grounding as a complementary signal to capture evidence mismatches beyond entropy. Guided by VGE, AdaVBoost applies stronger visual attention boosting to high-risk tokens and weaker boosting to low-risk tokens, enabling token-level adaptive intervention at each generation step. Extensive experiments show that AdaVBoost significantly outperforms baseline methods across multiple LVLMs and hallucination benchmarks.

[133] Towards Sparse Video Understanding and Reasoning

Chenwei Xu,Zhen Ye,Shang Wu,Weijian Li,Zihan Wang,Zhuofan Xia,Lie Lu,Pranav Maneriker,Fan Du,Manling Li,Han Liu

Main category: cs.CV

TL;DR: 本文提出REVISE，一种多轮视频问答（VQA）代理，通过选择信息量大的稀疏帧、维护摘要状态并早期停止来提升效率与准确性，并引入无标注奖励EAGER支持强化微调。

Details

Motivation: 现有VQA方法通常均匀采样视频帧，导致计算开销大、冗余高；需更高效利用视频信息，在保证准确率的同时降低帧数、轮次和提示长度。 Method: 提出多轮代理REVISE，结合稀疏帧选择、摘要状态维护与自信早停机制；设计无标注奖励函数EAGER，包含置信度增益、摘要充分性与正确早停三项指标，支持开源VLM的强化微调。 Result: 在多个VQA基准上，REVISE在提升准确率的同时显著减少所用帧数、推理轮次和prompt token数量。 Conclusion: REVISE实现了实用、高效的稀疏视频推理，兼具通用性（支持商用VLM即插即用）与可优化性（支持开源模型强化微调），为视频理解提供了新范式。 Abstract: We present \revise (\underline{Re}asoning with \underline{Vi}deo \underline{S}parsity), a multi-round agent for video question answering (VQA). Instead of uniformly sampling frames, \revise selects a small set of informative frames, maintains a summary-as-state across rounds, and stops early when confident. It supports proprietary vision-language models (VLMs) in a ``plug-and-play'' setting and enables reinforcement fine-tuning for open-source models. For fine-tuning, we introduce EAGER (Evidence-Adjusted Gain for Efficient Reasoning), an annotation-free reward with three terms: (1) Confidence gain: after new frames are added, we reward the increase in the log-odds gap between the correct option and the strongest alternative; (2) Summary sufficiency: at answer time we re-ask using only the last committed summary and reward success; (3) Correct-and-early stop: answering correctly within a small turn budget is rewarded. Across multiple VQA benchmarks, \revise improves accuracy while reducing frames, rounds, and prompt tokens, demonstrating practical sparse video reasoning.

[134] A generalizable foundation model for intraoperative understanding across surgical procedures

Kanggil Park,Yongjun Jeon,Soyoung Lim,Seonmin Park,Jongmin Shin,Jung Yong Kim,Sehyeon An,Jinsoo Rhu,Jongman Kim,Gyu-Seong Choi,Namkee Oh,Kyu-Hwan Jung

Main category: cs.CV

TL;DR: 本文提出了ZEN，一个用于术中手术视频理解的通用基础模型，通过自监督多教师蒸馏框架在超过400万帧、21种以上手术类型的数据上训练，展现出跨手术类型的强泛化能力，在20个下游任务中全面超越现有模型。

Details

Motivation: 术中视觉感知在不同外科医生和手术间差异大，导致评估、培训及AI系统开发不一致；现有手术AI模型任务定义狭窄，缺乏跨手术和跨机构的泛化能力。 Method: 提出ZEN模型，采用自监督多教师蒸馏框架，在涵盖21种以上手术、超400万帧的大规模多样化数据集上训练，并在统一基准下系统评估多种表征学习策略。 Result: ZEN在20个下游任务中，涵盖全微调、冻结骨干、少样本和零样本设置，均持续优于现有手术基础模型，展现出强健的跨手术泛化性能。 Conclusion: ZEN为手术场景理解提供了更统一的表征基础，推动术中辅助与手术培训评估等未来应用的发展。 Abstract: In minimally invasive surgery, clinical decisions depend on real-time visual interpretation, yet intraoperative perception varies substantially across surgeons and procedures. This variability limits consistent assessment, training, and the development of reliable artificial intelligence systems, as most surgical AI models are designed for narrowly defined tasks and do not generalize across procedures or institutions. Here we introduce ZEN, a generalizable foundation model for intraoperative surgical video understanding trained on more than 4 million frames from over 21 procedures using a self-supervised multi-teacher distillation framework. We curated a large and diverse dataset and systematically evaluated multiple representation learning strategies within a unified benchmark. Across 20 downstream tasks and full fine-tuning, frozen-backbone, few-shot and zero-shot settings, ZEN consistently outperforms existing surgical foundation models and demonstrates robust cross-procedure generalization. These results suggest a step toward unified representations for surgical scene understanding and support future applications in intraoperative assistance and surgical training assessment.

[135] Layer-Guided UAV Tracking: Enhancing Efficiency and Occlusion Robustness

Yang Zhou,Derui Ding,Ran Sun,Ying Sun,Haohua Zhang

Main category: cs.CV

TL;DR: 本文提出LGTrack，一种面向无人机视觉目标跟踪的轻量级统一框架，通过动态层选择、高效特征增强和鲁棒遮挡表征学习，在保证高精度的同时实现超实时性能（258.7 FPS）

Details

Motivation: 解决无人机视觉跟踪中精度与效率的权衡难题，尤其应对不可预测遮挡等挑战 Method: 提出LGTrack框架，包含轻量级Global-Grouped Coordinate Attention（GGCA）模块以建模长程依赖和全局上下文，以及轻量级Similarity-Guided Layer Adaptation（SGLA）模块替代知识蒸馏，实现精度与推理效率的平衡 Result: 在三个数据集上达到SOTA实时速度（UAVDT上258.7 FPS）和具有竞争力的跟踪精度（82.8%精度） Conclusion: LGTrack在保持高跟踪精度的同时显著提升推理速度，为资源受限的无人机平台提供了高效可靠的跟踪解决方案 Abstract: Visual object tracking (VOT) plays a pivotal role in unmanned aerial vehicle (UAV) applications. Addressing the trade-off between accuracy and efficiency, especially under challenging conditions like unpredictable occlusion, remains a significant challenge. This paper introduces LGTrack, a unified UAV tracking framework that integrates dynamic layer selection, efficient feature enhancement, and robust representation learning for occlusions. By employing a novel lightweight Global-Grouped Coordinate Attention (GGCA) module, LGTrack captures long-range dependencies and global contexts, enhancing feature discriminability with minimal computational overhead. Additionally, a lightweight Similarity-Guided Layer Adaptation (SGLA) module replaces knowledge distillation, achieving an optimal balance between tracking precision and inference efficiency. Experiments on three datasets demonstrate LGTrack's state-of-the-art real-time speed (258.7 FPS on UAVDT) while maintaining competitive tracking accuracy (82.8\% precision). Code is available at https://github.com/XiaoMoc/LGTrack

[136] DCDM: Divide-and-Conquer Diffusion Models for Consistency-Preserving Video Generation

Haoyu Zhao,Yuang Zhang,Junqi Cheng,Jiaxi Gu,Zenghui Lu,Peng Shu,Zuxuan Wu,Yu-Gang Jiang

Main category: cs.CV

TL;DR: 本文提出了一种名为DCDM的系统级框架，通过分解建模解决视频生成中的三类一致性问题：片内世界知识、片间相机运动和镜头间元素一致性，并在AAAI'26 CVM竞赛测试集上验证了其有效性。

Details

Motivation: 现有视频生成模型虽具高视觉保真度，但在语义、几何与身份一致性方面表现不足。 Method: DCDM将视频一致性建模分解为三个专用模块：1）利用大语言模型解析提示并驱动扩散Transformer生成连贯视频；2）在噪声空间中引入时序相机表征并结合文生图初始化以控制相机运动；3）采用带窗口交叉注意力与稀疏跨镜头自注意力的全场景生成范式保障长程叙事一致性。 Result: 在AAAI'26 CVM竞赛测试集上验证表明，所提策略能有效提升三类一致性。 Conclusion: DCDM通过系统性解耦设计，在保持统一生成主干的同时，显著改善了视频生成中多维度的一致性问题。 Abstract: Recent video generative models have demonstrated impressive visual fidelity, yet they often struggle with semantic, geometric, and identity consistency. In this paper, we propose a system-level framework, termed the Divide-and-Conquer Diffusion Model (DCDM), to address three key challenges: (1) intra-clip world knowledge consistency, (2) inter-clip camera consistency, and (3) inter-shot element consistency. DCDM decomposes video consistency modeling under these scenarios into three dedicated components while sharing a unified video generation backbone. For intra-clip consistency, DCDM leverages a large language model to parse input prompts into structured semantic representations, which are subsequently translated into coherent video content by a diffusion transformer. For inter-clip camera consistency, we propose a temporal camera representation in the noise space that enables precise and stable camera motion control, along with a text-to-image initialization mechanism to further enhance controllability. For inter-shot consistency, DCDM adopts a holistic scene generation paradigm with windowed cross-attention and sparse inter-shot self-attention, ensuring long-range narrative coherence while maintaining computational efficiency. We validate our framework on the test set of the CVM Competition at AAAI'26, and the results demonstrate that the proposed strategies effectively address these challenges.

[137] KorMedMCQA-V: A Multimodal Benchmark for Evaluating Vision-Language Models on the Korean Medical Licensing Examination

Byungjin Choi,Seongsu Bae,Sunjun Kweon,Edward Choi

Main category: cs.CV

TL;DR: 本文介绍了KorMedMCQA-V，一个面向韩国医学执照考试的韩语多模态多选问答基准数据集，包含1534道题、2043张医学图像，涵盖X光、CT、ECG等多种模态；在统一零样本协议下评估了50多个视觉语言模型，发现专用模型表现差异显著，多图推理仍是难点。

Details

Motivation: 现有韩语医学问答基准缺乏多模态支持，难以全面评估VLMs在真实临床场景（如结合影像与文本）下的推理能力，亟需构建兼顾语言与视觉的韩语医学评测基准。 Method: 构建KorMedMCQA-V数据集（含题目、选项、多图标注及模态标签），设计统一零样本评测协议，系统评估50+ VLMs（含通用、医学专用、韩语专用三类），并进行细粒度分析（如多图推理、模态差异、模型架构影响）。 Result: Gemini-3.0-Pro达96.9%准确率最高，Qwen3-VL-32B-Thinking（开源最佳）为83.7%，VARCO-VISION-2.0-14B（韩语专用最佳）仅43.2%；推理导向模型提升达20个百分点；所有模型在多图问题上性能下降；不同影像模态间表现差异显著。 Conclusion: KorMedMCQA-V填补了韩语医学多模态评测空白，揭示当前VLMs在跨图像推理和韩语医学理解上的瓶颈，强调模型架构（推理能力）比单纯领域/语言微调更关键，为后续研究提供标准化评测平台与数据资源。 Abstract: We introduce KorMedMCQA-V, a Korean medical licensing-exam-style multimodal multiple-choice question answering benchmark for evaluating vision-language models (VLMs). The dataset consists of 1,534 questions with 2,043 associated images from Korean Medical Licensing Examinations (2012-2023), with about 30% containing multiple images requiring cross-image evidence integration. Images cover clinical modalities including X-ray, computed tomography (CT), electrocardiography (ECG), ultrasound, endoscopy, and other medical visuals. We benchmark over 50 VLMs across proprietary and open-source categories-spanning general-purpose, medical-specialized, and Korean-specialized families-under a unified zero-shot evaluation protocol. The best proprietary model (Gemini-3.0-Pro) achieves 96.9% accuracy, the best open-source model (Qwen3-VL-32B-Thinking) 83.7%, and the best Korean-specialized model (VARCO-VISION-2.0-14B) only 43.2%. We further find that reasoning-oriented model variants gain up to +20 percentage points over instruction-tuned counterparts, medical domain specialization yields inconsistent gains over strong general-purpose baselines, all models degrade on multi-image questions, and performance varies notably across imaging modalities. By complementing the text-only KorMedMCQA benchmark, KorMedMCQA-V forms a unified evaluation suite for Korean medical reasoning across text-only and multimodal conditions. The dataset is available via Hugging Face Datasets: https://huggingface.co/datasets/seongsubae/KorMedMCQA-V.

[138] Optimizing Point-of-Care Ultrasound Video Acquisition for Probabilistic Multi-Task Heart Failure Detection

Armin Saadat,Nima Hashemi,Bahar Khodabakhshian,Michael Y. Tsang,Christina Luong,Teresa S. M. Tsang,Purang Abolmaesumi

Main category: cs.CV

TL;DR: 本文提出了一种基于强化学习的个性化超声心动图数据采集策略，通过动态选择最优视角或终止采集，在减少32%视频数量的同时，保持对主动脉瓣狭窄（AS）和左心室射血分数（LVEF）的诊断性能。

Details

Motivation: 床旁超声（POCUS）需在时间与操作者精力受限下支持临床决策，传统全视图采集效率低、成本高，亟需个性化、低成本、高效益的采集策略。 Method: 将POCUS建模为序贯采集问题：强化学习（RL）代理根据已获取的多视角视频动态选择下一视角或终止；终止后，共享多视角Transformer模型联合完成AS严重程度分类（有序分类）与LVEF回归，并输出高斯预测分布以量化不确定性；奖励函数平衡诊断收益与采集代价，生成患者特异性采集路径。 Result: 在1,820例测试研究中，方法达到77.2%平均平衡准确率（bACC），性能媲美完整采集方案，但仅使用68%的视频数量；验证了在采集预算约束下的鲁棒多任务诊断能力。 Conclusion: 该患者定制化、成本感知的采集框架可显著简化POCUS工作流而不损决策质量，生成可解释的扫描路径，适用于床旁场景，并可扩展至其他心脏指标，具备临床转化潜力。 Abstract: Purpose: Echocardiography with point-of-care ultrasound (POCUS) must support clinical decision-making under tight bedside time and operator-effort constraints. We introduce a personalized data acquisition strategy in which an RL agent, given a partially observed multi-view study, selects the next view to acquire or terminates acquisition to support heart-failure (HF) assessment. Upon termination, a diagnostic model jointly predicts aortic stenosis (AS) severity and left ventricular ejection fraction (LVEF), two key HF biomarkers, and outputs uncertainty, enabling an explicit trade-off between diagnostic performance and acquisition cost. Methods: We model POCUS as a sequential acquisition problem: at each step, a video selector (RL agent) chooses the next view to acquire or terminates acquisition. Upon termination, a shared multi-view transformer performs multi-task inference with two heads, ordinal AS classification, and LVEF regression, and outputs Gaussian predictive distributions yielding ordinal probabilities over AS classes and EF thresholds. These probabilities drive a reward that balances expected diagnostic benefit against acquisition cost, producing patient-specific acquisition pathways. Results: The dataset comprises 12,180 patient-level studies, split into training/validation/test sets (75/15/15). On the 1,820 test studies, our method matches full-study performance while using 32% fewer videos, achieving 77.2% mean balanced accuracy (bACC) across AS severity classification and LVEF estimation, demonstrating robust multi-task performance under acquisition budgets. Conclusion: Patient-tailored, cost-aware acquisition can streamline POCUS workflows while preserving decision quality, producing interpretable scan pathways suited to bedside use. The framework is extensible to additional cardiac endpoints and merits prospective evaluation for clinical integration.

[139] LeafNet: A Large-Scale Dataset and Comprehensive Benchmark for Foundational Vision-Language Understanding of Plant Diseases

Khang Nguyen Quoc,Phuong D. Dao,Luyl-Da Quach

Main category: cs.CV

TL;DR: 本文提出了LeafNet数据集和LeafBench基准，用于评估视觉-语言模型（VLMs）在植物病理学任务中的能力；实验表明VLMs在病害诊断上优于纯视觉模型，但在细粒度识别等任务上仍有明显不足。

Details

Motivation: 现有基础模型和视觉-语言预训练模型在农业领域（尤其是植物病理学）的应用受限，主要由于缺乏大规模、高质量的多模态图像-文本数据集与评测基准。 Method: 构建了包含18.6万张叶片图像、97类病害、13950个问答对的LeafNet多模态数据集，并设计LeafBench视觉问答基准，涵盖六大农业关键任务；在12个前沿VLM上进行系统评测，并对比纯视觉模型性能。 Result: 二分类（健康/病害）准确率超90%，但病原体与物种级细粒度识别低于65%；微调后的VLM显著优于纯视觉模型，验证了语言模态对诊断精度的关键提升作用。 Conclusion: 当前VLMs在植物病理理解方面仍存在显著能力缺口，LeafBench为推动可靠AI辅助植物病害诊断提供了必要且严谨的评测框架。 Abstract: Foundation models and vision-language pre-training have significantly advanced Vision-Language Models (VLMs), enabling multimodal processing of visual and linguistic data. However, their application in domain-specific agricultural tasks, such as plant pathology, remains limited due to the lack of large-scale, comprehensive multimodal image--text datasets and benchmarks. To address this gap, we introduce LeafNet, a comprehensive multimodal dataset, and LeafBench, a visual question-answering benchmark developed to systematically evaluate the capabilities of VLMs in understanding plant diseases. The dataset comprises 186,000 leaf digital images spanning 97 disease classes, paired with metadata, generating 13,950 question-answer pairs spanning six critical agricultural tasks. The questions assess various aspects of plant pathology understanding, including visual symptom recognition, taxonomic relationships, and diagnostic reasoning. Benchmarking 12 state-of-the-art VLMs on our LeafBench dataset, we reveal substantial disparity in their disease understanding capabilities. Our study shows performance varies markedly across tasks: binary healthy--diseased classification exceeds 90\% accuracy, while fine-grained pathogen and species identification remains below 65\%. Direct comparison between vision-only models and VLMs demonstrates the critical advantage of multimodal architectures: fine-tuned VLMs outperform traditional vision models, confirming that integrating linguistic representations significantly enhances diagnostic precision. These findings highlight critical gaps in current VLMs for plant pathology applications and underscore the need for LeafBench as a rigorous framework for methodological advancement and progress evaluation toward reliable AI-assisted plant disease diagnosis. Code is available at https://github.com/EnalisUs/LeafBench.

Rang Meng,Weipeng Wu,Yingjie Yin,Yuming Li,Chenguang Ma

Main category: cs.CV

TL;DR: EchoTorrent提出了一种面向实时流式多模态视频生成的高效稳定新框架，通过多教师训练、自适应CFG校准、混合长尾强制对齐和VAE解码器精修四项技术，在保持高视觉质量的同时显著提升时序一致性、身份保持与音唇同步能力。

Details

Motivation: 现有多模态视频生成模型存在推理延迟高、时序稳定性差的问题，尤其在流式推理下会出现空间模糊、时序漂移和音唇不同步等多模态退化现象，导致效率与性能难以兼顾。 Method: 提出EchoTorrent框架，包含四部分：(1) 多教师训练，利用多领域专家模型向学生模型迁移知识；(2) 自适应CFG校准（ACC-DMD），通过时空分阶段调度消除DMD中音频CFG增强误差，实现单步单通推理；(3) 混合长尾强制对齐，在自滚动训练中仅对尾帧施加对齐约束，采用因果-双向混合结构缓解流式退化；(4) VAE解码器精修，通过像素域优化恢复高频细节并规避隐空间歧义。 Result: 实验表明EchoTorrent可在极少前向传递次数下实现自回归生成，显著延长时序一致性、提升身份保持能力与音唇同步精度。 Conclusion: EchoTorrent有效解决了流式多模态视频生成中的效率-性能权衡问题，为实时部署提供了可行的技术路径。 Abstract: Recent multi-modal video generation models have achieved high visual quality, but their prohibitive latency and limited temporal stability hinder real-time deployment. Streaming inference exacerbates these issues, leading to pronounced multimodal degradation, such as spatial blurring, temporal drift, and lip desynchronization, which creates an unresolved efficiency-performance trade-off. To this end, we propose EchoTorrent, a novel schema with a fourfold design: (1) Multi-Teacher Training fine-tunes a pre-trained model on distinct preference domains to obtain specialized domain experts, which sequentially transfer domain-specific knowledge to a student model; (2) Adaptive CFG Calibration (ACC-DMD), which calibrates the audio CFG augmentation errors in DMD via a phased spatiotemporal schedule, eliminating redundant CFG computations and enabling single-pass inference per step; (3) Hybrid Long Tail Forcing, which enforces alignment exclusively on tail frames during long-horizon self-rollout training via a causal-bidirectional hybrid architecture, effectively mitigates spatiotemporal degradation in streaming mode while enhancing fidelity to reference frames; and (4) VAE Decoder Refiner through pixel-domain optimization of the VAE decoder to recover high-frequency details while circumventing latent-space ambiguities. Extensive experiments and analysis demonstrate that EchoTorrent achieves few-pass autoregressive generation with substantially extended temporal consistency, identity preservation, and audio-lip synchronization.

[141] An Ensemble Learning Approach towards Waste Segmentation in Cluttered Environment

Maimoona Jafar,Syed Imran Ali,Ahsan Saadat,Muhammad Bilal,Shah Khalid

Main category: cs.CV

TL;DR: 本文提出了一种基于U-Net与FPN模型加权平均的集成学习方法（EL-4），用于提升真实场景下垃圾分割精度，显著提高了IoU并降低了Dice损失，有助于自动化废料分拣。

Details

Motivation: 现实垃圾环境复杂（形变、无规律、重叠），导致分割困难；需高精度分割掩码支持机器人自动分拣，提升回收效率。 Method: 采用U-Net与FPN两种高性能分割模型，通过加权平均融合其预测结果，构建集成模型EL-4；使用贴近真实场景的垃圾数据集，并辅以预处理增强特征学习。 Result: EL-4模型达到IoU 0.8306（优于U-Net的0.8065）和Dice损失0.09019（低于FPN的0.1183）。 Conclusion: 该集成方法可有效提升垃圾分割精度，助力材料回收设施实现更高效、低人工干预的自动化分拣与原料获取。 Abstract: Environmental pollution is a critical global issue, with recycling emerging as one of the most viable solutions. This study focuses on waste segregation, a crucial step in recycling processes to obtain raw material. Recent advancements in computer vision have significantly contributed to waste classification and recognition. In waste segregation, segmentation masks are essential for robots to accurately localize and pick objects from conveyor belts. The complexity of real-world waste environments, characterized by deformed items without specific patterns and overlapping objects, further complicates waste segmentation tasks. This paper proposes an Ensemble Learning approach to improve segmentation accuracy by combining high performing segmentation models, U-Net and FPN, using a weighted average method. U-Net excels in capturing fine details and boundaries in segmentation tasks, while FPN effectively handles scale variation and context in complex environments, and their combined masks result in more precise predictions. The dataset used closely mimics real-life waste scenarios, and preprocessing techniques were applied to enhance feature learning for deep learning segmentation models. The ensemble model, referred to as EL-4, achieved an IoU value of 0.8306, an improvement over U-Net's 0.8065, and reduced Dice loss to 0.09019 from FPN's 0.1183. This study could contribute to the efficiency of waste sorting at Material Recovery Facility, facilitating better raw material acquisition for recycling with minimal human intervention and enhancing the overall throughput.

[142] A WDLoRA-Based Multimodal Generative Framework for Clinically Guided Corneal Confocal Microscopy Image Synthesis in Diabetic Neuropathy

Xin Zhang,Liangxiu Han,Yue Shi,Yalin Zheng,Uazman Alam,Maryam Ferdousi,Rayaz Malik

Main category: cs.CV

TL;DR: 本文提出了一种基于权重分解低秩自适应（WDLoRA）的多模态生成框架，用于临床引导下的角膜共聚焦显微镜（CCM）图像合成，以解决糖尿病周围神经病变（DPN）诊断中高质量标注数据稀缺的问题。该方法在视觉保真度和结构完整性上达到SOTA，并能提升下游诊断与分割模型性能。

Details

Motivation: 现有深度学习诊断模型受限于标注数据稀缺和角膜神经形态的细粒度变异；通用生成式AI在医学影像中因缺乏领域特异性训练而难以保证解剖学保真度。 Method: 提出WDLoRA参数高效微调机制，解耦权重更新的幅度与方向，分别建模神经拓扑（方向）和基质对比度（强度）；联合神经分割掩码与疾病特异性临床提示进行多模态条件生成。 Result: 合成图像FID=5.18、SSIM=0.630，显著优于GAN与标准扩散模型；保留金标准临床生物标志物，统计上等价于真实数据；用于训练下游模型时，诊断准确率提升2.1%，分割性能提升2.2%。 Conclusion: WDLoRA驱动的临床引导生成框架可有效缓解医学AI中的数据瓶颈，为DPN等小样本医学任务提供高保真、可解释的合成数据支持。 Abstract: Corneal Confocal Microscopy (CCM) is a sensitive tool for assessing small-fiber damage in Diabetic Peripheral Neuropathy (DPN), yet the development of robust, automated deep learning-based diagnostic models is limited by scarce labelled data and fine-grained variability in corneal nerve morphology. Although Artificial Intelligence (AI)-driven foundation generative models excel at natural image synthesis, they often struggle in medical imaging due to limited domain-specific training, compromising the anatomical fidelity required for clinical analysis. To overcome these limitations, we propose a Weight-Decomposed Low-Rank Adaptation (WDLoRA)-based multimodal generative framework for clinically guided CCM image synthesis. WDLoRA is a parameter-efficient fine-tuning (PEFT) mechanism that decouples magnitude and directional weight updates, enabling foundation generative models to independently learn the orientation (nerve topology) and intensity (stromal contrast) required for medical realism. By jointly conditioning on nerve segmentation masks and disease-specific clinical prompts, the model synthesises anatomically coherent images across the DPN spectrum (Control, T1NoDPN, T1DPN). A comprehensive three-pillar evaluation demonstrates that the proposed framework achieves state-of-the-art visual fidelity (Fréchet Inception Distance (FID): 5.18) and structural integrity (Structural Similarity Index Measure (SSIM): 0.630), significantly outperforming GAN and standard diffusion baselines. Crucially, the synthetic images preserve gold-standard clinical biomarkers and are statistically equivalent to real patient data. When used to train automated diagnostic models, the synthetic dataset improves downstream diagnostic accuracy by 2.1% and segmentation performance by 2.2%, validating the framework's potential to alleviate data bottlenecks in medical AI.

[143] Fine-tuned Vision Language Model for Localization of Parasitic Eggs in Microscopic Images

Chan Hao Sien,Hezerul Abdul Karim,Nouar AlDahoul

Main category: cs.CV

TL;DR: 本文提出了一种基于微调视觉语言模型（VLM）的寄生虫卵定位方法，用于自动化土壤传播蠕虫感染诊断，在微观图像中实现高精度（mIOU 0.94）定位，优于EfficientDet等检测方法。

Details

Motivation: 传统显微镜下人工诊断土壤传播蠕虫感染费时、费力且易出错，尤其在缺乏专业诊断资源的热带和亚热带地区亟需自动化解决方案。 Method: 采用并微调微软Florence视觉语言模型（VLM），使其具备在显微图像中精确定位各类寄生虫卵的能力，侧重于目标定位任务。 Result: 所提出的定位VLM在mIOU指标上达到0.94，性能优于EfficientDet等主流目标检测方法。 Conclusion: 该VLM具备作为智能寄生虫学诊断自动化框架核心组件的潜力，可提供可扩展的工程化解决方案。 Abstract: Soil-transmitted helminth (STH) infections continuously affect a large proportion of the global population, particularly in tropical and sub-tropical regions, where access to specialized diagnostic expertise is limited. Although manual microscopic diagnosis of parasitic eggs remains the diagnostic gold standard, the approach can be labour-intensive, time-consuming, and prone to human error. This paper aims to utilize a vision language model (VLM) such as Microsoft Florence that was fine-tuned to localize all parasitic eggs within microscopic images. The preliminary results show that our localization VLM performs comparatively better than the other object detection methods, such as EfficientDet, with an mIOU of 0.94. This finding demonstrates the potential of the proposed VLM to serve as a core component of an automated framework, offering a scalable engineering solution for intelligent parasitological diagnosis.

[144] RGA-Net: A Vision Enhancement Framework for Robotic Surgical Systems Using Reciprocal Attention Mechanisms

Quanjun Li,Weixuan Li,Han Xia,Junhua Zhou,Chi-Man Pun,Xuhang Chen

Main category: cs.CV

TL;DR: 本文提出RGA-Net，一种用于机器人手术中去除手术烟雾的新型深度学习框架，通过双流混合注意力与轴分解注意力模块及双向交叉门控机制，显著提升内窥镜视频清晰度，从而改善人机交互与手术安全性。

Details

Motivation: 手术烟雾严重降低内窥镜视频质量，影响机器人手术中的精准遥操作和手术效果。 Method: 提出RGA-Net网络，包含双流混合注意力（DHA）模块（融合移窗注意力与频域处理）和轴分解注意力（ADA）模块（因子化多尺度注意力），并采用双向交叉门控连接编码器与解码器。 Result: 在DesmokeData和LSD3K数据集上实验表明，RGA-Net在烟雾去除任务中性能优于现有方法，可提供满足机器人手术集成需求的清晰可视化效果。 Conclusion: RGA-Net为提升机器人手术视觉反馈可靠性提供了有效技术路径，有助于减轻外科医生认知负荷、优化操作流程、降低医源性损伤风险，推动更安全可靠的机器人手术系统发展。 Abstract: Robotic surgical systems rely heavily on high-quality visual feedback for precise teleoperation; yet, surgical smoke from energy-based devices significantly degrades endoscopic video feeds, compromising the human-robot interface and surgical outcomes. This paper presents RGA-Net (Reciprocal Gating and Attention-fusion Network), a novel deep learning framework specifically designed for smoke removal in robotic surgery workflows. Our approach addresses the unique challenges of surgical smoke-including dense, non-homogeneous distribution and complex light scattering-through a hierarchical encoder-decoder architecture featuring two key innovations: (1) a Dual-Stream Hybrid Attention (DHA) module that combines shifted window attention with frequency-domain processing to capture both local surgical details and global illumination changes, and (2) an Axis-Decomposed Attention (ADA) module that efficiently processes multi-scale features through factorized attention mechanisms. These components are connected via reciprocal cross-gating blocks that enable bidirectional feature modulation between encoder and decoder pathways. Extensive experiments on the DesmokeData and LSD3K surgical datasets demonstrate that RGA-Net achieves superior performance in restoring visual clarity suitable for robotic surgery integration. Our method enhances the surgeon-robot interface by providing consistently clear visualization, laying a technical foundation for alleviating surgeons' cognitive burden, optimizing operation workflows, and reducing iatrogenic injury risks in minimally invasive procedures. These practical benefits could be further validated through future clinical trials involving surgeon usability assessments. The proposed framework represents a significant step toward more reliable and safer robotic surgical systems through computational vision enhancement.

[145] Explore Intrinsic Geometry for Query-based Tiny and Oriented Object Detector with Momentum-based Bipartite Matching

Junpeng Zhang,Zewei Yang,Jie Feng,Yuhui Zheng,Ronghua Shang,Mengxuan Zhang

Main category: cs.CV

TL;DR: IGOFormer是一种新型的基于查询的定向目标检测器，通过引入内在几何感知解码器和动量双射匹配机制，显著提升了任意方向（尤其是微小）目标的检测性能。

Details

Motivation: 现有基于查询的检测器在处理任意方向目标（特别是纹理信息有限的微小目标）时性能受限，主要源于像素级特征解码中未充分利用内在几何信息，以及阶段间双射匹配不一致问题。 Method: 提出IGOFormer：1）设计内在几何感知解码器，将对象查询与几何嵌入关联以建模几何布局；2）构建基于动量的双射匹配机制，利用查询特定平滑因子的指数移动平均来稳定跨阶段匹配。 Result: 在DOTA-V1.0数据集单尺度设置下，采用Swin-T主干网络达到78.00%的AP$_{50}$，显著优于现有方法；消融实验验证各模块有效性。 Conclusion: IGOFormer通过显式建模目标内在几何结构与增强匹配稳定性，有效解决了遥感图像中任意方向目标（尤其微小目标）检测难题，为查询式检测框架提供了新思路。 Abstract: Recent query-based detectors have achieved remarkable progress, yet their performance remains constrained when handling objects with arbitrary orientations, especially for tiny objects capturing limited texture information. This limitation primarily stems from the underutilization of intrinsic geometry during pixel-based feature decoding and the occurrence of inter-stage matching inconsistency caused by stage-wise bipartite matching. To tackle these challenges, we present IGOFormer, a novel query-based oriented object detector that explicitly integrates intrinsic geometry into feature decoding and enhances inter-stage matching stability. Specifically, we design an Intrinsic Geometry-aware Decoder, which enhances the object-related features conditioned on an object query by injecting complementary geometric embeddings extrapolated from their correlations to capture the geometric layout of the object, thereby offering a critical geometric insight into its orientation. Meanwhile, a Momentum-based Bipartite Matching scheme is developed to adaptively aggregate historical matching costs by formulating an exponential moving average with query-specific smoothing factors, effectively preventing conflicting supervisory signals arising from inter-stage matching inconsistency. Extensive experiments and ablation studies demonstrate the superiority of our IGOFormer for aerial oriented object detection, achieving an AP$_{50}$ score of 78.00\% on DOTA-V1.0 using Swin-T backbone under the single-scale setting. The code will be made publicly available.

[146] Generative Latent Representations of 3D Brain MRI for Multi-Task Downstream Analysis in Down Syndrome

Jordi Malé,Juan Fortea,Mateus Rozalem-Aranha,Neus Martínez-Abadías,Xavier Sevillano

Main category: cs.CV

TL;DR: 本文开发了多个变分自编码器（VAE）来编码3D脑MRI扫描，系统评估其潜在表示在重建质量、潜在空间结构可视化及Down综合征分类任务中的有效性。

Details

Motivation: 尽管生成模型在医学影像中潜力巨大，但其潜在表示的结构、信息内容及对下游临床任务的适用性仍缺乏深入探索。 Method: 构建多个变分自编码器（VAE）对3D脑MRI进行编码，并通过MRI重建质量评估、主成分分析（PCA）可视化潜在空间结构、以及在唐氏综合征与正常个体MRI数据集上的下游分类任务进行系统评估。 Result: VAE能有效捕获关键脑部特征并保持高重建保真度；潜在空间呈现清晰聚类，尤其能区分唐氏综合征患者与正常对照。 Conclusion: VAE学习的潜在表示具有良好的结构性和判别性，有助于推动生成模型在神经影像研究和临床决策中的应用。 Abstract: Generative models have emerged as powerful tools in medical imaging, enabling tasks such as segmentation, anomaly detection, and high-quality synthetic data generation. These models typically rely on learning meaningful latent representations, which are particularly valuable given the high-dimensional nature of 3D medical images like brain magnetic resonance imaging (MRI) scans. Despite their potential, latent representations remain underexplored in terms of their structure, information content, and applicability to downstream clinical tasks. Investigating these representations is crucial for advancing the use of generative models in neuroimaging research and clinical decision-making. In this work, we develop multiple variational autoencoders (VAEs) to encode 3D brain MRI scans into compact latent space representations for generative and predictive applications. We systematically evaluate the effectiveness of the learned representations through three key analyses: (i) a quantitative and qualitative assessment of MRI reconstruction quality, (ii) a visualisation of the latent space structure using Principal Component Analysis, and (iii) downstream classification tasks on a proprietary dataset of euploid and Down syndrome individuals brain MRI scans. Our results demonstrate that the VAE successfully captures essential brain features while maintaining high reconstruction fidelity. The latent space exhibits clear clustering patterns, particularly in distinguishing individuals with Down syndrome from euploid controls.

[147] T2MBench: A Benchmark for Out-of-Distribution Text-to-Motion Generation

Bin Yang,Rong Ou,Weisheng Xu,Jiaqi Xiong,Xintao Li,Taowen Wang,Luyu Zhu,Xu Jiang,Jing Tan,Renjing Xu

Main category: cs.CV

TL;DR: 本文提出了一种专门用于评估文本到动作生成模型在分布外（OOD）文本条件下的新基准，包含1025条OOD文本提示、统一的多维度评估框架（LLM评估、多因素运动评估、细粒度准确性评估），并对14个基线模型进行了系统分析，发现现有模型在细粒度准确性上普遍表现不佳。

Details

Motivation: 现有文本到动作生成评估局限于分布内文本和有限指标，难以系统评估模型在复杂分布外文本条件下的泛化与生成能力。 Method: 构建包含1025条OOD文本描述的提示数据集，并提出融合LLM评估、多因素运动评估和细粒度准确性评估的统一评估框架；对14个代表性基线模型在两个衍生数据集上进行综合评测。 Result: 不同基线模型在语义对齐、运动泛化性、物理质量等方面各有优势，但在细粒度准确性评估中普遍表现较差。 Conclusion: 当前文本到动作生成方法在OOD场景下存在明显局限，该基准为未来面向生产级模型的设计与评估提供了实践指导。 Abstract: Most existing evaluations of text-to-motion generation focus on in-distribution textual inputs and a limited set of evaluation criteria, which restricts their ability to systematically assess model generalization and motion generation capabilities under complex out-of-distribution (OOD) textual conditions. To address this limitation, we propose a benchmark specifically designed for OOD text-to-motion evaluation, which includes a comprehensive analysis of 14 representative baseline models and the two datasets derived from evaluation results. Specifically, we construct an OOD prompt dataset consisting of 1,025 textual descriptions. Based on this prompt dataset, we introduce a unified evaluation framework that integrates LLM-based Evaluation, Multi-factor Motion evaluation, and Fine-grained Accuracy Evaluation. Our experimental results reveal that while different baseline models demonstrate strengths in areas such as text-to-motion semantic alignment, motion generalizability, and physical quality, most models struggle to achieve strong performance with Fine-grained Accuracy Evaluation. These findings highlight the limitations of existing methods in OOD scenarios and offer practical guidance for the design and evaluation of future production-level text-to-motion models.

Haoyi Tao,Chaozheng Huang,Nan Wang,Han Lyu,Linfeng Zhang,Guolin Ke,Xi Fang

Main category: cs.CV

TL;DR: 本文提出OmniScience数据集和动态模型路由重标注流水线，以提升多模态大语言模型对科学图像的理解能力。

Details

Motivation: 现有开源多模态大语言模型在理解科学图像（如示意图、实验表征、分析图表）方面能力有限，主要受限于数据集领域覆盖窄、结构标注粗粒度、语义支撑弱。 Method: 构建包含150万张图-标题-上下文三元组的大规模高质量多模态数据集OmniScience；设计动态模型路由重标注流水线，融合视觉特征、原始图注与文中引用生成高密度、自包含描述，并通过专家校验与质量过滤提升准确性与语义完整性。 Result: 图像-文本多模态相似度从0.769提升至0.956；基于OmniScience微调的Qwen2.5-VL-3B在MM-MT-Bench和MMMU上分别提升0.378和0.140。 Conclusion: OmniScience显著提升多模态大模型对科学图像的理解能力，所提重标注方法与评估协议为该领域提供了新范式。 Abstract: Multimodal Large Language Models demonstrate strong performance on natural image understanding, yet exhibit limited capability in interpreting scientific images, including but not limited to schematic diagrams, experimental characterizations, and analytical charts. This limitation is particularly pronounced in open-source MLLMs. The gap largely stems from existing datasets with limited domain coverage, coarse structural annotations, and weak semantic grounding. We introduce OmniScience, a large-scale, high-fidelity multi-modal dataset comprising 1.5 million figure-caption-context triplets, spanning more than 10 major scientific disciplines. To obtain image caption data with higher information density and accuracy for multi-modal large-model training, we develop a dynamic model-routing re-captioning pipeline that leverages state-of-the-art multi-modal large language models to generate dense, self-contained descriptions by jointly synthesizing visual features, original figure captions, and corresponding in-text references authored by human scientists. The pipeline is further reinforced with rigorous quality filtering and alignment with human expert judgments, ensuring both factual accuracy and semantic completeness, and boosts the image-text multi-modal similarity score from 0.769 to 0.956. We further propose a caption QA protocol as a proxy task for evaluating visual understanding. Under this setting, Qwen2.5-VL-3B model finetuned on OmniScience show substantial gains over baselines, achieving a gain of 0.378 on MM-MT-Bench and a gain of 0.140 on MMMU.

[149] SAM4Dcap: Training-free Biomechanical Twin System from Monocular Video

Li Wang,HaoYu Wang,Xi Chen,ZeKun Jiang,Kang Li,Jian Li

Main category: cs.CV

TL;DR: 本文提出SAM4Dcap，一个开源端到端框架，利用单目视频无需额外训练即可估计生物力学指标，结合SAM-Body4D的4D人体网格重建与OpenSim生物力学求解器，初步实验显示其膝关节运动学预测可媲美多视角系统。

Details

Motivation: 定量生物力学分析对临床诊断和损伤预防至关重要，但传统光学动捕系统成本高、难以在家庭等非实验室场景部署；单目视频方案需求迫切但现有方法受限。 Method: 提出SAM4Dcap框架：集成SAM-Body4D实现时序一致的4D人体网格重建，并通过OpenSim进行生物力学求解；将重建网格自动转换为兼容多种肌肉骨骼模型的轨迹文件；设计自动化提示策略与Linux原生构建流程。 Result: 在行走和下落跳任务的初步评估中，SAM4Dcap的膝关节运动学预测接近多视角系统水平，但在髋关节屈曲和残余抖动方面仍存在偏差。 Conclusion: SAM4Dcap成功桥接前沿计算机视觉与经典生物力学仿真，为非实验室环境下的运动分析提供了灵活、可及的基础工具。 Abstract: Quantitative biomechanical analysis is essential for clinical diagnosis and injury prevention but is often restricted to laboratories due to the high cost of optical motion capture systems. While multi-view video approaches have lowered barriers, they remain impractical for home-based scenarios requiring monocular capture. This paper presents SAM4Dcap, an open-source, end-to-end framework for estimating biomechanical metrics from monocular video without additional training. SAM4Dcap integrates the temporally consistent 4D human mesh recovery of SAM-Body4D with the OpenSim biomechanical solver. The pipeline converts reconstructed meshes into trajectory files compatible with diverse musculoskeletal models. We introduce automated prompting strategies and a Linux-native build for processing. Preliminary evaluations on walking and drop-jump tasks indicate that SAM4Dcap has the potential to achieve knee kinematic predictions comparable to multi-view systems, although some discrepancies in hip flexion and residual jitter remain. By bridging advanced computer vision with established biomechanical simulation, SAM4Dcap provides a flexible, accessible foundation for non-laboratory motion analysis.

[150] Offline-Poly: A Polyhedral Framework For Offline 3D Multi-Object Tracking

Xiaoyu Li,Yitao Wu,Xian Wu,Haolin Zhuo,Lijun Zhao,Lining Sun

Main category: cs.CV

TL;DR: 本文提出Offline-Poly，一种基于Tracking-by-Tracking（TBT）范式的通用离线3D多目标跟踪方法，解耦上游检测/跟踪器，利用全局优化与未来可观测性提升跟踪性能，在nuScenes和KITTI上达到SOTA。

Details

Motivation: 现有离线3D MOT方法只是在线框架的简单扩展，未能充分利用离线场景下的全局优化和全时序可观测优势，且依赖固定上游和定制架构，适应性差。 Method: 提出Tracking-by-Tracking（TBT）标准化范式，构建Offline-Poly方法：包含预处理、分层匹配融合、轨迹片段精炼三阶段；利用全局场景上下文消除幽灵轨迹、跨源关联、联合局部与全局运动模式进行精炼。 Result: 在nuScenes上AMOTA达77.6%，KITTI上HOTA达83.00%，均达当前最优；实验验证了其灵活性、泛化性与模块有效性。 Conclusion: Offline-Poly通过跟踪中心设计与TBT范式，显著提升了离线3D MOT的性能与通用性，为4DAL等下游任务提供更鲁棒的伪标签。 Abstract: Offline 3D multi-object tracking (MOT) is a critical component of the 4D auto-labeling (4DAL) process. It enhances pseudo-labels generated by high-performance detectors through the incorporation of temporal context. However, existing offline 3D MOT approaches are direct extensions of online frameworks and fail to fully exploit the advantages of offline setting. Moreover, these methods often depend on fixed upstream and customized architectures, limiting their adaptability. To address these limitations, we propose Offline-Poly, a general offline 3D MOT method based on a tracking-centric design. We introduce a standardized paradigm termed Tracking-by-Tracking (TBT), which operates exclusively on arbitrary off-the-shelf tracking outputs and produces offline-refined tracklets. This formulation decouples offline tracker from specific upstream detectors or trackers. Under the TBT paradigm, Offline-Poly accepts one or multiple coarse tracking results and processes them through a structured pipeline comprising pre-processing, hierarchical matching and fusion, and tracklet refinement. Each module is designed to capitalize on the two fundamental properties of offline tracking: resource unconstrainedness, which permits global optimization beyond real-time limits, and future observability, which enables tracklet reasoning over the full temporal horizon. Offline-Poly first eliminates short-term ghost tracklets and re-identifies fragmented segments using global scene context. It then constructs scene-level similarity to associate tracklets across multiple input sources. Finally, Offline-Poly refines tracklets by jointly leveraging local and global motion patterns. On nuScenes, we achieve SOTA performance with 77.6% AMOTA. On KITTI, it achieves leading results with 83.00% HOTA. Comprehensive experiments further validate the flexibility, generalizability, and modular effectiveness of Offline-Poly.

[151] Skeleton2Stage: Reward-Guided Fine-Tuning for Physically Plausible Dance Generation

Jidong Jia,Youjian Zhang,Huan Fu,Dacheng Tao

Main category: cs.CV

TL;DR: 本文提出了一种基于物理奖励的强化学习微调方法（RLFT），用于提升舞蹈生成中骨骼动作在人体网格可视化下的物理合理性，解决自穿透和足地接触异常问题，并引入抗冻结奖励以保持运动动态性。

Details

Motivation: 现有舞蹈生成方法多在骨骼域训练，忽略网格层面的物理约束，导致可视化时出现身体自穿透和足地接触异常，影响美观与实际应用。 Method: 提出基于人体网格的物理奖励机制，结合模仿奖励（惩罚穿透和滑足）和足地偏差（FGD）奖励，并引入抗冻结奖励防止运动冻结；采用强化学习微调（RLFT）优化扩散模型。 Result: 在多个舞蹈数据集上显著提升了生成动作的物理合理性，生成更真实、更具审美价值的舞蹈。 Conclusion: 通过引入网格级物理约束与多目标奖励机制，有效弥合了骨骼动作生成与网格可视化之间的差距，为舞蹈生成提供了更实用、更真实的解决方案。 Abstract: Despite advances in dance generation, most methods are trained in the skeletal domain and ignore mesh-level physical constraints. As a result, motions that look plausible as joint trajectories often exhibit body self-penetration and Foot-Ground Contact (FGC) anomalies when visualized with a human body mesh, reducing the aesthetic appeal of generated dances and limiting their real-world applications. We address this skeleton-to-mesh gap by deriving physics-based rewards from the body mesh and applying Reinforcement Learning Fine-Tuning (RLFT) to steer the diffusion model toward physically plausible motion synthesis under mesh visualization. Our reward design combines (i) an imitation reward that measures a motion's general plausibility by its imitability in a physical simulator (penalizing penetration and foot skating), and (ii) a Foot-Ground Deviation (FGD) reward with test-time FGD guidance to better capture the dynamic foot-ground interaction in dance. However, we find that the physics-based rewards tend to push the model to generate freezing motions for fewer physical anomalies and better imitability. To mitigate it, we propose an anti-freezing reward to preserve motion dynamics while maintaining physical plausibility. Experiments on multiple dance datasets consistently demonstrate that our method can significantly improve the physical plausibility of generated motions, yielding more realistic and aesthetically pleasing dances. The project page is available at: https://jjd1123.github.io/Skeleton2Stage/

[152] Foundation Model-Driven Semantic Change Detection in Remote Sensing Imagery

Hengtong Shen,Li Yan,Hong Xie,Yaxuan Wei,Xinhao Li,Wenfei Shen,Peixian Lv,Fei Tan

Main category: cs.CV

TL;DR: 本文提出PerASCD方法，利用遥感基础模型PerA驱动语义变化检测（SCD），通过级联门控解码器（CG-Decoder）简化流程并增强多尺度语义理解，引入软语义一致性损失（SSCLoss）提升训练稳定性，在两个公开数据集上达到SOTA性能。

Details

Motivation: 现有语义变化检测方法受限于模型语义理解能力不足和任务本身复杂性，面临性能与范式复杂度双重挑战。 Method: 提出基于遥感基础模型PerA的PerASCD方法，包含模块化级联门控解码器（CG-Decoder）以促进多级特征交互融合，并设计软语义一致性损失（SSCLoss）缓解训练数值不稳定问题；同时验证多种RS基础模型适配该解码器的可行性。 Result: 所提CG-Decoder有效简化SCD范式并实现跨多种视觉编码器的无缝适配，在LEVIR-CD+和WHU-CD两个公开基准数据集上均达到当前最优（SOTA）性能。 Conclusion: PerASCD通过增强多尺度语义理解、简化解码结构及稳定训练过程，显著提升了语义变化检测的性能与泛化能力，为遥感变化检测提供了高效、鲁棒的新范式。 Abstract: Remote sensing (RS) change detection methods can extract critical information on surface dynamics and are an essential means for humans to understand changes in the earth's surface and environment. Among these methods, semantic change detection (SCD) can more effectively interpret the multi-class information contained in bi-temporal RS imagery, providing semantic-level predictions that support dynamic change monitoring. However, due to the limited semantic understanding capability of the model and the inherent complexity of the SCD tasks, existing SCD methods face significant challenges in both performance and paradigm complexity. In this paper, we propose PerASCD, a SCD method driven by RS foundation model PerA, designed to enhance the multi-scale semantic understanding and overall performance. We introduce a modular Cascaded Gated Decoder (CG-Decoder) that simplifies complex SCD decoding pipelines while promoting effective multi-level feature interaction and fusion. In addition, we propose a Soft Semantic Consistency Loss (SSCLoss) to mitigate the numerical instability commonly encountered during SCD training. We further explore the applicability of multiple existing RS foundation models on the SCD task when equipped with the proposed decoder. Experimental results demonstrate that our decoder not only effectively simplifies the paradigm of SCD, but also achieves seamless adaptation across various vision encoders. Our method achieves state-of-the-art (SOTA) performance on two public benchmark datasets, validating its effectiveness. The code is available at https://github.com/SathShen/PerASCD.git.

[153] Joint Orientation and Weight Optimization for Robust Watertight Surface Reconstruction via Dirichlet-Regularized Winding Fields

Jiaze Li,Daisheng Jin,Fei Hou,Junhui Hou,Zheng Liu,Shiqing Xin,Wenping Wang,Ying He

Main category: cs.CV

TL;DR: 本文提出Dirichlet Winding Reconstruction (DiWR)，一种从无向、非均匀采样、含噪声和离群点的点云中鲁棒重建封闭曲面的方法，通过联合优化点方向、面积权重与置信系数，并最小化广义绕数场的Dirichlet能量，无需预处理。

Details

Motivation: 解决无向点云在非均匀采样、噪声和离群点干扰下难以鲁棒重建watertight表面的问题，避免传统多阶段流程的误差累积和对预处理的依赖。 Method: 采用广义绕数（GWN）场作为隐式表示，联合优化点方向、每点面积权重和置信系数；以最小化GWN场的Dirichlet能量为核心目标，并引入GWN相关约束以增强鲁棒性。 Result: 在3D Gaussian Splatting输出、计算机视觉管线生成点云及人工加噪图形基准数据上验证有效；相比传统多阶段方法和近期联合优化方法，重建质量更优、更鲁棒。 Conclusion: DiWR提供了一种端到端、无需预处理的统一框架，在复杂真实点云上实现了高质量、watertight的表面重建，显著提升了对采样不均、噪声与离群点的鲁棒性。 Abstract: We propose Dirichlet Winding Reconstruction (DiWR), a robust method for reconstructing watertight surfaces from unoriented point clouds with non-uniform sampling, noise, and outliers. Our method uses the generalized winding number (GWN) field as the target implicit representation and jointly optimizes point orientations, per-point area weights, and confidence coefficients in a single pipeline. The optimization minimizes the Dirichlet energy of the induced winding field together with additional GWN-based constraints, allowing DiWR to compensate for non-uniform sampling, reduce the impact of noise, and downweight outliers during reconstruction, with no reliance on separate preprocessing. We evaluate DiWR on point clouds from 3D Gaussian Splatting, a computer-vision pipeline, and corrupted graphics benchmarks. Experiments show that DiWR produces plausible watertight surfaces on these challenging inputs and outperforms both traditional multi-stage pipelines and recent joint orientation-reconstruction methods.

[154] Gaussian Sequences with Multi-Scale Dynamics for 4D Reconstruction from Monocular Casual Videos

Can Li,Jie Gu,Jingmin Chen,Fangzhou Qiu,Lei Sun

Main category: cs.CV

TL;DR: 本文提出了一种基于多尺度动力学机制和多模态先验的单目4D重建方法，通过分层高斯序列表示动态3D场景，显著提升了动态新视角合成的精度与一致性。

Details

Motivation: 单目视频下的四维（4D）动态场景重建严重不适定，而真实世界动力学在物体到粒子层面具有多尺度规律性，可被利用以缓解歧义。 Method: 设计多尺度动力学机制分解复杂运动场；提出基于多级运动组合的多尺度动态高斯序列作为动态3D高斯新表示；融合视觉基础模型的多模态先验提供互补监督。 Result: 实现了单目随意视频下准确且全局一致的4D重建，在动态新视角合成任务中于基准与真实操作数据集上显著优于现有方法。 Conclusion: 多尺度建模与多模态先验协同可有效提升单目动态场景重建的物理合理性和重建保真度。 Abstract: Understanding dynamic scenes from casual videos is critical for scalable robot learning, yet four-dimensional (4D) reconstruction under strictly monocular settings remains highly ill-posed. To address this challenge, our key insight is that real-world dynamics exhibits a multi-scale regularity from object to particle level. To this end, we design the multi-scale dynamics mechanism that factorizes complex motion fields. Within this formulation, we propose Gaussian sequences with multi-scale dynamics, a novel representation for dynamic 3D Gaussians derived through compositions of multi-level motion. This layered structure substantially alleviates ambiguity of reconstruction and promotes physically plausible dynamics. We further incorporate multi-modal priors from vision foundation models to establish complementary supervision, constraining the solution space and improving the reconstruction fidelity. Our approach enables accurate and globally consistent 4D reconstruction from monocular casual videos. Experiments of dynamic novel-view synthesis (NVS) on benchmark and real-world manipulation datasets demonstrate considerable improvements over existing methods.

[155] VAR-3D: View-aware Auto-Regressive Model for Text-to-3D Generation via a 3D Tokenizer

Zongcheng Han,Dongyan Cao,Haoran Sun,Yu Hong

Main category: cs.CV

TL;DR: 本文提出VAR-3D方法，通过视图感知的3D VQ-VAE和渲染监督训练策略，解决文本到3D生成中离散表示学习的信息损失与目标不匹配问题，显著提升生成质量与文本-3D对齐效果。

Details

Motivation: 现有文本到3D生成方法在学习离散3D表示时存在编码信息损失和向量量化导致的几何失真，且两阶段训练范式造成重建与自回归生成目标不一致。 Method: 提出View-aware Auto-Regressive 3D（VAR-3D），包含视图感知的3D VQ-VAE将3D几何结构转化为离散token，并引入渲染监督训练策略，联合优化离散token预测与视觉重建。 Result: 实验表明VAR-3D在生成质量与文本-3D对齐方面显著优于现有方法。 Conclusion: VAR-3D有效缓解了离散3D表示学习中的信息损失与目标失配问题，提升了文本条件下的3D生成保真度与结构一致性。 Abstract: Recent advances in auto-regressive transformers have achieved remarkable success in generative modeling. However, text-to-3D generation remains challenging, primarily due to bottlenecks in learning discrete 3D representations. Specifically, existing approaches often suffer from information loss during encoding, causing representational distortion before the quantization process. This effect is further amplified by vector quantization, ultimately degrading the geometric coherence of text-conditioned 3D shapes. Moreover, the conventional two-stage training paradigm induces an objective mismatch between reconstruction and text-conditioned auto-regressive generation. To address these issues, we propose View-aware Auto-Regressive 3D (VAR-3D), which intergrates a view-aware 3D Vector Quantized-Variational AutoEncoder (VQ-VAE) to convert the complex geometric structure of 3D models into discrete tokens. Additionally, we introduce a rendering-supervised training strategy that couples discrete token prediction with visual reconstruction, encouraging the generative process to better preserve visual fidelity and structural consistency relative to the input text. Experiments demonstrate that VAR-3D significantly outperforms existing methods in both generation quality and text-3D alignment.

[156] Embed-RL: Reinforcement Learning for Reasoning-Driven Multimodal Embeddings

Haonan Jiang,Yuji Wang,Yongjie Zhu,Xin Lu,Wenyu Qin,Meng Wang,Pengfei Wan,Yansong Tang

Main category: cs.CV

TL;DR: 本文提出了一种基于推理驱动的通用多模态嵌入（UME）框架，通过Embedder-Guided强化学习（EG-RL）优化Reasoner生成可追溯的链式推理（T-CoT），以提升跨模态语义一致性和细粒度匹配能力。

Details

Motivation: 现有生成式多模态嵌入方法所生成的思维链（CoT）仅限于文本查询分析，与目标检索无关，难以支持高质量跨模态嵌入。 Method: 提出推理驱动的UME框架，包含Embedder-Guided强化学习（EG-RL）机制和可追溯的链式推理（T-CoT），使Reasoner生成的推理过程聚焦于检索相关多模态线索，并由Embedder提供显式监督以对齐嵌入任务。 Result: 在MMEB-V2和UVRB基准上，该方法在有限计算资源下超越了先驱嵌入模型，提升了跨模态语义一致性、细粒度匹配能力和复杂场景泛化性。 Conclusion: 针对检索目标优化的结构化推理能显著提升多模态嵌入质量，为推理驱动的UME提供了高效实用的新范式。 Abstract: Leveraging Multimodal Large Language Models (MLLMs) has become pivotal for advancing Universal Multimodal Embeddings (UME) in addressing diverse cross-modal tasks. Recent studies demonstrate that incorporating generative Chain-of-Thought (CoT) reasoning can substantially enhance task-specific representations compared to discriminative methods. However, the generated reasoning CoTs of existing generative embedding methods are limited to the textual analysis of queries and are irrelevant to the retrieval of the targets. To address these limitations, we propose a reasoning-driven UME framework that integrates Embedder-Guided Reinforcement Learning (EG-RL) to optimize the Reasoner to produce evidential Traceability CoT (T-CoT). Our key contributions are threefold: (1) We design an EG-RL framework where the Embedder provides explicit supervision to the Reasoner, ensuring the generated CoT traces are aligned with embedding tasks. (2) We introduce T-CoT, which extracts critical multimodal cues to focus on retrieval-relevant elements and provides multimodal inputs for the Embedder. (3) With limited computational resources, our framework outperforms the pioneering embedding model on both MMEB-V2 and UVRB benchmarks. The integration of multimodal evidence in structured reasoning, paired with retrieval-oriented alignment, effectively strengthens cross-modal semantic consistency and boosts the fine-grained matching capability of the model as well as the generalization across complex scenarios. Our work demonstrates that targeted reasoning optimization can significantly improve multimodal embedding quality, providing a practical and efficient solution for reasoning-driven UME development.

[157] Prior-guided Hierarchical Instance-pixel Contrastive Learning for Ultrasound Speckle Noise Suppression

Zhenyu Bu,Yuanxin Xie,Guang-Quan Zhou

Main category: cs.CV

TL;DR: 本文提出了一种先验引导的分层实例-像素对比学习模型，结合统计引导的像素级对比学习与基于记忆库的实例级对比学习，并采用混合Transformer-CNN架构，有效提升超声图像去噪性能。

Details

Motivation: 超声图像中的斑点噪声既含干扰信息又编码纹理与解剖细节，如何在抑制噪声的同时保持结构保真度是一个关键挑战。 Method: 提出先验引导的分层实例-像素对比学习模型：1）统计引导的像素级对比学习增强噪声/干净像素分布差异；2）记忆库存储支持实例级对比学习；3）混合Transformer（全局建模）-CNN（局部重建）架构。 Result: 在两个公开超声数据集上，该方法显著优于现有主流方法，验证了其有效性与优越性。 Conclusion: 所提模型通过多粒度对比学习与异构网络协同，实现了噪声鲁棒性与结构保真性的更好平衡，为超声图像去噪提供了新范式。 Abstract: Ultrasound denoising is essential for mitigating speckle-induced degradations, thereby enhancing image quality and improving diagnostic reliability. Nevertheless, because speckle patterns inherently encode both texture and fine anatomical details, effectively suppressing noise while preserving structural fidelity remains a significant challenge. In this study, we propose a prior-guided hierarchical instance-pixel contrastive learning model for ultrasound denoising, designed to promote noise-invariant and structure-aware feature representations by maximizing the separability between noisy and clean samples at both pixel and instance levels. Specifically, a statistics-guided pixel-level contrastive learning strategy is introduced to enhance distributional discrepancies between noisy and clean pixels, thereby improving local structural consistency. Concurrently, a memory bank is employed to facilitate instance-level contrastive learning in the feature space, encouraging representations that more faithfully approximate the underlying data distribution. Furthermore, a hybrid Transformer-CNN architecture is adopted, coupling a Transformer-based encoder for global context modeling with a CNN-based decoder optimized for fine-grained anatomical structure restoration, thus enabling complementary exploitation of long-range dependencies and local texture details. Extensive evaluations on two publicly available ultrasound datasets demonstrate that the proposed model consistently outperforms existing methods, confirming its effectiveness and superiority.

[158] High-Fidelity Causal Video Diffusion Models for Real-Time Ultra-Low-Bitrate Semantic Communication

Cem Eteke,Batuhan Tosun,Alexander Griessel,Wolfgang Kellerer,Eckehard Steinbach

Main category: cs.CV

TL;DR: 本文提出了一种面向超低比特率语义通信约束的高保真、因果性、实时视频生成扩散模型，通过语义编码与压缩帧结合输入，并设计模块化扩散架构及高效时序蒸馏方法，在极低码率下实现优异感知质量与语义保真度。

Details

Motivation: 在超低比特率语义通信约束下，实现高保真、因果且实时的视频生成面临语义完整性与纹理细节难以兼顾的挑战。 Method: 采用有损语义视频编码传输场景结构，辅以高度压缩的低分辨率帧提供纹理信息；构建含语义控制、恢复适配器和时序适配器的模块化视频扩散模型；引入高效时序蒸馏以支持因果与实时合成。 Result: 在<0.0003 bpp的超低码率下，该框架在多个数据集上显著优于经典、神经及生成式基线，展现出强感知质量、语义保真度与时序一致性，同时参数量减少300倍、训练时间缩短2倍。 Conclusion: 所提方法成功平衡了通信效率与生成质量，在语义通信驱动的视频生成任务中具有重要应用价值与推广潜力。 Abstract: We introduce a video diffusion model for high-fidelity, causal, and real-time video generation under ultra-low-bitrate semantic communication constraints. Our approach utilizes lossy semantic video coding to transmit the semantic scene structure, complemented by a stream of highly compressed, low-resolution frames that provide sufficient texture information to preserve fidelity. Building on these inputs, we introduce a modular video diffusion model that contains Semantic Control, Restoration Adapter, and Temporal Adapter. We further introduce an efficient temporal distillation procedure that enables extension to real-time and causal synthesis, reducing trainable parameters by 300x and training time by 2x, while adhering to communication constraints. Evaluated across diverse datasets, the framework achieves strong perceptual quality, semantic fidelity, and temporal consistency at ultra-low bitrates (< 0.0003 bpp), outperforming classical, neural, and generative baselines in extensive quantitative, qualitative, and subjective evaluations.

[159] Automated Prediction of Paravalvular Regurgitation before Transcatheter Aortic Valve Implantation

Michele Cannito,Riccardo Renzulli,Adson Duarte,Farzad Nikfam,Carlo Alberto Barbano,Enrico Chiesa,Francesco Bruno,Federico Giacobbe,Wojciech Wanha,Arturo Giordano,Marco Grangetto,Fabrizio D'Ascenzo

Main category: cs.CV

TL;DR: 本研究利用三维卷积神经网络从术前心脏CT图像中预测经导管主动脉瓣植入术（TAVI）后瓣周主动脉反流（PVR）的发生，结果表明该方法能捕捉细微解剖特征，有助于个体化风险评估和手术优化。

Details

Motivation: 尽管TAVI技术不断进步，但术后常见的瓣周主动脉反流（PVR）仍严重影响患者长期预后，亟需术前精准预测手段。 Method: 收集TAVI患者术前心脏CT数据，构建各向同性三维CT体数据集，并训练3D卷积神经网络进行PVR发生预测。 Result: 模型能够从术前CT中有效提取与PVR相关的细微解剖特征，展现出良好的预测潜力。 Conclusion: 基于术前CT的三维深度学习方法有望成为TAVI术前个性化风险评估与手术规划的新工具。 Abstract: Severe aortic stenosis is a common and life-threatening condition in elderly patients, often treated with Transcatheter Aortic Valve Implantation (TAVI). Despite procedural advances, paravalvular aortic regurgitation (PVR) remains one of the most frequent post-TAVI complications, with a proven impact on long-term prognosis. In this work, we investigate the potential of deep learning to predict the occurrence of PVR from preoperative cardiac CT. To this end, a dataset of preoperative TAVI patients was collected, and 3D convolutional neural networks were trained on isotropic CT volumes. The results achieved suggest that volumetric deep learning can capture subtle anatomical features from pre-TAVI imaging, opening new perspectives for personalized risk assessment and procedural optimization. Source code is available at https://github.com/EIDOSLAB/tavi.

[160] Synthetic Dataset Generation and Validation for Robotic Surgery Instrument Segmentation

Giorgio Chiesa,Rossella Borra,Vittorio Lauro,Sabrina De Cillis,Daniele Amparore,Cristian Fiori,Riccardo Renzulli,Marco Grangetto

Main category: cs.CV

TL;DR: 本文提出了一种用于机器人手术器械分割的合成数据集生成与验证工作流，通过自动化3D建模与渲染生成带精确标注的逼真视频，并证明合理混合真实与合成数据可提升模型泛化能力。

Details

Motivation: 解决机器人手术中真实标注数据稀缺、获取成本高且难以覆盖所有场景的问题，探索合成数据在手术器械分割任务中的有效性与适用边界。 Method: 基于Autodesk Maya和Python构建全自动3D建模、动画与渲染管线，生成含随机运动、光照变化和合成血液纹理的带像素级掩码的合成视频；结合不同比例的真实/合成数据训练多种分割模型进行验证。 Result: 混合真实与合成数据（尤其均衡配比）显著提升模型泛化性能；但纯合成或过高比例合成数据会引发可测的域偏移。 Conclusion: 该框架为外科计算机视觉提供了可复现、可扩展的数据生成工具，支持数据增强、域自适应及基于仿真的预训练研究。 Abstract: This paper presents a comprehensive workflow for generating and validating a synthetic dataset designed for robotic surgery instrument segmentation. A 3D reconstruction of the Da Vinci robotic arms was refined and animated in Autodesk Maya through a fully automated Python-based pipeline capable of producing photorealistic, labeled video sequences. Each scene integrates randomized motion patterns, lighting variations, and synthetic blood textures to mimic intraoperative variability while preserving pixel-accurate ground truth masks. To validate the realism and effectiveness of the generated data, several segmentation models were trained under controlled ratios of real and synthetic data. Results demonstrate that a balanced composition of real and synthetic samples significantly improves model generalization compared to training on real data only, while excessive reliance on synthetic data introduces a measurable domain shift. The proposed framework provides a reproducible and scalable tool for surgical computer vision, supporting future research in data augmentation, domain adaptation, and simulation-based pretraining for robotic-assisted surgery. Data and code are available at https://github.com/EIDOSLAB/Sintetic-dataset-DaVinci.

[161] Cardiac Output Prediction from Echocardiograms: Self-Supervised Learning with Limited Data

Adson Duarte,Davide Vitturini,Emanuele Milillo,Andrea Bragagnolo,Carlo Alberto Barbano,Riccardo Renzulli,Michele Cannito,Federico Giacobbe,Francesco Bruno,Ovidio de Filippo,Fabrizio D'Ascenzo,Marco Grangetto

Main category: cs.CV

TL;DR: 本文提出了一种基于SimCLR的自监督学习预训练策略，用于从心尖四腔心超声视频中预测心输出量（CO），在数据有限的情况下提升了模型性能，并超越了使用百万级数据训练的PanEcho模型。

Details

Motivation: 心输出量（CO）是心血管疾病诊断与管理的关键指标，但其准确测量需依赖侵入性且耗时的右心导管术，因此亟需可靠的非侵入式替代方法，如基于超声心动图的AI预测。 Method: 采用SimCLR框架进行自监督学习（SSL）预训练，直接在下游任务所用的有限超声视频数据集上完成；随后微调模型以预测CO。 Result: SSL有效缓解过拟合、提升表征学习能力，在测试集上达到平均Pearson相关系数0.41，性能优于在超百万例超声检查数据上训练的PanEcho模型。 Conclusion: 即使在数据稀缺场景下，针对特定任务的小规模数据开展自监督预训练，仍可显著提升CO预测性能，为医学影像小样本建模提供了可行路径。 Abstract: Cardiac Output (CO) is a key parameter in the diagnosis and management of cardiovascular diseases. However, its accurate measurement requires right-heart catheterization, an invasive and time-consuming procedure, motivating the development of reliable non-invasive alternatives using echocardiography. In this work, we propose a self-supervised learning (SSL) pretraining strategy based on SimCLR to improve CO prediction from apical four-chamber echocardiographic videos. The pretraining is performed using the same limited dataset available for the downstream task, demonstrating the potential of SSL even under data scarcity. Our results show that SSL mitigates overfitting and improves representation learning, achieving an average Pearson correlation of 0.41 on the test set and outperforming PanEcho, a model trained on over one million echocardiographic exams. Source code is available at https://github.com/EIDOSLAB/cardiac-output.

[162] Low-Pass Filtering Improves Behavioral Alignment of Vision Models

Max Wolff,Thomas Klein,Evgenia Rusak,Felix Wichmann,Wieland Brendel

Main category: cs.CV

TL;DR: 本文发现生成式模型在人类视觉行为对齐上的优势主要源于其内置的低通滤波效应（如图像缩放操作），而非生成式建模本身；通过在测试时对判别式模型（如CLIP）输入图像施加简单高斯模糊，即可显著提升其与人类行为的一致性，甚至达到新SOTA，并揭示该现象与人类对比敏感度函数的频谱特性高度吻合。

Details

Motivation: 尽管深度神经网络在计算机视觉基准上表现优异，但在模拟人类视觉行为（如错误一致性与形状偏好）方面仍存在明显差距；近期研究推测生成式分类器可能更符合人类视觉机制，本文旨在检验该假设并探究其根本原因。 Method: 通过控制实验分析生成式模型中图像缩放操作的频域效应；在判别式模型（如CLIP）测试阶段引入不同形式的低通滤波（尤其是高斯模糊），评估其对模型-人类对齐指标（error consistency）的影响；进一步直接优化滤波器以最大化对齐，并构建该任务的Pareto最优前沿；最后将最优滤波器的频率响应与人类对比敏感度函数（CSF）进行比对分析。 Result: 1）生成式模型的行为对齐优势主要来自其隐含的低通滤波（如缩放）；2）仅在测试时对CLIP等判别式模型输入图像进行高斯模糊，即可将模型-人类对齐误差减半，达新SOTA；3）低通滤波（尤其高斯核）近似最优，且其频谱特性与人类CSF高度匹配；4）首次刻画了该基准下所有Pareto最优解的前沿。 Conclusion: 人类视觉行为对齐的关键因素并非生成式建模范式本身，而是对高频空间信息的抑制——这与人类视觉系统的带通滤波特性（由CSF刻画）本质一致；因此，提升DNN行为对齐的有效途径是显式引入符合生物合理性的频域先验，而非转向生成式架构。 Abstract: Despite their impressive performance on computer vision benchmarks, Deep Neural Networks (DNNs) still fall short of adequately modeling human visual behavior, as measured by error consistency and shape bias. Recent work hypothesized that behavioral alignment can be drastically improved through \emph{generative} -- rather than \emph{discriminative} -- classifiers, with far-reaching implications for models of human vision. Here, we instead show that the increased alignment of generative models can be largely explained by a seemingly innocuous resizing operation in the generative model which effectively acts as a low-pass filter. In a series of controlled experiments, we show that removing high-frequency spatial information from discriminative models like CLIP drastically increases their behavioral alignment. Simply blurring images at test-time -- rather than training on blurred images -- achieves a new state-of-the-art score on the model-vs-human benchmark, halving the current alignment gap between DNNs and human observers. Furthermore, low-pass filters are likely optimal, which we demonstrate by directly optimizing filters for alignment. To contextualize the performance of optimal filters, we compute the frontier of all possible pareto-optimal solutions to the benchmark, which was formerly unknown. We explain our findings by observing that the frequency spectrum of optimal Gaussian filters roughly matches the spectrum of band-pass filters implemented by the human visual system. We show that the contrast sensitivity function, describing the inverse of the contrast threshold required for humans to detect a sinusoidal grating as a function of spatiotemporal frequency, is approximated well by Gaussian filters of the specific width that also maximizes error consistency.

[163] Human-Aligned Evaluation of a Pixel-wise DNN Color Constancy Model

Hamed Heidari-Gorji,Raquel Gil Rodriguez,Karl R. Gegenfurtner

Main category: cs.CV

TL;DR: 本文结合虚拟现实中的颜色恒常性研究与深度神经网络模型，通过迁移学习使ResNet-U-Net模型在相同灰度物体选择任务中模拟人类表现，并发现模型与人类在不同颜色线索条件下的行为高度一致。

Details

Motivation: 探究深度神经网络是否能模拟人类颜色恒常性机制（如局部环绕、最大通量、空间均值），并以行为任务而非物理真值评估模型性能。 Method: 采用预训练的ResNet-U-Net预测表面反射率，通过迁移学习仅微调解码器部分以适配VR基线图像；模型输出用于执行与人类实验相同的灰度物体选择任务。 Result: 模型与人类在基线条件下均表现出高颜色恒常性，且在移除局部环绕或空间均值线索时，二者性能下降模式高度相似。 Conclusion: 该DNN模型能有效复现人类颜色恒常性行为，支持其作为理解人类视觉恒常性机制的计算工具。 Abstract: We previously investigated color constancy in photorealistic virtual reality (VR) and developed a Deep Neural Network (DNN) that predicts reflectance from rendered images. Here, we combine both approaches to compare and study a model and human performance with respect to established color constancy mechanisms: local surround, maximum flux and spatial mean. Rather than evaluating the model against physical ground truth, model performance was assessed using the same achromatic object selection task employed in the human experiments. The model, a ResNet based U-Net from our previous work, was pre-trained on rendered images to predict surface reflectance. We then applied transfer learning, fine-tuning only the network's decoder on images from the baseline VR condition. To parallel the human experiment, the model's output was used to perform the same achromatic object selection task across all conditions. Results show a strong correspondence between the model and human behavior. Both achieved high constancy under baseline conditions and showed similar, condition-dependent performance declines when the local surround or spatial mean color cues were removed.

[164] Parameter-Efficient Fine-Tuning of DINOv2 for Large-Scale Font Classification

Daniel Chen,Zaria Zinn,Marcus Lowe

Main category: cs.CV

TL;DR: 本文提出了一种基于微调DINOv2 ViT并结合LoRA的字体分类系统，能在394种字体家族上达到约86% top-1准确率，仅训练不到1%参数，并通过合成数据增强提升泛化能力，全部资源开源。

Details

Motivation: 现有字体识别方法在真实场景中泛化能力不足，且模型训练成本高；需构建高效、鲁棒、可部署的轻量级字体分类系统。 Method: 采用DINOv2 Vision Transformer作为主干网络，使用Low-Rank Adaptation（LoRA）进行参数高效微调；设计合成数据生成流程，基于Google Fonts渲染带多样化增强（颜色、对齐、换行、高斯噪声）的文本图像；引入内置预处理模块统一训练与推理流程。 Result: 在394类字体识别任务上取得约86% top-1准确率，仅训练约0.87M参数（<1% of 87.2M）；模型成功部署为HuggingFace Inference Endpoint；合成数据显著提升对真实文本图像的泛化性能。 Conclusion: 该工作验证了参数高效微调ViT结合高质量合成数据在细粒度字体识别中的有效性，提供了可复现、可扩展、端到端开源的实用解决方案。 Abstract: We present a font classification system capable of identifying 394 font families from rendered text images. Our approach fine-tunes a DINOv2 Vision Transformer using Low-Rank Adaptation (LoRA), achieving approximately 86% top-1 accuracy while training fewer than 1% of the model's 87.2M parameters. We introduce a synthetic dataset generation pipeline that renders Google Fonts at scale with diverse augmentations including randomized colors, alignment, line wrapping, and Gaussian noise, producing training images that generalize to real-world typographic samples. The model incorporates built-in preprocessing to ensure consistency between training and inference, and is deployed as a HuggingFace Inference Endpoint. We release the model, dataset, and full training pipeline as open-source resources.

[165] RPGD: RANSAC-P3P Gradient Descent for Extrinsic Calibration in 3D Human Pose Estimation

Zhanyu Tuo

Main category: cs.CV

TL;DR: RPGD是一种基于人体姿态的外参标定框架，结合RANSAC-P3P与梯度下降，实现仅用自然人体运动对单目或多视角RGB相机进行鲁棒标定。

Details

Motivation: 需要一种实用、自动且鲁棒的方法，仅利用自然人体运动完成大规模3D人体姿态估计数据集采集中的相机外参标定。 Method: 提出RPGD（RANSAC-P3P梯度下降）框架，将外参标定建模为面向人体姿态的粗到精问题：先用RANSAC-P3P获得鲁棒初值，再通过梯度下降精细优化。 Result: 在三个公开3D HPE数据集及自建野外数据集上验证，RPGD能稳定恢复高精度外参，重投影误差达亚像素级MPJPE，即使在噪声干扰下仍保持稳健。 Conclusion: RPGD为大规模3D人体姿态数据采集提供了一种实用、自动且可靠的外参标定解决方案。 Abstract: In this paper, we propose RPGD (RANSAC-P3P Gradient Descent), a human-pose-driven extrinsic calibration framework that robustly aligns MoCap-based 3D skeletal data with monocular or multi-view RGB cameras using only natural human motion. RPGD formulates extrinsic calibration as a coarse-to-fine problem tailored to human poses, combining the global robustness of RANSAC-P3P with Gradient-Descent-based refinement. We evaluate RPGD on three large-scale public 3D HPE datasets as well as on a self-collected in-the-wild dataset. Experimental results demonstrate that RPGD consistently recovers extrinsic parameters with accuracy comparable to the provided ground truth, achieving sub-pixel MPJPE reprojection error even in challenging, noisy settings. These results indicate that RPGD provides a practical and automatic solution for reliable extrinsic calibration of large-scale 3D HPE dataset collection.

[166] MamaDino: A Hybrid Vision Model for Breast Cancer 3-Year Risk Prediction

Ruggiero Santeramo,Igor Zubarev,Florian Jug

Main category: cs.CV

TL;DR: 本文提出MamaDino模型，结合卷积与Transformer结构，并显式建模双侧乳腺不对称性，在512×512低分辨率乳腺X光图像上实现了与高分辨率Mirai模型相当的3年乳腺癌风险预测性能。

Details

Motivation: 现有深度学习风险预测模型（如Mirai）依赖高分辨率图像和简单多视图融合，缺乏对双侧乳腺不对称性的显式建模；本文旨在探索如何通过结构化建模在更低分辨率下保持高性能。 Method: 提出MamaDino：融合冻结的DINOv3 ViT-S自监督特征与可训练CNN编码器（输入512×512），并通过BilateralMixer模块聚合双侧信息；在OPTIMAM数据集（53,883名女性）上训练，在内部及外部测试集上评估。 Result: MamaDino在乳腺级3年风险预测中，AUC达0.736（内部）和0.677（外部），与Mirai相当，但仅用约1/13像素；BilateralMixer显著提升判别能力，且性能在年龄、族裔、设备、肿瘤类型与分级上稳健。 Conclusion: 显式建模双侧不对称性并融合互补归纳偏置，可在大幅降低图像分辨率的同时维持最先进预测精度，为资源受限场景下的个性化筛查提供新路径。 Abstract: Breast cancer screening programmes increasingly seek to move from one-size-fits-all interval to risk-adapted and personalized strategies. Deep learning (DL) has enabled image-based risk models with stronger 1- to 5-year prediction than traditional clinical models, but leading systems (e.g., Mirai) typically use convolutional backbones, very high-resolution inputs (>1M pixels) and simple multi-view fusion, with limited explicit modelling of contralateral asymmetry. We hypothesised that combining complementary inductive biases (convolutional and transformer-based) with explicit contralateral asymmetry modelling would allow us to match state-of-the-art 3-year risk prediction performance even when operating on substantially lower-resolution mammograms, indicating that using less detailed images in a more structured way can recover state-of-the-art accuracy. We present MamaDino, a mammography-aware multi-view attentional DINO model. MamaDino fuses frozen self-supervised DINOv3 ViT-S features with a trainable CNN encoder at 512x512 resolution, and aggregates bilateral breast information via a BilateralMixer to output a 3-year breast cancer risk score. We train on 53,883 women from OPTIMAM (UK) and evaluate on matched 3-year case-control cohorts: an in-distribution test set from four screening sites and an external out-of-distribution cohort from an unseen site. At breast-level, MamaDino matches Mirai on both internal and external tests while using ~13x fewer input pixels. Adding the BilateralMixer improves discrimination to AUC 0.736 (vs 0.713) in-distribution and 0.677 (vs 0.666) out-of-distribution, with consistent performance across age, ethnicity, scanner, tumour type and grade. These findings demonstrate that explicit contralateral modelling and complementary inductive biases enable predictions that match Mirai, despite operating on substantially lower-resolution mammograms.

[167] Fusing Pixels and Genes: Spatially-Aware Learning in Computational Pathology

Minghao Han,Dingkang Yang,Linhao Qu,Zizhi Chen,Gang Li,Han Wang,Jiacong Wang,Lihua Zhang

Main category: cs.CV

TL;DR: 本文提出STAMP框架，通过整合空间转录组学数据增强病理图像与分子信息的多模态表征学习，提升计算病理学中的模型性能和泛化能力。

Details

Motivation: 现有计算病理学多模态模型主要依赖视觉和语言模态，但语言缺乏分子特异性且病理监督有限，导致表征瓶颈。 Method: 提出STAMP框架，融合空间分辨基因表达数据，采用自监督、基因引导训练；构建大规模空间转录组数据集SpaVis-6M，设计空间感知基因编码器；利用多尺度对比对齐与跨尺度补丁定位机制实现图像与转录组的空间对齐。 Result: 在六个数据集和四个下游任务中持续取得优异性能，验证了空间分子监督对多模态学习的有效性和必要性。 Conclusion: 整合空间分辨分子监督是推动计算病理学多模态学习发展的关键路径。 Abstract: Recent years have witnessed remarkable progress in multimodal learning within computational pathology. Existing models primarily rely on vision and language modalities; however, language alone lacks molecular specificity and offers limited pathological supervision, leading to representational bottlenecks. In this paper, we propose STAMP, a Spatial Transcriptomics-Augmented Multimodal Pathology representation learning framework that integrates spatially-resolved gene expression profiles to enable molecule-guided joint embedding of pathology images and transcriptomic data. Our study shows that self-supervised, gene-guided training provides a robust and task-agnostic signal for learning pathology image representations. Incorporating spatial context and multi-scale information further enhances model performance and generalizability. To support this, we constructed SpaVis-6M, the largest Visium-based spatial transcriptomics dataset to date, and trained a spatially-aware gene encoder on this resource. Leveraging hierarchical multi-scale contrastive alignment and cross-scale patch localization mechanisms, STAMP effectively aligns spatial transcriptomics with pathology images, capturing spatial structure and molecular variation. We validate STAMP across six datasets and four downstream tasks, where it consistently achieves strong performance. These results highlight the value and necessity of integrating spatially resolved molecular supervision for advancing multimodal learning in computational pathology. The code is included in the supplementary materials. The pretrained weights and SpaVis-6M are available at: https://github.com/Hanminghao/STAMP.

[168] MarsRetrieval: Benchmarking Vision-Language Models for Planetary-Scale Geospatial Retrieval on Mars

Shuoyuan Wang,Yiran Wang,Hongxin Wei

Main category: cs.CV

TL;DR: 本文提出了MarsRetrieval，一个用于评估视觉-语言模型在火星地理空间发现任务中性能的检索基准，涵盖图像-文本配对检索、地貌检索和全球地理定位三项任务，并强调了领域微调对提升模型性能的重要性。

Details

Motivation: 现有火星研究基准多局限于封闭集监督视觉任务，缺乏支持文本引导检索的地理空间发现能力，亟需构建面向行星科学的视觉-语言检索基准。 Method: 构建MarsRetrieval基准，包含三个多尺度、多成因的地貌检索任务；提出统一的以检索为中心的评估协议，测试对比式双塔编码器与生成式视觉语言模型等多模态嵌入架构。 Result: 实验表明该基准具有挑战性，强基础模型常难以区分火星特有地貌；领域特定微调显著提升模型在行星场景下的地理空间发现泛化能力。 Conclusion: MarsRetrieval为火星视觉-语言理解提供了首个系统性检索评测平台，证实了领域适配对行星科学AI模型的关键作用。 Abstract: Data-driven approaches like deep learning are rapidly advancing planetary science, particularly in Mars exploration. Despite recent progress, most existing benchmarks remain confined to closed-set supervised visual tasks and do not support text-guided retrieval for geospatial discovery. We introduce MarsRetrieval, a retrieval benchmark for evaluating vision-language models for Martian geospatial discovery. MarsRetrieval includes three tasks: (1) paired image-text retrieval, (2) landform retrieval, and (3) global geo-localization, covering multiple spatial scales and diverse geomorphic origins. We propose a unified retrieval-centric protocol to benchmark multimodal embedding architectures, including contrastive dual-tower encoders and generative vision-language models. Our evaluation shows MarsRetrieval is challenging: even strong foundation models often fail to capture domain-specific geomorphic distinctions. We further show that domain-specific fine-tuning is critical for generalizable geospatial discovery in planetary settings. Our code is available at https://github.com/ml-stat-Sustech/MarsRetrieval

[169] Elastic Diffusion Transformer

Jiangshan Wang,Zeqiang Lai,Jiarui Chen,Jiayi Guo,Hang Guo,Xiu Li,Xiangyu Yue,Chunchao Guo

Main category: cs.CV

TL;DR: 本文提出Elastic Diffusion Transformer (E-DiT)，一种针对Diffusion Transformers的自适应加速框架，通过动态跳过冗余模块和调整MLP宽度，并结合无训练的块级特征缓存机制，在保持生成质量的同时实现最高约2倍的推理加速。

Details

Motivation: 现有DiT加速方法（如剪枝、蒸馏）依赖固定计算量，导致加速不足且生成质量下降；而作者观察到DiT生成过程具有样本依赖的显著计算稀疏性，可据此设计自适应加速策略。 Method: 提出E-DiT框架：为每个DiT块配备轻量级路由器，动态判断是否跳过该块，并预测其内部MLP的最优宽度缩减比；进一步引入基于路由器预测的、无需训练的块级特征缓存机制以消除冗余计算。 Result: 在2D图像（Qwen-Image、FLUX）和3D资产（Hunyuan3D-3.0）上实验表明，E-DiT实现最高约2倍加速，生成质量损失可忽略。 Conclusion: E-DiT通过样本自适应的稀疏计算与缓存机制，有效平衡了DiT的效率与生成质量，为大模型推理加速提供了新思路。 Abstract: Diffusion Transformers (DiT) have demonstrated remarkable generative capabilities but remain highly computationally expensive. Previous acceleration methods, such as pruning and distillation, typically rely on a fixed computational capacity, leading to insufficient acceleration and degraded generation quality. To address this limitation, we propose \textbf{Elastic Diffusion Transformer (E-DiT)}, an adaptive acceleration framework for DiT that effectively improves efficiency while maintaining generation quality. Specifically, we observe that the generative process of DiT exhibits substantial sparsity (i.e., some computations can be skipped with minimal impact on quality), and this sparsity varies significantly across samples. Motivated by this observation, E-DiT equips each DiT block with a lightweight router that dynamically identifies sample-dependent sparsity from the input latent. Each router adaptively determines whether the corresponding block can be skipped. If the block is not skipped, the router then predicts the optimal MLP width reduction ratio within the block. During inference, we further introduce a block-level feature caching mechanism that leverages router predictions to eliminate redundant computations in a training-free manner. Extensive experiments across 2D image (Qwen-Image and FLUX) and 3D asset (Hunyuan3D-3.0) demonstrate the effectiveness of E-DiT, achieving up to $\sim$2$\times$ speedup with negligible loss in generation quality. Code will be available at https://github.com/wangjiangshan0725/Elastic-DiT.

[170] Inject Where It Matters: Training-Free Spatially-Adaptive Identity Preservation for Text-to-Image Personalization

Guandong Li,Mengxia Ye

Main category: cs.CV

TL;DR: 本文提出SpatialID，一种无需训练的、空间自适应的身份调制框架，用于个性化文本到图像生成，通过空间掩码提取器和时空调度策略，有效解决身份特征污染非面部区域的问题，显著提升文本一致性、视觉一致性和图像质量。

Details

Motivation: 现有无调优方法采用空间均匀视觉注入，导致身份特征污染非面部区域（如背景和光照），降低文本一致性。 Method: 提出SpatialID框架，包括基于交叉注意力响应的空间掩码提取器，以解耦面部相关与上下文无关区域；并引入时空调度策略，动态调整空间约束（从高斯先验过渡到注意力掩码和自适应松弛），以匹配扩散生成动态。 Result: 在IBench上实验表明，SpatialID在文本一致性（CLIP-T: 0.281）、视觉一致性（CLIP-I: 0.827）和图像质量（IQ: 0.523）方面达到SOTA，显著消除背景污染并保持强身份保真度。 Conclusion: SpatialID是一种高效、无需训练的个性化文本到图像生成方法，通过空间自适应身份调制，在不牺牲身份保真度的前提下大幅提升文本遵循能力和生成质量。 Abstract: Personalized text-to-image generation aims to integrate specific identities into arbitrary contexts. However, existing tuning-free methods typically employ Spatially Uniform Visual Injection, causing identity features to contaminate non-facial regions (e.g., backgrounds and lighting) and degrading text adherence. To address this without expensive fine-tuning, we propose SpatialID, a training-free spatially-adaptive identity modulation framework. SpatialID fundamentally decouples identity injection into face-relevant and context-free regions using a Spatial Mask Extractor derived from cross-attention responses. Furthermore, we introduce a Temporal-Spatial Scheduling strategy that dynamically adjusts spatial constraints - transitioning from Gaussian priors to attention-based masks and adaptive relaxation - to align with the diffusion generation dynamics. Extensive experiments on IBench demonstrate that SpatialID achieves state-of-the-art performance in text adherence (CLIP-T: 0.281), visual consistency (CLIP-I: 0.827), and image quality (IQ: 0.523), significantly eliminating background contamination while maintaining robust identity preservation.

[171] A Deployment-Friendly Foundational Framework for Efficient Computational Pathology

Yu Cai,Cheng Jin,Jiabo Ma,Fengtao Zhou,Yingxue Xu,Zhengrui Guo,Yihui Wang,Zhengyu Zhang,Ling Liang,Yonghao Tan,Pingcheng Dong,Du Cai,On Ki Tang,Chenglong Zhao,Xi Wang,Can Yang,Yali Xu,Jing Cui,Zhenhui Li,Ronald Cheong Kin Chan,Yueping Liu,Feng Gao,Xiuming Zhang,Li Liang,Hao Chen,Kwang-Ting Cheng

Main category: cs.CV

TL;DR: LitePath is a lightweight, deployment-friendly pathology foundation model that significantly reduces computational cost and energy consumption while maintaining high accuracy across diverse pathology tasks.

Details

Motivation: The high computational cost of existing pathology foundation models (PFMs) limits their clinical accessibility and scalability, especially for gigapixel whole slide images. Method: LitePath integrates LiteFM—a compact model distilled from three large PFMs—and the Adaptive Patch Selector (APS), a lightweight component for task-specific patch selection. Result: LitePath reduces parameters by 28x and FLOPs by 403.5x vs. Virchow2; runs 104.5x faster on Jetson Orin Nano Super; achieves 99.71% AUC of Virchow2; ranks second among 19 models; attains highest Deployability Score (D-Score), surpassing Virchow2 by 10.64%. Conclusion: LitePath enables rapid, cost-effective, and energy-efficient pathology analysis on low-power edge hardware without sacrificing diagnostic accuracy, reducing AI's carbon footprint. Abstract: Pathology foundation models (PFMs) have enabled robust generalization in computational pathology through large-scale datasets and expansive architectures, but their substantial computational cost, particularly for gigapixel whole slide images, limits clinical accessibility and scalability. Here, we present LitePath, a deployment-friendly foundational framework designed to mitigate model over-parameterization and patch level redundancy. LitePath integrates LiteFM, a compact model distilled from three large PFMs (Virchow2, H-Optimus-1 and UNI2) using 190 million patches, and the Adaptive Patch Selector (APS), a lightweight component for task-specific patch selection. The framework reduces model parameters by 28x and lowers FLOPs by 403.5x relative to Virchow2, enabling deployment on low-power edge hardware such as the NVIDIA Jetson Orin Nano Super. On this device, LitePath processes 208 slides per hour, 104.5x faster than Virchow2, and consumes 0.36 kWh per 3,000 slides, 171x lower than Virchow2 on an RTX3090 GPU. We validated accuracy using 37 cohorts across four organs and 26 tasks (26 internal, 9 external, and 2 prospective), comprising 15,672 slides from 9,808 patients disjoint from the pretraining data. LitePath ranks second among 19 evaluated models and outperforms larger models including H-Optimus-1, mSTAR, UNI2 and GPFM, while retaining 99.71% of the AUC of Virchow2 on average. To quantify the balance between accuracy and efficiency, we propose the Deployability Score (D-Score), defined as the weighted geometric mean of normalized AUC and normalized FLOP, where LitePath achieves the highest value, surpassing Virchow2 by 10.64%. These results demonstrate that LitePath enables rapid, cost-effective and energy-efficient pathology image analysis on accessible hardware while maintaining accuracy comparable to state-of-the-art PFMs and reducing the carbon footprint of AI deployment.

[172] Flow4R: Unifying 4D Reconstruction and Tracking with Scene Flow

Shenhan Qian,Ganlin Zhang,Shangzhe Wu,Daniel Cremers

Main category: cs.CV

TL;DR: Flow4R提出以场景流为中心的统一框架，用Vision Transformer从双视角输入直接预测3D点位置、场景流、位姿权重和置信度，无需显式位姿估计或BA，在4D重建与跟踪任务中达到SOTA。

Details

Motivation: 现有方法将几何与运动解耦：多视图重建假设场景静止，动态跟踪依赖显式相机位姿估计或独立运动模型，缺乏统一表征。 Method: 提出Flow4R框架，以相机空间场景流为核心表示，使用Vision Transformer从双视角输入端到端预测每像素的3D点位置、场景流、位姿权重和置信度；采用共享解码器对称推断局部几何与双向运动，无需显式位姿回归器或bundle adjustment。 Result: 在静态与动态数据集联合训练下，Flow4R在4D重建和跟踪任务上达到当前最优性能。 Conclusion: 以场景流为中心的表征能有效统一3D结构、物体运动与相机运动，提升时空场景理解能力。 Abstract: Reconstructing and tracking dynamic 3D scenes remains a fundamental challenge in computer vision. Existing approaches often decouple geometry from motion: multi-view reconstruction methods assume static scenes, while dynamic tracking frameworks rely on explicit camera pose estimation or separate motion models. We propose Flow4R, a unified framework that treats camera-space scene flow as the central representation linking 3D structure, object motion, and camera motion. Flow4R predicts a minimal per-pixel property set-3D point position, scene flow, pose weight, and confidence-from two-view inputs using a Vision Transformer. This flow-centric formulation allows local geometry and bidirectional motion to be inferred symmetrically with a shared decoder in a single forward pass, without requiring explicit pose regressors or bundle adjustment. Trained jointly on static and dynamic datasets, Flow4R achieves state-of-the-art performance on 4D reconstruction and tracking tasks, demonstrating the effectiveness of the flow-central representation for spatiotemporal scene understanding.

[173] Train Short, Inference Long: Training-free Horizon Extension for Autoregressive Video Generation

Jia Li,Xiaomeng Fu,Xurui Peng,Weifeng Chen,Youwei Zheng,Tianyu Zhao,Jiexi Wang,Fangmin Chen,Xing Wang,Hayden Kwok-Hay So

Main category: cs.CV

TL;DR: 本文提出FLEX框架，通过频域感知的RoPE调制、反相噪声采样和推理专用注意力锚点，在不重新训练的前提下显著提升自回归视频扩散模型的长视频外推能力。

Details

Motivation: 自回归视频扩散模型在长视频生成中存在严重的外推失败问题，主要源于3D位置编码的谱偏差和噪声采样中缺乏动态先验。 Method: 提出FLEX框架，包含频域感知RoPE调制（自适应插值低频、外推高频）、反相噪声采样（注入高频动态先验）和推理专用注意力锚点（稳定全局结构）。 Result: 在VBench上，FLEX在6倍外推（30秒）时显著超越SOTA，在12倍（60秒）时媲美长视频微调基线；可即插即用地将LongLive等模型扩展至4分钟视频生成。 Conclusion: FLEX是一种训练无关、即插即用的推理时框架，有效弥合了短时训练与长时推理之间的鸿沟，大幅提升了视频生成的时间尺度上限。 Abstract: Autoregressive video diffusion models have emerged as a scalable paradigm for long video generation. However, they often suffer from severe extrapolation failure, where rapid error accumulation leads to significant temporal degradation when extending beyond training horizons. We identify that this failure primarily stems from the \textit{spectral bias} of 3D positional embeddings and the lack of \textit{dynamic priors} in noise sampling. To address these issues, we propose \textbf{FLEX} (\textbf{F}requency-aware \textbf{L}ength \textbf{EX}tension), a training-free inference-time framework that bridges the gap between short-term training and long-term inference. FLEX introduces Frequency-aware RoPE Modulation to adaptively interpolate under-trained low-frequency components while extrapolating high-frequency ones to preserve multi-scale temporal discriminability. This is integrated with Antiphase Noise Sampling (ANS) to inject high-frequency dynamic priors and Inference-only Attention Sink to anchor global structure. Extensive evaluations on VBench demonstrate that FLEX significantly outperforms state-of-the-art models at $6\times$ extrapolation (30s duration) and matches the performance of long-video fine-tuned baselines at $12\times$ scale (60s duration). As a plug-and-play augmentation, FLEX seamlessly integrates into existing inference pipelines for horizon extension. It effectively pushes the generation limits of models such as LongLive, supporting consistent and dynamic video synthesis at a 4-minute scale. Project page is available at \href{https://ga-lee.github.io/FLEX_demo}{https://ga-lee.github.io/FLEX}.

[174] Explainability-Inspired Layer-Wise Pruning of Deep Neural Networks for Efficient Object Detection

Abhinav Shukla,Nachiket Tapas

Main category: cs.CV

TL;DR: 本文提出了一种基于SHAP启发的梯度-激活归因(layer-wise pruning)方法，用于高效目标检测模型压缩，相比传统L1范数剪枝，在多个模型（如ShuffleNetV2、RetinaNet）上实现了更优的精度-效率权衡。

Details

Motivation: 传统基于权重幅值的剪枝方法无法准确反映网络组件对任务性能的真实功能贡献，难以在资源受限设备上兼顾精度与效率。 Method: 提出一种可解释性驱动的逐层剪枝框架，利用SHAP启发的梯度-激活归因来评估各层重要性，替代静态权重幅值作为剪枝依据。 Result: 在COCO 2017验证集上，该方法在ShuffleNetV2上提升推理速度10%，而L1剪枝反而使性能下降13.7%；在RetinaNet上保持基线mAP（0.151）且推理速度几乎无损，而L1剪枝导致1.3% mAP下降。 Conclusion: 数据驱动的层重要性评估更合理，可解释性启发的压缩方法为边缘设备部署DNN提供了兼顾性能与可解释性的新路径。 Abstract: Deep neural networks (DNNs) have achieved remarkable success in object detection tasks, but their increasing complexity poses significant challenges for deployment on resource-constrained platforms. While model compression techniques such as pruning have emerged as essential tools, traditional magnitude-based pruning methods do not necessarily align with the true functional contribution of network components to task-specific performance. In this work, we present an explainability-inspired, layer-wise pruning framework tailored for efficient object detection. Our approach leverages a SHAP-inspired gradient--activation attribution to estimate layer importance, providing a data-driven proxy for functional contribution rather than relying solely on static weight magnitudes. We conduct comprehensive experiments across diverse object detection architectures, including ResNet-50, MobileNetV2, ShuffleNetV2, Faster R-CNN, RetinaNet, and YOLOv8, evaluating performance on the Microsoft COCO 2017 validation set. The results show that the proposed attribution-inspired pruning consistently identifies different layers as least important compared to L1-norm-based methods, leading to improved accuracy--efficiency trade-offs. Notably, for ShuffleNetV2, our method yields a 10\% empirical increase in inference speed, whereas L1-pruning degrades performance by 13.7\%. For RetinaNet, the proposed approach preserves the baseline mAP (0.151) with negligible impact on inference speed, while L1-pruning incurs a 1.3\% mAP drop for a 6.2\% speed increase. These findings highlight the importance of data-driven layer importance assessment and demonstrate that explainability-inspired compression offers a principled direction for deploying deep neural networks on edge and resource-constrained platforms while preserving both performance and interpretability.

[175] BitDance: Scaling Autoregressive Generative Models with Binary Tokens

Yuang Ai,Jiaming Han,Shaobin Zhuang,Weijia Mao,Xuefeng Hu,Ziyan Yang,Zhenheng Yang,Huaibo Huang,Xiangyu Yue,Hao Chen

Main category: cs.CV

TL;DR: BitDance是一种新型自回归图像生成模型，通过预测二进制视觉token（而非码本索引）实现高表达力与高效采样；采用二值扩散头和新提出的“下一图像块扩散”解码方法，在ImageNet上取得最优FID（1.24），参数更少、推理更快，并在文本到图像任务中表现优异。

Details

Motivation: 传统自回归图像生成受限于离散码本大小导致表达能力不足，且标准分类采样难以应对极高熵的token空间；需兼顾高表达性、可扩展性与推理效率。 Method: 提出BitDance：1）用256位二进制token替代码本索引，大幅提升状态空间；2）设计二值扩散头，以连续空间扩散替代softmax分类进行token生成；3）引入下一图像块扩散（next-patch diffusion），支持多token并行高精度预测。 Result: 在ImageNet 256×256上FID达1.24（AR模型最优）；以260M参数超越1.4B参数的SOTA并行AR模型，参数减少5.4倍、速度提升8.7倍；1024×1024图像生成比先前AR模型快30倍以上；文本到图像生成效果高质量、高分辨率、高效率。 Conclusion: BitDance证明了基于二进制token与扩散机制的自回归建模路径可行且优越，为AR基础模型提供了新的高效范式，兼具性能、规模与速度优势。 Abstract: We present BitDance, a scalable autoregressive (AR) image generator that predicts binary visual tokens instead of codebook indices. With high-entropy binary latents, BitDance lets each token represent up to $2^{256}$ states, yielding a compact yet highly expressive discrete representation. Sampling from such a huge token space is difficult with standard classification. To resolve this, BitDance uses a binary diffusion head: instead of predicting an index with softmax, it employs continuous-space diffusion to generate the binary tokens. Furthermore, we propose next-patch diffusion, a new decoding method that predicts multiple tokens in parallel with high accuracy, greatly speeding up inference. On ImageNet 256x256, BitDance achieves an FID of 1.24, the best among AR models. With next-patch diffusion, BitDance beats state-of-the-art parallel AR models that use 1.4B parameters, while using 5.4x fewer parameters (260M) and achieving 8.7x speedup. For text-to-image generation, BitDance trains on large-scale multimodal tokens and generates high-resolution, photorealistic images efficiently, showing strong performance and favorable scaling. When generating 1024x1024 images, BitDance achieves a speedup of over 30x compared to prior AR models. We release code and models to facilitate further research on AR foundation models. Code and models are available at: https://github.com/shallowdream204/BitDance.

[176] Restoration Adaptation for Semantic Segmentation on Low Quality Images

Kai Guan,Rongyuan Wu,Shuai Li,Wentao Zhu,Wenjun Zeng,Lei Zhang

Main category: cs.CV

TL;DR: 本文提出RASS框架，通过语义约束的图像恢复（SCR）与分割模型的联合优化，在低质量图像上实现高精度语义分割。

Details

Motivation: 现实场景中低质量图像导致语义分割性能下降；传统图像恢复方法注重像素保真度而忽略语义信息，现有分割模型对退化鲁棒性差。 Method: 提出语义约束恢复（SCR）模型，将分割先验注入恢复过程（通过跨注意力图与分割掩码对齐）；再通过LoRA模块融合与任务微调，将恢复知识迁移到分割模型中。 Result: 在自建真实低质量数据集及合成/真实退化基准上，SCR和RASS均显著优于SOTA方法。 Conclusion: 语义引导的恢复与分割协同优化可有效提升低质量图像上的分割鲁棒性与精度。 Abstract: In real-world scenarios, the performance of semantic segmentation often deteriorates when processing low-quality (LQ) images, which may lack clear semantic structures and high-frequency details. Although image restoration techniques offer a promising direction for enhancing degraded visual content, conventional real-world image restoration (Real-IR) models primarily focus on pixel-level fidelity and often fail to recover task-relevant semantic cues, limiting their effectiveness when directly applied to downstream vision tasks. Conversely, existing segmentation models trained on high-quality data lack robustness under real-world degradations. In this paper, we propose Restoration Adaptation for Semantic Segmentation (RASS), which effectively integrates semantic image restoration into the segmentation process, enabling high-quality semantic segmentation on the LQ images directly. Specifically, we first propose a Semantic-Constrained Restoration (SCR) model, which injects segmentation priors into the restoration model by aligning its cross-attention maps with segmentation masks, encouraging semantically faithful image reconstruction. Then, RASS transfers semantic restoration knowledge into segmentation through LoRA-based module merging and task-specific fine-tuning, thereby enhancing the model's robustness to LQ images. To validate the effectiveness of our framework, we construct a real-world LQ image segmentation dataset with high-quality annotations, and conduct extensive experiments on both synthetic and real-world LQ benchmarks. The results show that SCR and RASS significantly outperform state-of-the-art methods in segmentation and restoration tasks. Code, models, and datasets will be available at https://github.com/Ka1Guan/RASS.git.

[177] CoCoEdit: Content-Consistent Image Editing via Region Regularized Reinforcement Learning

Yuhui Wu,Chenxi Xie,Ruibin Li,Liyi Chen,Qiaosi Yi,Lei Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为CoCoEdit的后训练框架，通过区域正则化强化学习提升图像编辑中未编辑区域的内容一致性，同时保持编辑质量。

Details

Motivation: 现有图像编辑模型虽效果显著，但常导致非目标区域的意外变化，缺乏对内容一致性的保障。 Method: 构建40K高质量编辑数据集；引入像素级相似性奖励与MLLM奖励结合；设计区域正则化机制，区分高/低奖励样本以分别保护非编辑区和增强编辑效果；在GEdit-Bench和ImgEdit-Bench上新增编辑掩码标注及像素级评估指标。 Result: 在Qwen-Image-Edit和FLUX-Kontext上应用CoCoEdit后，在PSNR/SSIM和人类主观评分上显著提升内容一致性，同时编辑质量达SOTA水平。 Conclusion: 区域正则化强化学习可有效解耦编辑质量与内容一致性优化目标，为可控、鲁棒的图像编辑提供了新范式。 Abstract: Image editing has achieved impressive results with the development of large-scale generative models. However, existing models mainly focus on the editing effects of intended objects and regions, often leading to unwanted changes in unintended regions. We present a post-training framework for Content-Consistent Editing (CoCoEdit) via region regularized reinforcement learning. We first augment existing editing datasets with refined instructions and masks, from which 40K diverse and high quality samples are curated as training set. We then introduce a pixel-level similarity reward to complement MLLM-based rewards, enabling models to ensure both editing quality and content consistency during the editing process. To overcome the spatial-agnostic nature of the rewards, we propose a region-based regularizer, aiming to preserve non-edited regions for high-reward samples while encouraging editing effects for low-reward samples. For evaluation, we annotate editing masks for GEdit-Bench and ImgEdit-Bench, introducing pixel-level similarity metrics to measure content consistency and editing quality. Applying CoCoEdit to Qwen-Image-Edit and FLUX-Kontext, we achieve not only competitive editing scores with state-of-the-art models, but also significantly better content consistency, measured by PSNR/SSIM metrics and human subjective ratings.

[178] ForgeryVCR: Visual-Centric Reasoning via Efficient Forensic Tools in MLLMs for Image Forgery Detection and Localization

Youqi Wang,Shen Chen,Haowei Wang,Rongxuan Peng,Taiping Yao,Shunquan Tan,Changsheng Chen,Bin Li,Shouhong Ding

Main category: cs.CV

TL;DR: ForgeryVCR 提出视觉中心推理框架，通过引入法证工具箱将不可见篡改痕迹显式可视化，并结合策略性工具学习范式（含增益驱动轨迹构建与工具效用奖励引导的强化学习），使多模态大模型能主动调用多视角推理路径，在图像伪造检测与定位任务中达到SOTA性能。

Details

Motivation: 现有基于文本链式思维（CoT）的多模态大语言模型在图像伪造检测与定位中易因强行用语言描述不可见低层篡改痕迹而产生幻觉，因语言模态难以刻画像素级不一致性。 Method: 提出ForgeryVCR框架，包含：1）法证工具箱，将不可见篡改痕迹转化为显式视觉中间表示；2）策略性工具学习后训练范式，涵盖增益驱动的监督微调轨迹构建和基于工具效用奖励的强化学习优化；支持模型自发调用多视角推理（如局部放大、压缩历史、噪声残差、频域分析）。 Result: 在伪造检测与定位任务上均达到SOTA性能，展现出更强的泛化性、鲁棒性及更低的工具冗余度。 Conclusion: 视觉中心推理与策略性工具学习可有效克服文本中心CoT在细粒度伪造分析中的局限，为MLLM在数字取证等需像素级理解的任务中提供新范式。 Abstract: Existing Multimodal Large Language Models (MLLMs) for image forgery detection and localization predominantly operate under a text-centric Chain-of-Thought (CoT) paradigm. However, forcing these models to textually characterize imperceptible low-level tampering traces inevitably leads to hallucinations, as linguistic modalities are insufficient to capture such fine-grained pixel-level inconsistencies. To overcome this, we propose ForgeryVCR, a framework that incorporates a forensic toolbox to materialize imperceptible traces into explicit visual intermediates via Visual-Centric Reasoning. To enable efficient tool utilization, we introduce a Strategic Tool Learning post-training paradigm, encompassing gain-driven trajectory construction for Supervised Fine-Tuning (SFT) and subsequent Reinforcement Learning (RL) optimization guided by a tool utility reward. This paradigm empowers the MLLM to act as a proactive decision-maker, learning to spontaneously invoke multi-view reasoning paths including local zoom-in for fine-grained inspection and the analysis of invisible inconsistencies in compression history, noise residuals, and frequency domains. Extensive experiments reveal that ForgeryVCR achieves state-of-the-art (SOTA) performance in both detection and localization tasks, demonstrating superior generalization and robustness with minimal tool redundancy. The project page is available at https://youqiwong.github.io/projects/ForgeryVCR/.

[179] GeoFusionLRM: Geometry-Aware Self-Correction for Consistent 3D Reconstruction

Ahmet Burak Yildirim,Tuna Saygin,Duygu Ceylan,Aysegul Dundar

Main category: cs.CV

TL;DR: 本文提出GeoFusionLRM，一种几何感知的自校正框架，利用大重建模型（LRM）自身预测的法线和深度信息来提升单图像3D重建的几何一致性与细节对齐度，无需额外监督或外部信号。

Details

Motivation: 现有单图像3D重建方法（尤其是大重建模型）常存在几何不一致和细节错位问题，限制了重建保真度。 Method: 提出GeoFusionLRM框架，通过专用Transformer与融合模块，将模型自身预测的法线和深度等几何线索反馈回网络，实现对结构误差的自校正与输入图像条件的一致性约束。 Result: 在多个实验中，GeoFusionLRM显著提升了重建网格的几何锐度、法线一致性及整体保真度，优于当前最先进的LRM基线方法。 Conclusion: 几何线索的内部反馈机制可有效增强LRM的自我修正能力，在无额外监督下提升重建质量，为单图像3D重建提供了新范式。 Abstract: Single-image 3D reconstruction with large reconstruction models (LRMs) has advanced rapidly, yet reconstructions often exhibit geometric inconsistencies and misaligned details that limit fidelity. We introduce GeoFusionLRM, a geometry-aware self-correction framework that leverages the model's own normal and depth predictions to refine structural accuracy. Unlike prior approaches that rely solely on features extracted from the input image, GeoFusionLRM feeds back geometric cues through a dedicated transformer and fusion module, enabling the model to correct errors and enforce consistency with the conditioning image. This design improves the alignment between the reconstructed mesh and the input views without additional supervision or external signals. Extensive experiments demonstrate that GeoFusionLRM achieves sharper geometry, more consistent normals, and higher fidelity than state-of-the-art LRM baselines.

[180] EgoSound: Benchmarking Sound Understanding in Egocentric Videos

Bingwen Zhu,Yuqian Fu,Qiaole Dong,Guolei Sun,Tianwen Qian,Yuzheng Wu,Danda Pani Paudel,Xiangyang Xue,Yanwei Fu

Main category: cs.CV

TL;DR: 本文提出了首个用于评估多模态大语言模型（MLLMs）在自我中心视角下声音理解能力的基准EgoSound，涵盖7类任务、7315个验证问答对，并在9个SOTA模型上进行了实验，揭示了当前模型在细粒度空间与因果推理上的局限性。

Details

Motivation: 人类感知是多感官融合的，而现有MLLMs主要关注视觉-语言理解，忽略了声音在空间布局、屏外事件和因果交互中的关键作用，尤其在自我中心场景中声-视信号紧密耦合，亟需系统评估模型的听觉理解能力。 Method: 构建了首个自我中心声音理解基准EgoSound，整合Ego4D和EgoBlind数据，设计七任务分类体系（包括本征声音感知、空间定位、因果推理、跨模态推理等），采用多阶段自生成流水线构建含7315个QA对、覆盖900个视频的数据集。 Result: 在9个SOTA MLLMs上的实验表明，当前模型展现出初步的听觉推理能力，但在细粒度空间定位和因果理解方面仍显著受限。 Conclusion: EgoSound为推动多感官自我中心智能发展提供了具有挑战性的新基准，弥合了‘看见’与‘真正听见’世界之间的鸿沟。 Abstract: Multimodal Large Language Models (MLLMs) have recently achieved remarkable progress in vision-language understanding. Yet, human perception is inherently multisensory, integrating sight, sound, and motion to reason about the world. Among these modalities, sound provides indispensable cues about spatial layout, off-screen events, and causal interactions, particularly in egocentric settings where auditory and visual signals are tightly coupled. To this end, we introduce EgoSound, the first benchmark designed to systematically evaluate egocentric sound understanding in MLLMs. EgoSound unifies data from Ego4D and EgoBlind, encompassing both sighted and sound-dependent experiences. It defines a seven-task taxonomy spanning intrinsic sound perception, spatial localization, causal inference, and cross-modal reasoning. Constructed through a multi-stage auto-generative pipeline, EgoSound contains 7315 validated QA pairs across 900 videos. Comprehensive experiments on nine state-of-the-art MLLMs reveal that current models exhibit emerging auditory reasoning abilities but remain limited in fine-grained spatial and causal understanding. EgoSound establishes a challenging foundation for advancing multisensory egocentric intelligence, bridging the gap between seeing and truly hearing the world.

[181] DenseMLLM: Standard Multimodal LLMs are Intrinsic Dense Predictors

Yi Li,Hongze Shen,Lexiang Tang,Xin Li,Xinpeng Ding,Yinsong Liu,Deqiang Jiang,Xing Sun,Xiaomeng Li

Main category: cs.CV

TL;DR: 本文提出DenseMLLM，一种无需任务特定解码器即可执行密集预测（如语义分割、深度估计）的多模态大语言模型，通过新颖的视觉令牌监督策略实现通用MLLM对密集感知任务的有效支持。

Details

Motivation: 现有MLLM在细粒度密集预测任务中需引入复杂、任务定制的解码器，导致架构碎片化、复杂度上升，违背了MLLM通用化设计初衷，限制了实用性。 Method: 提出DenseMLLM模型，基于标准MLLM架构，引入面向多标签、多任务的新型视觉令牌监督策略，避免使用任务特定解码器。 Result: 在多种密集预测和视觉-语言基准上达到极具竞争力的性能，验证了标准通用MLLM无需架构特化即可有效支持密集感知。 Conclusion: 标准通用MLLM可通过恰当的监督策略直接胜任密集预测任务，无需额外解码器或架构修改，为统一、轻量、泛化的多模态模型设计提供了新范式。 Abstract: Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in high-level visual understanding. However, extending these models to fine-grained dense prediction tasks, such as semantic segmentation and depth estimation, typically necessitates the incorporation of complex, task-specific decoders and other customizations. This architectural fragmentation increases model complexity and deviates from the generalist design of MLLMs, ultimately limiting their practicality. In this work, we challenge this paradigm by accommodating standard MLLMs to perform dense predictions without requiring additional task-specific decoders. The proposed model is called DenseMLLM, grounded in the standard architecture with a novel vision token supervision strategy for multiple labels and tasks. Despite its minimalist design, our model achieves highly competitive performance across a wide range of dense prediction and vision-language benchmarks, demonstrating that a standard, general-purpose MLLM can effectively support dense perception without architectural specialization.

[182] Detection of On-Ground Chestnuts Using Artificial Intelligence Toward Automated Picking

Kaixuan Fang,Yuzhen Lu,Xinyang Mu

Main category: cs.CV

TL;DR: 本研究系统评估了29种实时目标检测模型（包括YOLO和RT-DETR系列）在果园地面栗子检测任务上的性能，构建并公开了一个含6524个标注栗子的319张图像数据集，发现YOLOv12m在mAP@0.5上表现最优（95.1%），YOLO模型整体优于RT-DETR，更适合嵌入式实时采摘应用。

Details

Motivation: 传统机械化采栗成本高、非选择性且易损伤栗子；为开发低成本、视觉引导的自动化采收技术，亟需在复杂地面环境中实现准确可靠的栗子检测，但此前该问题尚未被系统解决。 Method: 采集319张含6524个标注栗子的果园地面图像，系统评估29种SOTA实时目标检测模型（14种YOLO v11–v13、15种RT-DETR v1–v4，含不同尺度变体），通过重复建模实验对比其检测精度（mAP@0.5与mAP@[0.5:0.95]）与推理效率。 Result: YOLOv12m取得最高mAP@0.5（95.1%），RT-DETRv2-R101在RT-DETR系列中最佳（91.1%）；YOLOv11x在mAP@[0.5:0.95]上最优（80.1%）；所有模型均具实时检测潜力，YOLO整体精度与推理速度优于RT-DETR。 Conclusion: YOLO系列模型（尤其是YOLOv12m）更适用于资源受限的车载实时栗子检测系统；本研究所构建的数据集与代码已开源，为后续农业视觉检测研究提供重要基础。 Abstract: Traditional mechanized chestnut harvesting is too costly for small producers, non-selective, and prone to damaging nuts. Accurate, reliable detection of chestnuts on the orchard floor is crucial for developing low-cost, vision-guided automated harvesting technology. However, developing a reliable chestnut detection system faces challenges in complex environments with shading, varying natural light conditions, and interference from weeds, fallen leaves, stones, and other foreign on-ground objects, which have remained unaddressed. This study collected 319 images of chestnuts on the orchard floor, containing 6524 annotated chestnuts. A comprehensive set of 29 state-of-the-art real-time object detectors, including 14 in the YOLO (v11-13) and 15 in the RT-DETR (v1-v4) families at varied model scales, was systematically evaluated through replicated modeling experiments for chestnut detection. Experimental results show that the YOLOv12m model achieves the best mAP@0.5 of 95.1% among all the evaluated models, while the RT-DETRv2-R101 was the most accurate variant among RT-DETR models, with mAP@0.5 of 91.1%. In terms of mAP@[0.5:0.95], the YOLOv11x model achieved the best accuracy of 80.1%. All models demonstrate significant potential for real-time chestnut detection, and YOLO models outperformed RT-DETR models in terms of both detection accuracy and inference, making them better suited for on-board deployment. Both the dataset and software programs in this study have been made publicly available at https://github.com/AgFood-Sensing-and-Intelligence-Lab/ChestnutDetection.

[183] LaViDa-R1: Advancing Reasoning for Unified Multimodal Diffusion Language Models

Shufan Li,Yuchen Zhu,Jiuxiang Gu,Kangning Liu,Zhe Lin,Yongxin Chen,Molei Tao,Aditya Grover,Jason Kuen

Main category: cs.CV

TL;DR: 本文提出LaViDa-R1，一种多模态通用推理扩散语言模型，通过统一的后训练框架整合监督微调与多任务强化学习，并引入多种新训练技术，在视觉数学推理、密集推理定位和图像编辑等多模态任务中表现出色。

Details

Motivation: 现有推理扩散语言模型多依赖任务特定的强化学习，缺乏统一、可扩展的多模态推理建模方法。 Method: 提出统一后训练框架，融合监督微调（SFT）与多任务强化学习（RL），并引入答案强制（answer-forcing）、树搜索（tree search）和互补似然估计（complementary likelihood estimation）等新技术。 Result: LaViDa-R1在视觉数学推理、原因密集型定位（reason-intensive grounding）和图像编辑等多模态任务上展现出强泛化能力和优异性能。 Conclusion: 统一多任务后训练框架结合创新训练技术，可有效提升扩散语言模型在多模态通用推理任务中的表现，为dLLMs的发展提供了新范式。 Abstract: Diffusion language models (dLLMs) recently emerged as a promising alternative to auto-regressive LLMs. The latest works further extended it to multimodal understanding and generation tasks. In this work, we propose LaViDa-R1, a multimodal, general-purpose reasoning dLLM. Unlike existing works that build reasoning dLLMs through task-specific reinforcement learning, LaViDa-R1 incorporates diverse multimodal understanding and generation tasks in a unified manner. In particular, LaViDa-R1 is built with a novel unified post-training framework that seamlessly integrates supervised finetuning (SFT) and multi-task reinforcement learning (RL). It employs several novel training techniques, including answer-forcing, tree search, and complementary likelihood estimation, to enhance effectiveness and scalability. Extensive experiments demonstrate LaViDa-R1's strong performance on a wide range of multimodal tasks, including visual math reasoning, reason-intensive grounding, and image editing.

[184] ARport: An Augmented Reality System for Markerless Image-Guided Port Placement in Robotic Surgery

Zheng Han,Zixin Yang,Yonghao Long,Lin Zhang,Peter Kazanzides,Qi Dou

Main category: cs.CV

TL;DR: ARport是一种基于光学透视头戴式显示器的增强现实系统，无需外部传感器或标记物，即可将术前规划的套管针布局自动映射到患者体表，实现直观的空间引导。

Details

Motivation: 精确的套管针（port）放置是机器人辅助手术中的关键步骤，而术前规划与术中执行之间存在差距，需要一种直观、易用且无缝集成的可视化方案。 Method: ARport在光学透视头戴显示器（OST-HMD）上实现，利用其采集的RGB、深度和位姿数据重建手术场景；采用基础模型提取患者体表，并通过基于表面的无标记配准方法，将术前解剖模型对齐到真实体表，从而实现在体可视化套管针布局。 Result: 在全尺寸人体仿真模型实验中，ARport能准确地将术前规划的套管针位置叠加到物理模型上，虚拟规划与真实解剖结构保持一致的空间对应关系。 Conclusion: ARport提供了一种完全无标记、硬件需求极简的解决方案，可直接在患者体表可视化术前套管针规划，提升术中准备效率，具备融入常规临床工作流的潜力。 Abstract: Purpose: Precise port placement is a critical step in robot-assisted surgery, where port configuration influences both visual access to the operative field and instrument maneuverability. To bridge the gap between preoperative planning and intraoperative execution, we present ARport, an augmented reality (AR) system that automatically maps pre-planned trocar layouts onto the patient's body surface, providing intuitive spatial guidance during surgical preparation. Methods: ARport, implemented on an optical see-through head-mounted display (OST-HMD), operates without any external sensors or markers, simplifying setup and enhancing workflow integration. It reconstructs the operative scene from RGB, depth, and pose data captured by the OST-HMD, extracts the patient's body surface using a foundation model, and performs surface-based markerless registration to align preoperative anatomical models to the extracted patient's body surface, enabling in-situ visualization of planned trocar layouts. A demonstration video illustrating the overall workflow is available online. Results: In full-scale human-phantom experiments, ARport accurately overlaid pre-planned trocar sites onto the physical phantom, achieving consistent spatial correspondence between virtual plans and real anatomy. Conclusion: ARport provides a fully marker-free and hardware-minimal solution for visualizing preoperative trocar plans directly on the patient's body surface. The system facilitates efficient intraoperative setup and demonstrates potential for seamless integration into routine clinical workflows.

[185] When Test-Time Guidance Is Enough: Fast Image and Video Editing with Diffusion Guidance

Ahmed Ghorbel,Badr Moufad,Navid Bagheri Shouraki,Alain Oliviero Durmus,Thomas Hirtz,Eric Moulines,Jimmy Olsson,Yazid Janati

Main category: cs.CV

TL;DR: 本文提出了一种无需向量-雅可比乘积（VJP）计算的测试时引导方法，显著提升了文本驱动图像与视频编辑的效率与效果，性能媲美甚至超越训练式方法。

Details

Motivation: 现有基于扩散或流模型的文本驱动编辑方法依赖计算代价高昂的向量-雅可比乘积（VJP）来近似不可行的引导项，限制了实际应用。 Method: 基于Moufad等人（2025）的VJP-free近似理论，提供更深入的理论分析，并大幅扩展其在大规模图像与视频编辑基准上的实证评估。 Result: 实验表明，仅使用测试时引导即可达到与训练式方法相当甚至更优的性能。 Conclusion: 测试时引导是一种高效且强大的替代方案，无需额外训练即可实现高质量文本驱动编辑。 Abstract: Text-driven image and video editing can be naturally cast as inpainting problems, where masked regions are reconstructed to remain consistent with both the observed content and the editing prompt. Recent advances in test-time guidance for diffusion and flow models provide a principled framework for this task; however, existing methods rely on costly vector--Jacobian product (VJP) computations to approximate the intractable guidance term, limiting their practical applicability. Building upon the recent work of Moufad et al. (2025), we provide theoretical insights into their VJP-free approximation and substantially extend their empirical evaluation to large-scale image and video editing benchmarks. Our results demonstrate that test-time guidance alone can achieve performance comparable to, and in some cases surpass, training-based methods.

[186] Towards Spatial Transcriptomics-driven Pathology Foundation Models

Konstantin Hemker,Andrew H. Song,Cristina Almagro-Pérez,Guillaume Jaume,Sophia J. Wagner,Anurag Vaidya,Nikola Simidjievski,Mateja Jamnik,Faisal Mahmood

Main category: cs.CV

TL;DR: 本文提出Spatial Expression-Aligned Learning (SEAL)，一种将空间转录组（ST）分子信息融入病理图像视觉编码器的自监督学习框架，通过参数高效微调提升现有病理基础模型在多类下游任务中的性能，并增强跨模态能力。

Details

Motivation: 多模态基础模型的成功提示形态与分子信息可协同增强病理表征；而现有病理视觉模型缺乏对局部基因表达的利用，限制其分子级解读能力。 Method: 提出SEAL框架，以空间转录组中配对的基因表达spot-组织区域数据为监督信号，对现有病理视觉编码器进行参数高效微调，无需从头训练；在70万+样本上训练，适配多种主流病理基础模型。 Result: 在38个切片级和15个补丁级下游任务中，SEAL一致优于纯视觉及ST预测基线，显著提升分子状态、通路活性、治疗响应和基因表达预测性能；具备强域外泛化能力，并支持基因到图像检索等新跨模态功能。 Conclusion: SEAL提供了一种通用、实用的空间转录组引导的病理基础模型微调范式，证明引入局部分子监督可有效增强视觉表征能力并拓展其跨模态应用价值。 Abstract: Spatial transcriptomics (ST) provides spatially resolved measurements of gene expression, enabling characterization of the molecular landscape of human tissue beyond histological assessment as well as localized readouts that can be aligned with morphology. Concurrently, the success of multimodal foundation models that integrate vision with complementary modalities suggests that morphomolecular coupling between local expression and morphology can be systematically used to improve histological representations themselves. We introduce Spatial Expression-Aligned Learning (SEAL), a vision-omics self-supervised learning framework that infuses localized molecular information into pathology vision encoders. Rather than training new encoders from scratch, SEAL is designed as a parameter-efficient vision-omics finetuning method that can be flexibly applied to widely used pathology foundation models. We instantiate SEAL by training on over 700,000 paired gene expression spot-tissue region examples spanning tumor and normal samples from 14 organs. Tested across 38 slide-level and 15 patch-level downstream tasks, SEAL provides a drop-in replacement for pathology foundation models that consistently improves performance over widely used vision-only and ST prediction baselines on slide-level molecular status, pathway activity, and treatment response prediction, as well as patch-level gene expression prediction tasks. Additionally, SEAL encoders exhibit robust domain generalization on out-of-distribution evaluations and enable new cross-modal capabilities such as gene-to-image retrieval. Our work proposes a general framework for ST-guided finetuning of pathology foundation models, showing that augmenting existing models with localized molecular supervision is an effective and practical step for improving visual representations and expanding their cross-modal utility.

[187] UniWeTok: An Unified Binary Tokenizer with Codebook Size $\mathit{2^{128}}$ for Unified Multimodal Large Language Model

Shaobin Zhuang,Yuang Ai,Jiaming Han,Weijia Mao,Xiaohui Li,Fangyikang Wang,Xiao Wang,Yan Li,Shanchuan Lin,Kun Xu,Zhenheng Yang,Huaibo Huang,Xiangyu Yue,Hao Chen,Yali Wang

Main category: cs.CV

TL;DR: 本文提出UniWeTok，一种基于超大规模二进制码本（2^128）的统一离散视觉分词器，通过预-后蒸馏、生成感知先验、卷积-注意力混合架构及SigLu激活函数，兼顾高保真重建、语义提取与生成适配，在图像生成、多模态理解与编辑等任务中达到SOTA或领先性能，且训练开销显著降低。

Details

Motivation: 现有视觉分词器难以在同一框架内同时满足高保真重建、复杂语义提取和生成适配这三类相互冲突的目标，制约了统一多模态大语言模型（MLLMs）的发展。 Method: 提出UniWeTok：采用128位二进制码本；引入Pre-Post Distillation和Generative-Aware Prior提升语义提取与生成先验；设计卷积-注意力混合编码器并采用SigLu激活函数以稳定蒸馏并缓解熵损失与承诺损失的优化冲突；构建三阶段训练框架以增强跨分辨率与感知敏感场景（如人脸、文本）的适应性。 Result: 在ImageNet上图像生成FID达1.38（优于REPA的1.42），训练token量仅33B（远低于REPA的262B）；在通用领域，DPG评分为86.63（优于FLUX.1 [Dev]的83.84），GEdit总体得分为5.09（略高于OmniGen的5.06）；支持多模态理解、生成与编辑等广泛任务。 Conclusion: UniWeTok成功实现了视觉表征在重建质量、语义丰富性和生成友好性之间的统一平衡，为构建更高效、更通用的MLLM提供了关键基础组件，并开源代码与模型以推动社区发展。 Abstract: Unified Multimodal Large Language Models (MLLMs) require a visual representation that simultaneously supports high-fidelity reconstruction, complex semantic extraction, and generative suitability. However, existing visual tokenizers typically struggle to satisfy these conflicting objectives within a single framework. In this paper, we introduce UniWeTok, a unified discrete tokenizer designed to bridge this gap using a massive binary codebook ($\mathit{2^{128}}$). For training framework, we introduce Pre-Post Distillation and a Generative-Aware Prior to enhance the semantic extraction and generative prior of the discrete tokens. In terms of model architecture, we propose a convolution-attention hybrid architecture with the SigLu activation function. SigLu activation not only bounds the encoder output and stabilizes the semantic distillation process but also effectively addresses the optimization conflict between token entropy loss and commitment loss. We further propose a three-stage training framework designed to enhance UniWeTok's adaptability cross various image resolutions and perception-sensitive scenarios, such as those involving human faces and textual content. On ImageNet, UniWeTok achieves state-of-the-art image generation performance (FID: UniWeTok 1.38 vs. REPA 1.42) while requiring a remarkably low training compute (Training Tokens: UniWeTok 33B vs. REPA 262B). On general-domain, UniWeTok demonstrates highly competitive capabilities across a broad range of tasks, including multimodal understanding, image generation (DPG Score: UniWeTok 86.63 vs. FLUX.1 [Dev] 83.84), and editing (GEdit Overall Score: UniWeTok 5.09 vs. OmniGen 5.06). We release code and models to facilitate community exploration of unified tokenizer and MLLM.

[188] UniRef-Image-Edit: Towards Scalable and Consistent Multi-Reference Image Editing

Hongyang Wei,Bin Wen,Yancheng Long,Yankai Yang,Yuhang Hu,Tianke Zhang,Wei Chen,Haonan Fan,Kaiyu Jiang,Jiankang Chen,Changyi Liu,Kaiyu Tang,Haojie Ding,Xiao Yang,Jia Sun,Huaiqing Wang,Zhenyu Yang,Xinyu Wei,Xianglong He,Yangguang Li,Fan Yang,Tingting Gao,Lei Zhang,Guorui Zhou,Han Li

Main category: cs.CV

TL;DR: 本文提出UniRef-Image-Edit，一种统一单图编辑与多图合成的多模态生成系统，通过新提出的序列扩展潜在融合（SELF）表示和两阶段训练（监督微调+多源GRPO强化学习）提升跨参考图像的一致性与视觉质量。

Details

Motivation: 现有基于扩散模型的图像编辑方法在多个参考输入条件下难以维持一致性，因参考输入间交互有限。 Method: 提出Sequence-Extended Latent Fusion (SELF) 表示法，将多个参考图像动态序列化为固定长度潜在序列；采用两阶段训练：1）监督微调（SFT），联合训练单图编辑与多图合成，并使用渐进式序列长度策略（像素预算从1024²逐步增至2048²）；2）强化学习（RL），引入多源GRPO（MSGRPO）优化多参考约束冲突。 Result: 显著提升了多参考图像生成中的视觉保真度与跨参考一致性；实现了单图编辑与多图合成任务的统一建模；开源全部代码、模型、训练与奖励数据。 Conclusion: UniRef-Image-Edit通过SELF表征与两阶段训练范式，有效解决了多参考图像生成中的一致性与细节平衡难题，为多模态图像编辑提供了可扩展、可复现的新框架。 Abstract: We present UniRef-Image-Edit, a high-performance multi-modal generation system that unifies single-image editing and multi-image composition within a single framework. Existing diffusion-based editing methods often struggle to maintain consistency across multiple conditions due to limited interaction between reference inputs. To address this, we introduce Sequence-Extended Latent Fusion (SELF), a unified input representation that dynamically serializes multiple reference images into a coherent latent sequence. During a dedicated training stage, all reference images are jointly constrained to fit within a fixed-length sequence under a global pixel-budget constraint. Building upon SELF, we propose a two-stage training framework comprising supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we jointly train on single-image editing and multi-image composition tasks to establish a robust generative prior. We adopt a progressive sequence length training strategy, in which all input images are initially resized to a total pixel budget of $1024^2$, and are then gradually increased to $1536^2$ and $2048^2$ to improve visual fidelity and cross-reference consistency. This gradual relaxation of compression enables the model to incrementally capture finer visual details while maintaining stable alignment across references. For the RL stage, we introduce Multi-Source GRPO (MSGRPO), to our knowledge the first reinforcement learning framework tailored for multi-reference image generation. MSGRPO optimizes the model to reconcile conflicting visual constraints, significantly enhancing compositional consistency. We will open-source the code, models, training data, and reward data for community research purposes.

[189] GeoEyes: On-Demand Visual Focusing for Evidence-Grounded Understanding of Ultra-High-Resolution Remote Sensing Imagery

Fengxiang Wang,Mingshuo Chen,Yueying Li,Yajie Yang,Yifan Zhang,Long Lan,Xue Yang,Hongda Sun,Yulin Wang,Di Wang,Jun Song,Jing Zhang,Bo Du

Main category: cs.CV

TL;DR: 本文提出GeoEyes框架，通过分阶段训练（冷启动监督微调+自适应强化学习AdaZoom-GRPO）解决现有多模态大模型在超高清遥感视觉问答中工具使用同质化问题，实现按需缩放与适时停止，显著提升性能。

Details

Motivation: 现有支持缩放的多模态大语言模型在超高清遥感视觉问答任务中存在‘工具使用同质化’失败模式，即缩放行为趋于任务无关、无法有效获取关键视觉证据。 Method: 提出GeoEyes：(1) 构建覆盖多样缩放策略的冷启动监督微调数据集UHR Chain-of-Zoom（UHR-CoZ）；(2) 设计强化学习方法AdaZoom-GRPO，显式奖励证据增益与答案改进。 Result: GeoEyes在UHR遥感基准上取得显著提升，在XLRS-Bench上达到54.23%准确率。 Conclusion: 分阶段训练可有效引导MLLM学会任务驱动、有始有终的视觉探索行为，突破工具同质化瓶颈，提升超高清遥感VQA性能。 Abstract: The "thinking-with-images" paradigm enables multimodal large language models (MLLMs) to actively explore visual scenes via zoom-in tools. This is essential for ultra-high-resolution (UHR) remote sensing VQA, where task-relevant cues are sparse and tiny. However, we observe a consistent failure mode in existing zoom-enabled MLLMs: Tool Usage Homogenization, where tool calls collapse into task-agnostic patterns, limiting effective evidence acquisition. To address this, we propose GeoEyes, a staged training framework consisting of (1) a cold-start SFT dataset, UHR Chain-of-Zoom (UHR-CoZ), which covers diverse zooming regimes, and (2) an agentic reinforcement learning method, AdaZoom-GRPO, that explicitly rewards evidence gain and answer improvement during zoom interactions. The resulting model learns on-demand zooming with proper stopping behavior and achieves substantial improvements on UHR remote sensing benchmarks, with 54.23% accuracy on XLRS-Bench.

[190] HiVid: LLM-Guided Video Saliency For Content-Aware VOD And Live Streaming

Jiahui Chen,Bo Peng,Lianchen Jia,Zeyu Zhang,Tianchi Huang,Lifeng Sun

Main category: cs.CV

TL;DR: HiVid是一个利用大语言模型（LLM）作为人类代理，为点播（VOD）和直播流媒体生成高质量、内容感知的视频分块重要性权重的新框架，显著提升QoE预测准确率与用户主观体验相关性。

Details

Motivation: 内容感知流媒体需要动态、分块级的重要性权重以优化主观质量体验（QoE），但人工标注成本过高，而现有视觉显著性模型泛化能力差。 Method: 提出HiVid框架，包含三个核心模块：（1）感知模块——在局部上下文窗口内自回归地评估帧，克服LLM模态与token限制；（2）排序模块——针对VOD中局部评分不一致问题，设计LLM引导的归并排序算法实现全局重排序；（3）预测模块——针对直播低延迟需求，构建含内容感知注意力与自适应时域的多模态时间序列模型，支持无未来信息的在线权重预测。 Result: 实验表明HiVid在VOD和直播场景下权重预测精度分别比SOTA基线提升11.5%和26%；真实用户研究验证其使流媒体QoE相关性提升14.7%。 Conclusion: HiVid首次成功将LLM规模化应用于视频重要性建模，兼顾准确性、实时性与泛化性，为内容感知流媒体提供了可扩展、高保真的QoE优化新范式。 Abstract: Content-aware streaming requires dynamic, chunk-level importance weights to optimize subjective quality of experience (QoE). However, direct human annotation is prohibitively expensive while vision-saliency models generalize poorly. We introduce HiVid, the first framework to leverage Large Language Models (LLMs) as a scalable human proxy to generate high-fidelity weights for both Video-on-Demand (VOD) and live streaming. We address 3 non-trivial challenges: (1) To extend LLMs' limited modality and circumvent token limits, we propose a perception module to assess frames in a local context window, autoregressively building a coherent understanding of the video. (2) For VOD with rating inconsistency across local windows, we propose a ranking module to perform global re-ranking with a novel LLM-guided merge-sort algorithm. (3) For live streaming which requires low-latency, online inference without future knowledge, we propose a prediction module to predict future weights with a multi-modal time series model, which comprises a content-aware attention and adaptive horizon to accommodate asynchronous LLM inference. Extensive experiments show HiVid improves weight prediction accuracy by up to 11.5\% for VOD and 26\% for live streaming over SOTA baselines. Real-world user study validates HiVid boosts streaming QoE correlation by 14.7\%.

[191] Freq-DP Net: A Dual-Branch Network for Fence Removal using Dual-Pixel and Fourier Priors

Kunal Swami,Sudha Velusamy,Chandra Sekhar Seelamantula

Main category: cs.CV

TL;DR: 本文提出Freq-DP Net，首次利用双像素（DP）传感器解决单图像围栏遮挡去除问题，融合散焦视差几何先验与围栏全局结构先验，显著超越现有方法。

Details

Motivation: 现有方法在静态场景中表现不佳或依赖多帧运动线索，难以有效处理单图像中的围栏遮挡问题。 Method: 提出双分支网络Freq-DP Net，分别建模基于显式代价体的散焦视差几何先验和基于快速傅里叶卷积（FFC）学习的围栏结构先验，并通过注意力机制融合二者以实现精准围栏分割。 Result: 在自建多样化围栏数据集上实验表明，该方法显著优于强通用基线，成为首个单图像、DP传感器驱动的围栏去除新SOTA。 Conclusion: Freq-DP Net验证了双像素传感器在单图像遮挡去除任务中的独特价值，为基于硬件感知的图像复原提供了新范式。 Abstract: Removing fence occlusions from single images is a challenging task that degrades visual quality and limits downstream computer vision applications. Existing methods often fail on static scenes or require motion cues from multiple frames. To overcome these limitations, we introduce the first framework to leverage dual-pixel (DP) sensors for this problem. We propose Freq-DP Net, a novel dual-branch network that fuses two complementary priors: a geometric prior from defocus disparity, modeled using an explicit cost volume, and a structural prior of the fence's global pattern, learned via Fast Fourier Convolution (FFC). An attention mechanism intelligently merges these cues for highly accurate fence segmentation. To validate our approach, we build and release a diverse benchmark with different fence varieties. Experiments demonstrate that our method significantly outperforms strong general-purpose baselines, establishing a new state-of-the-art for single-image, DP-based fence removal.

[192] Learning Significant Persistent Homology Features for 3D Shape Understanding

Prachi Kudeshia,Jiju Poovvancheri

Main category: cs.CV

TL;DR: 本文提出了一种将持久同调（topological features）融入3D点云分析的新范式，包括构建拓扑增强的ModelNet40和ShapeNet数据集，并设计了端到端可学习的显著持久点选择方法TopoGAT，在分类与分割任务中提升了性能。

Details

Motivation: 现有3D点云基准数据集主要关注几何信息，忽视拓扑结构；而拓扑（如持久同调）能提供互补且稳定的形状描述，亟需系统性地将其引入深度学习流程。 Method: 1) 构建拓扑增强的ModelNet40和ShapeNet数据集，为每个点云添加持久同调特征；2) 提出基于图注意力机制的深度模型TopoGAT，实现从原始点云及其拓扑签名中端到端学习显著持久点的选择。 Result: TopoGAT在稳定性与判别力上优于传统手工统计选择方法；将其集成到标准点云分类与部件分割流程中，显著提升了准确率与分割指标（如mIoU）。 Conclusion: 拓扑增强数据集与可学习的显著特征选择方法共同推动了持久同调在实际3D深度学习工作流中的落地，为几何-拓扑联合建模提供了新基础。 Abstract: Geometry and topology constitute complementary descriptors of three-dimensional shape, yet existing benchmark datasets primarily capture geometric information while neglecting topological structure. This work addresses this limitation by introducing topologically-enriched versions of ModelNet40 and ShapeNet, where each point cloud is augmented with its corresponding persistent homology features. These benchmarks with the topological signatures establish a foundation for unified geometry-topology learning and enable systematic evaluation of topology-aware deep learning architectures for 3D shape analysis. Building on this foundation, we propose a deep learning-based significant persistent point selection method, \textit{TopoGAT}, that learns to identify the most informative topological features directly from input data and the corresponding topological signatures, circumventing the limitations of hand-crafted statistical selection criteria. A comparative study verifies the superiority of the proposed method over traditional statistical approaches in terms of stability and discriminative power. Integrating the selected significant persistent points into standard point cloud classification and part-segmentation pipelines yields improvements in both classification accuracy and segmentation metrics. The presented topologically-enriched datasets, coupled with our learnable significant feature selection approach, enable the broader integration of persistent homology into the practical deep learning workflows for 3D point cloud analysis.

[193] Dual-Signal Adaptive KV-Cache Optimization for Long-Form Video Understanding in Vision-Language Models

Vishnu Sai,Dheeraj Sai,Srinath B,Girish Varma,Priyesh Shukla

Main category: cs.CV

TL;DR: 本文提出Sali-Cache框架，通过光学流与显著性检测双信号预判冗余，实现KV缓存的主动压缩，在保持100%准确率的同时将内存使用降低2.2倍。

Details

Motivation: Vision-Language Models在处理长视频时面临KV缓存随序列长度线性增长的内存瓶颈，现有反应式缓存策略计算开销大、效率低。 Method: 提出Sali-Cache：融合基于光学流的时序滤波器（去帧间冗余）和基于显著性检测的空间滤波器（保留关键视觉区域），在注意力计算前进行主动缓存优化。 Result: 在LLaVA 1.6上实现2.20x内存压缩比，BLEU/ROUGE-L/Exact Match均保持100%；同等内存预算下可维持长时序上下文且不损性能。 Conclusion: Sali-Cache是一种高效、无损的预判式缓存机制，显著提升VLM处理长视频的实用性与硬件兼容性。 Abstract: Vision-Language Models (VLMs) face a critical memory bottleneck when processing long-form video content due to the linear growth of the Key-Value (KV) cache with sequence length. Existing solutions predominantly employ reactive eviction strategies that compute full attention matrices before discarding tokens, resulting in substantial computational waste. We propose Sali-Cache, a novel a priori optimization framework that implements dual-signal adaptive caching through proactive memory management. By integrating a temporal filter based on optical flow analysis for detecting inter-frame redundancy and a spatial filter leveraging saliency detection for identifying visually significant regions, Sali-Cache intelligently manages memory allocation before entering computationally expensive attention operations. Experimental evaluation on the LLaVA 1.6 architecture demonstrates that our method achieves a 2.20x compression ratio in effective memory usage while maintaining 100% accuracy across BLEU, ROUGE-L, and Exact Match metrics. Furthermore, under identical memory budget constraints, Sali-Cache preserves context-rich features over extended temporal durations without degrading model performance, enabling efficient processing of long-form video content on consumer-grade hardware.

[194] AbracADDbra: Touch-Guided Object Addition by Decoupling Placement and Editing Subtasks

Kunal Swami,Raghu Chittersu,Yuvraj Rathore,Rajeev Irny,Shashavali Doodekula,Alok Shukla

Main category: cs.CV

TL;DR: 本文提出AbracADDbra框架，利用触觉先验结合简洁文本指令实现精确对象添加，通过视觉语言Transformer定位和扩散模型生成对象及掩码，提升编辑质量与可用性。

Details

Motivation: 解决基于指令的对象添加中纯文本提示歧义性强、掩码输入繁琐的可用性问题。 Method: 提出AbracADDbra框架：使用视觉语言Transformer进行触觉引导的放置，再用扩散模型联合生成目标对象和实例掩码；构建Touch2Add基准用于标准化评估。 Result: 在评估中，该框架的放置模型显著优于随机放置和通用视觉语言模型基线；初始定位精度与最终编辑质量呈强相关性。 Conclusion: 该工作为更易用、高效的创意编辑工具提供了新路径。 Abstract: Instruction-based object addition is often hindered by the ambiguity of text-only prompts or the tedious nature of mask-based inputs. To address this usability gap, we introduce AbracADDbra, a user-friendly framework that leverages intuitive touch priors to spatially ground succinct instructions for precise placement. Our efficient, decoupled architecture uses a vision-language transformer for touch-guided placement, followed by a diffusion model that jointly generates the object and an instance mask for high-fidelity blending. To facilitate standardized evaluation, we contribute the Touch2Add benchmark for this interactive task. Our extensive evaluations, where our placement model significantly outperforms both random placement and general-purpose VLM baselines, confirm the framework's ability to produce high-fidelity edits. Furthermore, our analysis reveals a strong correlation between initial placement accuracy and final edit quality, validating our decoupled approach. This work thus paves the way for more accessible and efficient creative tools.

[195] Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision

A. Said Gurbuz,Sunghwan Hong,Ahmed Nassar,Marc Pollefeys,Peter Staar

Main category: cs.CV

TL;DR: 本文提出了ScreenParse数据集和ScreenVLM模型，旨在解决现有屏幕解析数据集标注稀疏、多样性不足及部署效率低的问题。ScreenParse包含771K网页截图的密集标注，ScreenVLM则是一个316M参数的轻量级视觉语言模型，在密集解析任务上显著优于更大规模的基础VLM，并具备良好的迁移能力。

Details

Motivation: 现有屏幕解析数据集标注稀疏、覆盖有限、泛化能力差，且难以满足实际部署中的低延迟、端侧运行需求。 Method: 构建了大规模密集标注数据集ScreenParse（771K网页截图，21M UI元素），通过自动化流水线Webshot生成并结合VLM重标注与质量过滤；基于该数据集训练轻量级ScreenVLM模型，采用结构感知损失函数解码ScreenTag标记表示。 Result: ScreenVLM在ScreenParse上PageIoU达0.592，远超更大基础VLM（0.294）；在公开基准上迁移性能优异；微调基础VLM也可提升其UI接地性能。 Conclusion: 密集屏幕解析监督能提供可迁移的结构先验，ScreenParse与ScreenVLM为高效、鲁棒的计算机使用代理提供了关键基础。 Abstract: Modern computer-use agents (CUA) must perceive a screen as a structured state, what elements are visible, where they are, and what text they contain, before they can reliably ground instructions and act. Yet, most available grounding datasets provide sparse supervision, with insufficient and low-diversity labels that annotate only a small subset of task-relevant elements per screen, which limits both coverage and generalization; moreover, practical deployment requires efficiency to enable low-latency, on-device use. We introduce ScreenParse, a large-scale dataset for complete screen parsing, with dense annotations of all visible UI elements (boxes, 55-class types, and text) across 771K web screenshots (21M elements). ScreenParse is generated by Webshot, an automated, scalable pipeline that renders diverse urls, extracts annotations and applies VLM-based relabeling and quality filtering. Using ScreenParse, we train ScreenVLM, a compact, 316M-parameter vision language model (VLM) that decodes a compact ScreenTag markup representation with a structure-aware loss that upweights structure-critical tokens. ScreenVLM substantially outperforms much larger foundation VLMs on dense parsing (e.g., 0.592 vs. 0.294 PageIoU on ScreenParse) and shows strong transfer to public benchmarks. Moreover, finetuning foundation VLMs on ScreenParse consistently improves their grounding performance, suggesting that dense screen supervision provides transferable structural priors for UI understanding. Project page: https://saidgurbuz.github.io/screenparse/.

[196] Differential pose optimization in descriptor space -- Combining Geometric and Photometric Methods for Motion Estimation

Andreas L. Teigen,Annette Stahl,Rudolf Mester

Main category: cs.CV

TL;DR: 本文提出了一种结合光度误差和重投影误差优势的新型两帧相对位姿优化方法，使用密集采样的几何特征描述子替代光度误差，但实验表明其精度仍未超越基于重投影误差的方法，并分析了描述子相似性度量变化缓慢是主要原因。

Details

Motivation: 解决计算机视觉中两帧相对位姿优化问题，权衡光度误差与重投影误差在精度、鲁棒性和闭环可能性之间的取舍，探索融合二者优势的新方法。 Method: 采用密集采样的几何特征描述子，以描述子残差替代传统光度误差，在保持微分光度法亚像素精度的同时引入几何特征描述子的表达能力。 Result: 所提方法实现了高精度跟踪，但未超越基于重投影误差的位姿优化策略；进一步分析指出描述子相似性度量变化缓慢，且不严格对应关键点定位精度。 Conclusion: 尽管融合光度与几何特征的新范式具有吸引力，但当前描述子设计限制了其在位姿优化中的性能上限，需改进相似性度量与关键点精度的关联性。 Abstract: One of the fundamental problems in computer vision is the two-frame relative pose optimization problem. Primarily, two different kinds of error values are used: photometric error and re-projection error. The selection of error value is usually directly dependent on the selection of feature paradigm, photometric features, or geometric features. It is a trade-off between accuracy, robustness, and the possibility of loop closing. We investigate a third method that combines the strengths of both paradigms into a unified approach. Using densely sampled geometric feature descriptors, we replace the photometric error with a descriptor residual from a dense set of descriptors, thereby enabling the employment of sub-pixel accuracy in differential photometric methods, along with the expressiveness of the geometric feature descriptor. Experiments show that although the proposed strategy is an interesting approach that results in accurate tracking, it ultimately does not outperform pose optimization strategies based on re-projection error despite utilizing more information. We proceed to analyze the underlying reason for this discrepancy and present the hypothesis that the descriptor similarity metric is too slowly varying and does not necessarily correspond strictly to keypoint placement accuracy.

[197] A Generative AI Approach for Reducing Skin Tone Bias in Skin Cancer Classification

Areez Muhammed Shabu,Mohammad Samar Ansari,Asra Aslam

Main category: cs.CV

TL;DR: 本文提出了一种基于LoRA微调Stable Diffusion的生成式数据增强方法，以缓解皮肤癌诊断中因ISIC数据集肤色分布不均（>70%浅肤色，<8%深肤色）导致的模型偏差，显著提升了深肤色人群的病变分割与分类性能，增强了医疗AI的公平性。

Details

Motivation: 现有皮肤癌AI诊断工具因训练数据中深肤色样本严重不足（ISIC数据集中深肤色图像少于8%），导致对深肤色人群诊断准确率低、公平性差，亟需提升医疗影像中的种族/肤色多样性。 Method: 提出一种生成式增强流程：在ISIC深肤色子集上，采用LoRA技术微调预训练Stable Diffusion模型，生成以病灶类型和肤色为条件的合成皮肤镜图像；并将生成数据用于病变分割与二分类下游任务。 Result: 分割任务中，使用增强数据训练的模型在真实测试集上IoU、Dice系数和边界精度均提升；分类任务中，EfficientNet-B0模型在增强数据上训练后达到92.14%准确率。 Conclusion: 生成式AI驱动的数据增强可有效缓解皮肤癌诊断中的肤色偏差，提升模型公平性与泛化能力，为医疗AI的包容性发展提供可行路径。 Abstract: Skin cancer is one of the most common cancers worldwide and early detection is critical for effective treatment. However, current AI diagnostic tools are often trained on datasets dominated by lighter skin tones, leading to reduced accuracy and fairness for people with darker skin. The International Skin Imaging Collaboration (ISIC) dataset, one of the most widely used benchmarks, contains over 70% light skin images while dark skins fewer than 8%. This imbalance poses a significant barrier to equitable healthcare delivery and highlights the urgent need for methods that address demographic diversity in medical imaging. This paper addresses this challenge of skin tone imbalance in automated skin cancer detection using dermoscopic images. To overcome this, we present a generative augmentation pipeline that fine-tunes a pre-trained Stable Diffusion model using Low-Rank Adaptation (LoRA) on the image dark-skin subset of the ISIC dataset and generates synthetic dermoscopic images conditioned on lesion type and skin tone. In this study, we investigated the utility of these images on two downstream tasks: lesion segmentation and binary classification. For segmentation, models trained on the augmented dataset and evaluated on held-out real images show consistent improvements in IoU, Dice coefficient, and boundary accuracy. These evalutions provides the verification of Generated dataset. For classification, an EfficientNet-B0 model trained on the augmented dataset achieved 92.14% accuracy. This paper demonstrates that synthetic data augmentation with Generative AI integration can substantially reduce bias with increase fairness in conventional dermatological diagnostics and open challenges for future directions.

[198] Image-based Joint-level Detection for Inflammation in Rheumatoid Arthritis from Small and Imbalanced Data

Shun Kato,Yasushi Kondo,Shuntaro Saito,Yoshimitsu Aoki,Mariko Isogawa

Main category: cs.CV

TL;DR: 本文提出了一种基于RGB手部图像的类风湿关节炎（RA）炎症检测框架，通过自监督预训练和不平衡感知训练，提升了F1-score和G-mean指标。

Details

Motivation: 早期诊断和持续监测类风湿关节炎（RA）至关重要，但患者常难以及时获得专科诊疗；亟需一种基于家用RGB图像的便捷关节炎症检测方法。 Method: 构建专用数据集定量分析视觉检测难度；提出融合全局-局部编码器的检测框架，结合大规模健康手部图像的自监督预训练与不平衡感知训练策略。 Result: 在实验中，所提方法相比基线模型F1-score提升0.2，G-mean提升0.25。 Conclusion: 该工作首次系统应对RGB图像下RA炎症检测中的样本稀缺、类别不平衡等挑战，验证了所提框架的有效性与实用性。 Abstract: Rheumatoid arthritis (RA) is an autoimmune disease characterized by systemic joint inflammation. Early diagnosis and tight follow-up are essential to the management of RA, as ongoing inflammation can cause irreversible joint damage. The detection of arthritis is important for diagnosis and assessment of disease activity; however, it often takes a long time for patients to receive appropriate specialist care. Therefore, there is a strong need to develop systems that can detect joint inflammation easily using RGB images captured at home. Consequently, we tackle the task of RA inflammation detection from RGB hand images. This task is highly challenging due to general issues in medical imaging, such as the scarcity of positive samples, data imbalance, and the inherent difficulty of the task itself. However, to the best of our knowledge, no existing work has explicitly addressed these challenges in RGB-based RA inflammation detection. This paper quantitatively demonstrates the difficulty of visually detecting inflammation by constructing a dedicated dataset, and we propose a inflammation detection framework with global local encoder that combines self-supervised pretraining on large-scale healthy hand images with imbalance-aware training to detect RA-related joint inflammation from RGB hand images. Our experiments demonstrated that the proposed approach improves F1-score by 0.2 points and Gmean by 0.25 points compared with the baseline model.

[199] Event-based Visual Deformation Measurement

Yuliang Wu,Wei Zhai,Yuxin Cui,Tiesong Zhao,Yang Cao,Zheng-Jun Zha

Main category: cs.CV

TL;DR: 本文提出了一种结合事件相机与帧图像的视觉形变测量方法，通过Affine Invariant Simplicial（AIS）框架和邻域贪心优化策略，在保证精度的同时大幅降低数据存储与计算开销。

Details

Motivation: 传统基于图像的形变测量方法受限于帧间运动小的假设，难以处理高动态场景；而高速视频方案又带来巨大存储与计算负担。 Method: 提出事件-帧融合框架：利用事件流提供高时间分辨率运动线索，利用帧图像提供高空间精度估计；引入基于固体弹性建模先验的AIS框架，将形变场划分为低参数线性子区域；设计邻域贪心优化策略以加速收敛并抑制误差累积。 Result: 在自建包含120+序列的基准数据集上，生存率较SOTA提升1.6%，且仅需高速视频方法18.9%的存储与计算资源。 Conclusion: 所提方法在精度、效率与实用性之间取得更好平衡，为动态形变测量提供了轻量高效的新范式。 Abstract: Visual Deformation Measurement (VDM) aims to recover dense deformation fields by tracking surface motion from camera observations. Traditional image-based methods rely on minimal inter-frame motion to constrain the correspondence search space, which limits their applicability to highly dynamic scenes or necessitates high-speed cameras at the cost of prohibitive storage and computational overhead. We propose an event-frame fusion framework that exploits events for temporally dense motion cues and frames for spatially dense precise estimation. Revisiting the solid elastic modeling prior, we propose an Affine Invariant Simplicial (AIS) framework. It partitions the deformation field into linearized sub-regions with low-parametric representation, effectively mitigating motion ambiguities arising from sparse and noisy events. To speed up parameter searching and reduce error accumulation, a neighborhood-greedy optimization strategy is introduced, enabling well-converged sub-regions to guide their poorly-converged neighbors, effectively suppress local error accumulation in long-term dense tracking. To evaluate the proposed method, a benchmark dataset with temporally aligned event streams and frames is established, encompassing over 120 sequences spanning diverse deformation scenarios. Experimental results show that our method outperforms the state-of-the-art baseline by 1.6% in survival rate. Remarkably, it achieves this using only 18.9% of the data storage and processing resources of high-speed video methods.

[200] Adapting VACE for Real-Time Autoregressive Video Diffusion

Ryan Fosdick

Main category: cs.CV

TL;DR: 本文提出了一种适配VACE模型用于实时自回归视频生成的方法，通过将参考帧移至并行条件路径，避免了全序列双向注意力，从而支持流式处理和KV缓存，无需额外训练，但参考到视频的保真度有所下降。

Details

Motivation: VACE原模型依赖全序列双向注意力，无法适配需要固定块大小和因果注意力的流式视频生成管道。 Method: 将参考帧从扩散隐空间迁移至并行条件路径，保留固定chunk大小与KV缓存能力，复用预训练VACE权重。 Result: 在1.3B和14B模型上，结构控制与修复任务引入20-30%延迟开销，显存占用可忽略；但参考到视频保真度因因果注意力受限而显著下降。 Conclusion: 该适配方案实现了VACE在实时自回归视频生成中的可行性，权衡了效率与生成质量，为流式可控视频生成提供了实用路径。 Abstract: We describe an adaptation of VACE (Video All-in-one Creation and Editing) for real-time autoregressive video generation. VACE provides unified video control (reference guidance, structural conditioning, inpainting, and temporal extension) but assumes bidirectional attention over full sequences, making it incompatible with streaming pipelines that require fixed chunk sizes and causal attention. The key modification moves reference frames from the diffusion latent space into a parallel conditioning pathway, preserving the fixed chunk sizes and KV caching that autoregressive models require. This adaptation reuses existing pretrained VACE weights without additional training. Across 1.3B and 14B model scales, VACE adds 20-30% latency overhead for structural control and inpainting, with negligible VRAM cost relative to the base model. Reference-to-video fidelity is severely degraded compared to batch VACE due to causal attention constraints. A reference implementation is available at https://github.com/daydreamlive/scope.

[201] Multi-Turn Adaptive Prompting Attack on Large Vision-Language Models

In Chong Choi,Jiacheng Zhang,Feng Liu,Yiliao Song

Main category: cs.CV

TL;DR: 本文提出了一种名为MAPA的多轮自适应提示攻击方法，用于有效绕过大型视觉语言模型（LVLMs）的安全防御机制。该方法通过在每轮中交替进行文本与视觉攻击，并在多轮间迭代优化攻击路径，显著提升了攻击成功率。

Details

Motivation: 现有针对LVLMs的多轮越狱攻击在引入视觉输入后易被安全对齐模型检测并防御，因此需要一种更隐蔽、自适应的攻击策略。 Method: MAPA采用两层设计：1）每轮交替执行文本与视觉攻击动作，以激发最恶意响应；2）跨轮通过迭代的前后向精炼动态调整攻击轨迹，逐步增强响应恶意性。 Result: MAPA在多个主流LVLM上（如LLaVA-V1.6-Mistral-7B、Qwen2.5-VL-7B等）显著优于现有最先进方法，攻击成功率提升11–35%。 Conclusion: MAPA证明了结合文本与视觉模态的协同、自适应攻击策略在突破LVLM安全防护方面的有效性，为评估和提升多模态模型安全性提供了新思路。 Abstract: Multi-turn jailbreak attacks are effective against text-only large language models (LLMs) by gradually introducing malicious content across turns. When extended to large vision-language models (LVLMs), we find that naively adding visual inputs can cause existing multi-turn jailbreaks to be easily defended. For example, overly malicious visual input will easily trigger the defense mechanism of safety-aligned LVLMs, making the response more conservative. To address this, we propose MAPA: a multi-turn adaptive prompting attack that 1) at each turn, alternates text-vision attack actions to elicit the most malicious response; and 2) across turns, adjusts the attack trajectory through iterative back-and-forth refinement to gradually amplify response maliciousness. This two-level design enables MAPA to consistently outperform state-of-the-art methods, improving attack success rates by 11-35% on recent benchmarks against LLaVA-V1.6-Mistral-7B, Qwen2.5-VL-7B-Instruct, Llama-3.2-Vision-11B-Instruct and GPT-4o-mini.

Qingqian Yang,Hao Wang,Sai Qian Zhang,Jian Li,Yang Hua,Miao Pan,Tao Song,Zhengwei Qi,Haibing Guan

Main category: cs.CV

TL;DR: 本文提出pFedNavi，一种面向视觉-语言导航（VLN）的结构感知、动态自适应个性化联邦学习框架，通过层级别混合系数识别客户端特异性层并进行细粒度参数融合，在保护隐私的同时显著提升导航性能与收敛速度。

Details

Motivation: VLN需要大量来自私有室内环境的轨迹指令数据，引发严重隐私问题；而传统联邦学习（FL）在VLN极端跨客户端异构性（环境与指令风格差异大）下难以训练出优质全局模型。 Method: 提出pFedNavi框架：基于层间混合系数自适应识别客户端关键层（如编码器-解码器投影层、环境敏感解码层），并对这些层进行细粒度参数融合，兼顾全局知识共享与本地专业化。 Result: 在R2R和RxR两个标准VLN基准上，使用ResNet和CLIP视觉表征，pFedNavi全面超越FedAvg基线：导航成功率最高提升7.5%，轨迹保真度最高提升7.8%，非独立同分布（non-IID）下收敛速度快1.38倍。 Conclusion: pFedNavi有效缓解了VLN中隐私保护与模型性能之间的矛盾，验证了结构感知个性化联邦学习在高度异构多模态导航任务中的优越性与可行性。 Abstract: Vision-Language Navigation VLN requires large-scale trajectory instruction data from private indoor environments, raising significant privacy concerns. Federated Learning FL mitigates this by keeping data on-device, but vanilla FL struggles under VLNs' extreme cross-client heterogeneity in environments and instruction styles, making a single global model suboptimal. This paper proposes pFedNavi, a structure-aware and dynamically adaptive personalized federated learning framework tailored for VLN. Our key idea is to personalize where it matters: pFedNavi adaptively identifies client-specific layers via layer-wise mixing coefficients, and performs fine-grained parameter fusion on the selected components (e.g., the encoder-decoder projection and environment-sensitive decoder layers) to balance global knowledge sharing with local specialization. We evaluate pFedNavi on two standard VLN benchmarks, R2R and RxR, using both ResNet and CLIP visual representations. Across all metrics, pFedNavi consistently outperforms the FedAvg-based VLN baseline, achieving up to 7.5% improvement in navigation success rate and up to 7.8% gain in trajectory fidelity, while converging 1.38x faster under non-IID conditions.

[203] Feature Recalibration Based Olfactory-Visual Multimodal Model for Fine-Grained Rice Deterioration Detection

Rongqiang Zhao,Hengrui Hu,Yijing Wang,Mingchun Sun,Jie Liu

Main category: cs.CV

TL;DR: 本文提出了一种基于嗅觉-视觉多模态的特征重校准模型，用于细粒度大米变质检测，通过FDEC和FDRA-Net提升特征表示与敏感性，在准确率（99.89%）和操作简易性上优于现有方法，并具备农业食品领域扩展潜力。

Details

Motivation: 现有大米变质检测的多模态方法在细粒度异常特征表征与提取能力有限，且依赖高成本设备（如高光谱相机、质谱仪），导致检测成本高、数据采集时间长。 Method: 提出基于特征重校准的嗅觉-视觉多模态模型：1）细粒度变质嵌入构造器（FDEC）重构标注的多模态嵌入特征数据集，增强样本表征；2）细粒度变质重校准注意力网络（FDRA-Net）强化信号变化，提高对大米表面细粒度变质的敏感性。 Result: 实验表明该方法分类准确率达99.89%，相比SOTA方法提升了检测精度并简化了流程；实地检测验证了其高精度与操作简便性。 Conclusion: 所提方法有效解决了细粒度大米变质检测中特征表达不足与设备依赖问题，兼具高精度、易部署优势，并可拓展至其他农产品与食品检测场景。 Abstract: Multimodal methods are widely used in rice deterioration detection, which exhibit limited capability in representing and extracting fine-grained abnormal features. Moreover, these methods rely on devices, such as hyperspectral cameras and mass spectrometers, increasing detection costs and prolonging data acquisition time. To address these issues, we propose a feature recalibration based olfactory-visual multimodal model for fine-grained rice deterioration detection. The fine-grained deterioration embedding constructor (FDEC) is proposed to reconstruct the labeled multimodal embedded-feature dataset, enhancing sample representation. The fine-grained deterioration recalibration attention network (FDRA-Net) is proposed to emphasize signal variations and increase sensitivity to fine-grained deterioration on the rice surface. Experiments show that the proposed method achieves a classification accuracy of 99.89%. Compared with state-of-the-art methods, the detection accuracy is improved and the procedure is simplified. Furthermore, field detection demonstrates the advantages of accuracy and operational simplicity. The proposed method can also be extended to other agrifood in agriculture and food industry.

[204] Learning Proposes, Geometry Disposes: A Modular Framework for Efficient Spatial Reasoning

Haichao Zhu,Zhaorui Yang,Qian Zhang

Main category: cs.CV

TL;DR: 本文探讨了在空间感知任务中，学习模块与几何算法如何协同工作，提出了一种端到端的模块化框架：学习模块生成几何假设（如位姿和深度），几何算法负责最终决策（如ICP优化）；实验表明，仅依赖学习模块不可靠，而将学习输出经几何对齐后交由几何模块处理，可在中等挑战性场景下稳定提升性能。

Details

Motivation: 当前学习方法虽具强表征能力，但是否应直接替代几何估计，还是作为几何系统中的中间模块，尚无定论。本文旨在填补这一设计范式上的空白。 Method: 提出一种学习-几何协同的模块化框架：用VGGT等学习模型生成相对位姿与深度假设，再通过经典点面ICP（RGB-D）进行几何决策；特别强调学习输出需与相机内参几何对齐。 Result: 在TUM RGB-D数据集上发现：(1)纯学习位姿提案不可靠；(2)未对齐内参的学习深度会损害性能；(3)经几何对齐的学习深度+几何后端可稳定提升中等难度刚性场景性能。 Conclusion: 几何模块不应仅作为后处理 refinement，而是关键仲裁者，用于验证和吸收学习产生的几何观测；模块化、几何感知的系统设计对鲁棒空间感知至关重要。 Abstract: Spatial perception aims to estimate camera motion and scene structure from visual observations, a problem traditionally addressed through geometric modeling and physical consistency constraints. Recent learning-based methods have demonstrated strong representational capacity for geometric perception and are increasingly used to augment classical geometry-centric systems in practice. However, whether learning components should directly replace geometric estimation or instead serve as intermediate modules within such pipelines remains an open question. In this work, we address this gap and investigate an end-to-end modular framework for effective spatial reasoning, where learning proposes geometric hypotheses, while geometric algorithms dispose estimation decisions. In particular, we study this principle in the context of relative camera pose estimation on RGB-D sequences. Using VGGT as a representative learning model, we evaluate learning-based pose and depth proposals under varying motion magnitudes and scene dynamics, followed by a classical point-to-plane RGB-D ICP as the geometric backend. Our experiments on the TUM RGB-D benchmark reveal three consistent findings: (1) learning-based pose proposals alone are unreliable; (2) learning-proposed geometry, when improperly aligned with camera intrinsics, can degrade performance; and (3) when learning-proposed depth is geometrically aligned and followed by a geometric disposal stage, consistent improvements emerge in moderately challenging rigid settings. These results demonstrate that geometry is not merely a refinement component, but an essential arbiter that validates and absorbs learning-based geometric observations. Our study highlights the importance of modular, geometry-aware system design for robust spatial perception.

[205] Understanding Sensor Vulnerabilities in Industrial XR Tracking

Sourya Saha,Md. Nurul Absur

Main category: cs.CV

TL;DR: 本文通过受控实证研究，系统分析了工业环境中视觉-惯性里程计（VIO）在传感器退化下的行为，发现惯性传感退化比视觉退化对位姿估计误差影响更严重，常导致数百至数千米的轨迹偏差，因此强调在XR系统设计与评估中应更重视惯性传感器的可靠性。

Details

Motivation: 现有VIO评估多关注理想传感器行为，而工业实际场景中传感器常发生持续退化，其影响尚不明确，亟需系统研究。 Method: 通过系统性故障注入方法，分别模拟视觉和惯性模态的退化，并在多种运行工况下进行定量评估。 Result: 发现视觉退化通常导致厘米级有界位姿误差，而惯性退化则可能引发数百至数千米的大幅轨迹偏差，呈现显著不对称性。 Conclusion: 应更加重视惯性传感器的可靠性，将其作为XR系统在真实工业场景中设计与评估的关键考量因素。 Abstract: Extended Reality (XR) systems deployed in industrial and operational settings rely on Visual--Inertial Odometry (VIO) for continuous six-degree-of-freedom pose tracking, yet these environments often involve sensing conditions that deviate from ideal assumptions. Despite this, most VIO evaluations emphasize nominal sensor behavior, leaving the effects of sustained sensor degradation under operational conditions insufficiently understood. This paper presents a controlled empirical study of VIO behavior under degraded sensing, examining faults affecting visual and inertial modalities across a range of operating regimes. Through systematic fault injection and quantitative evaluation, we observe a pronounced asymmetry in fault impact where degradations affecting visual sensing typically lead to bounded pose errors on the order of centimeters, whereas degradations affecting inertial sensing can induce substantially larger trajectory deviations, in some cases reaching hundreds to thousands of meters. These observations motivate greater emphasis on inertial reliability in the evaluation and design of XR systems for real-life industrial settings.

[206] Hierarchical Vision-Language Interaction for Facial Action Unit Detection

Yong Li,Yi Ren,Yizhe Zhang,Wenhua Zhang,Tianyi Zhang,Muyun Jiang,Guo-Sen Xie,Cuntai Guan

Main category: cs.CV

TL;DR: 本文提出HiVA方法，利用大语言模型生成AU文本描述作为语义先验，结合AU感知动态图和分层跨模态注意力机制（DDCA与CDCA），提升小样本下面部动作单元检测的判别性与泛化性。

Details

Motivation: 面部动作单元（AU）检测面临标注数据稀缺下难以学习判别性强、泛化性好的AU表征的挑战。 Method: 提出Hierarchical Vision-language Interaction for AU Understanding（HiVA）：1）用大语言模型生成多样化、上下文丰富的AU文本描述；2）设计AU感知动态图模块学习AU特异性视觉表征；3）构建分层跨模态注意力架构，包含细粒度的解耦双交叉注意力（DDCA）和建模全局AU依赖的上下文双交叉注意力（CDCA）。 Result: HiVA在多个基准上持续超越现有最优方法；定性分析显示其能生成语义清晰的激活模式，验证了其学习鲁棒、可解释跨模态对应关系的能力。 Conclusion: HiVA通过深度融合多粒度视觉特征与精细化语言语义，显著提升了AU检测性能与可解释性，为小样本面部行为分析提供了新范式。 Abstract: Facial Action Unit (AU) detection seeks to recognize subtle facial muscle activations as defined by the Facial Action Coding System (FACS). A primary challenge w.r.t AU detection is the effective learning of discriminative and generalizable AU representations under conditions of limited annotated data. To address this, we propose a Hierarchical Vision-language Interaction for AU Understanding (HiVA) method, which leverages textual AU descriptions as semantic priors to guide and enhance AU detection. Specifically, HiVA employs a large language model to generate diverse and contextually rich AU descriptions to strengthen language-based representation learning. To capture both fine-grained and holistic vision-language associations, HiVA introduces an AU-aware dynamic graph module that facilitates the learning of AU-specific visual representations. These features are further integrated within a hierarchical cross-modal attention architecture comprising two complementary mechanisms: Disentangled Dual Cross-Attention (DDCA), which establishes fine-grained, AU-specific interactions between visual and textual features, and Contextual Dual Cross-Attention (CDCA), which models global inter-AU dependencies. This collaborative, cross-modal learning paradigm enables HiVA to leverage multi-grained vision-based AU features in conjunction with refined language-based AU details, culminating in robust and semantically enriched AU detection capabilities. Extensive experiments show that HiVA consistently surpasses state-of-the-art approaches. Besides, qualitative analyses reveal that HiVA produces semantically meaningful activation patterns, highlighting its efficacy in learning robust and interpretable cross-modal correspondences for comprehensive facial behavior analysis.

[207] D-SECURE: Dual-Source Evidence Combination for Unified Reasoning in Misinformation Detection

Gagandeep Singh,Samudi Amarasinghe,Priyanka Singh

Main category: cs.CV

TL;DR: 本文提出D-SECURE框架，融合内部篡改检测（HAMMER）与外部证据推理（DEFAME），以提升多模态虚假信息识别能力，尤其针对图文一致但事实错误或存在细微篡改的内容。

Details

Motivation: 现有系统要么仅依赖内容内部分析（易忽略全局事实错误），要么仅依赖外部检索（易忽略像素/文本级细微篡改），二者割裂导致漏检。 Method: 将HAMMER篡改检测器与DEFAME检索验证流水线协同集成：DEFAME进行广义事实核查，HAMMER专攻DEFAME不确定或残差案例中的细粒度编辑分析。 Result: 在DGM4和ClaimReview数据集上的实验表明，两模块能力互补；联合框架能生成融合篡改线索与外部证据的统一可解释报告。 Conclusion: D-SECURE通过内外结合、分工协同的架构，有效弥补了单源检测的缺陷，提升了多模态新闻类虚假信息识别的鲁棒性与可解释性。 Abstract: Multimodal misinformation increasingly mixes realistic im-age edits with fluent but misleading text, producing persuasive posts that are difficult to verify. Existing systems usually rely on a single evidence source. Content-based detectors identify local inconsistencies within an image and its caption but cannot determine global factual truth. Retrieval-based fact-checkers reason over external evidence but treat inputs as coarse claims and often miss subtle visual or textual manipulations. This separation creates failure cases where internally consistent fabrications bypass manipulation detectors and fact-checkers verify claims that contain pixel-level or token-level corruption. We present D-SECURE, a framework that combines internal manipulation detection with external evidence-based reasoning for news-style posts. D-SECURE integrates the HAMMER manipulation detector with the DEFAME retrieval pipeline. DEFAME performs broad verification, and HAMMER analyses residual or uncertain cases that may contain fine-grained edits. Experiments on DGM4 and ClaimReview samples highlight the complementary strengths of both systems and motivate their fusion. We provide a unified, explainable report that incorporates manipulation cues and external evidence.

[208] Controlling Your Image via Simplified Vector Graphics

Lanqing Guo,Xi Liu,Yufei Wang,Zhihao Li,Siyu Huang

Main category: cs.CV

TL;DR: 本文提出了一种基于简化矢量图形（VGs）的分层可控图像生成方法，通过将图像解析为语义对齐、结构连贯的VG表示，并以此指导图像合成，实现对形状、颜色和物体等元素级的精细编辑与高保真输出。

Details

Motivation: 现有图像生成方法虽视觉质量高，但缺乏在元素级别（如形状、颜色、物体增删）进行直观可控编辑的能力。 Method: 首先将图像高效解析为语义对齐且结构连贯的分层矢量图形（VG）表示；然后构建以VG为引导的新型图像合成框架，结合VG的结构/语义特征与噪声预测，支持用户自由修改VG元素并生成逼真图像。 Result: 在图像编辑、对象级操控和细粒度内容生成等任务上展现出优异效果，验证了方法的有效性与通用性。 Conclusion: 该工作建立了基于矢量图形的可控图像生成新范式，显著提升了对几何、颜色及语义层面的精确控制能力。 Abstract: Recent advances in image generation have achieved remarkable visual quality, while a fundamental challenge remains: Can image generation be controlled at the element level, enabling intuitive modifications such as adjusting shapes, altering colors, or adding and removing objects? In this work, we address this challenge by introducing layer-wise controllable generation through simplified vector graphics (VGs). Our approach first efficiently parses images into hierarchical VG representations that are semantic-aligned and structurally coherent. Building on this representation, we design a novel image synthesis framework guided by VGs, allowing users to freely modify elements and seamlessly translate these edits into photorealistic outputs. By leveraging the structural and semantic features of VGs in conjunction with noise prediction, our method provides precise control over geometry, color, and object semantics. Extensive experiments demonstrate the effectiveness of our approach in diverse applications, including image editing, object-level manipulation, and fine-grained content creation, establishing a new paradigm for controllable image generation. Project page: https://guolanqing.github.io/Vec2Pix/

[209] CoCoDiff: Correspondence-Consistent Diffusion Model for Fine-grained Style Transfer

Wenbo Nie,Zixiang Li,Renshuai Tao,Bin Wu,Yunchao Wei,Yao Zhao

Main category: cs.CV

TL;DR: CoCoDiff是一种无需训练、低成本的图像风格迁移框架，利用预训练潜在扩散模型实现细粒度、语义一致的风格化，通过像素级语义对应模块和循环一致性模块提升区域与对象级的几何与细节保持能力。

Details

Motivation: 现有风格迁移方法多在全局层面操作，忽视区域甚至像素级的语义对应关系，导致语义不一致和细节丢失。 Method: 提出CoCoDiff框架：1）利用预训练潜在扩散模型；2）设计像素级语义对应模块，从中间扩散特征中构建内容图与风格图间的稠密对齐映射；3）引入循环一致性模块，在迭代中保证结构与感知对齐。 Result: 在无需额外训练或监督的情况下，CoCoDiff在视觉质量和定量指标上达到SOTA，优于依赖额外训练或标注的方法。 Conclusion: 证明了预训练扩散模型中蕴含的对应线索可被有效挖掘，无需训练即可实现高保真、语义一致的细粒度风格迁移。 Abstract: Transferring visual style between images while preserving semantic correspondence between similar objects remains a central challenge in computer vision. While existing methods have made great strides, most of them operate at global level but overlook region-wise and even pixel-wise semantic correspondence. To address this, we propose CoCoDiff, a novel training-free and low-cost style transfer framework that leverages pretrained latent diffusion models to achieve fine-grained, semantically consistent stylization. We identify that correspondence cues within generative diffusion models are under-explored and that content consistency across semantically matched regions is often neglected. CoCoDiff introduces a pixel-wise semantic correspondence module that mines intermediate diffusion features to construct a dense alignment map between content and style images. Furthermore, a cycle-consistency module then enforces structural and perceptual alignment across iterations, yielding object and region level stylization that preserves geometry and detail. Despite requiring no additional training or supervision, CoCoDiff delivers state-of-the-art visual quality and strong quantitative results, outperforming methods that rely on extra training or annotations.

[210] TikArt: Aperture-Guided Observation for Fine-Grained Visual Reasoning via Reinforcement Learning

Hao Ding,Zhichuan Yang,Weijie Ge,Ziqin Gao,Chaoyi Lu,Lei Zhao

Main category: cs.CV

TL;DR: TikArt是一种 aperture-guided 多模态推理代理，通过 Zoom 和 Segment 两种视觉‘光圈’操作，在细粒度视觉推理中动态聚焦关键区域，并结合语言观察构建持久的视觉-语言记忆，显著提升高分辨率图像理解能力。

Details

Motivation: 现有MLLMs在细粒度视觉推理中受限于单一全局图像编码，难以捕捉微小物体、杂乱区域或细微标记等关键证据。 Method: 提出TikArt（Thinking Aperture）代理，构建Think-Aperture-Observe循环：语言生成→Zoom（矩形裁剪）或Segment（SAM2掩码裁剪）→生成显式观察文本；基于Qwen3-VL-8B，采用两阶段AGRPO强化学习策略优化推理策略，耦合任务成功与有效光圈使用奖励。 Result: 在V*, HR-Bench-4K/8K, MME-RealWorld-Lite, MMStar, RefCOCO, ReasonSeg等多个基准上一致超越基线模型，并生成可解释的高分辨率推理光圈轨迹。 Conclusion: 动态、可控的局部视觉聚焦机制（如Zoom与Segment）配合语言化观察，是提升MLLM细粒度视觉推理能力的关键路径，且具备良好可解释性与泛化性。 Abstract: We address fine-grained visual reasoning in multimodal large language models (MLLMs), where key evidence may reside in tiny objects, cluttered regions, or subtle markings that are lost under a single global image encoding. We introduce TikArt (Thinking Aperture), an aperture-guided agent that casts multi-step vision-language reasoning as a decision process over regions of interest. TikArt follows a Think-Aperture-Observe loop, alternating between language generation and two aperture actions: Zoom extracts rectangular crops, while Segment invokes SAM2 to obtain mask-based crops for irregular targets. After every action, the model must produce an explicit observation, turning local visual cues into persistent linguistic memory. Built on Qwen3-VL-8B, TikArt optimizes its reasoning policy with AGRPO, a GRPO-style reinforcement learning algorithm with a two-stage curriculum: it warms up segmentation actions and then jointly optimizes visual math, fine-grained VQA, and segmentation, using rewards that couple task success with purposeful aperture use. Experiments on V*, HR-Bench-4K/8K, MME-RealWorld-Lite, MMStar, RefCOCO, and ReasonSeg show consistent gains over the backbone and yield interpretable aperture trajectories for high-resolution reasoning.

[211] Gaussian Mesh Renderer for Lightweight Differentiable Rendering

Xinpeng Liu,Fumio Okura

Main category: cs.CV

TL;DR: 本文提出了一种名为高斯网格渲染器（GMR）的轻量级可微分网格渲染器，通过将3D高斯点与三角网格紧密集成，实现了更平滑的梯度和更高效的优化。

Details

Motivation: 传统基于网格的可微分渲染器在表面重建中存在优化慢或计算重的问题，而3D高斯点渲染（3DGS）虽高效但不直接支持网格结构；因此需要一种兼顾两者优势的新方法。 Method: 提出高斯网格渲染器（GMR），将每个高斯原语从对应网格三角形解析推导而来，复用3DGS的高效光栅化流程，同时保持网格结构保真度并支持梯度传播。 Result: 相比传统网格渲染器，GMR实现了更平滑的梯度，在小批量、内存受限条件下显著提升优化效果；代码已开源。 Conclusion: GMR成功桥接了3D高斯点与三角网格表示，在保持结构 fidelity 的前提下，提升了可微分网格渲染的效率与稳定性，为新型视图合成与表面重建提供了新思路。 Abstract: 3D Gaussian Splatting (3DGS) has enabled high-fidelity virtualization with fast rendering and optimization for novel view synthesis. On the other hand, triangle mesh models still remain a popular choice for surface reconstruction but suffer from slow or heavy optimization in traditional mesh-based differentiable renderers. To address this problem, we propose a new lightweight differentiable mesh renderer leveraging the efficient rasterization process of 3DGS, named Gaussian Mesh Renderer (GMR), which tightly integrates the Gaussian and mesh representations. Each Gaussian primitive is analytically derived from the corresponding mesh triangle, preserving structural fidelity and enabling the gradient flow. Compared to the traditional mesh renderers, our method achieves smoother gradients, which especially contributes to better optimization using smaller batch sizes with limited memory. Our implementation is available in the public GitHub repository at https://github.com/huntorochi/Gaussian-Mesh-Renderer.

[212] Uncertainty-Aware Vision-Language Segmentation for Medical Imaging

Aryan Das,Tanishq Rachamalla,Koushik Biswas,Swalpa Kumar Roy,Vinay Kumar Verma

Main category: cs.CV

TL;DR: 本文提出了一种不确定性感知的多模态医学分割框架，结合影像与临床文本，通过MoDAB模块和SSMix实现高效跨模态融合，并引入SEU损失统一建模空间重叠、频谱一致性和预测不确定性，在图像质量差时提升可靠性，实验表明其在多个数据集上性能优越且计算高效。

Details

Motivation: 解决复杂临床场景下图像质量差导致的分割不确定性问题，同时充分利用影像与临床文本的互补信息以提升诊断精度。 Method: 提出Modality Decoding Attention Block（MoDAB）与轻量级State Space Mixer（SSMix）进行跨模态融合；设计Spectral-Entropic Uncertainty（SEU）损失函数，联合建模空间重叠、谱一致性与预测不确定性。 Result: 在QATA-COVID19、MosMed++和Kvasir-SEG等公开数据集上，分割性能优于现有SOTA方法，同时计算效率显著更高。 Conclusion: 不确定性建模与结构化多模态对齐对视觉-语言医学分割至关重要，所提框架兼顾鲁棒性、精度与效率。 Abstract: We introduce a novel uncertainty-aware multimodal segmentation framework that leverages both radiological images and associated clinical text for precise medical diagnosis. We propose a Modality Decoding Attention Block (MoDAB) with a lightweight State Space Mixer (SSMix) to enable efficient cross-modal fusion and long-range dependency modelling. To guide learning under ambiguity, we propose the Spectral-Entropic Uncertainty (SEU) Loss, which jointly captures spatial overlap, spectral consistency, and predictive uncertainty in a unified objective. In complex clinical circumstances with poor image quality, this formulation improves model reliability. Extensive experiments on various publicly available medical datasets, QATA-COVID19, MosMed++, and Kvasir-SEG, demonstrate that our method achieves superior segmentation performance while being significantly more computationally efficient than existing State-of-the-Art (SoTA) approaches. Our results highlight the importance of incorporating uncertainty modelling and structured modality alignment in vision-language medical segmentation tasks. Code: https://github.com/arya-domain/UA-VLS

[213] Prototype Instance-semantic Disentanglement with Low-rank Regularized Subspace Clustering for WSIs Explainable Recognition

Chentao Li,Pan Huang

Main category: cs.CV

TL;DR: 本文提出了一种端到端的原型实例语义解耦框架PID-LRSC，用于解决全切片图像中肿瘤区域识别所面临的实例-语义纠缠问题，通过低秩正则化子空间聚类和增强对比学习分别缓解实例与语义层面的混淆，显著提升了模型表征能力、可解释性与辅助诊断可靠性。

Details

Motivation: 肿瘤区域在病理诊断中至关重要，但其与癌前病变高度相似，且非肿瘤实例数量远超肿瘤实例，导致多实例学习中出现实例-语义纠缠，削弱模型表征能力和可解释性。 Method: 提出PID-LRSC框架：1）采用二次实例子空间学习构建低秩正则化子空间聚类（LRSC），缓解非肿瘤实例过多引发的实例纠缠；2）设计基于增强对比学习的原型实例语义解耦（PID），解决肿瘤与癌前组织高相似性导致的 semantic 纠缠。 Result: 在多中心病理数据集上实验表明，PID-LRSC优于其他SOTA方法，决策过程中实例语义更清晰，辅助诊断结果更可靠。 Conclusion: PID-LRSC有效解耦实例与语义信息，提升了WSI分析中肿瘤识别的准确性与可解释性，为病理辅助诊断提供了更可靠的深度学习解决方案。 Abstract: The tumor region plays a key role in pathological diagnosis. Tumor tissues are highly similar to precancerous lesions and non tumor instances often greatly exceed tumor instances in whole slide images (WSIs). These issues cause instance-semantic entanglement in multi-instance learning frameworks, degrading both model representation capability and interpretability. To address this, we propose an end-to-end prototype instance semantic disentanglement framework with low-rank regularized subspace clustering, PID-LRSC, in two aspects. First, we use secondary instance subspace learning to construct low-rank regularized subspace clustering (LRSC), addressing instance entanglement caused by an excessive proportion of non tumor instances. Second, we employ enhanced contrastive learning to design prototype instance semantic disentanglement (PID), resolving semantic entanglement caused by the high similarity between tumor and precancerous tissues. We conduct extensive experiments on multicentre pathology datasets, implying that PID-LRSC outperforms other SOTA methods. Overall, PID-LRSC provides clearer instance semantics during decision-making and significantly enhances the reliability of auxiliary diagnostic outcomes.

[214] MacNet: An End-to-End Manifold-Constrained Adaptive Clustering Network for Interpretable Whole Slide Image Classification

Mingrui Ma,Chentao Li,Pan Huang,Jing Qin

Main category: cs.CV

TL;DR: 本文提出了一种端到端的多实例学习（MIL）框架，结合Grassmann重嵌入与流形自适应聚类，提升全切片图像（WSI）病理诊断的准确性与可解释性。

Details

Motivation: 现有两步式框架依赖无领域知识的离线特征编码器；注意力机制MIL方法结果导向但可解释性差；聚类方法虽可解释，但受高维特征和语义模糊聚类中心困扰。 Method: 提出端到端MIL框架，融合Grassmann空间重嵌入与流形自适应聚类，并设计先验知识引导的代理实例标注与聚合策略，以逼近补丁级标签并聚焦病理性相关肿瘤区域。 Result: 在多中心WSI数据集上验证：1）所提聚类融合模型在分级准确率与可解释性上均优于基线；2）端到端学习能优化特征表示，且计算开销可控。 Conclusion: 该方法通过几何流形建模与先验引导的弱监督策略，有效平衡了WSI分析中的性能、可解释性与实用性。 Abstract: Whole slide images (WSIs) are the gold standard for pathological diagnosis and sub-typing. Current main-stream two-step frameworks employ offline feature encoders trained without domain-specific knowledge. Among them, attention-based multiple instance learning (MIL) methods are outcome-oriented and offer limited interpretability. Clustering-based approaches can provide explainable decision-making process but suffer from high dimension features and semantically ambiguous centroids. To this end, we propose an end-to-end MIL framework that integrates Grassmann re-embedding and manifold adaptive clustering, where the manifold geometric structure facilitates robust clustering results. Furthermore, we design a prior knowledge guiding proxy instance labeling and aggregation strategy to approximate patch labels and focus on pathologically relevant tumor regions. Experiments on multicentre WSI datasets demonstrate that: 1) our cluster-incorporated model achieves superior performance in both grading accuracy and interpretability; 2) end-to-end learning refines better feature representations and it requires acceptable computation resources.

[215] MedVAR: Towards Scalable and Efficient Medical Image Generation via Next-scale Autoregressive Prediction

Zhicheng He,Yunpeng Zhao,Junde Wu,Ziwei Niu,Zijun Li,Lanfen Lin,Yueming Jin

Main category: cs.CV

TL;DR: 本文提出了MedVAR，首个基于自回归的医学影像生成基础模型，采用下一尺度预测范式，实现快速且可扩展的医学图像合成，并在多尺度表示和生成性能上达到SOTA。

Details

Motivation: 现有医学影像生成方法在架构效率、多器官数据规模和评估方法上存在不足，难以构建可扩展的生成骨干网络。 Method: 提出MedVAR模型，采用自回归和下一尺度预测范式进行粗到细的图像生成；构建包含约44万CT/MRI图像、覆盖六个解剖区域的标准化多模态医学影像数据集以支持分层生成。 Result: 在保真度、多样性与可扩展性方面全面超越现有方法，达到医学图像生成的SOTA性能。 Conclusion: MedVAR为医学生成式基础模型提供了高效、可扩展的新架构方向，并推动了多尺度表征与下游任务适配的协同发展。 Abstract: Medical image generation is pivotal in applications like data augmentation for low-resource clinical tasks and privacy-preserving data sharing. However, developing a scalable generative backbone for medical imaging requires architectural efficiency, sufficient multi-organ data, and principled evaluation, yet current approaches leave these aspects unresolved. Therefore, we introduce MedVAR, the first autoregressive-based foundation model that adopts the next-scale prediction paradigm to enable fast and scale-up-friendly medical image synthesis. MedVAR generates images in a coarse-to-fine manner and produces structured multi-scale representations suitable for downstream use. To support hierarchical generation, we curate a harmonized dataset of around 440,000 CT and MRI images spanning six anatomical regions. Comprehensive experiments across fidelity, diversity, and scalability show that MedVAR achieves state-of-the-art generative performance and offers a promising architectural direction for future medical generative foundation models.

[216] Efficient Text-Guided Convolutional Adapter for the Diffusion Model

Aryan Das,Koushik Biswas,Swalpa Kumar Roy,Badri Narayana Patro,Vinay Kumar Verma

Main category: cs.CV

TL;DR: 本文提出了Nexus Adapters（Nexus Prime和Slim），一种文本引导的高效适配器，用于结构保持的条件图像生成（SPCG），在保持结构信息的同时提升对文本提示的理解，显著减少参数量并提高性能。

Details

Motivation: 现有结构保持方法效率低、参数量大，且适配器无法感知输入文本提示，导致仅对结构输入优化而忽略提示条件。 Method: 提出两种新型适配器Nexus Prime和Nexus Slim，每个Nexus Block引入跨注意力机制，实现文本提示与结构输入（如草图、深度图）的多模态联合引导。 Result: Nexus Prime仅增加8M参数即显著超越T2I-Adapter；Nexus Slim比T2I-Adapter少18M参数，仍达SOTA性能。 Conclusion: Nexus Adapters通过prompt-aware设计和轻量化架构，在结构保持生成任务中实现了高效性与性能的统一。 Abstract: We introduce the Nexus Adapters, novel text-guided efficient adapters to the diffusion-based framework for the Structure Preserving Conditional Generation (SPCG). Recently, structure-preserving methods have achieved promising results in conditional image generation by using a base model for prompt conditioning and an adapter for structure input, such as sketches or depth maps. These approaches are highly inefficient and sometimes require equal parameters in the adapter compared to the base architecture. It is not always possible to train the model since the diffusion model is itself costly, and doubling the parameter is highly inefficient. In these approaches, the adapter is not aware of the input prompt; therefore, it is optimal only for the structural input but not for the input prompt. To overcome the above challenges, we proposed two efficient adapters, Nexus Prime and Slim, which are guided by prompts and structural inputs. Each Nexus Block incorporates cross-attention mechanisms to enable rich multimodal conditioning. Therefore, the proposed adapter has a better understanding of the input prompt while preserving the structure. We conducted extensive experiments on the proposed models and demonstrated that the Nexus Prime adapter significantly enhances performance, requiring only 8M additional parameters compared to the baseline, T2I-Adapter. Furthermore, we also introduced a lightweight Nexus Slim adapter with 18M fewer parameters than the T2I-Adapter, which still achieved state-of-the-art results. Code: https://github.com/arya-domain/Nexus-Adapters

[217] Architectural Insights for Post-Tornado Damage Recognition

Robinson Umeike,Thang Dao,Shane Crawford,John van de Lindt,Blythe Johnston,Wanting,Wang,Trung Do,Ajibola Mofikoya,Sarbesh Banjara,Cuong Pham

Main category: cs.CV

TL;DR: 本文提出了一种系统性实验框架，评估79种开源深度学习模型在龙卷风建筑损毁评估任务上的性能，发现优化器选择（如SGD优于Adam）和低学习率（1e-4）比网络架构本身更重要，并基于ConvNeXt-Base构建出具备强跨事件泛化能力的冠军模型。

Details

Motivation: 龙卷风灾后建筑损毁评估亟需快速准确的方法，但现有自动方法受限于预训练数据与真实灾害场景间的严重域偏移及类别极度不平衡问题。 Method: 构建了新的四州龙卷风损毁数据集（QSTD），系统评估79种CNN与Vision Transformer模型，在2300+次受控实验中分析架构、优化器、学习率等关键因素的影响。 Result: 发现优化器选择（SGD vs Adam）可带来25–38点F1提升，低学习率（1e-4）平均提升F1达10.2点；ConvNeXt-Base在TMTD测试集上达46.4% Macro F1（+34.6）和85.5% Ordinal Top-1 Accuracy。 Conclusion: 龙卷风损毁评估的性能瓶颈不在模型架构本身，而在于训练策略（尤其是优化器与学习率）的精细设计；该发现对灾害AI模型的实用化部署具有重要指导意义。 Abstract: Rapid and accurate building damage assessment in the immediate aftermath of tornadoes is critical for coordinating life-saving search and rescue operations, optimizing emergency resource allocation, and accelerating community recovery. However, current automated methods struggle with the unique visual complexity of tornado-induced wreckage, primarily due to severe domain shift from standard pre-training datasets and extreme class imbalance in real-world disaster data. To address these challenges, we introduce a systematic experimental framework evaluating 79 open-source deep learning models, encompassing both Convolutional Neural Networks (CNNs) and Vision Transformers, across over 2,300 controlled experiments on our newly curated Quad-State Tornado Damage (QSTD) benchmark dataset. Our findings reveal that achieving operational-grade performance hinges on a complex interaction between architecture and optimization, rather than architectural selection alone. Most strikingly, we demonstrate that optimizer choice can be more consequential than architecture: switching from Adam to SGD provided dramatic F1 gains of +25 to +38 points for Vision Transformer and Swin Transformer families, fundamentally reversing their ranking from bottom-tier to competitive with top-performing CNNs. Furthermore, a low learning rate of 1x10^(-4) proved universally critical, boosting average F1 performance by +10.2 points across all architectures. Our champion model, ConvNeXt-Base trained with these optimized settings, demonstrated strong cross-event generalization on the held-out Tuscaloosa-Moore Tornado Damage (TMTD) dataset, achieving 46.4% Macro F1 (+34.6 points over baseline) and retaining 85.5% Ordinal Top-1 Accuracy despite temporal and sensor domain shifts.

[218] Error Patterns in Historical OCR: A Comparative Analysis of TrOCR and a Vision-Language Model

Ari Vesalainen,Eetu Mäkelä,Laura Ruotsalainen,Mikko Tolonen

Main category: cs.CV

TL;DR: 本文比较了TrOCR和Qwen两种模型在18世纪英文文本OCR任务中的表现，发现尽管Qwen在CER/WER上更优且对退化图像更鲁棒，但存在隐式正字法规范化风险；TrOCR则保持更高正字法保真度但易发生级联错误；研究强调需依据模型架构特性进行针对性评估。

Details

Motivation: 现有OCR指标（如CER、WER）难以反映模型在人文学术应用中的实际可靠性，尤其面对18世纪印刷文本的 degraded quality、archaic glyphs 和 non-standardized orthography 等挑战。 Method: 采用长度加权准确率与假设驱动的错误分析方法，在行级历史英文文本上对比专用OCR Transformer（TrOCR）与通用视觉语言模型（Qwen）。 Result: Qwen在CER/WER更低、对退化输入更鲁棒，但存在选择性语言规整化和正字法归一化；TrOCR正字法保真度更高，但易发生级联错误；两类模型虽聚合准确率相近，但错误局部性、可检测性及学术风险差异显著。 Conclusion: OCR模型的架构先验会系统性影响错误模式，因此历史文献数字化流程中亟需架构感知（architecture-aware）的评估策略。 Abstract: Optical Character Recognition (OCR) of eighteenth-century printed texts remains challenging due to degraded print quality, archaic glyphs, and non-standardized orthography. Although transformer-based OCR systems and Vision-Language Models (VLMs) achieve strong aggregate accuracy, metrics such as Character Error Rate (CER) and Word Error Rate (WER) provide limited insight into their reliability for scholarly use. We compare a dedicated OCR transformer (TrOCR) and a general-purpose Vision-Language Model (Qwen) on line-level historical English texts using length-weighted accuracy metrics and hypothesis driven error analysis. While Qwen achieves lower CER/WER and greater robustness to degraded input, it exhibits selective linguistic regularization and orthographic normalization that may silently alter historically meaningful forms. TrOCR preserves orthographic fidelity more consistently but is more prone to cascading error propagation. Our findings show that architectural inductive biases shape OCR error structure in systematic ways. Models with similar aggregate accuracy can differ substantially in error locality, detectability, and downstream scholarly risk, underscoring the need for architecture-aware evaluation in historical digitization workflows.

[219] Cross-view Domain Generalization via Geometric Consistency for LiDAR Semantic Segmentation

Jindong Zhao,Yuan Gao,Yang Xia,Sheng Nie,Jun Yue,Weiwei Sun,Shaobo Xia

Main category: cs.CV

TL;DR: 本文提出CVGC框架，解决跨视角域泛化下的LiDAR语义分割问题，通过跨视角几何增强与几何一致性约束提升模型在不同采集视角目标域上的泛化能力。

Details

Motivation: 现有LiDAR语义分割域泛化方法假设源域与目标域采集视角相似（如均车载），难以应对因视角差异导致的结构不完整性与点云密度不均等跨视角场景。 Method: 提出CVGC框架：1）跨视角几何增强模块，模拟视角变化引起的可见性与采样密度变化，生成同一场景的多视角点云；2）几何一致性模块，强制同一场景下不同几何增强点云的语义与占据预测一致。 Result: 在六个公开LiDAR数据集上首次系统评估了跨视角域泛化性能，CVGC在单源域到多异构视角目标域的泛化任务中持续优于SOTA方法。 Conclusion: CVGC有效缓解了跨视角域偏移问题，为真实世界中多样化部署条件下的LiDAR语义分割提供了更鲁棒的域泛化解决方案。 Abstract: Domain-generalized LiDAR semantic segmentation (LSS) seeks to train models on source-domain point clouds that generalize reliably to multiple unseen target domains, which is essential for real-world LiDAR applications. However, existing approaches assume similar acquisition views (e.g., vehicle-mounted) and struggle in cross-view scenarios, where observations differ substantially due to viewpoint-dependent structural incompleteness and non-uniform point density. Accordingly, we formulate cross-view domain generalization for LiDAR semantic segmentation and propose a novel framework, termed CVGC (Cross-View Geometric Consistency). Specifically, we introduce a cross-view geometric augmentation module that models viewpoint-induced variations in visibility and sampling density, generating multiple cross-view observations of the same scene. Subsequently, a geometric consistency module enforces consistent semantic and occupancy predictions across geometrically augmented point clouds of the same scene. Extensive experiments on six public LiDAR datasets establish the first systematic evaluation of cross-view domain generalization for LiDAR semantic segmentation, demonstrating that CVGC consistently outperforms state-of-the-art methods when generalizing from a single source domain to multiple target domains with heterogeneous acquisition viewpoints. The source code will be publicly available at https://github.com/KintomZi/CVGC-DG

[220] MoRL: Reinforced Reasoning for Unified Motion Understanding and Generation

Hongpeng Wang,Zeyu Zhang,Wenhao Li,Hao Tang

Main category: cs.CV

TL;DR: 本文提出MoRL，一种结合监督微调与强化学习的多模态运动模型，通过可验证奖励提升运动理解与生成能力，并引入Chain-of-Motion（CoM）实现测试时逐步规划与反思，显著优于现有方法。

Details

Motivation: 现有方法在人类运动理解与生成中推理能力不足、缺乏测试时规划能力。 Method: 提出MoRL模型，融合监督微调与基于语义对齐、推理连贯性、物理合理性和文本-运动一致性的可验证奖励的强化学习；引入Chain-of-Motion（CoM）作为测试时推理方法；构建两个大规模CoT数据集MoUnd-CoT-140K和MoGen-CoT-140K。 Result: 在HumanML3D和KIT-ML数据集上显著超越当前最优基线。 Conclusion: MoRL通过统一建模与可验证奖励机制，有效提升了运动理解与生成的逻辑推理能力和感知真实性，CoM进一步增强了测试时规划能力。 Abstract: Human motion understanding and generation are crucial for vision and robotics but remain limited in reasoning capability and test-time planning. We propose MoRL, a unified multimodal motion model trained with supervised fine-tuning and reinforcement learning with verifiable rewards. Our task-specific reward design combines semantic alignment and reasoning coherence for understanding with physical plausibility and text-motion consistency for generation, improving both logical reasoning and perceptual realism. To further enhance inference, we introduce Chain-of-Motion (CoM), a test-time reasoning method that enables step-by-step planning and reflection. We also construct two large-scale CoT datasets, MoUnd-CoT-140K and MoGen-CoT-140K, to align motion sequences with reasoning traces and action descriptions. Experiments on HumanML3D and KIT-ML show that MoRL achieves significant gains over state-of-the-art baselines. Code: https://github.com/AIGeeksGroup/MoRL. Website: https://aigeeksgroup.github.io/MoRL.

[221] OmniVTON++: Training-Free Universal Virtual Try-On with Principal Pose Guidance

Zhaotong Yang,Yong Du,Shengfeng He,Yuhui Li,Xinzhe Li,Yangyang Xu,Junyu Dong,Jian Yang

Main category: cs.CV

TL;DR: OmniVTON++ 是一种无需训练的通用图像虚拟试衣框架，通过结构化服装形变、主姿态引导和连续边界缝合三大模块，实现跨数据集、跨服装类型及多场景（如多衣、多人、动漫角色）的高质量试衣效果。

Details

Motivation: 现有基于图像的虚拟试衣方法通常需针对特定数据条件重新训练，泛化能力差，缺乏统一解决方案。 Method: 提出 OmniVTON++ 框架，包含三个核心组件：Structured Garment Morphing（对应驱动的服装形变）、Principal Pose Guidance（扩散采样中分步结构调控）、Continuous Boundary Stitching（边界感知精细化），全程无需任务特定重训练。 Result: 在跨数据集、跨服装类型等泛化设置下达到 SOTA 性能；支持单/多服装、单/多人及动漫角色试衣；兼容多种扩散骨干网络。 Conclusion: OmniVTON++ 实现了训练无关、高度泛化且应用广泛的虚拟试衣统一框架，显著拓展了 VTON 的实际适用边界。 Abstract: Image-based Virtual Try-On (VTON) concerns the synthesis of realistic person imagery through garment re-rendering under human pose and body constraints. In practice, however, existing approaches are typically optimized for specific data conditions, making their deployment reliant on retraining and limiting their generalization as a unified solution. We present OmniVTON++, a training-free VTON framework designed for universal applicability. It addresses the intertwined challenges of garment alignment, human structural coherence, and boundary continuity by coordinating Structured Garment Morphing for correspondence-driven garment adaptation, Principal Pose Guidance for step-wise structural regulation during diffusion sampling, and Continuous Boundary Stitching for boundary-aware refinement, forming a cohesive pipeline without task-specific retraining. Experimental results demonstrate that OmniVTON++ achieves state-of-the-art performance across diverse generalization settings, including cross-dataset and cross-garment-type evaluations, while reliably operating across scenarios and diffusion backbones within a single formulation. In addition to single-garment, single-human cases, the framework supports multi-garment, multi-human, and anime character virtual try-on, expanding the scope of virtual try-on applications. The source code will be released to the public.

[222] DriveFine: Refining-Augmented Masked Diffusion VLA for Precise and Robust Driving

Chenxu Dang,Sining Ang,Yongkang Li,Haochen Tian,Jie Wang,Guang Li,Hangjun Ye,Jie Ma,Long Chen,Yan Wang

Main category: cs.CV

TL;DR: 本文提出DriveFine，一种结合掩码扩散与自校正能力的视觉-语言-动作（VLA）模型，用于自动驾驶规划；通过创新的即插即用块式MoE结构解耦生成与精炼专家，并设计混合强化学习策略，在多个基准上展现出强效性与鲁棒性。

Details

Motivation: 现有扩散型与token型VLA规划器分别存在模态对齐难、训练效率低、泛化弱，以及累积因果错误、不可逆解码等问题，二者优劣互补，亟需融合方案。 Method: 提出DriveFine模型：1）基于掩码扩散框架；2）设计即插即用的块式MoE结构，显式分离生成专家与精炼专家（含推理时专家选择与训练时梯度阻断）；3）引入混合强化学习策略以促进精炼专家有效探索并保障训练稳定。 Result: 在NAVSIM v1、v2及Navhard基准上实验表明，DriveFine在规划性能与鲁棒性方面显著优于现有方法。 Conclusion: DriveFine成功融合扩散建模的灵活性与token解码的可控性，其块式MoE设计兼顾预训练权重复用与功能可扩展性，为VLA模型提供了新范式。 Abstract: Vision-Language-Action (VLA) models for autonomous driving increasingly adopt generative planners trained with imitation learning followed by reinforcement learning. Diffusion-based planners suffer from modality alignment difficulties, low training efficiency, and limited generalization. Token-based planners are plagued by cumulative causal errors and irreversible decoding. In summary, the two dominant paradigms exhibit complementary strengths and weaknesses. In this paper, we propose DriveFine, a masked diffusion VLA model that combines flexible decoding with self-correction capabilities. In particular, we design a novel plug-and-play block-MoE, which seamlessly injects a refinement expert on top of the generation expert. By enabling explicit expert selection during inference and gradient blocking during training, the two experts are fully decoupled, preserving the foundational capabilities and generic patterns of the pretrained weights, which highlights the flexibility and extensibility of the block-MoE design. Furthermore, we design a hybrid reinforcement learning strategy that encourages effective exploration of refinement expert while maintaining training stability. Extensive experiments on NAVSIM v1, v2, and Navhard benchmarks demonstrate that DriveFine exhibits strong efficacy and robustness. The code will be released at https://github.com/MSunDYY/DriveFine.

[223] YOLO26: A Comprehensive Architecture Overview and Key Improvements

Priyanto Hidayatullah,Refdinal Tubagus

Main category: cs.CV

TL;DR: 本研究深入分析了最新YOLO26模型的架构，重点探讨其在推理速度、小目标检测及多任务扩展（如实例分割、姿态估计）等方面的创新改进。

Details

Motivation: 为弥补现有技术文档对YOLO26真实运行机制描述不足的问题，尤其是源码中未被充分挖掘的细节，本文开展严谨的架构级分析，旨在为研究人员和开发者提供精准、可复用的模型理解。 Method: 基于YOLO26官方GitHub源码与文档，进行系统性架构逆向分析，提取并整合关键组件（如去除DFL、端到端无NMS推理、ProgLoss+STAL标签分配、MuSGD优化器），最终构建出首张CNN-based YOLO26架构图。 Result: 首次公开了YOLO26的完整CNN架构图；验证了其CPU模式下推理速度提升43%；确认其支持实例分割、姿态估计与定向边界框解码等多任务能力。 Conclusion: YOLO26不仅显著提升边缘设备实时性能，还通过多项原创设计拓展了YOLO系列的能力边界；本研究为其后续改进与应用提供了坚实架构基础。 Abstract: You Only Look Once (YOLO) has been the prominent model for computer vision in deep learning for a decade. This study explores the novel aspects of YOLO26, the most recent version in the YOLO series. The elimination of Distribution Focal Loss (DFL), implementation of End-to-End NMS-Free Inference, introduction of ProgLoss + Small-Target-Aware Label Assignment (STAL), and use of the MuSGD optimizer are the primary enhancements designed to improve inference speed, which is claimed to achieve a 43% boost in CPU mode. This is designed to allow YOLO26 to attain real-time performance on edge devices or those without GPUs. Additionally, YOLO26 offers improvements in many computer vision tasks, including instance segmentation, pose estimation, and oriented bounding box (OBB) decoding. We aim for this effort to provide more value than just consolidating information already included in the existing technical documentation. Therefore, we performed a rigorous architectural investigation into YOLO26, mostly using the source code available in its GitHub repository and its official documentation. The authentic and detailed operational mechanisms of YOLO26 are inside the source code, which is seldom extracted by others. The YOLO26 architectural diagram is shown as the outcome of the investigation. This study is, to our knowledge, the first one presenting the CNN-based YOLO26 architecture, which is the core of YOLO26. Our objective is to provide a precise architectural comprehension of YOLO26 for researchers and developers aspiring to enhance the YOLO model, ensuring it remains the leading deep learning model in computer vision.

[224] VariViT: A Vision Transformer for Variable Image Sizes

Aswathi Varma,Suprosanna Shit,Chinmay Prabhakar,Daniel Scholz,Hongwei Bran Li,Bjoern Menze,Daniel Rueckert,Benedikt Wiestler

Main category: cs.CV

TL;DR: 本文提出VariViT，一种能处理可变尺寸图像的改进型Vision Transformer模型，通过新型位置编码缩放方案和批处理策略，在保持固定patch大小的同时提升医学影像（如脑MRI）特征表示能力，并在胶质瘤基因型预测和脑肿瘤分类任务中超越标准ViT与ResNet。

Details

Motivation: 传统ViT对图像进行固定尺寸分块，需预处理（如裁剪、缩放），在医学影像（如不规则肿瘤）中易导致前景-背景比例失衡、信息丢失或伪影，影响诊断；同时存在计算开销与精度的权衡问题。 Method: 提出VariViT：1）支持可变图像尺寸输入并保持固定patch大小；2）设计新型可缩放位置编码以适配不同数量的patch；3）引入新批处理策略以降低计算复杂度。 Result: 在两个3D脑MRI数据集上，VariViT在胶质瘤基因型预测和脑肿瘤分类任务中F1-score分别达75.5%和76.3%，优于标准ViT和ResNet；新批处理策略使计算时间减少最多30%。 Conclusion: VariViT有效提升了ViT在可变尺寸医学图像上的表征能力与计算效率，验证了其在精准医疗影像分析中的实用价值。 Abstract: Vision Transformers (ViTs) have emerged as the state-of-the-art architecture in representation learning, leveraging self-attention mechanisms to excel in various tasks. ViTs split images into fixed-size patches, constraining them to a predefined size and necessitating pre-processing steps like resizing, padding, or cropping. This poses challenges in medical imaging, particularly with irregularly shaped structures like tumors. A fixed bounding box crop size produces input images with highly variable foreground-to-background ratios. Resizing medical images can degrade information and introduce artefacts, impacting diagnosis. Hence, tailoring variable-sized crops to regions of interest can enhance feature representation capabilities. Moreover, large images are computationally expensive, and smaller sizes risk information loss, presenting a computation-accuracy tradeoff. We propose VariViT, an improved ViT model crafted to handle variable image sizes while maintaining a consistent patch size. VariViT employs a novel positional embedding resizing scheme for a variable number of patches. We also implement a new batching strategy within VariViT to reduce computational complexity, resulting in faster training and inference times. In our evaluations on two 3D brain MRI datasets, VariViT surpasses vanilla ViTs and ResNet in glioma genotype prediction and brain tumor classification. It achieves F1-scores of 75.5% and 76.3%, respectively, learning more discriminative features. Our proposed batching strategy reduces computation time by up to 30% compared to conventional architectures. These findings underscore the efficacy of VariViT in image representation learning. Our code can be found here: https://github.com/Aswathi-Varma/varivit

[225] VIGIL: Tackling Hallucination Detection in Image Recontextualization

Joanna Wojciechowicz,Maria Łubniewska,Jakub Antczak,Justyna Baczyńska,Wojciech Gromski,Wojciech Kozłowski,Maciej Zięba

Main category: cs.CV

TL;DR: 本文提出了VIGIL，首个针对大语言多模态模型（LMMs）图像重语境化任务中幻觉现象的细粒度分类基准与检测框架，将幻觉分为五类，并设计多阶段检测流程，开源数据集与代码。

Details

Motivation: 现有研究将多模态幻觉视为统一问题，缺乏细粒度分类与评估；本文旨在填补图像重语境化任务中幻觉类型分解与可解释性检测的空白。 Method: 构建VIGIL基准数据集，提出多阶段检测流水线，结合多个开源模型分别处理对象粘贴幻觉、背景幻觉、对象遗漏、位置与逻辑不一致、物理规律违反等五类问题。 Result: 实现了对LMMs在图像重语境化中各类幻觉的可解释、细粒度识别与定位，实验验证了检测流程的有效性，并开源全部资源。 Conclusion: VIGIL首次系统性地对图像重语境化幻觉进行分类建模与检测，提升了多模态模型评估的透明性与可调试性，为后续研究提供了新范式与开放基础设施。 Abstract: We introduce VIGIL (Visual Inconsistency & Generative In-context Lucidity), the first benchmark dataset and framework providing a fine-grained categorization of hallucinations in the multimodal image recontextualization task for large multimodal models (LMMs). While existing research often treats hallucinations as a uniform issue, our work addresses a significant gap in multimodal evaluation by decomposing these errors into five categories: pasted object hallucinations, background hallucinations, object omission, positional & logical inconsistencies, and physical law violations. To address these complexities, we propose a multi-stage detection pipeline. Our architecture processes recontextualized images through a series of specialized steps targeting object-level fidelity, background consistency, and omission detection, leveraging a coordinated ensemble of open-source models, whose effectiveness is demonstrated through extensive experimental evaluations. Our approach enables a deeper understanding of where the models fail with an explanation; thus, we fill a gap in the field, as no prior methods offer such categorization and decomposition for this task. To promote transparency and further exploration, we openly release VIGIL, along with the detection pipeline and benchmark code, through our GitHub repository: https://github.com/mlubneuskaya/vigil and Data repository: https://huggingface.co/datasets/joannaww/VIGIL.

[226] SketchingReality: From Freehand Scene Sketches To Photorealistic Images

Ahmed Bourouis,Mikhail Bessmeltsev,Yulia Gryaditskaya

Main category: cs.CV

TL;DR: 本文提出了一种基于调制的方法，用于从自由手绘草图生成高质量、语义对齐的图像，无需像素级对齐的真实图像监督，并引入新损失函数提升草图语义一致性与生成真实性。

Details

Motivation: 现有方法多使用边缘图（常被误称为草图），而真正处理具有抽象性和变形的自由手绘草图的方法仍不足；且自由手绘草图天然缺乏唯一正确的像素对齐真值图像，制约了训练与评估。 Method: 提出一种基于调制的方法，强调对草图的语义理解而非边缘位置的严格对齐，并设计一种新型损失函数，支持在无像素对齐真值图像的情况下直接用自由手绘草图进行训练。 Result: 该方法在自由手绘草图的语义对齐能力、生成图像的真实性及整体质量方面均优于现有方法。 Conclusion: 语义驱动的调制策略与新型无监督损失可有效弥合自由手绘草图与高质量图像生成之间的鸿沟，为草图引导生成提供了更鲁棒、实用的新范式。 Abstract: Recent years have witnessed remarkable progress in generative AI, with natural language emerging as the most common conditioning input. As underlying models grow more powerful, researchers are exploring increasingly diverse conditioning signals, such as depth maps, edge maps, camera parameters, and reference images, to give users finer control over generation. Among different modalities, sketches are a natural and long-standing form of human communication, enabling rapid expression of visual concepts. Previous literature has largely focused on edge maps, often misnamed 'sketches', yet algorithms that effectively handle true freehand sketches, with their inherent abstraction and distortions, remain underexplored. We pursue the challenging goal of balancing photorealism with sketch adherence when generating images from freehand input. A key obstacle is the absence of ground-truth, pixel-aligned images: by their nature, freehand sketches do not have a single correct alignment. To address this, we propose a modulation-based approach that prioritizes semantic interpretation of the sketch over strict adherence to individual edge positions. We further introduce a novel loss that enables training on freehand sketches without requiring ground-truth pixel-aligned images. We show that our method outperforms existing approaches in both semantic alignment with freehand sketch inputs and in the realism and overall quality of the generated images.

[227] Advances in Global Solvers for 3D Vision

Zhenjun Zhao,Heng Yang,Bangyan Liao,Yingping Zeng,Shaocheng Yan,Yingdong Gu,Peidong Liu,Yi Zhou,Haoang Li,Javier Civera

Main category: cs.CV

TL;DR: This survey provides the first systematic review of global solvers in geometric vision, categorizing them into Branch-and-Bound, Convex Relaxation, and Graduated Non-Convexity paradigms, and analyzes their theoretical foundations, algorithmic designs, and trade-offs across ten core vision tasks.

Details

Motivation: To unify and systematically review global solvers in geometric vision, addressing the need for certifiable solutions to nonconvex geometric optimization problems traditionally tackled by local or heuristic methods. Method: A comprehensive taxonomy-based analysis of three core global solver paradigms—Branch-and-Bound (BnB), Convex Relaxation (CR), and Graduated Non-Convexity (GNC)—covering their theoretical foundations, algorithmic designs, robustness and scalability enhancements, and evaluation across ten vision tasks. Result: Identifies optimality-robustness-scalability trade-offs governing solver selection; highlights future directions including scalable guaranteed algorithms, integration of data-driven priors, standardized benchmarks, and societal implications for safety-critical applications. Conclusion: This survey consolidates theoretical foundations, practical advances, and broader impacts to provide a unified perspective and roadmap toward certifiable, trustworthy perception for real-world 3D vision applications. Abstract: Global solvers have emerged as a powerful paradigm for 3D vision, offering certifiable solutions to nonconvex geometric optimization problems traditionally addressed by local or heuristic methods. This survey presents the first systematic review of global solvers in geometric vision, unifying the field through a comprehensive taxonomy of three core paradigms: Branch-and-Bound (BnB), Convex Relaxation (CR), and Graduated Non-Convexity (GNC). We present their theoretical foundations, algorithmic designs, and practical enhancements for robustness and scalability, examining how each addresses the fundamental nonconvexity of geometric estimation problems. Our analysis spans ten core vision tasks, from Wahba problem to bundle adjustment, revealing the optimality-robustness-scalability trade-offs that govern solver selection. We identify critical future directions: scaling algorithms while maintaining guarantees, integrating data-driven priors with certifiable optimization, establishing standardized benchmarks, and addressing societal implications for safety-critical deployment. By consolidating theoretical foundations, practical advances, and broader impacts, this survey provides a unified perspective and roadmap toward certifiable, trustworthy perception for real-world applications. A continuously-updated literature summary and companion code tutorials are available at https://github.com/ericzzj1989/Awesome-Global-Solvers-for-3D-Vision.

[228] MeFEm: Medical Face Embedding model

Yury Borets,Stepan Botman

Main category: cs.CV

TL;DR: MeFEm 是一种基于改进JEPA架构的视觉模型，专用于从面部图像进行生物特征和医学分析，通过轴向条纹掩码、环形损失加权和CLS标记概率重分配等创新设计，在小数据集上超越了FaRL和Franca等强基线，并在BMI估计任务中表现优异。

Details

Motivation: 解决现有面部图像生物特征与医学分析方法在数据量需求大、领域偏差严重（如BMI估计数据集不均衡）以及语义相关区域建模不足等问题。 Method: 提出MeFEm模型：基于改进的联合嵌入预测架构（JEPA），引入轴向条纹掩码策略聚焦语义关键区域、环形损失加权方案优化训练目标、CLS标记的概率性重新分配以提升线性探测质量；训练数据为自构建的高质量面部图像整合数据集。 Result: 在核心人体测量学任务上显著优于FaRL和Franca等强基线（使用更少数据）；在新构建的封闭式BMI评估数据集上取得有前景的结果；模型权重已开源。 Conclusion: MeFEm为面部图像驱动的生物医学分析提供了一个高效、鲁棒且可复现的新基线，其结构设计与数据策略对小样本、高语义精度的视觉表征学习具有推广价值。 Abstract: We present MeFEm, a vision model based on a modified Joint Embedding Predictive Architecture (JEPA) for biometric and medical analysis from facial images. Key modifications include an axial stripe masking strategy to focus learning on semantically relevant regions, a circular loss weighting scheme, and the probabilistic reassignment of the CLS token for high quality linear probing. Trained on a consolidated dataset of curated images, MeFEm outperforms strong baselines like FaRL and Franca on core anthropometric tasks despite using significantly less data. It also shows promising results on Body Mass Index (BMI) estimation, evaluated on a novel, consolidated closed-source dataset that addresses the domain bias prevalent in existing data. Model weights are available at https://huggingface.co/boretsyury/MeFEm , offering a strong baseline for future work in this domain.

[229] Universal Image Immunization against Diffusion-based Image Editing via Semantic Injection

Chanhui Lee,Seunghyun Shin,Donggyu Choi,Hae-gon Jeon,Jeany Son

Main category: cs.CV

TL;DR: 本文提出首个通用图像免疫框架，通过生成单一的、适用于扩散模型编辑流程的通用对抗扰动（UAP），在不依赖训练数据的情况下有效抵御基于文本提示的恶意图像编辑。

Details

Motivation: 扩散模型驱动的图像编辑虽具创造力，但带来深伪和版权侵犯等伦理法律风险；现有图像免疫方法多为逐图优化的对抗扰动，缺乏可扩展性与实用性。 Method: 受目标攻击中通用对抗扰动（UAP）启发，设计一种能嵌入语义目标并抑制原始内容的UAP，使扩散编辑模型注意力偏移，从而覆盖原图像语义；支持数据无关（data-free）设置。 Result: 在UAP设定下显著优于多个基线；在受限扰动预算下性能媲美图像特异性方法；具备强黑盒迁移性，可跨不同扩散模型泛化。 Conclusion: 所提通用图像免疫框架兼顾高效性、实用性与泛化能力，是面向扩散模型编辑的安全防御的重要进展。 Abstract: Recent advances in diffusion models have enabled powerful image editing capabilities guided by natural language prompts, unlocking new creative possibilities. However, they introduce significant ethical and legal risks, such as deepfakes and unauthorized use of copyrighted visual content. To address these risks, image immunization has emerged as a promising defense against AI-driven semantic manipulation. Yet, most existing approaches rely on image-specific adversarial perturbations that require individual optimization for each image, thereby limiting scalability and practicality. In this paper, we propose the first universal image immunization framework that generates a single, broadly applicable adversarial perturbation specifically designed for diffusion-based editing pipelines. Inspired by universal adversarial perturbation (UAP) techniques used in targeted attacks, our method generates a UAP that embeds a semantic target into images to be protected. Simultaneously, it suppresses original content to effectively misdirect the model's attention during editing. As a result, our approach effectively blocks malicious editing attempts by overwriting the original semantic content in the image via the UAP. Moreover, our method operates effectively even in data-free settings without requiring access to training data or domain knowledge, further enhancing its practicality and broad applicability in real-world scenarios. Extensive experiments show that our method, as the first universal immunization approach, significantly outperforms several baselines in the UAP setting. In addition, despite the inherent difficulty of universal perturbations, our method also achieves performance on par with image-specific methods under a more restricted perturbation budget, while also exhibiting strong black-box transferability across different diffusion models.

[230] It's a Matter of Time: Three Lessons on Long-Term Motion for Perception

Willem Davison,Xinyue Hao,Laura Sevilla-Lara

Main category: cs.CV

TL;DR: 本文探讨了长期运动信息在视觉感知中的作用，发现其不仅有助于理解动作，还能更好地识别物体、材质和空间信息，并在低数据和零样本任务中表现出更强的泛化能力，同时具有计算效率高的优势。

Details

Motivation: 尽管时间信息对感知至关重要，但其作用仍不如图像信息研究充分；本文旨在探究长期运动信息能提供哪些关于世界的知识及其在视觉学习中的特性。 Method: 利用点轨迹估计的最新进展来学习时间表征，并在多种感知任务上进行实验分析。 Result: 1）长期运动表征能有效理解动作、物体、材质和空间信息，有时优于图像；2）在低数据和零样本任务中泛化能力远超图像表征；3）运动表征维度低、计算高效，与视频表征结合可提升性能。 Conclusion: 长期运动信息是视觉感知中被低估但极具潜力的信号，未来模型应更充分地利用它。 Abstract: Temporal information has long been considered to be essential for perception. While there is extensive research on the role of image information for perceptual tasks, the role of the temporal dimension remains less well understood: What can we learn about the world from long-term motion information? What properties does long-term motion information have for visual learning? We leverage recent success in point-track estimation, which offers an excellent opportunity to learn temporal representations and experiment on a variety of perceptual tasks. We draw 3 clear lessons: 1) Long-term motion representations contain information to understand actions, but also objects, materials, and spatial information, often even better than images. 2) Long-term motion representations generalize far better than image representations in low-data settings and in zero-shot tasks. 3) The very low dimensionality of motion information makes motion representations a better trade-off between GFLOPs and accuracy than standard video representations, and used together they achieve higher performance than video representations alone. We hope these insights will pave the way for the design of future models that leverage the power of long-term motion information for perception.

[231] Depth Completion as Parameter-Efficient Test-Time Adaptation

Bingxin Ke,Qunjie Zhou,Jiahui Huang,Xuanchi Ren,Tianchang Shen,Konrad Schindler,Laura Leal-Taixé,Shengyu Huang

Main category: cs.CV

TL;DR: CAPA是一种参数高效的测试时优化框架，通过稀疏几何线索自适应预训练3D基础模型用于深度补全，冻结主干网络，仅微调少量参数，并在视频中引入序列级参数共享以提升时序一致性与鲁棒性。

Details

Motivation: 现有方法常训练任务特定编码器处理辅助输入，易过拟合且泛化差；需在测试时利用稀疏观测有效引导基础模型的几何先验以修正失真和结构错位。 Method: 提出CAPA框架，冻结3D基础模型主干，采用参数高效微调（如LoRA或VPT）仅更新少量参数，梯度直接来自测试时稀疏观测；对视频引入序列级参数共享机制，联合优化多帧以利用时序相关性。 Result: CAPA在室内外多个数据集上达到SOTA性能，支持任意ViT-based基础模型，具有模型无关性和强泛化能力。 Conclusion: CAPA通过测试时轻量参数更新与视频级协同优化，实现了对3D基础模型的高效、鲁棒、场景自适应，为深度补全提供了新范式。 Abstract: We introduce CAPA, a parameter-efficient test-time optimization framework that adapts pre-trained 3D foundation models (FMs) for depth completion, using sparse geometric cues. Unlike prior methods that train task-specific encoders for auxiliary inputs, which often overfit and generalize poorly, CAPA freezes the FM backbone. Instead, it updates only a minimal set of parameters using Parameter-Efficient Fine-Tuning (e.g. LoRA or VPT), guided by gradients calculated directly from the sparse observations available at inference time. This approach effectively grounds the foundation model's geometric prior in the scene-specific measurements, correcting distortions and misplaced structures. For videos, CAPA introduces sequence-level parameter sharing, jointly adapting all frames to exploit temporal correlations, improve robustness, and enforce multi-frame consistency. CAPA is model-agnostic, compatible with any ViT-based FM, and achieves state-of-the-art results across diverse condition patterns on both indoor and outdoor datasets. Project page: research.nvidia.com/labs/dvl/projects/capa.

[232] SAILS: Segment Anything with Incrementally Learned Semantics for Task-Invariant and Training-Free Continual Learning

Shishir Muralidhara,Didier Stricker,René Schuster

Main category: cs.CV

TL;DR: SAILS是一种无需训练的类增量语义分割框架，利用SAM零样本提取区域，并通过固定特征空间中的原型进行语义关联，避免了遗忘问题并实现正向后向迁移。

Details

Motivation: 持续学习受限于重复重训练、高计算成本和灾难性遗忘问题，难以应用于实际场景。 Method: SAILS将类增量语义分割解耦为两阶段：1）使用Segment Anything Model（SAM）进行零样本区域提取；2）在固定特征空间中通过选择性类内聚类生成多个类原型，完成语义关联。 Result: SAILS在无需任何增量训练的情况下，在标准CISS数据集上通常超越现有基于训练的方法，尤其在长而困难的任务序列中表现更优；完全消除遗忘，并展现出正向后向迁移能力。 Conclusion: SAILS提供了一种高效、无遗忘、无需参数更新的CISS新范式，显著提升持续学习在现实场景中的适用性。 Abstract: Continual learning remains constrained by the need for repeated retraining, high computational costs, and the persistent challenge of forgetting. These factors significantly limit the applicability of continuous learning in real-world settings, as iterative model updates require significant computational resources and inherently exacerbate forgetting. We present SAILS -- Segment Anything with Incrementally Learned Semantics, a training-free framework for Class-Incremental Semantic Segmentation (CISS) that sidesteps these challenges entirely. SAILS leverages foundational models to decouple CISS into two stages: Zero-shot region extraction using Segment Anything Model (SAM), followed by semantic association through prototypes in a fixed feature space. SAILS incorporates selective intra-class clustering, resulting in multiple prototypes per class to better model intra-class variability. Our results demonstrate that, despite requiring no incremental training, SAILS typically surpasses the performance of existing training-based approaches on standard CISS datasets, particularly in long and challenging task sequences where forgetting tends to be most severe. By avoiding parameter updates, SAILS completely eliminates forgetting and maintains consistent, task-invariant performance. Furthermore, SAILS exhibits positive backward transfer, where the introduction of new classes can enhance performance on previous classes.

[233] GOT-JEPA: Generic Object Tracking with Model Adaptation and Occlusion Handling using Joint-Embedding Predictive Architecture

Shih-Fang Chen,Jun-Cheng Chen,I-Hong Jhuo,Yen-Yu Lin

Main category: cs.CV

TL;DR: 本文提出GOT-JEPA预训练框架和OccuSolver模块，通过模型预测式学习提升通用目标跟踪器在动态环境（尤其是遮挡）下的泛化性与鲁棒性。

Details

Motivation: 现有通用目标跟踪器在未见场景中泛化能力弱，且对遮挡的建模粗糙、缺乏细粒度感知。 Method: 1) GOT-JEPA：扩展JEPA范式，让学生网络从受损帧预测教师网络从干净帧生成的伪跟踪模型，实现稳定自监督；2) OccuSolver：基于点跟踪器，结合迭代生成的对象先验，逐级细化可见性状态与遮挡模式。 Result: 在七个基准上验证了方法在跟踪泛化性与鲁棒性上的显著提升。 Conclusion: GOT-JEPA与OccuSolver协同提升了跟踪器对复杂动态场景（特别是细粒度遮挡）的适应能力，为通用跟踪提供了新范式。 Abstract: The human visual system tracks objects by integrating current observations with previously observed information, adapting to target and scene changes, and reasoning about occlusion at fine granularity. In contrast, recent generic object trackers are often optimized for training targets, which limits robustness and generalization in unseen scenarios, and their occlusion reasoning remains coarse, lacking detailed modeling of occlusion patterns. To address these limitations in generalization and occlusion perception, we propose GOT-JEPA, a model-predictive pretraining framework that extends JEPA from predicting image features to predicting tracking models. Given identical historical information, a teacher predictor generates pseudo-tracking models from a clean current frame, and a student predictor learns to predict the same pseudo-tracking models from a corrupted version of the current frame. This design provides stable pseudo supervision and explicitly trains the predictor to produce reliable tracking models under occlusions, distractors, and other adverse observations, improving generalization to dynamic environments. Building on GOT-JEPA, we further propose OccuSolver to enhance occlusion perception for object tracking. OccuSolver adapts a point-centric point tracker for object-aware visibility estimation and detailed occlusion-pattern capture. Conditioned on object priors iteratively generated by the tracker, OccuSolver incrementally refines visibility states, strengthens occlusion handling, and produces higher-quality reference labels that progressively improve subsequent model predictions. Extensive evaluations on seven benchmarks show that our method effectively enhances tracker generalization and robustness.

[234] VIPA: Visual Informative Part Attention for Referring Image Segmentation

Yubin Cho,Hyunwoo Yu,Kyeongbo Kong,Kyomin Sohn,Bongjoon Hyun,Suk-Ju Kang

Main category: cs.CV

TL;DR: 本文提出了一种名为视觉信息部分注意力（VIPA）的新框架，用于指代表达图像分割（RIS），通过引入‘视觉表达’和视觉表达生成器（VEG）模块，提升跨模态对齐与细粒度分割性能，并在四个公开基准上达到SOTA。

Details

Motivation: 现有方法虽将视觉信息融入语言token，但难以充分挖掘视觉上下文以支持细粒度分割；需降低高方差跨模态投影、增强语义一致性。 Method: 提出VIPA框架，核心是构建‘视觉表达’——即具有结构与语义信息的视觉关键区域；设计VEG模块，利用局部-全局语言线索检索并优化视觉token，抑制噪声、共享关键视觉属性。 Result: 在四个公开RIS基准（如RefCOCO、RefCOCO+等）上超越现有SOTA方法；可视化分析验证了其对细粒度目标区域的鲁棒注意力对齐能力。 Conclusion: VIPA通过显式建模信息性视觉部分并结合上下文感知的VEG模块，显著提升了RIS任务中跨模态语义对齐与分割精度，为视觉-语言细粒度理解提供了新思路。 Abstract: Referring Image Segmentation (RIS) aims to segment a target object described by a natural language expression. Existing methods have evolved by leveraging the vision information into the language tokens. To more effectively exploit visual contexts for fine-grained segmentation, we propose a novel Visual Informative Part Attention (VIPA) framework for referring image segmentation. VIPA leverages the informative parts of visual contexts, called a visual expression, which can effectively provide the structural and semantic visual target information to the network. This design reduces high-variance cross-modal projection and enhances semantic consistency in an attention mechanism of the referring image segmentation. We also design a visual expression generator (VEG) module, which retrieves informative visual tokens via local-global linguistic context cues and refines the retrieved tokens for reducing noise information and sharing informative visual attributes. This module allows the visual expression to consider comprehensive contexts and capture semantic visual contexts of informative regions. In this way, our framework enables the network's attention to robustly align with the fine-grained regions of interest. Extensive experiments and visual analysis demonstrate the effectiveness of our approach. Our VIPA outperforms the existing state-of-the-art methods on four public RIS benchmarks.

[235] Debiasing Central Fixation Confounds Reveals a Peripheral "Sweet Spot" for Human-like Scanpaths in Hard-Attention Vision

Pengcheng Pan,Yonekura Shogo,Yasuo Kuniyosh

Main category: cs.CV

TL;DR: 本文揭示了在物体中心数据集上，基于注视路径的评估指标易受中心偏差影响的问题，并提出了去偏的GCS指标来更准确地评估硬注意力模型与人类注视行为的一致性。

Details

Motivation: 现有基于注视路径的评估指标在物体中心数据集上易受中心偏差干扰，导致对模型行为对齐性的误判。 Method: 通过Gaze-CIFAR-10数据集分析中心偏差影响；系统调节视网膜补丁大小和周边视野，识别‘周边甜点区’；提出去中心偏差的复合指标GCS，融合运动相似性。 Result: 发现仅在特定感官约束下，模型扫描路径才能同时超越中心基线且具备类人时间统计特性；GCS指标揭示中等补丁尺寸下的稳健甜点区，并识别出视野过大时的‘捷径模式’。 Conclusion: 需使用去偏指标（如GCS）评估主动感知模型，以更好区分真实行为对齐与中心倾向；应设计更合理的注视基准以提升评估可靠性。 Abstract: Human eye movements in visual recognition reflect a balance between foveal sampling and peripheral context. Task-driven hard-attention models for vision are often evaluated by how well their scanpaths match human gaze. However, common scanpath metrics can be strongly confounded by dataset-specific center bias, especially on object-centric datasets. Using Gaze-CIFAR-10, we show that a trivial center-fixation baseline achieves surprisingly strong scanpath scores, approaching many learned policies. This makes standard metrics optimistic and blurs the distinction between genuine behavioral alignment and mere central tendency. We then analyze a hard-attention classifier under constrained vision by sweeping foveal patch size and peripheral context, revealing a peripheral sweet spot: only a narrow range of sensory constraints yields scanpaths that are simultaneously (i) above the center baseline after debiasing and (ii) temporally human-like in movement statistics. To address center bias, we propose GCS (Gaze Consistency Score), a center-debiased composite metric augmented with movement similarity. GCS uncovers a robust sweet spot at medium patch size with both foveal and peripheral vision, that is not obvious from raw scanpath metrics or accuracy alone, and also highlights a "shortcut regime" when the field-of-view becomes too large. We discuss implications for evaluating active perception on object-centric datasets and for designing gaze benchmarks that better separate behavioral alignment from center bias.

[236] Integrating Affordances and Attention models for Short-Term Object Interaction Anticipation

Lorenzo Mur Labadia,Ruben Martinez-Cantin,Jose J. Guerrero,Giovanni M. Farinella,Antonino Furnari

Main category: cs.CV

TL;DR: 本文提出STAformer及其改进版本，结合帧引导的时间池化、双图像-视频注意力机制和多尺度特征融合，提升短时物体交互预测性能；并引入环境可供性建模与交互热点预测两个新模块，显著提升Ego4D和EPIC-Kitchens数据集上的mAP指标。

Details

Motivation: 短时物体交互预测（STA）对可穿戴助手理解用户目标、实现人机协同至关重要，但现有方法在准确性和时序建模上存在不足。 Method: 提出STAformer和STAformer++两种基于注意力的架构，并引入环境可供性建模（作为场景持久记忆）与交互热点预测（基于手部与物体轨迹）两个新模块，支持从图像-视频对中进行STA预测。 Result: 在Ego4D和EPIC-Kitchens数据集上Overall Top-5 mAP分别提升23和31个百分点；开源代码、标注及预提取的可供性特征。 Conclusion: 融合多源时空线索与可供性先验可显著提升STA预测性能，为具身智能中的意图理解提供有效技术路径。 Abstract: Short Term object-interaction Anticipation consists in detecting the location of the next active objects, the noun and verb categories of the interaction, as well as the time to contact from the observation of egocentric video. This ability is fundamental for wearable assistants to understand user goals and provide timely assistance, or to enable human-robot interaction. In this work, we present a method to improve the performance of STA predictions. Our contributions are two-fold: 1 We propose STAformer and STAformer plus plus, two novel attention-based architectures integrating frame-guided temporal pooling, dual image-video attention, and multiscale feature fusion to support STA predictions from an image-input video pair; 2 We introduce two novel modules to ground STA predictions on human behavior by modeling affordances. First, we integrate an environment affordance model which acts as a persistent memory of interactions that can take place in a given physical scene. We explore how to integrate environment affordances via simple late fusion and with an approach which adaptively learns how to best fuse affordances with end-to-end predictions. Second, we predict interaction hotspots from the observation of hands and object trajectories, increasing confidence in STA predictions localized around the hotspot. Our results show significant improvements on Overall Top-5 mAP, with gain up to +23p.p on Ego4D and +31p.p on a novel set of curated EPIC-Kitchens STA labels. We released the code, annotations, and pre-extracted affordances on Ego4D and EPIC-Kitchens to encourage future research in this area.

[237] Multi-dimensional Persistent Sheaf Laplacians for Image Analysis

Xiang Xiang Wang,Guo-Wei Wei

Main category: cs.CV

TL;DR: 本文提出了一种基于单纯复形的多维持续层叠拉普拉斯（MPSL）框架，用于图像分析，通过融合多个降维尺度下的拓扑谱特征提升分类稳定性与性能。

Details

Motivation: 常用降维方法（如PCA）对所选降维维度敏感，单一或平均维度选择难以兼顾不同尺度信息。 Method: 将图像视为单纯复形，在多个降维维度上分别构建持续层叠拉普拉斯，提取多尺度局部拓扑谱表示，并跨尺度与跨维度聚合统计摘要形成图像表征。 Result: 在COIL20和ETH80数据集上，该方法在中等降维维度下稳定优于PCA基线，且性能对维度选择鲁棒性更强。 Conclusion: MPSL框架通过整合多维、多尺度拓扑谱信息，有效缓解了传统降维方法对维度选择的敏感性，提升了图像表征的判别性与鲁棒性。 Abstract: We propose a multi-dimensional persistent sheaf Laplacian (MPSL) framework on simplicial complexes for image analysis. The proposed method is motivated by the strong sensitivity of commonly used dimensionality reduction techniques, such as principal component analysis (PCA), to the choice of reduced dimension. Rather than selecting a single reduced dimension or averaging results across dimensions, we exploit complementary advantages of multiple reduced dimensions. At a given dimension, image samples are regarded as simplicial complexes, and persistent sheaf Laplacians are utilized to extract a multiscale localized topological spectral representation for individual image samples. Statistical summaries of the resulting spectra are then aggregated across scales and dimensions to form multiscale multi-dimensional image representations. We evaluate the proposed framework on the COIL20 and ETH80 image datasets using standard classification protocols. Experimental results show that the proposed method provides more stable performance across a wide range of reduced dimensions and achieves consistent improvements to PCA-based baselines in moderate dimensional regimes.

[238] CT-Bench: A Benchmark for Multimodal Lesion Understanding in Computed Tomography

Qingqing Zhu,Qiao Jin,Tejas S. Mathai,Yin Fang,Zhizheng Wang,Yifan Yang,Maame Sarfo-Gyamfi,Benjamin Hou,Ran Gu,Praveen T. S. Balamuralikrishna,Kenneth C. Wang,Ronald M. Summers,Zhiyong Lu

Main category: cs.CV

TL;DR: 本文介绍了CT-Bench，一个首个面向CT影像病变分析的基准数据集，包含带标注的病变图像元数据集和多任务视觉问答基准，用于评估和提升多模态AI模型在病变定位、描述、尺寸估计和属性分类等任务上的性能。

Details

Motivation: 当前AI在CT影像病变自动分割和报告生成方面受限于公开可用的、带有病变级标注的CT数据集稀缺。 Method: 构建了CT-Bench数据集，包括两部分：（1）含20,335个病变（来自7,795例CT检查）的病变图像与元数据集，含边界框、描述和尺寸信息；（2）含2,850组QA对的多任务视觉问答基准，涵盖病变定位、描述、尺寸估计和属性分类，并引入难负样本以模拟真实诊断挑战。评估了多种先进多模态模型（如视觉-语言模型、医学CLIP变体），并与放射科医生评估结果对比；还进行了基于病变图像元数据集的模型微调实验。 Result: 多个SOTA多模态模型在CT-Bench上表现被系统评估，其性能与放射科医生评估结果对比验证了该基准的有效性；在病变图像与元数据集上微调显著提升了模型在两项任务上的性能。 Conclusion: CT-Bench是首个全面支持病变级分析的CT影像基准，兼具数据规模、标注丰富性和临床真实性，可有效推动医学AI在CT影像理解方面的研究与临床落地。 Abstract: Artificial intelligence (AI) can automatically delineate lesions on computed tomography (CT) and generate radiology report content, yet progress is limited by the scarcity of publicly available CT datasets with lesion-level annotations. To bridge this gap, we introduce CT-Bench, a first-of-its-kind benchmark dataset comprising two components: a Lesion Image and Metadata Set containing 20,335 lesions from 7,795 CT studies with bounding boxes, descriptions, and size information, and a multitask visual question answering benchmark with 2,850 QA pairs covering lesion localization, description, size estimation, and attribute categorization. Hard negative examples are included to reflect real-world diagnostic challenges. We evaluate multiple state-of-the-art multimodal models, including vision-language and medical CLIP variants, by comparing their performance to radiologist assessments, demonstrating the value of CT-Bench as a comprehensive benchmark for lesion analysis. Moreover, fine-tuning models on the Lesion Image and Metadata Set yields significant performance gains across both components, underscoring the clinical utility of CT-Bench.

[239] Wrivinder: Towards Spatial Intelligence for Geo-locating Ground Images onto Satellite Imagery

Chandrakanth Gudavalli,Tajuddin Manhar Mohammed,Abhay Yadav,Ananth Vishnu Bhaskar,Hardik Prajapati,Cheng Peng,Rama Chellappa,Shivkumar Chandrasekaran,B. S. Manjunath

Main category: cs.CV

TL;DR: Wrivinder是一种零样本、几何驱动的框架，通过聚合多张地面照片重建一致的3D场景，并与卫星图像对齐，实现高精度地面相机地理定位；同时发布MC-Sat数据集以支持该任务评估。

Details

Motivation: 解决地面影像与卫星地图在大视角差异或GPS不可靠情况下的对齐难题，且缺乏合适基准评测数据。 Method: 结合运动恢复结构（SfM）、3D高斯泼溅、语义锚定和单目深度度量线索，生成稳定天顶视图以匹配卫星图像。 Result: 在零样本实验中，Wrivinder在密集与大范围场景下均实现亚30米地理定位精度。 Conclusion: 几何驱动的多视角聚合方法可有效提升无配对监督下的地面到卫星对齐鲁棒性与精度，Wrivinder与MC-Sat共同构成该方向首个完整基线与测试平台。 Abstract: Aligning ground-level imagery with geo-registered satellite maps is crucial for mapping, navigation, and situational awareness, yet remains challenging under large viewpoint gaps or when GPS is unreliable. We introduce Wrivinder, a zero-shot, geometry-driven framework that aggregates multiple ground photographs to reconstruct a consistent 3D scene and align it with overhead satellite imagery. Wrivinder combines SfM reconstruction, 3D Gaussian Splatting, semantic grounding, and monocular depth--based metric cues to produce a stable zenith-view rendering that can be directly matched to satellite context for metrically accurate camera geo-localization. To support systematic evaluation of this task, which lacks suitable benchmarks, we also release MC-Sat, a curated dataset linking multi-view ground imagery with geo-registered satellite tiles across diverse outdoor environments. Together, Wrivinder and MC-Sat provide a first comprehensive baseline and testbed for studying geometry-centered cross-view alignment without paired supervision. In zero-shot experiments, Wrivinder achieves sub-30\,m geolocation accuracy across both dense and large-area scenes, highlighting the promise of geometry-based aggregation for robust ground-to-satellite localization.

[240] AnchorWeave: World-Consistent Video Generation with Retrieved Local Spatial Memories

Zun Wang,Han Lin,Jaehong Yoon,Jaemin Cho,Yue Zhang,Mohit Bansal

Main category: cs.CV

TL;DR: 本文提出AnchorWeave框架，通过多个对齐良好的局部几何记忆替代易错的全局3D重建记忆，并设计多锚点编织控制器与覆盖驱动的局部记忆检索机制，显著提升长时序相机可控视频生成的空间一致性。

Details

Motivation: 现有基于全局3D重建的记忆方法因多视角位姿与深度估计误差导致跨视图几何不一致，融合后产生噪声，损害生成质量。 Method: AnchorWeave采用覆盖驱动的局部记忆检索，对齐目标轨迹，并通过多锚点编织控制器在生成过程中融合多个干净的局部几何记忆。 Result: 在长时序相机可控视频生成任务上显著提升场景空间一致性，同时保持高质量视觉效果；消融实验验证了局部几何条件、多锚点控制和覆盖驱动检索的有效性。 Conclusion: 用多个对齐良好的局部几何记忆替代全局记忆，并智能融合，是提升长时序视频空间一致性的有效范式。 Abstract: Maintaining spatial world consistency over long horizons remains a central challenge for camera-controllable video generation. Existing memory-based approaches often condition generation on globally reconstructed 3D scenes by rendering anchor videos from the reconstructed geometry in the history. However, reconstructing a global 3D scene from multiple views inevitably introduces cross-view misalignment, as pose and depth estimation errors cause the same surfaces to be reconstructed at slightly different 3D locations across views. When fused, these inconsistencies accumulate into noisy geometry that contaminates the conditioning signals and degrades generation quality. We introduce AnchorWeave, a memory-augmented video generation framework that replaces a single misaligned global memory with multiple clean local geometric memories and learns to reconcile their cross-view inconsistencies. To this end, AnchorWeave performs coverage-driven local memory retrieval aligned with the target trajectory and integrates the selected local memories through a multi-anchor weaving controller during generation. Extensive experiments demonstrate that AnchorWeave significantly improves long-term scene consistency while maintaining strong visual quality, with ablation and analysis studies further validating the effectiveness of local geometric conditioning, multi-anchor control, and coverage-driven retrieval.

[241] PAct: Part-Decomposed Single-View Articulated Object Generation

Qingming Liu,Xinyue Yao,Shuyuan Zhang,Yueci Deng,Guiliang Liu,Zhen Liu,Kui Jia

Main category: cs.CV

TL;DR: 本文提出了一种以部件为中心的生成式框架，用于从单张图像快速生成高保真、可动的3D铰接物体，避免了耗时的逐实例优化，兼顾结构正确性、运动合理性和交互可控性。

Details

Motivation: 现有铰接物体建模方法存在效率低（优化/蒸馏类）或精度差（模板/检索类）的问题，难以满足交互式3D应用对可扩展性、保真度与实时性的需求。 Method: 提出基于部件的生成表示：将物体建模为一组可动部件，每个部件由带身份与运动线索的潜在token编码；以单张图像为条件，端到端生成具备几何、组成与运动一致性的3D资产。 Result: 在抽屉、门等常见铰接类别上，相比优化和检索基线，显著提升了输入一致性、部件准确性与运动合理性，同时大幅缩短推理时间。 Conclusion: 该部件中心生成框架实现了高效、可控、保真的铰接物体合成，为具身AI、机器人和VR/AR中的交互式3D内容创建提供了新范式。 Abstract: Articulated objects are central to interactive 3D applications, including embodied AI, robotics, and VR/AR, where functional part decomposition and kinematic motion are essential. Yet producing high-fidelity articulated assets remains difficult to scale because it requires reliable part decomposition and kinematic rigging. Existing approaches largely fall into two paradigms: optimization-based reconstruction or distillation, which can be accurate but often takes tens of minutes to hours per instance, and inference-time methods that rely on template or part retrieval, producing plausible results that may not match the specific structure and appearance in the input observation. We introduce a part-centric generative framework for articulated object creation that synthesizes part geometry, composition, and articulation under explicit part-aware conditioning. Our representation models an object as a set of movable parts, each encoded by latent tokens augmented with part identity and articulation cues. Conditioned on a single image, the model generates articulated 3D assets that preserve instance-level correspondence while maintaining valid part structure and motion. The resulting approach avoids per-instance optimization, enables fast feed-forward inference, and supports controllable assembly and articulation, which are important for embodied interaction. Experiments on common articulated categories (e.g., drawers and doors) show improved input consistency, part accuracy, and articulation plausibility over optimization-based and retrieval-driven baselines, while substantially reducing inference time.

[242] ThermEval: A Structured Benchmark for Evaluation of Vision-Language Models on Thermal Imagery

Ayush Shrivastava,Kirtan Gangani,Laksh Jain,Mayank Goel,Nipun Batra

Main category: cs.CV

TL;DR: 本文提出了ThermEval-B，一个包含约55,000对热成像视觉问答的基准测试，用于评估视觉语言模型（VLMs）在热图像上的理解能力；实验表明现有VLMs在温度感知推理等方面表现不佳，需专门针对热成像设计评估与建模方法。

Details

Motivation: 现有视觉语言模型在RGB图像上表现良好，但无法泛化到热成像；而热感测在夜间监控、搜救、自动驾驶和医疗筛查等低光场景中至关重要，其物理语义（温度）与RGB图像（颜色/纹理）本质不同，亟需专用评测基准。 Method: 构建结构化热成像视觉问答基准ThermEval-B（含公开数据集+新采集的ThermEval-D），后者首次提供带语义身体部位标注的密集像素级温度图；对25个开源与闭源VLM进行系统评测，分析其在温度推理、色彩映射鲁棒性、语言先验依赖等方面的表现。 Result: 所有被测VLM在温度接地推理任务上持续失败，对色彩映射变换敏感，易依赖语言先验或输出固定答案；提示工程与监督微调仅带来微弱提升。 Conclusion: 热成像理解不能沿用RGB中心化假设，必须建立专用评测体系；ThermEval-B为推动热视觉语言建模研究提供了基础基准。 Abstract: Vision language models (VLMs) achieve strong performance on RGB imagery, but they do not generalize to thermal images. Thermal sensing plays a critical role in settings where visible light fails, including nighttime surveillance, search and rescue, autonomous driving, and medical screening. Unlike RGB imagery, thermal images encode physical temperature rather than color or texture, requiring perceptual and reasoning capabilities that existing RGB-centric benchmarks do not evaluate. We introduce ThermEval-B, a structured benchmark of approximately 55,000 thermal visual question answering pairs designed to assess the foundational primitives required for thermal vision language understanding. ThermEval-B integrates public datasets with our newly collected ThermEval-D, the first dataset to provide dense per-pixel temperature maps with semantic body-part annotations across diverse indoor and outdoor environments. Evaluating 25 open-source and closed-source VLMs, we find that models consistently fail at temperature-grounded reasoning, degrade under colormap transformations, and default to language priors or fixed responses, with only marginal gains from prompting or supervised fine-tuning. These results demonstrate that thermal understanding requires dedicated evaluation beyond RGB-centric assumptions, positioning ThermEval as a benchmark to drive progress in thermal vision language modeling.

[243] Image Generation with a Sphere Encoder

Kaiyu Yue,Menglin Jia,Ji Hou,Tom Goldstein

Main category: cs.CV

TL;DR: 本文提出了一种名为Sphere Encoder的高效生成框架，通过将图像映射到球面隐空间并直接采样解码，在极少数步骤（少于5步）内实现与多步扩散模型相媲美的图像生成效果。

Details

Motivation: 解决现有扩散模型推理成本高、生成步骤多的问题，寻求一种更高效、单步（或极少步）即可生成高质量图像的生成范式。 Method: 设计一个编码器将自然图像均匀映射到单位球面隐空间，再设计一个解码器将球面上的随机点映射回图像空间；仅使用图像重建损失进行端到端训练；生成时直接从球面采样并解码；支持条件生成，并可通过编解码循环提升质量。 Result: 在多个数据集上，Sphere Encoder在图像质量上可媲美当前最优扩散模型，但推理成本仅为后者的极小部分（显著减少步数）。 Conclusion: Sphere Encoder提供了一种新颖、高效、低开销的生成建模范式，证明了单步/近单步高质量图像生成的可行性，为生成模型的效率优化开辟了新路径。 Abstract: We introduce the Sphere Encoder, an efficient generative framework capable of producing images in a single forward pass and competing with many-step diffusion models using fewer than five steps. Our approach works by learning an encoder that maps natural images uniformly onto a spherical latent space, and a decoder that maps random latent vectors back to the image space. Trained solely through image reconstruction losses, the model generates an image by simply decoding a random point on the sphere. Our architecture naturally supports conditional generation, and looping the encoder/decoder a few times can further enhance image quality. Across several datasets, the sphere encoder approach yields performance competitive with state of the art diffusions, but with a small fraction of the inference cost. Project page is available at https://sphere-encoder.github.io .

[244] EditCtrl: Disentangled Local and Global Control for Real-Time Generative Video Editing

Yehonathan Litman,Shikun Liu,Dario Seyb,Nicholas Milef,Yang Zhou,Carl Marshall,Shubham Tulsiani,Caleb Leak

Main category: cs.CV

TL;DR: 本文提出EditCtrl，一种高效的视频编辑控制框架，通过局部视频上下文模块和轻量级全局时间上下文嵌入器，在保证编辑质量的同时显著降低计算开销。

Details

Motivation: 现有基于预训练视频基础模型的生成式视频编辑方法计算成本高，尤其在仅需局部编辑时仍处理全视频上下文，效率低下。 Method: 提出EditCtrl框架，包含仅作用于掩码区域的局部视频上下文模块（计算量与编辑区域大小成正比）和轻量级时间全局上下文嵌入器，以低开销保障视频一致性。 Result: EditCtrl计算效率达SOTA方法的10倍，并在编辑质量上超越全注意力机制方法；支持多区域文本驱动编辑与自回归内容传播等新能力。 Conclusion: EditCtrl实现了高效与高质量兼顾的视频编辑，为实际应用中的实时、细粒度视频编辑提供了可行方案。 Abstract: High-fidelity generative video editing has seen significant quality improvements by leveraging pre-trained video foundation models. However, their computational cost is a major bottleneck, as they are often designed to inefficiently process the full video context regardless of the inpainting mask's size, even for sparse, localized edits. In this paper, we introduce EditCtrl, an efficient video inpainting control framework that focuses computation only where it is needed. Our approach features a novel local video context module that operates solely on masked tokens, yielding a computational cost proportional to the edit size. This local-first generation is then guided by a lightweight temporal global context embedder that ensures video-wide context consistency with minimal overhead. Not only is EditCtrl 10 times more compute efficient than state-of-the-art generative editing methods, it even improves editing quality compared to methods designed with full-attention. Finally, we showcase how EditCtrl unlocks new capabilities, including multi-region editing with text prompts and autoregressive content propagation.

Table of Contents

cs.CL [Back]

[1] Multimodal Consistency-Guided Reference-Free Data Selection for ASR Accent Adaptation

[2] LLM-Powered Automatic Translation and Urgency in Crisis Scenarios

[3] Using Machine Learning to Enhance the Detection of Obfuscated Abusive Words in Swahili: A Focus on Child Safety

[4] Language Model Memory and Memory Models for Language

[5] From Perceptions To Evidence: Detecting AI-Generated Content In Turkish News Media With A Fine-Tuned Bert Classifier

[6] Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens

[7] On Calibration of Large Language Models: From Response To Capability

[8] Small Reward Models via Backward Inference

[9] DistillLens: Symmetric Knowledge Distillation Through Logit Lens

[10] LLM-Confidence Reranker: A Training-Free Approach for Enhancing Retrieval-Augmented Generation Systems

[11] Elo-Evolve: A Co-evolutionary Framework for Language Model Alignment

[12] Metaphors' journeys across time and genre: tracking the evolution of literary metaphors with temporal embeddings

[13] On Theoretically-Driven LLM Agents for Multi-Dimensional Discourse Analysis

[14] RMPL: Relation-aware Multi-task Progressive Learning with Stage-wise Training for Multimedia Event Extraction

[15] How Do Lexical Senses Correspond Between Spoken German and German Sign Language?

[16] OMGs: A multi-agent system supporting MDT decision-making across the ovarian tumour care continuum

[17] The acquisition of English irregular inflections by Yemeni L1 Arabic learners: A Universal Grammar approach

[18] Beyond Words: Evaluating and Bridging Epistemic Divergence in User-Agent Interaction via Theory of Mind

[19] Speculative Decoding with a Speculative Vocabulary

[20] PrivAct: Internalizing Contextual Privacy Preservation via Multi-Agent Preference Training

[21] Tutoring Large Language Models to be Domain-adaptive, Precise, and Safe

[22] Bridging the Multilingual Safety Divide: Efficient, Culturally-Aware Alignment for Global South Languages

[23] ADAB: Arabic Dataset for Automated Politeness Benchmarking -- A Large-Scale Resource for Computational Sociopragmatics

[24] Evaluating Prompt Engineering Techniques for RAG in Small Language Models: A Multi-Hop QA Approach

[25] Pre-Editorial Normalization for Automatically Transcribed Medieval Manuscripts in Old French and Latin

[26] HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam

[27] Chain-of-Thought Reasoning with Large Language Models for Clinical Alzheimer's Disease Assessment and Diagnosis

[28] The Sufficiency-Conciseness Trade-off in LLM Self-Explanation from an Information Bottleneck Perspective

[29] Named Entity Recognition for Payment Data Using NLP

[30] GRRM: Group Relative Reward Modeling for Machine Translation

[31] Geometry-Preserving Aggregation for Mixture-of-Experts Embedding Models

[32] Context Shapes LLMs Retrieval-Augmented Fact-Checking Effectiveness

[33] LogitsCoder: Towards Efficient Chain-of-Thought Path Search via Logits Preference Decoding for Code Generation

[34] LM-Lexicon: Improving Definition Modeling via Harmonizing Semantic Experts

[35] From Scarcity to Scale: A Release-Level Analysis of the Pashto Common Voice Dataset

[36] Open Rubric System: Scaling Reinforcement Learning with Pairwise Adaptive Rubric

[37] Annotation-Efficient Vision-Language Model Adaptation to the Polish Language Using the LLaVA Framework

[38] GTS: Inference-Time Scaling of Latent Reasoning with a Learnable Gaussian Thought Sampler

[39] Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality

[40] CCiV: A Benchmark for Structure, Rhythm and Quality in LLM-Generated Chinese \textit{Ci} Poetry

[41] Character-aware Transformers Learn an Irregular Morphological Pattern Yet None Generalize Like Humans

[42] A Multi-Agent Framework for Medical AI: Leveraging Fine-Tuned GPT, LLaMA, and DeepSeek R1 for Evidence-Based and Bias-Aware Clinical Query Processing

[43] Index Light, Reason Deep: Deferred Visual Ingestion for Visual-Dense Document Question Answering

[44] GPT-5 vs Other LLMs in Long Short-Context Performance

[45] Knowing When Not to Answer: Abstention-Aware Scientific Reasoning

[46] We can still parse using syntactic rules

[47] AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchmark for LLM Agents

[48] Detecting LLM Hallucinations via Embedding Cluster Geometry: A Three-Type Taxonomy with Measurable Signatures

[49] STATe-of-Thoughts: Structured Action Templates for Tree-of-Thoughts

[50] Does Socialization Emerge in AI Agent Society? A Case Study of Moltbook

[51] InnoEval: On Research Idea Evaluation as a Knowledge-Grounded, Multi-Perspective Reasoning Problem

[52] Beyond Token-Level Policy Gradients for Complex Reasoning with Large Language Models

[53] TruthStance: An Annotated Dataset of Conversations on Truth Social

[54] WavePhaseNet: A DFT-Based Method for Constructing Semantic Conceptual Hierarchy Structures (SCHS)

[55] LLM-Guided Knowledge Distillation for Temporal Knowledge Graph Reasoning

[56] Robust Bias Evaluation with FilBBQ: A Filipino Bias Benchmark for Question-Answering Language Models

[57] Measuring and Mitigating Post-hoc Rationalization in Reverse Chain-of-Thought Generation

[58] HyperRAG: Reasoning N-ary Facts over Hypergraphs for Retrieval Augmented Generation

[59] BETA-Labeling for Multilingual Dataset Construction in Low-Resource IR

[60] Query as Anchor: Scenario-Adaptive User Representation via Large Language Model

[61] Beyond Translation: Evaluating Mathematical Reasoning Capabilities of LLMs in Sinhala and Tamil

[62] Explainable Token-level Noise Filtering for LLM Fine-tuning Datasets

[63] Assessing Large Language Models for Medical QA: Zero-Shot and LLM-as-a-Judge Evaluation

[64] The Wikidata Query Logs Dataset

[65] GradMAP: Faster Layer Pruning with Gradient Metric and Projection Compensation

[66] Is Information Density Uniform when Utterances are Grounded on Perception and Discourse?

[67] Breaking Data Efficiency Dilemma: A Federated and Augmented Learning Framework For Alzheimer's Disease Detection via Speech

[68] Crowdsourcing Piedmontese to Test LLMs on Non-Standard Orthography

[69] LLMStructBench: Benchmarking Large Language Model Structured Data Extraction

[70] Rethinking the Role of LLMs in Time Series Forecasting

[71] Cognitive networks reconstruct mindsets about STEM subjects and educational contexts in almost 1000 high-schoolers, University students and LLM-based digital twins

[72] Residual Connections and the Causal Shift: Uncovering a Structural Misalignment in Transformers

[73] Unlocking Reasoning Capability on Machine Translation in Large Language Models

[74] Multi-Agent Comedy Club: Investigating Community Discussion Effects on LLM Humor Generation

[75] Emergently Misaligned Language Models Show Behavioral Self-Awareness That Shifts With Subsequent Realignment

[76] A Geometric Analysis of Small-sized Language Model Hallucinations

[77] Overthinking Loops in Agents: A Structural Risk via MCP Tools

[78] Physical Commonsense Reasoning for Lower-Resourced Languages and Dialects: a Study on Basque