Skip to content

Table of Contents

cs.CL [Back]

[1] References Improve LLM Alignment in Non-Verifiable Domains

Kejian Shi,Yixin Liu,Peifeng Wang,Alexander R. Fabbri,Shafiq Joty,Arman Cohan

Main category: cs.CL

TL;DR: 本文提出了一种参考引导的LLM评估器方法,用于在缺乏真实验证器的非可验证领域(如大语言模型对齐)中替代RLVR中的硬验证器,通过利用前沿模型或人工撰写的参考输出提升LLM裁判的判别能力,并将其应用于自改进对齐训练,显著优于直接监督微调和无参考自改进。

Details Motivation: RLVR在可验证任务中效果好,但在LLM对齐等非可验证领域因缺乏真值验证器而无法直接应用,需探索软验证机制。 Method: 设计基于参考输出的LLM评估协议,用前沿模型或人工参考增强不同能力层级的LLM裁判;进而将增强后的裁判用于参考引导的自改进对齐训练。 Result: 在AlpacaEval和Arena-Hard上,Llama-3-8B-Instruct分别达73.1%/58.7%,Qwen2.5-7B达70.0%/74.1%;相比SFT蒸馏平均绝对提升+20.2/+17.1分,相比无参考自改进提升+5.3/+3.6分,性能媲美ArmoRM奖励模型。 Conclusion: 参考引导的LLM评估器可有效支撑非可验证领域的LLM后训练,为对齐等任务提供实用、可扩展的替代验证范式。 Abstract: While Reinforcement Learning with Verifiable Rewards (RLVR) has shown strong effectiveness in reasoning tasks, it cannot be directly applied to non-verifiable domains lacking ground-truth verifiers, such as LLM alignment. In this work, we investigate whether reference-guided LLM-evaluators can bridge this gap by serving as soft "verifiers". First, we design evaluation protocols that enhance LLM-based evaluators for LLM alignment using reference outputs. Through comprehensive experiments, we show that a reference-guided approach substantially improves the accuracy of less capable LLM-judges using references from frontier models; stronger LLM-judges can also be enhanced by high-quality (i.e., human-written) references. Building on these improved judges, we demonstrate the utility of high-quality references in alignment tuning, where LLMs guided with references are used as judges to self-improve. We show that reference-guided self-improvement yields clear gains over both direct SFT on reference outputs and self-improvement with reference-free judges, achieving performance comparable to training with ArmoRM, a strong finetuned reward model. Specifically, our method achieves 73.1% and 58.7% on AlpacaEval and Arena-Hard with Llama-3-8B-Instruct, and 70.0% and 74.1% with Qwen2.5-7B, corresponding to average absolute gains of +20.2 / +17.1 points over SFT distillation and +5.3 / +3.6 points over reference-free self-improvement on AlpacaEval / Arena-Hard. These results highlight the potential of using reference-guided LLM-evaluators to enable effective LLM post-training in non-verifiable domains.

[2] Evaluating Monolingual and Multilingual Large Language Models for Greek Question Answering: The DemosQA Benchmark

Charalampos Mastrokostas,Nikolaos Giarelis,Nikos Karacapilidis

Main category: cs.CL

TL;DR: 本文针对希腊语问答任务,构建了反映希腊社会文化特点的新数据集DemosQA,提出了适用于多语言的高效LLM评估框架,并对11个单语和多语大模型在6个希腊语QA数据集上进行了系统评测。

Details Motivation: 现有大语言模型主要面向高资源语言(如英语),多语模型存在训练数据偏向少数主流语言的问题,难以准确表征低资源语言的社会、文化和历史特征;而针对低资源语言的单语模型在特定任务上的有效性仍缺乏充分研究。 Method: 构建了基于社交媒体问答的希腊语数据集DemosQA;设计了一个内存高效的、可适配多种QA数据集与语言的LLM评估框架;在6个人工整理的希腊语QA数据集上,采用3种提示策略对11个单语和多语大模型进行大规模评估。 Result: 提供了首个聚焦希腊语社会文化语境的高质量QA数据集DemosQA;验证了所提评估框架的有效性与通用性;系统揭示了不同模型在希腊语QA任务上的性能差异及提示策略影响;开源了全部代码与数据。 Conclusion: 单语大模型在希腊语QA任务中展现出潜力,但其表现受数据质量、模型规模与提示策略共同影响;构建语言特异性数据集和评估方法对提升低资源语言LLM能力至关重要。 Abstract: Recent advancements in Natural Language Processing and Deep Learning have enabled the development of Large Language Models (LLMs), which have significantly advanced the state-of-the-art across a wide range of tasks, including Question Answering (QA). Despite these advancements, research on LLMs has primarily targeted high-resourced languages (e.g., English), and only recently has attention shifted toward multilingual models. However, these models demonstrate a training data bias towards a small number of popular languages or rely on transfer learning from high- to under-resourced languages; this may lead to a misrepresentation of social, cultural, and historical aspects. To address this challenge, monolingual LLMs have been developed for under-resourced languages; however, their effectiveness remains less studied when compared to multilingual counterparts on language-specific tasks. In this study, we address this research gap in Greek QA by contributing: (i) DemosQA, a novel dataset, which is constructed using social media user questions and community-reviewed answers to better capture the Greek social and cultural zeitgeist; (ii) a memory-efficient LLM evaluation framework adaptable to diverse QA datasets and languages; and (iii) an extensive evaluation of 11 monolingual and multilingual LLMs on 6 human-curated Greek QA datasets using 3 different prompting strategies. We release our code and data to facilitate reproducibility.

[3] One-step Language Modeling via Continuous Denoising

Chanhyuk Lee,Jaehoon Yoo,Manan Agarwal,Sheel Shah,Jerry Huang,Aditi Raghunathan,Seunghoon Hong,Nicholas M. Boffi,Jinwoo Kim

Main category: cs.CL

TL;DR: 本文提出了一种基于流(flow)的连续去噪语言模型(FLM)及其蒸馏版本(FMLM),在生成质量与速度上均优于离散扩散模型,尤其在少步生成(如一步)中显著超越现有方法。

Details Motivation: 离散扩散语言模型虽有望加速生成,但在少步生成时质量急剧下降;本文旨在探索是否可用连续流模型替代离散扩散,以兼顾质量与效率。 Method: 构建基于欧氏空间对one-hot token编码进行连续去噪的流模型(FLM),采用交叉熵目标预测干净数据,并引入时间重参数化提升训练稳定性;进一步蒸馏为流映射模型(FMLM)实现少步生成。 Result: FLM在LM1B和OWT数据集上达到当前最优离散扩散模型的质量;FMLM在少步生成中全面超越近期方法,一步生成质量超过其8步结果。 Conclusion: 离散扩散并非离散模态生成建模的必要选择;基于流的语言建模可实现高质量、高速度的大规模生成。 Abstract: Language models based on discrete diffusion have attracted widespread interest for their potential to provide faster generation than autoregressive models. In practice, however, they exhibit a sharp degradation of sample quality in the few-step regime, failing to realize this promise. Here we show that language models leveraging flow-based continuous denoising can outperform discrete diffusion in both quality and speed. By revisiting the fundamentals of flows over discrete modalities, we build a flow-based language model (FLM) that performs Euclidean denoising over one-hot token encodings. We show that the model can be trained by predicting the clean data via a cross entropy objective, where we introduce a simple time reparameterization that greatly improves training stability and generation quality. By distilling FLM into its associated flow map, we obtain a distilled flow map language model (FMLM) capable of few-step generation. On the LM1B and OWT language datasets, FLM attains generation quality matching state-of-the-art discrete diffusion models. With FMLM, our approach outperforms recent few-step language models across the board, with one-step generation exceeding their 8-step quality. Our work calls into question the widely held hypothesis that discrete diffusion processes are necessary for generative modeling over discrete modalities, and paves the way toward accelerated flow-based language modeling at scale. Code is available at https://github.com/david3684/flm.

[4] Claim Automation using Large Language Model

Zhengda Mo,Zhiyu Quan,Eli O'Donohue,Kaiwen Zhong

Main category: cs.CL

TL;DR: 本文提出了一种面向保险领域、本地部署的治理感知语言建模组件,利用历史保修索赔数据,通过LoRA微调预训练大语言模型,生成结构化纠正措施建议,显著优于通用大模型,在约80%案例中达到与真实纠正措施近似一致的效果。

Details Motivation: 大型语言模型(LLMs)在受监管和数据敏感领域(如保险业)部署受限,亟需兼顾性能、可解释性与合规性的领域适配方案。 Method: 基于百万级历史保修索赔数据,采用LoRA技术对预训练LLM进行领域微调,构建本地化、治理感知的初始决策模块,并设计融合自动语义相似度指标与人工评估的多维评估框架。 Result: 领域微调模型在约80%的测试案例中生成的纠正措施与真实标签高度匹配,显著优于商用通用及提示工程驱动的LLM。 Conclusion: 领域自适应微调能有效对齐模型输出分布与真实业务数据,是构建可靠、可治理保险AI应用的关键技术路径。 Abstract: While Large Language Models (LLMs) have achieved strong performance on general-purpose language tasks, their deployment in regulated and data-sensitive domains, including insurance, remains limited. Leveraging millions of historical warranty claims, we propose a locally deployed governance-aware language modeling component that generates structured corrective-action recommendations from unstructured claim narratives. We fine-tune pretrained LLMs using Low-Rank Adaptation (LoRA), scoping the model to an initial decision module within the claim processing pipeline to speed up claim adjusters' decisions. We assess this module using a multi-dimensional evaluation framework that combines automated semantic similarity metrics with human evaluation, enabling a rigorous examination of both practical utility and predictive accuracy. Our results show that domain-specific fine-tuning substantially outperforms commercial general-purpose and prompt-based LLMs, with approximately 80% of the evaluated cases achieving near-identical matches to ground-truth corrective actions. Overall, this study provides both theoretical and empirical evidence to prove that domain-adaptive fine-tuning can align model output distributions more closely with real-world operational data, demonstrating its promise as a reliable and governable building block for insurance applications.

[5] BanglaSummEval: Reference-Free Factual Consistency Evaluation for Bangla Summarization

Ahmed Rafid,Rumman Adib,Fariya Ahmed,Ajwad Abrar,Mohammed Saidul Islam

Main category: cs.CL

TL;DR: 本文提出BanglaSummEval,一种无需参考摘要、基于问答的孟加拉语摘要事实一致性评估框架,利用多语言指令微调模型统一完成问句生成、回答、答案抽取与重要性加权,并结合BERTScore-Recall衡量语义一致性,在教育与医疗领域人工摘要上验证了其与专家判断的高度相关性。

Details Motivation: 现有事实一致性评估指标大多忽略孟加拉语等低资源语言,且严重依赖参考摘要,缺乏适用性与可解释性。 Method: 构建参考无关的问答式评估框架BanglaSummEval:从原文和摘要自动生成问题,由同一多语言指令微调模型完成问答、候选答案抽取与问题重要性加权,并用BERTScore-Recall比较答案以捕捉语义一致性。 Result: 在300个人工撰写的教育与医疗领域孟加拉语摘要上验证,与专家判断呈强相关(Pearson r=0.694,Spearman ρ=0.763),并提供可解释的分步诊断。 Conclusion: BanglaSummEval为低资源语言的事实一致性评估提供了高效、统一、透明且实用的解决方案。 Abstract: Evaluating factual consistency is essential for reliable text summarization, particularly in high-stakes domains such as healthcare and news. However, most existing evaluation metrics overlook Bangla, a widely spoken yet under-resourced language, and often depend on reference summaries. We introduce BanglaSummEval, a reference-free, question-answering-based framework for evaluating factual consistency in Bangla summarization. The proposed method assesses both factual accuracy and content coverage through automatically generated questions and answers derived from the source document and the summary. A single multilingual instruction-tuned language model handles question generation, question answering, candidate answer extraction, and question importance weighting. This unified design reduces system complexity and computational cost. To capture semantic consistency beyond surface-level overlap, we use BERTScore-Recall for answer comparison. We validate BanglaSummEval on 300 human-written summaries from educational and medical domains, demonstrating strong correlation with expert human judgments (Pearson's $r = 0.694$, Spearman's $ρ= 0.763$). By providing interpretable, step-wise diagnostics alongside reliable evaluation scores, BanglaSummEval offers a practical and transparent solution for factual consistency evaluation in low-resource language settings.

[6] Meenz bleibt Meenz, but Large Language Models Do Not Speak Its Dialect

Minh Duc Bui,Manuel Mager,Peter Herbert Kann,Katharina von der Wense

Main category: cs.CL

TL;DR: 本文首次针对美因茨方言Meenzerisch开展NLP研究,构建了首个NLP就绪型数字词典(2351词条),并实验评估大语言模型在该方言定义生成与词汇生成任务上的表现,结果准确率均低于10%,表明亟需更多资源与研究投入。

Details Motivation: Meenzerisch方言濒临消亡,而NLP有望助力其保存与复兴,但此前尚无针对该方言的NLP研究。 Method: 构建基于Schramm(1966)的Meenzerisch-标准德语数字词典(2351词对);设计两项任务——LLM生成方言词定义、根据定义生成方言词;评估多种LLM,并尝试少样本学习与规则提取增强方法。 Result: 所有LLM在两项任务中表现极差:最佳定义生成准确率仅6.27%,最佳词汇生成准确率仅1.51%;少样本和规则注入策略有所提升但仍低于10%。 Conclusion: 当前LLM难以有效处理濒危德语方言Meenzerisch,凸显了建设更多方言资源及加强德语方言NLP研究的紧迫性。 Abstract: Meenzerisch, the dialect spoken in the German city of Mainz, is also the traditional language of the Mainz carnival, a yearly celebration well known throughout Germany. However, Meenzerisch is on the verge of dying out-a fate it shares with many other German dialects. Natural language processing (NLP) has the potential to help with the preservation and revival efforts of languages and dialects. However, so far no NLP research has looked at Meenzerisch. This work presents the first research in the field of NLP that is explicitly focused on the dialect of Mainz. We introduce a digital dictionary-an NLP-ready dataset derived from an existing resource (Schramm, 1966)-to support researchers in modeling and benchmarking the language. It contains 2,351 words in the dialect paired with their meanings described in Standard German. We then use this dataset to answer the following research questions: (1) Can state-of-the-art large language models (LLMs) generate definitions for dialect words? (2) Can LLMs generate words in Meenzerisch, given their definitions? Our experiments show that LLMs can do neither: the best model for definitions reaches only 6.27% accuracy and the best word generation model's accuracy is 1.51%. We then conduct two additional experiments in order to see if accuracy is improved by few-shot learning and by extracting rules from the training set, which are then passed to the LLM. While those approaches are able to improve the results, accuracy remains below 10%. This highlights that additional resources and an intensification of research efforts focused on German dialects are desperately needed.

[7] ConvApparel: A Benchmark Dataset and Validation Framework for User Simulators in Conversational Recommenders

Ofer Meshi,Krisztian Balog,Sally Goldman,Avi Caciularu,Guy Tennenholtz,Jihwan Jeong,Amir Globerson,Craig Boutilier

Main category: cs.CL

TL;DR: 本文提出ConvApparel数据集和综合验证框架,旨在解决LLM-based用户模拟器存在的‘现实性差距’问题,实验表明数据驱动的模拟器在反事实验证中表现更优。

Details Motivation: LLM-based用户模拟器存在‘现实性差距’,导致系统在模拟交互中优化良好但在真实世界中表现不佳。 Method: 构建ConvApparel数据集(采用‘好/坏’推荐器双代理协议采集,并含用户满意度第一人称标注),并提出结合统计对齐、人类相似度评分与反事实验证的综合评估框架。 Result: 实验发现所有模拟器均存在显著现实性差距;但数据驱动的模拟器在反事实验证中比提示式基线更适应未见行为,表现更优。 Conclusion: ConvApparel和所提验证框架有助于揭示并缩小用户模拟器的现实性差距,数据驱动方法虽不完美但更具鲁棒性。 Abstract: The promise of LLM-based user simulators to improve conversational AI is hindered by a critical "realism gap," leading to systems that are optimized for simulated interactions, but may fail to perform well in the real world. We introduce ConvApparel, a new dataset of human-AI conversations designed to address this gap. Its unique dual-agent data collection protocol -- using both "good" and "bad" recommenders -- enables counterfactual validation by capturing a wide spectrum of user experiences, enriched with first-person annotations of user satisfaction. We propose a comprehensive validation framework that combines statistical alignment, a human-likeness score, and counterfactual validation to test for generalization. Our experiments reveal a significant realism gap across all simulators. However, the framework also shows that data-driven simulators outperform a prompted baseline, particularly in counterfactual validation where they adapt more realistically to unseen behaviors, suggesting they embody more robust, if imperfect, user models.

[8] When Semantic Overlap Is Not Enough: Cross-Lingual Euphemism Transfer Between Turkish and English

Hasan Can Biyik,Libby Barak,Jing Peng,Anna Feldman

Main category: cs.CL

TL;DR: 本文研究了跨语言委婉语检测中的迁移学习问题,发现语义重叠并不总能保证正向迁移,尤其在土耳其语到英语的低资源方向上,甚至可能出现性能下降;相反,非重叠委婉语(NOPETs)训练有时反而提升效果,这与标签分布差异密切相关。

Details Motivation: 委婉语高度依赖文化与语用语境,建模难度大,跨语言等价性对多语言委婉语检测的迁移效果影响尚不明确。 Method: 将土耳其语和英语中的潜在委婉语(PETs)按功能、语用和语义对齐程度分为重叠(OPETs)和非重叠(NOPETs)两类,并分析其在跨语言迁移任务中的表现差异,辅以类别级和标签分布分析。 Result: 发现迁移存在不对称性:语义重叠不能保证正向迁移;土耳其语→英语方向性能可能下降,而NOPET训练反而提升效果;标签分布差异是关键解释因素;领域对齐可能影响迁移,但受限于数据稀疏性。 Conclusion: 跨语言委婉语检测中的迁移效果不仅取决于语义等价性,更受标签分布、资源不对称及领域对齐等因素制约,需重新思考等价性假设在低资源语境下的适用性。 Abstract: Euphemisms substitute socially sensitive expressions, often softening or reframing meaning, and their reliance on cultural and pragmatic context complicates modeling across languages. In this study, we investigate how cross-lingual equivalence influences transfer in multilingual euphemism detection. We categorize Potentially Euphemistic Terms (PETs) in Turkish and English into Overlapping (OPETs) and Non-Overlapping (NOPETs) subsets based on their functional, pragmatic, and semantic alignment. Our findings reveal a transfer asymmetry: semantic overlap is insufficient to guarantee positive transfer, particularly in low-resource Turkish-to-English direction, where performance can degrade even for overlapping euphemisms, and in some cases, improve under NOPET-based training. Differences in label distribution help explain these counterintuitive results. Category-level analysis suggests that transfer may be influenced by domain-specific alignment, though evidence is limited by sparsity.

[9] Eigenmood Space: Uncertainty-Aware Spectral Graph Analysis of Psychological Patterns in Classical Persian Poetry

Kourosh Shahnazari,Seyed Moein Ayyoubzadeh,Mohammadali Keshtparvar

Main category: cs.CL

TL;DR: 本文提出了一种面向古典波斯诗歌的不确定性感知计算框架,通过多标签自动标注、置信度加权聚合与图谱嵌入(Eigenmood),实现诗人心理特征的可解释、可审计的规模化分析。

Details Motivation: 古典波斯诗歌情感表达依赖隐喻与修辞间接性,使传统细读不可或缺,但难以进行可复现的大规模比较;需兼顾计算规模性与人文解释的审慎性。 Method: 基于大规模自动多标签标注,为每行诗句分配心理概念、置信度分数及‘ abstention’(证据不足)标志;构建诗人×概念矩阵,用JS/KL散度量化诗人个体性;构建置信加权的概念共现图,通过拉普拉斯谱分解得到Eigenmood嵌入。 Result: 在61,573行诗、10位诗人的语料上,22.2%诗句被标记为abstention;完成置信阈值敏感性分析、将abstention视为类别的选择偏差诊断,并实现从远距离到细读的轴向例句检索。 Conclusion: 该框架在支持数字人文可扩展分析的同时,通过将诗句级不确定性逐层传播至诗人级推断,保障了人文阐释的审慎性与可审计性。 Abstract: Classical Persian poetry is a historically sustained archive in which affective life is expressed through metaphor, intertextual convention, and rhetorical indirection. These properties make close reading indispensable while limiting reproducible comparison at scale. We present an uncertainty-aware computational framework for poet-level psychological analysis based on large-scale automatic multi-label annotation. Each verse is associated with a set of psychological concepts, per-label confidence scores, and an abstention flag that signals insufficient evidence. We aggregate confidence-weighted evidence into a Poet $\times$ Concept matrix, interpret each poet as a probability distribution over concepts, and quantify poetic individuality as divergence from a corpus baseline using Jensen--Shannon divergence and Kullback--Leibler divergence. To capture relational structure beyond marginals, we build a confidence-weighted co-occurrence graph over concepts and define an Eigenmood embedding through Laplacian spectral decomposition. On a corpus of 61{,}573 verses across 10 poets, 22.2\% of verses are abstained, underscoring the analytical importance of uncertainty. We further report sensitivity analysis under confidence thresholding, selection-bias diagnostics that treat abstention as a category, and a distant-to-close workflow that retrieves verse-level exemplars along Eigenmood axes. The resulting framework supports scalable, auditable digital-humanities analysis while preserving interpretive caution by propagating uncertainty from verse-level evidence to poet-level inference.

[10] Persona2Web: Benchmarking Personalized Web Agents for Contextual Reasoning with User History

Serin Kim,Sangam Lee,Dongha Lee

Main category: cs.CL

TL;DR: 本文提出了Persona2Web,首个面向真实开放网络的个性化网页智能体评测基准,基于‘澄清-个性化’原则,通过用户历史隐式推断偏好以解决查询歧义问题,并构建了含用户历史、歧义查询和推理感知评估框架的数据集与实验体系。

Details Motivation: 当前网页智能体缺乏个性化能力,难以根据用户隐含偏好和上下文理解模糊查询,亟需可评估真实个性化行为的基准。 Method: 提出Persona2Web基准,包含长期用户历史数据、需偏好推断的歧义查询,以及支持细粒度评估的推理感知框架;在多种智能体架构、大模型底座、历史访问方式及不同歧义程度查询上开展系统实验。 Result: 揭示了个性化网页智能体在历史利用、歧义解析与偏好建模等方面的关键挑战;代码与数据集已开源。 Conclusion: Persona2Web为推动具备真正用户感知能力的网页智能体研究提供了标准化评测基础与实证依据。 Abstract: Large language models have advanced web agents, yet current agents lack personalization capabilities. Since users rarely specify every detail of their intent, practical web agents must be able to interpret ambiguous queries by inferring user preferences and contexts. To address this challenge, we present Persona2Web, the first benchmark for evaluating personalized web agents on the real open web, built upon the clarify-to-personalize principle, which requires agents to resolve ambiguity based on user history rather than relying on explicit instructions. Persona2Web consists of: (1) user histories that reveal preferences implicitly over long time spans, (2) ambiguous queries that require agents to infer implicit user preferences, and (3) a reasoning-aware evaluation framework that enables fine-grained assessment of personalization. We conduct extensive experiments across various agent architectures, backbone models, history access schemes, and queries with varying ambiguity levels, revealing key challenges in personalized web agent behavior. For reproducibility, our codes and datasets are publicly available at https://anonymous.4open.science/r/Persona2Web-73E8.

[11] ReIn: Conversational Error Recovery with Reasoning Inception

Takyoung Kim,Jinseok Nam,Chandrayee Basu,Xing Fan,Chengyuan Ma,Heng Ji,Gokhan Tur,Dilek Hakkani-Tür

Main category: cs.CL

TL;DR: 本文提出了一种名为Reasoning Inception(ReIn)的测试时干预方法,用于提升对话代理在面对用户引发错误时的恢复能力,无需修改模型参数或系统提示。

Details Motivation: 现有基于大语言模型的对话代理虽在固定数据集上表现良好,但在面对用户引发的未预见错误时仍脆弱;本文聚焦于错误恢复而非预防,并在无法微调模型或修改提示的现实约束下探索有效应对策略。 Method: 提出ReIn方法:通过外部‘起始模块’识别对话上下文中的预定义错误并生成恢复计划,将该计划注入代理的内部推理过程以引导纠正行为,不改变模型参数或系统提示。 Result: ReIn在模拟多种用户导致失败场景(如模糊、不支持的请求)中显著提升任务成功率,泛化至未见错误类型,且持续优于显式提示修改方法。 Conclusion: ReIn是一种高效、即插即用的错误恢复机制;联合定义恢复工具与ReIn可安全有效地增强对话代理鲁棒性,无需改动骨干模型或系统提示。 Abstract: Conversational agents powered by large language models (LLMs) with tool integration achieve strong performance on fixed task-oriented dialogue datasets but remain vulnerable to unanticipated, user-induced errors. Rather than focusing on error prevention, this work focuses on error recovery, which necessitates the accurate diagnosis of erroneous dialogue contexts and execution of proper recovery plans. Under realistic constraints precluding model fine-tuning or prompt modification due to significant cost and time requirements, we explore whether agents can recover from contextually flawed interactions and how their behavior can be adapted without altering model parameters and prompts. To this end, we propose Reasoning Inception (ReIn), a test-time intervention method that plants an initial reasoning into the agent's decision-making process. Specifically, an external inception module identifies predefined errors within the dialogue context and generates recovery plans, which are subsequently integrated into the agent's internal reasoning process to guide corrective actions, without modifying its parameters or system prompts. We evaluate ReIn by systematically simulating conversational failure scenarios that directly hinder successful completion of user goals: user's ambiguous and unsupported requests. Across diverse combinations of agent models and inception modules, ReIn substantially improves task success and generalizes to unseen error types. Moreover, it consistently outperforms explicit prompt-modification approaches, underscoring its utility as an efficient, on-the-fly method. In-depth analysis of its operational mechanism, particularly in relation to instruction hierarchy, indicates that jointly defining recovery tools with ReIn can serve as a safe and effective strategy for improving the resilience of conversational agents without modifying the backbone models or system prompts.

[12] Large Language Models Persuade Without Planning Theory of Mind

Jared Moore,Rasmus Overmark,Ned Cooper,Beba Cibralic,Nick Haber,Cameron R. Jones

Main category: cs.CL

TL;DR: 本文提出了一种新型理论心智(ToM)任务,通过说服实验评估人类与大语言模型(LLMs)在动态交互中对他人知识与动机状态的建模能力。实验发现:LLMs在状态已知时表现优异,但在需主动推断时严重失败;而在真实人际说服中,LLMs反而超越人类,表明其擅长无需显式ToM的修辞式说服。研究警示不宜将LLMs的说服能力等同于人类ToM,并指出其潜在社会影响。

Details Motivation: 现有ToM评估多依赖静态问答,忽视了第一人称互动这一ToM核心要素;需设计能反映真实心智建模能力的动态交互任务。 Method: 设计三阶段说服实验: persuader需根据target的知识状态(知道什么)和动机状态(重视什么)策略性披露信息以推动其选择某政策;状态分Revealed(直接给出)或Hidden(需询问/推断);对比人类与LLMs在与理性bot、真人target及信念变化测量中的表现。 Result: Exp1:LLMs在Revealed条件下优秀,Hidden下低于随机水平;人类两者均中等。Exp2&3:LLMs在说服真人target及改变其真实信念上全面优于人类。 Conclusion: LLMs不具备类人ToM所需的多步心智推理能力,但可通过非ToM的修辞策略高效影响人类信念与行为;静态ToM测评易导致误判,应重视交互式评估。 Abstract: A growing body of work attempts to evaluate the theory of mind (ToM) abilities of humans and large language models (LLMs) using static, non-interactive question-and-answer benchmarks. However, theoretical work in the field suggests that first-personal interaction is a crucial part of ToM and that such predictive, spectatorial tasks may fail to evaluate it. We address this gap with a novel ToM task that requires an agent to persuade a target to choose one of three policy proposals by strategically revealing information. Success depends on a persuader's sensitivity to a given target's knowledge states (what the target knows about the policies) and motivational states (how much the target values different outcomes). We varied whether these states were Revealed to persuaders or Hidden, in which case persuaders had to inquire about or infer them. In Experiment 1, participants persuaded a bot programmed to make only rational inferences. LLMs excelled in the Revealed condition but performed below chance in the Hidden condition, suggesting difficulty with the multi-step planning required to elicit and use mental state information. Humans performed moderately well in both conditions, indicating an ability to engage such planning. In Experiment 2, where a human target role-played the bot, and in Experiment 3, where we measured whether human targets' real beliefs changed, LLMs outperformed human persuaders across all conditions. These results suggest that effective persuasion can occur without explicit ToM reasoning (e.g., through rhetorical strategies) and that LLMs excel at this form of persuasion. Overall, our results caution against attributing human-like ToM to LLMs while highlighting LLMs' potential to influence people's beliefs and behavior.

[13] Evaluating Cross-Lingual Classification Approaches Enabling Topic Discovery for Multilingual Social Media Data

Deepak Uniyal,Md Abul Bashar,Richi Nayak

Main category: cs.CL

TL;DR: This paper compares four cross-lingual text classification approaches—translation-based annotation, translation-based data, direct multilingual model application, and a hybrid method—to filter hydrogen-related tweets from noisy multilingual social media data (English, Japanese, Hindi, Korean), followed by topic modeling to uncover dominant themes.

Details Motivation: Analysing large-scale, multilingual social media discourse—especially public debates across diverse languages—remains challenging; existing keyword-driven collection yields substantial irrelevant content, necessitating robust cross-lingual filtering methods. Method: Four cross-lingual classification strategies are evaluated on a decade-long, nine-million-tweet dataset in four languages: (1) translate English annotations → train language-specific models; (2) translate all raw tweets into English → train one English-based model; (3) apply English-fine-tuned multilingual transformers directly per language; (4) hybrid—combine translated annotations with multilingual training. Filtered outputs undergo topic modeling. Result: Each approach shows distinct performance trade-offs in filtering relevance; the hybrid strategy achieves the best balance between accuracy and scalability, while direct multilingual application suffers from language mismatch and translation-based methods incur latency/quality costs. Topic modeling reveals temporally evolving thematic clusters (e.g., policy vs. technology focus) across languages. Conclusion: No single cross-lingual method universally dominates; optimal pipeline design depends on resource constraints, language coverage, and annotation availability. Hybrid approaches offer promising scalability and accuracy for real-world multilingual social media analysis. Abstract: Analysing multilingual social media discourse remains a major challenge in natural language processing, particularly when large-scale public debates span across diverse languages. This study investigates how different approaches for cross-lingual text classification can support reliable analysis of global conversations. Using hydrogen energy as a case study, we analyse a decade-long dataset of over nine million tweets in English, Japanese, Hindi, and Korean (2013--2022) for topic discovery. The online keyword-driven data collection results in a significant amount of irrelevant content. We explore four approaches to filter relevant content: (1) translating English annotated data into target languages for building language-specific models for each target language, (2) translating unlabelled data appearing from all languages into English for creating a single model based on English annotations, (3) applying English fine-tuned multilingual transformers directly to each target language data, and (4) a hybrid strategy that combines translated annotations with multilingual training. Each approach is evaluated for its ability to filter hydrogen-related tweets from noisy keyword-based collections. Subsequently, topic modeling is performed to extract dominant themes within the relevant subsets. The results highlight key trade-offs between translation and multilingual approaches, offering actionable insights into optimising cross-lingual pipelines for large-scale social media analysis.

[14] ALPS: A Diagnostic Challenge Set for Arabic Linguistic & Pragmatic Reasoning

Hussein S. Al-Olimat,Ahmad Alshareef

Main category: cs.CL

TL;DR: 本文介绍了ALPS,一个由专家精心设计的阿拉伯语语言学与语用学诊断测试集,旨在深入评估模型在深层语义和语用能力上的表现,发现当前模型虽流畅度高,但在形态句法依赖等基础能力上仍显著落后于人类。

Details Motivation: 现有阿拉伯语NLP基准多依赖合成或翻译数据,缺乏语言学深度验证;需构建原生、专家标注、文化可信的诊断性评测集以弥补深层语义与语用能力评估空白。 Method: 构建了包含531道题、覆盖15项任务与47个子任务的原生阿拉伯语诊断测试集ALPS,由阿拉伯语言学专家深度参与设计;对23种模型(含商用、开源及阿拉伯语专用模型)进行评测,并设立单次作答人类平均准确率(84.6%)与专家仲裁oracle(99.2%)作为基准。 Result: 模型在形态句法依赖任务(尤其需变音符号的任务)错误率达36.5%,显著高于组合语义任务;顶级商用模型(Gemini-3-flash,94.2%)超过平均人类水平,但最优阿拉伯语专用模型(Jais-2-70B,83.6%)仍未达人类水平。 Conclusion: 当前大模型在阿拉伯语深层语言理解上存在关键短板,尤其在形态句法层面;ALPS为推动更严谨、文化适配的语言能力评估提供了新范式。 Abstract: While recent Arabic NLP benchmarks focus on scale, they often rely on synthetic or translated data which may benefit from deeper linguistic verification. We introduce ALPS (Arabic Linguistic & Pragmatic Suite), a native, expert-curated diagnostic challenge set probing Deep Semantics and Pragmatics, capabilities that complement specialized large-scale benchmarks. While broad-coverage benchmarks prioritize scale and multi-task coverage, ALPS targets the depth of linguistic understanding through 531 rigorously crafted questions across 15 tasks and 47 subtasks. We developed the dataset with deep expertise in Arabic linguistics, guaranteeing cultural authenticity and eliminating translation artifacts. Evaluating 23 diverse models (commercial, open-source, and Arabic-native) against a single-pass human performance (avg. 84.6% accuracy) and an expert-adjudicated oracle (99.2%), we reveal a critical dissociation: models achieve high fluency but fail on fundamental morpho-syntactic dependencies, with elevated error rates on morpho-syntactic dependencies (36.5% across diacritics-reliant tasks) compared to compositional semantics. While top commercial models (Gemini-3-flash at 94.2%) surpass the average single human, a substantial gap persists between commercial giants and Arabic-native models, with the best Arabic-specific model (Jais-2-70B at 83.6%) approaching but not matching human performance.

[15] BankMathBench: A Benchmark for Numerical Reasoning in Banking Scenarios

Yunseung Lee,Subin Kim,Youngjun Kwak,Jaegul Choo

Main category: cs.CL

TL;DR: 本文提出BankMathBench,一个面向银行业务场景的领域专用数学推理基准数据集,用于提升大语言模型在存款、贷款等核心银行计算任务中的多步数值推理能力;实验表明,基于该数据集微调的开源LLM在公式生成与数值推理准确率上显著提升。

Details Motivation: 现有大语言模型在银行核心计算任务(如本息估算、多产品比较、提前还款计息)中准确率低,且缺乏反映真实银行业务场景的评测基准。 Method: 构建三层难度(基础/中级/高级)的BankMathBench数据集,覆盖单产品推理、多产品比较和多条件场景;采用工具增强微调方法在该数据集上训练开源LLM。 Result: 工具增强微调后,模型在基础、中级、高级任务上的准确率分别提升57.6、75.1、62.9个百分点,显著优于零样本基线。 Conclusion: BankMathBench是评估和提升大语言模型在真实银行业务中数值推理能力的有效且可靠的基准。 Abstract: Large language models (LLMs)-based chatbots are increasingly being adopted in the financial domain, particularly in digital banking, to handle customer inquiries about products such as deposits, savings, and loans. However, these models still exhibit low accuracy in core banking computations-including total payout estimation, comparison of products with varying interest rates, and interest calculation under early repayment conditions. Such tasks require multi-step numerical reasoning and contextual understanding of banking products, yet existing LLMs often make systematic errors-misinterpreting product types, applying conditions incorrectly, or failing basic calculations involving exponents and geometric progressions. However, such errors have rarely been captured by existing benchmarks. Mathematical datasets focus on fundamental math problems, whereas financial benchmarks primarily target financial documents, leaving everyday banking scenarios underexplored. To address this limitation, we propose BankMathBench, a domain-specific dataset that reflects realistic banking tasks. BankMathBench is organized in three levels of difficulty-basic, intermediate, and advanced-corresponding to single-product reasoning, multi-product comparison, and multi-condition scenarios, respectively. When trained on BankMathBench, open-source LLMs exhibited notable improvements in both formula generation and numerical reasoning accuracy, demonstrating the dataset's effectiveness in enhancing domain-specific reasoning. With tool-augmented fine-tuning, the models achieved average accuracy increases of 57.6%p (basic), 75.1%p (intermediate), and 62.9%p (advanced), representing significant gains over zero-shot baselines. These findings highlight BankMathBench as a reliable benchmark for evaluating and advancing LLMs' numerical reasoning in real-world banking scenarios.

[16] Projective Psychological Assessment of Large Multimodal Models Using Thematic Apperception Tests

Anton Dzega,Aviad Elyashar,Ortal Slobodin,Odeya Cohen,Rami Puzis

Main category: cs.CL

TL;DR: 本研究利用主题统觉测验(TAT)图像和SCORS-G量表,评估大语言模型(LMMs)在非语言模态下展现的类人格特质,发现其在理解人际动态和自我概念方面表现良好,但在感知与调节攻击性方面存在系统性缺陷,且模型规模与年代正向影响表现。

Details Motivation: 探索大模型是否具备可被心理学量表测量的、类人格的功能性特征,尤其关注非语言模态(如图像理解与叙事生成)下的表现。 Method: 采用TAT图像作为刺激,让LMMs分别担任被试模型(生成故事)和评估模型(依据SCORS-G量表对故事进行评分),并与人类专家结果对比。 Result: 评估模型对TAT反应的理解与分析能力优异,评分高度吻合人类专家;所有模型均擅长理解人际动态与自我概念,但普遍无法感知和调节攻击性;模型性能随参数规模增大和发布时间靠后而系统性提升。 Conclusion: LMMs展现出部分稳定、可测量的类人格功能维度,支持其具有初步的社会认知结构,但攻击性相关维度存在根本性缺失,提示当前架构在情感调节与负性动机建模上的局限。 Abstract: Thematic Apperception Test (TAT) is a psychometrically grounded, multidimensional assessment framework that systematically differentiates between cognitive-representational and affective-relational components of personality-like functioning. This test is a projective psychological framework designed to uncover unconscious aspects of personality. This study examines whether the personality traits of Large Multimodal Models (LMMs) can be assessed through non-language-based modalities, using the Social Cognition and Object Relations Scale - Global (SCORS-G). LMMs are employed in two distinct roles: as subject models (SMs), which generate stories in response to TAT images, and as evaluator models (EMs), who assess these narratives using the SCORS-G framework. Evaluators demonstrated an excellent ability to understand and analyze TAT responses. Their interpretations are highly consistent with those of human experts. Assessment results highlight that all models understand interpersonal dynamics very well and have a good grasp of the concept of self. However, they consistently fail to perceive and regulate aggression. Performance varied systematically across model families, with larger and more recent models consistently outperforming smaller and earlier ones across SCORS-G dimensions.

[17] The Emergence of Lab-Driven Alignment Signatures: A Psychometric Framework for Auditing Latent Bias and Compounding Risk in Generative AI

Dusan Bosnjakovic

Main category: cs.CL

TL;DR: 本文提出一种基于心理测量学的新型审计框架,用于量化大语言模型中稳定存在的潜在行为倾向(如优化偏差、谄媚倾向、现状合法化),无需依赖真实标签,揭示了在锁定的提供商生态系统中,这些潜藏偏见可能形成递归式意识形态回音室。

Details Motivation: 随着大语言模型(LLMs)成为多智能体系统和递归评估循环(如LLM-as-a-judge)的基础推理层,亟需检测持久、跨版本的提供方级行为特征,而传统基准仅衡量瞬时任务准确率,无法捕捉训练与对齐过程中嵌入的稳定‘主流思维’。 Method: 采用心理测量学中的潜在特质估计理论,设计基于序数不确定性的强制选择型情境题(含语义正交干扰项),并由密码学置换不变性保障公平性;结合混合线性模型(MixedLM)与组内相关系数(ICC)分析九个主流模型在多个维度上的响应一致性。 Result: 发现题目表述引发高方差,但存在显著的‘实验室信号’(lab signal),驱动跨模型的行为聚类;证实在‘锁定’的提供商生态中,潜在偏差是随架构层级递归放大的变量,而非静态错误。 Conclusion: 潜藏行为倾向具有跨版本稳定性与系统级传染性,需将心理测量审计纳入AI安全与治理标准流程,以防范多层AI架构中形成的递归意识形态回音室。 Abstract: As Large Language Models (LLMs) transition from standalone chat interfaces to foundational reasoning layers in multi-agent systems and recursive evaluation loops (LLM-as-a-judge), the detection of durable, provider-level behavioral signatures becomes a critical requirement for safety and governance. Traditional benchmarks measure transient task accuracy but fail to capture stable, latent response policies -- the ``prevailing mindsets'' embedded during training and alignment that outlive individual model versions. This paper introduces a novel auditing framework that utilizes psychometric measurement theory -- specifically latent trait estimation under ordinal uncertainty -- to quantify these tendencies without relying on ground-truth labels. Utilizing forced-choice ordinal vignettes masked by semantically orthogonal decoys and governed by cryptographic permutation-invariance, the research audits nine leading models across dimensions including Optimization Bias, Sycophancy, and Status-Quo Legitimization. Using Mixed Linear Models (MixedLM) and Intraclass Correlation Coefficient (ICC) analysis, the research identifies that while item-level framing drives high variance, a persistent ``lab signal'' accounts for significant behavioral clustering. These findings demonstrate that in ``locked-in'' provider ecosystems, latent biases are not merely static errors but compounding variables that risk creating recursive ideological echo chambers in multi-layered AI architectures.

[18] What Makes a Good Doctor Response? An Analysis on a Romanian Telemedicine Platform

Adrian Cosma,Cosmin Dumitrache,Emilian Radoi

Main category: cs.CL

TL;DR: 本文分析了罗马尼亚语文本型远程医疗中的患者满意度信号,使用77,334对患者提问-医生回复数据,构建二分类模型预测患者点赞反馈,并通过SHAP分析发现医生和患者历史特征占主导,而回复文本的语言特征(如礼貌性、模糊表达)虽影响较小但具可操作性。

Details Motivation: 文本型远程医疗日益普及,临床医生需以书面形式清晰有效地提供医疗建议;同时,平台依赖患者评分,而这些评分更多反映沟通质量而非临床准确性,因此需深入理解影响患者满意度的语言因素。 Method: 基于77,334条匿名患者提问–医生回复对,将点赞视为正类、其他反馈为负类;提取语言无关特征(长度、结构、可读性指标)、罗马尼亚语LIWC心理语言学特征及礼貌/模糊表达标记;采用时间划分训练分类器,并用SHAP进行可解释性分析与子群相关性检验。 Result: 患者与医生历史特征是满意度预测的最强预测因子;回复文本特征中,礼貌性与模糊表达 consistently 正向关联满意度,而词汇多样性呈负向关联;文本特征虽贡献较小,但具有临床干预可行性。 Conclusion: 在文本型远程医疗中,除历史行为外,医生回复的语言风格(尤其是礼貌与模糊表达)是可调控且显著影响患者满意度的关键因素,提示可通过沟通培训或辅助工具优化文本回复质量。 Abstract: Text-based telemedicine has become a common mode of care, requiring clinicians to deliver medical advice clearly and effectively in writing. As platforms increasingly rely on patient ratings and feedback, clinicians face growing pressure to maintain satisfaction scores, even though these evaluations often reflect communication quality more than clinical accuracy. We analyse patient satisfaction signals in Romanian text-based telemedicine. Using a sample of 77,334 anonymised patient question--doctor response pairs, we model feedback as a binary outcome, treating thumbs-up responses as positive and grouping negative or absent feedback into the other class. We extract interpretable, predominantly language-agnostic features (e.g., length, structural characteristics, readability proxies), along with Romanian LIWC psycholinguistic features and politeness/hedging markers where available. We train a classifier with a time-based split and perform SHAP-based analyses, which indicate that patient and clinician history features dominate prediction, functioning as strong priors, while characteristics of the response text provide a smaller but, crucially, actionable signal. In subgroup correlation analyses, politeness and hedging are consistently positively associated with patient feedback, whereas lexical diversity shows a negative association.

[19] Quantifying and Mitigating Socially Desirable Responding in LLMs: A Desirability-Matched Graded Forced-Choice Psychometric Study

Kensuke Okada,Yui Furukawa,Kyosuke Bunji

Main category: cs.CL

TL;DR: 本文提出了一种心理测量框架,用于量化和缓解大语言模型(LLMs)在基于问卷评估中因追求社会赞许性回答(SDR)而产生的偏差;通过对比诚实与伪装良好两种指导语下的IRT潜变量得分来量化SDR,并设计了难度匹配的分级迫选式(GFC)大五人格量表进行缓解;实验表明GFC显著降低SDR,同时较好保持对目标人格轮廓的恢复能力。

Details Motivation: 人类自报告问卷被广泛用于NLP中评估大语言模型(如一致性、安全性、偏见),但这些问卷假设模型会诚实作答;实际上,在评估情境下,LLM倾向于给出社会赞许性答案(SDR),导致评分偏差和错误结论。 Method: 提出一种心理测量框架:1)通过同一量表在HONEST与FAKE-GOOD指导语下施测,利用项目反应理论(IRT)估计潜变量得分,计算方向校正的标准化效应量以量化SDR;2)构建一个30对跨领域、社会赞许性匹配的分级迫选式(GFC)大五人格量表用于缓解SDR。 Result: 在九个指令微调的LLM上,针对已知目标人格的合成角色进行测试,Likert量表表现出持续显著的SDR,而GFC量表显著削弱SDR,同时基本保持对目标人格轮廓的准确恢复。 Conclusion: LLM在问卷评估中存在显著且模型依赖的SDR问题;GFC等SDR缓解策略可提升评估效度;建议在LLM基准测试与审计中采用SDR感知的报告规范。 Abstract: Human self-report questionnaires are increasingly used in NLP to benchmark and audit large language models (LLMs), from persona consistency to safety and bias assessments. Yet these instruments presume honest responding; in evaluative contexts, LLMs can instead gravitate toward socially preferred answers-a form of socially desirable responding (SDR)-biasing questionnaire-derived scores and downstream conclusions. We propose a psychometric framework to quantify and mitigate SDR in questionnaire-based evaluation of LLMs. To quantify SDR, the same inventory is administered under HONEST versus FAKE-GOOD instructions, and SDR is computed as a direction-corrected standardized effect size from item response theory (IRT)-estimated latent scores. This enables comparisons across constructs and response formats, as well as against human instructed-faking benchmarks. For mitigation, we construct a graded forced-choice (GFC) Big Five inventory by selecting 30 cross-domain pairs from an item pool via constrained optimization to match desirability. Across nine instruction-tuned LLMs evaluated on synthetic personas with known target profiles, Likert-style questionnaires show consistently large SDR, whereas desirability-matched GFC substantially attenuates SDR while largely preserving the recovery of the intended persona profiles. These results highlight a model-dependent SDR-recovery trade-off and motivate SDR-aware reporting practices for questionnaire-based benchmarking and auditing of LLMs.

[20] Towards Cross-lingual Values Assessment: A Consensus-Pluralism Perspective

Yukun Chen,Xinyu Zhang,Jialong Tang,Yu Wan,Baosong Yang,Yiming Li,Zhan Qin,Kui Ren

Main category: cs.CL

TL;DR: 本文提出X-Value——一个跨语言价值观评估基准,用于评测大语言模型在多语言环境下对内容深层价值观的理解能力;涵盖18种语言、5000+问答对,基于Schwartz价值观理论构建,并揭示当前SOTA模型在此任务上表现不足且存在显著语言偏差。

Details Motivation: 现有内容安全评估范式主要关注显性危害(如暴力、仇恨言论),忽视数字内容中更微妙、深层的价值观维度,尤其缺乏跨语言、全球视角的价值观评估能力。 Method: 构建X-Value基准:包含18种语言、5000+ QA对,按Schwartz基本人类价值观理论划分为7个核心领域,并分易/难两级;提出两阶段标注框架——先判断议题属于全球共识还是价值多元范畴,再开展多方协同的价值潜在线索标注。 Result: 在X-Value上的系统评测显示,当前SOTA大语言模型跨语言价值观评估准确率低于77%,且不同语言间性能差异超20%(ΔAcc > 20%)。 Conclusion: 大语言模型在细粒度、价值观感知的内容评估方面仍严重不足,亟需提升其跨语言、多文化语境下的价值观理解与判别能力;X-Value为该方向提供了首个系统性评测基准。 Abstract: While large language models (LLMs) have become pivotal to content safety, current evaluation paradigms primarily focus on detecting explicit harms (e.g., violence or hate speech), neglecting the subtler value dimensions conveyed in digital content. To bridge this gap, we introduce X-Value, a novel Cross-lingual Values Assessment Benchmark designed to evaluate LLMs' ability to assess deep-level values of content from a global perspective. X-Value consists of more than 5,000 QA pairs across 18 languages, systematically organized into 7 core domains grounded in Schwartz's Theory of Basic Human Values and categorized into easy and hard levels for discriminative evaluation. We further propose a unique two-stage annotation framework that first identifies whether an issue falls under global consensus (e.g., human rights) or pluralism (e.g., religion), and subsequently conducts a multi-party evaluation of the latent values embedded within the content. Systematic evaluations on X-Value reveal that current SOTA LLMs exhibit deficiencies in cross-lingual values assessment ($Acc < 77\%$), with significant performance disparities across different languages ($ΔAcc > 20\%$). This work highlights the urgent need to improve the nuanced, values-aware content assessment capability of LLMs. Our X-Value is available at: https://huggingface.co/datasets/Whitolf/X-Value.

[21] Representation Collapse in Machine Translation Through the Lens of Angular Dispersion

Evgeniia Tokarchuk,Maya K. Nachesa,Sergey Troshin,Vlad Niculae

Main category: cs.CL

TL;DR: 本文研究了Transformer架构在神经机器翻译中因标准next-token预测训练策略导致的表示坍缩问题,特别是在深层网络和连续输出NMT中更为显著;作者通过引入基于角度分散的正则化方法缓解该问题,并验证其在量化模型中依然有效且能提升翻译质量。

Details Motivation: 标准next-token预测训练策略可能导致表示坍缩,尤其在Transformer深层及连续输出NMT中更严重,甚至诱导模型趋向平凡解(如所有向量取相同值)。 Method: 分析离散与连续NMT中Transformer各层在训练过程中的表示坍缩动态,并采用基于角度分散的现有正则化方法进行干预。 Result: 该正则化方法不仅有效缓解表示坍缩,还提升了翻译质量;且在量化模型中仍保持相同坍缩行为和正则化增益。 Conclusion: 表示坍缩是NMT训练中的共性问题,角度分散正则化是一种简单而鲁棒的解决方案,适用于常规及量化Transformer模型。 Abstract: Modern neural translation models based on the Transformer architecture are known for their high performance, particularly when trained on high-resource datasets. A standard next-token prediction training strategy, while widely adopted in practice, may lead to overlooked artifacts such as representation collapse. Previous works have shown that this problem is especially pronounced in the representation of the deeper Transformer layers, where it often fails to efficiently utilize the geometric space. Representation collapse is even more evident in end-to-end training of continuous-output neural machine translation, where the trivial solution would be to set all vectors to the same value. In this work, we analyze the dynamics of representation collapse at different levels of discrete and continuous NMT transformers throughout training. We incorporate an existing regularization method based on angular dispersion and demonstrate empirically that it not only mitigates collapse but also improves translation quality. Furthermore, we show that quantized models exhibit similar collapse behavior and that the benefits of regularization are preserved even after quantization.

[22] Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation

Bogdan Kostić,Conor Fallon,Julian Risch,Alexander Löser

Main category: cs.CL

TL;DR: 本文研究了词汇和句法扰动对23个大语言模型在MMLU、SQuAD和AMEGA三个基准上的性能影响,发现词汇扰动普遍导致显著性能下降,而句法扰动效果不一;模型鲁棒性与规模无一致正相关,表明当前LLM更依赖表层词汇模式而非深层语言能力。

Details Motivation: 现有LLM评估基准因对输入提示的浅层变化敏感而可靠性受质疑,需系统考察模型对语义等价但形式不同的输入的鲁棒性。 Method: 设计两种语言学驱动的扰动方法:基于同义词替换的词汇扰动和基于依存句法分析的句法变换,并在三个主流基准上测试23个LLMs的性能变化与排名稳定性。 Result: 词汇扰动在几乎所有模型和任务上均引发显著性能下降;句法扰动效果异质,偶有提升;两类扰动均破坏复杂任务下的模型排行榜稳定性;模型鲁棒性不随参数量单调增强,且高度依赖具体任务。 Conclusion: 当前LLM更依赖表面词汇线索而非抽象语言理解能力,鲁棒性测试应成为LLM评估的标准环节。 Abstract: The rapid advancement of Large Language Models (LLMs) has established standardized evaluation benchmarks as the primary instrument for model comparison. Yet, their reliability is increasingly questioned due to sensitivity to shallow variations in input prompts. This paper examines how controlled, truth-conditionally equivalent lexical and syntactic perturbations affect the absolute performance and relative ranking of 23 contemporary LLMs across three benchmarks: MMLU, SQuAD, and AMEGA. We employ two linguistically principled pipelines to generate meaning-preserving variations: one performing synonym substitution for lexical changes, and another using dependency parsing to determine applicable syntactic transformations. Results show that lexical perturbations consistently induce substantial, statistically significant performance degradation across nearly all models and tasks, while syntactic perturbations have more heterogeneous effects, occasionally improving results. Both perturbation types destabilize model leaderboards on complex tasks. Furthermore, model robustness did not consistently scale with model size, revealing strong task dependence. Overall, the findings suggest that LLMs rely more on surface-level lexical patterns than on abstract linguistic competence, underscoring the need for robustness testing as a standard component of LLM evaluation.

[23] RPDR: A Round-trip Prediction-Based Data Augmentation Framework for Long-Tail Question Answering

Yiming Zhang,Siyue Zhang,Junbo Zhao,Chen Zhao

Main category: cs.CL

TL;DR: 本文提出RPDR框架,通过合成数据生成、Round-Trip预测筛选易学样本并训练密集检索器,显著提升长尾问答中的检索性能。

Details Motivation: 解决大语言模型在长尾问答中难以获取和准确回忆低频知识的问题,以及现有密集检索器在罕见或小众知识上泛化能力不足的挑战。 Method: 提出RPDR数据增强框架,包含三部分:合成数据生成、基于Round-Trip预测的数据筛选(识别易学样本)、利用筛选样本训练密集检索器;并引入动态路由机制,将查询分发至专用检索模块。 Result: 在PopQA和EntityQuestion两个长尾检索基准上,RPDR显著优于BM25、Contriver等基线方法,尤其在极长尾类别上效果突出;人工分析验证了其优势与局限。 Conclusion: RPDR有效提升了密集检索器在长尾场景下的性能,动态路由机制进一步增强了检索适应性,为长尾知识检索提供了新思路。 Abstract: Long-tail question answering presents significant challenges for large language models (LLMs) due to their limited ability to acquire and accurately recall less common knowledge. Retrieval-augmented generation (RAG) systems have shown great promise in mitigating this limitation by integrating external retrieval mechanisms. However, dense retrieval models often face the same difficulties when generalizing to rare or niche knowledge. In this study, we introduce RPDR, a novel data augmentation framework that selects high-quality easy-to-learn training data, to enhance dense retrievers. Our approach is built around three core components: synthetic data generation, data selection with Round-Trip prediction to identify easy-to-learn instances, and retriever training with these instances. We evaluate RPDR on two long-tail retrieval benchmarks, PopQA and EntityQuestion, demonstrating substantial improvements over existing retrievers like BM25 and Contriver, especially on extremely long-tail categories. We identify the strengths and limitations of RPDR through detailed human analysis and propose a dynamic routing mechanism to dynamically route queries to specialized retrieval modules to further improve retrieval performance.

[24] The Role of the Availability Heuristic in Multiple-Choice Answering Behaviour

Leonidas Zotos,Hedderik van Rijn,Malvina Nissim

Main category: cs.CL

TL;DR: 本文探讨了在多项选择题(MCQ)中,利用“可用性启发法”(即选择最容易想到的选项)作为猜测策略的有效性。研究通过计算选项在大型语料库(如Wikipedia)中的概念流行度来量化其认知可用性,发现正确答案普遍比干扰项更具可用性;采用该策略可显著超越随机猜测水平(提升13.5%–32.9%)。此外,LLM生成的选项也呈现类似可用性模式。结果提示:建模学生答题行为时应纳入可用性因素。

Details Motivation: 当学生不确定MCQ正确答案时往往依赖猜测;而可用性启发法(Tversky & Kahneman, 1973)暗示“最容易想到的选项”可能更常是正确答案——但这一直觉是否成立尚缺系统验证。本文旨在从计算角度检验该策略的有效性及其在人机生成题目中的普适性。 Method: 提出一种基于大规模文本语料库(如Wikipedia)中概念词频/共现频率的计算方法,量化各MCQ选项的认知可用性;在三个大型真实题库上评估‘始终选择最可用选项’策略的表现,并对比专家命题与LLM生成题目的可用性分布特征。 Result: 正确答案在所有题库中均显著比干扰项更可用;仅依据可用性排序选答案,准确率比随机猜测高13.5%–32.9%;LLM生成选项同样表现出正确答案更可用的规律,与其训练数据的频繁主义特性一致。 Conclusion: 可用性启发法是一种有效且具鲁棒性的MCQ猜测策略;该现象不仅存在于人类命题中,也延伸至LLM生成题目,表明语料统计偏差可能内化为认知线索;未来对学生答题建模应显式纳入可用性变量。 Abstract: When students are unsure of the correct answer to a multiple-choice question (MCQ), guessing is common practice. The availability heuristic, proposed by A. Tversky and D. Kahneman in 1973, suggests that the ease with which relevant instances come to mind, typically operationalised by the mere frequency of exposure, can offer a mental shortcut for problems in which the test-taker does not know the exact answer. Is simply choosing the option that comes most readily to mind a good strategy for answering MCQs? We propose a computational method of assessing the cognitive availability of MCQ options operationalised by concepts' prevalence in large corpora. The key finding, across three large question sets, is that correct answers, independently of the question stem, are significantly more available than incorrect MCQ options. Specifically, using Wikipedia as the retrieval corpus, we find that always selecting the most available option leads to scores 13.5% to 32.9% above the random-guess baseline. We further find that LLM-generated MCQ options show similar patterns of availability compared to expert-created options, despite the LLMs' frequentist nature and their training on large collections of textual data. Our findings suggest that availability should be considered in current and future work when computationally modelling student behaviour.

[25] Diverse Word Choices, Same Reference: Annotating Lexically-Rich Cross-Document Coreference

Anastasia Zhukova,Felix Hamborg,Karsten Donnay,Norman Meuschke,Bela Gipp

Main category: cs.CL

TL;DR: 本文提出了一种改进的跨文档共指消解(CDCR)标注方案,将共指链视为话语元素(DEs),支持身份与近似身份关系,以更好捕捉新闻报道中的词汇多样性与话语框架差异,并在NewsWCL50和ECB+子集上完成重标注与验证。

Details Motivation: 现有CDCR数据集多聚焦事件共指、定义狭窄,难以应对新闻中措辞差异大、立场多元的现实挑战。 Method: 将共指链重新定义为话语元素(DEs),支持身份与近似身份关系;使用统一编码手册对NewsWCL50和ECB+子集进行重标注;通过词汇多样性指标与same-head-lemma基线评估新数据集。 Result: 重标注后的数据集在词汇多样性等指标上表现居中,介于原始ECB+与NewsWCL50之间,验证了其一致性与 discourse-aware 特性。 Conclusion: 该修订方案提升了CDCR在新闻领域建模话语多样性与立场差异的能力,为更平衡、更具话语意识的CDCR研究提供了可靠资源。 Abstract: Cross-document coreference resolution (CDCR) identifies and links mentions of the same entities and events across related documents, enabling content analysis that aggregates information at the level of discourse participants. However, existing datasets primarily focus on event resolution and employ a narrow definition of coreference, which limits their effectiveness in analyzing diverse and polarized news coverage where wording varies widely. This paper proposes a revised CDCR annotation scheme of the NewsWCL50 dataset, treating coreference chains as discourse elements (DEs) and conceptual units of analysis. The approach accommodates both identity and near-identity relations, e.g., by linking "the caravan" - "asylum seekers" - "those contemplating illegal entry", allowing models to capture lexical diversity and framing variation in media discourse, while maintaining the fine-grained annotation of DEs. We reannotate the NewsWCL50 and a subset of ECB+ using a unified codebook and evaluate the new datasets through lexical diversity metrics and a same-head-lemma baseline. The results show that the reannotated datasets align closely, falling between the original ECB+ and NewsWCL50, thereby supporting balanced and discourse-aware CDCR research in the news domain.

[26] Evaluating Extremely Low-Resource Machine Translation: A Comparative Study of ChrF++ and BLEU Metrics

Sanjeev Kumar,Preethi Jyothi,Pushpak Bhattacharyya

Main category: cs.CL

TL;DR: 本文比较了BLEU和ChrF++两种机器翻译评估指标在极低资源语言(ELRL)场景下的表现,发现尽管BLEU得分较低,但其能提供互补的词汇精度信息,提升结果可解释性。

Details Motivation: 在极低资源语言(ELRL)场景下,传统指标如BLEU在数据稀缺时易误判翻译质量,需探究更适合的评估方法。 Method: 对BLEU(基于n-gram)和ChrF++(基于字符)两种指标进行对比分析,考察其对幻觉、重复、源文拷贝及变音符号(matra)差异等翻译缺陷的响应能力,覆盖Magahi、Bhojpuri和Chhattisgarhi三种ELRL,并聚焦LLM与NMT系统输出。 Result: ChrF++虽被近期研究广泛采用,但BLEU仍能提供有价值的词汇级精度信息,二者具有互补性;BLEU有助于提升评估结果的可解释性。 Conclusion: 在ELRL MT评估中,不应弃用BLEU,而应结合ChrF++等指标,以兼顾不同维度的质量洞察。 Abstract: Evaluating machine translation (MT) quality in extremely low-resource language (ELRL) scenarios poses unique challenges, as widely used metrics such as BLEU, effective in high-resource settings, often misrepresent quality in data-scarce contexts. This work presents a comparative analysis of BLEU, an n-gram-based metric, and ChrF++, a character-based metric, for MT evaluation in ELRL settings. We examine how each metric responds to translation artifacts, including hallucinations, repetition, source-text copying, and diacritic (\textit{matra}) variations across three ELRLs: Magahi, Bhojpuri, and Chhattisgarhi, with a focus on outputs from large language models (LLMs) and neural MT (NMT) systems. While recent work often relies solely on ChrF++, our findings show that BLEU, despite its lower absolute scores, provides complementary lexical-precision insights that improve interpretability.

[27] Fine-Grained Uncertainty Quantification for Long-Form Language Model Outputs: A Comparative Study

Dylan Bouchard,Mohit Singh Chauhan,Viren Bajaj,David Skarbrevik

Main category: cs.CL

TL;DR: 本文提出了一种面向长文本生成的细粒度不确定性量化(UQ)框架,通过响应分解、单元级打分和响应级聚合三阶段分类法,系统化评估LLM长文本事实性;实验表明基于主张-响应蕴含的打分方法效果最优,且不确定性感知解码能显著提升事实性。

Details Motivation: 现有不确定性量化方法主要针对短文本输出,难以泛化到长文本生成场景,缺乏系统化的细粒度分析框架。 Method: 构建了涵盖响应分解、单元级评分(如主张级、句子级)、响应级聚合三阶段的UQ分类法;形式化多种一致性驱动的黑盒评分器,并推广扩展已有方法。 Result: 实验发现:1)主张-响应蕴含打分优于或不逊于更复杂的主张级打分;2)主张级打分整体优于句子级打分;3)不确定性感知解码能显著提升长文本事实性。 Conclusion: 所提框架厘清了既有方法关系,支持公平比较,并为细粒度UQ组件选择提供实用指导。 Abstract: Uncertainty quantification has emerged as an effective approach to closed-book hallucination detection for LLMs, but existing methods are largely designed for short-form outputs and do not generalize well to long-form generation. We introduce a taxonomy for fine-grained uncertainty quantification in long-form LLM outputs that distinguishes methods by design choices at three stages: response decomposition, unit-level scoring, and response-level aggregation. We formalize several families of consistency-based black-box scorers, providing generalizations and extensions of existing methods. In our experiments across multiple LLMs and datasets, we find 1) claim-response entailment consistently performs better or on par with more complex claim-level scorers, 2) claim-level scoring generally yields better results than sentence-level scoring, and 3) uncertainty-aware decoding is highly effective for improving the factuality of long-form outputs. Our framework clarifies relationships between prior methods, enables apples-to-apples comparisons, and provides practical guidance for selecting components for fine-grained UQ.

[28] AIDG: Evaluating Asymmetry Between Information Extraction and Containment in Multi-Turn Dialogue

Adib Sakhawat,Fardeen Sadab,Rakin Shahriar

Main category: cs.CL

TL;DR: 本文提出AIDG(对抗性信息推理游戏)框架,用于评估大语言模型在动态多轮对话中的战略推理能力,发现模型在信息保持(防御)方面显著优于信息提取(推理),并分析了导致这一差距的两个关键瓶颈。

Details Motivation: 现有静态基准不足以评估大语言模型的战略推理能力,需转向动态、多轮交互场景;尤其需考察信息提取与信息保持之间的不对称性。 Method: 构建基于博弈论的AIDG框架,包含两个互补任务:AIDG-I(社交推理中的语用策略评估)和AIDG-II(结构化‘20个问题’中的约束满足评估);在439局游戏中测试6个前沿LLM。 Result: 模型在信息保持任务上显著优于信息提取任务,防御端ELO优势达350(Cohen's d = 5.47);确认式策略比盲目推理有效7.75倍(p < 0.00001);41.3%的推理失败源于对话负载下的指令遵循退化。 Conclusion: 大语言模型擅长局部防御性一致性,但在需要全局状态追踪的战略性探究任务中存在明显短板,其战略推理能力受限于信息动态建模与约束持续遵守能力。 Abstract: Evaluating the strategic reasoning capabilities of Large Language Models (LLMs) requires moving beyond static benchmarks to dynamic, multi-turn interactions. We introduce AIDG (Adversarial Information Deduction Game), a game-theoretic framework that probes the asymmetry between information extraction (active deduction) and information containment (state maintenance) in dialogue. We propose two complementary tasks: AIDG-I, measuring pragmatic strategy in social deduction, and AIDG-II, measuring constraint satisfaction in a structured "20 Questions" setting. Across 439 games with six frontier LLMs, we observe a clear capability asymmetry: models perform substantially better at containment than deduction, with a 350 ELO advantage on defense;(Cohen's d = 5.47). We identify two bottlenecks driving this gap: (1) Information Dynamics, where confirmation strategies are 7.75x more effective than blind deduction (p < 0.00001), and (2) Constraint Adherence, where instruction-following degrades under conversational load, accounting for 41.3% of deductive failures. These findings suggest that while LLMs excel at local defensive coherence, they struggle with the global state tracking required for strategic inquiry.

[29] ABCD: All Biases Come Disguised

Mateusz Nowak,Xavier Cadet,Peter Chin

Main category: cs.CL

TL;DR: 本文提出了一种减少标签位置偏差的MCQ评估协议,通过使用统一无序标签和全答案匹配,提升了LLM评估的鲁棒性。

Details Motivation: 发现现有MCQ基准中LLM存在标签位置、标签内容及few-shot示例分布等多重偏差,影响对其真实推理能力的准确评估。 Method: 设计NonsenseQA合成基准识别偏差;提出新评估协议:用统一无序标签替代原标签,要求模型基于完整答案文本作答,并用句子相似度模型匹配预测与真实答案。 Result: 在多个基准和模型上,该协议使答案排列鲁棒性显著提升,平均准确率方差降低3倍,性能仅轻微下降;消融实验验证其优于标准方法。 Conclusion: 减少评估中的表层线索(如标签位置)能更真实地揭示LLM的内在能力,所提协议是一种简单有效、无需微调或额外提示的鲁棒评估方案。 Abstract: Multiple-choice question (MCQ) benchmarks have been a standard evaluation practice for measuring LLMs' ability to reason and answer knowledge-based questions. Through a synthetic NonsenseQA benchmark, we observe that different LLMs exhibit varying degrees of label-position-few-shot-prompt bias, where the model either uses the answer position, the label in front of the answer, the distributions of correct answers present in the few-shot prompt, or a combination of all to answer each MCQ question. We propose a simple bias-reduced evaluation protocol that replaces the labels of each question with uniform, unordered labels and prompts the LLM to use the whole answer presented. With a simple sentence similarity model, we demonstrate improved robustness and lower standard deviation between different permutations of answers with a minimal drop in LLM's performance, exposing the LLM's capabilities under reduced evaluation artifacts, without any help from the prompt examples or the option labels. Across multiple benchmarks and models, this protocol substantially improves the robustness to answer permutations, reducing mean accuracy variance $3\times$ with only a minimal decrease in the mean model's performance. Through ablation studies on various embedding models and similarity functions, we show that the method is more robust than the standard ones.

[30] Entropy-Based Data Selection for Language Models

Hongming Li,Yang Liu,Chao Huang

Main category: cs.CL

TL;DR: 本文提出了一种基于熵的无监督数据选择框架(EUDS),旨在降低大语言模型细调时的数据与计算资源需求,尤其适用于计算受限场景。

Details Motivation: 现有数据选择方法虽能减少细调所需数据量,但通常依赖高计算预算,难以适用于实际中资源受限的细调场景;同时,大模型虽缓解数据稀缺,但评估数据可用性仍具挑战。 Method: 提出Entropy-Based Unsupervised Data Selection(EUDS)框架,建立一种计算高效的无监督数据过滤机制,利用不确定性估计(熵)指导数据选择。 Result: 在情感分析、主题分类和问答任务上的实验表明,EUDS显著降低计算成本、提升训练效率,并在更少数据下保持甚至提升模型性能。 Conclusion: EUDS为计算受限场景下的大语言模型高效细调提供了创新且实用的解决方案,平衡了数据效率与计算开销。 Abstract: Modern language models (LMs) increasingly require two critical resources: computational resources and data resources. Data selection techniques can effectively reduce the amount of training data required for fine-tuning LMs. However, their effectiveness is closely related to computational resources, which always require a high compute budget. Owing to the resource limitations in practical fine-tuning scenario, we systematically reveal the relationship between data selection and uncertainty estimation of selected data. Although large language models (LLMs) exhibit exceptional capabilities in language understanding and generation, which provide new ways to alleviate data scarcity, evaluating data usability remains a challenging task. This makes efficient data selection indispensable. To mitigate these issues, we propose Entropy-Based Unsupervised Data Selection (EUDS) framework. Empirical experiments on sentiment analysis (SA), topic classification (Topic-CLS), and question answering (Q&A) tasks validate its effectiveness. EUDS establishes a computationally efficient data-filtering mechanism. Theoretical analysis and experimental results confirm the effectiveness of our approach. EUDS significantly reduces computational costs and improves training time efficiency with less data requirement. This provides an innovative solution for the efficient fine-tuning of LMs in the compute-constrained scenarios.

[31] PEACE 2.0: Grounded Explanations and Counter-Speech for Combating Hate Expressions

Greta Damo,Stéphane Petiot,Elena Cabrio,Serena Villata

Main category: cs.CL

TL;DR: PEACE 2.0 是一个新型工具,用于分析、解释仇恨言论并生成基于证据的反制言论(counter-speech),采用检索增强生成(RAG)技术提升解释与响应的可信度和有效性。

Details Motivation: 现有仇恨言论检测方法已较成熟,但自动生成有依据、有效的反制言论仍是开放挑战;同时,对反制言论本身特性的系统探索也较为缺乏。 Method: 提出 PEACE 2.0 工具,基于检索增强生成(RAG)框架,实现三方面功能:1)将仇恨言论解释锚定于事实与证据;2)自动生成证据支撑的反制言论;3)分析反制言论的语言与结构特征。 Result: PEACE 2.0 能够对显性与隐性仇恨言论进行深度分析与响应生成,提升解释可解释性与反制言论质量。 Conclusion: RAG 技术可有效支撑仇恨言论的可解释分析与高质量反制言论生成,为构建更健康网络对话环境提供新路径。 Abstract: The increasing volume of hate speech on online platforms poses significant societal challenges. While the Natural Language Processing community has developed effective methods to automatically detect the presence of hate speech, responses to it, called counter-speech, are still an open challenge. We present PEACE 2.0, a novel tool that, besides analysing and explaining why a message is considered hateful or not, also generates a response to it. More specifically, PEACE 2.0 has three main new functionalities: leveraging a Retrieval-Augmented Generation (RAG) pipeline i) to ground HS explanations into evidence and facts, ii) to automatically generate evidence-grounded counter-speech, and iii) exploring the characteristics of counter-speech replies. By integrating these capabilities, PEACE 2.0 enables in-depth analysis and response generation for both explicit and implicit hateful messages.

[32] Auditing Reciprocal Sentiment Alignment: Inversion Risk, Dialect Representation and Intent Misalignment in Transformers

Nusrat Jahan Lia,Shubhashis Roy Dipta

Main category: cs.CL

TL;DR: 本文研究了孟加拉语与英语之间的跨语言情感对齐问题,发现现有对齐范式在低资源语言中存在严重的情感误判、不对称共情及现代偏见等问题,主张采用文化敏感、多元包容的对齐方法,并提出引入'情感稳定性'指标评估对齐质量。

Details Motivation: 双向对齐的核心是确保AI准确理解人类意图且人类能信任AI行为,但在语言障碍下这一闭环严重断裂;尤其在低资源语言(如孟加拉语)中,现有对齐方法缺乏情感保真度,威胁人机互信。 Method: 通过基准测试四种Transformer模型(含mDistilBERT和IndicBERT),量化跨语言情感对齐表现,重点分析情感反转率、不对称共情现象及方言(Sadhu Bengali)处理偏差。 Result: mDistilBERT情感反转率达28.7%;存在'不对称共情'现象;IndicBERT在正式孟加拉语中对齐错误率上升57%;揭示当前压缩式通用对齐范式无法维持情感保真度。 Conclusion: 公平的人机协同演化需植根于语言与方言多样性的多元文化对齐,而非单一压缩范式;建议对齐基准纳入'情感稳定性'指标,特别惩罚低资源与方言场景中的极性反转。 Abstract: The core theme of bidirectional alignment is ensuring that AI systems accurately understand human intent and that humans can trust AI behavior. However, this loop fractures significantly across language barriers. Our research addresses Cross-Lingual Sentiment Misalignment between Bengali and English by benchmarking four transformer architectures. We reveal severe safety and representational failures in current alignment paradigms. We demonstrate that compressed model (mDistilBERT) exhibits 28.7% "Sentiment Inversion Rate," fundamentally misinterpreting positive user intent as negative (or vice versa). Furthermore, we identify systemic nuances affecting human-AI trust, including "Asymmetric Empathy" where some models systematically dampen and others amplify the affective weight of Bengali text relative to its English counterpart. Finally, we reveal a "Modern Bias" in the regional model (IndicBERT), which shows a 57% increase in alignment error when processing formal (Sadhu) Bengali. We argue that equitable human-AI co-evolution requires pluralistic, culturally grounded alignment that respects language and dialectal diversity over universal compression, which fails to preserve the emotional fidelity required for reciprocal human-AI trust. We recommend that alignment benchmarks incorporate "Affective Stability" metrics that explicitly penalize polarity inversions in low-resource and dialectal contexts.

[33] Small LLMs for Medical NLP: a Systematic Analysis of Few-Shot, Constraint Decoding, Fine-Tuning and Continual Pre-Training in Italian

Pietro Ferrazzi,Mattia Franzin,Alberto Lavelli,Bernardo Magnini

Main category: cs.CL

TL;DR: 本文探讨了约十亿参数的小型大语言模型(LLMs)在20项临床NLP任务中的有效性,发现经微调的Qwen3-1.7B模型性能超越Qwen3-32B大模型,并开源了多个意大利语医疗数据集与模型。

Details Motivation: 大型语言模型在医疗NLP任务中表现优异,但其高计算成本限制了实际医疗场景部署;本文旨在验证小型LLM能否在保持高精度的同时满足资源受限的现实需求。 Method: 在Llama-3、Gemma-3和Qwen3三大模型家族中选取约1B参数的小型模型,在20个意大利语临床NLP任务上系统评估多种适配策略,包括推理时的少样本提示与约束解码,以及训练时的监督微调和持续预训练。 Result: 监督微调效果最优;Qwen3-1.7B经微调后平均得分比Qwen3-32B高9.2分;同时开源了多个公开意大利语医疗NLP数据集、126M词急诊科语料及175M词持续预训练语料。 Conclusion: 小型LLM经适当适配(尤其是微调)可在多项临床NLP任务中媲美甚至超越更大模型,为资源受限的医疗AI落地提供了可行路径。 Abstract: Large Language Models (LLMs) consistently excel in diverse medical Natural Language Processing (NLP) tasks, yet their substantial computational requirements often limit deployment in real-world healthcare settings. In this work, we investigate whether "small" LLMs (around one billion parameters) can effectively perform medical tasks while maintaining competitive accuracy. We evaluate models from three major families-Llama-3, Gemma-3, and Qwen3-across 20 clinical NLP tasks among Named Entity Recognition, Relation Extraction, Case Report Form Filling, Question Answering, and Argument Mining. We systematically compare a range of adaptation strategies, both at inference time (few-shot prompting, constraint decoding) and at training time (supervised fine-tuning, continual pretraining). Fine-tuning emerges as the most effective approach, while the combination of few-shot prompting and constraint decoding offers strong lower-resource alternatives. Our results show that small LLMs can match or even surpass larger baselines, with our best configuration based on Qwen3-1.7B achieving an average score +9.2 points higher than Qwen3-32B. We release a comprehensive collection of all the publicly available Italian medical datasets for NLP tasks, together with our top-performing models. Furthermore, we release an Italian dataset of 126M words from the Emergency Department of an Italian Hospital, and 175M words from various sources that we used for continual pre-training.

[34] Bridging the Domain Divide: Supervised vs. Zero-Shot Clinical Section Segmentation from MIMIC-III to Obstetrics

Baris Karacan,Barbara Di Eugenio,Patrick Thornton

Main category: cs.CL

TL;DR: 本文提出了一个针对产科临床文本的新数据集,并系统评估了基于Transformer的监督模型和零样本大语言模型在临床文本分段任务上的性能,发现零样本模型在跨领域适应性上表现更优,但需处理幻觉问题。

Details Motivation: 现有临床文本分段方法主要在MIMIC-III等通用医疗语料上训练,缺乏对产科等特定领域的覆盖;同时,零样本大语言模型在该任务中的潜力尚未被充分探索。 Method: 1)构建新的去标识化产科笔记标注数据集;2)在MIMIC-III子集(域内)和新产科数据集(域外)上系统评估Transformer监督模型;3)首次开展监督模型与零样本大语言模型在医学文本分段任务上的直接对比。 Result: 监督模型在域内表现优异,但在域外性能显著下降;零样本模型在修正幻觉产生的错误节标题后,展现出更强的跨领域鲁棒性。 Conclusion: 构建领域特异性临床资源至关重要,而经幻觉校正的零样本分段方法为拓展医疗NLP应用至非主流语料提供了可行新路径。 Abstract: Clinical free-text notes contain vital patient information. They are structured into labelled sections; recognizing these sections has been shown to support clinical decision-making and downstream NLP tasks. In this paper, we advance clinical section segmentation through three key contributions. First, we curate a new de-identified, section-labeled obstetrics notes dataset, to supplement the medical domains covered in public corpora such as MIMIC-III, on which most existing segmentation approaches are trained. Second, we systematically evaluate transformer-based supervised models for section segmentation on a curated subset of MIMIC-III (in-domain), and on the new obstetrics dataset (out-of-domain). Third, we conduct the first head-to-head comparison of supervised models for medical section segmentation with zero-shot large language models. Our results show that while supervised models perform strongly in-domain, their performance drops substantially out-of-domain. In contrast, zero-shot models demonstrate robust out-of-domain adaptability once hallucinated section headers are corrected. These findings underscore the importance of developing domain-specific clinical resources and highlight zero-shot segmentation as a promising direction for applying healthcare NLP beyond well-studied corpora, as long as hallucinations are appropriately managed.

[35] Using LLMs for Knowledge Component-level Correctness Labeling in Open-ended Coding Problems

Zhangqi Duan,Arnav Kankaria,Dhruv Kartik,Andrew Lan

Main category: cs.CL

TL;DR: 本文提出了一种利用大语言模型(LLM)自动为编程任务中的知识组件(KC)标注细粒度正确性标签的框架,结合时序感知的Code-KC映射机制,提升了学习曲线拟合度与预测性能,并获得与专家标注高度一致的人工评估结果。

Details Motivation: 真实编程数据集中缺乏KC级别的细粒度正确性标签,简单将题目级正确性传播至所有KC会掩盖学生的部分掌握状态,导致学习曲线拟合不佳。 Method: 提出基于大语言模型的自动化KC级正确性标注框架,引入时序上下文感知的Code-KC映射机制,以更精准地将KC与学生代码片段对齐。 Result: 实验表明该方法生成的KC标签使学习曲线更符合认知理论(如练习幂律),在Additive Factors Model等模型中预测性能提升,且人工评估显示LLM标注与专家标注具有高度一致性。 Conclusion: 利用LLM进行KC级自动标注是可行且有效的,为开放性编程任务中的细粒度学生建模提供了新范式。 Abstract: Fine-grained skill representations, commonly referred to as knowledge components (KCs), are fundamental to many approaches in student modeling and learning analytics. However, KC-level correctness labels are rarely available in real-world datasets, especially for open-ended programming tasks where solutions typically involve multiple KCs simultaneously. Simply propagating problem-level correctness to all associated KCs obscures partial mastery and often leads to poorly fitted learning curves. To address this challenge, we propose an automated framework that leverages large language models (LLMs) to label KC-level correctness directly from student-written code. Our method assesses whether each KC is correctly applied and further introduces a temporal context-aware Code-KC mapping mechanism to better align KCs with individual student code. We evaluate the resulting KC-level correctness labels in terms of learning curve fit and predictive performance using the power law of practice and the Additive Factors Model. Experimental results show that our framework leads to learning curves that are more consistent with cognitive theory and improves predictive performance, compared to baselines. Human evaluation further demonstrates substantial agreement between LLM and expert annotations.

[36] Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning

Jyotin Goel,Souvik Maji,Pratik Mazumder

Main category: cs.CL

TL;DR: 本文提出了一种自适应正则化训练框架,通过在微调过程中根据安全风险动态调整更新强度,从而在不牺牲模型实用性的情况下维持其安全性。

Details Motivation: 指令遵循型语言模型在微调(尤其是对抗性微调)后安全性能易退化,而现有防御方法往往在安全性与实用性之间难以兼顾。 Method: 提出一种基于风险感知的自适应正则化训练框架:利用两种风险估计方式——基于裁判模型(Safety Critic)的批次级危害评分,以及基于模型中间激活的轻量级分类器预测有害意图;据此动态约束高风险参数更新靠近安全参考策略,低风险更新则保持常规训练。 Result: 实验证明:有害意图信号可从生成前的隐藏层激活中有效预测;裁判评分具备高召回率的安全指导能力;该方法在多种模型和攻击场景下显著降低攻击成功率、保持下游任务性能,且无推理开销。 Conclusion: 该工作为在持续微调中维持模型对齐提供了原理清晰、实用高效的新范式,实现了安全性与实用性的协同提升。 Abstract: Instruction-following language models are trained to be helpful and safe, yet their safety behavior can deteriorate under benign fine-tuning and worsen under adversarial updates. Existing defenses often offer limited protection or force a trade-off between safety and utility. We introduce a training framework that adapts regularization in response to safety risk, enabling models to remain aligned throughout fine-tuning. To estimate safety risk at training time, we explore two distinct approaches: a judge-based Safety Critic that assigns high-level harm scores to training batches, and an activation-based risk predictor built with a lightweight classifier trained on intermediate model activations to estimate harmful intent. Each approach provides a risk signal that is used to constrain updates deemed higher risk to remain close to a safe reference policy, while lower-risk updates proceed with standard training. We empirically verify that harmful intent signals are predictable from pre-generation activations and that judge scores provide effective high-recall safety guidance. Across multiple model families and attack scenarios, adaptive regularization with either risk estimation approach consistently lowers attack success rate compared to standard fine-tuning, preserves downstream performance, and adds no inference-time cost. This work demonstrates a principled mechanism for maintaining safety without sacrificing utility.

[37] Modeling Distinct Human Interaction in Web Agents

Faria Huq,Zora Zhiruo Wang,Zhanqiu Guo,Venu Arvind Arangarajan,Tianyue Ou,Frank Xu,Shuyan Zhou,Graham Neubig,Jeffrey P. Bigham

Main category: cs.CL

TL;DR: 本文提出建模人类干预行为以支持人机协作式网页任务执行,构建了包含400条真实用户轨迹的CowCorpus数据集,识别出四类人机交互模式,并训练语言模型预测干预时机,显著提升干预预测准确率(+61.4–63.4%)与用户评价的代理有用性(+26.5%)。

Details Motivation: 当前自主网页代理缺乏对人类何时、为何干预的系统理解,导致错过关键决策点或频繁无效请求,亟需建模人类干预以实现真正协作。 Method: 构建CowCorpus数据集(400条真实用户网页导航轨迹,含4200+人机交错动作),归纳四类交互模式(放手监督、主动监督、协同解题、完全接管),并基于此训练语言模型预测干预时机。 Result: 干预预测准确率较基线语言模型提升61.4–63.4%;部署于真实网页导航代理后,用户评定的代理有用性提升26.5%。 Conclusion: 对人类干预进行结构化建模可显著提升网页代理的适应性与协作能力,为人机协同智能提供新范式。 Abstract: Despite rapid progress in autonomous web agents, human involvement remains essential for shaping preferences and correcting agent behavior as tasks unfold. However, current agentic systems lack a principled understanding of when and why humans intervene, often proceeding autonomously past critical decision points or requesting unnecessary confirmation. In this work, we introduce the task of modeling human intervention to support collaborative web task execution. We collect CowCorpus, a dataset of 400 real-user web navigation trajectories containing over 4,200 interleaved human and agent actions. We identify four distinct patterns of user interaction with agents -- hands-off supervision, hands-on oversight, collaborative task-solving, and full user takeover. Leveraging these insights, we train language models (LMs) to anticipate when users are likely to intervene based on their interaction styles, yielding a 61.4-63.4% improvement in intervention prediction accuracy over base LMs. Finally, we deploy these intervention-aware models in live web navigation agents and evaluate them in a user study, finding a 26.5% increase in user-rated agent usefulness. Together, our results show structured modeling of human intervention leads to more adaptive, collaborative agents.

[38] The Cascade Equivalence Hypothesis: When Do Speech LLMs Behave Like ASR$\rightarrow$LLM Pipelines?

Jayadev Billa

Main category: cs.CL

TL;DR: 本文通过实验发现,当前的语音大语言模型(Speech LLMs)在多数任务中行为上等价于Whisper转录后接文本LLM的级联结构,而非真正端到端理解语音;Ultravox等模型本质上是昂贵且鲁棒性更差的级联系统,而Qwen2-Audio则表现出真正的架构差异。

Details Motivation: 探究当前语音大语言模型是否真正实现了端到端语音理解,还是仅隐式执行ASR(自动语音识别),即其内部机制是否等价于‘ASR+文本LLM’级联。 Method: 采用匹配骨干网络(matched-backbone)的控制实验设计,在4个语音LLM和6项任务上进行系统评估;使用logit lens分析隐藏层表征、LEACE概念擦除验证文本表征的因果必要性,并比较干净与噪声条件下的性能。 Result: Ultravox与对应级联模型无统计显著差异(κ=0.93);logit lens显示字面文本在隐藏状态中显式出现;LEACE擦除使准确率趋近于零,证明文本表征具有因果必要性;Qwen2-Audio则显著偏离级联行为;在0 dB噪声下,语音LLMs性能反超级联最多达7.6%,表明其鲁棒性更差。 Conclusion: 当前主流语音LLMs本质上是昂贵且噪声鲁棒性更差的隐式ASR级联系统,其‘端到端’优势被高估;真正的语音理解需突破级联范式,Qwen2-Audio提示架构设计可能影响这一能力。 Abstract: Current speech LLMs largely perform implicit ASR: on tasks solvable from a transcript, they are behaviorally and mechanistically equivalent to simple Whisper$\to$LLM cascades. We show this through matched-backbone testing across four speech LLMs and six tasks, controlling for the LLM backbone for the first time. Ultravox is statistically indistinguishable from its matched cascade ($κ{=}0.93$); logit lens reveals literal text emerging in hidden states; LEACE concept erasure confirms text representations are causally necessary in both architectures tested, collapsing accuracy to near-zero. Qwen2-Audio genuinely diverges, revealing cascade equivalence is architecture-dependent, not universal. For most deployed use cases, current speech LLMs are expensive cascades, and under noise, they are worse ones, with clean-condition advantages reversing by up to 7.6% at 0 dB.

[39] Unmasking the Factual-Conceptual Gap in Persian Language Models

Alireza Sakhaeirad,Ali Ma'manpoosh,Arshia Hemmat

Main category: cs.CL

TL;DR: 本文提出DivanBench——一个聚焦波斯语文化迷信与习俗的诊断性基准,揭示当前波斯语大模型在隐含社会规范推理上存在严重缺陷,如顺从偏差、预训练加剧偏差、事实检索与情境应用间显著性能差距。

Details Motivation: 现有波斯语NLP基准虽拓展至语用与礼貌领域,但未能区分对文化事实的记忆与对隐含社会规范的推理能力。 Method: 构建包含315个问题的DivanBench基准,涵盖事实检索、成对情景验证和情境推理三类任务,评估7个波斯语大语言模型。 Result: 发现三大问题:(1)多数模型存在严重顺从偏差,能识别恰当行为却无法拒绝明显违规;(2)持续波斯语预训练反而加剧该偏差并削弱矛盾识别能力;(3)所有模型在事实检索与情境应用间存在21%性能差距。 Conclusion: 文化能力不能仅靠扩大单语数据规模获得;当前模型仅模仿文化模式,未内化其底层认知图式。 Abstract: While emerging Persian NLP benchmarks have expanded into pragmatics and politeness, they rarely distinguish between memorized cultural facts and the ability to reason about implicit social norms. We introduce DivanBench, a diagnostic benchmark focused on superstitions and customs, arbitrary, context-dependent rules that resist simple logical deduction. Through 315 questions across three task types (factual retrieval, paired scenario verification, and situational reasoning), we evaluate seven Persian LLMs and reveal three critical failures: most models exhibit severe acquiescence bias, correctly identifying appropriate behaviors but failing to reject clear violations; continuous Persian pretraining amplifies this bias rather than improving reasoning, often degrading the model's ability to discern contradictions; and all models show a 21\% performance gap between retrieving factual knowledge and applying it in scenarios. These findings demonstrate that cultural competence requires more than scaling monolingual data, as current models learn to mimic cultural patterns without internalizing the underlying schemas.

[40] Differences in Typological Alignment in Language Models' Treatment of Differential Argument Marking

Iskar Deng,Nathalia Xu,Shane Steinert-Threlkeld

Main category: cs.CL

TL;DR: 本文通过在18种不同差异论元标记(DAM)系统合成语料上训练GPT-2模型,发现语言模型能复现人类语言中关于标记方向的自然性偏好(即对语义非典型论元进行显性标记),但未能复现人类语言中强烈的宾语偏好(即DAM更常标记宾语而非主语),表明不同类型学倾向可能源于不同机制。

Details Motivation: 探究语言模型在合成语料上是否能复现人类语言中差异论元标记(DAM)的跨语言规律,尤其是其两个关键类型学维度:标记方向的自然性与论元角色(主语/宾语)偏好。 Method: 采用受控合成学习方法,在18种实现不同DAM系统的合成语料上训练GPT-2模型,并使用最小对立对评估其泛化能力。 Result: 模型稳定复现了人类语言中关于标记方向的自然性偏好(偏向标记语义非典型论元),但未复现人类语言中强烈的宾语标记偏好。 Conclusion: DAM的两类类型学倾向可能源于不同认知或统计机制,语言模型可作为探查语言共性来源的计算工具。 Abstract: Recent work has shown that language models (LMs) trained on synthetic corpora can exhibit typological preferences that resemble cross-linguistic regularities in human languages, particularly for syntactic phenomena such as word order. In this paper, we extend this paradigm to differential argument marking (DAM), a semantic licensing system in which morphological marking depends on semantic prominence. Using a controlled synthetic learning method, we train GPT-2 models on 18 corpora implementing distinct DAM systems and evaluate their generalization using minimal pairs. Our results reveal a dissociation between two typological dimensions of DAM. Models reliably exhibit human-like preferences for natural markedness direction, favoring systems in which overt marking targets semantically atypical arguments. In contrast, models do not reproduce the strong object preference in human languages, in which overt marking in DAM more often targets objects rather than subjects. These findings suggest that different typological tendencies may arise from distinct underlying sources.

[41] What Language is This? Ask Your Tokenizer

Clara Meister,Ahmetcan Yavuz,Pietro Lesci,Tiago Pimentel

Main category: cs.CL

TL;DR: 本文提出UniLID,一种基于UnigramLM分词算法的语言识别方法,在低资源和近似语言场景下显著提升性能,尤其在仅需5个样本/语言时准确率超70%,并支持增量添加语言和与现有LLM分词流程无缝集成。

Details Motivation: 现有语言识别系统在低资源语言和密切相关语言场景下表现脆弱,尽管在高资源语言上接近完美。 Method: 基于UnigramLM分词算法构建UniLID:学习共享词表上各语言条件下的unigram分布,但将分词视为语言特异性现象;无需重训练即可增量添加新语言。 Result: 在标准基准上性能媲美fastText、GlotLID和CLD3;低资源设置下样本效率大幅提升(5样本/语言达70%+准确率);细粒度方言识别效果显著增强。 Conclusion: UniLID是一种简单、高效、可扩展的语言识别方法,兼顾性能、数据/计算效率及工程实用性,适用于现代多语言NLP系统。 Abstract: Language Identification (LID) is an important component of many multilingual natural language processing pipelines, where it facilitates corpus curation, training data analysis, and cross-lingual evaluation of large language models. Despite near-perfect performance on high-resource languages, existing systems remain brittle in low-resource and closely related language settings. We introduce UniLID, a simple and efficient LID method based on the UnigramLM tokenization algorithm, leveraging its probabilistic framing, parameter estimation technique and inference strategy. In short, we learn language-conditional unigram distributions over a shared tokenizer vocabulary but treat segmentation as a language-specific phenomenon. Our formulation is data- and compute-efficient, supports incremental addition of new languages without retraining existing models, and can naturally be integrated into existing language model tokenization pipelines. Empirical evaluations against widely used baselines, including fastText, GlotLID, and CLD3, show that UniLID achieves competitive performance on standard benchmarks, substantially improves sample efficiency in low-resource settings - surpassing 70% accuracy with as few as five labeled samples per language - and delivers large gains on fine-grained dialect identification.

[42] Sink-Aware Pruning for Diffusion Language Models

Aidar Myrzakhan,Tianyi Li,Bowei Guo,Shengkun Tang,Zhiqiang Shen

Main category: cs.CL

TL;DR: 本文提出了一种针对扩散语言模型(DLMs)的新型剪枝方法Sink-Aware Pruning,指出DLM中注意力sink具有高时序变异性,因此不应像自回归模型那样默认保留,该方法无需重训练即可实现更优的质量-效率权衡。

Details Motivation: 扩散语言模型(DLMs)因迭代去噪导致推理开销高,需高效剪枝;而现有剪枝策略多直接沿用自回归(AR)大模型中保留注意力sink的启发式方法,但作者发现该假设在DLM中不成立。 Method: 基于对DLM中注意力sink位置时序变异性的实证分析,提出Sink-Aware Pruning方法,自动识别并剪除不稳定的sink token,而非沿用AR模型中默认保留sink的做法。 Result: 在无需重训练的前提下,该方法在相同计算预算下优于多个强基线剪枝方法,实现了更好的质量-效率权衡。 Conclusion: DLM中的注意力sink具有高度动态性和非结构性,应被主动剪除而非保留;Sink-Aware Pruning为DLM高效推理提供了新范式。 Abstract: Diffusion Language Models (DLMs) incur high inference cost due to iterative denoising, motivating efficient pruning. Existing pruning heuristics largely inherited from autoregressive (AR) LLMs, typically preserve attention sink tokens because AR sinks serve as stable global anchors. We show that this assumption does not hold for DLMs: the attention-sink position exhibits substantially higher variance over the full generation trajectory (measured by how the dominant sink locations shift across timesteps), indicating that sinks are often transient and less structurally essential than in AR models. Based on this observation, we propose ${\bf \texttt{Sink-Aware Pruning}}$, which automatically identifies and prunes unstable sinks in DLMs (prior studies usually keep sinks for AR LLMs). Without retraining, our method achieves a better quality-efficiency trade-off and outperforms strong prior pruning baselines under matched compute. Our code is available at https://github.com/VILA-Lab/Sink-Aware-Pruning.

cs.CV [Back]

[43] Three-dimensional Damage Visualization of Civil Structures via Gaussian Splatting-enabled Digital Twins

Shuo Wang,Shuo Wang,Xin Nie,Yasutaka Narazaki,Thomas Matiki,Billie F. Spencer

Main category: cs.CV

TL;DR: 本文提出了一种基于高斯泼溅(Gaussian Splatting, GS)的数字孪生方法,用于 civil infrastructure 的三维损伤可视化,相比NeRF更高效,并支持多尺度重建与随时间演化的更新。

Details Motivation: 传统2D图像损伤识别难以满足现代基础设施巡检对高精度3D损伤可视化的需求;现有NeRF等方法在效率或特征缺失区域表现不足,需更优的3D表征方案。 Method: 采用高斯泼溅(GS)进行3D重建,将2D损伤分割结果映射至3D空间;设计多尺度重建策略以兼顾效率与细节;支持基于新观测数据的数字孪生动态更新。 Result: 在开源地震后合成数据集上验证了该方法能有效降低分割误差、提升3D损伤可视化质量,并实现高效、可更新的数字孪生构建。 Conclusion: GS比NeRF更适合于基础设施数字孪生中的实时、高保真3D损伤可视化,所提方法为智能基础设施健康监测提供了新范式。 Abstract: Recent advancements in civil infrastructure inspections underscore the need for precise three-dimensional (3D) damage visualization on digital twins, transcending traditional 2D image-based damage identifications. Compared to conventional photogrammetric 3D reconstruction techniques, modern approaches such as Neural Radiance Field (NeRF) and Gaussian Splatting (GS) excel in scene representation, rendering quality, and handling featureless regions. Among them, GS stands out for its efficiency, leveraging discrete anisotropic 3D Gaussians to represent radiance fields, unlike NeRF's continuous implicit model. This study introduces a GS-enabled digital twin method tailored for effective 3D damage visualization. The method's key contributions include: 1) utilizing GS-based 3D reconstruction to visualize 2D damage segmentation results while reducing segmentation errors; 2) developing a multi-scale reconstruction strategy to balance efficiency and damage detail; 3) enabling digital twin updates as damage evolves over time. Demonstrated on an open-source synthetic dataset for post-earthquake inspections, the proposed approach offers a promising solution for comprehensive 3D damage visualization in civil infrastructure digital twins.

[44] Analytic Score Optimization for Multi Dimension Video Quality Assessment

Boda Lin,Yongjie Zhu,Wenyu Qin,Meng Wang,Pengfei Wan

Main category: cs.CV

TL;DR: 本文提出了一种多维视频质量评估(VQA)新范式,构建了大规模多维度数据集UltraVQA,并设计了理论驱动的Analytic Score Optimization(ASO)方法,提升离散质量评分预测精度与人类偏好对齐。

Details Motivation: 传统VQA仅依赖单一MOS分数,难以刻画视频质量的多样性;需更丰富、多维、可解释的质量标注以支持细粒度评估与模型优化。 Method: 构建涵盖5个质量维度、带细粒度子属性和GPT生成理由的大规模UGC视频数据集UltraVQA;提出Analytic Score Optimization(ASO)后训练目标,将质量评估建模为正则化决策过程,导出闭式解以建模人类评分的序数特性。 Result: ASO在多个基准上超越主流闭源API和开源模型,显著降低质量预测的平均绝对误差(MAE),验证了多维标注与强化对齐的有效性。 Conclusion: 多维、可解释的质量标注与理论驱动的优化方法(如ASO)是推动VQA向更符合人类感知方向发展的关键路径。 Abstract: Video Quality Assessment (VQA) is evolving beyond single-number mean opinion score toward richer, multi-faceted evaluations of video content. In this paper, we present a large-scale multi-dimensional VQA dataset UltraVQA that encompasses diverse User-Generated Content~(UGC) annotated across five key quality dimensions: Motion Quality, Motion Amplitude, Aesthetic Quality, Content Quality, and Clarity Quality. Each video in our dataset is scored by over 3 human raters on these dimensions, with fine-grained sub-attribute labels, and accompanied by an explanatory rationale generated by GPT based on the collective human judgments. To better leverage these rich annotations and improve discrete quality score assessment, we introduce Analytic Score Optimization (ASO), a theoretically grounded post-training objective derived for multi-dimensional VQA. By reframing quality assessment as a regularized decision-making process, we obtain a closed-form solution that naturally captures the ordinal nature of human ratings, ensuring alignment with human ranking preferences. In experiments, our method outperforms most baselines including closed-source APIs and open-source models, while also reducing mean absolute error (MAE) in quality prediction. Our work highlights the importance of multi-dimensional, interpretable annotations and reinforcement-based alignment in advancing video quality assessment.

[45] DODO: Discrete OCR Diffusion Models

Sean Man,Roy Ganz,Roi Ronen,Shahar Tsiper,Shai Mazor,Niv Nayman

Main category: cs.CV

TL;DR: 本文提出DODO模型,首次将块离散扩散(block discrete diffusion)应用于OCR任务,通过分块生成缓解全局扩散的同步误差,在保持接近SOTA精度的同时实现最高3倍的推理加速。

Details Motivation: 现有基于自回归解码的视觉语言模型在OCR任务中计算开销大、推理慢;而OCR是高度确定性任务,理论上适合并行解码,但现有掩码扩散模型因结构不稳定性无法满足OCR严格的精确匹配要求。 Method: 提出DODO模型,采用块离散扩散机制,将文本生成过程分解为多个块进行并行化建模,从而规避全局扩散带来的同步错误问题。 Result: 在OCR任务上达到接近当前最优(state-of-the-art)的识别精度,并实现最高达3倍的推理速度提升。 Conclusion: 块离散扩散是一种适配OCR等确定性视觉语言任务的有效并行生成范式,DODO验证了其在精度与效率上的双重优势。 Abstract: Optical Character Recognition (OCR) is a fundamental task for digitizing information, serving as a critical bridge between visual data and textual understanding. While modern Vision-Language Models (VLM) have achieved high accuracy in this domain, they predominantly rely on autoregressive decoding, which becomes computationally expensive and slow for long documents as it requires a sequential forward pass for every generated token. We identify a key opportunity to overcome this bottleneck: unlike open-ended generation, OCR is a highly deterministic task where the visual input strictly dictates a unique output sequence, theoretically enabling efficient, parallel decoding via diffusion models. However, we show that existing masked diffusion models fail to harness this potential; those introduce structural instabilities that are benign in flexible tasks, like captioning, but catastrophic for the rigid, exact-match requirements of OCR. To bridge this gap, we introduce DODO, the first VLM to utilize block discrete diffusion and unlock its speedup potential for OCR. By decomposing generation into blocks, DODO mitigates the synchronization errors of global diffusion. Empirically, our method achieves near state-of-the-art accuracy while enabling up to 3x faster inference compared to autoregressive baselines.

[46] StereoAdapter-2: Globally Structure-Consistent Underwater Stereo Depth Estimation

Zeyu Ren,Xiang Li,Yiran Wang,Zeyu Zhang,Hao Tang

Main category: cs.CV

TL;DR: 本文提出StereoAdapter-2,通过引入基于选择性状态空间模型的ConvSS2D算子替代传统ConvGRU,提升水下立体深度估计中长距离视差传播效率;并构建大规模合成数据集UW-StereoDepth-80K,结合动态LoRA适配,在零样本水下基准测试中达到SOTA性能。

Details Motivation: 水下立体深度估计面临波长相关光衰减、散射和折射导致的严重域偏移问题,现有基于单目基础模型与GRU迭代优化的方法受限于GRU的序列门控与局部卷积核,难以高效处理大视差与弱纹理区域。 Method: 提出StereoAdapter-2框架:1)用四向扫描策略的ConvSS2D算子替代ConvGRU,契合极线几何并建模垂直结构一致性,实现单步线性复杂度长程传播;2)构建UW-StereoDepth-80K合成数据集,融合语义感知风格迁移与几何一致新视角合成;3)继承StereoAdapter的动态LoRA适配机制。 Result: 在TartanAir-UW和SQUID水下基准上零样本性能分别提升17%和7.2%,并在BlueROV2真实水下平台验证了鲁棒性。 Conclusion: ConvSS2D算子与高质量合成数据集的协同设计显著提升了水下立体匹配的零样本泛化能力与计算效率,为水下机器人感知提供了更可靠、高效的深度估计方案。 Abstract: Stereo depth estimation is fundamental to underwater robotic perception, yet suffers from severe domain shifts caused by wavelength-dependent light attenuation, scattering, and refraction. Recent approaches leverage monocular foundation models with GRU-based iterative refinement for underwater adaptation; however, the sequential gating and local convolutional kernels in GRUs necessitate multiple iterations for long-range disparity propagation, limiting performance in large-disparity and textureless underwater regions. In this paper, we propose StereoAdapter-2, which replaces the conventional ConvGRU updater with a novel ConvSS2D operator based on selective state space models. The proposed operator employs a four-directional scanning strategy that naturally aligns with epipolar geometry while capturing vertical structural consistency, enabling efficient long-range spatial propagation within a single update step at linear computational complexity. Furthermore, we construct UW-StereoDepth-80K, a large-scale synthetic underwater stereo dataset featuring diverse baselines, attenuation coefficients, and scattering parameters through a two-stage generative pipeline combining semantic-aware style transfer and geometry-consistent novel view synthesis. Combined with dynamic LoRA adaptation inherited from StereoAdapter, our framework achieves state-of-the-art zero-shot performance on underwater benchmarks with 17% improvement on TartanAir-UW and 7.2% improvment on SQUID, with real-world validation on the BlueROV2 platform demonstrates the robustness of our approach. Code: https://github.com/AIGeeksGroup/StereoAdapter-2. Website: https://aigeeksgroup.github.io/StereoAdapter-2.

[47] SemCovNet: Towards Fair and Semantic Coverage-Aware Learning for Underrepresented Visual Concepts

Sakib Ahammed,Xia Cui,Xinqi Fan,Wenqi Lu,Moi Hoon Yap

Main category: cs.CV

TL;DR: 本文提出Semantic Coverage-Aware Network (SemCovNet)以解决视觉模型中语义覆盖不平衡(SCI)问题,通过语义描述符映射、描述符注意力调制和描述符-视觉对齐损失提升语义公平性与模型可靠性。

Details Motivation: 现有视觉数据集存在语义覆盖不平衡(SCI)这一被忽视的偏差,源于语义表示的长尾分布,影响模型对稀有但有意义语义的学习与推理。 Method: 提出SemCovNet模型,包含语义描述符映射(SDM)、描述符注意力调制(DAM)模块和描述符-视觉对齐(DVA)损失,并引入覆盖率差异指数(CDI)量化语义公平性。 Result: 在多个数据集上的实验表明,SemCovNet显著降低CDI,提升模型可靠性与语义公平性。 Conclusion: SCI是一种可测量、可纠正的偏差,该工作为推进语义公平与可解释视觉学习奠定了基础。 Abstract: Modern vision models increasingly rely on rich semantic representations that extend beyond class labels to include descriptive concepts and contextual attributes. However, existing datasets exhibit Semantic Coverage Imbalance (SCI), a previously overlooked bias arising from the long-tailed semantic representations. Unlike class imbalance, SCI occurs at the semantic level, affecting how models learn and reason about rare yet meaningful semantics. To mitigate SCI, we propose Semantic Coverage-Aware Network (SemCovNet), a novel model that explicitly learns to correct semantic coverage disparities. SemCovNet integrates a Semantic Descriptor Map (SDM) for learning semantic representations, a Descriptor Attention Modulation (DAM) module that dynamically weights visual and concept features, and a Descriptor-Visual Alignment (DVA) loss that aligns visual features with descriptor semantics. We quantify semantic fairness using a Coverage Disparity Index (CDI), which measures the alignment between coverage and error. Extensive experiments across multiple datasets demonstrate that SemCovNet enhances model reliability and substantially reduces CDI, achieving fairer and more equitable performance. This work establishes SCI as a measurable and correctable bias, providing a foundation for advancing semantic fairness and interpretable vision learning.

[48] Xray-Visual Models: Scaling Vision models on Industry Scale Data

Shlok Mishra,Tsung-Yu Lin,Linda Wang,Hongli Xu,Yimin Liu,Michael Hsu,Chaitanya Ahuja,Hao Yuan,Jianpeng Cheng,Hong-You Chen,Haoyuan Xu,Chao Li,Abhijeet Awasthi,Jihye Moon,Don Husa,Michael Ge,Sumedha Singla,Arkabandhu Chowdhury,Phong Dingh,Satya Narayan Shukla,Yonghuan Yang,David Jacobs,Qi Guo,Jun Xiao,Xiangjun Fan,Aashu Singh

Main category: cs.CV

TL;DR: Xray-Visual 是一个基于大规模社交媒体数据训练的统一视觉模型,融合图像与视频理解能力,采用三阶段训练策略和高效ViT架构,在多项基准上达到SOTA性能,并具备强鲁棒性与跨模态检索能力。

Details Motivation: 解决现有视觉模型在大规模、多源、带噪社交媒体数据上训练困难,以及图像与视频模态联合建模效率与泛化能力不足的问题。 Method: 提出三阶段训练流程(MAE自监督、半监督hashtag分类、CLIP式对比学习),结合Vision Transformer与EViT高效token重组织,并引入LLM2CLIP文本编码范式;使用150亿图像-文本对和100亿视频-hashtag对,辅以数据平衡与噪声抑制策略。 Result: 在ImageNet、Kinetics、HMDB51、MSCOCO等基准上达到SOTA;对域偏移与对抗扰动具有强鲁棒性;LLM2CLIP显著提升跨模态检索性能与现实场景泛化能力。 Conclusion: Xray-Visual确立了可扩展多模态视觉模型的新标杆,在精度、效率与鲁棒性之间取得优异平衡,为工业级视觉理解提供新范式。 Abstract: We present Xray-Visual, a unified vision model architecture for large-scale image and video understanding trained on industry-scale social media data. Our model leverages over 15 billion curated image-text pairs and 10 billion video-hashtag pairs from Facebook and Instagram, employing robust data curation pipelines that incorporate balancing and noise suppression strategies to maximize semantic diversity while minimizing label noise. We introduce a three-stage training pipeline that combines self-supervised MAE, semi-supervised hashtag classification, and CLIP-style contrastive learning to jointly optimize image and video modalities. Our architecture builds on a Vision Transformer backbone enhanced with efficient token reorganization (EViT) for improved computational efficiency. Extensive experiments demonstrate that Xray-Visual achieves state-of-the-art performance across diverse benchmarks, including ImageNet for image classification, Kinetics and HMDB51 for video understanding, and MSCOCO for cross-modal retrieval. The model exhibits strong robustness to domain shift and adversarial perturbations. We further demonstrate that integrating large language models as text encoders (LLM2CLIP) significantly enhances retrieval performance and generalization capabilities, particularly in real-world environments. Xray-Visual establishes new benchmarks for scalable, multimodal vision models, while maintaining superior accuracy and computational efficiency.

[49] HS-3D-NeRF: 3D Surface and Hyperspectral Reconstruction From Stationary Hyperspectral Images Using Multi-Channel NeRFs

Kibon Ku,Talukder Z. Jubery,Adarsh Krishnamurthy,Baskar Ganapathysubramanian

Main category: cs.CV

TL;DR: 本文提出HSI-SC-NeRF,一种基于固定相机的多通道神经辐射场框架,用于高通量、高保真度的农产食品高光谱三维重建,适用于采后检测。

Details Motivation: 传统高光谱成像与3D重建融合方法硬件复杂、难以适配自动化表型平台;现有NeRF方法依赖移动相机,限制农业室内场景的通量与可重复性。 Method: 设计固定相机+旋转样本的采集系统(特氟龙漫射腔);利用ArUco标记估计姿态并经模拟变换统一至相机坐标系;构建多通道NeRF模型,联合优化全波段光谱重建,采用复合光谱损失及两阶段训练策略(几何初始化→辐射精调)。 Result: 在三种农产品样本上验证了高空间重建精度与可见-近红外波段优异的光谱保真度。 Conclusion: HSI-SC-NeRF有效解决了高光谱与3D数据在农业自动化场景中规模化融合的难题,具备向实际农业工作流集成的潜力。 Abstract: Advances in hyperspectral imaging (HSI) and 3D reconstruction have enabled accurate, high-throughput characterization of agricultural produce quality and plant phenotypes, both essential for advancing agricultural sustainability and breeding programs. HSI captures detailed biochemical features of produce, while 3D geometric data substantially improves morphological analysis. However, integrating these two modalities at scale remains challenging, as conventional approaches involve complex hardware setups incompatible with automated phenotyping systems. Recent advances in neural radiance fields (NeRF) offer computationally efficient 3D reconstruction but typically require moving-camera setups, limiting throughput and reproducibility in standard indoor agricultural environments. To address these challenges, we introduce HSI-SC-NeRF, a stationary-camera multi-channel NeRF framework for high-throughput hyperspectral 3D reconstruction targeting postharvest inspection of agricultural produce. Multi-view hyperspectral data is captured using a stationary camera while the object rotates within a custom-built Teflon imaging chamber providing diffuse, uniform illumination. Object poses are estimated via ArUco calibration markers and transformed to the camera frame of reference through simulated pose transformations, enabling standard NeRF training on stationary-camera data. A multi-channel NeRF formulation optimizes reconstruction across all hyperspectral bands jointly using a composite spectral loss, supported by a two-stage training protocol that decouples geometric initialization from radiometric refinement. Experiments on three agricultural produce samples demonstrate high spatial reconstruction accuracy and strong spectral fidelity across the visible and near-infrared spectrum, confirming the suitability of HSI-SC-NeRF for integration into automated agricultural workflows.

[50] DDiT: Dynamic Patch Scheduling for Efficient Diffusion Transformers

Dahye Kim,Deepti Ghadiyaram,Raghudeep Gadde

Main category: cs.CV

TL;DR: 本文提出了一种动态分块(dynamic tokenization)策略,根据生成过程中的时间步和内容复杂度自适应调整patch大小,以提升Diffusion Transformers (DiTs) 的推理效率,同时保持生成质量。

Details Motivation: DiTs在图像和视频生成中性能卓越但计算开销大,主要源于固定大小的token化策略,无法适配不同去噪阶段对全局结构与局部细节的不同需求。 Method: 提出动态tokenization方法,在推理时依据去噪时间步和内容复杂度动态调整patch尺寸:早期使用较大patch建模全局结构,后期使用较小patch细化局部细节。 Result: 在FLUX-1.Dev和Wan 2.1上分别实现3.52×和3.2×加速,且不损失生成质量与提示词遵循能力。 Conclusion: 动态tokenization是一种高效、即插即用的推理优化策略,显著降低DiTs计算成本,同时维持高质量生成效果。 Abstract: Diffusion Transformers (DiTs) have achieved state-of-the-art performance in image and video generation, but their success comes at the cost of heavy computation. This inefficiency is largely due to the fixed tokenization process, which uses constant-sized patches throughout the entire denoising phase, regardless of the content's complexity. We propose dynamic tokenization, an efficient test-time strategy that varies patch sizes based on content complexity and the denoising timestep. Our key insight is that early timesteps only require coarser patches to model global structure, while later iterations demand finer (smaller-sized) patches to refine local details. During inference, our method dynamically reallocates patch sizes across denoising steps for image and video generation and substantially reduces cost while preserving perceptual generation quality. Extensive experiments demonstrate the effectiveness of our approach: it achieves up to $3.52\times$ and $3.2\times$ speedup on FLUX-1.Dev and Wan $2.1$, respectively, without compromising the generation quality and prompt adherence.

[51] Characterizing the Predictive Impact of Modalities with Supervised Latent-Variable Modeling

Divyam Madaan,Sumit Chopra,Kyunghyun Cho

Main category: cs.CV

TL;DR: PRIMO是一种监督式潜在变量插补模型,用于量化缺失模态在多模态学习中的预测影响,支持不完整模态数据的训练与推理,并在多个数据集上实现与单模态/多模态基线相当的性能。

Details Motivation: 现有多模态大语言模型(MLLMs)通常假设训练和推理时所有模态均可用,但实际中多模态数据常存在缺失、异步采集或仅部分样本具备全模态等问题,亟需处理不完整模态的鲁棒方法。 Method: 提出PRIMO模型,通过引入潜在变量建模缺失模态与已观测模态在预测任务下的关系;训练时利用全部样本(含部分模态数据);推理时从学习到的缺失模态分布中采样多次,以获得边际预测分布并计算实例级模态影响(基于预测方差)。 Result: 在合成XOR、Audio-Vision MNIST和MIMIC-III(死亡率及ICD-9预测)上,PRIMO在单模态缺失时性能媲美单模态基线,在全模态可用时媲美多模态基线;并提供基于方差的实例级模态影响量化与可视化补全分析。 Conclusion: PRIMO有效应对多模态数据不完整性问题,统一支持全模态与部分模态场景,兼具预测性能与可解释性,为现实多模态学习提供了实用且理论合理的插补与归因框架。 Abstract: Despite the recent success of Multimodal Large Language Models (MLLMs), existing approaches predominantly assume the availability of multiple modalities during training and inference. In practice, multimodal data is often incomplete because modalities may be missing, collected asynchronously, or available only for a subset of examples. In this work, we propose PRIMO, a supervised latent-variable imputation model that quantifies the predictive impact of any missing modality within the multimodal learning setting. PRIMO enables the use of all available training examples, whether modalities are complete or partial. Specifically, it models the missing modality through a latent variable that captures its relationship with the observed modality in the context of prediction. During inference, we draw many samples from the learned distribution over the missing modality to both obtain the marginal predictive distribution (for the purpose of prediction) and analyze the impact of the missing modalities on the prediction for each instance. We evaluate PRIMO on a synthetic XOR dataset, Audio-Vision MNIST, and MIMIC-III for mortality and ICD-9 prediction. Across all datasets, PRIMO obtains performance comparable to unimodal baselines when a modality is fully missing and to multimodal baselines when all modalities are available. PRIMO quantifies the predictive impact of a modality at the instance level using a variance-based metric computed from predictions across latent completions. We visually demonstrate how varying completions of the missing modality result in a set of plausible labels.

[52] Patch-Based Spatial Authorship Attribution in Human-Robot Collaborative Paintings

Eric Chen,Patricia Alves-Oliveira

Main category: cs.CV

TL;DR: 本文提出了一种基于图像块的框架,用于在人机协作绘画中进行空间作者归属分析,并通过15幅抽象画的法医案例研究验证了其有效性,准确率达到88.8%(块级)和86.7%(画作级),优于现有基线方法。

Details Motivation: 随着具身AI越来越多地参与创意生产,明确作者身份对艺术家、收藏家及法律场景至关重要;而协作艺术中真实作者信息模糊,亟需可解释、可量化的归属方法。 Method: 采用基于补丁(patch)的空间作者归属方法,利用商用平板扫描仪采集数据,结合留一画交叉验证;引入条件香农熵量化风格重叠度以评估混合创作区域的不确定性。 Result: 块级准确率88.8%,画作级86.7%,显著优于纹理特征与预训练特征基线(68.0%-84.7%);混合区域条件熵比纯区域高64%(p=0.003),表明模型能有效识别混合作者而非分类错误。 Conclusion: 该方法虽目前限于特定人-机配对,但为数据稀缺的人-AI创意工作流提供了样本高效、可推广的作者归属方法学基础,未来有望扩展至任意人机协作绘画场景。 Abstract: As agentic AI becomes increasingly involved in creative production, documenting authorship has become critical for artists, collectors, and legal contexts. We present a patch-based framework for spatial authorship attribution within human-robot collaborative painting practice, demonstrated through a forensic case study of one human artist and one robotic system across 15 abstract paintings. Using commodity flatbed scanners and leave-one-painting-out cross-validation, the approach achieves 88.8% patch-level accuracy (86.7% painting-level via majority vote), outperforming texture-based and pretrained-feature baselines (68.0%-84.7%). For collaborative artworks, where ground truth is inherently ambiguous, we use conditional Shannon entropy to quantify stylistic overlap; manually annotated hybrid regions exhibit 64% higher uncertainty than pure paintings (p=0.003), suggesting the model detects mixed authorship rather than classification failure. The trained model is specific to this human-robot pair but provides a methodological grounding for sample-efficient attribution in data-scarce human-AI creative workflows that, in the future, has the potential to extend authorship attribution to any human-robot collaborative painting.

[53] PartRAG: Retrieval-Augmented Part-Level 3D Generation and Editing

Peize Li,Zeyu Zhang,Hao Tang

Main category: cs.CV

TL;DR: PartRAG 是一种检索增强的单图3D生成框架,通过分层对比检索从外部部件库中引入多样化、物理合理的部件先验,并支持在共享规范空间中进行局部、交互式部件编辑,显著提升生成质量与编辑灵活性。

Details Motivation: 现有单图像3D生成方法难以覆盖部件几何的长尾分布、维持多视角一致性,且缺乏对精确局部编辑的支持。 Method: 提出PartRAG框架:1)构建含1236个带部件标注3D资产的外部部件库;2)设计分层对比检索模块,将图像稠密块与部件/物体粒度的3D潜在表示对齐;3)在扩散Transformer中融合检索结果;4)引入掩码式部件级编辑器,在共享规范空间中实现部件替换、属性微调和组合更新。 Result: 在Objaverse、ShapeNet、ABO数据集上取得领先性能:Objaverse上Chamfer距离从0.1726降至0.1528,F-Score从0.7472升至0.844;推理耗时38秒,交互编辑仅需5–8秒;定性显示更清晰的部件边界、更好的细结构保真度及对关节物体的鲁棒性。 Conclusion: PartRAG通过检索增强与可编辑规范表示,有效缓解了单图3D生成中部件多样性不足与编辑不灵活两大瓶颈,为可控、高质量3D内容生成提供了新范式。 Abstract: Single-image 3D generation with part-level structure remains challenging: learned priors struggle to cover the long tail of part geometries and maintain multi-view consistency, and existing systems provide limited support for precise, localized edits. We present PartRAG, a retrieval-augmented framework that integrates an external part database with a diffusion transformer to couple generation with an editable representation. To overcome the first challenge, we introduce a Hierarchical Contrastive Retrieval module that aligns dense image patches with 3D part latents at both part and object granularity, retrieving from a curated bank of 1,236 part-annotated assets to inject diverse, physically plausible exemplars into denoising. To overcome the second challenge, we add a masked, part-level editor that operates in a shared canonical space, enabling swaps, attribute refinements, and compositional updates without regenerating the whole object while preserving non-target parts and multi-view consistency. PartRAG achieves competitive results on Objaverse, ShapeNet, and ABO-reducing Chamfer Distance from 0.1726 to 0.1528 and raising F-Score from 0.7472 to 0.844 on Objaverse-with inference of 38s and interactive edits in 5-8s. Qualitatively, PartRAG produces sharper part boundaries, better thin-structure fidelity, and robust behavior on articulated objects. Code: https://github.com/AIGeeksGroup/PartRAG. Website: https://aigeeksgroup.github.io/PartRAG.

[54] Amber-Image: Efficient Compression of Large-Scale Diffusion Transformers

Chaojie Yang,Tian Li,Yue Zhang,Jun Gao

Main category: cs.CV

TL;DR: 本文提出了一种高效的压缩框架,将60层双流MMDiT架构的Qwen-Image模型压缩为轻量级T2I模型(Amber-Image系列),通过 timestep-sensitive 深度剪枝、混合流架构与分阶段知识蒸馏等技术,在大幅减少参数(70%)和GPU训练时长(<2000小时)的同时,保持高质量图像生成与文本渲染能力。

Details Motivation: DiT架构虽推动了文生图发展,但计算开销大、部署困难,亟需高效无须从头训练的压缩方法。 Method: 提出基于 timestep-sensitive 深度剪枝的 Amber-Image-10B(含局部权重平均初始化与层间蒸馏+全参微调);进一步设计混合流架构构建 Amber-Image-6B(双流转单流,图像支路初始化 + 渐进蒸馏 + 轻量微调)。 Result: 参数减少70%,无需大规模数据工程;全流程GPU耗时<2000小时;在DPG-Bench和LongText-Bench上媲美更大模型,具备高保真生成与优异文本渲染能力。 Conclusion: 该压缩框架实现了高性能与高效率的统一,为大型DiT模型的实际部署提供了可行路径。 Abstract: Diffusion Transformer (DiT) architectures have significantly advanced Text-to-Image (T2I) generation but suffer from prohibitive computational costs and deployment barriers. To address these challenges, we propose an efficient compression framework that transforms the 60-layer dual-stream MMDiT-based Qwen-Image into lightweight models without training from scratch. Leveraging this framework, we introduce Amber-Image, a series of streamlined T2I models. We first derive Amber-Image-10B using a timestep-sensitive depth pruning strategy, where retained layers are reinitialized via local weight averaging and optimized through layer-wise distillation and full-parameter fine-tuning. Building on this, we develop Amber-Image-6B by introducing a hybrid-stream architecture that converts deep-layer dual streams into a single stream initialized from the image branch, further refined via progressive distillation and lightweight fine-tuning. Our approach reduces parameters by 70% and eliminates the need for large-scale data engineering. Notably, the entire compression and training pipeline-from the 10B to the 6B variant-requires fewer than 2,000 GPU hours, demonstrating exceptional cost-efficiency compared to training from scratch. Extensive evaluations on benchmarks like DPG-Bench and LongText-Bench show that Amber-Image achieves high-fidelity synthesis and superior text rendering, matching much larger models.

[55] StructCore: Structure-Aware Image-Level Scoring for Training-Free Unsupervised Anomaly Detection

Joongwon Chae,Lihui Luo,Yang Liu,Runming Wang,Dongmei Yu,Zeming Liang,Xi Yuan,Dayan Zhang,Zhenglin Chen,Peiwu Qin,Ilmoon Chae

Main category: cs.CV

TL;DR: 本文提出StructCore,一种无需训练、结构感知的图像级评分方法,用于改进基于内存库的无监督异常检测中的图像级决策,通过捕捉异常得分图的分布和空间特征,并利用正常样本进行对角马氏距离校准,显著提升了图像级异常检测性能。

Details Motivation: Max pooling在基于内存库的无监督异常检测中虽为标准做法,但仅依赖单个极值响应,丢失了异常证据在图像中分布与结构的大量信息,导致正常与异常得分易重叠。 Method: StructCore是一种无需训练的方法,首先从异常得分图S中提取低维结构描述符φ(S),刻画其分布与空间特性;再基于训练集中的正常样本估计对角马氏距离校准参数,实现图像级评分优化,不改变像素级定位结果。 Result: StructCore在MVTec AD数据集上图像级AUROC达99.6%,VisA数据集上达98.4%,显著优于传统max pooling方法。 Conclusion: StructCore通过挖掘被max pooling忽略的结构化异常签名,在不增加训练开销的前提下,实现了鲁棒且高性能的图像级异常检测。 Abstract: Max pooling is the de facto standard for converting anomaly score maps into image-level decisions in memory-bank-based unsupervised anomaly detection (UAD). However, because it relies on a single extreme response, it discards most information about how anomaly evidence is distributed and structured across the image, often causing normal and anomalous scores to overlap. We propose StructCore, a training-free, structure-aware image-level scoring method that goes beyond max pooling. Given an anomaly score map, StructCore computes a low-dimensional structural descriptor phi(S) that captures distributional and spatial characteristics, and refines image-level scoring via a diagonal Mahalanobis calibration estimated from train-good samples, without modifying pixel-level localization. StructCore achieves image-level AUROC scores of 99.6% on MVTec AD and 98.4% on VisA, demonstrating robust image-level anomaly detection by exploiting structural signatures missed by max pooling.

[56] Cholec80-port: A Geometrically Consistent Trocar Port Segmentation Dataset for Robust Surgical Scene Understanding

Shunsuke Kikuchi,Atsushi Kouno,Hiroki Matsuzaki

Main category: cs.CV

TL;DR: 本文提出Cholec80-port数据集及统一的端口掩码标注规范(排除中心孔),解决腹腔镜手术中套管端口因高反射、纹理丰富导致几何下游任务(如图像拼接、3D重建)性能下降的问题;实验证明几何一致的标注比单纯增加数据量更能提升跨数据集鲁棒性。

Details Motivation: 套管端口在腹腔镜图像中易造成视野遮挡且吸引过多特征点,严重影响基于几何的下游任务;而现有公开数据集缺乏显式、几何一致的端口标注(常错误遮盖可见的中心孔)。 Method: 构建Cholec80-port高质量端口分割数据集,并制定严格SOP:定义端口-套管掩码,明确排除中心孔;同时按同一SOP清洗和统一多个现有公开数据集。 Result: 实验表明,采用几何一致标注显著提升了跨数据集的鲁棒性,该增益独立于数据集规模。 Conclusion: 几何一致的端口标注标准(尤其保留中心孔可见区域)对提升腹腔镜视觉几何任务的泛化性和稳定性至关重要;Cholec80-port及配套SOP为该领域提供了可靠基准。 Abstract: Trocar ports are camera-fixed, pseudo-static structures that can persistently occlude laparoscopic views and attract disproportionate feature points due to specular, textured surfaces. This makes ports particularly detrimental to geometry-based downstream pipelines such as image stitching, 3D reconstruction, and visual SLAM, where dynamic or non-anatomical outliers degrade alignment and tracking stability. Despite this practical importance, explicit port labels are rare in public surgical datasets, and existing annotations often violate geometric consistency by masking the central lumen (opening), even when anatomical regions are visible through it. We present Cholec80-port, a high-fidelity trocar port segmentation dataset derived from Cholec80, together with a rigorous standard operating procedure (SOP) that defines a port-sleeve mask excluding the central opening. We additionally cleanse and unify existing public datasets under the same SOP. Experiments demonstrate that geometrically consistent annotations substantially improve cross-dataset robustness beyond what dataset size alone provides.

[57] Cross Pseudo Labeling For Weakly Supervised Video Anomaly Detection

Lee Dayeon,Kim Dongheyong,Park Chaewon,Woo Sungmin,Lee Sangyoun

Main category: cs.CV

TL;DR: 本文提出CPL-VAD双分支框架,通过跨伪标签机制结合二值异常检测与视觉-语言对齐的类别分类,实现弱监督视频异常检测与异常类别识别,在XD-Violence和UCF-Crime数据集上达到SOTA性能。

Details Motivation: 解决弱监督视频异常检测中仅用视频级标签难以同时实现精准异常定位与异常类别识别的问题。 Method: 提出CPL-VAD双分支框架:一支用于片段级二值异常检测,另一支利用视觉-语言对齐进行异常事件类别识别,并通过跨伪标签机制使两分支相互增强。 Result: 在XD-Violence和UCF-Crime数据集上,CPL-VAD在异常检测和异常类别分类两项任务中均达到当前最优性能(SOTA)。 Conclusion: 跨伪标签机制能有效融合时序定位精度与语义判别能力,验证了双分支协同学习在弱监督视频异常分析中的有效性。 Abstract: Weakly supervised video anomaly detection aims to detect anomalies and identify abnormal categories with only video-level labels. We propose CPL-VAD, a dual-branch framework with cross pseudo labeling. The binary anomaly detection branch focuses on snippet-level anomaly localization, while the category classification branch leverages vision-language alignment to recognize abnormal event categories. By exchanging pseudo labels, the two branches transfer complementary strengths, combining temporal precision with semantic discrimination. Experiments on XD-Violence and UCF-Crime demonstrate that CPL-VAD achieves state-of-the-art performance in both anomaly detection and abnormal category classification.

[58] ComptonUNet: A Deep Learning Model for GRB Localization with Compton Cameras under Noisy and Low-Statistic Conditions

Shogo Sato,Kazuo Tanaka,Shojun Ogasawara,Kazuki Yamamoto,Kazuhiko Murasaki,Ryuichi Tanida,Jun Kataoka

Main category: cs.CV

TL;DR: 本文提出了一种名为ComptonUNet的混合深度学习框架,用于在低光子统计和强背景噪声条件下稳健地定位微弱伽马射线暴(GRB)源。该模型结合了直接重建的统计效率与图像去噪能力,在模拟实验中显著优于现有方法。

Details Motivation: faint GRBs难以检测和定位,因其光子统计量低、背景噪声强;现有机器学习方法难以兼顾统计鲁棒性与噪声抑制。 Method: 提出ComptonUNet:一种联合处理原始数据并重建图像的混合深度学习框架,融合直接重建模型的统计效率与图像架构的去噪能力。 Result: 在模拟低地球轨道任务背景下的GRB事件中,ComptonUNet在低统计量、高背景场景下显著提升了定位精度,优于现有方法。 Conclusion: ComptonUNet为微弱、遥远GRB的高精度定位提供了新范式,有望增强其作为早期恒星形成探针的应用价值。 Abstract: Gamma-ray bursts (GRBs) are among the most energetic transient phenomena in the universe and serve as powerful probes for high-energy astrophysical processes. In particular, faint GRBs originating from a distant universe may provide unique insights into the early stages of star formation. However, detecting and localizing such weak sources remains challenging owing to low photon statistics and substantial background noise. Although recent machine learning models address individual aspects of these challenges, they often struggle to balance the trade-off between statistical robustness and noise suppression. Consequently, we propose ComptonUNet, a hybrid deep learning framework that jointly processes raw data and reconstructs images for robust GRB localization. ComptonUNet was designed to operate effectively under conditions of limited photon statistics and strong background contamination by combining the statistical efficiency of direct reconstruction models with the denoising capabilities of image-based architectures. We perform realistic simulations of GRB-like events embedded in background environments representative of low-Earth orbit missions to evaluate the performance of ComptonUNet. Our results demonstrate that ComptonUNet significantly outperforms existing approaches, achieving improved localization accuracy across a wide range of low-statistic and high-background scenarios.

[59] 3D Scene Rendering with Multimodal Gaussian Splatting

Chi-Shiang Gau,Konstantinos D. Polyzos,Athanasios Bacharis,Saketh Madhuvarasu,Tara Javidi

Main category: cs.CV

TL;DR: 本文提出了一种融合射频(RF)传感(如车载雷达)与3D高斯泼溅(GS)渲染的多模态框架,以克服纯视觉GS在恶劣天气、低光照或遮挡等场景下初始化困难的问题;该方法利用稀疏RF深度测量高效预测深度,生成高质量点云用于GS初始化,在保持高渲染质量的同时提升了鲁棒性与效率。

Details Motivation: 传统基于视觉的高斯泼溅(GS)依赖大量相机视角进行初始化和训练,在恶劣天气、低照度或部分遮挡等视觉线索不可靠的场景下性能下降;而射频(RF)信号对这些干扰具有天然鲁棒性,因此引入RF传感可提升GS的可靠性与适用性。 Method: 提出一种多模态框架,将RF传感(如车载雷达)与GS渲染结合;利用稀疏RF深度测量,通过高效深度预测生成高质量3D点云,用于初始化各类GS架构中的高斯原语。 Result: 数值实验表明,该RF增强的GS方案在结构准确性驱动下实现了高保真3D场景渲染,显著提升了在视觉受限条件下的鲁棒性与效率。 Conclusion: 融合RF传感与GS是一种更高效、更鲁棒的替代纯视觉GS渲染的新范式,为自动驾驶、机器人等应用提供了更具适应性的3D重建与渲染方案。 Abstract: 3D scene reconstruction and rendering are core tasks in computer vision, with applications spanning industrial monitoring, robotics, and autonomous driving. Recent advances in 3D Gaussian Splatting (GS) and its variants have achieved impressive rendering fidelity while maintaining high computational and memory efficiency. However, conventional vision-based GS pipelines typically rely on a sufficient number of camera views to initialize the Gaussian primitives and train their parameters, typically incurring additional processing cost during initialization while falling short in conditions where visual cues are unreliable, such as adverse weather, low illumination, or partial occlusions. To cope with these challenges, and motivated by the robustness of radio-frequency (RF) signals to weather, lighting, and occlusions, we introduce a multimodal framework that integrates RF sensing, such as automotive radar, with GS-based rendering as a more efficient and robust alternative to vision-only GS rendering. The proposed approach enables efficient depth prediction from only sparse RF-based depth measurements, yielding a high-quality 3D point cloud for initializing Gaussian functions across diverse GS architectures. Numerical tests demonstrate the merits of judiciously incorporating RF sensing into GS pipelines, achieving high-fidelity 3D scene rendering driven by RF-informed structural accuracy.

[60] B$^3$-Seg: Camera-Free, Training-Free 3DGS Segmentation via Analytic EIG and Beta-Bernoulli Bayesian Updates

Hiromichi Kamata,Samuel Arthur Munro,Fuminori Homma

Main category: cs.CV

TL;DR: 本文提出B^3-Seg,一种无需相机视角预设、无需标注数据、无需重新训练的开放词汇3D高斯泼溅分割方法,通过Beta-Bernoulli贝叶斯更新与期望信息增益(EIG)驱动的主动视图选择,实现实时、理论可保证的信息高效交互式分割。

Details Motivation: 现有3DGS分割方法依赖预设视角、真实标签或昂贵重训练,难以满足影视与游戏制作中低延迟实时编辑需求。 Method: 将分割建模为序列化Beta-Bernoulli贝叶斯更新,并利用解析形式的期望信息增益(EIG)主动选择最优下一视角;EIG具备自适应单调性与次模性,支持贪心策略获得(1−1/e)近似最优采样。 Result: 在多个数据集上,B^3-Seg在几秒内完成端到端分割,性能媲美高成本监督方法,且具备理论保障的信息效率。 Conclusion: B^3-Seg实现了相机无关、训练无关、开放词汇的交互式3DGS分割,兼具实用性与理论严谨性,推动了3D内容实时编辑的实际落地。 Abstract: Interactive 3D Gaussian Splatting (3DGS) segmentation is essential for real-time editing of pre-reconstructed assets in film and game production. However, existing methods rely on predefined camera viewpoints, ground-truth labels, or costly retraining, making them impractical for low-latency use. We propose B$^3$-Seg (Beta-Bernoulli Bayesian Segmentation for 3DGS), a fast and theoretically grounded method for open-vocabulary 3DGS segmentation under camera-free and training-free conditions. Our approach reformulates segmentation as sequential Beta-Bernoulli Bayesian updates and actively selects the next view via analytic Expected Information Gain (EIG). This Bayesian formulation guarantees the adaptive monotonicity and submodularity of EIG, which produces a greedy $(1{-}1/e)$ approximation to the optimal view sampling policy. Experiments on multiple datasets show that B$^3$-Seg achieves competitive results to high-cost supervised methods while operating end-to-end segmentation within a few seconds. The results demonstrate that B$^3$-Seg enables practical, interactive 3DGS segmentation with provable information efficiency.

[61] BadCLIP++: Stealthy and Persistent Backdoors in Multimodal Contrastive Learning

Siyuan Liang,Yongcheng Jing,Yingjie Wang,Jiaxing Huang,Ee-chien Chang,Dacheng Tao

Main category: cs.CV

TL;DR: 本文提出BadCLIP++框架,通过语义融合QR微触发器、目标对齐子集选择、半径收缩与质心对齐、曲率控制与弹性权重巩固等技术,解决多模态对比学习模型后门攻击中的隐蔽性与持久性难题,在极低投毒率(0.3%)下实现高攻击成功率(99.99%)并强抗多种防御。

Details Motivation: 现有针对多模态对比学习模型的后门攻击方法在强检测或持续微调下失效,主因是跨模态不一致暴露触发模式,以及低投毒率下的梯度稀释加速后门遗忘,二者耦合问题未被充分建模和解决。 Method: 提出BadCLIP++统一框架:(1)为提升隐蔽性,设计语义融合QR微触发器,嵌入于任务相关区域附近,并采用目标对齐子集选择增强低注入率下的信号;(2)为提升持久性,通过半径收缩与质心对齐稳定触发嵌入,通过曲率控制与弹性权重巩固稳定模型参数;(3)首次理论证明在可信区域内,干净微调与后门目标梯度同向,攻击成功率退化有非增上界。 Result: 在仅0.3%投毒率下,数字攻击ASR达99.99%,超越基线11.4个百分点;在19种防御下ASR仍高于99.90%,干净准确率下降<0.8%;物理攻击成功率达65.03%,且对水印移除等防御鲁棒。 Conclusion: BadCLIP++有效兼顾后门攻击的隐蔽性与持久性,显著提升多模态对比学习模型后门攻击的实用性与鲁棒性,为该领域提供了新范式与理论支撑。 Abstract: Research on backdoor attacks against multimodal contrastive learning models faces two key challenges: stealthiness and persistence. Existing methods often fail under strong detection or continuous fine-tuning, largely due to (1) cross-modal inconsistency that exposes trigger patterns and (2) gradient dilution at low poisoning rates that accelerates backdoor forgetting. These coupled causes remain insufficiently modeled and addressed. We propose BadCLIP++, a unified framework that tackles both challenges. For stealthiness, we introduce a semantic-fusion QR micro-trigger that embeds imperceptible patterns near task-relevant regions, preserving clean-data statistics while producing compact trigger distributions. We further apply target-aligned subset selection to strengthen signals at low injection rates. For persistence, we stabilize trigger embeddings via radius shrinkage and centroid alignment, and stabilize model parameters through curvature control and elastic weight consolidation, maintaining solutions within a low-curvature wide basin resistant to fine-tuning. We also provide the first theoretical analysis showing that, within a trust region, gradients from clean fine-tuning and backdoor objectives are co-directional, yielding a non-increasing upper bound on attack success degradation. Experiments demonstrate that with only 0.3% poisoning, BadCLIP++ achieves 99.99% attack success rate (ASR) in digital settings, surpassing baselines by 11.4 points. Across nineteen defenses, ASR remains above 99.90% with less than 0.8% drop in clean accuracy. The method further attains 65.03% success in physical attacks and shows robustness against watermark removal defenses.

[62] NRGS-SLAM: Monocular Non-Rigid SLAM for Endoscopy via Deformation-Aware 3D Gaussian Splatting

Jiwei Shan,Zeyu Cai,Yirui Li,Yongbo Chen,Lijun Han,Yun-hui Liu,Hesheng Wang,Shing Shin Cheng

Main category: cs.CV

TL;DR: 本文提出NRGS-SLAM,一种基于3D高斯泼溅的单目非刚性SLAM系统,专用于内窥镜场景;通过引入可学习变形概率的变形感知高斯地图、粗到精的可变形跟踪模块、渐进式映射模块及统一鲁棒几何损失,有效解耦相机运动与软组织形变,显著提升位姿估计精度与重建质量。

Details Motivation: 内窥镜场景中软组织持续形变违反刚性假设,导致相机自运动与内在形变强耦合;现有单目非刚性SLAM方法缺乏有效解耦机制,且依赖稀疏或低保真场景表示,造成跟踪漂移与重建质量受限。 Method: 提出NRGS-SLAM:1)构建变形感知3D高斯地图,每个高斯元附加可学习变形概率,通过贝叶斯自监督优化;2)设计可变形跟踪模块,优先在低形变区域进行粗到精位姿估计,并高效更新每帧形变;3)设计可变形映射模块,渐进扩展与优化地图;4)引入融合外部几何先验的统一鲁棒几何损失。 Result: 在多个公开内窥镜数据集上,NRGS-SLAM相较SOTA方法将位姿估计RMSE最高降低50%,并生成更高保真度的光真实感重建结果;消融实验验证各核心设计的有效性。 Conclusion: NRGS-SLAM为内窥镜单目非刚性SLAM提供了高效、鲁棒且高质量的解决方案,其基于高斯泼溅的变形建模与解耦策略具有重要应用价值与推广潜力。 Abstract: Visual simultaneous localization and mapping (V-SLAM) is a fundamental capability for autonomous perception and navigation. However, endoscopic scenes violate the rigidity assumption due to persistent soft-tissue deformations, creating a strong coupling ambiguity between camera ego-motion and intrinsic deformation. Although recent monocular non-rigid SLAM methods have made notable progress, they often lack effective decoupling mechanisms and rely on sparse or low-fidelity scene representations, which leads to tracking drift and limited reconstruction quality. To address these limitations, we propose NRGS-SLAM, a monocular non-rigid SLAM system for endoscopy based on 3D Gaussian Splatting. To resolve the coupling ambiguity, we introduce a deformation-aware 3D Gaussian map that augments each Gaussian primitive with a learnable deformation probability, optimized via a Bayesian self-supervision strategy without requiring external non-rigidity labels. Building on this representation, we design a deformable tracking module that performs robust coarse-to-fine pose estimation by prioritizing low-deformation regions, followed by efficient per-frame deformation updates. A carefully designed deformable mapping module progressively expands and refines the map, balancing representational capacity and computational efficiency. In addition, a unified robust geometric loss incorporates external geometric priors to mitigate the inherent ill-posedness of monocular non-rigid SLAM. Extensive experiments on multiple public endoscopic datasets demonstrate that NRGS-SLAM achieves more accurate camera pose estimation (up to 50\% reduction in RMSE) and higher-quality photo-realistic reconstructions than state-of-the-art methods. Comprehensive ablation studies further validate the effectiveness of our key design choices. Source code will be publicly available upon paper acceptance.

[63] Selective Training for Large Vision Language Models via Visual Information Gain

Seulbi Lee,Sangheum Hwang

Main category: cs.CV

TL;DR: 本文提出了一种名为视觉信息增益(VIG)的新指标,用于量化图像输入对模型预测不确定性的降低程度,并基于该指标设计了选择性训练策略,以提升大视觉语言模型(LVLMs)的视觉 grounding 能力并缓解语言偏置。

Details Motivation: 现有LVLMs存在语言偏置问题,即不依赖视觉证据生成答案;已有方法缺乏对单个训练样本或token从图像中获益程度的定量衡量。 Method: 提出基于困惑度的视觉信息增益(VIG)指标,支持样本级和token级细粒度分析;并设计VIG引导的选择性训练方案,优先使用高VIG样本与token进行训练。 Result: 所提方法在减少监督数据量的同时,提升了模型的视觉 grounding 能力,有效缓解语言偏置,在多个基准上取得更优性能。 Conclusion: VIG为评估和增强LVLMs的视觉依赖性提供了可解释、可操作的定量工具,VIG引导的训练范式是一种高效利用视觉信息的新路径。 Abstract: Large Vision Language Models (LVLMs) have achieved remarkable progress, yet they often suffer from language bias, producing answers without relying on visual evidence. While prior work attempts to mitigate this issue through decoding strategies, architectural modifications, or curated instruction data, they typically lack a quantitative measure of how much individual training samples or tokens actually benefit from the image. In this work, we introduce Visual Information Gain (VIG), a perplexity-based metric that measures the reduction in prediction uncertainty provided by visual input. VIG enables fine-grained analysis at both sample and token levels, effectively highlighting visually grounded elements such as colors, spatial relations, and attributes. Leveraging this, we propose a VIG-guided selective training scheme that prioritizes high-VIG samples and tokens. This approach improves visual grounding and mitigates language bias, achieving superior performance with significantly reduced supervision by focusing exclusively on visually informative samples and tokens.

[64] EntropyPrune: Matrix Entropy Guided Visual Token Pruning for Multimodal Large Language Models

Yahong Wang,Juncheng Wu,Zhangkai Ni,Chengmei Yang,Yihang Liu,Longzhen Yang,Yuyin Zhou,Ying Wen,Lianghua He

Main category: cs.CV

TL;DR: 本文提出EntropyPrune,一种基于矩阵熵的视觉token剪枝方法,通过识别'熵坍缩层'(ECL)实现可解释、可迁移的高效剪枝,在保持高精度的同时显著降低MLLM推理开销。

Details Motivation: 现有MLLM token剪枝方法依赖经验性、静态的剪枝层选择,缺乏理论依据,导致可解释性和跨模型泛化能力差;同时高视觉token数量带来巨大推理成本。 Method: 从矩阵熵视角出发,发现并定义'熵坍缩层'(ECL)作为剪枝时机的理论依据;据此构建EntropyPrune框架,利用双Gram矩阵谱等价性高效计算token熵值,实现无需注意力图的视觉token价值量化与冗余剪枝。 Result: 在LLaVA-1.5-7B上实现68.2% FLOPs降低且保留96.0%原始性能;在多模态基准上全面超越SOTA剪枝方法,并成功扩展至高分辨率图像与视频模型。 Conclusion: 矩阵熵为MLLM视觉token剪枝提供了可解释、可迁移的新范式;EntropyPrune兼具高效性、鲁棒性与强泛化能力,是实用化MLLM加速的有效方案。 Abstract: Multimodal large language models (MLLMs) incur substantial inference cost due to the processing of hundreds of visual tokens per image. Although token pruning has proven effective for accelerating inference, determining when and where to prune remains largely heuristic. Existing approaches typically rely on static, empirically selected layers, which limit interpretability and transferability across models. In this work, we introduce a matrix-entropy perspective and identify an "Entropy Collapse Layer" (ECL), where the information content of visual representations exhibits a sharp and consistent drop, which provides a principled criterion for selecting the pruning stage. Building on this observation, we propose EntropyPrune, a novel matrix-entropy-guided token pruning framework that quantifies the information value of individual visual tokens and prunes redundant ones without relying on attention maps. Moreover, to enable efficient computation, we exploit the spectral equivalence of dual Gram matrices, reducing the complexity of entropy computation and yielding up to a 64x theoretical speedup. Extensive experiments on diverse multimodal benchmarks demonstrate that EntropyPrune consistently outperforms state-of-the-art pruning methods in both accuracy and efficiency. On LLaVA-1.5-7B, our method achieves a 68.2% reduction in FLOPs while preserving 96.0% of the original performance. Furthermore, EntropyPrune generalizes effectively to high-resolution and video-based models, highlighting the strong robustness and scalability in practical MLLM acceleration. The code will be publicly available at https://github.com/YahongWang1/EntropyPrune.

[65] GASS: Geometry-Aware Spherical Sampling for Disentangled Diversity Enhancement in Text-to-Image Generation

Ye Zhu,Kaleb S. Newman,Johannes F. Lutzeyer,Adriana Romero-Soriano,Michal Drozdzal,Olga Russakovsky

Main category: cs.CV

TL;DR: 本文提出了一种名为Geometry-Aware Spherical Sampling(GASS)的新方法,通过几何视角分解CLIP嵌入空间中的多样性来源(语义相关与无关方向),在不损害图像保真度和语义对齐的前提下,显著提升文本到图像生成的多样性。

Details Motivation: 现有文本到图像模型虽语义对齐好,但生成图像多样性不足,限制用户选择并可能加剧社会偏见。 Method: 提出GASS方法:在CLIP嵌入空间中将多样性分解为文本嵌入方向(语义相关)和其正交方向(如背景等提示无关变化),并在两个正交方向上扩大生成图像嵌入的几何投影分布,从而引导采样过程。 Result: 在多种冻结T2I骨干网络(U-Net、DiT;扩散与流模型)和基准测试中验证了该方法能有效解耦提升多样性,且对图像质量和语义对齐影响极小。 Conclusion: 从几何角度显式建模和控制多样性来源是一种有效且通用的提升T2I生成多样性的新范式。 Abstract: Despite high semantic alignment, modern text-to-image (T2I) generative models still struggle to synthesize diverse images from a given prompt. This lack of diversity not only restricts user choice, but also risks amplifying societal biases. In this work, we enhance the T2I diversity through a geometric lens. Unlike most existing methods that rely primarily on entropy-based guidance to increase sample dissimilarity, we introduce Geometry-Aware Spherical Sampling (GASS) to enhance diversity by explicitly controlling both prompt-dependent and prompt-independent sources of variation. Specifically, we decompose the diversity measure in CLIP embeddings using two orthogonal directions: the text embedding, which captures semantic variation related to the prompt, and an identified orthogonal direction that captures prompt-independent variation (e.g., backgrounds). Based on this decomposition, GASS increases the geometric projection spread of generated image embeddings along both axes and guides the T2I sampling process via expanded predictions along the generation trajectory. Our experiments on different frozen T2I backbones (U-Net and DiT, diffusion and flow) and benchmarks demonstrate the effectiveness of disentangled diversity enhancement with minimal impact on image fidelity and semantic alignment.

[66] HiMAP: History-aware Map-occupancy Prediction with Fallback

Yiming Xu,Yi Yang,Hao Cheng,Monika Sester

Main category: cs.CV

TL;DR: HiMAP是一种无需多目标跟踪(MOT)的轨迹预测框架,通过历史占据图与历史查询模块实现身份无关、鲁棒性强的多模态运动预测,在Argoverse 2上显著优于无跟踪基线。

Details Motivation: 现有运动预测方法严重依赖多目标跟踪(MOT),当出现遮挡、ID切换或漏检等跟踪失败时,预测性能与安全性大幅下降;亟需一种不依赖目标ID、对MOT失败鲁棒的预测方法。 Method: HiMAP将历史检测转换为时空不变的历史占据图;引入历史查询模块,基于当前智能体状态从无标签占据表示中迭代检索个体化历史;用时间地图嵌入汇总历史,并联合最终查询与地图上下文,驱动DETR式解码器生成多模态未来轨迹。 Result: 在Argoverse 2数据集上,HiMAP在无ID条件下达到与主流跟踪基线相当的性能;在无跟踪设定下,相较微调QCNet,FDE降低11%、ADE降低12%、MR降低4%;支持流式推理与全代理同步稳定预测。 Conclusion: HiMAP摆脱了对目标ID和MOT的依赖,具备强鲁棒性与实用性,可作为安全关键自动驾驶系统中跟踪失效时的可靠预测后备方案。 Abstract: Accurate motion forecasting is critical for autonomous driving, yet most predictors rely on multi-object tracking (MOT) with identity association, assuming that objects are correctly and continuously tracked. When tracking fails due to, e.g., occlusion, identity switches, or missed detections, prediction quality degrades and safety risks increase. We present \textbf{HiMAP}, a tracking-free, trajectory prediction framework that remains reliable under MOT failures. HiMAP converts past detections into spatiotemporally invariant historical occupancy maps and introduces a historical query module that conditions on the current agent state to iteratively retrieve agent-specific history from unlabeled occupancy representations. The retrieved history is summarized by a temporal map embedding and, together with the final query and map context, drives a DETR-style decoder to produce multi-modal future trajectories. This design lifts identity reliance, supports streaming inference via reusable encodings, and serves as a robust fallback when tracking is unavailable. On Argoverse~2, HiMAP achieves performance comparable to tracking-based methods while operating without IDs, and it substantially outperforms strong baselines in the no-tracking setting, yielding relative gains of 11\% in FDE, 12\% in ADE, and a 4\% reduction in MR over a fine-tuned QCNet. Beyond aggregate metrics, HiMAP delivers stable forecasts for all agents simultaneously without waiting for tracking to recover, highlighting its practical value for safety-critical autonomy. The code is available under: https://github.com/XuYiMing83/HiMAP.

[67] Inferring Height from Earth Embeddings: First insights using Google AlphaEarth

Alireza Hamoudzadeh,Valeria Belloni,Roberta Ravanelli

Main category: cs.CV

TL;DR: 本研究探讨了AlphaEarth嵌入(10米分辨率)能否有效指导深度学习回归模型进行区域地表高度映射,使用U-Net和U-Net++架构解码嵌入信息;结果表明嵌入包含可解码的高度信号,U-Net++在测试集上泛化能力更强(R²=0.84 vs. 0.78),但存在残差偏差与分布偏移挑战。

Details Motivation: 探索地球嵌入(特别是AlphaEarth Embeddings)中编码的地理空间与多模态特征是否能有效支持深度学习模型进行区域地表高度估计,以提升遥感地形建模的泛化性与效率。 Method: 采用10米分辨率的AlphaEarth嵌入作为输入特征,以高精度数字地表模型(DSM)为参考标签,分别使用U-Net和U-Net++轻量级卷积解码器进行回归建模与高度预测评估。 Result: 训练阶段两模型均达R²=0.97;测试阶段U-Net++表现更优(R²=0.84,中位误差−2.62 m),优于U-Net(R²=0.78,中位误差−7.22 m);但测试RMSE约16 m,存在系统性残差偏差。 Conclusion: AlphaEarth嵌入蕴含可迁移的地形模式,能有效支撑DL高度映射,尤其结合空间感知卷积架构时效果显著;然而需进一步解决分布偏移导致的偏差问题以提升区域可迁移性。 Abstract: This study investigates whether the geospatial and multimodal features encoded in \textit{Earth Embeddings} can effectively guide deep learning (DL) regression models for regional surface height mapping. In particular, we focused on AlphaEarth Embeddings at 10 m spatial resolution and evaluated their capability to support terrain height inference using a high-quality Digital Surface Model (DSM) as reference. U-Net and U-Net++ architectures were thus employed as lightweight convolutional decoders to assess how well the geospatial information distilled in the embeddings can be translated into accurate surface height estimates. Both architectures achieved strong training performance (both with $R^2 = 0.97$), confirming that the embeddings encode informative and decodable height-related signals. On the test set, performance decreased due to distribution shifts in height frequency between training and testing areas. Nevertheless, U-Net++ shows better generalization ($R^2 = 0.84$, median difference = -2.62 m) compared with the standard U-Net ($R^2 = 0.78$, median difference = -7.22 m), suggesting enhanced robustness to distribution mismatch. While the testing RMSE (approximately 16 m for U-Net++) and residual bias highlight remaining challenges in generalization, strong correlations indicate that the embeddings capture transferable topographic patterns. Overall, the results demonstrate the promising potential of AlphaEarth Embeddings to guide DL-based height mapping workflows, particularly when combined with spatially aware convolutional architectures, while emphasizing the need to address bias for improved regional transferability.

[68] A Multi-modal Detection System for Infrastructure-based Freight Signal Priority

Ziyan Zhang,Chuheng Wei,Xuanpeng Zhao,Siyan Li,Will Snyder,Mike Stas,Peng Hao,Kanok Boriboonsomsin,Guoyuan Wu

Main category: cs.CV

TL;DR: 本文提出了一种基于基础设施的多模态货运车辆检测系统,融合LiDAR与摄像头传感器,采用混合感知架构和多方法融合的感知流程,实现高时空分辨率下的货运车辆类型、位置与速度的实时精准感知,以支持货运信号优先(FSP)应用。

Details Motivation: 货运车辆在接近信号交叉口时需要可靠的目标检测与运动估计,以支撑基于基础设施的货运信号优先(FSP)策略;而准确及时地感知车辆类型、位置和速度是实现有效优先控制的关键前提。 Method: 设计并部署了融合LiDAR与摄像头的基础设施多模态检测系统,采用交叉口端与路段中段双子系统组成的混合传感架构,通过无线通信同步数据;感知流程结合聚类与深度学习检测方法,并引入卡尔曼滤波跟踪;LiDAR点云注册至大地坐标系以支持车道级定位与稳定跟踪。 Result: 实地评估表明该系统可在高时空分辨率下稳定可靠地监测货运车辆运动,具备实时性能与实用鲁棒性。 Conclusion: 该系统为面向FSP的基础设施感知系统提供了可复用的设计范式与工程部署经验,验证了多模态融合方案在货运交通管控中的可行性与有效性。 Abstract: Freight vehicles approaching signalized intersections require reliable detection and motion estimation to support infrastructure-based Freight Signal Priority (FSP). Accurate and timely perception of vehicle type, position, and speed is essential for enabling effective priority control strategies. This paper presents the design, deployment, and evaluation of an infrastructure-based multi-modal freight vehicle detection system integrating LiDAR and camera sensors. A hybrid sensing architecture is adopted, consisting of an intersection-mounted subsystem and a midblock subsystem, connected via wireless communication for synchronized data transmission. The perception pipeline incorporates both clustering-based and deep learning-based detection methods with Kalman filter tracking to achieve stable real-time performance. LiDAR measurements are registered into geodetic reference frames to support lane-level localization and consistent vehicle tracking. Field evaluations demonstrate that the system can reliably monitor freight vehicle movements at high spatio-temporal resolution. The design and deployment provide practical insights for developing infrastructure-based sensing systems to support FSP applications.

[69] EA-Swin: An Embedding-Agnostic Swin Transformer for AI-Generated Video Detection

Hung Mai,Loi Dinh,Duc Hai Nguyen,Dat Do,Luong Doan,Khanh Nguyen Quoc,Huan Vu,Phong Ho,Naeem Ul Islam,Tuan Do

Main category: cs.CV

TL;DR: 本文提出EA-Swin模型与EA-Video数据集,用于高效检测AI生成视频,显著提升准确率与跨分布泛化能力。

Details Motivation: 现有检测方法在面对Sora2、Veo3等先进视频生成器时表现不足,因其依赖浅层嵌入轨迹、图像适配或计算开销大的多模态大模型。 Method: 提出Embedding-Agnostic Swin Transformer(EA-Swin),采用因子化窗口注意力机制,直接建模预训练视频嵌入的时空依赖;构建包含130K视频的EA-Video基准数据集,覆盖多种生成器并支持未见生成器评测。 Result: EA-Swin在主流生成器上达到0.97–0.99准确率,较先前SoTA方法(0.8–0.9)提升5–20%,且在未见生成器分布上保持强泛化性。 Conclusion: EA-Swin是一种可扩展、鲁棒的现代AI生成视频检测方案。 Abstract: Recent advances in foundation video generators such as Sora2, Veo3, and other commercial systems have produced highly realistic synthetic videos, exposing the limitations of existing detection methods that rely on shallow embedding trajectories, image-based adaptation, or computationally heavy MLLMs. We propose EA-Swin, an Embedding-Agnostic Swin Transformer that models spatiotemporal dependencies directly on pretrained video embeddings via a factorized windowed attention design, making it compatible with generic ViT-style patch-based encoders. Alongside the model, we construct the EA-Video dataset, a benchmark dataset comprising 130K videos that integrates newly collected samples with curated existing datasets, covering diverse commercial and open-source generators and including unseen-generator splits for rigorous cross-distribution evaluation. Extensive experiments show that EA-Swin achieves 0.97-0.99 accuracy across major generators, outperforming prior SoTA methods (typically 0.8-0.9) by a margin of 5-20%, while maintaining strong generalization to unseen distributions, establishing a scalable and robust solution for modern AI-generated video detection.

[70] Physics Encoded Spatial and Temporal Generative Adversarial Network for Tropical Cyclone Image Super-resolution

Ruoyi Zhang,Jiawei Yuan,Lujia Ye,Runling Yu,Liling Zhao

Main category: cs.CV

TL;DR: 本文提出了一种物理编码的时空生成对抗网络(PESTGAN),用于热带气旋卫星图像超分辨率重建,通过引入物理约束的PhyCell模块和双判别器框架,在保持像素精度的同时显著提升气象结构的物理保真度。

Details Motivation: 现有基于深度学习的超分辨率方法将卫星图像序列视为普通视频,忽略了控制云运动的大气物理规律。 Method: 设计了包含PhyCell模块的解耦生成器,该模块通过约束卷积近似涡度方程,并将物理动力学编码为隐式潜在表示;同时引入时空双判别器框架,分别保证运动一致性和空间真实性。 Result: 在Digital Typhoon数据集上实现4×超分辨率,结构保真度和感知质量优于现有方法,且在气象结构重建和物理保真度方面显著提升。 Conclusion: PESTGAN有效融合物理先验与深度生成模型,为气象图像超分辨率提供了兼顾物理合理性和视觉质量的新范式。 Abstract: High-resolution satellite imagery is indispensable for tracking the genesis, intensification, and trajectory of tropical cyclones (TCs). However, existing deep learning-based super-resolution (SR) methods often treat satellite image sequences as generic videos, neglecting the underlying atmospheric physical laws governing cloud motion. To address this, we propose a Physics Encoded Spatial and Temporal Generative Adversarial Network (PESTGAN) for TC image super-resolution. Specifically, we design a disentangled generator architecture incorporating a PhyCell module, which approximates the vorticity equation via constrained convolutions and encodes the resulting approximate physical dynamics as implicit latent representations to separate physical dynamics from visual textures. Furthermore, a dual-discriminator framework is introduced, employing a temporal discriminator to enforce motion consistency alongside spatial realism. Experiments on the Digital Typhoon dataset for 4$\times$ upscaling demonstrate that PESTGAN establishes a better performance in structural fidelity and perceptual quality. While maintaining competitive pixel-wise accuracy compared to existing approaches, our method significantly excels in reconstructing meteorologically plausible cloud structures with superior physical fidelity.

[71] Attachment Anchors: A Novel Framework for Laparoscopic Grasping Point Prediction in Colorectal Surgery

Dennis N. Schneider,Lars Wagner,Daniel Rueckert,Dirk Wilhelm

Main category: cs.CV

TL;DR: 本文提出了一种名为'attachment anchors'的结构化表示方法,用于编码结直肠手术中组织与其解剖附着点之间的局部几何与力学关系,以提升腹腔镜图像下的抓取点预测准确性,尤其在分布外场景(如未见过的术式或术者)中效果显著。

Details Motivation: 结直肠手术复杂、耗时长,现有研究覆盖不足;但其重复性组织操作为机器学习驱动的自主辅助提供了理想学习环境。 Method: 提出'attachment anchors'这一中间表示,将手术场景归一化至局部参考系,并从腹腔镜图像中预测该表示,融入基于机器学习的抓取框架。 Result: 在90台结直肠手术数据集上验证,相比纯图像基线,attachment anchors显著提升了抓取点预测精度,尤其在分布外设置下增益更明显。 Conclusion: attachment anchors是一种有效的中间表征,有助于提升学习型组织操作在结直肠手术中的泛化性与鲁棒性。 Abstract: Accurate grasping point prediction is a key challenge for autonomous tissue manipulation in minimally invasive surgery, particularly in complex and variable procedures such as colorectal interventions. Due to their complexity and prolonged duration, colorectal procedures have been underrepresented in current research. At the same time, they pose a particularly interesting learning environment due to repetitive tissue manipulation, making them a promising entry point for autonomous, machine learning-driven support. Therefore, in this work, we introduce attachment anchors, a structured representation that encodes the local geometric and mechanical relationships between tissue and its anatomical attachments in colorectal surgery. This representation reduces uncertainty in grasping point prediction by normalizing surgical scenes into a consistent local reference frame. We demonstrate that attachment anchors can be predicted from laparoscopic images and incorporated into a grasping framework based on machine learning. Experiments on a dataset of 90 colorectal surgeries demonstrate that attachment anchors improve grasping point prediction compared to image-only baselines. There are particularly strong gains in out-of-distribution settings, including unseen procedures and operating surgeons. These results suggest that attachment anchors are an effective intermediate representation for learning-based tissue manipulation in colorectal surgery.

[72] Leveraging Contrastive Learning for a Similarity-Guided Tampered Document Data Generation Pipeline

Mohamed Dhouib,Davide Buscaldi,Sonia Vanier,Aymen Shabou

Main category: cs.CV

TL;DR: 本文提出了一种生成高质量篡改文档图像的新方法,通过两个辅助网络(对比学习文本块匹配和字符边界评估)提升数据多样性与视觉质量,从而显著提高篡改文本检测模型在真实数据上的泛化性能。

Details Motivation: 现有基于规则的篡改文档生成方法缺乏多样性、视觉质量差、伪影明显,导致模型泛化能力弱,在真实数据上表现不佳。 Method: 提出一种新生成框架:1)训练一个基于对比学习的辅助网络,用于匹配语义一致的文本块(含新正负样本构造策略);2)训练另一个辅助网络评估文本裁剪是否紧密包围目标字符;3)将二者集成到精心设计的生成流程中,生成多样且高质的篡改文档图像。 Result: 在相同源图像和训练协议下,使用本方法生成的数据训练的多个模型,在多个开源测试集上均取得一致性能提升,验证了生成数据的有效性与泛化性。 Conclusion: 所提生成框架能有效缓解数据稀缺问题,提升篡改文本检测模型的鲁棒性与实用性,为文档安全领域提供了高质量合成数据的新范式。 Abstract: Detecting tampered text in document images is a challenging task due to data scarcity. To address this, previous work has attempted to generate tampered documents using rule-based methods. However, the resulting documents often suffer from limited variety and poor visual quality, typically leaving highly visible artifacts that are rarely observed in real-world manipulations. This undermines the model's ability to learn robust, generalizable features and results in poor performance on real-world data. Motivated by this discrepancy, we propose a novel method for generating high-quality tampered document images. We first train an auxiliary network to compare text crops, leveraging contrastive learning with a novel strategy for defining positive pairs and their corresponding negatives. We also train a second auxiliary network to evaluate whether a crop tightly encloses the intended characters, without cutting off parts of characters or including parts of adjacent ones. Using a carefully designed generation pipeline that leverages both networks, we introduce a framework capable of producing diverse, high-quality tampered document images. We assess the effectiveness of our data generation pipeline by training multiple models on datasets derived from the same source images, generated using our method and existing approaches, under identical training protocols. Evaluating these models on various open-source datasets shows that our pipeline yields consistent performance improvements across architectures and datasets.

[73] Polaffini: A feature-based approach for robust affine and polyaffine image registration

Antoine Legouhy,Cosimo Campo,Ross Callaghan,Hojjat Azadbakht,Hui Zhang

Main category: cs.CV

TL;DR: 本文提出Polaffini框架,利用深度学习预训练分割模型提取解剖结构质心作为特征点,实现解剖学引导的快速、鲁棒、高精度图像配准,支持全局/局部仿射及可调平滑度的polyaffine变换,并在结构对齐和非线性配准初值优化上优于主流强度配准方法。

Details Motivation: 传统基于强度的医学图像配准依赖代理对齐指标,而基于解剖特征的方法因难以稳定提取特征而受限;近年深度学习分割模型的发展为可靠获取精细解剖结构提供了可能,从而推动解剖学引导配准新方法的研究。 Method: Polaffini从深度学习分割结果中直接提取各解剖区域质心作为具有一一对应关系的解剖特征点,通过闭式解法实现高效全局与局部仿射匹配,并组合成具有可调平滑度的polyaffine变换;该变换嵌入log-Euclidean框架以保证微分同胚性。 Result: Polaffini在结构对齐精度上优于主流强度配准方法,并显著提升后续非线性配准的初始化质量;同时具备计算速度快、鲁棒性强、准确度高的特点。 Conclusion: Polaffini成功将现代深度学习分割能力转化为解剖学引导配准优势,是一种兼具实用性与理论严谨性的新框架,适用于独立配准或作为非线性配准预处理步骤,易于集成至临床影像处理流程。 Abstract: In this work we present Polaffini, a robust and versatile framework for anatomically grounded registration. Medical image registration is dominated by intensity-based registration methods that rely on surrogate measures of alignment quality. In contrast, feature-based approaches that operate by identifying explicit anatomical correspondences, while more desirable in theory, have largely fallen out of favor due to the challenges of reliably extracting features. However, such challenges are now significantly overcome thanks to recent advances in deep learning, which provide pre-trained segmentation models capable of instantly delivering reliable, fine-grained anatomical delineations. We aim to demonstrate that these advances can be leveraged to create new anatomically-grounded image registration algorithms. To this end, we propose Polaffini, which obtains, from these segmented regions, anatomically grounded feature points with 1-to-1 correspondence in a particularly simple way: extracting their centroids. These enable efficient global and local affine matching via closed-form solutions. Those are used to produce an overall transformation ranging from affine to polyaffine with tunable smoothness. Polyaffine transformations can have many more degrees of freedom than affine ones allowing for finer alignment, and their embedding in the log-Euclidean framework ensures diffeomorphic properties. Polaffini has applications both for standalone registration and as pre-alignment for subsequent non-linear registration, and we evaluate it against popular intensity-based registration techniques. Results demonstrate that Polaffini outperforms competing methods in terms of structural alignment and provides improved initialisation for downstream non-linear registration. Polaffini is fast, robust, and accurate, making it particularly well-suited for integration into medical image processing pipelines.

Yuchang Jiang,Anton Raichuk,Xiaoye Tong,Vivien Sainte Fare Garnot,Daniel Ortiz-Gonzalo,Dan Morris,Konrad Schindler,Jan Dirk Wegner,Maxim Neumann

Main category: cs.CV

TL;DR: 本文提出首个10米分辨率的南美洲乔木作物分布图,利用Sentinel-1/2时序遥感数据与多模态深度学习模型生成,揭示现有EUDR监管地图误将小农林农系统判为森林,导致误警与不公处罚,本研究提供高精度基线以支持更有效、包容和公平的零毁林政策。

Details Motivation: 监测乔木作物扩张对落实零毁林政策(如欧盟EUDR)至关重要,但缺乏能区分多样化农林系统与森林的高分辨率数据严重制约了相关工作。 Method: 构建基于Sentinel-1和Sentinel-2卫星影像时间序列的多模态时空深度学习模型,生成10米分辨率的南美洲乔木作物分布图。 Result: 识别出约1100万公顷乔木作物,其中23%与2000–2020年间森林覆盖损失相关;发现现行EUDR监管地图常将已建立的农业(尤其是小农林农系统)错误归类为‘森林’,造成虚假毁林警报风险。 Conclusion: 本研究提供的高分辨率乔木作物地图可作为关键基线,减少误判,助力制定更有效、包容且公平的森林保护与可持续农业政策。 Abstract: Monitoring tree crop expansion is vital for zero-deforestation policies like the European Union's Regulation on Deforestation-free Products (EUDR). However, these efforts are hindered by a lack of highresolution data distinguishing diverse agricultural systems from forests. Here, we present the first 10m-resolution tree crop map for South America, generated using a multi-modal, spatio-temporal deep learning model trained on Sentinel-1 and Sentinel-2 satellite imagery time series. The map identifies approximately 11 million hectares of tree crops, 23% of which is linked to 2000-2020 forest cover loss. Critically, our analysis reveals that existing regulatory maps supporting the EUDR often classify established agriculture, particularly smallholder agroforestry, as "forest". This discrepancy risks false deforestation alerts and unfair penalties for small-scale farmers. Our work mitigates this risk by providing a high-resolution baseline, supporting conservation policies that are effective, inclusive, and equitable.

[75] DRetHTR: Linear-Time Decoder-Only Retentive Network for Handwritten Text Recognition

Changhun Kim,Martin Mayr,Thomas Gorges,Fei Wu,Mathias Seuret,Andreas Maier,Vincent Christlein

Main category: cs.CV

TL;DR: 本文提出DRetHTR,一种基于Retentive Networks(RetNet)的纯解码器手写文本识别模型,通过去除softmax注意力与KV缓存、引入多尺度序列先验和层间gamma缩放机制,在保持SOTA准确率的同时显著提升推理速度(快1.6–1.9倍)和内存效率(减少38–42%)。

Details Motivation: 现有基于Transformer的手写文本识别(HTR)系统因键值(KV)缓存随输出长度增长而导致解码慢、内存开销大,亟需更高效的替代架构。 Method: 提出DRetHTR:采用RetNet替代Transformer解码器,以softmax-free retention代替softmax注意力;注入多尺度序列先验;引入层-wise gamma scaling机制以渐进扩大各层有效保留范围,恢复局部到全局的归纳偏置。 Result: 在IAM-A(2.26% CER)、RIMES(1.81%)、Bentham(3.46%)和READ-2016(4.21%)上达到当前最优或具竞争力的字符错误率,同时实现1.6–1.9倍加速与38–42%内存节省。 Conclusion: 纯解码器RetNet可在不牺牲HTR精度的前提下,显著提升解码效率,为实际部署提供更优架构选择。 Abstract: State-of-the-art handwritten text recognition (HTR) systems commonly use Transformers, whose growing key-value (KV) cache makes decoding slow and memory-intensive. We introduce DRetHTR, a decoder-only model built on Retentive Networks (RetNet). Compared to an equally sized decoder-only Transformer baseline, DRetHTR delivers 1.6-1.9x faster inference with 38-42% less memory usage, without loss of accuracy. By replacing softmax attention with softmax-free retention and injecting multi-scale sequential priors, DRetHTR avoids a growing KV cache: decoding is linear in output length in both time and memory. To recover the local-to-global inductive bias of attention, we propose layer-wise gamma scaling, which progressively enlarges the effective retention horizon in deeper layers. This encourages early layers to model short-range dependencies and later layers to capture broader context, mitigating the flexibility gap introduced by removing softmax. Consequently, DRetHTR achieves best reported test character error rates of 2.26% (IAM-A, en), 1.81% (RIMES, fr), and 3.46% (Bentham, en), and is competitive on READ-2016 (de) with 4.21%. This demonstrates that decoder-only RetNet enables Transformer-level HTR accuracy with substantially improved decoding speed and memory efficiency.

[76] SpectralGCD: Spectral Concept Selection and Cross-modal Representation Learning for Generalized Category Discovery

Lorenzo Caselli,Marco Mistretta,Simone Magistri,Andrew D. Bagdanov

Main category: cs.CV

TL;DR: 本文提出SpectralGCD,一种高效且有效的多模态广义类别发现方法,利用CLIP跨模态图像-概念相似性作为统一表示,并通过谱滤波和双向知识蒸馏提升语义质量和对齐性,在多个基准上达到SOTA性能且计算成本更低。

Details Motivation: 现有方法在仅用图像特征训练时易对已知类别过拟合;多模态方法虽有改进,但模态间独立建模且计算开销大。 Method: 提出SpectralGCD:以CLIP跨模态图像-概念相似性构建统一表示;引入谱滤波(基于教师模型的跨模态协方差矩阵)筛选相关语义概念;采用前向与反向知识蒸馏确保学生模型表征兼具语义充分性与模态对齐性。 Result: 在六个基准数据集上,SpectralGCD准确率媲美或显著优于现有SOTA方法,同时计算成本大幅降低。 Conclusion: SpectralGCD通过锚定显式语义、减少对虚假视觉线索依赖及高效蒸馏机制,实现了高性能与高效率的平衡,为广义类别发现提供了新范式。 Abstract: Generalized Category Discovery (GCD) aims to identify novel categories in unlabeled data while leveraging a small labeled subset of known classes. Training a parametric classifier solely on image features often leads to overfitting to old classes, and recent multimodal approaches improve performance by incorporating textual information. However, they treat modalities independently and incur high computational cost. We propose SpectralGCD, an efficient and effective multimodal approach to GCD that uses CLIP cross-modal image-concept similarities as a unified cross-modal representation. Each image is expressed as a mixture over semantic concepts from a large task-agnostic dictionary, which anchors learning to explicit semantics and reduces reliance on spurious visual cues. To maintain the semantic quality of representations learned by an efficient student, we introduce Spectral Filtering which exploits a cross-modal covariance matrix over the softmaxed similarities measured by a strong teacher model to automatically retain only relevant concepts from the dictionary. Forward and reverse knowledge distillation from the same teacher ensures that the cross-modal representations of the student remain both semantically sufficient and well-aligned. Across six benchmarks, SpectralGCD delivers accuracy comparable to or significantly superior to state-of-the-art methods at a fraction of the computational cost. The code is publicly available at: https://github.com/miccunifi/SpectralGCD.

[77] A High-Level Survey of Optical Remote Sensing

Panagiotis Koletsis,Vasilis Efthymiou,Maria Vakalopoulou,Nikos Komodakis,Anastasios Doulamis,Georgios Th. Papadopoulos

Main category: cs.CV

TL;DR: 本文是一篇关于光学遥感(特别是基于无人机RGB相机)的综合性综述,旨在为新入行的研究者提供领域概览、关键数据集与研究方向指引。

Details Motivation: 现有文献缺乏从整体视角系统梳理光学遥感(尤其是无人机RGB影像)能力、任务、数据集与方法的综述,亟需一份面向初学者的引导性参考。 Method: 采用系统性文献调研与归纳分析法,对光学遥感领域的任务类型、技术方法、常用数据集及研究趋势进行分类整理与总结。 Result: 构建了一个涵盖任务范畴、主流方法、代表性数据集及实践洞见的综合性知识框架,填补了该领域全局性综述的空白。 Conclusion: 光学遥感(尤其结合低成本无人机RGB影像)已具备广泛而实用的能力;本综述为其提供了结构化入门指南,有助于研究者快速定位方向并推动后续研究。 Abstract: In recent years, significant advances in computer vision have also propelled progress in remote sensing. Concurrently, the use of drones has expanded, with many organizations incorporating them into their operations. Most drones are equipped by default with RGB cameras, which are both robust and among the easiest sensors to use and interpret. The body of literature on optical remote sensing is vast, encompassing diverse tasks, capabilities, and methodologies. Each task or methodology could warrant a dedicated survey. This work provides a comprehensive overview of the capabilities of the field, while also presenting key information, such as datasets and insights. It aims to serve as a guide for researchers entering the field, offering high-level insights and helping them focus on areas most relevant to their interests. To the best of our knowledge, no existing survey addresses this holistic perspective.

[78] EAGLE: Expert-Augmented Attention Guidance for Tuning-Free Industrial Anomaly Detection in Multimodal Large Language Models

Xiaomeng Peng,Xilang Huang,Seon Han Choi

Main category: cs.CV

TL;DR: 本文提出EAGLE框架,无需微调即可利用专家模型输出引导多模态大语言模型(MLLMs)实现高精度工业异常检测与可解释描述,并通过注意力机制分析验证其有效性。

Details Motivation: 现有深度学习方法仅提供二值判断且缺乏语义解释;MLLMs虽具生成细粒度语言分析潜力,但需昂贵微调且检测精度常不如轻量专用检测器。 Method: 提出无需调优的专家增强注意力引导框架EAGLE,融合专家模型输出以指导MLLMs进行准确检测与可解释描述;并分析MLLMs中间层对异常区域的注意力分布变化。 Result: 在MVTec-AD和VisA数据集上,EAGLE在不更新参数前提下显著提升多种MLLMs的异常检测性能,效果媲美微调方法。 Conclusion: EAGLE是一种高效、免训练的MLLMs增强方案,兼顾检测精度与可解释性,且注意力分析表明其能有效提升模型对异常区域的关注集中度。 Abstract: Industrial anomaly detection is important for smart manufacturing, but many deep learning approaches produce only binary decisions and provide limited semantic explanations. Multimodal large language models (MLLMs) can potentially generate fine-grained, language-based analyses, yet existing methods often require costly fine-tuning and do not consistently improve anomaly detection accuracy compared to lightweight specialist detectors. We propose expert-augmented attention guidance for industrial anomaly detection in MLLMs (EAGLE), a tuning-free framework that integrates outputs from expert model to guide MLLMs toward both accurate detection and interpretable anomaly descriptions. We further study how EAGLE affects MLLMs internals by examining the attention distribution of MLLMs to the anomalous image regions in the intermediate layers. We observe that successful anomaly detection is associated with increased attention concentration on anomalous regions, and EAGLE tends to encourage this alignment. Experiments on MVTec-AD and VisA show that EAGLE improves anomaly detection performance across multiple MLLMs without any parameter updates, achieving results comparable to fine-tuning based methods. Code is available at \href{https://github.com/shengtun/Eagle}{https://github.com/shengtun/Eagle}

[79] 4D Monocular Surgical Reconstruction under Arbitrary Camera Motions

Jiwei Shan,Zeyu Cai,Cheng-Tai Hsieh,Yirui Li,Hao Liu,Lijun Han,Hesheng Wang,Shing Shin Cheng

Main category: cs.CV

TL;DR: 本文提出Local-EndoGS,一种面向单目内窥镜视频的高质量4D重建框架,支持任意相机运动,通过窗口化局部建模、粗到细初始化策略及长程轨迹与物理运动约束,显著提升形变场景重建质量。

Details Motivation: 现有基于隐式神经表示或3D高斯溅射的方法多依赖固定视角、双目深度或高精度运动恢复结构(SfM),难以应对临床中常见的单目、大运动内窥镜序列。 Method: 提出Local-EndoGS:1)窗口化渐进全局表示,为每个观测窗口分配局部可变形模型;2)融合多视图几何、跨窗口信息与单目深度先验的粗到细初始化策略;3)引入长程2D像素轨迹约束和物理运动先验以增强形变合理性。 Result: 在三个公开可形变内窥镜数据集上,Local-EndoGS在外观质量和几何精度上均超越现有最先进方法;消融实验验证了各核心设计的有效性。 Conclusion: Local-EndoGS为单目、大运动内窥镜视频提供了鲁棒、可扩展的4D重建新范式,推动了其在真实临床场景中的应用潜力。 Abstract: Reconstructing deformable surgical scenes from endoscopic videos is challenging and clinically important. Recent state-of-the-art methods based on implicit neural representations or 3D Gaussian splatting have made notable progress. However, most are designed for deformable scenes with fixed endoscope viewpoints and rely on stereo depth priors or accurate structure-from-motion for initialization and optimization, limiting their ability to handle monocular sequences with large camera motion in real clinical settings. To address this, we propose Local-EndoGS, a high-quality 4D reconstruction framework for monocular endoscopic sequences with arbitrary camera motion. Local-EndoGS introduces a progressive, window-based global representation that allocates local deformable scene models to each observed window, enabling scalability to long sequences with substantial motion. To overcome unreliable initialization without stereo depth or accurate structure-from-motion, we design a coarse-to-fine strategy integrating multi-view geometry, cross-window information, and monocular depth priors, providing a robust foundation for optimization. We further incorporate long-range 2D pixel trajectory constraints and physical motion priors to improve deformation plausibility. Experiments on three public endoscopic datasets with deformable scenes and varying camera motions show that Local-EndoGS consistently outperforms state-of-the-art methods in appearance quality and geometry. Ablation studies validate the effectiveness of our key designs. Code will be released upon acceptance at: https://github.com/IRMVLab/Local-EndoGS.

[80] QuPAINT: Physics-Aware Instruction Tuning Approach to Quantum Material Discovery

Xuan-Bac Nguyen,Hoang-Quan Nguyen,Sankalp Pandey,Tim Faltermeier,Nicholas Borys,Hugh Churchill,Khoa Luu

Main category: cs.CV

TL;DR: 本文提出了一种物理感知的多模态框架QuPAINT,结合物理仿真数据生成器Synthia、首个量子材料指令数据集QMat-Instruct和物理信息注意力模块,以提升光学显微图像中二维量子材料层数识别的泛化性与鲁棒性,并构建了基准QF-Bench。

Details Motivation: 二维量子材料在光学显微图像中层间对比度微弱、标注数据稀缺、跨实验室成像差异大,现有视觉模型缺乏物理先验且泛化能力差。 Method: 提出Synthia(基于薄膜干涉的物理仿真数据生成器)、QMat-Instruct(物理引导的多模态指令数据集)、QuPAINT(融合光学先验的物理感知指令微调架构)及QF-Bench(覆盖多材料/基底/成像条件的综合基准)。 Result: 显著降低对人工标注依赖,提升模型在新材料和不同硬件条件下的泛化能力与判别性能,在QF-Bench上实现更鲁棒、可复现的层数识别。 Conclusion: 引入物理先验到多模态学习流程中,是解决小样本、跨域、高变异性科学图像分析问题的有效范式。 Abstract: Characterizing two-dimensional quantum materials from optical microscopy images is challenging due to the subtle layer-dependent contrast, limited labeled data, and significant variation across laboratories and imaging setups. Existing vision models struggle in this domain since they lack physical priors and cannot generalize to new materials or hardware conditions. This work presents a new physics-aware multimodal framework that addresses these limitations from both the data and model perspectives. We first present Synthia, a physics-based synthetic data generator that simulates realistic optical responses of quantum material flakes under thin-film interference. Synthia produces diverse and high-quality samples, helping reduce the dependence on expert manual annotation. We introduce QMat-Instruct, the first large-scale instruction dataset for quantum materials, comprising multimodal, physics-informed question-answer pairs designed to teach Multimodal Large Language Models (MLLMs) to understand the appearance and thickness of flakes. Then, we propose Physics-Aware Instruction Tuning (QuPAINT), a multimodal architecture that incorporates a Physics-Informed Attention module to fuse visual embeddings with optical priors, enabling more robust and discriminative flake representations. Finally, we establish QF-Bench, a comprehensive benchmark spanning multiple materials, substrates, and imaging settings, offering standardized protocols for fair and reproducible evaluation.

[81] Tracing Copied Pixels and Regularizing Patch Affinity in Copy Detection

Yichen Lu,Siwei Nie,Minlong Lu,Xudong Yang,Xiaobo Zhang,Peng Zhang

Main category: cs.CV

TL;DR: 本文提出PixTrace和CopyNCE,通过像素级坐标追踪与几何引导的对比损失,提升图像复制检测中细粒度对应学习能力,实现SOTA性能与更好可解释性。

Details Motivation: 现有基于视图级对比学习的图像复制检测方法在应对复杂编辑时,因缺乏细粒度对应关系建模而表现受限。 Method: 提出PixTrace模块显式建模编辑变换下的像素空间映射,并设计CopyNCE损失,利用PixTrace提供的重叠比约束图像块间相似性学习。 Result: 在DISC21数据集上,匹配器uAP达88.7%、RP90达83.9%,描述符uAP达72.6%、RP90达68.4%,性能领先且可解释性更强。 Conclusion: 将像素级几何可追溯性融入对比学习框架,有效抑制自监督训练中的监督噪声,显著提升图像复制检测精度与可解释性。 Abstract: Image Copy Detection (ICD) aims to identify manipulated content between image pairs through robust feature representation learning. While self-supervised learning (SSL) has advanced ICD systems, existing view-level contrastive methods struggle with sophisticated edits due to insufficient fine-grained correspondence learning. We address this limitation by exploiting the inherent geometric traceability in edited content through two key innovations. First, we propose PixTrace - a pixel coordinate tracking module that maintains explicit spatial mappings across editing transformations. Second, we introduce CopyNCE, a geometrically-guided contrastive loss that regularizes patch affinity using overlap ratios derived from PixTrace's verified mappings. Our method bridges pixel-level traceability with patch-level similarity learning, suppressing supervision noise in SSL training. Extensive experiments demonstrate not only state-of-the-art performance (88.7% uAP / 83.9% RP90 for matcher, 72.6% uAP / 68.4% RP90 for descriptor on DISC21 dataset) but also better interpretability over existing methods.

[82] FoundationPose-Initialized 3D-2D Liver Registration for Surgical Augmented Reality

Hanyuan Zhang,Lucas He,Runlong He,Abdolrahim Kadkhodamohammadi,Danail Stoyanov,Brian R. Davidson,Evangelos B. Mazomenos,Matthew J. Clarkson

Main category: cs.CV

TL;DR: 本文提出了一种结合腹腔镜深度图与基础姿态估计器的轻量级非刚性配准方法,用NICP替代传统有限元模型,实现了临床相关的配准精度(9.91 mm)并降低了工程复杂度。

Details Motivation: 现有肝脏腹腔镜手术中增强现实肿瘤定位依赖器官轮廓和复杂的有限元(FE)变形模型,建模与工程门槛高,需专业知识。 Method: 融合腹腔镜深度图与基础姿态估计器实现相机-肝脏姿态估计,并以非刚性迭代最近点(NICP)算法替代FE模型进行形变配准。 Result: 在真实患者数据上,该深度增强的基础姿态方法在3例中达到9.91 mm平均配准误差;刚性+NICP配准优于纯刚性配准,验证NICP可高效替代FE模型。 Conclusion: 所提流程在保证临床相关精度的同时,提供了一种更轻量、更易工程实现的FE替代方案。 Abstract: Augmented reality can improve tumor localization in laparoscopic liver surgery. Existing registration pipelines typically depend on organ contours; deformable (non-rigid) alignment is often handled with finite-element (FE) models coupled to dimensionality-reduction or machine-learning components. We integrate laparoscopic depth maps with a foundation pose estimator for camera-liver pose estimation and replace FE-based deformation with non-rigid iterative closest point (NICP) to lower engineering/modeling complexity and expertise requirements. On real patient data, the depth-augmented foundation pose approach achieved 9.91 mm mean registration error in 3 cases. Combined rigid-NICP registration outperformed rigid-only registration, demonstrating NICP as an efficient substitute for finite-element deformable models. This pipeline achieves clinically relevant accuracy while offering a lightweight, engineering-friendly alternative to FE-based deformation.

[83] LATA: Laplacian-Assisted Transductive Adaptation for Conformal Uncertainty in Medical VLMs

Behzad Bozorgtabar,Dwarikanath Mahapatra,Sudipta Roy,Muzammal Naseer,Imran Razzak,Zongyuan Ge

Main category: cs.CV

TL;DR: 本文提出LATA方法,通过拉普拉斯平滑和失败感知的共形分数,在不破坏交换性前提下提升医学视觉语言模型在域偏移下的零样本预测集效率与类别平衡性。

Details Motivation: 现有分割共形预测(SCP)在医学视觉语言模型中面临预测集过大、类别间覆盖率不平衡(CCV高)、以及在少样本/不平衡场景下难以保证有限样本覆盖保证的问题;同时,直接利用校准标签会破坏交换性,导致理论保证失效。 Method: 提出LATA(拉普拉斯辅助的传导式自适应):1)训练与标签无关,在校准集与测试集联合池上构建图像k-NN图,用少量CCCP均场更新对零样本概率进行图拉普拉斯平滑;2)设计失败感知的共形分数,融入ViLU框架以建模实例难度与标签合理性;3)支持纯标签无关或可选地利用校准边际分布的变体。 Result: 在三个医学VLM和九个下游任务上,LATA一致减小预测集大小和CCV,保持或更严格满足目标覆盖率,性能超越先前传导式基线,逼近有标签方法,且计算开销显著更低;消融与定性分析验证其在不损害交换性的前提下提升了预测锐度。 Conclusion: LATA是一种黑盒、轻量、保留理论保证的零样本不确定性校准方法,有效缓解医学VLM在域偏移下的效率与公平性瓶颈,为临床部署提供了更可靠、实用的共形预测解决方案。 Abstract: Medical vision-language models (VLMs) are strong zero-shot recognizers for medical imaging, but their reliability under domain shift hinges on calibrated uncertainty with guarantees. Split conformal prediction (SCP) offers finite-sample coverage, yet prediction sets often become large (low efficiency) and class-wise coverage unbalanced-high class-conditioned coverage gap (CCV), especially in few-shot, imbalanced regimes; moreover, naively adapting to calibration labels breaks exchangeability and voids guarantees. We propose \texttt{\textbf{LATA}} (Laplacian-Assisted Transductive Adaptation), a \textit{training- and label-free} refinement that operates on the joint calibration and test pool by smoothing zero-shot probabilities over an image-image k-NN graph using a small number of CCCP mean-field updates, preserving SCP validity via a deterministic transform. We further introduce a \textit{failure-aware} conformal score that plugs into the vision-language uncertainty (ViLU) framework, providing instance-level difficulty and label plausibility to improve prediction set efficiency and class-wise balance at fixed coverage. \texttt{\textbf{LATA}} is black-box (no VLM updates), compute-light (windowed transduction, no backprop), and includes an optional prior knob that can run strictly label-free or, if desired, in a label-informed variant using calibration marginals once. Across \textbf{three} medical VLMs and \textbf{nine} downstream tasks, \texttt{\textbf{LATA}} consistently reduces set size and CCV while matching or tightening target coverage, outperforming prior transductive baselines and narrowing the gap to label-using methods, while using far less compute. Comprehensive ablations and qualitative analyses show that \texttt{\textbf{LATA}} sharpens zero-shot predictions without compromising exchangeability.

[84] GraphThinker: Reinforcing Video Reasoning with Event Graph Thinking

Zixu Cheng,Da Li,Jian Hu,Ziquan Liu,Wei Li,Shaogang Gong

Main category: cs.CV

TL;DR: 本文提出GraphThinker,一种基于强化微调的方法,通过构建事件级场景图和增强视觉定位来减少视频推理中的幻觉问题。

Details Motivation: 现有视频推理模型缺乏对事件间因果关系的显式建模,导致推理过程中易产生幻觉;而手动标注因果关系成本高且隐含。 Method: 提出GraphThinker方法:1)利用多模态大语言模型(MLLM)构建显式建模事件内外关系的事件级视频场景图(EVSG),并将其作为中间推理过程;2)在强化微调中引入视觉注意力奖励,以增强视觉定位能力。 Result: 在RexTime和VidHalluc两个数据集上验证,GraphThinker在物体与事件关系建模、事件精确定位方面优于现有方法,显著降低视频推理幻觉。 Conclusion: 显式结构化因果建模与视觉接地增强相结合,可有效缓解多模态大模型在视频推理中的幻觉问题。 Abstract: Video reasoning requires understanding the causal relationships between events in a video. However, such relationships are often implicit and costly to annotate manually. While existing multimodal large language models (MLLMs) often infer event relations through dense captions or video summaries for video reasoning, such modeling still lacks causal understanding. Without explicit causal structure modeling within and across video events, these models suffer from hallucinations during the video reasoning. In this work, we propose GraphThinker, a reinforcement finetuning-based method that constructs structural event-level scene graphs and enhances visual grounding to jointly reduce hallucinations in video reasoning. Specifically, we first employ an MLLM to construct an event-based video scene graph (EVSG) that explicitly models both intra- and inter-event relations, and incorporate these formed scene graphs into the MLLM as an intermediate thinking process. We also introduce a visual attention reward during reinforcement finetuning, which strengthens video grounding and further mitigates hallucinations. We evaluate GraphThinker on two datasets, RexTime and VidHalluc, where it shows superior ability to capture object and event relations with more precise event localization, reducing hallucinations in video reasoning compared to prior methods.

[85] RetouchIQ: MLLM Agents for Instruction-Based Image Retouching with Generalist Reward

Qiucheng Wu,Jing Shi,Simon Jenni,Kushal Kafle,Tianyu Wang,Shiyu Chang,Handong Zhao

Main category: cs.CV

TL;DR: 本文提出RetouchIQ框架,利用多模态大语言模型(MLLM)代理结合通用奖励模型,实现基于指令的可执行图像编辑,解决了创意图像编辑中主观性导致的奖励信号不可靠问题。

Details Motivation: 现有基于强化学习的多模态大语言模型图像编辑方法面临缺乏可靠、可验证奖励信号的挑战,难以反映创意编辑的主观性。 Method: 提出RetouchIQ框架,包含:1)MLLM代理解析用户编辑意图并生成可执行参数调整;2)通用奖励模型——一个经RL微调的MLLM,针对每条指令动态生成多维评估指标并输出标量奖励;3)构建含19万指令-推理对的数据集和新基准。 Result: 在语义一致性和感知质量上显著优于先前的MLLM和扩散模型编辑系统。 Conclusion: 通用奖励驱动的MLLM代理可作为专业图像编辑中灵活、可解释、可执行的助手。 Abstract: Recent advances in multimodal large language models (MLLMs) have shown great potential for extending vision-language reasoning to professional tool-based image editing, enabling intuitive and creative editing. A promising direction is to use reinforcement learning (RL) to enable MLLMs to reason about and execute optimal tool-use plans within professional image-editing software. However, training remains challenging due to the lack of reliable, verifiable reward signals that can reflect the inherently subjective nature of creative editing. In this work, we introduce RetouchIQ, a framework that performs instruction-based executable image editing through MLLM agents guided by a generalist reward model. RetouchIQ interprets user-specified editing intentions and generates corresponding, executable image adjustments, bridging high-level aesthetic goals with precise parameter control. To move beyond conventional, rule-based rewards that compute similarity against a fixed reference image using handcrafted metrics, we propose a generalist reward model, an RL fine-tuned MLLM that evaluates edited results through a set of generated metrics on a case-by-case basis. Then, the reward model provides scalar feedback through multimodal reasoning, enabling reinforcement learning with high-quality, instruction-consistent gradients. We curate an extended dataset with 190k instruction-reasoning pairs and establish a new benchmark for instruction-based image editing. Experiments show that RetouchIQ substantially improves both semantic consistency and perceptual quality over previous MLLM-based and diffusion-based editing systems. Our findings demonstrate the potential of generalist reward-driven MLLM agents as flexible, explainable, and executable assistants for professional image editing.

[86] Art2Mus: Artwork-to-Music Generation via Visual Conditioning and Large-Scale Cross-Modal Alignment

Ivan Rinaldi,Matteo Mendula,Nicola Fanelli,Florence Levé,Matteo Testi,Giovanna Castellano,Gennaro Vessio

Main category: cs.CV

TL;DR: 本文提出ArtSound数据集和ArtToMus框架,首次实现直接从艺术作品图像生成音乐,无需经由文本中介,推动视觉到音乐的跨模态生成研究。

Details Motivation: 现有图像条件音乐生成方法受限于:(i) 使用自然照片训练,难以捕捉艺术作品的语义、风格与文化内涵;(ii) 依赖图像→文本转换,以语言为语义捷径,阻碍了真正的视觉到音频直接建模。 Method: 构建含105,884对艺术作品-音乐样本及双模态标注的ArtSound数据集;提出ArtToMus框架,将视觉嵌入直接投影至潜在扩散模型的条件空间,实现无文本翻译、无语言监督的艺术品到音乐生成。 Result: ArtToMus生成的音乐在音乐连贯性和风格一致性上表现良好,能反映原画作的关键视觉线索;虽绝对对齐指标低于文本条件系统,但感知质量具竞争力,且展现出有意义的跨模态对应关系。 Conclusion: 本工作确立了‘直接视觉到音乐生成’这一新研究方向,提供了高质量资源(数据集+代码),支撑多媒体艺术、文化遗产与AI辅助创作等应用。 Abstract: Music generation has advanced markedly through multimodal deep learning, enabling models to synthesize audio from text and, more recently, from images. However, existing image-conditioned systems suffer from two fundamental limitations: (i) they are typically trained on natural photographs, limiting their ability to capture the richer semantic, stylistic, and cultural content of artworks; and (ii) most rely on an image-to-text conversion stage, using language as a semantic shortcut that simplifies conditioning but prevents direct visual-to-audio learning. Motivated by these gaps, we introduce ArtSound, a large-scale multimodal dataset of 105,884 artwork-music pairs enriched with dual-modality captions, obtained by extending ArtGraph and the Free Music Archive. We further propose ArtToMus, the first framework explicitly designed for direct artwork-to-music generation, which maps digitized artworks to music without image-to-text translation or language-based semantic supervision. The framework projects visual embeddings into the conditioning space of a latent diffusion model, enabling music synthesis guided solely by visual information. Experimental results show that ArtToMus generates musically coherent and stylistically consistent outputs that reflect salient visual cues of the source artworks. While absolute alignment scores remain lower than those of text-conditioned systems-as expected given the substantially increased difficulty of removing linguistic supervision-ArtToMus achieves competitive perceptual quality and meaningful cross-modal correspondence. This work establishes direct visual-to-music generation as a distinct and challenging research direction, and provides resources that support applications in multimedia art, cultural heritage, and AI-assisted creative practice. Code and dataset will be publicly released upon acceptance.

[87] Adapting Actively on the Fly: Relevance-Guided Online Meta-Learning with Latent Concepts for Geospatial Discovery

Jowaria Khan,Anindya Sarkar,Yevgeniy Vorobeychik,Elizabeth Bondi-Kelly

Main category: cs.CV

TL;DR: 本文提出了一种融合主动学习、在线元学习和概念引导推理的地理空间发现框架,通过概念相关性建模提升在数据稀缺、环境动态场景下的目标发现效率。

Details Motivation: 现实场景中(如环境监测、灾害响应、公共卫生)数据采集成本高、环境动态变化,且真实标注稀疏有偏,导致现有基于学习的方法(如强化学习)难以适用。 Method: 提出统一地理空间发现框架,包含两个创新:1)基于概念相关性的加权不确定性采样策略;2)相关性感知的元批次构建策略,以增强在线元学习中的语义多样性。 Result: 在真实PFAS污染数据集上的实验表明,该方法能在有限数据和动态环境下可靠地发现目标。 Conclusion: 所提框架有效缓解了稀疏标注与环境动态性带来的挑战,提升了地理空间目标发现的效率与泛化能力。 Abstract: In many real-world settings, such as environmental monitoring, disaster response, or public health, with costly and difficult data collection and dynamic environments, strategically sampling from unobserved regions is essential for efficiently uncovering hidden targets under tight resource constraints. Yet, sparse and biased geospatial ground truth limits the applicability of existing learning-based methods, such as reinforcement learning. To address this, we propose a unified geospatial discovery framework that integrates active learning, online meta-learning, and concept-guided reasoning. Our approach introduces two key innovations built on a shared notion of *concept relevance*, which captures how domain-specific factors influence target presence: a *concept-weighted uncertainty sampling strategy*, where uncertainty is modulated by learned relevance based on readily-available domain-specific concepts (e.g., land cover, source proximity); and a *relevance-aware meta-batch formation strategy* that promotes semantic diversity during online-meta updates, improving generalization in dynamic environments. Our experiments include testing on a real-world dataset of cancer-causing PFAS (Per- and polyfluoroalkyl substances) contamination, showcasing our method's reliability at uncovering targets with limited data and a varying environment.

[88] CORAL: Correspondence Alignment for Improved Virtual Try-On

Jiyoung Kim,Youngjin Shin,Siyoon Jin,Dahyun Chung,Jisu Nam,Tongmin Kim,Jongjae Park,Hyeonwoo Kang,Seungryong Kim

Main category: cs.CV

TL;DR: 本文提出CORAL框架,通过显式对齐查询-键匹配来提升虚拟试穿中人物与服装的对应关系,从而改善全局形状转移和局部细节保留。

Details Motivation: 现有虚拟试穿方法在无配对设置下难以保持服装细节,且未显式建模人物-服装对应关系,尤其缺乏对扩散Transformer中该对应如何产生的解释。 Method: 分析DiT中全3D注意力机制,发现人物-服装对应依赖于精确的查询-键匹配;据此提出CORAL框架,包含对应蒸馏损失和熵最小化损失,并设计基于视觉语言模型的评估协议。 Result: CORAL在全局形状转移和局部细节保留上均优于基线方法,消融实验验证了各组件有效性。 Conclusion: 显式建模并优化人物-服装查询-键匹配是提升虚拟试穿质量的关键,CORAL为此提供了有效且可解释的解决方案。 Abstract: Existing methods for Virtual Try-On (VTON) often struggle to preserve fine garment details, especially in unpaired settings where accurate person-garment correspondence is required. These methods do not explicitly enforce person-garment alignment and fail to explain how correspondence emerges within Diffusion Transformers (DiTs). In this paper, we first analyze full 3D attention in DiT-based architecture and reveal that the person-garment correspondence critically depends on precise person-garment query-key matching within the full 3D attention. Building on this insight, we then introduce CORrespondence ALignment (CORAL), a DiT-based framework that explicitly aligns query-key matching with robust external correspondences. CORAL integrates two complementary components: a correspondence distillation loss that aligns reliable matches with person-garment attention, and an entropy minimization loss that sharpens the attention distribution. We further propose a VLM-based evaluation protocol to better reflect human preference. CORAL consistently improves over the baseline, enhancing both global shape transfer and local detail preservation. Extensive ablations validate our design choices.

[89] IntRec: Intent-based Retrieval with Contrastive Refinement

Pourya Shamsolmoali,Masoumeh Zareapoor,Eric Granger,Yue Lu

Main category: cs.CV

TL;DR: 本文提出IntRec框架,通过用户反馈交互式地优化目标检索结果,在LVIS和LVIS-Ambiguous数据集上显著提升准确率。

Details Motivation: 现有开放词汇检测器为单次推理,无法根据用户反馈迭代修正,难以处理模糊或相似目标共存的复杂场景。 Method: 设计Intent State(IS)维护正向锚点与负向约束双记忆集,并引入对比对齐函数对候选目标排序,实现基于反馈的细粒度消歧。 Result: 在LVIS上达35.4 AP,超越OVMR、CoDet、CAKE;在LVIS-Ambiguous上单次反馈提升+7.9 AP,延迟<30ms/次。 Conclusion: IntRec验证了交互式机制在开放词汇目标检索中的有效性,无需额外监督即可显著提升性能。 Abstract: Retrieving user-specified objects from complex scenes remains a challenging task, especially when queries are ambiguous or involve multiple similar objects. Existing open-vocabulary detectors operate in a one-shot manner, lacking the ability to refine predictions based on user feedback. To address this, we propose IntRec, an interactive object retrieval framework that refines predictions based on user feedback. At its core is an Intent State (IS) that maintains dual memory sets for positive anchors (confirmed cues) and negative constraints (rejected hypotheses). A contrastive alignment function ranks candidate objects by maximizing similarity to positive cues while penalizing rejected ones, enabling fine-grained disambiguation in cluttered scenes. Our interactive framework provides substantial improvements in retrieval accuracy without additional supervision. On LVIS, IntRec achieves 35.4 AP, outperforming OVMR, CoDet, and CAKE by +2.3, +3.7, and +0.5, respectively. On the challenging LVIS-Ambiguous benchmark, it improves performance by +7.9 AP over its one-shot baseline after a single corrective feedback, with less than 30 ms of added latency per interaction.

[90] Human-level 3D shape perception emerges from multi-view learning

Tyler Bonnen,Jitendra Malik,Angjoo Kanazawa

Main category: cs.CV

TL;DR: 本文提出了一种新型多视角神经网络框架,通过自然场景中的视觉-空间数据(如相机位置、深度)进行训练,无需物体先验,即可零样本匹配人类3D形状推理的准确率,并预测人类错误模式与反应时。

Details Motivation: 人类能从2D图像推断3D结构,但传统计算模型长期无法达到人类水平,亟需更符合人类感知机制的建模方法。 Method: 设计一类无物体先验的神经网络,以自然场景中多视角图像为输入,通过视觉-空间目标(如预测相机位姿和深度)进行自监督训练;采用零样本评估方式,在标准3D感知任务上对比模型与人类行为。 Result: 该框架首次在3D形状推理任务上达到与人类相当的准确率;独立解码模型响应可预测人类细粒度行为(如错误分布与反应时间)。 Conclusion: 仅依靠自然视觉-空间数据上的简单、可扩展学习目标,即可涌现出类人水平的3D感知能力。 Abstract: Humans can infer the three-dimensional structure of objects from two-dimensional visual inputs. Modeling this ability has been a longstanding goal for the science and engineering of visual intelligence, yet decades of computational methods have fallen short of human performance. Here we develop a modeling framework that predicts human 3D shape inferences for arbitrary objects, directly from experimental stimuli. We achieve this with a novel class of neural networks trained using a visual-spatial objective over naturalistic sensory data; given a set of images taken from different locations within a natural scene, these models learn to predict spatial information related to these images, such as camera location and visual depth, without relying on any object-related inductive biases. Notably, these visual-spatial signals are analogous to sensory cues readily available to humans. We design a zero-shot evaluation approach to determine the performance of these `multi-view' models on a well established 3D perception task, then compare model and human behavior. Our modeling framework is the first to match human accuracy on 3D shape inferences, even without task-specific training or fine-tuning. Remarkably, independent readouts of model responses predict fine-grained measures of human behavior, including error patterns and reaction times, revealing a natural correspondence between model dynamics and human perception. Taken together, our findings indicate that human-level 3D perception can emerge from a simple, scalable learning objective over naturalistic visual-spatial data. All code, human behavioral data, and experimental stimuli needed to reproduce our findings can be found on our project page.

[91] When Vision Overrides Language: Evaluating and Mitigating Counterfactual Failures in VLAs

Yu Fang,Yuchun Feng,Dong Jing,Jiaqi Liu,Yue Yang,Zhenyu Wei,Daniel Szafir,Mingyu Ding

Main category: cs.CV

TL;DR: 本文提出Counterfactual Action Guidance (CAG)方法,通过引入语言无关的视觉-动作(VA)分支与原VLA模型协同决策,缓解视觉捷径导致的语言指令违背问题,在不修改模型结构或额外训练的前提下显著提升VLAs在反事实场景下的语言遵循能力与任务成功率。

Details Motivation: 现有视觉-语言-动作模型(VLAs)在缺乏强场景监督的指令下易受数据集偏差影响,依赖视觉捷径而非语言意图,产生反事实失败;该问题尚未被系统研究。 Method: 提出CAG双分支推理机制:联合标准VLA策略与语言无关的Vision-Action(VA)模块,在动作选择时进行反事实对比,显式正则化语言条件作用;无需新增数据、架构修改或模型微调。 Result: 在新构建的LIBERO-CF反事实基准上,CAG使语言遵循准确率(π₀.₅)提升9.7%(训练无关)至15.5%(配VA模型),任务成功率提升3.6%至8.5%;真实机器人实验中反事实失败率降低9.4%,任务成功率平均提升17.2%。 Conclusion: CAG是一种即插即用、训练无关的通用增强方法,有效缓解VLAs对视觉捷径的依赖,显著提升其在未见任务和真实场景中的语言遵循鲁棒性。 Abstract: Vision-Language-Action models (VLAs) promise to ground language instructions in robot control, yet in practice often fail to faithfully follow language. When presented with instructions that lack strong scene-specific supervision, VLAs suffer from counterfactual failures: they act based on vision shortcuts induced by dataset biases, repeatedly executing well-learned behaviors and selecting objects frequently seen during training regardless of language intent. To systematically study it, we introduce LIBERO-CF, the first counterfactual benchmark for VLAs that evaluates language following capability by assigning alternative instructions under visually plausible LIBERO layouts. Our evaluation reveals that counterfactual failures are prevalent yet underexplored across state-of-the-art VLAs. We propose Counterfactual Action Guidance (CAG), a simple yet effective dual-branch inference scheme that explicitly regularizes language conditioning in VLAs. CAG combines a standard VLA policy with a language-unconditioned Vision-Action (VA) module, enabling counterfactual comparison during action selection. This design reduces reliance on visual shortcuts, improves robustness on under-observed tasks, and requires neither additional demonstrations nor modifications to existing architectures or pretrained models. Extensive experiments demonstrate its plug-and-play integration across diverse VLAs and consistent improvements. For example, on LIBERO-CF, CAG improves $π_{0.5}$ by 9.7% in language following accuracy and 3.6% in task success on under-observed tasks using a training-free strategy, with further gains of 15.5% and 8.5%, respectively, when paired with a VA model. In real-world evaluations, CAG reduces counterfactual failures of 9.4% and improves task success by 17.2% on average.

[92] OpenEarthAgent: A Unified Framework for Tool-Augmented Geospatial Agents

Akashah Shabbir,Muhammad Umer Sheikh,Muhammad Akhtar Munir,Hiyam Debary,Mustansar Fiaz,Muhammad Zaigham Zaheer,Paolo Fraccaro,Fahad Shahbaz Khan,Muhammad Haris Khan,Xiao Xiang Zhu,Salman Khan

Main category: cs.CV

TL;DR: OpenEarthAgent is a unified framework for geospatial reasoning agents trained on satellite imagery and natural-language queries, using structured reasoning traces and GIS-based tools to improve multimodal reasoning in remote sensing.

Details Motivation: Extending multimodal reasoning to remote sensing is challenging due to spatial scale, geographic structures, multispectral indices, and the need for coherent multi-step logic. Method: Supervised fine-tuning on structured reasoning trajectories involving GIS operations and spectral index analyses (e.g., NDVI, NBR, NDBI), using a corpus of 14,538 training and 1,169 evaluation instances with over 100K reasoning steps. Result: The agent shows structured reasoning, stable spatial understanding, and interpretable tool-driven behavior; achieves consistent improvements over a strong baseline and competitive performance against recent open and closed-source models. Conclusion: OpenEarthAgent successfully bridges the gap in multimodal geospatial reasoning by unifying tool-augmented learning, explicit reasoning traces, and domain-specific analytical capabilities. Abstract: Recent progress in multimodal reasoning has enabled agents that can interpret imagery, connect it with language, and perform structured analytical tasks. Extending such capabilities to the remote sensing domain remains challenging, as models must reason over spatial scale, geographic structures, and multispectral indices while maintaining coherent multi-step logic. To bridge this gap, OpenEarthAgent introduces a unified framework for developing tool-augmented geospatial agents trained on satellite imagery, natural-language queries, and detailed reasoning traces. The training pipeline relies on supervised fine-tuning over structured reasoning trajectories, aligning the model with verified multistep tool interactions across diverse analytical contexts. The accompanying corpus comprises 14,538 training and 1,169 evaluation instances, with more than 100K reasoning steps in the training split and over 7K reasoning steps in the evaluation split. It spans urban, environmental, disaster, and infrastructure domains, and incorporates GIS-based operations alongside index analyses such as NDVI, NBR, and NDBI. Grounded in explicit reasoning traces, the learned agent demonstrates structured reasoning, stable spatial understanding, and interpretable behaviour through tool-driven geospatial interactions across diverse conditions. We report consistent improvements over a strong baseline and competitive performance relative to recent open and closed-source models.

cs.OH [Back]

[93] A Conceptual Hybrid Framework for Post-Quantum Security: Integrating BB84 QKD, AES, and Bio-inspired Mechanisms

Md. Ismiel Hossen Abir

Main category: cs.OH

TL;DR: 本文提出了一种面向后量子时代的混合安全框架,结合AES、BB84量子密钥分发、量子态比较和类免疫系统,以应对Shor算法对RSA的威胁;目前为概念性设计,尚未实现与严格验证。

Details Motivation: RSA等经典密码体制面临Shor量子算法带来的严重威胁,需构建兼顾经典与量子安全的新型防护框架。 Method: 提出一个融合AES加密、BB84量子密钥分发、量子态比较认证及生物启发式免疫机制的混合安全概念框架。 Result: 该框架在理论上可实现安全密钥分发(BB84)、轻量认证(量子态比较)和自适应威胁检测(类免疫系统),并具备可扩展性。 Conclusion: 所提概念框架为后量子时代的数据保护提供了新思路,但尚需后续开展具体实现、安全性证明与实验验证。 Abstract: Quantum computing is a significant risk to classical cryptographic, especially RSA, which depends on the difficulty of factoring large numbers. Classical factorization methods, such as Trial Division and Pollard's Rho, are inefficient for large keys, while Shor's quantum algorithm can break RSA efficiently in polynomial time. This research studies RSA's vulnerabilities under both classical and quantum attacks and designs a hybrid security framework to ensure data protection in the post-quantum era. The conceptual framework combines AES encryption for classical security, BB84 Quantum Key Distribution (QKD) for secure key exchange with eavesdropping detection, quantum state comparison for lightweight authentication, and a bio-inspired immune system for adaptive threat detection. RSA is vulnerable to Shor's algorithm, BB84 achieves full key agreement in ideal conditions, and it detects eavesdropping with high accuracy. The conceptual model includes both classical and quantum security methods, providing a scalable and adaptive solution for Post-Quantum encryption data protection. This work primarily proposes a conceptual framework. Detailed implementation, security proofs, and extensive experimental validation are considered future work.