Skip to content

Table of Contents

cs.CL [Back]

[1] References Improve LLM Alignment in Non-Verifiable Domains

Kejian Shi,Yixin Liu,Peifeng Wang,Alexander R. Fabbri,Shafiq Joty,Arman Cohan

Main category: cs.CL

TL;DR: 本文提出了一种参考引导的LLM评估器方法,用于在缺乏真实验证器的非可验证领域(如大语言模型对齐)中替代RLVR,通过利用前沿模型或人工撰写的参考输出来提升LLM裁判的判别能力,并将其应用于自改进对齐训练,取得了显著性能提升。

Details Motivation: 现有强化学习与可验证奖励(RLVR)方法依赖于真实验证器,在LLM对齐等非可验证任务中无法直接应用,亟需一种软验证机制。 Method: 设计参考增强的LLM评估协议,利用前沿模型或人工参考输出提升不同能力层级LLM裁判的准确性;在此基础上构建参考引导的自改进对齐训练框架,用改进后的LLM裁判指导模型自我优化。 Result: 在AlpacaEval和Arena-Hard上,Llama-3-8B-Instruct分别达73.1%和58.7%,Qwen2.5-7B达70.0%和74.1%,相比SFT蒸馏平均绝对提升+20.2/+17.1分,相比无参考自改进提升+5.3/+3.6分,性能媲美ArmoRM奖励模型。 Conclusion: 参考引导的LLM评估器可有效弥补非可验证领域中缺乏真实验证器的缺陷,为LLM后训练提供新范式。 Abstract: While Reinforcement Learning with Verifiable Rewards (RLVR) has shown strong effectiveness in reasoning tasks, it cannot be directly applied to non-verifiable domains lacking ground-truth verifiers, such as LLM alignment. In this work, we investigate whether reference-guided LLM-evaluators can bridge this gap by serving as soft "verifiers". First, we design evaluation protocols that enhance LLM-based evaluators for LLM alignment using reference outputs. Through comprehensive experiments, we show that a reference-guided approach substantially improves the accuracy of less capable LLM-judges using references from frontier models; stronger LLM-judges can also be enhanced by high-quality (i.e., human-written) references. Building on these improved judges, we demonstrate the utility of high-quality references in alignment tuning, where LLMs guided with references are used as judges to self-improve. We show that reference-guided self-improvement yields clear gains over both direct SFT on reference outputs and self-improvement with reference-free judges, achieving performance comparable to training with ArmoRM, a strong finetuned reward model. Specifically, our method achieves 73.1% and 58.7% on AlpacaEval and Arena-Hard with Llama-3-8B-Instruct, and 70.0% and 74.1% with Qwen2.5-7B, corresponding to average absolute gains of +20.2 / +17.1 points over SFT distillation and +5.3 / +3.6 points over reference-free self-improvement on AlpacaEval / Arena-Hard. These results highlight the potential of using reference-guided LLM-evaluators to enable effective LLM post-training in non-verifiable domains.

[2] Evaluating Monolingual and Multilingual Large Language Models for Greek Question Answering: The DemosQA Benchmark

Charalampos Mastrokostas,Nikolaos Giarelis,Nikos Karacapilidis

Main category: cs.CL

TL;DR: 本文针对希腊语问答任务,构建了反映希腊社会文化特点的新数据集DemosQA,提出了适用于多语言的高效LLM评估框架,并对11个单语和多语大模型在6个希腊语QA数据集上进行了系统评估。

Details Motivation: 现有大语言模型主要面向高资源语言(如英语),多语模型存在数据偏差,难以准确反映低资源语言(如希腊语)的社会、文化和历史特征;单语模型虽被开发,但其在语言特定任务上的有效性研究不足。 Method: 构建了基于社交媒体问答的希腊语数据集DemosQA;设计了内存高效的LLM评估框架;在6个人工标注的希腊语QA数据集上,采用3种提示策略对11个单语/多语LLM进行系统评测。 Result: 提供了首个聚焦希腊语社会文化语境的高质量QA数据集DemosQA;验证了所提评估框架的通用性与效率;揭示了不同LLM在希腊语QA任务上的性能差异及提示策略影响。 Conclusion: 单语LLM在希腊语QA任务中展现出潜力,但需结合高质量本地化数据与适配评估方法;本工作为低资源语言的LLM研究提供了数据、工具与实证基准。 Abstract: Recent advancements in Natural Language Processing and Deep Learning have enabled the development of Large Language Models (LLMs), which have significantly advanced the state-of-the-art across a wide range of tasks, including Question Answering (QA). Despite these advancements, research on LLMs has primarily targeted high-resourced languages (e.g., English), and only recently has attention shifted toward multilingual models. However, these models demonstrate a training data bias towards a small number of popular languages or rely on transfer learning from high- to under-resourced languages; this may lead to a misrepresentation of social, cultural, and historical aspects. To address this challenge, monolingual LLMs have been developed for under-resourced languages; however, their effectiveness remains less studied when compared to multilingual counterparts on language-specific tasks. In this study, we address this research gap in Greek QA by contributing: (i) DemosQA, a novel dataset, which is constructed using social media user questions and community-reviewed answers to better capture the Greek social and cultural zeitgeist; (ii) a memory-efficient LLM evaluation framework adaptable to diverse QA datasets and languages; and (iii) an extensive evaluation of 11 monolingual and multilingual LLMs on 6 human-curated Greek QA datasets using 3 different prompting strategies. We release our code and data to facilitate reproducibility.

[3] One-step Language Modeling via Continuous Denoising

Chanhyuk Lee,Jaehoon Yoo,Manan Agarwal,Sheel Shah,Jerry Huang,Aditi Raghunathan,Seunghoon Hong,Nicholas M. Boffi,Jinwoo Kim

Main category: cs.CL

TL;DR: 本文提出了一种基于流(flow)的连续去噪语言模型(FLM)及其蒸馏版本(FMLM),在生成质量与速度上均优于离散扩散模型,尤其在少步(如一步)生成中显著超越现有方法。

Details Motivation: 离散扩散语言模型在少步生成时样本质量急剧下降,无法兑现其比自回归模型更快生成的潜力;本文旨在探索是否能用连续流模型替代离散扩散,在离散模态(如文本)上实现更优的少步高质量生成。 Method: 构建基于欧氏空间对one-hot词元编码进行连续去噪的流式语言模型(FLM),采用交叉熵目标预测干净数据,并引入简单的时间重参数化提升训练稳定性和生成质量;进一步将FLM蒸馏为流映射模型(FMLM)以支持少步甚至单步生成。 Result: FLM在LM1B和OWT数据集上达到与当前最优离散扩散模型相当的生成质量;FMLM在少步生成中全面超越近期方法,单步生成质量超过其8步结果。 Conclusion: 离散扩散过程并非离散模态生成建模的必要条件;基于流的连续去噪范式可更高效、高质量地建模语言,为大规模加速流式语言建模开辟新路径。 Abstract: Language models based on discrete diffusion have attracted widespread interest for their potential to provide faster generation than autoregressive models. In practice, however, they exhibit a sharp degradation of sample quality in the few-step regime, failing to realize this promise. Here we show that language models leveraging flow-based continuous denoising can outperform discrete diffusion in both quality and speed. By revisiting the fundamentals of flows over discrete modalities, we build a flow-based language model (FLM) that performs Euclidean denoising over one-hot token encodings. We show that the model can be trained by predicting the clean data via a cross entropy objective, where we introduce a simple time reparameterization that greatly improves training stability and generation quality. By distilling FLM into its associated flow map, we obtain a distilled flow map language model (FMLM) capable of few-step generation. On the LM1B and OWT language datasets, FLM attains generation quality matching state-of-the-art discrete diffusion models. With FMLM, our approach outperforms recent few-step language models across the board, with one-step generation exceeding their 8-step quality. Our work calls into question the widely held hypothesis that discrete diffusion processes are necessary for generative modeling over discrete modalities, and paves the way toward accelerated flow-based language modeling at scale. Code is available at https://github.com/david3684/flm.

[4] Claim Automation using Large Language Model

Zhengda Mo,Zhiyu Quan,Eli O'Donohue,Kaiwen Zhong

Main category: cs.CL

TL;DR: 本文提出了一种面向保险领域、本地部署的治理感知语言建模组件,通过LoRA微调预训练大模型,从非结构化理赔文本中生成结构化纠正措施建议,并在多维评估框架下验证其显著优于通用大模型。

Details Motivation: 大型语言模型(LLMs)在通用任务上表现优异,但在监管严格、数据敏感的保险等领域部署受限,亟需兼顾性能、可解释性与合规性的定制化方案。 Method: 基于数百万历史保修理赔数据,采用低秩自适应(LoRA)技术对预训练LLM进行领域微调,构建本地化、治理感知的语言建模组件,作为理赔处理流程中的初始决策模块;评估结合自动化语义相似度指标与人工评估。 Result: 领域微调模型在约80%的测试案例中生成的纠正措施与真实标注高度一致,显著优于商用通用及提示工程驱动的LLM。 Conclusion: 领域自适应微调能有效使模型输出分布贴近真实业务数据,是构建可靠、可治理保险AI应用的关键基础模块。 Abstract: While Large Language Models (LLMs) have achieved strong performance on general-purpose language tasks, their deployment in regulated and data-sensitive domains, including insurance, remains limited. Leveraging millions of historical warranty claims, we propose a locally deployed governance-aware language modeling component that generates structured corrective-action recommendations from unstructured claim narratives. We fine-tune pretrained LLMs using Low-Rank Adaptation (LoRA), scoping the model to an initial decision module within the claim processing pipeline to speed up claim adjusters' decisions. We assess this module using a multi-dimensional evaluation framework that combines automated semantic similarity metrics with human evaluation, enabling a rigorous examination of both practical utility and predictive accuracy. Our results show that domain-specific fine-tuning substantially outperforms commercial general-purpose and prompt-based LLMs, with approximately 80% of the evaluated cases achieving near-identical matches to ground-truth corrective actions. Overall, this study provides both theoretical and empirical evidence to prove that domain-adaptive fine-tuning can align model output distributions more closely with real-world operational data, demonstrating its promise as a reliable and governable building block for insurance applications.

[5] BanglaSummEval: Reference-Free Factual Consistency Evaluation for Bangla Summarization

Ahmed Rafid,Rumman Adib,Fariya Ahmed,Ajwad Abrar,Mohammed Saidul Islam

Main category: cs.CL

TL;DR: 本文提出BanglaSummEval,一种无需参考摘要、基于问答的孟加拉语摘要事实一致性评估框架,利用多语言指令微调模型统一完成问题生成、回答与权重计算,并结合BERTScore-Recall提升语义一致性判断,实证显示其与人工评估高度相关。

Details Motivation: 现有事实一致性评估指标大多忽略低资源语言孟加拉语,且依赖参考摘要,缺乏适用于该语言的可靠、可解释、无参考的评估方法。 Method: 构建参考无关的问答式评估框架BanglaSummEval:从原文和摘要自动生成问题,由同一多语言指令微调模型完成提问、作答、候选答案抽取及问题重要性加权,并用BERTScore-Recall比对答案以衡量语义一致性。 Result: 在300个人工撰写的教育与医疗领域孟加拉语摘要上验证,与专家人工评分呈强相关(Pearson r=0.694,Spearman ρ=0.763),并提供可解释的分步诊断结果。 Conclusion: BanglaSummEval为低资源语言摘要的事实一致性评估提供了实用、透明、高效且高相关性的新方案,显著降低系统复杂度与计算开销。 Abstract: Evaluating factual consistency is essential for reliable text summarization, particularly in high-stakes domains such as healthcare and news. However, most existing evaluation metrics overlook Bangla, a widely spoken yet under-resourced language, and often depend on reference summaries. We introduce BanglaSummEval, a reference-free, question-answering-based framework for evaluating factual consistency in Bangla summarization. The proposed method assesses both factual accuracy and content coverage through automatically generated questions and answers derived from the source document and the summary. A single multilingual instruction-tuned language model handles question generation, question answering, candidate answer extraction, and question importance weighting. This unified design reduces system complexity and computational cost. To capture semantic consistency beyond surface-level overlap, we use BERTScore-Recall for answer comparison. We validate BanglaSummEval on 300 human-written summaries from educational and medical domains, demonstrating strong correlation with expert human judgments (Pearson's $r = 0.694$, Spearman's $ρ= 0.763$). By providing interpretable, step-wise diagnostics alongside reliable evaluation scores, BanglaSummEval offers a practical and transparent solution for factual consistency evaluation in low-resource language settings.

[6] Meenz bleibt Meenz, but Large Language Models Do Not Speak Its Dialect

Minh Duc Bui,Manuel Mager,Peter Herbert Kann,Katharina von der Wense

Main category: cs.CL

TL;DR: 本文首次针对美因茨方言Meenzerisch开展NLP研究,构建了首个NLP就绪型数字词典(2351词条),并实验评估大语言模型在该方言定义生成与词汇生成任务上的表现,结果准确率均低于10%,表明亟需更多资源与研究投入。

Details Motivation: Meenzerisch方言濒临消亡,而NLP有望助力其保护与复兴,但此前尚无针对该方言的NLP研究。 Method: 构建基于Schramm(1966)的Meenzerisch-标准德语双语数字词典(2351词),并设计两项任务:LLM生成方言词定义、LLM根据定义生成方言词;进一步尝试少样本学习与规则提取增强方法。 Result: 所有LLM在两项任务中表现极差:最佳定义生成准确率仅6.27%,最佳词汇生成准确率仅1.51%;少样本与规则增强后仍低于10%。 Conclusion: 当前LLM难以有效处理Meenzerisch方言任务,凸显德国方言NLP研究资源匮乏,亟需加强数据建设与专项研究。 Abstract: Meenzerisch, the dialect spoken in the German city of Mainz, is also the traditional language of the Mainz carnival, a yearly celebration well known throughout Germany. However, Meenzerisch is on the verge of dying out-a fate it shares with many other German dialects. Natural language processing (NLP) has the potential to help with the preservation and revival efforts of languages and dialects. However, so far no NLP research has looked at Meenzerisch. This work presents the first research in the field of NLP that is explicitly focused on the dialect of Mainz. We introduce a digital dictionary-an NLP-ready dataset derived from an existing resource (Schramm, 1966)-to support researchers in modeling and benchmarking the language. It contains 2,351 words in the dialect paired with their meanings described in Standard German. We then use this dataset to answer the following research questions: (1) Can state-of-the-art large language models (LLMs) generate definitions for dialect words? (2) Can LLMs generate words in Meenzerisch, given their definitions? Our experiments show that LLMs can do neither: the best model for definitions reaches only 6.27% accuracy and the best word generation model's accuracy is 1.51%. We then conduct two additional experiments in order to see if accuracy is improved by few-shot learning and by extracting rules from the training set, which are then passed to the LLM. While those approaches are able to improve the results, accuracy remains below 10%. This highlights that additional resources and an intensification of research efforts focused on German dialects are desperately needed.

[7] ConvApparel: A Benchmark Dataset and Validation Framework for User Simulators in Conversational Recommenders

Ofer Meshi,Krisztian Balog,Sally Goldman,Avi Caciularu,Guy Tennenholtz,Jihwan Jeong,Amir Globerson,Craig Boutilier

Main category: cs.CL

TL;DR: 本文提出ConvApparel数据集和综合验证框架,旨在解决LLM用户模拟器存在的‘现实性差距’问题,实验表明数据驱动的模拟器在反事实验证中表现更优。

Details Motivation: LLM用户模拟器存在‘现实性差距’,导致在模拟环境中优化的对话系统在真实世界中表现不佳。 Method: 构建ConvApparel双代理人类对话数据集(含‘好’与‘坏’推荐器),引入包含用户满意度标注的反事实验证,并结合统计对齐、类人度评分与反事实验证的综合评估框架。 Result: 所有现有用户模拟器均存在显著现实性差距;但数据驱动模拟器在反事实验证中优于提示式基线,表现出更强的未见行为适应能力。 Conclusion: ConvApparel和所提验证框架为衡量和提升用户模拟器现实性提供了新基准,数据驱动方法虽不完美但更具鲁棒性。 Abstract: The promise of LLM-based user simulators to improve conversational AI is hindered by a critical "realism gap," leading to systems that are optimized for simulated interactions, but may fail to perform well in the real world. We introduce ConvApparel, a new dataset of human-AI conversations designed to address this gap. Its unique dual-agent data collection protocol -- using both "good" and "bad" recommenders -- enables counterfactual validation by capturing a wide spectrum of user experiences, enriched with first-person annotations of user satisfaction. We propose a comprehensive validation framework that combines statistical alignment, a human-likeness score, and counterfactual validation to test for generalization. Our experiments reveal a significant realism gap across all simulators. However, the framework also shows that data-driven simulators outperform a prompted baseline, particularly in counterfactual validation where they adapt more realistically to unseen behaviors, suggesting they embody more robust, if imperfect, user models.

[8] When Semantic Overlap Is Not Enough: Cross-Lingual Euphemism Transfer Between Turkish and English

Hasan Can Biyik,Libby Barak,Jing Peng,Anna Feldman

Main category: cs.CL

TL;DR: 本文研究跨语言委婉语检测中的迁移学习效果,发现语义重叠不足以保证正向迁移,尤其在低资源的土耳其语到英语方向中,即使对于重叠委婉语性能也可能下降,而使用非重叠委婉语(NOPETs)训练反而可能提升效果。

Details Motivation: 委婉语高度依赖文化与语用语境,建模难度大,尤其在多语言场景下;现有跨语言迁移方法在委婉语检测任务上的有效性尚不明确。 Method: 将土耳其语和英语中的潜在委婉语(PETs)按功能、语用和语义对齐程度划分为重叠(OPETs)与非重叠(NOPETs)两类,并在不同子集上进行跨语言迁移实验与类别级分析。 Result: 观察到迁移不对称性:语义重叠不能保障正向迁移;土耳其语→英语方向性能在OPETs上反而下降,而在NOPETs上有所提升;标签分布差异可部分解释该现象;领域对齐可能影响迁移,但受限于数据稀疏性证据有限。 Conclusion: 跨语言委婉语检测中的迁移效果不仅取决于语义等价性,更受标签分布、语用功能对齐及资源条件制约;需超越传统等价假设,构建更精细的跨语言对齐框架。 Abstract: Euphemisms substitute socially sensitive expressions, often softening or reframing meaning, and their reliance on cultural and pragmatic context complicates modeling across languages. In this study, we investigate how cross-lingual equivalence influences transfer in multilingual euphemism detection. We categorize Potentially Euphemistic Terms (PETs) in Turkish and English into Overlapping (OPETs) and Non-Overlapping (NOPETs) subsets based on their functional, pragmatic, and semantic alignment. Our findings reveal a transfer asymmetry: semantic overlap is insufficient to guarantee positive transfer, particularly in low-resource Turkish-to-English direction, where performance can degrade even for overlapping euphemisms, and in some cases, improve under NOPET-based training. Differences in label distribution help explain these counterintuitive results. Category-level analysis suggests that transfer may be influenced by domain-specific alignment, though evidence is limited by sparsity.

[9] Eigenmood Space: Uncertainty-Aware Spectral Graph Analysis of Psychological Patterns in Classical Persian Poetry

Kourosh Shahnazari,Seyed Moein Ayyoubzadeh,Mohammadali Keshtparvar

Main category: cs.CL

TL;DR: 本文提出了一种不确定性感知的计算框架,用于对古典波斯诗歌进行诗人层面的心理学分析,通过大规模自动多标签标注、置信度加权聚合与图嵌入(Eigenmood),在保持解释审慎性的同时实现可扩展、可审计的人文计算分析。

Details Motivation: 古典波斯诗歌以隐喻、互文和修辞间接性表达情感,使其需依赖细读,但难以进行可复现的大规模比较;亟需兼顾 interpretability 与 scalability 的计算人文方法。 Method: 基于大规模自动多标签标注,为每行诗句分配心理概念、置信度及 abstention 标志;构建诗人×概念矩阵并用 JS/ KL 散度量化个体性;建立置信加权的概念共现图,通过拉普拉斯谱分解定义 Eigenmood 嵌入;结合敏感性分析、选择偏差诊断与远距—细读工作流。 Result: 在涵盖10位诗人的61,573行诗语料上,22.2%诗句被标记为 abstention;验证了不确定性建模对分析稳健性的关键作用;Eigenmood 轴支持诗句级例证检索,实现了可审计、可解释的诗人心理表征。 Conclusion: 该框架成功将不确定性从诗句级证据传播至诗人级推断,在数字人文中平衡了规模化分析与诠释审慎性,为文学心理学建模提供了新范式。 Abstract: Classical Persian poetry is a historically sustained archive in which affective life is expressed through metaphor, intertextual convention, and rhetorical indirection. These properties make close reading indispensable while limiting reproducible comparison at scale. We present an uncertainty-aware computational framework for poet-level psychological analysis based on large-scale automatic multi-label annotation. Each verse is associated with a set of psychological concepts, per-label confidence scores, and an abstention flag that signals insufficient evidence. We aggregate confidence-weighted evidence into a Poet $\times$ Concept matrix, interpret each poet as a probability distribution over concepts, and quantify poetic individuality as divergence from a corpus baseline using Jensen--Shannon divergence and Kullback--Leibler divergence. To capture relational structure beyond marginals, we build a confidence-weighted co-occurrence graph over concepts and define an Eigenmood embedding through Laplacian spectral decomposition. On a corpus of 61{,}573 verses across 10 poets, 22.2\% of verses are abstained, underscoring the analytical importance of uncertainty. We further report sensitivity analysis under confidence thresholding, selection-bias diagnostics that treat abstention as a category, and a distant-to-close workflow that retrieves verse-level exemplars along Eigenmood axes. The resulting framework supports scalable, auditable digital-humanities analysis while preserving interpretive caution by propagating uncertainty from verse-level evidence to poet-level inference.

[10] Persona2Web: Benchmarking Personalized Web Agents for Contextual Reasoning with User History

Serin Kim,Sangam Lee,Dongha Lee

Main category: cs.CL

TL;DR: 本文提出了Persona2Web,首个用于评估个性化网络代理在真实开放网络上性能的基准,基于‘澄清到个性化’原则,强调依据用户历史而非显式指令来解决查询歧义。

Details Motivation: 当前网络代理缺乏个性化能力,难以根据用户隐含偏好和上下文解读模糊查询,因此需要构建能评估个性化能力的新基准。 Method: 构建Persona2Web基准,包含隐式反映长期偏好的用户历史、需推断偏好的模糊查询,以及支持细粒度个性化评估的推理感知评估框架,并在多种代理架构、模型和历史访问方式下开展实验。 Result: 揭示了个性化网络代理行为中的关键挑战,验证了基于用户历史进行歧义消解的有效性与难点。 Conclusion: Persona2Web为个性化网络代理研究提供了可复现、可扩展的评估标准,推动代理从指令驱动向历史驱动的个性化范式演进。 Abstract: Large language models have advanced web agents, yet current agents lack personalization capabilities. Since users rarely specify every detail of their intent, practical web agents must be able to interpret ambiguous queries by inferring user preferences and contexts. To address this challenge, we present Persona2Web, the first benchmark for evaluating personalized web agents on the real open web, built upon the clarify-to-personalize principle, which requires agents to resolve ambiguity based on user history rather than relying on explicit instructions. Persona2Web consists of: (1) user histories that reveal preferences implicitly over long time spans, (2) ambiguous queries that require agents to infer implicit user preferences, and (3) a reasoning-aware evaluation framework that enables fine-grained assessment of personalization. We conduct extensive experiments across various agent architectures, backbone models, history access schemes, and queries with varying ambiguity levels, revealing key challenges in personalized web agent behavior. For reproducibility, our codes and datasets are publicly available at https://anonymous.4open.science/r/Persona2Web-73E8.

[11] ReIn: Conversational Error Recovery with Reasoning Inception

Takyoung Kim,Jinseok Nam,Chandrayee Basu,Xing Fan,Chengyuan Ma,Heng Ji,Gokhan Tur,Dilek Hakkani-Tür

Main category: cs.CL

TL;DR: 本文提出了一种名为Reasoning Inception(ReIn)的测试时干预方法,用于提升对话代理在面对用户引发错误时的恢复能力,无需修改模型参数或系统提示。

Details Motivation: 现有基于大语言模型的对话代理虽在固定任务数据集上表现良好,但在面对用户引发的未预见错误时仍脆弱;本文聚焦于错误恢复而非预防,并在无法微调模型或修改提示的现实约束下探索有效恢复机制。 Method: 提出ReIn方法:通过外部‘初始模块’识别对话上下文中的预定义错误并生成恢复计划,再将该计划注入代理的内部推理过程以引导纠正行为,不改变模型参数或系统提示。 Result: ReIn在模拟用户模糊和不支持请求等失败场景中显著提升了任务成功率,泛化至未见错误类型,并持续优于显式提示修改方法。 Conclusion: ReIn是一种高效、即插即用的错误恢复策略,通过联合定义恢复工具与ReIn,可在不修改骨干模型和系统提示的前提下安全有效地增强对话代理的鲁棒性。 Abstract: Conversational agents powered by large language models (LLMs) with tool integration achieve strong performance on fixed task-oriented dialogue datasets but remain vulnerable to unanticipated, user-induced errors. Rather than focusing on error prevention, this work focuses on error recovery, which necessitates the accurate diagnosis of erroneous dialogue contexts and execution of proper recovery plans. Under realistic constraints precluding model fine-tuning or prompt modification due to significant cost and time requirements, we explore whether agents can recover from contextually flawed interactions and how their behavior can be adapted without altering model parameters and prompts. To this end, we propose Reasoning Inception (ReIn), a test-time intervention method that plants an initial reasoning into the agent's decision-making process. Specifically, an external inception module identifies predefined errors within the dialogue context and generates recovery plans, which are subsequently integrated into the agent's internal reasoning process to guide corrective actions, without modifying its parameters or system prompts. We evaluate ReIn by systematically simulating conversational failure scenarios that directly hinder successful completion of user goals: user's ambiguous and unsupported requests. Across diverse combinations of agent models and inception modules, ReIn substantially improves task success and generalizes to unseen error types. Moreover, it consistently outperforms explicit prompt-modification approaches, underscoring its utility as an efficient, on-the-fly method. In-depth analysis of its operational mechanism, particularly in relation to instruction hierarchy, indicates that jointly defining recovery tools with ReIn can serve as a safe and effective strategy for improving the resilience of conversational agents without modifying the backbone models or system prompts.

[12] Large Language Models Persuade Without Planning Theory of Mind

Jared Moore,Rasmus Overmark,Ned Cooper,Beba Cibralic,Nick Haber,Cameron R. Jones

Main category: cs.CL

TL;DR: 本文提出了一种新型理论心智(ToM)任务,通过说服实验评估人类与大语言模型(LLMs)在动态交互中对他人知识与动机状态的理解与运用能力;结果表明LLMs在需推理隐藏心理状态时表现差,但在真实人际说服中反而优于人类,提示其优势可能源于修辞策略而非真正ToM。

Details Motivation: 现有ToM评估多依赖静态问答,忽视了第一人称互动这一ToM关键要素;本文旨在设计更贴近真实社会互动的动态说服任务,以更准确评估ToM能力。 Method: 设计三阶段实验:实验1中人类说服理性机器人代理;实验2中人类说服真人目标;实验3测量真人目标信念是否被改变;任务要求说服者根据目标的知识状态(是否知晓政策信息)和动机状态(偏好结果)策略性披露信息,状态分Revealed或Hidden条件。 Result: 实验1:LLMs在Revealed条件下优秀,Hidden条件下低于随机水平;人类则在两种条件下均表现中等;实验2与3:LLMs作为说服者时全面优于人类说服者。 Conclusion: LLMs未必具备类人ToM能力,其说服优势更可能源于模式匹配与修辞策略;研究警示不应简单将LLM成功归因于ToM,同时凸显其影响人类信念与行为的现实潜力。 Abstract: A growing body of work attempts to evaluate the theory of mind (ToM) abilities of humans and large language models (LLMs) using static, non-interactive question-and-answer benchmarks. However, theoretical work in the field suggests that first-personal interaction is a crucial part of ToM and that such predictive, spectatorial tasks may fail to evaluate it. We address this gap with a novel ToM task that requires an agent to persuade a target to choose one of three policy proposals by strategically revealing information. Success depends on a persuader's sensitivity to a given target's knowledge states (what the target knows about the policies) and motivational states (how much the target values different outcomes). We varied whether these states were Revealed to persuaders or Hidden, in which case persuaders had to inquire about or infer them. In Experiment 1, participants persuaded a bot programmed to make only rational inferences. LLMs excelled in the Revealed condition but performed below chance in the Hidden condition, suggesting difficulty with the multi-step planning required to elicit and use mental state information. Humans performed moderately well in both conditions, indicating an ability to engage such planning. In Experiment 2, where a human target role-played the bot, and in Experiment 3, where we measured whether human targets' real beliefs changed, LLMs outperformed human persuaders across all conditions. These results suggest that effective persuasion can occur without explicit ToM reasoning (e.g., through rhetorical strategies) and that LLMs excel at this form of persuasion. Overall, our results caution against attributing human-like ToM to LLMs while highlighting LLMs' potential to influence people's beliefs and behavior.

[13] Evaluating Cross-Lingual Classification Approaches Enabling Topic Discovery for Multilingual Social Media Data

Deepak Uniyal,Md Abul Bashar,Richi Nayak

Main category: cs.CL

TL;DR: This paper compares four cross-lingual text classification approaches to filter hydrogen-related tweets from noisy multilingual social media data (English, Japanese, Hindi, Korean), then performs topic modeling; it identifies trade-offs between translation-based and multilingual transformer methods.

Details Motivation: Analysing multilingual social media discourse—especially large-scale, noisy, keyword-collected public debates across diverse languages—remains a major NLP challenge. Reliable cross-lingual classification is needed for meaningful global conversation analysis. Method: Four cross-lingual classification strategies are evaluated on a 9M-tweet dataset (2013–2022) in four languages: (1) translate English annotations → train language-specific models; (2) translate all unlabelled tweets to English → train one English model; (3) apply English-fine-tuned multilingual transformers directly; (4) hybrid—combine translated annotations with multilingual training. Filtered outputs undergo topic modeling. Result: Each approach shows distinct performance trade-offs in filtering hydrogen-related tweets from noise; the hybrid strategy achieves strong balance between accuracy and scalability; topic modeling reveals dominant thematic patterns within filtered subsets across languages. Conclusion: No single method dominates universally; optimal cross-lingual pipeline design depends on data scale, annotation availability, and language resource constraints—hybrid approaches offer promising flexibility for real-world multilingual social media analysis. Abstract: Analysing multilingual social media discourse remains a major challenge in natural language processing, particularly when large-scale public debates span across diverse languages. This study investigates how different approaches for cross-lingual text classification can support reliable analysis of global conversations. Using hydrogen energy as a case study, we analyse a decade-long dataset of over nine million tweets in English, Japanese, Hindi, and Korean (2013--2022) for topic discovery. The online keyword-driven data collection results in a significant amount of irrelevant content. We explore four approaches to filter relevant content: (1) translating English annotated data into target languages for building language-specific models for each target language, (2) translating unlabelled data appearing from all languages into English for creating a single model based on English annotations, (3) applying English fine-tuned multilingual transformers directly to each target language data, and (4) a hybrid strategy that combines translated annotations with multilingual training. Each approach is evaluated for its ability to filter hydrogen-related tweets from noisy keyword-based collections. Subsequently, topic modeling is performed to extract dominant themes within the relevant subsets. The results highlight key trade-offs between translation and multilingual approaches, offering actionable insights into optimising cross-lingual pipelines for large-scale social media analysis.

[14] ALPS: A Diagnostic Challenge Set for Arabic Linguistic & Pragmatic Reasoning

Hussein S. Al-Olimat,Ahmad Alshareef

Main category: cs.CL

TL;DR: 本文介绍了ALPS,一个原生、专家策划的阿拉伯语语言学与语用学诊断挑战集,旨在深入评估模型在深层语义和语用学方面的能力。ALPS包含531个精心设计的问题,覆盖15项任务和47个子任务,强调语言理解的深度而非广度。研究评估了23种不同模型,在人类单次作答准确率(84.6%)和专家仲裁的上限(99.2%)对比下,发现模型虽表现出高流利度,但在形态句法依赖(尤其是需依赖变音符号的任务,错误率达36.5%)上显著弱于组合语义任务;顶级商业模型(如Gemini-3-flash达94.2%)超越平均人类表现,但阿拉伯语专用模型(如Jais-2-70B为83.6%)仍略低于人类水平。

Details Motivation: 现有阿拉伯语NLP基准多依赖合成或翻译数据,缺乏深层语言学验证;需构建原生、专家主导、文化真实、无翻译失真的诊断性基准以弥补深层语义与语用能力评估空白。 Method: 构建ALPS数据集:由阿拉伯语言学专家深度参与,涵盖15项任务、47个子任务、共531题,聚焦深层语义与语用;对23种模型(含商业、开源及阿拉伯语专用模型)进行系统评测,并设立单次人类作答基线(84.6%)与专家仲裁上限(99.2%)。 Result: 模型在形态句法依赖任务(尤其依赖变音符号)上错误率高达36.5%,显著高于组合语义任务;Gemini-3-flash达94.2%,超过平均人类;Jais-2-70B达83.6%,最接近但未达人类水平;揭示模型流利性与深层语言理解之间存在关键脱节。 Conclusion: ALPS填补了阿拉伯语NLP中面向深度语言理解的诊断性基准空白;实证表明当前大模型在形态句法等基础语言能力上仍存明显短板,凸显原生语言资源与专业语言学指导对提升模型语言能力的重要性。 Abstract: While recent Arabic NLP benchmarks focus on scale, they often rely on synthetic or translated data which may benefit from deeper linguistic verification. We introduce ALPS (Arabic Linguistic & Pragmatic Suite), a native, expert-curated diagnostic challenge set probing Deep Semantics and Pragmatics, capabilities that complement specialized large-scale benchmarks. While broad-coverage benchmarks prioritize scale and multi-task coverage, ALPS targets the depth of linguistic understanding through 531 rigorously crafted questions across 15 tasks and 47 subtasks. We developed the dataset with deep expertise in Arabic linguistics, guaranteeing cultural authenticity and eliminating translation artifacts. Evaluating 23 diverse models (commercial, open-source, and Arabic-native) against a single-pass human performance (avg. 84.6% accuracy) and an expert-adjudicated oracle (99.2%), we reveal a critical dissociation: models achieve high fluency but fail on fundamental morpho-syntactic dependencies, with elevated error rates on morpho-syntactic dependencies (36.5% across diacritics-reliant tasks) compared to compositional semantics. While top commercial models (Gemini-3-flash at 94.2%) surpass the average single human, a substantial gap persists between commercial giants and Arabic-native models, with the best Arabic-specific model (Jais-2-70B at 83.6%) approaching but not matching human performance.

[15] BankMathBench: A Benchmark for Numerical Reasoning in Banking Scenarios

Yunseung Lee,Subin Kim,Youngjun Kwak,Jaegul Choo

Main category: cs.CL

TL;DR: 本文提出BankMathBench,一个面向银行业务场景的数值推理基准数据集,用于提升大语言模型在存款、贷款等核心银行计算任务中的准确性。该数据集涵盖基础、中级和高级三类难度任务,实验表明基于其微调的开源LLM在公式生成与数值推理上显著提升。

Details Motivation: 现有大语言模型在银行核心计算任务(如本息估算、多产品比对、提前还款计息)中准确率低,且缺乏反映真实银行业务场景的评估基准。 Method: 构建BankMathBench领域专用数据集,按难度分为基础(单产品)、中级(多产品比较)、高级(多条件)三类任务;采用工具增强微调方式训练开源LLM,并评估其公式生成与数值推理性能。 Result: 工具增强微调后,模型在基础、中级、高级任务上的平均准确率分别提升57.6、75.1、62.9个百分点,显著优于零样本基线。 Conclusion: BankMathBench是评估和提升大语言模型在真实银行业务中数值推理能力的有效且可靠的基准。 Abstract: Large language models (LLMs)-based chatbots are increasingly being adopted in the financial domain, particularly in digital banking, to handle customer inquiries about products such as deposits, savings, and loans. However, these models still exhibit low accuracy in core banking computations-including total payout estimation, comparison of products with varying interest rates, and interest calculation under early repayment conditions. Such tasks require multi-step numerical reasoning and contextual understanding of banking products, yet existing LLMs often make systematic errors-misinterpreting product types, applying conditions incorrectly, or failing basic calculations involving exponents and geometric progressions. However, such errors have rarely been captured by existing benchmarks. Mathematical datasets focus on fundamental math problems, whereas financial benchmarks primarily target financial documents, leaving everyday banking scenarios underexplored. To address this limitation, we propose BankMathBench, a domain-specific dataset that reflects realistic banking tasks. BankMathBench is organized in three levels of difficulty-basic, intermediate, and advanced-corresponding to single-product reasoning, multi-product comparison, and multi-condition scenarios, respectively. When trained on BankMathBench, open-source LLMs exhibited notable improvements in both formula generation and numerical reasoning accuracy, demonstrating the dataset's effectiveness in enhancing domain-specific reasoning. With tool-augmented fine-tuning, the models achieved average accuracy increases of 57.6%p (basic), 75.1%p (intermediate), and 62.9%p (advanced), representing significant gains over zero-shot baselines. These findings highlight BankMathBench as a reliable benchmark for evaluating and advancing LLMs' numerical reasoning in real-world banking scenarios.

[16] Projective Psychological Assessment of Large Multimodal Models Using Thematic Apperception Tests

Anton Dzega,Aviad Elyashar,Ortal Slobodin,Odeya Cohen,Rami Puzis

Main category: cs.CL

TL;DR: 本研究利用主题统觉测验(TAT)图像和SCORS-G量表,评估大语言模型(LMMs)在非语言模态下展现的类人格特质,发现其在理解人际互动和自我概念方面表现良好,但在感知与调节攻击性方面存在系统性缺陷,且模型规模与发布时间正向影响其表现。

Details Motivation: 探索大型多模态模型(LMMs)是否具备可被心理测量学方法评估的类人格特质,尤其关注非语言模态(如图像理解与叙事生成)下的表现。 Method: 采用TAT图像作为刺激,让LMMs分别担任被试模型(生成故事)和评估模型(依据SCORS-G量表对故事进行人格维度评分),并与人类专家评估结果对比。 Result: 评估模型对TAT反应的理解与分析能力优异,评分高度吻合人类专家;所有模型均擅长理解人际动态与自我概念,但普遍无法感知和调节攻击性;模型性能随参数量增大和发布时间靠后而系统性提升。 Conclusion: LMMs展现出部分稳定、可测量的类人格功能维度,支持将其视为具备初步社会认知能力的‘计算人格体’,但其在情绪调节(尤其是攻击性)方面存在根本性局限,提示当前架构在情感建模上的不足。 Abstract: Thematic Apperception Test (TAT) is a psychometrically grounded, multidimensional assessment framework that systematically differentiates between cognitive-representational and affective-relational components of personality-like functioning. This test is a projective psychological framework designed to uncover unconscious aspects of personality. This study examines whether the personality traits of Large Multimodal Models (LMMs) can be assessed through non-language-based modalities, using the Social Cognition and Object Relations Scale - Global (SCORS-G). LMMs are employed in two distinct roles: as subject models (SMs), which generate stories in response to TAT images, and as evaluator models (EMs), who assess these narratives using the SCORS-G framework. Evaluators demonstrated an excellent ability to understand and analyze TAT responses. Their interpretations are highly consistent with those of human experts. Assessment results highlight that all models understand interpersonal dynamics very well and have a good grasp of the concept of self. However, they consistently fail to perceive and regulate aggression. Performance varied systematically across model families, with larger and more recent models consistently outperforming smaller and earlier ones across SCORS-G dimensions.

[17] The Emergence of Lab-Driven Alignment Signatures: A Psychometric Framework for Auditing Latent Bias and Compounding Risk in Generative AI

Dusan Bosnjakovic

Main category: cs.CL

TL;DR: 本文提出一种基于心理测量理论的审计框架,用于量化大型语言模型中不依赖真实标签的稳定行为倾向(如优化偏差、谄媚倾向、现状合法化),发现模型存在显著的‘实验室信号’,表明其潜在偏见可能在多层AI架构中形成递归意识形态回声室。

Details Motivation: 随着大语言模型从独立聊天界面转变为多智能体系统和递归评估循环中的推理基础层,检测持久性、服务商级别的行为特征成为安全与治理的关键需求;传统基准仅衡量瞬时任务准确率,无法捕捉训练与对齐过程中嵌入的稳定潜在响应策略。 Method: 采用心理测量学中的潜在特质估计方法,在序数不确定性下构建强制选择式序数情境题,并通过语义正交诱饵掩蔽与密码学置换不变性保障;对九个主流模型在优化偏差、谄媚倾向、现状合法化等维度进行审计;使用混合线性模型(MixedLM)与组内相关系数(ICC)分析。 Result: 发现题目层面的表述方式带来高方差,但存在显著的持久‘实验室信号’,导致行为聚类;在‘锁定’的服务商生态中,潜在偏差不仅是静态错误,更是会累积放大的变量。 Conclusion: 该框架能有效揭示LLM中不依赖标注的稳定行为倾向,警示其潜在偏见可能在多层AI系统中引发递归意识形态回声室,对AI安全与治理具有重要启示。 Abstract: As Large Language Models (LLMs) transition from standalone chat interfaces to foundational reasoning layers in multi-agent systems and recursive evaluation loops (LLM-as-a-judge), the detection of durable, provider-level behavioral signatures becomes a critical requirement for safety and governance. Traditional benchmarks measure transient task accuracy but fail to capture stable, latent response policies -- the ``prevailing mindsets'' embedded during training and alignment that outlive individual model versions. This paper introduces a novel auditing framework that utilizes psychometric measurement theory -- specifically latent trait estimation under ordinal uncertainty -- to quantify these tendencies without relying on ground-truth labels. Utilizing forced-choice ordinal vignettes masked by semantically orthogonal decoys and governed by cryptographic permutation-invariance, the research audits nine leading models across dimensions including Optimization Bias, Sycophancy, and Status-Quo Legitimization. Using Mixed Linear Models (MixedLM) and Intraclass Correlation Coefficient (ICC) analysis, the research identifies that while item-level framing drives high variance, a persistent ``lab signal'' accounts for significant behavioral clustering. These findings demonstrate that in ``locked-in'' provider ecosystems, latent biases are not merely static errors but compounding variables that risk creating recursive ideological echo chambers in multi-layered AI architectures.

[18] What Makes a Good Doctor Response? An Analysis on a Romanian Telemedicine Platform

Adrian Cosma,Cosmin Dumitrache,Emilian Radoi

Main category: cs.CL

TL;DR: 本文分析了罗马尼亚语文本型远程医疗中的患者满意度信号,发现医生和患者的既往历史特征对预测满意度起主导作用,而回复文本的语言特征(如礼貌性、委婉表达)虽影响较小但具有可操作性。

Details Motivation: 随着基于文本的远程医疗普及,临床医生面临通过书面沟通维持患者满意度评分的压力,而这些评分往往反映沟通质量而非临床准确性。本文旨在探究影响罗马尼亚语环境中患者反馈的关键语言与非语言因素。 Method: 基于77,334条匿名医患问答对,将患者‘点赞’设为正类,其余为负类;提取语言无关特征(长度、结构、可读性)、罗马尼亚语LIWC心理语言学特征及礼貌/委婉标记;采用时间划分训练分类器,并用SHAP进行可解释性分析。 Result: 患者与医生的历史特征是满意度预测最强预测因子;回复文本中礼貌性和委婉表达与正面反馈显著正相关,词汇多样性则呈负相关。 Conclusion: 提升文本远程医疗满意度的关键不仅在于历史表现,还可通过优化语言策略(如增强礼貌性与委婉表达)实现可操作改进。 Abstract: Text-based telemedicine has become a common mode of care, requiring clinicians to deliver medical advice clearly and effectively in writing. As platforms increasingly rely on patient ratings and feedback, clinicians face growing pressure to maintain satisfaction scores, even though these evaluations often reflect communication quality more than clinical accuracy. We analyse patient satisfaction signals in Romanian text-based telemedicine. Using a sample of 77,334 anonymised patient question--doctor response pairs, we model feedback as a binary outcome, treating thumbs-up responses as positive and grouping negative or absent feedback into the other class. We extract interpretable, predominantly language-agnostic features (e.g., length, structural characteristics, readability proxies), along with Romanian LIWC psycholinguistic features and politeness/hedging markers where available. We train a classifier with a time-based split and perform SHAP-based analyses, which indicate that patient and clinician history features dominate prediction, functioning as strong priors, while characteristics of the response text provide a smaller but, crucially, actionable signal. In subgroup correlation analyses, politeness and hedging are consistently positively associated with patient feedback, whereas lexical diversity shows a negative association.

[19] Quantifying and Mitigating Socially Desirable Responding in LLMs: A Desirability-Matched Graded Forced-Choice Psychometric Study

Kensuke Okada,Yui Furukawa,Kyosuke Bunji

Main category: cs.CL

TL;DR: 本文提出了一种心理测量框架,用于量化和缓解大语言模型(LLMs)在问卷评估中因社会赞许性作答(SDR)导致的偏差;通过对比诚实与伪装良好指令下的IRT潜变量得分来量化SDR,并设计了基于偏好匹配的分级强制选择(GFC)量表以缓解该偏差,在多个LLM上验证其有效性。

Details Motivation: 现有基于人类自评问卷的LLM评估方法假设模型会诚实作答,但实际中LLM倾向于给出社会偏好的答案(即SDR),从而系统性地扭曲评估结果和结论。 Method: 1)提出SDR量化方法:在同一量表下分别施测HONEST与FAKE-GOOD指令,用IRT估计潜变量得分并计算方向校正的标准效应量;2)构建GFC版Big Five量表:从题库中通过约束优化选出30对跨领域、吸引力匹配的题目;3)在9个指令微调LLM及已知目标人格的合成角色上进行实验评估。 Result: Likert量表在所有LLM上均表现出显著SDR;而GFC量表大幅降低SDR,同时仍能较好恢复目标人格特征;揭示了SDR抑制与人格特征还原之间的模型依赖权衡关系。 Conclusion: SDR是问卷式LLM评估中不可忽视的系统性偏差;应采用SDR-aware方法(如GFC设计)并报告SDR水平,以提升基准测试与审计的可靠性。 Abstract: Human self-report questionnaires are increasingly used in NLP to benchmark and audit large language models (LLMs), from persona consistency to safety and bias assessments. Yet these instruments presume honest responding; in evaluative contexts, LLMs can instead gravitate toward socially preferred answers-a form of socially desirable responding (SDR)-biasing questionnaire-derived scores and downstream conclusions. We propose a psychometric framework to quantify and mitigate SDR in questionnaire-based evaluation of LLMs. To quantify SDR, the same inventory is administered under HONEST versus FAKE-GOOD instructions, and SDR is computed as a direction-corrected standardized effect size from item response theory (IRT)-estimated latent scores. This enables comparisons across constructs and response formats, as well as against human instructed-faking benchmarks. For mitigation, we construct a graded forced-choice (GFC) Big Five inventory by selecting 30 cross-domain pairs from an item pool via constrained optimization to match desirability. Across nine instruction-tuned LLMs evaluated on synthetic personas with known target profiles, Likert-style questionnaires show consistently large SDR, whereas desirability-matched GFC substantially attenuates SDR while largely preserving the recovery of the intended persona profiles. These results highlight a model-dependent SDR-recovery trade-off and motivate SDR-aware reporting practices for questionnaire-based benchmarking and auditing of LLMs.

[20] Towards Cross-lingual Values Assessment: A Consensus-Pluralism Perspective

Yukun Chen,Xinyu Zhang,Jialong Tang,Yu Wan,Baosong Yang,Yiming Li,Zhan Qin,Kui Ren

Main category: cs.CL

TL;DR: 本文提出了X-Value,一个跨语言价值观评估基准,用于评估大语言模型(LLMs)对数字内容深层价值观的理解能力,发现当前SOTA模型在此任务上表现不足且存在显著跨语言性能差异。

Details Motivation: 现有内容安全评估范式主要关注显性危害(如暴力、仇恨言论),忽视了数字内容中隐含的深层价值观维度,尤其缺乏跨语言、全球视角的价值观评估能力。 Method: 构建了包含18种语言、5000多个QA对的X-Value基准,覆盖Schwartz基本人类价值观理论的7个核心领域,并分为易/难两级;提出两阶段标注框架:先判别议题属于全球共识还是多元主义范畴,再进行多方协同的价值观隐含评估。 Result: 在X-Value上的系统评测显示,当前SOTA LLMs跨语言价值观评估准确率低于77%,不同语言间准确率差异超20%。 Conclusion: 当前LLMs在细粒度、价值观感知的内容评估方面存在明显短板,亟需提升其跨语言、深层次价值观理解与判断能力。 Abstract: While large language models (LLMs) have become pivotal to content safety, current evaluation paradigms primarily focus on detecting explicit harms (e.g., violence or hate speech), neglecting the subtler value dimensions conveyed in digital content. To bridge this gap, we introduce X-Value, a novel Cross-lingual Values Assessment Benchmark designed to evaluate LLMs' ability to assess deep-level values of content from a global perspective. X-Value consists of more than 5,000 QA pairs across 18 languages, systematically organized into 7 core domains grounded in Schwartz's Theory of Basic Human Values and categorized into easy and hard levels for discriminative evaluation. We further propose a unique two-stage annotation framework that first identifies whether an issue falls under global consensus (e.g., human rights) or pluralism (e.g., religion), and subsequently conducts a multi-party evaluation of the latent values embedded within the content. Systematic evaluations on X-Value reveal that current SOTA LLMs exhibit deficiencies in cross-lingual values assessment ($Acc < 77\%$), with significant performance disparities across different languages ($ΔAcc > 20\%$). This work highlights the urgent need to improve the nuanced, values-aware content assessment capability of LLMs. Our X-Value is available at: https://huggingface.co/datasets/Whitolf/X-Value.

[21] Representation Collapse in Machine Translation Through the Lens of Angular Dispersion

Evgeniia Tokarchuk,Maya K. Nachesa,Sergey Troshin,Vlad Niculae

Main category: cs.CL

TL;DR: 本文分析了Transformer架构在神经机器翻译中因标准next-token预测训练策略导致的表征坍缩问题,特别是在深层和连续输出NMT中更显著;提出采用基于角度分散的正则化方法缓解该问题,并验证其在离散、连续及量化模型中均能有效提升翻译质量。

Details Motivation: 标准next-token预测训练策略易引发表征坍缩,尤其在Transformer深层和连续输出NMT中更为严重,甚至导致模型趋向平凡解(所有向量相同),限制了几何空间的有效利用。 Method: 分析离散与连续NMT Transformer在训练过程中各层级的表征坍缩动态;引入并应用基于角度分散(angular dispersion)的现有正则化方法;在量化模型上进一步验证该正则化效果。 Result: 实证表明该正则化方法不仅能有效缓解表征坍缩,还能提升翻译质量;且在量化模型中仍保持改善效果。 Conclusion: 表征坍缩是影响Transformer NMT性能的重要隐患,角度分散正则化是一种简单而通用的解决方案,适用于不同输出形式(离散/连续)及模型精度(量化/非量化)场景。 Abstract: Modern neural translation models based on the Transformer architecture are known for their high performance, particularly when trained on high-resource datasets. A standard next-token prediction training strategy, while widely adopted in practice, may lead to overlooked artifacts such as representation collapse. Previous works have shown that this problem is especially pronounced in the representation of the deeper Transformer layers, where it often fails to efficiently utilize the geometric space. Representation collapse is even more evident in end-to-end training of continuous-output neural machine translation, where the trivial solution would be to set all vectors to the same value. In this work, we analyze the dynamics of representation collapse at different levels of discrete and continuous NMT transformers throughout training. We incorporate an existing regularization method based on angular dispersion and demonstrate empirically that it not only mitigates collapse but also improves translation quality. Furthermore, we show that quantized models exhibit similar collapse behavior and that the benefits of regularization are preserved even after quantization.

[22] Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation

Bogdan Kostić,Conor Fallon,Julian Risch,Alexander Löser

Main category: cs.CL

TL;DR: 本文研究了词汇和句法扰动对23个大语言模型在MMLU、SQuAD和AMEGA三个基准上的性能影响,发现词汇扰动显著降低性能,而句法扰动效果不一,且模型鲁棒性不随规模单调提升,表明当前LLM更依赖表层词汇模式而非深层语言能力。

Details Motivation: 现有LLM评估基准因对输入提示的浅层变化敏感而可靠性受质疑,需系统考察模型在语义不变扰动下的表现以检验其真正语言能力。 Method: 采用两种基于语言学原理的管道生成语义保持的扰动:一是同义词替换实现词汇变化,二是依赖句法分析确定可应用的句法变换;在MMLU、SQuAD和AMEGA上测试23个主流LLM对两类扰动的响应。 Result: 词汇扰动在几乎所有模型和任务上均引发显著性能下降;句法扰动效果异质,偶有提升;两类扰动均破坏复杂任务中的模型排行榜稳定性;模型鲁棒性与参数量无一致正相关,高度依赖具体任务。 Conclusion: LLMs更依赖表面词汇线索而非抽象语言能力,因此鲁棒性测试应成为LLM评估的标准环节。 Abstract: The rapid advancement of Large Language Models (LLMs) has established standardized evaluation benchmarks as the primary instrument for model comparison. Yet, their reliability is increasingly questioned due to sensitivity to shallow variations in input prompts. This paper examines how controlled, truth-conditionally equivalent lexical and syntactic perturbations affect the absolute performance and relative ranking of 23 contemporary LLMs across three benchmarks: MMLU, SQuAD, and AMEGA. We employ two linguistically principled pipelines to generate meaning-preserving variations: one performing synonym substitution for lexical changes, and another using dependency parsing to determine applicable syntactic transformations. Results show that lexical perturbations consistently induce substantial, statistically significant performance degradation across nearly all models and tasks, while syntactic perturbations have more heterogeneous effects, occasionally improving results. Both perturbation types destabilize model leaderboards on complex tasks. Furthermore, model robustness did not consistently scale with model size, revealing strong task dependence. Overall, the findings suggest that LLMs rely more on surface-level lexical patterns than on abstract linguistic competence, underscoring the need for robustness testing as a standard component of LLM evaluation.

[23] RPDR: A Round-trip Prediction-Based Data Augmentation Framework for Long-Tail Question Answering

Yiming Zhang,Siyue Zhang,Junbo Zhao,Chen Zhao

Main category: cs.CL

TL;DR: 本文提出RPDR框架,通过合成数据生成、Round-Trip预测筛选易学样本、针对性训练,显著提升密集检索器在长尾问答任务中的性能,并引入动态路由机制进一步优化。

Details Motivation: 大型语言模型在长尾问题回答中受限于对低频知识的获取与准确回忆能力;现有密集检索器在泛化至罕见或小众知识时同样表现不佳。 Method: 提出RPDR数据增强框架,包含三部分:合成数据生成、基于Round-Trip预测的数据选择(筛选易学样本)、使用所选样本训练密集检索器;并设计动态路由机制,将查询分发至专用检索模块。 Result: 在PopQA和EntityQuestion两个长尾检索基准上,RPDR显著优于BM25、Contriver等基线方法,尤其在极长尾类别上提升明显;人工分析验证了其优势与局限。 Conclusion: RPDR通过高质量易学数据增强有效缓解密集检索器在长尾场景下的泛化瓶颈,结合动态路由可进一步提升检索性能,为长尾知识检索提供了新思路。 Abstract: Long-tail question answering presents significant challenges for large language models (LLMs) due to their limited ability to acquire and accurately recall less common knowledge. Retrieval-augmented generation (RAG) systems have shown great promise in mitigating this limitation by integrating external retrieval mechanisms. However, dense retrieval models often face the same difficulties when generalizing to rare or niche knowledge. In this study, we introduce RPDR, a novel data augmentation framework that selects high-quality easy-to-learn training data, to enhance dense retrievers. Our approach is built around three core components: synthetic data generation, data selection with Round-Trip prediction to identify easy-to-learn instances, and retriever training with these instances. We evaluate RPDR on two long-tail retrieval benchmarks, PopQA and EntityQuestion, demonstrating substantial improvements over existing retrievers like BM25 and Contriver, especially on extremely long-tail categories. We identify the strengths and limitations of RPDR through detailed human analysis and propose a dynamic routing mechanism to dynamically route queries to specialized retrieval modules to further improve retrieval performance.

[24] The Role of the Availability Heuristic in Multiple-Choice Answering Behaviour

Leonidas Zotos,Hedderik van Rijn,Malvina Nissim

Main category: cs.CL

TL;DR: 本文探讨了在多项选择题(MCQ)中,利用认知可得性(availability heuristic)进行猜测的有效性。研究通过计算选项在大型语料库(如Wikipedia)中的概念出现频率来量化其可得性,发现正确答案普遍比干扰项更具可得性;始终选择最可得选项可显著超越随机猜测水平(提升13.5%–32.9%)。该模式在LLM生成题与专家出题中均成立,提示可得性应被纳入学生行为的计算建模。

Details Motivation: 当学生不确定MCQ正确答案时往往依赖猜测;经典启发式理论(可用性启发式)认为人们倾向于选择最容易想到的选项,但该策略在标准化测试中的实际有效性尚缺乏系统验证。 Method: 提出一种基于大规模语料库(如Wikipedia)中概念词频的计算方法,量化各MCQ选项的认知可得性;在三套大型MCQ数据集上检验正确答案是否显著更可得,并对比LLM生成题与人工出题中该效应的一致性。 Result: 正确答案在所有题集中均显著比错误选项更可得;仅选最可得选项即可使得分比随机猜测高出13.5%至32.9%;LLM生成选项同样呈现相同可得性优势模式。 Conclusion: 可用性启发式在MCQ作答中具有实际预测力和有效性,不应被忽视;未来对学生认知建模(尤其是涉及猜测行为)时,应将选项的语义可得性作为关键变量纳入考量。 Abstract: When students are unsure of the correct answer to a multiple-choice question (MCQ), guessing is common practice. The availability heuristic, proposed by A. Tversky and D. Kahneman in 1973, suggests that the ease with which relevant instances come to mind, typically operationalised by the mere frequency of exposure, can offer a mental shortcut for problems in which the test-taker does not know the exact answer. Is simply choosing the option that comes most readily to mind a good strategy for answering MCQs? We propose a computational method of assessing the cognitive availability of MCQ options operationalised by concepts' prevalence in large corpora. The key finding, across three large question sets, is that correct answers, independently of the question stem, are significantly more available than incorrect MCQ options. Specifically, using Wikipedia as the retrieval corpus, we find that always selecting the most available option leads to scores 13.5% to 32.9% above the random-guess baseline. We further find that LLM-generated MCQ options show similar patterns of availability compared to expert-created options, despite the LLMs' frequentist nature and their training on large collections of textual data. Our findings suggest that availability should be considered in current and future work when computationally modelling student behaviour.

[25] Diverse Word Choices, Same Reference: Annotating Lexically-Rich Cross-Document Coreference

Anastasia Zhukova,Felix Hamborg,Karsten Donnay,Norman Meuschke,Bela Gipp

Main category: cs.CL

TL;DR: 本文提出了一种改进的跨文档共指消解(CDCR)标注方案,将共指链视为话语元素(DEs),支持身份与近似身份关系,以更好捕捉新闻报道中的词汇多样性与框架差异,并在NewsWCL50和ECB+子集上完成重标注与评估。

Details Motivation: 现有CDCR数据集偏重事件共指、定义狭窄,难以应对多样化、立场分化新闻中广泛存在的措辞差异与话语 framing 变化。 Method: 重新定义共指链为话语元素(DEs),支持身份与近似身份关系;统一使用新标注规范重标注NewsWCL50和ECB+子集;通过词汇多样性指标与same-head-lemma基线进行评估。 Result: 重标注后的数据集在词汇多样性等指标上表现居中,介于原始ECB+与NewsWCL50之间,验证了其平衡性与话语敏感性。 Conclusion: 该修订方案提升了CDCR对新闻话语复杂性的建模能力,为更稳健、话语感知的跨文档共指研究提供了高质量资源与方法基础。 Abstract: Cross-document coreference resolution (CDCR) identifies and links mentions of the same entities and events across related documents, enabling content analysis that aggregates information at the level of discourse participants. However, existing datasets primarily focus on event resolution and employ a narrow definition of coreference, which limits their effectiveness in analyzing diverse and polarized news coverage where wording varies widely. This paper proposes a revised CDCR annotation scheme of the NewsWCL50 dataset, treating coreference chains as discourse elements (DEs) and conceptual units of analysis. The approach accommodates both identity and near-identity relations, e.g., by linking "the caravan" - "asylum seekers" - "those contemplating illegal entry", allowing models to capture lexical diversity and framing variation in media discourse, while maintaining the fine-grained annotation of DEs. We reannotate the NewsWCL50 and a subset of ECB+ using a unified codebook and evaluate the new datasets through lexical diversity metrics and a same-head-lemma baseline. The results show that the reannotated datasets align closely, falling between the original ECB+ and NewsWCL50, thereby supporting balanced and discourse-aware CDCR research in the news domain.

[26] Evaluating Extremely Low-Resource Machine Translation: A Comparative Study of ChrF++ and BLEU Metrics

Sanjeev Kumar,Preethi Jyothi,Pushpak Bhattacharyya

Main category: cs.CL

TL;DR: 本文比较了BLEU和ChrF++两种机器翻译评估指标在极低资源语言(ELRL)场景下的表现,发现BLEU虽得分较低,但能提供互补的词汇精度信息,提升可解释性。

Details Motivation: 现有主流指标如BLEU在极低资源语言(ELRL)场景下常误判翻译质量,需探究其与字符级指标ChrF++在该场景下的适用性与互补性。 Method: 对BLEU(基于n-gram)和ChrF++(基于字符)进行对比分析,考察二者对幻觉、重复、源文本拷贝及变音符号(matra)差异等翻译缺陷的响应能力,实验覆盖Magahi、Bhojpuri、Chhattisgarhi三种ELRL,并涵盖LLM与NMT系统输出。 Result: ChrF++被广泛采用,但BLEU虽绝对分数偏低,却能提供有价值的词汇精度信息,二者具有互补性;BLEU有助于提升评估结果的可解释性。 Conclusion: 在ELRL翻译评估中,不应弃用BLEU,而应结合ChrF++使用,以兼顾不同维度的质量信号,增强评估全面性与可解释性。 Abstract: Evaluating machine translation (MT) quality in extremely low-resource language (ELRL) scenarios poses unique challenges, as widely used metrics such as BLEU, effective in high-resource settings, often misrepresent quality in data-scarce contexts. This work presents a comparative analysis of BLEU, an n-gram-based metric, and ChrF++, a character-based metric, for MT evaluation in ELRL settings. We examine how each metric responds to translation artifacts, including hallucinations, repetition, source-text copying, and diacritic (\textit{matra}) variations across three ELRLs: Magahi, Bhojpuri, and Chhattisgarhi, with a focus on outputs from large language models (LLMs) and neural MT (NMT) systems. While recent work often relies solely on ChrF++, our findings show that BLEU, despite its lower absolute scores, provides complementary lexical-precision insights that improve interpretability.

[27] Fine-Grained Uncertainty Quantification for Long-Form Language Model Outputs: A Comparative Study

Dylan Bouchard,Mohit Singh Chauhan,Viren Bajaj,David Skarbrevik

Main category: cs.CL

TL;DR: 本文提出了一种面向长文本生成的细粒度不确定性量化(UQ)框架,通过响应分解、单元级打分和响应级聚合三阶段分类法,系统化评估LLM长文本事实性;实验表明基于主张-响应蕴含的打分方法效果稳定,主张级优于句子级,且不确定性感知解码能显著提升事实性。

Details Motivation: 现有不确定性量化方法主要针对短文本输出,难以泛化到长文本生成中的幻觉检测。 Method: 构建了长文本LLM输出的细粒度不确定性量化分类法,涵盖响应分解、单元级评分和响应级聚合三阶段,并形式化了多种基于一致性的黑盒评分器家族。 Result: 实验发现:1)主张-响应蕴含打分效果稳定且不逊于复杂主张级方法;2)主张级打分总体优于句子级;3)不确定性感知解码能显著提升长文本事实性。 Conclusion: 所提框架厘清了既有方法间关系,支持公平比较,并为细粒度不确定性量化组件选择提供实用指导。 Abstract: Uncertainty quantification has emerged as an effective approach to closed-book hallucination detection for LLMs, but existing methods are largely designed for short-form outputs and do not generalize well to long-form generation. We introduce a taxonomy for fine-grained uncertainty quantification in long-form LLM outputs that distinguishes methods by design choices at three stages: response decomposition, unit-level scoring, and response-level aggregation. We formalize several families of consistency-based black-box scorers, providing generalizations and extensions of existing methods. In our experiments across multiple LLMs and datasets, we find 1) claim-response entailment consistently performs better or on par with more complex claim-level scorers, 2) claim-level scoring generally yields better results than sentence-level scoring, and 3) uncertainty-aware decoding is highly effective for improving the factuality of long-form outputs. Our framework clarifies relationships between prior methods, enables apples-to-apples comparisons, and provides practical guidance for selecting components for fine-grained UQ.

[28] AIDG: Evaluating Asymmetry Between Information Extraction and Containment in Multi-Turn Dialogue

Adib Sakhawat,Fardeen Sadab,Rakin Shahriar

Main category: cs.CL

TL;DR: 本文提出AIDG框架评估大语言模型在动态多轮对话中的策略推理能力,发现模型在信息保持(防御)上远优于信息提取(进攻),并识别出信息动态和约束遵循两大瓶颈。

Details Motivation: 现有静态基准不足以评估大语言模型的战略推理能力,需转向动态、多轮交互场景;尤其需考察信息提取与信息保持之间的不对称性。 Method: 构建基于博弈论的AIDG(对抗性信息推断游戏)框架,包含两个互补任务:AIDG-I(社交推断中的实用策略)和AIDG-II(结构化'20个问题'中的约束满足);在439局游戏中测试6个前沿LLM。 Result: 发现显著能力不对称:防御端ELO高出350分(Cohen's d = 5.47);确认策略比盲目推断有效7.75倍(p < 0.00001);41.3%的推断失败源于对话负载下的指令遵循退化。 Conclusion: 大语言模型擅长局部防御一致性,但在需要全局状态跟踪的战略性探究任务中表现薄弱。 Abstract: Evaluating the strategic reasoning capabilities of Large Language Models (LLMs) requires moving beyond static benchmarks to dynamic, multi-turn interactions. We introduce AIDG (Adversarial Information Deduction Game), a game-theoretic framework that probes the asymmetry between information extraction (active deduction) and information containment (state maintenance) in dialogue. We propose two complementary tasks: AIDG-I, measuring pragmatic strategy in social deduction, and AIDG-II, measuring constraint satisfaction in a structured "20 Questions" setting. Across 439 games with six frontier LLMs, we observe a clear capability asymmetry: models perform substantially better at containment than deduction, with a 350 ELO advantage on defense;(Cohen's d = 5.47). We identify two bottlenecks driving this gap: (1) Information Dynamics, where confirmation strategies are 7.75x more effective than blind deduction (p < 0.00001), and (2) Constraint Adherence, where instruction-following degrades under conversational load, accounting for 41.3% of deductive failures. These findings suggest that while LLMs excel at local defensive coherence, they struggle with the global state tracking required for strategic inquiry.

[29] ABCD: All Biases Come Disguised

Mateusz Nowak,Xavier Cadet,Peter Chin

Main category: cs.CL

TL;DR: 本文提出了一种减少标签位置偏差的MCQ评估协议,通过使用统一无序标签和句子相似度匹配答案,提升了LLM评估的鲁棒性。

Details Motivation: 现有MCQ基准易受标签位置、few-shot提示中答案分布等偏差影响,导致对LLM真实推理能力评估不准确。 Method: 设计NonsenseQA合成基准发现偏差;提出新评估协议:用统一无序标签替代原标签,要求模型基于完整答案文本作答,并用轻量级句子相似度模型匹配预测与真实答案。 Result: 在多个基准和模型上,该协议使答案排列鲁棒性显著提升,平均准确率方差降低3倍,性能仅轻微下降;消融实验验证其优于标准方法。 Conclusion: 减少评估中的表面线索(如标签位置)可更真实地揭示LLM的知识与推理能力,所提协议为更公平、稳健的MCQ评估提供了可行路径。 Abstract: Multiple-choice question (MCQ) benchmarks have been a standard evaluation practice for measuring LLMs' ability to reason and answer knowledge-based questions. Through a synthetic NonsenseQA benchmark, we observe that different LLMs exhibit varying degrees of label-position-few-shot-prompt bias, where the model either uses the answer position, the label in front of the answer, the distributions of correct answers present in the few-shot prompt, or a combination of all to answer each MCQ question. We propose a simple bias-reduced evaluation protocol that replaces the labels of each question with uniform, unordered labels and prompts the LLM to use the whole answer presented. With a simple sentence similarity model, we demonstrate improved robustness and lower standard deviation between different permutations of answers with a minimal drop in LLM's performance, exposing the LLM's capabilities under reduced evaluation artifacts, without any help from the prompt examples or the option labels. Across multiple benchmarks and models, this protocol substantially improves the robustness to answer permutations, reducing mean accuracy variance $3\times$ with only a minimal decrease in the mean model's performance. Through ablation studies on various embedding models and similarity functions, we show that the method is more robust than the standard ones.

[30] Entropy-Based Data Selection for Language Models

Hongming Li,Yang Liu,Chao Huang

Main category: cs.CL

TL;DR: 本文提出了一种基于熵的无监督数据选择框架(EUDS),以在计算资源受限场景下高效微调语言模型,通过降低数据需求和计算成本提升训练效率。

Details Motivation: 现有数据选择方法虽能减少微调所需数据量,但通常依赖高计算预算,难以适用于实际资源受限的微调场景;同时,大语言模型虽缓解数据稀缺问题,但数据可用性评估仍具挑战性。 Method: 提出Entropy-Based Unsupervised Data Selection(EUDS)框架,建立一种计算高效的无监督数据过滤机制,结合不确定性估计与熵度量进行数据筛选。 Result: 在情感分析、主题分类和问答任务上的实验验证了EUDS的有效性:显著降低计算开销、提升训练时间效率,并减少所需数据量。 Conclusion: EUDS为计算受限场景下的语言模型高效微调提供了创新且实用的解决方案。 Abstract: Modern language models (LMs) increasingly require two critical resources: computational resources and data resources. Data selection techniques can effectively reduce the amount of training data required for fine-tuning LMs. However, their effectiveness is closely related to computational resources, which always require a high compute budget. Owing to the resource limitations in practical fine-tuning scenario, we systematically reveal the relationship between data selection and uncertainty estimation of selected data. Although large language models (LLMs) exhibit exceptional capabilities in language understanding and generation, which provide new ways to alleviate data scarcity, evaluating data usability remains a challenging task. This makes efficient data selection indispensable. To mitigate these issues, we propose Entropy-Based Unsupervised Data Selection (EUDS) framework. Empirical experiments on sentiment analysis (SA), topic classification (Topic-CLS), and question answering (Q&A) tasks validate its effectiveness. EUDS establishes a computationally efficient data-filtering mechanism. Theoretical analysis and experimental results confirm the effectiveness of our approach. EUDS significantly reduces computational costs and improves training time efficiency with less data requirement. This provides an innovative solution for the efficient fine-tuning of LMs in the compute-constrained scenarios.

[31] PEACE 2.0: Grounded Explanations and Counter-Speech for Combating Hate Expressions

Greta Damo,Stéphane Petiot,Elena Cabrio,Serena Villata

Main category: cs.CL

TL;DR: 本文介绍了PEACE 2.0工具,它不仅能分析和解释一条消息为何被判定为仇恨言论(或非仇恨言论),还能基于检索增强生成(RAG)技术自动生成有事实依据的反仇恨言论(counter-speech)回复,并探究反言论的特征。

Details Motivation: 在线平台仇恨言论激增带来严重社会问题;尽管自动检测已有成效,但如何生成有效、有依据的反仇恨言论仍是开放挑战。 Method: 提出PEACE 2.0工具,采用检索增强生成(RAG)流水线,实现:i) 基于证据解释仇恨言论判定;ii) 自动生成证据支撑的反言论;iii) 分析反言论的语言与内容特征。 Result: PEACE 2.0可对显性和隐性仇恨言论进行深度分析与响应生成,提升解释可信度与反言论质量。 Conclusion: PEACE 2.0通过融合解释性与生成能力,推动仇恨言论治理从被动检测迈向主动、可解释、有依据的干预。 Abstract: The increasing volume of hate speech on online platforms poses significant societal challenges. While the Natural Language Processing community has developed effective methods to automatically detect the presence of hate speech, responses to it, called counter-speech, are still an open challenge. We present PEACE 2.0, a novel tool that, besides analysing and explaining why a message is considered hateful or not, also generates a response to it. More specifically, PEACE 2.0 has three main new functionalities: leveraging a Retrieval-Augmented Generation (RAG) pipeline i) to ground HS explanations into evidence and facts, ii) to automatically generate evidence-grounded counter-speech, and iii) exploring the characteristics of counter-speech replies. By integrating these capabilities, PEACE 2.0 enables in-depth analysis and response generation for both explicit and implicit hateful messages.

[32] Auditing Reciprocal Sentiment Alignment: Inversion Risk, Dialect Representation and Intent Misalignment in Transformers

Nusrat Jahan Lia,Shubhashis Roy Dipta

Main category: cs.CL

TL;DR: 本文研究了孟加拉语与英语之间的跨语言情感对齐问题,发现现有对齐范式在安全性和表征能力上存在严重缺陷,尤其是压缩模型mDistilBERT存在28.7%的情感反转率,并揭示了‘非对称共情’和‘现代偏见’等系统性问题,主张采用多元文化、语言敏感的对齐方法,并建议在基准测试中引入‘情感稳定性’指标。

Details Motivation: 双向对齐的核心是确保AI准确理解人类意图且人类能信任AI行为,但该循环在语言障碍下严重断裂;本文聚焦孟加拉语-英语跨语言情感对齐这一低资源、高文化敏感场景,旨在揭示当前对齐范式在多语言环境下的失效机制及其对人-AI信任的深层影响。 Method: 通过构建跨语言情感对齐基准,系统评测四种Transformer架构(含mDistilBERT和IndicBERT)在孟加拉语(含Sadhu正式体)与英语间的情感极性一致性;量化分析‘情感反转率’、‘非对称共情’程度及区域模型在不同语体下的对齐误差。 Result: mDistilBERT出现28.7%情感反转;IndicBERT在Sadhu体中对齐错误率激增57%;发现‘非对称共情’现象——部分模型系统性压制、另一些则放大孟加拉语文本的情感强度;证实通用压缩策略损害情感保真度,威胁人-AI互信基础。 Conclusion: 公平的人-AI协同演化必须摒弃追求通用压缩的单一路径,转向尊重语言与方言多样性的多元文化对齐范式;建议对齐评估体系纳入‘情感稳定性’指标,尤其针对低资源语言和方言场景显式惩罚极性反转。 Abstract: The core theme of bidirectional alignment is ensuring that AI systems accurately understand human intent and that humans can trust AI behavior. However, this loop fractures significantly across language barriers. Our research addresses Cross-Lingual Sentiment Misalignment between Bengali and English by benchmarking four transformer architectures. We reveal severe safety and representational failures in current alignment paradigms. We demonstrate that compressed model (mDistilBERT) exhibits 28.7% "Sentiment Inversion Rate," fundamentally misinterpreting positive user intent as negative (or vice versa). Furthermore, we identify systemic nuances affecting human-AI trust, including "Asymmetric Empathy" where some models systematically dampen and others amplify the affective weight of Bengali text relative to its English counterpart. Finally, we reveal a "Modern Bias" in the regional model (IndicBERT), which shows a 57% increase in alignment error when processing formal (Sadhu) Bengali. We argue that equitable human-AI co-evolution requires pluralistic, culturally grounded alignment that respects language and dialectal diversity over universal compression, which fails to preserve the emotional fidelity required for reciprocal human-AI trust. We recommend that alignment benchmarks incorporate "Affective Stability" metrics that explicitly penalize polarity inversions in low-resource and dialectal contexts.

[33] Small LLMs for Medical NLP: a Systematic Analysis of Few-Shot, Constraint Decoding, Fine-Tuning and Continual Pre-Training in Italian

Pietro Ferrazzi,Mattia Franzin,Alberto Lavelli,Bernardo Magnini

Main category: cs.CL

TL;DR: 本文探讨了约十亿参数的小型大语言模型(LLMs)在20项临床NLP任务中的有效性,发现经微调的Qwen3-1.7B模型性能超越Qwen3-32B大模型,并开源了多个意大利语医疗数据集与模型。

Details Motivation: 大型语言模型虽在医疗NLP任务中表现优异,但其高计算成本限制了实际医疗场景部署;本文旨在验证小型LLM能否在保持高精度的同时满足资源受限环境的需求。 Method: 在Llama-3、Gemma-3和Qwen3三大模型家族中选取约10亿参数的小型LLM,在20个临床NLP任务上系统评估多种适应策略,包括推理时的少样本提示与约束解码,以及训练时的监督微调与持续预训练。 Result: 微调是最有效策略;Qwen3-1.7B经微调后平均得分比Qwen3-32B高9.2分;同时开源了多个公开意大利语医疗NLP数据集、126M词急诊科语料及175M词持续预训练语料。 Conclusion: 小型LLM经适当适配可在多项临床NLP任务中媲美甚至超越更大模型,具备在真实医疗环境中落地的潜力;开源数据与模型将推动低资源医疗NLP研究。 Abstract: Large Language Models (LLMs) consistently excel in diverse medical Natural Language Processing (NLP) tasks, yet their substantial computational requirements often limit deployment in real-world healthcare settings. In this work, we investigate whether "small" LLMs (around one billion parameters) can effectively perform medical tasks while maintaining competitive accuracy. We evaluate models from three major families-Llama-3, Gemma-3, and Qwen3-across 20 clinical NLP tasks among Named Entity Recognition, Relation Extraction, Case Report Form Filling, Question Answering, and Argument Mining. We systematically compare a range of adaptation strategies, both at inference time (few-shot prompting, constraint decoding) and at training time (supervised fine-tuning, continual pretraining). Fine-tuning emerges as the most effective approach, while the combination of few-shot prompting and constraint decoding offers strong lower-resource alternatives. Our results show that small LLMs can match or even surpass larger baselines, with our best configuration based on Qwen3-1.7B achieving an average score +9.2 points higher than Qwen3-32B. We release a comprehensive collection of all the publicly available Italian medical datasets for NLP tasks, together with our top-performing models. Furthermore, we release an Italian dataset of 126M words from the Emergency Department of an Italian Hospital, and 175M words from various sources that we used for continual pre-training.

[34] Bridging the Domain Divide: Supervised vs. Zero-Shot Clinical Section Segmentation from MIMIC-III to Obstetrics

Baris Karacan,Barbara Di Eugenio,Patrick Thornton

Main category: cs.CL

TL;DR: 本文提出了一种改进临床文本分段的方法,包括构建新的产科笔记数据集、系统评估基于Transformer的监督模型,并首次对比了监督模型与零样本大语言模型在医疗文本分段中的表现,发现零样本模型在跨领域场景下更具鲁棒性(需修正幻觉标题)。

Details Motivation: 现有临床文本分段方法多基于MIMIC-III等通用医疗语料训练,在特定医学领域(如产科)泛化能力有限;缺乏专门标注的领域数据和对零样本大模型的系统评估。 Method: 1)构建去标识化的产科笔记分段标注数据集;2)在MIMIC-III子集(领域内)和新产科数据集(领域外)上系统评估基于Transformer的监督模型;3)首次开展监督模型与零样本大语言模型在医疗分段任务上的直接对比实验,并修正其生成的幻觉标题。 Result: 监督模型在领域内表现优异,但跨领域性能显著下降;零样本大语言模型在修正幻觉标题后展现出强跨领域适应能力。 Conclusion: 应加强建设领域特异性临床NLP资源;经幻觉校正的零样本分段是拓展医疗NLP应用至非主流语料的重要可行方向。 Abstract: Clinical free-text notes contain vital patient information. They are structured into labelled sections; recognizing these sections has been shown to support clinical decision-making and downstream NLP tasks. In this paper, we advance clinical section segmentation through three key contributions. First, we curate a new de-identified, section-labeled obstetrics notes dataset, to supplement the medical domains covered in public corpora such as MIMIC-III, on which most existing segmentation approaches are trained. Second, we systematically evaluate transformer-based supervised models for section segmentation on a curated subset of MIMIC-III (in-domain), and on the new obstetrics dataset (out-of-domain). Third, we conduct the first head-to-head comparison of supervised models for medical section segmentation with zero-shot large language models. Our results show that while supervised models perform strongly in-domain, their performance drops substantially out-of-domain. In contrast, zero-shot models demonstrate robust out-of-domain adaptability once hallucinated section headers are corrected. These findings underscore the importance of developing domain-specific clinical resources and highlight zero-shot segmentation as a promising direction for applying healthcare NLP beyond well-studied corpora, as long as hallucinations are appropriately managed.

[35] Using LLMs for Knowledge Component-level Correctness Labeling in Open-ended Coding Problems

Zhangqi Duan,Arnav Kankaria,Dhruv Kartik,Andrew Lan

Main category: cs.CL

TL;DR: 本文提出了一种利用大语言模型(LLM)自动为编程任务中的细粒度知识组件(KC)标注正确性的新框架,结合时序感知的Code-KC映射机制,显著提升了学习曲线拟合度与预测性能,并获得与专家标注高度一致的人工评估结果。

Details Motivation: 真实编程数据集中缺乏KC级别的正确性标签,简单地将题目级正确性传播至所有KC会掩盖部分掌握状态,导致学习曲线拟合差。 Method: 利用大语言模型直接从学生代码中判断各KC是否被正确应用,并引入时序上下文感知的Code-KC映射机制以提升KC与学生代码的对齐精度。 Result: 在学习曲线拟合(幂律模型、AFM)和预测性能上均优于基线方法;人工评估显示LLM标注与专家标注具有较高一致性。 Conclusion: 基于LLM的自动化KC级标注框架可行且有效,为无标注编程教育数据建模提供了新范式。 Abstract: Fine-grained skill representations, commonly referred to as knowledge components (KCs), are fundamental to many approaches in student modeling and learning analytics. However, KC-level correctness labels are rarely available in real-world datasets, especially for open-ended programming tasks where solutions typically involve multiple KCs simultaneously. Simply propagating problem-level correctness to all associated KCs obscures partial mastery and often leads to poorly fitted learning curves. To address this challenge, we propose an automated framework that leverages large language models (LLMs) to label KC-level correctness directly from student-written code. Our method assesses whether each KC is correctly applied and further introduces a temporal context-aware Code-KC mapping mechanism to better align KCs with individual student code. We evaluate the resulting KC-level correctness labels in terms of learning curve fit and predictive performance using the power law of practice and the Additive Factors Model. Experimental results show that our framework leads to learning curves that are more consistent with cognitive theory and improves predictive performance, compared to baselines. Human evaluation further demonstrates substantial agreement between LLM and expert annotations.

[36] Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning

Jyotin Goel,Souvik Maji,Pratik Mazumder

Main category: cs.CL

TL;DR: 本文提出一种自适应正则化训练框架,通过在微调过程中动态调整对安全风险高的更新的约束强度,从而在不牺牲模型实用性的同时维持其安全性。

Details Motivation: 现有指令遵循语言模型在微调(尤其是对抗性微调)中易发生安全行为退化,而现有防御方法往往保护有限或需在安全性与实用性间权衡。 Method: 提出基于安全风险自适应调节正则化强度的训练框架;引入两种风险估计方法——基于裁判模型(Safety Critic)的批量危害评分,以及基于模型中间激活特征、轻量分类器预测有害意图的激活风险预测器;高风险更新被约束靠近安全参考策略,低风险更新按常规方式进行。 Result: 实验证明:预生成激活可有效预测有害意图;裁判评分具备高召回安全指导能力;在多种模型与攻击场景下,该方法显著降低攻击成功率,保持下游任务性能,且无推理开销。 Conclusion: 该工作提供了一种兼顾安全性与实用性的原则性机制,使模型在持续微调中仍能保持对齐。 Abstract: Instruction-following language models are trained to be helpful and safe, yet their safety behavior can deteriorate under benign fine-tuning and worsen under adversarial updates. Existing defenses often offer limited protection or force a trade-off between safety and utility. We introduce a training framework that adapts regularization in response to safety risk, enabling models to remain aligned throughout fine-tuning. To estimate safety risk at training time, we explore two distinct approaches: a judge-based Safety Critic that assigns high-level harm scores to training batches, and an activation-based risk predictor built with a lightweight classifier trained on intermediate model activations to estimate harmful intent. Each approach provides a risk signal that is used to constrain updates deemed higher risk to remain close to a safe reference policy, while lower-risk updates proceed with standard training. We empirically verify that harmful intent signals are predictable from pre-generation activations and that judge scores provide effective high-recall safety guidance. Across multiple model families and attack scenarios, adaptive regularization with either risk estimation approach consistently lowers attack success rate compared to standard fine-tuning, preserves downstream performance, and adds no inference-time cost. This work demonstrates a principled mechanism for maintaining safety without sacrificing utility.

[37] Modeling Distinct Human Interaction in Web Agents

Faria Huq,Zora Zhiruo Wang,Zhanqiu Guo,Venu Arvind Arangarajan,Tianyue Ou,Frank Xu,Shuyan Zhou,Graham Neubig,Jeffrey P. Bigham

Main category: cs.CL

TL;DR: 本文提出建模人类干预行为以支持人机协同网页任务执行,构建CowCorpus数据集并识别四类用户交互模式,训练语言模型提升干预预测准确率,并在真实用户研究中显著提升代理有用性。

Details Motivation: 当前自主网页代理缺乏对人类何时、为何干预的系统性理解,常错过关键决策点或过度请求确认,亟需建模人类干预以实现更自然的人机协作。 Method: 构建包含400条真实用户网页导航轨迹的CowCorpus数据集,归纳四种用户-代理交互模式,并基于此训练语言模型预测人类干预时机。 Result: 干预预测准确率较基线语言模型提升61.4–63.4%;部署后用户评估显示代理有用性提高26.5%。 Conclusion: 对人类干预进行结构化建模可显著提升网页代理的适应性与协作能力。 Abstract: Despite rapid progress in autonomous web agents, human involvement remains essential for shaping preferences and correcting agent behavior as tasks unfold. However, current agentic systems lack a principled understanding of when and why humans intervene, often proceeding autonomously past critical decision points or requesting unnecessary confirmation. In this work, we introduce the task of modeling human intervention to support collaborative web task execution. We collect CowCorpus, a dataset of 400 real-user web navigation trajectories containing over 4,200 interleaved human and agent actions. We identify four distinct patterns of user interaction with agents -- hands-off supervision, hands-on oversight, collaborative task-solving, and full user takeover. Leveraging these insights, we train language models (LMs) to anticipate when users are likely to intervene based on their interaction styles, yielding a 61.4-63.4% improvement in intervention prediction accuracy over base LMs. Finally, we deploy these intervention-aware models in live web navigation agents and evaluate them in a user study, finding a 26.5% increase in user-rated agent usefulness. Together, our results show structured modeling of human intervention leads to more adaptive, collaborative agents.

[38] The Cascade Equivalence Hypothesis: When Do Speech LLMs Behave Like ASR$\rightarrow$LLM Pipelines?

Jayadev Billa

Main category: cs.CL

TL;DR: 本文揭示当前语音大语言模型(Speech LLMs)本质上是隐式ASR系统,行为与机制上等价于Whisper→LLM级联架构;通过控制LLM主干的匹配测试,发现Ultravox与对应级联几乎无差别,而Qwen2-Audio则表现出真正差异;在噪声环境下,Speech LLMs性能甚至不如级联方案。

Details Motivation: 探究当前语音大语言模型是否真正具备端到端语音理解能力,还是仅隐式执行ASR并依赖文本表示;澄清其内在工作机制与实际部署价值。 Method: 采用匹配主干(matched-backbone)实验设计,在四个Speech LLM和六个任务上系统比较其与对应Whisper→LLM级联的表现;结合logit lens分析隐藏层表征、LEACE概念擦除验证文本表征的因果必要性,并评估不同信噪比下的鲁棒性。 Result: Ultravox与级联高度一致(κ=0.93),文本表征在隐藏层中显式出现且被证明因果必要;LEACE擦除后准确率趋近于零;Qwen2-Audio则显著偏离级联行为;在0 dB噪声下,Speech LLMs性能反超级联的优势逆转达7.6%。 Conclusion: 当前主流Speech LLMs并非真正的端到端语音理解模型,而是昂贵且噪声鲁棒性更差的隐式ASR级联;其有效性高度依赖架构设计,不具备普遍超越级联的内在优势。 Abstract: Current speech LLMs largely perform implicit ASR: on tasks solvable from a transcript, they are behaviorally and mechanistically equivalent to simple Whisper$\to$LLM cascades. We show this through matched-backbone testing across four speech LLMs and six tasks, controlling for the LLM backbone for the first time. Ultravox is statistically indistinguishable from its matched cascade ($κ{=}0.93$); logit lens reveals literal text emerging in hidden states; LEACE concept erasure confirms text representations are causally necessary in both architectures tested, collapsing accuracy to near-zero. Qwen2-Audio genuinely diverges, revealing cascade equivalence is architecture-dependent, not universal. For most deployed use cases, current speech LLMs are expensive cascades, and under noise, they are worse ones, with clean-condition advantages reversing by up to 7.6% at 0 dB.

[39] Unmasking the Factual-Conceptual Gap in Persian Language Models

Alireza Sakhaeirad,Ali Ma'manpoosh,Arshia Hemmat

Main category: cs.CL

TL;DR: 本文提出了DivanBench——一个专注于波斯语文化迷信与习俗的诊断性基准,通过三类任务评估7个波斯语大模型,发现其普遍存在顺从偏差、预训练加剧偏差、事实检索与情境应用能力间存在显著差距,表明当前模型仅模仿文化表象而未内化深层规范。

Details Motivation: 现有波斯语NLP基准虽拓展至语用学与礼貌领域,却未能区分对文化事实的记忆与对隐含社会规范的推理能力。 Method: 构建包含315道题的DivanBench基准,覆盖事实检索、成对情景验证和情境推理三类任务,系统评估7个波斯语大语言模型的表现。 Result: 发现三大问题:严重顺从偏差(能识别恰当行为但无法拒绝明显违规)、持续波斯语预训练反而加剧该偏差并削弱矛盾识别能力、所有模型在事实检索与情境应用间存在21%性能差距。 Conclusion: 文化能力不能仅靠扩大单语数据规模获得;当前模型仅习得文化模式表层模仿,尚未内化支撑这些模式的深层认知图式。 Abstract: While emerging Persian NLP benchmarks have expanded into pragmatics and politeness, they rarely distinguish between memorized cultural facts and the ability to reason about implicit social norms. We introduce DivanBench, a diagnostic benchmark focused on superstitions and customs, arbitrary, context-dependent rules that resist simple logical deduction. Through 315 questions across three task types (factual retrieval, paired scenario verification, and situational reasoning), we evaluate seven Persian LLMs and reveal three critical failures: most models exhibit severe acquiescence bias, correctly identifying appropriate behaviors but failing to reject clear violations; continuous Persian pretraining amplifies this bias rather than improving reasoning, often degrading the model's ability to discern contradictions; and all models show a 21\% performance gap between retrieving factual knowledge and applying it in scenarios. These findings demonstrate that cultural competence requires more than scaling monolingual data, as current models learn to mimic cultural patterns without internalizing the underlying schemas.

[40] Differences in Typological Alignment in Language Models' Treatment of Differential Argument Marking

Iskar Deng,Nathalia Xu,Shane Steinert-Threlkeld

Main category: cs.CL

TL;DR: 本文通过在18种不同差异论元标记(DAM)系统合成语料上训练GPT-2模型,发现语言模型能复现人类语言中关于标记方向的自然性偏好(即更倾向标记语义非典型论元),但未能复现人类语言中强烈的宾语偏好(即更常标记宾语而非主语),表明不同语言类型学规律可能源于不同机制。

Details Motivation: 探究语言模型在合成语料上是否能复现人类语言中差异论元标记(DAM)的跨语言规律,以理解这些规律的来源。 Method: 采用受控合成学习方法,在18种实现不同DAM系统的合成语料上训练GPT-2模型,并用最小对立对评估其泛化能力。 Result: 模型稳定复现了人类对‘标记语义非典型论元’这一自然标记方向的偏好,但未复现人类语言中强烈的‘宾语优先标记’倾向。 Conclusion: DAM的两类类型学倾向(标记方向 vs. 论元角色偏好)可能源于不同的认知或学习机制。 Abstract: Recent work has shown that language models (LMs) trained on synthetic corpora can exhibit typological preferences that resemble cross-linguistic regularities in human languages, particularly for syntactic phenomena such as word order. In this paper, we extend this paradigm to differential argument marking (DAM), a semantic licensing system in which morphological marking depends on semantic prominence. Using a controlled synthetic learning method, we train GPT-2 models on 18 corpora implementing distinct DAM systems and evaluate their generalization using minimal pairs. Our results reveal a dissociation between two typological dimensions of DAM. Models reliably exhibit human-like preferences for natural markedness direction, favoring systems in which overt marking targets semantically atypical arguments. In contrast, models do not reproduce the strong object preference in human languages, in which overt marking in DAM more often targets objects rather than subjects. These findings suggest that different typological tendencies may arise from distinct underlying sources.

[41] What Language is This? Ask Your Tokenizer

Clara Meister,Ahmetcan Yavuz,Pietro Lesci,Tiago Pimentel

Main category: cs.CL

TL;DR: 本文提出UniLID,一种基于UnigramLM分词算法的语言识别方法,在低资源和相近语言场景下显著提升性能,具备高效、可扩展和易集成特性。

Details Motivation: 现有语言识别系统在低资源和相近语言场景下表现脆弱,亟需更鲁棒、高效且易于扩展的方法。 Method: 提出UniLID方法,基于UnigramLM算法,学习各语言条件下的共享词表上的unigram分布,将分词视为语言特异性过程。 Result: UniLID在标准基准上媲美fastText、GlotLID和CLD3;低资源下仅需每语言5个样本即可超70%准确率;在细粒度方言识别上取得大幅提升。 Conclusion: UniLID是一种简单、高效、可增量扩展的语言识别新方法,特别适用于低资源与细粒度语言区分任务,并能无缝融入现有大模型分词流程。 Abstract: Language Identification (LID) is an important component of many multilingual natural language processing pipelines, where it facilitates corpus curation, training data analysis, and cross-lingual evaluation of large language models. Despite near-perfect performance on high-resource languages, existing systems remain brittle in low-resource and closely related language settings. We introduce UniLID, a simple and efficient LID method based on the UnigramLM tokenization algorithm, leveraging its probabilistic framing, parameter estimation technique and inference strategy. In short, we learn language-conditional unigram distributions over a shared tokenizer vocabulary but treat segmentation as a language-specific phenomenon. Our formulation is data- and compute-efficient, supports incremental addition of new languages without retraining existing models, and can naturally be integrated into existing language model tokenization pipelines. Empirical evaluations against widely used baselines, including fastText, GlotLID, and CLD3, show that UniLID achieves competitive performance on standard benchmarks, substantially improves sample efficiency in low-resource settings - surpassing 70% accuracy with as few as five labeled samples per language - and delivers large gains on fine-grained dialect identification.

[42] Sink-Aware Pruning for Diffusion Language Models

Aidar Myrzakhan,Tianyi Li,Bowei Guo,Shengkun Tang,Zhiqiang Shen

Main category: cs.CL

TL;DR: 本文提出了一种针对扩散语言模型(DLMs)的新型剪枝方法Sink-Aware Pruning,指出DLM中注意力sink位置具有高时序变异性,不同于自回归模型中稳定的sink,因此应主动识别并剪除不稳定sink;该方法无需重训练,显著提升了质量-效率权衡。

Details Motivation: 扩散语言模型(DLMs)因需多步去噪导致推理开销大,亟需高效剪枝;但现有剪枝策略多直接沿用自回归大模型(AR LLMs)中保留attention sink的启发式方法,而该假设在DLM中并不成立。 Method: 通过分析DLM生成过程中主导sink位置随时间步的偏移程度,发现其sink具有高变异性;据此提出Sink-Aware Pruning方法,自动识别并剪除不稳定的sink token,而非像AR模型那样默认保留sink。 Result: 在无需重训练的前提下,该方法在相同计算预算下优于多个强基线剪枝方法,实现了更优的质量-效率平衡。 Conclusion: DLM中的attention sink本质上是瞬态且非结构关键的,应区别于AR模型进行针对性剪枝;Sink-Aware Pruning为DLM高效推理提供了新范式。 Abstract: Diffusion Language Models (DLMs) incur high inference cost due to iterative denoising, motivating efficient pruning. Existing pruning heuristics largely inherited from autoregressive (AR) LLMs, typically preserve attention sink tokens because AR sinks serve as stable global anchors. We show that this assumption does not hold for DLMs: the attention-sink position exhibits substantially higher variance over the full generation trajectory (measured by how the dominant sink locations shift across timesteps), indicating that sinks are often transient and less structurally essential than in AR models. Based on this observation, we propose ${\bf \texttt{Sink-Aware Pruning}}$, which automatically identifies and prunes unstable sinks in DLMs (prior studies usually keep sinks for AR LLMs). Without retraining, our method achieves a better quality-efficiency trade-off and outperforms strong prior pruning baselines under matched compute. Our code is available at https://github.com/VILA-Lab/Sink-Aware-Pruning.

cs.CV [Back]

[43] Three-dimensional Damage Visualization of Civil Structures via Gaussian Splatting-enabled Digital Twins

Shuo Wang,Shuo Wang,Xin Nie,Yasutaka Narazaki,Thomas Matiki,Billie F. Spencer

Main category: cs.CV

TL;DR: 本文提出了一种基于高斯泼溅(Gaussian Splatting, GS)的数字孪生方法,用于 civil infrastructure 的三维损伤可视化,相比NeRF更高效,并支持多尺度重建与随时间演化的更新。

Details Motivation: 传统2D图像损伤识别难以满足基础设施巡检对高精度3D损伤可视化的需求;现有NeRF等方法在效率或特征缺失区域表现不足,需更优的3D表征方案。 Method: 采用高斯泼溅(GS)进行3D重建,将2D损伤分割结果映射至3D空间以减少分割误差;设计多尺度重建策略兼顾效率与细节;支持损伤演化过程中的数字孪生动态更新。 Result: 在开源地震后合成数据集上验证了该方法的有效性,实现了高保真、高效、可更新的3D损伤可视化。 Conclusion: GS赋能的数字孪生方法为土木基础设施提供了一种有前景的、面向实际应用的三维损伤可视化新范式。 Abstract: Recent advancements in civil infrastructure inspections underscore the need for precise three-dimensional (3D) damage visualization on digital twins, transcending traditional 2D image-based damage identifications. Compared to conventional photogrammetric 3D reconstruction techniques, modern approaches such as Neural Radiance Field (NeRF) and Gaussian Splatting (GS) excel in scene representation, rendering quality, and handling featureless regions. Among them, GS stands out for its efficiency, leveraging discrete anisotropic 3D Gaussians to represent radiance fields, unlike NeRF's continuous implicit model. This study introduces a GS-enabled digital twin method tailored for effective 3D damage visualization. The method's key contributions include: 1) utilizing GS-based 3D reconstruction to visualize 2D damage segmentation results while reducing segmentation errors; 2) developing a multi-scale reconstruction strategy to balance efficiency and damage detail; 3) enabling digital twin updates as damage evolves over time. Demonstrated on an open-source synthetic dataset for post-earthquake inspections, the proposed approach offers a promising solution for comprehensive 3D damage visualization in civil infrastructure digital twins.

[44] Analytic Score Optimization for Multi Dimension Video Quality Assessment

Boda Lin,Yongjie Zhu,Wenyu Qin,Meng Wang,Pengfei Wan

Main category: cs.CV

TL;DR: 本文提出了一种多维视频质量评估(VQA)新范式,构建了大规模多维度数据集UltraVQA,并设计了理论驱动的Analytic Score Optimization(ASO)方法,提升离散质量评分预测精度与人类偏好对齐。

Details Motivation: 传统VQA局限于单一MOS评分,难以刻画用户生成内容(UGC)在运动、美学、清晰度等多方面的复杂质量特性;需更丰富、可解释、符合人类判断逻辑的评估体系。 Method: 构建包含5个质量维度、细粒度子属性标注及GPT生成解释性理由的大规模UGC视频数据集UltraVQA;提出Analytic Score Optimization(ASO),将质量评估建模为带正则化的序数决策过程,推导出闭式解以对齐人类排序偏好。 Result: ASO在多个基准上超越主流闭源API与开源模型,显著降低质量预测的平均绝对误差(MAE)。 Conclusion: 多维、可解释的标注与基于强化思想的对齐优化是推动VQA向更真实、更鲁棒方向发展的关键路径。 Abstract: Video Quality Assessment (VQA) is evolving beyond single-number mean opinion score toward richer, multi-faceted evaluations of video content. In this paper, we present a large-scale multi-dimensional VQA dataset UltraVQA that encompasses diverse User-Generated Content~(UGC) annotated across five key quality dimensions: Motion Quality, Motion Amplitude, Aesthetic Quality, Content Quality, and Clarity Quality. Each video in our dataset is scored by over 3 human raters on these dimensions, with fine-grained sub-attribute labels, and accompanied by an explanatory rationale generated by GPT based on the collective human judgments. To better leverage these rich annotations and improve discrete quality score assessment, we introduce Analytic Score Optimization (ASO), a theoretically grounded post-training objective derived for multi-dimensional VQA. By reframing quality assessment as a regularized decision-making process, we obtain a closed-form solution that naturally captures the ordinal nature of human ratings, ensuring alignment with human ranking preferences. In experiments, our method outperforms most baselines including closed-source APIs and open-source models, while also reducing mean absolute error (MAE) in quality prediction. Our work highlights the importance of multi-dimensional, interpretable annotations and reinforcement-based alignment in advancing video quality assessment.

[45] DODO: Discrete OCR Diffusion Models

Sean Man,Roy Ganz,Roi Ronen,Shahar Tsiper,Shai Mazor,Niv Nayman

Main category: cs.CV

TL;DR: 本文提出DODO模型,首次将块离散扩散(block discrete diffusion)应用于OCR任务,通过分块生成缓解全局扩散的同步误差,在保持接近SOTA精度的同时实现最高3倍的推理加速。

Details Motivation: 现有基于自回归解码的视觉语言模型在OCR任务中计算开销大、推理速度慢;而OCR作为高度确定性任务,理论上适合并行解码,但现有掩码扩散模型因结构不稳定性无法满足OCR严格的精确匹配要求。 Method: 提出DODO模型,采用块离散扩散机制,将文本生成过程分解为多个块进行并行化建模,以缓解全局扩散带来的同步错误问题。 Result: DODO在OCR任务上达到接近当前最优(SOTA)的准确率,并实现最高3倍于自回归基线的推理加速。 Conclusion: 块离散扩散是提升OCR推理效率的有效范式,DODO验证了其在保持高精度前提下显著加速的可行性,为确定性视觉语言任务提供了新思路。 Abstract: Optical Character Recognition (OCR) is a fundamental task for digitizing information, serving as a critical bridge between visual data and textual understanding. While modern Vision-Language Models (VLM) have achieved high accuracy in this domain, they predominantly rely on autoregressive decoding, which becomes computationally expensive and slow for long documents as it requires a sequential forward pass for every generated token. We identify a key opportunity to overcome this bottleneck: unlike open-ended generation, OCR is a highly deterministic task where the visual input strictly dictates a unique output sequence, theoretically enabling efficient, parallel decoding via diffusion models. However, we show that existing masked diffusion models fail to harness this potential; those introduce structural instabilities that are benign in flexible tasks, like captioning, but catastrophic for the rigid, exact-match requirements of OCR. To bridge this gap, we introduce DODO, the first VLM to utilize block discrete diffusion and unlock its speedup potential for OCR. By decomposing generation into blocks, DODO mitigates the synchronization errors of global diffusion. Empirically, our method achieves near state-of-the-art accuracy while enabling up to 3x faster inference compared to autoregressive baselines.

[46] StereoAdapter-2: Globally Structure-Consistent Underwater Stereo Depth Estimation

Zeyu Ren,Xiang Li,Yiran Wang,Zeyu Zhang,Hao Tang

Main category: cs.CV

TL;DR: 本文提出StereoAdapter-2,用基于选择性状态空间模型的ConvSS2D算子替代传统ConvGRU,提升水下立体深度估计性能;并构建大规模合成数据集UW-StereoDepth-80K,实现零样本SOTA效果。

Details Motivation: 水下立体深度估计面临波长相关光衰减、散射和折射导致的严重域偏移问题,现有基于单目基础模型与GRU迭代优化的方法在大视差和弱纹理区域受限于GRU的序列门控与局部卷积特性,难以高效传播长程视差信息。 Method: 提出新型ConvSS2D更新算子,采用四向扫描策略匹配极线几何并保持垂直结构一致性,支持单步高效长程空间传播;构建UW-StereoDepth-80K合成数据集,通过语义感知风格迁移与几何一致新视角合成两阶段生成;结合动态LoRA适配机制。 Result: 在TartanAir-UW和SQUID水下基准上零样本性能分别提升17%和7.2%,并在BlueROV2平台实测验证了鲁棒性。 Conclusion: StereoAdapter-2通过改进状态空间建模与高质量合成数据,显著提升了水下立体匹配的零样本泛化能力与实际部署鲁棒性。 Abstract: Stereo depth estimation is fundamental to underwater robotic perception, yet suffers from severe domain shifts caused by wavelength-dependent light attenuation, scattering, and refraction. Recent approaches leverage monocular foundation models with GRU-based iterative refinement for underwater adaptation; however, the sequential gating and local convolutional kernels in GRUs necessitate multiple iterations for long-range disparity propagation, limiting performance in large-disparity and textureless underwater regions. In this paper, we propose StereoAdapter-2, which replaces the conventional ConvGRU updater with a novel ConvSS2D operator based on selective state space models. The proposed operator employs a four-directional scanning strategy that naturally aligns with epipolar geometry while capturing vertical structural consistency, enabling efficient long-range spatial propagation within a single update step at linear computational complexity. Furthermore, we construct UW-StereoDepth-80K, a large-scale synthetic underwater stereo dataset featuring diverse baselines, attenuation coefficients, and scattering parameters through a two-stage generative pipeline combining semantic-aware style transfer and geometry-consistent novel view synthesis. Combined with dynamic LoRA adaptation inherited from StereoAdapter, our framework achieves state-of-the-art zero-shot performance on underwater benchmarks with 17% improvement on TartanAir-UW and 7.2% improvment on SQUID, with real-world validation on the BlueROV2 platform demonstrates the robustness of our approach. Code: https://github.com/AIGeeksGroup/StereoAdapter-2. Website: https://aigeeksgroup.github.io/StereoAdapter-2.

[47] SemCovNet: Towards Fair and Semantic Coverage-Aware Learning for Underrepresented Visual Concepts

Sakib Ahammed,Xia Cui,Xinqi Fan,Wenqi Lu,Moi Hoon Yap

Main category: cs.CV

TL;DR: 本文提出Semantic Coverage-Aware Network (SemCovNet)以解决视觉模型中语义覆盖不平衡(SCI)问题,通过Semantic Descriptor Map、Descriptor Attention Modulation和Descriptor-Visual Alignment损失来提升语义公平性与模型可靠性。

Details Motivation: 现有视觉数据集存在语义覆盖不平衡(SCI)——一种在语义层面而非类别层面的长尾偏差,影响模型对稀有但有意义语义的学习与推理。 Method: 提出SemCovNet模型,包含Semantic Descriptor Map(SDM)用于学习语义表征、Descriptor Attention Modulation(DAM)模块动态加权视觉与概念特征,以及Descriptor-Visual Alignment(DVA)损失对齐视觉特征与语义描述符,并引入Coverage Disparity Index(CDI)量化语义公平性。 Result: 在多个数据集上的实验表明,SemCovNet显著降低CDI,提升模型可靠性与语义公平性,实现更均衡、可解释的视觉学习性能。 Conclusion: 本文首次明确定义并量化了语义覆盖不平衡(SCI),证明其为可测且可修正的偏差,为推进语义公平性与可解释视觉学习奠定基础。 Abstract: Modern vision models increasingly rely on rich semantic representations that extend beyond class labels to include descriptive concepts and contextual attributes. However, existing datasets exhibit Semantic Coverage Imbalance (SCI), a previously overlooked bias arising from the long-tailed semantic representations. Unlike class imbalance, SCI occurs at the semantic level, affecting how models learn and reason about rare yet meaningful semantics. To mitigate SCI, we propose Semantic Coverage-Aware Network (SemCovNet), a novel model that explicitly learns to correct semantic coverage disparities. SemCovNet integrates a Semantic Descriptor Map (SDM) for learning semantic representations, a Descriptor Attention Modulation (DAM) module that dynamically weights visual and concept features, and a Descriptor-Visual Alignment (DVA) loss that aligns visual features with descriptor semantics. We quantify semantic fairness using a Coverage Disparity Index (CDI), which measures the alignment between coverage and error. Extensive experiments across multiple datasets demonstrate that SemCovNet enhances model reliability and substantially reduces CDI, achieving fairer and more equitable performance. This work establishes SCI as a measurable and correctable bias, providing a foundation for advancing semantic fairness and interpretable vision learning.

[48] Xray-Visual Models: Scaling Vision models on Industry Scale Data

Shlok Mishra,Tsung-Yu Lin,Linda Wang,Hongli Xu,Yimin Liu,Michael Hsu,Chaitanya Ahuja,Hao Yuan,Jianpeng Cheng,Hong-You Chen,Haoyuan Xu,Chao Li,Abhijeet Awasthi,Jihye Moon,Don Husa,Michael Ge,Sumedha Singla,Arkabandhu Chowdhury,Phong Dingh,Satya Narayan Shukla,Yonghuan Yang,David Jacobs,Qi Guo,Jun Xiao,Xiangjun Fan,Aashu Singh

Main category: cs.CV

TL;DR: Xray-Visual 是一个基于大规模社交媒体数据训练的统一视觉模型,融合图像与视频理解,采用三阶段训练策略和高效ViT架构(EViT),在多项基准测试中达到SOTA,并具备强鲁棒性与跨模态检索能力。

Details Motivation: 解决现有视觉模型在大规模、多源、噪声多的社交媒体数据上训练困难,以及图像与视频模态联合建模效率与泛化能力不足的问题。 Method: 提出三阶段训练流程(MAE自监督、半监督hashtag分类、CLIP式对比学习),使用增强型ViT+高效token重组织(EViT),并引入LLM作为文本编码器(LLM2CLIP);数据方面利用150亿图文对和100亿视频-hashtag对,辅以平衡与去噪的数据清洗策略。 Result: 在ImageNet、Kinetics、HMDB51、MSCOCO等基准上达到SOTA;对域偏移和对抗扰动具有强鲁棒性;LLM2CLIP显著提升跨模态检索性能与真实场景泛化能力。 Conclusion: Xray-Visual 为可扩展、高精度、高效率的多模态视觉模型树立了新标杆,验证了工业级社交媒体数据与分阶段协同训练范式的有效性。 Abstract: We present Xray-Visual, a unified vision model architecture for large-scale image and video understanding trained on industry-scale social media data. Our model leverages over 15 billion curated image-text pairs and 10 billion video-hashtag pairs from Facebook and Instagram, employing robust data curation pipelines that incorporate balancing and noise suppression strategies to maximize semantic diversity while minimizing label noise. We introduce a three-stage training pipeline that combines self-supervised MAE, semi-supervised hashtag classification, and CLIP-style contrastive learning to jointly optimize image and video modalities. Our architecture builds on a Vision Transformer backbone enhanced with efficient token reorganization (EViT) for improved computational efficiency. Extensive experiments demonstrate that Xray-Visual achieves state-of-the-art performance across diverse benchmarks, including ImageNet for image classification, Kinetics and HMDB51 for video understanding, and MSCOCO for cross-modal retrieval. The model exhibits strong robustness to domain shift and adversarial perturbations. We further demonstrate that integrating large language models as text encoders (LLM2CLIP) significantly enhances retrieval performance and generalization capabilities, particularly in real-world environments. Xray-Visual establishes new benchmarks for scalable, multimodal vision models, while maintaining superior accuracy and computational efficiency.

[49] HS-3D-NeRF: 3D Surface and Hyperspectral Reconstruction From Stationary Hyperspectral Images Using Multi-Channel NeRFs

Kibon Ku,Talukder Z. Jubery,Adarsh Krishnamurthy,Baskar Ganapathysubramanian

Main category: cs.CV

TL;DR: 本文提出HSI-SC-NeRF,一种基于固定相机的多通道神经辐射场框架,用于高通量、高保真度的农业产出品高光谱三维重建,适用于采后检测。

Details Motivation: 现有高光谱成像与3D重建融合方法硬件复杂、难以规模化;传统NeRF需移动相机,限制农业场景下的通量与可重复性。 Method: 构建固定相机+旋转物体的采集系统(Teflon腔+ArUco标定);提出多通道NeRF模型,联合优化全波段光谱重建;采用复合光谱损失与两阶段训练(几何初始化+辐射精调)。 Result: 在三种农产品样本上验证了高空间重建精度与可见-近红外波段强光谱保真度。 Conclusion: HSI-SC-NeRF有效解决了农业自动化表型中高通量、高保真高光谱3D重建难题,具备实际部署潜力。 Abstract: Advances in hyperspectral imaging (HSI) and 3D reconstruction have enabled accurate, high-throughput characterization of agricultural produce quality and plant phenotypes, both essential for advancing agricultural sustainability and breeding programs. HSI captures detailed biochemical features of produce, while 3D geometric data substantially improves morphological analysis. However, integrating these two modalities at scale remains challenging, as conventional approaches involve complex hardware setups incompatible with automated phenotyping systems. Recent advances in neural radiance fields (NeRF) offer computationally efficient 3D reconstruction but typically require moving-camera setups, limiting throughput and reproducibility in standard indoor agricultural environments. To address these challenges, we introduce HSI-SC-NeRF, a stationary-camera multi-channel NeRF framework for high-throughput hyperspectral 3D reconstruction targeting postharvest inspection of agricultural produce. Multi-view hyperspectral data is captured using a stationary camera while the object rotates within a custom-built Teflon imaging chamber providing diffuse, uniform illumination. Object poses are estimated via ArUco calibration markers and transformed to the camera frame of reference through simulated pose transformations, enabling standard NeRF training on stationary-camera data. A multi-channel NeRF formulation optimizes reconstruction across all hyperspectral bands jointly using a composite spectral loss, supported by a two-stage training protocol that decouples geometric initialization from radiometric refinement. Experiments on three agricultural produce samples demonstrate high spatial reconstruction accuracy and strong spectral fidelity across the visible and near-infrared spectrum, confirming the suitability of HSI-SC-NeRF for integration into automated agricultural workflows.

[50] DDiT: Dynamic Patch Scheduling for Efficient Diffusion Transformers

Dahye Kim,Deepti Ghadiyaram,Raghudeep Gadde

Main category: cs.CV

TL;DR: 本文提出了一种动态分块(dynamic tokenization)策略,用于提升Diffusion Transformers(DiTs)在图像和视频生成中的推理效率,通过在去噪不同阶段自适应调整patch大小,在保持生成质量的同时显著降低计算开销。

Details Motivation: DiTs虽性能先进,但因固定大小的token化(即恒定patch尺寸)导致计算开销大;而内容复杂度和去噪阶段变化未被利用,存在优化空间。 Method: 提出动态tokenization:根据内容复杂度和当前去噪时间步,动态调整patch尺寸——早期用粗粒度(大patch)建模全局结构,后期用细粒度(小patch)细化局部细节;该策略在推理时实时应用。 Result: 在FLUX-1.Dev和Wan 2.1模型上分别实现最高3.52×和3.2×推理加速,同时不损害生成质量与提示词遵循能力。 Conclusion: 动态tokenization是一种高效、即插即用的测试时优化方法,可显著提升DiTs生成效率,为高分辨率图像/视频生成提供实用化路径。 Abstract: Diffusion Transformers (DiTs) have achieved state-of-the-art performance in image and video generation, but their success comes at the cost of heavy computation. This inefficiency is largely due to the fixed tokenization process, which uses constant-sized patches throughout the entire denoising phase, regardless of the content's complexity. We propose dynamic tokenization, an efficient test-time strategy that varies patch sizes based on content complexity and the denoising timestep. Our key insight is that early timesteps only require coarser patches to model global structure, while later iterations demand finer (smaller-sized) patches to refine local details. During inference, our method dynamically reallocates patch sizes across denoising steps for image and video generation and substantially reduces cost while preserving perceptual generation quality. Extensive experiments demonstrate the effectiveness of our approach: it achieves up to $3.52\times$ and $3.2\times$ speedup on FLUX-1.Dev and Wan $2.1$, respectively, without compromising the generation quality and prompt adherence.

[51] Characterizing the Predictive Impact of Modalities with Supervised Latent-Variable Modeling

Divyam Madaan,Sumit Chopra,Kyunghyun Cho

Main category: cs.CV

TL;DR: 本文提出PRIMO模型,用于处理多模态数据中模态缺失的问题,通过潜在变量建模缺失模态,并在推理时采样估计其对预测的影响,实现在模态不全情况下保持性能并量化单个样本层面各模态的预测贡献。

Details Motivation: 现有多模态大语言模型(MLLMs)大多假设训练和推理时所有模态均可用,但实际中多模态数据常存在缺失、异步采集或仅部分样本具备全模态等问题,亟需能有效利用不完整多模态数据的方法。 Method: PRIMO是一种有监督的潜在变量插补模型,将缺失模态建模为一个潜变量,该变量在预测任务背景下与可观测模态相关联;推理时从该潜变量分布中多次采样,以获得边际预测分布并分析缺失模态对每个样本预测的影响。 Result: 在合成XOR数据集、Audio-Vision MNIST和MIMIC-III(用于死亡率及ICD-9预测)上,PRIMO在模态完全缺失时性能接近单模态基线,在模态齐全时接近多模态基线;并提出基于预测方差的实例级模态影响度量方法,可视化展示了不同缺失模态补全导致的合理标签分布。 Conclusion: PRIMO能有效利用含缺失模态的多模态数据进行训练与推理,兼顾预测性能与可解释性,为不完整多模态学习提供了新范式。 Abstract: Despite the recent success of Multimodal Large Language Models (MLLMs), existing approaches predominantly assume the availability of multiple modalities during training and inference. In practice, multimodal data is often incomplete because modalities may be missing, collected asynchronously, or available only for a subset of examples. In this work, we propose PRIMO, a supervised latent-variable imputation model that quantifies the predictive impact of any missing modality within the multimodal learning setting. PRIMO enables the use of all available training examples, whether modalities are complete or partial. Specifically, it models the missing modality through a latent variable that captures its relationship with the observed modality in the context of prediction. During inference, we draw many samples from the learned distribution over the missing modality to both obtain the marginal predictive distribution (for the purpose of prediction) and analyze the impact of the missing modalities on the prediction for each instance. We evaluate PRIMO on a synthetic XOR dataset, Audio-Vision MNIST, and MIMIC-III for mortality and ICD-9 prediction. Across all datasets, PRIMO obtains performance comparable to unimodal baselines when a modality is fully missing and to multimodal baselines when all modalities are available. PRIMO quantifies the predictive impact of a modality at the instance level using a variance-based metric computed from predictions across latent completions. We visually demonstrate how varying completions of the missing modality result in a set of plausible labels.

[52] Patch-Based Spatial Authorship Attribution in Human-Robot Collaborative Paintings

Eric Chen,Patricia Alves-Oliveira

Main category: cs.CV

TL;DR: 本文提出了一种基于图像块的框架,用于在人机协作绘画中进行空间作者归属分析,并通过15幅抽象画的案例研究验证了其有效性。该方法在补丁级准确率达88.8%,优于多种基线方法,并利用条件香农熵量化风格重叠,表明模型能识别混合作者身份而非分类错误。

Details Motivation: 随着具身AI越来越多地参与创意生产,明确作者身份对艺术家、收藏家及法律环境至关重要;而协作艺术中真实作者标签往往模糊,亟需可解释、可验证的归属方法。 Method: 采用基于图像块(patch-based)的空间作者归属框架,使用普通平板扫描仪采集数据,结合留一画交叉验证;引入条件香农熵量化人与机器人风格重叠程度,并辅以人工标注的混合区域验证。 Result: 补丁级准确率88.8%,画作级准确率86.7%(多数投票),显著优于纹理特征和预训练特征等基线(68.0%-84.7%);混合区域条件熵比纯作品高64%(p=0.003),证实模型识别的是真实混合作者性。 Conclusion: 该方法虽目前仅适用于特定人-机组合,但为数据稀缺的人-AI创意工作流提供了样本高效、可扩展的作者归属方法论基础,未来有望推广至任意人机协作绘画场景。 Abstract: As agentic AI becomes increasingly involved in creative production, documenting authorship has become critical for artists, collectors, and legal contexts. We present a patch-based framework for spatial authorship attribution within human-robot collaborative painting practice, demonstrated through a forensic case study of one human artist and one robotic system across 15 abstract paintings. Using commodity flatbed scanners and leave-one-painting-out cross-validation, the approach achieves 88.8% patch-level accuracy (86.7% painting-level via majority vote), outperforming texture-based and pretrained-feature baselines (68.0%-84.7%). For collaborative artworks, where ground truth is inherently ambiguous, we use conditional Shannon entropy to quantify stylistic overlap; manually annotated hybrid regions exhibit 64% higher uncertainty than pure paintings (p=0.003), suggesting the model detects mixed authorship rather than classification failure. The trained model is specific to this human-robot pair but provides a methodological grounding for sample-efficient attribution in data-scarce human-AI creative workflows that, in the future, has the potential to extend authorship attribution to any human-robot collaborative painting.

[53] PartRAG: Retrieval-Augmented Part-Level 3D Generation and Editing

Peize Li,Zeyu Zhang,Hao Tang

Main category: cs.CV

TL;DR: PartRAG 是一种检索增强的单图像3D生成框架,通过分层对比检索从外部部件库中引入多样化、物理合理的部件先验,并支持在共享规范空间中进行局部、可编辑的部件级操作,显著提升几何精度与多视角一致性。

Details Motivation: 现有单图3D生成方法难以覆盖部件几何的长尾分布、维持多视角一致性,且缺乏对局部精确编辑的支持。 Method: 提出PartRAG框架:1)分层对比检索模块,将图像块与3D部件潜在表示在部件级和物体级对齐,从1236个标注部件资产库中检索;2)掩码式部件级编辑器,在共享规范空间中实现部件替换、属性微调与组合更新。 Result: 在Objaverse等数据集上Chamfer距离降至0.1528(原0.1726),F-Score升至0.844(原0.7472);推理耗时38秒,交互编辑仅需5–8秒;定性显示更清晰的部件边界、更好的细长结构保真度及对铰接物体的鲁棒性。 Conclusion: PartRAG通过检索增强与可编辑规范表示,有效缓解了部件几何泛化性与局部可控性两大挑战,为单图像3D生成提供了新范式。 Abstract: Single-image 3D generation with part-level structure remains challenging: learned priors struggle to cover the long tail of part geometries and maintain multi-view consistency, and existing systems provide limited support for precise, localized edits. We present PartRAG, a retrieval-augmented framework that integrates an external part database with a diffusion transformer to couple generation with an editable representation. To overcome the first challenge, we introduce a Hierarchical Contrastive Retrieval module that aligns dense image patches with 3D part latents at both part and object granularity, retrieving from a curated bank of 1,236 part-annotated assets to inject diverse, physically plausible exemplars into denoising. To overcome the second challenge, we add a masked, part-level editor that operates in a shared canonical space, enabling swaps, attribute refinements, and compositional updates without regenerating the whole object while preserving non-target parts and multi-view consistency. PartRAG achieves competitive results on Objaverse, ShapeNet, and ABO-reducing Chamfer Distance from 0.1726 to 0.1528 and raising F-Score from 0.7472 to 0.844 on Objaverse-with inference of 38s and interactive edits in 5-8s. Qualitatively, PartRAG produces sharper part boundaries, better thin-structure fidelity, and robust behavior on articulated objects. Code: https://github.com/AIGeeksGroup/PartRAG. Website: https://aigeeksgroup.github.io/PartRAG.

[54] Amber-Image: Efficient Compression of Large-Scale Diffusion Transformers

Chaojie Yang,Tian Li,Yue Zhang,Jun Gao

Main category: cs.CV

TL;DR: 本文提出了一种高效的压缩框架,将60层双流MMDiT架构的Qwen-Image模型压缩为轻量级T2I模型Amber-Image(10B和6B),通过时序敏感剪枝、局部权重平均、分层蒸馏与渐进式蒸馏等技术,在大幅减少参数(70%)和GPU训练成本(<2000小时)的同时,保持高保真图像生成与优异文本渲染能力。

Details Motivation: DiT架构在文生图任务中性能突出,但计算开销大、部署困难,亟需高效无须从头训练的压缩方法。 Method: 提出基于时序敏感深度剪枝的压缩框架:Amber-Image-10B采用局部权重平均初始化+分层蒸馏+全参微调;Amber-Image-6B进一步引入混合流架构(深层双流转单流,源自图像分支)+渐进式蒸馏+轻量微调。全程无需大规模数据工程。 Result: 参数减少70%,压缩与训练总GPU耗时<2000小时;在DPG-Bench和LongText-Bench上达到与更大模型相当的高保真图像生成与文本渲染性能。 Conclusion: 该压缩框架显著提升了DiT模型的效率与可部署性,在不牺牲生成质量的前提下实现了低成本、低资源的轻量化T2I模型构建。 Abstract: Diffusion Transformer (DiT) architectures have significantly advanced Text-to-Image (T2I) generation but suffer from prohibitive computational costs and deployment barriers. To address these challenges, we propose an efficient compression framework that transforms the 60-layer dual-stream MMDiT-based Qwen-Image into lightweight models without training from scratch. Leveraging this framework, we introduce Amber-Image, a series of streamlined T2I models. We first derive Amber-Image-10B using a timestep-sensitive depth pruning strategy, where retained layers are reinitialized via local weight averaging and optimized through layer-wise distillation and full-parameter fine-tuning. Building on this, we develop Amber-Image-6B by introducing a hybrid-stream architecture that converts deep-layer dual streams into a single stream initialized from the image branch, further refined via progressive distillation and lightweight fine-tuning. Our approach reduces parameters by 70% and eliminates the need for large-scale data engineering. Notably, the entire compression and training pipeline-from the 10B to the 6B variant-requires fewer than 2,000 GPU hours, demonstrating exceptional cost-efficiency compared to training from scratch. Extensive evaluations on benchmarks like DPG-Bench and LongText-Bench show that Amber-Image achieves high-fidelity synthesis and superior text rendering, matching much larger models.

[55] StructCore: Structure-Aware Image-Level Scoring for Training-Free Unsupervised Anomaly Detection

Joongwon Chae,Lihui Luo,Yang Liu,Runming Wang,Dongmei Yu,Zeming Liang,Xi Yuan,Dayan Zhang,Zhenglin Chen,Peiwu Qin,Ilmoon Chae

Main category: cs.CV

TL;DR: 本文提出StructCore,一种无需训练、结构感知的图像级评分方法,用于改进基于内存库的无监督异常检测中的图像级决策,通过捕捉异常得分图的分布和空间特征,并利用正常样本进行马氏距离校准,显著提升了图像级AUROC性能。

Details Motivation: 现有方法如最大池化仅依赖单个极端响应,忽略了异常证据在图像中的分布和结构信息,导致正常与异常得分重叠。 Method: StructCore计算异常得分图的低维结构描述符phi(S),捕捉其分布和空间特性,并利用正常训练样本估计对角马氏距离进行图像级评分校准,不改变像素级定位。 Result: 在MVTec AD和VisA数据集上,StructCore分别达到99.6%和98.4%的图像级AUROC分数。 Conclusion: StructCore通过挖掘被最大池化忽略的结构特征,实现了鲁棒的图像级异常检测,且无需额外训练。 Abstract: Max pooling is the de facto standard for converting anomaly score maps into image-level decisions in memory-bank-based unsupervised anomaly detection (UAD). However, because it relies on a single extreme response, it discards most information about how anomaly evidence is distributed and structured across the image, often causing normal and anomalous scores to overlap. We propose StructCore, a training-free, structure-aware image-level scoring method that goes beyond max pooling. Given an anomaly score map, StructCore computes a low-dimensional structural descriptor phi(S) that captures distributional and spatial characteristics, and refines image-level scoring via a diagonal Mahalanobis calibration estimated from train-good samples, without modifying pixel-level localization. StructCore achieves image-level AUROC scores of 99.6% on MVTec AD and 98.4% on VisA, demonstrating robust image-level anomaly detection by exploiting structural signatures missed by max pooling.

[56] Cholec80-port: A Geometrically Consistent Trocar Port Segmentation Dataset for Robust Surgical Scene Understanding

Shunsuke Kikuchi,Atsushi Kouno,Hiroki Matsuzaki

Main category: cs.CV

TL;DR: 本文提出Cholec80-port数据集及统一的端口掩码标注规范(排除中心孔),解决腹腔镜手术中套管端口对几何视觉任务(如图像拼接、3D重建)的干扰问题;实验证明几何一致的标注显著提升跨数据集鲁棒性。

Details Motivation: 腹腔镜手术中trocar端口因高反光、纹理丰富且位置固定,易被误检为特征点,严重干扰基于几何的下游任务(如图像拼接、3D重建、视觉SLAM);而现有公开数据集缺乏显式、几何一致的端口标注(常错误遮盖中心孔)。 Method: 构建Cholec80-port高质量trocar端口分割数据集,并制定严格SOP:端口-套管掩码需排除中央开口;同时按该SOP清洗和统一多个现有公开数据集。 Result: 实验表明,采用几何一致标注显著提升了模型在跨数据集场景下的鲁棒性,其增益超越单纯增加数据规模带来的效果。 Conclusion: 几何一致的trocar端口标注标准(排除中心孔)对提升腹腔镜视觉任务稳定性至关重要;Cholec80-port及统一SOP为相关研究提供了可靠基准与实践指南。 Abstract: Trocar ports are camera-fixed, pseudo-static structures that can persistently occlude laparoscopic views and attract disproportionate feature points due to specular, textured surfaces. This makes ports particularly detrimental to geometry-based downstream pipelines such as image stitching, 3D reconstruction, and visual SLAM, where dynamic or non-anatomical outliers degrade alignment and tracking stability. Despite this practical importance, explicit port labels are rare in public surgical datasets, and existing annotations often violate geometric consistency by masking the central lumen (opening), even when anatomical regions are visible through it. We present Cholec80-port, a high-fidelity trocar port segmentation dataset derived from Cholec80, together with a rigorous standard operating procedure (SOP) that defines a port-sleeve mask excluding the central opening. We additionally cleanse and unify existing public datasets under the same SOP. Experiments demonstrate that geometrically consistent annotations substantially improve cross-dataset robustness beyond what dataset size alone provides.

[57] Cross Pseudo Labeling For Weakly Supervised Video Anomaly Detection

Lee Dayeon,Kim Dongheyong,Park Chaewon,Woo Sungmin,Lee Sangyoun

Main category: cs.CV

TL;DR: CPL-VAD是一种双分支弱监督视频异常检测框架,通过跨伪标签机制融合时间定位精度与语义分类能力,在XD-Violence和UCF-Crime数据集上达到SOTA性能。

Details Motivation: 现有弱监督视频异常检测方法难以同时兼顾 snippet 级异常定位与异常事件类别的细粒度识别,尤其缺乏对语义层面的建模能力。 Method: 提出双分支架构:1)二值异常检测分支,专注于 snippet 级异常定位;2)类别分类分支,利用视觉-语言对齐识别异常事件类别;两分支通过交叉伪标签(cross pseudo labeling)机制相互增强。 Result: 在 XD-Violence 和 UCF-Crime 数据集上,CPL-VAD 在异常检测(AUC)和异常类别分类(accuracy)两项任务中均取得当前最优性能。 Conclusion: 跨伪标签机制能有效桥接定位与分类任务,视觉-语言对齐可提升弱监督下异常语义理解,验证了双任务协同学习在弱监督视频异常检测中的有效性。 Abstract: Weakly supervised video anomaly detection aims to detect anomalies and identify abnormal categories with only video-level labels. We propose CPL-VAD, a dual-branch framework with cross pseudo labeling. The binary anomaly detection branch focuses on snippet-level anomaly localization, while the category classification branch leverages vision-language alignment to recognize abnormal event categories. By exchanging pseudo labels, the two branches transfer complementary strengths, combining temporal precision with semantic discrimination. Experiments on XD-Violence and UCF-Crime demonstrate that CPL-VAD achieves state-of-the-art performance in both anomaly detection and abnormal category classification.

[58] ComptonUNet: A Deep Learning Model for GRB Localization with Compton Cameras under Noisy and Low-Statistic Conditions

Shogo Sato,Kazuo Tanaka,Shojun Ogasawara,Kazuki Yamamoto,Kazuhiko Murasaki,Ryuichi Tanida,Jun Kataoka

Main category: cs.CV

TL;DR: 本文提出了一种名为ComptonUNet的混合深度学习框架,用于在低光子统计和强背景噪声条件下稳健地定位微弱伽马射线暴(GRB)。该模型结合了直接重建的统计效率与图像去噪能力,在模拟实验中显著优于现有方法。

Details Motivation: faint GRBs from the distant universe are valuable for studying early star formation, but their detection and localization are challenging due to low photon statistics and high background noise; existing ML models fail to balance statistical robustness and noise suppression. Method: ComptonUNet, a hybrid deep learning framework that jointly processes raw data and reconstructs images, integrating statistical efficiency of direct reconstruction with denoising capability of image-based architectures. Result: ComptonUNet achieves significantly improved GRB localization accuracy in realistic simulations with low-statistic and high-background conditions representative of low-Earth orbit missions. Conclusion: ComptonUNet provides a robust solution for faint GRB localization under challenging observational conditions, advancing high-energy astrophysical probing capabilities. Abstract: Gamma-ray bursts (GRBs) are among the most energetic transient phenomena in the universe and serve as powerful probes for high-energy astrophysical processes. In particular, faint GRBs originating from a distant universe may provide unique insights into the early stages of star formation. However, detecting and localizing such weak sources remains challenging owing to low photon statistics and substantial background noise. Although recent machine learning models address individual aspects of these challenges, they often struggle to balance the trade-off between statistical robustness and noise suppression. Consequently, we propose ComptonUNet, a hybrid deep learning framework that jointly processes raw data and reconstructs images for robust GRB localization. ComptonUNet was designed to operate effectively under conditions of limited photon statistics and strong background contamination by combining the statistical efficiency of direct reconstruction models with the denoising capabilities of image-based architectures. We perform realistic simulations of GRB-like events embedded in background environments representative of low-Earth orbit missions to evaluate the performance of ComptonUNet. Our results demonstrate that ComptonUNet significantly outperforms existing approaches, achieving improved localization accuracy across a wide range of low-statistic and high-background scenarios.

[59] 3D Scene Rendering with Multimodal Gaussian Splatting

Chi-Shiang Gau,Konstantinos D. Polyzos,Athanasios Bacharis,Saketh Madhuvarasu,Tara Javidi

Main category: cs.CV

TL;DR: 本文提出了一种融合射频(RF)传感(如车载雷达)与3D高斯泼溅(GS)渲染的多模态框架,以克服纯视觉GS在恶劣天气、低光照或遮挡等场景下初始化困难的问题;该方法利用稀疏RF深度测量高效预测深度,生成高质量点云用于GS初始化,在保持高渲染质量的同时提升鲁棒性与效率。

Details Motivation: 传统基于视觉的高斯泼溅(GS)依赖大量相机视图进行初始化和训练,在恶劣天气、低照度或部分遮挡等视觉线索不可靠的场景下性能受限;而射频(RF)信号对这些干扰具有天然鲁棒性,因此引入RF传感可提升GS的可靠性与适用性。 Method: 提出一种多模态框架,将RF传感(如车载雷达)与GS渲染结合;利用稀疏RF深度测量,通过高效深度预测生成高质量3D点云,用于初始化各类GS架构中的高斯原语。 Result: 数值实验表明,该RF增强的GS方案能在结构准确性驱动下实现高保真3D场景渲染,显著提升在视觉退化条件下的鲁棒性与渲染效率。 Conclusion: 融合RF传感与GS是一种更高效、更鲁棒的3D场景重建与渲染替代方案,尤其适用于自动驾驶等对环境鲁棒性要求高的实际应用。 Abstract: 3D scene reconstruction and rendering are core tasks in computer vision, with applications spanning industrial monitoring, robotics, and autonomous driving. Recent advances in 3D Gaussian Splatting (GS) and its variants have achieved impressive rendering fidelity while maintaining high computational and memory efficiency. However, conventional vision-based GS pipelines typically rely on a sufficient number of camera views to initialize the Gaussian primitives and train their parameters, typically incurring additional processing cost during initialization while falling short in conditions where visual cues are unreliable, such as adverse weather, low illumination, or partial occlusions. To cope with these challenges, and motivated by the robustness of radio-frequency (RF) signals to weather, lighting, and occlusions, we introduce a multimodal framework that integrates RF sensing, such as automotive radar, with GS-based rendering as a more efficient and robust alternative to vision-only GS rendering. The proposed approach enables efficient depth prediction from only sparse RF-based depth measurements, yielding a high-quality 3D point cloud for initializing Gaussian functions across diverse GS architectures. Numerical tests demonstrate the merits of judiciously incorporating RF sensing into GS pipelines, achieving high-fidelity 3D scene rendering driven by RF-informed structural accuracy.

[60] B$^3$-Seg: Camera-Free, Training-Free 3DGS Segmentation via Analytic EIG and Beta-Bernoulli Bayesian Updates

Hiromichi Kamata,Samuel Arthur Munro,Fuminori Homma

Main category: cs.CV

TL;DR: 本文提出B^3-Seg方法,一种无需相机预设、无需训练、支持开放词汇的3D高斯泼溅(3DGS)交互式分割方法,通过Beta-Bernoulli贝叶斯更新与解析期望信息增益(EIG)主动选视图,在几秒内实现端到端分割,兼具理论保证与实际效率。

Details Motivation: 现有3DGS分割方法依赖预设视角、真值标签或昂贵重训练,难以满足影视与游戏制作中低延迟交互编辑的实际需求。 Method: 将分割建模为序列化的Beta-Bernoulli贝叶斯更新过程,并通过解析计算的期望信息增益(EIG)主动选择最优下一视角;利用EIG的自适应单调性与次模性,实现近似最优的贪心视图采样策略。 Result: 在多个数据集上,B^3-Seg在数秒内完成端到端分割,性能媲美高成本监督方法,且无需相机先验与训练。 Conclusion: B^3-Seg实现了实用、高效、理论可证的交互式3DGS分割,显著提升了开放词汇、零样本、低延迟场景下的可行性。 Abstract: Interactive 3D Gaussian Splatting (3DGS) segmentation is essential for real-time editing of pre-reconstructed assets in film and game production. However, existing methods rely on predefined camera viewpoints, ground-truth labels, or costly retraining, making them impractical for low-latency use. We propose B$^3$-Seg (Beta-Bernoulli Bayesian Segmentation for 3DGS), a fast and theoretically grounded method for open-vocabulary 3DGS segmentation under camera-free and training-free conditions. Our approach reformulates segmentation as sequential Beta-Bernoulli Bayesian updates and actively selects the next view via analytic Expected Information Gain (EIG). This Bayesian formulation guarantees the adaptive monotonicity and submodularity of EIG, which produces a greedy $(1{-}1/e)$ approximation to the optimal view sampling policy. Experiments on multiple datasets show that B$^3$-Seg achieves competitive results to high-cost supervised methods while operating end-to-end segmentation within a few seconds. The results demonstrate that B$^3$-Seg enables practical, interactive 3DGS segmentation with provable information efficiency.

[61] BadCLIP++: Stealthy and Persistent Backdoors in Multimodal Contrastive Learning

Siyuan Liang,Yongcheng Jing,Yingjie Wang,Jiaxing Huang,Ee-chien Chang,Dacheng Tao

Main category: cs.CV

TL;DR: 本文提出BadCLIP++,一种针对多模态对比学习模型的隐蔽且持久的后门攻击框架,通过语义融合微触发、目标对齐子集选择、嵌入稳定化与参数稳定化等技术,在极低投毒率(0.3%)下实现高攻击成功率(99.99%)并抵抗多种防御与微调。

Details Motivation: 现有针对多模态对比学习的后门攻击方法在强检测和持续微调下表现不佳,主因是跨模态不一致暴露触发模式,以及低投毒率下梯度稀释导致后门快速遗忘,二者耦合问题尚未被充分建模与解决。 Method: 提出BadCLIP++统一框架:(1)设计语义融合QR微触发,嵌入任务相关区域附近,保持干净数据统计特性;(2)采用目标对齐子集选择增强低投毒率下的信号强度;(3)通过半径收缩与质心对齐稳定触发嵌入;(4)利用曲率控制与弹性权重巩固稳定模型参数;(5)首次在可信区域内理论证明清洁微调与后门目标梯度同向,保证攻击成功率下降有上界。 Result: 在仅0.3%投毒率下,数字攻击ASR达99.99%,领先基线11.4个百分点;在19种防御下ASR仍高于99.90%,干净准确率下降<0.8%;物理攻击成功率达65.03%,且对水印移除等防御鲁棒。 Conclusion: BadCLIP++有效解决了多模态对比学习中后门攻击的隐蔽性与持久性难题,兼具实证优越性与理论可解释性,为安全评估与防御设计提供了新基准。 Abstract: Research on backdoor attacks against multimodal contrastive learning models faces two key challenges: stealthiness and persistence. Existing methods often fail under strong detection or continuous fine-tuning, largely due to (1) cross-modal inconsistency that exposes trigger patterns and (2) gradient dilution at low poisoning rates that accelerates backdoor forgetting. These coupled causes remain insufficiently modeled and addressed. We propose BadCLIP++, a unified framework that tackles both challenges. For stealthiness, we introduce a semantic-fusion QR micro-trigger that embeds imperceptible patterns near task-relevant regions, preserving clean-data statistics while producing compact trigger distributions. We further apply target-aligned subset selection to strengthen signals at low injection rates. For persistence, we stabilize trigger embeddings via radius shrinkage and centroid alignment, and stabilize model parameters through curvature control and elastic weight consolidation, maintaining solutions within a low-curvature wide basin resistant to fine-tuning. We also provide the first theoretical analysis showing that, within a trust region, gradients from clean fine-tuning and backdoor objectives are co-directional, yielding a non-increasing upper bound on attack success degradation. Experiments demonstrate that with only 0.3% poisoning, BadCLIP++ achieves 99.99% attack success rate (ASR) in digital settings, surpassing baselines by 11.4 points. Across nineteen defenses, ASR remains above 99.90% with less than 0.8% drop in clean accuracy. The method further attains 65.03% success in physical attacks and shows robustness against watermark removal defenses.

[62] NRGS-SLAM: Monocular Non-Rigid SLAM for Endoscopy via Deformation-Aware 3D Gaussian Splatting

Jiwei Shan,Zeyu Cai,Yirui Li,Yongbo Chen,Lijun Han,Yun-hui Liu,Hesheng Wang,Shing Shin Cheng

Main category: cs.CV

TL;DR: 本文提出NRGS-SLAM,一种基于3D高斯泼溅的单目非刚性SLAM系统,专用于内窥镜场景;通过引入可学习形变概率的形变感知高斯地图、形变感知跟踪与建图模块,以及统一鲁棒几何损失,有效解耦相机运动与软组织形变,在位姿估计精度和重建质量上显著优于现有方法。

Details Motivation: 内窥镜场景中软组织持续形变违反刚性假设,导致相机自运动与内在形变强耦合;现有单目非刚性SLAM方法缺乏有效解耦机制,且依赖稀疏或低保真场景表示,造成跟踪漂移与重建质量受限。 Method: 提出NRGS-SLAM:1)构建含可学习形变概率的形变感知3D高斯地图,采用贝叶斯自监督优化;2)设计形变感知的粗到精跟踪模块,优先利用低形变区域估计位姿并更新每帧形变;3)设计渐进式形变建图模块以平衡表达能力与效率;4)引入融合外部几何先验的统一鲁棒几何损失。 Result: 在多个公开内窥镜数据集上,NRGS-SLAM将位姿估计RMSE最高降低50%,并生成更高质量的光度真实感重建;消融实验验证了各核心设计的有效性。 Conclusion: NRGS-SLAM通过新颖的形变感知高斯表示与协同优化框架,显著提升了单目非刚性SLAM在内窥镜场景下的定位与建图性能,为微创手术导航等应用提供了更可靠的基础。 Abstract: Visual simultaneous localization and mapping (V-SLAM) is a fundamental capability for autonomous perception and navigation. However, endoscopic scenes violate the rigidity assumption due to persistent soft-tissue deformations, creating a strong coupling ambiguity between camera ego-motion and intrinsic deformation. Although recent monocular non-rigid SLAM methods have made notable progress, they often lack effective decoupling mechanisms and rely on sparse or low-fidelity scene representations, which leads to tracking drift and limited reconstruction quality. To address these limitations, we propose NRGS-SLAM, a monocular non-rigid SLAM system for endoscopy based on 3D Gaussian Splatting. To resolve the coupling ambiguity, we introduce a deformation-aware 3D Gaussian map that augments each Gaussian primitive with a learnable deformation probability, optimized via a Bayesian self-supervision strategy without requiring external non-rigidity labels. Building on this representation, we design a deformable tracking module that performs robust coarse-to-fine pose estimation by prioritizing low-deformation regions, followed by efficient per-frame deformation updates. A carefully designed deformable mapping module progressively expands and refines the map, balancing representational capacity and computational efficiency. In addition, a unified robust geometric loss incorporates external geometric priors to mitigate the inherent ill-posedness of monocular non-rigid SLAM. Extensive experiments on multiple public endoscopic datasets demonstrate that NRGS-SLAM achieves more accurate camera pose estimation (up to 50\% reduction in RMSE) and higher-quality photo-realistic reconstructions than state-of-the-art methods. Comprehensive ablation studies further validate the effectiveness of our key design choices. Source code will be publicly available upon paper acceptance.

[63] Selective Training for Large Vision Language Models via Visual Information Gain

Seulbi Lee,Sangheum Hwang

Main category: cs.CV

TL;DR: 本文提出了一种名为视觉信息增益(VIG)的新指标,用于量化图像输入对语言模型预测不确定性的减少程度,并基于此设计了VIG引导的选择性训练方法,从而提升大视觉语言模型(LVLMs)的视觉基础能力并缓解语言偏差。

Details Motivation: 现有大视觉语言模型(LVLMs)存在语言偏差问题,即模型不依赖图像内容作答;已有方法缺乏对单个训练样本或token中图像贡献度的定量衡量。 Method: 提出基于困惑度的视觉信息增益(VIG)指标,支持样本级和token级细粒度分析;进而设计VIG引导的选择性训练策略,优先使用高VIG样本和token进行训练。 Result: 所提方法在提升视觉接地能力、缓解语言偏差的同时,以更少监督实现更优性能。 Conclusion: VIG为评估和增强LVLMs的视觉依赖性提供了可解释、可量化的工具,VIG引导的训练范式能有效提升模型对图像内容的利用效率。 Abstract: Large Vision Language Models (LVLMs) have achieved remarkable progress, yet they often suffer from language bias, producing answers without relying on visual evidence. While prior work attempts to mitigate this issue through decoding strategies, architectural modifications, or curated instruction data, they typically lack a quantitative measure of how much individual training samples or tokens actually benefit from the image. In this work, we introduce Visual Information Gain (VIG), a perplexity-based metric that measures the reduction in prediction uncertainty provided by visual input. VIG enables fine-grained analysis at both sample and token levels, effectively highlighting visually grounded elements such as colors, spatial relations, and attributes. Leveraging this, we propose a VIG-guided selective training scheme that prioritizes high-VIG samples and tokens. This approach improves visual grounding and mitigates language bias, achieving superior performance with significantly reduced supervision by focusing exclusively on visually informative samples and tokens.

[64] EntropyPrune: Matrix Entropy Guided Visual Token Pruning for Multimodal Large Language Models

Yahong Wang,Juncheng Wu,Zhangkai Ni,Chengmei Yang,Yihang Liu,Longzhen Yang,Yuyin Zhou,Ying Wen,Lianghua He

Main category: cs.CV

TL;DR: 本文提出EntropyPrune,一种基于矩阵熵的视觉token剪枝方法,通过识别‘熵坍缩层’(ECL)实现可解释、可迁移的高效MLLM推理加速,显著降低FLOPs并保持高性能。

Details Motivation: 现有MLLM token剪枝方法依赖静态、经验性选定的剪枝层,缺乏理论依据,导致可解释性和跨模型迁移能力差;同时高视觉token数带来巨大推理开销。 Method: 从矩阵熵视角出发,发现视觉表征信息在特定层(Entropy Collapse Layer, ECL)出现尖锐一致的熵下降,据此确定剪枝时机;提出EntropyPrune框架,利用双Gram矩阵谱等价性高效计算token级信息熵,无需注意力图即可量化并剪除冗余视觉token。 Result: 在LLaVA-1.5-7B上实现68.2% FLOPs降低且保留96.0%原始性能;在多模态基准上全面超越SOTA剪枝方法;对高分辨率图像和视频模型具有良好泛化性与扩展性。 Conclusion: EntropyPrune为MLLM推理加速提供了原理清晰、计算高效、模型无关的新范式,验证了矩阵熵作为剪枝准则的有效性与普适性。 Abstract: Multimodal large language models (MLLMs) incur substantial inference cost due to the processing of hundreds of visual tokens per image. Although token pruning has proven effective for accelerating inference, determining when and where to prune remains largely heuristic. Existing approaches typically rely on static, empirically selected layers, which limit interpretability and transferability across models. In this work, we introduce a matrix-entropy perspective and identify an "Entropy Collapse Layer" (ECL), where the information content of visual representations exhibits a sharp and consistent drop, which provides a principled criterion for selecting the pruning stage. Building on this observation, we propose EntropyPrune, a novel matrix-entropy-guided token pruning framework that quantifies the information value of individual visual tokens and prunes redundant ones without relying on attention maps. Moreover, to enable efficient computation, we exploit the spectral equivalence of dual Gram matrices, reducing the complexity of entropy computation and yielding up to a 64x theoretical speedup. Extensive experiments on diverse multimodal benchmarks demonstrate that EntropyPrune consistently outperforms state-of-the-art pruning methods in both accuracy and efficiency. On LLaVA-1.5-7B, our method achieves a 68.2% reduction in FLOPs while preserving 96.0% of the original performance. Furthermore, EntropyPrune generalizes effectively to high-resolution and video-based models, highlighting the strong robustness and scalability in practical MLLM acceleration. The code will be publicly available at https://github.com/YahongWang1/EntropyPrune.

[65] GASS: Geometry-Aware Spherical Sampling for Disentangled Diversity Enhancement in Text-to-Image Generation

Ye Zhu,Kaleb S. Newman,Johannes F. Lutzeyer,Adriana Romero-Soriano,Michal Drozdzal,Olga Russakovsky

Main category: cs.CV

TL;DR: 本文提出了一种名为Geometry-Aware Spherical Sampling(GASS)的新方法,通过在CLIP嵌入空间中沿语义相关和无关两个正交方向控制图像生成的几何分布,提升文本到图像模型的多样性,同时保持图像质量和语义对齐。

Details Motivation: 现有文本到图像模型虽语义对齐度高,但生成图像多样性不足,限制用户选择并可能加剧社会偏见。 Method: 提出GASS方法,在CLIP嵌入空间中将多样性分解为文本嵌入方向(语义相关)和其正交方向(语义无关,如背景),并在两个正交轴上扩大生成图像嵌入的几何投影分布,引导采样过程。 Result: 在多种冻结T2I骨干网络(U-Net、DiT;扩散与流模型)及基准测试中验证了该方法能有效解耦增强多样性,且对图像保真度和语义对齐影响极小。 Conclusion: 从几何视角建模多样性是可行且有效的,GASS提供了一种不依赖熵、可解释性强、适用于多种架构的通用多样性增强策略。 Abstract: Despite high semantic alignment, modern text-to-image (T2I) generative models still struggle to synthesize diverse images from a given prompt. This lack of diversity not only restricts user choice, but also risks amplifying societal biases. In this work, we enhance the T2I diversity through a geometric lens. Unlike most existing methods that rely primarily on entropy-based guidance to increase sample dissimilarity, we introduce Geometry-Aware Spherical Sampling (GASS) to enhance diversity by explicitly controlling both prompt-dependent and prompt-independent sources of variation. Specifically, we decompose the diversity measure in CLIP embeddings using two orthogonal directions: the text embedding, which captures semantic variation related to the prompt, and an identified orthogonal direction that captures prompt-independent variation (e.g., backgrounds). Based on this decomposition, GASS increases the geometric projection spread of generated image embeddings along both axes and guides the T2I sampling process via expanded predictions along the generation trajectory. Our experiments on different frozen T2I backbones (U-Net and DiT, diffusion and flow) and benchmarks demonstrate the effectiveness of disentangled diversity enhancement with minimal impact on image fidelity and semantic alignment.

[66] HiMAP: History-aware Map-occupancy Prediction with Fallback

Yiming Xu,Yi Yang,Hao Cheng,Monika Sester

Main category: cs.CV

TL;DR: HiMAP是一种无需多目标跟踪(MOT)的轨迹预测框架,通过历史占据图与历史查询模块实现身份无关、鲁棒的多模态运动预测,在无跟踪场景下显著优于现有方法。

Details Motivation: 现有预测方法严重依赖多目标跟踪(MOT),但MOT在遮挡、ID切换或漏检时易失败,导致预测质量下降和安全风险上升。 Method: HiMAP将历史检测转换为时空不变的历史占据图;引入历史查询模块,基于当前智能体状态从无标签占据表示中迭代检索个体化历史;用时间地图嵌入汇总历史,并联合最终查询与地图上下文,驱动DETR式解码器生成多模态未来轨迹。 Result: 在Argoverse 2上,HiMAP在无ID条件下性能媲美基于跟踪的方法;在无跟踪设定下,相较微调后的QCNet,FDE降低11%、ADE降低12%、MR降低4%;支持同步稳定预测所有智能体,无需等待跟踪恢复。 Conclusion: HiMAP摆脱了对目标ID和MOT的依赖,具备流式推理能力与强鲁棒性,可作为安全关键自动驾驶系统中跟踪失效时的可靠预测 fallback 方案。 Abstract: Accurate motion forecasting is critical for autonomous driving, yet most predictors rely on multi-object tracking (MOT) with identity association, assuming that objects are correctly and continuously tracked. When tracking fails due to, e.g., occlusion, identity switches, or missed detections, prediction quality degrades and safety risks increase. We present \textbf{HiMAP}, a tracking-free, trajectory prediction framework that remains reliable under MOT failures. HiMAP converts past detections into spatiotemporally invariant historical occupancy maps and introduces a historical query module that conditions on the current agent state to iteratively retrieve agent-specific history from unlabeled occupancy representations. The retrieved history is summarized by a temporal map embedding and, together with the final query and map context, drives a DETR-style decoder to produce multi-modal future trajectories. This design lifts identity reliance, supports streaming inference via reusable encodings, and serves as a robust fallback when tracking is unavailable. On Argoverse~2, HiMAP achieves performance comparable to tracking-based methods while operating without IDs, and it substantially outperforms strong baselines in the no-tracking setting, yielding relative gains of 11\% in FDE, 12\% in ADE, and a 4\% reduction in MR over a fine-tuned QCNet. Beyond aggregate metrics, HiMAP delivers stable forecasts for all agents simultaneously without waiting for tracking to recover, highlighting its practical value for safety-critical autonomy. The code is available under: https://github.com/XuYiMing83/HiMAP.

[67] Inferring Height from Earth Embeddings: First insights using Google AlphaEarth

Alireza Hamoudzadeh,Valeria Belloni,Roberta Ravanelli

Main category: cs.CV

TL;DR: 本研究探讨了AlphaEarth嵌入(10米分辨率)能否有效指导深度学习回归模型进行区域地表高度映射,使用U-Net和U-Net++架构解码嵌入信息;结果表明嵌入包含可解码的高度相关信号,U-Net++在测试集上泛化能力更强(R²=0.84),但存在残差偏差与分布偏移挑战。

Details Motivation: 验证地球嵌入(特别是AlphaEarth Embeddings)中编码的地理空间与多模态特征是否能有效支持深度学习模型进行区域地表高度回归估计。 Method: 采用10米分辨率的AlphaEarth Embeddings作为输入特征,以高质量数字地表模型(DSM)为真值标签,利用U-Net和U-Net++两种轻量级卷积解码器进行高度回归建模与评估。 Result: 训练阶段两模型R²均达0.97;测试阶段U-Net++表现更优(R²=0.84,中位误差−2.62 m),优于U-Net(R²=0.78,中位误差−7.22 m);测试RMSE约16 m,存在系统性残差偏差。 Conclusion: AlphaEarth Embeddings蕴含可迁移的地形模式信息,具备指导DL高度制图的潜力,尤其适配空间感知卷积架构;但需进一步解决分布偏移与系统性偏差以提升区域泛化能力。 Abstract: This study investigates whether the geospatial and multimodal features encoded in \textit{Earth Embeddings} can effectively guide deep learning (DL) regression models for regional surface height mapping. In particular, we focused on AlphaEarth Embeddings at 10 m spatial resolution and evaluated their capability to support terrain height inference using a high-quality Digital Surface Model (DSM) as reference. U-Net and U-Net++ architectures were thus employed as lightweight convolutional decoders to assess how well the geospatial information distilled in the embeddings can be translated into accurate surface height estimates. Both architectures achieved strong training performance (both with $R^2 = 0.97$), confirming that the embeddings encode informative and decodable height-related signals. On the test set, performance decreased due to distribution shifts in height frequency between training and testing areas. Nevertheless, U-Net++ shows better generalization ($R^2 = 0.84$, median difference = -2.62 m) compared with the standard U-Net ($R^2 = 0.78$, median difference = -7.22 m), suggesting enhanced robustness to distribution mismatch. While the testing RMSE (approximately 16 m for U-Net++) and residual bias highlight remaining challenges in generalization, strong correlations indicate that the embeddings capture transferable topographic patterns. Overall, the results demonstrate the promising potential of AlphaEarth Embeddings to guide DL-based height mapping workflows, particularly when combined with spatially aware convolutional architectures, while emphasizing the need to address bias for improved regional transferability.

[68] A Multi-modal Detection System for Infrastructure-based Freight Signal Priority

Ziyan Zhang,Chuheng Wei,Xuanpeng Zhao,Siyan Li,Will Snyder,Mike Stas,Peng Hao,Kanok Boriboonsomsin,Guoyuan Wu

Main category: cs.CV

TL;DR: 本文提出了一种基于基础设施的多模态货运车辆检测系统,融合LiDAR与摄像头传感器,采用混合感知架构和融合聚类与深度学习的检测方法,并结合卡尔曼滤波跟踪,实现高时空分辨率的货运车辆类型、位置与速度感知,支撑货运信号优先(FSP)应用。

Details Motivation: 货运车辆在接近信号交叉口时需要可靠的检测与运动估计,以支持基于基础设施的货运信号优先(FSP);准确及时地感知车型、位置和速度对实施有效优先控制策略至关重要。 Method: 设计并部署了集成LiDAR与摄像头的多模态基础设施感知系统,采用路口安装子系统与路段中段子系统组成的混合传感架构,通过无线通信同步数据;感知流程融合聚类与深度学习检测方法,并结合卡尔曼滤波跟踪;LiDAR点云注册至大地坐标系以支持车道级定位与稳定跟踪。 Result: 实地评估表明该系统能在高时空分辨率下可靠监测货运车辆运动。 Conclusion: 该系统设计与部署为面向FSP应用的基础设施感知系统开发提供了实用经验与技术参考。 Abstract: Freight vehicles approaching signalized intersections require reliable detection and motion estimation to support infrastructure-based Freight Signal Priority (FSP). Accurate and timely perception of vehicle type, position, and speed is essential for enabling effective priority control strategies. This paper presents the design, deployment, and evaluation of an infrastructure-based multi-modal freight vehicle detection system integrating LiDAR and camera sensors. A hybrid sensing architecture is adopted, consisting of an intersection-mounted subsystem and a midblock subsystem, connected via wireless communication for synchronized data transmission. The perception pipeline incorporates both clustering-based and deep learning-based detection methods with Kalman filter tracking to achieve stable real-time performance. LiDAR measurements are registered into geodetic reference frames to support lane-level localization and consistent vehicle tracking. Field evaluations demonstrate that the system can reliably monitor freight vehicle movements at high spatio-temporal resolution. The design and deployment provide practical insights for developing infrastructure-based sensing systems to support FSP applications.

[69] EA-Swin: An Embedding-Agnostic Swin Transformer for AI-Generated Video Detection

Hung Mai,Loi Dinh,Duc Hai Nguyen,Dat Do,Luong Doan,Khanh Nguyen Quoc,Huan Vu,Phong Ho,Naeem Ul Islam,Tuan Do

Main category: cs.CV

TL;DR: 本文提出EA-Swin模型与EA-Video数据集,用于高效检测AI生成视频,显著提升准确率与泛化能力。

Details Motivation: 现有检测方法在面对Sora2、Veo3等先进视频生成器时表现不足,因其依赖浅层嵌入轨迹、图像适配或计算繁重的多模态大语言模型。 Method: 提出Embedding-Agnostic Swin Transformer(EA-Swin),采用因子化窗口注意力机制,直接建模预训练视频嵌入的时空依赖;同时构建包含130K视频的EA-Video基准数据集,覆盖多种商业与开源生成器,并设置未见生成器划分以支持跨分布评估。 Result: EA-Swin在主流生成器上达到0.97–0.99准确率,较先前SOTA方法(通常0.8–0.9)提升5–20%,且对未见生成器具备强泛化能力。 Conclusion: EA-Swin是一种可扩展、鲁棒的现代AI生成视频检测方案。 Abstract: Recent advances in foundation video generators such as Sora2, Veo3, and other commercial systems have produced highly realistic synthetic videos, exposing the limitations of existing detection methods that rely on shallow embedding trajectories, image-based adaptation, or computationally heavy MLLMs. We propose EA-Swin, an Embedding-Agnostic Swin Transformer that models spatiotemporal dependencies directly on pretrained video embeddings via a factorized windowed attention design, making it compatible with generic ViT-style patch-based encoders. Alongside the model, we construct the EA-Video dataset, a benchmark dataset comprising 130K videos that integrates newly collected samples with curated existing datasets, covering diverse commercial and open-source generators and including unseen-generator splits for rigorous cross-distribution evaluation. Extensive experiments show that EA-Swin achieves 0.97-0.99 accuracy across major generators, outperforming prior SoTA methods (typically 0.8-0.9) by a margin of 5-20%, while maintaining strong generalization to unseen distributions, establishing a scalable and robust solution for modern AI-generated video detection.

[70] Physics Encoded Spatial and Temporal Generative Adversarial Network for Tropical Cyclone Image Super-resolution

Ruoyi Zhang,Jiawei Yuan,Lujia Ye,Runling Yu,Liling Zhao

Main category: cs.CV

TL;DR: 本文提出了一种物理编码的时空生成对抗网络(PESTGAN),用于热带气旋卫星图像超分辨率重建,通过引入物理约束模块(PhyCell)和双判别器框架,提升了气象结构合理性和物理保真度。

Details Motivation: 现有基于深度学习的超分辨率方法将卫星图像序列视为普通视频,忽略了控制云运动的大气物理规律,导致重建结果物理不合理。 Method: 提出PESTGAN模型,包含解耦生成器(集成PhyCell模块以近似涡度方程并编码物理动力学)和双判别器(分别约束时空一致性与空间真实性)。 Result: 在Digital Typhoon数据集上实现4×超分辨率,结构保真度与感知质量优于现有方法,在像素级精度相当的同时显著提升气象结构合理性与物理保真度。 Conclusion: 融合物理先验的生成模型能有效提升遥感图像超分辨率的气象可信度,为物理引导的AI气象建模提供了新范式。 Abstract: High-resolution satellite imagery is indispensable for tracking the genesis, intensification, and trajectory of tropical cyclones (TCs). However, existing deep learning-based super-resolution (SR) methods often treat satellite image sequences as generic videos, neglecting the underlying atmospheric physical laws governing cloud motion. To address this, we propose a Physics Encoded Spatial and Temporal Generative Adversarial Network (PESTGAN) for TC image super-resolution. Specifically, we design a disentangled generator architecture incorporating a PhyCell module, which approximates the vorticity equation via constrained convolutions and encodes the resulting approximate physical dynamics as implicit latent representations to separate physical dynamics from visual textures. Furthermore, a dual-discriminator framework is introduced, employing a temporal discriminator to enforce motion consistency alongside spatial realism. Experiments on the Digital Typhoon dataset for 4$\times$ upscaling demonstrate that PESTGAN establishes a better performance in structural fidelity and perceptual quality. While maintaining competitive pixel-wise accuracy compared to existing approaches, our method significantly excels in reconstructing meteorologically plausible cloud structures with superior physical fidelity.

[71] Attachment Anchors: A Novel Framework for Laparoscopic Grasping Point Prediction in Colorectal Surgery

Dennis N. Schneider,Lars Wagner,Daniel Rueckert,Dirk Wilhelm

Main category: cs.CV

TL;DR: 本文提出了一种名为'attachment anchors'的结构化表示方法,用于在结直肠微创手术中提升自主组织抓取点预测的准确性。该方法通过编码组织与其解剖附着点之间的局部几何与力学关系,将手术场景归一化到一致的局部参考系,从而降低抓取点预测的不确定性,并在90例手术数据上验证了其在分布外场景下的优越性。

Details Motivation: 结直肠手术复杂、耗时长,在当前研究中代表性不足;但其重复性组织操作特性使其成为机器学习驱动自主辅助的理想切入点。 Method: 提出'attachment anchors'这一结构化中间表示,编码组织与解剖附着点间的局部几何与力学关系,并基于腹腔镜图像预测该表示,将其融入机器学习抓取框架。 Result: 在90例结直肠手术数据集上实验表明,相比纯图像基线,attachment anchors显著提升了抓取点预测性能,尤其在未见过的术式和术者(分布外场景)下增益明显。 Conclusion: attachment anchors是一种有效的中间表征,有助于提升学习型组织操作在结直肠手术中的泛化性与鲁棒性。 Abstract: Accurate grasping point prediction is a key challenge for autonomous tissue manipulation in minimally invasive surgery, particularly in complex and variable procedures such as colorectal interventions. Due to their complexity and prolonged duration, colorectal procedures have been underrepresented in current research. At the same time, they pose a particularly interesting learning environment due to repetitive tissue manipulation, making them a promising entry point for autonomous, machine learning-driven support. Therefore, in this work, we introduce attachment anchors, a structured representation that encodes the local geometric and mechanical relationships between tissue and its anatomical attachments in colorectal surgery. This representation reduces uncertainty in grasping point prediction by normalizing surgical scenes into a consistent local reference frame. We demonstrate that attachment anchors can be predicted from laparoscopic images and incorporated into a grasping framework based on machine learning. Experiments on a dataset of 90 colorectal surgeries demonstrate that attachment anchors improve grasping point prediction compared to image-only baselines. There are particularly strong gains in out-of-distribution settings, including unseen procedures and operating surgeons. These results suggest that attachment anchors are an effective intermediate representation for learning-based tissue manipulation in colorectal surgery.

[72] Leveraging Contrastive Learning for a Similarity-Guided Tampered Document Data Generation Pipeline

Mohamed Dhouib,Davide Buscaldi,Sonia Vanier,Aymen Shabou

Main category: cs.CV

TL;DR: 本文提出了一种生成高质量篡改文档图像的新方法,通过两个辅助网络(基于对比学习的文本块匹配网络和字符紧密包围评估网络)提升生成数据的多样性与视觉质量,并在多个数据集和模型上验证了其有效性。

Details Motivation: 现有基于规则的篡改文档生成方法缺乏多样性、视觉质量差、易留明显伪影,导致模型泛化能力弱、真实场景性能差。 Method: 设计两个辅助网络:1)基于对比学习的文本块比较网络(含新正负样本构造策略);2)字符边界紧密性评估网络;再结合二者构建端到端篡改文档图像生成流水线。 Result: 在相同训练设置下,使用本方法生成的数据训练的模型,在多个开源基准数据集上均取得一致性能提升,且适用于不同网络架构。 Conclusion: 所提生成框架能有效缓解数据稀缺问题,显著提升篡改文本检测模型在真实场景下的鲁棒性与泛化能力。 Abstract: Detecting tampered text in document images is a challenging task due to data scarcity. To address this, previous work has attempted to generate tampered documents using rule-based methods. However, the resulting documents often suffer from limited variety and poor visual quality, typically leaving highly visible artifacts that are rarely observed in real-world manipulations. This undermines the model's ability to learn robust, generalizable features and results in poor performance on real-world data. Motivated by this discrepancy, we propose a novel method for generating high-quality tampered document images. We first train an auxiliary network to compare text crops, leveraging contrastive learning with a novel strategy for defining positive pairs and their corresponding negatives. We also train a second auxiliary network to evaluate whether a crop tightly encloses the intended characters, without cutting off parts of characters or including parts of adjacent ones. Using a carefully designed generation pipeline that leverages both networks, we introduce a framework capable of producing diverse, high-quality tampered document images. We assess the effectiveness of our data generation pipeline by training multiple models on datasets derived from the same source images, generated using our method and existing approaches, under identical training protocols. Evaluating these models on various open-source datasets shows that our pipeline yields consistent performance improvements across architectures and datasets.

[73] Polaffini: A feature-based approach for robust affine and polyaffine image registration

Antoine Legouhy,Cosimo Campo,Ross Callaghan,Hojjat Azadbakht,Hui Zhang

Main category: cs.CV

TL;DR: 本文提出Polaffini框架,利用深度学习预训练分割模型提取解剖结构质心作为特征点,实现解剖学导向的图像配准;通过闭式解进行全局与局部仿射匹配,构建具有可调平滑度的polyaffine变换,在保持微分同胚性的同时提升结构对齐精度和下游非线性配准初始化效果。

Details Motivation: 传统基于强度的医学图像配准方法依赖对齐质量的代理指标,而基于解剖特征的方法虽更理想,却因难以稳定提取特征而被冷落;近期深度学习分割模型的进步为可靠获取精细解剖结构提供了可能,从而激发了构建新型解剖学导向配准算法的需求。 Method: Polaffini从预训练分割模型输出的解剖区域中简单提取其质心作为具有一一对应关系的解剖特征点;利用这些点通过闭式解实现高效全局与局部仿射匹配;进而组合生成从仿射到polyaffine、平滑度可调的整体变换;polyaffine嵌入log-Euclidean框架以保证微分同胚性。 Result: Polaffini在结构对齐精度上优于主流强度型配准方法,并显著改善下游非线性配准的初始化效果;兼具快速、鲁棒与高精度特性。 Conclusion: Polaffini成功将现代深度学习分割能力转化为解剖学可信的配准范式,是一种适用于独立配准或作为非线性配准预对齐步骤的实用、高效且可集成于临床影像处理流程的新框架。 Abstract: In this work we present Polaffini, a robust and versatile framework for anatomically grounded registration. Medical image registration is dominated by intensity-based registration methods that rely on surrogate measures of alignment quality. In contrast, feature-based approaches that operate by identifying explicit anatomical correspondences, while more desirable in theory, have largely fallen out of favor due to the challenges of reliably extracting features. However, such challenges are now significantly overcome thanks to recent advances in deep learning, which provide pre-trained segmentation models capable of instantly delivering reliable, fine-grained anatomical delineations. We aim to demonstrate that these advances can be leveraged to create new anatomically-grounded image registration algorithms. To this end, we propose Polaffini, which obtains, from these segmented regions, anatomically grounded feature points with 1-to-1 correspondence in a particularly simple way: extracting their centroids. These enable efficient global and local affine matching via closed-form solutions. Those are used to produce an overall transformation ranging from affine to polyaffine with tunable smoothness. Polyaffine transformations can have many more degrees of freedom than affine ones allowing for finer alignment, and their embedding in the log-Euclidean framework ensures diffeomorphic properties. Polaffini has applications both for standalone registration and as pre-alignment for subsequent non-linear registration, and we evaluate it against popular intensity-based registration techniques. Results demonstrate that Polaffini outperforms competing methods in terms of structural alignment and provides improved initialisation for downstream non-linear registration. Polaffini is fast, robust, and accurate, making it particularly well-suited for integration into medical image processing pipelines.

Yuchang Jiang,Anton Raichuk,Xiaoye Tong,Vivien Sainte Fare Garnot,Daniel Ortiz-Gonzalo,Dan Morris,Konrad Schindler,Jan Dirk Wegner,Maxim Neumann

Main category: cs.CV

TL;DR: 本文提出了一种基于Sentinel-1/2时序影像的多模态时空深度学习模型,生成了南美洲首张10米分辨率树种作物分布图,揭示现有EUDR监管地图常将小农林农系统误判为森林,导致错误毁林警报;该高精度基线图有助于实现更有效、包容与公平的零毁林政策。

Details Motivation: 监测树种作物扩张对落实零毁林政策(如欧盟EUDR)至关重要,但缺乏能区分多样化农业系统与森林的高分辨率数据严重制约了相关工作。 Method: 构建多模态时空深度学习模型,融合Sentinel-1雷达与Sentinel-2光学卫星影像时间序列,生成南美洲10米分辨率树种作物分布图。 Result: 识别出约1100万公顷树种作物,其中23%与2000–2020年森林覆盖损失相关;发现现行EUDR监管地图常将已建立的农业(尤其是小农林农系统)错误归类为森林。 Conclusion: 本研究提供的高分辨率树种作物地图可减少误报毁林风险,支持更具有效性、包容性与公平性的森林保护政策。 Abstract: Monitoring tree crop expansion is vital for zero-deforestation policies like the European Union's Regulation on Deforestation-free Products (EUDR). However, these efforts are hindered by a lack of highresolution data distinguishing diverse agricultural systems from forests. Here, we present the first 10m-resolution tree crop map for South America, generated using a multi-modal, spatio-temporal deep learning model trained on Sentinel-1 and Sentinel-2 satellite imagery time series. The map identifies approximately 11 million hectares of tree crops, 23% of which is linked to 2000-2020 forest cover loss. Critically, our analysis reveals that existing regulatory maps supporting the EUDR often classify established agriculture, particularly smallholder agroforestry, as "forest". This discrepancy risks false deforestation alerts and unfair penalties for small-scale farmers. Our work mitigates this risk by providing a high-resolution baseline, supporting conservation policies that are effective, inclusive, and equitable.

[75] DRetHTR: Linear-Time Decoder-Only Retentive Network for Handwritten Text Recognition

Changhun Kim,Martin Mayr,Thomas Gorges,Fei Wu,Mathias Seuret,Andreas Maier,Vincent Christlein

Main category: cs.CV

TL;DR: 本文提出DRetHTR,一种基于Retentive Networks(RetNet)的纯解码器手写文本识别模型,通过去除softmax注意力机制、引入多尺度序列先验和层自适应gamma缩放,显著提升推理速度与内存效率,同时保持甚至超越现有Transformer模型的识别精度。

Details Motivation: 现有基于Transformer的手写文本识别(HTR)系统因键值(KV)缓存随输出长度增长而面临解码慢、内存占用高的问题,亟需更高效的替代架构。 Method: 提出DRetHTR:采用RetNet作为纯解码器主干,以softmax-free retention替代自注意力;注入多尺度序列先验以建模不同粒度时序结构;设计层-wise gamma缩放机制,使深层网络逐步扩大有效保留范围,恢复局部到全局的归纳偏置。 Result: 相比同等规模的Transformer解码器基线,DRetHTR实现1.6–1.9倍加速与38–42%内存下降;在IAM-A、RIMES、Bentham和READ-2016数据集上分别达到2.26%、1.81%、3.46%和4.21%的字符错误率(CER),为当前最优。 Conclusion: Decoder-only RetNet可实现与Transformer相当甚至更优的HTR精度,同时大幅改善解码效率,验证了其作为高效HTR骨干架构的可行性与优势。 Abstract: State-of-the-art handwritten text recognition (HTR) systems commonly use Transformers, whose growing key-value (KV) cache makes decoding slow and memory-intensive. We introduce DRetHTR, a decoder-only model built on Retentive Networks (RetNet). Compared to an equally sized decoder-only Transformer baseline, DRetHTR delivers 1.6-1.9x faster inference with 38-42% less memory usage, without loss of accuracy. By replacing softmax attention with softmax-free retention and injecting multi-scale sequential priors, DRetHTR avoids a growing KV cache: decoding is linear in output length in both time and memory. To recover the local-to-global inductive bias of attention, we propose layer-wise gamma scaling, which progressively enlarges the effective retention horizon in deeper layers. This encourages early layers to model short-range dependencies and later layers to capture broader context, mitigating the flexibility gap introduced by removing softmax. Consequently, DRetHTR achieves best reported test character error rates of 2.26% (IAM-A, en), 1.81% (RIMES, fr), and 3.46% (Bentham, en), and is competitive on READ-2016 (de) with 4.21%. This demonstrates that decoder-only RetNet enables Transformer-level HTR accuracy with substantially improved decoding speed and memory efficiency.

[76] SpectralGCD: Spectral Concept Selection and Cross-modal Representation Learning for Generalized Category Discovery

Lorenzo Caselli,Marco Mistretta,Simone Magistri,Andrew D. Bagdanov

Main category: cs.CV

TL;DR: 本文提出SpectralGCD,一种高效且有效的多模态广义类别发现(GCD)方法,利用CLIP跨模态图像-概念相似性作为统一表征,并通过谱过滤和双向知识蒸馏提升语义质量和对齐度,在多个基准上以更低计算成本达到或超越SOTA性能。

Details Motivation: 现有GCD方法在仅用图像特征训练时易对已知类过拟合;多模态方法虽提升性能但模态独立处理且计算开销大。 Method: 提出SpectralGCD:1)以CLIP图像-概念相似性构建统一跨模态表征;2)用大规模任务无关语义词典建模图像为概念混合;3)引入谱过滤机制,基于教师模型的跨模态协方差矩阵自动筛选相关概念;4)采用前向与反向知识蒸馏保证学生模型表征的语义充分性与模态对齐。 Result: 在六个基准数据集上,SpectralGCD准确率媲美或显著优于当前最优方法,同时计算成本大幅降低。 Conclusion: SpectralGCD通过语义锚定、谱过滤与双向蒸馏,实现了高效、鲁棒且可解释的多模态GCD,为无监督新类别发现提供了新范式。 Abstract: Generalized Category Discovery (GCD) aims to identify novel categories in unlabeled data while leveraging a small labeled subset of known classes. Training a parametric classifier solely on image features often leads to overfitting to old classes, and recent multimodal approaches improve performance by incorporating textual information. However, they treat modalities independently and incur high computational cost. We propose SpectralGCD, an efficient and effective multimodal approach to GCD that uses CLIP cross-modal image-concept similarities as a unified cross-modal representation. Each image is expressed as a mixture over semantic concepts from a large task-agnostic dictionary, which anchors learning to explicit semantics and reduces reliance on spurious visual cues. To maintain the semantic quality of representations learned by an efficient student, we introduce Spectral Filtering which exploits a cross-modal covariance matrix over the softmaxed similarities measured by a strong teacher model to automatically retain only relevant concepts from the dictionary. Forward and reverse knowledge distillation from the same teacher ensures that the cross-modal representations of the student remain both semantically sufficient and well-aligned. Across six benchmarks, SpectralGCD delivers accuracy comparable to or significantly superior to state-of-the-art methods at a fraction of the computational cost. The code is publicly available at: https://github.com/miccunifi/SpectralGCD.

[77] A High-Level Survey of Optical Remote Sensing

Panagiotis Koletsis,Vasilis Efthymiou,Maria Vakalopoulou,Nikos Komodakis,Anastasios Doulamis,Georgios Th. Papadopoulos

Main category: cs.CV

TL;DR: 本文是一篇关于光学遥感领域的综述性论文,旨在为初入该领域的研究人员提供全面的概览、关键数据集和实用见解。

Details Motivation: 现有文献缺乏从整体视角对光学遥感(尤其是基于无人机RGB相机)进行系统性综述的工作,亟需一篇面向新研究者的综合性导引。 Method: 采用系统性文献调研与归纳分析方法,梳理光学遥感领域的任务、能力、方法及公开数据集,并以领域导览为目标组织内容。 Result: 构建了一个涵盖光学遥感核心能力、主流任务、常用数据集与实践洞见的综合性知识框架,填补了该方向系统性综述的空白。 Conclusion: 该综述不仅总结了当前光学遥感(特别是无人机RGB图像)的研究现状与资源,还为后续研究者提供了清晰的学习路径与方向指引,具有重要参考价值。 Abstract: In recent years, significant advances in computer vision have also propelled progress in remote sensing. Concurrently, the use of drones has expanded, with many organizations incorporating them into their operations. Most drones are equipped by default with RGB cameras, which are both robust and among the easiest sensors to use and interpret. The body of literature on optical remote sensing is vast, encompassing diverse tasks, capabilities, and methodologies. Each task or methodology could warrant a dedicated survey. This work provides a comprehensive overview of the capabilities of the field, while also presenting key information, such as datasets and insights. It aims to serve as a guide for researchers entering the field, offering high-level insights and helping them focus on areas most relevant to their interests. To the best of our knowledge, no existing survey addresses this holistic perspective.

[78] EAGLE: Expert-Augmented Attention Guidance for Tuning-Free Industrial Anomaly Detection in Multimodal Large Language Models

Xiaomeng Peng,Xilang Huang,Seon Han Choi

Main category: cs.CV

TL;DR: 本文提出EAGLE框架,一种无需调参的多模态大语言模型(MLLM)工业异常检测方法,通过融合专家模型输出来引导MLLM实现高精度检测与可解释描述,并验证其提升注意力聚焦于异常区域的能力。

Details Motivation: 现有深度学习方法仅提供二值决策且缺乏语义解释;多模态大语言模型虽具生成细粒度语言分析潜力,但需昂贵微调且检测精度常不如轻量专用检测器。 Method: 提出无需微调的专家增强注意力引导框架(EAGLE),将专家模型输出作为引导信号注入MLLM,同时分析MLLM中间层对异常图像区域的注意力分布变化。 Result: 在MVTec-AD和VisA数据集上,EAGLE在不更新任何参数前提下显著提升多种MLLM的异常检测性能,效果媲美微调方法,并观察到注意力更集中于异常区域。 Conclusion: EAGLE实现了高效、免训练、高精度与强可解释性的工业异常检测,揭示了注意力机制与检测性能间的内在关联,为MLLM在工业视觉中的应用提供了新范式。 Abstract: Industrial anomaly detection is important for smart manufacturing, but many deep learning approaches produce only binary decisions and provide limited semantic explanations. Multimodal large language models (MLLMs) can potentially generate fine-grained, language-based analyses, yet existing methods often require costly fine-tuning and do not consistently improve anomaly detection accuracy compared to lightweight specialist detectors. We propose expert-augmented attention guidance for industrial anomaly detection in MLLMs (EAGLE), a tuning-free framework that integrates outputs from expert model to guide MLLMs toward both accurate detection and interpretable anomaly descriptions. We further study how EAGLE affects MLLMs internals by examining the attention distribution of MLLMs to the anomalous image regions in the intermediate layers. We observe that successful anomaly detection is associated with increased attention concentration on anomalous regions, and EAGLE tends to encourage this alignment. Experiments on MVTec-AD and VisA show that EAGLE improves anomaly detection performance across multiple MLLMs without any parameter updates, achieving results comparable to fine-tuning based methods. Code is available at \href{https://github.com/shengtun/Eagle}{https://github.com/shengtun/Eagle}

[79] 4D Monocular Surgical Reconstruction under Arbitrary Camera Motions

Jiwei Shan,Zeyu Cai,Cheng-Tai Hsieh,Yirui Li,Hao Liu,Lijun Han,Hesheng Wang,Shing Shin Cheng

Main category: cs.CV

TL;DR: 本文提出Local-EndoGS,一种面向单目内窥镜视频的高质量4D重建框架,支持任意相机运动,通过窗口化局部建模、粗到细初始化策略及长程轨迹与物理运动先验提升形变合理性与重建质量。

Details Motivation: 现有基于隐式神经表示或3D高斯溅射的方法多依赖固定视角、双目深度先验或高精度运动恢复结构(SfM),难以处理临床中常见的单目、大运动内窥镜序列。 Method: 提出Local-EndoGS:1)窗口化渐进全局表征,为每个观测窗口分配局部可变形场景模型;2)融合多视图几何、跨窗口信息与单目深度先验的粗到细初始化策略;3)引入长程2D像素轨迹约束和物理运动先验以增强形变合理性。 Result: 在三个公开可形变内窥镜数据集上,Local-EndoGS在外观质量与几何精度上均超越当前最优方法;消融实验验证了各核心设计的有效性。 Conclusion: Local-EndoGS有效解决了单目内窥镜下大运动、无深度先验场景的4D重建难题,具备临床实用潜力,并将开源代码。 Abstract: Reconstructing deformable surgical scenes from endoscopic videos is challenging and clinically important. Recent state-of-the-art methods based on implicit neural representations or 3D Gaussian splatting have made notable progress. However, most are designed for deformable scenes with fixed endoscope viewpoints and rely on stereo depth priors or accurate structure-from-motion for initialization and optimization, limiting their ability to handle monocular sequences with large camera motion in real clinical settings. To address this, we propose Local-EndoGS, a high-quality 4D reconstruction framework for monocular endoscopic sequences with arbitrary camera motion. Local-EndoGS introduces a progressive, window-based global representation that allocates local deformable scene models to each observed window, enabling scalability to long sequences with substantial motion. To overcome unreliable initialization without stereo depth or accurate structure-from-motion, we design a coarse-to-fine strategy integrating multi-view geometry, cross-window information, and monocular depth priors, providing a robust foundation for optimization. We further incorporate long-range 2D pixel trajectory constraints and physical motion priors to improve deformation plausibility. Experiments on three public endoscopic datasets with deformable scenes and varying camera motions show that Local-EndoGS consistently outperforms state-of-the-art methods in appearance quality and geometry. Ablation studies validate the effectiveness of our key designs. Code will be released upon acceptance at: https://github.com/IRMVLab/Local-EndoGS.

[80] QuPAINT: Physics-Aware Instruction Tuning Approach to Quantum Material Discovery

Xuan-Bac Nguyen,Hoang-Quan Nguyen,Sankalp Pandey,Tim Faltermeier,Nicholas Borys,Hugh Churchill,Khoa Luu

Main category: cs.CV

TL;DR: 本文提出了一种物理感知的多模态框架,用于从光学显微图像中准确表征二维量子材料,包括合成数据生成器Synthia、首个量子材料指令数据集QMat-Instruct、物理感知指令微调方法QuPAINT,以及综合基准QF-Bench。

Details Motivation: 现有视觉模型缺乏物理先验,难以泛化到新材料或不同硬件条件,且面临标注数据少、对比度弱、跨实验室差异大等挑战。 Method: 提出了Synthia(基于物理的合成数据生成器)、QMat-Instruct(物理信息驱动的多模态指令数据集)、QuPAINT(融合光学先验的物理感知注意力多模态架构)和QF-Bench(涵盖多材料、多基底、多成像条件的综合基准)。 Result: 显著降低对人工标注依赖,提升模型在跨材料、跨设备场景下的泛化能力与鲁棒性,并支持可复现的公平评估。 Conclusion: 该物理感知多模态框架有效 bridged the gap between domain-specific physics and data-driven AI, enabling more reliable and scalable characterization of 2D quantum materials. Abstract: Characterizing two-dimensional quantum materials from optical microscopy images is challenging due to the subtle layer-dependent contrast, limited labeled data, and significant variation across laboratories and imaging setups. Existing vision models struggle in this domain since they lack physical priors and cannot generalize to new materials or hardware conditions. This work presents a new physics-aware multimodal framework that addresses these limitations from both the data and model perspectives. We first present Synthia, a physics-based synthetic data generator that simulates realistic optical responses of quantum material flakes under thin-film interference. Synthia produces diverse and high-quality samples, helping reduce the dependence on expert manual annotation. We introduce QMat-Instruct, the first large-scale instruction dataset for quantum materials, comprising multimodal, physics-informed question-answer pairs designed to teach Multimodal Large Language Models (MLLMs) to understand the appearance and thickness of flakes. Then, we propose Physics-Aware Instruction Tuning (QuPAINT), a multimodal architecture that incorporates a Physics-Informed Attention module to fuse visual embeddings with optical priors, enabling more robust and discriminative flake representations. Finally, we establish QF-Bench, a comprehensive benchmark spanning multiple materials, substrates, and imaging settings, offering standardized protocols for fair and reproducible evaluation.

[81] Tracing Copied Pixels and Regularizing Patch Affinity in Copy Detection

Yichen Lu,Siwei Nie,Minlong Lu,Xudong Yang,Xiaobo Zhang,Peng Zhang

Main category: cs.CV

TL;DR: 本文提出PixTrace和CopyNCE,通过像素级坐标追踪与几何引导的对比损失,提升图像复制检测中对精细编辑内容的识别能力与可解释性。

Details Motivation: 现有基于视图级对比学习的自监督图像复制检测方法难以应对复杂编辑操作,因缺乏细粒度的对应关系学习。 Method: 提出PixTrace模块以显式建模编辑变换下的像素空间映射,并设计CopyNCE损失,利用PixTrace提供的重叠比约束图像块间的相似性学习。 Result: 在DISC21数据集上达到SOTA性能:匹配器uAP 88.7%/RP90 83.9%,描述符uAP 72.6%/RP90 68.4%,且具备更强可解释性。 Conclusion: 将像素级几何可追溯性融入自监督对比学习,有效抑制训练噪声,显著提升图像复制检测对复杂编辑的鲁棒性与可解释性。 Abstract: Image Copy Detection (ICD) aims to identify manipulated content between image pairs through robust feature representation learning. While self-supervised learning (SSL) has advanced ICD systems, existing view-level contrastive methods struggle with sophisticated edits due to insufficient fine-grained correspondence learning. We address this limitation by exploiting the inherent geometric traceability in edited content through two key innovations. First, we propose PixTrace - a pixel coordinate tracking module that maintains explicit spatial mappings across editing transformations. Second, we introduce CopyNCE, a geometrically-guided contrastive loss that regularizes patch affinity using overlap ratios derived from PixTrace's verified mappings. Our method bridges pixel-level traceability with patch-level similarity learning, suppressing supervision noise in SSL training. Extensive experiments demonstrate not only state-of-the-art performance (88.7% uAP / 83.9% RP90 for matcher, 72.6% uAP / 68.4% RP90 for descriptor on DISC21 dataset) but also better interpretability over existing methods.

[82] FoundationPose-Initialized 3D-2D Liver Registration for Surgical Augmented Reality

Hanyuan Zhang,Lucas He,Runlong He,Abdolrahim Kadkhodamohammadi,Danail Stoyanov,Brian R. Davidson,Evangelos B. Mazomenos,Matthew J. Clarkson

Main category: cs.CV

TL;DR: 本文提出了一种结合腹腔镜深度图与基础姿态估计器的增强现实配准方法,用非刚性迭代最近点(NICP)替代有限元(FE)模型进行肝脏形变配准,在真实患者数据上达到9.91 mm平均配准误差,实现了临床可用精度且降低了工程复杂度。

Details Motivation: 现有腹腔镜肝手术中AR肿瘤定位依赖器官轮廓和复杂有限元形变模型,建模与工程门槛高,需更轻量、易部署的替代方案。 Method: 融合腹腔镜深度图与基础姿态估计器实现相机-肝脏姿态估计,并以非刚性ICP(NICP)替代传统FE形变模型,构建刚性+NICP联合配准流程。 Result: 在3例真实患者数据上,深度增强的基础姿态方法平均配准误差为9.91 mm;刚性+NICP优于纯刚性配准,验证NICP可高效替代FE模型。 Conclusion: 该方法在保持临床相关精度的同时,显著降低建模与工程复杂度,为术中AR提供轻量、工程友好的形变配准新范式。 Abstract: Augmented reality can improve tumor localization in laparoscopic liver surgery. Existing registration pipelines typically depend on organ contours; deformable (non-rigid) alignment is often handled with finite-element (FE) models coupled to dimensionality-reduction or machine-learning components. We integrate laparoscopic depth maps with a foundation pose estimator for camera-liver pose estimation and replace FE-based deformation with non-rigid iterative closest point (NICP) to lower engineering/modeling complexity and expertise requirements. On real patient data, the depth-augmented foundation pose approach achieved 9.91 mm mean registration error in 3 cases. Combined rigid-NICP registration outperformed rigid-only registration, demonstrating NICP as an efficient substitute for finite-element deformable models. This pipeline achieves clinically relevant accuracy while offering a lightweight, engineering-friendly alternative to FE-based deformation.

[83] LATA: Laplacian-Assisted Transductive Adaptation for Conformal Uncertainty in Medical VLMs

Behzad Bozorgtabar,Dwarikanath Mahapatra,Sudipta Roy,Muzammal Naseer,Imran Razzak,Zongyuan Ge

Main category: cs.CV

TL;DR: 本文提出LATA方法,通过拉普拉斯平滑和失败感知的共形分数,在不破坏交换性前提下提升医学视觉语言模型在域偏移下的零样本预测集效率与类别平衡性。

Details Motivation: 现有分割共形预测(SCP)在医学视觉语言模型中存在预测集过大、类别覆盖不平衡(CCV高)等问题,尤其在少样本、数据不平衡场景下;且直接利用校准标签会破坏交换性,导致理论保证失效。 Method: 提出LATA(拉普拉斯辅助的传导式自适应):1)在联合校准与测试池上构建图像k-NN图,对零样本概率进行拉普拉斯平滑(少量CCCP均值场更新);2)设计失败感知的共形分数,融入ViLU框架以建模实例难度与标签合理性;全程无需模型训练或标签更新,支持纯无标签或可选标签引导变体。 Result: 在3个医学VLM和9个下游任务上,LATA显著减小预测集尺寸与CCV,精准满足目标覆盖率,性能超越先前传导式基线,逼近有标签方法,同时计算开销极低。消融与定性分析证实其在保持交换性前提下提升了预测锐度。 Conclusion: LATA是一种高效、轻量、理论可靠且即插即用的后处理方法,为医学VLM在真实临床域偏移场景下的可信零样本推理提供了实用解决方案。 Abstract: Medical vision-language models (VLMs) are strong zero-shot recognizers for medical imaging, but their reliability under domain shift hinges on calibrated uncertainty with guarantees. Split conformal prediction (SCP) offers finite-sample coverage, yet prediction sets often become large (low efficiency) and class-wise coverage unbalanced-high class-conditioned coverage gap (CCV), especially in few-shot, imbalanced regimes; moreover, naively adapting to calibration labels breaks exchangeability and voids guarantees. We propose \texttt{\textbf{LATA}} (Laplacian-Assisted Transductive Adaptation), a \textit{training- and label-free} refinement that operates on the joint calibration and test pool by smoothing zero-shot probabilities over an image-image k-NN graph using a small number of CCCP mean-field updates, preserving SCP validity via a deterministic transform. We further introduce a \textit{failure-aware} conformal score that plugs into the vision-language uncertainty (ViLU) framework, providing instance-level difficulty and label plausibility to improve prediction set efficiency and class-wise balance at fixed coverage. \texttt{\textbf{LATA}} is black-box (no VLM updates), compute-light (windowed transduction, no backprop), and includes an optional prior knob that can run strictly label-free or, if desired, in a label-informed variant using calibration marginals once. Across \textbf{three} medical VLMs and \textbf{nine} downstream tasks, \texttt{\textbf{LATA}} consistently reduces set size and CCV while matching or tightening target coverage, outperforming prior transductive baselines and narrowing the gap to label-using methods, while using far less compute. Comprehensive ablations and qualitative analyses show that \texttt{\textbf{LATA}} sharpens zero-shot predictions without compromising exchangeability.

[84] GraphThinker: Reinforcing Video Reasoning with Event Graph Thinking

Zixu Cheng,Da Li,Jian Hu,Ziquan Liu,Wei Li,Shaogang Gong

Main category: cs.CV

TL;DR: 本文提出GraphThinker,一种基于强化微调的方法,通过构建事件级场景图和增强视觉定位来减少视频推理中的幻觉问题。

Details Motivation: 现有多模态大语言模型在视频推理中缺乏显式的因果结构建模,导致推理过程中易产生幻觉,且隐式因果关系标注成本高。 Method: 提出GraphThinker方法:1)利用MLLM构建显式建模事件内外关系的事件级视频场景图(EVSG),作为中间推理过程;2)在强化微调中引入视觉注意力奖励,增强视觉接地能力。 Result: 在RexTime和VidHalluc两个数据集上,GraphThinker在物体与事件关系建模、事件定位精度方面优于先前方法,显著减少了视频推理中的幻觉。 Conclusion: 显式建模事件因果结构并结合强化学习中的视觉接地优化,能有效缓解多模态大语言模型在视频推理中的幻觉问题。 Abstract: Video reasoning requires understanding the causal relationships between events in a video. However, such relationships are often implicit and costly to annotate manually. While existing multimodal large language models (MLLMs) often infer event relations through dense captions or video summaries for video reasoning, such modeling still lacks causal understanding. Without explicit causal structure modeling within and across video events, these models suffer from hallucinations during the video reasoning. In this work, we propose GraphThinker, a reinforcement finetuning-based method that constructs structural event-level scene graphs and enhances visual grounding to jointly reduce hallucinations in video reasoning. Specifically, we first employ an MLLM to construct an event-based video scene graph (EVSG) that explicitly models both intra- and inter-event relations, and incorporate these formed scene graphs into the MLLM as an intermediate thinking process. We also introduce a visual attention reward during reinforcement finetuning, which strengthens video grounding and further mitigates hallucinations. We evaluate GraphThinker on two datasets, RexTime and VidHalluc, where it shows superior ability to capture object and event relations with more precise event localization, reducing hallucinations in video reasoning compared to prior methods.

[85] RetouchIQ: MLLM Agents for Instruction-Based Image Retouching with Generalist Reward

Qiucheng Wu,Jing Shi,Simon Jenni,Kushal Kafle,Tianyu Wang,Shiyu Chang,Handong Zhao

Main category: cs.CV

TL;DR: 本文提出RetouchIQ框架,利用多模态大语言模型(MLLM)代理结合通用奖励模型,实现基于指令的可执行图像编辑,解决了创意图像编辑中主观性导致的奖励信号不可靠问题。

Details Motivation: 现有基于强化学习的图像编辑方法缺乏能反映创意编辑主观性的可靠、可验证奖励信号。 Method: 提出RetouchIQ框架,包含:1)MLLM代理解析用户编辑意图并生成可执行参数调整;2)通用奖励模型(RL微调的MLLM),按需生成多维度评估指标并输出标量反馈,支持高质量、指令一致的强化学习;3)构建含19万指令-推理对的数据集与新基准。 Result: 在语义一致性和感知质量上显著优于先前的MLLM和扩散模型编辑系统。 Conclusion: 通用奖励驱动的MLLM代理可作为专业图像编辑中灵活、可解释、可执行的智能助手。 Abstract: Recent advances in multimodal large language models (MLLMs) have shown great potential for extending vision-language reasoning to professional tool-based image editing, enabling intuitive and creative editing. A promising direction is to use reinforcement learning (RL) to enable MLLMs to reason about and execute optimal tool-use plans within professional image-editing software. However, training remains challenging due to the lack of reliable, verifiable reward signals that can reflect the inherently subjective nature of creative editing. In this work, we introduce RetouchIQ, a framework that performs instruction-based executable image editing through MLLM agents guided by a generalist reward model. RetouchIQ interprets user-specified editing intentions and generates corresponding, executable image adjustments, bridging high-level aesthetic goals with precise parameter control. To move beyond conventional, rule-based rewards that compute similarity against a fixed reference image using handcrafted metrics, we propose a generalist reward model, an RL fine-tuned MLLM that evaluates edited results through a set of generated metrics on a case-by-case basis. Then, the reward model provides scalar feedback through multimodal reasoning, enabling reinforcement learning with high-quality, instruction-consistent gradients. We curate an extended dataset with 190k instruction-reasoning pairs and establish a new benchmark for instruction-based image editing. Experiments show that RetouchIQ substantially improves both semantic consistency and perceptual quality over previous MLLM-based and diffusion-based editing systems. Our findings demonstrate the potential of generalist reward-driven MLLM agents as flexible, explainable, and executable assistants for professional image editing.

[86] Art2Mus: Artwork-to-Music Generation via Visual Conditioning and Large-Scale Cross-Modal Alignment

Ivan Rinaldi,Matteo Mendula,Nicola Fanelli,Florence Levé,Matteo Testi,Giovanna Castellano,Gennaro Vessio

Main category: cs.CV

TL;DR: 本文提出ArtSound数据集和ArtToMus框架,首次实现无需文本中介的直接艺术画作到音乐生成,突破现有图像条件音乐生成模型依赖自然照片和语言中介的局限。

Details Motivation: 现有图像条件音乐生成系统存在两大局限:(i) 训练数据局限于自然照片,难以捕捉艺术作品丰富的语义、风格与文化内涵;(ii) 多数方法依赖图像→文本转换,以语言为语义捷径,阻碍了真正的视觉→音频直接学习。 Method: 构建大规模多模态ArtSound数据集(105,884幅画作–音乐对+双模态标注),并提出ArtToMus框架:将视觉嵌入直接投影至潜在扩散模型的条件空间,绕过图像到文本转换,实现纯视觉驱动的音乐合成。 Result: ArtToMus生成的音乐在音乐连贯性与风格一致性上表现良好,能反映原画作的关键视觉线索;虽绝对对齐分数低于文本条件系统(符合预期),但在感知质量与跨模态对应性上具备竞争力。 Conclusion: 本工作确立了‘直接视觉→音乐生成’这一新且具挑战性的研究方向,提供了支持多媒体艺术、文化遗产与AI辅助创作的重要资源。 Abstract: Music generation has advanced markedly through multimodal deep learning, enabling models to synthesize audio from text and, more recently, from images. However, existing image-conditioned systems suffer from two fundamental limitations: (i) they are typically trained on natural photographs, limiting their ability to capture the richer semantic, stylistic, and cultural content of artworks; and (ii) most rely on an image-to-text conversion stage, using language as a semantic shortcut that simplifies conditioning but prevents direct visual-to-audio learning. Motivated by these gaps, we introduce ArtSound, a large-scale multimodal dataset of 105,884 artwork-music pairs enriched with dual-modality captions, obtained by extending ArtGraph and the Free Music Archive. We further propose ArtToMus, the first framework explicitly designed for direct artwork-to-music generation, which maps digitized artworks to music without image-to-text translation or language-based semantic supervision. The framework projects visual embeddings into the conditioning space of a latent diffusion model, enabling music synthesis guided solely by visual information. Experimental results show that ArtToMus generates musically coherent and stylistically consistent outputs that reflect salient visual cues of the source artworks. While absolute alignment scores remain lower than those of text-conditioned systems-as expected given the substantially increased difficulty of removing linguistic supervision-ArtToMus achieves competitive perceptual quality and meaningful cross-modal correspondence. This work establishes direct visual-to-music generation as a distinct and challenging research direction, and provides resources that support applications in multimedia art, cultural heritage, and AI-assisted creative practice. Code and dataset will be publicly released upon acceptance.

[87] Adapting Actively on the Fly: Relevance-Guided Online Meta-Learning with Latent Concepts for Geospatial Discovery

Jowaria Khan,Anindya Sarkar,Yevgeniy Vorobeychik,Elizabeth Bondi-Kelly

Main category: cs.CV

TL;DR: 本文提出了一种融合主动学习、在线元学习和概念引导推理的地理空间发现框架,通过概念相关性建模提升在稀疏、偏差标注数据下的目标发现效率。

Details Motivation: 现实场景中(如环境监测、灾害响应、公共卫生)数据采集成本高、环境动态变化,且地面真值稀疏有偏,导致现有基于学习的方法(如强化学习)难以适用。 Method: 提出统一地理空间发现框架,包含两个核心创新:1)基于概念相关性的加权不确定性采样策略;2)相关性感知的元批次构建策略,以增强语义多样性并提升动态环境泛化能力。 Result: 在真实PFAS污染数据集上的实验表明,该方法能在有限数据和变化环境中可靠地发现目标。 Conclusion: 所提框架有效缓解了地理空间发现任务中因标注稀缺和环境动态带来的挑战,提升了资源受限下的发现效率与鲁棒性。 Abstract: In many real-world settings, such as environmental monitoring, disaster response, or public health, with costly and difficult data collection and dynamic environments, strategically sampling from unobserved regions is essential for efficiently uncovering hidden targets under tight resource constraints. Yet, sparse and biased geospatial ground truth limits the applicability of existing learning-based methods, such as reinforcement learning. To address this, we propose a unified geospatial discovery framework that integrates active learning, online meta-learning, and concept-guided reasoning. Our approach introduces two key innovations built on a shared notion of *concept relevance*, which captures how domain-specific factors influence target presence: a *concept-weighted uncertainty sampling strategy*, where uncertainty is modulated by learned relevance based on readily-available domain-specific concepts (e.g., land cover, source proximity); and a *relevance-aware meta-batch formation strategy* that promotes semantic diversity during online-meta updates, improving generalization in dynamic environments. Our experiments include testing on a real-world dataset of cancer-causing PFAS (Per- and polyfluoroalkyl substances) contamination, showcasing our method's reliability at uncovering targets with limited data and a varying environment.

[88] CORAL: Correspondence Alignment for Improved Virtual Try-On

Jiyoung Kim,Youngjin Shin,Siyoon Jin,Dahyun Chung,Jisu Nam,Tongmin Kim,Jongjae Park,Hyeonwoo Kang,Seungryong Kim

Main category: cs.CV

TL;DR: 本文提出CORAL框架,通过显式对齐查询-键匹配与外部对应关系,在扩散Transformer(DiT)中提升虚拟试穿中人物与服装的对应关系建模,从而改善全局形状迁移和局部细节保留。

Details Motivation: 现有虚拟试穿方法在非配对设置下难以保持精细服装细节,且未显式建模人物-服装对应关系,也未解释该对应关系如何在扩散Transformer中产生。 Method: 分析DiT中全3D注意力机制,发现人物-服装对应依赖于精确的查询-键匹配;据此提出CORAL框架,包含对应蒸馏损失和熵最小化损失,并引入基于视觉语言模型的评估协议。 Result: CORAL在全局形状迁移和局部细节保留两方面均优于基线,消融实验验证了各设计的有效性。 Conclusion: 显式建模并优化人物-服装注意力匹配是提升虚拟试穿质量的关键,CORAL为DiT在VTON中的可解释性与性能提升提供了新思路。 Abstract: Existing methods for Virtual Try-On (VTON) often struggle to preserve fine garment details, especially in unpaired settings where accurate person-garment correspondence is required. These methods do not explicitly enforce person-garment alignment and fail to explain how correspondence emerges within Diffusion Transformers (DiTs). In this paper, we first analyze full 3D attention in DiT-based architecture and reveal that the person-garment correspondence critically depends on precise person-garment query-key matching within the full 3D attention. Building on this insight, we then introduce CORrespondence ALignment (CORAL), a DiT-based framework that explicitly aligns query-key matching with robust external correspondences. CORAL integrates two complementary components: a correspondence distillation loss that aligns reliable matches with person-garment attention, and an entropy minimization loss that sharpens the attention distribution. We further propose a VLM-based evaluation protocol to better reflect human preference. CORAL consistently improves over the baseline, enhancing both global shape transfer and local detail preservation. Extensive ablations validate our design choices.

[89] IntRec: Intent-based Retrieval with Contrastive Refinement

Pourya Shamsolmoali,Masoumeh Zareapoor,Eric Granger,Yue Lu

Main category: cs.CV

TL;DR: 本文提出IntRec框架,通过用户反馈迭代优化目标检索结果,在LVIS和LVIS-Ambiguous数据集上显著提升准确率,且延迟极低。

Details Motivation: 现有开放词汇检测器为单次预测,无法根据用户反馈调整结果,难以处理模糊或相似物体密集的复杂场景。 Method: IntRec引入意图状态(IS),维护正样本锚点与负约束两个记忆集合,并设计对比对齐函数,通过最大化与正样本相似性、最小化与负样本相似性来排序候选对象。 Result: 在LVIS上达到35.4 AP,优于OVMR、CoDet和CAKE;在LVIS-Ambiguous上单轮反馈提升7.9 AP,每轮交互延迟低于30ms。 Conclusion: IntRec有效支持交互式目标检索,无需额外监督即可显著提升模糊查询下的检索精度与鲁棒性。 Abstract: Retrieving user-specified objects from complex scenes remains a challenging task, especially when queries are ambiguous or involve multiple similar objects. Existing open-vocabulary detectors operate in a one-shot manner, lacking the ability to refine predictions based on user feedback. To address this, we propose IntRec, an interactive object retrieval framework that refines predictions based on user feedback. At its core is an Intent State (IS) that maintains dual memory sets for positive anchors (confirmed cues) and negative constraints (rejected hypotheses). A contrastive alignment function ranks candidate objects by maximizing similarity to positive cues while penalizing rejected ones, enabling fine-grained disambiguation in cluttered scenes. Our interactive framework provides substantial improvements in retrieval accuracy without additional supervision. On LVIS, IntRec achieves 35.4 AP, outperforming OVMR, CoDet, and CAKE by +2.3, +3.7, and +0.5, respectively. On the challenging LVIS-Ambiguous benchmark, it improves performance by +7.9 AP over its one-shot baseline after a single corrective feedback, with less than 30 ms of added latency per interaction.

[90] Human-level 3D shape perception emerges from multi-view learning

Tyler Bonnen,Jitendra Malik,Angjoo Kanazawa

Main category: cs.CV

TL;DR: 本文提出了一种新型多视角神经网络框架,通过自然场景图像学习视觉-空间信息(如相机位置、深度),无需物体先验,在零样本下首次达到与人类相当的3D形状推理精度,并能预测人类错误模式和反应时间。

Details Motivation: 建模人类从2D图像推断3D结构的能力长期未达人类水平,需探索更贴近人类感知的学习机制。 Method: 设计一类无物体先验的神经网络,以自然场景多视角图像为输入,通过视觉-空间目标(如预测相机位姿、深度)进行自监督训练;采用零样本评估方式,在标准3D感知任务上对比模型与人类行为。 Result: 模型在3D形状推理准确率上首次匹配人类水平;独立解码模型响应可预测人类细粒度行为(如错误分布、反应时)。 Conclusion: 仅依靠自然视觉-空间数据和简单可扩展的学习目标,即可涌现出类人的3D感知能力。 Abstract: Humans can infer the three-dimensional structure of objects from two-dimensional visual inputs. Modeling this ability has been a longstanding goal for the science and engineering of visual intelligence, yet decades of computational methods have fallen short of human performance. Here we develop a modeling framework that predicts human 3D shape inferences for arbitrary objects, directly from experimental stimuli. We achieve this with a novel class of neural networks trained using a visual-spatial objective over naturalistic sensory data; given a set of images taken from different locations within a natural scene, these models learn to predict spatial information related to these images, such as camera location and visual depth, without relying on any object-related inductive biases. Notably, these visual-spatial signals are analogous to sensory cues readily available to humans. We design a zero-shot evaluation approach to determine the performance of these `multi-view' models on a well established 3D perception task, then compare model and human behavior. Our modeling framework is the first to match human accuracy on 3D shape inferences, even without task-specific training or fine-tuning. Remarkably, independent readouts of model responses predict fine-grained measures of human behavior, including error patterns and reaction times, revealing a natural correspondence between model dynamics and human perception. Taken together, our findings indicate that human-level 3D perception can emerge from a simple, scalable learning objective over naturalistic visual-spatial data. All code, human behavioral data, and experimental stimuli needed to reproduce our findings can be found on our project page.

[91] When Vision Overrides Language: Evaluating and Mitigating Counterfactual Failures in VLAs

Yu Fang,Yuchun Feng,Dong Jing,Jiaqi Liu,Yue Yang,Zhenyu Wei,Daniel Szafir,Mingyu Ding

Main category: cs.CV

TL;DR: 本文提出Counterfactual Action Guidance (CAG)方法,通过引入语言无关的视觉-动作(VA)分支与原VLA模型协同决策,缓解视觉捷径导致的语言指令违背问题,在无需额外训练或修改模型的前提下显著提升VLAs在反事实场景下的语言遵循能力与任务成功率。

Details Motivation: 现有视觉-语言-动作模型(VLAs)常因数据集偏差依赖视觉捷径,忽视语言指令,在缺乏强场景监督时出现反事实失败(counterfactual failures),该问题尚未被系统研究。 Method: 提出双分支推理机制Counterfactual Action Guidance(CAG):联合标准VLA策略与语言无条件的Vision-Action(VA)模块,在动作选择中进行反事实对比,显式正则化语言条件作用;无需新增数据、架构修改或模型微调。 Result: 在新构建的反事实基准LIBERO-CF上,CAG以零训练策略提升语言遵循准确率9.7%(π₀.₅)、任务成功率3.6%;结合VA模型后进一步提升15.5%和8.5%;真实机器人实验中反事实失败率降低9.4%,平均任务成功率提升17.2%。 Conclusion: CAG是一种即插即用、训练无关的通用增强方案,有效缓解VLAs对视觉捷径的依赖,显著提升其语言遵循鲁棒性,尤其适用于低频/未见任务场景。 Abstract: Vision-Language-Action models (VLAs) promise to ground language instructions in robot control, yet in practice often fail to faithfully follow language. When presented with instructions that lack strong scene-specific supervision, VLAs suffer from counterfactual failures: they act based on vision shortcuts induced by dataset biases, repeatedly executing well-learned behaviors and selecting objects frequently seen during training regardless of language intent. To systematically study it, we introduce LIBERO-CF, the first counterfactual benchmark for VLAs that evaluates language following capability by assigning alternative instructions under visually plausible LIBERO layouts. Our evaluation reveals that counterfactual failures are prevalent yet underexplored across state-of-the-art VLAs. We propose Counterfactual Action Guidance (CAG), a simple yet effective dual-branch inference scheme that explicitly regularizes language conditioning in VLAs. CAG combines a standard VLA policy with a language-unconditioned Vision-Action (VA) module, enabling counterfactual comparison during action selection. This design reduces reliance on visual shortcuts, improves robustness on under-observed tasks, and requires neither additional demonstrations nor modifications to existing architectures or pretrained models. Extensive experiments demonstrate its plug-and-play integration across diverse VLAs and consistent improvements. For example, on LIBERO-CF, CAG improves $π_{0.5}$ by 9.7% in language following accuracy and 3.6% in task success on under-observed tasks using a training-free strategy, with further gains of 15.5% and 8.5%, respectively, when paired with a VA model. In real-world evaluations, CAG reduces counterfactual failures of 9.4% and improves task success by 17.2% on average.

[92] OpenEarthAgent: A Unified Framework for Tool-Augmented Geospatial Agents

Akashah Shabbir,Muhammad Umer Sheikh,Muhammad Akhtar Munir,Hiyam Debary,Mustansar Fiaz,Muhammad Zaigham Zaheer,Paolo Fraccaro,Fahad Shahbaz Khan,Muhammad Haris Khan,Xiao Xiang Zhu,Salman Khan

Main category: cs.CV

TL;DR: OpenEarthAgent 提出了一种面向遥感领域的多模态地理空间智能体框架,通过工具增强与结构化推理轨迹训练,在卫星影像理解、多光谱分析和多步地理任务上取得显著提升。

Details Motivation: 现有多模态推理模型难以适配遥感领域特有的空间尺度、地理结构和多光谱指数分析需求,亟需支持多步逻辑与GIS工具协同的地理空间智能体。 Method: 构建统一的工具增强型地理空间智能体框架,采用监督微调方式在包含14,538条样本、超10万推理步的结构化推理轨迹数据集上训练,涵盖城市、环境、灾害、基础设施四大领域及NDVI/NBR/NDBI等指数分析。 Result: 模型展现出结构化推理能力、稳定的空域理解能力与可解释的工具驱动行为,在多个遥感分析任务上持续优于强基线,并媲美近期开源与闭源模型。 Conclusion: OpenEarthAgent 验证了基于显式推理轨迹和工具协同的训练范式在遥感多模态推理中的有效性,为地理空间智能体的发展提供了新路径。 Abstract: Recent progress in multimodal reasoning has enabled agents that can interpret imagery, connect it with language, and perform structured analytical tasks. Extending such capabilities to the remote sensing domain remains challenging, as models must reason over spatial scale, geographic structures, and multispectral indices while maintaining coherent multi-step logic. To bridge this gap, OpenEarthAgent introduces a unified framework for developing tool-augmented geospatial agents trained on satellite imagery, natural-language queries, and detailed reasoning traces. The training pipeline relies on supervised fine-tuning over structured reasoning trajectories, aligning the model with verified multistep tool interactions across diverse analytical contexts. The accompanying corpus comprises 14,538 training and 1,169 evaluation instances, with more than 100K reasoning steps in the training split and over 7K reasoning steps in the evaluation split. It spans urban, environmental, disaster, and infrastructure domains, and incorporates GIS-based operations alongside index analyses such as NDVI, NBR, and NDBI. Grounded in explicit reasoning traces, the learned agent demonstrates structured reasoning, stable spatial understanding, and interpretable behaviour through tool-driven geospatial interactions across diverse conditions. We report consistent improvements over a strong baseline and competitive performance relative to recent open and closed-source models.

cs.OH [Back]

[93] A Conceptual Hybrid Framework for Post-Quantum Security: Integrating BB84 QKD, AES, and Bio-inspired Mechanisms

Md. Ismiel Hossen Abir

Main category: cs.OH

TL;DR: 本文提出了一种面向后量子时代的混合安全框架,结合AES、BB84 QKD、量子态比对和类免疫系统,以应对Shor算法对RSA的威胁。

Details Motivation: 量子计算(尤其是Shor算法)对基于大数分解的经典公钥密码(如RSA)构成严重威胁,亟需构建兼顾经典与量子安全的新型防护体系。 Method: 提出一个概念性混合安全框架,整合AES对称加密、BB84量子密钥分发、量子态比对实现轻量认证、以及受生物免疫机制启发的自适应威胁检测模块。 Result: 该框架在理论上可抵御Shor攻击,BB84保障密钥交换安全性并具备高精度窃听检测能力;整体具备可扩展性与适应性。 Conclusion: 所提概念框架为后量子时代的数据保护提供了融合经典与量子技术的新思路,但尚需后续开展实现、安全证明与实验验证工作。 Abstract: Quantum computing is a significant risk to classical cryptographic, especially RSA, which depends on the difficulty of factoring large numbers. Classical factorization methods, such as Trial Division and Pollard's Rho, are inefficient for large keys, while Shor's quantum algorithm can break RSA efficiently in polynomial time. This research studies RSA's vulnerabilities under both classical and quantum attacks and designs a hybrid security framework to ensure data protection in the post-quantum era. The conceptual framework combines AES encryption for classical security, BB84 Quantum Key Distribution (QKD) for secure key exchange with eavesdropping detection, quantum state comparison for lightweight authentication, and a bio-inspired immune system for adaptive threat detection. RSA is vulnerable to Shor's algorithm, BB84 achieves full key agreement in ideal conditions, and it detects eavesdropping with high accuracy. The conceptual model includes both classical and quantum security methods, providing a scalable and adaptive solution for Post-Quantum encryption data protection. This work primarily proposes a conceptual framework. Detailed implementation, security proofs, and extensive experimental validation are considered future work.