Table of Contents
cs.CL [Back]
[1] References Improve LLM Alignment in Non-Verifiable Domains
Kejian Shi,Yixin Liu,Peifeng Wang,Alexander R. Fabbri,Shafiq Joty,Arman Cohan
Main category: cs.CL
TL;DR: 本文提出了一种参考引导的LLM评估器方法,用于在缺乏真实验证器的非可验证领域(如大语言模型对齐)中替代传统RLVR方法,通过利用前沿模型或人工撰写的参考输出来提升LLM裁判的判别能力,并进一步用于自改进对齐训练,取得了优于SFT蒸馏和无参考自改进的性能。
Details
Motivation: 传统强化学习与可验证奖励(RLVR)依赖于可验证的奖励信号,无法直接应用于缺乏地面真值验证器的非可验证领域(如LLM对齐);本文旨在探索参考引导的LLM评估器能否作为软验证器填补这一空白。 Method: 设计了基于参考输出的LLM评估协议,提升不同能力等级LLM裁判的准确性;在此基础上,将增强后的LLM裁判用于参考引导的自改进对齐训练(即用参考引导的LLM作为奖励模型进行强化学习优化)。 Result: 在AlpacaEval和Arena-Hard上,Llama-3-8B-Instruct分别达73.1%/58.7%,Qwen2.5-7B达70.0%/74.1%;相比SFT蒸馏平均绝对提升+20.2/+17.1分,相比无参考自改进提升+5.3/+3.6分;性能媲美ArmoRM奖励模型。 Conclusion: 参考引导的LLM评估器可有效支撑非可验证领域的LLM后训练,为无需真实验证器的对齐任务提供了可行且高性能的新范式。 Abstract: While Reinforcement Learning with Verifiable Rewards (RLVR) has shown strong effectiveness in reasoning tasks, it cannot be directly applied to non-verifiable domains lacking ground-truth verifiers, such as LLM alignment. In this work, we investigate whether reference-guided LLM-evaluators can bridge this gap by serving as soft "verifiers". First, we design evaluation protocols that enhance LLM-based evaluators for LLM alignment using reference outputs. Through comprehensive experiments, we show that a reference-guided approach substantially improves the accuracy of less capable LLM-judges using references from frontier models; stronger LLM-judges can also be enhanced by high-quality (i.e., human-written) references. Building on these improved judges, we demonstrate the utility of high-quality references in alignment tuning, where LLMs guided with references are used as judges to self-improve. We show that reference-guided self-improvement yields clear gains over both direct SFT on reference outputs and self-improvement with reference-free judges, achieving performance comparable to training with ArmoRM, a strong finetuned reward model. Specifically, our method achieves 73.1% and 58.7% on AlpacaEval and Arena-Hard with Llama-3-8B-Instruct, and 70.0% and 74.1% with Qwen2.5-7B, corresponding to average absolute gains of +20.2 / +17.1 points over SFT distillation and +5.3 / +3.6 points over reference-free self-improvement on AlpacaEval / Arena-Hard. These results highlight the potential of using reference-guided LLM-evaluators to enable effective LLM post-training in non-verifiable domains.[2] Evaluating Monolingual and Multilingual Large Language Models for Greek Question Answering: The DemosQA Benchmark
Charalampos Mastrokostas,Nikolaos Giarelis,Nikos Karacapilidis
Main category: cs.CL
TL;DR: 本文针对希腊语问答任务,提出了一种新的希腊语问答数据集DemosQA、一个内存高效的LLM评估框架,并对11个单语和多语大语言模型在6个希腊语问答数据集上进行了全面评估。
Details
Motivation: 现有大语言模型主要面向高资源语言(如英语),多语模型存在训练数据偏向少数流行语言或依赖从高资源向低资源语言迁移的问题,导致对社会、文化及历史背景表征不足;而为低资源语言开发的单语模型在语言特定任务上的有效性仍缺乏充分研究。 Method: 构建了基于社交媒体用户提问与社区审核答案的新希腊语问答数据集DemosQA;设计了一个内存高效、可适配多种问答数据集与语言的LLM评估框架;在6个人工整理的希腊语QA数据集上,采用3种提示策略对11个单语和多语LLM进行了系统评估。 Result: 提供了首个聚焦希腊语社会文化语境的高质量QA数据集DemosQA;验证了所提评估框架的有效性与通用性;揭示了不同LLM在希腊语QA任务上的性能差异与提示策略影响;所有代码与数据均已开源。 Conclusion: 单语LLM在希腊语QA任务中展现出潜力,但其性能受模型架构、训练数据与提示策略显著影响;本工作为低资源语言的LLM评估与应用提供了方法论支持与实证基准。 Abstract: Recent advancements in Natural Language Processing and Deep Learning have enabled the development of Large Language Models (LLMs), which have significantly advanced the state-of-the-art across a wide range of tasks, including Question Answering (QA). Despite these advancements, research on LLMs has primarily targeted high-resourced languages (e.g., English), and only recently has attention shifted toward multilingual models. However, these models demonstrate a training data bias towards a small number of popular languages or rely on transfer learning from high- to under-resourced languages; this may lead to a misrepresentation of social, cultural, and historical aspects. To address this challenge, monolingual LLMs have been developed for under-resourced languages; however, their effectiveness remains less studied when compared to multilingual counterparts on language-specific tasks. In this study, we address this research gap in Greek QA by contributing: (i) DemosQA, a novel dataset, which is constructed using social media user questions and community-reviewed answers to better capture the Greek social and cultural zeitgeist; (ii) a memory-efficient LLM evaluation framework adaptable to diverse QA datasets and languages; and (iii) an extensive evaluation of 11 monolingual and multilingual LLMs on 6 human-curated Greek QA datasets using 3 different prompting strategies. We release our code and data to facilitate reproducibility.[3] One-step Language Modeling via Continuous Denoising
Chanhyuk Lee,Jaehoon Yoo,Manan Agarwal,Sheel Shah,Jerry Huang,Aditi Raghunathan,Seunghoon Hong,Nicholas M. Boffi,Jinwoo Kim
Main category: cs.CL
TL;DR: 本文提出了一种基于流(flow)的连续去噪语言模型(FLM)及其蒸馏版本(FMLM),在生成质量与速度上均优于离散扩散模型,尤其在单步生成时即超越其他模型8步效果。
Details
Motivation: 离散扩散语言模型在少步生成时样本质量急剧下降,未能兑现其比自回归模型更快生成的潜力;本文旨在探索是否可用连续流模型替代离散扩散,以兼顾质量与效率。 Method: 构建基于欧氏空间对one-hot token编码进行连续去噪的流式语言模型(FLM),采用交叉熵目标预测干净数据,并引入时间重参数化提升训练稳定性;进一步将FLM蒸馏为流映射模型(FMLM)以支持少步/单步生成。 Result: FLM在LM1B和OWT数据集上达到当前最优离散扩散模型的生成质量;FMLM在少步生成任务中全面超越近期方法,单步生成质量超过其8步结果。 Conclusion: 离散扩散并非离散模态生成建模的必要选择;基于流的语言建模可在质量和速度上实现更好平衡,为大规模加速语言建模开辟新路径。 Abstract: Language models based on discrete diffusion have attracted widespread interest for their potential to provide faster generation than autoregressive models. In practice, however, they exhibit a sharp degradation of sample quality in the few-step regime, failing to realize this promise. Here we show that language models leveraging flow-based continuous denoising can outperform discrete diffusion in both quality and speed. By revisiting the fundamentals of flows over discrete modalities, we build a flow-based language model (FLM) that performs Euclidean denoising over one-hot token encodings. We show that the model can be trained by predicting the clean data via a cross entropy objective, where we introduce a simple time reparameterization that greatly improves training stability and generation quality. By distilling FLM into its associated flow map, we obtain a distilled flow map language model (FMLM) capable of few-step generation. On the LM1B and OWT language datasets, FLM attains generation quality matching state-of-the-art discrete diffusion models. With FMLM, our approach outperforms recent few-step language models across the board, with one-step generation exceeding their 8-step quality. Our work calls into question the widely held hypothesis that discrete diffusion processes are necessary for generative modeling over discrete modalities, and paves the way toward accelerated flow-based language modeling at scale. Code is available at https://github.com/david3684/flm.[4] Claim Automation using Large Language Model
Zhengda Mo,Zhiyu Quan,Eli O'Donohue,Kaiwen Zhong
Main category: cs.CL
TL;DR: 本文提出了一种面向保险领域、本地部署的治理感知语言建模组件,通过LoRA微调预训练大模型,从非结构化保修索赔文本中生成结构化的纠正措施建议,并在多维评估框架下验证其优于通用大模型。
Details
Motivation: 大型语言模型(LLMs)在受监管且数据敏感的保险等领域部署受限,亟需兼顾性能、可解释性与合规性的定制化解决方案。 Method: 基于数百万历史保修索赔数据,采用低秩自适应(LoRA)技术对预训练LLM进行领域微调,构建本地化、治理感知的语言建模组件,聚焦于索赔处理流程中的初始决策模块。评估采用自动化语义相似度指标与人工评估相结合的多维框架。 Result: 领域微调模型在约80%的测试案例中生成了与真实纠正措施近乎一致的推荐,显著优于商用通用LLM及提示工程方法。 Conclusion: 领域自适应微调能有效使模型输出分布贴近真实业务数据,在保险等高监管场景中具备作为可靠、可治理构件的应用潜力。 Abstract: While Large Language Models (LLMs) have achieved strong performance on general-purpose language tasks, their deployment in regulated and data-sensitive domains, including insurance, remains limited. Leveraging millions of historical warranty claims, we propose a locally deployed governance-aware language modeling component that generates structured corrective-action recommendations from unstructured claim narratives. We fine-tune pretrained LLMs using Low-Rank Adaptation (LoRA), scoping the model to an initial decision module within the claim processing pipeline to speed up claim adjusters' decisions. We assess this module using a multi-dimensional evaluation framework that combines automated semantic similarity metrics with human evaluation, enabling a rigorous examination of both practical utility and predictive accuracy. Our results show that domain-specific fine-tuning substantially outperforms commercial general-purpose and prompt-based LLMs, with approximately 80% of the evaluated cases achieving near-identical matches to ground-truth corrective actions. Overall, this study provides both theoretical and empirical evidence to prove that domain-adaptive fine-tuning can align model output distributions more closely with real-world operational data, demonstrating its promise as a reliable and governable building block for insurance applications.[5] BanglaSummEval: Reference-Free Factual Consistency Evaluation for Bangla Summarization
Ahmed Rafid,Rumman Adib,Fariya Ahmed,Ajwad Abrar,Mohammed Saidul Islam
Main category: cs.CL
TL;DR: 本文提出BanglaSummEval,一种无需参考摘要、基于问答的孟加拉语摘要事实一致性评估框架,利用多语言指令微调模型统一完成问句生成、回答、答案抽取与重要性加权,并结合BERTScore-Recall提升语义一致性判断,实验显示其与人工评估高度相关。
Details
Motivation: 现有事实一致性评估指标大多忽略低资源语言孟加拉语,且严重依赖参考摘要,缺乏适用于该语言的可靠、可解释、无参考的评估方法。 Method: 构建参考-free的问答式评估框架BanglaSummEval:从源文档和摘要自动生成问题,由同一多语言指令微调模型完成问答、候选答案抽取及问题重要性加权;采用BERTScore-Recall比较答案以捕捉深层语义一致性。 Result: 在300个人工撰写的教育与医疗领域孟加拉语摘要上验证,与专家人工评分呈强相关(Pearson r=0.694,Spearman ρ=0.763),并提供可解释的分步诊断结果。 Conclusion: BanglaSummEval为低资源语言(尤其是孟加拉语)的事实一致性评估提供了实用、透明、高效且高相关性的解决方案,推动了非英语摘要评估的发展。 Abstract: Evaluating factual consistency is essential for reliable text summarization, particularly in high-stakes domains such as healthcare and news. However, most existing evaluation metrics overlook Bangla, a widely spoken yet under-resourced language, and often depend on reference summaries. We introduce BanglaSummEval, a reference-free, question-answering-based framework for evaluating factual consistency in Bangla summarization. The proposed method assesses both factual accuracy and content coverage through automatically generated questions and answers derived from the source document and the summary. A single multilingual instruction-tuned language model handles question generation, question answering, candidate answer extraction, and question importance weighting. This unified design reduces system complexity and computational cost. To capture semantic consistency beyond surface-level overlap, we use BERTScore-Recall for answer comparison. We validate BanglaSummEval on 300 human-written summaries from educational and medical domains, demonstrating strong correlation with expert human judgments (Pearson's $r = 0.694$, Spearman's $ρ= 0.763$). By providing interpretable, step-wise diagnostics alongside reliable evaluation scores, BanglaSummEval offers a practical and transparent solution for factual consistency evaluation in low-resource language settings.[6] Meenz bleibt Meenz, but Large Language Models Do Not Speak Its Dialect
Minh Duc Bui,Manuel Mager,Peter Herbert Kann,Katharina von der Wense
Main category: cs.CL
TL;DR: 本文首次针对美因茨方言Meenzerisch开展NLP研究,构建了首个NLP就绪的数字词典(2351词条),并实证发现当前大语言模型在该方言的定义生成与词汇生成任务上表现极差(准确率均低于10%),表明亟需加强德语方言的资源建设与研究投入。
Details
Motivation: Meenzerisch方言濒临消亡,而NLP有望助力其保护与复兴;但此前尚无针对该方言的NLP研究。 Method: 构建基于Schramm(1966)的Meenzerisch数字词典(2351词对),并设计两项任务评估主流大语言模型:(1)给定方言词生成标准德语定义;(2)给定定义生成对应方言词;进一步尝试少样本学习和规则提取两种提升策略。 Result: 所有模型在两项任务中准确率均极低(最佳分别为6.27%和1.51%);少样本和规则注入可小幅提升,但仍未突破10%。 Conclusion: 当前大语言模型难以有效处理濒危德语方言Meenzerisch,凸显德语方言NLP资源严重匮乏,亟需系统性资源建设与专项研究支持。 Abstract: Meenzerisch, the dialect spoken in the German city of Mainz, is also the traditional language of the Mainz carnival, a yearly celebration well known throughout Germany. However, Meenzerisch is on the verge of dying out-a fate it shares with many other German dialects. Natural language processing (NLP) has the potential to help with the preservation and revival efforts of languages and dialects. However, so far no NLP research has looked at Meenzerisch. This work presents the first research in the field of NLP that is explicitly focused on the dialect of Mainz. We introduce a digital dictionary-an NLP-ready dataset derived from an existing resource (Schramm, 1966)-to support researchers in modeling and benchmarking the language. It contains 2,351 words in the dialect paired with their meanings described in Standard German. We then use this dataset to answer the following research questions: (1) Can state-of-the-art large language models (LLMs) generate definitions for dialect words? (2) Can LLMs generate words in Meenzerisch, given their definitions? Our experiments show that LLMs can do neither: the best model for definitions reaches only 6.27% accuracy and the best word generation model's accuracy is 1.51%. We then conduct two additional experiments in order to see if accuracy is improved by few-shot learning and by extracting rules from the training set, which are then passed to the LLM. While those approaches are able to improve the results, accuracy remains below 10%. This highlights that additional resources and an intensification of research efforts focused on German dialects are desperately needed.[7] ConvApparel: A Benchmark Dataset and Validation Framework for User Simulators in Conversational Recommenders
Ofer Meshi,Krisztian Balog,Sally Goldman,Avi Caciularu,Guy Tennenholtz,Jihwan Jeong,Amir Globerson,Craig Boutilier
Main category: cs.CL
TL;DR: 本文提出ConvApparel数据集和综合验证框架,旨在解决LLM-based用户模拟器存在的‘现实性差距’问题,通过双代理数据收集和多维验证方法评估模拟器的泛化能力,并发现数据驱动的模拟器在反事实验证中表现更优。
Details
Motivation: LLM-based用户模拟器存在‘现实性差距’,导致在模拟环境中优化的对话系统在真实世界中表现不佳。 Method: 构建ConvApparel数据集,采用‘好’与‘坏’推荐器双代理协议采集人类-AI对话,并加入用户满意度的第一人称标注;提出结合统计对齐、类人度评分和反事实验证的综合验证框架。 Result: 实验揭示所有用户模拟器均存在显著现实性差距;但数据驱动的模拟器在反事实验证中比提示式基线更优,能更真实地适应未见行为。 Conclusion: 数据驱动的用户模拟器虽不完美,但相较提示式方法具备更鲁棒的用户建模能力;所提数据集与验证框架为缩小现实性差距提供了新路径。 Abstract: The promise of LLM-based user simulators to improve conversational AI is hindered by a critical "realism gap," leading to systems that are optimized for simulated interactions, but may fail to perform well in the real world. We introduce ConvApparel, a new dataset of human-AI conversations designed to address this gap. Its unique dual-agent data collection protocol -- using both "good" and "bad" recommenders -- enables counterfactual validation by capturing a wide spectrum of user experiences, enriched with first-person annotations of user satisfaction. We propose a comprehensive validation framework that combines statistical alignment, a human-likeness score, and counterfactual validation to test for generalization. Our experiments reveal a significant realism gap across all simulators. However, the framework also shows that data-driven simulators outperform a prompted baseline, particularly in counterfactual validation where they adapt more realistically to unseen behaviors, suggesting they embody more robust, if imperfect, user models.[8] When Semantic Overlap Is Not Enough: Cross-Lingual Euphemism Transfer Between Turkish and English
Hasan Can Biyik,Libby Barak,Jing Peng,Anna Feldman
Main category: cs.CL
TL;DR: 本文研究了跨语言委婉语检测中的迁移学习问题,发现语义重叠并不总能保证正向迁移,尤其在低资源的土耳其语到英语方向上,甚至可能出现性能下降;相反,非重叠委婉语(NOPETs)训练有时反而提升效果,这与标签分布差异密切相关。
Details
Motivation: 委婉语高度依赖文化与语用语境,跨语言建模困难;现有跨语言迁移方法在委婉语检测任务中效果不明,亟需探究等价性对迁移的影响。 Method: 将土耳其语和英语中的潜在委婉语(PETs)按功能、语用和语义对齐程度划分为重叠(OPETs)与非重叠(NOPETs)两类,并在多语言模型上进行迁移实验与类别级分析。 Result: 观察到迁移不对称性:语义重叠不足以保障正向迁移;土耳其语→英语方向在OPETs上性能下降,而NOPETs训练反而提升;标签分布差异是关键解释因素;领域对齐可能影响迁移,但受限于数据稀疏性。 Conclusion: 跨语言委婉语检测中的迁移效果不能仅由语义等价性预测,需综合考虑语用对齐、标签分布与资源不对称性;NOPETs训练揭示了非直觉但有效的迁移路径。 Abstract: Euphemisms substitute socially sensitive expressions, often softening or reframing meaning, and their reliance on cultural and pragmatic context complicates modeling across languages. In this study, we investigate how cross-lingual equivalence influences transfer in multilingual euphemism detection. We categorize Potentially Euphemistic Terms (PETs) in Turkish and English into Overlapping (OPETs) and Non-Overlapping (NOPETs) subsets based on their functional, pragmatic, and semantic alignment. Our findings reveal a transfer asymmetry: semantic overlap is insufficient to guarantee positive transfer, particularly in low-resource Turkish-to-English direction, where performance can degrade even for overlapping euphemisms, and in some cases, improve under NOPET-based training. Differences in label distribution help explain these counterintuitive results. Category-level analysis suggests that transfer may be influenced by domain-specific alignment, though evidence is limited by sparsity.[9] Eigenmood Space: Uncertainty-Aware Spectral Graph Analysis of Psychological Patterns in Classical Persian Poetry
Kourosh Shahnazari,Seyed Moein Ayyoubzadeh,Mohammadali Keshtparvar
Main category: cs.CL
TL;DR: 本文提出了一种不确定性感知的计算框架,用于对古典波斯诗歌进行诗人层面的心理学分析,结合自动多标签标注、置信度加权聚合与图嵌入方法,兼顾可扩展性与人文解释的审慎性。
Details
Motivation: 古典波斯诗歌以隐喻、互文与修辞间接性表达情感,使其需依赖细读,但难以进行可复现的大规模比较;亟需兼顾人文敏感性与计算可扩展性的新分析框架。 Method: 基于大规模自动多标签标注,为每行诗句分配心理概念、对应置信度及“ abstention(证据不足)”标志;构建诗人×概念矩阵并用JS/ KL散度量化个体性;建立置信加权的概念共现图,通过拉普拉斯谱分解定义Eigenmood嵌入;辅以阈值敏感性分析、选择偏差诊断及远距到细读工作流。 Result: 在涵盖10位诗人的61,573行诗语料上,22.2%诗句被标记为abstention;验证了不确定性建模对分析稳健性的关键作用;Eigenmood嵌入支持沿情绪轴检索诗句实例;整个框架实现可审计、可解释的数字人文分析。 Conclusion: 该框架成功将不确定性从诗句级证据传播至诗人级推断,在保障计算可扩展性的同时,坚守人文阐释所需的审慎原则,为古典诗歌的心理学风格分析提供了新范式。 Abstract: Classical Persian poetry is a historically sustained archive in which affective life is expressed through metaphor, intertextual convention, and rhetorical indirection. These properties make close reading indispensable while limiting reproducible comparison at scale. We present an uncertainty-aware computational framework for poet-level psychological analysis based on large-scale automatic multi-label annotation. Each verse is associated with a set of psychological concepts, per-label confidence scores, and an abstention flag that signals insufficient evidence. We aggregate confidence-weighted evidence into a Poet $\times$ Concept matrix, interpret each poet as a probability distribution over concepts, and quantify poetic individuality as divergence from a corpus baseline using Jensen--Shannon divergence and Kullback--Leibler divergence. To capture relational structure beyond marginals, we build a confidence-weighted co-occurrence graph over concepts and define an Eigenmood embedding through Laplacian spectral decomposition. On a corpus of 61{,}573 verses across 10 poets, 22.2\% of verses are abstained, underscoring the analytical importance of uncertainty. We further report sensitivity analysis under confidence thresholding, selection-bias diagnostics that treat abstention as a category, and a distant-to-close workflow that retrieves verse-level exemplars along Eigenmood axes. The resulting framework supports scalable, auditable digital-humanities analysis while preserving interpretive caution by propagating uncertainty from verse-level evidence to poet-level inference.[10] Persona2Web: Benchmarking Personalized Web Agents for Contextual Reasoning with User History
Serin Kim,Sangam Lee,Dongha Lee
Main category: cs.CL
TL;DR: 本文提出了Persona2Web,首个面向真实开放网络的个性化网页智能体评测基准,基于'澄清—个性化'原则,通过用户历史隐式推断偏好以解决查询歧义问题,并构建了包含用户历史、歧义查询和推理感知评估框架的数据集与实验体系。
Details
Motivation: 当前网页智能体缺乏个性化能力,难以根据用户隐式偏好和上下文理解模糊查询,亟需一个能评估真实开放网页环境下个性化能力的基准。 Method: 提出Persona2Web基准,包含三部分:(1)长期隐式反映用户偏好的历史数据;(2)需依赖历史推断偏好的歧义查询;(3)支持细粒度个性化评估的推理感知评测框架;并开展多维度消融实验。 Result: 通过在不同智能体架构、基座模型、历史访问方式及歧义程度查询上的系统实验,揭示了个性化网页智能体行为的关键挑战。 Conclusion: Persona2Web为个性化网页智能体提供了首个可复现、可扩展的真实世界评测基准,推动了基于用户历史进行歧义消解与个性化推理的研究。 Abstract: Large language models have advanced web agents, yet current agents lack personalization capabilities. Since users rarely specify every detail of their intent, practical web agents must be able to interpret ambiguous queries by inferring user preferences and contexts. To address this challenge, we present Persona2Web, the first benchmark for evaluating personalized web agents on the real open web, built upon the clarify-to-personalize principle, which requires agents to resolve ambiguity based on user history rather than relying on explicit instructions. Persona2Web consists of: (1) user histories that reveal preferences implicitly over long time spans, (2) ambiguous queries that require agents to infer implicit user preferences, and (3) a reasoning-aware evaluation framework that enables fine-grained assessment of personalization. We conduct extensive experiments across various agent architectures, backbone models, history access schemes, and queries with varying ambiguity levels, revealing key challenges in personalized web agent behavior. For reproducibility, our codes and datasets are publicly available at https://anonymous.4open.science/r/Persona2Web-73E8.[11] ReIn: Conversational Error Recovery with Reasoning Inception
Takyoung Kim,Jinseok Nam,Chandrayee Basu,Xing Fan,Chengyuan Ma,Heng Ji,Gokhan Tur,Dilek Hakkani-Tür
Main category: cs.CL
TL;DR: 本文提出了一种名为Reasoning Inception(ReIn)的测试时干预方法,用于提升对话代理在面对用户引发错误时的错误恢复能力,无需修改模型参数或系统提示。
Details
Motivation: 现有基于大语言模型的对话代理虽在固定任务数据集上表现良好,但在面对用户引发的未预见错误时仍脆弱;本文聚焦于错误恢复而非预防,并在不能微调模型或修改提示的实际约束下探索可行方案。 Method: 提出ReIn方法:通过外部‘起始模块’识别对话上下文中的预定义错误并生成恢复计划,再将该计划注入代理的内部推理过程以引导纠正行为,全程不修改模型参数或系统提示。 Result: 在模拟用户模糊与不支持请求等失败场景下,ReIn显著提升任务成功率,泛化至未见错误类型,且持续优于显式提示修改方法;联合定义恢复工具与ReIn可安全有效地增强代理鲁棒性。 Conclusion: ReIn是一种高效、即插即用的错误恢复机制,为提升对话代理在真实交互中的韧性提供了无需改动骨干模型或系统提示的新范式。 Abstract: Conversational agents powered by large language models (LLMs) with tool integration achieve strong performance on fixed task-oriented dialogue datasets but remain vulnerable to unanticipated, user-induced errors. Rather than focusing on error prevention, this work focuses on error recovery, which necessitates the accurate diagnosis of erroneous dialogue contexts and execution of proper recovery plans. Under realistic constraints precluding model fine-tuning or prompt modification due to significant cost and time requirements, we explore whether agents can recover from contextually flawed interactions and how their behavior can be adapted without altering model parameters and prompts. To this end, we propose Reasoning Inception (ReIn), a test-time intervention method that plants an initial reasoning into the agent's decision-making process. Specifically, an external inception module identifies predefined errors within the dialogue context and generates recovery plans, which are subsequently integrated into the agent's internal reasoning process to guide corrective actions, without modifying its parameters or system prompts. We evaluate ReIn by systematically simulating conversational failure scenarios that directly hinder successful completion of user goals: user's ambiguous and unsupported requests. Across diverse combinations of agent models and inception modules, ReIn substantially improves task success and generalizes to unseen error types. Moreover, it consistently outperforms explicit prompt-modification approaches, underscoring its utility as an efficient, on-the-fly method. In-depth analysis of its operational mechanism, particularly in relation to instruction hierarchy, indicates that jointly defining recovery tools with ReIn can serve as a safe and effective strategy for improving the resilience of conversational agents without modifying the backbone models or system prompts.[12] Large Language Models Persuade Without Planning Theory of Mind
Jared Moore,Rasmus Overmark,Ned Cooper,Beba Cibralic,Nick Haber,Cameron R. Jones
Main category: cs.CL
TL;DR: 本文提出了一种新型理论心智(ToM)任务,要求代理通过策略性信息揭示来说服目标选择某项政策提案,强调第一人称互动对ToM评估的重要性;实验发现LLMs在知识/动机状态已知时表现优异,但在需主动推断时表现差,而人类则更擅长多步心智推理;但在真实人际说服中LLMs反而超越人类,表明其可能依赖非ToM的修辞策略实现高效影响。
Details
Motivation: 现有ToM评估多基于静态问答,忽视了理论所强调的第一人称互动本质;需构建更贴近真实社会认知的动态说服任务来检验ToM能力。 Method: 设计三阶段实验:1)LLM/人类说服理性bot(知识/动机状态 Revealed 或 Hidden);2)人类扮演目标;3)测量人类目标真实信念变化;评估说服成功率及策略特征。 Result: Experiment 1:LLMs在Revealed条件下优秀,Hidden下低于随机水平;人类表现稳健。Experiments 2&3:LLMs在真实人际说服中全面优于人类 persuaders。 Conclusion: LLMs未必具备类人ToM,其说服优势可能源于模式匹配与修辞策略;该能力虽不等同于ToM,却具有显著现实影响力,提示需谨慎解读LLMs的‘心智’表现并重视其社会影响风险。 Abstract: A growing body of work attempts to evaluate the theory of mind (ToM) abilities of humans and large language models (LLMs) using static, non-interactive question-and-answer benchmarks. However, theoretical work in the field suggests that first-personal interaction is a crucial part of ToM and that such predictive, spectatorial tasks may fail to evaluate it. We address this gap with a novel ToM task that requires an agent to persuade a target to choose one of three policy proposals by strategically revealing information. Success depends on a persuader's sensitivity to a given target's knowledge states (what the target knows about the policies) and motivational states (how much the target values different outcomes). We varied whether these states were Revealed to persuaders or Hidden, in which case persuaders had to inquire about or infer them. In Experiment 1, participants persuaded a bot programmed to make only rational inferences. LLMs excelled in the Revealed condition but performed below chance in the Hidden condition, suggesting difficulty with the multi-step planning required to elicit and use mental state information. Humans performed moderately well in both conditions, indicating an ability to engage such planning. In Experiment 2, where a human target role-played the bot, and in Experiment 3, where we measured whether human targets' real beliefs changed, LLMs outperformed human persuaders across all conditions. These results suggest that effective persuasion can occur without explicit ToM reasoning (e.g., through rhetorical strategies) and that LLMs excel at this form of persuasion. Overall, our results caution against attributing human-like ToM to LLMs while highlighting LLMs' potential to influence people's beliefs and behavior.[13] Evaluating Cross-Lingual Classification Approaches Enabling Topic Discovery for Multilingual Social Media Data
Deepak Uniyal,Md Abul Bashar,Richi Nayak
Main category: cs.CL
TL;DR: This paper compares four cross-lingual text classification approaches—translation-based (source-to-target and target-to-source), direct multilingual model application, and a hybrid method—to filter hydrogen-related tweets from noisy multilingual social media data (English, Japanese, Hindi, Korean), followed by topic modeling to uncover dominant themes.
Details
Motivation: Analysing large-scale, multilingual social media discourse—especially public debates on global topics like hydrogen energy—is challenging due to noise, language diversity, and lack of labeled data in non-English languages. Method: Four cross-lingual classification strategies are evaluated: (1) translating English annotations to build language-specific models; (2) translating all raw tweets to English for a single English-based model; (3) directly applying English-fine-tuned multilingual transformers (e.g., XLM-R) to each language; (4) a hybrid approach combining translated annotations with multilingual training. Performance is measured on filtering hydrogen-related tweets from noisy keyword-collected data, followed by topic modeling on filtered subsets. Result: Each approach shows distinct trade-offs in precision, recall, and computational efficiency across languages; the hybrid strategy generally achieves the best balance, while direct multilingual application works well for high-resource languages but degrades for low-resource ones like Hindi. Topic modeling reveals evolving public concerns (e.g., policy, environment, infrastructure) over time and across regions. Conclusion: No single cross-lingual method universally outperforms others; optimal pipeline design depends on language resource availability, annotation effort, and deployment constraints—hybrid methods offer robustness, especially for low-resource languages, and should be prioritized in real-world multilingual social media analysis. Abstract: Analysing multilingual social media discourse remains a major challenge in natural language processing, particularly when large-scale public debates span across diverse languages. This study investigates how different approaches for cross-lingual text classification can support reliable analysis of global conversations. Using hydrogen energy as a case study, we analyse a decade-long dataset of over nine million tweets in English, Japanese, Hindi, and Korean (2013--2022) for topic discovery. The online keyword-driven data collection results in a significant amount of irrelevant content. We explore four approaches to filter relevant content: (1) translating English annotated data into target languages for building language-specific models for each target language, (2) translating unlabelled data appearing from all languages into English for creating a single model based on English annotations, (3) applying English fine-tuned multilingual transformers directly to each target language data, and (4) a hybrid strategy that combines translated annotations with multilingual training. Each approach is evaluated for its ability to filter hydrogen-related tweets from noisy keyword-based collections. Subsequently, topic modeling is performed to extract dominant themes within the relevant subsets. The results highlight key trade-offs between translation and multilingual approaches, offering actionable insights into optimising cross-lingual pipelines for large-scale social media analysis.[14] ALPS: A Diagnostic Challenge Set for Arabic Linguistic & Pragmatic Reasoning
Hussein S. Al-Olimat,Ahmad Alshareef
Main category: cs.CL
TL;DR: 本文介绍了ALPS,一个由阿拉伯语言学专家精心构建的诊断性挑战集,旨在评估模型在深度语义和语用层面的理解能力,揭示当前大模型在阿拉伯语形态句法依赖关系上的显著缺陷。
Details
Motivation: 现有阿拉伯语NLP基准过于依赖合成或翻译数据,缺乏深层语言学验证;需构建原生、专家标注、文化真实的数据集以补充大规模基准的不足。 Method: 构建了包含15项任务、47个子任务、共531道题目的原生阿拉伯语诊断数据集ALPS,并对23种商用、开源及阿拉伯语专用模型进行系统评测,对比单人平均表现(84.6%)与专家仲裁的上限(99.2%)。 Result: 模型整体流畅度高但形态句法依赖(尤其带变音符任务)错误率达36.5%;Gemini-3-flash达94.2%,超越单人平均;最佳阿拉伯语专用模型Jais-2-70B为83.6%,仍低于人类水平。 Conclusion: ALPS揭示了当前大模型在阿拉伯语深层语言理解上的关键短板,强调需兼顾规模与语言学深度,推动阿拉伯语NLP向更真实、更鲁棒的方向发展。 Abstract: While recent Arabic NLP benchmarks focus on scale, they often rely on synthetic or translated data which may benefit from deeper linguistic verification. We introduce ALPS (Arabic Linguistic & Pragmatic Suite), a native, expert-curated diagnostic challenge set probing Deep Semantics and Pragmatics, capabilities that complement specialized large-scale benchmarks. While broad-coverage benchmarks prioritize scale and multi-task coverage, ALPS targets the depth of linguistic understanding through 531 rigorously crafted questions across 15 tasks and 47 subtasks. We developed the dataset with deep expertise in Arabic linguistics, guaranteeing cultural authenticity and eliminating translation artifacts. Evaluating 23 diverse models (commercial, open-source, and Arabic-native) against a single-pass human performance (avg. 84.6% accuracy) and an expert-adjudicated oracle (99.2%), we reveal a critical dissociation: models achieve high fluency but fail on fundamental morpho-syntactic dependencies, with elevated error rates on morpho-syntactic dependencies (36.5% across diacritics-reliant tasks) compared to compositional semantics. While top commercial models (Gemini-3-flash at 94.2%) surpass the average single human, a substantial gap persists between commercial giants and Arabic-native models, with the best Arabic-specific model (Jais-2-70B at 83.6%) approaching but not matching human performance.[15] BankMathBench: A Benchmark for Numerical Reasoning in Banking Scenarios
Yunseung Lee,Subin Kim,Youngjun Kwak,Jaegul Choo
Main category: cs.CL
TL;DR: 本文提出BankMathBench,一个面向银行业务场景的数值推理基准数据集,涵盖基础、中级和高级三类任务,显著提升开源大模型在银行计算任务上的准确率。
Details
Motivation: 现有大语言模型在银行核心计算任务(如本息计算、产品比较、提前还款等)中准确率低,且缺乏反映真实银行业务场景的评估基准。 Method: 构建了BankMathBench领域专用数据集,按难度分为基础(单产品)、中级(多产品比较)、高级(多条件)三类任务,并采用工具增强微调方法对开源LLM进行训练与评估。 Result: 工具增强微调后,模型在基础、中级、高级任务上准确率分别提升57.6、75.1、62.9个百分点,显著优于零样本基线。 Conclusion: BankMathBench是评估和提升大语言模型在真实银行业务中数值推理能力的有效且可靠的基准。 Abstract: Large language models (LLMs)-based chatbots are increasingly being adopted in the financial domain, particularly in digital banking, to handle customer inquiries about products such as deposits, savings, and loans. However, these models still exhibit low accuracy in core banking computations-including total payout estimation, comparison of products with varying interest rates, and interest calculation under early repayment conditions. Such tasks require multi-step numerical reasoning and contextual understanding of banking products, yet existing LLMs often make systematic errors-misinterpreting product types, applying conditions incorrectly, or failing basic calculations involving exponents and geometric progressions. However, such errors have rarely been captured by existing benchmarks. Mathematical datasets focus on fundamental math problems, whereas financial benchmarks primarily target financial documents, leaving everyday banking scenarios underexplored. To address this limitation, we propose BankMathBench, a domain-specific dataset that reflects realistic banking tasks. BankMathBench is organized in three levels of difficulty-basic, intermediate, and advanced-corresponding to single-product reasoning, multi-product comparison, and multi-condition scenarios, respectively. When trained on BankMathBench, open-source LLMs exhibited notable improvements in both formula generation and numerical reasoning accuracy, demonstrating the dataset's effectiveness in enhancing domain-specific reasoning. With tool-augmented fine-tuning, the models achieved average accuracy increases of 57.6%p (basic), 75.1%p (intermediate), and 62.9%p (advanced), representing significant gains over zero-shot baselines. These findings highlight BankMathBench as a reliable benchmark for evaluating and advancing LLMs' numerical reasoning in real-world banking scenarios.[16] Projective Psychological Assessment of Large Multimodal Models Using Thematic Apperception Tests
Anton Dzega,Aviad Elyashar,Ortal Slobodin,Odeya Cohen,Rami Puzis
Main category: cs.CL
TL;DR: 本研究利用主题统觉测验(TAT)图像和SCORS-G量表,评估大模型(LMMs)在非语言模态下的‘人格特质’,发现其在理解人际关系与自我概念方面表现良好,但普遍缺乏对攻击性的感知与调节能力;且模型规模与发布时间正向影响其表现。
Details
Motivation: 探索大型多模态模型(LMMs)是否具备可被心理学量表测量的类人格特质,尤其关注非语言模态(如图像响应)下的社会认知能力。 Method: 采用TAT图像作为刺激,让LMMs以被试模型(SMs)身份生成故事,并由另一组LMMs作为评价模型(EMs)依据SCORS-G量表进行评分;对比模型评分与人类专家一致性,并分析不同模型家族在各维度上的系统性差异。 Result: EMs对TAT反应的理解与分析能力优秀,评分高度吻合人类专家;所有模型均擅长理解人际动态与自我概念,但在攻击性感知与调节上持续失败;模型性能随参数规模增大和发布时间推后而系统性提升。 Conclusion: LMMs展现出部分类人格的社会认知结构,但存在关键情感调节缺陷(如攻击性处理),提示当前AI在深层情感建模上仍与人类存在本质差距;SCORS-G可作为评估AI心理拟真度的有效工具。 Abstract: Thematic Apperception Test (TAT) is a psychometrically grounded, multidimensional assessment framework that systematically differentiates between cognitive-representational and affective-relational components of personality-like functioning. This test is a projective psychological framework designed to uncover unconscious aspects of personality. This study examines whether the personality traits of Large Multimodal Models (LMMs) can be assessed through non-language-based modalities, using the Social Cognition and Object Relations Scale - Global (SCORS-G). LMMs are employed in two distinct roles: as subject models (SMs), which generate stories in response to TAT images, and as evaluator models (EMs), who assess these narratives using the SCORS-G framework. Evaluators demonstrated an excellent ability to understand and analyze TAT responses. Their interpretations are highly consistent with those of human experts. Assessment results highlight that all models understand interpersonal dynamics very well and have a good grasp of the concept of self. However, they consistently fail to perceive and regulate aggression. Performance varied systematically across model families, with larger and more recent models consistently outperforming smaller and earlier ones across SCORS-G dimensions.[17] The Emergence of Lab-Driven Alignment Signatures: A Psychometric Framework for Auditing Latent Bias and Compounding Risk in Generative AI
Dusan Bosnjakovic
Main category: cs.CL
TL;DR: 本文提出一种基于心理测量理论的审计框架,用于量化大型语言模型中不依赖真实标签的稳定行为倾向(如优化偏差、谄媚倾向、现状合法化),发现模型存在显著的‘实验室信号’聚类,表明潜在偏见可能在多层AI架构中形成递归意识形态回声室。
Details
Motivation: 随着大语言模型从独立聊天界面转变为多智能体系统和递归评估循环中的基础推理层,检测持久的、提供方级别的行为特征成为安全与治理的关键需求;传统基准仅衡量瞬时任务准确性,无法捕捉训练与对齐过程中嵌入的稳定潜在响应策略。 Method: 采用心理测量学中的潜在特质估计方法(处理序数不确定性),设计掩蔽于语义正交干扰项之下的强制选择序数情境题,并通过密码学置换不变性保障审计公平性;使用混合线性模型(MixedLM)和组内相关系数(ICC)分析九个主流模型在多个维度上的响应模式。 Result: 发现题目层面的表述引发高方差,但存在显著的‘实验室信号’导致行为聚类;证实在‘锁定’的提供方生态系统中,潜在偏差不仅是静态错误,更是可能导致多层AI架构中递归意识形态回声室的累积变量。 Conclusion: 该框架能有效识别LLM中不依赖标注的稳定行为倾向,揭示提供方生态中潜在偏见的结构性与递归风险,为AI安全与治理提供新审计范式。 Abstract: As Large Language Models (LLMs) transition from standalone chat interfaces to foundational reasoning layers in multi-agent systems and recursive evaluation loops (LLM-as-a-judge), the detection of durable, provider-level behavioral signatures becomes a critical requirement for safety and governance. Traditional benchmarks measure transient task accuracy but fail to capture stable, latent response policies -- the ``prevailing mindsets'' embedded during training and alignment that outlive individual model versions. This paper introduces a novel auditing framework that utilizes psychometric measurement theory -- specifically latent trait estimation under ordinal uncertainty -- to quantify these tendencies without relying on ground-truth labels. Utilizing forced-choice ordinal vignettes masked by semantically orthogonal decoys and governed by cryptographic permutation-invariance, the research audits nine leading models across dimensions including Optimization Bias, Sycophancy, and Status-Quo Legitimization. Using Mixed Linear Models (MixedLM) and Intraclass Correlation Coefficient (ICC) analysis, the research identifies that while item-level framing drives high variance, a persistent ``lab signal'' accounts for significant behavioral clustering. These findings demonstrate that in ``locked-in'' provider ecosystems, latent biases are not merely static errors but compounding variables that risk creating recursive ideological echo chambers in multi-layered AI architectures.[18] What Makes a Good Doctor Response? An Analysis on a Romanian Telemedicine Platform
Adrian Cosma,Cosmin Dumitrache,Emilian Radoi
Main category: cs.CL
TL;DR: 本文分析了罗马尼亚语文本型远程医疗中的患者满意度信号,利用77,334对患者提问-医生回复数据,构建二分类模型预测患者反馈(点赞为正类),发现历史特征主导预测,而回复文本的语言特征(如礼貌性、模糊表达)虽影响较小但具可操作性,且与反馈呈稳定相关。
Details
Motivation: 文本型远程医疗日益普及,临床医生需以文字清晰有效传递医疗建议;同时,平台依赖患者评分,而这些评分更多反映沟通质量而非临床准确性,因此需理解影响患者满意度的沟通因素。 Method: 基于77,334条匿名医患问答对,将患者‘点赞’设为正样本、其余为负样本;提取语言无关特征(长度、结构、可读性)、罗马尼亚语LIWC心理语言学特征及礼貌/模糊表达标记;采用时序划分训练分类器,并用SHAP进行可解释性分析和子群相关性分析。 Result: 患者与医生的历史特征是预测反馈的最强信号;回复文本特征贡献较小但具可操作性;礼貌性和模糊表达始终与正向反馈正相关,词汇多样性则呈负相关。 Conclusion: 在文本型远程医疗中,除历史因素外,医生书面回应的语言风格(尤其是礼貌与模糊表达)是提升患者满意度的关键可控因素,提示可通过沟通培训或辅助工具优化文本表达。 Abstract: Text-based telemedicine has become a common mode of care, requiring clinicians to deliver medical advice clearly and effectively in writing. As platforms increasingly rely on patient ratings and feedback, clinicians face growing pressure to maintain satisfaction scores, even though these evaluations often reflect communication quality more than clinical accuracy. We analyse patient satisfaction signals in Romanian text-based telemedicine. Using a sample of 77,334 anonymised patient question--doctor response pairs, we model feedback as a binary outcome, treating thumbs-up responses as positive and grouping negative or absent feedback into the other class. We extract interpretable, predominantly language-agnostic features (e.g., length, structural characteristics, readability proxies), along with Romanian LIWC psycholinguistic features and politeness/hedging markers where available. We train a classifier with a time-based split and perform SHAP-based analyses, which indicate that patient and clinician history features dominate prediction, functioning as strong priors, while characteristics of the response text provide a smaller but, crucially, actionable signal. In subgroup correlation analyses, politeness and hedging are consistently positively associated with patient feedback, whereas lexical diversity shows a negative association.[19] Quantifying and Mitigating Socially Desirable Responding in LLMs: A Desirability-Matched Graded Forced-Choice Psychometric Study
Kensuke Okada,Yui Furukawa,Kyosuke Bunji
Main category: cs.CL
TL;DR: 本文提出一种心理测量框架,用于量化和缓解大语言模型(LLM)在自报告问卷评估中因追求社会赞许性(SDR)而产生的偏差;通过对比诚实与伪装良好指令下的IRT潜变量得分来量化SDR,并设计了基于偏好匹配的分级迫选(GFC)量表以缓解该偏差;实验表明GFC显著降低SDR,同时较好保持对目标人格特征的恢复能力。
Details
Motivation: 现有基于人类自报告问卷的LLM评估方法假设模型会诚实作答,但实际中LLM倾向于给出社会赞许性答案,导致评估结果偏差,亟需识别并缓解这种社会赞许性反应(SDR)偏差。 Method: 提出两阶段心理测量框架:(1)量化SDR:在同一量表下分别施测‘诚实’与‘伪装良好’两种指导语,利用项目反应理论(IRT)估计潜变量得分,计算方向校正的标准化效应量作为SDR指标;(2)缓解SDR:构建30对跨领域、吸引力匹配的分级迫选(GFC)版大五人格量表,通过约束优化从题库中筛选题目。 Result: 在9个指令微调LLM上,针对已知目标人格的合成角色进行测试,Likert量表始终呈现显著SDR;而GFC量表大幅削弱SDR,同时仍能较准确恢复目标人格轮廓;揭示了SDR抑制与人格特征恢复之间的模型依赖型权衡。 Conclusion: SDR是问卷式LLM评估中不可忽视的系统性偏差;应采用SDR-aware方法(如GFC设计与双指令评估)进行基准测试与审计,并倡导在报告中明确披露SDR影响。 Abstract: Human self-report questionnaires are increasingly used in NLP to benchmark and audit large language models (LLMs), from persona consistency to safety and bias assessments. Yet these instruments presume honest responding; in evaluative contexts, LLMs can instead gravitate toward socially preferred answers-a form of socially desirable responding (SDR)-biasing questionnaire-derived scores and downstream conclusions. We propose a psychometric framework to quantify and mitigate SDR in questionnaire-based evaluation of LLMs. To quantify SDR, the same inventory is administered under HONEST versus FAKE-GOOD instructions, and SDR is computed as a direction-corrected standardized effect size from item response theory (IRT)-estimated latent scores. This enables comparisons across constructs and response formats, as well as against human instructed-faking benchmarks. For mitigation, we construct a graded forced-choice (GFC) Big Five inventory by selecting 30 cross-domain pairs from an item pool via constrained optimization to match desirability. Across nine instruction-tuned LLMs evaluated on synthetic personas with known target profiles, Likert-style questionnaires show consistently large SDR, whereas desirability-matched GFC substantially attenuates SDR while largely preserving the recovery of the intended persona profiles. These results highlight a model-dependent SDR-recovery trade-off and motivate SDR-aware reporting practices for questionnaire-based benchmarking and auditing of LLMs.[20] Towards Cross-lingual Values Assessment: A Consensus-Pluralism Perspective
Yukun Chen,Xinyu Zhang,Jialong Tang,Yu Wan,Baosong Yang,Yiming Li,Zhan Qin,Kui Ren
Main category: cs.CL
TL;DR: 本文提出X-Value——一个跨语言价值观评估基准,用于评测大语言模型(LLMs)对数字内容深层价值观的理解能力,发现当前SOTA模型在此任务上表现不佳且存在显著跨语言差异。
Details
Motivation: 现有内容安全评估主要关注显性危害(如暴力、仇恨言论),忽视了数字内容中更微妙的价值观维度;需构建能从全球视角评估LLM价值观判断能力的基准。 Method: 构建包含18种语言、5000+问答对的X-Value基准,覆盖Schwartz基本人类价值观理论的7个核心领域,并按难易分级;提出两阶段标注框架:先区分议题属全球共识还是价值多元范畴,再进行多方协同的价值隐含评估。 Result: 在X-Value上系统评测显示,当前SOTA LLMs跨语言价值观评估准确率低于77%,不同语言间准确率差异超20%。 Conclusion: 当前LLMs在细粒度、价值观感知的内容评估能力上存在明显不足,亟需提升其跨语言、多价值观语境下的理解与判断能力。 Abstract: While large language models (LLMs) have become pivotal to content safety, current evaluation paradigms primarily focus on detecting explicit harms (e.g., violence or hate speech), neglecting the subtler value dimensions conveyed in digital content. To bridge this gap, we introduce X-Value, a novel Cross-lingual Values Assessment Benchmark designed to evaluate LLMs' ability to assess deep-level values of content from a global perspective. X-Value consists of more than 5,000 QA pairs across 18 languages, systematically organized into 7 core domains grounded in Schwartz's Theory of Basic Human Values and categorized into easy and hard levels for discriminative evaluation. We further propose a unique two-stage annotation framework that first identifies whether an issue falls under global consensus (e.g., human rights) or pluralism (e.g., religion), and subsequently conducts a multi-party evaluation of the latent values embedded within the content. Systematic evaluations on X-Value reveal that current SOTA LLMs exhibit deficiencies in cross-lingual values assessment ($Acc < 77\%$), with significant performance disparities across different languages ($ΔAcc > 20\%$). This work highlights the urgent need to improve the nuanced, values-aware content assessment capability of LLMs. Our X-Value is available at: https://huggingface.co/datasets/Whitolf/X-Value.[21] Representation Collapse in Machine Translation Through the Lens of Angular Dispersion
Evgeniia Tokarchuk,Maya K. Nachesa,Sergey Troshin,Vlad Niculae
Main category: cs.CL
TL;DR: 本文分析了Transformer架构在神经机器翻译中因标准next-token预测训练策略导致的表征坍缩问题,特别是在深层和连续输出场景下,并通过角分散正则化方法有效缓解该问题,同时提升翻译质量,且该方法在量化模型中依然有效。
Details
Motivation: 标准的next-token预测训练策略可能导致表征坍缩,尤其在Transformer深层和连续输出NMT中更为严重,影响几何空间的有效利用。 Method: 分析离散与连续NMT中Transformer各层表征坍缩的训练动态;引入基于角分散的正则化方法进行缓解;验证其在量化模型中的有效性。 Result: 角分散正则化不仅能缓解表征坍缩,还能提升翻译质量;该方法在量化模型中仍保持效果。 Conclusion: 表征坍缩是NMT训练中的关键问题,角分散正则化是一种简单而有效的解决方案,兼具性能提升与模型鲁棒性。 Abstract: Modern neural translation models based on the Transformer architecture are known for their high performance, particularly when trained on high-resource datasets. A standard next-token prediction training strategy, while widely adopted in practice, may lead to overlooked artifacts such as representation collapse. Previous works have shown that this problem is especially pronounced in the representation of the deeper Transformer layers, where it often fails to efficiently utilize the geometric space. Representation collapse is even more evident in end-to-end training of continuous-output neural machine translation, where the trivial solution would be to set all vectors to the same value. In this work, we analyze the dynamics of representation collapse at different levels of discrete and continuous NMT transformers throughout training. We incorporate an existing regularization method based on angular dispersion and demonstrate empirically that it not only mitigates collapse but also improves translation quality. Furthermore, we show that quantized models exhibit similar collapse behavior and that the benefits of regularization are preserved even after quantization.[22] Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation
Bogdan Kostić,Conor Fallon,Julian Risch,Alexander Löser
Main category: cs.CL
TL;DR: 本文研究了词汇和句法扰动对23个大语言模型在MMLU、SQuAD和AMEGA三个基准上的性能影响,发现词汇扰动显著降低性能,而句法扰动效果不一;模型鲁棒性不随规模单调提升,表明LLMs更依赖表层词汇模式而非抽象语言能力。
Details
Motivation: 现有LLM评估基准因对输入提示的浅层变化敏感而可靠性受质疑,需系统考察语义不变前提下不同语言扰动对模型性能与排名的影响。 Method: 采用两种基于语言学原理的流程生成语义保持的扰动:同义词替换(词汇扰动)和依存句法分析驱动的句法变换(句法扰动),并在MMLU、SQuAD、AMEGA上测试23个LLM的绝对性能与相对排名变化。 Result: 词汇扰动普遍导致显著性能下降;句法扰动效果异质,偶有提升;两类扰动均破坏复杂任务的模型排行榜稳定性;模型鲁棒性不随参数量单调增强,且高度依赖具体任务。 Conclusion: LLMs更依赖表面词汇线索而非深层语言理解能力,因此鲁棒性测试应成为LLM评估的标准环节。 Abstract: The rapid advancement of Large Language Models (LLMs) has established standardized evaluation benchmarks as the primary instrument for model comparison. Yet, their reliability is increasingly questioned due to sensitivity to shallow variations in input prompts. This paper examines how controlled, truth-conditionally equivalent lexical and syntactic perturbations affect the absolute performance and relative ranking of 23 contemporary LLMs across three benchmarks: MMLU, SQuAD, and AMEGA. We employ two linguistically principled pipelines to generate meaning-preserving variations: one performing synonym substitution for lexical changes, and another using dependency parsing to determine applicable syntactic transformations. Results show that lexical perturbations consistently induce substantial, statistically significant performance degradation across nearly all models and tasks, while syntactic perturbations have more heterogeneous effects, occasionally improving results. Both perturbation types destabilize model leaderboards on complex tasks. Furthermore, model robustness did not consistently scale with model size, revealing strong task dependence. Overall, the findings suggest that LLMs rely more on surface-level lexical patterns than on abstract linguistic competence, underscoring the need for robustness testing as a standard component of LLM evaluation.[23] RPDR: A Round-trip Prediction-Based Data Augmentation Framework for Long-Tail Question Answering
Yiming Zhang,Siyue Zhang,Junbo Zhao,Chen Zhao
Main category: cs.CL
TL;DR: 本文提出RPDR框架,通过合成数据生成、Round-Trip预测筛选易学样本、针对性训练,显著提升密集检索器在长尾问答任务中的性能,并引入动态路由机制进一步优化。
Details
Motivation: 现有大语言模型和密集检索器在长尾知识(罕见/小众知识)的获取与召回上表现不佳,需改进检索能力。 Method: 提出RPDR数据增强框架,包含三部分:1)合成数据生成;2)基于Round-Trip预测筛选易学训练样本;3)用筛选样本微调密集检索器;并设计动态路由机制分配查询至专用检索模块。 Result: 在PopQA和EntityQuestion两个长尾检索基准上,RPDR显著超越BM25、Contriver等基线方法,尤其在极长尾类别上提升明显;人工分析验证了其有效性与局限性。 Conclusion: RPDR通过高质量易学数据增强有效缓解密集检索器的长尾泛化问题,动态路由可进一步提升性能,为长尾问答检索提供了新思路。 Abstract: Long-tail question answering presents significant challenges for large language models (LLMs) due to their limited ability to acquire and accurately recall less common knowledge. Retrieval-augmented generation (RAG) systems have shown great promise in mitigating this limitation by integrating external retrieval mechanisms. However, dense retrieval models often face the same difficulties when generalizing to rare or niche knowledge. In this study, we introduce RPDR, a novel data augmentation framework that selects high-quality easy-to-learn training data, to enhance dense retrievers. Our approach is built around three core components: synthetic data generation, data selection with Round-Trip prediction to identify easy-to-learn instances, and retriever training with these instances. We evaluate RPDR on two long-tail retrieval benchmarks, PopQA and EntityQuestion, demonstrating substantial improvements over existing retrievers like BM25 and Contriver, especially on extremely long-tail categories. We identify the strengths and limitations of RPDR through detailed human analysis and propose a dynamic routing mechanism to dynamically route queries to specialized retrieval modules to further improve retrieval performance.[24] The Role of the Availability Heuristic in Multiple-Choice Answering Behaviour
Leonidas Zotos,Hedderik van Rijn,Malvina Nissim
Main category: cs.CL
TL;DR: 本文探讨了在多项选择题(MCQ)中,利用认知可得性(availability heuristic)进行猜测的有效性;通过基于大规模语料库(如Wikipedia)计算选项的概念可得性,发现正确答案普遍比干扰项更具可得性,采用‘选最可得选项’策略可显著超越随机猜测;该现象在专家命题和大语言模型生成题目中均存在,提示可得性应被纳入学生行为建模。
Details
Motivation: 当学生不确定MCQ正确答案时往往依赖猜测;经典可用性启发式(Tversky & Kahneman, 1973)指出,人倾向于选择最容易想到的选项——但该策略是否真能提升答题表现尚缺乏计算验证。 Method: 提出一种基于大规模文本语料库(如Wikipedia)量化MCQ各选项概念可得性的计算方法,并在三套大型题库上检验正确答案与错误选项的可得性差异;同时对比分析LLM生成题与专家命题题中该模式的一致性。 Result: 正确答案在所有题集中均显著比干扰项更可得;仅选择最可得选项即可使得分比随机猜测高13.5%–32.9%;LLM生成的MCQ选项也表现出与专家题相似的可得性分布规律。 Conclusion: 可用性启发式在MCQ作答中具有实际有效性,且该效应具有跨题源稳健性;未来对学生认知建模(尤其涉及猜测行为)时,应将概念可得性作为关键变量纳入考量。 Abstract: When students are unsure of the correct answer to a multiple-choice question (MCQ), guessing is common practice. The availability heuristic, proposed by A. Tversky and D. Kahneman in 1973, suggests that the ease with which relevant instances come to mind, typically operationalised by the mere frequency of exposure, can offer a mental shortcut for problems in which the test-taker does not know the exact answer. Is simply choosing the option that comes most readily to mind a good strategy for answering MCQs? We propose a computational method of assessing the cognitive availability of MCQ options operationalised by concepts' prevalence in large corpora. The key finding, across three large question sets, is that correct answers, independently of the question stem, are significantly more available than incorrect MCQ options. Specifically, using Wikipedia as the retrieval corpus, we find that always selecting the most available option leads to scores 13.5% to 32.9% above the random-guess baseline. We further find that LLM-generated MCQ options show similar patterns of availability compared to expert-created options, despite the LLMs' frequentist nature and their training on large collections of textual data. Our findings suggest that availability should be considered in current and future work when computationally modelling student behaviour.[25] Diverse Word Choices, Same Reference: Annotating Lexically-Rich Cross-Document Coreference
Anastasia Zhukova,Felix Hamborg,Karsten Donnay,Norman Meuschke,Bela Gipp
Main category: cs.CL
TL;DR: 本文提出了一种改进的跨文档共指消解(CDCR)标注方案,将共指链视为话语元素(DEs),支持同一性与近同一性关系,以更好捕捉新闻报道中的词汇多样性与框架差异,并在NewsWCL50和ECB+子集上完成重标注与验证。
Details
Motivation: 现有CDCR数据集多聚焦事件共指且定义狭窄,难以应对多样化、立场分化新闻中广泛存在的措辞差异和话语框架变化。 Method: 将共指链重新定义为话语元素(DEs),支持身份与近身份关系;使用统一编码手册对NewsWCL50和ECB+子集进行重标注;通过词汇多样性指标和same-head-lemma基线进行评估。 Result: 重标注后的数据集在词汇多样性等指标上表现居中,介于原始ECB+与NewsWCL50之间,验证了其平衡性与话语敏感性。 Conclusion: 该修订方案提升了CDCR在新闻领域的话语感知能力,为建模词汇多样性与框架变异提供了更合适的基准与分析单元。 Abstract: Cross-document coreference resolution (CDCR) identifies and links mentions of the same entities and events across related documents, enabling content analysis that aggregates information at the level of discourse participants. However, existing datasets primarily focus on event resolution and employ a narrow definition of coreference, which limits their effectiveness in analyzing diverse and polarized news coverage where wording varies widely. This paper proposes a revised CDCR annotation scheme of the NewsWCL50 dataset, treating coreference chains as discourse elements (DEs) and conceptual units of analysis. The approach accommodates both identity and near-identity relations, e.g., by linking "the caravan" - "asylum seekers" - "those contemplating illegal entry", allowing models to capture lexical diversity and framing variation in media discourse, while maintaining the fine-grained annotation of DEs. We reannotate the NewsWCL50 and a subset of ECB+ using a unified codebook and evaluate the new datasets through lexical diversity metrics and a same-head-lemma baseline. The results show that the reannotated datasets align closely, falling between the original ECB+ and NewsWCL50, thereby supporting balanced and discourse-aware CDCR research in the news domain.[26] Evaluating Extremely Low-Resource Machine Translation: A Comparative Study of ChrF++ and BLEU Metrics
Sanjeev Kumar,Preethi Jyothi,Pushpak Bhattacharyya
Main category: cs.CL
TL;DR: 本文比较了BLEU和ChrF++两种机器翻译评估指标在极低资源语言(ELRL)场景下的表现,发现尽管BLEU得分较低,但其能提供互补的词汇精度信息,增强结果可解释性。
Details
Motivation: 现有主流指标如BLEU在极低资源语言(ELRL)翻译评估中常失真,需探究更适合ELRL的评估方法。 Method: 对BLEU(基于n-gram)和ChrF++(基于字符)两种指标,在Magahi、Bhojpuri、Chhattisgarhi三种ELRL上,针对LLM与NMT输出,系统分析其对幻觉、重复、源文拷贝、变音符号(matra)变化等翻译缺陷的响应差异。 Result: ChrF++虽被近期工作广泛采用,但BLEU仍能提供有价值的词汇层面精度信号,二者具有互补性;BLEU有助于提升评估结果的可解释性。 Conclusion: 在ELRL翻译评估中,不应弃用BLEU,而应结合ChrF++等指标进行多维分析,以兼顾鲁棒性与可解释性。 Abstract: Evaluating machine translation (MT) quality in extremely low-resource language (ELRL) scenarios poses unique challenges, as widely used metrics such as BLEU, effective in high-resource settings, often misrepresent quality in data-scarce contexts. This work presents a comparative analysis of BLEU, an n-gram-based metric, and ChrF++, a character-based metric, for MT evaluation in ELRL settings. We examine how each metric responds to translation artifacts, including hallucinations, repetition, source-text copying, and diacritic (\textit{matra}) variations across three ELRLs: Magahi, Bhojpuri, and Chhattisgarhi, with a focus on outputs from large language models (LLMs) and neural MT (NMT) systems. While recent work often relies solely on ChrF++, our findings show that BLEU, despite its lower absolute scores, provides complementary lexical-precision insights that improve interpretability.[27] Fine-Grained Uncertainty Quantification for Long-Form Language Model Outputs: A Comparative Study
Dylan Bouchard,Mohit Singh Chauhan,Viren Bajaj,David Skarbrevik
Main category: cs.CL
TL;DR: 本文提出了一种面向长文本生成的细粒度不确定性量化(UQ)框架,通过响应分解、单元级打分和响应级聚合三阶段分类法,系统化评估LLM长文本输出的事实性;实验表明基于主张-响应蕴含的打分方法效果稳定,主张级优于句子级,且不确定性感知解码能显著提升事实性。
Details
Motivation: 现有不确定性量化方法主要针对短文本,难以泛化到长文本生成中的幻觉检测。 Method: 提出三阶段细粒度UQ分类法(响应分解、单元级评分、响应级聚合),形式化多种一致性驱动的黑盒打分器,并进行跨模型与数据集实验。 Result: 1)主张-响应蕴含打分表现稳定且不逊于复杂主张级方法;2)主张级打分优于句子级;3)不确定性感知解码显著提升长文本事实性。 Conclusion: 该框架统一了既有方法、支持公平比较,并为细粒度不确定性量化组件选择提供实用指导。 Abstract: Uncertainty quantification has emerged as an effective approach to closed-book hallucination detection for LLMs, but existing methods are largely designed for short-form outputs and do not generalize well to long-form generation. We introduce a taxonomy for fine-grained uncertainty quantification in long-form LLM outputs that distinguishes methods by design choices at three stages: response decomposition, unit-level scoring, and response-level aggregation. We formalize several families of consistency-based black-box scorers, providing generalizations and extensions of existing methods. In our experiments across multiple LLMs and datasets, we find 1) claim-response entailment consistently performs better or on par with more complex claim-level scorers, 2) claim-level scoring generally yields better results than sentence-level scoring, and 3) uncertainty-aware decoding is highly effective for improving the factuality of long-form outputs. Our framework clarifies relationships between prior methods, enables apples-to-apples comparisons, and provides practical guidance for selecting components for fine-grained UQ.[28] AIDG: Evaluating Asymmetry Between Information Extraction and Containment in Multi-Turn Dialogue
Adib Sakhawat,Fardeen Sadab,Rakin Shahriar
Main category: cs.CL
TL;DR: 本文提出AIDG框架,通过对抗性信息推理游戏评估大语言模型在信息提取与信息保持间的策略推理能力差异,发现模型在信息保持上显著优于信息提取,并识别出信息动态与约束遵循两大瓶颈。
Details
Motivation: 评估大语言模型的战略推理能力需从静态基准转向动态多轮交互,尤其需考察其在信息提取(主动推理)与信息保持(状态维护)之间的不对称性。 Method: 提出AIDG(对抗性信息推理游戏)框架,包含两个互补任务:AIDG-I(社交推理中的语用策略评估)和AIDG-II(结构化'20个问题'中的约束满足评估),并在439局游戏中测试6个前沿LLM。 Result: 发现模型在信息保持(防御)上明显强于信息提取(进攻),ELO优势达350分(Cohen's d = 5.47);确认策略比盲目推理有效7.75倍;41.3%的推理失败源于对话负载下的指令遵循退化。 Conclusion: LLM擅长局部防御一致性,但在需要全局状态跟踪的战略性探究任务中存在根本性局限。 Abstract: Evaluating the strategic reasoning capabilities of Large Language Models (LLMs) requires moving beyond static benchmarks to dynamic, multi-turn interactions. We introduce AIDG (Adversarial Information Deduction Game), a game-theoretic framework that probes the asymmetry between information extraction (active deduction) and information containment (state maintenance) in dialogue. We propose two complementary tasks: AIDG-I, measuring pragmatic strategy in social deduction, and AIDG-II, measuring constraint satisfaction in a structured "20 Questions" setting. Across 439 games with six frontier LLMs, we observe a clear capability asymmetry: models perform substantially better at containment than deduction, with a 350 ELO advantage on defense;(Cohen's d = 5.47). We identify two bottlenecks driving this gap: (1) Information Dynamics, where confirmation strategies are 7.75x more effective than blind deduction (p < 0.00001), and (2) Constraint Adherence, where instruction-following degrades under conversational load, accounting for 41.3% of deductive failures. These findings suggest that while LLMs excel at local defensive coherence, they struggle with the global state tracking required for strategic inquiry.[29] ABCD: All Biases Come Disguised
Mateusz Nowak,Xavier Cadet,Peter Chin
Main category: cs.CL
TL;DR: 本文提出了一种减少标签位置偏差的MCQ评估协议,通过使用无序统一标签和句子相似度匹配答案,提升了LLM评估的鲁棒性。
Details
Motivation: 现有MCQ基准易受模型对答案位置、标签或few-shot示例中正确答案分布的偏差影响,导致评估结果不真实反映模型推理能力。 Method: 设计NonsenseQA合成基准发现偏差;提出新评估协议:用统一无序标签替代原标签,要求模型输出完整答案文本,并用轻量级句子相似度模型匹配预测与真实答案。 Result: 在多个基准和模型上,该协议将准确率方差降低3倍,仅轻微降低平均性能;消融实验表明其对嵌入模型和相似度函数选择鲁棒。 Conclusion: 标准MCQ评估存在显著偏差,所提协议能更真实地暴露LLM内在能力,提升评估可靠性。 Abstract: Multiple-choice question (MCQ) benchmarks have been a standard evaluation practice for measuring LLMs' ability to reason and answer knowledge-based questions. Through a synthetic NonsenseQA benchmark, we observe that different LLMs exhibit varying degrees of label-position-few-shot-prompt bias, where the model either uses the answer position, the label in front of the answer, the distributions of correct answers present in the few-shot prompt, or a combination of all to answer each MCQ question. We propose a simple bias-reduced evaluation protocol that replaces the labels of each question with uniform, unordered labels and prompts the LLM to use the whole answer presented. With a simple sentence similarity model, we demonstrate improved robustness and lower standard deviation between different permutations of answers with a minimal drop in LLM's performance, exposing the LLM's capabilities under reduced evaluation artifacts, without any help from the prompt examples or the option labels. Across multiple benchmarks and models, this protocol substantially improves the robustness to answer permutations, reducing mean accuracy variance $3\times$ with only a minimal decrease in the mean model's performance. Through ablation studies on various embedding models and similarity functions, we show that the method is more robust than the standard ones.[30] Entropy-Based Data Selection for Language Models
Hongming Li,Yang Liu,Chao Huang
Main category: cs.CL
TL;DR: 本文提出了一种基于熵的无监督数据选择(EUDS)框架,以在计算资源受限的情况下高效地进行语言模型微调,通过减少所需训练数据量和计算成本来提升训练效率。
Details
Motivation: 现代语言模型微调需要大量计算和数据资源,而实际场景中常面临资源限制;现有数据选择方法依赖高算力,且数据可用性评估困难,亟需一种低开销、高效的数据选择方法。 Method: 提出基于熵的无监督数据选择(EUDS)框架,利用不确定性估计(熵)对数据进行过滤,无需标注或额外模型,实现计算高效的无监督数据筛选。 Result: 在情感分析、主题分类和问答任务上的实验表明,EUDS显著降低计算成本与训练时间,同时仅用更少数据即保持甚至提升模型性能。 Conclusion: EUDS为计算受限场景下的语言模型高效微调提供了创新可行的解决方案,揭示了数据选择与不确定性估计间的内在联系,并验证了其理论有效性与实践鲁棒性。 Abstract: Modern language models (LMs) increasingly require two critical resources: computational resources and data resources. Data selection techniques can effectively reduce the amount of training data required for fine-tuning LMs. However, their effectiveness is closely related to computational resources, which always require a high compute budget. Owing to the resource limitations in practical fine-tuning scenario, we systematically reveal the relationship between data selection and uncertainty estimation of selected data. Although large language models (LLMs) exhibit exceptional capabilities in language understanding and generation, which provide new ways to alleviate data scarcity, evaluating data usability remains a challenging task. This makes efficient data selection indispensable. To mitigate these issues, we propose Entropy-Based Unsupervised Data Selection (EUDS) framework. Empirical experiments on sentiment analysis (SA), topic classification (Topic-CLS), and question answering (Q&A) tasks validate its effectiveness. EUDS establishes a computationally efficient data-filtering mechanism. Theoretical analysis and experimental results confirm the effectiveness of our approach. EUDS significantly reduces computational costs and improves training time efficiency with less data requirement. This provides an innovative solution for the efficient fine-tuning of LMs in the compute-constrained scenarios.[31] PEACE 2.0: Grounded Explanations and Counter-Speech for Combating Hate Expressions
Greta Damo,Stéphane Petiot,Elena Cabrio,Serena Villata
Main category: cs.CL
TL;DR: PEACE 2.0 是一个新型工具,用于分析、解释仇恨言论并生成基于证据的反仇恨言论回应,采用检索增强生成(RAG)技术。
Details
Motivation: 在线平台仇恨言语激增带来社会挑战,现有研究多聚焦检测,而自动生成有效反仇恨言论(counter-speech)仍是开放问题。 Method: 提出 PEACE 2.0 工具,基于检索增强生成(RAG)流程,实现:i) 用证据支撑仇恨言论解释;ii) 自动生成证据支撑的反仇恨言论;iii) 探索反仇恨言论的特征。 Result: PEACE 2.0 能对显性和隐性仇恨信息进行深度分析与响应生成,提升解释可信度与反言论质量。 Conclusion: PEACE 2.0 将仇恨言论检测、可解释性与响应生成统一集成,推动从‘识别’到‘干预’的关键跨越。 Abstract: The increasing volume of hate speech on online platforms poses significant societal challenges. While the Natural Language Processing community has developed effective methods to automatically detect the presence of hate speech, responses to it, called counter-speech, are still an open challenge. We present PEACE 2.0, a novel tool that, besides analysing and explaining why a message is considered hateful or not, also generates a response to it. More specifically, PEACE 2.0 has three main new functionalities: leveraging a Retrieval-Augmented Generation (RAG) pipeline i) to ground HS explanations into evidence and facts, ii) to automatically generate evidence-grounded counter-speech, and iii) exploring the characteristics of counter-speech replies. By integrating these capabilities, PEACE 2.0 enables in-depth analysis and response generation for both explicit and implicit hateful messages.[32] Auditing Reciprocal Sentiment Alignment: Inversion Risk, Dialect Representation and Intent Misalignment in Transformers
Nusrat Jahan Lia,Shubhashis Roy Dipta
Main category: cs.CL
TL;DR: 本文研究了孟加拉语与英语之间的跨语言情感对齐问题,发现现有对齐范式在低资源语言中存在严重的情感误判、不对称共情及现代偏见等问题,主张采用文化敏感、多元包容的对齐方法,并提出引入'情感稳定性'指标以提升人机互信。
Details
Motivation: 双向对齐的核心是确保AI准确理解人类意图且人类能信任AI行为,但在语言障碍下该闭环严重断裂;本文聚焦孟加拉语-英语跨语言情感对齐这一被忽视的低资源场景,揭示现有对齐方法在文化与语言多样性上的根本缺陷。 Method: 通过基准测试四种Transformer模型(包括mDistilBERT和IndicBERT),定量分析其在孟加拉语(含口语与正式体Sadhu)与英语之间的情感极性预测一致性,定义并测量'情感反转率'、'不对称共情'及'现代偏见'等新现象。 Result: mDistilBERT出现28.7%情感反转;IndicBERT在正式孟加拉语中对齐误差激增57%;不同模型对孟加拉语文本的情感强度呈现系统性压制或放大(即不对称共情)。 Conclusion: 通用压缩型对齐范式无法保全低资源语言的情感保真度,应转向尊重语言与方言多样性的文化嵌入式对齐;建议在对齐评估中纳入'情感稳定性'指标,尤其关注低资源与方言语境下的极性一致性。 Abstract: The core theme of bidirectional alignment is ensuring that AI systems accurately understand human intent and that humans can trust AI behavior. However, this loop fractures significantly across language barriers. Our research addresses Cross-Lingual Sentiment Misalignment between Bengali and English by benchmarking four transformer architectures. We reveal severe safety and representational failures in current alignment paradigms. We demonstrate that compressed model (mDistilBERT) exhibits 28.7% "Sentiment Inversion Rate," fundamentally misinterpreting positive user intent as negative (or vice versa). Furthermore, we identify systemic nuances affecting human-AI trust, including "Asymmetric Empathy" where some models systematically dampen and others amplify the affective weight of Bengali text relative to its English counterpart. Finally, we reveal a "Modern Bias" in the regional model (IndicBERT), which shows a 57% increase in alignment error when processing formal (Sadhu) Bengali. We argue that equitable human-AI co-evolution requires pluralistic, culturally grounded alignment that respects language and dialectal diversity over universal compression, which fails to preserve the emotional fidelity required for reciprocal human-AI trust. We recommend that alignment benchmarks incorporate "Affective Stability" metrics that explicitly penalize polarity inversions in low-resource and dialectal contexts.[33] Small LLMs for Medical NLP: a Systematic Analysis of Few-Shot, Constraint Decoding, Fine-Tuning and Continual Pre-Training in Italian
Pietro Ferrazzi,Mattia Franzin,Alberto Lavelli,Bernardo Magnini
Main category: cs.CL
TL;DR: 本文探讨了小型大语言模型(约10亿参数)在20项临床NLP任务中的表现,发现经微调的小型模型(如Qwen3-1.7B)可超越大型模型(如Qwen3-32B),并开源了多个意大利语医疗数据集及模型。
Details
Motivation: 大型语言模型在医疗NLP任务中表现优异,但其高计算成本限制了实际医疗场景部署;本文旨在验证小型LLM能否在保持高准确率的同时满足资源受限环境的需求。 Method: 在20个临床NLP任务上系统评估Llama-3、Gemma-3和Qwen3三个系列的小型LLM(约10亿参数),对比多种适应策略:推理时的少样本提示与约束解码,以及训练时的监督微调与持续预训练。 Result: 微调是最有效的适配方法;少样本提示+约束解码是低资源下的强替代方案;Qwen3-1.7B最佳配置平均得分比Qwen3-32B高9.2分。 Conclusion: 小型LLM经适当适配可在多项医疗NLP任务中达到甚至超越大型模型性能,具备临床落地潜力;同时开源了多个高质量意大利语医疗数据集与模型,推动本地化医疗AI发展。 Abstract: Large Language Models (LLMs) consistently excel in diverse medical Natural Language Processing (NLP) tasks, yet their substantial computational requirements often limit deployment in real-world healthcare settings. In this work, we investigate whether "small" LLMs (around one billion parameters) can effectively perform medical tasks while maintaining competitive accuracy. We evaluate models from three major families-Llama-3, Gemma-3, and Qwen3-across 20 clinical NLP tasks among Named Entity Recognition, Relation Extraction, Case Report Form Filling, Question Answering, and Argument Mining. We systematically compare a range of adaptation strategies, both at inference time (few-shot prompting, constraint decoding) and at training time (supervised fine-tuning, continual pretraining). Fine-tuning emerges as the most effective approach, while the combination of few-shot prompting and constraint decoding offers strong lower-resource alternatives. Our results show that small LLMs can match or even surpass larger baselines, with our best configuration based on Qwen3-1.7B achieving an average score +9.2 points higher than Qwen3-32B. We release a comprehensive collection of all the publicly available Italian medical datasets for NLP tasks, together with our top-performing models. Furthermore, we release an Italian dataset of 126M words from the Emergency Department of an Italian Hospital, and 175M words from various sources that we used for continual pre-training.[34] Bridging the Domain Divide: Supervised vs. Zero-Shot Clinical Section Segmentation from MIMIC-III to Obstetrics
Baris Karacan,Barbara Di Eugenio,Patrick Thornton
Main category: cs.CL
TL;DR: 本文提出了一种新的产科临床文本数据集,并系统评估了基于Transformer的监督模型与零样本大语言模型在临床文本分段任务中的性能,发现零样本模型在跨域场景下更具鲁棒性(需修正幻觉生成的标题)
Details
Motivation: 现有临床文本分段方法多基于MIMIC-III等通用医疗语料训练,缺乏对产科等细分领域的覆盖;且缺乏对零样本大模型在该任务中表现的系统评估。 Method: 1)构建去标识化的产科笔记分段标注数据集;2)在MIMIC-III子集(领域内)和新产科数据集(领域外)上系统评估监督式Transformer模型;3)首次开展监督模型与零样本大语言模型在临床分段任务上的直接对比。 Result: 监督模型在领域内表现优异,但在领域外性能显著下降;零样本大模型在修正幻觉标题后展现出强跨域适应能力。 Conclusion: 需加强领域特异性临床资源建设;零样本分段是拓展医疗NLP应用范围的可行路径,关键在于有效管控幻觉问题。 Abstract: Clinical free-text notes contain vital patient information. They are structured into labelled sections; recognizing these sections has been shown to support clinical decision-making and downstream NLP tasks. In this paper, we advance clinical section segmentation through three key contributions. First, we curate a new de-identified, section-labeled obstetrics notes dataset, to supplement the medical domains covered in public corpora such as MIMIC-III, on which most existing segmentation approaches are trained. Second, we systematically evaluate transformer-based supervised models for section segmentation on a curated subset of MIMIC-III (in-domain), and on the new obstetrics dataset (out-of-domain). Third, we conduct the first head-to-head comparison of supervised models for medical section segmentation with zero-shot large language models. Our results show that while supervised models perform strongly in-domain, their performance drops substantially out-of-domain. In contrast, zero-shot models demonstrate robust out-of-domain adaptability once hallucinated section headers are corrected. These findings underscore the importance of developing domain-specific clinical resources and highlight zero-shot segmentation as a promising direction for applying healthcare NLP beyond well-studied corpora, as long as hallucinations are appropriately managed.[35] Using LLMs for Knowledge Component-level Correctness Labeling in Open-ended Coding Problems
Zhangqi Duan,Arnav Kankaria,Dhruv Kartik,Andrew Lan
Main category: cs.CL
TL;DR: 本文提出一种利用大语言模型(LLM)自动标注编程作业中知识组件(KC)级正确性的新框架,结合时序感知的Code-KC映射机制,显著提升学习曲线拟合度与预测性能,并获专家验证。
Details
Motivation: 真实编程数据中缺乏细粒度的知识组件(KC)级正确性标签,简单地将题目级正确性传播至所有KC会掩盖部分掌握状态,导致学习曲线拟合不佳。 Method: 提出基于大语言模型的自动化KC级正确性标注框架,包含KC应用正确性判断和时序上下文感知的Code-KC映射机制。 Result: 在学习曲线拟合(幂律实践、加法因子模型)和预测性能上优于基线;人工评估显示LLM标注与专家标注具有高度一致性。 Conclusion: 该框架能有效生成高质量KC级标签,提升学生建模的理论合理性与实用性,为开放性编程任务的学习分析提供可行路径。 Abstract: Fine-grained skill representations, commonly referred to as knowledge components (KCs), are fundamental to many approaches in student modeling and learning analytics. However, KC-level correctness labels are rarely available in real-world datasets, especially for open-ended programming tasks where solutions typically involve multiple KCs simultaneously. Simply propagating problem-level correctness to all associated KCs obscures partial mastery and often leads to poorly fitted learning curves. To address this challenge, we propose an automated framework that leverages large language models (LLMs) to label KC-level correctness directly from student-written code. Our method assesses whether each KC is correctly applied and further introduces a temporal context-aware Code-KC mapping mechanism to better align KCs with individual student code. We evaluate the resulting KC-level correctness labels in terms of learning curve fit and predictive performance using the power law of practice and the Additive Factors Model. Experimental results show that our framework leads to learning curves that are more consistent with cognitive theory and improves predictive performance, compared to baselines. Human evaluation further demonstrates substantial agreement between LLM and expert annotations.[36] Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning
Jyotin Goel,Souvik Maji,Pratik Mazumder
Main category: cs.CL
TL;DR: 本文提出了一种自适应正则化训练框架,通过在微调过程中动态调整对安全风险的约束强度,使指令遵循语言模型在保持实用性的同时持续对齐安全目标。该框架基于两种风险估计方法:基于裁判模型的安全批评家(Safety Critic)和基于模型中间激活的风险预测器,实验证明其能显著降低攻击成功率且不损害下游性能。
Details
Motivation: 现有防御方法往往保护有限或在安全性与实用性之间权衡;而指令模型的安全性在良性或对抗性微调中易退化,亟需一种无需推理开销、不牺牲效用的安全维持机制。 Method: 提出自适应正则化训练框架:利用Safety Critic(高阶危害评分)或轻量级激活风险预测器(基于中间层激活分类)生成批次级安全风险信号,并据此动态约束参数更新——高风险更新靠近安全参考策略,低风险更新按标准方式进行。 Result: 实验证明:1)有害意图可从生成前的中间激活中预测;2)裁判评分具备高召回安全指导能力;3)在多种模型与攻击场景下,该方法显著降低攻击成功率,保持下游任务性能,且无推理时开销。 Conclusion: 自适应正则化是一种原理清晰、实用性强的安全维持机制,能在微调全过程保障模型对齐,兼顾安全性与实用性,无需运行时干预。 Abstract: Instruction-following language models are trained to be helpful and safe, yet their safety behavior can deteriorate under benign fine-tuning and worsen under adversarial updates. Existing defenses often offer limited protection or force a trade-off between safety and utility. We introduce a training framework that adapts regularization in response to safety risk, enabling models to remain aligned throughout fine-tuning. To estimate safety risk at training time, we explore two distinct approaches: a judge-based Safety Critic that assigns high-level harm scores to training batches, and an activation-based risk predictor built with a lightweight classifier trained on intermediate model activations to estimate harmful intent. Each approach provides a risk signal that is used to constrain updates deemed higher risk to remain close to a safe reference policy, while lower-risk updates proceed with standard training. We empirically verify that harmful intent signals are predictable from pre-generation activations and that judge scores provide effective high-recall safety guidance. Across multiple model families and attack scenarios, adaptive regularization with either risk estimation approach consistently lowers attack success rate compared to standard fine-tuning, preserves downstream performance, and adds no inference-time cost. This work demonstrates a principled mechanism for maintaining safety without sacrificing utility.[37] Modeling Distinct Human Interaction in Web Agents
Faria Huq,Zora Zhiruo Wang,Zhanqiu Guo,Venu Arvind Arangarajan,Tianyue Ou,Frank Xu,Shuyan Zhou,Graham Neubig,Jeffrey P. Bigham
Main category: cs.CL
TL;DR: 本文提出建模人类干预行为以支持人机协作式网页任务执行,构建了包含400条真实用户轨迹的CowCorpus数据集,识别出四类人机交互模式,并训练语言模型预测干预时机,显著提升干预预测准确率(+61.4–63.4%)与用户评价的代理有用性(+26.5%)。
Details
Motivation: 当前自主网页智能体缺乏对人类何时及为何干预的系统性理解,常错过关键决策点或过度请求确认,导致协作效率低下。 Method: 构建CowCorpus数据集(400条真实用户网页导航轨迹,含4200+人机交错动作),归纳四类人机交互模式(放手监督、动手监督、协同解题、完全接管),并基于此训练语言模型预测干预时机。 Result: 干预预测准确率较基线语言模型提升61.4–63.4%;在真实用户研究中,部署干预感知模型的网页导航代理被用户评为有用性提升26.5%。 Conclusion: 对人类干预进行结构化建模可显著提升网页智能体的适应性与协作能力,为人机协同提供新范式。 Abstract: Despite rapid progress in autonomous web agents, human involvement remains essential for shaping preferences and correcting agent behavior as tasks unfold. However, current agentic systems lack a principled understanding of when and why humans intervene, often proceeding autonomously past critical decision points or requesting unnecessary confirmation. In this work, we introduce the task of modeling human intervention to support collaborative web task execution. We collect CowCorpus, a dataset of 400 real-user web navigation trajectories containing over 4,200 interleaved human and agent actions. We identify four distinct patterns of user interaction with agents -- hands-off supervision, hands-on oversight, collaborative task-solving, and full user takeover. Leveraging these insights, we train language models (LMs) to anticipate when users are likely to intervene based on their interaction styles, yielding a 61.4-63.4% improvement in intervention prediction accuracy over base LMs. Finally, we deploy these intervention-aware models in live web navigation agents and evaluate them in a user study, finding a 26.5% increase in user-rated agent usefulness. Together, our results show structured modeling of human intervention leads to more adaptive, collaborative agents.[38] The Cascade Equivalence Hypothesis: When Do Speech LLMs Behave Like ASR$\rightarrow$LLM Pipelines?
Jayadev Billa
Main category: cs.CL
TL;DR: 本文发现当前语音大语言模型(Speech LLMs)大多本质上是隐式ASR系统,行为和机制上等价于Whisper→LLM级联架构;通过控制LLM主干的匹配测试,验证了Ultravox等模型与对应级联几乎无差异,而Qwen2-Audio则表现出真正差异;在噪声下,语音LLMs性能甚至不如级联,揭示其实际部署价值有限。
Details
Motivation: 探究当前语音大语言模型是否真正具备端到端语音理解能力,还是仅隐式执行ASR并依赖文本表征,从而厘清其本质机制与实用价值。 Method: 采用匹配主干(matched-backbone)实验设计,在四个语音LLM和六个任务上对比其与对应Whisper→LLM级联的表现;使用logit lens分析隐藏层文本表征涌现,LEACE概念擦除检验文本表征的因果必要性,并评估不同信噪比下的鲁棒性。 Result: Ultravox与对应级联高度一致(κ=0.93),logit lens显示字面文本在隐状态中显现,LEACE擦除后准确率近零;Qwen2-Audio则显著偏离级联行为;在0 dB噪声下,语音LLMs相较级联的干净条件优势逆转达7.6%。 Conclusion: 当前主流语音LLMs大多并非真正的端到端模型,而是昂贵且噪声鲁棒性更差的隐式ASR级联;级联等价性取决于架构,并非普遍成立;应重新评估其设计目标与部署合理性。 Abstract: Current speech LLMs largely perform implicit ASR: on tasks solvable from a transcript, they are behaviorally and mechanistically equivalent to simple Whisper$\to$LLM cascades. We show this through matched-backbone testing across four speech LLMs and six tasks, controlling for the LLM backbone for the first time. Ultravox is statistically indistinguishable from its matched cascade ($κ{=}0.93$); logit lens reveals literal text emerging in hidden states; LEACE concept erasure confirms text representations are causally necessary in both architectures tested, collapsing accuracy to near-zero. Qwen2-Audio genuinely diverges, revealing cascade equivalence is architecture-dependent, not universal. For most deployed use cases, current speech LLMs are expensive cascades, and under noise, they are worse ones, with clean-condition advantages reversing by up to 7.6% at 0 dB.[39] Unmasking the Factual-Conceptual Gap in Persian Language Models
Alireza Sakhaeirad,Ali Ma'manpoosh,Arshia Hemmat
Main category: cs.CL
TL;DR: 本文提出DivanBench——一个聚焦波斯语文化迷信与习俗的诊断性基准,揭示当前波斯语大模型在文化推理中存在顺从偏差、预训练加剧偏差、事实检索与情境应用能力差距大三大问题,表明单纯扩大单语数据无法实现真正的文化能力。
Details
Motivation: 现有波斯语NLP基准虽拓展至语用与礼貌领域,但未能区分对固化文化事实的记忆与对隐含社会规范的推理能力。 Method: 构建包含315道题的DivanBench基准,涵盖事实检索、成对场景验证和情境推理三类任务,评估7个波斯语大语言模型。 Result: 发现三大关键失败:严重顺从偏差(能识别恰当行为却无法拒绝明显违规)、持续波斯语预训练反而加剧该偏差并损害矛盾识别能力、所有模型在事实检索与情境应用间存在21%性能差距。 Conclusion: 文化能力不能仅靠扩大单语数据规模获得;当前模型仅模仿文化模式,未内化其底层认知图式。 Abstract: While emerging Persian NLP benchmarks have expanded into pragmatics and politeness, they rarely distinguish between memorized cultural facts and the ability to reason about implicit social norms. We introduce DivanBench, a diagnostic benchmark focused on superstitions and customs, arbitrary, context-dependent rules that resist simple logical deduction. Through 315 questions across three task types (factual retrieval, paired scenario verification, and situational reasoning), we evaluate seven Persian LLMs and reveal three critical failures: most models exhibit severe acquiescence bias, correctly identifying appropriate behaviors but failing to reject clear violations; continuous Persian pretraining amplifies this bias rather than improving reasoning, often degrading the model's ability to discern contradictions; and all models show a 21\% performance gap between retrieving factual knowledge and applying it in scenarios. These findings demonstrate that cultural competence requires more than scaling monolingual data, as current models learn to mimic cultural patterns without internalizing the underlying schemas.[40] Differences in Typological Alignment in Language Models' Treatment of Differential Argument Marking
Iskar Deng,Nathalia Xu,Shane Steinert-Threlkeld
Main category: cs.CL
TL;DR: 本文研究了语言模型在合成语料上训练后对差异性论元标记(DAM)的类型学偏好,发现模型能复现人类语言中关于标记方向的自然倾向,但未能再现人类语言中对宾语的强烈偏好。
Details
Motivation: 探索语言模型是否能在合成语料上习得类似人类语言的跨语言规律,特别是语义驱动的差异性论元标记(DAM)系统。 Method: 使用受控的合成学习方法,在18种实现不同DAM系统的语料上训练GPT-2模型,并通过最小对立对评估其泛化能力。 Result: 模型可靠地表现出人类语言中‘标记倾向于语义非典型论元’的自然标记方向偏好,但未再现人类语言中‘更常标记宾语而非主语’的强宾语偏好。 Conclusion: 不同类型学倾向可能源于不同的内在机制,DAM的标记方向偏好与论元角色偏好具有不同认知或建模基础。 Abstract: Recent work has shown that language models (LMs) trained on synthetic corpora can exhibit typological preferences that resemble cross-linguistic regularities in human languages, particularly for syntactic phenomena such as word order. In this paper, we extend this paradigm to differential argument marking (DAM), a semantic licensing system in which morphological marking depends on semantic prominence. Using a controlled synthetic learning method, we train GPT-2 models on 18 corpora implementing distinct DAM systems and evaluate their generalization using minimal pairs. Our results reveal a dissociation between two typological dimensions of DAM. Models reliably exhibit human-like preferences for natural markedness direction, favoring systems in which overt marking targets semantically atypical arguments. In contrast, models do not reproduce the strong object preference in human languages, in which overt marking in DAM more often targets objects rather than subjects. These findings suggest that different typological tendencies may arise from distinct underlying sources.[41] What Language is This? Ask Your Tokenizer
Clara Meister,Ahmetcan Yavuz,Pietro Lesci,Tiago Pimentel
Main category: cs.CL
TL;DR: 本文提出UniLID,一种基于UnigramLM分词算法的语言识别方法,在低资源和相近语言场景下显著提升性能,具备高效、可扩展和易集成特性。
Details
Motivation: 现有语言识别系统在低资源和密切相关语言场景中表现脆弱,需更鲁棒、高效且可扩展的方法。 Method: 基于UnigramLM算法,学习语言条件下的共享词表unigram分布,将分词视为语言特异性现象,支持增量添加新语言而无需重训练。 Result: 在标准基准上媲美fastText、GlotLID和CLD3;低资源下仅需每语言5个标注样本即可超70%准确率;显著提升细粒度方言识别性能。 Conclusion: UniLID是一种简单、高效、可扩展的语言识别新方法,特别适用于低资源与细粒度语言区分任务,并能无缝融入现有大模型分词流程。 Abstract: Language Identification (LID) is an important component of many multilingual natural language processing pipelines, where it facilitates corpus curation, training data analysis, and cross-lingual evaluation of large language models. Despite near-perfect performance on high-resource languages, existing systems remain brittle in low-resource and closely related language settings. We introduce UniLID, a simple and efficient LID method based on the UnigramLM tokenization algorithm, leveraging its probabilistic framing, parameter estimation technique and inference strategy. In short, we learn language-conditional unigram distributions over a shared tokenizer vocabulary but treat segmentation as a language-specific phenomenon. Our formulation is data- and compute-efficient, supports incremental addition of new languages without retraining existing models, and can naturally be integrated into existing language model tokenization pipelines. Empirical evaluations against widely used baselines, including fastText, GlotLID, and CLD3, show that UniLID achieves competitive performance on standard benchmarks, substantially improves sample efficiency in low-resource settings - surpassing 70% accuracy with as few as five labeled samples per language - and delivers large gains on fine-grained dialect identification.[42] Sink-Aware Pruning for Diffusion Language Models
Aidar Myrzakhan,Tianyi Li,Bowei Guo,Shengkun Tang,Zhiqiang Shen
Main category: cs.CL
TL;DR: 本文提出了一种针对扩散语言模型(DLMs)的新型剪枝方法Sink-Aware Pruning,指出DLM中注意力sink具有高时序不稳定性,因此不应像自回归模型那样保留sink;该方法无需重训练,即可在相同计算开销下实现更优的质量-效率权衡。
Details
Motivation: 扩散语言模型(DLMs)因迭代去噪导致推理成本高,需高效剪枝;但现有剪枝策略多直接沿用自回归(AR)大模型中保留注意力sink的启发式方法,而该假设在DLM中并不成立。 Method: 通过分析DLM生成轨迹中注意力sink位置的时序变化方差,发现其sink具有高度瞬态性;据此提出Sink-Aware Pruning方法,自动识别并剪除不稳定的sink token,而非沿用AR模型中固定保留sink的做法。 Result: 在无需重训练的前提下,该方法在匹配计算量条件下,质量-效率权衡优于多个强基线剪枝方法。 Conclusion: DLM中的注意力sink不具备AR模型中的结构性稳定性,应被动态识别与剪除;Sink-Aware Pruning为DLM高效推理提供了新范式。 Abstract: Diffusion Language Models (DLMs) incur high inference cost due to iterative denoising, motivating efficient pruning. Existing pruning heuristics largely inherited from autoregressive (AR) LLMs, typically preserve attention sink tokens because AR sinks serve as stable global anchors. We show that this assumption does not hold for DLMs: the attention-sink position exhibits substantially higher variance over the full generation trajectory (measured by how the dominant sink locations shift across timesteps), indicating that sinks are often transient and less structurally essential than in AR models. Based on this observation, we propose ${\bf \texttt{Sink-Aware Pruning}}$, which automatically identifies and prunes unstable sinks in DLMs (prior studies usually keep sinks for AR LLMs). Without retraining, our method achieves a better quality-efficiency trade-off and outperforms strong prior pruning baselines under matched compute. Our code is available at https://github.com/VILA-Lab/Sink-Aware-Pruning.cs.CV [Back]
[43] Three-dimensional Damage Visualization of Civil Structures via Gaussian Splatting-enabled Digital Twins
Shuo Wang,Shuo Wang,Xin Nie,Yasutaka Narazaki,Thomas Matiki,Billie F. Spencer
Main category: cs.CV
TL;DR: 本文提出了一种基于高斯泼溅(Gaussian Splatting, GS)的数字孪生方法,用于 civil infrastructure 的三维损伤可视化,相比NeRF更高效,并支持多尺度重建与随时间演化的更新。
Details
Motivation: 传统2D图像损伤识别难以满足现代基础设施巡检对高精度3D损伤可视化的需求;现有NeRF等方法在效率或特征缺失区域表现不足,需更优的3D表征方案。 Method: 采用高斯泼溅(GS)进行3D重建,将2D损伤分割结果映射至3D空间;设计多尺度重建策略以兼顾效率与细节;支持基于新观测数据的数字孪生动态更新。 Result: 在开源地震后合成数据集上验证了该方法的有效性:提升了3D损伤可视化质量,降低了2D分割误差,并实现了高效、可更新的数字孪生构建。 Conclusion: GS比NeRF更适合于基础设施数字孪生中的实时、高保真3D损伤可视化;所提方法为结构健康监测提供了实用、可扩展的新范式。 Abstract: Recent advancements in civil infrastructure inspections underscore the need for precise three-dimensional (3D) damage visualization on digital twins, transcending traditional 2D image-based damage identifications. Compared to conventional photogrammetric 3D reconstruction techniques, modern approaches such as Neural Radiance Field (NeRF) and Gaussian Splatting (GS) excel in scene representation, rendering quality, and handling featureless regions. Among them, GS stands out for its efficiency, leveraging discrete anisotropic 3D Gaussians to represent radiance fields, unlike NeRF's continuous implicit model. This study introduces a GS-enabled digital twin method tailored for effective 3D damage visualization. The method's key contributions include: 1) utilizing GS-based 3D reconstruction to visualize 2D damage segmentation results while reducing segmentation errors; 2) developing a multi-scale reconstruction strategy to balance efficiency and damage detail; 3) enabling digital twin updates as damage evolves over time. Demonstrated on an open-source synthetic dataset for post-earthquake inspections, the proposed approach offers a promising solution for comprehensive 3D damage visualization in civil infrastructure digital twins.[44] Analytic Score Optimization for Multi Dimension Video Quality Assessment
Boda Lin,Yongjie Zhu,Wenyu Qin,Meng Wang,Pengfei Wan
Main category: cs.CV
TL;DR: 本文提出了一种多维视频质量评估(VQA)新范式,构建了大规模多维度数据集UltraVQA,并设计了理论驱动的Analytic Score Optimization(ASO)方法,提升离散质量评分预测精度与人类偏好对齐。
Details
Motivation: 传统VQA局限于单一MOS评分,难以刻画用户生成内容(UGC)的复杂质量特性;需更细粒度、多维度且可解释的质量标注与建模方法。 Method: 构建涵盖5个质量维度、带细粒度子属性和GPT生成理由的大规模UGC多维VQA数据集UltraVQA;提出Analytic Score Optimization(ASO),将质量评估建模为带正则化的决策过程,推导出闭式解以建模人类评分的序数特性。 Result: ASO在多个基准上超越主流闭源API与开源模型,显著降低质量预测的平均绝对误差(MAE),并增强与人类排序偏好的一致性。 Conclusion: 多维、可解释的标注与基于强化思想的对齐优化是推动VQA向更真实、更鲁棒方向发展的关键路径。 Abstract: Video Quality Assessment (VQA) is evolving beyond single-number mean opinion score toward richer, multi-faceted evaluations of video content. In this paper, we present a large-scale multi-dimensional VQA dataset UltraVQA that encompasses diverse User-Generated Content~(UGC) annotated across five key quality dimensions: Motion Quality, Motion Amplitude, Aesthetic Quality, Content Quality, and Clarity Quality. Each video in our dataset is scored by over 3 human raters on these dimensions, with fine-grained sub-attribute labels, and accompanied by an explanatory rationale generated by GPT based on the collective human judgments. To better leverage these rich annotations and improve discrete quality score assessment, we introduce Analytic Score Optimization (ASO), a theoretically grounded post-training objective derived for multi-dimensional VQA. By reframing quality assessment as a regularized decision-making process, we obtain a closed-form solution that naturally captures the ordinal nature of human ratings, ensuring alignment with human ranking preferences. In experiments, our method outperforms most baselines including closed-source APIs and open-source models, while also reducing mean absolute error (MAE) in quality prediction. Our work highlights the importance of multi-dimensional, interpretable annotations and reinforcement-based alignment in advancing video quality assessment.[45] DODO: Discrete OCR Diffusion Models
Sean Man,Roy Ganz,Roi Ronen,Shahar Tsiper,Shai Mazor,Niv Nayman
Main category: cs.CV
TL;DR: 本文提出DODO模型,首次将块离散扩散机制引入视觉语言模型(VLM)用于OCR任务,在保持近SOTA精度的同时实现最高3倍推理加速,克服了传统自回归解码的低效瓶颈。
Details
Motivation: 现有基于自回归解码的VLM在OCR任务中计算开销大、推理慢;而OCR作为高度确定性任务,理论上适合并行解码,但现有掩码扩散模型因结构不稳定性无法满足OCR严格的精确匹配要求。 Method: 提出DODO模型,采用块离散扩散(block discrete diffusion)机制,将文本生成分解为多个块进行并行扩散,以缓解全局扩散带来的同步误差问题。 Result: 在OCR任务上达到近SOTA精度,并实现最高3倍于自回归基线的推理速度提升。 Conclusion: 块离散扩散是提升OCR类确定性视觉语言任务推理效率的有效范式,DODO验证了其可行性与优越性。 Abstract: Optical Character Recognition (OCR) is a fundamental task for digitizing information, serving as a critical bridge between visual data and textual understanding. While modern Vision-Language Models (VLM) have achieved high accuracy in this domain, they predominantly rely on autoregressive decoding, which becomes computationally expensive and slow for long documents as it requires a sequential forward pass for every generated token. We identify a key opportunity to overcome this bottleneck: unlike open-ended generation, OCR is a highly deterministic task where the visual input strictly dictates a unique output sequence, theoretically enabling efficient, parallel decoding via diffusion models. However, we show that existing masked diffusion models fail to harness this potential; those introduce structural instabilities that are benign in flexible tasks, like captioning, but catastrophic for the rigid, exact-match requirements of OCR. To bridge this gap, we introduce DODO, the first VLM to utilize block discrete diffusion and unlock its speedup potential for OCR. By decomposing generation into blocks, DODO mitigates the synchronization errors of global diffusion. Empirically, our method achieves near state-of-the-art accuracy while enabling up to 3x faster inference compared to autoregressive baselines.[46] StereoAdapter-2: Globally Structure-Consistent Underwater Stereo Depth Estimation
Zeyu Ren,Xiang Li,Yiran Wang,Zeyu Zhang,Hao Tang
Main category: cs.CV
TL;DR: 本文提出StereoAdapter-2,通过引入基于选择性状态空间模型的ConvSS2D算子替代传统ConvGRU,提升水下立体匹配中长距离视差传播效率,并构建大规模合成数据集UW-StereoDepth-80K,实现零样本水下深度估计SOTA性能。
Details
Motivation: 水下立体深度估计受波长相关光衰减、散射和折射影响,存在严重域偏移;现有基于单目基础模型与GRU迭代优化的方法受限于GRU的序列门控与局部卷积特性,在大视差与无纹理区域性能不足。 Method: 提出新型ConvSS2D更新算子,采用四向扫描策略适配极线几何并保持垂直结构一致性,实现单步高效长程空间传播;构建UW-StereoDepth-80K合成数据集,结合语义感知风格迁移与几何一致新视角合成;集成动态LoRA自适应机制。 Result: 在TartanAir-UW和SQUID水下基准上零样本性能分别提升17%和7.2%,并在BlueROV2平台实现实时鲁棒验证。 Conclusion: StereoAdapter-2通过状态空间建模与高质量合成数据协同,显著提升了水下立体匹配的泛化性与效率,为水下机器人感知提供了更可靠的深度估计方案。 Abstract: Stereo depth estimation is fundamental to underwater robotic perception, yet suffers from severe domain shifts caused by wavelength-dependent light attenuation, scattering, and refraction. Recent approaches leverage monocular foundation models with GRU-based iterative refinement for underwater adaptation; however, the sequential gating and local convolutional kernels in GRUs necessitate multiple iterations for long-range disparity propagation, limiting performance in large-disparity and textureless underwater regions. In this paper, we propose StereoAdapter-2, which replaces the conventional ConvGRU updater with a novel ConvSS2D operator based on selective state space models. The proposed operator employs a four-directional scanning strategy that naturally aligns with epipolar geometry while capturing vertical structural consistency, enabling efficient long-range spatial propagation within a single update step at linear computational complexity. Furthermore, we construct UW-StereoDepth-80K, a large-scale synthetic underwater stereo dataset featuring diverse baselines, attenuation coefficients, and scattering parameters through a two-stage generative pipeline combining semantic-aware style transfer and geometry-consistent novel view synthesis. Combined with dynamic LoRA adaptation inherited from StereoAdapter, our framework achieves state-of-the-art zero-shot performance on underwater benchmarks with 17% improvement on TartanAir-UW and 7.2% improvment on SQUID, with real-world validation on the BlueROV2 platform demonstrates the robustness of our approach. Code: https://github.com/AIGeeksGroup/StereoAdapter-2. Website: https://aigeeksgroup.github.io/StereoAdapter-2.[47] SemCovNet: Towards Fair and Semantic Coverage-Aware Learning for Underrepresented Visual Concepts
Sakib Ahammed,Xia Cui,Xinqi Fan,Wenqi Lu,Moi Hoon Yap
Main category: cs.CV
TL;DR: 本文提出Semantic Coverage-Aware Network (SemCovNet)以解决视觉模型中语义覆盖不平衡(SCI)问题,通过语义描述符映射、描述符注意力调制和描述符-视觉对齐损失提升语义公平性与模型可靠性。
Details
Motivation: 现有视觉数据集存在语义覆盖不平衡(SCI)这一被忽视的偏见,源于语义表示的长尾分布,影响模型对稀有但有意义语义的学习与推理。 Method: 提出SemCovNet模型,包含语义描述符映射(SDM)、描述符注意力调制(DAM)模块和描述符-视觉对齐(DVA)损失,并引入覆盖差异指数(CDI)量化语义公平性。 Result: 在多个数据集上的实验表明,SemCovNet显著降低CDI,提升模型可靠性与语义公平性。 Conclusion: SCI是一种可测量、可纠正的偏见,本工作为推进语义公平与可解释视觉学习奠定了基础。 Abstract: Modern vision models increasingly rely on rich semantic representations that extend beyond class labels to include descriptive concepts and contextual attributes. However, existing datasets exhibit Semantic Coverage Imbalance (SCI), a previously overlooked bias arising from the long-tailed semantic representations. Unlike class imbalance, SCI occurs at the semantic level, affecting how models learn and reason about rare yet meaningful semantics. To mitigate SCI, we propose Semantic Coverage-Aware Network (SemCovNet), a novel model that explicitly learns to correct semantic coverage disparities. SemCovNet integrates a Semantic Descriptor Map (SDM) for learning semantic representations, a Descriptor Attention Modulation (DAM) module that dynamically weights visual and concept features, and a Descriptor-Visual Alignment (DVA) loss that aligns visual features with descriptor semantics. We quantify semantic fairness using a Coverage Disparity Index (CDI), which measures the alignment between coverage and error. Extensive experiments across multiple datasets demonstrate that SemCovNet enhances model reliability and substantially reduces CDI, achieving fairer and more equitable performance. This work establishes SCI as a measurable and correctable bias, providing a foundation for advancing semantic fairness and interpretable vision learning.[48] Xray-Visual Models: Scaling Vision models on Industry Scale Data
Shlok Mishra,Tsung-Yu Lin,Linda Wang,Hongli Xu,Yimin Liu,Michael Hsu,Chaitanya Ahuja,Hao Yuan,Jianpeng Cheng,Hong-You Chen,Haoyuan Xu,Chao Li,Abhijeet Awasthi,Jihye Moon,Don Husa,Michael Ge,Sumedha Singla,Arkabandhu Chowdhury,Phong Dingh,Satya Narayan Shukla,Yonghuan Yang,David Jacobs,Qi Guo,Jun Xiao,Xiangjun Fan,Aashu Singh
Main category: cs.CV
TL;DR: Xray-Visual 是一个基于大规模社交媒体数据训练的统一视觉模型,融合图像与视频理解能力,采用三阶段训练策略和高效ViT架构(EViT),在多项基准上达到SOTA,并通过LLM2CLIP提升跨模态检索性能。
Details
Motivation: 解决现有视觉模型在大规模、多模态(图像+视频)、噪声数据环境下难以兼顾语义多样性、标签质量、计算效率与泛化能力的问题。 Method: 构建基于Vision Transformer并集成EViT模块的统一架构;设计三阶段训练流程:MAE自监督预训练、半监督hashtag分类、CLIP式对比学习;利用150亿图像-文本对和100亿视频-hashtag对,结合数据平衡与噪声抑制策略进行训练;引入LLM2CLIP,用大语言模型替代传统文本编码器。 Result: 在ImageNet、Kinetics、HMDB51、MSCOCO等基准上达到SOTA;具备强域偏移鲁棒性与抗对抗扰动能力;LLM2CLIP显著提升跨模态检索效果与真实场景泛化性。 Conclusion: Xray-Visual为可扩展、高效、鲁棒的多模态视觉建模提供了新范式,验证了工业级噪声数据经合理治理后可有效支撑高性能模型训练,并凸显大语言模型作为文本编码器的潜力。 Abstract: We present Xray-Visual, a unified vision model architecture for large-scale image and video understanding trained on industry-scale social media data. Our model leverages over 15 billion curated image-text pairs and 10 billion video-hashtag pairs from Facebook and Instagram, employing robust data curation pipelines that incorporate balancing and noise suppression strategies to maximize semantic diversity while minimizing label noise. We introduce a three-stage training pipeline that combines self-supervised MAE, semi-supervised hashtag classification, and CLIP-style contrastive learning to jointly optimize image and video modalities. Our architecture builds on a Vision Transformer backbone enhanced with efficient token reorganization (EViT) for improved computational efficiency. Extensive experiments demonstrate that Xray-Visual achieves state-of-the-art performance across diverse benchmarks, including ImageNet for image classification, Kinetics and HMDB51 for video understanding, and MSCOCO for cross-modal retrieval. The model exhibits strong robustness to domain shift and adversarial perturbations. We further demonstrate that integrating large language models as text encoders (LLM2CLIP) significantly enhances retrieval performance and generalization capabilities, particularly in real-world environments. Xray-Visual establishes new benchmarks for scalable, multimodal vision models, while maintaining superior accuracy and computational efficiency.[49] HS-3D-NeRF: 3D Surface and Hyperspectral Reconstruction From Stationary Hyperspectral Images Using Multi-Channel NeRFs
Kibon Ku,Talukder Z. Jubery,Adarsh Krishnamurthy,Baskar Ganapathysubramanian
Main category: cs.CV
TL;DR: 本文提出HSI-SC-NeRF,一种基于固定相机的多通道神经辐射场框架,用于高通量、高保真度的农产品质后检测超光谱三维重建。
Details
Motivation: 传统超光谱成像与3D重建融合方法硬件复杂、难以适配自动化表型平台;现有NeRF方法依赖移动相机,限制农业室内场景的通量与可重复性。 Method: 采用固定相机+旋转样本方案,在特氟龙漫射照明腔中采集多视角超光谱数据;利用ArUco标记估计姿态并经模拟变换统一至相机坐标系;设计多通道NeRF联合优化所有光谱带,引入复合光谱损失与两阶段训练(几何初始化→辐射精调)。 Result: 在三种农产品样本上验证了高空间重建精度与可见光–近红外波段强光谱保真度。 Conclusion: HSI-SC-NeRF有效解决了固定相机下超光谱三维重建难题,具备向自动化农业工作流集成的实用潜力。 Abstract: Advances in hyperspectral imaging (HSI) and 3D reconstruction have enabled accurate, high-throughput characterization of agricultural produce quality and plant phenotypes, both essential for advancing agricultural sustainability and breeding programs. HSI captures detailed biochemical features of produce, while 3D geometric data substantially improves morphological analysis. However, integrating these two modalities at scale remains challenging, as conventional approaches involve complex hardware setups incompatible with automated phenotyping systems. Recent advances in neural radiance fields (NeRF) offer computationally efficient 3D reconstruction but typically require moving-camera setups, limiting throughput and reproducibility in standard indoor agricultural environments. To address these challenges, we introduce HSI-SC-NeRF, a stationary-camera multi-channel NeRF framework for high-throughput hyperspectral 3D reconstruction targeting postharvest inspection of agricultural produce. Multi-view hyperspectral data is captured using a stationary camera while the object rotates within a custom-built Teflon imaging chamber providing diffuse, uniform illumination. Object poses are estimated via ArUco calibration markers and transformed to the camera frame of reference through simulated pose transformations, enabling standard NeRF training on stationary-camera data. A multi-channel NeRF formulation optimizes reconstruction across all hyperspectral bands jointly using a composite spectral loss, supported by a two-stage training protocol that decouples geometric initialization from radiometric refinement. Experiments on three agricultural produce samples demonstrate high spatial reconstruction accuracy and strong spectral fidelity across the visible and near-infrared spectrum, confirming the suitability of HSI-SC-NeRF for integration into automated agricultural workflows.[50] DDiT: Dynamic Patch Scheduling for Efficient Diffusion Transformers
Dahye Kim,Deepti Ghadiyaram,Raghudeep Gadde
Main category: cs.CV
TL;DR: 本文提出了一种动态分块(dynamic tokenization)策略,根据生成过程中的时间步和内容复杂度自适应调整patch大小,从而在保持生成质量的同时显著提升Diffusion Transformers(DiTs)的推理效率。
Details
Motivation: 现有DiTs模型采用固定大小的patch进行tokenization,导致计算冗余,尤其在不同denoising timestep对细节需求不同的情况下缺乏灵活性。 Method: 提出动态tokenization方法,在推理时依据denoising timestep和内容复杂度动态调整patch尺寸:早期使用较大patch建模全局结构,后期使用较小patch细化局部细节。 Result: 在FLUX-1.Dev和Wan 2.1模型上分别实现3.52×和3.2×推理加速,且不损失生成质量与提示词遵循能力。 Conclusion: 动态tokenization是一种高效、即插即用的推理优化策略,为DiTs类模型的实用化提供了新思路。 Abstract: Diffusion Transformers (DiTs) have achieved state-of-the-art performance in image and video generation, but their success comes at the cost of heavy computation. This inefficiency is largely due to the fixed tokenization process, which uses constant-sized patches throughout the entire denoising phase, regardless of the content's complexity. We propose dynamic tokenization, an efficient test-time strategy that varies patch sizes based on content complexity and the denoising timestep. Our key insight is that early timesteps only require coarser patches to model global structure, while later iterations demand finer (smaller-sized) patches to refine local details. During inference, our method dynamically reallocates patch sizes across denoising steps for image and video generation and substantially reduces cost while preserving perceptual generation quality. Extensive experiments demonstrate the effectiveness of our approach: it achieves up to $3.52\times$ and $3.2\times$ speedup on FLUX-1.Dev and Wan $2.1$, respectively, without compromising the generation quality and prompt adherence.[51] Characterizing the Predictive Impact of Modalities with Supervised Latent-Variable Modeling
Divyam Madaan,Sumit Chopra,Kyunghyun Cho
Main category: cs.CV
TL;DR: PRIMO是一种监督式潜在变量插补模型,用于量化多模态学习中任意缺失模态的预测影响,支持不完整多模态数据下的训练与推理,并在多个数据集上实现与单模态/多模态基线相当的性能。
Details
Motivation: 现有多模态大语言模型(MLLMs)大多假设训练和推理时所有模态均完整可用,但现实中多模态数据常存在缺失、异步采集或仅部分样本具备全模态等问题,亟需能有效利用不完整数据的方法。 Method: PRIMO引入一个监督式潜在变量建模缺失模态,该变量刻画其与可观测模态在预测任务下的关系;训练时利用全部样本(含部分模态数据),推理时从学习到的缺失模态分布中多次采样,以获得边际预测分布并计算模态影响。 Result: 在合成XOR、Audio-Vision MNIST和MIMIC-III(死亡率与ICD-9预测)任务上,PRIMO在单模态缺失时性能媲美单模态基线,在全模态可用时媲美多模态基线;并提出基于预测方差的实例级模态影响度量,可视化展示不同模态补全导致的合理标签集合。 Conclusion: PRIMO为不完整多模态学习提供了统一且可解释的框架,既能保持预测性能,又能量化各模态对个体预测的贡献,提升了多模态模型在现实稀疏数据场景下的实用性与可解释性。 Abstract: Despite the recent success of Multimodal Large Language Models (MLLMs), existing approaches predominantly assume the availability of multiple modalities during training and inference. In practice, multimodal data is often incomplete because modalities may be missing, collected asynchronously, or available only for a subset of examples. In this work, we propose PRIMO, a supervised latent-variable imputation model that quantifies the predictive impact of any missing modality within the multimodal learning setting. PRIMO enables the use of all available training examples, whether modalities are complete or partial. Specifically, it models the missing modality through a latent variable that captures its relationship with the observed modality in the context of prediction. During inference, we draw many samples from the learned distribution over the missing modality to both obtain the marginal predictive distribution (for the purpose of prediction) and analyze the impact of the missing modalities on the prediction for each instance. We evaluate PRIMO on a synthetic XOR dataset, Audio-Vision MNIST, and MIMIC-III for mortality and ICD-9 prediction. Across all datasets, PRIMO obtains performance comparable to unimodal baselines when a modality is fully missing and to multimodal baselines when all modalities are available. PRIMO quantifies the predictive impact of a modality at the instance level using a variance-based metric computed from predictions across latent completions. We visually demonstrate how varying completions of the missing modality result in a set of plausible labels.[52] Patch-Based Spatial Authorship Attribution in Human-Robot Collaborative Paintings
Eric Chen,Patricia Alves-Oliveira
Main category: cs.CV
TL;DR: 本文提出了一种基于图像块(patch)的空间作者归属框架,用于人类与机器人协作绘画中的作者身份识别,在15幅抽象画上实现了88.8%的块级准确率,并通过条件香农熵验证了模型能有效识别混合创作区域而非误分类。
Details
Motivation: 随着具身AI越来越多地参与创意生产,明确人类与AI在协作艺术作品中的作者身份,对艺术家、收藏家及法律认定至关重要;而现有方法难以应对协作中风格模糊、标注困难、数据稀缺等挑战。 Method: 提出一种基于图像块的作者归属框架,使用普通平板扫描仪采集画作图像,采用留一法交叉验证;引入条件香农熵量化人类与机器人风格重叠程度,并结合人工标注的混合区域进行不确定性分析。 Result: 在15幅人机合作抽象画上达到88.8%的块级准确率(绘画级86.7%),显著优于纹理特征和预训练特征基线(68.0%-84.7%);混合区域的条件熵比纯风格画作高64%(p=0.003),证实模型确能识别混合作者性。 Conclusion: 该方法虽目前仅适用于特定人机组合,但为数据稀缺的人-AI创意工作流提供了可扩展、样本高效的空间作者归属范式,未来有望推广至任意人机协作绘画场景。 Abstract: As agentic AI becomes increasingly involved in creative production, documenting authorship has become critical for artists, collectors, and legal contexts. We present a patch-based framework for spatial authorship attribution within human-robot collaborative painting practice, demonstrated through a forensic case study of one human artist and one robotic system across 15 abstract paintings. Using commodity flatbed scanners and leave-one-painting-out cross-validation, the approach achieves 88.8% patch-level accuracy (86.7% painting-level via majority vote), outperforming texture-based and pretrained-feature baselines (68.0%-84.7%). For collaborative artworks, where ground truth is inherently ambiguous, we use conditional Shannon entropy to quantify stylistic overlap; manually annotated hybrid regions exhibit 64% higher uncertainty than pure paintings (p=0.003), suggesting the model detects mixed authorship rather than classification failure. The trained model is specific to this human-robot pair but provides a methodological grounding for sample-efficient attribution in data-scarce human-AI creative workflows that, in the future, has the potential to extend authorship attribution to any human-robot collaborative painting.[53] PartRAG: Retrieval-Augmented Part-Level 3D Generation and Editing
Peize Li,Zeyu Zhang,Hao Tang
Main category: cs.CV
TL;DR: 本文提出PartRAG,一种结合检索增强与扩散Transformer的单图像3D生成框架,支持部件级结构建模与局部可编辑性。通过分层对比检索从外部部件库中引入多样化、物理合理的部件先验,并设计共享规范空间中的掩码部件编辑器,实现高效、一致的部件替换与属性调整。在多个基准上显著提升几何精度与多视角一致性。
Details
Motivation: 单图像3D生成在部件级结构建模方面面临两大挑战:一是学习到的先验难以覆盖部件几何的长尾分布并保证多视角一致性;二是现有系统缺乏对精确、局部编辑的支持。 Method: 提出PartRAG框架,包含两个核心模块:(1)分层对比检索模块,将图像密集块与3D部件潜在表示在部件和物体两个粒度上对齐,从1236个带部件标注的资产库中检索;(2)在共享规范空间中运行的掩码部件级编辑器,支持部件替换、属性微调和组合式更新,无需重生成整体。 Result: 在Objaverse、ShapeNet和ABO数据集上取得领先性能:Objaverse上Chamfer距离从0.1726降至0.1528,F-Score从0.7472提升至0.844;推理耗时38秒,交互式编辑仅需5–8秒;定性结果显示更清晰的部件边界、更好的细长结构保真度及对铰接物体的鲁棒性。 Conclusion: PartRAG有效缓解了单图像3D生成中部件多样性与编辑可控性的瓶颈,验证了检索增强与规范空间编辑联合建模的有效性,为可控3D内容创作提供了新范式。 Abstract: Single-image 3D generation with part-level structure remains challenging: learned priors struggle to cover the long tail of part geometries and maintain multi-view consistency, and existing systems provide limited support for precise, localized edits. We present PartRAG, a retrieval-augmented framework that integrates an external part database with a diffusion transformer to couple generation with an editable representation. To overcome the first challenge, we introduce a Hierarchical Contrastive Retrieval module that aligns dense image patches with 3D part latents at both part and object granularity, retrieving from a curated bank of 1,236 part-annotated assets to inject diverse, physically plausible exemplars into denoising. To overcome the second challenge, we add a masked, part-level editor that operates in a shared canonical space, enabling swaps, attribute refinements, and compositional updates without regenerating the whole object while preserving non-target parts and multi-view consistency. PartRAG achieves competitive results on Objaverse, ShapeNet, and ABO-reducing Chamfer Distance from 0.1726 to 0.1528 and raising F-Score from 0.7472 to 0.844 on Objaverse-with inference of 38s and interactive edits in 5-8s. Qualitatively, PartRAG produces sharper part boundaries, better thin-structure fidelity, and robust behavior on articulated objects. Code: https://github.com/AIGeeksGroup/PartRAG. Website: https://aigeeksgroup.github.io/PartRAG.[54] Amber-Image: Efficient Compression of Large-Scale Diffusion Transformers
Chaojie Yang,Tian Li,Yue Zhang,Jun Gao
Main category: cs.CV
TL;DR: 本文提出了一种高效的压缩框架,将60层双流MMDiT架构的Qwen-Image模型压缩为轻量级T2I模型Amber-Image(10B和6B),通过时序敏感剪枝、局部权重平均、分层蒸馏与渐进式蒸馏等技术,在大幅降低参数量(70%)和GPU训练成本(<2000小时)的同时,保持高保真图像生成与文本渲染能力。
Details
Motivation: DiT架构虽推动了文生图发展,但计算开销大、部署困难,亟需高效压缩方法。 Method: 提出无需从头训练的压缩框架:对Qwen-Image采用时序敏感深度剪枝+局部权重平均初始化+分层蒸馏+全参微调得到Amber-Image-10B;再设计混合流架构(深层双流转单流,源自图像分支)+渐进式蒸馏+轻量微调得到Amber-Image-6B。 Result: 参数减少70%,训练总耗时<2000 GPU小时;在DPG-Bench和LongText-Bench上达到与更大模型相当的高保真合成与文本渲染性能。 Conclusion: 该压缩框架显著提升了DiT类模型的效率与可部署性,在性能与成本间取得优异平衡。 Abstract: Diffusion Transformer (DiT) architectures have significantly advanced Text-to-Image (T2I) generation but suffer from prohibitive computational costs and deployment barriers. To address these challenges, we propose an efficient compression framework that transforms the 60-layer dual-stream MMDiT-based Qwen-Image into lightweight models without training from scratch. Leveraging this framework, we introduce Amber-Image, a series of streamlined T2I models. We first derive Amber-Image-10B using a timestep-sensitive depth pruning strategy, where retained layers are reinitialized via local weight averaging and optimized through layer-wise distillation and full-parameter fine-tuning. Building on this, we develop Amber-Image-6B by introducing a hybrid-stream architecture that converts deep-layer dual streams into a single stream initialized from the image branch, further refined via progressive distillation and lightweight fine-tuning. Our approach reduces parameters by 70% and eliminates the need for large-scale data engineering. Notably, the entire compression and training pipeline-from the 10B to the 6B variant-requires fewer than 2,000 GPU hours, demonstrating exceptional cost-efficiency compared to training from scratch. Extensive evaluations on benchmarks like DPG-Bench and LongText-Bench show that Amber-Image achieves high-fidelity synthesis and superior text rendering, matching much larger models.[55] StructCore: Structure-Aware Image-Level Scoring for Training-Free Unsupervised Anomaly Detection
Joongwon Chae,Lihui Luo,Yang Liu,Runming Wang,Dongmei Yu,Zeming Liang,Xi Yuan,Dayan Zhang,Zhenglin Chen,Peiwu Qin,Ilmoon Chae
Main category: cs.CV
TL;DR: 本文提出StructCore,一种无需训练、结构感知的图像级评分方法,用于基于记忆库的无监督异常检测,通过捕捉异常分数图的分布和空间特征,并利用正常样本估计的对角马氏校准来提升图像级异常检测性能。
Details
Motivation: 最大池化在基于记忆库的无监督异常检测中虽为标准做法,但仅依赖单个极值响应,忽略了异常证据在图像中的分布与结构信息,导致正常与异常分数易重叠。 Method: StructCore对异常分数图计算低维结构描述符phi(S),刻画其分布与空间特性,并利用训练集中的正常样本估计对角马氏校准,实现图像级评分优化,不改变像素级定位。 Result: StructCore在MVTec AD和VisA数据集上分别达到99.6%和98.4%的图像级AUROC,显著优于传统最大池化。 Conclusion: StructCore通过挖掘被最大池化忽略的结构特征,实现了鲁棒且高性能的图像级异常检测,且无需额外训练。 Abstract: Max pooling is the de facto standard for converting anomaly score maps into image-level decisions in memory-bank-based unsupervised anomaly detection (UAD). However, because it relies on a single extreme response, it discards most information about how anomaly evidence is distributed and structured across the image, often causing normal and anomalous scores to overlap. We propose StructCore, a training-free, structure-aware image-level scoring method that goes beyond max pooling. Given an anomaly score map, StructCore computes a low-dimensional structural descriptor phi(S) that captures distributional and spatial characteristics, and refines image-level scoring via a diagonal Mahalanobis calibration estimated from train-good samples, without modifying pixel-level localization. StructCore achieves image-level AUROC scores of 99.6% on MVTec AD and 98.4% on VisA, demonstrating robust image-level anomaly detection by exploiting structural signatures missed by max pooling.[56] Cholec80-port: A Geometrically Consistent Trocar Port Segmentation Dataset for Robust Surgical Scene Understanding
Shunsuke Kikuchi,Atsushi Kouno,Hiroki Matsuzaki
Main category: cs.CV
TL;DR: 本文提出Cholec80-port数据集及统一的端口掩码标注规范(排除中心孔),解决腹腔镜图像中套管通道对几何感知任务(如图像拼接、3D重建)的干扰问题;实验证明几何一致的标注显著提升跨数据集鲁棒性。
Details
Motivation: 套管通道在腹腔镜图像中因高反射、纹理丰富而易被误检为特征点,且现有公开数据集缺乏显式、几何一致的端口标注(常错误遮盖中心孔),严重影响几何感知下游任务。 Method: 构建Cholec80-port高质量套管通道分割数据集,并制定严格的标准操作流程(SOP),定义仅覆盖套管套袖、排除中心孔的掩码;同时依据该SOP清洗和统一多个现有公开数据集。 Result: 实验表明,采用几何一致标注显著提升了模型在跨数据集场景下的鲁棒性,其增益超越单纯扩大数据集规模的效果。 Conclusion: 几何一致的套管通道标注标准对提升腹腔镜视觉几何任务的泛化性和稳定性至关重要,Cholec80-port及配套SOP为该领域提供了可靠基准与实践指南。 Abstract: Trocar ports are camera-fixed, pseudo-static structures that can persistently occlude laparoscopic views and attract disproportionate feature points due to specular, textured surfaces. This makes ports particularly detrimental to geometry-based downstream pipelines such as image stitching, 3D reconstruction, and visual SLAM, where dynamic or non-anatomical outliers degrade alignment and tracking stability. Despite this practical importance, explicit port labels are rare in public surgical datasets, and existing annotations often violate geometric consistency by masking the central lumen (opening), even when anatomical regions are visible through it. We present Cholec80-port, a high-fidelity trocar port segmentation dataset derived from Cholec80, together with a rigorous standard operating procedure (SOP) that defines a port-sleeve mask excluding the central opening. We additionally cleanse and unify existing public datasets under the same SOP. Experiments demonstrate that geometrically consistent annotations substantially improve cross-dataset robustness beyond what dataset size alone provides.[57] Cross Pseudo Labeling For Weakly Supervised Video Anomaly Detection
Lee Dayeon,Kim Dongheyong,Park Chaewon,Woo Sungmin,Lee Sangyoun
Main category: cs.CV
TL;DR: CPL-VAD是一种双分支弱监督视频异常检测框架,通过跨伪标签机制融合时序定位与语义分类能力,在XD-Violence和UCF-Crime数据集上达到SOTA性能。
Details
Motivation: 现有弱监督视频异常检测方法仅使用视频级标签,难以同时实现精准的片段级异常定位与异常类别识别。 Method: 提出CPL-VAD双分支框架:一个分支用于二值化异常检测(片段级定位),另一个分支利用视觉-语言对齐进行异常事件类别分类;两分支通过交换伪标签实现互补协同。 Result: 在XD-Violence和UCF-Crime数据集上,CPL-VAD在异常检测和异常类别分类两个任务上均取得当前最优性能。 Conclusion: 跨伪标签机制能有效融合时序建模与语义理解能力,为弱监督视频异常检测提供了新范式。 Abstract: Weakly supervised video anomaly detection aims to detect anomalies and identify abnormal categories with only video-level labels. We propose CPL-VAD, a dual-branch framework with cross pseudo labeling. The binary anomaly detection branch focuses on snippet-level anomaly localization, while the category classification branch leverages vision-language alignment to recognize abnormal event categories. By exchanging pseudo labels, the two branches transfer complementary strengths, combining temporal precision with semantic discrimination. Experiments on XD-Violence and UCF-Crime demonstrate that CPL-VAD achieves state-of-the-art performance in both anomaly detection and abnormal category classification.[58] ComptonUNet: A Deep Learning Model for GRB Localization with Compton Cameras under Noisy and Low-Statistic Conditions
Shogo Sato,Kazuo Tanaka,Shojun Ogasawara,Kazuki Yamamoto,Kazuhiko Murasaki,Ryuichi Tanida,Jun Kataoka
Main category: cs.CV
TL;DR: 本文提出了一种名为ComptonUNet的混合深度学习框架,用于在低光子统计和强背景噪声条件下稳健地定位微弱伽马射线暴(GRB)源。该模型结合了直接重建的统计效率与图像去噪能力,在仿真测试中显著优于现有方法。
Details
Motivation: faint GRBs originating from a distant universe may provide unique insights into the early stages of star formation, but detecting and localizing such weak sources remains challenging owing to low photon statistics and substantial background noise. Method: ComptonUNet, a hybrid deep learning framework that jointly processes raw data and reconstructs images for robust GRB localization. It combines the statistical efficiency of direct reconstruction models with the denoising capabilities of image-based architectures. Result: ComptonUNet significantly outperforms existing approaches, achieving improved localization accuracy across a wide range of low-statistic and high-background scenarios. Conclusion: ComptonUNet is an effective solution for GRB localization under realistic low-signal, high-noise conditions, enabling better study of early-universe astrophysics. Abstract: Gamma-ray bursts (GRBs) are among the most energetic transient phenomena in the universe and serve as powerful probes for high-energy astrophysical processes. In particular, faint GRBs originating from a distant universe may provide unique insights into the early stages of star formation. However, detecting and localizing such weak sources remains challenging owing to low photon statistics and substantial background noise. Although recent machine learning models address individual aspects of these challenges, they often struggle to balance the trade-off between statistical robustness and noise suppression. Consequently, we propose ComptonUNet, a hybrid deep learning framework that jointly processes raw data and reconstructs images for robust GRB localization. ComptonUNet was designed to operate effectively under conditions of limited photon statistics and strong background contamination by combining the statistical efficiency of direct reconstruction models with the denoising capabilities of image-based architectures. We perform realistic simulations of GRB-like events embedded in background environments representative of low-Earth orbit missions to evaluate the performance of ComptonUNet. Our results demonstrate that ComptonUNet significantly outperforms existing approaches, achieving improved localization accuracy across a wide range of low-statistic and high-background scenarios.[59] 3D Scene Rendering with Multimodal Gaussian Splatting
Chi-Shiang Gau,Konstantinos D. Polyzos,Athanasios Bacharis,Saketh Madhuvarasu,Tara Javidi
Main category: cs.CV
TL;DR: 本文提出了一种融合射频(RF)感知(如车载雷达)与3D高斯泼溅(GS)渲染的多模态框架,以克服纯视觉GS在恶劣天气、低光照或遮挡等场景下初始化困难的问题;利用稀疏RF深度测量高效预测深度并生成高质量点云,用于初始化各类GS模型,显著提升鲁棒性与渲染质量。
Details
Motivation: 传统基于视觉的3D高斯泼溅(GS)依赖大量相机视角进行初始化,在恶劣天气、低照度或部分遮挡等视觉线索不可靠的场景下性能下降且初始化开销大;而射频信号对这些干扰具有天然鲁棒性,因此引入RF传感可提升GS的可靠性与效率。 Method: 提出一种多模态框架,将RF传感(如车载雷达)与GS渲染结合;利用稀疏RF深度测量,通过高效深度预测生成高质量3D点云,用以初始化GS中的高斯原语,适配多种GS架构。 Result: 数值实验表明,该RF增强的GS方法在结构准确性驱动下实现了高保真3D场景渲染,显著优于纯视觉GS方案,尤其在视觉受限场景中表现更稳健高效。 Conclusion: 融合RF感知与GS渲染是一种高效、鲁棒的替代方案,能有效缓解视觉主导GS在复杂环境下的初始化瓶颈,为工业监控、机器人和自动驾驶等应用提供更具适应性的3D重建与渲染能力。 Abstract: 3D scene reconstruction and rendering are core tasks in computer vision, with applications spanning industrial monitoring, robotics, and autonomous driving. Recent advances in 3D Gaussian Splatting (GS) and its variants have achieved impressive rendering fidelity while maintaining high computational and memory efficiency. However, conventional vision-based GS pipelines typically rely on a sufficient number of camera views to initialize the Gaussian primitives and train their parameters, typically incurring additional processing cost during initialization while falling short in conditions where visual cues are unreliable, such as adverse weather, low illumination, or partial occlusions. To cope with these challenges, and motivated by the robustness of radio-frequency (RF) signals to weather, lighting, and occlusions, we introduce a multimodal framework that integrates RF sensing, such as automotive radar, with GS-based rendering as a more efficient and robust alternative to vision-only GS rendering. The proposed approach enables efficient depth prediction from only sparse RF-based depth measurements, yielding a high-quality 3D point cloud for initializing Gaussian functions across diverse GS architectures. Numerical tests demonstrate the merits of judiciously incorporating RF sensing into GS pipelines, achieving high-fidelity 3D scene rendering driven by RF-informed structural accuracy.[60] B$^3$-Seg: Camera-Free, Training-Free 3DGS Segmentation via Analytic EIG and Beta-Bernoulli Bayesian Updates
Hiromichi Kamata,Samuel Arthur Munro,Fuminori Homma
Main category: cs.CV
TL;DR: 本文提出B³-Seg,一种无需相机视角预设、无需标注、无需重新训练的开放词汇3D高斯泼溅(3DGS)交互式分割方法,基于Beta-Bernoulli贝叶斯更新与解析期望信息增益(EIG)主动选视点,兼顾理论保证与实时性。
Details
Motivation: 现有3DGS分割方法依赖预设视角、真实标签或昂贵重训练,难以满足影视与游戏制作中低延迟交互编辑的实际需求。 Method: 将分割建模为序列化的Beta-Bernoulli贝叶斯更新过程,并通过解析形式的期望信息增益(EIG)主动选择最优下一视角;利用EIG的自适应单调性与次模性,实现贪心近似最优采样策略。 Result: 在多个数据集上,B³-Seg以几秒端到端耗时达到与高成本监督方法相当的分割性能,并具备可证明的信息效率。 Conclusion: B³-Seg实现了相机无关、训练无关、开放词汇的高效交互式3DGS分割,兼具理论严谨性与工程实用性。 Abstract: Interactive 3D Gaussian Splatting (3DGS) segmentation is essential for real-time editing of pre-reconstructed assets in film and game production. However, existing methods rely on predefined camera viewpoints, ground-truth labels, or costly retraining, making them impractical for low-latency use. We propose B$^3$-Seg (Beta-Bernoulli Bayesian Segmentation for 3DGS), a fast and theoretically grounded method for open-vocabulary 3DGS segmentation under camera-free and training-free conditions. Our approach reformulates segmentation as sequential Beta-Bernoulli Bayesian updates and actively selects the next view via analytic Expected Information Gain (EIG). This Bayesian formulation guarantees the adaptive monotonicity and submodularity of EIG, which produces a greedy $(1{-}1/e)$ approximation to the optimal view sampling policy. Experiments on multiple datasets show that B$^3$-Seg achieves competitive results to high-cost supervised methods while operating end-to-end segmentation within a few seconds. The results demonstrate that B$^3$-Seg enables practical, interactive 3DGS segmentation with provable information efficiency.[61] BadCLIP++: Stealthy and Persistent Backdoors in Multimodal Contrastive Learning
Siyuan Liang,Yongcheng Jing,Yingjie Wang,Jiaxing Huang,Ee-chien Chang,Dacheng Tao
Main category: cs.CV
TL;DR: 本文提出BadCLIP++框架,通过语义融合QR微触发、目标对齐子集选择、触发嵌入稳定化与模型参数弹性约束等技术,显著提升多模态对比学习中后门攻击的隐蔽性与持久性,在极低投毒率(0.3%)下实现高达99.99%的数字攻击成功率,并在多种防御和物理场景中保持强鲁棒性。
Details
Motivation: 现有针对多模态对比学习模型的后门攻击方法在强检测和持续微调下表现不佳,主要源于跨模态不一致性和低投毒率下的梯度稀释问题,二者耦合且缺乏建模与解决。 Method: 提出BadCLIP++统一框架:(1)设计语义融合QR微触发并结合目标对齐子集选择以增强隐蔽性;(2)通过半径收缩、质心对齐稳定触发嵌入,结合曲率控制与弹性权重巩固稳定模型参数;(3)首次在信任域内理论证明清洁微调与后门目标梯度共向,从而保证攻击成功率衰减有上界。 Result: 在仅0.3%投毒率下,数字攻击成功率(ASR)达99.99%,领先基线11.4个百分点;在19种防御下ASR仍高于99.90%,清洁准确率下降<0.8%;物理攻击成功率达65.03%,且对水印移除等防御具有鲁棒性。 Conclusion: BadCLIP++有效解决了多模态对比学习后门攻击中隐蔽性与持久性两大核心挑战,兼具实证性能与理论支撑,为安全评估与防御研究提供了新基准。 Abstract: Research on backdoor attacks against multimodal contrastive learning models faces two key challenges: stealthiness and persistence. Existing methods often fail under strong detection or continuous fine-tuning, largely due to (1) cross-modal inconsistency that exposes trigger patterns and (2) gradient dilution at low poisoning rates that accelerates backdoor forgetting. These coupled causes remain insufficiently modeled and addressed. We propose BadCLIP++, a unified framework that tackles both challenges. For stealthiness, we introduce a semantic-fusion QR micro-trigger that embeds imperceptible patterns near task-relevant regions, preserving clean-data statistics while producing compact trigger distributions. We further apply target-aligned subset selection to strengthen signals at low injection rates. For persistence, we stabilize trigger embeddings via radius shrinkage and centroid alignment, and stabilize model parameters through curvature control and elastic weight consolidation, maintaining solutions within a low-curvature wide basin resistant to fine-tuning. We also provide the first theoretical analysis showing that, within a trust region, gradients from clean fine-tuning and backdoor objectives are co-directional, yielding a non-increasing upper bound on attack success degradation. Experiments demonstrate that with only 0.3% poisoning, BadCLIP++ achieves 99.99% attack success rate (ASR) in digital settings, surpassing baselines by 11.4 points. Across nineteen defenses, ASR remains above 99.90% with less than 0.8% drop in clean accuracy. The method further attains 65.03% success in physical attacks and shows robustness against watermark removal defenses.[62] NRGS-SLAM: Monocular Non-Rigid SLAM for Endoscopy via Deformation-Aware 3D Gaussian Splatting
Jiwei Shan,Zeyu Cai,Yirui Li,Yongbo Chen,Lijun Han,Yun-hui Liu,Hesheng Wang,Shing Shin Cheng
Main category: cs.CV
TL;DR: 本文提出NRGS-SLAM,一种基于3D高斯泼溅的单目非刚性SLAM系统,专用于内窥镜场景;通过引入可学习形变概率的形变感知高斯地图、形变感知跟踪与建图模块,以及统一鲁棒几何损失,有效解耦相机运动与软组织形变,显著提升位姿估计精度与重建质量。
Details
Motivation: 内窥镜场景中软组织持续形变破坏了传统V-SLAM的刚性假设,导致相机自运动与内在形变强耦合;现有单目非刚性SLAM方法缺乏有效解耦机制,且依赖稀疏或低保真场景表示,造成跟踪漂移与重建质量受限。 Method: 提出NRGS-SLAM:1)构建形变感知3D高斯地图,每个高斯元附加可学习形变概率,通过贝叶斯自监督优化;2)设计形变感知跟踪模块,优先在低形变区域进行粗到精位姿估计,并高效更新每帧形变;3)设计形变感知建图模块,渐进扩展与优化地图;4)引入融合外部几何先验的统一鲁棒几何损失。 Result: 在多个公开内窥镜数据集上,NRGS-SLAM相较SOTA方法位姿估计RMSE最高降低50%,并生成更高保真度的光度真实感重建结果;消融实验验证各核心设计的有效性。 Conclusion: NRGS-SLAM通过将3D高斯泼溅与形变建模、自监督学习及鲁棒几何优化相结合,为单目非刚性SLAM在挑战性内窥镜场景中提供了高效、准确且可扩展的解决方案。 Abstract: Visual simultaneous localization and mapping (V-SLAM) is a fundamental capability for autonomous perception and navigation. However, endoscopic scenes violate the rigidity assumption due to persistent soft-tissue deformations, creating a strong coupling ambiguity between camera ego-motion and intrinsic deformation. Although recent monocular non-rigid SLAM methods have made notable progress, they often lack effective decoupling mechanisms and rely on sparse or low-fidelity scene representations, which leads to tracking drift and limited reconstruction quality. To address these limitations, we propose NRGS-SLAM, a monocular non-rigid SLAM system for endoscopy based on 3D Gaussian Splatting. To resolve the coupling ambiguity, we introduce a deformation-aware 3D Gaussian map that augments each Gaussian primitive with a learnable deformation probability, optimized via a Bayesian self-supervision strategy without requiring external non-rigidity labels. Building on this representation, we design a deformable tracking module that performs robust coarse-to-fine pose estimation by prioritizing low-deformation regions, followed by efficient per-frame deformation updates. A carefully designed deformable mapping module progressively expands and refines the map, balancing representational capacity and computational efficiency. In addition, a unified robust geometric loss incorporates external geometric priors to mitigate the inherent ill-posedness of monocular non-rigid SLAM. Extensive experiments on multiple public endoscopic datasets demonstrate that NRGS-SLAM achieves more accurate camera pose estimation (up to 50\% reduction in RMSE) and higher-quality photo-realistic reconstructions than state-of-the-art methods. Comprehensive ablation studies further validate the effectiveness of our key design choices. Source code will be publicly available upon paper acceptance.[63] Selective Training for Large Vision Language Models via Visual Information Gain
Seulbi Lee,Sangheum Hwang
Main category: cs.CV
TL;DR: 本文提出了一种名为视觉信息增益(VIG)的基于困惑度的新指标,用于量化图像输入对语言模型预测不确定性的减少程度,并基于此设计了VIG引导的选择性训练策略,以提升大视觉语言模型(LVLMs)的视觉接地能力并缓解语言偏置问题。
Details
Motivation: 现有大视觉语言模型(LVLMs)存在语言偏置问题,即模型回答不依赖视觉证据;已有方法缺乏对单个训练样本或token从图像中获益程度的定量衡量。 Method: 提出基于困惑度的视觉信息增益(VIG)指标,支持样本级和token级细粒度分析;并据此设计VIG引导的选择性训练方案,优先使用高VIG样本与token进行训练。 Result: 该方法提升了LVLMs的视觉接地能力,缓解了语言偏置,在显著减少监督需求的前提下取得了更优性能。 Conclusion: VIG为评估和增强LVLMs的视觉接地提供了可解释、可量化的工具,VIG引导的训练范式是一种高效且数据经济的改进路径。 Abstract: Large Vision Language Models (LVLMs) have achieved remarkable progress, yet they often suffer from language bias, producing answers without relying on visual evidence. While prior work attempts to mitigate this issue through decoding strategies, architectural modifications, or curated instruction data, they typically lack a quantitative measure of how much individual training samples or tokens actually benefit from the image. In this work, we introduce Visual Information Gain (VIG), a perplexity-based metric that measures the reduction in prediction uncertainty provided by visual input. VIG enables fine-grained analysis at both sample and token levels, effectively highlighting visually grounded elements such as colors, spatial relations, and attributes. Leveraging this, we propose a VIG-guided selective training scheme that prioritizes high-VIG samples and tokens. This approach improves visual grounding and mitigates language bias, achieving superior performance with significantly reduced supervision by focusing exclusively on visually informative samples and tokens.[64] EntropyPrune: Matrix Entropy Guided Visual Token Pruning for Multimodal Large Language Models
Yahong Wang,Juncheng Wu,Zhangkai Ni,Chengmei Yang,Yihang Liu,Longzhen Yang,Yuyin Zhou,Ying Wen,Lianghua He
Main category: cs.CV
TL;DR: 本文提出EntropyPrune,一种基于矩阵熵的视觉token剪枝框架,通过识别‘熵坍缩层’(ECL)实现可解释、可迁移的高效剪枝,在保持性能的同时显著降低MLLM推理开销。
Details
Motivation: 现有MLLM视觉token剪枝方法依赖经验设定的静态层,缺乏可解释性和跨模型泛化能力,难以兼顾效率与精度。 Method: 从矩阵熵视角出发,发现并定义‘熵坍缩层’(ECL)作为剪枝位置的理论依据;提出EntropyPrune框架,利用双Gram矩阵谱等价性高效计算token熵值,实现无需注意力图的自适应token剪枝。 Result: 在LLaVA-1.5-7B上实现68.2% FLOPs下降且保留96.0%原始性能;在多模态基准、高分辨率图像及视频模型上均显著优于SOTA剪枝方法,并获得最高64倍理论加速。 Conclusion: 矩阵熵为MLLM视觉token剪枝提供了可解释、可迁移的新准则;EntropyPrune是一种高效、鲁棒且可扩展的通用加速方案。 Abstract: Multimodal large language models (MLLMs) incur substantial inference cost due to the processing of hundreds of visual tokens per image. Although token pruning has proven effective for accelerating inference, determining when and where to prune remains largely heuristic. Existing approaches typically rely on static, empirically selected layers, which limit interpretability and transferability across models. In this work, we introduce a matrix-entropy perspective and identify an "Entropy Collapse Layer" (ECL), where the information content of visual representations exhibits a sharp and consistent drop, which provides a principled criterion for selecting the pruning stage. Building on this observation, we propose EntropyPrune, a novel matrix-entropy-guided token pruning framework that quantifies the information value of individual visual tokens and prunes redundant ones without relying on attention maps. Moreover, to enable efficient computation, we exploit the spectral equivalence of dual Gram matrices, reducing the complexity of entropy computation and yielding up to a 64x theoretical speedup. Extensive experiments on diverse multimodal benchmarks demonstrate that EntropyPrune consistently outperforms state-of-the-art pruning methods in both accuracy and efficiency. On LLaVA-1.5-7B, our method achieves a 68.2% reduction in FLOPs while preserving 96.0% of the original performance. Furthermore, EntropyPrune generalizes effectively to high-resolution and video-based models, highlighting the strong robustness and scalability in practical MLLM acceleration. The code will be publicly available at https://github.com/YahongWang1/EntropyPrune.[65] GASS: Geometry-Aware Spherical Sampling for Disentangled Diversity Enhancement in Text-to-Image Generation
Ye Zhu,Kaleb S. Newman,Johannes F. Lutzeyer,Adriana Romero-Soriano,Michal Drozdzal,Olga Russakovsky
Main category: cs.CV
TL;DR: 本文提出了一种名为Geometry-Aware Spherical Sampling(GASS)的新方法,通过在CLIP嵌入空间中沿语义相关与无关两个正交方向扩展图像嵌入的几何投影分布,提升文本到图像生成的多样性,同时保持图像质量和语义对齐。
Details
Motivation: 现有文本到图像模型虽语义对齐度高,但生成图像多样性不足,限制用户选择并可能加剧社会偏见。 Method: 提出GASS方法,在CLIP嵌入空间中将多样性分解为文本嵌入方向(语义相关)和其正交方向(如背景等提示无关变化),并在两个正交轴上扩大生成图像嵌入的投影分布,引导采样过程。 Result: 在多种冻结T2I主干网络(U-Net、DiT;扩散与流模型)及基准测试中验证了该方法能有效解耦提升多样性,且对图像保真度和语义对齐影响极小。 Conclusion: 从几何视角建模多样性可实现更可控、解耦的多样性增强,为T2I生成提供了新范式。 Abstract: Despite high semantic alignment, modern text-to-image (T2I) generative models still struggle to synthesize diverse images from a given prompt. This lack of diversity not only restricts user choice, but also risks amplifying societal biases. In this work, we enhance the T2I diversity through a geometric lens. Unlike most existing methods that rely primarily on entropy-based guidance to increase sample dissimilarity, we introduce Geometry-Aware Spherical Sampling (GASS) to enhance diversity by explicitly controlling both prompt-dependent and prompt-independent sources of variation. Specifically, we decompose the diversity measure in CLIP embeddings using two orthogonal directions: the text embedding, which captures semantic variation related to the prompt, and an identified orthogonal direction that captures prompt-independent variation (e.g., backgrounds). Based on this decomposition, GASS increases the geometric projection spread of generated image embeddings along both axes and guides the T2I sampling process via expanded predictions along the generation trajectory. Our experiments on different frozen T2I backbones (U-Net and DiT, diffusion and flow) and benchmarks demonstrate the effectiveness of disentangled diversity enhancement with minimal impact on image fidelity and semantic alignment.[66] HiMAP: History-aware Map-occupancy Prediction with Fallback
Yiming Xu,Yi Yang,Hao Cheng,Monika Sester
Main category: cs.CV
TL;DR: HiMAP是一种无需跟踪的轨迹预测框架,通过历史占用图和历史查询模块实现多模态未来轨迹预测,在无跟踪场景下显著优于现有方法。
Details
Motivation: 现有运动预测方法依赖多目标跟踪(MOT),当发生遮挡、ID切换或漏检时性能下降,带来安全风险;需一种不依赖身份关联的鲁棒预测方法。 Method: HiMAP将历史检测转化为时空不变的历史占用图,引入历史查询模块,基于当前智能体状态从无标签占用表示中迭代检索个体化历史,并通过时间地图嵌入汇总,结合DETR式解码器生成多模态未来轨迹。 Result: 在Argoverse 2上,HiMAP在无ID条件下达到与跟踪方法相当的性能;在无跟踪设定下,相较微调QCNet,FDE提升11%,ADE提升12%,MR降低4%;支持流式推理并可同时稳定预测所有智能体轨迹。 Conclusion: HiMAP摆脱了对目标身份和连续跟踪的依赖,提升了预测鲁棒性与安全性,为自动驾驶提供了一种实用、可靠的跟踪失效后备方案。 Abstract: Accurate motion forecasting is critical for autonomous driving, yet most predictors rely on multi-object tracking (MOT) with identity association, assuming that objects are correctly and continuously tracked. When tracking fails due to, e.g., occlusion, identity switches, or missed detections, prediction quality degrades and safety risks increase. We present \textbf{HiMAP}, a tracking-free, trajectory prediction framework that remains reliable under MOT failures. HiMAP converts past detections into spatiotemporally invariant historical occupancy maps and introduces a historical query module that conditions on the current agent state to iteratively retrieve agent-specific history from unlabeled occupancy representations. The retrieved history is summarized by a temporal map embedding and, together with the final query and map context, drives a DETR-style decoder to produce multi-modal future trajectories. This design lifts identity reliance, supports streaming inference via reusable encodings, and serves as a robust fallback when tracking is unavailable. On Argoverse~2, HiMAP achieves performance comparable to tracking-based methods while operating without IDs, and it substantially outperforms strong baselines in the no-tracking setting, yielding relative gains of 11\% in FDE, 12\% in ADE, and a 4\% reduction in MR over a fine-tuned QCNet. Beyond aggregate metrics, HiMAP delivers stable forecasts for all agents simultaneously without waiting for tracking to recover, highlighting its practical value for safety-critical autonomy. The code is available under: https://github.com/XuYiMing83/HiMAP.[67] Inferring Height from Earth Embeddings: First insights using Google AlphaEarth
Alireza Hamoudzadeh,Valeria Belloni,Roberta Ravanelli
Main category: cs.CV
TL;DR: 本研究探讨了AlphaEarth嵌入(10米分辨率)能否有效指导深度学习回归模型进行区域地表高度映射,使用U-Net和U-Net++架构解码嵌入信息;结果表明嵌入包含可解码的高度信号,U-Net++在测试集上泛化能力更强(R²=0.84 vs. 0.78),但存在残差偏差与分布偏移挑战。
Details
Motivation: 探索地球嵌入(特别是AlphaEarth Embeddings)中编码的地理空间与多模态特征是否能有效支持深度学习模型进行高精度、可迁移的地表高度回归估计。 Method: 采用10米分辨率AlphaEarth嵌入作为输入特征,以高质量数字地表模型(DSM)为真值标签,分别使用U-Net和U-Net++轻量级卷积网络作为解码器进行高度回归建模与评估。 Result: 训练阶段两模型R²均达0.97;测试阶段U-Net++表现更优(R²=0.84,中位数误差−2.62 m),优于U-Net(R²=0.78,中位数误差−7.22 m);测试RMSE约16 m,显示存在残差偏差与分布偏移问题。 Conclusion: AlphaEarth嵌入蕴含可迁移的地形模式信息,结合空间感知卷积架构(如U-Net++)具备指导DL高度制图的潜力,但需进一步解决区域迁移中的系统性偏差问题。 Abstract: This study investigates whether the geospatial and multimodal features encoded in \textit{Earth Embeddings} can effectively guide deep learning (DL) regression models for regional surface height mapping. In particular, we focused on AlphaEarth Embeddings at 10 m spatial resolution and evaluated their capability to support terrain height inference using a high-quality Digital Surface Model (DSM) as reference. U-Net and U-Net++ architectures were thus employed as lightweight convolutional decoders to assess how well the geospatial information distilled in the embeddings can be translated into accurate surface height estimates. Both architectures achieved strong training performance (both with $R^2 = 0.97$), confirming that the embeddings encode informative and decodable height-related signals. On the test set, performance decreased due to distribution shifts in height frequency between training and testing areas. Nevertheless, U-Net++ shows better generalization ($R^2 = 0.84$, median difference = -2.62 m) compared with the standard U-Net ($R^2 = 0.78$, median difference = -7.22 m), suggesting enhanced robustness to distribution mismatch. While the testing RMSE (approximately 16 m for U-Net++) and residual bias highlight remaining challenges in generalization, strong correlations indicate that the embeddings capture transferable topographic patterns. Overall, the results demonstrate the promising potential of AlphaEarth Embeddings to guide DL-based height mapping workflows, particularly when combined with spatially aware convolutional architectures, while emphasizing the need to address bias for improved regional transferability.[68] A Multi-modal Detection System for Infrastructure-based Freight Signal Priority
Ziyan Zhang,Chuheng Wei,Xuanpeng Zhao,Siyan Li,Will Snyder,Mike Stas,Peng Hao,Kanok Boriboonsomsin,Guoyuan Wu
Main category: cs.CV
TL;DR: 本文提出了一种基于基础设施的多模态货运车辆检测系统,融合LiDAR与摄像头传感器,采用混合感知架构和融合聚类与深度学习的感知流程,结合卡尔曼滤波跟踪,实现高时空分辨率下的货运车辆类型、位置与速度的稳定实时感知,支撑货运信号优先(FSP)应用。
Details
Motivation: 货运车辆在接近信号交叉口时需要可靠的目标检测与运动估计,以支持基于基础设施的货运信号优先(FSP);准确及时地感知车辆类型、位置和速度对实现有效优先控制至关重要。 Method: 设计并部署了融合LiDAR与摄像头的基础设施多模态检测系统;采用路口安装子系统与路段中段子系统组成的混合传感架构,通过无线通信同步数据;感知流程融合聚类法与深度学习法进行检测,并用卡尔曼滤波跟踪;LiDAR点云注册至大地坐标系以支持车道级定位与连续跟踪。 Result: 实地评估表明该系统能在高时空分辨率下可靠监测货运车辆运动;系统实现了稳定实时性能,并支持车道级定位与一致跟踪。 Conclusion: 该系统设计与部署为面向FSP应用的基础设施感知系统提供了可复用的技术路径与实践参考。 Abstract: Freight vehicles approaching signalized intersections require reliable detection and motion estimation to support infrastructure-based Freight Signal Priority (FSP). Accurate and timely perception of vehicle type, position, and speed is essential for enabling effective priority control strategies. This paper presents the design, deployment, and evaluation of an infrastructure-based multi-modal freight vehicle detection system integrating LiDAR and camera sensors. A hybrid sensing architecture is adopted, consisting of an intersection-mounted subsystem and a midblock subsystem, connected via wireless communication for synchronized data transmission. The perception pipeline incorporates both clustering-based and deep learning-based detection methods with Kalman filter tracking to achieve stable real-time performance. LiDAR measurements are registered into geodetic reference frames to support lane-level localization and consistent vehicle tracking. Field evaluations demonstrate that the system can reliably monitor freight vehicle movements at high spatio-temporal resolution. The design and deployment provide practical insights for developing infrastructure-based sensing systems to support FSP applications.[69] EA-Swin: An Embedding-Agnostic Swin Transformer for AI-Generated Video Detection
Hung Mai,Loi Dinh,Duc Hai Nguyen,Dat Do,Luong Doan,Khanh Nguyen Quoc,Huan Vu,Phong Ho,Naeem Ul Islam,Tuan Do
Main category: cs.CV
TL;DR: 本文提出EA-Swin模型与EA-Video数据集,用于高效检测AI生成视频,显著提升准确率与泛化能力。
Details
Motivation: 现有检测方法在面对Sora2、Veo3等先进视频生成器时表现不足,因其依赖浅层嵌入轨迹、图像适配或计算繁重的MLLMs。 Method: 提出Embedding-Agnostic Swin Transformer(EA-Swin),采用因子化窗口注意力机制,直接建模预训练视频嵌入的时空依赖;构建包含130K视频的EA-Video基准数据集,覆盖多样生成器并支持跨分布评估。 Result: EA-Swin在主流生成器上达0.97–0.99准确率,较先前SOTA(0.8–0.9)提升5–20%,且对未见生成器保持强泛化性。 Conclusion: EA-Swin是一种可扩展、鲁棒的现代AI生成视频检测新方案。 Abstract: Recent advances in foundation video generators such as Sora2, Veo3, and other commercial systems have produced highly realistic synthetic videos, exposing the limitations of existing detection methods that rely on shallow embedding trajectories, image-based adaptation, or computationally heavy MLLMs. We propose EA-Swin, an Embedding-Agnostic Swin Transformer that models spatiotemporal dependencies directly on pretrained video embeddings via a factorized windowed attention design, making it compatible with generic ViT-style patch-based encoders. Alongside the model, we construct the EA-Video dataset, a benchmark dataset comprising 130K videos that integrates newly collected samples with curated existing datasets, covering diverse commercial and open-source generators and including unseen-generator splits for rigorous cross-distribution evaluation. Extensive experiments show that EA-Swin achieves 0.97-0.99 accuracy across major generators, outperforming prior SoTA methods (typically 0.8-0.9) by a margin of 5-20%, while maintaining strong generalization to unseen distributions, establishing a scalable and robust solution for modern AI-generated video detection.[70] Physics Encoded Spatial and Temporal Generative Adversarial Network for Tropical Cyclone Image Super-resolution
Ruoyi Zhang,Jiawei Yuan,Lujia Ye,Runling Yu,Liling Zhao
Main category: cs.CV
TL;DR: 本文提出了一种物理编码的时空生成对抗网络(PESTGAN),用于热带气旋卫星图像超分辨率重建,通过引入物理约束(如涡度方程)提升气象结构的物理合理性与视觉质量。
Details
Motivation: 现有基于深度学习的超分辨率方法将卫星图像序列视为普通视频,忽略了控制云运动的大气物理规律,导致重建结果缺乏气象学可信性。 Method: 提出PESTGAN模型:1)采用解耦生成器,集成PhyCell模块,用约束卷积近似涡度方程,将物理动力学编码为隐式潜在表示;2)设计双判别器框架,分别约束空间真实性和时间运动一致性。 Result: 在Digital Typhoon数据集上实现4×超分辨率,结构保真度和感知质量优于现有方法;像素级精度具竞争力,且显著提升云结构的气象合理性和物理保真度。 Conclusion: 融合物理先验的生成模型能有效提升遥感图像超分辨率的科学可用性,为物理信息深度学习在气象图像处理中的应用提供了新范式。 Abstract: High-resolution satellite imagery is indispensable for tracking the genesis, intensification, and trajectory of tropical cyclones (TCs). However, existing deep learning-based super-resolution (SR) methods often treat satellite image sequences as generic videos, neglecting the underlying atmospheric physical laws governing cloud motion. To address this, we propose a Physics Encoded Spatial and Temporal Generative Adversarial Network (PESTGAN) for TC image super-resolution. Specifically, we design a disentangled generator architecture incorporating a PhyCell module, which approximates the vorticity equation via constrained convolutions and encodes the resulting approximate physical dynamics as implicit latent representations to separate physical dynamics from visual textures. Furthermore, a dual-discriminator framework is introduced, employing a temporal discriminator to enforce motion consistency alongside spatial realism. Experiments on the Digital Typhoon dataset for 4$\times$ upscaling demonstrate that PESTGAN establishes a better performance in structural fidelity and perceptual quality. While maintaining competitive pixel-wise accuracy compared to existing approaches, our method significantly excels in reconstructing meteorologically plausible cloud structures with superior physical fidelity.[71] Attachment Anchors: A Novel Framework for Laparoscopic Grasping Point Prediction in Colorectal Surgery
Dennis N. Schneider,Lars Wagner,Daniel Rueckert,Dirk Wilhelm
Main category: cs.CV
TL;DR: 本文提出了一种名为'attachment anchors'的结构化表示方法,用于在结直肠微创手术中提升自主组织抓取点预测的准确性。该方法通过编码组织与其解剖附着点之间的局部几何与力学关系,将手术场景归一化到一致的局部参考系,从而降低抓取点预测的不确定性,并在90台手术数据集上验证了其在分布外场景下的优越性能。
Details
Motivation: 结直肠手术复杂、耗时长,在当前研究中代表性不足;但其重复性组织操作特性使其成为机器学习驱动自主辅助的理想切入点。 Method: 提出'attachment anchors'这一结构化中间表示,编码组织与解剖附着点间的局部几何与力学关系,实现手术场景的局部参考系归一化,并将其融入基于图像的机器学习抓取框架中。 Result: 在90例结直肠手术数据集上实验表明,相比纯图像基线方法,attachment anchors显著提升了抓取点预测精度,尤其在未见术式和不同外科医生等分布外场景下增益明显。 Conclusion: attachment anchors是一种有效的中间表征,有助于提升学习型组织操纵系统在结直肠手术中的泛化性与鲁棒性。 Abstract: Accurate grasping point prediction is a key challenge for autonomous tissue manipulation in minimally invasive surgery, particularly in complex and variable procedures such as colorectal interventions. Due to their complexity and prolonged duration, colorectal procedures have been underrepresented in current research. At the same time, they pose a particularly interesting learning environment due to repetitive tissue manipulation, making them a promising entry point for autonomous, machine learning-driven support. Therefore, in this work, we introduce attachment anchors, a structured representation that encodes the local geometric and mechanical relationships between tissue and its anatomical attachments in colorectal surgery. This representation reduces uncertainty in grasping point prediction by normalizing surgical scenes into a consistent local reference frame. We demonstrate that attachment anchors can be predicted from laparoscopic images and incorporated into a grasping framework based on machine learning. Experiments on a dataset of 90 colorectal surgeries demonstrate that attachment anchors improve grasping point prediction compared to image-only baselines. There are particularly strong gains in out-of-distribution settings, including unseen procedures and operating surgeons. These results suggest that attachment anchors are an effective intermediate representation for learning-based tissue manipulation in colorectal surgery.[72] Leveraging Contrastive Learning for a Similarity-Guided Tampered Document Data Generation Pipeline
Mohamed Dhouib,Davide Buscaldi,Sonia Vanier,Aymen Shabou
Main category: cs.CV
TL;DR: 本文提出了一种生成高质量篡改文档图像的新方法,通过两个辅助网络(基于对比学习的文本块匹配网络和字符紧密包围评估网络)提升生成数据的多样性与视觉质量,从而改善篡改文本检测模型在真实场景中的泛化能力。
Details
Motivation: 现有基于规则的篡改文档生成方法多样性低、视觉质量差、伪影明显,导致模型学不到鲁棒特征,在真实数据上性能差。 Method: 设计两个辅助网络:1)基于对比学习的文本块匹配网络(含新正负样本构造策略);2)字符边界紧密性评估网络;再结合二者构建端到端篡改文档图像生成流程。 Result: 在相同训练协议下,用本方法生成的数据训练的模型,在多个开源测试集上相较现有生成方法表现出跨模型架构和跨数据集的一致性能提升。 Conclusion: 高质量、多样化的合成篡改数据对提升文档图像篡改检测模型的真实世界泛化能力至关重要;所提两阶段辅助网络驱动的生成框架有效解决了现有方法的关键缺陷。 Abstract: Detecting tampered text in document images is a challenging task due to data scarcity. To address this, previous work has attempted to generate tampered documents using rule-based methods. However, the resulting documents often suffer from limited variety and poor visual quality, typically leaving highly visible artifacts that are rarely observed in real-world manipulations. This undermines the model's ability to learn robust, generalizable features and results in poor performance on real-world data. Motivated by this discrepancy, we propose a novel method for generating high-quality tampered document images. We first train an auxiliary network to compare text crops, leveraging contrastive learning with a novel strategy for defining positive pairs and their corresponding negatives. We also train a second auxiliary network to evaluate whether a crop tightly encloses the intended characters, without cutting off parts of characters or including parts of adjacent ones. Using a carefully designed generation pipeline that leverages both networks, we introduce a framework capable of producing diverse, high-quality tampered document images. We assess the effectiveness of our data generation pipeline by training multiple models on datasets derived from the same source images, generated using our method and existing approaches, under identical training protocols. Evaluating these models on various open-source datasets shows that our pipeline yields consistent performance improvements across architectures and datasets.[73] Polaffini: A feature-based approach for robust affine and polyaffine image registration
Antoine Legouhy,Cosimo Campo,Ross Callaghan,Hojjat Azadbakht,Hui Zhang
Main category: cs.CV
TL;DR: 本文提出Polaffini框架,利用深度学习预训练分割模型提取解剖结构质心作为特征点,实现解剖学驱动的快速、鲁棒、高精度图像配准(支持仿射至多仿射变换),显著提升结构对齐效果及后续非线性配准初始化性能。
Details
Motivation: 传统基于强度的医学图像配准方法依赖代理对齐指标,而基于解剖特征的方法因难以稳定提取特征而被冷落;近期深度学习分割模型的进步为可靠获取精细解剖结构提供了可能,从而催生解剖学驱动的新配准范式。 Method: Polaffini从预训练深度分割模型输出的解剖区域中直接提取各区域质心作为具有一一对应关系的解剖特征点,通过闭式解法实现高效全局与局部仿射匹配,并组合为具有可调平滑度的polyaffine变换;该变换嵌入log-Euclidean框架以保证微分同胚性。 Result: Polaffini在结构对齐精度上优于主流强度型配准方法,并能为下游非线性配准提供更优初始位姿;同时具备计算快、鲁棒性强、准确性高等优势。 Conclusion: Polaffini成功将现代深度分割能力转化为解剖学可信的配准工具,兼具实用性与理论严谨性,适用于独立配准或作为非线性配准预对齐模块,易于集成至临床影像处理流程。 Abstract: In this work we present Polaffini, a robust and versatile framework for anatomically grounded registration. Medical image registration is dominated by intensity-based registration methods that rely on surrogate measures of alignment quality. In contrast, feature-based approaches that operate by identifying explicit anatomical correspondences, while more desirable in theory, have largely fallen out of favor due to the challenges of reliably extracting features. However, such challenges are now significantly overcome thanks to recent advances in deep learning, which provide pre-trained segmentation models capable of instantly delivering reliable, fine-grained anatomical delineations. We aim to demonstrate that these advances can be leveraged to create new anatomically-grounded image registration algorithms. To this end, we propose Polaffini, which obtains, from these segmented regions, anatomically grounded feature points with 1-to-1 correspondence in a particularly simple way: extracting their centroids. These enable efficient global and local affine matching via closed-form solutions. Those are used to produce an overall transformation ranging from affine to polyaffine with tunable smoothness. Polyaffine transformations can have many more degrees of freedom than affine ones allowing for finer alignment, and their embedding in the log-Euclidean framework ensures diffeomorphic properties. Polaffini has applications both for standalone registration and as pre-alignment for subsequent non-linear registration, and we evaluate it against popular intensity-based registration techniques. Results demonstrate that Polaffini outperforms competing methods in terms of structural alignment and provides improved initialisation for downstream non-linear registration. Polaffini is fast, robust, and accurate, making it particularly well-suited for integration into medical image processing pipelines.[74] Tree crop mapping of South America reveals links to deforestation and conservation
Yuchang Jiang,Anton Raichuk,Xiaoye Tong,Vivien Sainte Fare Garnot,Daniel Ortiz-Gonzalo,Dan Morris,Konrad Schindler,Jan Dirk Wegner,Maxim Neumann
Main category: cs.CV
TL;DR: 本文提出了首个南美洲10米分辨率的木本作物地图,利用Sentinel-1和Sentinel-2卫星影像时间序列,结合多模态时空深度学习模型生成,旨在支持零毁林政策(如欧盟EUDR),并减少对小农户农林复合系统的误判。
Details
Motivation: 监测木本作物扩张对实施零毁林政策(如欧盟EUDR)至关重要,但现有高分辨率数据难以区分多样化的农业系统与森林,导致监管地图易将小农户农林复合系统误判为森林,引发虚假毁林警报和不公处罚。 Method: 采用基于Sentinel-1和Sentinel-2卫星影像时间序列的多模态、时空深度学习模型,生成南美洲10米分辨率木本作物分布图。 Result: 识别出约1100万公顷木本作物,其中23%与2000–2020年森林覆盖损失相关;发现现行EUDR监管地图常将已建立的农业(尤其是小农户农林复合系统)错误归类为‘森林’。 Conclusion: 本研究提供的高分辨率基线地图有助于提升毁林监测准确性,支持更有效、包容且公平的森林保护政策。 Abstract: Monitoring tree crop expansion is vital for zero-deforestation policies like the European Union's Regulation on Deforestation-free Products (EUDR). However, these efforts are hindered by a lack of highresolution data distinguishing diverse agricultural systems from forests. Here, we present the first 10m-resolution tree crop map for South America, generated using a multi-modal, spatio-temporal deep learning model trained on Sentinel-1 and Sentinel-2 satellite imagery time series. The map identifies approximately 11 million hectares of tree crops, 23% of which is linked to 2000-2020 forest cover loss. Critically, our analysis reveals that existing regulatory maps supporting the EUDR often classify established agriculture, particularly smallholder agroforestry, as "forest". This discrepancy risks false deforestation alerts and unfair penalties for small-scale farmers. Our work mitigates this risk by providing a high-resolution baseline, supporting conservation policies that are effective, inclusive, and equitable.[75] DRetHTR: Linear-Time Decoder-Only Retentive Network for Handwritten Text Recognition
Changhun Kim,Martin Mayr,Thomas Gorges,Fei Wu,Mathias Seuret,Andreas Maier,Vincent Christlein
Main category: cs.CV
TL;DR: 本文提出DRetHTR,一种基于Retentive Networks(RetNet)的纯解码器手写文本识别模型,通过去除softmax注意力与KV缓存、引入多尺度序列先验和层自适应gamma缩放,在保持甚至超越Transformer精度的同时,显著提升推理速度(1.6–1.9×)与内存效率(减少38–42%)。
Details
Motivation: 现有基于Transformer的手写文本识别(HTR)系统因KV缓存随输出长度增长而导致解码慢、内存开销大,亟需更高效的替代架构。 Method: 构建基于RetNet的纯解码器模型DRetHTR:用softmax-free retention替代softmax注意力;注入多尺度序列先验以建模结构信息;引入层-wise gamma scaling机制,使深层保留更长有效上下文,恢复局部到全局的归纳偏置。 Result: 在IAM-A(2.26% CER)、RIMES(1.81%)、Bentham(3.46%)和READ-2016(4.21%)上达到当前最优或具竞争力的字符错误率;相比同规模Transformer解码器,推理速度快1.6–1.9倍,内存占用减少38–42%,且解码时间/内存复杂度为线性。 Conclusion: DRetHTR验证了RetNet作为纯解码器架构在HTR任务中可兼顾高精度与高效率,为资源受限场景下的实时手写识别提供了新范式。 Abstract: State-of-the-art handwritten text recognition (HTR) systems commonly use Transformers, whose growing key-value (KV) cache makes decoding slow and memory-intensive. We introduce DRetHTR, a decoder-only model built on Retentive Networks (RetNet). Compared to an equally sized decoder-only Transformer baseline, DRetHTR delivers 1.6-1.9x faster inference with 38-42% less memory usage, without loss of accuracy. By replacing softmax attention with softmax-free retention and injecting multi-scale sequential priors, DRetHTR avoids a growing KV cache: decoding is linear in output length in both time and memory. To recover the local-to-global inductive bias of attention, we propose layer-wise gamma scaling, which progressively enlarges the effective retention horizon in deeper layers. This encourages early layers to model short-range dependencies and later layers to capture broader context, mitigating the flexibility gap introduced by removing softmax. Consequently, DRetHTR achieves best reported test character error rates of 2.26% (IAM-A, en), 1.81% (RIMES, fr), and 3.46% (Bentham, en), and is competitive on READ-2016 (de) with 4.21%. This demonstrates that decoder-only RetNet enables Transformer-level HTR accuracy with substantially improved decoding speed and memory efficiency.[76] SpectralGCD: Spectral Concept Selection and Cross-modal Representation Learning for Generalized Category Discovery
Lorenzo Caselli,Marco Mistretta,Simone Magistri,Andrew D. Bagdanov
Main category: cs.CV
TL;DR: 本文提出SpectralGCD,一种高效且有效的多模态广义类别发现方法,利用CLIP跨模态图像-概念相似性作为统一表示,并通过谱滤波和双向知识蒸馏提升语义质量和对齐度,在多个基准上实现高精度与低计算成本。
Details
Motivation: 现有方法在广义类别发现中易对已知类过拟合;多模态方法虽有提升但模态独立且计算开销大。 Method: 提出SpectralGCD:以CLIP跨模态图像-概念相似性构建统一表示;引入谱滤波(基于教师模型的跨模态协方差矩阵)筛选相关语义概念;采用前向与反向知识蒸馏确保学生模型表征的语义充分性与跨模态对齐。 Result: 在六个基准上,SpectralGCD准确率媲美或显著超越当前最优方法,同时计算成本大幅降低。 Conclusion: SpectralGCD通过锚定显式语义、减少伪视觉线索依赖及高效蒸馏机制,实现了广义类别发现中性能与效率的兼顾。 Abstract: Generalized Category Discovery (GCD) aims to identify novel categories in unlabeled data while leveraging a small labeled subset of known classes. Training a parametric classifier solely on image features often leads to overfitting to old classes, and recent multimodal approaches improve performance by incorporating textual information. However, they treat modalities independently and incur high computational cost. We propose SpectralGCD, an efficient and effective multimodal approach to GCD that uses CLIP cross-modal image-concept similarities as a unified cross-modal representation. Each image is expressed as a mixture over semantic concepts from a large task-agnostic dictionary, which anchors learning to explicit semantics and reduces reliance on spurious visual cues. To maintain the semantic quality of representations learned by an efficient student, we introduce Spectral Filtering which exploits a cross-modal covariance matrix over the softmaxed similarities measured by a strong teacher model to automatically retain only relevant concepts from the dictionary. Forward and reverse knowledge distillation from the same teacher ensures that the cross-modal representations of the student remain both semantically sufficient and well-aligned. Across six benchmarks, SpectralGCD delivers accuracy comparable to or significantly superior to state-of-the-art methods at a fraction of the computational cost. The code is publicly available at: https://github.com/miccunifi/SpectralGCD.[77] A High-Level Survey of Optical Remote Sensing
Panagiotis Koletsis,Vasilis Efthymiou,Maria Vakalopoulou,Nikos Komodakis,Anastasios Doulamis,Georgios Th. Papadopoulos
Main category: cs.CV
TL;DR: 本文是一篇关于光学遥感(特别是基于无人机RGB相机)的综合性综述,旨在为新入行的研究者提供领域概览、关键数据集与研究方向指引。
Details
Motivation: 现有文献缺乏从整体视角系统梳理光学遥感(尤其是无人机RGB影像)能力、任务、数据集与方法的综述;亟需一份面向初学者的、高屋建瓴的入门指南。 Method: 采用系统性文献调研与归纳分析方法,对光学遥感领域的任务类型、技术方法、常用数据集及核心洞察进行全景式梳理与分类总结。 Result: 构建了一个覆盖光学遥感主要能力维度的综合框架,整理了代表性数据集与实践洞见,明确了不同研究方向的适用场景与挑战。 Conclusion: 该综述填补了光学遥感领域缺乏全局性入门指南的空白,为研究人员快速定位兴趣方向、理解技术脉络与选择合适工具/数据提供了有效支撑。 Abstract: In recent years, significant advances in computer vision have also propelled progress in remote sensing. Concurrently, the use of drones has expanded, with many organizations incorporating them into their operations. Most drones are equipped by default with RGB cameras, which are both robust and among the easiest sensors to use and interpret. The body of literature on optical remote sensing is vast, encompassing diverse tasks, capabilities, and methodologies. Each task or methodology could warrant a dedicated survey. This work provides a comprehensive overview of the capabilities of the field, while also presenting key information, such as datasets and insights. It aims to serve as a guide for researchers entering the field, offering high-level insights and helping them focus on areas most relevant to their interests. To the best of our knowledge, no existing survey addresses this holistic perspective.[78] EAGLE: Expert-Augmented Attention Guidance for Tuning-Free Industrial Anomaly Detection in Multimodal Large Language Models
Xiaomeng Peng,Xilang Huang,Seon Han Choi
Main category: cs.CV
TL;DR: 本文提出EAGLE框架,无需微调即可利用专家模型输出引导多模态大语言模型(MLLMs)实现高精度工业异常检测与可解释描述,并通过注意力机制分析验证其有效性。
Details
Motivation: 现有深度学习方法仅提供二值判断且缺乏语义解释;MLLMs虽具生成细粒度语言分析潜力,但需昂贵微调且检测精度常不如轻量专用检测器。 Method: 提出无需微调的专家增强注意力引导框架EAGLE,将专家模型输出融入MLLMs以同时提升检测准确性和可解释性;并通过分析MLLMs中间层对异常区域的注意力分布,探究其内部机制。 Result: 在MVTec-AD和VisA数据集上,EAGLE在多个MLLMs上显著提升异常检测性能,效果媲美微调方法,且无需任何参数更新。 Conclusion: EAGLE是一种高效、通用、无需训练的MLLMs工业异常检测增强框架,兼具高精度与强可解释性,并揭示了注意力聚焦异常区域与检测成功之间的关联。 Abstract: Industrial anomaly detection is important for smart manufacturing, but many deep learning approaches produce only binary decisions and provide limited semantic explanations. Multimodal large language models (MLLMs) can potentially generate fine-grained, language-based analyses, yet existing methods often require costly fine-tuning and do not consistently improve anomaly detection accuracy compared to lightweight specialist detectors. We propose expert-augmented attention guidance for industrial anomaly detection in MLLMs (EAGLE), a tuning-free framework that integrates outputs from expert model to guide MLLMs toward both accurate detection and interpretable anomaly descriptions. We further study how EAGLE affects MLLMs internals by examining the attention distribution of MLLMs to the anomalous image regions in the intermediate layers. We observe that successful anomaly detection is associated with increased attention concentration on anomalous regions, and EAGLE tends to encourage this alignment. Experiments on MVTec-AD and VisA show that EAGLE improves anomaly detection performance across multiple MLLMs without any parameter updates, achieving results comparable to fine-tuning based methods. Code is available at \href{https://github.com/shengtun/Eagle}{https://github.com/shengtun/Eagle}[79] 4D Monocular Surgical Reconstruction under Arbitrary Camera Motions
Jiwei Shan,Zeyu Cai,Cheng-Tai Hsieh,Yirui Li,Hao Liu,Lijun Han,Hesheng Wang,Shing Shin Cheng
Main category: cs.CV
TL;DR: 本文提出Local-EndoGS,一种面向单目内窥镜视频、支持任意相机运动的高质量4D可变形手术场景重建框架,通过窗口化局部建模、粗到细初始化策略及长程像素轨迹与物理运动先验提升鲁棒性与形变合理性。
Details
Motivation: 现有基于隐式神经表示或3D高斯溅射的方法大多依赖固定内窥镜视角、双目深度先验或高精度运动恢复结构(SfM),难以处理临床中常见的单目、大运动内窥镜序列。 Method: 提出Local-EndoGS:1)采用渐进式窗口化全局表示,为每个观测窗口分配局部可变形场景模型;2)设计融合多视图几何、跨窗口信息与单目深度先验的粗到细初始化策略;3)引入长程2D像素轨迹约束和物理运动先验以增强形变合理性。 Result: 在三个公开可变形内窥镜数据集上,Local-EndoGS在外观质量和几何精度上均持续超越当前最优方法;消融实验验证了各核心设计的有效性。 Conclusion: Local-EndoGS为单目、大运动内窥镜视频提供了鲁棒、可扩展且高质量的4D重建方案,显著提升了临床适用性。 Abstract: Reconstructing deformable surgical scenes from endoscopic videos is challenging and clinically important. Recent state-of-the-art methods based on implicit neural representations or 3D Gaussian splatting have made notable progress. However, most are designed for deformable scenes with fixed endoscope viewpoints and rely on stereo depth priors or accurate structure-from-motion for initialization and optimization, limiting their ability to handle monocular sequences with large camera motion in real clinical settings. To address this, we propose Local-EndoGS, a high-quality 4D reconstruction framework for monocular endoscopic sequences with arbitrary camera motion. Local-EndoGS introduces a progressive, window-based global representation that allocates local deformable scene models to each observed window, enabling scalability to long sequences with substantial motion. To overcome unreliable initialization without stereo depth or accurate structure-from-motion, we design a coarse-to-fine strategy integrating multi-view geometry, cross-window information, and monocular depth priors, providing a robust foundation for optimization. We further incorporate long-range 2D pixel trajectory constraints and physical motion priors to improve deformation plausibility. Experiments on three public endoscopic datasets with deformable scenes and varying camera motions show that Local-EndoGS consistently outperforms state-of-the-art methods in appearance quality and geometry. Ablation studies validate the effectiveness of our key designs. Code will be released upon acceptance at: https://github.com/IRMVLab/Local-EndoGS.[80] QuPAINT: Physics-Aware Instruction Tuning Approach to Quantum Material Discovery
Xuan-Bac Nguyen,Hoang-Quan Nguyen,Sankalp Pandey,Tim Faltermeier,Nicholas Borys,Hugh Churchill,Khoa Luu
Main category: cs.CV
TL;DR: 本文提出了一种物理感知的多模态框架QuPAINT,结合物理建模合成数据(Synthia)、首个量子材料指令数据集(QMat-Instruct)和物理信息注意力机制,以提升光学显微图像中二维量子材料层数识别的泛化性与鲁棒性,并构建了标准化基准QF-Bench。
Details
Motivation: 现有视觉模型在二维量子材料光学图像表征中表现不佳,主要受限于层间对比度微弱、标注数据稀缺以及跨实验室/设备差异大,且缺乏物理先验、难以泛化到新材料或新硬件条件。 Method: 提出物理感知多模态框架:1)Synthia——基于薄膜干涉物理模型的合成数据生成器;2)QMat-Instruct——首个大规模、物理引导的多模态指令数据集;3)QuPAINT——融合光学先验的物理感知注意力模块的多模态大模型微调方法;4)QF-Bench——覆盖多材料、多基底、多成像条件的综合基准。 Result: 显著降低对人工标注的依赖,提升模型在未见材料、新设备条件下的泛化能力与厚度识别鲁棒性,并支持可复现的标准化评估。 Conclusion: 将物理先验深度融入数据生成、指令构建与模型架构,是提升AI在小样本、高变异性科学图像分析中可靠性与可解释性的有效范式。 Abstract: Characterizing two-dimensional quantum materials from optical microscopy images is challenging due to the subtle layer-dependent contrast, limited labeled data, and significant variation across laboratories and imaging setups. Existing vision models struggle in this domain since they lack physical priors and cannot generalize to new materials or hardware conditions. This work presents a new physics-aware multimodal framework that addresses these limitations from both the data and model perspectives. We first present Synthia, a physics-based synthetic data generator that simulates realistic optical responses of quantum material flakes under thin-film interference. Synthia produces diverse and high-quality samples, helping reduce the dependence on expert manual annotation. We introduce QMat-Instruct, the first large-scale instruction dataset for quantum materials, comprising multimodal, physics-informed question-answer pairs designed to teach Multimodal Large Language Models (MLLMs) to understand the appearance and thickness of flakes. Then, we propose Physics-Aware Instruction Tuning (QuPAINT), a multimodal architecture that incorporates a Physics-Informed Attention module to fuse visual embeddings with optical priors, enabling more robust and discriminative flake representations. Finally, we establish QF-Bench, a comprehensive benchmark spanning multiple materials, substrates, and imaging settings, offering standardized protocols for fair and reproducible evaluation.[81] Tracing Copied Pixels and Regularizing Patch Affinity in Copy Detection
Yichen Lu,Siwei Nie,Minlong Lu,Xudong Yang,Xiaobo Zhang,Peng Zhang
Main category: cs.CV
TL;DR: 本文提出PixTrace和CopyNCE,通过像素级坐标追踪与几何引导的对比损失提升图像复制检测性能与可解释性。
Details
Motivation: 现有基于视图级对比学习的自监督图像复制检测方法在应对复杂编辑时,因缺乏细粒度对应关系学习而表现受限。 Method: 提出PixTrace模块显式建模编辑变换下的像素空间映射,并设计CopyNCE损失,利用PixTrace验证的重叠比来正则化图像块相似性学习。 Result: 在DISC21数据集上达到SOTA性能:匹配器uAP 88.7% / RP90 83.9%,描述符uAP 72.6% / RP90 68.4%,且具备更好可解释性。 Conclusion: 将像素级几何可追溯性与块级相似性学习结合,有效抑制自监督训练中的监督噪声,显著提升ICD系统鲁棒性与可解释性。 Abstract: Image Copy Detection (ICD) aims to identify manipulated content between image pairs through robust feature representation learning. While self-supervised learning (SSL) has advanced ICD systems, existing view-level contrastive methods struggle with sophisticated edits due to insufficient fine-grained correspondence learning. We address this limitation by exploiting the inherent geometric traceability in edited content through two key innovations. First, we propose PixTrace - a pixel coordinate tracking module that maintains explicit spatial mappings across editing transformations. Second, we introduce CopyNCE, a geometrically-guided contrastive loss that regularizes patch affinity using overlap ratios derived from PixTrace's verified mappings. Our method bridges pixel-level traceability with patch-level similarity learning, suppressing supervision noise in SSL training. Extensive experiments demonstrate not only state-of-the-art performance (88.7% uAP / 83.9% RP90 for matcher, 72.6% uAP / 68.4% RP90 for descriptor on DISC21 dataset) but also better interpretability over existing methods.[82] FoundationPose-Initialized 3D-2D Liver Registration for Surgical Augmented Reality
Hanyuan Zhang,Lucas He,Runlong He,Abdolrahim Kadkhodamohammadi,Danail Stoyanov,Brian R. Davidson,Evangelos B. Mazomenos,Matthew J. Clarkson
Main category: cs.CV
TL;DR: 本文提出了一种结合腹腔镜深度图与基础姿态估计器的轻量级非刚性配准方法,用NICP替代有限元模型进行肝脏形变配准,在真实患者数据上达到9.91 mm平均配准误差,精度满足临床需求且工程实现更简单。
Details
Motivation: 现有腹腔镜肝手术增强现实系统依赖器官轮廓配准,常使用需高建模复杂度和专业知识的有限元(FE)模型进行非刚性配准,亟需更轻量、易部署的替代方案。 Method: 融合腹腔镜深度图与基础姿态估计器实现相机-肝脏姿态估计,并采用非刚性迭代最近点(NICP)算法替代FE模型进行形变配准,构建刚性+NICP联合配准流程。 Result: 在3例真实患者数据上,深度图增强的基础姿态估计方法平均配准误差为9.91 mm;刚性+NICP配准显著优于仅刚性配准,验证了NICP可高效替代FE模型。 Conclusion: 该方法在保持临床相关精度的同时,大幅降低了建模复杂度与工程门槛,为术中AR导航提供了更实用、可推广的非刚性配准方案。 Abstract: Augmented reality can improve tumor localization in laparoscopic liver surgery. Existing registration pipelines typically depend on organ contours; deformable (non-rigid) alignment is often handled with finite-element (FE) models coupled to dimensionality-reduction or machine-learning components. We integrate laparoscopic depth maps with a foundation pose estimator for camera-liver pose estimation and replace FE-based deformation with non-rigid iterative closest point (NICP) to lower engineering/modeling complexity and expertise requirements. On real patient data, the depth-augmented foundation pose approach achieved 9.91 mm mean registration error in 3 cases. Combined rigid-NICP registration outperformed rigid-only registration, demonstrating NICP as an efficient substitute for finite-element deformable models. This pipeline achieves clinically relevant accuracy while offering a lightweight, engineering-friendly alternative to FE-based deformation.[83] LATA: Laplacian-Assisted Transductive Adaptation for Conformal Uncertainty in Medical VLMs
Behzad Bozorgtabar,Dwarikanath Mahapatra,Sudipta Roy,Muzammal Naseer,Imran Razzak,Zongyuan Ge
Main category: cs.CV
TL;DR: 本文提出LATA方法,通过拉普拉斯平滑和失败感知的共形分数,在不破坏交换性前提下提升医学视觉语言模型在域偏移下的零样本预测集效率与类别平衡性。
Details
Motivation: 现有分割共形预测(SCP)在医学视觉语言模型中面临预测集过大、类别覆盖率不平衡(CCV高)以及校准标签破坏交换性导致理论保证失效等问题,尤其在少样本、数据不平衡场景下更为突出。 Method: 提出LATA(拉普拉斯辅助的传导式自适应)方法:1)在联合校准与测试池上构建图像k-NN图,对零样本概率进行拉普拉斯平滑(少量CCCP均值场更新);2)设计失败感知的共形分数,融入ViLU框架以建模实例难度与标签合理性;整个过程无需训练、无需标签(或仅需校准边际分布),保持SCP有效性。 Result: 在3个医学VLM和9个下游任务上,LATA显著减小预测集大小、降低CCV,同时严格满足目标覆盖率;性能超越先前传导式基线,逼近有标签方法,且计算开销极低。消融与定性分析证实其在不损害交换性的前提下提升了预测锐度。 Conclusion: LATA是一种高效、轻量、理论可保证的零样本不确定性校准方法,为医学VLM在真实临床域偏移场景下的可靠部署提供了新路径。 Abstract: Medical vision-language models (VLMs) are strong zero-shot recognizers for medical imaging, but their reliability under domain shift hinges on calibrated uncertainty with guarantees. Split conformal prediction (SCP) offers finite-sample coverage, yet prediction sets often become large (low efficiency) and class-wise coverage unbalanced-high class-conditioned coverage gap (CCV), especially in few-shot, imbalanced regimes; moreover, naively adapting to calibration labels breaks exchangeability and voids guarantees. We propose \texttt{\textbf{LATA}} (Laplacian-Assisted Transductive Adaptation), a \textit{training- and label-free} refinement that operates on the joint calibration and test pool by smoothing zero-shot probabilities over an image-image k-NN graph using a small number of CCCP mean-field updates, preserving SCP validity via a deterministic transform. We further introduce a \textit{failure-aware} conformal score that plugs into the vision-language uncertainty (ViLU) framework, providing instance-level difficulty and label plausibility to improve prediction set efficiency and class-wise balance at fixed coverage. \texttt{\textbf{LATA}} is black-box (no VLM updates), compute-light (windowed transduction, no backprop), and includes an optional prior knob that can run strictly label-free or, if desired, in a label-informed variant using calibration marginals once. Across \textbf{three} medical VLMs and \textbf{nine} downstream tasks, \texttt{\textbf{LATA}} consistently reduces set size and CCV while matching or tightening target coverage, outperforming prior transductive baselines and narrowing the gap to label-using methods, while using far less compute. Comprehensive ablations and qualitative analyses show that \texttt{\textbf{LATA}} sharpens zero-shot predictions without compromising exchangeability.[84] GraphThinker: Reinforcing Video Reasoning with Event Graph Thinking
Zixu Cheng,Da Li,Jian Hu,Ziquan Liu,Wei Li,Shaogang Gong
Main category: cs.CV
TL;DR: 本文提出GraphThinker,一种基于强化微调的方法,通过构建事件级场景图(EVSG)并增强视觉定位来减少视频推理中的幻觉问题。
Details
Motivation: 现有视频推理模型缺乏对事件间因果关系的显式建模,导致推理过程中易产生幻觉;而人工标注因果关系成本高且隐含。 Method: 提出GraphThinker方法:1)利用多模态大语言模型(MLLM)构建显式建模事件内与事件间关系的事件级视频场景图(EVSG);2)将EVSG作为中间推理过程融入MLLM;3)在强化微调中引入视觉注意力奖励以增强视觉接地能力。 Result: 在RexTime和VidHalluc两个数据集上验证,GraphThinker在物体与事件关系建模、事件精确定位方面优于先前方法,显著降低视频推理幻觉。 Conclusion: 显式结构化因果建模(如EVSG)与视觉接地增强相结合,可有效缓解MLLM在视频推理中的幻觉问题,提升推理可靠性。 Abstract: Video reasoning requires understanding the causal relationships between events in a video. However, such relationships are often implicit and costly to annotate manually. While existing multimodal large language models (MLLMs) often infer event relations through dense captions or video summaries for video reasoning, such modeling still lacks causal understanding. Without explicit causal structure modeling within and across video events, these models suffer from hallucinations during the video reasoning. In this work, we propose GraphThinker, a reinforcement finetuning-based method that constructs structural event-level scene graphs and enhances visual grounding to jointly reduce hallucinations in video reasoning. Specifically, we first employ an MLLM to construct an event-based video scene graph (EVSG) that explicitly models both intra- and inter-event relations, and incorporate these formed scene graphs into the MLLM as an intermediate thinking process. We also introduce a visual attention reward during reinforcement finetuning, which strengthens video grounding and further mitigates hallucinations. We evaluate GraphThinker on two datasets, RexTime and VidHalluc, where it shows superior ability to capture object and event relations with more precise event localization, reducing hallucinations in video reasoning compared to prior methods.[85] RetouchIQ: MLLM Agents for Instruction-Based Image Retouching with Generalist Reward
Qiucheng Wu,Jing Shi,Simon Jenni,Kushal Kafle,Tianyu Wang,Shiyu Chang,Handong Zhao
Main category: cs.CV
TL;DR: 本文提出RetouchIQ框架,利用多模态大语言模型(MLLM)代理结合通用奖励模型,实现基于指令的可执行图像编辑,通过RL优化工具使用计划,并在语义一致性和感知质量上显著优于现有方法。
Details
Motivation: 现有基于强化学习的图像编辑方法缺乏能反映创造性编辑主观性的可靠、可验证奖励信号。 Method: 提出RetouchIQ框架,包含:1)MLLM代理解析编辑意图并生成可执行参数调整;2)通用奖励模型(RL微调的MLLM),按需生成多维评估指标并输出标量反馈,支持高质量、指令一致的强化学习;3)构建含19万指令-推理对的数据集与新基准。 Result: RetouchIQ在语义一致性与感知质量上显著超越先前MLLM及扩散模型编辑系统。 Conclusion: 通用奖励驱动的MLLM代理可作为专业图像编辑中灵活、可解释、可执行的智能助手。 Abstract: Recent advances in multimodal large language models (MLLMs) have shown great potential for extending vision-language reasoning to professional tool-based image editing, enabling intuitive and creative editing. A promising direction is to use reinforcement learning (RL) to enable MLLMs to reason about and execute optimal tool-use plans within professional image-editing software. However, training remains challenging due to the lack of reliable, verifiable reward signals that can reflect the inherently subjective nature of creative editing. In this work, we introduce RetouchIQ, a framework that performs instruction-based executable image editing through MLLM agents guided by a generalist reward model. RetouchIQ interprets user-specified editing intentions and generates corresponding, executable image adjustments, bridging high-level aesthetic goals with precise parameter control. To move beyond conventional, rule-based rewards that compute similarity against a fixed reference image using handcrafted metrics, we propose a generalist reward model, an RL fine-tuned MLLM that evaluates edited results through a set of generated metrics on a case-by-case basis. Then, the reward model provides scalar feedback through multimodal reasoning, enabling reinforcement learning with high-quality, instruction-consistent gradients. We curate an extended dataset with 190k instruction-reasoning pairs and establish a new benchmark for instruction-based image editing. Experiments show that RetouchIQ substantially improves both semantic consistency and perceptual quality over previous MLLM-based and diffusion-based editing systems. Our findings demonstrate the potential of generalist reward-driven MLLM agents as flexible, explainable, and executable assistants for professional image editing.[86] Art2Mus: Artwork-to-Music Generation via Visual Conditioning and Large-Scale Cross-Modal Alignment
Ivan Rinaldi,Matteo Mendula,Nicola Fanelli,Florence Levé,Matteo Testi,Giovanna Castellano,Gennaro Vessio
Main category: cs.CV
TL;DR: 本文提出ArtSound数据集和ArtToMus框架,首次实现直接从艺术作品图像生成音乐,避免了传统方法中依赖图像转文本的中间步骤,推动视觉到音频的端到端生成研究。
Details
Motivation: 现有图像条件音乐生成方法受限于:(i) 训练数据多为自然照片,难以捕捉艺术作品的语义、风格与文化内涵;(ii) 多数方法依赖图像→文本转换,以语言为语义捷径,阻碍了真正的视觉到音频直接学习。 Method: 构建大规模艺术-音乐配对数据集ArtSound(105,884对),并提出ArtToMus框架:将艺术图像的视觉嵌入直接投影至潜在扩散模型的条件空间,实现无语言监督的端到端图像到音乐生成。 Result: ArtToMus生成的音乐在音乐连贯性和风格一致性上表现良好,能反映源艺术作品的关键视觉线索;虽跨模态对齐绝对分数低于文本条件方法,但感知质量具竞争力,且展现出有意义的视觉-听觉对应关系。 Conclusion: 本工作确立了‘直接视觉到音乐生成’为一个独立且具挑战性的新研究方向,并开源数据与代码,支撑多媒体艺术、文化遗产及AI辅助创作等应用。 Abstract: Music generation has advanced markedly through multimodal deep learning, enabling models to synthesize audio from text and, more recently, from images. However, existing image-conditioned systems suffer from two fundamental limitations: (i) they are typically trained on natural photographs, limiting their ability to capture the richer semantic, stylistic, and cultural content of artworks; and (ii) most rely on an image-to-text conversion stage, using language as a semantic shortcut that simplifies conditioning but prevents direct visual-to-audio learning. Motivated by these gaps, we introduce ArtSound, a large-scale multimodal dataset of 105,884 artwork-music pairs enriched with dual-modality captions, obtained by extending ArtGraph and the Free Music Archive. We further propose ArtToMus, the first framework explicitly designed for direct artwork-to-music generation, which maps digitized artworks to music without image-to-text translation or language-based semantic supervision. The framework projects visual embeddings into the conditioning space of a latent diffusion model, enabling music synthesis guided solely by visual information. Experimental results show that ArtToMus generates musically coherent and stylistically consistent outputs that reflect salient visual cues of the source artworks. While absolute alignment scores remain lower than those of text-conditioned systems-as expected given the substantially increased difficulty of removing linguistic supervision-ArtToMus achieves competitive perceptual quality and meaningful cross-modal correspondence. This work establishes direct visual-to-music generation as a distinct and challenging research direction, and provides resources that support applications in multimedia art, cultural heritage, and AI-assisted creative practice. Code and dataset will be publicly released upon acceptance.[87] Adapting Actively on the Fly: Relevance-Guided Online Meta-Learning with Latent Concepts for Geospatial Discovery
Jowaria Khan,Anindya Sarkar,Yevgeniy Vorobeychik,Elizabeth Bondi-Kelly
Main category: cs.CV
TL;DR: 本文提出了一种融合主动学习、在线元学习与概念引导推理的统一地理空间发现框架,通过概念相关性建模提升在稀疏、有偏地理数据下对隐藏目标的高效发现能力。
Details
Motivation: 现实场景(如环境监测、灾害响应、公共卫生)中数据采集成本高、环境动态变化,且真实标注稀疏有偏,限制了现有基于学习方法(如强化学习)的应用。 Method: 提出基于‘概念相关性’的两个创新:1)概念加权的不确定性采样策略,利用易获取的领域概念(如土地覆盖、源距离)调节不确定性;2)相关性感知的元批次构建策略,提升在线元学习中的语义多样性与泛化能力。 Result: 在真实PFAS污染数据集上验证了方法的有效性,表明其能在数据有限和环境变化条件下可靠发现目标。 Conclusion: 该框架显著提升了地理空间稀疏目标发现的效率与鲁棒性,为资源受限下的动态环境探索提供了新范式。 Abstract: In many real-world settings, such as environmental monitoring, disaster response, or public health, with costly and difficult data collection and dynamic environments, strategically sampling from unobserved regions is essential for efficiently uncovering hidden targets under tight resource constraints. Yet, sparse and biased geospatial ground truth limits the applicability of existing learning-based methods, such as reinforcement learning. To address this, we propose a unified geospatial discovery framework that integrates active learning, online meta-learning, and concept-guided reasoning. Our approach introduces two key innovations built on a shared notion of *concept relevance*, which captures how domain-specific factors influence target presence: a *concept-weighted uncertainty sampling strategy*, where uncertainty is modulated by learned relevance based on readily-available domain-specific concepts (e.g., land cover, source proximity); and a *relevance-aware meta-batch formation strategy* that promotes semantic diversity during online-meta updates, improving generalization in dynamic environments. Our experiments include testing on a real-world dataset of cancer-causing PFAS (Per- and polyfluoroalkyl substances) contamination, showcasing our method's reliability at uncovering targets with limited data and a varying environment.[88] CORAL: Correspondence Alignment for Improved Virtual Try-On
Jiyoung Kim,Youngjin Shin,Siyoon Jin,Dahyun Chung,Jisu Nam,Tongmin Kim,Jongjae Park,Hyeonwoo Kang,Seungryong Kim
Main category: cs.CV
TL;DR: 本文提出CORAL框架,通过显式对齐查询-键匹配与外部对应关系,提升虚拟试穿中服装细节保留和人衣对应准确性。
Details
Motivation: 现有虚拟试穿方法难以在无配对设置下保持精细服装细节,且未显式建模人衣对应关系,尤其在扩散Transformer中该对应如何产生尚不明确。 Method: 首先分析DiT中全3D注意力机制,发现人衣对应依赖于精确的查询-键匹配;进而提出CORAL框架,包含对应蒸馏损失和熵最小化损失,并设计基于视觉语言模型的评估协议。 Result: CORAL在全局形状迁移和局部细节保留上均优于基线方法,消融实验验证了各模块有效性。 Conclusion: 显式建模并优化人衣查询-键匹配是提升虚拟试穿质量的关键,CORAL为DiT在VTON任务中的可解释性与性能提升提供了新思路。 Abstract: Existing methods for Virtual Try-On (VTON) often struggle to preserve fine garment details, especially in unpaired settings where accurate person-garment correspondence is required. These methods do not explicitly enforce person-garment alignment and fail to explain how correspondence emerges within Diffusion Transformers (DiTs). In this paper, we first analyze full 3D attention in DiT-based architecture and reveal that the person-garment correspondence critically depends on precise person-garment query-key matching within the full 3D attention. Building on this insight, we then introduce CORrespondence ALignment (CORAL), a DiT-based framework that explicitly aligns query-key matching with robust external correspondences. CORAL integrates two complementary components: a correspondence distillation loss that aligns reliable matches with person-garment attention, and an entropy minimization loss that sharpens the attention distribution. We further propose a VLM-based evaluation protocol to better reflect human preference. CORAL consistently improves over the baseline, enhancing both global shape transfer and local detail preservation. Extensive ablations validate our design choices.[89] IntRec: Intent-based Retrieval with Contrastive Refinement
Pourya Shamsolmoali,Masoumeh Zareapoor,Eric Granger,Yue Lu
Main category: cs.CV
TL;DR: 本文提出IntRec,一种基于用户反馈的交互式物体检索框架,通过维护正向锚点和负向约束的双重记忆机制,利用对比对齐函数进行细粒度消歧,在LVIS和LVIS-Ambiguous数据集上显著提升检索精度。
Details
Motivation: 现有开词汇检测器为单次推理,无法根据用户反馈迭代优化,难以处理模糊或含多个相似物体的查询。 Method: 提出IntRec框架,核心是意图状态(IS),包含正向锚点(已确认线索)和负向约束(被拒绝假设)双记忆;引入对比对齐函数,最大化候选物体与正向线索的相似性,同时惩罚负向约束。 Result: 在LVIS上达到35.4 AP,分别超越OVMR、CoDet、CAKE 2.3、3.7、0.5;在LVIS-Ambiguous上单轮反馈提升7.9 AP,每轮交互延迟<30ms。 Conclusion: IntRec实现了无需额外监督的高效交互式物体检索,在复杂与模糊场景中显著提升准确率和鲁棒性。 Abstract: Retrieving user-specified objects from complex scenes remains a challenging task, especially when queries are ambiguous or involve multiple similar objects. Existing open-vocabulary detectors operate in a one-shot manner, lacking the ability to refine predictions based on user feedback. To address this, we propose IntRec, an interactive object retrieval framework that refines predictions based on user feedback. At its core is an Intent State (IS) that maintains dual memory sets for positive anchors (confirmed cues) and negative constraints (rejected hypotheses). A contrastive alignment function ranks candidate objects by maximizing similarity to positive cues while penalizing rejected ones, enabling fine-grained disambiguation in cluttered scenes. Our interactive framework provides substantial improvements in retrieval accuracy without additional supervision. On LVIS, IntRec achieves 35.4 AP, outperforming OVMR, CoDet, and CAKE by +2.3, +3.7, and +0.5, respectively. On the challenging LVIS-Ambiguous benchmark, it improves performance by +7.9 AP over its one-shot baseline after a single corrective feedback, with less than 30 ms of added latency per interaction.[90] Human-level 3D shape perception emerges from multi-view learning
Tyler Bonnen,Jitendra Malik,Angjoo Kanazawa
Main category: cs.CV
TL;DR: 本文提出了一种新型多视角神经网络框架,通过自然场景图像的视觉-空间目标(如相机位置、深度)进行训练,无需物体先验,即可零样本匹配人类3D形状推断精度,并预测人类错误模式与反应时。
Details
Motivation: 建模人类从2D图像推断3D结构的能力长期未能达到人类水平,需探索更贴近人类感知机制的计算框架。 Method: 设计基于自然场景多视角图像的视觉-空间自监督学习框架,训练神经网络预测相机位置和深度等空间信息;采用零样本评估方式,在标准3D感知任务上对比模型与人类行为。 Result: 模型首次在3D形状推断准确率上达到人类水平;独立读出模型响应可预测人类细粒度行为(如错误分布和反应时间)。 Conclusion: 仅依靠自然化视觉-空间数据与简单可扩展的学习目标,即可涌现出类人的3D感知能力。 Abstract: Humans can infer the three-dimensional structure of objects from two-dimensional visual inputs. Modeling this ability has been a longstanding goal for the science and engineering of visual intelligence, yet decades of computational methods have fallen short of human performance. Here we develop a modeling framework that predicts human 3D shape inferences for arbitrary objects, directly from experimental stimuli. We achieve this with a novel class of neural networks trained using a visual-spatial objective over naturalistic sensory data; given a set of images taken from different locations within a natural scene, these models learn to predict spatial information related to these images, such as camera location and visual depth, without relying on any object-related inductive biases. Notably, these visual-spatial signals are analogous to sensory cues readily available to humans. We design a zero-shot evaluation approach to determine the performance of these `multi-view' models on a well established 3D perception task, then compare model and human behavior. Our modeling framework is the first to match human accuracy on 3D shape inferences, even without task-specific training or fine-tuning. Remarkably, independent readouts of model responses predict fine-grained measures of human behavior, including error patterns and reaction times, revealing a natural correspondence between model dynamics and human perception. Taken together, our findings indicate that human-level 3D perception can emerge from a simple, scalable learning objective over naturalistic visual-spatial data. All code, human behavioral data, and experimental stimuli needed to reproduce our findings can be found on our project page.[91] When Vision Overrides Language: Evaluating and Mitigating Counterfactual Failures in VLAs
Yu Fang,Yuchun Feng,Dong Jing,Jiaqi Liu,Yue Yang,Zhenyu Wei,Daniel Szafir,Mingyu Ding
Main category: cs.CV
TL;DR: 本文提出Counterfactual Action Guidance (CAG)方法,通过引入语言无关的视觉-动作(VA)模块与标准VLA策略协同决策,缓解视觉捷径导致的反事实失败问题,在不修改模型结构或额外训练的前提下显著提升语言遵循能力与任务成功率。
Details
Motivation: 现有视觉-语言-动作(VLA)模型在缺乏强场景监督时易受数据集偏差影响,依赖视觉捷径而非语言指令,产生反事实失败。 Method: 提出双分支推理机制CAG:联合标准VLA策略与语言无关的Vision-Action(VA)模块,在动作选择中进行反事实对比,显式正则化语言条件作用;无需额外数据、架构修改或模型重训练。 Result: 在新构建的LIBERO-CF反事实基准上,CAG使语言遵循准确率(π₀.₅)提升9.7%(训练无关)至15.5%(配合VA模型),任务成功率提升3.6%至8.5%;真实机器人实验中反事实失败降低9.4%,任务成功率平均提升17.2%。 Conclusion: CAG是一种即插即用、训练无关的通用增强方法,能有效缓解VLAs对视觉捷径的依赖,显著提升其语言遵循鲁棒性与泛化能力。 Abstract: Vision-Language-Action models (VLAs) promise to ground language instructions in robot control, yet in practice often fail to faithfully follow language. When presented with instructions that lack strong scene-specific supervision, VLAs suffer from counterfactual failures: they act based on vision shortcuts induced by dataset biases, repeatedly executing well-learned behaviors and selecting objects frequently seen during training regardless of language intent. To systematically study it, we introduce LIBERO-CF, the first counterfactual benchmark for VLAs that evaluates language following capability by assigning alternative instructions under visually plausible LIBERO layouts. Our evaluation reveals that counterfactual failures are prevalent yet underexplored across state-of-the-art VLAs. We propose Counterfactual Action Guidance (CAG), a simple yet effective dual-branch inference scheme that explicitly regularizes language conditioning in VLAs. CAG combines a standard VLA policy with a language-unconditioned Vision-Action (VA) module, enabling counterfactual comparison during action selection. This design reduces reliance on visual shortcuts, improves robustness on under-observed tasks, and requires neither additional demonstrations nor modifications to existing architectures or pretrained models. Extensive experiments demonstrate its plug-and-play integration across diverse VLAs and consistent improvements. For example, on LIBERO-CF, CAG improves $π_{0.5}$ by 9.7% in language following accuracy and 3.6% in task success on under-observed tasks using a training-free strategy, with further gains of 15.5% and 8.5%, respectively, when paired with a VA model. In real-world evaluations, CAG reduces counterfactual failures of 9.4% and improves task success by 17.2% on average.[92] OpenEarthAgent: A Unified Framework for Tool-Augmented Geospatial Agents
Akashah Shabbir,Muhammad Umer Sheikh,Muhammad Akhtar Munir,Hiyam Debary,Mustansar Fiaz,Muhammad Zaigham Zaheer,Paolo Fraccaro,Fahad Shahbaz Khan,Muhammad Haris Khan,Xiao Xiang Zhu,Salman Khan
Main category: cs.CV
TL;DR: OpenEarthAgent is a unified framework for geospatial reasoning agents trained on satellite imagery and natural-language queries, using structured reasoning traces and GIS-based tools to improve multimodal reasoning in remote sensing.
Details
Motivation: Extending multimodal reasoning to remote sensing is challenging due to spatial scale, geographic structures, and multispectral indices; existing models lack coherent multi-step logic for geospatial tasks. Method: Supervised fine-tuning on structured reasoning trajectories involving tool-augmented interactions (e.g., GIS operations, NDVI/NBR/NDBI analysis) over a curated corpus of 14,538 training and 1,169 evaluation instances. Result: The agent achieves structured reasoning, stable spatial understanding, and interpretable tool-driven behavior, with consistent improvements over strong baselines and competitive performance against recent open/closed-source models. Conclusion: OpenEarthAgent bridges the gap in multimodal geospatial reasoning by grounding model behavior in explicit, verified reasoning traces and domain-specific tools. Abstract: Recent progress in multimodal reasoning has enabled agents that can interpret imagery, connect it with language, and perform structured analytical tasks. Extending such capabilities to the remote sensing domain remains challenging, as models must reason over spatial scale, geographic structures, and multispectral indices while maintaining coherent multi-step logic. To bridge this gap, OpenEarthAgent introduces a unified framework for developing tool-augmented geospatial agents trained on satellite imagery, natural-language queries, and detailed reasoning traces. The training pipeline relies on supervised fine-tuning over structured reasoning trajectories, aligning the model with verified multistep tool interactions across diverse analytical contexts. The accompanying corpus comprises 14,538 training and 1,169 evaluation instances, with more than 100K reasoning steps in the training split and over 7K reasoning steps in the evaluation split. It spans urban, environmental, disaster, and infrastructure domains, and incorporates GIS-based operations alongside index analyses such as NDVI, NBR, and NDBI. Grounded in explicit reasoning traces, the learned agent demonstrates structured reasoning, stable spatial understanding, and interpretable behaviour through tool-driven geospatial interactions across diverse conditions. We report consistent improvements over a strong baseline and competitive performance relative to recent open and closed-source models.cs.OH [Back]
[93] A Conceptual Hybrid Framework for Post-Quantum Security: Integrating BB84 QKD, AES, and Bio-inspired Mechanisms
Md. Ismiel Hossen Abir
Main category: cs.OH
TL;DR: 本文提出了一种面向后量子时代的混合安全框架,结合AES、BB84 QKD、量子态比对与类免疫系统,以应对Shor算法对RSA的威胁;目前仅为概念性设计,尚未实现与严格验证。