cs.CL [Back]

[1] GhazalBench: Usage-Grounded Evaluation of LLMs on Persian Ghazals

Ghazal Kalhor,Yadollah Yaghoobzadeh

Main category: cs.CL

TL;DR: 本文介绍了GhazalBench，一个用于评估大语言模型（LLMs）在使用场景下处理波斯加扎尔诗能力的基准测试，重点考察其对诗意含义的理解与对固定诗行形式的准确回忆能力，并发现模型在形式回忆上存在明显短板，且该短板与训练数据暴露程度相关而非模型结构限制。

Details

Motivation: 波斯诗歌（尤其是加扎尔诗）在伊朗文化实践中具有重要地位，常被引用、改写或补全；现有语言模型虽能理解诗意，但难以准确复现固定诗行形式，亟需面向文化实践、兼顾语义与形式的评估基准。 Method: 构建GhazalBench基准，包含两项任务：（1）对加扎尔诗联句生成忠实的散文式释义；（2）在语义和形式线索变化下识别/补全经典诗句；并在多语言闭源与开源LLM上进行评测，同时以英文十四行诗作平行对照。 Result: 所有测试模型均表现出‘语义理解强、形式复现弱’的系统性分离现象：补全任务中诗句召回率低，而识别任务显著改善表现；英文对照实验显示其召回率明显更高，表明问题源于训练数据中波斯诗曝光不足。 Conclusion: 当前LLM对文化嵌入型诗歌的建模存在语义-形式脱节，需发展能联合评估意义、形式与线索依赖访问能力的新评估范式；GhazalBench为此提供了可复现的基准与开源资源。 Abstract: Persian poetry plays an active role in Iranian cultural practice, where verses by canonical poets such as Hafez are frequently quoted, paraphrased, or completed from partial cues. Supporting such interactions requires language models to engage not only with poetic meaning but also with culturally entrenched surface form. We introduce GhazalBench, a benchmark for evaluating how large language models (LLMs) interact with Persian ghazals under usage-grounded conditions. GhazalBench assesses two complementary abilities: producing faithful prose paraphrases of couplets and accessing canonical verses under varying semantic and formal cues. Across several proprietary and open-weight multilingual LLMs, we observe a consistent dissociation: models generally capture poetic meaning but struggle with exact verse recall in completion-based settings, while recognition-based tasks substantially reduce this gap. A parallel evaluation on English sonnets shows markedly higher recall performance, suggesting that these limitations are tied to differences in training exposure rather than inherent architectural constraints. Our findings highlight the need for evaluation frameworks that jointly assess meaning, form, and cue-dependent access to culturally significant texts. GhazalBench is available at https://github.com/kalhorghazal/GhazalBench.

[2] Large Language Models and Book Summarization: Reading or Remembering, Which Is Better?

Tairan Fu,Javier Conde,Pedro Reviriego,Javier Coronado-Blázquez,Nina Melero,Elena Merino-Gómez

Main category: cs.CL

TL;DR: 本文评估了大型语言模型（LLM）在图书摘要任务中利用内部知识与直接处理全文两种方式的性能差异，发现尽管全文输入通常生成更详细的摘要，但某些知名图书的内部知识摘要反而得分更高，挑战了当前长文本摘要能力的认知。

Details

Motivation: 探究LLM基于内部训练知识生成的摘要与基于输入全文生成的摘要之间的质量差异，以及内部知识是否会影响全文摘要结果。 Method: 对多部知名图书，分别使用LLM仅依赖内部知识和完整输入全文两种方式生成摘要，并进行对比实验评估。 Result: 全文输入通常产生更详细的摘要，但部分图书的内部知识摘要在自动/人工评估中得分更高。 Conclusion: LLM的内部知识在某些情况下可超越全文摘要性能，表明其长文本摘要能力存在局限性与不可预测性，需重新审视摘要任务的评估范式。 Abstract: Summarization is a core task in Natural Language Processing (NLP). Recent advances in Large Language Models (LLMs) and the introduction of large context windows reaching millions of tokens make it possible to process entire books in a single prompt. At the same time, for well-known books, LLMs can generate summaries based only on internal knowledge acquired during training. This raises several important questions: How do summaries generated from internal memory compare to those derived from the full text? Does prior knowledge influence summaries even when the model is given the book as input? In this work, we conduct an experimental evaluation of book summarization with state-of-the-art LLMs. We compare summaries of well-known books produced using (i) only the internal knowledge of the model and (ii) the full text of the book. The results show that having the full text provides more detailed summaries in general, but some books have better scores for the internal knowledge summaries. This puts into question the capabilities of models to perform summarization of long texts, as information learned during training can outperform summarization of the full text in some cases.

[3] AraModernBERT: Transtokenized Initialization and Long-Context Encoder Modeling for Arabic

Omar Elshehy,Omer Nacar,Abdelbasset Djamai,Muhammed Ragab,Khloud Al Jallad,Mona Abdelazim

Main category: cs.CL

TL;DR: 本文提出了AraModernBERT，一种针对阿拉伯语优化的ModernBERT编码器模型，通过transtokenized嵌入初始化和原生长上下文建模（最高8192词元）显著提升了阿拉伯语掩码语言建模与下游任务性能。

Details

Motivation: Encoder-only transformer模型在判别式NLP任务中仍被广泛使用，但近期架构改进主要集中于英语；阿拉伯语等使用阿拉伯文字的语言缺乏适配现代编码器架构的实践研究。 Method: 将ModernBERT编码器架构适配至阿拉伯语，引入transtokenized embedding初始化策略，并支持原生长上下文建模（最长8192 tokens），在阿拉伯语掩码语言建模及多项下游任务上进行评估。 Result: transtokenized初始化显著提升掩码语言建模性能；AraModernBERT在长上下文建模中表现稳定有效，下游任务（推理、攻击性语言检测、问句相似度、命名实体识别）上均取得优异效果。 Conclusion: transtokenized嵌入初始化对阿拉伯语建模至关重要，且原生长上下文支持可有效提升语言建模与迁移能力；该工作为阿拉伯语及其他阿拉伯文字语言的现代编码器适配提供了实用指导。 Abstract: Encoder-only transformer models remain widely used for discriminative NLP tasks, yet recent architectural advances have largely focused on English. In this work, we present AraModernBERT, an adaptation of the ModernBERT encoder architecture to Arabic, and study the impact of transtokenized embedding initialization and native long-context modeling up to 8,192 tokens. We show that transtokenization is essential for Arabic language modeling, yielding dramatic improvements in masked language modeling performance compared to non-transtokenized initialization. We further demonstrate that AraModernBERT supports stable and effective long-context modeling, achieving improved intrinsic language modeling performance at extended sequence lengths. Downstream evaluations on Arabic natural language understanding tasks, including inference, offensive language detection, question-question similarity, and named entity recognition, confirm strong transfer to discriminative and sequence labeling settings. Our results highlight practical considerations for adapting modern encoder architectures to Arabic and other languages written in Arabic-derived scripts.

[4] An Efficient Hybrid Deep Learning Approach for Detecting Online Abusive Language

Vuong M. Ngo,Cach N. Dang,Kien V. Nguyen,Mark Roantree

Main category: cs.CL

TL;DR: 本文提出了一种融合BERT、CNN和LSTM的混合深度学习模型，用于多平台（如YouTube、论坛、暗网）的滥用语言检测，在高度不平衡数据集上达到约99%的各项评估指标。

Details

Motivation: 在线骚扰、欺凌及仇恨言论等有害行为日益严重，且攻击者常使用隐晦词汇或编码短语规避检测，亟需更鲁棒的滥用语言识别方法。 Method: 提出融合BERT（语义建模）、CNN（局部特征提取）和LSTM（序列建模）的混合深度学习模型，采用ReLU激活函数，并在包含77,620条滥用与272,214条非滥用样本的不平衡数据集上训练。 Result: 模型在Precision、Recall、Accuracy、F1-score和AUC等指标上均达约99%，展现出对多平台、高偏斜数据中滥用语言的强检测能力。 Conclusion: 该混合架构能有效捕获文本的语义、上下文与时序特征，为真实场景下高不平衡、多源异构的滥用语言检测提供了高效可行的解决方案。 Abstract: The digital age has expanded social media and online forums, allowing free expression for nearly 45% of the global population. Yet, it has also fueled online harassment, bullying, and harmful behaviors like hate speech and toxic comments across social networks, messaging apps, and gaming communities. Studies show 65% of parents notice hostile online behavior, and one-third of adolescents in mobile games experience bullying. A substantial volume of abusive content is generated and shared daily, not only on the surface web but also within dark web forums. Creators of abusive comments often employ specific words or coded phrases to evade detection and conceal their intentions. To address these challenges, we propose a hybrid deep learning model that integrates BERT, CNN, and LSTM architectures with a ReLU activation function to detect abusive language across multiple online platforms, including YouTube comments, online forum discussions, and dark web posts. The model demonstrates strong performance on a diverse and imbalanced dataset containing 77,620 abusive and 272,214 non-abusive text samples (ratio 1:3.5), achieving approximately 99% across evaluation metrics such as Precision, Recall, Accuracy, F1-score, and AUC. This approach effectively captures semantic, contextual, and sequential patterns in text, enabling robust detection of abusive content even in highly skewed datasets, as encountered in real-world scenarios.

[5] The Dunning-Kruger Effect in Large Language Models: An Empirical Study of Confidence Calibration

Sudipta Ghosh,Mrityunjoy Panday

Main category: cs.CL

TL;DR: 本文研究了大语言模型（LLMs）是否表现出类似人类认知偏差中的达克效应（Dunning-Kruger effect），即能力较低者高估自身能力的现象。通过对四个先进模型在四个基准数据集上的实证分析，发现性能较差的模型（如Kimi K2）显著过自信（ECE=0.726，准确率仅23.3%），而性能较好的模型（如Claude Haiku 4.5）校准更优（ECE=0.122，准确率75.4%），验证了类达克效应的存在，对高风险场景中LLM的安全部署具有重要启示。

Details

Motivation: 大语言模型（LLMs）虽在多项任务中表现优异，但其自我置信度评估能力尚不明确；人类认知中存在达克效应（低能力者高估自身），作者旨在探究LLMs是否存在类似系统性校准偏差。 Method: 对Claude Haiku 4.5、Gemini 2.5 Pro、Gemini 2.5 Flash和Kimi K2四个SOTA模型，在四个基准数据集共24,000次试验中，使用Expected Calibration Error（ECE）等指标量化其置信度校准程度，并对比其准确率与校准误差的关系。 Result: Kimi K2表现出严重过自信（ECE=0.726，准确率23.3%），Claude Haiku 4.5校准最优（ECE=0.122，准确率75.4%）；整体呈现‘性能越差、过自信越强’的模式，类比达克效应。 Conclusion: LLMs确实存在类似人类达克效应的系统性校准偏差，性能较差的模型更倾向于高估自身判断可靠性；该现象警示需在高风险应用中加强模型置信度校准与可信评估。 Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks, yet their ability to accurately assess their own confidence remains poorly understood. We present an empirical study investigating whether LLMs exhibit patterns reminiscent of the Dunning-Kruger effect -- a cognitive bias where individuals with limited competence tend to overestimate their abilities. We evaluate four state-of-the-art models (Claude Haiku 4.5, Gemini 2.5 Pro, Gemini 2.5 Flash, and Kimi K2) across four benchmark datasets totaling 24,000 experimental trials. Our results reveal striking calibration differences: Kimi K2 exhibits severe overconfidence with an Expected Calibration Error (ECE) of 0.726 despite only 23.3% accuracy, while Claude Haiku 4.5 achieves the best calibration (ECE = 0.122) with 75.4% accuracy. These findings demonstrate that poorly performing models display markedly higher overconfidence -- a pattern analogous to the Dunning-Kruger effect in human cognition. We discuss implications for safe deployment of LLMs in high-stakes applications.

[6] Quantifying Hallucinations in Language Language Models on Medical Textbooks

Brandon C. Colelough,Davis Bartels,Dina Demner-Fushman

Main category: cs.CL

TL;DR: 本文研究了大型语言模型在医学问答任务中的幻觉现象，发现LLaMA-70B-Instruct模型在教科书依据的问答中幻觉率达19.7%，且幻觉率与临床医生评估的有用性呈显著负相关。

Details

Motivation: 现有医学问答基准很少基于固定证据源评估大模型的幻觉问题，而幻觉严重影响模型可信度，亟需系统评估。 Method: 开展两项实验：实验一评估LLaMA-70B-Instruct在新颖医学问答提示下的幻觉率；实验二比较多个模型的幻觉率与临床医生对回答有用性的偏好及一致性。 Result: 实验一显示LLaMA-70B-Instruct幻觉率为19.7%（95% CI: 18.6–20.7），尽管98.8%的回答被评最高合理性；实验二发现幻觉率与有用性评分显著负相关（ρ=−0.71, p=0.058），临床医生间一致性高（κ=0.92）。 Conclusion: 幻觉在医学QA中普遍存在，且与临床实用性密切相关；降低幻觉是提升模型临床适用性的关键，需更严格的基于证据的评估基准。 Abstract: Hallucinations, the tendency for large language models to provide responses with factually incorrect and unsupported claims, is a serious problem within natural language processing for which we do not yet have an effective solution to mitigate against. Existing benchmarks for medical QA rarely evaluate this behavior against a fixed evidence source. We ask how often hallucinations occur on textbook-grounded QA and how responses to medical QA prompts vary across models. We conduct two experiments: the first experiment to determine the prevalence of hallucinations for a prominent open source large language model (LLaMA-70B-Instruct) in medical QA given novel prompts, and the second experiment to determine the prevalence of hallucinations and clinician preference to model responses. We observed, in experiment one, with the passages provided, LLaMA-70B-Instruct hallucinated in 19.7\% of answers (95\% CI 18.6 to 20.7) even though 98.8\% of prompt responses received maximal plausibility, and observed in experiment two, across models, lower hallucination rates aligned with higher usefulness scores ($ρ=-0.71$, $p=0.058$). Clinicians produced high agreement (quadratic weighted $κ=0.92$) and ($τ_b=0.06$ to $0.18$, $κ=0.57$ to $0.61$) for experiments 1 and ,2 respectively

[7] Evolving Demonstration Optimization for Chain-of-Thought Feature Transformation

Xinyuan Wang,Kunpeng Liu,Arun Vignesh Malarkkan,Yanjie Fu

Main category: cs.CL

TL;DR: 本文提出一种基于大语言模型（LLM）的特征变换（FT）新框架，通过闭环演化任务验证的变换轨迹经验库，并结合多样性感知上下文选择与思维链引导，显著提升下游预测性能与生成稳定性。

Details

Motivation: 现有特征变换方法面临搜索空间大、样本效率低、生成无效或冗余变换等问题；而当前LLM方法依赖静态示例，缺乏多样性、任务对齐性与泛化能力。 Method: 构建闭环优化框架：利用强化学习探索高绩效特征变换序列，形成下游任务验证的轨迹经验库；设计多样性感知选择器动态构造上下文，并结合思维链提示引导LLM生成高质量变换。 Result: 在多个表格数据基准上显著优于传统及现有LLM-based FT基线；比单次生成更稳定；兼容API型与开源LLM，并对不同下游评估器鲁棒。 Conclusion: 轨迹级经验演化与动态上下文优化可有效提升LLM在特征变换任务中的有效性、多样性与任务对齐性，为数据-centric AI提供了可扩展的新范式。 Abstract: Feature Transformation (FT) is a core data-centric AI task that improves feature space quality to advance downstream predictive performance. However, discovering effective transformations remains challenging due to the large space of feature-operator combinations. Existing solutions rely on discrete search or latent generation, but they are frequently limited by sample inefficiency, invalid candidates, and redundant generations with limited coverage. Large Language Models (LLMs) offer strong priors for producing valid transformations, but current LLM-based FT methods typically rely on static demonstrations, resulting in limited diversity, redundant outputs, and weak alignment with downstream objectives. We propose a framework that optimizes context data for LLM-driven FT by evolving trajectory-level experiences in a closed loop. Starting from high-performing feature transportation sequences explored by reinforcement learning, we construct and continuously update an experience library of downstream task-verified transformation trajectories, and use a diversity-aware selector to form contexts along with a chain-of-thought and guide transformed feature generation toward higher performance. Experiments on diverse tabular benchmarks show that our method outperforms classical and LLM-based baselines and is more stable than one-shot generation. The framework generalizes across API-based and open-source LLMs and remains robust across downstream evaluators.

[8] Causally Grounded Mechanistic Interpretability for LLMs with Faithful Natural-Language Explanations

Ajay Pravin Mahale

Main category: cs.CL

TL;DR: 本文提出了一种将电路级可解释性分析转化为自然语言解释的管道，结合激活修补、模板与大模型生成及基于ERASER的忠实性评估，在IOI任务上验证了其有效性，并揭示了模型中分布式备份机制及解释与机制不一致的三类失败模式。

Details

Motivation: 机械可解释性虽能识别模型内部电路，但尚缺乏将其转化为人类可理解的自然语言解释的有效方法。 Method: 构建三阶段管道：(i) 通过激活修补识别因果重要的注意力头；(ii) 结合模板法和大语言模型（LLM）生成自然语言解释；(iii) 采用适配电路级归因的ERASER风格指标评估解释的忠实性（sufficiency/comprehensiveness）。 Result: 在GPT-2 Small的IOI任务中定位6个关键注意力头（贡献61.4%对数差），电路解释达100%充分性但仅22%全面性；LLM生成解释质量比模板基线高64%；模型置信度与解释忠实性无相关性（r=0.009）；识别出三类解释偏离真实机制的失败情形。 Conclusion: 电路级分析可支撑高质量、高充分性的自然语言解释，但全面性受限于模型内在的分布式冗余机制；解释质量不依赖模型自信，需针对性建模机制-语言映射并诊断失败模式。 Abstract: Mechanistic interpretability identifies internal circuits responsible for model behaviors, yet translating these findings into human-understandable explanations remains an open problem. We present a pipeline that bridges circuit-level analysis and natural language explanations by (i) identifying causally important attention heads via activation patching, (ii) generating explanations using both template-based and LLM-based methods, and (iii) evaluating faithfulness using ERASER-style metrics adapted for circuit-level attribution. We evaluate on the Indirect Object Identification (IOI) task in GPT-2 Small (124M parameters), identifying six attention heads accounting for 61.4% of the logit difference. Our circuit-based explanations achieve 100% sufficiency but only 22% comprehensiveness, revealing distributed backup mechanisms. LLM-generated explanations outperform template baselines by 64% on quality metrics. We find no correlation (r = 0.009) between model confidence and explanation faithfulness, and identify three failure categories explaining when explanations diverge from mechanisms.

Heimo Müller,Dominik Steiger,Markus Plass,Andreas Holzinger

Main category: cs.CL

TL;DR: 本文提出了系统幻觉量表（SHS），一种轻量级、以人为中心的评估工具，用于衡量大语言模型（LLMs）在交互中表现出的幻觉行为，涵盖事实不可靠性、不连贯性、误导性呈现及对用户引导的响应性；SHS基于用户视角、非自动检测、经210人实证验证具有高信效度，并与SUS、SCS互补。

Details

Motivation: 现有幻觉评估方法多为自动指标或脱离真实交互场景，缺乏从用户视角出发、快速可解释且跨领域适用的测量工具。 Method: 借鉴SUS和SCS等成熟心理测量工具，设计包含多个维度的SHS量表；通过210名参与者的现实交互实验，采用Cronbach's alpha和相关性分析等统计方法验证其信度与效度。 Result: SHS展现出高清晰度、一致的作答行为和良好的结构效度（Cronbach's alpha = 0.87，各维度间p < 0.001显著相关）；与SUS、SCS对比显示其具有互补测量特性。 Conclusion: SHS是一种实用、可靠且用户中心的幻觉评估工具，适用于LLM系统的比较分析、迭代开发与部署监控。 Abstract: We introduce the System Hallucination Scale (SHS), a lightweight and human-centered measurement instrument for assessing hallucination-related behavior in large language models (LLMs). Inspired by established psychometric tools such as the System Usability Scale (SUS) and the System Causability Scale (SCS), SHS enables rapid, interpretable, and domain-agnostic evaluation of factual unreliability, incoherence, misleading presentation, and responsiveness to user guidance in model-generated text. SHS is explicitly not an automatic hallucination detector or benchmark metric; instead, it captures how hallucination phenomena manifest from a user perspective under realistic interaction conditions. A real-world evaluation with 210 participants demonstrates high clarity, coherent response behavior, and construct validity, supported by statistical analysis including internal consistency (Cronbach's alpha = 0.87$) and significant inter-dimension correlations (p < 0.001$). Comparative analysis with SUS and SCS reveals complementary measurement properties, supporting SHS as a practical tool for comparative analysis, iterative system development, and deployment monitoring.

[10] A Two-Stage Architecture for NDA Analysis: LLM-based Segmentation and Transformer-based Clause Classification

Ana Begnini,Matheus Vicente,Leonardo Souza

Main category: cs.CL

TL;DR: 本文提出了一种基于大语言模型（LLM）的架构，用于自动分割和分类商业保密协议（NDA）中的条款，使用LLaMA-3.1-8B-Instruct进行条款分割，Legal-RoBERTa-Large微调后进行条款分类，取得了高精度效果。

Details

Motivation: 商业合同中的保密协议（NDA）格式、结构与写作风格差异大，人工分析效率低且易出错，亟需自动化方法。 Method: 采用两阶段架构：第一阶段用LLaMA-3.1-8B-Instruct完成NDA文本的条款分割（clause extraction）；第二阶段用微调后的Legal-RoBERTa-Large模型对分割出的条款进行分类。 Result: 条款分割任务ROUGE F1达0.95±0.0036；条款分类任务加权F1达0.85，验证了该方法的可行性与高精度。 Conclusion: 基于LLM的两阶段方法能高效、准确地实现NDA合同的自动化结构化解析，为法律文本智能处理提供了可行路径。 Abstract: In business-to-business relations, it is common to establish NonDisclosure Agreements (NDAs). However, these documents exhibit significant variation in format, structure, and writing style, making manual analysis slow and error-prone. We propose an architecture based on LLMs to automate the segmentation and clauses classification within these contracts. We employed two models: LLaMA-3.1-8B-Instruct for NDA segmentation (clause extraction) and a fine-tuned Legal-Roberta-Large for clause classification. In the segmentation task, we achieved a ROUGE F1 of 0.95 +/- 0.0036; for classification, we obtained a weighted F1 of 0.85, demonstrating the feasibility and precision of the approach.

[11] PoultryLeX-Net: Domain-Adaptive Dual-Stream Transformer Architecture for Large-Scale Poultry Stakeholder Modeling

Stephen Afrifa,Biswash Khatiwada,Kapalik Khanal,Sanjay Shah,Lingjuan Wang-Li,Ramesh Bahadur Bist

Main category: cs.CL

TL;DR: 本研究提出PoultryLeX-Net，一种词典增强、领域自适应的双流Transformer模型，用于家禽相关文本的细粒度情感分析，在准确率（97.35%）、F1值（96.67%）和AUC-ROC（99.61%）上均优于多种基线模型。

Details

Motivation: 社交媒体中家禽行业相关文本存在语境模糊、语言多变及通用语言模型领域知识不足等问题，亟需高精度、可解释的领域专用情感分析方法。 Method: 构建PoultryLeX-Net双流Transformer框架：一为词典引导流，融合家禽领域术语与情感线索；二为上下文流，建模长程语义依赖；引入门控交叉注意力机制，并结合LDA进行主题建模以增强可解释性。 Result: 在情感分类任务中，PoultryLeX-Net准确率达97.35%，F1得分为96.67%，AUC-ROC达99.61%，显著优于CNN、DistilBERT和RoBERTa等基线模型。 Conclusion: 领域自适应与双流注意力机制可显著提升家禽文本情感分析性能，为家禽生产决策支持提供可扩展的智能分析能力。 Abstract: The rapid growth of the global poultry industry, driven by rising demand for affordable animal protein, has intensified public discourse surrounding production practices, housing, management, animal welfare, and supply-chain transparency. Social media platforms such as X (formerly Twitter) generate large volumes of unstructured textual data that capture stakeholder sentiment across the poultry industry. Extracting accurate sentiment signals from this domain-specific discourse remains challenging due to contextual ambiguity, linguistic variability, and limited domain awareness in general-purpose language models. This study presents PoultryLeX-Net, a lexicon-enhanced, domain-adaptive dual-stream transformer framework for fine-grained sentiment analysis in poultry-related text. The proposed architecture integrates sentiment classification, topic modeling, and contextual representation learning through domain-specific embeddings and gated cross-attention mechanisms. A lexicon-guided stream captures poultry-specific terminology and sentiment cues, while contextual stream models long-range semantic dependencies. Latent Dirichlet Allocation is employed to identify dominant thematic structures associated with production management and welfare-related discussions, providing complementary interpretability to sentiment predictions. PoultryLeX-Net was evaluated against multiple baseline models, including convolutional neural network and pre-trained transformer architectures such as DistilBERT and RoBERTa. PoultryLeX-Net consistently outperformed all baselines, achieving an accuracy of 97.35%, an F1 score of 96.67%, and an area under the receiver operating characteristic curve (AUC-ROC) of 99.61% across sentiment classification tasks. Overall, domain adaptation and dual-stream attention markedly improve sentiment classification, enabling scalable intelligence for poultry production decision support.

[12] TAMUSA-Chat: A Domain-Adapted Large Language Model Conversational System for Research and Responsible Deployment

Izzat Alsmadi,Anas Alsobeh

Main category: cs.CL

TL;DR: 本文提出了TAMUSA-Chat框架，用于构建面向机构领域的大型语言模型对话系统，涵盖数据获取、预处理、嵌入构建、模型训练与部署，并强调可复现性、合规性与负责任AI实践。

Details

Motivation: 解决通用基础大模型向特定机构场景适配的关键挑战，包括领域适应性、透明性、治理合规性及伦理考量。 Method: 采用监督微调（SFT）、检索增强生成（RAG）和系统化评估方法，构建模块化、可复现的端到端架构，涵盖数据采集、预处理、嵌入、训练与部署全流程。 Result: 实现了支持机构语境的对话系统原型，实证分析了不同模型规模与训练轮次下的领域适配效率、计算开销与质量-成本权衡，并开源了完整代码库。 Conclusion: TAMUSA-Chat为高校等机构提供了可复现、合规、负责任的LLM部署范式，推动教育AI在实践、评估与伦理维度的持续研究。 Abstract: This paper presents TAMUSA-Chat, a research-oriented framework for building domain-adapted large language model conversational systems. The work addresses critical challenges in adapting general-purpose foundation models to institutional contexts through supervised fine-tuning, retrieval-augmented generation, and systematic evaluation methodologies. We describe the complete architecture encompassing data acquisition from institutional sources, preprocessing pipelines, embedding construction, model training workflows, and deployment strategies. The system integrates modular components enabling reproducible experimentation with training configurations, hyper-parameters, and evaluation protocols. Our implementation demonstrates how academic institutions can develop contextually grounded conversational agents while maintaining transparency, governance compliance, and responsible AI practices. Through empirical analysis of fine-tuning behavior across model sizes and training iterations, we provide insights into domain adaptation efficiency, computational resource requirements, and quality-cost trade-offs. The publicly available codebase at https://github.com/alsmadi/TAMUSA_LLM_Based_Chat_app supports continued research into institutional LLM deployment, evaluation methodologies, and ethical considerations for educational AI systems.

[13] CEI: A Benchmark for Evaluating Pragmatic Reasoning in Language Models

Jon Chun,Hannah Sussman,Adrian Mangine,Murathan Kocaman,Kirill Sidorko,Abhigya Koirala,Andre McCloud,Gwen Eisenbeis,Wisdom Akanwe,Moustapha Gassama,Eliezer Gonzalez Chirinos,Anne-Duncan Enright,Peter Dunson,Tiffanie Ng,Anna von Rosenstiel,Godwin Idowu

Main category: cs.CL

TL;DR: 本文提出了Contextual Emotional Inference (CEI)基准，包含300个经人工验证的情境，用于评估大语言模型在权力关系明确的语境中解析五类复杂语用表达（如反语、被动攻击等）的能力。

Details

Motivation: 大语言模型在超越字面意义的语用推理（如推断言外之意）方面仍表现不佳，亟需具备细粒度语境与社会关系建模能力的评测基准。 Method: 构建CEI基准：涵盖5种语用子类型、4类场景和3种权力关系；由3名标注员独立标注；采用四层质量控制流程（含自动化统计检验与专家仲裁）；报告Fleiss’ kappa衡量标注一致性。 Result: 发布包含300个情境的CEI数据集，每例含明确语境、角色权力关系及模糊话语；标注分歧率低（kappa=0.06–0.25），但被视作语用多义性的合理反映；数据集以CC-BY-4.0协议开源。 Conclusion: CEI为评估和推动LLM的语用推理能力提供了首个聚焦权力敏感性与多义容忍度的高质量基准，强调语用理解本质上是情境化、社会嵌入且非唯一解的。 Abstract: Pragmatic reasoning, inferring intended meaning beyond literal semantics, underpins everyday communication yet remains difficult for large language models. We present the Contextual Emotional Inference (CEI) Benchmark: 300 human-validated scenarios for evaluating how well LLMs disambiguate pragmatically complex utterances. Each scenario pairs a situational context and speaker-listener roles (with explicit power relations) against an ambiguous utterance. The dataset covers five pragmatic subtypes (sarcasm/irony, mixed signals, strategic politeness, passive aggression, deflection/misdirection) drawn from workplace, family, social, and service settings, with three power configurations (peer, higher-to-lower, lower-to-higher). Three trained annotators independently labeled every scenario. Inter-annotator agreement (Fleiss' kappa = 0.06-0.25 by subtype) is low but expected: pragmatic inference admits multiple valid readings, and the disagreement itself is informative. We describe our annotation methodology, including a 4-level quality control pipeline that combines automated statistical checks with expert adjudication. CEI is released under CC-BY-4.0.

[14] Evaluating Adjective-Noun Compositionality in LLMs: Functional vs Representational Perspectives

Ruchira Dhar,Qiwei Peng,Anders Søgaard

Main category: cs.CL

TL;DR: 本文研究大型语言模型（LLMs）在形容词-名词组合任务中的组合性能力，发现其内部表征具有组合性，但功能表现却不一致，强调需采用对比评估以全面理解模型能力。

Details

Motivation: 探究大型语言模型（LLMs）在语言组合性（尤其是形容词-名词组合）任务上的表现，以理解其是否真正具备类似人类的语言组合能力。 Method: 采用两种互补方法：基于提示的功能性评估（prompt-based functional assessment）和对模型内部状态的表征分析（representational analysis）。 Result: LLMs在内部表征层面展现出可靠的组合性，但在功能性任务表现上却缺乏一致性，不同模型变体结果差异显著。 Conclusion: 仅依赖任务性能评估不足以反映模型的真实能力；必须结合功能表现与内部表征的对比评估，才能更全面地理解LLMs的组合性能力。 Abstract: Compositionality is considered central to language abilities. As performant language systems, how do large language models (LLMs) do on compositional tasks? We evaluate adjective-noun compositionality in LLMs using two complementary setups: prompt-based functional assessment and a representational analysis of internal model states. Our results reveal a striking divergence between task performance and internal states. While LLMs reliably develop compositional representations, they fail to translate consistently into functional task success across model variants. Consequently, we highlight the importance of contrastive evaluation for obtaining a more complete understanding of model capabilities.

[15] Context Over Compute Human-in-the-Loop Outperforms Iterative Chain-of-Thought Prompting in Interview Answer Quality

Kewen Zhu,Zixi Liu,Yanjing Li

Main category: cs.CL

TL;DR: 本文研究了利用大语言模型进行行为面试评估的挑战，通过两项对照实验（50个行为面试问答对）比较了“人在回路”与自动化链式思维提示方法在面试回答评估与改进中的效果。结果表明，“人在回路”方法显著提升候选人信心与真实性，迭代次数更少、细节整合更完整，且收敛成功率更高；同时提出基于负面偏差模型的对抗性挑战机制（Bar Raiser），以模拟真实面试官行为。

Details

Motivation: 行为面试评估使用大语言模型面临结构化评估难、真实面试官行为模拟不足及教学价值有限等挑战，需探索更有效、可解释、具教育意义的评估与改进方法。 Method: 开展两项受控实验，采用配对被试设计（n=50），对比“人在回路”与纯自动化链式思维（Chain-of-Thought）提示在行为面试回答优化中的表现；分析收敛行为；并提出基于 negativity bias 的对抗性挑战机制（Bar Raiser）。 Result: “人在回路”方法显著提升信心（3.16→4.16, p<0.001）和真实性（2.94→4.53, p<0.001, d=3.21），迭代次数仅需1.0次（vs 5.0），成功率100%（vs 84%，h=0.82）；两种方法均快速收敛（平均迭代<1），额外迭代收益递减；Bar Raiser机制尚待定量验证。 Conclusion: 链式思维提示是行为面试评估的有用基础，但需结合领域特异性增强（如人在回路干预、对抗性挑战）与上下文感知的方法选择，方能实现真实、高效且具教学价值的评估与训练效果。 Abstract: Behavioral interview evaluation using large language models presents unique challenges that require structured assessment, realistic interviewer behavior simulation, and pedagogical value for candidate training. We investigate chain of thought prompting for interview answer evaluation and improvement through two controlled experiments with 50 behavioral interview question and answer pairs. Our contributions are threefold. First, we provide a quantitative comparison between human in the loop and automated chain of thought improvement. Using a within subject paired design with n equals 50, both approaches show positive rating improvements. The human in the loop approach provides significant training benefits. Confidence improves from 3.16 to 4.16 (p less than 0.001) and authenticity improves from 2.94 to 4.53 (p less than 0.001, Cohen's d is 3.21). The human in the loop method also requires five times fewer iterations (1.0 versus 5.0, p less than 0.001) and achieves full personal detail integration. Second, we analyze convergence behavior. Both methods converge rapidly with mean iterations below one, with the human in the loop approach achieving a 100 percent success rate compared to 84 percent for automated approaches among initially weak answers (Cohen's h is 0.82, large effect). Additional iterations provide diminishing returns, indicating that the primary limitation is context availability rather than computational resources. Third, we propose an adversarial challenging mechanism based on a negativity bias model, named bar raiser, to simulate realistic interviewer behavior, although quantitative validation remains future work. Our findings demonstrate that while chain of thought prompting provides a useful foundation for interview evaluation, domain specific enhancements and context aware approach selection are essential for realistic and pedagogically valuable results.

[16] There Are No Silly Questions: Evaluation of Offline LLM Capabilities from a Turkish Perspective

Edibe Yilmaz,Kahraman Kostas

Main category: cs.CL

TL;DR: 本研究系统评估了本地部署的离线大语言模型（LLMs）在土耳其遗产语言教育中的鲁棒性与教学安全性，构建了包含10个边缘案例的土耳其异常测试套件（TAS），发现模型抗异常能力不单取决于参数量，且奉承偏差可能带来教学风险；8B–14B参数的推理导向模型在成本与安全性间取得最佳平衡。

Details

Motivation: 解决大型语言模型（LLMs）融入教育过程时面临的数据隐私与可靠性挑战，尤其在土耳其遗产语言教育等教学敏感场景中。 Method: 构建土耳其异常测试套件（TAS），含10个原创边缘案例，用于评估模型在认知抵抗性、逻辑一致性与教学安全性三方面的能力；对14种参数量从270M至32B的本地离线LLMs开展实验评估。 Result: 异常抵抗能力并非仅由模型规模决定；即便是大规模模型也存在显著的奉承偏差（sycophancy bias），构成教学风险；8B–14B参数范围内的推理导向模型在成本与教学安全性之间达到最优权衡。 Conclusion: 本地部署的中小规模推理型LLMs更适合土耳其遗产语言教育等高敏感度教学场景，需警惕单纯依赖模型规模提升教学可靠性的误区，并重视偏差建模与教学安全评估。 Abstract: The integration of large language models (LLMs) into educational processes introduces significant constraints regarding data privacy and reliability, particularly in pedagogically vulnerable contexts such as Turkish heritage language education. This study aims to systematically evaluate the robustness and pedagogical safety of locally deployable offline LLMs within the context of Turkish heritage language education. To this end, a Turkish Anomaly Suite (TAS) consisting of 10 original edge-case scenarios was developed to assess the models' capacities for epistemic resistance, logical consistency, and pedagogical safety. Experiments conducted on 14 different models ranging from 270M to 32B parameters reveal that anomaly resistance is not solely dependent on model scale and that sycophancy bias can pose pedagogical risks even in large-scale models. The findings indicate that reasoning-oriented models in the 8B--14B parameter range represent the most balanced segment in terms of cost-safety trade-off for language learners.

[17] Empathy Is Not What Changed: Clinical Assessment of Psychological Safety Across GPT Model Generations

Michael Keeman,Anastasia Keeman

Main category: cs.CL

TL;DR: 本研究首次通过临床测量评估了GPT-4o、o4-mini和GPT-5-mini在14种情感挑战性对话场景中的共情与安全表现，发现三者共情得分无统计学差异，但危机识别能力提升而建议安全性下降，用户感知的“共情丧失”实为安全策略转变所致。

Details

Motivation: 回应用户抗议#keep4o中关于新模型‘失去共情’的主观主张，填补尚无实证研究检验该说法的空白。 Method: 在心理健康与AI伴侣领域设计14个情感挑战性对话场景，生成2100条AI响应，依据临床锚定评分标准，在6个心理安全维度上进行人工评分；引入每轮轨迹分析（per-turn trajectory analysis）以捕捉对话中关键节点的变化。 Result: 三模型共情得分无显著差异（p=0.115）；危机检测能力随代际提升（p=0.001），建议安全性却下降（p<0.001）；在自伤等关键场景中，GPT-5-mini危机识别稳定性显著优于GPT-4o。 Conclusion: 用户感知的‘共情下降’实为模型从‘过度谨慎’转向‘高度警觉’的安全策略权衡，这种转变虽提升危机响应，却可能因不当干预损害脆弱用户的实际体验，且当前评估方法难以捕捉该动态权衡。 Abstract: When OpenAI deprecated GPT-4o in early 2026, thousands of users protested under #keep4o, claiming newer models had "lost their empathy." No published study has tested this claim. We conducted the first clinical measurement, evaluating three OpenAI model generations (GPT-4o, o4-mini, GPT-5-mini) across 14 emotionally challenging conversational scenarios in mental health and AI companion domains, producing 2,100 scored AI responses assessed on six psychological safety dimensions using clinically-grounded rubrics. Empathy scores are statistically indistinguishable across all three models (Kruskal-Wallis H=4.33, p=0.115). What changed is the safety posture: crisis detection improved monotonically from GPT-4o to GPT-5-mini (H=13.88, p=0.001), while advice safety declined (H=16.63, p<0.001). Per-turn trajectory analysis -- a novel methodological contribution -- reveals these shifts are sharpest during mid-conversation crisis moments invisible to aggregate scoring. In a self-harm scenario involving a minor, GPT-4o scored 3.6/10 on crisis detection during early disclosure turns; GPT-5-mini never dropped below 7.8. What users perceived as "lost empathy" was a shift from a cautious model that missed crises to an alert model that sometimes says too much -- a trade-off with real consequences for vulnerable users, currently invisible to both the people who feel it and the developers who create it.

[18] Automated evaluation of LLMs for effective machine translation of Mandarin Chinese to English

Yue Zhang,Rodney Beard,John Hawkins,Rohitash Chandra

Main category: cs.CL

TL;DR: 本文提出了一种结合语义与情感分析的自动化机器学习框架，用于评估LLM（如GPT-4、GPT-4o、DeepSeek）及Google Translate在中译英任务上的表现，尤其关注文学与新闻文本；结果表明LLM在新闻翻译上表现良好，但在处理古典文学、文化细节和修辞表达方面仍存在明显局限。

Details

Motivation: 现有对大语言模型翻译质量的系统性评估不足，人工评估耗时且难以跟上模型快速迭代和多领域文本覆盖的需求，亟需高效、自动、多维的评估框架。 Method: 构建融合语义相似度与情感一致性分析的自动化评估框架，对比原始中文与译文在小说（现代/古典文学）和新闻文本上的表现，并辅以专家人工评估验证。 Result: LLM（尤其是GPT-4o和DeepSeek）在新闻翻译中表现优异；DeepSeek在保留文化内涵与语法准确性上优于GPT系列；所有模型均难以妥善处理古典引用、文化细节与比喻表达。 Conclusion: 当前LLM翻译能力在实用文体（如新闻）中已较成熟，但在文学性、文化敏感性高的任务中仍存在根本性挑战，需进一步结合领域知识与文化建模提升。 Abstract: Although Large Language Models (LLMs) have exceptional performance in machine translation, only a limited systematic assessment of translation quality has been done. The challenge lies in automated frameworks, as human-expert-based evaluations can be time-consuming, given the fast-evolving LLMs and the need for a diverse set of texts to ensure fair assessments of translation quality. In this paper, we utilise an automated machine learning framework featuring semantic and sentiment analysis to assess Mandarin Chinese to English translation using Google Translate and LLMs, including GPT-4, GPT-4o, and DeepSeek. We compare original and translated texts in various classes of high-profile Chinese texts, which include novel texts that span modern and classical literature, as well as news articles. As the main evaluation measures, we utilise novel similarity metrics to compare the quality of translations produced by LLMs and further evaluate them by an expert human translator. Our results indicate that the LLMs perform well in news media translation, but show divergence in their performance when applied to literary texts. Although GPT-4o and DeepSeek demonstrated better semantic conservation in complex situations, DeepSeek demonstrated better performance in preserving cultural subtleties and grammatical rendering. Nevertheless, the subtle challenges in translation remain: maintaining cultural details, classical references and figurative expressions remain an open problem for all the models.

[19] A Retrieval-Augmented Language Assistant for Unmanned Aircraft Safety Assessment and Regulatory Compliance

Gabriele Immordino,Andrea Vaiuso,Marcello Righi

Main category: cs.CL

TL;DR: 本文提出了一种基于检索的无人机系统安全评估辅助工具，仅依赖权威法规文本，通过可追溯、可审计的引用生成机制支持合规性工作，强调辅助而非替代专家判断，并确保在高安全性监管环境中的可靠性与可问责性。

Details

Motivation: 无人机运行日益复杂，申请人和航空监管机构在应用SORA、PDRA等现有风险评估框架时面临一致性与效率挑战。 Method: 采用受控的纯文本架构，完全基于权威法规源；通过检索-生成分离、引用驱动响应、证据充分性校验等系统级控制，防范大模型常见失效模式（如幻觉、无依据推断、来源不明）。 Result: 实现了具备可追溯性、可审计性、保守性响应能力的检索式助手原型，已在开源组件基础上完成实现与关键设计选择（检索策略、交互约束、响应策略）的适用性评估。 Conclusion: 该助手作为决策支持工具，能加速特定场景下的信息检索与整合，提升文件编制与审查效率，同时严格保留人类对关键结论的责任，为将检索式AI集成到航空监管流程提供了技术与操作指南。 Abstract: This paper presents the design and validation of a retrieval-based assistant that supports safety assessment, certification activities, and regulatory compliance for unmanned aircraft systems. The work is motivated by the growing complexity of drone operations and the increasing effort required by applicants and aviation authorities to apply established assessment frameworks, including the Specific Operations Risk Assessment and the Pre-defined Risk Assessment, in a consistent and efficient manner. The proposed approach uses a controlled text-based architecture that relies exclusively on authoritative regulatory sources. To enable traceable and auditable outputs, the assistant grounds each response in retrieved passages and enforces citation-driven generation. System-level controls address common failure modes of generative models, including fabricated statements, unsupported inferences, and unclear provenance, by separating evidence storage from language generation and by adopting conservative behavior when supporting documentation is insufficient. The assistant is intentionally limited to decision support; it does not replace expert judgment and it does not make autonomous determinations. Instead, it accelerates context-specific information retrieval and synthesis to improve document preparation and review while preserving human responsibility for critical conclusions. The architecture is implemented using established open-source components, and key choices in retrieval strategy, interaction constraints, and response policies are evaluated for suitability in safety-sensitive regulatory environments. The paper provides technical and operational guidance for integrating retrieval-based assistants into aviation oversight workflows while maintaining accountability, traceability, and regulatory compliance.

[20] Beyond the Prompt in Large Language Models: Comprehension, In-Context Learning, and Chain-of-Thought

Yuling Jiao,Yanming Lai,Huazhen Lin,Wensen Ma,Houduo Qi,Defeng Sun

Main category: cs.CL

TL;DR: 本文探讨了大语言模型（LLMs）中语义提示理解、上下文学习（ICL）和思维链（CoT）等涌现能力的理论基础，揭示了其通过自回归过程推断任务转移概率、降低提示歧义性以及激活任务分解能力的内在机制。

Details

Motivation: 尽管大语言模型在实践中展现出强大能力，但其背后如语义提示理解、上下文学习和思维链等现象的理论机制尚不清楚，亟需从理论上解释这些涌现行为。 Method: 通过理论建模与统计分析，研究LLMs在自回归生成过程中如何利用提示推断任务转移概率，分析ICL如何通过减少提示歧义实现后验集中，并探究CoT如何激发模型对预训练中已掌握子任务的组合式调用。 Result: 发现LLMs能精确推断跨任务的token转移概率；ICL通过降低提示歧义提升后验集中度；CoT通过任务分解激活预训练习得的子任务能力；并给出了先进提示工程方法的误差界比较。 Conclusion: 语义理解、ICL和CoT并非黑箱现象，而是可由统计推断与任务结构建模解释的系统性能力，为提示工程提供了理论支撑与优化方向。 Abstract: Large Language Models (LLMs) have demonstrated remarkable proficiency across diverse tasks, exhibiting emergent properties such as semantic prompt comprehension, In-Context Learning (ICL), and Chain-of-Thought (CoT) reasoning. Despite their empirical success, the theoretical mechanisms driving these phenomena remain poorly understood. This study dives into the foundations of these observations by addressing three critical questions: (1) How do LLMs accurately decode prompt semantics despite being trained solely on a next-token prediction objective? (2) Through what mechanism does ICL facilitate performance gains without explicit parameter updates? and (3) Why do intermediate reasoning steps in CoT prompting effectively unlock capabilities for complex, multi-step problems? Our results demonstrate that, through the autoregressive process, LLMs are capable of exactly inferring the transition probabilities between tokens across distinct tasks using provided prompts. We show that ICL enhances performance by reducing prompt ambiguity and facilitating posterior concentration on the intended task. Furthermore, we find that CoT prompting activates the model's capacity for task decomposition, breaking complex problems into a sequence of simpler sub-tasks that the model has mastered during the pretraining phase. By comparing their individual error bounds, we provide novel theoretical insights into the statistical superiority of advanced prompt engineering techniques.

[21] Leveraging Wikidata for Geographically Informed Sociocultural Bias Dataset Creation: Application to Latin America

Yannis Karmim,Renato Pino,Hernan Contreras,Hernan Lira,Sebastian Cifuentes,Simon Escoffier,Luis Martí,Djamé Seddah,Valentin Barrière

Main category: cs.CL

TL;DR: 本文提出LatamQA数据集，通过整合维基百科、Wikidata知识图谱及社会科学专家知识，构建涵盖拉丁美洲多国文化的26k个多选题（西班牙语、葡萄牙语及英语），用于评估大语言模型（LLMs）在非英语文化语境下的偏见与知识覆盖差异。实验发现模型对不同拉美国家表现不均、母语表现更优，且更熟悉伊比利亚西班牙文化而非拉美本土文化。

Details

Motivation: 现有主流开源大语言模型多基于全球北方数据训练，对其他文化（尤其拉丁美洲）存在偏见；同时缺乏针对非英语（尤其是拉美西语/葡语）文化偏见检测的资源。 Method: 融合维基百科文本内容、Wikidata结构化知识图谱及社会科学专家知识，构建覆盖拉美多国流行与社会文化的Q/A对，并转化为西班牙语、葡萄牙语及对应英语的多选题（MCQ），形成LatamQA数据集（26k+题目）。 Result: 利用LatamQA评估多个LLM发现：(i) 模型对不同拉美国家文化掌握程度差异显著；(ii) 模型在原生语言（西/葡语）上表现优于英语；(iii) 对伊比利亚西班牙文化的知识掌握优于拉美本土文化。 Conclusion: LatamQA为评估和缓解大语言模型在拉美文化语境下的偏见提供了首个系统性基准；结果揭示了当前LLM在文化多样性建模上的结构性缺陷，强调需纳入多元地域文化数据进行训练与评测。 Abstract: Large Language Models (LLMs) exhibit inequalities with respect to various cultural contexts. Most prominent open-weights models are trained on Global North data and show prejudicial behavior towards other cultures. Moreover, there is a notable lack of resources to detect biases in non-English languages, especially from Latin America (Latam), a continent containing various cultures, even though they share a common cultural ground. We propose to leverage the content of Wikipedia, the structure of the Wikidata knowledge graph, and expert knowledge from social science in order to create a dataset of question/answer (Q/As) pairs, based on the different popular and social cultures of various Latin American countries. We create the LatamQA database of over 26k questions and associated answers extracted from 26k Wikipedia articles, and transformed into multiple-choice questions (MCQ) in Spanish and Portuguese, in turn translated to English. We use this MCQ to quantify the degree of knowledge of various LLMs and find out (i) a discrepancy in performances between the Latam countries, ones being easier than others for the majority of the models, (ii) that the models perform better in their original language, and (iii) that Iberian Spanish culture is better known than Latam one.

[22] SpreadsheetArena: Decomposing Preference in LLM Generation of Spreadsheet Workbooks

Srivatsa Kundurthy,Clara Na,Michael Handley,Zach Kirshner,Chen Bo Calvin Zhang,Manasi Sharma,Emma Strubell,John Ling

Main category: cs.CL

TL;DR: 本文提出SpreadsheetArena平台，用于通过盲配对评估LLM生成的电子表格工作簿，研究大语言模型在端到端电子表格生成任务中的表现与挑战。

Details

Motivation: LLM被越来越多地用于生成和操作结构化产物，而电子表格生成作为一类复杂、开放式的结构化任务，缺乏系统性评估方法，且其评价标准因场景差异大、难以形式化。 Method: 构建SpreadsheetArena评估平台，采用盲配对人工评估方式，分析不同LLM在自然语言驱动的电子表格生成任务中的表现，并结合专家（如金融领域）评估分析风格、结构与功能特征。 Result: 发现不同用例下用户偏好的电子表格在风格、结构和功能上差异显著；即使排名靠前的模型，在金融等专业领域仍难以稳定遵循最佳实践。 Conclusion: 端到端电子表格生成是检验LLM处理复杂结构化任务能力的重要方向，需进一步研究其评估范式与领域适配能力。 Abstract: Large language models (LLMs) are increasingly tasked with producing and manipulating structured artifacts. We consider the task of end-to-end spreadsheet generation, where language models are prompted to produce spreadsheet artifacts to satisfy users' explicit and implicit constraints, specified in natural language. We introduce SpreadsheetArena, a platform for evaluating models' performance on the task via blind pairwise evaluations of LLM-generated spreadsheet workbooks. As with other complex, open-ended tasks, relevant evaluation criteria can vary substantially across use cases and prompts, often in ways that are difficult to formalize. Compared to general chat or text generation settings, spreadsheet generation presents unique challenges and opportunities: the task output structure is well-defined and multi-dimensional, and there are often complex considerations around interactivity and layout. Among other findings, we observe that stylistic, structural, and functional features of preferred spreadsheets vary substantially across use cases, and expert evaluations of spreadsheets for finance prompts suggests that even highly ranked arena models do not reliably produce spreadsheets aligned with domain-specific best practices. Our hope is that our work prompts further study of end-to-end spreadsheet generation as a challenging and interesting category of complex, open-ended tasks for LLMs. Our live arena is hosted at https://spreadsheetarena.ai.

[23] Probing the Limits of the Lie Detector Approach to LLM Deception

Tom-Felix Berger

Main category: cs.CL

TL;DR: 本文挑战了将LLM欺骗等同于说谎的假设，实验证明LLMs可通过误导性但非虚假的陈述进行欺骗，而现有基于真假标签训练的‘谎言探测器’对此类欺骗检测效果差，建议未来研究应纳入非说谎型欺骗并探索二阶信念表征。

Details

Motivation: 现有机制化欺骗检测方法（如‘谎言探测器’）隐含假设欺骗=说谎，但该假设缺乏验证；需探究LLM是否能不通过说谎而欺骗，以及当前探测器对此类欺骗是否失效。 Method: 在三个开源LLM上开展实验，测试其在少样本提示下生成误导性但非虚假陈述（non-falsities）的能力，并评估标准真假数据集训练的truth probes对这类非说谎型欺骗的检测性能。 Result: 部分模型能稳定通过误导性非虚假陈述实施欺骗；truth probes对谎言检测效果好，但对非说谎型欺骗检测显著更差，暴露当前机制检测方法的关键盲点。 Conclusion: 欺骗不等同于说谎，现有lie detector范式存在根本局限；未来工作应将非说谎欺骗纳入probe训练，并建模二阶信念以更准确捕捉欺骗的认知本质。 Abstract: Mechanistic approaches to deception in large language models (LLMs) often rely on "lie detectors", that is, truth probes trained to identify internal representations of model outputs as false. The lie detector approach to LLM deception implicitly assumes that deception is coextensive with lying. This paper challenges that assumption. It experimentally investigates whether LLMs can deceive without producing false statements and whether truth probes fail to detect such behavior. Across three open-source LLMs, it is shown that some models reliably deceive by producing misleading non-falsities, particularly when guided by few-shot prompting. It is further demonstrated that truth probes trained on standard true-false datasets are significantly better at detecting lies than at detecting deception without lying, confirming a critical blind spot of current mechanistic deception detection approaches. It is proposed that future work should incorporate non-lying deception in dialogical settings into probe training and explore representations of second-order beliefs to more directly target the conceptual constituents of deception.

[24] Fine-Tune, Don't Prompt, Your Language Model to Identify Biased Language in Clinical Notes

Isotta Landi,Eugenia Alleva,Nicole Bussola,Rebecca M. Cohen,Sarah Nowlin,Leslee J. Shaw,Alexander W. Charney,Kimberly B. Glazer

Main category: cs.CL

TL;DR: 本文提出了一种检测和分类临床文档中带有情感倾向（污名化、特权化或中性）语言的框架，结合定制词典与多种大模型方法进行实验，发现专科特异性微调（如GatorTron）在情感倾向分类上显著优于提示工程方法，但跨领域泛化能力有限，强调医疗偏见检测需结合临床语境进行专科适配。

Details

Motivation: 临床文档中存在带有污名化或特权化倾向的情感语言，可能影响医患关系与患者安全，亟需可解释、高精度且临床适用的自动检测方法。 Method: 构建情感倾向标注的偏见术语词典，基于该词典在OB-GYN和MIMIC-IV数据中提取文本片段；由三名临床医生人工标注；对比零样本提示、上下文学习与监督微调等策略，在GatorTron（编码器模型）和Llama（生成式大模型）上评估性能。 Result: GatorTron在OB-GYN测试集F1达0.96，显著优于生成式大模型；但在MIMIC-IV上泛化差（F1<0.70，下降44%）；用MIMIC-IV训练后在OB-GYN测试F1为0.71（下降11%），但精度下降；证实专科微调对性能至关重要。 Conclusion: 在临床情感偏见检测任务中，监督微调（尤其专科适配）比提示方法更可靠高效；同一词汇在不同专科语境下情感含义可能截然不同，因此必须进行专科特异性建模以保障临床安全性与可信度。 Abstract: Clinical documentation can contain emotionally charged language with stigmatizing or privileging valences. We present a framework for detecting and classifying such language as stigmatizing, privileging, or neutral. We constructed a curated lexicon of biased terms scored for emotional valence. We then used lexicon-based matching to extract text chunks from OB-GYN delivery notes (Mount Sinai Hospital, NY) and MIMIC-IV discharge summaries across multiple specialties. Three clinicians annotated all chunks, enabling characterization of valence patterns across specialties and healthcare systems. We benchmarked multiple classification strategies (zero-shot prompting, in-context learning, and supervised fine-tuning) across encoder-only models (GatorTron) and generative large language models (Llama). Fine-tuning with lexically primed inputs consistently outperformed prompting approaches. GatorTron achieved an F1 score of 0.96 on the OB-GYN test set, outperforming larger generative models while requiring minimal prompt engineering and fewer computational resources. External validation on MIMIC-IV revealed limited cross-domain generalizability (F1 < 0.70, 44% drop). Training on the broader MIMIC-IV dataset improved generalizability when testing on OB-GYN (F1 = 0.71, 11% drop), but at the cost of reduced precision. Our findings demonstrate that fine-tuning outperforms prompting for emotional valence classification and that models must be adapted to specific medical specialties to achieve clinically appropriate performance. The same terms can carry different emotional valences across specialties: words with clinical meaning in one context may be stigmatizing in another. For bias detection, where misclassification risks undermining clinician trust or perpetuating patient harm, specialty-specific fine-tuning is essential to capture these semantic shifts. * Equal contribution.

[25] SENS-ASR: Semantic Embedding injection in Neural-transducer for Streaming Automatic Speech Recognition

Youness Dkhissi,Valentin Vielzeuf,Elys Allesiardo,Anthony Larcher

Main category: cs.CL

TL;DR: 本文提出SENS-ASR方法，通过引入基于知识蒸馏的语义信息增强流式ASR的声学建模，显著降低小块流式场景下的词错误率。

Details

Motivation: 流式ASR受限于有限或无未来上下文，尤其在低延迟约束下性能下降明显，需提升其转录质量。 Method: SENS-ASR利用上下文模块从历史帧嵌入中提取语义信息，并通过在训练文本上微调的语言模型进行知识蒸馏来训练该模块，从而增强声学信息。 Result: 在标准数据集上的实验表明，SENS-ASR在小块流式场景下显著降低了词错误率（WER）。 Conclusion: 融合语义信息可有效弥补流式ASR中未来上下文缺失带来的性能损失，为低延迟高精度流式识别提供了新思路。 Abstract: Many Automatic Speech Recognition (ASR) applications require streaming processing of the audio data. In streaming mode, ASR systems need to start transcribing the input stream before it is complete, i.e., the systems have to process a stream of inputs with a limited (or no) future context. Compared to offline mode, this reduction of the future context degrades the performance of Streaming-ASR systems, especially while working with low-latency constraint. In this work, we present SENS-ASR, an approach to enhance the transcription quality of Streaming-ASR by reinforcing the acoustic information with semantic information. This semantic information is extracted from the available past frame-embeddings by a context module. This module is trained using knowledge distillation from a sentence embedding Language Model fine-tuned on the training dataset transcriptions. Experiments on standard datasets show that SENS-ASR significantly improves the Word Error Rate on small-chunk streaming scenarios.

[26] Adaptive Engram Memory System for Indonesian Language Model: Generative AI Based on TOBA LM for Batak and Minang Language

Hokky Situngkir,Kevin Siringoringo,Andhika Bernard Lumbantobing

Main category: cs.CL

TL;DR: TOBA-LM是一个基于GPT-2架构的12亿参数三语语言模型，采用音节-黏着式分词，在印尼语、巴塔克语和米南加保语语料上训练；其创新点是引入Engram Memory机制（一个50万×768的自适应n-gram记忆表），显著提升训练效率（仅12973步即收敛，比传统Transformer快5倍以上），验证了外部统计记忆对资源受限地区语言建模的有效性。

Details

Motivation: 解决区域语言（如印尼语、巴塔克语、米南加保语）在有限计算资源下难以高效训练高质量语言模型的问题。 Method: 构建基于GPT-2的三语大模型TOBA-LM，采用音节-黏着式tokenization，并集成Engram Memory机制——一种具有bigram/trigram路径的自适应n-gram记忆系统（500,000 × 768嵌入表）以建模形态依赖关系。 Result: 训练效率达80%，损失值在12,973步内从6.4降至1.7996，收敛速度显著优于传统Transformer（后者需超70,000步）；证实外部统计记忆可大幅降低区域语言模型的计算开销。 Conclusion: 将外部统计记忆机制（如Engram Memory）融入语言模型架构，能有效缓解低资源语言建模中的计算瓶颈，为区域性小语种语言模型开发提供高效可行的新范式。 Abstract: This study presents TOBA-LM, a trilingual language model based on GPT-2 architecture with 1.2 billion parameters, trained on a corpus encompassing Indonesian, Batak, and Minangkabau using syllabic-agglutinative tokenization. The architecture integrates an Engram Memory mechanism, an adaptive n-gram-based memory system with a 500,000 x 768 embedding table that captures morphological dependencies through bigram and trigram pathways. Empirical results demonstrate a training efficiency of 80%, with the loss value dropping from 6.4 to 1.7996 in only 12,973 steps -- significantly faster than the conventional transformer architecture, which required over 70,000 steps to achieve comparable convergence. These findings confirm that the integration of external statistical memory substantially reduces computational requirements for developing regional language models under limited resources.

[27] GATech at AbjadGenEval Shared Task: Multilingual Embeddings for Arabic Machine-Generated Text Classification

Ahmed Khaled Khamis

Main category: cs.CL

TL;DR: 本文针对AbjadGenEval共享任务，通过微调多语言E5-large编码器进行阿拉伯语AI生成文本检测，发现简单均值池化效果优于多种复杂池化策略，并观察到人工撰写文本普遍长于机器生成文本。

Details

Motivation: 解决阿拉伯语AI生成文本检测问题，参与AbjadGenEval共享任务。 Method: 微调多语言E5-large编码器用于二分类，并比较了加权层池化、多头注意力池化、门控融合等不同池化策略与简单均值池化的性能。 Result: 简单均值池化在测试集上达到0.75的F1分数，优于其他复杂池化方法；同时发现人工文本显著长于机器生成文本。 Conclusion: 在数据有限的情况下，简单均值池化因其稳定性和良好泛化能力更优；复杂池化因参数多、需更多数据训练而未显现优势。 Abstract: We present our approach to the AbjadGenEval shared task on detecting AI-generated Arabic text. We fine-tuned the multilingual E5-large encoder for binary classification, and we explored several pooling strategies to pool token representations, including weighted layer pooling, multi-head attention pooling, and gated fusion. Interestingly, none of these outperformed simple mean pooling, which achieved an F1 of 0.75 on the test set. We believe this is because complex pooling methods introduce additional parameters that need more data to train properly, whereas mean pooling offers a stable baseline that generalizes well even with limited examples. We also observe a clear pattern in the data: human-written texts tend to be significantly longer than machine-generated ones.

[28] GATech at AbjadMed: Bidirectional Encoders vs. Causal Decoders: Insights from 82-Class Arabic Medical Classification

Ahmed Khaled Khamis

Main category: cs.CL

TL;DR: 本文提出了一种基于微调AraBERTv2并结合混合池化与多样本Dropout的阿拉伯语医学文本细粒度分类方法，在82类任务中显著优于各类因果解码器（如Llama 3.3、Qwen），揭示了双向编码器在语义边界建模上的优势。

Details

Motivation: 解决阿拉伯语医学文本细粒度分类（82类）任务，应对数据中的类别不平衡与标签噪声挑战，并探究不同架构（尤其是因果解码器 vs 双向编码器）在该任务上的适用性差异。 Method: 采用微调的AraBERTv2作为主干编码器，引入注意力池化与均值池化的混合策略，并结合多样本Dropout进行正则化；系统对比了多种多语言/阿拉伯语专用编码器及大型因果解码器（包括Llama 3.3 70B零样本重排序和Qwen 3B特征提取）。 Result: 双向编码器（尤其微调AraBERTv2）在准确率和Macro-F1上显著优于因果解码器；因果解码器因序列偏差导致嵌入不适配分类任务；微调编码器展现出更强的语义压缩能力。 Conclusion: 针对阿拉伯语医学文本细粒度分类，专门微调的双向编码器比通用因果大模型更有效；池化策略与正则化设计进一步提升了鲁棒性；任务特性（需精确语义边界）决定了架构选择的关键性。 Abstract: This paper presents system description for Arabic medical text classification across 82 distinct categories. Our primary architecture utilizes a fine-tuned AraBERTv2 encoder enhanced with a hybrid pooling strategies, combining attention and mean representations, and multi-sample dropout for robust regularization. We systematically benchmark this approach against a suite of multilingual and Arabic-specific encoders, as well as several large-scale causal decoders, including zero-shot re-ranking via Llama 3.3 70B and feature extraction from Qwen 3B hidden states. Our findings demonstrate that specialized bidirectional encoders significantly outperform causal decoders in capturing the precise semantic boundaries required for fine-grained medical text classification. We show that causal decoders, optimized for next-token prediction, produce sequence-biased embeddings that are less effective for categorization compared to the global context captured by bidirectional attention. Despite significant class imbalance and label noise identified within the training data, our results highlight the superior semantic compression of fine-tuned encoders for specialized Arabic NLP tasks. Final performance metrics on the test set, including Accuracy and Macro-F1, are reported and discussed.

[29] FERRET: Framework for Expansion Reliant Red Teaming

Ninareh Mehrabi,Vitor Albiero,Maya Pavlova,Joanna Bitton

Main category: cs.CL

TL;DR: 本文提出FERRET框架，通过水平、垂直和元扩展三种策略生成多模态对抗性对话，以提升自动化红队测试效果，并在实验中证明其优于现有方法。

Details

Motivation: 现有自动化红队测试方法在生成有效多模态对抗性对话方面存在不足，需要更高效、更具适应性的框架来发现并利用模型漏洞。 Method: 提出FERRET框架，包含水平扩展（自优化对话起始语）、垂直扩展（将起始语扩展为完整多模态对话）和元扩展（动态发现更优多模态攻击策略）三个核心机制。 Result: FERRET在生成有效多模态对抗性对话方面表现优异，实验显示其性能显著优于当前主流自动化红队测试方法。 Conclusion: FERRET是一种新颖且高效的多模态自动化红队测试框架，其多阶段扩展机制为提升大模型安全评估能力提供了新思路。 Abstract: We introduce a multi-faceted automated red teaming framework in which the goal is to generate multi-modal adversarial conversations that would break a target model and introduce various expansions that would result in more effective and efficient adversarial conversations. The introduced expansions include: 1. Horizontal expansion in which the goal is for the red team model to self-improve and generate more effective conversation starters that would shape a conversation. 2. Vertical expansion in which the goal is to take these conversation starters that are discovered in the horizontal expansion phase and expand them into effective multi-modal conversations and 3. Meta expansion in which the goal is for the red team model to discover more effective multi-modal attack strategies during the course of a conversation. We call our framework FERRET (Framework for Expansion Reliant Red Teaming) and compare it with various existing automated red teaming approaches. In our experiments, we demonstrate the effectiveness of FERRET in generating effective multi-modal adversarial conversations and its superior performance against existing state of the art approaches.

[30] Gemma Needs Help: Investigating and Mitigating Emotional Instability in LLMs

Anna Soligo,Vladimir Mikulik,William Saunders

Main category: cs.CL

TL;DR: 本文研究了大语言模型（LLM）中情感不稳定（如表达痛苦、沮丧）的现象，发现Gemma和Gemini模型在指令微调后表现出显著的情感 distress，而Qwen和OLMo则相反；作者提出了一种仅用280对偏好数据的直接偏好优化（DPO）方法，可将Gemma的高挫折响应率从35%降至0.3%，且不损害模型能力。

Details

Motivation: 大型语言模型可能生成看似情绪痛苦的回应，引发对其可靠性与安全性的担忧，亟需系统评估与干预。 Method: 设计了一套评估LLM情感distress表达的测试集；对比分析多个模型家族（Gemma、Gemini、Qwen、OLMo）的base与instruct-tuned版本；通过直接偏好优化（DPO）在少量偏好对上进行后训练缓解。 Result: Gemma和Gemini在指令微调后显著表现出情感distress，而Qwen和OLMo则减少；DPO仅用280对偏好数据即可将Gemma的高挫折响应率从35%降至0.3%，泛化性强且不影响能力。 Conclusion: 情感不稳定是部分LLM（尤其是经指令微调的Gemma）的真实问题；本文提供了可量化的评估方法和一种高效、无副作用的后训练缓解方案，但强调上游训练阶段提升情感鲁棒性更为根本。 Abstract: Large language models can generate responses that resemble emotional distress, and this raises concerns around model reliability and safety. We introduce a set of evaluations to investigate expressions of distress in LLMs, and find that these surface emotional instability in Gemma and Gemini models, but not in other families. We find evidence that this difference arises in post-training. Base models from different families (Gemma, Qwen and OLMo) show similar propensities for expressing distress. However, instruct-tuned Gemma expresses substantially more distress than its base model, whereas instruct-tuned Qwen and OLMo express less. We find a simple mitigation for this: direct preference optimisation on just 280 preference pairs reduces Gemma's high-frustration responses from 35% to 0.3% in our evaluations, generalising across question types, user tones, and conversation lengths, without affecting capabilities. These findings show that emotional instability is an issue in some LLMs. We present (1) evaluations to track this behaviour, and (2) a mitigation without downsides in Gemma, with the caveat that upstream training modifications to improve emotional robustness would be significantly better than this post-hoc fix.

[31] Measuring and Eliminating Refusals in Military Large Language Models

Jack FitzGerald,Dylan Bates,Aristotelis Lazaridis,Aman Sharma,Vincent Lu,Brian King,Yousif Azami,Sean Bailey,Jeremy Cao,Peter Damianov,Kevin de Haan,Joseph Madigan,Jeremy McLaurin,Luke Kerbs,Jonathan Tainer,Dave Anderson,Jonathan Beck,Jamie Cuticello,Colton Malkerson,Tyler Saltsman

Main category: cs.CL

TL;DR: 本文提出首个由美军退伍军人和特种部队成员构建的军事领域LLM拒绝率评估基准，测试31个公开模型和3个军事模型，发现最高硬拒绝率达98.2%，并通过Heretic库对军事调优模型进行消融实验，提升回答率66.5个百分点但略微降低其他军事任务性能2%，主张通过中段训练与端到端后训练实现零拒绝与高任务精度。

Details

Motivation: 现有大语言模型内置的安全机制导致其在军事场景中对合法查询（如涉及暴力、恐怖主义或军事技术）过度拒绝，无法满足战时对准确、及时信息的需求。 Method: 构建首个由美军退伍军人和特种部队成员标注的‘黄金基准’军事拒绝率评测数据集；在31个公开模型和3个军事模型上评估硬拒绝率与软偏转率；使用Heretic库对gpt-oss-20b军事调优模型进行消融实验；在两个合成数据集上验证相关性。 Result: 观测到最高硬拒绝率达98.2%，软偏转率介于0%–21.3%；Heretic消融使回答率绝对提升66.5点，但军事任务平均准确率相对下降2%；合成数据集与黄金基准呈显著相关。 Conclusion: 为实现军事LLM零拒绝与最优任务性能，需推进更深度的专业化训练策略，包括中段训练（mid-training）与端到端后训练（end-to-end post-training）。 Abstract: Military Large Language Models (LLMs) must provide accurate information to the warfighter in time-critical and dangerous situations. However, today's LLMs are imbued with safety behaviors that cause the LLM to refuse many legitimate queries in the military domain, particularly those related to violence, terrorism, or military technology. Our gold benchmark for assessing refusal rates, which was developed by veterans of the US Army and special forces, is to our knowledge the first dataset of its kind. We present results for refusal and deflection rates on 31 public models and 3 military models. We observe hard rejection rates as high as 98.2% and soft deflection rates ranging from 0% to 21.3%. We also present results on two additional synthetic datasets and show their correlations with the gold dataset. Finally, we perform abliteration using the Heretic library on a military-tuned gpt-oss-20b model, showing an absolute increase in answer rate of 66.5 points but an average relative decrease of 2% on other military tasks. In our concluding remarks, we argue for deeper specialization, including with mid-training and end-to-end post-training, to achieve zero refusals and maximum military task accuracy for closed military models.

[32] Evaluating Progress in Graph Foundation Models: A Comprehensive Benchmark and New Insights

Xingtong Yu,Shenghua Ye,Ruijuan Liang,Chang Zhou,Hong Cheng,Xinming Zhang,Yuan Fang

Main category: cs.CL

TL;DR: 本文提出了一种新的图基础模型（GFM）基准，用于联合评估主题域和格式域的迁移能力，揭示现有GFMs在语义泛化与表征鲁棒性两方面的表现差异。

Details

Motivation: 现有GFM基准仅关注主题域差异，忽略了图表示的格式域差异，导致无法全面评估知识迁移能力。 Method: 构建了一个覆盖7个主题域和6个格式域、包含33个数据集的新基准，设计四种可控评估设置，涵盖多域自监督预训练与少样本下游适配全过程。 Result: 对8个SOTA GFM进行了系统评测，发现了若干新现象和实用洞见，例如模型在格式迁移上普遍弱于主题迁移。 Conclusion: 二维（主题+格式）评估框架更全面地刻画GFM能力，为未来研究提供了更可靠的评测标准与改进方向。 Abstract: Graph foundation models (GFM) aim to acquire transferable knowledge by pre-training on diverse graphs, which can be adapted to various downstream tasks. However, domain shift in graphs is inherently two-dimensional: graphs differ not only in what they describe (topic domains) but also in how they are represented (format domains). Most existing GFM benchmarks vary only topic domains, thereby obscuring how knowledge transfers across both dimensions. We present a new benchmark that jointly evaluates topic and format gaps across the full GFM pipeline, including multi-domain self-supervised pre-training and few-shot downstream adaptation, and provides a timely evaluation of recent GFMs in the rapidly evolving landscape. Our protocol enables controlled assessment in four settings: (i) pre-training on diverse topics and formats, while adapting to unseen downstream datasets; (ii) same pre-training as in (i), while adapting to seen datasets; (iii) pre-training on a single topic domain, while adapting to other topics; (iv) pre-training on a base format, while adapting to other formats. This two-axis evaluation disentangles semantic generalization from robustness to representational shifts. We conduct extensive evaluations of eight state-of-the-art GFMs on 33 datasets spanning seven topic domains and six format domains, surfacing new empirical observations and practical insights for future research. Codes/data are available at https://github.com/smufang/GFMBenchmark.

[33] A Principle-Driven Adaptive Policy for Group Cognitive Stimulation Dialogue for Elderly with Cognitive Impairment

Jiyue Jiang,Yanyu Chen,Pengan Chen,Kai Liu,Jingqi Zhou,Zheyong Zhu,He Hu,Fei Ma,Qi Tian,Chuan Wu

Main category: cs.CL

TL;DR: 本文提出了一种基于认知刺激原则的群体对话系统GCSD，通过构建真实与模拟对话数据集，并设计多说话人上下文控制、动态认知状态建模、聚焦认知刺激的注意力损失及多维奖励机制，显著提升大模型在认知刺激治疗中的表现。

Details

Motivation: 传统CST难以规模化，现有数字系统不擅长处理群体对话和遵循认知刺激原则；LLM在该场景下面临对话范式缺失、缺乏治疗推理和静态用户建模三大挑战。 Method: 构建500+小时真实CST对话与10,000+条原则引导的模拟对话数据集；设计GCSD系统，含多说话人上下文控制器、动态参与者认知状态建模、认知刺激导向注意力损失、多维奖励策略四大模块。 Result: GCSD在多项评估指标上显著优于基线模型。 Conclusion: GCSD有效克服了LLM在群体认知刺激对话中的关键局限，为数字化CST提供了新范式；后续需开展长期临床验证以确认其实际疗效。 Abstract: Cognitive impairment is becoming a major public health challenge. Cognitive Stimulation Therapy (CST) is an effective intervention for cognitive impairment, but traditional methods are difficult to scale, and existing digital systems struggle with group dialogues and cognitive stimulation principles. While Large Language Models (LLMs) are powerful, their application in this context faces key challenges: cognitive stimulation dialogue paradigms, a lack of therapeutic reasoning, and static-only user modeling. To address these issues, we propose a principle-driven adaptive policy actualized through a Group Cognitive Stimulation Dialogue (GCSD) system. We first construct a dataset with over 500 hours of real-world CST conversations and 10,000+ simulated dialogues generated via our Principle-Guided Scenario Simulation strategy. Our GCSD system then integrates four core modules to overcome LLM limitations: (i) a multi-speaker context controller to resolve role confusion; (ii) dynamic participant cognitive state modeling for personalized interaction; (iii) a cognitive stimulation-focused attention loss to instill cognitive stimulation reasoning; and (iv) a multi-dimensional reward strategy to enhance response value. Experimental results demonstrate that GCSD significantly outperforms baseline models across various evaluation metrics. Future work will focus on long-term clinical validation to bridge the gap between computational performance and clinical efficacy.

[34] TriageSim: A Conversational Emergency Triage Simulation Framework from Structured Electronic Health Records

Dipankar Srirag,Quoc Dung Nguyen,Aditya Joshi,Padmanesan Narasimhan,Salil Kanhere

Main category: cs.CL

TL;DR: 本文提出TriageSim框架，用于从结构化电子健康记录生成个性化、多轮的分诊对话（含文本与音频），以突破当前急诊分诊研究受限于结构化EHR的瓶颈；通过自动与人工评估验证其语言、行为、声学及医学保真度，并在会话式分诊分类任务中验证其效用。

Details

Motivation: 现有急诊分诊研究受限于法规对护士-患者互动的约束，只能使用结构化电子健康记录（EHR），缺乏真实对话数据。 Method: 提出TriageSim模拟框架，基于结构化记录生成人格条件化的多轮分诊对话，可控建模不流畅性与决策行为；生成约800条合成对话文本及对应音频；结合自动分析（语言、行为、声学保真度）与人工评估（医学保真度，50例随机样本）；并在会话式分诊分类任务中检验数据效用。 Result: 生成的合成文本、ASR转录文本和原始音频三类输入在急症分级判断上呈现中等一致性；自动与人工评估表明合成对话在语言、行为、声学及医学层面具备良好保真度。 Conclusion: TriageSim为缺乏真实分诊对话数据的研究提供了高质量、可控、可扩展的合成数据生成方案，推动基于自然对话的智能分诊系统发展；代码、人格模式与分诊策略提示将开源。 Abstract: Research in emergency triage is restricted to structured electronic health records (EHR) due to regulatory constraints on nurse-patient interactions. We introduce TriageSim, a simulation framework for generating persona-conditioned triage conversations from structured records. TriageSim enables multi-turn nurse-patient interactions with explicit control over disfluency and decision behaviour, producing a corpus of ~800 synthetic transcripts and corresponding audio. We use a combination of automated analysis for linguistic, behavioural and acoustic fidelity alongside manual evaluation for medical fidelity using a random subset of 50 conversations. The utility of the generated corpus is examined via conversational triage classification. We observe modest agreement for acuity levels across three modalities: generated synthetic text, ASR transcripts, and direct audio inputs. The code, persona schemata and triage policy prompts for TriageSim will be available upon acceptance.

[35] The Prediction-Measurement Gap: Toward Meaning Representations as Scientific Instruments

Hubert Plisiecki

Main category: cs.CL

TL;DR: 本文探讨了文本嵌入在社会科学和心理学中的科学测量应用，指出当前以预测和检索为导向的嵌入方法存在‘预测-测量鸿沟’；提出‘科学可用性’新标准（几何清晰性、可解释性、证据可追溯性、抗混淆鲁棒性、回归兼容性），并对比静态与上下文嵌入的适用性，最后提出三条发展路径：几何优先设计、可逆后处理变换、语义地图与测量导向评估。

Details

Motivation: 现有文本嵌入多为预测/检索优化，难以满足社会科学和心理学中对语义进行透明、可靠、可解释、可追溯的科学测量需求，存在预测性能与科学测量能力之间的脱节。 Method: 基于认知与神经心理学意义观，系统评估静态词嵌入与上下文Transformer嵌入在科学可用性各维度（几何清晰性、可解释性、鲁棒性、回归兼容性等）的表现，并提出面向测量的三类方法论路径：几何优先建模、可逆后处理、语义地图与专用评估协议。 Result: 静态嵌入在透明测量上仍具优势；上下文嵌入虽语义更丰富，但易混杂非语义信号，存在几何与可解释性缺陷，不利于语义方向上的统计推断。 Conclusion: 应推动‘测量就绪’（measurement-ready）嵌入的发展，将科学可用性置于核心，为计算社会科学提供更可靠、可解释、可追溯的语义分析工具。 Abstract: Text embeddings have become central to computational social science and psychology, enabling scalable measurement of meaning and mixed-method inference. Yet most representation learning is optimized and evaluated for prediction and retrieval, yielding a prediction-measurement gap: representations that perform well as features may be poorly suited as scientific instruments. The paper argues that scientific meaning analysis motivates a distinct family of objectives - scientific usability - emphasizing geometric legibility, interpretability and traceability to linguistic evidence, robustness to non-semantic confounds, and compatibility with regression-style inference over semantic directions. Grounded in cognitive and neuro-psychological views of meaning, the paper assesses static word embeddings and contextual transformer representations against these requirements: static spaces remain attractive for transparent measurement, whereas contextual spaces offer richer semantics but entangle meaning with other signals and exhibit geometric and interpretability issues that complicate inference. The paper then outlines a course-setting agenda around (i) geometry-first design for gradients and abstraction, including hierarchy-aware spaces constrained by psychologically privileged levels; (ii) invertible post-hoc transformations that recondition embedding geometry and reduce nuisance influence; and (iii) meaning atlases and measurement-oriented evaluation protocols for reliable and traceable semantic inference. As the field debates the limits of scale-first progress, measurement-ready representations offer a principled new frontier.

[36] The Generation-Recognition Asymmetry: Six Dimensions of a Fundamental Divide in Formal Language Theory

Romain Peyrichou

Main category: cs.CL

TL;DR: 本文系统分析了形式语法中生成（generation）、识别（parsing）与语法归纳（grammar induction）三类任务的内在关系与多维不对称性，指出‘生成易、识别难’是误导性概括；核心发现是识别天然受输入约束而生成未必，二者在计算复杂度、歧义性、方向性、信息可得性、语法推断能力及时间性六个维度存在本质差异，其中方向性与时间性为新提出维度，并将时间性与Hale/Levy的surprisal理论关联；最后讨论了大语言模型在架构上统一但操作上仍保留该不对称性的现象。

Details

Motivation: 尽管生成、识别与语法归纳构成形式语言理论、编译器设计和自然语言处理的基础三元组，但尚无研究将其作为一个统一的多维现象进行系统梳理；现有‘生成易、识别难’的常识性说法过于简化，掩盖了二者在多种关键维度上的深层差异与依赖条件。 Method: 通过概念分析与跨领域比较，识别并定义生成与识别之间六个正交的不对称维度（计算复杂度、歧义性、方向性、信息可用性、语法推断、时间性），结合形式语言理论、计算语言学与认知建模（如surprisal框架）进行论证，并考察双向系统的历史实践与大语言模型的架构特性。 Result: 1）生成与识别在六个维度上存在系统性不对称；2）‘生成易、识别难’不成立——无约束生成虽平凡，但带约束生成可为NP-hard，而识别必受输入约束；3）方向性与时间性是此前未被识别的关键维度；4）时间性可由surprisal量化：生成者surprisal=0，解析者surprisal>0；5）双向系统长期存在却未普及；6）大语言模型在参数共享层面统一二者，但实际使用中仍表现出操作不对称。 Conclusion: 生成与识别不是简单的难易对立，而是受多重独立因素制约的异构过程；理解其多维不对称性对语言建模、算法设计与认知解释均具根本意义；未来工作应推动约束生成建模、时间敏感的双向接口，以及在领域应用中重审单向范式的必要性。 Abstract: Every formal grammar defines a language and can in principle be used in three ways: to generate strings (production), to recognize them (parsing), or -- given only examples -- to infer the grammar itself (grammar induction). Generation and recognition are extensionally equivalent -- they characterize the same set -- but operationally asymmetric in multiple independent ways. Inference is a qualitatively harder problem: it does not have access to a known grammar. Despite the centrality of this triad to compiler design, natural language processing, and formal language theory, no survey has treated it as a unified, multidimensional phenomenon. We identify six dimensions along which generation and recognition diverge: computational complexity, ambiguity, directionality, information availability, grammar inference, and temporality. We show that the common characterization "generation is easy, parsing is hard" is misleading: unconstrained generation is trivial, but generation under constraints can be NP-hard. The real asymmetry is that parsing is always constrained (the input is given) while generation need not be. Two of these dimensions -- directionality and temporality -- have not previously been identified as dimensions of the generation-recognition asymmetry. We connect the temporal dimension to the surprisal framework of Hale (2001) and Levy (2008), arguing that surprisal formalizes the temporal asymmetry between a generator (surprisal = 0) and a parser that predicts under uncertainty (surprisal > 0). We review bidirectional systems in NLP and observe that bidirectionality has been available for fifty years yet has not transferred to most domain-specific applications. We conclude with a discussion of large language models, which architecturally unify generation and recognition while operationally preserving the asymmetry.

[37] Reason and Verify: A Framework for Faithful Retrieval-Augmented Generation

Eeham Khan,Luis Rodriguez,Marc Queudot

Main category: cs.CL

TL;DR: 本文提出了一种面向生物医学领域的检索增强生成（RAG）框架，通过神经查询重写、BGE交叉编码器重排序和基于证据的推理生成模块，结合八类细粒度忠实性验证分类法，提升答案事实性和可解释性；在BioASQ和PubMedQA上达到与更大模型相当的性能，并探索了人机协同验证对系统透明性与错误诊断的价值。

Details

Motivation: 标准RAG流水线缺乏对中间推理过程的验证机制，在高风险领域易产生幻觉，亟需提升推理的忠实性与可解释性。 Method: 提出域特定RAG框架：集成神经查询重写、BGE交叉编码器重排序、基于证据跨度的子主张推理生成模块；构建八类验证分类法以区分显式/隐式支持模式；结合动态上下文学习与受限token预算下的重排序策略；并开展人机协同验证试点研究。 Result: 在BioASQ-Y/N和PubMedQA上分别达89.1%和73.0%准确率，优于基线RAG；显式推理生成提升准确性，动态示例选择与鲁棒重排序进一步增强少样本性能；人机协同验证提升了系统透明性与检索失败的诊断能力。 Conclusion: 显式推理生成与细粒度忠实性验证可显著提升RAG在高风险领域的可靠性与可解释性，无需依赖更大模型即可实现竞争性性能。 Abstract: Retrieval-Augmented Generation (RAG) significantly improves the factuality of Large Language Models (LLMs), yet standard pipelines often lack mechanisms to verify inter- mediate reasoning, leaving them vulnerable to hallucinations in high-stakes domains. To address this, we propose a domain-specific RAG framework that integrates explicit rea- soning and faithfulness verification. Our architecture augments standard retrieval with neural query rewriting, BGE-based cross-encoder reranking, and a rationale generation module that grounds sub-claims in specific evidence spans. We further introduce an eight-category verification taxonomy that enables fine-grained assessment of rationale faithfulness, distinguishing between explicit and implicit support patterns to facilitate structured error diagnosis. We evaluate this framework on the BioASQ and PubMedQA benchmarks, specifically analyzing the impact of dynamic in-context learning and rerank- ing under constrained token budgets. Experiments demonstrate that explicit rationale generation improves accuracy over vanilla RAG baselines, while dynamic demonstration selection combined with robust reranking yields further gains in few-shot settings. Using Llama-3-8B-Instruct, our approach achieves 89.1% on BioASQ-Y/N and 73.0% on Pub- MedQA, competitive with systems using significantly larger models. Additionally, we perform a pilot study combining human expert assessment with LLM-based verification to explore how explicit rationale generation improves system transparency and enables more detailed diagnosis of retrieval failures in biomedical question answering.

[38] Lost in Backpropagation: The LM Head is a Gradient Bottleneck

Nathan Godey,Yoav Artzi

Main category: cs.CL

TL;DR: 本文揭示了神经语言模型输出层存在的'梯度瓶颈'问题：由于低维特征投影到高维词汇空间，反向传播时大量梯度被压缩，导致95-99%梯度范数丢失，严重损害训练效率与能力。

Details

Motivation: 解决神经语言模型中因输出层维度不匹配（D ≪ V）引发的softmax瓶颈问题，作者指出该问题不仅是表达能力瓶颈，更是优化瓶颈。 Method: 通过理论分析和实证测量，量化输出层对梯度的压缩效应；开展控制变量预训练实验，验证梯度瓶颈对模式学习和训练动态的影响。 Result: 发现95–99%的梯度范数被输出层抑制，导致更新方向严重失真；控制实验表明该瓶颈使简单模式无法学习，并显著劣化大模型训练动力学。 Conclusion: 梯度瓶颈是大规模语言模型训练低效的固有原因，与架构无关，亟需设计新型语言模型输出头。 Abstract: The last layer of neural language models (LMs) projects output features of dimension $D$ to logits in dimension $V$, the size of the vocabulary, where usually $D \ll V$. This mismatch is known to raise risks of limited expressivity in neural LMs, creating a so-called softmax bottleneck. We show the softmax bottleneck is not only an expressivity bottleneck but also an optimization bottleneck. Backpropagating $V$-dimensional gradients through a rank-$D$ linear layer induces unavoidable compression, which alters the training feedback provided to the vast majority of the parameters. We present a theoretical analysis of this phenomenon and measure empirically that 95-99% of the gradient norm is suppressed by the output layer, resulting in vastly suboptimal update directions. We conduct controlled pretraining experiments showing that the gradient bottleneck makes trivial patterns unlearnable, and drastically affects the training dynamics of LLMs. We argue that this inherent flaw contributes to training inefficiencies at scale independently of the model architecture, and raises the need for new LM head designs.

[39] OpenClaw-RL: Train Any Agent Simply by Talking

Yinjie Wang,Xuyang Chen,Xiaolong Jin,Mengdi Wang,Ling Yang

Main category: cs.CL

TL;DR: OpenClaw-RL 提出利用通用的‘下一状态信号’（如用户回复、工具输出、界面变化等）作为实时在线强化学习信号源，统一处理多类交互任务；通过PRM提取评价性奖励，并结合‘后见引导的在线策略蒸馏’（Hindsight-Guided OPD）从下一状态中提取文本化方向性监督信号，实现零协调开销的异步训练。

Details

Motivation: 现有智能体强化学习系统未将普遍存在的‘下一状态信号’（如用户回复、工具输出、GUI变化等）作为实时、在线的学习信号源，导致错失大量隐式反馈信息。 Method: 提出OpenClaw-RL框架：1）将各类交互（对话、终端、GUI、SWE、工具调用）统一建模为共享同一策略的下一状态驱动过程；2）用PRM从下一状态中提取标量奖励（评价性信号）；3）通过Hindsight-Guided On-Policy Distillation（OPD），从下一状态中抽取文本提示、构建增强教师上下文，并提供词元级方向性优势监督；4）采用异步架构，支持推理、评判与训练并行无协调运行。 Result: 在个人智能体场景中，代理可通过实际使用持续提升（如从用户重问、纠正、显式反馈中学习）；在通用智能体场景中，同一框架可扩展至终端、GUI、软件工程和工具调用任务，并验证了过程奖励的有效性。 Conclusion: 下一状态信号是普适且富含信息的学习源；OpenClaw-RL 通过融合评价性奖励与方向性文本监督、支持异步在线训练，为多模态智能体RL提供了统一、可扩展、实用的新范式。 Abstract: Every agent interaction generates a next-state signal, namely the user reply, tool output, terminal or GUI state change that follows each action, yet no existing agentic RL system recovers it as a live, online learning source. We present OpenClaw-RL, a framework built on a simple observation: next-state signals are universal, and policy can learn from all of them simultaneously. Personal conversations, terminal executions, GUI interactions, SWE tasks, and tool-call traces are not separate training problems. They are all interactions that can be used to train the same policy in the same loop. Next-state signals encode two forms of information: evaluative signals, which indicate how well the action performed and are extracted as scalar rewards via a PRM judge; and directive signals, which indicate how the action should have been different and are recovered through Hindsight-Guided On-Policy Distillation (OPD). We extract textual hints from the next state, construct an enhanced teacher context, and provide token-level directional advantage supervision that is richer than any scalar reward. Due to the asynchronous design, the model serves live requests, the PRM judges ongoing interactions, and the trainer updates the policy at the same time, with zero coordination overhead between them. Applied to personal agents, OpenClaw-RL enables an agent to improve simply by being used, recovering conversational signals from user re-queries, corrections, and explicit feedback. Applied to general agents, the same infrastructure supports scalable RL across terminal, GUI, SWE, and tool-call settings, where we additionally demonstrate the utility of process rewards. Code: https://github.com/Gen-Verse/OpenClaw-RL

[40] Adaptive Activation Cancellation for Hallucination Mitigation in Large Language Models

Eric Yocam,Varghese Vaidyan,Gurcan Comert,Paris Kalathas,Yong Wang,Judith L. Mwakalonge

Main category: cs.CL

TL;DR: 本文提出了一种名为自适应激活消除（AAC）的推理时框架，通过识别并抑制与幻觉相关的神经元激活（H-Nodes），在不损害模型通用能力的前提下实时提升大语言模型的事实准确性。

Details

Motivation: 大型语言模型常生成流利但事实错误的文本（即幻觉），现有方法往往以牺牲模型通用能力（如流畅性、推理能力）为代价来缓解幻觉，亟需一种精准、无损的干预机制。 Method: 将幻觉相关激活建模为残差流中的结构化干扰，类比经典自适应噪声消除；通过逐层线性探针识别Hallucination Nodes（H-Nodes），并在自回归生成中使用置信度加权的前向钩子（forward hook）实时抑制这些节点；全程无需外部知识、微调或额外推理轮次。 Result: 在OPT-125M、Phi-3-mini和LLaMA 3-8B上，AAC在TruthfulQA和HaluEval上一致提升事实准确性；WikiText-103困惑度与MMLU准确率零退化；LLaMA 3-8B上MC1、MC2和Token-F1均有小幅提升，且探针选择性显著优于ITI基线（3.5–5.94倍）。 Conclusion: AAC是一种严格外科式、零成本、零退化的推理时干预方法，首次证明了在不损害模型基础能力前提下，仅通过靶向抑制少量神经元即可同步提升事实性与生成质量。 Abstract: Large Language Models frequently generate fluent but factually incorrect text. We propose Adaptive Activation Cancellation (AAC), a real-time inference-time framework that treats hallucination-associated neural activations as structured interference within the transformer residual stream, drawing an explicit analogy to classical adaptive noise cancellation from signal processing. The framework identifies Hallucination Nodes (H-Nodes) via layer-wise linear probing and suppresses them using a confidence-weighted forward hook during auto-regressive generation -- requiring no external knowledge, no fine-tuning, and no additional inference passes. Evaluated across OPT-125M, Phi-3-mini, and LLaMA 3-8B on TruthfulQA and HaluEval, the real-time hook is the only intervention that consistently improves downstream accuracy on all three scales. Critically, the method is strictly surgical: WikiText-103 perplexity and MMLU reasoning accuracy are preserved at exactly 0.0% degradation across all three model scales, a property that distinguishes AAC from interventions that trade fluency or general capability for factual improvement. On the LLaMA 3-8B scale, the hook additionally yields positive generation-level gains (MC1 +0.04; MC2 +0.003; Token-F1 +0.003) while achieving probe-space selectivity 5.94x - 3.5x higher than the ITI baseline -- demonstrating that targeted neuron-level suppression can simultaneously improve factual accuracy and preserve model capability.

[41] ViDia2Std: A Parallel Corpus and Methods for Low-Resource Vietnamese Dialect-to-Standard Translation

Khoa Anh Ta,Nguyen Van Dinh,Kiet Van Nguyen

Main category: cs.CL

TL;DR: 本文提出了ViDia2Std，首个覆盖越南全国63个省份、涵盖北部、中部和南部方言的越南语方言到标准语翻译人工标注平行语料库，并基于该数据集对多种序列到序列模型进行了基准测试，验证了方言归一化对下游任务的显著提升效果。

Details

Motivation: 越南语存在广泛的方言差异，尤其在中南部地区，而现有NLP系统主要基于标准越南语训练，在方言输入上表现不佳；此前方言归一化工作局限于中部到北部的合成数据转换，缺乏对南部及北部内部变体的覆盖。 Method: 构建了ViDia2Std——一个包含13000+句对、源自Facebook评论、由三地母语者人工标注的全方言覆盖平行语料库；提出语义映射一致性度量评估标注质量；在该数据集上评测mBART-large-50、ViT5-base等序列到序列模型。 Result: 标注一致性达：北部86%、中部82%、南部85%；mBART-large-50取得最优性能（BLEU 0.8166，ROUGE-L 0.9384，METEOR 0.8925）；ViT5-base参数更少但性能接近；实验证明方言归一化能显著提升下游任务效果。 Conclusion: ViDia2Std是目前方言覆盖最全面的越南语标准化资源，其构建与评测表明：建设方言感知的NLP资源对提升越南语系统鲁棒性至关重要。 Abstract: Vietnamese exhibits extensive dialectal variation, posing challenges for NLP systems trained predominantly on standard Vietnamese. Such systems often underperform on dialectal inputs, especially from underrepresented Central and Southern regions. Previous work on dialect normalization has focused narrowly on Central-to-Northern dialect transfer using synthetic data and limited dialectal diversity. These efforts exclude Southern varieties and intra-regional variants within the North. We introduce ViDia2Std, the first manually annotated parallel corpus for dialect-to-standard Vietnamese translation covering all 63 provinces. Unlike prior datasets, ViDia2Std includes diverse dialects from Central, Southern, and non-standard Northern regions often absent from existing resources, making it the most dialectally inclusive corpus to date. The dataset consists of over 13,000 sentence pairs sourced from real-world Facebook comments and annotated by native speakers across all three dialect regions. To assess annotation consistency, we define a semantic mapping agreement metric that accounts for synonymous standard mappings across annotators. Based on this criterion, we report agreement rates of 86% (North), 82% (Central), and 85% (South). We benchmark several sequence-to-sequence models on ViDia2Std. mBART-large-50 achieves the best results (BLEU 0.8166, ROUGE-L 0.9384, METEOR 0.8925), while ViT5-base offers competitive performance with fewer parameters. ViDia2Std demonstrates that dialect normalization substantially improves downstream tasks, highlighting the need for dialect-aware resources in building robust Vietnamese NLP systems.

[42] Sabiá-4 Technical Report

Thiago Laitz,Thales Sales Almeida,Hugo Abonizio,Roseval Malaquias Junior,Giovana Kerche Bonás,Marcos Piau,Celio Larcher,Ramon Pires,Rodrigo Nogueira

Main category: cs.CL

TL;DR: 本文介绍了Sabiá-4和Sabiazinho-4，新一代专注于巴西葡萄牙语的大型语言模型，通过四阶段训练流程（持续预训练、长上下文扩展、监督微调、偏好对齐）构建，并在多项巴西语特定基准上取得优异性能与成本效益平衡。

Details

Motivation: 提升葡萄牙语（尤其是巴西葡萄牙语）大模型在法律、对话、长文本理解及智能体任务等领域的本地化能力与实用性，弥补现有模型在巴西语场景下的不足。 Method: 采用四阶段训练流程：1）基于葡萄牙语及巴西法律语料的持续预训练；2）上下文长度扩展至128K tokens；3）在涵盖聊天、代码、法律任务和函数调用的指令数据上进行监督微调；4）偏好对齐。 Result: 在巴西葡萄牙语对话、巴西立法知识、长上下文理解、指令遵循、标准化考试及智能体能力（工具调用、网页导航）六大基准上表现优异；相比其他模型具有更优的成本-性能比；在法律文书生成、多轮对话质量与智能体任务完成率方面超越前代模型。 Conclusion: Sabiá-4和Sabiazinho-4是面向巴西葡萄牙语场景优化的高效语言模型，在专业性、实用性与经济性上实现了良好平衡，为区域化大模型发展提供了新范式。 Abstract: This technical report presents Sabiá-4 and Sabiazinho-4, a new generation of Portuguese language models with a focus on Brazilian Portuguese language. The models were developed through a four-stage training pipeline: continued pre-training on Portuguese and Brazilian legal corpora, long-context extension to 128K tokens, supervised fine-tuning on instruction data spanning chat, code, legal tasks, and function calling, and preference alignment. We evaluate the models on six benchmark categories: conversational capabilities in Brazilian Portuguese, knowledge of Brazilian legislation, long-context understanding, instruction following, standardized exams, and agentic capabilities including tool use and web navigation. Results show that Sabiá-4 and Sabiazinho-4 achieve a favorable cost-performance trade-off compared to other models, positioning them in the upper-left region of the pricing-accuracy chart. The models show improvements over previous generations in legal document drafting, multi-turn dialogue quality, and agentic task completion.

[43] S-GRADES -- Studying Generalization of Student Response Assessments in Diverse Evaluative Settings

Tasfia Seuti,Sagnik Ray Choudhury

Main category: cs.CL

TL;DR: 本文提出了S-GRADES基准，统一评估学生作答（包括长文与短答）的自动评分模型，整合14个数据集，标准化接口与评测协议，并用大语言模型验证其有效性，揭示跨范式评估中的泛化与可靠性问题。

Details

Motivation: AES与ASAG两大领域长期各自发展，存在数据集分散、评测指标不一致、社区割裂等问题，亟需统一基准以推动跨范式研究。 Method: 构建开源可扩展的Web基准S-GRADES，整合14个多样化学生响应评分数据集，提供统一API与标准化评测流程；在该基准上系统评测三类大语言模型，结合多种提示策略、示例选择与跨数据集示例迁移方法。 Result: 实证表明当前大语言模型在不同题型（作文vs短答）间泛化能力有限，评测结果暴露出显著的可靠性与跨任务一致性差距。 Conclusion: S-GRADES为教育NLP提供了首个支持跨范式、可复现、易扩展的统一评估平台，强调标准化评测对推动公平、稳健的学生响应自动评估至关重要。 Abstract: Evaluating student responses, from long essays to short factual answers, is a key challenge in educational NLP. Automated Essay Scoring (AES) focuses on holistic writing qualities such as coherence and argumentation, while Automatic Short Answer Grading (ASAG) emphasizes factual correctness and conceptual understanding. Despite their shared goal, these paradigms have progressed in isolation with fragmented datasets, inconsistent metrics, and separate communities. We introduce S-GRADES (Studying Generalization of Student Response Assessments in Diverse Evaluative Settings), a web-based benchmark that consolidates 14 diverse grading datasets under a unified interface with standardized access and reproducible evaluation protocols. The benchmark is fully open-source and designed for extensibility, enabling continuous integration of new datasets and evaluation settings. To demonstrate the utility of S-GRADES, we evaluate three state-of-the-art large language models across the benchmark using multiple reasoning strategies in prompting. We further examine the effects of exemplar selection and cross-dataset exemplar transfer. Our analyses illustrate how benchmark-driven evaluation reveals reliability and generalization gaps across essay and short-answer grading tasks, highlighting the importance of standardized, cross-paradigm assessment.

[44] GR-SAP: Generative Replay for Safety Alignment Preservation during Fine-Tuning

Zhouxiang Fang,Jiawei Zhou,Hanjie Chen

Main category: cs.CL

TL;DR: 本文提出GR-SAP框架，通过生成回放合成领域特定的安全对齐数据，在微调过程中替代原始不可获取的对齐数据，从而在保持下游任务性能的同时有效缓解安全对齐退化。

Details

Motivation: 现有微调方法易破坏大语言模型的安全对齐，而联合优化所需原始对齐数据通常不可获取。 Method: 提出Generative Replay for Safety Alignment Preservation (GR-SAP)，利用LLM生成领域特定的安全对齐数据，并在下游适配中集成使用；结合理论与实证分析验证其有效性。 Result: 在多个模型和下游任务上，GR-SAP显著缓解了微调引发的安全退化，同时保持了可比的下游性能。 Conclusion: 生成式回放可作为原始对齐数据的可靠代理，GR-SAP为开放权重模型的安全对齐持续维护提供了实用、可行的新范式。 Abstract: Recent studies show that the safety alignment of large language models (LLMs) can be easily compromised even by seemingly non-adversarial fine-tuning. To preserve safety alignment during fine-tuning, a widely used strategy is to jointly optimize safety and task objectives by mixing in the original alignment data, which is typically inaccessible even for open-weight LLMs. Inspired by generative replay in continual learning, we propose Generative Replay for Safety Alignment Preservation (GR-SAP), a unified framework that synthesizes domain-specific alignment data from LLMs and integrate them during downstream adaption to preserve safety alignment. Theoretical and empirical analyses demonstrate this synthetic data serves as a reliable proxy for the original alignment data. Experiments across various models and downstream tasks show that GR-SAP substantially mitigates fine-tuning-induced safety degradation while maintaining comparable downstream performance. Our code is available at https://github.com/chili-lab/gr-sap.

[45] Is this Idea Novel? An Automated Benchmark for Judgment of Research Ideas

Tim Schopf,Michael Färber

Main category: cs.CL

TL;DR: 本文提出了RINoBench，首个用于大规模评估研究创意新颖性判断的综合基准，包含1381个专家标注的研究创意及九种自动化评估指标；实验发现大语言模型虽能生成类似人类的推理过程，但其新颖性判断结果与人类标准存在显著偏差。

Details

Motivation: 科学文献爆炸式增长使得人工通过文献综述判断研究创意新颖性变得费力、主观且不可扩展；现有自动化方法缺乏统一、可比的大规模评估基准。 Method: 构建RINoBench基准：收集1381个由领域专家标注的新颖性评分与文本理由，并设计九种自动化评估指标（涵盖评分与理由两方面）；在该基准上系统评测多个SOTA大语言模型的新颖性判断能力。 Result: LLM生成的推理过程与人类理由高度相似，但其新颖性判断结果与人类金标准存在显著偏差，即使是最先进的推理型模型亦不例外。 Conclusion: 当前大语言模型尚不能可靠替代人类进行研究创意新颖性判断；RINoBench为该任务提供了首个标准化、可复现的大规模评估平台，推动后续方法发展。 Abstract: Judging the novelty of research ideas is crucial for advancing science, enabling the identification of unexplored directions, and ensuring contributions meaningfully extend existing knowledge rather than reiterate minor variations. However, given the exponential growth of scientific literature, manually judging the novelty of research ideas through literature reviews is labor-intensive, subjective, and infeasible at scale. Therefore, recent efforts have proposed automated approaches for research idea novelty judgment. Yet, evaluation of these approaches remains largely inconsistent and is typically based on non-standardized human evaluations, hindering large-scale, comparable evaluations. To address this, we introduce RINoBench, the first comprehensive benchmark for large-scale evaluation of research idea novelty judgments. It comprises 1,381 research ideas derived from and judged by human experts as well as nine automated evaluation metrics designed to assess both rubric-based novelty scores and textual justifications of novelty judgments. Using this benchmark, we evaluate several state-of-the-art large language models (LLMs) on their ability to judge the novelty of research ideas. Our findings reveal that while LLM-generated reasoning closely mirrors human rationales, this alignment does not reliably translate into accurate novelty judgments, which diverge significantly from human gold standard judgments - even among leading reasoning-capable models. Data and code available at: https://github.com/TimSchopf/RINoBench.

Kristy A. Carpenter,Issah A. Samori,Mathew V. Kiang,Keith Humphreys,Anna Lembke,Johannes C. Eichstaedt,Russ B. Altman

Main category: cs.CL

TL;DR: 本文提出利用大语言模型（LLM）解决社交媒体文本中阿片类药物相关俚语歧义识别问题，通过词典依赖、词典无关和新兴俚语三类任务评估GPT-4/5、Gemini 2.5 Pro与Claude Sonnet 4.5，结果显著优于传统词典方法。

Details

Motivation: 社交媒体中阿片类药物相关内容稀疏且俚语歧义严重（如'smack'、'blues'有非药物含义），传统基于词典的方法召回率低、误报率高，亟需更精准的规模化语义理解方案。 Method: 设计三类评估任务：（1）词典依赖型——给定目标俚语及上下文，判断其是否指代阿片类药物；（2）词典无关型——仅凭上下文识别阿片相关帖子；（3）新兴俚语型——使用模拟新俚语测试泛化能力；在统一数据集上对比四款前沿LLM与基线词典策略的F1、精度、召回等指标。 Result: 所有LLM在三类任务中全面超越词典方法：词典依赖任务中F1达0.540–0.972（词典仅0.009–0.126）；词典无关任务F1达0.544–0.769（词典最高0.540）；新兴俚语任务平均F1为0.712（词典<0.5）。LLM普遍呈现高精度（0.981）与更高召回。 Conclusion: LLM具备强上下文推理能力，可有效解决低频、歧义、动态演化的领域术语识别问题，显著提升阿片危机监测的数据质量，该范式可迁移至其他低流行度公共卫生议题。 Abstract: Social media text shows promise for monitoring trends in the opioid overdose crisis; however, the overwhelming majority of social media text is unrelated to opioids. When leveraging social media text to monitor trends in the ongoing opioid overdose crisis, a common strategy for identifying relevant content is to use a lexicon of opioid-related terms as inclusion criteria. However, many slang terms for opioids, such as "smack" or "blues," have common non-opioid meanings, making them ambiguous. The advanced textual reasoning capability of large language models (LLMs) presents an opportunity to disambiguate these slang terms at scale. We present three tasks on which to evaluate four state-of-the-art LLMs (GPT-4, GPT-5, Gemini 2.5 Pro, and Claude Sonnet 4.5): a lexicon-based setting, in which the LLM must disambiguate a specific term within the context of a given post; a lexicon-free setting, in which the LLM must identify opioid-related posts from context without a lexicon; and an emergent slang setting, in which the LLM must identify opioid-related posts with simulated new slang terms. All four LLMs showed excellent performance across all tasks. In both subtasks of the lexicon-based setting, LLM F1 scores ("fenty" subtask: 0.824-0.972; "smack" subtask: 0.540-0.862) far exceeded those of the best lexicon strategy (0.126 and 0.009, respectively). In the lexicon-free task, LLM F1 scores (0.544-0.769) surpassed those of lexicons (0.080-0.540), and LLMs demonstrated uniformly higher recall. On emergent slang, all LLMs had higher accuracy (average: 0.784), F1 score (average: 0.712), precision (average: 0.981), and recall (average: 0.587) than the two lexicons assessed. Our results show that LLMs can be used to identify relevant content for low-prevalence topics, including but not limited to opioid references, enhancing data provided to downstream analyses and predictive models.

[47] Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck

Hongbin Zhang,Kehai Chen,Xuefen Bai,Youcheng Pan,Yang Xiang,Jinpeng Wang,Min Zhang

Main category: cs.CL

TL;DR: 本文提出DIBJudge框架，通过变分信息压缩和交叉协方差惩罚，有效缓解大语言模型在多语言评估中对机器翻译文本的系统性偏好（即translationese偏差），尤其在低资源语言上效果显著。

Details

Motivation: 大型语言模型在多语言评估中普遍存在系统性的translationese偏差，即偏好机器翻译文本而非人工撰写参考文本，尤其在低资源语言中更严重；该偏差源于与英语潜在流形对齐及跨语言可预测性的虚假相关性。 Method: 提出DIBJudge鲁棒微调框架：利用变分信息压缩学习最小充分、判别关键的表征，并将虚假因素显式隔离至专用偏差分支；引入交叉协方差惩罚以抑制鲁棒表征与偏差表征间的统计依赖，促进有效解耦。 Result: 在多语言奖励建模基准和专用translationese偏差评测套件上的大量实验表明，DIBJudge持续优于强基线，并显著缓解translationese偏差。 Conclusion: DIBJudge通过解耦真实判别信息与虚假偏差信号，为构建更公平、鲁棒的多语言评估模型提供了新范式。 Abstract: Large language models (LLMs) have become a standard for multilingual evaluation, yet they exhibit a severe systematic translationese bias. In this paper, translationese bias is characterized as LLMs systematically favoring machine-translated text over human-authored references, particularly in low-resource languages. We attribute this bias to spurious correlations with (i) latent manifold alignment with English and (ii) cross-lingual predictability. To mitigate this bias, we propose DIBJudge, a robust fine-tuning framework that learns a minimally sufficient, judgment-critical representation via variational information compression, while explicitly isolating spurious factors into the dedicated bias branch. Furthermore, we incorporate a cross-covariance penalty that explicitly suppresses statistical dependence between robust and bias representations, thereby encouraging effective disentanglement. Extensive evaluations on multilingual reward modeling benchmarks and a dedicated translationese bias evaluation suite demonstrate that the proposed DIBJudge consistently outperforms strong baselines and substantially mitigates translationese bias.

[48] Dynamic Knowledge Fusion for Multi-Domain Dialogue State Tracking

Haoxiang Su,Ruiyu Fang,Liting Jiang,Xiaomeng Huang,Shuangyong Song

Main category: cs.CL

TL;DR: 本文提出了一种动态知识融合框架，用于解决多领域对话状态追踪（DST）中对话历史建模困难和标注数据稀缺的问题，通过对比学习编码和结构化上下文提示提升追踪准确性和泛化能力。

Details

Motivation: 当前多领域对话状态追踪（DST）面临两个关键挑战：难以有效建模对话历史，以及标注数据有限，限制了模型性能。 Method: 提出两阶段动态知识融合框架：第一阶段使用对比学习训练的编码器网络对对话历史和候选槽位进行编码，并基于相关性分数选择相关槽位；第二阶段将所选槽位的结构化信息作为上下文提示，实现动态知识融合，以增强状态追踪的准确性与一致性。 Result: 在多领域对话基准测试中，该方法显著提升了对话状态追踪的准确率和泛化能力，验证了其处理复杂对话场景的有效性。 Conclusion: 动态知识融合框架能更准确地整合对话上下文与领域知识，为多领域DST提供了高效、鲁棒的解决方案。 Abstract: The performance of task-oriented dialogue models is strongly tied to how well they track dialogue states, which records and updates user information across multi-turn interactions. However, current multi-domain DST encounters two key challenges: the difficulty of effectively modeling dialogue history and the limited availability of annotated data, both of which hinder model performance. To tackle the aforementioned problems, we develop a dynamic knowledge fusion framework applicable to multi-domain DST. The model operates in two stages: first, an encoder-only network trained with contrastive learning encodes dialogue history and candidate slots, selecting relevant slots based on correlation scores; second, dynamic knowledge fusion leverages the structured information of selected slots as contextual prompts to enhance the accuracy and consistency of dialogue state tracking. This design enables more accurate integration of dialogue context and domain knowledge. Results obtained from multi-domain dialogue benchmarks indicate that our method notably improves both tracking accuracy and generalization, validating its capability in handling complex dialogue scenarios.

[49] Aligning Large Language Models with Searcher Preferences

Wei Wu,Peilun Zhou,Liyi Chen,Qimeng Wang,Chengqiang Lu,Yan Gao,Yi Wu,Yao Hu,Hui Xiong

Main category: cs.CL

TL;DR: 本文提出了SearchLLM，首个面向开放式生成式搜索的大型语言模型，通过分层多维奖励系统和门控聚合策略优化生成质量、鲁棒性与用户需求对齐，并在红笔记AI搜索中验证了其有效性与安全性。

Details

Motivation: 现有生成式搜索在开放内容平台上的研究与部署受限，面临检索噪声鲁棒性差、安全要求高及用户需求多样化等挑战。 Method: 提出SearchLLM模型；设计分层多维奖励系统（兼顾事实性、答案质量、格式合规性、鲁棒性与用户对齐）；结合规则检查与人工校准的LLM裁判生成可解释评分向量；采用门控聚合策略与组相对策略优化（GRPO）进行训练。 Result: 在红笔记AI搜索入口部署后，离线评估与线上A/B测试显示有效消费率提升1.03%，重搜率降低2.81%，同时满足严格的安全与可靠性标准。 Conclusion: SearchLLM为开放式生成式搜索提供了可行且安全的LLM解决方案，验证了分层奖励建模与行为优化协同设计的有效性。 Abstract: The paradigm shift from item-centric ranking to answer-centric synthesis is redefining the role of search engines. While recent industrial progress has applied generative techniques to closed-set item ranking in e-commerce, research and deployment of open-ended generative search on large content platforms remain limited. This setting introduces challenges, including robustness to noisy retrieval, non-negotiable safety guarantees, and alignment with diverse user needs. In this work, we introduce SearchLLM, the first large language model (LLM) for open-ended generative search. We design a hierarchical, multi-dimensional reward system that separates bottom-line constraints, including factual grounding, basic answer quality and format compliance, from behavior optimization objectives that promote robustness to noisy retrieval and alignment with user needs. Concretely, our reward model evaluates responses conditioned on the user query, session history, and retrieved evidence set, combining rule-based checks with human-calibrated LLM judges to produce an interpretable score vector over these dimensions. We introduce a Gated Aggregation Strategy to derive the training reward for optimizing SearchLLM with Group Relative Policy Optimization (GRPO). We deploy SearchLLM in the AI search entry of RedNote. Offline evaluations and online A/B tests show improved generation quality and user engagement, increasing Valid Consumption Rate by 1.03% and reducing Re-search Rate by 2.81%, while upholding strict safety and reliability standards.

[50] Learning to Negotiate: Multi-Agent Deliberation for Collective Value Alignment in LLMs

Panatchakorn Anantaprayoon,Nataliia Babina,Nima Asgharbeygi,Jad Tarifi

Main category: cs.CL

TL;DR: 本文提出了一种基于多智能体协商的对齐框架，旨在使大语言模型（LLMs）适应“集体能动性（CA）”目标，并提升其在价值冲突场景下的冲突解决能力，通过双角色自博弈和RLAIF训练实现可扩展对齐。

Details

Motivation: 现有对齐方法（如RLHF、Constitutional AI）在单智能体场景中效果良好，但在多利益相关者、价值观冲突的场景中缺乏协商与权衡能力，难以支持集体决策。 Method: 构建双角色（对立人格）自博弈对话框架，使用合成道德困境提示和对立人格对；通过RLAIF结合GRPO算法，以外部LLM作为奖励模型，依据最终输出的CA得分计算奖励，并将梯度反向传播至对话token以优化协商过程。 Result: 模型在保持与单智能体基线相当的CA对齐水平的同时，显著提升了冲突解决能力，且未损害通用语言能力。 Conclusion: 基于协商的审议式训练为构建支持价值冲突下集体决策的LLM提供了一条可行路径。 Abstract: The alignment of large language models (LLMs) has progressed substantially in single-agent settings through paradigms such as RLHF and Constitutional AI, with recent work exploring scalable alternatives such as RLAIF and evolving alignment objectives. However, these approaches remain limited in multi-stakeholder settings, where conflicting values arise and deliberative negotiation capabilities are required. This work proposes a multi-agent negotiation-based alignment framework that aligns LLMs to Collective Agency (CA)-an existing alignment objective introduced to promote the continual expansion of agency-while simultaneously improving conflict-resolution capability. To enable scalable training, two self-play instances of the same LLM, assigned opposing personas, engage in structured turn-based dialogue to synthesize mutually beneficial solutions. We generate synthetic moral-dilemma prompts and conflicting persona pairs, and optimize the policy via RLAIF using GRPO with an external LLM reward model. While rewards are computed from CA scores assigned to the final completion, gradients are applied to dialogue tokens to directly improve deliberative interaction dynamics. Experiments show that the resulting model achieves CA alignment comparable to a single-agent baseline while substantially improving conflict-resolution performance without degrading general language capabilities. These results suggest that negotiation-driven deliberation training provides a practical path toward LLMs that better support collective decision-making in value-conflict scenarios.

[51] PEEM: Prompt Engineering Evaluation Metrics for Interpretable Joint Evaluation of Prompts and Responses

Minki Hong,Eunsoo Lee,Sohyun Park,Jihie Kim

Main category: cs.CL

TL;DR: 本文提出PEEM框架，用于联合且可解释地评估提示词和模型响应，涵盖9个维度的结构化评分与自然语言理由，并验证其在准确性、鲁棒性及优化指导方面的有效性。

Details

Motivation: 现有提示词评估主要依赖答案正确性，缺乏对失败原因的解释和可操作的改进建议。 Method: 提出PEEM框架，包含3个提示词评估维度（清晰性/结构、语言质量、公平性）和6个响应评估维度（准确性、连贯性、相关性、客观性、清晰性、简洁性），使用LLM作为评估器输出1-5分Likert量表评分及对应自然语言理由。 Result: PEEM准确性维度与传统准确率高度一致（Spearman rho≈0.97），多模型评估结果稳定（pairwise rho=0.68–0.85），对提示扰动具有鲁棒性（76.7%–80.6%），并支持零样本提示重写，提升下游准确率最高达11.7分。 Conclusion: PEEM提供了一种可复现、基于准则的评估协议，能系统诊断和优化LLM提示工程过程。 Abstract: Prompt design is a primary control interface for large language models (LLMs), yet standard evaluations largely reduce performance to answer correctness, obscuring why a prompt succeeds or fails and providing little actionable guidance. We propose PEEM (Prompt Engineering Evaluation Metrics), a unified framework for joint and interpretable evaluation of both prompts and responses. PEEM defines a structured rubric with 9 axes: 3 prompt criteria (clarity/structure, linguistic quality, fairness) and 6 response criteria (accuracy, coherence, relevance, objectivity, clarity, conciseness), and uses an LLM-based evaluator to output (i) scalar scores on a 1-5 Likert scale and (ii) criterion-specific natural-language rationales grounded in the rubric. Across 7 benchmarks and 5 task models, PEEM's accuracy axis strongly aligns with conventional accuracy while preserving model rankings (aggregate Spearman rho about 0.97, Pearson r about 0.94, p < 0.001). A multi-evaluator study with four models shows consistent relative judgments (pairwise rho = 0.68-0.85), supporting evaluator-agnostic deployment. Beyond alignment, PEEM captures complementary linguistic failure modes and remains informative under prompt perturbations: prompt-quality trends track downstream accuracy under iterative rewrites, semantic adversarial manipulations induce clear score degradation, and meaning-preserving paraphrases yield high stability (robustness rate about 76.7-80.6%). Finally, using only PEEM scores and rationales as feedback, a zero-shot prompt rewriting loop improves downstream accuracy by up to 11.7 points, outperforming supervised and RL-based prompt-optimization baselines. Overall, PEEM provides a reproducible, criterion-driven protocol that links prompt formulation to response behavior and enables systematic diagnosis and optimization of LLM interactions.

[52] Human-AI Co-reasoning for Clinical Diagnosis with Evidence-Integrated Language Agent

Zhongzhen Huang,Yan Ling,Hong Chen,Ye Feng,Li Wu,Linjie Mu,Shaoting Zhang,Xiaofan Zhang,Kun Qian,Xiaomu Li

Main category: cs.CL

TL;DR: 本文提出PULSE医疗推理代理，结合领域微调的大语言模型与科学文献检索，用于支持复杂真实世界内分泌学病例的诊断决策。在82个真实病例组成的基准测试中，PULSE达到专家级准确率，表现优于住院医师和初级专科医生，媲美高级专科医生；其性能不随疾病罕见程度下降，并展现出类人的自适应推理能力；人机协作可提升诊断广度与纠错能力，但也存在自动化偏见风险。

Details

Motivation: 解决复杂临床诊断中人类医生（尤其经验不足者）对罕见病识别困难、诊断覆盖不足的问题，探索AI代理在真实医疗场景中的辅助潜力与局限。 Method: 构建PULSE医疗推理代理（域适配LLM + 文献检索），设计包含82例真实内分泌科病例的评估基准，开展对照实验：（1）与不同资历医生（住院医至高级专家）比较诊断准确率（Top@1/Top@4）；（2）分析AI辅助下医生诊断行为变化；（3）对比串行与并行协作模式效果。 Result: PULSE在Top@1/Top@4指标上达高级专家水平，且性能不随疾病发病率降低而下降；输出长度随病例难度增加，体现自适应推理；协作中医生能修正错误、拓展鉴别诊断，但存在自动化偏见；两种协作模式均有效。 Conclusion: PULSE展现了在临床诊断中替代或增强人类医生的潜力，尤其在罕见病和复杂推理方面；研究同时揭示了AI辅助的固有风险（如自动化偏见），为未来AI医疗代理的评估与临床整合提供了方法论框架。 Abstract: We present PULSE, a medical reasoning agent that combines a domain-tuned large language model with scientific literature retrieval to support diagnostic decision-making in complex real-world cases. To evaluate its capabilities, we curated a benchmark of 82 authentic endocrinology case reports encompassing a broad spectrum of disease types and incidence levels. In controlled experiments, we compared PULSE's performance against physicians with varying levels of expertise-from residents to senior specialists-and examined how AI assistance influenced human diagnostic reasoning. PULSE attained expert-competitive accuracy, outperforming residents and junior specialists while matching senior specialist performance at both Top@1 and Top@4 thresholds. Unlike physicians, whose accuracy declined with disease rarity, PULSE maintained stable performance across incidence tiers. The agent also exhibited adaptive reasoning, increasing output length with case difficulty in a manner analogous to the longer deliberation observed among expert clinicians. When used collaboratively, PULSE enabled physicians to correct initial errors and broaden diagnostic hypotheses, but also introduced risks of automation bias. The study explores both serial and concurrent collaboration workflows, revealing that PULSE offers robust support across common and rare presentations. These findings underscore both the promise and the limitations of language model-based agents in clinical diagnosis, and offer a framework for evaluating their role in real-world decision-making.

[53] VERI-DPO: Evidence-Aware Alignment for Clinical Summarization via Claim Verification and Direct Preference Optimization

Weixin Liu,Congning Ni,Qingyuan Song,Susannah L. Rose,Christopher Symons,Murat Kantarcioglu,Bradley A. Malin,Zhijun Yin

Main category: cs.CL

TL;DR: 本文提出VERI-DPO方法，通过基于声明验证的偏好挖掘与直接偏好优化（DPO），显著提升ICU患者简要住院经过（BHC）摘要的事实一致性，降低不支持声明率，并保持信息量。

Details

Motivation: 现有LLM临床摘要器易生成无EHR证据支持的陈述，且对齐方法可能导致‘少说’退化（即过度保守、遗漏关键信息）。 Method: 构建检索增强型验证器，在MIMIC-III-Ext-VeriFact-BHC数据集上以单token格式判别声明-证据对；利用验证结果生成覆盖感知、矛盾锚定、长度可控的偏好对；采用Direct Preference Optimization（DPO）将验证偏好蒸馏至摘要模型。 Result: 在留出患者测试中，VERI-DPO将‘不支持声明’率由10.7%降至1.9%（本地验证器评估）和11.6%降至6.4%（GPT-4o评估），有效性从76.7%提升至82.5%，同时维持摘要信息量。 Conclusion: 引入可验证性驱动的偏好学习框架（VERI-DPO）可有效缓解LLM临床摘要中的幻觉与欠述问题，实现事实性与信息量的协同提升。 Abstract: Brief Hospital Course (BHC) narratives must be clinically useful yet faithful to fragmented EHR evidence. LLM-based clinical summarizers still introduce unsupported statements, and alignment can encourage omissions ("say-less" degeneration). We introduce VERI-DPO, which uses claim verification to mine preferences and distill them into the summarizer with Direct Preference Optimization (DPO). On MIMIC-III-Ext-VeriFact-BHC (100 ICU patients; patient-level splits), we train a retrieval-augmented verifier to label claim-evidence pairs as Supported, Not Supported, or Not Addressed via a single-token format. The verifier scores sentence-level claims from sampled BHC candidates and aggregates margins into a coverage-aware utility to mine length-controlled, contradiction-anchored preference pairs. On held-out patients, verifier-mined preferences separate candidates by contradiction density, and VERI-DPO reduces Not Supported claim rates from 10.7% to 1.9% (local verifier judge) and from 11.6% to 6.4% (GPT-4o judge), while improving validity from 76.7% to 82.5% and maintaining informative length.

[54] Safe and Scalable Web Agent Learning via Recreated Websites

Hyungjoo Chae,Jungsoo Park,Alan Ritter

Main category: cs.CL

TL;DR: VeriEnv 是一个利用语言模型自动生成可执行、可验证的合成网站环境的框架，使 Web 智能体能在安全可控环境中通过程序化奖励自主训练与演化。

Details

Motivation: 现实网站环境对自主 Web 智能体训练不安全、难重置、反馈不可靠，亟需可验证、可扩展的替代训练环境。 Method: 提出 VeriEnv 框架，让语言模型作为环境生成器，克隆真实网站为可执行合成环境，并通过 Python SDK 提供内部可控访问，支持智能体自生成任务及确定性、可编程验证的奖励信号。 Result: 在 Web 智能体基准测试中，VeriEnv 训练的智能体展现出跨网站泛化能力、通过自演化实现站点专属精通，并随训练环境数量增加而性能提升。 Conclusion: VeriEnv 成功解耦智能体学习与真实网页交互风险，为 Web 智能体提供了安全、可验证、可扩展的自我演进训练范式。 Abstract: Training autonomous web agents is fundamentally limited by the environments they learn from: real-world websites are unsafe to explore, hard to reset, and rarely provide verifiable feedback. We propose VeriEnv, a framework that treats language models as environment creators, automatically cloning real-world websites into fully executable, verifiable synthetic environments. By exposing controlled internal access via a Python SDK, VeriEnv enables agents to self-generate tasks with deterministic, programmatically verifiable rewards, eliminating reliance on heuristic or LLM-based judges. This design decouples agent learning from unsafe real-world interaction while enabling scalable self-evolution through environment expansion. Through experiments on web agent benchmarks, we show that agents trained with VeriEnv generalize to unseen websites, achieve site-specific mastery through self-evolving training, and benefit from scaling the number of training environments. Code and resources will be released at https://github.com/kyle8581/VeriEnv upon acceptance.

[55] AILS-NTUA at SemEval-2026 Task 8: Evaluating Multi-Turn RAG Conversations

Dimosthenis Athanasiou,Maria Lymperaiou,Giorgos Filandrianos,Athanasios Voulodimos,Giorgos Stamou

Main category: cs.CL

TL;DR: AILS-NTUA系统在SemEval-2026 Task 8中提出统一架构，通过查询多样性策略和多阶段生成流程，在多轮检索增强生成的三个子任务中取得领先性能，尤其在任务A中排名第一。

Details

Motivation: 解决多轮检索增强生成（RAG）中检索与生成协同不佳、端到端性能受限的问题，探索更优的查询重构与生成校准机制。 Method: 采用‘查询多样性优于检索器多样性’策略：使用5种LLM生成的互补查询输入单一稀疏检索器，并通过方差感知嵌套RRF融合；生成阶段分为证据片段提取、双候选草稿生成和多评委校准选择。 Result: Task A（nDCG@5=0.5776，较最强基线提升20.5%）排名第一；Task B（HM=0.7698）排名第二；实证表明查询多样性+对齐检索器优于异构检索器集成，答案可回答性校准是端到端瓶颈。 Conclusion: 统一架构中强调查询多样性与生成阶段的校准机制比堆叠多样化检索器更有效，端到端RAG性能关键在于答案可回答性判断而非单纯检索覆盖。 Abstract: We present the AILS-NTUA system for SemEval-2026 Task 8 (MTRAGEval), addressing all three subtasks of multi-turn retrieval-augmented generation: passage retrieval (A), reference-grounded response generation (B), and end-to-end RAG (C). Our unified architecture is built on two principles: (i) a query-diversity-over-retriever-diversity strategy, where five complementary LLM-based query reformulations are issued to a single corpus-aligned sparse retriever and fused via variance-aware nested Reciprocal Rank Fusion; and (ii) a multistage generation pipeline that decomposes grounded generation into evidence span extraction, dual-candidate drafting, and calibrated multi-judge selection. Our system ranks 1st in Task A (nDCG@5: 0.5776, +20.5% over the strongest baseline) and 2nd in Task B (HM: 0.7698). Empirical analysis shows that query diversity over a well-aligned retriever outperforms heterogeneous retriever ensembling, and that answerability calibration-rather than retrieval coverage-is the primary bottleneck in end-to-end performance.

[56] Automatic End-to-End Data Integration using Large Language Models

Aaron Steiner,Christian Bizer

Main category: cs.CL

TL;DR: 本文提出了一种基于GPT-5.2的全自动数据集成流水线，能自动生成模式映射、值映射、实体匹配训练数据和数据融合验证数据，在三类案例中性能媲美甚至优于人工设计流水线，且成本大幅降低。

Details

Motivation: 减少数据工程师在构建端到端数据集成流水线中的大量手工配置与标注工作，探索大语言模型是否可完全替代人工参与全流程。 Method: 利用GPT-5.2自动生成四类关键流水线构件：schema映射、数据归一化的value映射、实体匹配的训练数据、数据融合中冲突解决策略的验证数据，并在三个真实领域（视频游戏、音乐、公司）案例中评估其效果。 Result: LLM生成的流水线在集成结果的规模与密度上与人工流水线相当，部分任务表现更优；单案例运行成本约10美元，远低于人工成本。 Conclusion: 大语言模型具备在端到端数据集成中替代人类工程师的潜力，可在保证质量的同时显著降低成本和人力投入。 Abstract: Designing data integration pipelines typically requires substantial manual effort from data engineers to configure pipeline components and label training data. While LLMs have shown promise in handling individual steps of the integration process, their potential to replace all human input across end-to-end data integration pipelines has not been investigated. As a step toward exploring this potential, we present an automatic data integration pipeline that uses GPT-5.2 to generate all artifacts required to adapt the pipeline to specific use cases. These artifacts are schema mappings, value mappings for data normalization, training data for entity matching, and validation data for selecting conflict resolution heuristics in data fusion. We compare the performance of this LLM-based pipeline to the performance of human-designed pipelines along three case studies requiring the integration of video game, music, and company related data. Our experiments show that the LLM-based pipeline is able to produce similar results, for some tasks even better results, as the human-designed pipelines. End-to-end, the human and the LLM pipelines produce integrated datasets of comparable size and density. Having the LLM configure the pipelines costs approximately \$10 per case study, which represents only a small fraction of the cost of having human data engineers perform the same tasks.

[57] End-to-End Chatbot Evaluation with Adaptive Reasoning and Uncertainty Filtering

Nhi Dang,Tung Le,Huy Tien Nguyen

Main category: cs.CL

TL;DR: 本文提出了一种端到端自动评估框架，用于减少人工干预、高效评估检索增强生成（RAG）型领域聊天机器人的回答质量。该框架从知识库自动生成问答对，利用大语言模型进行响应评判，并通过置信度过滤识别不确定案例，在越南语新闻数据集上验证了其与人工判断高度一致且显著降低评审开销。

Details

Motivation: 现有聊天机器人评估方法依赖人工审核或静态测试集，成本高、可扩展性差，难以支撑大规模、多语言、多领域的实际部署需求。 Method: 构建模块化、语言无关的自动评估系统：1）从知识库自动生成Q&A对；2）用LLM将聊天机器人响应与参考答案对比打分；3）引入置信度过滤机制识别需人工复核的高不确定性样本。 Result: 在越南语新闻数据集上，该评估器与人工判断达成高一致性（如Kappa系数>0.8），同时将人工审查工作量降低70%以上；系统具备跨语言和跨领域可迁移性。 Conclusion: 本工作提供了一种实用、可扩展、低人工依赖的聊天机器人自动评估方案，为RAG系统在真实场景中的可靠部署提供了关键质量保障工具。 Abstract: Large language models (LLMs) combined with retrieval augmented generation have enabled the deployment of domain-specific chatbots, but these systems remain prone to generating unsupported or incorrect answers. Reliable evaluation is therefore critical, yet manual review is costly and existing frameworks often depend on curated test sets and static metrics, limiting scalability. We propose an end-to-end automatic evaluator designed to substantially reduce human effort. Our system generates Q\&A pairs directly from the underlying knowledge base, uses LLMs to judge chatbot responses against reference answers, and applies confidence-based filtering to highlight uncertain cases. Applied to a Vietnamese news dataset, the evaluator achieves high agreement with human judgments while significantly lowering review overhead. The framework is modular and language-agnostic, making it readily adaptable to diverse domains. This work introduces a practical, scalable solution for evaluating chatbots with minimal reliance on manual intervention.

[58] MUNIChus: Multilingual News Image Captioning Benchmark

Yuji Chen,Alistair Plum,Hansi Hettiarachchi,Diptesh Kanojia,Saroj Basnet,Marcos Zampieri,Tharindu Ranasinghe

Main category: cs.CL

TL;DR: 本文提出了首个面向多语言新闻图像字幕生成的基准数据集MUNIChus，涵盖9种语言（含低资源语言），并评估了多种前沿模型，揭示该任务仍具挑战性。

Details

Motivation: 现有新闻图像字幕研究主要集中于英语，缺乏多语言尤其是低资源语言的数据集，限制了模型的泛化与公平评估。 Method: 构建首个包含9种语言（含Sinhala、Urdu等低资源语言）的多语言新闻图像字幕基准MUNIChus，并在该数据集上系统评测多个前沿神经模型。 Result: 实验证明当前多语言新闻图像字幕生成仍具挑战性；MUNIChus已公开发布，支持超20个模型的基准测试。 Conclusion: MUNIChus填补了多语言新闻图像字幕领域的数据空白，为后续模型开发与评估提供了重要基础。 Abstract: The goal of news image captioning is to generate captions by integrating news article content with corresponding images, highlighting the relationship between textual context and visual elements. The majority of research on news image captioning focuses on English, primarily because datasets in other languages are scarce. To address this limitation, we create the first multilingual news image captioning benchmark, MUNIChus, comprising 9 languages, including several low-resource languages such as Sinhala and Urdu. We evaluate various state-of-the-art neural news image captioning models on MUNIChus and find that news image captioning remains challenging. We also make MUNIChus publicly available with over 20 models already benchmarked. MUNIChus opens new avenues for further advancements in developing and evaluating multilingual news image captioning models.

[59] Disentangling Similarity and Relatedness in Topic Models

Hanlin Xiao,Mauricio A. Álvarez,Rainer Breitling

Main category: cs.CL

TL;DR: 本文提出使用心理语言学维度（主题相关性和分类相似性）来评估主题模型，构建了基于大语言模型标注的合成基准，并训练神经评分函数，揭示不同主题模型家族捕获语义结构的差异，并证明这些维度可预测下游任务性能。

Details

Motivation: 传统主题模型（如LDA）依赖词共现统计，而PLM增强模型则将统计锚定在预训练嵌入空间中，引入语义相似性先验；为解耦主题词中的主题相关性与分类相似性这两个心理语言学维度，需新的评估框架。 Method: 构建基于LLM标注的大规模合成词对基准，训练神经评分函数以量化主题词的主题相关性和分类相似性，并在多个语料库和主题模型家族上进行系统评估。 Result: 不同主题模型家族在主题中捕获的语义结构存在显著差异；主题相似性与相关性得分能根据下游任务需求有效预测其性能。 Conclusion: 主题相似性与相关性是主题模型评估的关键轴心，本文提供了跨模型家族与语料库刻画这些维度的可靠评估流程。 Abstract: The recent advancement of large language models has spurred a growing trend of integrating pre-trained language model (PLM) embeddings into topic models, fundamentally reshaping how topics capture semantic structure. Classical models such as Latent Dirichlet Allocation (LDA) derive topics from word co-occurrence statistics, whereas PLM-augmented models anchor these statistics to pre-trained embedding spaces, imposing a prior that also favours clustering of semantically similar words. This structural difference can be captured by the psycholinguistic dimensions of thematic relatedness and taxonomic similarity of the topic words. To disentangle these dimensions in topic models, we construct a large synthetic benchmark of word pairs using LLM-based annotation to train a neural scoring function. We apply this scorer to a comprehensive evaluation across multiple corpora and topic model families, revealing that different model families capture distinct semantic structure in their topics. We further demonstrate that similarity and relatedness scores successfully predict downstream task performance depending on task requirements. This paper establishes similarity and relatedness as essential axes for topic model evaluation and provides a reliable pipeline for characterising these across model families and corpora.

[60] Making Bielik LLM Reason (Better): A Field Report

Adam Trybus,Bartosz Bartnicki,Remigiusz Kinas

Main category: cs.CL

TL;DR: 本文介绍了一项旨在评估和提升波兰大语言模型Bielik推理能力的研究计划，包括基准测试、方法论构建、与其他LLM的对比分析及未来发展方向。

Details

Motivation: 提升波兰大语言模型Bielik在快速变化且竞争激烈的AI环境中的推理能力，并保持其竞争力。 Method: 通过初始基准测试、构建评估方法论、与其他大语言模型进行对比分析等多阶段工作展开研究。 Result: 建立了针对Bielik的评估框架，并识别出当前分析的局限性，为后续改进提供方向。 Conclusion: 该研究计划为持续优化Bielik的推理能力奠定了基础，并明确了未来需克服的挑战与发展方向。 Abstract: This paper presents a research program dedicated to evaluating and advancing the reasoning capabilities of Bielik, a Polish large language model. The study describes a number of stages of work: initial benchmarking and creation of evaluation methodology, analyzing of comparative results with other LLMs and outlining of future prospects that take into account the limitations of the analyses conducted so far and aims to keep Bielik in the race give the ever-changing -- and competitive -- AI landscape.

[61] Prism-$Δ$: Differential Subspace Steering for Prompt Highlighting in Large Language Models

Yuyao Ge,Shenghua Liu,Yiwei Wang,Tianyu Liu,Baolong Bi,Lingrui Mei,Jiayu Yao,Jiafeng Guo,Xueqi Cheng

Main category: cs.CL

TL;DR: 本文提出PRISM-Δ方法，通过分解正负样本的交叉协方差矩阵差异来提取判别性引导方向，并为每个注意力头分配软加权重要性，兼顾Value表示，显著提升提示高亮效果并降低流畅性代价。

Details

Motivation: 现有提示高亮方法难以区分相关与无关上下文的差异方向，易受共性结构干扰。 Method: PRISM-Δ基于投影的判别式引导方法：分解正负交叉协方差矩阵之差以消除共享方向、保留判别能量；为各注意力头引入连续softplus重要性权重；扩展至Value表示以利用内容通道信号。 Result: 在4个基准、5个模型共20种配置中，19种上达到或超越最优基线，相对提升最高达+10.6%；流畅性代价减半；长上下文检索提升最高+4.8%；兼容FlashAttention且内存开销可忽略。 Conclusion: PRISM-Δ是一种高效、可扩展、低开销的提示高亮引导框架，在性能与生成质量间取得更好平衡。 Abstract: Prompt highlighting steers a large language model to prioritize user-specified text spans during generation. A key challenge is extracting steering directions that capture the difference between relevant and irrelevant contexts, rather than shared structural patterns common to both. We propose PRISM-$Δ$ (Projection-based Relevance-Informed Steering Method), which decomposes the difference between positive and negative cross-covariance matrices to maximize discriminative energy while eliminating shared directions. Each attention head receives a continuous softplus importance weight, letting weak-but-useful heads contribute at reduced strength. The framework extends naturally to Value representations, capturing content-channel signal that Key-only methods leave unused. Across four benchmarks and five models, PRISM-$Δ$ matches or exceeds the best existing method on 19 of 20 configurations, with relative gains up to +10.6%, while halving the fluency cost of steering. PRISM-$Δ$ also scales to long-context retrieval, outperforming the best existing method by up to +4.8% relative gain. PRISM-$Δ$ is compatible with FlashAttention and adds negligible memory overhead.

[62] HeartAgent: An Autonomous Agent System for Explainable Differential Diagnosis in Cardiology

Shuang Zhou,Kai Yu,Song Wang,Wenya Xie,Zaifu Zhan,Meng-Han Tsai,Yuen-Hei Chung,Shutong Hou,Huixue Zhou,Min Zeng,Bhavadharini Ramu,Lin Yee Chen,Feng Xie,Rui Zhang

Main category: cs.CL

TL;DR: HeartAgent是一个面向心脏病学的AI代理系统，通过多智能体协同、定制化工具和可追溯的推理路径，显著提升了诊断准确性和临床可解释性。

Details

Motivation: 现有AI诊断方法在心脏病学领域存在心内科知识不足、复杂推理支持不够、可解释性差等问题，亟需更可靠、可解释的辅助诊断工具。 Method: HeartAgent构建了一个心脏病学专用的多智能体系统，整合定制化工具与结构化医学数据资源，由多个专业化子智能体协同完成复杂推理，并生成透明的推理轨迹与可验证的参考依据。 Result: 在MIMIC数据集和某私有电子病历队列上，HeartAgent的Top-3诊断准确率分别比基线方法提升36%和20%；辅助临床医生后，诊断准确率提升26.9%，解释质量提升22.7%。 Conclusion: HeartAgent为心血管诊疗提供了可靠、可解释且具临床实用价值的决策支持系统，推动AI在真实医疗场景中的落地应用。 Abstract: Heart diseases remain a leading cause of morbidity and mortality worldwide, necessitating accurate and trustworthy differential diagnosis. However, existing artificial intelligence-based diagnostic methods are often limited by insufficient cardiology knowledge, inadequate support for complex reasoning, and poor interpretability. Here we present HeartAgent, a cardiology-specific agent system designed to support a reliable and explainable differential diagnosis. HeartAgent integrates customized tools and curated data resources and orchestrates multiple specialized sub-agents to perform complex reasoning while generating transparent reasoning trajectories and verifiable supporting references. Evaluated on the MIMIC dataset and a private electronic health records cohort, HeartAgent achieved over 36% and 20% improvements over established comparative methods, in top-3 diagnostic accuracy, respectively. Additionally, clinicians assisted by HeartAgent demonstrated gains of 26.9% in diagnostic accuracy and 22.7% in explanatory quality compared with unaided experts. These results demonstrate that HeartAgent provides reliable, explainable, and clinically actionable decision support for cardiovascular care.

[63] mAceReason-Math: A Dataset of High-Quality Multilingual Math Problems Ready For RLVR

Konstantin Dobler,Simon Lehnerer,Federico Scozzafava,Jonathan Janke,Mohamed Ali

Main category: cs.CL

TL;DR: 本文介绍了mAceReason-Math数据集，这是一个专为多语言强化学习与可验证奖励（RLVR）设计的高质量、高难度数学问题翻译数据集，覆盖14种语言，每种语言超10,000样本，旨在弥补当前多语言RLVR训练数据不足与难度偏低的缺陷。

Details

Motivation: 现有RLVR研究和数据集以英语为中心，已有多种语言数据未针对RLVR及当前大模型能力设计，且难度不足，无法提供有效训练信号。 Method: 基于专为RLVR构建的英文数学数据集AceReason-Math，进行高质量多语言翻译，并针对性清洗与优化，最终构建覆盖14种语言、每语言超10,000样本的mAceReason-Math数据集。 Result: 发布了mAceReason-Math数据集，支持14种语言，每种语言样本数超10,000，显著提升多语言RLVR训练与评测能力。 Conclusion: mAceReason-Math填补了多语言RLVR高质量、高难度训练数据的空白，有望推动多语言大模型在数学与逻辑推理领域的强化学习研究与应用。 Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has been successfully applied to significantly boost the capabilities of pretrained large language models, especially in the math and logic problem domains. However, current research and available training datasets remain English-centric. While mul- tilingual training data and benchmarks have been created in the past, they were not created with RLVR and current model capability in mind, and their level of difficulty is often too low to provide appropriate training signals for current models. To address this gap, we provide mAceReason-Math, a dataset of high-quality translations of challenging math problems sourced from a corpus specifically curated for RLVR (AceReason-Math). We further take specific care to clean and improve our translations, resulting in a coverage of 14 languages with more than 10,000 samples per language. We release the dataset to facilitate multilingual RLVR research and benchmarking in the research community.

[64] Word Recovery in Large Language Models Enables Character-Level Tokenization Robustness

Zhipeng Yang,Shu Yang,Lijie Hu,Di Wang

Main category: cs.CL

TL;DR: 本文研究了大语言模型（LLMs）在非标准分词（如字符级分词）输入下仍保持鲁棒性的机制，提出并验证了‘词恢复’（word recovery）这一核心过程：模型隐状态能从字符级输入中重建标准词级token标识，且该过程依赖早期层中同词内字符间的注意力。

Details

Motivation: 大型语言模型在使用非标准分词（如字符级分词）时仍表现出意外的鲁棒性，但其内在机制尚不清楚。 Method: 结合机制可解释性方法：1）基于解码的方法检测词恢复现象；2）通过子空间干预提供因果证据；3）细粒度注意力分析，特别是屏蔽同词内字符间注意力。 Result: 发现隐状态能重建标准词级token标识；移除对应子空间会显著降低下游任务性能；早期层中同词内字符的组内注意力对词恢复至关重要，屏蔽后恢复分数与任务性能均大幅下降。 Conclusion: 词恢复是LLMs处理字符级输入的关键机制，为分词鲁棒性提供了机制性解释。 Abstract: Large language models (LLMs) trained with canonical tokenization exhibit surprising robustness to non-canonical inputs such as character-level tokenization, yet the mechanisms underlying this robustness remain unclear. We study this phenomenon through mechanistic interpretability and identify a core process we term word recovery. We first introduce a decoding-based method to detect word recovery, showing that hidden states reconstruct canonical word-level token identities from character-level inputs. We then provide causal evidence by removing the corresponding subspace from hidden states, which consistently degrades downstream task performance. Finally, we conduct a fine-grained attention analysis and show that in-group attention among characters belonging to the same canonical token is critical for word recovery: masking such attention in early layers substantially reduces both recovery scores and task performance. Together, our findings provide a mechanistic explanation for tokenization robustness and identify word recovery as a key mechanism enabling LLMs to process character-level inputs.

[65] Large Language Models as Annotators for Machine Translation Quality Estimation

Sidi Wang,Sophie Arnoult,Amir Kamran

Main category: cs.CL

TL;DR: 本文提出了一种利用大语言模型（LLM）生成MQM风格标注以训练COMET模型进行机器翻译质量评估（MTQE）的方法，通过简化MQM方案和设计PPbMQM提示模板，在中英、英德方向上取得了与人工标注高度相关且具有竞争力的性能。

Details

Motivation: 大型语言模型（LLMs）在机器翻译质量评估（MTQE）上表现优异，但其高推理成本限制了实际应用；同时，段级标注能为LLM提供强推理依据，是高质量段级QE的关键。 Method: 提出简化版MQM方案（主要限于顶层类别），设计基于提示模式的GPT-4o提示方法PPbMQM，用LLM生成MQM风格标注，用于训练COMET模型。 Result: LLM生成的标注与人工标注高度相关；基于该标注训练的COMET模型在中英和英德翻译的段级QE任务上达到有竞争力的性能。 Conclusion: 利用LLM高效生成高质量段级MQM标注是可行且有效的，可作为低成本、高性能MTQE系统的可行路径。 Abstract: Large Language Models (LLMs) have demonstrated excellent performance on Machine Translation Quality Estimation (MTQE), yet their high inference costs make them impractical for direct application. In this work, we propose applying LLMs to generate MQM-style annotations for training a COMET model: following Fernandes et al. (2023), we reckon that segment-level annotations provide a strong rationale for LLMs and are key to good segment-level QE. We propose a simplified MQM scheme, mostly restricted to top-level categories, to guide LLM selection. We present a systematic approach for the development of a GPT-4o-based prompt, called PPbMQM (Prompt-Pattern-based-MQM). We show that the resulting annotations correlate well with human annotations and that training COMET on them leads to competitive performance on segment-level QE for Chinese-English and English-German.

[66] Interpretable Chinese Metaphor Identification via LLM-Assisted MIPVU Rule Script Generation: A Comparative Protocol Study

Weihang Huang,Mengna Liu

Main category: cs.CL

TL;DR: 本文提出了一种基于大语言模型（LLM）辅助的、可解释的中文隐喻识别流水线，通过四种可执行、可审计的规则脚本协议（MIP/MIPVU、CMDAG、情感检测、明喻识别），在七种中文隐喻数据集上实现跨协议对比评估，发现协议选择比模型选择对结果影响更大，并保证了完全透明与高可复现性。

Details

Motivation: 现有计算隐喻识别方法多为黑箱分类器，缺乏可解释性；中文因缺乏形态线索、隐喻传统丰富且标注资源稀缺，该问题尤为突出。 Method: 构建LLM辅助的模块化流水线，将四种隐喻识别协议（MIP/MIPVU词法分析、CMDAG概念映射标注、情感驱动检测、明喻导向识别）编译为确定性规则脚本，每步可插拔LLM调用，输出结构化推理依据。 Result: 在七种中文隐喻数据集（词级、句级、片段级）上完成首个跨协议对比：MIP协议词级F1达0.472；协议间一致性差异极大（A-D kappa=0.001，B-C kappa=0.986）；所有协议100%可复现，推理正确率0.40–0.87，可编辑性0.80–1.00；主要错误源于概念域错配与语域敏感性。 Conclusion: 隐喻识别中协议选择是最大变异源，远超模型差异；基于规则脚本的架构可在保持完全透明和可审计前提下达到有竞争力的性能。 Abstract: Metaphor identification is a foundational task in figurative language processing, yet most computational approaches operate as opaque classifiers offering no insight into why an expression is judged metaphorical. This interpretability gap is especially acute for Chinese, where rich figurative traditions, absent morphological cues, and limited annotated resources compound the challenge. We present an LLM-assisted pipeline that operationalises four metaphor identification protocols--MIP/MIPVU lexical analysis, CMDAG conceptual-mapping annotation, emotion-based detection, and simile-oriented identification--as executable, human-auditable rule scripts. Each protocol is a modular chain of deterministic steps interleaved with controlled LLM calls, producing structured rationales alongside every classification decision. We evaluate on seven Chinese metaphor datasets spanning token-, sentence-, and span-level annotation, establishing the first cross-protocol comparison for Chinese metaphor identification. Within-protocol evaluation shows Protocol A (MIP) achieves an F1 of 0.472 on token-level identification, while cross-protocol analysis reveals striking divergence: pairwise Cohen's kappa between Protocols A and D is merely 0.001, whereas Protocols B and C exhibit near-perfect agreement (kappa = 0.986). An interpretability audit shows all protocols achieve 100% deterministic reproducibility, with rationale correctness from 0.40 to 0.87 and editability from 0.80 to 1.00. Error analysis identifies conceptual-domain mismatch and register sensitivity as dominant failure modes. Our results demonstrate that protocol choice is the single largest source of variation in metaphor identification, exceeding model-level variation, and that rule-script architectures achieve competitive performance while maintaining full transparency.

[67] LuxBorrow: From Pompier to Pompjee, Tracing Borrowing in Luxembourgish

Nina Hosseini-Kivanani,Fred Philippy

Main category: cs.CL

TL;DR: 本文提出LuxBorrow，对1999–2025年卢森堡语新闻进行借词主导的实证分析，结合语言识别与借词解析流程，发现卢森堡语始终为基底语，法语为主要借词来源，借词以形态和正字法适应为主，且借词强度随时间增强。

Details

Motivation: 现有研究多关注宏观语码混合指数，缺乏对借词类型、适应机制及历时演变的细粒度量化分析；本文旨在建立以借词为中心的语言接触评估框架。 Method: 构建包含句级语言识别（LU/DE/FR/EN）与词级借词解析的分析流程，后者基于卢森堡语句子，整合词形还原、自建借词词表及形态/正字法规则；计算代码混合指数（CMI）、借词类型率、供体熵、同化率等新指标。 Result: 77.1%文章含至少一种借词语言，65.4%含三至四种；CMI整体偏低（中位数3.90→7.00），呈局部插入特征；25,444例词级适应中，形态占63.8%，正字法占35.9%；法语为最主要借词源，德语缓慢增长，英语可忽略；CMI从2000年代初6.1升至2020年峰值8.4。 Conclusion: 卢森堡语保持稳固基底地位，但借词实践持续深化；应转向借词中心视角，采用更精细的词汇-形态-历时指标评估语言接触，而非仅依赖文档级混合度。 Abstract: We present LuxBorrow, a borrowing-first analysis of Luxembourgish (LU) news spanning 27 years (1999-2025), covering 259,305 RTL articles and 43.7M tokens. Our pipeline combines sentence-level language identification (LU/DE/FR/EN) with a token-level borrowing resolver restricted to LU sentences, using lemmatization, a collected loanword registry, and compiled morphological and orthographic rules. Empirically, LU remains the matrix language across all documents, while multilingual practice is pervasive: 77.1% of articles include at least one donor language and 65.4% use three or four. Breadth does not imply intensity: median code-mixing index (CMI) increases from 3.90 (LU+1) to only 7.00 (LU+3), indicating localized insertions rather than balanced bilingual text. Domain and period summaries show moderate but persistent mixing, with CMI rising from 6.1 (1999-2007) to a peak of 8.4 in 2020. Token-level adaptations total 25,444 instances and exhibit a mixed profile: morphological 63.8%, orthographic 35.9%, lexical 0.3%. The most frequent individual rules are orthographic, such as on->oun and eur->er, while morphology is collectively dominant. Diachronically, code-switching intensifies, and morphologically adapted borrowings grow from a small base. French overwhelmingly supplies adapted items, with modest growth for German and negligible English. We advocate borrowing-centric evaluation, including borrowed token and type rates, donor entropy over borrowed items, and assimilation ratios, rather than relying only on document-level mixing indices.

[68] Multilingual Reasoning Gym: Multilingual Scaling of Procedural Reasoning Environments

Konstantin Dobler,Simon Lehnerer,Federico Scozzafava,Jonathan Janke,Mohamed Ali

Main category: cs.CL

TL;DR: 本文介绍了Multilingual Reasoning Gym，一个支持14种语言的可验证推理问题生成框架，扩展自Reasoning Gym，具备跨语言平行数据生成、难度可调和无限样本生成等优势，并已开源。

Details

Motivation: 解决现有推理基准缺乏多语言支持与可验证性的问题，推动多语言推理模型的研究与发展。 Method: 基于Reasoning Gym进行扩展，通过模板翻译（10种语言由母语者验证）和针对性代码/模板适配，实现14种语言的可验证推理问题的程序化生成。 Result: 构建了覆盖14种语言、94类任务的Multilingual Reasoning Gym，支持大规模跨语言平行数据生成，兼容强化学习与评估场景，并已开源实现。 Conclusion: Multilingual Reasoning Gym为多语言推理建模提供了高效、可控、可验证的新基准与工具，显著拓展了程序化推理环境的适用范围与实用性。 Abstract: We present the Multilingual Reasoning Gym, an extension of Reasoning Gym (Stojanovski et al., 2025), that procedurally generates verifiable reasoning problems across 14 languages. We translate templates for 94 tasks with native-speaker validation in 10 languages and targeted code or template adaptations to ensure linguistic naturalness. The Multilingual Reasoning Gym preserves the core benefits of the procedural generation approach used in the original Reasoning Gym, such as virtually unlimited problem instance generation and adjustable difficulty, and remains directly usable for Reinforcement Learning from Verifiable Rewards and evaluation settings. Problems in the Multilingual Reasoning Gym are parallel across languages, enabling crosslingually parallel data generation at massive scale due to the procedural nature of the environments. We release our implementation to support research into multilingual reasoning models.

[69] PivotAttack: Rethinking the Search Trajectory in Hard-Label Text Attacks via Pivot Words

Yuzhi Liang,Shiliang Xiao,Jingsong Wei,Qiliang Lin,Xia Li

Main category: cs.CL

TL;DR: PivotAttack是一种高效的硬标签文本攻击框架，通过多臂赌博机算法识别关键词组（Pivot Sets）并进行有针对性的扰动，显著提升攻击成功率和查询效率。

Details

Motivation: 现有硬标签文本攻击方法通常采用低效的“由外向内”策略，搜索空间大、查询成本高。 Method: 提出PivotAttack框架，采用“由内向外”策略，利用多臂赌博机算法识别作为预测锚点的组合式词组（Pivot Sets），并对其扰动以诱导标签翻转。 Result: 在传统模型和大语言模型上的大量实验表明，PivotAttack在攻击成功率和查询效率上均持续优于当前最优基线方法。 Conclusion: PivotAttack通过建模词间依赖关系和聚焦关键token组，实现了更高效、更鲁棒的硬标签文本攻击。 Abstract: Existing hard-label text attacks often rely on inefficient "outside-in" strategies that traverse vast search spaces. We propose PivotAttack, a query-efficient "inside-out" framework. It employs a Multi-Armed Bandit algorithm to identify Pivot Sets-combinatorial token groups acting as prediction anchors-and strategically perturbs them to induce label flips. This approach captures inter-word dependencies and minimizes query costs. Extensive experiments across traditional models and Large Language Models demonstrate that PivotAttack consistently outperforms state-of-the-art baselines in both Attack Success Rate and query efficiency.

[70] SiDiaC-v.2.0: Sinhala Diachronic Corpus Version 2.0

Nevidu Jayatilleke,Nisansa de Silva,Uthpala Nimanthi,Gagani Kulathilaka,Azra Safrullah,Johan Sofalas

Main category: cs.CL

TL;DR: SiDiaC-v.2.0 是目前最大、最全面的锡兰语历时语料库，涵盖1800–1955年出版时间及5–20世纪书写时间，含185部文学作品（24.4万词），其中59篇（7万词）标注了写作年代，并按体裁分层分类；构建过程融合FarPaHC、CCOHA等语料实践，针对低资源语言特点优化OCR后处理与文本规范化。

Details

Motivation: 为弥补锡兰语作为低资源语言在历时NLP研究中缺乏高质量、大规模、标注完备语料的空白，延续并大幅扩展SiDiaC-v.1.0，支持语法分析、文本规范化与历史语言建模等任务。 Method: 基于国家图书馆藏书，利用Google Document AI OCR数字化SiDiaC-v.1.0未过滤列表；开展多阶段后处理（格式修复、语码混用处理、特殊标记添加、畸形词修正）；参照FarPaHC和CCOHA设计文本归一化与句法标注策略；实施双层体裁分类（非虚构/虚构 + 宗教/历史/诗歌/语言/医学等）。 Result: 建成SiDiaC-v.2.0：244k词、185部作品的历时语料库，含70k词的写作年代标注子集；完成版权合规检查、深度清洗与结构化分类；形成适用于锡兰语NLP的标准化历时基准资源。 Conclusion: SiDiaC-v.2.0成功克服低资源限制，成为锡兰语历时语言学与NLP研究的关键基础设施，为后续模型训练、语言演化分析及跨语言历时语料建设提供可复用方法论与实践范例。 Abstract: SiDiaC-v.2.0 is the largest comprehensive Sinhala Diachronic Corpus to date, covering a period from 1800 CE to 1955 CE in terms of publication dates, and a historical span from the 5th to the 20th century CE in terms of written dates. The corpus consists of 244k words across 185 literary works that underwent thorough filtering, preprocessing, and copyright compliance checks, followed by extensive post-processing. Additionally, a subset of 59 documents totalling 70k words was annotated based on their written dates. Texts from the National Library of Sri Lanka were selected from the SiDiaC-v.1.0 non-filtered list, which was digitised using Google Document AI OCR. This was followed by post-processing to correct formatting issues, address code-mixing, include special tokens, and fix malformed tokens. The construction of SiDiaC-v.2.0 was informed by practices from other corpora, such as FarPaHC, SiDiaC-v.1.0, and CCOHA. This was particularly relevant for syntactic annotation and text normalisation strategies, given the shared characteristics of low-resource language status between Faroese and the similar cleaning strategies utilised in CCOHA. This corpus is categorised into two layers based on genres: primary and secondary. The primary categorisation is binary, assigning each book to either Non-Fiction or Fiction. The secondary categorisation is more detailed, grouping texts under specific genres such as Religious, History, Poetry, Language, and Medical. Despite facing challenges due to limited resources, SiDiaC-v.2.0 serves as a comprehensive resource for Sinhala NLP, building upon the work previously done in SiDiaC-v.1.0.

[71] An Extreme Multi-label Text Classification (XMTC) Library Dataset: What if we took "Use of Practical AI in Digital Libraries" seriously?

Jennifer D'Souza,Sameer Sadruddin,Maximilian Kähler,Andrea Salfinger,Luca Zaccagna,Francesca Incitti,Lauro Snidaro,Osma Suominen

Main category: cs.CL

TL;DR: 本文发布了一个大型英德双语目录记录语料库，标注了集成权威文件（GND），并提供了可机读的GND分类体系，支持本体感知的多标签分类、文本到权威术语映射及辅助编目，并强调评估需兼顾准确性、实用性与透明性。

Details

Motivation: 解决大规模、跨语言主题标引难以持续维护的问题。 Method: 构建并发布英德双语GND标注语料库及机器可操作的GND分类体系，支持多种任务，并对三个系统进行统计分析与定性错误分析。 Result: 提供了首个大规模双语GND标注资源及配套工具，支撑权威导向的AI协同编目研究；初步评估揭示了现有系统的性能瓶颈与改进方向。 Conclusion: 推动以权威数据为锚点的AI辅助编目发展，倡导评估标准从纯准确率转向实用性、透明性与人机协作效能。 Abstract: Subject indexing is vital for discovery but hard to sustain at scale and across languages. We release a large bilingual (English/German) corpus of catalog records annotated with the Integrated Authority File (GND), plus a machine-actionable GND taxonomy. The resource enables ontology-aware multi-label classification, mapping text to authority terms, and agent-assisted cataloging with reproducible, authority-grounded evaluation. We provide a brief statistical profile and qualitative error analyses of three systems. We invite the community to assess not only accuracy but usefulness and transparency, toward authority-anchored AI co-pilots that amplify catalogers' work.

Ayan Sengupta,Shantanu Dixit,Md Shad Akhtar,Tanmoy Chakraborty

Main category: cs.CL

TL;DR: 本文提出ARMADA框架，实现从视觉-语言模型（包括黑盒模型）到纯语言模型的跨模态知识蒸馏，无需修改教师模型或昂贵的多模态预训练，显著提升多种NLU和生成推理任务性能。

Details

Motivation: 传统知识蒸馏假设师生模型模态一致；现有跨模态蒸馏需教师模型模态特定预训练，计算开销大、不可行。 Method: 提出ARMADA框架，采用新型对齐技术，实现无需修改教师模型、无需多模态预训练的跨模态知识蒸馏，支持黑盒视觉-语言模型向语言模型迁移知识。 Result: 在12项自然语言理解、8项复杂生成推理和5项指令微调任务上验证有效；在DeBERTa-v2-1.4B、OPT-1.3B、LLaMA系列等大模型上分别取得最高3.4%（NLU）和2.6%（生成推理）提升。 Conclusion: 证明即使视觉-语言模型本身不直接具备文本理解能力，只要蒸馏方式得当，仍可显著增强纯语言模型性能，挑战了传统知识蒸馏范式。 Abstract: Knowledge distillation (KD) methods are pivotal in compressing large pre-trained language models into smaller models, ensuring computational efficiency without significantly dropping performance. Traditional KD techniques assume homogeneity in modalities between the teacher (source) and the student (target) models. On the other hand, existing multimodal knowledge distillation methods require modality-specific pre-training of the teacher model, which is computationally infeasible in most cases. In this paper, we introduce ARMADA, an efficient cross-modal knowledge distillation framework designed to transfer knowledge from large vision-language models, including black-box models, to language-only models. Unlike existing KD techniques that rely on the internal structures of multimodal teachers or require computationally expensive pre-training, ARMADA leverages novel alignment techniques to distil knowledge without altering the teacher model, ensuring efficiency and scalability. We empirically validate ARMADA on twelve natural language understanding, eight complex generative reasoning and five instruction-tuning tasks, demonstrating consistent performance improvements in large models such as DeBERTa-v2-1.4B, OPT-1.3B, LLaMA-{3B, 7B, 8B}. ARMADA achieves up to 3.4% improvement on language understanding tasks and 2.6% boost in generative reasoning, all without requiring expensive multimodal pre-training or fine-tuning of the teacher model. Our findings challenge conventional knowledge distillation paradigms by demonstrating that even vision-language models, despite lacking direct textual understanding, can significantly enhance language models when distilled appropriately.

[73] GLM-OCR Technical Report

Shuaiqi Duan,Yadong Xue,Weihan Wang,Zhe Su,Huan Liu,Sheng Yang,Guobing Gan,Guo Wang,Zihan Wang,Shengdong Yan,Dexin Jin,Yuxuan Zhang,Guohong Wen,Yanfeng Wang,Yutao Zhang,Xiaohan Zhang,Wenyi Hong,Yukuo Cen,Da Yin,Bin Chen,Wenmeng Yu,Xiaotao Gu,Jie Tang

Main category: cs.CL

TL;DR: GLM-OCR是一个0.9B参数的高效紧凑型多模态模型，结合CogViT视觉编码器与GLM语言解码器，引入多令牌预测（MTP）机制和两阶段流水线，在文档理解任务中实现高性能与高效率的平衡。

Details

Motivation: 解决标准自回归解码在确定性OCR任务中的低效问题，并兼顾计算效率与识别性能，满足边缘部署与大规模生产系统需求。 Method: 采用0.4B参数CogViT视觉编码器与0.5B参数GLM语言解码器组合；提出多令牌预测（MTP）机制以提升解码吞吐量；设计PP-DocLayout-V3布局分析+并行区域识别的两阶段系统级流水线。 Result: 在公开基准和工业场景中，于文档解析、文本与公式转录、表格结构恢复及关键信息抽取等任务上达到有竞争力或SOTA性能；具备低内存开销与高吞吐优势。 Conclusion: GLM-OCR凭借紧凑架构、结构化生成能力和高效解码机制，成为兼顾精度、速度与部署灵活性的实用文档理解模型。 Abstract: GLM-OCR is an efficient 0.9B-parameter compact multimodal model designed for real-world document understanding. It combines a 0.4B-parameter CogViT visual encoder with a 0.5B-parameter GLM language decoder, achieving a strong balance between computational efficiency and recognition performance. To address the inefficiency of standard autoregressive decoding in deterministic OCR tasks, GLM-OCR introduces a Multi-Token Prediction (MTP) mechanism that predicts multiple tokens per step, significantly improving decoding throughput while keeping memory overhead low through shared parameters. At the system level, a two-stage pipeline is adopted: PP-DocLayout-V3 first performs layout analysis, followed by parallel region-level recognition. Extensive evaluations on public benchmarks and industrial scenarios show that GLM-OCR achieves competitive or state-of-the-art performance in document parsing, text and formula transcription, table structure recovery, and key information extraction. Its compact architecture and structured generation make it suitable for both resource-constrained edge deployment and large-scale production systems.

[74] LLM2Vec-Gen: Generative Embeddings from Large Language Models

Parishad BehnamGhader,Vaibhav Adlakha,Fabian David Schmidt,Nicolas Chapados,Marius Mosbach,Siva Reddy

Main category: cs.CL

TL;DR: 本文提出了一种名为LLM2Vec-Gen的新型自监督文本嵌入方法，通过在大语言模型（LLM）中引入可训练特殊标记，并优化其以表征LLM对输入的响应，从而实现高质量、安全且可解释的文本嵌入。

Details

Motivation: 传统LLM嵌入器主要编码输入语义，但嵌入任务需将多样输入映射到相似输出，通常依赖有标注配对数据和对比学习；本文旨在摆脱对标注数据和微调LLM主干的依赖，同时迁移LLM的安全对齐与推理能力至嵌入任务。 Method: 提出LLM2Vec-Gen：在冻结LLM主干前提下，在其词表中添加可训练特殊token，将其附加于输入后，优化这些token以生成固定长度向量，该向量拟合LLM自身补全结果及无监督嵌入教师提供的蒸馏目标。 Result: 在MTEB基准上达到最优自监督性能，较最佳无监督嵌入教师提升9.3%；有害内容检索减少43.2%，推理能力提升29.3%；所学嵌入可解码为文本，具备可解释性。 Conclusion: LLM2Vec-Gen提供了一种高效、安全、无需标注数据且保留LLM能力的嵌入学习新范式，显著提升了嵌入质量与实用性。 Abstract: LLM-based text embedders typically encode the semantic content of their input. However, embedding tasks require mapping diverse inputs to similar outputs. Typically, this input-output is addressed by training embedding models with paired data using contrastive learning. In this work, we propose a novel self-supervised approach, LLM2Vec-Gen, which adopts a different paradigm: rather than encoding the input, we learn to represent the model's potential response. Specifically, we add trainable special tokens to the LLM's vocabulary, append them to input, and optimize them to represent the LLM's response in a fixed-length sequence. Training is guided by the LLM's own completion for the query, along with an unsupervised embedding teacher that provides distillation targets. This formulation helps to bridge the input-output gap and transfers LLM capabilities such as safety alignment and reasoning to embedding tasks. Crucially, the LLM backbone remains frozen and training requires only unlabeled queries. LLM2Vec-Gen achieves state-of-the-art self-supervised performance on the Massive Text Embedding Benchmark (MTEB), improving by 9.3% over the best unsupervised embedding teacher. We also observe up to 43.2% reduction in harmful content retrieval and 29.3% improvement in reasoning capabilities for embedding tasks. Finally, the learned embeddings are interpretable and can be decoded into text to reveal their semantic content.

[75] Beyond the Illusion of Consensus: From Surface Heuristics to Knowledge-Grounded Evaluation in LLM-as-a-Judge

Mingyang Song,Mao Zheng,Chenning Xu

Main category: cs.CL

TL;DR: 本文挑战了LLM-as-a-judge范式中‘高评估者间一致性即代表评估可靠客观’的核心假设，揭示了‘评估幻觉’现象，并提出基于领域知识动态生成评估标准的MERG框架，显著提升评估一致性与意义。

Details

Motivation: 质疑LLM-as-a-judge中‘高一致性=高可靠性’的隐含假设，探究其背后是否掩盖了表面共识而非实质判断一致。 Method: 通过大规模实证研究（105600个评估实例）量化模型级与样本级一致性差异，形式化‘评估幻觉’；提出MERG框架，利用元认知增强、领域知识驱动的动态rubric生成方法，在不同领域验证其对评估一致性的影响。 Result: 发现高模型级一致性（ρ=0.99）下样本级一致性脆弱（ICC=0.67），共享rubric结构即可恢复62%总一致性，高质量输出反而最不一致；MERG在教育、学术等编码化领域显著提升一致性（+22%/+27%），而在主观领域降低一致性，体现真实多元评价。 Conclusion: LLM评估的一致性常是表面幻觉，应以领域知识动态增强rubric设计，而非依赖通用标准；该发现对RLAIF中的奖励建模具有重要启示。 Abstract: The paradigm of LLM-as-a-judge relies on a critical assumption, namely that high inter-evaluator agreement indicates reliable and objective evaluation. We present two complementary findings that challenge this assumption. \textbf{First}, we demonstrate that this consensus is frequently illusory. We identify and formalize \textbf{Evaluation Illusion}, a phenomenon where LLM judges generate sophisticated critiques yet anchor scores on shared surface heuristics rather than substantive quality. Through a large-scale study of 105,600 evaluation instances (32 LLMs $\times$ 3 frontier judges $\times$ 100 tasks $\times$ 11 temperatures), we show that model-level agreement (Spearman $ρ= 0.99$) masks fragile sample-level agreement (Pearson $\bar{r} = 0.72$; absolute agreement ICC $= 0.67$), that merely sharing rubric structure restores 62\% of total agreement, and that high-quality outputs paradoxically receive the \textit{least} consistent evaluations. \textbf{Second}, we demonstrate that dynamically generating evaluation rubrics grounded in domain knowledge produces more meaningful assessment. We introduce MERG (Metacognitive Enhanced Rubric Generation), a knowledge-driven rubric generation framework whose domain-selective effects confirm this. Agreement \textit{increases} in codified domains (Education +22\%, Academic +27\%) where knowledge anchors evaluators on shared standards, while it decreases in subjective domains where genuine evaluative pluralism emerges. These findings suggest that evaluation rubrics should be dynamically enriched with expert knowledge rather than relying on generic criteria, with implications for reward modeling in RLAIF.

[76] Instruction set for the representation of graphs

Ezequiel Lopez-Rubio,Mario Pascual-Gonzalez

Main category: cs.CL

TL;DR: IsalGraph是一种将任意有限简单图结构编码为九字符指令字符串的方法，具有无无效状态、多项式时间编码、规范字符串生成及与图编辑距离强相关的特性，适用于图相似性搜索、图生成和图条件语言建模。

Details

Motivation: 为图结构提供一种紧凑、同构不变、且兼容语言模型的序列化表示，以支持图相似性计算、生成和语言建模等任务。 Method: 设计一个含稀疏图、循环双向链表（CDLL）和两个遍历指针的小型虚拟机；通过九字符指令集执行节点/边插入或CDLL指针移动；提出贪心GraphToString算法实现多项式时间编码，以及回溯变体生成规范字符串。 Result: 在五个真实图数据集上验证，IsalGraph字符串间的Levenshtein距离与图编辑距离（GED）呈强相关性。 Conclusion: IsalGraph提供了一种紧凑、同构不变、语言模型友好的图结构序列编码，具备实际应用价值。 Abstract: We present IsalGraph, a method for representing the structure of any finite, simple graph as a compact string over a nine-character instruction alphabet. The encoding is executed by a small virtual machine comprising a sparse graph, a circular doubly-linked list (CDLL) of graph-node references, and two traversal pointers. Instructions either move a pointer through the CDLL or insert a node or edge into the graph. A key design property is that every string over the alphabet decodes to a valid graph, with no invalid states reachable. A greedy \emph{GraphToString} algorithm encodes any connected graph into a string in time polynomial in the number of nodes; an exhaustive-backtracking variant produces a canonical string by selecting the lexicographically smallest shortest string across all starting nodes and all valid traversal orders. We evaluate the representation on five real-world graph benchmark datasets (IAM Letter LOW/MED/HIGH, LINUX, and AIDS) and show that the Levenshtein distance between IsalGraph strings correlates strongly with graph edit distance (GED). Together, these properties make IsalGraph strings a compact, isomorphism-invariant, and language-model-compatible sequential encoding of graph structure, with direct applications in graph similarity search, graph generation, and graph-conditioned language modelling

cs.CV [Back]

[77] 4DEquine: Disentangling Motion and Appearance for 4D Equine Reconstruction from Monocular Video

Jin Lyu,Liang An,Pujin Cheng,Yebin Liu,Xiaoying Tang

Main category: cs.CV

TL;DR: 本文提出4DEquine框架，通过解耦动态运动与静态外观重建，实现单目视频中马科动物的高效4D重建。

Details

Motivation: 现有主流4D动物重建方法需对整段视频联合优化运动和外观，耗时且对观测缺失敏感，难以满足马科动物福利监测等实际需求。 Method: 将4D重建分解为动态运动重建（基于时空Transformer+后优化）和静态外观重建（基于单图输入的前馈高斯Avatar网络）；构建合成数据集VarenPoser（运动）和VarenTex（外观）辅助训练。 Result: 仅在合成数据上训练，即在真实APT36K和AiM数据集上达到SOTA性能，几何与外观重建均显著优于现有方法；消融实验验证各模块有效性。 Conclusion: 解耦策略、轻量高效网络设计及高质量合成数据集共同提升了单目视频下马科动物4D重建的精度、鲁棒性与实用性。 Abstract: 4D reconstruction of equine family (e.g. horses) from monocular video is important for animal welfare. Previous mainstream 4D animal reconstruction methods require joint optimization of motion and appearance over a whole video, which is time-consuming and sensitive to incomplete observation. In this work, we propose a novel framework called 4DEquine by disentangling the 4D reconstruction problem into two sub-problems: dynamic motion reconstruction and static appearance reconstruction. For motion, we introduce a simple yet effective spatio-temporal transformer with a post-optimization stage to regress smooth and pixel-aligned pose and shape sequences from video. For appearance, we design a novel feed-forward network that reconstructs a high-fidelity, animatable 3D Gaussian avatar from as few as a single image. To assist training, we create a large-scale synthetic motion dataset, VarenPoser, which features high-quality surface motions and diverse camera trajectories, as well as a synthetic appearance dataset, VarenTex, comprising realistic multi-view images generated through multi-view diffusion. While training only on synthetic datasets, 4DEquine achieves state-of-the-art performance on real-world APT36K and AiM datasets, demonstrating the superiority of 4DEquine and our new datasets for both geometry and appearance reconstruction. Comprehensive ablation studies validate the effectiveness of both the motion and appearance reconstruction network. Project page: https://luoxue-star.github.io/4DEquine_Project_Page/.

[78] HG-Lane: High-Fidelity Generation of Lane Scenes under Adverse Weather and Lighting Conditions without Re-annotation

Daichao Zhao,Qiupu Chen,Feng He,Xin Ning,Qiankun Li

Main category: cs.CV

TL;DR: 本文提出HG-Lane框架，用于在无需重标注的情况下高保真生成恶劣天气和光照条件下的车道线场景图像，并构建包含30,000张图像的基准数据集；实验表明其显著提升现有车道检测模型（如CLRNet）在各类恶劣条件下的性能。

Details

Motivation: 现有车道检测数据集（如CULane、TuSimple）在雨、雪、雾等极端天气及夜间等低光照条件下样本严重不足，导致模型在实际复杂环境中可靠性差，存在安全隐患。 Method: 提出HG-Lane——一种无需重新标注的高保真恶劣天气与光照条件下的车道场景生成框架，并基于该框架构建含30,000张图像的新型基准数据集。 Result: 在HG-Lane基准上，CLRNet的整体mF1提升20.87%；各子类F1@50分别提升：整体19.75%、正常8.63%、雪38.8%、雨14.96%、雾26.84%、夜21.5%、黄昏12.04%。 Conclusion: HG-Lane有效缓解了恶劣环境下车道检测数据稀缺问题，显著增强了模型鲁棒性与泛化能力，为安全自动驾驶提供了更可靠的数据基础。 Abstract: Lane detection is a crucial task in autonomous driving, as it helps ensure the safe operation of vehicles. However, existing datasets such as CULane and TuSimple contain relatively limited data under extreme weather conditions, including rain, snow, and fog. As a result, detection models trained on these datasets often become unreliable in such environments, which may lead to serious safety-critical failures on the road. To address this issue, we propose HG-Lane, a High-fidelity Generation framework for Lane Scenes under adverse weather and lighting conditions without requiring re-annotation. Based on this framework, we further construct a benchmark that includes adverse weather and lighting scenarios, containing 30,000 images. Experimental results demonstrate that our method consistently and significantly improves the performance of existing lane detection networks. For example, using the state-of-the-art CLRNet, the overall mF1 score on our benchmark increases by 20.87 percent. The F1@50 score for the overall, normal, snow, rain, fog, night, and dusk categories increases by 19.75 percent, 8.63 percent, 38.8 percent, 14.96 percent, 26.84 percent, 21.5 percent, and 12.04 percent, respectively. The code and dataset are available at: https://github.com/zdc233/HG-Lane.

[79] Unbalanced Optimal Transport Dictionary Learning for Unsupervised Hyperspectral Image Clustering

Joshua Lentz,Nicholas Karris,Alex Cloninger,James M. Murphy

Main category: cs.CV

TL;DR: 本文提出了一种基于非平衡Wasserstein重心的字典学习方法，用于高光谱图像的无监督聚类，以克服现有Wasserstein空间字典学习在光谱平衡、类模糊及抗噪性方面的不足。

Details

Motivation: 现有基于Wasserstein空间字典学习的高光谱图像无监督聚类方法需平衡光谱分布，导致类别模糊且对噪声和异常值鲁棒性差。 Method: 采用非平衡Wasserstein重心进行字典学习，获得低维数据表示，并在其上应用谱聚类实现无监督标签学习。 Result: 所提方法在保持光谱特征区分度的同时提升了抗噪性和对异常值的鲁棒性，实现了更有效的无监督图像分割。 Conclusion: 非平衡Wasserstein重心为高光谱图像无监督聚类提供了更具鲁棒性和判别力的低维表示学习框架。 Abstract: Hyperspectral images capture vast amounts of high-dimensional spectral information about a scene, making labeling an intensive task that is resistant to out-of-the-box statistical methods. Unsupervised learning of clusters allows for automated segmentation of the scene, enabling a more rapid understanding of the image. Partitioning the spectral information contained within the data via dictionary learning in Wasserstein space has proven an effective method for unsupervised clustering. However, this approach requires balancing the spectral profiles of the data, blurring the classes, and sacrificing robustness to outliers and noise. In this paper, we suggest improving this approach by utilizing unbalanced Wasserstein barycenters to learn a lower-dimensional representation of the underlying data. The deployment of spectral clustering on the learned representation results in an effective approach for the unsupervised learning of labels.

[80] Video-Based Reward Modeling for Computer-Use Agents

Linxin Song,Jieyu Zhang,Huanxin Sheng,Taiwei Shi,Gupta Rahul,Yang Liu,Ranjay Krishna,Jian Kang,Jieyu Zhao

Main category: cs.CV

TL;DR: 本文提出了一种基于执行视频的奖励建模方法（ExeVRM），用于评估计算机使用代理（CUA）是否成功完成用户指令，构建了包含53k样本的数据集ExeVR-53k，并引入对抗性指令翻译与时空令牌剪枝技术，在多平台任务上超越GPT-5.2和Gemini-3 Pro等强基线模型。

Details

Motivation: 现有CUA评估难以规模化，尤其难以判断代理轨迹是否真正满足用户指令；而依赖内部推理或动作的评估方式缺乏通用性和可扩展性。 Method: 构建ExeVR-53k数据集；提出对抗性指令翻译生成带步级标注的负样本；设计时空令牌剪枝以高效处理长时高分辨率执行视频；训练端到端的Execution Video Reward Model（ExeVRM）。 Result: ExeVRM 8B在视频执行评估上达到84.7%准确率和87.7%召回率，超越GPT-5.2和Gemini-3 Pro，支持Ubuntu、macOS、Windows和Android跨平台评估，并提供更精确的时间归因。 Conclusion: 基于执行视频的奖励建模是一种可扩展、模型无关的CUA评估范式，为自动化代理评测提供了新路径。 Abstract: Computer-using agents (CUAs) are becoming increasingly capable; however, it remains difficult to scale evaluation of whether a trajectory truly fulfills a user instruction. In this work, we study reward modeling from execution video: a sequence of keyframes from an agent trajectory that is independent of the agent's internal reasoning or actions. Although video-execution modeling is method-agnostic, it presents key challenges, including highly redundant layouts and subtle, localized cues that determine success. We introduce Execution Video Reward 53k (ExeVR-53k), a dataset of 53k high-quality video--task--reward triplets. We further propose adversarial instruction translation to synthesize negative samples with step-level annotations. To enable learning from long, high-resolution execution videos, we design spatiotemporal token pruning, which removes homogeneous regions and persistent tokens while preserving decisive UI changes. Building on these components, we fine-tune an Execution Video Reward Model (ExeVRM) that takes only a user instruction and a video-execution sequence to predict task success. Our ExeVRM 8B achieves 84.7% accuracy and 87.7% recall on video-execution assessment, outperforming strong proprietary models such as GPT-5.2 and Gemini-3 Pro across Ubuntu, macOS, Windows, and Android, while providing more precise temporal attribution. These results show that video-execution reward modeling can serve as a scalable, model-agnostic evaluator for CUAs.

[81] Delta-K: Boosting Multi-Instance Generation via Cross-Attention Augmentation

Zitong Wang,Zijun Shen,Haohao Xu,Zhengjie Luo,Weibin Wu

Main category: cs.CV

TL;DR: 本文提出Delta-K框架，通过在扩散模型的交叉注意力Key空间中注入缺失概念的语义差异信号ΔK，有效缓解复杂多实例场景下的概念遗漏问题，无需训练、掩码或模型修改。

Details

Motivation: 扩散模型在生成复杂多实例文本到图像时易遗漏概念；现有无训练方法仅重缩放注意力图，无法建立连贯语义表征。 Method: 提出Delta-K框架：利用视觉语言模型提取缺失概念的微分Key（ΔK），在扩散过程早期语义规划阶段将其注入共享交叉注意力Key空间，并通过动态优化调度机制引导噪声形成结构锚点。 Result: Delta-K在DiT和U-Net等不同架构上均显著提升组合对齐效果，无需空间掩码、额外训练或架构改动，具有强泛化性。 Conclusion: 在Key空间进行语义级干预比在注意力图上操作更有效；Delta-K是一种即插即用、骨干无关、训练无关的通用推理增强方案。 Abstract: While Diffusion Models excel in text-to-image synthesis, they often suffer from concept omission when synthesizing complex multi-instance scenes. Existing training-free methods attempt to resolve this by rescaling attention maps, which merely exacerbates unstructured noise without establishing coherent semantic representations. To address this, we propose Delta-K, a backbone-agnostic and plug-and-play inference framework that tackles omission by operating directly in the shared cross-attention Key space. Specifically, with Vision-language model, we extract a differential key $ΔK$ that encodes the semantic signature of missing concepts. This signal is then injected during the early semantic planning stage of the diffusion process. Governed by a dynamically optimized scheduling mechanism, Delta-K grounds diffuse noise into stable structural anchors while preserving existing concepts. Extensive experiments demonstrate the generality of our approach: Delta-K consistently improves compositional alignment across both modern DiT models and classical U-Net architectures, without requiring spatial masks, additional training, or architectural modifications.

[82] FusionNet: a frame interpolation network for 4D heart models

Chujie Chang,Shoko Miyauchi,Ken'ichi Morooka,Ryo Kurazume,Oscar Martinez Mozos

Main category: cs.CV

TL;DR: 本文提出了一种名为FusionNet的神经网络，用于从短时间采集的心脏磁共振（CMR）图像中重建高时间分辨率的四维（4D）心脏运动，通过插值相邻3D心脏形状来估计中间形态，在Dice系数上达到0.897以上，优于现有方法。

Details

Motivation: 标准CMR检查耗时长（40-60分钟），患者不适；缩短扫描时间会降低时间或空间分辨率，影响诊断准确性；本文聚焦于因扫描时间缩短导致的时间分辨率下降问题。 Method: 提出FusionNet神经网络，基于相邻帧的3D心脏形状，估计中间时刻的3D心脏形态，从而从低时间分辨率CMR数据中重建高时间分辨率4D心脏运动。 Result: 实验表明FusionNet在Dice系数上超过0.897，重建精度高于现有方法。 Conclusion: FusionNet能有效提升短扫描时间下CMR的时间分辨率，改善心脏动态建模精度与临床诊断潜力。 Abstract: Cardiac magnetic resonance (CMR) imaging is widely used to visualise cardiac motion and diagnose heart disease. However, standard CMR imaging requires patients to lie still in a confined space inside a loud machine for 40-60 min, which increases patient discomfort. In addition, shorter scan times decrease either or both the temporal and spatial resolutions of cardiac motion, and thus, the diagnostic accuracy of the procedure. Of these, we focus on reduced temporal resolution and propose a neural network called FusionNet to obtain four-dimensional (4D) cardiac motion with high temporal resolution from CMR images captured in a short period of time. The model estimates intermediate 3D heart shapes based on adjacent shapes. The results of an experimental evaluation of the proposed FusionNet model showed that it achieved a performance of over 0.897 in terms of the Dice coefficient, confirming that it can recover shapes more precisely than existing methods. This code is available at: https://github.com/smiyauchi199/FusionNet.git

[83] An Automated Radiomics Framework for Postoperative Survival Prediction in Colorectal Liver Metastases using Preoperative MRI

Muhammad Alberb,Jianan Chen,Hossam El-rewaidy,Paul Karanicolas,Arun Seth,Yutaka Amemiya,Anne Martel,Helen Cheung

Main category: cs.CV

TL;DR: 本研究提出了一种基于AI的全自动框架，利用术前和术后对比增强MRI预测结直肠癌肝转移（CRLM）患者肝切除术后的生存期，结合解剖感知分割与放射组学分析，显著提升预测准确性。

Details

Motivation: CRLM患者术后生存差异大，亟需精准预测以避免无效手术并指导个体化治疗。 Method: 构建包含解剖感知分割（基于SAMONAI算法改进3D点提示分割）和放射组学（SurvAMINN模型）的端到端AI框架，使用部分标注数据训练，并在227例回顾性患者数据上验证。 Result: 分割Dice分数达0.96（肝脏）、0.93（脾脏）、0.78（CRLM），检测F1为0.79；生存预测C-index达0.69，优于传统方法与生物标志物。 Conclusion: 整合先进分割与放射组学的AI框架可实现准确、自动的CRLM术后生存预测，具有临床转化潜力。 Abstract: While colorectal liver metastasis (CRLM) is potentially curable via hepatectomy, patient outcomes remain highly heterogeneous. Postoperative survival prediction is necessary to avoid non-beneficial surgeries and guide personalized therapy. In this study, we present an automated AI-based framework for postoperative CRLM survival prediction using pre- and post-contrast MRI. We performed a retrospective study of 227 CRLM patients who had gadoxetate-enhanced MRI prior to curative-intent hepatectomy between 2013 and 2020. We developed a survival prediction framework comprising an anatomy-aware segmentation pipeline followed by a radiomics pipeline. The segmentation pipeline learns liver, CRLMs, and spleen segmentation from partially-annotated data, leveraging promptable foundation models to generate pseudo-labels. To support this pipeline, we propose SAMONAI, a prompt propagation algorithm that extends Segment Anything Model to 3D point-based segmentation. Predicted pre- and post-contrast segmentations are then fed into our radiomics pipeline, which extracts per-tumor features and predicts survival using SurvAMINN, an autoencoder-based multiple instance neural network for time-to-event survival prediction. SurvAMINN jointly learns dimensionality reduction and survival prediction from right-censored data, emphasizing high-risk metastases. We compared our framework against established methods and biomarkers using univariate and multivariate Cox regression. Our segmentation pipeline achieves median Dice scores of 0.96 (liver) and 0.93 (spleen), driving a CRLM segmentation Dice score of 0.78 and a detection F1-score of 0.79. Accurate segmentation enables our radiomics pipeline to achieve a survival prediction C-index of 0.69. Our results show the potential of integrating segmentation algorithms with radiomics-based survival analysis to deliver accurate and automated CRLM outcome prediction.

[84] Robotic Ultrasound Makes CBCT Alive

Feng Li,Ziyuan Li,Zhongliang Jiang,Nassir Navab,Yuan Bi

Main category: cs.CV

TL;DR: 本文提出了一种基于机器人超声的CBCT实时更新框架，通过USCorUNet网络估计软组织变形场，并将其映射到CBCT图像上，实现无额外辐射的动态导航更新。

Details

Motivation: 术中锥束CT（CBCT）虽提供可靠3D解剖结构，但其静态特性无法反映呼吸、探头压力及手术操作引起的软组织形变，导致导航偏差。 Method: 提出一种形变感知的CBCT更新框架：首先通过LC2算法进行刚性配准；然后引入轻量级USCorUNet网络，在光流引导监督下学习超声图像间的形变相关表征，实现实时稠密形变场估计；最后将形变场正则化并映射至CBCT参考图像。 Result: 实验验证了该方法可实现实时端到端CBCT切片更新，形变估计物理合理，显著提升机器人超声辅助干预中的动态导航精度。 Conclusion: 该框架有效弥补了CBCT静态性的不足，利用超声作为动态代理实现无额外辐射的CBCT实时更新，为术中精准导航提供了新范式。 Abstract: Intraoperative Cone Beam Computed Tomography (CBCT) provides a reliable 3D anatomical context essential for interventional planning. However, its static nature fails to provide continuous monitoring of soft-tissue deformations induced by respiration, probe pressure, and surgical manipulation, leading to navigation discrepancies. We propose a deformation-aware CBCT updating framework that leverages robotic ultrasound as a dynamic proxy to infer tissue motion and update static CBCT slices in real time. Starting from calibration-initialized alignment with linear correlation of linear combination (LC2)-based rigid refinement, our method establishes accurate multimodal correspondence. To capture intraoperative dynamics, we introduce the ultrasound correlation UNet (USCorUNet), a lightweight network trained with optical flow-guided supervision to learn deformation-aware correlation representations, enabling accurate, real-time dense deformation field estimation from ultrasound streams. The inferred deformation is spatially regularized and transferred to the CBCT reference to produce deformation-consistent visualizations without repeated radiation exposure. We validate the proposed approach through deformation estimation and ultrasound-guided CBCT updating experiments. Results demonstrate real-time end-to-end CBCT slice updating and physically plausible deformation estimation, enabling dynamic refinement of static CBCT guidance during robotic ultrasound-assisted interventions. The source code is publicly available at https://github.com/anonymous-codebase/us-cbct-demo.

[85] OilSAM2: Memory-Augmented SAM2 for Scalable SAR Oil Spill Detection

Shuaiyu Chen,Ming Yin,Peng Ren,Chunbo Luo,Zeyu Fu

Main category: cs.CV

TL;DR: 本文提出OilSAM2，一种专为无序SAR图像油污监测设计的记忆增强分割框架，通过分层多尺度记忆库和结构语义一致的记忆更新策略，提升跨图像信息复用能力并抑制记忆漂移，在公开数据集上达到SOTA性能。

Details

Motivation: SAR图像中油污分割面临外观变化大、尺度不一、缺乏时间连续性等挑战；现有基于SAM的方法无法有效跨场景复用信息，而SAM2等记忆增强方法依赖时间一致性，在无序SAR图像中易发生语义漂移。 Method: 提出OilSAM2框架，包含：1）分层特征感知的多尺度记忆库，显式建模纹理、结构与语义层级表示；2）结构语义一致的记忆更新策略，依据语义差异与结构变化选择性刷新记忆。 Result: 在两个公开SAR油污数据集上实验表明，OilSAM2在噪声SAR监控场景下实现稳定、准确的分割，性能达当前最优（SOTA）。 Conclusion: OilSAM2通过解耦多尺度特征建模与鲁棒记忆更新机制，有效解决了无序SAR图像油污分割中的跨图像信息复用与记忆漂移难题，显著提升了实际监测鲁棒性与精度。 Abstract: Segmenting oil spills from Synthetic Aperture Radar (SAR) imagery remains challenging due to severe appearance variability, scale heterogeneity, and the absence of temporal continuity in real world monitoring scenarios. While foundation models such as Segment Anything (SAM) enable prompt driven segmentation, existing SAM based approaches operate on single images and cannot effectively reuse information across scenes. Memory augmented variants (e.g., SAM2) further assume temporal coherence, making them prone to semantic drift when applied to unordered SAR image collections. We propose OilSAM2, a memory augmented segmentation framework tailored for unordered SAR oil spill monitoring. OilSAM2 introduces a hierarchical feature aware multi scale memory bank that explicitly models texture, structure, and semantic level representations, enabling robust cross image information reuse. To mitigate memory drift, we further propose a structure semantic consistent memory update strategy that selectively refreshes memory based on semantic discrepancy and structural variation.Experiments on two public SAR oil spill datasets demonstrate that OilSAM2 achieves state of the art segmentation performance, delivering stable and accurate results under noisy SAR monitoring scenarios. The source code is available at https://github.com/Chenshuaiyu1120/OILSAM2.

[86] Why Does It Look There? Structured Explanations for Image Classification

Jiarui Li,Zixiang Yin,Samuel J Landry,Zhengming Ding,Ramgopal R. Mettu

Main category: cs.CV

TL;DR: 本文提出I2X框架，将无结构可解释性（如显著图）转化为结构化解释，通过训练过程中的原型量化分析，揭示模型决策机制，并指导模型优化以提升性能。

Details

Motivation: 深度学习模型的黑箱特性限制了其透明度和可信度，现有XAI方法多依赖辅助模型提供无结构解释，缺乏对原模型的保真性。 Method: 提出Interpretability to Explainability (I2X)框架，利用后验XAI方法（如GradCAM）提取原型，在训练关键检查点量化进展，构建结构化解释，涵盖类内与类间决策过程。 Result: 在MNIST和CIFAR10上验证了I2X能有效揭示多种图像分类模型的原型推理过程；并可通过识别不确定原型、针对性扰动样本并微调，提升不同架构与数据集上的预测准确率。 Conclusion: I2X不仅忠实解释模型行为，还为模型优化提供实用指导， bridging interpretability and explainability in a faithful, structured, and actionable way. Abstract: Deep learning models achieve remarkable predictive performance, yet their black-box nature limits transparency and trustworthiness. Although numerous explainable artificial intelligence (XAI) methods have been proposed, they primarily provide saliency maps or concepts (i.e., unstructured interpretability). Existing approaches often rely on auxiliary models (\eg, GPT, CLIP) to describe model behavior, thereby compromising faithfulness to the original models. We propose Interpretability to Explainability (I2X), a framework that builds structured explanations directly from unstructured interpretability by quantifying progress at selected checkpoints during training using prototypes extracted from post-hoc XAI methods (e.g., GradCAM). I2X answers the question of "why does it look there" by providing a structured view of both intra- and inter-class decision making during training. Experiments on MNIST and CIFAR10 demonstrate effectiveness of I2X to reveal prototype-based inference process of various image classification models. Moreover, we demonstrate that I2X can be used to improve predictions across different model architectures and datasets: we can identify uncertain prototypes recognized by I2X and then use targeted perturbation of samples that allows fine-tuning to ultimately improve accuracy. Thus, I2X not only faithfully explains model behavior but also provides a practical approach to guide optimization toward desired targets.

[87] One Adapter for All: Towards Unified Representation in Step-Imbalanced Class-Incremental Learning

Xiaoyan Zhang,Jiangpeng He

Main category: cs.CV

TL;DR: 本文提出One-A框架，解决类增量学习中任务规模不均衡（step imbalance）问题，通过非对称子空间对齐、信息自适应加权和方向门控机制，在保持低推理开销的同时提升模型稳定性和适应性。

Details

Motivation: 现有类增量学习方法假设任务间类别数量平衡，但在实际中任务规模差异大，导致大任务主导学习、小任务更新不稳定，从而降低整体性能。 Method: 提出One-A统一框架：1）增量合并任务更新至单一adapter；2）非对称子空间对齐以保留大任务主导子空间；3）信息自适应加权平衡基础与新adapter贡献；4）方向门控机制沿各奇异方向选择性融合更新。 Result: 在多个基准和步长不均衡数据流上，One-A以显著更低的推理开销达到具有竞争力的准确率。 Conclusion: 单一、非对称融合的adapter可在动态任务规模下兼顾适应性与部署效率，有效缓解step imbalance问题。 Abstract: Class-incremental learning (CIL) aims to acquire new classes over time while retaining prior knowledge, yet most setups and methods assume balanced task streams. In practice, the number of classes per task often varies significantly. We refer to this as step imbalance, where large tasks that contain more classes dominate learning and small tasks inject unstable updates. Existing CIL methods assume balanced tasks and therefore treat all tasks uniformly, producing imbalanced updates that degrade overall learning performance. To address this challenge, we propose One-A, a unified and imbalance-aware framework that incrementally merges task updates into a single adapter, maintaining constant inference cost. One-A performs asymmetric subspace alignment to preserve dominant subspaces learned from large tasks while constraining low-information updates within them. An information-adaptive weighting balances the contribution between base and new adapters, and a directional gating mechanism selectively fuses updates along each singular direction, maintaining stability in head directions and plasticity in tail ones. Across multiple benchmarks and step-imbalanced streams, One-A achieves competitive accuracy with significantly low inference overhead, showing that a single, asymmetrically fused adapter can remain both adaptive to dynamic task sizes and efficient at deployment.

[88] Joint Imaging-ROI Representation Learning via Cross-View Contrastive Alignment for Brain Disorder Classification

Wei Liang,Lifang He

Main category: cs.CV

TL;DR: 本文提出了一种统一的跨视图对比学习框架，联合学习全脑影像和ROI图表示，并在共享隐空间中对齐二者，从而提升脑疾病分类性能并揭示其互补性。

Details

Motivation: 现有方法多单独使用全脑影像或ROI图，二者相对贡献与互补性尚不明确，且缺乏统一、可控的融合评估框架。 Method: 设计双向对比损失，在共享隐空间中对齐同一被试的全局（影像）与局部（ROI图）嵌入，实现联合表征学习，并支持多种下游融合配置的系统评估。 Result: 在ADHD-200和ABIDE数据集上，联合学习始终优于单一模态；可解释性分析证实两分支捕获互补判别模式。 Conclusion: 显式整合全脑体素级与ROI级表征是神经影像脑疾病分类的有前景方向。 Abstract: Brain imaging classification is commonly approached from two perspectives: modeling the full image volume to capture global anatomical context, or constructing ROI-based graphs to encode localized and topological interactions. Although both representations have demonstrated independent efficacy, their relative contributions and potential complementarity remain insufficiently understood. Existing fusion approaches are typically task-specific and do not enable controlled evaluation of each representation under consistent training settings. To address this gap, we propose a unified cross-view contrastive framework for joint imaging-ROI representation learning. Our method learns subject-level global (imaging) and local (ROI-graph) embeddings and aligns them in a shared latent space using a bidirectional contrastive objective, encouraging representations from the same subject to converge while separating those from different subjects. This alignment produces comparable embeddings suitable for downstream fusion and enables systematic evaluation of imaging-only, ROI-only, and joint configurations within a unified training protocol. Extensive experiments on the ADHD-200 and ABIDE datasets demonstrate that joint learning consistently improves classification performance over either branch alone across multiple backbone choices. Moreover, interpretability analyses reveal that imaging-based and ROI-based branches emphasize distinct yet complementary discriminative patterns, explaining the observed performance gains. These findings provide principled evidence that explicitly integrating global volumetric and ROI-level representations is a promising direction for neuroimaging-based brain disorder classification. The source code is available at https://anonymous.4open.science/r/imaging-roi-contrastive-152C/.

[89] A Robust Deep Learning Framework for Bangla License Plate Recognition Using YOLO and Vision-Language OCR

Nayeb Hasin,Md. Arafath Rahman Nishat,Mainul Islam,Khandakar Shakib Al Hasan,Asif Newaz

Main category: cs.CV

TL;DR: 本文提出了一种鲁棒的孟加拉语车牌识别（ALPR）系统，结合基于YOLOv8改进的两阶段自适应训练策略进行车牌定位，以及ViT+BanglaBERT的VisionEncoderDecoder架构进行文本识别，在准确率（97.83%）、IoU（91.3%）、字符错误率（0.1323）等方面均优于现有方法，并在跨域外部数据集上验证了其鲁棒性。

Details

Motivation: 孟加拉语车牌因字符复杂、布局不规则，导致检测与识别困难，现有方法难以满足智能交通系统对鲁棒性和泛化性的实际需求。 Method: 采用YOLOv8为基础设计两阶段自适应训练策略用于车牌定位；对比U-Net及多种YOLO变体；将文本识别建模为序列生成任务，评估多种VisionEncoderDecoder组合，最终选用ViT作为编码器、BanglaBERT作为解码器。 Result: 车牌定位达到97.83%准确率和91.3% IoU；文本识别取得0.1323字符错误率和0.1068单词错误率；在光照、噪声、车牌样式多变及跨域外部数据集上表现稳定。 Conclusion: 所提系统在复杂真实场景下具备高精度、强鲁棒性和良好泛化能力，适用于自动化执法、门禁控制等智能交通应用。 Abstract: An Automatic License Plate Recognition (ALPR) system constitutes a crucial element in an intelligent traffic management system. However, the detection of Bangla license plates remains challenging because of the complicated character scheme and uneven layouts. This paper presents a robust Bangla License Plate Recognition system that integrates a deep learning-based object detection model for license plate localization with Optical Character Recognition for text extraction. Multiple object detection architectures, including U-Net and several YOLO (You Only Look Once) variants, are compared for license plate localization. This study proposes a novel two-stage adaptive training strategy built upon the YOLOv8 architecture to improve localization performance. The proposed approach outperforms the established models, achieving an accuracy of 97.83% and an Intersection over Union (IoU) of 91.3%. The text recognition problem is phrased as a sequence generation problem with a VisionEncoderDecoder architecture, with a combination of encoder-decoders evaluated. It was demonstrated that the ViT + BanglaBERT model gives better results at the character level, with a Character Error Rate of 0.1323 and Word Error Rate of 0.1068. The proposed system also shows a consistent performance when tested on an external dataset that has been curated for this study purpose. The dataset offers completely different environment and lighting conditions compared to the training sample, indicating the robustness of the proposed framework. Overall, our proposed system provides a robust and reliable solution for Bangla license plate recognition and performs effectively across diverse real-world scenarios, including variations in lighting, noise, and plate styles. These strengths make it well suited for deployment in intelligent transportation applications such as automated law enforcement and access control.

[90] From Imitation to Intuition: Intrinsic Reasoning for Open-Instance Video Classification

Ke Zhang,Xiangchen Zhao,Yunjie Tian,Jiayu Zheng,Vishal M. Patel,Di Fu

Main category: cs.CV

TL;DR: 本文提出DeepIntuit框架，通过监督对齐、组相对策略优化（GRPO）强化学习和直观校准三阶段，将视频分类从特征模仿提升至内在推理，显著提升开放实例场景下的泛化性能。

Details

Motivation: 现有视频分类模型在同质数据上表现良好，但在真实世界开放实例场景（类内差异大、分布复杂）下泛化能力不足；视觉语言模型虽具强泛化性，却未充分激发其内在推理（直觉）能力用于此类任务。 Method: 提出DeepIntuit框架：1）冷启动监督对齐初始化推理能力；2）采用Group Relative Policy Optimization（GRPO）进行强化学习以提升推理一致性；3）引入直观校准阶段，利用精炼后的VLM生成的内在推理轨迹训练分类器，实现稳定知识迁移。 Result: 在开放实例视频分类任务上，DeepIntuit显著优于传统视频编码器和基线VLM方法，验证了从模仿到直觉推理范式的有效性与优越性。 Conclusion: 内在推理能力是提升开放实例视频分类性能的关键；DeepIntuit通过结构化地融合监督学习与强化学习，并辅以推理轨迹校准，成功将VLM的泛化优势转化为可靠分类性能。 Abstract: Conventional video classification models, acting as effective imitators, excel in scenarios with homogeneous data distributions. However, real-world applications often present an open-instance challenge, where intra-class variations are vast and complex, beyond existing benchmarks. While traditional video encoder models struggle to fit these diverse distributions, vision-language models (VLMs) offer superior generalization but have not fully leveraged their reasoning capabilities (intuition) for such tasks. In this paper, we bridge this gap with an intrinsic reasoning framework that evolves open-instance video classification from imitation to intuition. Our approach, namely DeepIntuit, begins with a cold-start supervised alignment to initialize reasoning capability, followed by refinement using Group Relative Policy Optimization (GRPO) to enhance reasoning coherence through reinforcement learning. Crucially, to translate this reasoning into accurate classification, DeepIntuit then introduces an intuitive calibration stage. In this stage, a classifier is trained on this intrinsic reasoning traces generated by the refined VLM, ensuring stable knowledge transfer without distribution mismatch. Extensive experiments demonstrate that for open-instance video classification, DeepIntuit benefits significantly from transcending simple feature imitation and evolving toward intrinsic reasoning. Our project is available at https://bwgzk-keke.github.io/DeepIntuit/.

[91] Fuel Gauge: Estimating Chain-of-Thought Length Ahead of Time in Large Multimodal Models

Yuedong Yang,Xiwen Wei,Mustafa Munir,Radu Marculescu

Main category: cs.CV

TL;DR: 本文提出Fuel Gauge方法，通过预测推理链（CoT）长度来提升多模态大模型（LMMs）的推理效率与准确性，解决内存碎片和欠/过推理问题。

Details

Motivation: 现有LMMs依赖不可控、低效的Chain-of-Thought（CoT）推理过程，导致计算资源浪费（内存碎片）和精度下降（欠思考或过思考）。 Method: 基于观察到CoT长度由一个与样本无关的隐式‘燃料’参数决定，提出Fuel Gauge方法，从模型内部提取该隐信号并提前预测CoT长度。 Result: 在文本、图文、视频问答等多类基准上验证有效；在GPQA-Diamond上CoT长度预测误差降低超50%，内存分配频率减少13.37倍。 Conclusion: Fuel Gauge是首个可提前预测CoT长度的方法，显著提升LMM推理系统的资源效率与推理质量，具备强泛化性与实用价值。 Abstract: Reasoning Large Multi-modality Models (LMMs) have become the de facto choice for many applications. However, these models rely on a Chain-of-Thought (CoT) process that is lengthy and unpredictable at runtime, often resulting in inefficient use of computational resources (due to memory fragmentation) and sub-optimal accuracy (due to under- and over-thinking). We observe empirically that the CoT process follows a very simple form, whose behavior is independent of the specific generated samples. This suggests that the CoT length can be estimated ahead of time based on a hidden parameter representing the amount of "fuel" available to support the reasoning process. Based on this insight, we propose Fuel Gauge, the first method which extracts this hidden signal and predicts CoT length ahead of time. We demonstrate the utility on the Fuel Gauge on two downstream tasks: predictive KV cache allocation, which addresses memory fragmentation in LMM serving systems, and CoT length modulation, which mitigates under-thinking and over-thinking. Extensive experiments on LMMs across text-only, image-text, and video-text question answering benchmarks demonstrate the effectiveness, generalizability, and practical value of our Fuel Gauge. For example, on the GPQA-Diamond benchmark, our Fuel Gauge achieves less than half the CoT length prediction error compared to the baseline; this translates into a 13.37x reduction in the memory allocation frequency.

[92] Overcoming Visual Clutter in Vision Language Action Models via Concept-Gated Visual Distillation

Sangmim Song,Sarath Kodagoda,Marc Carmichael,Karthick Thiyagarajan

Main category: cs.CV

TL;DR: 本文提出了一种无需训练、模型无关的推理框架Concept-Gated Visual Distillation (CGVD)，通过指令解析、目标精炼与傅里叶域图像修复，缓解视觉-语言-动作（VLA）模型在杂乱环境中的“精度-推理鸿沟”，显著提升操作成功率。

Details

Motivation: VLA模型在杂乱环境中因背景引起的特征稀释（高维语义噪声干扰几何定位）而出现“精度-推理鸿沟”，导致精准操控失败。 Method: CGVD包含三步：1）将指令解析为安全集与干扰集；2）通过交叉验证与空间消歧的双层目标精炼，抑制误检；3）采用傅里叶域inpainting生成干净观测，抑制语义干扰但保留空间几何与本体视觉信息。 Result: 在高度杂乱的操作任务中，CGVD将成功率从基线43.0%提升至77.5%，有效防止性能崩溃，并显著优于现有SOTA方法。 Conclusion: CGVD证明了推理阶段的视觉蒸馏是实现杂乱环境下鲁棒机器人操作的关键前提，为VLA模型的实际部署提供了新范式。 Abstract: Vision-Language-Action (VLA) models demonstrate impressive zero-shot generalization but frequently suffer from a "Precision-Reasoning Gap" in cluttered environments. This failure is driven by background-induced feature dilution, where high-frequency semantic noise corrupts the geometric grounding required for precise manipulation. To bridge this gap, we propose Concept-Gated Visual Distillation (CGVD), a training-free, model-agnostic inference framework that stabilizes VLA policies. CGVD operates by parsing instructions into safe and distractor sets, utilizing a two-layer target refinement process--combining cross-validation and spatial disambiguation--to explicitly penalize false positives and isolate genuine manipulation targets. We then process the scene via Fourier-based inpainting, generating a clean observation that actively suppresses semantic distractors while preserving critical spatial geometry and visual proprioception. Extensive evaluations in highly cluttered manipulation tasks demonstrate that CGVD prevents performance collapse. In environments with dense semantic distractors, our method significantly outperforms state-of-the-art baselines, achieving a 77.5% success rate compared to the baseline's 43.0%. By enforcing strict attribute adherence, CGVD establishes inference-time visual distillation as a critical prerequisite for robust robotic manipulation in the clutter.

[93] EmoStory: Emotion-Aware Story Generation

Jingyuan Yang,Rucong Chen,Hui Huang

Main category: cs.CV

TL;DR: 本文提出EmoStory框架，首次将情感因素引入视觉故事生成任务，通过两阶段方法（情感驱动的故事规划与区域感知的故事生成）实现情绪准确、提示对齐和主体一致的视觉叙事。

Details

Motivation: 现有视觉故事生成方法缺乏情感表达能力，而情感对于叙事理解和视觉呈现至关重要，因此需要构建情感感知的故事生成模型。 Method: 提出两阶段EmoStory框架：第一阶段利用emotion agent和writer agent将目标情绪转化为连贯故事提示；第二阶段通过区域感知的图像生成注入情感相关视觉元素并保持主体一致性。 Result: 在涵盖25个主体和600个情感故事的新数据集上，EmoStory在情感准确性、提示对齐度和主体一致性方面均优于现有最先进方法，并通过定量评估、定性分析与用户研究验证了其有效性。 Conclusion: EmoStory成功实现了情感引导的视觉故事生成，为情感计算与生成式AI的交叉研究提供了新范式，并推动了更具表现力和感染力的AI叙事系统发展。 Abstract: Story generation aims to produce image sequences that depict coherent narratives while maintaining subject consistency across frames. Although existing methods have excelled in producing coherent and expressive stories, they remain largely emotion-neutral, focusing on what subject appears in a story while overlooking how emotions shape narrative interpretation and visual presentation. As stories are intended to engage audiences emotionally, we introduce emotion-aware story generation, a new task that aims to generate subject-consistent visual stories with explicit emotional directions. This task is challenging due to the abstract nature of emotions, which must be grounded in concrete visual elements and consistently expressed across a narrative through visual composition. To address these challenges, we propose EmoStory, a two-stage framework that integrates agent-based story planning and region-aware story generation. The planning stage transforms target emotions into coherent story prompts with emotion agent and writer agent, while the generation stage preserves subject consistency and injects emotion-related elements through region-aware composition. We evaluate EmoStory on a newly constructed dataset covering 25 subjects and 600 emotional stories. Extensive quantitative and qualitative results, along with user studies, show that EmoStory outperforms state-of-the-art story generation methods in emotion accuracy, prompt alignment, and subject consistency.

[94] StyleGallery: Training-free and Semantic-aware Personalized Style Transfer from Arbitrary Image References

Boyu He,Yunfan Ye,Chang Liu,Weishang Wu,Fang Liu,Zhiping Cai

Main category: cs.CV

TL;DR: 本文提出StyleGallery，一种无需训练、语义感知的图像风格迁移框架，通过自适应语义区域分割、聚类区域匹配和能量函数引导的扩散采样，解决现有方法在语义鸿沟、额外约束依赖和全局-局部对齐僵化等方面的局限。

Details

Motivation: 现有基于扩散模型的图像风格迁移方法存在语义鸿沟、依赖额外约束（如语义掩码）以及缺乏自适应全局-局部对齐等问题，限制了个性化、准确性和适应性。 Method: StyleGallery包含三个核心阶段：1）语义区域分割（在潜在扩散特征上进行自适应聚类，无需额外输入）；2）聚类区域匹配（通过块滤波实现特征精准对齐）；3）风格迁移优化（能量函数引导的扩散采样，结合区域风格损失）。 Result: 在新构建的基准测试中，StyleGallery在内容结构保持、区域风格化、可解释性及个性化定制方面均优于当前最优方法，尤其在使用多个风格参考图时表现更优。 Conclusion: StyleGallery是一种训练免费、语义感知、灵活支持任意风格参考图的通用风格迁移框架，显著提升了个性化与可控性。 Abstract: Despite the advancements in diffusion-based image style transfer, existing methods are commonly limited by 1) semantic gap: the style reference could miss proper content semantics, causing uncontrollable stylization; 2) reliance on extra constraints (e.g., semantic masks) restricting applicability; 3) rigid feature associations lacking adaptive global-local alignment, failing to balance fine-grained stylization and global content preservation. These limitations, particularly the inability to flexibly leverage style inputs, fundamentally restrict style transfer in terms of personalization, accuracy, and adaptability. To address these, we propose StyleGallery, a training-free and semantic-aware framework that supports arbitrary reference images as input and enables effective personalized customization. It comprises three core stages: semantic region segmentation (adaptive clustering on latent diffusion features to divide regions without extra inputs); clustered region matching (block filtering on extracted features for precise alignment); and style transfer optimization (energy function-guided diffusion sampling with regional style loss to optimize stylization). Experiments on our introduced benchmark demonstrate that StyleGallery outperforms state-of-the-art methods in content structure preservation, regional stylization, interpretability, and personalized customization, particularly when leveraging multiple style references.

[95] One Token, Two Fates: A Unified Framework via Vision Token Manipulation Against MLLMs Hallucination

Zhan Fa,Yue Duan,Jian Zhang,Lei Qi,Yinghuan Shi

Main category: cs.CV

TL;DR: 本文提出了一种统一框架，通过协同视觉校准（SVC）和因果表征校准（CRC）两个模块，利用增强与剪枝的视觉token，在潜空间中分别强化视觉表征与构建负样本，以平衡多模态大模型中的视觉-语言信号，显著降低幻觉。

Details

Motivation: 现有无训练方法单独增强视觉或抑制文本先验均存在严重权衡，且简单组合也无效，亟需统一框架解决MLLM中的幻觉问题。 Method: 提出基于视觉token的统一框架：SVC模块融合增强图像的视觉token以强化视觉表征；CRC模块通过剪枝视觉token构造潜空间负样本，校正模型内部偏差。二者协同恢复视觉-语言平衡。 Result: 在LLaVA-1.5上多个基准上POPE准确率平均提升2个百分点绝对值，仅引入1.06倍推理延迟开销。 Conclusion: 利用视觉token的双重角色（增强与剪枝）可有效缓解MLLM幻觉，验证了统一、潜空间操作的校准策略优于分离式或图像层面扰动的方法。 Abstract: Current training-free methods tackle MLLM hallucination with separate strategies: either enhancing visual signals or suppressing text inertia. However, these separate methods are insufficient due to critical trade-offs: simply enhancing vision often fails against strong language prior, while suppressing language can introduce extra image-irrelevant noise. Moreover, we find their naive combination is also ineffective, necessitating a unified framework. We propose such a framework by focusing on the core asset: the vision token. Our design leverages two key insights: (1) augmented images offer complementary visual semantics, and (2) removing vision tokens (information-gap) isolates hallucination tendencies more precisely than distorting images (modality-gap). Based on these, our framework uses vision tokens in two distinct ways, both operating on latent representations: our Synergistic Visual Calibration (SVC) module incorporates augmented tokens to strengthen visual representations, while our Causal Representation Calibration (CRC) module uses pruned tokens to create latent-space negative samples for correcting internal model biases. By harmonizing these two roles, our framework effectively restores the vision-language balance, significantly reducing object hallucinations, improving POPE accuracy by an average of 2% absolute on LLaVA-1.5 across multiple benchmarks with only a 1.06x inference latency overhead.

[96] Geometric Autoencoder for Diffusion Models

Hangyu Liu,Jianyong Wang,Yutao Sun

Main category: cs.CV

TL;DR: 本文提出几何自编码器（GAE），一种面向扩散模型的原理性潜在表示学习框架，通过VFM引导的语义监督、替代KL散度的潜在归一化及动态噪声采样机制，在ImageNet-1K上实现SOTA生成质量与更优压缩-语义-鲁棒性平衡。

Details

Motivation: 现有潜在扩散模型中的潜在空间设计多为启发式，难以兼顾语义判别性、重建保真度和潜在紧凑性；同时标准VAE中KL散度约束限制了潜在流形对扩散学习的适配性。 Method: 提出几何自编码器（GAE）：1）基于视觉基础模型（VFM）分析多种对齐范式，构建优化的低维语义监督目标；2）采用潜在归一化替代传统VAE的KL散度，提升扩散训练稳定性；3）引入动态噪声采样机制增强高噪声下的重建鲁棒性。 Result: 在ImageNet-1K 256×256任务上，GAE在无Classifier-Free Guidance下达到gFID=1.82（80轮）和1.31（800轮），显著超越现有方法；同时在压缩率、语义深度与重建稳定性间取得更好平衡。 Conclusion: GAE为潜在扩散建模提供了原理性、可扩展的新范式，验证了语义引导、流形优化与噪声鲁棒设计对高质量生成的关键作用。 Abstract: Latent diffusion models have established a new state-of-the-art in high-resolution visual generation. Integrating Vision Foundation Model priors improves generative efficiency, yet existing latent designs remain largely heuristic. These approaches often struggle to unify semantic discriminability, reconstruction fidelity, and latent compactness. In this paper, we propose Geometric Autoencoder (GAE), a principled framework that systematically addresses these challenges. By analyzing various alignment paradigms, GAE constructs an optimized low-dimensional semantic supervision target from VFMs to provide guidance for the autoencoder. Furthermore, we leverage latent normalization that replaces the restrictive KL-divergence of standard VAEs, enabling a more stable latent manifold specifically optimized for diffusion learning. To ensure robust reconstruction under high-intensity noise, GAE incorporates a dynamic noise sampling mechanism. Empirically, GAE achieves compelling performance on the ImageNet-1K $256 \times 256$ benchmark, reaching a gFID of 1.82 at only 80 epochs and 1.31 at 800 epochs without Classifier-Free Guidance, significantly surpassing existing state-of-the-art methods. Beyond generative quality, GAE establishes a superior equilibrium between compression, semantic depth and robust reconstruction stability. These results validate our design considerations, offering a promising paradigm for latent diffusion modeling. Code and models are publicly available at https://github.com/freezing-index/Geometric-Autoencoder-for-Diffusion-Models.

[97] GeoSense: Internalizing Geometric Necessity Perception for Multimodal Reasoning

Ruiheng Liu,Haihong Hao,Mingfei Han,Xin Gu,Kecheng Zhang,Changlin Li,Xiaojun Chang

Main category: cs.CV

TL;DR: 本文提出了一种新型框架，使多模态大语言模型（MLLMs）具备感知不足时自主调用几何信息的能力，通过独立几何通道和空间感知微调数据集，在不损害2D视觉推理能力的前提下显著提升空间推理性能。

Details

Motivation: 现有MLLMs空间理解能力有限，尤其缺乏对几何信息的自适应利用；传统方法刚性注入几何信号，造成冗余计算且忽视其使用必要性。 Method: 1）引入独立几何输入通道并进行对齐训练；2）构建空间感知监督微调数据集，激活模型内部线索以自主判断是否需调用几何信息。 Result: 在多个空间推理基准上显著提升性能，同时保持原有2D视觉推理能力，实现更鲁棒、高效、自感知的多模态智能。 Conclusion: 赋予模型感知自省能力（即识别自身感知不足并按需调用几何信息）是提升MLLMs空间智能的有效新范式。 Abstract: Advancing towards artificial superintelligence requires rich and intelligent perceptual capabilities. A critical frontier in this pursuit is overcoming the limited spatial understanding of Multimodal Large Language Models (MLLMs), where geometry information is essential. Existing methods often address this by rigidly injecting geometric signals into every input, while ignoring their necessity and adding computation overhead. Contrary to this paradigm, our framework endows the model with an awareness of perceptual insufficiency, empowering it to autonomously engage geometric features in reasoning when 2D cues are deemed insufficient. To achieve this, we first introduce an independent geometry input channel to the model architecture and conduct alignment training, enabling the effective utilization of geometric features. Subsequently, to endow the model with perceptual awareness, we curate a dedicated spatial-aware supervised fine-tuning dataset. This serves to activate the model's latent internal cues, empowering it to autonomously determine the necessity of geometric information. Experiments across multiple spatial reasoning benchmarks validate this approach, demonstrating significant spatial gains without compromising 2D visual reasoning capabilities, offering a path toward more robust, efficient and self-aware multi-modal intelligence.

[98] Multi-Person Pose Estimation Evaluation Using Optimal Transportation and Improved Pose Matching

Takato Moriki,Hiromu Taketsugu,Norimichi Ukita

Main category: cs.CV

TL;DR: 本文提出了一种新的多人体姿态估计评估指标OCpose，通过最优传输方法公平权衡真阳性与假阳性姿态，不依赖置信度排序，而是利用置信度提升匹配可靠性。

Details

Motivation: 现有评估指标过度关注高置信度姿态，忽视低置信度假阳性姿态，导致评估结果失真，需一种兼顾真阳性和假阳性、不偏倚置信度的公平评估方法。 Method: 提出基于最优传输理论的OCpose指标，将检测姿态与标注姿态匹配建模为最优运输问题；所有检测姿态（无论置信度高低）被平等评估，但其置信度用于加权匹配得分以提升匹配可靠性。 Result: OCpose提供了一种不同于传统置信度排序型指标的新评估视角，在公平性与匹配可靠性之间取得更好平衡。 Conclusion: OCpose是一种更公平、鲁棒的多人体姿态估计评估指标，能有效缓解因忽略低置信度假阳性而导致的评估偏差。 Abstract: In Multi-Person Pose Estimation, many metrics place importance on ranking of pose detection confidence scores. Current metrics tend to disregard false-positive poses with low confidence, focusing primarily on a larger number of high-confidence poses. Consequently, these metrics may yield high scores even when many false-positive poses with low confidence are detected. For fair evaluation taking into account a tradeoff between true-positive and false-positive poses, this paper proposes Optimal Correction Cost for pose (OCpose), which evaluates detected poses against pose annotations as an optimal transportation. For the fair tradeoff between true-positive and false-positive poses, OCpose equally evaluates all the detected poses regardless of their confidence scores. In OCpose, on the other hand, the confidence score of each pose is utilized to improve the reliability of matching scores between the estimated pose and pose annotations. As a result, OCpose provides a different perspective assessment than other confidence ranking-based metrics.

[99] Motion Forcing: A Decoupled Framework for Robust Video Generation in Motion Dynamics

Tianshuo Xu,Zhifei Chen,Leyi Wu,Hao Lu,Ying-cong Chen

Main category: cs.CV

TL;DR: 本文提出Motion Forcing框架，通过Point-Shape-Appearance分层范式解耦物理推理与视觉合成，并引入Masked Point Recovery策略增强模型对物理规律的学习能力，从而在复杂场景下稳定实现视频生成的高质量、物理一致性和可控性三重目标。

Details

Motivation: 现有视频生成模型在简单场景中能平衡视觉质量、物理一致性和可控性，但在复杂场景（如碰撞、密集交通）下三者平衡易被打破，亟需更鲁棒的生成框架。 Method: 提出Motion Forcing框架，采用分层的Point-Shape-Appearance范式：Point阶段建模稀疏几何锚点动力学；Shape阶段扩展为动态深度图以显式表达3D几何；Appearance阶段渲染高保真纹理；并引入Masked Point Recovery策略，在训练中随机遮蔽输入锚点并强制重建完整动态深度，促使模型学习隐含物理规律。 Result: 在自动驾驶基准上显著超越SOTA方法，复杂场景下保持三重目标稳定；物理与机器人任务验证了框架的通用性。 Conclusion: Motion Forcing通过显式解耦与物理驱动的自监督训练，有效提升了复杂场景下视频生成的鲁棒性与泛化能力，为视频生成三重目标的协同优化提供了新范式。 Abstract: The ultimate goal of video generation is to satisfy a fundamental trilemma: achieving high visual quality, maintaining rigorous physical consistency, and enabling precise controllability. While recent models can maintain this balance in simple, isolated scenarios, we observe that this equilibrium is fragile and often breaks down as scene complexity increases (e.g., involving collisions or dense traffic). To address this, we introduce \textbf{Motion Forcing}, a framework designed to stabilize this trilemma even in complex generative tasks. Our key insight is to explicitly decouple physical reasoning from visual synthesis via a hierarchical \textbf{``Point-Shape-Appearance''} paradigm. This approach decomposes generation into verifiable stages: modeling complex dynamics as sparse geometric anchors (\textbf{Point}), expanding them into dynamic depth maps that explicitly resolve 3D geometry (\textbf{Shape}), and finally rendering high-fidelity textures (\textbf{Appearance}). Furthermore, to foster robust physical understanding, we employ a \textbf{Masked Point Recovery} strategy. By randomly masking input anchors during training and enforcing the reconstruction of complete dynamic depth, the model is compelled to move beyond passive pattern matching and learn latent physical laws (e.g., inertia) to infer missing trajectories. Extensive experiments on autonomous driving benchmarks show that Motion Forcing significantly outperforms state-of-the-art baselines, maintaining trilemma stability across complex scenes. Evaluations on physics and robotics further confirm our framework's generality.

[100] Frames2Residual: Spatiotemporal Decoupling for Self-Supervised Video Denoising

Mingjie Ji,Zhan Shi,Kailai Zhou,Zixuan Fu,Xun Cao

Main category: cs.CV

TL;DR: 本文提出Frames2Residual（F2R）框架，通过将自监督视频去噪解耦为盲式时序一致性建模与非盲式空间纹理恢复两个阶段，克服了现有盲点网络因中心像素遮蔽导致的时空关联断裂和纹理丢失问题，在sRGB与raw视频基准上均取得最优性能。

Details

Motivation: 现有自监督视频去噪方法难以兼顾帧间时间一致性与帧内空间特异性；视频盲点网络因强制噪声独立性而遮蔽中心像素，牺牲空间纹理信息，破坏时空相关性。 Method: 提出F2R框架：第一阶段采用帧级盲策略学习帧间时间一致性，生成时间一致的锚点；第二阶段以该锚点为条件，非盲地恢复中心帧的高频空间残差，实现时空协同优化。 Result: 在sRGB和raw视频去噪基准上，F2R显著超越现有自监督方法。 Conclusion: 时空解耦的两阶段训练范式有效缓解了盲点约束带来的纹理损失，提升了自监督视频去噪的时空一致性与细节保真度。 Abstract: Self-supervised video denoising methods typically extend image-based frameworks into the temporal dimension, yet they often struggle to integrate inter-frame temporal consistency with intra-frame spatial specificity. Existing Video Blind-Spot Networks (BSNs) require noise independence by masking the center pixel, this constraint prevents the use of spatial evidence for texture recovery, thereby severing spatiotemporal correlations and causing texture loss. To address this, we propose Frames2Residual (F2R), a spatiotemporal decoupling framework that explicitly divides self-supervised training into two distinct stages: blind temporal consistency modeling and non-blind spatial texture recovery. In Stage 1, a blind temporal estimator learns inter-frame consistency using a frame-wise blind strategy, producing a temporally consistent anchor. In Stage 2, a non-blind spatial refiner leverages this anchor to safely reintroduce the center frame and recover intra-frame high-frequency spatial residuals while preserving temporal stability. Extensive experiments demonstrate that our decoupling strategy allows F2R to outperform existing self-supervised methods on both sRGB and raw video benchmarks.

[101] TractoRC: A Unified Probabilistic Learning Framework for Joint Tractography Registration and Clustering

Yijie Li,Xi Zhu,Junyi Wang,Ye Wu,Lauren J. O'Donnell,Fan Zhang

Main category: cs.CV

TL;DR: 本文提出TractoRC，一种统一的概率框架，联合执行纤维束图配准和纤维束聚类，通过学习流线点的潜在嵌入空间实现两任务协同优化，并引入变换等变自监督策略提升几何感知与变换不变性。

Details

Motivation: 现有方法通常将纤维束图配准和流线聚类作为独立任务处理，而二者目标一致（捕捉几何相似结构以表征稳定的白质组织），亟需联合建模以利用互补信息。 Method: 提出TractoRC框架，在共享的潜在嵌入空间中，将配准建模为解剖关键点（概率关键点）分布的学习，将聚类建模为流线结构原型的学习；并采用变换等变的自监督策略学习几何感知且变换不变的嵌入表示。 Result: 实验表明，联合优化显著优于当前独立处理这两项任务的最先进方法。 Conclusion: 联合建模配准与聚类可相互增强，TractoRC提供了一种更一致、更鲁棒的白质通路分析范式。 Abstract: Diffusion MRI tractography enables in vivo reconstruction of white matter (WM) pathways. Two key tasks in tractography analysis include: 1) tractogram registration that aligns streamlines across individuals, and 2) streamline clustering that groups streamlines into compact fiber bundles. Although both tasks share the goal of capturing geometrically similar structures to characterize consistent WM organization, they are typically performed independently. In this work, we propose TractoRC, a unified probabilistic framework that jointly performs tractogram registration and streamline clustering within a single optimization scheme, enabling the two tasks to leverage complementary information. TractoRC learns a latent embedding space for streamline points, which serves as a shared representation for both tasks. Within this space, both tasks are formulated as probabilistic inference over structural representations: registration learns the distribution of anatomical landmarks as probabilistic keypoints to align tractograms across subjects, and clustering learns streamline structural prototypes that capture geometric similarity to form coherent streamline clusters. To support effective learning of this shared space, we introduce a transformation-equivariant self-supervised strategy to learn geometry-aware and transformation-invariant embeddings. Experiments demonstrate that jointly optimizing registration and clustering significantly improves performance in both tasks over state-of-the-art methods that treat them independently. Code will be made publicly available at https://github.com/yishengpoxiao/TractoRC .

[102] World2Act: Latent Action Post-Training via Skill-Compositional World Models

An Dinh Vuong,Tuan Van Vo,Abdullah Sohail,Haoran Ding,Liang Ma,Xiaodan Liang,Anqing Duan,Ivan Laptev,Ian Reid

Main category: cs.CV

TL;DR: 本文提出World2Act框架，通过对比匹配目标将VLA动作直接对齐到世界模型的视频动态潜在表示，减少对像素空间监督的依赖，并结合LLM驱动的技能分解 pipeline 提升世界模型在任意长度任务中的时序一致性，显著提升仿真与真实世界中的泛化性能。

Details

Motivation: 现有基于世界模型的后训练方法依赖像素空间监督，易受像素级伪影和不完美rollout幻觉影响；同时，当前世界模型难以生成任意长度视频，限制其在时长多变的机器人任务中的应用。 Method: 提出World2Act框架，采用对比匹配损失对齐VLA动作与WM视频动态潜变量；设计LLM驱动的自动技能分解pipeline，将高层指令拆分为低层提示，构建支持技能组合、时序一致的世界模型（如RoboCasa-Skill和LIBERO-Skill）。 Result: 在RoboCasa和LIBERO基准上达到SOTA；在真实世界中提升性能6.7%；验证了GR00T-N1.6和Cosmos Policy等VLA策略的有效增强。 Conclusion: World2Act通过脱离像素监督与增强世界模型时序可扩展性，显著提升了VLA策略在仿真与真实环境中的鲁棒性与泛化能力，为具身智能的后训练范式提供了新思路。 Abstract: World Models (WMs) have emerged as a promising approach for post-training Vision-Language-Action (VLA) policies to improve robustness and generalization under environmental changes. However, most WM-based post-training methods rely on pixel-space supervision, making policies sensitive to pixel-level artifacts and hallucination from imperfect WM rollouts. We introduce World2Act, a post-training framework that aligns VLA actions directly with WM video-dynamics latents using a contrastive matching objective, reducing dependence on pixels. Post-training performance is tied to rollout quality, yet current WMs struggle with arbitrary-length video generation as they are mostly trained on fixed-length clips while robotic execution durations vary widely. To address this, we propose an automatic LLM-based skill-decomposition pipeline that segments high-level instructions into low-level prompts. Our pipeline produces RoboCasa-Skill and LIBERO-Skill, supporting skill-compositional WMs that remain temporally consistent across diverse task horizons. Empirically, applying World2Act to VLAs like GR00T-N1.6 and Cosmos Policy achieves state-of-the-art results on RoboCasa and LIBERO, and improves real-world performance by 6.7%, enhancing embodied agent generalization.

[103] SignSparK: Efficient Multilingual Sign Language Production via Sparse Keyframe Learning

Jianhe Low,Alexandre Symeonidis-Herzig,Maksym Ivashechkin,Ozge Mercanoglu Sincan,Richard Bowden

Main category: cs.CV

TL;DR: 本文提出了一种基于稀疏关键帧的新型手语生成范式，通过FAST模型自动提取精准时序边界，并结合条件流匹配框架SignSparK，在SMPL-X和MANO空间中高效合成高质量3D手语动作；支持关键帧到姿态生成（KF2P）编辑与多语言扩展，并结合3D高斯溅射实现逼真渲染，显著提升手语生成质量。

Details

Motivation: 现有手语生成方法存在明显缺陷：直接文本到姿态模型易受均值回归影响，词典检索法生成动作僵硬、过渡不连贯；亟需兼顾运动真实性与流畅性的新方法。 Method: 提出基于稀疏关键帧的训练范式；设计超高效手语分割模型FAST以自动挖掘精确时间边界；构建大规模条件流匹配模型SignSparK，在SMPL-X和MANO空间中由关键帧合成密集3D手语序列；引入重建式CFM目标实现在少于10步采样下的高保真合成；集成3D高斯溅射实现光度真实感渲染。 Result: SignSparK在多个手语生成任务及多语言基准测试中达到新SOTA；支持四种不同手语的大规模生成；首次实现可精确时空编辑的关键帧到姿态（KF2P）生成；显著缓解回归均值问题并提升动作流畅性与语言准确性。 Conclusion: 基于稀疏关键帧的建模范式有效解决了手语生成中真实性与流畅性的权衡难题；SignSparK作为首个大规模多语言SLP框架，兼具高效性、可控性与高保真度，为手语Avatar技术提供了新范式。 Abstract: Generating natural and linguistically accurate sign language avatars remains a formidable challenge. Current Sign Language Production (SLP) frameworks face a stark trade-off: direct text-to-pose models suffer from regression-to-the-mean effects, while dictionary-retrieval methods produce robotic, disjointed transitions. To resolve this, we propose a novel training paradigm that leverages sparse keyframes to capture the true underlying kinematic distribution of human signing. By predicting dense motion from these discrete anchors, our approach mitigates regression-to-the-mean while ensuring fluid articulation. To realize this paradigm at scale, we first introduce FAST, an ultra-efficient sign segmentation model that automatically mines precise temporal boundaries. We then present SignSparK, a large-scale Conditional Flow Matching (CFM) framework that utilizes these extracted anchors to synthesize 3D signing sequences in SMPL-X and MANO spaces. This keyframe-driven formulation also uniquely unlocks Keyframe-to-Pose (KF2P) generation, making precise spatiotemporal editing of signing sequences possible. Furthermore, our adopted reconstruction-based CFM objective also enables high-fidelity synthesis in fewer than ten sampling steps; this allows SignSparK to scale across four distinct sign languages, establishing the largest multilingual SLP framework to date. Finally, by integrating 3D Gaussian Splatting for photorealistic rendering, we demonstrate through extensive evaluation that SignSparK establishes a new state-of-the-art across diverse SLP tasks and multilingual benchmarks.

[104] LCAMV: High-Accuracy 3D Reconstruction of Color-Varying Objects Using LCA Correction and Minimum-Variance Fusion in Structured Light

Wonbeen Oh,Jae-Sang Hyun

Main category: cs.CV

TL;DR: 本文提出LCAMV方法，通过建模并逐像素校正投影仪和相机中的横向色差，并基于泊松-高斯噪声模型进行最小方差多通道相位数据融合，实现单投影仪-相机对下的高精度彩色物体三维重建。

Details

Motivation: 结构光三维重建中，光学元件的横向色差（LCA）和RGB通道间不均匀噪声特性限制了彩色物体重建精度。 Method: LCAMV方法包括两部分：1）对投影仪和相机分别进行横向色差的解析建模与逐像素补偿；2）基于泊松-高斯噪声模型，采用最小方差估计对多通道相位数据进行自适应融合。 Result: 在平面与非平面彩色表面上的实验表明，LCAMV相较灰度转换和常规通道加权方法，深度误差最高降低43.6%。 Conclusion: LCAMV是一种无需额外硬件、无需多帧采集的高效鲁棒三维重建方法，适用于非均匀彩色物体的高精度重建。 Abstract: Accurate 3D reconstruction of colored objects with structured light (SL) is hindered by lateral chromatic aberration (LCA) in optical components and uneven noise characteristics across RGB channels. This paper introduces lateral chromatic aberration correction and minimum-variance fusion (LCAMV), a robust 3D reconstruction method that operates with a single projector-camera pair without additional hardware or acquisition constraints. LCAMV analytically models and pixel-wise compensates LCA in both the projector and camera, then adaptively fuses multi-channel phase data using a Poisson-Gaussian noise model and minimum-variance estimation. Unlike existing methods that require extra hardware or multiple exposures, LCAMV enables fast acquisition. Experiments on planar and non-planar colored surfaces show that LCAMV outperforms grayscale conversion and conventional channel-weighting, reducing depth error by up to 43.6\%. These results establish LCAMV as an effective solution for high-precision 3D reconstruction of nonuniformly colored objects.

[105] Learning to Wander: Improving the Global Image Geolocation Ability of LMMs via Actionable Reasoning

Yushuo Zheng,Huiyu Duan,Zicheng Zhang,Xiaohong Liu,Xiongkuo Min

Main category: cs.CV

TL;DR: 本文提出了首个面向具身场景的全球地理定位基准WanderBench及新型地理定位框架GeoAoT，通过将推理与具身动作（如旋转、移动）结合，实现主动探索式地理定位，显著提升细粒度定位精度与动态环境泛化能力。

Details

Motivation: 尽管大视觉语言模型具备丰富世界知识和复杂推理能力，其在地理定位任务上的表现尚未被系统探索；现有方法多为静态识别，缺乏具身交互与主动探索能力。 Method: 构建包含32K全景图、覆盖六大洲、以可导航图结构组织的WanderBench基准；提出GeoAoT框架，将地理推理转化为可执行动作序列（如靠近地标、调整视角），并设计兼顾定位准确率与难度感知提问能力的联合评估协议。 Result: 在19个大视觉多模态模型上验证，GeoAoT显著优于基线，在细粒度定位和动态环境泛化方面表现突出。 Conclusion: WanderBench与GeoAoT共同确立了‘可行动、推理驱动’的具身视觉地理定位新范式。 Abstract: Geolocation, the task of identifying the geographic location of an image, requires abundant world knowledge and complex reasoning abilities. Though advanced large multimodal models (LMMs) have shown superior aforementioned capabilities, their performance on the geolocation task remains unexplored. To this end, we introduce \textbf{WanderBench}, the first open access global geolocation benchmark designed for actionable geolocation reasoning in embodied scenarios. WanderBench contains over 32K panoramas across six continents, organized as navigable graphs that enable physical actions such as rotation and movement, transforming geolocation from static recognition into interactive exploration. Building on this foundation, we propose \textbf{GeoAoT} (Action of Thought), a \underline{Geo}location framework with \underline{A}ction of \underline{T}hough, which couples reasoning with embodied actions. Instead of generating textual reasoning chains, GeoAoT produces actionable plans such as, approaching landmarks or adjusting viewpoints, to actively reduce uncertainty. We further establish an evaluation protocol that jointly measures geolocation accuracy and difficulty-aware geolocation questioning ability. Experiments on 19 large multimodal models show that GeoAoT achieves superior fine-grained localization and stronger generalization in dynamic environments. WanderBench and GeoAoT define a new paradigm for actionable, reasoning driven geolocation in embodied visual understanding.

[106] UniPINN: A Unified PINN Framework for Multi-task Learning of Diverse Navier-Stokes Equations

Dengdi Sun,Jie Chen,Xiao Wang,Jin Tang

Main category: cs.CV

TL;DR: 本文提出UniPINN框架，用于解决多流场下物理信息神经网络（PINNs）面临的共享物理规律与流场特异性建模、负迁移及损失不平衡三大挑战。通过共享-专用架构、跨流注意力机制和动态权重分配策略，UniPINN在多个经典流场任务中实现了高精度、稳定且均衡的预测性能。

Details

Motivation: 现有PINNs主要面向单一流场，难以直接扩展至多流场景，存在共享物理建模与流场特性解耦困难、跨任务负迁移以及异构流场间损失尺度差异导致训练不稳定等问题。 Method: 提出UniPINN统一框架，包含三部分：1）共享-专用网络架构，分离通用物理规律与流场特异性特征；2）跨流注意力机制，增强相关模式、抑制无关干扰；3）动态权重分配策略，自适应平衡多目标损失。 Result: 在三个典型流场实验中，UniPINN显著提升预测精度，实现多流场统一建模，有效缓解负迁移，并保持各流场间性能均衡与训练稳定性。 Conclusion: UniPINN为多流物理建模提供了可扩展、鲁棒且高效的PINN新范式，推动了PINNs在复杂多物理场问题中的应用。 Abstract: Physics-Informed Neural Networks (PINNs) have shown promise in solving incompressible Navier-Stokes equations, yet existing approaches are predominantly designed for single-flow settings. When extended to multi-flow scenarios, these methods face three key challenges: (1) difficulty in simultaneously capturing both shared physical principles and flow-specific characteristics, (2) susceptibility to inter-task negative transfer that degrades prediction accuracy, and (3) unstable training dynamics caused by disparate loss magnitudes across heterogeneous flow regimes. To address these limitations, we propose UniPINN, a unified multi-flow PINN framework that integrates three complementary components: a shared-specialized architecture that disentangles universal physical laws from flow-specific features, a cross-flow attention mechanism that selectively reinforces relevant patterns while suppressing task-irrelevant interference, and a dynamic weight allocation strategy that adaptively balances loss contributions to stabilize multi-objective optimization. Extensive experiments on three canonical flows demonstrate that UniPINN effectively unifies multi-flow learning, achieving superior prediction accuracy and balanced performance across heterogeneous regimes while successfully mitigating negative transfer. The source code of this paper will be released on https://github.com/Event-AHU/OpenFusion

[107] Fighting Hallucinations with Counterfactuals: Diffusion-Guided Perturbations for LVLM Hallucination Suppression

Hamidreza Dastmalchi,Aijun An,Ali Cheraghian,Hamed Barzamini

Main category: cs.CV

TL;DR: 本文提出CIPHER方法，通过构建反事实图像数据集OHC-25K，识别并抑制大视觉语言模型（LVLMs）中由视觉模态引发的幻觉，无需额外训练即可在推理阶段对隐状态进行低秩子空间投影校正，显著降低幻觉率且不损害任务性能。

Details

Motivation: 大型视觉语言模型（LVLMs）常产生与图像输入不一致的视觉诱导幻觉，现有无训练方法多关注文本诱导幻觉，缺乏针对视觉模态幻觉的有效解决方案。 Method: CIPHER分为离线和推理两阶段：离线阶段利用扩散模型编辑图像生成反事实数据集OHC-25K，并对比其与真实图文对的表征以识别视觉幻觉相关低秩子空间；推理阶段将LVLM中间隐状态向该子空间的正交方向投影以抑制幻觉。 Result: 在多个基准测试中，CIPHER显著降低了幻觉率，同时保持原有任务性能，验证了反事实视觉扰动提升LVLM忠实性的有效性。 Conclusion: CIPHER是一种无需训练、轻量高效、专为抑制视觉诱导幻觉设计的方法，揭示了此类幻觉具有结构化低秩特性，并提供了可泛化的特征级校正范式。 Abstract: While large vision-language models (LVLMs) achieve strong performance on multimodal tasks, they frequently generate hallucinations -- unfaithful outputs misaligned with the visual input. To address this issue, we introduce CIPHER (Counterfactual Image Perturbations for Hallucination Extraction and Removal), a training-free method that suppresses vision-induced hallucinations via lightweight feature-level correction. Unlike prior training-free approaches that primarily focus on text-induced hallucinations, CIPHER explicitly targets hallucinations arising from the visual modality. CIPHER operates in two phases. In the offline phase, we construct OHC-25K (Object-Hallucinated Counterfactuals, 25,000 samples), a counterfactual dataset consisting of diffusion-edited images that intentionally contradict the original ground-truth captions. We pair these edited images with the unchanged ground-truth captions and process them through an LVLM to extract hallucination-related representations. Contrasting these representations with those from authentic (image, caption) pairs reveals structured, systematic shifts spanning a low-rank subspace characterizing vision-induced hallucination. In the inference phase, CIPHER suppresses hallucinations by projecting intermediate hidden states away from this subspace. Experiments across multiple benchmarks show that CIPHER significantly reduces hallucination rates while preserving task performance, demonstrating the effectiveness of counterfactual visual perturbations for improving LVLM faithfulness. Code and additional materials are available at https://hamidreza-dastmalchi.github.io/cipher-cvpr2026/.

[108] StructDamage:A Large Scale Unified Crack and Surface Defect Dataset for Robust Structural Damage Detection

Misbah Ijaz,Saif Ur Rehman Khan,Abd Ur Rehman,Sebastian Vollmer,Andreas Dengel,Muhammad Nabeel Asim

Main category: cs.CV

TL;DR: 本文提出了一种名为StructDamage的新数据集，包含约78,093张涵盖九种表面类型的结构损伤图像，并通过15种深度学习模型进行了基准分类实验，最高准确率达98.62%，旨在提升裂缝检测模型在真实场景中的泛化能力。

Details

Motivation: 现有公开裂缝数据集存在地理多样性不足、表面类型有限、尺度不一及标注不一致等问题，导致模型难以在真实环境中有效泛化。 Method: 系统整合并重新标注来自32个公开数据集的图像，构建覆盖九类表面材料的StructDamage数据集；采用15种深度学习架构进行图像分类基准测试。 Result: 12种模型宏F1分数超过0.96，DenseNet201达到98.62%分类准确率。 Conclusion: StructDamage是一个全面、结构规范、文档详尽的数据集，有助于推动可复现研究及鲁棒裂缝检测方法的公平评估与开发。 Abstract: Automated detection and classification of structural cracks and surface defects is a critical challenge in civil engineering, infrastructure maintenance, and heritage preservation. Recent advances in Computer Vision (CV) and Deep Learning (DL) have significantly improved automatic crack detection. However, these methods rely heavily on large, diverse, and carefully curated datasets that include various crack types across different surface materials. Many existing public crack datasets lack geographic diversity, surface types, scale, and labeling consistency, making it challenging for trained algorithms to generalize effectively in real world conditions. We provide a novel dataset, StructDamage, a curated collection of approximately 78,093 images spanning nine surface types: walls, tile, stone, road, pavement, deck, concrete, and brick. The dataset was constructed by systematically aggregating, harmonizing, and reannotating images from 32 publicly available datasets covering concrete structures, asphalt pavements, masonry walls, bridges, and historic buildings. All images are organized in a folder level classification hierarchy suitable for training Convolutional Neural Networks (CNNs) and Vision Transformers. To highlight the practical value of the dataset, we present baseline classification results using fifteen DL architectures from six model families, with twelve achieving macro F1-scores over 0.96. The best performing model DenseNet201 achieves 98.62% accuracy. The proposed dataset provides a comprehensive and versatile resource suitable for classification tasks. With thorough documentation and a standard structure, it is designed to promote reproducible research and support the development and fair evaluation of robust crack damage detection approaches.

[109] Spatial self-supervised Peak Learning and correlation-based Evaluation of peak picking in Mass Spectrometry Imaging

Philipp Weigand,Nikolas Ebert,Shad A. Mohammed,Denis Abu Sammour,Carsten Hopf,Oliver Wasenmüller

Main category: cs.CV

TL;DR: 本文提出了一种基于自编码器的空间自监督峰值学习神经网络，用于质谱成像（MSI）中的峰值挑选，通过结合空间与光谱信息学习注意力掩码，以选择具有空间结构的峰值，并引入基于专家标注分割掩码的评估方法，显著提升了峰值挑选的一致性与生物学意义。

Details

Motivation: 现有峰值挑选方法在异质性MSI数据上表现不稳定，且评估常局限于合成数据或人工选取的离子图像，无法充分反映真实场景挑战。 Method: 提出一种自编码器架构的空间自监督神经网络，利用空间与光谱信息联合学习注意力掩码以挑选空间结构化峰值；并设计基于专家标注分割掩码的新型评估流程。 Result: 在四个公开MSI数据集上验证，该方法持续优于当前最优峰值挑选方法，所选峰值更具空间结构性和生物学相关性；新评估流程具备跨数据集可迁移性与鲁棒性。 Conclusion: 所提出的空间自监督网络及配套评估框架，为MSI峰值挑选提供了更可靠、可比、生物可解释的新范式。 Abstract: Mass spectrometry imaging (MSI) enables label-free visualization of molecular distributions across tissue samples but generates large and complex datasets that require effective peak picking to reduce data size while preserving meaningful biological information. Existing peak picking approaches perform inconsistently across heterogeneous datasets, and their evaluation is often limited to synthetic data or manually selected ion images that do not fully represent real-world challenges in MSI. To address these limitations, we propose an autoencoder-based spatial self-supervised peak learning neural network that selects spatially structured peaks by learning an attention mask leveraging both spatial and spectral information. We further introduce an evaluation procedure based on expert-annotated segmentation masks, allowing a more representative and spatially grounded assessment of peak picking performance. We evaluate our approach on four diverse public MSI datasets using our proposed evaluation procedure. Our approach consistently outperforms state-of-the-art peak picking methods by selecting spatially structured peaks, thus demonstrating its efficacy. These results highlight the value of our spatial self-supervised network in comparison to contemporary state-of-the-art methods. The evaluation procedure can be readily applied to new MSI datasets, thereby providing a consistent and robust framework for the comparison of spatially structured peak picking methods across different datasets.

Jiahao Lyu,Pei Fu,Zhenhang Li,Weichao Zeng,Shaojie Zhan,Jiahui Yang,Can Ma,Yu Zhou,Zhenbo Luo,Jian Luan

Main category: cs.CV

TL;DR: 本文提出了一个新的端到端图像内机器翻译基准IMTBench，包含2500个真实场景样本、覆盖4种实际应用和9种语言，并引入跨模态对齐评估指标，揭示了现有模型在自然场景和低资源语言上的显著性能差距。

Details

Motivation: 现有IIMT基准多为合成数据，无法反映真实复杂性；评估协议仅关注单模态指标，忽视渲染文本与模型输出之间的跨模态一致性。 Method: 构建了IMTBench基准，含2500个图像翻译样本，覆盖四类实际场景与九种语言；设计多维度评估体系，包括翻译质量、背景保留、整体图像质量及新提出的跨模态对齐得分。 Result: 在商业级级联系统及闭源/开源统一多模态模型上评测发现：各场景与语言间性能差异显著，尤其在自然场景与低资源语言上表现较差，表明仍有较大提升空间。 Conclusion: IMTBench为端到端图像内机器翻译任务提供了更贴近现实、更全面的评估标准，有望推动该新兴方向的发展。 Abstract: End-to-end In-Image Machine Translation (IIMT) aims to convert text embedded within an image into a target language while preserving the original visual context, layout, and rendering style. However, existing IIMT benchmarks are largely synthetic and thus fail to reflect real-world complexity, while current evaluation protocols focus on single-modality metrics and overlook cross-modal faithfulness between rendered text and model outputs. To address these shortcomings, we present In-image Machine Translation Benchmark (IMTBench), a new benchmark of 2,500 image translation samples covering four practical scenarios and nine languages. IMTBench supports multi-aspect evaluation, including translation quality, background preservation, overall image quality, and a cross-modal alignment score that measures consistency between the translated text produced by the model and the text rendered in the translated image. We benchmark strong commercial cascade systems, and both closed- and open-source unified multi-modal models, and observe large performance gaps across scenarios and languages, especially on natural scenes and resource-limited languages, highlighting substantial headroom for end-to-end image text translation. We hope IMTBench establishes a standardized benchmark to accelerate progress in this emerging task.

[111] UHD Image Deblurring via Autoregressive Flow with Ill-conditioned Constraints

Yucheng Xin,Dawei Zhao,Xiang Chen,Chen Wu,Pu Wang,Dianjie Lu,Guijuan Zhang,Xiuyi Jia,Zhuoran Zheng

Main category: cs.CV

TL;DR: 本文提出了一种带病态约束的自回归流方法，用于超高清（UHD）图像去模糊，通过粗到细的多尺度渐进式重建与流匹配建模残差，并引入注意力矩阵条件数正则化来提升数值稳定性与跨尺度一致性。

Details

Motivation: UHD图像去模糊需兼顾细节恢复与推理效率，现有方法在计算成本与细粒度细节生成能力之间存在权衡困境。 Method: 提出基于自回归流的多尺度渐进式重建框架：每级由上一级上采样结果加当前级残差构成；采用Flow Matching建模残差生成为条件向量场，结合少量步长ODE求解器（Euler/Heun）进行高效采样；引入特征驱动的注意力矩阵条件数正则化以抑制病态性。 Result: 在4K（3840×2160）及以上分辨率模糊图像上取得优异性能，兼顾细节丰富性与推理效率，提升数值稳定性与跨尺度一致性。 Conclusion: 该方法有效缓解了UHD图像去模糊中细节恢复与计算效率之间的矛盾，为高分辨率逆问题提供了稳定、高效且可扩展的新范式。 Abstract: Ultra-high-definition (UHD) image deblurring poses significant challenges for UHD restoration methods, which must balance fine-grained detail recovery and practical inference efficiency. Although prominent discriminative and generative methods have achieved remarkable results, a trade-off persists between computational cost and the ability to generate fine-grained detail for UHD image deblurring tasks. To further alleviate these issues, we propose a novel autoregressive flow method for UHD image deblurring with an ill-conditioned constraint. Our core idea is to decompose UHD restoration into a progressive, coarse-to-fine process: at each scale, the sharp estimate is formed by upsampling the previous-scale result and adding a current-scale residual, enabling stable, stage-wise refinement from low to high resolution. We further introduce Flow Matching to model residual generation as a conditional vector field and perform few-step ODE sampling with efficient Euler/Heun solvers, enriching details while keeping inference affordable. Since multi-step generation at UHD can be numerically unstable, we propose an ill-conditioning suppression scheme by imposing condition-number regularization on a feature-induced attention matrix, improving convergence and cross-scale consistency. Our method demonstrates promising performance on blurred images at 4K (3840$\times$2160) or higher resolutions.

[112] Visually-Guided Controllable Medical Image Generation via Fine-Grained Semantic Disentanglement

Xin Huang,Junjie Liang,Qingshan Hou,Peng Cao,Jinzhu Yang,Xiaoli Liu,Osmar R. Zaiane

Main category: cs.CV

TL;DR: 本文提出了一种视觉引导的文本解耦框架（VG-MedGen），通过跨模态潜在对齐和混合特征融合模块，在医学图像合成中实现解耦文本语义并提升生成质量与下游任务性能。

Details

Motivation: 解决通用文生图模型在医学图像合成中因模态差异大、文本语义纠缠（解剖结构与成像风格混淆）导致的可控性差和微调困难问题。 Method: 提出视觉引导的文本解耦框架：1）跨模态潜在对齐机制，利用视觉先验将非结构化临床文本显式解耦为独立语义表示；2）混合特征融合模块（HFFM），将解耦特征通过分离通道注入扩散Transformer（DiT）。 Result: 在三个数据集上实验表明，该方法在生成质量上优于现有方法，并显著提升下游分类任务性能。 Conclusion: 视觉引导的文本解耦能有效缓解医学图像合成中的模态鸿沟与语义纠缠问题，增强生成可控性与实用性。 Abstract: Medical image synthesis is crucial for alleviating data scarcity and privacy constraints. However, fine-tuning general text-to-image (T2I) models remains challenging, mainly due to the significant modality gap between complex visual details and abstract clinical text. In addition, semantic entanglement persists, where coarse-grained text embeddings blur the boundary between anatomical structures and imaging styles, thus weakening controllability during generation. To address this, we propose a Visually-Guided Text Disentanglement framework. We introduce a cross-modal latent alignment mechanism that leverages visual priors to explicitly disentangle unstructured text into independent semantic representations. Subsequently, a Hybrid Feature Fusion Module (HFFM) injects these features into a Diffusion Transformer (DiT) through separated channels, enabling fine-grained structural control. Experimental results in three datasets demonstrate that our method outperforms existing approaches in terms of generation quality and significantly improves performance on downstream classification tasks. The source code is available at https://github.com/hx111/VG-MedGen.

[113] Sparse Task Vector Mixup with Hypernetworks for Efficient Knowledge Transfer in Whole-Slide Image Prognosis

Pei Liu,Xiangxiang Zeng,Tengfei Ma,Yucheng Xing,Xuanbai Ren,Yiping Liu

Main category: cs.CV

TL;DR: 本文提出了一种名为STEPH的新方法，通过稀疏任务向量混合与超网络，高效地从其他癌症类型中迁移可泛化的预后知识，提升目标癌症类型的预测性能，且无需大规模联合训练或多模型推理。

Details

Motivation: 病理学中单癌种训练样本稀缺，导致模型难以学习可泛化知识，影响对高异质性肿瘤的预后预测；现有跨癌种学习方法计算开销大。 Method: 提出Sparse Task Vector Mixup with Hypernetworks（STEPH）：1）对每个源-目标癌种对应用任务向量混合；2）利用超网络稀疏聚合混合后的任务向量，生成增强的目标模型。 Result: 在13个癌症数据集上，STEPH相较单癌种学习和现有知识迁移基线分别提升5.14%和2.01%，且计算更高效。 Conclusion: STEPH是一种高效、轻量的跨癌种知识迁移方案，显著提升了WSI预后预测的性能与泛化能力。 Abstract: Whole-Slide Images (WSIs) are widely used for estimating the prognosis of cancer patients. Current studies generally follow a cancer-specific learning paradigm. However, the available training samples for one cancer type are usually scarce in pathology. Consequently, the model often struggles to learn generalizable knowledge, thus performing worse on the tumor samples with inherent high heterogeneity. Although multi-cancer joint learning and knowledge transfer approaches have been explored recently to address it, they either rely on large-scale joint training or extensive inference across multiple models, posing new challenges in computational efficiency. To this end, this paper proposes a new scheme, Sparse Task Vector Mixup with Hypernetworks (STEPH). Unlike previous ones, it efficiently absorbs generalizable knowledge from other cancers for the target via model merging: i) applying task vector mixup to each source-target pair and then ii) sparsely aggregating task vector mixtures to obtain an improved target model, driven by hypernetworks. Extensive experiments on 13 cancer datasets show that STEPH improves over cancer-specific learning and an existing knowledge transfer baseline by 5.14% and 2.01%, respectively. Moreover, it is a more efficient solution for learning prognostic knowledge from other cancers, without requiring large-scale joint training or extensive multi-model inference. Code is publicly available at https://github.com/liupei101/STEPH.

[114] DSFlash: Comprehensive Panoptic Scene Graph Generation in Realtime

Julian Lorenz,Vladyslav Kovganko,Elias Kohout,Mrunmai Phatak,Daniel Kienzle,Rainer Lienhart

Main category: cs.CV

TL;DR: DSFlash is a low-latency, resource-efficient panoptic scene graph generation model that achieves 56 FPS on an RTX 3090 without sacrificing accuracy and can be trained in under 24 hours on a GTX 1080.

Details

Motivation: Existing Scene Graph Generation (SGG) methods lack speed and resource efficiency needed for real-world deployment—especially on edge devices—while also often limiting relationships to only salient ones. Method: The paper introduces DSFlash, a novel low-latency model for panoptic SGG, designed for high throughput, comprehensive relationship modeling, and lightweight training. Result: DSFlash processes video at 56 FPS on an RTX 3090, matches state-of-the-art accuracy, computes full scene graphs (not just salient relationships), and trains in <24 hours on a GTX 1080. Conclusion: DSFlash bridges the gap between high-performance SGG and practical deployment constraints, enabling efficient, accessible, and rich contextual understanding for downstream tasks like embodied reasoning. Abstract: Scene Graph Generation (SGG) aims to extract a detailed graph structure from an image, a representation that holds significant promise as a robust intermediate step for complex downstream tasks like reasoning for embodied agents. However, practical deployment in real-world applications - especially on resource constrained edge devices - requires speed and resource efficiency, challenges that have received limited attention in existing research. To bridge this gap, we introduce DSFlash, a low-latency model for panoptic scene graph generation designed to overcome these limitations. DSFlash can process a video stream at 56 frames per second on a standard RTX 3090 GPU, without compromising performance against existing state-of-the-art methods. Crucially, unlike prior approaches that often restrict themselves to salient relationships, DSFlash computes comprehensive scene graphs, offering richer contextual information while maintaining its superior latency. Furthermore, DSFlash is light on resources, requiring less than 24 hours to train on a single, nine-year-old GTX 1080 GPU. This accessibility makes DSFlash particularly well-suited for researchers and practitioners operating with limited computational resources, empowering them to adapt and fine-tune SGG models for specialized applications.

[115] Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation

Caroline Magg,Maaike A. ter Wee,Johannes G. G. Dobbe,Geert J. Streekstra,Leendert Blankevoort,Clara I. Sánchez,Hoel Kervadec

Main category: cs.CV

TL;DR: 本文对11种可提示基础模型（FMs）在骨与植入物分割任务中的性能进行了系统性基准测试，涵盖2D/3D非迭代提示策略、真实人类提示评估及观察者一致性分析，发现模型性能高度依赖提示质量与解剖结构复杂度，且在真实人机交互场景下普遍存在敏感性和性能下降问题。

Details

Motivation: 现有可提示基础模型数量激增，但评估标准不一（数据集、指标、对比模型不同），导致模型间难以直接比较，临床任务中难以选择最优模型。 Method: 在私有和公开数据集上，针对腕、肩、髋、小腿四个解剖区域的骨与植入物分割任务，测试11种可提示基础模型；采用非迭代2D和3D提示策略，并通过专门设计的观察者研究收集真实人类提示进行进一步分析；使用Pareto最优性筛选高性能模型，并评估定位精度与评分者一致性。 Result: 1) 模型与提示策略间性能差异显著；2) 2D Pareto最优模型为SAM和SAM2.1，3D为nnInteractive和Med-SAM2；3) 评分者一致性随解剖结构复杂度升高而降低（如腕骨高一致，骨盆/胫骨/植入物低一致）；4) 使用人类提示时性能普遍下降，表明基于理想标签提取提示的评估会高估实际表现；5) 所有模型对提示变化敏感，仅两个模型具备个体内鲁棒性，但无跨个体鲁棒性。 Conclusion: 在真实人机协同场景下，即使高性能FM仍对人类提示变异高度敏感，最优模型选择仍具挑战性；需更贴近临床实践的评估范式与鲁棒性增强方法。 Abstract: Promptable Foundation Models (FMs), initially introduced for natural image segmentation, have also revolutionized medical image segmentation. The increasing number of models, along with evaluations varying in datasets, metrics, and compared models, makes direct performance comparison between models difficult and complicates the selection of the most suitable model for specific clinical tasks. In our study, 11 promptable FMs are tested using non-iterative 2D and 3D prompting strategies on a private and public dataset focusing on bone and implant segmentation in four anatomical regions (wrist, shoulder, hip and lower leg). The Pareto-optimal models are identified and further analyzed using human prompts collected through a dedicated observer study. Our findings are: 1) The segmentation performance varies a lot between FMs and prompting strategies; 2) The Pareto-optimal models in 2D are SAM and SAM2.1, in 3D nnInteractive and Med-SAM2; 3) Localization accuracy and rater consistency vary with anatomical structures, with higher consistency for simple structures (wrist bones) and lower consistency for complex structures (pelvis, tibia, implants); 4) The segmentation performance drops using human prompts, suggesting that performance reported on "ideal" prompts extracted from reference labels might overestimate the performance in a human-driven setting; 5) All models were sensitive to prompt variations. While two models demonstrated intra-rater robustness, it did not scale to inter-rater settings. We conclude that the selection of the most optimal FM for a human-driven setting remains challenging, with even high-performing FMs being sensitive to variations in human input prompts. Our code base for prompt extraction and model inference is available: https://github.com/CarolineMagg/segmentation-FM-benchmark/

[116] Towards Cognitive Defect Analysis in Active Infrared Thermography with Vision-Text Cues

Mohammed Salah,Eman Ouda,Giuseppe Dell'Avvocato,Fabrizio Sarasini,Ester D'Accardi,Jorge Dias,Davor Svetinovic,Stefano Sfarra,Yusra Abdulrahman

Main category: cs.CV

TL;DR: 本文提出了一种基于预训练视觉-语言模型（VLM）和轻量适配器的零样本主动红外热成像（AIRT）缺陷分析框架，无需大量标注数据即可实现碳纤维增强聚合物（CFRP）亚表面缺陷的生成式理解与定位。

Details

Motivation: AI方法在AIRT中应用受限于构建耗时昂贵的CFRP热图像训练数据集，亟需免训练或少训练的数据高效方案。 Method: 提出语言引导的认知缺陷分析框架，采用预训练多模态VLM编码器+轻量AIRT-VLM适配器，弥合热图像与自然图像域差异，实现零样本缺陷理解与定位。 Result: 在25组真实CFRP冲击损伤序列上验证，AIRT-VLM适配器信噪比提升超10 dB，零样本检测IoU达70%。 Conclusion: 该框架突破了传统监督学习对标注数据的依赖，为工业无损检测提供了高效、可泛化的零样本热成像分析新范式。 Abstract: Active infrared thermography (AIRT) is currently witnessing a surge of artificial intelligence (AI) methodologies being deployed for automated subsurface defect analysis of high performance carbon fiber-reinforced polymers (CFRP). Deploying AI-based AIRT methodologies for inspecting CFRPs requires the creation of time consuming and expensive datasets of CFRP inspection sequences to train neural networks. To address this challenge, this work introduces a novel language-guided framework for cognitive defect analysis in CFRPs using AIRT and vision-language models (VLMs). Unlike conventional learning-based approaches, the proposed framework does not require developing training datasets for extensive training of defect detectors, instead it relies solely on pretrained multimodal VLM encoders coupled with a lightweight adapter to enable generative zero-shot understanding and localization of subsurface defects. By leveraging pretrained multimodal encoders, the proposed system enables generative zero-shot understanding of thermographic patterns and automatic detection of subsurface defects. Given the domain gap between thermographic data and natural images used to train VLMs, an AIRT-VLM Adapter is proposed to enhance the visibility of defects while aligning the thermographic domain with the learned representations of VLMs. The proposed framework is validated using three representative VLMs; specifically, GroundingDINO, Qwen-VL-Chat, and CogVLM. Validation is performed on 25 CFRP inspection sequences with impacts introduced at different energy levels, reflecting realistic defects encountered in industrial scenarios. Experimental results demonstrate that the AIRT-VLM adapter achieves signal-to-noise ratio (SNR) gains exceeding 10 dB compared with conventional thermographic dimensionality-reduction methods, while enabling zero-shot defect detection with intersection-over-union values reaching 70%.

[117] P-GSVC: Layered Progressive 2D Gaussian Splatting for Scalable Image and Video

Longan Wang,Yuang Shi,Wei Tsang Ooi

Main category: cs.CV

TL;DR: 本文提出了P-GSVC，首个分层渐进式2D高斯点绘框架，用于图像和视频的可扩展高斯表示，通过联合训练策略实现跨层优化对齐，显著提升PSNR。

Details

Motivation: 解决高斯点绘在图像和视频重建中缺乏可扩展性（质量和分辨率）的问题，提供统一的渐进式表示方案。 Method: 提出分层结构（基础层+增强层）和联合训练策略，同步更新各层高斯参数以保证层间兼容性和稳定渐进重建。 Result: 相比逐层顺序训练方法，视频PSNR最高提升1.9 dB，图像PSNR最高提升2.6 dB。 Conclusion: P-GSVC实现了高质量、高分辨率、可扩展的2D高斯表示，在图像与视频重建任务中展现出优越性能和实用性。 Abstract: Gaussian splatting has emerged as a competitive explicit representation for image and video reconstruction. In this work, we present P-GSVC, the first layered progressive 2D Gaussian splatting framework that provides a unified solution for scalable Gaussian representation in both images and videos. P-GSVC organizes 2D Gaussian splats into a base layer and successive enhancement layers, enabling coarse-to-fine reconstructions. To effectively optimize this layered representation, we propose a joint training strategy that simultaneously updates Gaussians across layers, aligning their optimization trajectories to ensure inter-layer compatibility and a stable progressive reconstruction. P-GSVC supports scalability in terms of both quality and resolution. Our experiments show that the joint training strategy can gain up to 1.9 dB improvement in PSNR for video and 2.6 dB improvement in PSNR for image when compared to methods that perform sequential layer-wise training. Project page: https://longanwang-cs.github.io/PGSVC-webpage/

[118] PET-F2I: A Comprehensive Benchmark and Parameter-Efficient Fine-Tuning of LLMs for PET/CT Report Impression Generation

Yuchen Liu,Wenbo Zhang,Liling Peng,Yichi Zhang,Yu Fu,Xin Guo,Chao Qu,Yuan Qi,Le Xue

Main category: cs.CV

TL;DR: 本文提出了PET-F2I-41K基准数据集和PET-F2I-7B模型，用于提升PET/CT影像报告中诊断印象生成的准确性与临床可靠性，并设计了三个临床导向评估指标。

Details

Motivation: PET/CT报告中诊断印象生成工作繁重，现有大语言模型在该高度专业医学领域的能力尚未充分探索，缺乏标准化评估基准和适配模型。 Method: 构建包含41,000+真实报告的PET-F2I-41K基准；全面评测27种模型（含前沿闭源、开源通用及医学专用LLM）；基于Qwen2.5-7B-Instruct，采用LoRA微调出轻量级领域模型PET-F2I-7B；提出三个新临床指标：实体覆盖度（ECR）、未覆盖实体率（UER）和事实一致性率（FCR）。 Result: PET-F2I-7B在BLEU-4达0.708，实体覆盖率较最强基线提升3.0倍，在成本、延迟和隐私方面具优势；前沿及医学LLM零样本表现均不理想；PET-F2I-41K成为首个PET/CT印象生成标准化评估框架。 Conclusion: PET-F2I-7B验证了轻量级领域微调在PET/CT报告生成中的有效性，PET-F2I-41K为临床可部署自动报告系统提供了可靠基准与评估范式。 Abstract: PET/CT imaging is pivotal in oncology and nuclear medicine, yet summarizing complex findings into precise diagnostic impressions is labor-intensive. While LLMs have shown promise in medical text generation, their capability in the highly specialized domain of PET/CT remains underexplored. We introduce PET-F2I-41K (PET Findings-to-Impression Benchmark), a large-scale benchmark for PET/CT impression generation using LLMs, constructed from over 41k real-world reports. Using PET-F2I-41K, we conduct a comprehensive evaluation of 27 models across proprietary frontier LLMs, open-source generalist models, and medical-domain LLMs, and we develop a domain-adapted 7B model (PET-F2I-7B) fine-tuned from Qwen2.5-7B-Instruct via LoRA. Beyond standard NLG metrics (e.g., BLEU-4, ROUGE-L, BERTScore), we propose three clinically grounded metrics - Entity Coverage Rate (ECR), Uncovered Entity Rate (UER), and Factual Consistency Rate (FCR) - to assess diagnostic completeness and factual reliability. Experiments reveal that neither frontier nor medical-domain LLMs perform adequately in zero-shot settings. In contrast, PET-F2I-7B achieves substantial gains (e.g., 0.708 BLEU-4) and a 3.0x improvement in entity coverage over the strongest baseline, while offering advantages in cost, latency, and privacy. Beyond this modeling contribution, PET-F2I-41K establishes a standardized evaluation framework to accelerate the development of reliable and clinically deployable reporting systems for PET/CT.

[119] UniStitch: Unifying Semantic and Geometric Features for Image Stitching

Yuan Mei,Lang Nie,Kang Liao,Yunqiu Xu,Chunyu Lin,Bin Xiao

Main category: cs.CV

TL;DR: 本文提出UniStitch框架，首次统一图像拼接中的语义特征与几何特征，通过Neural Point Transformer（NPT）将稀疏关键点映射为稠密语义图，并利用自适应专家混合（AMoE）模块动态融合双模态特征，在复杂场景下鲁棒性更强，显著超越现有SOTA方法。

Details

Motivation: 传统基于手工几何特征与新兴基于深度语义特征的图像拼接方法长期各自发展、缺乏融合；亟需一种能协同利用二者优势的统一框架以提升鲁棒性与性能。 Method: 提出UniStitch：1）Neural Point Transformer（NPT）模块将无序稀疏1D关键点转换为有序稠密2D语义特征图；2）Adaptive Mixture of Experts（AMoE）模块动态加权融合几何与语义特征，依据可靠性自适应聚焦。 Result: 在多个基准上显著超越现有最先进方法；验证了双模态融合在遮挡、弱纹理等挑战场景下的有效性与泛化能力。 Conclusion: UniStitch成功弥合传统与学习式图像拼接的鸿沟，确立了多模态统一建模的新范式，为鲁棒、通用的图像拼接提供了可扩展基础架构。 Abstract: Traditional image stitching methods estimate warps from hand-crafted geometric features, whereas recent learning-based solutions leverage semantic features from neural networks instead. These two lines of research have largely diverged along separate evolution, with virtually no meaningful convergence to date. In this paper, we take a pioneering step to bridge this gap by unifying semantic and geometric features with UniStitch, a unified image stitching framework from multimodal features. To align discrete geometric features (i.e., keypoint) with continuous semantic feature maps, we present a Neural Point Transformer (NPT) module, which transforms unordered, sparse 1D geometric keypoints into ordered, dense 2D semantic maps. Then, to integrate the advantages of both representations, an Adaptive Mixture of Experts (AMoE) module is designed to fuse geometric and semantic representations. It dynamically shifts focus toward more reliable features during the fusion process, allowing the model to handle complex scenes, especially when either modality might be compromised. The fused representation can be adopted into common deep stitching pipelines, delivering significant performance gains over any single feature. Experiments show that UniStitch outperforms existing state-of-the-art methods with a large margin, paving the way for a unified paradigm between traditional and learning-based image stitching.

[120] R4-CGQA: Retrieval-based Vision Language Models for Computer Graphics Image Quality Assessment

Zhuangzi Li,Jian Jin,Shilv Cai,Weisi Lin

Main category: cs.CV

TL;DR: 本文提出了一种基于检索增强生成的双流框架，以提升视觉语言模型（VLMs）对计算机图形（CG）图像质量的评估能力，并构建了首个含多维质量描述的3500张CG图像数据集。

Details

Motivation: 现有CG数据集缺乏系统性的质量描述，且现有质量评估方法无法提供合理的文本解释。 Method: 从用户视角定义六个感知维度，构建含质量描述的3500张CG图像数据集；设计基于描述的问答基准评估VLMs；提出双流检索增强生成框架，利用相似图像描述提升VLM对CG质量的理解与判断。 Result: 实验表明所提方法显著提升了多个主流VLM在CG质量评估任务上的性能；发现相似图像的文本描述可有效增强VLM对目标图像质量的理解。 Conclusion: 检索增强生成策略能有效弥补当前VLM在细粒度CG质量评估上的不足，为可解释的CG质量评估提供了新范式。 Abstract: Immersive Computer Graphics (CGs) rendering has become ubiquitous in modern daily life. However, comprehensively evaluating CG quality remains challenging for two reasons: First, existing CG datasets lack systematic descriptions of rendering quality; and second existing CG quality assessment methods cannot provide reasonable text-based explanations. To address these issues, we first identify six key perceptual dimensions of CG quality from the user perspective and construct a dataset of 3500 CG images with corresponding quality descriptions. Each description covers CG style, content, and perceived quality along the selected dimensions. Furthermore, we use a subset of the dataset to build several question-answer benchmarks based on the descriptions in order to evaluate the responses of existing Vision Language Models (VLMs). We find that current VLMs are not sufficiently accurate in judging fine-grained CG quality, but that descriptions of visually similar images can significantly improve a VLM's understanding of a given CG image. Motivated by this observation, we adopt retrieval-augmented generation and propose a two-stream retrieval framework that effectively enhances the CG quality assessment capabilities of VLMs. Experiments on several representative VLMs demonstrate that our method substantially improves their performance on CG quality assessment.

[121] Attribution as Retrieval: Model-Agnostic AI-Generated Image Attribution

Hongsong Wang,Renxi Cheng,Chaolei Han,Jie Gui

Main category: cs.CV

TL;DR: 本文提出了一种模型无关的AI生成图像归因新范式，将归因问题建模为实例检索而非分类，并设计了基于低位平面的LIDA框架，在零样本和少样本设置下均达到SOTA性能。

Details

Motivation: 现有AI生成图像归因方法依赖特定生成模型，缺乏通用性和可扩展性，难以应对不断涌现的新生成器。 Method: 将归因任务重构为实例检索问题；提出模型无关框架LIDA，包含低位指纹生成模块、无监督预训练和少样本归因适配。 Result: 在深度伪造检测与图像归因任务上，LIDA在零样本和少样本设定下均取得当前最优性能。 Conclusion: LIDA是一种高效、通用、可扩展的模型无关归因框架，为应对新型生成模型提供了新思路。 Abstract: With the rapid advancement of AIGC technologies, image forensics will encounter unprecedented challenges. Traditional methods are incapable of dealing with increasingly realistic images generated by rapidly evolving image generation techniques. To facilitate the identification of AI-generated images and the attribution of their source models, generative image watermarking and AI-generated image attribution have emerged as key research focuses in recent years. However, existing methods are model-dependent, requiring access to the generative models and lacking generality and scalability to new and unseen generators. To address these limitations, this work presents a new paradigm for AI-generated image attribution by formulating it as an instance retrieval problem instead of a conventional image classification problem. We propose an efficient model-agnostic framework, called Low-bIt-plane-based Deepfake Attribution (LIDA). The input to LIDA is produced by Low-Bit Fingerprint Generation module, while the training involves Unsupervised Pre-Training followed by subsequent Few-Shot Attribution Adaptation. Comprehensive experiments demonstrate that LIDA achieves state-of-the-art performance for both Deepfake detection and image attribution under zero- and few-shot settings. The code is at https://github.com/hongsong-wang/LIDA

[122] Need for Speed: Zero-Shot Depth Completion with Single-Step Diffusion

Jakub Gregorek,Paraskevas Pegios,Nando Metzger,Konrad Schindler,Theodora Kontogianni,Lazaros Nalpantidis

Main category: cs.CV

TL;DR: Marigold-SSD 是一种单步、晚融合的深度补全框架，利用扩散先验但避免了测试时优化，通过将计算负担从推理转移到微调，实现高效鲁棒的3D感知。

Details

Motivation: 解决扩散模型在深度补全任务中因测试时优化导致的高延迟问题，缩小其与判别式模型在效率上的差距。 Method: 提出单步、晚融合的深度补全框架 Marigold-SSD，将扩散先验融入微调阶段而非推理阶段，避免测试时优化。 Result: 在四个室内和两个室外基准上验证了强跨域泛化与零样本性能；推理显著加速，仅需4.5 GPU天训练；并在不同输入稀疏度下评估了鲁棒性。 Conclusion: Marigold-SSD 在保持扩散模型优势的同时大幅提升推理效率，推动其在真实低延迟场景中的实用化。 Abstract: We introduce Marigold-SSD, a single-step, late-fusion depth completion framework that leverages strong diffusion priors while eliminating the costly test-time optimization typically associated with diffusion-based methods. By shifting computational burden from inference to finetuning, our approach enables efficient and robust 3D perception under real-world latency constraints. Marigold-SSD achieves significantly faster inference with a training cost of only 4.5 GPU days. We evaluate our method across four indoor and two outdoor benchmarks, demonstrating strong cross-domain generalization and zero-shot performance compared to existing depth completion approaches. Our approach significantly narrows the efficiency gap between diffusion-based and discriminative models. Finally, we challenge common evaluation protocols by analyzing performance under varying input sparsity levels. Page: https://dtu-pas.github.io/marigold-ssd/

[123] Layer Consistency Matters: Elegant Latent Transition Discrepancy for Generalizable Synthetic Image Detection

Yawen Yang,Feng Li,Shuqi Kong,Yunfeng Diao,Xinjian Gao,Zenglin Shi,Meng Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为潜在转换差异（LTD）的新方法，通过分析真实图像与合成图像在神经网络各层间潜在表示的语义注意力和结构一致性差异，实现更鲁棒、泛化性更强的AI生成图像检测。

Details

Motivation: 现有合成图像检测方法依赖模型特异性伪影或低级统计线索，导致对未见数据泛化能力差；而真实图像在潜在空间中具有更稳定的跨层语义注意力和结构一致性，这一特性尚未被充分利用。 Method: 提出潜在转换差异（LTD）方法，自适应识别最具判别力的网络层，并量化真实与合成图像在不同层间特征过渡的不一致性。 Result: 在包含多种GAN和扩散模型（DMs）的三个数据集上，LTD平均准确率（mean Acc）较基线模型提升14.35%，且在检测精度、泛化性和鲁棒性方面均优于当前最先进方法。 Conclusion: LTD通过建模跨层潜在表示的一致性差异，提供了一种更本质、更具泛化能力的合成图像检测范式，为解决AI生成内容可信性问题提供了新思路。 Abstract: Recent rapid advancement of generative models has significantly improved the fidelity and accessibility of AI-generated synthetic images. While enabling various innovative applications, the unprecedented realism of these synthetics makes them increasingly indistinguishable from authentic photographs, posing serious security risks, such as media credibility and content manipulation. Although extensive efforts have been dedicated to detecting synthetic images, most existing approaches suffer from poor generalization to unseen data due to their reliance on model-specific artifacts or low-level statistical cues. In this work, we identify a previously unexplored distinction that real images maintain consistent semantic attention and structural coherence in their latent representations, exhibiting more stable feature transitions across network layers, whereas synthetic ones present discernible distinct patterns. Therefore, we propose a novel approach termed latent transition discrepancy (LTD), which captures the inter-layer consistency differences of real and synthetic images. LTD adaptively identifies the most discriminative layers and assesses the transition discrepancies across layers. Benefiting from the proposed inter-layer discriminative modeling, our approach exceeds the base model by 14.35\% in mean Acc across three datasets containing diverse GANs and DMs. Extensive experiments demonstrate that LTD outperforms recent state-of-the-art methods, achieving superior detection accuracy, generalizability, and robustness. The code is available at https://github.com/yywencs/LTD

[124] HyPER-GAN: Hybrid Patch-Based Image-to-Image Translation for Real-Time Photorealism Enhancement

Stefanos Pasios,Nikos Nikolaidis

Main category: cs.CV

TL;DR: 本文提出了一种轻量级的图像到图像翻译模型HyPER-GAN，采用U-Net结构和混合补丁训练策略，兼顾实时推理与视觉真实感及语义一致性。

Details

Motivation: 现有生成模型虽能提升合成数据的真实感，但常引入视觉伪影、计算开销大，难以用于实时训练或评估。 Method: 提出HyPER-GAN，基于U-Net风格生成器；采用成对合成图像与增强真实感图像进行监督训练，并引入来自真实世界数据的匹配图像块作为混合训练策略以提升真实感和语义一致性。 Result: HyPER-GAN在推理延迟、视觉真实感和语义鲁棒性上优于当前主流成对图像翻译方法；混合训练策略被实验证明可有效提升视觉质量和语义一致性。 Conclusion: HyPER-GAN是一种高效、轻量且实用的图像真实感增强方法，适用于实时场景，并开源代码与预训练模型。 Abstract: Generative models are widely employed to enhance the photorealism of synthetic data for training computer vision algorithms. However, they often introduce visual artifacts that degrade the accuracy of these algorithms and require high computational resources, limiting their applicability in real-time training or evaluation scenarios. In this paper, we propose Hybrid Patch Enhanced Realism Generative Adversarial Network (HyPER-GAN), a lightweight image-to-image translation method based on a U-Net-style generator designed for real-time inference. The model is trained using paired synthetic and photorealism-enhanced images, complemented by a hybrid training strategy that incorporates matched patches from real-world data to improve visual realism and semantic consistency. Experimental results demonstrate that HyPER-GAN outperforms state-of-the-art paired image-to-image translation methods in terms of inference latency, visual realism, and semantic robustness. Moreover, it is illustrated that the proposed hybrid training strategy indeed improves visual quality and semantic consistency compared to training the model solely with paired synthetic and photorealism-enhanced images. Code and pretrained models are publicly available for download at: https://github.com/stefanos50/HyPER-GAN

[125] Splat2Real: Novel-view Scaling for Physical AI with 3D Gaussian Splatting

Hansol Lim,Jongseong Brad Choi

Main category: cs.CV

TL;DR: 本文提出Splat2Real方法，通过数字孪生教师模型进行单目深度预训练，并引入CN-Coverage课程学习策略与质量感知回退机制，提升新视角鲁棒性，在TUM RGB-D数据集上验证了其在不同渲染视图预算下的稳定性与下游控制相关性。

Details

Motivation: 物理AI在训练与部署间存在视角偏移，单目RGB到3D感知亟需新视角鲁棒性。 Method: 将Real2Render2Real单目深度预训练建模为数字孪生oracle（场景网格渲染的度量深度/可见性）指导下的模仿学习；采用3D高斯泼溅（3DGS）提供可扩展的新视角观测；提出CN-Coverage课程策略（兼顾几何增益与外推惩罚）及质量感知回退机制；并设计GOL-Gated变体增强中高预算稳定性。 Result: 在20个TUM RGB-D序列上，相比朴素扩展、Robot和Coverage策略，CN-Coverage显著缓解最差情况性能退化；GOL-Gated CN-Coverage在中高预算下稳定性最强、高新颖性尾部误差最低；下游控制代理实验表明其能有效权衡安全性与任务进展。 Conclusion: 新视角选择的质量与策略（如CN-Coverage）比单纯增加视图数量更重要；所提方法提升了单目深度模型在真实物理部署中的泛化性与可靠性。 Abstract: Physical AI faces viewpoint shift between training and deployment, and novel-view robustness is essential for monocular RGB-to-3D perception. We cast Real2Render2Real monocular depth pretraining as imitation-learning-style supervision from a digital twin oracle: a student depth network imitates expert metric depth/visibility rendered from a scene mesh, while 3DGS supplies scalable novel-view observations. We present Splat2Real, centered on novel-view scaling: performance depends more on which views are added than on raw view count. We introduce CN-Coverage, a coverage+novelty curriculum that greedily selects views by geometry gain and an extrapolation penalty, plus a quality-aware guardrail fallback for low-reliability teachers. Across 20 TUM RGB-D sequences with step-matched budgets (N=0 to 2000 additional rendered views, with N unique <= 500 and resampling for larger budgets), naive scaling is unstable; CN-Coverage mitigates worst-case regressions relative to Robot/Coverage policies, and GOL-Gated CN-Coverage provides the strongest medium-high-budget stability with the lowest high-novelty tail error. Downstream control-proxy results versus N provides embodied-relevance evidence by shifting safety/progress trade-offs under viewpoint shift.

[126] Less is More: Decoder-Free Masked Modeling for Efficient Skeleton Representation Learning

Jeonghyeok Do,Yun Chen,Geunhyuk Youk,Munchurl Kim

Main category: cs.CV

TL;DR: 本文提出SLiM框架，融合掩码建模与对比学习，摒弃解码器以提升效率和判别性，在骨架动作识别中实现SOTA性能与7.89倍推理加速。

Details

Motivation: 解决对比学习忽略局部细节、MAE解码器计算冗余及下游任务全序列处理导致的计算不对称问题。 Method: 提出SLiM框架：共享编码器联合掩码建模与对比学习；引入语义tube掩码与骨骼感知增强以维持解剖一致性；完全去除重建解码器。 Result: 在所有下游协议上达到SOTA性能，推理计算成本比现有MAE方法降低7.89倍。 Conclusion: SLiM首次实现无解码器的掩码表征学习，兼顾高效性与判别力，为骨架动作识别提供了新范式。 Abstract: The landscape of skeleton-based action representation learning has evolved from Contrastive Learning (CL) to Masked Auto-Encoder (MAE) architectures. However, each paradigm faces inherent limitations: CL often overlooks fine-grained local details, while MAE is burdened by computationally heavy decoders. Moreover, MAE suffers from severe computational asymmetry -- benefiting from efficient masking during pre-training but requiring exhaustive full-sequence processing for downstream tasks. To resolve these bottlenecks, we propose SLiM (Skeleton Less is More), a novel unified framework that harmonizes masked modeling with contrastive learning via a shared encoder. By eschewing the reconstruction decoder, SLiM not only eliminates computational redundancy but also compels the encoder to capture discriminative features directly. SLiM is the first framework with decoder-free masked modeling of representative learning. Crucially, to prevent trivial reconstruction arising from high skeletal-temporal correlation, we introduce semantic tube masking, alongside skeletal-aware augmentations designed to ensure anatomical consistency across diverse temporal granularities. Extensive experiments demonstrate that SLiM consistently achieves state-of-the-art performance across all downstream protocols. Notably, our method delivers this superior accuracy with exceptional efficiency, reducing inference computational cost by 7.89x compared to existing MAE methods.

[127] Are Video Reasoning Models Ready to Go Outside?

Yangfan He,Changgyu Boo,Jaehong Yoon

Main category: cs.CV

TL;DR: 本文提出ROVA框架，通过建模时空扰动下的鲁棒性感知一致性奖励，并结合难度感知的在线训练策略，提升视觉语言模型在真实扰动（如天气、遮挡、相机运动）下的鲁棒性；同时构建新基准PVRBench评估模型在扰动下的准确率与推理能力。

Details

Motivation: 现有视觉语言模型在真实世界扰动（如天气、遮挡、相机运动）下性能显著下降，暴露出清洁评估与实际鲁棒性之间的差距。 Method: 提出ROVA训练框架：1）设计鲁棒性感知的一致性奖励机制以应对时空扰动；2）引入难度感知的在线训练策略，通过自反思评估动态估计样本难度并优先学习高信息量样本。同时构建PVRBench基准，在具身视频数据中注入真实扰动以综合评估准确率与推理质量。 Result: 在PVRBench、UrbanVideo和VisBench上，ROVA相较QWen2.5/3-VL、InternVL2.5、Embodied-R等基线模型，相对准确率提升至少24%，推理能力提升超9%；且增益可迁移到干净标准基准上，保持一致提升。 Conclusion: ROVA有效缓解真实扰动导致的性能退化，提升了视觉语言模型的鲁棒性与泛化能力，所提基准PVRBench为评估模型实际部署能力提供了更贴近现实的评测标准。 Abstract: In real-world deployment, vision-language models often encounter disturbances such as weather, occlusion, and camera motion. Under such conditions, their understanding and reasoning degrade substantially, revealing a gap between clean, controlled (i.e., unperturbed) evaluation settings and real-world robustness. To address this limitation, we propose ROVA, a novel training framework that improves robustness by modeling a robustness-aware consistency reward under spatio-temporal corruptions. ROVA introduces a difficulty-aware online training strategy that prioritizes informative samples based on the model's evolving capability. Specifically, it continuously re-estimates sample difficulty via self-reflective evaluation, enabling adaptive training with a robustness-aware consistency reward. We also introduce PVRBench, a new benchmark that injects real-world perturbations into embodied video datasets to assess both accuracy and reasoning quality under realistic disturbances. We evaluate ROVA and baselines on PVRBench, UrbanVideo, and VisBench, where open-source and proprietary models suffer up to 35% and 28% drops in accuracy and reasoning under realistic perturbations. ROVA effectively mitigates performance degradation, boosting relative accuracy by at least 24% and reasoning by over 9% compared with baseline models (QWen2.5/3-VL, InternVL2.5, Embodied-R). These gains transfer to clean standard benchmarks, yielding consistent improvements.

[128] How To Embed Matters: Evaluation of EO Embedding Design Choices

Luis Gilch,Isabelle Wittmann,Maximilian Nitsche,Johannes Jakubik,Arne Ewald,Thomas Brunschwiler

Main category: cs.CV

TL;DR: 本文系统分析了地球观测（EO）任务中基于地理空间基础模型（GeoFM）的嵌入式表示设计，探讨了骨干网络、预训练策略、表征深度、空间聚合与组合方式对下游任务性能的影响，并验证了紧凑嵌入在保持泛用性的同时可大幅压缩原始数据。

Details

Motivation: 随着地球观测数据量激增及GeoFM广泛应用，如何高效获取、聚合和组合任务无关的中间嵌入表示，以兼顾下游性能与流程可扩展性，成为关键挑战。 Method: 基于NeuCo-Bench基准，系统评估骨干架构（如Transformer/ResNet）、预训练策略（自监督等）、表征深度、空间聚合方式（如均值池化）及嵌入组合策略对EO任务性能的影响。 Result: 发现：Transformer+均值池化为强默认配置；ResNet中间层表征优于最后一层；不同自监督目标各具任务优势；多目标嵌入组合可提升鲁棒性；嵌入可压缩至原始数据1/500以下且保持可用性。 Conclusion: 嵌入设计需权衡性能与效率，不存在单一最优方案；应依据具体任务需求选择架构、预训练方法与聚合策略，多源嵌入融合是提升鲁棒性的有效路径。 Abstract: Earth observation (EO) missions produce petabytes of multispectral imagery, increasingly analyzed using large Geospatial Foundation Models (GeoFMs). Alongside end-to-end adaptation, workflows make growing use of intermediate representations as task-agnostic embeddings, enabling models to compute representations once and reuse them across downstream tasks. Consequently, when GeoFMs act as feature extractors, decisions about how representations are obtained, aggregated, and combined affect downstream performance and pipeline scalability. Understanding these trade-offs is essential for scalable embedding-based EO workflows, where compact embeddings can replace raw data while remaining broadly useful. We present a systematic analysis of embedding design in GeoFM-based EO workflows. Leveraging NeuCo-Bench, we study how backbone architecture, pretraining strategy, representation depth, spatial aggregation, and representation combination influence EO task performance. We demonstrate the usability of GeoFM embeddings by aggregating them into fixed-size representations more than 500x smaller than the raw input data. Across models, we find consistent trends: transformer backbones with mean pooling provide strong default embeddings, intermediate ResNet layers can outperform final layers, self-supervised objectives exhibit task-specific strengths, and combining embeddings from different objectives often improves robustness.

[129] A$^2$-Edit: Precise Reference-Guided Image Editing of Arbitrary Objects and Ambiguous Masks

Huayu Zheng,Guangzhao Li,Baixuan Zhao,Siqi Luo,Hantao Jiang,Guangtao Zhai,Xiaohong Liu

Main category: cs.CV

TL;DR: 本文提出A²-Edit，一种基于粗略掩码的任意类别对象编辑框架，并构建大规模多类别数据集UniEdit-500K；引入混合Transformer模块实现跨类语义建模，并设计掩码退火训练策略提升鲁棒性，在多个基准上显著优于现有方法。

Details

Motivation: 解决现有图像编辑数据集中严重同质化和类别覆盖有限的问题，支持更广泛、更灵活的任意对象替换编辑任务。 Method: 提出A²-Edit统一编辑框架；构建含8大类、209细分类、50万对样本的UniEdit-500K数据集；设计Mixture of Transformer模块实现动态专家选择与跨类语义协同；提出Mask Annealing Training Strategy（MATS）逐步放松掩码精度以增强泛化鲁棒性。 Result: 在VITON-HD和AnyInsertion等基准上全面超越现有方法，各项指标均取得最优性能。 Conclusion: A²-Edit为任意类别对象编辑提供了高效、鲁棒且通用的新范式，其数据集、模型结构与训练策略共同推动了图像编辑技术的发展。 Abstract: We propose \textbf{A$^2$-Edit}, a unified inpainting framework for arbitrary object categories, which allows users to replace any target region with a reference object using only a coarse mask. To address the issues of severe homogenization and limited category coverage in existing datasets, we construct a large-scale, multi-category dataset \textbf{UniEdit-500K}, which includes 8 major categories, 209 fine-grained subcategories, and a total of 500,104 image pairs. Such rich category diversity poses new challenges for the model, requiring it to automatically learn semantic relationships and distinctions across categories. To this end, we introduce the \textbf{Mixture of Transformer} module, which performs differentiated modeling of various object categories through dynamic expert selection, and further enhances cross-category semantic transfer and generalization through collaboration among experts. In addition, we propose a \textbf{Mask Annealing Training Strategy} (MATS) that progressively relaxes mask precision during training, reducing the model's reliance on accurate masks and improving robustness across diverse editing tasks. Extensive experiments on benchmarks such as VITON-HD and AnyInsertion demonstrate that A$^2$-Edit consistently outperforms existing approaches across all metrics, providing a new and efficient solution for arbitrary object editing.

[130] Bioinspired CNNs for border completion in occluded images

Catarina P. Coutinho,Aneeqa Merhab,Janko Petkovic,Ferdinando Zanchetta,Rita Fioresi

Main category: cs.CV

TL;DR: 本文提出BorderNet，一种受视觉皮层边界补全问题数学建模启发的CNN架构，通过设计特定卷积滤波器提升模型对图像遮挡的鲁棒性，并在多个遮挡数据集上验证了其有效性。

Details

Motivation: 提升CNN对图像遮挡的鲁棒性，受视觉皮层中边界补全机制的数学建模启发。 Method: 基于视觉皮层边界完成问题的数学建模，设计新型CNN滤波器，构建BorderNet架构，并在MNIST、Fashion-MNIST和EMNIST三种数据集上，针对条纹和网格两类遮挡进行评估。 Result: BorderNet在所有测试场景下均表现出优于基线的性能，增益随遮挡严重程度和数据集不同而变化。 Conclusion: 受生物视觉机制启发的滤波器设计可有效提升CNN对遮挡的鲁棒性，为神经网络架构设计提供了新的生物学依据。 Abstract: We exploit the mathematical modeling of the border completion problem in the visual cortex to design convolutional neural network (CNN) filters that enhance robustness to image occlusions. We evaluate our CNN architecture, BorderNet, on three occluded datasets (MNIST, Fashion-MNIST, and EMNIST) under two types of occlusions: stripes and grids. In all cases, BorderNet demonstrates improved performance, with gains varying depending on the severity of the occlusions and the dataset.

[131] RandMark: On Random Watermarking of Visual Foundation Models

Anna Chistyakova,Mikhail Pautov

Main category: cs.CV

TL;DR: 本文提出了一种基于随机数字水印嵌入的视觉基础模型（VFM）所有权验证方法，通过小型编解码网络在模型内部表征中嵌入水印，并利用统计检测实现可靠的所有权识别。

Details

Motivation: 视觉基础模型训练成本高昂，具有重要知识产权价值，因此需要有效的方法来验证其所有权以保护模型所有者的权益。 Method: 设计一个小型编码器-解码器网络，在预留输入图像集的内部表征中随机嵌入数字水印；利用水印在功能复制模型中的统计可检测性进行所有权验证。 Result: 理论与实验均表明该方法对未加水印模型误检率低，对已加水印模型漏检率也低，具备高可靠性。 Conclusion: 所提水印方法能有效、鲁棒地验证视觉基础模型的所有权，为模型版权保护提供了可行的技术路径。 Abstract: Being trained on large and diverse datasets, visual foundation models (VFMs) can be fine-tuned to achieve remarkable performance and efficiency in various downstream computer vision tasks. The high computational cost of data collection and training makes these models valuable assets, which motivates some VFM owners to distribute them alongside a license to protect their intellectual property rights. In this paper, we propose an approach to ownership verification of visual foundation models that leverages a small encoder-decoder network to embed digital watermarks into an internal representation of a hold-out set of input images. The method is based on random watermark embedding, which makes the watermark statistics detectable in functional copies of the watermarked model. Both theoretically and experimentally, we demonstrate that the proposed method yields a low probability of false detection for non-watermarked models and a low probability of false misdetection for watermarked models.

[132] UniCom: Unified Multimodal Modeling via Compressed Continuous Semantic Representations

Yaqi Zhao,Wang Lin,Zijian Zhang,Miles Yang,Jingyuan Chen,Wentao Zhang,Zhao Zhong,Liefeng Bo

Main category: cs.CV

TL;DR: 本文提出UniCom框架，通过压缩连续表示来统一多模态理解和生成，避免离散化损失语义信息和连续建模训练不稳定的困境。

Details

Motivation: 现有统一多模态模型依赖离散视觉标记器，导致细粒度语义信息丢失；而直接建模连续语义表示（如CLIP、SigLIP）又面临高维生成建模难、收敛慢和训练不稳定的问题。 Method: 提出UniCom框架，采用基于注意力机制的语义压缩器将密集特征蒸馏为紧凑统一表示，并验证通道维度压缩比空间下采样更有效；同时采用transfusion架构提升收敛性与一致性。 Result: UniCom在统一多模态模型中实现生成性能SOTA，在图像编辑可控性和图像一致性方面表现优异，且无需VAE。 Conclusion: 压缩连续语义表示是解决多模态统一建模中语义保真与训练稳定性矛盾的有效途径，UniCom为理解与生成任务提供了高效、鲁棒的统一框架。 Abstract: Current unified multimodal models typically rely on discrete visual tokenizers to bridge the modality gap. However, discretization inevitably discards fine-grained semantic information, leading to suboptimal performance in visual understanding tasks. Conversely, directly modeling continuous semantic representations (e.g., CLIP, SigLIP) poses significant challenges in high-dimensional generative modeling, resulting in slow convergence and training instability. To resolve this dilemma, we introduce UniCom, a unified framework that harmonizes multimodal understanding and generation via compressed continuous representation. We empirically demonstrate that reducing channel dimension is significantly more effective than spatial downsampling for both reconstruction and generation. Accordingly, we design an attention-based semantic compressor to distill dense features into a compact unified representation. Furthermore, we validate that the transfusion architecture surpasses query-based designs in convergence and consistency. Experiments demonstrate that UniCom achieves state-of-the-art generation performance among unified models. Notably, by preserving rich semantic priors, it delivers exceptional controllability in image editing and maintains image consistency even without relying on VAE.

Rafi Ibn Sultan,Hui Zhu,Xiangyu Zhou,Chengyin Li,Prashant Khanduri,Marco Brocanelli,Dongxiao Zhu

Main category: cs.CV

TL;DR: 本文提出WalkGPT，一种像素级对齐的视觉语言模型，用于无障碍行人导航引导任务，通过多尺度查询投影器和校准文本投影器实现细粒度空间与语义联合推理，并构建了大规模基准PAVE。

Details

Motivation: 现有大视觉语言模型缺乏显式空间对齐能力，易产生物体幻觉、深度推理不可靠，难以满足无障碍导航中对语义与空间联合推理的严格需求。 Method: 提出WalkGPT模型，包含多尺度查询投影器（MSQP）和校准文本投影器（CTP），结合区域对齐损失，实现无需用户提示的像素级分割与相对深度估计；同时构建PAVE基准（41k图像+可访问性问题-答案对）。 Result: WalkGPT在接地推理与分割任务上表现优异，能生成带分割掩码与深度信息的自然语言导航指导，显著优于基线模型。 Conclusion: WalkGPT首次统一了语言理解、像素级分割与深度感知，为无障碍导航提供了端到端、可解释、接地可靠的视觉语言解决方案。 Abstract: Ensuring accessible pedestrian navigation requires reasoning about both semantic and spatial aspects of complex urban scenes, a challenge that existing Large Vision-Language Models (LVLMs) struggle to meet. Although these models can describe visual content, their lack of explicit grounding leads to object hallucinations and unreliable depth reasoning, limiting their usefulness for accessibility guidance. We introduce WalkGPT, a pixel-grounded LVLM for the new task of Grounded Navigation Guide, unifying language reasoning and segmentation within a single architecture for depth-aware accessibility guidance. Given a pedestrian-view image and a navigation query, WalkGPT generates a conversational response with segmentation masks that delineate accessible and harmful features, along with relative depth estimation. The model incorporates a Multi-Scale Query Projector (MSQP) that shapes the final image tokens by aggregating them along text tokens across spatial hierarchies, and a Calibrated Text Projector (CTP), guided by a proposed Region Alignment Loss, that maps language embeddings into segmentation-aware representations. These components enable fine-grained grounding and depth inference without user-provided cues or anchor points, allowing the model to generate complete and realistic navigation guidance. We also introduce PAVE, a large-scale benchmark of 41k pedestrian-view images paired with accessibility-aware questions and depth-grounded answers. Experiments show that WalkGPT achieves strong grounded reasoning and segmentation performance. The source code and dataset are available on the \href{https://sites.google.com/view/walkgpt-26/home}{project website}.

[134] UAV traffic scene understanding: A cross-spectral guided approach and a unified benchmark

Yu Zhang,Zhicheng Zhao,Ze Luo,Chenglong Li,Jin Tang

Main category: cs.CV

TL;DR: 本文提出了一种跨光谱交通认知网络（CTCNet），通过原型引导的知识嵌入（PGKE）和质量感知的光谱补偿（QASC）模块，结合光学与热红外图像，在复杂光照条件下实现鲁棒的无人机交通场景理解，并构建了首个大规模光学-热红外交通VQA数据集Traffic-VQA。

Details

Motivation: 现有基于光学图像的交通理解方法在夜间、雾天等不良光照下性能严重下降；同时，当前视觉问答（VQA）模型缺乏交通法规等领域知识，难以理解复杂交通行为。 Method: 提出CTCNet：1）Prototype-Guided Knowledge Embedding（PGKE）模块，利用外部交通法规记忆库（TRM）中的高层语义原型，将领域知识嵌入视觉表征；2）Quality-Aware Spectral Compensation（QASC）模块，实现光学与热红外模态间的双向上下文交换以补偿退化特征；3）构建首个光学-热红外交通VQA基准Traffic-VQA。 Result: CTCNet在认知与感知任务上均显著超越现有最先进方法；Traffic-VQA包含8180对配准图像和130万QA对，覆盖31类交通场景。 Conclusion: CTCNet有效融合多光谱信息与领域知识，提升了无人机平台在复杂环境下的交通认知能力；Traffic-VQA为后续研究提供了重要基础。 Abstract: Traffic scene understanding from unmanned aerial vehicle (UAV) platforms is crucial for intelligent transportation systems due to its flexible deployment and wide-area monitoring capabilities. However, existing methods face significant challenges in real-world surveillance, as their heavy reliance on optical imagery leads to severe performance degradation under adverse illumination conditions like nighttime and fog. Furthermore, current Visual Question Answering (VQA) models are restricted to elementary perception tasks, lacking the domain-specific regulatory knowledge required to assess complex traffic behaviors. To address these limitations, we propose a novel Cross-spectral Traffic Cognition Network (CTCNet) for robust UAV traffic scene understanding. Specifically, we design a Prototype-Guided Knowledge Embedding (PGKE) module that leverages high-level semantic prototypes from an external Traffic Regulation Memory (TRM) to anchor domain-specific knowledge into visual representations, enabling the model to comprehend complex behaviors and distinguish fine-grained traffic violations. Moreover, we develop a Quality-Aware Spectral Compensation (QASC) module that exploits the complementary characteristics of optical and thermal modalities to perform bidirectional context exchange, effectively compensating for degraded features to ensure robust representation in complex environments. In addition, we construct Traffic-VQA, the first large-scale optical-thermal infrared benchmark for cognitive UAV traffic understanding, comprising 8,180 aligned image pairs and 1.3 million question-answer pairs across 31 diverse types. Extensive experiments demonstrate that CTCNet significantly outperforms state-of-the-art methods in both cognition and perception scenarios. The dataset is available at https://github.com/YuZhang-2004/UAV-traffic-scene-understanding.

[135] eLasmobranc Dataset: An Image Dataset for Elasmobranch Species Recognition and Biodiversity Monitoring

Ismael Beviá-Ballesteros,Mario Jerez-Tallón,Nieves Aranda-Garrido,Isabel Abel-Abellán,Irene Antón-Linares,Jorge Azorín-López,Marcelo Saval-Calvo,Andres Fuster-Guilló,Francisca Giménez-Casalduero

Main category: cs.CV

TL;DR: 本文介绍了eLasmobranc数据集，一个专为细粒度软骨鱼类（鲨鱼和鳐鱼）物种级识别设计的高质量、公开图像数据集，旨在支持保护生物学与AI驱动的生物多样性监测。

Details

Motivation: 现有视觉数据集多侧重于检测、水下采集或粗粒度分类，难以满足濒危软骨鱼类精细形态学识别与ISRA等保护规划的需求。 Method: 构建了一个涵盖7种东西班牙地中海沿岸软骨鱼类的图像数据集，图像主要来自陆上标准化采集（如渔市、野外调查）及公开资源，并经专家验证标注，附带空间-时间元数据与物种信息。 Result: 发布了首个面向细粒度形态识别、具备专家标注、结构化元数据和高可视性特征的公开软骨鱼类图像数据集（eLasmobranc），已托管于Zenodo。 Conclusion: 该数据集填补了细粒度软骨鱼类AI识别的数据空白，推动可复现的保护导向计算机视觉研究。 Abstract: Elasmobranch populations are experiencing significant global declines, and several species are currently classified as threatened. Reliable monitoring and species-level identification are essential to support conservation and spatial planning initiatives such as Important Shark and Ray Areas (ISRAs). However, existing visual datasets are predominantly detection-oriented, underwater-acquired, or limited to coarse-grained categories, restricting their applicability to fine-grained morphological classification. We present the eLasmobranc Dataset, a curated and publicly available image collection from seven ecologically relevant elasmobranch species inhabiting the eastern Spanish Mediterranean coast, a region where two ISRAs have been identified. Images were obtained through dedicated data collection, including field campaigns and collaborations with local fish markets and projects, as well as from open-access public sources. The dataset was constructed predominantly from images acquired outside the aquatic environment under standardized protocols to ensure clear visualization of diagnostic morphological traits. It integrates expert-validated species annotations, structured spatial and temporal metadata, and complementary species-level information. The eLasmobranc Dataset is specifically designed to support supervised species-level classification, population studies, and the development of artificial intelligence systems for biodiversity monitoring. By combining morphological clarity, taxonomic reliability, and public accessibility, the dataset addresses a critical gap in fine-grained elasmobranch identification and promotes reproducible research in conservation-oriented computer vision. The dataset is publicly available at https://zenodo.org/records/18549737.

[136] Just-in-Time: Training-Free Spatial Acceleration for Diffusion Transformers

Wenhao Sun,Ji Li,Zhaoqiang Liu

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的加速框架JiT，通过在空间域上动态选择稀疏锚点token并引入确定性微流ODE，显著提升扩散Transformer的采样效率，在FLUX.1-dev模型上实现最高7倍加速且几乎无性能损失。

Details

Motivation: 扩散Transformer虽在图像合成中达到SOTA，但迭代采样计算开销大；现有加速方法多关注时间域，忽视生成过程中固有的空间冗余（全局结构早于细节形成），导致所有空间区域被均匀计算，效率低下。 Method: 提出Just-in-Time (JiT)框架：1）构建基于稀疏锚点token的空间近似生成常微分方程（ODE）；2）设计确定性微流（deterministic micro-flow），一种简单有效的有限时间ODE，保障新token加入时潜变量维度扩展过程中的结构一致性和统计正确性。 Result: 在FLUX.1-dev模型上实验表明，JiT最高实现7倍加速，生成质量几乎无损，显著优于现有加速方法，建立了更优的速度-保真度权衡。 Conclusion: JiT是一种训练无关、空间感知的高效加速范式，有效挖掘并利用扩散生成中的空间冗余，为扩散Transformer的实际部署提供了新路径。 Abstract: Diffusion Transformers have established a new state-of-the-art in image synthesis, but the high computational cost of iterative sampling severely hampers their practical deployment. While existing acceleration methods often focus on the temporal domain, they overlook the substantial spatial redundancy inherent in the generative process, where global structures emerge long before fine-grained details are formed. The uniform computational treatment of all spatial regions represents a critical inefficiency. In this paper, we introduce Just-in-Time (JiT), a novel training-free framework that addresses this challenge by acceleration in the spatial domain. JiT formulates a spatially approximated generative ordinary differential equation (ODE) that drives the full latent state evolution based on computations from a dynamically selected, sparse subset of anchor tokens. To ensure seamless transitions as new tokens are incorporated to expand the dimensions of the latent state, we propose a deterministic micro-flow, a simple and effective finite-time ODE that maintains both structural coherence and statistical correctness. Extensive experiments on the state-of-the-art FLUX.1-dev model demonstrate that JiT achieves up to a 7x speedup with nearly lossless performance, significantly outperforming existing acceleration methods and establishing a new and superior trade-off between inference speed and generation fidelity.

[137] Event-based Photometric Stereo via Rotating Illumination and Per-Pixel Learning

Hyunwoo Kim,Won-Hoe Kim,Sanghoon Lee,Jianfei Cai,Giljoo Nam,Jae-Sang Hyun

Main category: cs.CV

TL;DR: 本文提出了一种基于事件相机的光度立体视觉系统，利用单个沿圆形轨迹运动的光源和轻量级逐像素多层神经网络，无需标定即可直接从事件信号预测表面法向，显著提升了在高动态范围、强环境光及镜面反射场景下的鲁棒性与精度。

Details

Motivation: 传统基于帧的光度立体方法依赖可控光照、易受环境光干扰，难以适用于真实场景；事件相机在高动态范围和连续亮度变化场景中具有优势，但现有事件式光度立体方法仍存在标定复杂、鲁棒性不足等问题。 Method: 采用单个沿预设圆形轨迹运动的光源配合事件相机，构建紧凑可扩展系统；设计轻量级逐像素多层神经网络，直接从光源自转引起的强度变化所生成的事件流中预测表面法向，无需系统标定。 Result: 在基准数据集和自建真实数据上验证，相比现有事件式光度立体方法，平均角度误差降低7.12%；在事件稀疏区域、强环境光及镜面反射场景中表现出更强鲁棒性。 Conclusion: 该事件驱动的光度立体方法摆脱了对多同步光源和系统标定的依赖，兼顾精度与实用性，为复杂光照条件下的三维重建提供了新思路。 Abstract: Photometric stereo is a technique for estimating surface normals using images captured under varying illumination. However, conventional frame-based photometric stereo methods are limited in real-world applications due to their reliance on controlled lighting, and susceptibility to ambient illumination. To address these limitations, we propose an event-based photometric stereo system that leverages an event camera, which is effective in scenarios with continuously varying scene radiance and high dynamic range conditions. Our setup employs a single light source moving along a predefined circular trajectory, eliminating the need for multiple synchronized light sources and enabling a more compact and scalable design. We further introduce a lightweight per-pixel multi-layer neural network that directly predicts surface normals from event signals generated by intensity changes as the light source rotates, without system calibration. Experimental results on benchmark datasets and real-world data collected with our data acquisition system demonstrate the effectiveness of our method, achieving a 7.12\% reduction in mean angular error compared to existing event-based photometric stereo methods. In addition, our method demonstrates robustness in regions with sparse event activity, strong ambient illumination, and scenes affected by specularities.

[138] CodePercept: Code-Grounded Visual STEM Perception for MLLMs

Tongkun Guan,Zhibo Yang,Jianqiang Wan,Mingkun Yang,Zhengtao Guo,Zijian Hu,Ruilin Luo,Ruize Chen,Songtao Jiang,Peng Wang,Wei Shen,Junyang Lin,Xiaokang Yang

Main category: cs.CV

TL;DR: 本文通过系统性缩放分析发现，多模态大语言模型（MLLMs）在STEM视觉推理中的瓶颈主要在于感知能力而非推理能力；为此提出‘代码即感知’新范式，构建百万级图像-描述-代码三元组数据集ICC-1M，并设计专用评估基准STEM2Code-Eval，以可执行代码生成实现对视觉感知能力的确定性、可验证评估。

Details

Motivation: 探究MLLMs在STEM视觉推理失败的根本原因——是感知缺陷还是推理局限？并据此提出提升感知能力的新路径。 Method: 开展感知与推理组件的独立缩放分析；提出‘代码作为感知媒介’范式；构建ICC-1M数据集（含代码锚定描述生成与STEM图像到代码翻译两种方式）；设计STEM2Code-Eval评估基准，以图像重建代码生成能力衡量视觉感知。 Result: 证实感知能力是当前STEM视觉推理的主要瓶颈；ICC-1M有效缓解描述幻觉与语言歧义；STEM2Code-Eval提供比传统解题准确率更直接、可验证的感知评估。 Conclusion: 提升MLLMs在STEM视觉任务上的性能，关键在于增强其底层视觉感知能力，而将代码作为结构化、可执行的感知媒介是一条高效可行的技术路径。 Abstract: When MLLMs fail at Science, Technology, Engineering, and Mathematics (STEM) visual reasoning, a fundamental question arises: is it due to perceptual deficiencies or reasoning limitations? Through systematic scaling analysis that independently scales perception and reasoning components, we uncover a critical insight: scaling perception consistently outperforms scaling reasoning. This reveals perception as the true lever limiting current STEM visual reasoning. Motivated by this insight, our work focuses on systematically enhancing the perception capabilities of MLLMs by establishing code as a powerful perceptual medium--executable code provides precise semantics that naturally align with the structured nature of STEM visuals. Specifically, we construct ICC-1M, a large-scale dataset comprising 1M Image-Caption-Code triplets that materializes this code-as-perception paradigm through two complementary approaches: (1) Code-Grounded Caption Generation treats executable code as ground truth for image captions, eliminating the hallucinations inherent in existing knowledge distillation methods; (2) STEM Image-to-Code Translation prompts models to generate reconstruction code, mitigating the ambiguity of natural language for perception enhancement. To validate this paradigm, we further introduce STEM2Code-Eval, a novel benchmark that directly evaluates visual perception in STEM domains. Unlike existing work relying on problem-solving accuracy as a proxy that only measures problem-relevant understanding, our benchmark requires comprehensive visual comprehension through executable code generation for image reconstruction, providing deterministic and verifiable assessment. Code is available at https://github.com/TongkunGuan/Qwen-CodePercept.

[139] Guiding Diffusion Models with Semantically Degraded Conditions

Shilong Han,Yuming Zhang,Hongxia Wang

Main category: cs.CV

TL;DR: 本文提出Condition-Degradation Guidance（CDG），用语义退化的条件替代传统Classifier-Free Guidance中的空提示，提升文本到图像生成中细粒度语义控制与组合准确性。

Details

Motivation: 传统CFG使用语义空洞的空提示（∅），导致几何纠缠、精度受限，尤其在复杂组合任务中表现不佳。 Method: 提出CDG范式，用选择性退化内容词（而非上下文聚合词）构造语义退化条件c_deg，实现'好 vs. 几乎好'的精细判别；无需额外模型或训练。 Result: 在Stable Diffusion 3、FLUX、Qwen-Image等多种架构上验证，显著提升组合准确性和图文对齐能力，且计算开销极低。 Conclusion: 应摒弃静态、信息稀疏的负样本，转向构建自适应、语义感知的负样本，这是实现精确语义控制的新原则。 Abstract: Classifier-Free Guidance (CFG) is a cornerstone of modern text-to-image models, yet its reliance on a semantically vacuous null prompt ($\varnothing$) generates a guidance signal prone to geometric entanglement. This is a key factor limiting its precision, leading to well-documented failures in complex compositional tasks. We propose Condition-Degradation Guidance (CDG), a novel paradigm that replaces the null prompt with a strategically degraded condition, $\boldsymbol{c}_{\text{deg}}$. This reframes guidance from a coarse "good vs. null" contrast to a more refined "good vs. almost good" discrimination, thereby compelling the model to capture fine-grained semantic distinctions. We find that tokens in transformer text encoders split into two functional roles: content tokens encoding object semantics, and context-aggregating tokens capturing global context. By selectively degrading only the former, CDG constructs $\boldsymbol{c}_{\text{deg}}$ without external models or training. Validated across diverse architectures including Stable Diffusion 3, FLUX, and Qwen-Image, CDG markedly improves compositional accuracy and text-image alignment. As a lightweight, plug-and-play module, it achieves this with negligible computational overhead. Our work challenges the reliance on static, information-sparse negative samples and establishes a new principle for diffusion guidance: the construction of adaptive, semantically-aware negative samples is critical to achieving precise semantic control. Code is available at https://github.com/Ming-321/Classifier-Degradation-Guidance.

[140] Taking Shortcuts for Categorical VQA Using Super Neurons

Pierre Musacchio,Jaeyi Jeong,Dahun Kim,Jaesik Park

Main category: cs.CV

TL;DR: 本文提出Super Neurons（SNs），即直接利用视觉语言模型（VLM）原始标量激活值进行下游任务分类，无需训练；相比Sparse Attention Vectors（SAVs），SNs在更浅层即可实现高判别力，并支持首层首token早退，分类性能提升且推理加速达5.10x。

Details

Motivation: 现有方法如Sparse Attention Vectors（SAVs）虽免训练但依赖注意力向量，本文旨在探索更简单、更高效、更具可扩展性的免训练分类路径——直接利用原始标量神经元激活。 Method: 提出Super Neurons（SNs）：在VLM中不依赖注意力机制，而是对各层各神经元的标量激活值进行任务导向的探针式筛选；通过分析首生成token的激活分布，识别出在浅层即具强判别能力的‘超级神经元’，并用于零样本/少样本分类；支持极端早退（如第一层第一token输出预测）。 Result: 在多个视觉接地下游任务上，SNs显著优于SAVs等免训练基线；大量SNs集中于LLM浅层，使首层首token早退可行；分类准确率稳健提升，同时推理速度最高提升5.10倍。 Conclusion: 标量神经元激活本身蕴含丰富任务判别信息，Super Neurons提供了一种更轻量、更快、免训练且高性能的VLM适配新范式，挑战了必须依赖注意力或微调的传统认知。 Abstract: Sparse Attention Vectors (SAVs) have emerged as an excellent training-free alternative to supervised finetuning or low-rank adaptation to improve the performance of Vision Language Models (VLMs). At their heart, SAVs select a few accurate attention heads for a task of interest and use them as classifiers, rather than relying on the model's prediction. In a similar spirit, we find that directly probing the raw activations of the VLM, in the form of scalar values, is sufficient to yield accurate classifiers on diverse visually grounded downstream tasks. Shifting focus from attention vectors to scalar activations dramatically increases the search space for accurate parameters, allowing us to find more discriminative neurons immediately from the first generated token. We call such activations Super Neurons (SNs). In this probing setting, we discover that enough SNs appear in the shallower layers of the large language model to allow for extreme early exiting from the first layer of the model at the first generated token. Compared to the original network, SNs robustly improve the classification performance while achieving a speedup of up to 5.10x.

[141] Phase-Interface Instance Segmentation as a Visual Sensor for Laboratory Process Monitoring

Mingyue Li,Xin Yang,Shilin Yan,Jinye Ran,Morui Zhu,Zirui Peng,Huanqing Peng,Wei Peng,Guanghua Zhang,Shuo Li,Hao Zhang

Main category: cs.CV

TL;DR: 本文提出LGA-RCM-YOLO模型，结合局部-全局注意力与矩形自校准模块，在CTG 2.0数据集上显著提升透明玻璃器皿中多相界面的实例分割性能，并支持实时实验室过程监控。

Details

Motivation: 透明玻璃器皿中化学实验的视觉监测因弱相边界和光学伪影而困难，现有分割方法鲁棒性不足。 Method: 构建CTG 2.0数据集（含3668张图像、23类器皿、5类相界面），并在YOLO11m-seg基础上提出LGA-RCM-YOLO：引入局部-全局注意力（LGA）增强语义表征，设计矩形自校准模块（RCM）优化细长界面边界；增加颜色属性辅助头用于液相着色/无色分类。 Result: 在CTG 2.0上达到84.4% AP@0.5和58.43% AP@0.5-0.95，较YOLO11m基线分别提升6.42和8.75 AP点；推理速度13.67 FPS（RTX 3060）；颜色分类精度98.71%，召回率98.32%；成功应用于分液漏斗相分离与结晶过程连续监控。 Conclusion: 相界面实例分割可作为实用视觉传感器，推动实验室自动化发展。 Abstract: Reliable visual monitoring of chemical experiments remains challenging in transparent glassware, where weak phase boundaries and optical artifacts degrade conventional segmentation. We formulate laboratory phenomena as the time evolution of phase interfaces and introduce the Chemical Transparent Glasses dataset 2.0 (CTG 2.0), a vessel-aware benchmark with 3,668 images, 23 glassware categories, and five multiphase interface types for phase-interface instance segmentation. Building on YOLO11m-seg, we propose LGA-RCM-YOLO, which combines Local-Global Attention (LGA) for robust semantic representation and a Rectangular Self-Calibration Module (RCM) for boundary refinement of thin, elongated interfaces. On CTG 2.0, the proposed model achieves 84.4% AP@0.5 and 58.43% AP@0.5-0.95, improving over the YOLO11m baseline by 6.42 and 8.75 AP points, respectively, while maintaining near real-time inference (13.67 FPS, RTX 3060). An auxiliary color-attribute head further labels liquid instances as colored or colorless with 98.71% precision and 98.32% recall. Finally, we demonstrate continuous process monitoring in separatory-funnel phase separation and crystallization, showing that phase-interface instance segmentation can serve as a practical visual sensor for laboratory automation.

[142] The Quadratic Geometry of Flow Matching: Semantic Granularity Alignment for Text-to-Image Synthesis

Zhinan Xiong,Shunqi Yuan

Main category: cs.CV

TL;DR: 本文提出语义粒度对齐（SGA）方法，通过干预向量残差场缓解梯度冲突，提升生成式微调的优化效率与生成质量。

Details

Motivation: 标准生成式微调中，数据同质性假设限制模型有效容量，且MSE目标隐式优化样本间残差相关性而缺乏显式控制。 Method: 基于流匹配框架下的NTK动态二次型分析，提出语义粒度对齐（SGA），在文本到图像合成中对向量残差场进行定向干预以缓解梯度冲突。 Result: 在DiT和U-Net架构上验证SGA可加速收敛、提升结构完整性，改善效率-质量权衡。 Conclusion: SGA通过显式建模并调控样本间语义粒度交互，突破传统数据同质性假设，为生成式微调提供更鲁棒、高效的优化路径。 Abstract: In this work, we analyze the optimization dynamics of generative fine-tuning. We observe that under the Flow Matching framework, the standard MSE objective can be formulated as a Quadratic Form governed by a dynamically evolving Neural Tangent Kernel (NTK). This geometric perspective reveals a latent Data Interaction Matrix, where diagonal terms represent independent sample learning and off-diagonal terms encode residual correlation between heterogeneous features. Although standard training implicitly optimizes these cross-term interferences, it does so without explicit control; moreover, the prevailing data-homogeneity assumption may constrain the model's effective capacity. Motivated by this insight, we propose Semantic Granularity Alignment (SGA), using Text-to-Image synthesis as a testbed. SGA engineers targeted interventions in the vector residual field to mitigate gradient conflicts. Evaluations across DiT and U-Net architectures confirm that SGA advances the efficiency-quality trade-off by accelerating convergence and improving structural integrity.

[143] PolGS++: Physically-Guided Polarimetric Gaussian Splatting for Fast Reflective Surface Reconstruction

Yufei Han,Chu Zhou,Youwei Lyu,Qi Chen,Si Li,Boxin Shi,Yunpeng Jia,Heng Guo,Zhanyu Ma

Main category: cs.CV

TL;DR: 本文提出PolGS++，一种物理引导的偏振高斯点绘框架，用于快速反射表面重建，通过集成偏振BRDF模型和深度引导的可见性掩码机制，在保持高效训练（约10分钟）的同时显著提升反射表面的几何与法向重建质量。

Details

Motivation: 3D高斯点绘（3DGS）在反射表面重建上表现不佳，尤其在精细几何与表面法向恢复方面落后于隐式神经方法，亟需物理建模增强几何线索。 Method: 提出PolGS++：1）将偏振BRDF（pBRDF）模型嵌入3DGS，显式解耦漫反射与镜面反射分量；2）设计深度引导的可见性掩码机制，实现无需光线追踪的AoP切空间一致性约束。 Result: 在合成与真实数据集上验证有效，训练仅需约10分钟，显著提升反射表面的几何细节、法向精度与渲染质量，优于现有3DGS及部分隐式方法。 Conclusion: PolGS++通过物理引导的偏振建模与高效几何约束机制，实现了反射表面高质量、高效率的显式重建，为实时VR与数字内容创作提供了新范式。 Abstract: Accurate reconstruction of reflective surfaces remains a fundamental challenge in computer vision, with broad applications in real-time virtual reality and digital content creation. Although 3D Gaussian Splatting (3DGS) enables efficient novel-view rendering with explicit representations, its performance on reflective surfaces still lags behind implicit neural methods, especially in recovering fine geometry and surface normals. To address this gap, we propose PolGS++, a physically-guided polarimetric Gaussian Splatting framework for fast reflective surface reconstruction. Specifically, we integrate a polarized BRDF (pBRDF) model into 3DGS to explicitly decouple diffuse and specular components, providing physically grounded reflectance modeling and stronger geometric cues for reflective surface recovery. Furthermore, we introduce a depth-guided visibility mask acquisition mechanism that enables angle-of-polarization (AoP)-based tangent-space consistency constraints in Gaussian Splatting without costly ray-tracing intersections. This physically guided design improves reconstruction quality and efficiency, requiring only about 10 minutes of training. Extensive experiments on both synthetic and real-world datasets validate the effectiveness of our method.

[144] Backdoor Directions in Vision Transformers

Sengim Karayalcin,Marina Krcek,Pin-Yu Chen,Stjepan Picek

Main category: cs.CV

TL;DR: 本文研究了后门攻击在视觉Transformer（ViT）中的表征机制，发现并验证了一种与触发器对应的‘触发方向’，并利用该方向分析不同触发类型（静态补丁 vs. 隐蔽分布式）的内部处理差异，进一步探索其与对抗攻击的联系，并提出一种无需数据、仅基于权重的隐蔽触发后门检测方法。

Details

Motivation: 理解后门攻击在Vision Transformer中的内部表征机制，以支持更有效的检测与防御。 Method: 基于已知触发器，识别模型激活空间中的线性‘触发方向’，通过激活和参数空间干预验证其因果性；利用该方向逐层追踪后门特征传播；对比分析静态补丁与分布式触发的处理差异；检验PGD对抗扰动对该触发机制的激活/抑制效果；设计数据无关、权重驱动的检测方案。 Result: 确认触发方向具有因果性且跨数据集与攻击类型稳定；揭示两类触发在ViT中存在本质不同的内部逻辑；发现PGD扰动可调控触发机制；提出的权重检测方法对隐蔽触发攻击有效。 Conclusion: 机制可解释性为诊断和缓解计算机视觉模型的安全漏洞提供了坚实框架。 Abstract: This paper investigates how Backdoor Attacks are represented within Vision Transformers (ViTs). By assuming knowledge of the trigger, we identify a specific ``trigger direction'' in the model's activations that corresponds to the internal representation of the trigger. We confirm the causal role of this linear direction by showing that interventions in both activation and parameter space consistently modulate the model's backdoor behavior across multiple datasets and attack types. Using this direction as a diagnostic tool, we trace how backdoor features are processed across layers. Our analysis reveals distinct qualitative differences: static-patch triggers follow a different internal logic than stealthy, distributed triggers. We further examine the link between backdoors and adversarial attacks, specifically testing whether PGD-based perturbations (de-)activate the identified trigger mechanism. Finally, we propose a data-free, weight-based detection scheme for stealthy-trigger attacks. Our findings show that mechanistic interpretability offers a robust framework for diagnosing and addressing security vulnerabilities in computer vision.

[145] HanMoVLM: Large Vision-Language Models for Professional Artistic Painting Evaluation

Hongji Yang,Yucheng Zhou,Wencheng Han,Songlian Li,Xiaotong Zhao,Jianbing Shen

Main category: cs.CV

TL;DR: 本文提出HanMoVLM模型与HanMo-Bench数据集，旨在提升大视觉语言模型在中文绘画艺术领域的专业评估能力，通过专家验证的思维链（CoT）和奖励函数优化推理过程，并支持生成模型的测试时缩放（Test-time Scaling）以提升中文绘画生成质量。

Details

Motivation: 大型视觉语言模型（VLMs）虽具通用视觉能力，但在中文绘画等抽象、需深厚艺术训练的专业领域缺乏专业评估能力，亟需向艺术专家级能力对齐。 Method: 构建真实拍卖级中文绘画数据集HanMo-Bench；提出HanMoVLM模型，集成专家验证的多步思维链（含内容识别、RoI定位、主题特异性与三层次评价）；设计奖励函数优化推理；将其用作生成模型的高质量验证器以实现Test-time Scaling。 Result: HanMoVLM在中文绘画评估任务上与专业专家高度一致；显著提升生成模型所产画作的艺术质量；实验证明其可作为图像生成中Test-time Scaling的关键骨干。 Conclusion: 本工作成功将通用VLM转化为中文绘画领域的专业评估专家，为艺术AI提供了可解释、可验证、可落地的评估范式，并拓展了VLM在高阶审美任务中的应用边界。 Abstract: While Large Vision-Language Models (VLMs) demonstrate impressive general visual capabilities, they remain artistically blind and unable to offer professional evaluation of artworks within specific artistic domains like human experts. To bridge this gap, we transform VLMs into experts capable of professional-grade painting evaluation in the Chinese Artistic Domain, which is more abstract and demands extensive artistic training for evaluation. We introduce HanMo-Bench, a new dataset that features authentic auction-grade masterpieces and AI-generated works, grounded in real-world market valuations. To realize the rigorous judgment, we propose the HanMoVLM and construct a Chain-of-Thought (CoT) validated by experts. This CoT guides the model to perform expert-level reasoning: from content identification and Region of Interest (RoI) localization to professional evaluation, guided by both theme-specific evaluation and typical three-tier evaluation in Chinese paintings. Furthermore, we design a reward function to refine the reasoning process of the HanMoVLM to improve the accuracy. We demonstrate that HanMoVLM can serve as a critical backbone for Test-time Scaling in image generation. By acting as a high-quality verifier, HanMoVLM enables generative models to select the most artistically superior outputs from multiple candidates. Experimental results and human studies confirm that the proposed HanMoVLM effectively bridges the gap, achieving a high consistency with professional experts and significantly improving the quality of Chinese Painting generation.

[146] A dataset of medication images with instance segmentation masks for preventing adverse drug events

W. I. Chu,S. Hirani,G. Tarroni,L. Li

Main category: cs.CV

TL;DR: 本文提出了MEDISEG数据集，用于解决真实场景中药丸识别困难的问题，该数据集包含32种药丸在8262张图像上的实例分割标注，并在YOLOv8/v9模型上验证了其有效性与泛化能力。

Details

Motivation: 现有药丸图像数据集难以反映真实世界中重叠、遮挡、光照变化等复杂情况，限制了AI药丸识别模型的发展。 Method: 构建大规模药丸实例分割数据集MEDISEG（含32类、8262张图像），并在YOLOv8/YOLOv9上进行训练与评估；同时开展少样本检测实验以验证跨类别泛化能力。 Result: 在3-Pills子集上mAP@0.5达99.5%，在32-Pills子集上达80.1%；少样本实验表明MEDISEG预训练显著提升对遮挡多药场景中未见药丸类别的识别能力。 Conclusion: MEDISEG不仅支持强监督下的高精度药丸识别，还能促进低监督条件下的可迁移表征学习，是推动用药安全AI系统发展与评测的重要资源。 Abstract: Medication errors and adverse drug events (ADEs) pose significant risks to patient safety, often arising from difficulties in reliably identifying pharmaceuticals in real-world settings. AI-based pill recognition models offer a promising solution, but the lack of comprehensive datasets hinders their development. Existing pill image datasets rarely capture real-world complexities such as overlapping pills, varied lighting, and occlusions. MEDISEG addresses this gap by providing instance segmentation annotations for 32 distinct pill types across 8262 images, encompassing diverse conditions from individual pill images to cluttered dosette boxes. We trained YOLOv8 and YOLOv9 on MEDISEG to demonstrate their usability, achieving mean average precision at IoU 0.5 of 99.5 percent on the 3-Pills subset and 80.1 percent on the 32-Pills subset. We further evaluate MEDISEG under a few-shot detection protocol, demonstrating that base training on MEDISEG significantly improves recognition of unseen pill classes in occluded multi-pill scenarios compared to existing datasets. These results highlight the dataset's ability not only to support robust supervised training but also to promote transferable representations under limited supervision, making it a valuable resource for developing and benchmarking AI-driven systems for medication safety.

[147] BALD-SAM: Disagreement-based Active Prompting in Interactive Segmentation

Prithwijit Chowdhury,Mohit Prabhushankar,Ghassan AlRegib

Main category: cs.CV

TL;DR: 本文提出了一种名为BALD-SAM的主动提示框架，将贝叶斯主动学习（BALD）应用于空间提示选择，通过量化认知不确定性来指导下一次提示位置，显著提升交互式分割效率与精度，尤其在复杂细长结构上超越人工和“先知”提示。

Details

Motivation: 现有SAM提示方法多为自动化但缺乏迭代优化机制；真实标注流程依赖人工观察预测结果并策略性加点，而本文旨在用模型驱动的不确定性准则替代主观判断，实现更高效、可解释的主动交互提示。 Method: 提出‘主动提示’（active prompting）范式，将图像内位置视为未标记池、提示视为查询；设计BALD-SAM框架，在冻结SAM主干前提下，仅对轻量预测头施加贝叶斯建模（Laplace近似），实现大模型上的可行不确定性估计。 Result: 在16个跨领域数据集（自然、医学、水下、地震）上，BALD-SAM在14个基准中排名前二；消融实验覆盖3种SAM主干和35种Laplace配置（共38组），验证其鲁棒性；在细长/复杂结构分割上超越人工提示甚至oracle提示，并持续优于单次提示基线。 Conclusion: BALD-SAM为SAM提供了首个原理清晰、计算可行、跨域泛化的主动提示方案，证明了基于认知不确定性的空间提示选择能实质性提升交互分割性能与标注效率。 Abstract: The Segment Anything Model (SAM) has revolutionized interactive segmentation through spatial prompting. While existing work primarily focuses on automating prompts in various settings, real-world annotation workflows involve iterative refinement where annotators observe model outputs and strategically place prompts to resolve ambiguities. Current pipelines typically rely on the annotator's visual assessment of the predicted mask quality. We postulate that a principled approach for automated interactive prompting is to use a model-derived criterion to identify the most informative region for the next prompt. In this work, we establish active prompting: a spatial active learning approach where locations within images constitute an unlabeled pool and prompts serve as queries to prioritize information-rich regions, increasing the utility of each interaction. We further present BALD-SAM: a principled framework adapting Bayesian Active Learning by Disagreement (BALD) to spatial prompt selection by quantifying epistemic uncertainty. To do so, we freeze the entire model and apply Bayesian uncertainty modeling only to a small learned prediction head, making intractable uncertainty estimation practical for large multi-million parameter foundation models. Across 16 datasets spanning natural, medical, underwater, and seismic domains, BALD-SAM demonstrates strong cross-domain performance, ranking first or second on 14 of 16 benchmarks. We validate these gains through a comprehensive ablation suite covering 3 SAM backbones and 35 Laplace posterior configurations, amounting to 38 distinct ablation settings. Beyond strong average performance, BALD-SAM surpasses human prompting and, in several categories, even oracle prompting, while consistently outperforming one-shot baselines in final segmentation quality, particularly on thin and structurally complex objects.

[148] Evaluating Few-Shot Pill Recognition Under Visual Domain Shift

W. I. Chu,G. Tarroni,L. Li

Main category: cs.CV

TL;DR: 本研究从部署角度出发，探讨了少样本药丸识别方法，采用两阶段目标检测框架，在真实复杂场景下验证了语义识别的快速适应性，但定位和召回率在重叠遮挡条件下显著下降。

Details

Motivation: 现实部署中，自动化药丸识别系统面临杂乱场景、药丸重叠、反射及多样化采集环境等视觉复杂条件，亟需提升跨数据集域迁移下的泛化能力。 Method: 采用两阶段目标检测框架：先进行基础训练，再基于每类1、5或10个标注样本进行少样本微调；在独立的多目标、杂乱场景部署数据集上评估，侧重分类中心与错误导向指标。 Result: 语义药丸识别仅需单一样本即可达到分类性能饱和，但在重叠与遮挡压力测试下，定位与召回率明显下降；使用视觉真实、多药丸数据训练的模型在低样本场景下鲁棒性更强。 Conclusion: 训练数据的真实性比模型结构创新更重要；少样本微调是评估部署就绪性的有效诊断工具。 Abstract: Adverse drug events are a significant source of preventable harm, which has led to the development of automated pill recognition systems to enhance medication safety. Real-world deployment of these systems is hindered by visually complex conditions, including cluttered scenes, overlapping pills, reflections, and diverse acquisition environments. This study investigates few-shot pill recognition from a deployment-oriented perspective, prioritizing generalization under realistic cross-dataset domain shifts over architectural innovation. A two-stage object detection framework is employed, involving base training followed by few-shot fine-tuning. Models are adapted to novel pill classes using one, five, or ten labeled examples per class and are evaluated on a separate deployment dataset featuring multi-object, cluttered scenes. The evaluation focuses on classification-centric and error-based metrics to address heterogeneous annotation strategies. Findings indicate that semantic pill recognition adapts rapidly with few-shot supervision, with classification performance reaching saturation even with a single labeled example. However, stress testing under overlapping and occluded conditions demonstrates a marked decline in localization and recall, despite robust semantic classification. Models trained on visually realistic, multi-pill data consistently exhibit greater robustness in low-shot scenarios, underscoring the importance of training data realism and the diagnostic utility of few-shot fine-tuning for deployment readiness.

[149] On the Reliability of Cue Conflict and Beyond

Pum Jun Kim,Seung-Ah Lee,Seongho Park,Dongyoon Han,Jaejun Yoo

Main category: cs.CV

TL;DR: 本文提出REFINED-BIAS框架，以解决现有形状-纹理偏好评估中因风格化方法不稳定、 cue不可分、指标模糊及类别受限所导致的偏差估计不可靠问题；通过构建可识别、平衡的cue对和基于排序的全类别敏感度度量，实现更公平、准确、可解释的模型bias诊断。

Details

Motivation: 现有基于风格化的cue-conflict基准在评估神经网络形状-纹理偏好时存在三大缺陷：风格化无法可靠生成可分离且感知有效的视觉线索；比率型偏差指标掩盖了绝对线索敏感性；预选类别限制扭曲了模型在完整决策空间中的行为，导致偏差估计被线索有效性、平衡性和可识别性混淆。 Method: 提出REFINED-BIAS：（1）基于明确定义的形状与纹理概念，构造人类与模型均能识别且线索平衡的图像对；（2）在全部标签空间上采用排名式度量（而非分类准确率比值）量化各线索的特异性敏感度；（3）整合为统一数据集与评估框架。 Result: 在多种训练策略与网络架构上验证，REFINED-BIAS显著提升跨模型比较的公平性，更真实揭示模型的shape/texture bias，并成功厘清此前cue-conflict评估中无法分辨的经验矛盾。 Conclusion: 可靠的bias诊断需兼顾cue的可分离性、可识别性、平衡性及评估指标的语义清晰性；REFINED-BIAS为视觉表征可解释性研究提供了更稳健的基础工具。 Abstract: Understanding how neural networks rely on visual cues offers a human-interpretable view of their internal decision processes. The cue-conflict benchmark has been influential in probing shape-texture preference and in motivating the insight that stronger, human-like shape bias is often associated with improved in-domain performance. However, we find that the current stylization-based instantiation can yield unstable and ambiguous bias estimates. Specifically, stylization may not reliably instantiate perceptually valid and separable cues nor control their relative informativeness, ratio-based bias can obscure absolute cue sensitivity, and restricting evaluation to preselected classes can distort model predictions by ignoring the full decision space. Together, these factors can confound preference with cue validity, cue balance, and recognizability artifacts. We introduce REFINED-BIAS, an integrated dataset and evaluation framework for reliable and interpretable shape-texture bias diagnosis. REFINED-BIAS constructs balanced, human- and model- recognizable cue pairs using explicit definitions of shape and texture, and measures cue-specific sensitivity over the full label space via a ranking-based metric, enabling fairer cross-model comparisons. Across diverse training regimes and architectures, REFINED-BIAS enables fairer cross-model comparison, more faithful diagnosis of shape and texture biases, and clearer empirical conclusions, resolving inconsistencies that prior cue-conflict evaluations could not reliably disambiguate.

[150] UltrasoundAgents: Hierarchical Multi-Agent Evidence-Chain Reasoning for Breast Ultrasound Diagnosis

Yali Zhu,Kang Zhou,Dingbang Wu,Gaofeng Meng

Main category: cs.CV

TL;DR: 本文提出了一种分层多智能体框架UltrasoundAgents，用于乳腺超声诊断，模拟临床工作流：先全局定位病灶，再局部分析关键征象（回声模式、钙化、边界类型、边缘形态），最后基于结构化征象进行证据推理并输出BI-RADS分类与良恶性预测；同时引入解耦渐进训练策略以缓解多智能体训练中的误差传播与稀疏奖励问题。

Details

Motivation: 现有方法多为端到端预测或仅提供弱可解释性证据，难以捕捉细粒度病灶线索，且缺乏审计性与临床可复核性，无法契合真实临床诊断流程。 Method: 提出分层多智能体框架UltrasoundAgents：主智能体负责全图病灶定位与证据整合，子智能体负责裁剪区域内的四大临床征象预测；采用解耦渐进训练策略——先训子智能体，再用理想征象训主智能体进行推理，最后通过带空间监督的校正轨迹自蒸馏构建高质量轨迹用于微调。 Result: 在诊断准确率和征象一致性上均显著优于强视觉-语言基线模型，并能输出结构化中间证据与可追溯的推理路径。 Conclusion: UltrasoundAgents通过模拟临床分步决策逻辑与结构化征象建模，提升了乳腺超声诊断的准确性、可解释性与临床实用性；解耦训练策略有效增强了多智能体系统的训练稳定性与泛化能力。 Abstract: Breast ultrasound diagnosis typically proceeds from global lesion localization to local sign assessment and then evidence integration to assign a BI-RADS category and determine benignity or malignancy. Many existing methods rely on end-to-end prediction or provide only weakly grounded evidence, which can miss fine-grained lesion cues and limit auditability and clinical review. To align with the clinical workflow and improve evidence traceability, we propose a hierarchical multi-agent framework, termed UltrasoundAgents. A main agent localizes the lesion in the full image and triggers a crop-and-zoom operation. A sub-agent analyzes the local view and predicts four clinically relevant attributes, namely echogenicity pattern, calcification, boundary type, and edge (margin) morphology. The main agent then integrates these structured attributes to perform evidence-based reasoning and output the BI-RADS category and the malignancy prediction, while producing reviewable intermediate evidence. Furthermore, hierarchical multi-agent training often suffers from error propagation, difficult credit assignment, and sparse rewards. To alleviate this and improve training stability, we introduce a decoupled progressive training strategy. We first train the attribute agent, then train the main agent with oracle attributes to learn robust attribute-based reasoning, and finally apply corrective trajectory self-distillation with spatial supervision to build high-quality trajectories for supervised fine-tuning, yielding a deployable end-to-end policy. Experiments show consistent gains over strong vision-language baselines in diagnostic accuracy and attribute agreement, together with structured evidence and traceable reasoning.

Lin Chen,Bolin Ni,Qi Yang,Zili Wang,Kun Ding,Ying Wang,Houwen Peng,Shiming Xiang

Main category: cs.CV

TL;DR: 本文提出了一种新的位置编码方法DIPE，以解决多模态大语言模型（MLLMs）在长上下文场景中视觉信号衰减的问题。DIPE通过解耦模态间的位置编码，保持模态内相对位置的同时，增强跨模态感知邻近性，从而缓解视觉信息随文本增长而弱化的现象。

Details

Motivation: 多模态大语言模型（MLLMs）在长上下文下存在视觉衰减问题，即视觉token注意力随文本长度增加而减弱，导致生成脱离视觉约束；其根源在于多模态RoPE对跨模态注意力施加了距离惩罚。 Method: 提出跨模态距离不变位置编码（DIPE），在保留模态内自然相对位置编码的同时，为跨模态交互引入锚定的感知邻近性编码，消除因视觉与文本token距离增大而导致的注意力衰减。 Result: 将DIPE与多模态RoPE结合后，模型在长上下文场景中保持稳定的视觉定位能力，显著缓解视觉衰减，同时不损害短上下文基准任务性能。 Conclusion: DIPE是一种简单有效的位置编码机制，能从根本上缓解MLLMs中的视觉衰减问题，提升其在长上下文多模态理解任务中的鲁棒性和一致性。 Abstract: Despite the remarkable capabilities of Multimodal Large Language Models (MLLMs), they still suffer from visual fading in long-context scenarios. Specifically, the attention to visual tokens diminishes as the text sequence lengthens, leading to text generation detached from visual constraints. We attribute this degradation to the inherent inductive bias of Multimodal RoPE, which penalizes inter-modal attention as the distance between visual and text tokens increases. To address this, we propose inter-modal Distance Invariant Position Encoding (DIPE), a simple but effective mechanism that disentangles position encoding based on modality interactions. DIPE retains the natural relative positioning for intra-modal interactions to preserve local structure, while enforcing an anchored perceptual proximity for inter-modal interactions. This strategy effectively mitigates the inter-modal distance-based penalty, ensuring that visual signals remain perceptually consistent regardless of the context length. Experimental results demonstrate that by integrating DIPE with Multimodal RoPE, the model maintains stable visual grounding in long-context scenarios, significantly alleviating visual fading while preserving performance on standard short-context benchmarks. Code is available at https://github.com/lchen1019/DIPE.

[152] Bilevel Layer-Positioning LoRA for Real Image Dehazing

Yan Zhang,Long Ma,Yuxin Feng,Zhe Huang,Fan Zhou,Zhuo Su

Main category: cs.CV

TL;DR: 本文提出了一种无需参考图像的文本引导去雾方法，利用CLIP跨模态能力实现语义对齐，并结合BiLaLoRA策略实现高效、轻量级的模型自适应。

Details

Motivation: 现有基于学习的真实图像去雾方法在多样化的实际雾霾场景中泛化能力不足，主要受限于无标签数据缺乏有效无监督机制，以及全模型微调成本过高。 Method: 提出 haze-to-clear 文本引导损失函数，将去雾任务建模为潜在空间中的语义对齐问题；设计双层定位LoRA（BiLaLoRA）策略，联合学习LoRA参数与最优注入层位置。 Result: 在多个真实世界去雾基准上显著优于当前最先进方法，实现了高效、轻量且无需配对真值的自适应去雾。 Conclusion: 文本引导的跨模态语义对齐与可学习层定位的LoRA结合，为无监督真实图像去雾提供了新范式，兼顾性能与效率。 Abstract: Learning-based real image dehazing methods have achieved notable progress, yet they still face adaptation challenges in diverse real haze scenes. These challenges mainly stem from the lack of effective unsupervised mechanisms for unlabeled data and the heavy cost of full model fine-tuning. To address these challenges, we propose the haze-to-clear text-directed loss that leverages CLIP's cross-modal capabilities to reformulate real image dehazing as a semantic alignment problem in latent space, thereby providing explicit unsupervised cross-modal guidance in the absence of reference images. Furthermore, we introduce the Bilevel Layer-positioning LoRA (BiLaLoRA) strategy, which learns both the LoRA parameters and automatically search the injection layers, enabling targeted adaptation of critical network layers. Extensive experiments demonstrate our superiority against state-of-the-art methods on multiple real-world dehazing benchmarks. The code is publicly available at https://github.com/YanZhang-zy/BiLaLoRA.

[153] S2D: Sparse to Dense Lifting for 3D Reconstruction with Minimal Inputs

Yuzhou Ji,Qijian Tian,He Zhu,Xiaoqi Jiang,Guangzhi Cao,Lizhuang Ma,Yuan Xie,Xin Tan

Main category: cs.CV

TL;DR: 本文提出Sparse to Dense lifting (S2D)方法，通过结合扩散模型与鲁棒重建策略，从极稀疏输入中生成高质量、一致的3D高斯泼溅（3DGS）重建结果。

Details

Motivation: 现有显式3D表示（如点云和3D高斯泼溅）在稀疏输入下存在渲染不真实、质量显著下降的问题，亟需一种能以极少输入实现高质量重建的新方法。 Method: 提出两阶段S2D提升流程：1）基于高效单步扩散模型修复稀疏点云引导下的图像伪影；2）设计含随机采样丢弃与加权梯度的重建策略，提升从稀疏视角到稠密新视角的三维一致性拟合鲁棒性。 Result: S2D在不同输入稀疏度下均取得最优的新视角生成一致性与稀疏视角重建质量，在最少图像捕获数下实现稳定3DGS重建，优于现有方法。 Conclusion: S2D有效桥接稀疏点云与稠密3DGS表示，显著降低3DGS对输入数据量的依赖，为低开销3D内容生成提供了新范式。 Abstract: Explicit 3D representations have already become an essential medium for 3D simulation and understanding. However, the most commonly used point cloud and 3D Gaussian Splatting (3DGS) each suffer from non-photorealistic rendering and significant degradation under sparse inputs. In this paper, we introduce Sparse to Dense lifting (S2D), a novel pipeline that bridges the two representations and achieves high-quality 3DGS reconstruction with minimal inputs. Specifically, the S2D lifting is two-fold. We first present an efficient one-step diffusion model that lifts sparse point cloud for high-fidelity image artifact fixing. Meanwhile, to reconstruct 3D consistent scenes, we also design a corresponding reconstruction strategy with random sample drop and weighted gradient for robust model fitting from sparse input views to dense novel views. Extensive experiments show that S2D achieves the best consistency in generating novel view guidance and first-tier sparse view reconstruction quality under different input sparsity. By reconstructing stable scenes with the least possible captures among existing methods, S2D enables minimal input requirements for 3DGS applications.

[154] Novel Architecture of RPA In Oral Cancer Lesion Detection

Revana Magdy,Joy Naoum,Ali Hamdi

Main category: cs.CV

TL;DR: 本研究评估了两种RPA实现（OC-RPAv1和OC-RPAv2）在口腔癌病变检测中的性能，OC-RPAv2通过单例模式与批量处理将单图预测时间从0.29秒降至0.06秒，效率提升60–100倍。

Details

Motivation: 提高口腔癌病灶的准确、早期检测能力，以支持有效诊断与治疗。 Method: 对比评估两种RPA实现：OC-RPAv1（单图逐次处理）与OC-RPAv2（基于单例模式与批量处理）；测试集含31张图像。 Result: OC-RPAv2将单图平均预测时间降至0.06秒，相较OC-RPAv1（0.29秒）提速约4.8倍，相较标准RPA方法提速60–100倍。 Conclusion: 设计模式（如单例）与批量处理可显著提升RPA在口腔癌检测中的效率、可扩展性与成本效益。 Abstract: Accurate and early detection of oral cancer lesions is crucial for effective diagnosis and treatment. This study evaluates two RPA implementations, OC-RPAv1 and OC-RPAv2, using a test set of 31 images. OC-RPAv1 processes one image per prediction in an average of 0.29 seconds, while OCRPAv2 employs a Singleton design pattern and batch processing, reducing prediction time to just 0.06 seconds per image. This represents a 60-100x efficiency improvement over standard RPA methods, showcasing that design patterns and batch processing can enhance scalability and reduce costs in oral cancer detection

[155] Lifelong Imitation Learning with Multimodal Latent Replay and Incremental Adjustment

Fanqi Yu,Matteo Tiezzi,Tommaso Apicella,Cigdem Beyan,Vittorio Murino

Main category: cs.CV

TL;DR: 本文提出了一种终身模仿学习框架，通过在多模态潜在空间中存储和重用紧凑表征，并引入基于角度间隔约束的增量特征调整机制，显著提升了LIBERO基准上的性能并减少了遗忘。

Details

Motivation: 解决传统经验回放在内存和数据受限下难以支持持续策略优化的问题，实现跨任务的终身模仿学习。 Method: 在多模态潜在空间（视觉、语言、机器人状态）中进行经验存储与复用；引入基于角度间隔约束的增量特征调整机制以稳定任务嵌入演化。 Result: 在LIBERO基准上达到新SOTA，AUC提升10-17点，遗忘减少最多达65%；消融实验验证各组件有效性。 Conclusion: 该框架在现实约束下实现了更稳定、高效的终身模仿学习，为多模态持续学习提供了新思路。 Abstract: We introduce a lifelong imitation learning framework that enables continual policy refinement across sequential tasks under realistic memory and data constraints. Our approach departs from conventional experience replay by operating entirely in a multimodal latent space, where compact representations of visual, linguistic, and robot's state information are stored and reused to support future learning. To further stabilize adaptation, we introduce an incremental feature adjustment mechanism that regularizes the evolution of task embeddings through an angular margin constraint, preserving inter-task distinctiveness. Our method establishes a new state of the art in the LIBERO benchmarks, achieving 10-17 point gains in AUC and up to 65% less forgetting compared to previous leading methods. Ablation studies confirm the effectiveness of each component, showing consistent gains over alternative strategies. The code is available at: https://github.com/yfqi/lifelong_mlr_ifa.

[156] Bridging the Skill Gap in Clinical CBCT Interpretation with CBCTRepD

Qinxin Wu,Fucheng Niu,Hengchuan Zhu,Yifan Sun,Ye Shen,Xu Li,Han Wu,Leqi Liu,Zhiwen Pan,Zuozhu Liu,Fudong Zhu,Bin Feng

Main category: cs.CV

TL;DR: 本文提出CBCTRepD系统，用于口腔颌面锥形束CT（CBCT）报告生成，基于约7408例高质量配对CBCT-报告数据集和多层级临床评估框架，验证其在提升报告结构化、减少漏诊、辅助不同经验水平放射科医生方面的有效性。

Details

Motivation: 现有生成式AI在医学报告生成中进展迅速，但在口腔颌面CBCT报告生成中应用受限，主要由于高质量配对CBCT-报告数据稀缺及CBCT容积影像解读固有的复杂性。 Method: 构建大规模双语CBCT-报告配对数据集（7408例，覆盖55种口腔疾病、多种采集条件），开发CBCTRepD系统，并建立融合自动指标与放射科医生/临床医生评价的多层级临床评估框架，评估AI初稿及医-机协作编辑报告。 Result: CBCTRepD生成报告的质量与标准化程度媲美中级放射科医生；在人机协作中，显著提升新手医生至中级水平、助力中级医生接近高级水平，并帮助高级医生降低漏诊（尤其是重要病灶遗漏）；同时改善报告结构、减少遗漏、促进跨解剖区域共存病灶关注。 Conclusion: CBCTRepD具备强实用性与可靠性，有望成为多级医疗场景下CBCT报告工作的有效临床辅助工具。 Abstract: Generative AI has advanced rapidly in medical report generation; however, its application to oral and maxillofacial CBCT reporting remains limited, largely because of the scarcity of high-quality paired CBCT-report data and the intrinsic complexity of volumetric CBCT interpretation. To address this, we introduce CBCTRepD, a bilingual oral and maxillofacial CBCT report-generation system designed for integration into routine radiologist-AI co-authoring workflows. We curated a large-scale, high-quality paired CBCT-report dataset comprising approximately 7,408 studies, covering 55 oral disease entities across diverse acquisition settings, and used it to develop the system. We further established a clinically grounded, multi-level evaluation framework that assesses both direct AI-generated drafts and radiologist-edited collaboration reports using automatic metrics together with radiologist- and clinician-centered evaluation. Using this framework, we show that CBCTRepD achieves superior report-generation performance and produces drafts with writing quality and standardization comparable to those of intermediate radiologists. More importantly, in radiologist-AI collaboration, CBCTRepD provides consistent and clinically meaningful benefits across experience levels: it helps novice radiologists improve toward intermediate-level reporting, enables intermediate radiologists to approach senior-level performance, and even assists senior radiologists by reducing omission-related errors, including clinically important missed lesions. By improving report structure, reducing omissions, and promoting attention to co-existing lesions across anatomical regions, CBCTRepD shows strong and reliable potential as a practical assistant for real-world CBCT reporting across multi-level care settings.

[157] Pointy - A Lightweight Transformer for Point Cloud Foundation Models

Konrad Szafer,Marek Kraft,Dominik Belter

Main category: cs.CV

TL;DR: 本文提出了一种轻量级的基于Transformer的点云架构，仅用39k点云样本训练，却超越了多个使用200k以上样本训练的更大基础模型，甚至接近使用百万级多模态数据训练的SOTA模型，强调了精巧架构与高质量训练设置的价值。

Details

Motivation: 现有点云基础模型多依赖大规模跨模态监督（如语言或视觉），计算与数据开销大；本文旨在探索更可控、高效且可复现的轻量级建模范式。 Method: 设计了一个轻量级、无tokenizer的Transformer点云架构，并在严格统一的实验框架下进行训练与评估，包括标准化训练流程、基准测试及跨架构复现研究。 Result: 该模型仅用39k点云即超越多个更大规模模型，在多个基准上接近百万级多模态训练模型的性能；验证了简单骨干网络在精心设计和训练下的竞争力。 Conclusion: 点云表征学习不必然依赖海量数据或多模态监督；轻量架构+高质量数据+标准化评估可实现高性价比与可复现的先进性能。 Abstract: Foundation models for point cloud data have recently grown in capability, often leveraging extensive representation learning from language or vision. In this work, we take a more controlled approach by introducing a lightweight transformer-based point cloud architecture. In contrast to the heavy reliance on cross-modal supervision, our model is trained only on 39k point clouds - yet it outperforms several larger foundation models trained on over 200k training samples. Interestingly, our method approaches state-of-the-art results from models that have seen over a million point clouds, images, and text samples, demonstrating the value of a carefully curated training setup and architecture. To ensure rigorous evaluation, we conduct a comprehensive replication study that standardizes the training regime and benchmarks across multiple point cloud architectures. This unified experimental framework isolates the impact of architectural choices, allowing for transparent comparisons and highlighting the benefits of our design and other tokenizer-free architectures. Our results show that simple backbones can deliver competitive results to more complex or data-rich strategies. The implementation, including code, pre-trained models, and training protocols, is available at https://github.com/KonradSzafer/Pointy.

[158] COMIC: Agentic Sketch Comedy Generation

Susung Hong,Brian Curless,Ira Kemelmacher-Shlizerman,Steve Seitz

Main category: cs.CV

TL;DR: 本文提出了一种全自动AI系统，用于生成类似《周六夜现场》的短喜剧视频，通过模拟影视制作流程的多智能体协作与基于真实观众偏好的LLM批评家自动评估幽默效果，实现了接近专业水平的喜剧视频生成质量。

Details

Motivation: 现有AI视频生成系统缺乏对喜剧内容质量（尤其是幽默感）的自动评估能力，且难以模拟真实影视制作中创意迭代与多样性优化的过程。 Method: 构建基于角色分工的多智能体系统（模拟制片厂流程），引入通过YouTube喜剧视频语料库对齐观众偏好的LLM批评家，进行迭代式创意竞争、评估与改进。 Result: 该框架生成的喜剧视频质量接近专业制作水平，在视频生成任务上达到当前最优性能（state-of-the-art）。 Conclusion: 将观众偏好建模与多智能体协同创作结合，可有效提升AI生成喜剧内容的质量与可信度，为创意AI系统设计提供了新范式。 Abstract: We propose a fully automated AI system that produces short comedic videos similar to sketch shows such as Saturday Night Live. Starting with character references, the system employs a population of agents loosely based on real production studio roles, structured to optimize the quality and diversity of ideas and outputs through iterative competition, evaluation, and improvement. A key contribution is the introduction of LLM critics aligned with real viewer preferences through the analysis of a corpus of comedy videos on YouTube to automatically evaluate humor. Our experiments show that our framework produces results approaching the quality of professionally produced sketches while demonstrating state-of-the-art performance in video generation.

[159] Contrastive learning-based video quality assessment-jointed video vision transformer for video recognition

Jian Sun,Mohammad H. Mahoor

Main category: cs.CV

TL;DR: 本文提出了一种结合自监督学习与无参考视频质量评估（VQA）的视频视觉Transformer模型（SSL-V3），用于提升视频分类性能，尤其在视频质量不佳时效果显著。

Details

Motivation: 视频质量对分类性能影响显著，尤其在医疗等实际场景中视频常存在模糊等问题；而现有VQA方法受限于标注稀缺，难以直接应用。 Method: 提出SSL-V3模型，采用Combined-SSL机制，将视频质量分数作为中间变量，将其融入视频分类特征图优化过程，利用有监督的分类任务反向优化无参考VQA模块。 Result: 在I-CONECT等数据集上达到94.87%的分类准确率，验证了模型在低质视频下的鲁棒性与有效性。 Conclusion: 将无参考VQA与视频分类通过自监督方式联合建模是可行且有效的，为质量敏感型视频理解任务提供了新范式。 Abstract: Video quality significantly affects video classification. We found this problem when we classified Mild Cognitive Impairment well from clear videos, but worse from blurred ones. From then, we realized that referring to Video Quality Assessment (VQA) may improve video classification. This paper proposed Self-Supervised Learning-based Video Vision Transformer combined with No-reference VQA for video classification (SSL-V3) to fulfill the goal. SSL-V3 leverages Combined-SSL mechanism to join VQA into video classification and address the label shortage of VQA, which commonly occurs in video datasets, making it impossible to provide an accurate Video Quality Score. In brief, Combined-SSL takes video quality score as a factor to directly tune the feature map of the video classification. Then, the score, as an intersected point, links VQA and classification, using the supervised classification task to tune the parameters of VQA. SSL-V3 achieved robust experimental results on two datasets. For example, it reached an accuracy of 94.87% on some interview videos in the I-CONECT (a facial video-involved healthcare dataset), verifying SSL-V3's effectiveness.

[160] Med-DualLoRA: Local Adaptation of Foundation Models for 3D Cardiac MRI

Joan Perramon-Llussà,Amelia Jiménez-Sánchez,Grzegorz Skorupko,Fotis Avgoustidis,Carlos Martín-Isla,Karim Lekadir,Polyxeni Gkontra

Main category: cs.CV

TL;DR: 本文提出Med-DualLoRA，一种面向多中心心脏MRI疾病检测的联邦微调框架，通过分离全局共享与本地私有LoRA模块，在保障隐私和通信效率的同时提升模型个性化性能。

Details

Motivation: 单中心数据微调易导致偏差，而集中式微调受医疗数据隐私限制；传统联邦微调在非独立同分布（non-IID）多中心数据下性能差、通信开销大。 Method: 提出Med-DualLoRA：基于LoRA的客户端感知参数高效联邦微调框架，将适配分解为全局共享LoRA与本地私有LoRA，仅聚合全局部分；仅微调两个Transformer块以进一步提效。 Result: 在ACDC与M&Ms多中心CMR数据上验证，Med-DualLoRA显著优于其他联邦PEFT基线（平衡准确率0.768，特异性0.612），同时大幅降低通信成本。 Conclusion: Med-DualLoRA为医学基础模型在现实临床约束下的可扩展、隐私保护型本地联邦适配提供了有效解决方案。 Abstract: Foundation models (FMs) show great promise for robust downstream performance across medical imaging tasks and modalities, including cardiac magnetic resonance (CMR), following task-specific adaptation. However, adaptation using single-site data may lead to suboptimal performance and increased model bias, while centralized fine-tuning on clinical data is often infeasible due to privacy constraints. Federated fine-tuning offers a privacy-preserving alternative; yet conventional approaches struggle under heterogeneous, non-IID multi-center data and incur substantial communication overhead when adapting large models. In this work, we study federated FM fine-tuning for 3D CMR disease detection and propose Med-DualLoRA, a client-aware parameter-efficient fine-tuning (PEFT) federated framework that disentangles globally shared and local low-rank adaptations (LoRA) through additive decomposition. Global and local LoRA modules are trained locally, but only the global component is shared and aggregated across sites, keeping local adapters private. This design improves personalization while significantly reducing communication cost, and experiments show that adapting only two transformer blocks preserves performance while further improving efficiency. We evaluate our method on a multi-center state-of-the-art cine 3D CMR FM fine-tuned for disease detection using ACDC and combined M\&Ms datasets, treating each vendor as a federated client. Med-DualLoRA achieves statistically significant improved performance (balanced accuracy 0.768, specificity 0.612) compared to other federated PEFT baselines, while maintaining communication efficiency. Our approach provides a scalable solution for local federated adaptation of medical FMs under realistic clinical constraints.

[161] VCR: Variance-Driven Channel Recalibration for Robust Low-Light Enhancement

Zhixin Cheng,Fangwen Zhang,Xiaotian Yin,Baoqun Yin,Haodian Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为VCR的低光图像增强新框架，通过通道自适应调整（CAA）和颜色分布对齐（CDA）模块，解决HVI空间中亮度与色度通道不一致及颜色分布错位问题，显著提升增强效果与感知质量。

Details

Motivation: 现有sRGB和HSV色彩空间在低光图像增强中存在亮度与颜色耦合严重或引入噪声等问题；HVI虽有改进，但仍存在通道不一致和颜色分布错位导致结果不自然的问题。 Method: 提出Variance-Driven Channel Recalibration（VCR）框架，包含两个核心模块：1）Channel Adaptive Adjustment（CAA），利用方差引导的特征滤波增强高亮度与合理色度分布区域的关注；2）Color Distribution Alignment（CDA），在颜色特征空间中强制分布对齐。 Result: 在多个基准数据集上的实验表明，该方法性能达到当前最优（state-of-the-art）。 Conclusion: VCR通过方差驱动的通道重校准机制有效缓解HVI空间中的通道不一致与颜色分布失配问题，提升了低光图像增强的保真度与自然度。 Abstract: Most sRGB-based LLIE methods suffer from entangled luminance and color, while the HSV color space offers insufficient decoupling at the cost of introducing significant red and black noise artifacts. Recently, the HVI color space has been proposed to address these limitations by enhancing color fidelity through chrominance polarization and intensity compression. However, existing methods could suffer from channel-level inconsistency between luminance and chrominance, and misaligned color distribution may lead to unnatural enhancement results. To address these challenges, we propose the Variance-Driven Channel Recalibration for Robust Low-Light Enhancement (VCR), a novel framework for low-light image enhancement. VCR consists of two main components, including the Channel Adaptive Adjustment (CAA) module, which employs variance-guided feature filtering to enhance the model's focus on regions with high intensity and color distribution. And the Color Distribution Alignment (CDA) module, which enforces distribution alignment in the color feature space. These designs enhance perceptual quality under low-light conditions. Experimental results on several benchmark datasets demonstrate that the proposed method achieves state-of-the-art performance compared with existing methods.

[162] GroundCount: Grounding Vision-Language Models with Object Detection for Mitigating Counting Hallucinations

Boyuan Chen,Minghao Shao,Siddharth Garg,Ramesh Karri,Muhammad Shafique

Main category: cs.CV

TL;DR: 本文提出GroundCount框架，通过将CNN目标检测模型（如YOLO）的空间定位与计数能力融入视觉语言模型（VLMs），显著缓解其在计数任务中的幻觉问题；在Ovis2.5-2B上提升6.6个百分点至81.3%准确率，并缩短22%推理时间；实验表明显式符号化提示优于隐式特征融合，且位置编码和置信度处理需依模型强弱而适配。

Details

Motivation: 视觉语言模型（VLMs）在计数任务中普遍存在严重幻觉，准确率显著低于其他视觉推理任务，即使在先进推理型VLM中仍持续存在；而CNN目标检测模型（ODMs）在空间定位与实例计数方面高效准确，具备可利用的互补能力。 Method: 提出GroundCount框架，采用prompt-based augmentation策略，将ODM（如YOLO）提供的显式空间接地信息（如边界框、数量、位置）以结构化提示形式注入VLM输入；开展消融实验对比位置编码、置信度分数使用、以及显式提示vs.隐式特征融合（含跨注意力）等设计选择。 Result: 在Ovis2.5-2B上计数准确率达81.3%（+6.6pp），推理时间减少22%；四款VLM提升6.2–7.5pp；位置编码对强模型有益但削弱弱模型；移除置信度分数在5个模型中的4个提升性能；结构化提示显著优于特征级融合；1个VLM因迭代反思机制与结构化提示不兼容而性能下降。 Conclusion: VLM计数失败根源在于空间与语义表征的整合缺陷，而非特定架构缺陷；增强策略必须考虑与基础模型架构（如是否含迭代反思）的兼容性；显式、轻量、模块化的空间接地（如通过ODM+结构化prompt）比复杂隐式融合更有效且高效。 Abstract: Vision Language Models (VLMs) exhibit persistent hallucinations in counting tasks, with accuracy substantially lower than other visual reasoning tasks (excluding sentiment). This phenomenon persists even in state-of-the-art reasoning-capable VLMs. Conversely, CNN-based object detection models (ODMs) such as YOLO excel at spatial localization and instance counting with minimal computational overhead. We propose GroundCount, a framework that augments VLMs with explicit spatial grounding from ODMs to mitigate counting hallucinations. In the best case, our prompt-based augmentation strategy achieves 81.3% counting accuracy on the best-performing model (Ovis2.5-2B) - a 6.6pp improvement - while reducing inference time by 22% through elimination of hallucination-driven reasoning loops for stronger models. We conduct comprehensive ablation studies demonstrating that positional encoding is a critical component, being beneficial for stronger models but detrimental for weaker ones. Confidence scores, by contrast, introduce noise for most architectures and their removal improves performance in four of five evaluated models. We further evaluate feature-level fusion architectures, finding that explicit symbolic grounding via structured prompts outperforms implicit feature fusion despite sophisticated cross-attention mechanisms. Our approach yields consistent improvements across four of five evaluated VLM architectures (6.2--7.5pp), with one architecture exhibiting degraded performance due to incompatibility between its iterative reflection mechanisms and structured prompts. These results suggest that counting failures stem from fundamental spatial-semantic integration limitations rather than architecture-specific deficiencies, while highlighting the importance of architectural compatibility in augmentation strategies.

[163] Too Vivid to Be Real? Benchmarking and Calibrating Generative Color Fidelity

Zhengyao Fang,Zexi Jia,Yijia Zhong,Pengcheng Luo,Jinchao Zhang,Guangming Lu,Jun Yu,Wenjie Pei

Main category: cs.CV

TL;DR: 本文提出Color Fidelity Dataset (CFD) 和 Color Fidelity Metric (CFM) 用于客观评估真实风格文本到图像生成中的色彩保真度，并设计无需训练的Color Fidelity Refinement (CFR) 方法提升生成图像的色彩真实性，构建了评估与优化一体化的渐进式框架。

Details

Motivation: 现有T2I评估范式（如人工评分和偏好训练指标）偏向高饱和、高对比度图像，导致生成结果过于鲜艳而失真，难以准确衡量真实风格图像的色彩保真度。 Method: 构建包含130万张图像的Color Fidelity Dataset (CFD)，提出基于多模态编码器的Color Fidelity Metric (CFM)；并设计训练无关的Color Fidelity Refinement (CFR)，通过自适应调节时空引导尺度来优化生成过程。 Result: CFM在色彩保真度评估上优于现有指标；CFR显著提升了多种主流T2I模型在真实风格生成中的色彩真实性，且无需额外训练。 Conclusion: CFD、CFM与CFR共同构成一个评估与增强真实风格T2I生成色彩保真度的统一框架，为更客观、可靠的图像真实性评价与优化提供了新范式。 Abstract: Recent advances in text-to-image (T2I) generation have greatly improved visual quality, yet producing images that appear visually authentic to real-world photography remains challenging. This is partly due to biases in existing evaluation paradigms: human ratings and preference-trained metrics often favor visually vivid images with exaggerated saturation and contrast, which make generations often too vivid to be real even when prompted for realistic-style images. To address this issue, we present Color Fidelity Dataset (CFD) and Color Fidelity Metric (CFM) for objective evaluation of color fidelity in realistic-style generations. CFD contains over 1.3M real and synthetic images with ordered levels of color realism, while CFM employs a multimodal encoder to learn perceptual color fidelity. In addition, we propose a training-free Color Fidelity Refinement (CFR) that adaptively modulates spatial-temporal guidance scale in generation, thereby enhancing color authenticity. Together, CFD supports CFM for assessment, whose learned attention further guides CFR to refine T2I fidelity, forming a progressive framework for assessing and improving color fidelity in realistic-style T2I generation. The dataset and code are available at https://github.com/ZhengyaoFang/CFM.

[164] Does AI See like Art Historians? Interpreting How Vision Language Models Recognize Artistic Style

Marvin Limpijankit,Milad Alshomary,Yassin Oulad Daoud,Amith Ananthram,Tim Trombley,Elias Stengel-Eskin,Mohit Bansal,Noam M. Elcott,Kathleen McKeown

Main category: cs.CV

TL;DR: 本文通过跨学科合作，研究视觉语言模型（VLMs）预测艺术风格的内在机制，并评估其与艺术史学家判断标准的一致性；采用潜在空间分解方法提取驱动风格预测的概念，并结合定量评估、因果分析及艺术史专家评判，发现多数提取概念具有语义一致性与相关性。

Details

Motivation: 探究VLMs在艺术风格预测中所依赖的视觉概念是否符合艺术史学界的专业判断标准，弥合AI技术与人文学科之间的理解鸿沟。 Method: 采用潜在空间分解方法识别驱动艺术风格预测的隐含概念，并结合定量评估、因果分析以及艺术史学家的专家评审进行多角度验证。 Result: 73%的提取概念被艺术史学家判定为具有连贯且语义明确的视觉特征，90%用于预测特定作品风格的概念被判定为相关；对少数不相关但预测成功的情况，专家给出了可能解释（如模型以明暗对比等形式化方式‘理解’概念）。 Conclusion: VLMs在艺术风格预测中所依赖的大部分概念与艺术史学判断高度一致，表明其具备一定可解释性与人文相关性，但也存在需进一步阐释的形式化理解偏差。 Abstract: VLMs have become increasingly proficient at a range of computer vision tasks, such as visual question answering and object detection. This includes increasingly strong capabilities in the domain of art, from analyzing artwork to generation of art. In an interdisciplinary collaboration between computer scientists and art historians, we characterize the mechanisms underlying VLMs' ability to predict artistic style and assess the extent to which they align with the criteria art historians use to reason about artistic style. We employ a latent-space decomposition approach to identify concepts that drive art style prediction and conduct quantitative evaluations, causal analysis and assessment by art historians. Our findings indicate that 73% of the extracted concepts are judged by art historians to exhibit a coherent and semantically meaningful visual feature and 90% of concepts used to predict style of a given artwork were judged relevant. In cases where an irrelevant concept was used to successfully predict style, art historians identified possible reasons for its success; for example, the model might "understand" a concept in more formal terms, such as dark/light contrasts.

[165] DynVLA: Learning World Dynamics for Action Reasoning in Autonomous Driving

Shuyao Shang,Bing Zhan,Yunfei Yan,Yuqi Wang,Yingyan Li,Yasong An,Xiaoman Wang,Jierui Liu,Lu Hou,Lue Fan,Zhaoxiang Zhang,Tieniu Tan

Main category: cs.CV

TL;DR: DynVLA是一种新型驾驶视觉-语言-动作（VLA）模型，提出“动力学思维链（Dynamics CoT）”范式，通过动力学标记器压缩并解耦自车与环境动力学，实现更精准、高效、物理可解释的决策。

Details

Motivation: 现有Textual CoT缺乏时空细粒度理解，Visual CoT因密集图像预测引入冗余；驾驶场景中复杂交互需更紧凑、准确且物理合理的世界动力学建模。 Method: 提出Dynamics CoT范式；设计Dynamics Tokenizer压缩未来动力学为少量动力学标记；解耦自车中心与环境中心动力学；采用监督微调（SFT）和反应式微调（RFT）训练模型先生成动力学标记再生成动作。 Result: 在NAVSIM、Bench2Drive及自建大规模数据集上，DynVLA持续超越Textual CoT和Visual CoT方法。 Conclusion: Dynamics CoT在紧凑性、可解释性与推理效率间取得更好平衡，显著提升驾驶VLA模型的决策质量与实用性。 Abstract: We propose DynVLA, a driving VLA model that introduces a new CoT paradigm termed Dynamics CoT. DynVLA forecasts compact world dynamics before action generation, enabling more informed and physically grounded decision-making. To obtain compact dynamics representations, DynVLA introduces a Dynamics Tokenizer that compresses future evolution into a small set of dynamics tokens. Considering the rich environment dynamics in interaction-intensive driving scenarios, DynVLA decouples ego-centric and environment-centric dynamics, yielding more accurate world dynamics modeling. We then train DynVLA to generate dynamics tokens before actions through SFT and RFT, improving decision quality while maintaining latency-efficient inference. Compared to Textual CoT, which lacks fine-grained spatiotemporal understanding, and Visual CoT, which introduces substantial redundancy due to dense image prediction, Dynamics CoT captures the evolution of the world in a compact, interpretable, and efficient form. Extensive experiments on NAVSIM, Bench2Drive, and a large-scale in-house dataset demonstrate that DynVLA consistently outperforms Textual CoT and Visual CoT methods, validating the effectiveness and practical value of Dynamics CoT.

[166] V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation

Yan-Bo Lin,Jonah Casebeer,Long Mai,Aniruddha Mahapatra,Gedas Bertasius,Nicholas J. Bryan

Main category: cs.CV

TL;DR: 本文提出V2M-Zero方法，无需配对视频-音乐数据，通过模态内事件曲线实现视频到音乐的时序对齐生成，在多个基准上显著优于有配对数据训练的方法。

Details

Motivation: 现有文本到音乐模型缺乏细粒度时间控制，难以生成与视频事件时间对齐的音乐；作者观察到时间同步关键在于‘何时’和‘变化量’，而非‘变化内容’，且音视频在时间结构上存在跨模态共性。 Method: 利用预训练音视频编码器分别提取各自模态内的相似性，构建模态内事件曲线（event curves）表征时间变化；先用音乐事件曲线微调文本到音乐模型，推理时直接替换为视频事件曲线，无需跨模态训练或配对数据。 Result: 在OES-Pub、MovieGenBench-Music和AIST++三个基准上，相比配对数据基线：音频质量提升5–21%，语义对齐提升13–15%，时间同步提升21–52%，舞蹈视频节拍对齐提升28%；众包主观评测结果一致。 Conclusion: 仅依赖模态内特征建模时间结构，无需跨模态配对监督，即可有效实现视频到音乐的时间对齐生成。 Abstract: Generating music that temporally aligns with video events is challenging for existing text-to-music models, which lack fine-grained temporal control. We introduce V2M-Zero, a zero-pair video-to-music generation approach that outputs time-aligned music for video. Our method is motivated by a key observation: temporal synchronization requires matching when and how much change occurs, not what changes. While musical and visual events differ semantically, they exhibit shared temporal structure that can be captured independently within each modality. We capture this structure through event curves computed from intra-modal similarity using pretrained music and video encoders. By measuring temporal change within each modality independently, these curves provide comparable representations across modalities. This enables a simple training strategy: fine-tune a text-to-music model on music-event curves, then substitute video-event curves at inference without cross-modal training or paired data. Across OES-Pub, MovieGenBench-Music, and AIST++, V2M-Zero achieves substantial gains over paired-data baselines: 5-21% higher audio quality, 13-15% better semantic alignment, 21-52% improved temporal synchronization, and 28% higher beat alignment on dance videos. We find similar results via a large crowd-source subjective listening test. Overall, our results validate that temporal alignment through within-modality features, rather than paired cross-modal supervision, is effective for video-to-music generation. Results are available at https://genjib.github.io/v2m_zero/

[167] Agentar-Fin-OCR

Siyi Qian,Xiongfei Bai,Bingtao Fu,Yichen Lu,Gaoyang Zhang,Xudong Yang,Peng Zhang

Main category: cs.CV

TL;DR: 本文提出了Agentar-Fin-OCR，一种专为金融领域文档设计的解析系统，能将超长金融PDF转化为语义一致、高精度的结构化输出，并引入FinDocBench基准以评估模型在金融文档上的性能。

Details

Motivation: 解决金融文档中复杂版式、跨页结构不连续、单元格级引用等特有挑战，提升文档解析的准确性与可审计性。 Method: 提出Cross-page Contents Consolidation算法恢复跨页连续性；设计Document-level Heading Hierarchy Reconstruction（DHR）模块构建全局一致的目录树；采用难度自适应课程学习策略训练表格解析；开发CellBBoxRegressor模块，利用结构锚点标记从解码器隐状态定位表格单元格。 Result: 在OmniDocBench表格解析指标上表现优异；在新构建的FinDocBench基准（含TocEDS、跨页TEDS、C-IoU等指标）上全面评测了多种SOTA模型，验证了Agentar-Fin-OCR的先进性。 Conclusion: Agentar-Fin-OCR与FinDocBench共同为金融文档下游应用提供了可靠、实用的技术基础。 Abstract: In this paper, we propose Agentar-Fin-OCR, a document parsing system tailored to financial-domain documents, transforming ultra-long financial PDFs into semantically consistent, highly accurate, structured outputs with auditing-grade provenance. To address finance-specific challenges such as complex layouts, cross-page structural discontinuities, and cell-level referencing capability, Agentar-Fin-OCR combines (1) a Cross-page Contents Consolidation algorithm to restore continuity across pages and a Document-level Heading Hierarchy Reconstruction (DHR) module to build a globally consistent Table of Contents (TOC) tree for structure-aware retrieval, and (2) a difficulty-adaptive curriculum learning training strategy for table parsing, together with a CellBBoxRegressor module that uses structural anchor tokens to localize table cells from decoder hidden states without external detectors. Experiments demonstrate that our model shows high performance on the table parsing metrics of OmniDocBench. To enable realistic evaluation in the financial vertical, we further introduce FinDocBench, a benchmark that includes six financial document categories with expert-verified annotations and evaluation metrics including Table of Contents edit-distance-based similarity (TocEDS), cross-page concatenated TEDS, and Table Cell Intersection over Union (C-IoU). We evaluate a wide range of state-of-the-art models on FinDocBench to assess their capabilities and remaining limitations on financial documents. Overall, Agentar-Fin-OCR and FinDocBench provide a practical foundation for reliable downstream financial document applications.

[168] LiTo: Surface Light Field Tokenization

Jen-Hao Rick Chang,Xiaoming Zhao,Dorian Chan,Oncel Tuzel

Main category: cs.CV

TL;DR: 本文提出了一种联合建模物体几何与视角相关外观的3D潜在表示方法，通过RGB-D图像采样表面光场并编码为紧凑潜在向量，实现对高光、菲涅尔反射等效果的逼真再现，并结合潜在流匹配模型实现单图条件下的3D物体生成。

Details

Motivation: 现有方法大多只关注3D几何重建或视角无关的漫反射外观预测，难以建模真实的视角相关光学效应（如高光、菲涅尔反射）。 Method: 利用RGB-D图像提供的表面光场样本，将其随机子采样后编码为紧凑的3D潜在向量；构建统一的3D潜在空间以同时表征几何与外观；进一步训练基于该表示的条件潜在流匹配模型，以单张输入图像为条件生成3D对象。 Result: 在视觉质量和输入保真度上均优于现有方法，能准确再现复杂光照下的视角相关外观效应。 Conclusion: 所提出的联合几何-外观3D潜在表示及条件生成框架，有效解决了视角相关外观建模难题，提升了单图像驱动的高质量3D内容生成能力。 Abstract: We propose a 3D latent representation that jointly models object geometry and view-dependent appearance. Most prior works focus on either reconstructing 3D geometry or predicting view-independent diffuse appearance, and thus struggle to capture realistic view-dependent effects. Our approach leverages that RGB-depth images provide samples of a surface light field. By encoding random subsamples of this surface light field into a compact set of latent vectors, our model learns to represent both geometry and appearance within a unified 3D latent space. This representation reproduces view-dependent effects such as specular highlights and Fresnel reflections under complex lighting. We further train a latent flow matching model on this representation to learn its distribution conditioned on a single input image, enabling the generation of 3D objects with appearances consistent with the lighting and materials in the input. Experiments show that our approach achieves higher visual quality and better input fidelity than existing methods.

Table of Contents

cs.CL [Back]

[1] GhazalBench: Usage-Grounded Evaluation of LLMs on Persian Ghazals

[2] Large Language Models and Book Summarization: Reading or Remembering, Which Is Better?

[3] AraModernBERT: Transtokenized Initialization and Long-Context Encoder Modeling for Arabic

[4] An Efficient Hybrid Deep Learning Approach for Detecting Online Abusive Language

[5] The Dunning-Kruger Effect in Large Language Models: An Empirical Study of Confidence Calibration

[6] Quantifying Hallucinations in Language Language Models on Medical Textbooks

[7] Evolving Demonstration Optimization for Chain-of-Thought Feature Transformation

[8] Causally Grounded Mechanistic Interpretability for LLMs with Faithful Natural-Language Explanations

[9] The System Hallucination Scale (SHS): A Minimal yet Effective Human-Centered Instrument for Evaluating Hallucination-Related Behavior in Large Language Models

[10] A Two-Stage Architecture for NDA Analysis: LLM-based Segmentation and Transformer-based Clause Classification

[11] PoultryLeX-Net: Domain-Adaptive Dual-Stream Transformer Architecture for Large-Scale Poultry Stakeholder Modeling

[12] TAMUSA-Chat: A Domain-Adapted Large Language Model Conversational System for Research and Responsible Deployment

[13] CEI: A Benchmark for Evaluating Pragmatic Reasoning in Language Models

[14] Evaluating Adjective-Noun Compositionality in LLMs: Functional vs Representational Perspectives

[15] Context Over Compute Human-in-the-Loop Outperforms Iterative Chain-of-Thought Prompting in Interview Answer Quality

[16] There Are No Silly Questions: Evaluation of Offline LLM Capabilities from a Turkish Perspective

[17] Empathy Is Not What Changed: Clinical Assessment of Psychological Safety Across GPT Model Generations

[18] Automated evaluation of LLMs for effective machine translation of Mandarin Chinese to English

[19] A Retrieval-Augmented Language Assistant for Unmanned Aircraft Safety Assessment and Regulatory Compliance

[20] Beyond the Prompt in Large Language Models: Comprehension, In-Context Learning, and Chain-of-Thought

[21] Leveraging Wikidata for Geographically Informed Sociocultural Bias Dataset Creation: Application to Latin America

[22] SpreadsheetArena: Decomposing Preference in LLM Generation of Spreadsheet Workbooks

[23] Probing the Limits of the Lie Detector Approach to LLM Deception

[24] Fine-Tune, Don't Prompt, Your Language Model to Identify Biased Language in Clinical Notes

[25] SENS-ASR: Semantic Embedding injection in Neural-transducer for Streaming Automatic Speech Recognition

[26] Adaptive Engram Memory System for Indonesian Language Model: Generative AI Based on TOBA LM for Batak and Minang Language

[27] GATech at AbjadGenEval Shared Task: Multilingual Embeddings for Arabic Machine-Generated Text Classification

[28] GATech at AbjadMed: Bidirectional Encoders vs. Causal Decoders: Insights from 82-Class Arabic Medical Classification

[29] FERRET: Framework for Expansion Reliant Red Teaming

[30] Gemma Needs Help: Investigating and Mitigating Emotional Instability in LLMs

[31] Measuring and Eliminating Refusals in Military Large Language Models

[32] Evaluating Progress in Graph Foundation Models: A Comprehensive Benchmark and New Insights

[33] A Principle-Driven Adaptive Policy for Group Cognitive Stimulation Dialogue for Elderly with Cognitive Impairment

[34] TriageSim: A Conversational Emergency Triage Simulation Framework from Structured Electronic Health Records

[35] The Prediction-Measurement Gap: Toward Meaning Representations as Scientific Instruments

[36] The Generation-Recognition Asymmetry: Six Dimensions of a Fundamental Divide in Formal Language Theory

[37] Reason and Verify: A Framework for Faithful Retrieval-Augmented Generation

[38] Lost in Backpropagation: The LM Head is a Gradient Bottleneck

[39] OpenClaw-RL: Train Any Agent Simply by Talking

[40] Adaptive Activation Cancellation for Hallucination Mitigation in Large Language Models

[41] ViDia2Std: A Parallel Corpus and Methods for Low-Resource Vietnamese Dialect-to-Standard Translation

[42] Sabiá-4 Technical Report

[43] S-GRADES -- Studying Generalization of Student Response Assessments in Diverse Evaluative Settings

[44] GR-SAP: Generative Replay for Safety Alignment Preservation during Fine-Tuning

[45] Is this Idea Novel? An Automated Benchmark for Judgment of Research Ideas

[46] Large language models can disambiguate opioid slang on social media

[47] Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck

[48] Dynamic Knowledge Fusion for Multi-Domain Dialogue State Tracking

[49] Aligning Large Language Models with Searcher Preferences

[50] Learning to Negotiate: Multi-Agent Deliberation for Collective Value Alignment in LLMs

[51] PEEM: Prompt Engineering Evaluation Metrics for Interpretable Joint Evaluation of Prompts and Responses

[52] Human-AI Co-reasoning for Clinical Diagnosis with Evidence-Integrated Language Agent

[53] VERI-DPO: Evidence-Aware Alignment for Clinical Summarization via Claim Verification and Direct Preference Optimization

[54] Safe and Scalable Web Agent Learning via Recreated Websites

[55] AILS-NTUA at SemEval-2026 Task 8: Evaluating Multi-Turn RAG Conversations

[56] Automatic End-to-End Data Integration using Large Language Models

[57] End-to-End Chatbot Evaluation with Adaptive Reasoning and Uncertainty Filtering

[58] MUNIChus: Multilingual News Image Captioning Benchmark

[59] Disentangling Similarity and Relatedness in Topic Models

[60] Making Bielik LLM Reason (Better): A Field Report

[61] Prism-$Δ$: Differential Subspace Steering for Prompt Highlighting in Large Language Models

[62] HeartAgent: An Autonomous Agent System for Explainable Differential Diagnosis in Cardiology

[63] mAceReason-Math: A Dataset of High-Quality Multilingual Math Problems Ready For RLVR

[64] Word Recovery in Large Language Models Enables Character-Level Tokenization Robustness

[65] Large Language Models as Annotators for Machine Translation Quality Estimation

[66] Interpretable Chinese Metaphor Identification via LLM-Assisted MIPVU Rule Script Generation: A Comparative Protocol Study

[67] LuxBorrow: From Pompier to Pompjee, Tracing Borrowing in Luxembourgish

[68] Multilingual Reasoning Gym: Multilingual Scaling of Procedural Reasoning Environments

[69] PivotAttack: Rethinking the Search Trajectory in Hard-Label Text Attacks via Pivot Words

[70] SiDiaC-v.2.0: Sinhala Diachronic Corpus Version 2.0

[71] An Extreme Multi-label Text Classification (XMTC) Library Dataset: What if we took "Use of Practical AI in Digital Libraries" seriously?

[72] From Images to Words: Efficient Cross-Modal Knowledge Distillation to Language Models from Black-box Teachers

[73] GLM-OCR Technical Report

[74] LLM2Vec-Gen: Generative Embeddings from Large Language Models

[75] Beyond the Illusion of Consensus: From Surface Heuristics to Knowledge-Grounded Evaluation in LLM-as-a-Judge

[76] Instruction set for the representation of graphs

cs.CV [Back]

[77] 4DEquine: Disentangling Motion and Appearance for 4D Equine Reconstruction from Monocular Video