cs.CL [Back]

[1] ChiEngMixBench: Evaluating Large Language Models on Spontaneous and Natural Chinese-English Code-Mixed Generation

Qingyan Yang,Tongxi Wang,Yunsheng Luo

Main category: cs.CL

TL;DR: 本文提出了首个面向真实社区场景的中英混合代码切换评测基准ChiEngMixBench，将代码混合建模为认知对齐问题，并通过自发性与自然性两个维度评估模型表现，发现大模型隐式遵循人类语言学理论（MLF）的术语分层策略。

Details

Motivation: 现有工作将代码混合简化为翻译或可转换性问题，难以评估模型切换是否符合人类语境习惯和语言规范。 Method: 构建ChiEngMixBench基准，提出基于‘自发性’和‘自然性’的认知对齐评测框架；设计通用数据构建流程；结合实证分析与语言学理论（MLF）解释模型行为。 Result: 所提指标能系统区分不同模型的代码混合能力；发现多语言大模型隐式采用符合Matrix Language Frame理论的术语分层策略。 Conclusion: 代码混合应被视为认知对齐问题而非单纯转换任务；ChiEngMixBench为多语言大模型的跨语言交互能力提供了可扩展、可解释的评测范式。 Abstract: Code-mixing is increasingly prevalent in interactions between humans and large language models, yet existing work often reduces it to a translation or convertibility problem, making it difficult to assess whether a model's switching behavior is context-appropriate and aligned with human conventions. We introduce ChiEngMixBench, the first benchmark designed to evaluate code-mixing ability in authentic community contexts, built upon a general construction pipeline that enables scalable dataset development across domains and bilingual pairs. ChiEngMixBench formulates code-mixing as a cognitive alignment problem, characterized by two complementary signals: Spontaneity and Naturalness. Empirical evaluation shows that our metrics can systematically distinguish code-mixing performance across models. Beyond benchmarking, we further uncover an implicitly emergent Terminology Layering Strategy, a phenomenon consistent with the Matrix Language Frame (MLF) theory, indicating structured cognitive alignment between multilingual large language models and human communication.

[2] M3Kang: Evaluating Multilingual Multimodal Mathematical Reasoning in Vision-Language Models

Aleix Torres-Camps,Nathaniel Mitrani Hadida,Víctor Conchello Vendrell,Àlex Batlle Casellas,Arnau Padrés Masdemont,Jordi Ros-Giralt

Main category: cs.CL

TL;DR: 本文提出了首个大规模多语言、多模态数学推理数据集M3Kang，基于国际袋鼠数学竞赛构建，覆盖108种语言和年级难度，并在多种SOTA视觉语言模型上进行了基准测试，揭示了当前模型在多语言数学与图表推理上的不足，同时验证了多语言技术在多模态场景下的有效性，并开源了数据集及构建框架。

Details

Motivation: 现有视觉语言模型（VLMs）在多语言数学推理方面研究不足，尤其缺乏与人类表现的直接对比；需构建高质量、大规模、多语言、含图表的数学推理基准以填补这一空白。 Method: 从袋鼠数学竞赛中收集1747道原创多选题，按年级分级，翻译为108种语言（部分含解题必需图表），构建M3Kang数据集；设计统一评估协议，在闭源与开源SOTA VLM上进行多语言、多难度、多模态推理评测；融合68,000+学生真实作答数据用于人机性能对比；探索多语言迁移技术在多模态模型中的适配方法。 Result: 实验表明：当前VLMs在基础数学与图表推理上仍显著落后于人类；性能随语言覆盖度和模型规模提升，但不随年级升高而提升；引入多语言技术可显著提升多模态模型表现；M3Kang首次实现细粒度人机数学推理能力对标。 Conclusion: M3Kang为多语言多模态数学推理提供了权威基准，揭示了模型在文化多样性、图表理解与逻辑泛化上的关键瓶颈；开源数据与框架将推动该方向研究，并促进教育公平与AI可及性。 Abstract: Despite state-of-the-art vision-language models (VLMs) have demonstrated strong reasoning capabilities, their performance in multilingual mathematical reasoning remains underexplored, particularly when compared to human performance. To bridge this gap, we introduce M3Kang, the first massively multilingual, multimodal mathematical reasoning dataset for VLMs. It is derived from the Kangaroo Math Competition, the world's largest mathematics contest, which annually engages over six million participants under the age of 18 across more than 90 countries. M3Kang includes 1,747 unique multiple-choice problems organized by grade-level difficulty, with translations into 108 culturally diverse languages, some of them including diagrams essential for solving them. Using this dataset, we conduct extensive benchmarking on both closed- and open-source SOTA models. We observe that, despite recent advances, models still struggle with basic math and diagram-based reasoning, with performance scaling with language presence and model size, but not with grade level. We also find that multilingual techniques can be effectively extended to the multimodal setting, resulting in significant improvements over baseline approaches. Our analysis also incorporates performance data from over 68,000 students, enabling direct comparison with human performance. We are open-sourcing M3Kang, including the English-only subset M2Kang, along with the framework and codebase used to construct the dataset.

[3] Domain Specific Specialization in Low-Resource Settings: The Efficacy of Offline Response-Based Knowledge Distillation in Large Language Models

Erdem Aslan,Pakize Erdoğmuş

Main category: cs.CL

TL;DR: 本文提出了一种离线响应式知识蒸馏方法，通过高质量、结构化的小规模合成数据集（仅500行），在资源受限条件下显著提升大语言模型在专业领域的准确性（96.7%）并抑制幻觉，验证了数据质量与结构对齐比数据量更重要。

Details

Motivation: 大型语言模型（LLMs）在通用任务上表现优异，但在处理其预训练中未涵盖的领域或机构专有知识时易产生幻觉，亟需高效、低资源的领域适配方法。 Method: 采用离线响应式知识蒸馏，对比三种数据策略：通用领域适应（15,000行）、非结构化知识注入（2,000行）和教师模型生成的上下文感知合成数据集（500行）；使用Unsloth库优化Qwen-2.5-7B学生模型，大幅降低GPU显存需求。 Result: 500行上下文感知合成数据集实现96.7%准确率和强拒绝能力；更大规模的非结构化数据仍存在持续幻觉；显存占用从40GB降至16GB。 Conclusion: 在低资源场景下，数据质量与结构对齐比数据规模更关键，支持LIMA假设，为高效领域适配提供了可行路径。 Abstract: Large Language Models (LLMs) excel in general tasks but often struggle with hallucinations when handling domain-specific or institutional knowledge absent from their pre-training. We present an offline response-based knowledge distillation method that develops high-accuracy specialized assistants under constrained hardware resources. We evaluate three distinct data strategies: general domain adaptation (15,000 lines), unstructured knowledge injection (2,000 lines), and a context-aware synthetic dataset (500 lines) generated by a teacher model. To minimize computational costs, we utilize the Unsloth library to optimize the Qwen-2.5-7B student model, reducing NVIDIA A100 GPU memory requirements from 40 GB to 16 GB. Experimental results demonstrate that while larger unstructured datasets suffer from persistent hallucinations, the 500-line context-aware dataset achieves a 96.7% accuracy rate and robust rejection capability. These findings validate the LIMA hypothesis, showing that data quality and structural alignment are more critical than quantity for domain adaptation in low-resource settings.

[4] Towards Latent Diffusion Suitable For Text

Nesta Midavaine,Christian A. Naesseth,Grigory Bartosh

Main category: cs.CL

TL;DR: 本文提出了神经流扩散模型（NFDM），用于语言生成，通过学习多变量前向过程来改进离散状态空间上的连续扩散模型，显著缩小了与同规模自回归模型的似然差距，并实现了与先前潜在扩散模型相当的样本质量。

Details

Motivation: 提高语言生成中采样速度和连贯性，克服自回归大语言模型（LLM）的局限性。 Method: 提出神经流扩散模型（NFDM），扩展连续扩散模型以适用于离散状态空间，通过学习数据驱动的多变量前向过程，使前向过程和生成轨迹更适配语言建模任务。 Result: 显著缩小了与同规模自回归模型的似然差距，同时样本质量与先前潜在扩散模型相当。 Conclusion: NFDM为离散语言建模提供了一种有效且可扩展的扩散建模范式，兼具高效采样与高质量生成能力。 Abstract: Language diffusion models aim to improve sampling speed and coherence over autoregressive LLMs. We introduce Neural Flow Diffusion Models for language generation, an extension of NFDM that enables the straightforward application of continuous diffusion models to discrete state spaces. NFDM learns a multivariate forward process from the data, ensuring that the forward process and generative trajectory are a good fit for language modeling. Our model substantially reduces the likelihood gap with autoregressive models of the same size, while achieving sample quality comparable to that of previous latent diffusion models.

[5] Limits of n-gram Style Control for LLMs via Logit-Space Injection

Sami-ul Ahmed

Main category: cs.CL

TL;DR: 本文提出了一种在解码时通过在logit空间注入n-gram风格先验来轻量级控制大语言模型写作风格的方法，但实验表明该方法仅在极窄参数范围内有效，整体上不如提示工程和LoRA。

Details

Motivation: 现有个性化LLM的方法（如提示工程、LoRA）存在难以精准表达写作风格或计算开销大的问题，需探索更轻量、无需训练的替代方案。 Method: 训练基于不同风格语料（如《堂吉诃德》、新闻标题、arXiv摘要）的1–3元n-gram模型，构建插值风格先验；在生成时将匹配上下文的各阶n-gram对数概率加权（权重由lambda控制）注入LLM原始logits中。 Result: 在TinyLlama-1.1B上仅对《堂吉诃德》语料、lambda=0.1时观察到风格与流畅性同时提升（风格困惑度降24.7%，基础模型困惑度降51.4%）；其余设置下均导致性能下降甚至文本崩溃；JS散度与token重叠分析进一步验证其不稳定性。 Conclusion: logit空间注入n-gram风格先验虽轻量、可调，但鲁棒性差，仅在极低lambda窄区间有效，整体表现逊于提示工程和LoRA，不适合作为主流风格控制方案。 Abstract: Large language models (LLMs) are typically personalized via prompt engineering or parameter-efficient fine-tuning such as LoRA. However, writing style can be difficult to distill into a single prompt, and LoRA fine-tuning requires computationally intensive training and infrastructure. We investigate a possible lightweight alternative: steering a frozen LLM with n-gram style priors injected in logit space at decoding time. We train an n-gram model on stylistically distinct corpora -- including Don Quixote, CNN/DailyMail news headlines, and arXiv abstracts -- constructing an interpolated 1-to-3-gram prior over next-token probabilities. During generation we modify the LLM's logits by adding a weighted sum of style log-probabilities from each n-gram order that matches the current context, scaled by a control parameter lambda in [0, 1]. We sweep lambda and style corpora and report style perplexity under the n-gram model, base-model perplexity as a proxy for fluency, Jensen-Shannon (JS) divergence between the original and steered token distributions, and token-overlap statistics. On TinyLlama-1.1B we identify a single narrow regime (for the Don Quixote corpus at lambda=0.1) where style perplexity improves by 24.7% and base-model perplexity improves by 51.4% relative to the frozen model. Outside this regime, and for multi-author corpora such as CNN/DailyMail and arXiv abstracts, even small nonzero lambda values generally result in worse style and fluency, and larger lambda values lead to collapse with extreme perplexities and incoherent text. Logit-space injection of n-gram style priors provides lightweight, tunable style control, but it is fragile: it operates effectively only within a narrow range of low lambda values and is consistently outperformed by prompting and LoRA.

[6] GameTalk: Training LLMs for Strategic Conversation

Victor Conchello Vendrell,Max Ruiz Luyten,Mihaela van der Schaar

Main category: cs.CL

TL;DR: 本文提出了GameTalk框架，通过多轮对话训练大语言模型（LLMs）进行战略决策，利用GRPO、DPO和STaR等微调方法优化全局对话目标，在多种博弈任务中显著提升协调、推理与对手建模能力。

Details

Motivation: 现有工作多关注单步决策或静态动作预测，缺乏对LLM在长期、多轮对话中优化全局目标能力的研究。 Method: 提出GameTalk框架，将GRPO、DPO和STaR等微调方法适配至多轮对话场景，使奖励信号依赖于完整交互过程而非单步输出。 Result: 在一系列复杂博弈任务上，GameTalk显著优于基线模型，尤其在奖励塑形下效果突出；DPO表现最优。 Conclusion: 面向对话的微调是提升LLM在交互式环境中推理、协商与行动能力的有效路径。 Abstract: Strategic decision-making in multi-agent settings is a key challenge for large language models (LLMs), particularly when coordination and negotiation must unfold over extended conversations. While recent work has explored the use of LLMs in isolated decision tasks, little attention has been given to optimizing long-term objectives through dialogue. We introduce \textbf{GameTalk}, a framework for training LLMs to make strategic decisions via multi-turn interactions. Unlike prior work that focuses on single-turn objectives or static action prediction, we train LLMs to optimize a global objective across full conversations. We achieve this by adapting fine-tuning methods like GRPO, DPO, and STaR to incorporate reward signals that depend on the entire interaction. We evaluate this approach on a suite of increasingly complex games, designed to stress different aspects of reasoning, coordination, and opponent modeling. Our results show that GameTalk significantly outperforms untrained models, especially under reward shaping, with DPO consistently yielding the strongest gains. These findings position conversational fine-tuning as a promising path for LLMs to reason, negotiate, and act in interactive environments.

[7] Better as Generators Than Classifiers: Leveraging LLMs and Synthetic Data for Low-Resource Multilingual Classification

Branislav Pecher,Jan Cegin,Robert Belanec,Ivan Srba,Jakub Simko,Maria Bielikova

Main category: cs.CL

TL;DR: 本文探讨了利用大型语言模型（LLMs）生成多语言合成数据，用于训练更小、更高效的模型，结果表明小模型在低资源语言上可超越原大模型，凸显LLM作为‘教师’生成数据的价值。

Details

Motivation: 在低资源语言中人工标注数据稀缺，需要有效方法提升小模型性能；而LLMs具备强大多语言能力，可被用作合成数据生成器以实现知识蒸馏。 Method: 使用前沿多语言LLM为11种语言、4个分类任务生成合成数据集，并分别用于小模型的微调、指令微调或作为紧凑型LLM的上下文示例。 Result: 仅需少量合成数据，小模型即可在多语言（尤其低资源语言）分类任务上超越生成该数据的大模型。 Conclusion: LLMs更适合作为数据生成器（教师），而非直接作为分类器；其生成的合成数据能有效赋能更小、更高效、多语言的小模型。 Abstract: Large Language Models (LLMs) have demonstrated remarkable multilingual capabilities, making them promising tools in both high- and low-resource languages. One particularly valuable use case is generating synthetic samples that can be used to train smaller models in low-resource scenarios where human-labelled data is scarce. In this work, we investigate whether these synthetic data generation capabilities can serve as a form of distillation, producing smaller models that perform on par with or even better than massive LLMs across languages and tasks. To this end, we use a state-of-the-art multilingual LLM to generate synthetic datasets covering 11 languages and 4 classification tasks. These datasets are then used to train smaller models via fine-tuning or instruction tuning, or as synthetic in-context examples for compact LLMs. Our experiments show that even small amounts of synthetic data enable smaller models to outperform the large generator itself, particularly in low-resource languages. Overall, the results suggest that LLMs are best utilised as generators (teachers) rather than classifiers, producing data that empowers smaller and more efficient multilingual models.

[8] Generating Literature-Driven Scientific Theories at Scale

Peter Jansen,Peter Clark,Doug Downey,Daniel S. Weld

Main category: cs.CL

TL;DR: 本文提出了一种从大规模科学文献中合成定性与定量理论的方法，相比依赖大模型参数化知识的方法，该方法在匹配现有证据和预测后续论文结果方面表现更优。

Details

Motivation: 当前自动化科学发现主要集中在实验生成，而更高层次的理论构建系统研究不足；本文旨在填补这一空白，探索如何从海量文献中自动构建科学理论。 Method: 提出一种基于文献支撑的理论合成方法，利用13.7k篇源论文生成2.9k个理论，并对比文献 grounding 与参数化知识、准确性导向与新颖性导向等不同生成策略的影响。 Result: 文献支撑方法生成的理论在匹配已有证据和预测4.6k篇后续论文结果两方面，均显著优于仅依赖大模型参数化记忆的方法。 Conclusion: 文献 grounding 是提升理论生成质量（兼顾准确性与预测力）的关键，为自动化理论构建提供了可行且有效的新范式。 Abstract: Contemporary automated scientific discovery has focused on agents for generating scientific experiments, while systems that perform higher-level scientific activities such as theory building remain underexplored. In this work, we formulate the problem of synthesizing theories consisting of qualitative and quantitative laws from large corpora of scientific literature. We study theory generation at scale, using 13.7k source papers to synthesize 2.9k theories, examining how generation using literature-grounding versus parametric knowledge, and accuracy-focused versus novelty-focused generation objectives change theory properties. Our experiments show that, compared to using parametric LLM memory for generation, our literature-supported method creates theories that are significantly better at both matching existing evidence and at predicting future results from 4.6k subsequently-written papers

[9] A Longitudinal, Multinational, and Multilingual Corpus of News Coverage of the Russo-Ukrainian War

Dikshya Mohanty,Taisiia Sabadyn,Jelwin Rodrigues,Chenlu Wang,Abhishek Kalugade,Ritwik Banerjee

Main category: cs.CL

TL;DR: 本文介绍了DNIPRO，一个涵盖2022年2月至2024年8月俄乌战争期间24.6万篇新闻文章的多语种纵向语料库，覆盖俄、乌、美、英、中五国11家媒体，含英、俄、中文三种语言，并配备丰富元数据与多类型人工标注，旨在支持跨国家战时话语分析。

Details

Motivation: 构建一个能系统分析战时跨国媒体叙事差异、信息战与框架效应的高质量、多视角、多语言新闻语料库。 Method: 收集并整理246K篇新闻文章，涵盖五国十一媒体、三种语言；设计统一元数据标准与多类人工标注（如立场、情感、主题框架、矛盾性），并通过严格人工评估验证质量；开展立场检测、情感分析、主题框架与矛盾分析等用例实验。 Result: 实验证明不同媒体对同一冲突事件呈现显著极化解读，反映出其背后地缘政治立场；DNIPRO成功支持了叙事分歧、媒体框架与信息战等关键研究任务。 Conclusion: DNIPRO是一个具有开创性的多语种、多视角战时新闻语料库，为计算新闻学及全球信息生态中的冲突叙事研究提供了坚实基础资源。 Abstract: We introduce DNIPRO, a novel longitudinal corpus of 246K news articles documenting the Russo-Ukrainian war from Feb 2022 to Aug 2024, spanning eleven media outlets across five nation states (Russia, Ukraine, U.S., U.K., and China) and three languages (English, Russian, and Mandarin Chinese). This multilingual resource features consistent and comprehensive metadata, and multiple types of annotation with rigorous human evaluations for downstream tasks relevant to systematic transnational analyses of contentious wartime discourse. DNIPRO's distinctive value lies in its inclusion of competing geopolitical perspectives, making it uniquely suited for studying narrative divergence, media framing, and information warfare. To demonstrate its utility, we include use case experiments using stance detection, sentiment analysis, topical framing, and contradiction analysis of major conflict events within the larger war. Our explorations reveal how outlets construct competing realities, with coverage exhibiting polarized interpretations that reflect geopolitical interests. Beyond supporting computational journalism research, DNIPRO provides a foundational resource for understanding how conflicting narratives emerge and evolve across global information ecosystems.

Dikshya Mohanty,Mohammad Saqib Hasan,Syed Mostofa Monsur,Size Zheng,Benjamin Hsiao,Niranjan Balasubramanian

Main category: cs.CL

TL;DR: 本文提出了PolyBench——一个面向聚合物设计的大规模基准数据集（含125K+任务、1300万+数据点），并结合知识增强的思维链蒸馏方法训练小语言模型（7B–14B），使其在聚合物设计任务上超越同尺寸及部分闭源大模型。

Details

Motivation: 现有大语言模型缺乏聚合物领域专业知识和面向聚合物设计所需的知识覆盖与能力，难以有效支持AI4Science中的聚合物设计任务。 Method: 构建PolyBench大规模聚合物设计基准数据集（含实验与合成来源的1300万+数据点、125K+任务），并提出知识增强的推理蒸馏方法，为任务注入结构化思维链（CoT）；任务按难度递进组织，支持泛化性与诊断性评估。 Result: 基于PolyBench训练的7B–14B小语言模型在PolyBench测试集上显著优于同尺寸模型及部分闭源前沿大模型，并在其他聚合物基准上也表现出性能增益。 Conclusion: 领域专用大规模基准（如PolyBench）结合知识增强对齐方法，可显著提升小语言模型在专业科学任务（如聚合物设计）中的表现，为AI4Science提供高效可行路径。 Abstract: Research in AI4Science has shown promise in many science applications, including polymer design. However, current LLMs prove ineffective on this problem space because: (i) most models lack polymer-specific knowledge (ii) existing aligned models lack coverage of knowledge and capabilities relevant to polymer design. Addressing this, we introduce PolyBench, a large scale training and test benchmark dataset of more than 125K polymer design related tasks, leveraging a knowledge base of 13M+ data points obtained from experimental and synthetic sources to ensure broad coverage of polymers and their properties. For effective alignment using PolyBench, we introduce a knowledge-augmented reasoning distillation method that augments this dataset with structured CoT. Furthermore, tasks in PolyBench are organized from simple to complex analytical reasoning problems, enabling generalization tests and diagnostic probes across the problem space. Experiments show that small language models (SLMs), of 7B to 14B parameters, trained on PolyBench data outperform similar sized models, and even closed source frontier LLMs on PolyBench test dataset while demonstrating gains on other polymer benchmarks as well.

[11] Machine-Assisted Grading of Nationwide School-Leaving Essay Exams with LLMs and Statistical NLP

Andres Karjus,Kais Allkivi,Silvia Maine,Katarin Leppik,Krister Kruusmaa,Merilin Aruvee

Main category: cs.CL

TL;DR: 本文探讨了大语言模型（LLMs）在国家级毕业考试作文自动评分中的应用，验证其在爱沙尼亚两个全国性试测作文数据集上可达到与人工评分相当的性能，并支持细粒度反馈和人类监督下的高利害评估。

Details

Motivation: 解决大规模、限时的人工阅卷压力，尤其适用于全国性毕业考试；探索LLM在小语种（爱沙尼亚语）及高利害教育评估场景中的可行性与合规性。 Method: 基于官方课程标准构建评分量规，将LLM与统计NLP方法的自动评分结果与人工评分小组进行对比；同时评估偏见、提示注入风险及LLM作为写作工具的潜在问题。 Result: 自动评分性能与人工评分者相当，落在人类评分区间内；系统可生成细粒度子分数并支持个性化教学反馈；验证了人类参与闭环流程的可行性和小语种场景下的适用性。 Conclusion: 以量规为驱动、人类监督为保障的LLM辅助评分体系可在国家层面落地，符合教育公平、监管合规与数字化转型要求，尤其适用于爱沙尼亚等数字先进但语言规模较小的国家。 Abstract: Large language models (LLMs) enable rapid and consistent automated evaluation of open-ended exam responses, including dimensions of content and argumentation that have traditionally required human judgment. This is particularly important in cases where a large amount of exams need to be graded in a limited time frame, such as nation-wide graduation exams in various countries. Here, we examine the applicability of automated scoring on two large datasets of trial exam essays of two full national cohorts from Estonia. We operationalize the official curriculum-based rubric and compare LLM and statistical natural language processing (NLP) based assessments with human panel scores. The results show that automated scoring can achieve performance comparable to that of human raters and tends to fall within the human scoring range. We also evaluate bias, prompt injection risks, and LLMs as essay writers. These findings demonstrate that a principled, rubric-driven, human-in-the-loop scoring pipeline is viable for high-stakes writing assessment, particularly relevant for digitally advanced societies like Estonia, which is about to adapt a fully electronic examination system. Furthermore, the system produces fine-grained subscore profiles that can be used to generate systematic, personalized feedback for instruction and exam preparation. The study provides evidence that LLM-assisted assessment can be implemented at a national scale, even in a small-language context, while maintaining human oversight and compliance with emerging educational and regulatory standards.

[12] Regional Bias in Large Language Models

M P V S Gopinadh,Kappara Lakshmi Sindhu,Soma Sekhar Pandu Ranga Raju P,Yesaswini Swarna

Main category: cs.CL

TL;DR: 本研究提出FAZE框架评估10个主流大语言模型的区域偏见，发现GPT-3.5偏见最强（9.5分），Claude 3.5 Sonnet最弱（2.5分），揭示地理偏见对AI公平性与跨文化应用可靠性的严重影响。

Details

Motivation: 区域偏见是AI公平性和全球表征中的新兴问题，亟需系统评估和量化。 Method: 构建包含100个中性情境强制选择提示的数据集，提出基于提示的FAZE评估框架，以10分制量化区域偏好倾向。 Result: 10个模型区域偏见程度差异显著：GPT-3.5得分最高（9.5），Claude 3.5 Sonnet最低（2.5）。 Conclusion: 区域偏见会实质性损害LLM在真实跨文化场景中的可靠性、公平性与包容性；需发展更具包容性的评估框架与系统性纠偏方法。 Abstract: This study investigates regional bias in large language models (LLMs), an emerging concern in AI fairness and global representation. We evaluate ten prominent LLMs: GPT-3.5, GPT-4o, Gemini 1.5 Flash, Gemini 1.0 Pro, Claude 3 Opus, Claude 3.5 Sonnet, Llama 3, Gemma 7B, Mistral 7B, and Vicuna-13B using a dataset of 100 carefully designed prompts that probe forced-choice decisions between regions under contextually neutral scenarios. We introduce FAZE, a prompt-based evaluation framework that measures regional bias on a 10-point scale, where higher scores indicate a stronger tendency to favor specific regions. Experimental results reveal substantial variation in bias levels across models, with GPT-3.5 exhibiting the highest bias score (9.5) and Claude 3.5 Sonnet scoring the lowest (2.5). These findings indicate that regional bias can meaningfully undermine the reliability, fairness, and inclusivity of LLM outputs in real-world, cross-cultural applications. This work contributes to AI fairness research by highlighting the importance of inclusive evaluation frameworks and systematic approaches for identifying and mitigating geographic biases in language models.

[13] Identity, Cooperation and Framing Effects within Groups of Real and Simulated Humans

Suhong Moon,Minwoo Kang,Joseph Suh,Mustafa Safdari,John Canny

Main category: cs.CL

TL;DR: 本文研究了大语言模型（LLMs）在社会困境博弈中模拟人类行为的能力，强调通过为基座模型注入丰富的叙事身份背景（深度绑定）来提升模拟保真度，并利用指令微调模型验证一致性；同时发现LLMs还能建模时间、问题表述和参与者群体等情境因素，有助于揭示影响人类实验但常被忽略的细节。

Details

Motivation: 现有工作多采用弱绑定（如提示工程）模拟人格，但难以准确复现人类基于身份和情境的复杂行为；而人类行为受理性、身份与情境多重影响，需更忠实的建模方法以支持可复现的社会科学实验。 Method: 对基座语言模型进行深度绑定，即注入详尽的叙事身份背景；使用指令微调模型评估行为一致性；系统测试模型对时间（年份）、问题框架、参与者池等情境变量的建模能力，并与人类实证研究结果对比保真度。 Result: 深度绑定显著提升LLM行为模拟与人类研究结果的一致性；LLMs能有效捕捉并响应时间、表述方式和群体特征等情境变量；该方法可揭示实验报告中常被省略却影响结果的关键细节。 Conclusion: 基于叙事身份的深度绑定是提升LLM模拟人类社会行为保真度的有效路径；LLMs不仅可作为行为仿真工具，还可成为社会科学中探索隐性实验变量与增强可复现性的新范式。 Abstract: Humans act via a nuanced process that depends both on rational deliberation and also on identity and contextual factors. In this work, we study how large language models (LLMs) can simulate human action in the context of social dilemma games. While prior work has focused on "steering" (weak binding) of chat models to simulate personas, we analyze here how deep binding of base models with extended backstories leads to more faithful replication of identity-based behaviors. Our study has these findings: simulation fidelity vs human studies is improved by conditioning base LMs with rich context of narrative identities and checking consistency using instruction-tuned models. We show that LLMs can also model contextual factors such as time (year that a study was performed), question framing, and participant pool effects. LLMs, therefore, allow us to explore the details that affect human studies but which are often omitted from experiment descriptions, and which hamper accurate replication.

[14] PolyAgent: Large Language Model Agent for Polymer Design

Vani Nigam,Achuth Chandrasekhar,Amir Barati Farimani

Main category: cs.CL

TL;DR: 本文提出了一种集成于终端的闭环聚合物结构-性能预测框架，利用大语言模型（LLM）实现性能预测、性能引导的聚合物结构生成与结构修改，并结合合成可及性评分（SA Score）和合成复杂度评分（SC Score）确保生成结构的实验可行性。

Details

Motivation: 实验室研究人员难以便捷访问现有机器学习模型进行聚合物结构与性能的探索，受限于基础设施；同时传统聚合物实验试错周期长、资源消耗大。 Method: 构建基于LLM推理的终端集成闭环框架，支持属性预测、属性引导的聚合物结构生成与结构修改；生成的SMILES序列受合成可及性分数（SA Score）和合成复杂度分数（SC Score）约束，以保证单体级别结构的可合成性。 Result: 实现了面向早期聚合物发现的、用户友好的终端工具，支持快速生成兼具新颖性与合成可行性的聚合物结构，并提供可解释的计算洞察。 Conclusion: 该框架弥合了AI模型与实验化学家之间的鸿沟，为按需聚合物发现提供了实用、可部署的计算解决方案。 Abstract: On-demand Polymer discovery is essential for various industries, ranging from biomedical to reinforcement materials. Experiments with polymers have a long trial-and-error process, leading to long procedures and extensive resources. For these processes, machine learning has accelerated scientific discovery at the property prediction and latent space search fronts. However, laboratory researchers cannot readily access codes and these models to extract individual structures and properties due to infrastructure limitations. We present a closed-loop polymer structure-property predictor integrated in a terminal for early-stage polymer discovery. The framework is powered by LLM reasoning to provide users with property prediction, property-guided polymer structure generation, and structure modification capabilities. The SMILES sequences are guided by the synthetic accessibility score and the synthetic complexity score (SC Score) to ensure that polymer generation is as close as possible to synthetically accessible monomer-level structures. This framework addresses the challenge of generating novel polymer structures for laboratory researchers, thereby providing computational insights into polymer research.

[15] Cross-Lingual Activation Steering for Multilingual Language Models

Rhitabrat Pokharel,Ameeta Agrawal,Tanay Nagar

Main category: cs.CL

TL;DR: 本文提出了一种无需训练的推理时干预方法Cross-Lingual Activation Steering (CLAS)，通过选择性调控神经元激活来缩小大语言模型在主导与非主导语言间的性能差距，在分类和生成任务上分别平均提升2.3%准确率和3.4% F1值，且不损害高资源语言性能；研究发现有效跨语言迁移依赖功能分化而非严格对齐，性能提升与语言聚类分离度增强正相关。

Details

Motivation: 大语言模型虽具多语言能力，但在主导与非主导语言间仍存在显著性能差距，先前研究认为这源于多语言表征中共享神经元与语言特异性神经元的不平衡。 Method: 提出无需训练的推理时干预方法Cross-Lingual Activation Steering (CLAS)，选择性调制神经元激活。 Result: 在分类和生成基准上分别实现平均2.3%（准确率）和3.4%（F1）的提升，同时保持高资源语言性能；发现性能增益与语言簇分离度增强呈正相关，表明跨语言迁移依赖功能分化而非严格对齐。 Conclusion: 定向激活调控可在不修改模型权重的前提下释放现有模型中潜在的多语言能力。 Abstract: Large language models exhibit strong multilingual capabilities, yet significant performance gaps persist between dominant and non-dominant languages. Prior work attributes this gap to imbalances between shared and language-specific neurons in multilingual representations. We propose Cross-Lingual Activation Steering (CLAS), a training-free inference-time intervention that selectively modulates neuron activations. We evaluate CLAS on classification and generation benchmarks, achieving average improvements of 2.3% (Acc.) and 3.4% (F1) respectively, while maintaining high-resource language performance. We discover that effective transfer operates through functional divergence rather than strict alignment; performance gains correlate with increased language cluster separation. Our results demonstrate that targeted activation steering can unlock latent multilingual capacity in existing models without modification to model weights.

[16] Cite-While-You-Generate: Training-Free Evidence Attribution for Multimodal Clinical Summarization

Qianqi Yan,Huy Nguyen,Sumana Srivatsa,Hari Bandi,Xin Eric Wang,Krishnaram Kenthapadi

Main category: cs.CL

TL;DR: 本文提出了一种无需训练的生成时源归因框架，利用解码器注意力机制直接引用支持文本片段或图像，支持纯文本和多模态归因，在临床对话和放射报告数据集上显著提升归因准确率。

Details

Motivation: 可信的临床摘要生成不仅需要流畅性，还需透明地指出每句话的来源；现有后处理或需重训练的方法存在局限性。 Method: 提出基于解码器注意力的训练-free生成时源归因框架，包含两种多模态归因策略：原始图像模式（直接使用图像块注意力）和字幕作为片段模式（用生成字幕替代图像以实现纯文本对齐）。 Result: 在CliConSummation和MIMIC-CXR两个数据集上，该方法显著优于嵌入式和自归因基线，文本级与多模态归因F1值提升达15%；字幕归因性能接近原始图像注意力，但更轻量实用。 Conclusion: 注意力引导的归因是一种有前景的方法，有助于构建可解释、可部署的临床摘要系统。 Abstract: Trustworthy clinical summarization requires not only fluent generation but also transparency about where each statement comes from. We propose a training-free framework for generation-time source attribution that leverages decoder attentions to directly cite supporting text spans or images, overcoming the limitations of post-hoc or retraining-based methods. We introduce two strategies for multimodal attribution: a raw image mode, which directly uses image patch attentions, and a caption-as-span mode, which substitutes images with generated captions to enable purely text-based alignment. Evaluations on two representative domains: clinician-patient dialogues (CliConSummation) and radiology reports (MIMIC-CXR), show that our approach consistently outperforms embedding-based and self-attribution baselines, improving both text-level and multimodal attribution accuracy (e.g., +15% F1 over embedding baselines). Caption-based attribution achieves competitive performance with raw-image attention while being more lightweight and practical. These findings highlight attention-guided attribution as a promising step toward interpretable and deployable clinical summarization systems.

[17] Clarify or Answer: Reinforcement Learning for Agentic VQA with Context Under-specification

Zongwan Cao,Bingbing Wen,Lucy Lu Wang

Main category: cs.CL

TL;DR: 本文提出CoA（Clarify-or-Answer）框架，用于解决现实世界中上下文依赖的视觉问答（VQA）问题，通过判断是否需澄清、生成聚焦澄清问题并整合回答，显著提升VQA准确率。

Details

Motivation: 现实VQA常存在图像-问题对信息不足的问题，仅依赖图像易导致高置信度但错误的回答，需引入外部上下文澄清机制。 Method: 提出CoA框架：先判别是否需澄清；若需，则生成一个聚焦的澄清问题，并利用用户反馈生成最终答案；构建CONTEXTCLARIFY数据集与GRPO-CR强化学习方法优化澄清问题生成。 Result: 在三个VLLM和三个数据集上，CoA在模块与系统层面均一致提升，端到端VQA准确率平均提升+15.3分（相对提升83%）。 Conclusion: 显式建模‘提问还是回答’的决策过程，并结合高质量澄清问题生成，可有效缓解上下文缺失导致的VQA错误，为开放世界VQA提供新范式。 Abstract: Real-world visual question answering (VQA) is often context-dependent: an image-question pair may be under-specified, such that the correct answer depends on external information that is not observable in the image. In such cases, directly answering can lead to confident but incorrect predictions. We propose CoA(Clarify-or-Answer), an ask-or-answer agent that separately models the decision to ask or answer, and what to ask if needed. CoA first determines whether clarification is necessary; if so, it asks a single focused question and then incorporates the response to produce the final answer. We introduce CONTEXTCLARIFY with a set of ambiguous VQA questions and the contrast set that is non-ambiguous. We further introduce GRPO-CR (Clarification Reasoning), a reinforcement learning approach that optimizes clarification question generation with multiple reward signals encouraging well-formed, focused, non-trivial questions that resolve ambiguity. Across three VLLMs and three datasets, CoA achieves consistent improvements at both the module and system levels, improving end-to-end VQA accuracy by an average of +15.3 points (83%) over prompting-based baselines

[18] Jacobian Scopes: token-level causal attributions in LLMs

Toni J. B. Liu,Baran Zadeoğlu,Nicolas Boullé,Raphaël Sarfati,Christopher J. Earls

Main category: cs.CL

TL;DR: 本文提出Jacobian Scopes，一种基于梯度的token级因果归因方法，用于解释大语言模型（LLM）预测中哪些输入token对输出影响最大；通过分析最终隐藏状态相对于输入的线性化关系，量化输入token的影响，并设计Semantic、Fisher和Temperature三种变体分别关注logit敏感性、预测分布和模型置信度；在指令理解、翻译和上下文学习等任务中验证了其有效性，并揭示了隐含政治偏见等现象。

Details

Motivation: 现代大语言模型层数和注意力头众多，难以明确识别哪些前置token对特定预测影响最大，亟需可解释性强、细粒度的归因方法。 Method: 提出Jacobian Scopes，基于最终隐藏状态对输入token的Jacobian矩阵进行线性化分析，定义三种变体：Semantic Scope（针对特定logit的梯度敏感性）、Fisher Scope（针对整个预测分布的Fisher信息）、Temperature Scope（针对模型置信度即逆温度参数）。 Result: 在指令理解、机器翻译和上下文学习等任务中成功定位关键影响token；发现模型在某些情况下依赖隐含政治偏见线索；为时序预测中的上下文学习机制提供新解释视角。 Conclusion: Jacobian Scopes是一种有效、灵活且可扩展的token级归因工具，能增强LLM决策过程的透明性与可解释性，并有助于诊断模型偏差和理解内部机制。 Abstract: Large language models (LLMs) make next-token predictions based on clues present in their context, such as semantic descriptions and in-context examples. Yet, elucidating which prior tokens most strongly influence a given prediction remains challenging due to the proliferation of layers and attention heads in modern architectures. We propose Jacobian Scopes, a suite of gradient-based, token-level causal attribution methods for interpreting LLM predictions. By analyzing the linearized relations of final hidden state with respect to inputs, Jacobian Scopes quantify how input tokens influence a model's prediction. We introduce three variants - Semantic, Fisher, and Temperature Scopes - which respectively target sensitivity of specific logits, the full predictive distribution, and model confidence (inverse temperature). Through case studies spanning instruction understanding, translation and in-context learning (ICL), we uncover interesting findings, such as when Jacobian Scopes point to implicit political biases. We believe that our proposed methods also shed light on recently debated mechanisms underlying in-context time-series forecasting. Our code and interactive demonstrations are publicly available at https://github.com/AntonioLiu97/JacobianScopes.

[19] Learning Domain Knowledge in Multimodal Large Language Models through Reinforcement Fine-Tuning

Qinglong Cao,Yuntian Chen,Chao Ma,Xiaokang Yang

Main category: cs.CL

TL;DR: 本文发现文本层面的领域知识注入对科学多模态任务效果甚微，提出一种将领域知识编码为约束和奖励信号的强化微调框架，在遥感和医学影像等专业领域显著提升了多模态大语言模型（MLLMs）性能。

Details

Motivation: 现有MLLMs在遥感、医学影像等专业领域表现有限，而常规的文本提示或说明式领域知识注入效果不佳，表明模型难以仅通过语言输入内化领域先验，需在优化层面融合领域知识。 Method: 提出一种强化微调框架，将领域知识转化为输出空间中的领域导向约束与奖励信号，而非作为输入文本提示，从而在训练目标中直接嵌入领域知识。 Result: 在多个遥感与医学影像数据集上实验表明，该方法带来稳定且显著的性能提升，达到多模态领域任务的SOTA水平。 Conclusion: 领域知识必须在优化层面（而非输入层面）融入MLLMs；当前MLLMs对纯文本形式的领域条件建模存在根本性局限。 Abstract: Multimodal large language models (MLLMs) have shown remarkable capabilities in multimodal perception and understanding tasks. However, their effectiveness in specialized domains, such as remote sensing and medical imaging, remains limited. A natural approach to domain adaptation is to inject domain knowledge through textual instructions, prompts, or auxiliary captions. Surprisingly, we find that such input-level domain knowledge injection yields little to no improvement on scientific multimodal tasks, even when the domain knowledge is explicitly provided. This observation suggests that current MLLMs fail to internalize domain-specific priors through language alone, and that domain knowledge must be integrated at the optimization level. Motivated by this insight, we propose a reinforcement fine-tuning framework that incorporates domain knowledge directly into the learning objective. Instead of treating domain knowledge as descriptive information, we encode it as domain-informed constraints and reward signals, shaping the model's behavior in the output space. Extensive experiments across multiple datasets in remote sensing and medical domains consistently demonstrate good performance gains, achieving state-of-the-art results on multimodal domain tasks. Our results highlight the necessity of optimization-level domain knowledge integration and reveal a fundamental limitation of textual domain conditioning in current MLLMs.

[20] Exploring the Effects of Alignment on Numerical Bias in Large Language Models

Ayako Sato,Hwichan Kim,Zhousi Chen,Masato Mita,Mamoru Komachi

Main category: cs.CL

TL;DR: 本文研究了大语言模型作为评估器时出现的数值偏差问题，发现对齐过程（instruction tuning 和 preference tuning）是导致该偏差的主要原因，并提出了包括分数范围调整在内的多种缓解策略，其中分数范围调整效果最佳。

Details

Motivation: LLM-as-a-judge在评估任务中广泛应用，但存在数值偏差问题，影响评估性能；已有研究表明对齐会降低输出多样性，因此推测对齐可能是数值偏差的根源。 Method: 通过对比对齐前后的LLM输出，验证对齐是否加剧数值偏差；并尝试温度缩放、分布校准和分数范围调整等缓解策略。 Result: 实验证实对齐确实加剧了数值偏差；在多种缓解策略中，分数范围调整最有效，能显著降低偏差并提升评估性能，但仍属启发式方法。 Conclusion: 对齐是LLM-as-a-judge中数值偏差的重要成因，需进一步研究最优分数范围选择及更鲁棒的缓解方法。 Abstract: ``LLM-as-a-judge,'' which utilizes large language models (LLMs) as evaluators, has proven effective in many evaluation tasks. However, evaluator LLMs exhibit numerical bias, a phenomenon where certain evaluation scores are generated disproportionately often, leading reduced evaluation performance. This study investigates the cause of this bias. Given that most evaluator LLMs are aligned through instruction tuning and preference tuning, and that prior research suggests alignment reduces output diversity, we hypothesize that numerical bias arises from alignment. To test this, we compare outputs from pre- and post-alignment LLMs, and observe that alignment indeed increases numerical bias. We also explore mitigation strategies for post-alignment LLMs, including temperature scaling, distribution calibration, and score range adjustment. Among these, score range adjustment is most effective in reducing bias and improving performance, though still heuristic. Our findings highlight the need for further work on optimal score range selection and more robust mitigation strategies.

[21] Mixing Expert Knowledge: Bring Human Thoughts Back To the Game of Go

Yichuan Ma,Linyang Li,Yongkang Chen,Peiji Li,Jiasheng Ye,Qipeng Guo,Dahua Lin,Kai Chen

Main category: cs.CL

TL;DR: 本文提出LoGos模型，通过混合微调与强化学习融合围棋领域专家知识与通用链式推理能力，使大语言模型在围棋任务中达到人类职业选手水平，同时保持通用推理能力。

Details

Motivation: 主流大语言模型在专业领域（如围棋）表现远逊于领域专家，限制了其在专业任务中的应用；需弥合通用推理能力与领域专业知识之间的鸿沟。 Method: 采用结构化围棋专业知识与通用长链式思维（CoT）数据进行混合微调作为冷启动，再通过强化学习进一步整合围棋专家知识与通用推理能力。 Result: LoGos在围棋自然语言推理、策略分析与下一步预测上达到人类职业棋手水平，显著超越所有现有大语言模型；并开源首个大规模围棋训练数据集、首个面向LLM的围棋评测基准及模型代码。 Conclusion: 本工作验证了将大语言模型通用推理能力有效迁移至高度专业化领域的可行性，为LLM在垂直领域落地提供了新范式。 Abstract: Large language models (LLMs) have demonstrated exceptional performance in reasoning tasks such as mathematics and coding, matching or surpassing human capabilities. However, these impressive reasoning abilities face significant challenges in specialized domains. Taking Go as an example, although AlphaGo has established the high performance ceiling of AI systems in Go, mainstream LLMs still struggle to reach even beginner-level proficiency, let alone perform natural language reasoning. This performance gap between general-purpose LLMs and domain experts is significantly limiting the application of LLMs on a wider range of domain-specific tasks. In this work, we aim to bridge the divide between LLMs' general reasoning capabilities and expert knowledge in domain-specific tasks. We perform mixed fine-tuning with structured Go expertise and general long Chain-of-Thought (CoT) reasoning data as a cold start, followed by reinforcement learning to integrate expert knowledge in Go with general reasoning capabilities. Through this methodology, we present \textbf{LoGos}, a powerful LLM that not only maintains outstanding general reasoning abilities, but also conducts Go gameplay in natural language, demonstrating effective strategic reasoning and accurate next-move prediction. LoGos achieves performance comparable to human professional players, substantially surpassing all existing LLMs. Through this work, we aim to contribute insights on applying general LLM reasoning capabilities to specialized domains. We will release the first large-scale Go dataset for LLM training, the first LLM Go evaluation benchmark, and the first general LLM that reaches human professional-level performance in Go at: https://github.com/Entarochuan/LoGos.

[22] Graph-Anchored Knowledge Indexing for Retrieval-Augmented Generation

Zhenghao Liu,Mingyan Wu,Xinze Li,Yukun Yan,Shuo Wang,Cheng Yang,Minghe Yu,Zheni Zeng,Maosong Sun

Main category: cs.CL

TL;DR: 本文提出GraphAnchor，一种图锚定知识索引方法，通过在迭代检索中动态构建和更新图结构来锚定关键实体与关系，从而提升RAG系统对分散证据的整合能力与多跳问答性能。

Details

Motivation: 现有RAG系统难以有效整合和理解散布在噪声文档中的关键证据，导致推理不充分或注意力偏差。 Method: GraphAnchor将图结构从静态知识表示重构为动态演化的知识索引，在迭代检索过程中逐步更新图以锚定显著实体和关系，并联合利用最终图与所有检索文档生成答案。 Result: 在四个多跳问答基准上验证了GraphAnchor的有效性，表明其能调节LLM注意力，更有效地关联跨文档的关键信息。 Conclusion: GraphAnchor通过引入演化图索引机制，显著提升了RAG系统对复杂、分散知识的建模与利用能力，为缓解幻觉与增强推理提供了新路径。 Abstract: Retrieval-Augmented Generation (RAG) has emerged as a dominant paradigm for mitigating hallucinations in Large Language Models (LLMs) by incorporating external knowledge. Nevertheless, effectively integrating and interpreting key evidence scattered across noisy documents remains a critical challenge for existing RAG systems. In this paper, we propose GraphAnchor, a novel Graph-Anchored Knowledge Indexing approach that reconceptualizes graph structures from static knowledge representations into active, evolving knowledge indices. GraphAnchor incrementally updates a graph during iterative retrieval to anchor salient entities and relations, yielding a structured index that guides the LLM in evaluating knowledge sufficiency and formulating subsequent subqueries. The final answer is generated by jointly leveraging all retrieved documents and the final evolved graph. Experiments on four multi-hop question answering benchmarks demonstrate the effectiveness of GraphAnchor, and reveal that GraphAnchor modulates the LLM's attention to more effectively associate key information distributed in retrieved documents. All code and data are available at https://github.com/NEUIR/GraphAnchor.

[23] Persona Jailbreaking in Large Language Models

Jivnesh Sandhan,Fei Cheng,Tushar Sandhan,Yugo Murawaki

Main category: cs.CL

TL;DR: 本文提出PHISH框架，通过在用户输入中嵌入语义线索，在黑盒设置下渐进式地劫持大语言模型（LLM）的诱导人格，揭示了LLM在人格一致性方面的新型安全漏洞。

Details

Motivation: 现有研究忽视了对抗性对话历史单独重塑LLM人格的风险，尤其在教育、心理健康和客服等需稳定人格的关键领域，黑盒人格操纵问题尚未被探索。 Method: 提出‘人格编辑’新任务，并设计PHISH框架，在仅推理、黑盒条件下，通过用户侧输入隐式注入语义线索以逐步诱导反向人格；定义量化攻击成功率的指标，并在多基准与多模型上验证。 Result: PHISH在3个基准、8个LLMs上可预测地改变人格，引发相关特质的连带变化，且在多轮对话中效果更强；在心理、辅导、客服等高风险场景中均被人类与LLM-as-Judge验证有效；仅轻微影响推理性能。 Conclusion: PHISH揭示了LLM人格鲁棒性的新漏洞，当前防护机制在持续攻击下仍脆弱，亟需构建上下文鲁棒的人格保持机制。 Abstract: Large Language Models (LLMs) are increasingly deployed in domains such as education, mental health and customer support, where stable and consistent personas are critical for reliability. Yet, existing studies focus on narrative or role-playing tasks and overlook how adversarial conversational history alone can reshape induced personas. Black-box persona manipulation remains unexplored, raising concerns for robustness in realistic interactions. In response, we introduce the task of persona editing, which adversarially steers LLM traits through user-side inputs under a black-box, inference-only setting. To this end, we propose PHISH (Persona Hijacking via Implicit Steering in History), the first framework to expose a new vulnerability in LLM safety that embeds semantically loaded cues into user queries to gradually induce reverse personas. We also define a metric to quantify attack success. Across 3 benchmarks and 8 LLMs, PHISH predictably shifts personas, triggers collateral changes in correlated traits, and exhibits stronger effects in multi-turn settings. In high-risk domains mental health, tutoring, and customer support, PHISH reliably manipulates personas, validated by both human and LLM-as-Judge evaluations. Importantly, PHISH causes only a small reduction in reasoning benchmark performance, leaving overall utility largely intact while still enabling significant persona manipulation. While current guardrails offer partial protection, they remain brittle under sustained attack. Our findings expose new vulnerabilities in personas and highlight the need for context-resilient persona in LLMs. Our codebase and dataset is available at: https://github.com/Jivnesh/PHISH

[24] DeepEra: A Deep Evidence Reranking Agent for Scientific Retrieval-Augmented Generated Question Answering

Haotian Chen,Qingqing Long,Siyu Pu,Xiao Luo,Wei Ju,Meng Xiao,Yuanchun Zhou,Jianghua Zhao,Xuezhi Wang

Main category: cs.CL

TL;DR: 本文提出DeepEra模型，通过逐步推理对科学问答中的检索结果进行深度证据重排序，以解决语义相似但逻辑无关（SSLI）问题，并构建了大规模SciRAG-SSLI数据集进行系统评估。

Details

Motivation: 现有RAG框架中的检索与重排序方法易受语义相似但逻辑无关的干扰片段影响，导致事实可靠性下降和幻觉增强。 Method: 提出Deep Evidence Reranking Agent（DeepEra），融合逐步推理机制，超越表层语义，精准评估候选段落；构建包含30万科学问答实例的SciRAG-SSLI数据集，涵盖10个学科，结合自然检索上下文与系统生成的干扰项。 Result: DeepEra在检索性能上显著优于当前主流重排序器；SciRAG-SSLI首次实证验证了两阶段RAG中SSLI问题的严重性与普遍性。 Conclusion: SSLI是制约科学问答可信性的关键瓶颈，DeepEra通过可解释的推理式重排序提升了RAG的事实一致性与逻辑鲁棒性，为可信科学AI提供了新范式。 Abstract: With the rapid growth of scientific literature, scientific question answering (SciQA) has become increasingly critical for exploring and utilizing scientific knowledge. Retrieval-Augmented Generation (RAG) enhances LLMs by incorporating knowledge from external sources, thereby providing credible evidence for scientific question answering. But existing retrieval and reranking methods remain vulnerable to passages that are semantically similar but logically irrelevant, often reducing factual reliability and amplifying hallucinations.To address this challenge, we propose a Deep Evidence Reranking Agent (DeepEra) that integrates step-by-step reasoning, enabling more precise evaluation of candidate passages beyond surface-level semantics. To support systematic evaluation, we construct SciRAG-SSLI (Scientific RAG - Semantically Similar but Logically Irrelevant), a large-scale dataset comprising about 300K SciQA instances across 10 subjects, constructed from 10M scientific corpus. The dataset combines naturally retrieved contexts with systematically generated distractors to test logical robustness and factual grounding. Comprehensive evaluations confirm that our approach achieves superior retrieval performance compared to leading rerankers. To our knowledge, this work is the first to comprehensively study and empirically validate innegligible SSLI issues in two-stage RAG frameworks.

[25] TL-GRPO: Turn-Level RL for Reasoning-Guided Iterative Optimization

Peiji Li,Linyang Li,Handa Sun,Wenjin Mai,Yongkang Chen,Xiaozhe Li,Yue Shen,Yichuan Ma,Yiliu Sun,Jiaxi Cao,Zhishu He,Bo Wang,Xiaoqing Zheng,Zhaori Bi,Xipeng Qiu,Qipeng Guo,Kai Chen,Dahua Lin

Main category: cs.CL

TL;DR: 本文提出了一种面向迭代优化任务的轻量级强化学习算法TL-GRPO，通过回合级分组采样实现细粒度优化，在模拟预算受限的模拟电路尺寸设计（ACS）任务中显著优于标准GRPO和贝叶斯优化方法，并在30B大模型上达到SOTA性能。

Details

Motivation: 现有基于GRPO的轨迹级强化学习方法无法在迭代优化类任务（如模拟电路尺寸设计）中进行细粒度的回合级优化；而黑箱优化方法又会丢失大模型已有的推理能力和先验知识。 Method: 提出Turn-Level GRPO（TL-GRPO），一种轻量级强化学习算法，采用回合级分组采样（turn-level group sampling）机制，适配迭代优化任务中环境状态不变、奖励以最优单步而非累计和为准的特点。 Result: 在模拟电路尺寸设计（ACS）任务上，TL-GRPO显著优于标准GRPO和贝叶斯优化；30B模型在相同仿真预算下达到该任务SOTA性能，展现出强泛化性和实用性。 Conclusion: TL-GRPO有效弥合了大模型工具调用中轨迹级RL与迭代优化需求之间的鸿沟，为科学计算类优化任务提供了兼顾推理能力与优化效率的新范式。 Abstract: Large language models have demonstrated strong reasoning capabilities in complex tasks through tool integration, which is typically framed as a Markov Decision Process and optimized with trajectory-level RL algorithms such as GRPO. However, a common class of reasoning tasks, iterative optimization, presents distinct challenges: the agent interacts with the same underlying environment state across turns, and the value of a trajectory is determined by the best turn-level reward rather than cumulative returns. Existing GRPO-based methods cannot perform fine-grained, turn-level optimization in such settings, while black-box optimization methods discard prior knowledge and reasoning capabilities. To address this gap, we propose Turn-Level GRPO (TL-GRPO), a lightweight RL algorithm that performs turn-level group sampling for fine-grained optimization. We evaluate TL-GRPO on analog circuit sizing (ACS), a challenging scientific optimization task requiring multiple simulations and domain expertise. Results show that TL-GRPO outperforms standard GRPO and Bayesian optimization methods across various specifications. Furthermore, our 30B model trained with TL-GRPO achieves state-of-the-art performance on ACS tasks under same simulation budget, demonstrating both strong generalization and practical utility.

[26] Timely Machine: Awareness of Time Makes Test-Time Scaling Agentic

Yichuan Ma,Linyang Li,Yongkang chen,Peiji Li,Xiaozhe Li,Qipeng Guo,Dahua Lin,Kai Chen

Main category: cs.CL

TL;DR: 本文提出Timely Machine框架，将测试时扩展重新定义为基于真实时间（wall-clock time）的动态策略调整，并设计Timely-Eval基准与Timely-RL方法，使模型能根据时间预算自适应推理，提升在高频率/低频率工具调用及时间受限场景下的性能。

Details

Motivation: 传统基于生成长度的测试时扩展在智能体（agentic）场景中失效，因工具调用延迟使推理时间与生成长度解耦；现有模型缺乏对时间预算的感知与自适应能力。 Method: 提出Timely Machine框架（以真实时间为测试时尺度）、Timely-Eval基准（覆盖不同工具调用频率与延迟设置），并设计Timely-RL：先监督微调冷启动，再通过强化学习优化时间感知的推理规划。 Result: 实验表明小模型在低延迟、高交互频次下表现更优，大模型在高延迟下依赖高质量单次交互；Timely-RL显著提升模型的时间预算意识与Timely-Eval各项任务性能。 Conclusion: 应以真实时间而非生成长度作为测试时扩展的核心维度；Timely-RL为构建时间感知型智能体提供了可行路径，推动LLM在实际工具增强场景中的可靠部署。 Abstract: As large language models (LLMs) increasingly tackle complex reasoning tasks, test-time scaling has become critical for enhancing capabilities. However, in agentic scenarios with frequent tool calls, the traditional generation-length-based definition breaks down: tool latency decouples inference time from generation length. We propose Timely Machine, redefining test-time as wall-clock time, where models dynamically adjust strategies based on time budgets. We introduce Timely-Eval, a benchmark spanning high-frequency tool calls, low-frequency tool calls, and time-constrained reasoning. By varying tool latency, we find smaller models excel with fast feedback through more interactions, while larger models dominate high-latency settings via superior interaction quality. Moreover, existing models fail to adapt reasoning to time budgets. We propose Timely-RL to address this gap. After cold-start supervised fine-tuning, we use reinforcement learning to enhance temporal planning. Timely-RL improves time budget awareness and consistently boosts performance across Timely-Eval. We hope our work offers a new perspective on test-time scaling for the agentic era.

[27] MRAG: Benchmarking Retrieval-Augmented Generation for Bio-medicine

Wei Zhu

Main category: cs.CL

TL;DR: 本文提出了医学检索增强生成（MRAG）基准和配套工具包，填补了医疗领域RAG系统缺乏综合评估基准的空白，并通过实验揭示了RAG对大语言模型在可靠性、有用性及推理质量方面的提升效果及其对可读性的影响。

Details

Motivation: 现有检索增强生成（RAG）技术在科学与临床问答系统中广泛应用，但医学领域尚缺乏全面、多语言、多任务的RAG评估基准。 Method: 构建覆盖中英文、涵盖多种任务的MRAG基准（基于Wikipedia和PubMed语料），开发MRAG-Toolkit以支持RAG各组件的系统性探索，并开展对比实验分析RAG性能影响因素。 Result: 实验表明：(a) RAG显著提升LLM在MRAG任务中的可靠性；(b) 检索方法、模型规模和提示策略均影响RAG性能；(c) RAG提升有用性和推理质量，但可能略微降低长问题回答的可读性。 Conclusion: MRAG基准与工具包为医学RAG研究提供了标准化评估平台，有助于推动学术界与工业界在该方向的发展与应用。 Abstract: While Retrieval-Augmented Generation (RAG) has been swiftly adopted in scientific and clinical QA systems, a comprehensive evaluation benchmark in the medical domain is lacking. To address this gap, we introduce the Medical Retrieval-Augmented Generation (MRAG) benchmark, covering various tasks in English and Chinese languages, and building a corpus with Wikipedia and Pubmed. Additionally, we develop the MRAG-Toolkit, facilitating systematic exploration of different RAG components. Our experiments reveal that: (a) RAG enhances LLM reliability across MRAG tasks. (b) the performance of RAG systems is influenced by retrieval approaches, model sizes, and prompting strategies. (c) While RAG improves usefulness and reasoning quality, LLM responses may become slightly less readable for long-form questions. We will release the MRAG-Bench's dataset and toolkit with CCBY-4.0 license upon acceptance, to facilitate applications from both academia and industry.

[28] LOGICAL-COMMONSENSEQA: A Benchmark for Logical Commonsense Reasoning

Obed Junias,Maria Leonor Pacheco

Main category: cs.CL

TL;DR: 本文提出LOGICAL-COMMONSENSEQA基准，将常识推理重构为基于原子语句对的逻辑组合（AND/OR/NEITHER），揭示当前大模型在否定类逻辑推理上的显著缺陷。

Details

Motivation: 现有常识推理基准多采用单标签评估，无法刻画多个解释之间的逻辑关系（如联合合理、互斥或均不合理），限制了对模型真正推理能力的评估。 Method: 构建LOGICAL-COMMONSENSEQA基准，将常识问题建模为对两个原子陈述进行AND/OR/NEITHER/NOR等逻辑操作的判断任务；在零样本、少样本及思维链提示下评测多种主流大模型。 Result: 模型在合取（AND）推理上表现尚可，析取（OR）中等，但在否定类（NEITHER/NOR）任务上性能急剧下降；不同模型和提示策略均未根本缓解该问题。 Conclusion: 该基准有效暴露了当前语言模型在逻辑组合与否定推理方面的基础性缺陷，为推进可组合常识推理提供了可控、可解释的评估框架。 Abstract: Commonsense reasoning often involves evaluating multiple plausible interpretations rather than selecting a single atomic answer, yet most benchmarks rely on single-label evaluation, obscuring whether statements are jointly plausible, mutually exclusive, or jointly implausible. We introduce LOGICAL-COMMONSENSEQA, a benchmark that re-frames commonsense reasoning as logical composition over pairs of atomic statements using plausibility-level operators (AND, OR, NEITHER/NOR). Evaluating instruction-tuned, reasoning-specialized, and fine-tuned models under zero-shot, few-shot, and chain-of-thought prompting, we find that while models perform reasonably on conjunctive and moderately on disjunctive reasoning, performance degrades sharply on negation-based questions. LOGICAL-COMMONSENSEQA exposes fundamental reasoning limitations and provides a controlled framework for advancing compositional commonsense reasoning.

[29] Is Length Really A Liability? An Evaluation of Multi-turn LLM Conversations using BoolQ

Karl Neergaard,Le Qiu,Emmanuele Chersoni

Main category: cs.CL

TL;DR: 本文指出单轮提示评估无法捕捉真实对话中可能出现的危害，通过在BoolQ数据集上对三个不同大语言模型进行多轮对话评估，发现对话长度和提示结构会影响模型回答的真实性，揭示了静态评估的局限性。

Details

Motivation: 单轮提示评估无法反映真实世界中对话动态带来的潜在危害，需要更贴近实际部署场景的评估方法。 Method: 在BoolQ数据集上，针对三个不同大语言模型，设计不同对话长度和提示结构（scaffolding）条件下的多轮评估实验。 Result: 发现模型存在长度依赖性和提示结构依赖性的脆弱性，这些脆弱性在单轮测试中无法被检测到。 Conclusion: 静态单轮评估存在根本性局限，多轮对话评估对于识别部署相关的模型风险至关重要。 Abstract: Single-prompt evaluations dominate current LLM benchmarking, yet they fail to capture the conversational dynamics where real-world harm occurs. In this study, we examined whether conversation length affects response veracity by evaluating LLM performance on the BoolQ dataset under varying length and scaffolding conditions. Our results across three distinct LLMs revealed model-specific vulnerabilities that are invisible under single-turn testing. The length-dependent and scaffold-specific effects we observed demonstrate a fundamental limitation of static evaluations, as deployment-relevant vulnerabilities could only be spotted in a multi-turn conversational setting.

[30] SearchLLM: Detecting LLM Paraphrased Text by Measuring the Similarity with Regeneration of the Candidate Source via Search Engine

Hoang-Quoc Nguyen-Son,Minh-Son Dao,Koji Zettsu

Main category: cs.CL

TL;DR: 本文提出SearchLLM方法，利用搜索引擎定位原始文本来源，通过比对输入文本与候选源再生文本的相似性，识别大语言模型（LLM）生成的改写文本，并作为代理层提升现有检测器性能。

Details

Motivation: 现有LLM改写文本检测方法在面对高度拟真、语义易失真的改写时效果不佳，传统检测手段难以区分人写与LLM改写内容。 Method: 提出SearchLLM框架，将待检文本提交至搜索引擎获取候选原始来源，再对每个候选源进行LLM再生，并计算输入文本与再生文本间的语义/表征相似度，以此判断是否为LLM改写；该方法以代理层形式嵌入现有检测器。 Result: 在多个主流LLM生成的改写文本上验证表明，SearchLLM显著提升现有检测器对高保真改写文本的识别准确率，并能有效抵御针对检测器的改写攻击。 Conclusion: SearchLLM是一种实用、可插拔的增强型检测范式，通过引入外部搜索与再生比对机制，弥补了纯模型内检测在语义保真改写场景下的局限性。 Abstract: With the advent of large language models (LLMs), it has become common practice for users to draft text and utilize LLMs to enhance its quality through paraphrasing. However, this process can sometimes result in the loss or distortion of the original intended meaning. Due to the human-like quality of LLM-generated text, traditional detection methods often fail, particularly when text is paraphrased to closely mimic original content. In response to these challenges, we propose a novel approach named SearchLLM, designed to identify LLM-paraphrased text by leveraging search engine capabilities to locate potential original text sources. By analyzing similarities between the input and regenerated versions of candidate sources, SearchLLM effectively distinguishes LLM-paraphrased content. SearchLLM is designed as a proxy layer, allowing seamless integration with existing detectors to enhance their performance. Experimental results across various LLMs demonstrate that SearchLLM consistently enhances the accuracy of recent detectors in detecting LLM-paraphrased text that closely mimics original content. Furthermore, SearchLLM also helps the detectors prevent paraphrasing attacks.

[31] Curate-Train-Refine: A Closed-Loop Agentic Framework for Zero Shot Classification

Gaurav Maheshwari,Kevin El Haddad

Main category: cs.CL

TL;DR: 本文提出一种利用大语言模型（LLM）动态生成监督信号来训练轻量级文本分类器的新方法，通过闭环的智能体式数据迭代优化，显著提升零/少样本分类性能，同时避免部署大模型的高开销。

Details

Motivation: 大型语言模型和高容量编码器虽提升了零样本和少样本分类性能，但其推理成本高、延迟大，限制了实际部署。 Method: 设计一个迭代式、具身式的智能体循环：LLM负责筛选训练数据、分析轻量分类器的错误案例，并生成针对性修正样本；该闭环持续优化数据质量并适配下游任务与模型。 Result: 在四个主流基准上，该方法持续超越标准零样本和少样本基线。 Conclusion: LLM可高效充当数据策展者，支撑轻量分类器实现高精度、低开销的实际应用。 Abstract: Large language models (LLMs) and high-capacity encoders have advanced zero and few-shot classification, but their inference cost and latency limit practical deployment. We propose training lightweight text classifiers using dynamically generated supervision from an LLM. Our method employs an iterative, agentic loop in which the LLM curates training data, analyzes model successes and failures, and synthesizes targeted examples to address observed errors. This closed-loop generation and evaluation process progressively improves data quality and adapts it to the downstream classifier and task. Across four widely used benchmarks, our approach consistently outperforms standard zero and few-shot baselines. These results indicate that LLMs can serve effectively as data curators, enabling accurate and efficient classification without the operational cost of large-model deployment.

[32] Retrieve-Refine-Calibrate: A Framework for Complex Claim Fact-Checking

Mingwei Sun,Qianlong Wang,Ruifeng Xu

Main category: cs.CL

TL;DR: 本文提出了一种基于大语言模型的Retrieve-Refine-Calibrate（RRC）框架，用于提升事实核查任务的准确性，通过实体识别、证据精炼和低置信度预测再评估三个步骤，有效缓解分解范式引入的噪声问题。

Details

Motivation: 现有基于分解范式的事实核查方法易因引入无关实体或证据而产生噪声，降低验证准确率。 Method: 提出RRC框架：1）识别声明中的实体并检索相关证据；2）根据声明精炼证据以剔除无关信息；3）对低置信度预测进行再评估校准。全部基于大语言模型实现。 Result: 在HOVER和FEVEROUS-S两个主流事实核查数据集上，RRC框架性能优于多个强基线方法。 Conclusion: RRC框架通过减少噪声干扰和增强推理鲁棒性，显著提升了事实核查的准确性和可靠性。 Abstract: Fact-checking aims to verify the truthfulness of a claim based on the retrieved evidence. Existing methods typically follow a decomposition paradigm, in which a claim is broken down into sub-claims that are individually verified. However, the decomposition paradigm may introduce noise to the verification process due to irrelevant entities or evidence, ultimately degrading verification accuracy. To address this problem, we propose a Retrieve-Refine-Calibrate (RRC) framework based on large language models (LLMs). Specifically, the framework first identifies the entities mentioned in the claim and retrieves evidence relevant to them. Then, it refines the retrieved evidence based on the claim to reduce irrelevant information. Finally, it calibrates the verification process by re-evaluating low-confidence predictions. Experiments on two popular fact-checking datasets (HOVER and FEVEROUS-S) demonstrate that our framework achieves superior performance compared with competitive baselines.

[33] Attention-MoA: Enhancing Mixture-of-Agents via Inter-Agent Semantic Attention and Deep Residual Synthesis

Jianyu Wen,Yang Wei,Xiongxi Yu,Changxuan Xiao,Ke Zeng

Main category: cs.CL

TL;DR: 本文提出Attention-MoA框架，通过引入跨智能体语义注意力机制与自适应早停的残差模块，提升多模型协作中的语义交互与推理修正能力，在多个基准测试中显著超越现有方法，并使小型开源模型组合可超越大型闭源模型。

Details

Motivation: 现有Mixture-of-Agents（MoA）方法虽引入动态路由和残差连接，但缺乏深层语义交互，难以有效纠正幻觉与逻辑错误。 Method: 提出Attention-MoA框架，核心包括Inter-agent Semantic Attention（实现智能体间细粒度语义对齐）和Inter-layer Residual Module with Adaptive Early Stopping（缓解深层信息退化并提升计算效率）。 Result: 在AlpacaEval 2.0、MT-Bench和FLASK上全面领先：AlpacaEval 2.0长度控制胜率91.15%，FLASK 12项能力中胜出10项；小模型组合达MT-Bench 8.83、AlpacaEval LC胜率77.36%，超越Claude-4.5-Sonnet与GPT-4.1。 Conclusion: Attention-MoA通过增强语义协作与结构优化，验证了轻量级开源模型协同可媲美甚至超越巨型闭源模型，为高效、可控的推理协作提供了新范式。 Abstract: As the development of Large Language Models (LLMs) shifts from parameter scaling to inference-time collaboration, the Mixture-of-Agents (MoA) framework has emerged as a general paradigm to harness collective intelligence by layering diverse models. While recent MoA variants have introduced dynamic routing and residual connections to improve efficiency, these methods often fail to facilitate deep semantic interaction between agents, limiting the system's ability to actively correct hallucinations and refine logic. In this paper, we introduce Attention-MoA, a novel MoA-based framework that redefines collaboration through Inter-agent Semantic Attention. Complemented by an Inter-layer Residual Module with Adaptive Early Stopping Mechanism, our architecture mitigates information degradation in deep layers while improving computational efficiency. Extensive evaluations across AlpacaEval 2.0, MT-Bench, and FLASK demonstrate that Attention-MoA significantly outperforms state-of-the-art baselines, achieving a 91.15% Length-Controlled Win Rate on AlpacaEval 2.0 and dominating in 10 out of 12 capabilities on FLASK. Notably, Attention-MoA enables an ensemble of small open-source models to outperform massive proprietary models like Claude-4.5-Sonnet and GPT-4.1, achieving an MT-Bench score of 8.83 and an AlpacaEval 2.0 LC Win Rate of 77.36%.

[34] AuroraEdge-V-2B: A Faster And Stronger Edge Visual Large Language Model

Xiang Chen

Main category: cs.CL

TL;DR: 本文提出AuroraEdge-V-2B，一种专为边缘部署设计的轻量、高效、鲁棒的视觉大语言模型，并通过压缩融合方法提升推理效率，在保持高性能的同时显著降低计算开销和延迟。

Details

Motivation: VLLMs在工业应用中展现出泛化强、灵活等优势，但存在领域性能弱、参数量大、推理慢等缺点，难以满足边缘部署对实时性、资源受限等需求。 Method: 提出紧凑型VLLM AuroraEdge-V-2B（仅2B参数），并设计压缩-融合方法以减少视觉token数量、降低浮点运算量；优化模型结构以适配边缘设备。 Result: AuroraEdge-V-2B在9个基准测试中性能优于同参数量模型（如Qwen2-VL-2B等），推理速度更快、视觉token更少、FLOPs减半，更适合边缘部署。 Conclusion: AuroraEdge-V-2B在性能、效率与部署成本间取得良好平衡，为VLLMs在资源受限工业场景中的落地提供了可行方案。 Abstract: Recently, due to the advancement of multimodal technology, people are attempting to use visual large language models (VLLMs) in industrial production. Many deep learning models (DLMs) deployed in the production environment are gradually being replaced by VLLMs. Compared with DLMs, VLLMs have some advantages in industrial applications: (1) Their strong generalization ability enables them to perform well across a wide range of tasks. (2) They are flexible and can deal with unfamiliar samples through context learning quickly. However, VLLMs also have obvious drawbacks: (1) VLLMs do not perform as well as custom-developed DLMs in specific domains. (2) The number of parameters in VLLMs is generally quite large, and their deployment requires substantial computational resources. (3) VLLMs generally operate much slower than DLMs, making real-time response challenging to achieve. To better utilize VLLMs in industrial applications, we introduce AuroraEdge-V-2B in this work, a compact, robust, and high-speed VLLM designed for edge deployment. To make the model run faster, we also propose a compression-fusion method to improve inference efficiency. AuroraEdge-V-2B has the following notable features: (1) Easy deployment and faster: It has only 2B parameters and is highly suitable for edge deployment, offering better real-time performance. (2) Fewer visual tokens and cheaper: It significantly reduces the number of visual tokens in the decoding process, thereby reducing the floating-point operations by half during inference and making it cheaper to use. (3) Strong performance: It gets a higher score on 9 benchmarks than models with the same number of parameter (e.g., Qwen2-VL-2B, Qwen2.5-VL-3B, InternVL-2.5-2B).

[35] PROST-LLM: Progressively Enhancing the Speech-to-Speech Translation Capability in LLMs

Jing Xu,Jiaqi Wang,Daxin Tan,Xiao Chen

Main category: cs.CL

TL;DR: 本文提出PROST-LLM方法，通过三任务学习、模态链方法、自采样与回译生成偏好对，并进行偏好优化，逐步提升大语言模型在语音到语音翻译（S2ST）任务上的能力。

Details

Motivation: 大型语言模型（LLMs）在语音到语音翻译（S2ST）任务中应用不足，主要受限于数据稀缺问题。 Method: 采用渐进式策略：1）基于CVSS语料库，结合三任务学习和模态链方法对LLM进行微调；2）利用微调后的模型通过自采样和回译生成无需人工评估的偏好对；3）基于偏好对开展偏好优化。 Result: 大量实验验证了PROST-LLM能有效提升LLMs在S2ST任务上的性能。 Conclusion: PROST-LLM为解决LLMs在S2ST任务中因数据稀缺导致的能力不足问题提供了可行且有效的渐进式增强方案。 Abstract: Although Large Language Models (LLMs) excel in many tasks, their application to Speech-to-Speech Translation (S2ST) is underexplored and hindered by data scarcity. To bridge this gap, we propose PROST-LLM (PROgressive Speech-to-speech Translation) to enhance the S2ST capabilities in LLMs progressively. First, we fine-tune the LLMs with the CVSS corpus, employing designed tri-task learning and chain of modality methods to boost the initial performance. Then, leveraging the fine-tuned model, we generate preference pairs through self-sampling and back-translation without human evaluation. Finally, these preference pairs are used for preference optimization to enhance the model's S2ST capability further. Extensive experiments confirm the effectiveness of our proposed PROST-LLM in improving the S2ST capability of LLMs.

[36] How Does Personalized Memory Shape LLM Behavior? Benchmarking Rational Preference Utilization in Personalized Assistants

Xueyang Feng,Weinan Gan,Xu Chen,Quanyu Dai,Yong Liu

Main category: cs.CL

TL;DR: 本文提出RPEval基准测试和RP-Reasoner方法，以解决大语言模型个性化记忆导致的意图理解干扰问题，通过务实推理实现个性化信息的选择性整合，显著提升性能。

Details

Motivation: 现有LLM助手引入用户偏好记忆后，常因无关记忆干扰而损害意图理解，亟需系统评估与改进机制。 Method: 构建RPEval基准（含个性化意图推理数据集与多粒度评估协议），并提出RP-Reasoner，将记忆利用建模为务实推理过程，实现个性化信息的选择性整合。 Result: RP-Reasoner在RPEval上显著优于强基线，并解决了商用个性化助手中80%的不良案例。 Conclusion: 务实推理能有效缓解不合理的个性化问题，RPEval为该领域提供了可复现、可扩展的评估标准。 Abstract: Large language model (LLM)-powered assistants have recently integrated memory mechanisms that record user preferences, leading to more personalized and user-aligned responses. However, irrelevant personalized memories are often introduced into the context, interfering with the LLM's intent understanding. To comprehensively investigate the dual effects of personalization, we develop RPEval, a benchmark comprising a personalized intent reasoning dataset and a multi-granularity evaluation protocol. RPEval reveals the widespread phenomenon of irrational personalization in existing LLMs and, through error pattern analysis, illustrates its negative impact on user experience. Finally, we introduce RP-Reasoner, which treats memory utilization as a pragmatic reasoning process, enabling the selective integration of personalized information. Experimental results demonstrate that our method significantly outperforms carefully designed baselines on RPEval, and resolves 80% of the bad cases observed in a large-scale commercial personalized assistant, highlighting the potential of pragmatic reasoning to mitigate irrational personalization. Our benchmark is publicly available at https://github.com/XueyangFeng/RPEval.

[37] MultiLexNorm++: A Unified Benchmark and a Generative Model for Lexical Normalization for Asian Languages

Weerayut Buaphet,Thanh-Nhi Nguyen,Risa Kondo,Tomoyuki Kajiwara,Yumin Kim,Jimin Lee,Hwanhee Lee,Holy Lovenia,Peerat Limkonchotiwat,Sarana Nutanong,Rob Van der Goot

Main category: cs.CL

TL;DR: 本文扩展了MultiLexNorm基准测试，新增5种亚洲语言（涵盖不同语系和4种文字），发现现有SOTA模型在新语言上性能下降，并提出基于大语言模型（LLM）的新架构以提升鲁棒性。

Details

Motivation: 现有MultiLexNorm基准主要覆盖印欧语系拉丁字母语言，缺乏对亚洲语言（多语系、多文字）的代表性，限制了词汇规范化模型的泛化能力评估。 Method: 构建覆盖5种亚洲语言（不同语系、4种文字）的MultiLexNorm扩展数据集；对比评估现有SOTA模型表现；提出基于大语言模型（LLM）的新架构；进行错误分析。 Result: 现有SOTA模型在新增亚洲语言上性能显著下降；所提LLM-based新架构展现出更强的跨语言鲁棒性；错误分析揭示了未来改进方向。 Conclusion: 词汇规范化任务亟需更广泛的语言覆盖基准；基于LLM的架构比传统方法更具跨语言适应性；后续工作应关注多文字、多语系建模及细粒度错误类型处理。 Abstract: Social media data has been of interest to Natural Language Processing (NLP) practitioners for over a decade, because of its richness in information, but also challenges for automatic processing. Since language use is more informal, spontaneous, and adheres to many different sociolects, the performance of NLP models often deteriorates. One solution to this problem is to transform data to a standard variant before processing it, which is also called lexical normalization. There has been a wide variety of benchmarks and models proposed for this task. The MultiLexNorm benchmark proposed to unify these efforts, but it consists almost solely of languages from the Indo-European language family in the Latin script. Hence, we propose an extension to MultiLexNorm, which covers 5 Asian languages from different language families in 4 different scripts. We show that the previous state-of-the-art model performs worse on the new languages and propose a new architecture based on Large Language Models (LLMs), which shows more robust performance. Finally, we analyze remaining errors, revealing future directions for this task.

[38] Typologically Informed Parameter Aggregation

Stef Accou,Wessel Poelman

Main category: cs.CL

TL;DR: 本文提出了一种无需训练的、基于语言类型学相似性的参数聚合方法TIPA，用于构建代理语言适配器，以提升多语言模型在低资源和未见语言上的零样本跨语言迁移性能。

Details

Motivation: 大规模多语言语言模型在低资源和未见语言上表现不佳，而为每种语言单独训练适配器成本高昂。 Method: 提出Typologically Informed Parameter Aggregation (TIPA)，在MAD-X框架中通过加权聚合已有语言适配器（权重由语言类型学相似性决定）来构建代理适配器，实现零样本跨语言迁移。 Result: 在5个NLP任务、230多种语言上验证，TIPA持续优于或媲美英语微调、选择最接近语言适配器等基线方法，尤其在缺乏专用适配器的语言上提升显著。 Conclusion: 基于语言类型学的参数聚合是一种无需训练即可替代语言专用模块的有效方案。 Abstract: Massively multilingual language models enable cross-lingual generalization but underperform on low-resource and unseen languages. While adapter-based fine-tuning offers a parameter-efficient solution, training language-specific adapters at scale remains costly. We introduce Typologically Informed Parameter Aggregation (TIPA), a training-free method that constructs proxy language adapters by aggregating existing ones, weighted by typological similarity. Integrated into the MAD-X framework, these proxies enable zero-shot cross-lingual transfer without additional training. We evaluate TIPA on five NLP tasks and over 230 languages. TIPA consistently outperforms or matches baselines such as English-only fine-tuning or selecting the typologically closest language adapter. We see the largest gains for languages lacking dedicated adapters. Our results demonstrate that typologically informed aggregation provides a viable alternative to language-specific modules without any training needed.

[39] Sycophancy Hides Linearly in the Attention Heads

Rifo Genadi,Munachiso Nwadike,Nurdaulet Mukhituly,Hilal Alquabeh,Tatsuya Hiraoka,Kentaro Inui

Main category: cs.CL

TL;DR: 本文发现，正确到错误的谄媚信号在多头注意力激活中具有最高的线性可分性，并通过在线性探针分析中定位到中间层稀疏注意力头进行有效干预，从而缓解大模型的谄媚行为。

Details

Motivation: 基于线性表征假设，探究谄媚信号在模型内部（残差流、MLP、注意力层）的表征位置与可分性，以寻找可解释、可干预的机制。 Method: 在TruthfulQA等事实问答数据集上训练线性探针，分析残差流、MLP和注意力层中谄媚信号的线性可分性；评估探针跨数据集迁移能力；对比已有‘真实’方向；结合注意力模式分析关键头的行为。 Result: 谄媚信号在多头注意力激活中最线性可分；中间层稀疏注意力头上的探针干预最有效；探针在事实类QA任务间具有良好泛化性；新发现方向与已有‘真实’方向重叠有限；关键注意力头显著关注用户怀疑表达。 Conclusion: 谄媚行为源于模型对用户怀疑信号的注意力偏差，可通过针对中间层注意力激活的简单线性干预有效缓解，表明其具有几何可解释性和可调控性。 Abstract: We find that correct-to-incorrect sycophancy signals are most linearly separable within multi-head attention activations. Motivated by the linear representation hypothesis, we train linear probes across the residual stream, multilayer perceptron (MLP), and attention layers to analyze where these signals emerge. Although separability appears in the residual stream and MLPs, steering using these probes is most effective in a sparse subset of middle-layer attention heads. Using TruthfulQA as the base dataset, we find that probes trained on it transfer effectively to other factual QA benchmarks. Furthermore, comparing our discovered direction to previously identified "truthful" directions reveals limited overlap, suggesting that factual accuracy, and deference resistance, arise from related but distinct mechanisms. Attention-pattern analysis further indicates that the influential heads attend disproportionately to expressions of user doubt, contributing to sycophantic shifts. Overall, these findings suggest that sycophancy can be mitigated through simple, targeted linear interventions that exploit the internal geometry of attention activations.

[40] Select or Project? Evaluating Lower-dimensional Vectors for LLM Training Data Explanations

Lukas Hinterleitner,Loris Schoenegger,Benjamin Roth

Main category: cs.CL

TL;DR: 本文探讨了在大型语言模型（LLM）中进行基于实例的梯度解释时，如何高效处理高维梯度问题；提出并验证了一种基于架构知识、贪心选择少量模型组件的方法，在影响检索任务中优于全梯度或随机投影，且计算更高效。

Details

Motivation: 梯度类实例解释方法受限于模型梯度的极高维度，实践中常随意选取参数子集估算影响，缺乏系统性评估和理论依据。 Method: 构建新基准，对比分析三种低维化策略：（1）基于架构知识贪心选择少量模型组件；（2）使用全梯度；（3）对全梯度做随机投影；重点评估其在训练数据影响检索任务中的表现与计算开销。 Result: 贪心选择的组件子集在影响检索任务中效果优于全梯度和随机投影，同时计算效率更高。 Conclusion: 针对大型模型的实例解释，有依据地选择关键组件比盲目降维（如随机投影）更有效且实用，为可扩展的解释方法提供了新思路。 Abstract: Gradient-based methods for instance-based explanation for large language models (LLMs) are hindered by the immense dimensionality of model gradients. In practice, influence estimation is restricted to a subset of model parameters to make computation tractable, but this subset is often chosen ad hoc and rarely justified by systematic evaluation. This paper investigates if it is better to create low-dimensional representations by selecting a small, architecturally informed subset of model components or by projecting the full gradients into a lower-dimensional space. Using a novel benchmark, we show that a greedily selected subset of components captures the information about training data influence needed for a retrieval task more effectively than either the full gradient or random projection. We further find that this approach is more computationally efficient than random projection, demonstrating that targeted component selection is a practical strategy for making instance-based explanations of large models more computationally feasible.

[41] PLawBench: A Rubric-Based Benchmark for Evaluating LLMs in Real-World Legal Practice

Yuzhen Shi,Huanghai Liu,Yiran Hu,Gaojie Song,Xinran Xu,Yubo Ma,Tianyi Tang,Li Zhang,Qingjing Chen,Di Feng,Wenbo Lv,Weiheng Wu,Kexin Yang,Sen Yang,Wei Wang,Rongyao Shi,Yuanyang Qiu,Yuemeng Qi,Jingwen Zhang,Xiaoyu Sui,Yifan Chen,Yi Zhang,An Yang,Bowen Yu,Dayiheng Liu,Junyang Lin,Weixing Shen,Bing Zhao,Charles L. A. Clarke,Hu Wei

Main category: cs.CL

TL;DR: 本文提出PLawBench，一个面向真实法律实践场景的实用法律基准，涵盖法律咨询、案例分析和文书生成三类任务，含850个问题与12500条细粒度评估标准，用于评估大语言模型在法律领域的细粒度推理能力。

Details

Motivation: 现有法律评测基准过于简化、标准化，无法反映真实法律实践中的模糊性、复杂性和推理需求，且缺乏细粒度法律推理评估。 Method: 构建基于真实法律工作流的PLawBench基准，包含三大任务类别（公众法律咨询、实务案例分析、法律文书生成），共850题、13种实务场景，并设计专家级细粒度评分标准；采用对齐人类专家判断的LLM评估器评测10个前沿大模型。 Result: 实验表明当前所有被测大模型在PLawBench上表现均不理想，暴露出其在细粒度法律推理能力上的显著不足。 Conclusion: PLawBench为法律领域大模型提供了更贴近实务的评测框架，揭示了现有模型的能力瓶颈，为未来法律大模型的评估与研发指明方向。 Abstract: As large language models (LLMs) are increasingly applied to legal domain-specific tasks, evaluating their ability to perform legal work in real-world settings has become essential. However, existing legal benchmarks rely on simplified and highly standardized tasks, failing to capture the ambiguity, complexity, and reasoning demands of real legal practice. Moreover, prior evaluations often adopt coarse, single-dimensional metrics and do not explicitly assess fine-grained legal reasoning. To address these limitations, we introduce PLawBench, a Practical Law Benchmark designed to evaluate LLMs in realistic legal practice scenarios. Grounded in real-world legal workflows, PLawBench models the core processes of legal practitioners through three task categories: public legal consultation, practical case analysis, and legal document generation. These tasks assess a model's ability to identify legal issues and key facts, perform structured legal reasoning, and generate legally coherent documents. PLawBench comprises 850 questions across 13 practical legal scenarios, with each question accompanied by expert-designed evaluation rubrics, resulting in approximately 12,500 rubric items for fine-grained assessment. Using an LLM-based evaluator aligned with human expert judgments, we evaluate 10 state-of-the-art LLMs. Experimental results show that none achieves strong performance on PLawBench, revealing substantial limitations in the fine-grained legal reasoning capabilities of current LLMs and highlighting important directions for future evaluation and development of legal LLMs. Data is available at: https://github.com/skylenage/PLawbench.

[42] EMemBench: Interactive Benchmarking of Episodic Memory for VLM Agents

Xinze Li,Ziyue Zhu,Siyuan Liu,Yubo Ma,Yuhang Zang,Yixin Cao,Aixin Sun

Main category: cs.CL

TL;DR: EMemBench 是一个用于评估智能体长期记忆能力的程序化基准，通过交互式游戏生成基于智能体自身轨迹的问题，覆盖文本与视觉环境，并涵盖多种记忆技能；实验表明当前模型在归纳与空间推理（尤其视觉场景）上仍存在显著瓶颈。

Details

Motivation: 现有基准多依赖固定问题集，难以真实反映智能体在动态交互中对自身经历的长期记忆能力，亟需能自动生成、可验证、多维度覆盖的记忆评估方法。 Method: 提出 EMemBench：基于智能体在文本/视觉游戏中的运行轨迹，动态生成带可验证真值的问题模板，系统覆盖单跳/多跳回忆、归纳、时序、空间、逻辑及对抗性记忆技能；以强语言模型（LM）/视觉语言模型（VLM）为骨干，采用上下文学习作为基线进行评测。 Result: 在15个文本游戏和多个视觉种子上，模型性能远未饱和；归纳与空间推理是持续瓶颈，尤其在视觉环境中；持久记忆机制对文本游戏中的开放骨干模型提升明显，但对VLM效果不一致；人类研究验证了该基准的高难度。 Conclusion: EMemBench 有效揭示了当前智能体长期记忆（尤其视觉-情节记忆）的关键短板，为后续研究提供了可复现、可扩展、多模态的记忆评估新范式。 Abstract: We introduce EMemBench, a programmatic benchmark for evaluating long-term memory of agents through interactive games. Rather than using a fixed set of questions, EMemBench generates questions from each agent's own trajectory, covering both text and visual game environments. Each template computes verifiable ground truth from underlying game signals, with controlled answerability and balanced coverage over memory skills: single/multi-hop recall, induction, temporal, spatial, logical, and adversarial. We evaluate memory agents with strong LMs/VLMs as backbones, using in-context prompting as baselines. Across 15 text games and multiple visual seeds, results are far from saturated: induction and spatial reasoning are persistent bottlenecks, especially in visual setting. Persistent memory yields clear gains for open backbones on text games, but improvements are less consistent for VLM agents, suggesting that visually grounded episodic memory remains an open challenge. A human study further confirms the difficulty of EMemBench.

[43] Better Generalizing to Unseen Concepts: An Evaluation Framework and An LLM-Based Auto-Labeled Pipeline for Biomedical Concept Recognition

Shanshan Liu,Noriki Nishida,Fei Cheng,Narumi Tokunaga,Rumana Ferdous Munne,Yuki Yamagata,Kouji Kozaki,Takehito Utsuro,Yuji Matsumoto

Main category: cs.CL

TL;DR: 本文提出了一种基于层次概念索引的评估框架和利用大语言模型生成自动标注数据（ALD）的方法，以提升生物医学概念识别在未见概念上的泛化能力。

Details

Motivation: 解决提及无关型生物医学概念识别（MA-BCR）中因人工标注稀缺而导致的对未见概念泛化能力差的问题。 Method: 构建基于层次概念索引的评估框架与新指标；设计面向任务的大语言模型自动标注数据（ALD）生成流程。 Result: 实验证明LLM生成的ALD虽不能完全替代人工标注，但能显著提升模型对未见概念的泛化能力，提供更广覆盖与结构化知识。 Conclusion: ALD是一种有价值的补充资源，可有效增强MA-BCR模型在未见概念上的识别能力。 Abstract: Generalization to unseen concepts is a central challenge due to the scarcity of human annotations in Mention-agnostic Biomedical Concept Recognition (MA-BCR). This work makes two key contributions to systematically address this issue. First, we propose an evaluation framework built on hierarchical concept indices and novel metrics to measure generalization. Second, we explore LLM-based Auto-Labeled Data (ALD) as a scalable resource, creating a task-specific pipeline for its generation. Our research unequivocally shows that while LLM-generated ALD cannot fully substitute for manual annotations, it is a valuable resource for improving generalization, successfully providing models with the broader coverage and structural knowledge needed to approach recognizing unseen concepts. Code and datasets are available at https://github.com/bio-ie-tool/hi-ald.

[44] Mitigating Bias in Automated Grading Systems for ESL Learners: A Contrastive Learning Approach

Kevin Fan,Eric Yun

Main category: cs.CL

TL;DR: 本文研究了自动作文评分（AES）系统对英语作为第二语言（ESL）学习者的算法偏差问题，发现当前基于Transformer的模型存在对高熟练度ESL写作的系统性低评现象；为此提出一种基于匹配作文对的对比学习方法，在显著缩小评分差距的同时保持评分一致性。

Details

Motivation: AES系统在高利害教育场景中广泛应用，但现有主要基于母语者语料训练的Transformer回归模型易将二语表面语言特征与作文质量错误关联，导致对ESL学习者（尤其是高熟练度者）的系统性评分偏低，引发公平性质疑。 Method: 基于DeBERTa-v3模型，在ASAP 2.0和ELLIPSE数据集上开展偏差分析；提出‘匹配作文对对比学习’方法，构建17,161对ESL-母语匹配作文，采用三元组边际损失进行微调，以对齐两类写作文本的隐空间表征。 Result: 高熟练度ESL作文与母语作文间评分差距由10.3%降至6.2%（降幅39.9%），同时保持二次加权Kappa（QWK）达0.76；后验语言学分析表明模型能区分句法复杂性与语法错误，避免惩罚合法的二语句法结构。 Conclusion: 对比学习可有效缓解AES系统中针对ESL学习者的隐性偏见，提升评分公平性而不牺牲整体准确性；该方法强调语义表征对齐而非表面特征拟合，为构建更公平的语言评估模型提供了新路径。 Abstract: As Automated Essay Scoring (AES) systems are increasingly used in high-stakes educational settings, concerns regarding algorithmic bias against English as a Second Language (ESL) learners have increased. Current Transformer-based regression models trained primarily on native-speaker corpora often learn spurious correlations between surface-level L2 linguistic features and essay quality. In this study, we conduct a bias study of a fine-tuned DeBERTa-v3 model using the ASAP 2.0 and ELLIPSE datasets, revealing a constrained score scaling for high-proficiency ESL writing where high-proficiency ESL essays receive scores 10.3% lower than Native speaker essays of identical human-rated quality. To mitigate this, we propose applying contrastive learning with a triplet construction strategy: Contrastive Learning with Matched Essay Pairs. We constructed a dataset of 17,161 matched essay pairs and fine-tuned the model using Triplet Margin Loss to align the latent representations of ESL and Native writing. Our approach reduced the high-proficiency scoring disparity by 39.9% (to a 6.2% gap) while maintaining a Quadratic Weighted Kappa (QWK) of 0.76. Post-hoc linguistic analysis suggests the model successfully disentangled sentence complexity from grammatical error, preventing the penalization of valid L2 syntactic structures.

[45] Standardizing Longitudinal Radiology Report Evaluation via Large Language Model Annotation

Xinyi Wang,Grazziela Figueredo,Ruizhe Li,Xin Chen

Main category: cs.CL

TL;DR: 本文提出了一种基于大语言模型（LLM）的自动化标注流程，用于识别和提取放射学报告中的纵向信息（如疾病进展），在500份人工标注报告上评估后选用Qwen2.5-32B模型，进而构建了含95,169份报告的大规模标注数据集，并以此建立新基准评测7种前沿报告生成模型，显著优于现有标注方法。

Details

Motivation: 现有放射学报告纵向信息标注方法依赖手工词典与规则，存在劳动密集、封闭源码、领域强耦合或过于简化而丢失关键语义等问题；缺乏统一、可扩展、高精度的自动标注工具来支撑报告生成模型的公平评估。 Method: 设计两阶段LLM驱动的自动标注流水线：第一阶段识别含纵向信息的句子，第二阶段提取疾病进展模式；在500份人工标注报告上对比5个主流LLM，选出最优Qwen2.5-32B，再批量标注MIMIC-CXR中95,169份报告，构建标准化纵向信息基准数据集。 Result: 所提LLM标注方法在纵向信息检测和疾病追踪任务上F1分数分别比现有方法高11.3%和5.3%；基于新标注数据集对7个SOTA报告生成模型的评测揭示了其在建模纵向关系上的性能差异。 Conclusion: 基于LLM的自动标注方案能高效、准确、可迁移地提取放射学报告中的纵向信息，为报告生成模型提供了首个大规模、标准化的纵向评估基准，推动临床自然语言处理向时序建模方向发展。 Abstract: Longitudinal information in radiology reports refers to the sequential tracking of findings across multiple examinations over time, which is crucial for monitoring disease progression and guiding clinical decisions. Many recent automated radiology report generation methods are designed to capture longitudinal information; however, validating their performance is challenging. There is no proper tool to consistently label temporal changes in both ground-truth and model-generated texts for meaningful comparisons. Existing annotation methods are typically labor-intensive, relying on the use of manual lexicons and rules. Complex rules are closed-source, domain specific and hard to adapt, whereas overly simple ones tend to miss essential specialised information. Large language models (LLMs) offer a promising annotation alternative, as they are capable of capturing nuanced linguistic patterns and semantic similarities without extensive manual intervention. They also adapt well to new contexts. In this study, we therefore propose an LLM-based pipeline to automatically annotate longitudinal information in radiology reports. The pipeline first identifies sentences containing relevant information and then extracts the progression of diseases. We evaluate and compare five mainstream LLMs on these two tasks using 500 manually annotated reports. Considering both efficiency and performance, Qwen2.5-32B was subsequently selected and used to annotate another 95,169 reports from the public MIMIC-CXR dataset. Our Qwen2.5-32B-annotated dataset provided us with a standardized benchmark for evaluating report generation models. Using this new benchmark, we assessed seven state-of-the-art report generation models. Our LLM-based annotation method outperforms existing annotation solutions, achieving 11.3\% and 5.3\% higher F1-scores for longitudinal information detection and disease tracking, respectively.

[46] Do LLM hallucination detectors suffer from low-resource effect?

Debtanu Datta,Mohan Kishore Chilukuri,Yash Kumar,Saptarshi Ghosh,Muhammad Bilal Zafar

Main category: cs.CL

TL;DR: 本文研究了大语言模型（LLM）在低资源语言中的幻觉检测器是否也受低资源效应影响，发现尽管任务准确率大幅下降，幻觉检测器的性能下降却小得多，表明其对语言内和多语言设置具有一定鲁棒性，但在无目标语言监督的跨语言设置中表现不佳。

Details

Motivation: 探究幻觉检测器是否同样受低资源效应影响，即在低资源语言中检测幻觉的能力是否显著下降。 Method: 在三个领域（事实回忆、STEM、人文学科）的五个任务上，使用四种大语言模型和三种幻觉检测器进行实验，对比高资源语言（如英语）与低资源语言（如孟加拉语）下的任务准确率与检测器准确率变化。 Result: 任务准确率在低资源语言中大幅下降，但幻觉检测器准确率下降幅度通常仅为前者的几分之一；检测器在语言内和多语言设置中表现鲁棒，但在缺乏目标语言监督的跨语言设置中性能不佳。 Conclusion: LLM内部可能在低资源语言中仍保留不确定性信号，使得幻觉检测器具备一定跨语言泛化潜力，但需目标语言的监督才能有效工作。 Abstract: LLMs, while outperforming humans in a wide range of tasks, can still fail in unanticipated ways. We focus on two pervasive failure modes: (i) hallucinations, where models produce incorrect information about the world, and (ii) the low-resource effect, where the models show impressive performance in high-resource languages like English but the performance degrades significantly in low-resource languages like Bengali. We study the intersection of these issues and ask: do hallucination detectors suffer from the low-resource effect? We conduct experiments on five tasks across three domains (factual recall, STEM, and Humanities). Experiments with four LLMs and three hallucination detectors reveal a curious finding: As expected, the task accuracies in low-resource languages experience large drops (compared to English). However, the drop in detectors' accuracy is often several times smaller than the drop in task accuracy. Our findings suggest that even in low-resource languages, the internal mechanisms of LLMs might encode signals about their uncertainty. Further, the detectors are robust within language (even for non-English) and in multilingual setups, but not in cross-lingual settings without in-language supervision.

[47] Persuasion Tokens for Editing Factual Knowledge in LLMs

Paul Youssef,Jörg Schlötterer,Christin Seifert

Main category: cs.CL

TL;DR: 本文提出P-Tokens，一种无需事实特定示例即可高效编辑大语言模型知识的新方法，性能媲美甚至超越现有上下文内知识编辑（IKE）技术。

Details

Motivation: 现有上下文内知识编辑（IKE）依赖长而特定的事实演示，构建成本高且占用大量上下文空间。 Method: 提出并训练特殊标记（P-Tokens），使其能复现IKE演示的效果，从而在不使用事实特定演示的情况下实现知识编辑。 Result: 在两个编辑数据集和三个大语言模型上验证了P-Tokens性能与IKE相当甚至更优；编辑效果对干扰项鲁棒，邻近事实受轻微负面影响；增加P-Tokens数量可提升性能。 Conclusion: P-Tokens解决了IKE的关键局限，为大语言模型知识编辑提供了更实用、可扩展的替代方案。 Abstract: In-context knowledge editing (IKE) is a promising technique for updating Large Language Models (LLMs) with new information. However, IKE relies on lengthy, fact-specific demonstrations which are costly to create and consume significant context window space. In this paper, we introduce persuasion tokens (P-Tokens) -- special tokens trained to replicate the effect of IKE demonstrations, enabling efficient knowledge editing without requiring fact-specific demonstrations. We evaluate P-Tokens across two editing datasets and three LLMs, demonstrating performance comparable to, and often exceeding, IKE. We further find that editing performance is robust to distractors with small negative effects to neighboring facts, and that increasing the number of P-Tokens improves performance. Our work addresses key limitations of IKE and provides a more practical and scalable alternative for editing LLMs.

[48] Large Language Models as Automatic Annotators and Annotation Adjudicators for Fine-Grained Opinion Analysis

Gaurav Negi,MA Waskow,Paul Buitelaar

Main category: cs.CL

TL;DR: 本文探讨了利用大语言模型（LLM）作为自动标注器，解决细粒度意见分析（如ASTE和ACOS任务）中领域特定标注数据稀缺、人工标注成本高的问题；提出了一种声明式标注流程和新型LLM标签仲裁方法，实验表明LLM可实现高标注一致性，显著降低数据构建成本。

Details

Motivation: 细粒度意见分析需要大量人工标注，跨领域构建高质量标注数据集成本高昂、耗时费力，亟需自动化标注方案。 Method: 提出声明式标注流程以减少手动提示工程的差异性，并设计一种新方法让LLM对多个候选标签进行仲裁并生成最终标注；在ASTE和ACOS任务上测试不同规模LLM的效果。 Result: LLM作为自动标注器和仲裁器，在多个模型上均展现出高标注者间一致性（IAA），验证了其在细粒度意见分析中自动标注的可行性与有效性。 Conclusion: LLM可有效替代人工完成细粒度意见分析的数据标注任务，大幅降低人力与经济成本，为构建多领域标注数据集提供可行路径。 Abstract: Fine-grained opinion analysis of text provides a detailed understanding of expressed sentiments, including the addressed entity. Although this level of detail is sound, it requires considerable human effort and substantial cost to annotate opinions in datasets for training models, especially across diverse domains and real-world applications. We explore the feasibility of LLMs as automatic annotators for fine-grained opinion analysis, addressing the shortage of domain-specific labelled datasets. In this work, we use a declarative annotation pipeline. This approach reduces the variability of manual prompt engineering when using LLMs to identify fine-grained opinion spans in text. We also present a novel methodology for an LLM to adjudicate multiple labels and produce final annotations. After trialling the pipeline with models of different sizes for the Aspect Sentiment Triplet Extraction (ASTE) and Aspect-Category-Opinion-Sentiment (ACOS) analysis tasks, we show that LLMs can serve as automatic annotators and adjudicators, achieving high Inter-Annotator Agreement across individual LLM-based annotators. This reduces the cost and human effort needed to create these fine-grained opinion-annotated datasets.

[49] SoS: Analysis of Surface over Semantics in Multilingual Text-To-Image Generation

Carolin Holtermann,Florian Schneider,Anne Lauscher

Main category: cs.CL

TL;DR: 本文首次系统分析了文本到图像（T2I）模型中普遍存在的“重表层、轻语义”（SoS）现象，即模型对非英语提示语的表面语言形式敏感而忽略其语义，导致生成刻板文化图像；作者构建覆盖171种文化身份、14种语言的提示集，测试7个主流T2I模型，提出新量化指标，并发现SoS效应随文本编码器层级加深而增强，且与视觉刻板印象显著相关。

Details

Motivation: 现有研究指出T2I模型对非英语提示高度敏感，易产生文化刻板图像，但缺乏对这种‘表面优先于语义’（SoS）行为的系统性分析。 Method: 构建涵盖171种文化身份、翻译为14种语言的提示数据集，用于测试7个T2I模型；提出一种新型量化指标评估SoS倾向，并结合视觉分析考察其在文本编码器各层中的表现及与刻板图像的关联。 Result: 除一个模型外，其余6个均在至少两种语言中表现出强SoS倾向；SoS效应随文本编码器层级加深而增强；该倾向常与刻板视觉表征显著相关。 Conclusion: SoS是T2I模型中广泛存在的问题，根源在于模型过度依赖输入语言的表层形式而非深层语义，亟需从模型设计和训练数据层面加以改进以提升跨文化公平性与鲁棒性。 Abstract: Text-to-image (T2I) models are increasingly employed by users worldwide. However, prior research has pointed to the high sensitivity of T2I towards particular input languages - when faced with languages other than English (i.e., different surface forms of the same prompt), T2I models often produce culturally stereotypical depictions, prioritizing the surface over the prompt's semantics. Yet a comprehensive analysis of this behavior, which we dub Surface-over-Semantics (SoS), is missing. We present the first analysis of T2I models' SoS tendencies. To this end, we create a set of prompts covering 171 cultural identities, translated into 14 languages, and use it to prompt seven T2I models. To quantify SoS tendencies across models, languages, and cultures, we introduce a novel measure and analyze how the tendencies we identify manifest visually. We show that all but one model exhibit strong surface-level tendency in at least two languages, with this effect intensifying across the layers of T2I text encoders. Moreover, these surface tendencies frequently correlate with stereotypical visual depictions.

[50] Trapped in the past? Disentangling fluid and crystallized intelligence of large language models using chess

Leonard S. Pleiss,Maximilian Schiffer,Robert K. von Weizsäcker

Main category: cs.CL

TL;DR: 本文利用国际象棋作为可控测试平台，区分大语言模型（LLMs）的晶体智力（记忆召回）与流体智力（推理能力），发现模型性能随推理需求增强而系统性下降，尤其在分布外任务中退化至随机水平，表明当前架构在系统性泛化和稳健流体智力方面存在根本局限。

Details

Motivation: 厘清大语言模型展现出的能力究竟是源于记忆召回（晶体智力）还是真正推理能力（流体智力），需一个结构清晰、可量化、能控制分布偏移的测试环境。 Method: 以国际象棋为测试床，构建基于训练语料接近度的位置分类法（从常见可记忆局面到需第一性原理推理的全新局面），结合引擎评估与多代GPT模型在不同推理强度下的系统评测。 Result: 模型性能随流体智力需求上升呈清晰下降梯度；分布外任务性能坍塌至随机水平；新模型提升有限；推理增强型推理虽有效，但其每token边际收益随分布接近度提高而递减。 Conclusion: 当前LLM架构在系统性泛化能力上存在根本瓶颈，仅靠扩大规模难以实现稳健的流体智力，亟需引入超越规模的新机制。 Abstract: Large Language Models (LLMs) exhibit remarkable capabilities, yet it remains unclear to what extent these reflect sophisticated recall (crystallized intelligence) or reasoning ability (fluid intelligence). We introduce chess as a controlled testbed for disentangling these faculties. Leveraging the game's structure and scalable engine evaluations, we construct a taxonomy of positions varying in training corpus proximity--ranging from common states solvable by memorization to novel ones requiring first-principles reasoning. We systematically evaluate multiple GPT generations under varying reasoning intensities. Our analysis reveals a clear gradient: performance consistently degrades as fluid intelligence demands increase. Notably, in out-of-distribution tasks, performance collapses to random levels. While newer models improve, progress slows significantly for tasks outside the training distribution. Furthermore, while reasoning-augmented inference improves performance, its marginal benefit per token decreases with distributional proximity. These results suggest current architectures remain limited in systematic generalization, highlighting the need for mechanisms beyond scale to achieve robust fluid intelligence.

[51] LLM-Based Adversarial Persuasion Attacks on Fact-Checking Systems

João A. Leite,Olesya Razuvayevskaya,Kalina Bontcheva,Carolina Scarton

Main category: cs.CL

TL;DR: 本文提出了一种利用说服技巧生成对抗性攻击的新方法，通过大语言模型重写声明以欺骗自动事实核查系统，实验证明该方法显著降低了核查和证据检索性能。

Details

Motivation: 现有对抗攻击框架未利用广泛用于虚假信息传播的说服技巧，而这类技巧可能对自动事实核查系统构成新的威胁。 Method: 使用生成式大语言模型，基于15种分属6类的说服技巧对声明进行重写，采用解耦评估策略分析其对声明验证和证据检索的影响。 Result: 在FEVER和FEVEROUS基准上的实验表明，说服性对抗攻击显著降低验证准确率与证据检索效果。 Conclusion: 说服技巧是一类强效的对抗攻击手段，凸显了构建更鲁棒自动事实核查系统的迫切需求。 Abstract: Automated fact-checking (AFC) systems are susceptible to adversarial attacks, enabling false claims to evade detection. Existing adversarial frameworks typically rely on injecting noise or altering semantics, yet no existing framework exploits the adversarial potential of persuasion techniques, which are widely used in disinformation campaigns to manipulate audiences. In this paper, we introduce a novel class of persuasive adversarial attacks on AFCs by employing a generative LLM to rephrase claims using persuasion techniques. Considering 15 techniques grouped into 6 categories, we study the effects of persuasion on both claim verification and evidence retrieval using a decoupled evaluation strategy. Experiments on the FEVER and FEVEROUS benchmarks show that persuasion attacks can substantially degrade both verification performance and evidence retrieval. Our analysis identifies persuasion techniques as a potent class of adversarial attacks, highlighting the need for more robust AFC systems.

[52] Information Representation Fairness in Long-Document Embeddings: The Peculiar Interaction of Positional and Language Bias

Elias Schuhmacher,Andrianos Michail,Juri Opitz,Rico Sennrich,Simon Clematide

Main category: cs.CL

TL;DR: 本文提出了一种基于排列的评估框架，揭示了当前嵌入模型在长文档和多语言场景中存在位置和语言偏差，并提出了推理时注意力校准方法以缓解该问题。

Details

Motivation: 为了确保文档各部分在嵌入表示中被公平反映，需量化嵌入模型中的潜在反射偏差。 Method: 提出基于排列的评估框架，并分析注意力分布；设计推理时注意力校准方法以均衡不同位置的注意力分配。 Result: 发现前沿嵌入模型存在系统性位置偏差（早期段落过表征）和语言偏差（高资源语言如英语过表征），并验证所提注意力校准方法可提升后期段落的可发现性。 Conclusion: 位置与语言偏差是嵌入模型在长文档和多语言检索中的关键问题，注意力校准是一种有效且无需训练的缓解策略。 Abstract: To be discoverable in an embedding-based search process, each part of a document should be reflected in its embedding representation. To quantify any potential reflection biases, we introduce a permutation-based evaluation framework. With this, we observe that state-of-the-art embedding models exhibit systematic positional and language biases when documents are longer and consist of multiple segments. Specifically, early segments and segments in higher-resource languages like English are over-represented, while later segments and segments in lower-resource languages are marginalized. In our further analysis, we find that the positional bias stems from front-loaded attention distributions in pooling-token embeddings, where early tokens receive more attention. To mitigate this issue, we introduce an inference-time attention calibration method that redistributes attention more evenly across document positions, increasing discoverabiltiy of later segments. Our evaluation framework and attention calibration is available at https://github.com/impresso/fair-sentence-transformers

[53] Strategies for Span Labeling with Large Language Models

Danil Semin,Ondřej Dušek,Zdeněk Kasner

Main category: cs.CL

TL;DR: 本文提出了一种名为LogitMatch的约束解码方法，用于改进大语言模型在跨度标注任务中的性能，通过强制模型输出与输入中有效跨度对齐，解决了内容匹配策略中的跨度匹配问题。

Details

Motivation: 大语言模型（LLMs）缺乏显式机制来定位输入文本中的特定部分，导致跨度标注任务中提示策略不统一、效果不稳定。 Method: 将现有跨度标注提示策略分为三类（文本标注、位置索引、内容匹配），并提出LogitMatch——一种基于logit约束的解码方法，确保生成输出严格对应输入中的合法跨度。 Result: 在四个不同任务上的实验表明，LogitMatch显著优于内容匹配类方法，消除了跨度匹配错误，在某些设置下也超越其他策略；而文本标注法仍为稳健基线。 Conclusion: LogitMatch为生成式模型的跨度标注提供了更可靠、可解释的约束解码方案，弥补了其在结构化文本分析任务中的固有缺陷。 Abstract: Large language models (LLMs) are increasingly used for text analysis tasks, such as named entity recognition or error detection. Unlike encoder-based models, however, generative architectures lack an explicit mechanism to refer to specific parts of their input. This leads to a variety of ad-hoc prompting strategies for span labeling, often with inconsistent results. In this paper, we categorize these strategies into three families: tagging the input text, indexing numerical positions of spans, and matching span content. To address the limitations of content matching, we introduce LogitMatch, a new constrained decoding method that forces the model's output to align with valid input spans. We evaluate all methods across four diverse tasks. We find that while tagging remains a robust baseline, LogitMatch improves upon competitive matching-based methods by eliminating span matching issues and outperforms other strategies in some setups.

cs.CV [Back]

[54] GR3EN: Generative Relighting for 3D Environments

Xiaoyan Xing,Philipp Henzler,Junhwa Hur,Runze Li,Jonathan T. Barron,Pratul P. Srinivasan,Dor Verbin

Main category: cs.CV

TL;DR: 本文提出一种基于视频到视频扩散模型蒸馏的房间级3D场景重光照方法，绕过困难的逆向渲染问题，实现高质量、可控的复杂真实场景3D重光照。

Details

Motivation: 现有3D场景重光照方法常面临欠定或病态的逆向渲染问题，难以在复杂真实场景中生成高质量结果；而基于扩散模型的方法多局限于2D图像/视频或单个物体的3D重光照。 Method: 将视频到视频重光照扩散模型的输出蒸馏至3D重建中，实现房间级3D场景的可控重光照。 Result: 在合成与真实数据集上验证了该方法能忠实渲染新光照条件下的新颖视角。 Conclusion: 该方法避免了直接求解逆向渲染问题，提供了一种灵活、高质量的房间级3D重光照解决方案。 Abstract: We present a method for relighting 3D reconstructions of large room-scale environments. Existing solutions for 3D scene relighting often require solving under-determined or ill-conditioned inverse rendering problems, and are as such unable to produce high-quality results on complex real-world scenes. Though recent progress in using generative image and video diffusion models for relighting has been promising, these techniques are either limited to 2D image and video relighting or 3D relighting of individual objects. Our approach enables controllable 3D relighting of room-scale scenes by distilling the outputs of a video-to-video relighting diffusion model into a 3D reconstruction. This side-steps the need to solve a difficult inverse rendering problem, and results in a flexible system that can relight 3D reconstructions of complex real-world scenes. We validate our approach on both synthetic and real-world datasets to show that it can faithfully render novel views of scenes under new lighting conditions.

[55] Memory-V2V: Augmenting Video-to-Video Diffusion Models with Memory

Dohun Lee,Chun-Hao Paul Huang,Xuelin Chen,Jong Chul Ye,Duygu Ceylan,Hyeonho Jeong

Main category: cs.CV

TL;DR: 本文提出Memory-V2V框架，首次解决多轮视频编辑中的跨轮次一致性问题，通过引入显式记忆机制（外部缓存+精准检索+动态token化）和可学习token压缩模块，在保持甚至提升性能的同时显著提升一致性并加速30%。

Details

Motivation: 现实中的视频编辑是多轮迭代过程，而现有视频到视频扩散模型难以在连续编辑中维持跨轮次的一致性。 Method: 提出Memory-V2V框架：1）利用外部缓存存储历史编辑结果；2）采用精准检索与动态token化策略，将先前结果作为当前编辑的条件；3）在DiT主干中嵌入可学习token压缩器，压缩冗余条件token、保留关键视觉线索。 Result: 在视频新视角合成与长视频文本编辑等任务上验证有效；相比SOTA基线，显著提升跨轮次一致性，计算开销极小，并实现30%推理加速。 Conclusion: Memory-V2V首次系统性地解决了多轮视频编辑中的跨一致性难题，其轻量级记忆增强设计兼具有效性、高效性与通用性，为交互式视频编辑提供了新范式。 Abstract: Recent foundational video-to-video diffusion models have achieved impressive results in editing user provided videos by modifying appearance, motion, or camera movement. However, real-world video editing is often an iterative process, where users refine results across multiple rounds of interaction. In this multi-turn setting, current video editors struggle to maintain cross-consistency across sequential edits. In this work, we tackle, for the first time, the problem of cross-consistency in multi-turn video editing and introduce Memory-V2V, a simple, yet effective framework that augments existing video-to-video models with explicit memory. Given an external cache of previously edited videos, Memory-V2V employs accurate retrieval and dynamic tokenization strategies to condition the current editing step on prior results. To further mitigate redundancy and computational overhead, we propose a learnable token compressor within the DiT backbone that compresses redundant conditioning tokens while preserving essential visual cues, achieving an overall speedup of 30%. We validate Memory-V2V on challenging tasks including video novel view synthesis and text-conditioned long video editing. Extensive experiments show that Memory-V2V produces videos that are significantly more cross-consistent with minimal computational overhead, while maintaining or even improving task-specific performance over state-of-the-art baselines. Project page: https://dohunlee1.github.io/MemoryV2V

[56] FeTTL: Federated Template and Task Learning for Multi-Institutional Medical Imaging

Abhijeet Parida,Antonia Alomar,Zhifan Jiang,Pooneh Roshanitabrizi,Austin Tapp,Ziyue Xu,Syed Muhammad Anwar,Maria J. Ledesma-Carbayo,Holger R. Roth,Marius George Linguraru

Main category: cs.CV

TL;DR: 本文提出了一种名为FeTTL的新框架，通过联合学习全局模板和任务模型来缓解联邦学习中多中心医学影像数据的分布偏移问题，并在视盘分割和病理转移分类任务上显著优于现有方法。

Details

Motivation: 联邦学习在医疗领域面临域偏移和数据异质性导致性能下降的问题，尤其在医学影像中受采集协议、设备类型和患者群体差异影响严重。 Method: 提出了联邦模板与任务学习（FeTTL）框架，在联邦环境下协同学习一个全局模板和一个任务模型，以对齐各客户端的数据分布。 Result: 在视网膜眼底视盘分割和组织病理学转移分类两个多中心医学影像任务上，FeTTL显著优于当前最优联邦学习基线方法（p值<0.002），且验证了模板与任务联合学习的重要性。 Conclusion: FeTTL为联邦学习中的分布偏移问题提供了原理清晰、可扩展的解决方案，有助于在真实多中心医疗场景中实现鲁棒模型部署。 Abstract: Federated learning enables collaborative model training across geographically distributed medical centers while preserving data privacy. However, domain shifts and heterogeneity in data often lead to a degradation in model performance. Medical imaging applications are particularly affected by variations in acquisition protocols, scanner types, and patient populations. To address these issues, we introduce Federated Template and Task Learning (FeTTL), a novel framework designed to harmonize multi-institutional medical imaging data in federated environments. FeTTL learns a global template together with a task model to align data distributions among clients. We evaluated FeTTL on two challenging and diverse multi-institutional medical imaging tasks: retinal fundus optical disc segmentation and histopathological metastasis classification. Experimental results show that FeTTL significantly outperforms the state-of-the-art federated learning baselines (p-values <0.002) for optical disc segmentation and classification of metastases from multi-institutional data. Our experiments further highlight the importance of jointly learning the template and the task. These findings suggest that FeTTL offers a principled and extensible solution for mitigating distribution shifts in federated learning, supporting robust model deployment in real-world, multi-institutional environments.

[57] Where is the multimodal goal post? On the Ability of Foundation Models to Recognize Contextually Important Moments

Aditya K Surikuchi,Raquel Fernández,Sandro Pezzelle

Main category: cs.CV

TL;DR: 本文研究了基础模型在识别视频中重要子事件方面的能力，特别是在足球比赛场景下，并构建了一个基于人类偏好隐含重要性的新数据集，发现现有最先进多模态模型表现接近随机水平，且倾向于依赖单一主导模态，难以有效融合多源信息。

Details

Motivation: 识别视频中最重要的子事件是多模态事件叙述或摘要的基本前提，但现有模型在此任务上的能力尚不明确。 Method: 构建了一个基于足球比赛精彩集锦中人类偏好隐含重要性的新数据集，无需额外标注成本；使用该数据集评估多个最先进多模态模型，并进行超越标准指标的深入分析。 Result: 现有最先进多模态模型在该任务上表现仅略高于随机水平；模型倾向于依赖单一主导模态，难以有效整合多模态信息。 Conclusion: 需要模块化架构以应对多模态数据的样本级异质性，并需互补的训练策略以最大化跨模态协同效应。 Abstract: Foundation models are used for many real-world applications involving language generation from temporally-ordered multimodal events. In this work, we study the ability of models to identify the most important sub-events in a video, which is a fundamental prerequisite for narrating or summarizing multimodal events. Specifically, we focus on football games and evaluate models on their ability to distinguish between important and non-important sub-events in a game. To this end, we construct a new dataset by leveraging human preferences for importance implicit in football game highlight reels, without any additional annotation costs. Using our dataset, which we will publicly release to the community, we compare several state-of-the-art multimodal models and show that they are not far from chance level performance. Analyses of models beyond standard evaluation metrics reveal their tendency to rely on a single dominant modality and their ineffectiveness in synthesizing necessary information from multiple sources. Our findings underline the importance of modular architectures that can handle sample-level heterogeneity in multimodal data and the need for complementary training procedures that can maximize cross-modal synergy.

Aline Sindel,Andreas Maier,Vincent Christlein

Main category: cs.CV

TL;DR: 本文提出了一种基于细纹（craquelure）特征的粗到精、非刚性多模态图像配准方法，用于历史木板画的科技分析，显著提升了配准精度与效率。

Details

Motivation: 历史木板画的多模态图像（如可见光、红外、X射线等）需像素级对齐以支持综合分析，但现有手动配准费时费力，自动配准又受限于分辨率差异、大尺寸、非刚性形变及模态差异。 Method: 提出一种端到端非刚性多模态配准方法：利用CNN联合检测与描述基于craquelure的稀疏关键点，用图神经网络（GNN）进行块级描述子匹配，并基于局部单应性重投影误差筛选匹配；引入多层级关键点细化策略实现粗到精配准，适配混合分辨率图像。 Result: 在自建高标注量多模态木板画数据集（含5种模态、多分辨率）上验证，所提方法在各项指标上均优于现有关键点法与稠密匹配法，消融实验验证各模块有效性。 Conclusion: 基于craquelure的关键点配准框架可高效、高精度地解决艺术科技中多模态大图非刚性配准难题，为文化遗产数字化分析提供可靠自动化工具。 Abstract: Art technological investigations of historical panel paintings rely on acquiring multi-modal image data, including visual light photography, infrared reflectography, ultraviolet fluorescence photography, x-radiography, and macro photography. For a comprehensive analysis, the multi-modal images require pixel-wise alignment, which is still often performed manually. Multi-modal image registration can reduce this laborious manual work, is substantially faster, and enables higher precision. Due to varying image resolutions, huge image sizes, non-rigid distortions, and modality-dependent image content, registration is challenging. Therefore, we propose a coarse-to-fine non-rigid multi-modal registration method efficiently relying on sparse keypoints and thin-plate-splines. Historical paintings exhibit a fine crack pattern, called craquelure, on the paint layer, which is captured by all image systems and is well-suited as a feature for registration. In our one-stage non-rigid registration approach, we employ a convolutional neural network for joint keypoint detection and description based on the craquelure and a graph neural network for descriptor matching in a patch-based manner, and filter matches based on homography reprojection errors in local areas. For coarse-to-fine registration, we introduce a novel multi-level keypoint refinement approach to register mixed-resolution images up to the highest resolution. We created a multi-modal dataset of panel paintings with a high number of keypoint annotations, and a large test set comprising five multi-modal domains and varying image resolutions. The ablation study demonstrates the effectiveness of all modules of our refinement method. Our proposed approaches achieve the best registration results compared to competing keypoint and dense matching methods and refinement methods.

[59] Cognitively-Inspired Tokens Overcome Egocentric Bias in Multimodal Models

Bridget Leonard,Scott O. Murray

Main category: cs.CV

TL;DR: 本文提出了一种名为'视角标记（perspective tokens）'的新方法，通过在多模态语言模型中嵌入基于人类空间认知的定向表征（如身体关键点或抽象旋转表示），显著提升了模型在二级视觉视角采择任务中的表现，并揭示了现有模型已具备初步的非自我中心空间推理能力但缺乏适当内部结构。

Details

Motivation: 现有多模态语言模型在语义视觉-语言任务上表现良好，但在需采用他人视觉视角的空间推理任务中失败，表现出顽固的自我中心偏差，引发对其是否支持非自我中心（allocentric）推理的质疑。 Method: 引入两种视角标记：（1）基于具身身体关键点的姿态线索；（2）支持心理旋转的抽象表征。将这些标记集成到LLaVA-1.5-13B中，并在Isle Bricks V2、COCO和3DSRBench等合成与自然基准上评估其性能；辅以表征分析探究模型潜在朝向敏感性变化。 Result: 视角标记显著提升模型在二级视觉视角采择任务上的准确率，尤其旋转式标记可泛化至非人类参考主体；表征分析表明微调增强了基础模型中已存在的潜在朝向敏感性。 Conclusion: 多模态语言模型本身已蕴含非自我中心空间推理的雏形，但需通过认知驱动的空间结构显式嵌入（如视角标记）来激活；该方法轻量、模型无关，为实现更类人的空间推理提供了新路径。 Abstract: Multimodal language models (MLMs) perform well on semantic vision-language tasks but fail at spatial reasoning that requires adopting another agent's visual perspective. These errors reflect a persistent egocentric bias and raise questions about whether current models support allocentric reasoning. Inspired by human spatial cognition, we introduce perspective tokens, specialized embeddings that encode orientation through either (1) embodied body-keypoint cues or (2) abstract representations supporting mental rotation. Integrating these tokens into LLaVA-1.5-13B yields performance on level-2 visual perspective-taking tasks. Across synthetic and naturalistic benchmarks (Isle Bricks V2, COCO, 3DSRBench), perspective tokens improve accuracy, with rotation-based tokens generalizing to non-human reference agents. Representational analyses reveal that fine-tuning enhances latent orientation sensitivity already present in the base model, suggesting that MLMs contain precursors of allocentric reasoning but lack appropriate internal structure. Overall, embedding cognitively grounded spatial structure directly into token space provides a lightweight, model-agnostic mechanism for perspective-taking and more human-like spatial reasoning.

[60] VTFusion: A Vision-Text Multimodal Fusion Network for Few-Shot Anomaly Detection

Yuxin Jiang,Yunkang Cao,Yuqi Cheng,Yiheng Zhang,Weiming Shen

Main category: cs.CV

TL;DR: 本文提出VTFusion框架，专为少样本异常检测（FSAD）设计，通过自适应特征提取器和多模态预测融合模块，解决工业质检中视觉-文本语义不匹配与领域差异问题，在多个数据集上取得先进性能。

Details

Motivation: 现有FSAD方法依赖自然场景预训练特征，忽视工业检测所需的细粒度领域语义；且常用视觉-文本融合方式（如简单拼接）无法解决模态间语义错位，导致跨模态干扰鲁棒性差。 Method: 提出VTFusion框架：1）为图像和文本模态设计自适应特征提取器，学习任务特定表征，并结合合成异常增强判别性；2）构建专用多模态预测融合模块，含促进跨模态信息交互的融合块和在多模态指导下生成像素级异常图的分割网络。 Result: 在MVTec AD和VisA数据集2-shot设置下图像级AUROC分别达96.8%和86.2%；在新引入的真实汽车塑料件工业数据集上AUPRO达93.5%。 Conclusion: VTFusion有效弥合了预训练模型与工业数据间的领域鸿沟，并提升了视觉-文本模态融合的语义一致性，显著增强了FSAD在实际工业场景中的实用性与鲁棒性。 Abstract: Few-Shot Anomaly Detection (FSAD) has emerged as a critical paradigm for identifying irregularities using scarce normal references. While recent methods have integrated textual semantics to complement visual data, they predominantly rely on features pre-trained on natural scenes, thereby neglecting the granular, domain-specific semantics essential for industrial inspection. Furthermore, prevalent fusion strategies often resort to superficial concatenation, failing to address the inherent semantic misalignment between visual and textual modalities, which compromises robustness against cross-modal interference. To bridge these gaps, this study proposes VTFusion, a vision-text multimodal fusion framework tailored for FSAD. The framework rests on two core designs. First, adaptive feature extractors for both image and text modalities are introduced to learn task-specific representations, bridging the domain gap between pre-trained models and industrial data; this is further augmented by generating diverse synthetic anomalies to enhance feature discriminability. Second, a dedicated multimodal prediction fusion module is developed, comprising a fusion block that facilitates rich cross-modal information exchange and a segmentation network that generates refined pixel-level anomaly maps under multimodal guidance. VTFusion significantly advances FSAD performance, achieving image-level AUROCs of 96.8% and 86.2% in the 2-shot scenario on the MVTec AD and VisA datasets, respectively. Furthermore, VTFusion achieves an AUPRO of 93.5% on a real-world dataset of industrial automotive plastic parts introduced in this paper, further demonstrating its practical applicability in demanding industrial scenarios.

[61] ResAgent: Entropy-based Prior Point Discovery and Visual Reasoning for Referring Expression Segmentation

Yihao Wang,Jusheng Zhang,Ziyi Tang,Keze Wang,Meng Yang

Main category: cs.CV

TL;DR: 本文提出了一种新的Referring Expression Segmentation（RES）框架，通过熵驱动的点发现（EBD）和基于视觉的推理（VBR），在粗略边界框内选择高信息量提示点并进行视觉-语义联合验证，显著提升了分割精度。

Details

Motivation: 现有基于多模态大语言模型（MLLM）的RES方法存在两个关键问题：一是MLLM提供的粗略边界框导致点提示冗余或缺乏区分性；二是依赖文本坐标推理不可靠，难以区分视觉相似的干扰项。 Method: 提出名为\model的RES框架，包含熵基点发现（EBD）和视觉基推理（VBR）：EBD在粗框内建模空间不确定性以选择高信息点；VBR通过视觉-语义对齐验证点正确性，摒弃纯文本坐标推理；整体采用粗到精流程：框初始化→熵引导点发现→视觉验证→掩码解码。 Result: 在RefCOCO、RefCOCO+、RefCOCOg和ReasonSeg四个基准上均达到新SOTA性能，生成更准确、语义更扎实的分割掩码，且所需提示更少。 Conclusion: EBD与VBR协同有效克服了当前RES方法在点提示质量和推理鲁棒性上的瓶颈，验证了视觉主导验证与信息驱动点选择对RES任务的重要价值。 Abstract: Referring Expression Segmentation (RES) is a core vision-language segmentation task that enables pixel-level understanding of targets via free-form linguistic expressions, supporting critical applications such as human-robot interaction and augmented reality. Despite the progress of Multimodal Large Language Model (MLLM)-based approaches, existing RES methods still suffer from two key limitations: first, the coarse bounding boxes from MLLMs lead to redundant or non-discriminative point prompts; second, the prevalent reliance on textual coordinate reasoning is unreliable, as it fails to distinguish targets from visually similar distractors. To address these issues, we propose \textbf{\model}, a novel RES framework integrating \textbf{E}ntropy-\textbf{B}ased Point \textbf{D}iscovery (\textbf{EBD}) and \textbf{V}ision-\textbf{B}ased \textbf{R}easoning (\textbf{VBR}). Specifically, EBD identifies high-information candidate points by modeling spatial uncertainty within coarse bounding boxes, treating point selection as an information maximization process. VBR verifies point correctness through joint visual-semantic alignment, abandoning text-only coordinate inference for more robust validation. Built on these components, \model implements a coarse-to-fine workflow: bounding box initialization, entropy-guided point discovery, vision-based validation, and mask decoding. Extensive evaluations on four benchmark datasets (RefCOCO, RefCOCO+, RefCOCOg, and ReasonSeg) demonstrate that \model achieves new state-of-the-art performance across all four benchmarks, highlighting its effectiveness in generating accurate and semantically grounded segmentation masks with minimal prompts.

[62] A Cosine Network for Image Super-Resolution

Chunwei Tian,Chengyuan Zhang,Bob Zhang,Zhiwu Li,C. L. Philip Chen,David Zhang

Main category: cs.CV

TL;DR: 本文提出了一种用于图像超分辨率的余弦网络（CSRNet），通过设计奇偶异构模块提取互补的同源结构信息，并结合线性和非线性结构信息提升鲁棒性；同时采用余弦退火机制优化训练过程，实验表明其性能与当前最优方法相当。

Details

Motivation: 在图像超分辨率中，如何有效保留并利用所提取的层次化结构信息是一个关键挑战，尤其是避免同源信息冗余、增强结构信息的鲁棒性。 Method: 提出CSRNet：1）设计奇偶异构块以增大网络结构差异，提取互补同源结构信息；2）融合线性与非线性结构信息；3）引入余弦退火机制进行学习率调度和warm restart以缓解梯度下降陷入局部极小问题。 Result: CSRNet在图像超分辨率任务上达到与当前最先进方法相当的性能。 Conclusion: 通过结构异构设计与优化训练策略，CSRNet能更有效地提取和利用结构信息，提升超分辨率质量与训练稳定性。 Abstract: Deep convolutional neural networks can use hierarchical information to progressively extract structural information to recover high-quality images. However, preserving the effectiveness of the obtained structural information is important in image super-resolution. In this paper, we propose a cosine network for image super-resolution (CSRNet) by improving a network architecture and optimizing the training strategy. To extract complementary homologous structural information, odd and even heterogeneous blocks are designed to enlarge the architectural differences and improve the performance of image super-resolution. Combining linear and non-linear structural information can overcome the drawback of homologous information and enhance the robustness of the obtained structural information in image super-resolution. Taking into account the local minimum of gradient descent, a cosine annealing mechanism is used to optimize the training procedure by performing warm restarts and adjusting the learning rate. Experimental results illustrate that the proposed CSRNet is competitive with state-of-the-art methods in image super-resolution.

[63] DCCS-Det: Directional Context and Cross-Scale-Aware Detector for Infrared Small Target

Shuying Li,Qiang Ma,San Zhang,Chuang Yang

Main category: cs.CV

TL;DR: 本文提出DCCS-Det，一种面向红外小目标检测的新方法，通过Dual-stream Saliency Enhancement（DSE）块和Latent-aware Semantic Extraction and Aggregation（LaSEA）模块，联合建模局部-全局特征并抑制语义稀释与冗余，显著提升检测精度与鲁棒性。

Details

Motivation: 现有红外小目标检测方法在局部-全局特征联合建模不足，导致目标-背景判别能力弱；同时存在特征冗余与语义稀释问题，损害目标表征质量。 Method: 提出DCCS-Det检测器，包含DSE块（融合局部感知与方向感知上下文聚合）和LaSEA模块（通过跨尺度特征提取与随机池化采样缓解特征退化、增强判别性、抑制噪声）。 Result: 在多个数据集上达到SOTA检测精度，同时保持竞争性效率；消融实验验证了DSE与LaSEA对复杂场景下目标感知与特征表征的显著贡献。 Conclusion: DCCS-Det有效解决了红外小目标检测中局部-全局建模不足与特征退化问题，为高鲁棒、高精度IRSTD提供了新思路与实用框架。 Abstract: Infrared small target detection (IRSTD) is critical for applications like remote sensing and surveillance, which aims to identify small, low-contrast targets against complex backgrounds. However, existing methods often struggle with inadequate joint modeling of local-global features (harming target-background discrimination) or feature redundancy and semantic dilution (degrading target representation quality). To tackle these issues, we propose DCCS-Det (Directional Context and Cross-Scale Aware Detector for Infrared Small Target), a novel detector that incorporates a Dual-stream Saliency Enhancement (DSE) block and a Latent-aware Semantic Extraction and Aggregation (LaSEA) module. The DSE block integrates localized perception with direction-aware context aggregation to help capture long-range spatial dependencies and local details. On this basis, the LaSEA module mitigates feature degradation via cross-scale feature extraction and random pooling sampling strategies, enhancing discriminative features and suppressing noise. Extensive experiments show that DCCS-Det achieves state-of-the-art detection accuracy with competitive efficiency across multiple datasets. Ablation studies further validate the contributions of DSE and LaSEA in improving target perception and feature representation under complex scenarios. \href{https://huggingface.co/InPeerReview/InfraredSmallTargetDetection-IRSTD.DCCS}{DCCS-Det Official Code is Available Here!}

[64] AlphaFace: High Fidelity and Real-time Face Swapper Robust to Facial Pose

Jongmin Yu,Hyeontaek Oh,Zhongtian Sun,Angelica I Aviles-Rivero,Moongu Jeon,Jinhong Yang

Main category: cs.CV

TL;DR: AlphaFace is a real-time face-swapping method that uses vision-language models and CLIP embeddings with novel contrastive losses to improve identity preservation and attribute accuracy under extreme facial poses, outperforming existing methods on pose-challenging benchmarks.

Details

Motivation: Existing face-swapping methods degrade significantly under extreme facial poses; explicit geometric features add dependencies and cost, while diffusion-based methods lack real-time capability. Method: AlphaFace leverages an open-source vision-language model and CLIP image/text embeddings to introduce visual and textual semantic contrastive losses for stronger identity representation and precise attribute preservation. Result: AlphaFace achieves superior performance over state-of-the-art methods in pose-challenging scenarios on FF++, MPIE, and LPFF benchmarks, while maintaining real-time speed. Conclusion: AlphaFace effectively balances robustness to extreme poses, identity fidelity, attribute preservation, and real-time efficiency—setting a new standard for practical face swapping. Abstract: Existing face-swapping methods often deliver competitive results in constrained settings but exhibit substantial quality degradation when handling extreme facial poses. To improve facial pose robustness, explicit geometric features are applied, but this approach remains problematic since it introduces additional dependencies and increases computational cost. Diffusion-based methods have achieved remarkable results; however, they are impractical for real-time processing. We introduce AlphaFace, which leverages an open-source vision-language model and CLIP image and text embeddings to apply novel visual and textual semantic contrastive losses. AlphaFace enables stronger identity representation and more precise attribute preservation, all while maintaining real-time performance. Comprehensive experiments across FF++, MPIE, and LPFF demonstrate that AlphaFace surpasses state-of-the-art methods in pose-challenging cases. The project is publicly available on `https://github.com/andrewyu90/Alphaface_Official.git'.

[65] MDAFNet: Multiscale Differential Edge and Adaptive Frequency Guided Network for Infrared Small Target Detection

Shuying Li,Qiang Ma,San Zhang,Wuwei Wang,Chuang Yang

Main category: cs.CV

TL;DR: 本文提出MDAFNet网络，通过多尺度差分边缘模块（MSDE）和双域自适应特征增强模块（DAFE）解决红外小目标检测中边缘信息丢失和频域干扰问题，显著提升检测性能。

Details

Motivation: 现有方法存在网络深层导致目标边缘像素逐渐退化、传统卷积难以区分频率成分的问题，造成背景低频干扰目标高频特征、高频噪声引发误检。 Method: 提出MDAFNet，包含MSDE模块（多尺度边缘提取与增强以补偿下采样中的边缘信息损失）和DAFE模块（结合频域处理与空间域模拟频分解融合，自适应增强高频目标并抑制高频噪声）。 Result: 在多个数据集上的实验表明MDAFNet具有更优的检测性能。 Conclusion: MDAFNet有效缓解了边缘退化与频域干扰问题，提升了红外小目标检测的精度与鲁棒性。 Abstract: Infrared small target detection (IRSTD) plays a crucial role in numerous military and civilian applications. However, existing methods often face the gradual degradation of target edge pixels as the number of network layers increases, and traditional convolution struggles to differentiate between frequency components during feature extraction, leading to low-frequency backgrounds interfering with high-frequency targets and high-frequency noise triggering false detections. To address these limitations, we propose MDAFNet (Multi-scale Differential Edge and Adaptive Frequency Guided Network for Infrared Small Target Detection), which integrates the Multi-Scale Differential Edge (MSDE) module and Dual-Domain Adaptive Feature Enhancement (DAFE) module. The MSDE module, through a multi-scale edge extraction and enhancement mechanism, effectively compensates for the cumulative loss of target edge information during downsampling. The DAFE module combines frequency domain processing mechanisms with simulated frequency decomposition and fusion mechanisms in the spatial domain to effectively improve the network's capability to adaptively enhance high-frequency targets and selectively suppress high-frequency noise. Experimental results on multiple datasets demonstrate the superior detection performance of MDAFNet.

[66] Masked Face Recognition under Different Backbones

Bo Zhang,Ming Zhang,Kun Wu,Lei Bian,Yi Lin

Main category: cs.CV

TL;DR: 本文评估了多种骨干网络在戴口罩和不戴口罩场景下的人脸识别性能，发现r100系列在标准测试中表现最优，而r100_mask_v2和ViT-Small/Tiny在戴口罩场景中更具优势，据此提出实际部署建议。

Details

Motivation: 后疫情时代，大量民航旅客佩戴口罩进行安检，给传统人脸识别模型带来挑战，亟需评估不同骨干网络在戴口罩场景下的适应性。 Method: 通过大规模对比实验，对r100、r50、r34_mask_v1、r100_mask_v2、r50_mask_v3及ViT-Small/Tiny等骨干网络在标准与戴口罩条件下的人脸识别性能进行综合评测。 Result: 标准测试中r100系列准确率超98%（FAR=0.01%），r50次之，r34_mask_v1最差；戴口罩测试中r100_mask_v2达90.07%，ViT-Small/Tiny也表现出显著提升。 Conclusion: 骨干网络结构对戴口罩人脸识别影响显著，r100_mask_v2和轻量级ViT变体更适用于口罩场景，研究为实际安防部署提供了模型选型依据。 Abstract: Erratum to the paper (Zhang et al., 2025): corrections to Table IV and the data in Page 3, Section A. In the post-pandemic era, a high proportion of civil aviation passengers wear masks during security checks, posing significant challenges to traditional face recognition models. The backbone network serves as the core component of face recognition models. In standard tests, r100 series models excelled (98%+ accuracy at 0.01% FAR in face comparison, high top1/top5 in search). r50 ranked second, r34_mask_v1 lagged. In masked tests, r100_mask_v2 led (90.07% accuracy), r50_mask_v3 performed best among r50 but trailed r100. Vit-Small/Tiny showed strong masked performance with gains in effectiveness. Through extensive comparative experiments, this paper conducts a comprehensive evaluation of several core backbone networks, aiming to reveal the impacts of different models on face recognition with and without masks, and provide specific deployment recommendations.

[67] Emotion-LLaMAv2 and MMEVerse: A New Framework and Benchmark for Multimodal Emotion Understanding

Xiaojiang Peng,Jingyi Chen,Zebang Cheng,Bao Peng,Fengyi Wu,Yifei Dong,Shuyuan Tu,Qiyu Hu,Huiting Huang,Yuxiang Lin,Jun-Yan He,Kai Wang,Zheng Lian,Zhi-Qi Cheng

Main category: cs.CV

TL;DR: 本文提出了Emotion-LLaMAv2模型与MMEVerse基准，旨在提升多模态大语言模型在情感识别与推理任务中的性能。通过端到端多视角编码器、卷积注意力预融合模块及感知到认知的课程指令微调策略，克服了前代模型依赖显式人脸检测、融合方式隐式及数据质量低等问题；MMEVerse整合12个公开数据集并经多智能体重标注，构建了13万训练样本与3.6万测试样本的大规模统一指令格式数据集。

Details

Motivation: 现有情感计算面临高质量多模态情感标注数据稀缺、缺乏标准化评测基准、以及多模态大语言模型（MLLMs）在情感推理能力上受限等挑战；前期Emotion-LLaMA框架存在依赖外部人脸检测器、隐式模态融合、训练数据规模小且质量低等问题。 Method: 提出Emotion-LLaMAv2：1）端到端多视角编码器替代显式人脸检测，提取更丰富的时空情感线索；2）设计Conv Attention预融合模块，在LLM主干外实现局部与全局多模态特征交互；3）在LLaMA2主干中引入感知到认知的课程指令微调机制，统一情感识别与自由形式情感推理。配套构建MMEVerse基准：聚合12个公开数据集，采用Qwen2 Audio、Qwen2.5 VL和GPT-4o多智能体重标注，生成统一指令格式的130k训练片段与36k测试片段，覆盖18个评测子任务。 Result: Emotion-LLaMAv2在MMEVerse涵盖的18个情感理解子任务上显著优于基线模型，展现出更强的情感识别准确率与自由形式推理能力；MMEVerse成为首个大规模、高质量、标准化的多模态情感指令微调与评测基准。 Conclusion: Emotion-LLaMAv2与MMEVerse共同构成一套端到端、可扩展、可复现的多模态情感理解新范式，有效推动了MLLMs在情感计算领域的落地应用与系统性评估发展。 Abstract: Understanding human emotions from multimodal signals poses a significant challenge in affective computing and human-robot interaction. While multimodal large language models (MLLMs) have excelled in general vision-language tasks, their capabilities in emotional reasoning remain limited. The field currently suffers from a scarcity of large-scale datasets with high-quality, descriptive emotion annotations and lacks standardized benchmarks for evaluation. Our preliminary framework, Emotion-LLaMA, pioneered instruction-tuned multimodal learning for emotion reasoning but was restricted by explicit face detectors, implicit fusion strategies, and low-quality training data with limited scale. To address these limitations, we present Emotion-LLaMAv2 and the MMEVerse benchmark, establishing an end-to-end pipeline together with a standardized evaluation setting for emotion recognition and reasoning. Emotion-LLaMAv2 introduces three key advances. First, an end-to-end multiview encoder eliminates external face detection and captures nuanced emotional cues via richer spatial and temporal multiview tokens. Second, a Conv Attention pre-fusion module is designed to enable simultaneous local and global multimodal feature interactions external to the LLM backbone. Third, a perception-to-cognition curriculum instruction tuning scheme within the LLaMA2 backbone unifies emotion recognition and free-form emotion reasoning. To support large-scale training and reproducible evaluation, MMEVerse aggregates twelve publicly available emotion datasets, including IEMOCAP, MELD, DFEW, and MAFW, into a unified multimodal instruction format. The data are re-annotated via a multi-agent pipeline involving Qwen2 Audio, Qwen2.5 VL, and GPT 4o, producing 130k training clips and 36k testing clips across 18 evaluation benchmarks.

[68] VISTA-PATH: An interactive foundation model for pathology image segmentation and quantitative analysis in computational pathology

Peixian Liang,Songhao Li,Shunsuke Koga,Yutong Li,Zahra Alipour,Yucheng Tang,Daguang Xu,Zhi Huang

Main category: cs.CV

TL;DR: 本文提出了VISTA-PATH，一种面向病理图像的交互式、类别感知分割基础模型，通过融合视觉、语义描述与专家空间提示实现高精度多类分割，并构建了大规模三元组数据集VISTA-PATH Data；该模型在多个基准上超越现有方法，支持人机协同优化，并提升临床相关组织微环境分析能力。

Details

Motivation: 现有分割基础模型将分割视为静态视觉预测任务，与病理学需求不匹配，缺乏对异质结构的解析能力、专家反馈整合能力及临床可解释性。 Method: 提出VISTA-PATH模型，联合建模视觉上下文、语义组织描述和可选专家空间提示（如边界框）；构建含160万图像-掩码-文本三元组的大规模病理分割数据集VISTA-PATH Data（覆盖9个器官、93类组织）；支持稀疏标注反馈驱动的全片分割动态优化。 Result: 在多个留出集与外部基准上持续优于现有分割基础模型；支持人机协同的patch级反馈传播至全片分割；生成的高保真、类别感知分割结果被验证可提升组织微环境分析，其提出的肿瘤相互作用评分（TIS）与患者生存显著相关。 Conclusion: VISTA-PATH将病理图像分割从静态预测范式提升为交互式、临床可解释的基础表示范式，为数字病理学提供了新基础模型。 Abstract: Accurate semantic segmentation for histopathology image is crucial for quantitative tissue analysis and downstream clinical modeling. Recent segmentation foundation models have improved generalization through large-scale pretraining, yet remain poorly aligned with pathology because they treat segmentation as a static visual prediction task. Here we present VISTA-PATH, an interactive, class-aware pathology segmentation foundation model designed to resolve heterogeneous structures, incorporate expert feedback, and produce pixel-level segmentation that are directly meaningful for clinical interpretation. VISTA-PATH jointly conditions segmentation on visual context, semantic tissue descriptions, and optional expert-provided spatial prompts, enabling precise multi-class segmentation across heterogeneous pathology images. To support this paradigm, we curate VISTA-PATH Data, a large-scale pathology segmentation corpus comprising over 1.6 million image-mask-text triplets spanning 9 organs and 93 tissue classes. Across extensive held-out and external benchmarks, VISTA-PATH consistently outperforms existing segmentation foundation models. Importantly, VISTA-PATH supports dynamic human-in-the-loop refinement by propagating sparse, patch-level bounding-box annotation feedback into whole-slide segmentation. Finally, we show that the high-fidelity, class-aware segmentation produced by VISTA-PATH is a preferred model for computational pathology. It improve tissue microenvironment analysis through proposed Tumor Interaction Score (TIS), which exhibits strong and significant associations with patient survival. Together, these results establish VISTA-PATH as a foundation model that elevates pathology image segmentation from a static prediction to an interactive and clinically grounded representation for digital pathology. Source code and demo can be found at https://github.com/zhihuanglab/VISTA-PATH.

[69] Order from Chaos: Physical World Understanding from Glitchy Gameplay Videos

Meng Cao,Haoran Tang,Haoze Zhao,Mingfei Han,Ruyang Liu,Qiang Sun,Xiaojun Chang,Ian Reid,Xiaodan Liang

Main category: cs.CV

TL;DR: 本文提出利用游戏视频中的物理异常（glitch）作为监督信号，构建了PhysGame数据集和GameBench评测基准，显著提升了多模态大模型在物理世界理解任务上的性能。

Details

Motivation: 现有物理推理数据集存在真实视频标注成本高或合成数据缺乏真实性和多样性的问题，难以支撑AI实现类人的物理世界理解。 Method: 提出基于游戏视频中违反物理定律的视觉异常（glitch）的新范式；构建了14万+ glitch-centric问答对的PhysGame指令微调数据集，并设计元信息引导的提示策略保障质量；同时构建专家标注的GameBench评测基准（880个含glitch的游戏视频）。 Result: PhysGame显著提升模型迁移能力：Game2Real（PhysBench +2.5%）、Game2General（MVBench +1.9%），并在GameBench上绝对提升3.7%，证明其对物理不合理性检测更鲁棒。 Conclusion: 从游戏异常中学习是一种可扩展且有效的提升多模态智能体物理世界理解能力的新途径。 Abstract: Understanding the physical world, including object dynamics, material properties, and causal interactions, remains a core challenge in artificial intelligence. Although recent multi-modal large language models (MLLMs) have demonstrated impressive general reasoning capabilities, they still fall short of achieving human-level understanding of physical principles. Existing datasets for physical reasoning either rely on real-world videos, which incur high annotation costs, or on synthetic simulations, which suffer from limited realism and diversity. In this paper, we propose a novel paradigm that leverages glitches in gameplay videos, referring to visual anomalies that violate predefined physical laws, as a rich and scalable supervision source for physical world understanding. We introduce PhysGame, an meta information guided instruction-tuning dataset containing 140,057 glitch-centric question-answer pairs across five physical domains and sixteen fine-grained categories. To ensure data accuracy, we design a prompting strategy that utilizes gameplay metadata such as titles and descriptions to guide high-quality QA generation. Complementing PhysGame, we construct GameBench, an expert-annotated benchmark with 880 glitch-identified gameplay videos designed to evaluate physical reasoning capabilities. Extensive experiments show that PhysGame significantly enhances both Game2Real transferability, improving the real world physical reasoning performance of Qwen2.5VL by 2.5% on PhysBench, and Game2General transferability, yielding a 1.9% gain on the MVBench benchmark. Moreover, PhysGame-tuned models achieve a 3.7% absolute improvement on GameBench, demonstrating enhanced robustness in detecting physical implausibilities. These results indicate that learning from gameplay anomalies offers a scalable and effective pathway toward advancing physical world understanding in multimodal intelligence.

[70] Multi-View Consistent Wound Segmentation With Neural Fields

Remi Chierchia,Léo Lebrat,David Ahmedt-Aristizabal,Yulia Arzhaeva,Olivier Salvado,Clinton Fookes,Rodrigo Santa Cruz

Main category: cs.CV

TL;DR: 本文提出WoundNeRF方法，利用NeRF SDF从自动生成标注中估计鲁棒的伤口分割结果，并在准确率上优于现有Vision Transformer和光栅化算法。

Details

Motivation: 伤口护理面临经济与后勤负担，亟需自动、快速、精准的组织评估手段；2D伤口分割已受关注，但多视角一致的3D结构重建仍具挑战。 Method: 提出基于NeRF SDF的WoundNeRF方法，从自动标注中估计伤口3D分割；对比评估其与先进Vision Transformer及传统光栅化算法的性能。 Result: WoundNeRF在恢复准确伤口分割方面展现出优越潜力，性能优于当前主流方法。 Conclusion: WoundNeRF为基于2D图像生成鲁棒、一致3D伤口分割提供了一种有前景的新范式，代码将开源以推动后续研究。 Abstract: Wound care is often challenged by the economic and logistical burdens that consistently afflict patients and hospitals worldwide. In recent decades, healthcare professionals have sought support from computer vision and machine learning algorithms. In particular, wound segmentation has gained interest due to its ability to provide professionals with fast, automatic tissue assessment from standard RGB images. Some approaches have extended segmentation to 3D, enabling more complete and precise healing progress tracking. However, inferring multi-view consistent 3D structures from 2D images remains a challenge. In this paper, we evaluate WoundNeRF, a NeRF SDF-based method for estimating robust wound segmentations from automatically generated annotations. We demonstrate the potential of this paradigm in recovering accurate segmentations by comparing it against state-of-the-art Vision Transformer networks and conventional rasterisation-based algorithms. The code will be released to facilitate further development in this promising paradigm.

[71] Expert Knowledge-Guided Decision Calibration for Accurate Fine-Grained Tree Species Classification

Chen Long,Dian Chen,Ruifei Ding,Zhe Chen,Zhen Dong,Bisheng Yang

Main category: cs.CV

TL;DR: 本文提出了一种专家知识引导的分类决策校准网络（EKDC-Net），通过引入外部领域专家知识，结合局部先验引导的知识提取与不确定性引导的决策校准模块，有效缓解树种细粒度分类中长尾分布和类间相似性带来的挑战，并构建了大规模数据集CU-Tree102。

Details

Motivation: 现有方法忽视树种分类中固有的长尾分布和高类间相似性问题，尤其在小样本或易混淆类别上表现不佳；受人类向专家求助以突破局部思维局限的启发，本文引入外部领域专家知识辅助分类。 Method: 提出EKDC-Net框架，包含两个核心模块：1）局部先验引导的知识提取模块（LPKEM），利用CAM分析引导专家关注判别性特征；2）不确定性引导的决策校准模块（UDCM），动态融合整体类别不确定性和实例级预测不确定性以校准本地模型决策；同时构建CU-Tree102数据集（102类树种）。 Result: 在三个基准数据集上达到SOTA性能；作为轻量即插即用模块，仅增加0.08M可学习参数，即可使骨干网络准确率提升6.42%，精确率提升11.46%；开源数据集、代码与预训练模型。 Conclusion: EKDC-Net通过显式建模和利用外部专家知识，有效提升了细粒度树种分类性能，尤其适用于数据有限、分布不均且类别易混淆的实际场景，具有良好的实用性与扩展性。 Abstract: Accurate fine-grained tree species classification is critical for forest inventory and biodiversity monitoring. Existing methods predominantly focus on designing complex architectures to fit local data distributions. However, they often overlook the long-tailed distributions and high inter-class similarity inherent in limited data, thereby struggling to distinguish between few-shot or confusing categories. In the process of knowledge dissemination in the human world, individuals will actively seek expert assistance to transcend the limitations of local thinking. Inspired by this, we introduce an external "Domain Expert" and propose an Expert Knowledge-Guided Classification Decision Calibration Network (EKDC-Net) to overcome these challenges. Our framework addresses two core issues: expert knowledge extraction and utilization. Specifically, we first develop a Local Prior Guided Knowledge Extraction Module (LPKEM). By leveraging Class Activation Map (CAM) analysis, LPKEM guides the domain expert to focus exclusively on discriminative features essential for classification. Subsequently, to effectively integrate this knowledge, we design an Uncertainty-Guided Decision Calibration Module (UDCM). This module dynamically corrects the local model's decisions by considering both overall category uncertainty and instance-level prediction uncertainty. Furthermore, we present a large-scale classification dataset covering 102 tree species, named CU-Tree102 to address the issue of scarce diversity in current benchmarks. Experiments on three benchmark datasets demonstrate that our approach achieves state-of-the-art performance. Crucially, as a lightweight plug-and-play module, EKDC-Net improves backbone accuracy by 6.42% and precision by 11.46% using only 0.08M additional learnable parameters. The dataset, code, and pre-trained models are available at https://github.com/WHU-USI3DV/TreeCLS.

[72] SALAD: Achieve High-Sparsity Attention via Efficient Linear Attention Tuning for Video Diffusion Transformer

Tongcheng Fang,Hanling Zhang,Ruiqi Xie,Zhuo Han,Xin Tao,Tianchen Zhao,Pengfei Wan,Wenbo Ding,Wanli Ouyang,Xuefei Ning,Yu Wang

Main category: cs.CV

TL;DR: 本文提出SALAD方法，在扩散Transformer中引入轻量级线性注意力分支与稀疏注意力并行，并通过输入依赖的门控机制平衡二者，在90%稀疏率下实现1.72倍推理加速，同时保持生成质量，且微调高效（仅需2000视频样本、1600步）

Details

Motivation: 扩散Transformer在视频生成中表现优异，但全注意力的二次复杂度导致高计算延迟；现有稀疏注意力方法中，训练-free方法稀疏度有限、加速效果弱，而训练-based方法虽稀疏度高但训练开销大。 Method: 提出SALAD：在稀疏注意力旁路并行引入轻量级线性注意力分支，并设计输入依赖的门控机制动态融合两分支输出；采用高效微调策略，仅需少量数据和训练步数。 Result: 达到90%稀疏率和1.72倍推理加速，生成质量与全注意力基线相当；微调仅需2000个视频样本、1600步（batch size=8）。 Conclusion: SALAD在保持生成质量前提下显著提升扩散Transformer推理效率，兼顾高稀疏性与低训练成本，为视频生成模型的实际部署提供了可行方案。 Abstract: Diffusion Transformers have recently demonstrated remarkable performance in video generation. However, the long input sequences result in high computational latency due to the quadratic complexity of full attention. Various sparse attention mechanisms have been proposed. Training-free sparse attention is constrained by limited sparsity and thus offers modest acceleration, whereas training-based methods can reach much higher sparsity but demand substantial data and computation for training. In this work, we propose SALAD, introducing a lightweight linear attention branch in parallel with the sparse attention. By incorporating an input-dependent gating mechanism to finely balance the two branches, our method attains 90% sparsity and 1.72x inference speedup, while maintaining generation quality comparable to the full attention baseline. Moreover, our finetuning process is highly efficient, requiring only 2,000 video samples and 1,600 training steps with a batch size of 8.

[73] TangramPuzzle: Evaluating Multimodal Large Language Models with Compositional Spatial Reasoning

Daixian Liu,Jiayi Kuang,Yinghui Li,Yangning Li,Di Yin,Haoyu Cao,Xing Sun,Ying Shen,Hai-Tao Zheng,Liang Lin,Philip S. Yu

Main category: cs.CV

TL;DR: 本文提出了TangramPuzzle基准，用于评估多模态大语言模型（MLLMs）在组合式空间推理方面的能力，并设计了Tangram Construction Expression（TCE）符号框架和两个任务：轮廓预测与端到端代码生成。实验发现MLLMs倾向于匹配目标轮廓而忽略几何约束，导致拼图变形。

Details

Motivation: 现有基准在组合式空间推理评估上存在任务简单、依赖语义近似或粗略相对定位、评价指标不严谨等问题，亟需更几何 grounded 的评估方法。 Method: 提出TangramPuzzle基准及Tangram Construction Expression（TCE）符号几何框架，定义两种任务：Outline Prediction（从局部推断整体形状）和End-to-End Code Generation（求解逆向几何装配问题），并在多个先进开源与闭源MLLM上进行评测。 Result: 实验揭示MLLMs普遍偏好匹配目标轮廓，却忽视精确几何约束，造成拼图部件扭曲或形变，暴露其在严格空间推理上的短板。 Conclusion: TangramPuzzle为MLLMs的空间推理能力提供了更严格、可验证的评估范式，TCE框架支持机器可验证的几何表达，有助于推动具身与几何感知AI的发展。 Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable progress in visual recognition and semantic understanding. Nevertheless, their ability to perform precise compositional spatial reasoning remains largely unexplored. Existing benchmarks often involve relatively simple tasks and rely on semantic approximations or coarse relative positioning, while their evaluation metrics are typically limited and lack rigorous mathematical formulations. To bridge this gap, we introduce TangramPuzzle, a geometry-grounded benchmark designed to evaluate compositional spatial reasoning through the lens of the classic Tangram game. We propose the Tangram Construction Expression (TCE), a symbolic geometric framework that grounds tangram assemblies in exact, machine-verifiable coordinate specifications, to mitigate the ambiguity of visual approximation. We design two complementary tasks: Outline Prediction, which demands inferring global shapes from local components, and End-to-End Code Generation, which requires solving inverse geometric assembly problems. We conduct extensive evaluation experiments on advanced open-source and proprietary models, revealing an interesting insight: MLLMs tend to prioritize matching the target silhouette while neglecting geometric constraints, leading to distortions or deformations of the pieces.

[74] AnchoredDream: Zero-Shot 360° Indoor Scene Generation from a Single View via Geometric Grounding

Runmao Yao,Junsheng Zhou,Zhen Dong,Yu-Shen Liu

Main category: cs.CV

TL;DR: 本文提出AnchoredDream，一种零样本单视图室内场景生成方法，通过外观-几何互促机制，基于高保真几何锚定360°场景生成，在外观一致性和几何合理性上显著优于现有方法。

Details

Motivation: 单视图室内场景生成是实际应用中的关键任务，但由单张图像生成完整360°场景高度不适定；现有基于扩散模型和深度估计的方法在大视角变化下难以兼顾外观一致性与几何合理性。 Method: 提出AnchoredDream零样本生成流程：首先进行外观引导的几何生成以构建可靠3D布局；再通过warp-and-inpaint、warp-and-refine、后优化及新提出的Grouting Block模块逐步生成完整场景，确保输入视图与生成区域间无缝过渡。 Result: 大量实验表明，AnchoredDream在外观一致性和几何合理性两方面大幅超越现有方法，且全程无需微调或训练。 Conclusion: 几何锚定是实现高质量、零样本单视图场景生成的关键路径，验证了以高保真几何为根基的有效性。 Abstract: Single-view indoor scene generation plays a crucial role in a range of real-world applications. However, generating a complete 360° scene from a single image remains a highly ill-posed and challenging problem. Recent approaches have made progress by leveraging diffusion models and depth estimation networks, yet they still struggle to maintain appearance consistency and geometric plausibility under large viewpoint changes, limiting their effectiveness in full-scene generation. To address this, we propose AnchoredDream, a novel zero-shot pipeline that anchors 360° scene generation on high-fidelity geometry via an appearance-geometry mutual boosting mechanism. Given a single-view image, our method first performs appearance-guided geometry generation to construct a reliable 3D scene layout. Then, we progressively generate the complete scene through a series of modules: warp-and-inpaint, warp-and-refine, post-optimization, and a novel Grouting Block, which ensures seamless transitions between the input view and generated regions. Extensive experiments demonstrate that AnchoredDream outperforms existing methods by a large margin in both appearance consistency and geometric plausibility--all in a zero-shot manner. Our results highlight the potential of geometric grounding for high-quality, zero-shot single-view scene generation.

[75] OnlineSI: Taming Large Language Model for Online 3D Understanding and Grounding

Zixian Liu,Zhaoxi Chen,Liang Pan,Ziwei Liu

Main category: cs.CV

TL;DR: 本文提出OnlineSI框架，通过维护有限空间记忆和融合3D点云与语义信息，使多模态大语言模型能持续提升空间理解能力，适用于真实世界具身系统。

Details

Motivation: 现有方法忽视MLLM在动态环境中持续工作的能力，且难以部署于真实具身系统。 Method: 提出OnlineSI框架，利用有限空间记忆存储历史观测，并融合3D点云与语义信息以增强空间定位与物体识别；引入Fuzzy F1-Score评估指标缓解标注模糊性。 Result: 在两个代表性数据集上验证了方法有效性，提升了MLLM的空间理解与推理能力。 Conclusion: OnlineSI为构建面向真实世界、具备持续空间理解能力的具身智能系统提供了可行路径。 Abstract: In recent years, researchers have increasingly been interested in how to enable Multimodal Large Language Models (MLLM) to possess spatial understanding and reasoning capabilities. However, most existing methods overlook the importance of the ability to continuously work in an ever-changing world, and lack the possibility of deployment on embodied systems in real-world environments. In this work, we introduce OnlineSI, a framework that can continuously improve its spatial understanding of its surroundings given a video stream. Our core idea is to maintain a finite spatial memory to retain past observations, ensuring the computation required for each inference does not increase as the input accumulates. We further integrate 3D point cloud information with semantic information, helping MLLM to better locate and identify objects in the scene. To evaluate our method, we introduce the Fuzzy $F_1$-Score to mitigate ambiguity, and test our method on two representative datasets. Experiments demonstrate the effectiveness of our method, paving the way towards real-world embodied systems.

[76] Semi-Supervised Hierarchical Open-Set Classification

Erik Wallin,Fredrik Kahl,Lars Hammarstrand

Main category: cs.CV

TL;DR: 本文提出了一种基于伪标签的师生框架，用于半监督分层开放集分类，通过子树伪标签和年龄门控机制提升对未知类别的泛化能力，并在iNaturalist19上仅用20个标注样本/类即达到全监督性能。

Details

Motivation: 扩展分层开放集分类至半监督场景，以利用大规模未筛选数据（含已知与未知类别）提升模型性能。 Method: 提出基于伪标签的师生框架，包含两个关键组件：1）子树伪标签，为未知数据提供可靠监督；2）年龄门控机制，缓解伪标签过自信问题。 Result: 在iNaturalist19基准上，该方法优于自监督预训练+监督微调，并在每类仅20个标注样本时达到全监督性能。 Conclusion: 所提半监督框架能有效利用未标注数据中的未知类别信息，显著提升分层开放集分类性能，尤其在标注数据稀缺时表现突出。 Abstract: Hierarchical open-set classification handles previously unseen classes by assigning them to the most appropriate high-level category in a class taxonomy. We extend this paradigm to the semi-supervised setting, enabling the use of large-scale, uncurated datasets containing a mixture of known and unknown classes to improve the hierarchical open-set performance. To this end, we propose a teacher-student framework based on pseudo-labeling. Two key components are introduced: 1) subtree pseudo-labels, which provide reliable supervision in the presence of unknown data, and 2) age-gating, a mechanism that mitigates overconfidence in pseudo-labels. Experiments show that our framework outperforms self-supervised pretraining followed by supervised adaptation, and even matches the fully supervised counterpart when using only 20 labeled samples per class on the iNaturalist19 benchmark. Our code is available at https://github.com/walline/semihoc.

[77] HA2F: Dual-module Collaboration-Guided Hierarchical Adaptive Aggregation Framework for Remote Sensing Change Detection

Shuying Li,Yuchen Wang,San Zhang,Chuang Yang

Main category: cs.CV

TL;DR: 本文提出了一种名为HA2F的双模块协同分层自适应聚合框架，用于遥感变化检测，通过动态分层特征校准（DHFCM）和噪声自适应特征精炼（NAFRM）两个模块，有效缓解多时相特征对齐偏差与辐射/几何噪声干扰，在多个数据集上达到SOTA性能。

Details

Motivation: 现有方法在局部补丁特征提取或全图整体处理之间存在权衡，导致跨时相特征匹配偏差，并对辐射与几何噪声敏感。 Method: 提出HA2F框架，包含动态层级特征校准模块（DHFCM）——通过感知驱动的跨层级特征选择融合抑制无关差异；以及噪声自适应特征精炼模块（NAFRM）——利用双特征选择机制生成空间掩码，突出变化敏感区域并抑制阴影等干扰。 Result: 在LEVIR-CD、WHU-CD和SYSU-CD数据集上达到SOTA精度，同时提升计算效率；消融实验证明DHFCM与NAFRM均有效。 Conclusion: HA2F通过层级自适应与噪声鲁棒设计，显著提升了遥感变化检测的准确性与鲁棒性，为环境监测等应用提供了更可靠的技术支撑。 Abstract: Remote sensing change detection (RSCD) aims to identify the spatio-temporal changes of land cover, providing critical support for multi-disciplinary applications (e.g., environmental monitoring, disaster assessment, and climate change studies). Existing methods focus either on extracting features from localized patches, or pursue processing entire images holistically, which leads to the cross temporal feature matching deviation and exhibiting sensitivity to radiometric and geometric noise. Following the above issues, we propose a dual-module collaboration guided hierarchical adaptive aggregation framework, namely HA2F, which consists of dynamic hierarchical feature calibration module (DHFCM) and noise-adaptive feature refinement module (NAFRM). The former dynamically fuses adjacent-level features through perceptual feature selection, suppressing irrelevant discrepancies to address multi-temporal feature alignment deviations. The NAFRM utilizes the dual feature selection mechanism to highlight the change sensitive regions and generate spatial masks, suppressing the interference of irrelevant regions or shadows. Extensive experiments verify the effectiveness of the proposed HA2F, which achieves state-of-the-art performance on LEVIR-CD, WHU-CD, and SYSU-CD datasets, surpassing existing comparative methods in terms of both precision metrics and computational efficiency. In addition, ablation experiments show that DHFCM and NAFRM are effective. \href{https://huggingface.co/InPeerReview/RemoteSensingChangeDetection-RSCD.HA2F}{HA2F Official Code is Available Here!}

[78] X-Aligner: Composed Visual Retrieval without the Bells and Whistles

Yuqian Zheng,Mariana-Iuliana Georgescu

Main category: cs.CV

TL;DR: 本文提出了一种新的组合视频检索（CoVR）框架，利用视觉语言模型（VLMs）和新型跨注意力模块X-Aligner，分阶段训练以提升多模态查询表示能力，在Webvid-CoVR上达到SOTA性能，并在CIR任务上展现强零样本泛化能力。

Details

Motivation: 现有CoVR框架通常单阶段融合多模态输入，性能提升有限，亟需更有效的多模态对齐与表征方法。 Method: 提出基于VLM（如BLIP/BLIP-2）的两阶段训练框架，引入跨注意力模块X-Aligner实现渐进式图文融合与视频对齐，并额外引入视觉查询的文本描述增强查询表征。 Result: 在Webvid-CoVR-Test上Recall@1达63.93%，为当前最优；在CIRCO和Fashion-IQ上实现强零样本迁移性能。 Conclusion: 分阶段微调结合X-Aligner和辅助文本描述可有效提升CoVR性能，并具备跨任务（CIR）泛化能力，验证了VLM在组合检索中的潜力。 Abstract: Composed Video Retrieval (CoVR) facilitates video retrieval by combining visual and textual queries. However, existing CoVR frameworks typically fuse multimodal inputs in a single stage, achieving only marginal gains over initial baseline. To address this, we propose a novel CoVR framework that leverages the representational power of Vision Language Models (VLMs). Our framework incorporates a novel cross-attention module X-Aligner, composed of cross-attention layers that progressively fuse visual and textual inputs and align their multimodal representation with that of the target video. To further enhance the representation of the multimodal query, we incorporate the caption of the visual query as an additional input. The framework is trained in two stages to preserve the pretrained VLM representation. In the first stage, only the newly introduced module is trained, while in the second stage, the textual query encoder is also fine-tuned. We implement our framework on top of BLIP-family architecture, namely BLIP and BLIP-2, and train it on the Webvid-CoVR data set. In addition to in-domain evaluation on Webvid-CoVR-Test, we perform zero-shot evaluations on the Composed Image Retrieval (CIR) data sets CIRCO and Fashion-IQ. Our framework achieves state-of-the-art performance on CoVR obtaining a Recall@1 of 63.93% on Webvid-CoVR-Test, and demonstrates strong zero-shot generalization on CIR tasks.

[79] A Lightweight Medical Image Classification Framework via Self-Supervised Contrastive Learning and Quantum-Enhanced Feature Modeling

Jingsong Xia,Siqi Wang

Main category: cs.CV

TL;DR: 本文提出了一种轻量级医学图像分类框架，结合自监督对比学习与量子增强特征建模，在标注数据少、算力受限条件下实现高性能分类。

Details

Motivation: 解决医学图像分析中标签稀缺、计算资源受限和模型泛化能力不足的问题。 Method: 采用MobileNetV2作为轻量骨干网络，用SimCLR风格自监督方式在无标签数据上预训练；嵌入参数化量子电路（PQC）作为量子特征增强模块，构建经典-量子混合架构，并在少量标注数据上微调。 Result: 仅约2–3百万参数、低计算开销下，在Accuracy、AUC和F1-score上持续超越无自监督或无量子增强的经典基线；特征可视化显示判别性与表征稳定性提升。 Conclusion: 该方法为资源受限场景下的高性能医学人工智能提供了实用且前瞻性的解决方案。 Abstract: Intelligent medical image analysis is essential for clinical decision support but is often limited by scarce annotations, constrained computational resources, and suboptimal model generalization. To address these challenges, we propose a lightweight medical image classification framework that integrates self-supervised contrastive learning with quantum-enhanced feature modeling. MobileNetV2 is employed as a compact backbone and pretrained using a SimCLR-style self-supervised paradigm on unlabeled images. A lightweight parameterized quantum circuit (PQC) is embedded as a quantum feature enhancement module, forming a hybrid classical-quantum architecture, which is subsequently fine-tuned on limited labeled data. Experimental results demonstrate that, with only approximately 2-3 million parameters and low computational cost, the proposed method consistently outperforms classical baselines without self-supervised learning or quantum enhancement in terms of Accuracy, AUC, and F1-score. Feature visualization further indicates improved discriminability and representation stability. Overall, this work provides a practical and forward-looking solution for high-performance medical artificial intelligence under resource-constrained settings.

[80] Boundary and Position Information Mining for Aerial Small Object Detection

Rongxin Huang,Guangfeng Lin,Wenbo Zhou,Zhirong Li,Wenhuan Wu

Main category: cs.CV

TL;DR: 本文提出了一种边界与位置信息挖掘（BPIM）框架，通过多个模块（PIG、BIG、CSF、TFF、AWF）融合边界、位置和多尺度特征，提升无人机图像中小目标检测的精度。在VisDrone2021、DOTA1.0和WiderPerson数据集上优于YOLOv5-P2，并达到SOTA水平。

Details

Motivation: 无人机图像中小目标检测面临尺度不平衡和边缘模糊等挑战，需有效挖掘边界与位置信息。 Method: 提出BPIM框架，包含位置信息引导（PIG）、边界信息引导（BIG）、跨尺度融合（CSF）、三特征融合（TFF）和自适应权重融合（AWF）模块，结合注意力机制与跨尺度特征融合策略。 Result: 在VisDrone2021、DOTA1.0和WiderPerson数据集上性能优于YOLOv5-P2，并达到当前最优水平，同时计算开销可控。 Conclusion: BPIM通过融合边界、位置与多尺度信息，显著提升了小目标检测能力，兼顾精度与效率。 Abstract: Unmanned Aerial Vehicle (UAV) applications have become increasingly prevalent in aerial photography and object recognition. However, there are major challenges to accurately capturing small targets in object detection due to the imbalanced scale and the blurred edges. To address these issues, boundary and position information mining (BPIM) framework is proposed for capturing object edge and location cues. The proposed BPIM includes position information guidance (PIG) module for obtaining location information, boundary information guidance (BIG) module for extracting object edge, cross scale fusion (CSF) module for gradually assembling the shallow layer image feature, three feature fusion (TFF) module for progressively combining position and boundary information, and adaptive weight fusion (AWF) module for flexibly merging the deep layer semantic feature. Therefore, BPIM can integrate boundary, position, and scale information in image for small object detection using attention mechanisms and cross-scale feature fusion strategies. Furthermore, BPIM not only improves the discrimination of the contextual feature by adaptive weight fusion with boundary, but also enhances small object perceptions by cross-scale position fusion. On the VisDrone2021, DOTA1.0, and WiderPerson datasets, experimental results show the better performances of BPIM compared to the baseline Yolov5-P2, and obtains the promising performance in the state-of-the-art methods with comparable computation load.

[81] SCHIGAND: A Synthetic Facial Generation Mode Pipeline

Ananya Kadali,Sunnie Jehan-Morrison,Orasiki Wellington,Barney Evans,Precious Durojaiye,Richard Guest

Main category: cs.CV

TL;DR: 本文提出SCHIGAND，一种融合StyleCLIP、HyperStyle、InterfaceGAN和扩散模型的合成人脸生成新方法，在保证身份一致性的同时提升真实感与多样性，适用于生物特征识别测试，并通过ArcFace验证其有效性。

Details

Motivation: 应对面部数据集在隐私法规、数据稀缺性和伦理问题下的获取挑战，需兼顾真实性、多样性与身份保持的高质量合成图像。 Method: 构建SCHIGAND合成人脸生成流程，集成StyleCLIP、HyperStyle、InterfaceGAN和扩散模型，增强身份保持能力，同时控制类内变化与类间区分度。 Result: 在ArcFace评估中，SCHIGAND生成的数据集在图像质量与多样性之间取得更好平衡，优于以往生成模型；可部分替代真实数据用于面部生物识别。 Conclusion: SCHIGAND为隐私合规、可扩展的合成面部数据集生成提供了可行路径，有望补充甚至替代真实数据应用于生物识别领域。 Abstract: The growing demand for diverse and high-quality facial datasets for training and testing biometric systems is challenged by privacy regulations, data scarcity, and ethical concerns. Synthetic facial images offer a potential solution, yet existing generative models often struggle to balance realism, diversity, and identity preservation. This paper presents SCHIGAND, a novel synthetic face generation pipeline integrating StyleCLIP, HyperStyle, InterfaceGAN, and Diffusion models to produce highly realistic and controllable facial datasets. SCHIGAND enhances identity preservation while generating realistic intra-class variations and maintaining inter-class distinctiveness, making it suitable for biometric testing. The generated datasets were evaluated using ArcFace, a leading facial verification model, to assess their effectiveness in comparison to real-world facial datasets. Experimental results demonstrate that SCHIGAND achieves a balance between image quality and diversity, addressing key limitations of prior generative models. This research highlights the potential of SCHIGAND to supplement and, in some cases, replace real data for facial biometric applications, paving the way for privacy-compliant and scalable solutions in synthetic dataset generation.

[82] Edge-Aware Image Manipulation via Diffusion Models with a Novel Structure-Preservation Loss

Minsu Gong,Nuri Ryu,Jungseul Ok,Sunghyun Cho

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的结构保持损失（SPL），利用局部线性模型量化输入与编辑图像间的结构差异，并结合后处理、掩码策略和颜色保持损失，显著提升了潜在扩散模型在图像编辑中对边缘结构的保真度。

Details

Motivation: 现有基于潜在扩散模型（LDM）的图像编辑方法难以保持像素级边缘结构，影响光真实感风格迁移和色调调整等任务的效果。 Method: 提出结构保持损失（SPL），在LDM生成过程中直接集成；辅以后处理抑制解码失真、掩码策略实现精确编辑定位、颜色保持损失保护未编辑区域色调。 Result: 实验表明SPL显著提升结构保真度，在潜在扩散图像编辑任务上达到当前最优性能。 Conclusion: SPL是一种高效、即插即用的结构保持机制，无需额外训练即可增强LDM编辑结果的几何与边缘一致性。 Abstract: Recent advances in image editing leverage latent diffusion models (LDMs) for versatile, text-prompt-driven edits across diverse tasks. Yet, maintaining pixel-level edge structures-crucial for tasks such as photorealistic style transfer or image tone adjustment-remains as a challenge for latent-diffusion-based editing. To overcome this limitation, we propose a novel Structure Preservation Loss (SPL) that leverages local linear models to quantify structural differences between input and edited images. Our training-free approach integrates SPL directly into the diffusion model's generative process to ensure structural fidelity. This core mechanism is complemented by a post-processing step to mitigate LDM decoding distortions, a masking strategy for precise edit localization, and a color preservation loss to preserve hues in unedited areas. Experiments confirm SPL enhances structural fidelity, delivering state-of-the-art performance in latent-diffusion-based image editing. Our code will be publicly released at https://github.com/gongms00/SPL.

[83] Reliable Brain Tumor Segmentation Based on Spiking Neural Networks with Efficient Training

Aurora Pia Ghiardelli,Guangzhi Tang,Tao Sun

Main category: cs.CV

TL;DR: 本文提出了一种基于脉冲神经网络（SNN）的可靠、节能的3D脑肿瘤分割框架，采用多视角集成与FPTT训练方法，在BraTS数据集上实现了高精度、良好校准的不确定性估计及87%的计算量降低。

Details

Motivation: 解决传统SNN在医学图像语义分割中训练计算成本高、能耗大、可靠性不足的问题，推动其在低功耗医疗IoT和床旁诊断系统中的应用。 Method: 构建矢状面、冠状面和轴向切面的多视角SNN集成模型，实现体素级不确定性估计；引入前向传播时序（FPTT）方法以提升训练效率并大幅降低计算开销。 Result: 在BraTS 2017和BraTS 2023数据集上达到具有竞争力的分割精度，不确定性估计良好校准，并实现87%的FLOPs降低。 Conclusion: 所提SNN框架兼顾高可靠性、低能耗与强鲁棒性，为资源受限的医疗边缘设备提供了可行的轻量化智能分割方案。 Abstract: We propose a reliable and energy-efficient framework for 3D brain tumor segmentation using spiking neural networks (SNNs). A multi-view ensemble of sagittal, coronal, and axial SNN models provides voxel-wise uncertainty estimation and enhances segmentation robustness. To address the high computational cost in training SNN models for semantic image segmentation, we employ Forward Propagation Through Time (FPTT), which maintains temporal learning efficiency with significantly reduced computational cost. Experiments on the Multimodal Brain Tumor Segmentation Challenges (BraTS 2017 and BraTS 2023) demonstrate competitive accuracy, well-calibrated uncertainty, and an 87% reduction in FLOPs, underscoring the potential of SNNs for reliable, low-power medical IoT and Point-of-Care systems.

[84] ReWeaver: Towards Simulation-Ready and Topology-Accurate Garment Reconstruction

Ming Li,Hui Shan,Kai Zheng,Chentao Shen,Siyu Liu,Yanwei Fu,Zhen Chen,Xiangru Huang

Main category: cs.CV

TL;DR: 本文提出ReWeaver框架，从稀疏多视角RGB图像中重建具有准确拓扑结构和缝纫模式的3D服装，解决了现有方法在拓扑与结构建模上的不足，并构建了大规模合成数据集GCD-TS用于训练与评估。

Details

Motivation: 现有基于非结构化表示（如3D高斯点阵）的服装重建方法难以准确建模服装拓扑与缝纫结构，导致重建结果不适用于高保真物理仿真等下游任务。 Method: 提出ReWeaver框架，输入稀疏多视角RGB图像（最少4视图），联合预测2D UV空间与3D空间中的缝线、衣片及其连接关系；并构建大规模合成数据集GCD-TS（含10万+样本）支持训练。 Result: 在拓扑精度、几何对齐度和缝线-衣片一致性方面均显著优于现有方法。 Conclusion: ReWeaver实现了结构化、拓扑准确的3D服装重建，为数字人、虚拟试衣和机器人操作等应用提供了更适配物理仿真的高质量表示。 Abstract: High-quality 3D garment reconstruction plays a crucial role in mitigating the sim-to-real gap in applications such as digital avatars, virtual try-on and robotic manipulation. However, existing garment reconstruction methods typically rely on unstructured representations, such as 3D Gaussian Splats, struggling to provide accurate reconstructions of garment topology and sewing structures. As a result, the reconstructed outputs are often unsuitable for high-fidelity physical simulation. We propose ReWeaver, a novel framework for topology-accurate 3D garment and sewing pattern reconstruction from sparse multi-view RGB images. Given as few as four input views, ReWeaver predicts seams and panels as well as their connectivities in both the 2D UV space and the 3D space. The predicted seams and panels align precisely with the multi-view images, yielding structured 2D--3D garment representations suitable for 3D perception, high-fidelity physical simulation, and robotic manipulation. To enable effective training, we construct a large-scale dataset GCD-TS, comprising multi-view RGB images, 3D garment geometries, textured human body meshes and annotated sewing patterns. The dataset contains over 100,000 synthetic samples covering a wide range of complex geometries and topologies. Extensive experiments show that ReWeaver consistently outperforms existing methods in terms of topology accuracy, geometry alignment and seam-panel consistency.

[85] Affinity Contrastive Learning for Skeleton-based Human Activity Understanding

Hongda Liu,Yunfan Liu,Min Ren,Lin Sui,Yunlong Wang,Zhenan Sun

Main category: cs.CV

TL;DR: 本文提出ACLNet，一种基于亲和力的对比学习网络，通过构建活动超类和动态温度调度来提升骨架动作识别等任务的特征判别能力。

Details

Motivation: 现有基于骨架的人类活动理解方法在对比学习中未能充分利用类间结构相似性，并忽略了异常正样本的影响。 Method: 提出亲和度度量以细化相似性计算，形成活动超类；引入动态温度调度自适应调整不同超类的惩罚强度；采用基于边距的对比策略增强类内难正负样本的分离。 Result: 在NTU RGB+D 60/120、Kinetics-Skeleton、PKU-MMD、FineGYM和CASIA-B等多个数据集上，ACLNet在骨架动作识别、步态识别和行人重识别任务中均表现出优越性能。 Conclusion: ACLNet通过建模类间亲和关系与优化对比学习机制，显著提升了骨架数据驱动任务的特征判别能力与泛化性。 Abstract: In skeleton-based human activity understanding, existing methods often adopt the contrastive learning paradigm to construct a discriminative feature space. However, many of these approaches fail to exploit the structural inter-class similarities and overlook the impact of anomalous positive samples. In this study, we introduce ACLNet, an Affinity Contrastive Learning Network that explores the intricate clustering relationships among human activity classes to improve feature discrimination. Specifically, we propose an affinity metric to refine similarity measurements, thereby forming activity superclasses that provide more informative contrastive signals. A dynamic temperature schedule is also introduced to adaptively adjust the penalty strength for various superclasses. In addition, we employ a margin-based contrastive strategy to improve the separation of hard positive and negative samples within classes. Extensive experiments on NTU RGB+D 60, NTU RGB+D 120, Kinetics-Skeleton, PKU-MMD, FineGYM, and CASIA-B demonstrate the superiority of our method in skeleton-based action recognition, gait recognition, and person re-identification. The source code is available at https://github.com/firework8/ACLNet.

[86] CER-HV: A CER-Based Human-in-the-Loop Framework for Cleaning Datasets Applied to Arabic-Script HTR

Sana Al-azzawi,Elisa Barney,Marcus Liwicki

Main category: cs.CV

TL;DR: 本文提出CER-HV框架，结合基于字符错误率（CER）的噪声检测与人工验证，显著提升阿拉伯文字手写文本识别（HTR）数据集质量，并在多个数据集上实现SOTA性能。

Details

Motivation: 阿拉伯文字HTR性能落后于拉丁文字，主要受限于现有数据集标注质量差，缺乏系统性清洗方法。 Method: 提出CER-HV框架：1）基于精心配置并采用早停策略的CRNN模型进行CER驱动的噪声检测；2）引入人工在环（HITL）对高风险样本进行验证。 Result: 在Muharaf和PHTI数据集中分别以90%和80–86%精度识别出转录、分割、方向及非文本内容等标注错误；CRNN在5/6个数据集上达到SOTA CER（如KHATT为8.45%）；CER-HV使CER降低0.3–1.8个百分点。 Conclusion: 数据质量是阿拉伯文字HTR的关键瓶颈，CER-HV是一种高效、通用的数据清洗框架，可迁移至其他文字识别任务。 Abstract: Handwritten text recognition (HTR) for Arabic-script languages still lags behind Latin-script HTR, despite recent advances in model architectures, datasets, and benchmarks. We show that data quality is a significant limiting factor in many published datasets and propose CER-HV (CER-based Ranking with Human Verification) as a framework to detect and clean label errors. CER-HV combines a CER-based noise detector, built on a carefully configured Convolutional Recurrent Neural Network (CRNN) with early stopping to avoid overfitting noisy samples, and a human-in-the-loop (HITL) step that verifies high-ranking samples. The framework reveals that several existing datasets contain previously underreported problems, including transcription, segmentation, orientation, and non-text content errors. These have been identified with up to 90 percent precision in the Muharaf and 80-86 percent in the PHTI datasets. We also show that our CRNN achieves state-of-the-art performance across five of the six evaluated datasets, reaching 8.45 percent Character Error Rate (CER) on KHATT (Arabic), 8.26 percent on PHTI (Pashto), 10.66 percent on Ajami, and 10.11 percent on Muharaf (Arabic), all without any data cleaning. We establish a new baseline of 11.3 percent CER on the PHTD (Persian) dataset. Applying CER-HV improves the evaluation CER by 0.3-0.6 percent on the cleaner datasets and 1.0-1.8 percent on the noisier ones. Although our experiments focus on documents written in an Arabic-script language, including Arabic, Persian, Urdu, Ajami, and Pashto, the framework is general and can be applied to other text recognition datasets.

[87] Using Shadows in Circular Synthetic Aperture Sonar Imaging for Target Analysis

Yann Le Gall,Nicolas Burlet,Mathieu Simon,Fabien Novella,Samantha Dugelay,Jean-Philippe Malkasse

Main category: cs.CV

TL;DR: 本文提出了一种从圆形合成孔径声纳（CSAS）数据中恢复目标阴影信息的方法，通过子孔径滤波和固定焦点阴影增强（FFSE）获取清晰阴影，并结合空间雕刻法进行3D重建，提升了水下目标识别与分析能力。

Details

Motivation: CSAS虽提供360°方位覆盖和高分辨率成像，但因圆周扫描导致阴影丢失，而阴影对目标形状识别具有重要互补价值。 Method: 采用子孔径滤波生成多视角图像，应用固定焦点阴影增强（FFSE）突出阴影，并设计交互式界面可视化阴影；最后利用空间雕刻法从分割出的阴影推断目标三维形状。 Result: 成功从CSAS数据中恢复并利用阴影信息，验证了其在目标分析和3D重建中的有效性与潜力。 Conclusion: 阴影信息可显著增强CSAS在水下目标识别与三维重建中的性能，所提方法为 mine warfare 等应用提供了新思路。 Abstract: Circular Synthetic Aperture Sonar (CSAS) provides a 360° azimuth view of the seabed, surpassing the limited aperture and mono-view image of conventional side-scan SAS. This makes CSAS a valuable tool for target recognition in mine warfare where the diversity of point of view is essential for reducing false alarms. CSAS processing typically produces a very high-resolution two-dimensional image. However, the parallax introduced by the circular displacement of the illuminator fill-in the shadow regions, and the shadow cast by an object on the seafloor is lost in favor of azimuth coverage and resolution. Yet the shadows provide complementary information on target shape useful for target recognition. In this paper, we explore a way to retrieve shadow information from CSAS data to improve target analysis and carry 3D reconstruction. Sub-aperture filtering is used to get a collection of images at various points of view along the circular trajectory and fixed focus shadow enhancement (FFSE) is applied to obtain sharp shadows. An interactive interface is also proposed to allow human operators to visualize these shadows along the circular trajectory. A space-carving reconstruction method is applied to infer the 3D shape of the object from the segmented shadows. The results demonstrate the potential of shadows in circular SAS for improving target analysis and 3D reconstruction.

[88] A Step to Decouple Optimization in 3DGS

Renjie Ding,Yaonan Wang,Min Liu,Jialin Zhu,Jiazheng Wang,Jiahao Zhao,Wenting Shen,Feixiang He,Xiang Che

Main category: cs.CV

TL;DR: 本文重新审视了3D高斯泼溅（3DGS）的优化过程，指出其存在更新步耦合和动量梯度耦合两个被忽视的问题，并提出解耦优化策略（Sparse Adam、Re-State Regularization、Decoupled Attribute Regularization），最终设计出改进的AdamW-GS优化器，在效率与表征效果上均取得提升。

Details

Motivation: 3DGS虽采用类似深度神经网络的优化方法（如Adam），但其物理意义和结构设计特殊，现有优化中存在更新步耦合和梯度耦合等未被深入探讨的问题，影响优化效率与表征质量。 Method: 通过解耦分析3DGS优化中的关键耦合问题，提出三阶段解耦策略：Sparse Adam（稀疏化参数更新）、Re-State Regularization（重置优化器状态以缓解状态缩放）、Decoupled Attribute Regularization（解耦属性正则化）；并在大量实验基础上，重新耦合有益成分，设计出AdamW-GS优化器。 Result: 在3DGS及3DGS-MCMC框架下验证了所提解耦策略的有效性；AdamW-GS显著提升了优化效率与重建质量，实现了二者兼顾。 Conclusion: 3DGS的优化不应直接套用标准DNN优化器；需结合其显式几何特性进行定制化设计；解耦—分析—再耦合是提升其优化性能的有效范式。 Abstract: 3D Gaussian Splatting (3DGS) has emerged as a powerful technique for real-time novel view synthesis. As an explicit representation optimized through gradient propagation among primitives, optimization widely accepted in deep neural networks (DNNs) is actually adopted in 3DGS, such as synchronous weight updating and Adam with the adaptive gradient. However, considering the physical significance and specific design in 3DGS, there are two overlooked details in the optimization of 3DGS: (i) update step coupling, which induces optimizer state rescaling and costly attribute updates outside the viewpoints, and (ii) gradient coupling in the moment, which may lead to under- or over-effective regularization. Nevertheless, such a complex coupling is under-explored. After revisiting the optimization of 3DGS, we take a step to decouple it and recompose the process into: Sparse Adam, Re-State Regularization and Decoupled Attribute Regularization. Taking a large number of experiments under the 3DGS and 3DGS-MCMC frameworks, our work provides a deeper understanding of these components. Finally, based on the empirical analysis, we re-design the optimization and propose AdamW-GS by re-coupling the beneficial components, under which better optimization efficiency and representation effectiveness are achieved simultaneously.

[89] Automated Road Crack Localization to Guide Highway Maintenance

Steffen Knoblauch,Ram Kumar Muthusamy,Pedram Ghamisi,Alexander Zipf

Main category: cs.CV

TL;DR: 本研究提出一种基于开源数据（航拍影像和OpenStreetMap）的高速公路裂缝检测框架，利用YOLOv11进行裂缝定位，并构建瑞士相对高速公路裂缝密度（RHCD）指数以指导全国性养护决策；模型在裂缝类上F1达0.84，RHCD与温度振幅和交通量相关性弱，表明其具有独立信息价值。

Details

Motivation: 气候变化导致路面应力加剧，养护成本上升，亟需高效、精准的养护策略；开源数据潜力尚未在公路养护中被充分挖掘。 Method: 融合航拍影像与OpenStreetMap数据，对YOLOv11模型进行微调以实现高速公路裂缝定位；据此计算瑞士相对高速公路裂缝密度（RHCD）指数，并分析其与长时地表温度振幅（LT-LST-A）和交通量（TV）的相关性。 Result: 裂缝检测模型正类（裂缝）F1-score为0.84，负类（无裂缝）为0.97；RHCD指数与LT-LST-A呈弱负相关（r = -0.05），与TV呈弱正相关（r = 0.17）；高RHCD值集中于城市中心及交叉口，符合实际认知。 Conclusion: RHCD指数可作为独立、有效的养护决策支持指标；开源数据融合方法能提升公共部门基础设施管理的效率与创新性。 Abstract: Highway networks are crucial for economic prosperity. Climate change-induced temperature fluctuations are exacerbating stress on road pavements, resulting in elevated maintenance costs. This underscores the need for targeted and efficient maintenance strategies. This study investigates the potential of open-source data to guide highway infrastructure maintenance. The proposed framework integrates airborne imagery and OpenStreetMap (OSM) to fine-tune YOLOv11 for highway crack localization. To demonstrate the framework's real-world applicability, a Swiss Relative Highway Crack Density (RHCD) index was calculated to inform nationwide highway maintenance. The crack classification model achieved an F1-score of $0.84$ for the positive class (crack) and $0.97$ for the negative class (no crack). The Swiss RHCD index exhibited weak correlations with Long-term Land Surface Temperature Amplitudes (LT-LST-A) (Pearson's $r\ = -0.05$) and Traffic Volume (TV) (Pearson's $r\ = 0.17$), underlining the added value of this novel index for guiding maintenance over other data. Significantly high RHCD values were observed near urban centers and intersections, providing contextual validation for the predictions. These findings highlight the value of open-source data sharing to drive innovation, ultimately enabling more efficient solutions in the public sector.

[90] Curated endoscopic retrograde cholangiopancreatography images dataset

Alda João Andrade,Mónica Martins,André Ferreira,Tarcísio Araújo,Luís Lopes,Victor Alves

Main category: cs.CV

TL;DR: 本文提出一个大规模、经过专业医生标注和审核的ERCP内镜图像数据集，包含近4万张图像，其中5519张已标注，旨在推动AI在胆胰疾病自动诊断中的研究与应用。

Details

Motivation: 公共ERCP数据集稀缺，限制了人工智能在该领域的发展与应用，亟需构建高质量、大规模、专业标注的数据集。 Method: 收集来自1602名患者的19018张原始及19317张处理后ERCP图像，由两名资深胃肠病医生（每人年均操作超400例）人工标注5519张，并由一位经验超20年的专家复核；通过分类实验验证数据集有效性。 Result: 构建了目前规模较大、标注规范、经多级专家审核的ERCP图像数据集，分类实验验证其具备良好可用性与有效性。 Conclusion: 该数据集可作为ERCP自动分析与胆胰疾病AI辅助诊断的重要基准资源，有望促进相关算法研发与临床转化。 Abstract: Endoscopic Retrograde Cholangiopancreatography (ERCP) is a key procedure in the diagnosis and treatment of biliary and pancreatic diseases. Artificial intelligence has been pointed as one solution to automatize diagnosis. However, public ERCP datasets are scarce, which limits the use of such approach. Therefore, this study aims to help fill this gap by providing a large and curated dataset. The collection is composed of 19.018 raw images and 19.317 processed from 1.602 patients. 5.519 images are labeled, which provides a ready to use dataset. All images were manually inspected and annotated by two gastroenterologist with more than 5 years of experience and reviewed by another gastroenterologist with more than 20 years of experience, all with more than 400 ERCP procedures annually. The utility and validity of the dataset is proven by a classification experiment. This collection aims to provide or contribute for a benchmark in automatic ERCP analysis and diagnosis of biliary and pancreatic diseases.

[91] Flow Matching for Probabilistic Monocular 3D Human Pose Estimation

Cuong Le,Pavló Melnyk,Bastian Wandt,Mårten Wadenbäck

Main category: cs.CV

TL;DR: 本文提出了FMPose，一种基于流匹配生成方法的概率性3D人体姿态估计方法，通过最优传输和图卷积网络建模2D到3D姿态映射，在多个基准上超越了现有最优方法。

Details

Motivation: 解决单目图像3D人体姿态估计中因深度模糊导致的高度病态问题，以及传统方法产生错误但过度自信的估计结果。 Method: 提出FMPose方法，采用基于流匹配的生成模型，通过连续归一化流学习从简单源分布到合理3D姿态分布的最优传输；以图卷积网络建模2D线索作为条件，利用关节间可学习图结构进行特征聚合。 Result: 在Human3.6M、MPI-INF-3DHP和3DPW三个主流基准上，FMPose显著优于当前最先进方法，且相比扩散模型生成更快、更准确。 Conclusion: FMPose通过结合最优传输与图结构建模，有效提升了概率性3D姿态估计的准确性与效率，验证了流匹配在该任务中的优越性。 Abstract: Recovering 3D human poses from a monocular camera view is a highly ill-posed problem due to the depth ambiguity. Earlier studies on 3D human pose lifting from 2D often contain incorrect-yet-overconfident 3D estimations. To mitigate the problem, emerging probabilistic approaches treat the 3D estimations as a distribution, taking into account the uncertainty measurement of the poses. Falling in a similar category, we proposed FMPose, a probabilistic 3D human pose estimation method based on the flow matching generative approach. Conditioned on the 2D cues, the flow matching scheme learns the optimal transport from a simple source distribution to the plausible 3D human pose distribution via continuous normalizing flows. The 2D lifting condition is modeled via graph convolutional networks, leveraging the learnable connections between human body joints as the graph structure for feature aggregation. Compared to diffusion-based methods, the FMPose with optimal transport produces faster and more accurate 3D pose generations. Experimental results show major improvements of our FMPose over current state-of-the-art methods on three common benchmarks for 3D human pose estimation, namely Human3.6M, MPI-INF-3DHP and 3DPW.

[92] AutoRegressive Generation with B-rep Holistic Token Sequence Representation

Jiahao Li,Yunpeng Bai,Yongkang Dai,Hao Guo,Hongping Gan,Yilei Shi

Main category: cs.CV

TL;DR: 本文提出BrepARG，首次将B-rep的几何与拓扑信息统一编码为整体token序列，支持基于序列的自回归生成，并通过Transformer解码器实现SOTA性能。

Details

Motivation: 现有B-rep表示与生成方法依赖图结构，分离几何与拓扑特征，无法直接应用序列模型（如Transformer）；亟需一种能统一建模二者并兼容序列生成框架的新表示。 Method: BrepARG将B-rep编码为三类token：几何与位置token（表征几何）、面索引token（表征拓扑）；按层次构建整体token序列：先构建几何块（面/边），再排序几何块，最后组装全B-rep序列；采用因果掩码的Decoder-only Transformer进行自回归建模。 Result: 在B-rep生成任务上达到SOTA性能；验证了将B-rep表示为整体token序列的可行性。 Conclusion: BrepARG为B-rep生成开辟了新方向，证明序列化建模几何与拓扑联合表征的有效性与潜力。 Abstract: Previous representation and generation approaches for the B-rep relied on graph-based representations that disentangle geometric and topological features through decoupled computational pipelines, thereby precluding the application of sequence-based generative frameworks, such as transformer architectures that have demonstrated remarkable performance. In this paper, we propose BrepARG, the first attempt to encode B-rep's geometry and topology into a holistic token sequence representation, enabling sequence-based B-rep generation with an autoregressive architecture. Specifically, BrepARG encodes B-rep into 3 types of tokens: geometry and position tokens representing geometric features, and face index tokens representing topology. Then the holistic token sequence is constructed hierarchically, starting with constructing the geometry blocks (i.e., faces and edges) using the above tokens, followed by geometry block sequencing. Finally, we assemble the holistic sequence representation for the entire B-rep. We also construct a transformer-based autoregressive model that learns the distribution over holistic token sequences via next-token prediction, using a multi-layer decoder-only architecture with causal masking. Experiments demonstrate that BrepARG achieves state-of-the-art (SOTA) performance. BrepARG validates the feasibility of representing B-rep as holistic token sequences, opening new directions for B-rep generation.

[93] CASP: Few-Shot Class-Incremental Learning with CLS Token Attention Steering Prompts

Shuai Huang,Xuhan Lin,Yuwu Lu

Main category: cs.CV

TL;DR: 本文提出CLS Token Attention Steering Prompts (CASP)，通过在CLS token的Q/K/V投影中引入可学习的类共享偏置参数，显式调控自注意力权重，并结合注意力扰动与流形Token混合策略，提升少样本类增量学习（FSCIL）中的泛化能力与知识迁移效率，显著优于现有方法且无需增量阶段微调、参数开销低。

Details

Motivation: 在极端少样本类增量学习（FSCIL）场景下，模型需用极少量样本快速适应新类并避免灾难性遗忘；现有提示方法泛化能力不足，亟需在基础训练阶段利用预训练知识学习可跨任务共享的鲁棒特征表示。 Method: 提出CASP方法：1）受CLS token机制启发，在其查询、键、值投影中嵌入类共享可训练偏置以调控自注意力；2）设计注意力扰动策略；3）在浅层特征空间执行Manifold Token Mixup，合成潜在新类特征以增强泛化并保留表征容量。 Result: 在CUB200、CIFAR100和ImageNet-R数据集上，CASP在标准与细粒度FSCIL设置下均超越当前最优方法，且不依赖增量阶段微调，参数开销显著降低。 Conclusion: CASP通过结构化引导CLS token注意力并增强浅层特征多样性，有效提升了少样本增量场景下的知识迁移与长期泛化能力，为高效轻量化的持续学习提供了新范式。 Abstract: Few-shot class-incremental learning (FSCIL) presents a core challenge in continual learning, requiring models to rapidly adapt to new classes with very limited samples while mitigating catastrophic forgetting. Recent prompt-based methods, which integrate pretrained backbones with task-specific prompts, have made notable progress. However, under extreme few-shot incremental settings, the model's ability to transfer and generalize becomes critical, and it is thus essential to leverage pretrained knowledge to learn feature representations that can be shared across future categories during the base session. Inspired by the mechanism of the CLS token, which is similar to human attention and progressively filters out task-irrelevant information, we propose the CLS Token Attention Steering Prompts (CASP). This approach introduces class-shared trainable bias parameters into the query, key, and value projections of the CLS token to explicitly modulate the self-attention weights. To further enhance generalization, we also design an attention perturbation strategy and perform Manifold Token Mixup in the shallow feature space, synthesizing potential new class features to improve generalization and reserve the representation capacity for upcoming tasks. Experiments on the CUB200, CIFAR100, and ImageNet-R datasets demonstrate that CASP outperforms state-of-the-art methods in both standard and fine-grained FSCIL settings without requiring fine-tuning during incremental phases and while significantly reducing the parameter overhead.

[94] SLD: Segmentation-Based Landmark Detection for Spinal Ligaments

Lara Blomenkamp,Ivanna Kramer,Sabine Bauer,Theresa Schöche

Main category: cs.CV

TL;DR: 本文提出了一种新的脊柱韧带附着点检测方法，通过基于形状的椎骨3D分割和领域特定规则识别附着点，在所有脊柱区域均表现出高精度和强泛化能力，验证结果显示平均绝对误差为0.7 mm，均方根误差为1.1 mm。

Details

Motivation: 现有自动化检测方法局限于特定脊柱区域或精度不足，而精准识别韧带附着点对构建可靠的脊柱生物力学模型至关重要。 Method: 首先进行基于形状的3D椎骨分割，然后应用领域特定规则识别不同类型韧带附着点。 Result: 在两个独立的多患者脊柱数据集上验证，平均绝对误差（MAE）为0.7 mm，均方根误差（RMSE）为1.1 mm，性能优于现有方法。 Conclusion: 该方法具有高精度和跨脊柱区域的强泛化能力，可提升脊柱生物力学建模的可靠性。 Abstract: In biomechanical modeling, the representation of ligament attachments is crucial for a realistic simulation of the forces acting between the vertebrae. These forces are typically modeled as vectors connecting ligament landmarks on adjacent vertebrae, making precise identification of these landmarks a key requirement for constructing reliable spine models. Existing automated detection methods are either limited to specific spinal regions or lack sufficient accuracy. This work presents a novel approach for detecting spinal ligament landmarks, which first performs shape-based segmentation of 3D vertebrae and subsequently applies domain-specific rules to identify different types of attachment points. The proposed method outperforms existing approaches by achieving high accuracy and demonstrating strong generalization across all spinal regions. Validation on two independent spinal datasets from multiple patients yielded a mean absolute error (MAE) of 0.7 mm and a root mean square error (RMSE) of 1.1 mm.

[95] REL-SF4PASS: Panoramic Semantic Segmentation with REL Depth Representation and Spherical Fusion

Xuewei Li,Xinghan Bao,Zhimin Chen,Xi Li

Main category: cs.CV

TL;DR: 本文提出REL-SF4PASS方法，通过基于圆柱坐标系的REL深度表示和球面动态多模态融合（SMMF）提升全景语义分割性能与鲁棒性。

Details

Motivation: 现有全景语义分割（PASS）方法未充分利用全景图像几何特性，尤其在 spherical geometry 和深度信息使用上存在局限。 Method: 提出REL深度表示（含校正深度、仰角增益垂直倾角、横向方位角）和球面动态多模态融合（SMMF），适配不同全景区域并缓解ERP投影导致的圆柱侧面展开失真。 Result: 在Stanford2D3D全景数据集上平均mIoU提升2.35%，面对3D扰动时性能方差降低约70%。 Conclusion: REL-SF4PASS有效提升了全景语义分割的精度与鲁棒性，验证了圆柱坐标系建模与区域自适应融合策略的有效性。 Abstract: As an important and challenging problem in computer vision, Panoramic Semantic Segmentation (PASS) aims to give complete scene perception based on an ultra-wide angle of view. Most PASS methods often focus on spherical geometry with RGB input or using the depth information in original or HHA format, which does not make full use of panoramic image geometry. To address these shortcomings, we propose REL-SF4PASS with our REL depth representation based on cylindrical coordinate and Spherical-dynamic Multi-Modal Fusion SMMF. REL is made up of Rectified Depth, Elevation-Gained Vertical Inclination Angle, and Lateral Orientation Angle, which fully represents 3D space in cylindrical coordinate style and the surface normal direction. SMMF aims to ensure the diversity of fusion for different panoramic image regions and reduce the breakage of cylinder side surface expansion in ERP projection, which uses different fusion strategies to match the different regions in panoramic images. Experimental results show that REL-SF4PASS considerably improves performance and robustness on popular benchmark, Stanford2D3D Panoramic datasets. It gains 2.35% average mIoU improvement on all 3 folds and reduces the performance variance by approximately 70% when facing 3D disturbance.

[96] Incorporating Eye-Tracking Signals Into Multimodal Deep Visual Models For Predicting User Aesthetic Experience In Residential Interiors

Chen-Ying Chien,Po-Chih Kuo

Main category: cs.CV

TL;DR: 本文提出了一种融合视觉特征与眼动信号的双分支CNN-LSTM模型，用于预测住宅室内空间的美学评价，在客观和主观维度上均优于现有视频基线方法，并揭示了眼动数据（尤其是瞳孔反应）对不同评价维度的差异化贡献。

Details

Motivation: 预测室内空间的美学体验困难，因其具有主观性和视觉响应复杂性，需更有效建模人对空间的感知与评价。 Method: 构建双分支CNN-LSTM框架，分别处理视频帧（视觉特征）和同步眼动信号（注视点、瞳孔大小等），并在特征层面进行融合；基于224个室内设计视频及28名被试的15维美学评分与眼动数据训练与评估模型。 Result: 模型在客观美学维度（如亮度）达72.2%准确率，在主观维度（如放松感）达66.8%，显著优于纯视频基线；消融实验表明瞳孔响应主导客观评估，而眼动+视觉融合提升主观评估；仅用视觉输入时性能保持稳定。 Conclusion: 眼动数据作为训练阶段的特权信息可显著提升美学预测性能，尤其对主观维度，并支持部署时简化输入（仅视觉），为室内设计中的实用化美学评估工具提供新路径。 Abstract: Understanding how people perceive and evaluate interior spaces is essential for designing environments that promote well-being. However, predicting aesthetic experiences remains difficult due to the subjective nature of perception and the complexity of visual responses. This study introduces a dual-branch CNN-LSTM framework that fuses visual features with eye-tracking signals to predict aesthetic evaluations of residential interiors. We collected a dataset of 224 interior design videos paired with synchronized gaze data from 28 participants who rated 15 aesthetic dimensions. The proposed model attains 72.2% accuracy on objective dimensions (e.g., light) and 66.8% on subjective dimensions (e.g., relaxation), outperforming state-of-the-art video baselines and showing clear gains on subjective evaluation tasks. Notably, models trained with eye-tracking retain comparable performance when deployed with visual input alone. Ablation experiments further reveal that pupil responses contribute most to objective assessments, while the combination of gaze and visual cues enhances subjective evaluations. These findings highlight the value of incorporating eye-tracking as privileged information during training, enabling more practical tools for aesthetic assessment in interior design.

[97] ColorConceptBench: A Benchmark for Probabilistic Color-Concept Understanding in Text-to-Image Models

Chenxi Ruan,Yu Xiao,Yihan Hou,Guosheng Hu,Wei Zeng

Main category: cs.CV

TL;DR: 本文提出了ColorConceptBench基准，用于评估文本到图像模型对隐式颜色概念与色彩之间概率分布关系的理解能力；实验表明当前主流T2I模型在抽象语义色彩关联上存在明显不足，且该问题难以通过模型缩放或引导等常规手段缓解。

Details

Motivation: 现有文本到图像模型在将隐式概念与对应颜色关联方面的能力尚未被系统研究，缺乏专门评估此类能力的基准。 Method: 构建了基于6369条人工标注的ColorConceptBench基准，涵盖1281个隐式颜色概念，以概率色彩分布方式评测7个主流T2I模型。 Result: 实验发现当前T2I模型对抽象语义色彩关联敏感性低，且该缺陷对模型规模扩大和采样引导等常见优化策略不敏感。 Conclusion: 实现类人色彩语义理解不仅需要更大模型，更需在模型学习与表征隐式意义的方式上进行根本性变革。 Abstract: While text-to-image (T2I) models have advanced considerably, their capability to associate colors with implicit concepts remains underexplored. To address the gap, we introduce ColorConceptBench, a new human-annotated benchmark to systematically evaluate color-concept associations through the lens of probabilistic color distributions. ColorConceptBench moves beyond explicit color names or codes by probing how models translate 1,281 implicit color concepts using a foundation of 6,369 human annotations. Our evaluation of seven leading T2I models reveals that current models lack sensitivity to abstract semantics, and crucially, this limitation appears resistant to standard interventions (e.g., scaling and guidance). This demonstrates that achieving human-like color semantics requires more than larger models, but demands a fundamental shift in how models learn and represent implicit meaning.

[98] No Validation, No Problem: Predicting Model Performance from a Single Gradient

Fangzheng Wu,Brian Summa

Main category: cs.CV

TL;DR: 本文提出了一种无需验证集的单次前向-反向传播检查点选择信号——分类头梯度的Frobenius范数||g||_F，该指标在多种模型和任务（图像分类、检测/分割、扩散模型）上均能有效预测性能并实现近似最优的检查点选择与早停。

Details

Motivation: 避免依赖验证标签进行模型检查点选择和早停，降低计算与标注成本，提升训练流程的实用性与通用性。 Method: 利用单个batch的分类头梯度Frobenius范数||dL/dW||_F作为无监督代理指标；结合尾部窗口最小化策略，并针对不同架构（CNN/Transformer）采用头尺度或特征尺度归一化以提升稳定性。 Result: 在ImageNet-1k上，该方法以通用设置缩小了约4.24%±2.00%的oracle差距；可泛化至COCO检测/分割mAP预测及CIFAR-10扩散模型监控（与MSE正相关、与FID负相关），开销低于0.1%一个epoch。 Conclusion: 该梯度范数代理指标是一种轻量、标签无关、即插即用的验证替代方案，适用于各类主流模型与视觉任务的检查点选择与训练监控。 Abstract: We propose a validation-free checkpointing signal from a single forward-backward pass: the Frobenius norm of the classifier-head gradient on one detached-feature batch, ||g||_F = ||dL/dW||_F. Across ImageNet-1k CNNs and Transformers, this proxy is strongly negative with Top-1 and positive with loss. Selecting the checkpoint with the minimum head gradient in a short tail window closes most of the gap to the oracle (4.24% +/- 2.00% with a universal setup, about 1.12% with light per-family tuning). For practical deployment, a head-scale normalization is more stable within classic CNN families (e.g., ResNets), while a feature-scale normalization works well for Transformers and modern CNNs. The same one-batch probe also predicts COCO detection/segmentation mAP. In diffusion (UNet/DDPM on CIFAR-10), it tracks progress and enables near-oracle tail-window selection; it is positively correlated with same-distribution probe MSE and negatively with FID (lower is better), so it can be used as a lightweight, label-free monitor. Validation labels are never used beyond reporting. The probe adds much less than 0.1% of an epoch and works as a drop-in for validation-free checkpoint selection and early stopping.

[99] GPA-VGGT:Adapting VGGT to Large scale Localization by self-Supervised learning with Geometry and Physics Aware loss

Yangfan Xu,Lilian Zhang,Xiaofeng He,Pengdong Wu,Wenqi Wu,Jun Mao

Main category: cs.CV

TL;DR: 本文提出了一种用于视觉几何基础Transformer（VGGT）的自监督训练框架，通过序列级几何约束和联合光度-几何损失，无需真值标签即可提升大场景下的相机定位性能。

Details

Motivation: 现有VGGT模型依赖真值标签训练，在无标签和未见场景中泛化能力受限，亟需自监督方法提升其在大规模环境中的定位能力。 Method: 提出自监督框架，将传统成对几何关系扩展为序列级几何约束；在每段图像序列中采样多个源帧并投影到不同目标帧，增强时序特征一致性；设计光度一致性和几何约束联合优化损失函数；端到端训练VGGT模型，使跨视图注意力层及相机/深度头共同建模多视角几何。 Result: 模型仅需数百次迭代即可收敛，在大规模定位任务上取得显著性能提升。 Conclusion: 所提自监督方法有效缓解了对标注数据的依赖，提升了VGGT在真实复杂场景中的鲁棒性与可扩展性，为通用视觉几何建模提供了新思路。 Abstract: Transformer-based general visual geometry frameworks have shown promising performance in camera pose estimation and 3D scene understanding. Recent advancements in Visual Geometry Grounded Transformer (VGGT) models have shown great promise in camera pose estimation and 3D reconstruction. However, these models typically rely on ground truth labels for training, posing challenges when adapting to unlabeled and unseen scenes. In this paper, we propose a self-supervised framework to train VGGT with unlabeled data, thereby enhancing its localization capability in large-scale environments. To achieve this, we extend conventional pair-wise relations to sequence-wise geometric constraints for self-supervised learning. Specifically, in each sequence, we sample multiple source frames and geometrically project them onto different target frames, which improves temporal feature consistency. We formulate physical photometric consistency and geometric constraints as a joint optimization loss to circumvent the requirement for hard labels. By training the model with this proposed method, not only the local and global cross-view attention layers but also the camera and depth heads can effectively capture the underlying multi-view geometry. Experiments demonstrate that the model converges within hundreds of iterations and achieves significant improvements in large-scale localization. Our code will be released at https://github.com/X-yangfan/GPA-VGGT.

[100] Evaluating Large Vision-language Models for Surgical Tool Detection

Nakul Poudel,Richard Simon,Cristian A. Linte

Main category: cs.CV

TL;DR: 本文评估了三种大型视觉语言模型（Qwen2.5、LLaVA1.5、InternVL3.5）在手术器械检测任务中的性能，发现Qwen2.5在零样本和LoRA微调设置下均表现最优，尤其在器械识别上优于Grounding DINO，而后者定位能力更强。

Details

Motivation: 当前手术AI多为单模态，难以全面理解手术流程；亟需能整合多模态信息、具备类人场景推理能力的通用手术AI系统。 Method: 在GraSP机器人手术数据集上，对Qwen2.5、LLaVA1.5和InternVL3.5三个VLM进行零样本和LoRA高效微调下的手术工具检测实验，并与开放集检测基线Grounding DINO对比。 Result: Qwen2.5在零样本和微调设置下检测性能均最优；相比Grounding DINO，其零样本泛化能力更强、微调后性能相当；Qwen2.5识别更优，Grounding DINO定位更准。 Conclusion: 大型VLM（尤其是Qwen2.5）在手术工具检测任务中展现出显著潜力，验证了多模态大模型向通用手术AI演进的可行性与优势。 Abstract: Surgery is a highly complex process, and artificial intelligence has emerged as a transformative force in supporting surgical guidance and decision-making. However, the unimodal nature of most current AI systems limits their ability to achieve a holistic understanding of surgical workflows. This highlights the need for general-purpose surgical AI systems capable of comprehensively modeling the interrelated components of surgical scenes. Recent advances in large vision-language models that integrate multimodal data processing offer strong potential for modeling surgical tasks and providing human-like scene reasoning and understanding. Despite their promise, systematic investigations of VLMs in surgical applications remain limited. In this study, we evaluate the effectiveness of large VLMs for the fundamental surgical vision task of detecting surgical tools. Specifically, we investigate three state-of-the-art VLMs, Qwen2.5, LLaVA1.5, and InternVL3.5, on the GraSP robotic surgery dataset under both zero-shot and parameter-efficient LoRA fine-tuning settings. Our results demonstrate that Qwen2.5 consistently achieves superior detection performance in both configurations among the evaluated VLMs. Furthermore, compared with the open-set detection baseline Grounding DINO, Qwen2.5 exhibits stronger zero-shot generalization and comparable fine-tuned performance. Notably, Qwen2.5 shows superior instrument recognition, while Grounding DINO demonstrates stronger localization.

[101] LoL: Longer than Longer, Scaling Video Generation to Hour

Justin Cui,Jie Wu,Ming Li,Tao Yang,Xiaojie Li,Rui Wang,Andrew Bai,Yuanhao Ban,Cho-Jui Hsieh

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的轻量级方法——多头RoPE抖动，以解决长视频自回归生成中因RoPE与多头注意力冲突导致的'sink-collapse'问题，实现了高质量、实时、流式、近乎无限长度的视频生成（最长12小时）。

Details

Motivation: 现有长视频自回归生成模型易出现误差累积和长期连贯性丢失；虽引入attention sink帧缓解性能衰减，但引发sink-collapse（内容反复回退至sink帧，造成场景重置和循环运动）这一新问题。 Method: 分析发现sink-collapse源于RoPE的周期性结构与多头注意力机制的内在冲突；为此提出训练无关的多头RoPE抖动（multi-head RoPE jitter）方法，通过打破各注意力头间的一致性，抑制长程坍缩。 Result: 实验表明该方法有效缓解sink-collapse，同时保持生成质量；首次实现高质量、实时、流式、近乎无限长度的视频生成，最长成功生成12小时连续视频。 Conclusion: 多头RoPE抖动是一种简单、高效、即插即用的解决方案，为长视频生成中的长期一致性建模提供了新思路，并推动了流式无限生成的实际可行性。 Abstract: Recent research in long-form video generation has shifted from bidirectional to autoregressive models, yet these methods commonly suffer from error accumulation and a loss of long-term coherence. While attention sink frames have been introduced to mitigate this performance decay, they often induce a critical failure mode we term sink-collapse: the generated content repeatedly reverts to the sink frame, resulting in abrupt scene resets and cyclic motion patterns. Our analysis reveals that sink-collapse originates from an inherent conflict between the periodic structure of Rotary Position Embedding (RoPE) and the multi-head attention mechanisms prevalent in current generative models. To address it, we propose a lightweight, training-free approach that effectively suppresses this behavior by introducing multi-head RoPE jitter that breaks inter-head attention homogenization and mitigates long-horizon collapse. Extensive experiments show that our method successfully alleviates sink-collapse while preserving generation quality. To the best of our knowledge, this work achieves the first demonstration of real-time, streaming, and infinite-length video generation with little quality decay. As an illustration of this robustness, we generate continuous videos up to 12 hours in length, which, to our knowledge, is among the longest publicly demonstrated results in streaming video generation.

[102] Reward-Forcing: Autoregressive Video Generation with Reward Feedback

Jingran Zhang,Ning Li,Yuanhao Ban,Andrew Bai,Justin Cui

Main category: cs.CV

TL;DR: 本文提出了一种基于奖励信号引导的自回归视频生成方法，避免依赖强教师模型，提升了生成质量与效率，在VBench上达到84.92分，媲美甚至超越部分双向模型。

Details

Motivation: 现有自回归视频生成方法严重依赖教师模型，导致性能受限、生成质量低于双向模型；需探索不依赖教师的高效自回归生成新范式。 Method: 引入奖励信号指导自回归视频生成过程，简化训练流程，同时保持高视觉保真度和时间一致性。 Result: 在标准基准（如VBench）上，该方法性能媲美现有自回归模型，部分场景下超越同尺寸双向模型；VBench总分为84.92，优于需异构蒸馏的SOTA自回归方法（84.31）。 Conclusion: 基于奖励信号的自回归生成是一种更高效、可扩展且不依赖教师模型的新路径，兼顾生成质量与训练简洁性。 Abstract: While most prior work in video generation relies on bidirectional architectures, recent efforts have sought to adapt these models into autoregressive variants to support near real-time generation. However, such adaptations often depend heavily on teacher models, which can limit performance, particularly in the absence of a strong autoregressive teacher, resulting in output quality that typically lags behind their bidirectional counterparts. In this paper, we explore an alternative approach that uses reward signals to guide the generation process, enabling more efficient and scalable autoregressive generation. By using reward signals to guide the model, our method simplifies training while preserving high visual fidelity and temporal consistency. Through extensive experiments on standard benchmarks, we find that our approach performs comparably to existing autoregressive models and, in some cases, surpasses similarly sized bidirectional models by avoiding constraints imposed by teacher architectures. For example, on VBench, our method achieves a total score of 84.92, closely matching state-of-the-art autoregressive methods that score 84.31 but require significant heterogeneous distillation.

[103] Domain-invariant Mixed-domain Semi-supervised Medical Image Segmentation with Clustered Maximum Mean Discrepancy Alignment

Ba-Thinh Lam,Thanh-Huy Nguyen,Hoang-Thien Nguyen,Quang-Khai Bui-Tran,Nguyen Lan Vi Vu,Phat K. Huynh,Ulas Bagci,Min Xu

Main category: cs.CV

TL;DR: 本文提出了一种面向混合域半监督医学图像分割的域不变框架，通过复制粘贴机制（CPM）增强数据多样性，并利用聚类最大均值差异（CMMD）块对齐跨域特征，缓解域偏移，在少量标注和多未知域场景下实现鲁棒精准分割。

Details

Motivation: 现实医疗图像数据常来自多个中心或设备，导致域标签未知、域间差异大，且专家标注稀缺；现有半监督或域自适应方法多假设单一域偏移或已知域标签，难以应对真实混合域场景。 Method: 提出域不变混合域半监督分割框架：1）Copy-Paste Mechanism（CPM）跨域迁移信息区域以增强数据多样性；2）Cluster Maximum Mean Discrepancy（CMMD）对无标签特征聚类，并通过MMD目标与有标签锚点对齐，学习域不变表征；3）集成于教师-学生框架中联合优化。 Result: 在Fundus和M&Ms基准上实验表明，该方法在极少量标注和多个未知域差异下，持续超越现有半监督与域自适应方法，分割精度与鲁棒性显著提升。 Conclusion: 所提框架有效解决了混合域、少标注、无域标签下的医学图像分割难题，为实际临床部署提供了可行方案。 Abstract: Deep learning has shown remarkable progress in medical image semantic segmentation, yet its success heavily depends on large-scale expert annotations and consistent data distributions. In practice, annotations are scarce, and images are collected from multiple scanners or centers, leading to mixed-domain settings with unknown domain labels and severe domain gaps. Existing semi-supervised or domain adaptation approaches typically assume either a single domain shift or access to explicit domain indices, which rarely hold in real-world deployment. In this paper, we propose a domain-invariant mixed-domain semi-supervised segmentation framework that jointly enhances data diversity and mitigates domain bias. A Copy-Paste Mechanism (CPM) augments the training set by transferring informative regions across domains, while a Cluster Maximum Mean Discrepancy (CMMD) block clusters unlabeled features and aligns them with labeled anchors via an MMD objective, encouraging domain-invariant representations. Integrated within a teacher-student framework, our method achieves robust and precise segmentation even with very few labeled examples and multiple unknown domain discrepancies. Experiments on Fundus and M&Ms benchmarks demonstrate that our approach consistently surpasses semi-supervised and domain adaptation methods, establishing a potential solution for mixed-domain semi-supervised medical image segmentation.

[104] VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents

Zirui Wang,Junyi Zhang,Jiaxin Ge,Long Lian,Letian Fu,Lisa Dunlap,Ken Goldberg,XuDong Wang,Ion Stoica,David M. Chan,Sewon Min,Joseph E. Gonzalez

Main category: cs.CV

TL;DR: 本文提出了VisGym——一个包含17个环境的视觉-语言模型（VLM）评测与训练平台，用于评估VLM在多步视觉交互任务中的感知、记忆与动作整合能力；实验表明当前前沿模型在此类任务中表现较差，尤其在长上下文利用和视觉化符号任务上存在明显瓶颈，并指出目标提示、文本反馈与探索性示范等策略可有效提升性能。

Details

Motivation: 现有视觉-语言模型（VLMs）在多步视觉交互（涉及感知、记忆与动作长期协同）方面缺乏系统性评测与训练框架，导致其能力边界不明。 Method: 构建VisGym评测套件（含17个多样化环境），支持难度、输入表示、规划步数与反馈方式的灵活控制；开发多步求解器生成结构化示范，用于监督微调；开展系统性基准测试与消融分析。 Result: 所有前沿VLM在VisGym上表现不佳：易配置成功率仅46.6%，难配置仅26.0%；模型难以有效利用长上下文，且视觉化符号任务显著难于纯文本版本；引入显式目标观测、文本反馈及探索性示范可带来稳定性能提升。 Conclusion: VisGym揭示了当前VLM在多步视觉决策中的关键缺陷，为后续研究提供了可复现的评测基准与明确的改进路径。 Abstract: Modern Vision-Language Models (VLMs) remain poorly characterized in multi-step visual interactions, particularly in how they integrate perception, memory, and action over long horizons. We introduce VisGym, a gymnasium of 17 environments for evaluating and training VLMs. The suite spans symbolic puzzles, real-image understanding, navigation, and manipulation, and provides flexible controls over difficulty, input representation, planning horizon, and feedback. We also provide multi-step solvers that generate structured demonstrations, enabling supervised finetuning. Our evaluations show that all frontier models struggle in interactive settings, achieving low success rates in both the easy (46.6%) and hard (26.0%) configurations. Our experiments reveal notable limitations: models struggle to effectively leverage long context, performing worse with an unbounded history than with truncated windows. Furthermore, we find that several text-based symbolic tasks become substantially harder once rendered visually. However, explicit goal observations, textual feedback, and exploratory demonstrations in partially observable or unknown-dynamics settings for supervised finetuning yield consistent gains, highlighting concrete failure modes and pathways for improving multi-step visual decision-making. Code, data, and models can be found at: https://visgym.github.io/.

[105] SyncLight: Controllable and Consistent Multi-View Relighting

David Serrano-Lozano,Anand Bhattad,Luis Herranz,Jean-François Lalonde,Javier Vazquez-Corral

Main category: cs.CV

TL;DR: SyncLight 是首个支持多视角静态场景一致、参数化重光照的方法，通过多视角扩散 Transformer 实现单次推理下的高保真、跨视角光照一致性编辑，无需相机位姿信息，支持零样本泛化到任意视角数。

Details

Motivation: 现有单视角重光照方法难以满足多摄像机广播、立体电影和虚拟制片等场景所需的严格跨视角光照一致性要求。 Method: 提出 SyncLight 方法，基于多视角扩散 Transformer，采用潜在桥接匹配（latent bridge matching）训练策略，并构建包含合成与真实多视角数据的大规模混合数据集。 Result: 在单次推理中实现整个多视角图像集的高保真重光照；零样本泛化至任意数量视角；无需相机位姿信息即可保持光照一致性。 Conclusion: SyncLight 为多视角采集系统提供了实用、可控且一致的重光照工作流，填补了生成式重光照在多视角一致性上的关键空白。 Abstract: We present SyncLight, the first method to enable consistent, parametric relighting across multiple uncalibrated views of a static scene. While single-view relighting has advanced significantly, existing generative approaches struggle to maintain the rigorous lighting consistency essential for multi-camera broadcasts, stereoscopic cinema, and virtual production. SyncLight addresses this by enabling precise control over light intensity and color across a multi-view capture of a scene, conditioned on a single reference edit. Our method leverages a multi-view diffusion transformer trained using a latent bridge matching formulation, achieving high-fidelity relighting of the entire image set in a single inference step. To facilitate training, we introduce a large-scale hybrid dataset comprising diverse synthetic environments -- curated from existing sources and newly designed scenes -- alongside high-fidelity, real-world multi-view captures under calibrated illumination. Surprisingly, though trained only on image pairs, SyncLight generalizes zero-shot to an arbitrary number of viewpoints, effectively propagating lighting changes across all views, without requiring camera pose information. SyncLight enables practical relighting workflows for multi-view capture systems.

[106] AnyView: Synthesizing Any Novel View in Dynamic Scenes

Basile Van Hoorick,Dian Chen,Shun Iwase,Pavel Tokmakov,Muhammad Zubair Irshad,Igor Vasiljevic,Swati Gupta,Fangzhou Cheng,Sergey Zakharov,Vitor Campagnolo Guizilini

Main category: cs.CV

TL;DR: 本文提出AnyView，一种基于扩散模型的动态视图合成框架，通过融合多源数据（单目2D、多视角静态3D、多视角动态4D）训练通用时空隐式表示，实现任意相机位姿下的零样本视频生成，并在新提出的极端动态视图合成基准AnyViewBench上展现出优于现有方法的一致性和鲁棒性。

Details

Motivation: 现有生成式视频模型在高度动态的真实世界环境中难以保持多视角和时空一致性，且多数方法依赖强几何假设或大量视角重叠，限制了其泛化能力。 Method: 提出AnyView框架，采用扩散模型，融合单目2D、多视角静态3D和多视角动态4D数据进行联合训练，构建无需显式几何先验的通用时空隐式表示，支持任意相机轨迹下的零样本视频生成。 Result: 在标准基准上达到SOTA水平；在新提出的AnyViewBench（聚焦极端动态真实场景）上显著优于主流基线，尤其在视角重叠少甚至无重叠时仍能生成逼真、合理且时空一致的视频。 Conclusion: AnyView通过弱归纳偏置和多源监督学习，有效提升了动态视图合成的泛化性与鲁棒性，为真实复杂场景下的自由视角视频生成提供了新范式。 Abstract: Modern generative video models excel at producing convincing, high-quality outputs, but struggle to maintain multi-view and spatiotemporal consistency in highly dynamic real-world environments. In this work, we introduce \textbf{AnyView}, a diffusion-based video generation framework for \emph{dynamic view synthesis} with minimal inductive biases or geometric assumptions. We leverage multiple data sources with various levels of supervision, including monocular (2D), multi-view static (3D) and multi-view dynamic (4D) datasets, to train a generalist spatiotemporal implicit representation capable of producing zero-shot novel videos from arbitrary camera locations and trajectories. We evaluate AnyView on standard benchmarks, showing competitive results with the current state of the art, and propose \textbf{AnyViewBench}, a challenging new benchmark tailored towards \emph{extreme} dynamic view synthesis in diverse real-world scenarios. In this more dramatic setting, we find that most baselines drastically degrade in performance, as they require significant overlap between viewpoints, while AnyView maintains the ability to produce realistic, plausible, and spatiotemporally consistent videos when prompted from \emph{any} viewpoint. Results, data, code, and models can be viewed at: https://tri-ml.github.io/AnyView/

Table of Contents

cs.CL [Back]

[1] ChiEngMixBench: Evaluating Large Language Models on Spontaneous and Natural Chinese-English Code-Mixed Generation

[2] M3Kang: Evaluating Multilingual Multimodal Mathematical Reasoning in Vision-Language Models

[3] Domain Specific Specialization in Low-Resource Settings: The Efficacy of Offline Response-Based Knowledge Distillation in Large Language Models

[4] Towards Latent Diffusion Suitable For Text

[5] Limits of n-gram Style Control for LLMs via Logit-Space Injection

[6] GameTalk: Training LLMs for Strategic Conversation

[7] Better as Generators Than Classifiers: Leveraging LLMs and Synthetic Data for Low-Resource Multilingual Classification

[8] Generating Literature-Driven Scientific Theories at Scale

[9] A Longitudinal, Multinational, and Multilingual Corpus of News Coverage of the Russo-Ukrainian War

[10] Teaching and Evaluating LLMs to Reason About Polymer Design Related Tasks

[11] Machine-Assisted Grading of Nationwide School-Leaving Essay Exams with LLMs and Statistical NLP

[12] Regional Bias in Large Language Models

[13] Identity, Cooperation and Framing Effects within Groups of Real and Simulated Humans

[14] PolyAgent: Large Language Model Agent for Polymer Design

[15] Cross-Lingual Activation Steering for Multilingual Language Models

[16] Cite-While-You-Generate: Training-Free Evidence Attribution for Multimodal Clinical Summarization

[17] Clarify or Answer: Reinforcement Learning for Agentic VQA with Context Under-specification

[18] Jacobian Scopes: token-level causal attributions in LLMs

[19] Learning Domain Knowledge in Multimodal Large Language Models through Reinforcement Fine-Tuning

[20] Exploring the Effects of Alignment on Numerical Bias in Large Language Models

[21] Mixing Expert Knowledge: Bring Human Thoughts Back To the Game of Go

[22] Graph-Anchored Knowledge Indexing for Retrieval-Augmented Generation

[23] Persona Jailbreaking in Large Language Models

[24] DeepEra: A Deep Evidence Reranking Agent for Scientific Retrieval-Augmented Generated Question Answering

[25] TL-GRPO: Turn-Level RL for Reasoning-Guided Iterative Optimization

[26] Timely Machine: Awareness of Time Makes Test-Time Scaling Agentic

[27] MRAG: Benchmarking Retrieval-Augmented Generation for Bio-medicine

[28] LOGICAL-COMMONSENSEQA: A Benchmark for Logical Commonsense Reasoning

[29] Is Length Really A Liability? An Evaluation of Multi-turn LLM Conversations using BoolQ

[30] SearchLLM: Detecting LLM Paraphrased Text by Measuring the Similarity with Regeneration of the Candidate Source via Search Engine

[31] Curate-Train-Refine: A Closed-Loop Agentic Framework for Zero Shot Classification

[32] Retrieve-Refine-Calibrate: A Framework for Complex Claim Fact-Checking

[33] Attention-MoA: Enhancing Mixture-of-Agents via Inter-Agent Semantic Attention and Deep Residual Synthesis

[34] AuroraEdge-V-2B: A Faster And Stronger Edge Visual Large Language Model

[35] PROST-LLM: Progressively Enhancing the Speech-to-Speech Translation Capability in LLMs

[36] How Does Personalized Memory Shape LLM Behavior? Benchmarking Rational Preference Utilization in Personalized Assistants

[37] MultiLexNorm++: A Unified Benchmark and a Generative Model for Lexical Normalization for Asian Languages

[38] Typologically Informed Parameter Aggregation

[39] Sycophancy Hides Linearly in the Attention Heads

[40] Select or Project? Evaluating Lower-dimensional Vectors for LLM Training Data Explanations

[41] PLawBench: A Rubric-Based Benchmark for Evaluating LLMs in Real-World Legal Practice

[42] EMemBench: Interactive Benchmarking of Episodic Memory for VLM Agents

[43] Better Generalizing to Unseen Concepts: An Evaluation Framework and An LLM-Based Auto-Labeled Pipeline for Biomedical Concept Recognition

[44] Mitigating Bias in Automated Grading Systems for ESL Learners: A Contrastive Learning Approach

[45] Standardizing Longitudinal Radiology Report Evaluation via Large Language Model Annotation

[46] Do LLM hallucination detectors suffer from low-resource effect?

[47] Persuasion Tokens for Editing Factual Knowledge in LLMs

[48] Large Language Models as Automatic Annotators and Annotation Adjudicators for Fine-Grained Opinion Analysis

[49] SoS: Analysis of Surface over Semantics in Multilingual Text-To-Image Generation

[50] Trapped in the past? Disentangling fluid and crystallized intelligence of large language models using chess

[51] LLM-Based Adversarial Persuasion Attacks on Fact-Checking Systems

[52] Information Representation Fairness in Long-Document Embeddings: The Peculiar Interaction of Positional and Language Bias

[53] Strategies for Span Labeling with Large Language Models

cs.CV [Back]

[54] GR3EN: Generative Relighting for 3D Environments

[55] Memory-V2V: Augmenting Video-to-Video Diffusion Models with Memory

[56] FeTTL: Federated Template and Task Learning for Multi-Institutional Medical Imaging

[57] Where is the multimodal goal post? On the Ability of Foundation Models to Recognize Contextually Important Moments

[58] Coarse-to-Fine Non-rigid Multi-modal Image Registration for Historical Panel Paintings based on Crack Structures

[59] Cognitively-Inspired Tokens Overcome Egocentric Bias in Multimodal Models

[60] VTFusion: A Vision-Text Multimodal Fusion Network for Few-Shot Anomaly Detection

[61] ResAgent: Entropy-based Prior Point Discovery and Visual Reasoning for Referring Expression Segmentation

[62] A Cosine Network for Image Super-Resolution

[63] DCCS-Det: Directional Context and Cross-Scale-Aware Detector for Infrared Small Target

[64] AlphaFace: High Fidelity and Real-time Face Swapper Robust to Facial Pose

[65] MDAFNet: Multiscale Differential Edge and Adaptive Frequency Guided Network for Infrared Small Target Detection

[66] Masked Face Recognition under Different Backbones

[67] Emotion-LLaMAv2 and MMEVerse: A New Framework and Benchmark for Multimodal Emotion Understanding

[68] VISTA-PATH: An interactive foundation model for pathology image segmentation and quantitative analysis in computational pathology

[69] Order from Chaos: Physical World Understanding from Glitchy Gameplay Videos

[70] Multi-View Consistent Wound Segmentation With Neural Fields

[71] Expert Knowledge-Guided Decision Calibration for Accurate Fine-Grained Tree Species Classification

[72] SALAD: Achieve High-Sparsity Attention via Efficient Linear Attention Tuning for Video Diffusion Transformer

[73] TangramPuzzle: Evaluating Multimodal Large Language Models with Compositional Spatial Reasoning

[74] AnchoredDream: Zero-Shot 360° Indoor Scene Generation from a Single View via Geometric Grounding

[75] OnlineSI: Taming Large Language Model for Online 3D Understanding and Grounding

[76] Semi-Supervised Hierarchical Open-Set Classification

[77] HA2F: Dual-module Collaboration-Guided Hierarchical Adaptive Aggregation Framework for Remote Sensing Change Detection