Skip to content

Table of Contents

cs.CL [Back]

[1] StreetMath: Study of LLMs' Approximation Behaviors

Chiung-Yi Tseng,Somshubhra Roy,Maisha Thasin,Danyang Zhang,Blessing Effiong

Main category: cs.CL

TL;DR: 本文提出了StreetMath基准,用于评估大语言模型在现实场景中的近似数学推理能力,并发现LLMs倾向于精确计算而非近似,且精确与近似运算依赖不同的神经机制。

Details Motivation: 现有研究多关注大语言模型在精确算术上的表现,而忽视了其在非正式、快速数学情境下的近似推理能力,尤其是在非自回归模型中的表现。 Method: 构建StreetMath基准,对多种LLM架构进行广泛评估,并应用机械可解释性技术分析模型内部的计算状态。 Result: 发现LLMs即使在需要近似的任务中也倾向于精确计算或调用外部工具;有时早期层已得出正确答案但仍继续生成更多token;精确与近似运算依赖不同的神经组件。 Conclusion: 大语言模型不像人类那样表现出‘认知吝啬’的街头数学行为,其近似推理机制与人类存在本质差异。 Abstract: There is a substantial body of literature examining the mathematical reasoning capabilities of large language models (LLMs), particularly their performance on precise arithmetic operations in autoregressive architectures. However, their ability to perform approximate reasoning in informal, fast-paced mathematical operations has received far less attention, especially among non-autoregressive decoder models. Our work addresses this gap by introducing StreetMath, a benchmark designed to evaluate models' approximation abilities under real-world approximation scenarios. We conduct extensive evaluations across different LLM architectures: Qwen3-4B-Instruct-2507, Qwen3-4B-Thinking-2507, Dream-v0-Instruct-7B, Falcon-Mamba-7B-Instruct, and Mamba-GPT-3B. Furthermore, we apply mechanistic interpretability techniques to probe their internal computational states. Our analysis reveals that LLMs generally attempt to compute exact values or invoke external tools even in tasks that call for approximation. Moreover, while models sometimes reach the correct answer in early layers or steps, they still consume more tokens when solving approximation tasks. Additional experiments indicate that exact and approximate arithmetic operations rely on largely separate neural components. Drawing upon research on cognitive psychology, we argue that LLMs do not exhibit cognitive miserliness in the same way humans do in street math settings. We open source our work https://github.com/ctseng777/StreetMath

[2] Review Based Entity Ranking using Fuzzy Logic Algorithmic Approach: Analysis

Pratik N. Kalamkar,Anupama G. Phakatkar

Main category: cs.CL

TL;DR: 本文提出了一种基于情感词方向和强度的细粒度分类方法,结合模糊逻辑和句法依存分析,对产品特定方面的评论进行评分,以改进基于词典的情感分析。

Details Motivation: 现有的整体词典方法未考虑情感强度的差异,无法细致区分不同程度的情感表达,因此需要一种更精细的情感分析方法来准确评估用户意见。 Method: 采用模糊逻辑算法将情感词(如副词、形容词、名词、动词)分类到不同强度等级(非常弱、弱、中等、强、非常强),并利用句法依存关系识别与目标方面相关的情感词,进而计算实体在特定方面的得分。 Result: 能够更精确地对产品或服务的不同方面进行评分,提升了情感分析的细粒度和准确性。 Conclusion: 该方法通过结合模糊逻辑与句法分析,有效实现了对评论中情感方向和强度的细粒度识别,显著提高了基于方面的情感分析性能。 Abstract: Opinion mining, also called sentiment analysis, is the field of study that analyzes people opinions, sentiments, evaluations, appraisals, attitudes, and emotions towards entities such as products, services, organizations, individuals, issues, events, topics, and their attributes. Holistic lexicon-based approach does not consider the strength of each opinion, i.e., whether the opinion is very strongly negative (or positive), strongly negative (or positive), moderate negative (or positive), very weakly negative (or positive) and weakly negative (or positive). In this paper, we propose approach to rank entities based on orientation and strength of the entity reviews and user's queries by classifying them in granularity levels (i.e. very weak, weak, moderate, very strong and strong) by combining opinion words (i.e. adverb, adjective, noun and verb) that are related to aspect of interest of certain product. We shall use fuzzy logic algorithmic approach in order to classify opinion words into different category and syntactic dependency resolution to find relations for desired aspect words. Opinion words related to certain aspects of interest are considered to find the entity score for that aspect in the review.

[3] LASTIST: LArge-Scale Target-Independent STance dataset

DongJae Kim,Yaejin Lee,Minsu Park,Eunil Park

Main category: cs.CL

TL;DR: 本文提出了一个大规模的、与目标无关的韩语立场检测数据集LASTIST,包含563,299个标注句子,旨在填补低资源语言在立场检测领域的研究空白。

Details Motivation: 现有立场检测研究主要集中于目标依赖型任务且以英语数据为主,缺乏针对韩语等低资源语言的高质量基准数据集,限制了该领域的发展。 Method: 从韩国两大政党的新闻稿中收集数据,构建了LASTIST数据集,并详细描述了数据收集与标注过程,同时训练了先进的深度学习模型进行实验验证。 Result: 成功构建了大规模韩语立场检测数据集LASTIST,支持目标独立和历时性立场检测等多种任务,为低资源语言下的立场检测研究提供了重要资源。 Conclusion: LASTIST数据集有效推动了韩语立场检测研究,尤其促进了非英语语言环境下目标独立及动态立场分析的发展。 Abstract: Stance detection has emerged as an area of research in the field of artificial intelligence. However, most research is currently centered on the target-dependent stance detection task, which is based on a person's stance in favor of or against a specific target. Furthermore, most benchmark datasets are based on English, making it difficult to develop models in low-resource languages such as Korean, especially for an emerging field such as stance detection. This study proposes the LArge-Scale Target-Independent STance (LASTIST) dataset to fill this research gap. Collected from the press releases of both parties on Korean political parties, the LASTIST dataset uses 563,299 labeled Korean sentences. We provide a detailed description of how we collected and constructed the dataset and trained state-of-the-art deep learning and stance detection models. Our LASTIST dataset is designed for various tasks in stance detection, including target-independent stance detection and diachronic evolution stance detection. We deploy our dataset on https://anonymous.4open.science/r/LASTIST-3721/.

[4] zFLoRA: Zero-Latency Fused Low-Rank Adapters

Dhananjaya Gowda,Seoha Song,Harshith Goka,Junhyun Lee

Main category: cs.CL

TL;DR: 本文提出了一种新的零延迟融合低秩适配器(zFLoRA),在保持极低延迟的同时,在多种任务上表现优于LoRA和全量微调。

Details Motivation: 现有的适配器虽然参数少,但在推理时仍带来显著计算开销,需要一种几乎无延迟的高效适配方法。 Method: 提出zFLoRA,通过融合低秩结构实现零或可忽略的推理延迟,无需额外计算资源。 Result: 在1B、3B、7B规模的LLM上,于18项任务中验证了zFLoRA的有效性,延迟在NPU和GPU上均接近基线模型。 Conclusion: zFLoRA在保持高性能的同时几乎不增加推理延迟,是多任务场景下更高效的适配解决方案。 Abstract: Large language models (LLMs) are increasingly deployed with task-specific adapters catering to multiple downstream applications. In such a scenario, the additional compute associated with these apparently insignificant number of adapter parameters (typically less than 1% of the base model) turns out to be disproportionately significant during inference time (upto 2.5x times that of the base model). In this paper, we propose a new zero-latency fused low-rank adapter (zFLoRA) that introduces zero or negligible latency overhead on top of the base model. Experimental results on LLMs of size 1B, 3B and 7B show that zFLoRA compares favorably against the popular supervised fine-tuning benchmarks including low-rank adapters (LoRA) as well as full fine-tuning (FFT). Experiments are conducted on 18 different tasks across three different categories namely commonsense reasoning, math reasoning and summary-dialogue. Latency measurements made on NPU (Samsung Galaxy S25+) as well as GPU (NVIDIA H100) platforms show that the proposed zFLoRA adapters introduce zero to negligible latency overhead.

[5] BlackboxNLP-2025 MIB Shared Task: Improving Circuit Faithfulness via Better Edge Selection

Yaniv Nikankin,Dana Arad,Itay Itzhak,Anja Reusch,Adi Simhi,Gal Kesten-Pomeranz,Yonatan Belinkov

Main category: cs.CL

TL;DR: 本文提出了三种改进机制可解释性中电路发现的方法,包括使用自助法识别具有一致归因分数的边、基于比率的选择策略以及整数线性规划替代贪婪选择,提升了电路的保真度并优于先前方法。

Details Motivation: 在机制可解释性中,如何确定模型的哪些部分执行特定任务(即电路发现)是一个主要挑战。现有方法在准确性和保真度方面存在不足,因此需要更有效的方法来识别关键网络结构。 Method: 1) 使用自助法识别具有稳定归因分数的边;2) 引入基于比率的选择策略,优先选择高分正向边以平衡性能与保真度;3) 用整数线性规划替代传统的贪婪选择方法。 Result: 所提方法在多个MIB任务和模型上均优于先前方法,生成了更保真的电路。 Conclusion: 本文提出的三种改进策略有效提升了电路发现的准确性与保真度,为机制可解释性提供了更强的工具。 Abstract: One of the main challenges in mechanistic interpretability is circuit discovery, determining which parts of a model perform a given task. We build on the Mechanistic Interpretability Benchmark (MIB) and propose three key improvements to circuit discovery. First, we use bootstrapping to identify edges with consistent attribution scores. Second, we introduce a simple ratio-based selection strategy to prioritize strong positive-scoring edges, balancing performance and faithfulness. Third, we replace the standard greedy selection with an integer linear programming formulation. Our methods yield more faithful circuits and outperform prior approaches across multiple MIB tasks and models. Our code is available at: https://github.com/technion-cs-nlp/MIB-Shared-Task.

[6] LISTEN to Your Preferences: An LLM Framework for Multi-Objective Selection

Adam S. Jovine,Tinghan Ye,Francis Bahk,Jingjing Wang,David B. Shmoys,Peter I. Frazier

Main category: cs.CL

TL;DR: 提出LISTEN框架,利用大语言模型作为零样本偏好判断器,通过自然语言指导专家在多目标决策中选择最优选项,减少传统偏好获取的认知负担。

Details Motivation: 人类专家在面对多个竞争目标时难以从大量选项中做出最佳选择,因为复杂的隐性偏好难以形式化,导致决策过程受限。 Method: 提出LISTEN框架,使用大语言模型(LLM)作为零样本偏好判断器,基于专家的高级优先级进行引导;设计两种迭代算法:LISTEN-U通过LLM优化参数化效用函数,LISTEN-T采用非参数化的锦标赛式小批量选择。 Result: 在航班预订、购物和考试安排等多个任务中评估,结果显示当偏好具有参数一致性时LISTEN-U表现优异,而LISTEN-T则具有更强的鲁棒性。 Conclusion: LISTEN框架展示了利用自然语言直接引导复杂多目标决策的潜力,有效降低了传统偏好获取方式的认知负担。 Abstract: Human experts often struggle to select the best option from a large set of items with multiple competing objectives, a process bottlenecked by the difficulty of formalizing complex, implicit preferences. To address this, we introduce LISTEN, a framework that leverages a Large Language Model (LLM) as a zero-shot preference oracle, guided only by an expert's high-level priorities in natural language. To operate within LLM constraints like context windows and inference costs, we propose two iterative algorithms: LISTEN-U, which uses the LLM to refine a parametric utility function, and LISTEN-T, a non-parametric method that performs tournament-style selections over small batches of solutions. Evaluated on diverse tasks including flight booking, shopping, and exam scheduling, our results show LISTEN-U excels when preferences are parametrically aligned (a property we measure with a novel concordance metric), while LISTEN-T offers more robust performance. This work explores a promising direction for steering complex multi-objective decisions directly with natural language, reducing the cognitive burden of traditional preference elicitation.

[7] Beyond Length: Quantifying Long-Range Information for Long-Context LLM Pretraining Data

Haoran Deng,Yingyu Lin,Zhenghao Lin,Xiao Liu,Yizhou Sun,Yi-An Ma,Yeyun Gong

Main category: cs.CL

TL;DR: LongFilter是一种用于长上下文预训练的数据筛选框架,通过比较长上下文和短上下文下的模型预测,识别出依赖长距离信息的高质量样本,提升模型在长文本任务上的表现。

Details Motivation: 现有的长文本数据中许多缺乏真正的长距离依赖,导致训练效率低下,因此需要有效筛选出具有重要意义的长上下文数据。 Method: 提出LongFilter框架,通过计算长上下文与短上下文设置下模型预测的信息增益,衡量长距离依赖的重要性,从而筛选训练数据。 Result: 在LLaMA-3-8B上将上下文长度从8K扩展到64K,实验表明LongFilter能高效选择高质量数据,并在HELMET、LongBench和RULER等基准上显著提升性能。 Conclusion: LongFilter有效提升了长上下文语言模型的训练效率和性能,验证了数据质量在长上下文预训练中的关键作用。 Abstract: Long-context language models unlock advanced capabilities in reasoning, code generation, and document summarization by leveraging dependencies across extended spans of text. However, a significant portion of readily available long-text data lacks meaningful long-distance dependencies; most spans can be predicted using only local context. Training on such data is inefficient, making careful data selection crucial. Therefore, we introduce LongFilter, a framework for curating training data tailored to long-context pretraining. LongFilter measures the information gain provided by extended context by contrasting model predictions under long-context versus short-context settings, thereby identifying samples where long-range dependencies are essential. Experiments with LLaMA-3-8B, extending its context length from 8K to 64K, show that LongFilter efficiently selects high-quality data and yields substantial improvements on benchmarks such as HELMET, LongBench, and RULER.

[8] Ideology-Based LLMs for Content Moderation

Stefano Civelli,Pietro Bernardelle,Nardiena A. Pratama,Gianluca Demartini

Main category: cs.CL

TL;DR: 该研究探讨了在内容审核系统中,采用不同意识形态“人格”(persona)对大型语言模型(LLM)有害内容分类公平性和一致性的影响。尽管总体准确率变化不大,但模型在不同意识形态人格下表现出显著的判断偏差,倾向于支持同意识形态观点并弱化对立观点的危害性,揭示了AI系统可能隐含强化偏见的风险。

Details Motivation: 确保LLM在内容审核中的公平与中立,探究人格设定是否会在表面性能稳定的情况下引入隐蔽的意识形态偏见。 Method: 通过在不同LLM架构、模型规模和模态(文本与视觉)下引入具有不同意识形态倾向的人格设定,分析其对有害内容分类行为的影响,并进行跨意识形态的一致性与分歧分析,另设政治定向任务验证人格的辩护倾向。 Result: 不同人格导致模型对有害内容的判定出现系统性差异;较大模型更倾向于与同意识形态人格保持一致,且在政治定向任务中表现出维护自身立场、淡化对立观点危害性的倾向。 Conclusion: 人格设定会引入微妙但重要的意识形态偏见,可能导致AI内容审核系统在看似中立的表象下加剧 partisan 分裂,需警惕其在实际应用中的公平性问题。 Abstract: Large language models (LLMs) are increasingly used in content moderation systems, where ensuring fairness and neutrality is essential. In this study, we examine how persona adoption influences the consistency and fairness of harmful content classification across different LLM architectures, model sizes, and content modalities (language vs. vision). At first glance, headline performance metrics suggest that personas have little impact on overall classification accuracy. However, a closer analysis reveals important behavioral shifts. Personas with different ideological leanings display distinct propensities to label content as harmful, showing that the lens through which a model "views" input can subtly shape its judgments. Further agreement analyses highlight that models, particularly larger ones, tend to align more closely with personas from the same political ideology, strengthening within-ideology consistency while widening divergence across ideological groups. To show this effect more directly, we conducted an additional study on a politically targeted task, which confirmed that personas not only behave more coherently within their own ideology but also exhibit a tendency to defend their perspective while downplaying harmfulness in opposing views. Together, these findings highlight how persona conditioning can introduce subtle ideological biases into LLM outputs, raising concerns about the use of AI systems that may reinforce partisan perspectives under the guise of neutrality.

[9] Beyond Long Context: When Semantics Matter More than Tokens

Tarun Kumar Chawdhury,Jon D. Duke

Main category: cs.CL

TL;DR: CLEAR方法通过实体感知检索,在临床文档问答中比传统嵌入检索和大上下文推理更高效且准确,显著减少token使用并提升长文本性能。

Details Motivation: 电子健康记录中的临床文档以base64编码存储,难以进行语义问答;传统向量数据库方法容易忽略细微的临床关系。 Method: 提出并验证了Clinical Entity Augmented Retrieval(CLEAR)方法,结合实体感知检索,并在一个包含12份真实临床笔记的评估平台上对比零样本大上下文推断和基于分块的检索增强生成。 Result: CLEAR在F1分数上达到0.90,优于传统方法的0.86,且使用token减少70%以上;在实际测试中赢得58.3%的比较案例,语义相似度平均为0.878,token使用减少78%,在超过65,000 token的长文档中胜率达75%。 Conclusion: 实体感知检索能有效提升临床自然语言处理的效率与准确性,所构建的评估框架可作为可复用、透明的基准用于临床问答系统评估。 Abstract: Electronic Health Records (EHR) store clinical documentation as base64 encoded attachments in FHIR DocumentReference resources, which makes semantic question answering difficult. Traditional vector database methods often miss nuanced clinical relationships. The Clinical Entity Augmented Retrieval (CLEAR) method, introduced by Lopez et al. 2025, uses entity aware retrieval and achieved improved performance with an F1 score of 0.90 versus 0.86 for embedding based retrieval, while using over 70 percent fewer tokens. We developed a Clinical Notes QA Evaluation Platform to validate CLEAR against zero shot large context inference and traditional chunk based retrieval augmented generation. The platform was tested on 12 clinical notes ranging from 10,000 to 65,000 tokens representing realistic EHR content. CLEAR achieved a 58.3 percent win rate, an average semantic similarity of 0.878, and used 78 percent fewer tokens than wide context processing. The largest performance gains occurred on long notes, with a 75 percent win rate for documents exceeding 65,000 tokens. These findings confirm that entity aware retrieval improves both efficiency and accuracy in clinical natural language processing. The evaluation framework provides a reusable and transparent benchmark for assessing clinical question answering systems where semantic precision and computational efficiency are critical.

[10] A Survey on Efficient Large Language Model Training: From Data-centric Perspectives

Junyu Luo,Bohan Wu,Xiao Luo,Zhiping Xiao,Yiqiao Jin,Rong-Cheng Tu,Nan Yin,Yifan Wang,Jingyang Yuan,Wei Ju,Ming Zhang

Main category: cs.CL

TL;DR: 本文首次系统性地从数据角度综述了大语言模型高效后训练方法,提出涵盖数据选择、质量提升、合成数据生成、数据蒸馏与压缩及自演进数据生态系统的分类体系,并总结了各类代表性方法与未来研究方向。

Details Motivation: 当前大语言模型后训练面临数据标注成本高和数据规模边际效益递减的问题,亟需实现数据高效的后训练方法。 Method: 从数据中心视角出发,提出一种针对数据高效的大语言模型后训练方法的分类体系,系统梳理现有方法并归纳代表性技术路径。 Result: 建立了数据高效后训练的五大类方法体系,总结了各领域的关键技术与研究进展,并指出了当前面临的挑战与开放问题。 Conclusion: 通过系统分析数据高效后训练的现状与挑战,本文为提升大规模模型训练中的数据利用率提供了清晰的研究框架与未来方向。 Abstract: Post-training of Large Language Models (LLMs) is crucial for unlocking their task generalization potential and domain-specific capabilities. However, the current LLM post-training paradigm faces significant data challenges, including the high costs of manual annotation and diminishing marginal returns on data scales. Therefore, achieving data-efficient post-training has become a key research question. In this paper, we present the first systematic survey of data-efficient LLM post-training from a data-centric perspective. We propose a taxonomy of data-efficient LLM post-training methods, covering data selection, data quality enhancement, synthetic data generation, data distillation and compression, and self-evolving data ecosystems. We summarize representative approaches in each category and outline future research directions. By examining the challenges in data-efficient LLM post-training, we highlight open problems and propose potential research avenues. We hope our work inspires further exploration into maximizing the potential of data utilization in large-scale model training. Paper List: https://github.com/luo-junyu/Awesome-Data-Efficient-LLM

[11] Evaluating the Impact of LLM-Assisted Annotation in a Perspectivized Setting: the Case of FrameNet Annotation

Frederico Belcavello,Ely Matos,Arthur Lorenzi,Lisandra Bonoto,Lívia Ruiz,Luiz Fernando Pereira,Victor Herbst,Yulla Navarro,Helen de Andrade Abreu,Lívia Dutra,Tiago Timponi Torrent

Main category: cs.CL

TL;DR: 本文评估了基于大语言模型(LLM)的语义角色标注器在FrameNet类语义标注中的自动化与半自动化性能,发现半自动方法在保持标注覆盖率的同时提升了框架多样性,优于纯人工和全自动方法。

Details Motivation: 尽管基于大语言模型的应用在语言资源创建中潜力巨大,但在NLP的视角化研究背景下,其对标注数据集构建的影响尚缺乏系统评估。 Method: 通过比较手动、自动和半自动三种实验设置下的标注时间、覆盖度和多样性,进行大规模评估。 Result: 半自动标注在框架多样性上优于纯人工标注,且标注覆盖率相当;全自动标注除速度外,在各项指标上均表现较差。 Conclusion: 半自动标注结合了人类与LLM的优势,是构建高质量语义标注数据集更有效的方法。 Abstract: The use of LLM-based applications as a means to accelerate and/or substitute human labor in the creation of language resources and dataset is a reality. Nonetheless, despite the potential of such tools for linguistic research, comprehensive evaluation of their performance and impact on the creation of annotated datasets, especially under a perspectivized approach to NLP, is still missing. This paper contributes to reduction of this gap by reporting on an extensive evaluation of the (semi-)automatization of FrameNet-like semantic annotation by the use of an LLM-based semantic role labeler. The methodology employed compares annotation time, coverage and diversity in three experimental settings: manual, automatic and semi-automatic annotation. Results show that the hybrid, semi-automatic annotation setting leads to increased frame diversity and similar annotation coverage, when compared to the human-only setting, while the automatic setting performs considerably worse in all metrics, except for annotation time.

[12] RECAP: Reproducing Copyrighted Data from LLMs Training with an Agentic Pipeline

André V. Duarte,Xuying li,Bin Zeng,Arlindo L. Oliveira,Lei Li,Zhuo Li

Main category: cs.CL

TL;DR: 提出RECAP,一种通过反馈循环和越狱模块从大语言模型输出中提取和验证记忆训练数据的代理管道,在EchoTrace基准上显著优于单次迭代方法。

Details Motivation: 在无法检查大语言模型训练数据的情况下,如何确认其是否记忆了特定内容,需要一种有效的方法来揭示模型的记忆行为。 Method: 设计了一个基于反馈驱动的循环机制,使用辅助语言模型比对输出与参考文本的差异,并生成最小修正提示反馈给目标模型;同时引入越狱模块以克服对齐导致的拒绝问题。 Result: 在EchoTrace基准(涵盖30多本完整书籍)上的实验表明,相比单次迭代方法,RECAP显著提升了性能,例如GPT-4.1在版权文本提取任务中的平均ROUGE-L分数从0.38提升到0.47,增幅近24%。 Conclusion: RECAP能有效激发并验证大语言模型对训练数据的记忆,为检测模型是否记忆敏感或受版权保护的内容提供了有力工具。 Abstract: If we cannot inspect the training data of a large language model (LLM), how can we ever know what it has seen? We believe the most compelling evidence arises when the model itself freely reproduces the target content. As such, we propose RECAP, an agentic pipeline designed to elicit and verify memorized training data from LLM outputs. At the heart of RECAP is a feedback-driven loop, where an initial extraction attempt is evaluated by a secondary language model, which compares the output against a reference passage and identifies discrepancies. These are then translated into minimal correction hints, which are fed back into the target model to guide subsequent generations. In addition, to address alignment-induced refusals, RECAP includes a jailbreaking module that detects and overcomes such barriers. We evaluate RECAP on EchoTrace, a new benchmark spanning over 30 full books, and the results show that RECAP leads to substantial gains over single-iteration approaches. For instance, with GPT-4.1, the average ROUGE-L score for the copyrighted text extraction improved from 0.38 to 0.47 - a nearly 24% increase.

[13] Revisiting Multilingual Data Mixtures in Language Model Pretraining

Negar Foroutan,Paul Teiletche,Ayush Kumar Tarun,Antoine Bosselut

Main category: cs.CL

TL;DR: 研究表明,在大规模语言模型预训练中,合理平衡多语言数据不会导致性能下降,反而能提升多语言能力,且英语作为枢纽语言有助于跨语言泛化。

Details Motivation: 探讨不同多语言数据组合对大语言模型预训练的影响,特别是语言覆盖与模型性能之间的权衡问题(即多语言诅咒)。 Method: 训练11亿和30亿参数的语言模型,使用包含25到400种语言的多语言语料库,系统分析语言数量、数据配比及枢纽语言选择对模型性能的影响。 Result: 发现足够数据支持下,英文与多语言数据结合不会损害任一语言性能;以英语为枢纽语言可带来跨语系增益,而语系内枢纽语言未见一致优势;增加训练语言数量未显现显著的“多语言诅咒”现象。 Conclusion: 适当平衡的多语言数据能够增强语言模型能力,即使在低资源场景下也不会牺牲性能,挑战了多语言训练中的常见假设。 Abstract: The impact of different multilingual data mixtures in pretraining large language models (LLMs) has been a topic of ongoing debate, often raising concerns about potential trade-offs between language coverage and model performance (i.e., the curse of multilinguality). In this work, we investigate these assumptions by training 1.1B and 3B parameter LLMs on diverse multilingual corpora, varying the number of languages from 25 to 400. Our study challenges common beliefs surrounding multilingual training. First, we find that combining English and multilingual data does not necessarily degrade the in-language performance of either group, provided that languages have a sufficient number of tokens included in the pretraining corpus. Second, we observe that using English as a pivot language (i.e., a high-resource language that serves as a catalyst for multilingual generalization) yields benefits across language families, and contrary to expectations, selecting a pivot language from within a specific family does not consistently improve performance for languages within that family. Lastly, we do not observe a significant "curse of multilinguality" as the number of training languages increases in models at this scale. Our findings suggest that multilingual data, when balanced appropriately, can enhance language model capabilities without compromising performance, even in low-resource settings

[14] Semantic Label Drift in Cross-Cultural Translation

Mohsinul Kabir,Tasnim Ahmed,Md Mezbaur Rahman,Polydoros Giannouris,Sophia Ananiadou

Main category: cs.CL

TL;DR: 该论文研究了机器翻译中由于文化差异导致的语义标签漂移问题,发现包括大语言模型在内的翻译系统在跨文化敏感领域容易改变原始标签,且文化相似性显著影响标签保留。

Details Motivation: 现有机器翻译研究多关注情感保持,但忽视了源语言与目标语言之间的文化对齐这一关键因素,可能导致下游应用中的误读和文化冲突。 Method: 通过在文化敏感和中立领域进行一系列实验,评估不同机器翻译系统(包括大语言模型)在翻译过程中对语义标签的影响,并分析文化知识编码与标签漂移的关系。 Result: 1) 机器翻译系统(含LLMs)在文化敏感领域会引起显著的标签漂移;2) LLMs虽具备文化知识,但利用这些知识可能加剧标签漂移;3) 源语言与目标语言之间的文化相似性是标签保持的关键决定因素。 Conclusion: 忽略机器翻译中的文化因素会损害标签保真度,并在下游任务中引发误解和文化冲突,因此需在翻译过程中考虑文化对齐。 Abstract: Machine Translation (MT) is widely employed to address resource scarcity in low-resource languages by generating synthetic data from high-resource counterparts. While sentiment preservation in translation has long been studied, a critical but underexplored factor is the role of cultural alignment between source and target languages. In this paper, we hypothesize that semantic labels are drifted or altered during MT due to cultural divergence. Through a series of experiments across culturally sensitive and neutral domains, we establish three key findings: (1) MT systems, including modern Large Language Models (LLMs), induce label drift during translation, particularly in culturally sensitive domains; (2) unlike earlier statistical MT tools, LLMs encode cultural knowledge, and leveraging this knowledge can amplify label drift; and (3) cultural similarity or dissimilarity between source and target languages is a crucial determinant of label preservation. Our findings highlight that neglecting cultural factors in MT not only undermines label fidelity but also risks misinterpretation and cultural conflict in downstream applications.

[15] SymCode: A Neurosymbolic Approach to Mathematical Reasoning via Verifiable Code Generation

Sina Bagheri Nezhad,Yao Li,Ameeta Agrawal

Main category: cs.CL

TL;DR: SymCode是一种神经符号框架,通过将数学问题求解转化为可验证的代码生成任务(使用SymPy库),显著提升了大语言模型在复杂数学推理中的准确性和可信度。

Details Motivation: 大语言模型在复杂数学推理中常因基于文本的生成方式而产生无法验证且算术上不严谨的解答,现有提示方法缺乏确定性验证机制。 Method: 提出SymCode框架,利用SymPy库将数学问题转化为可执行、可验证的代码生成任务,结合神经网络与符号计算实现推理过程的确定性验证。 Result: 在MATH-500和OlympiadBench等基准上,SymCode比基线模型最高提升13.6个百分点,且更节省token,错误类型从难以察觉的逻辑谬误转变为易于发现的程序错误。 Conclusion: 通过将大语言模型的推理锚定在确定性的符号引擎上,SymCode为形式化领域实现了更准确、更可信的人工智能,是迈向可靠AI的重要一步。 Abstract: Large Language Models (LLMs) often struggle with complex mathematical reasoning, where prose-based generation leads to unverified and arithmetically unsound solutions. Current prompting strategies like Chain of Thought still operate within this unreliable medium, lacking a mechanism for deterministic verification. To address these limitations, we introduce SymCode, a neurosymbolic framework that reframes mathematical problem-solving as a task of verifiable code generation using the SymPy library. We evaluate SymCode on challenging benchmarks, including MATH-500 and OlympiadBench, demonstrating significant accuracy improvements of up to 13.6 percentage points over baselines. Our analysis shows that SymCode is not only more token-efficient but also fundamentally shifts model failures from opaque logical fallacies towards transparent, programmatic errors. By grounding LLM reasoning in a deterministic symbolic engine, SymCode represents a key step towards more accurate and trustworthy AI in formal domains.

[16] NeuronMM: High-Performance Matrix Multiplication for LLM Inference on AWS Trainium

Dinghong Song,Jierui Xu,Weichu Yang,Pengfei Su,Dong Li

Main category: cs.CL

TL;DR: 本文设计了针对AWS Trainium架构的高性能矩阵乘法(matmul)内核,用于大语言模型(LLM)推理。通过定制化的内核融合和新型缓存策略,减少了数据移动、最大化SRAM带宽并避免了昂贵的矩阵转置,在matmul内核层面平均提速1.35倍(最高2.22倍),端到端LLM推理平均提速1.66倍(最高2.49倍)。

Details Motivation: Trainium的异构架构虽适合LLM训练与推理,但其脉动阵列结构和特殊的数据布局要求使得高性能实现具有挑战性,尤其是关键计算内核矩阵乘法的优化。 Method: 提出一系列基于内核融合和新型缓存策略的技术,专门针对Trainium架构进行优化,以减少软件管理内存层次中的数据移动,最大化SRAM带宽,并避免高成本的矩阵转置操作。 Result: 在九个数据集和四个最新LLM上的实验表明,相比AWS在Trainium上实现的最先进matmul,所提方法在matmul内核层面平均加速1.35倍(最高2.22倍),端到端LLM推理平均加速1.66倍(最高2.49倍)。 Conclusion: 本文提出的针对Trainium的优化技术显著提升了LLM推理中矩阵乘法的性能,有效克服了其硬件架构带来的挑战,为在该平台高效部署LLM提供了可行方案。 Abstract: AI accelerators, customized to AI workloads, provide cost-effective and high-performance solutions for training and inference. Trainium, an AI accelerator recently developed by Amazon Web Services (AWS), provides an attractive option for LLM training and inference through its heterogeneous architecture. However, leveraging Trainium architecture for high performance can be challenging because of its systolic array architecture and special requirement on data layout. In this paper, we design high-performance matrix multiplication (matmul), a critical compute kernel, for LLM inference on Trainium. We introduce a series of techniques customized to Trainium based on kernel fusion and novel caching strategies to reduce data movement across the software-managed memory hierarchy, maximize SRAM bandwidth, and avoid expensive matrix transpose. Evaluating with nine datasets and four recent LLMs, we show that our system largely outperforms the state-of-the-art matmul implemented by AWS on Trainium: at the level of matmul kernel, it achieves an average 1.35x speedup (up to 2.22x), which translates to an average 1.66x speedup (up to 2.49x) for end-to-end LLM inference.

[17] AttnCache: Accelerating Self-Attention Inference for LLM Prefill via Attention Cache

Dinghong Song,Yuan Feng,Yiwei Wang,Shangye Chen,Cyril Guyot,Filip Blagojevic,Hyeran Jeon,Pengfei Su,Dong Li

Main category: cs.CL

TL;DR: 本文提出了AttnCache,一种通过检索和重用相似注意力图来加速大语言模型预填充阶段推理的框架,在CPU和GPU上均实现了显著的速度提升,且精度损失可忽略不计。

Details Motivation: 在仅需预填充阶段的应用(如分类、问答、推荐和文本嵌入)中,自注意力计算因其与序列长度的平方复杂度成为主要性能瓶颈。现有方法难以有效降低该阶段的计算开销。 Method: 基于观察到不同语义句子在多层多头注意力中常产生相似注意力图的现象,AttnCache构建了一个注意力图记忆数据库,并采用高效的缓存机制和相似性搜索技术,在推理时识别并重用已缓存的注意力图,从而减少自注意力计算量。 Result: 实验结果显示,AttnCache在CPU上平均实现1.2倍端到端和2倍注意力速度提升,在GPU上达到1.6倍端到端和3倍注意力速度提升,且准确率损失极小。 Conclusion: AttnCache有效缓解了LLM在预填充阶段的自注意力计算瓶颈,为高效率的前缀式推理任务提供了可行的优化方案,具有良好的实际应用价值。 Abstract: Large Language Models (LLMs) are widely used in generative applications such as chatting, code generation, and reasoning. However, many realworld workloads such as classification, question answering, recommendation, and text embedding rely solely on the prefill stage of inference, where the model encodes input sequences without performing autoregressive decoding. In these prefill only scenarios, the self-attention computation becomes the primary performance bottleneck due to its quadratic complexity with respect to sequence length. In this paper, we observe that semantically different sentences often produce similar attention maps across layers and heads. Building on this insight, we propose AttnCache, a framework that accelerates the prefill stage of LLM inference by retrieving and reusing similar attention maps. Based on an attention map memorization database, AttnCache employs efficient caching and similarity search techniques to identify and reuse pre-cached attention maps during inference, thereby reducing the computational overhead of self-attention. Experimental results show that AttnCache achieves an average of 1.2x end-to-end and 2x attention speedup on CPU, and 1.6x end-to-end and 3x attention speedup on GPU, with negligible accuracy degradation.

[18] Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning

Yihe Deng,I-Hung Hsu,Jun Yan,Zifeng Wang,Rujun Han,Gufeng Zhang,Yanfei Chen,Wei Wang,Tomas Pfister,Chen-Yu Lee

Main category: cs.CL

TL;DR: 提出了一种名为监督强化学习(SRL)的新框架,通过将问题解决重构为生成一系列逻辑“动作”,并结合逐步监督和可验证奖励,显著提升小型语言模型在多步推理任务中的表现。

Details Motivation: 现有方法如强化学习与可验证奖励(RLVR)难以采样到正确解,监督微调(SFT)易过拟合长示范,导致小规模开源模型在多步推理任务上表现不佳。 Method: 将问题求解建模为生成逻辑动作序列,训练模型在每一步前生成内部推理独白,并基于模型动作与SFT数据集中提取的专家动作之间的相似性提供逐步的、更平滑的奖励信号。 Result: SRL使小模型能够学会以往无法通过SFT或RLVR学习的复杂问题;先用SRL初始化再用RLVR微调可获得最佳性能;在推理基准和代理软件工程任务中均表现出良好泛化能力。 Conclusion: SRL是一种强大且通用的训练框架,有效结合了SFT的示范引导与RL的探索能力,为面向推理的大型语言模型提供了新的训练范式。 Abstract: Large Language Models (LLMs) often struggle with problems that require multi-step reasoning. For small-scale open-source models, Reinforcement Learning with Verifiable Rewards (RLVR) fails when correct solutions are rarely sampled even after many attempts, while Supervised Fine-Tuning (SFT) tends to overfit long demonstrations through rigid token-by-token imitation. To address this gap, we propose Supervised Reinforcement Learning (SRL), a framework that reformulates problem solving as generating a sequence of logical "actions". SRL trains the model to generate an internal reasoning monologue before committing to each action. It provides smoother rewards based on the similarity between the model's actions and expert actions extracted from the SFT dataset in a step-wise manner. This supervision offers richer learning signals even when all rollouts are incorrect, while encouraging flexible reasoning guided by expert demonstrations. As a result, SRL enables small models to learn challenging problems previously unlearnable by SFT or RLVR. Moreover, initializing training with SRL before refining with RLVR yields the strongest overall performance. Beyond reasoning benchmarks, SRL generalizes effectively to agentic software engineering tasks, establishing it as a robust and versatile training framework for reasoning-oriented LLMs.

[19] PORTool: Tool-Use LLM Training with Rewarded Tree

Feijie Wu,Weiwu Zhu,Yuxiang Zhang,Soumya Chatterjee,Jiarong Zhu,Fan Mo,Rodin Luo,Jing Gao

Main category: cs.CL

TL;DR: 提出一种基于强化学习的工具使用方法PORTool,通过树状多轨迹探索和分步奖励机制提升大语言模型在动态工具环境中的性能。

Details Motivation: 现有工具使用大语言模型依赖静态数据集,模仿通用调用流程,缺乏对多种可能解法的探索,在动态环境中表现受限。 Method: 采用强化学习框架,生成多个具有树状结构的工具调用轨迹,基于各步骤对正确答案的贡献分配分步奖励,并结合共享步骤与分支差异计算fork-relative优势,融合trajectory-relative优势训练模型。 Result: 在包含17种工具的实验中,PORTool相比其他方法显著提升了最终准确率并减少了工具调用步数,消融研究验证了分步奖励设计的有效性与鲁棒性。 Conclusion: PORTool通过引入多轨迹探索与精细化奖励机制,增强了工具使用LLM的探索能力与决策质量,适用于复杂动态的工具交互场景。 Abstract: Current tool-use large language models (LLMs) are trained on static datasets, enabling them to interact with external tools and perform multi-step, tool-integrated reasoning, which produces tool-call trajectories. However, these models imitate how a query is resolved in a generic tool-call routine, thereby failing to explore possible solutions and demonstrating limited performance in an evolved, dynamic tool-call environment. In this work, we propose PORTool, a reinforcement learning (RL) method that encourages a tool-use LLM to explore various trajectories yielding the correct answer. Specifically, this method starts with generating multiple rollouts for a given query, and some of them share the first few tool-call steps, thereby forming a tree-like structure. Next, we assign rewards to each step, based on its ability to produce a correct answer and make successful tool calls. A shared step across different trajectories receives the same reward, while different steps under the same fork receive different rewards. Finally, these step-wise rewards are used to calculate fork-relative advantages, blended with trajectory-relative advantages, to train the LLM for tool use. The experiments utilize 17 tools to address user queries, covering both time-sensitive and time-invariant topics. We conduct ablation studies to systematically justify the necessity and the design robustness of step-wise rewards. Furthermore, we compare the proposed PORTool with other training approaches and demonstrate significant improvements in final accuracy and the number of tool-call steps.

[20] Rethinking Cross-lingual Alignment: Balancing Transfer and Cultural Erasure in Multilingual LLMs

HyoJung Han,Sweta Agrawal,Eleftheria Briakou

Main category: cs.CL

TL;DR: 本文研究了跨语言对齐(CLA)在促进知识迁移的同时可能导致“文化抹除”的问题,提出了一种新的评估框架(转移-本地化平面)和名为Surgical Steering的推理时方法,以在不同模型层解耦事实迁移与文化本地化,实现二者更好的平衡。

Details Motivation: 跨语言对齐虽有助于知识迁移,但可能导致模型忽略语言背后的文化差异,造成文化抹除现象,因此需要系统评估并解决这一权衡问题。 Method: 提出了转移-本地化平面作为评估框架,并基于对模型内部表示的分析,设计了Surgical Steering方法,在不同模型层进行有针对性的激活调控。 Result: 重新评估了现有CLA方法,发现它们在提升事实迁移的同时损害了文化本地化;Surgical Steering能在保持知识迁移能力的同时更好地保留文化特异性响应。 Conclusion: 跨语言对齐需兼顾知识迁移与文化本地化,Surgical Steering通过分层调控有效缓解了二者之间的冲突,为更公平、文化的多语言模型提供了新方向。 Abstract: Cross-lingual alignment (CLA) aims to align multilingual representations, enabling Large Language Models (LLMs) to seamlessly transfer knowledge across languages. While intuitive, we hypothesize, this pursuit of representational convergence can inadvertently cause "cultural erasure", the functional loss of providing culturally-situated responses that should diverge based on the query language. In this work, we systematically analyze this trade-off by introducing a holistic evaluation framework, the transfer-localization plane, which quantifies both desirable knowledge transfer and undesirable cultural erasure. Using this framework, we re-evaluate recent CLA approaches and find that they consistently improve factual transfer at the direct cost of cultural localization across all six languages studied. Our investigation into the internal representations of these models reveals a key insight: universal factual transfer and culturally-specific knowledge are optimally steerable at different model layers. Based on this finding, we propose Surgical Steering, a novel inference-time method that disentangles these two objectives. By applying targeted activation steering to distinct layers, our approach achieves a better balance between the two competing dimensions, effectively overcoming the limitations of current alignment techniques.

[21] Artificial Intelligence-Enabled Analysis of Radiology Reports: Epidemiology and Consequences of Incidental Thyroid Findings

Felipe Larios,Mariana Borras-Osorio,Yuqi Wu,Ana Gabriela Claros,David Toro-Tobon,Esteban Cabezas,Ricardo Loor-Torres,Maria Mateo Chavez,Kerly Guevara Maldonado,Luis Vilatuna Andrango,Maria Lizarazo Jimenez,Ivan Mateo Alzamora,Misk Al Zahidy,Marcelo Montero,Ana Cristina Proano,Cristian Soto Jacome,Jungwei W. Fan,Oscar J. Ponce-Ponte,Megan E. Branda,Naykky Singh Ospina,Juan P. Brito

Main category: cs.CL

TL;DR: 该研究利用自然语言处理技术分析了11.5万名患者的放射学报告,发现7.8%的患者存在偶然性甲状腺异常(ITFs),其中92.9%为结节。ITFs与更高的甲状腺癌检出率相关,但多数为低风险乳头状癌,提示ITFs可能导致甲状腺癌的过度诊断。

Details Motivation: 偶然发现的甲状腺异常(ITFs)在非甲状腺适应症的影像检查中日益常见,但其流行率、特征及临床影响尚不明确,亟需系统评估其临床意义和潜在过度诊断问题。 Method: 研究人员开发并验证了一种基于Transformer的自然语言处理(NLP)流程,用于从多种模态和部位的影像报告中识别ITFs并提取结节特征。研究采用回顾性队列设计,纳入2017年至2023年在梅奥诊所接受甲状腺成像的成人患者。 Result: 在115,683名患者中,7.8%(9,077例)发现ITFs,其中92.9%为结节。ITFs更常见于女性、老年人、高BMI人群以及由肿瘤科或内科开具的影像检查中。与胸部CT相比,颈部CT、PET和核医学扫描更易发现ITFs。结节特征记录不全,仅44%报告大小,少于15%描述其他特征(如钙化)。ITFs患者接受甲状腺超声、活检、甲状腺切除术及癌症诊断的比例显著更高。大多数癌症为乳头状癌,且在ITFs组中发现时体积更大。 Conclusion: ITFs较为常见,并常引发一系列临床干预,导致小体积、低风险甲状腺癌的过度诊断。研究强调需改进标准化报告规范,并对ITFs采取更审慎的选择性随访策略。 Abstract: Importance Incidental thyroid findings (ITFs) are increasingly detected on imaging performed for non-thyroid indications. Their prevalence, features, and clinical consequences remain undefined. Objective To develop, validate, and deploy a natural language processing (NLP) pipeline to identify ITFs in radiology reports and assess their prevalence, features, and clinical outcomes. Design, Setting, and Participants Retrospective cohort of adults without prior thyroid disease undergoing thyroid-capturing imaging at Mayo Clinic sites from July 1, 2017, to September 30, 2023. A transformer-based NLP pipeline identified ITFs and extracted nodule characteristics from image reports from multiple modalities and body regions. Main Outcomes and Measures Prevalence of ITFs, downstream thyroid ultrasound, biopsy, thyroidectomy, and thyroid cancer diagnosis. Logistic regression identified demographic and imaging-related factors. Results Among 115,683 patients (mean age, 56.8 [SD 17.2] years; 52.9% women), 9,077 (7.8%) had an ITF, of which 92.9% were nodules. ITFs were more likely in women, older adults, those with higher BMI, and when imaging was ordered by oncology or internal medicine. Compared with chest CT, ITFs were more likely via neck CT, PET, and nuclear medicine scans. Nodule characteristics were poorly documented, with size reported in 44% and other features in fewer than 15% (e.g. calcifications). Compared with patients without ITFs, those with ITFs had higher odds of thyroid nodule diagnosis, biopsy, thyroidectomy and thyroid cancer diagnosis. Most cancers were papillary, and larger when detected after ITFs vs no ITF. Conclusions ITFs were common and strongly associated with cascades leading to the detection of small, low-risk cancers. These findings underscore the role of ITFs in thyroid cancer overdiagnosis and the need for standardized reporting and more selective follow-up.

[22] QCoder Benchmark: Bridging Language Generation and Quantum Hardware through Simulator-Based Feedback

Taku Mikuriya,Tatsuya Ishigaki,Masayuki Kawarada,Shunya Minami,Tadashi Kadowaki,Yohichi Suzuki,Soshun Naito,Shunya Takata,Takumi Kato,Tamotsu Basseda,Reo Yamada,Hiroya Takamura

Main category: cs.CL

TL;DR: 本文提出了QCoder Benchmark,一个用于评估大语言模型在量子编程任务中表现的框架,结合量子模拟器反馈和真实人类代码提交,揭示了现有模型在此领域的挑战与潜力。

Details Motivation: 量子编程涉及自然语言、人类知识与硬件逻辑的交互,目前大语言模型在此类硬件交互领域缺乏充分评估,亟需专门的评测基准。 Method: 构建QCoder Benchmark,集成量子模拟器环境以获取电路深度、执行时间、错误分类等域特定反馈,并引入真实编程竞赛中的人类代码进行对比评估。 Result: 实验显示GPT-4o准确率仅为18.97%,而推理型模型o3可达78%,超过人类平均成功率(39.98%)。 Conclusion: QCoder Benchmark为评估LLM在量子编程中的表现提供了有效工具,揭示了推理模型在此类复杂任务中的显著优势,推动未来研究发展。 Abstract: Large language models (LLMs) have increasingly been applied to automatic programming code generation. This task can be viewed as a language generation task that bridges natural language, human knowledge, and programming logic. However, it remains underexplored in domains that require interaction with hardware devices, such as quantum programming, where human coders write Python code that is executed on a quantum computer. To address this gap, we introduce QCoder Benchmark, an evaluation framework that assesses LLMs on quantum programming with feedback from simulated hardware devices. Our benchmark offers two key features. First, it supports evaluation using a quantum simulator environment beyond conventional Python execution, allowing feedback of domain-specific metrics such as circuit depth, execution time, and error classification, which can be used to guide better generation. Second, it incorporates human-written code submissions collected from real programming contests, enabling both quantitative comparisons and qualitative analyses of LLM outputs against human-written codes. Our experiments reveal that even advanced models like GPT-4o achieve only around 18.97% accuracy, highlighting the difficulty of the benchmark. In contrast, reasoning-based models such as o3 reach up to 78% accuracy, outperforming averaged success rates of human-written codes (39.98%). We release the QCoder Benchmark dataset and public evaluation API to support further research.

[23] Reasoning Path Divergence: A New Metric and Curation Strategy to Unlock LLM Diverse Thinking

Feng Ju,Zeyu Qin,Rui Min,Zhitao He,Lingpeng Kong,Yi R. Fung

Main category: cs.CL

TL;DR: 本文提出了一种“一题多解”(1PNS)的训练范式,以提升大语言模型在测试时扩展(TTS)中的推理多样性,通过引入推理路径差异性(RPD)度量来筛选多样化的思维链,并在Qwen3-4B-Base上进行微调,显著提升了pass@k表现。

Details Motivation: 传统的“一题一解”(1P1S)训练方式限制了模型输出的多样性,导致测试时扩展效果受限,因此需要一种能促进多样化推理路径的训练方法。 Method: 提出“一题多解”(1PNS)训练范式,引入推理路径差异性(RPD)作为衡量多步思维链语义差异的指标,并利用RPD筛选出最具多样性的解答路径用于模型微调。 Result: 使用RPD筛选的数据进行训练后,模型输出更加多样化,在pass@16上平均比强1P1S基线提升+2.80%,在AIME24上提升+4.99%。 Conclusion: 1PNS训练范式结合RPD度量能有效提升大语言模型的推理多样性,并进一步增强测试时扩展的效果。 Abstract: While Test-Time Scaling (TTS) has proven effective in improving the reasoning ability of large language models (LLMs), low diversity in model outputs often becomes a bottleneck; this is partly caused by the common "one problem, one solution" (1P1S) training practice, which provides a single canonical answer and can push models toward a narrow set of reasoning paths. To address this, we propose a "one problem, multiple solutions" (1PNS) training paradigm that exposes the model to a variety of valid reasoning trajectories and thus increases inference diversity. A core challenge for 1PNS is reliably measuring semantic differences between multi-step chains of thought, so we introduce Reasoning Path Divergence (RPD), a step-level metric that aligns and scores Long Chain-of-Thought solutions to capture differences in intermediate reasoning. Using RPD, we curate maximally diverse solution sets per problem and fine-tune Qwen3-4B-Base. Experiments show that RPD-selected training yields more varied outputs and higher pass@k, with an average +2.80% gain in pass@16 over a strong 1P1S baseline and a +4.99% gain on AIME24, demonstrating that 1PNS further amplifies the effectiveness of TTS. Our code is available at https://github.com/fengjujf/Reasoning-Path-Divergence .

[24] On the Influence of Discourse Relations in Persuasive Texts

Nawar Turk,Sevag Kaspar,Leila Kosseim

Main category: cs.CL

TL;DR: 本研究利用大语言模型和提示工程,分析说服技巧与话语关系之间的联系,构建了同时标注说服技巧和话语关系的银标准数据集,并发现六种话语关系在说服性文本中起关键作用。

Details Motivation: 由于缺乏同时标注说服技巧和话语关系的数据集,本文旨在通过大语言模型填补这一空白,以探索二者之间的关联及其在识别网络宣传和虚假信息中的应用潜力。 Method: 基于SemEval 2023 Task 3数据集,使用四种大语言模型和十种不同提示,共构建40个话语关系分类器,并采用多种多数投票策略集成生成五个银标准数据集。 Result: 生成了规模从204到1,281不等的五个银标准数据集,统计分析显示因果、目的、对比、因果+信念、让步和条件这六种话语关系在 Loaded Language、夸张/淡化、重复和制造怀疑等说服技巧中起重要作用。 Conclusion: 六种特定话语关系在说服性文本中具有显著作用,该发现有助于检测在线 propaganda 和虚假信息,并深化对有效沟通机制的理解。 Abstract: This paper investigates the relationship between Persuasion Techniques (PTs) and Discourse Relations (DRs) by leveraging Large Language Models (LLMs) and prompt engineering. Since no dataset annotated with both PTs and DRs exists, we took the SemEval 2023 Task 3 dataset labelled with 19 PTs as a starting point and developed LLM-based classifiers to label each instance of the dataset with one of the 22 PDTB 3.0 level-2 DRs. In total, four LLMs were evaluated using 10 different prompts, resulting in 40 unique DR classifiers. Ensemble models using different majority-pooling strategies were used to create 5 silver datasets of instances labelled with both persuasion techniques and level-2 PDTB senses. The silver dataset sizes vary from 1,281 instances to 204 instances, depending on the majority pooling technique used. Statistical analysis of these silver datasets shows that six discourse relations (namely Cause, Purpose, Contrast, Cause+Belief, Concession, and Condition) play a crucial role in persuasive texts, especially in the use of Loaded Language, Exaggeration/Minimisation, Repetition and to cast Doubt. This insight can contribute to detecting online propaganda and misinformation, as well as to our general understanding of effective communication.

[25] MossNet: Mixture of State-Space Experts is a Multi-Head Attention

Shikhar Tuli,James Seale Smith,Haris Jeelani,Chi-Heng Lin,Abhishek Patel,Vasili Ramanishka,Yen-Chang Hsu,Hongxia Jin

Main category: cs.CL

TL;DR: 本文提出了MossNet,一种基于混合状态空间专家(MoE)架构的新型模型,能够模拟线性多头注意力机制,在语言建模和下游任务中表现优于同类Transformer和SSM架构,兼具高效性和可扩展性。

Details Motivation: 现有基于SSM/GRM的方法通常仅模拟单个注意力头,表达能力受限,因此需要更具表达力且高效的替代方案。 Method: 提出MossNet,采用混合状态空间专家架构,在通道混合MLP块和时间混合SSM核中均引入MoE机制,以实现多个‘注意力头’的功能,模拟线性多头注意力。 Result: 在相同模型规模和数据预算下,MossNet在语言建模和下游任务中优于Transformer和SSM基线模型;大模型版本在万亿token训练下展现出良好可扩展性;在三星S24 Ultra和Nvidia A100上的实测显示其运行速度更快、资源占用更优。 Conclusion: MossNet为高效且高性能的循环式大语言模型提供了一个有前景的新方向。 Abstract: Large language models (LLMs) have significantly advanced generative applications in natural language processing (NLP). Recent trends in model architectures revolve around efficient variants of transformers or state-space/gated-recurrent models (SSMs, GRMs). However, prevailing SSM/GRM-based methods often emulate only a single attention head, potentially limiting their expressiveness. In this work, we propose MossNet, a novel mixture-of-state-space-experts architecture that emulates a linear multi-head attention (MHA). MossNet leverages a mixture-of-experts (MoE) implementation not only in channel-mixing multi-layered perceptron (MLP) blocks but also in the time-mixing SSM kernels to realize multiple "attention heads." Extensive experiments on language modeling and downstream evaluations show that MossNet outperforms both transformer- and SSM-based architectures of similar model size and data budgets. Larger variants of MossNet, trained on trillions of tokens, further confirm its scalability and superior performance. In addition, real-device profiling on a Samsung Galaxy S24 Ultra and an Nvidia A100 GPU demonstrate favorable runtime speed and resource usage compared to similarly sized baselines. Our results suggest that MossNet is a compelling new direction for efficient, high-performing recurrent LLM architectures.

[26] Similarity-Distance-Magnitude Language Models

Allen Schmaltz

Main category: cs.CL

TL;DR: 提出了一种基于相似性-距离-幅度(SDM)的激活层的语言模型,通过监督微调将现有解码器-only Transformer 模型转化为SDM语言模型,提升了生成结果的校准性和统计效率。

Details Motivation: 旨在提升语言模型在指令跟随任务中的校准性和减少生成过程中的 abstention(拒绝生成)现象,从而提高模型的可靠性与效率。 Method: 在预训练的Transformer解码器模型上引入SDM激活层进行二分类,通过监督微调结合对比输入编码方案和在线生成的硬负样本,调整下一词预测损失函数。 Result: SDM语言模型相比强监督基线显著减少了生成 abstentions,提高了模型在高概率区域生成内容的比例和统计效率。 Conclusion: SDM语言模型通过结构化激活层和改进的训练机制,有效提升了指令跟随任务中生成结果的质量与可靠性。 Abstract: We introduce Similarity-Distance-Magnitude (SDM) language models (LMs), which are sequence prediction models fine-tuned to maximize the proportion of generations in the well-calibrated, high-probability region partitioned by a final-layer SDM activation layer used for binary classification of instruction-following. We demonstrate that existing pre-trained decoder-only Transformer LMs can be readily converted into SDM LMs via supervised fine-tuning, using the final-layer SDM activation layer during training to estimate a change-of-base for a supervised next-token loss over a contrastive input encoding scheme, with additional hard negative examples generated online during training. This results in reduced abstentions (i.e., improved statistical efficiency) compared to strong supervised baselines.

[27] RCScore: Quantifying Response Consistency in Large Language Models

Dongjun Jang,Youngchae Ahn,Hyopil Shin

Main category: cs.CL

TL;DR: 提出RCScore框架,系统评估指令风格对大模型性能的影响,发现指令变化可导致准确率波动达16.7%,并引入交叉响应相似性(CRS)作为模型可靠性的代理指标。

Details Motivation: 现有大模型评估方法通常依赖单一指令模板,忽视了模型对指令风格的敏感性,难以反映真实应用场景下的表现,因此需要一种能衡量指令风格影响的多维评估框架。 Method: 构建RCScore框架,通过将基准问题转换为多种指令风格,系统分析不同风格下模型响应的变化;提出交叉响应相似性(CRS)指标来衡量模型在不同指令下的输出一致性,并在十个大模型和四个推理基准上进行实验验证。 Result: 实验证明指令风格可使模型准确率变化高达16.7%;CRS与任务准确率呈强相关性;确定性解码能提升风格稳定性;模型规模越大,跨风格一致性越高。 Conclusion: RCScore为评估大模型在不同指令风格下的鲁棒性提供了有效工具,表明指令鲁棒性和输出一致性是衡量模型可靠性的重要维度。 Abstract: Current LLM evaluations often rely on a single instruction template, overlooking models' sensitivity to instruction style-a critical aspect for real-world deployments. We present RCScore, a multi-dimensional framework quantifying how instruction formulation affects model responses. By systematically transforming benchmark problems into multiple instruction styles, RCScore reveals performance variations undetected by conventional metrics. Our experiments across ten LLMs on four reasoning benchmarks demonstrate that instruction style can shift accuracy by up to 16.7% points. We introduce Cross-Response Similarity (CRS), a method applying RCScore metrics to measure stylistic self-consistency, and establish its strong correlation with task accuracy, suggesting consistency as a valuable proxy for model reliability. Additional findings show that deterministic decoding produces more stylistically stable outputs, and model scale correlates positively with cross-style consistency. RCScore offers a principled approach to assess instruction robustness.

[28] Don't Let It Fade: Preserving Edits in Diffusion Language Models via Token Timestep Allocation

Woojin Kim,Jaeyoung Do

Main category: cs.CL

TL;DR: 本文提出了一种名为Token Timestep Allocation (TTA)的方法,通过为不同token分配特定的时间步调度来缓解扩散语言模型中的“更新遗忘”问题,从而提升文本生成的可控性与流畅性。

Details Motivation: 扩散语言模型在细粒度优化方面具有潜力,但其控制性脆弱,主要由于均匀且上下文无关的更新导致语义编辑在时间步间被抹除,即“更新遗忘”问题。 Method: 提出Token Timestep Allocation (TTA),通过为关键token提前冻结、不确定token持续优化,实现基于时间步的软性与语义感知的token排序,支持固定或自适应策略,并仅在推理阶段应用。 Result: 在情感控制任务中,TTA提升了20%以上的准确率,困惑度降低近一半,且使用不到五分之一的步数;在去毒任务中,最大毒性从14.5降至12.2,困惑度从32.0降至26.0。 Conclusion: 通过时间步分配实现的软性排序是缓解更新遗忘、实现稳定可控扩散文本生成的关键机制。 Abstract: While diffusion language models (DLMs) enable fine-grained refinement, their practical controllability remains fragile. We identify and formally characterize a central failure mode called update forgetting, in which uniform and context agnostic updates induce token level fluctuations across timesteps, erasing earlier semantic edits and disrupting the cumulative refinement process, thereby degrading fluency and coherence. As this failure originates in uniform and context agnostic updates, effective control demands explicit token ordering. We propose Token Timestep Allocation (TTA), which realizes soft and semantic token ordering via per token timestep schedules: critical tokens are frozen early, while uncertain tokens receive continued refinement. This timestep based ordering can be instantiated as either a fixed policy or an adaptive policy driven by task signals, thereby supporting a broad spectrum of refinement strategies. Because it operates purely at inference time, it applies uniformly across various DLMs and naturally extends to diverse supervision sources. Empirically, TTA improves controllability and fluency: on sentiment control, it yields more than 20 percent higher accuracy and nearly halves perplexity using less than one fifth the steps; in detoxification, it lowers maximum toxicity (12.2 versus 14.5) and perplexity (26.0 versus 32.0). Together, these results demonstrate that softened ordering via timestep allocation is the critical lever for mitigating update forgetting and achieving stable and controllable diffusion text generation.

[29] What's In My Human Feedback? Learning Interpretable Descriptions of Preference Data

Rajiv Movva,Smitha Milli,Sewon Min,Emma Pierson

Main category: cs.CL

TL;DR: 本文提出了WIMHF方法,利用稀疏自编码器解释人类反馈数据,揭示了不同数据集中人类偏好的多样性和上下文依赖性,并发现了潜在的不安全偏好,为数据筛选和细粒度个性化提供了有效工具。

Details Motivation: 当前对人类反馈数据的理解不足,导致语言模型在训练中可能被不可预测和不良的方式改变。现有研究通常局限于预定义属性的偏好分析,缺乏自动提取关键特征的方法。 Method: 提出WIMHF方法,使用稀疏自编码器从人类反馈数据中提取可解释的特征,分析数据集能测量的偏好以及标注者实际表达的偏好。 Result: 在7个数据集上验证了WIMHF的有效性,识别出少量可解释特征即可解释黑箱模型大部分偏好预测信号;发现了如非正式语气、幽默、对拒绝回答的反感等具体偏好;实现了37%的安全性提升且不影响整体性能;支持基于用户特征的细粒度个性化建模。 Conclusion: WIMHF提供了一种以人为中心的分析方法,帮助实践者更好地理解与使用偏好数据,提升模型安全性与个性化能力。 Abstract: Human feedback can alter language models in unpredictable and undesirable ways, as practitioners lack a clear understanding of what feedback data encodes. While prior work studies preferences over certain attributes (e.g., length or sycophancy), automatically extracting relevant features without pre-specifying hypotheses remains challenging. We introduce What's In My Human Feedback? (WIMHF), a method to explain feedback data using sparse autoencoders. WIMHF characterizes both (1) the preferences a dataset is capable of measuring and (2) the preferences that the annotators actually express. Across 7 datasets, WIMHF identifies a small number of human-interpretable features that account for the majority of the preference prediction signal achieved by black-box models. These features reveal a wide diversity in what humans prefer, and the role of dataset-level context: for example, users on Reddit prefer informality and jokes, while annotators in HH-RLHF and PRISM disprefer them. WIMHF also surfaces potentially unsafe preferences, such as that LMArena users tend to vote against refusals, often in favor of toxic content. The learned features enable effective data curation: re-labeling the harmful examples in Arena yields large safety gains (+37%) with no cost to general performance. They also allow fine-grained personalization: on the Community Alignment dataset, we learn annotator-specific weights over subjective features that improve preference prediction. WIMHF provides a human-centered analysis method for practitioners to better understand and use preference data.

[30] Towards Global Retrieval Augmented Generation: A Benchmark for Corpus-Level Reasoning

Qi Luo,Xiaonan Li,Tingshuo Fan,Xinchi Chen,Xipeng Qiu

Main category: cs.CL

TL;DR: 本文提出了GlobalQA,首个用于评估全局检索增强生成(global RAG)能力的基准,并提出GlobalRAG框架,在多任务上显著优于现有方法。

Details Motivation: 现有的RAG评估基准主要关注局部信息检索,无法有效衡量模型在跨文档聚合和分析整个语料库信息方面的能力,而许多实际应用需要这种全局理解能力。 Method: 设计了GlobalQA基准,涵盖计数、极值查询、排序和top-k提取四类任务;提出GlobalRAG框架,结合chunk级检索、LLM驱动的智能过滤器和聚合模块,实现结构化信息整合与精确符号计算。 Result: 现有RAG方法在GlobalQA上表现差(最强基线F1为1.51);GlobalRAG在Qwen2.5-14B模型上达到6.63 F1,显著提升性能。 Conclusion: GlobalRAG有效提升了大语言模型在全局RAG任务上的表现,验证了多工具协作与结构化聚合对复杂跨文档推理的重要性。 Abstract: Retrieval-augmented generation (RAG) has emerged as a leading approach to reducing hallucinations in large language models (LLMs). Current RAG evaluation benchmarks primarily focus on what we call local RAG: retrieving relevant chunks from a small subset of documents to answer queries that require only localized understanding within specific text chunks. However, many real-world applications require a fundamentally different capability -- global RAG -- which involves aggregating and analyzing information across entire document collections to derive corpus-level insights (for example, "What are the top 10 most cited papers in 2023?"). In this paper, we introduce GlobalQA -- the first benchmark specifically designed to evaluate global RAG capabilities, covering four core task types: counting, extremum queries, sorting, and top-k extraction. Through systematic evaluation across different models and baselines, we find that existing RAG methods perform poorly on global tasks, with the strongest baseline achieving only 1.51 F1 score. To address these challenges, we propose GlobalRAG, a multi-tool collaborative framework that preserves structural coherence through chunk-level retrieval, incorporates LLM-driven intelligent filters to eliminate noisy documents, and integrates aggregation modules for precise symbolic computation. On the Qwen2.5-14B model, GlobalRAG achieves 6.63 F1 compared to the strongest baseline's 1.51 F1, validating the effectiveness of our method.

[31] Pragmatic Theories Enhance Understanding of Implied Meanings in LLMs

Takuma Sato,Seiya Kawano,Koichiro Yoshino

Main category: cs.CL

TL;DR: 提出将语用学理论作为提示输入语言模型,以提升其对隐含意义的理解能力,实验表明该方法在推理任务中显著优于基线模型。

Details Motivation: 语言模型需要更好地理解语言中的隐含意义,而现有的零样本推理方法效果有限,因此探索引入语用学理论来增强模型的推理能力。 Method: 将格赖斯语用学和关联理论等概述作为提示输入语言模型,引导其逐步推理;同时测试仅提及理论名称的效果。 Result: 相比不引入理论的零样本思维链方法,新方法使模型在语用推理任务上最高提升9.6%;仅提及理论名称也能在大模型上带来1-3%的提升。 Conclusion: 将语用学理论融入提示是一种有效的上下文学习方法,有助于提升语言模型对隐含意义的理解能力。 Abstract: The ability to accurately interpret implied meanings plays a crucial role in human communication and language use, and language models are also expected to possess this capability. This study demonstrates that providing language models with pragmatic theories as prompts is an effective in-context learning approach for tasks to understand implied meanings. Specifically, we propose an approach in which an overview of pragmatic theories, such as Gricean pragmatics and Relevance Theory, is presented as a prompt to the language model, guiding it through a step-by-step reasoning process to derive a final interpretation. Experimental results showed that, compared to the baseline, which prompts intermediate reasoning without presenting pragmatic theories (0-shot Chain-of-Thought), our methods enabled language models to achieve up to 9.6\% higher scores on pragmatic reasoning tasks. Furthermore, we show that even without explaining the details of pragmatic theories, merely mentioning their names in the prompt leads to a certain performance improvement (around 1-3%) in larger models compared to the baseline.

[32] Language Models Are Borrowing-Blind: A Multilingual Evaluation of Loanword Identification across 10 Languages

Mérilin Sousa Silva,Sina Ahmadi

Main category: cs.CL

TL;DR: 该论文研究了预训练语言模型是否具备识别借词的能力,发现在10种语言中,尽管有明确提示和上下文信息,模型在区分借词与本族词方面表现不佳。

Details Motivation: 探究预训练语言模型能否像双语使用者一样区分借词与本族词,特别是在主流语言对少数语言产生词汇影响的背景下。 Method: 在10种语言上评估多个预训练语言模型(包括大模型),通过明确指令和上下文信息测试其识别借词的能力。 Result: 模型在区分借词与本族词任务上表现较差,表明当前NLP系统更偏向借词而非本族词。 Conclusion: 现代NLP系统在处理少数语言时存在借词偏见,这对少数语言的NLP工具开发和语言保护具有重要启示。 Abstract: Throughout language history, words are borrowed from one language to another and gradually become integrated into the recipient's lexicon. Speakers can often differentiate these loanwords from native vocabulary, particularly in bilingual communities where a dominant language continuously imposes lexical items on a minority language. This paper investigates whether pretrained language models, including large language models, possess similar capabilities for loanword identification. We evaluate multiple models across 10 languages. Despite explicit instructions and contextual information, our results show that models perform poorly in distinguishing loanwords from native ones. These findings corroborate previous evidence that modern NLP systems exhibit a bias toward loanwords rather than native equivalents. Our work has implications for developing NLP tools for minority languages and supporting language preservation in communities under lexical pressure from dominant languages.

[33] Distilling Multilingual Vision-Language Models: When Smaller Models Stay Multilingual

Sukrit Sriratanawilai,Jhayahgrit Thongwat,Romrawin Chumpu,Patomporn Payoungkhamdee,Sarana Nutanong,Peerat Limkonchotiwat

Main category: cs.CL

TL;DR: 本文研究了知识蒸馏在多语言视觉-语言模型压缩中的应用,评估了五种蒸馏方法对跨语言表示一致性和下游任务稳定性的影响,发现某些配置在减小模型规模的同时仍能保持甚至提升多语言检索的鲁棒性。

Details Motivation: 视觉-语言模型在不同语言上的表现不均衡,且模型规模缩小时问题更严重,而知识蒸馏在多语言场景下的应用尚不充分。 Method: 通过控制实验,比较五种知识蒸馏方法在CLIP和SigLIP2模型上的表现,评估其在领域内检索和领域外视觉问答任务中的跨语言一致性与性能稳定性。 Result: 部分蒸馏配置在模型规模减半的情况下仍能保持或提升多语言检索性能,但有些方法无法维持跨任务稳定性,暴露出仅看总体准确率无法揭示的设计敏感性权衡。 Conclusion: 知识蒸馏在多语言VLM压缩中具有潜力,但其效果高度依赖于具体配置,需综合考虑跨语言一致性和任务稳定性。 Abstract: Vision-language models (VLMs) exhibit uneven performance across languages, a problem that is often exacerbated when the model size is reduced. While Knowledge distillation (KD) demonstrates promising results in transferring knowledge from larger to smaller VLMs, applying KD in multilingualism is an underexplored area. This paper presents a controlled empirical study of KD behavior across five distillation approaches, isolating their effects on cross-lingual representation consistency and downstream performance stability under model compression. We study five distillation formulations across CLIP and SigLIP2, and evaluate them on in-domain retrieval and out-of-domain visual QA. We find that some configurations preserve or even improve multilingual retrieval robustness despite halving model size, but others fail to maintain cross-task stability, exposing design-sensitive trade-offs that aggregate accuracy alone does not reveal.

[34] Do LLMs Signal When They're Right? Evidence from Neuron Agreement

Kang Chen,Yaoning Wang,Kai Xiong,Zhuoka Feng,Wenhe Sun,Haotian Chen,Yixin Cao

Main category: cs.CL

TL;DR: 提出了一种基于神经元激活的无监督解码方法Neuron Agreement Decoding (NAD),利用内部信号实现高效、可靠的免标签集成推理,在数学、科学和编程任务中表现优异,并显著减少计算开销。

Details Motivation: 现有基于外部输出(如概率、熵)的候选评分方法在后训练后校准性差,缺乏对模型内部行为的有效利用。 Method: 分析LLM生成过程中的神经元激活,发现正确响应具有更低的激活稀疏性和更强的跨样本一致性,据此提出NAD方法,使用激活稀疏性和神经元一致性的内部信号选择最优候选。 Result: NAD在数学和科学基准上达到与多数投票相当的性能;在开放编码任务中优于Avg@64;可提前在32个token内预测正确性并支持激进早停,减少99%的token使用量。 Conclusion: 内部神经元激活信号能为免标签集成解码提供可靠、可扩展且高效的指导,NAD为提升大模型推理效率提供了新路径。 Abstract: Large language models (LLMs) commonly boost reasoning via sample-evaluate-ensemble decoders, achieving label free gains without ground truth. However, prevailing strategies score candidates using only external outputs such as token probabilities, entropies, or self evaluations, and these signals can be poorly calibrated after post training. We instead analyze internal behavior based on neuron activations and uncover three findings: (1) external signals are low dimensional projections of richer internal dynamics; (2) correct responses activate substantially fewer unique neurons than incorrect ones throughout generation; and (3) activations from correct responses exhibit stronger cross sample agreement, whereas incorrect ones diverge. Motivated by these observations, we propose Neuron Agreement Decoding (NAD), an unsupervised best-of-N method that selects candidates using activation sparsity and cross sample neuron agreement, operating solely on internal signals and without requiring comparable textual outputs. NAD enables early correctness prediction within the first 32 generated tokens and supports aggressive early stopping. Across math and science benchmarks with verifiable answers, NAD matches majority voting; on open ended coding benchmarks where majority voting is inapplicable, NAD consistently outperforms Avg@64. By pruning unpromising trajectories early, NAD reduces token usage by 99% with minimal loss in generation quality, showing that internal signals provide reliable, scalable, and efficient guidance for label free ensemble decoding.

[35] Unravelling the Mechanisms of Manipulating Numbers in Language Models

Michal Štefánik,Timothee Mickus,Marek Kadlčík,Bertram Højer,Michal Spiegel,Raúl Vázquez,Aman Sinha,Josef Kuchař,Philipp Mondorf

Main category: cs.CL

TL;DR: 该研究探讨了大语言模型在处理数字时的内部机制,发现尽管存在输出错误,不同模型仍学习到系统且高度准确的通用数字表示,并可通过通用探针追踪错误来源至特定层。

Details Motivation: 解释大语言模型在数字输入嵌入表示上表现准确,但在实际处理数字时却常出错这一矛盾现象。 Method: 通过分析多个大语言模型内部隐藏状态中的数字表示,构建通用探针以量化其准确性,并追踪导致输出错误的信息路径。 Result: 发现不同模型学习到可互换、系统化且跨上下文通用的数字表示;确定了模型操作数字的底层机制,并能将错误归因于特定网络层。 Conclusion: 预训练大语言模型具备基础的数字操作能力,未来可通过更精确的探针技术优化其架构设计以提升数值处理准确性。 Abstract: Recent work has shown that different large language models (LLMs) converge to similar and accurate input embedding representations for numbers. These findings conflict with the documented propensity of LLMs to produce erroneous outputs when dealing with numeric information. In this work, we aim to explain this conflict by exploring how language models manipulate numbers and quantify the lower bounds of accuracy of these mechanisms. We find that despite surfacing errors, different language models learn interchangeable representations of numbers that are systematic, highly accurate and universal across their hidden states and the types of input contexts. This allows us to create universal probes for each LLM and to trace information -- including the causes of output errors -- to specific layers. Our results lay a fundamental understanding of how pre-trained LLMs manipulate numbers and outline the potential of more accurate probing techniques in addressed refinements of LLMs' architectures.

[36] Can Agent Conquer Web? Exploring the Frontiers of ChatGPT Atlas Agent in Web Games

Jingran Zhang,Ning Li,Justin Cui

Main category: cs.CL

TL;DR: 本研究评估了OpenAI的ChatGPT Atlas在浏览器游戏中的网页交互能力,发现其在逻辑推理任务(如数独)中表现优异,但在需要精确时序和操作控制的实时游戏中表现较差。

Details Motivation: 探讨Atlas在动态、交互式网络环境中的实际表现,尤其是在信息检索之外的实时交互能力。 Method: 使用T-Rex Runner、Sudoku、Flappy Bird和Stein.world等浏览器游戏作为测试场景,以游戏内得分作为量化评估指标。 Result: Atlas在Sudoku等逻辑任务中解题速度显著快于人类基线,但在Flappy Bird等实时游戏中难以通过初始障碍。 Conclusion: Atlas具备强大的分析处理能力,但在需要实时交互的动态网页环境中仍存在明显局限。 Abstract: OpenAI's ChatGPT Atlas introduces new capabilities for web interaction, enabling the model to analyze webpages, process user intents, and execute cursor and keyboard inputs directly within the browser. While its capacity for information retrieval tasks has been demonstrated, its performance in dynamic, interactive environments remains less explored. In this study, we conduct an early evaluation of Atlas's web interaction capabilities using browser-based games as test scenarios, including Google's T-Rex Runner, Sudoku, Flappy Bird, and Stein.world. We employ in-game performance scores as quantitative metrics to assess performance across different task types. Our results show that Atlas performs strongly in logical reasoning tasks like Sudoku, completing puzzles significantly faster than human baselines, but struggles substantially in real-time games requiring precise timing and motor control, often failing to progress beyond initial obstacles. These findings suggest that while Atlas demonstrates capable analytical processing, there remain notable limitations in dynamic web environments requiring real-time interaction. The website of our project can be found at https://atlas-game-eval.github.io.

[37] SCRIBE: Structured Chain Reasoning for Interactive Behaviour Explanations using Tool Calling

Fares Fawzi,Vinitra Swamy,Dominik Glandorf,Tanya Nazaretsky,Tanja Käser

Main category: cs.CL

TL;DR: SCRIBE是一个用于生成学生反馈的多跳、工具增强型推理框架,通过两阶段LoRA微调在3B和8B的小型开源模型上实现,能在资源受限和隐私敏感的教育场景中提供与大型模型相当甚至更优的反馈质量。

Details Motivation: 现实世界中语言模型在教育中的应用面临隐私、计算资源限制和教学有效性三大挑战,需要能够在本地运行且输出可靠的小型开源模型。 Method: 提出SCRIBE框架,结合领域专用工具与自反性推理流程,支持迭代推理、工具调用和错误恢复,并利用GPT-4o生成的合成数据对3B和8B模型进行两阶段LoRA微调。 Result: 在人类对齐的GPT-Judge评估和108名学生的用户研究中,8B-SCRIBE模型在相关性和可操作性方面表现优于或媲美更大规模的模型,学生感知质量与GPT-4o和Llama-3.3 70B相当。 Conclusion: SCRIBE证明了小型化、本地化语言模型在低资源、隐私敏感教育应用中的可行性与高效性。 Abstract: Language models can be used to provide interactive, personalized student feedback in educational settings. However, real-world deployment faces three key challenges: privacy concerns, limited computational resources, and the need for pedagogically valid responses. These constraints require small, open-source models that can run locally and reliably ground their outputs in correct information. We introduce SCRIBE, a framework for multi-hop, tool-augmented reasoning designed to generate valid responses to student questions about feedback reports. SCRIBE combines domain-specific tools with a self-reflective inference pipeline that supports iterative reasoning, tool use, and error recovery. We distil these capabilities into 3B and 8B models via two-stage LoRA fine-tuning on synthetic GPT-4o-generated data. Evaluation with a human-aligned GPT-Judge and a user study with 108 students shows that 8B-SCRIBE models achieve comparable or superior quality to much larger models in key dimensions such as relevance and actionability, while being perceived on par with GPT-4o and Llama-3.3 70B by students. These findings demonstrate the viability of SCRIBE for low-resource, privacy-sensitive educational applications.

[38] From Amateur to Master: Infusing Knowledge into LLMs via Automated Curriculum Learning

Nishit Neema,Srinjoy Mukherjee,Sapan Shah,Gokul Ramakrishnan,Ganesh Venkatesh

Main category: cs.CL

TL;DR: ACER是一种自动化课程增强方法,通过生成教科书式课程和基于Bloom分类法的问答对,将通用大模型转化为领域专家,同时保持其广泛能力。

Details Motivation: 通用大语言模型在需要深度原理理解的专业领域(如经济学、心理学)表现不佳,需提升其专业化能力而不牺牲通用性。 Method: ACER首先生成学科目录结构,并依据Bloom分类法创建渐进式难度的问答对,构建合成课程;然后采用交错课程策略进行持续预训练,兼顾内容与认知维度的学习。 Result: 在Llama 3.2上的实验显示,ACER在MMLU专业子集上平均提升3个百分点,在微观经济学等难点领域提升5个百分点;同时在ARC和GPQA等知识密集型基准上提升超2个百分点,非目标领域性能也提升0.7点,且未出现灾难性遗忘。 Conclusion: ACER提供了一种可扩展且有效的方法,能够在不损害通用能力的前提下,显著缩小大模型在专业领域的性能差距。 Abstract: Large Language Models (LLMs) excel at general tasks but underperform in specialized domains like economics and psychology, which require deep, principled understanding. To address this, we introduce ACER (Automated Curriculum-Enhanced Regimen) that transforms generalist models into domain experts without sacrificing their broad capabilities. ACER first synthesizes a comprehensive, textbook-style curriculum by generating a table of contents for a subject and then creating question-answer (QA) pairs guided by Bloom's taxonomy. This ensures systematic topic coverage and progressively increasing difficulty. The resulting synthetic corpus is used for continual pretraining with an interleaved curriculum schedule, aligning learning across both content and cognitive dimensions. Experiments with Llama 3.2 (1B and 3B) show significant gains in specialized MMLU subsets. In challenging domains like microeconomics, where baselines struggle, ACER boosts accuracy by 5 percentage points. Across all target domains, we observe a consistent macro-average improvement of 3 percentage points. Notably, ACER not only prevents catastrophic forgetting but also facilitates positive cross-domain knowledge transfer, improving performance on non-target domains by 0.7 points. Beyond MMLU, ACER enhances performance on knowledge-intensive benchmarks like ARC and GPQA by over 2 absolute points, while maintaining stable performance on general reasoning tasks. Our results demonstrate that ACER offers a scalable and effective recipe for closing critical domain gaps in LLMs.

[39] MisSynth: Improving MISSCI Logical Fallacies Classification with Synthetic Data

Mykhailo Poliakov,Nadiya Shvai

Main category: cs.CL

TL;DR: 本文提出了一种名为MisSynth的管道,利用检索增强生成(RAG)生成合成谬误样本,并用于微调大语言模型(LLM),以提升其识别科学错误信息的能力。在MISSCI数据集上,经过微调的LLaMA 3.1 8B模型相较于基线模型F1分数提升了超过35%,表明合成数据能显著提高零样本分类性能。

Details Motivation: 科学健康类虚假信息广泛且有害,尤其当其扭曲或误读科研成果时难以识别。因此,需要提升大语言模型识别此类谬误论点的能力。 Method: 提出MisSynth框架,结合检索增强生成(RAG)生成合成的谬误样本,并使用这些数据对大语言模型进行轻量级微调,以增强其在科学错误信息检测任务中的表现。 Result: 在MISSCI测试集上,微调后的LLaMA 3.1 8B模型相比原始模型F1分数提升了超过35%,验证了合成数据对零样本分类性能的有效提升,且在计算资源有限的情况下仍表现优异。 Conclusion: 通过合成数据增强训练样本,即使在标注数据和计算资源有限的情况下,也能显著提升大语言模型识别科学谬误的能力,为应对健康 misinformation 提供有效技术路径。 Abstract: Health-related misinformation is very prevalent and potentially harmful. It is difficult to identify, especially when claims distort or misinterpret scientific findings. We investigate the impact of synthetic data generation and lightweight fine-tuning techniques on the ability of large language models (LLMs) to recognize fallacious arguments using the MISSCI dataset and framework. In this work, we propose MisSynth, a pipeline that applies retrieval-augmented generation (RAG) to produce synthetic fallacy samples, which are then used to fine-tune an LLM model. Our results show substantial accuracy gains with fine-tuned models compared to vanilla baselines. For instance, the LLaMA 3.1 8B fine-tuned model achieved an over 35% F1-score absolute improvement on the MISSCI test split over its vanilla baseline. We demonstrate that introducing synthetic fallacy data to augment limited annotated resources can significantly enhance zero-shot LLM classification performance on real-world scientific misinformation tasks, even with limited computational resources. The code and synthetic dataset are available on https://github.com/mxpoliakov/MisSynth.

[40] The Geometry of Dialogue: Graphing Language Models to Reveal Synergistic Teams for Multi-Agent Collaboration

Kotaro Furuya,Yuichi Kitagawa

Main category: cs.CL

TL;DR: 提出一种基于交互的自动团队组合框架,通过构建“语言模型图”并应用社区检测来发现具有协同效应的模型集群,无需先验知识即可实现与人工 curated 团队相媲美的性能。

Details Motivation: 由于大多数大语言模型的内在特性不透明,难以形成最优的多智能体团队,因此需要一种无需先验知识的自动化团队组合方法。 Method: 通过成对对话的语义连贯性构建‘语言模型图’,利用社区检测识别出具有协同效应的模型群组,从而实现自动团队组成。 Result: 在多种大语言模型上的实验表明,该方法能发现功能上一致的模型群组,并在下游任务中优于随机基线,性能接近基于已知模型专长的手动团队。 Conclusion: 该研究为协作式多智能体大语言模型团队的自动化设计提供了新基础。 Abstract: While a multi-agent approach based on large language models (LLMs) represents a promising strategy to surpass the capabilities of single models, its success is critically dependent on synergistic team composition. However, forming optimal teams is a significant challenge, as the inherent opacity of most models obscures the internal characteristics necessary for effective collaboration. In this paper, we propose an interaction-centric framework for automatic team composition that does not require any prior knowledge including their internal architectures, training data, or task performances. Our method constructs a "language model graph" that maps relationships between models from the semantic coherence of pairwise conversations, and then applies community detection to identify synergistic model clusters. Our experiments with diverse LLMs demonstrate that the proposed method discovers functionally coherent groups that reflect their latent specializations. Priming conversations with specific topics identified synergistic teams which outperform random baselines on downstream benchmarks and achieve comparable accuracy to that of manually-curated teams based on known model specializations. Our findings provide a new basis for the automated design of collaborative multi-agent LLM teams.

[41] On the Role of Context for Discourse Relation Classification in Scientific Writing

Stephen Wan,Wei Liu,Michael Strube

Main category: cs.CL

TL;DR: 本文探讨了在科学写作中推断话语结构的任务,研究了预训练语言模型(PLM)和大语言模型(LLM)在科学出版物中的话语关系分类(DRC)任务中的应用,发现上下文信息对DRC任务普遍有帮助,并分析了哪些科学话语关系类型最能受益于上下文。

Details Motivation: 随着生成式人工智能在科研流程中的广泛应用,如何利用话语层面的信息为AI生成的科学主张寻找支持证据成为一个关键问题。因此,研究科学写作中的话语结构推断任务具有重要意义。 Method: 采用预训练语言模型(PLM)和大语言模型(LLM)进行话语关系分类(DRC)实验,重点分析上下文(由话语结构定义)在科学出版物中的作用,并评估不同话语关系类型对上下文的依赖程度。 Result: 实验结果表明,上下文信息总体上有助于提升DRC任务的性能,并识别出某些特定的科学话语关系类型比其他类型更受益于上下文信息。 Conclusion: 在科学文本中,利用话语结构提供的上下文能够有效支持话语关系分类任务,未来的研究可进一步针对特定关系类型优化上下文建模方法。 Abstract: With the increasing use of generative Artificial Intelligence (AI) methods to support science workflows, we are interested in the use of discourse-level information to find supporting evidence for AI generated scientific claims. A first step towards this objective is to examine the task of inferring discourse structure in scientific writing. In this work, we present a preliminary investigation of pretrained language model (PLM) and Large Language Model (LLM) approaches for Discourse Relation Classification (DRC), focusing on scientific publications, an under-studied genre for this task. We examine how context can help with the DRC task, with our experiments showing that context, as defined by discourse structure, is generally helpful. We also present an analysis of which scientific discourse relation types might benefit most from context.

[42] OmniEduBench: A Comprehensive Chinese Benchmark for Evaluating Large Language Models in Education

Min Zhang,Hao Chen,Hao Chen,Wenqi Zhang,Didi Zhu,Xin Lin,Bo Jiang,Aimin Zhou,Fei Wu,Kun Kuang

Main category: cs.CL

TL;DR: 本文提出了OmniEduBench,一个全面的中文教育评估基准,包含24.6万高质量问答对,涵盖知识和培养两个维度及多种题型,实验表明现有大模型在教育能力评估上仍有显著不足。

Details Motivation: 现有大语言模型及其评测基准多关注知识层面,忽视实际教育场景中所需的能力培养评估,且多数基准局限于单一学科或题型,缺乏多样性,尤其在中文语境下更为明显。 Method: 构建了一个名为OmniEduBench的中文教育基准,包含24.602K高质量问答对,分为知识(18.121K)和培养(6.481K)两个维度,每个维度细分为6类,覆盖61个学科,并包含11种常见考试题型。 Result: 在11个主流大模型上的实验显示,知识维度仅Gemini-2.5 Pro准确率超60%,培养维度表现最好的QWQ模型仍比人类低近30%。 Conclusion: 当前大模型在教育应用中仍有较大提升空间,OmniEduBench为全面评估模型在真实教育场景中的能力提供了有效工具。 Abstract: With the rapid development of large language models (LLMs), various LLM-based works have been widely applied in educational fields. However, most existing LLMs and their benchmarks focus primarily on the knowledge dimension, largely neglecting the evaluation of cultivation capabilities that are essential for real-world educational scenarios. Additionally, current benchmarks are often limited to a single subject or question type, lacking sufficient diversity. This issue is particularly prominent within the Chinese context. To address this gap, we introduce OmniEduBench, a comprehensive Chinese educational benchmark. OmniEduBench consists of 24.602K high-quality question-answer pairs. The data is meticulously divided into two core dimensions: the knowledge dimension and the cultivation dimension, which contain 18.121K and 6.481K entries, respectively. Each dimension is further subdivided into 6 fine-grained categories, covering a total of 61 different subjects (41 in the knowledge and 20 in the cultivation). Furthermore, the dataset features a rich variety of question formats, including 11 common exam question types, providing a solid foundation for comprehensively evaluating LLMs' capabilities in education. Extensive experiments on 11 mainstream open-source and closed-source LLMs reveal a clear performance gap. In the knowledge dimension, only Gemini-2.5 Pro surpassed 60\% accuracy, while in the cultivation dimension, the best-performing model, QWQ, still trailed human intelligence by nearly 30\%. These results highlight the substantial room for improvement and underscore the challenges of applying LLMs in education.

[43] 1+1>2: A Synergistic Sparse and Low-Rank Compression Method for Large Language Models

Zeliang Zong,Kai Zhang,Zheyang Li,Wenming Tan,Ye Ren,Yiyan Zhai,Jilin Hu

Main category: cs.CL

TL;DR: 提出了一种名为SSLC的协同稀疏与低秩压缩方法,用于高效压缩大语言模型,在无需额外训练的情况下显著提升压缩效果和推理速度。

Details Motivation: 大语言模型因带宽和计算需求高而受限,现有剪枝和低秩方法未充分结合,缺乏协同优化。 Method: 将低秩近似与稀疏优化统一建模,通过迭代优化算法联合求解,实现结构化压缩。 Result: 在LLaMA和Qwen2.5(7B-70B)上验证,SSLC在无性能损失下压缩50%,推理速度提升至少1.63倍,优于单独使用剪枝或低秩的方法。 Conclusion: SSLC能有效结合稀疏性和低秩性优势,为大模型的高效部署提供了实用且先进的解决方案。 Abstract: Large Language Models (LLMs) have demonstrated remarkable proficiency in language comprehension and generation; however, their widespread adoption is constrained by substantial bandwidth and computational demands. While pruning and low-rank approximation have each demonstrated promising performance individually, their synergy for LLMs remains underexplored. We introduce \underline{S}ynergistic \underline{S}parse and \underline{L}ow-Rank \underline{C}ompression (SSLC) methods for LLMs, which leverages the strengths of both techniques: low-rank approximation compresses the model by retaining its essential structure with minimal information loss, whereas sparse optimization eliminates non-essential weights, preserving those crucial for generalization. Based on theoretical analysis, we first formulate the low-rank approximation and sparse optimization as a unified problem and solve it by iterative optimization algorithm. Experiments on LLaMA and Qwen2.5 models (7B-70B) show that SSLC, without any additional training steps, consistently surpasses standalone methods, achieving state-of-the-arts results. Notably, SSLC compresses Qwen2.5 by 50\% with no performance drop and achieves at least 1.63$\times$ speedup, offering a practical solution for efficient LLM deployment.

[44] Bayesian Network Fusion of Large Language Models for Sentiment Analysis

Rasoul Amirzadeh,Dhananjay Thiruvady,Fatemeh Shiri

Main category: cs.CL

TL;DR: 提出贝叶斯网络大语言模型融合(BNLF)框架,通过概率机制整合多个LLM进行情感分析,显著提升准确性和鲁棒性。

Details Motivation: 解决现有领域特定大语言模型缺乏透明性、微调成本高、提示工程复杂、跨域结果不一致及环境影响大的问题。 Method: 利用贝叶斯网络对FinBERT、RoBERTa和BERTweet三个LLM的情感预测进行晚期融合,构建BNLF框架。 Result: 在三个金融语料库上测试,BNLF比基线模型准确率提高约6%,表现出对数据集变异的鲁棒性。 Conclusion: BNLF框架能有效提升情感分类的准确性与可解释性,适用于多领域情感分析任务。 Abstract: Large language models (LLMs) continue to advance, with an increasing number of domain-specific variants tailored for specialised tasks. However, these models often lack transparency and explainability, can be costly to fine-tune, require substantial prompt engineering, yield inconsistent results across domains, and impose significant adverse environmental impact due to their high computational demands. To address these challenges, we propose the Bayesian network LLM fusion (BNLF) framework, which integrates predictions from three LLMs, including FinBERT, RoBERTa, and BERTweet, through a probabilistic mechanism for sentiment analysis. BNLF performs late fusion by modelling the sentiment predictions from multiple LLMs as probabilistic nodes within a Bayesian network. Evaluated across three human-annotated financial corpora with distinct linguistic and contextual characteristics, BNLF demonstrates consistent gains of about six percent in accuracy over the baseline LLMs, underscoring its robustness to dataset variability and the effectiveness of probabilistic fusion for interpretable sentiment classification.

[45] A Multi-agent Large Language Model Framework to Automatically Assess Performance of a Clinical AI Triage Tool

Adam E. Flanders,Yifan Peng,Luciano Prevedello,Robyn Ball,Errol Colak,Prahlad Menon,George Shih,Hui-Ming Lin,Paras Lakhani

Main category: cs.CL

TL;DR: 该研究评估了使用多个开源大语言模型(LLM)组成的集成方法,相较于单一LLM,能否更可靠地评估基于像素的AI分诊工具。结果表明,集成方法在一致性与可靠性上表现更优,尤其是Llama3.3:70b和GPT-4o性能最佳,且多模型集成在MCC等指标上优于单一模型。

Details Motivation: 旨在探索是否可以通过多个LLM代理的集成来提高对临床AI分诊工具进行回顾性评估的准确性和可靠性,克服单一LLM可能存在的偏差和不稳定性。 Method: 使用14家医院的29,766例非增强头颅CT检查数据,通过商用颅内出血(ICH)AI检测工具处理,并由8个开源LLM及GPT-4o的内部HIPAA合规版本组成的集成系统分析放射学报告。采用单个多样本提示判断ICH存在与否,人工复核1,726个案例,并比较不同模型及其集成的性能。 Result: Llama3.3:70b和GPT-4o的AUC最高(0.78),平均精度也最高(分别为0.75和0.76)。Llama3.3:70b的F1分数(0.81)、召回率(0.85)、精确度(0.78)、特异性(0.72)和MCC(0.57)均表现最佳。集成模型中,Full-9、Top-3和Consensus的MCC分别为0.571、0.558和0.556,均优于GPT-4o的0.522,且前三者间无显著差异(p > 0.05)。 Conclusion: 中到大型开源LLM的集成比单一LLM能更一致、更可靠地用于临床AI分诊工具的回顾性评估,可作为生成真实标签的有效策略。 Abstract: Purpose: The purpose of this study was to determine if an ensemble of multiple LLM agents could be used collectively to provide a more reliable assessment of a pixel-based AI triage tool than a single LLM. Methods: 29,766 non-contrast CT head exams from fourteen hospitals were processed by a commercial intracranial hemorrhage (ICH) AI detection tool. Radiology reports were analyzed by an ensemble of eight open-source LLM models and a HIPAA compliant internal version of GPT-4o using a single multi-shot prompt that assessed for presence of ICH. 1,726 examples were manually reviewed. Performance characteristics of the eight open-source models and consensus were compared to GPT-4o. Three ideal consensus LLM ensembles were tested for rating the performance of the triage tool. Results: The cohort consisted of 29,766 head CTs exam-report pairs. The highest AUC performance was achieved with llama3.3:70b and GPT-4o (AUC= 0.78). The average precision was highest for Llama3.3:70b and GPT-4o (AP=0.75 & 0.76). Llama3.3:70b had the highest F1 score (0.81) and recall (0.85), greater precision (0.78), specificity (0.72), and MCC (0.57). Using MCC (95% CI) the ideal combination of LLMs were: Full-9 Ensemble 0.571 (0.552-0.591), Top-3 Ensemble 0.558 (0.537-0.579), Consensus 0.556 (0.539-0.574), and GPT4o 0.522 (0.500-0.543). No statistically significant differences were observed between Top-3, Full-9, and Consensus (p > 0.05). Conclusion: An ensemble of medium to large sized open-source LLMs provides a more consistent and reliable method to derive a ground truth retrospective evaluation of a clinical AI triage tool over a single LLM alone.

[46] Inside CORE-KG: Evaluating Structured Prompting and Coreference Resolution for Knowledge Graphs

Dipak Meher,Carlotta Domeniconi

Main category: cs.CL

TL;DR: 本文系统地消融研究了CORE-KG框架的两个关键组件:类型感知的共指消解模块和领域引导的结构化提示,量化了它们在减少节点重复和噪声方面的作用。

Details Motivation: 由于非法案件文档的非结构化、词汇密集以及存在模糊或变化的引用,导致自动化构建知识图谱困难,现有基于大模型的方法仍产生噪声多、节点重复的问题。 Method: 通过对CORE-KG框架进行系统性消融实验,分别移除其共指消解模块和结构化提示组件,评估各部分对节点重复和噪声的影响。 Result: 移除共指消解导致节点重复增加28.32%,噪声节点增加4.32%;移除结构化提示导致节点重复增加4.34%,噪声节点增加73.33%。 Conclusion: 结构化提示在抑制噪声方面起主导作用,而共指消解更有效减少节点重复,二者协同显著提升从复杂法律文本中提取结构化信息的准确性。 Abstract: Human smuggling networks are increasingly adaptive and difficult to analyze. Legal case documents offer critical insights but are often unstructured, lexically dense, and filled with ambiguous or shifting references, which pose significant challenges for automated knowledge graph (KG) construction. While recent LLM-based approaches improve over static templates, they still generate noisy, fragmented graphs with duplicate nodes due to the absence of guided extraction and coreference resolution. The recently proposed CORE-KG framework addresses these limitations by integrating a type-aware coreference module and domain-guided structured prompts, significantly reducing node duplication and legal noise. In this work, we present a systematic ablation study of CORE-KG to quantify the individual contributions of its two key components. Our results show that removing coreference resolution results in a 28.32% increase in node duplication and a 4.32% increase in noisy nodes, while removing structured prompts leads to a 4.34% increase in node duplication and a 73.33% increase in noisy nodes. These findings offer empirical insights for designing robust LLM-based pipelines for extracting structured representations from complex legal texts.

[47] Hebrew Diacritics Restoration using Visual Representation

Yair Elboher,Yuval Pinter

Main category: cs.CL

TL;DR: 本文提出DIVRIT,一种基于视觉语言模型的希伯来语去音符化系统,将任务建模为零样本分类问题,在无需复杂语言分析的情况下实现高效准确的音符恢复。

Details Motivation: 希伯来语在无音符时歧义严重,影响发音和语义理解,现有方法依赖复杂的语言学特征,本文旨在提出一种更高效、自动化的去音符方法。 Method: 将希伯来语文本视为图像输入视觉语言模型,在词级别上根据上下文从动态生成的候选集中选择最合适的音符模式,采用零样本分类框架。 Result: 在候选集包含正确形式的‘oracle’设置下达到高准确率,架构改进和训练优化显著提升了泛化能力。 Conclusion: 视觉表征在希伯来语自动去音符任务中具有巨大潜力,DIVRIT为该任务提供了一种新颖且有效的解决方案。 Abstract: Diacritics restoration in Hebrew is a fundamental task for ensuring accurate word pronunciation and disambiguating textual meaning. Despite the language's high degree of ambiguity when unvocalized, recent machine learning approaches have significantly advanced performance on this task. In this work, we present DIVRIT, a novel system for Hebrew diacritization that frames the task as a zero-shot classification problem. Our approach operates at the word level, selecting the most appropriate diacritization pattern for each undiacritized word from a dynamically generated candidate set, conditioned on the surrounding textual context. A key innovation of DIVRIT is its use of a Hebrew Visual Language Model, which processes undiacritized text as an image, allowing diacritic information to be embedded directly within the input's vector representation. Through a comprehensive evaluation across various configurations, we demonstrate that the system effectively performs diacritization without relying on complex, explicit linguistic analysis. Notably, in an ``oracle'' setting where the correct diacritized form is guaranteed to be among the provided candidates, DIVRIT achieves a high level of accuracy. Furthermore, strategic architectural enhancements and optimized training methodologies yield significant improvements in the system's overall generalization capabilities. These findings highlight the promising potential of visual representations for accurate and automated Hebrew diacritization.

[48] The Structure of Relation Decoding Linear Operators in Large Language Models

Miranda Anna Christ,Adrián Csiszárik,Gergely Becsó,Dániel Varga

Main category: cs.CL

TL;DR: 该论文研究了用于解码Transformer语言模型中特定关系事实的线性算子结构,发现这些算子并非编码特定关系,而是提取跨关系的粗粒度语义属性(如“某物所属国家”),从而解释了其高度可压缩性和有限泛化能力。

Details Motivation: 理解Transformer语言模型中线性关系解码器的本质结构及其泛化机制,探究其为何能在多个关系上有效工作。 Method: 扩展先前单关系研究至多关系场景,使用三阶张量网络压缩关系解码器集合,并提出跨关系评估协议,测试各解码器对其他关系主体的适用性。 Result: 发现关系解码器可高度压缩而不损失精度;跨评估显示它们提取的是共通的粗粒度语义属性而非特定关系,例如‘首都所在国’和‘食物所在国’都属于‘X所属国家’这一属性。 Conclusion: Transformer中的线性关系解码本质上是基于属性的而非关系特定的,这种属性中心结构解释了解码器的可压缩性和在语义相近关系上的泛化能力。 Abstract: This paper investigates the structure of linear operators introduced in Hernandez et al. [2023] that decode specific relational facts in transformer language models. We extend their single-relation findings to a collection of relations and systematically chart their organization. We show that such collections of relation decoders can be highly compressed by simple order-3 tensor networks without significant loss in decoding accuracy. To explain this surprising redundancy, we develop a cross-evaluation protocol, in which we apply each linear decoder operator to the subjects of every other relation. Our results reveal that these linear maps do not encode distinct relations, but extract recurring, coarse-grained semantic properties (e.g., country of capital city and country of food are both in the country-of-X property). This property-centric structure clarifies both the operators' compressibility and highlights why they generalize only to new relations that are semantically close. Our findings thus interpret linear relational decoding in transformer language models as primarily property-based, rather than relation-specific.

[49] InfoFlow: Reinforcing Search Agent Via Reward Density Optimization

Kun Luo,Hongjin Qian,Zheng Liu,Ziyi Xia,Shitao Xiao,Siqi Bao,Jun Zhao,Kang Liu

Main category: cs.CL

TL;DR: 本文提出了InfoFlow框架,通过子问题分解、失败引导提示和双代理优化三种方法解决强化学习中奖励密度低的问题,显著提升了轻量级大模型在智能体搜索任务中的表现。

Details Motivation: 在深度搜索场景中,由于探索成本高而最终奖励稀疏,导致强化学习的奖励密度低,限制了其应用。 Method: 提出InfoFlow框架,从三个方面优化奖励密度:1)子问题分解以提供更密集的学习信号;2)失败引导提示为停滞轨迹提供纠正指导;3)双代理架构通过历史信息压缩降低探索成本。 Result: 在多个智能体搜索基准上,InfoFlow显著优于强基线方法,并使轻量级大语言模型达到与先进专有模型相当的性能。 Conclusion: InfoFlow有效解决了深度搜索中奖励密度低的问题,提升了探索效率和学习效果,具有广泛的应用潜力。 Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is a promising approach for enhancing agentic deep search. However, its application is often hindered by low \textbf{Reward Density} in deep search scenarios, where agents expend significant exploratory costs for infrequent and often null final rewards. In this paper, we formalize this challenge as the \textbf{Reward Density Optimization} problem, which aims to improve the reward obtained per unit of exploration cost. This paper introduce \textbf{InfoFlow}, a systematic framework that tackles this problem from three aspects. 1) \textbf{Subproblem decomposition}: breaking down long-range tasks to assign process rewards, thereby providing denser learning signals. 2) \textbf{Failure-guided hints}: injecting corrective guidance into stalled trajectories to increase the probability of successful outcomes. 3) \textbf{Dual-agent refinement}: employing a dual-agent architecture to offload the cognitive burden of deep exploration. A refiner agent synthesizes the search history, which effectively compresses the researcher's perceived trajectory, thereby reducing exploration cost and increasing the overall reward density. We evaluate InfoFlow on multiple agentic search benchmarks, where it significantly outperforms strong baselines, enabling lightweight LLMs to achieve performance comparable to advanced proprietary LLMs.

[50] Inference-Cost-Aware Dynamic Tree Construction for Efficient Inference in Large Language Models

Yinrong Hong,Zhiquan Tan,Kai Hu

Main category: cs.CL

TL;DR: 本文提出了一种新的动态树解码方法CAST,考虑了GPU配置和批处理大小等系统变量对推理成本的影响,显著提升了大语言模型的解码速度,最高可达传统方法的5.2倍,并在多种任务和模型上优于现有最先进方法5%到20%。

Details Motivation: 大语言模型由于自回归设计和模型规模大,存在显著的推理延迟问题,现有推测解码方法如EAGLE-2和EAGLE-3忽略了GPU设备和批处理大小等关键系统变量的影响。 Method: 提出CAST方法,通过综合考虑包括GPU配置和批处理大小在内的推理成本,动态优化树结构以提升解码效率。 Result: 在六个不同任务和六种大语言模型上的实验表明,CAST最高可比传统解码方法快5.2倍,且通常比现有最先进技术性能提升5%至20%。 Conclusion: CAST通过引入系统级变量优化动态树结构,在多种场景下显著提升了大语言模型的推理效率,具有广泛的应用前景。 Abstract: Large Language Models (LLMs) face significant inference latency challenges stemming from their autoregressive design and large size. To address this, speculative decoding emerges as a solution, enabling the simultaneous generation and validation of multiple tokens. While recent approaches like EAGLE-2 and EAGLE-3 improve speculative decoding using dynamic tree structures, they often neglect the impact of crucial system variables such as GPU devices and batch sizes. Therefore, we introduce a new dynamic tree decoding approach called CAST that takes into account inference costs, including factors such as GPU configurations and batch sizes, to dynamically refine the tree structure. Through comprehensive experimentation across six diverse tasks and utilizing six distinct LLMs, our methodology demonstrates remarkable results, achieving speeds up to 5.2 times faster than conventional decoding methods. Moreover, it generally outperforms existing state-of-the-art techniques from 5% to 20%.

[51] SlideAgent: Hierarchical Agentic Framework for Multi-Page Visual Document Understanding

Yiqiao Jin,Rachneet Kaur,Zhen Zeng,Sumitra Ganesh,Srijan Kumar

Main category: cs.CL

TL;DR: SlideAgent是一个用于理解多模态、多页、多布局文档(尤其是幻灯片)的智能体框架,通过全局、页面和元素三级推理提升对复杂视觉文档的理解能力。

Details Motivation: 现有系统在处理复杂的多页视觉文档时,难以进行细粒度的跨页面和跨元素推理,限制了对关键信息的有效理解。 Method: 提出SlideAgent框架,采用专门化代理,将推理分解为全局、页面和元素三个层次,构建结构化且与查询无关的文档表示,并在推理时选择性激活相应代理以生成上下文感知的答案。 Result: 实验表明,SlideAgent在整体性能上优于现有闭源模型(+7.9)和开源模型(+9.8)。 Conclusion: SlideAgent通过多层次、专业化代理的协同推理,显著提升了对多页视觉文档的理解能力,具有良好的通用性和应用潜力。 Abstract: Multi-page visual documents such as manuals, brochures, presentations, and posters convey key information through layout, colors, icons, and cross-slide references. While large language models (LLMs) offer opportunities in document understanding, current systems struggle with complex, multi-page visual documents, particularly in fine-grained reasoning over elements and pages. We introduce SlideAgent, a versatile agentic framework for understanding multi-modal, multi-page, and multi-layout documents, especially slide decks. SlideAgent employs specialized agents and decomposes reasoning into three specialized levels-global, page, and element-to construct a structured, query-agnostic representation that captures both overarching themes and detailed visual or textual cues. During inference, SlideAgent selectively activates specialized agents for multi-level reasoning and integrates their outputs into coherent, context-aware answers. Extensive experiments show that SlideAgent achieves significant improvement over both proprietary (+7.9 overall) and open-source models (+9.8 overall).

[52] Encoder-Decoder or Decoder-Only? Revisiting Encoder-Decoder Large Language Model

Biao Zhang,Yong Cheng,Siamak Shakeri,Xinyi Wang,Min Ma,Orhan Firat

Main category: cs.CL

TL;DR: 本文重新审视了编码器-解码器架构的大语言模型(RedLLM),通过引入现代训练方法,在不同规模下与主流的仅解码器模型(DecLLM)进行了系统比较,发现RedLLM在性能相当的同时具有更优的推理效率。

Details Motivation: 由于大语言模型迅速转向仅解码器架构,而缺乏从扩展视角对编码器-解码器模型的系统评估,可能导致其潜力被低估,因此需要重新评估其有效性。 Method: 基于RedPajama V1数据集和前缀语言建模预训练一个增强版的编码器-解码器模型(RedLLM),并与采用因果语言建模的仅解码器模型(DecLLM)在多个尺度下进行对比,随后使用FLAN数据集进行指令微调。 Result: 实验表明,尽管DecLLM在预训练阶段计算更高效,但RedLLM展现出相当的扩展能力和上下文外推性能;在指令微调后,RedLLM在多种下游任务上表现相当甚至更优,且推理效率显著更高。 Conclusion: 编码器-解码器架构的大语言模型具有被忽视的潜力,值得进一步研究以构建更强大且高效的模型。 Abstract: Recent large language model (LLM) research has undergone an architectural shift from encoder-decoder modeling to nowadays the dominant decoder-only modeling. This rapid transition, however, comes without a rigorous comparative analysis especially \textit{from the scaling perspective}, raising concerns that the potential of encoder-decoder models may have been overlooked. To fill this gap, we revisit encoder-decoder LLM (RedLLM), enhancing it with recent recipes from decoder-only LLM (DecLLM). We conduct a comprehensive comparison between RedLLM, pretrained with prefix language modeling (LM), and DecLLM, pretrained with causal LM, at different model scales, ranging from $\sim$150M to $\sim$8B. Using RedPajama V1 (1.6T tokens) for pretraining and FLAN for instruction tuning, our experiments show that RedLLM produces compelling scaling properties and surprisingly strong performance. While DecLLM is overall more compute-optimal during pretraining, RedLLM demonstrates comparable scaling and context length extrapolation capabilities. After instruction tuning, RedLLM achieves comparable and even better results on various downstream tasks while enjoying substantially better inference efficiency. We hope our findings could inspire more efforts on re-examining RedLLM, unlocking its potential for developing powerful and efficient LLMs.

[53] Evontree: Ontology Rule-Guided Self-Evolution of Large Language Models

Mingchen Tu,Zhiqiang Liu,Juan Li,Liangyurui Liu,Junjie Wang,Lei Liang,Wen Zhang

Main category: cs.CL

TL;DR: 提出Evontree框架,利用少量高质量本体规则从大模型中提取、验证和增强领域知识,无需大量外部数据,在医疗问答任务中显著提升性能。

Details Motivation: 在数据敏感领域(如医疗)缺乏高质量领域训练数据,限制了大语言模型的适应性;而领域专家已将知识沉淀为本体规则,值得结合利用。 Method: Evontree框架首先从原始模型中提取领域本体,使用两个核心本体规则检测不一致性,并通过自蒸馏微调强化修正后的知识。 Result: 在Llama3-8B-Instruct和Med42-v2上进行的实验表明,该方法在多个医疗问答基准上均优于原始模型和领先的监督基线,准确率最高提升3.7%。 Conclusion: Evontree能有效、高效且鲁棒地实现大语言模型在低资源场景下的领域适配,验证了结合显式本体规则与隐式模型知识的潜力。 Abstract: Large language models (LLMs) have demonstrated exceptional capabilities across multiple domains by leveraging massive pre-training and curated fine-tuning data. However, in data-sensitive fields such as healthcare, the lack of high-quality, domain-specific training corpus hinders LLMs' adaptation for specialized applications. Meanwhile, domain experts have distilled domain wisdom into ontology rules, which formalize relationships among concepts and ensure the integrity of knowledge management repositories. Viewing LLMs as implicit repositories of human knowledge, we propose Evontree, a novel framework that leverages a small set of high-quality ontology rules to systematically extract, validate, and enhance domain knowledge within LLMs, without requiring extensive external datasets. Specifically, Evontree extracts domain ontology from raw models, detects inconsistencies using two core ontology rules, and reinforces the refined knowledge via self-distilled fine-tuning. Extensive experiments on medical QA benchmarks with Llama3-8B-Instruct and Med42-v2 demonstrate consistent outperformance over both unmodified models and leading supervised baselines, achieving up to a 3.7% improvement in accuracy. These results confirm the effectiveness, efficiency, and robustness of our approach for low-resource domain adaptation of LLMs.

[54] Kimi Linear: An Expressive, Efficient Attention Architecture

Kimi Team,Yu Zhang,Zongyu Lin,Xingcheng Yao,Jiaxi Hu,Fanqing Meng,Chengyin Liu,Xin Men,Songlin Yang,Zhiyuan Li,Wentao Li,Enzhe Lu,Weizhou Liu,Yanru Chen,Weixin Xu,Longhui Yu,Yejie Wang,Yu Fan,Longguang Zhong,Enming Yuan,Dehao Zhang,Yizhi Zhang,T. Y. Liu,Haiming Wang,Shengjun Fang,Weiran He,Shaowei Liu,Yiwei Li,Jianlin Su,Jiezhong Qiu,Bo Pang,Junjie Yan,Zhejun Jiang,Weixiao Huang,Bohong Yin,Jiacheng You,Chu Wei,Zhengtao Wang,Chao Hong,Yutian Chen,Guanduo Chen,Yucheng Wang,Huabin Zheng,Feng Wang,Yibo Liu,Mengnan Dong,Zheng Zhang,Siyuan Pan,Wenhao Wu,Yuhao Wu,Longyu Guan,Jiawen Tao,Guohong Fu,Xinran Xu,Yuzhi Wang,Guokun Lai,Yuxin Wu,Xinyu Zhou,Zhilin Yang,Yulun Du

Main category: cs.CL

TL;DR: Kimi Linear是一种新型混合线性注意力架构,首次在多种场景下优于全注意力机制,具备更高的效率和性能,支持长上下文和强化学习扩展,并开源了相关实现和模型。

Details Motivation: 现有的线性注意力机制在表达能力和硬件效率上难以与全注意力竞争,尤其是在长序列和复杂任务中表现不足,需要一种既能保持高效又能提升性能的新型架构。 Method: 提出Kimi Delta Attention(KDA),结合细粒度门控机制和定制化的块状算法,采用特殊化的DPLR转换矩阵,在减少计算的同时保持与经典delta规则的一致性;构建包含3B激活参数和48B总参数的混合KDA与MLA模型进行预训练。 Result: 在相同训练设置下,Kimi Linear在所有任务上均显著优于全注意力MLA,KV缓存使用减少最多75%,1M上下文解码吞吐量提高达6倍,且在短、长上下文及强化学习扩展中均表现更优。 Conclusion: Kimi Linear是一种高性能、高效率的全注意力替代方案,适用于输入输出较长的任务,具备实际部署和广泛研究的应用潜力。 Abstract: We introduce Kimi Linear, a hybrid linear attention architecture that, for the first time, outperforms full attention under fair comparisons across various scenarios -- including short-context, long-context, and reinforcement learning (RL) scaling regimes. At its core lies Kimi Delta Attention (KDA), an expressive linear attention module that extends Gated DeltaNet with a finer-grained gating mechanism, enabling more effective use of limited finite-state RNN memory. Our bespoke chunkwise algorithm achieves high hardware efficiency through a specialized variant of the Diagonal-Plus-Low-Rank (DPLR) transition matrices, which substantially reduces computation compared to the general DPLR formulation while remaining more consistent with the classical delta rule. We pretrain a Kimi Linear model with 3B activated parameters and 48B total parameters, based on a layerwise hybrid of KDA and Multi-Head Latent Attention (MLA). Our experiments show that with an identical training recipe, Kimi Linear outperforms full MLA with a sizeable margin across all evaluated tasks, while reducing KV cache usage by up to 75% and achieving up to 6 times decoding throughput for a 1M context. These results demonstrate that Kimi Linear can be a drop-in replacement for full attention architectures with superior performance and efficiency, including tasks with longer input and output lengths. To support further research, we open-source the KDA kernel and vLLM implementations, and release the pre-trained and instruction-tuned model checkpoints.

[55] The End of Manual Decoding: Towards Truly End-to-End Language Models

Zhichao Wang,Dongyang Ma,Xinting Huang,Deng Cai,Tian Lan,Jiahao Xu,Haitao Mi,Xiaoying Tang,Yan Wang

Main category: cs.CL

TL;DR: 本文提出了AutoDeco,一种通过学习动态预测解码参数(如temperature和top-p)实现真正端到端生成的新型架构,使大语言模型能在单次前向传播中自我调节采样策略,并展现出基于自然语言指令控制解码行为的新兴能力。

Details Motivation: 现有的大语言模型依赖于非可微分的解码过程,需要手动调整超参数,缺乏灵活性和自动化,因此需要一种能够自适应地、端到端地控制解码策略的方法。 Method: 在标准Transformer基础上增加轻量级头部模块,在每一步生成时动态预测上下文相关的temperature和top-p值,并与下一token的logits联合输出,将解码过程变为可学习的、逐token的参数化过程。 Result: 在八个基准测试上,AutoDeco显著优于默认解码策略,性能媲美通过‘测试集调优’获得的oracle基线;同时展现出可根据自然语言指令(如‘低随机性生成’)逐token调整解码参数的能力。 Conclusion: AutoDeco实现了真正意义上的端到端文本生成,不仅提升了生成质量,还开启了可通过自然语言指令进行可解释、可操控的LLM解码新范式。 Abstract: The "end-to-end" label for LLMs is a misnomer. In practice, they depend on a non-differentiable decoding process that requires laborious, hand-tuning of hyperparameters like temperature and top-p. This paper introduces AutoDeco, a novel architecture that enables truly "end-to-end" generation by learning to control its own decoding strategy. We augment the standard transformer with lightweight heads that, at each step, dynamically predict context-specific temperature and top-p values alongside the next-token logits. This approach transforms decoding into a parametric, token-level process, allowing the model to self-regulate its sampling strategy within a single forward pass. Through extensive experiments on eight benchmarks, we demonstrate that AutoDeco not only significantly outperforms default decoding strategies but also achieves performance comparable to an oracle-tuned baseline derived from "hacking the test set"-a practical upper bound for any static method. Crucially, we uncover an emergent capability for instruction-based decoding control: the model learns to interpret natural language commands (e.g., "generate with low randomness") and adjusts its predicted temperature and top-p on a token-by-token basis, opening a new paradigm for steerable and interactive LLM decoding.

[56] Value Drifts: Tracing Value Alignment During LLM Post-Training

Mehar Bhatia,Shravan Nayak,Gaurav Kamath,Marius Mosbach,Karolina Stańczak,Vered Shwartz,Siva Reddy

Main category: cs.CL

TL;DR: 研究大模型在后训练过程中价值观对齐的动态机制,发现监督微调阶段基本确立模型价值观,后续偏好优化难以改变已有价值观,且不同偏好优化算法会导致不同的对齐结果。

Details Motivation: 现有研究多关注完全训练后模型的价值观对齐,忽视了训练过程中价值观如何形成和演变,因此需要探究后训练阶段算法和数据对价值观对齐的影响过程。 Method: 通过解耦后训练算法与数据集的影响,在Llama-3和Qwen-3不同规模模型上实验多种监督微调(SFT)与偏好优化算法,并使用合成偏好数据集控制变量分析价值观漂移的幅度与时机。 Result: 发现SFT阶段基本确立模型价值观,后续偏好优化通常无法重新对齐;即使偏好数据相同,不同优化算法仍导致不同对齐结果。 Conclusion: 模型价值观主要在SFT阶段建立,偏好优化阶段调整有限,算法选择本身显著影响对齐结果,为数据构建和算法选择提供了实践指导。 Abstract: As LLMs occupy an increasingly important role in society, they are more and more confronted with questions that require them not only to draw on their general knowledge but also to align with certain human value systems. Therefore, studying the alignment of LLMs with human values has become a crucial field of inquiry. Prior work, however, mostly focuses on evaluating the alignment of fully trained models, overlooking the training dynamics by which models learn to express human values. In this work, we investigate how and at which stage value alignment arises during the course of a model's post-training. Our analysis disentangles the effects of post-training algorithms and datasets, measuring both the magnitude and time of value drifts during training. Experimenting with Llama-3 and Qwen-3 models of different sizes and popular supervised fine-tuning (SFT) and preference optimization datasets and algorithms, we find that the SFT phase generally establishes a model's values, and subsequent preference optimization rarely re-aligns these values. Furthermore, using a synthetic preference dataset that enables controlled manipulation of values, we find that different preference optimization algorithms lead to different value alignment outcomes, even when preference data is held constant. Our findings provide actionable insights into how values are learned during post-training and help to inform data curation, as well as the selection of models and algorithms for preference optimization to improve model alignment to human values.

[57] AMO-Bench: Large Language Models Still Struggle in High School Math Competitions

Shengnan An,Xunliang Cai,Xuezhi Cao,Xiaoyu Li,Yehao Lin,Junlin Liu,Xinxuan Lv,Dan Ma,Xuanlin Wang,Ziwen Wang,Shuang Zhou

Main category: cs.CL

TL;DR: AMO-Bench是一个高级数学推理基准,包含50道人工设计的、达到或超过国际数学奥林匹克竞赛难度的原创问题,旨在评估大语言模型在高难度数学任务上的表现,实验显示当前模型准确率普遍较低,但存在随计算资源增加而提升的潜力。

Details Motivation: 现有数学基准因性能饱和难以有效评估顶级大语言模型的数学推理能力,需要更具挑战性的评测基准。 Method: 构建一个包含50道高难度、专家验证且完全原创的数学问题的基准AMO-Bench,仅需提供最终答案以实现自动评分。 Result: 在26个大语言模型上的实验表明,最佳模型的准确率为52.4%,大多数模型低于40%,显示出当前模型在高阶数学推理上的局限性,但也发现了测试时计算资源增加带来的性能提升趋势。 Conclusion: AMO-Bench揭示了当前大语言模型在高级数学推理方面仍有显著提升空间,并为未来研究提供了开放的评估平台。 Abstract: We present AMO-Bench, an Advanced Mathematical reasoning benchmark with Olympiad level or even higher difficulty, comprising 50 human-crafted problems. Existing benchmarks have widely leveraged high school math competitions for evaluating mathematical reasoning capabilities of large language models (LLMs). However, many existing math competitions are becoming less effective for assessing top-tier LLMs due to performance saturation (e.g., AIME24/25). To address this, AMO-Bench introduces more rigorous challenges by ensuring all 50 problems are (1) cross-validated by experts to meet at least the International Mathematical Olympiad (IMO) difficulty standards, and (2) entirely original problems to prevent potential performance leakages from data memorization. Moreover, each problem in AMO-Bench requires only a final answer rather than a proof, enabling automatic and robust grading for evaluation. Experimental results across 26 LLMs on AMO-Bench show that even the best-performing model achieves only 52.4% accuracy on AMO-Bench, with most LLMs scoring below 40%. Beyond these poor performances, our further analysis reveals a promising scaling trend with increasing test-time compute on AMO-Bench. These results highlight the significant room for improving the mathematical reasoning in current LLMs. We release AMO-Bench to facilitate further research into advancing the reasoning abilities of language models. https://amo-bench.github.io/

[58] Gistify! Codebase-Level Understanding via Runtime Execution

Hyunji Lee,Minseon Kim,Chinmay Singh,Matheus Pereira,Atharv Sonwane,Isadora White,Elias Stengel-Eskin,Mohit Bansal,Zhengyan Shi,Alessandro Sordoni,Marc-Alexandre Côté,Xingdi Yuan,Lucas Caccia

Main category: cs.CL

TL;DR: 提出Gistify任务,要求编码LLM从大型代码库中生成一个最小、自包含的文件来复现特定功能,评估模型对代码库结构和执行流的理解能力。

Details Motivation: 随着编码代理在大型代码库中的广泛应用,亟需自动设计具有挑战性的代码库级别评估方法。 Method: 给定完整代码库和特定入口点,要求LLM生成一个能复现原命令输出的最小独立文件,仅包含执行所需的核心组件。 Result: 实验表明当前最先进的模型在Gistify任务上表现不佳,尤其在执行路径较长的情况下。 Conclusion: Gistify是一项具有挑战性的新任务,暴露了现有编码LLM在理解代码库结构和生成大范围代码补丁方面的不足。 Abstract: As coding agents are increasingly deployed in large codebases, the need to automatically design challenging, codebase-level evaluation is central. We propose Gistify, a task where a coding LLM must create a single, minimal, self-contained file that can reproduce a specific functionality of a codebase. The coding LLM is given full access to a codebase along with a specific entrypoint (e.g., a python command), and the generated file must replicate the output of the same command ran under the full codebase, while containing only the essential components necessary to execute the provided command. Success on Gistify requires both structural understanding of the codebase, accurate modeling of its execution flow as well as the ability to produce potentially large code patches. Our findings show that current state-of-the-art models struggle to reliably solve Gistify tasks, especially ones with long executions traces.

cs.CV [Back]

[59] Enhancing Underwater Object Detection through Spatio-Temporal Analysis and Spatial Attention Networks

Sai Likhith Karri,Ansh Saxena

Main category: cs.CV

TL;DR: 本研究评估了时空建模和空间注意力机制在水下目标检测中的有效性,提出并改进了T-YOLOv5模型,引入CBAM模块以提升复杂场景下的检测性能。

Details Motivation: 为了提升动态海洋环境中水下目标检测的准确性,特别是在突发运动、部分遮挡和缓慢移动等挑战性条件下,研究探索了时间建模和注意力机制的作用。 Method: 首先比较标准YOLOv5与时间增强型T-YOLOv5的性能;随后在T-YOLOv5中引入卷积块注意力模块(CBAM),构建增强模型,并在相同数据集上进行对比实验。 Result: 实验结果显示,YOLOv5的mAP@50-95为0.563,T-YOLOv5提升至0.813,加入CBAM后达到0.811;T-YOLOv5显著提高了检测可靠性,CBAM进一步增强了复杂场景表现,但在简单场景中略有精度损失。 Conclusion: T-YOLOv5通过时间建模显著优于标准YOLOv5,而结合CBAM的变体在处理复杂动态水下环境时表现出更强的鲁棒性和准确性,尽管在简单场景下存在轻微性能下降,整体仍具优越性。 Abstract: This study examines the effectiveness of spatio-temporal modeling and the integration of spatial attention mechanisms in deep learning models for underwater object detection. Specifically, in the first phase, the performance of temporal-enhanced YOLOv5 variant T-YOLOv5 is evaluated, in comparison with the standard YOLOv5. For the second phase, an augmented version of T-YOLOv5 is developed, through the addition of a Convolutional Block Attention Module (CBAM). By examining the effectiveness of the already pre-existing YOLOv5 and T-YOLOv5 models and of the newly developed T-YOLOv5 with CBAM. With CBAM, the research highlights how temporal modeling improves detection accuracy in dynamic marine environments, particularly under conditions of sudden movements, partial occlusions, and gradual motion. The testing results showed that YOLOv5 achieved a mAP@50-95 of 0.563, while T-YOLOv5 and T-YOLOv5 with CBAM outperformed with mAP@50-95 scores of 0.813 and 0.811, respectively, highlighting their superior accuracy and generalization in detecting complex objects. The findings demonstrate that T-YOLOv5 significantly enhances detection reliability compared to the standard model, while T-YOLOv5 with CBAM further improves performance in challenging scenarios, although there is a loss of accuracy when it comes to simpler scenarios.

[60] MIRO: MultI-Reward cOnditioned pretraining improves T2I quality and efficiency

Nicolas Dufour,Lucas Degeorge,Arijit Ghosh,Vicky Kalogeiton,David Picard

Main category: cs.CV

TL;DR: 提出MIRO方法,通过在训练过程中引入多个奖励模型来直接学习用户偏好,从而提升生成图像质量、训练效率和语义保真度。

Details Motivation: 现有文本到图像生成模型依赖大规模未筛选数据,生成结果与用户偏好不一致;当前基于奖励模型的后处理方法会丢弃大量信息且损害多样性与效率。 Method: 在训练过程中将模型基于多个奖励模型进行条件化,使模型直接学习用户偏好,而非依赖后处理筛选。 Result: MIRO在GenEval组合基准和用户偏好评分(PickAScore, ImageReward, HPSv2)上达到最先进的性能,显著提升图像视觉质量和训练速度。 Conclusion: 通过训练时融合多奖励模型条件化,能更高效地对齐用户偏好,同时保持生成多样性与语义一致性。 Abstract: Current text-to-image generative models are trained on large uncurated datasets to enable diverse generation capabilities. However, this does not align well with user preferences. Recently, reward models have been specifically designed to perform post-hoc selection of generated images and align them to a reward, typically user preference. This discarding of informative data together with the optimizing for a single reward tend to harm diversity, semantic fidelity and efficiency. Instead of this post-processing, we propose to condition the model on multiple reward models during training to let the model learn user preferences directly. We show that this not only dramatically improves the visual quality of the generated images but it also significantly speeds up the training. Our proposed method, called MIRO, achieves state-of-the-art performances on the GenEval compositional benchmark and user-preference scores (PickAScore, ImageReward, HPSv2).

[61] BikeScenes: Online LiDAR Semantic Segmentation for Bicycles

Denniz Goren,Holger Caesar

Main category: cs.CV

TL;DR: 本文提出了一种针对自行车安全的3D LiDAR分割方法,并发布了BikeScenes-lidarseg数据集,实验表明在该数据集上微调模型显著提升了分割性能。

Details Motivation: 骑行者尤其是使用电动自行车的骑行者面临更高的安全风险,因此需要将汽车感知技术适配到自行车安全领域。 Method: 基于多传感器'SenseBike'平台,开发并评估了一种适用于自行车环境的3D LiDAR语义分割方法,并构建了包含29类语义标注的BikeScenes-lidarseg数据集用于模型训练与评估。 Result: 在BikeScenes数据集上微调后的模型mIoU达到63.6%,显著优于仅使用SemanticKITTI预训练的13.8%。 Conclusion: 领域特定的数据集对提升自行车场景下的LiDAR分割性能至关重要,BikeScenes数据集为面向骑行者的感知研究提供了重要资源。 Abstract: The vulnerability of cyclists, exacerbated by the rising popularity of faster e-bikes, motivates adapting automotive perception technologies for bicycle safety. We use our multi-sensor 'SenseBike' research platform to develop and evaluate a 3D LiDAR segmentation approach tailored to bicycles. To bridge the automotive-to-bicycle domain gap, we introduce the novel BikeScenes-lidarseg Dataset, comprising 3021 consecutive LiDAR scans around the university campus of the TU Delft, semantically annotated for 29 dynamic and static classes. By evaluating model performance, we demonstrate that fine-tuning on our BikeScenes dataset achieves a mean Intersection-over-Union (mIoU) of 63.6%, significantly outperforming the 13.8% obtained with SemanticKITTI pre-training alone. This result underscores the necessity and effectiveness of domain-specific training. We highlight key challenges specific to bicycle-mounted, hardware-constrained perception systems and contribute the BikeScenes dataset as a resource for advancing research in cyclist-centric LiDAR segmentation.

[62] Generative Image Restoration and Super-Resolution using Physics-Informed Synthetic Data for Scanning Tunneling Microscopy

Nikola L. Kolev,Tommaso Rodani,Neil J. Curson,Taylor J. Z. Stock,Alberto Cazzaniga

Main category: cs.CV

TL;DR: 提出了一种基于机器学习的扫描隧道显微镜图像修复与超分辨率方法,利用物理信息引导的合成数据生成流程训练模型,显著减少图像采集时间并降低针尖处理频率。

Details Motivation: 扫描隧道显微镜(STM)因针尖退化和串行数据采集速度慢而受限,且在制备过程中高电压可能改变针尖形貌,需频繁调节,限制了其应用效率。 Method: 采用仅包含36张高质量实验图像的数据集,构建物理信息引导的合成数据生成管道,用于训练先进的flow-matching和扩散模型,实现图像修复与超分辨率重建。 Result: 模型在CLIP MMD和结构相似性等指标上表现优异,能有效恢复图像,并通过稀疏采样数据准确重建,使图像采集时间减少2到4倍。 Conclusion: 该框架可显著提升STM实验通量,减少针尖调理次数,并有望提高现有高速STM系统的帧率。 Abstract: Scanning tunnelling microscopy (STM) enables atomic-resolution imaging and atom manipulation, but its utility is often limited by tip degradation and slow serial data acquisition. Fabrication adds another layer of complexity since the tip is often subjected to large voltages, which may alter the shape of its apex, requiring it to be conditioned. Here, we propose a machine learning (ML) approach for image repair and super-resolution to alleviate both challenges. Using a dataset of only 36 pristine experimental images of Si(001):H, we demonstrate that a physics-informed synthetic data generation pipeline can be used to train several state-of-the-art flow-matching and diffusion models. Quantitative evaluation with metrics such as the CLIP Maximum Mean Discrepancy (CMMD) score and structural similarity demonstrates that our models are able to effectively restore images and offer a two- to fourfold reduction in image acquisition time by accurately reconstructing images from sparsely sampled data. Our framework has the potential to significantly increase STM experimental throughput by offering a route to reducing the frequency of tip-conditioning procedures and to enhancing frame rates in existing high-speed STM systems.

[63] SplitFlow: Flow Decomposition for Inversion-Free Text-to-Image Editing

Sung-Hoon Yoon,Minghan Li,Gaspard Beaudouin,Congcong Wen,Muhammad Rafay Azhar,Mengyu Wang

Main category: cs.CV

TL;DR: 本文提出了一种基于流分解与聚合的图像编辑框架SplitFlow,无需显式反演即可实现高质量的零样本图像编辑。通过将目标提示语义分解为多个子提示,分别计算独立流并进行软聚合,结合投影与自适应加权机制,有效缓解了语义冗余与梯度纠缠问题,在保持输出多样性的同时提升了语义保真度和属性解耦能力。

Details Motivation: 现有的rectified flow模型在图像编辑中存在反演不准确和梯度纠缠问题,导致编辑结果偏离目标提示;现有无反演方法也因缺乏语义协调机制而编辑质量有限。 Method: 提出流分解与聚合框架:首先将目标提示语义分解为多个子提示,对每个子提示独立计算rectified flow;设计基于多任务学习中梯度冲突解决思想的投影与软聚合机制,自适应加权各子流的速度场,抑制冗余语义、增强差异方向,从而生成统一且语义一致的编辑轨迹。 Result: 实验表明,该方法在语义保真度和属性解耦方面优于现有的零样本图像编辑方法,能更准确地反映目标提示内容,同时保持生成结果的多样性。 Conclusion: SplitFlow通过语义分解与软聚合策略,在无需反演的前提下有效解决了rectified flow模型在图像编辑中的梯度纠缠与语义偏差问题,显著提升了编辑质量和一致性。 Abstract: Rectified flow models have become a de facto standard in image generation due to their stable sampling trajectories and high-fidelity outputs. Despite their strong generative capabilities, they face critical limitations in image editing tasks: inaccurate inversion processes for mapping real images back into the latent space, and gradient entanglement issues during editing often result in outputs that do not faithfully reflect the target prompt. Recent efforts have attempted to directly map source and target distributions via ODE-based approaches without inversion; however,these methods still yield suboptimal editing quality. In this work, we propose a flow decomposition-and-aggregation framework built upon an inversion-free formulation to address these limitations. Specifically, we semantically decompose the target prompt into multiple sub-prompts, compute an independent flow for each, and aggregate them to form a unified editing trajectory. While we empirically observe that decomposing the original flow enhances diversity in the target space, generating semantically aligned outputs still requires consistent guidance toward the full target prompt. To this end, we design a projection and soft-aggregation mechanism for flow, inspired by gradient conflict resolution in multi-task learning. This approach adaptively weights the sub-target velocity fields, suppressing semantic redundancy while emphasizing distinct directions, thereby preserving both diversity and consistency in the final edited output. Experimental results demonstrate that our method outperforms existing zero-shot editing approaches in terms of semantic fidelity and attribute disentanglement. The code is available at https://github.com/Harvard-AI-and-Robotics-Lab/SplitFlow.

[64] Brain-IT: Image Reconstruction from fMRI via Brain-Interaction Transformer

Roman Beliy,Amit Zalcher,Jonathan Kogman,Navve Wasserman,Michal Irani

Main category: cs.CV

TL;DR: 提出了一种名为Brain-IT的脑启发方法,通过脑交互Transformer(BIT)实现功能相似脑体素簇之间的有效交互,显著提升了从fMRI数据重建视觉图像的保真度,在少量新受试者数据下即超越现有最先进方法。

Details Motivation: 当前基于fMRI的图像重建方法在还原真实所见图像方面仍缺乏保真度,尽管扩散模型取得进展,但仍需改进语义和结构一致性。 Method: 提出Brain Interaction Transformer(BIT),利用跨被试共享的功能性脑体素簇,预测局部图像块的高层语义和低层结构特征,指导扩散模型生成更准确的图像。所有模型组件共享,提升数据效率。 Result: 在标准客观指标和视觉质量上均超越现有最先进方法;仅用1小时新受试者fMRI数据即可达到其他方法使用40小时数据的效果。 Conclusion: Brain-IT通过脑启发式设计实现了高效、高保真的图像重建,推动了跨被试、低数据需求的脑解码技术发展。 Abstract: Reconstructing images seen by people from their fMRI brain recordings provides a non-invasive window into the human brain. Despite recent progress enabled by diffusion models, current methods often lack faithfulness to the actual seen images. We present "Brain-IT", a brain-inspired approach that addresses this challenge through a Brain Interaction Transformer (BIT), allowing effective interactions between clusters of functionally-similar brain-voxels. These functional-clusters are shared by all subjects, serving as building blocks for integrating information both within and across brains. All model components are shared by all clusters & subjects, allowing efficient training with a limited amount of data. To guide the image reconstruction, BIT predicts two complementary localized patch-level image features: (i)high-level semantic features which steer the diffusion model toward the correct semantic content of the image; and (ii)low-level structural features which help to initialize the diffusion process with the correct coarse layout of the image. BIT's design enables direct flow of information from brain-voxel clusters to localized image features. Through these principles, our method achieves image reconstructions from fMRI that faithfully reconstruct the seen images, and surpass current SotA approaches both visually and by standard objective metrics. Moreover, with only 1-hour of fMRI data from a new subject, we achieve results comparable to current methods trained on full 40-hour recordings.

[65] Fine-tuning Segment Anything for Real-Time Tumor Tracking in Cine-MRI

Valentin Boussot,Cédric Hémon,Jean-Claude Nunes,Jean-Louis Dillenseger

Main category: cs.CV

TL;DR: 本研究针对TrackRAD2025挑战赛中的实时肿瘤追踪任务,采用基于SAM 2.1的分割方法,在数据稀缺条件下实现了优异性能,最终在隐藏测试集上取得0.8794的Dice分数,排名第六。

Details Motivation: 在胸腹部cine-MRI序列中实现受限于极小标注数据和严格实时性(1秒内)要求下的精准肿瘤追踪。 Method: 基于SAM 2.1 b+模型,使用首帧标注生成的掩码作为提示,在小规模标注数据集上进行全模块微调;采用1024x1024图像块、小批量(1)、标准数据增强及Dice+IoU损失函数,并以低学习率(0.0001)优化所有模块以防止过拟合并保持泛化能力。 Result: 在隐藏测试集上达到Dice分数0.8794,位列TrackRAD2025挑战赛第6名;训练耗时约12小时(RTX A6000),推理策略统一适用于不同解剖部位和磁场强度。 Conclusion: 基础模型(如SAM 2.1)通过提示机制和轻量微调,能够在极端数据稀缺和实时性约束下实现高性能肿瘤追踪,展现出其在MRI引导放疗中的巨大应用潜力。 Abstract: In this work, we address the TrackRAD2025 challenge of real-time tumor tracking in cine-MRI sequences of the thoracic and abdominal regions under strong data scarcity constraints. Two complementary strategies were explored: (i) unsupervised registration with the IMPACT similarity metric and (ii) foundation model-based segmentation leveraging SAM 2.1 and its recent variants through prompt-based interaction. Due to the one-second runtime constraint, the SAM-based method was ultimately selected. The final configuration used SAM2.1 b+ with mask-based prompts from the first annotated slice, fine-tuned solely on the small labeled subset from TrackRAD2025. Training was configured to minimize overfitting, using 1024x1024 patches (batch size 1), standard augmentations, and a balanced Dice + IoU loss. A low uniform learning rate (0.0001) was applied to all modules (prompt encoder, decoder, Hiera backbone) to preserve generalization while adapting to annotator-specific styles. Training lasted 300 epochs (~12h on RTX A6000, 48GB). The same inference strategy was consistently applied across all anatomical sites and MRI field strengths. Test-time augmentation was considered but ultimately discarded due to negligible performance gains. The final model was selected based on the highest Dice Similarity Coefficient achieved on the validation set after fine-tuning. On the hidden test set, the model reached a Dice score of 0.8794, ranking 6th overall in the TrackRAD2025 challenge. These results highlight the strong potential of foundation models for accurate and real-time tumor tracking in MRI-guided radiotherapy.

[66] Larger Hausdorff Dimension in Scanning Pattern Facilitates Mamba-Based Methods in Low-Light Image Enhancement

Xinhua Wang,Caibo Feng,Xiangjun Fu,Chunxiao Liu

Main category: cs.CV

TL;DR: 提出了一种基于希尔伯特选择性扫描的Mamba框架改进方法,通过提高扫描模式的豪斯多夫维度来增强低光图像增强性能。

Details Motivation: 为了更有效地探索特征空间,捕捉细粒度细节并改善覆盖范围,同时缓解信息不一致性并优化空间局部性。 Method: 引入了新型的希尔伯特选择性扫描机制,提升了Mamba框架的扫描模式复杂性(豪斯多夫维度),以更好地捕获局部交互并保持长距离依赖建模能力。 Result: 在公开基准上的实验表明,该方法显著提升了现有Mamba基方法的定量指标和视觉质量,同时降低了计算资源消耗和推理时间。 Conclusion: 该策略不仅推动了低光图像增强领域的技术进步,还为其他应用Mamba架构的领域提供了潜在价值。 Abstract: We propose an innovative enhancement to the Mamba framework by increasing the Hausdorff dimension of its scanning pattern through a novel Hilbert Selective Scan mechanism. This mechanism explores the feature space more effectively, capturing intricate fine-scale details and improving overall coverage. As a result, it mitigates information inconsistencies while refining spatial locality to better capture subtle local interactions without sacrificing the model's ability to handle long-range dependencies. Extensive experiments on publicly available benchmarks demonstrate that our approach significantly improves both the quantitative metrics and qualitative visual fidelity of existing Mamba-based low-light image enhancement methods, all while reducing computational resource consumption and shortening inference time. We believe that this refined strategy not only advances the state-of-the-art in low-light image enhancement but also holds promise for broader applications in fields that leverage Mamba-based techniques.

[67] CAVE: Detecting and Explaining Commonsense Anomalies in Visual Environments

Rishika Bhagwatkar,Syrielle Montariol,Angelika Romanou,Beatriz Borges,Irina Rish,Antoine Bosselut

Main category: cs.CV

TL;DR: 本文提出了CAVE,首个真实世界视觉异常的基准,支持异常描述、解释和论证三个开放性任务,并通过细粒度标注评估视觉语言模型在异常感知和常识推理上的表现。

Details Motivation: 现有视觉异常研究局限于工业缺陷或合成异常,无法反映真实世界异常的复杂性和不可预测性,因此需要一个更贴近人类认知的真实异常基准。 Method: 构建CAVE基准数据集,包含真实世界视觉异常,设计三个任务(描述、解释、论证),并基于认知科学理论进行细粒度标注,涵盖异常的表现形式、复杂性、严重性和常见性。 Result: 实验表明,即使采用先进的提示策略,当前最先进的视觉语言模型在CAVE上仍难以有效处理视觉异常感知和常识推理任务。 Conclusion: CAVE为评估和推动视觉语言模型在真实场景下的异常检测与理解提供了重要资源,有助于促进相关领域的研究发展。 Abstract: Humans can naturally identify, reason about, and explain anomalies in their environment. In computer vision, this long-standing challenge remains limited to industrial defects or unrealistic, synthetically generated anomalies, failing to capture the richness and unpredictability of real-world anomalies. In this work, we introduce CAVE, the first benchmark of real-world visual anomalies. CAVE supports three open-ended tasks: anomaly description, explanation, and justification; with fine-grained annotations for visual grounding and categorizing anomalies based on their visual manifestations, their complexity, severity, and commonness. These annotations draw inspiration from cognitive science research on how humans identify and resolve anomalies, providing a comprehensive framework for evaluating Vision-Language Models (VLMs) in detecting and understanding anomalies. We show that state-of-the-art VLMs struggle with visual anomaly perception and commonsense reasoning, even with advanced prompting strategies. By offering a realistic and cognitively grounded benchmark, CAVE serves as a valuable resource for advancing research in anomaly detection and commonsense reasoning in VLMs.

[68] Climate Adaptation-Aware Flood Prediction for Coastal Cities Using Deep Learning

Bilal Hassan,Areg Karapetyan,Aaron Chung Hin Chow,Samer Madanat

Main category: cs.CV

TL;DR: 提出一种基于轻量级CNN的深度学习模型,用于在不同海平面上升情景下预测沿海洪水,该模型在数据稀缺条件下表现优异,并在阿布扎比和旧金山地区展现出良好的跨区域泛化能力,预测精度较现有方法平均提升近20%。

Details Motivation: 传统水动力模拟器计算成本高,难以应用于城市尺度的沿海规划,而现有深度学习方法受限于数据稀缺和高维输出问题,因此需要更高效、准确且适用于多地区的洪水预测模型。 Method: 基于一种新兴的视觉驱动、低资源深度学习框架,构建了一种轻量级卷积神经网络(CNN)模型,利用不同海平面上升情景和海岸线适应方案进行训练,并在阿布扎比和旧金山的数据集上验证其跨区域泛化能力。 Result: 该模型在预测洪水深度图上的平均绝对误差(MAE)较现有最先进方法降低了近20%,并在两个地理差异显著的地区均表现出优异的预测性能。 Conclusion: 所提出的轻量级CNN模型在准确性、计算效率和跨区域适用性方面具有显著优势,具备作为可扩展工具支持沿海城市洪水管理与气候适应决策的潜力。 Abstract: Climate change and sea-level rise (SLR) pose escalating threats to coastal cities, intensifying the need for efficient and accurate methods to predict potential flood hazards. Traditional physics-based hydrodynamic simulators, although precise, are computationally expensive and impractical for city-scale coastal planning applications. Deep Learning (DL) techniques offer promising alternatives, however, they are often constrained by challenges such as data scarcity and high-dimensional output requirements. Leveraging a recently proposed vision-based, low-resource DL framework, we develop a novel, lightweight Convolutional Neural Network (CNN)-based model designed to predict coastal flooding under variable SLR projections and shoreline adaptation scenarios. Furthermore, we demonstrate the ability of the model to generalize across diverse geographical contexts by utilizing datasets from two distinct regions: Abu Dhabi and San Francisco. Our findings demonstrate that the proposed model significantly outperforms state-of-the-art methods, reducing the mean absolute error (MAE) in predicted flood depth maps on average by nearly 20%. These results highlight the potential of our approach to serve as a scalable and practical tool for coastal flood management, empowering decision-makers to develop effective mitigation strategies in response to the growing impacts of climate change. Project Page: https://caspiannet.github.io/

[69] Enhancing Temporal Understanding in Video-LLMs through Stacked Temporal Attention in Vision Encoders

Ali Rasekh,Erfan Bagheri Soula,Omid Daliran,Simon Gottschalk,Mohsen Fayyaz

Main category: cs.CV

TL;DR: 本文提出了一种新的Video-LLM架构STAVEQ,通过在视觉编码器中引入堆叠的时间注意力模块,提升模型对视频中动作序列和时间进展的理解能力,在多个视频问答基准上显著优于现有方法。

Details Motivation: 现有的Video-LLM在理解复杂的时间动态(如动作序列和时间 progression)方面存在明显不足,难以有效处理需要精细时间推理的任务。 Method: 在视觉编码器中引入堆叠的时间注意力模块,使模型能够在将视觉token传递给大语言模型之前更好地捕捉帧间关系和动作的时间演化。 Result: 在VITATECS、MVBench和Video-MME等多个基准上,该方法相比现有模型最高提升了+5.5%,显著增强了时间推理和动作识别能力。 Conclusion: 通过增强视觉编码器的时间建模能力,有效弥补了当前Video-LLM在视频理解中的关键缺陷,为后续研究提供了可行方向。 Abstract: Despite significant advances in Multimodal Large Language Models (MLLMs), understanding complex temporal dynamics in videos remains a major challenge. Our experiments show that current Video Large Language Model (Video-LLM) architectures have critical limitations in temporal understanding, struggling with tasks that require detailed comprehension of action sequences and temporal progression. In this work, we propose a Video-LLM architecture that introduces stacked temporal attention modules directly within the vision encoder. This design incorporates a temporal attention in vision encoder, enabling the model to better capture the progression of actions and the relationships between frames before passing visual tokens to the LLM. Our results show that this approach significantly improves temporal reasoning and outperforms existing models in video question answering tasks, specifically in action recognition. We improve on benchmarks including VITATECS, MVBench, and Video-MME by up to +5.5%. By enhancing the vision encoder with temporal structure, we address a critical gap in video understanding for Video-LLMs. Project page and code are available at: https://alirasekh.github.io/STAVEQ2/.

[70] FlexICL: A Flexible Visual In-context Learning Framework for Elbow and Wrist Ultrasound Segmentation

Yuyue Zhou,Jessica Knight,Shrimanti Ghosh,Banafshe Felfeliyan,Jacob L. Jaremko,Abhilash R. Hareendranathan

Main category: cs.CV

TL;DR: 提出了一种名为FlexICL的灵活上下文学习框架,用于超声图像中骨骼区域的分割,在仅使用5%训练数据的情况下显著优于现有方法。

Details Motivation: 超声图像中骨结构的像素级标注耗时且昂贵,限制了深度学习在儿科骨折诊断中的应用。 Method: 提出FlexICL框架,采用帧间图像拼接技术和多种增强策略,在视频内分割场景下利用少量标注帧实现未见帧的分割。 Result: 在四个手腕和肘部超声数据集上,FlexICL仅用5%标注数据即超越Painter、MAE-VQGAN、U-Net和TransUNet等模型,Dice系数提升1-27%。 Conclusion: FlexICL是一种高效、可扩展的超声图像分割方案,适用于标注数据稀缺的医学影像场景。 Abstract: Elbow and wrist fractures are the most common fractures in pediatric populations. Automatic segmentation of musculoskeletal structures in ultrasound (US) can improve diagnostic accuracy and treatment planning. Fractures appear as cortical defects but require expert interpretation. Deep learning (DL) can provide real-time feedback and highlight key structures, helping lightly trained users perform exams more confidently. However, pixel-wise expert annotations for training remain time-consuming and costly. To address this challenge, we propose FlexICL, a novel and flexible in-context learning (ICL) framework for segmenting bony regions in US images. We apply it to an intra-video segmentation setting, where experts annotate only a small subset of frames, and the model segments unseen frames. We systematically investigate various image concatenation techniques and training strategies for visual ICL and introduce novel concatenation methods that significantly enhance model performance with limited labeled data. By integrating multiple augmentation strategies, FlexICL achieves robust segmentation performance across four wrist and elbow US datasets while requiring only 5% of the training images. It outperforms state-of-the-art visual ICL models like Painter, MAE-VQGAN, and conventional segmentation models like U-Net and TransUNet by 1-27% Dice coefficient on 1,252 US sweeps. These initial results highlight the potential of FlexICL as an efficient and scalable solution for US image segmentation well suited for medical imaging use cases where labeled data is scarce.

[71] Dynamic VLM-Guided Negative Prompting for Diffusion Models

Hoyeon Chang,Seungjin Kim,Yoonseok Choi

Main category: cs.CV

TL;DR: 提出一种利用视觉-语言模型(VLM)在去噪过程中自适应生成负提示的动态负提示新方法。

Details Motivation: 传统负提示方法使用固定的负提示,缺乏上下文适应性,限制了生成图像的质量和与文本的一致性。 Method: 在特定去噪步骤生成中间图像预测,并利用VLM查询生成上下文相关的负提示,实现动态调整。 Result: 在多个基准数据集上评估了该方法,验证了负引导强度与文本-图像对齐之间的权衡。 Conclusion: 所提动态负提示方法能有效提升扩散模型在文本到图像生成中的灵活性和对齐性能。 Abstract: We propose a novel approach for dynamic negative prompting in diffusion models that leverages Vision-Language Models (VLMs) to adaptively generate negative prompts during the denoising process. Unlike traditional Negative Prompting methods that use fixed negative prompts, our method generates intermediate image predictions at specific denoising steps and queries a VLM to produce contextually appropriate negative prompts. We evaluate our approach on various benchmark datasets and demonstrate the trade-offs between negative guidance strength and text-image alignment.

[72] Security Risk of Misalignment between Text and Image in Multi-modal Model

Xiaosen Wang,Zhijin Ge,Shaokang Wang

Main category: cs.CV

TL;DR: 本文提出了一种新的多模态攻击方法PReMA,通过仅修改输入图像而不改变文本提示来操纵多模态扩散模型的输出,尤其在固定提示的图像编辑应用中构成新威胁。

Details Motivation: 现有文本到图像扩散模型在文本与图像模态间的对齐不足,可能导致生成不当或NSFW内容,其对抗鲁棒性尚未充分探索。 Method: 提出Prompt-Restricted Multi-modal Attack (PReMA),通过生成对抗性图像来操控模型输出,在不修改文本提示的前提下实现内容操纵。 Result: 在多种模型的图像修复和风格迁移任务中验证了PReMA的有效性,证明其能成功生成违背预期的NSFW内容。 Conclusion: PReMA揭示了多模态扩散模型在模态对齐上的薄弱环节,对固定提示的图像编辑应用构成新型安全威胁。 Abstract: Despite the notable advancements and versatility of multi-modal diffusion models, such as text-to-image models, their susceptibility to adversarial inputs remains underexplored. Contrary to expectations, our investigations reveal that the alignment between textual and Image modalities in existing diffusion models is inadequate. This misalignment presents significant risks, especially in the generation of inappropriate or Not-Safe-For-Work (NSFW) content. To this end, we propose a novel attack called Prompt-Restricted Multi-modal Attack (PReMA) to manipulate the generated content by modifying the input image in conjunction with any specified prompt, without altering the prompt itself. PReMA is the first attack that manipulates model outputs by solely creating adversarial images, distinguishing itself from prior methods that primarily generate adversarial prompts to produce NSFW content. Consequently, PReMA poses a novel threat to the integrity of multi-modal diffusion models, particularly in image-editing applications that operate with fixed prompts. Comprehensive evaluations conducted on image inpainting and style transfer tasks across various models confirm the potent efficacy of PReMA.

[73] EgoExo-Con: Exploring View-Invariant Video Temporal Understanding

Minjoon Jung,Junbin Xiao,Junghyun Kim,Byoung-Tak Zhang,Angela Yao

Main category: cs.CV

TL;DR: 本文提出了EgoExo-Con,一个用于评估视频大模型在不同视角下时间理解一致性的新基准,并提出了View-GRPO方法以提升跨视角一致性。

Details Motivation: 研究现有视频大语言模型在多视角视频中是否能保持时间理解的一致性。 Method: 构建了包含同步第一人称与第三人称视频对的EgoExo-Con基准,设计了时间验证和时间定位任务,并提出View-GRPO强化学习框架来提升跨视角一致性。 Result: 发现现有模型在跨视角一致性上表现不佳;直接微调难以兼顾单视角性能与一致性;View-GRPO在提升一致性方面优于SFT和GRPO。 Conclusion: 跨视角时间理解一致性是Video-LLMs的重要挑战,View-GRPO为解决该问题提供了有效途径。 Abstract: Can Video-LLMs achieve consistent temporal understanding when videos capture the same event from different viewpoints? To study this, we introduce EgoExo-Con (Consistency), a benchmark of comprehensively synchronized egocentric and exocentric video pairs with human-refined queries in natural language. EgoExo-Con emphasizes two temporal understanding tasks: Temporal Verification and Temporal Grounding. It evaluates not only correctness but consistency across viewpoints. Our analysis reveals two critical limitations of existing Video-LLMs: (1) models often fail to maintain consistency, with results far worse than their single-view performances. (2) When naively finetuned with synchronized videos of both viewpoints, the models show improved consistency but often underperform those trained on a single view. For improvements, we propose View-GRPO, a novel reinforcement learning framework that effectively strengthens view-specific temporal reasoning while encouraging consistent comprehension across viewpoints. Our method demonstrates its superiority over naive SFT and GRPO, especially for improving cross-view consistency. All resources will be made publicly available.

[74] OracleAgent: A Multimodal Reasoning Agent for Oracle Bone Script Research

Caoshuo Li,Zengmao Ding,Xiaobin Hu,Bang Li,Donghao Luo,Xu Peng,Taisong Jin,Yongge Liu,Shengwei Han,Jing Yang,Xiaoping He,Feng Gao,AndyPian Wu,SevenShu,Chaoyang Wang,Chengjie Wang

Main category: cs.CV

TL;DR: 本文提出了OracleAgent,首个用于甲骨文信息结构化管理和检索的智能体系统,结合大语言模型与多模态知识库,显著提升甲骨文研究效率。

Details Motivation: 甲骨文研究面临流程复杂、信息组织与检索效率低下的挑战,亟需自动化工具支持。 Method: 构建包含140万字符拓片图像和8万条释读文本的多模态知识库,并设计基于大语言模型的智能体系统OracleAgent,集成多种分析工具以实现灵活的任务编排与信息检索。 Result: 实验表明OracleAgent在多模态推理与生成任务中优于主流多模态大模型(如GPT-4o),案例研究显示其显著减少专家研究时间。 Conclusion: OracleAgent推动了甲骨文研究向自动化与实用化迈进,为古文字学研究提供了高效的技术支持。 Abstract: As one of the earliest writing systems, Oracle Bone Script (OBS) preserves the cultural and intellectual heritage of ancient civilizations. However, current OBS research faces two major challenges: (1) the interpretation of OBS involves a complex workflow comprising multiple serial and parallel sub-tasks, and (2) the efficiency of OBS information organization and retrieval remains a critical bottleneck, as scholars often spend substantial effort searching for, compiling, and managing relevant resources. To address these challenges, we present OracleAgent, the first agent system designed for the structured management and retrieval of OBS-related information. OracleAgent seamlessly integrates multiple OBS analysis tools, empowered by large language models (LLMs), and can flexibly orchestrate these components. Additionally, we construct a comprehensive domain-specific multimodal knowledge base for OBS, which is built through a rigorous multi-year process of data collection, cleaning, and expert annotation. The knowledge base comprises over 1.4M single-character rubbing images and 80K interpretation texts. OracleAgent leverages this resource through its multimodal tools to assist experts in retrieval tasks of character, document, interpretation text, and rubbing image. Extensive experiments demonstrate that OracleAgent achieves superior performance across a range of multimodal reasoning and generation tasks, surpassing leading mainstream multimodal large language models (MLLMs) (e.g., GPT-4o). Furthermore, our case study illustrates that OracleAgent can effectively assist domain experts, significantly reducing the time cost of OBS research. These results highlight OracleAgent as a significant step toward the practical deployment of OBS-assisted research and automated interpretation systems.

[75] JOGS: Joint Optimization of Pose Estimation and 3D Gaussian Splatting

Yuxuan Li,Tao Wang,Xianben Yang

Main category: cs.CV

TL;DR: 提出了一种无需预标定输入的联合优化3D高斯点和相机位姿的统一框架,通过交替更新3D高斯参数和相机位姿,显著提升了新视角合成的重建质量和位姿精度。

Details Motivation: 传统方法依赖COLMAP等外部相机位姿估计工具,存在计算瓶颈且易传播误差。 Method: 采用分阶段交替优化策略:固定位姿通过可微渲染更新3D高斯参数,再利用结合几何与光度约束的定制3D光流算法优化相机位姿。 Result: 在多个数据集上验证了该方法优于现有无COLMAP方法,并超越基于COLMAP的标准基线,尤其在大视角变化和稀疏特征场景下表现更优。 Conclusion: 所提出的联合优化框架有效降低了投影误差,实现了更高精度的场景重建与位姿估计,摆脱了对传统位姿估计工具的依赖。 Abstract: Traditional novel view synthesis methods heavily rely on external camera pose estimation tools such as COLMAP, which often introduce computational bottlenecks and propagate errors. To address these challenges, we propose a unified framework that jointly optimizes 3D Gaussian points and camera poses without requiring pre-calibrated inputs. Our approach iteratively refines 3D Gaussian parameters and updates camera poses through a novel co-optimization strategy, ensuring simultaneous improvements in scene reconstruction fidelity and pose accuracy. The key innovation lies in decoupling the joint optimization into two interleaved phases: first, updating 3D Gaussian parameters via differentiable rendering with fixed poses, and second, refining camera poses using a customized 3D optical flow algorithm that incorporates geometric and photometric constraints. This formulation progressively reduces projection errors, particularly in challenging scenarios with large viewpoint variations and sparse feature distributions, where traditional methods struggle. Extensive evaluations on multiple datasets demonstrate that our approach significantly outperforms existing COLMAP-free techniques in reconstruction quality, and also surpasses the standard COLMAP-based baseline in general.

[76] WOD-E2E: Waymo Open Dataset for End-to-End Driving in Challenging Long-tail Scenarios

Runsheng Xu,Hubert Lin,Wonseok Jeon,Hao Feng,Yuliang Zou,Liting Sun,John Gorman,Kate Tolstaya,Sarah Tang,Brandyn White,Ben Sapp,Mingxing Tan,Jyh-Jing Hwang,Drago Anguelov

Main category: cs.CV

TL;DR: 本文提出了一个用于端到端自动驾驶的新数据集WOD-E2E,专注于罕见且具有挑战性的长尾场景,并引入了一种新的开环评估指标Rater Feedback Score (RFS),以更有效地评估自动驾驶系统在复杂真实世界情况下的表现。

Details Motivation: 现有的端到端驾驶基准主要集中在常规场景上,缺乏对罕见但关键的长尾场景的充分测试,同时传统评估指标难以准确反映多模态驾驶行为和复杂情境下的性能。因此,需要一个专门针对这些挑战的数据集和更合理的评估方法。 Method: 构建了一个包含4,021个驾驶片段(约12小时)的数据集WOD-E2E,重点涵盖发生频率低于0.03%的长尾场景,提供高精度路由信息、自车状态和360度摄像头视图;提出新的评估指标RFS,基于评分者标注的轨迹偏好标签来衡量预测轨迹的质量,而非仅依赖与真实轨迹的距离。 Result: 发布了WOD-E2E验证集的评分者偏好标签,测试集标签用于2025年WOD-E2E挑战赛;RFS指标能更好捕捉人类对轨迹合理性的判断,提升对复杂场景下模型行为的评估能力。 Conclusion: WOD-E2E和RFS为研究可泛化、鲁棒且安全的端到端自动驾驶系统提供了重要资源和评估手段,推动其在真实复杂环境中的发展。 Abstract: Vision-based end-to-end (E2E) driving has garnered significant interest in the research community due to its scalability and synergy with multimodal large language models (MLLMs). However, current E2E driving benchmarks primarily feature nominal scenarios, failing to adequately test the true potential of these systems. Furthermore, existing open-loop evaluation metrics often fall short in capturing the multi-modal nature of driving or effectively evaluating performance in long-tail scenarios. To address these gaps, we introduce the Waymo Open Dataset for End-to-End Driving (WOD-E2E). WOD-E2E contains 4,021 driving segments (approximately 12 hours), specifically curated for challenging long-tail scenarios that that are rare in daily life with an occurring frequency of less than 0.03%. Concretely, each segment in WOD-E2E includes the high-level routing information, ego states, and 360-degree camera views from 8 surrounding cameras. To evaluate the E2E driving performance on these long-tail situations, we propose a novel open-loop evaluation metric: Rater Feedback Score (RFS). Unlike conventional metrics that measure the distance between predicted way points and the logs, RFS measures how closely the predicted trajectory matches rater-annotated trajectory preference labels. We have released rater preference labels for all WOD-E2E validation set segments, while the held out test set labels have been used for the 2025 WOD-E2E Challenge. Through our work, we aim to foster state of the art research into generalizable, robust, and safe end-to-end autonomous driving agents capable of handling complex real-world situations.

[77] Exploring Object-Aware Attention Guided Frame Association for RGB-D SLAM

Ali Caglayan,Nevrez Imamoglu,Oguzhan Guclu,Ali Osman Serhatoglu,Ahmet Burak Can,Ryosuke Nakamura

Main category: cs.CV

TL;DR: 本文提出了一种将基于梯度的注意力信息集成到CNN特征表示中的方法,用于提升RGB-D室内SLAM的帧间匹配性能,尤其在大尺度环境中表现更优。

Details Motivation: 现有的可视化技术虽然能揭示CNN的注意力区域,但缺乏对梯度注意力信息在语义理解任务中显式集成的应用,尤其是在SLAM等需要空间注意力的任务中存在改进空间。 Method: 通过结合网络梯度与CNN特征生成逐层的注意力信息,并将其显式地融入CNN表示中,以增强对关键物体区域的关注,从而提升SLAM中的帧关联能力。 Result: 实验结果表明,所提方法在帧关联性能上优于基线方法,尤其在大型环境中效果更为显著。 Conclusion: 将任务特定的梯度注意力机制整合进CNN表示可有效提升RGB-D SLAM的性能,验证了注意力信息在视觉定位任务中的实用价值。 Abstract: Attention models have recently emerged as a powerful approach, demonstrating significant progress in various fields. Visualization techniques, such as class activation mapping, provide visual insights into the reasoning of convolutional neural networks (CNNs). Using network gradients, it is possible to identify regions where the network pays attention during image recognition tasks. Furthermore, these gradients can be combined with CNN features to localize more generalizable, task-specific attentive (salient) regions within scenes. However, explicit use of this gradient-based attention information integrated directly into CNN representations for semantic object understanding remains limited. Such integration is particularly beneficial for visual tasks like simultaneous localization and mapping (SLAM), where CNN representations enriched with spatially attentive object locations can enhance performance. In this work, we propose utilizing task-specific network attention for RGB-D indoor SLAM. Specifically, we integrate layer-wise attention information derived from network gradients with CNN feature representations to improve frame association performance. Experimental results indicate improved performance compared to baseline methods, particularly for large environments.

[78] FullPart: Generating each 3D Part at Full Resolution

Lihe Ding,Shaocong Dong,Yaokun Li,Chenjian Gao,Xiao Chen,Rui Han,Yihao Kuang,Hong Zhang,Bo Huang,Zhanpeng Huang,Zibin Wang,Dan Xu,Tianfan Xue

Main category: cs.CV

TL;DR: 本文提出了FullPart,一种结合隐式和显式表征的新型3D部件生成框架,通过独立的全分辨率体素网格生成每个部件,并引入中心点编码策略以保持全局一致性,同时发布了目前最大的人工标注3D部件数据集PartVerse-XL。

Details Motivation: 现有3D部件生成方法在几何细节或小部件质量上存在不足:隐式向量集标记表示缺乏几何细节,而共享全局低分辨率体素网格导致小部件占据体素过少、生成质量下降。 Method: 首先通过隐式的框向量集扩散过程生成部件的边界框布局,然后在各自独立的全分辨率体素网格中显式生成每个部件,并采用中心点编码策略解决不同尺寸部件间信息交换时的错位问题。 Result: 实验表明FullPart在3D部件生成任务上达到了最先进的性能,能够生成包含复杂细节的高质量部件,尤其提升了小部件的生成质量。 Conclusion: FullPart有效结合了隐式和显式生成的优势,解决了部件生成中的分辨率与对齐问题,配合大规模数据集PartVerse-XL的发布,为未来3D部件生成研究提供了新基准和资源。 Abstract: Part-based 3D generation holds great potential for various applications. Previous part generators that represent parts using implicit vector-set tokens often suffer from insufficient geometric details. Another line of work adopts an explicit voxel representation but shares a global voxel grid among all parts; this often causes small parts to occupy too few voxels, leading to degraded quality. In this paper, we propose FullPart, a novel framework that combines both implicit and explicit paradigms. It first derives the bounding box layout through an implicit box vector-set diffusion process, a task that implicit diffusion handles effectively since box tokens contain little geometric detail. Then, it generates detailed parts, each within its own fixed full-resolution voxel grid. Instead of sharing a global low-resolution space, each part in our method - even small ones - is generated at full resolution, enabling the synthesis of intricate details. We further introduce a center-point encoding strategy to address the misalignment issue when exchanging information between parts of different actual sizes, thereby maintaining global coherence. Moreover, to tackle the scarcity of reliable part data, we present PartVerse-XL, the largest human-annotated 3D part dataset to date with 40K objects and 320K parts. Extensive experiments demonstrate that FullPart achieves state-of-the-art results in 3D part generation. We will release all code, data, and model to benefit future research in 3D part generation.

[79] BasicAVSR: Arbitrary-Scale Video Super-Resolution via Image Priors and Enhanced Motion Compensation

Wei Shang,Wanying Zhang,Shuhang Gu,Pengfei Zhu,Qinghua Hu,Dongwei Ren

Main category: cs.CV

TL;DR: 本文提出了一种用于任意尺度视频超分辨率(AVSR)的强基线模型BasicAVSR,包含四个关键组件:基于拉普拉斯金字塔的自适应多尺度频率先验、光流引导传播单元、二阶运动补偿和超上采样单元,并设计了三种传播变体以适应不同应用场景。实验表明,该方法在质量、泛化能力和推理速度方面均显著优于现有方法。

Details Motivation: 视频超分辨率在不同缩放因子下存在空间细节恢复、时间一致性和计算复杂度等挑战,现有方法难以兼顾性能与实用性,因此需要一个更强大且灵活的基线模型。 Method: 提出BasicAVSR模型,结合图像拉普拉斯金字塔生成自适应多尺度频率先验,采用光流引导传播单元聚合时空信息,引入二阶运动补偿提升帧间对齐精度,并设计超上采样单元生成尺度感知且内容无关的上采样核;同时构建三种RNN传播变体以适应在线、有限延迟和离线场景。 Result: BasicAVSR在多个数据集上显著优于现有方法,具备优异的超分辨率质量、良好的泛化能力以及更快的推理速度,且在不同传播模式下均表现出强适应性。 Conclusion: BasicAVSR为任意尺度视频超分辨率建立了一个强大而灵活的基线,通过模块化设计有效应对多种应用需求,推动了该领域的技术发展,并可扩展至其他框架。 Abstract: Arbitrary-scale video super-resolution (AVSR) aims to enhance the resolution of video frames, potentially at various scaling factors, which presents several challenges regarding spatial detail reproduction, temporal consistency, and computational complexity. In this paper, we propose a strong baseline BasicAVSR for AVSR by integrating four key components: 1) adaptive multi-scale frequency priors generated from image Laplacian pyramids, 2) a flow-guided propagation unit to aggregate spatiotemporal information from adjacent frames, 3) a second-order motion compensation unit for more accurate spatial alignment of adjacent frames, and 4) a hyper-upsampling unit to generate scale-aware and content-independent upsampling kernels. To meet diverse application demands, we instantiate three propagation variants: (i) a unidirectional RNN unit for strictly online inference, (ii) a unidirectional RNN unit empowered with a limited lookahead that tolerates a small output delay, and (iii) a bidirectional RNN unit designed for offline tasks where computational resources are less constrained. Experimental results demonstrate the effectiveness and adaptability of our model across these different scenarios. Through extensive experiments, we show that BasicAVSR significantly outperforms existing methods in terms of super-resolution quality, generalization ability, and inference speed. Our work not only advances the state-of-the-art in AVSR but also extends its core components to multiple frameworks for diverse scenarios. The code is available at https://github.com/shangwei5/BasicAVSR.

[80] MV-MLM: Bridging Multi-View Mammography and Language for Breast Cancer Diagnosis and Risk Prediction

Shunjie-Fabian Zheng,Hyeonjun Lee,Thijs Kooi,Ali Diba

Main category: cs.CV

TL;DR: 提出了一种多视角乳腺X线与语言模型(MV-MLM),利用合成放射报告进行跨模态自监督学习,在乳腺癌分类和风险预测任务中实现了最先进的性能,且具有出色的数据效率。

Details Motivation: 获取带有精细标注的大规模医学数据集成本高、耗时长,因此需要一种无需依赖真实放射报告即可有效训练的CAD模型。 Method: 构建一个基于多视角乳腺X线图像和合成放射报告配对数据集的视觉-语言模型,采用跨模态自监督和联合视觉-文本学习策略,通过多视角图像与伪报告对齐来学习丰富表征。 Result: 在私有和公开数据集上验证,模型在恶性分类、亚型分类和图像-based癌症风险预测三个任务中均达到SOTA性能,并展现出优于现有全监督或VLM基线模型的数据效率。 Conclusion: MV-MLM通过利用合成报告和多视角监督,能够在不依赖真实标注报告的情况下高效训练,显著提升乳腺癌检测与风险预测的准确性和泛化能力。 Abstract: Large annotated datasets are essential for training robust Computer-Aided Diagnosis (CAD) models for breast cancer detection or risk prediction. However, acquiring such datasets with fine-detailed annotation is both costly and time-consuming. Vision-Language Models (VLMs), such as CLIP, which are pre-trained on large image-text pairs, offer a promising solution by enhancing robustness and data efficiency in medical imaging tasks. This paper introduces a novel Multi-View Mammography and Language Model for breast cancer classification and risk prediction, trained on a dataset of paired mammogram images and synthetic radiology reports. Our MV-MLM leverages multi-view supervision to learn rich representations from extensive radiology data by employing cross-modal self-supervision across image-text pairs. This includes multiple views and the corresponding pseudo-radiology reports. We propose a novel joint visual-textual learning strategy to enhance generalization and accuracy performance over different data types and tasks to distinguish breast tissues or cancer characteristics(calcification, mass) and utilize these patterns to understand mammography images and predict cancer risk. We evaluated our method on both private and publicly available datasets, demonstrating that the proposed model achieves state-of-the-art performance in three classification tasks: (1) malignancy classification, (2) subtype classification, and (3) image-based cancer risk prediction. Furthermore, the model exhibits strong data efficiency, outperforming existing fully supervised or VLM baselines while trained on synthetic text reports and without the need for actual radiology reports.

[81] Detecting Unauthorized Vehicles using Deep Learning for Smart Cities: A Case Study on Bangladesh

Sudipto Das Sukanto,Diponker Roy,Fahim Shakil,Nirjhar Singha,Abdullah Asik,Aniket Joarder,Mridha Md Nafis Fuad,Muhammad Ibrahim

Main category: cs.CV

TL;DR: 本文提出了一种基于YOLOv8模型的机器学习方法,用于在交通图像中自动检测孟加拉国的机动三轮车(auto-rickshaw),并在真实场景下实现了较高的检测精度。

Details Motivation: 由于机动三轮车在城市交通中受到路线限制,而现有监控系统难以准确区分其与非机动三轮车,手动分析视频又耗时费力,因此需要一种高效的自动检测方法。 Method: 采用YOLOv8进行实时目标检测,并使用包含1,730张标注图像的数据集在多种交通条件下训练模型。 Result: 模型在mAP50指标上达到83.447%,二分类精确率和召回率均超过78%,能有效应对密集和稀疏交通场景。 Conclusion: 该方法能够高效、准确地实现机动三轮车的自动检测,具备实际应用潜力,且数据集已公开以支持后续研究。 Abstract: Modes of transportation vary across countries depending on geographical location and cultural context. In South Asian countries rickshaws are among the most common means of local transport. Based on their mode of operation, rickshaws in cities across Bangladesh can be broadly classified into non-auto (pedal-powered) and auto-rickshaws (motorized). Monitoring the movement of auto-rickshaws is necessary as traffic rules often restrict auto-rickshaws from accessing certain routes. However, existing surveillance systems make it quite difficult to monitor them due to their similarity to other vehicles, especially non-auto rickshaws whereas manual video analysis is too time-consuming. This paper presents a machine learning-based approach to automatically detect auto-rickshaws in traffic images. In this system, we used real-time object detection using the YOLOv8 model. For training purposes, we prepared a set of 1,730 annotated images that were captured under various traffic conditions. The results show that our proposed model performs well in real-time auto-rickshaw detection and offers an mAP50 of 83.447% and binary precision and recall values above 78%, demonstrating its effectiveness in handling both dense and sparse traffic scenarios. The dataset has been publicly released for further research.

[82] CRAG-MM: Multi-modal Multi-turn Comprehensive RAG Benchmark

Jiaqi Wang,Xiao Yang,Kai Sun,Parth Suresh,Sanat Sharma,Adam Czyzewski,Derek Andersen,Surya Appini,Arkav Banerjee,Sajal Choudhary,Shervin Ghasemlou,Ziqiang Guan,Akil Iyer,Haidar Khan,Lingkun Kong,Roy Luo,Tiffany Ma,Zhen Qiao,David Tran,Wenfang Xu,Skyler Yeatman,Chen Zhou,Gunveer Gujral,Yinglong Xia,Shane Moon,Nicolas Scheffer,Nirav Shah,Eun Chang,Yue Liu,Florian Metze,Tammy Stark,Zhaleh Feizollahi,Andrea Jessee,Mangesh Pujari,Ahmed Aly,Babak Damavandi,Rakesh Wanga,Anuj Kumar,Rohit Patel,Wen-tau Yih,Xin Luna Dong

Main category: cs.CV

TL;DR: 本文提出了CRAG-MM,一个面向可穿戴设备场景的多模态、多轮对话检索增强生成综合基准,包含6.5K图像问答对和2K多轮对话,涵盖13个领域,并设计了三项任务以评估单源、多源及多轮对话性能,揭示现有方法在真实性方面仍有较大提升空间。

Details Motivation: 现有的多模态检索增强生成(MM-RAG)缺乏针对可穿戴设备使用场景的综合性基准,难以评估真实世界中的多模态多轮对话系统性能。 Method: 构建了一个包含6.5K(图像-问题-答案)三元组和2K多轮对话的数据集,涵盖13个领域,其中6.2K为模拟可穿戴设备拍摄的自我中心图像;设计了三种任务(单源、多源、多轮对话),并提供相应的检索语料库和图像-KG与网页检索API。 Result: 实验表明,简单的RAG方法在单轮和多轮问答中的真实性分别仅为32%和43%,最先进工业方案也仅达32%/45%;KDD Cup 2025基于该基准举办,吸引约1000名参与者和5000次提交,优胜方案将基线性能提升了28%。 Conclusion: CRAG-MM填补了可穿戴设备场景下多模态RAG基准的空白,具有现实挑战性和广泛适用性,已对学术与工业界产生初步影响,推动该领域发展。 Abstract: Wearable devices such as smart glasses are transforming the way people interact with their surroundings, enabling users to seek information regarding entities in their view. Multi-Modal Retrieval-Augmented Generation (MM-RAG) plays a key role in supporting such questions, yet there is still no comprehensive benchmark for this task, especially regarding wearables scenarios. To fill this gap, we present CRAG-MM -- a Comprehensive RAG benchmark for Multi-modal Multi-turn conversations. CRAG-MM contains a diverse set of 6.5K (image, question, answer) triplets and 2K visual-based multi-turn conversations across 13 domains, including 6.2K egocentric images designed to mimic captures from wearable devices. We carefully constructed the questions to reflect real-world scenarios and challenges, including five types of image-quality issues, six question types, varying entity popularity, differing information dynamism, and different conversation turns. We design three tasks: single-source augmentation, multi-source augmentation, and multi-turn conversations -- each paired with an associated retrieval corpus and APIs for both image-KG retrieval and webpage retrieval. Our evaluation shows that straightforward RAG approaches achieve only 32% and 43% truthfulness on CRAG-MM single- and multi-turn QA, respectively, whereas state-of-the-art industry solutions have similar quality (32%/45%), underscoring ample room for improvement. The benchmark has hosted KDD Cup 2025, attracting about 1K participants and 5K submissions, with winning solutions improving baseline performance by 28%, highlighting its early impact on advancing the field.

[83] MoTDiff: High-resolution Motion Trajectory estimation from a single blurred image using Diffusion models

Wontae Choi,Jaelin Lee,Hyung Sup Yun,Byeungwoo Jeon,Il Yong Chun

Main category: cs.CV

TL;DR: 本文提出了首个基于扩散模型的高分辨率运动轨迹估计框架MoTDiff,能够从单张运动模糊图像中恢复高质量的运动轨迹,在盲去模糊和编码曝光摄影任务中优于现有方法。

Details Motivation: 现有的运动表示方法通常质量较低,存在粗粒度和不准确的问题,难以满足计算成像和计算机视觉应用对精确运动信息的需求。 Method: 提出MoTDiff框架,包含两个关键部分:1)以单张模糊图像的多尺度特征图作为条件的新条件扩散框架;2)一种新的训练方法,用于精确识别细粒度运动轨迹、一致估计运动路径的整体形状与位置,并保持轨迹上的像素连通性。 Result: 实验表明,MoTDiff在盲图像去模糊和编码曝光摄影应用中均优于当前最先进的方法,显著提升了运动轨迹估计的质量和下游任务性能。 Conclusion: MoTDiff是首个利用扩散模型进行高分辨率运动轨迹估计的框架,有效解决了从单张模糊图像中恢复精细运动信息的难题,具有较强的实用性和扩展潜力。 Abstract: Accurate estimation of motion information is crucial in diverse computational imaging and computer vision applications. Researchers have investigated various methods to extract motion information from a single blurred image, including blur kernels and optical flow. However, existing motion representations are often of low quality, i.e., coarse-grained and inaccurate. In this paper, we propose the first high-resolution (HR) Motion Trajectory estimation framework using Diffusion models (MoTDiff). Different from existing motion representations, we aim to estimate an HR motion trajectory with high-quality from a single motion-blurred image. The proposed MoTDiff consists of two key components: 1) a new conditional diffusion framework that uses multi-scale feature maps extracted from a single blurred image as a condition, and 2) a new training method that can promote precise identification of a fine-grained motion trajectory, consistent estimation of overall shape and position of a motion path, and pixel connectivity along a motion trajectory. Our experiments demonstrate that the proposed MoTDiff can outperform state-of-the-art methods in both blind image deblurring and coded exposure photography applications.

[84] ConceptScope: Characterizing Dataset Bias via Disentangled Visual Concepts

Jinho Choi,Hyesu Lim,Steffen Schneider,Jaegul Choo

Main category: cs.CV

TL;DR: ConceptScope是一个可扩展的自动化框架,利用稀疏自编码器分析视觉数据集中的概念,识别和量化人类可解释的偏见,无需细粒度标注,有效发现已知和未知的数据集偏差。

Details Motivation: 现有方法难以在没有昂贵标注的情况下系统识别机器学习数据集中的概念偏差,因此需要一种自动化、可扩展的方法来发现和量化这些偏差。 Method: 提出ConceptScope框架,使用在视觉基础模型表示上训练的稀疏自编码器来发现和量化视觉概念,并根据语义相关性和统计相关性将概念分类为目标、上下文和偏见类型。 Result: ConceptScope能捕捉多种视觉概念(如物体、纹理、情绪等),生成与图像语义区域对齐的空间归因,并成功检测出已知(如Waterbirds背景偏差)和未标注的偏差(如ImageNet中物体共现)。 Conclusion: ConceptScope为数据集审计和模型诊断提供了一种实用工具,能够在无需人工标注的情况下实现细粒度的偏差识别和鲁棒性评估。 Abstract: Dataset bias, where data points are skewed to certain concepts, is ubiquitous in machine learning datasets. Yet, systematically identifying these biases is challenging without costly, fine-grained attribute annotations. We present ConceptScope, a scalable and automated framework for analyzing visual datasets by discovering and quantifying human-interpretable concepts using Sparse Autoencoders trained on representations from vision foundation models. ConceptScope categorizes concepts into target, context, and bias types based on their semantic relevance and statistical correlation to class labels, enabling class-level dataset characterization, bias identification, and robustness evaluation through concept-based subgrouping. We validate that ConceptScope captures a wide range of visual concepts, including objects, textures, backgrounds, facial attributes, emotions, and actions, through comparisons with annotated datasets. Furthermore, we show that concept activations produce spatial attributions that align with semantically meaningful image regions. ConceptScope reliably detects known biases (e.g., background bias in Waterbirds) and uncovers previously unannotated ones (e.g, co-occurring objects in ImageNet), offering a practical tool for dataset auditing and model diagnostics.

[85] Sketch2PoseNet: Efficient and Generalized Sketch to 3D Human Pose Prediction

Li Wang,Yiyu Zhuang,Yanwen Wang,Xun Cao,Chuan Guo,Xinxin Zuo,Hao Zhu

Main category: cs.CV

TL;DR: 提出一种基于合成数据的3D人体姿态估计方法,利用扩散模型生成带标注的草图-3D姿态数据集SKEP-120K,并构建端到端数据驱动框架,在精度和速度上均优于以往方法。

Details Motivation: 现有草图到3D姿态估计方法受限于大规模标注数据的缺乏,依赖启发式规则优化,耗时且泛化能力差。 Method: 采用“从合成中学习”策略,先用扩散模型从2D姿态(由3D姿态投影)生成模拟草图结构的合成草图,构建SKEP-120K数据集;在此基础上设计端到端框架,结合2D姿态检测器、扩散先验和前馈网络进行草图特征提取与2D姿态估计,并引入多种启发式损失保证3D姿态与2D检测结果的几何一致性及自接触准确性。 Result: 在定性、定量和主观评估中,该方法在估计精度和速度上均显著优于先前方法。 Conclusion: 所提方法通过合成数据有效解决了草图-3D姿态配对数据稀缺问题,实现了对多样化草图风格的高效准确姿态估计,推动了草图到3D姿态估计的发展。 Abstract: 3D human pose estimation from sketches has broad applications in computer animation and film production. Unlike traditional human pose estimation, this task presents unique challenges due to the abstract and disproportionate nature of sketches. Previous sketch-to-pose methods, constrained by the lack of large-scale sketch-3D pose annotations, primarily relied on optimization with heuristic rules-an approach that is both time-consuming and limited in generalizability. To address these challenges, we propose a novel approach leveraging a "learn from synthesis" strategy. First, a diffusion model is trained to synthesize sketch images from 2D poses projected from 3D human poses, mimicking disproportionate human structures in sketches. This process enables the creation of a synthetic dataset, SKEP-120K, consisting of 120k accurate sketch-3D pose annotation pairs across various sketch styles. Building on this synthetic dataset, we introduce an end-to-end data-driven framework for estimating human poses and shapes from diverse sketch styles. Our framework combines existing 2D pose detectors and generative diffusion priors for sketch feature extraction with a feed-forward neural network for efficient 2D pose estimation. Multiple heuristic loss functions are incorporated to guarantee geometric coherence between the derived 3D poses and the detected 2D poses while preserving accurate self-contacts. Qualitative, quantitative, and subjective evaluations collectively show that our model substantially surpasses previous ones in both estimation accuracy and speed for sketch-to-pose tasks.

[86] Developing a Multi-task Ensemble Geometric Deep Network for Supply Chain Sustainability and Risk Management

Mehdi Khaleghi,Nastaran Khaleghi,Sobhan Sheykhivand,Sebelan Danishvar

Main category: cs.CV

TL;DR: 提出了一种基于切比雪夫集成几何网络(Ch-EGN)的混合深度学习模型,用于提升供应链的可持续性和风险管理效率,在多个任务上实现了高准确率。

Details Motivation: 为了提高供应链的可持续性与风险管控能力,需要有效利用数据中的信息依赖关系并实现精准的产品与关系分类。 Method: 提出一种融合卷积神经网络与几何深度学习的新型切比雪夫集成几何网络(Ch-EGN),在SupplyGraph和DataCo两个数据集上进行风险预测、产品分类和边分类实验。 Result: 在风险预测任务中平均准确率达98.95%;产品分类(5类)和产品关系分类(4类)准确率分别为100%和98.07%;企业关系分类(25类)准确率达92.37%,整体性能优于现有方法。 Conclusion: 所提出的Ch-EGN模型能有效挖掘供应链数据中的隐含状态和依赖关系,显著提升供应链可持续性管理与风险预测的准确性与效率。 Abstract: The sustainability of supply chain plays a key role in achieving optimal performance in controlling the supply chain. The management of risks that occur in a supply chain is a fundamental problem for the purpose of developing the sustainability of the network and elevating the performance efficiency of the supply chain. The correct classification of products is another essential element in a sustainable supply chain. Acknowledging recent breakthroughs in the context of deep networks, several architectural options have been deployed to analyze supply chain datasets. A novel geometric deep network is used to propose an ensemble deep network. The proposed Chebyshev ensemble geometric network (Ch-EGN) is a hybrid convolutional and geometric deep learning. This network is proposed to leverage the information dependencies in supply chain to derive invisible states of samples in the database. The functionality of the proposed deep network is assessed on the two different databases. The SupplyGraph Dataset and DataCo are considered in this research. The prediction of delivery status of DataCo supply chain is done for risk administration. The product classification and edge classification are performed using the SupplyGraph database to enhance the sustainability of the supply network. An average accuracy of 98.95% is obtained for the ensemble network for risk management. The average accuracy of 100% and 98.07% are obtained for sustainable supply chain in terms of 5 product group classification and 4 product relation classification, respectively. The average accuracy of 92.37% is attained for 25 company relation classification. The results confirm an average improvement and efficiency of the proposed method compared to the state-of-the-art approaches.

[87] OmniLayout: Enabling Coarse-to-Fine Learning with LLMs for Universal Document Layout Generation

Hengrui Kang,Zhuangcheng Gu,Zhiyuan Zhao,Zichen Wen,Bin Wang,Weijia Li,Conghui He

Main category: cs.CV

TL;DR: 本文提出了OmniLayout-1M,首个百万级多样化文档布局数据集,以及OmniLayout-LLM(0.5B参数模型),采用粗到细的两阶段学习范式,在多种文档域中显著优于现有布局生成方法和通用大模型。

Details Motivation: 现有文档布局生成研究局限于少数类型(如学术论文),缺乏多样性,且现有方法在复杂场景下难以生成连贯布局,亟需更丰富的数据与更强的模型。 Method: 构建了包含六类文档的百万级数据集OmniLayout-1M;提出OmniLayout-LLM模型,采用两阶段Coarse-to-Fine学习:先在粗粒度类别上学习通用布局原则,再迁移到细粒度特定领域。 Result: 在M^6Doc数据集多个域上实验表明,该方法显著优于现有布局生成专家模型和最新通用大模型,尤其在复杂长序列布局任务中表现突出。 Conclusion: OmniLayout-1M和OmniLayout-LLM为文档布局生成提供了新的数据基础与有效模型框架,推动了该领域向开放世界多样化布局的发展。 Abstract: Document AI has advanced rapidly and is attracting increasing attention. Yet, while most efforts have focused on document layout analysis (DLA), its generative counterpart, document layout generation, remains underexplored. A major obstacle lies in the scarcity of diverse layouts: academic papers with Manhattan-style structures dominate existing studies, while open-world genres such as newspapers and magazines remain severely underrepresented. To address this gap, we curate OmniLayout-1M, the first million-scale dataset of diverse document layouts, covering six common document types and comprising contemporary layouts collected from multiple sources. Moreover, since existing methods struggle in complex domains and often fail to arrange long sequences coherently, we introduce OmniLayout-LLM, a 0.5B model with designed two-stage Coarse-to-Fine learning paradigm: 1) learning universal layout principles from OmniLayout-1M with coarse category definitions, and 2) transferring the knowledge to a specific domain with fine-grained annotations. Extensive experiments demonstrate that our approach achieves strong performance on multiple domains in M$^{6}$Doc dataset, substantially surpassing both existing layout generation experts and several latest general-purpose LLMs. Our code, models, and dataset will be publicly released.

[88] Which Way Does Time Flow? A Psychophysics-Grounded Evaluation for Vision-Language Models

Shiho Matta,Lis Kanashiro Pereira,Peitao Han,Fei Cheng,Shigeru Kitazawa

Main category: cs.CV

TL;DR: 本文提出了一种新的基准AoT-PsyPhyBENCH,用于评估视觉语言模型(VLMs)在判断视频时间方向(正放或倒放)上的能力,揭示了现有模型在物理不可逆过程和因果手动操作等场景下远逊于人类的表现,暴露出其在时间连续性和因果理解方面的根本缺陷。

Details Motivation: 尽管现代视觉语言模型在多模态任务中表现出色,但其对视频中时间信息的理解能力较弱且缺乏充分评估。为此,作者旨在通过‘时间之箭’这一简单而深刻的任务,检验VLMs是否具备类似人类的时间感知能力。 Method: 构建了一个经过心理物理学验证的基准AoT-PsyPhyBENCH,使用与人类实验相同的刺激材料和行为基线,系统评估多种开源与专有、推理与非推理型VLMs在自然视频中判断时间方向的能力。 Result: 大多数VLMs表现接近随机猜测,最佳模型在物理不可逆过程(如自由落体、扩散/爆炸)和因果手工操作(如分割/合并)上仍远低于人类准确率。 Conclusion: 当前多模态系统虽能捕捉丰富的视觉语义关联,但缺乏支持时间连续性和因果推理的归纳偏置,亟需改进以实现真正的物理与时间理解。 Abstract: Modern vision-language models (VLMs) excel at many multimodal tasks, yet their grasp of temporal information in video remains weak and, crucially, under-evaluated. We probe this gap with a deceptively simple but revealing challenge: judging the arrow of time (AoT)-whether a short clip is played forward or backward. We introduce AoT-PsyPhyBENCH, a psychophysically validated benchmark that tests whether VLMs can infer temporal direction in natural videos using the same stimuli and behavioral baselines established for humans. Our comprehensive evaluation of open-weight and proprietary, reasoning and non-reasoning VLMs reveals that most models perform near chance, and even the best lag far behind human accuracy on physically irreversible processes (e.g., free fall, diffusion/explosion) and causal manual actions (division/addition) that humans recognize almost instantly. These results highlight a fundamental gap in current multimodal systems: while they capture rich visual-semantic correlations, they lack the inductive biases required for temporal continuity and causal understanding. We release the code and data for AoT-PsyPhyBENCH to encourage further progress in the physical and temporal reasoning capabilities of VLMs.

[89] Revisiting Generative Infrared and Visible Image Fusion Based on Human Cognitive Laws

Lin Guo,Xiaoqing Luo,Wei Xie,Zhancheng Zhang,Hui Li,Rui Wang,Zhenhua Feng,Xiaoning Song

Main category: cs.CV

TL;DR: 本文提出了一种受人类认知启发的红外与可见光图像融合方法HCLFuse,通过多尺度掩码调控变分瓶颈编码器和时变物理引导机制,在无监督条件下实现了高质量、结构一致的融合结果,显著提升了语义分割性能。

Details Motivation: 现有融合方法在模态信息平衡和生成能力上存在局限,且缺乏对信息选择的可解释性,影响复杂场景下的可靠性与一致性。 Method: 提出HCLFuse方法,设计多尺度掩码调控的变分瓶颈编码器进行信息分解与建模,并结合扩散模型与物理规律构建时变物理引导机制,增强生成过程的结构感知能力。 Result: 在多个数据集上实现了最先进的定性和定量融合性能,并显著提升下游语义分割任务的指标表现。 Conclusion: 该方法通过模拟人类认知机制,有效提升了生成式图像融合的结构一致性、细节质量与鲁棒性,具有良好的应用前景。 Abstract: Existing infrared and visible image fusion methods often face the dilemma of balancing modal information. Generative fusion methods reconstruct fused images by learning from data distributions, but their generative capabilities remain limited. Moreover, the lack of interpretability in modal information selection further affects the reliability and consistency of fusion results in complex scenarios. This manuscript revisits the essence of generative image fusion under the inspiration of human cognitive laws and proposes a novel infrared and visible image fusion method, termed HCLFuse. First, HCLFuse investigates the quantification theory of information mapping in unsupervised fusion networks, which leads to the design of a multi-scale mask-regulated variational bottleneck encoder. This encoder applies posterior probability modeling and information decomposition to extract accurate and concise low-level modal information, thereby supporting the generation of high-fidelity structural details. Furthermore, the probabilistic generative capability of the diffusion model is integrated with physical laws, forming a time-varying physical guidance mechanism that adaptively regulates the generation process at different stages, thereby enhancing the ability of the model to perceive the intrinsic structure of data and reducing dependence on data quality. Experimental results show that the proposed method achieves state-of-the-art fusion performance in qualitative and quantitative evaluations across multiple datasets and significantly improves semantic segmentation metrics. This fully demonstrates the advantages of this generative image fusion method, drawing inspiration from human cognition, in enhancing structural consistency and detail quality.

[90] Exploring Complementarity and Explainability in CNNs for Periocular Verification Across Acquisition Distances

Fernando Alonso-Fernandez,Kevin Hernandez Diaz,Jose M. Buades,Kiran Raja,Josef Bigun

Main category: cs.CV

TL;DR: 本文研究了在UBIPr数据库上不同距离下的CNN模型在眼周验证中的互补性,通过融合SqueezeNet、MobileNetv2和ResNet50三种架构,在余弦和卡方度量下结合得分级融合方法显著提升了性能,并利用LIME热图和Jensen-Shannon散度分析了模型注意力差异,揭示了其互补性原因,最终实现了新的最先进结果。

Details Motivation: 探索不同复杂度CNN模型在不同拍摄距离下的眼周识别性能及其互补性,以提升远距离场景下的识别准确率。 Method: 使用VGGFace2数据集预训练SqueezeNet、MobileNetv2和ResNet50三种CNN模型,在UBIPr数据集上评估其在不同距离下的表现;采用余弦相似度和卡方距离进行匹配度量;使用逻辑回归进行得分级融合;并通过LIME热图与Jensen-Shannon散度分析各模型的注意力机制差异。 Result: ResNet50单独表现最优,但三模型融合带来显著性能提升,尤其在远距离条件下效果更明显;热图分析显示不同模型关注图像的不同区域,证实了其互补性;所提方法在UBIPr上达到新的SOTA性能。 Conclusion: 不同结构的CNN在眼周识别中具有互补性,融合多模型可有效提升跨距离识别性能,结合注意力分析有助于理解模型行为。 Abstract: We study the complementarity of different CNNs for periocular verification at different distances on the UBIPr database. We train three architectures of increasing complexity (SqueezeNet, MobileNetv2, and ResNet50) on a large set of eye crops from VGGFace2. We analyse performance with cosine and chi2 metrics, compare different network initialisations, and apply score-level fusion via logistic regression. In addition, we use LIME heatmaps and Jensen-Shannon divergence to compare attention patterns of the CNNs. While ResNet50 consistently performs best individually, the fusion provides substantial gains, especially when combining all three networks. Heatmaps show that networks usually focus on distinct regions of a given image, which explains their complementarity. Our method significantly outperforms previous works on UBIPr, achieving a new state-of-the-art.

[91] Beyond Imitation: Constraint-Aware Trajectory Generation with Flow Matching For End-to-End Autonomous Driving

Lin Liu,Guanyi Yu,Ziying Song,Junqiao Li,Caiyan Jia,Feiyang Jia,Peiliang Wu,Yandan Luo

Main category: cs.CV

TL;DR: 提出了一种基于约束流匹配的自动驾驶规划框架CATG,有效缓解了模式崩溃问题,并在生成过程中直接引入安全与运动学约束,实现了多样化且安全的轨迹生成。

Details Motivation: 现有模仿学习方法存在模式崩溃问题,而生成模型难以直接融入安全和物理约束,需额外优化步骤,限制了自动驾驶规划的性能与安全性。 Method: 提出CATG框架,采用约束流匹配方法,在流匹配过程中显式建模并施加安全与运动学约束,同时将驾驶激进程度作为可控信号进行轨迹风格调节。 Result: 在NavSim v2挑战赛中,CATG获得第二名,EPDMS得分为51.31,并荣获创新奖。 Conclusion: CATG通过在流匹配过程中引入显式约束和可控生成机制,有效提升了轨迹生成的多样性、安全性与实用性,具备较强的现实应用潜力。 Abstract: Planning is a critical component of end-to-end autonomous driving. However, prevailing imitation learning methods often suffer from mode collapse, failing to produce diverse trajectory hypotheses. Meanwhile, existing generative approaches struggle to incorporate crucial safety and physical constraints directly into the generative process, necessitating an additional optimization stage to refine their outputs. To address these limitations, we propose CATG, a novel planning framework that leverages Constrained Flow Matching. Concretely, CATG explicitly models the flow matching process, which inherently mitigates mode collapse and allows for flexible guidance from various conditioning signals. Our primary contribution is the novel imposition of explicit constraints directly within the flow matching process, ensuring that the generated trajectories adhere to vital safety and kinematic rules. Secondly, CATG parameterizes driving aggressiveness as a control signal during generation, enabling precise manipulation of trajectory style. Notably, on the NavSim v2 challenge, CATG achieved 2nd place with an EPDMS score of 51.31 and was honored with the Innovation Award.

[92] Leveraging Large-Scale Face Datasets for Deep Periocular Recognition via Ocular Cropping

Fernando Alonso-Fernandez,Kevin Hernandez-Diaz,Jose Maria Buades Rubio,Josef Bigun

Main category: cs.CV

TL;DR: 本文研究了基于眼周区域的生物特征识别,使用三种不同深度和复杂度的卷积神经网络,在大规模VGGFace2数据集上进行训练,并在VGGFace2-Pose和UFPR-Periocular数据库上进行实验,取得了当前最低的眼周识别等错误率(EER)1-2%。

Details Motivation: 眼周区域具有高区分性和较低采集限制,但现有研究多依赖小规模数据集,缺乏在大规模数据上的验证。因此,本文旨在评估深度学习模型在大规模真实场景下眼周识别的有效性。 Method: 采用三种不同复杂度的卷积神经网络,在从VGGFace2数据库中提取的超过190万张眼周图像上进行训练,并在VGGFace2-Pose和UFPR-Periocular两个数据集上测试眼周识别性能。 Result: 在VGGFace2-Pose上眼周识别的等错误率(EER)为9-15%,低于全脸识别的3-6%;而在质量更高的UFPR-Periocular数据集上,EER降至1-2%,达到目前该数据集上的最佳性能。 Conclusion: 大规模训练数据结合深度卷积网络可显著提升眼周识别性能,尤其在高质量、规范采集条件下(如UFPR-Periocular),能达到极低的识别错误率,推动眼周生物特征的实际应用。 Abstract: We focus on ocular biometrics, specifically the periocular region (the area around the eye), which offers high discrimination and minimal acquisition constraints. We evaluate three Convolutional Neural Network architectures of varying depth and complexity to assess their effectiveness for periocular recognition. The networks are trained on 1,907,572 ocular crops extracted from the large-scale VGGFace2 database. This significantly contrasts with existing works, which typically rely on small-scale periocular datasets for training having only a few thousand images. Experiments are conducted with ocular images from VGGFace2-Pose, a subset of VGGFace2 containing in-the-wild face images, and the UFPR-Periocular database, which consists of selfies captured via mobile devices with user guidance on the screen. Due to the uncontrolled conditions of VGGFace2, the Equal Error Rates (EERs) obtained with ocular crops range from 9-15%, noticeably higher than the 3-6% EERs achieved using full-face images. In contrast, UFPR-Periocular yields significantly better performance (EERs of 1-2%), thanks to higher image quality and more consistent acquisition protocols. To the best of our knowledge, these are the lowest reported EERs on the UFPR dataset to date.

[93] Towards Realistic Earth-Observation Constellation Scheduling: Benchmark and Methodology

Luting Wang,Yinghao Xiang,Hongliang Huang,Dongjun Li,Chen Gao,Si Liu

Main category: cs.CV

TL;DR: 本文提出了一个用于敏捷地球观测卫星星座调度的统一框架,包括首个大规模真实场景基准套件AEOS-Bench和基于Transformer的调度模型AEOS-Former。

Details Motivation: 现有方法在处理大规模、动态环境和严格约束下的卫星调度问题时往往过于简化,限制了其在实际应用中的性能。因此,需要一个更真实、标准化的基准和更强大的调度模型来解决这一问题。 Method: 构建了一个高保真仿真平台生成包含3907颗卫星资产和16410个场景的AEOS-Bench基准套件,并提出AEOS-Former模型,该模型采用约束感知注意力机制和内部约束模块,通过基于仿真的迭代学习进行训练。 Result: 实验结果表明,AEOS-Former在任务完成率和能量效率方面优于基线模型,消融研究验证了各组件的有效性。 Conclusion: AEOS-Bench为卫星调度研究提供了标准化的大规模真实基准,AEOS-Former展示了在复杂动态环境下高效调度的能力,推动了该领域的技术发展。 Abstract: Agile Earth Observation Satellites (AEOSs) constellations offer unprecedented flexibility for monitoring the Earth's surface, but their scheduling remains challenging under large-scale scenarios, dynamic environments, and stringent constraints. Existing methods often simplify these complexities, limiting their real-world performance. We address this gap with a unified framework integrating a standardized benchmark suite and a novel scheduling model. Our benchmark suite, AEOS-Bench, contains $3,907$ finely tuned satellite assets and $16,410$ scenarios. Each scenario features $1$ to $50$ satellites and $50$ to $300$ imaging tasks. These scenarios are generated via a high-fidelity simulation platform, ensuring realistic satellite behavior such as orbital dynamics and resource constraints. Ground truth scheduling annotations are provided for each scenario. To our knowledge, AEOS-Bench is the first large-scale benchmark suite tailored for realistic constellation scheduling. Building upon this benchmark, we introduce AEOS-Former, a Transformer-based scheduling model that incorporates a constraint-aware attention mechanism. A dedicated internal constraint module explicitly models the physical and operational limits of each satellite. Through simulation-based iterative learning, AEOS-Former adapts to diverse scenarios, offering a robust solution for AEOS constellation scheduling. Experimental results demonstrate that AEOS-Former outperforms baseline models in task completion and energy efficiency, with ablation studies highlighting the contribution of each component. Code and data are provided in https://github.com/buaa-colalab/AEOSBench.

[94] Exploring the correlation between the type of music and the emotions evoked: A study using subjective questionnaires and EEG

Jelizaveta Jankowska,Bożena Kostek,Fernando Alonso-Fernandez,Prayag Tiwari

Main category: cs.CV

TL;DR: 本研究探讨了不同音乐类型对人类情绪的影响,通过主观调查和脑电图(EEG)测量进行分析。

Details Motivation: 了解不同音乐流派如何影响人的情绪,为音乐治疗和情感调控提供科学依据。 Method: 使用EEG头盔记录参与者听不同音乐时的脑电活动,并结合主观问卷进行情绪评估。 Result: 分析显示情绪与大脑活动之间存在关联,不同音乐类型引发不同的脑电模式和情绪反应。 Conclusion: 音乐类型显著影响情绪状态,脑电数据与主观报告具有一致性,表明生理与心理反应的结合可有效评估音乐的情感影响。 Abstract: The subject of this work is to check how different types of music affect human emotions. While listening to music, a subjective survey and brain activity measurements were carried out using an EEG helmet. The aim is to demonstrate the impact of different music genres on emotions. The research involved a diverse group of participants of different gender and musical preferences. This had the effect of capturing a wide range of emotional responses to music. After the experiment, a relationship analysis of the respondents' questionnaires with EEG signals was performed. The analysis revealed connections between emotions and observed brain activity.

[95] A Hybrid Framework Bridging CNN and ViT based on Theory of Evidence for Diabetic Retinopathy Grading

Junlai Qiu,Yunzhu Chen,Hao Zheng,Yawen Huang,Yuexiang Li

Main category: cs.CV

TL;DR: 提出一种基于证据理论的融合范式,结合CNN和ViT的优势,提升糖尿病视网膜病变分级的准确性和可解释性。

Details Motivation: 现有基于单一骨干网络(CNN或ViT)的自动DR诊断系统性能已达到瓶颈,难以兼顾局部与全局特征提取能力。 Method: 提出一种基于证据理论的融合范式,利用深度证据网络将不同骨干网络提取的特征转化为支持性证据,并据此自适应调整融合模式。 Result: 在两个公开DR数据集上验证了方法的有效性,相比当前最先进模型提升了分级精度,同时提供了良好的特征融合与决策可解释性。 Conclusion: 所提出的证据融合范式能有效结合CNN和ViT的优势,显著提升DR自动诊断性能,并增强模型的可解释性。 Abstract: Diabetic retinopathy (DR) is a leading cause of vision loss among middle-aged and elderly people, which significantly impacts their daily lives and mental health. To improve the efficiency of clinical screening and enable the early detection of DR, a variety of automated DR diagnosis systems have been recently established based on convolutional neural network (CNN) or vision Transformer (ViT). However, due to the own shortages of CNN / ViT, the performance of existing methods using single-type backbone has reached a bottleneck. One potential way for the further improvements is integrating different kinds of backbones, which can fully leverage the respective strengths of them (\emph{i.e.,} the local feature extraction capability of CNN and the global feature capturing ability of ViT). To this end, we propose a novel paradigm to effectively fuse the features extracted by different backbones based on the theory of evidence. Specifically, the proposed evidential fusion paradigm transforms the features from different backbones into supporting evidences via a set of deep evidential networks. With the supporting evidences, the aggregated opinion can be accordingly formed, which can be used to adaptively tune the fusion pattern between different backbones and accordingly boost the performance of our hybrid model. We evaluated our method on two publicly available DR grading datasets. The experimental results demonstrate that our hybrid model not only improves the accuracy of DR grading, compared to the state-of-the-art frameworks, but also provides the excellent interpretability for feature fusion and decision-making.

[96] GLYPH-SR: Can We Achieve Both High-Quality Image Super-Resolution and High-Fidelity Text Recovery via VLM-guided Latent Diffusion Model?

Mingyu Sung,Seungjae Ham,Kangwoo Kim,Yeokyoung Yoon,Sangseok Yun,Il-Min Kim,Jae-Mo Kang

Main category: cs.CV

TL;DR: 本文提出GLYPH-SR,一种基于视觉-语言引导的扩散框架,旨在同时优化场景文本可读性和图像感知质量,通过引入OCR引导的控制网络和交替调度策略,在复杂自然场景中显著提升文本超分辨率性能。

Details Motivation: 现有超分辨率方法多依赖对字符级错误不敏感的传统指标(如PSNR/SSIM)或感知模型,且多数文本超分研究局限于孤立字符的简化基准,忽视了真实场景中文本与背景的复杂交互,导致文本常被视为普通纹理,影响OCR及下游任务性能。 Method: 提出GLYPH-SR框架,包含由OCR数据引导的文本-超分融合ControlNet(TS-ControlNet)和在文本与场景导向间交替的‘乒乓’调度器;在保持主超分分支冻结的同时,使用合成数据集训练关键组件,以实现对文本的针对性恢复。 Result: 在SVT、SCUT-CTW1500和CUTE80数据集上,x4和x8放大倍数下,GLYPH-SR相比扩散模型/GAN基线最多提升OCR F1分数15.18个百分点(SVT x8,OpenOCR),同时在MANIQA、CLIP-IQA和MUSIQ等感知质量指标上保持竞争力。 Conclusion: GLYPH-SR能够同时满足高可读性和高视觉真实感的需求,实现了‘看起来正确且读起来正确’的超分辨率结果,为实际应用中的文本密集型场景提供了更优的解决方案。 Abstract: Image super-resolution(SR) is fundamental to many vision system-from surveillance and autonomy to document analysis and retail analytics-because recovering high-frequency details, especially scene-text, enables reliable downstream perception. Scene-text, i.e., text embedded in natural images such as signs, product labels, and storefronts, often carries the most actionable information; when characters are blurred or hallucinated, optical character recognition(OCR) and subsequent decisions fail even if the rest of the image appears sharp. Yet previous SR research has often been tuned to distortion (PSNR/SSIM) or learned perceptual metrics (LIPIS, MANIQA, CLIP-IQA, MUSIQ) that are largely insensitive to character-level errors. Furthermore, studies that do address text SR often focus on simplified benchmarks with isolated characters, overlooking the challenges of text within complex natural scenes. As a result, scene-text is effectively treated as generic texture. For SR to be effective in practical deployments, it is therefore essential to explicitly optimize for both text legibility and perceptual quality. We present GLYPH-SR, a vision-language-guided diffusion framework that aims to achieve both objectives jointly. GLYPH-SR utilizes a Text-SR Fusion ControlNet(TS-ControlNet) guided by OCR data, and a ping-pong scheduler that alternates between text- and scene-centric guidance. To enable targeted text restoration, we train these components on a synthetic corpus while keeping the main SR branch frozen. Across SVT, SCUT-CTW1500, and CUTE80 at x4, and x8, GLYPH-SR improves OCR F1 by up to +15.18 percentage points over diffusion/GAN baseline (SVT x8, OpenOCR) while maintaining competitive MANIQA, CLIP-IQA, and MUSIQ. GLYPH-SR is designed to satisfy both objectives simultaneously-high readability and high visual realism-delivering SR that looks right and reds right.

[97] EEG-Driven Image Reconstruction with Saliency-Guided Diffusion Models

Igor Abramov,Ilya Makarov

Main category: cs.CV

TL;DR: 提出一种结合EEG嵌入和空间显著图的双条件框架,以提升EEG驱动图像重建的质量和语义一致性。

Details Motivation: 现有EEG驱动图像重建方法忽视了空间注意力机制,导致重建图像在保真度和语义连贯性方面受限。 Method: 采用自适应思维映射器(ATM)提取EEG特征,并通过LoRA微调Stable Diffusion 2.1,同时引入ControlNet分支利用显著图进行空间控制生成。 Result: 在THINGS-EEG数据集上验证,所提方法在低级和高级图像特征质量上均优于现有方法,并更好对齐人类视觉注意力。 Conclusion: 注意力先验有助于解决EEG信号的模糊性,通过高效适配预训练扩散模型实现高保真图像重建,推动神经解码在医疗诊断和神经自适应接口中的应用。 Abstract: Existing EEG-driven image reconstruction methods often overlook spatial attention mechanisms, limiting fidelity and semantic coherence. To address this, we propose a dual-conditioning framework that combines EEG embeddings with spatial saliency maps to enhance image generation. Our approach leverages the Adaptive Thinking Mapper (ATM) for EEG feature extraction and fine-tunes Stable Diffusion 2.1 via Low-Rank Adaptation (LoRA) to align neural signals with visual semantics, while a ControlNet branch conditions generation on saliency maps for spatial control. Evaluated on THINGS-EEG, our method achieves a significant improvement in the quality of low- and high-level image features over existing approaches. Simultaneously, strongly aligning with human visual attention. The results demonstrate that attentional priors resolve EEG ambiguities, enabling high-fidelity reconstructions with applications in medical diagnostics and neuroadaptive interfaces, advancing neural decoding through efficient adaptation of pre-trained diffusion models.

[98] LoCoT2V-Bench: A Benchmark for Long-Form and Complex Text-to-Video Generation

Xiangqing Zheng,Chengyue Wu,Kehai Chen,Min Zhang

Main category: cs.CV

TL;DR: 提出LoCoT2V-Bench,一个面向复杂输入条件下长视频生成的基准,包含真实复杂的提示和多维评估框架,揭示现有模型在事件间一致性、细粒度对齐和高层主题遵循方面的不足。

Details Motivation: 现有文本到视频生成基准多依赖简化提示并关注低层次指标,缺乏对复杂提示的细粒度对齐和叙事连贯性、主题表达等抽象维度的评估。 Method: 基于真实视频构建包含场景转换和事件动态等元素的复杂提示集,并设计多维评估框架,引入事件级对齐、细粒度时间一致性、内容清晰度及HERD等新指标。 Result: 对九个代表性长视频生成模型的评估显示,当前方法在基础视觉和时间方面表现良好,但在跨事件一致性、细粒度对齐和高层主题表达上存在明显不足。 Conclusion: LoCoT2V-Bench为长视频生成提供了全面可靠的评估平台,并指明了未来方法改进的关键方向。 Abstract: Recently text-to-video generation has made impressive progress in producing short, high-quality clips, but evaluating long-form outputs remains a major challenge especially when processing complex prompts. Existing benchmarks mostly rely on simplified prompts and focus on low-level metrics, overlooking fine-grained alignment with prompts and abstract dimensions such as narrative coherence and thematic expression. To address these gaps, we propose LoCoT2V-Bench, a benchmark specifically designed for long video generation (LVG) under complex input conditions. Based on various real-world videos, LoCoT2V-Bench introduces a suite of realistic and complex prompts incorporating elements like scene transitions and event dynamics. Moreover, it constructs a multi-dimensional evaluation framework that includes our newly proposed metrics such as event-level alignment, fine-grained temporal consistency, content clarity, and the Human Expectation Realization Degree (HERD) that focuses on more abstract attributes like narrative flow, emotional response, and character development. Using this framework, we conduct a comprehensive evaluation of nine representative LVG models, finding that while current methods perform well on basic visual and temporal aspects, they struggle with inter-event consistency, fine-grained alignment, and high-level thematic adherence, etc. Overall, LoCoT2V-Bench provides a comprehensive and reliable platform for evaluating long-form complex text-to-video generation and highlights critical directions for future method improvement.

[99] A-TPT: Angular Diversity Calibration Properties for Test-Time Prompt Tuning of Vision-Language Models

Shihab Aaqil Ahamed,Udaya S. K. P. Miriya Thanthrige,Ranga Rodrigo,Muhammad Haris Khan

Main category: cs.CV

TL;DR: 本文提出了一种新的测试时提示调优框架A-TPT,通过引入角度多样性来提升视觉-语言模型在无监督任务适应中的校准性能,显著降低了校准误差并具有良好的泛化能力。

Details Motivation: 现有测试时提示调优方法在文本特征间缺乏足够的分散性,影响模型校准效果,限制了模型的可靠性与安全性。 Method: 提出A-TPT框架,通过最大化单位超球面上归一化文本特征间的最小成对角度距离,实现类间提示诱导特征的角度均匀分布,从而增强角度多样性。 Result: 在多个骨干网络和数据集上实验表明,A-TPT在保持准确率的同时显著降低平均校准误差,尤其在自然分布偏移和医学数据集上表现出优越的零样本校准性能。 Conclusion: 促进角度多样性可有效改善视觉-语言模型在测试时适应过程中的校准表现,A-TPT为无监督提示调优提供了更可靠的新范式。 Abstract: Test-time prompt tuning (TPT) has emerged as a promising technique for adapting large vision-language models (VLMs) to unseen tasks without relying on labeled data. However, the lack of dispersion between textual features can hurt calibration performance, which raises concerns about VLMs' reliability, trustworthiness, and safety. Current TPT approaches primarily focus on improving prompt calibration by either maximizing average textual feature dispersion or enforcing orthogonality constraints to encourage angular separation. However, these methods may not always have optimal angular separation between class-wise textual features, which implies overlooking the critical role of angular diversity. To address this, we propose A-TPT, a novel TPT framework that introduces angular diversity to encourage uniformity in the distribution of normalized textual features induced by corresponding learnable prompts. This uniformity is achieved by maximizing the minimum pairwise angular distance between features on the unit hypersphere. We show that our approach consistently surpasses state-of-the-art TPT methods in reducing the aggregate average calibration error while maintaining comparable accuracy through extensive experiments with various backbones on different datasets. Notably, our approach exhibits superior zero-shot calibration performance on natural distribution shifts and generalizes well to medical datasets. We provide extensive analyses, including theoretical aspects, to establish the grounding of A-TPT. These results highlight the potency of promoting angular diversity to achieve well-dispersed textual features, significantly improving VLM calibration during test-time adaptation. Our code will be made publicly available.

[100] PointSt3R: Point Tracking through 3D Grounded Correspondence

Rhodri Guerrier,Adam W. Harley,Dima Damen

Main category: cs.CV

TL;DR: 本文提出了一种基于3D重建模型(如DUSt3R和MASt3R)的点跟踪方法PointSt3R,通过结合重建损失、动态对应训练和可见性预测头,并在少量合成数据上微调,实现了在多个数据集上具有竞争力或更优的点跟踪性能。

Details Motivation: 现有的点跟踪方法多依赖时间上下文,而本文旨在利用新兴的3D重建模型在静态场景中建立2D-3D对应关系的能力,探索其在点跟踪中的潜力,尤其是在无时间上下文条件下的表现。 Method: 采用MASt3R模型,引入动态对应训练和可见性头,结合重建损失进行微调;仅使用包含查询点的帧对进行训练和评估,避免利用时间信息;训练数据为少量合成的动静态点对应样本。 Result: 在EgoPoints上相比CoTracker2提升33.5%;在TAP-Vid-DAVIS上达到73.8 δ_avg和85.8%遮挡准确率,与CoTracker2相当;在EgoPoints和RGB-S上显著优于CoTracker3(61.3 vs 54.2,87.0 vs 82.8)。 Conclusion: 通过适配先进的3D重建模型并引入动态对应训练,PointSt3R在无时间上下文的情况下仍能实现优异的点跟踪性能,验证了3D接地对应在点跟踪任务中的有效性。 Abstract: Recent advances in foundational 3D reconstruction models, such as DUSt3R and MASt3R, have shown great potential in 2D and 3D correspondence in static scenes. In this paper, we propose to adapt them for the task of point tracking through 3D grounded correspondence. We first demonstrate that these models are competitive point trackers when focusing on static points, present in current point tracking benchmarks ($+33.5\%$ on EgoPoints vs. CoTracker2). We propose to combine the reconstruction loss with training for dynamic correspondence along with a visibility head, and fine-tuning MASt3R for point tracking using a relatively small amount of synthetic data. Importantly, we only train and evaluate on pairs of frames where one contains the query point, effectively removing any temporal context. Using a mix of dynamic and static point correspondences, we achieve competitive or superior point tracking results on four datasets (e.g. competitive on TAP-Vid-DAVIS 73.8 $\delta_{avg}$ / 85.8\% occlusion acc. for PointSt3R compared to 75.7 / 88.3\% for CoTracker2; and significantly outperform CoTracker3 on EgoPoints 61.3 vs 54.2 and RGB-S 87.0 vs 82.8). We also present results on 3D point tracking along with several ablations on training datasets and percentage of dynamic correspondences.

[101] Towards Fine-Grained Vision-Language Alignment for Few-Shot Anomaly Detection

Yuanting Fan,Jun Liu,Xiaochen Chen,Bin-Bin Gao,Jian Li,Yong Liu,Jinlong Peng,Chengjie Wang

Main category: cs.CV

TL;DR: 本文提出了一种用于少样本异常检测的新框架FineGrainedAD,通过多级细粒度语义描述(MFSC)和多级可学习提示(MLLP)与多级语义对齐(MLSA)机制,提升了异常定位性能。

Details Motivation: 现有基于视觉-语言模型的少样本异常检测方法因缺乏细粒度文本描述,导致图像级描述与局部视觉异常之间存在语义错位,影响定位精度。 Method: 提出MFSC自动生成多级细粒度文本描述,并设计MLLP和MLSA模块,通过可学习提示和区域聚合策略实现多层级语义对齐。 Result: 在MVTec-AD和VisA数据集上实验表明,该方法在少样本设置下优于现有方法,显著提升异常定位性能。 Conclusion: FineGrainedAD通过引入细粒度语义描述和多级对齐机制,有效缓解了语义错位问题,在少样本异常检测中实现了更优的定位效果。 Abstract: Few-shot anomaly detection (FSAD) methods identify anomalous regions with few known normal samples. Most existing methods rely on the generalization ability of pre-trained vision-language models (VLMs) to recognize potentially anomalous regions through feature similarity between text descriptions and images. However, due to the lack of detailed textual descriptions, these methods can only pre-define image-level descriptions to match each visual patch token to identify potential anomalous regions, which leads to the semantic misalignment between image descriptions and patch-level visual anomalies, achieving sub-optimal localization performance. To address the above issues, we propose the Multi-Level Fine-Grained Semantic Caption (MFSC) to provide multi-level and fine-grained textual descriptions for existing anomaly detection datasets with automatic construction pipeline. Based on the MFSC, we propose a novel framework named FineGrainedAD to improve anomaly localization performance, which consists of two components: Multi-Level Learnable Prompt (MLLP) and Multi-Level Semantic Alignment (MLSA). MLLP introduces fine-grained semantics into multi-level learnable prompts through automatic replacement and concatenation mechanism, while MLSA designs region aggregation strategy and multi-level alignment training to facilitate learnable prompts better align with corresponding visual regions. Experiments demonstrate that the proposed FineGrainedAD achieves superior overall performance in few-shot settings on MVTec-AD and VisA datasets.

[102] Representation-Level Counterfactual Calibration for Debiased Zero-Shot Recognition

Pei Peng,MingKun Xie,Hang Hao,Tong Jin,ShengJun Huang

Main category: cs.CV

TL;DR: 提出一种基于因果推理的轻量级方法,通过合成反事实嵌入和估计直接效应,在不重新训练的情况下显著提升视觉-语言模型在上下文敏感任务中的零样本性能。

Details Motivation: 解决视觉-语言模型中对象与上下文的捷径问题,提高在测试场景与训练数据分布不一致时的零样本可靠性。 Method: 将问题重构为因果推断问题,估计CLIP表示空间中的对象和背景期望,通过结合对象特征与来自外部数据集、批次邻居或文本描述的多样化上下文生成反事实嵌入,并模拟干预以消除背景偏差。 Result: 在上下文敏感的基准上显著提升了最差组和平均准确率,实现了新的零样本性能最优结果。 Conclusion: 该方法提供了一种无需重训练的轻量级、基于表示的反事实框架,为去偏和可靠的多模态推理提供了实用的因果路径。 Abstract: Object-context shortcuts remain a persistent challenge in vision-language models, undermining zero-shot reliability when test-time scenes differ from familiar training co-occurrences. We recast this issue as a causal inference problem and ask: Would the prediction remain if the object appeared in a different environment? To answer this at inference time, we estimate object and background expectations within CLIP's representation space, and synthesize counterfactual embeddings by recombining object features with diverse alternative contexts sampled from external datasets, batch neighbors, or text-derived descriptions. By estimating the Total Direct Effect and simulating intervention, we further subtract background-only activation, preserving beneficial object-context interactions while mitigating hallucinated scores. Without retraining or prompt design, our method substantially improves both worst-group and average accuracy on context-sensitive benchmarks, establishing a new zero-shot state of the art. Beyond performance, our framework provides a lightweight representation-level counterfactual approach, offering a practical causal avenue for debiased and reliable multimodal reasoning.

[103] Counteracting Matthew Effect in Self-Improvement of LVLMs through Head-Tail Re-balancing

Xin Guo,Zhiheng Xi,Yiwen Ding,Yitao Zhai,Xiaowei Shi,Xunliang Cai,Tao Gui,Qi Zhang,Xuanjing Huang

Main category: cs.CV

TL;DR: 本文提出了一种针对大型视觉语言模型在自我提升过程中出现的“马太效应”问题的解决方案,通过分布重塑和轨迹重采样策略实现头尾数据平衡,显著提升了视觉推理能力。

Details Motivation: 发现现有自改进方法在处理简单和复杂查询时存在优化不平衡问题,导致模型难以提升复杂推理能力。 Method: 从分布重塑和轨迹重采样两个角度引入四种高效策略,以在自改进过程中实现头尾数据的再平衡。 Result: 在Qwen2-VL-7B-Instruct和InternVL2.5-4B模型上的实验表明,所提方法平均比传统自改进方法提升3.86个百分点。 Conclusion: 所提出的头尾重平衡策略有效缓解了马太效应,推动了模型在复杂视觉推理任务上的持续改进。 Abstract: Self-improvement has emerged as a mainstream paradigm for advancing the reasoning capabilities of large vision-language models (LVLMs), where models explore and learn from successful trajectories iteratively. However, we identify a critical issue during this process: the model excels at generating high-quality trajectories for simple queries (i.e., head data) but struggles with more complex ones (i.e., tail data). This leads to an imbalanced optimization that drives the model to prioritize simple reasoning skills, while hindering its ability to tackle more complex reasoning tasks. Over iterations, this imbalance becomes increasingly pronounced--a dynamic we term the "Matthew effect"--which ultimately hinders further model improvement and leads to performance bottlenecks. To counteract this challenge, we introduce four efficient strategies from two perspectives: distribution-reshaping and trajectory-resampling, to achieve head-tail re-balancing during the exploration-and-learning self-improvement process. Extensive experiments on Qwen2-VL-7B-Instruct and InternVL2.5-4B models across visual reasoning tasks demonstrate that our methods consistently improve visual reasoning capabilities, outperforming vanilla self-improvement by 3.86 points on average.

[104] Analysis of the Robustness of an Edge Detector Based on Cellular Automata Optimized by Particle Swarm

Vinícius Ferraria,Eurico Ruivo

Main category: cs.CV

TL;DR: 提出了一种基于二维细胞自动机并结合元启发式优化和迁移学习的可适应性边缘检测方法,研究了优化阶段搜索空间扩展的影响及其在自然图像及子集上的适应性。

Details Motivation: 解决传统边缘检测器在检测松散边缘和缺乏上下文信息方面的不足,并提高检测器对不同图像特性的适应能力。 Method: 采用二维细胞自动机描述可适应性检测器,通过元启发式算法进行优化,并结合迁移学习技术提升性能。 Result: 扩展优化阶段的搜索空间对所选图像集无效;模型能够适应输入,但迁移学习未带来显著改进。 Conclusion: 尽管模型具备良好的适应性,但扩大搜索空间和应用迁移学习未能有效提升边缘检测性能,表明需进一步探索更有效的优化与学习策略。 Abstract: The edge detection task is essential in image processing aiming to extract relevant information from an image. One recurring problem in this task is the weaknesses found in some detectors, such as the difficulty in detecting loose edges and the lack of context to extract relevant information from specific problems. To address these weaknesses and adapt the detector to the properties of an image, an adaptable detector described by two-dimensional cellular automaton and optimized by meta-heuristic combined with transfer learning techniques was developed. This study aims to analyze the impact of expanding the search space of the optimization phase and the robustness of the adaptability of the detector in identifying edges of a set of natural images and specialized subsets extracted from the same image set. The results obtained prove that expanding the search space of the optimization phase was not effective for the chosen image set. The study also analyzed the adaptability of the model through a series of experiments and validation techniques and found that, regardless of the validation, the model was able to adapt to the input and the transfer learning techniques applied to the model showed no significant improvements.

[105] SA$^{2}$Net: Scale-Adaptive Structure-Affinity Transformation for Spine Segmentation from Ultrasound Volume Projection Imaging

Hao Xie,Zixun Huang,Yushen Zuo,Yakun Ju,Frank H. F. Leung,N. F. Law,Kin-Man Lam,Yong-Ping Zheng,Sai Ho Ling

Main category: cs.CV

TL;DR: 提出了一种用于超声脊柱图像分割的尺度自适应结构感知网络SA²Net,通过尺度自适应互补策略和结构亲和性转换提升分割性能。

Details Motivation: 现有方法难以充分学习脊柱骨特征之间的高阶空间相关性和结构信息,导致分割效果受限。 Method: 设计了尺度自适应互补策略以捕捉跨维度长距离相关性;结合Transformer解码器引入结构亲和性转换进行结构感知推理,并采用特征混合损失聚合方法优化训练过程。 Result: 实验表明SA²Net在脊柱分割任务上优于现有最先进方法,且兼容多种主干网络,具有良好的鲁棒性和准确性。 Conclusion: SA²Net能有效提升超声脊柱图像的分割质量,具备应用于智能脊柱侧弯诊断的潜力。 Abstract: Spine segmentation, based on ultrasound volume projection imaging (VPI), plays a vital role for intelligent scoliosis diagnosis in clinical applications. However, this task faces several significant challenges. Firstly, the global contextual knowledge of spines may not be well-learned if we neglect the high spatial correlation of different bone features. Secondly, the spine bones contain rich structural knowledge regarding their shapes and positions, which deserves to be encoded into the segmentation process. To address these challenges, we propose a novel scale-adaptive structure-aware network (SA$^{2}$Net) for effective spine segmentation. First, we propose a scale-adaptive complementary strategy to learn the cross-dimensional long-distance correlation features for spinal images. Second, motivated by the consistency between multi-head self-attention in Transformers and semantic level affinity, we propose structure-affinity transformation to transform semantic features with class-specific affinity and combine it with a Transformer decoder for structure-aware reasoning. In addition, we adopt a feature mixing loss aggregation method to enhance model training. This method improves the robustness and accuracy of the segmentation process. The experimental results demonstrate that our SA$^{2}$Net achieves superior segmentation performance compared to other state-of-the-art methods. Moreover, the adaptability of SA$^{2}$Net to various backbones enhances its potential as a promising tool for advanced scoliosis diagnosis using intelligent spinal image analysis. The code and experimental demo are available at https://github.com/taetiseo09/SA2Net.

[106] AdSum: Two-stream Audio-visual Summarization for Automated Video Advertisement Clipping

Wen Xie,Yanjun Zhu,Gijs Overgoor,Yakov Bart,Agata Lapedriza Garcia,Sarah Ostadabbas

Main category: cs.CV

TL;DR: 本文提出了一种基于音频-视觉融合模型的自动化视频广告剪辑框架,首次将广告剪辑视为镜头选择问题,并强调音频在广告中的重要作用。

Details Motivation: 传统广告剪辑依赖人工重编辑,耗时耗力,缺乏针对广告场景的自动化方法。 Method: 提出一种双流音频-视觉融合模型,通过预测帧重要性实现自动剪辑,并构建广告专用数据集AdSum204进行训练与评估。 Result: 实验表明,该模型在平均精度、曲线下面积、斯皮尔曼和肯德尔等指标上优于现有最先进方法。 Conclusion: 所提方法有效解决了广告视频自动剪辑问题,凸显了音频信息的重要性,为广告制作提供了高效自动化方案。 Abstract: Advertisers commonly need multiple versions of the same advertisement (ad) at varying durations for a single campaign. The traditional approach involves manually selecting and re-editing shots from longer video ads to create shorter versions, which is labor-intensive and time-consuming. In this paper, we introduce a framework for automated video ad clipping using video summarization techniques. We are the first to frame video clipping as a shot selection problem, tailored specifically for advertising. Unlike existing general video summarization methods that primarily focus on visual content, our approach emphasizes the critical role of audio in advertising. To achieve this, we develop a two-stream audio-visual fusion model that predicts the importance of video frames, where importance is defined as the likelihood of a frame being selected in the firm-produced short ad. To address the lack of ad-specific datasets, we present AdSum204, a novel dataset comprising 102 pairs of 30-second and 15-second ads from real advertising campaigns. Extensive experiments demonstrate that our model outperforms state-of-the-art methods across various metrics, including Average Precision, Area Under Curve, Spearman, and Kendall.

[107] Dynamic Context-Aware Scene Reasoning Using Vision-Language Alignment in Zero-Shot Real-World Scenarios

Manjunath Prasad Holenarasipura Rajiv,B. M. Vidyavathi

Main category: cs.CV

TL;DR: 提出了一种动态上下文感知场景推理框架,通过视觉-语言对齐实现零样本真实场景理解,在多个基准上显著提升了复杂和未见环境中的场景理解准确率。

Details Motivation: 现有场景理解模型难以在无标注数据的未知场景中泛化,限制了视觉应用在动态真实环境中的部署。 Method: 结合预训练的视觉Transformer和大语言模型,利用视觉-语言对齐,并通过动态推理模块融合全局场景线索与对象级交互,以语言先验指导推理。 Result: 在COCO、Visual Genome和Open Images等零样本基准上,场景理解准确率最高提升18%,在模糊或杂乱场景中表现出更强的鲁棒性。 Conclusion: 该框架提供了一种可扩展且可解释的上下文感知推理方法,有效推动了动态真实场景中的零样本泛化能力。 Abstract: In real-world environments, AI systems often face unfamiliar scenarios without labeled data, creating a major challenge for conventional scene understanding models. The inability to generalize across unseen contexts limits the deployment of vision-based applications in dynamic, unstructured settings. This work introduces a Dynamic Context-Aware Scene Reasoning framework that leverages Vision-Language Alignment to address zero-shot real-world scenarios. The goal is to enable intelligent systems to infer and adapt to new environments without prior task-specific training. The proposed approach integrates pre-trained vision transformers and large language models to align visual semantics with natural language descriptions, enhancing contextual comprehension. A dynamic reasoning module refines predictions by combining global scene cues and object-level interactions guided by linguistic priors. Extensive experiments on zero-shot benchmarks such as COCO, Visual Genome, and Open Images demonstrate up to 18% improvement in scene understanding accuracy over baseline models in complex and unseen environments. Results also show robust performance in ambiguous or cluttered scenes due to the synergistic fusion of vision and language. This framework offers a scalable and interpretable approach for context-aware reasoning, advancing zero-shot generalization in dynamic real-world settings.

[108] CATCH: A Modular Cross-domain Adaptive Template with Hook

Xinjin Li,Yulie Lu,Jinghan Cao,Yu Ma,Zhenglin Li,Yeyang Zhou

Main category: cs.CV

TL;DR: 本文提出了CATCH,一种即插即用的跨域视觉问答(VQA)适应框架,通过解耦视觉与语言适应,引入轻量级模块实现无需重训练主干模型的高效多域迁移。

Details Motivation: 现有VQA模型在跨域场景(如遥感、医学图像、数学图表)中泛化能力差,且依赖昂贵的领域特定微调,缺乏可扩展性和灵活性。 Method: 提出CATCH框架,包含一个域分类器和双适配器机制(提示适配器用于语言调节,视觉适配器用于视觉特征调整),通过统一钩子接口动态注入,不修改也不重训练主干模型。 Result: 在四个领域特定的VQA基准上取得一致性能提升,无需重训练主干模型,包括MathVQA上+2.3 BLEU,MedVQA-RAD上+2.6 VQA分数,ChartQA上+3.1 ROUGE。 Conclusion: CATCH提供了一种可扩展、可扩展的多域VQA解决方案,支持在多样化应用场景中的实际部署。 Abstract: Recent advances in Visual Question Answering (VQA) have demonstrated impressive performance in natural image domains, with models like LLaVA leveraging large language models (LLMs) for open-ended reasoning. However, their generalization degrades significantly when transferred to out-of-domain scenarios such as remote sensing, medical imaging, or math diagrams, due to large distributional shifts and the lack of effective domain adaptation mechanisms. Existing approaches typically rely on per-domain fine-tuning or bespoke pipelines, which are costly, inflexible, and not scalable across diverse tasks. In this paper, we propose CATCH, a plug-and-play framework for cross-domain adaptation that improves the generalization of VQA models while requiring minimal changes to their core architecture. Our key idea is to decouple visual and linguistic adaptation by introducing two lightweight modules: a domain classifier to identify the input image type, and a dual adapter mechanism comprising a Prompt Adapter for language modulation and a Visual Adapter for vision feature adjustment. Both modules are dynamically injected via a unified hook interface, requiring no retraining of the backbone model. Experimental results across four domain-specific VQA benchmarks demonstrate that our framework achieves consistent performance gains without retraining the backbone model, including +2.3 BLEU on MathVQA, +2.6 VQA on MedVQA-RAD, and +3.1 ROUGE on ChartQA. These results highlight that CATCH provides a scalable and extensible approach to multi-domain VQA, enabling practical deployment across diverse application domains.

[109] Emu3.5: Native Multimodal Models are World Learners

Yufeng Cui,Honghao Chen,Haoge Deng,Xu Huang,Xinghang Li,Jirong Liu,Yang Liu,Zhuoyan Luo,Jinsheng Wang,Wenxuan Wang,Yueze Wang,Chengyuan Wang,Fan Zhang,Yingli Zhao,Ting Pan,Xianduo Li,Zecheng Hao,Wenxuan Ma,Zhuo Chen,Yulong Ao,Tiejun Huang,Zhongyuan Wang,Xinlong Wang

Main category: cs.CV

TL;DR: Emu3.5 是一个大规模多模态世界模型,通过统一的下一个标记预测目标在超过10万亿token的视觉-语言交错数据上进行端到端预训练,能够原生预测视觉和语言的下一个状态,并结合离散扩散自适应(DiDA)技术提升推理效率,展现出强大的多模态生成与推理能力。

Details Motivation: 为了构建具备原生多模态预测能力和高效推理的世界模型,能够处理复杂的视觉-语言交错输入并生成连贯的多模态输出。 Method: 采用统一的下一个标记预测目标在大规模视频帧和转录文本构成的交错数据上进行端到端预训练,并引入离散扩散自适应(DiDA)实现并行化推理;进一步通过大规模强化学习进行后训练以增强多模态推理。 Result: Emu3.5 在图像生成、编辑及交错生成任务上表现优异,推理速度提升约20倍,具备长视野生成、X2I生成、文本丰富图像生成以及时空一致的世界探索和具身操作能力,性能媲美 Gemini 2.5 Flash Image(Nano Banana)。 Conclusion: Emu3.5 是一个高效且强大的原生多模态世界模型,在多模态生成与推理方面具有广泛的应用潜力,已开源以支持社区研究。 Abstract: We introduce Emu3.5, a large-scale multimodal world model that natively predicts the next state across vision and language. Emu3.5 is pre-trained end-to-end with a unified next-token prediction objective on a corpus of vision-language interleaved data containing over 10 trillion tokens, primarily derived from sequential frames and transcripts of internet videos. The model naturally accepts interleaved vision-language inputs and generates interleaved vision-language outputs. Emu3.5 is further post-trained with large-scale reinforcement learning to enhance multimodal reasoning and generation. To improve inference efficiency, we propose Discrete Diffusion Adaptation (DiDA), which converts token-by-token decoding into bidirectional parallel prediction, accelerating per-image inference by about 20x without sacrificing performance. Emu3.5 exhibits strong native multimodal capabilities, including long-horizon vision-language generation, any-to-image (X2I) generation, and complex text-rich image generation. It also exhibits generalizable world-modeling abilities, enabling spatiotemporally consistent world exploration and open-world embodied manipulation across diverse scenarios and tasks. For comparison, Emu3.5 achieves performance comparable to Gemini 2.5 Flash Image (Nano Banana) on image generation and editing tasks and demonstrates superior results on a suite of interleaved generation tasks. We open-source Emu3.5 at https://github.com/baaivision/Emu3.5 to support community research.

[110] ResMatching: Noise-Resilient Computational Super-Resolution via Guided Conditional Flow Matching

Anirban Ray,Vera Galinova,Florian Jug

Main category: cs.CV

TL;DR: 本文提出了一种名为ResMatching的新型计算超分辨率(CSR)方法,利用引导条件流匹配学习更强的数据先验,在荧光显微镜图像重建中实现了优异的数据保真度与感知真实性的平衡,并能提供像素级不确定性估计。

Details Motivation: 由于CSR是一个病态问题,传统方法受限于先验知识的表达能力;随着机器学习的发展,有望通过数据驱动方式学习更强大的先验以提升分辨率重建质量。 Method: 采用引导条件流匹配(guided conditional flow matching)来学习更优的数据先验,并通过采样隐式后验分布生成高分辨率图像,同时估计像素级不确定性。 Result: 在BioSR数据集的4个生物结构上优于7个基线方法,尤其在低信噪比情况下表现突出,且后验分布校准良好,可提供可靠的不确定性指标。 Conclusion: ResMatching通过数据驱动的方式有效提升了荧光显微图像的超分辨率重建效果,兼具高性能与可靠性,适用于噪声较大的实际生物成像场景。 Abstract: Computational Super-Resolution (CSR) in fluorescence microscopy has, despite being an ill-posed problem, a long history. At its very core, CSR is about finding a prior that can be used to extrapolate frequencies in a micrograph that have never been imaged by the image-generating microscope. It stands to reason that, with the advent of better data-driven machine learning techniques, stronger prior can be learned and hence CSR can lead to better results. Here, we present ResMatching, a novel CSR method that uses guided conditional flow matching to learn such improved data-priors. We evaluate ResMatching on 4 diverse biological structures from the BioSR dataset and compare its results against 7 baselines. ResMatching consistently achieves competitive results, demonstrating in all cases the best trade-off between data fidelity and perceptual realism. We observe that CSR using ResMatching is particularly effective in cases where a strong prior is hard to learn, e.g. when the given low-resolution images contain a lot of noise. Additionally, we show that ResMatching can be used to sample from an implicitly learned posterior distribution and that this distribution is calibrated for all tested use-cases, enabling our method to deliver a pixel-wise data-uncertainty term that can guide future users to reject uncertain predictions.

[111] CYPRESS: Crop Yield Prediction via Regression on Prithvi's Encoder for Satellite Sensing

Shayan Nejadshamsi,Yuanyuan Zhang,Shadi Zaki,Brock Porth,Lysa Porth,Vahab Khoshdel

Main category: cs.CV

TL;DR: 本文提出了一种名为CYPRESS的深度学习模型,用于高分辨率、田块内的油菜籽产量预测,通过微调大型地理空间基础模型Prithvi-EO-2.0-600M,实现了基于多时相卫星影像的像素级连续产量预测,在加拿大草原地区的数据上表现优于现有方法。

Details Motivation: 准确及时的作物产量预测对全球粮食安全和现代农业管理至关重要,传统方法在精度农业所需的可扩展性和细粒度方面往往不足。 Method: CYPRESS利用预训练的大地理空间基础模型Prithvi-EO-2.0-600M,将其适配为连续回归任务,将多时相卫星影像转化为密集的像素级产量图。 Result: 在加拿大草原地区的大规模数据集上评估显示,CYPRESS在性能上优于现有的基于深度学习的产量预测模型,能够生成高分辨率、连续的产量图。 Conclusion: CYPRESS验证了通过微调基础模型实现专业化农业应用的有效性,弥合了大尺度地球观测与农场决策之间的差距,为精细农业监测提供了一个可扩展的解决方案。 Abstract: Accurate and timely crop yield prediction is crucial for global food security and modern agricultural management. Traditional methods often lack the scalability and granularity required for precision farming. This paper introduces CYPRESS (Crop Yield Prediction via Regression on Prithvi's Encoder for Satellite Sensing), a deep learning model designed for high-resolution, intra-field canola yield prediction. CYPRESS leverages a pre-trained, large-scale geospatial foundation model (Prithvi-EO-2.0-600M) and adapts it for a continuous regression task, transforming multi-temporal satellite imagery into dense, pixel-level yield maps. Evaluated on a comprehensive dataset from the Canadian Prairies, CYPRESS demonstrates superior performance over existing deep learning-based yield prediction models, highlighting the effectiveness of fine-tuning foundation models for specialized agricultural applications. By providing a continuous, high-resolution output, CYPRESS offers a more actionable tool for precision agriculture than conventional classification or county-level aggregation methods. This work validates a novel approach that bridges the gap between large-scale Earth observation and on-farm decision-making, offering a scalable solution for detailed agricultural monitoring.

[112] Spiking Patches: Asynchronous, Sparse, and Efficient Tokens for Event Cameras

Christoffer Koo Øhrstrøm,Ronja Güldenring,Lazaros Nalpantidis

Main category: cs.CV

TL;DR: 提出了一种专为事件相机设计的事件分词方法Spiking Patches,保持了事件流的异步性和空间稀疏性,在手势识别和目标检测任务中实现了更快的推理速度并保持甚至提升了准确率。

Details Motivation: 现有事件表示方法(如帧或体素)将异步、稀疏的事件数据转为同步且降低稀疏性,从而丢失事件相机的独特优势,因此需要一种能保留这些特性的新表示方法。 Method: 提出了Spiking Patches分词器,将事件流划分为时空上的局部“脉冲块”作为token,并结合GNN、PCN和Transformer进行下游任务评估。 Result: 相比基于体素和帧的方法,Spiking Patches的推理速度最高提升3.4倍和10.4倍,在手势识别上准确率最高提升3.8,在目标检测上最高提升1.4。 Conclusion: 事件分词是一种面向事件视觉的新范式,Spiking Patches有效保留了事件相机的异步与稀疏特性,在效率和性能上均表现优越,推动了事件驱动方法的发展。 Abstract: We propose tokenization of events and present a tokenizer, Spiking Patches, specifically designed for event cameras. Given a stream of asynchronous and spatially sparse events, our goal is to discover an event representation that preserves these properties. Prior works have represented events as frames or as voxels. However, while these representations yield high accuracy, both frames and voxels are synchronous and decrease the spatial sparsity. Spiking Patches gives the means to preserve the unique properties of event cameras and we show in our experiments that this comes without sacrificing accuracy. We evaluate our tokenizer using a GNN, PCN, and a Transformer on gesture recognition and object detection. Tokens from Spiking Patches yield inference times that are up to 3.4x faster than voxel-based tokens and up to 10.4x faster than frames. We achieve this while matching their accuracy and even surpassing in some cases with absolute improvements up to 3.8 for gesture recognition and up to 1.4 for object detection. Thus, tokenization constitutes a novel direction in event-based vision and marks a step towards methods that preserve the properties of event cameras.

[113] PT-DETR: Small Target Detection Based on Partially-Aware Detail Focus

Bingcong Huo,Zhiming Wang

Main category: cs.CV

TL;DR: 本文提出了一种基于RT-DETR的新型无人机图像小目标检测算法PT-DETR,通过引入PADF模块、MFFF模块和Focaler-SIoU损失函数,在复杂背景下提升了小目标检测精度和鲁棒性,并在VisDrone2019数据集上实现了更高的mAP和更低的计算复杂度。

Details Motivation: 无人机图像中的小目标检测面临复杂背景、严重遮挡、密集小物体和光照变化等挑战,现有方法在特征提取和定位精度方面存在不足。 Method: 在RT-DETR基础上,设计了部分感知细节聚焦(PADF)模块以增强小目标特征提取,提出中频特征融合(MFFF)模块来提升上下文信息捕捉能力,并引入Focaler-SIoU损失函数以提高边界框匹配性能和对小目标特征的敏感性。 Result: 在VisDrone2019数据集上,PT-DETR相比RT-DETR提升了1.6%和1.7%的mAP,同时具有更低的计算复杂度和更少的参数量。 Conclusion: PT-DETR在小目标检测任务中表现出更强的鲁棒性和可行性,适用于资源受限的无人机平台。 Abstract: To address the challenges in UAV object detection, such as complex backgrounds, severe occlusion, dense small objects, and varying lighting conditions,this paper proposes PT-DETR based on RT-DETR, a novel detection algorithm specifically designed for small objects in UAV imagery. In the backbone network, we introduce the Partially-Aware Detail Focus (PADF) Module to enhance feature extraction for small objects. Additionally,we design the Median-Frequency Feature Fusion (MFFF) module,which effectively improves the model's ability to capture small-object details and contextual information. Furthermore,we incorporate Focaler-SIoU to strengthen the model's bounding box matching capability and increase its sensitivity to small-object features, thereby further enhancing detection accuracy and robustness. Compared with RT-DETR, our PT-DETR achieves mAP improvements of 1.6% and 1.7% on the VisDrone2019 dataset with lower computational complexity and fewer parameters, demonstrating its robustness and feasibility for small-object detection tasks.

[114] All You Need for Object Detection: From Pixels, Points, and Prompts to Next-Gen Fusion and Multimodal LLMs/VLMs in Autonomous Vehicles

Sayed Pedram Haeri Boroujeni,Niloufar Mehrabi,Hazim Alzorgan,Ahmad Sarlak,Mahlagha Fazeli,Abolfazl Razi

Main category: cs.CV

TL;DR: 本文综述了自动驾驶汽车中物体检测的最新进展,重点探讨了多模态感知、上下文推理和协同智能中的新兴范式,如视觉-语言模型(VLM)、大语言模型(LLM)和生成式AI,并提出了融合传感器、数据集分类和基于Transformer的检测方法的系统性分析与未来发展方向。

Details Motivation: 自动驾驶汽车的成功依赖于在复杂多模态环境中可靠地检测物体,但当前研究知识分散,缺乏整合,亟需系统性综述以梳理新兴技术并推动发展。 Method: 本文系统回顾了自动驾驶传感器(相机、超声波、LiDAR、雷达)及其融合策略,提出了一种涵盖自车、基础设施和协同通信(V2V, V2I, V2X, I2I)的数据集分类体系,并深入分析了从2D/3D到混合传感器融合的先进检测方法,特别关注基于Vision Transformer、大语言模型和视觉-语言模型的新兴方法。 Result: 论文构建了自动驾驶物体检测的全面框架,明确了现有技术的能力与局限,揭示了多模态融合、上下文理解与协同感知中的关键挑战,并总结了生成式AI与大模型驱动感知的新趋势。 Conclusion: 该综述为自动驾驶中的物体检测提供了清晰的技术路线图,强调了将大模型与多传感器融合结合的潜力,指出了未来在智能化、协作化感知方向的重要机遇。 Abstract: Autonomous Vehicles (AVs) are transforming the future of transportation through advances in intelligent perception, decision-making, and control systems. However, their success is tied to one core capability, reliable object detection in complex and multimodal environments. While recent breakthroughs in Computer Vision (CV) and Artificial Intelligence (AI) have driven remarkable progress, the field still faces a critical challenge as knowledge remains fragmented across multimodal perception, contextual reasoning, and cooperative intelligence. This survey bridges that gap by delivering a forward-looking analysis of object detection in AVs, emphasizing emerging paradigms such as Vision-Language Models (VLMs), Large Language Models (LLMs), and Generative AI rather than re-examining outdated techniques. We begin by systematically reviewing the fundamental spectrum of AV sensors (camera, ultrasonic, LiDAR, and Radar) and their fusion strategies, highlighting not only their capabilities and limitations in dynamic driving environments but also their potential to integrate with recent advances in LLM/VLM-driven perception frameworks. Next, we introduce a structured categorization of AV datasets that moves beyond simple collections, positioning ego-vehicle, infrastructure-based, and cooperative datasets (e.g., V2V, V2I, V2X, I2I), followed by a cross-analysis of data structures and characteristics. Ultimately, we analyze cutting-edge detection methodologies, ranging from 2D and 3D pipelines to hybrid sensor fusion, with particular attention to emerging transformer-driven approaches powered by Vision Transformers (ViTs), Large and Small Language Models (SLMs), and VLMs. By synthesizing these perspectives, our survey delivers a clear roadmap of current capabilities, open challenges, and future opportunities.

[115] Towards Reliable Sea Ice Drift Estimation in the Arctic Deep Learning Optical Flow on RADARSAT-2

Daniela Martin,Joseph Gallego

Main category: cs.CV

TL;DR: 本文首次在RADARSAT 2海冰图像上对48种深度学习光流模型进行了大规模基准测试,结果表明这些模型在估计海冰漂移方面具有高精度,优于传统方法,并可为北极导航和气候建模提供连续的空间运动场。

Details Motivation: 光学流技术在计算机视觉中发展迅速,但在地球物理问题和卫星SAR图像中的应用仍不足;传统方法受限于数学模型和运动假设,难以应对复杂场景,因此需要探索更先进的深度学习方法在海冰漂移估计中的适用性。 Method: 采用48种基于深度学习的光流模型,在RADARSAT 2 ScanSAR海冰影像上进行大规模基准测试,使用终点误差(EPE)和Fl all指标,并以GNSS浮标数据作为真值进行评估。 Result: 多个模型达到亚公里级精度(EPE为6-8像素,即300-400米),能够捕捉一致的区域漂移模式,显著优于传统方法,并生成空间连续的漂移场。 Conclusion: 基于深度学习的光流方法可有效迁移到极地遥感中的海冰漂移估计任务,提供高精度、空间连续的运动场,具有在北极导航和气候建模中广泛应用的潜力。 Abstract: Accurate estimation of sea ice drift is critical for Arctic navigation, climate research, and operational forecasting. While optical flow, a computer vision technique for estimating pixel wise motion between consecutive images, has advanced rapidly in computer vision, its applicability to geophysical problems and to satellite SAR imagery remains underexplored. Classical optical flow methods rely on mathematical models and strong assumptions about motion, which limit their accuracy in complex scenarios. Recent deep learning based approaches have substantially improved performance and are now the standard in computer vision, motivating their application to sea ice drift estimation. We present the first large scale benchmark of 48 deep learning optical flow models on RADARSAT 2 ScanSAR sea ice imagery, evaluated with endpoint error (EPE) and Fl all metrics against GNSS tracked buoys. Several models achieve sub kilometer accuracy (EPE 6 to 8 pixels, 300 to 400 m), a small error relative to the spatial scales of sea ice motion and typical navigation requirements in the Arctic. Our results demonstrate that the models are capable of capturing consistent regional drift patterns and that recent deep learning based optical flow methods, which have substantially improved motion estimation accuracy compared to classical methods, can be effectively transferred to polar remote sensing. Optical flow produces spatially continuous drift fields, providing motion estimates for every image pixel rather than at sparse buoy locations, offering new opportunities for navigation and climate modeling.

[116] Improving Classification of Occluded Objects through Scene Context

Courtney M. King,Daniel D. Leeds,Damian Lyons,George Kalaitzis

Main category: cs.CV

TL;DR: 本文提出两种基于场景信息融合的方法,增强区域建议网络-深度卷积神经网络(RPN-DCNN)在遮挡情况下的物体检测鲁棒性,实验表明在召回率和精确率上均优于基线方法。

Details Motivation: 遮挡对现有物体识别算法构成挑战,而场景上下文信息有助于缓解因遮挡导致的识别错误,因此引入场景信息提升检测性能。 Method: 提出两种场景信息融合策略:一种在预测前根据背景场景选择定制的物体网络,另一种在检测后将场景知识融合到RPN输出的初始物体得分中。 Result: 在包含部分遮挡的挑战性数据集上验证了方法的有效性,相比基线方法在召回率和精度上均有提升;同时发现联合训练(包含遮挡和无遮挡图像)效果最佳。 Conclusion: 所提方法具有可解释性,易于适配其他数据集,为处理遮挡问题提供了可行方案及未来研究方向。 Abstract: The presence of occlusions has provided substantial challenges to typically-powerful object recognition algorithms. Additional sources of information can be extremely valuable to reduce errors caused by occlusions. Scene context is known to aid in object recognition in biological vision. In this work, we attempt to add robustness into existing Region Proposal Network-Deep Convolutional Neural Network (RPN-DCNN) object detection networks through two distinct scene-based information fusion techniques. We present one algorithm under each methodology: the first operates prior to prediction, selecting a custom object network to use based on the identified background scene, and the second operates after detection, fusing scene knowledge into initial object scores output by the RPN. We demonstrate our algorithms on challenging datasets featuring partial occlusions, which show overall improvement in both recall and precision against baseline methods. In addition, our experiments contrast multiple training methodologies for occlusion handling, finding that training on a combination of both occluded and unoccluded images demonstrates an improvement over the others. Our method is interpretable and can easily be adapted to other datasets, offering many future directions for research and practical applications.

[117] Process Integrated Computer Vision for Real-Time Failure Prediction in Steel Rolling Mill

Vaibhav Kurrey,Sivakalyan Pujari,Gagan Raj Gupta

Main category: cs.CV

TL;DR: 提出了一种基于机器视觉的异常检测系统,用于钢铁轧机中的故障预测,通过工业摄像头和深度学习模型实时监控设备运行状态,结合传感器数据与视觉输入,实现故障早期预警与根因分析,提升制造系统的可靠性与生产力。

Details Motivation: 为了减少钢铁轧机中非计划性停机带来的成本,需要一种能够提前预测设备故障并定位根本原因的高效、可扩展的监测系统。 Method: 在生产线上部署工业摄像头,实时采集设备运行和热轧钢条运动的视频流;利用集中式视频服务器运行深度学习模型进行实时推理,并结合来自数据采集系统的传感器数据进行联合分析,以识别异常并预测故障。 Result: 该系统成功实现了对设备故障和工艺中断的早期预测,能够准确定位故障位置并推断可能的根本原因,显著降低了非计划停机时间和维护成本,且对PLC系统计算负担小,支持跨生产线的规模化部署。 Conclusion: 该基于机器视觉与多源数据融合的异常检测系统为工业制造环境提供了可靠、可扩展的故障预测解决方案,有效提升了运营可靠性、生产效率和经济效益。 Abstract: We present a long-term deployment study of a machine vision-based anomaly detection system for failure prediction in a steel rolling mill. The system integrates industrial cameras to monitor equipment operation, alignment, and hot bar motion in real time along the process line. Live video streams are processed on a centralized video server using deep learning models, enabling early prediction of equipment failures and process interruptions, thereby reducing unplanned breakdown costs. Server-based inference minimizes the computational load on industrial process control systems (PLCs), supporting scalable deployment across production lines with minimal additional resources. By jointly analyzing sensor data from data acquisition systems and visual inputs, the system identifies the location and probable root causes of failures, providing actionable insights for proactive maintenance. This integrated approach enhances operational reliability, productivity, and profitability in industrial manufacturing environments.

[118] Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark

Ziyu Guo,Xinyan Chen,Renrui Zhang,Ruichuan An,Yu Qi,Dongzhi Jiang,Xiangtai Li,Manyuan Zhang,Hongsheng Li,Pheng-Ann Heng

Main category: cs.CV

TL;DR: 本文研究了当前视频生成模型(如Veo-3)在零样本视觉推理任务中的能力,提出并构建了MME-CoF基准,从12个维度系统评估其在空间、几何、物理、时间及具身逻辑等方面的推理表现。结果表明,尽管模型在短期空间连贯性和局部动态一致性上表现良好,但在长期因果推理、严格几何约束和抽象逻辑方面仍存在局限,尚不能作为可靠的独立零样本推理器,但可作为专用推理模型的有益补充。

Details Motivation: 探索视频生成模型是否具备零样本视觉推理能力,尤其是在复杂视觉推理场景下的潜力与局限。 Method: 构建MME-CoF基准,涵盖12个推理维度,采用Chain-of-Frame(CoF)评估方法,在Veo-3等领先视频模型上进行实证研究。 Result: 发现当前视频模型在短期空间连贯性、细粒度对齐和局部动态一致方面表现出一定推理能力,但在长期因果推理、严格几何约束和抽象逻辑任务上表现不佳。 Conclusion: 现有视频模型尚不足以作为可靠的独立零样本推理工具,但在特定场景下可作为辅助视觉引擎,与专用推理模型协同使用具有前景。 Abstract: Recent video generation models can produce high-fidelity, temporally coherent videos, indicating that they may encode substantial world knowledge. Beyond realistic synthesis, they also exhibit emerging behaviors indicative of visual perception, modeling, and manipulation. Yet, an important question still remains: Are video models ready to serve as zero-shot reasoners in challenging visual reasoning scenarios? In this work, we conduct an empirical study to comprehensively investigate this question, focusing on the leading and popular Veo-3. We evaluate its reasoning behavior across 12 dimensions, including spatial, geometric, physical, temporal, and embodied logic, systematically characterizing both its strengths and failure modes. To standardize this study, we curate the evaluation data into MME-CoF, a compact benchmark that enables in-depth and thorough assessment of Chain-of-Frame (CoF) reasoning. Our findings reveal that while current video models demonstrate promising reasoning patterns on short-horizon spatial coherence, fine-grained grounding, and locally consistent dynamics, they remain limited in long-horizon causal reasoning, strict geometric constraints, and abstract logic. Overall, they are not yet reliable as standalone zero-shot reasoners, but exhibit encouraging signs as complementary visual engines alongside dedicated reasoning models. Project page: https://video-cof.github.io

[119] The Impact and Outlook of 3D Gaussian Splatting

Bernhard Kerbl

Main category: cs.CV

TL;DR: 本文综述了3D高斯点阵(3DGS)自提出以来在3D场景表示领域的重要进展,涵盖了效率提升、动态表示扩展、数学基础深化以及在移动和虚拟现实平台上的应用。

Details Motivation: 3DGS的引入极大地改变了3D场景表示的研究格局,激发了大量后续工作,需要系统梳理其发展方向与关键技术进展。 Method: 通过总结和分析3DGS相关研究的关键方向,包括资源高效训练与渲染、向动态(4DGS)表示的演进、外观建模与渲染的数学基础、移动端与VR平台的应用、大规模场景扩展及前馈或分布式快速辐射场重建。 Result: 归纳出3DGS已在效率、动态性、可扩展性和实用性等方面取得显著进展,并逐步成为3D视觉与图形学的基础性工具。 Conclusion: 3DGS已从一项突破性技术发展为推动3D视觉与图形学多方向发展的核心框架,具备广泛的应用前景与研究价值。 Abstract: Since its introduction, 3D Gaussian Splatting (3DGS) has rapidly transformed the landscape of 3D scene representations, inspiring an extensive body of associated research. Follow-up work includes analyses and contributions that enhance the efficiency, scalability, and real-world applicability of 3DGS. In this summary, we present an overview of several key directions that have emerged in the wake of 3DGS. We highlight advances enabling resource-efficient training and rendering, the evolution toward dynamic (or four-dimensional, 4DGS) representations, and deeper exploration of the mathematical foundations underlying its appearance modeling and rendering process. Furthermore, we examine efforts to bring 3DGS to mobile and virtual reality platforms, its extension to massive-scale environments, and recent progress toward near-instant radiance field reconstruction via feed-forward or distributed computation. Collectively, these developments illustrate how 3DGS has evolved from a breakthrough representation into a versatile and foundational tool for 3D vision and graphics.

[120] SteerVLM: Robust Model Control through Lightweight Activation Steering for Vision Language Models

Anushka Sivakumar,Andrew Zhang,Zaber Hakim,Chris Thomas

Main category: cs.CV

TL;DR: 本文提出了SteerVLM,一种轻量级的视觉语言模型(VLM)引导模块,通过调节语言与图像模态间的激活实现推理时的细粒度控制,无需修改原始模型权重。

Details Motivation: 为了在不微调模型权重的前提下,提升VLM对指令的遵循能力,并减少幻觉输出,需要一种高效、动态的模型行为引导方法。 Method: 通过学习成对提示(目标与反向行为)的潜在嵌入,动态调整连接语言与图像上下文的激活;采用逐维度激活调制和跨层自适应引导,实现无需静态向量或手动干预的推理时控制。 Result: 该方法仅需学习原VLM 0.14%参数,在多个引导和幻觉缓解基准上优于现有干预技术,同时保持在非目标任务上的性能。 Conclusion: SteerVLM提供了一种高效、灵活的VLM控制方案,结合提出的VNIA数据集,推动了多模态模型引导技术的发展。 Abstract: This work introduces SteerVLM, a lightweight steering module designed to guide Vision-Language Models (VLMs) towards outputs that better adhere to desired instructions. Our approach learns from the latent embeddings of paired prompts encoding target and converse behaviors to dynamically adjust activations connecting the language modality with image context. This allows for fine-grained, inference-time control over complex output semantics without modifying model weights while preserving performance on off-target tasks. Our steering module requires learning parameters equal to 0.14% of the original VLM's size. Our steering module gains model control through dimension-wise activation modulation and adaptive steering across layers without requiring pre-extracted static vectors or manual tuning of intervention points. Furthermore, we introduce VNIA (Visual Narrative Intent Alignment), a multimodal dataset specifically created to facilitate the development and evaluation of VLM steering techniques. Our method outperforms existing intervention techniques on steering and hallucination mitigation benchmarks for VLMs and proposes a robust solution for multimodal model control through activation engineering.

[121] Surpassing state of the art on AMD area estimation from RGB fundus images through careful selection of U-Net architectures and loss functions for class imbalance

Valentyna Starodub,Mantas Lukoševičius

Main category: cs.CV

TL;DR: 本研究基于U-Net架构,通过改进模型结构和训练流程,在RGB眼底图像中实现了对年龄相关性黄斑变性(AMD)病变的多类别语义分割,性能优于ADAM挑战赛之前的全部提交结果。

Details Motivation: AMD是老年人视力损伤的主要原因之一,亟需一种非侵入、低成本的方法实现精准病变检测。现有方法在多类别病变分割上仍有提升空间。 Method: 以U-Net为基础框架,比较了多种改进策略,包括预处理技术、不同复杂度的编码器骨干网络,以及用于缓解像素级和图像级类别不平衡的专用损失函数。 Result: 最终的框架配置在ADAM挑战赛数据集上实现了最先进的多类别AMD病变分割性能,超越了此前所有参赛方法。 Conclusion: 所提出的改进方案显著提升了RGB眼底图像中AMD病变的分割精度,为临床辅助诊断提供了有效且可复现的技术路径。 Abstract: Age-related macular degeneration (AMD) is one of the leading causes of irreversible vision impairment in people over the age of 60. This research focuses on semantic segmentation for AMD lesion detection in RGB fundus images, a non-invasive and cost-effective imaging technique. The results of the ADAM challenge - the most comprehensive AMD detection from RGB fundus images research competition and open dataset to date - serve as a benchmark for our evaluation. Taking the U-Net connectivity as a base of our framework, we evaluate and compare several approaches to improve the segmentation model's architecture and training pipeline, including pre-processing techniques, encoder (backbone) deep network types of varying complexity, and specialized loss functions to mitigate class imbalances on image and pixel levels. The main outcome of this research is the final configuration of the AMD detection framework, which outperforms all the prior ADAM challenge submissions on the multi-class segmentation of different AMD lesion types in non-invasive RGB fundus images. The source code used to conduct the experiments presented in this paper is made freely available.

[122] ChartAB: A Benchmark for Chart Grounding & Dense Alignment

Aniruddh Bansal,Davit Soselia,Dang Nguyen,Tianyi Zhou

Main category: cs.CV

TL;DR: 本文提出了一个名为ChartAlign Benchmark (ChartAB)的新基准,用于全面评估视觉语言模型(VLMs)在图表理解任务中的表现,包括提取数据、定位可视化元素和识别属性,并通过两阶段推理流程评估模型跨图表对齐与比较的能力。

Details Motivation: 现有视觉语言模型在准确感知图表细节和提取细粒度结构方面存在不足,限制了其在多图表比较与推理中的应用。 Method: 设计了一个JSON模板来量化各项图表 grounding 任务的评估指标,并提出一种两阶段推理工作流以评估模型在跨图表元素对齐与比较上的能力。 Result: 在多个最新VLM上进行的评估揭示了它们在图表理解中的感知偏差、弱点、鲁棒性和幻觉问题,显示出不同模型之间的细粒度差异。 Conclusion: ChartAB能够有效暴露当前VLM在图表理解方面的不足,为未来改进提供了明确方向。 Abstract: Charts play an important role in visualization, reasoning, data analysis, and the exchange of ideas among humans. However, existing vision-language models (VLMs) still lack accurate perception of details and struggle to extract fine-grained structures from charts. Such limitations in chart grounding also hinder their ability to compare multiple charts and reason over them. In this paper, we introduce a novel "ChartAlign Benchmark (ChartAB)" to provide a comprehensive evaluation of VLMs in chart grounding tasks, i.e., extracting tabular data, localizing visualization elements, and recognizing various attributes from charts of diverse types and complexities. We design a JSON template to facilitate the calculation of evaluation metrics specifically tailored for each grounding task. By incorporating a novel two-stage inference workflow, the benchmark can further evaluate VLMs' capability to align and compare elements/attributes across two charts. Our analysis of evaluations on several recent VLMs reveals new insights into their perception biases, weaknesses, robustness, and hallucinations in chart understanding. These findings highlight the fine-grained discrepancies among VLMs in chart understanding tasks and point to specific skills that need to be strengthened in current models.

[123] HEIR: Learning Graph-Based Motion Hierarchies

Cheng Zheng,William Koch,Baiang Li,Felix Heide

Main category: cs.CV

TL;DR: 本文提出了一种数据驱动的通用分层运动建模方法,通过图神经网络从数据中学习结构化的运动关系,将全局运动分解为继承模式和局部残差,并在1D、2D及动态3D场景中验证了其有效性和可解释性。

Details Motivation: 现有运动建模方法依赖人工定义或启发式分层结构,使用固定的运动基元,限制了在不同任务间的泛化能力。因此需要一种能从数据中自动学习可解释、结构化运动关系的通用方法。 Method: 提出基于图的分层表示,将运动建模为节点(基本运动)和有向边(父子依赖关系)构成的可微图结构,利用图神经网络进行层次推断,实现全局运动到继承成分与局部残差的显式分解。 Result: 在1D平移、2D旋转和基于高斯点阵的动态3D场景形变任务上验证了方法的有效性:成功重建出内在运动层次结构,在3D场景中相比基线生成更真实、可解释的形变结果。 Conclusion: 该方法提供了一种自适应、数据驱动的分层建模范式,具有良好的可解释性和泛化能力,适用于广泛的以运动为中心的任务。 Abstract: Hierarchical structures of motion exist across research fields, including computer vision, graphics, and robotics, where complex dynamics typically arise from coordinated interactions among simpler motion components. Existing methods to model such dynamics typically rely on manually-defined or heuristic hierarchies with fixed motion primitives, limiting their generalizability across different tasks. In this work, we propose a general hierarchical motion modeling method that learns structured, interpretable motion relationships directly from data. Our method represents observed motions using graph-based hierarchies, explicitly decomposing global absolute motions into parent-inherited patterns and local motion residuals. We formulate hierarchy inference as a differentiable graph learning problem, where vertices represent elemental motions and directed edges capture learned parent-child dependencies through graph neural networks. We evaluate our hierarchical reconstruction approach on three examples: 1D translational motion, 2D rotational motion, and dynamic 3D scene deformation via Gaussian splatting. Experimental results show that our method reconstructs the intrinsic motion hierarchy in 1D and 2D cases, and produces more realistic and interpretable deformations compared to the baseline on dynamic 3D Gaussian splatting scenes. By providing an adaptable, data-driven hierarchical modeling paradigm, our method offers a formulation applicable to a broad range of motion-centric tasks. Project Page: https://light.princeton.edu/HEIR/

[124] The Quest for Generalizable Motion Generation: Data, Model, and Evaluation

Jing Lin,Ruisi Wang,Junzhe Lu,Ziqi Huang,Guorui Song,Ailing Zeng,Xian Liu,Chen Wei,Wanqi Yin,Qingping Sun,Zhongang Cai,Lei Yang,Ziwei Liu

Main category: cs.CV

TL;DR: 本文提出了一种从视频生成(ViGen)向3D人体运动生成(MoGen)迁移知识的综合框架,涵盖数据、建模和评估三个方面,显著提升了MoGen的泛化能力。

Details Motivation: 现有3D人体运动生成模型在泛化能力上存在瓶颈,而视频生成领域已展现出更强的行为建模泛化性,因此可借鉴其经验。 Method: 提出了ViMoGen-228K大规模数据集,结合光学动捕、网络视频标注和ViGen合成数据;设计了基于流匹配的扩散Transformer模型ViMoGen及其轻量版ViMoGen-light;并构建了分层评估基准MBench。 Result: 实验表明,该方法在自动指标和人工评测中均显著优于现有方法,尤其在运动质量、提示保真度和泛化能力方面表现突出。 Conclusion: 通过系统化迁移ViGen的知识,所提出的框架有效提升了3D运动生成的泛化性能,为未来研究提供了新方向。 Abstract: Despite recent advances in 3D human motion generation (MoGen) on standard benchmarks, existing models still face a fundamental bottleneck in their generalization capability. In contrast, adjacent generative fields, most notably video generation (ViGen), have demonstrated remarkable generalization in modeling human behaviors, highlighting transferable insights that MoGen can leverage. Motivated by this observation, we present a comprehensive framework that systematically transfers knowledge from ViGen to MoGen across three key pillars: data, modeling, and evaluation. First, we introduce ViMoGen-228K, a large-scale dataset comprising 228,000 high-quality motion samples that integrates high-fidelity optical MoCap data with semantically annotated motions from web videos and synthesized samples generated by state-of-the-art ViGen models. The dataset includes both text-motion pairs and text-video-motion triplets, substantially expanding semantic diversity. Second, we propose ViMoGen, a flow-matching-based diffusion transformer that unifies priors from MoCap data and ViGen models through gated multimodal conditioning. To enhance efficiency, we further develop ViMoGen-light, a distilled variant that eliminates video generation dependencies while preserving strong generalization. Finally, we present MBench, a hierarchical benchmark designed for fine-grained evaluation across motion quality, prompt fidelity, and generalization ability. Extensive experiments show that our framework significantly outperforms existing approaches in both automatic and human evaluations. The code, data, and benchmark will be made publicly available.

[125] Scaling Image Geo-Localization to Continent Level

Philipp Lindenberger,Paul-Edouard Sarlin,Jan Hosang,Matteo Balice,Marc Pollefeys,Simon Lynen,Eduard Trulls

Main category: cs.CV

TL;DR: 本文提出了一种混合方法,通过代理分类任务学习特征表示,并结合航拍图像嵌入实现跨大陆范围的细粒度地理定位,在欧洲大范围数据集上实现了超过68%的查询定位误差小于200米。

Details Motivation: 现有图像地理定位方法在全局尺度下存在效率低、覆盖不足或精度粗等问题,需要一种可扩展且高精度的解决方案。 Method: 采用代理分类任务训练模型以隐式编码位置信息,结合航拍图像的嵌入和学习到的原型,实现对地面图像的细粒度跨视角检索。 Result: 在覆盖欧洲大部分地区的数据集上,超过68%的查询可实现200米以内的定位精度。 Conclusion: 该混合方法有效提升了大范围地理区域中图像细粒度定位的性能,兼具可扩展性和高精度。 Abstract: Determining the precise geographic location of an image at a global scale remains an unsolved challenge. Standard image retrieval techniques are inefficient due to the sheer volume of images (>100M) and fail when coverage is insufficient. Scalable solutions, however, involve a trade-off: global classification typically yields coarse results (10+ kilometers), while cross-view retrieval between ground and aerial imagery suffers from a domain gap and has been primarily studied on smaller regions. This paper introduces a hybrid approach that achieves fine-grained geo-localization across a large geographic expanse the size of a continent. We leverage a proxy classification task during training to learn rich feature representations that implicitly encode precise location information. We combine these learned prototypes with embeddings of aerial imagery to increase robustness to the sparsity of ground-level data. This enables direct, fine-grained retrieval over areas spanning multiple countries. Our extensive evaluation demonstrates that our approach can localize within 200m more than 68\% of queries of a dataset covering a large part of Europe. The code is publicly available at https://scaling-geoloc.github.io.

[126] SEE4D: Pose-Free 4D Generation via Auto-Regressive Video Inpainting

Dongyue Lu,Ao Liang,Tianxin Huang,Xiao Fu,Yuyang Zhao,Baorui Ma,Liang Pan,Wei Yin,Lingdong Kong,Wei Tsang Ooi,Ziwei Liu

Main category: cs.CV

TL;DR: 本文提出SEE4D,一种无需姿态标注的轨迹到相机框架,通过渲染到固定虚拟相机阵列来分离相机控制与场景建模,实现从无约束视频中生成4D内容。

Details Motivation: 现有方法依赖人工标注的相机姿态或复杂的轨迹预测,难以应用于真实场景的视频,且易混淆相机运动与场景动态。 Method: 提出SEE4D框架,使用虚拟相机阵列替代显式轨迹预测,结合视图条件下的视频修复模型,通过去噪合成图像并填补缺失区域,实现无需3D标注的4D建模。 Result: 在跨视角视频生成和稀疏重建任务上优于基于姿态或轨迹的方法,具有更好的泛化性和性能表现。 Conclusion: SEE4D实现了无需姿态监督的高效4D内容生成,推动了从无约束视频进行实际4D世界建模的发展。 Abstract: Immersive applications call for synthesizing spatiotemporal 4D content from casual videos without costly 3D supervision. Existing video-to-4D methods typically rely on manually annotated camera poses, which are labor-intensive and brittle for in-the-wild footage. Recent warp-then-inpaint approaches mitigate the need for pose labels by warping input frames along a novel camera trajectory and using an inpainting model to fill missing regions, thereby depicting the 4D scene from diverse viewpoints. However, this trajectory-to-trajectory formulation often entangles camera motion with scene dynamics and complicates both modeling and inference. We introduce SEE4D, a pose-free, trajectory-to-camera framework that replaces explicit trajectory prediction with rendering to a bank of fixed virtual cameras, thereby separating camera control from scene modeling. A view-conditional video inpainting model is trained to learn a robust geometry prior by denoising realistically synthesized warped images and to inpaint occluded or missing regions across virtual viewpoints, eliminating the need for explicit 3D annotations. Building on this inpainting core, we design a spatiotemporal autoregressive inference pipeline that traverses virtual-camera splines and extends videos with overlapping windows, enabling coherent generation at bounded per-step complexity. We validate See4D on cross-view video generation and sparse reconstruction benchmarks. Across quantitative metrics and qualitative assessments, our method achieves superior generalization and improved performance relative to pose- or trajectory-conditioned baselines, advancing practical 4D world modeling from casual videos.

[127] Masked Diffusion Captioning for Visual Feature Learning

Chao Feng,Zihao Wei,Andrew Owens

Main category: cs.CV

TL;DR: 提出了一种基于掩码扩散语言模型的图像描述方法(MDC),通过重建被掩码的文本 token 来学习视觉特征,在多种模型和数据集上表现出与自回归和对比方法相竞争的性能。

Details Motivation: 传统自回归描述方法中,视觉学习信号受token位置影响,且常需辅助目标;希望设计一种更均衡、简洁的视觉特征学习方式。 Method: 使用图像条件下的掩码扩散语言模型,对图像-文本对中的文本进行随机掩码,训练一个基于视觉特征的解码器来重建原始文本。 Result: 线性探测试验表明,所学视觉特征在多个学术规模的模型和数据集上与自回归和对比学习方法性能相当。 Conclusion: MDC为视觉特征学习提供了一种有效的新范式,无需依赖token位置或额外辅助目标,具有良好的下游任务迁移能力。 Abstract: We learn visual features by captioning images with an image-conditioned masked diffusion language model, a formulation we call masked diffusion captioning (MDC). During training, text tokens in each image-caption pair are masked at a randomly chosen ratio, and a decoder conditioned on visual features is trained to reconstruct the original text. After training, the learned visual features can be applied to downstream vision tasks. Unlike autoregressive captioning, the strength of the visual learning signal in MDC does not depend on each token's position in the sequence, reducing the need for auxiliary objectives. Linear probing experiments across a variety of academic-scale models and datasets show that the learned visual features are competitive with those produced by autoregressive and contrastive approaches.

[128] OmniX: From Unified Panoramic Generation and Perception to Graphics-Ready 3D Scenes

Yukun Huang,Jiwen Yu,Yanning Zhou,Jianan Wang,Xintao Wang,Pengfei Wan,Xihui Liu

Main category: cs.CV

TL;DR: 本文提出OmniX,一种基于全景图的2D提升方法,用于生成适用于物理渲染和仿真的图形就绪3D场景,通过重用2D生成先验实现全景感知、生成与补全。

Details Motivation: 现有2D提升方法侧重外观生成,忽视几何、材质等内在属性的感知,难以满足物理真实渲染和仿真需求。 Method: 提出OmniX框架,采用轻量级跨模态适配器结构,复用2D生成模型先验,实现对全景图中几何、纹理和PBR材质的联合感知,并构建大规模多模态合成全景数据集。 Result: 在全景视觉感知和图形就绪3D场景生成任务上取得优异表现,支持PBR、重光照和仿真应用。 Conclusion: OmniX为生成沉浸式且物理真实的虚拟世界提供了新途径,推动了全景感知与3D内容创作的结合。 Abstract: There are two prevalent ways to constructing 3D scenes: procedural generation and 2D lifting. Among them, panorama-based 2D lifting has emerged as a promising technique, leveraging powerful 2D generative priors to produce immersive, realistic, and diverse 3D environments. In this work, we advance this technique to generate graphics-ready 3D scenes suitable for physically based rendering (PBR), relighting, and simulation. Our key insight is to repurpose 2D generative models for panoramic perception of geometry, textures, and PBR materials. Unlike existing 2D lifting approaches that emphasize appearance generation and ignore the perception of intrinsic properties, we present OmniX, a versatile and unified framework. Based on a lightweight and efficient cross-modal adapter structure, OmniX reuses 2D generative priors for a broad range of panoramic vision tasks, including panoramic perception, generation, and completion. Furthermore, we construct a large-scale synthetic panorama dataset containing high-quality multimodal panoramas from diverse indoor and outdoor scenes. Extensive experiments demonstrate the effectiveness of our model in panoramic visual perception and graphics-ready 3D scene generation, opening new possibilities for immersive and physically realistic virtual world generation.