Table of Contents
cs.CL [Back]
[1] StreetMath: Study of LLMs' Approximation Behaviors
Chiung-Yi Tseng,Somshubhra Roy,Maisha Thasin,Danyang Zhang,Blessing Effiong
Main category: cs.CL
TL;DR: 本文提出了StreetMath基准,用于评估大语言模型在现实场景下的近似数学推理能力,发现现有模型倾向于精确计算或调用外部工具,而非进行高效的近似推理,且精确与近似运算依赖不同的神经机制,表明LLM不具备人类在街头数学中的“认知吝啬”特性。
Details
Motivation: 现有研究多关注大语言模型在精确算术任务上的表现,而忽视了其在非正式、快节奏的近似数学推理(即街头数学)中的能力,尤其是在非自回归模型中的表现。因此,作者希望填补这一研究空白。 Method: 提出StreetMath基准测试,涵盖多种真实世界的近似计算场景,并在多个大语言模型架构上进行广泛评估,包括Qwen、Dream、Falcon-Mamba等。同时应用机械可解释性技术分析模型内部的计算状态和神经机制。 Result: 实验表明,大语言模型在近似任务中仍倾向于尝试精确计算或调用工具;即使早期层已得出正确答案,仍会消耗更多token。精确与近似运算依赖不同的神经组件。模型未表现出类似人类的‘认知吝啬’行为。 Conclusion: 当前的大语言模型缺乏高效近似推理的能力,其架构设计可能未充分支持类似人类在日常数学情境中的快速估算策略,未来模型应增强对近似推理的认知模拟。 Abstract: There is a substantial body of literature examining the mathematical reasoning capabilities of large language models (LLMs), particularly their performance on precise arithmetic operations in autoregressive architectures. However, their ability to perform approximate reasoning in informal, fast-paced mathematical operations has received far less attention, especially among non-autoregressive decoder models. Our work addresses this gap by introducing StreetMath, a benchmark designed to evaluate models' approximation abilities under real-world approximation scenarios. We conduct extensive evaluations across different LLM architectures: Qwen3-4B-Instruct-2507, Qwen3-4B-Thinking-2507, Dream-v0-Instruct-7B, Falcon-Mamba-7B-Instruct, and Mamba-GPT-3B. Furthermore, we apply mechanistic interpretability techniques to probe their internal computational states. Our analysis reveals that LLMs generally attempt to compute exact values or invoke external tools even in tasks that call for approximation. Moreover, while models sometimes reach the correct answer in early layers or steps, they still consume more tokens when solving approximation tasks. Additional experiments indicate that exact and approximate arithmetic operations rely on largely separate neural components. Drawing upon research on cognitive psychology, we argue that LLMs do not exhibit cognitive miserliness in the same way humans do in street math settings. We open source our work https://github.com/ctseng777/StreetMath[2] Review Based Entity Ranking using Fuzzy Logic Algorithmic Approach: Analysis
Pratik N. Kalamkar,Anupama G. Phakatkar
Main category: cs.CL
TL;DR: 本文提出了一种基于情感词方向和强度的细粒度分类方法,利用模糊逻辑和句法依存分析对产品特定方面的评论进行评分,以实现更精确的实体排序。
Details
Motivation: 传统的基于词典的情感分析方法未考虑情感强度的差异,无法准确反映用户意见的细微差别,因此需要一种能够区分不同情感强度并结合方面信息进行实体排序的方法。 Method: 采用模糊逻辑算法将情感词(如副词、形容词、名词和动词)按情感强度分为五个等级(很弱、弱、中等、很强、强),并通过句法依存关系识别与特定方面相关的情感词,进而计算每个方面的实体得分。 Result: 该方法能够有效对评论中的情感进行细粒度分类,并根据用户查询和评论内容对实体在特定方面进行排序,提升了情感分析的精度和实用性。 Conclusion: 结合模糊逻辑与句法依存分析的情感强度分类方法能更好地捕捉意见的细微差异,在基于方面的意见挖掘中具有良好的应用潜力。 Abstract: Opinion mining, also called sentiment analysis, is the field of study that analyzes people opinions, sentiments, evaluations, appraisals, attitudes, and emotions towards entities such as products, services, organizations, individuals, issues, events, topics, and their attributes. Holistic lexicon-based approach does not consider the strength of each opinion, i.e., whether the opinion is very strongly negative (or positive), strongly negative (or positive), moderate negative (or positive), very weakly negative (or positive) and weakly negative (or positive). In this paper, we propose approach to rank entities based on orientation and strength of the entity reviews and user's queries by classifying them in granularity levels (i.e. very weak, weak, moderate, very strong and strong) by combining opinion words (i.e. adverb, adjective, noun and verb) that are related to aspect of interest of certain product. We shall use fuzzy logic algorithmic approach in order to classify opinion words into different category and syntactic dependency resolution to find relations for desired aspect words. Opinion words related to certain aspects of interest are considered to find the entity score for that aspect in the review.[3] LASTIST: LArge-Scale Target-Independent STance dataset
DongJae Kim,Yaejin Lee,Minsu Park,Eunil Park
Main category: cs.CL
TL;DR: 本文提出了一个大规模的、与目标无关的韩语立场检测数据集LASTIST,包含563,299个标注句子,旨在填补低资源语言在立场检测研究中的空白。
Details
Motivation: 现有立场检测研究多集中于英文和特定目标的立场识别,缺乏对低资源语言如韩语的支持,且缺少与目标无关的立场检测数据集。 Method: 从韩国政党发布的新闻稿中收集数据,构建了LASTIST数据集,并训练了先进的深度学习模型进行立场检测。 Result: 提供了大规模韩语立场检测数据集,支持多种任务,包括与目标无关的立场检测和历时性立场演变分析。 Conclusion: LASTIST数据集为韩语立场检测研究提供了重要资源,推动了低资源语言下立场检测的发展。 Abstract: Stance detection has emerged as an area of research in the field of artificial intelligence. However, most research is currently centered on the target-dependent stance detection task, which is based on a person's stance in favor of or against a specific target. Furthermore, most benchmark datasets are based on English, making it difficult to develop models in low-resource languages such as Korean, especially for an emerging field such as stance detection. This study proposes the LArge-Scale Target-Independent STance (LASTIST) dataset to fill this research gap. Collected from the press releases of both parties on Korean political parties, the LASTIST dataset uses 563,299 labeled Korean sentences. We provide a detailed description of how we collected and constructed the dataset and trained state-of-the-art deep learning and stance detection models. Our LASTIST dataset is designed for various tasks in stance detection, including target-independent stance detection and diachronic evolution stance detection. We deploy our dataset on https://anonymous.4open.science/r/LASTIST-3721/.[4] zFLoRA: Zero-Latency Fused Low-Rank Adapters
Dhananjaya Gowda,Seoha Song,Harshith Goka,Junhyun Lee
Main category: cs.CL
TL;DR: 本文提出了一种新的零延迟融合低秩适配器(zFLoRA),在保持大语言模型性能的同时,显著减少了推理时的延迟开销。
Details
Motivation: 由于任务特定适配器在推理时引入了不成比例的计算开销,需要一种更高效的适配方法。 Method: 提出zFLoRA,通过融合低秩结构实现零或可忽略的延迟开销。 Result: 在1B、3B和7B规模的LLM上实验显示,zFLoRA在多个任务上优于LoRA和全量微调,且在NPU和GPU平台上延迟几乎无增加。 Conclusion: zFLoRA是一种高效、低延迟的适配方法,适用于多任务场景下的大语言模型部署。 Abstract: Large language models (LLMs) are increasingly deployed with task-specific adapters catering to multiple downstream applications. In such a scenario, the additional compute associated with these apparently insignificant number of adapter parameters (typically less than 1% of the base model) turns out to be disproportionately significant during inference time (upto 2.5x times that of the base model). In this paper, we propose a new zero-latency fused low-rank adapter (zFLoRA) that introduces zero or negligible latency overhead on top of the base model. Experimental results on LLMs of size 1B, 3B and 7B show that zFLoRA compares favorably against the popular supervised fine-tuning benchmarks including low-rank adapters (LoRA) as well as full fine-tuning (FFT). Experiments are conducted on 18 different tasks across three different categories namely commonsense reasoning, math reasoning and summary-dialogue. Latency measurements made on NPU (Samsung Galaxy S25+) as well as GPU (NVIDIA H100) platforms show that the proposed zFLoRA adapters introduce zero to negligible latency overhead.[5] BlackboxNLP-2025 MIB Shared Task: Improving Circuit Faithfulness via Better Edge Selection
Yaniv Nikankin,Dana Arad,Itay Itzhak,Anja Reusch,Adi Simhi,Gal Kesten-Pomeranz,Yonatan Belinkov
Main category: cs.CL
TL;DR: 本文提出了三种改进机制可解释性中电路发现的方法,包括使用自助法识别一致的边、基于比率的选择策略以及整数线性规划替代贪婪选择,实验表明这些方法在多个任务和模型上优于先前方法。
Details
Motivation: 机制可解释性中的一个主要挑战是电路发现,即确定模型的哪些部分执行特定任务。现有方法在准确性和保真度方面存在不足,因此需要更有效的方法来识别关键组件。 Method: 1) 使用自助法识别具有稳定归因分数的边;2) 提出基于比率的选择策略,优先选择高分正向边以平衡性能与保真度;3) 用整数线性规划替代传统的贪婪选择方法。 Result: 所提方法在多个MIB任务和模型上实现了更高的电路保真度,并优于之前的电路发现方法。 Conclusion: 本文提出的三种改进策略显著提升了电路发现的准确性与可靠性,为机制可解释性提供了更有效的工具。 Abstract: One of the main challenges in mechanistic interpretability is circuit discovery, determining which parts of a model perform a given task. We build on the Mechanistic Interpretability Benchmark (MIB) and propose three key improvements to circuit discovery. First, we use bootstrapping to identify edges with consistent attribution scores. Second, we introduce a simple ratio-based selection strategy to prioritize strong positive-scoring edges, balancing performance and faithfulness. Third, we replace the standard greedy selection with an integer linear programming formulation. Our methods yield more faithful circuits and outperform prior approaches across multiple MIB tasks and models. Our code is available at: https://github.com/technion-cs-nlp/MIB-Shared-Task.[6] LISTEN to Your Preferences: An LLM Framework for Multi-Objective Selection
Adam S. Jovine,Tinghan Ye,Francis Bahk,Jingjing Wang,David B. Shmoys,Peter I. Frazier
Main category: cs.CL
TL;DR: 提出LISTEN框架,利用大语言模型作为零样本偏好判断器,通过自然语言指导专家在多目标决策中选择最优选项,包含两种迭代算法:LISTEN-U(优化参数效用函数)和LISTEN-T(小批量循环选择),在多种任务中验证了方法的有效性。
Details
Motivation: 人类专家在面对多个竞争目标时难以从大量选项中做出选择,因为复杂且隐式的偏好难以形式化,导致决策效率低下。 Method: 提出LISTEN框架,使用大语言模型作为零样本偏好判断器,基于专家的高级自然语言优先级进行指导;设计两种算法:LISTEN-U通过LLM优化参数化效用函数,LISTEN-T采用非参数的小批量锦标赛式选择。 Result: 在航班预订、购物和考试调度等任务中,LISTEN-U在偏好具有参数一致性时表现优异,而LISTEN-T具有更强的鲁棒性;提出了一种新的协调性度量指标来评估偏好对齐程度。 Conclusion: LISTEN框架能够有效支持基于自然语言的复杂多目标决策,减轻传统偏好获取的认知负担,展示了大语言模型在零样本偏好建模中的潜力。 Abstract: Human experts often struggle to select the best option from a large set of items with multiple competing objectives, a process bottlenecked by the difficulty of formalizing complex, implicit preferences. To address this, we introduce LISTEN, a framework that leverages a Large Language Model (LLM) as a zero-shot preference oracle, guided only by an expert's high-level priorities in natural language. To operate within LLM constraints like context windows and inference costs, we propose two iterative algorithms: LISTEN-U, which uses the LLM to refine a parametric utility function, and LISTEN-T, a non-parametric method that performs tournament-style selections over small batches of solutions. Evaluated on diverse tasks including flight booking, shopping, and exam scheduling, our results show LISTEN-U excels when preferences are parametrically aligned (a property we measure with a novel concordance metric), while LISTEN-T offers more robust performance. This work explores a promising direction for steering complex multi-objective decisions directly with natural language, reducing the cognitive burden of traditional preference elicitation.[7] Beyond Length: Quantifying Long-Range Information for Long-Context LLM Pretraining Data
Haoran Deng,Yingyu Lin,Zhenghao Lin,Xiao Liu,Yizhou Sun,Yi-An Ma,Yeyun Gong
Main category: cs.CL
TL;DR: 本文提出了LongFilter框架,用于筛选适合长上下文预训练的高质量数据,通过对比长上下文与短上下文下的模型预测,识别出依赖长距离信息的样本。
Details
Motivation: 现有长文本数据中许多缺乏有意义的长距离依赖,仅用局部上下文即可预测,导致训练效率低下,因此需要有效筛选具有真正长距离依赖的数据。 Method: LongFilter通过对比模型在长上下文和短上下文设置下的预测结果,计算信息增益,从而量化长距离依赖的重要性,并据此筛选训练数据。 Result: 在LLaMA-3-8B上将上下文长度从8K扩展到64K的实验表明,LongFilter能高效选择高质量数据,并在HELMET、LongBench和RULER等基准上显著提升性能。 Conclusion: LongFilter能够有效提升长上下文语言模型的训练效率和性能,是长上下文预训练中数据筛选的有力工具。 Abstract: Long-context language models unlock advanced capabilities in reasoning, code generation, and document summarization by leveraging dependencies across extended spans of text. However, a significant portion of readily available long-text data lacks meaningful long-distance dependencies; most spans can be predicted using only local context. Training on such data is inefficient, making careful data selection crucial. Therefore, we introduce LongFilter, a framework for curating training data tailored to long-context pretraining. LongFilter measures the information gain provided by extended context by contrasting model predictions under long-context versus short-context settings, thereby identifying samples where long-range dependencies are essential. Experiments with LLaMA-3-8B, extending its context length from 8K to 64K, show that LongFilter efficiently selects high-quality data and yields substantial improvements on benchmarks such as HELMET, LongBench, and RULER.[8] Ideology-Based LLMs for Content Moderation
Stefano Civelli,Pietro Bernardelle,Nardiena A. Pratama,Gianluca Demartini
Main category: cs.CL
TL;DR: 该研究探讨了在不同大语言模型架构、规模和内容模态下,采用不同意识形态角色(persona)对有害内容分类一致性与公平性的影响,发现尽管整体准确率变化不大,但角色会引入微妙的意识形态偏见,导致模型更倾向于支持同意识形态的观点,从而影响判断的中立性。
Details
Motivation: 确保内容审核系统中的大语言模型具有公平性和中立性,避免因模型角色设定而引入意识形态偏差。 Method: 通过在不同LLM架构、模型大小和内容模态(文本与视觉)中引入具有不同意识形态倾向的persona,分析其在有害内容分类中的行为差异,并进行跨意识形态的一致性与对抗性分析。 Result: 具有不同意识形态的persona在判定内容是否有害时表现出显著差异;大型模型更倾向于与同意识形态的persona保持一致,且在政治导向任务中表现出捍卫自身立场、弱化对立观点危害性的倾向。 Conclusion: persona的使用可能在看似中立的AI系统中引入隐蔽的意识形态偏见,这对依赖LLM进行内容审核的应用构成了潜在风险,需谨慎设计和评估。 Abstract: Large language models (LLMs) are increasingly used in content moderation systems, where ensuring fairness and neutrality is essential. In this study, we examine how persona adoption influences the consistency and fairness of harmful content classification across different LLM architectures, model sizes, and content modalities (language vs. vision). At first glance, headline performance metrics suggest that personas have little impact on overall classification accuracy. However, a closer analysis reveals important behavioral shifts. Personas with different ideological leanings display distinct propensities to label content as harmful, showing that the lens through which a model "views" input can subtly shape its judgments. Further agreement analyses highlight that models, particularly larger ones, tend to align more closely with personas from the same political ideology, strengthening within-ideology consistency while widening divergence across ideological groups. To show this effect more directly, we conducted an additional study on a politically targeted task, which confirmed that personas not only behave more coherently within their own ideology but also exhibit a tendency to defend their perspective while downplaying harmfulness in opposing views. Together, these findings highlight how persona conditioning can introduce subtle ideological biases into LLM outputs, raising concerns about the use of AI systems that may reinforce partisan perspectives under the guise of neutrality.[9] Beyond Long Context: When Semantics Matter More than Tokens
Tarun Kumar Chawdhury,Jon D. Duke
Main category: cs.CL
TL;DR: CLEAR方法通过实体感知检索,在临床文档问答中比传统嵌入检索和大上下文推理更高效且准确,显著减少token使用并提升长文档性能。
Details
Motivation: 电子健康记录中的临床文档以base64编码附件存储,难以进行语义问答;传统向量数据库方法常忽略细微的临床关系。 Method: 提出并验证了Clinical Entity Augmented Retrieval(CLEAR)方法,结合实体感知检索,并在自建的临床笔记问答评估平台上对比零样本大上下文推断和基于分块的传统检索增强生成。 Result: 在12份长度为10,000至65,000 token的真实临床笔记上测试,CLEAR的胜率达58.3%,平均语义相似度为0.878,且比大上下文处理少用78%的token;在超过65,000 token的长文档中胜率高达75%。 Conclusion: 实体感知检索能有效提升临床自然语言处理的效率与准确性,所提出的评估框架为临床问答系统提供了可复用、透明的基准。 Abstract: Electronic Health Records (EHR) store clinical documentation as base64 encoded attachments in FHIR DocumentReference resources, which makes semantic question answering difficult. Traditional vector database methods often miss nuanced clinical relationships. The Clinical Entity Augmented Retrieval (CLEAR) method, introduced by Lopez et al. 2025, uses entity aware retrieval and achieved improved performance with an F1 score of 0.90 versus 0.86 for embedding based retrieval, while using over 70 percent fewer tokens. We developed a Clinical Notes QA Evaluation Platform to validate CLEAR against zero shot large context inference and traditional chunk based retrieval augmented generation. The platform was tested on 12 clinical notes ranging from 10,000 to 65,000 tokens representing realistic EHR content. CLEAR achieved a 58.3 percent win rate, an average semantic similarity of 0.878, and used 78 percent fewer tokens than wide context processing. The largest performance gains occurred on long notes, with a 75 percent win rate for documents exceeding 65,000 tokens. These findings confirm that entity aware retrieval improves both efficiency and accuracy in clinical natural language processing. The evaluation framework provides a reusable and transparent benchmark for assessing clinical question answering systems where semantic precision and computational efficiency are critical.[10] A Survey on Efficient Large Language Model Training: From Data-centric Perspectives
Junyu Luo,Bohan Wu,Xiao Luo,Zhiping Xiao,Yiqiao Jin,Rong-Cheng Tu,Nan Yin,Yifan Wang,Jingyang Yuan,Wei Ju,Ming Zhang
Main category: cs.CL
TL;DR: 本文首次系统性地从数据角度综述了大语言模型高效后训练方法,提出涵盖数据选择、质量提升、合成数据生成、数据蒸馏与压缩以及自演进数据生态的分类体系,并总结了各类代表性方法与未来研究方向。
Details
Motivation: 当前大语言模型后训练面临数据标注成本高和数据规模边际收益递减的问题,亟需实现数据高效的后训练方法。 Method: 从数据中心视角出发,提出一种针对数据高效型LLM后训练的方法分类体系,包括数据选择、数据质量增强、合成数据生成、数据蒸馏与压缩以及自演进数据生态系统。 Result: 系统梳理了各类别下的代表性方法,明确了数据高效后训练中的挑战与开放问题,为后续研究提供了清晰的技术路线图。 Conclusion: 通过构建统一的分类框架,本文为提升大语言模型训练中的数据利用效率指明了方向,有望推动该领域的进一步研究。 Abstract: Post-training of Large Language Models (LLMs) is crucial for unlocking their task generalization potential and domain-specific capabilities. However, the current LLM post-training paradigm faces significant data challenges, including the high costs of manual annotation and diminishing marginal returns on data scales. Therefore, achieving data-efficient post-training has become a key research question. In this paper, we present the first systematic survey of data-efficient LLM post-training from a data-centric perspective. We propose a taxonomy of data-efficient LLM post-training methods, covering data selection, data quality enhancement, synthetic data generation, data distillation and compression, and self-evolving data ecosystems. We summarize representative approaches in each category and outline future research directions. By examining the challenges in data-efficient LLM post-training, we highlight open problems and propose potential research avenues. We hope our work inspires further exploration into maximizing the potential of data utilization in large-scale model training. Paper List: https://github.com/luo-junyu/Awesome-Data-Efficient-LLM[11] Evaluating the Impact of LLM-Assisted Annotation in a Perspectivized Setting: the Case of FrameNet Annotation
Frederico Belcavello,Ely Matos,Arthur Lorenzi,Lisandra Bonoto,Lívia Ruiz,Luiz Fernando Pereira,Victor Herbst,Yulla Navarro,Helen de Andrade Abreu,Lívia Dutra,Tiago Timponi Torrent
Main category: cs.CL
TL;DR: 本文评估了基于大语言模型(LLM)的语义角色标注器在FrameNet类语义标注中的自动化与半自动化性能,发现半自动方法在保持标注覆盖率的同时提升了框架多样性,优于纯自动方法。
Details
Motivation: 尽管LLM在语言资源构建中潜力巨大,但其在NLP人文视角下对标注数据集创建的影响尚缺乏系统评估。 Method: 通过比较人工、自动和半自动三种模式下的标注时间、覆盖度和多样性,进行广泛实验评估。 Result: 半自动模式在框架多样性上表现更优,标注覆盖率与人工相当;全自动模式除速度外,在各项指标上均较差。 Conclusion: 结合人类与LLM的半自动标注策略在语义标注任务中更具优势,能有效提升数据质量与效率。 Abstract: The use of LLM-based applications as a means to accelerate and/or substitute human labor in the creation of language resources and dataset is a reality. Nonetheless, despite the potential of such tools for linguistic research, comprehensive evaluation of their performance and impact on the creation of annotated datasets, especially under a perspectivized approach to NLP, is still missing. This paper contributes to reduction of this gap by reporting on an extensive evaluation of the (semi-)automatization of FrameNet-like semantic annotation by the use of an LLM-based semantic role labeler. The methodology employed compares annotation time, coverage and diversity in three experimental settings: manual, automatic and semi-automatic annotation. Results show that the hybrid, semi-automatic annotation setting leads to increased frame diversity and similar annotation coverage, when compared to the human-only setting, while the automatic setting performs considerably worse in all metrics, except for annotation time.[12] RECAP: Reproducing Copyrighted Data from LLMs Training with an Agentic Pipeline
André V. Duarte,Xuying li,Bin Zeng,Arlindo L. Oliveira,Lei Li,Zhuo Li
Main category: cs.CL
TL;DR: 提出RECAP,一种通过反馈循环和越狱模块从大语言模型输出中提取和验证记忆训练数据的代理管道,显著提升了对版权文本的记忆提取效果。
Details
Motivation: 在无法检查大语言模型训练数据的情况下,如何确认模型记住了什么内容?作者希望通过模型自身自由复现内容来提供最有力的证据。 Method: 设计了一个基于反馈循环的代理管道RECAP,使用辅助语言模型对比输出与参考文本并生成修正提示,迭代优化目标模型输出;同时引入越狱模块以克服对齐导致的拒绝行为。 Result: 在涵盖30多本完整书籍的新基准EchoTrace上评估显示,RECAP相比单次迭代方法有显著提升,例如GPT-4.1在版权文本提取上的平均ROUGE-L分数从0.38提高到0.47,提升近24%。 Conclusion: RECAP能有效激发并验证大语言模型对训练数据的记忆,为探测模型记忆提供了强有力的工具。 Abstract: If we cannot inspect the training data of a large language model (LLM), how can we ever know what it has seen? We believe the most compelling evidence arises when the model itself freely reproduces the target content. As such, we propose RECAP, an agentic pipeline designed to elicit and verify memorized training data from LLM outputs. At the heart of RECAP is a feedback-driven loop, where an initial extraction attempt is evaluated by a secondary language model, which compares the output against a reference passage and identifies discrepancies. These are then translated into minimal correction hints, which are fed back into the target model to guide subsequent generations. In addition, to address alignment-induced refusals, RECAP includes a jailbreaking module that detects and overcomes such barriers. We evaluate RECAP on EchoTrace, a new benchmark spanning over 30 full books, and the results show that RECAP leads to substantial gains over single-iteration approaches. For instance, with GPT-4.1, the average ROUGE-L score for the copyrighted text extraction improved from 0.38 to 0.47 - a nearly 24% increase.[13] Revisiting Multilingual Data Mixtures in Language Model Pretraining
Negar Foroutan,Paul Teiletche,Ayush Kumar Tarun,Antoine Bosselut
Main category: cs.CL
TL;DR: 本研究探讨了在大规模语言模型预训练中使用不同多语言数据组合的影响,挑战了关于多语言训练的常见假设。研究表明,在数据充足的情况下,结合英语和多语言数据不会损害任一语言的性能;以英语为枢纽语言能跨语系提升性能,且同语系内选择枢纽语言并未持续带来优势;此外,在当前模型规模下,并未出现显著的“多语言诅咒”现象。
Details
Motivation: 探究多语言数据混合对大模型预训练的影响,澄清关于语言覆盖与模型性能之间权衡(即“多语言诅咒”)的争议。 Method: 训练11亿和30亿参数的语言模型,使用包含25到400种语言的多样化多语言语料库,系统性地分析不同语言数量和数据配比下的模型表现。 Result: 1) 英语与多语言数据结合不会降低各自语言性能,只要各语言有足够的训练token;2) 以英语为枢纽语言可提升跨语系性能,但选择特定语系内的枢纽语言并未一致提升该语系内语言的表现;3) 随着训练语言数量增加,未观察到明显的性能下降(即无显著“多语言诅咒”)。 Conclusion: 适当平衡的多语言数据可以在不牺牲模型性能的前提下增强其语言能力,即使在低资源语言设置下也表现良好。 Abstract: The impact of different multilingual data mixtures in pretraining large language models (LLMs) has been a topic of ongoing debate, often raising concerns about potential trade-offs between language coverage and model performance (i.e., the curse of multilinguality). In this work, we investigate these assumptions by training 1.1B and 3B parameter LLMs on diverse multilingual corpora, varying the number of languages from 25 to 400. Our study challenges common beliefs surrounding multilingual training. First, we find that combining English and multilingual data does not necessarily degrade the in-language performance of either group, provided that languages have a sufficient number of tokens included in the pretraining corpus. Second, we observe that using English as a pivot language (i.e., a high-resource language that serves as a catalyst for multilingual generalization) yields benefits across language families, and contrary to expectations, selecting a pivot language from within a specific family does not consistently improve performance for languages within that family. Lastly, we do not observe a significant "curse of multilinguality" as the number of training languages increases in models at this scale. Our findings suggest that multilingual data, when balanced appropriately, can enhance language model capabilities without compromising performance, even in low-resource settings[14] Semantic Label Drift in Cross-Cultural Translation
Mohsinul Kabir,Tasnim Ahmed,Md Mezbaur Rahman,Polydoros Giannouris,Sophia Ananiadou
Main category: cs.CL
TL;DR: 本文研究了机器翻译中由于文化差异导致的语义标签漂移问题,发现包括大语言模型在内的翻译系统在文化敏感领域更容易产生标签漂移,且文化相似性是标签保留的关键因素。
Details
Motivation: 现有的机器翻译研究多关注情感保持,但忽视了源语言与目标语言之间的文化对齐这一关键因素,可能导致下游应用中的语义失真和文化冲突。 Method: 通过在文化敏感和中立领域进行一系列实验,分析不同机器翻译系统(包括大语言模型)在翻译过程中语义标签的变化情况,并探讨文化知识编码对标签漂移的影响。 Result: 1) 机器翻译系统(包括LLMs)在文化敏感领域会引起显著的标签漂移;2) LLMs虽具备文化知识,但可能加剧标签漂移;3) 语言间的文化相似性显著影响标签的保持。 Conclusion: 忽略翻译中的文化因素会损害标签保真度,增加误解和文化冲突风险,因此在低资源语言的数据生成中应重视文化对齐。 Abstract: Machine Translation (MT) is widely employed to address resource scarcity in low-resource languages by generating synthetic data from high-resource counterparts. While sentiment preservation in translation has long been studied, a critical but underexplored factor is the role of cultural alignment between source and target languages. In this paper, we hypothesize that semantic labels are drifted or altered during MT due to cultural divergence. Through a series of experiments across culturally sensitive and neutral domains, we establish three key findings: (1) MT systems, including modern Large Language Models (LLMs), induce label drift during translation, particularly in culturally sensitive domains; (2) unlike earlier statistical MT tools, LLMs encode cultural knowledge, and leveraging this knowledge can amplify label drift; and (3) cultural similarity or dissimilarity between source and target languages is a crucial determinant of label preservation. Our findings highlight that neglecting cultural factors in MT not only undermines label fidelity but also risks misinterpretation and cultural conflict in downstream applications.[15] SymCode: A Neurosymbolic Approach to Mathematical Reasoning via Verifiable Code Generation
Sina Bagheri Nezhad,Yao Li,Ameeta Agrawal
Main category: cs.CL
TL;DR: SymCode是一种神经符号框架,通过将数学问题求解转化为可验证的代码生成任务(使用SymPy库),显著提升了大型语言模型在复杂数学推理中的准确性和可信度。
Details
Motivation: 大型语言模型在复杂数学推理中常因基于文本的生成方式而产生无法验证且算术上不严谨的解,现有提示方法缺乏确定性验证机制。 Method: 提出SymCode框架,利用SymPy库引导LLM生成可执行和验证的代码,从而将数学推理建立在确定性符号引擎之上。 Result: 在MATH-500和OlympiadBench等挑战性基准上,SymCode比基线模型最高提升13.6个百分点,且更节省token,错误更透明。 Conclusion: SymCode通过将LLM推理锚定于确定性符号系统,代表了在形式化领域实现更准确、更可信AI的重要一步。 Abstract: Large Language Models (LLMs) often struggle with complex mathematical reasoning, where prose-based generation leads to unverified and arithmetically unsound solutions. Current prompting strategies like Chain of Thought still operate within this unreliable medium, lacking a mechanism for deterministic verification. To address these limitations, we introduce SymCode, a neurosymbolic framework that reframes mathematical problem-solving as a task of verifiable code generation using the SymPy library. We evaluate SymCode on challenging benchmarks, including MATH-500 and OlympiadBench, demonstrating significant accuracy improvements of up to 13.6 percentage points over baselines. Our analysis shows that SymCode is not only more token-efficient but also fundamentally shifts model failures from opaque logical fallacies towards transparent, programmatic errors. By grounding LLM reasoning in a deterministic symbolic engine, SymCode represents a key step towards more accurate and trustworthy AI in formal domains.[16] NeuronMM: High-Performance Matrix Multiplication for LLM Inference on AWS Trainium
Dinghong Song,Jierui Xu,Weichu Yang,Pengfei Su,Dong Li
Main category: cs.CL
TL;DR: 本文针对AWS的Trainium AI加速器设计了高性能矩阵乘法(matmul)内核,用于大语言模型(LLM)推理。通过定制化的内核融合和新型缓存策略,减少了数据移动、最大化SRAM带宽并避免昂贵的矩阵转置操作,在多个数据集和LLM上实现了显著性能提升。
Details
Motivation: Trainium的异构架构虽适合LLM训练与推理,但其脉动阵列结构和特殊数据布局要求使得高性能实现具有挑战性,尤其是关键计算内核矩阵乘法的优化。 Method: 提出一系列针对Trainium架构的优化技术,包括基于内核融合的方法和创新的缓存策略,以减少软件管理内存层级中的数据移动,最大化利用SRAM带宽,并避免高成本的矩阵转置操作。 Result: 在九个数据集和四个最新大语言模型上的实验表明,相比AWS在Trainium上实现的最先进matmul,所提方法在matmul内核层面平均提速1.35倍(最高2.22倍),端到端LLM推理平均提速1.66倍(最高2.49倍)。 Conclusion: 本文设计的针对Trainium的高性能matmul优化方案有效克服了其架构限制,显著提升了大语言模型推理效率,展示了在专用AI加速器上进行底层内核优化的重要潜力。 Abstract: AI accelerators, customized to AI workloads, provide cost-effective and high-performance solutions for training and inference. Trainium, an AI accelerator recently developed by Amazon Web Services (AWS), provides an attractive option for LLM training and inference through its heterogeneous architecture. However, leveraging Trainium architecture for high performance can be challenging because of its systolic array architecture and special requirement on data layout. In this paper, we design high-performance matrix multiplication (matmul), a critical compute kernel, for LLM inference on Trainium. We introduce a series of techniques customized to Trainium based on kernel fusion and novel caching strategies to reduce data movement across the software-managed memory hierarchy, maximize SRAM bandwidth, and avoid expensive matrix transpose. Evaluating with nine datasets and four recent LLMs, we show that our system largely outperforms the state-of-the-art matmul implemented by AWS on Trainium: at the level of matmul kernel, it achieves an average 1.35x speedup (up to 2.22x), which translates to an average 1.66x speedup (up to 2.49x) for end-to-end LLM inference.[17] AttnCache: Accelerating Self-Attention Inference for LLM Prefill via Attention Cache
Dinghong Song,Yuan Feng,Yiwei Wang,Shangye Chen,Cyril Guyot,Filip Blagojevic,Hyeran Jeon,Pengfei Su,Dong Li
Main category: cs.CL
TL;DR: 本文提出了AttnCache,一种通过检索和重用相似注意力图来加速大语言模型预填充阶段推理的框架,在CPU和GPU上均实现了显著的速度提升,且精度损失可忽略不计。
Details
Motivation: 在仅需预填充阶段的现实应用场景中,自注意力计算因其序列长度的平方复杂度成为性能瓶颈,因此需要优化该阶段的计算效率。 Method: 基于不同句子常产生相似注意力图的观察,构建一个注意力图记忆数据库,利用高效的缓存和相似性搜索技术,在推理过程中识别并重用已缓存的注意力图,从而减少自注意力的计算开销。 Result: 实验结果表明,AttnCache在CPU上平均实现1.2倍端到端和2倍注意力速度提升,在GPU上实现1.6倍端到端和3倍注意力速度提升,且精度损失极小。 Conclusion: AttnCache有效缓解了预填充阶段的自注意力计算瓶颈,为分类、问答、推荐等依赖预填充的任务提供了高效的推理加速方案。 Abstract: Large Language Models (LLMs) are widely used in generative applications such as chatting, code generation, and reasoning. However, many realworld workloads such as classification, question answering, recommendation, and text embedding rely solely on the prefill stage of inference, where the model encodes input sequences without performing autoregressive decoding. In these prefill only scenarios, the self-attention computation becomes the primary performance bottleneck due to its quadratic complexity with respect to sequence length. In this paper, we observe that semantically different sentences often produce similar attention maps across layers and heads. Building on this insight, we propose AttnCache, a framework that accelerates the prefill stage of LLM inference by retrieving and reusing similar attention maps. Based on an attention map memorization database, AttnCache employs efficient caching and similarity search techniques to identify and reuse pre-cached attention maps during inference, thereby reducing the computational overhead of self-attention. Experimental results show that AttnCache achieves an average of 1.2x end-to-end and 2x attention speedup on CPU, and 1.6x end-to-end and 3x attention speedup on GPU, with negligible accuracy degradation.[18] Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning
Yihe Deng,I-Hung Hsu,Jun Yan,Zifeng Wang,Rujun Han,Gufeng Zhang,Yanfei Chen,Wei Wang,Tomas Pfister,Chen-Yu Lee
Main category: cs.CL
TL;DR: 提出了一种新的训练框架SRL,通过将问题解决重新定义为生成一系列逻辑“动作”,结合监督学习与强化学习的优势,提升小规模语言模型在多步推理任务中的表现。
Details
Motivation: 现有的小规模开源大模型在多步推理任务中表现不佳,RLVR难以采样到正确解,SFT容易过拟合长示范,缺乏有效的学习信号。 Method: 将问题求解转化为生成逻辑动作序列,引入内部推理独白机制,并基于专家示范动作的相似性提供逐步的、平滑的奖励信号,在每一步提供更丰富的监督信号。 Result: SRL使小模型能够学会此前无法通过SFT或RLVR学习的难题;先用SRL初始化再用RLVR微调可获得最佳性能;在推理基准和代理软件工程任务中均表现出良好泛化能力。 Conclusion: SRL是一种强大且通用的面向推理的LLM训练框架,有效弥补了SFT与RLVR的不足,提升了小模型的多步推理能力。 Abstract: Large Language Models (LLMs) often struggle with problems that require multi-step reasoning. For small-scale open-source models, Reinforcement Learning with Verifiable Rewards (RLVR) fails when correct solutions are rarely sampled even after many attempts, while Supervised Fine-Tuning (SFT) tends to overfit long demonstrations through rigid token-by-token imitation. To address this gap, we propose Supervised Reinforcement Learning (SRL), a framework that reformulates problem solving as generating a sequence of logical "actions". SRL trains the model to generate an internal reasoning monologue before committing to each action. It provides smoother rewards based on the similarity between the model's actions and expert actions extracted from the SFT dataset in a step-wise manner. This supervision offers richer learning signals even when all rollouts are incorrect, while encouraging flexible reasoning guided by expert demonstrations. As a result, SRL enables small models to learn challenging problems previously unlearnable by SFT or RLVR. Moreover, initializing training with SRL before refining with RLVR yields the strongest overall performance. Beyond reasoning benchmarks, SRL generalizes effectively to agentic software engineering tasks, establishing it as a robust and versatile training framework for reasoning-oriented LLMs.[19] PORTool: Tool-Use LLM Training with Rewarded Tree
Feijie Wu,Weiwu Zhu,Yuxiang Zhang,Soumya Chatterjee,Jiarong Zhu,Fan Mo,Rodin Luo,Jing Gao
Main category: cs.CL
TL;DR: 本文提出了一种名为PORTool的强化学习方法,用于提升工具使用型大语言模型在动态环境中的探索能力与性能。
Details
Motivation: 现有工具使用大语言模型依赖静态数据集训练,缺乏在动态环境中探索多种解决路径的能力,导致性能受限。 Method: PORTool通过生成多条具有树状结构的工具调用轨迹,基于各步骤对正确答案的贡献分配逐 шаг奖励,并结合共享步骤与分支差异设计分叉相对优势,融合轨迹级优势进行强化学习训练。 Result: 实验使用17个工具处理用户查询,结果表明PORTool在最终准确率和工具调用步数上均显著优于其他训练方法,消融研究验证了逐 шаг奖励设计的有效性与鲁棒性。 Conclusion: PORTool通过引入树状多轨迹探索与细粒度奖励机制,有效提升了工具使用大语言模型在复杂动态环境下的推理与执行能力。 Abstract: Current tool-use large language models (LLMs) are trained on static datasets, enabling them to interact with external tools and perform multi-step, tool-integrated reasoning, which produces tool-call trajectories. However, these models imitate how a query is resolved in a generic tool-call routine, thereby failing to explore possible solutions and demonstrating limited performance in an evolved, dynamic tool-call environment. In this work, we propose PORTool, a reinforcement learning (RL) method that encourages a tool-use LLM to explore various trajectories yielding the correct answer. Specifically, this method starts with generating multiple rollouts for a given query, and some of them share the first few tool-call steps, thereby forming a tree-like structure. Next, we assign rewards to each step, based on its ability to produce a correct answer and make successful tool calls. A shared step across different trajectories receives the same reward, while different steps under the same fork receive different rewards. Finally, these step-wise rewards are used to calculate fork-relative advantages, blended with trajectory-relative advantages, to train the LLM for tool use. The experiments utilize 17 tools to address user queries, covering both time-sensitive and time-invariant topics. We conduct ablation studies to systematically justify the necessity and the design robustness of step-wise rewards. Furthermore, we compare the proposed PORTool with other training approaches and demonstrate significant improvements in final accuracy and the number of tool-call steps.[20] Rethinking Cross-lingual Alignment: Balancing Transfer and Cultural Erasure in Multilingual LLMs
HyoJung Han,Sweta Agrawal,Eleftheria Briakou
Main category: cs.CL
TL;DR: 本文提出了一种新的评估框架“迁移-本地化平面”,用于分析跨语言对齐中的知识迁移与文化擦除之间的权衡,并提出了“手术式引导”方法,在不同模型层实现事实迁移与文化本地化的解耦,取得了更好的平衡。
Details
Motivation: 跨语言对齐虽有助于知识迁移,但可能导致文化特定响应的丢失,即‘文化擦除’问题,本文旨在揭示并缓解这一现象。 Method: 引入转移-局部化平面作为评估框架,并提出Surgical Steering方法,在推理时对不同模型层进行有针对性的激活引导,以分离通用知识迁移和文化特异性响应。 Result: 重新评估现有跨语言对齐方法发现它们在提升事实迁移的同时损害了文化本地化;Surgical Steering能在多种语言中更好地平衡二者。 Conclusion: 跨语言对齐需兼顾知识迁移与文化保留,通过分层控制可有效缓解文化擦除问题,Surgical Steering为更细粒度的多语言模型调控提供了新方向。 Abstract: Cross-lingual alignment (CLA) aims to align multilingual representations, enabling Large Language Models (LLMs) to seamlessly transfer knowledge across languages. While intuitive, we hypothesize, this pursuit of representational convergence can inadvertently cause "cultural erasure", the functional loss of providing culturally-situated responses that should diverge based on the query language. In this work, we systematically analyze this trade-off by introducing a holistic evaluation framework, the transfer-localization plane, which quantifies both desirable knowledge transfer and undesirable cultural erasure. Using this framework, we re-evaluate recent CLA approaches and find that they consistently improve factual transfer at the direct cost of cultural localization across all six languages studied. Our investigation into the internal representations of these models reveals a key insight: universal factual transfer and culturally-specific knowledge are optimally steerable at different model layers. Based on this finding, we propose Surgical Steering, a novel inference-time method that disentangles these two objectives. By applying targeted activation steering to distinct layers, our approach achieves a better balance between the two competing dimensions, effectively overcoming the limitations of current alignment techniques.[21] Artificial Intelligence-Enabled Analysis of Radiology Reports: Epidemiology and Consequences of Incidental Thyroid Findings
Felipe Larios,Mariana Borras-Osorio,Yuqi Wu,Ana Gabriela Claros,David Toro-Tobon,Esteban Cabezas,Ricardo Loor-Torres,Maria Mateo Chavez,Kerly Guevara Maldonado,Luis Vilatuna Andrango,Maria Lizarazo Jimenez,Ivan Mateo Alzamora,Misk Al Zahidy,Marcelo Montero,Ana Cristina Proano,Cristian Soto Jacome,Jungwei W. Fan,Oscar J. Ponce-Ponte,Megan E. Branda,Naykky Singh Ospina,Juan P. Brito
Main category: cs.CL
TL;DR: 该研究利用自然语言处理技术分析影像报告中的偶然甲状腺发现(ITF),发现其发生率为7.8%,且与后续过度诊断和低风险甲状腺癌的检出密切相关,提示需改进报告标准化和随访策略。
Details
Motivation: 偶然发现的甲状腺异常在非甲状腺适应症的影像检查中日益常见,但其流行情况、特征及临床影响尚不明确,因此需要系统性评估其影响并优化管理策略。 Method: 研究开发并验证了一种基于Transformer的自然语言处理流程,用于从多种模态和部位的影像报告中识别ITF并提取结节特征,采用回顾性队列设计分析115,683名患者的数据,并使用逻辑回归分析相关因素。 Result: 在115,683名患者中,7.8%发现ITF,其中92.9%为结节;女性、老年人、高BMI及特定科室开具的影像检查更易发现ITF;相比胸部CT,颈部CT、PET和核医学扫描更可能发现ITF;结节特征记录不完整,仅44%报告大小,少于15%报告其他特征;ITF患者接受超声、活检、手术及癌症诊断的比例更高,多数为乳头状癌且检出时更大。 Conclusion: ITF较为常见,并显著关联到一系列后续诊疗过程,导致小体积、低风险甲状腺癌的过度诊断,凸显了标准化报告和选择性随访的必要性。 Abstract: Importance Incidental thyroid findings (ITFs) are increasingly detected on imaging performed for non-thyroid indications. Their prevalence, features, and clinical consequences remain undefined. Objective To develop, validate, and deploy a natural language processing (NLP) pipeline to identify ITFs in radiology reports and assess their prevalence, features, and clinical outcomes. Design, Setting, and Participants Retrospective cohort of adults without prior thyroid disease undergoing thyroid-capturing imaging at Mayo Clinic sites from July 1, 2017, to September 30, 2023. A transformer-based NLP pipeline identified ITFs and extracted nodule characteristics from image reports from multiple modalities and body regions. Main Outcomes and Measures Prevalence of ITFs, downstream thyroid ultrasound, biopsy, thyroidectomy, and thyroid cancer diagnosis. Logistic regression identified demographic and imaging-related factors. Results Among 115,683 patients (mean age, 56.8 [SD 17.2] years; 52.9% women), 9,077 (7.8%) had an ITF, of which 92.9% were nodules. ITFs were more likely in women, older adults, those with higher BMI, and when imaging was ordered by oncology or internal medicine. Compared with chest CT, ITFs were more likely via neck CT, PET, and nuclear medicine scans. Nodule characteristics were poorly documented, with size reported in 44% and other features in fewer than 15% (e.g. calcifications). Compared with patients without ITFs, those with ITFs had higher odds of thyroid nodule diagnosis, biopsy, thyroidectomy and thyroid cancer diagnosis. Most cancers were papillary, and larger when detected after ITFs vs no ITF. Conclusions ITFs were common and strongly associated with cascades leading to the detection of small, low-risk cancers. These findings underscore the role of ITFs in thyroid cancer overdiagnosis and the need for standardized reporting and more selective follow-up.[22] QCoder Benchmark: Bridging Language Generation and Quantum Hardware through Simulator-Based Feedback
Taku Mikuriya,Tatsuya Ishigaki,Masayuki Kawarada,Shunya Minami,Tadashi Kadowaki,Yohichi Suzuki,Soshun Naito,Shunya Takata,Takumi Kato,Tamotsu Basseda,Reo Yamada,Hiroya Takamura
Main category: cs.CL
TL;DR: 本文提出了QCoder Benchmark,一个用于评估大语言模型在量子编程任务中表现的框架,结合量子模拟器反馈和真实编程比赛的人类代码,揭示了现有模型在此复杂任务上的局限性与潜力。
Details
Motivation: 量子编程涉及自然语言、人类知识与硬件交互的复杂逻辑,现有大语言模型在此类需硬件反馈的领域研究不足,亟需专门的评估基准。 Method: 构建QCoder Benchmark,集成量子模拟器环境以获取电路深度、执行时间、错误分类等域特定反馈,并引入真实编程竞赛中的人类代码进行定性和定量对比评估。 Result: 实验显示GPT-4o准确率仅为18.97%,而基于推理的模型o3可达78%,超过人类平均成功率(39.98%)。 Conclusion: QCoder Benchmark为评估大语言模型在量子编程中的表现提供了有效工具,表明推理机制对提升生成质量至关重要,同时推动该领域的进一步研究。 Abstract: Large language models (LLMs) have increasingly been applied to automatic programming code generation. This task can be viewed as a language generation task that bridges natural language, human knowledge, and programming logic. However, it remains underexplored in domains that require interaction with hardware devices, such as quantum programming, where human coders write Python code that is executed on a quantum computer. To address this gap, we introduce QCoder Benchmark, an evaluation framework that assesses LLMs on quantum programming with feedback from simulated hardware devices. Our benchmark offers two key features. First, it supports evaluation using a quantum simulator environment beyond conventional Python execution, allowing feedback of domain-specific metrics such as circuit depth, execution time, and error classification, which can be used to guide better generation. Second, it incorporates human-written code submissions collected from real programming contests, enabling both quantitative comparisons and qualitative analyses of LLM outputs against human-written codes. Our experiments reveal that even advanced models like GPT-4o achieve only around 18.97% accuracy, highlighting the difficulty of the benchmark. In contrast, reasoning-based models such as o3 reach up to 78% accuracy, outperforming averaged success rates of human-written codes (39.98%). We release the QCoder Benchmark dataset and public evaluation API to support further research.[23] Reasoning Path Divergence: A New Metric and Curation Strategy to Unlock LLM Diverse Thinking
Feng Ju,Zeyu Qin,Rui Min,Zhitao He,Lingpeng Kong,Yi R. Fung
Main category: cs.CL
TL;DR: 提出“一题多解”(1PNS)训练范式,结合推理路径差异性度量(RPD),提升大模型在测试时扩展下的推理多样性和性能。
Details
Motivation: 传统“一题一解”(1P1S)训练方式限制了模型推理路径的多样性,导致测试时扩展效果受限。 Method: 提出1PNS训练范式,并引入Reasoning Path Divergence(RPD)指标来量化多步推理链之间的语义差异,基于RPD筛选多样化解答进行模型微调。 Result: 在Qwen3-4B-Base上验证,相比强1P1S基线平均pass@16提升2.80%,在AIME24上提升4.99%,输出多样性显著增加。 Conclusion: 1PNS结合RPD能有效提升推理多样性,进一步增强测试时扩展的效果,为大模型训练提供新方向。 Abstract: While Test-Time Scaling (TTS) has proven effective in improving the reasoning ability of large language models (LLMs), low diversity in model outputs often becomes a bottleneck; this is partly caused by the common "one problem, one solution" (1P1S) training practice, which provides a single canonical answer and can push models toward a narrow set of reasoning paths. To address this, we propose a "one problem, multiple solutions" (1PNS) training paradigm that exposes the model to a variety of valid reasoning trajectories and thus increases inference diversity. A core challenge for 1PNS is reliably measuring semantic differences between multi-step chains of thought, so we introduce Reasoning Path Divergence (RPD), a step-level metric that aligns and scores Long Chain-of-Thought solutions to capture differences in intermediate reasoning. Using RPD, we curate maximally diverse solution sets per problem and fine-tune Qwen3-4B-Base. Experiments show that RPD-selected training yields more varied outputs and higher pass@k, with an average +2.80% gain in pass@16 over a strong 1P1S baseline and a +4.99% gain on AIME24, demonstrating that 1PNS further amplifies the effectiveness of TTS. Our code is available at https://github.com/fengjujf/Reasoning-Path-Divergence .[24] On the Influence of Discourse Relations in Persuasive Texts
Nawar Turk,Sevag Kaspar,Leila Kosseim
Main category: cs.CL
TL;DR: 该论文研究了说服技巧(PTs)与话语关系(DRs)之间的关系,利用大语言模型(LLMs)和提示工程构建标注数据集,发现六种话语关系在说服性文本中起关键作用。
Details
Motivation: 由于缺乏同时标注说服技巧和话语关系的数据集,研究旨在通过大语言模型填补这一空白,并探索两者之间的关联。 Method: 基于SemEval 2023 Task 3数据集,使用4个大语言模型和10种不同提示生成40种DR分类器,通过集成模型和多数投票策略构建5个银标准数据集,标注PTs和PDTB 3.0 level-2 DRs。 Result: 成功构建了多个银标准数据集(规模从204到1,281不等),统计分析表明Cause、Purpose、Contrast、Cause+Belief、Concession和Condition六种话语关系在Loaded Language、Exaggeration/Minimisation、Repetition和cast Doubt等说服技巧中起重要作用。 Conclusion: 这六种话语关系对理解说服机制至关重要,有助于检测在线宣传和虚假信息,并提升对有效沟通的理解。 Abstract: This paper investigates the relationship between Persuasion Techniques (PTs) and Discourse Relations (DRs) by leveraging Large Language Models (LLMs) and prompt engineering. Since no dataset annotated with both PTs and DRs exists, we took the SemEval 2023 Task 3 dataset labelled with 19 PTs as a starting point and developed LLM-based classifiers to label each instance of the dataset with one of the 22 PDTB 3.0 level-2 DRs. In total, four LLMs were evaluated using 10 different prompts, resulting in 40 unique DR classifiers. Ensemble models using different majority-pooling strategies were used to create 5 silver datasets of instances labelled with both persuasion techniques and level-2 PDTB senses. The silver dataset sizes vary from 1,281 instances to 204 instances, depending on the majority pooling technique used. Statistical analysis of these silver datasets shows that six discourse relations (namely Cause, Purpose, Contrast, Cause+Belief, Concession, and Condition) play a crucial role in persuasive texts, especially in the use of Loaded Language, Exaggeration/Minimisation, Repetition and to cast Doubt. This insight can contribute to detecting online propaganda and misinformation, as well as to our general understanding of effective communication.[25] MossNet: Mixture of State-Space Experts is a Multi-Head Attention
Shikhar Tuli,James Seale Smith,Haris Jeelani,Chi-Heng Lin,Abhishek Patel,Vasili Ramanishka,Yen-Chang Hsu,Hongxia Jin
Main category: cs.CL
TL;DR: 本文提出了MossNet,一种基于混合状态空间专家(MoE)架构的新型模型,能够模拟线性多头注意力机制,在语言建模和下游任务中优于同类大小的Transformer和SSM模型,且具备良好扩展性和实际设备运行效率。
Details
Motivation: 现有基于状态空间模型(SSM)或门控循环模型(GRM)的方法通常只能模拟单个注意力头,表达能力受限,因此需要一种更具表达力且高效的架构来提升生成式语言模型性能。 Method: 提出MossNet,采用混合状态空间专家(MoSS)架构,在通道混合MLP块和时间混合SSM核中均引入混合专家(MoE)机制,以实现多个‘注意力头’的功能,从而模拟线性多头注意力(MHA)。 Result: 在语言建模和下游任务实验中,MossNet优于同等规模和数据预算下的Transformer和SSM模型;大规模版本在万亿token训练下展现出良好可扩展性;在三星S24 Ultra和Nvidia A100上的实测显示其具有更优的运行速度和资源利用率。 Conclusion: MossNet为高效且高性能的循环结构大语言模型提供了有竞争力的新方向。 Abstract: Large language models (LLMs) have significantly advanced generative applications in natural language processing (NLP). Recent trends in model architectures revolve around efficient variants of transformers or state-space/gated-recurrent models (SSMs, GRMs). However, prevailing SSM/GRM-based methods often emulate only a single attention head, potentially limiting their expressiveness. In this work, we propose MossNet, a novel mixture-of-state-space-experts architecture that emulates a linear multi-head attention (MHA). MossNet leverages a mixture-of-experts (MoE) implementation not only in channel-mixing multi-layered perceptron (MLP) blocks but also in the time-mixing SSM kernels to realize multiple "attention heads." Extensive experiments on language modeling and downstream evaluations show that MossNet outperforms both transformer- and SSM-based architectures of similar model size and data budgets. Larger variants of MossNet, trained on trillions of tokens, further confirm its scalability and superior performance. In addition, real-device profiling on a Samsung Galaxy S24 Ultra and an Nvidia A100 GPU demonstrate favorable runtime speed and resource usage compared to similarly sized baselines. Our results suggest that MossNet is a compelling new direction for efficient, high-performing recurrent LLM architectures.[26] Similarity-Distance-Magnitude Language Models
Allen Schmaltz
Main category: cs.CL
TL;DR: 提出了SDM语言模型,通过微调使生成结果更多落在高概率区域,并利用最终层的SDM激活层进行指令遵循的二分类,提升统计效率。
Details
Motivation: 旨在提高语言模型在指令遵循任务中的校准性和生成效率,减少拒绝回答(abstention)的情况。 Method: 对预训练的Transformer解码器模型进行监督微调,引入最终层的SDM激活层,结合对比输入编码和在线生成的硬负例,调整监督下一词预测损失。 Result: SDM语言模型相比强监督基线减少了生成中的拒绝情况,提高了统计效率。 Conclusion: SDM语言模型能有效提升现有语言模型在指令遵循任务中的生成质量和校准性。 Abstract: We introduce Similarity-Distance-Magnitude (SDM) language models (LMs), which are sequence prediction models fine-tuned to maximize the proportion of generations in the well-calibrated, high-probability region partitioned by a final-layer SDM activation layer used for binary classification of instruction-following. We demonstrate that existing pre-trained decoder-only Transformer LMs can be readily converted into SDM LMs via supervised fine-tuning, using the final-layer SDM activation layer during training to estimate a change-of-base for a supervised next-token loss over a contrastive input encoding scheme, with additional hard negative examples generated online during training. This results in reduced abstentions (i.e., improved statistical efficiency) compared to strong supervised baselines.[27] RCScore: Quantifying Response Consistency in Large Language Models
Dongjun Jang,Youngchae Ahn,Hyopil Shin
Main category: cs.CL
TL;DR: RCScore是一个多维框架,用于量化指令表述方式对大语言模型响应的影响,揭示了传统指标无法检测到的性能差异,并提出交叉响应相似性(CRS)作为衡量风格一致性的方法,发现其与任务准确率高度相关。
Details
Motivation: 现有LLM评估通常依赖单一指令模板,忽略了模型对指令风格的敏感性,这在实际应用中至关重要。因此需要一个更全面的评估框架来衡量指令鲁棒性。 Method: 通过系统地将基准问题转换为多种指令风格,构建RCScore框架,并引入交叉响应相似性(CRS)指标来测量模型在不同风格下的响应一致性。 Result: 在十个LLM和四个推理基准上的实验表明,指令风格可导致准确率变化高达16.7个百分点;确定性解码产生更稳定的输出,且模型规模与跨风格一致性正相关。 Conclusion: RCScore提供了一种评估指令鲁棒性的系统方法,CRS可作为模型可靠性的一个有效代理指标,有助于提升现实场景中LLM的稳定性和可信度。 Abstract: Current LLM evaluations often rely on a single instruction template, overlooking models' sensitivity to instruction style-a critical aspect for real-world deployments. We present RCScore, a multi-dimensional framework quantifying how instruction formulation affects model responses. By systematically transforming benchmark problems into multiple instruction styles, RCScore reveals performance variations undetected by conventional metrics. Our experiments across ten LLMs on four reasoning benchmarks demonstrate that instruction style can shift accuracy by up to 16.7% points. We introduce Cross-Response Similarity (CRS), a method applying RCScore metrics to measure stylistic self-consistency, and establish its strong correlation with task accuracy, suggesting consistency as a valuable proxy for model reliability. Additional findings show that deterministic decoding produces more stylistically stable outputs, and model scale correlates positively with cross-style consistency. RCScore offers a principled approach to assess instruction robustness.[28] Don't Let It Fade: Preserving Edits in Diffusion Language Models via Token Timestep Allocation
Woojin Kim,Jaeyoung Do
Main category: cs.CL
TL;DR: 本文提出了一种名为Token Timestep Allocation (TTA)的方法,通过为不同token分配特定的时间步策略来缓解扩散语言模型中的“更新遗忘”问题,从而提升文本生成的可控性和流畅性。
Details
Motivation: 扩散语言模型在细粒度编辑上具有潜力,但其可控性较差,主要由于均匀且上下文无关的更新导致语义修改被逐步抹除,即“更新遗忘”问题。 Method: 提出Token Timestep Allocation (TTA),为每个token设计独立的时间步调度策略:关键token提前冻结,不确定token持续优化;支持固定或基于任务信号的自适应策略,纯推理时操作,适用于多种DLM和监督来源。 Result: 在情感控制任务中,TTA提升了20%以上的准确率,困惑度降低近一半,且使用不到五分之一的步数;在去毒化任务中,最大毒性从14.5降至12.2,困惑度从32.0降至26.0。 Conclusion: 通过时间步分配实现的软性token排序是缓解更新遗忘、实现稳定可控扩散文本生成的关键机制。 Abstract: While diffusion language models (DLMs) enable fine-grained refinement, their practical controllability remains fragile. We identify and formally characterize a central failure mode called update forgetting, in which uniform and context agnostic updates induce token level fluctuations across timesteps, erasing earlier semantic edits and disrupting the cumulative refinement process, thereby degrading fluency and coherence. As this failure originates in uniform and context agnostic updates, effective control demands explicit token ordering. We propose Token Timestep Allocation (TTA), which realizes soft and semantic token ordering via per token timestep schedules: critical tokens are frozen early, while uncertain tokens receive continued refinement. This timestep based ordering can be instantiated as either a fixed policy or an adaptive policy driven by task signals, thereby supporting a broad spectrum of refinement strategies. Because it operates purely at inference time, it applies uniformly across various DLMs and naturally extends to diverse supervision sources. Empirically, TTA improves controllability and fluency: on sentiment control, it yields more than 20 percent higher accuracy and nearly halves perplexity using less than one fifth the steps; in detoxification, it lowers maximum toxicity (12.2 versus 14.5) and perplexity (26.0 versus 32.0). Together, these results demonstrate that softened ordering via timestep allocation is the critical lever for mitigating update forgetting and achieving stable and controllable diffusion text generation.[29] What's In My Human Feedback? Learning Interpretable Descriptions of Preference Data
Rajiv Movva,Smitha Milli,Sewon Min,Emma Pierson
Main category: cs.CL
TL;DR: 本文提出了WIMHF方法,利用稀疏自编码器解释人类反馈数据,揭示了不同数据集中人类偏好背后的可解释特征,并展示了其在安全提升和个性化定制方面的应用潜力。
Details
Motivation: 由于人类反馈可能包含未被明确理解的复杂偏好,导致语言模型训练结果不可预测甚至有害,因此需要一种无需预设假设即可自动提取反馈中关键特征的方法来更好地理解和利用人类反馈数据。 Method: 提出WIMHF方法,使用稀疏自编码器对人类反馈数据进行分析,从中提取出可解释的、稀疏的潜在特征,用以刻画数据集能测量的偏好以及标注者实际表达的偏好。 Result: 在7个数据集上验证了WIMHF的有效性,发现了少量可解释特征即可解释大部分偏好预测信号;揭示了不同平台(如Reddit、HH-RLHF)用户的偏好差异,并发现潜在有害偏好(如LMArena用户倾向于反对拒绝回应);利用这些特征进行数据重标注可显著提升模型安全性(+37%),同时支持基于用户特征的细粒度个性化建模。 Conclusion: WIMHF为理解人类反馈提供了一种以人为中心的分析工具,有助于提升模型的安全性、可解释性和个性化能力,推动更可靠的人类反馈利用方式。 Abstract: Human feedback can alter language models in unpredictable and undesirable ways, as practitioners lack a clear understanding of what feedback data encodes. While prior work studies preferences over certain attributes (e.g., length or sycophancy), automatically extracting relevant features without pre-specifying hypotheses remains challenging. We introduce What's In My Human Feedback? (WIMHF), a method to explain feedback data using sparse autoencoders. WIMHF characterizes both (1) the preferences a dataset is capable of measuring and (2) the preferences that the annotators actually express. Across 7 datasets, WIMHF identifies a small number of human-interpretable features that account for the majority of the preference prediction signal achieved by black-box models. These features reveal a wide diversity in what humans prefer, and the role of dataset-level context: for example, users on Reddit prefer informality and jokes, while annotators in HH-RLHF and PRISM disprefer them. WIMHF also surfaces potentially unsafe preferences, such as that LMArena users tend to vote against refusals, often in favor of toxic content. The learned features enable effective data curation: re-labeling the harmful examples in Arena yields large safety gains (+37%) with no cost to general performance. They also allow fine-grained personalization: on the Community Alignment dataset, we learn annotator-specific weights over subjective features that improve preference prediction. WIMHF provides a human-centered analysis method for practitioners to better understand and use preference data.[30] Towards Global Retrieval Augmented Generation: A Benchmark for Corpus-Level Reasoning
Qi Luo,Xiaonan Li,Tingshuo Fan,Xinchi Chen,Xipeng Qiu
Main category: cs.CL
TL;DR: 本文提出了GlobalQA,首个用于评估全局检索增强生成(global RAG)能力的基准,并提出GlobalRAG框架,在多文档聚合任务上显著优于现有方法。
Details
Motivation: 现有的RAG评估基准主要关注局部信息检索,无法有效评估需要跨整个文档集合进行分析和聚合的全局任务,因此需要一个新的基准和方法来应对这一挑战。 Method: 设计了GlobalQA基准,涵盖计数、极值查询、排序和top-k提取四类核心任务;提出GlobalRAG框架,结合智能过滤、块级检索与聚合模块,实现精确的符号计算和噪声抑制。 Result: 现有RAG方法在GlobalQA上表现差,最强基线F1仅为1.51;GlobalRAG在Qwen2.5-14B模型上达到6.63 F1,显著提升性能。 Conclusion: GlobalRAG有效提升了LLM在全局RAG任务上的表现,验证了多工具协作与结构化聚合在复杂分析任务中的必要性。 Abstract: Retrieval-augmented generation (RAG) has emerged as a leading approach to reducing hallucinations in large language models (LLMs). Current RAG evaluation benchmarks primarily focus on what we call local RAG: retrieving relevant chunks from a small subset of documents to answer queries that require only localized understanding within specific text chunks. However, many real-world applications require a fundamentally different capability -- global RAG -- which involves aggregating and analyzing information across entire document collections to derive corpus-level insights (for example, "What are the top 10 most cited papers in 2023?"). In this paper, we introduce GlobalQA -- the first benchmark specifically designed to evaluate global RAG capabilities, covering four core task types: counting, extremum queries, sorting, and top-k extraction. Through systematic evaluation across different models and baselines, we find that existing RAG methods perform poorly on global tasks, with the strongest baseline achieving only 1.51 F1 score. To address these challenges, we propose GlobalRAG, a multi-tool collaborative framework that preserves structural coherence through chunk-level retrieval, incorporates LLM-driven intelligent filters to eliminate noisy documents, and integrates aggregation modules for precise symbolic computation. On the Qwen2.5-14B model, GlobalRAG achieves 6.63 F1 compared to the strongest baseline's 1.51 F1, validating the effectiveness of our method.[31] Pragmatic Theories Enhance Understanding of Implied Meanings in LLMs
Takuma Sato,Seiya Kawano,Koichiro Yoshino
Main category: cs.CL
TL;DR: 提出将语用学理论作为提示输入语言模型,以提升其对隐含意义的理解能力,实验表明该方法在语用推理任务中显著优于基线方法。
Details
Motivation: 语言模型需要具备理解言外之意的能力,而现有的方法在引导模型进行语用推理方面仍有不足。 Method: 将格赖斯语用学和关联论等理论概述作为提示,引导模型逐步推理;同时测试仅提及理论名称的效果。 Result: 相比不引入理论的零样本思维链方法,所提方法最高提升9.6%的性能;仅提及理论名称也能在大模型上带来1-3%的提升。 Conclusion: 将语用学理论融入提示是一种有效的上下文学习方法,有助于提升语言模型的语用推理能力。 Abstract: The ability to accurately interpret implied meanings plays a crucial role in human communication and language use, and language models are also expected to possess this capability. This study demonstrates that providing language models with pragmatic theories as prompts is an effective in-context learning approach for tasks to understand implied meanings. Specifically, we propose an approach in which an overview of pragmatic theories, such as Gricean pragmatics and Relevance Theory, is presented as a prompt to the language model, guiding it through a step-by-step reasoning process to derive a final interpretation. Experimental results showed that, compared to the baseline, which prompts intermediate reasoning without presenting pragmatic theories (0-shot Chain-of-Thought), our methods enabled language models to achieve up to 9.6\% higher scores on pragmatic reasoning tasks. Furthermore, we show that even without explaining the details of pragmatic theories, merely mentioning their names in the prompt leads to a certain performance improvement (around 1-3%) in larger models compared to the baseline.[32] Language Models Are Borrowing-Blind: A Multilingual Evaluation of Loanword Identification across 10 Languages
Mérilin Sousa Silva,Sina Ahmadi
Main category: cs.CL
TL;DR: 研究预训练语言模型是否能识别借词,发现在多种语言中模型表现不佳,表明现代NLP系统偏向借词而非本族词。
Details
Motivation: 探究预训练语言模型在多语言环境下区分借词与本族词的能力,特别是在主流语言对少数语言产生词汇影响的背景下。 Method: 在10种语言上评估多个预训练语言模型(包括大模型),通过显式指令和上下文信息测试其识别借词的能力。 Result: 模型在区分借词与本族词方面表现较差,无论是否有提示或上下文信息。 Conclusion: 当前NLP系统存在对借词的偏倚,这对少数语言的NLP工具开发和语言保护具有重要启示。 Abstract: Throughout language history, words are borrowed from one language to another and gradually become integrated into the recipient's lexicon. Speakers can often differentiate these loanwords from native vocabulary, particularly in bilingual communities where a dominant language continuously imposes lexical items on a minority language. This paper investigates whether pretrained language models, including large language models, possess similar capabilities for loanword identification. We evaluate multiple models across 10 languages. Despite explicit instructions and contextual information, our results show that models perform poorly in distinguishing loanwords from native ones. These findings corroborate previous evidence that modern NLP systems exhibit a bias toward loanwords rather than native equivalents. Our work has implications for developing NLP tools for minority languages and supporting language preservation in communities under lexical pressure from dominant languages.[33] Distilling Multilingual Vision-Language Models: When Smaller Models Stay Multilingual
Sukrit Sriratanawilai,Jhayahgrit Thongwat,Romrawin Chumpu,Patomporn Payoungkhamdee,Sarana Nutanong,Peerat Limkonchotiwat
Main category: cs.CL
TL;DR: 本文研究了知识蒸馏在多语言视觉-语言模型中的应用,评估了五种蒸馏方法对跨语言表示一致性和下游任务稳定性的影响,发现某些配置在模型减半的情况下仍能保持甚至提升多语言检索的鲁棒性。
Details
Motivation: 视觉-语言模型在不同语言上的表现不均衡,尤其是在模型规模较小时问题更严重,而知识蒸馏在多语言场景下的应用尚缺乏探索。 Method: 通过控制实验研究五种知识蒸馏方法,在CLIP和SigLIP2框架下分析其对跨语言表示一致性和任务稳定性的孤立影响,并在领域内检索和跨领域视觉问答任务上进行评估。 Result: 某些蒸馏配置能在模型压缩50%的情况下保持或提升多语言检索性能,但部分方法无法维持跨任务稳定性,揭示了仅靠总体准确率无法反映的设计敏感性权衡。 Conclusion: 知识蒸馏在多语言VLM压缩中具有潜力,但其效果高度依赖于具体配置,需综合考虑跨语言一致性和任务稳定性。 Abstract: Vision-language models (VLMs) exhibit uneven performance across languages, a problem that is often exacerbated when the model size is reduced. While Knowledge distillation (KD) demonstrates promising results in transferring knowledge from larger to smaller VLMs, applying KD in multilingualism is an underexplored area. This paper presents a controlled empirical study of KD behavior across five distillation approaches, isolating their effects on cross-lingual representation consistency and downstream performance stability under model compression. We study five distillation formulations across CLIP and SigLIP2, and evaluate them on in-domain retrieval and out-of-domain visual QA. We find that some configurations preserve or even improve multilingual retrieval robustness despite halving model size, but others fail to maintain cross-task stability, exposing design-sensitive trade-offs that aggregate accuracy alone does not reveal.[34] Do LLMs Signal When They're Right? Evidence from Neuron Agreement
Kang Chen,Yaoning Wang,Kai Xiong,Zhuoka Feng,Wenhe Sun,Haotian Chen,Yixin Cao
Main category: cs.CL
TL;DR: 提出Neuron Agreement Decoding (NAD),一种基于神经元激活稀疏性和跨样本一致性进行无监督候选选择的标签自由解码方法,可在生成早期预测正确性并大幅减少计算开销。
Details
Motivation: 现有基于外部输出(如概率、熵)的集成解码策略在后训练后校准性差,缺乏对模型内部行为的有效利用。 Method: 分析LLM内部神经元激活行为,发现正确响应具有更低的神经元激活多样性和更高的跨样本一致性,据此提出NAD方法,利用内部激活信号进行候选排序和早期停止。 Result: NAD在数学与科学基准上媲美多数投票,在开放编码任务上优于Avg@64,可实现99%的token节省且质量损失极小,并支持生成前32个token内的早期正确性预测。 Conclusion: 内部神经元激活信号可为无标签集成解码提供可靠、可扩展且高效的指导,NAD为推理优化提供了新方向。 Abstract: Large language models (LLMs) commonly boost reasoning via sample-evaluate-ensemble decoders, achieving label free gains without ground truth. However, prevailing strategies score candidates using only external outputs such as token probabilities, entropies, or self evaluations, and these signals can be poorly calibrated after post training. We instead analyze internal behavior based on neuron activations and uncover three findings: (1) external signals are low dimensional projections of richer internal dynamics; (2) correct responses activate substantially fewer unique neurons than incorrect ones throughout generation; and (3) activations from correct responses exhibit stronger cross sample agreement, whereas incorrect ones diverge. Motivated by these observations, we propose Neuron Agreement Decoding (NAD), an unsupervised best-of-N method that selects candidates using activation sparsity and cross sample neuron agreement, operating solely on internal signals and without requiring comparable textual outputs. NAD enables early correctness prediction within the first 32 generated tokens and supports aggressive early stopping. Across math and science benchmarks with verifiable answers, NAD matches majority voting; on open ended coding benchmarks where majority voting is inapplicable, NAD consistently outperforms Avg@64. By pruning unpromising trajectories early, NAD reduces token usage by 99% with minimal loss in generation quality, showing that internal signals provide reliable, scalable, and efficient guidance for label free ensemble decoding.[35] Unravelling the Mechanisms of Manipulating Numbers in Language Models
Michal Štefánik,Timothee Mickus,Marek Kadlčík,Bertram Højer,Michal Spiegel,Raúl Vázquez,Aman Sinha,Josef Kuchař,Philipp Mondorf
Main category: cs.CL
TL;DR: 研究表明,尽管大语言模型在处理数字时会产生错误,但它们学习到了系统、高度准确且通用的数字表示方法,并可通过通用探针追踪错误来源。
Details
Motivation: 解释为何大语言模型在具有相似且准确的数字嵌入表示的同时仍频繁产生数值相关错误。 Method: 通过分析不同语言模型对数字的内部表示,构建通用探针以追踪信息流并定位导致输出错误的具体网络层。 Result: 发现不同模型学习到可互换、系统性强且跨上下文通用的数字表示;确定了操作机制的准确率下限,并能追溯错误至特定层。 Conclusion: 预训练大语言模型以统一方式处理数字,未来可通过更精确的探针技术改进模型架构以减少数值错误。 Abstract: Recent work has shown that different large language models (LLMs) converge to similar and accurate input embedding representations for numbers. These findings conflict with the documented propensity of LLMs to produce erroneous outputs when dealing with numeric information. In this work, we aim to explain this conflict by exploring how language models manipulate numbers and quantify the lower bounds of accuracy of these mechanisms. We find that despite surfacing errors, different language models learn interchangeable representations of numbers that are systematic, highly accurate and universal across their hidden states and the types of input contexts. This allows us to create universal probes for each LLM and to trace information -- including the causes of output errors -- to specific layers. Our results lay a fundamental understanding of how pre-trained LLMs manipulate numbers and outline the potential of more accurate probing techniques in addressed refinements of LLMs' architectures.[36] Can Agent Conquer Web? Exploring the Frontiers of ChatGPT Atlas Agent in Web Games
Jingran Zhang,Ning Li,Justin Cui
Main category: cs.CL
TL;DR: 本研究评估了OpenAI的ChatGPT Atlas在浏览器游戏中的网页交互能力,发现其在逻辑推理任务(如数独)中表现优异,但在需要精确时序和操作控制的实时游戏中表现较差。
Details
Motivation: 探索Atlas在动态、交互式网页环境中的表现,尤其是在信息检索之外的实时交互能力。 Method: 使用T-Rex Runner、Sudoku、Flappy Bird和Stein.world等浏览器游戏作为测试场景,以游戏内得分作为量化评估指标。 Result: Atlas在Sudoku等逻辑任务中显著优于人类基线,但在Flappy Bird和T-Rex Runner等实时游戏中难以通过初始障碍。 Conclusion: Atlas具备强大的分析处理能力,但在需要实时交互和精细操作的动态网页环境中仍存在明显局限。 Abstract: OpenAI's ChatGPT Atlas introduces new capabilities for web interaction, enabling the model to analyze webpages, process user intents, and execute cursor and keyboard inputs directly within the browser. While its capacity for information retrieval tasks has been demonstrated, its performance in dynamic, interactive environments remains less explored. In this study, we conduct an early evaluation of Atlas's web interaction capabilities using browser-based games as test scenarios, including Google's T-Rex Runner, Sudoku, Flappy Bird, and Stein.world. We employ in-game performance scores as quantitative metrics to assess performance across different task types. Our results show that Atlas performs strongly in logical reasoning tasks like Sudoku, completing puzzles significantly faster than human baselines, but struggles substantially in real-time games requiring precise timing and motor control, often failing to progress beyond initial obstacles. These findings suggest that while Atlas demonstrates capable analytical processing, there remain notable limitations in dynamic web environments requiring real-time interaction. The website of our project can be found at https://atlas-game-eval.github.io.[37] SCRIBE: Structured Chain Reasoning for Interactive Behaviour Explanations using Tool Calling
Fares Fawzi,Vinitra Swamy,Dominik Glandorf,Tanya Nazaretsky,Tanja Käser
Main category: cs.CL
TL;DR: SCRIBE是一个用于生成学生反馈的多跳、工具增强推理框架,通过两阶段LoRA微调在3B和8B的小型开源模型上实现,表现出与更大模型相当甚至更优的效果,适合低资源和隐私敏感的教育场景。
Details
Motivation: 现实世界中部署语言模型进行教育反馈面临隐私、计算资源限制和教学有效性三大挑战,需要能够在本地运行且输出可靠的小型开源模型。 Method: 提出SCRIBE框架,结合领域专用工具和自反思推理流程,支持迭代推理、工具使用和错误恢复,并通过基于GPT-4o生成的合成数据对3B和8B模型进行两阶段LoRA微调。 Result: 在人类对齐的GPT-Judge评估和108名学生的用户研究中,8B-SCRIBE模型在相关性和可操作性等关键维度上表现优于或媲美更大规模的模型,学生感知质量与GPT-4o和Llama-3.3 70B相当。 Conclusion: SCRIBE框架证明了小型化、本地化语言模型在教育资源受限和隐私敏感场景下的可行性与高效性。 Abstract: Language models can be used to provide interactive, personalized student feedback in educational settings. However, real-world deployment faces three key challenges: privacy concerns, limited computational resources, and the need for pedagogically valid responses. These constraints require small, open-source models that can run locally and reliably ground their outputs in correct information. We introduce SCRIBE, a framework for multi-hop, tool-augmented reasoning designed to generate valid responses to student questions about feedback reports. SCRIBE combines domain-specific tools with a self-reflective inference pipeline that supports iterative reasoning, tool use, and error recovery. We distil these capabilities into 3B and 8B models via two-stage LoRA fine-tuning on synthetic GPT-4o-generated data. Evaluation with a human-aligned GPT-Judge and a user study with 108 students shows that 8B-SCRIBE models achieve comparable or superior quality to much larger models in key dimensions such as relevance and actionability, while being perceived on par with GPT-4o and Llama-3.3 70B by students. These findings demonstrate the viability of SCRIBE for low-resource, privacy-sensitive educational applications.[38] From Amateur to Master: Infusing Knowledge into LLMs via Automated Curriculum Learning
Nishit Neema,Srinjoy Mukherjee,Sapan Shah,Gokul Ramakrishnan,Ganesh Venkatesh
Main category: cs.CL
TL;DR: 本文提出了一种名为ACER的自动化课程增强方法,通过生成教科书式课程和基于布鲁姆分类学的问题-答案对,持续预训练大语言模型,使其在保持通用能力的同时成为特定领域的专家。实验表明,该方法在多个专业领域显著提升了模型性能,并促进了跨领域知识迁移。
Details
Motivation: 大型语言模型在通用任务上表现优异,但在需要深入原理理解的经济学、心理学等专业领域表现不佳。因此,需要一种既能提升模型专业能力又不损害其通用性的方法。 Method: 提出ACER方法:首先自动生成某学科的目录结构,然后依据布鲁姆认知分类学生成问题-答案对,构建教科书式的综合课程;使用合成语料库进行交错式持续预训练,兼顾内容与认知层次的学习。 Result: 在Llama 3.2(1B和3B)上的实验显示,ACER在MMLU的专业子集上有显著提升,如微观经济学准确率提高5个百分点,所有目标领域的宏平均提升3个百分点;在非目标领域性能提升0.7点;在ARC和GPQA等知识密集型基准上提升超过2个百分点,且通用推理任务性能稳定。 Conclusion: ACER提供了一种可扩展且有效的方法,能够在不牺牲通用能力的前提下,显著缩小大语言模型在关键专业领域的性能差距,并促进跨领域知识迁移。 Abstract: Large Language Models (LLMs) excel at general tasks but underperform in specialized domains like economics and psychology, which require deep, principled understanding. To address this, we introduce ACER (Automated Curriculum-Enhanced Regimen) that transforms generalist models into domain experts without sacrificing their broad capabilities. ACER first synthesizes a comprehensive, textbook-style curriculum by generating a table of contents for a subject and then creating question-answer (QA) pairs guided by Bloom's taxonomy. This ensures systematic topic coverage and progressively increasing difficulty. The resulting synthetic corpus is used for continual pretraining with an interleaved curriculum schedule, aligning learning across both content and cognitive dimensions. Experiments with Llama 3.2 (1B and 3B) show significant gains in specialized MMLU subsets. In challenging domains like microeconomics, where baselines struggle, ACER boosts accuracy by 5 percentage points. Across all target domains, we observe a consistent macro-average improvement of 3 percentage points. Notably, ACER not only prevents catastrophic forgetting but also facilitates positive cross-domain knowledge transfer, improving performance on non-target domains by 0.7 points. Beyond MMLU, ACER enhances performance on knowledge-intensive benchmarks like ARC and GPQA by over 2 absolute points, while maintaining stable performance on general reasoning tasks. Our results demonstrate that ACER offers a scalable and effective recipe for closing critical domain gaps in LLMs.[39] MisSynth: Improving MISSCI Logical Fallacies Classification with Synthetic Data
Mykhailo Poliakov,Nadiya Shvai
Main category: cs.CL
TL;DR: 本文提出了一种名为MisSynth的管道,利用检索增强生成(RAG)生成合成谬误样本,用于微调大语言模型(LLM),以提升其识别科学错误信息的能力。在MISSCI数据集上的实验表明,经过微调的模型相较于基线模型在F1分数上提升了超过35%,显著提高了零样本分类性能。
Details
Motivation: 科学相关的错误信息广泛存在且具有危害性,尤其当其扭曲或误读科研成果时难以识别。因此需要提升大语言模型识别此类谬误论点的能力。 Method: 提出MisSynth框架,结合检索增强生成(RAG)生成合成的谬误样本,并使用这些样本对LLM(如LLaMA 3.1 8B)进行轻量级微调,以增强其在科学错误信息检测中的表现。 Result: 在MISSCI测试集上,微调后的LLaMA 3.1 8B模型相比原始模型F1分数提升超过35%;合成数据显著提升了模型在真实场景下的零样本分类性能,即使计算资源有限也能取得良好效果。 Conclusion: 通过合成数据增强和轻量微调,可有效提升大语言模型识别科学错误信息的能力,为缓解健康相关 misinformation 提供了可行且高效的解决方案。 Abstract: Health-related misinformation is very prevalent and potentially harmful. It is difficult to identify, especially when claims distort or misinterpret scientific findings. We investigate the impact of synthetic data generation and lightweight fine-tuning techniques on the ability of large language models (LLMs) to recognize fallacious arguments using the MISSCI dataset and framework. In this work, we propose MisSynth, a pipeline that applies retrieval-augmented generation (RAG) to produce synthetic fallacy samples, which are then used to fine-tune an LLM model. Our results show substantial accuracy gains with fine-tuned models compared to vanilla baselines. For instance, the LLaMA 3.1 8B fine-tuned model achieved an over 35% F1-score absolute improvement on the MISSCI test split over its vanilla baseline. We demonstrate that introducing synthetic fallacy data to augment limited annotated resources can significantly enhance zero-shot LLM classification performance on real-world scientific misinformation tasks, even with limited computational resources. The code and synthetic dataset are available on https://github.com/mxpoliakov/MisSynth.[40] The Geometry of Dialogue: Graphing Language Models to Reveal Synergistic Teams for Multi-Agent Collaboration
Kotaro Furuya,Yuichi Kitagawa
Main category: cs.CL
TL;DR: 提出一种基于交互的自动团队组合框架,通过构建语言模型图并应用社区检测来发现功能一致的模型集群,无需先验知识即可实现多智能体大语言模型的有效协作。
Details
Motivation: 由于大语言模型内部特性不透明,难以形成最优多智能体团队,因此需要一种无需先验知识的自动团队组合方法。 Method: 构建“语言模型图”,通过成对对话的语义连贯性映射模型间关系,并利用社区检测识别协同模型簇;通过特定主题引导对话以发现协同团队。 Result: 实验表明该方法能发现反映潜在专业化的功能一致群体,所组成的团队在下游任务中优于随机基线,性能接近基于已知专业化的手动组队。 Conclusion: 该交互中心框架为自动化设计协作式多智能体大语言模型团队提供了新基础。 Abstract: While a multi-agent approach based on large language models (LLMs) represents a promising strategy to surpass the capabilities of single models, its success is critically dependent on synergistic team composition. However, forming optimal teams is a significant challenge, as the inherent opacity of most models obscures the internal characteristics necessary for effective collaboration. In this paper, we propose an interaction-centric framework for automatic team composition that does not require any prior knowledge including their internal architectures, training data, or task performances. Our method constructs a "language model graph" that maps relationships between models from the semantic coherence of pairwise conversations, and then applies community detection to identify synergistic model clusters. Our experiments with diverse LLMs demonstrate that the proposed method discovers functionally coherent groups that reflect their latent specializations. Priming conversations with specific topics identified synergistic teams which outperform random baselines on downstream benchmarks and achieve comparable accuracy to that of manually-curated teams based on known model specializations. Our findings provide a new basis for the automated design of collaborative multi-agent LLM teams.[41] On the Role of Context for Discourse Relation Classification in Scientific Writing
Stephen Wan,Wei Liu,Michael Strube
Main category: cs.CL
TL;DR: 本文探讨了在科学写作中推断话语结构的任务,初步研究了预训练语言模型(PLM)和大语言模型(LLM)在科学出版物中的话语关系分类(DRC)任务中的应用,发现上下文信息通常有助于DRC任务,并分析了哪些科学话语关系类型最能受益于上下文。
Details
Motivation: 随着生成式人工智能在科研流程中的广泛应用,如何利用话语层面的信息为AI生成的科学主张寻找支持证据成为重要问题。因此,研究科学写作中的话语结构推断任务具有重要意义。 Method: 采用预训练语言模型(PLM)和大语言模型(LLM)进行话语关系分类(DRC)实验,重点分析上下文(由话语结构定义)对DRC任务的影响,并评估不同科学话语关系类型对上下文的依赖程度。 Result: 实验结果表明,上下文信息通常有助于提升DRC任务的性能,并识别出某些科学话语关系类型比其他类型更依赖上下文信息。 Conclusion: 上下文信息在科学文本的话语关系分类中具有积极作用,未来的研究可进一步探索如何有效利用话语结构来增强AI生成科学主张的可信度和可解释性。 Abstract: With the increasing use of generative Artificial Intelligence (AI) methods to support science workflows, we are interested in the use of discourse-level information to find supporting evidence for AI generated scientific claims. A first step towards this objective is to examine the task of inferring discourse structure in scientific writing. In this work, we present a preliminary investigation of pretrained language model (PLM) and Large Language Model (LLM) approaches for Discourse Relation Classification (DRC), focusing on scientific publications, an under-studied genre for this task. We examine how context can help with the DRC task, with our experiments showing that context, as defined by discourse structure, is generally helpful. We also present an analysis of which scientific discourse relation types might benefit most from context.[42] OmniEduBench: A Comprehensive Chinese Benchmark for Evaluating Large Language Models in Education
Min Zhang,Hao Chen,Hao Chen,Wenqi Zhang,Didi Zhu,Xin Lin,Bo Jiang,Aimin Zhou,Fei Wu,Kun Kuang
Main category: cs.CL
TL;DR: 本文提出了OmniEduBench,一个全面的中文教育评估基准,涵盖知识和培养两个维度,包含24.6万高质量问答对,用于系统评估大语言模型在教育场景中的能力。
Details
Motivation: 现有大语言模型及其评测基准多关注知识维度,忽视了实际教育中至关重要的能力培养维度,且多数基准局限于单一学科或题型,缺乏多样性,尤其在中文教育背景下问题更为突出。 Method: 构建了一个名为OmniEduBench的综合性中文教育评测基准,包含24,602个高质量问答对,分为知识(18,121)和培养(6,481)两个维度,每个维度细分为6类,覆盖61个学科,并包含11种常见考试题型,支持多维度、多样化的模型评估。 Result: 在11个主流大模型上的实验表明,知识维度中仅Gemini-2.5 Pro准确率超过60%,而在培养维度中表现最好的QWQ模型仍比人类水平低近30%。 Conclusion: 当前大语言模型在教育能力培养方面的表现仍有显著不足,OmniEduBench揭示了现有模型的局限性,为未来教育模型的发展提供了重要评估工具和方向指引。 Abstract: With the rapid development of large language models (LLMs), various LLM-based works have been widely applied in educational fields. However, most existing LLMs and their benchmarks focus primarily on the knowledge dimension, largely neglecting the evaluation of cultivation capabilities that are essential for real-world educational scenarios. Additionally, current benchmarks are often limited to a single subject or question type, lacking sufficient diversity. This issue is particularly prominent within the Chinese context. To address this gap, we introduce OmniEduBench, a comprehensive Chinese educational benchmark. OmniEduBench consists of 24.602K high-quality question-answer pairs. The data is meticulously divided into two core dimensions: the knowledge dimension and the cultivation dimension, which contain 18.121K and 6.481K entries, respectively. Each dimension is further subdivided into 6 fine-grained categories, covering a total of 61 different subjects (41 in the knowledge and 20 in the cultivation). Furthermore, the dataset features a rich variety of question formats, including 11 common exam question types, providing a solid foundation for comprehensively evaluating LLMs' capabilities in education. Extensive experiments on 11 mainstream open-source and closed-source LLMs reveal a clear performance gap. In the knowledge dimension, only Gemini-2.5 Pro surpassed 60\% accuracy, while in the cultivation dimension, the best-performing model, QWQ, still trailed human intelligence by nearly 30\%. These results highlight the substantial room for improvement and underscore the challenges of applying LLMs in education.[43] 1+1>2: A Synergistic Sparse and Low-Rank Compression Method for Large Language Models
Zeliang Zong,Kai Zhang,Zheyang Li,Wenming Tan,Ye Ren,Yiyan Zhai,Jilin Hu
Main category: cs.CL
TL;DR: 提出了一种名为SSLC的协同稀疏与低秩压缩方法,用于高效压缩大语言模型,在不损失性能的情况下显著减少模型大小并提升推理速度。
Details
Motivation: 大语言模型因带宽和计算需求高而受限,现有剪枝和低秩近似方法各自有效,但二者协同效应尚未充分探索。 Method: 将低秩近似与稀疏优化统一建模,通过迭代优化算法联合求解,实现无需额外训练的模型压缩。 Result: 在LLaMA和Qwen2.5(7B-70B)上实验表明,SSLC在无性能下降情况下压缩Qwen2.5达50%,并实现至少1.63倍加速,优于单独使用剪枝或低秩的方法。 Conclusion: SSLC能有效结合稀疏性和低秩性优势,为大语言模型的高效部署提供了实用且先进的解决方案。 Abstract: Large Language Models (LLMs) have demonstrated remarkable proficiency in language comprehension and generation; however, their widespread adoption is constrained by substantial bandwidth and computational demands. While pruning and low-rank approximation have each demonstrated promising performance individually, their synergy for LLMs remains underexplored. We introduce \underline{S}ynergistic \underline{S}parse and \underline{L}ow-Rank \underline{C}ompression (SSLC) methods for LLMs, which leverages the strengths of both techniques: low-rank approximation compresses the model by retaining its essential structure with minimal information loss, whereas sparse optimization eliminates non-essential weights, preserving those crucial for generalization. Based on theoretical analysis, we first formulate the low-rank approximation and sparse optimization as a unified problem and solve it by iterative optimization algorithm. Experiments on LLaMA and Qwen2.5 models (7B-70B) show that SSLC, without any additional training steps, consistently surpasses standalone methods, achieving state-of-the-arts results. Notably, SSLC compresses Qwen2.5 by 50\% with no performance drop and achieves at least 1.63$\times$ speedup, offering a practical solution for efficient LLM deployment.[44] Bayesian Network Fusion of Large Language Models for Sentiment Analysis
Rasoul Amirzadeh,Dhananjay Thiruvady,Fatemeh Shiri
Main category: cs.CL
TL;DR: 提出了一种名为贝叶斯网络大语言模型融合(BNLF)的框架,通过概率机制整合多个大语言模型(FinBERT、RoBERTa和Bertweet)的预测结果,用于情感分析,在三个金融语料库上实现了比基线模型高约6%的准确率提升。
Details
Motivation: 解决现有大语言模型在透明度、可解释性、微调成本、提示工程需求、跨领域性能不一致以及高计算开销带来的环境影响等方面的问题。 Method: 采用贝叶斯网络对多个大语言模型的情感预测结果进行建模,实现晚期融合,利用概率推理整合不同模型的输出。 Result: 在三个具有不同语言和上下文特征的人工标注金融语料库上,BNLF框架相比单个大语言模型准确率提升了约6%,表现出对数据集变化的鲁棒性和良好的可解释性。 Conclusion: BNLF框架通过概率融合多个专用大语言模型的预测,有效提升了情感分析的准确性与可解释性,同时减少了对单一模型的依赖和计算资源消耗。 Abstract: Large language models (LLMs) continue to advance, with an increasing number of domain-specific variants tailored for specialised tasks. However, these models often lack transparency and explainability, can be costly to fine-tune, require substantial prompt engineering, yield inconsistent results across domains, and impose significant adverse environmental impact due to their high computational demands. To address these challenges, we propose the Bayesian network LLM fusion (BNLF) framework, which integrates predictions from three LLMs, including FinBERT, RoBERTa, and BERTweet, through a probabilistic mechanism for sentiment analysis. BNLF performs late fusion by modelling the sentiment predictions from multiple LLMs as probabilistic nodes within a Bayesian network. Evaluated across three human-annotated financial corpora with distinct linguistic and contextual characteristics, BNLF demonstrates consistent gains of about six percent in accuracy over the baseline LLMs, underscoring its robustness to dataset variability and the effectiveness of probabilistic fusion for interpretable sentiment classification.[45] A Multi-agent Large Language Model Framework to Automatically Assess Performance of a Clinical AI Triage Tool
Adam E. Flanders,Yifan Peng,Luciano Prevedello,Robyn Ball,Errol Colak,Prahlad Menon,George Shih,Hui-Ming Lin,Paras Lakhani
Main category: cs.CL
TL;DR: 该研究评估了使用多个大语言模型(LLM)组成的集成系统是否比单个LLM更能可靠地评估基于像素的AI分诊工具。结果表明,由中到大型开源LLM组成的集成方法在回顾性临床AI评估中更一致且更可靠。
Details
Motivation: 探索多LLM集成系统在医学AI工具评估中的潜力,以提高评估的可靠性与一致性,克服单个LLM可能存在的偏差和不稳定性。 Method: 使用14家医院的29,766例非增强头颅CT扫描数据,通过商用颅内出血(ICH)AI检测工具处理,并由8个开源LLM和一个HIPAA合规版GPT-4o进行报告分析。采用多轮提示评估ICH存在情况,对比不同模型及集成策略的表现。 Result: Llama3.3:70b和GPT-4o表现最佳(AUC=0.78),Llama3.3:70b在F1分数、召回率、精度等方面最优;集成方法中,Full-9、Top-3和Consensus三者间无显著差异(p>0.05),但均优于单个GPT-4o(MCC更高)。 Conclusion: 中到大型开源LLM的集成系统相比单一LLM能更一致、可靠地用于临床AI分诊工具的回顾性评估,适合作为生成参考标准的方法。 Abstract: Purpose: The purpose of this study was to determine if an ensemble of multiple LLM agents could be used collectively to provide a more reliable assessment of a pixel-based AI triage tool than a single LLM. Methods: 29,766 non-contrast CT head exams from fourteen hospitals were processed by a commercial intracranial hemorrhage (ICH) AI detection tool. Radiology reports were analyzed by an ensemble of eight open-source LLM models and a HIPAA compliant internal version of GPT-4o using a single multi-shot prompt that assessed for presence of ICH. 1,726 examples were manually reviewed. Performance characteristics of the eight open-source models and consensus were compared to GPT-4o. Three ideal consensus LLM ensembles were tested for rating the performance of the triage tool. Results: The cohort consisted of 29,766 head CTs exam-report pairs. The highest AUC performance was achieved with llama3.3:70b and GPT-4o (AUC= 0.78). The average precision was highest for Llama3.3:70b and GPT-4o (AP=0.75 & 0.76). Llama3.3:70b had the highest F1 score (0.81) and recall (0.85), greater precision (0.78), specificity (0.72), and MCC (0.57). Using MCC (95% CI) the ideal combination of LLMs were: Full-9 Ensemble 0.571 (0.552-0.591), Top-3 Ensemble 0.558 (0.537-0.579), Consensus 0.556 (0.539-0.574), and GPT4o 0.522 (0.500-0.543). No statistically significant differences were observed between Top-3, Full-9, and Consensus (p > 0.05). Conclusion: An ensemble of medium to large sized open-source LLMs provides a more consistent and reliable method to derive a ground truth retrospective evaluation of a clinical AI triage tool over a single LLM alone.[46] Inside CORE-KG: Evaluating Structured Prompting and Coreference Resolution for Knowledge Graphs
Dipak Meher,Carlotta Domeniconi
Main category: cs.CL
TL;DR: 本文对CORE-KG框架进行了系统性消融研究,评估其两个关键组件在减少知识图谱构建中节点重复和噪声方面的作用。
Details
Motivation: 现有基于大语言模型的法律文本知识图谱构建方法仍存在节点重复和法律噪声问题,缺乏有效的共指消解和引导式提取机制。 Method: 通过对CORE-KG框架进行消融实验,分别量化其类型感知共指模块和领域引导结构化提示对知识图谱质量的影响。 Result: 移除共指解析导致节点重复增加28.32%、噪声节点增加4.32%;移除结构化提示导致节点重复增加4.34%、噪声节点增加73.33%。 Conclusion: 结构化提示在抑制噪声方面起主导作用,而共指解析主要减少节点重复,二者结合显著提升复杂法律文本中结构化信息提取的鲁棒性。 Abstract: Human smuggling networks are increasingly adaptive and difficult to analyze. Legal case documents offer critical insights but are often unstructured, lexically dense, and filled with ambiguous or shifting references, which pose significant challenges for automated knowledge graph (KG) construction. While recent LLM-based approaches improve over static templates, they still generate noisy, fragmented graphs with duplicate nodes due to the absence of guided extraction and coreference resolution. The recently proposed CORE-KG framework addresses these limitations by integrating a type-aware coreference module and domain-guided structured prompts, significantly reducing node duplication and legal noise. In this work, we present a systematic ablation study of CORE-KG to quantify the individual contributions of its two key components. Our results show that removing coreference resolution results in a 28.32% increase in node duplication and a 4.32% increase in noisy nodes, while removing structured prompts leads to a 4.34% increase in node duplication and a 73.33% increase in noisy nodes. These findings offer empirical insights for designing robust LLM-based pipelines for extracting structured representations from complex legal texts.[47] Hebrew Diacritics Restoration using Visual Representation
Yair Elboher,Yuval Pinter
Main category: cs.CL
TL;DR: 本文提出了一种名为DIVRIT的希伯来语去音调恢复系统,将该任务视为零样本分类问题,并引入视觉语言模型以图像形式处理无音调文本,显著提升了去音调准确性和泛化能力。
Details
Motivation: 希伯来语在无音调时具有高度歧义性,准确恢复音调对发音和语义消歧至关重要,但现有方法依赖复杂的语言学分析,限制了其泛化性能。 Method: DIVRIT在词级别操作,基于上下文动态生成候选音调模式集,并采用希伯来语视觉语言模型将无音调文本作为图像处理,通过零样本分类选择最合适的音调方案。 Result: 实验表明,DIVRIT在无需复杂语言学分析的情况下有效完成去音调任务;在候选集包含正确形式的‘oracle’设置下达到高准确率,且架构优化和训练方法改进显著增强了泛化能力。 Conclusion: 视觉表征为自动化希伯来语去音调提供了新思路,DIVRIT展示了其在准确性与可扩展性方面的潜力,推动了低资源语言处理的发展。 Abstract: Diacritics restoration in Hebrew is a fundamental task for ensuring accurate word pronunciation and disambiguating textual meaning. Despite the language's high degree of ambiguity when unvocalized, recent machine learning approaches have significantly advanced performance on this task. In this work, we present DIVRIT, a novel system for Hebrew diacritization that frames the task as a zero-shot classification problem. Our approach operates at the word level, selecting the most appropriate diacritization pattern for each undiacritized word from a dynamically generated candidate set, conditioned on the surrounding textual context. A key innovation of DIVRIT is its use of a Hebrew Visual Language Model, which processes undiacritized text as an image, allowing diacritic information to be embedded directly within the input's vector representation. Through a comprehensive evaluation across various configurations, we demonstrate that the system effectively performs diacritization without relying on complex, explicit linguistic analysis. Notably, in an ``oracle'' setting where the correct diacritized form is guaranteed to be among the provided candidates, DIVRIT achieves a high level of accuracy. Furthermore, strategic architectural enhancements and optimized training methodologies yield significant improvements in the system's overall generalization capabilities. These findings highlight the promising potential of visual representations for accurate and automated Hebrew diacritization.[48] The Structure of Relation Decoding Linear Operators in Large Language Models
Miranda Anna Christ,Adrián Csiszárik,Gergely Becsó,Dániel Varga
Main category: cs.CL
TL;DR: 该论文研究了用于解码Transformer语言模型中特定关系事实的线性算子结构,发现这些算子实际上提取的是共通的粗粒度语义属性而非特定关系,从而解释了其可压缩性和泛化局限。
Details
Motivation: 理解Transformer语言模型中线性关系解码器的本质结构及其泛化能力的来源。 Method: 扩展先前单关系研究至多关系场景,使用三阶张量网络压缩关系解码器,并提出跨评估协议检验解码器对不同关系的适用性。 Result: 发现多个关系解码器可被高度压缩而不损失精度;跨评估显示它们提取的是如“国家-某物”这类通用语义属性,而非特定关系。 Conclusion: Transformer中的线性关系解码本质上是基于属性的,而非关系特定的,这解释了其可压缩性和在语义相近关系上的泛化能力。 Abstract: This paper investigates the structure of linear operators introduced in Hernandez et al. [2023] that decode specific relational facts in transformer language models. We extend their single-relation findings to a collection of relations and systematically chart their organization. We show that such collections of relation decoders can be highly compressed by simple order-3 tensor networks without significant loss in decoding accuracy. To explain this surprising redundancy, we develop a cross-evaluation protocol, in which we apply each linear decoder operator to the subjects of every other relation. Our results reveal that these linear maps do not encode distinct relations, but extract recurring, coarse-grained semantic properties (e.g., country of capital city and country of food are both in the country-of-X property). This property-centric structure clarifies both the operators' compressibility and highlights why they generalize only to new relations that are semantically close. Our findings thus interpret linear relational decoding in transformer language models as primarily property-based, rather than relation-specific.[49] InfoFlow: Reinforcing Search Agent Via Reward Density Optimization
Kun Luo,Hongjin Qian,Zheng Liu,Ziyi Xia,Shitao Xiao,Siqi Bao,Jun Zhao,Kang Liu
Main category: cs.CL
TL;DR: 本文提出了一个名为InfoFlow的框架,用于解决强化学习中奖励密度低的问题,通过子问题分解、失败引导提示和双代理优化三个方法提升深度搜索中的学习效率。
Details
Motivation: 在深度搜索场景中,由于探索成本高而最终奖励稀少,导致强化学习的奖励密度低,影响了智能体的学习效率。 Method: 提出InfoFlow框架,包含三个核心组件:子问题分解以提供更密集的过程奖励;失败引导提示以纠正卡住的轨迹;双代理架构通过历史信息压缩降低探索成本。 Result: 在多个代理搜索基准上,InfoFlow显著优于强基线方法,并使轻量级大语言模型达到与先进专有模型相当的性能。 Conclusion: InfoFlow有效提升了奖励密度,降低了探索成本,为低成本模型实现高效深度搜索提供了可行方案。 Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is a promising approach for enhancing agentic deep search. However, its application is often hindered by low \textbf{Reward Density} in deep search scenarios, where agents expend significant exploratory costs for infrequent and often null final rewards. In this paper, we formalize this challenge as the \textbf{Reward Density Optimization} problem, which aims to improve the reward obtained per unit of exploration cost. This paper introduce \textbf{InfoFlow}, a systematic framework that tackles this problem from three aspects. 1) \textbf{Subproblem decomposition}: breaking down long-range tasks to assign process rewards, thereby providing denser learning signals. 2) \textbf{Failure-guided hints}: injecting corrective guidance into stalled trajectories to increase the probability of successful outcomes. 3) \textbf{Dual-agent refinement}: employing a dual-agent architecture to offload the cognitive burden of deep exploration. A refiner agent synthesizes the search history, which effectively compresses the researcher's perceived trajectory, thereby reducing exploration cost and increasing the overall reward density. We evaluate InfoFlow on multiple agentic search benchmarks, where it significantly outperforms strong baselines, enabling lightweight LLMs to achieve performance comparable to advanced proprietary LLMs.[50] Inference-Cost-Aware Dynamic Tree Construction for Efficient Inference in Large Language Models
Yinrong Hong,Zhiquan Tan,Kai Hu
Main category: cs.CL
TL;DR: 提出一种新的动态树解码方法CAST,考虑了GPU配置和批大小等推理成本,显著提升了大语言模型的推理速度。
Details
Motivation: 大语言模型由于自回归设计和规模庞大,存在显著的推理延迟问题,现有推测解码方法忽略了GPU设备和批大小等系统变量的影响。 Method: 引入CAST,一种考虑推理成本(如GPU配置和批大小)的动态树解码方法,动态优化树结构以提升解码效率。 Result: 在六项任务和六个不同大模型上的实验表明,CAST比传统解码方法快达5.2倍,并在多数情况下优于现有最先进方法5%至20%。 Conclusion: CAST通过综合考虑系统级变量,有效提升了推测解码的效率,是加速大语言模型推理的有力方案。 Abstract: Large Language Models (LLMs) face significant inference latency challenges stemming from their autoregressive design and large size. To address this, speculative decoding emerges as a solution, enabling the simultaneous generation and validation of multiple tokens. While recent approaches like EAGLE-2 and EAGLE-3 improve speculative decoding using dynamic tree structures, they often neglect the impact of crucial system variables such as GPU devices and batch sizes. Therefore, we introduce a new dynamic tree decoding approach called CAST that takes into account inference costs, including factors such as GPU configurations and batch sizes, to dynamically refine the tree structure. Through comprehensive experimentation across six diverse tasks and utilizing six distinct LLMs, our methodology demonstrates remarkable results, achieving speeds up to 5.2 times faster than conventional decoding methods. Moreover, it generally outperforms existing state-of-the-art techniques from 5% to 20%.[51] SlideAgent: Hierarchical Agentic Framework for Multi-Page Visual Document Understanding
Yiqiao Jin,Rachneet Kaur,Zhen Zeng,Sumitra Ganesh,Srijan Kumar
Main category: cs.CL
TL;DR: 本文提出了SlideAgent,一个用于理解多模态、多页、多布局文档(尤其是幻灯片)的智能代理框架,通过全局、页面和元素三级推理显著提升了对复杂视觉文档的理解能力。
Details
Motivation: 现有系统在处理复杂的多页视觉文档时,难以进行细粒度的跨页和跨元素推理,而大语言模型虽有潜力,却缺乏有效的结构化理解机制。 Method: SlideAgent采用专门化的代理,将推理分解为全局、页面和元素三个层次,构建一种结构化、与查询无关的文档表示,并在推理时选择性激活相应代理,整合输出以生成上下文感知的答案。 Result: 实验表明,SlideAgent在整体性能上优于现有闭源系统(+7.9)和开源模型(+9.8)。 Conclusion: SlideAgent通过多层次、模块化的代理架构,有效提升了对多页视觉文档的理解与推理能力,具有良好的通用性和应用前景。 Abstract: Multi-page visual documents such as manuals, brochures, presentations, and posters convey key information through layout, colors, icons, and cross-slide references. While large language models (LLMs) offer opportunities in document understanding, current systems struggle with complex, multi-page visual documents, particularly in fine-grained reasoning over elements and pages. We introduce SlideAgent, a versatile agentic framework for understanding multi-modal, multi-page, and multi-layout documents, especially slide decks. SlideAgent employs specialized agents and decomposes reasoning into three specialized levels-global, page, and element-to construct a structured, query-agnostic representation that captures both overarching themes and detailed visual or textual cues. During inference, SlideAgent selectively activates specialized agents for multi-level reasoning and integrates their outputs into coherent, context-aware answers. Extensive experiments show that SlideAgent achieves significant improvement over both proprietary (+7.9 overall) and open-source models (+9.8 overall).[52] Encoder-Decoder or Decoder-Only? Revisiting Encoder-Decoder Large Language Model
Biao Zhang,Yong Cheng,Siamak Shakeri,Xinyi Wang,Min Ma,Orhan Firat
Main category: cs.CL
TL;DR: 本文重新审视了编码器-解码器大语言模型(RedLLM),通过与当前主流的仅解码器模型(DecLLM)在不同规模下的系统比较,发现RedLLM在扩展性、上下文外推能力和指令微调后的下游任务表现上具有竞争力,且推理效率更高。
Details
Motivation: 近年来大语言模型架构从编码器-解码器转向仅解码器,但缺乏从扩展视角的严谨比较,可能导致编码器-解码器模型的潜力被低估。因此,作者旨在填补这一空白。 Method: 作者构建了基于现代训练方法的编码器-解码器模型RedLLM,采用前缀语言建模进行预训练,并在多个模型规模(150M到8B)下与使用因果语言建模的DecLLM进行全面对比,使用RedPajama V1数据集预训练,FLAN进行指令微调。 Result: 实验表明,尽管DecLLM在预训练阶段计算更优,但RedLLM展现出相当甚至更强的扩展能力、上下文长度外推能力;在指令微调后,RedLLM在多种下游任务上表现相当或更好,且推理效率显著更高。 Conclusion: 编码器-解码器架构的大语言模型具有被忽视的潜力,值得重新评估和进一步探索,以发展更强大且高效的LLM。 Abstract: Recent large language model (LLM) research has undergone an architectural shift from encoder-decoder modeling to nowadays the dominant decoder-only modeling. This rapid transition, however, comes without a rigorous comparative analysis especially \textit{from the scaling perspective}, raising concerns that the potential of encoder-decoder models may have been overlooked. To fill this gap, we revisit encoder-decoder LLM (RedLLM), enhancing it with recent recipes from decoder-only LLM (DecLLM). We conduct a comprehensive comparison between RedLLM, pretrained with prefix language modeling (LM), and DecLLM, pretrained with causal LM, at different model scales, ranging from $\sim$150M to $\sim$8B. Using RedPajama V1 (1.6T tokens) for pretraining and FLAN for instruction tuning, our experiments show that RedLLM produces compelling scaling properties and surprisingly strong performance. While DecLLM is overall more compute-optimal during pretraining, RedLLM demonstrates comparable scaling and context length extrapolation capabilities. After instruction tuning, RedLLM achieves comparable and even better results on various downstream tasks while enjoying substantially better inference efficiency. We hope our findings could inspire more efforts on re-examining RedLLM, unlocking its potential for developing powerful and efficient LLMs.[53] Evontree: Ontology Rule-Guided Self-Evolution of Large Language Models
Mingchen Tu,Zhiqiang Liu,Juan Li,Liangyurui Liu,Junjie Wang,Lei Liang,Wen Zhang
Main category: cs.CL
TL;DR: 本文提出了一种名为Evontree的新框架,利用少量高质量的本体规则来提取、验证和增强大语言模型中的领域知识,无需大量外部数据,在医疗问答任务中显著提升了模型性能。
Details
Motivation: 在医疗等数据敏感领域,缺乏高质量的特定领域训练语料限制了大语言模型的应用。而领域专家已将知识总结为形式化的本体规则,因此如何利用这些规则提升模型的领域适应能力成为关键问题。 Method: Evontree从原始模型中提取领域本体,使用两个核心本体规则检测知识不一致性,并通过自我蒸馏微调来强化修正后的知识,实现对模型知识的系统性优化。 Result: 在Llama3-8B-Instruct和Med42-v2上进行的实验表明,该方法在多个医疗问答基准上均优于未修改的模型和领先的有监督基线,准确率最高提升达3.7%。 Conclusion: Evontree有效实现了低资源条件下的大语言模型领域适配,验证了结合形式化本体规则提升模型知识一致性和专业性的可行性与优越性。 Abstract: Large language models (LLMs) have demonstrated exceptional capabilities across multiple domains by leveraging massive pre-training and curated fine-tuning data. However, in data-sensitive fields such as healthcare, the lack of high-quality, domain-specific training corpus hinders LLMs' adaptation for specialized applications. Meanwhile, domain experts have distilled domain wisdom into ontology rules, which formalize relationships among concepts and ensure the integrity of knowledge management repositories. Viewing LLMs as implicit repositories of human knowledge, we propose Evontree, a novel framework that leverages a small set of high-quality ontology rules to systematically extract, validate, and enhance domain knowledge within LLMs, without requiring extensive external datasets. Specifically, Evontree extracts domain ontology from raw models, detects inconsistencies using two core ontology rules, and reinforces the refined knowledge via self-distilled fine-tuning. Extensive experiments on medical QA benchmarks with Llama3-8B-Instruct and Med42-v2 demonstrate consistent outperformance over both unmodified models and leading supervised baselines, achieving up to a 3.7% improvement in accuracy. These results confirm the effectiveness, efficiency, and robustness of our approach for low-resource domain adaptation of LLMs.[54] Kimi Linear: An Expressive, Efficient Attention Architecture
Kimi Team,Yu Zhang,Zongyu Lin,Xingcheng Yao,Jiaxi Hu,Fanqing Meng,Chengyin Liu,Xin Men,Songlin Yang,Zhiyuan Li,Wentao Li,Enzhe Lu,Weizhou Liu,Yanru Chen,Weixin Xu,Longhui Yu,Yejie Wang,Yu Fan,Longguang Zhong,Enming Yuan,Dehao Zhang,Yizhi Zhang,T. Y. Liu,Haiming Wang,Shengjun Fang,Weiran He,Shaowei Liu,Yiwei Li,Jianlin Su,Jiezhong Qiu,Bo Pang,Junjie Yan,Zhejun Jiang,Weixiao Huang,Bohong Yin,Jiacheng You,Chu Wei,Zhengtao Wang,Chao Hong,Yutian Chen,Guanduo Chen,Yucheng Wang,Huabin Zheng,Feng Wang,Yibo Liu,Mengnan Dong,Zheng Zhang,Siyuan Pan,Wenhao Wu,Yuhao Wu,Longyu Guan,Jiawen Tao,Guohong Fu,Xinran Xu,Yuzhi Wang,Guokun Lai,Yuxin Wu,Xinyu Zhou,Zhilin Yang,Yulun Du
Main category: cs.CL
TL;DR: Kimi Linear是一种新型混合线性注意力架构,在多种场景下首次优于全注意力机制,具备更高的效率和性能。
Details
Motivation: 设计一种在短上下文、长上下文和强化学习扩展中均优于全注意力的高效注意力架构。 Method: 提出Kimi Delta Attention(KDA),结合细粒度门控机制和专用块状算法,采用特化的DPLR变换矩阵提升硬件效率。 Result: 3B激活参数、48B总参数的Kimi Linear模型在相同训练设置下全面超越全MLA,KV缓存减少75%,1M上下文解码吞吐提升6倍。 Conclusion: Kimi Linear可作为全注意力架构的即插即用替代方案,在性能和效率上均有显著优势,适用于长短输入输出任务。 Abstract: We introduce Kimi Linear, a hybrid linear attention architecture that, for the first time, outperforms full attention under fair comparisons across various scenarios -- including short-context, long-context, and reinforcement learning (RL) scaling regimes. At its core lies Kimi Delta Attention (KDA), an expressive linear attention module that extends Gated DeltaNet with a finer-grained gating mechanism, enabling more effective use of limited finite-state RNN memory. Our bespoke chunkwise algorithm achieves high hardware efficiency through a specialized variant of the Diagonal-Plus-Low-Rank (DPLR) transition matrices, which substantially reduces computation compared to the general DPLR formulation while remaining more consistent with the classical delta rule. We pretrain a Kimi Linear model with 3B activated parameters and 48B total parameters, based on a layerwise hybrid of KDA and Multi-Head Latent Attention (MLA). Our experiments show that with an identical training recipe, Kimi Linear outperforms full MLA with a sizeable margin across all evaluated tasks, while reducing KV cache usage by up to 75% and achieving up to 6 times decoding throughput for a 1M context. These results demonstrate that Kimi Linear can be a drop-in replacement for full attention architectures with superior performance and efficiency, including tasks with longer input and output lengths. To support further research, we open-source the KDA kernel and vLLM implementations, and release the pre-trained and instruction-tuned model checkpoints.[55] The End of Manual Decoding: Towards Truly End-to-End Language Models
Zhichao Wang,Dongyang Ma,Xinting Huang,Deng Cai,Tian Lan,Jiahao Xu,Haitao Mi,Xiaoying Tang,Yan Wang
Main category: cs.CL
TL;DR: 本文提出了AutoDeco,一种通过学习动态调整解码参数(如temperature和top-p)实现真正端到端生成的新型架构,使大模型能在单次前向传播中自我调节采样策略,并展现出基于自然语言指令控制解码行为的新兴能力。
Details
Motivation: 现有的大语言模型依赖于非可微分的解码过程,需要手动调参,缺乏灵活性和自动化,因此需要一种能够自适应、端到端优化的解码机制。 Method: 在标准Transformer基础上增加轻量级头部模块,在每一步生成时动态预测上下文相关的temperature和top-p值,并与下一词元logits联合输出,将解码过程变为可学习的、逐token的参数化过程。 Result: 在八个基准上实验表明,AutoDeco显著优于默认解码策略,性能接近基于测试集调优的oracle基线,并展现出根据自然语言指令(如“低随机性生成”)逐token调整解码参数的能力。 Conclusion: AutoDeco实现了真正意义上的端到端文本生成,不仅提升了生成质量,还开启了可通过自然语言指令控制解码行为的新范式,增强了大模型的可调控性和交互性。 Abstract: The "end-to-end" label for LLMs is a misnomer. In practice, they depend on a non-differentiable decoding process that requires laborious, hand-tuning of hyperparameters like temperature and top-p. This paper introduces AutoDeco, a novel architecture that enables truly "end-to-end" generation by learning to control its own decoding strategy. We augment the standard transformer with lightweight heads that, at each step, dynamically predict context-specific temperature and top-p values alongside the next-token logits. This approach transforms decoding into a parametric, token-level process, allowing the model to self-regulate its sampling strategy within a single forward pass. Through extensive experiments on eight benchmarks, we demonstrate that AutoDeco not only significantly outperforms default decoding strategies but also achieves performance comparable to an oracle-tuned baseline derived from "hacking the test set"-a practical upper bound for any static method. Crucially, we uncover an emergent capability for instruction-based decoding control: the model learns to interpret natural language commands (e.g., "generate with low randomness") and adjusts its predicted temperature and top-p on a token-by-token basis, opening a new paradigm for steerable and interactive LLM decoding.[56] Value Drifts: Tracing Value Alignment During LLM Post-Training
Mehar Bhatia,Shravan Nayak,Gaurav Kamath,Marius Mosbach,Karolina Stańczak,Vered Shwartz,Siva Reddy
Main category: cs.CL
TL;DR: 研究大语言模型(LLM)在后训练过程中如何与人类价值观对齐,发现监督微调(SFT)阶段主要确立模型价值观,后续偏好优化影响有限,且不同算法导致不同的对齐结果。
Details
Motivation: 随着大语言模型在社会中扮演越来越重要的角色,其需不仅依赖知识,还需符合人类价值观。然而以往研究多关注最终模型的对齐,忽视训练过程中的动态变化,因此需要探究模型在后训练各阶段如何习得价值观。 Method: 通过分析Llama-3和Qwen-3系列不同规模模型,在多种监督微调(SFT)和偏好优化算法及数据集下的训练动态,分离算法与数据的影响,测量价值观偏移的幅度与时机;并使用可控制价值观的合成偏好数据集比较不同偏好优化算法的效果。 Result: 发现SFT阶段基本确立模型价值观,后续偏好优化难以重新对齐;即使使用相同偏好数据,不同优化算法也会导致不同的价值观对齐结果。 Conclusion: 模型的价值观主要在SFT阶段形成,偏好优化的作用有限且受算法选择影响显著;研究结果为数据构建、模型与算法选择提供了提升价值对齐的实用指导。 Abstract: As LLMs occupy an increasingly important role in society, they are more and more confronted with questions that require them not only to draw on their general knowledge but also to align with certain human value systems. Therefore, studying the alignment of LLMs with human values has become a crucial field of inquiry. Prior work, however, mostly focuses on evaluating the alignment of fully trained models, overlooking the training dynamics by which models learn to express human values. In this work, we investigate how and at which stage value alignment arises during the course of a model's post-training. Our analysis disentangles the effects of post-training algorithms and datasets, measuring both the magnitude and time of value drifts during training. Experimenting with Llama-3 and Qwen-3 models of different sizes and popular supervised fine-tuning (SFT) and preference optimization datasets and algorithms, we find that the SFT phase generally establishes a model's values, and subsequent preference optimization rarely re-aligns these values. Furthermore, using a synthetic preference dataset that enables controlled manipulation of values, we find that different preference optimization algorithms lead to different value alignment outcomes, even when preference data is held constant. Our findings provide actionable insights into how values are learned during post-training and help to inform data curation, as well as the selection of models and algorithms for preference optimization to improve model alignment to human values.[57] AMO-Bench: Large Language Models Still Struggle in High School Math Competitions
Shengnan An,Xunliang Cai,Xuezhi Cao,Xiaoyu Li,Yehao Lin,Junlin Liu,Xinxuan Lv,Dan Ma,Xuanlin Wang,Ziwen Wang,Shuang Zhou
Main category: cs.CL
TL;DR: AMO-Bench是一个高级数学推理基准,包含50道人工设计的、达到或超过国际数学奥林匹克竞赛难度的原创问题,用于评估大语言模型的数学推理能力,实验结果显示当前模型表现较差但存在可扩展性趋势。
Details
Motivation: 现有数学推理基准因大语言模型性能饱和而难以有效评估顶级模型,需要更具挑战性的评估工具。 Method: 构建一个包含50道高难度、专家验证且完全原创的数学问题基准AMO-Bench,仅需提供最终答案以实现自动评分。 Result: 在26个大语言模型上的实验表明,最佳模型准确率仅为52.4%,大多数模型低于40%,但显示出测试时计算量增加带来的性能提升趋势。 Conclusion: AMO-Bench揭示了当前大语言模型在高级数学推理方面仍有显著提升空间,该基准已公开以促进相关研究。 Abstract: We present AMO-Bench, an Advanced Mathematical reasoning benchmark with Olympiad level or even higher difficulty, comprising 50 human-crafted problems. Existing benchmarks have widely leveraged high school math competitions for evaluating mathematical reasoning capabilities of large language models (LLMs). However, many existing math competitions are becoming less effective for assessing top-tier LLMs due to performance saturation (e.g., AIME24/25). To address this, AMO-Bench introduces more rigorous challenges by ensuring all 50 problems are (1) cross-validated by experts to meet at least the International Mathematical Olympiad (IMO) difficulty standards, and (2) entirely original problems to prevent potential performance leakages from data memorization. Moreover, each problem in AMO-Bench requires only a final answer rather than a proof, enabling automatic and robust grading for evaluation. Experimental results across 26 LLMs on AMO-Bench show that even the best-performing model achieves only 52.4% accuracy on AMO-Bench, with most LLMs scoring below 40%. Beyond these poor performances, our further analysis reveals a promising scaling trend with increasing test-time compute on AMO-Bench. These results highlight the significant room for improving the mathematical reasoning in current LLMs. We release AMO-Bench to facilitate further research into advancing the reasoning abilities of language models. https://amo-bench.github.io/[58] Gistify! Codebase-Level Understanding via Runtime Execution
Hyunji Lee,Minseon Kim,Chinmay Singh,Matheus Pereira,Atharv Sonwane,Isadora White,Elias Stengel-Eskin,Mohit Bansal,Zhengyan Shi,Alessandro Sordoni,Marc-Alexandre Côté,Xingdi Yuan,Lucas Caccia
Main category: cs.CL
TL;DR: 提出Gistify任务,要求编码LLM从大型代码库中生成一个最小、自包含的文件来复现特定功能,评估模型对代码库结构和执行流的理解能力。
Details
Motivation: 随着编码代理在大型代码库中的广泛应用,亟需自动设计具有挑战性的代码库级别评估方法。 Method: 提出Gistify任务,给定完整代码库和特定入口点,要求LLM生成仅包含必要组件的单个文件,以重现原代码库下该命令的输出。 Result: 实验发现当前最先进的模型在Gistify任务上表现不佳,尤其是在执行轨迹较长的情况下。 Conclusion: Gistify是一项具有挑战性的新基准,暴露了现有编码LLM在理解代码库结构和生成大段精确代码补丁方面的不足。 Abstract: As coding agents are increasingly deployed in large codebases, the need to automatically design challenging, codebase-level evaluation is central. We propose Gistify, a task where a coding LLM must create a single, minimal, self-contained file that can reproduce a specific functionality of a codebase. The coding LLM is given full access to a codebase along with a specific entrypoint (e.g., a python command), and the generated file must replicate the output of the same command ran under the full codebase, while containing only the essential components necessary to execute the provided command. Success on Gistify requires both structural understanding of the codebase, accurate modeling of its execution flow as well as the ability to produce potentially large code patches. Our findings show that current state-of-the-art models struggle to reliably solve Gistify tasks, especially ones with long executions traces.cs.CV [Back]
[59] Enhancing Underwater Object Detection through Spatio-Temporal Analysis and Spatial Attention Networks
Sai Likhith Karri,Ansh Saxena
Main category: cs.CV
TL;DR: 本研究评估了时空建模和空间注意力机制在水下目标检测深度学习模型中的有效性,提出并改进了T-YOLOv5模型,引入CBAM模块以提升复杂场景下的检测性能。
Details
Motivation: 为了提高动态海洋环境中水下目标检测的准确性和鲁棒性,特别是在突发运动、部分遮挡和渐进运动等挑战性条件下。 Method: 首先评估时序增强的T-YOLOv5相对于标准YOLOv5的性能;随后在T-YOLOv5中引入卷积块注意力模块(CBAM),构建增强版本,并比较三者的检测效果。 Result: 实验结果显示,YOLOv5的mAP@50-95为0.563,T-YOLOv5提升至0.813,加入CBAM后达到0.811,在复杂场景中表现出更优的检测精度和泛化能力。 Conclusion: T-YOLOv5显著提升了检测可靠性,而加入CBAM进一步增强了在挑战性场景下的性能,但在简单场景中存在轻微精度下降。 Abstract: This study examines the effectiveness of spatio-temporal modeling and the integration of spatial attention mechanisms in deep learning models for underwater object detection. Specifically, in the first phase, the performance of temporal-enhanced YOLOv5 variant T-YOLOv5 is evaluated, in comparison with the standard YOLOv5. For the second phase, an augmented version of T-YOLOv5 is developed, through the addition of a Convolutional Block Attention Module (CBAM). By examining the effectiveness of the already pre-existing YOLOv5 and T-YOLOv5 models and of the newly developed T-YOLOv5 with CBAM. With CBAM, the research highlights how temporal modeling improves detection accuracy in dynamic marine environments, particularly under conditions of sudden movements, partial occlusions, and gradual motion. The testing results showed that YOLOv5 achieved a mAP@50-95 of 0.563, while T-YOLOv5 and T-YOLOv5 with CBAM outperformed with mAP@50-95 scores of 0.813 and 0.811, respectively, highlighting their superior accuracy and generalization in detecting complex objects. The findings demonstrate that T-YOLOv5 significantly enhances detection reliability compared to the standard model, while T-YOLOv5 with CBAM further improves performance in challenging scenarios, although there is a loss of accuracy when it comes to simpler scenarios.[60] MIRO: MultI-Reward cOnditioned pretraining improves T2I quality and efficiency
Nicolas Dufour,Lucas Degeorge,Arijit Ghosh,Vicky Kalogeiton,David Picard
Main category: cs.CV
TL;DR: 提出MIRO方法,通过在训练过程中结合多个奖励模型来直接学习用户偏好,从而提升生成图像的质量和训练效率。
Details
Motivation: 现有的文本到图像生成模型虽然能生成多样化的图像,但与用户偏好不一致,且后处理方式会损害多样性、语义保真度和效率。 Method: 在训练过程中将模型基于多个奖励模型进行条件化,使模型直接学习用户偏好,而不是依赖生成后的筛选。 Result: MIRO显著提升了生成图像的视觉质量,加快了训练速度,并在GenEval组合基准和多个用户偏好评分(PickAScore, ImageReward, HPSv2)上达到最先进水平。 Conclusion: 通过在训练中融合多奖励模型条件,MIRO有效对齐用户偏好,同时提升生成质量、多样性和训练效率。 Abstract: Current text-to-image generative models are trained on large uncurated datasets to enable diverse generation capabilities. However, this does not align well with user preferences. Recently, reward models have been specifically designed to perform post-hoc selection of generated images and align them to a reward, typically user preference. This discarding of informative data together with the optimizing for a single reward tend to harm diversity, semantic fidelity and efficiency. Instead of this post-processing, we propose to condition the model on multiple reward models during training to let the model learn user preferences directly. We show that this not only dramatically improves the visual quality of the generated images but it also significantly speeds up the training. Our proposed method, called MIRO, achieves state-of-the-art performances on the GenEval compositional benchmark and user-preference scores (PickAScore, ImageReward, HPSv2).[61] BikeScenes: Online LiDAR Semantic Segmentation for Bicycles
Denniz Goren,Holger Caesar
Main category: cs.CV
TL;DR: 本文提出了一种针对自行车安全的3D LiDAR分割方法,并发布了BikeScenes-lidarseg数据集,实验表明在该数据集上微调模型显著提升了分割性能。
Details
Motivation: 骑行者尤其是使用电助力自行车的用户安全性较低,因此需要将汽车感知技术适配于自行车安全应用。 Method: 基于多传感器'SenseBike'平台开发适用于自行车环境的3D LiDAR语义分割方法,并构建包含29类语义标注的BikeScenes-lidarseg数据集用于训练与评估。 Result: 在BikeScenes数据集上微调后的模型mIoU达到63.6%,远高于仅使用SemanticKITTI预训练的13.8%。 Conclusion: 领域特定的数据集对提升自行车场景下的LiDAR语义分割性能至关重要,BikeScenes为面向骑行者的感知研究提供了重要资源。 Abstract: The vulnerability of cyclists, exacerbated by the rising popularity of faster e-bikes, motivates adapting automotive perception technologies for bicycle safety. We use our multi-sensor 'SenseBike' research platform to develop and evaluate a 3D LiDAR segmentation approach tailored to bicycles. To bridge the automotive-to-bicycle domain gap, we introduce the novel BikeScenes-lidarseg Dataset, comprising 3021 consecutive LiDAR scans around the university campus of the TU Delft, semantically annotated for 29 dynamic and static classes. By evaluating model performance, we demonstrate that fine-tuning on our BikeScenes dataset achieves a mean Intersection-over-Union (mIoU) of 63.6%, significantly outperforming the 13.8% obtained with SemanticKITTI pre-training alone. This result underscores the necessity and effectiveness of domain-specific training. We highlight key challenges specific to bicycle-mounted, hardware-constrained perception systems and contribute the BikeScenes dataset as a resource for advancing research in cyclist-centric LiDAR segmentation.[62] Generative Image Restoration and Super-Resolution using Physics-Informed Synthetic Data for Scanning Tunneling Microscopy
Nikola L. Kolev,Tommaso Rodani,Neil J. Curson,Taylor J. Z. Stock,Alberto Cazzaniga
Main category: cs.CV
TL;DR: 提出了一种基于机器学习的方法,利用物理信息引导的合成数据生成流程训练流匹配和扩散模型,用于修复和超分辨率重建扫描隧道显微镜(STM)图像,显著减少图像采集时间和针尖调理频率。
Details
Motivation: STM成像受限于针尖退化和缓慢的串行数据采集,且在制备过程中高电压可能改变针尖形态,需频繁调理,限制了其应用效率。 Method: 基于仅36张高质量Si(001):H实验图像,构建物理信息引导的合成数据生成 pipeline,并用其训练先进的流匹配和扩散模型,实现图像修复与超分辨率重建。 Result: 模型在CLIP MMD和结构相似性等指标上表现优异,能有效恢复图像,并通过稀疏采样数据准确重建,使图像采集时间减少2到4倍。 Conclusion: 该框架可显著提升STM实验通量,降低针尖调理频率,并增强现有高速STM系统的帧率。 Abstract: Scanning tunnelling microscopy (STM) enables atomic-resolution imaging and atom manipulation, but its utility is often limited by tip degradation and slow serial data acquisition. Fabrication adds another layer of complexity since the tip is often subjected to large voltages, which may alter the shape of its apex, requiring it to be conditioned. Here, we propose a machine learning (ML) approach for image repair and super-resolution to alleviate both challenges. Using a dataset of only 36 pristine experimental images of Si(001):H, we demonstrate that a physics-informed synthetic data generation pipeline can be used to train several state-of-the-art flow-matching and diffusion models. Quantitative evaluation with metrics such as the CLIP Maximum Mean Discrepancy (CMMD) score and structural similarity demonstrates that our models are able to effectively restore images and offer a two- to fourfold reduction in image acquisition time by accurately reconstructing images from sparsely sampled data. Our framework has the potential to significantly increase STM experimental throughput by offering a route to reducing the frequency of tip-conditioning procedures and to enhancing frame rates in existing high-speed STM systems.[63] SplitFlow: Flow Decomposition for Inversion-Free Text-to-Image Editing
Sung-Hoon Yoon,Minghan Li,Gaspard Beaudouin,Congcong Wen,Muhammad Rafay Azhar,Mengyu Wang
Main category: cs.CV
TL;DR: 提出一种基于流分解与聚合的无反演框架,用于解决矩形流模型在图像编辑中的反演不准确和梯度纠缠问题,通过语义分解提示词并自适应聚合子流向量场,提升了编辑的语义保真度和属性解耦能力。
Details
Motivation: 现有矩形流模型在图像编辑中存在反演过程不准确和梯度纠缠问题,导致编辑结果偏离目标提示;尽管已有无需反演的方法,但编辑质量仍不理想。 Method: 提出流分解与聚合框架:将目标提示语义分解为多个子提示,分别计算独立流,再通过投影和软聚合机制自适应地融合子速度场,抑制语义冗余,增强语义一致性。 Result: 实验表明,该方法在零样本图像编辑任务中优于现有方法,显著提升语义保真度和属性解耦性能。 Conclusion: 所提出的分解-聚合机制有效解决了无反演流程中的语义一致性和多样性平衡问题,为矩形流模型的高质量图像编辑提供了新思路。 Abstract: Rectified flow models have become a de facto standard in image generation due to their stable sampling trajectories and high-fidelity outputs. Despite their strong generative capabilities, they face critical limitations in image editing tasks: inaccurate inversion processes for mapping real images back into the latent space, and gradient entanglement issues during editing often result in outputs that do not faithfully reflect the target prompt. Recent efforts have attempted to directly map source and target distributions via ODE-based approaches without inversion; however,these methods still yield suboptimal editing quality. In this work, we propose a flow decomposition-and-aggregation framework built upon an inversion-free formulation to address these limitations. Specifically, we semantically decompose the target prompt into multiple sub-prompts, compute an independent flow for each, and aggregate them to form a unified editing trajectory. While we empirically observe that decomposing the original flow enhances diversity in the target space, generating semantically aligned outputs still requires consistent guidance toward the full target prompt. To this end, we design a projection and soft-aggregation mechanism for flow, inspired by gradient conflict resolution in multi-task learning. This approach adaptively weights the sub-target velocity fields, suppressing semantic redundancy while emphasizing distinct directions, thereby preserving both diversity and consistency in the final edited output. Experimental results demonstrate that our method outperforms existing zero-shot editing approaches in terms of semantic fidelity and attribute disentanglement. The code is available at https://github.com/Harvard-AI-and-Robotics-Lab/SplitFlow.[64] Brain-IT: Image Reconstruction from fMRI via Brain-Interaction Transformer
Roman Beliy,Amit Zalcher,Jonathan Kogman,Navve Wasserman,Michal Irani
Main category: cs.CV
TL;DR: 提出了一种名为Brain-IT的脑启发方法,通过脑交互Transformer(BIT)实现功能相似脑体素簇之间的有效交互,显著提升了从fMRI数据重建视觉图像的保真度,在少量数据下超越现有方法。
Details
Motivation: 当前基于fMRI的图像重建方法在还原真实所见图像方面仍缺乏保真度,尤其是在语义和结构一致性上存在不足。 Method: 提出Brain Interaction Transformer(BIT),利用跨被试共享的功能性脑体素簇,预测局部图像块的高层语义和低层结构特征,以引导扩散模型进行图像重建。所有模型组件共享,支持高效训练。 Result: 在标准客观指标和视觉质量上均超越当前最先进方法;仅用1小时新被试fMRI数据即可达到其他方法需40小时训练的效果。 Conclusion: Brain-IT通过脑启发式设计实现了高保真、数据高效的fMRI到图像重建,推动了非侵入式脑解码技术的发展。 Abstract: Reconstructing images seen by people from their fMRI brain recordings provides a non-invasive window into the human brain. Despite recent progress enabled by diffusion models, current methods often lack faithfulness to the actual seen images. We present "Brain-IT", a brain-inspired approach that addresses this challenge through a Brain Interaction Transformer (BIT), allowing effective interactions between clusters of functionally-similar brain-voxels. These functional-clusters are shared by all subjects, serving as building blocks for integrating information both within and across brains. All model components are shared by all clusters & subjects, allowing efficient training with a limited amount of data. To guide the image reconstruction, BIT predicts two complementary localized patch-level image features: (i)high-level semantic features which steer the diffusion model toward the correct semantic content of the image; and (ii)low-level structural features which help to initialize the diffusion process with the correct coarse layout of the image. BIT's design enables direct flow of information from brain-voxel clusters to localized image features. Through these principles, our method achieves image reconstructions from fMRI that faithfully reconstruct the seen images, and surpass current SotA approaches both visually and by standard objective metrics. Moreover, with only 1-hour of fMRI data from a new subject, we achieve results comparable to current methods trained on full 40-hour recordings.[65] Fine-tuning Segment Anything for Real-Time Tumor Tracking in Cine-MRI
Valentin Boussot,Cédric Hémon,Jean-Claude Nunes,Jean-Louis Dillenseger
Main category: cs.CV
TL;DR: 本研究针对TrackRAD2025挑战赛中的实时肿瘤追踪任务,采用基于SAM 2.1的分割方法,在极短运行时间内实现了高精度肿瘤追踪,最终在测试集上取得0.8794的Dice分数,排名第六。
Details
Motivation: 在强数据稀缺条件下实现胸腹部cine-MRI序列中肿瘤的实时追踪,满足放疗引导中对速度和精度的严苛要求。 Method: 采用基于提示的SAM 2.1基础模型进行分割,使用首帧标注生成掩码提示,并在TrackRAD2025的小规模标注数据集上对整个模型(包括提示编码器、解码器和Hiera主干)进行微调;训练采用1024x1024图像块、小批量、标准增强和Dice+IoU损失,学习率为0.0001,共训练300轮。 Result: 在隐藏测试集上达到0.8794的Dice相似系数,排名第6;推理策略适用于不同解剖部位和MRI场强,且未使用测试时增强以保持效率。 Conclusion: 基础模型(如SAM 2.1)通过少量标注数据微调,可在极端数据稀缺和实时性约束下实现高性能肿瘤追踪,展现出其在MRI引导放疗中的巨大应用潜力。 Abstract: In this work, we address the TrackRAD2025 challenge of real-time tumor tracking in cine-MRI sequences of the thoracic and abdominal regions under strong data scarcity constraints. Two complementary strategies were explored: (i) unsupervised registration with the IMPACT similarity metric and (ii) foundation model-based segmentation leveraging SAM 2.1 and its recent variants through prompt-based interaction. Due to the one-second runtime constraint, the SAM-based method was ultimately selected. The final configuration used SAM2.1 b+ with mask-based prompts from the first annotated slice, fine-tuned solely on the small labeled subset from TrackRAD2025. Training was configured to minimize overfitting, using 1024x1024 patches (batch size 1), standard augmentations, and a balanced Dice + IoU loss. A low uniform learning rate (0.0001) was applied to all modules (prompt encoder, decoder, Hiera backbone) to preserve generalization while adapting to annotator-specific styles. Training lasted 300 epochs (~12h on RTX A6000, 48GB). The same inference strategy was consistently applied across all anatomical sites and MRI field strengths. Test-time augmentation was considered but ultimately discarded due to negligible performance gains. The final model was selected based on the highest Dice Similarity Coefficient achieved on the validation set after fine-tuning. On the hidden test set, the model reached a Dice score of 0.8794, ranking 6th overall in the TrackRAD2025 challenge. These results highlight the strong potential of foundation models for accurate and real-time tumor tracking in MRI-guided radiotherapy.[66] Larger Hausdorff Dimension in Scanning Pattern Facilitates Mamba-Based Methods in Low-Light Image Enhancement
Xinhua Wang,Caibo Feng,Xiangjun Fu,Chunxiao Liu
Main category: cs.CV
TL;DR: 提出了一种基于希尔伯特选择性扫描的Mamba框架改进方法,通过增加扫描模式的豪斯多夫维度来更有效地探索特征空间,显著提升了低光图像增强的性能,同时降低了计算开销。
Details
Motivation: 为了克服现有Mamba框架在低光图像增强中对细粒度细节捕捉不足和空间局部性不够的问题,提升信息一致性和局部交互建模能力。 Method: 引入一种新的希尔伯特选择性扫描机制,增加扫描路径的豪斯多夫维度,从而提高特征空间覆盖范围并增强空间局部性与长距离依赖的平衡。 Result: 在公开基准上实验表明,该方法显著提升了定量指标和视觉质量,同时减少了计算资源消耗和推理时间。 Conclusion: 所提出的希尔伯特选择性扫描机制有效改进了Mamba框架在低光图像增强中的表现,具有推动该领域发展的潜力,并可推广至其他基于Mamba的技术应用。 Abstract: We propose an innovative enhancement to the Mamba framework by increasing the Hausdorff dimension of its scanning pattern through a novel Hilbert Selective Scan mechanism. This mechanism explores the feature space more effectively, capturing intricate fine-scale details and improving overall coverage. As a result, it mitigates information inconsistencies while refining spatial locality to better capture subtle local interactions without sacrificing the model's ability to handle long-range dependencies. Extensive experiments on publicly available benchmarks demonstrate that our approach significantly improves both the quantitative metrics and qualitative visual fidelity of existing Mamba-based low-light image enhancement methods, all while reducing computational resource consumption and shortening inference time. We believe that this refined strategy not only advances the state-of-the-art in low-light image enhancement but also holds promise for broader applications in fields that leverage Mamba-based techniques.[67] CAVE: Detecting and Explaining Commonsense Anomalies in Visual Environments
Rishika Bhagwatkar,Syrielle Montariol,Angelika Romanou,Beatriz Borges,Irina Rish,Antoine Bosselut
Main category: cs.CV
TL;DR: 本文提出了CAVE,首个真实世界视觉异常的基准,支持异常描述、解释和论证任务,并通过细粒度标注推动视觉语言模型在异常感知与常识推理方面的发展。
Details
Motivation: 现有视觉异常检测研究局限于工业缺陷或合成异常,无法反映真实世界异常的复杂性和不可预测性,因此需要一个更贴近人类认知的真实异常基准。 Method: 构建了CAVE基准数据集,包含真实世界视觉异常,设计三个开放性任务(描述、解释、论证),并基于认知科学引入细粒度标注,涵盖异常的视觉表现、复杂性、严重性和常见性。 Result: 实验表明,即使采用先进的提示策略,当前最先进的视觉语言模型在CAVE上的表现仍然不佳,显示出其在异常感知和常识推理方面的不足。 Conclusion: CAVE作为一个现实且基于认知科学的基准,为视觉语言模型的异常检测与理解研究提供了重要资源,有助于推动该领域的进一步发展。 Abstract: Humans can naturally identify, reason about, and explain anomalies in their environment. In computer vision, this long-standing challenge remains limited to industrial defects or unrealistic, synthetically generated anomalies, failing to capture the richness and unpredictability of real-world anomalies. In this work, we introduce CAVE, the first benchmark of real-world visual anomalies. CAVE supports three open-ended tasks: anomaly description, explanation, and justification; with fine-grained annotations for visual grounding and categorizing anomalies based on their visual manifestations, their complexity, severity, and commonness. These annotations draw inspiration from cognitive science research on how humans identify and resolve anomalies, providing a comprehensive framework for evaluating Vision-Language Models (VLMs) in detecting and understanding anomalies. We show that state-of-the-art VLMs struggle with visual anomaly perception and commonsense reasoning, even with advanced prompting strategies. By offering a realistic and cognitively grounded benchmark, CAVE serves as a valuable resource for advancing research in anomaly detection and commonsense reasoning in VLMs.[68] Climate Adaptation-Aware Flood Prediction for Coastal Cities Using Deep Learning
Bilal Hassan,Areg Karapetyan,Aaron Chung Hin Chow,Samer Madanat
Main category: cs.CV
TL;DR: 提出一种基于轻量级CNN的深度学习模型,用于在不同海平面上升情景下预测沿海洪水,该模型在数据资源有限的情况下表现优异,并在阿布扎比和旧金山两个地区展现出良好的泛化能力,平均MAE降低近20%。
Details
Motivation: 传统水动力模拟器计算成本高,难以应用于城市尺度的沿海规划,而现有深度学习方法受限于数据稀缺和高维输出问题,因此需要更高效、准确且可扩展的洪水预测模型。 Method: 采用一种新提出的基于视觉的低资源深度学习框架,构建了一个轻量级卷积神经网络(CNN)模型,利用来自阿布扎比和旧金山的数据集进行训练与验证,以预测不同海平面上升和海岸线适应情景下的洪水深度图。 Result: 该模型在预测洪水深度图上的平均绝对误差(MAE)比现有最先进方法降低了近20%,并在两个不同地理区域表现出良好的泛化能力。 Conclusion: 所提出的轻量级CNN模型在准确性与效率之间取得了良好平衡,具有较强的可扩展性,可作为沿海城市应对气候变化影响下的洪水管理与决策支持的实用工具。 Abstract: Climate change and sea-level rise (SLR) pose escalating threats to coastal cities, intensifying the need for efficient and accurate methods to predict potential flood hazards. Traditional physics-based hydrodynamic simulators, although precise, are computationally expensive and impractical for city-scale coastal planning applications. Deep Learning (DL) techniques offer promising alternatives, however, they are often constrained by challenges such as data scarcity and high-dimensional output requirements. Leveraging a recently proposed vision-based, low-resource DL framework, we develop a novel, lightweight Convolutional Neural Network (CNN)-based model designed to predict coastal flooding under variable SLR projections and shoreline adaptation scenarios. Furthermore, we demonstrate the ability of the model to generalize across diverse geographical contexts by utilizing datasets from two distinct regions: Abu Dhabi and San Francisco. Our findings demonstrate that the proposed model significantly outperforms state-of-the-art methods, reducing the mean absolute error (MAE) in predicted flood depth maps on average by nearly 20%. These results highlight the potential of our approach to serve as a scalable and practical tool for coastal flood management, empowering decision-makers to develop effective mitigation strategies in response to the growing impacts of climate change. Project Page: https://caspiannet.github.io/[69] Enhancing Temporal Understanding in Video-LLMs through Stacked Temporal Attention in Vision Encoders
Ali Rasekh,Erfan Bagheri Soula,Omid Daliran,Simon Gottschalk,Mohsen Fayyaz
Main category: cs.CV
TL;DR: 本文提出了一种新的Video-LLM架构STAVEQ2,通过在视觉编码器中引入堆叠的时间注意力模块,显著提升了模型对视频中动作序列和时间动态的理解能力,在多个视频问答基准上性能提升高达+5.5%。
Details
Motivation: 现有的Video-LLM在理解复杂的时间动态方面存在严重缺陷,难以准确捕捉动作序列和帧间的时间 progression,限制了其在视频理解任务中的表现。 Method: 在视觉编码器内部引入堆叠的时间注意力模块,使模型能够在将视觉token传递给大语言模型之前,更好地建模帧间的时间关系和动作发展过程。 Result: 该方法在VITATECS、MVBench和Video-MME等多个视频问答 benchmark 上显著优于现有模型,特别是在动作识别任务上,性能提升高达+5.5%。 Conclusion: 通过增强视觉编码器的时间建模能力,有效弥补了当前Video-LLM在时间理解方面的关键缺陷,为视频理解提供了更优的架构设计方向。 Abstract: Despite significant advances in Multimodal Large Language Models (MLLMs), understanding complex temporal dynamics in videos remains a major challenge. Our experiments show that current Video Large Language Model (Video-LLM) architectures have critical limitations in temporal understanding, struggling with tasks that require detailed comprehension of action sequences and temporal progression. In this work, we propose a Video-LLM architecture that introduces stacked temporal attention modules directly within the vision encoder. This design incorporates a temporal attention in vision encoder, enabling the model to better capture the progression of actions and the relationships between frames before passing visual tokens to the LLM. Our results show that this approach significantly improves temporal reasoning and outperforms existing models in video question answering tasks, specifically in action recognition. We improve on benchmarks including VITATECS, MVBench, and Video-MME by up to +5.5%. By enhancing the vision encoder with temporal structure, we address a critical gap in video understanding for Video-LLMs. Project page and code are available at: https://alirasekh.github.io/STAVEQ2/.[70] FlexICL: A Flexible Visual In-context Learning Framework for Elbow and Wrist Ultrasound Segmentation
Yuyue Zhou,Jessica Knight,Shrimanti Ghosh,Banafshe Felfeliyan,Jacob L. Jaremko,Abhilash R. Hareendranathan
Main category: cs.CV
TL;DR: 提出了一种名为FlexICL的灵活上下文学习框架,用于超声图像中骨骼区域的分割,在仅使用5%训练数据的情况下显著优于现有模型。
Details
Motivation: 超声图像中骨结构的自动分割有助于提高儿童肘部和腕部骨折的诊断准确性,但像素级专家标注耗时且昂贵,亟需减少对大量标注数据的依赖。 Method: 提出FlexICL框架,采用帧间视频分割设置,专家仅标注少量帧;系统研究多种图像拼接技术和训练策略,并引入新的拼接方法,结合多种增强策略实现高效分割。 Result: 在四个手腕和肘部超声数据集上,仅用5%标注数据即超越Painter、MAE-VQGAN、U-Net和TransUNet等模型,Dice系数提升1-27%,在1,252个超声扫描中表现优异。 Conclusion: FlexICL是一种高效、可扩展的超声图像分割方案,适用于标注数据稀缺的医学影像场景,具有临床应用潜力。 Abstract: Elbow and wrist fractures are the most common fractures in pediatric populations. Automatic segmentation of musculoskeletal structures in ultrasound (US) can improve diagnostic accuracy and treatment planning. Fractures appear as cortical defects but require expert interpretation. Deep learning (DL) can provide real-time feedback and highlight key structures, helping lightly trained users perform exams more confidently. However, pixel-wise expert annotations for training remain time-consuming and costly. To address this challenge, we propose FlexICL, a novel and flexible in-context learning (ICL) framework for segmenting bony regions in US images. We apply it to an intra-video segmentation setting, where experts annotate only a small subset of frames, and the model segments unseen frames. We systematically investigate various image concatenation techniques and training strategies for visual ICL and introduce novel concatenation methods that significantly enhance model performance with limited labeled data. By integrating multiple augmentation strategies, FlexICL achieves robust segmentation performance across four wrist and elbow US datasets while requiring only 5% of the training images. It outperforms state-of-the-art visual ICL models like Painter, MAE-VQGAN, and conventional segmentation models like U-Net and TransUNet by 1-27% Dice coefficient on 1,252 US sweeps. These initial results highlight the potential of FlexICL as an efficient and scalable solution for US image segmentation well suited for medical imaging use cases where labeled data is scarce.[71] Dynamic VLM-Guided Negative Prompting for Diffusion Models
Hoyeon Chang,Seungjin Kim,Yoonseok Choi
Main category: cs.CV
TL;DR: 提出一种基于视觉-语言模型(VLM)的动态负提示方法,用于在去噪过程中自适应生成负提示。
Details
Motivation: 传统负提示方法使用固定提示,缺乏上下文适应性,限制了生成质量与文本对齐能力。 Method: 在特定去噪步骤生成中间图像预测,并利用VLM根据当前状态生成上下文相关的负提示。 Result: 在多个基准数据集上验证了该方法的有效性,展示了负引导强度与文本-图像对齐之间的权衡。 Conclusion: 动态负提示能更灵活地提升生成图像的质量和语义一致性,优于固定负提示方法。 Abstract: We propose a novel approach for dynamic negative prompting in diffusion models that leverages Vision-Language Models (VLMs) to adaptively generate negative prompts during the denoising process. Unlike traditional Negative Prompting methods that use fixed negative prompts, our method generates intermediate image predictions at specific denoising steps and queries a VLM to produce contextually appropriate negative prompts. We evaluate our approach on various benchmark datasets and demonstrate the trade-offs between negative guidance strength and text-image alignment.[72] Security Risk of Misalignment between Text and Image in Multi-modal Model
Xiaosen Wang,Zhijin Ge,Shaokang Wang
Main category: cs.CV
TL;DR: 本文提出了一种名为PReMA的新型多模态攻击方法,通过仅修改输入图像而不改变文本提示来操纵多模态扩散模型的输出,尤其在固定提示的图像编辑应用中构成新威胁。
Details
Motivation: 现有文本到图像扩散模型在文本与图像模态之间的对齐不足,可能导致生成不安全内容,且对抗性输入的脆弱性尚未充分研究。 Method: 提出Prompt-Restricted Multi-modal Attack (PReMA),通过生成对抗性图像而非修改提示来操纵模型输出,在图像修复和风格迁移任务中进行验证。 Result: 实验表明PReMA在多种模型上均能有效生成不符合预期的输出,尤其是在固定提示条件下成功诱导生成NSFW内容。 Conclusion: PReMA揭示了多模态扩散模型在模态对齐方面的缺陷,突显了其在实际应用中的安全隐患,特别是在图像编辑场景中需加强防御机制。 Abstract: Despite the notable advancements and versatility of multi-modal diffusion models, such as text-to-image models, their susceptibility to adversarial inputs remains underexplored. Contrary to expectations, our investigations reveal that the alignment between textual and Image modalities in existing diffusion models is inadequate. This misalignment presents significant risks, especially in the generation of inappropriate or Not-Safe-For-Work (NSFW) content. To this end, we propose a novel attack called Prompt-Restricted Multi-modal Attack (PReMA) to manipulate the generated content by modifying the input image in conjunction with any specified prompt, without altering the prompt itself. PReMA is the first attack that manipulates model outputs by solely creating adversarial images, distinguishing itself from prior methods that primarily generate adversarial prompts to produce NSFW content. Consequently, PReMA poses a novel threat to the integrity of multi-modal diffusion models, particularly in image-editing applications that operate with fixed prompts. Comprehensive evaluations conducted on image inpainting and style transfer tasks across various models confirm the potent efficacy of PReMA.[73] EgoExo-Con: Exploring View-Invariant Video Temporal Understanding
Minjoon Jung,Junbin Xiao,Junghyun Kim,Byoung-Tak Zhang,Angela Yao
Main category: cs.CV
TL;DR: 本文提出了EgoExo-Con基准,用于评估视频大模型在不同视角下的时间理解一致性,并提出View-GRPO方法以提升跨视角的一致性。
Details
Motivation: 研究现有视频大语言模型在多视角下时间理解的一致性问题,发现其在跨视角推理中表现不佳。 Method: 构建了包含同步第一人称与第三人称视频对的EgoExo-Con基准,设计了时间验证与时间定位任务,并提出基于强化学习的View-GRPO框架来增强跨视角一致性。 Result: 实验表明现有模型在跨视角一致性上显著下降;直接微调可提升一致性但牺牲单视角性能;View-GRPO在保持单视角性能的同时显著提升一致性。 Conclusion: 跨视角时间理解一致性是当前Video-LLMs的薄弱环节,View-GRPO为解决该问题提供了有效路径。 Abstract: Can Video-LLMs achieve consistent temporal understanding when videos capture the same event from different viewpoints? To study this, we introduce EgoExo-Con (Consistency), a benchmark of comprehensively synchronized egocentric and exocentric video pairs with human-refined queries in natural language. EgoExo-Con emphasizes two temporal understanding tasks: Temporal Verification and Temporal Grounding. It evaluates not only correctness but consistency across viewpoints. Our analysis reveals two critical limitations of existing Video-LLMs: (1) models often fail to maintain consistency, with results far worse than their single-view performances. (2) When naively finetuned with synchronized videos of both viewpoints, the models show improved consistency but often underperform those trained on a single view. For improvements, we propose View-GRPO, a novel reinforcement learning framework that effectively strengthens view-specific temporal reasoning while encouraging consistent comprehension across viewpoints. Our method demonstrates its superiority over naive SFT and GRPO, especially for improving cross-view consistency. All resources will be made publicly available.[74] OracleAgent: A Multimodal Reasoning Agent for Oracle Bone Script Research
Caoshuo Li,Zengmao Ding,Xiaobin Hu,Bang Li,Donghao Luo,Xu Peng,Taisong Jin,Yongge Liu,Shengwei Han,Jing Yang,Xiaoping He,Feng Gao,AndyPian Wu,SevenShu,Chaoyang Wang,Chengjie Wang
Main category: cs.CV
TL;DR: 本文提出了OracleAgent,首个用于甲骨文信息结构化管理和检索的智能体系统,结合大语言模型与多模态知识库,显著提升甲骨文研究效率。
Details
Motivation: 甲骨文研究面临流程复杂、信息组织与检索效率低下的挑战,亟需自动化工具支持。 Method: 构建了一个包含140多万张单字拓片图像和8万条释读文本的领域多模态知识库,并设计基于大语言模型的智能体系统OracleAgent,集成多种分析工具以实现灵活的任务编排与信息检索。 Result: 实验表明,OracleAgent在多模态推理与生成任务中优于主流多模态大模型(如GPT-4o),案例研究显示其能显著减少专家的研究时间。 Conclusion: OracleAgent为甲骨文研究的自动化与实用化迈出了关键一步,具有重要的学术应用前景。 Abstract: As one of the earliest writing systems, Oracle Bone Script (OBS) preserves the cultural and intellectual heritage of ancient civilizations. However, current OBS research faces two major challenges: (1) the interpretation of OBS involves a complex workflow comprising multiple serial and parallel sub-tasks, and (2) the efficiency of OBS information organization and retrieval remains a critical bottleneck, as scholars often spend substantial effort searching for, compiling, and managing relevant resources. To address these challenges, we present OracleAgent, the first agent system designed for the structured management and retrieval of OBS-related information. OracleAgent seamlessly integrates multiple OBS analysis tools, empowered by large language models (LLMs), and can flexibly orchestrate these components. Additionally, we construct a comprehensive domain-specific multimodal knowledge base for OBS, which is built through a rigorous multi-year process of data collection, cleaning, and expert annotation. The knowledge base comprises over 1.4M single-character rubbing images and 80K interpretation texts. OracleAgent leverages this resource through its multimodal tools to assist experts in retrieval tasks of character, document, interpretation text, and rubbing image. Extensive experiments demonstrate that OracleAgent achieves superior performance across a range of multimodal reasoning and generation tasks, surpassing leading mainstream multimodal large language models (MLLMs) (e.g., GPT-4o). Furthermore, our case study illustrates that OracleAgent can effectively assist domain experts, significantly reducing the time cost of OBS research. These results highlight OracleAgent as a significant step toward the practical deployment of OBS-assisted research and automated interpretation systems.[75] JOGS: Joint Optimization of Pose Estimation and 3D Gaussian Splatting
Yuxuan Li,Tao Wang,Xianben Yang
Main category: cs.CV
TL;DR: 提出一种联合优化3D高斯点和相机位姿的统一框架,无需预标定输入,在重建质量和位姿精度上均优于传统方法。
Details
Motivation: 传统新视角合成方法依赖COLMAP等外部位姿估计工具,存在计算瓶颈且易传播误差。 Method: 通过可微渲染固定位姿更新3D高斯参数,并利用结合几何与光度约束的定制3D光流算法优化相机位姿,交替进行联合优化。 Result: 在多个数据集上显著优于现有无COLMAP方法,并超越标准COLMAP基线,尤其在大视角变化和稀疏特征场景下表现更优。 Conclusion: 所提共优化策略能有效提升场景重建保真度与相机位姿精度,缓解传统方法的局限性。 Abstract: Traditional novel view synthesis methods heavily rely on external camera pose estimation tools such as COLMAP, which often introduce computational bottlenecks and propagate errors. To address these challenges, we propose a unified framework that jointly optimizes 3D Gaussian points and camera poses without requiring pre-calibrated inputs. Our approach iteratively refines 3D Gaussian parameters and updates camera poses through a novel co-optimization strategy, ensuring simultaneous improvements in scene reconstruction fidelity and pose accuracy. The key innovation lies in decoupling the joint optimization into two interleaved phases: first, updating 3D Gaussian parameters via differentiable rendering with fixed poses, and second, refining camera poses using a customized 3D optical flow algorithm that incorporates geometric and photometric constraints. This formulation progressively reduces projection errors, particularly in challenging scenarios with large viewpoint variations and sparse feature distributions, where traditional methods struggle. Extensive evaluations on multiple datasets demonstrate that our approach significantly outperforms existing COLMAP-free techniques in reconstruction quality, and also surpasses the standard COLMAP-based baseline in general.[76] WOD-E2E: Waymo Open Dataset for End-to-End Driving in Challenging Long-tail Scenarios
Runsheng Xu,Hubert Lin,Wonseok Jeon,Hao Feng,Yuliang Zou,Liting Sun,John Gorman,Kate Tolstaya,Sarah Tang,Brandyn White,Ben Sapp,Mingxing Tan,Jyh-Jing Hwang,Drago Anguelov
Main category: cs.CV
TL;DR: 本文提出了一个用于端到端自动驾驶的新数据集WOD-E2E,专注于罕见且具有挑战性的长尾场景,并引入了一种新的开环评估指标Rater Feedback Score(RFS),通过人类评分员的轨迹偏好标签来更有效地评估自动驾驶性能。
Details
Motivation: 现有端到端驾驶基准多集中在常规场景,缺乏对罕见但关键的长尾场景的充分测试,且传统评估指标难以准确反映多模态驾驶行为和复杂情况下的表现。因此需要一个专注于挑战性场景的数据集和更合理的评估方式。 Method: 构建了一个包含4,021个驾驶片段(约12小时)的数据集WOD-E2E,专门筛选发生频率低于0.03%的长尾场景;每个片段包含高阶路径信息、自车状态和360度相机视图;提出Rater Feedback Score(RFS)作为新评估指标,基于人类评分员标注的轨迹偏好进行评估。 Result: 发布了WOD-E2E验证集的所有评分员偏好标签,测试集标签用于2025年WOD-E2E挑战赛;提供了一个更具挑战性的基准和更贴近人类判断的评估方法。 Conclusion: WOD-E2E和RFS为端到端自动驾驶研究提供了更严格、更真实的评估环境,有助于推动具备泛化性、鲁棒性和安全性的自动驾驶系统的发展。 Abstract: Vision-based end-to-end (E2E) driving has garnered significant interest in the research community due to its scalability and synergy with multimodal large language models (MLLMs). However, current E2E driving benchmarks primarily feature nominal scenarios, failing to adequately test the true potential of these systems. Furthermore, existing open-loop evaluation metrics often fall short in capturing the multi-modal nature of driving or effectively evaluating performance in long-tail scenarios. To address these gaps, we introduce the Waymo Open Dataset for End-to-End Driving (WOD-E2E). WOD-E2E contains 4,021 driving segments (approximately 12 hours), specifically curated for challenging long-tail scenarios that that are rare in daily life with an occurring frequency of less than 0.03%. Concretely, each segment in WOD-E2E includes the high-level routing information, ego states, and 360-degree camera views from 8 surrounding cameras. To evaluate the E2E driving performance on these long-tail situations, we propose a novel open-loop evaluation metric: Rater Feedback Score (RFS). Unlike conventional metrics that measure the distance between predicted way points and the logs, RFS measures how closely the predicted trajectory matches rater-annotated trajectory preference labels. We have released rater preference labels for all WOD-E2E validation set segments, while the held out test set labels have been used for the 2025 WOD-E2E Challenge. Through our work, we aim to foster state of the art research into generalizable, robust, and safe end-to-end autonomous driving agents capable of handling complex real-world situations.[77] Exploring Object-Aware Attention Guided Frame Association for RGB-D SLAM
Ali Caglayan,Nevrez Imamoglu,Oguzhan Guclu,Ali Osman Serhatoglu,Ahmet Burak Can,Ryosuke Nakamura
Main category: cs.CV
TL;DR: 本文提出了一种将基于梯度的注意力信息集成到CNN表示中的方法,用于RGB-D室内SLAM,提升了帧间关联性能。
Details
Motivation: 现有的注意力可视化技术虽能提供视觉解释,但尚未充分将梯度注意力信息显式整合到CNN表示中以提升语义理解,尤其在SLAM等任务中存在改进空间。 Method: 通过结合网络梯度与CNN特征生成逐层注意力信息,并将其融入CNN表示中,增强对显著区域的感知,应用于RGB-D SLAM中的帧匹配。 Result: 实验表明,该方法在大尺度环境中相比基线方法提升了帧关联性能。 Conclusion: 显式整合任务特定的梯度注意力信息可有效增强CNN在复杂场景下的语义表达能力,有利于SLAM等空间感知任务。 Abstract: Attention models have recently emerged as a powerful approach, demonstrating significant progress in various fields. Visualization techniques, such as class activation mapping, provide visual insights into the reasoning of convolutional neural networks (CNNs). Using network gradients, it is possible to identify regions where the network pays attention during image recognition tasks. Furthermore, these gradients can be combined with CNN features to localize more generalizable, task-specific attentive (salient) regions within scenes. However, explicit use of this gradient-based attention information integrated directly into CNN representations for semantic object understanding remains limited. Such integration is particularly beneficial for visual tasks like simultaneous localization and mapping (SLAM), where CNN representations enriched with spatially attentive object locations can enhance performance. In this work, we propose utilizing task-specific network attention for RGB-D indoor SLAM. Specifically, we integrate layer-wise attention information derived from network gradients with CNN feature representations to improve frame association performance. Experimental results indicate improved performance compared to baseline methods, particularly for large environments.[78] FullPart: Generating each 3D Part at Full Resolution
Lihe Ding,Shaocong Dong,Yaokun Li,Chenjian Gao,Xiao Chen,Rui Han,Yihao Kuang,Hong Zhang,Bo Huang,Zhanpeng Huang,Zibin Wang,Dan Xu,Tianfan Xue
Main category: cs.CV
TL;DR: 本文提出了FullPart,一种结合隐式和显式方法的新型3D部件生成框架,通过独立的全分辨率体素网格生成每个部件,并引入中心点编码策略以保持全局一致性,在3D部件生成中实现了最先进的性能。
Details
Motivation: 现有的基于部件的3D生成方法在几何细节或小部件质量方面存在不足,需要更有效的表示方式来提升生成质量。 Method: 首先使用隐式的box vector-set扩散过程生成边界框布局,然后在各自独立的全分辨率体素网格中生成详细部件,并采用中心点编码策略解决不同大小部件间的信息对齐问题。 Result: 在多个实验中,FullPart在3D部件生成任务上达到了最先进的水平,能够生成包含复杂细节的高质量部件。 Conclusion: FullPart有效结合了隐式和显式表示的优势,解决了现有方法在细节和小部件生成上的局限,推动了3D部件生成的发展。 Abstract: Part-based 3D generation holds great potential for various applications. Previous part generators that represent parts using implicit vector-set tokens often suffer from insufficient geometric details. Another line of work adopts an explicit voxel representation but shares a global voxel grid among all parts; this often causes small parts to occupy too few voxels, leading to degraded quality. In this paper, we propose FullPart, a novel framework that combines both implicit and explicit paradigms. It first derives the bounding box layout through an implicit box vector-set diffusion process, a task that implicit diffusion handles effectively since box tokens contain little geometric detail. Then, it generates detailed parts, each within its own fixed full-resolution voxel grid. Instead of sharing a global low-resolution space, each part in our method - even small ones - is generated at full resolution, enabling the synthesis of intricate details. We further introduce a center-point encoding strategy to address the misalignment issue when exchanging information between parts of different actual sizes, thereby maintaining global coherence. Moreover, to tackle the scarcity of reliable part data, we present PartVerse-XL, the largest human-annotated 3D part dataset to date with 40K objects and 320K parts. Extensive experiments demonstrate that FullPart achieves state-of-the-art results in 3D part generation. We will release all code, data, and model to benefit future research in 3D part generation.[79] BasicAVSR: Arbitrary-Scale Video Super-Resolution via Image Priors and Enhanced Motion Compensation
Wei Shang,Wanying Zhang,Shuhang Gu,Pengfei Zhu,Qinghua Hu,Dongwei Ren
Main category: cs.CV
TL;DR: 本文提出了一种用于任意尺度视频超分辨率(AVSR)的强基线模型BasicAVSR,包含四个关键组件:自适应多尺度频率先验、流引导传播单元、二阶运动补偿和超上采样单元,并设计了三种传播变体以适应不同应用场景。实验表明,该方法在质量、泛化能力和推理速度方面显著优于现有方法。
Details
Motivation: 解决AVSR在空间细节恢复、时间一致性和计算复杂度方面的挑战,提供一个通用且高效的基线模型。 Method: 结合图像拉普拉斯金字塔生成自适应多尺度频率先验,使用光流引导传播聚合时空信息,引入二阶运动补偿提升对齐精度,并设计超上采样单元生成尺度感知的上采样核;构建三种RNN传播变体以适应在线、有限延迟和离线场景。 Result: BasicAVSR在多个基准上显著优于现有方法,具备良好的泛化能力与更快的推理速度,且其组件可扩展至其他框架。 Conclusion: BasicAVSR为AVSR任务建立了一个强大且灵活的基线,推动了该领域的技术发展,并具有广泛的应用潜力。 Abstract: Arbitrary-scale video super-resolution (AVSR) aims to enhance the resolution of video frames, potentially at various scaling factors, which presents several challenges regarding spatial detail reproduction, temporal consistency, and computational complexity. In this paper, we propose a strong baseline BasicAVSR for AVSR by integrating four key components: 1) adaptive multi-scale frequency priors generated from image Laplacian pyramids, 2) a flow-guided propagation unit to aggregate spatiotemporal information from adjacent frames, 3) a second-order motion compensation unit for more accurate spatial alignment of adjacent frames, and 4) a hyper-upsampling unit to generate scale-aware and content-independent upsampling kernels. To meet diverse application demands, we instantiate three propagation variants: (i) a unidirectional RNN unit for strictly online inference, (ii) a unidirectional RNN unit empowered with a limited lookahead that tolerates a small output delay, and (iii) a bidirectional RNN unit designed for offline tasks where computational resources are less constrained. Experimental results demonstrate the effectiveness and adaptability of our model across these different scenarios. Through extensive experiments, we show that BasicAVSR significantly outperforms existing methods in terms of super-resolution quality, generalization ability, and inference speed. Our work not only advances the state-of-the-art in AVSR but also extends its core components to multiple frameworks for diverse scenarios. The code is available at https://github.com/shangwei5/BasicAVSR.[80] MV-MLM: Bridging Multi-View Mammography and Language for Breast Cancer Diagnosis and Risk Prediction
Shunjie-Fabian Zheng,Hyeonjun Lee,Thijs Kooi,Ali Diba
Main category: cs.CV
TL;DR: 提出了一种基于多视图乳腺X线图像和合成报告的视觉-语言模型(MV-MLM),用于乳腺癌分类与风险预测,在无需真实放射报告的情况下实现了最先进的性能。
Details
Motivation: 获取精细标注的大规模医学数据集成本高、耗时长,限制了CAD模型的发展,因此需要一种更高效的数据利用方法。 Method: 构建一个多视图乳腺X线与语言联合模型(MV-MLM),采用跨模态自监督学习,利用合成放射报告进行图像-文本对的联合训练,并引入多视图监督和伪报告增强表征学习。 Result: 在私有和公开数据集上验证,该模型在恶性分类、亚型分类和图像-based癌症风险预测三个任务中均达到SOTA性能,且具有出色的数据效率,优于全监督和其他VLM基线方法。 Conclusion: MV-MLM通过合成文本报告实现高效跨模态学习,显著提升乳腺癌诊断模型的泛化能力和准确性,减少对真实标注报告的依赖。 Abstract: Large annotated datasets are essential for training robust Computer-Aided Diagnosis (CAD) models for breast cancer detection or risk prediction. However, acquiring such datasets with fine-detailed annotation is both costly and time-consuming. Vision-Language Models (VLMs), such as CLIP, which are pre-trained on large image-text pairs, offer a promising solution by enhancing robustness and data efficiency in medical imaging tasks. This paper introduces a novel Multi-View Mammography and Language Model for breast cancer classification and risk prediction, trained on a dataset of paired mammogram images and synthetic radiology reports. Our MV-MLM leverages multi-view supervision to learn rich representations from extensive radiology data by employing cross-modal self-supervision across image-text pairs. This includes multiple views and the corresponding pseudo-radiology reports. We propose a novel joint visual-textual learning strategy to enhance generalization and accuracy performance over different data types and tasks to distinguish breast tissues or cancer characteristics(calcification, mass) and utilize these patterns to understand mammography images and predict cancer risk. We evaluated our method on both private and publicly available datasets, demonstrating that the proposed model achieves state-of-the-art performance in three classification tasks: (1) malignancy classification, (2) subtype classification, and (3) image-based cancer risk prediction. Furthermore, the model exhibits strong data efficiency, outperforming existing fully supervised or VLM baselines while trained on synthetic text reports and without the need for actual radiology reports.[81] Detecting Unauthorized Vehicles using Deep Learning for Smart Cities: A Case Study on Bangladesh
Sudipto Das Sukanto,Diponker Roy,Fahim Shakil,Nirjhar Singha,Abdullah Asik,Aniket Joarder,Mridha Md Nafis Fuad,Muhammad Ibrahim
Main category: cs.CV
TL;DR: 本文提出了一种基于YOLOv8模型的机器学习方法,用于实时检测交通图像中的机动三轮车(auto-rickshaw),在自建的1730张标注图像数据集上取得了83.447%的mAP50和超过78%的精确率与召回率,且数据集已公开。
Details
Motivation: 由于机动三轮车在交通管理中常受路线限制,但现有监控系统难以将其与其他车辆(尤其是非机动三轮车)区分,人工视频分析又耗时,因此需要一种自动化的检测方法。 Method: 采用YOLOv8进行实时目标检测,并构建包含1730张不同交通场景下标注图像的数据集用于训练和测试模型。 Result: 模型在密集和稀疏交通场景下均表现良好,mAP50达到83.447%,二分类精确率和召回率均高于78%。 Conclusion: 所提出的YOLOv8模型能有效实现机动三轮车的自动检测,具备实际应用潜力,且公开数据集有助于后续研究。 Abstract: Modes of transportation vary across countries depending on geographical location and cultural context. In South Asian countries rickshaws are among the most common means of local transport. Based on their mode of operation, rickshaws in cities across Bangladesh can be broadly classified into non-auto (pedal-powered) and auto-rickshaws (motorized). Monitoring the movement of auto-rickshaws is necessary as traffic rules often restrict auto-rickshaws from accessing certain routes. However, existing surveillance systems make it quite difficult to monitor them due to their similarity to other vehicles, especially non-auto rickshaws whereas manual video analysis is too time-consuming. This paper presents a machine learning-based approach to automatically detect auto-rickshaws in traffic images. In this system, we used real-time object detection using the YOLOv8 model. For training purposes, we prepared a set of 1,730 annotated images that were captured under various traffic conditions. The results show that our proposed model performs well in real-time auto-rickshaw detection and offers an mAP50 of 83.447% and binary precision and recall values above 78%, demonstrating its effectiveness in handling both dense and sparse traffic scenarios. The dataset has been publicly released for further research.[82] CRAG-MM: Multi-modal Multi-turn Comprehensive RAG Benchmark
Jiaqi Wang,Xiao Yang,Kai Sun,Parth Suresh,Sanat Sharma,Adam Czyzewski,Derek Andersen,Surya Appini,Arkav Banerjee,Sajal Choudhary,Shervin Ghasemlou,Ziqiang Guan,Akil Iyer,Haidar Khan,Lingkun Kong,Roy Luo,Tiffany Ma,Zhen Qiao,David Tran,Wenfang Xu,Skyler Yeatman,Chen Zhou,Gunveer Gujral,Yinglong Xia,Shane Moon,Nicolas Scheffer,Nirav Shah,Eun Chang,Yue Liu,Florian Metze,Tammy Stark,Zhaleh Feizollahi,Andrea Jessee,Mangesh Pujari,Ahmed Aly,Babak Damavandi,Rakesh Wanga,Anuj Kumar,Rohit Patel,Wen-tau Yih,Xin Luna Dong
Main category: cs.CV
TL;DR: 本文提出了CRAG-MM,一个面向可穿戴设备场景的多模态、多轮对话检索增强生成综合基准,包含6.5K个(图像、问题、答案)三元组和2K个多轮对话,涵盖13个领域,并设计了三项任务评估系统性能,揭示现有方法在真实场景中仍有较大提升空间。
Details
Motivation: 现有的多模态检索增强生成(MM-RAG)缺乏针对可穿戴设备场景的综合性基准,难以评估系统在真实多轮交互中的表现,因此需要构建一个贴近实际应用、具有挑战性的 benchmark。 Method: 构建了包含6.5K个(图像、问题、答案)三元组和2K个多轮对话的数据集,涵盖13个领域,使用6.2K张自我中心图像模拟可穿戴设备拍摄场景;设计三种任务:单源增强、多源增强和多轮对话,并提供对应的检索语料库及图像知识图谱与网页检索API。 Result: 实验表明,简单的RAG方法在CRAG-MM上的单轮和多轮问答真实性分别仅为32%和43%,业界最先进的解决方案也仅达到32%/45%;KDD Cup 2025基于该基准吸引了约1000名参与者和5000次提交,优胜方案将基线性能提升了28%。 Conclusion: CRAG-MM填补了可穿戴设备场景下多模态多轮对话RAG基准的空白,具备多样性、现实性和挑战性,已展现出推动该领域发展的潜力,为未来研究提供了重要平台。 Abstract: Wearable devices such as smart glasses are transforming the way people interact with their surroundings, enabling users to seek information regarding entities in their view. Multi-Modal Retrieval-Augmented Generation (MM-RAG) plays a key role in supporting such questions, yet there is still no comprehensive benchmark for this task, especially regarding wearables scenarios. To fill this gap, we present CRAG-MM -- a Comprehensive RAG benchmark for Multi-modal Multi-turn conversations. CRAG-MM contains a diverse set of 6.5K (image, question, answer) triplets and 2K visual-based multi-turn conversations across 13 domains, including 6.2K egocentric images designed to mimic captures from wearable devices. We carefully constructed the questions to reflect real-world scenarios and challenges, including five types of image-quality issues, six question types, varying entity popularity, differing information dynamism, and different conversation turns. We design three tasks: single-source augmentation, multi-source augmentation, and multi-turn conversations -- each paired with an associated retrieval corpus and APIs for both image-KG retrieval and webpage retrieval. Our evaluation shows that straightforward RAG approaches achieve only 32% and 43% truthfulness on CRAG-MM single- and multi-turn QA, respectively, whereas state-of-the-art industry solutions have similar quality (32%/45%), underscoring ample room for improvement. The benchmark has hosted KDD Cup 2025, attracting about 1K participants and 5K submissions, with winning solutions improving baseline performance by 28%, highlighting its early impact on advancing the field.[83] MoTDiff: High-resolution Motion Trajectory estimation from a single blurred image using Diffusion models
Wontae Choi,Jaelin Lee,Hyung Sup Yun,Byeungwoo Jeon,Il Yong Chun
Main category: cs.CV
TL;DR: 本文提出了首个基于扩散模型的高分辨率运动轨迹估计框架MoTDiff,能够从单张运动模糊图像中恢复高质量的运动轨迹,在盲去模糊和编码曝光摄影任务中优于现有方法。
Details
Motivation: 现有从单张模糊图像中提取运动信息的方法(如模糊核或光流)往往精度低、结果粗糙,难以满足高精度计算成像需求。 Method: 提出MoTDiff框架,包含两个关键部分:1)以多尺度特征图为条件的新型条件扩散模型;2)新的训练策略,用于精确识别细粒度运动轨迹、保持运动路径形状与位置的一致性,并增强轨迹上的像素连通性。 Result: 实验表明,MoTDiff在盲去模糊和编码曝光摄影应用中均优于当前最先进的方法,能生成更准确、更高分辨率的运动轨迹。 Conclusion: MoTDiff是首个利用扩散模型进行高分辨率运动轨迹估计的框架,显著提升了单图像运动估计的质量,并在下游视觉任务中展现出优越性能。 Abstract: Accurate estimation of motion information is crucial in diverse computational imaging and computer vision applications. Researchers have investigated various methods to extract motion information from a single blurred image, including blur kernels and optical flow. However, existing motion representations are often of low quality, i.e., coarse-grained and inaccurate. In this paper, we propose the first high-resolution (HR) Motion Trajectory estimation framework using Diffusion models (MoTDiff). Different from existing motion representations, we aim to estimate an HR motion trajectory with high-quality from a single motion-blurred image. The proposed MoTDiff consists of two key components: 1) a new conditional diffusion framework that uses multi-scale feature maps extracted from a single blurred image as a condition, and 2) a new training method that can promote precise identification of a fine-grained motion trajectory, consistent estimation of overall shape and position of a motion path, and pixel connectivity along a motion trajectory. Our experiments demonstrate that the proposed MoTDiff can outperform state-of-the-art methods in both blind image deblurring and coded exposure photography applications.[84] ConceptScope: Characterizing Dataset Bias via Disentangled Visual Concepts
Jinho Choi,Hyesu Lim,Steffen Schneider,Jaegul Choo
Main category: cs.CV
TL;DR: ConceptScope是一个可扩展且自动化的框架,用于通过稀疏自编码器发现和量化视觉数据集中的可解释概念,从而识别数据偏见并评估模型鲁棒性。
Details
Motivation: 由于机器学习数据集中普遍存在数据点偏向某些概念的偏差,而细粒度属性标注成本高昂,因此需要一种无需人工标注即可系统识别这些偏差的方法。 Method: 提出ConceptScope框架,利用在视觉基础模型表示上训练的稀疏自编码器来发现和量化人类可解释的视觉概念,并根据语义相关性和与类别标签的统计相关性将概念分类为目标、上下文和偏差类型。 Result: 验证了ConceptScope能够捕捉包括物体、纹理、背景、面部属性、情绪和动作在内的多种视觉概念,并生成与语义相关图像区域对齐的空间归因;成功检测到已知偏差(如Waterbirds中的背景偏差)和未标注的新偏差(如ImageNet中共现物体)。 Conclusion: ConceptScope为数据集审计和模型诊断提供了一种实用工具,能够在无需人工标注的情况下实现数据集的细粒度分析和偏差识别。 Abstract: Dataset bias, where data points are skewed to certain concepts, is ubiquitous in machine learning datasets. Yet, systematically identifying these biases is challenging without costly, fine-grained attribute annotations. We present ConceptScope, a scalable and automated framework for analyzing visual datasets by discovering and quantifying human-interpretable concepts using Sparse Autoencoders trained on representations from vision foundation models. ConceptScope categorizes concepts into target, context, and bias types based on their semantic relevance and statistical correlation to class labels, enabling class-level dataset characterization, bias identification, and robustness evaluation through concept-based subgrouping. We validate that ConceptScope captures a wide range of visual concepts, including objects, textures, backgrounds, facial attributes, emotions, and actions, through comparisons with annotated datasets. Furthermore, we show that concept activations produce spatial attributions that align with semantically meaningful image regions. ConceptScope reliably detects known biases (e.g., background bias in Waterbirds) and uncovers previously unannotated ones (e.g, co-occurring objects in ImageNet), offering a practical tool for dataset auditing and model diagnostics.[85] Sketch2PoseNet: Efficient and Generalized Sketch to 3D Human Pose Prediction
Li Wang,Yiyu Zhuang,Yanwen Wang,Xun Cao,Chuan Guo,Xinxin Zuo,Hao Zhu
Main category: cs.CV
TL;DR: 提出一种基于合成数据的端到端学习方法,利用扩散模型生成大规模草图-3D姿态数据集SKEP-120K,实现从多样化草图风格中高效准确地估计人体姿态和形状。
Details
Motivation: 传统草图到3D姿态估计方法受限于缺乏大规模标注数据,依赖启发式规则优化,耗时且泛化能力差。 Method: 采用“从合成中学习”策略,先用扩散模型从3D姿态投影的2D姿态生成草图,构建SKEP-120K合成数据集;基于该数据集,结合2D姿态检测器、扩散先验和前馈神经网络,设计端到端数据驱动框架,并引入多种启发式损失保证几何一致性与自接触精度。 Result: 在定性、定量和主观评估中均显著优于先前方法,显著提升草图到姿态估计的精度和速度。 Conclusion: 所提方法通过合成数据驱动策略有效解决了草图-3D姿态标注数据稀缺问题,在多样化草图风格下实现了高效准确的人体姿态与形状估计。 Abstract: 3D human pose estimation from sketches has broad applications in computer animation and film production. Unlike traditional human pose estimation, this task presents unique challenges due to the abstract and disproportionate nature of sketches. Previous sketch-to-pose methods, constrained by the lack of large-scale sketch-3D pose annotations, primarily relied on optimization with heuristic rules-an approach that is both time-consuming and limited in generalizability. To address these challenges, we propose a novel approach leveraging a "learn from synthesis" strategy. First, a diffusion model is trained to synthesize sketch images from 2D poses projected from 3D human poses, mimicking disproportionate human structures in sketches. This process enables the creation of a synthetic dataset, SKEP-120K, consisting of 120k accurate sketch-3D pose annotation pairs across various sketch styles. Building on this synthetic dataset, we introduce an end-to-end data-driven framework for estimating human poses and shapes from diverse sketch styles. Our framework combines existing 2D pose detectors and generative diffusion priors for sketch feature extraction with a feed-forward neural network for efficient 2D pose estimation. Multiple heuristic loss functions are incorporated to guarantee geometric coherence between the derived 3D poses and the detected 2D poses while preserving accurate self-contacts. Qualitative, quantitative, and subjective evaluations collectively show that our model substantially surpasses previous ones in both estimation accuracy and speed for sketch-to-pose tasks.[86] Developing a Multi-task Ensemble Geometric Deep Network for Supply Chain Sustainability and Risk Management
Mehdi Khaleghi,Nastaran Khaleghi,Sobhan Sheykhivand,Sebelan Danishvar
Main category: cs.CV
TL;DR: 提出一种基于Chebyshev集合几何网络(Ch-EGN)的混合深度学习模型,用于提升供应链可持续性与风险管理,实验显示其在多个任务上优于现有方法。
Details
Motivation: 为提升供应链的可持续性和运行效率,需有效管理风险并准确分类产品,传统方法难以充分挖掘供应链数据中的复杂依赖关系。 Method: 提出一种融合卷积神经网络与几何深度学习的新型Ch-EGN模型,利用图结构和Chebyshev多项式捕捉供应链中样本间的隐含状态与信息依赖,并在两个真实数据集(SupplyGraph和DataCo)上进行风险预测、产品分类和边分类任务。 Result: 在风险预测任务中平均准确率达98.95%;在产品分类(5类)和产品关系分类(4类)任务中分别达到100%和98.07%的准确率;在企业关系分类(25类)中达到92.37%的准确率,整体性能优于现有最先进方法。 Conclusion: 所提出的Ch-EGN模型能有效建模供应链中的复杂依赖关系,在风险管理和可持续性提升方面表现出卓越性能,具有实际应用潜力。 Abstract: The sustainability of supply chain plays a key role in achieving optimal performance in controlling the supply chain. The management of risks that occur in a supply chain is a fundamental problem for the purpose of developing the sustainability of the network and elevating the performance efficiency of the supply chain. The correct classification of products is another essential element in a sustainable supply chain. Acknowledging recent breakthroughs in the context of deep networks, several architectural options have been deployed to analyze supply chain datasets. A novel geometric deep network is used to propose an ensemble deep network. The proposed Chebyshev ensemble geometric network (Ch-EGN) is a hybrid convolutional and geometric deep learning. This network is proposed to leverage the information dependencies in supply chain to derive invisible states of samples in the database. The functionality of the proposed deep network is assessed on the two different databases. The SupplyGraph Dataset and DataCo are considered in this research. The prediction of delivery status of DataCo supply chain is done for risk administration. The product classification and edge classification are performed using the SupplyGraph database to enhance the sustainability of the supply network. An average accuracy of 98.95% is obtained for the ensemble network for risk management. The average accuracy of 100% and 98.07% are obtained for sustainable supply chain in terms of 5 product group classification and 4 product relation classification, respectively. The average accuracy of 92.37% is attained for 25 company relation classification. The results confirm an average improvement and efficiency of the proposed method compared to the state-of-the-art approaches.[87] OmniLayout: Enabling Coarse-to-Fine Learning with LLMs for Universal Document Layout Generation
Hengrui Kang,Zhuangcheng Gu,Zhiyuan Zhao,Zichen Wen,Bin Wang,Weijia Li,Conghui He
Main category: cs.CV
TL;DR: 本文提出了OmniLayout-1M,首个百万级多样化文档布局数据集,以及OmniLayout-LLM,一个具有粗到细学习范式的0.5B模型,用于解决现有文档布局生成方法在复杂领域和长序列排列上的不足。
Details
Motivation: 现有的文档布局生成研究主要集中在学术论文等有限类型上,缺乏对报纸、杂志等开放世界文档类型的覆盖,且现有方法难以处理复杂布局和长序列排列。 Method: 构建了包含六种常见文档类型的OmniLayout-1M数据集,并提出OmniLayout-LLM模型,采用两阶段的粗到细学习范式:首先从大规模数据中学习通用布局原则,然后迁移到特定领域进行精细化生成。 Result: 在M$^{6}$Doc数据集的多个领域上,该方法显著优于现有的布局生成专家模型和多种最新通用大语言模型。 Conclusion: OmniLayout-1M和OmniLayout-LLM有效推动了多样化文档布局生成的发展,展现出强大的跨领域生成能力和应用潜力。 Abstract: Document AI has advanced rapidly and is attracting increasing attention. Yet, while most efforts have focused on document layout analysis (DLA), its generative counterpart, document layout generation, remains underexplored. A major obstacle lies in the scarcity of diverse layouts: academic papers with Manhattan-style structures dominate existing studies, while open-world genres such as newspapers and magazines remain severely underrepresented. To address this gap, we curate OmniLayout-1M, the first million-scale dataset of diverse document layouts, covering six common document types and comprising contemporary layouts collected from multiple sources. Moreover, since existing methods struggle in complex domains and often fail to arrange long sequences coherently, we introduce OmniLayout-LLM, a 0.5B model with designed two-stage Coarse-to-Fine learning paradigm: 1) learning universal layout principles from OmniLayout-1M with coarse category definitions, and 2) transferring the knowledge to a specific domain with fine-grained annotations. Extensive experiments demonstrate that our approach achieves strong performance on multiple domains in M$^{6}$Doc dataset, substantially surpassing both existing layout generation experts and several latest general-purpose LLMs. Our code, models, and dataset will be publicly released.[88] Which Way Does Time Flow? A Psychophysics-Grounded Evaluation for Vision-Language Models
Shiho Matta,Lis Kanashiro Pereira,Peitao Han,Fei Cheng,Shigeru Kitazawa
Main category: cs.CV
TL;DR: 本文提出了一种新的基准AoT-PsyPhyBENCH,用于评估视觉语言模型(VLMs)在判断视频时间方向(正放或倒放)方面的能力,发现现有模型在物理不可逆过程和因果手动操作上的表现远低于人类,暴露出其在时间连续性和因果理解方面的根本缺陷。
Details
Motivation: 尽管现代视觉语言模型在多模态任务中表现出色,但其对视频中时间信息的理解能力较弱且缺乏充分评估。为此,作者旨在通过‘时间之箭’这一简单而深刻的挑战,揭示模型在时间推理方面的不足。 Method: 构建了一个经过心理物理学验证的基准AoT-PsyPhyBENCH,使用与人类实验相同的刺激材料和行为基线,系统评估多种开源和专有、推理与非推理型视觉语言模型在自然视频中判断时间方向的能力。 Result: 大多数模型的表现接近随机猜测,即使最优模型在物理不可逆过程(如自由落体、扩散/爆炸)和因果手动动作(如分割/添加)上的准确率也远低于人类。 Conclusion: 当前的多模态系统虽能捕捉丰富的视觉语义关联,但缺乏支持时间连续性和因果理解的归纳偏置,亟需改进其物理与时间推理能力。 Abstract: Modern vision-language models (VLMs) excel at many multimodal tasks, yet their grasp of temporal information in video remains weak and, crucially, under-evaluated. We probe this gap with a deceptively simple but revealing challenge: judging the arrow of time (AoT)-whether a short clip is played forward or backward. We introduce AoT-PsyPhyBENCH, a psychophysically validated benchmark that tests whether VLMs can infer temporal direction in natural videos using the same stimuli and behavioral baselines established for humans. Our comprehensive evaluation of open-weight and proprietary, reasoning and non-reasoning VLMs reveals that most models perform near chance, and even the best lag far behind human accuracy on physically irreversible processes (e.g., free fall, diffusion/explosion) and causal manual actions (division/addition) that humans recognize almost instantly. These results highlight a fundamental gap in current multimodal systems: while they capture rich visual-semantic correlations, they lack the inductive biases required for temporal continuity and causal understanding. We release the code and data for AoT-PsyPhyBENCH to encourage further progress in the physical and temporal reasoning capabilities of VLMs.[89] Revisiting Generative Infrared and Visible Image Fusion Based on Human Cognitive Laws
Lin Guo,Xiaoqing Luo,Wei Xie,Zhancheng Zhang,Hui Li,Rui Wang,Zhenhua Feng,Xiaoning Song
Main category: cs.CV
TL;DR: 本文提出了一种受人类认知启发的红外与可见光图像融合方法HCLFuse,通过多尺度掩码调控变分瓶颈编码器和时变物理引导扩散模型,实现了高质量、结构一致的融合结果。
Details
Motivation: 现有融合方法在模态信息平衡和可解释性方面存在不足,生成能力有限且可靠性不高,难以应对复杂场景下的融合需求。 Method: 提出HCLFuse,设计多尺度掩码调控的变分瓶颈编码器进行信息分解与低层特征提取,并结合扩散模型与物理规律构建时变物理引导机制,实现对生成过程的自适应调控。 Result: 在多个数据集上实现了最先进的定性和定量融合性能,显著提升了语义分割指标,验证了方法在结构一致性和细节质量上的优势。 Conclusion: 受人类认知启发的生成式图像融合框架能有效提升融合图像的质量和可靠性,为多模态图像融合提供了新的思路。 Abstract: Existing infrared and visible image fusion methods often face the dilemma of balancing modal information. Generative fusion methods reconstruct fused images by learning from data distributions, but their generative capabilities remain limited. Moreover, the lack of interpretability in modal information selection further affects the reliability and consistency of fusion results in complex scenarios. This manuscript revisits the essence of generative image fusion under the inspiration of human cognitive laws and proposes a novel infrared and visible image fusion method, termed HCLFuse. First, HCLFuse investigates the quantification theory of information mapping in unsupervised fusion networks, which leads to the design of a multi-scale mask-regulated variational bottleneck encoder. This encoder applies posterior probability modeling and information decomposition to extract accurate and concise low-level modal information, thereby supporting the generation of high-fidelity structural details. Furthermore, the probabilistic generative capability of the diffusion model is integrated with physical laws, forming a time-varying physical guidance mechanism that adaptively regulates the generation process at different stages, thereby enhancing the ability of the model to perceive the intrinsic structure of data and reducing dependence on data quality. Experimental results show that the proposed method achieves state-of-the-art fusion performance in qualitative and quantitative evaluations across multiple datasets and significantly improves semantic segmentation metrics. This fully demonstrates the advantages of this generative image fusion method, drawing inspiration from human cognition, in enhancing structural consistency and detail quality.[90] Exploring Complementarity and Explainability in CNNs for Periocular Verification Across Acquisition Distances
Fernando Alonso-Fernandez,Kevin Hernandez Diaz,Jose M. Buades,Kiran Raja,Josef Bigun
Main category: cs.CV
TL;DR: 本文研究了在UBIPr数据库上不同距离下的CNN在眼周验证中的互补性,通过融合SqueezeNet、MobileNetv2和ResNet50三种网络结构,并利用LIME热图分析其注意力机制,实现了新的性能记录。
Details
Motivation: 探索不同复杂度的卷积神经网络在不同拍摄距离下眼周识别任务中的互补性,并提升跨距离验证的准确性。 Method: 使用VGGFace2数据集预训练SqueezeNet、MobileNetv2和ResNet50,在UBIPr数据集上进行眼周验证;采用余弦和卡方度量评估性能,比较不同初始化方式,并通过逻辑回归进行分数级融合;利用LIME热图和Jensen-Shannon散度分析网络注意力模式。 Result: ResNet50单独表现最优,但三者融合显著提升性能,尤其在所有网络结合时增益最大;热图显示各网络关注图像的不同区域,证明其互补性;该方法在UBIPr上达到新的SOTA水平。 Conclusion: 不同结构的CNN在眼周识别中具有互补性,分数级融合能有效提升跨距离验证性能,结合注意力分析可解释模型行为。 Abstract: We study the complementarity of different CNNs for periocular verification at different distances on the UBIPr database. We train three architectures of increasing complexity (SqueezeNet, MobileNetv2, and ResNet50) on a large set of eye crops from VGGFace2. We analyse performance with cosine and chi2 metrics, compare different network initialisations, and apply score-level fusion via logistic regression. In addition, we use LIME heatmaps and Jensen-Shannon divergence to compare attention patterns of the CNNs. While ResNet50 consistently performs best individually, the fusion provides substantial gains, especially when combining all three networks. Heatmaps show that networks usually focus on distinct regions of a given image, which explains their complementarity. Our method significantly outperforms previous works on UBIPr, achieving a new state-of-the-art.[91] Beyond Imitation: Constraint-Aware Trajectory Generation with Flow Matching For End-to-End Autonomous Driving
Lin Liu,Guanyi Yu,Ziying Song,Junqiao Li,Caiyan Jia,Feiyang Jia,Peiliang Wu,Yandan Luo
Main category: cs.CV
TL;DR: 提出了一种基于约束流匹配(Constrained Flow Matching)的自动驾驶规划框架CATG,有效缓解模式崩溃并直接在生成过程中融入安全与运动学约束,支持通过驾驶激进程度调节轨迹风格,在NavSim v2挑战赛中表现优异。
Details
Motivation: 现有模仿学习方法存在模式崩溃问题,生成模型难以直接引入安全和物理约束,需额外优化步骤,限制了规划性能与安全性。 Method: 提出CATG框架,利用约束流匹配技术,在流匹配过程中显式建模并施加安全与运动学约束,同时将驾驶激进程度作为可控条件信号进行轨迹生成。 Result: 在NavSim v2挑战赛上取得第二名,EPDMS得分为51.31,并获得创新奖。 Conclusion: CATG能有效生成多样化且符合安全与物理约束的轨迹,支持风格调控,具备实际应用潜力与创新性。 Abstract: Planning is a critical component of end-to-end autonomous driving. However, prevailing imitation learning methods often suffer from mode collapse, failing to produce diverse trajectory hypotheses. Meanwhile, existing generative approaches struggle to incorporate crucial safety and physical constraints directly into the generative process, necessitating an additional optimization stage to refine their outputs. To address these limitations, we propose CATG, a novel planning framework that leverages Constrained Flow Matching. Concretely, CATG explicitly models the flow matching process, which inherently mitigates mode collapse and allows for flexible guidance from various conditioning signals. Our primary contribution is the novel imposition of explicit constraints directly within the flow matching process, ensuring that the generated trajectories adhere to vital safety and kinematic rules. Secondly, CATG parameterizes driving aggressiveness as a control signal during generation, enabling precise manipulation of trajectory style. Notably, on the NavSim v2 challenge, CATG achieved 2nd place with an EPDMS score of 51.31 and was honored with the Innovation Award.[92] Leveraging Large-Scale Face Datasets for Deep Periocular Recognition via Ocular Cropping
Fernando Alonso-Fernandez,Kevin Hernandez-Diaz,Jose Maria Buades Rubio,Josef Bigun
Main category: cs.CV
TL;DR: 本文研究了基于眼周区域的生物特征识别,使用三种不同深度和复杂度的卷积神经网络,并在大规模VGGFace2数据库上进行训练,相较于以往小规模数据集有显著提升。实验在VGGFace2-Pose和UFPR-Periocular数据库上进行,结果表明在受控条件下(如UFPR)眼周识别性能更优,取得了目前最低的1-2%等错误率(EER)。
Details
Motivation: 眼周区域具有高区分性和较低采集限制,适合用于生物特征识别,但现有研究多依赖小规模数据集,缺乏在大规模真实场景下的评估。 Method: 采用三种不同复杂度的卷积神经网络,在包含190万张眼部图像的大规模VGGFace2数据集上训练,并在VGGFace2-Pose和UFPR-Periocular两个数据集上测试眼周识别性能。 Result: 在VGGFace2-Pose上眼周识别的等错误率(EER)为9-15%,低于全脸识别的3-6%;而在UFPR-Periocular上EER仅为1-2%,为当前最低水平。 Conclusion: 大规模训练提升了眼周识别模型性能,在高质量、采集条件一致的数据下可实现极低错误率,验证了深度学习在眼周生物特征识别中的潜力。 Abstract: We focus on ocular biometrics, specifically the periocular region (the area around the eye), which offers high discrimination and minimal acquisition constraints. We evaluate three Convolutional Neural Network architectures of varying depth and complexity to assess their effectiveness for periocular recognition. The networks are trained on 1,907,572 ocular crops extracted from the large-scale VGGFace2 database. This significantly contrasts with existing works, which typically rely on small-scale periocular datasets for training having only a few thousand images. Experiments are conducted with ocular images from VGGFace2-Pose, a subset of VGGFace2 containing in-the-wild face images, and the UFPR-Periocular database, which consists of selfies captured via mobile devices with user guidance on the screen. Due to the uncontrolled conditions of VGGFace2, the Equal Error Rates (EERs) obtained with ocular crops range from 9-15%, noticeably higher than the 3-6% EERs achieved using full-face images. In contrast, UFPR-Periocular yields significantly better performance (EERs of 1-2%), thanks to higher image quality and more consistent acquisition protocols. To the best of our knowledge, these are the lowest reported EERs on the UFPR dataset to date.[93] Towards Realistic Earth-Observation Constellation Scheduling: Benchmark and Methodology
Luting Wang,Yinghao Xiang,Hongliang Huang,Dongjun Li,Chen Gao,Si Liu
Main category: cs.CV
TL;DR: 提出首个大规模、高保真的敏捷地球观测卫星星座调度基准套件AEOS-Bench,并基于Transformer架构设计约束感知的调度模型AEOS-Former,显著提升任务完成率与能源效率。
Details
Motivation: 现有方法在处理大规模场景、动态环境和严格约束下的敏捷地球观测卫星调度问题时往往过于简化,导致实际性能受限,缺乏统一、真实的基准测试平台来评估调度算法的有效性。 Method: 构建了一个包含3907个卫星资源和16410个场景的标准化基准套件AEOS-Bench,所有场景均通过高保真仿真平台生成,并提供真实调度标注;在此基础上提出AEOS-Former,一种基于Transformer并引入约束感知注意力机制和内部约束模块的调度模型,通过仿真迭代学习适应多样化场景。 Result: 实验表明AEOS-Former在任务完成率和能源效率方面优于基线模型,消融研究验证了各组件的有效贡献。AEOS-Bench是首个面向真实星座调度的大规模基准套件。 Conclusion: 所提出的AEOS-Bench为卫星调度研究提供了可靠基准,AEOS-Former结合约束建模与深度学习的方法有效提升了复杂环境下星座调度性能,推动了自动化卫星管理的发展。 Abstract: Agile Earth Observation Satellites (AEOSs) constellations offer unprecedented flexibility for monitoring the Earth's surface, but their scheduling remains challenging under large-scale scenarios, dynamic environments, and stringent constraints. Existing methods often simplify these complexities, limiting their real-world performance. We address this gap with a unified framework integrating a standardized benchmark suite and a novel scheduling model. Our benchmark suite, AEOS-Bench, contains $3,907$ finely tuned satellite assets and $16,410$ scenarios. Each scenario features $1$ to $50$ satellites and $50$ to $300$ imaging tasks. These scenarios are generated via a high-fidelity simulation platform, ensuring realistic satellite behavior such as orbital dynamics and resource constraints. Ground truth scheduling annotations are provided for each scenario. To our knowledge, AEOS-Bench is the first large-scale benchmark suite tailored for realistic constellation scheduling. Building upon this benchmark, we introduce AEOS-Former, a Transformer-based scheduling model that incorporates a constraint-aware attention mechanism. A dedicated internal constraint module explicitly models the physical and operational limits of each satellite. Through simulation-based iterative learning, AEOS-Former adapts to diverse scenarios, offering a robust solution for AEOS constellation scheduling. Experimental results demonstrate that AEOS-Former outperforms baseline models in task completion and energy efficiency, with ablation studies highlighting the contribution of each component. Code and data are provided in https://github.com/buaa-colalab/AEOSBench.[94] Exploring the correlation between the type of music and the emotions evoked: A study using subjective questionnaires and EEG
Jelizaveta Jankowska,Bożena Kostek,Fernando Alonso-Fernandez,Prayag Tiwari
Main category: cs.CV
TL;DR: 本研究探讨了不同音乐类型对人类情绪的影响,通过结合主观问卷调查和脑电图(EEG)测量来分析情绪反应。
Details
Motivation: 了解不同音乐类型如何影响人类情绪,为音乐治疗和情感计算提供科学依据。 Method: 使用EEG头盔记录参与者听不同音乐时的脑活动,并结合主观问卷进行情绪评估,随后分析问卷结果与EEG信号之间的关系。 Result: 发现了情绪与大脑活动之间的关联,不同音乐类型引发的情绪反应在脑电信号上有明显差异。 Conclusion: 不同音乐类型显著影响人类情绪,且可通过EEG检测到相应的脑活动变化,验证了音乐对情绪的调节作用。 Abstract: The subject of this work is to check how different types of music affect human emotions. While listening to music, a subjective survey and brain activity measurements were carried out using an EEG helmet. The aim is to demonstrate the impact of different music genres on emotions. The research involved a diverse group of participants of different gender and musical preferences. This had the effect of capturing a wide range of emotional responses to music. After the experiment, a relationship analysis of the respondents' questionnaires with EEG signals was performed. The analysis revealed connections between emotions and observed brain activity.[95] A Hybrid Framework Bridging CNN and ViT based on Theory of Evidence for Diabetic Retinopathy Grading
Junlai Qiu,Yunzhu Chen,Hao Zheng,Yawen Huang,Yuexiang Li
Main category: cs.CV
TL;DR: 提出一种基于证据理论的融合范式,结合CNN和ViT的优势,提升糖尿病视网膜病变分级的准确性和可解释性。
Details
Motivation: 现有基于单一骨干网络(CNN或ViT)的自动DR诊断系统性能已达到瓶颈,难以兼顾局部与全局特征提取能力。 Method: 提出一种基于证据理论的融合范式,利用深度证据网络将不同骨干网络提取的特征转化为支持性证据,并据此自适应地调整融合模式。 Result: 在两个公开DR数据集上实验表明,该方法优于现有最先进框架,提升了分级精度,同时提供了良好的特征融合与决策可解释性。 Conclusion: 所提出的证据融合范式能有效整合CNN和ViT的优势,显著提升糖尿病视网膜病变自动诊断的性能与可解释性。 Abstract: Diabetic retinopathy (DR) is a leading cause of vision loss among middle-aged and elderly people, which significantly impacts their daily lives and mental health. To improve the efficiency of clinical screening and enable the early detection of DR, a variety of automated DR diagnosis systems have been recently established based on convolutional neural network (CNN) or vision Transformer (ViT). However, due to the own shortages of CNN / ViT, the performance of existing methods using single-type backbone has reached a bottleneck. One potential way for the further improvements is integrating different kinds of backbones, which can fully leverage the respective strengths of them (\emph{i.e.,} the local feature extraction capability of CNN and the global feature capturing ability of ViT). To this end, we propose a novel paradigm to effectively fuse the features extracted by different backbones based on the theory of evidence. Specifically, the proposed evidential fusion paradigm transforms the features from different backbones into supporting evidences via a set of deep evidential networks. With the supporting evidences, the aggregated opinion can be accordingly formed, which can be used to adaptively tune the fusion pattern between different backbones and accordingly boost the performance of our hybrid model. We evaluated our method on two publicly available DR grading datasets. The experimental results demonstrate that our hybrid model not only improves the accuracy of DR grading, compared to the state-of-the-art frameworks, but also provides the excellent interpretability for feature fusion and decision-making.[96] GLYPH-SR: Can We Achieve Both High-Quality Image Super-Resolution and High-Fidelity Text Recovery via VLM-guided Latent Diffusion Model?
Mingyu Sung,Seungjae Ham,Kangwoo Kim,Yeokyoung Yoon,Sangseok Yun,Il-Min Kim,Jae-Mo Kang
Main category: cs.CV
TL;DR: 本文提出GLYPH-SR,一种基于视觉-语言引导的扩散模型框架,通过OCR数据指导的控制网络和交替调度机制,在提升图像超分辨率的同时显著增强场景文本的可读性,兼顾感知质量与文字识别性能。
Details
Motivation: 现有超分辨率方法多优化于失真或感知指标,对字符级错误不敏感,且多数文本超分辨率研究局限于孤立字符的简化基准,忽视复杂自然场景中场景文本的恢复挑战,导致文本常被视为普通纹理处理。为使超分辨率在实际应用中有效,需同时优化文本可读性与视觉质量。 Method: 提出GLYPH-SR框架,包含由OCR数据引导的文本-超分融合控制网络(TS-ControlNet)和在文本与场景导向间交替的‘乒乓’调度器;通过合成数据集训练该组件,同时保持主超分分支冻结,实现针对性文本恢复。 Result: 在SVT、SCUT-CTW1500和CUTE80数据集上,x4和x8放大倍数下,相比扩散模型/GAN基线(SVT x8,OpenOCR),OCR F1分数最高提升+15.18个百分点,同时保持MANIQA、CLIP-IQA和MUSIQ指标竞争力。 Conclusion: GLYPH-SR能够同时实现高可读性和高视觉真实感,生成既‘看起来正确’又‘读起来正确’的超分辨率图像,推动超分辨率技术在实际视觉系统中的可靠部署。 Abstract: Image super-resolution(SR) is fundamental to many vision system-from surveillance and autonomy to document analysis and retail analytics-because recovering high-frequency details, especially scene-text, enables reliable downstream perception. Scene-text, i.e., text embedded in natural images such as signs, product labels, and storefronts, often carries the most actionable information; when characters are blurred or hallucinated, optical character recognition(OCR) and subsequent decisions fail even if the rest of the image appears sharp. Yet previous SR research has often been tuned to distortion (PSNR/SSIM) or learned perceptual metrics (LIPIS, MANIQA, CLIP-IQA, MUSIQ) that are largely insensitive to character-level errors. Furthermore, studies that do address text SR often focus on simplified benchmarks with isolated characters, overlooking the challenges of text within complex natural scenes. As a result, scene-text is effectively treated as generic texture. For SR to be effective in practical deployments, it is therefore essential to explicitly optimize for both text legibility and perceptual quality. We present GLYPH-SR, a vision-language-guided diffusion framework that aims to achieve both objectives jointly. GLYPH-SR utilizes a Text-SR Fusion ControlNet(TS-ControlNet) guided by OCR data, and a ping-pong scheduler that alternates between text- and scene-centric guidance. To enable targeted text restoration, we train these components on a synthetic corpus while keeping the main SR branch frozen. Across SVT, SCUT-CTW1500, and CUTE80 at x4, and x8, GLYPH-SR improves OCR F1 by up to +15.18 percentage points over diffusion/GAN baseline (SVT x8, OpenOCR) while maintaining competitive MANIQA, CLIP-IQA, and MUSIQ. GLYPH-SR is designed to satisfy both objectives simultaneously-high readability and high visual realism-delivering SR that looks right and reds right.[97] EEG-Driven Image Reconstruction with Saliency-Guided Diffusion Models
Igor Abramov,Ilya Makarov
Main category: cs.CV
TL;DR: 提出一种结合EEG嵌入和空间显著性图的双条件框架,用于提升EEG驱动图像重建的质量和语义一致性。
Details
Motivation: 现有EEG驱动图像重建方法忽略空间注意力机制,导致重建图像保真度和语义连贯性不足。 Method: 采用自适应思维映射器(ATM)提取EEG特征,并通过LoRA微调Stable Diffusion 2.1以对齐神经信号与视觉语义,同时利用ControlNet分支基于显著性图实现空间控制。 Result: 在THINGS-EEG数据集上验证,所提方法在低级和高级图像特征质量上均优于现有方法,并更好匹配人类视觉注意力。 Conclusion: 引入注意力先验可有效缓解EEG信号的模糊性,实现高保真图像重建,推动基于预训练扩散模型的神经解码发展。 Abstract: Existing EEG-driven image reconstruction methods often overlook spatial attention mechanisms, limiting fidelity and semantic coherence. To address this, we propose a dual-conditioning framework that combines EEG embeddings with spatial saliency maps to enhance image generation. Our approach leverages the Adaptive Thinking Mapper (ATM) for EEG feature extraction and fine-tunes Stable Diffusion 2.1 via Low-Rank Adaptation (LoRA) to align neural signals with visual semantics, while a ControlNet branch conditions generation on saliency maps for spatial control. Evaluated on THINGS-EEG, our method achieves a significant improvement in the quality of low- and high-level image features over existing approaches. Simultaneously, strongly aligning with human visual attention. The results demonstrate that attentional priors resolve EEG ambiguities, enabling high-fidelity reconstructions with applications in medical diagnostics and neuroadaptive interfaces, advancing neural decoding through efficient adaptation of pre-trained diffusion models.[98] LoCoT2V-Bench: A Benchmark for Long-Form and Complex Text-to-Video Generation
Xiangqing Zheng,Chengyue Wu,Kehai Chen,Min Zhang
Main category: cs.CV
TL;DR: 提出LoCoT2V-Bench,一个面向复杂输入条件下长视频生成的多维评测基准,揭示现有模型在事件间一致性、细粒度对齐和高层主题表达上的不足。
Details
Motivation: 现有评测基准多依赖简化提示词且关注低层次指标,缺乏对复杂提示下长视频生成在叙事连贯性和主题表达等抽象维度的评估能力。 Method: 基于真实视频构建包含场景转换和事件动态等复杂元素的提示集,并设计涵盖事件级对齐、时序一致性、内容清晰度及人类期望实现度(HERD)等多维评测框架。 Result: 对九个代表性长视频生成模型的评测显示,当前方法在基础视觉和时序方面表现尚可,但在跨事件一致性、细粒度对齐和高层主题遵循方面存在明显短板。 Conclusion: LoCoT2V-Bench为复杂长文本到视频生成提供了可靠评测平台,指明了未来模型改进的关键方向。 Abstract: Recently text-to-video generation has made impressive progress in producing short, high-quality clips, but evaluating long-form outputs remains a major challenge especially when processing complex prompts. Existing benchmarks mostly rely on simplified prompts and focus on low-level metrics, overlooking fine-grained alignment with prompts and abstract dimensions such as narrative coherence and thematic expression. To address these gaps, we propose LoCoT2V-Bench, a benchmark specifically designed for long video generation (LVG) under complex input conditions. Based on various real-world videos, LoCoT2V-Bench introduces a suite of realistic and complex prompts incorporating elements like scene transitions and event dynamics. Moreover, it constructs a multi-dimensional evaluation framework that includes our newly proposed metrics such as event-level alignment, fine-grained temporal consistency, content clarity, and the Human Expectation Realization Degree (HERD) that focuses on more abstract attributes like narrative flow, emotional response, and character development. Using this framework, we conduct a comprehensive evaluation of nine representative LVG models, finding that while current methods perform well on basic visual and temporal aspects, they struggle with inter-event consistency, fine-grained alignment, and high-level thematic adherence, etc. Overall, LoCoT2V-Bench provides a comprehensive and reliable platform for evaluating long-form complex text-to-video generation and highlights critical directions for future method improvement.[99] A-TPT: Angular Diversity Calibration Properties for Test-Time Prompt Tuning of Vision-Language Models
Shihab Aaqil Ahamed,Udaya S. K. P. Miriya Thanthrige,Ranga Rodrigo,Muhammad Haris Khan
Main category: cs.CV
TL;DR: 提出了一种新的测试时提示调优框架A-TPT,通过引入角度多样性来提升视觉-语言模型在无监督任务适应中的校准性能,显著降低了校准误差并具有良好的泛化能力。
Details
Motivation: 现有测试时提示调优方法在文本特征间缺乏足够的角度分离,影响模型校准性能,限制了模型的可靠性与安全性。 Method: 提出A-TPT框架,通过最大化单位超球面上归一化文本特征间的最小成对角度距离,促进提示生成特征的角度多样性与均匀分布。 Result: 在多个骨干网络和数据集上,A-TPT持续优于当前最优TPT方法,显著降低平均校准误差,同时保持相当的准确性;在自然分布偏移和医学数据集上表现出优越的零样本校准性能。 Conclusion: 推动角度多样性是实现良好分散的文本特征的关键,能有效提升视觉-语言模型在测试时适应中的校准效果。 Abstract: Test-time prompt tuning (TPT) has emerged as a promising technique for adapting large vision-language models (VLMs) to unseen tasks without relying on labeled data. However, the lack of dispersion between textual features can hurt calibration performance, which raises concerns about VLMs' reliability, trustworthiness, and safety. Current TPT approaches primarily focus on improving prompt calibration by either maximizing average textual feature dispersion or enforcing orthogonality constraints to encourage angular separation. However, these methods may not always have optimal angular separation between class-wise textual features, which implies overlooking the critical role of angular diversity. To address this, we propose A-TPT, a novel TPT framework that introduces angular diversity to encourage uniformity in the distribution of normalized textual features induced by corresponding learnable prompts. This uniformity is achieved by maximizing the minimum pairwise angular distance between features on the unit hypersphere. We show that our approach consistently surpasses state-of-the-art TPT methods in reducing the aggregate average calibration error while maintaining comparable accuracy through extensive experiments with various backbones on different datasets. Notably, our approach exhibits superior zero-shot calibration performance on natural distribution shifts and generalizes well to medical datasets. We provide extensive analyses, including theoretical aspects, to establish the grounding of A-TPT. These results highlight the potency of promoting angular diversity to achieve well-dispersed textual features, significantly improving VLM calibration during test-time adaptation. Our code will be made publicly available.[100] PointSt3R: Point Tracking through 3D Grounded Correspondence
Rhodri Guerrier,Adam W. Harley,Dima Damen
Main category: cs.CV
TL;DR: 本文提出了一种基于3D重建模型(如DUSt3R和MASt3R)的点跟踪方法PointSt3R,通过结合重建损失、动态对应训练和可见性头,并在少量合成数据上微调,实现了在多个数据集上优于或媲美现有方法的点跟踪性能。
Details
Motivation: 现有的点跟踪方法主要依赖时序上下文,而本文旨在利用强大的3D重建模型在静态场景中的2D-3D对应能力,探索其在点跟踪任务中的潜力,并提升对动态和静态点的跟踪效果。 Method: 将MASt3R模型结合重建损失与动态对应训练,并添加可见性预测头;仅使用包含查询点的帧对进行训练和评估,避免引入时序信息;利用合成数据进行微调。 Result: 在EgoPoints上相比CoTracker2提升33.5%,在TAP-Vid-DAVIS上达到73.8 δ_avg和85.8%遮挡准确率,接近CoTracker2;在EgoPoints和RGB-S上显著优于CoTracker3(61.3 vs 54.2,87.0 vs 82.8)。 Conclusion: 通过适配先进的3D重建模型并引入动态对应训练,可在无时序上下文的情况下实现高性能点跟踪,验证了3D接地对应在点跟踪任务中的有效性。 Abstract: Recent advances in foundational 3D reconstruction models, such as DUSt3R and MASt3R, have shown great potential in 2D and 3D correspondence in static scenes. In this paper, we propose to adapt them for the task of point tracking through 3D grounded correspondence. We first demonstrate that these models are competitive point trackers when focusing on static points, present in current point tracking benchmarks ($+33.5\%$ on EgoPoints vs. CoTracker2). We propose to combine the reconstruction loss with training for dynamic correspondence along with a visibility head, and fine-tuning MASt3R for point tracking using a relatively small amount of synthetic data. Importantly, we only train and evaluate on pairs of frames where one contains the query point, effectively removing any temporal context. Using a mix of dynamic and static point correspondences, we achieve competitive or superior point tracking results on four datasets (e.g. competitive on TAP-Vid-DAVIS 73.8 $\delta_{avg}$ / 85.8\% occlusion acc. for PointSt3R compared to 75.7 / 88.3\% for CoTracker2; and significantly outperform CoTracker3 on EgoPoints 61.3 vs 54.2 and RGB-S 87.0 vs 82.8). We also present results on 3D point tracking along with several ablations on training datasets and percentage of dynamic correspondences.[101] Towards Fine-Grained Vision-Language Alignment for Few-Shot Anomaly Detection
Yuanting Fan,Jun Liu,Xiaochen Chen,Bin-Bin Gao,Jian Li,Yong Liu,Jinlong Peng,Chengjie Wang
Main category: cs.CV
TL;DR: 提出FineGrainedAD框架,通过多级细粒度语义描述提升少样本异常检测的定位性能。
Details
Motivation: 现有基于视觉-语言模型的方法因缺乏细粒度文本描述,导致图像级描述与局部视觉异常之间语义错位,影响异常定位效果。 Method: 构建多级细粒度语义字幕(MFSC),并提出FineGrainedAD框架,包含多级可学习提示(MLLP)和多级语义对齐(MLSA)两个模块,通过自动替换与拼接机制引入细粒度语义,并设计区域聚合与多级对齐训练策略。 Result: 在MVTec-AD和VisA数据集上,FineGrainedAD在少样本设置下显著优于现有方法,提升了异常定位性能。 Conclusion: FineGrainedAD通过引入多级细粒度语义描述和对齐机制,有效缓解了语义错位问题,在少样本异常检测中实现了更优的定位效果。 Abstract: Few-shot anomaly detection (FSAD) methods identify anomalous regions with few known normal samples. Most existing methods rely on the generalization ability of pre-trained vision-language models (VLMs) to recognize potentially anomalous regions through feature similarity between text descriptions and images. However, due to the lack of detailed textual descriptions, these methods can only pre-define image-level descriptions to match each visual patch token to identify potential anomalous regions, which leads to the semantic misalignment between image descriptions and patch-level visual anomalies, achieving sub-optimal localization performance. To address the above issues, we propose the Multi-Level Fine-Grained Semantic Caption (MFSC) to provide multi-level and fine-grained textual descriptions for existing anomaly detection datasets with automatic construction pipeline. Based on the MFSC, we propose a novel framework named FineGrainedAD to improve anomaly localization performance, which consists of two components: Multi-Level Learnable Prompt (MLLP) and Multi-Level Semantic Alignment (MLSA). MLLP introduces fine-grained semantics into multi-level learnable prompts through automatic replacement and concatenation mechanism, while MLSA designs region aggregation strategy and multi-level alignment training to facilitate learnable prompts better align with corresponding visual regions. Experiments demonstrate that the proposed FineGrainedAD achieves superior overall performance in few-shot settings on MVTec-AD and VisA datasets.[102] Representation-Level Counterfactual Calibration for Debiased Zero-Shot Recognition
Pei Peng,MingKun Xie,Hang Hao,Tong Jin,ShengJun Huang
Main category: cs.CV
TL;DR: 提出一种无需重新训练或提示设计的轻量级因果推理方法,通过合成反事实嵌入和估计直接效应来缓解视觉-语言模型中的对象-上下文捷径问题,提升零样本性能。
Details
Motivation: 解决视觉-语言模型中因训练数据中对象与上下文共现导致的对象-上下文捷径问题,提高模型在新环境下的零样本可靠性。 Method: 将该问题重构为因果推断问题,估计CLIP表示空间中的对象和背景期望,通过结合外部数据集、批次邻居或文本描述中的多样化上下文生成反事实嵌入,并利用总直接效应估计和干预模拟去除仅由背景引起的激活。 Result: 在多个对上下文敏感的基准上显著提升了最差组和平均准确率,实现了新的零样本性能最优结果。 Conclusion: 所提方法提供了一种有效的、表示层面的反事实分析框架,能够在不依赖额外训练或提示工程的情况下实现更可靠、无偏的多模态推理。 Abstract: Object-context shortcuts remain a persistent challenge in vision-language models, undermining zero-shot reliability when test-time scenes differ from familiar training co-occurrences. We recast this issue as a causal inference problem and ask: Would the prediction remain if the object appeared in a different environment? To answer this at inference time, we estimate object and background expectations within CLIP's representation space, and synthesize counterfactual embeddings by recombining object features with diverse alternative contexts sampled from external datasets, batch neighbors, or text-derived descriptions. By estimating the Total Direct Effect and simulating intervention, we further subtract background-only activation, preserving beneficial object-context interactions while mitigating hallucinated scores. Without retraining or prompt design, our method substantially improves both worst-group and average accuracy on context-sensitive benchmarks, establishing a new zero-shot state of the art. Beyond performance, our framework provides a lightweight representation-level counterfactual approach, offering a practical causal avenue for debiased and reliable multimodal reasoning.[103] Counteracting Matthew Effect in Self-Improvement of LVLMs through Head-Tail Re-balancing
Xin Guo,Zhiheng Xi,Yiwen Ding,Yitao Zhai,Xiaowei Shi,Xunliang Cai,Tao Gui,Qi Zhang,Xuanjing Huang
Main category: cs.CV
TL;DR: 本文提出了一种针对大视觉语言模型在自我提升过程中出现的“马太效应”问题的解决方案,通过分布重塑和轨迹重采样策略实现头尾数据的平衡,显著提升了视觉推理能力。
Details
Motivation: 发现现有自改进方法在处理简单和复杂查询时存在优化不平衡问题,导致模型难以提升复杂推理能力。 Method: 从分布重塑和轨迹重采样两个角度提出四种高效策略,在自改进过程中实现头尾数据的重新平衡。 Result: 在Qwen2-VL-7B-Instruct和InternVL2.5-4B模型上的实验表明,所提方法平均比传统自改进方法提升3.86个百分点。 Conclusion: 所提出的头尾重平衡策略有效缓解了马太效应,推动了模型在复杂视觉推理任务上的持续改进。 Abstract: Self-improvement has emerged as a mainstream paradigm for advancing the reasoning capabilities of large vision-language models (LVLMs), where models explore and learn from successful trajectories iteratively. However, we identify a critical issue during this process: the model excels at generating high-quality trajectories for simple queries (i.e., head data) but struggles with more complex ones (i.e., tail data). This leads to an imbalanced optimization that drives the model to prioritize simple reasoning skills, while hindering its ability to tackle more complex reasoning tasks. Over iterations, this imbalance becomes increasingly pronounced--a dynamic we term the "Matthew effect"--which ultimately hinders further model improvement and leads to performance bottlenecks. To counteract this challenge, we introduce four efficient strategies from two perspectives: distribution-reshaping and trajectory-resampling, to achieve head-tail re-balancing during the exploration-and-learning self-improvement process. Extensive experiments on Qwen2-VL-7B-Instruct and InternVL2.5-4B models across visual reasoning tasks demonstrate that our methods consistently improve visual reasoning capabilities, outperforming vanilla self-improvement by 3.86 points on average.[104] Analysis of the Robustness of an Edge Detector Based on Cellular Automata Optimized by Particle Swarm
Vinícius Ferraria,Eurico Ruivo
Main category: cs.CV
TL;DR: 提出了一种基于二维细胞自动机并结合元启发式优化和迁移学习的自适应边缘检测方法,但实验表明扩大优化搜索空间对所选图像集无效,且迁移学习未带来显著改进。
Details
Motivation: 解决传统边缘检测器在检测松散边缘和缺乏上下文信息方面的不足,并提升检测器对不同图像特性的适应能力。 Method: 采用二维细胞自动机建模自适应边缘检测器,结合元启发式算法进行优化,并引入迁移学习技术以提升模型泛化能力。 Result: 扩大优化阶段的搜索空间未能有效提升性能;模型在不同验证条件下均表现出良好的输入适应性,但迁移学习技术未带来显著改善。 Conclusion: 所提出的自适应检测器具备良好的适应能力,但迁移学习和扩展搜索空间在当前设置下效果有限,需进一步优化策略。 Abstract: The edge detection task is essential in image processing aiming to extract relevant information from an image. One recurring problem in this task is the weaknesses found in some detectors, such as the difficulty in detecting loose edges and the lack of context to extract relevant information from specific problems. To address these weaknesses and adapt the detector to the properties of an image, an adaptable detector described by two-dimensional cellular automaton and optimized by meta-heuristic combined with transfer learning techniques was developed. This study aims to analyze the impact of expanding the search space of the optimization phase and the robustness of the adaptability of the detector in identifying edges of a set of natural images and specialized subsets extracted from the same image set. The results obtained prove that expanding the search space of the optimization phase was not effective for the chosen image set. The study also analyzed the adaptability of the model through a series of experiments and validation techniques and found that, regardless of the validation, the model was able to adapt to the input and the transfer learning techniques applied to the model showed no significant improvements.[105] SA$^{2}$Net: Scale-Adaptive Structure-Affinity Transformation for Spine Segmentation from Ultrasound Volume Projection Imaging
Hao Xie,Zixun Huang,Yushen Zuo,Yakun Ju,Frank H. F. Leung,N. F. Law,Kin-Man Lam,Yong-Ping Zheng,Sai Ho Ling
Main category: cs.CV
TL;DR: 提出了一种用于超声体积投影成像中脊柱分割的尺度自适应结构感知网络(SA²Net),通过尺度自适应互补策略和结构亲和力变换提升分割性能,在智能脊柱侧弯诊断中表现出优越的准确性和鲁棒性。
Details
Motivation: 脊柱分割在智能脊柱侧弯诊断中至关重要,但面临骨特征空间相关性学习不足和脊柱结构知识难以编码的挑战。 Method: 提出SA²Net:采用尺度自适应互补策略学习跨维度长距离相关特征;结合结构亲和力变换与Transformer解码器进行结构感知推理;使用特征混合损失聚合方法增强训练。 Result: 实验结果表明,SA²Net在分割性能上优于现有最先进方法,并展现出对多种主干网络的良好适应性。 Conclusion: SA²Net能有效提升脊柱分割的准确性和鲁棒性,具有在智能脊柱图像分析中用于高级脊柱侧弯诊断的应用潜力。 Abstract: Spine segmentation, based on ultrasound volume projection imaging (VPI), plays a vital role for intelligent scoliosis diagnosis in clinical applications. However, this task faces several significant challenges. Firstly, the global contextual knowledge of spines may not be well-learned if we neglect the high spatial correlation of different bone features. Secondly, the spine bones contain rich structural knowledge regarding their shapes and positions, which deserves to be encoded into the segmentation process. To address these challenges, we propose a novel scale-adaptive structure-aware network (SA$^{2}$Net) for effective spine segmentation. First, we propose a scale-adaptive complementary strategy to learn the cross-dimensional long-distance correlation features for spinal images. Second, motivated by the consistency between multi-head self-attention in Transformers and semantic level affinity, we propose structure-affinity transformation to transform semantic features with class-specific affinity and combine it with a Transformer decoder for structure-aware reasoning. In addition, we adopt a feature mixing loss aggregation method to enhance model training. This method improves the robustness and accuracy of the segmentation process. The experimental results demonstrate that our SA$^{2}$Net achieves superior segmentation performance compared to other state-of-the-art methods. Moreover, the adaptability of SA$^{2}$Net to various backbones enhances its potential as a promising tool for advanced scoliosis diagnosis using intelligent spinal image analysis. The code and experimental demo are available at https://github.com/taetiseo09/SA2Net.[106] AdSum: Two-stream Audio-visual Summarization for Automated Video Advertisement Clipping
Wen Xie,Yanjun Zhu,Gijs Overgoor,Yakov Bart,Agata Lapedriza Garcia,Sarah Ostadabbas
Main category: cs.CV
TL;DR: 本文提出了一种基于音频-视觉融合模型的自动化视频广告剪辑框架,首次将广告剪辑视为镜头选择问题,并强调音频在广告中的关键作用。
Details
Motivation: 传统广告剪辑依赖人工,耗时耗力,且现有视频摘要方法多关注视觉内容,忽视了广告中音频的重要性。 Method: 提出一种双流音频-视觉融合模型,通过预测帧重要性来自动生成短视频广告,并构建了包含102对真实广告的AdSum204数据集进行训练与评估。 Result: 实验表明,该模型在平均精度、曲线下面积、斯皮尔曼和肯德尔等指标上均优于现有最先进方法。 Conclusion: 所提方法能有效自动生成高质量的短视频广告,显著优于通用视频摘要方法,验证了音频在广告剪辑中的重要作用。 Abstract: Advertisers commonly need multiple versions of the same advertisement (ad) at varying durations for a single campaign. The traditional approach involves manually selecting and re-editing shots from longer video ads to create shorter versions, which is labor-intensive and time-consuming. In this paper, we introduce a framework for automated video ad clipping using video summarization techniques. We are the first to frame video clipping as a shot selection problem, tailored specifically for advertising. Unlike existing general video summarization methods that primarily focus on visual content, our approach emphasizes the critical role of audio in advertising. To achieve this, we develop a two-stream audio-visual fusion model that predicts the importance of video frames, where importance is defined as the likelihood of a frame being selected in the firm-produced short ad. To address the lack of ad-specific datasets, we present AdSum204, a novel dataset comprising 102 pairs of 30-second and 15-second ads from real advertising campaigns. Extensive experiments demonstrate that our model outperforms state-of-the-art methods across various metrics, including Average Precision, Area Under Curve, Spearman, and Kendall.[107] Dynamic Context-Aware Scene Reasoning Using Vision-Language Alignment in Zero-Shot Real-World Scenarios
Manjunath Prasad Holenarasipura Rajiv,B. M. Vidyavathi
Main category: cs.CV
TL;DR: 提出了一种动态上下文感知场景推理框架,利用视觉-语言对齐实现零样本真实场景理解,在多个基准上准确率提升达18%。
Details
Motivation: 传统场景理解模型难以在无标注数据的未知场景中泛化,限制了视觉应用在动态环境中的部署。 Method: 结合预训练视觉Transformer和大语言模型,通过视觉-语言对齐和动态推理模块,融合全局场景线索与对象级交互,利用语言先验提升上下文理解。 Result: 在COCO、Visual Genome和Open Images等零样本基准上,复杂和未知环境中场景理解准确率最高提升18%,在模糊或杂乱场景中表现出强鲁棒性。 Conclusion: 该框架实现了可扩展、可解释的上下文感知推理,显著提升了AI系统在动态真实世界场景中的零样本泛化能力。 Abstract: In real-world environments, AI systems often face unfamiliar scenarios without labeled data, creating a major challenge for conventional scene understanding models. The inability to generalize across unseen contexts limits the deployment of vision-based applications in dynamic, unstructured settings. This work introduces a Dynamic Context-Aware Scene Reasoning framework that leverages Vision-Language Alignment to address zero-shot real-world scenarios. The goal is to enable intelligent systems to infer and adapt to new environments without prior task-specific training. The proposed approach integrates pre-trained vision transformers and large language models to align visual semantics with natural language descriptions, enhancing contextual comprehension. A dynamic reasoning module refines predictions by combining global scene cues and object-level interactions guided by linguistic priors. Extensive experiments on zero-shot benchmarks such as COCO, Visual Genome, and Open Images demonstrate up to 18% improvement in scene understanding accuracy over baseline models in complex and unseen environments. Results also show robust performance in ambiguous or cluttered scenes due to the synergistic fusion of vision and language. This framework offers a scalable and interpretable approach for context-aware reasoning, advancing zero-shot generalization in dynamic real-world settings.[108] CATCH: A Modular Cross-domain Adaptive Template with Hook
Xinjin Li,Yulie Lu,Jinghan Cao,Yu Ma,Zhenglin Li,Yeyang Zhou
Main category: cs.CV
TL;DR: 本文提出CATCH,一种即插即用的跨域视觉问答(VQA)适应框架,通过解耦视觉与语言适应,引入轻量级模块实现无需重训练主干模型的性能提升。
Details
Motivation: 现有VQA模型在跨域场景(如遥感、医学图像、数学图表)中泛化能力差,且依赖昂贵的领域特定微调,缺乏可扩展性和灵活性。 Method: 提出CATCH框架,包含一个域分类器和双适配器机制(提示适配器用于语言调节,视觉适配器用于视觉特征调整),通过统一钩子接口动态注入,不修改也不重训练主干模型。 Result: 在四个领域特定的VQA基准上取得一致性能提升,包括MathVQA上+2.3 BLEU,MedVQA-RAD上+2.6 VQA分数,ChartQA上+3.1 ROUGE。 Conclusion: CATCH提供了一种可扩展、可扩展的多领域VQA解决方案,支持在多样化应用场景中的实际部署。 Abstract: Recent advances in Visual Question Answering (VQA) have demonstrated impressive performance in natural image domains, with models like LLaVA leveraging large language models (LLMs) for open-ended reasoning. However, their generalization degrades significantly when transferred to out-of-domain scenarios such as remote sensing, medical imaging, or math diagrams, due to large distributional shifts and the lack of effective domain adaptation mechanisms. Existing approaches typically rely on per-domain fine-tuning or bespoke pipelines, which are costly, inflexible, and not scalable across diverse tasks. In this paper, we propose CATCH, a plug-and-play framework for cross-domain adaptation that improves the generalization of VQA models while requiring minimal changes to their core architecture. Our key idea is to decouple visual and linguistic adaptation by introducing two lightweight modules: a domain classifier to identify the input image type, and a dual adapter mechanism comprising a Prompt Adapter for language modulation and a Visual Adapter for vision feature adjustment. Both modules are dynamically injected via a unified hook interface, requiring no retraining of the backbone model. Experimental results across four domain-specific VQA benchmarks demonstrate that our framework achieves consistent performance gains without retraining the backbone model, including +2.3 BLEU on MathVQA, +2.6 VQA on MedVQA-RAD, and +3.1 ROUGE on ChartQA. These results highlight that CATCH provides a scalable and extensible approach to multi-domain VQA, enabling practical deployment across diverse application domains.[109] Emu3.5: Native Multimodal Models are World Learners
Yufeng Cui,Honghao Chen,Haoge Deng,Xu Huang,Xinghang Li,Jirong Liu,Yang Liu,Zhuoyan Luo,Jinsheng Wang,Wenxuan Wang,Yueze Wang,Chengyuan Wang,Fan Zhang,Yingli Zhao,Ting Pan,Xianduo Li,Zecheng Hao,Wenxuan Ma,Zhuo Chen,Yulong Ao,Tiejun Huang,Zhongyuan Wang,Xinlong Wang
Main category: cs.CV
TL;DR: Emu3.5是一种大规模多模态世界模型,通过统一的下一个标记预测目标在超过10万亿token的视觉-语言交错数据上进行端到端预训练,能够自然地接受和生成视觉-语言交错输入输出。该模型结合大规模强化学习后训练和离散扩散自适应(DiDA)技术,显著提升多模态推理、生成能力及推理效率,在图像生成、编辑和复杂文本图像生成任务中表现优异,并具备时空一致的世界探索和开放世界具身操作能力。
Details
Motivation: 为了构建一个能原生支持视觉与语言联合建模的通用世界模型,实现跨模态的连贯生成与推理,解决传统方法在多模态序列建模和推理效率上的局限性。 Method: 采用统一的下一个标记预测目标对模型进行端到端预训练,使用包含互联网视频帧和转录文本的超大规模视觉-语言交错数据;引入离散扩散自适应(DiDA)技术实现并行化图像生成,提升推理速度;并通过大规模强化学习进行后训练以增强多模态推理能力。 Result: Emu3.5在图像生成与编辑任务上性能媲美Gemini 2.5 Flash Image(Nano Banana),在交错生成任务上表现更优;支持长视野视觉-语言生成、任意到图像(X2I)生成和复杂文本图像生成;具备时空一致的虚拟环境探索和开放世界操作能力;DiDA技术使每张图像推理速度提升约20倍。 Conclusion: Emu3.5是一个强大的原生多模态世界模型,兼具高效推理与强大生成能力,展示了通向通用视觉-语言世界模型的可行路径,且已开源以促进社区研究。 Abstract: We introduce Emu3.5, a large-scale multimodal world model that natively predicts the next state across vision and language. Emu3.5 is pre-trained end-to-end with a unified next-token prediction objective on a corpus of vision-language interleaved data containing over 10 trillion tokens, primarily derived from sequential frames and transcripts of internet videos. The model naturally accepts interleaved vision-language inputs and generates interleaved vision-language outputs. Emu3.5 is further post-trained with large-scale reinforcement learning to enhance multimodal reasoning and generation. To improve inference efficiency, we propose Discrete Diffusion Adaptation (DiDA), which converts token-by-token decoding into bidirectional parallel prediction, accelerating per-image inference by about 20x without sacrificing performance. Emu3.5 exhibits strong native multimodal capabilities, including long-horizon vision-language generation, any-to-image (X2I) generation, and complex text-rich image generation. It also exhibits generalizable world-modeling abilities, enabling spatiotemporally consistent world exploration and open-world embodied manipulation across diverse scenarios and tasks. For comparison, Emu3.5 achieves performance comparable to Gemini 2.5 Flash Image (Nano Banana) on image generation and editing tasks and demonstrates superior results on a suite of interleaved generation tasks. We open-source Emu3.5 at https://github.com/baaivision/Emu3.5 to support community research.[110] ResMatching: Noise-Resilient Computational Super-Resolution via Guided Conditional Flow Matching
Anirban Ray,Vera Galinova,Florian Jug
Main category: cs.CV
TL;DR: 提出了一种基于引导条件流匹配的计算超分辨率方法ResMatching,能够学习更强的数据先验,在生物结构图像上实现了良好的保真度与感知真实性的平衡,并提供像素级不确定性估计。
Details
Motivation: 传统计算超分辨率是病态问题,依赖先验知识外推未采样频率;随着机器学习发展,期望通过数据驱动方式学习更强先验以提升性能。 Method: 采用引导条件流匹配(guided conditional flow matching)来学习更优的数据先验,实现荧光显微图像的超分辨率重建,并可从隐式学习的后验分布中采样以估计像素级不确定性。 Result: 在BioSR数据集4种生物结构上优于7个基线方法,尤其在低信噪比情况下表现突出,且能提供校准的不确定性图以指导结果可信度评估。 Conclusion: ResMatching通过数据驱动方式有效学习先验,显著提升计算超分辨率效果,兼具高保真与感知质量,并支持不确定性量化,具有实际应用价值。 Abstract: Computational Super-Resolution (CSR) in fluorescence microscopy has, despite being an ill-posed problem, a long history. At its very core, CSR is about finding a prior that can be used to extrapolate frequencies in a micrograph that have never been imaged by the image-generating microscope. It stands to reason that, with the advent of better data-driven machine learning techniques, stronger prior can be learned and hence CSR can lead to better results. Here, we present ResMatching, a novel CSR method that uses guided conditional flow matching to learn such improved data-priors. We evaluate ResMatching on 4 diverse biological structures from the BioSR dataset and compare its results against 7 baselines. ResMatching consistently achieves competitive results, demonstrating in all cases the best trade-off between data fidelity and perceptual realism. We observe that CSR using ResMatching is particularly effective in cases where a strong prior is hard to learn, e.g. when the given low-resolution images contain a lot of noise. Additionally, we show that ResMatching can be used to sample from an implicitly learned posterior distribution and that this distribution is calibrated for all tested use-cases, enabling our method to deliver a pixel-wise data-uncertainty term that can guide future users to reject uncertain predictions.[111] CYPRESS: Crop Yield Prediction via Regression on Prithvi's Encoder for Satellite Sensing
Shayan Nejadshamsi,Yuanyuan Zhang,Shadi Zaki,Brock Porth,Lysa Porth,Vahab Khoshdel
Main category: cs.CV
TL;DR: 本文提出了一种名为CYPRESS的深度学习模型,用于高分辨率、田块内的油菜籽产量预测,通过微调大规模地理空间基础模型Prithvi-EO-2.0-600M,实现了优于现有模型的性能。
Details
Motivation: 传统的作物产量预测方法在精度农业所需的可扩展性和细粒度方面存在不足,亟需一种能够提供及时、准确、高分辨率预测的新方法。 Method: CYPRESS利用预训练的地理空间基础模型Prithvi-EO-2.0-600M,将其适配为连续回归任务,将多时相卫星影像转化为密集的像素级产量图。 Result: 在加拿大草原地区的数据集上评估表明,CYPRESS在油菜籽产量预测方面优于现有的深度学习模型,能够生成连续且高分辨率的产量图。 Conclusion: CYPRESS验证了通过微调基础模型实现精细化农业监测的可行性,为连接大尺度地球观测与农场级决策提供了可扩展的解决方案。 Abstract: Accurate and timely crop yield prediction is crucial for global food security and modern agricultural management. Traditional methods often lack the scalability and granularity required for precision farming. This paper introduces CYPRESS (Crop Yield Prediction via Regression on Prithvi's Encoder for Satellite Sensing), a deep learning model designed for high-resolution, intra-field canola yield prediction. CYPRESS leverages a pre-trained, large-scale geospatial foundation model (Prithvi-EO-2.0-600M) and adapts it for a continuous regression task, transforming multi-temporal satellite imagery into dense, pixel-level yield maps. Evaluated on a comprehensive dataset from the Canadian Prairies, CYPRESS demonstrates superior performance over existing deep learning-based yield prediction models, highlighting the effectiveness of fine-tuning foundation models for specialized agricultural applications. By providing a continuous, high-resolution output, CYPRESS offers a more actionable tool for precision agriculture than conventional classification or county-level aggregation methods. This work validates a novel approach that bridges the gap between large-scale Earth observation and on-farm decision-making, offering a scalable solution for detailed agricultural monitoring.[112] Spiking Patches: Asynchronous, Sparse, and Efficient Tokens for Event Cameras
Christoffer Koo Øhrstrøm,Ronja Güldenring,Lazaros Nalpantidis
Main category: cs.CV
TL;DR: 提出了一种专为事件相机设计的事件标记化方法Spiking Patches,保留了事件流的异步性和空间稀疏性,在手势识别和目标检测任务中实现了比基于体素和帧的方法更快的推理速度,同时保持甚至提升了准确率。
Details
Motivation: 现有事件表示方法(如帧或体素)将异步、稀疏的事件数据转为同步且降低空间稀疏性,从而损失事件相机的独特优势,因此需要一种能保持这些特性的新表示方法。 Method: 设计了一种名为Spiking Patches的事件分词器,将异步稀疏的事件流转换为保留其时空特性的标记,并将其用于图神经网络、点云网络和Transformer模型中进行下游任务评估。 Result: 在手势识别和目标检测任务上,Spiking Patches相比体素标记推理速度快3.4倍,相比帧标记快10.4倍,同时准确率相当甚至更高,最高提升3.8(手势识别)和1.4(目标检测)。 Conclusion: 事件标记化是一种事件视觉的新方向,Spiking Patches有效保留了事件相机的异步与稀疏特性,兼顾高效推理与高性能,推动了事件驱动方法的发展。 Abstract: We propose tokenization of events and present a tokenizer, Spiking Patches, specifically designed for event cameras. Given a stream of asynchronous and spatially sparse events, our goal is to discover an event representation that preserves these properties. Prior works have represented events as frames or as voxels. However, while these representations yield high accuracy, both frames and voxels are synchronous and decrease the spatial sparsity. Spiking Patches gives the means to preserve the unique properties of event cameras and we show in our experiments that this comes without sacrificing accuracy. We evaluate our tokenizer using a GNN, PCN, and a Transformer on gesture recognition and object detection. Tokens from Spiking Patches yield inference times that are up to 3.4x faster than voxel-based tokens and up to 10.4x faster than frames. We achieve this while matching their accuracy and even surpassing in some cases with absolute improvements up to 3.8 for gesture recognition and up to 1.4 for object detection. Thus, tokenization constitutes a novel direction in event-based vision and marks a step towards methods that preserve the properties of event cameras.[113] PT-DETR: Small Target Detection Based on Partially-Aware Detail Focus
Bingcong Huo,Zhiming Wang
Main category: cs.CV
TL;DR: 本文提出了一种基于RT-DETR的新型无人机图像小目标检测算法PT-DETR,通过引入PADF模块、MFFF模块和Focaler-SIoU损失函数,在复杂背景下显著提升了小目标检测的精度和鲁棒性,并在VisDrone2019数据集上取得了优于RT-DETR的性能,同时具有更低的计算复杂度和更少的参数量。
Details
Motivation: 针对无人机目标检测中存在的复杂背景、严重遮挡、密集小目标和光照变化等问题,现有方法在小目标特征提取和定位精度方面存在不足,亟需提升模型对小目标的检测能力。 Method: 在RT-DETR基础上,提出PT-DETR:1)设计部分感知细节聚焦(PADF)模块以增强主干网络对小目标的特征提取;2)提出中频特征融合(MFFF)模块,提升小目标细节与上下文信息的融合能力;3)引入Focaler-SIoU损失函数,增强边界框匹配能力和对小目标特征的敏感性。 Result: 在VisDrone2019数据集上,PT-DETR相比RT-DETR在mAP上分别提升了1.6%和1.7%,同时具有更低的计算复杂度和更少的参数量,验证了其在小目标检测任务中的有效性与可行性。 Conclusion: PT-DETR通过改进特征提取、融合策略和损失函数,有效提升了无人机图像中小目标的检测性能,具备较高的应用价值和推广潜力。 Abstract: To address the challenges in UAV object detection, such as complex backgrounds, severe occlusion, dense small objects, and varying lighting conditions,this paper proposes PT-DETR based on RT-DETR, a novel detection algorithm specifically designed for small objects in UAV imagery. In the backbone network, we introduce the Partially-Aware Detail Focus (PADF) Module to enhance feature extraction for small objects. Additionally,we design the Median-Frequency Feature Fusion (MFFF) module,which effectively improves the model's ability to capture small-object details and contextual information. Furthermore,we incorporate Focaler-SIoU to strengthen the model's bounding box matching capability and increase its sensitivity to small-object features, thereby further enhancing detection accuracy and robustness. Compared with RT-DETR, our PT-DETR achieves mAP improvements of 1.6% and 1.7% on the VisDrone2019 dataset with lower computational complexity and fewer parameters, demonstrating its robustness and feasibility for small-object detection tasks.[114] All You Need for Object Detection: From Pixels, Points, and Prompts to Next-Gen Fusion and Multimodal LLMs/VLMs in Autonomous Vehicles
Sayed Pedram Haeri Boroujeni,Niloufar Mehrabi,Hazim Alzorgan,Ahmad Sarlak,Mahlagha Fazeli,Abolfazl Razi
Main category: cs.CV
TL;DR: 本文综述了自动驾驶车辆中物体检测的最新进展,重点关注多模态感知、上下文推理和协同智能中的新兴范式,如视觉-语言模型(VLMs)、大语言模型(LLMs)和生成式AI,并提出了传感器融合、数据集分类及基于Transformer的检测方法的系统性分析与未来发展方向。
Details
Motivation: 自动驾驶的成功依赖于在复杂多模态环境中可靠地检测物体,但当前研究知识分散,缺乏整合。本文旨在填补多模态感知、上下文推理与协同智能之间的空白,推动更高效、智能的物体检测技术发展。 Method: 系统回顾自动驾驶传感器(摄像头、超声波、LiDAR、雷达)及其融合策略;提出针对自动驾驶数据集的结构化分类(如V2V、V2I、V2X、I2I);分析从2D/3D到混合传感器融合的前沿检测方法,特别关注基于Vision Transformer、大语言模型和视觉-语言模型的新范式。 Result: 建立了自动驾驶中物体检测技术的综合框架,明确了现有技术的能力与局限,揭示了LLM/VLM驱动的感知框架与多源传感器融合的潜力,并通过跨数据集分析提供了方法间的可比性。 Conclusion: 本文为自动驾驶中的物体检测提供了前瞻性的视角,强调了融合生成式AI与多模态感知的重要性,指出了向协作式、语义增强型智能系统发展的关键路径。 Abstract: Autonomous Vehicles (AVs) are transforming the future of transportation through advances in intelligent perception, decision-making, and control systems. However, their success is tied to one core capability, reliable object detection in complex and multimodal environments. While recent breakthroughs in Computer Vision (CV) and Artificial Intelligence (AI) have driven remarkable progress, the field still faces a critical challenge as knowledge remains fragmented across multimodal perception, contextual reasoning, and cooperative intelligence. This survey bridges that gap by delivering a forward-looking analysis of object detection in AVs, emphasizing emerging paradigms such as Vision-Language Models (VLMs), Large Language Models (LLMs), and Generative AI rather than re-examining outdated techniques. We begin by systematically reviewing the fundamental spectrum of AV sensors (camera, ultrasonic, LiDAR, and Radar) and their fusion strategies, highlighting not only their capabilities and limitations in dynamic driving environments but also their potential to integrate with recent advances in LLM/VLM-driven perception frameworks. Next, we introduce a structured categorization of AV datasets that moves beyond simple collections, positioning ego-vehicle, infrastructure-based, and cooperative datasets (e.g., V2V, V2I, V2X, I2I), followed by a cross-analysis of data structures and characteristics. Ultimately, we analyze cutting-edge detection methodologies, ranging from 2D and 3D pipelines to hybrid sensor fusion, with particular attention to emerging transformer-driven approaches powered by Vision Transformers (ViTs), Large and Small Language Models (SLMs), and VLMs. By synthesizing these perspectives, our survey delivers a clear roadmap of current capabilities, open challenges, and future opportunities.[115] Towards Reliable Sea Ice Drift Estimation in the Arctic Deep Learning Optical Flow on RADARSAT-2
Daniela Martin,Joseph Gallego
Main category: cs.CV
TL;DR: 本研究首次在RADARSAT-2 ScanSAR海冰图像上对48种深度学习光流模型进行了大规模基准测试,结果表明这些模型在估计海冰漂移方面具有亚千米级精度,显著优于传统方法,并能生成空间连续的漂移场,为北极导航和气候建模提供了新机遇。
Details
Motivation: 光学流方法在计算机视觉中发展迅速,但在地球物理问题和卫星SAR图像中的应用仍不足;传统方法依赖强假设,限制了复杂场景下的准确性,因此需要探索更先进的深度学习方法在海冰漂移估计中的适用性。 Method: 采用48种基于深度学习的光流模型,在RADARSAT-2 ScanSAR海冰图像上进行大规模基准测试,使用端点误差(EPE)和Fl all指标,并以GNSS浮标数据作为真值进行评估。 Result: 多个模型达到亚千米精度(EPE为6到8像素,即300至400米),能够捕捉一致的区域漂移模式,且性能显著优于经典方法。 Conclusion: 基于深度学习的光流方法可有效迁移到极地遥感领域,生成空间连续的海冰漂移场,弥补浮标观测稀疏的不足,具有在北极导航和气候建模中广泛应用的潜力。 Abstract: Accurate estimation of sea ice drift is critical for Arctic navigation, climate research, and operational forecasting. While optical flow, a computer vision technique for estimating pixel wise motion between consecutive images, has advanced rapidly in computer vision, its applicability to geophysical problems and to satellite SAR imagery remains underexplored. Classical optical flow methods rely on mathematical models and strong assumptions about motion, which limit their accuracy in complex scenarios. Recent deep learning based approaches have substantially improved performance and are now the standard in computer vision, motivating their application to sea ice drift estimation. We present the first large scale benchmark of 48 deep learning optical flow models on RADARSAT 2 ScanSAR sea ice imagery, evaluated with endpoint error (EPE) and Fl all metrics against GNSS tracked buoys. Several models achieve sub kilometer accuracy (EPE 6 to 8 pixels, 300 to 400 m), a small error relative to the spatial scales of sea ice motion and typical navigation requirements in the Arctic. Our results demonstrate that the models are capable of capturing consistent regional drift patterns and that recent deep learning based optical flow methods, which have substantially improved motion estimation accuracy compared to classical methods, can be effectively transferred to polar remote sensing. Optical flow produces spatially continuous drift fields, providing motion estimates for every image pixel rather than at sparse buoy locations, offering new opportunities for navigation and climate modeling.[116] Improving Classification of Occluded Objects through Scene Context
Courtney M. King,Daniel D. Leeds,Damian Lyons,George Kalaitzis
Main category: cs.CV
TL;DR: 本文提出两种基于场景信息融合的技术,增强RPN-DCNN目标检测网络在遮挡情况下的鲁棒性,通过在预测前选择场景自适应网络和在检测后融合场景知识来提升召回率和精度。
Details
Motivation: 遮挡严重影响目标识别性能,而现有方法缺乏有效利用场景上下文信息来缓解该问题。受生物视觉中场景上下文有助于识别的启发,本文旨在通过引入场景信息提升检测模型对遮挡的鲁棒性。 Method: 提出两种场景信息融合策略:一是在预测前根据背景场景选择定制化的物体检测网络;二是在检测后将场景知识融合到RPN输出的初始物体得分中。同时对比了不同训练策略,发现混合使用遮挡与非遮挡图像训练效果更优。 Result: 在具有挑战性的部分遮挡数据集上实验表明,所提方法在召回率和精度上均优于基线方法,且模型具有良好的可解释性和跨数据集适应性。 Conclusion: 利用场景上下文信息能有效提升目标检测模型在遮挡场景下的性能,所提出的两种融合方法简单有效,为后续研究和实际应用提供了可行方向。 Abstract: The presence of occlusions has provided substantial challenges to typically-powerful object recognition algorithms. Additional sources of information can be extremely valuable to reduce errors caused by occlusions. Scene context is known to aid in object recognition in biological vision. In this work, we attempt to add robustness into existing Region Proposal Network-Deep Convolutional Neural Network (RPN-DCNN) object detection networks through two distinct scene-based information fusion techniques. We present one algorithm under each methodology: the first operates prior to prediction, selecting a custom object network to use based on the identified background scene, and the second operates after detection, fusing scene knowledge into initial object scores output by the RPN. We demonstrate our algorithms on challenging datasets featuring partial occlusions, which show overall improvement in both recall and precision against baseline methods. In addition, our experiments contrast multiple training methodologies for occlusion handling, finding that training on a combination of both occluded and unoccluded images demonstrates an improvement over the others. Our method is interpretable and can easily be adapted to other datasets, offering many future directions for research and practical applications.[117] Process Integrated Computer Vision for Real-Time Failure Prediction in Steel Rolling Mill
Vaibhav Kurrey,Sivakalyan Pujari,Gagan Raj Gupta
Main category: cs.CV
TL;DR: 提出一种基于机器视觉的异常检测系统,用于钢铁轧机中的故障预测,通过工业相机和深度学习模型实时监控设备运行状态,结合传感器与视觉数据实现故障定位与根因分析,提升制造系统的可靠性与生产效率。
Details
Motivation: 在钢铁轧机等工业环境中,非计划停机导致高昂的维护成本,亟需一种能够提前预测设备故障并提供可操作洞察的智能监控系统。 Method: 部署工业相机实时采集生产线视频流,通过中央视频服务器运行深度学习模型进行实时推理,并融合数据采集系统的传感器信息与视觉输入进行联合分析,以预测故障并定位根因。 Result: 系统成功实现对设备异常与工艺中断的早期预警,减少非计划停机时间与维护成本,验证了在多产线扩展部署的可行性与低资源开销优势。 Conclusion: 该集成式视觉-传感器分析框架有效提升了工业制造环境的运行可靠性、生产效率与盈利能力,具备在复杂工业场景中规模化应用的潜力。 Abstract: We present a long-term deployment study of a machine vision-based anomaly detection system for failure prediction in a steel rolling mill. The system integrates industrial cameras to monitor equipment operation, alignment, and hot bar motion in real time along the process line. Live video streams are processed on a centralized video server using deep learning models, enabling early prediction of equipment failures and process interruptions, thereby reducing unplanned breakdown costs. Server-based inference minimizes the computational load on industrial process control systems (PLCs), supporting scalable deployment across production lines with minimal additional resources. By jointly analyzing sensor data from data acquisition systems and visual inputs, the system identifies the location and probable root causes of failures, providing actionable insights for proactive maintenance. This integrated approach enhances operational reliability, productivity, and profitability in industrial manufacturing environments.[118] The Impact and Outlook of 3D Gaussian Splatting
Bernhard Kerbl
Main category: cs.CV
TL;DR: 本文综述了3D高斯点阵(3DGS)自提出以来在3D场景表示领域的快速发展,涵盖了其在效率、动态表示、数学基础、移动与VR应用、大规模环境及快速辐射场重建等方面的关键进展。
Details
Motivation: 3DGS的提出引发了大量研究兴趣,推动了3D视觉与图形学的发展,但需系统梳理其后续研究方向与技术演进。 Method: 通过总结和分类3DGS后续的研究工作,分析其在训练与渲染效率、动态扩展(4DGS)、外观建模理论、移动端部署、大规模场景处理以及前馈或分布式快速重建等方面的技术路径。 Result: 归纳出多个关键研究方向,展示了3DGS从一种新颖表示方法发展为3D视觉与图形领域基础性工具的过程。 Conclusion: 3DGS已成为一个多功能且具奠基性的3D表示框架,具备广泛的应用前景和技术延展性。 Abstract: Since its introduction, 3D Gaussian Splatting (3DGS) has rapidly transformed the landscape of 3D scene representations, inspiring an extensive body of associated research. Follow-up work includes analyses and contributions that enhance the efficiency, scalability, and real-world applicability of 3DGS. In this summary, we present an overview of several key directions that have emerged in the wake of 3DGS. We highlight advances enabling resource-efficient training and rendering, the evolution toward dynamic (or four-dimensional, 4DGS) representations, and deeper exploration of the mathematical foundations underlying its appearance modeling and rendering process. Furthermore, we examine efforts to bring 3DGS to mobile and virtual reality platforms, its extension to massive-scale environments, and recent progress toward near-instant radiance field reconstruction via feed-forward or distributed computation. Collectively, these developments illustrate how 3DGS has evolved from a breakthrough representation into a versatile and foundational tool for 3D vision and graphics.[119] SteerVLM: Robust Model Control through Lightweight Activation Steering for Vision Language Models
Anushka Sivakumar,Andrew Zhang,Zaber Hakim,Chris Thomas
Main category: cs.CV
TL;DR: 本文提出了SteerVLM,一种轻量级的视觉语言模型(VLM)引导模块,通过调整语言模态与图像上下文之间的激活来实现推理时对输出语义的细粒度控制,且无需修改模型权重。
Details
Motivation: 现有的VLM在遵循指令方面存在不足,尤其是在复杂语义控制和减少幻觉方面,缺乏高效、灵活且无需微调的干预方法。 Method: SteerVLM通过学习目标行为与相反行为配对提示的潜在嵌入,动态调节跨模态激活;采用逐维度激活调制和跨层自适应引导,参数量仅为原模型的0.14%。同时构建了VNIA多模态数据集用于引导技术的开发与评估。 Result: 该方法在VLM引导和幻觉缓解基准上优于现有干预技术,在保持非目标任务性能的同时实现了有效的模型控制。 Conclusion: SteerVLM提供了一种高效、灵活且低开销的多模态模型控制方案,通过激活工程实现了推理时的精细语义调控。 Abstract: This work introduces SteerVLM, a lightweight steering module designed to guide Vision-Language Models (VLMs) towards outputs that better adhere to desired instructions. Our approach learns from the latent embeddings of paired prompts encoding target and converse behaviors to dynamically adjust activations connecting the language modality with image context. This allows for fine-grained, inference-time control over complex output semantics without modifying model weights while preserving performance on off-target tasks. Our steering module requires learning parameters equal to 0.14% of the original VLM's size. Our steering module gains model control through dimension-wise activation modulation and adaptive steering across layers without requiring pre-extracted static vectors or manual tuning of intervention points. Furthermore, we introduce VNIA (Visual Narrative Intent Alignment), a multimodal dataset specifically created to facilitate the development and evaluation of VLM steering techniques. Our method outperforms existing intervention techniques on steering and hallucination mitigation benchmarks for VLMs and proposes a robust solution for multimodal model control through activation engineering.[120] Surpassing state of the art on AMD area estimation from RGB fundus images through careful selection of U-Net architectures and loss functions for class imbalance
Valentyna Starodub,Mantas Lukoševičius
Main category: cs.CV
TL;DR: 本研究基于U-Net架构,通过改进网络结构、编码器选择、预处理和损失函数,提升了年龄相关性黄斑变性(AMD)病变在RGB眼底图像中的语义分割性能,在ADAM挑战赛数据集上超越了先前所有方法。
Details
Motivation: AMD是老年人视力损伤的主要原因,亟需非侵入、低成本的自动病变检测方法。现有方法在多类别病变分割上仍有提升空间。 Method: 以U-Net为基础,比较不同预处理技术、复杂度各异的编码器(骨干网络),并采用专门的损失函数缓解图像级和像素级的类别不平衡问题。 Result: 提出的最终框架在ADAM挑战赛的多类AMD病变分割任务中表现优于此前所有参赛方法,实现了更优的分割精度。 Conclusion: 改进后的U-Net框架能有效提升RGB眼底图像中AMD病变的检测性能,具有临床应用潜力,且代码已开源以促进后续研究。 Abstract: Age-related macular degeneration (AMD) is one of the leading causes of irreversible vision impairment in people over the age of 60. This research focuses on semantic segmentation for AMD lesion detection in RGB fundus images, a non-invasive and cost-effective imaging technique. The results of the ADAM challenge - the most comprehensive AMD detection from RGB fundus images research competition and open dataset to date - serve as a benchmark for our evaluation. Taking the U-Net connectivity as a base of our framework, we evaluate and compare several approaches to improve the segmentation model's architecture and training pipeline, including pre-processing techniques, encoder (backbone) deep network types of varying complexity, and specialized loss functions to mitigate class imbalances on image and pixel levels. The main outcome of this research is the final configuration of the AMD detection framework, which outperforms all the prior ADAM challenge submissions on the multi-class segmentation of different AMD lesion types in non-invasive RGB fundus images. The source code used to conduct the experiments presented in this paper is made freely available.[121] ChartAB: A Benchmark for Chart Grounding & Dense Alignment
Aniruddh Bansal,Davit Soselia,Dang Nguyen,Tianyi Zhou
Main category: cs.CV
TL;DR: 本文提出了一个名为ChartAlign Benchmark(ChartAB)的新基准,用于全面评估视觉语言模型(VLMs)在图表 grounding 任务中的表现,包括提取表格数据、定位可视化元素和识别多种属性,并通过两阶段推理流程评估模型在多个图表间对齐与比较的能力。
Details
Motivation: 现有的视觉语言模型在准确感知图表细节和提取细粒度结构方面存在不足,限制了其在多图表比较和推理中的应用。因此,需要一个专门的基准来系统评估和揭示这些模型在图表理解中的弱点。 Method: 设计了一个JSON模板以量化评估指标,并构建了涵盖多种类型和复杂度图表的数据集;提出两阶段推理流程以评估跨图表的对齐与比较能力。 Result: 对多个最新VLMs的评估揭示了它们在图表理解中的感知偏差、脆弱性、幻觉等问题,显示出模型在细粒度任务上的显著差异。 Conclusion: ChartAB能够有效暴露当前VLMs在图表 grounding 中的关键缺陷,为未来提升模型在图表分析与推理能力方面提供了明确方向。 Abstract: Charts play an important role in visualization, reasoning, data analysis, and the exchange of ideas among humans. However, existing vision-language models (VLMs) still lack accurate perception of details and struggle to extract fine-grained structures from charts. Such limitations in chart grounding also hinder their ability to compare multiple charts and reason over them. In this paper, we introduce a novel "ChartAlign Benchmark (ChartAB)" to provide a comprehensive evaluation of VLMs in chart grounding tasks, i.e., extracting tabular data, localizing visualization elements, and recognizing various attributes from charts of diverse types and complexities. We design a JSON template to facilitate the calculation of evaluation metrics specifically tailored for each grounding task. By incorporating a novel two-stage inference workflow, the benchmark can further evaluate VLMs' capability to align and compare elements/attributes across two charts. Our analysis of evaluations on several recent VLMs reveals new insights into their perception biases, weaknesses, robustness, and hallucinations in chart understanding. These findings highlight the fine-grained discrepancies among VLMs in chart understanding tasks and point to specific skills that need to be strengthened in current models.[122] HEIR: Learning Graph-Based Motion Hierarchies
Cheng Zheng,William Koch,Baiang Li,Felix Heide
Main category: cs.CV
TL;DR: 本文提出了一种数据驱动的通用分层运动建模方法,通过图神经网络从数据中学习结构化的运动关系,能够有效重建1D、2D和动态3D场景中的内在运动层次。
Details
Motivation: 现有方法依赖人工定义或启发式分层结构,难以泛化到不同任务,因此需要一种能从数据中自动学习可解释运动层次的方法。 Method: 将运动表示为基于图的层次结构,分解全局运动为继承自父节点的模式和局部残差;通过可微图学习框架,利用图神经网络学习节点间的父子依赖关系。 Result: 在1D平移、2D旋转和3D高斯点阵动态场景上验证了方法的有效性,成功重建出内在运动层次,并在3D变形任务中生成更真实、可解释的结果。 Conclusion: 该方法提供了一种灵活、数据驱动的分层建模范式,适用于广泛的以运动为中心的任务。 Abstract: Hierarchical structures of motion exist across research fields, including computer vision, graphics, and robotics, where complex dynamics typically arise from coordinated interactions among simpler motion components. Existing methods to model such dynamics typically rely on manually-defined or heuristic hierarchies with fixed motion primitives, limiting their generalizability across different tasks. In this work, we propose a general hierarchical motion modeling method that learns structured, interpretable motion relationships directly from data. Our method represents observed motions using graph-based hierarchies, explicitly decomposing global absolute motions into parent-inherited patterns and local motion residuals. We formulate hierarchy inference as a differentiable graph learning problem, where vertices represent elemental motions and directed edges capture learned parent-child dependencies through graph neural networks. We evaluate our hierarchical reconstruction approach on three examples: 1D translational motion, 2D rotational motion, and dynamic 3D scene deformation via Gaussian splatting. Experimental results show that our method reconstructs the intrinsic motion hierarchy in 1D and 2D cases, and produces more realistic and interpretable deformations compared to the baseline on dynamic 3D Gaussian splatting scenes. By providing an adaptable, data-driven hierarchical modeling paradigm, our method offers a formulation applicable to a broad range of motion-centric tasks. Project Page: https://light.princeton.edu/HEIR/[123] The Quest for Generalizable Motion Generation: Data, Model, and Evaluation
Jing Lin,Ruisi Wang,Junzhe Lu,Ziqi Huang,Guorui Song,Ailing Zeng,Xian Liu,Chen Wei,Wanqi Yin,Qingping Sun,Zhongang Cai,Lei Yang,Ziwei Liu
Main category: cs.CV
TL;DR: 本文提出一个从视频生成(ViGen)向3D人体运动生成(MoGen)迁移知识的综合框架,涵盖数据、建模和评估三个方面,显著提升了MoGen的泛化能力。
Details
Motivation: 现有3D人体运动生成模型在泛化能力上存在瓶颈,而视频生成领域已展现出更强的人类行为建模泛化性,因此作者希望借鉴ViGen的进展来提升MoGen。 Method: 1) 构建大规模数据集ViMoGen-228K,融合光学动捕、网络视频标注和ViGen生成的数据;2) 提出基于流匹配的扩散Transformer模型ViMoGen,采用门控多模态条件融合多源先验,并设计轻量版ViMoGen-light提升效率;3) 设计分层评估基准MBench,细粒度评估运动质量、提示一致性和泛化能力。 Result: 实验表明,该框架在自动指标和人工评估中均显著优于现有方法,尤其在语义多样性和泛化性能上表现突出。 Conclusion: 通过系统性地从视频生成迁移知识,该工作有效提升了3D运动生成的泛化能力和语义丰富度,为未来MoGen研究提供了新数据、模型和评估标准。 Abstract: Despite recent advances in 3D human motion generation (MoGen) on standard benchmarks, existing models still face a fundamental bottleneck in their generalization capability. In contrast, adjacent generative fields, most notably video generation (ViGen), have demonstrated remarkable generalization in modeling human behaviors, highlighting transferable insights that MoGen can leverage. Motivated by this observation, we present a comprehensive framework that systematically transfers knowledge from ViGen to MoGen across three key pillars: data, modeling, and evaluation. First, we introduce ViMoGen-228K, a large-scale dataset comprising 228,000 high-quality motion samples that integrates high-fidelity optical MoCap data with semantically annotated motions from web videos and synthesized samples generated by state-of-the-art ViGen models. The dataset includes both text-motion pairs and text-video-motion triplets, substantially expanding semantic diversity. Second, we propose ViMoGen, a flow-matching-based diffusion transformer that unifies priors from MoCap data and ViGen models through gated multimodal conditioning. To enhance efficiency, we further develop ViMoGen-light, a distilled variant that eliminates video generation dependencies while preserving strong generalization. Finally, we present MBench, a hierarchical benchmark designed for fine-grained evaluation across motion quality, prompt fidelity, and generalization ability. Extensive experiments show that our framework significantly outperforms existing approaches in both automatic and human evaluations. The code, data, and benchmark will be made publicly available.[124] Scaling Image Geo-Localization to Continent Level
Philipp Lindenberger,Paul-Edouard Sarlin,Jan Hosang,Matteo Balice,Marc Pollefeys,Simon Lynen,Eduard Trulls
Main category: cs.CV
TL;DR: 提出一种混合方法,通过代理分类任务学习特征表示,并结合航拍图像嵌入实现跨大陆范围的细粒度图像地理定位。
Details
Motivation: 现有方法在全局尺度下难以实现精确的图像地理定位:传统检索效率低,覆盖不足;全局分类结果粗糙,跨视角检索受限于域差异和区域范围小。 Method: 采用代理分类任务训练模型以隐式编码位置信息,提取学习到的原型,并与航拍图像嵌入结合,实现对地面图像的鲁棒细粒度检索。 Result: 在覆盖欧洲大部分地区的数据集上,68%以上的查询可实现200米以内的定位精度。 Conclusion: 该方法在大范围地理区域内实现了高效、精细的图像地理定位,显著优于现有可扩展方案,且代码已开源。 Abstract: Determining the precise geographic location of an image at a global scale remains an unsolved challenge. Standard image retrieval techniques are inefficient due to the sheer volume of images (>100M) and fail when coverage is insufficient. Scalable solutions, however, involve a trade-off: global classification typically yields coarse results (10+ kilometers), while cross-view retrieval between ground and aerial imagery suffers from a domain gap and has been primarily studied on smaller regions. This paper introduces a hybrid approach that achieves fine-grained geo-localization across a large geographic expanse the size of a continent. We leverage a proxy classification task during training to learn rich feature representations that implicitly encode precise location information. We combine these learned prototypes with embeddings of aerial imagery to increase robustness to the sparsity of ground-level data. This enables direct, fine-grained retrieval over areas spanning multiple countries. Our extensive evaluation demonstrates that our approach can localize within 200m more than 68\% of queries of a dataset covering a large part of Europe. The code is publicly available at https://scaling-geoloc.github.io.[125] Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark
Ziyu Guo,Xinyan Chen,Renrui Zhang,Ruichuan An,Yu Qi,Dongzhi Jiang,Xiangtai Li,Manyuan Zhang,Hongsheng Li,Pheng-Ann Heng
Main category: cs.CV
TL;DR: 本文研究了当前视频生成模型(如Veo-3)在零样本视觉推理任务中的表现,提出并构建了MME-CoF基准,评估其在12个维度的推理能力,发现模型在短期空间一致性等方面表现良好,但在长期因果推理和抽象逻辑上仍有限,尚不能作为可靠的独立零样本推理器。
Details
Motivation: 探索视频生成模型是否具备在复杂视觉推理场景中进行零样本推理的能力,揭示其潜在的世界知识编码与推理潜力。 Method: 通过构建MME-CoF基准,在12个推理维度上对Veo-3等先进视频模型进行系统性实证评估,采用Chain-of-Frame(CoF)推理范式分析其行为。 Result: 发现当前视频模型在短时空间连贯性和局部动态一致性方面表现出一定推理能力,但在长时因果推理、几何约束和抽象逻辑方面存在明显局限。 Conclusion: 现有视频模型尚不足以作为可靠的独立零样本推理工具,但可作为专用推理模型的有益补充视觉引擎。 Abstract: Recent video generation models can produce high-fidelity, temporally coherent videos, indicating that they may encode substantial world knowledge. Beyond realistic synthesis, they also exhibit emerging behaviors indicative of visual perception, modeling, and manipulation. Yet, an important question still remains: Are video models ready to serve as zero-shot reasoners in challenging visual reasoning scenarios? In this work, we conduct an empirical study to comprehensively investigate this question, focusing on the leading and popular Veo-3. We evaluate its reasoning behavior across 12 dimensions, including spatial, geometric, physical, temporal, and embodied logic, systematically characterizing both its strengths and failure modes. To standardize this study, we curate the evaluation data into MME-CoF, a compact benchmark that enables in-depth and thorough assessment of Chain-of-Frame (CoF) reasoning. Our findings reveal that while current video models demonstrate promising reasoning patterns on short-horizon spatial coherence, fine-grained grounding, and locally consistent dynamics, they remain limited in long-horizon causal reasoning, strict geometric constraints, and abstract logic. Overall, they are not yet reliable as standalone zero-shot reasoners, but exhibit encouraging signs as complementary visual engines alongside dedicated reasoning models. Project page: https://video-cof.github.io[126] SEE4D: Pose-Free 4D Generation via Auto-Regressive Video Inpainting
Dongyue Lu,Ao Liang,Tianxin Huang,Xiao Fu,Yuyang Zhao,Baorui Ma,Liang Pan,Wei Yin,Lingdong Kong,Wei Tsang Ooi,Ziwei Liu
Main category: cs.CV
TL;DR: 提出SEE4D,一种无需姿态标注的轨迹到相机框架,通过固定虚拟相机渲染和视图条件视频修复实现从随意视频中生成4D内容。
Details
Motivation: 现有方法依赖人工标注的相机姿态或复杂的轨迹预测,难以应用于野外视频且易混淆相机运动与场景动态。 Method: 采用虚拟相机阵列进行渲染,结合去噪训练的视图条件视频修复模型,在无需3D标注的情况下修复缺失区域,并设计时空自回归推理流程以实现连贯生成。 Result: 在跨视角视频生成和稀疏重建任务上优于基于姿态或轨迹的方法,具备更好的泛化性和性能表现。 Conclusion: SEE4D实现了无需姿态监督的高效4D场景建模,推动了从随意视频中进行实用化4D世界建模的发展。 Abstract: Immersive applications call for synthesizing spatiotemporal 4D content from casual videos without costly 3D supervision. Existing video-to-4D methods typically rely on manually annotated camera poses, which are labor-intensive and brittle for in-the-wild footage. Recent warp-then-inpaint approaches mitigate the need for pose labels by warping input frames along a novel camera trajectory and using an inpainting model to fill missing regions, thereby depicting the 4D scene from diverse viewpoints. However, this trajectory-to-trajectory formulation often entangles camera motion with scene dynamics and complicates both modeling and inference. We introduce SEE4D, a pose-free, trajectory-to-camera framework that replaces explicit trajectory prediction with rendering to a bank of fixed virtual cameras, thereby separating camera control from scene modeling. A view-conditional video inpainting model is trained to learn a robust geometry prior by denoising realistically synthesized warped images and to inpaint occluded or missing regions across virtual viewpoints, eliminating the need for explicit 3D annotations. Building on this inpainting core, we design a spatiotemporal autoregressive inference pipeline that traverses virtual-camera splines and extends videos with overlapping windows, enabling coherent generation at bounded per-step complexity. We validate See4D on cross-view video generation and sparse reconstruction benchmarks. Across quantitative metrics and qualitative assessments, our method achieves superior generalization and improved performance relative to pose- or trajectory-conditioned baselines, advancing practical 4D world modeling from casual videos.[127] Masked Diffusion Captioning for Visual Feature Learning
Chao Feng,Zihao Wei,Andrew Owens
Main category: cs.CV
TL;DR: 提出了一种基于图像条件掩码扩散语言模型的图像描述方法(MDC),通过重建被掩码的文本token来学习视觉特征,所学特征在多种下游任务中表现优异。
Details
Motivation: 传统自回归描述方法中,视觉学习信号受token位置影响较大,且常需辅助目标函数,本文旨在提出一种更高效、稳定的视觉特征学习方法。 Method: 使用图像条件掩码扩散语言模型,在训练时随机掩码图像-文本对中的文本token,并训练一个基于视觉特征的解码器来重建原始文本。 Result: 线性探测实验表明,该方法学习到的视觉特征在多个学术规模的模型和数据集上与自回归和对比方法相当甚至更优。 Conclusion: MDC提供了一种新的视觉特征学习范式,无需依赖token位置和复杂辅助目标,具有良好的通用性和竞争力。 Abstract: We learn visual features by captioning images with an image-conditioned masked diffusion language model, a formulation we call masked diffusion captioning (MDC). During training, text tokens in each image-caption pair are masked at a randomly chosen ratio, and a decoder conditioned on visual features is trained to reconstruct the original text. After training, the learned visual features can be applied to downstream vision tasks. Unlike autoregressive captioning, the strength of the visual learning signal in MDC does not depend on each token's position in the sequence, reducing the need for auxiliary objectives. Linear probing experiments across a variety of academic-scale models and datasets show that the learned visual features are competitive with those produced by autoregressive and contrastive approaches.[128] OmniX: From Unified Panoramic Generation and Perception to Graphics-Ready 3D Scenes
Yukun Huang,Jiwen Yu,Yanning Zhou,Jianan Wang,Xintao Wang,Pengfei Wan,Xihui Liu
Main category: cs.CV
TL;DR: 本文提出OmniX,一个基于全景图的2D提升框架,用于生成适用于物理渲染和仿真的图形就绪3D场景,通过重用2D生成先验实现全景感知、生成与补全。