Table of Contents
cs.CL [Back]
[1] StreetMath: Study of LLMs' Approximation Behaviors
Chiung-Yi Tseng,Somshubhra Roy,Maisha Thasin,Danyang Zhang,Blessing Effiong
Main category: cs.CL
TL;DR: 本论文提出了StreetMath基准,用于评估大语言模型在现实场景下的近似数学推理能力,并发现LLM在近似任务中仍倾向于精确计算,消耗更多计算资源,且精确与近似运算依赖不同的神经机制,表明LLM缺乏人类在街头数学中的‘认知吝啬’特性。
Details
Motivation: 现有研究多关注大语言模型在精确算术上的表现,而忽视了其在非正式、快速数学推理(即近似计算)中的能力,尤其是在非自回归模型中的表现。本文旨在填补这一研究空白。 Method: 提出StreetMath基准测试,对多种大语言模型架构(包括Qwen、Dream、Falcon-Mamba等)进行广泛评估,并应用机械可解释性技术分析模型内部的计算状态,比较精确与近似运算的神经机制差异。 Result: 实验表明,LLM在近似任务中仍倾向于计算精确值或调用外部工具;即使早期层已得出正确答案,仍会消耗额外token。精确与近似运算依赖不同的神经组件,且LLM不具备人类在街头数学中表现出的认知吝啬倾向。 Conclusion: 大语言模型在近似数学推理方面存在效率缺陷,未能像人类一样在适当情境下启用低代价的近似思维模式,揭示了当前LLM在认知灵活性上的局限性。 Abstract: There is a substantial body of literature examining the mathematical reasoning capabilities of large language models (LLMs), particularly their performance on precise arithmetic operations in autoregressive architectures. However, their ability to perform approximate reasoning in informal, fast-paced mathematical operations has received far less attention, especially among non-autoregressive decoder models. Our work addresses this gap by introducing StreetMath, a benchmark designed to evaluate models' approximation abilities under real-world approximation scenarios. We conduct extensive evaluations across different LLM architectures: Qwen3-4B-Instruct-2507, Qwen3-4B-Thinking-2507, Dream-v0-Instruct-7B, Falcon-Mamba-7B-Instruct, and Mamba-GPT-3B. Furthermore, we apply mechanistic interpretability techniques to probe their internal computational states. Our analysis reveals that LLMs generally attempt to compute exact values or invoke external tools even in tasks that call for approximation. Moreover, while models sometimes reach the correct answer in early layers or steps, they still consume more tokens when solving approximation tasks. Additional experiments indicate that exact and approximate arithmetic operations rely on largely separate neural components. Drawing upon research on cognitive psychology, we argue that LLMs do not exhibit cognitive miserliness in the same way humans do in street math settings. We open source our work https://github.com/ctseng777/StreetMath[2] Review Based Entity Ranking using Fuzzy Logic Algorithmic Approach: Analysis
Pratik N. Kalamkar,Anupama G. Phakatkar
Main category: cs.CL
TL;DR: 本文提出了一种基于模糊逻辑和句法依存分析的细粒度情感分类方法,用于对产品评论中的意见词进行强度分级,并据此对实体进行排序。
Details
Motivation: 现有的基于词典的情感分析方法未考虑意见强度的差异,无法准确反映用户情感的细微差别,因此需要一种能够区分不同情感强度的细粒度分析方法。 Method: 结合与产品特定方面相关的情感词(如副词、形容词、名词和动词),利用模糊逻辑算法将情感词分类到不同强度等级(如很弱、弱、中等、很强、强),并通过句法依存关系识别目标方面的相关词汇。 Result: 能够根据评论中与特定方面相关的情感词的取向和强度计算实体得分,实现更精细的实体排序。 Conclusion: 该方法通过引入情感强度分级和句法依存分析,提升了基于词典的情感分析精度,适用于需要细粒度情感理解的应用场景。 Abstract: Opinion mining, also called sentiment analysis, is the field of study that analyzes people opinions, sentiments, evaluations, appraisals, attitudes, and emotions towards entities such as products, services, organizations, individuals, issues, events, topics, and their attributes. Holistic lexicon-based approach does not consider the strength of each opinion, i.e., whether the opinion is very strongly negative (or positive), strongly negative (or positive), moderate negative (or positive), very weakly negative (or positive) and weakly negative (or positive). In this paper, we propose approach to rank entities based on orientation and strength of the entity reviews and user's queries by classifying them in granularity levels (i.e. very weak, weak, moderate, very strong and strong) by combining opinion words (i.e. adverb, adjective, noun and verb) that are related to aspect of interest of certain product. We shall use fuzzy logic algorithmic approach in order to classify opinion words into different category and syntactic dependency resolution to find relations for desired aspect words. Opinion words related to certain aspects of interest are considered to find the entity score for that aspect in the review.[3] LASTIST: LArge-Scale Target-Independent STance dataset
DongJae Kim,Yaejin Lee,Minsu Park,Eunil Park
Main category: cs.CL
TL;DR: 本文提出了一个大规模的韩语立场检测数据集LASTIST,包含563,299个标注句子,用于支持无目标立场检测和历时立场演变分析。
Details
Motivation: 现有立场检测研究多集中于英语和特定目标的立场识别,缺乏对低资源语言如韩语的支持,且缺少无目标立场检测的数据集。 Method: 从韩国政党发布的新闻稿中收集数据,构建了LASTIST数据集,并训练了先进的深度学习模型进行立场检测。 Result: 提供了大规模韩语立场检测数据集,支持多种任务,包括无目标立场检测和历时立场演变分析。 Conclusion: LASTIST数据集填补了韩语立场检测领域的空白,有助于推动低资源语言下的立场检测研究。 Abstract: Stance detection has emerged as an area of research in the field of artificial intelligence. However, most research is currently centered on the target-dependent stance detection task, which is based on a person's stance in favor of or against a specific target. Furthermore, most benchmark datasets are based on English, making it difficult to develop models in low-resource languages such as Korean, especially for an emerging field such as stance detection. This study proposes the LArge-Scale Target-Independent STance (LASTIST) dataset to fill this research gap. Collected from the press releases of both parties on Korean political parties, the LASTIST dataset uses 563,299 labeled Korean sentences. We provide a detailed description of how we collected and constructed the dataset and trained state-of-the-art deep learning and stance detection models. Our LASTIST dataset is designed for various tasks in stance detection, including target-independent stance detection and diachronic evolution stance detection. We deploy our dataset on https://anonymous.4open.science/r/LASTIST-3721/.[4] zFLoRA: Zero-Latency Fused Low-Rank Adapters
Dhananjaya Gowda,Seoha Song,Harshith Goka,Junhyun Lee
Main category: cs.CL
TL;DR: 本文提出了一种新的零延迟融合低秩适配器(zFLoRA),在保持大语言模型性能的同时,显著减少推理时的延迟开销。
Details
Motivation: 现有的任务特定适配器虽然参数少,但在推理时仍引入显著计算开销,限制了高效部署。 Method: 提出zFLoRA,通过融合低秩结构实现零或可忽略的延迟增加,无需额外推理时间。 Result: 在1B、3B和7B规模的LLM上,zFLoRA在常识推理、数学推理和摘要对话等18项任务中表现优于LoRA和全量微调,且在NPU和GPU上的延迟几乎无增加。 Conclusion: zFLoRA能有效解决适配器带来的额外延迟问题,实现高效多任务部署。 Abstract: Large language models (LLMs) are increasingly deployed with task-specific adapters catering to multiple downstream applications. In such a scenario, the additional compute associated with these apparently insignificant number of adapter parameters (typically less than 1% of the base model) turns out to be disproportionately significant during inference time (upto 2.5x times that of the base model). In this paper, we propose a new zero-latency fused low-rank adapter (zFLoRA) that introduces zero or negligible latency overhead on top of the base model. Experimental results on LLMs of size 1B, 3B and 7B show that zFLoRA compares favorably against the popular supervised fine-tuning benchmarks including low-rank adapters (LoRA) as well as full fine-tuning (FFT). Experiments are conducted on 18 different tasks across three different categories namely commonsense reasoning, math reasoning and summary-dialogue. Latency measurements made on NPU (Samsung Galaxy S25+) as well as GPU (NVIDIA H100) platforms show that the proposed zFLoRA adapters introduce zero to negligible latency overhead.[5] BlackboxNLP-2025 MIB Shared Task: Improving Circuit Faithfulness via Better Edge Selection
Yaniv Nikankin,Dana Arad,Itay Itzhak,Anja Reusch,Adi Simhi,Gal Kesten-Pomeranz,Yonatan Belinkov
Main category: cs.CL
TL;DR: 提出三种改进方法用于提升机械可解释性中的电路发现,包括使用自举法识别一致的边、基于比率的选择策略以及整数线性规划替代贪婪选择,实验表明新方法在多个任务和模型上优于先前方法。
Details
Motivation: 在机械可解释性中,如何准确识别模型中执行特定任务的子结构(即电路)是一个核心挑战,现有方法在准确性与保真度之间存在不足。 Method: 1) 使用自举法识别具有稳定归因分数的边;2) 提出基于比率的选择策略以优先选择高分正向边;3) 用整数线性规划替代传统的贪心选择方法。 Result: 改进后的方法在多个MIB任务和模型上实现了更优的电路发现性能,生成的电路更具保真性。 Conclusion: 所提出的三种技术有效提升了电路发现的准确性和可靠性,为机械可解释性提供了更强的分析工具。 Abstract: One of the main challenges in mechanistic interpretability is circuit discovery, determining which parts of a model perform a given task. We build on the Mechanistic Interpretability Benchmark (MIB) and propose three key improvements to circuit discovery. First, we use bootstrapping to identify edges with consistent attribution scores. Second, we introduce a simple ratio-based selection strategy to prioritize strong positive-scoring edges, balancing performance and faithfulness. Third, we replace the standard greedy selection with an integer linear programming formulation. Our methods yield more faithful circuits and outperform prior approaches across multiple MIB tasks and models. Our code is available at: https://github.com/technion-cs-nlp/MIB-Shared-Task.[6] LISTEN to Your Preferences: An LLM Framework for Multi-Objective Selection
Adam S. Jovine,Tinghan Ye,Francis Bahk,Jingjing Wang,David B. Shmoys,Peter I. Frazier
Main category: cs.CL
TL;DR: 提出LISTEN框架,利用大语言模型作为零样本偏好判断器,通过自然语言指导专家在多目标决策中选择最优选项,包含两种迭代算法:LISTEN-U(优化参数效用函数)和LISTEN-T(非参数锦标赛式选择),在多种任务中验证有效性。
Details
Motivation: 人类专家在面对多目标、大量选项时难以形式化复杂且隐含的偏好,导致决策效率低下,需要一种降低认知负担的新型偏好表达方法。 Method: 提出LISTEN框架,使用大语言模型作为零样本偏好判断器,结合专家的自然语言优先级;设计两种算法:LISTEN-U通过LLM优化参数化效用函数,LISTEN-T采用小批量锦标赛方式执行非参数选择。 Result: 在航班预订、购物和考试安排等任务中,LISTEN-U在偏好具有参数一致性时表现优异(通过新提出的协和性度量评估),而LISTEN-T展现出更强的鲁棒性。 Conclusion: LISTEN框架支持直接用自然语言引导复杂多目标决策,减少传统偏好获取的认知负担,展示了LLM在决策支持中的潜力。 Abstract: Human experts often struggle to select the best option from a large set of items with multiple competing objectives, a process bottlenecked by the difficulty of formalizing complex, implicit preferences. To address this, we introduce LISTEN, a framework that leverages a Large Language Model (LLM) as a zero-shot preference oracle, guided only by an expert's high-level priorities in natural language. To operate within LLM constraints like context windows and inference costs, we propose two iterative algorithms: LISTEN-U, which uses the LLM to refine a parametric utility function, and LISTEN-T, a non-parametric method that performs tournament-style selections over small batches of solutions. Evaluated on diverse tasks including flight booking, shopping, and exam scheduling, our results show LISTEN-U excels when preferences are parametrically aligned (a property we measure with a novel concordance metric), while LISTEN-T offers more robust performance. This work explores a promising direction for steering complex multi-objective decisions directly with natural language, reducing the cognitive burden of traditional preference elicitation.[7] Beyond Length: Quantifying Long-Range Information for Long-Context LLM Pretraining Data
Haoran Deng,Yingyu Lin,Zhenghao Lin,Xiao Liu,Yizhou Sun,Yi-An Ma,Yeyun Gong
Main category: cs.CL
TL;DR: 本文提出了LongFilter框架,用于筛选适合长上下文预训练的高质量数据,通过对比长上下文与短上下文下的模型预测来衡量信息增益,有效识别依赖长距离依赖的样本。
Details
Motivation: 现有的长文本数据中许多缺乏真正的长距离依赖,导致训练效率低下,因此需要一种方法来筛选出真正需要长上下文的数据以提升训练效果。 Method: LongFilter通过对比模型在长上下文和短上下文设置下的预测结果,计算信息增益,从而识别出那些依赖长距离上下文的样本,并用于数据筛选。 Result: 在LLaMA-3-8B上将上下文长度从8K扩展到64K的实验表明,LongFilter能高效选择高质量数据,并在HELMET、LongBench和RULER等基准上带来显著性能提升。 Conclusion: LongFilter能够有效提升长上下文语言模型的训练效率和性能,是长上下文预训练中数据筛选的一个实用解决方案。 Abstract: Long-context language models unlock advanced capabilities in reasoning, code generation, and document summarization by leveraging dependencies across extended spans of text. However, a significant portion of readily available long-text data lacks meaningful long-distance dependencies; most spans can be predicted using only local context. Training on such data is inefficient, making careful data selection crucial. Therefore, we introduce LongFilter, a framework for curating training data tailored to long-context pretraining. LongFilter measures the information gain provided by extended context by contrasting model predictions under long-context versus short-context settings, thereby identifying samples where long-range dependencies are essential. Experiments with LLaMA-3-8B, extending its context length from 8K to 64K, show that LongFilter efficiently selects high-quality data and yields substantial improvements on benchmarks such as HELMET, LongBench, and RULER.[8] Ideology-Based LLMs for Content Moderation
Stefano Civelli,Pietro Bernardelle,Nardiena A. Pratama,Gianluca Demartini
Main category: cs.CL
TL;DR: 本研究探讨了在内容审核系统中,采用不同意识形态“人格”的大语言模型(LLM)如何影响有害内容分类的公平性和一致性。尽管总体准确率变化不大,但不同人格会导致模型对有害内容的判定存在显著差异,且模型更倾向于认同与其意识形态一致的观点,暴露出潜在的偏见问题。
Details
Motivation: 确保LLM在内容审核中的公平与中立至关重要,但当前对人格设定如何影响模型判断尚不清晰,因此需要探究人格带来的潜在偏见。 Method: 研究考察了不同LLM架构、模型大小和模态(语言与视觉)下,具有不同意识形态倾向的人格对有害内容分类的影响,并通过一致性分析和针对性政治任务实验验证模型与人格之间的对齐行为。 Result: 不同人格导致模型对有害内容的判定倾向不同;大型模型更易与同意识形态人格保持一致,且在跨意识形态任务中表现出为自身立场辩护、弱化对立观点危害性的倾向。 Conclusion: 人格设定会引入微妙的意识形态偏见,可能导致LLM在看似中立的表象下强化 partisan 观点,对AI在内容审核中的公平应用提出警示。 Abstract: Large language models (LLMs) are increasingly used in content moderation systems, where ensuring fairness and neutrality is essential. In this study, we examine how persona adoption influences the consistency and fairness of harmful content classification across different LLM architectures, model sizes, and content modalities (language vs. vision). At first glance, headline performance metrics suggest that personas have little impact on overall classification accuracy. However, a closer analysis reveals important behavioral shifts. Personas with different ideological leanings display distinct propensities to label content as harmful, showing that the lens through which a model "views" input can subtly shape its judgments. Further agreement analyses highlight that models, particularly larger ones, tend to align more closely with personas from the same political ideology, strengthening within-ideology consistency while widening divergence across ideological groups. To show this effect more directly, we conducted an additional study on a politically targeted task, which confirmed that personas not only behave more coherently within their own ideology but also exhibit a tendency to defend their perspective while downplaying harmfulness in opposing views. Together, these findings highlight how persona conditioning can introduce subtle ideological biases into LLM outputs, raising concerns about the use of AI systems that may reinforce partisan perspectives under the guise of neutrality.[9] Beyond Long Context: When Semantics Matter More than Tokens
Tarun Kumar Chawdhury,Jon D. Duke
Main category: cs.CL
TL;DR: CLEAR方法通过实体感知检索在临床文档问答中实现了更高的准确性和效率,相比传统方法减少了70%以上的token使用,并在长文本上表现更优。
Details
Motivation: 由于EHR中的临床文档以base64编码附件形式存储,语义问答困难,且传统向量数据库方法难以捕捉细微的临床关系。 Method: 提出并验证了Clinical Entity Augmented Retrieval(CLEAR)方法,结合实体感知检索,并开发了一个临床笔记问答评估平台,与零样本大上下文推断和基于分块的传统检索增强生成进行比较。 Result: CLEAR在12份临床笔记测试中取得58.3%的胜率,平均语义相似度为0.878,且比大上下文处理少用78%的token;在超过65,000 token的长文档中胜率达75%。 Conclusion: 实体感知检索能有效提升临床自然语言处理的效率与准确性,所提出的评估框架为临床问答系统提供了可复用、透明的基准。 Abstract: Electronic Health Records (EHR) store clinical documentation as base64 encoded attachments in FHIR DocumentReference resources, which makes semantic question answering difficult. Traditional vector database methods often miss nuanced clinical relationships. The Clinical Entity Augmented Retrieval (CLEAR) method, introduced by Lopez et al. 2025, uses entity aware retrieval and achieved improved performance with an F1 score of 0.90 versus 0.86 for embedding based retrieval, while using over 70 percent fewer tokens. We developed a Clinical Notes QA Evaluation Platform to validate CLEAR against zero shot large context inference and traditional chunk based retrieval augmented generation. The platform was tested on 12 clinical notes ranging from 10,000 to 65,000 tokens representing realistic EHR content. CLEAR achieved a 58.3 percent win rate, an average semantic similarity of 0.878, and used 78 percent fewer tokens than wide context processing. The largest performance gains occurred on long notes, with a 75 percent win rate for documents exceeding 65,000 tokens. These findings confirm that entity aware retrieval improves both efficiency and accuracy in clinical natural language processing. The evaluation framework provides a reusable and transparent benchmark for assessing clinical question answering systems where semantic precision and computational efficiency are critical.[10] A Survey on Efficient Large Language Model Training: From Data-centric Perspectives
Junyu Luo,Bohan Wu,Xiao Luo,Zhiping Xiao,Yiqiao Jin,Rong-Cheng Tu,Nan Yin,Yifan Wang,Jingyang Yuan,Wei Ju,Ming Zhang
Main category: cs.CL
TL;DR: 本文首次从数据角度系统综述了大语言模型(LLM)高效后训练方法,提出涵盖数据选择、质量提升、合成数据生成、数据蒸馏与压缩及自进化数据生态系统的分类体系,并总结各类代表性方法,指出未来研究方向。
Details
Motivation: 当前LLM后训练面临数据标注成本高和数据规模边际效益递减的问题,亟需实现数据高效的后训练方法。 Method: 提出一种基于数据中心视角的分类法,将数据高效LLM后训练方法分为五类:数据选择、数据质量增强、合成数据生成、数据蒸馏与压缩、自演化数据生态系统,并系统梳理每类中的代表性技术。 Result: 建立了首个面向数据高效LLM后训练的系统性分类框架,总结了现有代表性方法,并分析了各方法的优势与局限。 Conclusion: 通过系统梳理数据高效后训练的技术路径,本文揭示了该领域存在的开放问题并提出了潜在研究方向,有望推动大规模模型训练中数据利用率的进一步提升。 Abstract: Post-training of Large Language Models (LLMs) is crucial for unlocking their task generalization potential and domain-specific capabilities. However, the current LLM post-training paradigm faces significant data challenges, including the high costs of manual annotation and diminishing marginal returns on data scales. Therefore, achieving data-efficient post-training has become a key research question. In this paper, we present the first systematic survey of data-efficient LLM post-training from a data-centric perspective. We propose a taxonomy of data-efficient LLM post-training methods, covering data selection, data quality enhancement, synthetic data generation, data distillation and compression, and self-evolving data ecosystems. We summarize representative approaches in each category and outline future research directions. By examining the challenges in data-efficient LLM post-training, we highlight open problems and propose potential research avenues. We hope our work inspires further exploration into maximizing the potential of data utilization in large-scale model training. Paper List: https://github.com/luo-junyu/Awesome-Data-Efficient-LLM[11] Evaluating the Impact of LLM-Assisted Annotation in a Perspectivized Setting: the Case of FrameNet Annotation
Frederico Belcavello,Ely Matos,Arthur Lorenzi,Lisandra Bonoto,Lívia Ruiz,Luiz Fernando Pereira,Victor Herbst,Yulla Navarro,Helen de Andrade Abreu,Lívia Dutra,Tiago Timponi Torrent
Main category: cs.CL
TL;DR: 本文评估了基于大语言模型(LLM)的语义角色标注器在FrameNet式语义标注中的(半)自动化效果,比较了手动、自动和半自动三种标注方式在标注时间、覆盖度和多样性方面的表现。
Details
Motivation: 尽管LLM在语言资源和数据集创建中展现出潜力,但在NLP的视角化研究背景下,其性能和影响尚缺乏系统评估,本文旨在填补这一空白。 Method: 通过对比手动、自动和半自动三种实验设置,评估LLM在FrameNet-like语义标注中的标注时间、覆盖度和框架多样性。 Result: 半自动标注在框架多样性上优于纯人工标注,覆盖度相近,且显著节省时间;而全自动标注在除时间外的各项指标上均表现较差。 Conclusion: 半自动标注结合了人类准确性与LLM效率,是构建高质量语义标注数据集更优的方案。 Abstract: The use of LLM-based applications as a means to accelerate and/or substitute human labor in the creation of language resources and dataset is a reality. Nonetheless, despite the potential of such tools for linguistic research, comprehensive evaluation of their performance and impact on the creation of annotated datasets, especially under a perspectivized approach to NLP, is still missing. This paper contributes to reduction of this gap by reporting on an extensive evaluation of the (semi-)automatization of FrameNet-like semantic annotation by the use of an LLM-based semantic role labeler. The methodology employed compares annotation time, coverage and diversity in three experimental settings: manual, automatic and semi-automatic annotation. Results show that the hybrid, semi-automatic annotation setting leads to increased frame diversity and similar annotation coverage, when compared to the human-only setting, while the automatic setting performs considerably worse in all metrics, except for annotation time.[12] RECAP: Reproducing Copyrighted Data from LLMs Training with an Agentic Pipeline
André V. Duarte,Xuying li,Bin Zeng,Arlindo L. Oliveira,Lei Li,Zhuo Li
Main category: cs.CL
TL;DR: 提出RECAP方法,通过反馈驱动的循环和越狱模块从大语言模型输出中提取和验证记忆的训练数据,在EchoTrace基准上显著优于单次迭代方法。
Details
Motivation: 在无法检查大语言模型训练数据的情况下,如何确认模型记住了什么内容?需要一种能有效揭示模型记忆内容的方法。 Method: 设计了一个名为RECAP的代理管道,包含反馈驱动循环:用辅助语言模型比较模型输出与参考文本,生成最小修正提示并反馈给目标模型以改进后续生成;同时加入越狱模块以克服对齐导致的拒绝问题。 Result: 在EchoTrace基准(涵盖30多本完整书籍)上的实验表明,RECAP显著优于单次迭代方法。例如,GPT-4.1在版权文本提取任务中的平均ROUGE-L分数从0.38提升至0.47,提高了近24%。 Conclusion: RECAP能有效激发并验证大语言模型对训练数据的记忆,为检测模型是否记忆特定内容提供了有力工具。 Abstract: If we cannot inspect the training data of a large language model (LLM), how can we ever know what it has seen? We believe the most compelling evidence arises when the model itself freely reproduces the target content. As such, we propose RECAP, an agentic pipeline designed to elicit and verify memorized training data from LLM outputs. At the heart of RECAP is a feedback-driven loop, where an initial extraction attempt is evaluated by a secondary language model, which compares the output against a reference passage and identifies discrepancies. These are then translated into minimal correction hints, which are fed back into the target model to guide subsequent generations. In addition, to address alignment-induced refusals, RECAP includes a jailbreaking module that detects and overcomes such barriers. We evaluate RECAP on EchoTrace, a new benchmark spanning over 30 full books, and the results show that RECAP leads to substantial gains over single-iteration approaches. For instance, with GPT-4.1, the average ROUGE-L score for the copyrighted text extraction improved from 0.38 to 0.47 - a nearly 24% increase.[13] Revisiting Multilingual Data Mixtures in Language Model Pretraining
Negar Foroutan,Paul Teiletche,Ayush Kumar Tarun,Antoine Bosselut
Main category: cs.CL
TL;DR: 本研究通过训练11亿和30亿参数的语言模型,探讨了多语言预训练数据混合的影响,发现适当平衡的多语言数据不会损害模型性能,反而能增强跨语言能力,尤其是在低资源语言中。
Details
Motivation: 探讨多语言数据混合对大语言模型性能的影响,特别是是否存在‘多语言诅咒’现象,以及英语作为枢纽语言的作用。 Method: 训练1.1B和3B参数的语言模型,使用包含25到400种语言的多语言语料库,系统地变化语言数量和数据配比,评估不同语言族的表现。 Result: 1) 英语与多语言数据结合不会降低任一语言组的性能,只要各语言有足够的训练token;2) 以英语为枢纽语言能提升跨语言家族的性能,但选择特定语族内的枢纽语言并不总能提升该语族内语言的表现;3) 随着训练语言数量增加,未观察到显著的‘多语言诅咒’现象。 Conclusion: 在适当平衡的情况下,多语言数据可以增强语言模型的能力,而不会牺牲性能,即使在低资源环境下也表现良好。 Abstract: The impact of different multilingual data mixtures in pretraining large language models (LLMs) has been a topic of ongoing debate, often raising concerns about potential trade-offs between language coverage and model performance (i.e., the curse of multilinguality). In this work, we investigate these assumptions by training 1.1B and 3B parameter LLMs on diverse multilingual corpora, varying the number of languages from 25 to 400. Our study challenges common beliefs surrounding multilingual training. First, we find that combining English and multilingual data does not necessarily degrade the in-language performance of either group, provided that languages have a sufficient number of tokens included in the pretraining corpus. Second, we observe that using English as a pivot language (i.e., a high-resource language that serves as a catalyst for multilingual generalization) yields benefits across language families, and contrary to expectations, selecting a pivot language from within a specific family does not consistently improve performance for languages within that family. Lastly, we do not observe a significant "curse of multilinguality" as the number of training languages increases in models at this scale. Our findings suggest that multilingual data, when balanced appropriately, can enhance language model capabilities without compromising performance, even in low-resource settings[14] Semantic Label Drift in Cross-Cultural Translation
Mohsinul Kabir,Tasnim Ahmed,Md Mezbaur Rahman,Polydoros Giannouris,Sophia Ananiadou
Main category: cs.CL
TL;DR: 本文研究了机器翻译中由于文化差异导致的语义标签漂移问题,发现包括大语言模型在内的翻译系统在跨文化敏感领域中容易改变原始标签,且文化相似性显著影响标签保真度。
Details
Motivation: 解决低资源语言中因机器翻译忽略文化对齐而导致的语义标签漂移问题,提升翻译在敏感文化语境下的准确性与适用性。 Method: 通过在文化敏感和中立领域进行一系列实验,分析不同机器翻译系统(包括大语言模型)在翻译过程中的标签变化情况,并考察源语言与目标语言之间的文化相似性对标签保留的影响。 Result: 1) 机器翻译系统(包括LLMs)在文化敏感领域中会引起显著的标签漂移;2) LLMs虽具备文化知识编码能力,但可能加剧标签漂移;3) 语言间的文化相似性是标签保持的关键因素。 Conclusion: 忽视文化因素不仅损害标签保真度,还可能导致下游应用中的误解和文化冲突,因此在机器翻译中需充分考虑文化对齐。 Abstract: Machine Translation (MT) is widely employed to address resource scarcity in low-resource languages by generating synthetic data from high-resource counterparts. While sentiment preservation in translation has long been studied, a critical but underexplored factor is the role of cultural alignment between source and target languages. In this paper, we hypothesize that semantic labels are drifted or altered during MT due to cultural divergence. Through a series of experiments across culturally sensitive and neutral domains, we establish three key findings: (1) MT systems, including modern Large Language Models (LLMs), induce label drift during translation, particularly in culturally sensitive domains; (2) unlike earlier statistical MT tools, LLMs encode cultural knowledge, and leveraging this knowledge can amplify label drift; and (3) cultural similarity or dissimilarity between source and target languages is a crucial determinant of label preservation. Our findings highlight that neglecting cultural factors in MT not only undermines label fidelity but also risks misinterpretation and cultural conflict in downstream applications.[15] SymCode: A Neurosymbolic Approach to Mathematical Reasoning via Verifiable Code Generation
Sina Bagheri Nezhad,Yao Li,Ameeta Agrawal
Main category: cs.CL
TL;DR: SymCode是一种神经符号框架,通过使用SymPy库生成可验证的代码来解决数学问题,显著提高了大型语言模型在复杂数学推理任务中的准确性和透明度。
Details
Motivation: 大型语言模型在复杂数学推理中表现不佳,传统的基于文本生成的方法缺乏确定性验证机制,导致解法不可靠。 Method: 提出SymCode框架,将数学问题求解重构为可验证的代码生成任务,利用SymPy进行符号计算,实现推理过程的确定性验证。 Result: 在MATH-500和OlympiadBench等基准上,SymCode比基线模型准确率最高提升13.6个百分点,且具有更高的token效率,错误更透明。 Conclusion: 通过将大模型推理锚定在确定性符号引擎上,SymCode推动了形式化领域中更准确、更可信AI的发展。 Abstract: Large Language Models (LLMs) often struggle with complex mathematical reasoning, where prose-based generation leads to unverified and arithmetically unsound solutions. Current prompting strategies like Chain of Thought still operate within this unreliable medium, lacking a mechanism for deterministic verification. To address these limitations, we introduce SymCode, a neurosymbolic framework that reframes mathematical problem-solving as a task of verifiable code generation using the SymPy library. We evaluate SymCode on challenging benchmarks, including MATH-500 and OlympiadBench, demonstrating significant accuracy improvements of up to 13.6 percentage points over baselines. Our analysis shows that SymCode is not only more token-efficient but also fundamentally shifts model failures from opaque logical fallacies towards transparent, programmatic errors. By grounding LLM reasoning in a deterministic symbolic engine, SymCode represents a key step towards more accurate and trustworthy AI in formal domains.[16] NeuronMM: High-Performance Matrix Multiplication for LLM Inference on AWS Trainium
Dinghong Song,Jierui Xu,Weichu Yang,Pengfei Su,Dong Li
Main category: cs.CL
TL;DR: 本文针对AWS Trainium架构设计了高性能矩阵乘法(matmul)用于大语言模型推理,通过内核融合和新型缓存策略优化数据移动、提升SRAM带宽利用率并避免昂贵的矩阵转置,在多个数据集和LLM上实现了显著性能提升。
Details
Motivation: Trainium的脉动阵列架构和特殊数据布局要求使其高性能利用具有挑战性,尤其是对关键计算内核如矩阵乘法的优化需求迫切。 Method: 基于内核融合和创新的缓存策略,针对Trainium架构定制优化矩阵乘法,减少软件管理内存层次中的数据移动,最大化SRAM带宽,并避免矩阵转置开销。 Result: 在九个数据集和四个最新大语言模型上评估显示,相比AWS在Trainium上的最先进实现,所提方法在matmul内核层面平均提速1.35倍(最高2.22倍),端到端LLM推理平均提速1.66倍(最高2.49倍)。 Conclusion: 本文提出的针对Trainium的优化技术能显著提升大语言模型推理中矩阵乘法的性能,有效克服硬件架构限制,为AI加速器上的高效推理提供了可行方案。 Abstract: AI accelerators, customized to AI workloads, provide cost-effective and high-performance solutions for training and inference. Trainium, an AI accelerator recently developed by Amazon Web Services (AWS), provides an attractive option for LLM training and inference through its heterogeneous architecture. However, leveraging Trainium architecture for high performance can be challenging because of its systolic array architecture and special requirement on data layout. In this paper, we design high-performance matrix multiplication (matmul), a critical compute kernel, for LLM inference on Trainium. We introduce a series of techniques customized to Trainium based on kernel fusion and novel caching strategies to reduce data movement across the software-managed memory hierarchy, maximize SRAM bandwidth, and avoid expensive matrix transpose. Evaluating with nine datasets and four recent LLMs, we show that our system largely outperforms the state-of-the-art matmul implemented by AWS on Trainium: at the level of matmul kernel, it achieves an average 1.35x speedup (up to 2.22x), which translates to an average 1.66x speedup (up to 2.49x) for end-to-end LLM inference.[17] AttnCache: Accelerating Self-Attention Inference for LLM Prefill via Attention Cache
Dinghong Song,Yuan Feng,Yiwei Wang,Shangye Chen,Cyril Guyot,Filip Blagojevic,Hyeran Jeon,Pengfei Su,Dong Li
Main category: cs.CL
TL;DR: 提出AttnCache框架,通过检索和重用相似的注意力图来加速大语言模型的prefill阶段推理,显著提升CPU和GPU上的推理速度,且精度损失可忽略。
Details
Motivation: 在仅需prefill阶段的场景中,自注意力计算因序列长度的平方复杂度成为性能瓶颈,需要优化计算效率。 Method: 构建基于注意力图记忆数据库的缓存机制,利用高效的缓存和相似性搜索技术,在推理过程中识别并重用已缓存的注意力图,减少自注意力计算开销。 Result: 实验结果显示,AttnCache在CPU上平均实现1.2倍端到端和2倍注意力计算加速,在GPU上实现1.6倍端到端和3倍注意力计算加速,精度损失极小。 Conclusion: AttnCache能有效加速大语言模型在prefill-only场景下的推理,显著降低自注意力计算成本,具有实际部署价值。 Abstract: Large Language Models (LLMs) are widely used in generative applications such as chatting, code generation, and reasoning. However, many realworld workloads such as classification, question answering, recommendation, and text embedding rely solely on the prefill stage of inference, where the model encodes input sequences without performing autoregressive decoding. In these prefill only scenarios, the self-attention computation becomes the primary performance bottleneck due to its quadratic complexity with respect to sequence length. In this paper, we observe that semantically different sentences often produce similar attention maps across layers and heads. Building on this insight, we propose AttnCache, a framework that accelerates the prefill stage of LLM inference by retrieving and reusing similar attention maps. Based on an attention map memorization database, AttnCache employs efficient caching and similarity search techniques to identify and reuse pre-cached attention maps during inference, thereby reducing the computational overhead of self-attention. Experimental results show that AttnCache achieves an average of 1.2x end-to-end and 2x attention speedup on CPU, and 1.6x end-to-end and 3x attention speedup on GPU, with negligible accuracy degradation.[18] Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning
Yihe Deng,I-Hung Hsu,Jun Yan,Zifeng Wang,Rujun Han,Gufeng Zhang,Yanfei Chen,Wei Wang,Tomas Pfister,Chen-Yu Lee
Main category: cs.CL
TL;DR: 提出了一种名为监督强化学习(SRL)的新框架,通过将问题解决重构为生成逻辑“动作”序列,并结合逐步监督和内部推理独白,显著提升小规模语言模型在多步推理任务中的表现。
Details
Motivation: 现有方法如监督微调(SFT)容易过拟合长示范,而强化学习(RLVR)在采样不到正确解时失败,难以有效训练小模型进行多步推理。 Method: 将问题求解建模为生成逻辑动作序列,引入内部推理独白机制,并基于模型动作与专家动作的相似性提供逐步、平滑的奖励信号,结合SFT数据中的专家示范进行监督。 Result: SRL使小模型能学会原本无法通过SFT或RLVR学习的复杂问题;先用SRL初始化再用RLVR精调可获得最佳性能;并在代理软件工程任务中展现出良好泛化能力。 Conclusion: SRL是一种强大且通用的训练框架,为提升小型语言模型的多步推理能力提供了有效解决方案,兼具灵活性与实用性。 Abstract: Large Language Models (LLMs) often struggle with problems that require multi-step reasoning. For small-scale open-source models, Reinforcement Learning with Verifiable Rewards (RLVR) fails when correct solutions are rarely sampled even after many attempts, while Supervised Fine-Tuning (SFT) tends to overfit long demonstrations through rigid token-by-token imitation. To address this gap, we propose Supervised Reinforcement Learning (SRL), a framework that reformulates problem solving as generating a sequence of logical "actions". SRL trains the model to generate an internal reasoning monologue before committing to each action. It provides smoother rewards based on the similarity between the model's actions and expert actions extracted from the SFT dataset in a step-wise manner. This supervision offers richer learning signals even when all rollouts are incorrect, while encouraging flexible reasoning guided by expert demonstrations. As a result, SRL enables small models to learn challenging problems previously unlearnable by SFT or RLVR. Moreover, initializing training with SRL before refining with RLVR yields the strongest overall performance. Beyond reasoning benchmarks, SRL generalizes effectively to agentic software engineering tasks, establishing it as a robust and versatile training framework for reasoning-oriented LLMs.[19] PORTool: Tool-Use LLM Training with Rewarded Tree
Feijie Wu,Weiwu Zhu,Yuxiang Zhang,Soumya Chatterjee,Jiarong Zhu,Fan Mo,Rodin Luo,Jing Gao
Main category: cs.CL
TL;DR: 提出一种名为PORTool的强化学习方法,通过树状多轨迹探索和分步奖励机制提升工具调用大模型的推理能力与效率。
Details
Motivation: 现有工具调用大语言模型依赖静态数据集,模仿固定调用流程,缺乏在动态环境中探索多种解决方案的能力,导致性能受限。 Method: 采用强化学习框架,生成多个具有共享前缀的工具调用轨迹,构建树状结构;基于每步对正确答案的贡献分配分步奖励,相同步骤共享奖励,不同分支步骤获得差异奖励;结合分叉相对优势与轨迹相对优势进行模型训练。 Result: 在涵盖17个工具、涉及时效性与非时效性问题的实验中,PORTool相比其他训练方法显著提升了最终准确率并减少了工具调用步数;消融实验验证了分步奖励设计的有效性与必要性。 Conclusion: PORTool通过引入树状多轨迹探索和精细化的奖励机制,有效增强了工具调用大模型在动态环境中的探索能力与推理效率,为提升复杂任务下的工具使用性能提供了新思路。 Abstract: Current tool-use large language models (LLMs) are trained on static datasets, enabling them to interact with external tools and perform multi-step, tool-integrated reasoning, which produces tool-call trajectories. However, these models imitate how a query is resolved in a generic tool-call routine, thereby failing to explore possible solutions and demonstrating limited performance in an evolved, dynamic tool-call environment. In this work, we propose PORTool, a reinforcement learning (RL) method that encourages a tool-use LLM to explore various trajectories yielding the correct answer. Specifically, this method starts with generating multiple rollouts for a given query, and some of them share the first few tool-call steps, thereby forming a tree-like structure. Next, we assign rewards to each step, based on its ability to produce a correct answer and make successful tool calls. A shared step across different trajectories receives the same reward, while different steps under the same fork receive different rewards. Finally, these step-wise rewards are used to calculate fork-relative advantages, blended with trajectory-relative advantages, to train the LLM for tool use. The experiments utilize 17 tools to address user queries, covering both time-sensitive and time-invariant topics. We conduct ablation studies to systematically justify the necessity and the design robustness of step-wise rewards. Furthermore, we compare the proposed PORTool with other training approaches and demonstrate significant improvements in final accuracy and the number of tool-call steps.[20] Rethinking Cross-lingual Alignment: Balancing Transfer and Cultural Erasure in Multilingual LLMs
HyoJung Han,Sweta Agrawal,Eleftheria Briakou
Main category: cs.CL
TL;DR: 本文研究了跨语言对齐(CLA)在促进知识迁移的同时可能导致“文化抹除”的问题,提出了一种新的评估框架(转移-本地化平面)和名为Surgical Steering的推理时方法,在不同模型层进行定向激活控制,以更好平衡知识迁移与文化本地化。
Details
Motivation: 跨语言对齐虽有助于多语言知识迁移,但可能导致模型忽略语言背后的文化差异,无法生成符合特定文化语境的回答,因此需要系统性评估并解决这一文化抹除问题。 Method: 提出了转移-本地化平面作为评估框架,并分析了不同模型层在事实迁移和文化本地化中的作用;基于此,设计了Surgical Steering方法,在推理时对不同层进行有区别的激活 steering 操作。 Result: 重新评估了现有CLA方法,发现它们在提升事实迁移的同时普遍损害了文化本地化;Surgical Steering能在保持知识迁移能力的同时显著改善文化响应的适配性。 Conclusion: 跨语言对齐需权衡知识迁移与文化保留,Surgical Steering通过分层控制实现了两者更好的平衡,为构建真正多语言且文化敏感的LLM提供了新方向。 Abstract: Cross-lingual alignment (CLA) aims to align multilingual representations, enabling Large Language Models (LLMs) to seamlessly transfer knowledge across languages. While intuitive, we hypothesize, this pursuit of representational convergence can inadvertently cause "cultural erasure", the functional loss of providing culturally-situated responses that should diverge based on the query language. In this work, we systematically analyze this trade-off by introducing a holistic evaluation framework, the transfer-localization plane, which quantifies both desirable knowledge transfer and undesirable cultural erasure. Using this framework, we re-evaluate recent CLA approaches and find that they consistently improve factual transfer at the direct cost of cultural localization across all six languages studied. Our investigation into the internal representations of these models reveals a key insight: universal factual transfer and culturally-specific knowledge are optimally steerable at different model layers. Based on this finding, we propose Surgical Steering, a novel inference-time method that disentangles these two objectives. By applying targeted activation steering to distinct layers, our approach achieves a better balance between the two competing dimensions, effectively overcoming the limitations of current alignment techniques.[21] Artificial Intelligence-Enabled Analysis of Radiology Reports: Epidemiology and Consequences of Incidental Thyroid Findings
Felipe Larios,Mariana Borras-Osorio,Yuqi Wu,Ana Gabriela Claros,David Toro-Tobon,Esteban Cabezas,Ricardo Loor-Torres,Maria Mateo Chavez,Kerly Guevara Maldonado,Luis Vilatuna Andrango,Maria Lizarazo Jimenez,Ivan Mateo Alzamora,Misk Al Zahidy,Marcelo Montero,Ana Cristina Proano,Cristian Soto Jacome,Jungwei W. Fan,Oscar J. Ponce-Ponte,Megan E. Branda,Naykky Singh Ospina,Juan P. Brito
Main category: cs.CL
TL;DR: 该研究利用自然语言处理技术分析放射学报告,发现偶发性甲状腺病变(ITF)较为常见,且与过度诊断相关。
Details
Motivation: 明确偶发性甲状腺病变的流行情况、特征及其临床后果。 Method: 开发并验证基于Transformer的自然语言处理流程,从多模态影像报告中识别ITF并提取结节特征。 Result: 在115,683名患者中,7.8%检出ITF,其中92.9%为结节;ITF患者更可能接受进一步检查和治疗,且多数癌症为乳头状癌,检测时体积更大。 Conclusion: ITF常见,并常导致小、低风险甲状腺癌的过度诊断,需标准化报告和更审慎的随访策略。 Abstract: Importance Incidental thyroid findings (ITFs) are increasingly detected on imaging performed for non-thyroid indications. Their prevalence, features, and clinical consequences remain undefined. Objective To develop, validate, and deploy a natural language processing (NLP) pipeline to identify ITFs in radiology reports and assess their prevalence, features, and clinical outcomes. Design, Setting, and Participants Retrospective cohort of adults without prior thyroid disease undergoing thyroid-capturing imaging at Mayo Clinic sites from July 1, 2017, to September 30, 2023. A transformer-based NLP pipeline identified ITFs and extracted nodule characteristics from image reports from multiple modalities and body regions. Main Outcomes and Measures Prevalence of ITFs, downstream thyroid ultrasound, biopsy, thyroidectomy, and thyroid cancer diagnosis. Logistic regression identified demographic and imaging-related factors. Results Among 115,683 patients (mean age, 56.8 [SD 17.2] years; 52.9% women), 9,077 (7.8%) had an ITF, of which 92.9% were nodules. ITFs were more likely in women, older adults, those with higher BMI, and when imaging was ordered by oncology or internal medicine. Compared with chest CT, ITFs were more likely via neck CT, PET, and nuclear medicine scans. Nodule characteristics were poorly documented, with size reported in 44% and other features in fewer than 15% (e.g. calcifications). Compared with patients without ITFs, those with ITFs had higher odds of thyroid nodule diagnosis, biopsy, thyroidectomy and thyroid cancer diagnosis. Most cancers were papillary, and larger when detected after ITFs vs no ITF. Conclusions ITFs were common and strongly associated with cascades leading to the detection of small, low-risk cancers. These findings underscore the role of ITFs in thyroid cancer overdiagnosis and the need for standardized reporting and more selective follow-up.[22] QCoder Benchmark: Bridging Language Generation and Quantum Hardware through Simulator-Based Feedback
Taku Mikuriya,Tatsuya Ishigaki,Masayuki Kawarada,Shunya Minami,Tadashi Kadowaki,Yohichi Suzuki,Soshun Naito,Shunya Takata,Takumi Kato,Tamotsu Basseda,Reo Yamada,Hiroya Takamura
Main category: cs.CL
TL;DR: 本文提出了QCoder Benchmark,一个用于评估大语言模型在量子编程任务中表现的框架,结合量子模拟器反馈和真实编程比赛的人类代码,揭示现有模型在此领域仍面临挑战。
Details
Motivation: 量子编程涉及硬件交互,传统代码生成评估方法不足,缺乏针对该领域的专门评测基准。 Method: 构建QCoder Benchmark,集成量子模拟器环境以获取电路深度、执行时间和错误分类等指标,并引入真实编程竞赛中的人类代码进行对比评估。 Result: 实验显示GPT-4o准确率仅为18.97%,而基于推理的模型o3达到78%准确率,超过人类平均表现(39.98%)。 Conclusion: QCoder Benchmark为评估LLM在量子编程中的能力提供了有效工具,表明推理模型在此复杂任务上具有显著优势,推动未来相关研究发展。 Abstract: Large language models (LLMs) have increasingly been applied to automatic programming code generation. This task can be viewed as a language generation task that bridges natural language, human knowledge, and programming logic. However, it remains underexplored in domains that require interaction with hardware devices, such as quantum programming, where human coders write Python code that is executed on a quantum computer. To address this gap, we introduce QCoder Benchmark, an evaluation framework that assesses LLMs on quantum programming with feedback from simulated hardware devices. Our benchmark offers two key features. First, it supports evaluation using a quantum simulator environment beyond conventional Python execution, allowing feedback of domain-specific metrics such as circuit depth, execution time, and error classification, which can be used to guide better generation. Second, it incorporates human-written code submissions collected from real programming contests, enabling both quantitative comparisons and qualitative analyses of LLM outputs against human-written codes. Our experiments reveal that even advanced models like GPT-4o achieve only around 18.97% accuracy, highlighting the difficulty of the benchmark. In contrast, reasoning-based models such as o3 reach up to 78% accuracy, outperforming averaged success rates of human-written codes (39.98%). We release the QCoder Benchmark dataset and public evaluation API to support further research.[23] Reasoning Path Divergence: A New Metric and Curation Strategy to Unlock LLM Diverse Thinking
Feng Ju,Zeyu Qin,Rui Min,Zhitao He,Lingpeng Kong,Yi R. Fung
Main category: cs.CL
TL;DR: 提出“一题多解”(1PNS)训练范式,结合推理路径差异性度量(RPD),提升大模型在测试时扩展中的推理多样性和性能。
Details
Motivation: 传统“一题一解”(1P1S)训练方式限制了模型推理路径的多样性,导致测试时扩展效果受限。 Method: 引入“一题多解”(1PNS)训练范式,并提出Reasoning Path Divergence(RPD)指标来量化多步推理链之间的语义差异,基于RPD筛选差异最大的解答进行模型微调。 Result: 在Qwen3-4B-Base上验证,相比强1P1S基线平均pass@16提升+2.80%,在AIME24上提升+4.99%,输出更具多样性且推理性能更好。 Conclusion: 1PNS结合RPD能有效提升大语言模型推理的多样性和准确性,进一步增强测试时扩展的效果。 Abstract: While Test-Time Scaling (TTS) has proven effective in improving the reasoning ability of large language models (LLMs), low diversity in model outputs often becomes a bottleneck; this is partly caused by the common "one problem, one solution" (1P1S) training practice, which provides a single canonical answer and can push models toward a narrow set of reasoning paths. To address this, we propose a "one problem, multiple solutions" (1PNS) training paradigm that exposes the model to a variety of valid reasoning trajectories and thus increases inference diversity. A core challenge for 1PNS is reliably measuring semantic differences between multi-step chains of thought, so we introduce Reasoning Path Divergence (RPD), a step-level metric that aligns and scores Long Chain-of-Thought solutions to capture differences in intermediate reasoning. Using RPD, we curate maximally diverse solution sets per problem and fine-tune Qwen3-4B-Base. Experiments show that RPD-selected training yields more varied outputs and higher pass@k, with an average +2.80% gain in pass@16 over a strong 1P1S baseline and a +4.99% gain on AIME24, demonstrating that 1PNS further amplifies the effectiveness of TTS. Our code is available at https://github.com/fengjujf/Reasoning-Path-Divergence .[24] On the Influence of Discourse Relations in Persuasive Texts
Nawar Turk,Sevag Kaspar,Leila Kosseim
Main category: cs.CL
TL;DR: 该研究利用大语言模型和提示工程,探索说服技巧与话语关系之间的关联,通过构建银标准数据集发现六种话语关系在说服性文本中起关键作用。
Details
Motivation: 由于缺乏同时标注说服技巧和话语关系的数据集,研究旨在填补这一空白,并揭示两者之间的联系。 Method: 基于SemEval 2023 Task 3数据集,使用四种大语言模型和十种不同提示生成40个话语关系分类器,并采用集成模型与多数投票策略构建多个银标准数据集。 Result: 成功构建了规模从204到1281不等的五个银标准数据集,统计分析显示Cause、Purpose、Contrast等六种话语关系在Loaded Language、Exaggeration/Minimisation等说服技巧中显著存在。 Conclusion: 六种话语关系在说服性文本中具有重要作用,该发现有助于识别网络宣传和虚假信息,并提升对有效沟通机制的理解。 Abstract: This paper investigates the relationship between Persuasion Techniques (PTs) and Discourse Relations (DRs) by leveraging Large Language Models (LLMs) and prompt engineering. Since no dataset annotated with both PTs and DRs exists, we took the SemEval 2023 Task 3 dataset labelled with 19 PTs as a starting point and developed LLM-based classifiers to label each instance of the dataset with one of the 22 PDTB 3.0 level-2 DRs. In total, four LLMs were evaluated using 10 different prompts, resulting in 40 unique DR classifiers. Ensemble models using different majority-pooling strategies were used to create 5 silver datasets of instances labelled with both persuasion techniques and level-2 PDTB senses. The silver dataset sizes vary from 1,281 instances to 204 instances, depending on the majority pooling technique used. Statistical analysis of these silver datasets shows that six discourse relations (namely Cause, Purpose, Contrast, Cause+Belief, Concession, and Condition) play a crucial role in persuasive texts, especially in the use of Loaded Language, Exaggeration/Minimisation, Repetition and to cast Doubt. This insight can contribute to detecting online propaganda and misinformation, as well as to our general understanding of effective communication.[25] MossNet: Mixture of State-Space Experts is a Multi-Head Attention
Shikhar Tuli,James Seale Smith,Haris Jeelani,Chi-Heng Lin,Abhishek Patel,Vasili Ramanishka,Yen-Chang Hsu,Hongxia Jin
Main category: cs.CL
TL;DR: 本文提出了MossNet,一种基于混合状态空间专家的新型架构,能够模拟线性多头注意力机制,在语言建模和下游任务中优于同类模型,且具备良好的扩展性和运行效率。
Details
Motivation: 现有基于SSM/GRM的方法通常仅模拟单个注意力头,表达能力受限,因此需要更具表达力的高效架构。 Method: 提出MossNet,采用混合专家(MoE)架构,不仅在通道混合MLP模块中应用,还在时序混合SSM核中实现,以模拟多个‘注意力头’,从而实现线性多头注意力。 Result: 在语言建模和下游任务中,MossNet优于相似规模的Transformer和SSM架构;大规模版本在万亿token训练下表现出良好可扩展性;在三星S24 Ultra和Nvidia A100上的实测显示其运行速度和资源使用更优。 Conclusion: MossNet为高效且高性能的循环型大语言模型提供了一个有前景的新方向。 Abstract: Large language models (LLMs) have significantly advanced generative applications in natural language processing (NLP). Recent trends in model architectures revolve around efficient variants of transformers or state-space/gated-recurrent models (SSMs, GRMs). However, prevailing SSM/GRM-based methods often emulate only a single attention head, potentially limiting their expressiveness. In this work, we propose MossNet, a novel mixture-of-state-space-experts architecture that emulates a linear multi-head attention (MHA). MossNet leverages a mixture-of-experts (MoE) implementation not only in channel-mixing multi-layered perceptron (MLP) blocks but also in the time-mixing SSM kernels to realize multiple "attention heads." Extensive experiments on language modeling and downstream evaluations show that MossNet outperforms both transformer- and SSM-based architectures of similar model size and data budgets. Larger variants of MossNet, trained on trillions of tokens, further confirm its scalability and superior performance. In addition, real-device profiling on a Samsung Galaxy S24 Ultra and an Nvidia A100 GPU demonstrate favorable runtime speed and resource usage compared to similarly sized baselines. Our results suggest that MossNet is a compelling new direction for efficient, high-performing recurrent LLM architectures.[26] Similarity-Distance-Magnitude Language Models
Allen Schmaltz
Main category: cs.CL
TL;DR: 本文提出了基于相似性-距离-幅度(SDM)的激活层的语言模型,通过监督微调将现有的Transformer解码器模型转化为SDM语言模型,以提高生成结果在高概率区域的比例,并减少拒绝回答的情况。
Details
Motivation: 旨在提升语言模型在指令跟随任务中的校准性和生成效率,减少模型的 abstention(拒绝生成)行为。 Method: 在预训练的Transformer解码器上添加SDM激活层,用于二分类判断是否遵循指令;采用对比输入编码方案和在线生成的硬负样本进行监督微调,并调整下一词预测损失的基变换。 Result: SDM语言模型相比强监督基线显著减少了生成中的拒绝情况,提升了统计效率和模型校准性。 Conclusion: SDM激活层结合监督微调能有效改进现有语言模型的生成质量与可靠性,尤其在需要高置信度输出的任务中具有潜力。 Abstract: We introduce Similarity-Distance-Magnitude (SDM) language models (LMs), which are sequence prediction models fine-tuned to maximize the proportion of generations in the well-calibrated, high-probability region partitioned by a final-layer SDM activation layer used for binary classification of instruction-following. We demonstrate that existing pre-trained decoder-only Transformer LMs can be readily converted into SDM LMs via supervised fine-tuning, using the final-layer SDM activation layer during training to estimate a change-of-base for a supervised next-token loss over a contrastive input encoding scheme, with additional hard negative examples generated online during training. This results in reduced abstentions (i.e., improved statistical efficiency) compared to strong supervised baselines.[27] RCScore: Quantifying Response Consistency in Large Language Models
Dongjun Jang,Youngchae Ahn,Hyopil Shin
Main category: cs.CL
TL;DR: RCScore是一个多维框架,用于量化指令表述方式对大语言模型响应的影响,揭示了传统指标无法检测到的性能差异。
Details
Motivation: 现有LLM评估通常依赖单一指令模板,忽视了模型对指令风格的敏感性,这在实际应用中至关重要。 Method: 通过系统地将基准问题转换为多种指令风格,并引入交叉响应相似性(CRS)来衡量风格一致性。 Result: 在十个LLM和四个推理基准上的实验表明,指令风格可使准确率变化高达16.7个百分点;确定性解码产生更稳定的输出,模型规模与跨风格一致性正相关。 Conclusion: RCScore提供了一种评估指令鲁棒性的系统方法,且风格一致性可作为模型可靠性的一个有效代理指标。 Abstract: Current LLM evaluations often rely on a single instruction template, overlooking models' sensitivity to instruction style-a critical aspect for real-world deployments. We present RCScore, a multi-dimensional framework quantifying how instruction formulation affects model responses. By systematically transforming benchmark problems into multiple instruction styles, RCScore reveals performance variations undetected by conventional metrics. Our experiments across ten LLMs on four reasoning benchmarks demonstrate that instruction style can shift accuracy by up to 16.7% points. We introduce Cross-Response Similarity (CRS), a method applying RCScore metrics to measure stylistic self-consistency, and establish its strong correlation with task accuracy, suggesting consistency as a valuable proxy for model reliability. Additional findings show that deterministic decoding produces more stylistically stable outputs, and model scale correlates positively with cross-style consistency. RCScore offers a principled approach to assess instruction robustness.[28] Don't Let It Fade: Preserving Edits in Diffusion Language Models via Token Timestep Allocation
Woojin Kim,Jaeyoung Do
Main category: cs.CL
TL;DR: 本文提出了一种名为Token Timestep Allocation (TTA)的方法,通过为每个token分配特定的时间步策略来缓解扩散语言模型中的更新遗忘问题,从而提升文本生成的可控性和流畅性。
Details
Motivation: 扩散语言模型在细粒度编辑上具有潜力,但其可控性较差,主要由于均匀且上下文无关的更新导致语义编辑在时间步间被抹除,即“更新遗忘”问题。 Method: 提出Token Timestep Allocation (TTA),通过为不同token设定不同的时间步调度实现软性、语义感知的token排序:关键token提前冻结,不确定token持续优化。该方法可在推理时应用,支持固定或自适应策略,并适用于多种DLM和监督信号。 Result: 实验表明,TTA显著提升了可控性和流畅性:在情感控制任务中准确率提高20%以上,困惑度几乎减半,且仅用不到五分之一的步数;在去毒任务中最大毒性从14.5降至12.2,困惑度从32.0降至26.0。 Conclusion: 通过时间步分配实现的软性排序是缓解更新遗忘、实现稳定可控扩散文本生成的关键机制。 Abstract: While diffusion language models (DLMs) enable fine-grained refinement, their practical controllability remains fragile. We identify and formally characterize a central failure mode called update forgetting, in which uniform and context agnostic updates induce token level fluctuations across timesteps, erasing earlier semantic edits and disrupting the cumulative refinement process, thereby degrading fluency and coherence. As this failure originates in uniform and context agnostic updates, effective control demands explicit token ordering. We propose Token Timestep Allocation (TTA), which realizes soft and semantic token ordering via per token timestep schedules: critical tokens are frozen early, while uncertain tokens receive continued refinement. This timestep based ordering can be instantiated as either a fixed policy or an adaptive policy driven by task signals, thereby supporting a broad spectrum of refinement strategies. Because it operates purely at inference time, it applies uniformly across various DLMs and naturally extends to diverse supervision sources. Empirically, TTA improves controllability and fluency: on sentiment control, it yields more than 20 percent higher accuracy and nearly halves perplexity using less than one fifth the steps; in detoxification, it lowers maximum toxicity (12.2 versus 14.5) and perplexity (26.0 versus 32.0). Together, these results demonstrate that softened ordering via timestep allocation is the critical lever for mitigating update forgetting and achieving stable and controllable diffusion text generation.[29] What's In My Human Feedback? Learning Interpretable Descriptions of Preference Data
Rajiv Movva,Smitha Milli,Sewon Min,Emma Pierson
Main category: cs.CL
TL;DR: 本文提出了WIMHF方法,利用稀疏自编码器解释人类反馈数据,揭示不同数据集中人类偏好的多样性和上下文影响,并实现安全性和个性化改进。
Details
Motivation: 由于缺乏对人类反馈数据所编码内容的理解,导致语言模型在接收反馈后可能产生不可预测和不良的改变,因此需要一种能够自动提取相关特征的方法来更好地理解和利用这些反馈数据。 Method: 提出了一种名为WIMHF(What's In My Human Feedback?)的方法,使用稀疏自编码器分析人类反馈数据,以识别出可测量的偏好以及标注者实际表达的偏好。 Result: 在7个数据集上验证了WIMHF的有效性,发现少量可解释的特征即可解释黑箱模型大部分的偏好预测信号;揭示了人类偏好的多样性及数据集上下文的影响;发现了潜在不安全的偏好,并通过重新标记有害样例实现了显著的安全性提升(+37%),同时支持细粒度的个性化建模。 Conclusion: WIMHF提供了一种以人为中心的分析方法,帮助实践者更好地理解与使用偏好数据,提升模型安全性与个性化能力。 Abstract: Human feedback can alter language models in unpredictable and undesirable ways, as practitioners lack a clear understanding of what feedback data encodes. While prior work studies preferences over certain attributes (e.g., length or sycophancy), automatically extracting relevant features without pre-specifying hypotheses remains challenging. We introduce What's In My Human Feedback? (WIMHF), a method to explain feedback data using sparse autoencoders. WIMHF characterizes both (1) the preferences a dataset is capable of measuring and (2) the preferences that the annotators actually express. Across 7 datasets, WIMHF identifies a small number of human-interpretable features that account for the majority of the preference prediction signal achieved by black-box models. These features reveal a wide diversity in what humans prefer, and the role of dataset-level context: for example, users on Reddit prefer informality and jokes, while annotators in HH-RLHF and PRISM disprefer them. WIMHF also surfaces potentially unsafe preferences, such as that LMArena users tend to vote against refusals, often in favor of toxic content. The learned features enable effective data curation: re-labeling the harmful examples in Arena yields large safety gains (+37%) with no cost to general performance. They also allow fine-grained personalization: on the Community Alignment dataset, we learn annotator-specific weights over subjective features that improve preference prediction. WIMHF provides a human-centered analysis method for practitioners to better understand and use preference data.[30] Towards Global Retrieval Augmented Generation: A Benchmark for Corpus-Level Reasoning
Qi Luo,Xiaonan Li,Tingshuo Fan,Xinchi Chen,Xipeng Qiu
Main category: cs.CL
TL;DR: 本文提出了GlobalQA,首个用于评估全局检索增强生成(global RAG)能力的基准,并提出GlobalRAG框架以提升在跨文档聚合任务上的表现。
Details
Motivation: 现有RAG评估主要关注局部信息检索,难以反映真实场景中需要跨整个文档集合进行分析和汇总的能力需求,因此需要专门针对全局RAG能力的评估基准与方法。 Method: 设计GlobalQA基准,涵盖计数、极值查询、排序和Top-k提取四类任务;提出GlobalRAG框架,结合多工具协作、LLM驱动的智能过滤和聚合模块,实现结构保持的块级检索与精确符号计算。 Result: 实验表明现有RAG方法在全局任务上表现差(最强基线F1仅为1.51),而GlobalRAG在Qwen2.5-14B模型上达到6.63 F1,显著提升性能。 Conclusion: GlobalRAG有效提升了大语言模型在全局信息整合任务中的表现,展示了面向全局RAG系统设计的潜力与必要性。 Abstract: Retrieval-augmented generation (RAG) has emerged as a leading approach to reducing hallucinations in large language models (LLMs). Current RAG evaluation benchmarks primarily focus on what we call local RAG: retrieving relevant chunks from a small subset of documents to answer queries that require only localized understanding within specific text chunks. However, many real-world applications require a fundamentally different capability -- global RAG -- which involves aggregating and analyzing information across entire document collections to derive corpus-level insights (for example, "What are the top 10 most cited papers in 2023?"). In this paper, we introduce GlobalQA -- the first benchmark specifically designed to evaluate global RAG capabilities, covering four core task types: counting, extremum queries, sorting, and top-k extraction. Through systematic evaluation across different models and baselines, we find that existing RAG methods perform poorly on global tasks, with the strongest baseline achieving only 1.51 F1 score. To address these challenges, we propose GlobalRAG, a multi-tool collaborative framework that preserves structural coherence through chunk-level retrieval, incorporates LLM-driven intelligent filters to eliminate noisy documents, and integrates aggregation modules for precise symbolic computation. On the Qwen2.5-14B model, GlobalRAG achieves 6.63 F1 compared to the strongest baseline's 1.51 F1, validating the effectiveness of our method.[31] Pragmatic Theories Enhance Understanding of Implied Meanings in LLMs
Takuma Sato,Seiya Kawano,Koichiro Yoshino
Main category: cs.CL
TL;DR: 提出将语用学理论作为提示输入语言模型,以提升其对隐含意义的理解能力,实验表明该方法在语用推理任务中显著优于基线模型。
Details
Motivation: 语言模型需要更好地理解言外之意,而现有方法在隐含意义推理方面表现有限,因此探索结合语用学理论的提示方法。 Method: 将格赖斯语用学和关联理论等概述作为提示输入语言模型,引导其逐步推理;同时对比仅提及理论名称的效果。 Result: 相比不引入语用理论的零样本思维链基线,所提方法最高提升9.6%;仅提及理论名称也可使大模型提升1-3%。 Conclusion: 将语用理论融入提示能有效提升语言模型的隐含意义理解能力,是一种有效的上下文学习策略。 Abstract: The ability to accurately interpret implied meanings plays a crucial role in human communication and language use, and language models are also expected to possess this capability. This study demonstrates that providing language models with pragmatic theories as prompts is an effective in-context learning approach for tasks to understand implied meanings. Specifically, we propose an approach in which an overview of pragmatic theories, such as Gricean pragmatics and Relevance Theory, is presented as a prompt to the language model, guiding it through a step-by-step reasoning process to derive a final interpretation. Experimental results showed that, compared to the baseline, which prompts intermediate reasoning without presenting pragmatic theories (0-shot Chain-of-Thought), our methods enabled language models to achieve up to 9.6\% higher scores on pragmatic reasoning tasks. Furthermore, we show that even without explaining the details of pragmatic theories, merely mentioning their names in the prompt leads to a certain performance improvement (around 1-3%) in larger models compared to the baseline.[32] Language Models Are Borrowing-Blind: A Multilingual Evaluation of Loanword Identification across 10 Languages
Mérilin Sousa Silva,Sina Ahmadi
Main category: cs.CL
TL;DR: 该论文研究了预训练语言模型在识别借词方面的能力,发现在10种语言中,尽管有明确指示和上下文信息,模型仍难以区分借词与本族词,表现出对借词的偏倚,这对少数语言的NLP工具开发和语言保护具有重要意义。
Details
Motivation: 探究预训练语言模型是否具备类似人类区分借词与本族词的能力,特别是在双语社区中主导语言对少数语言产生词汇影响的背景下。 Method: 在10种语言上评估多个预训练语言模型(包括大模型),通过明确指令和上下文信息测试其识别借词的能力。 Result: 模型在区分借词与本族词方面表现不佳,结果支持了现有观点,即现代NLP系统更偏向借词而非本族词。 Conclusion: 当前的NLP系统在处理少数语言时存在借词偏倚,需改进以支持语言多样性与少数语言保护。 Abstract: Throughout language history, words are borrowed from one language to another and gradually become integrated into the recipient's lexicon. Speakers can often differentiate these loanwords from native vocabulary, particularly in bilingual communities where a dominant language continuously imposes lexical items on a minority language. This paper investigates whether pretrained language models, including large language models, possess similar capabilities for loanword identification. We evaluate multiple models across 10 languages. Despite explicit instructions and contextual information, our results show that models perform poorly in distinguishing loanwords from native ones. These findings corroborate previous evidence that modern NLP systems exhibit a bias toward loanwords rather than native equivalents. Our work has implications for developing NLP tools for minority languages and supporting language preservation in communities under lexical pressure from dominant languages.[33] Distilling Multilingual Vision-Language Models: When Smaller Models Stay Multilingual
Sukrit Sriratanawilai,Jhayahgrit Thongwat,Romrawin Chumpu,Patomporn Payoungkhamdee,Sarana Nutanong,Peerat Limkonchotiwat
Main category: cs.CL
TL;DR: 本文研究了知识蒸馏在多语言视觉-语言模型压缩中的应用,评估了五种蒸馏方法对跨语言表示一致性和下游任务稳定性的影响,发现某些配置能在模型减半的情况下保持甚至提升多语言检索性能,但存在跨任务稳定性的权衡。
Details
Motivation: 视觉-语言模型在不同语言上的表现不均衡,尤其在模型变小时问题更严重,而知识蒸馏在多语言场景下的应用尚不充分。 Method: 通过控制实验,比较五种知识蒸馏方法在CLIP和SigLIP2模型上的表现,评估其在领域内检索和领域外视觉问答任务中的跨语言一致性与稳定性。 Result: 某些蒸馏配置能在模型大小减半时保持或提升多语言检索性能,但部分方法无法维持跨任务稳定性,暴露出仅看总体准确率无法发现的设计敏感性权衡。 Conclusion: 知识蒸馏在多语言VLM压缩中具有潜力,但需谨慎选择蒸馏方法以平衡跨语言性能和任务稳定性。 Abstract: Vision-language models (VLMs) exhibit uneven performance across languages, a problem that is often exacerbated when the model size is reduced. While Knowledge distillation (KD) demonstrates promising results in transferring knowledge from larger to smaller VLMs, applying KD in multilingualism is an underexplored area. This paper presents a controlled empirical study of KD behavior across five distillation approaches, isolating their effects on cross-lingual representation consistency and downstream performance stability under model compression. We study five distillation formulations across CLIP and SigLIP2, and evaluate them on in-domain retrieval and out-of-domain visual QA. We find that some configurations preserve or even improve multilingual retrieval robustness despite halving model size, but others fail to maintain cross-task stability, exposing design-sensitive trade-offs that aggregate accuracy alone does not reveal.[34] Do LLMs Signal When They're Right? Evidence from Neuron Agreement
Kang Chen,Yaoning Wang,Kai Xiong,Zhuoka Feng,Wenhe Sun,Haotian Chen,Yixin Cao
Main category: cs.CL
TL;DR: 提出了一种基于神经元激活的无监督解码方法Neuron Agreement Decoding (NAD),利用内部信号实现高效、准确的免标签集成推理,在数学、科学和编程任务中表现优异,且可大幅减少计算开销。
Details
Motivation: 现有基于外部输出评分的推理增强方法在后训练后可能校准不佳,缺乏对模型内部行为的有效利用,因此需要一种更可靠、高效的免标签集成解码策略。 Method: 通过分析LLM生成过程中的神经元激活,发现正确响应具有更低的神经元稀疏性和更强的跨样本一致性;基于此提出NAD方法,利用激活稀疏性和神经元一致性的内部信号选择最优候选。 Result: NAD在数学和科学基准上达到与多数投票相当的性能,在开放编码任务中优于Avg@64;可在生成前32个token内预测正确性并支持激进早停,减少99%的token使用量。 Conclusion: 内部神经元激活信号可为免标签集成解码提供可靠、可扩展且高效的指导,NAD为提升LLM推理效率和实用性提供了新路径。 Abstract: Large language models (LLMs) commonly boost reasoning via sample-evaluate-ensemble decoders, achieving label free gains without ground truth. However, prevailing strategies score candidates using only external outputs such as token probabilities, entropies, or self evaluations, and these signals can be poorly calibrated after post training. We instead analyze internal behavior based on neuron activations and uncover three findings: (1) external signals are low dimensional projections of richer internal dynamics; (2) correct responses activate substantially fewer unique neurons than incorrect ones throughout generation; and (3) activations from correct responses exhibit stronger cross sample agreement, whereas incorrect ones diverge. Motivated by these observations, we propose Neuron Agreement Decoding (NAD), an unsupervised best-of-N method that selects candidates using activation sparsity and cross sample neuron agreement, operating solely on internal signals and without requiring comparable textual outputs. NAD enables early correctness prediction within the first 32 generated tokens and supports aggressive early stopping. Across math and science benchmarks with verifiable answers, NAD matches majority voting; on open ended coding benchmarks where majority voting is inapplicable, NAD consistently outperforms Avg@64. By pruning unpromising trajectories early, NAD reduces token usage by 99% with minimal loss in generation quality, showing that internal signals provide reliable, scalable, and efficient guidance for label free ensemble decoding.[35] Unravelling the Mechanisms of Manipulating Numbers in Language Models
Michal Štefánik,Timothee Mickus,Marek Kadlčík,Bertram Højer,Michal Spiegel,Raúl Vázquez,Aman Sinha,Josef Kuchař,Philipp Mondorf
Main category: cs.CL
TL;DR: 该研究探讨了大语言模型在处理数字时的表示机制,发现尽管存在输出错误,不同模型仍学习到系统且高度准确的通用数字表示,并可通过通用探针追溯错误来源。
Details
Motivation: 解释大语言模型为何在数字嵌入表示准确的情况下仍频繁产生数值相关错误。 Method: 分析多个大语言模型内部对数字的表示,构建通用探针以追踪信息流并量化其操作机制的准确下限。 Result: 发现不同模型学习到可互换、系统化且跨上下文通用的数字表示;能够创建针对各模型的通用探针,并定位导致输出错误的具体层。 Conclusion: 预训练大语言模型内部具备高度一致和准确的数字操作机制,未来可通过更精确的探针技术优化模型架构以减少数值错误。 Abstract: Recent work has shown that different large language models (LLMs) converge to similar and accurate input embedding representations for numbers. These findings conflict with the documented propensity of LLMs to produce erroneous outputs when dealing with numeric information. In this work, we aim to explain this conflict by exploring how language models manipulate numbers and quantify the lower bounds of accuracy of these mechanisms. We find that despite surfacing errors, different language models learn interchangeable representations of numbers that are systematic, highly accurate and universal across their hidden states and the types of input contexts. This allows us to create universal probes for each LLM and to trace information -- including the causes of output errors -- to specific layers. Our results lay a fundamental understanding of how pre-trained LLMs manipulate numbers and outline the potential of more accurate probing techniques in addressed refinements of LLMs' architectures.[36] Can Agent Conquer Web? Exploring the Frontiers of ChatGPT Atlas Agent in Web Games
Jingran Zhang,Ning Li,Justin Cui
Main category: cs.CL
TL;DR: 本研究评估了OpenAI的ChatGPT Atlas在浏览器游戏中的网页交互能力,发现其在逻辑推理任务(如数独)中表现优异,但在需要精确时序和操作控制的实时游戏中表现较差。
Details
Motivation: 探讨Atlas在动态、交互式网页环境中的性能,尤其是在信息检索之外的实时交互能力。 Method: 使用Google T-Rex Runner、Sudoku、Flappy Bird和Stein.world等浏览器游戏作为测试场景,以游戏内得分作为量化评估指标。 Result: Atlas在Sudoku等逻辑任务中显著优于人类基线,但在实时游戏如Flappy Bird中难以通过初始障碍。 Conclusion: Atlas具备强大的分析处理能力,但在需要实时交互和精细操作的动态网页环境中仍存在明显局限。 Abstract: OpenAI's ChatGPT Atlas introduces new capabilities for web interaction, enabling the model to analyze webpages, process user intents, and execute cursor and keyboard inputs directly within the browser. While its capacity for information retrieval tasks has been demonstrated, its performance in dynamic, interactive environments remains less explored. In this study, we conduct an early evaluation of Atlas's web interaction capabilities using browser-based games as test scenarios, including Google's T-Rex Runner, Sudoku, Flappy Bird, and Stein.world. We employ in-game performance scores as quantitative metrics to assess performance across different task types. Our results show that Atlas performs strongly in logical reasoning tasks like Sudoku, completing puzzles significantly faster than human baselines, but struggles substantially in real-time games requiring precise timing and motor control, often failing to progress beyond initial obstacles. These findings suggest that while Atlas demonstrates capable analytical processing, there remain notable limitations in dynamic web environments requiring real-time interaction. The website of our project can be found at https://atlas-game-eval.github.io.[37] SCRIBE: Structured Chain Reasoning for Interactive Behaviour Explanations using Tool Calling
Fares Fawzi,Vinitra Swamy,Dominik Glandorf,Tanya Nazaretsky,Tanja Käser
Main category: cs.CL
TL;DR: SCRIBE是一个用于生成学生反馈的多跳、工具增强推理框架,通过小型开源模型实现本地化运行,在隐私保护和资源受限的教育场景中表现出与大模型相当的性能。
Details
Motivation: 在教育场景中使用语言模型提供个性化反馈面临隐私、计算资源和教学有效性的挑战,需要能够在本地运行的小型开源模型。 Method: 提出SCRIBE框架,结合领域专用工具和自反性推理流程,并通过两阶段LoRA微调在合成数据上蒸馏出3B和8B规模的模型。 Result: 8B-SCRIBE模型在相关性和可操作性方面表现优于或媲美更大模型,在用户研究中被学生评价为与GPT-4o和Llama-3.3 70B相当。 Conclusion: SCRIBE框架能够在低资源、隐私敏感的教育应用中有效部署,具备实际可行性。 Abstract: Language models can be used to provide interactive, personalized student feedback in educational settings. However, real-world deployment faces three key challenges: privacy concerns, limited computational resources, and the need for pedagogically valid responses. These constraints require small, open-source models that can run locally and reliably ground their outputs in correct information. We introduce SCRIBE, a framework for multi-hop, tool-augmented reasoning designed to generate valid responses to student questions about feedback reports. SCRIBE combines domain-specific tools with a self-reflective inference pipeline that supports iterative reasoning, tool use, and error recovery. We distil these capabilities into 3B and 8B models via two-stage LoRA fine-tuning on synthetic GPT-4o-generated data. Evaluation with a human-aligned GPT-Judge and a user study with 108 students shows that 8B-SCRIBE models achieve comparable or superior quality to much larger models in key dimensions such as relevance and actionability, while being perceived on par with GPT-4o and Llama-3.3 70B by students. These findings demonstrate the viability of SCRIBE for low-resource, privacy-sensitive educational applications.[38] From Amateur to Master: Infusing Knowledge into LLMs via Automated Curriculum Learning
Nishit Neema,Srinjoy Mukherjee,Sapan Shah,Gokul Ramakrishnan,Ganesh Venkatesh
Main category: cs.CL
TL;DR: ACER是一种将通用大模型转化为特定领域专家的自动化课程增强方法,通过基于教科书结构和布鲁姆分类法生成的问答对进行持续预训练,在不牺牲通用能力的同时显著提升专业领域性能。
Details
Motivation: 大型语言模型在通用任务上表现优异,但在需要深入原理理解的专业领域(如经济学、心理学)表现不足,需有效方法弥补领域差距。 Method: 提出ACER框架:首先自动生成某学科的目录结构,再依据布鲁姆分类法逐层生成问题-答案对构建教科书式课程;使用交错课程调度对模型进行持续预训练,兼顾内容与认知层次的系统学习。 Result: 在Llama 3.2(1B和3B)上的实验显示,ACER在MMLU专业子集上平均提升3个百分点,在微观经济学等难点领域提升达5个百分点;同时在非目标领域提升0.7点,并在ARC和GPQA等知识密集型基准上提升超2点,且保持通用推理能力稳定。 Conclusion: ACER提供了一种可扩展且高效的方法,能够在不遗忘通用知识的前提下,系统性地增强大模型在专业领域的表现,并实现跨领域知识正向迁移。 Abstract: Large Language Models (LLMs) excel at general tasks but underperform in specialized domains like economics and psychology, which require deep, principled understanding. To address this, we introduce ACER (Automated Curriculum-Enhanced Regimen) that transforms generalist models into domain experts without sacrificing their broad capabilities. ACER first synthesizes a comprehensive, textbook-style curriculum by generating a table of contents for a subject and then creating question-answer (QA) pairs guided by Bloom's taxonomy. This ensures systematic topic coverage and progressively increasing difficulty. The resulting synthetic corpus is used for continual pretraining with an interleaved curriculum schedule, aligning learning across both content and cognitive dimensions. Experiments with Llama 3.2 (1B and 3B) show significant gains in specialized MMLU subsets. In challenging domains like microeconomics, where baselines struggle, ACER boosts accuracy by 5 percentage points. Across all target domains, we observe a consistent macro-average improvement of 3 percentage points. Notably, ACER not only prevents catastrophic forgetting but also facilitates positive cross-domain knowledge transfer, improving performance on non-target domains by 0.7 points. Beyond MMLU, ACER enhances performance on knowledge-intensive benchmarks like ARC and GPQA by over 2 absolute points, while maintaining stable performance on general reasoning tasks. Our results demonstrate that ACER offers a scalable and effective recipe for closing critical domain gaps in LLMs.[39] MisSynth: Improving MISSCI Logical Fallacies Classification with Synthetic Data
Mykhailo Poliakov,Nadiya Shvai
Main category: cs.CL
TL;DR: 本文提出了一种名为MisSynth的管道,利用检索增强生成(RAG)生成合成谬误样本,用于微调大语言模型(LLM),以提高其识别科学错误信息中谬误论点的能力。在MISSCI数据集上的实验表明,经过微调的模型相比基础模型有显著性能提升,例如LLaMA 3.1 8B模型的F1分数提升了超过35%。
Details
Motivation: 健康相关的错误信息普遍存在且具有潜在危害,尤其当其扭曲或误解科学发现时难以识别。因此需要有效方法来提升模型识别此类谬误论点的能力。 Method: 提出MisSynth框架,结合检索增强生成(RAG)生成合成的谬误样本,并使用这些样本对大语言模型进行轻量级微调,以增强其在科学错误信息识别任务中的表现。 Result: 在MISSCI测试集上,微调后的LLaMA 3.1 8B模型相比原始模型F1分数提升超过35%,证明合成数据能显著提升零样本分类性能。 Conclusion: 利用合成数据增强有限标注资源,可显著提升大语言模型在现实世界科学错误信息识别任务中的性能,且对计算资源需求较低,具备实用价值。 Abstract: Health-related misinformation is very prevalent and potentially harmful. It is difficult to identify, especially when claims distort or misinterpret scientific findings. We investigate the impact of synthetic data generation and lightweight fine-tuning techniques on the ability of large language models (LLMs) to recognize fallacious arguments using the MISSCI dataset and framework. In this work, we propose MisSynth, a pipeline that applies retrieval-augmented generation (RAG) to produce synthetic fallacy samples, which are then used to fine-tune an LLM model. Our results show substantial accuracy gains with fine-tuned models compared to vanilla baselines. For instance, the LLaMA 3.1 8B fine-tuned model achieved an over 35% F1-score absolute improvement on the MISSCI test split over its vanilla baseline. We demonstrate that introducing synthetic fallacy data to augment limited annotated resources can significantly enhance zero-shot LLM classification performance on real-world scientific misinformation tasks, even with limited computational resources. The code and synthetic dataset are available on https://github.com/mxpoliakov/MisSynth.[40] The Geometry of Dialogue: Graphing Language Models to Reveal Synergistic Teams for Multi-Agent Collaboration
Kotaro Furuya,Yuichi Kitagawa
Main category: cs.CL
TL;DR: 提出一种基于交互的自动团队组合框架,通过构建语言模型图并应用社区检测来发现协同模型集群,无需先验知识即可实现有效的多智能体协作。
Details
Motivation: 由于大多数大语言模型的内部特性不透明,难以形成最优的多智能体团队,因此需要一种无需先验知识的自动团队组合方法。 Method: 构建一个“语言模型图”,通过成对对话的语义连贯性映射模型间关系,并应用社区检测识别协同模型集群。 Result: 实验证明该方法能发现功能上连贯的模型组,反映其潜在专长;在下游基准测试中优于随机基线,性能接近基于已知专长的手动团队。 Conclusion: 该研究为自动化设计协作式多智能体大语言模型团队提供了新基础。 Abstract: While a multi-agent approach based on large language models (LLMs) represents a promising strategy to surpass the capabilities of single models, its success is critically dependent on synergistic team composition. However, forming optimal teams is a significant challenge, as the inherent opacity of most models obscures the internal characteristics necessary for effective collaboration. In this paper, we propose an interaction-centric framework for automatic team composition that does not require any prior knowledge including their internal architectures, training data, or task performances. Our method constructs a "language model graph" that maps relationships between models from the semantic coherence of pairwise conversations, and then applies community detection to identify synergistic model clusters. Our experiments with diverse LLMs demonstrate that the proposed method discovers functionally coherent groups that reflect their latent specializations. Priming conversations with specific topics identified synergistic teams which outperform random baselines on downstream benchmarks and achieve comparable accuracy to that of manually-curated teams based on known model specializations. Our findings provide a new basis for the automated design of collaborative multi-agent LLM teams.[41] On the Role of Context for Discourse Relation Classification in Scientific Writing
Stephen Wan,Wei Liu,Michael Strube
Main category: cs.CL
TL;DR: 本文探讨了在科学写作中推断话语结构的任务,初步研究了预训练语言模型(PLM)和大语言模型(LLM)在科学出版物中的话语关系分类(DRC)应用,发现上下文信息对DRC任务总体上有帮助,并分析了哪些科学话语关系类型最能受益于上下文。
Details
Motivation: 随着生成式人工智能在科研流程中的广泛应用,需要利用话语层面的信息来验证AI生成科学主张的证据,因此研究科学写作中的话语结构推断任务具有重要意义。 Method: 采用预训练语言模型(PLM)和大语言模型(LLM)进行话语关系分类(DRC),重点分析上下文(由话语结构定义)在该任务中的作用,并通过实验评估不同科学话语关系类型对上下文的依赖程度。 Result: 实验证明上下文信息普遍有助于提升DRC任务的性能,并识别出某些特定的科学话语关系类型比其他类型更受益于上下文信息。 Conclusion: 上下文信息在科学文本的话语关系分类中起着积极作用,未来的研究可进一步利用话语结构增强AI生成科学主张的可信度和可解释性。 Abstract: With the increasing use of generative Artificial Intelligence (AI) methods to support science workflows, we are interested in the use of discourse-level information to find supporting evidence for AI generated scientific claims. A first step towards this objective is to examine the task of inferring discourse structure in scientific writing. In this work, we present a preliminary investigation of pretrained language model (PLM) and Large Language Model (LLM) approaches for Discourse Relation Classification (DRC), focusing on scientific publications, an under-studied genre for this task. We examine how context can help with the DRC task, with our experiments showing that context, as defined by discourse structure, is generally helpful. We also present an analysis of which scientific discourse relation types might benefit most from context.[42] OmniEduBench: A Comprehensive Chinese Benchmark for Evaluating Large Language Models in Education
Min Zhang,Hao Chen,Hao Chen,Wenqi Zhang,Didi Zhu,Xin Lin,Bo Jiang,Aimin Zhou,Fei Wu,Kun Kuang
Main category: cs.CL
TL;DR: 本文提出了OmniEduBench,一个全面的中文教育评测基准,包含24.602K高质量问答对,涵盖知识和培养两个维度及61个学科,旨在弥补现有大模型在教育能力评估上的不足。实验表明当前大模型在培养能力上与人类仍有显著差距。
Details
Motivation: 现有大语言模型及其评测基准多关注知识层面,忽视了实际教育场景中关键的培养能力评估,且常局限于单一科目或题型,缺乏多样性,尤其在中文教育背景下问题更为突出。 Method: 构建了一个名为OmniEduBench的中文教育评测基准,包含24.602K高质量问答对,分为知识(18.121K)和培养(6.481K)两个维度,每个维度细分为6类,覆盖61个学科,并包含11种常见题型。在11个主流大模型上进行广泛实验以评估其表现。 Result: 实验结果显示,在知识维度仅Gemini-2.5 Pro准确率超过60%,而在培养维度表现最好的QWQ模型仍比人类低近30%,表明当前模型在教育应用中尤其是培养能力方面存在明显不足。 Conclusion: OmniEduBench为评估大模型在中文教育场景中的综合能力提供了可靠基准,揭示了现有模型在培养能力方面的显著短板,指出了未来改进的方向和挑战。 Abstract: With the rapid development of large language models (LLMs), various LLM-based works have been widely applied in educational fields. However, most existing LLMs and their benchmarks focus primarily on the knowledge dimension, largely neglecting the evaluation of cultivation capabilities that are essential for real-world educational scenarios. Additionally, current benchmarks are often limited to a single subject or question type, lacking sufficient diversity. This issue is particularly prominent within the Chinese context. To address this gap, we introduce OmniEduBench, a comprehensive Chinese educational benchmark. OmniEduBench consists of 24.602K high-quality question-answer pairs. The data is meticulously divided into two core dimensions: the knowledge dimension and the cultivation dimension, which contain 18.121K and 6.481K entries, respectively. Each dimension is further subdivided into 6 fine-grained categories, covering a total of 61 different subjects (41 in the knowledge and 20 in the cultivation). Furthermore, the dataset features a rich variety of question formats, including 11 common exam question types, providing a solid foundation for comprehensively evaluating LLMs' capabilities in education. Extensive experiments on 11 mainstream open-source and closed-source LLMs reveal a clear performance gap. In the knowledge dimension, only Gemini-2.5 Pro surpassed 60\% accuracy, while in the cultivation dimension, the best-performing model, QWQ, still trailed human intelligence by nearly 30\%. These results highlight the substantial room for improvement and underscore the challenges of applying LLMs in education.[43] 1+1>2: A Synergistic Sparse and Low-Rank Compression Method for Large Language Models
Zeliang Zong,Kai Zhang,Zheyang Li,Wenming Tan,Ye Ren,Yiyan Zhai,Jilin Hu
Main category: cs.CL
TL;DR: 提出了一种名为SSLC的协同稀疏与低秩压缩方法,用于大语言模型的高效部署,在无需额外训练的情况下显著压缩模型并提升推理速度。
Details
Motivation: 大语言模型因带宽和计算需求高而受限,现有压缩方法(如剪枝和低秩近似)多单独使用,其协同效应尚未充分探索。 Method: 将低秩近似与稀疏优化统一建模,通过迭代优化算法联合求解,实现模型压缩。 Result: 在LLaMA和Qwen2.5系列模型上验证,SSLC在无性能损失下压缩Qwen2.5达50%,并实现至少1.63倍加速。 Conclusion: SSLC能有效结合稀疏性和低秩性优势,显著提升压缩效率和推理速度,为大模型高效部署提供了实用方案。 Abstract: Large Language Models (LLMs) have demonstrated remarkable proficiency in language comprehension and generation; however, their widespread adoption is constrained by substantial bandwidth and computational demands. While pruning and low-rank approximation have each demonstrated promising performance individually, their synergy for LLMs remains underexplored. We introduce \underline{S}ynergistic \underline{S}parse and \underline{L}ow-Rank \underline{C}ompression (SSLC) methods for LLMs, which leverages the strengths of both techniques: low-rank approximation compresses the model by retaining its essential structure with minimal information loss, whereas sparse optimization eliminates non-essential weights, preserving those crucial for generalization. Based on theoretical analysis, we first formulate the low-rank approximation and sparse optimization as a unified problem and solve it by iterative optimization algorithm. Experiments on LLaMA and Qwen2.5 models (7B-70B) show that SSLC, without any additional training steps, consistently surpasses standalone methods, achieving state-of-the-arts results. Notably, SSLC compresses Qwen2.5 by 50\% with no performance drop and achieves at least 1.63$\times$ speedup, offering a practical solution for efficient LLM deployment.[44] Bayesian Network Fusion of Large Language Models for Sentiment Analysis
Rasoul Amirzadeh,Dhananjay Thiruvady,Fatemeh Shiri
Main category: cs.CL
TL;DR: 提出了一种名为贝叶斯网络大语言模型融合(BNLF)的框架,通过概率机制整合多个特定领域的大语言模型预测结果,用于情感分析,在三个金融语料库上实现了比基线模型高约6%的准确率提升。
Details
Motivation: 解决现有大语言模型在透明性、可解释性、微调成本、提示工程需求、跨领域表现不一致以及高计算开销带来的环境影响等方面的问题。 Method: 采用贝叶斯网络对FinBERT、RoBERTa和Bertweet三个大语言模型的情感预测结果进行晚期融合,将各模型输出作为贝叶斯网络中的概率节点进行建模。 Result: 在三个具有不同语言和上下文特征的人工标注金融语料库上,BNLF框架相比单个大语言模型准确率提升了约6%,表现出对数据集变化的鲁棒性和可解释的情感分类能力。 Conclusion: BNLF框架通过概率融合多个大语言模型的预测,有效提升了情感分析的准确性与可解释性,同时降低了对单一模型的依赖和环境影响。 Abstract: Large language models (LLMs) continue to advance, with an increasing number of domain-specific variants tailored for specialised tasks. However, these models often lack transparency and explainability, can be costly to fine-tune, require substantial prompt engineering, yield inconsistent results across domains, and impose significant adverse environmental impact due to their high computational demands. To address these challenges, we propose the Bayesian network LLM fusion (BNLF) framework, which integrates predictions from three LLMs, including FinBERT, RoBERTa, and BERTweet, through a probabilistic mechanism for sentiment analysis. BNLF performs late fusion by modelling the sentiment predictions from multiple LLMs as probabilistic nodes within a Bayesian network. Evaluated across three human-annotated financial corpora with distinct linguistic and contextual characteristics, BNLF demonstrates consistent gains of about six percent in accuracy over the baseline LLMs, underscoring its robustness to dataset variability and the effectiveness of probabilistic fusion for interpretable sentiment classification.[45] A Multi-agent Large Language Model Framework to Automatically Assess Performance of a Clinical AI Triage Tool
Adam E. Flanders,Yifan Peng,Luciano Prevedello,Robyn Ball,Errol Colak,Prahlad Menon,George Shih,Hui-Ming Lin,Paras Lakhani
Main category: cs.CL
TL;DR: 该研究探讨了使用多个大语言模型(LLM)组成的集成方法是否比单个LLM更能可靠地评估基于像素的AI分诊工具。结果表明,由多个开源LLM组成的集成方法在评估颅内出血AI检测工具时表现更稳定和可靠,尤其是Llama3.3:70b与GPT-4o表现最佳,且多模型集成在MCC等指标上优于单一模型。
Details
Motivation: 为了提高对临床AI分诊工具进行回顾性评估的准确性和可靠性,探索多LLM集成方法相较于单一LLM的优势。 Method: 使用14家医院的29,766例非增强头颅CT检查数据,通过商用AI检测工具识别颅内出血(ICH),并利用八个开源LLM和一个HIPAA合规版GPT-4o对放射学报告进行分析,采用多轮提示判断ICH是否存在。比较不同模型及集成策略与GPT-4o的一致性,并评估其性能。 Result: Llama3.3:70b和GPT-4o的AUC最高(0.78),平均精度也最高(分别为0.75和0.76)。Llama3.3:70b的F1分数(0.81)、召回率(0.85)、精确度(0.78)、特异性(0.72)和MCC(0.57)均表现最优。在MCC指标上,全9模型集成、Top-3集成和共识集成均优于GPT-4o,且三者间无显著差异。 Conclusion: 中到大型开源LLM的集成方法相比单一LLM能更一致、更可靠地用于临床AI分诊工具的回顾性评估。 Abstract: Purpose: The purpose of this study was to determine if an ensemble of multiple LLM agents could be used collectively to provide a more reliable assessment of a pixel-based AI triage tool than a single LLM. Methods: 29,766 non-contrast CT head exams from fourteen hospitals were processed by a commercial intracranial hemorrhage (ICH) AI detection tool. Radiology reports were analyzed by an ensemble of eight open-source LLM models and a HIPAA compliant internal version of GPT-4o using a single multi-shot prompt that assessed for presence of ICH. 1,726 examples were manually reviewed. Performance characteristics of the eight open-source models and consensus were compared to GPT-4o. Three ideal consensus LLM ensembles were tested for rating the performance of the triage tool. Results: The cohort consisted of 29,766 head CTs exam-report pairs. The highest AUC performance was achieved with llama3.3:70b and GPT-4o (AUC= 0.78). The average precision was highest for Llama3.3:70b and GPT-4o (AP=0.75 & 0.76). Llama3.3:70b had the highest F1 score (0.81) and recall (0.85), greater precision (0.78), specificity (0.72), and MCC (0.57). Using MCC (95% CI) the ideal combination of LLMs were: Full-9 Ensemble 0.571 (0.552-0.591), Top-3 Ensemble 0.558 (0.537-0.579), Consensus 0.556 (0.539-0.574), and GPT4o 0.522 (0.500-0.543). No statistically significant differences were observed between Top-3, Full-9, and Consensus (p > 0.05). Conclusion: An ensemble of medium to large sized open-source LLMs provides a more consistent and reliable method to derive a ground truth retrospective evaluation of a clinical AI triage tool over a single LLM alone.[46] Inside CORE-KG: Evaluating Structured Prompting and Coreference Resolution for Knowledge Graphs
Dipak Meher,Carlotta Domeniconi
Main category: cs.CL
TL;DR: 本文对CORE-KG框架进行了系统性消融研究,评估其核心组件在减少知识图谱构建中的节点重复和噪声方面的贡献。
Details
Motivation: 现有基于大语言模型的方法在从复杂法律文本中构建知识图谱时仍存在节点重复和噪声问题,缺乏有效的共指消解和引导式抽取机制。 Method: 通过消融实验分析CORE-KG框架中类型感知共指模块和领域引导结构化提示两个关键组件的独立影响。 Result: 移除共指解析导致节点重复增加28.32%,噪声节点增加4.32%;移除结构化提示导致节点重复增加4.34%,噪声节点激增73.33%。 Conclusion: 结构化提示对降低噪声尤为关键,而共指解析主要减少节点重复,两者结合显著提升法律文本知识抽取的准确性与鲁棒性。 Abstract: Human smuggling networks are increasingly adaptive and difficult to analyze. Legal case documents offer critical insights but are often unstructured, lexically dense, and filled with ambiguous or shifting references, which pose significant challenges for automated knowledge graph (KG) construction. While recent LLM-based approaches improve over static templates, they still generate noisy, fragmented graphs with duplicate nodes due to the absence of guided extraction and coreference resolution. The recently proposed CORE-KG framework addresses these limitations by integrating a type-aware coreference module and domain-guided structured prompts, significantly reducing node duplication and legal noise. In this work, we present a systematic ablation study of CORE-KG to quantify the individual contributions of its two key components. Our results show that removing coreference resolution results in a 28.32% increase in node duplication and a 4.32% increase in noisy nodes, while removing structured prompts leads to a 4.34% increase in node duplication and a 73.33% increase in noisy nodes. These findings offer empirical insights for designing robust LLM-based pipelines for extracting structured representations from complex legal texts.[47] Hebrew Diacritics Restoration using Visual Representation
Yair Elboher,Yuval Pinter
Main category: cs.CL
TL;DR: 本文提出DIVRIT,一种基于视觉语言模型的希伯来语变音符号恢复系统,将任务建模为零样本分类问题,在无需复杂语言分析的情况下实现高效准确的变音符号标注。
Details
Motivation: 希伯来语在无变音符号时具有高度歧义性,影响发音和语义理解,因此需要高效的自动变音符号恢复方法。 Method: DIVRIT在词级别操作,将未标注文本作为图像输入至希伯来语视觉语言模型,动态生成候选变音模式,并基于上下文选择最合适的标注。该方法采用零样本分类框架,不依赖显式语言规则。 Result: 实验表明,DIVRIT在‘oracle’设置下表现出高准确率,架构改进和训练优化显著提升了泛化能力,验证了视觉表示在希伯来语变音中的有效性。 Conclusion: 视觉语言模型为希伯来语自动变音提供了新思路,DIVRIT在不依赖复杂语言分析的前提下展现了良好的性能和应用潜力。 Abstract: Diacritics restoration in Hebrew is a fundamental task for ensuring accurate word pronunciation and disambiguating textual meaning. Despite the language's high degree of ambiguity when unvocalized, recent machine learning approaches have significantly advanced performance on this task. In this work, we present DIVRIT, a novel system for Hebrew diacritization that frames the task as a zero-shot classification problem. Our approach operates at the word level, selecting the most appropriate diacritization pattern for each undiacritized word from a dynamically generated candidate set, conditioned on the surrounding textual context. A key innovation of DIVRIT is its use of a Hebrew Visual Language Model, which processes undiacritized text as an image, allowing diacritic information to be embedded directly within the input's vector representation. Through a comprehensive evaluation across various configurations, we demonstrate that the system effectively performs diacritization without relying on complex, explicit linguistic analysis. Notably, in an ``oracle'' setting where the correct diacritized form is guaranteed to be among the provided candidates, DIVRIT achieves a high level of accuracy. Furthermore, strategic architectural enhancements and optimized training methodologies yield significant improvements in the system's overall generalization capabilities. These findings highlight the promising potential of visual representations for accurate and automated Hebrew diacritization.[48] The Structure of Relation Decoding Linear Operators in Large Language Models
Miranda Anna Christ,Adrián Csiszárik,Gergely Becsó,Dániel Varga
Main category: cs.CL
TL;DR: 该论文研究了Hernandez等人[2023]提出的解码Transformer语言模型中特定关系事实的线性算子结构,将其从单一关系扩展到多关系,并发现这些关系解码器可通过简单的三阶张量网络高度压缩而几乎不损失精度。通过跨评估协议,作者发现这些线性算子并非编码具体关系,而是提取重复的粗粒度语义属性(如“X的国家”),从而解释其可压缩性和在语义相近关系上的泛化能力。研究结论认为,Transformer中的线性关系解码本质上是基于属性而非特定关系的。
Details
Motivation: 理解Transformer语言模型中用于解码关系事实的线性算子的本质结构,特别是其在多个关系下的组织方式和冗余性来源。 Method: 将单关系分析扩展至多关系场景,使用简单三阶张量网络对关系解码器进行压缩,并提出跨评估协议,测试每个解码器对其他关系主语的作用,以揭示其提取的是通用语义属性而非特定关系。 Result: 发现关系解码器可被高度压缩而不显著损失性能;跨评估显示它们提取的是跨关系的粗粒度语义属性(如‘国家-of-X’);解释了压缩性和有限泛化性的原因。 Conclusion: Transformer中的线性关系解码主要基于共享的语义属性,而非特定于某个关系,因此应被理解为属性驱动而非关系专用的机制。 Abstract: This paper investigates the structure of linear operators introduced in Hernandez et al. [2023] that decode specific relational facts in transformer language models. We extend their single-relation findings to a collection of relations and systematically chart their organization. We show that such collections of relation decoders can be highly compressed by simple order-3 tensor networks without significant loss in decoding accuracy. To explain this surprising redundancy, we develop a cross-evaluation protocol, in which we apply each linear decoder operator to the subjects of every other relation. Our results reveal that these linear maps do not encode distinct relations, but extract recurring, coarse-grained semantic properties (e.g., country of capital city and country of food are both in the country-of-X property). This property-centric structure clarifies both the operators' compressibility and highlights why they generalize only to new relations that are semantically close. Our findings thus interpret linear relational decoding in transformer language models as primarily property-based, rather than relation-specific.[49] InfoFlow: Reinforcing Search Agent Via Reward Density Optimization
Kun Luo,Hongjin Qian,Zheng Liu,Ziyi Xia,Shitao Xiao,Siqi Bao,Jun Zhao,Kang Liu
Main category: cs.CL
TL;DR: 本文提出了InfoFlow框架,通过子问题分解、失败引导提示和双代理优化三种方法解决强化学习中奖励密度低的问题,显著提升了轻量级大模型在智能体搜索任务中的表现。
Details
Motivation: 在深度搜索场景中,由于探索成本高且最终奖励稀少,强化学习的奖励密度低,限制了其应用。 Method: 提出InfoFlow框架,从子问题分解、失败引导提示和双代理精炼三个方面优化奖励密度。 Result: 在多个智能体搜索基准上显著优于强基线方法,使轻量级大语言模型达到与先进专有模型相当的性能。 Conclusion: InfoFlow有效解决了深度搜索中的奖励稀疏问题,提高了探索效率和奖励密度,为低成本模型实现高效搜索提供了可行方案。 Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is a promising approach for enhancing agentic deep search. However, its application is often hindered by low \textbf{Reward Density} in deep search scenarios, where agents expend significant exploratory costs for infrequent and often null final rewards. In this paper, we formalize this challenge as the \textbf{Reward Density Optimization} problem, which aims to improve the reward obtained per unit of exploration cost. This paper introduce \textbf{InfoFlow}, a systematic framework that tackles this problem from three aspects. 1) \textbf{Subproblem decomposition}: breaking down long-range tasks to assign process rewards, thereby providing denser learning signals. 2) \textbf{Failure-guided hints}: injecting corrective guidance into stalled trajectories to increase the probability of successful outcomes. 3) \textbf{Dual-agent refinement}: employing a dual-agent architecture to offload the cognitive burden of deep exploration. A refiner agent synthesizes the search history, which effectively compresses the researcher's perceived trajectory, thereby reducing exploration cost and increasing the overall reward density. We evaluate InfoFlow on multiple agentic search benchmarks, where it significantly outperforms strong baselines, enabling lightweight LLMs to achieve performance comparable to advanced proprietary LLMs.[50] Inference-Cost-Aware Dynamic Tree Construction for Efficient Inference in Large Language Models
Yinrong Hong,Zhiquan Tan,Kai Hu
Main category: cs.CL
TL;DR: 本文提出了一种新的动态树解码方法CAST,通过考虑GPU配置和批处理大小等系统变量来优化推理过程,显著提升了大语言模型的解码速度。
Details
Motivation: 大语言模型由于其自回归结构和庞大参数量,在推理时面临显著的延迟问题,现有推测解码方法未能充分考虑系统层面的影响因素。 Method: 提出CAST方法,结合推理成本(如GPU配置、批大小)动态调整树结构,实现更高效的多token并行生成与验证。 Result: 在六项任务和六个不同大模型上的实验表明,CAST最高可比传统解码快5.2倍,且普遍优于现有最先进方法5%到20%。 Conclusion: CAST通过系统感知的动态树结构调整,有效提升了大语言模型的推理效率,具有广泛的应用前景。 Abstract: Large Language Models (LLMs) face significant inference latency challenges stemming from their autoregressive design and large size. To address this, speculative decoding emerges as a solution, enabling the simultaneous generation and validation of multiple tokens. While recent approaches like EAGLE-2 and EAGLE-3 improve speculative decoding using dynamic tree structures, they often neglect the impact of crucial system variables such as GPU devices and batch sizes. Therefore, we introduce a new dynamic tree decoding approach called CAST that takes into account inference costs, including factors such as GPU configurations and batch sizes, to dynamically refine the tree structure. Through comprehensive experimentation across six diverse tasks and utilizing six distinct LLMs, our methodology demonstrates remarkable results, achieving speeds up to 5.2 times faster than conventional decoding methods. Moreover, it generally outperforms existing state-of-the-art techniques from 5% to 20%.[51] SlideAgent: Hierarchical Agentic Framework for Multi-Page Visual Document Understanding
Yiqiao Jin,Rachneet Kaur,Zhen Zeng,Sumitra Ganesh,Srijan Kumar
Main category: cs.CL
TL;DR: 本文提出了SlideAgent,一个用于理解多模态、多页、多布局文档(尤其是幻灯片)的通用代理框架。该框架通过全局、页面和元素三个层次的专门化推理,显著提升了对复杂视觉文档的理解能力。
Details
Motivation: 现有的大语言模型在处理复杂的多页视觉文档时,难以进行细粒度的跨页面和跨元素推理,因此需要一种更有效的框架来提升文档理解能力。 Method: SlideAgent采用专门化的代理机制,将推理过程分解为全局、页面和元素三个层次,构建出一种结构化的、与查询无关的表示形式,并在推理时选择性激活相应代理以生成连贯且上下文感知的答案。 Result: 实验表明,SlideAgent在整体性能上优于现有闭源模型(+7.9)和开源模型(+9.8)。 Conclusion: SlideAgent通过多层次的专门化代理机制,有效提升了对多页视觉文档的理解能力,尤其在细粒度推理和跨页面信息整合方面表现突出。 Abstract: Multi-page visual documents such as manuals, brochures, presentations, and posters convey key information through layout, colors, icons, and cross-slide references. While large language models (LLMs) offer opportunities in document understanding, current systems struggle with complex, multi-page visual documents, particularly in fine-grained reasoning over elements and pages. We introduce SlideAgent, a versatile agentic framework for understanding multi-modal, multi-page, and multi-layout documents, especially slide decks. SlideAgent employs specialized agents and decomposes reasoning into three specialized levels-global, page, and element-to construct a structured, query-agnostic representation that captures both overarching themes and detailed visual or textual cues. During inference, SlideAgent selectively activates specialized agents for multi-level reasoning and integrates their outputs into coherent, context-aware answers. Extensive experiments show that SlideAgent achieves significant improvement over both proprietary (+7.9 overall) and open-source models (+9.8 overall).[52] Encoder-Decoder or Decoder-Only? Revisiting Encoder-Decoder Large Language Model
Biao Zhang,Yong Cheng,Siamak Shakeri,Xinyi Wang,Min Ma,Orhan Firat
Main category: cs.CL
TL;DR: 本文重新审视了编码器-解码器大语言模型(RedLLM),通过与当前主流的仅解码器模型(DecLLM)在不同规模下的系统比较,发现RedLLM在扩展性、上下文外推能力和指令微调后性能方面表现优异,且推理效率更高。
Details
Motivation: 近年来大语言模型架构从编码器-解码器转向仅解码器,但缺乏从扩展视角的严谨比较,可能导致编码器-解码器模型潜力被低估。因此,本文旨在填补这一空白。 Method: 基于近期仅解码器模型的技术改进编码器-解码器模型(RedLLM),采用前缀语言建模进行预训练,并在150M到8B的不同模型规模下,与使用因果语言建模的仅解码器模型进行系统对比,使用RedPajama V1数据集预训练,FLAN进行指令微调。 Result: 实验表明,尽管仅解码器模型在预训练计算上更优,但编码器-解码器模型展现出相当甚至更好的扩展特性、上下文长度外推能力;在指令微调后,其在多种下游任务中表现相当或更优,且推理效率显著更高。 Conclusion: 编码器-解码器结构的大语言模型具有被忽视的潜力,值得进一步研究和开发,以构建更强大且高效的模型。 Abstract: Recent large language model (LLM) research has undergone an architectural shift from encoder-decoder modeling to nowadays the dominant decoder-only modeling. This rapid transition, however, comes without a rigorous comparative analysis especially \textit{from the scaling perspective}, raising concerns that the potential of encoder-decoder models may have been overlooked. To fill this gap, we revisit encoder-decoder LLM (RedLLM), enhancing it with recent recipes from decoder-only LLM (DecLLM). We conduct a comprehensive comparison between RedLLM, pretrained with prefix language modeling (LM), and DecLLM, pretrained with causal LM, at different model scales, ranging from $\sim$150M to $\sim$8B. Using RedPajama V1 (1.6T tokens) for pretraining and FLAN for instruction tuning, our experiments show that RedLLM produces compelling scaling properties and surprisingly strong performance. While DecLLM is overall more compute-optimal during pretraining, RedLLM demonstrates comparable scaling and context length extrapolation capabilities. After instruction tuning, RedLLM achieves comparable and even better results on various downstream tasks while enjoying substantially better inference efficiency. We hope our findings could inspire more efforts on re-examining RedLLM, unlocking its potential for developing powerful and efficient LLMs.[53] Evontree: Ontology Rule-Guided Self-Evolution of Large Language Models
Mingchen Tu,Zhiqiang Liu,Juan Li,Liangyurui Liu,Junjie Wang,Lei Liang,Wen Zhang
Main category: cs.CL
TL;DR: 本文提出了一种名为Evontree的新框架,利用少量高质量本体规则从大语言模型中系统提取、验证和增强领域知识,无需大量外部数据,在医疗问答任务中显著提升了模型性能。
Details
Motivation: 在医疗等数据敏感领域,缺乏高质量的领域训练数据限制了大语言模型的应用;而领域专家已将知识总结为本体规则,如何有效结合这些规则与大模型成为关键问题。 Method: Evontree首先从原始模型中提取领域本体,利用两个核心本体规则检测知识不一致性,并通过自蒸馏微调方法强化修正后的知识。 Result: 在Llama3-8B-Instruct和Med42-v2上进行的实验表明,该方法在多个医疗问答基准上均优于未修改的模型和领先的监督基线,准确率最高提升3.7%。 Conclusion: Evontree能有效、高效且鲁棒地实现大语言模型在低资源场景下的领域适应,验证了结合本体规则与隐式知识库的潜力。 Abstract: Large language models (LLMs) have demonstrated exceptional capabilities across multiple domains by leveraging massive pre-training and curated fine-tuning data. However, in data-sensitive fields such as healthcare, the lack of high-quality, domain-specific training corpus hinders LLMs' adaptation for specialized applications. Meanwhile, domain experts have distilled domain wisdom into ontology rules, which formalize relationships among concepts and ensure the integrity of knowledge management repositories. Viewing LLMs as implicit repositories of human knowledge, we propose Evontree, a novel framework that leverages a small set of high-quality ontology rules to systematically extract, validate, and enhance domain knowledge within LLMs, without requiring extensive external datasets. Specifically, Evontree extracts domain ontology from raw models, detects inconsistencies using two core ontology rules, and reinforces the refined knowledge via self-distilled fine-tuning. Extensive experiments on medical QA benchmarks with Llama3-8B-Instruct and Med42-v2 demonstrate consistent outperformance over both unmodified models and leading supervised baselines, achieving up to a 3.7% improvement in accuracy. These results confirm the effectiveness, efficiency, and robustness of our approach for low-resource domain adaptation of LLMs.[54] Kimi Linear: An Expressive, Efficient Attention Architecture
Kimi Team,Yu Zhang,Zongyu Lin,Xingcheng Yao,Jiaxi Hu,Fanqing Meng,Chengyin Liu,Xin Men,Songlin Yang,Zhiyuan Li,Wentao Li,Enzhe Lu,Weizhou Liu,Yanru Chen,Weixin Xu,Longhui Yu,Yejie Wang,Yu Fan,Longguang Zhong,Enming Yuan,Dehao Zhang,Yizhi Zhang,T. Y. Liu,Haiming Wang,Shengjun Fang,Weiran He,Shaowei Liu,Yiwei Li,Jianlin Su,Jiezhong Qiu,Bo Pang,Junjie Yan,Zhejun Jiang,Weixiao Huang,Bohong Yin,Jiacheng You,Chu Wei,Zhengtao Wang,Chao Hong,Yutian Chen,Guanduo Chen,Yucheng Wang,Huabin Zheng,Feng Wang,Yibo Liu,Mengnan Dong,Zheng Zhang,Siyuan Pan,Wenhao Wu,Yuhao Wu,Longyu Guan,Jiawen Tao,Guohong Fu,Xinran Xu,Yuzhi Wang,Guokun Lai,Yuxin Wu,Xinyu Zhou,Zhilin Yang,Yulun Du
Main category: cs.CL
TL;DR: Kimi Linear是一种新型混合线性注意力架构,首次在多种场景下优于全注意力机制,具有更高的效率和性能。
Details
Motivation: 为了在保持表达能力的同时提升长上下文和强化学习场景下的计算效率,克服传统全注意力机制在内存和计算开销上的局限。 Method: 提出Kimi Delta Attention(KDA),结合细粒度门控机制和定制的块状算法,采用特殊化的DPLR转换矩阵以提高硬件效率;使用KDA与MLA的混合层结构进行预训练。 Result: 在相同训练设置下,Kimi Linear在各项任务上显著优于全注意力模型,KV缓存减少达75%,1M上下文解码吞吐量提升达6倍。 Conclusion: Kimi Linear可作为全注意力架构的高效替代方案,在短、长上下文及强化学习场景中均表现更优,适合处理长输入输出序列任务。 Abstract: We introduce Kimi Linear, a hybrid linear attention architecture that, for the first time, outperforms full attention under fair comparisons across various scenarios -- including short-context, long-context, and reinforcement learning (RL) scaling regimes. At its core lies Kimi Delta Attention (KDA), an expressive linear attention module that extends Gated DeltaNet with a finer-grained gating mechanism, enabling more effective use of limited finite-state RNN memory. Our bespoke chunkwise algorithm achieves high hardware efficiency through a specialized variant of the Diagonal-Plus-Low-Rank (DPLR) transition matrices, which substantially reduces computation compared to the general DPLR formulation while remaining more consistent with the classical delta rule. We pretrain a Kimi Linear model with 3B activated parameters and 48B total parameters, based on a layerwise hybrid of KDA and Multi-Head Latent Attention (MLA). Our experiments show that with an identical training recipe, Kimi Linear outperforms full MLA with a sizeable margin across all evaluated tasks, while reducing KV cache usage by up to 75% and achieving up to 6 times decoding throughput for a 1M context. These results demonstrate that Kimi Linear can be a drop-in replacement for full attention architectures with superior performance and efficiency, including tasks with longer input and output lengths. To support further research, we open-source the KDA kernel and vLLM implementations, and release the pre-trained and instruction-tuned model checkpoints.[55] The End of Manual Decoding: Towards Truly End-to-End Language Models
Zhichao Wang,Dongyang Ma,Xinting Huang,Deng Cai,Tian Lan,Jiahao Xu,Haitao Mi,Xiaoying Tang,Yan Wang
Main category: cs.CL
TL;DR: 本文提出了AutoDeco,一种通过学习动态预测上下文相关的解码参数(如temperature和top-p)来实现真正端到端生成的新型架构,使大模型能在单次前向传播中自我调节采样策略,并展现出基于自然语言指令控制解码行为的新兴能力。
Details
Motivation: 现有的大语言模型虽然被称为“端到端”,但实际上依赖于非可微的解码过程,需要手动调参(如temperature和top-p),限制了模型的自动化与适应性。因此,需要一种能自主学习并动态调整解码策略的方法,以实现真正的端到端生成。 Method: 在标准Transformer基础上增加轻量级头部模块,在每一步生成时同时预测下一词元的logits以及对应的temperature和top-p值,将解码策略转化为可学习的、逐词元的参数化过程,整个过程可通过反向传播联合优化。 Result: 在八个基准任务上实验表明,AutoDeco显著优于默认解码策略,性能接近基于‘测试集调优’的oracle基线;此外,模型展现出根据自然语言指令(如‘低随机性生成’)动态调整解码参数的能力,实现了细粒度的解码控制。 Conclusion: AutoDeco实现了真正意义上的端到端文本生成,不仅提升了生成质量,还赋予模型自我调控和指令驱动解码的能力,为可操控、交互式的大模型解码提供了新范式。 Abstract: The "end-to-end" label for LLMs is a misnomer. In practice, they depend on a non-differentiable decoding process that requires laborious, hand-tuning of hyperparameters like temperature and top-p. This paper introduces AutoDeco, a novel architecture that enables truly "end-to-end" generation by learning to control its own decoding strategy. We augment the standard transformer with lightweight heads that, at each step, dynamically predict context-specific temperature and top-p values alongside the next-token logits. This approach transforms decoding into a parametric, token-level process, allowing the model to self-regulate its sampling strategy within a single forward pass. Through extensive experiments on eight benchmarks, we demonstrate that AutoDeco not only significantly outperforms default decoding strategies but also achieves performance comparable to an oracle-tuned baseline derived from "hacking the test set"-a practical upper bound for any static method. Crucially, we uncover an emergent capability for instruction-based decoding control: the model learns to interpret natural language commands (e.g., "generate with low randomness") and adjusts its predicted temperature and top-p on a token-by-token basis, opening a new paradigm for steerable and interactive LLM decoding.[56] Value Drifts: Tracing Value Alignment During LLM Post-Training
Mehar Bhatia,Shravan Nayak,Gaurav Kamath,Marius Mosbach,Karolina Stańczak,Vered Shwartz,Siva Reddy
Main category: cs.CL
TL;DR: 本研究探讨了大语言模型在后训练过程中价值观对齐的动态变化,发现监督微调阶段主要确立模型价值观,而后续偏好优化很少重新调整这些价值观,并揭示了不同偏好优化算法在相同数据下导致的不同对齐结果。
Details
Motivation: 随着大语言模型在社会中扮演越来越重要的角色,其需不仅依赖知识还需对齐人类价值观。然而先前研究多关注最终模型的对齐评估,忽视了训练过程中的对齐演化,因此有必要探究模型在后训练过程中如何及何时形成价值观对齐。 Method: 通过分离后训练算法与数据集的影响,在Llama-3和Qwen-3不同规模模型上使用常见的监督微调(SFT)和偏好优化算法与数据集,测量训练过程中价值观偏移的幅度与时间;并利用可控制价值观的合成偏好数据集分析不同算法的效果。 Result: 发现SFT阶段基本确立模型价值观,后续偏好优化难以改变已建立的价值取向;即使使用相同的偏好数据,不同偏好优化算法也会产生不同的价值对齐结果。 Conclusion: 研究揭示了后训练过程中价值观学习的关键阶段与机制,为数据构建、模型选择及优化算法设计提供了实践指导,有助于提升模型与人类价值观的对齐效果。 Abstract: As LLMs occupy an increasingly important role in society, they are more and more confronted with questions that require them not only to draw on their general knowledge but also to align with certain human value systems. Therefore, studying the alignment of LLMs with human values has become a crucial field of inquiry. Prior work, however, mostly focuses on evaluating the alignment of fully trained models, overlooking the training dynamics by which models learn to express human values. In this work, we investigate how and at which stage value alignment arises during the course of a model's post-training. Our analysis disentangles the effects of post-training algorithms and datasets, measuring both the magnitude and time of value drifts during training. Experimenting with Llama-3 and Qwen-3 models of different sizes and popular supervised fine-tuning (SFT) and preference optimization datasets and algorithms, we find that the SFT phase generally establishes a model's values, and subsequent preference optimization rarely re-aligns these values. Furthermore, using a synthetic preference dataset that enables controlled manipulation of values, we find that different preference optimization algorithms lead to different value alignment outcomes, even when preference data is held constant. Our findings provide actionable insights into how values are learned during post-training and help to inform data curation, as well as the selection of models and algorithms for preference optimization to improve model alignment to human values.[57] AMO-Bench: Large Language Models Still Struggle in High School Math Competitions
Shengnan An,Xunliang Cai,Xuezhi Cao,Xiaoyu Li,Yehao Lin,Junlin Liu,Xinxuan Lv,Dan Ma,Xuanlin Wang,Ziwen Wang,Shuang Zhou
Main category: cs.CL
TL;DR: AMO-Bench是一个高级数学推理基准,包含50道人工设计、达到或超过国际数学奥林匹克竞赛难度的原创题目,旨在评估大语言模型在高难度数学问题上的推理能力。实验显示当前模型表现有限,但存在随计算资源增加而提升的趋势,表明仍有改进空间。
Details
Motivation: 现有数学评测基准因大模型性能饱和(如AIME24/25)已难以有效评估顶级大语言模型的数学推理能力,因此需要构建更具挑战性的新基准。 Method: 设计50道经过专家交叉验证、至少达到IMO难度且完全原创的数学题,每题仅需给出最终答案,支持自动评分,并在26个大语言模型上进行测试与分析。 Result: 在26个大语言模型上的实验结果显示,即使表现最好的模型在AMO-Bench上的准确率也仅为52.4%,大多数模型低于40%,但观察到测试时计算资源增加带来性能提升的趋势。 Conclusion: AMO-Bench揭示了当前大语言模型在高级数学推理方面仍有显著提升空间,该基准的发布有助于推动语言模型推理能力的研究。 Abstract: We present AMO-Bench, an Advanced Mathematical reasoning benchmark with Olympiad level or even higher difficulty, comprising 50 human-crafted problems. Existing benchmarks have widely leveraged high school math competitions for evaluating mathematical reasoning capabilities of large language models (LLMs). However, many existing math competitions are becoming less effective for assessing top-tier LLMs due to performance saturation (e.g., AIME24/25). To address this, AMO-Bench introduces more rigorous challenges by ensuring all 50 problems are (1) cross-validated by experts to meet at least the International Mathematical Olympiad (IMO) difficulty standards, and (2) entirely original problems to prevent potential performance leakages from data memorization. Moreover, each problem in AMO-Bench requires only a final answer rather than a proof, enabling automatic and robust grading for evaluation. Experimental results across 26 LLMs on AMO-Bench show that even the best-performing model achieves only 52.4% accuracy on AMO-Bench, with most LLMs scoring below 40%. Beyond these poor performances, our further analysis reveals a promising scaling trend with increasing test-time compute on AMO-Bench. These results highlight the significant room for improving the mathematical reasoning in current LLMs. We release AMO-Bench to facilitate further research into advancing the reasoning abilities of language models. https://amo-bench.github.io/[58] Gistify! Codebase-Level Understanding via Runtime Execution
Hyunji Lee,Minseon Kim,Chinmay Singh,Matheus Pereira,Atharv Sonwane,Isadora White,Elias Stengel-Eskin,Mohit Bansal,Zhengyan Shi,Alessandro Sordoni,Marc-Alexandre Côté,Xingdi Yuan,Lucas Caccia
Main category: cs.CL
TL;DR: 提出Gistify任务,要求编码LLM从大型代码库中生成一个最小、自包含的文件来复现特定功能,评估模型对代码库结构和执行流的理解能力。
Details
Motivation: 随着编码代理在大型代码库中的广泛应用,亟需一种能自动设计具有挑战性的、代码库级别的评估方法。 Method: 提出Gistify任务,给定完整代码库和特定入口点,要求LLM生成仅包含必要组件的单个文件,以重现原代码库下该入口点的执行输出。 Result: 实验表明当前最先进的模型在Gistify任务上表现不佳,尤其是在执行轨迹较长的情况下。 Conclusion: Gistify是一项具有挑战性的新评估任务,揭示了现有编码LLM在理解代码库结构和生成长代码补丁方面的不足。 Abstract: As coding agents are increasingly deployed in large codebases, the need to automatically design challenging, codebase-level evaluation is central. We propose Gistify, a task where a coding LLM must create a single, minimal, self-contained file that can reproduce a specific functionality of a codebase. The coding LLM is given full access to a codebase along with a specific entrypoint (e.g., a python command), and the generated file must replicate the output of the same command ran under the full codebase, while containing only the essential components necessary to execute the provided command. Success on Gistify requires both structural understanding of the codebase, accurate modeling of its execution flow as well as the ability to produce potentially large code patches. Our findings show that current state-of-the-art models struggle to reliably solve Gistify tasks, especially ones with long executions traces.cs.CV [Back]
[59] Enhancing Underwater Object Detection through Spatio-Temporal Analysis and Spatial Attention Networks
Sai Likhith Karri,Ansh Saxena
Main category: cs.CV
TL;DR: 本研究提出并评估了T-YOLOv5及其结合CBAM的改进版本,用于提升水下目标检测在动态海洋环境中的准确性。
Details
Motivation: 为了应对水下环境中因快速运动、部分遮挡和复杂背景导致的目标检测难题,提升现有YOLOv5模型的检测性能。 Method: 首先引入时间增强的T-YOLOv5模型,并在其基础上集成卷积块注意力模块(CBAM),通过对比YOLOv5、T-YOLOv5和T-YOLOv5+CBAM在mAP@50-95指标上的表现进行评估。 Result: YOLOv5的mAP@50-95为0.563,T-YOLOv5提升至0.813,T-YOLOv5+CBAM达到0.811;表明时序建模显著提升性能,加入CBAM在复杂场景中进一步优化,但在简单场景中略有精度下降。 Conclusion: T-YOLOv5显著增强了水下目标检测的可靠性,结合CBAM后在挑战性场景中表现更优,但可能在简单场景中过拟合,需权衡模型复杂度与应用场景。 Abstract: This study examines the effectiveness of spatio-temporal modeling and the integration of spatial attention mechanisms in deep learning models for underwater object detection. Specifically, in the first phase, the performance of temporal-enhanced YOLOv5 variant T-YOLOv5 is evaluated, in comparison with the standard YOLOv5. For the second phase, an augmented version of T-YOLOv5 is developed, through the addition of a Convolutional Block Attention Module (CBAM). By examining the effectiveness of the already pre-existing YOLOv5 and T-YOLOv5 models and of the newly developed T-YOLOv5 with CBAM. With CBAM, the research highlights how temporal modeling improves detection accuracy in dynamic marine environments, particularly under conditions of sudden movements, partial occlusions, and gradual motion. The testing results showed that YOLOv5 achieved a mAP@50-95 of 0.563, while T-YOLOv5 and T-YOLOv5 with CBAM outperformed with mAP@50-95 scores of 0.813 and 0.811, respectively, highlighting their superior accuracy and generalization in detecting complex objects. The findings demonstrate that T-YOLOv5 significantly enhances detection reliability compared to the standard model, while T-YOLOv5 with CBAM further improves performance in challenging scenarios, although there is a loss of accuracy when it comes to simpler scenarios.[60] MIRO: MultI-Reward cOnditioned pretraining improves T2I quality and efficiency
Nicolas Dufour,Lucas Degeorge,Arijit Ghosh,Vicky Kalogeiton,David Picard
Main category: cs.CV
TL;DR: 提出MIRO方法,通过在训练过程中引入多个奖励模型来直接学习用户偏好,从而提升文本到图像生成的质量、多样性和训练效率。
Details
Motivation: 现有的文本到图像生成模型依赖大规模未筛选数据,难以对齐用户偏好;后处理选择方式浪费数据且损害多样性与语义保真度。 Method: 在训练过程中将模型条件化于多个奖励模型,使生成模型直接学习用户偏好,而非依赖生成后的筛选。 Result: MIRO显著提升了生成图像的视觉质量,加快了训练速度,并在GenEval、PickAScore、ImageReward和HPSv2等多个基准上达到最先进水平。 Conclusion: 通过多奖励模型条件训练,能更高效地对齐用户偏好,同时保持生成多样性与语义一致性,优于后处理对齐方法。 Abstract: Current text-to-image generative models are trained on large uncurated datasets to enable diverse generation capabilities. However, this does not align well with user preferences. Recently, reward models have been specifically designed to perform post-hoc selection of generated images and align them to a reward, typically user preference. This discarding of informative data together with the optimizing for a single reward tend to harm diversity, semantic fidelity and efficiency. Instead of this post-processing, we propose to condition the model on multiple reward models during training to let the model learn user preferences directly. We show that this not only dramatically improves the visual quality of the generated images but it also significantly speeds up the training. Our proposed method, called MIRO, achieves state-of-the-art performances on the GenEval compositional benchmark and user-preference scores (PickAScore, ImageReward, HPSv2).[61] BikeScenes: Online LiDAR Semantic Segmentation for Bicycles
Denniz Goren,Holger Caesar
Main category: cs.CV
TL;DR: 本文提出了一种针对自行车安全的3D LiDAR分割方法,并发布了BikeScenes-lidarseg数据集,实验表明在该数据集上微调模型显著提升了分割性能。
Details
Motivation: 骑行者的脆弱性因电动自行车的普及而加剧,促使将汽车感知技术应用于自行车安全。 Method: 使用多传感器'SenseBike'平台开发并评估适用于自行车的3D LiDAR分割方法,并引入BikeScenes-lidarseg数据集以缩小汽车到自行车领域的差距。 Result: 在BikeScenes数据集上微调模型后,平均交并比(mIoU)达到63.6%,远高于仅使用SemanticKITTI预训练的13.8%。 Conclusion: 领域特定的训练对提升自行车场景下的LiDAR语义分割性能至关重要,BikeScenes数据集为面向骑行者的感知研究提供了重要资源。 Abstract: The vulnerability of cyclists, exacerbated by the rising popularity of faster e-bikes, motivates adapting automotive perception technologies for bicycle safety. We use our multi-sensor 'SenseBike' research platform to develop and evaluate a 3D LiDAR segmentation approach tailored to bicycles. To bridge the automotive-to-bicycle domain gap, we introduce the novel BikeScenes-lidarseg Dataset, comprising 3021 consecutive LiDAR scans around the university campus of the TU Delft, semantically annotated for 29 dynamic and static classes. By evaluating model performance, we demonstrate that fine-tuning on our BikeScenes dataset achieves a mean Intersection-over-Union (mIoU) of 63.6%, significantly outperforming the 13.8% obtained with SemanticKITTI pre-training alone. This result underscores the necessity and effectiveness of domain-specific training. We highlight key challenges specific to bicycle-mounted, hardware-constrained perception systems and contribute the BikeScenes dataset as a resource for advancing research in cyclist-centric LiDAR segmentation.[62] Generative Image Restoration and Super-Resolution using Physics-Informed Synthetic Data for Scanning Tunneling Microscopy
Nikola L. Kolev,Tommaso Rodani,Neil J. Curson,Taylor J. Z. Stock,Alberto Cazzaniga
Main category: cs.CV
TL;DR: 提出了一种基于机器学习的扫描隧道显微镜图像修复与超分辨率方法,利用物理信息引导的合成数据生成管道训练模型,显著减少图像采集时间并降低针尖调理频率。
Details
Motivation: 扫描隧道显微镜(STM)因针尖退化和数据采集速度慢而受限,且在制备过程中针尖易受高压影响改变形貌,需频繁调理,限制了其应用效率。 Method: 采用仅包含36张高质量实验图像的数据集,构建物理信息引导的合成数据生成流程,用于训练先进的流匹配和扩散模型,实现图像修复与超分辨率重建。 Result: 模型在CLIP MMD和结构相似性等指标上表现优异,可从稀疏采样数据中准确重建图像,将图像采集时间减少2至4倍。 Conclusion: 该框架有望显著提升STM实验通量,减少针尖调理次数,并增强现有高速STM系统的帧率。 Abstract: Scanning tunnelling microscopy (STM) enables atomic-resolution imaging and atom manipulation, but its utility is often limited by tip degradation and slow serial data acquisition. Fabrication adds another layer of complexity since the tip is often subjected to large voltages, which may alter the shape of its apex, requiring it to be conditioned. Here, we propose a machine learning (ML) approach for image repair and super-resolution to alleviate both challenges. Using a dataset of only 36 pristine experimental images of Si(001):H, we demonstrate that a physics-informed synthetic data generation pipeline can be used to train several state-of-the-art flow-matching and diffusion models. Quantitative evaluation with metrics such as the CLIP Maximum Mean Discrepancy (CMMD) score and structural similarity demonstrates that our models are able to effectively restore images and offer a two- to fourfold reduction in image acquisition time by accurately reconstructing images from sparsely sampled data. Our framework has the potential to significantly increase STM experimental throughput by offering a route to reducing the frequency of tip-conditioning procedures and to enhancing frame rates in existing high-speed STM systems.[63] SplitFlow: Flow Decomposition for Inversion-Free Text-to-Image Editing
Sung-Hoon Yoon,Minghan Li,Gaspard Beaudouin,Congcong Wen,Muhammad Rafay Azhar,Mengyu Wang
Main category: cs.CV
TL;DR: 提出一种基于流分解与聚合的图像编辑框架,无需显式逆过程,通过语义分解提示词并自适应融合子流,提升编辑的语义保真度和属性解耦能力。
Details
Motivation: 现有图像编辑方法在逆过程不准确和梯度纠缠问题下难以保持编辑的语义一致性和多样性,而现有无逆方法编辑质量仍不理想。 Method: 将目标提示语义分解为多个子提示,分别为其计算独立流,并通过受多任务学习启发的投影与软聚合机制自适应融合,形成统一编辑轨迹。 Result: 实验表明该方法在零样本图像编辑任务中优于现有方法,在语义保真度和属性解耦方面表现更优。 Conclusion: 所提出的流分解与聚合框架有效解决了无逆编辑中的语义冗余与一致性矛盾,实现了高质量、多样且忠实的图像编辑。 Abstract: Rectified flow models have become a de facto standard in image generation due to their stable sampling trajectories and high-fidelity outputs. Despite their strong generative capabilities, they face critical limitations in image editing tasks: inaccurate inversion processes for mapping real images back into the latent space, and gradient entanglement issues during editing often result in outputs that do not faithfully reflect the target prompt. Recent efforts have attempted to directly map source and target distributions via ODE-based approaches without inversion; however,these methods still yield suboptimal editing quality. In this work, we propose a flow decomposition-and-aggregation framework built upon an inversion-free formulation to address these limitations. Specifically, we semantically decompose the target prompt into multiple sub-prompts, compute an independent flow for each, and aggregate them to form a unified editing trajectory. While we empirically observe that decomposing the original flow enhances diversity in the target space, generating semantically aligned outputs still requires consistent guidance toward the full target prompt. To this end, we design a projection and soft-aggregation mechanism for flow, inspired by gradient conflict resolution in multi-task learning. This approach adaptively weights the sub-target velocity fields, suppressing semantic redundancy while emphasizing distinct directions, thereby preserving both diversity and consistency in the final edited output. Experimental results demonstrate that our method outperforms existing zero-shot editing approaches in terms of semantic fidelity and attribute disentanglement. The code is available at https://github.com/Harvard-AI-and-Robotics-Lab/SplitFlow.[64] Brain-IT: Image Reconstruction from fMRI via Brain-Interaction Transformer
Roman Beliy,Amit Zalcher,Jonathan Kogman,Navve Wasserman,Michal Irani
Main category: cs.CV
TL;DR: 提出了一种名为“Brain-IT”的脑启发方法,通过脑交互Transformer(BIT)实现功能相似脑体素簇之间的有效交互,显著提升了从fMRI数据重建视觉图像的保真度,在视觉效果和客观指标上均超越现有最先进方法。
Details
Motivation: 当前基于扩散模型的fMRI图像重建方法在还原真实所见图像方面仍缺乏足够的保真度,需要更准确、高效的方法来提升重建质量。 Method: 引入Brain Interaction Transformer(BIT),利用跨被试共享的功能性脑体素簇进行信息整合,并通过预测高低层次的局部图像特征(语义与结构)来引导扩散模型,实现从脑信号到图像的精准重建。 Result: 在标准客观指标和视觉质量上均超越现有最先进方法,且仅需1小时新被试fMRI数据即可达到以往方法需40小时训练才能达到的效果。 Conclusion: Brain-IT通过脑启发式设计和高效的特征引导机制,实现了更忠实、高效的fMRI图像重建,推动了非侵入式脑解码技术的发展。 Abstract: Reconstructing images seen by people from their fMRI brain recordings provides a non-invasive window into the human brain. Despite recent progress enabled by diffusion models, current methods often lack faithfulness to the actual seen images. We present "Brain-IT", a brain-inspired approach that addresses this challenge through a Brain Interaction Transformer (BIT), allowing effective interactions between clusters of functionally-similar brain-voxels. These functional-clusters are shared by all subjects, serving as building blocks for integrating information both within and across brains. All model components are shared by all clusters & subjects, allowing efficient training with a limited amount of data. To guide the image reconstruction, BIT predicts two complementary localized patch-level image features: (i)high-level semantic features which steer the diffusion model toward the correct semantic content of the image; and (ii)low-level structural features which help to initialize the diffusion process with the correct coarse layout of the image. BIT's design enables direct flow of information from brain-voxel clusters to localized image features. Through these principles, our method achieves image reconstructions from fMRI that faithfully reconstruct the seen images, and surpass current SotA approaches both visually and by standard objective metrics. Moreover, with only 1-hour of fMRI data from a new subject, we achieve results comparable to current methods trained on full 40-hour recordings.[65] Fine-tuning Segment Anything for Real-Time Tumor Tracking in Cine-MRI
Valentin Boussot,Cédric Hémon,Jean-Claude Nunes,Jean-Louis Dillenseger
Main category: cs.CV
TL;DR: 本研究针对TrackRAD2025挑战赛中的实时肿瘤追踪任务,提出基于SAM 2.1基础模型的分割方法,在数据稀缺条件下实现了Dice得分为0.8794的性能,排名第六,展示了基础模型在MRI引导放疗中实时肿瘤追踪的潜力。
Details
Motivation: 在强数据稀缺约束下实现胸腹部cine-MRI序列中肿瘤的实时追踪,满足临床对高效、准确跟踪的需求。 Method: 采用基于SAM 2.1及其变体的基础模型进行分割,使用首帧标注生成掩码提示,并在小规模标注数据集上对提示编码器、解码器和Hiera主干网络进行微调;训练采用1024x1024图像块、小批量(1)、标准增强和Dice+IoU损失函数,学习率为0.0001,共训练300轮。 Result: 在隐藏测试集上达到Dice相似系数0.8794,位列TrackRAD2025挑战赛第6名;模型满足1秒内推理时间限制,且在不同解剖部位和磁场强度下表现一致。 Conclusion: 基础模型(如SAM 2.1)通过少量标注数据微调即可实现高精度、实时的肿瘤追踪,具有在MRI引导放疗中广泛应用的潜力。 Abstract: In this work, we address the TrackRAD2025 challenge of real-time tumor tracking in cine-MRI sequences of the thoracic and abdominal regions under strong data scarcity constraints. Two complementary strategies were explored: (i) unsupervised registration with the IMPACT similarity metric and (ii) foundation model-based segmentation leveraging SAM 2.1 and its recent variants through prompt-based interaction. Due to the one-second runtime constraint, the SAM-based method was ultimately selected. The final configuration used SAM2.1 b+ with mask-based prompts from the first annotated slice, fine-tuned solely on the small labeled subset from TrackRAD2025. Training was configured to minimize overfitting, using 1024x1024 patches (batch size 1), standard augmentations, and a balanced Dice + IoU loss. A low uniform learning rate (0.0001) was applied to all modules (prompt encoder, decoder, Hiera backbone) to preserve generalization while adapting to annotator-specific styles. Training lasted 300 epochs (~12h on RTX A6000, 48GB). The same inference strategy was consistently applied across all anatomical sites and MRI field strengths. Test-time augmentation was considered but ultimately discarded due to negligible performance gains. The final model was selected based on the highest Dice Similarity Coefficient achieved on the validation set after fine-tuning. On the hidden test set, the model reached a Dice score of 0.8794, ranking 6th overall in the TrackRAD2025 challenge. These results highlight the strong potential of foundation models for accurate and real-time tumor tracking in MRI-guided radiotherapy.[66] Larger Hausdorff Dimension in Scanning Pattern Facilitates Mamba-Based Methods in Low-Light Image Enhancement
Xinhua Wang,Caibo Feng,Xiangjun Fu,Chunxiao Liu
Main category: cs.CV
TL;DR: 提出了一种基于希尔伯特选择性扫描的Mamba框架改进方法,通过增加扫描模式的豪斯多夫维度来更有效地探索特征空间,提升低光照图像增强效果。
Details
Motivation: 现有Mamba框架在低光图像增强中存在信息不一致和空间局部性不足的问题,限制了对细粒度特征的捕捉能力。 Method: 引入一种新的希尔伯特选择性扫描机制,提升扫描路径的豪斯多夫维度,从而增强特征空间覆盖,改善空间局部性和长距离依赖建模。 Result: 在公开基准上显著提升了定量指标和视觉质量,同时降低了计算资源消耗和推理时间。 Conclusion: 该方法不仅推动了低光图像增强的技术发展,也为其他基于Mamba的应用提供了潜在改进方向。 Abstract: We propose an innovative enhancement to the Mamba framework by increasing the Hausdorff dimension of its scanning pattern through a novel Hilbert Selective Scan mechanism. This mechanism explores the feature space more effectively, capturing intricate fine-scale details and improving overall coverage. As a result, it mitigates information inconsistencies while refining spatial locality to better capture subtle local interactions without sacrificing the model's ability to handle long-range dependencies. Extensive experiments on publicly available benchmarks demonstrate that our approach significantly improves both the quantitative metrics and qualitative visual fidelity of existing Mamba-based low-light image enhancement methods, all while reducing computational resource consumption and shortening inference time. We believe that this refined strategy not only advances the state-of-the-art in low-light image enhancement but also holds promise for broader applications in fields that leverage Mamba-based techniques.[67] CAVE: Detecting and Explaining Commonsense Anomalies in Visual Environments
Rishika Bhagwatkar,Syrielle Montariol,Angelika Romanou,Beatriz Borges,Irina Rish,Antoine Bosselut
Main category: cs.CV
TL;DR: 本文提出了CAVE,首个真实世界视觉异常的基准,支持异常描述、解释和论证三个开放性任务,并提供细粒度标注,用于评估视觉-语言模型在检测和理解异常方面的能力。
Details
Motivation: 现有视觉异常研究局限于工业缺陷或合成异常,无法反映真实世界异常的复杂性和不可预测性,因此需要一个更贴近人类认知的真实异常基准。 Method: 构建CAVE基准数据集,包含真实场景中的视觉异常,设计三个任务(描述、解释、论证),并基于认知科学设计细粒度标注,涵盖异常的表现形式、复杂性、严重性和常见性。 Result: 实验表明,即使采用先进的提示策略,当前最先进的视觉-语言模型在视觉异常感知和常识推理方面仍表现不佳。 Conclusion: CAVE作为一个真实且基于认知科学的基准,为推进视觉-语言模型在异常检测与常识推理方面的研究提供了重要资源。 Abstract: Humans can naturally identify, reason about, and explain anomalies in their environment. In computer vision, this long-standing challenge remains limited to industrial defects or unrealistic, synthetically generated anomalies, failing to capture the richness and unpredictability of real-world anomalies. In this work, we introduce CAVE, the first benchmark of real-world visual anomalies. CAVE supports three open-ended tasks: anomaly description, explanation, and justification; with fine-grained annotations for visual grounding and categorizing anomalies based on their visual manifestations, their complexity, severity, and commonness. These annotations draw inspiration from cognitive science research on how humans identify and resolve anomalies, providing a comprehensive framework for evaluating Vision-Language Models (VLMs) in detecting and understanding anomalies. We show that state-of-the-art VLMs struggle with visual anomaly perception and commonsense reasoning, even with advanced prompting strategies. By offering a realistic and cognitively grounded benchmark, CAVE serves as a valuable resource for advancing research in anomaly detection and commonsense reasoning in VLMs.[68] Climate Adaptation-Aware Flood Prediction for Coastal Cities Using Deep Learning
Bilal Hassan,Areg Karapetyan,Aaron Chung Hin Chow,Samer Madanat
Main category: cs.CV
TL;DR: 提出一种基于轻量级CNN的深度学习模型,用于在不同海平面上升情景和海岸线适应方案下预测沿海洪水,该模型在数据资源有限的情况下表现出色,并在阿布扎比和旧金山两个地区展现出良好的泛化能力,相比现有方法平均减少近20%的预测误差。
Details
Motivation: 传统基于物理的水动力模拟器计算成本高,难以应用于城市尺度的沿海规划,且深度学习方法常受限于数据稀缺和高维输出问题,因此需要一种高效、准确且适用于大范围的洪水预测模型。 Method: 采用一种新提出的基于视觉的低资源深度学习框架,构建轻量级卷积神经网络(CNN)模型,利用阿布扎比和旧金山的多区域数据集进行训练与验证,以预测不同海平面上升和海岸适应情景下的洪水深度图。 Result: 所提模型在预测洪水深度图上的平均绝对误差(MAE)比现有最先进方法降低近20%,并在不同地理区域表现出良好的泛化能力。 Conclusion: 该轻量级CNN模型为沿海洪水预测提供了一种可扩展且实用的工具,有助于决策者制定应对气候变化影响的有效减灾策略。 Abstract: Climate change and sea-level rise (SLR) pose escalating threats to coastal cities, intensifying the need for efficient and accurate methods to predict potential flood hazards. Traditional physics-based hydrodynamic simulators, although precise, are computationally expensive and impractical for city-scale coastal planning applications. Deep Learning (DL) techniques offer promising alternatives, however, they are often constrained by challenges such as data scarcity and high-dimensional output requirements. Leveraging a recently proposed vision-based, low-resource DL framework, we develop a novel, lightweight Convolutional Neural Network (CNN)-based model designed to predict coastal flooding under variable SLR projections and shoreline adaptation scenarios. Furthermore, we demonstrate the ability of the model to generalize across diverse geographical contexts by utilizing datasets from two distinct regions: Abu Dhabi and San Francisco. Our findings demonstrate that the proposed model significantly outperforms state-of-the-art methods, reducing the mean absolute error (MAE) in predicted flood depth maps on average by nearly 20%. These results highlight the potential of our approach to serve as a scalable and practical tool for coastal flood management, empowering decision-makers to develop effective mitigation strategies in response to the growing impacts of climate change. Project Page: https://caspiannet.github.io/[69] Enhancing Temporal Understanding in Video-LLMs through Stacked Temporal Attention in Vision Encoders
Ali Rasekh,Erfan Bagheri Soula,Omid Daliran,Simon Gottschalk,Mohsen Fayyaz
Main category: cs.CV
TL;DR: 本文提出了一种新的Video-LLM架构STAVEQ2,通过在视觉编码器中引入堆叠的时间注意力模块,显著提升了模型对视频中动作序列和时间进展的理解能力,在多个视频问答基准上性能提升高达+5.5%。
Details
Motivation: 现有的Video-LLM在理解复杂时间动态方面存在局限,难以准确捕捉动作序列和帧间时序关系,限制了其在视频理解任务中的表现。 Method: 在视觉编码器中引入堆叠的时间注意力模块,使模型能够在将视觉标记输入大语言模型之前更好地建模帧间的时间进展和依赖关系。 Result: 该方法在VITATECS、MVBench和Video-MME等多个视频问答基准上显著优于现有模型,特别是在动作识别任务中,性能提升达+5.5%。 Conclusion: 通过增强视觉编码器的时间建模能力,有效弥补了当前Video-LLM在时序理解方面的关键缺陷,为视频理解提供了更优的架构设计方向。 Abstract: Despite significant advances in Multimodal Large Language Models (MLLMs), understanding complex temporal dynamics in videos remains a major challenge. Our experiments show that current Video Large Language Model (Video-LLM) architectures have critical limitations in temporal understanding, struggling with tasks that require detailed comprehension of action sequences and temporal progression. In this work, we propose a Video-LLM architecture that introduces stacked temporal attention modules directly within the vision encoder. This design incorporates a temporal attention in vision encoder, enabling the model to better capture the progression of actions and the relationships between frames before passing visual tokens to the LLM. Our results show that this approach significantly improves temporal reasoning and outperforms existing models in video question answering tasks, specifically in action recognition. We improve on benchmarks including VITATECS, MVBench, and Video-MME by up to +5.5%. By enhancing the vision encoder with temporal structure, we address a critical gap in video understanding for Video-LLMs. Project page and code are available at: https://alirasekh.github.io/STAVEQ2/.[70] FlexICL: A Flexible Visual In-context Learning Framework for Elbow and Wrist Ultrasound Segmentation
Yuyue Zhou,Jessica Knight,Shrimanti Ghosh,Banafshe Felfeliyan,Jacob L. Jaremko,Abhilash R. Hareendranathan
Main category: cs.CV
TL;DR: 提出了一种名为FlexICL的灵活上下文学习框架,用于超声图像中骨组织的自动分割,在仅使用5%训练数据的情况下显著优于现有模型。
Details
Motivation: 减少对大量像素级专家标注的依赖,解决儿科肘部和腕部骨折超声图像分割中标注成本高的问题。 Method: 采用灵活的上下文学习(ICL)框架,结合多种图像拼接技术和数据增强策略,应用于视频内分割场景,仅需少量帧标注即可实现未见帧的分割。 Result: 在四个手腕和肘部超声数据集上,FlexICL仅用5%训练图像即达到优异性能,相较于Painter、MAE-VQGAN、U-Net和TransUNet模型,Dice系数提升1-27%,共测试1,252个超声扫描序列。 Conclusion: FlexICL是一种高效且可扩展的超声图像分割方案,特别适用于标注数据稀缺的医学影像场景。 Abstract: Elbow and wrist fractures are the most common fractures in pediatric populations. Automatic segmentation of musculoskeletal structures in ultrasound (US) can improve diagnostic accuracy and treatment planning. Fractures appear as cortical defects but require expert interpretation. Deep learning (DL) can provide real-time feedback and highlight key structures, helping lightly trained users perform exams more confidently. However, pixel-wise expert annotations for training remain time-consuming and costly. To address this challenge, we propose FlexICL, a novel and flexible in-context learning (ICL) framework for segmenting bony regions in US images. We apply it to an intra-video segmentation setting, where experts annotate only a small subset of frames, and the model segments unseen frames. We systematically investigate various image concatenation techniques and training strategies for visual ICL and introduce novel concatenation methods that significantly enhance model performance with limited labeled data. By integrating multiple augmentation strategies, FlexICL achieves robust segmentation performance across four wrist and elbow US datasets while requiring only 5% of the training images. It outperforms state-of-the-art visual ICL models like Painter, MAE-VQGAN, and conventional segmentation models like U-Net and TransUNet by 1-27% Dice coefficient on 1,252 US sweeps. These initial results highlight the potential of FlexICL as an efficient and scalable solution for US image segmentation well suited for medical imaging use cases where labeled data is scarce.[71] Dynamic VLM-Guided Negative Prompting for Diffusion Models
Hoyeon Chang,Seungjin Kim,Yoonseok Choi
Main category: cs.CV
TL;DR: 提出一种利用视觉-语言模型(VLM)在去噪过程中自适应生成负提示的扩散模型新方法。
Details
Motivation: 传统负提示方法使用固定提示,缺乏动态调整能力,限制了生成质量与文本对齐的优化。 Method: 在特定去噪步骤生成中间图像预测,并利用VLM查询生成上下文相关的负提示,实现动态负提示。 Result: 在多个基准数据集上评估了该方法,展示了负引导强度与文本-图像对齐之间的权衡。 Conclusion: 所提方法能有效提升扩散模型的灵活性和生成质量,通过动态生成负提示改善文本-图像一致性。 Abstract: We propose a novel approach for dynamic negative prompting in diffusion models that leverages Vision-Language Models (VLMs) to adaptively generate negative prompts during the denoising process. Unlike traditional Negative Prompting methods that use fixed negative prompts, our method generates intermediate image predictions at specific denoising steps and queries a VLM to produce contextually appropriate negative prompts. We evaluate our approach on various benchmark datasets and demonstrate the trade-offs between negative guidance strength and text-image alignment.[72] Security Risk of Misalignment between Text and Image in Multi-modal Model
Xiaosen Wang,Zhijin Ge,Shaokang Wang
Main category: cs.CV
TL;DR: 提出了一种新的攻击方法PReMA,通过仅生成对抗性图像来操纵多模态扩散模型的输出,而无需修改文本提示,对图像编辑应用构成新威胁。
Details
Motivation: 现有扩散模型中文本与图像模态间的对齐不足,可能导致生成不安全内容,需探究其对抗输入的脆弱性。 Method: 提出Prompt-Restricted Multi-modal Attack (PReMA),通过修改输入图像并结合指定提示来操控生成内容,不更改提示本身。 Result: 在多种模型的图像修复和风格迁移任务中验证了PReMA的有效性,能成功生成不当内容。 Conclusion: PReMA是一种新型威胁,凸显了多模态扩散模型在固定提示场景下的安全隐患。 Abstract: Despite the notable advancements and versatility of multi-modal diffusion models, such as text-to-image models, their susceptibility to adversarial inputs remains underexplored. Contrary to expectations, our investigations reveal that the alignment between textual and Image modalities in existing diffusion models is inadequate. This misalignment presents significant risks, especially in the generation of inappropriate or Not-Safe-For-Work (NSFW) content. To this end, we propose a novel attack called Prompt-Restricted Multi-modal Attack (PReMA) to manipulate the generated content by modifying the input image in conjunction with any specified prompt, without altering the prompt itself. PReMA is the first attack that manipulates model outputs by solely creating adversarial images, distinguishing itself from prior methods that primarily generate adversarial prompts to produce NSFW content. Consequently, PReMA poses a novel threat to the integrity of multi-modal diffusion models, particularly in image-editing applications that operate with fixed prompts. Comprehensive evaluations conducted on image inpainting and style transfer tasks across various models confirm the potent efficacy of PReMA.[73] EgoExo-Con: Exploring View-Invariant Video Temporal Understanding
Minjoon Jung,Junbin Xiao,Junghyun Kim,Byoung-Tak Zhang,Angela Yao
Main category: cs.CV
TL;DR: 本文提出了EgoExo-Con,一个用于评估视频大模型在不同视角下时间理解一致性的新基准,并提出View-GRPO框架以提升跨视角一致性。
Details
Motivation: 研究现有视频大语言模型在多视角视频中是否能保持时间理解的一致性,并揭示其在跨视角推理上的不足。 Method: 构建了包含同步第一人称与第三人称视频对的EgoExo-Con基准,设计了时间验证和时间定位任务;提出View-GRPO强化学习框架,在训练中增强视角间的一致性理解。 Result: 发现现有模型在跨视角一致性上表现差,简单微调会损害单视角性能;View-GRPO在提升一致性的同时优于传统的SFT和GRPO方法。 Conclusion: 跨视角时间理解一致性是当前Video-LLMs的薄弱环节,View-GRPO为解决该问题提供了有效途径。 Abstract: Can Video-LLMs achieve consistent temporal understanding when videos capture the same event from different viewpoints? To study this, we introduce EgoExo-Con (Consistency), a benchmark of comprehensively synchronized egocentric and exocentric video pairs with human-refined queries in natural language. EgoExo-Con emphasizes two temporal understanding tasks: Temporal Verification and Temporal Grounding. It evaluates not only correctness but consistency across viewpoints. Our analysis reveals two critical limitations of existing Video-LLMs: (1) models often fail to maintain consistency, with results far worse than their single-view performances. (2) When naively finetuned with synchronized videos of both viewpoints, the models show improved consistency but often underperform those trained on a single view. For improvements, we propose View-GRPO, a novel reinforcement learning framework that effectively strengthens view-specific temporal reasoning while encouraging consistent comprehension across viewpoints. Our method demonstrates its superiority over naive SFT and GRPO, especially for improving cross-view consistency. All resources will be made publicly available.[74] OracleAgent: A Multimodal Reasoning Agent for Oracle Bone Script Research
Caoshuo Li,Zengmao Ding,Xiaobin Hu,Bang Li,Donghao Luo,Xu Peng,Taisong Jin,Yongge Liu,Shengwei Han,Jing Yang,Xiaoping He,Feng Gao,AndyPian Wu,SevenShu,Chaoyang Wang,Chengjie Wang
Main category: cs.CV
TL;DR: 本文提出了OracleAgent,首个用于甲骨文信息结构化管理和检索的智能体系统,结合大语言模型与多模态知识库,显著提升甲骨文研究效率。
Details
Motivation: 甲骨文研究面临流程复杂、信息组织与检索效率低下的挑战,亟需自动化工具支持。 Method: 构建了一个包含140万字符拓片图像和8万条释读文本的多模态知识库,并设计基于大语言模型的智能体系统OracleAgent,集成多种分析工具以实现灵活的任务编排与信息检索。 Result: 实验表明,OracleAgent在多模态推理与生成任务中表现优于主流多模态大模型(如GPT-4o),案例研究显示其能显著减少专家的研究时间。 Conclusion: OracleAgent推动了甲骨文研究向自动化和实用化迈进,为文化遗产的智能辅助研究提供了新范式。 Abstract: As one of the earliest writing systems, Oracle Bone Script (OBS) preserves the cultural and intellectual heritage of ancient civilizations. However, current OBS research faces two major challenges: (1) the interpretation of OBS involves a complex workflow comprising multiple serial and parallel sub-tasks, and (2) the efficiency of OBS information organization and retrieval remains a critical bottleneck, as scholars often spend substantial effort searching for, compiling, and managing relevant resources. To address these challenges, we present OracleAgent, the first agent system designed for the structured management and retrieval of OBS-related information. OracleAgent seamlessly integrates multiple OBS analysis tools, empowered by large language models (LLMs), and can flexibly orchestrate these components. Additionally, we construct a comprehensive domain-specific multimodal knowledge base for OBS, which is built through a rigorous multi-year process of data collection, cleaning, and expert annotation. The knowledge base comprises over 1.4M single-character rubbing images and 80K interpretation texts. OracleAgent leverages this resource through its multimodal tools to assist experts in retrieval tasks of character, document, interpretation text, and rubbing image. Extensive experiments demonstrate that OracleAgent achieves superior performance across a range of multimodal reasoning and generation tasks, surpassing leading mainstream multimodal large language models (MLLMs) (e.g., GPT-4o). Furthermore, our case study illustrates that OracleAgent can effectively assist domain experts, significantly reducing the time cost of OBS research. These results highlight OracleAgent as a significant step toward the practical deployment of OBS-assisted research and automated interpretation systems.[75] JOGS: Joint Optimization of Pose Estimation and 3D Gaussian Splatting
Yuxuan Li,Tao Wang,Xianben Yang
Main category: cs.CV
TL;DR: 提出了一种无需预标定输入的联合优化3D高斯点和相机位姿的统一框架,通过交替更新3D高斯参数和相机位姿,显著提升了新视角合成的重建质量和位姿精度。
Details
Motivation: 传统方法依赖COLMAP等外部位姿估计工具,存在计算瓶颈且易传播误差,因此需要一种端到端的联合优化方法来提升鲁棒性和精度。 Method: 采用分阶段交替优化策略:首先固定位姿通过可微渲染更新3D高斯参数,然后利用结合几何与光度约束的定制化3D光流算法优化相机位姿。 Result: 在多个数据集上验证了该方法优于现有无COLMAP方法,并普遍超过基于COLMAP的标准基线,在大视角变化和稀疏特征场景下表现尤为突出。 Conclusion: 所提出的联合优化框架有效降低了投影误差,实现了更精确的场景重建与位姿估计,推动了无依赖外部位姿估计的新视角合成技术发展。 Abstract: Traditional novel view synthesis methods heavily rely on external camera pose estimation tools such as COLMAP, which often introduce computational bottlenecks and propagate errors. To address these challenges, we propose a unified framework that jointly optimizes 3D Gaussian points and camera poses without requiring pre-calibrated inputs. Our approach iteratively refines 3D Gaussian parameters and updates camera poses through a novel co-optimization strategy, ensuring simultaneous improvements in scene reconstruction fidelity and pose accuracy. The key innovation lies in decoupling the joint optimization into two interleaved phases: first, updating 3D Gaussian parameters via differentiable rendering with fixed poses, and second, refining camera poses using a customized 3D optical flow algorithm that incorporates geometric and photometric constraints. This formulation progressively reduces projection errors, particularly in challenging scenarios with large viewpoint variations and sparse feature distributions, where traditional methods struggle. Extensive evaluations on multiple datasets demonstrate that our approach significantly outperforms existing COLMAP-free techniques in reconstruction quality, and also surpasses the standard COLMAP-based baseline in general.[76] WOD-E2E: Waymo Open Dataset for End-to-End Driving in Challenging Long-tail Scenarios
Runsheng Xu,Hubert Lin,Wonseok Jeon,Hao Feng,Yuliang Zou,Liting Sun,John Gorman,Kate Tolstaya,Sarah Tang,Brandyn White,Ben Sapp,Mingxing Tan,Jyh-Jing Hwang,Drago Anguelov
Main category: cs.CV
TL;DR: 本文提出了一个用于端到端自动驾驶的新数据集WOD-E2E,专注于罕见的长尾场景,并引入了一种新的开环评估指标Rater Feedback Score(RFS),以更好地评估自动驾驶系统在复杂真实世界情况下的表现。
Details
Motivation: 现有端到端驾驶基准多集中在常规场景,缺乏对罕见但关键的长尾场景的充分测试,且传统评估指标难以准确反映驾驶行为质量,因此需要更贴近实际挑战的数据集和更有效的评估方式。 Method: 构建包含4,021个罕见长尾场景驾驶片段的WOD-E2E数据集,涵盖高阶路径信息、自车状态和360度相机视图;提出RFS评估指标,通过比较预测轨迹与人工标注的轨迹偏好标签来衡量性能。 Result: 发布了WOD-E2E验证集的评分员偏好标签,测试集标签用于2025年WOD-E2E挑战赛;提供了针对长尾场景的高质量评估框架。 Conclusion: WOD-E2E和RFS为端到端自动驾驶研究提供了更具挑战性和现实意义的测试平台,有助于推动通用性、鲁棒性和安全性更强的自动驾驶系统的发展。 Abstract: Vision-based end-to-end (E2E) driving has garnered significant interest in the research community due to its scalability and synergy with multimodal large language models (MLLMs). However, current E2E driving benchmarks primarily feature nominal scenarios, failing to adequately test the true potential of these systems. Furthermore, existing open-loop evaluation metrics often fall short in capturing the multi-modal nature of driving or effectively evaluating performance in long-tail scenarios. To address these gaps, we introduce the Waymo Open Dataset for End-to-End Driving (WOD-E2E). WOD-E2E contains 4,021 driving segments (approximately 12 hours), specifically curated for challenging long-tail scenarios that that are rare in daily life with an occurring frequency of less than 0.03%. Concretely, each segment in WOD-E2E includes the high-level routing information, ego states, and 360-degree camera views from 8 surrounding cameras. To evaluate the E2E driving performance on these long-tail situations, we propose a novel open-loop evaluation metric: Rater Feedback Score (RFS). Unlike conventional metrics that measure the distance between predicted way points and the logs, RFS measures how closely the predicted trajectory matches rater-annotated trajectory preference labels. We have released rater preference labels for all WOD-E2E validation set segments, while the held out test set labels have been used for the 2025 WOD-E2E Challenge. Through our work, we aim to foster state of the art research into generalizable, robust, and safe end-to-end autonomous driving agents capable of handling complex real-world situations.[77] Exploring Object-Aware Attention Guided Frame Association for RGB-D SLAM
Ali Caglayan,Nevrez Imamoglu,Oguzhan Guclu,Ali Osman Serhatoglu,Ahmet Burak Can,Ryosuke Nakamura
Main category: cs.CV
TL;DR: 本文提出了一种将基于梯度的注意力信息集成到CNN表示中的方法,用于RGB-D室内SLAM,提升了帧间关联性能。
Details
Motivation: 现有的注意力可视化技术虽能提供视觉解释,但缺乏对梯度注意力信息在语义理解任务中显式建模的应用,尤其是在SLAM等需要空间感知的任务中。 Method: 通过结合网络梯度与CNN特征提取逐层注意力信息,并将其融入CNN表示中,以增强对关键物体区域的关注。 Result: 实验表明,该方法在大尺度环境中显著优于基线方法,特别是在帧关联任务上表现更优。 Conclusion: 利用任务特定的注意力机制可有效提升RGB-D SLAM中CNN表示的质量,有助于提高定位与映射的准确性。 Abstract: Attention models have recently emerged as a powerful approach, demonstrating significant progress in various fields. Visualization techniques, such as class activation mapping, provide visual insights into the reasoning of convolutional neural networks (CNNs). Using network gradients, it is possible to identify regions where the network pays attention during image recognition tasks. Furthermore, these gradients can be combined with CNN features to localize more generalizable, task-specific attentive (salient) regions within scenes. However, explicit use of this gradient-based attention information integrated directly into CNN representations for semantic object understanding remains limited. Such integration is particularly beneficial for visual tasks like simultaneous localization and mapping (SLAM), where CNN representations enriched with spatially attentive object locations can enhance performance. In this work, we propose utilizing task-specific network attention for RGB-D indoor SLAM. Specifically, we integrate layer-wise attention information derived from network gradients with CNN feature representations to improve frame association performance. Experimental results indicate improved performance compared to baseline methods, particularly for large environments.[78] FullPart: Generating each 3D Part at Full Resolution
Lihe Ding,Shaocong Dong,Yaokun Li,Chenjian Gao,Xiao Chen,Rui Han,Yihao Kuang,Hong Zhang,Bo Huang,Zhanpeng Huang,Zibin Wang,Dan Xu,Tianfan Xue
Main category: cs.CV
TL;DR: 本文提出了FullPart,一种结合隐式和显式表征的新型3D部件生成框架,通过独立的全分辨率体素网格生成每个部件,显著提升了几何细节质量,并引入PartVerse-XL数据集以推动研究发展。
Details
Motivation: 现有3D部件生成方法在几何细节表达上存在不足:基于隐式向量集的方法缺乏细节,而共享全局低分辨率体素网格的显式方法难以有效生成小部件。 Method: 提出FullPart框架,首先通过隐式的盒子向量集扩散过程生成部件的包围盒布局,然后在各自独立的全分辨率体素网格中生成详细部件,并引入中心点编码策略以解决不同大小部件间的信息对齐问题。 Result: 实验表明FullPart在3D部件生成任务上达到了最先进的性能,能够生成包含复杂细节的高质量部件,尤其改善了小部件的生成质量。 Conclusion: FullPart有效结合了隐式和显式生成的优势,在保持全局一致性的同时实现了高保真部件生成,并发布了大规模标注数据集PartVerse-XL,推动了3D部件生成领域的发展。 Abstract: Part-based 3D generation holds great potential for various applications. Previous part generators that represent parts using implicit vector-set tokens often suffer from insufficient geometric details. Another line of work adopts an explicit voxel representation but shares a global voxel grid among all parts; this often causes small parts to occupy too few voxels, leading to degraded quality. In this paper, we propose FullPart, a novel framework that combines both implicit and explicit paradigms. It first derives the bounding box layout through an implicit box vector-set diffusion process, a task that implicit diffusion handles effectively since box tokens contain little geometric detail. Then, it generates detailed parts, each within its own fixed full-resolution voxel grid. Instead of sharing a global low-resolution space, each part in our method - even small ones - is generated at full resolution, enabling the synthesis of intricate details. We further introduce a center-point encoding strategy to address the misalignment issue when exchanging information between parts of different actual sizes, thereby maintaining global coherence. Moreover, to tackle the scarcity of reliable part data, we present PartVerse-XL, the largest human-annotated 3D part dataset to date with 40K objects and 320K parts. Extensive experiments demonstrate that FullPart achieves state-of-the-art results in 3D part generation. We will release all code, data, and model to benefit future research in 3D part generation.[79] BasicAVSR: Arbitrary-Scale Video Super-Resolution via Image Priors and Enhanced Motion Compensation
Wei Shang,Wanying Zhang,Shuhang Gu,Pengfei Zhu,Qinghua Hu,Dongwei Ren
Main category: cs.CV
TL;DR: 本文提出了一种用于任意尺度视频超分辨率(AVSR)的强基线模型BasicAVSR,包含四个关键组件:基于拉普拉斯金字塔的自适应多尺度频率先验、流引导传播单元、二阶运动补偿单元和超上采样单元,并设计了三种传播变体以适应不同应用场景。实验表明,该方法在质量、泛化能力和推理速度方面显著优于现有方法。
Details
Motivation: 视频超分辨率在不同缩放因子下存在空间细节恢复、时间一致性和计算复杂性等挑战,现有方法难以兼顾性能与效率,因此需要一个强大且灵活的基线模型来统一解决这些问题。 Method: 提出BasicAVSR模型,结合图像拉普拉斯金字塔生成自适应多尺度频率先验,使用流引导传播单元聚合时空信息,引入二阶运动补偿提升帧间对齐精度,并设计超上采样单元生成尺度感知且内容无关的上采样核;同时构建三种RNN传播变体以支持在线、有限延迟和离线场景。 Result: BasicAVSR在多个数据集上显著优于现有方法,展现出优异的超分辨率质量、良好的泛化能力以及更快的推理速度,在不同应用场景下均表现出强适应性。 Conclusion: BasicAVSR为任意尺度视频超分辨率提供了一个有效且通用的基线框架,其核心组件可扩展至多种框架,推动了该领域的技术发展。 Abstract: Arbitrary-scale video super-resolution (AVSR) aims to enhance the resolution of video frames, potentially at various scaling factors, which presents several challenges regarding spatial detail reproduction, temporal consistency, and computational complexity. In this paper, we propose a strong baseline BasicAVSR for AVSR by integrating four key components: 1) adaptive multi-scale frequency priors generated from image Laplacian pyramids, 2) a flow-guided propagation unit to aggregate spatiotemporal information from adjacent frames, 3) a second-order motion compensation unit for more accurate spatial alignment of adjacent frames, and 4) a hyper-upsampling unit to generate scale-aware and content-independent upsampling kernels. To meet diverse application demands, we instantiate three propagation variants: (i) a unidirectional RNN unit for strictly online inference, (ii) a unidirectional RNN unit empowered with a limited lookahead that tolerates a small output delay, and (iii) a bidirectional RNN unit designed for offline tasks where computational resources are less constrained. Experimental results demonstrate the effectiveness and adaptability of our model across these different scenarios. Through extensive experiments, we show that BasicAVSR significantly outperforms existing methods in terms of super-resolution quality, generalization ability, and inference speed. Our work not only advances the state-of-the-art in AVSR but also extends its core components to multiple frameworks for diverse scenarios. The code is available at https://github.com/shangwei5/BasicAVSR.[80] MV-MLM: Bridging Multi-View Mammography and Language for Breast Cancer Diagnosis and Risk Prediction
Shunjie-Fabian Zheng,Hyeonjun Lee,Thijs Kooi,Ali Diba
Main category: cs.CV
TL;DR: 提出了一种多视图乳腺X线与语言模型(MV-MLM),利用合成放射报告进行跨模态自监督学习,在乳腺癌分类和风险预测任务中实现了最先进的性能,且具有出色的数据效率。
Details
Motivation: 获取精细标注的大规模数据集成本高、耗时长,限制了乳腺癌诊断模型的发展,因此需要一种更高效的方法来利用未标注或合成文本数据提升模型性能。 Method: 构建一个基于多视图乳腺X线图像和合成放射报告的跨模态自监督学习框架,采用联合视觉-文本学习策略,通过图像-文本对的多视图监督学习丰富表征。 Result: 在私有和公开数据集上验证,该模型在恶性分类、亚型分类和图像基础的癌症风险预测三个任务上均达到最先进水平,并表现出强数据效率,优于全监督和其他VLM基线方法。 Conclusion: MV-MLM通过利用合成报告和多视图跨模态学习,能够在无需真实放射报告的情况下有效提升乳腺癌检测与风险预测的准确性和泛化能力。 Abstract: Large annotated datasets are essential for training robust Computer-Aided Diagnosis (CAD) models for breast cancer detection or risk prediction. However, acquiring such datasets with fine-detailed annotation is both costly and time-consuming. Vision-Language Models (VLMs), such as CLIP, which are pre-trained on large image-text pairs, offer a promising solution by enhancing robustness and data efficiency in medical imaging tasks. This paper introduces a novel Multi-View Mammography and Language Model for breast cancer classification and risk prediction, trained on a dataset of paired mammogram images and synthetic radiology reports. Our MV-MLM leverages multi-view supervision to learn rich representations from extensive radiology data by employing cross-modal self-supervision across image-text pairs. This includes multiple views and the corresponding pseudo-radiology reports. We propose a novel joint visual-textual learning strategy to enhance generalization and accuracy performance over different data types and tasks to distinguish breast tissues or cancer characteristics(calcification, mass) and utilize these patterns to understand mammography images and predict cancer risk. We evaluated our method on both private and publicly available datasets, demonstrating that the proposed model achieves state-of-the-art performance in three classification tasks: (1) malignancy classification, (2) subtype classification, and (3) image-based cancer risk prediction. Furthermore, the model exhibits strong data efficiency, outperforming existing fully supervised or VLM baselines while trained on synthetic text reports and without the need for actual radiology reports.[81] Detecting Unauthorized Vehicles using Deep Learning for Smart Cities: A Case Study on Bangladesh
Sudipto Das Sukanto,Diponker Roy,Fahim Shakil,Nirjhar Singha,Abdullah Asik,Aniket Joarder,Mridha Md Nafis Fuad,Muhammad Ibrahim
Main category: cs.CV
TL;DR: 本文提出了一种基于YOLOv8的机器学习方法,用于实时检测孟加拉国交通图像中的机动三轮车(auto-rickshaws),解决了传统监控手段难以区分机动与非机动三轮车的问题。
Details
Motivation: 由于交通规则限制机动三轮车通行某些路段,而现有监控系统难以有效识别和区分机动与非机动三轮车,手动视频分析又耗时耗力,因此需要一种自动化的检测方法。 Method: 采用YOLOv8模型进行实时目标检测,并使用在多种交通条件下采集并标注的1,730张图像数据集进行训练。 Result: 该模型在实时检测中表现良好,mAP50达到83.447%,二分类精度和召回率均超过78%,适用于密集和稀疏交通场景。 Conclusion: 所提出的YOLOv8模型能有效实现机动三轮车的自动检测,具备实际应用潜力,且公开的数据集可支持后续研究。 Abstract: Modes of transportation vary across countries depending on geographical location and cultural context. In South Asian countries rickshaws are among the most common means of local transport. Based on their mode of operation, rickshaws in cities across Bangladesh can be broadly classified into non-auto (pedal-powered) and auto-rickshaws (motorized). Monitoring the movement of auto-rickshaws is necessary as traffic rules often restrict auto-rickshaws from accessing certain routes. However, existing surveillance systems make it quite difficult to monitor them due to their similarity to other vehicles, especially non-auto rickshaws whereas manual video analysis is too time-consuming. This paper presents a machine learning-based approach to automatically detect auto-rickshaws in traffic images. In this system, we used real-time object detection using the YOLOv8 model. For training purposes, we prepared a set of 1,730 annotated images that were captured under various traffic conditions. The results show that our proposed model performs well in real-time auto-rickshaw detection and offers an mAP50 of 83.447% and binary precision and recall values above 78%, demonstrating its effectiveness in handling both dense and sparse traffic scenarios. The dataset has been publicly released for further research.[82] CRAG-MM: Multi-modal Multi-turn Comprehensive RAG Benchmark
Jiaqi Wang,Xiao Yang,Kai Sun,Parth Suresh,Sanat Sharma,Adam Czyzewski,Derek Andersen,Surya Appini,Arkav Banerjee,Sajal Choudhary,Shervin Ghasemlou,Ziqiang Guan,Akil Iyer,Haidar Khan,Lingkun Kong,Roy Luo,Tiffany Ma,Zhen Qiao,David Tran,Wenfang Xu,Skyler Yeatman,Chen Zhou,Gunveer Gujral,Yinglong Xia,Shane Moon,Nicolas Scheffer,Nirav Shah,Eun Chang,Yue Liu,Florian Metze,Tammy Stark,Zhaleh Feizollahi,Andrea Jessee,Mangesh Pujari,Ahmed Aly,Babak Damavandi,Rakesh Wanga,Anuj Kumar,Rohit Patel,Wen-tau Yih,Xin Luna Dong
Main category: cs.CV
TL;DR: 本文提出了CRAG-MM,一个面向可穿戴设备场景的多模态、多轮对话检索增强生成综合基准,包含6.5K个(图像、问题、答案)三元组和2K个多轮对话,涵盖13个领域,并设计了三种任务来评估单源、多源及多轮对话性能,实验表明现有方法在真实性和回答质量上仍有较大提升空间。
Details
Motivation: 现有的多模态检索增强生成(MM-RAG)缺乏针对可穿戴设备场景的综合性基准,难以评估模型在真实世界多轮视觉对话中的表现,因此需要构建一个贴近实际应用、具有挑战性的 benchmark。 Method: 构建了CRAG-MM 基准,包含6.5K个(图像、问题、答案)三元组和2K个基于6.2K张自我中心图像的多轮对话,覆盖13个领域;设计了三种任务(单源、多源、多轮对话),并提供对应的检索语料库和图像-KG与网页检索API;问题设计考虑了图像质量问题、问题类型、实体流行度、信息动态性等现实因素。 Result: 评估结果显示,简单RAG方法在单轮和多轮问答中的真实性分别为32%和43%,当前最先进的工业方案也仅达到32%/45%;KDD Cup 2025基于该基准举办,吸引了约1000名参与者和5000次提交,优胜方案将基线性能提升了28%。 Conclusion: CRAG-MM 是首个面向可穿戴设备场景的综合性多模态RAG基准,能够有效揭示现有方法的不足,并推动多模态检索增强生成技术的发展,已在学术和工业界产生初步影响。 Abstract: Wearable devices such as smart glasses are transforming the way people interact with their surroundings, enabling users to seek information regarding entities in their view. Multi-Modal Retrieval-Augmented Generation (MM-RAG) plays a key role in supporting such questions, yet there is still no comprehensive benchmark for this task, especially regarding wearables scenarios. To fill this gap, we present CRAG-MM -- a Comprehensive RAG benchmark for Multi-modal Multi-turn conversations. CRAG-MM contains a diverse set of 6.5K (image, question, answer) triplets and 2K visual-based multi-turn conversations across 13 domains, including 6.2K egocentric images designed to mimic captures from wearable devices. We carefully constructed the questions to reflect real-world scenarios and challenges, including five types of image-quality issues, six question types, varying entity popularity, differing information dynamism, and different conversation turns. We design three tasks: single-source augmentation, multi-source augmentation, and multi-turn conversations -- each paired with an associated retrieval corpus and APIs for both image-KG retrieval and webpage retrieval. Our evaluation shows that straightforward RAG approaches achieve only 32% and 43% truthfulness on CRAG-MM single- and multi-turn QA, respectively, whereas state-of-the-art industry solutions have similar quality (32%/45%), underscoring ample room for improvement. The benchmark has hosted KDD Cup 2025, attracting about 1K participants and 5K submissions, with winning solutions improving baseline performance by 28%, highlighting its early impact on advancing the field.[83] MoTDiff: High-resolution Motion Trajectory estimation from a single blurred image using Diffusion models
Wontae Choi,Jaelin Lee,Hyung Sup Yun,Byeungwoo Jeon,Il Yong Chun
Main category: cs.CV
TL;DR: 提出了一种基于扩散模型的高分辨率运动轨迹估计框架MoTDiff,能够从单张运动模糊图像中恢复高质量的运动轨迹,在盲去模糊和编码曝光摄影任务中优于现有方法。
Details
Motivation: 现有从单张模糊图像中提取运动信息的方法(如模糊核、光流)质量较低,难以准确恢复精细的运动轨迹,因此需要一种能生成高分辨率、精确运动表示的新方法。 Method: 提出了MoTDiff框架,包含两个关键部分:1)以单张模糊图像的多尺度特征图作为条件输入的新型条件扩散模型;2)一种新的训练方法,用于精确识别细粒度运动轨迹、保持运动路径的整体形状与位置一致性,并确保轨迹上的像素连通性。 Result: 实验表明,MoTDiff在盲图像去模糊和编码曝光摄影应用中均优于当前最先进的方法,能够生成更高质量的运动轨迹估计结果。 Conclusion: MoTDiff是首个利用扩散模型进行高分辨率运动轨迹估计的框架,有效提升了从单张模糊图像中恢复运动信息的精度和质量,具有广泛的应用潜力。 Abstract: Accurate estimation of motion information is crucial in diverse computational imaging and computer vision applications. Researchers have investigated various methods to extract motion information from a single blurred image, including blur kernels and optical flow. However, existing motion representations are often of low quality, i.e., coarse-grained and inaccurate. In this paper, we propose the first high-resolution (HR) Motion Trajectory estimation framework using Diffusion models (MoTDiff). Different from existing motion representations, we aim to estimate an HR motion trajectory with high-quality from a single motion-blurred image. The proposed MoTDiff consists of two key components: 1) a new conditional diffusion framework that uses multi-scale feature maps extracted from a single blurred image as a condition, and 2) a new training method that can promote precise identification of a fine-grained motion trajectory, consistent estimation of overall shape and position of a motion path, and pixel connectivity along a motion trajectory. Our experiments demonstrate that the proposed MoTDiff can outperform state-of-the-art methods in both blind image deblurring and coded exposure photography applications.[84] ConceptScope: Characterizing Dataset Bias via Disentangled Visual Concepts
Jinho Choi,Hyesu Lim,Steffen Schneider,Jaegul Choo
Main category: cs.CV
TL;DR: ConceptScope是一个可扩展的自动化框架,利用稀疏自编码器和视觉基础模型来发现和量化视觉数据集中的可解释概念,用于识别数据集偏差并评估模型鲁棒性。
Details
Motivation: 由于机器学习数据集中普遍存在数据点偏向某些概念的问题,而细粒度属性标注成本高昂,因此需要一种无需人工标注即可系统识别数据集偏差的方法。 Method: 提出ConceptScope框架,通过在视觉基础模型表征上训练稀疏自编码器来自动发现和量化人类可解释的视觉概念,并根据语义相关性和与类别标签的统计相关性将概念分类为目标、上下文和偏差类型,实现按类别的数据集分析和基于概念的子群组评估。 Result: 实验表明,ConceptScope能有效捕捉包括物体、纹理、背景、面部属性、情绪和动作在内的多种视觉概念,并生成与语义相关图像区域对齐的空间归因;该方法可靠地检测出已知偏差(如Waterbirds中的背景偏差)并发现未标注的新偏差(如ImageNet中共现物体)。 Conclusion: ConceptScope为数据集审计和模型诊断提供了一种实用工具,能够在无需人工标注的情况下实现对视觉数据集的系统性偏差分析和鲁棒性评估。 Abstract: Dataset bias, where data points are skewed to certain concepts, is ubiquitous in machine learning datasets. Yet, systematically identifying these biases is challenging without costly, fine-grained attribute annotations. We present ConceptScope, a scalable and automated framework for analyzing visual datasets by discovering and quantifying human-interpretable concepts using Sparse Autoencoders trained on representations from vision foundation models. ConceptScope categorizes concepts into target, context, and bias types based on their semantic relevance and statistical correlation to class labels, enabling class-level dataset characterization, bias identification, and robustness evaluation through concept-based subgrouping. We validate that ConceptScope captures a wide range of visual concepts, including objects, textures, backgrounds, facial attributes, emotions, and actions, through comparisons with annotated datasets. Furthermore, we show that concept activations produce spatial attributions that align with semantically meaningful image regions. ConceptScope reliably detects known biases (e.g., background bias in Waterbirds) and uncovers previously unannotated ones (e.g, co-occurring objects in ImageNet), offering a practical tool for dataset auditing and model diagnostics.[85] Sketch2PoseNet: Efficient and Generalized Sketch to 3D Human Pose Prediction
Li Wang,Yiyu Zhuang,Yanwen Wang,Xun Cao,Chuan Guo,Xinxin Zuo,Hao Zhu
Main category: cs.CV
TL;DR: 提出一种基于合成数据的3D人体姿态估计方法,利用扩散模型生成大规模草图-3D姿态配对数据集SKEP-120K,并构建端到端数据驱动框架,在精度和速度上均显著优于先前方法。
Details
Motivation: 现有草图到3D姿态估计方法受限于缺乏大规模标注数据,依赖启发式规则优化,耗时且泛化能力差。 Method: 采用“从合成中学习”策略:首先训练扩散模型从3D姿态投影的2D姿态生成草图,构建SKEP-120K合成数据集;然后结合2D姿态检测器、扩散先验和前馈神经网络,设计端到端框架进行姿态估计,并引入多任务损失保证几何一致性和自接触准确性。 Result: 在定性、定量和主观评估中,该方法在估计精度和速度上均显著优于以往方法,适用于多种草图风格。 Conclusion: 所提方法通过合成数据解决了草图-3D姿态标注稀缺问题,实现了高效、准确、鲁棒的草图到3D人体姿态估计。 Abstract: 3D human pose estimation from sketches has broad applications in computer animation and film production. Unlike traditional human pose estimation, this task presents unique challenges due to the abstract and disproportionate nature of sketches. Previous sketch-to-pose methods, constrained by the lack of large-scale sketch-3D pose annotations, primarily relied on optimization with heuristic rules-an approach that is both time-consuming and limited in generalizability. To address these challenges, we propose a novel approach leveraging a "learn from synthesis" strategy. First, a diffusion model is trained to synthesize sketch images from 2D poses projected from 3D human poses, mimicking disproportionate human structures in sketches. This process enables the creation of a synthetic dataset, SKEP-120K, consisting of 120k accurate sketch-3D pose annotation pairs across various sketch styles. Building on this synthetic dataset, we introduce an end-to-end data-driven framework for estimating human poses and shapes from diverse sketch styles. Our framework combines existing 2D pose detectors and generative diffusion priors for sketch feature extraction with a feed-forward neural network for efficient 2D pose estimation. Multiple heuristic loss functions are incorporated to guarantee geometric coherence between the derived 3D poses and the detected 2D poses while preserving accurate self-contacts. Qualitative, quantitative, and subjective evaluations collectively show that our model substantially surpasses previous ones in both estimation accuracy and speed for sketch-to-pose tasks.[86] Developing a Multi-task Ensemble Geometric Deep Network for Supply Chain Sustainability and Risk Management
Mehdi Khaleghi,Nastaran Khaleghi,Sobhan Sheykhivand,Sebelan Danishvar
Main category: cs.CV
TL;DR: 提出了一种基于切比雪夫集成几何网络(Ch-EGN)的混合深度学习模型,用于提升供应链可持续性与风险管理效率,在多个任务上实现了接近或达到最优的准确率。
Details
Motivation: 为了提高供应链的可持续性和运行效率,需要有效管理风险并准确分类产品,传统方法难以充分挖掘供应链数据中的复杂依赖关系。 Method: 提出一种融合卷积神经网络与几何深度学习的混合模型——切比雪夫集成几何网络(Ch-EGN),利用图结构捕捉供应链中样本间的隐含状态和信息依赖,并在两个真实数据集(SupplyGraph 和 DataCo)上进行风险预测、产品分类和边分类任务。 Result: 在风险预测任务上平均准确率达98.95%;在5类产品分类和4类产品关系分类中分别达到100%和98.07%的准确率,在25类企业关系分类中达到92.37%的准确率,整体性能优于现有最先进方法。 Conclusion: 所提出的Ch-EGN模型能有效挖掘供应链中的复杂结构信息,显著提升风险管理和分类任务的准确性,为实现可持续供应链提供了高效的技术方案。 Abstract: The sustainability of supply chain plays a key role in achieving optimal performance in controlling the supply chain. The management of risks that occur in a supply chain is a fundamental problem for the purpose of developing the sustainability of the network and elevating the performance efficiency of the supply chain. The correct classification of products is another essential element in a sustainable supply chain. Acknowledging recent breakthroughs in the context of deep networks, several architectural options have been deployed to analyze supply chain datasets. A novel geometric deep network is used to propose an ensemble deep network. The proposed Chebyshev ensemble geometric network (Ch-EGN) is a hybrid convolutional and geometric deep learning. This network is proposed to leverage the information dependencies in supply chain to derive invisible states of samples in the database. The functionality of the proposed deep network is assessed on the two different databases. The SupplyGraph Dataset and DataCo are considered in this research. The prediction of delivery status of DataCo supply chain is done for risk administration. The product classification and edge classification are performed using the SupplyGraph database to enhance the sustainability of the supply network. An average accuracy of 98.95% is obtained for the ensemble network for risk management. The average accuracy of 100% and 98.07% are obtained for sustainable supply chain in terms of 5 product group classification and 4 product relation classification, respectively. The average accuracy of 92.37% is attained for 25 company relation classification. The results confirm an average improvement and efficiency of the proposed method compared to the state-of-the-art approaches.[87] OmniLayout: Enabling Coarse-to-Fine Learning with LLMs for Universal Document Layout Generation
Hengrui Kang,Zhuangcheng Gu,Zhiyuan Zhao,Zichen Wen,Bin Wang,Weijia Li,Conghui He
Main category: cs.CV
TL;DR: 本文提出了OmniLayout-1M,首个百万级多样化文档布局数据集,以及OmniLayout-LLM,一个0.5B参数的两阶段粗到精学习模型,用于生成复杂多样的文档布局,在多个领域显著优于现有方法。
Details
Motivation: 现有的文档布局生成研究受限于布局类型的缺乏多样性,主要集中在学术论文等少数类型上,而报纸、杂志等现实世界中的多样化布局研究不足,因此需要构建更广泛的数据集和更强大的生成模型。 Method: 提出OmniLayout-1M数据集,包含六种常见文档类型;设计OmniLayout-LLM模型,采用两阶段粗到精学习范式:第一阶段从粗分类中学习通用布局原则,第二阶段利用细粒度标注迁移到特定领域。 Result: 在M$^{6}$Doc数据集的多个领域上实验表明,该方法显著优于现有的布局生成方法和多种主流大语言模型,展现出更强的布局生成能力。 Conclusion: OmniLayout-1M和OmniLayout-LLM为文档布局生成提供了新的数据基础和模型范式,推动了该领域向更开放、多样化的现实场景发展。 Abstract: Document AI has advanced rapidly and is attracting increasing attention. Yet, while most efforts have focused on document layout analysis (DLA), its generative counterpart, document layout generation, remains underexplored. A major obstacle lies in the scarcity of diverse layouts: academic papers with Manhattan-style structures dominate existing studies, while open-world genres such as newspapers and magazines remain severely underrepresented. To address this gap, we curate OmniLayout-1M, the first million-scale dataset of diverse document layouts, covering six common document types and comprising contemporary layouts collected from multiple sources. Moreover, since existing methods struggle in complex domains and often fail to arrange long sequences coherently, we introduce OmniLayout-LLM, a 0.5B model with designed two-stage Coarse-to-Fine learning paradigm: 1) learning universal layout principles from OmniLayout-1M with coarse category definitions, and 2) transferring the knowledge to a specific domain with fine-grained annotations. Extensive experiments demonstrate that our approach achieves strong performance on multiple domains in M$^{6}$Doc dataset, substantially surpassing both existing layout generation experts and several latest general-purpose LLMs. Our code, models, and dataset will be publicly released.[88] Which Way Does Time Flow? A Psychophysics-Grounded Evaluation for Vision-Language Models
Shiho Matta,Lis Kanashiro Pereira,Peitao Han,Fei Cheng,Shigeru Kitazawa
Main category: cs.CV
TL;DR: 本文提出了AoT-PsyPhyBENCH,一个经过心理物理学验证的基准,用于评估视觉语言模型(VLMs)在自然视频中判断时间方向(正放或倒放)的能力。结果显示,现有VLMs在物理不可逆过程和因果手动操作上的表现接近随机,远落后于人类,暴露出其在时间连续性和因果理解方面的根本缺陷。
Details
Motivation: 尽管现代视觉语言模型在多模态任务中表现出色,但其对视频中时间信息的理解能力仍较弱且缺乏充分评估。本文旨在通过‘时间之箭’这一简单而深刻的任务,揭示当前模型在时间与因果推理上的不足。 Method: 构建了一个名为AoT-PsyPhyBENCH的心理物理学验证基准,使用与人类实验相同的刺激材料和行为基线,系统评估多种开源与专有、具备推理与非推理能力的视觉语言模型在判断视频播放方向任务上的表现。 Result: 大多数模型表现接近随机水平,最优模型在物理不可逆过程(如自由落体、扩散/爆炸)和因果手动操作(如分割/合并)上仍显著低于人类准确率。人类几乎能瞬时识别的时间方向,当前VLMs却难以捕捉。 Conclusion: 当前多模态系统虽能捕捉丰富的视觉-语义相关性,但缺乏实现时间连续性和因果理解所需的归纳偏置。AoT-PsyPhyBENCH的发布旨在推动VLM在物理与时间推理能力上的进一步发展。 Abstract: Modern vision-language models (VLMs) excel at many multimodal tasks, yet their grasp of temporal information in video remains weak and, crucially, under-evaluated. We probe this gap with a deceptively simple but revealing challenge: judging the arrow of time (AoT)-whether a short clip is played forward or backward. We introduce AoT-PsyPhyBENCH, a psychophysically validated benchmark that tests whether VLMs can infer temporal direction in natural videos using the same stimuli and behavioral baselines established for humans. Our comprehensive evaluation of open-weight and proprietary, reasoning and non-reasoning VLMs reveals that most models perform near chance, and even the best lag far behind human accuracy on physically irreversible processes (e.g., free fall, diffusion/explosion) and causal manual actions (division/addition) that humans recognize almost instantly. These results highlight a fundamental gap in current multimodal systems: while they capture rich visual-semantic correlations, they lack the inductive biases required for temporal continuity and causal understanding. We release the code and data for AoT-PsyPhyBENCH to encourage further progress in the physical and temporal reasoning capabilities of VLMs.[89] Revisiting Generative Infrared and Visible Image Fusion Based on Human Cognitive Laws
Lin Guo,Xiaoqing Luo,Wei Xie,Zhancheng Zhang,Hui Li,Rui Wang,Zhenhua Feng,Xiaoning Song
Main category: cs.CV
TL;DR: 本文提出了一种受人类认知启发的红外与可见光图像融合方法HCLFuse,通过多尺度掩码调控变分瓶颈编码器和时变物理引导扩散模型,实现了高质量、结构一致的图像融合。
Details
Motivation: 现有融合方法在模态信息平衡和可解释性方面存在不足,难以在复杂场景中保持可靠性和一致性,本文旨在提升生成式图像融合的保真度与可解释性。 Method: 提出HCLFuse方法,设计多尺度掩码调控的变分瓶颈编码器进行信息分解与低层特征提取,并结合扩散模型与物理规律构建时变物理引导机制,增强对数据内在结构的感知能力。 Result: 在多个数据集上实现了定性和定量的最优融合性能,显著提升了语义分割指标。 Conclusion: HCLFuse通过融合人类认知机制与物理规律,在生成式图像融合中展现出优越的结构保持能力和细节生成质量,具有良好的应用前景。 Abstract: Existing infrared and visible image fusion methods often face the dilemma of balancing modal information. Generative fusion methods reconstruct fused images by learning from data distributions, but their generative capabilities remain limited. Moreover, the lack of interpretability in modal information selection further affects the reliability and consistency of fusion results in complex scenarios. This manuscript revisits the essence of generative image fusion under the inspiration of human cognitive laws and proposes a novel infrared and visible image fusion method, termed HCLFuse. First, HCLFuse investigates the quantification theory of information mapping in unsupervised fusion networks, which leads to the design of a multi-scale mask-regulated variational bottleneck encoder. This encoder applies posterior probability modeling and information decomposition to extract accurate and concise low-level modal information, thereby supporting the generation of high-fidelity structural details. Furthermore, the probabilistic generative capability of the diffusion model is integrated with physical laws, forming a time-varying physical guidance mechanism that adaptively regulates the generation process at different stages, thereby enhancing the ability of the model to perceive the intrinsic structure of data and reducing dependence on data quality. Experimental results show that the proposed method achieves state-of-the-art fusion performance in qualitative and quantitative evaluations across multiple datasets and significantly improves semantic segmentation metrics. This fully demonstrates the advantages of this generative image fusion method, drawing inspiration from human cognition, in enhancing structural consistency and detail quality.[90] Exploring Complementarity and Explainability in CNNs for Periocular Verification Across Acquisition Distances
Fernando Alonso-Fernandez,Kevin Hernandez Diaz,Jose M. Buades,Kiran Raja,Josef Bigun
Main category: cs.CV
TL;DR: 本文研究了在UBIPr数据库上不同距离下的CNN模型在眼周验证中的互补性,通过融合SqueezeNet、MobileNetv2和ResNet50三种架构,在特征层和分数层进行融合,显著提升了性能,并利用LIME热图分析了模型注意力的差异。
Details
Motivation: 探索不同复杂度CNN模型在不同拍摄距离下眼周识别任务中的互补性,并提升跨距离验证性能。 Method: 使用VGGFace2数据集预训练SqueezeNet、MobileNetv2和ResNet50三种CNN模型,在UBIPr数据集上进行眼周验证;采用余弦相似度和卡方距离进行匹配评估;通过逻辑回归实现分数级融合;并利用LIME热图与Jensen-Shannon散度分析不同模型的注意力机制。 Result: ResNet50单独表现最优,但三者融合带来显著增益;不同模型关注图像中不同区域,证实其互补性;所提方法在UBIPr上达到新的SOTA性能。 Conclusion: 不同结构的CNN在眼周识别中具有互补性,融合多模型可有效提升跨距离验证性能,结合注意力分析进一步验证了互补机制的有效性。 Abstract: We study the complementarity of different CNNs for periocular verification at different distances on the UBIPr database. We train three architectures of increasing complexity (SqueezeNet, MobileNetv2, and ResNet50) on a large set of eye crops from VGGFace2. We analyse performance with cosine and chi2 metrics, compare different network initialisations, and apply score-level fusion via logistic regression. In addition, we use LIME heatmaps and Jensen-Shannon divergence to compare attention patterns of the CNNs. While ResNet50 consistently performs best individually, the fusion provides substantial gains, especially when combining all three networks. Heatmaps show that networks usually focus on distinct regions of a given image, which explains their complementarity. Our method significantly outperforms previous works on UBIPr, achieving a new state-of-the-art.[91] Beyond Imitation: Constraint-Aware Trajectory Generation with Flow Matching For End-to-End Autonomous Driving
Lin Liu,Guanyi Yu,Ziying Song,Junqiao Li,Caiyan Jia,Feiyang Jia,Peiliang Wu,Yandan Luo
Main category: cs.CV
TL;DR: 提出了一种基于约束流匹配的自动驾驶规划框架CATG,有效缓解模式崩溃并直接在生成过程中融入安全与运动学约束。
Details
Motivation: 现有模仿学习方法存在模式崩溃问题,而生成模型难以直接引入安全和物理约束,需额外优化步骤。 Method: 采用约束流匹配(Constrained Flow Matching)建模生成过程,显式地在流匹配中施加约束,并将驾驶激进程度作为可控信号进行调节。 Result: 在NavSim v2挑战赛中,CATG以51.31的EPDMS得分获得第二名,并荣获创新奖。 Conclusion: CATG能有效生成多样化且符合安全与物理约束的轨迹,具备强实用性与创新性。 Abstract: Planning is a critical component of end-to-end autonomous driving. However, prevailing imitation learning methods often suffer from mode collapse, failing to produce diverse trajectory hypotheses. Meanwhile, existing generative approaches struggle to incorporate crucial safety and physical constraints directly into the generative process, necessitating an additional optimization stage to refine their outputs. To address these limitations, we propose CATG, a novel planning framework that leverages Constrained Flow Matching. Concretely, CATG explicitly models the flow matching process, which inherently mitigates mode collapse and allows for flexible guidance from various conditioning signals. Our primary contribution is the novel imposition of explicit constraints directly within the flow matching process, ensuring that the generated trajectories adhere to vital safety and kinematic rules. Secondly, CATG parameterizes driving aggressiveness as a control signal during generation, enabling precise manipulation of trajectory style. Notably, on the NavSim v2 challenge, CATG achieved 2nd place with an EPDMS score of 51.31 and was honored with the Innovation Award.[92] Leveraging Large-Scale Face Datasets for Deep Periocular Recognition via Ocular Cropping
Fernando Alonso-Fernandez,Kevin Hernandez-Diaz,Jose Maria Buades Rubio,Josef Bigun
Main category: cs.CV
TL;DR: 本文研究了基于眼周区域的生物特征识别,使用三种不同深度和复杂度的卷积神经网络,在大规模VGGFace2数据集上进行训练,并在VGGFace2-Pose和UFPR-Periocular数据库上测试。结果表明,由于图像质量和采集条件差异,VGGFace2上的等错误率(EER)为9-15%,而UFPR-Periocular上达到1-2%,为目前最低水平。
Details
Motivation: 眼周区域具有高区分性和较低的采集限制,但现有研究多依赖小规模数据集,缺乏在大规模数据上的验证,因此需要评估深度网络在大尺度数据下的眼周识别性能。 Method: 采用三种不同复杂度的卷积神经网络,在从VGGFace2数据库中提取的约190万张眼部图像上进行训练,并在VGGFace2-Pose和UFPR-Periocular两个数据集上进行眼周识别实验,评估其在不同采集条件下的性能。 Result: 在VGGFace2-Pose上,眼周识别的等错误率(EER)为9-15%;而在UFPR-Periocular上,EER降至1-2%,显著优于以往方法,为当前该数据集上的最佳性能。 Conclusion: 大规模训练数据有助于提升眼周识别性能,但在非受控环境下仍面临挑战;高质量、规范采集的数据能显著提高识别精度,本文方法在UFPR-Periocular上实现了最先进的结果。 Abstract: We focus on ocular biometrics, specifically the periocular region (the area around the eye), which offers high discrimination and minimal acquisition constraints. We evaluate three Convolutional Neural Network architectures of varying depth and complexity to assess their effectiveness for periocular recognition. The networks are trained on 1,907,572 ocular crops extracted from the large-scale VGGFace2 database. This significantly contrasts with existing works, which typically rely on small-scale periocular datasets for training having only a few thousand images. Experiments are conducted with ocular images from VGGFace2-Pose, a subset of VGGFace2 containing in-the-wild face images, and the UFPR-Periocular database, which consists of selfies captured via mobile devices with user guidance on the screen. Due to the uncontrolled conditions of VGGFace2, the Equal Error Rates (EERs) obtained with ocular crops range from 9-15%, noticeably higher than the 3-6% EERs achieved using full-face images. In contrast, UFPR-Periocular yields significantly better performance (EERs of 1-2%), thanks to higher image quality and more consistent acquisition protocols. To the best of our knowledge, these are the lowest reported EERs on the UFPR dataset to date.[93] Towards Realistic Earth-Observation Constellation Scheduling: Benchmark and Methodology
Luting Wang,Yinghao Xiang,Hongliang Huang,Dongjun Li,Chen Gao,Si Liu
Main category: cs.CV
TL;DR: 提出首个面向真实场景的大型敏捷地球观测卫星星座调度基准AEOS-Bench,并基于Transformer架构设计约束感知的调度模型AEOS-Former,在任务完成率和能效方面优于基线模型。
Details
Motivation: 现有方法在处理大规模、动态环境和严格约束下的敏捷地球观测卫星调度问题时往往简化复杂性,导致实际性能受限,缺乏统一、真实的基准测试平台和高效调度模型。 Method: 构建包含3907个卫星资产和16410个场景的高保真仿真基准AEOS-Bench,并提出基于Transformer的约束感知调度模型AEOS-Former,引入专用内部约束模块显式建模卫星物理与操作限制,通过基于仿真的迭代学习适应多样化场景。 Result: AEOS-Former在任务完成率和能源效率上优于基线模型,消融实验验证了各组件的有效性;AEOS-Bench为首个针对现实星座调度的大规模基准。 Conclusion: 所提出的AEOS-Bench基准和AEOS-Former模型为复杂动态环境下敏捷地球观测卫星星座调度提供了有效评估平台和高性能解决方案。 Abstract: Agile Earth Observation Satellites (AEOSs) constellations offer unprecedented flexibility for monitoring the Earth's surface, but their scheduling remains challenging under large-scale scenarios, dynamic environments, and stringent constraints. Existing methods often simplify these complexities, limiting their real-world performance. We address this gap with a unified framework integrating a standardized benchmark suite and a novel scheduling model. Our benchmark suite, AEOS-Bench, contains $3,907$ finely tuned satellite assets and $16,410$ scenarios. Each scenario features $1$ to $50$ satellites and $50$ to $300$ imaging tasks. These scenarios are generated via a high-fidelity simulation platform, ensuring realistic satellite behavior such as orbital dynamics and resource constraints. Ground truth scheduling annotations are provided for each scenario. To our knowledge, AEOS-Bench is the first large-scale benchmark suite tailored for realistic constellation scheduling. Building upon this benchmark, we introduce AEOS-Former, a Transformer-based scheduling model that incorporates a constraint-aware attention mechanism. A dedicated internal constraint module explicitly models the physical and operational limits of each satellite. Through simulation-based iterative learning, AEOS-Former adapts to diverse scenarios, offering a robust solution for AEOS constellation scheduling. Experimental results demonstrate that AEOS-Former outperforms baseline models in task completion and energy efficiency, with ablation studies highlighting the contribution of each component. Code and data are provided in https://github.com/buaa-colalab/AEOSBench.[94] Exploring the correlation between the type of music and the emotions evoked: A study using subjective questionnaires and EEG
Jelizaveta Jankowska,Bożena Kostek,Fernando Alonso-Fernandez,Prayag Tiwari
Main category: cs.CV
TL;DR: 本研究探讨了不同音乐类型对人类情绪的影响,通过主观调查和脑电图(EEG)测量参与者的情绪反应。
Details
Motivation: 了解不同音乐流派如何影响人的情绪,以揭示音乐与情感之间的关系。 Method: 使用EEG头盔记录参与者听不同音乐时的脑活动,并结合主观问卷进行分析。 Result: 分析显示情绪与大脑活动之间存在关联,不同音乐类型引发不同的脑电模式和情感反馈。 Conclusion: 音乐类型显著影响情绪,结合EEG和问卷能有效捕捉和验证这种情感变化。 Abstract: The subject of this work is to check how different types of music affect human emotions. While listening to music, a subjective survey and brain activity measurements were carried out using an EEG helmet. The aim is to demonstrate the impact of different music genres on emotions. The research involved a diverse group of participants of different gender and musical preferences. This had the effect of capturing a wide range of emotional responses to music. After the experiment, a relationship analysis of the respondents' questionnaires with EEG signals was performed. The analysis revealed connections between emotions and observed brain activity.[95] A Hybrid Framework Bridging CNN and ViT based on Theory of Evidence for Diabetic Retinopathy Grading
Junlai Qiu,Yunzhu Chen,Hao Zheng,Yawen Huang,Yuexiang Li
Main category: cs.CV
TL;DR: 提出了一种基于证据理论的融合范式,通过结合CNN和ViT的优势,提升了糖尿病视网膜病变分级的准确性和可解释性。
Details
Motivation: 现有基于单一骨干网络(如CNN或ViT)的自动DR诊断系统性能已达到瓶颈,难以进一步提升,且缺乏对特征融合与决策过程的可解释性。 Method: 提出一种基于证据理论的融合范式,利用深度证据网络将不同骨干网络提取的特征转化为支持性证据,并据此形成聚合意见,自适应地调整多骨干网络间的融合模式。 Result: 在两个公开的DR分级数据集上验证了方法的有效性,相比当前最先进模型取得了更高的分级精度,同时提供了良好的特征融合与决策可解释性。 Conclusion: 所提出的证据融合范式能有效整合CNN和ViT的优势,显著提升糖尿病视网膜病变自动诊断系统的性能与可解释性,具有临床应用潜力。 Abstract: Diabetic retinopathy (DR) is a leading cause of vision loss among middle-aged and elderly people, which significantly impacts their daily lives and mental health. To improve the efficiency of clinical screening and enable the early detection of DR, a variety of automated DR diagnosis systems have been recently established based on convolutional neural network (CNN) or vision Transformer (ViT). However, due to the own shortages of CNN / ViT, the performance of existing methods using single-type backbone has reached a bottleneck. One potential way for the further improvements is integrating different kinds of backbones, which can fully leverage the respective strengths of them (\emph{i.e.,} the local feature extraction capability of CNN and the global feature capturing ability of ViT). To this end, we propose a novel paradigm to effectively fuse the features extracted by different backbones based on the theory of evidence. Specifically, the proposed evidential fusion paradigm transforms the features from different backbones into supporting evidences via a set of deep evidential networks. With the supporting evidences, the aggregated opinion can be accordingly formed, which can be used to adaptively tune the fusion pattern between different backbones and accordingly boost the performance of our hybrid model. We evaluated our method on two publicly available DR grading datasets. The experimental results demonstrate that our hybrid model not only improves the accuracy of DR grading, compared to the state-of-the-art frameworks, but also provides the excellent interpretability for feature fusion and decision-making.[96] GLYPH-SR: Can We Achieve Both High-Quality Image Super-Resolution and High-Fidelity Text Recovery via VLM-guided Latent Diffusion Model?
Mingyu Sung,Seungjae Ham,Kangwoo Kim,Yeokyoung Yoon,Sangseok Yun,Il-Min Kim,Jae-Mo Kang
Main category: cs.CV
TL;DR: 本文提出了一种名为GLYPH-SR的视觉-语言引导扩散框架,旨在同时提升图像超分辨率中的文本可读性和视觉真实感,通过引入OCR数据引导的TS-ControlNet和乒乓调度机制,在复杂自然场景中实现对场景文本的有效恢复。
Details
Motivation: 现有超分辨率技术多关注整体图像质量指标(如PSNR/SSIM)或感知质量,忽视字符级准确性,导致OCR性能下降;而现有文本超分辨率研究多局限于孤立字符的简化基准,未能应对自然场景中文本的实际挑战。因此,亟需一种能同时优化文本可读性与感知质量的方法。 Method: 提出GLYPH-SR框架,包含由OCR数据引导的Text-SR Fusion ControlNet(TS-ControlNet)和在文本与场景中心指导之间交替的乒乓调度器;在保持主超分分支冻结的同时,使用合成语料库训练这些组件,以实现针对性的文本恢复。 Result: 在SVT、SCUT-CTW1500和CUTE80数据集上进行x4和x8放大测试,GLYPH-SR相比扩散模型/GAN基线最多提升了+15.18个百分点的OCR F1分数(SVT x8, OpenOCR),同时保持了具有竞争力的MANIQA、CLIP-IQA和MUSIQ指标。 Conclusion: GLYPH-SR能够同时满足高可读性和高视觉真实性的需求,提供既‘看起来正确’又‘读起来正确’的超分辨率结果,适用于实际部署中的视觉系统。 Abstract: Image super-resolution(SR) is fundamental to many vision system-from surveillance and autonomy to document analysis and retail analytics-because recovering high-frequency details, especially scene-text, enables reliable downstream perception. Scene-text, i.e., text embedded in natural images such as signs, product labels, and storefronts, often carries the most actionable information; when characters are blurred or hallucinated, optical character recognition(OCR) and subsequent decisions fail even if the rest of the image appears sharp. Yet previous SR research has often been tuned to distortion (PSNR/SSIM) or learned perceptual metrics (LIPIS, MANIQA, CLIP-IQA, MUSIQ) that are largely insensitive to character-level errors. Furthermore, studies that do address text SR often focus on simplified benchmarks with isolated characters, overlooking the challenges of text within complex natural scenes. As a result, scene-text is effectively treated as generic texture. For SR to be effective in practical deployments, it is therefore essential to explicitly optimize for both text legibility and perceptual quality. We present GLYPH-SR, a vision-language-guided diffusion framework that aims to achieve both objectives jointly. GLYPH-SR utilizes a Text-SR Fusion ControlNet(TS-ControlNet) guided by OCR data, and a ping-pong scheduler that alternates between text- and scene-centric guidance. To enable targeted text restoration, we train these components on a synthetic corpus while keeping the main SR branch frozen. Across SVT, SCUT-CTW1500, and CUTE80 at x4, and x8, GLYPH-SR improves OCR F1 by up to +15.18 percentage points over diffusion/GAN baseline (SVT x8, OpenOCR) while maintaining competitive MANIQA, CLIP-IQA, and MUSIQ. GLYPH-SR is designed to satisfy both objectives simultaneously-high readability and high visual realism-delivering SR that looks right and reds right.[97] EEG-Driven Image Reconstruction with Saliency-Guided Diffusion Models
Igor Abramov,Ilya Makarov
Main category: cs.CV
TL;DR: 提出一种结合EEG嵌入和空间显著图的双条件框架,以提升EEG驱动图像重建的质量和语义一致性。
Details
Motivation: 现有EEG驱动图像重建方法忽略空间注意力机制,导致重建图像保真度和语义连贯性不足。 Method: 采用自适应思维映射器(ATM)提取EEG特征,并通过LoRA微调Stable Diffusion 2.1,结合ControlNet分支利用显著图进行空间控制生成。 Result: 在THINGS-EEG数据集上验证,所提方法在低级和高级图像特征质量上均优于现有方法,且与人类视觉注意力高度对齐。 Conclusion: 引入注意力先验可有效缓解EEG信号的模糊性,实现高保真图像重建,推动基于预训练扩散模型的神经解码发展。 Abstract: Existing EEG-driven image reconstruction methods often overlook spatial attention mechanisms, limiting fidelity and semantic coherence. To address this, we propose a dual-conditioning framework that combines EEG embeddings with spatial saliency maps to enhance image generation. Our approach leverages the Adaptive Thinking Mapper (ATM) for EEG feature extraction and fine-tunes Stable Diffusion 2.1 via Low-Rank Adaptation (LoRA) to align neural signals with visual semantics, while a ControlNet branch conditions generation on saliency maps for spatial control. Evaluated on THINGS-EEG, our method achieves a significant improvement in the quality of low- and high-level image features over existing approaches. Simultaneously, strongly aligning with human visual attention. The results demonstrate that attentional priors resolve EEG ambiguities, enabling high-fidelity reconstructions with applications in medical diagnostics and neuroadaptive interfaces, advancing neural decoding through efficient adaptation of pre-trained diffusion models.[98] LoCoT2V-Bench: A Benchmark for Long-Form and Complex Text-to-Video Generation
Xiangqing Zheng,Chengyue Wu,Kehai Chen,Min Zhang
Main category: cs.CV
TL;DR: 本文提出了LoCoT2V-Bench,一个面向复杂输入条件下长视频生成的评测基准,引入了更贴近真实场景的复杂提示和多维评估体系,揭示了现有模型在事件间一致性、细粒度对齐和高层主题表达上的不足。
Details
Motivation: 现有文本到视频生成评测基准多依赖简化提示,侧重低层次指标,缺乏对复杂提示下叙事连贯性和主题表达等抽象维度的评估,难以反映长视频生成的真实性能。 Method: 基于真实视频构建包含场景转换和事件动态等元素的复杂提示集,并提出一个多维评估框架,包括事件级对齐、细粒度时间一致性、内容清晰度和人类期望实现程度(HERD)等新指标。 Result: 对九个代表性长视频生成模型的评估显示,当前方法在基础视觉和时间连贯性方面表现尚可,但在跨事件一致性、细粒度内容对齐和高层次主题遵循方面存在明显不足。 Conclusion: LoCoT2V-Bench为复杂长视频生成提供了全面可靠的评估平台,指明了未来模型改进的关键方向。 Abstract: Recently text-to-video generation has made impressive progress in producing short, high-quality clips, but evaluating long-form outputs remains a major challenge especially when processing complex prompts. Existing benchmarks mostly rely on simplified prompts and focus on low-level metrics, overlooking fine-grained alignment with prompts and abstract dimensions such as narrative coherence and thematic expression. To address these gaps, we propose LoCoT2V-Bench, a benchmark specifically designed for long video generation (LVG) under complex input conditions. Based on various real-world videos, LoCoT2V-Bench introduces a suite of realistic and complex prompts incorporating elements like scene transitions and event dynamics. Moreover, it constructs a multi-dimensional evaluation framework that includes our newly proposed metrics such as event-level alignment, fine-grained temporal consistency, content clarity, and the Human Expectation Realization Degree (HERD) that focuses on more abstract attributes like narrative flow, emotional response, and character development. Using this framework, we conduct a comprehensive evaluation of nine representative LVG models, finding that while current methods perform well on basic visual and temporal aspects, they struggle with inter-event consistency, fine-grained alignment, and high-level thematic adherence, etc. Overall, LoCoT2V-Bench provides a comprehensive and reliable platform for evaluating long-form complex text-to-video generation and highlights critical directions for future method improvement.[99] A-TPT: Angular Diversity Calibration Properties for Test-Time Prompt Tuning of Vision-Language Models
Shihab Aaqil Ahamed,Udaya S. K. P. Miriya Thanthrige,Ranga Rodrigo,Muhammad Haris Khan
Main category: cs.CV
TL;DR: 本文提出了一种新的测试时提示调优框架A-TPT,通过引入角度多样性来提升视觉-语言模型在未见任务上的校准性能,显著降低了校准误差并保持了准确性。
Details
Motivation: 现有测试时提示调优方法在文本特征间缺乏足够的分散性,影响模型校准效果,限制了模型的可靠性与安全性。 Method: 提出A-TPT框架,通过最大化单位超球面上归一化文本特征间的最小成对角度距离,实现特征分布的均匀性,从而增强角度多样性。 Result: 实验表明,A-TPT在多个骨干网络和数据集上均优于当前最先进的TPT方法,显著降低平均校准误差,且在自然分布偏移和医学数据集上表现出优异的零样本校准泛化能力。 Conclusion: 促进角度多样性可有效改善VLM在测试时适应中的校准性能,A-TPT为无需标签数据的鲁棒自适应提供了新思路。 Abstract: Test-time prompt tuning (TPT) has emerged as a promising technique for adapting large vision-language models (VLMs) to unseen tasks without relying on labeled data. However, the lack of dispersion between textual features can hurt calibration performance, which raises concerns about VLMs' reliability, trustworthiness, and safety. Current TPT approaches primarily focus on improving prompt calibration by either maximizing average textual feature dispersion or enforcing orthogonality constraints to encourage angular separation. However, these methods may not always have optimal angular separation between class-wise textual features, which implies overlooking the critical role of angular diversity. To address this, we propose A-TPT, a novel TPT framework that introduces angular diversity to encourage uniformity in the distribution of normalized textual features induced by corresponding learnable prompts. This uniformity is achieved by maximizing the minimum pairwise angular distance between features on the unit hypersphere. We show that our approach consistently surpasses state-of-the-art TPT methods in reducing the aggregate average calibration error while maintaining comparable accuracy through extensive experiments with various backbones on different datasets. Notably, our approach exhibits superior zero-shot calibration performance on natural distribution shifts and generalizes well to medical datasets. We provide extensive analyses, including theoretical aspects, to establish the grounding of A-TPT. These results highlight the potency of promoting angular diversity to achieve well-dispersed textual features, significantly improving VLM calibration during test-time adaptation. Our code will be made publicly available.[100] PointSt3R: Point Tracking through 3D Grounded Correspondence
Rhodri Guerrier,Adam W. Harley,Dima Damen
Main category: cs.CV
TL;DR: 本文提出将3D重建模型(如DUSt3R和MASt3R)应用于点跟踪任务,通过结合重建损失、动态对应训练和可见性头,并在少量合成数据上微调MASt3R,实现了在多个数据集上具有竞争力或更优的点跟踪性能。
Details
Motivation: 现有的3D重建模型在静态场景中表现出色,但缺乏对动态点的处理能力;因此,作者希望将其扩展到点跟踪任务中,以提升其在动态场景中的适用性。 Method: 结合3D重建损失与动态对应训练,引入可见性预测头,在少量合成数据上微调MASt3R模型,并仅使用包含查询点的帧对进行训练和评估,避免依赖时间上下文。 Result: 在四个数据集上取得了有竞争力或更优的结果,例如在TAP-Vid-DAVIS上接近CoTracker2,在EgoPoints和RGB-S上显著优于CoTracker3(EgoPoints: 61.3 vs 54.2, RGB-S: 87.0 vs 82.8)。 Conclusion: 通过适配先进的3D重建模型并引入动态对应训练,可在无时间上下文的情况下实现高性能的点跟踪,展示了此类模型在视频点跟踪任务中的潜力。 Abstract: Recent advances in foundational 3D reconstruction models, such as DUSt3R and MASt3R, have shown great potential in 2D and 3D correspondence in static scenes. In this paper, we propose to adapt them for the task of point tracking through 3D grounded correspondence. We first demonstrate that these models are competitive point trackers when focusing on static points, present in current point tracking benchmarks ($+33.5\%$ on EgoPoints vs. CoTracker2). We propose to combine the reconstruction loss with training for dynamic correspondence along with a visibility head, and fine-tuning MASt3R for point tracking using a relatively small amount of synthetic data. Importantly, we only train and evaluate on pairs of frames where one contains the query point, effectively removing any temporal context. Using a mix of dynamic and static point correspondences, we achieve competitive or superior point tracking results on four datasets (e.g. competitive on TAP-Vid-DAVIS 73.8 $\delta_{avg}$ / 85.8\% occlusion acc. for PointSt3R compared to 75.7 / 88.3\% for CoTracker2; and significantly outperform CoTracker3 on EgoPoints 61.3 vs 54.2 and RGB-S 87.0 vs 82.8). We also present results on 3D point tracking along with several ablations on training datasets and percentage of dynamic correspondences.[101] Towards Fine-Grained Vision-Language Alignment for Few-Shot Anomaly Detection
Yuanting Fan,Jun Liu,Xiaochen Chen,Bin-Bin Gao,Jian Li,Yong Liu,Jinlong Peng,Chengjie Wang
Main category: cs.CV
TL;DR: 本文提出了一种新的少样本异常检测框架FineGrainedAD,通过多级细粒度语义字幕(MFSC)和多级可学习提示(MLLP)与多级语义对齐(MLSA)来提升异常定位性能。
Details
Motivation: 现有基于视觉-语言模型的少样本异常检测方法因缺乏细粒度文本描述,导致图像级描述与局部视觉异常之间存在语义错位,影响定位精度。 Method: 提出MFSC自动生成多级细粒度文本描述,并设计MLLP和MLSA模块,通过可学习提示和区域聚合策略实现多层次语义对齐。 Result: 在MVTec-AD和VisA数据集上实验表明,该方法在少样本设置下优于现有方法,显著提升了异常定位性能。 Conclusion: FineGrainedAD通过引入细粒度语义描述和多层次对齐机制,有效缓解了语义错位问题,在少样本异常检测中实现了更优的性能。 Abstract: Few-shot anomaly detection (FSAD) methods identify anomalous regions with few known normal samples. Most existing methods rely on the generalization ability of pre-trained vision-language models (VLMs) to recognize potentially anomalous regions through feature similarity between text descriptions and images. However, due to the lack of detailed textual descriptions, these methods can only pre-define image-level descriptions to match each visual patch token to identify potential anomalous regions, which leads to the semantic misalignment between image descriptions and patch-level visual anomalies, achieving sub-optimal localization performance. To address the above issues, we propose the Multi-Level Fine-Grained Semantic Caption (MFSC) to provide multi-level and fine-grained textual descriptions for existing anomaly detection datasets with automatic construction pipeline. Based on the MFSC, we propose a novel framework named FineGrainedAD to improve anomaly localization performance, which consists of two components: Multi-Level Learnable Prompt (MLLP) and Multi-Level Semantic Alignment (MLSA). MLLP introduces fine-grained semantics into multi-level learnable prompts through automatic replacement and concatenation mechanism, while MLSA designs region aggregation strategy and multi-level alignment training to facilitate learnable prompts better align with corresponding visual regions. Experiments demonstrate that the proposed FineGrainedAD achieves superior overall performance in few-shot settings on MVTec-AD and VisA datasets.[102] Representation-Level Counterfactual Calibration for Debiased Zero-Shot Recognition
Pei Peng,MingKun Xie,Hang Hao,Tong Jin,ShengJun Huang
Main category: cs.CV
TL;DR: 提出一种基于因果推理的轻量级方法,通过合成反事实嵌入来缓解视觉-语言模型中的对象-上下文捷径问题,提升零样本性能和鲁棒性。
Details
Motivation: 解决视觉-语言模型在训练和测试场景上下文不一致时因对象-上下文共现偏差导致的零样本可靠性下降问题。 Method: 将问题建模为因果推断,估计CLIP表示空间中的对象与背景期望,通过组合对象特征与多样化替代上下文生成反事实嵌入,并利用总直接效应估计和干预模拟去除背景干扰。 Result: 在无需重新训练或提示设计的情况下,显著提升上下文敏感基准上的最差组和平均准确率,达到新的零样本最先进水平。 Conclusion: 该方法提供了一种实用的、基于表示层的因果反事实框架,有助于实现去偏且可靠的多模态推理。 Abstract: Object-context shortcuts remain a persistent challenge in vision-language models, undermining zero-shot reliability when test-time scenes differ from familiar training co-occurrences. We recast this issue as a causal inference problem and ask: Would the prediction remain if the object appeared in a different environment? To answer this at inference time, we estimate object and background expectations within CLIP's representation space, and synthesize counterfactual embeddings by recombining object features with diverse alternative contexts sampled from external datasets, batch neighbors, or text-derived descriptions. By estimating the Total Direct Effect and simulating intervention, we further subtract background-only activation, preserving beneficial object-context interactions while mitigating hallucinated scores. Without retraining or prompt design, our method substantially improves both worst-group and average accuracy on context-sensitive benchmarks, establishing a new zero-shot state of the art. Beyond performance, our framework provides a lightweight representation-level counterfactual approach, offering a practical causal avenue for debiased and reliable multimodal reasoning.[103] Counteracting Matthew Effect in Self-Improvement of LVLMs through Head-Tail Re-balancing
Xin Guo,Zhiheng Xi,Yiwen Ding,Yitao Zhai,Xiaowei Shi,Xunliang Cai,Tao Gui,Qi Zhang,Xuanjing Huang
Main category: cs.CV
TL;DR: 本文提出通过分布重塑和轨迹重采样策略来缓解大视觉语言模型在自我提升过程中因简单任务主导而导致的“马太效应”,从而提升复杂推理能力。
Details
Motivation: 发现现有自改进方法在处理简单和复杂查询时存在优化不平衡问题,导致模型难以提升复杂推理能力。 Method: 从分布重塑和轨迹重采样两个角度引入四种高效策略,实现头尾数据的再平衡。 Result: 在Qwen2-VL-7B-Instruct和InternVL2.5-4B模型上的实验证明,所提方法平均比传统自改进方法提升3.86个百分点。 Conclusion: 所提出的头尾再平衡策略有效缓解了马太效应,显著提升了模型的视觉推理能力。 Abstract: Self-improvement has emerged as a mainstream paradigm for advancing the reasoning capabilities of large vision-language models (LVLMs), where models explore and learn from successful trajectories iteratively. However, we identify a critical issue during this process: the model excels at generating high-quality trajectories for simple queries (i.e., head data) but struggles with more complex ones (i.e., tail data). This leads to an imbalanced optimization that drives the model to prioritize simple reasoning skills, while hindering its ability to tackle more complex reasoning tasks. Over iterations, this imbalance becomes increasingly pronounced--a dynamic we term the "Matthew effect"--which ultimately hinders further model improvement and leads to performance bottlenecks. To counteract this challenge, we introduce four efficient strategies from two perspectives: distribution-reshaping and trajectory-resampling, to achieve head-tail re-balancing during the exploration-and-learning self-improvement process. Extensive experiments on Qwen2-VL-7B-Instruct and InternVL2.5-4B models across visual reasoning tasks demonstrate that our methods consistently improve visual reasoning capabilities, outperforming vanilla self-improvement by 3.86 points on average.[104] Analysis of the Robustness of an Edge Detector Based on Cellular Automata Optimized by Particle Swarm
Vinícius Ferraria,Eurico Ruivo
Main category: cs.CV
TL;DR: 提出了一种基于二维细胞自动机并结合元启发式优化和迁移学习的可适应性边缘检测器,但实验表明扩展优化搜索空间对所选图像集无效,且迁移学习未带来显著改进。
Details
Motivation: 解决传统边缘检测器在检测松散边缘和缺乏上下文信息方面的不足,并提升检测器对不同图像特性的适应能力。 Method: 采用二维细胞自动机描述可适应性检测器,结合元启发式算法进行优化,并引入迁移学习技术以提升性能。通过扩展优化搜索空间和多种验证方法评估模型适应性。 Result: 扩展优化搜索空间未能提升性能;模型具备良好的输入适应性,但迁移学习技术未带来显著改进。 Conclusion: 所提出的可适应性边缘检测器能有效适应输入图像,但当前优化策略和迁移学习方法在该任务中效果有限,需进一步探索更有效的优化与学习机制。 Abstract: The edge detection task is essential in image processing aiming to extract relevant information from an image. One recurring problem in this task is the weaknesses found in some detectors, such as the difficulty in detecting loose edges and the lack of context to extract relevant information from specific problems. To address these weaknesses and adapt the detector to the properties of an image, an adaptable detector described by two-dimensional cellular automaton and optimized by meta-heuristic combined with transfer learning techniques was developed. This study aims to analyze the impact of expanding the search space of the optimization phase and the robustness of the adaptability of the detector in identifying edges of a set of natural images and specialized subsets extracted from the same image set. The results obtained prove that expanding the search space of the optimization phase was not effective for the chosen image set. The study also analyzed the adaptability of the model through a series of experiments and validation techniques and found that, regardless of the validation, the model was able to adapt to the input and the transfer learning techniques applied to the model showed no significant improvements.[105] SA$^{2}$Net: Scale-Adaptive Structure-Affinity Transformation for Spine Segmentation from Ultrasound Volume Projection Imaging
Hao Xie,Zixun Huang,Yushen Zuo,Yakun Ju,Frank H. F. Leung,N. F. Law,Kin-Man Lam,Yong-Ping Zheng,Sai Ho Ling
Main category: cs.CV
TL;DR: 提出了一种用于超声脊柱图像分割的尺度自适应结构感知网络SA²Net,通过尺度自适应互补策略和结构亲和性变换提升分割性能。
Details
Motivation: 现有方法未能充分捕捉脊柱骨特征之间的空间相关性和结构信息,导致分割效果受限。 Method: 设计了尺度自适应互补策略以学习跨维度长距离相关特征;结合Transformer解码器引入结构亲和性变换进行结构感知推理,并采用特征混合损失聚合方法优化训练过程。 Result: 实验结果表明,SA²Net在脊柱分割任务中优于当前最先进的方法,且对不同骨干网络具有良好的适应性。 Conclusion: SA²Net能有效提升脊柱超声图像的分割精度和鲁棒性,有望成为智能脊柱侧弯诊断中的有力工具。 Abstract: Spine segmentation, based on ultrasound volume projection imaging (VPI), plays a vital role for intelligent scoliosis diagnosis in clinical applications. However, this task faces several significant challenges. Firstly, the global contextual knowledge of spines may not be well-learned if we neglect the high spatial correlation of different bone features. Secondly, the spine bones contain rich structural knowledge regarding their shapes and positions, which deserves to be encoded into the segmentation process. To address these challenges, we propose a novel scale-adaptive structure-aware network (SA$^{2}$Net) for effective spine segmentation. First, we propose a scale-adaptive complementary strategy to learn the cross-dimensional long-distance correlation features for spinal images. Second, motivated by the consistency between multi-head self-attention in Transformers and semantic level affinity, we propose structure-affinity transformation to transform semantic features with class-specific affinity and combine it with a Transformer decoder for structure-aware reasoning. In addition, we adopt a feature mixing loss aggregation method to enhance model training. This method improves the robustness and accuracy of the segmentation process. The experimental results demonstrate that our SA$^{2}$Net achieves superior segmentation performance compared to other state-of-the-art methods. Moreover, the adaptability of SA$^{2}$Net to various backbones enhances its potential as a promising tool for advanced scoliosis diagnosis using intelligent spinal image analysis. The code and experimental demo are available at https://github.com/taetiseo09/SA2Net.[106] AdSum: Two-stream Audio-visual Summarization for Automated Video Advertisement Clipping
Wen Xie,Yanjun Zhu,Gijs Overgoor,Yakov Bart,Agata Lapedriza Garcia,Sarah Ostadabbas
Main category: cs.CV
TL;DR: 本文提出了一种基于音视频融合模型的自动视频广告剪辑框架,首次将广告剪辑视为镜头选择问题,并强调音频在广告中的重要作用。
Details
Motivation: 传统广告剪辑依赖人工重编辑,费时费力,缺乏针对广告场景的自动化方法。 Method: 提出一个双流音视频融合模型,通过预测帧重要性(即被选入短版广告的概率)实现自动剪辑,并构建了新的广告数据集AdSum204用于训练和评估。 Result: 实验表明,该模型在平均精度、曲线下面积、斯皮尔曼和肯德尔等指标上优于现有最先进方法。 Conclusion: 所提方法能有效生成高质量短视频广告,验证了音频在广告剪辑中的关键作用,并为广告级视频摘要提供了新基准。 Abstract: Advertisers commonly need multiple versions of the same advertisement (ad) at varying durations for a single campaign. The traditional approach involves manually selecting and re-editing shots from longer video ads to create shorter versions, which is labor-intensive and time-consuming. In this paper, we introduce a framework for automated video ad clipping using video summarization techniques. We are the first to frame video clipping as a shot selection problem, tailored specifically for advertising. Unlike existing general video summarization methods that primarily focus on visual content, our approach emphasizes the critical role of audio in advertising. To achieve this, we develop a two-stream audio-visual fusion model that predicts the importance of video frames, where importance is defined as the likelihood of a frame being selected in the firm-produced short ad. To address the lack of ad-specific datasets, we present AdSum204, a novel dataset comprising 102 pairs of 30-second and 15-second ads from real advertising campaigns. Extensive experiments demonstrate that our model outperforms state-of-the-art methods across various metrics, including Average Precision, Area Under Curve, Spearman, and Kendall.[107] Dynamic Context-Aware Scene Reasoning Using Vision-Language Alignment in Zero-Shot Real-World Scenarios
Manjunath Prasad Holenarasipura Rajiv,B. M. Vidyavathi
Main category: cs.CV
TL;DR: 提出了一种动态上下文感知场景推理框架,利用视觉-语言对齐实现零样本真实场景理解,在多个基准上显著提升了准确率。
Details
Motivation: 传统场景理解模型难以在未见过的无标签真实场景中泛化,限制了视觉应用在动态环境中的部署。 Method: 结合预训练的视觉Transformer和大语言模型,通过视觉-语言对齐将视觉语义与自然语言描述相融合,并设计动态推理模块,利用语言先验整合全局场景线索和对象级交互。 Result: 在COCO、Visual Genome和Open Images等零样本基准上,场景理解准确率最高提升18%,在复杂和模糊场景中表现出更强的鲁棒性。 Conclusion: 该框架提供了一种可扩展且可解释的上下文感知推理方法,有效推动了动态真实世界场景下的零样本泛化能力。 Abstract: In real-world environments, AI systems often face unfamiliar scenarios without labeled data, creating a major challenge for conventional scene understanding models. The inability to generalize across unseen contexts limits the deployment of vision-based applications in dynamic, unstructured settings. This work introduces a Dynamic Context-Aware Scene Reasoning framework that leverages Vision-Language Alignment to address zero-shot real-world scenarios. The goal is to enable intelligent systems to infer and adapt to new environments without prior task-specific training. The proposed approach integrates pre-trained vision transformers and large language models to align visual semantics with natural language descriptions, enhancing contextual comprehension. A dynamic reasoning module refines predictions by combining global scene cues and object-level interactions guided by linguistic priors. Extensive experiments on zero-shot benchmarks such as COCO, Visual Genome, and Open Images demonstrate up to 18% improvement in scene understanding accuracy over baseline models in complex and unseen environments. Results also show robust performance in ambiguous or cluttered scenes due to the synergistic fusion of vision and language. This framework offers a scalable and interpretable approach for context-aware reasoning, advancing zero-shot generalization in dynamic real-world settings.[108] CATCH: A Modular Cross-domain Adaptive Template with Hook
Xinjin Li,Yulie Lu,Jinghan Cao,Yu Ma,Zhenglin Li,Yeyang Zhou
Main category: cs.CV
TL;DR: 本文提出CATCH,一种即插即用的跨域视觉问答(VQA)适应框架,通过解耦视觉与语言适应,引入轻量级模块实现无需重训练主干模型的性能提升。
Details
Motivation: 现有VQA模型在跨域场景(如遥感、医学图像、数学图表)中泛化能力差,且依赖昂贵的领域特定微调,缺乏可扩展的通用适应机制。 Method: 提出CATCH框架,包含一个域分类器和双适配器机制(提示适配器用于语言调节,视觉适配器用于视觉特征调整),通过统一钩子接口动态注入,不修改也不重训练主干模型。 Result: 在四个领域特定的VQA基准上均取得显著性能提升,如MathVQA上+2.3 BLEU,MedVQA-RAD上+2.6 VQA,ChartQA上+3.1 ROUGE,且无需重训练主干模型。 Conclusion: CATCH提供了一种可扩展、灵活且高效的多领域VQA解决方案,支持即插即用部署,显著提升跨域泛化能力。 Abstract: Recent advances in Visual Question Answering (VQA) have demonstrated impressive performance in natural image domains, with models like LLaVA leveraging large language models (LLMs) for open-ended reasoning. However, their generalization degrades significantly when transferred to out-of-domain scenarios such as remote sensing, medical imaging, or math diagrams, due to large distributional shifts and the lack of effective domain adaptation mechanisms. Existing approaches typically rely on per-domain fine-tuning or bespoke pipelines, which are costly, inflexible, and not scalable across diverse tasks. In this paper, we propose CATCH, a plug-and-play framework for cross-domain adaptation that improves the generalization of VQA models while requiring minimal changes to their core architecture. Our key idea is to decouple visual and linguistic adaptation by introducing two lightweight modules: a domain classifier to identify the input image type, and a dual adapter mechanism comprising a Prompt Adapter for language modulation and a Visual Adapter for vision feature adjustment. Both modules are dynamically injected via a unified hook interface, requiring no retraining of the backbone model. Experimental results across four domain-specific VQA benchmarks demonstrate that our framework achieves consistent performance gains without retraining the backbone model, including +2.3 BLEU on MathVQA, +2.6 VQA on MedVQA-RAD, and +3.1 ROUGE on ChartQA. These results highlight that CATCH provides a scalable and extensible approach to multi-domain VQA, enabling practical deployment across diverse application domains.[109] Emu3.5: Native Multimodal Models are World Learners
Yufeng Cui,Honghao Chen,Haoge Deng,Xu Huang,Xinghang Li,Jirong Liu,Yang Liu,Zhuoyan Luo,Jinsheng Wang,Wenxuan Wang,Yueze Wang,Chengyuan Wang,Fan Zhang,Yingli Zhao,Ting Pan,Xianduo Li,Zecheng Hao,Wenxuan Ma,Zhuo Chen,Yulong Ao,Tiejun Huang,Zhongyuan Wang,Xinlong Wang
Main category: cs.CV
TL;DR: Emu3.5 是一个大规模多模态世界模型,通过统一的下一个标记预测目标在超过10万亿token的视觉-语言交错数据上进行端到端预训练,能够原生预测视觉和语言的下一个状态,并结合离散扩散自适应(DiDA)技术提升推理效率,展现出强大的多模态生成与世界建模能力。
Details
Motivation: 旨在构建一个能同时处理视觉和语言模态、具备原生多模态预测能力的世界模型,以实现更自然、连贯的跨模态生成与推理。 Method: 采用统一的下一个标记预测目标在大规模视频帧与转录文本构成的交错数据上进行端到端预训练;引入离散扩散自适应(DiDA)方法实现并行化图像生成,提升推理速度;并通过大规模强化学习进行后训练以增强多模态推理能力。 Result: Emu3.5 在图像生成与编辑任务上性能媲美 Gemini 2.5 Flash Image(Nano Banana),在交错生成任务上表现更优;支持长视野视觉-语言生成、任意到图像(X2I)生成、文本丰富图像生成,并具备时空一致的世界探索与开放世界具身操作能力,推理速度提升约20倍。 Conclusion: Emu3.5 是一个高效且强大的原生多模态世界模型,兼具高性能生成能力与通用世界建模潜力,开源版本已发布以促进社区研究。 Abstract: We introduce Emu3.5, a large-scale multimodal world model that natively predicts the next state across vision and language. Emu3.5 is pre-trained end-to-end with a unified next-token prediction objective on a corpus of vision-language interleaved data containing over 10 trillion tokens, primarily derived from sequential frames and transcripts of internet videos. The model naturally accepts interleaved vision-language inputs and generates interleaved vision-language outputs. Emu3.5 is further post-trained with large-scale reinforcement learning to enhance multimodal reasoning and generation. To improve inference efficiency, we propose Discrete Diffusion Adaptation (DiDA), which converts token-by-token decoding into bidirectional parallel prediction, accelerating per-image inference by about 20x without sacrificing performance. Emu3.5 exhibits strong native multimodal capabilities, including long-horizon vision-language generation, any-to-image (X2I) generation, and complex text-rich image generation. It also exhibits generalizable world-modeling abilities, enabling spatiotemporally consistent world exploration and open-world embodied manipulation across diverse scenarios and tasks. For comparison, Emu3.5 achieves performance comparable to Gemini 2.5 Flash Image (Nano Banana) on image generation and editing tasks and demonstrates superior results on a suite of interleaved generation tasks. We open-source Emu3.5 at https://github.com/baaivision/Emu3.5 to support community research.[110] ResMatching: Noise-Resilient Computational Super-Resolution via Guided Conditional Flow Matching
Anirban Ray,Vera Galinova,Florian Jug
Main category: cs.CV
TL;DR: 本文提出了一种新的计算超分辨率方法ResMatching,利用引导条件流匹配学习更强的数据先验,在荧光显微图像中实现了优异的性能,并能提供像素级不确定性估计。
Details
Motivation: 由于计算超分辨率是一个病态问题,传统方法受限于先验知识的表达能力,随着机器学习的发展,亟需更强大的数据驱动先验来提升重建质量。 Method: 采用引导条件流匹配(guided conditional flow matching)框架,通过学习数据分布的隐式后验来实现超分辨率,并支持从后验分布中采样以估计不确定性。 Result: 在BioSR数据集的4种生物结构上优于7个基线方法,在数据保真度和感知真实性之间取得了最佳平衡,尤其在低信噪比情况下表现突出;同时可输出像素级不确定性图。 Conclusion: ResMatching能有效学习强先验,提升荧光显微图像超分辨率效果,并通过校准的后验分布提供可靠的不确定性估计,具有实际应用价值。 Abstract: Computational Super-Resolution (CSR) in fluorescence microscopy has, despite being an ill-posed problem, a long history. At its very core, CSR is about finding a prior that can be used to extrapolate frequencies in a micrograph that have never been imaged by the image-generating microscope. It stands to reason that, with the advent of better data-driven machine learning techniques, stronger prior can be learned and hence CSR can lead to better results. Here, we present ResMatching, a novel CSR method that uses guided conditional flow matching to learn such improved data-priors. We evaluate ResMatching on 4 diverse biological structures from the BioSR dataset and compare its results against 7 baselines. ResMatching consistently achieves competitive results, demonstrating in all cases the best trade-off between data fidelity and perceptual realism. We observe that CSR using ResMatching is particularly effective in cases where a strong prior is hard to learn, e.g. when the given low-resolution images contain a lot of noise. Additionally, we show that ResMatching can be used to sample from an implicitly learned posterior distribution and that this distribution is calibrated for all tested use-cases, enabling our method to deliver a pixel-wise data-uncertainty term that can guide future users to reject uncertain predictions.[111] CYPRESS: Crop Yield Prediction via Regression on Prithvi's Encoder for Satellite Sensing
Shayan Nejadshamsi,Yuanyuan Zhang,Shadi Zaki,Brock Porth,Lysa Porth,Vahab Khoshdel
Main category: cs.CV
TL;DR: 本文提出了一种基于Prithvi-EO-2.0-600M预训练地理空间基础模型的深度学习方法CYPRESS,用于高分辨率、田块内的油菜籽产量预测。
Details
Motivation: 准确及时的作物产量预测对全球粮食安全和现代农业管理至关重要,传统方法在精准农业所需的可扩展性和细粒度方面往往不足。 Method: CYPRESS利用预训练的大规模地理空间基础模型Prithvi-EO-2.0-600M,将其适应于连续回归任务,将多时相卫星影像转换为密集的像素级产量图。 Result: 在加拿大草原地区的综合数据集上评估显示,CYPRESS在性能上优于现有的基于深度学习的产量预测模型,并能生成连续、高分辨率的输出。 Conclusion: 该研究验证了一种连接大规模地球观测与农场决策的新方法,为精细农业监测提供了一个可扩展的解决方案。 Abstract: Accurate and timely crop yield prediction is crucial for global food security and modern agricultural management. Traditional methods often lack the scalability and granularity required for precision farming. This paper introduces CYPRESS (Crop Yield Prediction via Regression on Prithvi's Encoder for Satellite Sensing), a deep learning model designed for high-resolution, intra-field canola yield prediction. CYPRESS leverages a pre-trained, large-scale geospatial foundation model (Prithvi-EO-2.0-600M) and adapts it for a continuous regression task, transforming multi-temporal satellite imagery into dense, pixel-level yield maps. Evaluated on a comprehensive dataset from the Canadian Prairies, CYPRESS demonstrates superior performance over existing deep learning-based yield prediction models, highlighting the effectiveness of fine-tuning foundation models for specialized agricultural applications. By providing a continuous, high-resolution output, CYPRESS offers a more actionable tool for precision agriculture than conventional classification or county-level aggregation methods. This work validates a novel approach that bridges the gap between large-scale Earth observation and on-farm decision-making, offering a scalable solution for detailed agricultural monitoring.[112] Spiking Patches: Asynchronous, Sparse, and Efficient Tokens for Event Cameras
Christoffer Koo Øhrstrøm,Ronja Güldenring,Lazaros Nalpantidis
Main category: cs.CV
TL;DR: 提出了一种专为事件相机设计的标记化方法Spiking Patches,保持了事件流的异步性和空间稀疏性,在手势识别和目标检测任务中实现了比基于体素和帧的方法更快的推理速度,同时保持甚至提高了准确率。
Details
Motivation: 现有事件表示方法(如帧或体素)将异步、稀疏的事件流转换为同步且密集的表示,破坏了事件相机的独特优势,因此需要一种能够保留其原始特性的新表示方法。 Method: 提出了Spiking Patches tokenizer,将事件流划分为时空上的稀疏块(即“spiking patches”),生成保留异步性和空间稀疏性的事件token,并将其用于图神经网络(GNN)、点云网络(PCN)和Transformer模型中进行下游任务。 Result: 在手势识别和目标检测任务上,Spiking Patches相比体素表示推理速度快3.4倍,相比帧表示快10.4倍,同时准确率相当甚至更高,最高提升3.8(手势识别)和1.4(目标检测)。 Conclusion: 事件tokenization是一种面向事件视觉的新范式,Spiking Patches成功保留了事件相机的核心特性,并在效率和性能上优于传统表示方法,推动了事件驱动视觉的发展。 Abstract: We propose tokenization of events and present a tokenizer, Spiking Patches, specifically designed for event cameras. Given a stream of asynchronous and spatially sparse events, our goal is to discover an event representation that preserves these properties. Prior works have represented events as frames or as voxels. However, while these representations yield high accuracy, both frames and voxels are synchronous and decrease the spatial sparsity. Spiking Patches gives the means to preserve the unique properties of event cameras and we show in our experiments that this comes without sacrificing accuracy. We evaluate our tokenizer using a GNN, PCN, and a Transformer on gesture recognition and object detection. Tokens from Spiking Patches yield inference times that are up to 3.4x faster than voxel-based tokens and up to 10.4x faster than frames. We achieve this while matching their accuracy and even surpassing in some cases with absolute improvements up to 3.8 for gesture recognition and up to 1.4 for object detection. Thus, tokenization constitutes a novel direction in event-based vision and marks a step towards methods that preserve the properties of event cameras.[113] PT-DETR: Small Target Detection Based on Partially-Aware Detail Focus
Bingcong Huo,Zhiming Wang
Main category: cs.CV
TL;DR: 本文提出了一种基于RT-DETR的新型无人机图像小目标检测算法PT-DETR,通过引入PADF模块、MFFF模块和Focaler-SIoU损失函数,在复杂背景、遮挡严重、光照变化等挑战下显著提升了小目标检测精度和鲁棒性。
Details
Motivation: 针对无人机目标检测中存在的复杂背景、严重遮挡、密集小目标和多变光照等问题,现有方法在特征提取和小目标匹配方面存在不足,难以有效检测小目标。 Method: 在RT-DETR基础上引入三个关键模块:1)部分感知细节聚焦(PADF)模块增强主干网络对小目标的特征提取;2)中频特征融合(MFFF)模块提升小目标细节与上下文信息融合能力;3)Focaler-SIoU损失函数增强边界框匹配能力和对小目标特征的敏感性。 Result: 在VisDrone2019数据集上,PT-DETR相比RT-DETR在更低计算复杂度和更少参数量的情况下,mAP分别提升了1.6%和1.7%,验证了其在小目标检测任务中的优越性能。 Conclusion: PT-DETR通过改进特征提取、融合与损失函数设计,有效提升了无人机图像中小目标的检测精度与模型鲁棒性,具备实际应用潜力。 Abstract: To address the challenges in UAV object detection, such as complex backgrounds, severe occlusion, dense small objects, and varying lighting conditions,this paper proposes PT-DETR based on RT-DETR, a novel detection algorithm specifically designed for small objects in UAV imagery. In the backbone network, we introduce the Partially-Aware Detail Focus (PADF) Module to enhance feature extraction for small objects. Additionally,we design the Median-Frequency Feature Fusion (MFFF) module,which effectively improves the model's ability to capture small-object details and contextual information. Furthermore,we incorporate Focaler-SIoU to strengthen the model's bounding box matching capability and increase its sensitivity to small-object features, thereby further enhancing detection accuracy and robustness. Compared with RT-DETR, our PT-DETR achieves mAP improvements of 1.6% and 1.7% on the VisDrone2019 dataset with lower computational complexity and fewer parameters, demonstrating its robustness and feasibility for small-object detection tasks.[114] All You Need for Object Detection: From Pixels, Points, and Prompts to Next-Gen Fusion and Multimodal LLMs/VLMs in Autonomous Vehicles
Sayed Pedram Haeri Boroujeni,Niloufar Mehrabi,Hazim Alzorgan,Ahmad Sarlak,Mahlagha Fazeli,Abolfazl Razi
Main category: cs.CV
TL;DR: 本文综述了自动驾驶汽车中物体检测的最新进展,重点探讨了视觉语言模型、大语言模型和生成式AI等新兴范式,系统分析了传感器融合、数据集分类及基于Transformer的检测方法,提出了当前能力、开放挑战与未来机遇的清晰路线图。
Details
Motivation: 由于多模态感知、上下文推理和协同智能领域的知识分散,自动驾驶汽车中的物体检测仍面临关键挑战,需要整合新兴AI技术以提升检测可靠性。 Method: 通过系统回顾自动驾驶传感器(如相机、超声波、LiDAR、雷达)及其融合策略,提出结构化的数据集分类方法,并分析从2D/3D到混合传感器融合的先进检测方法,特别关注基于Vision Transformer、大语言模型和视觉语言模型的新兴方法。 Result: 建立了涵盖传感器、数据集和检测方法的综合分析框架,明确了当前技术的能力与局限,特别是在动态环境下的感知性能和与LLM/VLM框架的集成潜力。 Conclusion: 该综述为自动驾驶中物体检测的发展提供了前瞻性视角,强调了多模态大模型与协同感知在未来研究中的核心作用,并为后续研究提供了清晰的技术路线图。 Abstract: Autonomous Vehicles (AVs) are transforming the future of transportation through advances in intelligent perception, decision-making, and control systems. However, their success is tied to one core capability, reliable object detection in complex and multimodal environments. While recent breakthroughs in Computer Vision (CV) and Artificial Intelligence (AI) have driven remarkable progress, the field still faces a critical challenge as knowledge remains fragmented across multimodal perception, contextual reasoning, and cooperative intelligence. This survey bridges that gap by delivering a forward-looking analysis of object detection in AVs, emphasizing emerging paradigms such as Vision-Language Models (VLMs), Large Language Models (LLMs), and Generative AI rather than re-examining outdated techniques. We begin by systematically reviewing the fundamental spectrum of AV sensors (camera, ultrasonic, LiDAR, and Radar) and their fusion strategies, highlighting not only their capabilities and limitations in dynamic driving environments but also their potential to integrate with recent advances in LLM/VLM-driven perception frameworks. Next, we introduce a structured categorization of AV datasets that moves beyond simple collections, positioning ego-vehicle, infrastructure-based, and cooperative datasets (e.g., V2V, V2I, V2X, I2I), followed by a cross-analysis of data structures and characteristics. Ultimately, we analyze cutting-edge detection methodologies, ranging from 2D and 3D pipelines to hybrid sensor fusion, with particular attention to emerging transformer-driven approaches powered by Vision Transformers (ViTs), Large and Small Language Models (SLMs), and VLMs. By synthesizing these perspectives, our survey delivers a clear roadmap of current capabilities, open challenges, and future opportunities.[115] Towards Reliable Sea Ice Drift Estimation in the Arctic Deep Learning Optical Flow on RADARSAT-2
Daniela Martin,Joseph Gallego
Main category: cs.CV
TL;DR: 本文首次在RADARSAT 2 ScanSAR海冰图像上对48种深度学习光流模型进行了大规模基准测试,结果表明深度学习方法可实现亚公里级精度,显著优于传统方法,并能生成空间连续的漂移场,为北极导航和气候建模提供了新可能。
Details
Motivation: 尽管光流技术在计算机视觉中发展迅速,但其在地球物理问题和卫星SAR图像中的应用仍不足;传统方法依赖强假设,限制了复杂场景下的准确性,因此需要探索更先进的深度学习方法在海冰漂移估计中的适用性。 Method: 采用48种基于深度学习的光流模型,在RADARSAT 2 ScanSAR海冰影像上进行大规模基准测试,使用终点误差(EPE)和Fl all指标与GNSS浮标数据对比评估性能。 Result: 多个模型达到亚公里级精度(EPE为6到8像素,即300至400米),能够捕捉一致的区域漂移模式,且输出为每个像素提供连续运动估计,优于传统稀疏观测。 Conclusion: 深度学习光流方法可有效迁移到极地遥感领域,显著提升海冰漂移估计的精度和空间连续性,具有在北极导航和气候研究中广泛应用的潜力。 Abstract: Accurate estimation of sea ice drift is critical for Arctic navigation, climate research, and operational forecasting. While optical flow, a computer vision technique for estimating pixel wise motion between consecutive images, has advanced rapidly in computer vision, its applicability to geophysical problems and to satellite SAR imagery remains underexplored. Classical optical flow methods rely on mathematical models and strong assumptions about motion, which limit their accuracy in complex scenarios. Recent deep learning based approaches have substantially improved performance and are now the standard in computer vision, motivating their application to sea ice drift estimation. We present the first large scale benchmark of 48 deep learning optical flow models on RADARSAT 2 ScanSAR sea ice imagery, evaluated with endpoint error (EPE) and Fl all metrics against GNSS tracked buoys. Several models achieve sub kilometer accuracy (EPE 6 to 8 pixels, 300 to 400 m), a small error relative to the spatial scales of sea ice motion and typical navigation requirements in the Arctic. Our results demonstrate that the models are capable of capturing consistent regional drift patterns and that recent deep learning based optical flow methods, which have substantially improved motion estimation accuracy compared to classical methods, can be effectively transferred to polar remote sensing. Optical flow produces spatially continuous drift fields, providing motion estimates for every image pixel rather than at sparse buoy locations, offering new opportunities for navigation and climate modeling.[116] Improving Classification of Occluded Objects through Scene Context
Courtney M. King,Daniel D. Leeds,Damian Lyons,George Kalaitzis
Main category: cs.CV
TL;DR: 本文提出两种基于场景信息融合的方法,以增强RPN-DCNN目标检测网络在遮挡情况下的鲁棒性,实验表明其在召回率和精确率上均优于基线方法。
Details
Motivation: 遮挡严重影响目标识别算法的性能,而场景上下文信息有助于缓解这一问题,因此本文旨在利用场景信息提升检测模型在遮挡条件下的鲁棒性。 Method: 提出两种场景信息融合策略:一种在预测前根据背景场景选择定制的目标网络,另一种在检测后将场景知识融合到RPN输出的初始目标得分中。 Result: 在包含部分遮挡的挑战性数据集上验证了方法的有效性,相比基线方法在召回率和精度上均有提升;同时发现联合训练(包含遮挡与无遮挡图像)效果最佳。 Conclusion: 所提方法具有可解释性,易于迁移到其他数据集,为处理遮挡问题提供了实用解决方案和未来研究方向。 Abstract: The presence of occlusions has provided substantial challenges to typically-powerful object recognition algorithms. Additional sources of information can be extremely valuable to reduce errors caused by occlusions. Scene context is known to aid in object recognition in biological vision. In this work, we attempt to add robustness into existing Region Proposal Network-Deep Convolutional Neural Network (RPN-DCNN) object detection networks through two distinct scene-based information fusion techniques. We present one algorithm under each methodology: the first operates prior to prediction, selecting a custom object network to use based on the identified background scene, and the second operates after detection, fusing scene knowledge into initial object scores output by the RPN. We demonstrate our algorithms on challenging datasets featuring partial occlusions, which show overall improvement in both recall and precision against baseline methods. In addition, our experiments contrast multiple training methodologies for occlusion handling, finding that training on a combination of both occluded and unoccluded images demonstrates an improvement over the others. Our method is interpretable and can easily be adapted to other datasets, offering many future directions for research and practical applications.[117] Process Integrated Computer Vision for Real-Time Failure Prediction in Steel Rolling Mill
Vaibhav Kurrey,Sivakalyan Pujari,Gagan Raj Gupta
Main category: cs.CV
TL;DR: 提出一种基于机器视觉的异常检测系统,用于钢铁轧机中的故障预测,通过结合传感器数据与视觉输入实现对设备运行状态的实时监控和故障根因分析。
Details
Motivation: 为了减少钢铁轧机中非计划停机带来的成本,提升生产可靠性与效率,需要一种能够早期预测设备故障的解决方案。 Method: 在生产线部署工业相机实时监控设备运行、对中状态及热轧钢条运动,利用深度学习模型在中央视频服务器上处理实时视频流,并结合数据采集系统的传感器数据进行联合分析。 Result: 该系统能提前预测设备故障和工艺中断,准确识别故障位置和可能的根本原因,支持跨生产线的可扩展部署,且对PLC系统计算负担小。 Conclusion: 该集成视觉与传感器的异常检测系统有效提升了工业制造环境的运行可靠性、生产率和盈利能力。 Abstract: We present a long-term deployment study of a machine vision-based anomaly detection system for failure prediction in a steel rolling mill. The system integrates industrial cameras to monitor equipment operation, alignment, and hot bar motion in real time along the process line. Live video streams are processed on a centralized video server using deep learning models, enabling early prediction of equipment failures and process interruptions, thereby reducing unplanned breakdown costs. Server-based inference minimizes the computational load on industrial process control systems (PLCs), supporting scalable deployment across production lines with minimal additional resources. By jointly analyzing sensor data from data acquisition systems and visual inputs, the system identifies the location and probable root causes of failures, providing actionable insights for proactive maintenance. This integrated approach enhances operational reliability, productivity, and profitability in industrial manufacturing environments.[118] The Impact and Outlook of 3D Gaussian Splatting
Bernhard Kerbl
Main category: cs.CV
TL;DR: 3D高斯点阵化(3DGS)自提出以来迅速发展,推动了3D场景表示的多项关键进展,包括效率提升、动态表示、数学基础深化及在移动与VR平台的应用,已成为3D视觉与图形学的重要基础工具。
Details
Motivation: 3DGS的出现引发了大量研究兴趣,但其在效率、动态性、可扩展性和实际应用方面仍有改进空间,促使研究者探索多个关键技术方向。 Method: 综述了3DGS后续研究的主要方向,包括资源高效的训练与渲染、向动态(4DGS)表示的演进、外观建模与渲染的数学基础分析、移动端与VR部署、大规模场景扩展以及基于前馈或分布式计算的快速辐射场重建。 Result: 总结出3DGS已从一种突破性表示方法发展为支持高效训练、动态建模、跨平台应用和快速重建的多功能基础工具。 Conclusion: 3DGS不仅在理论层面持续深化,还在实际应用中展现出广泛潜力,正逐步成为3D视觉与图形领域的一项核心技术。 Abstract: Since its introduction, 3D Gaussian Splatting (3DGS) has rapidly transformed the landscape of 3D scene representations, inspiring an extensive body of associated research. Follow-up work includes analyses and contributions that enhance the efficiency, scalability, and real-world applicability of 3DGS. In this summary, we present an overview of several key directions that have emerged in the wake of 3DGS. We highlight advances enabling resource-efficient training and rendering, the evolution toward dynamic (or four-dimensional, 4DGS) representations, and deeper exploration of the mathematical foundations underlying its appearance modeling and rendering process. Furthermore, we examine efforts to bring 3DGS to mobile and virtual reality platforms, its extension to massive-scale environments, and recent progress toward near-instant radiance field reconstruction via feed-forward or distributed computation. Collectively, these developments illustrate how 3DGS has evolved from a breakthrough representation into a versatile and foundational tool for 3D vision and graphics.[119] SteerVLM: Robust Model Control through Lightweight Activation Steering for Vision Language Models
Anushka Sivakumar,Andrew Zhang,Zaber Hakim,Chris Thomas
Main category: cs.CV
TL;DR: 本文提出了SteerVLM,一种轻量级的引导模块,用于在推理时动态调整视觉-语言模型(VLM)的激活状态,以更好地遵循指令,同时不修改模型权重且保持对非目标任务的性能。
Details
Motivation: 现有的VLM干预方法难以实现细粒度、无需权重更新的推理时控制,且通常依赖静态向量或手动调参。因此,需要一种更灵活、高效的方法来实现对复杂输出语义的精确引导。 Method: SteerVLM通过学习成对提示(目标行为与相反行为)的潜在嵌入,动态调节语言模态与图像上下文之间的激活。该方法采用逐维度激活调制和跨层自适应引导,参数量仅为原VLM的0.14%。同时提出VNIA数据集用于评估VLM引导技术。 Result: SteerVLM在VLM引导和幻觉缓解基准上优于现有干预方法,能够在不损害非目标任务性能的前提下实现高效的推理时控制。 Conclusion: SteerVLM通过激活工程为多模态模型控制提供了一种高效、灵活且无需修改权重的解决方案,结合VNIA数据集推动了VLM引导技术的发展。 Abstract: This work introduces SteerVLM, a lightweight steering module designed to guide Vision-Language Models (VLMs) towards outputs that better adhere to desired instructions. Our approach learns from the latent embeddings of paired prompts encoding target and converse behaviors to dynamically adjust activations connecting the language modality with image context. This allows for fine-grained, inference-time control over complex output semantics without modifying model weights while preserving performance on off-target tasks. Our steering module requires learning parameters equal to 0.14% of the original VLM's size. Our steering module gains model control through dimension-wise activation modulation and adaptive steering across layers without requiring pre-extracted static vectors or manual tuning of intervention points. Furthermore, we introduce VNIA (Visual Narrative Intent Alignment), a multimodal dataset specifically created to facilitate the development and evaluation of VLM steering techniques. Our method outperforms existing intervention techniques on steering and hallucination mitigation benchmarks for VLMs and proposes a robust solution for multimodal model control through activation engineering.[120] Surpassing state of the art on AMD area estimation from RGB fundus images through careful selection of U-Net architectures and loss functions for class imbalance
Valentyna Starodub,Mantas Lukoševičius
Main category: cs.CV
TL;DR: 本研究基于U-Net架构,通过改进模型结构和训练流程,在RGB眼底图像中实现了优于以往方法的年龄相关性黄斑变性(AMD)病变多类别语义分割。
Details
Motivation: AMD是60岁以上人群视力损伤的主要原因,亟需一种非侵入、低成本且准确的病变检测方法。 Method: 以U-Net为基础,比较了不同的预处理技术、编码器骨干网络以及针对像素和图像级别类别不平衡的专用损失函数。 Result: 所提出的框架在ADAM挑战赛的数据集上实现了最先进的多类别病变分割性能,超越了此前所有参赛方法。 Conclusion: 改进后的U-Net框架能有效提升RGB眼底图像中AMD病变的检测精度,具有临床应用潜力,且代码已开源。 Abstract: Age-related macular degeneration (AMD) is one of the leading causes of irreversible vision impairment in people over the age of 60. This research focuses on semantic segmentation for AMD lesion detection in RGB fundus images, a non-invasive and cost-effective imaging technique. The results of the ADAM challenge - the most comprehensive AMD detection from RGB fundus images research competition and open dataset to date - serve as a benchmark for our evaluation. Taking the U-Net connectivity as a base of our framework, we evaluate and compare several approaches to improve the segmentation model's architecture and training pipeline, including pre-processing techniques, encoder (backbone) deep network types of varying complexity, and specialized loss functions to mitigate class imbalances on image and pixel levels. The main outcome of this research is the final configuration of the AMD detection framework, which outperforms all the prior ADAM challenge submissions on the multi-class segmentation of different AMD lesion types in non-invasive RGB fundus images. The source code used to conduct the experiments presented in this paper is made freely available.[121] ChartAB: A Benchmark for Chart Grounding & Dense Alignment
Aniruddh Bansal,Davit Soselia,Dang Nguyen,Tianyi Zhou
Main category: cs.CV
TL;DR: 本文提出了一个名为“ChartAlign Benchmark (ChartAB)”的新基准,用于全面评估视觉语言模型(VLMs)在图表对齐任务中的表现,包括提取表格数据、定位可视化元素和识别不同类型的图表属性。通过设计JSON模板和两阶段推理工作流,该基准能够更准确地衡量VLMs在单个及多个图表间的细粒度理解与比较能力。实验分析揭示了现有VLMs在图表理解中的感知偏差、弱点和幻觉问题,指出了需改进的关键技能。
Details
Motivation: 现有的视觉语言模型在理解图表细节和提取细粒度结构方面存在不足,难以准确对齐和比较多个图表,限制了其在数据可视化和推理任务中的应用。因此需要一个专门的基准来系统评估和揭示这些模型在图表理解上的缺陷。 Method: 提出ChartAlign Benchmark (ChartAB),包含针对不同类型和复杂度图表的三项核心任务:表格数据提取、可视化元素定位和属性识别;设计专用JSON模板以计算各任务的评估指标;引入两阶段推理流程,支持跨图表元素/属性的对齐与比较。 Result: 在多个最新VLM上进行评测的结果显示,当前模型普遍存在感知偏差、鲁棒性差和幻觉等问题,在细粒度图表理解任务中表现不佳,且不同模型之间存在显著差异。 Conclusion: ChartAB能够有效揭示VLMs在图表理解中的关键弱点,为未来提升模型在图表对齐、比较和细粒度解析方面的能力提供了明确方向。 Abstract: Charts play an important role in visualization, reasoning, data analysis, and the exchange of ideas among humans. However, existing vision-language models (VLMs) still lack accurate perception of details and struggle to extract fine-grained structures from charts. Such limitations in chart grounding also hinder their ability to compare multiple charts and reason over them. In this paper, we introduce a novel "ChartAlign Benchmark (ChartAB)" to provide a comprehensive evaluation of VLMs in chart grounding tasks, i.e., extracting tabular data, localizing visualization elements, and recognizing various attributes from charts of diverse types and complexities. We design a JSON template to facilitate the calculation of evaluation metrics specifically tailored for each grounding task. By incorporating a novel two-stage inference workflow, the benchmark can further evaluate VLMs' capability to align and compare elements/attributes across two charts. Our analysis of evaluations on several recent VLMs reveals new insights into their perception biases, weaknesses, robustness, and hallucinations in chart understanding. These findings highlight the fine-grained discrepancies among VLMs in chart understanding tasks and point to specific skills that need to be strengthened in current models.[122] HEIR: Learning Graph-Based Motion Hierarchies
Cheng Zheng,William Koch,Baiang Li,Felix Heide
Main category: cs.CV
TL;DR: 本文提出了一种通用的层次化运动建模方法,通过图神经网络从数据中学习结构化的运动关系,将全局运动分解为继承模式和局部残差,并在1D、2D及3D动态场景中验证了其有效性。
Details
Motivation: 现有方法依赖人工定义或启发式层次结构,难以泛化到不同任务,因此需要一种能从数据中自动学习可解释运动层次的方法。 Method: 采用基于图的层次表示,将运动建模为节点(基本运动)和有向边(父子依赖关系),通过可微图学习框架利用图神经网络推断层次结构。 Result: 在1D平移、2D旋转和3D高斯点阵动态变形任务上,该方法成功重建了内在运动层次,并在3D场景中生成更真实、可解释的变形结果,优于基线方法。 Conclusion: 所提出的数据驱动层次建模范式具有良好的适应性和可扩展性,适用于广泛的以运动为中心的任务。 Abstract: Hierarchical structures of motion exist across research fields, including computer vision, graphics, and robotics, where complex dynamics typically arise from coordinated interactions among simpler motion components. Existing methods to model such dynamics typically rely on manually-defined or heuristic hierarchies with fixed motion primitives, limiting their generalizability across different tasks. In this work, we propose a general hierarchical motion modeling method that learns structured, interpretable motion relationships directly from data. Our method represents observed motions using graph-based hierarchies, explicitly decomposing global absolute motions into parent-inherited patterns and local motion residuals. We formulate hierarchy inference as a differentiable graph learning problem, where vertices represent elemental motions and directed edges capture learned parent-child dependencies through graph neural networks. We evaluate our hierarchical reconstruction approach on three examples: 1D translational motion, 2D rotational motion, and dynamic 3D scene deformation via Gaussian splatting. Experimental results show that our method reconstructs the intrinsic motion hierarchy in 1D and 2D cases, and produces more realistic and interpretable deformations compared to the baseline on dynamic 3D Gaussian splatting scenes. By providing an adaptable, data-driven hierarchical modeling paradigm, our method offers a formulation applicable to a broad range of motion-centric tasks. Project Page: https://light.princeton.edu/HEIR/[123] The Quest for Generalizable Motion Generation: Data, Model, and Evaluation
Jing Lin,Ruisi Wang,Junzhe Lu,Ziqi Huang,Guorui Song,Ailing Zeng,Xian Liu,Chen Wei,Wanqi Yin,Qingping Sun,Zhongang Cai,Lei Yang,Ziwei Liu
Main category: cs.CV
TL;DR: 提出一个从视频生成(ViGen)向3D人体运动生成(MoGen)迁移知识的综合框架,涵盖数据、建模和评估三个方面,显著提升MoGen的泛化能力。
Details
Motivation: 现有3D人体运动生成模型在泛化能力上存在瓶颈,而视频生成领域已展现出更强的人类行为建模泛化能力,启发本文进行跨领域知识迁移。 Method: 构建大规模数据集ViMoGen-228K,结合光学动捕、网络视频标注和ViGen生成样本;提出基于流匹配的扩散Transformer模型ViMoGen及其轻量版ViMoGen-light,采用门控多模态条件融合多源先验;设计分层评估基准MBench,细粒度评测运动质量、提示保真度和泛化能力。 Result: 实验表明,该框架在自动指标和人工评估中均显著优于现有方法,尤其在泛化能力和语义多样性方面表现突出。 Conclusion: 通过系统化迁移ViGen的知识,该工作在数据、模型和评估层面推动了3D运动生成的发展,为未来研究提供了高质量数据集和标准化评测基准。 Abstract: Despite recent advances in 3D human motion generation (MoGen) on standard benchmarks, existing models still face a fundamental bottleneck in their generalization capability. In contrast, adjacent generative fields, most notably video generation (ViGen), have demonstrated remarkable generalization in modeling human behaviors, highlighting transferable insights that MoGen can leverage. Motivated by this observation, we present a comprehensive framework that systematically transfers knowledge from ViGen to MoGen across three key pillars: data, modeling, and evaluation. First, we introduce ViMoGen-228K, a large-scale dataset comprising 228,000 high-quality motion samples that integrates high-fidelity optical MoCap data with semantically annotated motions from web videos and synthesized samples generated by state-of-the-art ViGen models. The dataset includes both text-motion pairs and text-video-motion triplets, substantially expanding semantic diversity. Second, we propose ViMoGen, a flow-matching-based diffusion transformer that unifies priors from MoCap data and ViGen models through gated multimodal conditioning. To enhance efficiency, we further develop ViMoGen-light, a distilled variant that eliminates video generation dependencies while preserving strong generalization. Finally, we present MBench, a hierarchical benchmark designed for fine-grained evaluation across motion quality, prompt fidelity, and generalization ability. Extensive experiments show that our framework significantly outperforms existing approaches in both automatic and human evaluations. The code, data, and benchmark will be made publicly available.[124] Scaling Image Geo-Localization to Continent Level
Philipp Lindenberger,Paul-Edouard Sarlin,Jan Hosang,Matteo Balice,Marc Pollefeys,Simon Lynen,Eduard Trulls
Main category: cs.CV
TL;DR: 本文提出了一种混合方法,通过代理分类任务学习特征表示,并结合航拍图像嵌入实现跨大陆范围的细粒度地理定位,在欧洲大范围数据集上实现了超过68%的查询定位误差小于200米。
Details
Motivation: 现有图像地理定位方法在全局尺度上存在效率低、覆盖不足或精度粗略的问题,跨视角检索也受限于域差异和区域范围小,难以实现高精度的大范围定位。 Method: 采用代理分类任务训练模型以隐式编码精确位置信息,结合学习到的原型和航拍图像嵌入进行细粒度检索,提升对地面数据稀疏性的鲁棒性。 Result: 在覆盖欧洲大部分地区的数据集上,超过68%的查询可实现200米以内的定位精度,显著优于现有可扩展方法。 Conclusion: 该混合方法有效解决了大范围细粒度图像地理定位的挑战,兼顾可扩展性与精度,具备实际应用潜力。 Abstract: Determining the precise geographic location of an image at a global scale remains an unsolved challenge. Standard image retrieval techniques are inefficient due to the sheer volume of images (>100M) and fail when coverage is insufficient. Scalable solutions, however, involve a trade-off: global classification typically yields coarse results (10+ kilometers), while cross-view retrieval between ground and aerial imagery suffers from a domain gap and has been primarily studied on smaller regions. This paper introduces a hybrid approach that achieves fine-grained geo-localization across a large geographic expanse the size of a continent. We leverage a proxy classification task during training to learn rich feature representations that implicitly encode precise location information. We combine these learned prototypes with embeddings of aerial imagery to increase robustness to the sparsity of ground-level data. This enables direct, fine-grained retrieval over areas spanning multiple countries. Our extensive evaluation demonstrates that our approach can localize within 200m more than 68\% of queries of a dataset covering a large part of Europe. The code is publicly available at https://scaling-geoloc.github.io.[125] Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark
Ziyu Guo,Xinyan Chen,Renrui Zhang,Ruichuan An,Yu Qi,Dongzhi Jiang,Xiangtai Li,Manyuan Zhang,Hongsheng Li,Pheng-Ann Heng
Main category: cs.CV
TL;DR: 本文研究了当前视频生成模型(如Veo-3)在零样本视觉推理任务中的表现,提出并构建了一个名为MME-CoF的基准测试,涵盖12个推理维度。研究发现,尽管这些模型在短期空间一致性、细粒度定位和局部动态一致性方面展现出潜力,但在长时因果推理、严格几何约束和抽象逻辑方面仍存在局限,尚不能作为可靠的独立零样本推理器,但可作为专用推理模型的有益补充视觉引擎。
Details
Motivation: 探索视频生成模型是否具备零样本视觉推理能力,尤其是在复杂视觉推理场景下的潜力与局限。 Method: 通过构建MME-CoF基准,在12个推理维度上对Veo-3等领先视频模型进行系统评估,采用Chain-of-Frame(CoF)推理模式分析其行为。 Result: 发现当前视频模型在短期空间和局部动态推理上表现良好,但在长时因果、几何约束和抽象逻辑方面存在明显不足。 Conclusion: 现有视频模型尚不足以作为可靠的独立零样本推理器,但可作为辅助的视觉引擎与专用推理模型协同使用。 Abstract: Recent video generation models can produce high-fidelity, temporally coherent videos, indicating that they may encode substantial world knowledge. Beyond realistic synthesis, they also exhibit emerging behaviors indicative of visual perception, modeling, and manipulation. Yet, an important question still remains: Are video models ready to serve as zero-shot reasoners in challenging visual reasoning scenarios? In this work, we conduct an empirical study to comprehensively investigate this question, focusing on the leading and popular Veo-3. We evaluate its reasoning behavior across 12 dimensions, including spatial, geometric, physical, temporal, and embodied logic, systematically characterizing both its strengths and failure modes. To standardize this study, we curate the evaluation data into MME-CoF, a compact benchmark that enables in-depth and thorough assessment of Chain-of-Frame (CoF) reasoning. Our findings reveal that while current video models demonstrate promising reasoning patterns on short-horizon spatial coherence, fine-grained grounding, and locally consistent dynamics, they remain limited in long-horizon causal reasoning, strict geometric constraints, and abstract logic. Overall, they are not yet reliable as standalone zero-shot reasoners, but exhibit encouraging signs as complementary visual engines alongside dedicated reasoning models. Project page: https://video-cof.github.io[126] SEE4D: Pose-Free 4D Generation via Auto-Regressive Video Inpainting
Dongyue Lu,Ao Liang,Tianxin Huang,Xiao Fu,Yuyang Zhao,Baorui Ma,Liang Pan,Wei Yin,Lingdong Kong,Wei Tsang Ooi,Ziwei Liu
Main category: cs.CV
TL;DR: 提出SEE4D,一种无需姿态标注的轨迹到相机框架,通过固定虚拟相机渲染和视图条件视频修复实现从普通视频中生成4D内容。
Details
Motivation: 现有方法依赖手动标注的相机姿态或复杂的轨迹预测,难以应用于野外视频且建模复杂。 Method: 采用虚拟相机银行代替显式轨迹预测,使用去噪训练的视图条件视频修复模型填充缺失区域,并设计时空自回归推理流程以实现连贯生成。 Result: 在跨视角视频生成和稀疏重建任务上优于基于姿态或轨迹的基线方法,具有更好的泛化性和性能表现。 Conclusion: SEE4D实现了无需3D监督的高质量4D场景建模,推动了从普通视频进行实用化4D世界建模的发展。 Abstract: Immersive applications call for synthesizing spatiotemporal 4D content from casual videos without costly 3D supervision. Existing video-to-4D methods typically rely on manually annotated camera poses, which are labor-intensive and brittle for in-the-wild footage. Recent warp-then-inpaint approaches mitigate the need for pose labels by warping input frames along a novel camera trajectory and using an inpainting model to fill missing regions, thereby depicting the 4D scene from diverse viewpoints. However, this trajectory-to-trajectory formulation often entangles camera motion with scene dynamics and complicates both modeling and inference. We introduce SEE4D, a pose-free, trajectory-to-camera framework that replaces explicit trajectory prediction with rendering to a bank of fixed virtual cameras, thereby separating camera control from scene modeling. A view-conditional video inpainting model is trained to learn a robust geometry prior by denoising realistically synthesized warped images and to inpaint occluded or missing regions across virtual viewpoints, eliminating the need for explicit 3D annotations. Building on this inpainting core, we design a spatiotemporal autoregressive inference pipeline that traverses virtual-camera splines and extends videos with overlapping windows, enabling coherent generation at bounded per-step complexity. We validate See4D on cross-view video generation and sparse reconstruction benchmarks. Across quantitative metrics and qualitative assessments, our method achieves superior generalization and improved performance relative to pose- or trajectory-conditioned baselines, advancing practical 4D world modeling from casual videos.[127] Masked Diffusion Captioning for Visual Feature Learning
Chao Feng,Zihao Wei,Andrew Owens
Main category: cs.CV
TL;DR: 提出了一种基于图像条件的掩码扩散语言模型进行图像描述的方法,称为掩码扩散描述(MDC),通过重建被掩码的文本学习视觉特征,在多种模型和数据集上的线性探测实验表明其性能与自回归和对比方法相当。
Details
Motivation: 传统的自回归图像描述方法中,视觉学习信号受序列位置影响较大,且常需辅助目标,限制了视觉特征的学习效果,因此需要一种更有效的视觉特征学习方法。 Method: 使用图像条件的掩码扩散语言模型(MDC),在训练过程中随机掩码图像-文本对中的文本标记,并训练一个基于视觉特征的解码器来重建原始文本,从而学习视觉特征。 Result: 在多个学术规模的模型和数据集上进行线性探测实验,结果显示MDC学习到的视觉特征在性能上可与自回归和对比学习方法相媲美。 Conclusion: MDC提供了一种新的图像描述与视觉特征学习框架,不依赖于文本序列位置,减少了对辅助目标的需求,且能有效提升视觉表示能力。 Abstract: We learn visual features by captioning images with an image-conditioned masked diffusion language model, a formulation we call masked diffusion captioning (MDC). During training, text tokens in each image-caption pair are masked at a randomly chosen ratio, and a decoder conditioned on visual features is trained to reconstruct the original text. After training, the learned visual features can be applied to downstream vision tasks. Unlike autoregressive captioning, the strength of the visual learning signal in MDC does not depend on each token's position in the sequence, reducing the need for auxiliary objectives. Linear probing experiments across a variety of academic-scale models and datasets show that the learned visual features are competitive with those produced by autoregressive and contrastive approaches.[128] OmniX: From Unified Panoramic Generation and Perception to Graphics-Ready 3D Scenes
Yukun Huang,Jiwen Yu,Yanning Zhou,Jianan Wang,Xintao Wang,Pengfei Wan,Xihui Liu
Main category: cs.CV
TL;DR: 本文提出OmniX框架,利用2D生成先验实现全景几何、纹理和PBR材质的感知,生成适用于物理渲染和仿真的高质量3D场景。