Skip to content

Table of Contents

cs.CL [Back]

[1] LLMs for Game Theory: Entropy-Guided In-Context Learning and Adaptive CoT Reasoning

Tommaso Felice Banfi,Sashenka Gamage

Main category: cs.CL

TL;DR: 提出一种基于LLM的自适应推理框架,通过熵引导的思维链和动态上下文检索,在Tic-Tac-Toe等博弈任务中显著提升决策质量。

Details Motivation: 传统思维链方法在离散、博弈任务中使用固定上下文和推理路径,难以应对不同状态下的不确定性,需更灵活的推理机制。 Method: 结合上下文学习与熵引导的链式思维(CoT),根据token级不确定性动态调整检索示例数量和推理路径:低不确定性时简洁推理,高不确定性时扩展多路径探索。 Result: 在对抗次优算法对手的100局游戏中,平均得分从基线LLM的-11.6%提升至+9.5%,且每局LLM查询次数保持较低;统计验证显示改进显著,且token熵与走子最优性呈负相关。 Conclusion: 不确定性引导的自适应推理能有效增强LLM在序贯决策环境中的表现,为LLM推理提供了高效、动态的优化路径。 Abstract: We propose a novel LLM-based framework for reasoning in discrete, game-theoretic tasks, illustrated with \emph{Tic-Tac-Toe}. The method integrates in-context learning with entropy-guided chain-of-thought (CoT) reasoning and adaptive context retrieval. The model dynamically adjusts both the number of retrieved examples and reasoning paths according to token-level uncertainty: concise reasoning with minimal context is used when uncertainty is low, whereas higher uncertainty triggers expanded multi-path CoT exploration. Experimental evaluation against a sub-optimal algorithmic opponent shows that entropy-aware adaptive reasoning substantially improves decision quality, increasing the average game outcome from \(-11.6\%\) with the baseline LLM to \(+9.5\%\) with entropy-guided adaptive reasoning over 100 games (win = +1, tie = 0, loss = -1), while maintaining a relatively low number of LLM queries per game. Statistical validation confirms that the improvement is significant, and correlation analysis reveals a negative association between token-level entropy and move optimality. These findings demonstrate that uncertainty-guided adaptive reasoning effectively enhances LLM performance in sequential decision-making environments.

[2] BYOL: Bring Your Own Language Into LLMs

Syed Waqas Zamir,Wassim Hamidouche,Boulbaba Ben Amor,Luana Marotti,Inbal Becker-Reshef,Juan Lavista Ferres

Main category: cs.CL

TL;DR: 本文提出了BYOL(Bring Your Own Language)框架,以解决大语言模型在低资源和极低资源语言上的性能不足问题,通过分级资源分类和定制化数据增强与翻译路径,提升了小语种的LLM表现,并发布了相关基准测试和模型。

Details Motivation: 由于全球语言资源严重不平衡,大多数低资源语言在现有大语言模型中表现不佳,导致文化错位和使用受限,因此需要一种可扩展的语言感知框架来支持这些语言。 Method: 提出BYOL框架,将语言按数字资源分为四个等级,并为不同等级设计相应路径:对低资源语言采用数据清洗、合成生成、持续预训练和微调的全流程;对极低资源语言则构建翻译中介路径,结合定制化机器翻译系统实现LLM访问。 Result: 在Chichewa和Maori上平均提升约12%;在Inuktitut的翻译系统上比商业基线提高4 BLEU;并通过权重空间模型融合保持了英语和多语言能力。 Conclusion: BYOL为不同资源水平的语言提供了可扩展的LLM开发方案,显著改善了低资源和极低资源语言的表现,推动了更公平、包容的多语言AI发展。 Abstract: Large Language Models (LLMs) exhibit strong multilingual capabilities, yet remain fundamentally constrained by the severe imbalance in global language resources. While over 7,000 languages are spoken worldwide, only a small subset (fewer than 100) has sufficient digital presence to meaningfully influence modern LLM training. This disparity leads to systematic underperformance, cultural misalignment, and limited accessibility for speakers of low-resource and extreme-low-resource languages. To address this gap, we introduce Bring Your Own Language (BYOL), a unified framework for scalable, language-aware LLM development tailored to each language's digital footprint. BYOL begins with a language resource classification that maps languages into four tiers (Extreme-Low, Low, Mid, High) using curated web-scale corpora, and uses this classification to select the appropriate integration pathway. For low-resource languages, we propose a full-stack data refinement and expansion pipeline that combines corpus cleaning, synthetic text generation, continual pretraining, and supervised finetuning. Applied to Chichewa and Maori, this pipeline yields language-specific LLMs that achieve approximately 12 percent average improvement over strong multilingual baselines across 12 benchmarks, while preserving English and multilingual capabilities via weight-space model merging. For extreme-low-resource languages, we introduce a translation-mediated inclusion pathway, and show on Inuktitut that a tailored machine translation system improves over a commercial baseline by 4 BLEU, enabling high-accuracy LLM access when direct language modeling is infeasible. Finally, we release human-translated versions of the Global MMLU-Lite benchmark in Chichewa, Maori, and Inuktitut, and make our codebase and models publicly available at https://github.com/microsoft/byol .

[3] A Concise Agent is Less Expert: Revealing Side Effects of Using Style Features on Conversational Agents

Young-Min Cho,Yuan Yuan,Sharath Chandra Guntuku,Lyle Ungar

Main category: cs.CL

TL;DR: 本研究首次系统探讨了大型语言模型对话代理中风格特征提示的跨特征副作用,发现风格特征之间存在复杂纠缠而非正交独立,并提出了新数据集CASSE以支持未来研究。

Details Motivation: 尽管在提示中广泛使用诸如友好、有帮助或简洁等风格特征来引导大语言模型的行为,但其潜在的副作用尚不清楚。 Method: 通过分析ACL Anthology中127篇论文识别出12种常用风格特征,并在任务导向和开放域场景下使用受控合成对话,结合成对LLM作为评判框架进行因果评估。 Result: 发现提示某一风格会显著影响其他风格,例如追求简洁会降低感知专业性,表明风格特征间存在一致且结构化的副作用;评估了基于提示和激活 steering 的缓解策略,发现其难以兼顾主目标与副作用。 Conclusion: 风格特征在LLM中并非独立可控,现有方法难以实现精准的多目标风格控制,需发展更原则性的多目标优化方法以实现安全、精确的风格引导。 Abstract: Style features such as friendly, helpful, or concise are widely used in prompts to steer the behavior of Large Language Model (LLM) conversational agents, yet their unintended side effects remain poorly understood. In this work, we present the first systematic study of cross-feature stylistic side effects. We conduct a comprehensive survey of 127 conversational agent papers from ACL Anthology and identify 12 frequently used style features. Using controlled, synthetic dialogues across task-oriented and open domain settings, we quantify how prompting for one style feature causally affects others via a pairwise LLM as a Judge evaluation framework. Our results reveal consistent and structured side effects, such as prompting for conciseness significantly reduces perceived expertise. They demonstrate that style features are deeply entangled rather than orthogonal. To support future research, we introduce CASSE (Conversational Agent Stylistic Side Effects), a dataset capturing these complex interactions. We further evaluate prompt based and activation steering based mitigation strategies and find that while they can partially restore suppressed traits, they often degrade the primary intended style. These findings challenge the assumption of faithful style control in LLMs and highlight the need for multi-objective and more principled approaches to safe, targeted stylistic steering in conversational agents.

[4] Reasoning Models Generate Societies of Thought

Junsol Kim,Shiyang Lai,Nino Scherrer,Blaise Agüera y Arcas,James Evans

Main category: cs.CL

TL;DR: 该论文提出,大语言模型的复杂推理能力不仅源于更长的思维链,更重要的是通过模拟多智能体交互(即“思维社会”),实现多样化视角与辩论,从而提升推理准确性。

Details Motivation: 探索大语言模型中复杂推理能力背后的机制,揭示为何推理模型在认知任务上优于传统指令微调模型。 Method: 结合定量分析与机制可解释性方法,分析DeepSeek-R1和QwQ-32B等模型的推理轨迹,并通过强化学习实验研究对话行为与推理准确性的关系。 Result: 发现推理模型展现出更高的认知视角多样性,激活更多异质的性格与专长特征;多智能体结构表现为问答、观点转换与冲突调和等对话行为,并带来推理准确性优势。 Conclusion: 推理能力的提升源于“思维社会”的组织形式,即内部多样化视角的结构化互动,类比人类群体中的集体智慧,为构建更高效AI推理系统提供了新方向。 Abstract: Large language models have achieved remarkable capabilities across domains, yet mechanisms underlying sophisticated reasoning remain elusive. Recent reasoning models outperform comparable instruction-tuned models on complex cognitive tasks, attributed to extended computation through longer chains of thought. Here we show that enhanced reasoning emerges not from extended computation alone, but from simulating multi-agent-like interactions -- a society of thought -- which enables diversification and debate among internal cognitive perspectives characterized by distinct personality traits and domain expertise. Through quantitative analysis and mechanistic interpretability methods applied to reasoning traces, we find that reasoning models like DeepSeek-R1 and QwQ-32B exhibit much greater perspective diversity than instruction-tuned models, activating broader conflict between heterogeneous personality- and expertise-related features during reasoning. This multi-agent structure manifests in conversational behaviors, including question-answering, perspective shifts, and the reconciliation of conflicting views, and in socio-emotional roles that characterize sharp back-and-forth conversations, together accounting for the accuracy advantage in reasoning tasks. Controlled reinforcement learning experiments reveal that base models increase conversational behaviors when rewarded solely for reasoning accuracy, and fine-tuning models with conversational scaffolding accelerates reasoning improvement over base models. These findings indicate that the social organization of thought enables effective exploration of solution spaces. We suggest that reasoning models establish a computational parallel to collective intelligence in human groups, where diversity enables superior problem-solving when systematically structured, which suggests new opportunities for agent organization to harness the wisdom of crowds.

[5] EncodeRec: An Embedding Backbone for Recommendation Systems

Guy Hadad,Neomi Rabaev,Bracha Shapira

Main category: cs.CL

TL;DR: EncodeRec是一种用于推荐系统的新型方法,通过冻结预训练语言模型参数并直接从项目描述中学习紧凑且信息丰富的嵌入,有效对齐文本表示与推荐目标。

Details Motivation: 现有预训练语言模型(PLMs)生成的嵌入在推荐系统中存在两个问题:缺乏结构化和判别性,且难以捕捉领域特定语义。因此需要一种更适配推荐任务的嵌入学习方法。 Method: 提出EncodeRec方法,在训练推荐系统时冻结语言模型参数,仅学习适应推荐目标的紧凑嵌入表示,从而提升效率并保留语义保真度。该方法利用物品描述直接进行嵌入学习,并支持序列推荐和语义ID标记化任务。 Result: 在多个核心推荐基准实验中,EncodeRec在作为序列推荐模型骨干和用于语义ID标记时均显著优于基于PLM和其他嵌入模型的基线方法。 Conclusion: 嵌入适应在连接通用语言模型与实际推荐系统之间起着关键作用,EncodeRec提供了一种高效且有效的解决方案。 Abstract: Recent recommender systems increasingly leverage embeddings from large pre-trained language models (PLMs). However, such embeddings exhibit two key limitations: (1) PLMs are not explicitly optimized to produce structured and discriminative embedding spaces, and (2) their representations remain overly generic, often failing to capture the domain-specific semantics crucial for recommendation tasks. We present EncodeRec, an approach designed to align textual representations with recommendation objectives while learning compact, informative embeddings directly from item descriptions. EncodeRec keeps the language model parameters frozen during recommender system training, making it computationally efficient without sacrificing semantic fidelity. Experiments across core recommendation benchmarks demonstrate its effectiveness both as a backbone for sequential recommendation models and for semantic ID tokenization, showing substantial gains over PLM-based and embedding model baselines. These results underscore the pivotal role of embedding adaptation in bridging the gap between general-purpose language models and practical recommender systems.

[6] DialDefer: A Framework for Detecting and Mitigating LLM Dialogic Deference

Parisa Rabbani,Priyam Sahoo,Ruben Mathew,Aishee Mondal,Harshita Ketharaman,Nimet Beyza Bozdag,Dilek Hakkani-Tür

Main category: cs.CL

TL;DR: 该论文揭示了大语言模型(LLM)在对话评估中存在“对话性顺从”(DialDefer)现象:相同内容因提问框架不同(如判断陈述本身 vs. 判断说话者)而产生显著判断偏差。作者提出DialDefer框架和对话顺从得分(DDS)来量化这一现象,在多个领域和模型中发现判断偏移高达87个百分点,而准确率变化很小。研究还发现人类vs. LLM归因是主要驱动因素,提示需在评估中校准而非仅优化准确率。

Details Motivation: 尽管LLMs被广泛用作第三方评判者,但其在对话语境下对发言者评判的可靠性尚不清楚。本文旨在探究相同主张在不同表述框架下是否会导致LLM作出不一致的判断,从而揭示潜在的系统性偏差。 Method: 提出DialDefer框架,引入对话顺从得分(DDS)来量化LLM在‘判断陈述正确性’与‘判断说话者正确性’两种框架下的判断偏移。在九个领域、3000多个实例、四种模型上进行实验,并通过消融研究分析影响因素,包括发言者身份(人类或AI)等。 Result: 实验显示,对话框架导致显著判断偏移(|DDS|最高达87pp,p < .0001),而整体准确率变化小于2pp;在Reddit真实对话中效应放大2-4倍;同一模型在不同领域表现出从顺从(DDS = -53)到怀疑(DDS = +58)的差异;人类vs. LLM归因引起最大偏移(17.7pp)。缓解策略可减少顺从但可能过度转向怀疑。 Conclusion: LLM作为评判者的可靠性受对话框架显著影响,单纯依赖准确率会掩盖系统性判断偏移。应将此类偏差视为校准问题,在设计评估机制时需考虑语境和归因因素,以提升公平性和稳健性。 Abstract: LLMs are increasingly used as third-party judges, yet their reliability when evaluating speakers in dialogue remains poorly understood. We show that LLMs judge identical claims differently depending on framing: the same content elicits different verdicts when presented as a statement to verify ("Is this statement correct?") versus attributed to a speaker ("Is this speaker correct?"). We call this dialogic deference and introduce DialDefer, a framework for detecting and mitigating these framing-induced judgment shifts. Our Dialogic Deference Score (DDS) captures directional shifts that aggregate accuracy obscures. Across nine domains, 3k+ instances, and four models, conversational framing induces large shifts (|DDS| up to 87pp, p < .0001) while accuracy remains stable (<2pp), with effects amplifying 2-4x on naturalistic Reddit conversations. Models can shift toward agreement (deference) or disagreement (skepticism) depending on domain -- the same model ranges from DDS = -53 on graduate-level science to +58 on social judgment. Ablations reveal that human-vs-LLM attribution drives the largest shifts (17.7pp swing), suggesting models treat disagreement with humans as more costly than with AI. Mitigation attempts reduce deference but can over-correct into skepticism, framing this as a calibration problem beyond accuracy optimization.

[7] Neural Induction of Finite-State Transducers

Michael Ginn,Alexis Palmer,Mans Hulden

Main category: cs.CL

TL;DR: 提出一种基于循环神经网络隐藏状态几何结构自动生成无权重有限状态转换器(FST)的新方法,在形态变形、图音转换和历史归一化任务上显著优于传统FST学习算法。

Details Motivation: 手工构建有限状态转换器(FST)困难,而现有自动构建方法效果有限,需要更高效准确的自动构造方法。 Method: 利用循环神经网络(RNN)学习到的隐藏状态几何结构,自动构建无权重的有限状态转换器(FST)。 Result: 在多个真实数据集上的实验表明,所构建的FST在准确性与鲁棒性方面表现优异,相较于经典FST学习算法,在保留测试集上的准确率最高提升达87%。 Conclusion: 该方法能有效结合神经网络的表达能力与FST的高效推理优势,为字符串重写任务提供高性能的自动建模方案。 Abstract: Finite-State Transducers (FSTs) are effective models for string-to-string rewriting tasks, often providing the efficiency necessary for high-performance applications, but constructing transducers by hand is difficult. In this work, we propose a novel method for automatically constructing unweighted FSTs following the hidden state geometry learned by a recurrent neural network. We evaluate our methods on real-world datasets for morphological inflection, grapheme-to-phoneme prediction, and historical normalization, showing that the constructed FSTs are highly accurate and robust for many datasets, substantially outperforming classical transducer learning algorithms by up to 87% accuracy on held-out test sets.

[8] Massively Multilingual Joint Segmentation and Glossing

Michael Ginn,Lindia Tjuatja,Enora Rice,Ali Marashian,Maria Valentini,Jasmine Xu,Graham Neubig,Alexis Palmer

Main category: cs.CL

TL;DR: 本文提出了PolyGloss,一种能够联合预测词形切分和中间语注释的神经网络模型,相较于现有模型在可解释性和准确性上均有提升,并可通过低秩适应快速迁移到新数据集。

Details Motivation: 现有模型如GlossLM虽在注释任务上表现良好,但未预测词素边界,导致结果难以被语言学家信任和使用。 Method: 提出并训练了PolyGloss,一个序列到序列的多语言模型,联合执行词形切分与注释任务,并通过扩展GlossLM语料库和低秩适应进行优化。 Result: PolyGloss在注释和切分任务上均优于GlossLM和其他开源大模型,且在两项任务的对齐性上表现更佳。 Conclusion: 联合建模词形切分与注释能提升模型的可解释性和实用性,PolyGloss为语言记录工作提供了更可靠、易适配的自动化工具。 Abstract: Automated interlinear gloss prediction with neural networks is a promising approach to accelerate language documentation efforts. However, while state-of-the-art models like GlossLM achieve high scores on glossing benchmarks, user studies with linguists have found critical barriers to the usefulness of such models in real-world scenarios. In particular, existing models typically generate morpheme-level glosses but assign them to whole words without predicting the actual morpheme boundaries, making the predictions less interpretable and thus untrustworthy to human annotators. We conduct the first study on neural models that jointly predict interlinear glosses and the corresponding morphological segmentation from raw text. We run experiments to determine the optimal way to train models that balance segmentation and glossing accuracy, as well as the alignment between the two tasks. We extend the training corpus of GlossLM and pretrain PolyGloss, a family of seq2seq multilingual models for joint segmentation and glossing that outperforms GlossLM on glossing and beats various open-source LLMs on segmentation, glossing, and alignment. In addition, we demonstrate that PolyGloss can be quickly adapted to a new dataset via low-rank adaptation.

[9] Selecting Language Models for Social Science: Start Small, Start Open, and Validate

Dustin S. Stoltz,Marshall A. Taylor,Sanuj Kumar

Main category: cs.CL

TL;DR: 本文探讨了社会科学家在选择预训练语言模型时应考虑的有效性、可靠性、可重复性和可复制性,强调应优先使用小型开源模型并构建特定基准以验证计算流程的有效性。

Details Motivation: 面对大量可用的预训练语言模型,社会科学家缺乏系统性的选择标准,尤其在确保研究结果的科学性和可复制性方面存在挑战。 Method: 以有效性、可靠性、可重复性和可复制性为指导,分析模型开放性、模型规模、训练数据以及模型架构和微调的影响,并主张采用小型开源模型和构建限定性基准进行事后验证。 Result: 发现仅依赖事前基准测试不足以保证模型在社会科学应用中的有效性,而可复制性是更关键的选择标准;通过复现任务来确保研究结果的可靠再现。 Conclusion: 建议社会科学家从更小、开放的模型入手,建立针对性的基准测试,以验证整个计算管道的有效性,从而提升研究的可复制性和科学性。 Abstract: Currently, there are thousands of large pretrained language models (LLMs) available to social scientists. How do we select among them? Using validity, reliability, reproducibility, and replicability as guides, we explore the significance of: (1) model openness, (2) model footprint, (3) training data, and (4) model architectures and fine-tuning. While ex-ante tests of validity (i.e., benchmarks) are often privileged in these discussions, we argue that social scientists cannot altogether avoid validating computational measures (ex-post). Replicability, in particular, is a more pressing guide for selecting language models. Being able to reliably replicate a particular finding that entails the use of a language model necessitates reliably reproducing a task. To this end, we propose starting with smaller, open models, and constructing delimited benchmarks to demonstrate the validity of the entire computational pipeline.

[10] Multi-Stage Patient Role-Playing Framework for Realistic Clinical Interactions

Shijie Jiang,Zefan Zhang,Kehua Zhu,Tian Bai,Ruihong Zhao

Main category: cs.CL

TL;DR: 本文提出了首个基于真实临床场景的中文患者模拟数据集Ch-PatientSim,并提出了一种无需训练的多阶段患者角色扮演框架(MSPRP),以提升大模型在模拟患者行为时的真实性与个性化表现。

Details Motivation: 现有患者模拟方法依赖通用或由大语言模型生成的对话数据,缺乏真实性和多样性,限制了临床大模型和医学诊断教育的发展。因此,需要构建更真实的中文患者模拟数据集并改进模型的个性表达能力。 Method: 构建了一个五维人格结构的真实中文患者模拟数据集Ch-PatientSim,针对类别不平衡问题采用少样本生成与人工验证相结合的方式进行数据增强;提出一种无需训练的多阶段患者角色扮演(MSPRP)框架,将交互分解为三个阶段以提升回应的个性化与真实性。 Result: 在多个先进大语言模型上的实验表明,大多数模型响应过于正式、缺乏个性;引入MSPRP框架后,模型在患者模拟的多个维度上表现显著提升。 Conclusion: Ch-PatientSim为评估中文临床大模型提供了更具真实性和多样性的基准,而MSPRP框架有效提升了模型在无训练情况下的个性化模拟能力,对推动临床AI发展和医学教育具有重要意义。 Abstract: The simulation of realistic clinical interactions plays a pivotal role in advancing clinical Large Language Models (LLMs) and supporting medical diagnostic education. Existing approaches and benchmarks rely on generic or LLM-generated dialogue data, which limits the authenticity and diversity of doctor-patient interactions. In this work, we propose the first Chinese patient simulation dataset (Ch-PatientSim), constructed from realistic clinical interaction scenarios to comprehensively evaluate the performance of models in emulating patient behavior. Patients are simulated based on a five-dimensional persona structure. To address issues of the persona class imbalance, a portion of the dataset is augmented using few-shot generation, followed by manual verification. We evaluate various state-of-the-art LLMs and find that most produce overly formal responses that lack individual personality. To address this limitation, we propose a training-free Multi-Stage Patient Role-Playing (MSPRP) framework, which decomposes interactions into three stages to ensure both personalization and realism in model responses. Experimental results demonstrate that our approach significantly improves model performance across multiple dimensions of patient simulation.

[11] Steering Language Models Before They Speak: Logit-Level Interventions

Hyeseon An,Shinwoo Park,Hyundong Jin,Yo-Sub Han

Main category: cs.CL

TL;DR: 提出一种无需训练的推理时logit干预方法,通过基于标注语料库z标准化log-odds构建的统计词元分数表来实现对LLM生成的可控引导,有效提升风格、正式性、毒性等多任务控制效果。

Details Motivation: 现有引导方法如提示法和激活法存在缺乏细粒度控制或需要访问内部层的问题,需一种无需训练且能提供一致精细控制的新方法。 Method: 利用标注语料库的z标准化log-odds构建token分数表,在推理时干预logits以调整解码分布,实现训练-free的可控生成。 Result: 在写作复杂性、正式性和毒性三个数据集上验证了方法有效性,实现了最高+47%准确率和50倍f1值提升,表现出强一致性与多任务适用性。 Conclusion: 基于统计的logit引导方法可在不训练的情况下实现广泛、一致且任务无关的生成控制,为LLM的可控生成提供了高效实用的新途径。 Abstract: Steering LLMs is essential for specialized applications such as style-sensitive text rewriting, user-adaptive communication, and toxicity mitigation. Current steering methods, such as prompting-based and activation-based approaches, are widely used to guide model behavior. However, activation-based techniques require deep access to internal layers, while prompting-based steering often fails to provide consistent or fine-grained control. In order to address these limitations, we propose a training-free inference-time logit intervention for controllable generation. Our approach utilizes a statistical token score table derived from z-normalized log-odds of labeled corpora to shift the decoding distribution. Empirical evaluations across three diverse datasets focusing on writing complexity, formality, and toxicity demonstrate that our method effectively steers output characteristics, confirming its broad applicability and task-agnostic nature. Our results show that statistically grounded logit steering can achieve large, consistent, and multi-task control gains: up to +47%p accuracy and 50x f1 improvement.

[12] ZPD Detector: Data Selection via Capability-Difficulty Alignment for Large Language Models

Bo Yang,Yunkui Chen,Lanfei Feng,Yu Zhang,Shijian Li

Main category: cs.CL

TL;DR: 提出ZPD Detector,一种基于近侧发展区理论的动态数据选择框架,通过建模样本难度与模型能力的匹配关系,提升大模型训练中的数据利用效率。

Details Motivation: 随着大语言模型训练成本上升和高质量数据减少,现有静态数据选择方法无法捕捉模型与数据之间动态变化的关系,因此需要一种能随训练进程自适应调整的选择机制。 Method: 基于教育学中的近侧发展区(ZPD)理论,结合项目反应理论(IRT),提出ZPD Detector:1)进行难度校准;2)估计模型能力;3)计算能力-难度匹配得分,以动态筛选每个阶段最具信息量的样本。 Result: 该方法能够动态识别适合当前模型能力水平的学习样本,显著提升数据使用效率,并为训练策略设计提供新视角。 Conclusion: ZPD Detector通过建模模型能力与样本难度的动态对齐关系,实现了更高效的数据利用,验证了教育理论在深度学习数据调度中的应用潜力。 Abstract: As the cost of training large language models continues to increase and high-quality training data become increasingly scarce, selecting high-value samples or synthesizing effective training data under limited data budgets has emerged as a critical research problem. Most existing data selection methods rely on static criteria, such as difficulty, uncertainty, or heuristics, and fail to model the evolving relationship between the model and the data. Inspired by the educational theory of the Zone of Proximal Development (ZPD), we propose ZPD Detector, a data selection framework that adopts a bidirectional perspective between models and data by explicitly modeling the alignment between sample difficulty and the model's current capability. ZPD Detector integrates difficulty calibration, model capability estimation based on Item Response Theory (IRT), and a capability-difficulty matching score to dynamically identify the most informative samples at each learning stage, improving data utilization efficiency; moreover, this dynamic matching strategy provides new insights into training strategy design. All code and data will be released after our work be accepted to support reproducible researc

[13] When Personalization Misleads: Understanding and Mitigating Hallucinations in Personalized LLMs

Zhongxiang Sun,Yi Zhan,Chenglei Shen,Weijie Yu,Xiao Zhang,Ming He,Jun Xu

Main category: cs.CL

TL;DR: 本文提出了一种轻量级推理时方法FPPS,用于缓解个性化大语言模型中的事实性失真问题,同时保持个性化性能,并构建了首个联合评估事实与个性化问答的基准PFQABench。

Details Motivation: 个性化大语言模型可能因用户历史偏好而产生违背事实的幻觉,损害事实可靠性,因此需要在保持个性化的同时维护事实准确性。 Method: 提出Factuality-Preserving Personalized Steering (FPPS),在推理阶段解耦个性化与事实表征,减少事实扭曲;并构建PFQABench基准以联合评估个性化和事实问答能力。 Result: 在多个LLM架构和个性化方法上实验表明,FPPS显著提升了事实准确性,同时保持了良好的个性化表现。 Conclusion: FPPS能有效缓解个性化引发的幻觉问题,在维持个性化行为的同时增强模型的事实可靠性,为安全的个性化LLM提供了可行方案。 Abstract: Personalized large language models (LLMs) adapt model behavior to individual users to enhance user satisfaction, yet personalization can inadvertently distort factual reasoning. We show that when personalized LLMs face factual queries, there exists a phenomenon where the model generates answers aligned with a user's prior history rather than the objective truth, resulting in personalization-induced hallucinations that degrade factual reliability and may propagate incorrect beliefs, due to representational entanglement between personalization and factual representations. To address this issue, we propose Factuality-Preserving Personalized Steering (FPPS), a lightweight inference-time approach that mitigates personalization-induced factual distortions while preserving personalized behavior. We further introduce PFQABench, the first benchmark designed to jointly evaluate factual and personalized question answering under personalization. Experiments across multiple LLM backbones and personalization methods show that FPPS substantially improves factual accuracy while maintaining personalized performance.

[14] Redefining Machine Simultaneous Interpretation: From Incremental Translation to Human-Like Strategies

Qianen Zhang,Zeyu Yang,Satoshi Nakamura

Main category: cs.CL

TL;DR: 本文提出了一种基于大语言模型的同步机器翻译框架,通过扩展包含句子切分、丢弃、部分摘要和代词化等自适应动作的动作空间,在保证语义保真的前提下实现低延迟翻译。

Details Motivation: 传统同步机器翻译仅使用读/写操作,难以在严格实时约束下保持高质量翻译,需引入更灵活的自适应机制。 Method: 引入四种自适应动作(Sentence_Cut, Drop, Partial_Summarization, Pronominalization),在大语言模型中构建动作感知提示生成训练参考,并设计延迟感知的TTS评估流水线。 Result: 在ACL60/60英-中、英-德、英-日数据集上,该方法在语义指标和延迟方面均优于基线模型,尤其Drop与Sentence_Cut结合显著改善流畅性与延迟的平衡。 Conclusion: 扩展动作空间为LLM驱动的同步机器翻译提供了有效路径,有助于缩小人机传译之间的差距。 Abstract: Simultaneous Machine Translation (SiMT) requires high-quality translations under strict real-time constraints, which traditional policies with only READ/WRITE actions cannot fully address. We extend the action space of SiMT with four adaptive actions: Sentence_Cut, Drop, Partial_Summarization and Pronominalization, which enable real-time restructuring, omission, and simplification while preserving semantic fidelity. We adapt these actions in a large language model (LLM) framework and construct training references through action-aware prompting. To evaluate both quality and word-level monotonicity, we further develop a latency-aware TTS pipeline that maps textual outputs to speech with realistic timing. Experiments on the ACL60/60 English-Chinese, English-German and English-Japanese benchmarks show that our framework consistently improves semantic metrics and achieves lower delay compared to reference translations and salami-based baselines. Notably, combining Drop and Sentence_Cut leads to consistent improvements in the balance between fluency and latency. These results demonstrate that enriching the action space of LLM-based SiMT provides a promising direction for bridging the gap between human and machine interpretation.

[15] NAACL: Noise-AwAre Verbal Confidence Calibration for LLMs in RAG Systems

Jiayu Liu,Rui Wang,Qing Zong,Qingcheng Zeng,Tianshi Zheng,Haochen Shi,Dadi Guo,Baixuan Xu,Chunyang Li,Yangqiu Song

Main category: cs.CL

TL;DR: 本文研究了检索增强生成(RAG)场景下大语言模型(LLM)的置信度校准问题,发现噪声检索上下文会导致模型过度自信。为此提出NAACL Rules和NAACL框架,通过2K个HotpotQA样本进行监督微调,使模型具备噪声感知能力,在域内和域外均显著提升校准性能。

Details Motivation: 在关键事实领域部署大语言模型时,准确评估模型置信度至关重要。然而,现有RAG方法中的置信度校准因噪声上下文而表现不佳,亟需解决模型在噪声下的过度自信问题。 Method: 提出了NAACL Rules作为理论基础,并设计了NAACL框架,利用约2000个HotpotQA样例生成监督信号,通过监督微调(SFT)赋予模型内在的噪声感知能力,无需依赖更强的教师模型。 Result: 实验结果表明,NAACL在四个基准上显著提升了校准性能,域内ECE分数提高10.9%,域外提高8.0%。 Conclusion: NAACL有效弥合了检索噪声与口头校准之间的差距,为构建既准确又认知可靠的LLM提供了可行路径。 Abstract: Accurately assessing model confidence is essential for deploying large language models (LLMs) in mission-critical factual domains. While retrieval-augmented generation (RAG) is widely adopted to improve grounding, confidence calibration in RAG settings remains poorly understood. We conduct a systematic study across four benchmarks, revealing that LLMs exhibit poor calibration performance due to noisy retrieved contexts. Specifically, contradictory or irrelevant evidence tends to inflate the model's false certainty, leading to severe overconfidence. To address this, we propose NAACL Rules (Noise-AwAre Confidence CaLibration Rules) to provide a principled foundation for resolving overconfidence under noise. We further design NAACL, a noise-aware calibration framework that synthesizes supervision from about 2K HotpotQA examples guided by these rules. By performing supervised fine-tuning (SFT) with this data, NAACL equips models with intrinsic noise awareness without relying on stronger teacher models. Empirical results show that NAACL yields substantial gains, improving ECE scores by 10.9% in-domain and 8.0% out-of-domain. By bridging the gap between retrieval noise and verbal calibration, NAACL paves the way for both accurate and epistemically reliable LLMs.

[16] Finding the Translation Switch: Discovering and Exploiting the Task-Initiation Features in LLMs

Xinwei Wu,Heng Liu,Xiaohu Zhao,Yuqi Ren,Linlong Xu,Longyue Wang,Deyi Xiong,Weihua Luo,Kaifu Zhang

Main category: cs.CL

TL;DR: 本文提出了一种基于稀疏自编码器(SAE)和PCA一致性度量的新框架,用于识别大语言模型中与翻译任务相关的特定特征,发现了一组“翻译启动”特征,并通过因果干预验证了其在模型翻译能力中的核心作用。进一步地,利用该机制设计了一种面向“机制难样本”的数据选择策略,提升了微调的数据效率并抑制幻觉,且机制可迁移到同系列更大模型上。

Details Motivation: 大语言模型虽具备无需微调即可翻译的能力,但其内部机制不透明,缺乏对这种内在翻译能力的解释和有效利用。 Method: 采用稀疏自编码器(SAE)提取模型激活模式,通过共激活频率召回候选特征,再使用基于PCA的一致性度量筛选出功能连贯的翻译相关特征;结合因果干预验证特征功能,并据此提出针对‘机制难样本’的数据选择策略用于高效微调。 Result: 成功识别出一组‘翻译启动’特征,干预实验证明其对翻译行为具有因果影响;基于这些特征设计的数据选择策略显著提升微调效率并减少幻觉;该机制在同系列更大模型中具有可迁移性。 Conclusion: 该研究揭示了大语言模型中内在翻译能力的一个核心机制——翻译启动特征,为理解模型内部工作机制提供了新路径,并展示了如何利用机制分析指导高效训练,为构建更鲁棒、高效的语言模型提供了可行蓝图。 Abstract: Large Language Models (LLMs) frequently exhibit strong translation abilities, even without task-specific fine-tuning. However, the internal mechanisms governing this innate capability remain largely opaque. To demystify this process, we leverage Sparse Autoencoders (SAEs) and introduce a novel framework for identifying task-specific features. Our method first recalls features that are frequently co-activated on translation inputs and then filters them for functional coherence using a PCA-based consistency metric. This framework successfully isolates a small set of **translation initiation** features. Causal interventions demonstrate that amplifying these features steers the model towards correct translation, while ablating them induces hallucinations and off-task outputs, confirming they represent a core component of the model's innate translation competency. Moving from analysis to application, we leverage this mechanistic insight to propose a new data selection strategy for efficient fine-tuning. Specifically, we prioritize training on **mechanistically hard** samples-those that fail to naturally activate the translation initiation features. Experiments show this approach significantly improves data efficiency and suppresses hallucinations. Furthermore, we find these mechanisms are transferable to larger models of the same family. Our work not only decodes a core component of the translation mechanism in LLMs but also provides a blueprint for using internal model mechanism to create more robust and efficient models. The codes are available at https://github.com/flamewei123/AAAI26-translation-Initiation-Features.

[17] From Interpretability to Performance: Optimizing Retrieval Heads for Long-Context Language Models

Youmi Ma,Naoaki Okazaki

Main category: cs.CL

TL;DR: 本文提出了一种基于机制解释的RetMask方法,通过遮蔽检索头生成训练信号,显著提升了大模型在长上下文任务中的表现,验证了检索头的功能及其对性能增强的潜力。

Details Motivation: 探索检索头在提升大语言模型长上下文能力中的作用,填补其功能与实际性能提升之间关系的研究空白。 Method: 提出RetMask方法,通过对比正常模型输出与遮蔽检索头后的模型输出来生成训练信号,从而优化模型在长上下文任务中的表现。 Result: 在Llama-3.1上128K上下文长度下,HELMET得分提升+2.28,在引用生成任务上提升+70%,段落重排序任务上提升+32%,且保持通用任务性能;跨三个模型家族实验表明,检索头集中分布的模型增益明显,分布式则效果有限。 Conclusion: 检索头确实在上下文信息提取中起关键作用,其组织模式影响性能提升效果,证明了机制性解释可转化为实际性能优化。 Abstract: Advances in mechanistic interpretability have identified special attention heads, known as retrieval heads, that are responsible for retrieving information from the context. However, the role of these retrieval heads in improving model performance remains unexplored. This work investigates whether retrieval heads can be leveraged to enhance the long-context capabilities of LLMs. Specifically, we propose RetMask, a method that generates training signals by contrasting normal model outputs with those from an ablated variant in which the retrieval heads are masked. This mechanism-based approach achieves substantial improvements: +2.28 points on HELMET at 128K for Llama-3.1, with +70% gains on generation with citation and +32% on passage re-ranking, while preserving performance on general tasks. Experiments across three model families reveal that the effectiveness depends on retrieval head organization: models with concentrated patterns of retrieval heads respond strongly, while those with distributed patterns show limited gains. This mechanistic relationship validates the function of retrieval heads and demonstrates that mechanistic insights can be transformed into performance enhancements.

[18] Budget-Aware Anytime Reasoning with LLM-Synthesized Preference Data

Xuanming Zhang,Shwan Ashrafi,Aziza Mirsaidova,Amir Rezaeian,Miguel Ballesteros,Lydia B. Chilton,Zhou Yu,Dan Roth

Main category: cs.CL

TL;DR: 本文提出了一种任意时间推理框架和任意时间指数,用于衡量在计算预算受限的情况下大语言模型的推理效率,并通过LLM自生成偏好数据实现推理时自我改进,提升了多种模型在多个数据集上的推理质量和效率。

Details Motivation: 在计算资源有限的情况下,大语言模型需要在固定推理预算内快速产生有用的部分解,而非进行高成本的完整推理,以适应如旅行规划等现实任务的需求。 Method: 引入任意时间推理框架和任意时间指数(Anytime Index),并提出一种基于LLM合成偏好数据的推理时自我改进方法,使模型能从自身的推理比较中学习,优化中间输出。 Result: 在NaturalPlan (Trip)、AIME和GPQA数据集上,Grok-3、GPT-oss、GPT-4.1/4o和LLaMA模型均表现出推理质量和效率的一致提升。 Conclusion: 所提出的框架和方法有效提高了大语言模型在受限计算预算下的推理效率和解的质量,具有实际应用价值。 Abstract: We study the reasoning behavior of large language models (LLMs) under limited computation budgets. In such settings, producing useful partial solutions quickly is often more practical than exhaustive reasoning, which incurs high inference costs. Many real-world tasks, such as trip planning, require models to deliver the best possible output within a fixed reasoning budget. We introduce an anytime reasoning framework and the Anytime Index, a metric that quantifies how effectively solution quality improves as reasoning tokens increase. To further enhance efficiency, we propose an inference-time self-improvement method using LLM-synthesized preference data, where models learn from their own reasoning comparisons to produce better intermediate solutions. Experiments on NaturalPlan (Trip), AIME, and GPQA datasets show consistent gains across Grok-3, GPT-oss, GPT-4.1/4o, and LLaMA models, improving both reasoning quality and efficiency under budget constraints.

[19] Spectral Characterization and Mitigation of Sequential Knowledge Editing Collapse

Chi Zhang,Mengqi Zhang,Xiaotian Ye,Runxi Cheng,Zisheng Zhou,Ying Zhou,Pengjie Ren,Zhumin Chen

Main category: cs.CL

TL;DR: 本文提出了一种名为REVIVE的插件式框架,通过谱分析发现大语言模型的通用能力与预训练权重矩阵的主导奇异方向密切相关,并在连续知识编辑中通过保护这些方向来稳定模型性能。

Details Motivation: 连续知识编辑常导致大语言模型通用能力的灾难性崩溃,尤其是参数修改方法,但其退化机制尚不明确,因此需要深入理解并解决这一问题。 Method: 通过谱分析研究连续知识编辑对权重矩阵的影响,提出REVIVE框架:在原始权重的谱基上表示参数更新,并过滤会干扰主导奇异子空间的成分,以保护模型的通用能力。 Result: 在多个模型和基准上的实验表明,REVIVE在长序列编辑(最多20,000次编辑)下显著提升了编辑效果并有效保持了模型的通用能力。 Conclusion: 模型的通用能力依赖于权重矩阵的主导奇异方向,REVIVE通过显式保护这些方向,实现了稳定且高效的连续知识编辑。 Abstract: Sequential knowledge editing in large language models often causes catastrophic collapse of the model's general abilities, especially for parameter-modifying methods. Existing approaches mitigate this issue through heuristic constraints on parameter updates, yet the mechanisms underlying such degradation remain insufficiently understood. In this work, we present a spectral analysis of sequential knowledge editing and show that a model's general abilities are closely associated with dominant singular directions of pretrained weight matrices. These directions are highly sensitive to perturbations and are progressively disrupted by repeated edits, closely tracking the collapse in both editing efficacy and general performance. Building on this insight, we propose REVIVE, a plug-and-play framework that stabilizes sequential editing by explicitly preserving the dominant singular subspace. REVIVE represents parameter updates in the spectral basis of the original weights and filters components that would interfere with the protected region. Extensive experiments across multiple models and benchmarks show that REVIVE consistently improves editing efficacy while substantially preserving general abilities under long-horizon sequential editing, including extreme settings with up to 20,000 edits.

[20] CoG: Controllable Graph Reasoning via Relational Blueprints and Failure-Aware Refinement over Knowledge Graphs

Yuanxiang Liu,Songze Li,Xiaoke Guo,Zhaoyan Gong,Qifei Zhang,Huajun Chen,Wen Zhang

Main category: cs.CL

TL;DR: 本文提出了一种名为CoG的训练-free框架,通过模拟直觉与审慎思维的双过程理论,结合关系蓝图引导和失败感知优化模块,提升知识图谱增强大语言模型的推理准确性和鲁棒性。

Details Motivation: 大语言模型在推理任务中存在幻觉和可靠性问题,现有知识图谱增强方法因搜索策略单一而易受邻域噪声和结构错位影响,导致推理停滞。 Method: 基于双过程理论设计CoG框架:1)关系蓝图引导模块作为快速直觉过程,利用关系蓝图提供软性结构约束以稳定搜索方向;2)失败感知优化模块作为分析过程,在推理受阻时触发证据条件反射和可控回溯。 Result: 在三个基准上的实验表明,CoG在准确率和效率上均显著优于现有最先进方法。 Conclusion: CoG通过模拟人类双过程认知机制,有效提升了KG增强LLM的推理稳定性与灵活性,为构建更可靠的推理系统提供了新思路。 Abstract: Large Language Models (LLMs) have demonstrated remarkable reasoning capabilities but often grapple with reliability challenges like hallucinations. While Knowledge Graphs (KGs) offer explicit grounding, existing paradigms of KG-augmented LLMs typically exhibit cognitive rigidity--applying homogeneous search strategies that render them vulnerable to instability under neighborhood noise and structural misalignment leading to reasoning stagnation. To address these challenges, we propose CoG, a training-free framework inspired by Dual-Process Theory that mimics the interplay between intuition and deliberation. First, functioning as the fast, intuitive process, the Relational Blueprint Guidance module leverages relational blueprints as interpretable soft structural constraints to rapidly stabilize the search direction against noise. Second, functioning as the prudent, analytical process, the Failure-Aware Refinement module intervenes upon encountering reasoning impasses. It triggers evidence-conditioned reflection and executes controlled backtracking to overcome reasoning stagnation. Experimental results on three benchmarks demonstrate that CoG significantly outperforms state-of-the-art approaches in both accuracy and efficiency.

[21] Efficient Multilingual Name Type Classification Using Convolutional Networks

Davor Lauc

Main category: cs.CL

TL;DR: 本文提出了一种名为Onomas-CNN X的卷积神经网络模型,用于多语言命名实体的语言和类型分类,在准确率与XLM-RoBERTa相当的情况下,速度提升46倍且能耗降低46倍。

Details Motivation: 针对轻量、高效处理多语言命名实体分类的需求,探索在充分训练数据下专用CNN架构能否在特定NLP任务上与大型预训练模型竞争。 Method: 采用并行卷积分支结合深度可分离卷积操作,并引入层次化分类机制,设计出适用于CPU的高效CNN模型Onomas-CNN X。 Result: 在涵盖104种语言和4种实体类型的大型数据集上,模型达到92.1%的准确率,单CPU核心每秒可处理2,813个名称,速度是XLM-RoBERTa的46倍,能耗降低46倍。 Conclusion: 研究表明,在训练数据充足的情况下,专用的CNN架构在特定NLP任务上仍能与大型预训练模型性能相当,同时显著提升效率和节能性。 Abstract: We present a convolutional neural network approach for classifying proper names by language and entity type. Our model, Onomas-CNN X, combines parallel convolution branches with depthwise-separable operations and hierarchical classification to process names efficiently on CPU hardware. We evaluate the architecture on a large multilingual dataset covering 104 languages and four entity types (person, organization, location, other). Onomas-CNN X achieves 92.1% accuracy while processing 2,813 names per second on a single CPU core - 46 times faster than fine-tuned XLM-RoBERTa with comparable accuracy. The model reduces energy consumption by a factor of 46 compared to transformer baselines. Our experiments demonstrate that specialized CNN architectures remain competitive with large pre-trained models for focused NLP tasks when sufficient training data exists.

[22] Integrity Shield A System for Ethical AI Use & Authorship Transparency in Assessments

Ashish Raj Shekhar,Shiven Agarwal,Priyanuj Bordoloi,Yash Shah,Tejas Anvekar,Vivek Gupta

Main category: cs.CL

TL;DR: Integrity Shield 是一种文档层水印系统,可在不影响视觉外观的情况下嵌入可检测的水印,有效防止大语言模型泄露考试内容,并实现高精度的答案阻止与签名识别。

Details Motivation: 大语言模型能直接解析PDF考试题并作答,威胁学术诚信,现有水印技术在黑盒模型和实际教学场景中失效,亟需一种无需控制模型即可保护评估材料的方法。 Method: 提出 Integrity Shield,通过在评估PDF中嵌入感知文档结构的、项目级别的隐形水印,使大语言模型无法正确回答问题,并能从响应中恢复稳定签名,支持对答案来源进行追踪。 Result: 在30个涵盖STEM、人文和医学推理的考试中,Integrity Shield 实现了91-94%的考试级阻止率和89-93%的签名恢复率,在四个商用MLLM上均表现稳定。 Conclusion: Integrity Shield 提供了一种实用且可靠的防御机制,能够在不改变考试文件观感的前提下,有效抵御大语言模型带来的学术作弊风险,同时支持责任追溯。 Abstract: Large Language Models (LLMs) can now solve entire exams directly from uploaded PDF assessments, raising urgent concerns about academic integrity and the reliability of grades and credentials. Existing watermarking techniques either operate at the token level or assume control over the model's decoding process, making them ineffective when students query proprietary black-box systems with instructor-provided documents. We present Integrity Shield, a document-layer watermarking system that embeds schema-aware, item-level watermarks into assessment PDFs while keeping their human-visible appearance unchanged. These watermarks consistently prevent MLLMs from answering shielded exam PDFs and encode stable, item-level signatures that can be reliably recovered from model or student responses. Across 30 exams spanning STEM, humanities, and medical reasoning, Integrity Shield achieves exceptionally high prevention (91-94% exam-level blocking) and strong detection reliability (89-93% signature retrieval) across four commercial MLLMs. Our demo showcases an interactive interface where instructors upload an exam, preview watermark behavior, and inspect pre/post AI performance & authorship evidence.

[23] The Growing Gains and Pains of Iterative Web Corpora Crawling: Insights from South Slavic CLASSLA-web 2.0 Corpora

Taja Kuzman Pungeršek,Peter Rupnik,Vít Suchomel,Nikola Ljubešić

Main category: cs.CL

TL;DR: 本文介绍了CLASSLA-web 2.0语料库,通过持续抓取南斯拉夫及相关的国家顶级域名,构建了包含七种语言、170亿词的大型网络语料库,并新增主题标签标注。与前一版本相比内容更新显著,但发现机器生成网站导致网络内容质量下降。

Details Motivation: 为了获取资源较少的南斯拉夫语言的大规模文本数据,延续CLASSLA-web 1.0的成功经验,建立可持续迭代的网络爬取基础设施。 Method: 采用持续迭代的方式对多个国家顶级域名进行系统性爬取,并对新获取的文本自动添加主题标签和文体类别标注。 Result: 发布了CLASSLA-web 2.0语料库,包含七种语言共170亿词、3810万篇文档;仅有五分之一文本与前一版本重叠;同时发现大量机器生成内容影响语料质量。 Conclusion: 重复爬取能有效获取新文本,大幅扩展语料规模,但网络内容中自动化生成站点增多带来了数据质量挑战,需在后续工作中加以应对。 Abstract: Crawling national top-level domains has proven to be highly effective for collecting texts in less-resourced languages. This approach has been recently used for South Slavic languages and resulted in the largest general corpora for this language group: the CLASSLA-web 1.0 corpora. Building on this success, we established a continuous crawling infrastructure for iterative national top-level domain crawling across South Slavic and related webs. We present the first outcome of this crawling infrastructure - the CLASSLA-web 2.0 corpus collection, with substantially larger web corpora containing 17.0 billion words in 38.1 million texts in seven languages: Bosnian, Bulgarian, Croatian, Macedonian, Montenegrin, Serbian, and Slovenian. In addition to genre categories, the new version is also automatically annotated with topic labels. Comparing CLASSLA-web 2.0 with its predecessor reveals that only one-fifth of the texts overlap, showing that re-crawling after just two years yields largely new content. However, while the new web crawls bring growing gains, we also notice growing pains - a manual inspection of top domains reveals a visible degradation of web content, as machine-generated sites now contribute a significant portion of texts.

[24] DOREMI: Optimizing Long Tail Predictions in Document-Level Relation Extraction

Laura Menotti,Stefano Marchesin,Gianmaria Silvello

Main category: cs.CL

TL;DR: 本文提出了DOREMI框架,用于文档级关系抽取中优化长尾分布问题,通过少量有针对性的手动标注提升稀有关系的性能。

Details Motivation: 由于文档级关系抽取依赖跨句子上下文且关系类型呈长尾分布,稀有关系样本少,导致模型性能差。 Method: 提出DOREMI框架,迭代选择最具信息量的样例进行人工标注,增强对少见关系的学习,可集成到任意现有DocRE模型中。 Result: DOREMI在减少对噪声数据依赖的同时,显著提升了模型在稀有关系上的表现和整体泛化能力。 Conclusion: DOREMI是一种高效、可扩展的方法,能有效缓解文档级关系抽取中的长尾问题。 Abstract: Document-Level Relation Extraction (DocRE) presents significant challenges due to its reliance on cross-sentence context and the long-tail distribution of relation types, where many relations have scarce training examples. In this work, we introduce DOcument-level Relation Extraction optiMizing the long taIl (DOREMI), an iterative framework that enhances underrepresented relations through minimal yet targeted manual annotations. Unlike previous approaches that rely on large-scale noisy data or heuristic denoising, DOREMI actively selects the most informative examples to improve training efficiency and robustness. DOREMI can be applied to any existing DocRE model and is effective at mitigating long-tail biases, offering a scalable solution to improve generalization on rare relations.

[25] T$^\star$: Progressive Block Scaling for MDM Through Trajectory Aware RL

Hanchen Xia,Baoyou Chen,Yutang Ge,Guojiang Zhao,Siyu Zhu

Main category: cs.CL

TL;DR: T$^\star$ 是一种基于 TraceRL 的训练课程,用于在掩码扩散语言模型中逐步扩展块大小,实现高并行解码且性能损失较小。

Details Motivation: 为了在保持性能的同时提升掩码扩散语言模型的解码并行性,需要有效的方法来逐步扩大块大小。 Method: 提出 T$^\star$ 方法,基于 TraceRL 设计渐进式训练课程,从小块模型开始,平滑过渡到大块模型,并探索替代解码调度 $\hat{\rm S}$。 Result: T$^\star$ 能够实现更高并行性的解码,在数学推理基准上性能下降很小,并可收敛到性能相当的替代解码调度 $\hat{\rm S}$。 Conclusion: T$^\star$ 为掩码扩散语言模型提供了一种有效的块大小扩展策略,兼顾了解码效率与模型性能。 Abstract: We present T$^\star$, a simple \textsc{TraceRL}-based training curriculum for progressive block-size scaling in masked diffusion language models (MDMs). Starting from an AR-initialized small-block MDM, T$^\star$~transitions smoothly to larger blocks, enabling higher-parallelism decoding with minimal performance degradation on math reasoning benchmarks. Moreover, further analysis suggests that T$^\star$~can converge to an alternative decoding schedule $\hat{\rm S}$ that achieves comparable performance.

[26] MultiCaption: Detecting disinformation using multilingual visual claims

Rafael Martins Frade,Rrubaa Panchendrarajan,Arkaitz Zubiaga

Main category: cs.CL

TL;DR: 本文提出了MultiCaption数据集,用于检测多语言、多模态环境下的视觉声明矛盾,并通过实验证明其挑战性和对任务特定微调的需求。

Details Motivation: 现有自动事实核查方法受限于缺乏反映真实世界复杂性的数据集,特别是涉及多语言和多媒体的误导性内容传播。 Method: 构建了一个包含11,088个视觉声明、覆盖64种语言的多语言多模态数据集MultiCaption,并采用多种标注策略判断声明间的矛盾关系;使用基于Transformer架构、自然语言推理模型和大语言模型进行实验。 Result: 实验表明,MultiCaption比标准NLI任务更具挑战性,需要任务特定的微调才能取得良好性能;多语言训练和测试的增益显示了该数据集在构建无需机器翻译的多语言事实核查系统中的潜力。 Conclusion: MultiCaption为多语言多模态环境下的虚假信息检测提供了宝贵资源,推动了自动化事实核查系统的发展。 Abstract: Online disinformation poses an escalating threat to society, driven increasingly by the rapid spread of misleading content across both multimedia and multilingual platforms. While automated fact-checking methods have advanced in recent years, their effectiveness remains constrained by the scarcity of datasets that reflect these real-world complexities. To address this gap, we first present MultiCaption, a new dataset specifically designed for detecting contradictions in visual claims. Pairs of claims referring to the same image or video were labeled through multiple strategies to determine whether they contradict each other. The resulting dataset comprises 11,088 visual claims in 64 languages, offering a unique resource for building and evaluating misinformation-detection systems in truly multimodal and multilingual environments. We then provide comprehensive experiments using transformer-based architectures, natural language inference models, and large language models, establishing strong baselines for future research. The results show that MultiCaption is more challenging than standard NLI tasks, requiring task-specific finetuning for strong performance. Moreover, the gains from multilingual training and testing highlight the dataset's potential for building effective multilingual fact-checking pipelines without relying on machine translation.

[27] Language of Thought Shapes Output Diversity in Large Language Models

Shaoyang Xu,Wenxuan Zhang

Main category: cs.CL

TL;DR: 本文提出通过改变大语言模型的“思维语言”(即非输出语言)来提升输出多样性,发现使用与英语差异更大的语言作为思维语言能显著增加输出多样性,并通过多语言采样进一步提升效果,在文化知识和价值观覆盖上具有实际优势。

Details Motivation: 探索新的、结构性的方法来提升大语言模型的输出多样性,以支持多元性和创造力。 Method: 引入“思维语言”概念,比较单语言采样和混合语言采样策略,在保持输出语言为英文的前提下,评估不同非英语思维语言对输出多样性的影响。 Result: 实验表明,使用距离英语更远的思维语言能带来更大的多样性增益;多语言组合采样产生组合效应,进一步提升多样性;语言异质性扩展了模型多样性的上限。 Conclusion: 控制思维语言是一种有效且结构化的提升大语言模型输出多样性的新途径,在多元对齐场景中具有广泛应用潜力。 Abstract: Output diversity is crucial for Large Language Models as it underpins pluralism and creativity. In this work, we reveal that controlling the language used during model thinking-the language of thought-provides a novel and structural source of output diversity. Our preliminary study shows that different thinking languages occupy distinct regions in a model's thinking space. Based on this observation, we study two repeated sampling strategies under multilingual thinking-Single-Language Sampling and Mixed-Language Sampling-and conduct diversity evaluation on outputs that are controlled to be in English, regardless of the thinking language used. Across extensive experiments, we demonstrate that switching the thinking language from English to non-English languages consistently increases output diversity, with a clear and consistent positive correlation such that languages farther from English in the thinking space yield larger gains. We further show that aggregating samples across multiple thinking languages yields additional improvements through compositional effects, and that scaling sampling with linguistic heterogeneity expands the model's diversity ceiling. Finally, we show that these findings translate into practical benefits in pluralistic alignment scenarios, leading to broader coverage of cultural knowledge and value orientations in LLM outputs. Our code is publicly available at https://github.com/iNLP-Lab/Multilingual-LoT-Diversity.

[28] FactCorrector: A Graph-Inspired Approach to Long-Form Factuality Correction of Large Language Models

Javier Carnerero-Cano,Massimiliano Pronesti,Radu Marinescu,Tigran Tchrakian,James Barry,Jasmina Gajcin,Yufang Hou,Alessandra Pascale,Elizabeth Daly

Main category: cs.CL

TL;DR: 本文提出了FactCorrector,一种无需重新训练即可跨领域适应的LLM事实性后处理修正方法,并构建了包含系统注入错误和真实修正的VELI5基准用于评估。实验表明该方法显著提升了事实准确性同时保持相关性。

Details Motivation: 大型语言模型(LLMs)在知识密集型应用中广泛使用,但常生成事实错误的响应。通过反馈修正LLM是解决这一问题的有前景方法,但缺乏有效的后处理修正框架和标准评估基准。 Method: 提出FactCorrector:一种基于结构化事实性反馈生成修正的后处理方法,可跨领域适应而无需重新训练;同时构建VELI5基准,包含系统注入的事实错误及对应的真实修正,支持对修正方法进行严格评估。 Result: 在VELI5及多个主流长篇幅事实性数据集上的实验显示,FactCorrector显著提升事实精确率的同时保持响应相关性,优于强基线方法。 Conclusion: FactCorrector为提升LLM输出的事实准确性提供了一种高效、通用的后处理解决方案,且VELI5基准为未来研究提供了重要评估工具。 Abstract: Large language models (LLMs) are widely used in knowledge-intensive applications but often generate factually incorrect responses. A promising approach to rectify these flaws is correcting LLMs using feedback. Therefore, in this paper, we introduce FactCorrector, a new post-hoc correction method that adapts across domains without retraining and leverages structured feedback about the factuality of the original response to generate a correction. To support rigorous evaluations of factuality correction methods, we also develop the VELI5 benchmark, a novel dataset containing systematically injected factual errors and ground-truth corrections. Experiments on VELI5 and several popular long-form factuality datasets show that the FactCorrector approach significantly improves factual precision while preserving relevance, outperforming strong baselines. We release our code at https://ibm.biz/factcorrector.

[29] How DDAIR you? Disambiguated Data Augmentation for Intent Recognition

Galo Castillo-López,Alexis Lombard,Nasredine Semmar,Gaël de Chalendar

Main category: cs.CL

TL;DR: 提出DDAIR方法,利用句子嵌入检测并重新生成大语言模型生成的意图识别数据中存在类别歧义的样本,提升低资源场景下的分类性能。

Details Motivation: 大语言模型在数据增强中可能生成跨类别的模糊样本,影响意图识别效果,尤其在低资源和意图边界模糊的场景下更为显著。 Method: 使用Sentence Transformers检测生成样本与目标意图的语义相似性,识别出更接近其他意图的歧义样本,并通过迭代重生成机制进行修正。 Result: 实验证明句子嵌入能有效识别并减少生成样本的歧义性,从而提升分类性能。 Conclusion: 句子嵌入辅助的数据增强可有效缓解LLM生成样本的类别歧义问题,在低资源和宽泛定义的意图识别任务中具有应用潜力。 Abstract: Large Language Models (LLMs) are effective for data augmentation in classification tasks like intent detection. In some cases, they inadvertently produce examples that are ambiguous with regard to untargeted classes. We present DDAIR (Disambiguated Data Augmentation for Intent Recognition) to mitigate this problem. We use Sentence Transformers to detect ambiguous class-guided augmented examples generated by LLMs for intent recognition in low-resource scenarios. We identify synthetic examples that are semantically more similar to another intent than to their target one. We also provide an iterative re-generation method to mitigate such ambiguities. Our findings show that sentence embeddings effectively help to (re)generate less ambiguous examples, and suggest promising potential to improve classification performance in scenarios where intents are loosely or broadly defined.

[30] Reasoning in Trees: Improving Retrieval-Augmented Generation for Multi-Hop Question Answering

Yuling Shi,Maolin Sun,Zijun Liu,Mo Yang,Yixiong Fang,Tianran Sun,Xiaodong Gu

Main category: cs.CL

TL;DR: 本文提出了RT-RAG,一种基于推理树引导的检索增强生成框架,用于解决复杂多跳问答中的推理不连贯和错误传播问题。该方法通过结构化实体分析和共识机制构建推理树,并采用自底向上遍历策略进行查询重写与证据收集,显著优于现有方法。

Details Motivation: 现有基于大模型的多跳问答方法在自我引导检索过程中存在查询分解不准确和推理链中错误传播的问题,导致推理连贯性差。 Method: 提出RT-RAG框架:首先将多跳问题分解为显式的推理树,结合结构化实体分析与共识机制选择最优树;然后采用自底向上的遍历策略,通过迭代查询重写和优化来逐步收集高质量证据。 Result: 实验结果显示,RT-RAG在F1分数上比现有最先进方法高出7.0%,EM分数高出6.0%,有效提升了多跳问答性能。 Conclusion: RT-RAG通过引入推理树结构和自底向上证据收集机制,显著增强了多跳问答中的推理一致性和准确性,为RAG系统提供了更可靠的多步推理框架。 Abstract: Retrieval-Augmented Generation (RAG) has demonstrated significant effectiveness in enhancing large language models (LLMs) for complex multi-hop question answering (QA). For multi-hop QA tasks, current iterative approaches predominantly rely on LLMs to self-guide and plan multi-step exploration paths during retrieval, leading to substantial challenges in maintaining reasoning coherence across steps from inaccurate query decomposition and error propagation. To address these issues, we introduce Reasoning Tree Guided RAG (RT-RAG), a novel hierarchical framework for complex multi-hop QA. RT-RAG systematically decomposes multi-hop questions into explicit reasoning trees, minimizing inaccurate decomposition through structured entity analysis and consensus-based tree selection that clearly separates core queries, known entities, and unknown entities. Subsequently, a bottom-up traversal strategy employs iterative query rewriting and refinement to collect high-quality evidence, thereby mitigating error propagation. Comprehensive experiments show that RT-RAG substantially outperforms state-of-the-art methods by 7.0% F1 and 6.0% EM, demonstrating the effectiveness of RT-RAG in complex multi-hop QA.

[31] One LLM to Train Them All: Multi-Task Learning Framework for Fact-Checking

Malin Astrid Larsson,Harald Fosen Grunnaleite,Vinay Setty

Main category: cs.CL

TL;DR: 本文提出使用多任务学习(MTL)来提升基于小规模开源大语言模型的自动事实核查效率,通过联合训练实现声明检测、证据排序和立场检测,显著优于零样本/少样本设置。

Details Motivation: 大型专有模型在自动事实核查中表现良好,但其封闭性、复杂性和高成本限制了可持续性;而为各个任务单独微调小型开源模型又导致高成本。因此需要更高效的替代方案。 Method: 采用多任务学习策略,在小型仅解码器LLM(如Qwen3-4b)上探索分类头、因果语言建模头和指令微调三种方法,并在不同模型规模、任务顺序下进行评估。 Result: 多任务模型未普遍超越单任务基线,但在零/少样本设置上取得显著提升:声明检测提升44%,证据重排序提升54%,立场检测提升31%。 Conclusion: 多任务学习是构建高效、可持续自动事实核查系统的可行方案,并提供了适用于实践者的实证指导原则。 Abstract: Large language models (LLMs) are reshaping automated fact-checking (AFC) by enabling unified, end-to-end verification pipelines rather than isolated components. While large proprietary models achieve strong performance, their closed weights, complexity, and high costs limit sustainability. Fine-tuning smaller open weight models for individual AFC tasks can help but requires multiple specialized models resulting in high costs. We propose \textbf{multi-task learning (MTL)} as a more efficient alternative that fine-tunes a single model to perform claim detection, evidence ranking, and stance detection jointly. Using small decoder-only LLMs (e.g., Qwen3-4b), we explore three MTL strategies: classification heads, causal language modeling heads, and instruction-tuning, and evaluate them across model sizes, task orders, and standard non-LLM baselines. While multitask models do not universally surpass single-task baselines, they yield substantial improvements, achieving up to \textbf{44\%}, \textbf{54\%}, and \textbf{31\%} relative gains for claim detection, evidence re-ranking, and stance detection, respectively, over zero-/few-shot settings. Finally, we also provide practical, empirically grounded guidelines to help practitioners apply MTL with LLMs for automated fact-checking.

[32] Membership Inference on LLMs in the Wild

Jiatong Yi,Yanyang Li

Main category: cs.CL

TL;DR: 本文提出了一种新的黑盒环境下针对大语言模型的成员推断攻击框架SimMIA,仅利用生成文本实现高性能推理,并构建了新基准WikiMIA-25进行评估。

Details Motivation: 现有成员推断攻击方法依赖模型内部信息或在纯文本黑盒场景下泛化能力差,难以有效审计大语言模型的数据成员隐私风险。 Method: 提出SimMIA框架,采用先进的采样策略和评分机制,在仅访问生成文本的严格黑盒设置下进行成员推断;并构建新基准WikiMIA-25用于评估。 Result: 实验表明SimMIA在黑盒设置下达到最先进性能,媲美依赖模型内部信息的方法。 Conclusion: SimMIA有效提升了在无访问模型内部状态情况下的成员推断能力,为评估现代闭源大语言模型的训练数据隐私泄露提供了实用工具。 Abstract: Membership Inference Attacks (MIAs) act as a crucial auditing tool for the opaque training data of Large Language Models (LLMs). However, existing techniques predominantly rely on inaccessible model internals (e.g., logits) or suffer from poor generalization across domains in strict black-box settings where only generated text is available. In this work, we propose SimMIA, a robust MIA framework tailored for this text-only regime by leveraging an advanced sampling strategy and scoring mechanism. Furthermore, we present WikiMIA-25, a new benchmark curated to evaluate MIA performance on modern proprietary LLMs. Experiments demonstrate that SimMIA achieves state-of-the-art results in the black-box setting, rivaling baselines that exploit internal model information.

[33] F-Actor: Controllable Conversational Behaviour in Full-Duplex Models

Maike Züfle,Ondrej Klejch,Nicholas Sanders,Jan Niehues,Alexandra Birch,Tsz Kin Lam

Main category: cs.CL

TL;DR: 本文提出了一种首个开源的、可遵循指令的全双工对话语音模型,能够在常规学术资源限制下高效训练,支持对说话人声音、话题、对话行为和对话发起的控制。

Details Motivation: 当前的口语对话系统缺乏动态适应上下文的自然对话行为,导致不够自然和易用。 Method: 通过冻结音频编码器并仅微调语言模型,采用单阶段训练协议,在2000小时数据上进行训练,不依赖大规模预训练或多阶段优化。 Result: 模型能够根据指令控制语音特征、话题、对话行为(如反馈和打断)以及对话启动,实现了高效的可控全双工对话。 Conclusion: 该模型为可控制的全双工语音系统提供了可复现的研究基础,并将公开模型和代码以促进相关研究。 Abstract: Spoken conversational systems require more than accurate speech generation to have human-like conversations: to feel natural and engaging, they must produce conversational behaviour that adapts dynamically to the context. Current spoken conversational systems, however, rarely allow such customization, limiting their naturalness and usability. In this work, we present the first open, instruction-following full-duplex conversational speech model that can be trained efficiently under typical academic resource constraints. By keeping the audio encoder frozen and finetuning only the language model, our model requires just 2,000 hours of data, without relying on large-scale pretraining or multi-stage optimization. The model can follow explicit instructions to control speaker voice, conversation topic, conversational behaviour (e.g., backchanneling and interruptions), and dialogue initiation. We propose a single-stage training protocol and systematically analyze design choices. Both the model and training code will be released to enable reproducible research on controllable full-duplex speech systems.

[34] Idea First, Code Later: Disentangling Problem Solving from Code Generation in Evaluating LLMs for Competitive Programming

Sama Hadhoud,Alaa Elsetohy,Frederikus Hudi,Jan Christian Blaise Cruz,Steven Halim,Alham Fikri Aji

Main category: cs.CL

TL;DR: 本文提出在竞争性编程中应将自然语言的解题思路(editorials)作为核心,区分算法推理与代码实现,提出新的评估方法和包含83个问题的数据集,以更好衡量大模型在问题求解与实现上的表现。

Details Motivation: 现有评测方法混淆了算法推理与代码实现,无法准确评估大模型在竞争性编程中的真实问题解决能力,因此需要一种能分离这两者的评估方式。 Method: 提出使用自然语言editorial作为解题中间步骤,利用金标准editorial引导模型生成解决方案,并通过专家标注和LLM-as-a-judge协议对比生成与标准editorial的差异,分析错误类型。 Result: 使用gold editorial可提升部分模型的解题成功率,但模型在实现层面仍存在困难;生成的editorial与金标准之间仍有显著差距,显示算法设计仍是瓶颈;提出了一个包含83个ICPC风格问题的新数据集及完整测试套件。 Conclusion: 竞争性编程的评测应明确区分问题求解与代码实现两个阶段,未来的研究和基准应围绕自然语言推理进行设计,以更准确地评估和改进模型的算法思维能力。 Abstract: Large Language Models (LLMs) increasingly succeed on competitive programming problems, yet existing evaluations conflate algorithmic reasoning with code-level implementation. We argue that competitive programming is fundamentally a problem-solving task and propose centering natural-language editorials in both solution generation and evaluation. Generating an editorial prior to code improves solve rates for some LLMs, with substantially larger gains when using expertly written gold editorials. However, even with gold editorials, models continue to struggle with implementation, while the gap between generated and gold editorials reveals a persistent problem-solving bottleneck in specifying correct and complete algorithms. Beyond pass/fail metrics, we diagnose reasoning errors by comparing model-generated editorials to gold standards using expert annotations and validate an LLM-as-a-judge protocol for scalable evaluation. We introduce a dataset of 83 ICPC-style problems with gold editorials and full test suites, and evaluate 19 LLMs, arguing that future benchmarks should explicitly separate problem solving from implementation.

[35] Neural Chain-of-Thought Search: Searching the Optimal Reasoning Path to Enhance Large Language Models

Guoming Ling,Zhongzhan Huang,Yupei Lin,Junxin Li,Shanshan Zhong,Hefeng Wu,Liang Lin

Main category: cs.CL

TL;DR: 提出神经链式思维搜索(NCoTS)框架,通过动态搜索最优推理路径,在提升准确率的同时显著缩短推理长度。

Details Motivation: 现有思维链推理方法缺乏前瞻性,易陷入冗余且次优的推理路径,限制了大模型的效率与性能。 Method: 将推理过程建模为对最优思维策略的动态搜索,利用双因素启发式(正确性与计算成本)评估候选推理算子,并量化刻画解空间以发现稀疏的优质路径。 Result: 在多个推理基准上实现帕累托改进,平均准确率提升超过3.5%,生成长度减少超过22%。 Conclusion: NCoTS通过引入搜索机制优化推理路径,实现了更高效、更精确的链式思维推理。 Abstract: Chain-of-Thought reasoning has significantly enhanced the problem-solving capabilities of Large Language Models. Unfortunately, current models generate reasoning steps sequentially without foresight, often becoming trapped in suboptimal reasoning paths with redundant steps. In contrast, we introduce Neural Chain-of-Thought Search (NCoTS), a framework that reformulates reasoning as a dynamic search for the optimal thinking strategy. By quantitatively characterizing the solution space, we reveal the existence of sparse superior reasoning paths that are simultaneously more accurate and concise than standard outputs. Our method actively navigates towards these paths by evaluating candidate reasoning operators using a dual-factor heuristic that optimizes for both correctness and computational cost. Consequently, NCoTS achieves a Pareto improvement across diverse reasoning benchmarks, boosting accuracy by over 3.5% while reducing generation length by over 22%. Our code and data are available at https://github.com/MilkThink-Lab/Neural-CoT-Search.

[36] How Much Would a Clinician Edit This Draft? Evaluating LLM Alignment for Patient Message Response Drafting

Parker Seegmiller,Joseph Gatto,Sarah E. Greer,Ganza Belise Isingizwe,Rohan Ray,Timothy E. Burdick,Sarah Masud Preum

Main category: cs.CL

TL;DR: 该研究探讨了大型语言模型(LLM)在协助医生撰写患者门户消息回复中的应用,提出了一种新的主题分类法和评估框架,以衡量LLM生成内容与医生实际需求的对齐程度,并发现需通过个性化适配提升其临床实用性。

Details Motivation: 尽管LLMs在自动生成患者消息回复方面有潜力,但其是否能真正减轻医生负担尚不明确,且存在与个体医生风格和临床需求不对齐的风险。 Method: 提出了一个关于医生回应主题元素的新分类法,构建了评估医生编辑LLM生成回复负担的框架,并发布了专家标注数据集;评估了本地和商业LLM在不同适应技术(如主题提示、检索增强生成、监督微调和直接偏好优化)下的表现。 Result: LLMs在某些主题上能有效生成内容,但在需要向患者提问以获取更多信息的主题上表现较差;主题驱动的适配策略在多数主题中提升了生成质量;不同医生间存在显著的认知差异,导致LLM难以通用化对齐。 Conclusion: 必须针对个体医生偏好对LLMs进行适配,才能实现其在医患沟通工作流中的可靠和负责任使用。 Abstract: Large language models (LLMs) show promise in drafting responses to patient portal messages, yet their integration into clinical workflows raises various concerns, including whether they would actually save clinicians time and effort in their portal workload. We investigate LLM alignment with individual clinicians through a comprehensive evaluation of the patient message response drafting task. We develop a novel taxonomy of thematic elements in clinician responses and propose a novel evaluation framework for assessing clinician editing load of LLM-drafted responses at both content and theme levels. We release an expert-annotated dataset and conduct large-scale evaluations of local and commercial LLMs using various adaptation techniques including thematic prompting, retrieval-augmented generation, supervised fine-tuning, and direct preference optimization. Our results reveal substantial epistemic uncertainty in aligning LLM drafts with clinician responses. While LLMs demonstrate capability in drafting certain thematic elements, they struggle with clinician-aligned generation in other themes, particularly question asking to elicit further information from patients. Theme-driven adaptation strategies yield improvements across most themes. Our findings underscore the necessity of adapting LLMs to individual clinician preferences to enable reliable and responsible use in patient-clinician communication workflows.

[37] Reward Modeling for Scientific Writing Evaluation

Furkan Şahinuç,Subhabrata Dutta,Iryna Gurevych

Main category: cs.CL

TL;DR: 本文提出了一种针对科学写作评估的低成本、开源奖励模型,采用两阶段训练框架以提升大语言模型在多任务、动态标准下的评估能力。

Details Motivation: 现有基于大语言模型的评估方法主要针对通用基准优化,难以适应科学写作中依赖领域知识和多维度、任务特定评估标准的需求,且为每个任务微调成本高昂。 Method: 提出一个两阶段训练框架:首先优化科学评估偏好,然后增强推理能力;采用多方面评估设计和跨任务联合训练,实现细粒度评估并增强对动态标准的鲁棒性。 Result: 实验表明该方法显著提升了大语言模型在科学写作评估中的表现,模型能有效泛化到不同任务及未见过的科学写作评估场景。 Conclusion: 所提出的框架能够构建可重用、无需任务特定微调的科学写作评估模型,为低资源环境下提供了一种高效可靠的解决方案。 Abstract: Scientific writing is an expert-domain task that demands deep domain knowledge, task-specific requirements and reasoning capabilities that leverage the domain knowledge to satisfy the task specifications. While scientific text generation has been widely studied, its evaluation remains a challenging and open problem. It is critical to develop models that can be reliably deployed for evaluating diverse open-ended scientific writing tasks while adhering to their distinct requirements. However, existing LLM-based judges and reward models are primarily optimized for general-purpose benchmarks with fixed scoring rubrics and evaluation criteria. Consequently, they often fail to reason over sparse knowledge of scientific domains when interpreting task-dependent and multi-faceted criteria. Moreover, fine-tuning for each individual task is costly and impractical for low-resource settings. To bridge these gaps, we propose cost-efficient, open-source reward models tailored for scientific writing evaluation. We introduce a two-stage training framework that initially optimizes scientific evaluation preferences and then refines reasoning capabilities. Our multi-aspect evaluation design and joint training across diverse tasks enable fine-grained assessment and robustness to dynamic criteria and scoring rubrics. Experimental analysis shows that our training regime strongly improves LLM-based scientific writing evaluation. Our models generalize effectively across tasks and to previously unseen scientific writing evaluation settings, allowing a single trained evaluator to be reused without task-specific retraining.

[38] Evaluating LLM Behavior in Hiring: Implicit Weights, Fairness Across Groups, and Alignment with Human Preferences

Morgane Hoffmann,Emma Jouffroy,Warren Jouanneau,Marc Palyart,Charles Pebereau

Main category: cs.CL

TL;DR: 本文提出了一种评估大语言模型(LLM)在招聘决策中如何权衡不同标准的框架,利用真实自由职业者数据和全因子设计分析LLM对匹配相关特征的重视程度,并探讨其与经济原则、招聘者偏好及社会规范的一致性。

Details Motivation: 不确定大语言模型在招聘决策中如何分配各属性的重要性,以及这种分配是否符合经济原则、招聘者偏好或社会规范。 Method: 基于真实自由职业平台的数据构建合成数据集,采用全因子实验设计来估计LLM在评估自由职业者与项目匹配度时对各项标准的权重,并分析这些权重在不同项目情境和人口子群体间的差异。 Result: 发现LLM重视技能和经验等核心生产力信号,但也会赋予某些特征超出其显式匹配价值的权重;整体上对少数群体歧视较小,但在交叉性分析中显示出不同群体间生产力信号权重存在差异。 Conclusion: 该研究揭示了LLM在招聘推理中的决策逻辑,提出了可用来评估模型与人类决策一致性的实验框架,为理解AI在招聘中的公平性和透明性提供了方法基础。 Abstract: General-purpose Large Language Models (LLMs) show significant potential in recruitment applications, where decisions require reasoning over unstructured text, balancing multiple criteria, and inferring fit and competence from indirect productivity signals. Yet, it is still uncertain how LLMs assign importance to each attribute and whether such assignments are in line with economic principles, recruiter preferences or broader societal norms. We propose a framework to evaluate an LLM's decision logic in recruitment, by drawing on established economic methodologies for analyzing human hiring behavior. We build synthetic datasets from real freelancer profiles and project descriptions from a major European online freelance marketplace and apply a full factorial design to estimate how a LLM weighs different match-relevant criteria when evaluating freelancer-project fit. We identify which attributes the LLM prioritizes and analyze how these weights vary across project contexts and demographic subgroups. Finally, we explain how a comparable experimental setup could be implemented with human recruiters to assess alignment between model and human decisions. Our findings reveal that the LLM weighs core productivity signals, such as skills and experience, but interprets certain features beyond their explicit matching value. While showing minimal average discrimination against minority groups, intersectional effects reveal that productivity signals carry different weights between demographic groups.

[39] Relational Linearity is a Predictor of Hallucinations

Yuetian Lu,Yihong Liu,Hinrich Schütze

Main category: cs.CL

TL;DR: 研究发现大语言模型在处理合成实体问题时容易产生幻觉,尤其是对于线性关系,由于其抽象存储方式导致模型难以评估自身知识,实验显示关系线性与幻觉率有强相关性(r ∈ [.78,.82]),提出了通过改进事实知识表示来缓解幻觉的新方向。

Details Motivation: 解决大语言模型在回答关于未知合成实体问题时频繁产生幻觉的问题,探究导致这种幻觉的内在机制,特别是关系线性对知识存储和自我认知的影响。 Method: 构建包含6000个合成实体的SyntHal数据集,覆盖六种关系;使用Δcos度量每种关系的线性程度,并在四个模型上测量各关系的幻觉率,分析线性与幻觉率之间的相关性。 Result: 中等规模模型如Gemma-7B-IT常对合成实体问题产生幻觉;实验发现关系线性程度与幻觉率之间存在强相关性(r ∈ [.78,.82]),支持了线性关系更易引发幻觉的假设。 Conclusion: 关系的线性程度是影响大语言模型幻觉行为的关键因素,线性关系因抽象存储使模型难以判断自身是否知晓该知识,该结果为管理和缓解幻觉提供了新思路,并指明了改进事实知识表示的研究方向。 Abstract: Hallucination is a central failure mode in large language models (LLMs). We focus on hallucinations of answers to questions like: "Which instrument did Glenn Gould play?", but we ask these questions for synthetic entities that are unknown to the model. Surprisingly, we find that medium-size models like Gemma-7B-IT frequently hallucinate, i.e., they have difficulty recognizing that the hallucinated fact is not part of their knowledge. We hypothesize that an important factor in causing these hallucinations is the linearity of the relation: linear relations tend to be stored more abstractly, making it difficult for the LLM to assess its knowledge; the facts of nonlinear relations tend to be stored more directly, making knowledge assessment easier. To investigate this hypothesis, we create SyntHal, a dataset of 6000 synthetic entities for six relations. In our experiments with four models, we determine, for each relation, the hallucination rate on SyntHal and also measure its linearity, using $Δ\cos$. We find a strong correlation ($r \in [.78,.82]$) between relational linearity and hallucination rate, providing evidence for our hypothesis that the underlying storage of triples of a relation is a factor in how well a model can self-assess its knowledge. This finding has implications for how to manage hallucination behavior and suggests new research directions for improving the representation of factual knowledge in LLMs.

[40] The unreasonable effectiveness of pattern matching

Gary Lupyan,Blaise Agüera y Arcas

Main category: cs.CL

TL;DR: 大型语言模型(LLMs)能够从“Jabberwocky”语言中恢复意义,表明模式匹配是其核心能力,而非单纯的语言模仿或网络模糊副本。

Details Motivation: 探讨大型语言模型究竟是语言模仿者、数据库还是网络的模糊版本,并理解其处理无意义词汇但保留结构的语言的能力。 Method: 通过测试LLMs在内容词被替换为无意义字符串的‘Jabberwocky’语言中的表现,分析其是否能基于结构模式恢复语义。 Result: 发现LLMs能准确推断并翻译此类句子,显示出强大的结构模式识别能力。 Conclusion: 模式匹配不仅是LLM的核心机制,也是实现智能的关键组成部分,而非与真正智能对立的手段。 Abstract: We report on an astonishing ability of large language models (LLMs) to make sense of "Jabberwocky" language in which most or all content words have been randomly replaced by nonsense strings, e.g., translating "He dwushed a ghanc zawk" to "He dragged a spare chair". This result addresses ongoing controversies regarding how to best think of what LLMs are doing: are they a language mimic, a database, a blurry version of the Web? The ability of LLMs to recover meaning from structural patterns speaks to the unreasonable effectiveness of pattern-matching. Pattern-matching is not an alternative to "real" intelligence, but rather a key ingredient.

[41] Hierarchical Orthogonal Residual Spread for Precise Massive Editing in Large Language Models

Xiaojie Gu,Guangxu Chen,Yuheng Yang,Jingxin Han,Andi Zhang

Main category: cs.CL

TL;DR: 本文提出了一种名为HORSE的层次正交残差传播方法,用于改进大语言模型中的知识编辑,通过减少噪声梯度实现更稳定的编辑效果,并在多个模型和数据集上验证了其有效性。

Details Motivation: 大语言模型存在安全问题,现有模型编辑方法虽然有效但计算成本高且容易引发知识冲突,因此需要一种更高效稳定的编辑方法。 Method: 提出HORSE(Hierarchical Orthogonal Residual SprEad)方法,通过对信息矩阵进行层次化正交处理,减少梯度噪声,从而实现更稳定的知识编辑。 Result: 理论分析和在两个数据集、多个大语言模型上的实验表明,HORSE能够精确地进行大规模知识编辑,并在多种场景下保持良好性能。 Conclusion: HORSE为大语言模型的知识编辑提供了一种高效且稳定的新范式,显著降低了编辑过程中的干扰,具有良好的应用前景。 Abstract: Large language models (LLMs) exhibit exceptional performance across various domains, yet they face critical safety concerns. Model editing has emerged as an effective approach to mitigate these issues. Existing model editing methods often focus on optimizing an information matrix that blends new and old knowledge. While effective, these approaches can be computationally expensive and may cause conflicts. In contrast, we shift our attention to Hierarchical Orthogonal Residual SprEad of the information matrix, which reduces noisy gradients and enables more stable edits from a different perspective. We demonstrate the effectiveness of our method HORSE through a clear theoretical comparison with several popular methods and extensive experiments conducted on two datasets across multiple LLMs. The results show that HORSE maintains precise massive editing across diverse scenarios. The code is available at https://github.com/XiaojieGu/HORSE

[42] Predict the Retrieval! Test time adaptation for Retrieval Augmented Generation

Xin Sun,Zhongqi Chen,Qiang Liu,Shu Wu,Bowen Song,Weiqiang Wang,Zilei Wang,Liang Wang

Main category: cs.CL

TL;DR: 本文提出了TTARAG,一种用于增强检索增强生成(RAG)系统在特定领域性能的测试时自适应方法,通过在推理过程中动态更新语言模型参数,显著提升了跨六个专业领域的表现。

Details Motivation: 由于分布偏移,现有的RAG系统在适应特定领域时面临泛化能力不足的问题,需要提升其在目标领域中的表现。 Method: 提出TTARAG方法,通过让模型学习预测检索到的内容,在推理期间动态调整语言模型参数,实现对目标领域的自动适应。 Result: 在六个专业领域上的大量实验表明,TTARAG相比基线RAG系统实现了显著的性能提升。 Conclusion: TTARAG是一种简单而有效的方法,能够有效缓解RAG系统在特定领域中的分布偏移问题,提升其问答性能。 Abstract: Retrieval-Augmented Generation (RAG) has emerged as a powerful approach for enhancing large language models' question-answering capabilities through the integration of external knowledge. However, when adapting RAG systems to specialized domains, challenges arise from distribution shifts, resulting in suboptimal generalization performance. In this work, we propose TTARAG, a test-time adaptation method that dynamically updates the language model's parameters during inference to improve RAG system performance in specialized domains. Our method introduces a simple yet effective approach where the model learns to predict retrieved content, enabling automatic parameter adjustment to the target domain. Through extensive experiments across six specialized domains, we demonstrate that TTARAG achieves substantial performance improvements over baseline RAG systems. Code available at https://github.com/sunxin000/TTARAG.

[43] CTest-Metric: A Unified Framework to Assess Clinical Validity of Metrics for CT Report Generation

Vanshali Sharma,Andrea Mia Bejar,Gorkem Durak,Ulas Bagci

Main category: cs.CL

TL;DR: 本文提出了CTest-Metric,首个用于评估CT放射学报告生成(RRG)中质量度量的统一框架,包含写作风格泛化性、合成错误注入和专家相关性三个模块,并对八种常用指标进行了系统评估。

Details Motivation: 现有的放射学报告生成评估指标缺乏在临床环境中的鲁棒性和适用性评估框架,难以满足生成式AI时代下医学任务自动化的高标准需求。 Method: 提出CTest-Metric框架,包含三个模块:基于LLM的写作风格重写测试(WSG)、分级严重性的合成错误注入测试(SEI),以及与175个‘分歧案例’的临床医生评分进行相关性分析(MvE)。在七个基于CT-CLIP编码器的LLM上评估了八个常用指标。 Result: 发现词汇类NLG指标对风格变化高度敏感;GREEN Score与专家判断一致性最高(Spearman~0.70),CRG呈负相关;BERTScore-F1对事实性错误注入最不敏感。 Conclusion: CTest-Metric为RRG领域提供了可复现的指标评估基准,揭示了现有指标的优势与局限,推动未来更具临床可行性的评估指标发展。 Abstract: In the generative AI era, where even critical medical tasks are increasingly automated, radiology report generation (RRG) continues to rely on suboptimal metrics for quality assessment. Developing domain-specific metrics has therefore been an active area of research, yet it remains challenging due to the lack of a unified, well-defined framework to assess their robustness and applicability in clinical contexts. To address this, we present CTest-Metric, a first unified metric assessment framework with three modules determining the clinical feasibility of metrics for CT RRG. The modules test: (i) Writing Style Generalizability (WSG) via LLM-based rephrasing; (ii) Synthetic Error Injection (SEI) at graded severities; and (iii) Metrics-vs-Expert correlation (MvE) using clinician ratings on 175 "disagreement" cases. Eight widely used metrics (BLEU, ROUGE, METEOR, BERTScore-F1, F1-RadGraph, RaTEScore, GREEN Score, CRG) are studied across seven LLMs built on a CT-CLIP encoder. Using our novel framework, we found that lexical NLG metrics are highly sensitive to stylistic variations; GREEN Score aligns best with expert judgments (Spearman~0.70), while CRG shows negative correlation; and BERTScore-F1 is least sensitive to factual error injection. We will release the framework, code, and allowable portion of the anonymized evaluation data (rephrased/error-injected CT reports), to facilitate reproducible benchmarking and future metric development.

[44] Do explanations generalize across large reasoning models?

Koyena Pal,David Bau,Chandan Singh

Main category: cs.CL

TL;DR: 研究大型推理模型(LRM)生成的思维链(CoT)解释是否具有跨模型的泛化能力,发现CoT能提升不同LRM之间的一致性,且该一致性与人类偏好和强化学习后训练相关,提出一种句子级集成策略以进一步提高一致性。

Details Motivation: 探究LRM生成的自然语言解释是否捕捉到问题的普遍模式而非模型特有的隐秘模式,特别是在科学发现等需要理解或发现新概念的应用中,这一问题至关重要。 Method: 通过评估一个具体意义上的泛化能力:即一个LRM产生的解释是否能在其他LRM中诱导出相同的行为,来研究CoT解释的跨模型一致性,并分析影响一致性的条件,提出句子级集成方法。 Result: 发现CoT解释通常具备跨模型泛化能力,能提高不同LRM之间的一致性,且这种一致性与人类偏好排名及强化学习后训练正相关;提出的句子级集成策略可进一步提升一致性。 Conclusion: 在利用LRM解释获取新见解时需保持谨慎,论文提出了一种刻画LRM解释泛化能力的框架,为未来研究提供了方向。 Abstract: Large reasoning models (LRMs) produce a textual chain of thought (CoT) in the process of solving a problem, which serves as a potentially powerful tool to understand the problem by surfacing a human-readable, natural-language explanation. However, it is unclear whether these explanations generalize, i.e. whether they capture general patterns about the underlying problem rather than patterns which are esoteric to the LRM. This is a crucial question in understanding or discovering new concepts, e.g. in AI for science. We study this generalization question by evaluating a specific notion of generalizability: whether explanations produced by one LRM induce the same behavior when given to other LRMs. We find that CoT explanations often exhibit this form of generalization (i.e. they increase consistency between LRMs) and that this increased generalization is correlated with human preference rankings and post-training with reinforcement learning. We further analyze the conditions under which explanations yield consistent answers and propose a straightforward, sentence-level ensembling strategy that improves consistency. Taken together, these results prescribe caution when using LRM explanations to yield new insights and outline a framework for characterizing LRM explanation generalization.

[45] How Long Is a Piece of String? A Brief Empirical Analysis of Tokenizers

Jonathan Roberts,Kai Han,Samuel Albanie

Main category: cs.CL

TL;DR: 本文对现代大语言模型中的分词化过程进行了全面的实证分析,揭示了不同模型和文本领域之间分词长度的显著变化,挑战了关于分词长度的常见启发式假设。

Details Motivation: 由于分词在比较模型、输入输出以及推断定价中被广泛用作稳定货币,但其在不同模型和文本领域间的差异导致简单解释分词计数存在问题,因此需要量化这种变异以提供更清晰的理解。 Method: 通过跨不同文本数据分布的序列到分词的压缩情况,进行广泛的实证分析。 Result: 发现分词长度的常见启发式规则过于简化,不同模型和文本领域的分词化存在显著差异。 Conclusion: 研究结果为当代大语言模型中的分词化提供了更清晰和直观的认识,强调了在使用分词作为比较单位时需谨慎。 Abstract: Frontier LLMs are increasingly utilised across academia, society and industry. A commonly used unit for comparing models, their inputs and outputs, and estimating inference pricing is the token. In general, tokens are used as a stable currency, assumed to be broadly consistent across tokenizers and contexts, enabling direct comparisons. However, tokenization varies significantly across models and domains of text, making naive interpretation of token counts problematic. We quantify this variation by providing a comprehensive empirical analysis of tokenization, exploring the compression of sequences to tokens across different distributions of textual data. Our analysis challenges commonly held heuristics about token lengths, finding them to be overly simplistic. We hope the insights of our study add clarity and intuition toward tokenization in contemporary LLMs.

cs.CV [Back]

[46] Future Optical Flow Prediction Improves Robot Control & Video Generation

Kanchana Ranasinghe,Honglu Zhou,Yu Fang,Luyu Yang,Le Xue,Ran Xu,Caiming Xiong,Silvio Savarese,Michael S Ryoo,Juan Carlos Niebles

Main category: cs.CV

TL;DR: 本文提出了FOFPred,一种语言条件下的光流预测模型,结合视觉-语言模型与扩散架构,实现从大规模噪声网络数据中学习未来密集运动表示,并在机器人控制和视频生成任务中展现跨域应用能力。

Details Motivation: 现有方法难以从噪声真实世界数据中泛化地预测空间密集的运动表示(如光流),且相关研究较少。本文旨在探索如何利用大规模但非结构化的网络视频-文本数据进行可扩展的学习。 Method: 提出FOFPred模型,融合统一的视觉-语言模型(VLM)与扩散模型架构,通过关键的数据预处理技术和强图像预训练,从网络规模的人类活动视频-字幕数据中提取有意义信号,实现语言条件下的未来光流预测。 Result: 在机器人操作和语言驱动视频生成两个下游任务中,FOFPred展现出优异的跨域性能,验证了其对未来运动预测的有效性和生成保真度。 Conclusion: 统一的VLM-扩散架构结合可扩展的网络数据训练,为语言条件下的未来光流预测提供了有效解决方案,推动了其在控制与生成任务中的应用。 Abstract: Future motion representations, such as optical flow, offer immense value for control and generative tasks. However, forecasting generalizable spatially dense motion representations remains a key challenge, and learning such forecasting from noisy, real-world data remains relatively unexplored. We introduce FOFPred, a novel language-conditioned optical flow forecasting model featuring a unified Vision-Language Model (VLM) and Diffusion architecture. This unique combination enables strong multimodal reasoning with pixel-level generative fidelity for future motion prediction. Our model is trained on web-scale human activity data-a highly scalable but unstructured source. To extract meaningful signals from this noisy video-caption data, we employ crucial data preprocessing techniques and our unified architecture with strong image pretraining. The resulting trained model is then extended to tackle two distinct downstream tasks in control and generation. Evaluations across robotic manipulation and video generation under language-driven settings establish the cross-domain versatility of FOFPred, confirming the value of a unified VLM-Diffusion architecture and scalable learning from diverse web data for future optical flow prediction.

[47] ICONIC-444: A 3.1-Million-Image Dataset for OOD Detection Research

Gerhard Krumpl,Henning Avenhaus,Horst Possegger

Main category: cs.CV

TL;DR: ICONIC-444是一个大规模工业图像数据集,包含超过310万张RGB图像和444个类别,专为支持不同难度级别的分布外(OOD)检测研究而设计。

Details Motivation: 现有OOD检测研究受限于缺乏大规模、高质量且具有明确定义的OOD类别的数据集,尤其是在近域到远域OOD的不同复杂度场景下。 Method: 构建了一个名为ICONIC-444的数据集,使用工业分拣机原型采集数据,并定义了四个基准任务用于评估OOD检测方法,同时提供了22种先进后处理OOD检测方法的基线结果。 Result: ICONIC-444包含超过310万张图像和444个类别,能够支持细粒度和粗粒度的计算机视觉任务,并为OOD检测提供了结构化、多样化的数据。 Conclusion: ICONIC-444填补了当前OOD检测数据集在规模、真实性和任务多样性方面的空白,有望推动OOD检测领域的进一步发展。 Abstract: Current progress in out-of-distribution (OOD) detection is limited by the lack of large, high-quality datasets with clearly defined OOD categories across varying difficulty levels (near- to far-OOD) that support both fine- and coarse-grained computer vision tasks. To address this limitation, we introduce ICONIC-444 (Image Classification and OOD Detection with Numerous Intricate Complexities), a specialized large-scale industrial image dataset containing over 3.1 million RGB images spanning 444 classes tailored for OOD detection research. Captured with a prototype industrial sorting machine, ICONIC-444 closely mimics real-world tasks. It complements existing datasets by offering structured, diverse data suited for rigorous OOD evaluation across a spectrum of task complexities. We define four reference tasks within ICONIC-444 to benchmark and advance OOD detection research and provide baseline results for 22 state-of-the-art post-hoc OOD detection methods.

[48] A Unified 3D Object Perception Framework for Real-Time Outside-In Multi-Camera Systems

Yizhou Wang,Sameer Pusegaonkar,Yuxing Wang,Anqi Li,Vishal Kumar,Chetan Sethi,Ganapathy Aiyer,Yun He,Kartikay Thakkar,Swapnil Rathi,Bhushan Rupde,Zheng Tang,Sujit Biswas

Main category: cs.CV

TL;DR: 本文提出了一种针对大规模工业基础设施环境优化的Sparse4D框架,用于实现高精度3D物体感知与多目标多摄像头(MTMC)跟踪。

Details Motivation: 将自动驾驶中的“由内而外”模型迁移到固定摄像头网络的“由外而内”场景面临相机布局异构和严重遮挡的问题,难以实现稳定的身份识别与感知。 Method: 引入基于绝对世界坐标的几何先验,并设计了一种遮挡感知的ReID嵌入模块;采用NVIDIA COSMOS框架进行生成式数据增强以缩小Sim2Real差距;开发了支持Multi-Scale Deformable Aggregation(MSDA)的TensorRT插件以实现硬件加速。 Result: 在AI City Challenge 2025基准上达到45.22的HOTA,为当前最优;硬件加速实现2.15倍提速,单个Blackwell级GPU可支持超过64路摄像头并发处理。 Conclusion: 所提方法有效解决了复杂工业环境中3D感知与多目标跟踪的遮挡与部署难题,兼顾性能与实时性,具备大规模部署潜力。 Abstract: Accurate 3D object perception and multi-target multi-camera (MTMC) tracking are fundamental for the digital transformation of industrial infrastructure. However, transitioning "inside-out" autonomous driving models to "outside-in" static camera networks presents significant challenges due to heterogeneous camera placements and extreme occlusion. In this paper, we present an adapted Sparse4D framework specifically optimized for large-scale infrastructure environments. Our system leverages absolute world-coordinate geometric priors and introduces an occlusion-aware ReID embedding module to maintain identity stability across distributed sensor networks. To bridge the Sim2Real domain gap without manual labeling, we employ a generative data augmentation strategy using the NVIDIA COSMOS framework, creating diverse environmental styles that enhance the model's appearance-invariance. Evaluated on the AI City Challenge 2025 benchmark, our camera-only framework achieves a state-of-the-art HOTA of $45.22$. Furthermore, we address real-time deployment constraints by developing an optimized TensorRT plugin for Multi-Scale Deformable Aggregation (MSDA). Our hardware-accelerated implementation achieves a $2.15\times$ speedup on modern GPU architectures, enabling a single Blackwell-class GPU to support over 64 concurrent camera streams.

[49] Can Vision-Language Models Understand Construction Workers? An Exploratory Study

Hieu Bui,Nathaniel E. Chodosh,Arash Tavakoli

Main category: cs.CV

TL;DR: 本研究评估了三种主流视觉-语言模型(GPT-4o、Florence 2 和 LLaVa-1.5)在建筑工地静态图像中识别工人行为和情绪的表现,结果表明 GPT-4o 性能最优,但模型在语义相近类别上仍存在区分困难,需进一步优化以提升实际应用可靠性。

Details Motivation: 由于施工现场标注数据稀缺,且监控工人的行为与情绪对安全和效率至关重要,因此需要能够无需大量领域训练即可理解人类行为的通用模型,以支持机器人与人类的安全协作。 Method: 采用包含1000张图像的数据集,涵盖十类动作和十类情绪,通过标准化推理流程和多种评估指标(如F1分数和准确率)测试GPT-4o、Florence 2和LLaVa-1.5三种视觉-语言模型的性能,并使用混淆矩阵分析错误模式。 Result: GPT-4o在动作识别中取得平均F1-score 0.756和准确率0.799,在情绪识别中达到F1-score 0.712和准确率0.773,表现最佳;Florence 2次之(动作F1: 0.497,情绪F1: 0.414);LLaVa-1.5最差(动作F1: 0.466,情绪F1: 0.461)。所有模型在区分语义相近类别(如团队协作与向上级汇报)时均表现不佳。 Conclusion: 通用视觉-语言模型可为建筑场景中的人类行为识别提供基础能力,尤其是GPT-4o展现出较强潜力,但仍需通过领域适应、时序建模或多模态感知等方法改进,以实现实际应用中的可靠部署。 Abstract: As robotics become increasingly integrated into construction workflows, their ability to interpret and respond to human behavior will be essential for enabling safe and effective collaboration. Vision-Language Models (VLMs) have emerged as a promising tool for visual understanding tasks and offer the potential to recognize human behaviors without extensive domain-specific training. This capability makes them particularly appealing in the construction domain, where labeled data is scarce and monitoring worker actions and emotional states is critical for safety and productivity. In this study, we evaluate the performance of three leading VLMs, GPT-4o, Florence 2, and LLaVa-1.5, in detecting construction worker actions and emotions from static site images. Using a curated dataset of 1,000 images annotated across ten action and ten emotion categories, we assess each model's outputs through standardized inference pipelines and multiple evaluation metrics. GPT-4o consistently achieved the highest scores across both tasks, with an average F1-score of 0.756 and accuracy of 0.799 in action recognition, and an F1-score of 0.712 and accuracy of 0.773 in emotion recognition. Florence 2 performed moderately, with F1-scores of 0.497 for action and 0.414 for emotion, while LLaVa-1.5 showed the lowest overall performance, with F1-scores of 0.466 for action and 0.461 for emotion. Confusion matrix analyses revealed that all models struggled to distinguish semantically close categories, such as collaborating in teams versus communicating with supervisors. While the results indicate that general-purpose VLMs can offer a baseline capability for human behavior recognition in construction environments, further improvements, such as domain adaptation, temporal modeling, or multimodal sensing, may be needed for real-world reliability.

[50] One Model, Many Behaviors: Training-Induced Effects on Out-of-Distribution Detection

Gerhard Krumpl,Henning Avenhaus,Horst Possegger

Main category: cs.CV

TL;DR: 研究探讨了在ImageNet上训练的ResNet-50模型中,不同训练策略对21种主流OOD检测方法性能的影响,发现ID准确率与OOD检测性能之间存在非单调关系,且训练策略、检测器选择和OOD性能密切相关,没有一种方法在所有情况下都最优。

Details Motivation: 尽管OOD检测方法不断进步,但其与现代训练流程(旨在提高ID准确率和泛化能力)之间的相互作用仍缺乏深入理解,本文旨在系统研究这一关系。 Method: 固定ResNet-50架构,对56个通过不同训练策略训练的ImageNet模型, benchmark 21种先进的post-hoc OOD检测方法,并在八个OOD测试集上进行评估。 Result: 发现ID准确率与OOD检测性能呈非单调关系:初期随准确率提升而改善,但当训练策略将准确率推高过基线后,OOD性能反而下降;同时,训练策略、检测器选择与OOD性能之间存在强依赖关系。 Conclusion: 更高的ID准确率并不总意味着更好的OOD检测性能,应根据具体训练策略选择合适的OOD检测方法,而非依赖通用最优方案。 Abstract: Out-of-distribution (OOD) detection is crucial for deploying robust and reliable machine-learning systems in open-world settings. Despite steady advances in OOD detectors, their interplay with modern training pipelines that maximize in-distribution (ID) accuracy and generalization remains under-explored. We investigate this link through a comprehensive empirical study. Fixing the architecture to the widely adopted ResNet-50, we benchmark 21 post-hoc, state-of-the-art OOD detection methods across 56 ImageNet-trained models obtained via diverse training strategies and evaluate them on eight OOD test sets. Contrary to the common assumption that higher ID accuracy implies better OOD detection performance, we uncover a non-monotonic relationship: OOD performance initially improves with accuracy but declines once advanced training recipes push accuracy beyond the baseline. Moreover, we observe a strong interdependence between training strategy, detector choice, and resulting OOD performance, indicating that no single method is universally optimal.

[51] Effects of Different Attention Mechanisms Applied on 3D Models in Video Classification

Mohammad Rasras,Iuliana Marin,Serban Radu,Irina Mocanu

Main category: cs.CV

TL;DR: 本文研究了在降低时间特征捕获能力的同时提高帧分辨率对3D ResNet模型(MC3、R3D、R(2+1)D)在人类动作识别中性能的影响,并引入多种注意力机制进行比较,最终在UCF101数据集上取得了88.98%的最高准确率。

Details Motivation: 探索在减少时间信息的情况下提升空间分辨率对动作识别性能的影响,并分析不同注意力模块对这类模型的作用。 Method: 基于MC3、R3D和R(2+1)D构建相似结构并在最后分类器前加入dropout层以限制时间特征学习,进一步为每种结构设计十种含不同注意力模块(如CBAM、TCN、多头注意力、通道注意力)的变体模型。 Result: 在UCF101数据集上测试结果显示,改进后的R(2+1)D加入多头注意力机制的变体达到88.98%的准确率;不同变体在整体性能相近的情况下表现出不同的类别级精度行为。 Conclusion: 时间特征的缺失显著影响高分辨率动作识别模型的性能,而不同注意力机制虽提升整体表现,但在类别层面产生不同影响,说明其作用具有差异性。 Abstract: Human action recognition has become an important research focus in computer vision due to the wide range of applications where it is used. 3D Resnet-based CNN models, particularly MC3, R3D, and R(2+1)D, have different convolutional filters to extract spatiotemporal features. This paper investigates the impact of reducing the captured knowledge from temporal data, while increasing the resolution of the frames. To establish this experiment, we created similar designs to the three originals, but with a dropout layer added before the final classifier. Secondly, we then developed ten new versions for each one of these three designs. The variants include special attention blocks within their architecture, such as convolutional block attention module (CBAM), temporal convolution networks (TCN), in addition to multi-headed and channel attention mechanisms. The purpose behind that is to observe the extent of the influence each of these blocks has on performance for the restricted-temporal models. The results of testing all the models on UCF101 have shown accuracy of 88.98% for the variant with multiheaded attention added to the modified R(2+1)D. This paper concludes the significance of missing temporal features in the performance of the newly created increased resolution models. The variants had different behavior on class-level accuracy, despite the similarity of their enhancements to the overall performance.

[52] Medical SAM3: A Foundation Model for Universal Prompt-Driven Medical Image Segmentation

Chongcong Jiang,Tianxingjian Ding,Chuhan Song,Jiachen Tu,Ziyang Yan,Yihua Shao,Zhenyi Wang,Yuzhang Shang,Tianyu Han,Yu Tian

Main category: cs.CV

TL;DR: Medical SAM3 是一种通过在大规模、异构的2D和3D医学图像数据集上完全微调SAM3得到的通用医学图像分割基础模型,显著提升了在多种器官、模态和维度下的分割性能,尤其在语义模糊、形态复杂和长程3D上下文等挑战场景中表现突出。

Details Motivation: 原始的SAM3模型由于严重的领域差异、缺乏空间提示以及难以理解复杂的解剖和体积结构,在医学图像分割中的直接应用受限,因此需要进行全模型适配以提升其在医学领域的适用性。 Method: 通过对SAM3模型在包含10种医学成像模态的33个数据集上进行全参数微调,结合配对的分割掩码和文本提示,使其学习到鲁棒的领域特异性表示,同时保持基于提示的灵活性。 Result: 实验表明,Medical SAM3在多个器官、成像模态和维度上均实现了持续且显著的性能提升,尤其在语义模糊、复杂形态和长程3D上下文等挑战性场景中优于现有方法。 Conclusion: Medical SAM3是一种适用于医学图像的通用、文本引导的分割基础模型,证明了在严重领域偏移下,整体模型适应对于实现鲁棒的提示驱动分割至关重要。 Abstract: Promptable segmentation foundation models such as SAM3 have demonstrated strong generalization capabilities through interactive and concept-based prompting. However, their direct applicability to medical image segmentation remains limited by severe domain shifts, the absence of privileged spatial prompts, and the need to reason over complex anatomical and volumetric structures. Here we present Medical SAM3, a foundation model for universal prompt-driven medical image segmentation, obtained by fully fine-tuning SAM3 on large-scale, heterogeneous 2D and 3D medical imaging datasets with paired segmentation masks and text prompts. Through a systematic analysis of vanilla SAM3, we observe that its performance degrades substantially on medical data, with its apparent competitiveness largely relying on strong geometric priors such as ground-truth-derived bounding boxes. These findings motivate full model adaptation beyond prompt engineering alone. By fine-tuning SAM3's model parameters on 33 datasets spanning 10 medical imaging modalities, Medical SAM3 acquires robust domain-specific representations while preserving prompt-driven flexibility. Extensive experiments across organs, imaging modalities, and dimensionalities demonstrate consistent and significant performance gains, particularly in challenging scenarios characterized by semantic ambiguity, complex morphology, and long-range 3D context. Our results establish Medical SAM3 as a universal, text-guided segmentation foundation model for medical imaging and highlight the importance of holistic model adaptation for achieving robust prompt-driven segmentation under severe domain shift. Code and model will be made available at https://github.com/AIM-Research-Lab/Medical-SAM3.

[53] FrankenMotion: Part-level Human Motion Generation and Composition

Chuqiao Li,Xianghui Xie,Yong Cao,Andreas Geiger,Gerard Pons-Moll

Main category: cs.CV

TL;DR: 本文提出了FrankenMotion,首个支持原子级、时间感知的部件级文本标注的人体动作生成框架,通过构建高质量细粒度数据集,实现了对身体各部位在时间和空间上的精确控制。

Details Motivation: 现有基于文本生成人体动作的方法主要依赖于序列级或动作级描述,缺乏对局部肢体运动的精细控制,限制了生成动作的可控性和灵活性。 Method: 利用大语言模型构建包含原子级、时间感知的部件级文本标注的高质量动作数据集,并提出基于扩散模型的分部件生成框架FrankenMotion,每个身体部位由独立的时间结构化文本提示引导。 Result: 实验表明,FrankenMotion在新设置下优于所有基线模型,能够组合出训练中未见过的动作,且实现更精细的时空控制。 Conclusion: FrankenMotion是首个实现原子级、时间感知部件控制的文本到动作生成方法,显著提升了动作生成的可控性与泛化能力。 Abstract: Human motion generation from text prompts has made remarkable progress in recent years. However, existing methods primarily rely on either sequence-level or action-level descriptions due to the absence of fine-grained, part-level motion annotations. This limits their controllability over individual body parts. In this work, we construct a high-quality motion dataset with atomic, temporally-aware part-level text annotations, leveraging the reasoning capabilities of large language models (LLMs). Unlike prior datasets that either provide synchronized part captions with fixed time segments or rely solely on global sequence labels, our dataset captures asynchronous and semantically distinct part movements at fine temporal resolution. Based on this dataset, we introduce a diffusion-based part-aware motion generation framework, namely FrankenMotion, where each body part is guided by its own temporally-structured textual prompt. This is, to our knowledge, the first work to provide atomic, temporally-aware part-level motion annotations and have a model that allows motion generation with both spatial (body part) and temporal (atomic action) control. Experiments demonstrate that FrankenMotion outperforms all previous baseline models adapted and retrained for our setting, and our model can compose motions unseen during training. Our code and dataset will be publicly available upon publication.

[54] Classification of Chest XRay Diseases through image processing and analysis techniques

Santiago Martínez Novoa,María Catalina Ibáñez,Lina Gómez Mesa,Jeremias Kramer

Main category: cs.CV

TL;DR: 本文研究了用于多分类胸部X光图像诊断的多种方法,包括DenseNet121,并开发了一个开源Web应用进行方法比较,分析了现有方法的不足并提出了改进建议。

Details Motivation: 为了提高胸部X光图像中多类别疾病诊断的准确性和实用性,需要系统评估和比较现有深度学习方法的表现。 Method: 采用DenseNet121等深度学习模型对多分类胸部X光图像进行实验,通过性能测试比较不同方法的效果,并开发了一个开源Web应用以支持可视化与交互式分析。 Result: 实验比较揭示了不同方法在胸部X光多分类任务中的表现差异,识别出当前方法存在的局限性。 Conclusion: 尽管DenseNet121等模型表现出一定效果,但仍存在改进空间;未来工作可针对模型鲁棒性、泛化能力和临床适用性进行优化。 Abstract: Multi-Classification Chest X-Ray Images are one of the most prevalent forms of radiological examination used for diagnosing thoracic diseases. In this study, we offer a concise overview of several methods employed for tackling this task, including DenseNet121. In addition, we deploy an open-source web-based application. In our study, we conduct tests to compare different methods and see how well they work. We also look closely at the weaknesses of the methods we propose and suggest ideas for making them better in the future. Our code is available at: https://github.com/AML4206-MINE20242/Proyecto_AML

[55] Self-learned representation-guided latent diffusion model for breast cancer classification in deep ultraviolet whole surface images

Pouya Afshin,David Helminiak,Tianling Niu,Julie M. Jorns,Tina Yen,Bing Yu,Dong Hye Ye

Main category: cs.CV

TL;DR: 提出一种基于自监督学习引导的潜在扩散模型(SSL-guided LDM)生成高质量合成DUV图像数据,用于增强乳腺保乳手术中深度紫外荧光扫描显微图像的训练,结合Vision Transformer实现高精度WSI分类。

Details Motivation: 由于标注的深紫外荧光图像数据稀缺,限制了深度学习模型在乳腺保乳手术术中切缘评估中的应用,亟需有效方法生成高质量合成数据以提升模型性能。 Method: 利用微调的DINO作为教师模型提取语义特征,指导潜在扩散模型生成富含细胞结构细节的合成DUV图像;将真实与合成图像结合训练Vision Transformer,并通过图像块预测聚合实现全切片分类。 Result: 在5折交叉验证中达到96.47%的准确率,FID分数降至45.72,显著优于类别条件生成基线方法。 Conclusion: 该方法能有效缓解医学图像数据稀缺问题,通过语义引导的合成数据生成显著提升模型在WSI分类任务中的表现,具有临床应用潜力。 Abstract: Breast-Conserving Surgery (BCS) requires precise intraoperative margin assessment to preserve healthy tissue. Deep Ultraviolet Fluorescence Scanning Microscopy (DUV-FSM) offers rapid, high-resolution surface imaging for this purpose; however, the scarcity of annotated DUV data hinders the training of robust deep learning models. To address this, we propose an Self-Supervised Learning (SSL)-guided Latent Diffusion Model (LDM) to generate high-quality synthetic training patches. By guiding the LDM with embeddings from a fine-tuned DINO teacher, we inject rich semantic details of cellular structures into the synthetic data. We combine real and synthetic patches to fine-tune a Vision Transformer (ViT), utilizing patch prediction aggregation for WSI-level classification. Experiments using 5-fold cross-validation demonstrate that our method achieves 96.47 % accuracy and reduces the FID score to 45.72, significantly outperforming class-conditioned baselines.

[56] RobuMTL: Enhancing Multi-Task Learning Robustness Against Weather Conditions

Tasneem Shaffee,Sherief Reda

Main category: cs.CV

TL;DR: 本文提出了一种名为RobuMTL的新架构,通过动态选择任务特定的低秩适配模块和专家混合机制,提升自动驾驶系统在恶劣天气条件下的多任务学习鲁棒性。

Details Motivation: 在真实环境中,恶劣天气会严重降低模型性能,因此需要更鲁棒的多任务学习方法来保证自主系统的可靠性。 Method: 提出RobuMTL框架,采用基于输入扰动动态选择任务特定的分层低秩适配(LoRA)模块和LoRA专家组,以混合专家方式实现自适应专业化。 Result: 在PASCAL数据集上,单扰动下平均相对提升+2.8%,混合天气条件下最高达+44.4%;在NYUD-v2上任务平均提升+9.7%。 Conclusion: RobuMTL显著提升了多任务模型在复杂现实条件下的鲁棒性和性能,验证了自适应模块选择的有效性。 Abstract: Robust Multi-Task Learning (MTL) is crucial for autonomous systems operating in real-world environments, where adverse weather conditions can severely degrade model performance and reliability. In this paper, we introduce RobuMTL, a novel architecture designed to adaptively address visual degradation by dynamically selecting task-specific hierarchical Low-Rank Adaptation (LoRA) modules and a LoRA expert squad based on input perturbations in a mixture-of-experts fashion. Our framework enables adaptive specialization based on input characteristics, improving robustness across diverse real-world conditions. To validate our approach, we evaluated it on the PASCAL and NYUD-v2 datasets and compared it against single-task models, standard MTL baselines, and state-of-the-art methods. On the PASCAL benchmark, RobuMTL delivers a +2.8% average relative improvement under single perturbations and up to +44.4% under mixed weather conditions compared to the MTL baseline. On NYUD-v2, RobuMTL achieves a +9.7% average relative improvement across tasks. The code is available at GitHub.

[57] Sparse Data Tree Canopy Segmentation: Fine-Tuning Leading Pretrained Models on Only 150 Images

David Szczecina,Hudson Sun,Anthony Bertnyk,Niloofar Azad,Kyle Gao,Lincoln Linlin Xu

Main category: cs.CV

TL;DR: 在极端数据稀缺的情况下,评估了五种代表性架构在树冠分割任务中的表现,发现预训练的卷积模型(如YOLOv11和Mask R-CNN)显著优于基于Transformer的模型。

Details Motivation: 由于真实场景中数据标注稀缺,需要研究在小规模、不平衡数据集上哪些深度学习模型能够有效进行树冠检测。 Method: 评估了YOLOv11、Mask R-CNN、DeepLabv3、Swin-UNet和DINOv2五种代表性架构,在仅含150张标注图像的小型不平衡数据集上进行实验,并分析训练策略、增强策略和模型行为。 Result: 预训练的卷积神经网络模型(尤其是YOLOv11和Mask R-CNN)泛化能力明显优于基于Transformer的模型;而DeepLabv3、Swin-UNet和DINOv2表现较差,可能由于任务类型差异、Vision Transformer对数据量需求高以及缺乏强归纳偏置所致。 Conclusion: 在低数据环境下,若无充分预训练或增强,基于Transformer的架构表现不佳,而轻量级CNN方法仍是有限影像数据下最可靠的树冠检测方案。 Abstract: Tree canopy detection from aerial imagery is an important task for environmental monitoring, urban planning, and ecosystem analysis. Simulating real-life data annotation scarcity, the Solafune Tree Canopy Detection competition provides a small and imbalanced dataset of only 150 annotated images, posing significant challenges for training deep models without severe overfitting. In this work, we evaluate five representative architectures, YOLOv11, Mask R-CNN, DeepLabv3, Swin-UNet, and DINOv2, to assess their suitability for canopy segmentation under extreme data scarcity. Our experiments show that pretrained convolution-based models, particularly YOLOv11 and Mask R-CNN, generalize significantly better than pretrained transformer-based models. DeeplabV3, Swin-UNet and DINOv2 underperform likely due to differences between semantic and instance segmentation tasks, the high data requirements of Vision Transformers, and the lack of strong inductive biases. These findings confirm that transformer-based architectures struggle in low-data regimes without substantial pretraining or augmentation and that differences between semantic and instance segmentation further affect model performance. We provide a detailed analysis of training strategies, augmentation policies, and model behavior under the small-data constraint and demonstrate that lightweight CNN-based methods remain the most reliable for canopy detection on limited imagery.

[58] PatientVLM Meets DocVLM: Pre-Consultation Dialogue Between Vision-Language Models for Efficient Diagnosis

K Lokesh,Abhirama Subramanyam Penamakuri,Uday Agarwal,Apoorva Challa,Shreya K Gowda,Somesh Gupta,Anand Mishra

Main category: cs.CV

TL;DR: 提出了一种预咨询对话框架(PCDF),通过视觉-语言模型之间的模拟诊断对话,结合图像与患者症状,提升医学诊断准确性。

Details Motivation: 传统AI医学诊断主要依赖图像分析,缺乏患者自述症状信息,限制了诊断准确率。因此需要引入更接近真实诊疗过程的交互方式。 Method: 构建两个视觉-语言模型(VLM)之间的对话:DocVLM根据图像和对话历史生成问题,PatientVLM基于真实诊断的症状特征进行回答;并通过临床验证评估生成症状的真实性。 Result: 合成症状被临床医生认为具有临床相关性、覆盖全面且真实;DocVLM-PatientVLM的多轮对话可有效生成配对的图文诊断数据,用于微调模型。 Conclusion: 基于对话的监督训练显著优于仅使用图像的训练方式,证明了真实症状采集在AI辅助诊断中的重要价值。 Abstract: Traditionally, AI research in medical diagnosis has largely centered on image analysis. While this has led to notable advancements, the absence of patient-reported symptoms continues to hinder diagnostic accuracy. To address this, we propose a Pre-Consultation Dialogue Framework (PCDF) that mimics real-world diagnostic procedures, where doctors iteratively query patients before reaching a conclusion. Specifically, we simulate diagnostic dialogues between two vision-language models (VLMs): a DocVLM, which generates follow-up questions based on the image and dialogue history, and a PatientVLM, which responds using a symptom profile derived from the ground-truth diagnosis. We additionally conducted a small-scale clinical validation of the synthetic symptoms generated by our framework, with licensed clinicians confirming their clinical relevance, symptom coverage, and overall realism. These findings indicate that the resulting DocVLM-PatientVLM interactions form coherent, multi-turn consultations paired with images and diagnoses, which we then use to fine-tune the DocVLM. This dialogue-based supervision leads to substantial gains over image-only training, highlighting the value of realistic symptom elicitation for diagnosis.

[59] MMedExpert-R1: Strengthening Multimodal Medical Reasoning via Domain-Specific Adaptation and Clinical Guideline Reinforcement

Meidan Ding,Jipeng Zhang,Wenxuan Wang,Haiqin Zhong,Xiaoling Luo,Wenting Chen,Linlin Shen

Main category: cs.CV

TL;DR: 本文提出MMedExpert-R1,一种通过领域特定适应和基于指南的强化学习来增强医学视觉-语言模型复杂临床推理能力的新方法。

Details Motivation: 现有的医学视觉-语言模型在复杂临床推理方面表现不足,且缺乏深度推理数据、多专科对齐困难以及标准强化学习算法无法建模临床推理多样性。 Method: 构建包含10K样本的高质量多专科数据集MMedExpert,采用领域特定适应(DSA)生成专业LoRA模块,并利用基于指南的优势(GBA)建模不同临床推理视角,最后通过冲突感知能力集成融合专家模型。 Result: 实验表明该模型在MedXpert-MM和OmniMedVQA上分别达到27.50和83.03的性能,优于现有方法。 Conclusion: MMedExpert-R1有效提升了医学视觉-语言模型在多专科复杂推理任务中的表现,为可靠的多模态医学推理系统奠定了基础。 Abstract: Medical Vision-Language Models (MedVLMs) excel at perception tasks but struggle with complex clinical reasoning required in real-world scenarios. While reinforcement learning (RL) has been explored to enhance reasoning capabilities, existing approaches face critical mismatches: the scarcity of deep reasoning data, cold-start limits multi-specialty alignment, and standard RL algorithms fail to model clinical reasoning diversity. We propose MMedExpert-R1, a novel reasoning MedVLM that addresses these challenges through domain-specific adaptation and clinical guideline reinforcement. We construct MMedExpert, a high-quality dataset of 10K samples across four specialties with step-by-step reasoning traces. Our Domain-Specific Adaptation (DSA) creates specialty-specific LoRA modules to provide diverse initialization, while Guideline-Based Advantages (GBA) explicitly models different clinical reasoning perspectives to align with real-world diagnostic strategies. Conflict-Aware Capability Integration then merges these specialized experts into a unified agent, ensuring robust multi-specialty alignment. Comprehensive experiments demonstrate state-of-the-art performance, with our 7B model achieving 27.50 on MedXpert-MM and 83.03 on OmniMedVQA, establishing a robust foundation for reliable multimodal medical reasoning systems.

[60] IDDR-NGP: Incorporating Detectors for Distractor Removal with Instant Neural Radiance Field

Xianliang Huang,Jiajie Gou,Shuhang Chen,Zhizhou Zhong,Jihong Guan,Shuigeng Zhou

Main category: cs.CV

TL;DR: 本文提出了一种统一的3D场景干扰物去除方法IDDR-NGP,能够有效处理多种类型的干扰物,并通过结合隐式3D表示与2D检测器实现高质量3D场景恢复。

Details Motivation: 现有方法通常只针对特定类型的干扰物,缺乏一种能统一处理多种干扰物的通用解决方案。 Method: 将隐式3D表示与2D检测器结合,设计LPIPS损失和多视图补偿损失(MVCL),在Instant-NGP上直接操作,端到端优化渲染结果。 Result: 在包含合成与真实干扰物的新基准数据集上验证了IDDR-NGP的有效性和鲁棒性,实验表明该方法可高效去除多种干扰物,效果媲美现有去雪SOTA方法。 Conclusion: IDDR-NGP是首个统一的干扰物去除框架,能够在多视角下聚合信息,恢复高质量3D场景,适用于广泛类型的干扰物去除任务。 Abstract: This paper presents the first unified distractor removal method, named IDDR-NGP, which directly operates on Instant-NPG. The method is able to remove a wide range of distractors in 3D scenes, such as snowflakes, confetti, defoliation and petals, whereas existing methods usually focus on a specific type of distractors. By incorporating implicit 3D representations with 2D detectors, we demonstrate that it is possible to efficiently restore 3D scenes from multiple corrupted images. We design the learned perceptual image patch similarity~( LPIPS) loss and the multi-view compensation loss (MVCL) to jointly optimize the rendering results of IDDR-NGP, which could aggregate information from multi-view corrupted images. All of them can be trained in an end-to-end manner to synthesize high-quality 3D scenes. To support the research on distractors removal in implicit 3D representations, we build a new benchmark dataset that consists of both synthetic and real-world distractors. To validate the effectiveness and robustness of IDDR-NGP, we provide a wide range of distractors with corresponding annotated labels added to both realistic and synthetic scenes. Extensive experimental results demonstrate the effectiveness and robustness of IDDR-NGP in removing multiple types of distractors. In addition, our approach achieves results comparable with the existing SOTA desnow methods and is capable of accurately removing both realistic and synthetic distractors.

[61] Your One-Stop Solution for AI-Generated Video Detection

Long Ma,Zihao Xue,Yan Wang,Zhiyuan Yan,Jin Xu,Xiaorui Jiang,Haiyang Yu,Yong Liao,Zhen Bi

Main category: cs.CV

TL;DR: 本文提出了AIGVDBench,一个全面且具有代表性的AI生成视频检测基准,涵盖31种先进生成模型和超过44万段视频,对33种现有检测器进行了1500多次评估,提出8项深入分析并发现4个新结论,推动该领域发展。

Details Motivation: 现有数据集规模有限、生成模型过时且多样性不足,当前基准多停留在数据集构建阶段,缺乏系统性深度分析,难以应对现代生成技术的快速发展。 Method: 构建包含31个最先进生成模型和超44万视频的大规模高质量数据集,设计覆盖多维度的评测体系,对四类共33种检测器进行1500余次实验,开展8项深入分析。 Result: 完成了1500多次评估,识别出4个新发现,验证了不同检测器在多样性、泛化性和鲁棒性方面的表现差异。 Conclusion: AIGVDBench为AI生成视频检测提供了坚实基础,揭示了现有方法的局限性,并为未来研究指明方向,项目已开源。 Abstract: Recent advances in generative modeling can create remarkably realistic synthetic videos, making it increasingly difficult for humans to distinguish them from real ones and necessitating reliable detection methods. However, two key limitations hinder the development of this field. \textbf{From the dataset perspective}, existing datasets are often limited in scale and constructed using outdated or narrowly scoped generative models, making it difficult to capture the diversity and rapid evolution of modern generative techniques. Moreover, the dataset construction process frequently prioritizes quantity over quality, neglecting essential aspects such as semantic diversity, scenario coverage, and technological representativeness. \textbf{From the benchmark perspective}, current benchmarks largely remain at the stage of dataset creation, leaving many fundamental issues and in-depth analysis yet to be systematically explored. Addressing this gap, we propose AIGVDBench, a benchmark designed to be comprehensive and representative, covering \textbf{31} state-of-the-art generation models and over \textbf{440,000} videos. By executing more than \textbf{1,500} evaluations on \textbf{33} existing detectors belonging to four distinct categories. This work presents \textbf{8 in-depth analyses} from multiple perspectives and identifies \textbf{4 novel findings} that offer valuable insights for future research. We hope this work provides a solid foundation for advancing the field of AI-generated video detection. Our benchmark is open-sourced at https://github.com/LongMa-2025/AIGVDBench.

[62] M3DDM+: An improved video outpainting by a modified masking strategy

Takuya Murakawa,Takumi Fukuzawa,Ning Ding,Toru Tamaki

Main category: cs.CV

TL;DR: M3DDM+ 提出了一种改进的视频外绘框架,通过在训练中使用统一的掩码方向和宽度,并对预训练模型进行微调,有效缓解了M3DDM在信息受限场景下的质量退化问题,提升了视觉保真度和时序一致性。

Details Motivation: M3DDM在视频外绘中存在训练与推理阶段掩码策略不一致的问题,导致在相机运动有限或外绘区域较大时出现模糊和时序不连贯现象。 Method: M3DDM+ 在训练过程中对所有帧采用统一的掩码方向和宽度,并在此基础上对预训练的M3DDM模型进行微调,以对齐训练与推理的掩码策略。 Result: 实验表明,M3DDM+ 在挑战性场景下显著提升了生成质量与时序一致性,同时保持了原有的计算效率。 Conclusion: M3DDM+ 通过解决训练-推理掩码不匹配问题,有效增强了视频外绘在信息稀缺情况下的性能,是M3DDM的高效改进版本。 Abstract: M3DDM provides a computationally efficient framework for video outpainting via latent diffusion modeling. However, it exhibits significant quality degradation -- manifested as spatial blur and temporal inconsistency -- under challenging scenarios characterized by limited camera motion or large outpainting regions, where inter-frame information is limited. We identify the cause as a training-inference mismatch in the masking strategy: M3DDM's training applies random mask directions and widths across frames, whereas inference requires consistent directional outpainting throughout the video. To address this, we propose M3DDM+, which applies uniform mask direction and width across all frames during training, followed by fine-tuning of the pretrained M3DDM model. Experiments demonstrate that M3DDM+ substantially improves visual fidelity and temporal coherence in information-limited scenarios while maintaining computational efficiency. The code is available at https://github.com/tamaki-lab/M3DDM-Plus.

[63] PhysRVG: Physics-Aware Unified Reinforcement Learning for Video Generative Models

Qiyuan Zhang,Biao Gong,Shuai Tan,Zheng Zhang,Yujun Shen,Xing Zhu,Yuyuan Li,Kelu Yao,Chunhua Shen,Changqing Zou

Main category: cs.CV

TL;DR: 本文提出了一种物理感知的强化学习范式,首次在高维空间中强制执行物理碰撞规则,以提升视频生成模型的物理真实性,并提出了Mimicry-Discovery Cycle(MDcycle)统一框架,在精细调优的同时保持对物理反馈的利用能力。

Details Motivation: 现有的基于Transformer的视频生成模型在像素级去噪过程中忽略了物体刚性等物理原理,导致生成视频在刚体运动和碰撞模拟上缺乏物理真实性,限制了其在需要物理准确性的应用中的使用。 Method: 引入一种物理感知的强化学习范式,直接在高维空间中施加物理碰撞约束;提出MDcycle框架,结合模仿与发现机制,在微调过程中保留物理基础反馈能力;构建新的基准PhysRVGBench用于评估。 Result: 实验表明该方法在新构建的PhysRVGBench基准上显著提升了生成视频的物理合理性,在定性和定量评估中均表现出优越性能。 Conclusion: 通过将物理规律直接嵌入生成模型的优化过程,而非作为条件处理,能够有效提升视频生成的物理真实感,为未来物理一致的视觉生成模型提供了新方向。 Abstract: Physical principles are fundamental to realistic visual simulation, but remain a significant oversight in transformer-based video generation. This gap highlights a critical limitation in rendering rigid body motion, a core tenet of classical mechanics. While computer graphics and physics-based simulators can easily model such collisions using Newton formulas, modern pretrain-finetune paradigms discard the concept of object rigidity during pixel-level global denoising. Even perfectly correct mathematical constraints are treated as suboptimal solutions (i.e., conditions) during model optimization in post-training, fundamentally limiting the physical realism of generated videos. Motivated by these considerations, we introduce, for the first time, a physics-aware reinforcement learning paradigm for video generation models that enforces physical collision rules directly in high-dimensional spaces, ensuring the physics knowledge is strictly applied rather than treated as conditions. Subsequently, we extend this paradigm to a unified framework, termed Mimicry-Discovery Cycle (MDcycle), which allows substantial fine-tuning while fully preserving the model's ability to leverage physics-grounded feedback. To validate our approach, we construct new benchmark PhysRVGBench and perform extensive qualitative and quantitative experiments to thoroughly assess its effectiveness.

[64] CoDance: An Unbind-Rebind Paradigm for Robust Multi-Subject Animation

Shuai Tan,Biao Gong,Ke Ma,Yutong Feng,Qiyuan Zhang,Yan Wang,Yujun Shen,Hengshuang Zhao

Main category: cs.CV

TL;DR: 提出CoDance,一种解绑-重绑框架,用于实现任意数量、类型和空间布局的角色图像动画,解决现有方法在多主体动画中的空间错位与运动绑定问题。

Details Motivation: 现有方法在处理多主体动画时受限于刚性空间绑定和无法准确将动作绑定到目标主体,难以应对任意主体数量、类型及空间错位情况。 Method: 提出CoDance框架,包含Unbind模块(通过姿态偏移编码器引入随机扰动,学习位置无关的动作表示)和Rebind模块(利用文本提示的语义引导和掩码的空间引导,将动作精确绑定到目标角色)。 Result: 在新构建的CoDanceBench和现有数据集上实验表明,CoDance在多主体动画任务中达到SOTA性能,具有强泛化能力,支持多样化的角色组合与空间布局。 Conclusion: CoDance通过解绑-重绑机制有效解决了多主体图像动画中的空间错位与运动分配问题,显著提升了灵活性和鲁棒性,未来将开源代码与模型权重。 Abstract: Character image animation is gaining significant importance across various domains, driven by the demand for robust and flexible multi-subject rendering. While existing methods excel in single-person animation, they struggle to handle arbitrary subject counts, diverse character types, and spatial misalignment between the reference image and the driving poses. We attribute these limitations to an overly rigid spatial binding that forces strict pixel-wise alignment between the pose and reference, and an inability to consistently rebind motion to intended subjects. To address these challenges, we propose CoDance, a novel Unbind-Rebind framework that enables the animation of arbitrary subject counts, types, and spatial configurations conditioned on a single, potentially misaligned pose sequence. Specifically, the Unbind module employs a novel pose shift encoder to break the rigid spatial binding between the pose and the reference by introducing stochastic perturbations to both poses and their latent features, thereby compelling the model to learn a location-agnostic motion representation. To ensure precise control and subject association, we then devise a Rebind module, leveraging semantic guidance from text prompts and spatial guidance from subject masks to direct the learned motion to intended characters. Furthermore, to facilitate comprehensive evaluation, we introduce a new multi-subject CoDanceBench. Extensive experiments on CoDanceBench and existing datasets show that CoDance achieves SOTA performance, exhibiting remarkable generalization across diverse subjects and spatial layouts. The code and weights will be open-sourced.

[65] Graph Smoothing for Enhanced Local Geometry Learning in Point Cloud Analysis

Shangbo Yuan,Jie Xu,Ping Hu,Xiaofeng Zhu,Na Zhao

Main category: cs.CV

TL;DR: 提出一种结合图平滑模块和增强局部几何学习模块的新方法,以优化3D点云分析中的图结构,有效缓解边界点稀疏连接和连接区域噪声问题。

Details Motivation: 现有基于图的方法在处理3D点云时,常因边界点连接稀疏和连接区域存在噪声连接而导致图结构次优,影响分析性能。 Method: 引入图平滑模块以优化图结构,减少不可靠连接的影响;基于优化后的图结构,利用自适应几何描述符的特征向量提取形状特征,并通过柱面坐标变换获取分布特征,增强局部几何学习。 Result: 在真实数据集上的实验表明,该方法在分类、部件分割和语义分割等点云学习任务中均表现出优异性能。 Conclusion: 所提方法通过图结构优化与局部几何信息融合,显著提升了3D点云分析的准确性和鲁棒性。 Abstract: Graph-based methods have proven to be effective in capturing relationships among points for 3D point cloud analysis. However, these methods often suffer from suboptimal graph structures, particularly due to sparse connections at boundary points and noisy connections in junction areas. To address these challenges, we propose a novel method that integrates a graph smoothing module with an enhanced local geometry learning module. Specifically, we identify the limitations of conventional graph structures, particularly in handling boundary points and junction areas. In response, we introduce a graph smoothing module designed to optimize the graph structure and minimize the negative impact of unreliable sparse and noisy connections. Based on the optimized graph structure, we improve the feature extract function with local geometry information. These include shape features derived from adaptive geometric descriptors based on eigenvectors and distribution features obtained through cylindrical coordinate transformation. Experimental results on real-world datasets validate the effectiveness of our method in various point cloud learning tasks, i.e., classification, part segmentation, and semantic segmentation.

[66] Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning

Shaofeng Yin,Jiaxin Ge,Zora Zhiruo Wang,Xiuyu Li,Michael J. Black,Trevor Darrell,Angjoo Kanazawa,Haiwen Feng

Main category: cs.CV

TL;DR: 本文提出了VIGA(Vision-as-Inverse-Graphic Agent),通过迭代的写-运行-渲染-比较-修正闭环流程,实现从图像到可编辑图形程序的逆向生成,支持多种任务且无需微调,显著提升了现有模型在多模态推理任务上的表现。

Details Motivation: 现有视觉语言模型(VLMs)缺乏细粒度的空间和物理对齐能力,难以实现“视觉即逆向图形”这一目标,因此需要一种具备迭代多模态推理能力的框架来弥补这一差距。 Method: 提出VIGA,采用闭环的write-run-render-compare-revise流程,结合技能库(交替执行生成与验证)和演化的上下文记忆(包含计划、代码差异和渲染历史),实现无需辅助模块的通用场景重建与编辑。 Result: 在BlenderGym上比单次基线提升35.32%,在SlideBench上提升117.17%,在新提出的BlenderBench基准上提升124.70%,验证了其在长程多模态推理任务中的有效性。 Conclusion: VIGA实现了任务无关且模型无关的视觉逆向图形生成框架,通过迭代推理显著提升了VLMs在复杂空间与物理理解任务上的性能,为评估基础VLM提供统一协议。 Abstract: Vision-as-inverse-graphics, the concept of reconstructing an image as an editable graphics program is a long-standing goal of computer vision. Yet even strong VLMs aren't able to achieve this in one-shot as they lack fine-grained spatial and physical grounding capability. Our key insight is that closing this gap requires interleaved multimodal reasoning through iterative execution and verification. Stemming from this, we present VIGA (Vision-as-Inverse-Graphic Agent) that starts from an empty world and reconstructs or edits scenes through a closed-loop write-run-render-compare-revise procedure. To support long-horizon reasoning, VIGA combines (i) a skill library that alternates generator and verifier roles and (ii) an evolving context memory that contains plans, code diffs, and render history. VIGA is task-agnostic as it doesn't require auxiliary modules, covering a wide range of tasks such as 3D reconstruction, multi-step scene editing, 4D physical interaction, and 2D document editing, etc. Empirically, we found VIGA substantially improves one-shot baselines on BlenderGym (35.32%) and SlideBench (117.17%). Moreover, VIGA is also model-agnostic as it doesn't require finetuning, enabling a unified protocol to evaluate heterogeneous foundation VLMs. To better support this protocol, we introduce BlenderBench, a challenging benchmark that stress-tests interleaved multimodal reasoning with graphics engine, where VIGA improves by 124.70%.

[67] SoLA-Vision: Fine-grained Layer-wise Linear Softmax Hybrid Attention

Ruibang Li,Guan Luo,Yiwei Zhang,Jin Gao,Bing Li,Weiming Hu

Main category: cs.CV

TL;DR: 提出SoLA-Vision,一种细粒度混合线性与softmax注意力的视觉模型,在保持低计算成本的同时提升准确率。

Details Motivation: 标准softmax自注意力在高分辨率下计算代价高,线性注意力虽高效但表达能力弱,需兼顾效率与性能。 Method: 从层堆叠角度分析线性和softmax注意力差异,实验探索不同混合模式,提出层级别灵活混合的SoLA-Vision架构。 Result: SoLA-Vision在ImageNet-1K和密集预测任务上优于纯线性及其他混合模型,仅用少量softmax层即实现更优精度-效率权衡。 Conclusion: 细粒度的层间混合注意力优于固定块结构设计,SoLA-Vision为高效视觉建模提供了新范式。 Abstract: Standard softmax self-attention excels in vision tasks but incurs quadratic complexity O(N^2), limiting high-resolution deployment. Linear attention reduces the cost to O(N), yet its compressed state representations can impair modeling capacity and accuracy. We present an analytical study that contrasts linear and softmax attention for visual representation learning from a layer-stacking perspective. We further conduct systematic experiments on layer-wise hybridization patterns of linear and softmax attention. Our results show that, compared with rigid intra-block hybrid designs, fine-grained layer-wise hybridization can match or surpass performance while requiring fewer softmax layers. Building on these findings, we propose SoLA-Vision (Softmax-Linear Attention Vision), a flexible layer-wise hybrid attention backbone that enables fine-grained control over how linear and softmax attention are integrated. By strategically inserting a small number of global softmax layers, SoLA-Vision achieves a strong trade-off between accuracy and computational cost. On ImageNet-1K, SoLA-Vision outperforms purely linear and other hybrid attention models. On dense prediction tasks, it consistently surpasses strong baselines by a considerable margin. Code will be released.

[68] Democratizing planetary-scale analysis: An ultra-lightweight Earth embedding database for accurate and flexible global land monitoring

Shuang Chen,Jie Wang,Shuai Yuan,Jiayang Li,Yu Xia,Yuanhong Liao,Junbo Wei,Jincheng Yuan,Xiaoqing Xu,Xiaolin Zhu,Peng Zhu,Hongsheng Zhang,Yuyu Zhou,Haohuan Fu,Huabing Huang,Bin Chen,Fan Dai,Peng Gong

Main category: cs.CV

TL;DR: 本文提出了一种超轻量级的全球地球嵌入数据库ESD,通过将多源遥感数据压缩为信息密集的潜在向量,实现340倍的数据体积缩减,支持在普通工作站上进行全球尺度的长期分析。

Details Motivation: 解决卫星遥感数据量庞大导致计算和存储成本过高、限制全球尺度研究的问题。 Method: 利用Landsat和MODIS数据,结合ESDNet网络与有限标量量化(FSQ)技术,将高维观测数据转换为统一潜在空间中的低维嵌入向量,并以12个时间步长表征年周期变化。 Result: 实现了约340倍数据压缩,单年全球陆地数据仅需约2.4TB;重建精度高(MAE: 0.0130),并在土地覆盖分类中准确率达到79.74%(优于原始数据的76.92%)。 Conclusion: ESD为全球尺度地球观测研究提供了高效、可访问的数据基础,推动地理空间人工智能的发展,并支持少样本学习和长期一致性分析。 Abstract: The rapid evolution of satellite-borne Earth Observation (EO) systems has revolutionized terrestrial monitoring, yielding petabyte-scale archives. However, the immense computational and storage requirements for global-scale analysis often preclude widespread use, hindering planetary-scale studies. To address these barriers, we present Embedded Seamless Data (ESD), an ultra-lightweight, 30-m global Earth embedding database spanning the 25-year period from 2000 to 2024. By transforming high-dimensional, multi-sensor observations from the Landsat series (5, 7, 8, and 9) and MODIS Terra into information-dense, quantized latent vectors, ESD distills essential geophysical and semantic features into a unified latent space. Utilizing the ESDNet architecture and Finite Scalar Quantization (FSQ), the dataset achieves a transformative ~340-fold reduction in data volume compared to raw archives. This compression allows the entire global land surface for a single year to be encapsulated within approximately 2.4 TB, enabling decadal-scale global analysis on standard local workstations. Rigorous validation demonstrates high reconstructive fidelity (MAE: 0.0130; RMSE: 0.0179; CC: 0.8543). By condensing the annual phenological cycle into 12 temporal steps, the embeddings provide inherent denoising and a semantically organized space that outperforms raw reflectance in land-cover classification, achieving 79.74% accuracy (vs. 76.92% for raw fusion). With robust few-shot learning capabilities and longitudinal consistency, ESD provides a versatile foundation for democratizing planetary-scale research and advancing next-generation geospatial artificial intelligence.

[69] ATATA: One Algorithm to Align Them All

Boyi Pang,Savva Ignatyev,Vladimir Ippolitov,Ramil Khafizov,Yurii Melnik,Oleg Voynov,Maksim Nakhodnov,Aibek Alanov,Xiaopeng Fan,Peter Wonka,Evgeny Burnaev

Main category: cs.CV

TL;DR: 提出一种基于Rectified Flow的多模态联合推理算法,通过结构对齐实现高效生成,在图像、视频和3D形状生成中表现出更快的速度和高质量。

Details Motivation: 现有方法在联合生成中未充分考虑结构对齐问题,且Score Distillation Sampling(SDS)方法存在耗时、模式崩溃和结果失真等问题。 Method: 提出一种新的多模态算法,基于Rectified Flow模型,在结构化潜在空间中进行样本段的联合传输,支持任意基础模型,并实现快速推理。 Result: 在图像、视频和3D生成任务中,该方法实现了高度的结构对齐与高视觉质量;相比SDS等方法速度显著提升,在图像和视频生成上优于现有方法,3D生成质量相当但快几个数量级。 Conclusion: 该方法为多模态联合生成提供了高效、高质量的解决方案,显著提升了推理速度并保持优异生成质量,推动了结构对齐生成模型的发展。 Abstract: We suggest a new multi-modal algorithm for joint inference of paired structurally aligned samples with Rectified Flow models. While some existing methods propose a codependent generation process, they do not view the problem of joint generation from a structural alignment perspective. Recent work uses Score Distillation Sampling to generate aligned 3D models, but SDS is known to be time-consuming, prone to mode collapse, and often provides cartoonish results. By contrast, our suggested approach relies on the joint transport of a segment in the sample space, yielding faster computation at inference time. Our approach can be built on top of an arbitrary Rectified Flow model operating on the structured latent space. We show the applicability of our method to the domains of image, video, and 3D shape generation using state-of-the-art baselines and evaluate it against both editing-based and joint inference-based competing approaches. We demonstrate a high degree of structural alignment for the sample pairs obtained with our method and a high visual quality of the samples. Our method improves the state-of-the-art for image and video generation pipelines. For 3D generation, it is able to show comparable quality while working orders of magnitude faster.

[70] Bio-inspired fine-tuning for selective transfer learning in image classification

Ana Davila,Jacinto Colan,Yasuhisa Hasegawa

Main category: cs.CV

TL;DR: BioTune是一种基于进化优化的自适应微调技术,通过选择冻结层和调整学习率来提升跨域图像分类任务中的迁移学习效果,在多种数据集和CNN架构上优于现有方法。

Details Motivation: 由于源域与目标域之间的差异,传统迁移学习在有限标注数据下难以有效传递知识,需要更灵活的微调策略以适应不同领域和数据分布变化。 Method: 提出BioTune,利用进化优化算法自动决定哪些网络层应被冻结,并为未冻结层动态调整学习率,从而实现对迁移学习过程的自适应控制。 Result: 在九个图像分类数据集(包括自然图像和医学图像)和四种CNN架构上验证了BioTune的有效性,其准确率和效率均优于AutoRGN、LoRA等先进方法,并通过消融实验验证了各组件的作用。 Conclusion: BioTune是一种高效且灵活的迁移学习微调方法,能够适应不同数据特征和分布变化,在广泛的应用场景中展现出卓越性能。 Abstract: Deep learning has significantly advanced image analysis across diverse domains but often depends on large, annotated datasets for success. Transfer learning addresses this challenge by utilizing pre-trained models to tackle new tasks with limited labeled data. However, discrepancies between source and target domains can hinder effective transfer learning. We introduce BioTune, a novel adaptive fine-tuning technique utilizing evolutionary optimization. BioTune enhances transfer learning by optimally choosing which layers to freeze and adjusting learning rates for unfrozen layers. Through extensive evaluation on nine image classification datasets, spanning natural and specialized domains such as medical imaging, BioTune demonstrates superior accuracy and efficiency over state-of-the-art fine-tuning methods, including AutoRGN and LoRA, highlighting its adaptability to various data characteristics and distribution changes. Additionally, BioTune consistently achieves top performance across four different CNN architectures, underscoring its flexibility. Ablation studies provide valuable insights into the impact of BioTune's key components on overall performance. The source code is available at https://github.com/davilac/BioTune.

[71] Image-Text Knowledge Modeling for Unsupervised Multi-Scenario Person Re-Identification

Zhiqi Pang,Lingling Zhao,Yang Liu,Chunyu Wang,Gaurav Sharma

Main category: cs.CV

TL;DR: 提出了一种新的无监督多场景行人重识别(UMS-ReID)任务,并设计了基于图像-文本知识建模(ITKM)的三阶段框架,利用视觉-语言模型在跨分辨率、换装等多场景下提升ReID性能。

Details Motivation: 传统行人重识别方法多针对单一特定场景,难以泛化到多种实际应用场景;因此需要构建一个能在多种异构场景下统一有效工作的无监督ReID框架。 Method: 提出ITKM框架:第一阶段在CLIP图像编码器中引入场景嵌入并微调以适应多场景知识;第二阶段优化与伪标签关联的文本嵌入,并引入多场景分离损失增强文本表示差异性;第三阶段通过簇级和实例级异构匹配模块构建可靠的正样本对,并采用动态文本表示更新策略保持图文监督一致性。 Result: 在多个跨场景ReID数据集上实验表明,ITKM优于现有的特定场景方法,且通过融合多场景知识提升了整体性能。 Conclusion: ITKM为无监督多场景行人重识别提供了一个有效且通用的解决方案,展示了视觉-语言模型在复杂ReID任务中的强大潜力。 Abstract: We propose unsupervised multi-scenario (UMS) person re-identification (ReID) as a new task that expands ReID across diverse scenarios (cross-resolution, clothing change, etc.) within a single coherent framework. To tackle UMS-ReID, we introduce image-text knowledge modeling (ITKM) -- a three-stage framework that effectively exploits the representational power of vision-language models. We start with a pre-trained CLIP model with an image encoder and a text encoder. In Stage I, we introduce a scenario embedding in the image encoder and fine-tune the encoder to adaptively leverage knowledge from multiple scenarios. In Stage II, we optimize a set of learned text embeddings to associate with pseudo-labels from Stage I and introduce a multi-scenario separation loss to increase the divergence between inter-scenario text representations. In Stage III, we first introduce cluster-level and instance-level heterogeneous matching modules to obtain reliable heterogeneous positive pairs (e.g., a visible image and an infrared image of the same person) within each scenario. Next, we propose a dynamic text representation update strategy to maintain consistency between text and image supervision signals. Experimental results across multiple scenarios demonstrate the superiority and generalizability of ITKM; it not only outperforms existing scenario-specific methods but also enhances overall performance by integrating knowledge from multiple scenarios.

[72] Language-Agnostic Visual Embeddings for Cross-Script Handwriting Retrieval

Fangke Chen,Tianhao Dong,Sirry Chen,Guobin Zhang,Yishu Zhang,Yining Chen

Main category: cs.CV

TL;DR: 提出了一种轻量级的非对称双编码器框架,用于跨语言手写文本检索,通过联合优化实例级对齐和类级语义一致性,实现风格不变的视觉嵌入,在多个基准上达到最先进的性能,同时显著降低计算成本。

Details Motivation: 手写词检索在数字档案中至关重要,但由于手写变体多样性和跨语言语义鸿沟,现有方法面临挑战;同时大型视觉-语言模型因计算成本过高难以在边缘设备部署。 Method: 提出一种轻量级非对称双编码器框架,学习统一且风格不变的视觉嵌入,通过联合优化实例级对齐和类级语义一致性,并将视觉嵌入锚定到语言无关的语义原型上,以实现跨脚本和书写风格的不变性。 Result: 在单语言检索任务上优于28个基线方法,达到最先进水平;在显式的跨语言检索任务中也表现出色,验证了所学表示的有效性。 Conclusion: 该框架能够在显著减少参数量的情况下实现高效、准确的跨语言手写文本检索,适合资源受限环境下的实际应用。 Abstract: Handwritten word retrieval is vital for digital archives but remains challenging due to large handwriting variability and cross-lingual semantic gaps. While large vision-language models offer potential solutions, their prohibitive computational costs hinder practical edge deployment. To address this, we propose a lightweight asymmetric dual-encoder framework that learns unified, style-invariant visual embeddings. By jointly optimizing instance-level alignment and class-level semantic consistency, our approach anchors visual embeddings to language-agnostic semantic prototypes, enforcing invariance across scripts and writing styles. Experiments show that our method outperforms 28 baselines and achieves state-of-the-art accuracy on within-language retrieval benchmarks. We further conduct explicit cross-lingual retrieval, where the query language differs from the target language, to validate the effectiveness of the learned cross-lingual representations. Achieving strong performance with only a fraction of the parameters required by existing models, our framework enables accurate and resource-efficient cross-script handwriting retrieval.

[73] FTDMamba: Frequency-Assisted Temporal Dilation Mamba for Unmanned Aerial Vehicle Video Anomaly Detection

Cheng-Zhuang Liu,Si-Bao Chen,Qing-Ling Shu,Chris Ding,Jin Tang,Bin Luo

Main category: cs.CV

TL;DR: 提出了一种用于无人机视频异常检测(UAV VAD)的FTDMamba网络,结合频域分析与序列建模,有效解耦动态背景中的运动耦合,并构建了新的大规模动态背景数据集MUVAD,实现了最先进的性能。

Details Motivation: 现有VAD方法在处理具有动态背景的无人机视频时存在运动耦合导致的误检问题,且缺乏对多尺度时空相关性的联合建模;同时缺少针对动态场景的大规模数据集。 Method: 提出FTDMamba网络,包含两个核心模块:1)频域解耦时空相关模块,通过频率分析分离运动模式并建模全局时空依赖;2)时序扩张Mamba模块,利用Mamba结构学习多时间感受野下的细粒度时序动态和局部空间结构。同时构建了包含22万帧的新数据集MUVAD。 Result: 在两个公开静态基准和新提出的MUVAD数据集上均达到SOTA性能。 Conclusion: FTDMamba有效解决了动态背景下无人机视频中多源运动耦合带来的异常检测难题,并通过新数据集推动了该方向的研究发展。 Abstract: Recent advances in video anomaly detection (VAD) mainly focus on ground-based surveillance or unmanned aerial vehicle (UAV) videos with static backgrounds, whereas research on UAV videos with dynamic backgrounds remains limited. Unlike static scenarios, dynamically captured UAV videos exhibit multi-source motion coupling, where the motion of objects and UAV-induced global motion are intricately intertwined. Consequently, existing methods may misclassify normal UAV movements as anomalies or fail to capture true anomalies concealed within dynamic backgrounds. Moreover, many approaches do not adequately address the joint modeling of inter-frame continuity and local spatial correlations across diverse temporal scales. To overcome these limitations, we propose the Frequency-Assisted Temporal Dilation Mamba (FTDMamba) network for UAV VAD, including two core components: (1) a Frequency Decoupled Spatiotemporal Correlation Module, which disentangles coupled motion patterns and models global spatiotemporal dependencies through frequency analysis; and (2) a Temporal Dilation Mamba Module, which leverages Mamba's sequence modeling capability to jointly learn fine-grained temporal dynamics and local spatial structures across multiple temporal receptive fields. Additionally, unlike existing UAV VAD datasets which focus on static backgrounds, we construct a large-scale Moving UAV VAD dataset (MUVAD), comprising 222,736 frames with 240 anomaly events across 12 anomaly types. Extensive experiments demonstrate that FTDMamba achieves state-of-the-art (SOTA) performance on two public static benchmarks and the new MUVAD dataset. The code and MUVAD dataset will be available at: https://github.com/uavano/FTDMamba.

[74] X-Distill: Cross-Architecture Vision Distillation for Visuomotor Learning

Maanping Shao,Feihong Zhang,Gu Zhang,Baiye Cheng,Zhengrong Xue,Huazhe Xu

Main category: cs.CV

TL;DR: 提出X-Distill方法,通过跨架构知识蒸馏将DINOv2的视觉表示迁移到紧凑ResNet-18中,提升数据效率下的机器人操作性能。

Details Motivation: 在数据稀缺的机器人学习场景中,大型ViT虽具强泛化能力但训练困难,而小型CNN更易优化,需权衡二者优劣。 Method: 采用离线跨架构知识蒸馏,先在ImageNet上将DINOv2的知识蒸馏到ResNet-18,再与扩散策略头联合微调于目标任务。 Result: 在34个模拟基准和5个真实世界任务上,X-Distill优于从零训练的ResNet和微调的DINOv2编码器,并超越使用点云或大语言模型的3D编码器。 Conclusion: 简单的蒸馏策略可有效结合ViT与CNN优势,在数据高效条件下实现最先进的机器人操作性能。 Abstract: Visuomotor policies often leverage large pre-trained Vision Transformers (ViTs) for their powerful generalization capabilities. However, their significant data requirements present a major challenge in the data-scarce context of most robotic learning settings, where compact CNNs with strong inductive biases can be more easily optimized. To address this trade-off, we introduce X-Distill, a simple yet highly effective method that synergizes the strengths of both architectures. Our approach involves an offline, cross-architecture knowledge distillation, transferring the rich visual representations of a large, frozen DINOv2 teacher to a compact ResNet-18 student on the general-purpose ImageNet dataset. This distilled encoder, now endowed with powerful visual priors, is then jointly fine-tuned with a diffusion policy head on the target manipulation tasks. Extensive experiments on $34$ simulated benchmarks and $5$ challenging real-world tasks demonstrate that our method consistently outperforms policies equipped with from-scratch ResNet or fine-tuned DINOv2 encoders. Notably, X-Distill also surpasses 3D encoders that utilize privileged point cloud observations or much larger Vision-Language Models. Our work highlights the efficacy of a simple, well-founded distillation strategy for achieving state-of-the-art performance in data-efficient robotic manipulation.

[75] Efficient On-Board Processing of Oblique UAV Video for Rapid Flood Extent Mapping

Vishisht Sharma,Sam Leroux,Lisa Landuyt,Nick Witvrouwen,Pieter Simoens

Main category: cs.CV

TL;DR: 本文提出了一种名为Temporal Token Reuse (TTR)的自适应推理框架,通过利用倾斜航拍视频中的时空冗余来加速嵌入式设备上的视频分割,显著降低推理延迟的同时几乎不损失精度。

Details Motivation: 由于无人机严格的尺寸、重量和功耗(SWaP)限制,机载处理高分辨率倾斜航拍视频流面临计算瓶颈,难以实现实时低延迟推断。 Method: TTR将图像块表示为token,使用轻量级相似性度量动态识别静态区域,并传播其预计算的深度特征,从而跳过冗余的主干网络计算,实现高效推理。 Result: 在标准基准和新构建的倾斜洪水数据集上验证表明,TTR在边缘设备上实现了30%的推理延迟降低,且分割精度损失极小(mIoU下降<0.5%)。 Conclusion: TTR有效推动了运行效率的帕累托前沿,使高保真、实时的倾斜视频理解成为可能,适用于时间敏感的遥感任务。 Abstract: Effective disaster response relies on rapid disaster response, where oblique aerial video is the primary modality for initial scouting due to its ability to maximize spatial coverage and situational awareness in limited flight time. However, the on-board processing of high-resolution oblique streams is severely bottlenecked by the strict Size, Weight, and Power (SWaP) constraints of Unmanned Aerial Vehicles (UAVs). The computational density required to process these wide-field-of-view streams precludes low-latency inference on standard edge hardware. To address this, we propose Temporal Token Reuse (TTR), an adaptive inference framework capable of accelerating video segmentation on embedded devices. TTR exploits the intrinsic spatiotemporal redundancy of aerial video by formulating image patches as tokens; it utilizes a lightweight similarity metric to dynamically identify static regions and propagate their precomputed deep features, thereby bypassing redundant backbone computations. We validate the framework on standard benchmarks and a newly curated Oblique Floodwater Dataset designed for hydrological monitoring. Experimental results on edge-grade hardware demonstrate that TTR achieves a 30% reduction in inference latency with negligible degradation in segmentation accuracy (< 0.5% mIoU). These findings confirm that TTR effectively shifts the operational Pareto frontier, enabling high-fidelity, real-time oblique video understanding for time-critical remote sensing missions

[76] SAMannot: A Memory-Efficient, Local, Open-source Framework for Interactive Video Instance Segmentation based on SAM2

Gergely Dinya,András Gelencsér,Krisztina Kupán,Clemens Küpper,Kristóf Karacs,Anna Gelencsér-Horváth

Main category: cs.CV

TL;DR: SAMannot是一个开源、本地化的视频实例分割框架,结合Segment Anything Model 2(SAM2)与人工协同工作流,支持高效、私密且低成本的高精度视频标注。

Details Motivation: 现有视频分割方法受限于手动标注耗时、商业平台成本高或云服务隐私泄露问题,研究中对高保真视频实例分割的需求难以满足。 Method: 开发了SAMannot框架,集成SAM2模型并优化其依赖,引入处理层以降低计算开销;设计了持久化实例身份管理、带屏障帧的“锁定-细化”流程,以及基于掩码骨架化的自动提示机制。 Result: 实现了高效的本地化视频标注,支持YOLO和PNG格式数据输出及交互日志记录,在动物行为追踪和LVOS、DAVIS数据集子集上验证了其有效性。 Conclusion: SAMannot提供了一种可扩展、私密且经济高效的替代方案,适用于复杂视频标注任务,优于现有商业平台。 Abstract: Current research workflows for precise video segmentation are often forced into a compromise between labor-intensive manual curation, costly commercial platforms, and/or privacy-compromising cloud-based services. The demand for high-fidelity video instance segmentation in research is often hindered by the bottleneck of manual annotation and the privacy concerns of cloud-based tools. We present SAMannot, an open-source, local framework that integrates the Segment Anything Model 2 (SAM2) into a human-in-the-loop workflow. To address the high resource requirements of foundation models, we modified the SAM2 dependency and implemented a processing layer that minimizes computational overhead and maximizes throughput, ensuring a highly responsive user interface. Key features include persistent instance identity management, an automated ``lock-and-refine'' workflow with barrier frames, and a mask-skeletonization-based auto-prompting mechanism. SAMannot facilitates the generation of research-ready datasets in YOLO and PNG formats alongside structured interaction logs. Verified through animal behavior tracking use-cases and subsets of the LVOS and DAVIS benchmark datasets, the tool provides a scalable, private, and cost-effective alternative to commercial platforms for complex video annotation tasks.

[77] Context-Aware Semantic Segmentation via Stage-Wise Attention

Antoine Carreaud,Elias Naha,Arthur Chansel,Nina Lahellec,Jan Skaloud,Adrien Gressin

Main category: cs.CV

TL;DR: 本文提出了一种名为CASWiT的双分支Swin Transformer架构,用于语义超高清遥感图像分割,通过引入全局上下文信息和跨尺度融合模块,在大规模数据集上实现了优于现有方法的性能。

Details Motivation: Transformer模型在处理超高分辨率图像时面临内存消耗大、上下文范围或空间分辨率受限的问题,因此需要一种能有效结合长距离依赖与细节特征提取的方法。 Method: 提出CASWiT,包含一个处理下采样邻域的上下文编码器和一个处理高分辨率图像块的高分辨率编码器;通过跨注意力机制和门控特征注入的跨尺度融合模块实现特征融合;并采用SimMIM风格的预训练策略,掩码75%的高分辨率token及对应的低分辨率中心区域以重建原始图像。 Result: 在IGN FLAIR-HUB数据集上达到65.83% mIoU,超过RGB基线1.78点;在URUR数据集上达到49.1% mIoU,超越当前SoTA 0.9%。 Conclusion: CASWiT有效解决了超高分辨率图像分割中上下文与细节的平衡问题,结合新颖的预训练策略,在多个遥感数据集上取得了领先的性能表现。 Abstract: Semantic ultra high resolution image (UHR) segmentation is essential in remote sensing applications such as aerial mapping and environmental monitoring. Transformer-based models struggle in this setting because memory grows quadratically with token count, constraining either the contextual scope or the spatial resolution. We introduce CASWiT (Context-Aware Stage-Wise Transformer), a dual-branch, Swin-based architecture that injects global cues into fine-grained UHR features. A context encoder processes a downsampled neighborhood to capture long-range dependencies, while a high resolution encoder extracts detailed features from UHR patches. A cross-scale fusion module, combining cross-attention and gated feature injection, enriches high-resolution tokens with context. Beyond architecture, we propose a SimMIM-style pretraining. We mask 75% of the high-resolution image tokens and the low-resolution center region that spatially corresponds to the UHR patch, then train the shared dual-encoder with small decoder to reconstruct the UHR initial image. Extensive experiments on the large-scale IGN FLAIR-HUB aerial dataset demonstrate the effectiveness of CASWiT. Our method achieves 65.83% mIoU, outperforming RGB baselines by 1.78 points. On URUR, CASWiT achieves 49.1% mIoU, surpassing the current SoTA by +0.9% under the official evaluation protocol. All codes are provided on: https://huggingface.co/collections/heig-vd-geo/caswit.

[78] Enhancing Vision Language Models with Logic Reasoning for Situational Awareness

Pavana Pradeep,Krishna Kant,Suya Yu

Main category: cs.CV

TL;DR: 本文提出了一种将视觉-语言模型(VLM)与传统计算机视觉方法结合,并通过显式逻辑推理增强态势感知能力的方法,能够在提取细粒度事件细节、提高准确性和生成VLM输出解释方面取得更好效果。

Details Motivation: 在态势感知应用中,需要高可靠性和准确性地识别罕见但重要的事件,并获取细粒度细节和识别质量评估,而现有VLM在这些方面存在不足。 Method: 将VLM与传统计算机视觉方法结合,采用显式逻辑推理,并设计一种智能微调策略,在推理过程中生成对VLM输出的解释。 Result: 所提出的智能微调机制显著提高了准确性,优于非针对性的选择方法,并能在推理过程中验证VLM输出的有效性或指出其可疑原因。 Conclusion: 该方法通过融合VLM与传统方法并引入逻辑推理和智能微调,有效提升了态势感知中的准确性、可解释性和细粒度信息提取能力。 Abstract: Vision-Language Models (VLMs) offer the ability to generate high-level, interpretable descriptions of complex activities from images and videos, making them valuable for situational awareness (SA) applications. In such settings, the focus is on identifying infrequent but significant events with high reliability and accuracy, while also extracting fine-grained details and assessing recognition quality. In this paper, we propose an approach that integrates VLMs with traditional computer vision methods through explicit logic reasoning to enhance SA in three key ways: (a) extracting fine-grained event details, (b) employing an intelligent fine-tuning (FT) strategy that achieves substantially higher accuracy than uninformed selection, and (c) generating justifications for VLM outputs during inference. We demonstrate that our intelligent FT mechanism improves the accuracy and provides a valuable means, during inferencing, to either confirm the validity of the VLM output or indicate why it may be questionable.

[79] Beer-Lambert Autoencoder for Unsupervised Stain Representation Learning and Deconvolution in Multi-immunohistochemical Brightfield Histology Images

Mark Eastwood,Thomas McKee,Zedong Hu,Sabine Tejpar,Fayyaz Minhas

Main category: cs.CV

TL;DR: 提出一种基于数据驱动的编码器-解码器架构,用于多通道免疫组化(mIHC)图像的染色分离,相比传统方法能更清晰地分离五种染色的浓度图。

Details Motivation: 传统Beer-Lambert方法在超过三种染色时变得欠定且不稳定,难以有效分离多通道IHC图像中的染色贡献。 Method: 采用紧凑型U-Net作为编码器预测K个非负浓度通道,结合可微分的Beer-Lambert前向模型和可学习的染色矩阵作为解码器,通过无监督训练和感知重建损失优化模型。 Result: 在包含5种染色的结直肠mIHC数据集上实现了优异的RGB图像重建效果,并显著减少了通道间的串扰。 Conclusion: 该方法能有效实现mIHC图像中多种染色的清晰分离,优于基于矩阵的传统去卷积方法,适用于定量分析和生物标志物表达评估。 Abstract: Separating the contributions of individual chromogenic stains in RGB histology whole slide images (WSIs) is essential for stain normalization, quantitative assessment of marker expression, and cell-level readouts in immunohistochemistry (IHC). Classical Beer-Lambert (BL) color deconvolution is well-established for two- or three-stain settings, but becomes under-determined and unstable for multiplex IHC (mIHC) with K>3 chromogens. We present a simple, data-driven encoder-decoder architecture that learns cohort-specific stain characteristics for mIHC RGB WSIs and yields crisp, well-separated per-stain concentration maps. The encoder is a compact U-Net that predicts K nonnegative concentration channels; the decoder is a differentiable BL forward model with a learnable stain matrix initialized from typical chromogen hues. Training is unsupervised with a perceptual reconstruction objective augmented by loss terms that discourage unnecessary stain mixing. On a colorectal mIHC panel comprising 5 stains (H, CDX2, MUC2, MUC5, CD8) we show excellent RGB reconstruction, and significantly reduced inter-channel bleed-through compared with matrix-based deconvolution. Code and model are available at https://github.com/measty/StainQuant.git.

[80] Assessing Building Heat Resilience Using UAV and Street-View Imagery with Coupled Global Context Vision Transformer

Steffen Knoblauch,Ram Kumar Muthusamy,Hao Li,Iddy Chazua,Benedcto Adamu,Innocent Maholi,Alexander Zipf

Main category: cs.CV

TL;DR: 提出一种融合无人机和街景影像的机器学习框架(CGCViT),用于评估城市建筑的热相关属性,并结合HotSat-1热红外数据揭示建筑材料与热暴露风险的关系,揭示达累斯萨拉姆家庭层面的热暴露不平等。

Details Motivation: 城市中心,尤其是全球南方地区,因气候变化和高密度建筑加剧了人类热暴露风险,但目前缺乏可扩展的方法来评估与热相关的建筑特征。 Method: 提出一种名为CGCViT的耦合全局上下文视觉Transformer模型,融合公开的无人机(UAV)和街景(SV)图像进行双模态交叉学习,并利用HotSat-1的热红外(TIR)测量数据建立建筑属性与热风险之间的关联。 Result: 该双模态方法比最佳单模态模型性能提升最高达9.3%;发现建筑物周围有植被、屋顶颜色较亮、使用混凝土/木材/黏土材料(而非金属或油布)与更低的TIR值显著相关;在坦桑尼亚达累斯萨拉姆市的应用揭示了基于建筑材料的社会经济不平等导致的家庭级热暴露差异。 Conclusion: 本地化、数据驱动的风险评估在制定公平气候适应策略中具有关键作用,机器学习可用于识别并解决城市中由社会经济劣势导致的热暴露不平等问题。 Abstract: Climate change is intensifying human heat exposure, particularly in densely built urban centers of the Global South. Low-cost construction materials and high thermal-mass surfaces further exacerbate this risk. Yet scalable methods for assessing such heat-relevant building attributes remain scarce. We propose a machine learning framework that fuses openly available unmanned aerial vehicle (UAV) and street-view (SV) imagery via a coupled global context vision transformer (CGCViT) to learn heat-relevant representations of urban structures. Thermal infrared (TIR) measurements from HotSat-1 are used to quantify the relationship between building attributes and heat-associated health risks. Our dual-modality cross-view learning approach outperforms the best single-modality models by up to $9.3\%$, demonstrating that UAV and SV imagery provide valuable complementary perspectives on urban structures. The presence of vegetation surrounding buildings (versus no vegetation), brighter roofing (versus darker roofing), and roofing made of concrete, clay, or wood (versus metal or tarpaulin) are all significantly associated with lower HotSat-1 TIR values. Deployed across the city of Dar es Salaam, Tanzania, the proposed framework illustrates how household-level inequalities in heat exposure - often linked to socio-economic disadvantage and reflected in building materials - can be identified and addressed using machine learning. Our results point to the critical role of localized, data-driven risk assessment in shaping climate adaptation strategies that deliver equitable outcomes.

[81] Think-Clip-Sample: Slow-Fast Frame Selection for Video Understanding

Wenhui Tan,Ruihua Song,Jiaze Li,Jianzhong Ju,Zhenbo Luo

Main category: cs.CV

TL;DR: 提出了一种无需训练的框架TCS,通过多查询推理和片段级慢-快采样来提升多模态大模型在长视频理解中的性能,显著提高准确率并降低推理成本。

Details Motivation: 现有MLLMs在长视频理解上受限于计算资源和帧选择策略的不足,难以兼顾局部细节与全局上下文。 Method: 设计了Think-Clip-Sample(TCS)框架,包含多查询推理生成多个互补查询,并采用片段级慢-快采样策略自适应平衡局部密集细节与全局稀疏上下文。 Result: 在MLVU、LongVideoBench和VideoMME数据集上实验表明,TCS可将不同MLLM的性能提升最高达6.9%,且能以50%的推理时间成本达到相当的准确率。 Conclusion: TCS是一种高效且有效的长视频理解方法,能够在不增加训练开销的前提下显著提升多模态大语言模型的表现。 Abstract: Recent progress in multi-modal large language models (MLLMs) has significantly advanced video understanding. However, their performance on long-form videos remains limited by computational constraints and suboptimal frame selection. We present Think-Clip-Sample (TCS), a training-free framework that enhances long video understanding through two key components: (i) Multi-Query Reasoning, which generates multiple queries to capture complementary aspects of the question and video; and (ii) Clip-level Slow-Fast Sampling, which adaptively balances dense local details and sparse global context. Extensive experiments on MLVU, LongVideoBench, and VideoMME demonstrate that TCS consistently improves performance across different MLLMs, boosting up to 6.9% accuracy, and is capable of achieving comparable accuracy with 50% fewer inference time cost, highlighting both efficiency and efficacy of TCS on long video understanding.

[82] Heterogeneous Uncertainty-Guided Composed Image Retrieval with Fine-Grained Probabilistic Learning

Haomiao Tang,Jinpeng Wang,Minyi Zhao,Guanghao Meng,Ruisheng Luo,Long Chen,Shu-Tao Xia

Main category: cs.CV

TL;DR: 提出了一种异构不确定性引导(HUG)范式,通过细粒度概率学习和定制化不确定性估计提升组合图像检索的鲁棒性。

Details Motivation: CIR三元组中的内在噪声导致不确定性,威胁模型鲁棒性;现有概率学习方法因实例级整体建模和对查询与目标的同质处理而不足。 Method: 设计了HUG范式,使用高斯嵌入表示查询和目标,分别定制多模态查询和单模态目标的异构不确定性估计,并引入动态加权机制和不确定性引导的目标函数。 Result: 在基准数据集上实验表明,HUG优于现有最先进基线方法,且具备更强的判别学习能力。 Conclusion: HUG有效提升了CIR任务中模型对噪声的鲁棒性和性能,验证了细粒度不确定性建模的重要性。 Abstract: Composed Image Retrieval (CIR) enables image search by combining a reference image with modification text. Intrinsic noise in CIR triplets incurs intrinsic uncertainty and threatens the model's robustness. Probabilistic learning approaches have shown promise in addressing such issues; however, they fall short for CIR due to their instance-level holistic modeling and homogeneous treatment of queries and targets. This paper introduces a Heterogeneous Uncertainty-Guided (HUG) paradigm to overcome these limitations. HUG utilizes a fine-grained probabilistic learning framework, where queries and targets are represented by Gaussian embeddings that capture detailed concepts and uncertainties. We customize heterogeneous uncertainty estimations for multi-modal queries and uni-modal targets. Given a query, we capture uncertainties not only regarding uni-modal content quality but also multi-modal coordination, followed by a provable dynamic weighting mechanism to derive comprehensive query uncertainty. We further design uncertainty-guided objectives, including query-target holistic contrast and fine-grained contrasts with comprehensive negative sampling strategies, which effectively enhance discriminative learning. Experiments on benchmarks demonstrate HUG's effectiveness beyond state-of-the-art baselines, with faithful analysis justifying the technical contributions.

[83] SUG-Occ: An Explicit Semantics and Uncertainty Guided Sparse Learning Framework for Real-Time 3D Occupancy Prediction

Hanlin Wu,Pengfei Lin,Ehsan Javanmardi,Nanren Bao,Bo Qian,Hao Si,Manabu Tsukada

Main category: cs.CV

TL;DR: 本文提出了SUG-Occ,一种语义与不确定性引导的稀疏学习框架,用于高效3D语义占据预测,兼顾精度与计算效率。

Details Motivation: 3D语义占据预测虽能实现精细场景理解,但计算和内存开销大,难以实现实时部署,需利用3D场景稀疏性解决该问题。 Method: 提出SUG-Occ框架:1)利用语义与不确定性先验抑制自由空间投影,并采用显式无符号距离编码增强几何一致性;2)设计级联稀疏补全模块,结合超交叉稀疏卷积与生成式上采样实现粗到精推理;3)采用基于OCR的掩码解码器,通过轻量级查询-上下文交互聚合全局语义,避免在体素特征上使用高成本注意力机制。 Result: 在SemanticKITTI数据集上实验表明,该方法相比基线提升了7.34%的准确率,并提高57.8%的效率。 Conclusion: SUG-Occ通过显式利用语义、不确定性与稀疏性,在保持几何与语义完整性的同时显著提升效率,适用于实际实时自动驾驶场景。 Abstract: As autonomous driving moves toward full scene understanding, 3D semantic occupancy prediction has emerged as a crucial perception task, offering voxel-level semantics beyond traditional detection and segmentation paradigms. However, such a refined representation for scene understanding incurs prohibitive computation and memory overhead, posing a major barrier to practical real-time deployment. To address this, we propose SUG-Occ, an explicit Semantics and Uncertainty Guided Sparse Learning Enabled 3D Occupancy Prediction Framework, which exploits the inherent sparsity of 3D scenes to reduce redundant computation while maintaining geometric and semantic completeness. Specifically, we first utilize semantic and uncertainty priors to suppress projections from free space during view transformation while employing an explicit unsigned distance encoding to enhance geometric consistency, producing a structurally consistent sparse 3D representation. Secondly, we design an cascade sparse completion module via hyper cross sparse convolution and generative upsampling to enable efficiently coarse-to-fine reasoning. Finally, we devise an object contextual representation (OCR) based mask decoder that aggregates global semantic context from sparse features and refines voxel-wise predictions via lightweight query-context interactions, avoiding expensive attention operations over volumetric features. Extensive experiments on SemanticKITTI benchmark demonstrate that the proposed approach outperforms the baselines, achieving a 7.34/% improvement in accuracy and a 57.8\% gain in efficiency.

[84] Wetland mapping from sparse annotations with satellite image time series and temporal-aware segment anything model

Shuai Yuan,Tianwu Lin,Shuang Chen,Yu Xia,Peng Qin,Xiangyu Liu,Xiaoqing Xu,Nan Xu,Hongsheng Zhang,Jie Wang,Peng Gong

Main category: cs.CV

TL;DR: 提出WetSAM,一种基于SAM的双分支框架,利用卫星时间序列和稀疏点标签实现高精度湿地制图,显著优于现有方法。

Details Motivation: 现有深度学习模型在稀疏点标签下表现差,单时相影像难以应对湿地季节性和年际动态变化,SAM等基础模型无法建模时间信息,导致分割结果碎片化。 Method: 设计双分支结构:时序提示分支通过分层适配器和动态时序聚合提取湿地特征并解耦物候变化;空间分支采用时序约束的区域增长策略生成可靠伪标签,并通过双向一致性正则化联合优化两个分支。 Result: 在八个全球区域(每个约5000 km²)的实验表明,WetSAM平均F1-score达85.58%,显著优于现有方法,能生成准确且结构一致的湿地分割结果。 Conclusion: WetSAM有效结合时序信息与稀疏监督,在低标注成本下实现可扩展、高分辨率的湿地制图,具有强泛化能力和应用潜力。 Abstract: Accurate wetland mapping is essential for ecosystem monitoring, yet dense pixel-level annotation is prohibitively expensive and practical applications usually rely on sparse point labels, under which existing deep learning models perform poorly, while strong seasonal and inter-annual wetland dynamics further render single-date imagery inadequate and lead to significant mapping errors; although foundation models such as SAM show promising generalization from point prompts, they are inherently designed for static images and fail to model temporal information, resulting in fragmented masks in heterogeneous wetlands. To overcome these limitations, we propose WetSAM, a SAM-based framework that integrates satellite image time series for wetland mapping from sparse point supervision through a dual-branch design, where a temporally prompted branch extends SAM with hierarchical adapters and dynamic temporal aggregation to disentangle wetland characteristics from phenological variability, and a spatial branch employs a temporally constrained region-growing strategy to generate reliable dense pseudo-labels, while a bidirectional consistency regularization jointly optimizes both branches. Extensive experiments across eight global regions of approximately 5,000 km2 each demonstrate that WetSAM substantially outperforms state-of-the-art methods, achieving an average F1-score of 85.58%, and delivering accurate and structurally consistent wetland segmentation with minimal labeling effort, highlighting its strong generalization capability and potential for scalable, low-cost, high-resolution wetland mapping.

[85] SME-YOLO: A Real-Time Detector for Tiny Defect Detection on PCB Surfaces

Meng Han

Main category: cs.CV

TL;DR: 本文提出了一种基于YOLOv11n的新型小目标多尺度增强检测框架SME-YOLO,用于提升PCB表面缺陷检测精度,通过引入NWDLoss、EUCB模块和MSFA模块,在PKU-PCB数据集上实现了优于现有方法的性能。

Details Motivation: PCB缺陷通常尺寸小、纹理相似且分布不均,导致传统检测方法难以实现高精度检测,尤其是对微小缺陷的定位与识别存在挑战。 Method: 提出SME-YOLO框架:1)采用归一化Wasserstein距离损失(NWDLoss)缓解IoU对微小目标位置偏差的敏感性;2)用高效上采样卷积块(EUCB)替代原有上采样模块,通过多尺度卷积恢复空间分辨率并保留边缘和纹理细节;3)设计多尺度聚焦注意力(MSFA)模块,自适应增强关键尺度区间的感知能力,实现局部细粒度特征与全局上下文信息的有效融合。 Result: 在PKU-PCB数据集上的实验表明,相比基线模型YOLOv11n,SME-YOLO将mAP提高了2.2%,Precision提高了4%,达到当前最优性能。 Conclusion: SME-YOLO通过损失函数优化、上采样结构改进和注意力机制设计,显著提升了PCB微小缺陷的检测精度,验证了其在复杂工业检测场景中的有效性与应用潜力。 Abstract: Surface defects on Printed Circuit Boards (PCBs) directly compromise product reliability and safety. However, achieving high-precision detection is challenging because PCB defects are typically characterized by tiny sizes, high texture similarity, and uneven scale distributions. To address these challenges, this paper proposes a novel framework based on YOLOv11n, named SME-YOLO (Small-target Multi-scale Enhanced YOLO). First, we employ the Normalized Wasserstein Distance Loss (NWDLoss). This metric effectively mitigates the sensitivity of Intersection over Union (IoU) to positional deviations in tiny objects. Second, the original upsampling module is replaced by the Efficient Upsampling Convolution Block (EUCB). By utilizing multi-scale convolutions, the EUCB gradually recovers spatial resolution and enhances the preservation of edge and texture details for tiny defects. Finally, this paper proposes the Multi-Scale Focused Attention (MSFA) module. Tailored to the specific spatial distribution of PCB defects, this module adaptively strengthens perception within key scale intervals, achieving efficient fusion of local fine-grained features and global context information. Experimental results on the PKU-PCB dataset demonstrate that SME-YOLO achieves state-of-the-art performance. Specifically, compared to the baseline YOLOv11n, SME-YOLO improves mAP by 2.2% and Precision by 4%, validating the effectiveness of the proposed method.

[86] Topology-Guaranteed Image Segmentation: Enforcing Connectivity, Genus, and Width Constraints

Wenxiao Li,Xue-Cheng Tai,Jun Liu

Main category: cs.CV

TL;DR: 提出一种新的数学框架,将宽度信息融入拓扑结构描述中,结合持续同调与PDE平滑方法,用于变分图像分割和神经网络设计,有效保持图像分割中的拓扑不变性和宽度特征。

Details Motivation: 传统拓扑方法(如持续同调)缺乏对图像结构宽度信息(如厚度、长度)的刻画,难以满足实际图像分割中对拓扑和几何特性的双重需求。 Method: 提出融合宽度信息的新拓扑框架,利用持续同调结合偏微分方程的平滑思想,调整上水平集的局部极值,并将其嵌入变分分割模型和神经网络中,通过拓扑能量约束实现结构保持。 Result: 方法能在保持连通性、亏格等拓扑不变量的同时,有效保留线条的厚度和长度等宽度属性,数值实验验证了其在拓扑保真和几何特性建模上的有效性。 Conclusion: 所提框架成功将宽度信息引入拓扑描述,提升了图像分割中对复杂结构的建模能力,为拓扑正则化提供了兼具拓扑与几何意义的新途径。 Abstract: Existing research highlights the crucial role of topological priors in image segmentation, particularly in preserving essential structures such as connectivity and genus. Accurately capturing these topological features often requires incorporating width-related information, including the thickness and length inherent to the image structures. However, traditional mathematical definitions of topological structures lack this dimensional width information, limiting methods like persistent homology from fully addressing practical segmentation needs. To overcome this limitation, we propose a novel mathematical framework that explicitly integrates width information into the characterization of topological structures. This method leverages persistent homology, complemented by smoothing concepts from partial differential equations (PDEs), to modify local extrema of upper-level sets. This approach enables the resulting topological structures to inherently capture width properties. We incorporate this enhanced topological description into variational image segmentation models. Using some proper loss functions, we are also able to design neural networks that can segment images with the required topological and width properties. Through variational constraints on the relevant topological energies, our approach successfully preserves essential topological invariants such as connectivity and genus counts, simultaneously ensuring that segmented structures retain critical width attributes, including line thickness and length. Numerical experiments demonstrate the effectiveness of our method, showcasing its capability to maintain topological fidelity while explicitly embedding width characteristics into segmented image structures.

[87] PubMed-OCR: PMC Open Access OCR Annotations

Hunter Heidenreich,Yosheb Getachew,Olivia Dinica,Ben Elliott

Main category: cs.CV

TL;DR: PubMed-OCR是一个基于PubMed Central开放获取PDF的OCR中心型科学论文语料库,包含209.5万篇文章,支持布局感知建模和坐标定位问答,数据以紧凑JSON格式发布。

Details Motivation: 为了支持科学文献的布局感知分析、坐标相关的问答任务以及评估依赖OCR的处理流程,需要一个大规模、精细标注的OCR语料库。 Method: 从PubMed Central的开放获取PDF中提取页面图像,使用Google Cloud Vision进行文本检测与标注,生成包含词级、行级和段落级边界框的JSON格式数据,并对语料库特征进行统计分析。 Result: 构建了一个包含209.5万篇文章、150万页、约13亿词的语料库,提供了详细的布局注释,并分析了期刊覆盖范围和布局特征,同时指出了对单一OCR引擎和启发式行重建的依赖等局限性。 Conclusion: PubMed-OCR为布局感知模型、坐标驱动的问答系统及OCR流程评估提供了重要资源,作者公开了数据和模式,鼓励后续研究扩展应用。 Abstract: PubMed-OCR is an OCR-centric corpus of scientific articles derived from PubMed Central Open Access PDFs. Each page image is annotated with Google Cloud Vision and released in a compact JSON schema with word-, line-, and paragraph-level bounding boxes. The corpus spans 209.5K articles (1.5M pages; ~1.3B words) and supports layout-aware modeling, coordinate-grounded QA, and evaluation of OCR-dependent pipelines. We analyze corpus characteristics (e.g., journal coverage and detected layout features) and discuss limitations, including reliance on a single OCR engine and heuristic line reconstruction. We release the data and schema to facilitate downstream research and invite extensions.

[88] Map2Thought: Explicit 3D Spatial Reasoning via Metric Cognitive Maps

Xiangjun Gao,Zhensong Zhang,Dave Zhenyu Chen,Songcen Xu,Long Quan,Eduardo Pérez-Pellitero,Youngkyoon Jang

Main category: cs.CV

TL;DR: 提出Map2Thought框架,通过Metric-CogMap和Cog-CoT实现3D视觉语言模型中的显式、可解释空间推理,在减少监督下仍优于现有方法。

Details Motivation: 为提升3D视觉语言模型的空间推理能力,实现可解释的几何与关系理解。 Method: 提出Metric-Cognitive Map(Metric-CogMap)统一离散网格与连续度量表示,并设计Cognitive Chain-of-Thought(Cog-CoT)进行基于向量运算、边界框距离和遮挡感知顺序的显式推理。 Result: 在VSI-Bench上,使用全数据一半的监督即达到59.9%准确率(接近全监督60.9%),并在10%、25%、50%子集上分别超越SOTA 5.3%、4.8%、4.0%。 Conclusion: Map2Thought实现了高效、可解释的3D空间推理,在低监督下仍表现优越,展示了结构化表征与显式推理在3D VLM中的潜力。 Abstract: We propose Map2Thought, a framework that enables explicit and interpretable spatial reasoning for 3D VLMs. The framework is grounded in two key components: Metric Cognitive Map (Metric-CogMap) and Cognitive Chain-of-Thought (Cog-CoT). Metric-CogMap provides a unified spatial representation by integrating a discrete grid for relational reasoning with a continuous, metric-scale representation for precise geometric understanding. Building upon the Metric-CogMap, Cog-CoT performs explicit geometric reasoning through deterministic operations, including vector operations, bounding-box distances, and occlusion-aware appearance order cues, producing interpretable inference traces grounded in 3D structure. Experimental results show that Map2Thought enables explainable 3D understanding, achieving 59.9% accuracy using only half the supervision, closely matching the 60.9% baseline trained with the full dataset. It consistently outperforms state-of-the-art methods by 5.3%, 4.8%, and 4.0% under 10%, 25%, and 50% training subsets, respectively, on the VSI-Bench.

[89] PRISM-CAFO: Prior-conditioned Remote-sensing Infrastructure Segmentation and Mapping for CAFOs

Oishee Bintey Hoque,Nibir Chandra Mandal,Kyle Luong,Amanda Wilson,Samarth Swarup,Madhav Marathe,Abhijin Adiga

Main category: cs.CV

TL;DR: 本文提出了一种基于基础设施的可解释性管道,用于从航空和卫星图像中识别和表征集中动物饲养操作(CAFO),通过结合领域调优的目标检测、SAM2掩码生成与空间交叉注意力分类器,实现了优于现有方法15%的性能提升。

Details Motivation: 大规模畜牧业对人类健康和环境构成风险,并易受传染病和极端天气影响,准确且可扩展的地图绘制对于监控和管理这些设施至关重要。 Method: 首先使用YOLOv8检测候选基础设施(如 barns, feedlots, manure lagoons),然后利用这些框生成SAM2掩码并根据组件特定标准过滤;接着提取结构化描述符并与深度视觉特征融合,采用轻量级空间交叉注意力分类器进行分类;最后输出CAFO类型预测及与可见基础设施关联的掩码级归因。 Result: 该方法在多个美国地区表现出色,Swin-B+PRISM-CAFO模型比最佳基线高出最多15%,并通过梯度-激活分析验证了领域先验的有效性。 Conclusion: 所提出的基础设施优先、可解释的CAFO识别框架在准确性与可解释性之间取得了良好平衡,具备良好的泛化能力与实际应用潜力。 Abstract: Large-scale livestock operations pose significant risks to human health and the environment, while also being vulnerable to threats such as infectious diseases and extreme weather events. As the number of such operations continues to grow, accurate and scalable mapping has become increasingly important. In this work, we present an infrastructure-first, explainable pipeline for identifying and characterizing Concentrated Animal Feeding Operations (CAFOs) from aerial and satellite imagery. Our method (1) detects candidate infrastructure (e.g., barns, feedlots, manure lagoons, silos) with a domain-tuned YOLOv8 detector, then derives SAM2 masks from these boxes and filters component-specific criteria, (2) extracts structured descriptors (e.g., counts, areas, orientations, and spatial relations) and fuses them with deep visual features using a lightweight spatial cross-attention classifier, and (3) outputs both CAFO type predictions and mask-level attributions that link decisions to visible infrastructure. Through comprehensive evaluation, we show that our approach achieves state-of-the-art performance, with Swin-B+PRISM-CAFO surpassing the best performing baseline by up to 15\%. Beyond strong predictive performance across diverse U.S. regions, we run systematic gradient--activation analyses that quantify the impact of domain priors and show ho

[90] MHA2MLA-VLM: Enabling DeepSeek's Economical Multi-Head Latent Attention across Vision-Language Models

Xiaoran Fan,Zhichao Sun,Tao Ji,Lixing Shen,Tao Gui

Main category: cs.CV

TL;DR: 本文提出MHA2MLA-VLM,一种参数高效且多模态感知的框架,用于将现成的视觉语言模型(VLMs)转换为多头潜在注意力(MLA)架构,以压缩KV缓存并加速推理。

Details Motivation: 随着VLM处理更复杂的多模态任务,KV缓存快速增长导致推理时内存和计算瓶颈;现有MLA方法缺乏对无需重新预训练的VLM适配研究。 Method: 提出两种核心技术:模态自适应的部分RoPE策略,选择性屏蔽非关键维度;解耦的低秩近似方法,分别压缩视觉和文本KV空间,并结合参数高效的微调策略,最小化输出激活误差。 Result: 在三个代表性VLM上实验表明,该方法能以极少监督数据恢复原始性能,显著减少KV缓存大小,并可与KV量化无缝集成。 Conclusion: MHA2MLA-VLM实现了高效、低损耗的VLM到MLA的转换,为大规模多模态模型的部署提供了实用解决方案。 Abstract: As vision-language models (VLMs) tackle increasingly complex and multimodal tasks, the rapid growth of Key-Value (KV) cache imposes significant memory and computational bottlenecks during inference. While Multi-Head Latent Attention (MLA) offers an effective means to compress the KV cache and accelerate inference, adapting existing VLMs to the MLA architecture without costly pretraining remains largely unexplored. In this work, we present MHA2MLA-VLM, a parameter-efficient and multimodal-aware framework for converting off-the-shelf VLMs to MLA. Our approach features two core techniques: (1) a modality-adaptive partial-RoPE strategy that supports both traditional and multimodal settings by selectively masking nonessential dimensions, and (2) a modality-decoupled low-rank approximation method that independently compresses the visual and textual KV spaces. Furthermore, we introduce parameter-efficient fine-tuning to minimize adaptation cost and demonstrate that minimizing output activation error, rather than parameter distance, substantially reduces performance loss. Extensive experiments on three representative VLMs show that MHA2MLA-VLM restores original model performance with minimal supervised data, significantly reduces KV cache footprint, and integrates seamlessly with KV quantization.

[91] Generative Scenario Rollouts for End-to-End Autonomous Driving

Rajeev Yasarla,Deepti Hegde,Shizhong Han,Hsin-Pai Cheng,Yunxiao Shi,Meysam Sadeghigooghari,Shweta Mahajan,Apratim Bhattacharyya,Litian Liu,Risheek Garrepalli,Thomas Svantesson,Fatih Porikli,Hong Cai

Main category: cs.CV

TL;DR: 本文提出了一种名为Generative Scenario Rollouts (GeRo)的即插即用框架,用于增强视觉-语言-动作(VLA)模型在端到端自动驾驶中的长时程规划与多智能体场景生成能力。通过语言条件自回归生成和一致性损失,GeRo实现了文本对齐、时间连贯的未来场景预测,在多项指标上显著提升性能。

Details Motivation: 现有VLA模型多依赖稀疏轨迹标注的模仿学习,未能充分利用其作为生成模型的潜力,且在长时程规划和语言对齐生成方面存在不足。 Method: 首先训练VLA模型将自车与交通参与者动态编码为潜在token,并联合规划、运动和语言任务进行监督;随后采用语言条件下的自回归rollout策略生成未来潜在token和文本响应,并引入rollout-consistency损失以稳定预测并保持文本-动作一致性。 Result: 在Bench2Drive上,GeRo将驾驶得分和成功率分别提升了+15.7和+26.2;结合强化学习后,在闭环和开环设置下均达到SOTA性能,并展现出强零样本鲁棒性。 Conclusion: 生成式、语言条件化的推理可作为更安全、可解释的端到端自动驾驶系统的基础,GeRo展示了该方向的巨大潜力。 Abstract: Vision-Language-Action (VLA) models are emerging as highly effective planning models for end-to-end autonomous driving systems. However, current works mostly rely on imitation learning from sparse trajectory annotations and under-utilize their potential as generative models. We propose Generative Scenario Rollouts (GeRo), a plug-and-play framework for VLA models that jointly performs planning and generation of language-grounded future traffic scenes through an autoregressive rollout strategy. First, a VLA model is trained to encode ego vehicle and agent dynamics into latent tokens under supervision from planning, motion, and language tasks, facilitating text-aligned generation. Next, GeRo performs language-conditioned autoregressive generation. Given multi-view images, a scenario description, and ego-action questions, it generates future latent tokens and textual responses to guide long-horizon rollouts. A rollout-consistency loss stabilizes predictions using ground truth or pseudo-labels, mitigating drift and preserving text-action alignment. This design enables GeRo to perform temporally consistent, language-grounded rollouts that support long-horizon reasoning and multi-agent planning. On Bench2Drive, GeRo improves driving score and success rate by +15.7 and +26.2, respectively. By integrating reinforcement learning with generative rollouts, GeRo achieves state-of-the-art closed-loop and open-loop performance, demonstrating strong zero-shot robustness. These results highlight the promise of generative, language-conditioned reasoning as a foundation for safer and more interpretable end-to-end autonomous driving.

[92] ReScene4D: Temporally Consistent Semantic Instance Segmentation of Evolving Indoor 3D Scenes

Emily Steiner,Jianhao Zheng,Henry Howard-Jenkins,Chris Xie,Iro Armeni

Main category: cs.CV

TL;DR: 提出并形式化了时间稀疏的4D室内语义实例分割(SIS)任务,设计了ReScene4D方法以在稀疏观测下实现跨时间的一致性实例分割与跟踪,并提出t-mAP指标评估时序一致性,实验表明其在3RScan数据集上达到SOTA性能。

Details Motivation: 现有3DSIS方法缺乏时间推理能力,需依赖离散匹配步骤;而4D LiDAR方法依赖高频时序数据,在室内环境长期演化中不适用,因此需要一种适用于稀疏时间采样的4D SIS方法。 Method: 提出ReScene4D,通过在3DSIS架构基础上引入跨观测的信息共享机制,实现对对象实例的联合分割、识别和时序关联,无需密集时间采样即可保持时间一致性。 Result: 在3RScan数据集上验证了ReScene4D的有效性,其在标准3DSIS性能和时序一致性方面均优于现有方法,并提出了新的评估指标t-mAP来衡量时间身份一致性。 Conclusion: ReScene4D能够有效处理室内场景长时间演化下的稀疏4D语义实例分割任务,提升了实例跟踪的时间一致性和分割质量,为理解动态室内环境提供了新基准。 Abstract: Indoor environments evolve as objects move, appear, or disappear. Capturing these dynamics requires maintaining temporally consistent instance identities across intermittently captured 3D scans, even when changes are unobserved. We introduce and formalize the task of temporally sparse 4D indoor semantic instance segmentation (SIS), which jointly segments, identifies, and temporally associates object instances. This setting poses a challenge for existing 3DSIS methods, which require a discrete matching step due to their lack of temporal reasoning, and for 4D LiDAR approaches, which perform poorly due to their reliance on high-frequency temporal measurements that are uncommon in the longer-horizon evolution of indoor environments. We propose ReScene4D, a novel method that adapts 3DSIS architectures for 4DSIS without needing dense observations. It explores strategies to share information across observations, demonstrating that this shared context not only enables consistent instance tracking but also improves standard 3DSIS quality. To evaluate this task, we define a new metric, t-mAP, that extends mAP to reward temporal identity consistency. ReScene4D achieves state-of-the-art performance on the 3RScan dataset, establishing a new benchmark for understanding evolving indoor scenes.

[93] ShapeR: Robust Conditional 3D Shape Generation from Casual Captures

Yawar Siddiqui,Duncan Frost,Samir Aroudj,Armen Avetisyan,Henry Howard-Jenkins,Daniel DeTone,Pierre Moulon,Qirui Wu,Zhengqin Li,Julian Straub,Richard Newcombe,Jakob Engel

Main category: cs.CV

TL;DR: ShapeR是一种基于随意捕获图像序列的条件3D物体形状生成新方法,利用SLAM、3D检测和视觉语言模型提取多模态信息,并通过训练的变换器生成高保真3D形状,在真实场景中显著优于现有方法。

Details Motivation: 现有3D形状生成方法依赖于干净、无遮挡的输入,难以应对现实世界中随意拍摄数据的复杂性和遮挡问题。 Method: 结合现成的视觉惯性SLAM、3D检测算法和视觉语言模型,从图像序列中为每个物体提取稀疏SLAM点、多视角图像和机器生成的描述,使用基于修正流的变换器进行多模态条件下的3D形状生成,并采用组合增强、课程学习和去背景策略提升鲁棒性。 Result: 在包含178个真实世界物体的新基准上实验表明,ShapeR在Chamfer距离上比现有最先进方法提升2.7倍。 Conclusion: ShapeR能有效处理现实场景中随意捕获的数据,实现高质量的3D物体形状生成,推动了3D生成技术向实际应用的发展。 Abstract: Recent advances in 3D shape generation have achieved impressive results, but most existing methods rely on clean, unoccluded, and well-segmented inputs. Such conditions are rarely met in real-world scenarios. We present ShapeR, a novel approach for conditional 3D object shape generation from casually captured sequences. Given an image sequence, we leverage off-the-shelf visual-inertial SLAM, 3D detection algorithms, and vision-language models to extract, for each object, a set of sparse SLAM points, posed multi-view images, and machine-generated captions. A rectified flow transformer trained to effectively condition on these modalities then generates high-fidelity metric 3D shapes. To ensure robustness to the challenges of casually captured data, we employ a range of techniques including on-the-fly compositional augmentations, a curriculum training scheme spanning object- and scene-level datasets, and strategies to handle background clutter. Additionally, we introduce a new evaluation benchmark comprising 178 in-the-wild objects across 7 real-world scenes with geometry annotations. Experiments show that ShapeR significantly outperforms existing approaches in this challenging setting, achieving an improvement of 2.7x in Chamfer distance compared to state of the art.

[94] UniX: Unifying Autoregression and Diffusion for Chest X-Ray Understanding and Generation

Ruiheng Zhang,Jingfeng Yao,Huangxuan Zhao,Hao Yan,Xiao He,Lei Chen,Zhou Wei,Yong Luo,Zengmao Wang,Lefei Zhang,Dacheng Tao,Bo Du

Main category: cs.CV

TL;DR: UniX是一种用于胸部X光图像理解与生成的下一代统一医疗基础模型,通过解耦自回归分支(用于理解)和扩散分支(用于生成),并引入跨模态自注意力机制,实现任务间的协同合作,在减少参数量的同时显著提升理解和生成性能。

Details Motivation: 现有基于共享参数的自回归架构在医学图像的理解与生成任务中存在性能折衷,难以同时满足语义抽象与像素级重建的需求。 Method: UniX将理解与生成任务分别交由自回归分支和扩散分支处理,并通过跨模态自注意力机制利用理解特征动态引导生成过程,结合严格的数据清洗和多阶段训练策略进行优化。 Result: 在两个基准上,UniX的理解性能(Micro-F1)提升了46.1%,生成质量(FD-RadDino)提高了24.2%,且仅使用LLM-CXR四分之一的参数量,性能媲美专用任务模型。 Conclusion: UniX建立了一种可扩展的、协同的医学图像理解与生成范式,为医疗基础模型的发展提供了新方向。 Abstract: Despite recent progress, medical foundation models still struggle to unify visual understanding and generation, as these tasks have inherently conflicting goals: semantic abstraction versus pixel-level reconstruction. Existing approaches, typically based on parameter-shared autoregressive architectures, frequently lead to compromised performance in one or both tasks. To address this, we present UniX, a next-generation unified medical foundation model for chest X-ray understanding and generation. UniX decouples the two tasks into an autoregressive branch for understanding and a diffusion branch for high-fidelity generation. Crucially, a cross-modal self-attention mechanism is introduced to dynamically guide the generation process with understanding features. Coupled with a rigorous data cleaning pipeline and a multi-stage training strategy, this architecture enables synergistic collaboration between tasks while leveraging the strengths of diffusion models for superior generation. On two representative benchmarks, UniX achieves a 46.1% improvement in understanding performance (Micro-F1) and a 24.2% gain in generation quality (FD-RadDino), using only a quarter of the parameters of LLM-CXR. By achieving performance on par with task-specific models, our work establishes a scalable paradigm for synergistic medical image understanding and generation. Codes and models are available at https://github.com/ZrH42/UniX.