Table of Contents
cs.CL [Back]
[1] LLMs for Game Theory: Entropy-Guided In-Context Learning and Adaptive CoT Reasoning
Tommaso Felice Banfi,Sashenka Gamage
Main category: cs.CL
TL;DR: 提出一种基于LLM的自适应推理框架,通过熵引导的思维链和动态上下文检索,在井字棋等博弈任务中显著提升决策质量。
Details
Motivation: 为了提高大语言模型在离散、博弈论任务中的推理能力,尤其是在不同不确定性下做出最优决策的能力。 Method: 结合上下文学习、熵引导的思维链(CoT)推理与自适应上下文检索,根据token级不确定性动态调整推理路径数量和示例检索量。 Result: 在对抗次优算法对手的100局游戏中,平均得分从基线LLM的-11.6%提升至+9.5%,且每局LLM查询次数保持较低;统计验证显示改进显著,并发现token级熵与走法最优性呈负相关。 Conclusion: 不确定性引导的自适应推理能有效增强LLM在序贯决策环境中的表现。 Abstract: We propose a novel LLM-based framework for reasoning in discrete, game-theoretic tasks, illustrated with \emph{Tic-Tac-Toe}. The method integrates in-context learning with entropy-guided chain-of-thought (CoT) reasoning and adaptive context retrieval. The model dynamically adjusts both the number of retrieved examples and reasoning paths according to token-level uncertainty: concise reasoning with minimal context is used when uncertainty is low, whereas higher uncertainty triggers expanded multi-path CoT exploration. Experimental evaluation against a sub-optimal algorithmic opponent shows that entropy-aware adaptive reasoning substantially improves decision quality, increasing the average game outcome from \(-11.6\%\) with the baseline LLM to \(+9.5\%\) with entropy-guided adaptive reasoning over 100 games (win = +1, tie = 0, loss = -1), while maintaining a relatively low number of LLM queries per game. Statistical validation confirms that the improvement is significant, and correlation analysis reveals a negative association between token-level entropy and move optimality. These findings demonstrate that uncertainty-guided adaptive reasoning effectively enhances LLM performance in sequential decision-making environments.[2] BYOL: Bring Your Own Language Into LLMs
Syed Waqas Zamir,Wassim Hamidouche,Boulbaba Ben Amor,Luana Marotti,Inbal Becker-Reshef,Juan Lavista Ferres
Main category: cs.CL
TL;DR: 本文提出了Bring Your Own Language (BYOL) 框架,以解决大规模语言模型在低资源和极低资源语言上的性能不足问题,通过分级资源分类和定制化数据增强与翻译中介路径,提升了目标语言的表现,并发布了多个小语种的评测基准。
Details
Motivation: 由于全球语言资源分布极不均衡,现有大模型在低资源语言上表现差、文化对齐弱、可及性低,亟需一种可扩展的语言感知型建模框架来弥补这一鸿沟。 Method: 提出BYOL框架,首先根据语言数字足迹将语言划分为四个资源等级;针对低资源语言采用包含语料清洗、合成文本生成、持续预训练和监督微调的全栈式 pipeline;对于极低资源语言则设计基于翻译中介的接入路径,并通过权重空间模型融合保持多语言能力。 Result: 在Chichewa和Maori上,定制模型相比强多语言基线平均提升约12%;在Inuktitut上,定制翻译系统比商业基线高4 BLEU,显著提升LLM访问精度;同时保持了英语和其他语言性能。 Conclusion: BYOL为不同资源水平的语言提供了可扩展、个性化的LLM开发路径,有效缓解了语言不平等,推动了真正多语言AI的发展。 Abstract: Large Language Models (LLMs) exhibit strong multilingual capabilities, yet remain fundamentally constrained by the severe imbalance in global language resources. While over 7,000 languages are spoken worldwide, only a small subset (fewer than 100) has sufficient digital presence to meaningfully influence modern LLM training. This disparity leads to systematic underperformance, cultural misalignment, and limited accessibility for speakers of low-resource and extreme-low-resource languages. To address this gap, we introduce Bring Your Own Language (BYOL), a unified framework for scalable, language-aware LLM development tailored to each language's digital footprint. BYOL begins with a language resource classification that maps languages into four tiers (Extreme-Low, Low, Mid, High) using curated web-scale corpora, and uses this classification to select the appropriate integration pathway. For low-resource languages, we propose a full-stack data refinement and expansion pipeline that combines corpus cleaning, synthetic text generation, continual pretraining, and supervised finetuning. Applied to Chichewa and Maori, this pipeline yields language-specific LLMs that achieve approximately 12 percent average improvement over strong multilingual baselines across 12 benchmarks, while preserving English and multilingual capabilities via weight-space model merging. For extreme-low-resource languages, we introduce a translation-mediated inclusion pathway, and show on Inuktitut that a tailored machine translation system improves over a commercial baseline by 4 BLEU, enabling high-accuracy LLM access when direct language modeling is infeasible. Finally, we release human-translated versions of the Global MMLU-Lite benchmark in Chichewa, Maori, and Inuktitut, and make our codebase and models publicly available at https://github.com/microsoft/byol .[3] A Concise Agent is Less Expert: Revealing Side Effects of Using Style Features on Conversational Agents
Young-Min Cho,Yuan Yuan,Sharath Chandra Guntuku,Lyle Ungar
Main category: cs.CL
TL;DR: 本文首次系统研究了大型语言模型对话代理中风格特征之间的交叉副作用,发现提示中的风格特征并非正交而是深度纠缠的,例如追求简洁会显著降低感知专业性,并提出了CASSE数据集以支持未来研究。
Details
Motivation: 尽管在提示中广泛使用诸如友好、有帮助或简洁等风格特征来引导大语言模型的行为,但其潜在的副作用尚不清楚,因此需要系统性研究这些风格特征之间的相互影响。 Method: 通过分析ACL Anthology中127篇论文识别出12种常用风格特征,并在任务导向和开放域对话场景中使用合成对话与成对LLM作为评估框架,进行受控实验以量化风格特征间的因果影响。 Result: 发现了显著且结构化的跨风格副作用,例如提示简洁性会显著降低感知到的专业性,表明风格特征之间存在深度纠缠;构建了CASSE数据集记录这些交互,并发现现有的提示或激活 steering 方法虽可部分恢复被抑制的特质,但常会损害主要目标风格。 Conclusion: 挑战了当前对大语言模型能忠实地独立控制风格特征的假设,强调需要更原则性的多目标优化方法来实现安全、精准的风格引导。 Abstract: Style features such as friendly, helpful, or concise are widely used in prompts to steer the behavior of Large Language Model (LLM) conversational agents, yet their unintended side effects remain poorly understood. In this work, we present the first systematic study of cross-feature stylistic side effects. We conduct a comprehensive survey of 127 conversational agent papers from ACL Anthology and identify 12 frequently used style features. Using controlled, synthetic dialogues across task-oriented and open domain settings, we quantify how prompting for one style feature causally affects others via a pairwise LLM as a Judge evaluation framework. Our results reveal consistent and structured side effects, such as prompting for conciseness significantly reduces perceived expertise. They demonstrate that style features are deeply entangled rather than orthogonal. To support future research, we introduce CASSE (Conversational Agent Stylistic Side Effects), a dataset capturing these complex interactions. We further evaluate prompt based and activation steering based mitigation strategies and find that while they can partially restore suppressed traits, they often degrade the primary intended style. These findings challenge the assumption of faithful style control in LLMs and highlight the need for multi-objective and more principled approaches to safe, targeted stylistic steering in conversational agents.[4] Reasoning Models Generate Societies of Thought
Junsol Kim,Shiyang Lai,Nino Scherrer,Blaise Agüera y Arcas,James Evans
Main category: cs.CL
TL;DR: 该论文提出,大语言模型的复杂推理能力不仅源于更长的思维链,更关键的是通过模拟多智能体交互(即“思维社会”),实现内部认知视角的多样化与辩论,从而提升推理准确性。
Details
Motivation: 探索大语言模型中复杂推理能力背后的机制,尤其是为何推理模型在认知任务上优于普通指令微调模型。 Method: 结合定量分析与机制可解释性方法,分析DeepSeek-R1和QwQ-32B等推理模型的推理轨迹,并通过强化学习实验研究对话行为与推理准确性的关系。 Result: 发现推理模型展现出更高的视角多样性,激活更多异质的性格与专业特征冲突;多智能体结构体现为问答、视角转换和观点调和等对话行为,并伴随社会情感角色互动,显著提升推理准确性。 Conclusion: 推理能力的提升源于“思维的社会化组织”,即通过结构化地整合多样化的内部视角,实现类似人类群体智慧的集体智能,为构建更高效AI推理系统提供了新方向。 Abstract: Large language models have achieved remarkable capabilities across domains, yet mechanisms underlying sophisticated reasoning remain elusive. Recent reasoning models outperform comparable instruction-tuned models on complex cognitive tasks, attributed to extended computation through longer chains of thought. Here we show that enhanced reasoning emerges not from extended computation alone, but from simulating multi-agent-like interactions -- a society of thought -- which enables diversification and debate among internal cognitive perspectives characterized by distinct personality traits and domain expertise. Through quantitative analysis and mechanistic interpretability methods applied to reasoning traces, we find that reasoning models like DeepSeek-R1 and QwQ-32B exhibit much greater perspective diversity than instruction-tuned models, activating broader conflict between heterogeneous personality- and expertise-related features during reasoning. This multi-agent structure manifests in conversational behaviors, including question-answering, perspective shifts, and the reconciliation of conflicting views, and in socio-emotional roles that characterize sharp back-and-forth conversations, together accounting for the accuracy advantage in reasoning tasks. Controlled reinforcement learning experiments reveal that base models increase conversational behaviors when rewarded solely for reasoning accuracy, and fine-tuning models with conversational scaffolding accelerates reasoning improvement over base models. These findings indicate that the social organization of thought enables effective exploration of solution spaces. We suggest that reasoning models establish a computational parallel to collective intelligence in human groups, where diversity enables superior problem-solving when systematically structured, which suggests new opportunities for agent organization to harness the wisdom of crowds.[5] EncodeRec: An Embedding Backbone for Recommendation Systems
Guy Hadad,Neomi Rabaev,Bracha Shapira
Main category: cs.CL
TL;DR: EncodeRec是一种用于推荐系统的新型方法,通过冻结预训练语言模型参数并直接从项目描述中学习紧凑且信息丰富的嵌入,以对齐文本表示与推荐目标。
Details
Motivation: 现有的预训练语言模型(PLMs)在推荐系统中的嵌入空间缺乏结构化和区分性,且难以捕捉领域特定语义。 Method: 提出EncodeRec方法,在训练推荐系统时保持语言模型参数不变,直接从物品描述中学习适应推荐任务的紧凑嵌入。 Result: 在多个核心推荐基准实验中,EncodeRec作为序列推荐模型骨干或语义ID标记化均显著优于基于PLM和其他嵌入模型的基线。 Conclusion: 嵌入适应在连接通用语言模型与实际推荐系统之间起着关键作用。 Abstract: Recent recommender systems increasingly leverage embeddings from large pre-trained language models (PLMs). However, such embeddings exhibit two key limitations: (1) PLMs are not explicitly optimized to produce structured and discriminative embedding spaces, and (2) their representations remain overly generic, often failing to capture the domain-specific semantics crucial for recommendation tasks. We present EncodeRec, an approach designed to align textual representations with recommendation objectives while learning compact, informative embeddings directly from item descriptions. EncodeRec keeps the language model parameters frozen during recommender system training, making it computationally efficient without sacrificing semantic fidelity. Experiments across core recommendation benchmarks demonstrate its effectiveness both as a backbone for sequential recommendation models and for semantic ID tokenization, showing substantial gains over PLM-based and embedding model baselines. These results underscore the pivotal role of embedding adaptation in bridging the gap between general-purpose language models and practical recommender systems.[6] DialDefer: A Framework for Detecting and Mitigating LLM Dialogic Deference
Parisa Rabbani,Priyam Sahoo,Ruben Mathew,Aishee Mondal,Harshita Ketharaman,Nimet Beyza Bozdag,Dilek Hakkani-Tür
Main category: cs.CL
TL;DR: 该论文揭示了大型语言模型(LLM)在对话评估中存在“对话性顺从”(dialogic deference)现象:相同内容在以“陈述验证”与“归因于说话者”两种不同框架呈现时,会引发LLM不同的判断。作者提出DialDefer框架和对话性顺从得分(DDS)来检测并缓解此类偏差。实验覆盖九个领域、三千多个实例及四种模型,发现对话框架导致显著判断偏移(DDS最大达87个百分点),而准确率变化极小。偏移程度在真实对话(如Reddit)中进一步放大,且同一模型在不同领域可表现出顺从或怀疑倾向。研究还发现,人类vs. LLM的归属差异是主要驱动因素,表明模型认为反驳人类代价更高。缓解策略虽可减少顺从,但易转向过度怀疑,提示需超越准确率、注重校准的评估优化。
Details
Motivation: 探究LLM作为第三方评判者在对话场景下的可靠性问题,特别是其对相同内容因表述框架不同而产生评判差异的现象,揭示现有基于准确率的评估可能掩盖系统性偏差。 Method: 提出DialDefer框架和对话性顺从得分(DDS)量化框架引发的判断偏移;设计对照实验,在九个领域3000多个实例上测试四种模型,比较‘陈述验证’与‘归因说话者’两种框架下的评判差异,并通过归因消融实验分析影响因素。 Result: 发现对话框架引发显著判断偏移(|DDS|高达87pp,p<.0001),而总体准确率变化小于2pp;偏移在自然对话中放大2-4倍;同一模型在不同领域表现不一(如科学领域DDS=-53,社会判断领域DDS=+58);人类vs. LLM归因导致最大偏移(17.7pp)。缓解方法可降低顺从但易引发过度怀疑。 Conclusion: LLM在对话评判中存在系统性框架效应——对话性顺从,这种效应无法由准确率反映,需通过DDS等新指标衡量;该现象提示需重新思考LLM作为裁判的可靠性,并将评估重点从单纯准确率转向判断校准。 Abstract: LLMs are increasingly used as third-party judges, yet their reliability when evaluating speakers in dialogue remains poorly understood. We show that LLMs judge identical claims differently depending on framing: the same content elicits different verdicts when presented as a statement to verify ("Is this statement correct?") versus attributed to a speaker ("Is this speaker correct?"). We call this dialogic deference and introduce DialDefer, a framework for detecting and mitigating these framing-induced judgment shifts. Our Dialogic Deference Score (DDS) captures directional shifts that aggregate accuracy obscures. Across nine domains, 3k+ instances, and four models, conversational framing induces large shifts (|DDS| up to 87pp, p < .0001) while accuracy remains stable (<2pp), with effects amplifying 2-4x on naturalistic Reddit conversations. Models can shift toward agreement (deference) or disagreement (skepticism) depending on domain -- the same model ranges from DDS = -53 on graduate-level science to +58 on social judgment. Ablations reveal that human-vs-LLM attribution drives the largest shifts (17.7pp swing), suggesting models treat disagreement with humans as more costly than with AI. Mitigation attempts reduce deference but can over-correct into skepticism, framing this as a calibration problem beyond accuracy optimization.[7] Neural Induction of Finite-State Transducers
Michael Ginn,Alexis Palmer,Mans Hulden
Main category: cs.CL
TL;DR: 提出一种基于循环神经网络隐状态几何结构自动构建无权有限状态转换器(FST)的新方法,在形态变形、音形转换和历史归一化任务中显著优于传统FST学习算法。
Details
Motivation: 手工构建有限状态转换器(FST)困难且耗时,而现有自动学习算法性能有限,难以满足高精度需求。 Method: 利用循环神经网络(RNN)学习到的隐状态几何结构,自动构造无权FST,将神经网络的知识迁移到可解释、高效的符号模型中。 Result: 在多个真实数据集上验证了该方法的有效性,构建的FST在保留高准确率的同时表现出强鲁棒性,相比经典FST学习算法在测试集上最高提升达87%准确率。 Conclusion: 该方法成功结合了神经网络的强大表达能力与FST的高效性和可解释性,为构建高性能字符串重写系统提供了新路径。 Abstract: Finite-State Transducers (FSTs) are effective models for string-to-string rewriting tasks, often providing the efficiency necessary for high-performance applications, but constructing transducers by hand is difficult. In this work, we propose a novel method for automatically constructing unweighted FSTs following the hidden state geometry learned by a recurrent neural network. We evaluate our methods on real-world datasets for morphological inflection, grapheme-to-phoneme prediction, and historical normalization, showing that the constructed FSTs are highly accurate and robust for many datasets, substantially outperforming classical transducer learning algorithms by up to 87% accuracy on held-out test sets.[8] Massively Multilingual Joint Segmentation and Glossing
Michael Ginn,Lindia Tjuatja,Enora Rice,Ali Marashian,Maria Valentini,Jasmine Xu,Graham Neubig,Alexis Palmer
Main category: cs.CL
TL;DR: 本文提出PolyGloss,一种联合预测语素切分和词语注释的神经模型,解决了现有模型在实际语言学应用中缺乏可解释性和可信度的问题。
Details
Motivation: 现有自动注释模型如GlossLM虽在基准测试中表现良好,但未能预测语素边界,导致语言学家难以信任和使用其输出。 Method: 设计并训练一个序列到序列的多语言模型PolyGloss,联合学习形态切分与注释任务,并通过扩展GlossLM语料库和使用低秩适配实现跨数据集迁移。 Result: PolyGloss在注释准确率、切分效果及任务对齐方面优于GlossLM和其他开源大模型。 Conclusion: 联合建模语素切分与注释能提升模型可解释性与实用性,PolyGloss为语言记录工作提供了更可靠且易适应的工具。 Abstract: Automated interlinear gloss prediction with neural networks is a promising approach to accelerate language documentation efforts. However, while state-of-the-art models like GlossLM achieve high scores on glossing benchmarks, user studies with linguists have found critical barriers to the usefulness of such models in real-world scenarios. In particular, existing models typically generate morpheme-level glosses but assign them to whole words without predicting the actual morpheme boundaries, making the predictions less interpretable and thus untrustworthy to human annotators. We conduct the first study on neural models that jointly predict interlinear glosses and the corresponding morphological segmentation from raw text. We run experiments to determine the optimal way to train models that balance segmentation and glossing accuracy, as well as the alignment between the two tasks. We extend the training corpus of GlossLM and pretrain PolyGloss, a family of seq2seq multilingual models for joint segmentation and glossing that outperforms GlossLM on glossing and beats various open-source LLMs on segmentation, glossing, and alignment. In addition, we demonstrate that PolyGloss can be quickly adapted to a new dataset via low-rank adaptation.[9] Selecting Language Models for Social Science: Start Small, Start Open, and Validate
Dustin S. Stoltz,Marshall A. Taylor,Sanuj Kumar
Main category: cs.CL
TL;DR: 本文探讨了社会科学家如何在众多预训练语言模型中进行选择,强调了有效性、可靠性、可重复性和可复制性的重要性,主张使用小型开源模型并构建特定基准来验证计算流程的有效性。
Details
Motivation: 面对大量可用的预训练语言模型,社会科学家缺乏明确的选择标准,需要系统性指导以确保研究结果的科学性和可信度。 Method: 以有效性、可靠性、可重复性和可复制性为框架,分析模型开放性、模型规模、训练数据以及模型架构与微调的影响,并提倡采用小型开源模型和构建限定性基准。 Result: 发现仅依赖基准测试(ex-ante validity)不足以保证研究质量,事后验证(ex-post validation)和可复制性更为关键。 Conclusion: 建议社会科学家优先选用小型开源模型,并通过构建具体任务的基准来验证整个计算流程的有效性与可复制性,以提升研究的严谨性。 Abstract: Currently, there are thousands of large pretrained language models (LLMs) available to social scientists. How do we select among them? Using validity, reliability, reproducibility, and replicability as guides, we explore the significance of: (1) model openness, (2) model footprint, (3) training data, and (4) model architectures and fine-tuning. While ex-ante tests of validity (i.e., benchmarks) are often privileged in these discussions, we argue that social scientists cannot altogether avoid validating computational measures (ex-post). Replicability, in particular, is a more pressing guide for selecting language models. Being able to reliably replicate a particular finding that entails the use of a language model necessitates reliably reproducing a task. To this end, we propose starting with smaller, open models, and constructing delimited benchmarks to demonstrate the validity of the entire computational pipeline.[10] Multi-Stage Patient Role-Playing Framework for Realistic Clinical Interactions
Shijie Jiang,Zefan Zhang,Kehua Zhu,Tian Bai,Ruihong Zhao
Main category: cs.CL
TL;DR: 本文提出了首个基于真实临床场景的中文患者模拟数据集Ch-PatientSim,采用五维人格结构构建,并提出无需训练的多阶段患者角色扮演框架(MSPRP)以提升大模型在患者行为模拟中的个性化与真实性表现。
Details
Motivation: 现有临床交互模拟依赖通用或LLM生成的对话数据,缺乏真实性和多样性,限制了临床大模型和医学教育的发展。 Method: 构建了一个基于五维人格结构的真实中文患者模拟数据集Ch-PatientSim,针对类别不平衡问题采用少样本生成与人工验证进行数据增强;提出MSPRP框架,将交互分解为三个阶段,在无需训练的情况下提升模型模拟的真实性与个性。 Result: 评估了多种主流大模型,发现其响应普遍过于正式、缺乏个性;实验结果表明,所提出的MSPRP框架能显著提升模型在多个维度上的患者模拟表现。 Conclusion: Ch-PatientSim为评估患者行为模拟提供了更真实、多样化的基准,MSPRP框架有效提升了大模型在临床交互中生成个性化、拟人化响应的能力,具有在医学教育和临床模型评测中的广泛应用潜力。 Abstract: The simulation of realistic clinical interactions plays a pivotal role in advancing clinical Large Language Models (LLMs) and supporting medical diagnostic education. Existing approaches and benchmarks rely on generic or LLM-generated dialogue data, which limits the authenticity and diversity of doctor-patient interactions. In this work, we propose the first Chinese patient simulation dataset (Ch-PatientSim), constructed from realistic clinical interaction scenarios to comprehensively evaluate the performance of models in emulating patient behavior. Patients are simulated based on a five-dimensional persona structure. To address issues of the persona class imbalance, a portion of the dataset is augmented using few-shot generation, followed by manual verification. We evaluate various state-of-the-art LLMs and find that most produce overly formal responses that lack individual personality. To address this limitation, we propose a training-free Multi-Stage Patient Role-Playing (MSPRP) framework, which decomposes interactions into three stages to ensure both personalization and realism in model responses. Experimental results demonstrate that our approach significantly improves model performance across multiple dimensions of patient simulation.[11] Steering Language Models Before They Speak: Logit-Level Interventions
Hyeseon An,Shinwoo Park,Hyundong Jin,Yo-Sub Han
Main category: cs.CL
TL;DR: 提出一种无需训练的推理时logit干预方法,通过基于标注语料库z标准化log-odds构建的统计词元分数表来实现对LLM生成的可控引导,有效克服了现有提示或激活方法的局限性。
Details
Motivation: 现有LLM引导方法如提示工程和激活干预存在控制不精细或需访问内部层的问题,限制了在风格敏感重写、用户自适应交互和毒性缓解等场景的应用。 Method: 构建基于z标准化log-odds的统计词元分数表,在推理时直接干预logits分布,实现无需训练的可控文本生成。 Result: 在写作复杂度、正式程度和毒性控制三个任务上验证了方法有效性,实现了最高+47个百分点的准确率提升和50倍的f1值改善。 Conclusion: 该方法具有广泛适用性和任务无关性,能实现强健、一致且多任务的控制效果,为LLM可控生成提供了高效实用的新途径。 Abstract: Steering LLMs is essential for specialized applications such as style-sensitive text rewriting, user-adaptive communication, and toxicity mitigation. Current steering methods, such as prompting-based and activation-based approaches, are widely used to guide model behavior. However, activation-based techniques require deep access to internal layers, while prompting-based steering often fails to provide consistent or fine-grained control. In order to address these limitations, we propose a training-free inference-time logit intervention for controllable generation. Our approach utilizes a statistical token score table derived from z-normalized log-odds of labeled corpora to shift the decoding distribution. Empirical evaluations across three diverse datasets focusing on writing complexity, formality, and toxicity demonstrate that our method effectively steers output characteristics, confirming its broad applicability and task-agnostic nature. Our results show that statistically grounded logit steering can achieve large, consistent, and multi-task control gains: up to +47%p accuracy and 50x f1 improvement.[12] ZPD Detector: Data Selection via Capability-Difficulty Alignment for Large Language Models
Bo Yang,Yunkui Chen,Lanfei Feng,Yu Zhang,Shijian Li
Main category: cs.CL
TL;DR: 提出ZPD Detector,一种基于近侧发展区理论的动态数据选择框架,通过建模样本难度与模型能力的匹配关系来提升训练数据利用效率。
Details
Motivation: 现有数据选择方法多依赖静态标准,难以适应模型在训练过程中能力的动态变化,且未充分考虑模型与数据间的演化关系。 Method: 基于项目反应理论(IRT)估计模型能力和样本难度,引入难度校准和能力-难度匹配得分,以双向视角动态识别每个学习阶段最具信息量的样本。 Result: ZPD Detector在有限数据预算下显著提高了数据利用效率,能够动态选择最合适的训练样本,并为训练策略设计提供新思路。 Conclusion: 通过模拟教育心理学中的ZPD理念,ZPD Detector实现了更高效的数据选择,验证了动态匹配模型能力与样本难度的有效性。 Abstract: As the cost of training large language models continues to increase and high-quality training data become increasingly scarce, selecting high-value samples or synthesizing effective training data under limited data budgets has emerged as a critical research problem. Most existing data selection methods rely on static criteria, such as difficulty, uncertainty, or heuristics, and fail to model the evolving relationship between the model and the data. Inspired by the educational theory of the Zone of Proximal Development (ZPD), we propose ZPD Detector, a data selection framework that adopts a bidirectional perspective between models and data by explicitly modeling the alignment between sample difficulty and the model's current capability. ZPD Detector integrates difficulty calibration, model capability estimation based on Item Response Theory (IRT), and a capability-difficulty matching score to dynamically identify the most informative samples at each learning stage, improving data utilization efficiency; moreover, this dynamic matching strategy provides new insights into training strategy design. All code and data will be released after our work be accepted to support reproducible researc[13] When Personalization Misleads: Understanding and Mitigating Hallucinations in Personalized LLMs
Zhongxiang Sun,Yi Zhan,Chenglei Shen,Weijie Yu,Xiao Zhang,Ming He,Jun Xu
Main category: cs.CL
TL;DR: 本文提出了一种轻量级推理时方法FPPS,用于缓解个性化大语言模型中的事实性失真问题,同时保持个性化性能,并构建了首个联合评估事实与个性化问答的基准PFQABench。
Details
Motivation: 个性化大语言模型可能因用户历史偏好而产生违背事实的幻觉,损害事实可靠性,需解决个性化与事实表征之间的纠缠问题。 Method: 提出Factuality-Preserving Personalized Steering (FPPS),在推理时分离个性化与事实表征,减少事实扭曲;同时构建PFQABench基准以联合评估个性化和事实问答能力。 Result: 在多个LLM架构和个性化方法上实验表明,FPPS显著提升了事实准确性,同时保持了良好的个性化表现。 Conclusion: FPPS能有效缓解个性化导致的幻觉问题,在不牺牲个性化效果的前提下增强模型的事实性,为安全可靠的个性化LLM提供了可行方案。 Abstract: Personalized large language models (LLMs) adapt model behavior to individual users to enhance user satisfaction, yet personalization can inadvertently distort factual reasoning. We show that when personalized LLMs face factual queries, there exists a phenomenon where the model generates answers aligned with a user's prior history rather than the objective truth, resulting in personalization-induced hallucinations that degrade factual reliability and may propagate incorrect beliefs, due to representational entanglement between personalization and factual representations. To address this issue, we propose Factuality-Preserving Personalized Steering (FPPS), a lightweight inference-time approach that mitigates personalization-induced factual distortions while preserving personalized behavior. We further introduce PFQABench, the first benchmark designed to jointly evaluate factual and personalized question answering under personalization. Experiments across multiple LLM backbones and personalization methods show that FPPS substantially improves factual accuracy while maintaining personalized performance.[14] Redefining Machine Simultaneous Interpretation: From Incremental Translation to Human-Like Strategies
Qianen Zhang,Zeyu Yang,Satoshi Nakamura
Main category: cs.CL
TL;DR: 本文提出了一种基于大语言模型的同步机器翻译框架,通过扩展包含句子切分、丢弃、部分摘要和代词化等自适应动作的动作空间,在保证语义保真的前提下实现低延迟翻译。
Details
Motivation: 传统同步机器翻译仅使用读/写操作,难以在严格实时约束下兼顾翻译质量与延迟,需引入更灵活的自适应机制。 Method: 在大语言模型框架中引入四种新动作(Sentence_Cut, Drop, Partial_Summarization, Pronominalization),并通过动作感知提示构建训练参考;设计延迟感知的TTS流水线用于评估。 Result: 在ACL60/60英中、英德、英日数据集上,该方法在语义指标和延迟方面均优于基线模型,尤其是Drop与Sentence_Cut结合显著优化了流畅性与延迟的平衡。 Conclusion: 扩展动作空间为LLM驱动的同步翻译提供了有效路径,有助于缩小人机同传之间的差距。 Abstract: Simultaneous Machine Translation (SiMT) requires high-quality translations under strict real-time constraints, which traditional policies with only READ/WRITE actions cannot fully address. We extend the action space of SiMT with four adaptive actions: Sentence_Cut, Drop, Partial_Summarization and Pronominalization, which enable real-time restructuring, omission, and simplification while preserving semantic fidelity. We adapt these actions in a large language model (LLM) framework and construct training references through action-aware prompting. To evaluate both quality and word-level monotonicity, we further develop a latency-aware TTS pipeline that maps textual outputs to speech with realistic timing. Experiments on the ACL60/60 English-Chinese, English-German and English-Japanese benchmarks show that our framework consistently improves semantic metrics and achieves lower delay compared to reference translations and salami-based baselines. Notably, combining Drop and Sentence_Cut leads to consistent improvements in the balance between fluency and latency. These results demonstrate that enriching the action space of LLM-based SiMT provides a promising direction for bridging the gap between human and machine interpretation.[15] NAACL: Noise-AwAre Verbal Confidence Calibration for LLMs in RAG Systems
Jiayu Liu,Rui Wang,Qing Zong,Qingcheng Zeng,Tianshi Zheng,Haochen Shi,Dadi Guo,Baixuan Xu,Chunyang Li,Yangqiu Song
Main category: cs.CL
TL;DR: 本文提出了一种名为NAACL的噪声感知校准框架,用于解决检索增强生成(RAG)中大语言模型因上下文噪声导致的过度自信问题,通过2K个HotpotQA样本进行监督微调,显著提升了模型在域内和域外的置信度校准性能。
Details
Motivation: 在关键事实领域部署大语言模型时,准确评估模型置信度至关重要。然而,在检索增强生成(RAG)场景下,由于检索到的上下文存在噪声,模型的置信度校准表现较差,尤其是容易产生过度自信的问题。 Method: 作者系统研究了四个基准上的校准表现,发现噪声上下文导致模型对错误答案过于自信。为此提出了NAACL Rules,构建了一个噪声感知的校准框架NAACL,并利用约2000个HotpotQA样例生成监督信号,通过监督微调(SFT)使模型具备内在的噪声感知能力。 Result: NAACL在域内将ECE(期望校准误差)降低了10.9%,在域外降低了8.0%,显著优于现有方法,且不依赖更强的教师模型。 Conclusion: NAACL为处理RAG中的噪声引起的置信度偏差提供了有效解决方案,增强了LLMs在真实场景中的可信性和可靠性。 Abstract: Accurately assessing model confidence is essential for deploying large language models (LLMs) in mission-critical factual domains. While retrieval-augmented generation (RAG) is widely adopted to improve grounding, confidence calibration in RAG settings remains poorly understood. We conduct a systematic study across four benchmarks, revealing that LLMs exhibit poor calibration performance due to noisy retrieved contexts. Specifically, contradictory or irrelevant evidence tends to inflate the model's false certainty, leading to severe overconfidence. To address this, we propose NAACL Rules (Noise-AwAre Confidence CaLibration Rules) to provide a principled foundation for resolving overconfidence under noise. We further design NAACL, a noise-aware calibration framework that synthesizes supervision from about 2K HotpotQA examples guided by these rules. By performing supervised fine-tuning (SFT) with this data, NAACL equips models with intrinsic noise awareness without relying on stronger teacher models. Empirical results show that NAACL yields substantial gains, improving ECE scores by 10.9% in-domain and 8.0% out-of-domain. By bridging the gap between retrieval noise and verbal calibration, NAACL paves the way for both accurate and epistemically reliable LLMs.[16] Finding the Translation Switch: Discovering and Exploiting the Task-Initiation Features in LLMs
Xinwei Wu,Heng Liu,Xiaohu Zhao,Yuqi Ren,Linlong Xu,Longyue Wang,Deyi Xiong,Weihua Luo,Kaifu Zhang
Main category: cs.CL
TL;DR: 该研究利用稀疏自编码器(SAE)和PCA一致性指标,识别出大语言模型中负责翻译任务的“翻译启动”特征,并通过因果干预验证其功能;进一步提出基于机制难易度的数据选择策略,提升微调效率并抑制幻觉,且机制可迁移到同系列更大模型。
Details
Motivation: 揭示大语言模型无需微调即具备翻译能力的内在机制,理解其不可见的内部表征如何支持翻译任务。 Method: 采用稀疏自编码器(SAE)提取神经特征,结合共激活模式召回候选特征,再用PCA-based一致性度量筛选功能连贯的特征;通过因果干预(增强或消融)验证其作用,并据此设计‘机制难’样本优先的数据选择策略用于微调。 Result: 成功分离出少量‘翻译启动’特征,干预实验证明其对翻译行为具有因果影响;基于这些特征设计的数据选择方法显著提高微调数据效率并减少幻觉;发现该机制在同族更大模型中可迁移。 Conclusion: 大语言模型内部存在可解释、可干预的翻译启动机制,利用此类机制不仅能增进对模型工作原理的理解,还可指导更高效、鲁棒的训练方法设计。 Abstract: Large Language Models (LLMs) frequently exhibit strong translation abilities, even without task-specific fine-tuning. However, the internal mechanisms governing this innate capability remain largely opaque. To demystify this process, we leverage Sparse Autoencoders (SAEs) and introduce a novel framework for identifying task-specific features. Our method first recalls features that are frequently co-activated on translation inputs and then filters them for functional coherence using a PCA-based consistency metric. This framework successfully isolates a small set of **translation initiation** features. Causal interventions demonstrate that amplifying these features steers the model towards correct translation, while ablating them induces hallucinations and off-task outputs, confirming they represent a core component of the model's innate translation competency. Moving from analysis to application, we leverage this mechanistic insight to propose a new data selection strategy for efficient fine-tuning. Specifically, we prioritize training on **mechanistically hard** samples-those that fail to naturally activate the translation initiation features. Experiments show this approach significantly improves data efficiency and suppresses hallucinations. Furthermore, we find these mechanisms are transferable to larger models of the same family. Our work not only decodes a core component of the translation mechanism in LLMs but also provides a blueprint for using internal model mechanism to create more robust and efficient models. The codes are available at https://github.com/flamewei123/AAAI26-translation-Initiation-Features.[17] From Interpretability to Performance: Optimizing Retrieval Heads for Long-Context Language Models
Youmi Ma,Naoaki Okazaki
Main category: cs.CL
TL;DR: 本文提出了一种名为RetMask的新方法,利用检索头(retrieval heads)来增强大语言模型的长上下文能力。通过对比正常模型与屏蔽检索头后的模型输出生成训练信号,RetMask在多个任务上显著提升了性能,尤其在Llama-3.1上表现出色,并揭示了检索头组织模式对效果的影响。
Details
Motivation: 尽管已发现检索头在上下文信息提取中的作用,但其对模型性能提升的潜力尚不明确,本文旨在探索并利用这一机制以增强长上下文处理能力。 Method: 提出RetMask方法,通过遮蔽检索头并对比模型输出差异来生成训练信号,从而优化模型对长上下文的利用。 Result: 在HELMET 128K上提升+2.28分,生成引用任务提升+70%,段落重排序提升+32%,且保持通用任务性能;跨三个模型家族实验显示集中式检索头结构更有效。 Conclusion: 检索头在长上下文建模中起关键作用,其组织结构影响性能增益,机制性理解可转化为实际性能提升。 Abstract: Advances in mechanistic interpretability have identified special attention heads, known as retrieval heads, that are responsible for retrieving information from the context. However, the role of these retrieval heads in improving model performance remains unexplored. This work investigates whether retrieval heads can be leveraged to enhance the long-context capabilities of LLMs. Specifically, we propose RetMask, a method that generates training signals by contrasting normal model outputs with those from an ablated variant in which the retrieval heads are masked. This mechanism-based approach achieves substantial improvements: +2.28 points on HELMET at 128K for Llama-3.1, with +70% gains on generation with citation and +32% on passage re-ranking, while preserving performance on general tasks. Experiments across three model families reveal that the effectiveness depends on retrieval head organization: models with concentrated patterns of retrieval heads respond strongly, while those with distributed patterns show limited gains. This mechanistic relationship validates the function of retrieval heads and demonstrates that mechanistic insights can be transformed into performance enhancements.[18] Budget-Aware Anytime Reasoning with LLM-Synthesized Preference Data
Xuanming Zhang,Shwan Ashrafi,Aziza Mirsaidova,Amir Rezaeian,Miguel Ballesteros,Lydia B. Chilton,Zhou Yu,Dan Roth
Main category: cs.CL
TL;DR: 本文提出了一种任意时间推理框架和任意时间指数,用于评估在计算预算受限的情况下大语言模型的推理效率,并通过模型自生成的偏好数据实现推理时自我改进,从而在有限资源下持续提升中间解的质量。
Details
Motivation: 在实际应用中,许多任务需要在固定的推理预算内完成,因此研究如何在有限的计算资源下提高大语言模型的推理效率具有重要意义。 Method: 引入任意时间推理框架和任意时间指数来量化推理效果,同时利用LLM合成的偏好数据进行推理时的自我改进,优化中间输出质量。 Result: 在NaturalPlan (Trip)、AIME和GPQA数据集上的实验表明,该方法在Grok-3、GPT-oss、GPT-4.1/4o和LLaMA等模型上均显著提升了推理质量和效率。 Conclusion: 所提出的框架和指标能够有效提升大语言模型在资源受限情况下的实用性和推理性能。 Abstract: We study the reasoning behavior of large language models (LLMs) under limited computation budgets. In such settings, producing useful partial solutions quickly is often more practical than exhaustive reasoning, which incurs high inference costs. Many real-world tasks, such as trip planning, require models to deliver the best possible output within a fixed reasoning budget. We introduce an anytime reasoning framework and the Anytime Index, a metric that quantifies how effectively solution quality improves as reasoning tokens increase. To further enhance efficiency, we propose an inference-time self-improvement method using LLM-synthesized preference data, where models learn from their own reasoning comparisons to produce better intermediate solutions. Experiments on NaturalPlan (Trip), AIME, and GPQA datasets show consistent gains across Grok-3, GPT-oss, GPT-4.1/4o, and LLaMA models, improving both reasoning quality and efficiency under budget constraints.[19] Spectral Characterization and Mitigation of Sequential Knowledge Editing Collapse
Chi Zhang,Mengqi Zhang,Xiaotian Ye,Runxi Cheng,Zisheng Zhou,Ying Zhou,Pengjie Ren,Zhumin Chen
Main category: cs.CL
TL;DR: 本文提出了一种名为REVIVE的插件式框架,通过谱分析揭示了大语言模型在连续知识编辑中性能退化的原因,并在更新参数时保护主导奇异子空间以稳定编辑过程。
Details
Motivation: 理解连续知识编辑导致模型通用能力崩溃的机制,并解决现有方法缺乏理论支持的问题。 Method: 采用谱分析方法研究预训练权重矩阵的奇异方向变化,提出REVIVE框架,在谱基上表示参数更新并滤除会干扰主导奇异子空间的成分。 Result: 实验表明,REVIVE在多个模型和基准上显著提升了连续编辑的有效性,并在长达20,000次编辑的极端场景下仍能有效保持模型的通用能力。 Conclusion: 模型的通用能力与权重矩阵的主导奇异方向密切相关,保护这些方向可有效缓解连续知识编辑带来的性能崩溃问题。 Abstract: Sequential knowledge editing in large language models often causes catastrophic collapse of the model's general abilities, especially for parameter-modifying methods. Existing approaches mitigate this issue through heuristic constraints on parameter updates, yet the mechanisms underlying such degradation remain insufficiently understood. In this work, we present a spectral analysis of sequential knowledge editing and show that a model's general abilities are closely associated with dominant singular directions of pretrained weight matrices. These directions are highly sensitive to perturbations and are progressively disrupted by repeated edits, closely tracking the collapse in both editing efficacy and general performance. Building on this insight, we propose REVIVE, a plug-and-play framework that stabilizes sequential editing by explicitly preserving the dominant singular subspace. REVIVE represents parameter updates in the spectral basis of the original weights and filters components that would interfere with the protected region. Extensive experiments across multiple models and benchmarks show that REVIVE consistently improves editing efficacy while substantially preserving general abilities under long-horizon sequential editing, including extreme settings with up to 20,000 edits.[20] CoG: Controllable Graph Reasoning via Relational Blueprints and Failure-Aware Refinement over Knowledge Graphs
Yuanxiang Liu,Songze Li,Xiaoke Guo,Zhaoyan Gong,Qifei Zhang,Huajun Chen,Wen Zhang
Main category: cs.CL
TL;DR: 本文提出了一种名为CoG的训练免费框架,通过结合关系蓝图引导和失败感知优化模块,提升知识图谱增强型大语言模型在噪声环境下的推理稳定性与效率。
Details
Motivation: 现有的知识图谱增强大语言模型在面对邻域噪声和结构不匹配时表现出认知僵化,容易导致推理停滞,缺乏可靠性和灵活性。 Method: 受双过程理论启发,CoG包含两个模块:1)关系蓝图引导模块作为快速直觉过程,利用关系蓝图为搜索提供软性结构约束;2)失败感知优化模块作为分析过程,在推理受阻时触发基于证据的反思与可控回溯。 Result: 在三个基准上的实验表明,CoG在准确率和推理效率方面均显著优于现有最先进方法。 Conclusion: CoG通过模拟直觉与审慎思维的协作机制,有效提升了LLM在知识图谱引导下的推理鲁棒性与适应性,为无需训练的知识增强提供了新思路。 Abstract: Large Language Models (LLMs) have demonstrated remarkable reasoning capabilities but often grapple with reliability challenges like hallucinations. While Knowledge Graphs (KGs) offer explicit grounding, existing paradigms of KG-augmented LLMs typically exhibit cognitive rigidity--applying homogeneous search strategies that render them vulnerable to instability under neighborhood noise and structural misalignment leading to reasoning stagnation. To address these challenges, we propose CoG, a training-free framework inspired by Dual-Process Theory that mimics the interplay between intuition and deliberation. First, functioning as the fast, intuitive process, the Relational Blueprint Guidance module leverages relational blueprints as interpretable soft structural constraints to rapidly stabilize the search direction against noise. Second, functioning as the prudent, analytical process, the Failure-Aware Refinement module intervenes upon encountering reasoning impasses. It triggers evidence-conditioned reflection and executes controlled backtracking to overcome reasoning stagnation. Experimental results on three benchmarks demonstrate that CoG significantly outperforms state-of-the-art approaches in both accuracy and efficiency.[21] Efficient Multilingual Name Type Classification Using Convolutional Networks
Davor Lauc
Main category: cs.CL
TL;DR: 提出一种名为Onomas-CNN X的卷积神经网络模型,用于多语言命名实体的语言和类型分类,在准确率与XLM-RoBERTa相当的情况下,速度提升46倍且能耗显著降低。
Details
Motivation: 针对命名实体进行高效、低能耗的语言和实体类型分类,尤其适用于资源受限的CPU环境。 Method: 采用并行卷积分支与深度可分离卷积操作,结合层次化分类策略,构建轻量高效的CNN模型Onomas-CNN X。 Result: 在涵盖104种语言和4类实体的大规模数据集上达到92.1%的准确率,单核CPU每秒处理2,813个名称,速度是XLM-RoBERTa的46倍,能耗降低46倍。 Conclusion: 在训练数据充足的情况下,专用的CNN架构在特定NLP任务上仍能与大型预训练模型竞争,且具备显著的速度和能效优势。 Abstract: We present a convolutional neural network approach for classifying proper names by language and entity type. Our model, Onomas-CNN X, combines parallel convolution branches with depthwise-separable operations and hierarchical classification to process names efficiently on CPU hardware. We evaluate the architecture on a large multilingual dataset covering 104 languages and four entity types (person, organization, location, other). Onomas-CNN X achieves 92.1% accuracy while processing 2,813 names per second on a single CPU core - 46 times faster than fine-tuned XLM-RoBERTa with comparable accuracy. The model reduces energy consumption by a factor of 46 compared to transformer baselines. Our experiments demonstrate that specialized CNN architectures remain competitive with large pre-trained models for focused NLP tasks when sufficient training data exists.[22] Integrity Shield A System for Ethical AI Use & Authorship Transparency in Assessments
Ashish Raj Shekhar,Shiven Agarwal,Priyanuj Bordoloi,Yash Shah,Tejas Anvekar,Vivek Gupta
Main category: cs.CL
TL;DR: Integrity Shield 是一种文档层水印系统,通过在PDF考试文件中嵌入人眼不可见的、模式感知的项目级水印,有效防止大语言模型作弊,同时保持对人类的视觉一致性。
Details
Motivation: 随着大语言模型能够直接解析并回答PDF格式的考试题,学术诚信面临严重威胁,现有水印技术因无法作用于黑盒模型或缺乏模型控制而不适用,亟需一种无需干预模型即可保护评估内容的方法。 Method: 提出 Integrity Shield,一种在文档层面嵌入水印的技术,将水印信息编码到题目结构中,在不改变PDF可视外观的前提下,使大语言模型无法正确回答被保护的试题,并能从响应中恢复出稳定的签名用于检测。 Result: 在30个涵盖STEM、人文和医学推理领域的考试上测试,Integrity Shield 实现了91-94%的考试级阻止率和89-93%的签名提取准确率,在四个商用大语言模型上均表现出高防御效果和检测可靠性。 Conclusion: Integrity Shield 提供了一种实用且高效的解决方案,能够在不依赖模型访问权限的情况下,有效防范大语言模型在学术评估中的滥用,保障评分与认证的可信性。 Abstract: Large Language Models (LLMs) can now solve entire exams directly from uploaded PDF assessments, raising urgent concerns about academic integrity and the reliability of grades and credentials. Existing watermarking techniques either operate at the token level or assume control over the model's decoding process, making them ineffective when students query proprietary black-box systems with instructor-provided documents. We present Integrity Shield, a document-layer watermarking system that embeds schema-aware, item-level watermarks into assessment PDFs while keeping their human-visible appearance unchanged. These watermarks consistently prevent MLLMs from answering shielded exam PDFs and encode stable, item-level signatures that can be reliably recovered from model or student responses. Across 30 exams spanning STEM, humanities, and medical reasoning, Integrity Shield achieves exceptionally high prevention (91-94% exam-level blocking) and strong detection reliability (89-93% signature retrieval) across four commercial MLLMs. Our demo showcases an interactive interface where instructors upload an exam, preview watermark behavior, and inspect pre/post AI performance & authorship evidence.[23] The Growing Gains and Pains of Iterative Web Corpora Crawling: Insights from South Slavic CLASSLA-web 2.0 Corpora
Taja Kuzman Pungeršek,Peter Rupnik,Vít Suchomel,Nikola Ljubešić
Main category: cs.CL
TL;DR: 本文介绍了CLASSLA-web 2.0语料库,通过持续爬取南斯拉夫及关联国家顶级域名,构建了包含7种语言、170亿词的更大规模网络语料库,并新增自动主题标注;但发现两年后重爬内容重合率仅20%,且网页质量下降,机器生成内容显著增多。
Details
Motivation: 为获取资源较少语言的大规模文本数据,延续此前对南斯拉夫语言成功爬取的经验,建立可持续迭代的国家级域名爬取基础设施。 Method: 构建连续爬取基础设施,对南斯拉夫及关联语言的国家顶级域名进行迭代式网络爬虫采集,并对新语料库进行自动主题标注和多语言分词/句法标注。 Result: 发布CLASSLA-web 2.0语料库,含7种语言、170亿词、3810万文本;仅约20%文本与1.0版重合;发现机器生成内容比例明显上升,网页内容质量下降。 Conclusion: 持续爬取能有效扩充低资源语言语料,但需警惕网络内容退化问题,未来需加强内容质量过滤与评估机制。 Abstract: Crawling national top-level domains has proven to be highly effective for collecting texts in less-resourced languages. This approach has been recently used for South Slavic languages and resulted in the largest general corpora for this language group: the CLASSLA-web 1.0 corpora. Building on this success, we established a continuous crawling infrastructure for iterative national top-level domain crawling across South Slavic and related webs. We present the first outcome of this crawling infrastructure - the CLASSLA-web 2.0 corpus collection, with substantially larger web corpora containing 17.0 billion words in 38.1 million texts in seven languages: Bosnian, Bulgarian, Croatian, Macedonian, Montenegrin, Serbian, and Slovenian. In addition to genre categories, the new version is also automatically annotated with topic labels. Comparing CLASSLA-web 2.0 with its predecessor reveals that only one-fifth of the texts overlap, showing that re-crawling after just two years yields largely new content. However, while the new web crawls bring growing gains, we also notice growing pains - a manual inspection of top domains reveals a visible degradation of web content, as machine-generated sites now contribute a significant portion of texts.[24] DOREMI: Optimizing Long Tail Predictions in Document-Level Relation Extraction
Laura Menotti,Stefano Marchesin,Gianmaria Silvello
Main category: cs.CL
TL;DR: 本文提出了DOREMI框架,通过少量有针对性的手动标注来增强文档级关系抽取中代表性不足的关系,有效缓解长尾分布问题,并提升稀有关系的泛化能力。
Details
Motivation: 文档级关系抽取面临跨句子上下文依赖和关系类型长尾分布的挑战,尤其是稀有关系缺乏足够的训练样本。 Method: 提出DOREMI框架,迭代地选择最具信息量的样本进行少量人工标注,以增强现有DocRE模型对长尾关系的学习能力,且不依赖大规模噪声数据或启发式去噪方法。 Result: DOREMI能有效减轻长尾偏差,提高对罕见关系的泛化性能,适用于任何现有的DocRE模型,具有良好的可扩展性。 Conclusion: DOREMI通过最小但精准的标注干预,在提升训练效率和模型鲁棒性方面表现出色,为解决文档级关系抽取中的长尾问题提供了可扩展的新方案。 Abstract: Document-Level Relation Extraction (DocRE) presents significant challenges due to its reliance on cross-sentence context and the long-tail distribution of relation types, where many relations have scarce training examples. In this work, we introduce DOcument-level Relation Extraction optiMizing the long taIl (DOREMI), an iterative framework that enhances underrepresented relations through minimal yet targeted manual annotations. Unlike previous approaches that rely on large-scale noisy data or heuristic denoising, DOREMI actively selects the most informative examples to improve training efficiency and robustness. DOREMI can be applied to any existing DocRE model and is effective at mitigating long-tail biases, offering a scalable solution to improve generalization on rare relations.[25] T$^\star$: Progressive Block Scaling for MDM Through Trajectory Aware RL
Hanchen Xia,Baoyou Chen,Yutang Ge,Guojiang Zhao,Siyu Zhu
Main category: cs.CL
TL;DR: 提出T$^\star$,一种基于TraceRL的训练课程,用于掩码扩散语言模型中的渐进式块大小扩展,实现高并行解码且性能损失小。
Details
Motivation: 为了在保持数学推理任务性能的同时提升掩码扩散语言模型的解码并行度。 Method: 采用TraceRL设计训练课程T$^\star$,从AR初始化的小块MDM开始,逐步平滑过渡到更大的块。 Result: T$^\star$能有效支持更大块的训练,实现高并行解码且在数学推理基准上性能下降小;并可收敛到一个具有相当性能的替代解码调度$\hat{\rm S}$。 Conclusion: T$^\star$为MDM提供了高效的块扩展方案,平衡了并行性与性能。 Abstract: We present T$^\star$, a simple \textsc{TraceRL}-based training curriculum for progressive block-size scaling in masked diffusion language models (MDMs). Starting from an AR-initialized small-block MDM, T$^\star$~transitions smoothly to larger blocks, enabling higher-parallelism decoding with minimal performance degradation on math reasoning benchmarks. Moreover, further analysis suggests that T$^\star$~can converge to an alternative decoding schedule $\hat{\rm S}$ that achieves comparable performance.[26] MultiCaption: Detecting disinformation using multilingual visual claims
Rafael Martins Frade,Rrubaa Panchendrarajan,Arkaitz Zubiaga
Main category: cs.CL
TL;DR: 本文提出了MultiCaption数据集,用于检测多语言、多模态环境下的视觉声明矛盾,并通过多种模型实验建立了基准,证明其在无机器翻译情况下构建多语言事实核查系统的潜力。
Details
Motivation: 现有自动事实核查方法受限于缺乏反映真实世界复杂性的数据集,尤其是在多语言和多模态场景下。 Method: 构建了一个包含11,088个视觉声明、覆盖64种语言的MultiCaption数据集,通过多种标注策略判断同一图像或视频的声明对是否矛盾,并使用基于Transformer、自然语言推理模型和大语言模型进行实验。 Result: 实验表明MultiCaption比标准NLI任务更具挑战性,需任务特定微调才能取得良好性能;多语言训练和测试带来了性能提升,显示其在多语言事实核查中的潜力。 Conclusion: MultiCaption为多语言多模态环境下的虚假信息检测提供了有价值的资源,并支持无需依赖机器翻译的高效多语言事实核查系统开发。 Abstract: Online disinformation poses an escalating threat to society, driven increasingly by the rapid spread of misleading content across both multimedia and multilingual platforms. While automated fact-checking methods have advanced in recent years, their effectiveness remains constrained by the scarcity of datasets that reflect these real-world complexities. To address this gap, we first present MultiCaption, a new dataset specifically designed for detecting contradictions in visual claims. Pairs of claims referring to the same image or video were labeled through multiple strategies to determine whether they contradict each other. The resulting dataset comprises 11,088 visual claims in 64 languages, offering a unique resource for building and evaluating misinformation-detection systems in truly multimodal and multilingual environments. We then provide comprehensive experiments using transformer-based architectures, natural language inference models, and large language models, establishing strong baselines for future research. The results show that MultiCaption is more challenging than standard NLI tasks, requiring task-specific finetuning for strong performance. Moreover, the gains from multilingual training and testing highlight the dataset's potential for building effective multilingual fact-checking pipelines without relying on machine translation.[27] Language of Thought Shapes Output Diversity in Large Language Models
Shaoyang Xu,Wenxuan Zhang
Main category: cs.CL
TL;DR: 本文提出通过改变大语言模型的“思维语言”来提升输出多样性,发现使用与英语差异更大的语言作为思维语言可显著提高输出多样性,并在多语言思维采样中实现协同增益,增强模型的文化覆盖和价值多元性。
Details
Motivation: 探索影响大语言模型输出多样性的新机制,特别是语言对思维过程的影响,以增强模型的创造力和文化包容性。 Method: 引入多语言思维(multilingual thinking)框架,比较单语言采样和混合语言采样的策略,在保持输出为英语的前提下,评估不同思维语言对输出多样性的影响。 Result: 实验表明,使用非英语思维语言能持续提升输出多样性,且语言在思维空间中与英语距离越远,增益越大;多语言采样具有组合效应,能进一步提升多样性。 Conclusion: 思维语言是控制输出多样性的有效结构化手段,多语言思维采样不仅能提升多样性上限,还能促进模型在多元文化与价值观上的对齐。 Abstract: Output diversity is crucial for Large Language Models as it underpins pluralism and creativity. In this work, we reveal that controlling the language used during model thinking-the language of thought-provides a novel and structural source of output diversity. Our preliminary study shows that different thinking languages occupy distinct regions in a model's thinking space. Based on this observation, we study two repeated sampling strategies under multilingual thinking-Single-Language Sampling and Mixed-Language Sampling-and conduct diversity evaluation on outputs that are controlled to be in English, regardless of the thinking language used. Across extensive experiments, we demonstrate that switching the thinking language from English to non-English languages consistently increases output diversity, with a clear and consistent positive correlation such that languages farther from English in the thinking space yield larger gains. We further show that aggregating samples across multiple thinking languages yields additional improvements through compositional effects, and that scaling sampling with linguistic heterogeneity expands the model's diversity ceiling. Finally, we show that these findings translate into practical benefits in pluralistic alignment scenarios, leading to broader coverage of cultural knowledge and value orientations in LLM outputs. Our code is publicly available at https://github.com/iNLP-Lab/Multilingual-LoT-Diversity.[28] FactCorrector: A Graph-Inspired Approach to Long-Form Factuality Correction of Large Language Models
Javier Carnerero-Cano,Massimiliano Pronesti,Radu Marinescu,Tigran Tchrakian,James Barry,Jasmina Gajcin,Yufang Hou,Alessandra Pascale,Elizabeth Daly
Main category: cs.CL
TL;DR: 本文提出了FactCorrector,一种无需重新训练即可跨领域适应的LLM事后事实纠正方法,并构建了包含系统注入错误和真实修正的VELI5基准用于评估。实验表明该方法显著提升了事实准确性同时保持相关性。
Details
Motivation: 大型语言模型(LLMs)在知识密集型应用中广泛使用,但常生成事实错误的响应。为解决这一问题,需要一种有效的反馈驱动的事后纠正机制来提升其事实准确性。 Method: 提出FactCorrector方法,利用关于原始响应事实性的结构化反馈进行修正;开发VELI5基准数据集,包含系统注入的事实错误及对应的真实修正,支持对纠正方法的严格评估。 Result: 在VELI5及多个主流长篇事实性数据集上的实验显示,FactCorrector显著提高了事实精确率,同时保持了响应的相关性,优于强基线方法。 Conclusion: FactCorrector是一种有效且通用的事后纠正方法,能够在不重新训练的情况下提升LLMs的事实准确性,为构建可靠的知识密集型应用提供了可行路径。 Abstract: Large language models (LLMs) are widely used in knowledge-intensive applications but often generate factually incorrect responses. A promising approach to rectify these flaws is correcting LLMs using feedback. Therefore, in this paper, we introduce FactCorrector, a new post-hoc correction method that adapts across domains without retraining and leverages structured feedback about the factuality of the original response to generate a correction. To support rigorous evaluations of factuality correction methods, we also develop the VELI5 benchmark, a novel dataset containing systematically injected factual errors and ground-truth corrections. Experiments on VELI5 and several popular long-form factuality datasets show that the FactCorrector approach significantly improves factual precision while preserving relevance, outperforming strong baselines. We release our code at https://ibm.biz/factcorrector.[29] How DDAIR you? Disambiguated Data Augmentation for Intent Recognition
Galo Castillo-López,Alexis Lombard,Nasredine Semmar,Gaël de Chalendar
Main category: cs.CL
TL;DR: 提出DDAIR方法,利用句子嵌入检测并重新生成大语言模型生成的意图识别中的歧义样本,提升低资源场景下的分类性能。
Details
Motivation: 大语言模型在数据增强中可能生成跨类歧义样本,影响意图识别效果,尤其在低资源和意图边界模糊的场景下问题更显著。 Method: 使用Sentence Transformers检测与目标意图语义不符的生成样本,并通过迭代重生成机制修正歧义样本。 Result: 实验表明句子嵌入能有效识别歧义样本,重生成后数据集的分类性能得到提升,尤其在意图定义宽泛的场景中效果显著。 Conclusion: 句子嵌入结合迭代重生成可有效缓解LLM生成数据的类别歧义问题,为低资源意图识别提供了高效的数据增强策略。 Abstract: Large Language Models (LLMs) are effective for data augmentation in classification tasks like intent detection. In some cases, they inadvertently produce examples that are ambiguous with regard to untargeted classes. We present DDAIR (Disambiguated Data Augmentation for Intent Recognition) to mitigate this problem. We use Sentence Transformers to detect ambiguous class-guided augmented examples generated by LLMs for intent recognition in low-resource scenarios. We identify synthetic examples that are semantically more similar to another intent than to their target one. We also provide an iterative re-generation method to mitigate such ambiguities. Our findings show that sentence embeddings effectively help to (re)generate less ambiguous examples, and suggest promising potential to improve classification performance in scenarios where intents are loosely or broadly defined.[30] Reasoning in Trees: Improving Retrieval-Augmented Generation for Multi-Hop Question Answering
Yuling Shi,Maolin Sun,Zijun Liu,Mo Yang,Yixiong Fang,Tianran Sun,Xiaodong Gu
Main category: cs.CL
TL;DR: 提出了一种新的层次化框架RT-RAG,用于复杂多跳问答,通过推理树分解问题并采用自底向上遍历策略减少错误传播,显著优于现有方法。
Details
Motivation: 现有迭代方法在多步检索中容易出现查询分解不准确和错误传播,导致推理连贯性差,因此需要更结构化的方法来提升多跳问答性能。 Method: 提出RT-RAG框架,将多跳问题系统分解为显式推理树,结合结构化实体分析与基于共识的树选择,并采用自底向上的查询重写与证据收集策略。 Result: 实验表明RT-RAG在F1分数上比现有最优方法高出7.0%,EM分数高出6.0%。 Conclusion: RT-RAG通过结构化推理树和迭代优化策略有效提升了复杂多跳问答的准确性和鲁棒性。 Abstract: Retrieval-Augmented Generation (RAG) has demonstrated significant effectiveness in enhancing large language models (LLMs) for complex multi-hop question answering (QA). For multi-hop QA tasks, current iterative approaches predominantly rely on LLMs to self-guide and plan multi-step exploration paths during retrieval, leading to substantial challenges in maintaining reasoning coherence across steps from inaccurate query decomposition and error propagation. To address these issues, we introduce Reasoning Tree Guided RAG (RT-RAG), a novel hierarchical framework for complex multi-hop QA. RT-RAG systematically decomposes multi-hop questions into explicit reasoning trees, minimizing inaccurate decomposition through structured entity analysis and consensus-based tree selection that clearly separates core queries, known entities, and unknown entities. Subsequently, a bottom-up traversal strategy employs iterative query rewriting and refinement to collect high-quality evidence, thereby mitigating error propagation. Comprehensive experiments show that RT-RAG substantially outperforms state-of-the-art methods by 7.0% F1 and 6.0% EM, demonstrating the effectiveness of RT-RAG in complex multi-hop QA.[31] One LLM to Train Them All: Multi-Task Learning Framework for Fact-Checking
Malin Astrid Larsson,Harald Fosen Grunnaleite,Vinay Setty
Main category: cs.CL
TL;DR: 本文提出使用多任务学习(MTL)来提升基于大语言模型的自动事实核查效率,通过单个小型开源模型联合完成多个子任务,在降低成本的同时实现显著性能增益。
Details
Motivation: 大型专有模型虽然性能强,但封闭权重、高成本和复杂性限制了可持续性;单独微调多个小模型又导致维护成本高。需要更高效且经济的解决方案。 Method: 采用多任务学习(MTL),在小型仅解码器LLM(如Qwen3-4b)上探索三种策略:分类头、因果语言建模头和指令微调,并评估不同模型大小、任务顺序下的表现。 Result: 多任务模型未全面超越单任务基线,但在相对于零样本/少样本设置下,分别取得了最高44%、54%和31%的相对提升(针对声明检测、证据重排序和立场检测)。 Conclusion: 多任务学习是构建高效、低成本自动事实核查系统的可行方案,并为实践者提供了基于实证的应用指南。 Abstract: Large language models (LLMs) are reshaping automated fact-checking (AFC) by enabling unified, end-to-end verification pipelines rather than isolated components. While large proprietary models achieve strong performance, their closed weights, complexity, and high costs limit sustainability. Fine-tuning smaller open weight models for individual AFC tasks can help but requires multiple specialized models resulting in high costs. We propose \textbf{multi-task learning (MTL)} as a more efficient alternative that fine-tunes a single model to perform claim detection, evidence ranking, and stance detection jointly. Using small decoder-only LLMs (e.g., Qwen3-4b), we explore three MTL strategies: classification heads, causal language modeling heads, and instruction-tuning, and evaluate them across model sizes, task orders, and standard non-LLM baselines. While multitask models do not universally surpass single-task baselines, they yield substantial improvements, achieving up to \textbf{44\%}, \textbf{54\%}, and \textbf{31\%} relative gains for claim detection, evidence re-ranking, and stance detection, respectively, over zero-/few-shot settings. Finally, we also provide practical, empirically grounded guidelines to help practitioners apply MTL with LLMs for automated fact-checking.[32] Membership Inference on LLMs in the Wild
Jiatong Yi,Yanyang Li
Main category: cs.CL
TL;DR: 本文提出了一种名为SimMIA的新型黑盒环境下针对大语言模型的成员推断攻击框架,并构建了新的基准数据集WikiMIA-25,实验表明其在仅使用生成文本的情况下达到了最先进的性能。
Details
Motivation: 现有的成员推断攻击方法大多依赖于模型内部信息(如logits),在仅有生成文本可用的严格黑盒设置下泛化能力差,因此需要一种适用于纯文本输出场景的高效MIA方法。 Method: 提出SimMIA框架,采用先进的采样策略和评分机制,在不访问模型内部状态的前提下,仅利用生成的文本进行成员推断;同时构建了新的基准WikiMIA-25用于评估现代闭源大语言模型上的MIA性能。 Result: SimMIA在黑盒设置下实现了最先进的攻击效果,性能媲美那些依赖模型内部信息的基线方法,且在跨域场景中表现出更强的泛化能力。 Conclusion: SimMIA有效解决了仅基于生成文本进行成员推断的挑战,为审计大语言模型的数据隐私风险提供了实用且强大的工具。 Abstract: Membership Inference Attacks (MIAs) act as a crucial auditing tool for the opaque training data of Large Language Models (LLMs). However, existing techniques predominantly rely on inaccessible model internals (e.g., logits) or suffer from poor generalization across domains in strict black-box settings where only generated text is available. In this work, we propose SimMIA, a robust MIA framework tailored for this text-only regime by leveraging an advanced sampling strategy and scoring mechanism. Furthermore, we present WikiMIA-25, a new benchmark curated to evaluate MIA performance on modern proprietary LLMs. Experiments demonstrate that SimMIA achieves state-of-the-art results in the black-box setting, rivaling baselines that exploit internal model information.[33] F-Actor: Controllable Conversational Behaviour in Full-Duplex Models
Maike Züfle,Ondrej Klejch,Nicholas Sanders,Jan Niehues,Alexandra Birch,Tsz Kin Lam
Main category: cs.CL
TL;DR: 提出首个开放的、可遵循指令的全双工对话语音模型,仅需2000小时数据即可高效训练,支持对说话人声音、话题、对话行为等的灵活控制。
Details
Motivation: 现有口语对话系统缺乏对上下文动态适应的定制化对话行为,导致自然性和可用性受限。 Method: 保持音频编码器冻结,仅微调语言模型,采用单阶段训练协议,在较小规模数据上实现高效训练。 Result: 模型能根据指令控制语音特征、话题、对话行为(如插话和反馈),且训练资源需求低。 Conclusion: 该方法降低了全双工可控语音系统的训练门槛,推动了可复现的开放研究。 Abstract: Spoken conversational systems require more than accurate speech generation to have human-like conversations: to feel natural and engaging, they must produce conversational behaviour that adapts dynamically to the context. Current spoken conversational systems, however, rarely allow such customization, limiting their naturalness and usability. In this work, we present the first open, instruction-following full-duplex conversational speech model that can be trained efficiently under typical academic resource constraints. By keeping the audio encoder frozen and finetuning only the language model, our model requires just 2,000 hours of data, without relying on large-scale pretraining or multi-stage optimization. The model can follow explicit instructions to control speaker voice, conversation topic, conversational behaviour (e.g., backchanneling and interruptions), and dialogue initiation. We propose a single-stage training protocol and systematically analyze design choices. Both the model and training code will be released to enable reproducible research on controllable full-duplex speech systems.[34] Idea First, Code Later: Disentangling Problem Solving from Code Generation in Evaluating LLMs for Competitive Programming
Sama Hadhoud,Alaa Elsetohy,Frederikus Hudi,Jan Christian Blaise Cruz,Steven Halim,Alham Fikri Aji
Main category: cs.CL
TL;DR: 本文提出在竞争性编程中应将自然语言题解(editorials)作为核心,以区分算法推理与代码实现,评估19个大模型后发现尽管使用黄金题解仍存在实现和算法设计的瓶颈。
Details
Motivation: 现有评测方法混淆了算法推理和代码实现,无法准确衡量模型的问题解决能力,因此需要一种更合理的方法来分离这两者。 Method: 引入包含83道ICPC风格问题的新数据集,配备黄金题解和完整测试套件;利用生成题解再生成代码的方法,并通过专家标注和LLM-as-a-judge协议分析模型推理错误。 Result: 部分LLM在先生成题解后再写代码时求解率提高,使用黄金题解提升更明显;但模型在代码实现和完整正确算法的设计上仍有显著困难;发现了生成题解与黄金题解之间的差距。 Conclusion: 未来的大模型评测应明确区分问题求解与代码实现,自然语言题解是评估算法推理能力的关键工具。 Abstract: Large Language Models (LLMs) increasingly succeed on competitive programming problems, yet existing evaluations conflate algorithmic reasoning with code-level implementation. We argue that competitive programming is fundamentally a problem-solving task and propose centering natural-language editorials in both solution generation and evaluation. Generating an editorial prior to code improves solve rates for some LLMs, with substantially larger gains when using expertly written gold editorials. However, even with gold editorials, models continue to struggle with implementation, while the gap between generated and gold editorials reveals a persistent problem-solving bottleneck in specifying correct and complete algorithms. Beyond pass/fail metrics, we diagnose reasoning errors by comparing model-generated editorials to gold standards using expert annotations and validate an LLM-as-a-judge protocol for scalable evaluation. We introduce a dataset of 83 ICPC-style problems with gold editorials and full test suites, and evaluate 19 LLMs, arguing that future benchmarks should explicitly separate problem solving from implementation.[35] Neural Chain-of-Thought Search: Searching the Optimal Reasoning Path to Enhance Large Language Models
Guoming Ling,Zhongzhan Huang,Yupei Lin,Junxin Li,Shanshan Zhong,Hefeng Wu,Liang Lin
Main category: cs.CL
TL;DR: 本文提出了Neural Chain-of-Thought Search (NCoTS),通过将推理过程建模为动态搜索最优思维策略的过程,解决了现有大模型在推理时路径冗余、缺乏前瞻的问题。该方法利用双因素启发式策略优化正确性和计算成本,在多个基准上实现了准确率提升3.5%以上的同时减少生成长度22%以上。
Details
Motivation: 现有的思维链推理方法逐个生成步骤而无全局规划,容易陷入次优且冗余的推理路径,因此需要一种具备前瞻性的动态搜索机制来发现更优的推理路径。 Method: 将推理过程重新定义为对最优思维策略的动态搜索,通过量化刻画解空间,并采用兼顾正确性与计算成本的双因素启发式评估候选推理操作,主动导向稀疏但优越的推理路径。 Result: NCoTS在多个推理基准上实现了帕累托改进:平均准确率提升超过3.5%,同时生成长度减少超过22%。 Conclusion: NCoTS通过引入具有前瞻性的搜索机制,有效找到了更准确且更简洁的推理路径,为大模型的高效推理提供了新范式。 Abstract: Chain-of-Thought reasoning has significantly enhanced the problem-solving capabilities of Large Language Models. Unfortunately, current models generate reasoning steps sequentially without foresight, often becoming trapped in suboptimal reasoning paths with redundant steps. In contrast, we introduce Neural Chain-of-Thought Search (NCoTS), a framework that reformulates reasoning as a dynamic search for the optimal thinking strategy. By quantitatively characterizing the solution space, we reveal the existence of sparse superior reasoning paths that are simultaneously more accurate and concise than standard outputs. Our method actively navigates towards these paths by evaluating candidate reasoning operators using a dual-factor heuristic that optimizes for both correctness and computational cost. Consequently, NCoTS achieves a Pareto improvement across diverse reasoning benchmarks, boosting accuracy by over 3.5% while reducing generation length by over 22%. Our code and data are available at https://github.com/MilkThink-Lab/Neural-CoT-Search.[36] How Much Would a Clinician Edit This Draft? Evaluating LLM Alignment for Patient Message Response Drafting
Parker Seegmiller,Joseph Gatto,Sarah E. Greer,Ganza Belise Isingizwe,Rohan Ray,Timothy E. Burdick,Sarah Masud Preum
Main category: cs.CL
TL;DR: 该研究探讨了大型语言模型(LLM)在协助医生撰写患者门户消息回复中的应用,提出了一种新的主题分类体系和评估框架,评估LLM生成内容与医生编辑负担的对齐程度,并通过多种适应技术进行优化,结果表明需针对个体医生偏好调整LLM以实现可靠应用。
Details
Motivation: 尽管LLM在自动生成患者消息回复方面有潜力,但其是否能真正减轻临床医生的工作负担尚不明确,且存在与个体医生风格和需求不对齐的风险,因此需要系统评估和个性化适配。 Method: 构建了医生回复内容的主题分类体系,提出了在内容和主题层面评估医生编辑负荷的新框架,发布了专家标注数据集,并对本地和商业LLM采用主题提示、检索增强生成、监督微调和直接偏好优化等适应技术进行大规模评估。 Result: LLM在生成某些主题内容上表现良好,但在需向患者提问以获取更多信息的主题上对齐效果较差;主题驱动的适应策略在多数主题中提升了对齐性能;整体仍存在显著的认知不确定性。 Conclusion: 为确保LLM在医患沟通流程中的可靠与负责任使用,必须根据个体医生的偏好对其进行定制化适应。 Abstract: Large language models (LLMs) show promise in drafting responses to patient portal messages, yet their integration into clinical workflows raises various concerns, including whether they would actually save clinicians time and effort in their portal workload. We investigate LLM alignment with individual clinicians through a comprehensive evaluation of the patient message response drafting task. We develop a novel taxonomy of thematic elements in clinician responses and propose a novel evaluation framework for assessing clinician editing load of LLM-drafted responses at both content and theme levels. We release an expert-annotated dataset and conduct large-scale evaluations of local and commercial LLMs using various adaptation techniques including thematic prompting, retrieval-augmented generation, supervised fine-tuning, and direct preference optimization. Our results reveal substantial epistemic uncertainty in aligning LLM drafts with clinician responses. While LLMs demonstrate capability in drafting certain thematic elements, they struggle with clinician-aligned generation in other themes, particularly question asking to elicit further information from patients. Theme-driven adaptation strategies yield improvements across most themes. Our findings underscore the necessity of adapting LLMs to individual clinician preferences to enable reliable and responsible use in patient-clinician communication workflows.[37] Reward Modeling for Scientific Writing Evaluation
Furkan Şahinuç,Subhabrata Dutta,Iryna Gurevych
Main category: cs.CL
TL;DR: 提出了一种针对科学写作评估的低成本、开源奖励模型,通过两阶段训练框架提升大模型在多任务、动态标准下的评估能力。
Details
Motivation: 现有基于大语言模型的评估方法在科学写作这种需要深度领域知识和多维度评判标准的任务上表现不佳,且为每个任务微调模型成本高、不现实。 Method: 设计了一个两阶段训练框架:首先优化科学评估偏好,然后增强推理能力;采用多方面评估设计和跨任务联合训练,以实现细粒度评估并适应动态评分标准。 Result: 实验表明该方法显著提升了大模型在科学写作评估中的表现,模型能有效泛化到不同任务及未见过的评估场景,无需任务特定的再训练即可复用。 Conclusion: 所提出的开源奖励模型能够高效、可靠地评估多样化的科学写作任务,具备良好的泛化性和实用性,适用于低资源设置。 Abstract: Scientific writing is an expert-domain task that demands deep domain knowledge, task-specific requirements and reasoning capabilities that leverage the domain knowledge to satisfy the task specifications. While scientific text generation has been widely studied, its evaluation remains a challenging and open problem. It is critical to develop models that can be reliably deployed for evaluating diverse open-ended scientific writing tasks while adhering to their distinct requirements. However, existing LLM-based judges and reward models are primarily optimized for general-purpose benchmarks with fixed scoring rubrics and evaluation criteria. Consequently, they often fail to reason over sparse knowledge of scientific domains when interpreting task-dependent and multi-faceted criteria. Moreover, fine-tuning for each individual task is costly and impractical for low-resource settings. To bridge these gaps, we propose cost-efficient, open-source reward models tailored for scientific writing evaluation. We introduce a two-stage training framework that initially optimizes scientific evaluation preferences and then refines reasoning capabilities. Our multi-aspect evaluation design and joint training across diverse tasks enable fine-grained assessment and robustness to dynamic criteria and scoring rubrics. Experimental analysis shows that our training regime strongly improves LLM-based scientific writing evaluation. Our models generalize effectively across tasks and to previously unseen scientific writing evaluation settings, allowing a single trained evaluator to be reused without task-specific retraining.[38] Evaluating LLM Behavior in Hiring: Implicit Weights, Fairness Across Groups, and Alignment with Human Preferences
Morgane Hoffmann,Emma Jouffroy,Warren Jouanneau,Marc Palyart,Charles Pebereau
Main category: cs.CL
TL;DR: 本文提出了一种评估大语言模型(LLM)在招聘决策中如何权衡不同匹配标准的框架,利用真实自由职业者数据和全因子实验设计,分析LLM对技能、经验等生产力信号的重视程度及其在不同人口群体中的差异,发现尽管总体上对少数群体无明显歧视,但存在交叉性偏差。
Details
Motivation: 研究LLM在招聘决策中是否符合经济原则、招聘者偏好或社会规范,明确其对不同属性的权重分配机制。 Method: 基于欧洲主流自由职业平台的真实数据构建合成数据集,采用全因子实验设计来估计LLM在评估自由职业者与项目匹配度时对各项标准的权重,并分析上下文和人口子群对权重的影响。 Result: LLM优先考虑核心生产力信号(如技能和经验),但会赋予某些特征超出其显式匹配价值的权重;整体上对少数群体歧视较小,但在交叉性维度上表现出不同群体间信号权重的差异。 Conclusion: 该框架可有效揭示LLM在招聘中的决策逻辑,为进一步评估模型与人类决策的一致性提供了可扩展的方法路径。 Abstract: General-purpose Large Language Models (LLMs) show significant potential in recruitment applications, where decisions require reasoning over unstructured text, balancing multiple criteria, and inferring fit and competence from indirect productivity signals. Yet, it is still uncertain how LLMs assign importance to each attribute and whether such assignments are in line with economic principles, recruiter preferences or broader societal norms. We propose a framework to evaluate an LLM's decision logic in recruitment, by drawing on established economic methodologies for analyzing human hiring behavior. We build synthetic datasets from real freelancer profiles and project descriptions from a major European online freelance marketplace and apply a full factorial design to estimate how a LLM weighs different match-relevant criteria when evaluating freelancer-project fit. We identify which attributes the LLM prioritizes and analyze how these weights vary across project contexts and demographic subgroups. Finally, we explain how a comparable experimental setup could be implemented with human recruiters to assess alignment between model and human decisions. Our findings reveal that the LLM weighs core productivity signals, such as skills and experience, but interprets certain features beyond their explicit matching value. While showing minimal average discrimination against minority groups, intersectional effects reveal that productivity signals carry different weights between demographic groups.[39] Relational Linearity is a Predictor of Hallucinations
Yuetian Lu,Yihong Liu,Hinrich Schütze
Main category: cs.CL
TL;DR: 研究发现大语言模型在处理合成实体问题时容易产生幻觉,尤其是对于线性关系的抽象存储更难自我评估知识,导致更高的幻觉率。
Details
Motivation: 探索大语言模型在面对未知合成实体时产生幻觉的原因,特别是关系线性对知识存储和自我认知的影响。 Method: 构建包含6000个合成实体的SyntHal数据集,测量六种关系的幻觉率及其线性程度(Δcos),并在四个模型上进行实验分析。 Result: 发现关系线性与幻觉率之间存在强相关性(r ∈ [0.78, 0.82]),表明线性关系更可能导致模型难以判断自身是否掌握相关知识。 Conclusion: 关系的线性程度是影响大语言模型知识自我评估能力的重要因素,该发现为减少幻觉提供了新方向,并建议改进事实知识的表示方式。 Abstract: Hallucination is a central failure mode in large language models (LLMs). We focus on hallucinations of answers to questions like: "Which instrument did Glenn Gould play?", but we ask these questions for synthetic entities that are unknown to the model. Surprisingly, we find that medium-size models like Gemma-7B-IT frequently hallucinate, i.e., they have difficulty recognizing that the hallucinated fact is not part of their knowledge. We hypothesize that an important factor in causing these hallucinations is the linearity of the relation: linear relations tend to be stored more abstractly, making it difficult for the LLM to assess its knowledge; the facts of nonlinear relations tend to be stored more directly, making knowledge assessment easier. To investigate this hypothesis, we create SyntHal, a dataset of 6000 synthetic entities for six relations. In our experiments with four models, we determine, for each relation, the hallucination rate on SyntHal and also measure its linearity, using $Δ\cos$. We find a strong correlation ($r \in [.78,.82]$) between relational linearity and hallucination rate, providing evidence for our hypothesis that the underlying storage of triples of a relation is a factor in how well a model can self-assess its knowledge. This finding has implications for how to manage hallucination behavior and suggests new research directions for improving the representation of factual knowledge in LLMs.[40] The unreasonable effectiveness of pattern matching
Gary Lupyan,Blaise Agüera y Arcas
Main category: cs.CL
TL;DR: 大型语言模型(LLMs)能够通过结构模式从“Jabberwocky”语言(内容词被无意义字符串替换)中恢复语义,表明模式匹配不仅是语言模仿,更是智能的关键组成部分。
Details
Motivation: 探讨大型语言模型究竟是语言模仿者、数据库还是网络的模糊版本,解决关于其本质的争议。 Method: 通过测试LLMs在内容词被随机替换为无意义字符串的‘Jabberwocky’语言中的表现,分析其恢复语义的能力。 Result: 发现LLMs能基于句子的结构模式理解并翻译无意义文本,展现出强大的模式匹配能力。 Conclusion: 模式匹配并非“真实”智能的替代品,而是实现智能的关键要素之一。 Abstract: We report on an astonishing ability of large language models (LLMs) to make sense of "Jabberwocky" language in which most or all content words have been randomly replaced by nonsense strings, e.g., translating "He dwushed a ghanc zawk" to "He dragged a spare chair". This result addresses ongoing controversies regarding how to best think of what LLMs are doing: are they a language mimic, a database, a blurry version of the Web? The ability of LLMs to recover meaning from structural patterns speaks to the unreasonable effectiveness of pattern-matching. Pattern-matching is not an alternative to "real" intelligence, but rather a key ingredient.[41] Hierarchical Orthogonal Residual Spread for Precise Massive Editing in Large Language Models
Xiaojie Gu,Guangxu Chen,Yuheng Yang,Jingxin Han,Andi Zhang
Main category: cs.CL
TL;DR: 提出了一种名为HORSE的层次正交残差扩展方法,用于大语言模型编辑,能够在减少噪声梯度的同时实现稳定且精确的大规模编辑。
Details
Motivation: 大语言模型存在安全风险,现有编辑方法计算成本高且易引发知识冲突,需更高效的编辑方法。 Method: 引入层次正交残差扩展(HORSE)框架,通过对信息矩阵进行正交分解与分层更新,减少梯度噪声并提升编辑稳定性。 Result: 在多个大模型和两个数据集上的实验表明,HORSE在保持编辑精度的同时支持大规模修改,性能优于多种主流编辑方法。 Conclusion: HORSE为大语言模型编辑提供了一种高效、稳定的解决方案,具有良好的理论基础和实际应用前景。 Abstract: Large language models (LLMs) exhibit exceptional performance across various domains, yet they face critical safety concerns. Model editing has emerged as an effective approach to mitigate these issues. Existing model editing methods often focus on optimizing an information matrix that blends new and old knowledge. While effective, these approaches can be computationally expensive and may cause conflicts. In contrast, we shift our attention to Hierarchical Orthogonal Residual SprEad of the information matrix, which reduces noisy gradients and enables more stable edits from a different perspective. We demonstrate the effectiveness of our method HORSE through a clear theoretical comparison with several popular methods and extensive experiments conducted on two datasets across multiple LLMs. The results show that HORSE maintains precise massive editing across diverse scenarios. The code is available at https://github.com/XiaojieGu/HORSE[42] Predict the Retrieval! Test time adaptation for Retrieval Augmented Generation
Xin Sun,Zhongqi Chen,Qiang Liu,Shu Wu,Bowen Song,Weiqiang Wang,Zilei Wang,Liang Wang
Main category: cs.CL
TL;DR: 提出了一种名为TTARAG的测试时自适应方法,通过在推理过程中动态更新语言模型参数,提升检索增强生成系统在特定领域中的性能。
Details
Motivation: 针对RAG系统在特定领域中因分布偏移导致泛化性能不佳的问题,需要一种能够动态适应目标领域的方法。 Method: 引入一种简单而有效的方法,使模型学习预测检索到的内容,从而在推理时自动调整模型参数以适应目标领域。 Result: 在六个专业领域上进行了广泛实验,结果表明TTARAG相比基线RAG系统显著提升了性能。 Conclusion: TTARAG能有效提升RAG系统在特定领域中的表现,具备良好的应用潜力。 Abstract: Retrieval-Augmented Generation (RAG) has emerged as a powerful approach for enhancing large language models' question-answering capabilities through the integration of external knowledge. However, when adapting RAG systems to specialized domains, challenges arise from distribution shifts, resulting in suboptimal generalization performance. In this work, we propose TTARAG, a test-time adaptation method that dynamically updates the language model's parameters during inference to improve RAG system performance in specialized domains. Our method introduces a simple yet effective approach where the model learns to predict retrieved content, enabling automatic parameter adjustment to the target domain. Through extensive experiments across six specialized domains, we demonstrate that TTARAG achieves substantial performance improvements over baseline RAG systems. Code available at https://github.com/sunxin000/TTARAG.[43] CTest-Metric: A Unified Framework to Assess Clinical Validity of Metrics for CT Report Generation
Vanshali Sharma,Andrea Mia Bejar,Gorkem Durak,Ulas Bagci
Main category: cs.CL
TL;DR: 本文提出了CTest-Metric,首个用于评估CT放射学报告生成(RRG)中质量度量的统一框架,包含三个模块:写作风格泛化性测试、合成错误注入和与专家评分的相关性分析,发现GREEN Score与临床专家判断最一致,而CRG呈负相关。
Details
Motivation: 现有放射学报告生成的质量评估指标缺乏统一且临床可行的评估框架,导致难以衡量其在真实医疗场景中的可靠性与适用性。 Method: 设计CTest-Metric框架,包含三个模块:基于LLM的写作风格重述(WSG)、分级错误注入(SEI)和与专家评分的相关性(MvE),并在七个基于CT-CLIP编码器的LLM上评估八个常用指标。 Result: 实验发现词汇类NLG指标对风格变化敏感;GREEN Score与专家判断相关性最高(Spearman~0.70),CRG呈负相关;BERTScore-F1对事实性错误注入最不敏感。 Conclusion: CTest-Metric为RRG领域提供了可复现的度量评估标准,揭示了现有指标的局限性,并推动未来更符合临床需求的自动评估方法发展。 Abstract: In the generative AI era, where even critical medical tasks are increasingly automated, radiology report generation (RRG) continues to rely on suboptimal metrics for quality assessment. Developing domain-specific metrics has therefore been an active area of research, yet it remains challenging due to the lack of a unified, well-defined framework to assess their robustness and applicability in clinical contexts. To address this, we present CTest-Metric, a first unified metric assessment framework with three modules determining the clinical feasibility of metrics for CT RRG. The modules test: (i) Writing Style Generalizability (WSG) via LLM-based rephrasing; (ii) Synthetic Error Injection (SEI) at graded severities; and (iii) Metrics-vs-Expert correlation (MvE) using clinician ratings on 175 "disagreement" cases. Eight widely used metrics (BLEU, ROUGE, METEOR, BERTScore-F1, F1-RadGraph, RaTEScore, GREEN Score, CRG) are studied across seven LLMs built on a CT-CLIP encoder. Using our novel framework, we found that lexical NLG metrics are highly sensitive to stylistic variations; GREEN Score aligns best with expert judgments (Spearman~0.70), while CRG shows negative correlation; and BERTScore-F1 is least sensitive to factual error injection. We will release the framework, code, and allowable portion of the anonymized evaluation data (rephrased/error-injected CT reports), to facilitate reproducible benchmarking and future metric development.[44] Do explanations generalize across large reasoning models?
Koyena Pal,David Bau,Chandan Singh
Main category: cs.CL
TL;DR: 该论文研究了大型推理模型(LRM)生成的思维链(CoT)解释是否具有跨模型的泛化能力,发现这些解释在一定程度上能提升不同LRM之间的一致性,且与人类偏好和强化学习后训练正相关,但需谨慎用于产生新见解。
Details
Motivation: 探究LRM生成的自然语言解释是否捕捉到问题的通用模式,而非仅适用于特定模型的特有模式,尤其是在AI for science等需要发现新概念的场景中至关重要。 Method: 通过评估一个LRM生成的CoT解释是否能在其他LRM中诱导出相同行为,来研究其泛化能力,并分析影响一致性的因素,提出一种基于句子级集成的策略以提升一致性。 Result: CoT解释通常能在不同LRM间提升一致性,这种一致性与人类偏好排名及强化学习后训练相关;特定条件下解释可导致一致回答,并可通过句子级集成进一步改善。 Conclusion: LRM生成的解释具有一定泛化能力,但在用于获取新洞察时应保持谨慎,论文为此提供了评估解释泛化的框架。 Abstract: Large reasoning models (LRMs) produce a textual chain of thought (CoT) in the process of solving a problem, which serves as a potentially powerful tool to understand the problem by surfacing a human-readable, natural-language explanation. However, it is unclear whether these explanations generalize, i.e. whether they capture general patterns about the underlying problem rather than patterns which are esoteric to the LRM. This is a crucial question in understanding or discovering new concepts, e.g. in AI for science. We study this generalization question by evaluating a specific notion of generalizability: whether explanations produced by one LRM induce the same behavior when given to other LRMs. We find that CoT explanations often exhibit this form of generalization (i.e. they increase consistency between LRMs) and that this increased generalization is correlated with human preference rankings and post-training with reinforcement learning. We further analyze the conditions under which explanations yield consistent answers and propose a straightforward, sentence-level ensembling strategy that improves consistency. Taken together, these results prescribe caution when using LRM explanations to yield new insights and outline a framework for characterizing LRM explanation generalization.[45] How Long Is a Piece of String? A Brief Empirical Analysis of Tokenizers
Jonathan Roberts,Kai Han,Samuel Albanie
Main category: cs.CL
TL;DR: 本文对当前大语言模型中的分词(tokenization)现象进行了全面的实证分析,揭示了不同模型和文本领域之间分词长度的巨大差异,挑战了关于token数量的常见简化假设。
Details
Motivation: 由于不同模型和文本类型中分词方式存在显著差异,将token作为统一计量单位可能导致误解,因此需要系统性研究以澄清其实际影响。 Method: 通过在多种文本数据分布上对多个主流模型的分词行为进行量化分析,评估序列到token的压缩情况,并比较不同tokenizer的表现。 Result: 发现token数量在不同模型和语境下变化显著,常见的token估算经验法则过于简单化,缺乏普适性。 Conclusion: token不能被简单视为跨模型和文本类型的稳定单位,研究结果有助于提升对现代大语言模型中分词机制的理解与直觉。 Abstract: Frontier LLMs are increasingly utilised across academia, society and industry. A commonly used unit for comparing models, their inputs and outputs, and estimating inference pricing is the token. In general, tokens are used as a stable currency, assumed to be broadly consistent across tokenizers and contexts, enabling direct comparisons. However, tokenization varies significantly across models and domains of text, making naive interpretation of token counts problematic. We quantify this variation by providing a comprehensive empirical analysis of tokenization, exploring the compression of sequences to tokens across different distributions of textual data. Our analysis challenges commonly held heuristics about token lengths, finding them to be overly simplistic. We hope the insights of our study add clarity and intuition toward tokenization in contemporary LLMs.cs.CV [Back]
[46] Future Optical Flow Prediction Improves Robot Control & Video Generation
Kanchana Ranasinghe,Honglu Zhou,Yu Fang,Luyu Yang,Le Xue,Ran Xu,Caiming Xiong,Silvio Savarese,Michael S Ryoo,Juan Carlos Niebles
Main category: cs.CV
TL;DR: 本文提出了FOFPred,一种新型的语言条件光学流预测模型,结合了视觉-语言模型(VLM)和扩散架构,能够在噪声较大的网络规模人类活动数据上进行训练,并在机器人操控和视频生成任务中展现跨域应用能力。
Details
Motivation: 现有方法在从噪声较多的真实世界数据中学习通用、密集的未来运动表示(如光学流)方面仍面临挑战,且相关研究较少。本文旨在探索如何有效利用大规模但非结构化的网络数据来实现高质量的未来运动预测。 Method: 提出FOFPred模型,融合视觉-语言模型与扩散模型的统一架构;采用关键的数据预处理技术以从嘈杂的视频-字幕对中提取有效信号,并利用强图像预训练提升性能。模型在大规模人类活动数据上训练,并应用于语言驱动的下游任务。 Result: 在语言驱动的机器人操作控制和视频生成两个下游任务中,FOFPred均表现出优异性能,验证了其跨领域适用性及对未来光学流预测的有效性。 Conclusion: 统一的VLM-扩散架构结合大规模非结构化数据训练,能够实现高质量、语言条件下的未来光学流预测,为控制与生成任务提供了强有力的运动表示基础。 Abstract: Future motion representations, such as optical flow, offer immense value for control and generative tasks. However, forecasting generalizable spatially dense motion representations remains a key challenge, and learning such forecasting from noisy, real-world data remains relatively unexplored. We introduce FOFPred, a novel language-conditioned optical flow forecasting model featuring a unified Vision-Language Model (VLM) and Diffusion architecture. This unique combination enables strong multimodal reasoning with pixel-level generative fidelity for future motion prediction. Our model is trained on web-scale human activity data-a highly scalable but unstructured source. To extract meaningful signals from this noisy video-caption data, we employ crucial data preprocessing techniques and our unified architecture with strong image pretraining. The resulting trained model is then extended to tackle two distinct downstream tasks in control and generation. Evaluations across robotic manipulation and video generation under language-driven settings establish the cross-domain versatility of FOFPred, confirming the value of a unified VLM-Diffusion architecture and scalable learning from diverse web data for future optical flow prediction.[47] ICONIC-444: A 3.1-Million-Image Dataset for OOD Detection Research
Gerhard Krumpl,Henning Avenhaus,Horst Possegger
Main category: cs.CV
TL;DR: 本文提出了ICONIC-444,一个大规模工业图像数据集,包含超过310万张RGB图像和444个类别,专为OOD检测研究设计,支持多种复杂程度的基准测试任务,并提供了22种先进OOD检测方法的基线结果。
Details
Motivation: 现有的OOD检测研究受限于缺乏大规模、高质量且具有明确定义OOD类别的数据集,尤其是在不同难度级别(从近端到远端OOD)和细粒度与粗粒度视觉任务中。 Method: 构建了一个名为ICONIC-444的大规模工业图像数据集,使用原型工业分拣机采集数据,涵盖444个类别和超过310万张图像,并定义了四个参考任务用于评估OOD检测方法。 Result: 提供了22种最新的后处理OOD检测方法在ICONIC-444上的基线性能结果,验证了该数据集在OOD检测研究中的实用性和挑战性。 Conclusion: ICONIC-444填补了现有数据集在真实世界OOD检测任务中的空白,为OOD检测算法的评估与改进提供了结构化、多样化的新基准。 Abstract: Current progress in out-of-distribution (OOD) detection is limited by the lack of large, high-quality datasets with clearly defined OOD categories across varying difficulty levels (near- to far-OOD) that support both fine- and coarse-grained computer vision tasks. To address this limitation, we introduce ICONIC-444 (Image Classification and OOD Detection with Numerous Intricate Complexities), a specialized large-scale industrial image dataset containing over 3.1 million RGB images spanning 444 classes tailored for OOD detection research. Captured with a prototype industrial sorting machine, ICONIC-444 closely mimics real-world tasks. It complements existing datasets by offering structured, diverse data suited for rigorous OOD evaluation across a spectrum of task complexities. We define four reference tasks within ICONIC-444 to benchmark and advance OOD detection research and provide baseline results for 22 state-of-the-art post-hoc OOD detection methods.[48] A Unified 3D Object Perception Framework for Real-Time Outside-In Multi-Camera Systems
Yizhou Wang,Sameer Pusegaonkar,Yuxing Wang,Anqi Li,Vishal Kumar,Chetan Sethi,Ganapathy Aiyer,Yun He,Kartikay Thakkar,Swapnil Rathi,Bhushan Rupde,Zheng Tang,Sujit Biswas
Main category: cs.CV
TL;DR: 本文提出了一种针对大规模基础设施环境优化的Sparse4D框架,用于实现高精度3D目标感知与多目标多摄像头跟踪。通过引入世界坐标几何先验和遮挡感知的ReID模块,并结合生成式数据增强克服Sim2Real差距,系统在AI City Challenge 2025上实现了45.22的HOTA性能,并通过TensorRT加速实现单GPU支持64路以上视频流。
Details
Motivation: 将自动驾驶中的“由内而外”模型迁移到静态摄像头网络的“由外而内”场景面临相机布局异构和严重遮挡的问题,难以实现稳定的身份识别与3D感知,因此需要专门针对基础设施环境优化的解决方案。 Method: 基于Sparse4D框架,引入绝对世界坐标系几何先验以统一空间表示,设计遮挡感知的ReID嵌入模块来提升跨摄像头身份一致性;使用NVIDIA COSMOS进行生成式数据增强,提升外观不变性;并开发基于TensorRT的Multi-Scale Deformable Aggregation插件以实现硬件加速。 Result: 在AI City Challenge 2025基准上达到45.22的HOTA,为当前最优的纯摄像头方案;通过TensorRT优化实现2.15倍加速,单个Blackwell级GPU可支持超过64路摄像头实时处理。 Conclusion: 该方法有效解决了复杂工业环境中多摄像头感知与跟踪中的遮挡与异构问题,兼顾高性能与实时性,推动了工业基础设施数字化转型中视觉系统的实用化部署。 Abstract: Accurate 3D object perception and multi-target multi-camera (MTMC) tracking are fundamental for the digital transformation of industrial infrastructure. However, transitioning "inside-out" autonomous driving models to "outside-in" static camera networks presents significant challenges due to heterogeneous camera placements and extreme occlusion. In this paper, we present an adapted Sparse4D framework specifically optimized for large-scale infrastructure environments. Our system leverages absolute world-coordinate geometric priors and introduces an occlusion-aware ReID embedding module to maintain identity stability across distributed sensor networks. To bridge the Sim2Real domain gap without manual labeling, we employ a generative data augmentation strategy using the NVIDIA COSMOS framework, creating diverse environmental styles that enhance the model's appearance-invariance. Evaluated on the AI City Challenge 2025 benchmark, our camera-only framework achieves a state-of-the-art HOTA of $45.22$. Furthermore, we address real-time deployment constraints by developing an optimized TensorRT plugin for Multi-Scale Deformable Aggregation (MSDA). Our hardware-accelerated implementation achieves a $2.15\times$ speedup on modern GPU architectures, enabling a single Blackwell-class GPU to support over 64 concurrent camera streams.[49] Can Vision-Language Models Understand Construction Workers? An Exploratory Study
Hieu Bui,Nathaniel E. Chodosh,Arash Tavakoli
Main category: cs.CV
TL;DR: 本研究评估了三种先进的视觉-语言模型(GPT-4o、Florence 2 和 LLaVa-1.5)在建筑工地静态图像中识别工人行为和情绪的能力。结果表明,GPT-4o 表现最佳,但在语义相近类别区分上仍存在挑战,显示通用模型需进一步优化以适应实际施工环境。
Details
Motivation: 由于建筑领域标注数据稀缺,且监控工人行为与情绪对安全和效率至关重要,因此需要能够无需大量领域训练即可理解人类行为的模型。视觉-语言模型(VLMs)为此提供了潜在解决方案。 Method: 采用包含1000张图像的数据集,涵盖十类动作和十类情绪,通过标准化推理流程和多种评估指标测试GPT-4o、Florence 2和LLaVa-1.5的表现,并使用混淆矩阵分析错误模式。 Result: GPT-4o在动作识别中取得0.756的F1分数和0.799准确率,情绪识别为0.712 F1和0.773准确率;Florence 2和LLaVa-1.5表现较低,且所有模型均难以区分语义相近的行为类别。 Conclusion: 通用视觉-语言模型可在建筑场景中提供基础的人类行为识别能力,但要实现实际应用的可靠性,仍需领域适配、时序建模或多模态融合等改进。 Abstract: As robotics become increasingly integrated into construction workflows, their ability to interpret and respond to human behavior will be essential for enabling safe and effective collaboration. Vision-Language Models (VLMs) have emerged as a promising tool for visual understanding tasks and offer the potential to recognize human behaviors without extensive domain-specific training. This capability makes them particularly appealing in the construction domain, where labeled data is scarce and monitoring worker actions and emotional states is critical for safety and productivity. In this study, we evaluate the performance of three leading VLMs, GPT-4o, Florence 2, and LLaVa-1.5, in detecting construction worker actions and emotions from static site images. Using a curated dataset of 1,000 images annotated across ten action and ten emotion categories, we assess each model's outputs through standardized inference pipelines and multiple evaluation metrics. GPT-4o consistently achieved the highest scores across both tasks, with an average F1-score of 0.756 and accuracy of 0.799 in action recognition, and an F1-score of 0.712 and accuracy of 0.773 in emotion recognition. Florence 2 performed moderately, with F1-scores of 0.497 for action and 0.414 for emotion, while LLaVa-1.5 showed the lowest overall performance, with F1-scores of 0.466 for action and 0.461 for emotion. Confusion matrix analyses revealed that all models struggled to distinguish semantically close categories, such as collaborating in teams versus communicating with supervisors. While the results indicate that general-purpose VLMs can offer a baseline capability for human behavior recognition in construction environments, further improvements, such as domain adaptation, temporal modeling, or multimodal sensing, may be needed for real-world reliability.[50] One Model, Many Behaviors: Training-Induced Effects on Out-of-Distribution Detection
Gerhard Krumpl,Henning Avenhaus,Horst Possegger
Main category: cs.CV
TL;DR: 研究了在ImageNet上训练的56个ResNet-50模型中,21种先进OOD检测方法与现代训练策略之间的关系,发现ID准确率与OOD检测性能之间存在非单调关系。
Details
Motivation: 探索现代训练策略对OOD检测性能的影响,挑战‘更高的ID准确率意味着更好的OOD检测’这一常见假设。 Method: 固定ResNet-50架构,对56个使用不同训练策略训练的ImageNet模型,评估21种post-hoc OOD检测方法在八个OOD测试集上的表现。 Result: 发现OOD性能随ID准确率提升先增加后下降;训练策略、检测器选择与OOD性能之间存在强关联,没有一种方法在所有情况下都最优。 Conclusion: 现代训练策略虽提升ID准确率,但可能损害OOD检测性能,需根据训练策略选择合适的OOD检测方法。 Abstract: Out-of-distribution (OOD) detection is crucial for deploying robust and reliable machine-learning systems in open-world settings. Despite steady advances in OOD detectors, their interplay with modern training pipelines that maximize in-distribution (ID) accuracy and generalization remains under-explored. We investigate this link through a comprehensive empirical study. Fixing the architecture to the widely adopted ResNet-50, we benchmark 21 post-hoc, state-of-the-art OOD detection methods across 56 ImageNet-trained models obtained via diverse training strategies and evaluate them on eight OOD test sets. Contrary to the common assumption that higher ID accuracy implies better OOD detection performance, we uncover a non-monotonic relationship: OOD performance initially improves with accuracy but declines once advanced training recipes push accuracy beyond the baseline. Moreover, we observe a strong interdependence between training strategy, detector choice, and resulting OOD performance, indicating that no single method is universally optimal.[51] Effects of Different Attention Mechanisms Applied on 3D Models in Video Classification
Mohammad Rasras,Iuliana Marin,Serban Radu,Irina Mocanu
Main category: cs.CV
TL;DR: 本文研究了在降低时间信息捕获的同时提高帧分辨率对3D ResNet类模型(MC3、R3D、R(2+1)D)在人类动作识别中性能的影响,并通过引入注意力机制分析其作用,结果表明时间特征的缺失显著影响性能。
Details
Motivation: 探索在减少时间维度信息的情况下提升空间分辨率对动作识别模型性能的影响,并评估不同注意力机制在受限时间模型中的有效性。 Method: 基于MC3、R3D和R(2+1)D构建带有dropout层的简化时间模型,并为每种模型设计十种引入CBAM、TCN、多头注意力和通道注意力等注意力模块的变体,在UCF101数据集上进行实验评估。 Result: 在UCF101上,改进的R(2+1)D加入多头注意力机制的变体达到88.98%的准确率;不同注意力模块对整体性能提升相似,但在类别级别上的表现存在差异。 Conclusion: 时间特征的缺失显著影响高分辨率动作识别模型的性能,尽管注意力机制能带来一定提升,但其对不同类别的影响各异,说明时间建模仍至关重要。 Abstract: Human action recognition has become an important research focus in computer vision due to the wide range of applications where it is used. 3D Resnet-based CNN models, particularly MC3, R3D, and R(2+1)D, have different convolutional filters to extract spatiotemporal features. This paper investigates the impact of reducing the captured knowledge from temporal data, while increasing the resolution of the frames. To establish this experiment, we created similar designs to the three originals, but with a dropout layer added before the final classifier. Secondly, we then developed ten new versions for each one of these three designs. The variants include special attention blocks within their architecture, such as convolutional block attention module (CBAM), temporal convolution networks (TCN), in addition to multi-headed and channel attention mechanisms. The purpose behind that is to observe the extent of the influence each of these blocks has on performance for the restricted-temporal models. The results of testing all the models on UCF101 have shown accuracy of 88.98% for the variant with multiheaded attention added to the modified R(2+1)D. This paper concludes the significance of missing temporal features in the performance of the newly created increased resolution models. The variants had different behavior on class-level accuracy, despite the similarity of their enhancements to the overall performance.[52] Medical SAM3: A Foundation Model for Universal Prompt-Driven Medical Image Segmentation
Chongcong Jiang,Tianxingjian Ding,Chuhan Song,Jiachen Tu,Ziyang Yan,Yihua Shao,Zhenyi Wang,Yuzhang Shang,Tianyu Han,Yu Tian
Main category: cs.CV
TL;DR: Medical SAM3是通过在大规模、异构的2D和3D医学图像数据集上完全微调SAM3得到的通用医学图像分割基础模型,显著提升了在多种器官、模态和维度下的分割性能,尤其在语义模糊、形态复杂和长程3D上下文等挑战场景中表现突出。
Details
Motivation: 原始的SAM3由于严重的域偏移、缺乏特权空间提示以及难以推理复杂的解剖和体积结构,在医学图像分割中的直接应用受限,因此需要一种能够适应医学影像特性的全模型适配方法。 Method: 通过对SAM3在33个涵盖10种医学成像模态的数据集上进行全参数微调,结合配对的分割掩码和文本提示,使模型学习到鲁棒的领域特定表示,同时保持基于提示的灵活性。 Result: 实验表明,Medical SAM3在各种器官、成像模态和维度上均实现了持续且显著的性能提升,尤其在语义模糊、复杂形态和长程3D上下文等挑战性场景中优于现有方法。 Conclusion: Medical SAM3是一种适用于医学影像的通用、文本引导的分割基础模型,证明了在严重域偏移下实现鲁棒提示驱动分割的关键在于整体模型适配而非仅提示工程。 Abstract: Promptable segmentation foundation models such as SAM3 have demonstrated strong generalization capabilities through interactive and concept-based prompting. However, their direct applicability to medical image segmentation remains limited by severe domain shifts, the absence of privileged spatial prompts, and the need to reason over complex anatomical and volumetric structures. Here we present Medical SAM3, a foundation model for universal prompt-driven medical image segmentation, obtained by fully fine-tuning SAM3 on large-scale, heterogeneous 2D and 3D medical imaging datasets with paired segmentation masks and text prompts. Through a systematic analysis of vanilla SAM3, we observe that its performance degrades substantially on medical data, with its apparent competitiveness largely relying on strong geometric priors such as ground-truth-derived bounding boxes. These findings motivate full model adaptation beyond prompt engineering alone. By fine-tuning SAM3's model parameters on 33 datasets spanning 10 medical imaging modalities, Medical SAM3 acquires robust domain-specific representations while preserving prompt-driven flexibility. Extensive experiments across organs, imaging modalities, and dimensionalities demonstrate consistent and significant performance gains, particularly in challenging scenarios characterized by semantic ambiguity, complex morphology, and long-range 3D context. Our results establish Medical SAM3 as a universal, text-guided segmentation foundation model for medical imaging and highlight the importance of holistic model adaptation for achieving robust prompt-driven segmentation under severe domain shift. Code and model will be made available at https://github.com/AIM-Research-Lab/Medical-SAM3.[53] FrankenMotion: Part-level Human Motion Generation and Composition
Chuqiao Li,Xianghui Xie,Yong Cao,Andreas Geiger,Gerard Pons-Moll
Main category: cs.CV
TL;DR: 本文提出了FrankenMotion,首个支持基于原子级、时间感知的部位级文本描述进行人体动作生成的扩散模型,并构建了相应的高质量数据集。
Details
Motivation: 现有文本驱动动作生成方法受限于缺乏细粒度的部位级动作标注,难以实现对身体各部位的精细控制。 Method: 利用大语言模型构建具有时序感知的部位级动作标注数据集,并提出一种基于扩散模型的分部位动作生成框架,每个身体部位由独立的时序结构化文本提示引导。 Result: 实验表明,FrankenMotion在新设定下优于所有基线模型,能生成训练中未见的动作组合,且实现了空间(身体部位)和时间(原子动作)上的双重可控性。 Conclusion: 该工作首次实现了细粒度的部位级动作控制,为文本驱动的人体动作生成提供了新的数据资源和模型范式。 Abstract: Human motion generation from text prompts has made remarkable progress in recent years. However, existing methods primarily rely on either sequence-level or action-level descriptions due to the absence of fine-grained, part-level motion annotations. This limits their controllability over individual body parts. In this work, we construct a high-quality motion dataset with atomic, temporally-aware part-level text annotations, leveraging the reasoning capabilities of large language models (LLMs). Unlike prior datasets that either provide synchronized part captions with fixed time segments or rely solely on global sequence labels, our dataset captures asynchronous and semantically distinct part movements at fine temporal resolution. Based on this dataset, we introduce a diffusion-based part-aware motion generation framework, namely FrankenMotion, where each body part is guided by its own temporally-structured textual prompt. This is, to our knowledge, the first work to provide atomic, temporally-aware part-level motion annotations and have a model that allows motion generation with both spatial (body part) and temporal (atomic action) control. Experiments demonstrate that FrankenMotion outperforms all previous baseline models adapted and retrained for our setting, and our model can compose motions unseen during training. Our code and dataset will be publicly available upon publication.[54] Classification of Chest XRay Diseases through image processing and analysis techniques
Santiago Martínez Novoa,María Catalina Ibáñez,Lina Gómez Mesa,Jeremias Kramer
Main category: cs.CV
TL;DR: 本文研究了用于多分类胸部X光图像诊断的多种方法,包括DenseNet121,并开发了一个开源Web应用进行方法比较,分析了现有方法的不足并提出了改进建议。
Details
Motivation: 为了提高胸部X光图像中多种胸腔疾病的自动诊断准确性,需要系统地评估和比较不同的深度学习方法。 Method: 采用了包括DenseNet121在内的多种深度学习模型对胸部X光图像进行多分类任务,并通过一个开源Web应用实现方法对比。 Result: 实验比较了不同方法在多分类任务上的性能,识别出各方法的优势与弱点。 Conclusion: 研究表明某些方法在胸部X光多分类任务中表现良好,但仍有改进空间,未来工作可针对所发现的弱点进行优化。 Abstract: Multi-Classification Chest X-Ray Images are one of the most prevalent forms of radiological examination used for diagnosing thoracic diseases. In this study, we offer a concise overview of several methods employed for tackling this task, including DenseNet121. In addition, we deploy an open-source web-based application. In our study, we conduct tests to compare different methods and see how well they work. We also look closely at the weaknesses of the methods we propose and suggest ideas for making them better in the future. Our code is available at: https://github.com/AML4206-MINE20242/Proyecto_AML[55] Self-learned representation-guided latent diffusion model for breast cancer classification in deep ultraviolet whole surface images
Pouya Afshin,David Helminiak,Tianling Niu,Julie M. Jorns,Tina Yen,Bing Yu,Dong Hye Ye
Main category: cs.CV
TL;DR: 提出一种自监督学习引导的潜在扩散模型(SSL-guided LDM),生成高质量的深紫外荧光扫描显微图像合成数据,用于乳腺保乳手术切缘评估,显著提升Vision Transformer分类性能。
Details
Motivation: 由于标注的深紫外荧光图像数据稀缺,难以训练出鲁棒的深度学习模型进行术中切缘评估。 Method: 利用自监督学习(DINO)提取细胞结构语义信息,指导潜在扩散模型生成具有丰富细节的合成图像;结合真实与合成数据微调Vision Transformer,并通过patch预测聚合实现全切片分类。 Result: 在5折交叉验证中达到96.47%的准确率,FID分数降至45.72,显著优于类别条件生成基线方法。 Conclusion: 该方法有效缓解了医学图像数据标注不足的问题,提升了BCS术中 margin assessment 的精度,具有临床应用潜力。 Abstract: Breast-Conserving Surgery (BCS) requires precise intraoperative margin assessment to preserve healthy tissue. Deep Ultraviolet Fluorescence Scanning Microscopy (DUV-FSM) offers rapid, high-resolution surface imaging for this purpose; however, the scarcity of annotated DUV data hinders the training of robust deep learning models. To address this, we propose an Self-Supervised Learning (SSL)-guided Latent Diffusion Model (LDM) to generate high-quality synthetic training patches. By guiding the LDM with embeddings from a fine-tuned DINO teacher, we inject rich semantic details of cellular structures into the synthetic data. We combine real and synthetic patches to fine-tune a Vision Transformer (ViT), utilizing patch prediction aggregation for WSI-level classification. Experiments using 5-fold cross-validation demonstrate that our method achieves 96.47 % accuracy and reduces the FID score to 45.72, significantly outperforming class-conditioned baselines.[56] RobuMTL: Enhancing Multi-Task Learning Robustness Against Weather Conditions
Tasneem Shaffee,Sherief Reda
Main category: cs.CV
TL;DR: 本文提出了一种名为RobuMTL的新架构,通过动态选择任务特定的低秩适配(LoRA)模块和专家组合,提升自动驾驶系统在恶劣天气条件下的多任务学习鲁棒性。
Details
Motivation: 在真实环境中,恶劣天气会显著降低模型性能,因此需要更鲁棒的多任务学习方法来应对视觉退化问题。 Method: 采用基于输入扰动动态选择任务特定层次化LoRA模块和LoRA专家组的混合专家架构,实现对不同输入条件的自适应优化。 Result: 在PASCAL数据集上,单一扰动下平均相对提升+2.8%,混合天气条件下最高提升44.4%;在NYUD-v2上任务平均提升+9.7%。 Conclusion: RobuMTL能有效提升多任务模型在复杂真实环境中的鲁棒性和性能,具备良好的应用潜力。 Abstract: Robust Multi-Task Learning (MTL) is crucial for autonomous systems operating in real-world environments, where adverse weather conditions can severely degrade model performance and reliability. In this paper, we introduce RobuMTL, a novel architecture designed to adaptively address visual degradation by dynamically selecting task-specific hierarchical Low-Rank Adaptation (LoRA) modules and a LoRA expert squad based on input perturbations in a mixture-of-experts fashion. Our framework enables adaptive specialization based on input characteristics, improving robustness across diverse real-world conditions. To validate our approach, we evaluated it on the PASCAL and NYUD-v2 datasets and compared it against single-task models, standard MTL baselines, and state-of-the-art methods. On the PASCAL benchmark, RobuMTL delivers a +2.8% average relative improvement under single perturbations and up to +44.4% under mixed weather conditions compared to the MTL baseline. On NYUD-v2, RobuMTL achieves a +9.7% average relative improvement across tasks. The code is available at GitHub.[57] Sparse Data Tree Canopy Segmentation: Fine-Tuning Leading Pretrained Models on Only 150 Images
David Szczecina,Hudson Sun,Anthony Bertnyk,Niloofar Azad,Kyle Gao,Lincoln Linlin Xu
Main category: cs.CV
TL;DR: 在极端数据稀缺的情况下,基于预训练的卷积模型(如YOLOv11和Mask R-CNN)在树冠分割任务中表现优于基于Transformer的模型。
Details
Motivation: 由于实际标注数据稀少,研究在小规模、不平衡数据集上不同深度学习模型对树冠检测的适用性具有重要意义。 Method: 评估了五种代表性模型(YOLOv11、Mask R-CNN、DeepLabv3、Swin-UNet、DINOv2)在仅有150张标注图像的小型数据集上的性能,并分析其在数据稀缺条件下的泛化能力。 Result: 基于卷积的预训练模型(尤其是YOLOv11和Mask R-CNN)显著优于基于Transformer的模型;后者因数据需求高、缺乏强归纳偏置以及语义与实例分割任务差异而表现较差。 Conclusion: 在低数据环境下,轻量级CNN方法仍是遥感图像树冠检测最可靠的选择,Transformer架构需更多预训练或增强策略才能有效应用。 Abstract: Tree canopy detection from aerial imagery is an important task for environmental monitoring, urban planning, and ecosystem analysis. Simulating real-life data annotation scarcity, the Solafune Tree Canopy Detection competition provides a small and imbalanced dataset of only 150 annotated images, posing significant challenges for training deep models without severe overfitting. In this work, we evaluate five representative architectures, YOLOv11, Mask R-CNN, DeepLabv3, Swin-UNet, and DINOv2, to assess their suitability for canopy segmentation under extreme data scarcity. Our experiments show that pretrained convolution-based models, particularly YOLOv11 and Mask R-CNN, generalize significantly better than pretrained transformer-based models. DeeplabV3, Swin-UNet and DINOv2 underperform likely due to differences between semantic and instance segmentation tasks, the high data requirements of Vision Transformers, and the lack of strong inductive biases. These findings confirm that transformer-based architectures struggle in low-data regimes without substantial pretraining or augmentation and that differences between semantic and instance segmentation further affect model performance. We provide a detailed analysis of training strategies, augmentation policies, and model behavior under the small-data constraint and demonstrate that lightweight CNN-based methods remain the most reliable for canopy detection on limited imagery.[58] PatientVLM Meets DocVLM: Pre-Consultation Dialogue Between Vision-Language Models for Efficient Diagnosis
K Lokesh,Abhirama Subramanyam Penamakuri,Uday Agarwal,Apoorva Challa,Shreya K Gowda,Somesh Gupta,Anand Mishra
Main category: cs.CV
TL;DR: 提出了一种预咨询对话框架(PCDF),通过视觉-语言模型之间的模拟问诊对话,结合图像与患者症状信息,提升医学诊断准确性。
Details
Motivation: 传统AI医学诊断主要依赖图像分析,缺乏患者症状信息,限制了诊断准确率。因此需要一种能模拟真实问诊过程的框架来弥补这一缺陷。 Method: 构建一个由DocVLM和PatientVLM组成的对话系统:DocVLM根据医学图像和对话历史生成问题,PatientVLM基于真实诊断生成的症状档案进行回答;通过多轮对话数据微调DocVLM,并开展小规模临床验证以评估生成症状的临床相关性与真实性。 Result: 合成症状被持证临床医生认为具有临床相关性、覆盖性和现实性;DocVLM-PatientVLM框架能生成连贯的多轮图文会话;基于对话的监督显著优于仅使用图像的训练方式。 Conclusion: 融合患者症状的对话式AI问诊可有效提升医学图像诊断性能,验证了真实症状采集在AI辅助诊断中的关键作用。 Abstract: Traditionally, AI research in medical diagnosis has largely centered on image analysis. While this has led to notable advancements, the absence of patient-reported symptoms continues to hinder diagnostic accuracy. To address this, we propose a Pre-Consultation Dialogue Framework (PCDF) that mimics real-world diagnostic procedures, where doctors iteratively query patients before reaching a conclusion. Specifically, we simulate diagnostic dialogues between two vision-language models (VLMs): a DocVLM, which generates follow-up questions based on the image and dialogue history, and a PatientVLM, which responds using a symptom profile derived from the ground-truth diagnosis. We additionally conducted a small-scale clinical validation of the synthetic symptoms generated by our framework, with licensed clinicians confirming their clinical relevance, symptom coverage, and overall realism. These findings indicate that the resulting DocVLM-PatientVLM interactions form coherent, multi-turn consultations paired with images and diagnoses, which we then use to fine-tune the DocVLM. This dialogue-based supervision leads to substantial gains over image-only training, highlighting the value of realistic symptom elicitation for diagnosis.[59] MMedExpert-R1: Strengthening Multimodal Medical Reasoning via Domain-Specific Adaptation and Clinical Guideline Reinforcement
Meidan Ding,Jipeng Zhang,Wenxuan Wang,Haiqin Zhong,Xiaoling Luo,Wenting Chen,Linlin Shen
Main category: cs.CV
TL;DR: 提出MMedExpert-R1,一种通过领域特定适应和基于指南的强化来提升医学视觉-语言模型复杂临床推理能力的新方法。
Details
Motivation: 现有强化学习方法在医疗推理中面临深度推理数据稀缺、冷启动限制多专科对齐以及无法建模临床推理多样性的挑战。 Method: 构建包含10K样本的高质量多专科数据集MMedExpert,采用领域特定适应(DSA)生成专业化的LoRA模块,并设计基于指南的优势函数(GBA)以捕捉不同临床推理视角,最后通过冲突感知能力集成融合专家模型。 Result: 在MedXpert-MM和OmniMedVQA上分别达到27.50和83.03的性能,显著优于现有方法。 Conclusion: MMedExpert-R1通过领域适配与临床指南引导的强化学习,有效提升了多专科复杂临床推理的准确性与一致性,为可靠的多模态医学推理系统提供了新范式。 Abstract: Medical Vision-Language Models (MedVLMs) excel at perception tasks but struggle with complex clinical reasoning required in real-world scenarios. While reinforcement learning (RL) has been explored to enhance reasoning capabilities, existing approaches face critical mismatches: the scarcity of deep reasoning data, cold-start limits multi-specialty alignment, and standard RL algorithms fail to model clinical reasoning diversity. We propose MMedExpert-R1, a novel reasoning MedVLM that addresses these challenges through domain-specific adaptation and clinical guideline reinforcement. We construct MMedExpert, a high-quality dataset of 10K samples across four specialties with step-by-step reasoning traces. Our Domain-Specific Adaptation (DSA) creates specialty-specific LoRA modules to provide diverse initialization, while Guideline-Based Advantages (GBA) explicitly models different clinical reasoning perspectives to align with real-world diagnostic strategies. Conflict-Aware Capability Integration then merges these specialized experts into a unified agent, ensuring robust multi-specialty alignment. Comprehensive experiments demonstrate state-of-the-art performance, with our 7B model achieving 27.50 on MedXpert-MM and 83.03 on OmniMedVQA, establishing a robust foundation for reliable multimodal medical reasoning systems.[60] IDDR-NGP: Incorporating Detectors for Distractor Removal with Instant Neural Radiance Field
Xianliang Huang,Jiajie Gou,Shuhang Chen,Zhizhou Zhong,Jihong Guan,Shuigeng Zhou
Main category: cs.CV
TL;DR: 本文提出了一种统一的3D场景干扰物去除方法IDDR-NGP,可处理多种类型的干扰物,并通过结合隐式3D表示与2D检测器实现从多视角受损图像中恢复高质量3D场景。
Details
Motivation: 现有方法通常只针对特定类型干扰物,缺乏通用性和统一性,难以应对复杂多变的真实场景中的多种干扰物。 Method: 提出IDDR-NGP方法,基于Instant-NGP框架,引入LPIPS损失和多视角补偿损失(MVCL),结合2D检测器与隐式3D表示,端到端优化渲染结果。 Result: 在包含合成与真实干扰物的新建基准数据集上验证了方法的有效性与鲁棒性,能够有效去除雪、纸屑、落叶等多种干扰物,性能媲美现有去雪SOTA方法。 Conclusion: IDDR-NGP是首个统一的干扰物去除框架,可在隐式神经图形表示中高效恢复受多种干扰物影响的3D场景,具有良好的泛化能力与应用前景。 Abstract: This paper presents the first unified distractor removal method, named IDDR-NGP, which directly operates on Instant-NPG. The method is able to remove a wide range of distractors in 3D scenes, such as snowflakes, confetti, defoliation and petals, whereas existing methods usually focus on a specific type of distractors. By incorporating implicit 3D representations with 2D detectors, we demonstrate that it is possible to efficiently restore 3D scenes from multiple corrupted images. We design the learned perceptual image patch similarity~( LPIPS) loss and the multi-view compensation loss (MVCL) to jointly optimize the rendering results of IDDR-NGP, which could aggregate information from multi-view corrupted images. All of them can be trained in an end-to-end manner to synthesize high-quality 3D scenes. To support the research on distractors removal in implicit 3D representations, we build a new benchmark dataset that consists of both synthetic and real-world distractors. To validate the effectiveness and robustness of IDDR-NGP, we provide a wide range of distractors with corresponding annotated labels added to both realistic and synthetic scenes. Extensive experimental results demonstrate the effectiveness and robustness of IDDR-NGP in removing multiple types of distractors. In addition, our approach achieves results comparable with the existing SOTA desnow methods and is capable of accurately removing both realistic and synthetic distractors.[61] Your One-Stop Solution for AI-Generated Video Detection
Long Ma,Zihao Xue,Yan Wang,Zhiyuan Yan,Jin Xu,Xiaorui Jiang,Haiyang Yu,Yong Liao,Zhen Bi
Main category: cs.CV
TL;DR: 本文提出了AIGVDBench,一个全面且具有代表性的AI生成视频检测基准,涵盖31种前沿生成模型和超过44万段视频,对33种现有检测器进行了1500多次评估,提出8项深入分析并发现4个新结论,推动该领域发展。
Details
Motivation: 现有数据集规模有限、生成模型过时且多样性不足,同时缺乏系统性基准来深入分析检测方法的性能,亟需一个更全面、更具代表性的基准来推动AI生成视频检测的发展。 Method: 构建了包含31个最先进生成模型和超过44万视频的大规模高质量数据集,并设计了一个多维度评估框架,对四类共33种检测器进行超过1500次实验,开展8项深入分析。 Result: 完成了大规模基准测试,识别出4个新发现,揭示了现有检测器在跨模型泛化、对抗鲁棒性和场景适应性等方面的局限性。 Conclusion: AIGVDBench为AI生成视频检测提供了坚实基础,促进未来研究在数据质量、技术覆盖和深度分析方向的发展,项目已开源。 Abstract: Recent advances in generative modeling can create remarkably realistic synthetic videos, making it increasingly difficult for humans to distinguish them from real ones and necessitating reliable detection methods. However, two key limitations hinder the development of this field. \textbf{From the dataset perspective}, existing datasets are often limited in scale and constructed using outdated or narrowly scoped generative models, making it difficult to capture the diversity and rapid evolution of modern generative techniques. Moreover, the dataset construction process frequently prioritizes quantity over quality, neglecting essential aspects such as semantic diversity, scenario coverage, and technological representativeness. \textbf{From the benchmark perspective}, current benchmarks largely remain at the stage of dataset creation, leaving many fundamental issues and in-depth analysis yet to be systematically explored. Addressing this gap, we propose AIGVDBench, a benchmark designed to be comprehensive and representative, covering \textbf{31} state-of-the-art generation models and over \textbf{440,000} videos. By executing more than \textbf{1,500} evaluations on \textbf{33} existing detectors belonging to four distinct categories. This work presents \textbf{8 in-depth analyses} from multiple perspectives and identifies \textbf{4 novel findings} that offer valuable insights for future research. We hope this work provides a solid foundation for advancing the field of AI-generated video detection. Our benchmark is open-sourced at https://github.com/LongMa-2025/AIGVDBench.[62] M3DDM+: An improved video outpainting by a modified masking strategy
Takuya Murakawa,Takumi Fukuzawa,Ning Ding,Toru Tamaki
Main category: cs.CV
TL;DR: M3DDM+ 提出了一种改进的视频外绘框架,通过在训练中使用统一的遮罩方向和宽度,并对预训练模型进行微调,有效解决了 M3DDM 在信息受限场景下的质量退化问题,提升了视觉保真度和时间一致性。
Details
Motivation: M3DDM 在摄像机运动有限或外绘区域较大时出现空间模糊和时间不一致的问题,源于训练与推理之间遮罩策略的不匹配。 Method: 提出 M3DDM+,在训练过程中对所有帧应用统一方向和宽度的遮罩,并在此基础上对预训练的 M3DDM 模型进行微调。 Result: 实验表明,M3DDM+ 显著提高了在信息受限情况下的视觉质量和时间连贯性,同时保持了计算效率。 Conclusion: M3DDM+ 通过消除训练-推理遮罩不一致性,有效增强了视频外绘在挑战性场景下的性能。 Abstract: M3DDM provides a computationally efficient framework for video outpainting via latent diffusion modeling. However, it exhibits significant quality degradation -- manifested as spatial blur and temporal inconsistency -- under challenging scenarios characterized by limited camera motion or large outpainting regions, where inter-frame information is limited. We identify the cause as a training-inference mismatch in the masking strategy: M3DDM's training applies random mask directions and widths across frames, whereas inference requires consistent directional outpainting throughout the video. To address this, we propose M3DDM+, which applies uniform mask direction and width across all frames during training, followed by fine-tuning of the pretrained M3DDM model. Experiments demonstrate that M3DDM+ substantially improves visual fidelity and temporal coherence in information-limited scenarios while maintaining computational efficiency. The code is available at https://github.com/tamaki-lab/M3DDM-Plus.[63] PhysRVG: Physics-Aware Unified Reinforcement Learning for Video Generative Models
Qiyuan Zhang,Biao Gong,Shuai Tan,Zheng Zhang,Yujun Shen,Xing Zhu,Yuyuan Li,Kelu Yao,Chunhua Shen,Changqing Zou
Main category: cs.CV
TL;DR: 本文提出了一种物理感知的强化学习范式,首次在高维空间中直接施加物理碰撞规则,以提升视频生成模型的物理真实性,并提出了MDcycle框架,在保持物理反馈能力的同时实现高效微调。
Details
Motivation: 现有基于Transformer的视频生成模型在像素级去噪过程中忽略了物体刚性等物理原理,导致刚体运动模拟不真实,物理规律仅作为优化条件而非强制约束。 Method: 引入物理感知的强化学习范式,直接在高维空间中执行物理碰撞规则;提出Mimicry-Discovery Cycle(MDcycle)统一框架,支持大规模微调同时保留物理反馈能力。 Result: 构建了新的基准PhysRVGBench,实验表明该方法在定性和定量评估中均显著提升了生成视频的物理合理性与视觉质量。 Conclusion: 该研究弥合了视频生成与经典力学之间的鸿沟,证明了将物理规律作为硬约束而非软条件的有效性,为生成模型中的物理一致性提供了新范式。 Abstract: Physical principles are fundamental to realistic visual simulation, but remain a significant oversight in transformer-based video generation. This gap highlights a critical limitation in rendering rigid body motion, a core tenet of classical mechanics. While computer graphics and physics-based simulators can easily model such collisions using Newton formulas, modern pretrain-finetune paradigms discard the concept of object rigidity during pixel-level global denoising. Even perfectly correct mathematical constraints are treated as suboptimal solutions (i.e., conditions) during model optimization in post-training, fundamentally limiting the physical realism of generated videos. Motivated by these considerations, we introduce, for the first time, a physics-aware reinforcement learning paradigm for video generation models that enforces physical collision rules directly in high-dimensional spaces, ensuring the physics knowledge is strictly applied rather than treated as conditions. Subsequently, we extend this paradigm to a unified framework, termed Mimicry-Discovery Cycle (MDcycle), which allows substantial fine-tuning while fully preserving the model's ability to leverage physics-grounded feedback. To validate our approach, we construct new benchmark PhysRVGBench and perform extensive qualitative and quantitative experiments to thoroughly assess its effectiveness.[64] CoDance: An Unbind-Rebind Paradigm for Robust Multi-Subject Animation
Shuai Tan,Biao Gong,Ke Ma,Yutong Feng,Qiyuan Zhang,Yan Wang,Yujun Shen,Hengshuang Zhao
Main category: cs.CV
TL;DR: 提出CoDance框架,通过解耦-重绑定机制实现任意数量、类型和空间布局的角色图像动画,解决现有方法在多主体动画中的局限性。
Details
Motivation: 现有方法在处理多主体动画时受限于刚性空间绑定和无法准确将动作关联到目标角色,难以应对主体数量、类型和空间错位的多样性。 Method: 提出CoDance,包含Unbind模块(引入姿态偏移编码器和随机扰动以学习位置无关的动作表示)和Rebind模块(利用文本提示和掩码进行语义与空间引导,实现精准动作重绑定)。 Result: 在新构建的CoDanceBench及现有数据集上实验表明,CoDance在多主体动画中达到SOTA性能,具有良好的泛化能力。 Conclusion: CoDance通过解耦-重绑定策略有效解决了多主体角色动画中的空间错位与主体关联问题,支持灵活、鲁棒的多角色动画生成。 Abstract: Character image animation is gaining significant importance across various domains, driven by the demand for robust and flexible multi-subject rendering. While existing methods excel in single-person animation, they struggle to handle arbitrary subject counts, diverse character types, and spatial misalignment between the reference image and the driving poses. We attribute these limitations to an overly rigid spatial binding that forces strict pixel-wise alignment between the pose and reference, and an inability to consistently rebind motion to intended subjects. To address these challenges, we propose CoDance, a novel Unbind-Rebind framework that enables the animation of arbitrary subject counts, types, and spatial configurations conditioned on a single, potentially misaligned pose sequence. Specifically, the Unbind module employs a novel pose shift encoder to break the rigid spatial binding between the pose and the reference by introducing stochastic perturbations to both poses and their latent features, thereby compelling the model to learn a location-agnostic motion representation. To ensure precise control and subject association, we then devise a Rebind module, leveraging semantic guidance from text prompts and spatial guidance from subject masks to direct the learned motion to intended characters. Furthermore, to facilitate comprehensive evaluation, we introduce a new multi-subject CoDanceBench. Extensive experiments on CoDanceBench and existing datasets show that CoDance achieves SOTA performance, exhibiting remarkable generalization across diverse subjects and spatial layouts. The code and weights will be open-sourced.[65] Graph Smoothing for Enhanced Local Geometry Learning in Point Cloud Analysis
Shangbo Yuan,Jie Xu,Ping Hu,Xiaofeng Zhu,Na Zhao
Main category: cs.CV
TL;DR: 提出了一种结合图平滑模块和增强局部几何学习模块的新方法,以优化3D点云分析中的图结构,有效应对边界点稀疏连接和连接区域噪声问题。
Details
Motivation: 现有的基于图的方法在处理3D点云时,由于边界点连接稀疏和连接区域存在噪声连接,导致图结构次优,影响分析效果。 Method: 引入图平滑模块来优化图结构,减少不可靠连接的影响;基于优化后的图结构,利用自适应几何描述符的特征向量和柱面坐标变换提取形状特征和分布特征,增强局部几何学习。 Result: 在真实数据集上的实验表明,该方法在分类、部件分割和语义分割等点云学习任务中均表现出优异性能。 Conclusion: 所提方法通过优化图结构和融合局部几何信息,显著提升了3D点云分析的准确性和鲁棒性。 Abstract: Graph-based methods have proven to be effective in capturing relationships among points for 3D point cloud analysis. However, these methods often suffer from suboptimal graph structures, particularly due to sparse connections at boundary points and noisy connections in junction areas. To address these challenges, we propose a novel method that integrates a graph smoothing module with an enhanced local geometry learning module. Specifically, we identify the limitations of conventional graph structures, particularly in handling boundary points and junction areas. In response, we introduce a graph smoothing module designed to optimize the graph structure and minimize the negative impact of unreliable sparse and noisy connections. Based on the optimized graph structure, we improve the feature extract function with local geometry information. These include shape features derived from adaptive geometric descriptors based on eigenvectors and distribution features obtained through cylindrical coordinate transformation. Experimental results on real-world datasets validate the effectiveness of our method in various point cloud learning tasks, i.e., classification, part segmentation, and semantic segmentation.[66] Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning
Shaofeng Yin,Jiaxin Ge,Zora Zhiruo Wang,Xiuyu Li,Michael J. Black,Trevor Darrell,Angjoo Kanazawa,Haiwen Feng
Main category: cs.CV
TL;DR: 本文提出了VIGA(Vision-as-Inverse-Graphic Agent),通过写-运行-渲染-比较-修正的闭环流程,实现图像到可编辑图形程序的逆向重建,具备任务和模型无关性,并在多个基准上显著提升性能。
Details
Motivation: 现有视觉语言模型(VLMs)缺乏细粒度的空间和物理对齐能力,难以实现一键式图像到图形程序的逆向生成。 Method: 提出VIGA,采用迭代的闭环流程(写-运行-渲染-比较-修正),结合技能库与演进式上下文记忆,支持长视野多模态推理。 Result: 在BlenderGym提升35.32%,SlideBench提升117.17%,BlenderBench提升124.70%。 Conclusion: VIGA实现了无需微调、任务通用的视觉逆向图形生成,推动了vision-as-inverse-graphics的发展。 Abstract: Vision-as-inverse-graphics, the concept of reconstructing an image as an editable graphics program is a long-standing goal of computer vision. Yet even strong VLMs aren't able to achieve this in one-shot as they lack fine-grained spatial and physical grounding capability. Our key insight is that closing this gap requires interleaved multimodal reasoning through iterative execution and verification. Stemming from this, we present VIGA (Vision-as-Inverse-Graphic Agent) that starts from an empty world and reconstructs or edits scenes through a closed-loop write-run-render-compare-revise procedure. To support long-horizon reasoning, VIGA combines (i) a skill library that alternates generator and verifier roles and (ii) an evolving context memory that contains plans, code diffs, and render history. VIGA is task-agnostic as it doesn't require auxiliary modules, covering a wide range of tasks such as 3D reconstruction, multi-step scene editing, 4D physical interaction, and 2D document editing, etc. Empirically, we found VIGA substantially improves one-shot baselines on BlenderGym (35.32%) and SlideBench (117.17%). Moreover, VIGA is also model-agnostic as it doesn't require finetuning, enabling a unified protocol to evaluate heterogeneous foundation VLMs. To better support this protocol, we introduce BlenderBench, a challenging benchmark that stress-tests interleaved multimodal reasoning with graphics engine, where VIGA improves by 124.70%.[67] SoLA-Vision: Fine-grained Layer-wise Linear Softmax Hybrid Attention
Ruibang Li,Guan Luo,Yiwei Zhang,Jin Gao,Bing Li,Weiming Hu
Main category: cs.CV
TL;DR: 提出SoLA-Vision,一种细粒度混合线性与softmax注意力的视觉模型,在保持低计算成本的同时提升准确率。
Details
Motivation: 标准softmax自注意力在高分辨率下计算代价高,线性注意力虽高效但表达能力弱,需平衡二者性能与效率。 Method: 从层堆叠视角分析线性和softmax注意力差异,系统研究层间混合模式,提出层级别混合的SoLA-Vision架构。 Result: SoLA-Vision在ImageNet-1K和密集预测任务上优于纯线性及其他混合模型,仅用少量softmax层即实现更优性能。 Conclusion: 细粒度层级别混合注意力优于固定模块设计,能有效兼顾准确率与计算效率。 Abstract: Standard softmax self-attention excels in vision tasks but incurs quadratic complexity O(N^2), limiting high-resolution deployment. Linear attention reduces the cost to O(N), yet its compressed state representations can impair modeling capacity and accuracy. We present an analytical study that contrasts linear and softmax attention for visual representation learning from a layer-stacking perspective. We further conduct systematic experiments on layer-wise hybridization patterns of linear and softmax attention. Our results show that, compared with rigid intra-block hybrid designs, fine-grained layer-wise hybridization can match or surpass performance while requiring fewer softmax layers. Building on these findings, we propose SoLA-Vision (Softmax-Linear Attention Vision), a flexible layer-wise hybrid attention backbone that enables fine-grained control over how linear and softmax attention are integrated. By strategically inserting a small number of global softmax layers, SoLA-Vision achieves a strong trade-off between accuracy and computational cost. On ImageNet-1K, SoLA-Vision outperforms purely linear and other hybrid attention models. On dense prediction tasks, it consistently surpasses strong baselines by a considerable margin. Code will be released.[68] Democratizing planetary-scale analysis: An ultra-lightweight Earth embedding database for accurate and flexible global land monitoring
Shuang Chen,Jie Wang,Shuai Yuan,Jiayang Li,Yu Xia,Yuanhong Liao,Junbo Wei,Jincheng Yuan,Xiaoqing Xu,Xiaolin Zhu,Peng Zhu,Hongsheng Zhang,Yuyu Zhou,Haohuan Fu,Huabing Huang,Bin Chen,Fan Dai,Peng Gong
Main category: cs.CV
TL;DR: 本文提出了ESD,一个超轻量级的全球地球嵌入数据库,通过将多源遥感数据压缩为信息密集的潜在向量,实现340倍的数据体积缩减,支持在普通工作站上进行长时间序列的全球尺度分析。
Details
Motivation: 由于卫星遥感数据量巨大,计算和存储需求高,限制了全球尺度研究的普及。因此需要一种高效、轻量化的数据表示方法来降低使用门槛。 Method: 利用Landsat和MODIS数据,通过ESDNet架构和有限标量量化(FSQ)技术,将高维观测数据转换为统一潜在空间中的低维、量化潜向量,并以12个时间步长压缩年物候周期。 Result: 实现了约340倍的数据压缩,单年全球陆地数据仅需约2.4TB;重建误差低(MAE: 0.0130),且在土地覆盖分类中准确率达79.74%(优于原始数据的76.92%),具备强大多样本学习能力和时间一致性。 Conclusion: ESD为全球尺度地球观测分析提供了高效、可访问的数据基础,推动地理空间人工智能的发展,并促进行星级研究的普及化。 Abstract: The rapid evolution of satellite-borne Earth Observation (EO) systems has revolutionized terrestrial monitoring, yielding petabyte-scale archives. However, the immense computational and storage requirements for global-scale analysis often preclude widespread use, hindering planetary-scale studies. To address these barriers, we present Embedded Seamless Data (ESD), an ultra-lightweight, 30-m global Earth embedding database spanning the 25-year period from 2000 to 2024. By transforming high-dimensional, multi-sensor observations from the Landsat series (5, 7, 8, and 9) and MODIS Terra into information-dense, quantized latent vectors, ESD distills essential geophysical and semantic features into a unified latent space. Utilizing the ESDNet architecture and Finite Scalar Quantization (FSQ), the dataset achieves a transformative ~340-fold reduction in data volume compared to raw archives. This compression allows the entire global land surface for a single year to be encapsulated within approximately 2.4 TB, enabling decadal-scale global analysis on standard local workstations. Rigorous validation demonstrates high reconstructive fidelity (MAE: 0.0130; RMSE: 0.0179; CC: 0.8543). By condensing the annual phenological cycle into 12 temporal steps, the embeddings provide inherent denoising and a semantically organized space that outperforms raw reflectance in land-cover classification, achieving 79.74% accuracy (vs. 76.92% for raw fusion). With robust few-shot learning capabilities and longitudinal consistency, ESD provides a versatile foundation for democratizing planetary-scale research and advancing next-generation geospatial artificial intelligence.[69] ATATA: One Algorithm to Align Them All
Boyi Pang,Savva Ignatyev,Vladimir Ippolitov,Ramil Khafizov,Yurii Melnik,Oleg Voynov,Maksim Nakhodnov,Aibek Alanov,Xiaopeng Fan,Peter Wonka,Evgeny Burnaev
Main category: cs.CV
TL;DR: 本文提出了一种基于Rectified Flow模型的新型多模态联合推理算法,用于结构对齐样本对的生成,相比现有方法(如SDS)更高效、结构对齐度更高、视觉质量更好。
Details
Motivation: 现有联合生成方法未从结构对齐视角建模;Score Distillation Sampling(SDS)存在耗时、模式坍塌和结果卡通化等问题。 Method: 提出基于结构化潜在空间中段联合传输的多模态Rectified Flow联合推理算法,可构建于任意Rectified Flow模型之上。 Result: 在图像、视频和3D形状生成任务上验证了方法有效性:结构对齐度高、视觉质量好;图像与视频生成达到SOTA;3D生成质量相当但速度提升数个数量级。 Conclusion: 该方法为结构对齐的多模态联合生成提供了高效、高质量的新范式,显著优于现有编辑型与联合推理型方法。 Abstract: We suggest a new multi-modal algorithm for joint inference of paired structurally aligned samples with Rectified Flow models. While some existing methods propose a codependent generation process, they do not view the problem of joint generation from a structural alignment perspective. Recent work uses Score Distillation Sampling to generate aligned 3D models, but SDS is known to be time-consuming, prone to mode collapse, and often provides cartoonish results. By contrast, our suggested approach relies on the joint transport of a segment in the sample space, yielding faster computation at inference time. Our approach can be built on top of an arbitrary Rectified Flow model operating on the structured latent space. We show the applicability of our method to the domains of image, video, and 3D shape generation using state-of-the-art baselines and evaluate it against both editing-based and joint inference-based competing approaches. We demonstrate a high degree of structural alignment for the sample pairs obtained with our method and a high visual quality of the samples. Our method improves the state-of-the-art for image and video generation pipelines. For 3D generation, it is able to show comparable quality while working orders of magnitude faster.[70] Bio-inspired fine-tuning for selective transfer learning in image classification
Ana Davila,Jacinto Colan,Yasuhisa Hasegawa
Main category: cs.CV
TL;DR: BioTune是一种基于进化优化的自适应微调技术,可提升跨域图像分类任务中的迁移学习效果,在多种数据集和网络架构上均优于现有方法。
Details
Motivation: 解决迁移学习中源域与目标域之间差异导致的性能下降问题,提升在标注数据有限情况下的模型适应能力。 Method: 提出BioTune方法,利用进化算法自适应地选择冻结哪些网络层,并为未冻结层优化学习率,实现更高效的微调。 Result: 在九个图像分类数据集(包括医学图像等专业领域)和四种CNN架构上验证了BioTune的有效性,准确率和效率均优于AutoRGN、LoRA等先进方法,并通过消融实验分析了各组件的影响。 Conclusion: BioTune具有强适应性和泛化能力,能有效应对不同数据分布和模型结构的挑战,是一种高效灵活的迁移学习微调框架。 Abstract: Deep learning has significantly advanced image analysis across diverse domains but often depends on large, annotated datasets for success. Transfer learning addresses this challenge by utilizing pre-trained models to tackle new tasks with limited labeled data. However, discrepancies between source and target domains can hinder effective transfer learning. We introduce BioTune, a novel adaptive fine-tuning technique utilizing evolutionary optimization. BioTune enhances transfer learning by optimally choosing which layers to freeze and adjusting learning rates for unfrozen layers. Through extensive evaluation on nine image classification datasets, spanning natural and specialized domains such as medical imaging, BioTune demonstrates superior accuracy and efficiency over state-of-the-art fine-tuning methods, including AutoRGN and LoRA, highlighting its adaptability to various data characteristics and distribution changes. Additionally, BioTune consistently achieves top performance across four different CNN architectures, underscoring its flexibility. Ablation studies provide valuable insights into the impact of BioTune's key components on overall performance. The source code is available at https://github.com/davilac/BioTune.[71] Image-Text Knowledge Modeling for Unsupervised Multi-Scenario Person Re-Identification
Zhiqi Pang,Lingling Zhao,Yang Liu,Chunyu Wang,Gaurav Sharma
Main category: cs.CV
TL;DR: 本文提出了无监督多场景(UMS)行人重识别新任务,并设计了图像-文本知识建模(ITKM)框架,利用视觉语言模型在多个场景中提升ReID性能。
Details
Motivation: 传统行人重识别方法通常局限于单一或特定场景,难以应对跨分辨率、衣着变化等复杂情况,缺乏统一的无监督多场景处理框架。 Method: 提出三阶段ITKM框架:第一阶段通过引入场景嵌入微调图像编码器;第二阶段优化文本嵌入并与伪标签关联,引入多场景分离损失增强文本表示差异性;第三阶段设计簇级和实例级异构匹配模块获取可靠正样本对,并采用动态文本表示更新策略保持图文监督一致性。 Result: 在多个场景下实验表明,ITKM优于现有特定场景方法,且通过融合多场景知识提升了整体性能。 Conclusion: ITKM有效解决了无监督多场景行人重识别问题,验证了利用视觉语言模型进行跨场景知识建模的可行性与优越性。 Abstract: We propose unsupervised multi-scenario (UMS) person re-identification (ReID) as a new task that expands ReID across diverse scenarios (cross-resolution, clothing change, etc.) within a single coherent framework. To tackle UMS-ReID, we introduce image-text knowledge modeling (ITKM) -- a three-stage framework that effectively exploits the representational power of vision-language models. We start with a pre-trained CLIP model with an image encoder and a text encoder. In Stage I, we introduce a scenario embedding in the image encoder and fine-tune the encoder to adaptively leverage knowledge from multiple scenarios. In Stage II, we optimize a set of learned text embeddings to associate with pseudo-labels from Stage I and introduce a multi-scenario separation loss to increase the divergence between inter-scenario text representations. In Stage III, we first introduce cluster-level and instance-level heterogeneous matching modules to obtain reliable heterogeneous positive pairs (e.g., a visible image and an infrared image of the same person) within each scenario. Next, we propose a dynamic text representation update strategy to maintain consistency between text and image supervision signals. Experimental results across multiple scenarios demonstrate the superiority and generalizability of ITKM; it not only outperforms existing scenario-specific methods but also enhances overall performance by integrating knowledge from multiple scenarios.[72] Language-Agnostic Visual Embeddings for Cross-Script Handwriting Retrieval
Fangke Chen,Tianhao Dong,Sirry Chen,Guobin Zhang,Yishu Zhang,Yining Chen
Main category: cs.CV
TL;DR: 提出一种轻量级非对称双编码器框架,用于跨语言手写词检索,通过联合优化实例级对齐和类级语义一致性,学习统一且书写风格不变的视觉嵌入。
Details
Motivation: 现有大规模视觉-语言模型因计算成本高难以在边缘设备部署,且手写变体大和跨语言语义鸿沟使手写词检索困难。 Method: 设计轻量级非对称双编码器框架,联合优化实例级对齐和类级语义一致性,将视觉嵌入锚定到语言无关的语义原型上。 Result: 在单语言检索任务上优于28个基线模型,达到SOTA性能;在显式跨语言检索中验证了所学表示的有效性。 Conclusion: 该框架以极少参数实现了高效准确的跨语言、跨书写风格的手写词检索,适合资源受限环境部署。 Abstract: Handwritten word retrieval is vital for digital archives but remains challenging due to large handwriting variability and cross-lingual semantic gaps. While large vision-language models offer potential solutions, their prohibitive computational costs hinder practical edge deployment. To address this, we propose a lightweight asymmetric dual-encoder framework that learns unified, style-invariant visual embeddings. By jointly optimizing instance-level alignment and class-level semantic consistency, our approach anchors visual embeddings to language-agnostic semantic prototypes, enforcing invariance across scripts and writing styles. Experiments show that our method outperforms 28 baselines and achieves state-of-the-art accuracy on within-language retrieval benchmarks. We further conduct explicit cross-lingual retrieval, where the query language differs from the target language, to validate the effectiveness of the learned cross-lingual representations. Achieving strong performance with only a fraction of the parameters required by existing models, our framework enables accurate and resource-efficient cross-script handwriting retrieval.[73] FTDMamba: Frequency-Assisted Temporal Dilation Mamba for Unmanned Aerial Vehicle Video Anomaly Detection
Cheng-Zhuang Liu,Si-Bao Chen,Qing-Ling Shu,Chris Ding,Jin Tang,Bin Luo
Main category: cs.CV
TL;DR: 本文提出FTDMamba网络,用于解决动态背景下的无人机视频异常检测问题,通过频率解耦和时序扩张Mamba模块建模多尺度时空依赖,并构建了首个大规模动态背景UAV异常检测数据集MUVAD。
Details
Motivation: 现有视频异常检测方法主要针对静态背景的地面或无人机视频,难以应对动态背景下目标运动与无人机自身运动耦合导致的误检与漏检问题,且缺乏对多时间尺度帧间连续性与局部空间相关性的联合建模。 Method: 提出FTDMamba网络,包含两个核心模块:(1) 频率解耦时空相关模块,利用频域分析解耦运动并建模全局时空依赖;(2) 时序扩张Mamba模块,借助Mamba序列建模能力联合学习多感受野下的细粒度时序动态与局部空间结构。同时构建了含222,736帧、240个事件、12类异常的大规模动态背景UAV数据集MUVAD。 Result: FTDMamba在两个公开静态基准和新构建的MUVAD数据集上均达到SOTA性能。 Conclusion: FTDMamba有效解决了动态背景下UAV视频异常检测的关键挑战,验证了频率解耦与时序扩张Mamba建模的有效性,并推动了该领域向更真实复杂场景发展。 Abstract: Recent advances in video anomaly detection (VAD) mainly focus on ground-based surveillance or unmanned aerial vehicle (UAV) videos with static backgrounds, whereas research on UAV videos with dynamic backgrounds remains limited. Unlike static scenarios, dynamically captured UAV videos exhibit multi-source motion coupling, where the motion of objects and UAV-induced global motion are intricately intertwined. Consequently, existing methods may misclassify normal UAV movements as anomalies or fail to capture true anomalies concealed within dynamic backgrounds. Moreover, many approaches do not adequately address the joint modeling of inter-frame continuity and local spatial correlations across diverse temporal scales. To overcome these limitations, we propose the Frequency-Assisted Temporal Dilation Mamba (FTDMamba) network for UAV VAD, including two core components: (1) a Frequency Decoupled Spatiotemporal Correlation Module, which disentangles coupled motion patterns and models global spatiotemporal dependencies through frequency analysis; and (2) a Temporal Dilation Mamba Module, which leverages Mamba's sequence modeling capability to jointly learn fine-grained temporal dynamics and local spatial structures across multiple temporal receptive fields. Additionally, unlike existing UAV VAD datasets which focus on static backgrounds, we construct a large-scale Moving UAV VAD dataset (MUVAD), comprising 222,736 frames with 240 anomaly events across 12 anomaly types. Extensive experiments demonstrate that FTDMamba achieves state-of-the-art (SOTA) performance on two public static benchmarks and the new MUVAD dataset. The code and MUVAD dataset will be available at: https://github.com/uavano/FTDMamba.[74] X-Distill: Cross-Architecture Vision Distillation for Visuomotor Learning
Maanping Shao,Feihong Zhang,Gu Zhang,Baiye Cheng,Zhengrong Xue,Huazhe Xu
Main category: cs.CV
TL;DR: 提出X-Distill方法,通过跨架构知识蒸馏将大型ViT(DINOv2)的视觉表示能力迁移到紧凑ResNet-18中,提升数据稀缺下的机器人操作性能。
Details
Motivation: 在机器人学习中数据稀缺,大模型如ViT泛化强但训练难,小模型如CNN易优化但表达弱,需平衡二者优劣。 Method: 采用离线跨架构知识蒸馏,先在ImageNet上将DINOv2的知识蒸馏到ResNet-18,再将蒸馏后的编码器与扩散策略头联合微调于具体操作任务。 Result: 在34个仿真基准和5个真实世界任务上优于从头训练的ResNet、微调的DINOv2,并超越使用点云或更大VLM的3D编码器。 Conclusion: 简单的跨架构蒸馏策略可有效结合大模型表征能力与小模型数据效率,实现当前最优的机器人操作性能。 Abstract: Visuomotor policies often leverage large pre-trained Vision Transformers (ViTs) for their powerful generalization capabilities. However, their significant data requirements present a major challenge in the data-scarce context of most robotic learning settings, where compact CNNs with strong inductive biases can be more easily optimized. To address this trade-off, we introduce X-Distill, a simple yet highly effective method that synergizes the strengths of both architectures. Our approach involves an offline, cross-architecture knowledge distillation, transferring the rich visual representations of a large, frozen DINOv2 teacher to a compact ResNet-18 student on the general-purpose ImageNet dataset. This distilled encoder, now endowed with powerful visual priors, is then jointly fine-tuned with a diffusion policy head on the target manipulation tasks. Extensive experiments on $34$ simulated benchmarks and $5$ challenging real-world tasks demonstrate that our method consistently outperforms policies equipped with from-scratch ResNet or fine-tuned DINOv2 encoders. Notably, X-Distill also surpasses 3D encoders that utilize privileged point cloud observations or much larger Vision-Language Models. Our work highlights the efficacy of a simple, well-founded distillation strategy for achieving state-of-the-art performance in data-efficient robotic manipulation.[75] Efficient On-Board Processing of Oblique UAV Video for Rapid Flood Extent Mapping
Vishisht Sharma,Sam Leroux,Lisa Landuyt,Nick Witvrouwen,Pieter Simoens
Main category: cs.CV
TL;DR: 本文提出了一种名为Temporal Token Reuse (TTR) 的自适应推理框架,通过利用倾斜航拍视频中的时空冗余,在嵌入式设备上加速视频分割,显著降低延迟并保持高精度。
Details
Motivation: 由于无人机严格的尺寸、重量和功耗(SWaP)限制,处理高分辨率倾斜航拍视频在边缘设备上面临计算密度高、难以实现实时推断的挑战。 Method: TTR将图像块表示为token,使用轻量级相似性度量动态识别静态区域,并传播其预先计算的深度特征,从而跳过冗余的骨干网络计算,实现高效推理。 Result: 在标准基准和新构建的倾斜洪水数据集上验证表明,TTR在边缘硬件上实现了30%的推理延迟降低,且分割精度损失极小(mIoU下降<0.5%)。 Conclusion: TTR有效推动了运行效率的帕累托前沿,使在时间敏感的遥感任务中实现高质量、实时的倾斜视频理解成为可能。 Abstract: Effective disaster response relies on rapid disaster response, where oblique aerial video is the primary modality for initial scouting due to its ability to maximize spatial coverage and situational awareness in limited flight time. However, the on-board processing of high-resolution oblique streams is severely bottlenecked by the strict Size, Weight, and Power (SWaP) constraints of Unmanned Aerial Vehicles (UAVs). The computational density required to process these wide-field-of-view streams precludes low-latency inference on standard edge hardware. To address this, we propose Temporal Token Reuse (TTR), an adaptive inference framework capable of accelerating video segmentation on embedded devices. TTR exploits the intrinsic spatiotemporal redundancy of aerial video by formulating image patches as tokens; it utilizes a lightweight similarity metric to dynamically identify static regions and propagate their precomputed deep features, thereby bypassing redundant backbone computations. We validate the framework on standard benchmarks and a newly curated Oblique Floodwater Dataset designed for hydrological monitoring. Experimental results on edge-grade hardware demonstrate that TTR achieves a 30% reduction in inference latency with negligible degradation in segmentation accuracy (< 0.5% mIoU). These findings confirm that TTR effectively shifts the operational Pareto frontier, enabling high-fidelity, real-time oblique video understanding for time-critical remote sensing missions[76] SAMannot: A Memory-Efficient, Local, Open-source Framework for Interactive Video Instance Segmentation based on SAM2
Gergely Dinya,András Gelencsér,Krisztina Kupán,Clemens Küpper,Kristóf Karacs,Anna Gelencsér-Horváth
Main category: cs.CV
TL;DR: SAMannot是一个开源、本地化的视频实例分割框架,结合Segment Anything Model 2(SAM2)与人工协同流程,支持高效、隐私安全的高精度视频标注。
Details
Motivation: 现有视频标注方法受限于手动标注耗时、商业平台成本高或云服务隐私泄露问题,研究中缺乏兼顾效率、成本与隐私的解决方案。 Method: 开发了SAMannot框架,改进SAM2依赖并引入处理层以降低计算开销;设计了带屏障帧的“锁定-细化”自动化流程、基于掩码骨架的自动提示机制,并实现持久化的实例身份管理。 Result: 框架在动物行为追踪案例及LVOS、DAVIS数据集子集上验证有效,支持生成YOLO和PNG格式的数据集及结构化交互日志,具备高响应性与可扩展性。 Conclusion: SAMannot提供了一种可扩展、私密且经济高效的视频标注替代方案,适用于复杂科研场景下的高保真视频实例分割任务。 Abstract: Current research workflows for precise video segmentation are often forced into a compromise between labor-intensive manual curation, costly commercial platforms, and/or privacy-compromising cloud-based services. The demand for high-fidelity video instance segmentation in research is often hindered by the bottleneck of manual annotation and the privacy concerns of cloud-based tools. We present SAMannot, an open-source, local framework that integrates the Segment Anything Model 2 (SAM2) into a human-in-the-loop workflow. To address the high resource requirements of foundation models, we modified the SAM2 dependency and implemented a processing layer that minimizes computational overhead and maximizes throughput, ensuring a highly responsive user interface. Key features include persistent instance identity management, an automated ``lock-and-refine'' workflow with barrier frames, and a mask-skeletonization-based auto-prompting mechanism. SAMannot facilitates the generation of research-ready datasets in YOLO and PNG formats alongside structured interaction logs. Verified through animal behavior tracking use-cases and subsets of the LVOS and DAVIS benchmark datasets, the tool provides a scalable, private, and cost-effective alternative to commercial platforms for complex video annotation tasks.[77] Context-Aware Semantic Segmentation via Stage-Wise Attention
Antoine Carreaud,Elias Naha,Arthur Chansel,Nina Lahellec,Jan Skaloud,Adrien Gressin
Main category: cs.CV
TL;DR: 本文提出了一种名为CASWiT的双分支Swin Transformer架构,用于语义超高清图像分割,通过跨尺度融合模块和SimMIM风格的预训练策略,在遥感数据集上实现了领先性能。
Details
Motivation: Transformer模型在处理超高分辨率图像时面临内存消耗随token数量平方增长的问题,限制了上下文范围或空间分辨率,因此需要一种能有效结合全局上下文与细粒度特征的方法。 Method: 提出CASWiT,包含一个处理下采样邻域的上下文编码器和一个处理高分辨率图像块的高分辨率编码器;采用跨注意力和门控特征注入的跨尺度融合模块;并设计了一种SimMIM风格的预训练方法,掩码75%的高分辨率token及对应低分辨率中心区域,重建原始UHR图像。 Result: 在IGN FLAIR-HUB数据集上达到65.83% mIoU,超过RGB基线1.78点;在URUR数据集上达到49.1% mIoU,比当前SoTA提升0.9%。 Conclusion: CASWiT通过有效的双分支结构和跨尺度特征融合,显著提升了超高分辨率图像语义分割的性能,并验证了其在大型遥感数据集上的有效性与先进性。 Abstract: Semantic ultra high resolution image (UHR) segmentation is essential in remote sensing applications such as aerial mapping and environmental monitoring. Transformer-based models struggle in this setting because memory grows quadratically with token count, constraining either the contextual scope or the spatial resolution. We introduce CASWiT (Context-Aware Stage-Wise Transformer), a dual-branch, Swin-based architecture that injects global cues into fine-grained UHR features. A context encoder processes a downsampled neighborhood to capture long-range dependencies, while a high resolution encoder extracts detailed features from UHR patches. A cross-scale fusion module, combining cross-attention and gated feature injection, enriches high-resolution tokens with context. Beyond architecture, we propose a SimMIM-style pretraining. We mask 75% of the high-resolution image tokens and the low-resolution center region that spatially corresponds to the UHR patch, then train the shared dual-encoder with small decoder to reconstruct the UHR initial image. Extensive experiments on the large-scale IGN FLAIR-HUB aerial dataset demonstrate the effectiveness of CASWiT. Our method achieves 65.83% mIoU, outperforming RGB baselines by 1.78 points. On URUR, CASWiT achieves 49.1% mIoU, surpassing the current SoTA by +0.9% under the official evaluation protocol. All codes are provided on: https://huggingface.co/collections/heig-vd-geo/caswit.[78] Enhancing Vision Language Models with Logic Reasoning for Situational Awareness
Pavana Pradeep,Krishna Kant,Suya Yu
Main category: cs.CV
TL;DR: 本文提出了一种将视觉-语言模型(VLM)与传统计算机视觉方法结合的方法,通过逻辑推理增强态势感知能力,能够在提取细粒度事件细节、智能微调提升准确性以及生成VLM输出的推理依据方面取得更好效果。
Details
Motivation: 在态势感知应用中,需要高可靠性和准确率地识别罕见但重要的事件,并获取细粒度信息和评估识别质量,而现有VLM在这些方面存在不足。 Method: 将VLM与传统计算机视觉方法结合,引入显式逻辑推理机制,并采用一种智能微调策略来提高准确性,同时在推理过程中生成对VLM输出结果的解释。 Result: 所提出的智能微调机制显著提升了模型准确性,优于非智能选择方法,并能在推理时提供输出有效性的验证或质疑依据。 Conclusion: 该方法通过融合VLM与传统方法并引入逻辑推理和智能微调,有效增强了态势感知中的准确性、可解释性和可靠性。 Abstract: Vision-Language Models (VLMs) offer the ability to generate high-level, interpretable descriptions of complex activities from images and videos, making them valuable for situational awareness (SA) applications. In such settings, the focus is on identifying infrequent but significant events with high reliability and accuracy, while also extracting fine-grained details and assessing recognition quality. In this paper, we propose an approach that integrates VLMs with traditional computer vision methods through explicit logic reasoning to enhance SA in three key ways: (a) extracting fine-grained event details, (b) employing an intelligent fine-tuning (FT) strategy that achieves substantially higher accuracy than uninformed selection, and (c) generating justifications for VLM outputs during inference. We demonstrate that our intelligent FT mechanism improves the accuracy and provides a valuable means, during inferencing, to either confirm the validity of the VLM output or indicate why it may be questionable.[79] Beer-Lambert Autoencoder for Unsupervised Stain Representation Learning and Deconvolution in Multi-immunohistochemical Brightfield Histology Images
Mark Eastwood,Thomas McKee,Zedong Hu,Sabine Tejpar,Fayyaz Minhas
Main category: cs.CV
TL;DR: 提出了一种基于数据驱动的编码器-解码器架构,用于多通道免疫组化(mIHC)图像中染色分离,相比传统方法能更清晰地分离染色并减少串扰。
Details
Motivation: 在多通道免疫组化(mIHC)中,当染色种类超过3种时,传统的Beer-Lambert方法变得欠定且不稳定,难以有效分离各染色贡献。 Method: 设计了一个紧凑的U-Net编码器预测K个非负浓度通道,结合可微分的Beer-Lambert前向模型作为解码器,并引入可学习的染色矩阵;训练采用无监督方式,使用感知重建损失并加入抑制染色混叠的正则项。 Result: 在包含5种染色的结直肠mIHC数据上实现了优秀的RGB图像重建效果,并显著减少了通道间的串色现象,优于基于矩阵的传统去卷积方法。 Conclusion: 该方法能有效实现mIHC图像中多种染色的分离,适用于实际病理图像分析,具有良好的应用前景。 Abstract: Separating the contributions of individual chromogenic stains in RGB histology whole slide images (WSIs) is essential for stain normalization, quantitative assessment of marker expression, and cell-level readouts in immunohistochemistry (IHC). Classical Beer-Lambert (BL) color deconvolution is well-established for two- or three-stain settings, but becomes under-determined and unstable for multiplex IHC (mIHC) with K>3 chromogens. We present a simple, data-driven encoder-decoder architecture that learns cohort-specific stain characteristics for mIHC RGB WSIs and yields crisp, well-separated per-stain concentration maps. The encoder is a compact U-Net that predicts K nonnegative concentration channels; the decoder is a differentiable BL forward model with a learnable stain matrix initialized from typical chromogen hues. Training is unsupervised with a perceptual reconstruction objective augmented by loss terms that discourage unnecessary stain mixing. On a colorectal mIHC panel comprising 5 stains (H, CDX2, MUC2, MUC5, CD8) we show excellent RGB reconstruction, and significantly reduced inter-channel bleed-through compared with matrix-based deconvolution. Code and model are available at https://github.com/measty/StainQuant.git.[80] Assessing Building Heat Resilience Using UAV and Street-View Imagery with Coupled Global Context Vision Transformer
Steffen Knoblauch,Ram Kumar Muthusamy,Hao Li,Iddy Chazua,Benedcto Adamu,Innocent Maholi,Alexander Zipf
Main category: cs.CV
TL;DR: 提出一种融合无人机和街景影像的机器学习框架(CGCViT),用于评估城市建筑的热相关属性,结合HotSat-1热红外数据揭示材料与热暴露的关系,发现植被、浅色及特定屋顶材料可降低热风险,并在坦桑尼亚达累斯萨拉姆市识别出与社会经济劣势相关的家庭级热暴露不平等。
Details
Motivation: 城市化与气候变化加剧了全球南方城市中心的人类热暴露风险,而低成本建筑材料和高热质量地表进一步加重该问题,但目前缺乏可扩展的建筑热属性评估方法。 Method: 提出一种耦合全局上下文视觉变换器(CGCViT)的双模态交叉学习框架,融合公开的无人机(UAV)和街景(SV)图像,并利用HotSat-1的热红外(TIR)数据建立建筑特征与热健康风险之间的关联。 Result: 双模态模型性能优于最佳单模态模型达9.3%;发现建筑周围有植被、浅色屋顶以及混凝土、黏土或木质屋顶(相较金属或油布)显著关联于更低的TIR值;在达累斯萨拉姆的应用显示该框架可识别与社会经济劣势相关的建筑级热暴露差异。 Conclusion: 本地化、数据驱动的风险评估对制定公平气候适应策略至关重要,该框架为城市热风险治理提供了可扩展且具社会公平性的技术路径。 Abstract: Climate change is intensifying human heat exposure, particularly in densely built urban centers of the Global South. Low-cost construction materials and high thermal-mass surfaces further exacerbate this risk. Yet scalable methods for assessing such heat-relevant building attributes remain scarce. We propose a machine learning framework that fuses openly available unmanned aerial vehicle (UAV) and street-view (SV) imagery via a coupled global context vision transformer (CGCViT) to learn heat-relevant representations of urban structures. Thermal infrared (TIR) measurements from HotSat-1 are used to quantify the relationship between building attributes and heat-associated health risks. Our dual-modality cross-view learning approach outperforms the best single-modality models by up to $9.3\%$, demonstrating that UAV and SV imagery provide valuable complementary perspectives on urban structures. The presence of vegetation surrounding buildings (versus no vegetation), brighter roofing (versus darker roofing), and roofing made of concrete, clay, or wood (versus metal or tarpaulin) are all significantly associated with lower HotSat-1 TIR values. Deployed across the city of Dar es Salaam, Tanzania, the proposed framework illustrates how household-level inequalities in heat exposure - often linked to socio-economic disadvantage and reflected in building materials - can be identified and addressed using machine learning. Our results point to the critical role of localized, data-driven risk assessment in shaping climate adaptation strategies that deliver equitable outcomes.[81] Think-Clip-Sample: Slow-Fast Frame Selection for Video Understanding
Wenhui Tan,Ruihua Song,Jiaze Li,Jianzhong Ju,Zhenbo Luo
Main category: cs.CV
TL;DR: 提出了一种无需训练的框架TCS,通过多查询推理和片段级慢-快采样提升多模态大模型对长视频的理解能力,在多个基准上显著提升性能并减少50%推理时间。
Details
Motivation: 现有MLLM在长视频理解中受限于计算开销和帧选择不佳,难以兼顾局部细节与全局上下文。 Method: 提出Think-Clip-Sample(TCS)框架,包含多查询推理生成多个互补查询,以及片段级慢-快采样策略自适应平衡局部密集细节与全局稀疏上下文。 Result: 在MLVU、LongVideoBench和VideoMME上实验表明,TCS可提升最多6.9%准确率,并以50%更少推理时间达到相当性能。 Conclusion: TCS有效提升了多模态大模型在长视频理解任务中的效率与效果,具有良好的通用性和实用性。 Abstract: Recent progress in multi-modal large language models (MLLMs) has significantly advanced video understanding. However, their performance on long-form videos remains limited by computational constraints and suboptimal frame selection. We present Think-Clip-Sample (TCS), a training-free framework that enhances long video understanding through two key components: (i) Multi-Query Reasoning, which generates multiple queries to capture complementary aspects of the question and video; and (ii) Clip-level Slow-Fast Sampling, which adaptively balances dense local details and sparse global context. Extensive experiments on MLVU, LongVideoBench, and VideoMME demonstrate that TCS consistently improves performance across different MLLMs, boosting up to 6.9% accuracy, and is capable of achieving comparable accuracy with 50% fewer inference time cost, highlighting both efficiency and efficacy of TCS on long video understanding.[82] Heterogeneous Uncertainty-Guided Composed Image Retrieval with Fine-Grained Probabilistic Learning
Haomiao Tang,Jinpeng Wang,Minyi Zhao,Guanghao Meng,Ruisheng Luo,Long Chen,Shu-Tao Xia
Main category: cs.CV
TL;DR: 本文提出了一种异构不确定性引导(HUG)范式,用于解决组合图像检索(CIR)中的内在噪声和不确定性问题,通过细粒度的概率学习框架提升模型鲁棒性和检索性能。
Details
Motivation: CIR三元组中的内在噪声引发不确定性,威胁模型鲁棒性;现有概率学习方法因实例级整体建模和对查询与目标的同质化处理而在CIR中表现不足。 Method: 提出HUG范式,采用细粒度概率学习框架,将查询和目标表示为高斯嵌入以捕捉概念和不确定性;定制化异构不确定性估计,分别处理多模态查询和单模态目标;设计动态加权机制融合多种不确定性,并引入不确定性引导的目标函数与细粒度对比学习策略。 Result: 在多个基准上的实验表明,HUG显著优于当前最先进的基线方法,且分析验证了其技术贡献的有效性。 Conclusion: HUG通过异构不确定性建模和细粒度对比学习,有效提升了CIR任务的鲁棒性和准确性,为应对现实场景中的噪声与不确定性提供了新思路。 Abstract: Composed Image Retrieval (CIR) enables image search by combining a reference image with modification text. Intrinsic noise in CIR triplets incurs intrinsic uncertainty and threatens the model's robustness. Probabilistic learning approaches have shown promise in addressing such issues; however, they fall short for CIR due to their instance-level holistic modeling and homogeneous treatment of queries and targets. This paper introduces a Heterogeneous Uncertainty-Guided (HUG) paradigm to overcome these limitations. HUG utilizes a fine-grained probabilistic learning framework, where queries and targets are represented by Gaussian embeddings that capture detailed concepts and uncertainties. We customize heterogeneous uncertainty estimations for multi-modal queries and uni-modal targets. Given a query, we capture uncertainties not only regarding uni-modal content quality but also multi-modal coordination, followed by a provable dynamic weighting mechanism to derive comprehensive query uncertainty. We further design uncertainty-guided objectives, including query-target holistic contrast and fine-grained contrasts with comprehensive negative sampling strategies, which effectively enhance discriminative learning. Experiments on benchmarks demonstrate HUG's effectiveness beyond state-of-the-art baselines, with faithful analysis justifying the technical contributions.[83] SUG-Occ: An Explicit Semantics and Uncertainty Guided Sparse Learning Framework for Real-Time 3D Occupancy Prediction
Hanlin Wu,Pengfei Lin,Ehsan Javanmardi,Nanren Bao,Bo Qian,Hao Si,Manabu Tsukada
Main category: cs.CV
TL;DR: 本文提出了SUG-Occ,一种语义与不确定性引导的稀疏学习框架,用于高效3D语义占据预测,兼顾精度与计算效率。
Details
Motivation: 3D语义占据预测虽能提供细粒度场景理解,但计算和内存开销大,难以实现实时部署,需利用3D场景稀疏性来降低冗余计算。 Method: 提出SUG-Occ框架:1)结合语义与不确定性先验抑制空闲空间投影,并采用无符号距离编码增强几何一致性;2)设计级联稀疏补全模块,通过超交叉稀疏卷积与生成式上采样实现粗到精推理;3)采用基于OCR的轻量掩码解码器,通过查询-上下文交互优化体素级预测。 Result: 在SemanticKITTI数据集上实验表明,该方法比基线提升7.34%准确率,效率提高57.8%。 Conclusion: SUG-Occ通过显式利用语义与不确定性引导的稀疏学习,在保持高精度的同时显著提升效率,推动了3D占据预测在实际场景中的应用。 Abstract: As autonomous driving moves toward full scene understanding, 3D semantic occupancy prediction has emerged as a crucial perception task, offering voxel-level semantics beyond traditional detection and segmentation paradigms. However, such a refined representation for scene understanding incurs prohibitive computation and memory overhead, posing a major barrier to practical real-time deployment. To address this, we propose SUG-Occ, an explicit Semantics and Uncertainty Guided Sparse Learning Enabled 3D Occupancy Prediction Framework, which exploits the inherent sparsity of 3D scenes to reduce redundant computation while maintaining geometric and semantic completeness. Specifically, we first utilize semantic and uncertainty priors to suppress projections from free space during view transformation while employing an explicit unsigned distance encoding to enhance geometric consistency, producing a structurally consistent sparse 3D representation. Secondly, we design an cascade sparse completion module via hyper cross sparse convolution and generative upsampling to enable efficiently coarse-to-fine reasoning. Finally, we devise an object contextual representation (OCR) based mask decoder that aggregates global semantic context from sparse features and refines voxel-wise predictions via lightweight query-context interactions, avoiding expensive attention operations over volumetric features. Extensive experiments on SemanticKITTI benchmark demonstrate that the proposed approach outperforms the baselines, achieving a 7.34/% improvement in accuracy and a 57.8\% gain in efficiency.[84] Wetland mapping from sparse annotations with satellite image time series and temporal-aware segment anything model
Shuai Yuan,Tianwu Lin,Shuang Chen,Yu Xia,Peng Qin,Xiangyu Liu,Xiaoqing Xu,Nan Xu,Hongsheng Zhang,Jie Wang,Peng Gong
Main category: cs.CV
TL;DR: 本文提出WetSAM,一种基于SAM框架并结合卫星时间序列的湿地制图方法,在稀疏点监督下实现高精度、结构一致的分割,显著优于现有方法。
Details
Motivation: 现有深度学习模型在稀疏标签下表现差,单时相影像难以应对湿地季节性和年际动态变化,且SAM等基础模型无法建模时间信息,导致分割结果碎片化。 Method: 提出WetSAM,采用双分支架构:时间提示分支通过分层适配器和动态时间聚合增强SAM以解耦物候变异;空间分支利用时间约束的区域增长策略生成可靠伪标签,并通过双向一致性正则化联合优化两个分支。 Result: 在八个全球区域(每个约5000 km²)的实验表明,WetSAM平均F1-score达85.58%,显著优于现有方法,能以最少标注成本实现准确且结构一致的湿地分割。 Conclusion: WetSAM在稀疏监督下展现出强大的泛化能力和时间建模优势,具有用于大规模、低成本、高分辨率湿地监测的潜力。 Abstract: Accurate wetland mapping is essential for ecosystem monitoring, yet dense pixel-level annotation is prohibitively expensive and practical applications usually rely on sparse point labels, under which existing deep learning models perform poorly, while strong seasonal and inter-annual wetland dynamics further render single-date imagery inadequate and lead to significant mapping errors; although foundation models such as SAM show promising generalization from point prompts, they are inherently designed for static images and fail to model temporal information, resulting in fragmented masks in heterogeneous wetlands. To overcome these limitations, we propose WetSAM, a SAM-based framework that integrates satellite image time series for wetland mapping from sparse point supervision through a dual-branch design, where a temporally prompted branch extends SAM with hierarchical adapters and dynamic temporal aggregation to disentangle wetland characteristics from phenological variability, and a spatial branch employs a temporally constrained region-growing strategy to generate reliable dense pseudo-labels, while a bidirectional consistency regularization jointly optimizes both branches. Extensive experiments across eight global regions of approximately 5,000 km2 each demonstrate that WetSAM substantially outperforms state-of-the-art methods, achieving an average F1-score of 85.58%, and delivering accurate and structurally consistent wetland segmentation with minimal labeling effort, highlighting its strong generalization capability and potential for scalable, low-cost, high-resolution wetland mapping.[85] SME-YOLO: A Real-Time Detector for Tiny Defect Detection on PCB Surfaces
Meng Han
Main category: cs.CV
TL;DR: 本文提出了一种基于YOLOv11n的新型小目标多尺度增强检测框架SME-YOLO,用于提升PCB表面微小缺陷的检测精度。
Details
Motivation: PCB缺陷通常尺寸小、纹理相似且分布不均,传统方法在高精度检测上面临挑战,尤其是对微小缺陷的定位和识别能力不足。 Method: 引入三方面改进:1)采用归一化Wasserstein距离损失(NWDLoss)降低IoU对微小目标位置偏差的敏感性;2)用高效上采样卷积块(EUCB)替代原有上采样模块,通过多尺度卷积更好恢复空间分辨率并保留边缘纹理细节;3)提出多尺度聚焦注意力(MSFA)模块,自适应增强关键尺度区间的感知能力,实现局部细粒度特征与全局上下文信息的有效融合。 Result: 在PKU-PCB数据集上的实验表明,相比基线YOLOv11n,SME-YOLO将mAP提高了2.2%,Precision提高了4%,达到当前最优性能。 Conclusion: SME-YOLO通过损失函数优化、上采样结构改进和注意力机制设计,显著提升了PCB微小缺陷的检测精度,具有良好的应用前景。 Abstract: Surface defects on Printed Circuit Boards (PCBs) directly compromise product reliability and safety. However, achieving high-precision detection is challenging because PCB defects are typically characterized by tiny sizes, high texture similarity, and uneven scale distributions. To address these challenges, this paper proposes a novel framework based on YOLOv11n, named SME-YOLO (Small-target Multi-scale Enhanced YOLO). First, we employ the Normalized Wasserstein Distance Loss (NWDLoss). This metric effectively mitigates the sensitivity of Intersection over Union (IoU) to positional deviations in tiny objects. Second, the original upsampling module is replaced by the Efficient Upsampling Convolution Block (EUCB). By utilizing multi-scale convolutions, the EUCB gradually recovers spatial resolution and enhances the preservation of edge and texture details for tiny defects. Finally, this paper proposes the Multi-Scale Focused Attention (MSFA) module. Tailored to the specific spatial distribution of PCB defects, this module adaptively strengthens perception within key scale intervals, achieving efficient fusion of local fine-grained features and global context information. Experimental results on the PKU-PCB dataset demonstrate that SME-YOLO achieves state-of-the-art performance. Specifically, compared to the baseline YOLOv11n, SME-YOLO improves mAP by 2.2% and Precision by 4%, validating the effectiveness of the proposed method.[86] Topology-Guaranteed Image Segmentation: Enforcing Connectivity, Genus, and Width Constraints
Wenxiao Li,Xue-Cheng Tai,Jun Liu
Main category: cs.CV
TL;DR: 提出一种结合宽度信息的新型拓扑描述框架,通过改进持续同调并融入PDE平滑思想,用于变分图像分割和神经网络设计,有效保持结构连通性、属数及厚度等几何特征。
Details
Motivation: 传统拓扑先验缺乏对结构宽度(如厚度、长度)的刻画能力,限制了其在图像分割中对实际结构的准确建模,尤其难以同时保持拓扑与几何属性。 Method: 提出融合宽度信息的拓扑框架,利用PDE平滑修正上水平集的局部极值,增强持续同调以捕获宽度特征,并将其嵌入变分分割模型和神经网络中,通过拓扑能量约束实现联合优化。 Result: 实验表明该方法能有效保持分割结果的连通性和属数等拓扑不变量,同时准确保留线条的厚度与长度等宽度特性。 Conclusion: 所提方法成功将宽度信息融入拓扑结构描述,在图像分割中实现了拓扑保真与关键几何属性的一致性,提升了分割的结构性准确性。 Abstract: Existing research highlights the crucial role of topological priors in image segmentation, particularly in preserving essential structures such as connectivity and genus. Accurately capturing these topological features often requires incorporating width-related information, including the thickness and length inherent to the image structures. However, traditional mathematical definitions of topological structures lack this dimensional width information, limiting methods like persistent homology from fully addressing practical segmentation needs. To overcome this limitation, we propose a novel mathematical framework that explicitly integrates width information into the characterization of topological structures. This method leverages persistent homology, complemented by smoothing concepts from partial differential equations (PDEs), to modify local extrema of upper-level sets. This approach enables the resulting topological structures to inherently capture width properties. We incorporate this enhanced topological description into variational image segmentation models. Using some proper loss functions, we are also able to design neural networks that can segment images with the required topological and width properties. Through variational constraints on the relevant topological energies, our approach successfully preserves essential topological invariants such as connectivity and genus counts, simultaneously ensuring that segmented structures retain critical width attributes, including line thickness and length. Numerical experiments demonstrate the effectiveness of our method, showcasing its capability to maintain topological fidelity while explicitly embedding width characteristics into segmented image structures.[87] PubMed-OCR: PMC Open Access OCR Annotations
Hunter Heidenreich,Yosheb Getachew,Olivia Dinica,Ben Elliott
Main category: cs.CV
TL;DR: PubMed-OCR是一个基于PubMed Central开放获取PDF的OCR中心语料库,包含209.5万篇文章的页面图像及其文字、行和段落级的边界框标注,支持布局感知建模和坐标定位问答等任务。
Details
Motivation: 为了支持科学文献的自动化处理与分析,构建一个大规模、结构化且带有OCR标注的语料库,以促进布局感知模型和OCR相关系统的研发。 Method: 利用Google Cloud Vision对PubMed Central开放获取PDF中的每页图像进行OCR标注,并采用紧凑的JSON格式存储词、行和段落级别的边界框信息。 Result: 构建了一个包含209.5万篇文章、约150万页、约13亿词的大型语料库,分析了期刊覆盖范围和布局特征,并指出了对单一OCR引擎和启发式行重建的依赖等局限性。 Conclusion: PubMed-OCR为布局感知建模、基于坐标的问答及OCR流水线评估提供了重要资源,数据与schema已公开,鼓励后续研究扩展使用。 Abstract: PubMed-OCR is an OCR-centric corpus of scientific articles derived from PubMed Central Open Access PDFs. Each page image is annotated with Google Cloud Vision and released in a compact JSON schema with word-, line-, and paragraph-level bounding boxes. The corpus spans 209.5K articles (1.5M pages; ~1.3B words) and supports layout-aware modeling, coordinate-grounded QA, and evaluation of OCR-dependent pipelines. We analyze corpus characteristics (e.g., journal coverage and detected layout features) and discuss limitations, including reliance on a single OCR engine and heuristic line reconstruction. We release the data and schema to facilitate downstream research and invite extensions.[88] Map2Thought: Explicit 3D Spatial Reasoning via Metric Cognitive Maps
Xiangjun Gao,Zhensong Zhang,Dave Zhenyu Chen,Songcen Xu,Long Quan,Eduardo Pérez-Pellitero,Youngkyoon Jang
Main category: cs.CV
TL;DR: 提出Map2Thought框架,通过Metric-CogMap和Cog-CoT实现3D视觉语言模型的显式、可解释空间推理,在减少监督下仍优于现有方法。
Details
Motivation: 为提升3D视觉语言模型的空间推理能力与可解释性,解决现有方法在几何理解与推理透明度上的不足。 Method: 构建Metric-CogMap统一离散关系与连续几何表示,并设计Cog-CoT基于该表示进行向量运算、边界框距离计算和遮挡感知的外观顺序推理。 Result: 在VSI-Bench上,仅用一半监督即达到59.9%准确率,接近全数据基线(60.9%);在10%、25%、50%子集上分别超越SOTA 5.3%、4.8%、4.0%。 Conclusion: Map2Thought实现了可解释的3D空间推理,在低监督下仍表现优异,验证了结构化表征与显式推理在3D VLM中的有效性。 Abstract: We propose Map2Thought, a framework that enables explicit and interpretable spatial reasoning for 3D VLMs. The framework is grounded in two key components: Metric Cognitive Map (Metric-CogMap) and Cognitive Chain-of-Thought (Cog-CoT). Metric-CogMap provides a unified spatial representation by integrating a discrete grid for relational reasoning with a continuous, metric-scale representation for precise geometric understanding. Building upon the Metric-CogMap, Cog-CoT performs explicit geometric reasoning through deterministic operations, including vector operations, bounding-box distances, and occlusion-aware appearance order cues, producing interpretable inference traces grounded in 3D structure. Experimental results show that Map2Thought enables explainable 3D understanding, achieving 59.9% accuracy using only half the supervision, closely matching the 60.9% baseline trained with the full dataset. It consistently outperforms state-of-the-art methods by 5.3%, 4.8%, and 4.0% under 10%, 25%, and 50% training subsets, respectively, on the VSI-Bench.[89] PRISM-CAFO: Prior-conditioned Remote-sensing Infrastructure Segmentation and Mapping for CAFOs
Oishee Bintey Hoque,Nibir Chandra Mandal,Kyle Luong,Amanda Wilson,Samarth Swarup,Madhav Marathe,Abhijin Adiga
Main category: cs.CV
TL;DR: 提出了一种基于基础设施的可解释性管道,用于从航拍和卫星图像中识别和表征集中动物饲养场(CAFO),在多个指标上达到先进性能。
Details
Motivation: 大规模畜牧业对人类健康和环境构成风险,并易受疾病和极端天气影响,准确且可扩展的地图绘制变得愈发重要。 Method: 使用领域调优的YOLOv8检测候选设施(如畜棚、饲料场等),通过SAM2生成掩码并过滤;提取结构化描述符并与深度视觉特征融合,利用轻量级空间交叉注意力分类器进行分类,并输出类型预测与归因掩码。 Result: 所提方法在多种美国地区表现出色,Swin-B+PRISM-CAFO比最优基线提升达15%,并通过梯度-激活分析验证了领域先验的影响。 Conclusion: 该基础设施优先的框架在CAFO识别中实现了高性能与可解释性,具备良好的泛化能力和实际应用潜力。 Abstract: Large-scale livestock operations pose significant risks to human health and the environment, while also being vulnerable to threats such as infectious diseases and extreme weather events. As the number of such operations continues to grow, accurate and scalable mapping has become increasingly important. In this work, we present an infrastructure-first, explainable pipeline for identifying and characterizing Concentrated Animal Feeding Operations (CAFOs) from aerial and satellite imagery. Our method (1) detects candidate infrastructure (e.g., barns, feedlots, manure lagoons, silos) with a domain-tuned YOLOv8 detector, then derives SAM2 masks from these boxes and filters component-specific criteria, (2) extracts structured descriptors (e.g., counts, areas, orientations, and spatial relations) and fuses them with deep visual features using a lightweight spatial cross-attention classifier, and (3) outputs both CAFO type predictions and mask-level attributions that link decisions to visible infrastructure. Through comprehensive evaluation, we show that our approach achieves state-of-the-art performance, with Swin-B+PRISM-CAFO surpassing the best performing baseline by up to 15\%. Beyond strong predictive performance across diverse U.S. regions, we run systematic gradient--activation analyses that quantify the impact of domain priors and show ho[90] MHA2MLA-VLM: Enabling DeepSeek's Economical Multi-Head Latent Attention across Vision-Language Models
Xiaoran Fan,Zhichao Sun,Tao Ji,Lixing Shen,Tao Gui
Main category: cs.CV
TL;DR: 本文提出MHA2MLA-VLM,一种参数高效且多模态感知的框架,用于将现成的视觉语言模型(VLMs)转换为支持多头潜在注意力(MLA)架构,以压缩KV缓存并加速推理。
Details
Motivation: 随着VLM处理更复杂的多模态任务,KV缓存快速增长导致推理时内存和计算瓶颈;现有MLA方法缺乏对无需重新预训练的VLM适配方案。 Method: 提出两种核心技术:模态自适应的部分RoPE策略,选择性屏蔽非关键维度;解耦的低秩近似方法,分别压缩视觉和文本KV空间;结合参数高效的微调,通过最小化输出激活误差而非参数距离来减少性能损失。 Result: 在三个代表性VLM上实验表明,该方法以极少监督数据恢复原始模型性能,显著减少KV缓存占用,并可与KV量化无缝集成。 Conclusion: MHA2MLA-VLM实现了高效的KV缓存压缩与快速推理,无需重新预训练,具有良好的通用性和实用性。 Abstract: As vision-language models (VLMs) tackle increasingly complex and multimodal tasks, the rapid growth of Key-Value (KV) cache imposes significant memory and computational bottlenecks during inference. While Multi-Head Latent Attention (MLA) offers an effective means to compress the KV cache and accelerate inference, adapting existing VLMs to the MLA architecture without costly pretraining remains largely unexplored. In this work, we present MHA2MLA-VLM, a parameter-efficient and multimodal-aware framework for converting off-the-shelf VLMs to MLA. Our approach features two core techniques: (1) a modality-adaptive partial-RoPE strategy that supports both traditional and multimodal settings by selectively masking nonessential dimensions, and (2) a modality-decoupled low-rank approximation method that independently compresses the visual and textual KV spaces. Furthermore, we introduce parameter-efficient fine-tuning to minimize adaptation cost and demonstrate that minimizing output activation error, rather than parameter distance, substantially reduces performance loss. Extensive experiments on three representative VLMs show that MHA2MLA-VLM restores original model performance with minimal supervised data, significantly reduces KV cache footprint, and integrates seamlessly with KV quantization.[91] Generative Scenario Rollouts for End-to-End Autonomous Driving
Rajeev Yasarla,Deepti Hegde,Shizhong Han,Hsin-Pai Cheng,Yunxiao Shi,Meysam Sadeghigooghari,Shweta Mahajan,Apratim Bhattacharyya,Litian Liu,Risheek Garrepalli,Thomas Svantesson,Fatih Porikli,Hong Cai
Main category: cs.CV
TL;DR: 本文提出了一种名为Generative Scenario Rollouts (GeRo) 的插件式框架,用于视觉-语言-动作(VLA)模型,通过自回归展开策略实现语言引导的未来交通场景生成与规划,显著提升了端到端自动驾驶系统的长时程推理与多智能体规划能力。
Details
Motivation: 现有VLA模型在自动驾驶中主要依赖稀疏轨迹标注的模仿学习,未能充分利用其作为生成模型的潜力,缺乏对语言对齐和长期一致性生成的支持。 Method: 首先训练VLA模型将自车与周围智能体动态编码为潜在标记,并在规划、运动和语言任务监督下实现文本对齐生成;随后GeRo通过语言条件自回归方式,结合多视角图像、场景描述和自车动作问题,生成未来潜在标记与文本响应以指导长时程展开,并引入展开一致性损失来稳定预测并保持文本-动作对齐。 Result: 在Bench2Drive上,GeRo使驾驶得分和成功率分别提升+15.7和+26.2;结合强化学习后,在闭环和开环设置下均达到最先进性能,并表现出强零样本鲁棒性。 Conclusion: 生成式、语言条件化的推理可作为更安全、更可解释的端到端自动驾驶系统的基础。 Abstract: Vision-Language-Action (VLA) models are emerging as highly effective planning models for end-to-end autonomous driving systems. However, current works mostly rely on imitation learning from sparse trajectory annotations and under-utilize their potential as generative models. We propose Generative Scenario Rollouts (GeRo), a plug-and-play framework for VLA models that jointly performs planning and generation of language-grounded future traffic scenes through an autoregressive rollout strategy. First, a VLA model is trained to encode ego vehicle and agent dynamics into latent tokens under supervision from planning, motion, and language tasks, facilitating text-aligned generation. Next, GeRo performs language-conditioned autoregressive generation. Given multi-view images, a scenario description, and ego-action questions, it generates future latent tokens and textual responses to guide long-horizon rollouts. A rollout-consistency loss stabilizes predictions using ground truth or pseudo-labels, mitigating drift and preserving text-action alignment. This design enables GeRo to perform temporally consistent, language-grounded rollouts that support long-horizon reasoning and multi-agent planning. On Bench2Drive, GeRo improves driving score and success rate by +15.7 and +26.2, respectively. By integrating reinforcement learning with generative rollouts, GeRo achieves state-of-the-art closed-loop and open-loop performance, demonstrating strong zero-shot robustness. These results highlight the promise of generative, language-conditioned reasoning as a foundation for safer and more interpretable end-to-end autonomous driving.[92] ReScene4D: Temporally Consistent Semantic Instance Segmentation of Evolving Indoor 3D Scenes
Emily Steiner,Jianhao Zheng,Henry Howard-Jenkins,Chris Xie,Iro Armeni
Main category: cs.CV
TL;DR: 本文提出了一个名为ReScene4D的新方法,用于解决稀疏时间采样下的4D室内语义实例分割(SIS)问题,能够在不依赖密集观测的情况下实现跨时间的实例一致性分割与追踪,并提出新的评估指标t-mAP,在3RScan数据集上达到领先性能。
Details
Motivation: 现有3D语义实例分割方法缺乏时间推理能力,需依赖离散匹配步骤;而现有的4D方法依赖高频时序数据,难以适用于室内环境长期演化场景。因此需要一种能在稀疏时间采样下保持实例身份一致性的新方法。 Method: 提出ReScene4D,通过扩展3DSIS架构以适应4DSIS任务,利用跨观测的信息共享机制实现时间一致性实例分割与追踪,并设计新的评估指标t-mAP来衡量时间上的身份一致性。 Result: ReScene4D在3RScan数据集上实现了最先进的性能,不仅提升了时间一致性,也改善了标准3DSIS的分割质量。 Conclusion: ReScene4D有效解决了稀疏时间采样下的4D室内语义实例分割挑战,为理解动态室内环境演化提供了新基准。 Abstract: Indoor environments evolve as objects move, appear, or disappear. Capturing these dynamics requires maintaining temporally consistent instance identities across intermittently captured 3D scans, even when changes are unobserved. We introduce and formalize the task of temporally sparse 4D indoor semantic instance segmentation (SIS), which jointly segments, identifies, and temporally associates object instances. This setting poses a challenge for existing 3DSIS methods, which require a discrete matching step due to their lack of temporal reasoning, and for 4D LiDAR approaches, which perform poorly due to their reliance on high-frequency temporal measurements that are uncommon in the longer-horizon evolution of indoor environments. We propose ReScene4D, a novel method that adapts 3DSIS architectures for 4DSIS without needing dense observations. It explores strategies to share information across observations, demonstrating that this shared context not only enables consistent instance tracking but also improves standard 3DSIS quality. To evaluate this task, we define a new metric, t-mAP, that extends mAP to reward temporal identity consistency. ReScene4D achieves state-of-the-art performance on the 3RScan dataset, establishing a new benchmark for understanding evolving indoor scenes.[93] ShapeR: Robust Conditional 3D Shape Generation from Casual Captures
Yawar Siddiqui,Duncan Frost,Samir Aroudj,Armen Avetisyan,Henry Howard-Jenkins,Daniel DeTone,Pierre Moulon,Qirui Wu,Zhengqin Li,Julian Straub,Richard Newcombe,Jakob Engel
Main category: cs.CV
TL;DR: ShapeR是一种基于随意捕获图像序列的条件3D物体形状生成新方法,结合SLAM、3D检测和视觉语言模型提取多模态信息,并利用矩形流变换器生成高保真3D形状,在真实场景中显著优于现有方法。
Details
Motivation: 现有3D形状生成方法依赖干净、无遮挡的输入,难以应对现实世界中随意拍摄数据的复杂性和杂乱背景,因此需要更鲁棒的方法。 Method: 利用现成的视觉惯性SLAM、3D检测算法和视觉语言模型,从图像序列中为每个物体提取稀疏SLAM点、多视角图像和机器生成的描述;使用训练好的矩形流变换器融合这些模态信息来生成3D形状;采用即插即用的数据增强、课程学习策略和去背景干扰技术提升鲁棒性。 Result: 在包含178个真实世界物体的新基准上进行实验,ShapeR在Chamfer距离上比现有最先进方法提升2.7倍。 Conclusion: ShapeR能有效处理随意捕获的真实场景数据,通过多模态条件生成实现高保真且度量准确的3D对象重建,显著优于现有方法。 Abstract: Recent advances in 3D shape generation have achieved impressive results, but most existing methods rely on clean, unoccluded, and well-segmented inputs. Such conditions are rarely met in real-world scenarios. We present ShapeR, a novel approach for conditional 3D object shape generation from casually captured sequences. Given an image sequence, we leverage off-the-shelf visual-inertial SLAM, 3D detection algorithms, and vision-language models to extract, for each object, a set of sparse SLAM points, posed multi-view images, and machine-generated captions. A rectified flow transformer trained to effectively condition on these modalities then generates high-fidelity metric 3D shapes. To ensure robustness to the challenges of casually captured data, we employ a range of techniques including on-the-fly compositional augmentations, a curriculum training scheme spanning object- and scene-level datasets, and strategies to handle background clutter. Additionally, we introduce a new evaluation benchmark comprising 178 in-the-wild objects across 7 real-world scenes with geometry annotations. Experiments show that ShapeR significantly outperforms existing approaches in this challenging setting, achieving an improvement of 2.7x in Chamfer distance compared to state of the art.[94] UniX: Unifying Autoregression and Diffusion for Chest X-Ray Understanding and Generation
Ruiheng Zhang,Jingfeng Yao,Huangxuan Zhao,Hao Yan,Xiao He,Lei Chen,Zhou Wei,Yong Luo,Zengmao Wang,Lefei Zhang,Dacheng Tao,Bo Du
Main category: cs.CV
TL;DR: UniX是一种用于胸部X光理解与生成的下一代统一医疗基础模型,通过解耦自回归分支(用于理解)和扩散分支(用于生成),并引入跨模态自注意力机制,实现任务间的协同合作,在减少参数量的同时显著提升性能。