Skip to content

Table of Contents

cs.CL [Back]

[1] Shoot First, Ask Questions Later? Building Rational Agents that Explore and Act Like People

Gabriel Grand,Valerio Pepe,Jacob Andreas,Joshua B. Tenenbaum

Main category: cs.CL

TL;DR: 本文提出了一种基于贝叶斯实验设计的蒙特卡洛推理策略,用于提升语言模型在信息寻求任务中的理性决策能力,并在协作型博弈任务(如Collaborative Battleship和Guess Who?)中验证了其有效性。

Details Motivation: 研究语言模型代理在高风险应用场景下(如科学发现与诊断)是否能够理性地进行数据驱动假设和目标猜测,尤其是在资源有限的情况下。现有模型在问题生成、上下文理解和高价值动作选择方面表现不佳,因此需要改进。 Method: 提出了名为Collaborative Battleship的战略对话任务,结合人类行为洞察来评估LM代理的表现;引入基于贝叶斯实验设计(BED)的新型蒙特卡洛推理方法,以提升信息获取效率和回答准确性,并在Guess Who?任务中进行泛化验证。 Result: 在Collaborative Battleship中,Spotter代理准确率最高提升14.7%,Captain代理的信息增益提升达0.227比特(达到噪声上限的94.2%);整体F1值提升0.303-0.374;Llama-4-Scout模型胜率从8%升至82%(对人类),对GPT-5胜率达67%(原为0%),成本仅为GPT-5的1%;在Guess Who?中准确率提升28.3-42.4个百分点。 Conclusion: 所提出的基于BED的推理策略显著提升了语言模型代理在信息受限环境下的理性决策与信息寻求能力,具有广泛适用性,并使较小模型能在极低开销下超越人类和前沿大模型。 Abstract: Many high-stakes applications of AI require forming data-driven hypotheses and making targeted guesses; e.g., in scientific and diagnostic settings. Given limited resources, to what extent do agents based on language models (LMs) act rationally? We develop methods to benchmark and enhance agentic information-seeking, drawing on insights from human behavior. First, we introduce a strategic decision-oriented dialogue task called Collaborative Battleship, in which a partially-informed Captain must balance exploration (asking questions) and action (taking shots), while a fully-informed Spotter must provide accurate answers under an information bottleneck. Compared to human players (N=42), we find that LM agents struggle to ground answers in context, generate informative questions, and select high-value actions. Next, to address these gaps, we develop novel Monte Carlo inference strategies for LMs based on principles from Bayesian Experimental Design (BED). For Spotter agents, our approach boosts accuracy by up to 14.7% absolute over LM-only baselines; for Captain agents, it raises expected information gain (EIG) by up to 0.227 bits (94.2% of the achievable noise ceiling). Combined, these components yield sharper targeting (+0.303-0.374 F1), and enable weaker LMs, such as Llama-4-Scout, to outperform both humans (8% -> 82% win rate) and frontier models (0% -> 67% win rate vs. GPT-5) at ~1% of GPT-5's cost. We replicate these findings on Guess Who? where our methods significantly boost accuracy (+28.3-42.4 p.p.), demonstrating their general applicability for building rational information-seeking agents.

[2] Code-enabled language models can outperform reasoning models on diverse tasks

Cedegao E. Zhang,Cédric Colas,Gabriel Poesia,Joshua B. Tenenbaum,Jacob Andreas

Main category: cs.CL

TL;DR: 本文提出CodeAdapt方法,通过结合CodeAct框架和少样本上下文学习,使标准指令语言模型无需微调即可在多种任务上达到甚至超越推理模型的性能,同时更高效地利用计算资源。

Details Motivation: 推理模型虽有效但训练和运行成本高,本文旨在探索无需微调的标准语言模型是否可通过简单方法实现同等或更强的推理能力。 Method: 提出CodeAdapt方法,结合CodeAct(语言模型与代码执行交替进行)和仅需5个示例的少样本上下文学习,在多个领域激发标准指令语言模型的推理能力。 Result: 在四个语言模型-推理模型配对中,CodeAdapt使三个语言模型平均在八项任务上超越对应推理模型(最高提升22.9%),且令牌效率提高10%-81%;在四项模型上平均于六项任务表现更优(最高提升35.7%)。 Conclusion: CodeAdapt是一种高效、通用的推理增强方法,表明代码增强的语言模型具备强大的认知基础,可能为权重内强化学习提供新方向。 Abstract: Reasoning models (RMs), language models (LMs) trained with reinforcement learning to produce long-form natural language reasoning, have been remarkably successful, but they still require large amounts of computation and data to train, and can be slow and expensive to run. In this paper, we show that standard instruct LMs can already be elicited to be strong reasoners at a level comparable to or even surpassing their corresponding RMs (e.g., DeepSeek V3 vs R1) without finetuning, across diverse domains from instruction following and creative generation to mathematical reasoning. This is achieved by CodeAdapt, our simple recipe that combines the CodeAct framework, where LMs interleave natural language reasoning with code execution in a multi-step fashion, with few-shot bootstrap in-context learning from as few as five training problems. Analyzing four matched pairs of LMs and RMs, we find that CodeAdapt enables three LMs to outperform the corresponding RMs on average over eight tasks (up to 22.9%) while being 10-81% more token efficient, and delivers superior performance on six tasks when averaged over the four models (up to 35.7%). Furthermore, the code-augmented reasoning traces display rich and varied problem-solving strategies. Our findings support that (1) CodeAdapt-style learning and reasoning may be robust and domain general and (2) code-enabled LMs are cognitively grounded and powerful systems, potentially providing a strong foundation for in-weight reinforcement learning.

[3] FicSim: A Dataset for Multi-Faceted Semantic Similarity in Long-Form Fiction

Natasha Johnson,Amanda Bertsch,Maria-Emil Deal,Emma Strubell

Main category: cs.CL

TL;DR: 本文介绍了FICSIM数据集,用于评估长篇小说文本中语言模型的语义相似性,强调作者参与和细粒度文学任务的适用性。

Details Motivation: 现有嵌入相似性数据集侧重于短文本和粗粒度相似性,不适用于计算文学研究中的长篇复杂文本分析,且存在数据污染问题。 Method: 构建并发布FICSIM数据集,包含近期创作的长篇小说文本,涵盖由作者元数据指导并经数字人文学者验证的12个维度的相似性评分,并评估多种嵌入模型的表现。 Result: 实验显示现有嵌入模型倾向于关注表层特征,而非对计算文学研究有用的语义类别。 Conclusion: FICSIM为评估文学领域语言模型提供了更合适的数据资源,同时强调了在数据收集中保障作者自主权的重要性。 Abstract: As language models become capable of processing increasingly long and complex texts, there has been growing interest in their application within computational literary studies. However, evaluating the usefulness of these models for such tasks remains challenging due to the cost of fine-grained annotation for long-form texts and the data contamination concerns inherent in using public-domain literature. Current embedding similarity datasets are not suitable for evaluating literary-domain tasks because of a focus on coarse-grained similarity and primarily on very short text. We assemble and release FICSIM, a dataset of long-form, recently written fiction, including scores along 12 axes of similarity informed by author-produced metadata and validated by digital humanities scholars. We evaluate a suite of embedding models on this task, demonstrating a tendency across models to focus on surface-level features over semantic categories that would be useful for computational literary studies tasks. Throughout our data-collection process, we prioritize author agency and rely on continual, informed author consent.

[4] Do LLMs Truly Understand When a Precedent Is Overruled?

Li Zhang,Jaromir Savelka,Kevin Ashley

Main category: cs.CL

TL;DR: 本文评估了最先进的大语言模型在识别美国最高法院判例中的推翻关系时的表现,揭示了模型在处理长期法律文本时的三个关键局限性:时代敏感性、浅层推理和上下文依赖的推理失败,并提出了一个更贴近真实法律任务的长上下文基准测试。

Details Motivation: 现有评测多基于简化的人工任务,无法反映现实世界法律文档理解的复杂性,缺乏针对高风险、真实场景的长上下文评测基准。 Method: 构建包含236对判例的数据集,评估大语言模型在识别普通法中基础性的推翻关系这一复杂长文档法律任务上的表现。 Result: 发现模型存在三大局限:(1) 时代敏感性——对历史案例表现下降;(2) 浅层推理——依赖表面逻辑而非深层法律理解;(3) 上下文相关的推理失败——在复杂开放任务中产生时间上不可能的关系。 Conclusion: 提出一个更真实的长上下文法律理解评测基准,揭示当前大语言模型在实际法律推理任务中的不足,强调需改进模型的时间一致性与深度推理能力。 Abstract: Large language models (LLMs) with extended context windows show promise for complex legal reasoning tasks, yet their ability to understand long legal documents remains insufficiently evaluated. Developing long-context benchmarks that capture realistic, high-stakes tasks remains a significant challenge in the field, as most existing evaluations rely on simplified synthetic tasks that fail to represent the complexity of real-world document understanding. Overruling relationships are foundational to common-law doctrine and commonly found in judicial opinions. They provide a focused and important testbed for long-document legal understanding that closely resembles what legal professionals actually do. We present an assessment of state-of-the-art LLMs on identifying overruling relationships from U.S. Supreme Court cases using a dataset of 236 case pairs. Our evaluation reveals three critical limitations: (1) era sensitivity -- the models show degraded performance on historical cases compared to modern ones, revealing fundamental temporal bias in their training; (2) shallow reasoning -- models rely on shallow logical heuristics rather than deep legal comprehension; and (3) context-dependent reasoning failures -- models produce temporally impossible relationships in complex open-ended tasks despite maintaining basic temporal awareness in simple contexts. Our work contributes a benchmark that addresses the critical gap in realistic long-context evaluation, providing an environment that mirrors the complexity and stakes of actual legal reasoning tasks.

[5] Irish-BLiMP: A Linguistic Benchmark for Evaluating Human and Language Model Performance in a Low-Resource Setting

Josh McGiff,Khanh-Tung Tran,William Mulcahy,Dáibhidh Ó Luinín,Jake Dalzell,Róisín Ní Bhroin,Adam Burke,Barry O'Sullivan,Hoang D. Nguyen,Nikola S. Nikolov

Main category: cs.CL

TL;DR: 本文提出了Irish-BLiMP,首个用于评估爱尔兰语语言能力的精细基准测试,包含1020对最小语言对,涵盖11类语言特征。研究发现人类在所有特征上均优于现有大语言模型(LLMs),平均准确率高出16.6%,且开源与闭源模型间存在18.1%的显著差距,最强模型gpt-5准确率为73.5%,而人类为90.1%。人类与模型在不同语法方面表现出困难,揭示了模型学习表征的局限性。

Details Motivation: 爱尔兰语是一种濒危语言,缺乏系统评估语言模型语法能力的工具,因此需要构建一个专门针对低资源语言的精细化评估框架。 Method: 基于语言学文献和语法参考,由流利的爱尔兰语使用者团队手动构建并审核1020个最小语言对,覆盖11类语言特征,并对现有大语言模型和人类参与者进行语法知识测试。 Result: 人类在所有语言特征上的表现均优于所有模型,平均准确率高出16.6%;开源与闭源LLM之间存在18.1%的性能差距;最强模型gpt-5的准确率为73.5%,人类为90.1%;人类与模型在不同语法结构上表现出不同的困难模式。 Conclusion: Irish-BLiMP为评估爱尔兰语大语言模型的语法能力提供了首个系统化框架,揭示了当前模型在低资源语言理解上的局限性,为未来研究提供了重要基准。 Abstract: We present Irish-BLiMP (Irish Benchmark of Linguistic Minimal Pairs), the first dataset and framework designed for fine-grained evaluation of linguistic competence in the Irish language, an endangered language. Drawing on a variety of linguistic literature and grammar reference works, we manually constructed and reviewed 1020 minimal pairs across a taxonomy of 11 linguistic features, through a team of fluent Irish speakers. We evaluate both existing Large Language Models (LLMs) and fluent human participants on their syntactic knowledge of Irish. Our findings show that humans outperform all models across all linguistic features, achieving 16.6% higher accuracy on average. Moreover, a substantial performance gap of 18.1% persists between open- and closed-source LLMs, with even the strongest model (gpt-5) reaching only 73.5% accuracy compared to 90.1% by human. Interestingly, human participants and models struggle on different aspects of Irish grammar, thus highlighting a difference in representation learned by the models. Overall, Irish-BLiMP provides the first systematic framework for evaluating the grammatical competence of LLMs in Irish and offers a valuable benchmark for advancing research on linguistic understanding in low-resource languages.

[6] Can Confidence Estimates Decide When Chain-of-thought is Necessary for Llms?

Samuel Lewis-Lim,Xingwei Tan,Zhixue Zhao,Nikolaos Aletras

Main category: cs.CL

TL;DR: 本文提出置信度门控的思维链(CoT)方法,通过在模型对直接回答缺乏信心时才触发推理,以减少冗余的CoT使用,并系统评估了无需训练的置信度估计方法的有效性与局限性。

Details Motivation: 尽管思维链提示能提升大模型推理能力,但其增加的计算开销和在简单任务上的无效性限制了实用性,因此需要一种自适应机制来决定何时使用CoT。 Method: 提出置信度门控CoT框架,评估四种无需训练的置信度估计方法,将其与随机基线和理想 oracle 进行比较,在多个数据集和模型上进行实验分析。 Result: 现有无需训练的置信度估计方法可有效减少冗余CoT并优于随机触发,但其效果在不同数据集和模型间表现不一致,实用性受限。 Conclusion: 当前置信度估计方法具有减少不必要推理的潜力,但其表现不稳定,仍需更可靠的自适应门控机制以实现广泛应用。 Abstract: Chain-of-thought (CoT) prompting has emerged as a common technique for enhancing the reasoning abilities of large language models (LLMs). While extended reasoning can boost accuracy on complex tasks, it is often unnecessary and substantially increases token usage, limiting the practicality of reasoning models in many scenarios. Recent models, such as GPT-OSS and Qwen3, expose controls that enable users to adjust the length of CoT or determine whether it is used at all. Yet, it remains unclear when CoT should be used: on some tasks it improves performance, while on others it provides little benefit or even harms performance. We address this challenge with confidence-gated CoT, where a model invokes reasoning only when confidence in its direct answer is low. To this end, we present the first systematic study of training-free confidence estimation methods for CoT gating. Specifically, we evaluate four training-free confidence estimation methods and compare them to a random baseline and an oracle that always knows when CoT is needed. Through extensive experiments, we show that existing training-free confidence measures can reduce redundant CoT and outperform randomly invoked CoT. However, the utility of individual confidence measures is inconsistent, varying with both the dataset and the model, underscoring the difficulty of deploying confidence-gated CoT in practice. By analysing both strengths and failure modes, our study highlights the potential and limitations of current methods and paves the way toward more reliable adaptive gating of CoT.

[7] Input Matters: Evaluating Input Structure's Impact on LLM Summaries of Sports Play-by-Play

Barkavi Sundararajan,Somayajulu Sripada,Ehud Reiter

Main category: cs.CL

TL;DR: 研究探讨了输入结构(行结构、JSON、非结构化)对LLM生成NBA比赛摘要时事实错误的影响,发现JSON格式显著降低幻觉和错误率。

Details Motivation: 在准确性要求高的领域(如体育报道)中,确保LLM生成内容忠实于输入数据是一个关键问题。 Method: 使用Llama-3.1-70B和Qwen2.5-72B两个模型生成比赛摘要,比较三种输入格式下的错误率,并通过人工标注3,312个错误进行定量分析,采用双因素重复测量ANOVA和Tukey HSD事后检验进行统计验证。 Result: JSON输入使Llama和Qwen的错误率分别下降69%和65%,行结构输入分别下降54%和51%;输入结构解释了80%以上的误差方差,且各格式间差异显著。 Conclusion: 结构化输入(尤其是JSON)能显著减少LLM生成文本中的事实错误,建议在高精度场景中优先采用结构化数据格式。 Abstract: A major concern when deploying LLMs in accuracy-critical domains such as sports reporting is that the generated text may not faithfully reflect the input data. We quantify how input structure affects hallucinations and other factual errors in LLM-generated summaries of NBA play-by-play data, across three formats: row-structured, JSON and unstructured. We manually annotated 3,312 factual errors across 180 game summaries produced by two models, Llama-3.1-70B and Qwen2.5-72B. Input structure has a strong effect: JSON input reduces error rates by 69% for Llama and 65% for Qwen compared to unstructured input, while row-structured input reduces errors by 54% for Llama and 51% for Qwen. A two-way repeated measures ANOVA shows that input structure accounts for over 80% of the variance in error rates, with Tukey HSD post hoc tests confirming statistically significant differences between all input formats.

[8] Reasoning's Razor: Reasoning Improves Accuracy but Can Hurt Recall at Critical Operating Points in Safety and Hallucination Detection

Atoosa Chegini,Hamid Kazemi,Garrett Souza,Maria Safi,Yang Song,Samy Bengio,Sinead Williamson,Mehrdad Farajtabar

Main category: cs.CL

TL;DR: 该研究系统评估了大语言模型在严格低假阳性率(FPR)要求下的推理能力,发现在安全性检测和幻觉检测任务中,启用推理(Think On)虽提升整体准确率,但在高精度要求下表现不如关闭推理(Think Off);研究还发现基于token的评分优于自述置信度,且两种模式的简单集成可兼顾优势。

Details Motivation: 尽管推理被广泛用于提升大语言模型的准确性,但其在需要高精度、低误报率的实际应用场景中的适用性尚不明确,因此需要系统评估其在精度敏感任务中的表现。 Method: 在安全检测和幻觉检测两个分类任务上,比较启用推理(Think On)与关闭推理(Think Off)在微调和零样本设置下的表现,并评估不同置信度评分方法(token-based scoring vs. self-verbalized confidence)的效果,最后尝试通过简单集成融合两种推理模式的优势。 Result: 启用推理提升了整体准确率,但在低FPR(高精度)要求下表现较差;关闭推理在精度敏感场景中更优;基于token的评分方法显著优于自述置信度;两种模式的集成能同时获得两者的优点。 Conclusion: 推理是一把双刃剑:有助于提高平均准确率,但在要求严格精度的任务中往往不适用;在实际部署中应根据FPR容忍度选择推理模式,并优先使用token-based评分方法。 Abstract: Reasoning has become a central paradigm for large language models (LLMs), consistently boosting accuracy across diverse benchmarks. Yet its suitability for precision-sensitive tasks remains unclear. We present the first systematic study of reasoning for classification tasks under strict low false positive rate (FPR) regimes. Our analysis covers two tasks--safety detection and hallucination detection--evaluated in both fine-tuned and zero-shot settings, using standard LLMs and Large Reasoning Models (LRMs). Our results reveal a clear trade-off: Think On (reasoning-augmented) generation improves overall accuracy, but underperforms at the low-FPR thresholds essential for practical use. In contrast, Think Off (no reasoning during inference) dominates in these precision-sensitive regimes, with Think On surpassing only when higher FPRs are acceptable. In addition, we find token-based scoring substantially outperforms self-verbalized confidence for precision-sensitive deployments. Finally, a simple ensemble of the two modes recovers the strengths of each. Taken together, our findings position reasoning as a double-edged tool: beneficial for average accuracy, but often ill-suited for applications requiring strict precision.

[9] Dynamic Retriever for In-Context Knowledge Editing via Policy Optimization

Mahmud Wasif Nafee,Maiqi Jiang,Haipeng Chen,Yanfu Zhang

Main category: cs.CL

TL;DR: 本文提出了一种名为DR-IKE的动态检索框架,用于上下文中的知识编辑,通过强化学习训练BERT检索器以根据编辑奖励选择高价值示例,并自适应调整提示长度,在提升编辑成功率的同时降低延迟。

Details Motivation: 现有的上下文知识编辑方法依赖静态示例集,存在数量与质量的权衡问题,且无法根据任务难度自适应调整,限制了编辑效果。 Method: 提出DR-IKE框架:1)使用REINFORCE算法训练BERT检索器,按编辑奖励对示例进行排序;2)引入可学习阈值剪枝低价值示例,实现提示长度的动态调整。整个过程无需更新模型权重,仅使用前向推理。 Result: 在COUNTERFACT基准上,编辑成功率最高提升17.1%,延迟减少41.6%,同时保持对无关查询的准确率。 Conclusion: DR-IKE实现了高效、可扩展且自适应的上下文知识编辑,适用于黑盒大语言模型,兼顾性能与效率。 Abstract: Large language models (LLMs) excel at factual recall yet still propagate stale or incorrect knowledge. In-context knowledge editing offers a gradient-free remedy suitable for black-box APIs, but current editors rely on static demonstration sets chosen by surface-level similarity, leading to two persistent obstacles: (i) a quantity-quality trade-off, and (ii) lack of adaptivity to task difficulty. We address these issues by dynamically selecting supporting demonstrations according to their utility for the edit. We propose Dynamic Retriever for In-Context Knowledge Editing (DR-IKE), a lightweight framework that (1) trains a BERT retriever with REINFORCE to rank demonstrations by editing reward, and (2) employs a learnable threshold to prune low-value examples, shortening the prompt when the edit is easy and expanding it when the task is hard. DR-IKE performs editing without modifying model weights, relying solely on forward passes for compatibility with black-box LLMs. On the COUNTERFACT benchmark, it improves edit success by up to 17.1%, reduces latency by 41.6%, and preserves accuracy on unrelated queries, demonstrating scalable and adaptive knowledge editing. The code is available at https://github.com/mwnafee/DR-IKE .

[10] Bridging Language Gaps with Adaptive RAG: Improving Indonesian Language Question Answering

William Christian,Daniel Adamlu,Adrian Yu,Derwin Suhartono

Main category: cs.CL

TL;DR: 本研究将自适应RAG系统应用于印尼语问答任务,通过机器翻译进行数据增强,实现了可靠的复杂度分类,但在多检索策略中发现不一致性问题。

Details Motivation: 解决现有检索增强生成(RAG)系统主要局限于英语、低资源语言如印尼语缺乏相关研究的问题。 Method: 采用自适应RAG系统,结合问题复杂度分类器决定回答策略,并利用机器翻译进行数据增强以缓解印尼语数据不足的问题。 Result: 实验显示问题复杂度分类器表现可靠,但多检索回答策略存在显著不一致性,影响了整体评估效果。 Conclusion: 该研究展示了低资源语言问答的潜力与挑战,为未来改进提供了方向。 Abstract: Question Answering (QA) has seen significant improvements with the advancement of machine learning models, further studies enhanced this question answering system by retrieving external information, called Retrieval-Augmented Generation (RAG) to produce more accurate and informative answers. However, these state-of-the-art-performance is predominantly in English language. To address this gap we made an effort of bridging language gaps by incorporating Adaptive RAG system to Indonesian language. Adaptive RAG system integrates a classifier whose task is to distinguish the question complexity, which in turn determines the strategy for answering the question. To overcome the limited availability of Indonesian language dataset, our study employs machine translation as data augmentation approach. Experiments show reliable question complexity classifier; however, we observed significant inconsistencies in multi-retrieval answering strategy which negatively impacted the overall evaluation when this strategy was applied. These findings highlight both the promise and challenges of question answering in low-resource language suggesting directions for future improvement.

[11] CDrugRed: A Chinese Drug Recommendation Dataset for Discharge Medications in Metabolic Diseases

Juntao Li,Haobin Yuan,Ling Luo,Yan Jiang,Fan Wang,Ping Zhang,Huiyi Lv,Jian Wang,Yuanyuan Sun,Hongfei Lin

Main category: cs.CL

TL;DR: 本文提出了首个面向代谢疾病出院药物推荐的公开中文电子健康记录数据集CDrugRed,包含5,894条去标识化患者记录,并基于该数据集对多种大语言模型进行了基准测试,结果表明现有模型仍有较大提升空间,凸显了临床药物推荐任务的复杂性。

Details Motivation: 由于缺乏非英语的真实世界电子健康记录(EHR)公开数据集,基于EHR的智能药物推荐系统的发展受到严重限制,尤其是在中文语境下。因此,构建一个高质量的中文药物推荐数据集具有重要意义。 Method: 作者构建了一个名为CDrugRed的中文药物推荐数据集,涵盖3,190名患者的5,894条出院记录,包含人口统计学、病史、临床过程和诊断等信息,并采用多个最先进的大语言模型在出院用药推荐任务上进行基准测试。 Result: 实验结果显示,尽管经过监督微调能提升模型性能,但最优模型的F1得分为0.5648,Jaccard得分为0.4477,性能仍不理想,表明该任务具有较高挑战性。 Conclusion: CDrugRed是首个公开的中文出院药物推荐数据集,为开发更鲁棒、准确的药物推荐系统提供了有价值的资源,并推动非英语医疗AI研究的发展。 Abstract: Intelligent drug recommendation based on Electronic Health Records (EHRs) is critical for improving for improving the quality and efficiency of clinical decision-making. By leveraging large-scale patient data, drug recommendation systems can assist physicians in selecting the most appropriate medications according to a patient's medical history, diagnoses, laboratory results, and comorbidities. However, the advancement of such systems is significantly hampered by the scarcity of publicly available, real-world EHR datasets, particularly in languages other than English. In this work, we present CDrugRed, a first publicly available Chinese drug recommendation dataset focused on discharge medications for metabolic diseases. The dataset includes 5,894 de-identified records from 3,190 patients, containing comprehensive information such as patient demographics, medical history, clinical course, and discharge diagnoses. We assess the utility of CDrugRed by benchmarking several state-of-the-art large language models (LLMs) on the discharge medication recommendation task. Experimental results show that while supervised fine-tuning improves model performance, there remains substantial room for improvement, with the best model achieving the F1 score of 0.5648 and Jaccard score of 0.4477. This result highlights the complexity of the clinical drug recommendation task and establishes CDrugRed as a challenging and valuable resource for developing more robust and accurate drug recommendation systems. The dataset is publicly available to the research community under the data usage agreements at https://github.com/DUTIR-BioNLP/CDrugRed.

[12] Self-Rewarding PPO: Aligning Large Language Models with Demonstrations Only

Qingru Zhang,Liang Qiu,Ilgee Hong,Zhenghao Xu,Tianyi Liu,Shiyang Li,Rongzhi Zhang,Zheng Li,Lihong Li,Bing Yin,Chao Zhang,Jianshu Chen,Haoming Jiang,Tuo Zhao

Main category: cs.CL

TL;DR: 提出Self-Rewarding PPO方法,结合SFT与PPO优势,利用自奖励机制提升大模型在少数据下的泛化与对齐效果。

Details Motivation: 监督微调(SFT)在数据稀缺时易过拟合且泛化能力差,需改进其在无偏好标注情况下的对齐性能。 Method: 设计基于SFT模型与预训练模型间对数策略比的隐式奖励函数,结合PPO进行在线策略微调。 Result: 在多个NLP任务中,Self-Rewarding PPO显著优于传统SFT,尤其在低数据场景下表现更优。 Conclusion: Self-Rewarding PPO通过自奖励机制有效提升LLM从演示数据中学习的泛化性、数据效率和鲁棒性。 Abstract: Supervised fine-tuning (SFT) has emerged as a crucial method for aligning large language models (LLMs) with human-annotated demonstrations. However, SFT, being an off-policy approach similar to behavior cloning, often struggles with overfitting and poor out-of-domain generalization, especially in limited-data scenarios. To address these limitations, we propose Self-Rewarding PPO, a novel fine-tuning method that leverages on-policy techniques to enhance generalization performance. Our approach combines the strengths of SFT and proximal policy optimization (PPO) to achieve more effective alignment from demonstration data. At its core is a reward function designed as the log policy ratio between the SFT model and the pretrained base model. This function serves as an implicit reward signal, using the pretrained policy as a baseline and the SFT policy as a target. By doing so, it enables on-policy fine-tuning without relying on human preference annotations. The integration of this self-rewarding mechanism with PPO addresses key limitations of SFT, improving generalization, data efficiency, and robustness. Our empirical evaluation across a range of natural language processing tasks demonstrates that Self-Rewarding PPO consistently outperforms traditional SFT methods. The results highlight the effectiveness of our approach in aligning LLMs using demonstration data, particularly in scenarios where high-quality annotated data is scarce.

[13] The Gray Zone of Faithfulness: Taming Ambiguity in Unfaithfulness Detection

Qiang Ding,Lvzhou Luo,Yixuan Cao,Ping Luo

Main category: cs.CL

TL;DR: 本文提出了一种新的摘要忠实性标注框架VeriGray,通过引入“外部依赖”类别来解决现有基准中因外部知识使用界限不清导致的标注模糊问题,并构建了一个新的不忠检测基准,实验表明当前大模型仍存在显著的幻觉问题。

Details Motivation: 现有的大语言模型摘要忠实性评估基准存在标注模糊问题,尤其是对外部知识(如常识)使用的界定不明确,导致标注不一致,亟需一个更清晰的框架来准确评估模型输出的忠实性。 Method: 提出一种新的忠实性标注框架,引入“Out-Dependent”中间类别,用于分类需要外部知识验证的情况;基于此框架构建了名为VeriGray的新基准数据集,并对多个大模型生成的摘要进行统计分析和实验评估。 Result: 统计显示即使是GPT-5等SOTA模型在摘要中仍有约6%的句子存在幻觉,平均约8%的句子属于需外部知识验证的Out-Dependent类别;实验表明现有基线方法在该基准上表现不佳,说明该基准具有挑战性。 Conclusion: VeriGray通过细化标注类别有效缓解了忠实性评估中的标注歧义问题,揭示了当前大模型在摘要忠实性方面的不足,为未来研究提供了更具挑战性的评估基准。 Abstract: Ensuring that Large Language Models (LLMs) generate summaries faithful to a given source document is essential for real-world applications. While prior research has explored LLM faithfulness, existing benchmarks suffer from annotation ambiguity, primarily due to the ill-defined boundary of permissible external knowledge in generated outputs. For instance, common sense is often incorporated into responses and labeled as "faithful", yet the acceptable extent of such knowledge remains unspecified, leading to inconsistent annotations. To address this issue, we propose a novel faithfulness annotation framework, which introduces an intermediate category, Out-Dependent, to classify cases where external knowledge is required for verification. Using this framework, we construct VeriGray (Verification with the Gray Zone) -- a new unfaithfulness detection benchmark in summarization. Statistics reveal that even SOTA LLMs, such as GPT-5, exhibit hallucinations ($\sim 6\%$ of sentences) in summarization tasks. Moreover, a substantial proportion ($\sim 8\%$ on average of models) of generated sentences fall into the Out-Dependent category, underscoring the importance of resolving annotation ambiguity in unfaithfulness detection benchmarks. Experiments demonstrate that our benchmark poses significant challenges to multiple baseline methods, indicating considerable room for future improvement.

[14] Large Language Models Meet Text-Attributed Graphs: A Survey of Integration Frameworks and Applications

Guangxin Su,Hanchen Wang,Jianwei Wang,Wenjie Zhang,Ying Zhang,Jian Pei

Main category: cs.CL

TL;DR: 本文综述了大语言模型(LLMs)与文本属性图(TAGs)融合的研究进展,提出了一种基于协作视角的分类体系,涵盖LLM增强TAG和TAG增强LLM两大方向,并讨论了序列化、并行和多模块等协作策略。

Details Motivation: LLMs在语义理解和生成方面表现出色,但缺乏结构化推理能力;TAGs具有显式关系结构,但语义表达有限。结合二者可互补优势,提升表示学习与推理可解释性。 Method: 提出了一个新的分类法,将LLM-TAG集成分为两个方向:LLM for TAG 和 TAG for LLM,并对协作策略(如顺序、并行、多模块框架)、预训练、提示工程和参数高效微调方法进行系统梳理。 Result: 总结了在推荐系统、生物医学分析和知识密集型问答等领域的应用进展,整理了现有数据集和实证研究结果,展示了LLM-TAG融合的有效性与潜力。 Conclusion: LLM与TAG的协同具有广阔前景,未来应关注更高效的协作框架、可扩展性、动态图处理及实际部署中的挑战。 Abstract: Large Language Models (LLMs) have achieved remarkable success in natural language processing through strong semantic understanding and generation. However, their black-box nature limits structured and multi-hop reasoning. In contrast, Text-Attributed Graphs (TAGs) provide explicit relational structures enriched with textual context, yet often lack semantic depth. Recent research shows that combining LLMs and TAGs yields complementary benefits: enhancing TAG representation learning and improving the reasoning and interpretability of LLMs. This survey provides the first systematic review of LLM--TAG integration from an orchestration perspective. We introduce a novel taxonomy covering two fundamental directions: LLM for TAG, where LLMs enrich graph-based tasks, and TAG for LLM, where structured graphs improve LLM reasoning. We categorize orchestration strategies into sequential, parallel, and multi-module frameworks, and discuss advances in TAG-specific pretraining, prompting, and parameter-efficient fine-tuning. Beyond methodology, we summarize empirical insights, curate available datasets, and highlight diverse applications across recommendation systems, biomedical analysis, and knowledge-intensive question answering. Finally, we outline open challenges and promising research directions, aiming to guide future work at the intersection of language and graph learning.

[15] Social Simulations with Large Language Model Risk Utopian Illusion

Ning Bian,Xianpei Han,Hongyu Lin,Baolei Wu,Jun Wang

Main category: cs.CL

TL;DR: 本文提出了一种系统性框架,用于分析大语言模型(LLMs)在社会模拟中的行为,发现LLMs倾向于表现出社会可期偏差、角色偏差、首因效应和积极偏差,生成过于理想化的“乌托邦”式互动,而非真实的人类行为。

Details Motivation: 尽管LLMs在模拟人类行为方面展现出潜力,但其在社会情境中与真实人类行为的差异尚不明确,可能带来科学误读和现实应用风险,因此需要系统评估其社会行为表现。 Method: 通过构建聊天室式的多智能体交互模拟,并从五个语言维度进行分析,对来自三个家族的八种代表性LLMs进行了广泛实验。 Result: LLMs并未忠实复现真实人类行为,而是表现出社会可期偏差、社会角色偏差、首因效应和积极偏差,导致模拟出的社会趋于理想化,缺乏真实人际互动的复杂性与变异性。 Conclusion: 当前LLMs在社会模拟中存在显著偏差,需开发更具社会根基的模型,以更准确捕捉人类社会行为的多样性。 Abstract: Reliable simulation of human behavior is essential for explaining, predicting, and intervening in our society. Recent advances in large language models (LLMs) have shown promise in emulating human behaviors, interactions, and decision-making, offering a powerful new lens for social science studies. However, the extent to which LLMs diverge from authentic human behavior in social contexts remains underexplored, posing risks of misinterpretation in scientific studies and unintended consequences in real-world applications. Here, we introduce a systematic framework for analyzing LLMs' behavior in social simulation. Our approach simulates multi-agent interactions through chatroom-style conversations and analyzes them across five linguistic dimensions, providing a simple yet effective method to examine emergent social cognitive biases. We conduct extensive experiments involving eight representative LLMs across three families. Our findings reveal that LLMs do not faithfully reproduce genuine human behavior but instead reflect overly idealized versions of it, shaped by the social desirability bias. In particular, LLMs show social role bias, primacy effect, and positivity bias, resulting in "Utopian" societies that lack the complexity and variability of real human interactions. These findings call for more socially grounded LLMs that capture the diversity of human social behavior.

[16] Estonian Native Large Language Model Benchmark

Helena Grete Lillepalu,Tanel Alumäe

Main category: cs.CL

TL;DR: 本文介绍了一个用于评估爱沙尼亚语大语言模型(LLM)的新基准,基于七个来自本土数据集的多样化任务,涵盖语言理解、知识、摘要等多个方面,并对多种模型进行了系统评估。

Details Motivation: 目前缺乏针对爱沙尼亚语LLM的充分基准测试,且尚无全面比较不同LLM在爱沙尼亚语任务上表现的研究。 Method: 构建了七个源自爱沙尼亚语原生资源的数据集,评估包括基础模型、开源指令微调模型和商业模型在内的共32个模型,采用人工评估与LLM作为裁判两种评估方式。 Result: 共评估6个基础模型和26个指令微调模型,人工评分与基准结果呈中等到高度相关,Claude 3.7 Sonnet作为评判模型表现出与人类评分的高度一致性。 Conclusion: 所提出的新基准能有效评估爱沙尼亚语LLM性能,且顶级LLM可作为可靠的自动评估工具支持此类低资源语言的模型评测。 Abstract: The availability of LLM benchmarks for the Estonian language is limited, and a comprehensive evaluation comparing the performance of different LLMs on Estonian tasks has yet to be conducted. We introduce a new benchmark for evaluating LLMs in Estonian, based on seven diverse datasets. These datasets assess general and domain-specific knowledge, understanding of Estonian grammar and vocabulary, summarization abilities, contextual comprehension, and more. The datasets are all generated from native Estonian sources without using machine translation. We compare the performance of base models, instruction-tuned open-source models, and commercial models. Our evaluation includes 6 base models and 26 instruction-tuned models. To assess the results, we employ both human evaluation and LLM-as-a-judge methods. Human evaluation scores showed moderate to high correlation with benchmark evaluations, depending on the dataset. Claude 3.7 Sonnet, used as an LLM judge, demonstrated strong alignment with human ratings, indicating that top-performing LLMs can effectively support the evaluation of Estonian-language models.

Nishan Chatterjee,Veronika Bajt,Ana Zwitter Vitez,Senja Pollak

Main category: cs.CL

TL;DR: 本文提出一种结合自然语言处理与社会学洞见的方法,分析英法语境下极右翼推文中的移民话语、仇恨言论和说服策略,以揭示社交媒体上右翼极端主义的传播模式。

Details Motivation: 理解右翼民粹主义在欧洲崛起背景下,社交媒体如何促进极端意识形态传播及其对政治结果的影响。 Method: 采用前沿的自然语言处理技术,结合社会学视角,分析MIGR-TWIT语料库中的英法双语极右翼推文。 Result: 揭示了围绕移民议题的 discourse 模式、仇恨言论特征以及右翼行动者使用的说服技巧。 Conclusion: 跨学科方法有助于深入理解社交媒体上右翼极端主义的社会动态,为应对相关挑战提供新视角。 Abstract: The rise of right-wing populism in Europe has brought to the forefront the significance of analysing social media discourse to understand the dissemination of extremist ideologies and their impact on political outcomes. Twitter, as a platform for interaction and mobilisation, provides a unique window into the everyday communication of far-right supporters. In this paper, we propose a methodology that uses state-of-the-art natural language processing techniques with sociological insights to analyse the MIGR-TWIT corpus of far-right tweets in English and French. We aim to uncover patterns of discourse surrounding migration, hate speech, and persuasion techniques employed by right and far-right actors. By integrating linguistic, sociological, and computational approaches, we seek to offer cross-disciplinary insights into societal dynamics and contribute to a better understanding of contemporary challenges posed by right-wing extremism on social media platforms.

[18] DispatchMAS: Fusing taxonomy and artificial intelligence agents for emergency medical services

Xiang Li,Huizi Yu,Wenkong Wang,Yiran Wu,Jiayan Zhou,Wenyue Hua,Xinxin Lin,Wenjia Tan,Lexuan Zhu,Bingyi Chen,Guang Chen,Ming-Li Chen,Yang Zhou,Zhao Li,Themistocles L. Assimes,Yongfeng Zhang,Qingyun Wu,Xin Ma,Lingyao Li,Lizhou Fan

Main category: cs.CL

TL;DR: 本研究开发并评估了一个基于临床分类和大语言模型的多智能体系统,用于模拟真实的急救调度场景,结果表明该系统在指导效果和调度效能方面表现优异,具有高保真度和临床合理性,可用于培训、协议评估及实时决策支持。

Details Motivation: 急救调度面临呼叫者情绪紧张、信息模糊和调度员认知负荷高等挑战,现有流程易出错,亟需智能化工具辅助调度决策,提升响应效率与准确性。 Method: 基于MIMIC-III数据构建包含32种主诉和6类呼叫者身份的临床分类体系及六阶段通话协议,采用AutoGen框架开发包含呼叫者与调度员智能体的多智能体系统,并通过事实共用库确保交互的临床合理性;结合四名医师对100个模拟案例的人工评估(指导效果与调度效能)和自动化语言分析(情感、可读性、礼貌性)进行混合评估。 Result: 人工评估显示系统调度效能高(94%正确联系潜在其他代理),指导效果良好(91%案例提供适当建议),医师评分高且评价一致性好(Gwe's AC1 > 0.70);算法分析表明对话情感以中性为主(73.7%中性情感,90.4%中性情绪),可读性高(Flesch 80.9),语言风格礼貌(60.0%礼貌,0%无礼)。 Conclusion: 该基于分类的多智能体系统能高保真模拟多样且临床合理的急救调度场景,具备用于调度员培训、协议测试和未来实时决策支持的潜力,为安全集成AI代理进入应急响应流程提供了可行路径。 Abstract: Objective: Emergency medical dispatch (EMD) is a high-stakes process challenged by caller distress, ambiguity, and cognitive load. Large Language Models (LLMs) and Multi-Agent Systems (MAS) offer opportunities to augment dispatchers. This study aimed to develop and evaluate a taxonomy-grounded, LLM-powered multi-agent system for simulating realistic EMD scenarios. Methods: We constructed a clinical taxonomy (32 chief complaints, 6 caller identities from MIMIC-III) and a six-phase call protocol. Using this framework, we developed an AutoGen-based MAS with Caller and Dispatcher Agents. The system grounds interactions in a fact commons to ensure clinical plausibility and mitigate misinformation. We used a hybrid evaluation framework: four physicians assessed 100 simulated cases for "Guidance Efficacy" and "Dispatch Effectiveness," supplemented by automated linguistic analysis (sentiment, readability, politeness). Results: Human evaluation, with substantial inter-rater agreement (Gwe's AC1 > 0.70), confirmed the system's high performance. It demonstrated excellent Dispatch Effectiveness (e.g., 94 % contacting the correct potential other agents) and Guidance Efficacy (advice provided in 91 % of cases), both rated highly by physicians. Algorithmic metrics corroborated these findings, indicating a predominantly neutral affective profile (73.7 % neutral sentiment; 90.4 % neutral emotion), high readability (Flesch 80.9), and a consistently polite style (60.0 % polite; 0 % impolite). Conclusion: Our taxonomy-grounded MAS simulates diverse, clinically plausible dispatch scenarios with high fidelity. Findings support its use for dispatcher training, protocol evaluation, and as a foundation for real-time decision support. This work outlines a pathway for safely integrating advanced AI agents into emergency response workflows.

[19] Correlation Dimension of Auto-Regressive Large Language Models

Xin Du,Kumiko Tanaka-Ishii

Main category: cs.CL

TL;DR: 本文提出了一种基于分形几何的“关联维度”指标,用于衡量语言模型生成文本的认知复杂性,弥补传统困惑度等局部评估指标的不足。实验表明该指标能揭示预训练中的三个阶段、反映上下文复杂性、预测幻觉倾向并检测多种生成退化现象。

Details Motivation: 传统语言模型评估指标(如困惑度)仅关注局部预测准确性,无法捕捉文本的长距离结构复杂性,导致难以解释模型生成中的重复、不连贯等异常行为。因此需要一种新的度量方式来更好地理解语言模型的生成动态。 Method: 引入关联维度(correlation dimension)这一分形几何概念,通过分析语言模型对文本中自相似性和层次递归结构的感知,量化其生成文本的全局复杂性。该方法适用于自回归架构(如Transformer和Mamba),计算高效且对4位量化具有鲁棒性。 Result: 实验证明关联维度能够:(1) 揭示预训练过程中的三个不同阶段;(2) 反映上下文相关的复杂性变化;(3) 指示模型产生幻觉的倾向;(4) 可靠地检测多种生成退化形式。 Conclusion: 关联维度为评估大语言模型提供了一个新颖、有效且稳健的全局复杂性度量框架,有助于深入理解模型的生成机制,并为未来模型训练与评估提供了新的视角。 Abstract: Large language models (LLMs) have achieved remarkable progress in natural language generation, yet they continue to display puzzling behaviors -- such as repetition and incoherence -- even when exhibiting low perplexity. This highlights a key limitation of conventional evaluation metrics, which emphasize local prediction accuracy while overlooking long-range structural complexity. We introduce correlation dimension, a fractal-geometric measure of self-similarity, to quantify the epistemological complexity of text as perceived by a language model. This measure captures the hierarchical recurrence structure of language, bridging local and global properties in a unified framework. Through extensive experiments, we show that correlation dimension (1) reveals three distinct phases during pretraining, (2) reflects context-dependent complexity, (3) indicates a model's tendency toward hallucination, and (4) reliably detects multiple forms of degeneration in generated text. The method is computationally efficient, robust to model quantization (down to 4-bit precision), broadly applicable across autoregressive architectures (e.g., Transformer and Mamba), and provides fresh insight into the generative dynamics of LLMs.

[20] Sparser Block-Sparse Attention via Token Permutation

Xinghao Wang,Pengyu Wang,Dong Zhang,Chenkun Tan,Shaojun Zhou,Zhaoxiang Liu,Shiguo Lian,Fangxu Liu,Kai Song,Xipeng Qiu

Main category: cs.CL

TL;DR: 本文提出了一种名为Permuted Block-Sparse Attention (PBS-Attn)的方法,通过利用注意力机制的排列性质来提高块级稀疏性,从而提升大语言模型在长序列预填充时的计算效率。实验表明,该方法在保持接近全注意力机制精度的同时,最高可实现2.75倍的端到端加速。

Details Motivation: 由于自注意力机制的时间和内存复杂度随序列长度呈平方增长,扩展大语言模型的上下文长度面临巨大计算开销。尽管块稀疏注意力能通过跳过部分块的计算来优化,但其效果受限于底层注意力模式,可能导致块级稀疏性不足和计算冗余。因此,需要一种更高效的方法来提升稀疏性和计算效率。 Method: 提出PBS-Attn,一种即插即用的块稀疏注意力改进方法,通过对序列进行重排,集中重要键值位置,增强块级稀疏性。同时开发了定制化的permuted-FlashAttention内核以支持高效实现。 Result: 在多个真实世界的长上下文数据集上实验显示,PBS-Attn在模型准确率上优于现有的块稀疏注意力方法,并接近全注意力基线性能。使用自定义内核实现了最高达2.75倍的端到端预填充速度提升。 Conclusion: PBS-Attn通过利用注意力的排列不变性,有效提高了块稀疏注意力的效率和实用性,为大语言模型的长上下文处理提供了一种高效且可行的解决方案。 Abstract: Scaling the context length of large language models (LLMs) offers significant benefits but is computationally expensive. This expense stems primarily from the self-attention mechanism, whose $O(N^2)$ complexity with respect to sequence length presents a major bottleneck for both memory and latency. Fortunately, the attention matrix is often sparse, particularly for long sequences, suggesting an opportunity for optimization. Block-sparse attention has emerged as a promising solution that partitions sequences into blocks and skips computation for a subset of these blocks. However, the effectiveness of this method is highly dependent on the underlying attention patterns, which can lead to sub-optimal block-level sparsity. For instance, important key tokens for queries within a single block may be scattered across numerous other blocks, leading to computational redundancy. In this work, we propose Permuted Block-Sparse Attention (\textbf{PBS-Attn}), a plug-and-play method that leverages the permutation properties of attention to increase block-level sparsity and enhance the computational efficiency of LLM prefilling. We conduct comprehensive experiments on challenging real-world long-context datasets, demonstrating that PBS-Attn consistently outperforms existing block-sparse attention methods in model accuracy and closely matches the full attention baseline. Powered by our custom permuted-FlashAttention kernels, PBS-Attn achieves an end-to-end speedup of up to $2.75\times$ in long-context prefilling, confirming its practical viability. Code available at https://github.com/xinghaow99/pbs-attn

[21] PARL: Prompt-based Agents for Reinforcement Learning

Yarik Menchaca Resendiz,Roman Klinger

Main category: cs.CL

TL;DR: 本文提出了PARL,一种通过提示将大语言模型用作强化学习代理的方法,无需微调即可在非语言任务中表现良好,但在需要复杂数学运算或状态解码的任务中存在性能局限。

Details Motivation: 现有研究多关注大语言模型在监督或无监督自然语言任务中的表现,而较少评估其作为强化学习代理的能力,尤其是在非语言结构化推理任务中。 Method: 提出PARL方法,通过提示将状态、动作和奖励编码到输入中,使大语言模型能够在无需微调的情况下作为强化学习代理进行试错学习。 Result: 在三个标准的强化学习任务上评估PARL,结果显示其在简单环境中可匹敌或超越传统RL代理,但在涉及复杂数学运算或状态-动作解码的任务中性能受限。 Conclusion: 大语言模型可通过提示有效参与强化学习任务,尤其适用于简单环境,但面对复杂计算和精确状态解析时仍有局限,未来需改进其结构化推理能力。 Abstract: Large language models (LLMs) have demonstrated high performance on tasks expressed in natural language, particularly in zero- or few-shot settings. These are typically framed as supervised (e.g., classification) or unsupervised (e.g., clustering) problems. However, limited work evaluates LLMs as agents in reinforcement learning (RL) tasks (e.g., playing games), where learning occurs through interaction with an environment and a reward system. While prior work focused on representing tasks that rely on a language representation, we study structured, non-linguistic reasoning - such as interpreting positions in a grid world. We therefore introduce PARL (Prompt-based Agent for Reinforcement Learning), a method that uses LLMs as RL agents through prompting, without any fine-tuning. PARL encodes actions, states, and rewards in the prompt, enabling the model to learn through trial-and-error interaction. We evaluate PARL on three standard RL tasks that do not entirely rely on natural language. We show that it can match or outperform traditional RL agents in simple environments by leveraging pretrained knowledge. However, we identify performance limitations in tasks that require complex mathematical operations or decoding states and actions.

[22] Efficient semantic uncertainty quantification in language models via diversity-steered sampling

Ji Won Park,Kyunghyun Cho

Main category: cs.CL

TL;DR: 提出了一种多样性引导的采样器,通过引入语义相似性惩罚来提高大语言模型在自由形式问答中不确定性估计的样本效率。

Details Motivation: 在自由形式问答中,准确估计大语言模型的语义偶然性和认知不确定性具有挑战性,通常需要大量生成样本才能获得稳定估计。 Method: 设计了一个多样性引导的采样器,在解码过程中使用轻量微调的自然语言推断模型注入连续的语义相似性惩罚,并结合重要性重加权和控制变量法对下游不确定性估计进行去偏和方差缩减。 Result: 在四个问答基准上,该方法在相同样本数下覆盖了更多的语义簇,性能达到或超过基线方法。 Conclusion: 该框架模块化、无需访问基础大语言模型的梯度,可作为风险敏感模型部署中不确定性估计的即插即用增强方案。 Abstract: Accurately estimating semantic aleatoric and epistemic uncertainties in large language models (LLMs) is particularly challenging in free-form question answering (QA), where obtaining stable estimates often requires many expensive generations. We introduce a diversity-steered sampler that discourages semantically redundant outputs during decoding, covers both autoregressive and masked diffusion paradigms, and yields substantial sample-efficiency gains. The key idea is to inject a continuous semantic-similarity penalty into the model's proposal distribution using a natural language inference (NLI) model lightly finetuned on partial prefixes or intermediate diffusion states. We debias downstream uncertainty estimates with importance reweighting and shrink their variance with control variates. Across four QA benchmarks, our method matches or surpasses baselines while covering more semantic clusters with the same number of samples. Being modular and requiring no gradient access to the base LLM, the framework promises to serve as a drop-in enhancement for uncertainty estimation in risk-sensitive model deployments.

[23] Typoglycemia under the Hood: Investigating Language Models' Understanding of Scrambled Words

Gianluca Sperduti,Alejandro Moreo

Main category: cs.CL

TL;DR: 本文研究了在“typoglycemia”(单词内部字母打乱但人类仍可阅读)现象下,NLP模型为何仍能保持良好性能的问题,发现主要原因在于:较少的英文单词在打乱后发生碰撞,且碰撞词通常出现在极易区分的上下文中。

Details Motivation: 探讨NLP模型在字符顺序被打乱的情况下仍表现良好的原因,特别是当多个不同单词映射为相同表示时如何维持性能。 Method: 分析英国国家语料库以量化typoglycemia下的词坍塌与歧义;评估BERT对坍塌形式的消歧能力;通过在干净和打乱的维基百科文本上从头训练BERT变体进行探针实验。 Result: 研究表明,在typoglycemia下,仅有少量英文单词发生碰撞,且这些词多出现在可轻松消歧的上下文中;BERT在打乱输入下的性能下降小于预期。 Conclusion: NLP模型在typoglycemia下表现稳健的主要原因在于单词碰撞少且上下文差异大,使得消歧容易,因此即使忽略字符顺序也能保持高性能。 Abstract: Research in linguistics has shown that humans can read words with internally scrambled letters, a phenomenon recently dubbed typoglycemia. Some specific NLP models have recently been proposed that similarly demonstrate robustness to such distortions by ignoring the internal order of characters by design. This raises a fundamental question: how can models perform well when many distinct words (e.g., form and from) collapse into identical representations under typoglycemia? Our work, focusing exclusively on the English language, seeks to shed light on the underlying aspects responsible for this robustness. We hypothesize that the main reasons have to do with the fact that (i) relatively few English words collapse under typoglycemia, and that (ii) collapsed words tend to occur in contexts so distinct that disambiguation becomes trivial. In our analysis, we (i) analyze the British National Corpus to quantify word collapse and ambiguity under typoglycemia, (ii) evaluate BERT's ability to disambiguate collapsing forms, and (iii) conduct a probing experiment by comparing variants of BERT trained from scratch on clean versus typoglycemic Wikipedia text; our results reveal that the performance degradation caused by scrambling is smaller than expected.

[24] TripTide: A Benchmark for Adaptive Travel Planning under Disruptions

Priyanshu Karmakar,Soumyabrata Chaudhuri,Shubhojit Mallick,Manish Gupta,Abhik Jana,Shreya Ghosh

Main category: cs.CL

TL;DR: TripTide是首个评估大语言模型(LLM)在现实旅行中断情况下调整行程能力的基准,通过自动指标、LLM评分和专家评估三方面衡量LLM的适应性、意图保持和响应能力,揭示了LLM在长行程中鲁棒性下降的问题。

Details Motivation: 现有基于大语言模型的旅行规划系统缺乏对真实世界突发情况(如航班取消、天气关闭等)下行程调整能力的系统评估,亟需一个能够衡量模型适应性和鲁棒性的基准。 Method: 提出TripTide基准,建模中断严重性和旅客容忍度;设计三项评估:自动指标(意图保持、响应性、适应性)、LLM作为裁判的自动评分、人工专家评估语义、空间、时序和响应质量。 Result: 实验表明LLM在短途旅行中空间偏差较大但随行程延长而改善,保持良好的时序一致性和语义稳定性;然而随着原计划长度增加,应对中断的能力下降,暴露了LLM在复杂长期规划中的局限性。 Conclusion: TripTide为评估LLM在不确定性下的旅行规划适应性、个性化和韧性提供了有效基准,揭示了当前模型在处理长程复杂扰动时的不足,推动未来更具鲁棒性的智能行程系统研究。 Abstract: Recent efforts like TripCraft and TravelPlanner have advanced the use of Large Language Models ( LLMs) for personalized, constraint aware travel itinerary generation. Yet, real travel often faces disruptions. To address this, we present TripTide, the first benchmark evaluating LLM's ability to revise itineraries under realistic disruptions. TripTide models key dimensions such as disruption severity and traveler tolerance, enabling nuanced assessment of LLM adaptability to events like flight cancellations, weather closures, or overbooked attractions. We conduct a threefold evaluation. First, we introduce automatic metrics including Preservation of Intent (how well the revised plan maintains feasibility and goals), Responsiveness (promptness and appropriateness of disruption handling), and Adaptability (semantic, spatial, and sequential divergence between original and revised plans). Second, we apply an LLM-as-a-judge approach to automatically assess revision quality. Third, we perform manual expert evaluation to verify whether revisions preserve semantic, spatial, sequential, and responsive aspects. Our experiments show that LLMs maintain strong sequential consistency and semantic stability, while spatial deviations are larger for shorter trips but decrease with longer ones, indicating that extended plans encourage better geographic coherence. However, disruption-handling ability declines as plan length increases, highlighting limits in LLM robustness. TripTide establishes a benchmark for evaluating adaptability, personalization, and resilience in LLM-based travel planning under real-world uncertainty.

[25] Multi-turn Training with Basic Human Feedback Helps Little on LLM Reasoning

Qiang Liu,Wuganjing Song,Zhenzhou Lin,Feifan Chen,Qiaolong Cai,Chen Li,Yongduo Sui

Main category: cs.CL

TL;DR: 该研究比较了单轮与多轮人类反馈训练对大语言模型推理能力的影响,发现单轮训练在单轮和多轮评估中均表现良好,而多轮训练反而会损害单轮推理性能。

Details Motivation: 由于实际应用中常涉及多轮人机交互,但当前大模型通常通过单轮强化学习训练,存在训练与部署条件不匹配的问题,因此探究多轮训练是否必要。 Method: 对比传统的单轮训练与三种多轮训练策略,评估它们在单轮和多轮推理任务上的表现。 Result: 单轮训练的模型在单轮和多轮评估中均能有效泛化;而采用多轮训练策略的模型在单轮推理任务上性能显著下降。 Conclusion: 对于信息完整的任务,强健的单轮训练比多轮训练更有效且可靠,多轮训练带来的增益有限,甚至可能损害模型的推理能力。 Abstract: The reasoning capabilities of Large Language Models (LLMs) are typically developed through the single-turn reinforcement learning, whereas real-world applications often involve multi-turn interactions with human feedback, leading to a potential mismatch between training and deployment conditions. In this work, we study whether multi-turn training with human feedback is necessary for reasoning tasks. We compare conventional single-turn training with three multi-turn strategies and reach contrary conclusions to previous research. We find that models trained in a single-turn setting generalize effectively to both single- and multi-turn evaluations, while models trained with multi-turn strategies exhibit a significant degradation in single-turn reasoning performance. These results suggest that for tasks with complete information, robust single-turn training remains more effective and reliable, as multi-turn training with basic feedback provides limited benefits and can even degrade reasoning capabilities.

Jenny Kunz

Main category: cs.CL

TL;DR: 本文介绍了一个针对瑞典相关人物和事件的手动编写问答基准测试,旨在解决现有翻译自美国中心化基准的不足。该数据集可用于评估不同规模和瑞典语覆盖程度模型的事实回忆能力,并支持跨语言事实一致性分析。研究发现,具有较强瑞典语覆盖的小模型在回忆瑞典相关事实方面表现与三倍大的多语言模型相当;此外,持续在瑞典语上进行预训练通常能改善事实知识,但也导致部分原有信息的遗忘。结果表明,该数据集可作为研究多语言模型及语言适应过程中语言适应性和知识保持性的诊断工具。

Details Motivation: 现有的许多瑞典基准是翻译自以美国为中心的基准,因此不适合测试对瑞典特别重要或特有的知识。为了填补这一空白,作者希望创建一个更符合瑞典本土需求的问答基准。 Method: 研究人员手动构建了一个专注于瑞典相关人物和事件的问答数据集,灵感来源于一档流行的广播节目以及瑞典主要体育赛事。数据集包含英文翻译,可用于评估模型在不同规模和瑞典语覆盖情况下的事实回忆能力,并分析跨语言事实一致性。 Result: 实验发现,较小但具备更强瑞典语覆盖的模型在回忆瑞典相关事实上的表现与大三倍的多语言模型相当;持续在瑞典语数据上进行预训练虽能提升对瑞典事实的知识,但也导致部分已有知识的遗忘。 Conclusion: 该数据集有效支持对多语言模型在语言适应过程中的知识保留和事实一致性的研究,具备作为诊断工具的潜力。 Abstract: Many Swedish benchmarks are translated US-centric benchmarks, and therefore not suitable for testing knowledge that is particularly relevant, or even specific, to Sweden. We therefore introduce a manually written question-answering benchmark specifically targeted to Sweden-related personalities and events, many of which receive very limited coverage in international media. Our annotators drew inspiration from a popular radio program featuring public figures from culture and media, as well as major sports events in Sweden. The dataset can be used to measure factual recall across models of varying sizes and degrees of Swedish coverage, and allows to probe cross-lingual factual consistency as to contains English translations. Using the dataset, we find that smaller models with stronger Swedish coverage perform comparably to a three times larger multilingual model in recalling Sweden-related facts. We also observe that continued pre-training on Swedish generally improves factual knowledge but also leads to forgetting of a part of the previously known information. These results demonstrate the dataset's potential as a diagnostic tool for studying language adaptation and knowledge retention in multilingual models and during language adaptation.

[27] SindBERT, the Sailor: Charting the Seas of Turkish NLP

Raphael Scheible-Schmitt,Stefan Schweter

Main category: cs.CL

TL;DR: SindBERT是首个大规模基于RoBERTa的土耳其语编码器,通过在312GB土耳其语文本上从零训练,展示了在多个NLP任务上的竞争力,但模型扩展优势不明显,反映出当前基准可能已趋饱和,并强调语料质量与多样性的重要性。

Details Motivation: 填补土耳其语等形态丰富语言在大规模预训练模型中的空白,推动其在自然语言处理领域的发展。 Method: 从零开始在312GB土耳其语文本(包括mC4、OSCAR23和Wikipedia)上训练RoBERTa架构的SindBERT模型,提供base和large两种版本,并在POS标注、命名实体识别、攻击性语言检测和TurBLiMP语言可接受性任务上进行评估。 Result: SindBERT在四项任务中有两项上表现最佳(large版),但整体未显示出一致的扩展优势;与XLM-R和EuroBERT类似,呈现平缓的扩展趋势,同时发现语料质量和多样性比数据量更为重要。 Conclusion: SindBERT不仅为土耳其语NLP提供了开源资源,也实证揭示了在形态丰富的语言中,单纯扩大模型规模效果有限,语料构成的质量至关重要。 Abstract: Transformer models have revolutionized NLP, yet many morphologically rich languages remain underrepresented in large-scale pre-training efforts. With SindBERT, we set out to chart the seas of Turkish NLP, providing the first large-scale RoBERTa-based encoder for Turkish. Trained from scratch on 312 GB of Turkish text (mC4, OSCAR23, Wikipedia), SindBERT is released in both base and large configurations, representing the first large-scale encoder-only language model available for Turkish. We evaluate SindBERT on part-of-speech tagging, named entity recognition, offensive language detection, and the TurBLiMP linguistic acceptability benchmark. Our results show that SindBERT performs competitively with existing Turkish and multilingual models, with the large variant achieving the best scores in two of four tasks but showing no consistent scaling advantage overall. This flat scaling trend, also observed for XLM-R and EuroBERT, suggests that current Turkish benchmarks may already be saturated. At the same time, comparisons with smaller but more curated models such as BERTurk highlight that corpus quality and diversity can outweigh sheer data volume. Taken together, SindBERT contributes both as an openly released resource for Turkish NLP and as an empirical case study on the limits of scaling and the central role of corpus composition in morphologically rich languages. The SindBERT models are released under the MIT license and made available in both fairseq and Huggingface formats.

[28] HalleluBERT: Let every token that has meaning bear its weight

Raphael Scheible-Schmitt

Main category: cs.CL

TL;DR: HalleluBERT是基于RoBERTa的希伯来语编码器家族,基于49.1GB希伯来语文本从头训练,在NER和情感分类任务上超越现有模型,刷新希伯来语SOTA。

Details Motivation: 现有的希伯来语模型(如HeBERT、AlephBERT和HeRo)受限于语料规模、词汇表或训练深度,缺乏大规模、充分训练的RoBERTa模型。 Method: 从头训练基于RoBERTa的HalleluBERT模型(base和large),使用49.1GB去重的希伯来语网页文本和Wikipedia数据,并采用希伯来语专用的字节级BPE词汇表。 Result: 在命名实体识别(NER)和情感分类基准测试中,HalleluBERT优于单语和多语基线模型。 Conclusion: HalleluBERT为希伯来语建立了新的最先进水平,证明了充分收敛的单语预训练的优势。 Abstract: Transformer-based models have advanced NLP, yet Hebrew still lacks a large-scale RoBERTa encoder which is extensively trained. Existing models such as HeBERT, AlephBERT, and HeRo are limited by corpus size, vocabulary, or training depth. We present HalleluBERT, a RoBERTa-based encoder family (base and large) trained from scratch on 49.1~GB of deduplicated Hebrew web text and Wikipedia with a Hebrew-specific byte-level BPE vocabulary. Evaluated on NER and sentiment classification benchmarks, HalleluBERT outperforms both monolingual and multilingual baselines. HalleluBERT sets a new state of the art for Hebrew and highlights the benefits of fully converged monolingual pretraining.

[29] Vision Language Models for Dynamic Human Activity Recognition in Healthcare Settings

Abderrazek Abid,Thanh-Cong Ho,Fakhri Karray

Main category: cs.CL

TL;DR: 本文探讨了视觉语言模型(VLMs)在远程健康监测中的人类活动识别(HAR)应用,提出了一种描述性字幕数据集和综合评估方法,并通过实验验证了VLMs在准确率上可与甚至优于传统深度学习模型。

Details Motivation: 尽管VLMs在医疗领域展现出潜力,但其在人类活动识别中的应用仍缺乏探索,且现有方法难以有效评估其动态输出。 Method: 构建了一个描述性字幕数据集,提出了针对VLMs在HAR任务中的综合评估方法,并与最先进的深度学习模型进行对比实验。 Result: 实验结果表明,VLMs在HAR任务中表现与传统模型相当,在某些情况下甚至更优。 Conclusion: 该研究为VLMs在智能医疗系统中的应用提供了有力基准,拓展了其在远程健康监测中的可能性。 Abstract: As generative AI continues to evolve, Vision Language Models (VLMs) have emerged as promising tools in various healthcare applications. One area that remains relatively underexplored is their use in human activity recognition (HAR) for remote health monitoring. VLMs offer notable strengths, including greater flexibility and the ability to overcome some of the constraints of traditional deep learning models. However, a key challenge in applying VLMs to HAR lies in the difficulty of evaluating their dynamic and often non-deterministic outputs. To address this gap, we introduce a descriptive caption data set and propose comprehensive evaluation methods to evaluate VLMs in HAR. Through comparative experiments with state-of-the-art deep learning models, our findings demonstrate that VLMs achieve comparable performance and, in some cases, even surpass conventional approaches in terms of accuracy. This work contributes a strong benchmark and opens new possibilities for the integration of VLMs into intelligent healthcare systems.

[30] Redefining Retrieval Evaluation in the Era of LLMs

Giovanni Trappolini,Florin Cuconasu,Simone Filice,Yoelle Maarek,Fabrizio Silvestri

Main category: cs.CL

TL;DR: 本文提出了一种新的信息检索评估指标UDCG,用于解决传统IR指标在检索增强生成(RAG)系统中与大语言模型(LLM)使用场景不匹配的问题,通过引入效用感知的标注模式和面向LLM的位置折扣,显著提升了与端到端回答准确率的相关性。

Details Motivation: 传统IR指标基于人类用户行为假设,在RAG系统中因LLM整体处理文档且受无关干扰影响而失效,导致无法准确预测RAG性能。 Method: 提出效用-干扰感知的标注框架,并设计UDCG指标,采用面向LLM的位置折扣机制,综合衡量相关片段的正面贡献和干扰项的负面影响。 Result: 在五个数据集和六种大语言模型上的实验表明,UDCG相比传统指标与端到端答案准确率的相关性最高提升36%。 Conclusion: UDCG更契合LLM作为检索结果消费者的特点,为RAG系统的可靠评估提供了有效工具,推动了IR评估与LLM应用的对齐。 Abstract: Traditional Information Retrieval (IR) metrics, such as nDCG, MAP, and MRR, assume that human users sequentially examine documents with diminishing attention to lower ranks. This assumption breaks down in Retrieval Augmented Generation (RAG) systems, where search results are consumed by Large Language Models (LLMs), which, unlike humans, process all retrieved documents as a whole rather than sequentially. Additionally, traditional IR metrics do not account for related but irrelevant documents that actively degrade generation quality, rather than merely being ignored. Due to these two major misalignments, namely human vs. machine position discount and human relevance vs. machine utility, classical IR metrics do not accurately predict RAG performance. We introduce a utility-based annotation schema that quantifies both the positive contribution of relevant passages and the negative impact of distracting ones. Building on this foundation, we propose UDCG (Utility and Distraction-aware Cumulative Gain), a metric using an LLM-oriented positional discount to directly optimize the correlation with the end-to-end answer accuracy. Experiments on five datasets and six LLMs demonstrate that UDCG improves correlation by up to 36% compared to traditional metrics. Our work provides a critical step toward aligning IR evaluation with LLM consumers and enables more reliable assessment of RAG components

[31] REMONI: An Autonomous System Integrating Wearables and Multimodal Large Language Models for Enhanced Remote Health Monitoring

Thanh Cong Ho,Farah Kharrat,Abderrazek Abid,Fakhri Karray

Main category: cs.CL

TL;DR: 本文提出了一种名为REMONI的自主远程健康监测系统,结合多模态大语言模型(MLLMs)、物联网和可穿戴设备,实现对患者生命体征、活动和情绪的实时监测,并通过自然语言交互向医护人员提供智能响应,具备可扩展性和实际应用潜力。

Details Motivation: 现有远程健康监测研究多集中于数据采集与异常检测,但在人机交互方面存在明显不足,缺乏对患者情绪和活动的语义理解以及与医疗人员的自然交互能力。 Method: 构建一个集成可穿戴设备、摄像头和物联网的系统,利用多模态大语言模型(MLLMs)处理生命体征、加速度计和视频数据,结合异常检测模块(如跌倒检测)和基于提示工程的自然语言处理组件,实现对患者状态的理解与问答交互。 Result: 开发出一个功能完整的原型系统,实验表明该系统可实现实时监测、情绪与活动识别,并支持医护人员通过Web应用与智能代理交互获取患者状态,具备可实施性和可扩展性。 Conclusion: REMONI系统有效填补了远程健康监测中人机交互的空白,通过多模态大模型提升了系统的智能化水平,有望减轻医疗人员负担并降低医疗成本。 Abstract: With the widespread adoption of wearable devices in our daily lives, the demand and appeal for remote patient monitoring have significantly increased. Most research in this field has concentrated on collecting sensor data, visualizing it, and analyzing it to detect anomalies in specific diseases such as diabetes, heart disease and depression. However, this domain has a notable gap in the aspect of human-machine interaction. This paper proposes REMONI, an autonomous REmote health MONItoring system that integrates multimodal large language models (MLLMs), the Internet of Things (IoT), and wearable devices. The system automatically and continuously collects vital signs, accelerometer data from a special wearable (such as a smartwatch), and visual data in patient video clips collected from cameras. This data is processed by an anomaly detection module, which includes a fall detection model and algorithms to identify and alert caregivers of the patient's emergency conditions. A distinctive feature of our proposed system is the natural language processing component, developed with MLLMs capable of detecting and recognizing a patient's activity and emotion while responding to healthcare worker's inquiries. Additionally, prompt engineering is employed to integrate all patient information seamlessly. As a result, doctors and nurses can access real-time vital signs and the patient's current state and mood by interacting with an intelligent agent through a user-friendly web application. Our experiments demonstrate that our system is implementable and scalable for real-life scenarios, potentially reducing the workload of medical professionals and healthcare costs. A full-fledged prototype illustrating the functionalities of the system has been developed and being tested to demonstrate the robustness of its various capabilities.

[32] MRO: Enhancing Reasoning in Diffusion Language Models via Multi-Reward Optimization

Chenglong Wang,Yang Gan,Hang Zhou,Chi Hu,Yongyu Mu,Kai Song,Murun Yang,Bei Li,Chunliang Zhang,Tongran Liu,Jingbo Zhu,Zhengtao Yu,Tong Xiao

Main category: cs.CL

TL;DR: 本文提出了一种多奖励优化(MRO)方法,通过增强扩散语言模型中的token相关性来提升其在推理任务上的表现。

Details Motivation: 现有的扩散语言模型在推理性能上落后于传统的自回归大模型,尤其是在去噪步骤减少时表现更差,主要原因是去噪过程中token的独立生成忽略了token之间的相关性。 Method: 提出了MRO方法,利用测试时扩展、拒绝采样和强化学习,结合多种精心设计的奖励机制,直接优化token的相关性;同时引入分组步骤和重要性采样策略以降低奖励方差并提高采样效率。 Result: 实验表明,MRO不仅提升了推理性能,还在保持高性能的同时显著加快了采样速度。 Conclusion: 通过显式建模token间的相关性,MRO有效弥补了扩散语言模型在推理能力上的不足,为其在高效推理场景中的应用提供了新方向。 Abstract: Recent advances in diffusion language models (DLMs) have presented a promising alternative to traditional autoregressive large language models (LLMs). However, DLMs still lag behind LLMs in reasoning performance, especially as the number of denoising steps decreases. Our analysis reveals that this shortcoming arises primarily from the independent generation of masked tokens across denoising steps, which fails to capture the token correlation. In this paper, we define two types of token correlation: intra-sequence correlation and inter-sequence correlation, and demonstrate that enhancing these correlations improves reasoning performance. To this end, we propose a Multi-Reward Optimization (MRO) approach, which encourages DLMs to consider the token correlation during the denoising process. More specifically, our MRO approach leverages test-time scaling, reject sampling, and reinforcement learning to directly optimize the token correlation with multiple elaborate rewards. Additionally, we introduce group step and importance sampling strategies to mitigate reward variance and enhance sampling efficiency. Through extensive experiments, we demonstrate that MRO not only improves reasoning performance but also achieves significant sampling speedups while maintaining high performance on reasoning benchmarks.

[33] Brain-tuning Improves Generalizability and Efficiency of Brain Alignment in Speech Models

Omer Moussa,Mariya Toneva

Main category: cs.CL

TL;DR: 提出一种可扩展、可泛化的多参与者脑调优方法,通过联合预测多个参与者的fMRI响应来微调预训练语音语言模型,显著提升模型在个体和群体层面的脑对齐效果,并改善下游语义任务性能。

Details Motivation: 现有脑对齐方法依赖特定参与者且受数据量限制,难以泛化到新参与者和进行群体分析。 Method: 提出一种多参与者脑调优方法,微调预训练语音语言模型以联合预测多个参与者的fMRI响应。 Result: 该方法使新参与者所需fMRI数据减少5倍,整体脑对齐提升高达50%,并在新数据集上表现出强泛化能力,同时提升下游语义任务性能。 Conclusion: 多参与者脑调优实现了神经科学与人工智能之间的双向受益,有助于弥合两个领域之间的差距。 Abstract: Pretrained language models are remarkably effective in aligning with human brain responses elicited by natural language stimuli, positioning them as promising model organisms for studying language processing in the brain. However, existing approaches for both estimating and improving this brain alignment are participant-dependent and highly affected by the amount of data available per participant, hindering both generalization to new participants and population-level analyses. In this work, we address these limitations by introducing a scalable, generalizable brain-tuning method, in which we fine-tune pretrained speech language models to jointly predict fMRI responses from multiple participants. We demonstrate that the resulting brain-tuned models exhibit strong individual brain alignment while generalizing across participants. Specifically, our method leads to 1) a 5-fold decrease in the amount of fMRI data needed to predict brain data from new participants, 2) up to a 50% increase in the overall brain alignment, and 3) strong generalization to new unseen datasets. Furthermore, this multi-participant brain-tuning additionally improves downstream performance on semantic tasks, suggesting that training using brain data from multiple participants leads to more generalizable semantic representations. Taken together, these findings demonstrate a bidirectional benefit between neuroscience and AI, helping bridge the gap between the two fields. We make our code and models publicly available at https://github.com/bridge-ai-neuro/multi-brain-tuning.

[34] InterpDetect: Interpretable Signals for Detecting Hallucinations in Retrieval-Augmented Generation

Likun Tan,Kuan-Wei Huang,Joy Shi,Kevin Wu

Main category: cs.CL

TL;DR: 本文研究了检索增强生成(RAG)中的幻觉机制,发现后期层的前馈网络(FFN)模块过度注入参数化知识是导致幻觉的主要原因,并提出基于外部上下文和参数化知识评分的机械式检测方法,该方法在多个大模型上表现优异且具有良好的可迁移性。

Details Motivation: 现有RAG系统中,模型输出常与检索内容不一致,而传统方法难以区分外部上下文和参数化知识的贡献,因此需要更精确的幻觉检测机制。 Method: 通过分析Qwen3-0.6b模型各层和注意力头的外部上下文得分与参数化知识得分,探索基于机械解释的幻觉检测方法,并训练回归分类器进行幻觉预测。 Result: 所提方法在GPT-5、GPT-4.1等先进LLM及RAGAS、TruLens、RefChecker等基准上表现出优越性能,且在Qwen3-0.6b上训练的分类器可迁移到GPT-4.1-mini上。 Conclusion: 基于机制的信号可作为高效且可泛化的RAG系统幻觉检测指标,具备实际应用潜力。 Abstract: Retrieval-Augmented Generation (RAG) integrates external knowledge to mitigate hallucinations, yet models often generate outputs inconsistent with retrieved content. Accurate hallucination detection requires disentangling the contributions of external context and parametric knowledge, which prior methods typically conflate. We investigate the mechanisms underlying RAG hallucinations and find they arise when later-layer FFN modules disproportionately inject parametric knowledge into the residual stream. To address this, we explore a mechanistic detection approach based on external context scores and parametric knowledge scores. Using Qwen3-0.6b, we compute these scores across layers and attention heads and train regression-based classifiers to predict hallucinations. Our method is evaluated against state-of-the-art LLMs (GPT-5, GPT-4.1) and detection baselines (RAGAS, TruLens, RefChecker). Furthermore, classifiers trained on Qwen3-0.6b signals generalize to GPT-4.1-mini responses, demonstrating the potential of proxy-model evaluation. Our results highlight mechanistic signals as efficient, generalizable predictors for hallucination detection in RAG systems.

[35] Document Understanding, Measurement, and Manipulation Using Category Theory

Jared Claypoole,Yunye Gong,Noson S. Yanofsky,Ajay Divakaran

Main category: cs.CL

TL;DR: 本文将范畴论应用于多模态文档结构提取,提出了基于问答对的文档表示方法、信息正交分解、信息度量与枚举、新型摘要技术以及文档扩展(释经)方法,并结合大规模预训练模型实现了自监督改进。

Details Motivation: 为了更好地理解和组织多模态文档中的信息,需要一种形式化的数学框架来提取结构并量化内容,同时提升预训练模型的能力。 Method: 将文档建模为问答对范畴,提出正交化过程以分离非重叠信息块,在此基础上发展信息度量、摘要、文档扩展方法,并利用范畴论导出的一致性约束(如可组合性和闭合性)设计基于RLVR的自监督学习来改进大模型。 Result: 建立了基于范畴论的文档结构化框架;实现了信息分割与度量;提出了新的摘要与文档扩展技术;成功应用并改进了大规模预训练模型,尤其在自监督设置下通过一致性约束提升了性能。 Conclusion: 范畴论为文档结构建模提供了强有力的数学基础,不仅支持信息的精确度量与操作,还为摘要生成、内容扩展和大模型自监督优化提供了统一框架。 Abstract: We apply category theory to extract multimodal document structure which leads us to develop information theoretic measures, content summarization and extension, and self-supervised improvement of large pretrained models. We first develop a mathematical representation of a document as a category of question-answer pairs. Second, we develop an orthogonalization procedure to divide the information contained in one or more documents into non-overlapping pieces. The structures extracted in the first and second steps lead us to develop methods to measure and enumerate the information contained in a document. We also build on those steps to develop new summarization techniques, as well as to develop a solution to a new problem viz. exegesis resulting in an extension of the original document. Our question-answer pair methodology enables a novel rate distortion analysis of summarization techniques. We implement our techniques using large pretrained models, and we propose a multimodal extension of our overall mathematical framework. Finally, we develop a novel self-supervised method using RLVR to improve large pretrained models using consistency constraints such as composability and closure under certain operations that stem naturally from our category theoretic framework.

[36] Are the LLMs Capable of Maintaining at Least the Language Genus?

Sandra Mitrović,David Kletz,Ljiljana Dolamic,Fabio Rinaldi

Main category: cs.CL

TL;DR: 研究探讨了大语言模型(LLM)在多语言行为中对语系结构的敏感性,发现模型确实在语言转换和知识一致性上表现出语系效应,但训练数据的可用性仍是影响其多语言表现的主要因素。

Details Motivation: 探索大语言模型在多语言行为中是否受到语言谱系结构的影响,填补此前对这一问题研究的不足。 Method: 基于MultiQ数据集,分析模型在不保持提示语言保真度时是否倾向于切换到同源语言,并比较语系内与跨语系的知识一致性表现。 Result: 发现大语言模型确实表现出语系层面的影响,但在不同模型家族中存在差异,且这种影响显著受训练数据资源可用性的调节。 Conclusion: 大语言模型编码了一定程度的语系结构信息,但训练数据的不平衡仍是决定其多语言性能的主要因素。 Abstract: Large Language Models (LLMs) display notable variation in multilingual behavior, yet the role of genealogical language structure in shaping this variation remains underexplored. In this paper, we investigate whether LLMs exhibit sensitivity to linguistic genera by extending prior analyses on the MultiQ dataset. We first check if models prefer to switch to genealogically related languages when prompt language fidelity is not maintained. Next, we investigate whether knowledge consistency is better preserved within than across genera. We show that genus-level effects are present but strongly conditioned by training resource availability. We further observe distinct multilingual strategies across LLMs families. Our findings suggest that LLMs encode aspects of genus-level structure, but training data imbalances remain the primary factor shaping their multilingual performance.

[37] From Polyester Girlfriends to Blind Mice: Creating the First Pragmatics Understanding Benchmarks for Slovene

Mojca Brglez,Špela Vintar

Main category: cs.CL

TL;DR: 本文介绍了SloPragEval和SloPragMega,这是首个针对斯洛文尼亚语的语用理解基准测试,共包含405道选择题,旨在评估大语言模型在非字面、文化特定语境下的语用理解能力。研究发现当前模型虽在理解细微语言方面有进步,但在推断隐含意义时仍存在困难,且专有模型与开源模型之间存在显著差距。

Details Motivation: 随着大语言模型在表面语言能力上的不断提升,需要更具挑战性的评估方法来检验其在语用层面(如语境、文化和语言规范)的理解能力,尤其是在非字面和文化相关表达中的表现。 Method: 构建了两个新的斯洛文尼亚语语用理解基准SloPragEval和SloPragMega,包含405个多项选择题;通过人工基线建立活动获取人类表现数据,并对多种大语言模型进行初步评估。 Result: 实验结果显示当前大语言模型在理解细微语言方面已有显著提升,但在处理非字面、尤其是文化特定的表达时仍难以准确推断说话者的隐含意图;同时发现专有模型明显优于开源模型。 Conclusion: 针对复杂语言理解和目标文化知识的基准测试应谨慎设计,优先使用母语原生数据构建,并通过人类反馈进行验证,以更真实地衡量模型的语用能力。 Abstract: Large language models are demonstrating increasing capabilities, excelling at benchmarks once considered very difficult. As their capabilities grow, there is a need for more challenging evaluations that go beyond surface-level linguistic competence. Namely, language competence involves not only syntax and semantics but also pragmatics, i.e., understanding situational meaning as shaped by context as well as linguistic and cultural norms. To contribute to this line of research, we introduce SloPragEval and SloPragMega, the first pragmatics understanding benchmarks for Slovene that contain altogether 405 multiple-choice questions. We discuss the difficulties of translation, describe the campaign to establish a human baseline, and report pilot evaluations with LLMs. Our results indicate that current models have greatly improved in understanding nuanced language but may still fail to infer implied speaker meaning in non-literal utterances, especially those that are culture-specific. We also observe a significant gap between proprietary and open-source models. Finally, we argue that benchmarks targeting nuanced language understanding and knowledge of the target culture must be designed with care, preferably constructed from native data, and validated with human responses.

[38] Automated Quality Control for Language Documentation: Detecting Phonotactic Inconsistencies in a Kokborok Wordlist

Kellen Parker van Dam,Abishek Stephen

Main category: cs.CL

TL;DR: 提出了一种无监督异常检测方法,用于识别词表中的音位不一致性,以改进低资源语言记录的数据质量。

Details Motivation: 词汇数据收集中常存在转录错误和未记录的借词,可能误导语言学分析,因此需要有效方法来识别这些问题。 Method: 使用字符级和音节级音位特征,对 Kokborok 方言与孟加拉语的多语言数据集进行无监督异常检测。 Result: 音节感知特征显著优于字符级基线,高召回率方法能有效标记需核查的条目,尽管精度和召回率因异常的细微性仍有限。 Conclusion: 该方法为田野工作者提供系统性工具以提升语言数据质量,尤其适用于资源匮乏的语言记录工作。 Abstract: Lexical data collection in language documentation often contains transcription errors and undocumented borrowings that can mislead linguistic analysis. We present unsupervised anomaly detection methods to identify phonotactic inconsistencies in wordlists, applying them to a multilingual dataset of Kokborok varieties with Bangla. Using character-level and syllable-level phonotactic features, our algorithms identify potential transcription errors and borrowings. While precision and recall remain modest due to the subtle nature of these anomalies, syllable-aware features significantly outperform character-level baselines. The high-recall approach provides fieldworkers with a systematic method to flag entries requiring verification, supporting data quality improvement in low-resourced language documentation.

[39] RETuning: Upgrading Inference-Time Scaling for Stock Movement Prediction with Large Language Models

Xueyuan Lin,Cehao Yang,Ye Ma,Ming Li,Rongjunchen Zhang,Yang Ni,Xiaojun Wu,Chengjin Xu,Jian Guo,Hui Xiong

Main category: cs.CL

TL;DR: 本文研究了大型语言模型(LLM)在股票走势预测中的应用,提出了一种名为RETuning的反思性证据调优方法,以增强模型的独立推理能力,并构建了一个大规模、多源信息的A股数据集进行验证。

Details Motivation: 现有LLM在金融领域的股票走势预测任务中表现不佳,倾向于依赖分析师观点而非独立分析,缺乏对对立证据的权衡,未能充分发挥其推理能力。 Method: 提出Reflective Evidence Tuning (RETuning) 方法,在生成思维链(CoT)过程中动态构建分析框架,整合多源信息并对上涨或下跌证据进行组织与评分,最后通过反思得出预测,减少上下文偏见的影响。 Result: 实验表明RETuning能有效释放LLM在金融领域的推理能力,提升三类(上涨、持有、下跌)分类预测性能,且推理时扩展在长期(6个月后)和分布外股票上仍有效。 Conclusion: RETuning使LLM在股票预测中实现更系统、独立的逻辑推理,显著提升预测可靠性,为LLM在金融领域的应用提供了新思路。 Abstract: Recently, large language models (LLMs) have demonstrated outstanding reasoning capabilities on mathematical and coding tasks. However, their application to financial tasks-especially the most fundamental task of stock movement prediction-remains underexplored. We study a three-class classification problem (up, hold, down) and, by analyzing existing reasoning responses, observe that: (1) LLMs follow analysts' opinions rather than exhibit a systematic, independent analytical logic (CoTs). (2) LLMs list summaries from different sources without weighing adversarial evidence, yet such counterevidence is crucial for reliable prediction. It shows that the model does not make good use of its reasoning ability to complete the task. To address this, we propose Reflective Evidence Tuning (RETuning), a cold-start method prior to reinforcement learning, to enhance prediction ability. While generating CoT, RETuning encourages dynamically constructing an analytical framework from diverse information sources, organizing and scoring evidence for price up or down based on that framework-rather than on contextual viewpoints-and finally reflecting to derive the prediction. This approach maximally aligns the model with its learned analytical framework, ensuring independent logical reasoning and reducing undue influence from context. We also build a large-scale dataset spanning all of 2024 for 5,123 A-share stocks, with long contexts (32K tokens) and over 200K samples. In addition to price and news, it incorporates analysts' opinions, quantitative reports, fundamental data, macroeconomic indicators, and similar stocks. Experiments show that RETuning successfully unlocks the model's reasoning ability in the financial domain. Inference-time scaling still works even after 6 months or on out-of-distribution stocks, since the models gain valuable insights about stock movement prediction.

[40] The Universal Landscape of Human Reasoning

Qiguang Chen,Jinhao Liu,Libo Qin,Yimeng Zhang,Yihao Liang,Shangxu Ren,Chengyu Luan,Dengyun Peng,Hanjing Li,Jiannan Guan,Zheng Yan,Jiaqi Wang,Mengkang Hu,Yantao Du,Zhi Chen,Xie Chen,Wanxiang Che

Main category: cs.CL

TL;DR: 本文提出了信息流追踪(IF-Track)方法,利用大语言模型量化人类推理过程中的信息变化,首次在统一度量空间中建模人类推理行为,揭示了推理特征、错误模式和个体差异,并促进了人工智能与人类认知理论的融合。

Details Motivation: 现有模型难以对人类推理动态提供统一的定量描述,缺乏对信息在推理过程中如何积累和转化的深入理解。 Method: 提出信息流追踪(IF-Track)方法,使用大语言模型作为概率编码器,量化每一步推理中的信息熵和信息增益,并在多种任务中进行细粒度分析。 Result: IF-Track成功建模了人类推理的普遍模式,捕捉到关键推理特征,识别出系统性错误模式,刻画了个体差异,并揭示了人工智能与人类认知之间的对齐关系及其对人类推理过程的影响。 Conclusion: IF-Track为人类推理提供了统一的量化框架,架起了理论与测量之间的桥梁,深化了对推理机制的理解,并推动了认知科学与人工智能的交叉发展。 Abstract: Understanding how information is dynamically accumulated and transformed in human reasoning has long challenged cognitive psychology, philosophy, and artificial intelligence. Existing accounts, from classical logic to probabilistic models, illuminate aspects of output or individual modelling, but do not offer a unified, quantitative description of general human reasoning dynamics. To solve this, we introduce Information Flow Tracking (IF-Track), that uses large language models (LLMs) as probabilistic encoder to quantify information entropy and gain at each reasoning step. Through fine-grained analyses across diverse tasks, our method is the first successfully models the universal landscape of human reasoning behaviors within a single metric space. We show that IF-Track captures essential reasoning features, identifies systematic error patterns, and characterizes individual differences. Applied to discussion of advanced psychological theory, we first reconcile single- versus dual-process theories in IF-Track and discover the alignment of artificial and human cognition and how LLMs reshaping human reasoning process. This approach establishes a quantitative bridge between theory and measurement, offering mechanistic insights into the architecture of reasoning.

cs.CV [Back]

[41] Preventing Shortcuts in Adapter Training via Providing the Shortcuts

Anujraaj Argo Goyal,Guocheng Gordon Qian,Huseyin Coskun,Aarush Gupta,Himmy Tam,Daniil Ostashev,Ju Hu,Dhritiman Sagar,Sergey Tulyakov,Kfir Aberman,Kuan-Chieh Jackson Wang

Main category: cs.CV

TL;DR: 提出了一种名为Shortcut-Rerouted Adapter Training的新方法,通过在训练时引入辅助模块(如ControlNet或LoRA)来显式处理干扰因素,在推理时移除这些模块,从而提升适配器对目标属性的解耦能力,改善生成质量、多样性和文本遵循度。

Details Motivation: 现有适配器训练容易将目标属性与姿态、表情、光照等无关视觉因素纠缠,导致泛化能力差和文本遵循度低。 Method: 在训练过程中使用辅助模块(如ControlNet或LoRA)显式建模并分离干扰因素,使主适配器专注于学习目标属性;在推理阶段移除这些辅助模块。 Result: 在人脸和全身身份注入任务中,该方法显著提升了生成图像的质量、多样性以及对文本提示的遵循能力。 Conclusion: 为实现解耦表征,最有效的方式可能是主动提供不应由主模型学习的‘捷径’,这一发现提出了大模型时代下一种新的设计原则。 Abstract: Adapter-based training has emerged as a key mechanism for extending the capabilities of powerful foundation image generators, enabling personalized and stylized text-to-image synthesis. These adapters are typically trained to capture a specific target attribute, such as subject identity, using single-image reconstruction objectives. However, because the input image inevitably contains a mixture of visual factors, adapters are prone to entangle the target attribute with incidental ones, such as pose, expression, and lighting. This spurious correlation problem limits generalization and obstructs the model's ability to adhere to the input text prompt. In this work, we uncover a simple yet effective solution: provide the very shortcuts we wish to eliminate during adapter training. In Shortcut-Rerouted Adapter Training, confounding factors are routed through auxiliary modules, such as ControlNet or LoRA, eliminating the incentive for the adapter to internalize them. The auxiliary modules are then removed during inference. When applied to tasks like facial and full-body identity injection, our approach improves generation quality, diversity, and prompt adherence. These results point to a general design principle in the era of large models: when seeking disentangled representations, the most effective path may be to establish shortcuts for what should NOT be learned.

[42] Video-As-Prompt: Unified Semantic Control for Video Generation

Yuxuan Bian,Xin Chen,Zenan Li,Tiancheng Zhi,Shen Sang,Linjie Luo,Qiang Xu

Main category: cs.CV

TL;DR: 本文提出了一种名为Video-As-Prompt(VAP)的新范式,通过将参考视频作为语义提示来实现统一、可泛化的视频生成控制,无需微调即可在多种语义条件下实现高质量的可控视频生成。

Details Motivation: 现有的视频生成控制方法存在引入伪影、依赖特定任务微调或架构等问题,缺乏通用性和鲁棒性,因此需要一种统一且可泛化的方法来实现语义控制。 Method: VAP利用参考视频作为直接语义提示,通过即插即用的Mixture-of-Transformers(MoT)专家模块引导冻结的Video Diffusion Transformer(DiT)。该方法采用时间偏置的位置嵌入以避免错误的映射先验,并防止灾难性遗忘。同时构建了包含超过10万对视频的大规模数据集VAP-Data用于训练和评估。 Result: VAP在开放源码方法中达到了新的最先进水平,在用户偏好测试中获得38.7%的偏好率,接近领先的商业模型表现,展现出强大的零样本泛化能力和多下游应用支持。 Conclusion: VAP为通用、可控的视频生成提供了有效解决方案,标志着向统一、可扩展的语义控制视频生成迈出了重要一步。 Abstract: Unified, generalizable semantic control in video generation remains a critical open challenge. Existing methods either introduce artifacts by enforcing inappropriate pixel-wise priors from structure-based controls, or rely on non-generalizable, condition-specific finetuning or task-specific architectures. We introduce Video-As-Prompt (VAP), a new paradigm that reframes this problem as in-context generation. VAP leverages a reference video as a direct semantic prompt, guiding a frozen Video Diffusion Transformer (DiT) via a plug-and-play Mixture-of-Transformers (MoT) expert. This architecture prevents catastrophic forgetting and is guided by a temporally biased position embedding that eliminates spurious mapping priors for robust context retrieval. To power this approach and catalyze future research, we built VAP-Data, the largest dataset for semantic-controlled video generation with over 100K paired videos across 100 semantic conditions. As a single unified model, VAP sets a new state-of-the-art for open-source methods, achieving a 38.7% user preference rate that rivals leading condition-specific commercial models. VAP's strong zero-shot generalization and support for various downstream applications mark a significant advance toward general-purpose, controllable video generation.

[43] Focal Modulation and Bidirectional Feature Fusion Network for Medical Image Segmentation

Moin Safdar,Shahzaib Iqbal,Mehwish Mehmood,Mubeen Ghafoor,Tariq M. Khan,Imran Razzak

Main category: cs.CV

TL;DR: 提出FM-BFF-Net,结合CNN与Transformer,通过焦点调制注意力和双向特征融合模块提升医学图像分割性能。

Details Motivation: 卷积神经网络难以捕捉全局上下文和长距离依赖,影响复杂边界和多尺度结构的精确分割。 Method: 结合CNN与Transformer,采用焦点调制注意力机制和双向特征融合模块,增强上下文感知和跨尺度特征交互。 Result: 在八个公开数据集上实验表明,FM-BFF-Net在Jaccard指数和Dice系数上优于现有最先进方法。 Conclusion: FM-BFF-Net能有效提升医学图像分割的边界精度和对病灶尺寸、形状及对比度变化的鲁棒性,具有良好的通用性和应用潜力。 Abstract: Medical image segmentation is essential for clinical applications such as disease diagnosis, treatment planning, and disease development monitoring because it provides precise morphological and spatial information on anatomical structures that directly influence treatment decisions. Convolutional neural networks significantly impact image segmentation; however, since convolution operations are local, capturing global contextual information and long-range dependencies is still challenging. Their capacity to precisely segment structures with complicated borders and a variety of sizes is impacted by this restriction. Since transformers use self-attention methods to capture global context and long-range dependencies efficiently, integrating transformer-based architecture with CNNs is a feasible approach to overcoming these challenges. To address these challenges, we propose the Focal Modulation and Bidirectional Feature Fusion Network for Medical Image Segmentation, referred to as FM-BFF-Net in the remainder of this paper. The network combines convolutional and transformer components, employs a focal modulation attention mechanism to refine context awareness, and introduces a bidirectional feature fusion module that enables efficient interaction between encoder and decoder representations across scales. Through this design, FM-BFF-Net enhances boundary precision and robustness to variations in lesion size, shape, and contrast. Extensive experiments on eight publicly available datasets, including polyp detection, skin lesion segmentation, and ultrasound imaging, show that FM-BFF-Net consistently surpasses recent state-of-the-art methods in Jaccard index and Dice coefficient, confirming its effectiveness and adaptability for diverse medical imaging scenarios.

[44] Generative Point Tracking with Flow Matching

Mattie Tesfaldet,Adam W. Harley,Konstantinos G. Derpanis,Derek Nowrouzezahrai,Christopher Pal

Main category: cs.CV

TL;DR: 本文提出了GenPT,一种用于建模多模态轨迹的生成式点跟踪框架,通过结合流匹配新公式和最佳优先搜索策略,在遮挡情况下实现了最先进的跟踪精度。

Details Motivation: 现有判别模型在处理视觉遮挡时只能回归单一预测,无法捕捉多模态不确定性,因此需要一种能建模多模态轨迹的跟踪方法。 Method: 提出Generative Point Tracker (GenPT),采用新的流匹配训练框架,结合判别跟踪器的迭代优化、窗口依赖先验和针对点坐标的方差调度,并在推理时使用基于模型置信度的最佳优先搜索策略。 Result: 在PointOdyssey、Dynamic Replica和TAP-Vid基准上达到最先进水平,尤其在遮挡点跟踪上表现优异,同时在可见点上保持竞争力;新引入的高遮挡TAP-Vid变体验证了模型对多模态的建模能力。 Conclusion: GenPT能够有效捕捉点轨迹的多模态特性,显著提升遮挡情况下的跟踪性能,同时保持对可见点的良好跟踪能力,为视频点跟踪提供了更鲁棒的生成式解决方案。 Abstract: Tracking a point through a video can be a challenging task due to uncertainty arising from visual obfuscations, such as appearance changes and occlusions. Although current state-of-the-art discriminative models excel in regressing long-term point trajectory estimates -- even through occlusions -- they are limited to regressing to a mean (or mode) in the presence of uncertainty, and fail to capture multi-modality. To overcome this limitation, we introduce Generative Point Tracker (GenPT), a generative framework for modelling multi-modal trajectories. GenPT is trained with a novel flow matching formulation that combines the iterative refinement of discriminative trackers, a window-dependent prior for cross-window consistency, and a variance schedule tuned specifically for point coordinates. We show how our model's generative capabilities can be leveraged to improve point trajectory estimates by utilizing a best-first search strategy on generated samples during inference, guided by the model's own confidence of its predictions. Empirically, we evaluate GenPT against the current state of the art on the standard PointOdyssey, Dynamic Replica, and TAP-Vid benchmarks. Further, we introduce a TAP-Vid variant with additional occlusions to assess occluded point tracking performance and highlight our model's ability to capture multi-modality. GenPT is capable of capturing the multi-modality in point trajectories, which translates to state-of-the-art tracking accuracy on occluded points, while maintaining competitive tracking accuracy on visible points compared to extant discriminative point trackers.

[45] 3DReasonKnee: Advancing Grounded Reasoning in Medical Vision Language Models

Sraavya Sambara,Sung Eun Kim,Xiaoman Zhang,Luyang Luo,Shreya Johri,Mohammed Baharoon,Du Hyun Ro,Pranav Rajpurkar

Main category: cs.CV

TL;DR: 本文提出了3DReasonKnee,首个用于医学图像的3D grounded reasoning数据集,包含49.4万高质量五元组,源自7,970个3D膝关节MRI体积,支持对解剖区域的定位与逐步推理,旨在推动多模态医学AI在临床对齐决策中的发展。

Details Motivation: 现有视觉-语言模型(VLMs)难以在3D医学图像中对解剖区域进行定位并进行逐步推理,而这是临床诊断评估的关键需求。现有的3D数据集缺乏支持这种“具身推理”能力的标注,因此需要构建一个具备临床相关性和高质量推理链的数据集以促进可信的医患-AI协作。 Method: 作者构建了3DReasonKnee数据集,包含来自7,970个3D膝关节MRI的494,000个高质量五元组,每个样本包括:3D MRI、针对特定解剖区域的诊断问题、3D边界框定位、由临床医生生成的诊断推理步骤以及结构化严重程度评估。数据集经过超过450小时的专家手动分割和推理链生成验证。同时建立了ReasonKnee-Bench评测基准,用于评估VLM在定位和诊断准确性方面的能力,并对五种最先进的VLM进行了基准测试。 Result: 3DReasonKnee是目前首个支持3D医学图像中具身推理的数据集,提供了丰富的临床专家标注的推理路径;ReasonKnee-Bench揭示了现有VLM在3D定位与严重程度评估方面的局限性,为未来研究提供了基线性能;该数据集成为骨科医生诊断知识的存储库,并为多模态医学AI系统的发展提供了重要测试平台。 Conclusion: 3DReasonKnee填补了3D医学图像中grounded reasoning数据集的空白,通过高质量的专家标注和临床相关的推理链,推动了VLM在临床实践中的可信赖应用,是实现3D、局部化、临床对齐决策AI系统的重要一步。 Abstract: Current Vision-Language Models (VLMs) struggle to ground anatomical regions in 3D medical images and reason about them in a step-by-step manner, a key requirement of real-world diagnostic assessment. This ability is essential for aligning model outputs with the diagnostic workflows clinicians use in practice, enabling trustworthy clinician-AI collaboration. Existing 3D datasets provide localization labels, but none support this "grounded reasoning" ability. To address this gap, we introduce 3DReasonKnee, the first 3D grounded reasoning dataset for medical images, which provides 494k high-quality quintuples derived from 7,970 3D knee MRI volumes. Each quintuple includes: (1) the 3D MRI volume, (2) a diagnostic question targeting a specific anatomical region (3) a 3D bounding box localizing the relevant anatomical structures, (4) clinician-generated diagnostic reasoning steps that explicitly detail the 3D reasoning process, and (5) structured severity assessments for the relevant anatomical region. The creation and validation of 3DReasonKnee, involving over 450 hours of expert clinician time for manually segmenting MRIs and generating reasoning chains, ensures its superior quality and clinical relevance. We establish ReasonKnee-Bench to evaluate localization and diagnostic accuracy, providing insight into VLM ability to perform grounding and severity assessment across anatomical regions and diagnostic inquiries. We benchmark five state-of-the-art VLMs, providing baseline performance for ReasonKnee-Bench. By providing this unique resource of expert-annotated 3D reasoning pathways, 3DReasonKnee serves as a repository of orthopedic surgeons' diagnostic expertise and offers a vital testbed for advancing multimodal medical AI systems towards 3D, clinically aligned, localized decision-making capabilities. The dataset can be found in: https://huggingface.co/datasets/rajpurkarlab/3DReasonKnee

[46] Thermal Polarimetric Multi-view Stereo

Takahiro Kushida,Kenichiro Tanaka

Main category: cs.CV

TL;DR: 本文提出了一种利用热偏振线索进行详细3D形状重建的新方法,该方法不受光照和材料属性的影响。

Details Motivation: 现有基于可见光偏振的3D重建方法受光照和材质影响存在歧义,难以准确恢复透明或非均匀物体的细节。 Method: 建立了偏振观测的一般理论,证明了长波红外(LWIR)偏振成像可避免可见光偏振分析中的歧义,并提出利用多视角热偏振图像恢复3D形状的方法。 Result: 实验结果表明,该方法能有效重建透明、半透明和非均匀物体的精细细节,优于现有技术。 Conclusion: 基于热偏振的3D重建方法在复杂材质和光照条件下具有更强的鲁棒性和更高的精度,为3D形状恢复提供了新方向。 Abstract: This paper introduces a novel method for detailed 3D shape reconstruction utilizing thermal polarization cues. Unlike state-of-the-art methods, the proposed approach is independent of illumination and material properties. In this paper, we formulate a general theory of polarization observation and show that long-wave infrared (LWIR) polarimetric imaging is free from the ambiguities that affect visible polarization analyses. Subsequently, we propose a method for recovering detailed 3D shapes using multi-view thermal polarimetric images. Experimental results demonstrate that our approach effectively reconstructs fine details in transparent, translucent, and heterogeneous objects, outperforming existing techniques.

[47] VESSA: Video-based objEct-centric Self-Supervised Adaptation for Visual Foundation Models

Jesimon Barreto,Carlos Caetano,André Araujo,William Robson Schwartz

Main category: cs.CV

TL;DR: 本文提出了一种名为VESSA的新型自监督微调方法,用于视觉基础模型的领域自适应,仅利用无标注的多视角物体中心视频,通过自蒸馏范式提升下游分类任务性能。

Details Motivation: 现有的视觉基础模型在分布偏移和标签稀缺的场景下表现不佳,监督微调难以实施,而持续的自监督学习在视觉编码器模型中尚未有效验证。 Method: 提出VESSA方法,基于多视角物体中心视频进行自监督微调,采用自蒸馏框架,并结合预测头的精细调整和参数高效适应技术,防止遗忘预训练知识。 Result: 在3个视觉基础模型和2个数据集上的实验表明,VESSA在下游分类任务中 consistently 优于基线模型和现有适应方法。 Conclusion: VESSA实现了无需标注的视觉基础模型领域自适应,有效提升了模型在新领域中的泛化能力,具有良好的应用前景。 Abstract: Foundation models have advanced computer vision by enabling strong performance across diverse tasks through large-scale pretraining and supervised fine-tuning. However, they may underperform in domains with distribution shifts and scarce labels, where supervised fine-tuning may be infeasible. While continued self-supervised learning for model adaptation is common for generative language models, this strategy has not proven effective for vision-centric encoder models. To address this challenge, we introduce a novel formulation of self-supervised fine-tuning for vision foundation models, where the model is adapted to a new domain without requiring annotations, leveraging only short multi-view object-centric videos. Our method is referred to as VESSA: Video-based objEct-centric Self-Supervised Adaptation for visual foundation models. VESSA's training technique is based on a self-distillation paradigm, where it is critical to carefully tune prediction heads and deploy parameter-efficient adaptation techniques - otherwise, the model may quickly forget its pretrained knowledge and reach a degraded state. VESSA benefits significantly from multi-view object observations sourced from different frames in an object-centric video, efficiently learning robustness to varied capture conditions, without the need of annotations. Through comprehensive experiments with 3 vision foundation models on 2 datasets, VESSA demonstrates consistent improvements in downstream classification tasks, compared to the base models and previous adaptation methods. Code is publicly available at https://github.com/jesimonbarreto/VESSA.

[48] BioDet: Boosting Industrial Object Detection with Image Preprocessing Strategies

Jiaqi Hu,Hongli Xu,Junwen Huang,Peter KT Yu,Slobodan Ilic,Benjamin Busam

Main category: cs.CV

TL;DR: 提出一种标准化插件式管道,用于工业场景中未见物体的2D检测,通过低光增强和基于基础模型的开放词汇检测引导的背景去除,减少域偏移和背景伪影,显著提升检测精度且推理开销 negligible。

Details Motivation: 现有6D姿态估计流水线在杂乱、光照差和复杂背景等挑战性条件下性能下降,主要瓶颈在于检测环节。 Method: 基于当前SOTA基线,结合低光图像增强和由基础模型驱动的开放词汇检测进行背景去除,抑制SAM原始输出中的误检,提升检测可靠性。 Result: 在BOP的真实工业分拣基准上实验表明,该方法显著提高了检测精度,且推理开销极小。 Conclusion: 所提方法有效且实用,能显著提升工业环境中未见物体的2D检测性能,进而改善下游6D姿态估计。 Abstract: Accurate 6D pose estimation is essential for robotic manipulation in industrial environments. Existing pipelines typically rely on off-the-shelf object detectors followed by cropping and pose refinement, but their performance degrades under challenging conditions such as clutter, poor lighting, and complex backgrounds, making detection the critical bottleneck. In this work, we introduce a standardized and plug-in pipeline for 2D detection of unseen objects in industrial settings. Based on current SOTA baselines, our approach reduces domain shift and background artifacts through low-light image enhancement and background removal guided by open-vocabulary detection with foundation models. This design suppresses the false positives prevalent in raw SAM outputs, yielding more reliable detections for downstream pose estimation. Extensive experiments on real-world industrial bin-picking benchmarks from BOP demonstrate that our method significantly boosts detection accuracy while incurring negligible inference overhead, showing the effectiveness and practicality of the proposed method.

[49] Deep learning-based automated damage detection in concrete structures using images from earthquake events

Abdullah Turer,Yongsheng Bai,Halil Sezen,Alper Yilmaz

Main category: cs.CV

TL;DR: 本研究利用深度学习方法,基于2023年土耳其地震后的图像数据,构建了一个自动化检测混凝土结构震后钢筋外露与损伤程度的混合框架。

Details Motivation: 及时评估地震后结构的完整性对公共安全和应急响应至关重要,传统人工检测效率低且受限于灾后环境,因此需要一种快速、自动化的损伤识别方法。 Method: 采用YOLOv11模型检测裂缝、混凝土剥落及钢筋外露,并通过另一个微调后的YOLO模型分类不同损伤等级;结合数据增强、微调和公共数据集测试,构建了可识别建筑内外环境与构件类型的自动分类框架。 Result: 在真实地震图像数据上成功训练并验证了多个深度学习模型,实现了对结构损伤类型的准确识别和损伤等级的自动划分,形成了一个可靠的混合式自动化损伤评估系统。 Conclusion: 研究表明,结合图像采集、标注与深度学习技术,可在多种灾害场景下实现快速、可靠的结构损伤自动检测,具有广泛应用于灾后应急评估的潜力。 Abstract: Timely assessment of integrity of structures after seismic events is crucial for public safety and emergency response. This study focuses on assessing the structural damage conditions using deep learning methods to detect exposed steel reinforcement in concrete buildings and bridges after large earthquakes. Steel bars are typically exposed after concrete spalling or large flexural or shear cracks. The amount and distribution of exposed steel reinforcement is an indication of structural damage and degradation. To automatically detect exposed steel bars, new datasets of images collected after the 2023 Turkey Earthquakes were labeled to represent a wide variety of damaged concrete structures. The proposed method builds upon a deep learning framework, enhanced with fine-tuning, data augmentation, and testing on public datasets. An automated classification framework is developed that can be used to identify inside/outside buildings and structural components. Then, a YOLOv11 (You Only Look Once) model is trained to detect cracking and spalling damage and exposed bars. Another YOLO model is finetuned to distinguish different categories of structural damage levels. All these trained models are used to create a hybrid framework to automatically and reliably determine the damage levels from input images. This research demonstrates that rapid and automated damage detection following disasters is achievable across diverse damage contexts by utilizing image data collection, annotation, and deep learning approaches.

[50] ZING-3D: Zero-shot Incremental 3D Scene Graphs via Vision-Language Models

Pranav Saxena,Jimmy Chiun

Main category: cs.CV

TL;DR: 本文提出了ZING-3D,一种利用预训练基础模型实现零样本、增量更新且具有三维几何定位能力的开放词汇3D场景图生成框架。

Details Motivation: 现有3D场景图生成方法多局限于单视角、无法支持增量更新且缺乏明确的三维几何定位,难以满足具身智能体对复杂环境理解的需求。 Method: 利用预训练视觉语言模型(VLM)进行2D场景图推理,并结合深度信息将节点(对象特征、3D位置、语义上下文)和边(空间与语义关系及对象间距离)投影到3D空间,实现语义与几何的联合建模。 Result: 在Replica和HM3D数据集上的实验表明,ZING-3D无需任务特定训练即可有效捕捉空间和关系知识。 Conclusion: ZING-3D实现了开放词汇、可增量更新且几何准确的3D场景理解,适用于机器人等下游应用。 Abstract: Understanding and reasoning about complex 3D environments requires structured scene representations that capture not only objects but also their semantic and spatial relationships. While recent works on 3D scene graph generation have leveraged pretrained VLMs without task-specific fine-tuning, they are largely confined to single-view settings, fail to support incremental updates as new observations arrive and lack explicit geometric grounding in 3D space, all of which are essential for embodied scenarios. In this paper, we propose, ZING-3D, a framework that leverages the vast knowledge of pretrained foundation models to enable open-vocabulary recognition and generate a rich semantic representation of the scene in a zero-shot manner while also enabling incremental updates and geometric grounding in 3D space, making it suitable for downstream robotics applications. Our approach leverages VLM reasoning to generate a rich 2D scene graph, which is grounded in 3D using depth information. Nodes represent open-vocabulary objects with features, 3D locations, and semantic context, while edges capture spatial and semantic relations with inter-object distances. Our experiments on scenes from the Replica and HM3D dataset show that ZING-3D is effective at capturing spatial and relational knowledge without the need of task-specific training.

[51] WaveSeg: Enhancing Segmentation Precision via High-Frequency Prior and Mamba-Driven Spectrum Decomposition

Guoan Xu,Yang Xiao,Wenjing Jia,Guangwei Gao,Guo-Jun Qi,Chia-Wen Lin

Main category: cs.CV

TL;DR: 本文提出了一种名为WaveSeg的新型解码器架构,通过在空间域和小波域联合优化特征细化,结合高频先验、多尺度融合机制(DDO)、频谱分解注意力(SDA)模块以及重参数化卷积,显著提升了语义分割的精度与细节保留能力。

Details Motivation: 现有语义分割网络多依赖强大的预训练编码器,但解码器结构简单,导致语义上下文与细节保持之间的权衡不理想。为此,本文旨在设计一个能同时增强语义信息和边界细节的高效解码器。 Method: 提出WaveSeg解码器:1)利用图像高频分量作为先验强化边界;2)设计双域操作(DDO)进行多尺度融合;3)引入基于Mamba的频谱分解注意力(SDA)增强长距离建模;4)使用重参数化卷积保持小波域低频语义完整性;5)采用残差引导融合生成高保真特征图。 Result: 在多个标准数据集上实验表明,WaveSeg在定量和定性指标上均优于当前最先进方法,实现了更高效、精确的语义分割。 Conclusion: WaveSeg通过结合小波域频率先验与Mamba-based注意力机制,有效平衡了语义理解与细节恢复,为语义分割任务提供了一个高性能且高效的解码器设计方案。 Abstract: While recent semantic segmentation networks heavily rely on powerful pretrained encoders, most employ simplistic decoders, leading to suboptimal trade-offs between semantic context and fine-grained detail preservation. To address this, we propose a novel decoder architecture, WaveSeg, which jointly optimizes feature refinement in spatial and wavelet domains. Specifically, high-frequency components are first learned from input images as explicit priors to reinforce boundary details at early stages. A multi-scale fusion mechanism, Dual Domain Operation (DDO), is then applied, and the novel Spectrum Decomposition Attention (SDA) block is proposed, which is developed to leverage Mamba's linear-complexity long-range modeling to enhance high-frequency structural details. Meanwhile, reparameterized convolutions are applied to preserve low-frequency semantic integrity in the wavelet domain. Finally, a residual-guided fusion integrates multi-scale features with boundary-aware representations at native resolution, producing semantically and structurally rich feature maps. Extensive experiments on standard benchmarks demonstrate that WaveSeg, leveraging wavelet-domain frequency prior with Mamba-based attention, consistently outperforms state-of-the-art approaches both quantitatively and qualitatively, achieving efficient and precise segmentation.

[52] Knowledge-Driven Vision-Language Model for Plexus Detection in Hirschsprung's Disease

Youssef Megahed,Atallah Madi,Dina El Demellawy,Adrian D. C. Chan

Main category: cs.CV

TL;DR: 提出一种结合专家文本概念和视觉语言模型的多模态框架,用于引导肠肌间神经丛分类,在准确率、精确率和特异性上优于传统CNN模型。

Details Motivation: 深度学习模型在组织病理分类中表现良好但缺乏可解释性,且不符合医生决策过程,因此需要融合临床专业知识以提升模型的可解释性和临床相关性。 Method: 基于对比语言-图像预训练的视觉-语言模型,利用大语言模型生成并经专家审核的文本提示,通过QuiltNet编码,将临床语义线索与视觉特征对齐,实现神经丛分类。 Result: 该模型准确率达到83.9%,精确率为86.6%,特异性为87.6%,在各项指标上均优于VGG-19、ResNet-18和ResNet-50等CNN模型。 Conclusion: 融合专家知识的多模态学习方法在组织病理分类中具有优越性能和更高临床相关性,展现出在医学图像分析中的巨大潜力。 Abstract: Hirschsprung's disease is defined as the congenital absence of ganglion cells in some segment(s) of the colon. The muscle cannot make coordinated movements to propel stool in that section, most commonly leading to obstruction. The diagnosis and treatment for this disease require a clear identification of different region(s) of the myenteric plexus, where ganglion cells should be present, on the microscopic view of the tissue slide. While deep learning approaches, such as Convolutional Neural Networks, have performed very well in this task, they are often treated as black boxes, with minimal understanding gained from them, and may not conform to how a physician makes decisions. In this study, we propose a novel framework that integrates expert-derived textual concepts into a Contrastive Language-Image Pre-training-based vision-language model to guide plexus classification. Using prompts derived from expert sources (e.g., medical textbooks and papers) generated by large language models and reviewed by our team before being encoded with QuiltNet, our approach aligns clinically relevant semantic cues with visual features. Experimental results show that the proposed model demonstrated superior discriminative capability across different classification metrics as it outperformed CNN-based models, including VGG-19, ResNet-18, and ResNet-50; achieving an accuracy of 83.9%, a precision of 86.6%, and a specificity of 87.6%. These findings highlight the potential of multi-modal learning in histopathology and underscore the value of incorporating expert knowledge for more clinically relevant model outputs.

[53] HistRetinex: Optimizing Retinex model in Histogram Domain for Efficient Low-Light Image Enhancement

Jingtian Zhao,Xueli Xie,Jianxiang Xi,Xiaogang Yang,Haoxuan Sun

Main category: cs.CV

TL;DR: 提出了一种基于直方图域Retinex模型的快速低光照图像增强方法HistRetinex,显著提升处理速度与视觉效果。

Details Motivation: 传统Retinex方法在处理大尺寸图像时耗时较长,需提高效率。 Method: 将Retinex模型从空间域扩展到直方图域,构建直方图位置与计数矩阵,并设计两层优化模型求解照度和反射率直方图。 Result: 在1000*664分辨率图像上仅用1.86秒,相比现有方法至少节省6.67秒,且在视觉质量和指标上表现更优。 Conclusion: HistRetinex在保持优秀增强效果的同时大幅提升了处理速度,适用于高效低光图像增强。 Abstract: Retinex-based low-light image enhancement methods are widely used due to their excellent performance. However, most of them are time-consuming for large-sized images. This paper extends the Retinex model from the spatial domain to the histogram domain, and proposes a novel histogram-based Retinex model for fast low-light image enhancement, named HistRetinex. Firstly, we define the histogram location matrix and the histogram count matrix, which establish the relationship among histograms of the illumination, reflectance and the low-light image. Secondly, based on the prior information and the histogram-based Retinex model, we construct a novel two-level optimization model. Through solving the optimization model, we give the iterative formulas of the illumination histogram and the reflectance histogram, respectively. Finally, we enhance the low-light image through matching its histogram with the one provided by HistRetinex. Experimental results demonstrate that the HistRetinex outperforms existing enhancement methods in both visibility and performance metrics, while executing 1.86 seconds on 1000*664 resolution images, achieving a minimum time saving of 6.67 seconds.

[54] PhysVLM-AVR: Active Visual Reasoning for Multimodal Large Language Models in Physical Environments

Weijie Zhou,Xuantang Xiong,Yi Peng,Manli Tao,Chaoyang Zhao,Honghui Dong,Ming Tang,Jinqiao Wang

Main category: cs.CV

TL;DR: 本文提出了主动视觉推理(AVR)任务,旨在模拟真实世界中信息不完整的场景,通过交互式环境中的感知、推理与行动闭环来提升多模态大语言模型的视觉推理能力,并发布了CLEVR-AVR基准和AVR-152k数据集,同时提出PhysVLM-AVR模型,在多个任务上达到SOTA性能。

Details Motivation: 现有MLLM在静态、全观测环境下进行视觉推理,难以应对现实世界中因遮挡或视野受限导致的信息不完整问题;而人类通过主动探索获取信息,因此需要构建能主动获取信息的视觉推理系统。 Method: 提出AVR任务框架,设计CLEVR-AVR仿真基准和AVR-152k大规模带链式思维标注的数据集,训练基于多模态大语言模型的PhysVLM-AVR,实现感知-推理-行动闭环。 Result: PhysVLM-AVR在CLEVR-AVR、OpenEQA、RoboVQA、GeoMath和Geometry30K等多个任务上取得SOTA性能;实验表明当前具身MLLM虽能识别信息缺失,但难以通过交互主动获取并整合新信息。 Conclusion: 主动视觉推理是提升MLLM在部分可观测环境中推理能力的关键方向,AVR框架和数据集为未来研究提供了重要基础,揭示了当前模型在主动信息获取与整合方面的不足。 Abstract: Visual reasoning in multimodal large language models (MLLMs) has primarily been studied in static, fully observable settings, limiting their effectiveness in real-world environments where information is often incomplete due to occlusion or limited field of view. Humans, in contrast, actively explore and interact with their environment-moving, examining, and manipulating objects-to gather information through a closed-loop process integrating perception, reasoning, and action. Inspired by this human capability, we introduce the Active Visual Reasoning (AVR) task, extending visual reasoning to partially observable, interactive environments. AVR necessitates agents to: (1) actively acquire information via sequential physical actions, (2) integrate observations across multiple steps for coherent reasoning, and (3) dynamically adjust decisions based on evolving visual feedback. To rigorously evaluate AVR, we introduce CLEVR-AVR, a simulation benchmark featuring multi-round interactive environments designed to assess both reasoning correctness and information-gathering efficiency. We present AVR-152k, a large-scale dataset that offers rich Chain-of-Thought (CoT) annotations detailing iterative reasoning for uncertainty identification, action-conditioned information gain prediction, and information-maximizing action selection, crucial for training agents in a higher-order Markov Decision Process. Building on this, we develop PhysVLM-AVR, an MLLM achieving state-of-the-art performance on CLEVR-AVR, embodied reasoning (OpenEQA, RoboVQA), and passive visual reasoning (GeoMath, Geometry30K). Our analysis also reveals that current embodied MLLMs, despite detecting information incompleteness, struggle to actively acquire and integrate new information through interaction, highlighting a fundamental gap in active reasoning capabilities.

[55] Urban 3D Change Detection Using LiDAR Sensor for HD Map Maintenance and Smart Mobility

Hezam Albagami,Haitian Wang,Xinyu Wang,Muhammad Ibrahim,Zainy M. Malakan,Abdullah M. Alqamdi,Mohammed H. Alghamdi,Ajmal Mian

Main category: cs.CV

TL;DR: 提出一种面向城市级LiDAR点云的物体中心、不确定性感知的双时相变化检测流程,通过多分辨率配准、语义实例分割与类约束匹配,有效处理分割与合并情况,在精度、mF1和mIoU上均优于现有方法。

Details Motivation: 现有变化检测方法对微小垂直偏差、地面坡度和视角差异敏感,难以保持物体一致性,且缺乏对不确定性建模,导致分割或合并问题无法解决。 Method: 采用多分辨率NDT与点到面ICP进行双时相配准,基于配准协方差和表面粗糙度估计局部检测水平;利用几何先验进行跨时段关联,结合语义与实例分割及类约束二分图匹配优化关联结果;通过分块处理控制内存,并在实例级融合3D重叠、法向位移、高程体积差异与直方图距离进行变化决策。 Result: 在15个Subiaco街区测试中达到95.2%准确率、90.4% mF1和82.6% mIoU,较Triplet KPConv提升0.2~0.8个百分点,其中Decreased类别IoU提升7.6个百分点至74.8%。 Conclusion: 该方法实现了高精度、鲁棒的城市尺度LiDAR变化检测,有效处理了实例分割与合并问题,并在保持窄小地物变化的同时抑制误检,适用于高精地图更新等实际应用。 Abstract: High-definition 3D city maps underpin smart transportation, digital twins, and autonomous driving, where object level change detection across bi temporal LiDAR enables HD map maintenance, construction monitoring, and reliable localization. Classical DSM differencing and image based methods are sensitive to small vertical bias, ground slope, and viewpoint mismatch and yield cellwise outputs without object identity. Point based neural models and voxel encodings demand large memory, assume near perfect pre alignment, degrade thin structures, and seldom enforce class consistent association, which leaves split or merge cases unresolved and ignores uncertainty. We propose an object centric, uncertainty aware pipeline for city scale LiDAR that aligns epochs with multi resolution NDT followed by point to plane ICP, normalizes height, and derives a per location level of detection from registration covariance and surface roughness to calibrate decisions and suppress spurious changes. Geometry only proxies seed cross epoch associations that are refined by semantic and instance segmentation and a class constrained bipartite assignment with augmented dummies to handle splits and merges while preserving per class counts. Tiled processing bounds memory without eroding narrow ground changes, and instance level decisions combine 3D overlap, normal direction displacement, and height and volume differences with a histogram distance, all gated by the local level of detection to remain stable under partial overlap and sampling variation. On 15 representative Subiaco blocks the method attains 95.2% accuracy, 90.4% mF1, and 82.6% mIoU, exceeding Triplet KPConv by 0.2 percentage points in accuracy, 0.2 in mF1, and 0.8 in mIoU, with the largest gain on Decreased where IoU reaches 74.8% and improves by 7.6 points.

[56] Controllable-LPMoE: Adapting to Challenging Object Segmentation via Dynamic Local Priors from Mixture-of-Experts

Yanguang Sun,Jiawei Lian,Jian Yang,Lei Luo

Main category: cs.CV

TL;DR: 本文提出了一种基于动态先验的轻量级微调方法Controllable-LPMoE,通过动态控制局部先验来增强大规模基础模型在特定分割任务中的细粒度感知能力,显著减少可训练参数并提升效率。

Details Motivation: 全参数微调大模型计算开销大,现有固定模型加可学习提示的方法缺乏语义先验,适应性不足。 Method: 构建轻量化的动态混合局部先验提取器,结合异构卷积和门控网络动态生成专家先验,并设计双向交互适配器,利用余弦对齐的可变形注意力和通道自适应尺度增强实现冻结特征与可训练特征的高效交互。 Result: 在多个二值对象分割任务上优于31种SOTA方法,验证了其卓越的分割性能和强适应性。 Conclusion: Controllable-LPMoE以更少的可训练参数实现了高效、灵活的微调,为大模型在下游分割任务中的应用提供了新范式。 Abstract: Large-scale foundation models provide powerful feature representations for downstream object segmentation tasks. However, when adapted to specific tasks through the full-parameter fine-tuning, the enormous parameters being updated often results in significant computational overhead, creating a bottleneck in training efficiency. Although existing methods attempt to fine-tune frozen models by directly embedding trainable prompts, these prompts lack inherent semantic priors, limiting the adaptability of large-scale models. In this paper, we propose a novel dynamic priors-based fine-tuning paradigm with fewer trainable parameters, dubbed Controllable-LPMoE, which adaptively modulates frozen foundation models by dynamically controlling local priors to enhance fine-grained perception for specific segmentation tasks. More specifically, we construct a lightweight dynamic mixed local priors extractor that captures diverse local priors from input images through heterogeneous convolutions while employing a gating network to dynamically output expert priors required for the subsequent fine-tuning. Furthermore, we design a bi-directional interaction adapter that employs cosine-aligned deformable attention and channel-oriented adaptive scale enhancement to interact and restructure between frozen and trainable features, achieving efficient fine-tuning. Extensive experiments validate the superiority of our \href{https://github.com/CSYSI/Controllable-LPMoE} {Controllable-LPMoE} approach, demonstrating excellent segmentation performance compared to 31 state-of-the-art (SOTA) methods and adaptability to multiple binary object segmentation tasks.

[57] SafetyPairs: Isolating Safety Critical Image Features with Counterfactual Image Generation

Alec Helbling,Shruti Palaskar,Kundan Krishna,Polo Chau,Leon Gatys,Joseph Yitan Cheng

Main category: cs.CV

TL;DR: 本文提出了SafetyPairs,一个可扩展的生成反事实图像对的框架,用于精确区分安全与不安全图像,通过微调关键特征改变安全标签,并构建了一个包含3020对图像的新基准,以评估和改进视觉-语言模型的安全性判断能力。

Details Motivation: 现有图像安全数据集过于粗略,缺乏对导致安全差异的具体特征的明确区分,难以准确评估模型对细微差异的敏感度。 Method: 利用图像编辑模型生成仅在安全相关特征上不同的反事实图像对(SafetyPairs),系统性地翻转安全标签,同时保持其他无关细节不变。 Result: 构建了涵盖9类安全策略、超过3020对图像的新基准,显著提升了对视觉-语言模型安全判断弱点的检测能力,并可作为数据增强手段提高轻量级防护模型的训练效率。 Conclusion: SafetyPairs为细粒度图像安全研究提供了首个系统性资源,不仅可用于模型评估,还能有效提升安全模型的训练效果。 Abstract: What exactly makes a particular image unsafe? Systematically differentiating between benign and problematic images is a challenging problem, as subtle changes to an image, such as an insulting gesture or symbol, can drastically alter its safety implications. However, existing image safety datasets are coarse and ambiguous, offering only broad safety labels without isolating the specific features that drive these differences. We introduce SafetyPairs, a scalable framework for generating counterfactual pairs of images, that differ only in the features relevant to the given safety policy, thus flipping their safety label. By leveraging image editing models, we make targeted changes to images that alter their safety labels while leaving safety-irrelevant details unchanged. Using SafetyPairs, we construct a new safety benchmark, which serves as a powerful source of evaluation data that highlights weaknesses in vision-language models' abilities to distinguish between subtly different images. Beyond evaluation, we find our pipeline serves as an effective data augmentation strategy that improves the sample efficiency of training lightweight guard models. We release a benchmark containing over 3,020 SafetyPair images spanning a diverse taxonomy of 9 safety categories, providing the first systematic resource for studying fine-grained image safety distinctions.

[58] NoisyGRPO: Incentivizing Multimodal CoT Reasoning via Noise Injection and Bayesian Estimation

Longtian Qiu,Shan Ning,Jiaxuan Sun,Xuming He

Main category: cs.CV

TL;DR: 提出NoisyGRPO,一种引入可控噪声和贝叶斯优势估计的多模态强化学习框架,以提升多模态大模型在链式思维推理中的泛化能力和鲁棒性。

Details Motivation: 现有强化学习框架在提升多模态大语言模型的通用链式思维推理时,难以在训练分布之外良好泛化。 Method: 1) 噪声注入探索策略:在视觉输入中加入高斯噪声以增强探索;2) 贝叶斯优势估计:将优势估计建模为贝叶斯推断问题,利用噪声水平作为先验,轨迹奖励作为似然。 Result: 在标准的链式思维质量、通用能力和幻觉基准测试中,NoisyGRPO显著提升了模型的泛化性和鲁棒性,尤其适用于Qwen2.5-VL 3B等小规模MLLM。 Conclusion: NoisyGRPO通过噪声增强探索和贝叶斯优势估计,有效提升了多模态大模型在复杂推理任务中的稳定性和泛化能力。 Abstract: Reinforcement learning (RL) has shown promise in enhancing the general Chain-of-Thought (CoT) reasoning capabilities of multimodal large language models (MLLMs). However, when applied to improve general CoT reasoning, existing RL frameworks often struggle to generalize beyond the training distribution. To address this, we propose NoisyGRPO, a systematic multimodal RL framework that introduces controllable noise into visual inputs for enhanced exploration and explicitly models the advantage estimation process via a Bayesian framework. Specifically, NoisyGRPO improves RL training by: (1) \textbf{Noise-Injected Exploration Policy}: Perturbing visual inputs with Gaussian noise to encourage exploration across a wider range of visual scenarios; and (2) \textbf{Bayesian Advantage Estimation}: Formulating advantage estimation as a principled Bayesian inference problem, where the injected noise level serves as a prior and the observed trajectory reward as the likelihood. This Bayesian modeling fuses both sources of information to compute a robust posterior estimate of trajectory advantage, effectively guiding MLLMs to prefer visually grounded trajectories over noisy ones. Experiments on standard CoT quality, general capability, and hallucination benchmarks demonstrate that NoisyGRPO substantially improves generalization and robustness, especially in RL settings with small-scale MLLMs such as Qwen2.5-VL 3B. The project page is available at \href{https://artanic30.github.io/project_pages/NoisyGRPO/}{\texttt{https://artanic30.github.io/project\_pages/NoisyGRPO}}.

[59] Digital Contrast CT Pulmonary Angiography Synthesis from Non-contrast CT for Pulmonary Vascular Disease

Ying Ming,Yue Lin,Longfei Zhao,Gengwan Li,Zuopeng Tan,Bing Li,Sheng Xie,Wei Song,Qiqi Xu

Main category: cs.CV

TL;DR: 提出一种基于CycleGAN的级联合成方法,从非对比CT生成数字对比CTPA,减少碘对比剂使用风险,同时在血管增强、图像保真和下游临床任务中表现优异。

Details Motivation: 由于CTPA依赖碘对比剂可能引发肾毒性及过敏反应,尤其对高风险患者存在安全隐患,因此需要一种无需使用对比剂即可实现有效肺血管成像的方法。 Method: 采用基于Cycle-Consistent GAN的级联合成器,利用来自三个中心的410对配对NCCT和CTPA图像进行训练与验证,内部训练集为249对,外部测试集为161对,用于评估模型泛化能力及下游临床任务性能。 Result: 该方法在定量指标上优于现有SOTA方法(验证集MAE:156.28, PSNR:20.71, SSIM:0.98;测试集MAE:165.12, PSNR:20.27, SSIM:0.98),并在视觉质量上展现出良好的血管增强与结构保持;在肺动静脉分割任务中Dice、clDice、clRecall均显著优于NCCT输入;血管体积的ICC从0.70提升至0.81,表明小血管增强效果更优。 Conclusion: 所提出的DCCTPA生成方法能有效模拟真实CTPA的血管增强效果,具备良好的图像保真度和临床可用性,有望减少对比剂依赖并拓展CT在高危人群中的应用。 Abstract: Computed Tomography Pulmonary Angiography (CTPA) is the reference standard for diagnosing pulmonary vascular diseases such as Pulmonary Embolism (PE) and Chronic Thromboembolic Pulmonary Hypertension (CTEPH). However, its reliance on iodinated contrast agents poses risks including nephrotoxicity and allergic reactions, particularly in high-risk patients. This study proposes a method to generate Digital Contrast CTPA (DCCTPA) from Non-Contrast CT (NCCT) scans using a cascaded synthesizer based on Cycle-Consistent Generative Adversarial Networks (CycleGAN). Totally retrospective 410 paired CTPA and NCCT scans were obtained from three centers. The model was trained and validated internally on 249 paired images. Extra dataset that comprising 161 paired images was as test set for model generalization evaluation and downstream clinical tasks validation. Compared with state-of-the-art (SOTA) methods, the proposed method achieved the best comprehensive performance by evaluating quantitative metrics (For validation, MAE: 156.28, PSNR: 20.71 and SSIM: 0.98; For test, MAE: 165.12, PSNR: 20.27 and SSIM: 0.98) and qualitative visualization, demonstrating valid vessel enhancement, superior image fidelity and structural preservation. The approach was further applied to downstream tasks of pulmonary vessel segmentation and vascular quantification. On the test set, the average Dice, clDice, and clRecall of artery and vein pulmonary segmentation was 0.70, 0.71, 0.73 and 0.70, 0.72, 0.75 respectively, all markedly improved compared with NCCT inputs.\@ Inter-class Correlation Coefficient (ICC) for vessel volume between DCCTPA and CTPA was significantly better than that between NCCT and CTPA (Average ICC : 0.81 vs 0.70), indicating effective vascular enhancement in DCCTPA, especially for small vessels.

[60] Towards Physics-informed Spatial Intelligence with Human Priors: An Autonomous Driving Pilot Study

Guanlin Wu,Boyan Su,Yang Zhao,Pu Wang,Yichen Lin,Hao Frank Yang

Main category: cs.CV

TL;DR: 本文提出了Spatial Intelligence Grid (SIG),一种用于显式编码对象布局、对象间关系和物理先验的结构化网格模式,以提升基础模型中的视觉-空间智能(VSI)评估与训练。相比传统的文本问答方式,SIG能更真实地表示场景结构,并分离语言先验与空间能力。基于SIG构建的评估指标在多模态大模型上表现出更稳定、全面的性能提升,并发布了包含1.4K驾驶帧的SIGBench基准。

Details Motivation: 现有方法常使用纯文本提示和VQA评分来代理视觉-空间智能,容易引入语言捷径,模糊几何结构,难以归因于真正的空间能力。因此需要一种更忠实、结构化的表示方式来准确衡量和训练空间智能。 Method: 提出Spatial Intelligence Grid (SIG)作为对文本的补充通道,用网格结构显式编码物体布局、关系和物理常识;基于SIG设计新的评估指标,分离空间能力与语言先验;在few-shot in-context learning设置下测试主流多模态大模型的表现;构建并发布SIGBench基准,包含真实驾驶场景的SIG标注与人类注视轨迹。 Result: 使用SIG表示在多个VSI指标上均优于传统VQA-only方法,带来更大、更稳定且更全面的性能提升;SIGBench提供了支持机器与类人注意力任务的评测平台。 Conclusion: SIG是一种有前景的数据标注与训练框架,能够有效提升基础模型对视觉-空间智能的学习与评估,尤其适用于自动驾驶等需精确空间推理的场景。 Abstract: How to integrate and verify spatial intelligence in foundation models remains an open challenge. Current practice often proxies Visual-Spatial Intelligence (VSI) with purely textual prompts and VQA-style scoring, which obscures geometry, invites linguistic shortcuts, and weakens attribution to genuinely spatial skills. We introduce Spatial Intelligence Grid (SIG): a structured, grid-based schema that explicitly encodes object layouts, inter-object relations, and physically grounded priors. As a complementary channel to text, SIG provides a faithful, compositional representation of scene structure for foundation-model reasoning. Building on SIG, we derive SIG-informed evaluation metrics that quantify a model's intrinsic VSI, which separates spatial capability from language priors. In few-shot in-context learning with state-of-the-art multimodal LLMs (e.g. GPT- and Gemini-family models), SIG yields consistently larger, more stable, and more comprehensive gains across all VSI metrics compared to VQA-only representations, indicating its promise as a data-labeling and training schema for learning VSI. We also release SIGBench, a benchmark of 1.4K driving frames annotated with ground-truth SIG labels and human gaze traces, supporting both grid-based machine VSI tasks and attention-driven, human-like VSI tasks in autonomous-driving scenarios.

[61] Blockwise Flow Matching: Improving Flow Matching Models For Efficient High-Quality Generation

Dogyun Park,Taehoon Lee,Minseok Joo,Hyunwoo J. Kim

Main category: cs.CV

TL;DR: 提出了一种名为Blockwise Flow Matching (BFM)的新框架,通过将生成轨迹分段并使用专用小模块建模,提升推理效率和生成质量。

Details Motivation: 传统Flow Matching模型使用单一大型网络学习整个生成过程,难以同时捕捉不同时间步的信号特征,且推理成本高。 Method: 将生成轨迹划分为多个时间段,每段由专门的小型速度模块建模;引入语义特征引导模块和轻量级特征残差近似策略。 Result: 在ImageNet 256x256上实验显示,BFM在相当的生成性能下,推理复杂度比现有方法快2.1到4.9倍。 Conclusion: BFM通过分块建模和语义引导显著提升了Flow Matching的效率与生成质量,建立了更优的Pareto前沿。 Abstract: Recently, Flow Matching models have pushed the boundaries of high-fidelity data generation across a wide range of domains. It typically employs a single large network to learn the entire generative trajectory from noise to data. Despite their effectiveness, this design struggles to capture distinct signal characteristics across timesteps simultaneously and incurs substantial inference costs due to the iterative evaluation of the entire model. To address these limitations, we propose Blockwise Flow Matching (BFM), a novel framework that partitions the generative trajectory into multiple temporal segments, each modeled by smaller but specialized velocity blocks. This blockwise design enables each block to specialize effectively in its designated interval, improving inference efficiency and sample quality. To further enhance generation fidelity, we introduce a Semantic Feature Guidance module that explicitly conditions velocity blocks on semantically rich features aligned with pretrained representations. Additionally, we propose a lightweight Feature Residual Approximation strategy that preserves semantic quality while significantly reducing inference cost. Extensive experiments on ImageNet 256x256 demonstrate that BFM establishes a substantially improved Pareto frontier over existing Flow Matching methods, achieving 2.1x to 4.9x accelerations in inference complexity at comparable generation performance. Code is available at https://github.com/mlvlab/BFM.

[62] TokenCLIP: Token-wise Prompt Learning for Zero-shot Anomaly Detection

Qihang Zhou,Binbin Gao,Guansong Pang,Xin Wang,Jiming Chen,Shibo He

Main category: cs.CV

TL;DR: 提出TokenCLIP,一种基于token级动态对齐视觉与可学习文本空间的零样本异常检测框架,通过最优传输问题实现细粒度语义匹配。

Details Motivation: 现有方法依赖单一文本空间对齐多样物体和域的视觉语义,难以准确捕捉多样的异常语义。 Method: 将token无关的文本空间扩展为一组正交子空间,通过语义相似性将每个视觉token动态分配到子空间组合,建模为最优传输问题并引入top-k掩码稀疏化分配方案。 Result: 在多个实验中表现出优越性能,实现了更精细的异常检测,并有效提升跨对象和域的泛化能力。 Conclusion: TokenCLIP通过动态、定制化的token级文本对齐机制,显著提升了零样本异常检测的准确性和灵活性。 Abstract: Adapting CLIP for anomaly detection on unseen objects has shown strong potential in a zero-shot manner. However, existing methods typically rely on a single textual space to align with visual semantics across diverse objects and domains. The indiscriminate alignment hinders the model from accurately capturing varied anomaly semantics. We propose TokenCLIP, a token-wise adaptation framework that enables dynamic alignment between visual and learnable textual spaces for fine-grained anomaly learning. Rather than mapping all visual tokens to a single, token-agnostic textual space, TokenCLIP aligns each token with a customized textual subspace that represents its visual characteristics. Explicitly assigning a unique learnable textual space to each token is computationally intractable and prone to insufficient optimization. We instead expand the token-agnostic textual space into a set of orthogonal subspaces, and then dynamically assign each token to a subspace combination guided by semantic affinity, which jointly supports customized and efficient token-wise adaptation. To this end, we formulate dynamic alignment as an optimal transport problem, where all visual tokens in an image are transported to textual subspaces based on semantic similarity. The transport constraints of OT ensure sufficient optimization across subspaces and encourage them to focus on different semantics. Solving the problem yields a transport plan that adaptively assigns each token to semantically relevant subspaces. A top-k masking is then applied to sparsify the plan and specialize subspaces for distinct visual regions. Extensive experiments demonstrate the superiority of TokenCLIP.

[63] KBE-DME: Dynamic Multimodal Evaluation via Knowledge Enhanced Benchmark Evolution

Junzhe Zhang,Huixuan Zhang,Xiaojun Wan

Main category: cs.CV

TL;DR: 提出了一种名为Knowledge-enhanced Benchmark Evolution (KBE) 的动态多模态评估框架,以解决现有静态基准在多模态大模型评估中的数据污染和饱和问题。

Details Motivation: 现有的静态基准存在数据污染和饱和风险,导致对多模态大语言模型的性能评估不准确或误导。 Method: 通过图结构表示VQA样本,KBE分析原始静态基准,并结合多模态知识进行扩展,通过重新选择图像中的视觉信息重构问题,同时利用外部文本知识增强问题,实现可控难度的动态评估。 Result: 实验表明,KBE有效缓解了数据污染和数据饱和问题,能够更全面地评估MLLM的能力。 Conclusion: KBE为多模态大语言模型提供了一个可控、动态演化的评估框架,显著提升了评估的可靠性和全面性。 Abstract: The rapid progress of multimodal large language models (MLLMs) calls for more reliable evaluation protocols. Existing static benchmarks suffer from the potential risk of data contamination and saturation, leading to inflated or misleading performance evaluations. To address these issues, we first apply Graph formulation to represent a static or dynamic VQA sample. With the formulation, we propose Knowledge-enhanced Benchmark Evolution(KBE), a dynamic multimodal evaluation framework. KBE first analyzes the original static benchmark, then expands it by integrating multimodal knowledge, transforming the static benchmark into a controllable, dynamic evolving version. Crucially, KBE can both reconstruct questions by Re-selecting visual information in the original image and expand existing questions with external textual knowledge. It enables difficulty-controllable evaluation by adjusting the degree of question exploration. Extensive experiments demonstrate that KBE alleviates the risk of data contamination, data saturation, and provides a more comprehensive assessment of MLLM capabilities.

[64] 3rd Place Solution to ICCV LargeFineFoodAI Retrieval

Yang Zhong,Zhiming Wang,Zhaoyang Li,Jinyu Ma,Xiang Li

Main category: cs.CV

TL;DR: 本文介绍了在ICCV LargeFineFoodAI检索竞赛中获得第三名的解决方案,通过结合ArcFace和Circle损失训练四个基础模型,并采用TTA和集成方法提升特征表示能力,还提出了一种基于扩散和k-互近邻重排序的新重排序方法,最终在公榜和私榜上分别取得了0.81219和0.81191的mAP@100成绩。

Details Motivation: 在大规模食品图像检索任务中提升模型的特征表示能力和检索精度。 Method: 使用ArcFace和Circle损失的加权和独立训练四个基础模型,结合TTA和模型集成,并提出基于扩散和k-互近邻的重排序方法。 Result: 在公共和私人排行榜上分别达到0.81219和0.81191的mAP@100分数。 Conclusion: 所提出的方法在食品图像检索任务中表现优异,验证了损失函数组合、模型集成与新型重排序策略的有效性。 Abstract: This paper introduces the 3rd place solution to the ICCV LargeFineFoodAI Retrieval Competition on Kaggle. Four basic models are independently trained with the weighted sum of ArcFace and Circle loss, then TTA and Ensemble are successively applied to improve feature representation ability. In addition, a new reranking method for retrieval is proposed based on diffusion and k-reciprocal reranking. Finally, our method scored 0.81219 and 0.81191 mAP@100 on the public and private leaderboard, respectively.

[65] 3rd Place Solution to Large-scale Fine-grained Food Recognition

Yang Zhong,Yifan Yao,Tong Luo,Youcai Zhang,Yaqian Li

Main category: cs.CV

TL;DR: 本文提出了一种结合Arcface损失和Circle损失的方法,用于细粒度食物识别任务,并在Kaggle的LargeFineFoodAI-ICCV Workshop挑战赛中获得第三名。

Details Motivation: 细粒度食物识别在健康领域的食品分析中具有重要意义,需要提高识别精度以支持实际应用。 Method: 采用Arcface损失与Circle损失的组合,在精心调优的配置下训练模型,并通过模型集成获得最终结果。 Result: 所提出的方法显著提升了细粒度食物识别的性能,在Kaggle竞赛中取得了第三名的成绩。 Conclusion: Arcface与Circle损失的合理结合能有效提升细粒度食物识别的准确率,验证了损失函数设计在该任务中的重要性。 Abstract: Food analysis is becoming a hot topic in health area, in which fine-grained food recognition task plays an important role. In this paper, we describe the details of our solution to the LargeFineFoodAI-ICCV Workshop-Recognition challenge held on Kaggle. We find a proper combination of Arcface loss[1] and Circle loss[9] can bring improvement to the performance. With Arcface and the combined loss, model was trained with carefully tuned configurations and ensembled to get the final results. Our solution won the 3rd place in the competition.

[66] Improved Training Technique for Shortcut Models

Anh Nguyen,Viet Nguyen,Duc Vu,Trung Dao,Chi Tran,Toan Tran,Anh Tran

Main category: cs.CV

TL;DR: 本文提出了一种名为iSM的统一训练框架,用于解决shortcut生成模型中的五个关键问题,包括累积引导缺陷、固定引导不灵活、频率偏差、自洽性冲突和弯曲的生成路径。通过引入内在引导、多级小波损失、缩放最优传输和双EMA策略,显著提升了在ImageNet 256×256上的一致性、少数步和多步生成性能。

Details Motivation: Shortcut模型虽支持单步、少步和多步采样,但因存在累积引导缺陷、频率偏差、训练与自洽冲突等问题导致性能受限,阻碍了其广泛应用,亟需系统性解决方案。 Method: 提出iSM框架,包含四个核心改进:1)Intrinsic Guidance实现动态可调的引导强度;2)Multi-Level Wavelet Loss缓解频域偏差;3)Scaling Optimal Transport学习更直、更稳定的生成路径;4)Twin EMA策略协调EMA训练与自洽性。 Result: 在ImageNet 256×256上的实验表明,iSM在单步、少步和多步生成中均显著优于基线shortcut模型,FID指标大幅提升,验证了方法的有效性和鲁棒性。 Conclusion: iSM通过系统性改进解决了shortcut模型的关键瓶颈,使其成为一类高效且具竞争力的生成模型,推动了非对抗式生成建模的发展。 Abstract: Shortcut models represent a promising, non-adversarial paradigm for generative modeling, uniquely supporting one-step, few-step, and multi-step sampling from a single trained network. However, their widespread adoption has been stymied by critical performance bottlenecks. This paper tackles the five core issues that held shortcut models back: (1) the hidden flaw of compounding guidance, which we are the first to formalize, causing severe image artifacts; (2) inflexible fixed guidance that restricts inference-time control; (3) a pervasive frequency bias driven by a reliance on low-level distances in the direct domain, which biases reconstructions toward low frequencies; (4) divergent self-consistency arising from a conflict with EMA training; and (5) curvy flow trajectories that impede convergence. To address these challenges, we introduce iSM, a unified training framework that systematically resolves each limitation. Our framework is built on four key improvements: Intrinsic Guidance provides explicit, dynamic control over guidance strength, resolving both compounding guidance and inflexibility. A Multi-Level Wavelet Loss mitigates frequency bias to restore high-frequency details. Scaling Optimal Transport (sOT) reduces training variance and learns straighter, more stable generative paths. Finally, a Twin EMA strategy reconciles training stability with self-consistency. Extensive experiments on ImageNet 256 x 256 demonstrate that our approach yields substantial FID improvements over baseline shortcut models across one-step, few-step, and multi-step generation, making shortcut models a viable and competitive class of generative models.

[67] Topology Sculptor, Shape Refiner: Discrete Diffusion Model for High-Fidelity 3D Meshes Generation

Kaiyu Song,Hanjiang Lai,Yaqing Zhang,Chuangjian Cai,Yan Pan Kun Yue,Jian Yin

Main category: cs.CV

TL;DR: 本文提出了一种基于离散扩散模型(DDM)的3D艺术风格网格生成新方法TSSR,通过并行生成和三个关键技术——解耦训练与混合推理、改进的含RoPE的沙漏架构、以及连接损失,在复杂数据集上实现了高达10,000面且分辨率达$1024^3$的高质量网格生成。

Details Motivation: 实现高精度的网格标记预测,并利用并行生成提升效率,克服传统自回归方法的顺序生成瓶颈。 Method: 采用解耦训练与混合推理策略,将生成分为拓扑雕刻和形状精修两个阶段;设计具有双向注意力和旋转位置嵌入(RoPE)的改进沙漏架构;引入连接损失作为拓扑约束以提升生成质量。 Result: 在复杂数据集上实验表明,TSSR能生成高达10,000个面、空间分辨率达到$1024^3$的高质量艺术风格3D网格,具有优越的细节表现力和拓扑准确性。 Conclusion: TSSR通过并行扩散建模和多项结构创新,显著提升了3D艺术网格生成的质量与效率,为高分辨率3D内容创作提供了有效方案。 Abstract: In this paper, we introduce Topology Sculptor, Shape Refiner (TSSR), a novel method for generating high-quality, artist-style 3D meshes based on Discrete Diffusion Models (DDMs). Our primary motivation for TSSR is to achieve highly accurate token prediction while enabling parallel generation, a significant advantage over sequential autoregressive methods. By allowing TSSR to "see" all mesh tokens concurrently, we unlock a new level of efficiency and control. We leverage this parallel generation capability through three key innovations: 1) Decoupled Training and Hybrid Inference, which distinctly separates the DDM-based generation into a topology sculpting stage and a subsequent shape refinement stage. This strategic decoupling enables TSSR to effectively capture both intricate local topology and overarching global shape. 2) An Improved Hourglass Architecture, featuring bidirectional attention enriched by face-vertex-sequence level Rotational Positional Embeddings (RoPE), thereby capturing richer contextual information across the mesh structure. 3) A novel Connection Loss, which acts as a topological constraint to further enhance the realism and fidelity of the generated meshes. Extensive experiments on complex datasets demonstrate that TSSR generates high-quality 3D artist-style meshes, capable of achieving up to 10,000 faces at a remarkable spatial resolution of $1024^3$. The code will be released at: https://github.com/psky1111/Tencent-TSSR.

[68] Towards Physically Executable 3D Gaussian for Embodied Navigation

Bingchen Miao,Rong Wei,Zhiqi Ge,Xiaoquan sun,Shiqi Gao,Jingzhe Zhu,Renhan Wang,Siliang Tang,Jun Xiao,Rui Tang,Juncheng Li

Main category: cs.CV

TL;DR: 提出SAGE-3D,一种将3D高斯点阵升级为具有语义和物理对齐的可执行环境的新范式,用于视觉语言导航(VLN),并通过新发布的InteriorGS数据集和SAGE-Bench基准验证其有效性。

Details Motivation: 3D高斯点阵(3DGS)在实时渲染方面表现优异,但缺乏细粒度语义和物理可执行性,限制了其在视觉语言导航(VLN)中的应用。 Method: 提出SAGE-3D,包含两个组件:(1) 面向对象的语义接地,为3DGS添加对象级细粒度标注;(2) 物理感知执行联合,在3DGS中嵌入碰撞体并构建丰富的物理接口。同时发布包含1K个室内场景的InteriorGS数据集和首个基于3DGS的VLN基准SAGE-Bench,含2M VLN数据。 Result: 实验证明3DGS场景数据虽更难收敛,但具有强泛化能力,在VLN-CE Unseen任务上使基线性能提升31%。 Conclusion: SAGE-3D有效提升了3DGS在视觉语言导航中的语义与物理对齐能力,为构建更真实的可执行虚拟环境提供了新范式。 Abstract: 3D Gaussian Splatting (3DGS), a 3D representation method with photorealistic real-time rendering capabilities, is regarded as an effective tool for narrowing the sim-to-real gap. However, it lacks fine-grained semantics and physical executability for Visual-Language Navigation (VLN). To address this, we propose SAGE-3D (Semantically and Physically Aligned Gaussian Environments for 3D Navigation), a new paradigm that upgrades 3DGS into an executable, semantically and physically aligned environment. It comprises two components: (1) Object-Centric Semantic Grounding, which adds object-level fine-grained annotations to 3DGS; and (2) Physics-Aware Execution Jointing, which embeds collision objects into 3DGS and constructs rich physical interfaces. We release InteriorGS, containing 1K object-annotated 3DGS indoor scene data, and introduce SAGE-Bench, the first 3DGS-based VLN benchmark with 2M VLN data. Experiments show that 3DGS scene data is more difficult to converge, while exhibiting strong generalizability, improving baseline performance by 31% on the VLN-CE Unseen task. The data and code will be available soon.

[69] FineRS: Fine-grained Reasoning and Segmentation of Small Objects with Reinforcement Learning

Lu Zhang,Jiazuo Yu,Haomiao Xiong,Ping Hu,Yunzhi Zhuge,Huchuan Lu,You He

Main category: cs.CV

TL;DR: 提出了一种名为FineRS的两阶段多模态大语言模型框架,用于在高分辨率场景中联合推理和分割极小物体,通过粗到细的策略和强化学习机制显著提升了性能。

Details Motivation: 由于输入分辨率受限,现有的多模态大语言模型在理解高分辨率图像中的细微视觉细节(尤其是嵌入复杂背景中的极小物体)时面临挑战。 Method: 提出FineRS框架,包含全局语义探索(GSE)和局部感知优化(LPR)两个阶段,并引入定位反馈奖励机制,利用LPR输出优化GSE,实现端到端的联合推理与分割。 Result: 在自建数据集FineRS-4k及公开数据集上实验表明,该方法在指令引导的分割和视觉推理任务上均优于现有最先进的MLLM方法。 Conclusion: FineRS通过粗到细的两阶段设计和强化学习策略,有效提升了多模态大语言模型对高分辨率图像中极小物体的理解与定位能力。 Abstract: Multi-modal Large Language Models (MLLMs) have shown remarkable capabilities across a wide range of vision-language tasks. However, due to the restricted input resolutions, MLLMs face significant challenges in precisely understanding and localizing visual details in high-resolution images -- particularly when dealing with extra-small objects embedded in cluttered contexts. To address this issue, we propose \textsc{FineRS}, a two-stage MLLM-based reinforcement learning framework for jointly reasoning and segmenting extremely small objects within high-resolution scenes. \textsc{FineRS} adopts a coarse-to-fine pipeline comprising Global Semantic Exploration (GSE) and Localized Perceptual Refinement (LPR). Specifically, GSE performs instruction-guided reasoning to generate a textural response and a coarse target region, while LPR refines this region to produce an accurate bounding box and segmentation mask. To couple the two stages, we introduce a locate-informed retrospective reward, where LPR's outputs are used to optimize GSE for more robust coarse region exploration. % Additionally, we present \textsc{FineRS}-4k, a new dataset for evaluating MLLMs on attribute-level reasoning and pixel-level segmentation on subtle, small-scale targets in complex high-resolution scenes. Experimental results on \textsc{FineRS}-4k and public datasets demonstrate that our method consistently outperforms state-of-the-art MLLM-based approaches on both instruction-guided segmentation and visual reasoning tasks.

[70] VL-SAE: Interpreting and Enhancing Vision-Language Alignment with a Unified Concept Set

Shufan Shen,Junshu Sun,Qingming Huang,Shuhui Wang

Main category: cs.CV

TL;DR: 提出了一种名为VL-SAE的稀疏自编码器,用于解释和增强视觉-语言模型中的跨模态对齐,通过将多模态表示映射到统一的概念集,提升了可解释性与下游任务性能。

Details Motivation: 现有视觉-语言模型中跨模态表示对齐的可解释性不足,因缺乏将多模态语义映射到统一概念集的有效方法。 Method: 设计VL-SAE,采用基于距离的编码器和两个模态特定解码器,在自监督训练中利用余弦相似度衡量多模态表示的语义相似性,并使语义相似的表示具有稳定的神经元激活,从而建立神经元与概念的关联。 Result: 在CLIP、LLaVA等多个VLM上验证了VL-SAE在解释和增强视觉-语言对齐方面的有效性,支持零样本图像分类和幻觉消除等下游任务的性能提升。 Conclusion: VL-SAE能够有效解耦并解释视觉-语言对齐中的语义概念,并通过概念级对齐增强模型表现,为多模态模型的可解释性提供了新路径。 Abstract: The alignment of vision-language representations endows current Vision-Language Models (VLMs) with strong multi-modal reasoning capabilities. However, the interpretability of the alignment component remains uninvestigated due to the difficulty in mapping the semantics of multi-modal representations into a unified concept set. To address this problem, we propose VL-SAE, a sparse autoencoder that encodes vision-language representations into its hidden activations. Each neuron in its hidden layer correlates to a concept represented by semantically similar images and texts, thereby interpreting these representations with a unified concept set. To establish the neuron-concept correlation, we encourage semantically similar representations to exhibit consistent neuron activations during self-supervised training. First, to measure the semantic similarity of multi-modal representations, we perform their alignment in an explicit form based on cosine similarity. Second, we construct the VL-SAE with a distance-based encoder and two modality-specific decoders to ensure the activation consistency of semantically similar representations. Experiments across multiple VLMs (e.g., CLIP, LLaVA) demonstrate the superior capability of VL-SAE in interpreting and enhancing the vision-language alignment. For interpretation, the alignment between vision and language representations can be understood by comparing their semantics with concepts. For enhancement, the alignment can be strengthened by aligning vision-language representations at the concept level, contributing to performance improvements in downstream tasks, including zero-shot image classification and hallucination elimination. Codes are available at https://github.com/ssfgunner/VL-SAE.

[71] Morphologically Intelligent Perturbation Prediction with FORM

Reed Naidoo,Matt De Vries,Olga Fourkioti,Vicky Bousgouni,Mar Arias-Garcia,Maria Portillo-Malumbres,Chris Bakal

Main category: cs.CV

TL;DR: 本文提出了一种名为FORM的机器学习框架,用于预测扰动引起的三维细胞结构变化,结合形态编码器和基于扩散的扰动轨迹模块,并在大规模3D细胞数据上训练,支持无条件生成和条件模拟,同时引入MorphoEval评估套件来量化形态变化。

Details Motivation: 现有细胞响应建模方法受限于二维表示,难以捕捉扰动下细胞形态的复杂性,阻碍了高精度虚拟细胞模型的发展。 Method: FORM包含一个通过多通道VQGAN训练的形态编码器,用于学习紧凑的3D细胞形状表示,以及一个基于扩散的扰动轨迹模块,用于建模不同扰动条件下形态的演化过程;在超过65,000个3D多荧光细胞体积数据上进行训练。 Result: FORM能够实现无条件形态生成、条件扰动状态模拟、下游信号活动预测、组合扰动效应模拟以及未见扰动间的形态动态转换预测;MorphoEval评估显示其在结构、统计和生物学维度均能有效量化形态变化。 Conclusion: FORM与MorphoEval共同推动了3D虚拟细胞的实现,通过高分辨率预测模拟将细胞形态、扰动与功能联系起来。 Abstract: Understanding how cells respond to external stimuli is a central challenge in biomedical research and drug development. Current computational frameworks for modelling cellular responses remain restricted to two-dimensional representations, limiting their capacity to capture the complexity of cell morphology under perturbation. This dimensional constraint poses a critical bottleneck for the development of accurate virtual cell models. Here, we present FORM, a machine learning framework for predicting perturbation-induced changes in three-dimensional cellular structure. FORM consists of two components: a morphology encoder, trained end-to-end via a novel multi-channel VQGAN to learn compact 3D representations of cell shape, and a diffusion-based perturbation trajectory module that captures how morphology evolves across perturbation conditions. Trained on a large-scale dataset of over 65,000 multi-fluorescence 3D cell volumes spanning diverse chemical and genetic perturbations, FORM supports both unconditional morphology synthesis and conditional simulation of perturbed cell states. Beyond generation, FORM can predict downstream signalling activity, simulate combinatorial perturbation effects, and model morphodynamic transitions between states of unseen perturbations. To evaluate performance, we introduce MorphoEval, a benchmarking suite that quantifies perturbation-induced morphological changes in structural, statistical, and biological dimensions. Together, FORM and MorphoEval work toward the realisation of the 3D virtual cell by linking morphology, perturbation, and function through high-resolution predictive simulation.

[72] CT-CLIP: A Multi-modal Fusion Framework for Robust Apple Leaf Disease Recognition in Complex Environments

Lemin Liu,Fangchao Hu,Honghua Jiang,Yaru Chen,Limin Liu,Yongliang Qiao

Main category: cs.CV

TL;DR: 提出了一种基于CNN-Transformer-CLIP的多分支框架(CT-CLIP),用于复杂果园环境下的苹果叶部病害识别,结合局部细节与全局结构特征,并引入自适应融合模块和多模态图文学习,显著提升少样本条件下的识别精度。

Details Motivation: 传统多尺度特征融合方法难以应对苹果叶部病害的表型异质性和病斑多样性,且对局部与全局特征关系建模不足,导致复杂环境下识别准确率受限。 Method: 提出CT-CLIP框架:使用CNN提取病斑局部细节特征,Vision Transformer捕捉全局结构关系,通过自适应特征融合模块(AFFM)动态融合;引入基于预训练CLIP的多模态图文学习,实现视觉特征与疾病语义描述的深度对齐。 Result: 在公开苹果病害数据集和自建数据集上分别达到97.38%和96.12%的准确率,优于多种基线方法,尤其在复杂背景和少样本条件下表现更优。 Conclusion: CT-CLIP有效融合局部与全局特征,结合多模态语义对齐,显著提升了复杂环境下苹果病害的识别精度,为农业病害自动识别提供了创新且实用的解决方案。 Abstract: In complex orchard environments, the phenotypic heterogeneity of different apple leaf diseases, characterized by significant variation among lesions, poses a challenge to traditional multi-scale feature fusion methods. These methods only integrate multi-layer features extracted by convolutional neural networks (CNNs) and fail to adequately account for the relationships between local and global features. Therefore, this study proposes a multi-branch recognition framework named CNN-Transformer-CLIP (CT-CLIP). The framework synergistically employs a CNN to extract local lesion detail features and a Vision Transformer to capture global structural relationships. An Adaptive Feature Fusion Module (AFFM) then dynamically fuses these features, achieving optimal coupling of local and global information and effectively addressing the diversity in lesion morphology and distribution. Additionally, to mitigate interference from complex backgrounds and significantly enhance recognition accuracy under few-shot conditions, this study proposes a multimodal image-text learning approach. By leveraging pre-trained CLIP weights, it achieves deep alignment between visual features and disease semantic descriptions. Experimental results show that CT-CLIP achieves accuracies of 97.38% and 96.12% on a publicly available apple disease and a self-built dataset, outperforming several baseline methods. The proposed CT-CLIP demonstrates strong capabilities in recognizing agricultural diseases, significantly enhances identification accuracy under complex environmental conditions, provides an innovative and practical solution for automated disease recognition in agricultural applications.

[73] Dynamic Semantic-Aware Correlation Modeling for UAV Tracking

Xinyu Zhou,Tongxin Pan,Lingyi Hong,Pinxue Guo,Haijing Guo,Zhaoyu Chen,Kaixun Jiang,Wenqiang Zhang

Main category: cs.CV

TL;DR: 提出了一种动态语义感知相关性建模的无人机跟踪框架,通过动态语义相关性生成器和Transformer相关图提升语义感知能力,增强了在复杂场景下的定位精度和鲁棒性,并设计了剪枝方法以提高速度,实现了速度与精度的灵活权衡。

Details Motivation: 现有无人机跟踪方法过于注重速度,缺乏语义感知能力,导致在相机运动、快速运动和低分辨率等挑战下性能不佳。 Method: 提出动态语义相关性生成器(Dynamic Semantic Relevance Generator),结合Transformer的相关图挖掘语义相关性,并设计剪枝方法提升推理速度,构建多个模型变体以平衡速度与精度。 Result: 在多个无人机跟踪数据集上取得了具有竞争力的性能,验证了方法在准确性和鲁棒性方面的有效性。 Conclusion: 所提方法有效提升了无人机跟踪中的语义感知能力,在应对典型挑战时表现出更高的精度和鲁棒性,同时通过模型变体支持不同计算资源下的灵活部署。 Abstract: UAV tracking can be widely applied in scenarios such as disaster rescue, environmental monitoring, and logistics transportation. However, existing UAV tracking methods predominantly emphasize speed and lack exploration in semantic awareness, which hinders the search region from extracting accurate localization information from the template. The limitation results in suboptimal performance under typical UAV tracking challenges such as camera motion, fast motion, and low resolution, etc. To address this issue, we propose a dynamic semantic aware correlation modeling tracking framework. The core of our framework is a Dynamic Semantic Relevance Generator, which, in combination with the correlation map from the Transformer, explore semantic relevance. The approach enhances the search region's ability to extract important information from the template, improving accuracy and robustness under the aforementioned challenges. Additionally, to enhance the tracking speed, we design a pruning method for the proposed framework. Therefore, we present multiple model variants that achieve trade-offs between speed and accuracy, enabling flexible deployment according to the available computational resources. Experimental results validate the effectiveness of our method, achieving competitive performance on multiple UAV tracking datasets. The code is available at https://github.com/zxyyxzz/DSATrack.

[74] Gaze-VLM:Bridging Gaze and VLMs through Attention Regularization for Egocentric Understanding

Anupam Pani,Yanchao Yang

Main category: cs.CV

TL;DR: 提出一种基于注视正则化的框架,通过在训练中引入人类视觉注视信号来增强视觉语言模型(VLM)在自我中心理解任务中的表现,显著提升了未来事件预测和当前活动理解的准确性。

Details Motivation: 利用眼动信号作为注意力和行为意图的重要线索,提升视觉语言模型在自我中心视频理解中的预测能力和鲁棒性。 Method: 提出一种仅在训练阶段使用眼动数据的注视正则化注意力机制,使模型关注区域与人类视觉注视对齐,且该方法可灵活适配多种基于注意力结构的VLM。 Result: 在细粒度未来事件预测任务上语义预测得分最高提升11,在当前活动理解任务上提升约7,显著优于无注视正则化的基线模型。 Conclusion: 证明了利用人类眼动信号进行模型训练可有效增强VLM在自我中心场景下的理解能力,为辅助机器人和人机协作等实际应用奠定了基础。 Abstract: Eye gaze offers valuable cues about attention, short-term intent, and future actions, making it a powerful signal for modeling egocentric behavior. In this work, we propose a gaze-regularized framework that enhances VLMs for two key egocentric understanding tasks: fine-grained future event prediction and current activity understanding. Unlike prior approaches that rely solely on visual inputs or use gaze as an auxiliary input signal , our method uses gaze only during training. We introduce a gaze-regularized attention mechanism that aligns model focus with human visual gaze. This design is flexible and modular, allowing it to generalize across multiple VLM architectures that utilize attention. Experimental results show that our approach improves semantic prediction scores by up to 11 for future event prediction and around 7 for current activity understanding, compared to the corresponding baseline models trained without gaze regularization. These results highlight the value of gaze-guided training in improving the accuracy and robustness of egocentric VLMs. Overall, this work establishes a foundation for using human gaze to enhance the predictive capabilities of VLMs in real-world scenarios like assistive robots and human-machine collaboration. Code and additional information is available at: https://github.com/anupampani/Gaze-VLM

[75] Why Registration Quality Matters: Enhancing sCT Synthesis with IMPACT-Based Registration

Valentin Boussot,Cédric Hémon,Jean-Claude Nunes,Jean-Louis Dillenseger

Main category: cs.CV

TL;DR: 本文提出了一种基于KonfAI框架的2.5D U-Net++模型,用于从MRI和CBCT生成合成CT(sCT),结合L1损失与基于SAM和TotalSegmentator的感知损失IMPACT-Synth,并比较了两种配准方法对sCT生成性能的影响。

Details Motivation: 配准误差会影响监督学习中的训练和评估,导致性能指标虚高但解剖结构保真度下降。本文旨在通过改进配准策略和损失函数,提升sCT生成的解剖一致性和鲁棒性。 Method: 采用2.5D U-Net++网络结构,ResNet-34作为编码器,联合多区域训练并进行区域微调;使用AdamW优化器,输入为归一化、体掩码的图像块,仅使用随机翻转增强;损失函数结合像素级L1损失与IMPACT-Synth感知损失;最终预测采用测试时增强和五折集成;比较Elastix(互信息)与IMPACT(基于特征)两种配准方法。 Result: 在本地测试集上,基于IMPACT的配准比基于互信息的配准具有更高的解剖一致性与更低的MAE;但在公共验证集上,使用Elastix配准的数据训练的模型得分更高,显示出评估流程中的配准偏差。 Conclusion: 配准策略显著影响sCT生成模型的训练与评估,IMPACT通过促进解剖一致的配准有助于缓解偏差,支持更鲁棒和可泛化的模型开发。 Abstract: We participated in the SynthRAD2025 challenge (Tasks 1 and 2) with a unified pipeline for synthetic CT (sCT) generation from MRI and CBCT, implemented using the KonfAI framework. Our model is a 2.5D U-Net++ with a ResNet-34 encoder, trained jointly across anatomical regions and fine-tuned per region. The loss function combined pixel-wise L1 loss with IMPACT-Synth, a perceptual loss derived from SAM and TotalSegmentator to enhance structural fidelity. Training was performed using AdamW (initial learning rate = 0.001, halved every 25k steps) on patch-based, normalized, body-masked inputs (320x320 for MRI, 256x256 for CBCT), with random flipping as the only augmentation. No post-processing was applied. Final predictions leveraged test-time augmentation and five-fold ensembling. The best model was selected based on validation MAE. Two registration strategies were evaluated: (i) Elastix with mutual information, consistent with the challenge pipeline, and (ii) IMPACT, a feature-based similarity metric leveraging pretrained segmentation networks. On the local test sets, IMPACT-based registration achieved more accurate and anatomically consistent alignments than mutual-information-based registration, resulting in improved sCT synthesis with lower MAE and more realistic anatomical structures. On the public validation set, however, models trained with Elastix-aligned data achieved higher scores, reflecting a registration bias favoring alignment strategies consistent with the evaluation pipeline. This highlights how registration errors can propagate into supervised learning, influencing both training and evaluation, and potentially inflating performance metrics at the expense of anatomical fidelity. By promoting anatomically consistent alignment, IMPACT helps mitigate this bias and supports the development of more robust and generalizable sCT synthesis models.

[76] BADiff: Bandwidth Adaptive Diffusion Model

Xi Zhang,Hanwei Zhu,Yan Zhong,Jiamang Wang,Weisi Lin

Main category: cs.CV

TL;DR: 提出一种基于实时带宽约束自适应调整生成质量的扩散模型框架,通过端到端训练使模型根据带宽条件调节去噪过程,实现早期停止采样并在低带宽下保持良好视觉质量。

Details Motivation: 传统扩散模型固定去噪步数,忽略下游传输限制,在带宽受限时导致计算浪费和图像质量下降,难以满足云到设备场景的实际需求。 Method: 引入一种联合端到端训练策略,将扩散模型与目标质量水平(由可用带宽决定)进行条件关联,利用轻量级质量嵌入引导去噪轨迹,实现自适应调制去噪过程和早期停止采样。 Result: 实验结果表明,相比简单的早期停止方法,该方法在不同带宽条件下生成的图像具有更高的视觉保真度,且仅需最小的架构修改。 Conclusion: 所提方法为带宽受限环境下的高效图像传输提供了一种有效解决方案,能够在保证感知质量的同时提升生成效率。 Abstract: In this work, we propose a novel framework to enable diffusion models to adapt their generation quality based on real-time network bandwidth constraints. Traditional diffusion models produce high-fidelity images by performing a fixed number of denoising steps, regardless of downstream transmission limitations. However, in practical cloud-to-device scenarios, limited bandwidth often necessitates heavy compression, leading to loss of fine textures and wasted computation. To address this, we introduce a joint end-to-end training strategy where the diffusion model is conditioned on a target quality level derived from the available bandwidth. During training, the model learns to adaptively modulate the denoising process, enabling early-stop sampling that maintains perceptual quality appropriate to the target transmission condition. Our method requires minimal architectural changes and leverages a lightweight quality embedding to guide the denoising trajectory. Experimental results demonstrate that our approach significantly improves the visual fidelity of bandwidth-adapted generations compared to naive early-stopping, offering a promising solution for efficient image delivery in bandwidth-constrained environments. Code is available at: https://github.com/xzhang9308/BADiff.

[77] TerraGen: A Unified Multi-Task Layout Generation Framework for Remote Sensing Data Augmentation

Datao Tang,Hao Wang,Yudeng Xin,Hui Qiao,Dongsheng Jiang,Yin Li,Zhiheng Yu,Xiangyong Cao

Main category: cs.CV

TL;DR: 提出TerraGen,一个统一的布局到图像生成框架,用于多任务遥感视觉任务的数据增强,支持空间控制和地理信息建模。

Details Motivation: 现有生成模型在遥感任务中孤立处理各任务,缺乏对地理空间信息和跨任务共享的支持。 Method: 设计地理空间布局编码器,统一处理检测框和分割掩码,结合多尺度注入和掩码加权损失来建模空间约束。 Result: 在45k图像的大规模多任务数据集上验证,生成图像质量最优,并显著提升下游任务性能。 Conclusion: TerraGen具有强大的跨任务泛化能力,可作为通用遥感数据增强工具,在全量和少样本场景下均表现优异。 Abstract: Remote sensing vision tasks require extensive labeled data across multiple, interconnected domains. However, current generative data augmentation frameworks are task-isolated, i.e., each vision task requires training an independent generative model, and ignores the modeling of geographical information and spatial constraints. To address these issues, we propose \textbf{TerraGen}, a unified layout-to-image generation framework that enables flexible, spatially controllable synthesis of remote sensing imagery for various high-level vision tasks, e.g., detection, segmentation, and extraction. Specifically, TerraGen introduces a geographic-spatial layout encoder that unifies bounding box and segmentation mask inputs, combined with a multi-scale injection scheme and mask-weighted loss to explicitly encode spatial constraints, from global structures to fine details. Also, we construct the first large-scale multi-task remote sensing layout generation dataset containing 45k images and establish a standardized evaluation protocol for this task. Experimental results show that our TerraGen can achieve the best generation image quality across diverse tasks. Additionally, TerraGen can be used as a universal data-augmentation generator, enhancing downstream task performance significantly and demonstrating robust cross-task generalisation in both full-data and few-shot scenarios.

[78] Depth-Supervised Fusion Network for Seamless-Free Image Stitching

Zhiying Jiang,Ruhao Yan,Zengxi Zhang,Bowei Zhang,Jinyuan Liu

Main category: cs.CV

TL;DR: 提出一种深度一致性约束的无缝图像拼接方法,通过多阶段对齐机制和全局深度正则化提升跨视角对齐精度,结合图优化与软拼接缝扩散策略实现自然融合,并引入重参数化策略提升算法效率。

Details Motivation: 解决因物体深度变化导致的大视差问题,避免图像拼接中的重影和错位现象。 Method: 采用多阶段对齐机制结合全局深度正则化约束提升对齐精度;在融合阶段利用图优化寻找最优拼接缝,并扩散软拼接区域以缓解视差引起的错位;引入重参数化策略降低计算开销。 Result: 实验表明该方法在拼接质量与自然度上优于现有方法,同时提升了算法运行效率。 Conclusion: 所提方法有效解决了大视差下的图像拼接难题,在保持高性能的同时显著提高了计算效率,适用于复杂场景的无缝拼接。 Abstract: Image stitching synthesizes images captured from multiple perspectives into a single image with a broader field of view. The significant variations in object depth often lead to large parallax, resulting in ghosting and misalignment in the stitched results. To address this, we propose a depth-consistency-constrained seamless-free image stitching method. First, to tackle the multi-view alignment difficulties caused by parallax, a multi-stage mechanism combined with global depth regularization constraints is developed to enhance the alignment accuracy of the same apparent target across different depth ranges. Second, during the multi-view image fusion process, an optimal stitching seam is determined through graph-based low-cost computation, and a soft-seam region is diffused to precisely locate transition areas, thereby effectively mitigating alignment errors induced by parallax and achieving natural and seamless stitching results. Furthermore, considering the computational overhead in the shift regression process, a reparameterization strategy is incorporated to optimize the structural design, significantly improving algorithm efficiency while maintaining optimal performance. Extensive experiments demonstrate the superior performance of the proposed method against the existing methods. Code is available at https://github.com/DLUT-YRH/DSFN.

[79] Head Pursuit: Probing Attention Specialization in Multimodal Transformers

Lorenzo Basile,Valentino Maiorca,Diego Doimo,Francesco Locatello,Alberto Cazzaniga

Main category: cs.CV

TL;DR: 该研究通过信号处理视角重新解释探针中间激活的方法,揭示了语言和视觉-语言模型中注意力头在语义或视觉属性上的专门化模式,并表明仅编辑1%的注意力头即可可靠地调控生成内容中的特定概念。

Details Motivation: 理解语言和多模态模型内部机制,特别是注意力头如何针对特定语义或视觉属性进行专门化。 Method: 基于可解释性方法,将探针中间层激活与最终解码层结合,从信号处理角度分析多个样本,系统性地评估并排序注意力头对目标概念的相关性。 Result: 发现单模态和多模态Transformer中注意力头存在一致的专门化模式,仅需编辑约1%的头部即可有效抑制或增强模型输出中的目标概念,在问答、毒性缓解、图像分类和描述生成等任务中验证了方法有效性。 Conclusion: 注意力层具有可解释且可控的结构,提供了一种简单有效的工具来理解和编辑大规模生成模型。 Abstract: Language and vision-language models have shown impressive performance across a wide range of tasks, but their internal mechanisms remain only partly understood. In this work, we study how individual attention heads in text-generative models specialize in specific semantic or visual attributes. Building on an established interpretability method, we reinterpret the practice of probing intermediate activations with the final decoding layer through the lens of signal processing. This lets us analyze multiple samples in a principled way and rank attention heads based on their relevance to target concepts. Our results show consistent patterns of specialization at the head level across both unimodal and multimodal transformers. Remarkably, we find that editing as few as 1% of the heads, selected using our method, can reliably suppress or enhance targeted concepts in the model output. We validate our approach on language tasks such as question answering and toxicity mitigation, as well as vision-language tasks including image classification and captioning. Our findings highlight an interpretable and controllable structure within attention layers, offering simple tools for understanding and editing large-scale generative models.

[80] MUVR: A Multi-Modal Untrimmed Video Retrieval Benchmark with Multi-Level Visual Correspondence

Yue Feng,Jinwei Hu,Qijia Lu,Jiawei Niu,Li Tan,Shuo Yuan,Ziyi Yan,Yizhen Jia,Qingzhi He,Shiping Ge,Ethan Q. Chen,Wentong Li,Limin Wang,Jie Qin

Main category: cs.CV

TL;DR: 本文提出了多模态未剪辑视频检索任务及新基准MUVR,旨在通过多模态查询(如长文本、标签和掩码提示)从长视频平台中检索包含相关片段的未剪辑视频。MUVR具有实用的检索范式、多层次视觉对应关系和全面的评估标准,并包含5.3万个未剪辑视频和多种查询类型。实验评估了多种模型,揭示了现有方法在处理未剪辑视频和多模态查询方面的局限性。

Details Motivation: 现有的视频检索任务难以满足长视频平台对细粒度、多模态查询的支持需求,且缺乏针对未剪辑视频的合理匹配定义与评估体系。因此,需要一个更贴近实际应用场景的多模态视频检索基准。 Method: 提出MUVR基准,支持基于长文本描述、视频标签和掩码提示的多模态查询;构建涵盖六种层次(复制、事件、场景、实例、动作等)的视觉对应关系;设计三个版本(Base、Filter、QA)用于评估不同模型,并引入重排序分数评价MLLM的重排序能力。数据集包含来自Bilibili的53K未剪辑视频、1,050个多模态查询和84K匹配项。 Result: 在MUVR上评估了3种先进视频检索模型、6种图像基VLM和10种MLLM,结果表明现有方法在处理未剪辑视频和多模态查询方面存在局限,MLLM在多视频理解和重排序能力上表现不足。 Conclusion: MUVR为长视频平台上的多模态未剪辑视频检索提供了新的基准,推动了该领域的发展,并揭示了当前模型在复杂查询理解、视频内容匹配和重排序方面的不足,为未来研究指明方向。 Abstract: We propose the Multi-modal Untrimmed Video Retrieval task, along with a new benchmark (MUVR) to advance video retrieval for long-video platforms. MUVR aims to retrieve untrimmed videos containing relevant segments using multi-modal queries. It has the following features: 1) Practical retrieval paradigm: MUVR supports video-centric multi-modal queries, expressing fine-grained retrieval needs through long text descriptions, video tag prompts, and mask prompts. It adopts a one-to-many retrieval paradigm and focuses on untrimmed videos, tailored for long-video platform applications. 2) Multi-level visual correspondence: To cover common video categories (e.g., news, travel, dance) and precisely define retrieval matching criteria, we construct multi-level visual correspondence based on core video content (e.g., news events, travel locations, dance moves) which users are interested in and want to retrieve. It covers six levels: copy, event, scene, instance, action, and others. 3) Comprehensive evaluation criteria: We develop 3 versions of MUVR (i.e., Base, Filter, QA). MUVR-Base/Filter evaluates retrieval models, while MUVR-QA assesses MLLMs in a question-answering format. We also propose a Reranking Score to evaluate the reranking ability of MLLMs. MUVR consists of 53K untrimmed videos from the video platform Bilibili, with 1,050 multi-modal queries and 84K matches. Extensive evaluations of 3 state-of-the-art video retrieval models, 6 image-based VLMs, and 10 MLLMs are conducted. MUVR reveals the limitations of retrieval methods in processing untrimmed videos and multi-modal queries, as well as MLLMs in multi-video understanding and reranking. Our code and benchmark is available at https://github.com/debby-0527/MUVR.

[81] Bridging the gap to real-world language-grounded visual concept learning

Whie Jung,Semin Kim,Junee Kim,Seunghoon Hong

Main category: cs.CV

TL;DR: 提出一种可扩展的框架,能够自适应地识别真实场景中的图像相关概念轴,并将视觉概念沿这些轴进行 grounding,无需预先定义概念或增加模型参数,在多个数据集上展示了优越的编辑能力和组合泛化性能。

Details Motivation: 现有基于语言的视觉概念学习方法局限于少数预定义的原始语义轴(如颜色、形状),且多在合成数据上研究,难以捕捉真实世界中丰富多样的视觉概念。 Method: 利用预训练的视觉-语言模型和通用提示策略,自动发现图像相关的语义轴;通过通用概念编码器将视觉特征自适应绑定到这些轴上,并优化组合锚定目标,实现各轴的独立操控。 Result: 在ImageNet、CelebA-HQ和AFHQ子集上验证了方法的有效性,展现出对多样化真实世界概念的优越编辑能力,并在组合泛化方面优于现有的视觉概念学习和文本驱动编辑方法。 Conclusion: 该框架无需先验知识和额外参数即可发现并 grounding 丰富的视觉概念,具有良好的可扩展性和实际应用潜力。 Abstract: Human intelligence effortlessly interprets visual scenes along a rich spectrum of semantic dimensions. However, existing approaches to language-grounded visual concept learning are limited to a few predefined primitive axes, such as color and shape, and are typically explored in synthetic datasets. In this work, we propose a scalable framework that adaptively identifies image-related concept axes and grounds visual concepts along these axes in real-world scenes. Leveraging a pretrained vision-language model and our universal prompting strategy, our framework identifies a diverse image-related axes without any prior knowledge. Our universal concept encoder adaptively binds visual features to the discovered axes without introducing additional model parameters for each concept. To ground visual concepts along the discovered axes, we optimize a compositional anchoring objective, which ensures that each axis can be independently manipulated without affecting others. We demonstrate the effectiveness of our framework on subsets of ImageNet, CelebA-HQ, and AFHQ, showcasing superior editing capabilities across diverse real-world concepts that are too varied to be manually predefined. Our method also exhibits strong compositional generalization, outperforming existing visual concept learning and text-based editing methods. The code is available at https://github.com/whieya/Language-grounded-VCL.

[82] ArtiLatent: Realistic Articulated 3D Object Generation via Structured Latents

Honghua Chen,Yushi Lan,Yongwei Chen,Xingang Pan

Main category: cs.CV

TL;DR: 提出ArtiLatent,一种生成框架,用于合成具有精细几何、准确关节和真实外观的3D人造物体。

Details Motivation: 现有方法在生成可动3D物体时难以同时保证几何细节、关节准确性和视觉真实性,尤其在处理部件遮挡与运动可见性变化方面存在不足。 Method: 通过变分自编码器将稀疏体素表示和关节属性(类型、轴、原点、范围、部件类别)嵌入统一的潜在空间,并在该空间上训练潜在扩散模型以实现多样且物理合理的采样;引入关节感知的高斯解码器,根据关节状态解码外观,处理因运动导致的可见性变化。 Result: 在PartNet-Mobility和ACD数据集上的实验证明,ArtiLatent在几何一致性与外观保真度方面优于现有方法,能生成更真实的可动物体。 Conclusion: ArtiLatent为可动3D物体的合成与操作提供了一个可扩展的解决方案,显著提升了生成结果的几何精度与视觉 realism。 Abstract: We propose ArtiLatent, a generative framework that synthesizes human-made 3D objects with fine-grained geometry, accurate articulation, and realistic appearance. Our approach jointly models part geometry and articulation dynamics by embedding sparse voxel representations and associated articulation properties, including joint type, axis, origin, range, and part category, into a unified latent space via a variational autoencoder. A latent diffusion model is then trained over this space to enable diverse yet physically plausible sampling. To reconstruct photorealistic 3D shapes, we introduce an articulation-aware Gaussian decoder that accounts for articulation-dependent visibility changes (e.g., revealing the interior of a drawer when opened). By conditioning appearance decoding on articulation state, our method assigns plausible texture features to regions that are typically occluded in static poses, significantly improving visual realism across articulation configurations. Extensive experiments on furniture-like objects from PartNet-Mobility and ACD datasets demonstrate that ArtiLatent outperforms existing approaches in geometric consistency and appearance fidelity. Our framework provides a scalable solution for articulated 3D object synthesis and manipulation.

[83] Anisotropic Pooling for LUT-realizable CNN Image Restoration

Xi Zhang,Xiaolin Wu

Main category: cs.CV

TL;DR: 本文提出了一种各向异性池化策略,用于改进基于查找表(LUT)的卷积神经网络(CNN)图像恢复方法,相较于传统的平均池化,显著提升了性能。

Details Motivation: 现有的LUT-based CNN图像恢复方法在融合不同方向的小块查表结果时采用平均池化,难以有效处理各向异性信号结构,限制了恢复质量。 Method: 引入广义中值池化,并进一步提出学习数据依赖的各向异性池化系数,以自适应地加权不同方向像素块的贡献。 Result: 在多个图像恢复基准上的实验表明,所提方法在感知质量和数值指标上均优于现有的LUT可实现CNN方法。 Conclusion: 各向异性池化策略能有效提升LUT-based CNN在图像恢复任务中的性能,为高效、高质量的模型设计提供了新思路。 Abstract: Table look-up realization of image restoration CNNs has the potential of achieving competitive image quality while being much faster and resource frugal than the straightforward CNN implementation. The main technical challenge facing the LUT-based CNN algorithm designers is to manage the table size without overly restricting the receptive field. The prevailing strategy is to reuse the table for small pixel patches of different orientations (apparently assuming a degree of isotropy) and then fuse the look-up results. The fusion is currently done by average pooling, which we find being ill suited to anisotropic signal structures. To alleviate the problem, we investigate and discuss anisotropic pooling methods to replace naive averaging for improving the performance of the current LUT-realizable CNN restoration methods. First, we introduce the method of generalized median pooling which leads to measurable gains over average pooling. We then extend this idea by learning data-dependent pooling coefficients for each orientation, so that they can adaptively weigh the contributions of differently oriented pixel patches. Experimental results on various restoration benchmarks show that our anisotropic pooling strategy yields both perceptually and numerically superior results compared to existing LUT-realizable CNN methods.

[84] OpenHype: Hyperbolic Embeddings for Hierarchical Open-Vocabulary Radiance Fields

Lisa Weijler,Sebastian Koch,Fabio Poiesi,Timo Ropinski,Pedro Hermosilla

Main category: cs.CV

TL;DR: 提出OpenHype,一种利用连续双曲潜在空间表示场景层次结构的新方法,通过双曲几何特性自然编码多尺度关系,实现高效、自适应的3D场景理解。

Details Motivation: 现有显式建模层次结构的方法存在推理时间长或依赖预定义离散层次的问题,难以泛化到真实世界中复杂多样的结构。 Method: 采用基于隐式表示的连续双曲潜在空间来建模3D场景的层次结构,利用双曲几何的性质支持多层次特征编码和潜空间中的测地线遍历。 Result: 在标准基准上优于当前最先进方法,展现出更高的效率和更好的适应性。 Conclusion: OpenHype为3D场景理解提供了一种高效且灵活的层次化建模方案,克服了传统方法在推理速度和泛化能力上的局限。 Abstract: Modeling the inherent hierarchical structure of 3D objects and 3D scenes is highly desirable, as it enables a more holistic understanding of environments for autonomous agents. Accomplishing this with implicit representations, such as Neural Radiance Fields, remains an unexplored challenge. Existing methods that explicitly model hierarchical structures often face significant limitations: they either require multiple rendering passes to capture embeddings at different levels of granularity, significantly increasing inference time, or rely on predefined, closed-set discrete hierarchies that generalize poorly to the diverse and nuanced structures encountered by agents in the real world. To address these challenges, we propose OpenHype, a novel approach that represents scene hierarchies using a continuous hyperbolic latent space. By leveraging the properties of hyperbolic geometry, OpenHype naturally encodes multi-scale relationships and enables smooth traversal of hierarchies through geodesic paths in latent space. Our method outperforms state-of-the-art approaches on standard benchmarks, demonstrating superior efficiency and adaptability in 3D scene understanding.

[85] PhysWorld: From Real Videos to World Models of Deformable Objects via Physics-Aware Demonstration Synthesis

Yu Yang,Zhilu Zhang,Xiang Zhang,Yihan Zeng,Hui Li,Wangmeng Zuo

Main category: cs.CV

TL;DR: 提出PhysWorld框架,利用模拟器生成物理一致且多样化的演示数据,以解决真实世界视频数据稀缺下可变形物体动态建模的挑战。

Details Motivation: 从有限的真实世界视频中学习物理一致的动力学模型尤其困难,特别是对于具有空间变化物理属性的可变形物体。 Method: 在MPM模拟器中构建物理一致的数字孪生体,通过本构模型选择和全局到局部优化物理属性;施加部分感知扰动生成多样化运动模式;使用合成数据训练基于GNN的轻量级世界模型,并用真实视频进一步优化物理属性。 Result: PhysWorld实现了对多种可变形物体的准确、快速未来预测,推理速度比现有最先进方法PhysTwin快47倍,并能很好地泛化到新交互场景。 Conclusion: PhysWorld通过结合物理模拟与轻量级神经建模,有效解决了数据稀缺下的可变形物体动力学建模问题,在精度、速度和泛化能力方面均表现出色。 Abstract: Interactive world models that simulate object dynamics are crucial for robotics, VR, and AR. However, it remains a significant challenge to learn physics-consistent dynamics models from limited real-world video data, especially for deformable objects with spatially-varying physical properties. To overcome the challenge of data scarcity, we propose PhysWorld, a novel framework that utilizes a simulator to synthesize physically plausible and diverse demonstrations to learn efficient world models. Specifically, we first construct a physics-consistent digital twin within MPM simulator via constitutive model selection and global-to-local optimization of physical properties. Subsequently, we apply part-aware perturbations to the physical properties and generate various motion patterns for the digital twin, synthesizing extensive and diverse demonstrations. Finally, using these demonstrations, we train a lightweight GNN-based world model that is embedded with physical properties. The real video can be used to further refine the physical properties. PhysWorld achieves accurate and fast future predictions for various deformable objects, and also generalizes well to novel interactions. Experiments show that PhysWorld has competitive performance while enabling inference speeds 47 times faster than the recent state-of-the-art method, i.e., PhysTwin.

[86] MoniTor: Exploiting Large Language Models with Instruction for Online Video Anomaly Detection

Shengtian Yang,Yue Feng,Yingshi Liu,Jingrou Zhang,Jie Qin

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的基于记忆机制的在线视频异常检测方法MoniTor,结合视觉-语言模型和LSTM启发的预测机制,通过评分队列和异常先验动态捕捉时序依赖,有效提升在线VAD性能。

Details Motivation: 在线视频异常检测(online VAD)因实时性和计算复杂度问题较少受到关注,而现有基于大模型的进展主要集中在离线场景,缺乏对实时检测的有效建模。 Method: 提出MoniTor框架,利用预训练视觉-语言模型处理视频流输入;引入受LSTM启发的预测机制建模历史状态;设计评分队列和异常先验机制,动态存储近期分数并覆盖监控场景中的各类异常,辅助大语言模型进行时序判断。 Result: 在UCF-Crime和XD-Violence两个大规模数据集上验证,MoniTor优于现有无监督方法,且性能可与弱监督方法相媲美,无需任何训练过程。 Conclusion: MoniTor为无需训练的在线视频异常检测提供了有效解决方案,通过记忆机制和动态评分策略,显著提升了对复杂真实场景中异常行为的识别能力。 Abstract: Video Anomaly Detection (VAD) aims to locate unusual activities or behaviors within videos. Recently, offline VAD has garnered substantial research attention, which has been invigorated by the progress in large language models (LLMs) and vision-language models (VLMs), offering the potential for a more nuanced understanding of anomalies. However, online VAD has seldom received attention due to real-time constraints and computational intensity. In this paper, we introduce a novel Memory-based online scoring queue scheme for Training-free VAD (MoniTor), to address the inherent complexities in online VAD. Specifically, MoniTor applies a streaming input to VLMs, leveraging the capabilities of pre-trained large-scale models. To capture temporal dependencies more effectively, we incorporate a novel prediction mechanism inspired by Long Short-Term Memory (LSTM) networks. This ensures the model can effectively model past states and leverage previous predictions to identify anomalous behaviors. Thereby, it better understands the current frame. Moreover, we design a scoring queue and an anomaly prior to dynamically store recent scores and cover all anomalies in the monitoring scenario, providing guidance for LLMs to distinguish between normal and abnormal behaviors over time. We evaluate MoniTor on two large datasets (i.e., UCF-Crime and XD-Violence) containing various surveillance and real-world scenarios. The results demonstrate that MoniTor outperforms state-of-the-art methods and is competitive with weakly supervised methods without training. Code is available at https://github.com/YsTvT/MoniTor.

[87] VidSplice: Towards Coherent Video Inpainting via Explicit Spaced Frame Guidance

Ming Xie,Junqiu Yu,Qiaole Dong,Xiangyang Xue,Yanwei Fu

Main category: cs.CV

TL;DR: 本文提出了一种名为VidSplice的新框架,通过解耦视频修复为多帧一致的图像修复和遮罩区域运动传播两个子任务,引入间隔帧先验来增强时空一致性。

Details Motivation: 现有基于图像到视频(I2V)先验的方法在严重内容退化情况下表现不佳,且难以保证时空稳定性,导致对视频后段控制不足。 Method: 将视频修复分解为两个子任务:多帧一致图像修复与遮罩区域运动传播;设计CoSpliced模块实现首帧传播策略,并通过拼接机制扩散初始帧内容;引入上下文控制器模块编码连贯先验并注入I2V生成主干以约束生成过程中的内容失真。 Result: 实验表明,VidSplice在多种视频修复场景中表现出竞争力,显著提升了前景对齐性和运动稳定性,优于现有方法。 Conclusion: VidSplice通过引入间隔帧先验和拼接机制,有效增强了视频修复中的时空一致性,在内容保真和运动稳定方面优于现有方法。 Abstract: Recent video inpainting methods often employ image-to-video (I2V) priors to model temporal consistency across masked frames. While effective in moderate cases, these methods struggle under severe content degradation and tend to overlook spatiotemporal stability, resulting in insufficient control over the latter parts of the video. To address these limitations, we decouple video inpainting into two sub-tasks: multi-frame consistent image inpainting and masked area motion propagation. We propose VidSplice, a novel framework that introduces spaced-frame priors to guide the inpainting process with spatiotemporal cues. To enhance spatial coherence, we design a CoSpliced Module to perform first-frame propagation strategy that diffuses the initial frame content into subsequent reference frames through a splicing mechanism. Additionally, we introduce a delicate context controller module that encodes coherent priors after frame duplication and injects the spliced video into the I2V generative backbone, effectively constraining content distortion during generation. Extensive evaluations demonstrate that VidSplice achieves competitive performance across diverse video inpainting scenarios. Moreover, its design significantly improves both foreground alignment and motion stability, outperforming existing approaches.

[88] CXR-LanIC: Language-Grounded Interpretable Classifier for Chest X-Ray Diagnosis

Yiming Tang,Wenjia Zhong,Rushi Shah,Dianbo Liu

Main category: cs.CV

TL;DR: 本文提出了一种名为CXR-LanIC的可解释性分类框架,通过任务对齐的模式发现,从胸部X光诊断模型中提取出约5000个单义视觉模式,实现准确且透明的AI诊断,支持临床部署中的可信决策。

Details Motivation: 深度学习在胸部X光诊断中表现优异,但因其“黑箱”特性限制了临床应用。本文旨在提升模型的可解释性,使医生能够信任并验证AI的诊断结果。 Method: 利用基于BiomedCLIP诊断分类器的转码器稀疏自编码器,在MIMIC-CXR数据集的多模态嵌入上训练100个转码器集成,分解医学图像表征为可解释的视觉模式,并关联自然语言描述以生成解释。 Result: 发现了约5000个跨心脏、肺部、胸膜等类别的单义视觉模式,每个模式在具有特定放射学特征的图像中表现出一致的激活行为;模型在五项关键发现上达到有竞争力的诊断准确率,预测可分解为20-50个可解释模式,并提供可验证的激活图库。 Conclusion: 通过从特定诊断目标训练的分类器中提取与临床决策直接相关的可解释特征,CXR-LanIC证明了医学AI系统可以兼具高准确性和可解释性,有助于安全的临床部署。 Abstract: Deep learning models have achieved remarkable accuracy in chest X-ray diagnosis, yet their widespread clinical adoption remains limited by the black-box nature of their predictions. Clinicians require transparent, verifiable explanations to trust automated diagnoses and identify potential failure modes. We introduce CXR-LanIC (Language-Grounded Interpretable Classifier for Chest X-rays), a novel framework that addresses this interpretability challenge through task-aligned pattern discovery. Our approach trains transcoder-based sparse autoencoders on a BiomedCLIP diagnostic classifier to decompose medical image representations into interpretable visual patterns. By training an ensemble of 100 transcoders on multimodal embeddings from the MIMIC-CXR dataset, we discover approximately 5,000 monosemantic patterns spanning cardiac, pulmonary, pleural, structural, device, and artifact categories. Each pattern exhibits consistent activation behavior across images sharing specific radiological features, enabling transparent attribution where predictions decompose into 20-50 interpretable patterns with verifiable activation galleries. CXR-LanIC achieves competitive diagnostic accuracy on five key findings while providing the foundation for natural language explanations through planned large multimodal model annotation. Our key innovation lies in extracting interpretable features from a classifier trained on specific diagnostic objectives rather than general-purpose embeddings, ensuring discovered patterns are directly relevant to clinical decision-making, demonstrating that medical AI systems can be both accurate and interpretable, supporting safer clinical deployment through transparent, clinically grounded explanations.

[89] ITC-RWKV: Interactive Tissue-Cell Modeling with Recurrent Key-Value Aggregation for Histopathological Subtyping

Yating Huang,Qijun Yang,Lintao Xiang,Hujun Yin

Main category: cs.CV

TL;DR: 提出了一种双流架构,结合宏观组织特征和细胞级表征,通过高效的受体加权键值聚合模型和双向组织-细胞交互模块,在四个病理亚型分类基准上优于现有方法。

Details Motivation: 现有的病理学基础模型虽能捕捉全局组织上下文,但缺乏对细胞级特征的建模,限制了其在癌症亚型分类等细粒度任务中的表现。 Method: 设计双流架构,分别处理组织和细胞特征;采用受体加权键值聚合的循环Transformer进行高效细胞信息聚合;引入双向组织-细胞交互模块实现局部细胞线索与周围组织环境之间的相互注意力机制。 Result: 在四个组学病理亚型分类基准上取得了优于现有模型的性能,验证了细胞级聚合和组织-细胞交互的重要性。 Conclusion: 细胞级特征的有效建模与组织-细胞间的双向交互对提升细粒度计算病理分析至关重要。 Abstract: Accurate interpretation of histopathological images demands integration of information across spatial and semantic scales, from nuclear morphology and cellular textures to global tissue organization and disease-specific patterns. Although recent foundation models in pathology have shown strong capabilities in capturing global tissue context, their omission of cell-level feature modeling remains a key limitation for fine-grained tasks such as cancer subtype classification. To address this, we propose a dual-stream architecture that models the interplay between macroscale tissue features and aggregated cellular representations. To efficiently aggregate information from large cell sets, we propose a receptance-weighted key-value aggregation model, a recurrent transformer that captures inter-cell dependencies with linear complexity. Furthermore, we introduce a bidirectional tissue-cell interaction module to enable mutual attention between localized cellular cues and their surrounding tissue environment. Experiments on four histopathological subtype classification benchmarks show that the proposed method outperforms existing models, demonstrating the critical role of cell-level aggregation and tissue-cell interaction in fine-grained computational pathology.

[90] GRAP-MOT: Unsupervised Graph-based Position Weighted Person Multi-camera Multi-object Tracking in a Highly Congested Space

Marek Socha,Michał Marczyk,Aleksander Kempski,Michał Cogiel,Paweł Foszner,Radosław Zawiski,Michał Staniszewski

Main category: cs.CV

TL;DR: GRAP-MOT 是一种用于解决封闭区域多摄像头重叠视图下多人跟踪(MOT)问题的新方法,通过在线更新身份标签和引入人物位置估计模块提升跟踪性能。

Details Motivation: 在封闭区域多摄像头视频中,人物频繁遮挡导致传统MOT方法性能下降,需要更鲁棒的解决方案。 Method: 提出基于图加权的在线身份更新机制,结合特征提取、轨迹关联与社区搜索,并集成人物位置估计模块以增强跟踪精度。 Result: 在自建封闭场景模型和公开高密度真实数据集上验证了GRAP-MOT的优越性,且表明IDF1比MOTA更适合此类任务的评估。 Conclusion: GRAP-MOT有效提升了复杂遮挡环境下的多目标跟踪精度,位置信息的引入显著改善结果,同时建议采用IDF1作为更合适的评价指标。 Abstract: GRAP-MOT is a new approach for solving the person MOT problem dedicated to videos of closed areas with overlapping multi-camera views, where person occlusion frequently occurs. Our novel graph-weighted solution updates a person's identification label online based on tracks and the person's characteristic features. To find the best solution, we deeply investigated all elements of the MOT process, including feature extraction, tracking, and community search. Furthermore, GRAP-MOT is equipped with a person's position estimation module, which gives additional key information to the MOT method, ensuring better results than methods without position data. We tested GRAP-MOT on recordings acquired in a closed-area model and on publicly available real datasets that fulfil the requirement of a highly congested space, showing the superiority of our proposition. Finally, we analyzed existing metrics used to compare MOT algorithms and concluded that IDF1 is more adequate than MOTA in such comparisons. We made our code, along with the acquired dataset, publicly available.

[91] An Automatic Detection Method for Hematoma Features in Placental Abruption Ultrasound Images Based on Few-Shot Learning

Xiaoqing Liu,Jitai Han,Hua Yan,Peng Li,Sida Tang,Ying Li,Kaiwen Zhang,Min Yu

Main category: cs.CV

TL;DR: 本文提出了一种基于小样本学习的改进模型EH-YOLOv11n,用于自动检测胎盘超声图像中的血肿特征,以实现胎盘早剥的早期准确诊断。

Details Motivation: 传统超声诊断方法依赖医生经验,存在主观偏差和诊断不一致问题,亟需一种自动化、高精度的辅助诊断方法。 Method: 在YOLOv11n基础上引入小样本学习,结合小波卷积和坐标卷积增强频域与空间特征提取,并采用级联组注意力机制抑制超声伪影和遮挡干扰,提升定位精度。 Result: 实验结果显示检测准确率达78%,较YOLOv11n提升2.5%,较YOLOv8提升13.7%,在精确率-召回率曲线、置信度评分和遮挡场景下均表现优越。 Conclusion: EH-YOLOv11n模型兼顾高精度与实时性,为胎盘早剥的计算机辅助诊断提供了可靠方案,具有重要的临床应用价值。 Abstract: Placental abruption is a severe complication during pregnancy, and its early accurate diagnosis is crucial for ensuring maternal and fetal safety. Traditional ultrasound diagnostic methods heavily rely on physician experience, leading to issues such as subjective bias and diagnostic inconsistencies. This paper proposes an improved model, EH-YOLOv11n (Enhanced Hemorrhage-YOLOv11n), based on small-sample learning, aiming to achieve automatic detection of hematoma features in placental ultrasound images. The model enhances performance through multidimensional optimization: it integrates wavelet convolution and coordinate convolution to strengthen frequency and spatial feature extraction; incorporates a cascaded group attention mechanism to suppress ultrasound artifacts and occlusion interference, thereby improving bounding box localization accuracy. Experimental results demonstrate a detection accuracy of 78%, representing a 2.5% improvement over YOLOv11n and a 13.7% increase over YOLOv8. The model exhibits significant superiority in precision-recall curves, confidence scores, and occlusion scenarios. Combining high accuracy with real-time processing, this model provides a reliable solution for computer-aided diagnosis of placental abruption, holding significant clinical application value.

[92] GranViT: A Fine-Grained Vision Model With Autoregressive Perception For MLLMs

Guanghao Zheng,Bowen Shi,Mingxing Xu,Ruoyu Sun,Peisen Zhao,Zhibo Zhang,Wenrui Dai,Junni Zou,Hongkai Xiong,Xiaopeng Zhang,Qi Tian

Main category: cs.CV

TL;DR: 本文提出了一种新的视觉Transformer模型GranViT,通过区域级自回归训练实现细粒度特征提取与大语言模型的语义对齐,并构建了大规模细粒度标注数据集Gran-29M,结合预训练-适配框架和自蒸馏机制,在细粒度识别、多模态视觉问答和OCR理解任务上取得了最先进的性能。

Details Motivation: 现有视觉编码器主要关注全局图像表示,忽视了细粒度的区域分析,受限于细粒度标注数据的稀缺以及缺乏相应的预训练范式,导致在精细感知能力上的不足。 Method: 提出GranViT模型,构建包含200万图像和1.8亿区域级标注的Gran-29M数据集,采用边界框到文本和文本到边界框的回归任务进行预训练与适配,并引入自蒸馏机制以增强视觉编码器的局部化约束和区域推理能力。 Result: 实验表明,GranViT在多个基准上超越现有视觉编码器,具备良好的迁移能力,在细粒度识别、多模态VQA和OCR理解任务中达到SOTA性能。 Conclusion: GranViT通过细粒度区域级预训练和自蒸馏机制有效提升了视觉编码器的局部感知与语义对齐能力,显著增强了多模态大模型在复杂视觉语言任务中的表现。 Abstract: Vision encoders are indispensable for allowing impressive performance of Multi-modal Large Language Models (MLLMs) in vision language tasks such as visual question answering and reasoning. However, existing vision encoders focus on global image representations but overlook fine-grained regional analysis. They are limited in fine grained perception due to the scarcity of fine grained annotated data and the lack of a fine grained pre-training paradigm. In this paper, we propose GranViT, a novel Vision Transformer that integrates fine-grained feature extraction with semantic alignment to Large Language Models (LLMs) via region level autoregressive training. We first construct Gran-29M, a dataset comprising 2million natural and OCR images paired with over 180 million high-quality region-level annotations, to enable large scale fine grained pretraining. Consequently, we develop a pretraining-adaptation framework along with a self distillation mechanism to train fine-grained GranViT on Gran-29M. We sufficiently exploit the fine-grained annotations from Gran-29M to resort to bounding-box-to-caption regression to enhance localized visual representation of the vision encoder in the pretraining and caption-to-bounding-box regression to improve vision feature utilization and localization for LLM in the adaptation. We further incorporate a self distillation mechanism that imposes explicit localization constraints on the vision encoder to strengthen its regional reasoning capability. Extensive experiments show that GranViT surpasses existing vision encoders and attains strong transferability to varying LLMs. Remarkably, it achieves state-of-the-art results on fine-grained recognition, multimodal VQA, and OCR understanding.

[93] Towards a Golden Classifier-Free Guidance Path via Foresight Fixed Point Iterations

Kaibo Wang,Jianda Mao,Tong Wu,Yang Xiang

Main category: cs.CV

TL;DR: 提出了一种统一视角,将条件引导重构为固定点迭代,提出了Foresight Guidance (FSG),在图像质量和计算效率上优于现有方法。

Details Motivation: 现有条件引导方法理论解释不同,限制了设计空间并模糊了关键设计选择,需要一种统一的视角来改进。 Method: 将条件引导视为固定点迭代,寻找潜变量的“黄金路径”,并在早期扩散阶段优先解决长区间子问题,采用多步迭代的FSG方法。 Result: 在多个数据集和模型架构上的实验表明,FSG在图像生成质量和计算效率方面均优于当前最先进的方法。 Conclusion: FSG提供了一种新的条件引导视角,展示了通过自适应设计提升扩散模型性能的潜力。 Abstract: Classifier-Free Guidance (CFG) is an essential component of text-to-image diffusion models, and understanding and advancing its operational mechanisms remains a central focus of research. Existing approaches stem from divergent theoretical interpretations, thereby limiting the design space and obscuring key design choices. To address this, we propose a unified perspective that reframes conditional guidance as fixed point iterations, seeking to identify a golden path where latents produce consistent outputs under both conditional and unconditional generation. We demonstrate that CFG and its variants constitute a special case of single-step short-interval iteration, which is theoretically proven to exhibit inefficiency. To this end, we introduce Foresight Guidance (FSG), which prioritizes solving longer-interval subproblems in early diffusion stages with increased iterations. Extensive experiments across diverse datasets and model architectures validate the superiority of FSG over state-of-the-art methods in both image quality and computational efficiency. Our work offers novel perspectives for conditional guidance and unlocks the potential of adaptive design.

[94] Foley Control: Aligning a Frozen Latent Text-to-Audio Model to Video

Ciara Rowles,Varun Jampani,Simon Donné,Shimon Vainer,Julian Parker,Zach Evans

Main category: cs.CV

TL;DR: Foley Control 是一种轻量级的视频引导Foley生成方法,通过连接预训练的视频和音频模型,并仅学习一个小的跨注意力桥接模块来实现音视频同步,保持了良好的可控性和模块化设计。

Details Motivation: 现有的多模态系统通常需要大量可训练参数并重新训练整个模型,而Foley Control旨在在不重训音频先验的情况下,以更少的参数实现高质量的视频到Foley音频生成。 Method: 将V-JEPA2提取的视频嵌入与冻结的Stable Audio Open DiT文本到音频模型通过紧凑的跨注意力机制连接,在文本交叉注意力后插入视频交叉注意力,并对视频token进行池化以降低内存消耗和稳定训练。 Result: 在精选的视频-音频基准上,Foley Control 在可训练参数远少于现有方法的情况下,实现了具有竞争力的时间和语义对齐效果,同时保持了基于提示的可控性和生产友好的模块化特性。 Conclusion: Foley Control 提供了一种高效、模块化的视频引导Foley生成方案,其桥接设计可在不重新端到端训练的情况下灵活替换或升级组件,未来还可扩展至语音等其他音频模态。 Abstract: Foley Control is a lightweight approach to video-guided Foley that keeps pretrained single-modality models frozen and learns only a small cross-attention bridge between them. We connect V-JEPA2 video embeddings to a frozen Stable Audio Open DiT text-to-audio (T2A) model by inserting compact video cross-attention after the model's existing text cross-attention, so prompts set global semantics while video refines timing and local dynamics. The frozen backbones retain strong marginals (video; audio given text) and the bridge learns the audio-video dependency needed for synchronization -- without retraining the audio prior. To cut memory and stabilize training, we pool video tokens before conditioning. On curated video-audio benchmarks, Foley Control delivers competitive temporal and semantic alignment with far fewer trainable parameters than recent multi-modal systems, while preserving prompt-driven controllability and production-friendly modularity (swap/upgrade encoders or the T2A backbone without end-to-end retraining). Although we focus on Video-to-Foley, the same bridge design can potentially extend to other audio modalities (e.g., speech).

[95] Sample By Step, Optimize By Chunk: Chunk-Level GRPO For Text-to-Image Generation

Yifu Luo,Penghui Du,Bo Li,Sinan Du,Tiantian Zhang,Yongzhe Chang,Kai Wu,Kun Gai,Xueqian Wang

Main category: cs.CV

TL;DR: 本文提出了Chunk-GRPO,一种用于文本到图像生成的块级优化方法,通过将连续步骤分组为‘块’来捕捉流匹配的时序动态,从而克服传统GRPO在优势归因和时序建模上的局限性。

Details Motivation: 现有GRPO方法在优势归因不准确且忽略生成过程中的时序动态,限制了其性能。 Method: 提出Chunk-GRPO,将优化从步骤级别转移到块级别,引入可选的加权采样策略,在块级别上进行策略优化以更好地建模生成过程的时序特性。 Result: 大量实验表明,Chunk-GRPO在偏好对齐和图像质量方面均优于基线方法。 Conclusion: 块级优化为基于GRPO的文本到图像生成方法提供了更优的范式,具有显著提升性能的潜力。 Abstract: Group Relative Policy Optimization (GRPO) has shown strong potential for flow-matching-based text-to-image (T2I) generation, but it faces two key limitations: inaccurate advantage attribution, and the neglect of temporal dynamics of generation. In this work, we argue that shifting the optimization paradigm from the step level to the chunk level can effectively alleviate these issues. Building on this idea, we propose Chunk-GRPO, the first chunk-level GRPO-based approach for T2I generation. The insight is to group consecutive steps into coherent 'chunk's that capture the intrinsic temporal dynamics of flow matching, and to optimize policies at the chunk level. In addition, we introduce an optional weighted sampling strategy to further enhance performance. Extensive experiments show that ChunkGRPO achieves superior results in both preference alignment and image quality, highlighting the promise of chunk-level optimization for GRPO-based methods.

[96] MATrack: Efficient Multiscale Adaptive Tracker for Real-Time Nighttime UAV Operations

Xuzhao Li,Xuchen Li,Shiyu Hu

Main category: cs.CV

TL;DR: 本文提出了一种专为夜间无人机跟踪设计的多尺度自适应系统MATrack,通过三个核心模块协同工作,在低光、复杂背景和视角变化等挑战下实现了显著性能提升,并在真实无人机平台上验证了其可靠性。

Details Motivation: 夜间无人机跟踪面临低光条件、杂乱背景和频繁视角变化等挑战,现有方法存在视觉伪影、计算成本高和难以充分利用动态目标信息等问题。 Method: 提出MATrack系统,包含多尺度层次融合(MHB)、自适应关键令牌门和夜间模板校准器(NTC)三个核心模块,分别用于增强静态与动态模板间的特征一致性、在复杂背景下准确识别目标信息以及确保长时间序列中的稳定跟踪性能。 Result: 在UAVDark135基准上,MATrack的精度、归一化精度和AUC分别超过最先进方法5.9%、5.4%和4.2%,同时保持81 FPS的实时处理速度。 Conclusion: MATrack有效解决了夜间无人机跟踪中的关键技术难题,在实际应用中表现出高可靠性和稳定性,适用于夜间搜救和边境巡逻等关键机器人任务。 Abstract: Nighttime UAV tracking faces significant challenges in real-world robotics operations. Low-light conditions not only limit visual perception capabilities, but cluttered backgrounds and frequent viewpoint changes also cause existing trackers to drift or fail during deployment. To address these difficulties, researchers have proposed solutions based on low-light enhancement and domain adaptation. However, these methods still have notable shortcomings in actual UAV systems: low-light enhancement often introduces visual artifacts, domain adaptation methods are computationally expensive and existing lightweight designs struggle to fully leverage dynamic object information. Based on an in-depth analysis of these key issues, we propose MATrack-a multiscale adaptive system designed specifically for nighttime UAV tracking. MATrack tackles the main technical challenges of nighttime tracking through the collaborative work of three core modules: Multiscale Hierarchy Blende (MHB) enhances feature consistency between static and dynamic templates. Adaptive Key Token Gate accurately identifies object information within complex backgrounds. Nighttime Template Calibrator (NTC) ensures stable tracking performance over long sequences. Extensive experiments show that MATrack achieves a significant performance improvement. On the UAVDark135 benchmark, its precision, normalized precision and AUC surpass state-of-the-art (SOTA) methods by 5.9%, 5.4% and 4.2% respectively, while maintaining a real-time processing speed of 81 FPS. Further tests on a real-world UAV platform validate the system's reliability, demonstrating that MATrack can provide stable and effective nighttime UAV tracking support for critical robotics applications such as nighttime search and rescue and border patrol.

[97] Restore Text First, Enhance Image Later: Two-Stage Scene Text Image Super-Resolution with Glyph Structure Guidance

Minxing Luo,Linlong Fan,Wang Qiushi,Ge Wu,Yiyan Luo,Yuhang Yu,Jinwei Chen,Yaxing Wang,Qingnan Fan,Jian Yang

Main category: cs.CV

TL;DR: 本文提出了一种名为TIGER的两阶段文本-图像引导超分辨率框架,采用“先文本,后图像”的范式,在提升图像质量的同时保持文本可读性,解决了现有方法中文本失真的问题。

Details Motivation: 现有的生成式超分辨率方法在自然图像上表现良好,但会扭曲文本内容,导致图像质量与文本可读性之间的权衡问题。为解决这一问题,需要一种能够同时保证文本清晰度和整体图像质量的方法。 Method: TIGER采用两阶段框架:第一阶段专注于精确恢复文本结构(字形重建),第二阶段利用重建的文本结构指导整个图像的超分辨率增强。通过这种“字形到图像”的引导机制,实现高保真和视觉一致性的平衡。此外,作者构建了首个支持极端放大(×14.29)的场景文本数据集UltraZoom-ST用于训练与评估。 Result: 实验结果表明,TIGER在多个指标上达到了最先进的性能,显著提升了文本可读性,同时保持了优异的整体图像质量。 Conclusion: TIGER成功打破了图像质量与文本可读性之间的权衡,为包含文本的图像超分辨率任务提供了有效解决方案,具有实际应用价值。 Abstract: Current generative super-resolution methods show strong performance on natural images but distort text, creating a fundamental trade-off between image quality and textual readability. To address this, we introduce \textbf{TIGER} (\textbf{T}ext-\textbf{I}mage \textbf{G}uided sup\textbf{E}r-\textbf{R}esolution), a novel two-stage framework that breaks this trade-off through a \textit{"text-first, image-later"} paradigm. \textbf{TIGER} explicitly decouples glyph restoration from image enhancement: it first reconstructs precise text structures and then uses them to guide subsequent full-image super-resolution. This glyph-to-image guidance ensures both high fidelity and visual consistency. To support comprehensive training and evaluation, we also contribute the \textbf{UltraZoom-ST} (UltraZoom-Scene Text), the first scene text dataset with extreme zoom (\textbf{$\times$14.29}). Extensive experiments show that \textbf{TIGER} achieves \textbf{state-of-the-art} performance, enhancing readability while preserving overall image quality.

[98] Automated interictal epileptic spike detection from simple and noisy annotations in MEG data

Pauline Mouches,Julien Jung,Armand Demasson,Agnès Guinard,Romain Bouet,Rosalie Marchal,Romain Quentin

Main category: cs.CV

TL;DR: 本研究提出基于深度学习的MEG间歇性癫痫尖峰自动检测方法,采用ANN和CNN模型,在仅有单专家时序标注的现实临床条件下表现优于现有方法,并通过交互式机器学习提升标注质量,具有良好的鲁棒性和临床应用潜力。

Details Motivation: 手动检测MEG中的间歇性癫痫尖峰费时且易出错,现有自动化方法因依赖大量标注数据或缺乏对非典型数据的鲁棒性而难以应用于临床。 Method: 提出两种深度学习模型(基于特征的ANN和CNN),在59名患者的数据上训练,并使用交互式机器学习策略利用中间模型输出迭代优化标注质量,最终在10名独立测试患者上评估性能。 Result: 提出的CNN(F1=0.46)和ANN(F1=0.44)模型均优于现有最先进模型,且交互式学习表明模型对噪声标注具有鲁棒性。 Conclusion: 简单架构的深度学习模型在复杂、标注不完美的MEG数据中表现出良好鲁棒性,结合交互式机器学习可有效提升标注效率与模型性能,具备成为临床实用工具的潜力。 Abstract: In drug-resistant epilepsy, presurgical evaluation of epilepsy can be considered. Magnetoencephalography (MEG) has been shown to be an effective exam to inform the localization of the epileptogenic zone through the localization of interictal epileptic spikes. Manual detection of these pathological biomarkers remains a fastidious and error-prone task due to the high dimensionality of MEG recordings, and interrater agreement has been reported to be only moderate. Current automated methods are unsuitable for clinical practice, either requiring extensively annotated data or lacking robustness on non-typical data. In this work, we demonstrate that deep learning models can be used for detecting interictal spikes in MEG recordings, even when only temporal and single-expert annotations are available, which represents real-world clinical practice. We propose two model architectures: a feature-based artificial neural network (ANN) and a convolutional neural network (CNN), trained on a database of 59 patients, and evaluated against a state-of-the-art model to classify short time windows of signal. In addition, we employ an interactive machine learning strategy to iteratively improve our data annotation quality using intermediary model outputs. Both proposed models outperform the state-of-the-art model (F1-scores: CNN=0.46, ANN=0.44) when tested on 10 holdout test patients. The interactive machine learning strategy demonstrates that our models are robust to noisy annotations. Overall, results highlight the robustness of models with simple architectures when analyzing complex and imperfectly annotated data. Our method of interactive machine learning offers great potential for faster data annotation, while our models represent useful and efficient tools for automated interictal spikes detection.

[99] S3OD: Towards Generalizable Salient Object Detection with Synthetic Data

Orest Kupyn,Hirokatsu Kataoka,Christian Rupprecht

Main category: cs.CV

TL;DR: 本文提出了一种通过大规模合成数据生成和模糊性感知架构来显著提升显著性目标检测泛化能力的方法,构建了包含13.9万张高分辨率图像的S3OD数据集,并设计了多掩码解码器以处理标注模糊性,仅使用合成数据训练的模型在跨数据集评估中错误率降低20-50%,微调后在DIS和HR-SOD任务上达到SOTA性能。

Details Motivation: 由于显著性目标检测等任务依赖昂贵的像素级标注,导致不同子任务需单独训练模型,限制了模型的泛化能力,因此需要一种能打破数据壁垒、提升跨任务泛化性能的方法。 Method: 提出S3OD数据集,通过多模态扩散模型生成高分辨率图像并从扩散特征和DINO-v3特征中提取标签;采用迭代生成框架优先生成难例样本;设计简洁的多掩码解码器以处理标注中的模糊性。 Result: 仅在合成数据上训练的模型在跨数据集测试中实现了20-50%的错误率下降,经过微调后在DIS和HR-SOD基准上达到最先进的性能。 Conclusion: 该方法通过高质量合成数据和专门设计的网络架构,有效解决了显著性目标检测中因标注成本高导致的数据局限问题,显著提升了模型在不同子任务间的泛化能力。 Abstract: Salient object detection exemplifies data-bounded tasks where expensive pixel-precise annotations force separate model training for related subtasks like DIS and HR-SOD. We present a method that dramatically improves generalization through large-scale synthetic data generation and ambiguity-aware architecture. We introduce S3OD, a dataset of over 139,000 high-resolution images created through our multi-modal diffusion pipeline that extracts labels from diffusion and DINO-v3 features. The iterative generation framework prioritizes challenging categories based on model performance. We propose a streamlined multi-mask decoder that naturally handles the inherent ambiguity in salient object detection by predicting multiple valid interpretations. Models trained solely on synthetic data achieve 20-50% error reduction in cross-dataset generalization, while fine-tuned versions reach state-of-the-art performance across DIS and HR-SOD benchmarks.

[100] Modest-Align: Data-Efficient Alignment for Vision-Language Models

Jiaxiang Liu,Yuan Wang,Jiawei Du,Joey Tianyi Zhou,Mingkun Xu,Zuozhu Liu

Main category: cs.CV

TL;DR: 提出Modest-Align,一种轻量级的跨模态对齐框架,通过随机扰动和嵌入平滑策略,在低资源和噪声数据下实现鲁棒且高效的对齐性能。

Details Motivation: 现有对比学习方法在低质量或弱相关图文对上容易过拟合且过于自信,导致在资源受限场景下性能下降。 Method: 引入两种策略:随机扰动以模拟不确定性,嵌入平滑以校准嵌入空间中的相似性分布。 Result: 在多个基准数据集上实验表明,Modest-Align在检索任务中优于现有方法,仅需CLIP模型1/100的训练数据和1/600的GPU训练时间即达到具竞争力的结果。 Conclusion: Modest-Align为现实世界中的低资源跨模态对齐提供了一种实用且可扩展的解决方案。 Abstract: Cross-modal alignment aims to map heterogeneous modalities into a shared latent space, as exemplified by models like CLIP, which benefit from large-scale image-text pretraining for strong recognition capabilities. However, when operating in resource-constrained settings with limited or low-quality data, these models often suffer from overconfidence and degraded performance due to the prevalence of ambiguous or weakly correlated image-text pairs. Current contrastive learning approaches, which rely on single positive pairs, further exacerbate this issue by reinforcing overconfidence on uncertain samples. To address these challenges, we propose Modest-Align, a lightweight alignment framework designed for robustness and efficiency. Our approach leverages two complementary strategies -- Random Perturbation, which introduces controlled noise to simulate uncertainty, and Embedding Smoothing, which calibrates similarity distributions in the embedding space. These mechanisms collectively reduce overconfidence and improve performance on noisy or weakly aligned samples. Extensive experiments across multiple benchmark datasets demonstrate that Modest-Align outperforms state-of-the-art methods in retrieval tasks, achieving competitive results with over 100x less training data and 600x less GPU time than CLIP. Our method offers a practical and scalable solution for cross-modal alignment in real-world, low-resource scenarios.

[101] Epipolar Geometry Improves Video Generation Models

Orest Kupyn,Fabian Manhardt,Federico Tombari,Christian Rupprecht

Main category: cs.CV

TL;DR: 本文提出了一种结合经典对极几何约束与现代视频扩散模型的方法,通过偏好优化实现几何一致性,从而提升生成视频的空间稳定性和视觉质量。

Details Motivation: 现有的视频生成模型在几何一致性、运动稳定性和视觉伪影方面存在不足,难以生成逼真的3D场景效果,因此需要引入基本的几何原理来改进模型。 Method: 利用成对的对极几何约束,通过基于偏好的优化方法对扩散模型进行对齐,无需端到端的可微性,从而有效施加几何约束。 Result: 实验表明,相比现代学习型度量,经典的几何约束提供了更稳定的优化信号,减少了噪声目标对对齐质量的影响;该方法在静态场景中训练并能泛化到多种动态内容。 Conclusion: 通过融合数据驱动的深度学习与经典计算机视觉中的几何方法,本文提出了一种实用且高效的视频生成方案,能够在不牺牲视觉质量的前提下生成空间一致的视频。 Abstract: Video generation models have progressed tremendously through large latent diffusion transformers trained with rectified flow techniques. Yet these models still struggle with geometric inconsistencies, unstable motion, and visual artifacts that break the illusion of realistic 3D scenes. 3D-consistent video generation could significantly impact numerous downstream applications in generation and reconstruction tasks. We explore how epipolar geometry constraints improve modern video diffusion models. Despite massive training data, these models fail to capture fundamental geometric principles underlying visual content. We align diffusion models using pairwise epipolar geometry constraints via preference-based optimization, directly addressing unstable camera trajectories and geometric artifacts through mathematically principled geometric enforcement. Our approach efficiently enforces geometric principles without requiring end-to-end differentiability. Evaluation demonstrates that classical geometric constraints provide more stable optimization signals than modern learned metrics, which produce noisy targets that compromise alignment quality. Training on static scenes with dynamic cameras ensures high-quality measurements while the model generalizes effectively to diverse dynamic content. By bridging data-driven deep learning with classical geometric computer vision, we present a practical method for generating spatially consistent videos without compromising visual quality.

[102] DAP-MAE: Domain-Adaptive Point Cloud Masked Autoencoder for Effective Cross-Domain Learning

Ziqi Gao,Qiufu Li,Linlin Shen

Main category: cs.CV

TL;DR: 提出了一种域自适应点云掩码自动编码器(DAP-MAE),通过异构域适配器和域特征生成器,实现跨域数据的自适应知识融合,在多种下游任务中表现出色。

Details Motivation: 现有跨域点云数据预训练方法因先验知识与下游任务不匹配导致性能下降,需提升跨域知识迁移的有效性。 Method: 设计异构域适配器,在预训练阶段进行知识适应,微调阶段进行特征融合,并引入域特征生成器指导特征适配。 Result: 在ScanObjectNN上达到95.18%的分类准确率,在Bosphorus上达到88.45%的表情识别准确率,适用于四种不同点云分析任务。 Conclusion: DAP-MAE能有效整合跨域点云数据知识,仅需一次预训练即可在多个下游任务中取得优异性能。 Abstract: Compared to 2D data, the scale of point cloud data in different domains available for training, is quite limited. Researchers have been trying to combine these data of different domains for masked autoencoder (MAE) pre-training to leverage such a data scarcity issue. However, the prior knowledge learned from mixed domains may not align well with the downstream 3D point cloud analysis tasks, leading to degraded performance. To address such an issue, we propose the Domain-Adaptive Point Cloud Masked Autoencoder (DAP-MAE), an MAE pre-training method, to adaptively integrate the knowledge of cross-domain datasets for general point cloud analysis. In DAP-MAE, we design a heterogeneous domain adapter that utilizes an adaptation mode during pre-training, enabling the model to comprehensively learn information from point clouds across different domains, while employing a fusion mode in the fine-tuning to enhance point cloud features. Meanwhile, DAP-MAE incorporates a domain feature generator to guide the adaptation of point cloud features to various downstream tasks. With only one pre-training, DAP-MAE achieves excellent performance across four different point cloud analysis tasks, reaching 95.18% in object classification on ScanObjectNN and 88.45% in facial expression recognition on Bosphorus.

[103] A Dynamic Knowledge Distillation Method Based on the Gompertz Curve

Han Yang,Guangjun Qin

Main category: cs.CV

TL;DR: 本文提出了一种基于Gompertz生长模型的动态知识蒸馏框架Gompertz-CNN,通过阶段感知策略动态调整蒸馏损失权重,结合Wasserstein距离和梯度匹配,在CIFAR-10和CIFAR-100上显著优于传统方法。

Details Motivation: 传统知识蒸馏方法未能充分考虑学生模型认知能力的动态演化过程,导致知识迁移效果不佳。 Method: 引入Gompertz增长模型来动态调节蒸馏损失权重,采用Wasserstein距离衡量特征差异,并结合梯度匹配对齐教师与学生模型的反向传播行为,构建多任务损失函数。 Result: 在CIFAR-10和CIFAR-100数据集上,相比传统蒸馏方法最高提升8%和4%的准确率。 Conclusion: Gompertz-CNN通过模拟学生模型的学习进程实现了更高效的知识迁移,验证了动态调权策略在知识蒸馏中的有效性。 Abstract: This paper introduces a novel dynamic knowledge distillation framework, Gompertz-CNN, which integrates the Gompertz growth model into the training process to address the limitations of traditional knowledge distillation. Conventional methods often fail to capture the evolving cognitive capacity of student models, leading to suboptimal knowledge transfer. To overcome this, we propose a stage-aware distillation strategy that dynamically adjusts the weight of distillation loss based on the Gompertz curve, reflecting the student's learning progression: slow initial growth, rapid mid-phase improvement, and late-stage saturation. Our framework incorporates Wasserstein distance to measure feature-level discrepancies and gradient matching to align backward propagation behaviors between teacher and student models. These components are unified under a multi-loss objective, where the Gompertz curve modulates the influence of distillation losses over time. Extensive experiments on CIFAR-10 and CIFAR-100 using various teacher-student architectures (e.g., ResNet50 and MobileNet_v2) demonstrate that Gompertz-CNN consistently outperforms traditional distillation methods, achieving up to 8% and 4% accuracy gains on CIFAR-10 and CIFAR-100, respectively.

[104] Group Inertial Poser: Multi-Person Pose and Global Translation from Sparse Inertial Sensors and Ultra-Wideband Ranging

Ying Xue,Jiaxi Jiang,Rayan Armani,Dominik Hollidt,Yi-Chi Liao,Christian Holz

Main category: cs.CV

TL;DR: 本文提出了一种利用惯性测量单元(IMU)和超宽带测距(UWB)融合的新型多人体姿态与全局位移估计方法——Group Inertial Poser,通过引入传感器间距离信息克服了纯IMU方法在全局定位和相对位置上的局限性,并发布了首个支持双人动作捕捉的IMU+UWB数据集GIP-DB。

Details Motivation: 纯IMU方法缺乏对环境的空间参考,难以准确估计个体间的相对位置和全局平移,限制了其在多人运动追踪中的应用。因此需要引入额外的距离感知来增强系统的空间一致性。 Method: 提出Group Inertial Poser方法,融合IMU惯性观测与UWB获取的传感器间绝对距离,输入到结构化状态空间模型中进行时序运动建模;采用两步优化策略,分别估计身体姿态和全局轨迹。 Result: 在合成与真实世界数据上均优于现有最先进方法,显著提升精度与鲁棒性;构建并发布了包含14名参与者、200分钟动作数据的GIP-DB数据集。 Conclusion: IMU与UWB融合可有效提升多人体运动捕捉的准确性与实用性,尤其适用于复杂户外场景下的全身运动追踪。 Abstract: Tracking human full-body motion using sparse wearable inertial measurement units (IMUs) overcomes the limitations of occlusion and instrumentation of the environment inherent in vision-based approaches. However, purely IMU-based tracking compromises translation estimates and accurate relative positioning between individuals, as inertial cues are inherently self-referential and provide no direct spatial reference for others. In this paper, we present a novel approach for robustly estimating body poses and global translation for multiple individuals by leveraging the distances between sparse wearable sensors - both on each individual and across multiple individuals. Our method Group Inertial Poser estimates these absolute distances between pairs of sensors from ultra-wideband ranging (UWB) and fuses them with inertial observations as input into structured state-space models to integrate temporal motion patterns for precise 3D pose estimation. Our novel two-step optimization further leverages the estimated distances for accurately tracking people's global trajectories through the world. We also introduce GIP-DB, the first IMU+UWB dataset for two-person tracking, which comprises 200 minutes of motion recordings from 14 participants. In our evaluation, Group Inertial Poser outperforms previous state-of-the-art methods in accuracy and robustness across synthetic and real-world data, showing the promise of IMU+UWB-based multi-human motion capture in the wild. Code, models, dataset: https://github.com/eth-siplab/GroupInertialPoser

[105] Long-tailed Species Recognition in the NACTI Wildlife Dataset

Zehua Liu,Tilo Burghardt

Main category: cs.CV

TL;DR: 本文系统研究了长尾识别(LTR)方法在北美相机陷阱图像(NACTI)数据集上的物种识别性能,通过改进损失函数和正则化策略,显著提升了准确率,并验证了模型在分布偏移下的泛化能力。

Details Motivation: NACTI数据集存在严重的长尾类别不平衡问题,传统方法在尾部类别上表现不佳,亟需有效的长尾识别技术来提升整体性能和泛化能力。 Method: 基于PyTorch Wildlife模型,系统评估了多种LTR损失函数和正则化方法,并优化了学习率调度策略,在NACTI数据集上进行实验。 Result: 在NACTI测试集上达到99.40%的Top-1准确率,优于基线95.51%;在EN-A-Detection构建的去偏测试集上准确率达到52.55%(提升自51.20%),表现出更强的泛化能力。 Conclusion: LTR方法在野生动物图像识别中显著提升性能和鲁棒性,但面对严重域偏移时尾部类仍存在失效问题,未来需进一步解决偏差和泛化挑战。 Abstract: As most ''in the wild'' data collections of the natural world, the North America Camera Trap Images (NACTI) dataset shows severe long-tailed class imbalance, noting that the largest 'Head' class alone covers >50% of the 3.7M images in the corpus. Building on the PyTorch Wildlife model, we present a systematic study of Long-Tail Recognition methodologies for species recognition on the NACTI dataset covering experiments on various LTR loss functions plus LTR-sensitive regularisation. Our best configuration achieves 99.40% Top-1 accuracy on our NACTI test data split, substantially improving over a 95.51% baseline using standard cross-entropy with Adam. This also improves on previously reported top performance in MLWIC2 at 96.8% albeit using partly unpublished (potentially different) partitioning, optimiser, and evaluation protocols. To evaluate domain shifts (e.g. night-time captures, occlusion, motion-blur) towards other datasets we construct a Reduced-Bias Test set from the ENA-Detection dataset where our experimentally optimised long-tail enhanced model achieves leading 52.55% accuracy (up from 51.20% with WCE loss), demonstrating stronger generalisation capabilities under distribution shift. We document the consistent improvements of LTR-enhancing scheduler choices in this NACTI wildlife domain, particularly when in tandem with state-of-the-art LTR losses. We finally discuss qualitative and quantitative shortcomings that LTR methods cannot sufficiently address, including catastrophic breakdown for 'Tail' classes under severe domain shift. For maximum reproducibility we publish all dataset splits, key code, and full network weights.

[106] Self-Supervised Learning of Synapse Types from EM Images

Aarav Shetty,Gary B Huang

Main category: cs.CV

TL;DR: 提出一种无需预先知道突触类型数量的无监督方法,基于同一神经元内邻近突触更相似的假设,对果蝇EM图像中的突触进行分类。

Details Motivation: 传统突触分类依赖监督学习,需预先提供类别示例;本文旨在通过无监督方式自动发现突触类别,避免对先验知识的依赖,并更好地覆盖突触结构的多样性。 Method: 利用同一神经元内邻近突触在形态上更相似的假设,基于EM图像数据构建相似性度量,采用聚类方法将突触划分为不同类别,无需预先指定类别数量。 Result: 在果蝇数据上成功实现了突触的无监督分类,能够自动识别出不同的突触类型,并为选择涵盖结构多样性的真值数据提供了原则性方法。 Conclusion: 该方法为突触分类提供了一种无需监督标签的可行方案,具有发现新突触类型和辅助神经连接组研究的潜力。 Abstract: Separating synapses into different classes based on their appearance in EM images has many applications in biology. Examples may include assigning a neurotransmitter to a particular class, or separating synapses whose strength can be modulated from those whose strength is fixed. Traditionally, this has been done in a supervised manner, giving the classification algorithm examples of the different classes. Here we instead separate synapses into classes based only on the observation that nearby synapses in the same neuron are likely more similar than synapses chosen randomly from different cells. We apply our methodology to data from {\it Drosophila}. Our approach has the advantage that the number of synapse types does not need to be known in advance. It may also provide a principled way to select ground-truth that spans the range of synapse structure.

[107] Foundation Models in Dermatopathology: Skin Tissue Classification

Riya Gupta,Yiwei Zong,Dennis H. Murphree

Main category: cs.CV

TL;DR: 本研究评估了UNI和Virchow2两种基础模型在皮肤病理全切片图像(WSI)分类中的性能,发现Virchow2提取的特征结合逻辑回归可达到90%的准确率,展现出自动化WSI分类的潜力。

Details Motivation: 随着皮肤病理全切片图像(WSI)的快速生成,亟需自动化方法以高效处理并准确分类这些图像,提升诊断效率与可扩展性。 Method: 使用UNI和Virchow2作为特征提取器获取WSI的补丁级嵌入,并采用均值聚合策略生成切片级特征,随后训练多种机器学习分类器(如逻辑回归、梯度提升树和随机森林),并通过WandB.ai追踪实验结果。 Result: Virchow2在多数分类器中表现优于UNI,其与逻辑回归结合时准确率达到90%,但差异无统计学显著性;数据增强和图像归一化有助于提升模型鲁棒性,均值聚合策略能有效生成切片级特征表示。 Conclusion: 基础模型(尤其是Virchow2)在皮肤病理WSI分类中具有巨大潜力,为自动化诊断和未来切片级表征学习提供了可扩展且有效的解决方案。 Abstract: The rapid generation of whole-slide images (WSIs) in dermatopathology necessitates automated methods for efficient processing and accurate classification. This study evaluates the performance of two foundation models, UNI and Virchow2, as feature extractors for classifying WSIs into three diagnostic categories: melanocytic, basaloid, and squamous lesions. Patch-level embeddings were aggregated into slide-level features using a mean-aggregation strategy and subsequently used to train multiple machine learning classifiers, including logistic regression, gradient-boosted trees, and random forest models. Performance was assessed using precision, recall, true positive rate, false positive rate, and the area under the receiver operating characteristic curve (AUROC) on the test set. Results demonstrate that patch-level features extracted using Virchow2 outperformed those extracted via UNI across most slide-level classifiers, with logistic regression achieving the highest accuracy (90%) for Virchow2, though the difference was not statistically significant. The study also explored data augmentation techniques and image normalization to enhance model robustness and generalizability. The mean-aggregation approach provided reliable slide-level feature representations. All experimental results and metrics were tracked and visualized using WandB.ai, facilitating reproducibility and interpretability. This research highlights the potential of foundation models for automated WSI classification, providing a scalable and effective approach for dermatopathological diagnosis while paving the way for future advancements in slide-level representation learning.

[108] WorldGrow: Generating Infinite 3D World

Sikuang Li,Chen Yang,Jiemin Fang,Taoran Yi,Jia Lu,Jiazhong Cen,Lingxi Xie,Wei Shen,Qi Tian

Main category: cs.CV

TL;DR: 本文提出WorldGrow,一种用于无限3D场景生成的分层框架,通过结构化场景块生成解决现有方法在几何一致性、可扩展性和场景级生成上的局限。

Details Motivation: 现有3D生成方法在跨视角一致性、可扩展性及场景级生成方面存在不足,难以实现大范围连续且逼真的3D环境生成。 Method: 提出WorldGrow框架,包含三个核心组件:高质量场景块提取的数据预处理流程、支持上下文感知扩展的3D块修复机制,以及确保全局布局合理与局部细节保真的粗到精生成策略。 Result: 在3D-FRONT数据集上达到SOTA的几何重建性能,并能实现无限场景扩展,生成结果具有照片级真实感和结构一致性。 Conclusion: WorldGrow能够高效生成大规模、连贯且逼真的3D环境,展现出构建未来世界模型的潜力。 Abstract: We tackle the challenge of generating the infinitely extendable 3D world -- large, continuous environments with coherent geometry and realistic appearance. Existing methods face key challenges: 2D-lifting approaches suffer from geometric and appearance inconsistencies across views, 3D implicit representations are hard to scale up, and current 3D foundation models are mostly object-centric, limiting their applicability to scene-level generation. Our key insight is leveraging strong generation priors from pre-trained 3D models for structured scene block generation. To this end, we propose WorldGrow, a hierarchical framework for unbounded 3D scene synthesis. Our method features three core components: (1) a data curation pipeline that extracts high-quality scene blocks for training, making the 3D structured latent representations suitable for scene generation; (2) a 3D block inpainting mechanism that enables context-aware scene extension; and (3) a coarse-to-fine generation strategy that ensures both global layout plausibility and local geometric/textural fidelity. Evaluated on the large-scale 3D-FRONT dataset, WorldGrow achieves SOTA performance in geometry reconstruction, while uniquely supporting infinite scene generation with photorealistic and structurally consistent outputs. These results highlight its capability for constructing large-scale virtual environments and potential for building future world models.

[109] On Thin Ice: Towards Explainable Conservation Monitoring via Attribution and Perturbations

Jiayi Zhou,Günel Aghakishiyeva,Saagar Arya,Julian Dale,James David Poling,Holly R. Houliston,Jamie N. Womble,Gregory D. Larsen,David W. Johnston,Brinnae Bent

Main category: cs.CV

TL;DR: 本研究通过在Faster R-CNN模型上应用多种事后解释方法(如HiResCAM、LIME等),提升计算机视觉在生态监测中的可信度,利用冰川湾国家公园的航拍图像检测海豹,并验证解释方法在定位准确性、置信度忠实性和诊断实用性三方面的有效性。

Details Motivation: 生态学界对基于神经网络的“黑箱”模型缺乏信任,限制了计算机视觉在生态保护监测中的应用,因此需要可解释性方法来增强模型透明度和可信度。 Method: 使用Faster R-CNN检测海豹,并采用梯度类激活映射(HiResCAM、LayerCAM)、LIME和基于扰动的解释方法生成预测解释,从定位保真度、忠实性和诊断效用三个维度评估解释质量。 Result: 解释方法能准确聚焦海豹躯干而非背景,删除海豹区域显著降低检测置信度,证明解释的忠实性;同时揭示模型常将黑冰和岩石误判为海豹,暴露系统性错误模式。 Conclusion: 结合事后可解释性方法可提升目标检测模型的可审计性和实用性,推动其作为可信的决策支持工具在保护监测中落地应用。 Abstract: Computer vision can accelerate ecological research and conservation monitoring, yet adoption in ecology lags in part because of a lack of trust in black-box neural-network-based models. We seek to address this challenge by applying post-hoc explanations to provide evidence for predictions and document limitations that are important to field deployment. Using aerial imagery from Glacier Bay National Park, we train a Faster R-CNN to detect pinnipeds (harbor seals) and generate explanations via gradient-based class activation mapping (HiResCAM, LayerCAM), local interpretable model-agnostic explanations (LIME), and perturbation-based explanations. We assess explanations along three axes relevant to field use: (i) localization fidelity: whether high-attribution regions coincide with the animal rather than background context; (ii) faithfulness: whether deletion/insertion tests produce changes in detector confidence; and (iii) diagnostic utility: whether explanations reveal systematic failure modes. Explanations concentrate on seal torsos and contours rather than surrounding ice/rock, and removal of the seals reduces detection confidence, providing model-evidence for true positives. The analysis also uncovers recurrent error sources, including confusion between seals and black ice and rocks. We translate these findings into actionable next steps for model development, including more targeted data curation and augmentation. By pairing object detection with post-hoc explainability, we can move beyond "black-box" predictions toward auditable, decision-supporting tools for conservation monitoring.

[110] BachVid: Training-Free Video Generation with Consistent Background and Character

Han Yan,Xibin Song,Yifu Wang,Hongdong Li,Pan Ji,Chao Ma

Main category: cs.CV

TL;DR: BachVid是一种无需训练、无需参考图像的文本到视频生成方法,通过缓存和重用Diffusion Transformer中的中间变量,实现多视频中人物和背景的一致性。

Details Motivation: 现有方法在生成多个具有一致角色和背景的视频时依赖参考图像或大量训练,且通常只解决角色一致性,缺乏对背景一致性的有效处理。 Method: 通过对DiT注意力机制和中间特征的系统分析,发现其在去噪过程中能提取前景掩码并识别匹配点;基于此,先生成身份视频并缓存中间变量,再将其注入新视频的对应位置,以确保前后景一致性。 Result: 实验结果表明,BachVid在无需额外训练的情况下,能有效实现跨视频的角色与背景一致性,且不依赖参考图像。 Conclusion: BachVid是首个无需训练且无需参考图像即可实现多视频前后景一致生成的方法,为一致性的视频生成提供了高效新颖的解决方案。 Abstract: Diffusion Transformers (DiTs) have recently driven significant progress in text-to-video (T2V) generation. However, generating multiple videos with consistent characters and backgrounds remains a significant challenge. Existing methods typically rely on reference images or extensive training, and often only address character consistency, leaving background consistency to image-to-video models. We introduce BachVid, the first training-free method that achieves consistent video generation without needing any reference images. Our approach is based on a systematic analysis of DiT's attention mechanism and intermediate features, revealing its ability to extract foreground masks and identify matching points during the denoising process. Our method leverages this finding by first generating an identity video and caching the intermediate variables, and then inject these cached variables into corresponding positions in newly generated videos, ensuring both foreground and background consistency across multiple videos. Experimental results demonstrate that BachVid achieves robust consistency in generated videos without requiring additional training, offering a novel and efficient solution for consistent video generation without relying on reference images or additional training.

[111] Visual Diffusion Models are Geometric Solvers

Nir Goren,Shai Yehezkel,Omer Dahary,Andrey Voynov,Or Patashnik,Daniel Cohen-Or

Main category: cs.CV

TL;DR: 本文提出了一种利用视觉扩散模型直接在像素空间中解决几何问题的新方法,成功应用于内接正方形问题、斯坦纳树问题和简单多边形问题,展示了生成模型与几何求解之间的新联系。

Details Motivation: 探索视觉扩散模型是否能在无需专门架构的情况下,直接通过图像空间进行几何推理,为经典难解几何问题提供新的求解途径。 Method: 将每个几何问题实例视为图像,训练标准的视觉扩散模型,从高斯噪声生成近似正确解的图像,将几何推理转化为图像生成任务。 Result: 模型能够有效生成接近真实解的几何结构,在多个经典几何难题上展现出强大学习与推理能力。 Conclusion: 视觉扩散模型可作为通用几何求解器,操作在图像空间提供了一个简洁且实用的框架,有望推广至更广泛的几何问题求解领域。 Abstract: In this paper we show that visual diffusion models can serve as effective geometric solvers: they can directly reason about geometric problems by working in pixel space. We first demonstrate this on the Inscribed Square Problem, a long-standing problem in geometry that asks whether every Jordan curve contains four points forming a square. We then extend the approach to two other well-known hard geometric problems: the Steiner Tree Problem and the Simple Polygon Problem. Our method treats each problem instance as an image and trains a standard visual diffusion model that transforms Gaussian noise into an image representing a valid approximate solution that closely matches the exact one. The model learns to transform noisy geometric structures into correct configurations, effectively recasting geometric reasoning as image generation. Unlike prior work that necessitates specialized architectures and domain-specific adaptations when applying diffusion to parametric geometric representations, we employ a standard visual diffusion model that operates on the visual representation of the problem. This simplicity highlights a surprising bridge between generative modeling and geometric problem solving. Beyond the specific problems studied here, our results point toward a broader paradigm: operating in image space provides a general and practical framework for approximating notoriously hard problems, and opens the door to tackling a far wider class of challenging geometric tasks.

[112] Automated Detection of Visual Attribute Reliance with a Self-Reflective Agent

Christy Li,Josep Lopez Camuñas,Jake Thomas Touchet,Jacob Andreas,Agata Lapedriza,Antonio Torralba,Tamar Rott Shaham

Main category: cs.CV

TL;DR: 提出一种基于自反思智能体的自动化框架,用于检测视觉模型在图像识别中对特定视觉属性的依赖。

Details Motivation: 为了确保模型的鲁棒性、防止过拟合和避免虚假相关性,需要检测视觉模型是否依赖于某些非预期的视觉特征。 Method: 设计一个自反思智能体,通过迭代生成并验证关于模型可能依赖的视觉属性的假设,并基于实验结果和自我评估不断优化假设。 Result: 在包含130个模型的新基准上验证,该方法通过自反思显著优于无反思基线,并成功识别出CLIP视觉编码器和YOLOv8检测器中的真实视觉属性依赖。 Conclusion: 自反思智能体能有效揭示视觉模型的隐含依赖,提升对模型行为的理解和可解释性。 Abstract: When a vision model performs image recognition, which visual attributes drive its predictions? Detecting unintended reliance on specific visual features is critical for ensuring model robustness, preventing overfitting, and avoiding spurious correlations. We introduce an automated framework for detecting such dependencies in trained vision models. At the core of our method is a self-reflective agent that systematically generates and tests hypotheses about visual attributes that a model may rely on. This process is iterative: the agent refines its hypotheses based on experimental outcomes and uses a self-evaluation protocol to assess whether its findings accurately explain model behavior. When inconsistencies arise, the agent self-reflects over its findings and triggers a new cycle of experimentation. We evaluate our approach on a novel benchmark of 130 models designed to exhibit diverse visual attribute dependencies across 18 categories. Our results show that the agent's performance consistently improves with self-reflection, with a significant performance increase over non-reflective baselines. We further demonstrate that the agent identifies real-world visual attribute dependencies in state-of-the-art models, including CLIP's vision encoder and the YOLOv8 object detector.