Skip to content

Table of Contents

cs.CL [Back]

[1] Recursive Language Models Meet Uncertainty: The Surprising Effectiveness of Self-Reflective Program Search for Long Context

Keivan Alizadeh,Parshin Shojaee,Minsik Cho,Mehrdad Farajtabar

Main category: cs.CL

TL;DR: 本文提出SRLM框架,通过引入基于不确定性感知的自我反思机制(利用自一致性、推理长度和口头化置信度三个内在信号)来优化长上下文处理中的程序化上下文交互选择,显著提升性能并减少对显式递归的依赖。

Details Motivation: 现有递归语言模型(RLM)虽能分解长上下文,但其性能高度依赖于上下文交互程序的选择,而该选择策略此前缺乏系统研究。 Method: 提出SRLM框架,在程序化上下文交互中引入不确定性感知的自我反思机制,利用自一致性、推理长度和口头化置信度三个内在信号评估和比较候选程序。 Result: 在多种基准、上下文长度和骨干模型上,SRLM一致优于SOTA基线,较RLM最高提升22%;且无需显式递归或自查询即可达到甚至超越RLM效果;在短/长上下文及语义密集型任务中均表现稳健。 Conclusion: 递归本身并非RLM性能提升的主因;基于自我反思的程序搜索更关键;SRLM通过语义感知的不确定性建模,实现了更鲁棒、泛化性更强的长上下文推理。 Abstract: Long-context handling remains a core challenge for language models: even with extended context windows, models often fail to reliably extract, reason over, and use the information across long contexts. Recent works like Recursive Language Models (RLM) have approached this challenge by agentic way of decomposing long contexts into recursive sub-calls through programmatic interaction at inference. While promising, the success of RLM critically depends on how these context-interaction programs are selected, which has remained largely unexplored. In this paper, we study this problem and introduce SRLM, a framework that augments programmatic context interaction with uncertainty-aware Self-Reflection. SRLM leverages three intrinsic signals: self consistency, reasoning length, and verbalized confidence. These serve as complementary indicators of a model's internal uncertainty, and the model uses them to evaluate and compare candidate context-interaction programs. Extensive experiments across diverse benchmark datasets, context lengths, and backbone models, show that SRLM consistently outperforms state-of-the-art baselines, yielding up to 22% improvement over RLM under the same time budget. Our findings show that recursion itself is not the primary driver of performance in RLM, and a simple self-reflective program search can match or surpass RLM without requiring self-query or explicit recursion mechanisms. We find that for context lengths within the model's window, RLMs with recursion often degrade performance relative to the base model, whereas SRLM yields consistent gains across both short and long contexts. We also find that RLM is less effective in tasks with semantically intensive nature, where heuristic program search is insufficient and broader contextual understanding is required, while self-reflection in SRLM provides a semantic signal that better steers reasoning in these scenarios.

[2] MedArena: Comparing LLMs for Medicine-in-the-Wild Clinician Preferences

Eric Wu,Kevin Wu,Jason Hom,Paul H. Yi,Angela Zhang,Alejandro Lozano,Jeff Nirschl,Jeff Tangney,Kevin Byram,Braydon Dymm,Narender Annapureddy,Eric Topol,David Ouyang,James Zou

Main category: cs.CL

TL;DR: 本文提出了MedArena,一个面向临床医生的交互式大语言模型评估平台,通过让医生直接对不同模型的回答进行偏好选择,来更真实地评估医疗大语言模型在实际临床场景中的效用。

Details Motivation: 现有医疗大语言模型评估方法依赖静态、模板化的基准测试,无法反映真实临床实践的复杂性和动态性,导致评估结果与临床实用性脱节。 Method: 构建MedArena交互式评估平台,收集临床医生针对自身真实问题提出的双模型响应偏好;采用Bradley-Terry模型进行排序,并分析偏好原因及控制风格变量(如长度、格式)的影响。 Result: 在1571条偏好数据中,Gemini 2.0 Flash Thinking、Gemini 2.5 Pro和GPT-4o排名前三;仅1/3问题属事实回忆类,多数涉及治疗选择、文书撰写、医患沟通等高阶任务;医生更看重回答的深度、细节与表达清晰度,而非单纯事实准确性;排名结果在控制响应长度与格式后依然稳健。 Conclusion: MedArena通过基于真实临床问题与医生偏好的评估范式,为医疗大语言模型的实用性和有效性提供了更具临床相关性的可扩展评估框架。 Abstract: Large language models (LLMs) are increasingly central to clinician workflows, spanning clinical decision support, medical education, and patient communication. However, current evaluation methods for medical LLMs rely heavily on static, templated benchmarks that fail to capture the complexity and dynamics of real-world clinical practice, creating a dissonance between benchmark performance and clinical utility. To address these limitations, we present MedArena, an interactive evaluation platform that enables clinicians to directly test and compare leading LLMs using their own medical queries. Given a clinician-provided query, MedArena presents responses from two randomly selected models and asks the user to select the preferred response. Out of 1571 preferences collected across 12 LLMs up to November 1, 2025, Gemini 2.0 Flash Thinking, Gemini 2.5 Pro, and GPT-4o were the top three models by Bradley-Terry rating. Only one-third of clinician-submitted questions resembled factual recall tasks (e.g., MedQA), whereas the majority addressed topics such as treatment selection, clinical documentation, or patient communication, with ~20% involving multi-turn conversations. Additionally, clinicians cited depth and detail and clarity of presentation more often than raw factual accuracy when explaining their preferences, highlighting the importance of readability and clinical nuance. We also confirm that the model rankings remain stable even after controlling for style-related factors like response length and formatting. By grounding evaluation in real-world clinical questions and preferences, MedArena offers a scalable platform for measuring and improving the utility and efficacy of medical LLMs.

[3] MiroThinker-1.7 & H1: Towards Heavy-Duty Research Agents via Verification

MiroMind Team,S. Bai,L. Bing,L. Lei,R. Li,X. Li,X. Lin,E. Min,L. Su,B. Wang,L. Wang,L. Wang,S. Wang,X. Wang,Y. Zhang,Z. Zhang,G. Chen,L. Chen,Z. Cheng,Y. Deng,Z. Huang,D. Ng,J. Ni,Q. Ren,X. Tang,B. L. Wang,H. Wang,N. Wang,C. Wei,Q. Wu,J. Xia,Y. Xiao,H. Xu,X. Xu,C. Xue,Z. Yang,Z. Yang,F. Ye,H. Ye,J. Yu,C. Zhang,W. Zhang,H. Zhao,P. Zhu

Main category: cs.CL

TL;DR: 本文提出了MiroThinker-1.7和MiroThinker-H1两个新型研究型智能体,分别通过结构化规划、上下文推理与工具交互,以及本地与全局验证机制,提升长程复杂推理任务的可靠性与性能,并在多个领域达到SOTA水平,同时开源了高效轻量版本。

Details Motivation: 解决复杂长周期推理任务中可靠性不足、多步推理易出错的问题,提升研究型智能体在开放网络研究、科学推理和金融分析等深度任务中的表现。 Method: MiroThinker-1.7引入代理式中期训练阶段,强化结构化规划、上下文推理和工具交互;MiroThinker-H1进一步在推理过程中嵌入本地(中间决策)与全局(整体推理链)两级验证机制,支持推理过程的动态评估与修正。 Result: 在开放网络研究、科学推理与金融分析等基准测试中,MiroThinker-H1达到SOTA性能,同时MiroThinker-1.7及轻量版MiroThinker-1.7-mini以更高效率提供有竞争力的研究代理能力。 Conclusion: 结构化训练与内生验证机制可显著增强研究型智能体的长程推理可靠性与泛化能力,开源模型为社区提供了高效实用的新基线。 Abstract: We present MiroThinker-1.7, a new research agent designed for complex long-horizon reasoning tasks. Building on this foundation, we further introduce MiroThinker-H1, which extends the agent with heavy-duty reasoning capabilities for more reliable multi-step problem solving. In particular, MiroThinker-1.7 improves the reliability of each interaction step through an agentic mid-training stage that emphasizes structured planning, contextual reasoning, and tool interaction. This enables more effective multi-step interaction and sustained reasoning across complex tasks. MiroThinker-H1 further incorporates verification directly into the reasoning process at both local and global levels. Intermediate reasoning decisions can be evaluated and refined during inference, while the overall reasoning trajectory is audited to ensure that final answers are supported by coherent chains of evidence. Across benchmarks covering open-web research, scientific reasoning, and financial analysis, MiroThinker-H1 achieves state-of-the-art performance on deep research tasks while maintaining strong results on specialized domains. We also release MiroThinker-1.7 and MiroThinker-1.7-mini as open-source models, providing competitive research-agent capabilities with significantly improved efficiency.

[4] Morphemes Without Borders: Evaluating Root-Pattern Morphology in Arabic Tokenizers and LLMs

Yara Alakeel,Chatrine Qwaider,Hanan Aldarmaki,Sawsan Alqahtani

Main category: cs.CL

TL;DR: 本文研究了大语言模型(LLMs)及其分词方案对阿拉伯语根-模式形态的表征与生成能力,发现分词器的形态对齐既非必要也非充分条件,挑战了形态分词在下游任务中作用的传统认知。

Details Motivation: 阿拉伯语具有复杂的非连接式根-模式形态系统,是检验LLM是否真正理解形态结构而非仅依赖表面记忆的理想测试场景;同时探究分词选择如何影响该过程。 Method: 首先评估多种阿拉伯语及多语言分词器在标准形态切分上的准确性;其次构建新测试集,评测7个阿拉伯语/多语言LLM在根-模式形态生成任务中的表现。 Result: 实验表明,分词器在形态切分上的对齐程度与LLM在形态生成任务上的表现无必然联系——高对齐分词器未必带来更好生成效果,低对齐者亦可能表现优异。 Conclusion: 形态感知型分词并非提升LLM阿拉伯语形态生成能力的关键因素,LLM可能通过其他机制(如上下文建模)实现形态泛化,提示需重新审视分词设计与语言能力之间的关系。 Abstract: This work investigates how effectively large language models (LLMs) and their tokenization schemes represent and generate Arabic root-pattern morphology, probing whether they capture genuine morphological structure or rely on surface memorization. Arabic morphological system provides a rich testbed for analyzing how LLMs handle complex, non-concatenative forms and how tokenization choices influence this process. Our study begins with an evaluation of morphological fidelity across Arabic and multilingual tokenizers against gold-standard segmentation, followed by an analysis of LLM performance in productive root-pattern generation using a newly developed test set. Our findings across seven Arabic-centric and multilingual LLMs and their respective tokenizers reveal that tokenizer morphological alignment is not necessary nor sufficient for morphological generation, which questions the role of morphological tokenization in downstream performance.

[5] COGNAC at SemEval-2026 Task 5: LLM Ensembles for Human-Level Word Sense Plausibility Rating in Challenging Narratives

Azwad Anjum Islam,Tisa Islam Erana

Main category: cs.CL

TL;DR: 本文提出了一种基于多种大语言模型(LLM)和三种提示策略(零样本、思维链、对比提示)的集成方法,用于SemEval-2026任务5——对短故事中同音异义词词义的合理性进行5级李克特量表评分;通过模型预测平均化缓解标注者间差异,最终在官方评测中获第4名,并在赛后实验中进一步提升性能。

Details Motivation: 解决同音异义词在上下文中词义合理性判断这一主观性强、标注者间差异大的语义评估任务,并应对人类标注结果固有的变异性。 Method: 采用三种提示策略(零样本、思维链式结构化推理、对比式同时评估候选词义)驱动多个闭源商业大语言模型,并构建跨策略与跨模型的集成系统,对模型输出取平均以对齐人类平均判断。 Result: 官方提交系统在准确率(0.88)和Spearman秩相关系数(0.83)上取得0.86平均分,排名第四;赛后扩展实验达0.92准确率与0.85 rho(平均0.89);对比提示与模型集成均被证实有效提升性能。 Conclusion: 对比提示策略具有一致增益效果,而大语言模型集成显著提升与人类平均判断的一致性,表明LLM集成特别适合涉及多标注者的主观语义评估任务。 Abstract: We describe our system for SemEval-2026 Task 5, which requires rating the plausibility of given word senses of homonyms in short stories on a 5-point Likert scale. Systems are evaluated by the unweighted average of accuracy (within one standard deviation of mean human judgments) and Spearman Rank Correlation. We explore three prompting strategies using multiple closed-source commercial LLMs: (i) a baseline zero-shot setup, (ii) Chain-of-Thought (CoT) style prompting with structured reasoning, and (iii) a comparative prompting strategy for evaluating candidate word senses simultaneously. Furthermore, to account for the substantial inter-annotator variation present in the gold labels, we propose an ensemble setup by averaging model predictions. Our best official system, comprising an ensemble of LLMs across all three prompting strategies, placed 4th on the competition leaderboard with 0.88 accuracy and 0.83 Spearman's rho (0.86 average). Post-competition experiments with additional models further improved this performance to 0.92 accuracy and 0.85 Spearman's rho (0.89 average). We find that comparative prompting consistently improved performance across model families, and model ensembling significantly enhanced alignment with mean human judgments, suggesting that LLM ensembles are especially well suited for subjective semantic evaluation tasks involving multiple annotators.

[6] Agent-based imitation dynamics can yield efficiently compressed population-level vocabularies

Nathaniel Imel,Richard Futrell,Michael Franke,Noga Zaslavsky

Main category: cs.CL

TL;DR: 本文将进化博弈论与信息瓶颈(IB)框架统一建模,表明在信号博弈中通过不精确策略模仿的群体动力学可自发产生接近IB最优的词汇压缩,揭示了语言词汇高效性演化的潜在机制。

Details Motivation: 探究驱动语言词汇向信息瓶颈(IB)意义下高效压缩演化的社会动力学机制,填补进化博弈论解释语言涌现但未验证其IB效率的空白。 Method: 构建融合进化博弈论与信息瓶颈框架的统一模型,在信号博弈中引入不精确策略模仿动态,并分析关键参数(如策略模仿精度、状态混淆倾向)对词汇压缩权衡的影响。 Result: 模型显示,不精确模仿可导致群体演化出接近IB最优的词汇压缩;关键参数调控了新兴词汇所实现的复杂度-准确性权衡范围。 Conclusion: 进化博弈动力学可为具有信息论最优性及经验可证性特征的语言词汇演化提供一种机制性基础。 Abstract: Natural languages have been argued to evolve under pressure to efficiently compress meanings into words by optimizing the Information Bottleneck (IB) complexity-accuracy tradeoff. However, the underlying social dynamics that could drive the optimization of a language's vocabulary towards efficiency remain largely unknown. In parallel, evolutionary game theory has been invoked to explain the emergence of language from rudimentary agent-level dynamics, but it has not yet been tested whether such an approach can lead to efficient compression in the IB sense. Here, we provide a unified model integrating evolutionary game theory with the IB framework and show how near-optimal compression can arise in a population through an independently motivated dynamic of imprecise strategy imitation in signaling games. We find that key parameters of the model -- namely, those that regulate precision in these games, as well as players' tendency to confuse similar states -- lead to constrained variation of the tradeoffs achieved by emergent vocabularies. Our results suggest that evolutionary game dynamics could potentially provide a mechanistic basis for the evolution of vocabularies with information-theoretically optimal and empirically attested properties.

[7] CTG-DB: An Ontology-Based Transformation of ClinicalTrials.gov to Enable Cross-Trial Drug Safety Analyses

Jeffery L. Painter,François Haguinet,Andrew Bate

Main category: cs.CL

TL;DR: 本文提出CTG-DB开源管道,将ClinicalTrials.gov的XML数据转换为基于MedDRA标准化术语的关系型数据库,支持臂级分母保留、安慰剂/对照组表示及可复现的AE术语映射,提升药物警戒中的系统性安全分析能力。

Details Motivation: ClinicalTrials.gov虽为最大临床试验注册库,但其注册导向架构和不良事件(AE)术语异质性限制了系统性药物警戒(PV)分析;AE多以研究者自由文本报告,缺乏标准化标识,需人工整合才能形成一致安全概念。 Method: 构建开源的ClinicalTrials.gov Transformation Database(CTG-DB)管道,完整摄取CT.gov XML归档数据,通过确定性精确匹配与模糊匹配,将AE文本映射至MedDRA标准术语,并构建关系型数据库;同时保留治疗臂级分母信息,明确区分安慰剂与对照组。 Result: 实现了CT.gov全量AE数据的标准化、结构化与可追溯映射;支持概念级检索、跨试验聚合分析,以及安慰剂参照的安全性评估;为下游PV信号检测提供可集成的高质量临床试验证据。 Conclusion: CTG-DB为利用注册平台数据开展规模化、标准化、可复现的药物警戒分析提供了关键基础设施,推动临床试验安全数据从文本记录向可计算资源转化。 Abstract: ClinicalTrials.gov (CT.gov) is the largest publicly accessible registry of clinical studies, yet its registry-oriented architecture and heterogeneous adverse event (AE) terminology limit systematic pharmacovigilance (PV) analytics. AEs are typically recorded as investigator-reported text rather than standardized identifiers, requiring manual reconciliation to identify coherent safety concepts. We present the ClinicalTrials.gov Transformation Database (CTG-DB), an open-source pipeline that ingests the complete CT.gov XML archive and produces a relational database aligned to standardized AE terminology using the Medical Dictionary for Regulatory Activities (MedDRA). CTG-DB preserves arm-level denominators, represents placebo and comparator arms, and normalizes AE terminology using deterministic exact and fuzzy matching to ensure transparent and reproducible mappings. This framework enables concept-level retrieval and cross-trial aggregation for scalable placebo-referenced safety analyses and integration of clinical trial evidence into downstream PV signal detection.

[8] BANGLASOCIALBENCH: A Benchmark for Evaluating Sociopragmatic and Cultural Alignment of LLMs in Bangladeshi Social Interaction

Tanvir Ahmed Sijan,S. M Golam Rifat,Pankaj Chowdhury Partha,Md. Tanjeed Islam,Md. Musfique Anwar

Main category: cs.CL

TL;DR: 本文提出了BANGLASOCIALBENCH,首个用于评估大型语言模型在孟加拉语中社会语用能力的基准,涵盖称呼语、亲属关系推理和社会习俗三大领域,发现现有模型在文化适配性上存在系统性缺陷。

Details Motivation: 大型语言模型虽具多语言流利性,但在高语境语言(如孟加拉语)中缺乏对社会等级、关系角色与互动规范的敏感性,难以实现真正得体的语言使用。 Method: 构建首个面向孟加拉语社会语用能力评估的基准BANGLASOCIALBENCH,含1719个由母语者编写验证的实例,覆盖称呼语、亲属推理与社会习俗三类任务;在零样本设置下评测12个主流大模型。 Result: 模型普遍存在过度正式化称呼、无法识别多种可接受代词、混淆跨宗教亲属术语等系统性文化错配现象;错误呈现结构化、非随机特征。 Conclusion: 当前大模型在理解与应用孟加拉社会语境中的恰当语言方面存在根本性局限,亟需融入文化感知机制以提升社会语用能力。 Abstract: Large Language Models have demonstrated strong multilingual fluency, yet fluency alone does not guarantee socially appropriate language use. In high-context languages, communicative competence requires sensitivity to social hierarchy, relational roles, and interactional norms that are encoded directly in everyday language. Bangla exemplifies this challenge through its three-tiered pronominal system, kinship-based addressing, and culturally embedded social customs. We introduce BANGLASOCIALBENCH, the first benchmark designed to evaluate sociopragmatic competence in Bangla through context-dependent language use rather than factual recall. The benchmark spans three domains: Bangla Address Terms, Kinship Reasoning, and Social Customs, and consists of 1,719 culturally grounded instances written and verified by native Bangla speakers. We evaluate twelve contemporary LLMs in a zero-shot setting and observe systematic patterns of cultural misalignment. Models frequently default to overly formal address forms, fail to recognize multiple socially acceptable address pronouns, and conflate kinship terminology across religious contexts. Our findings show that sociopragmatic failures are often structured and non-random, revealing persistent limitations in how current LLMs infer and apply culturally appropriate language use in realistic Bangladeshi social interactions.

[9] POLAR:A Per-User Association Test in Embedding Space

Pedro Bento,Arthur Buzelin,Arthur Chagas,Yan Aquino,Victoria Estanislau,Samira Malaquias,Pedro Robles Dutenhefner,Gisele L. Pappa,Virgilio Almeida,Wagner MeiraJr

Main category: cs.CL

TL;DR: 本文提出POLAR方法,一种基于嵌入空间的作者级词汇关联检测工具,通过轻量适配的掩码语言模型和私有确定性token表征作者,在预设词汇轴上进行投影分析,并提供标准化效应值与统计显著性检验,有效区分机器人与真人账号、量化极端主义倾向及随时间变化趋势。

Details Motivation: 现有内在关联探测方法多在词、句或语料层面进行,难以捕捉作者个体层面的差异,缺乏面向计算社会科学的细粒度、可解释的作者级分析工具。 Method: 提出POLAR(Per-user On-axis Lexical Association Report):作者以私有确定性token表示;利用轻量适配的掩码语言模型获取嵌入;将作者向量投影到人工构建的词汇轴(如偏见、极端主义等);采用置换检验计算p值,并用Benjamini-Hochberg法控制多重比较;支持模块化扩展新属性集。 Result: 在平衡的推特机器人-真人数据集上,POLAR能清晰区分LLM生成账号与真实用户;在极端主义论坛数据中,成功量化用户对污名化词汇的强对齐程度,并检测出随时间推移的右倾趋势。 Conclusion: POLAR是一种可解释、可扩展、统计严谨的作者级词汇关联分析方法,为计算社会科学提供了简洁、可靠的个体诊断工具,代码已全部开源。 Abstract: Most intrinsic association probes operate at the word, sentence, or corpus level, obscuring author-level variation. We present POLAR (Per-user On-axis Lexical Association Re-port), a per-user lexical association test that runs in the embedding space of a lightly adapted masked language model. Authors are represented by private deterministic to-kens; POLAR projects these vectors onto curated lexicalaxes and reports standardized effects with permutation p-values and Benjamini--Hochberg control. On a balanced bot--human Twitter benchmark, POLAR cleanly separates LLM-driven bots from organic accounts; on an extremist forum,it quantifies strong alignment with slur lexicons and reveals rightward drift over time. The method is modular to new attribute sets and provides concise, per-author diagnostics for computational social science. All code is publicly avail-able at https://github.com/pedroaugtb/POLAR-A-Per-User-Association-Test-in-Embedding-Space.

[10] A Family of LLMs Liberated from Static Vocabularies

Aleph Alpha,:,Adnen Abdessaied,Artur Baranowski,Lukas Balles,Michael Barlow,Fabien C. Y. Benureau,Felix Berkenkamp,Lukas Bluebaum,Bastian Boll,Thomas F. Burns,Björn Deiseroth,Constantin Eichenberg,David Friede,Pablo Iyu Guerrero,Ahmed Hammam,Bastian Harren,Johann Higl,Yasser Jadidi,Carina Kauf,Johannes Messner,Jan Hendrik Metzen,Max Meuer,Vedant Nanda,Pit Neitemeier,Koen Oostermeijer,Letitia Parcalabescu,Markus Pernpointner,Felix Reinfurt,Dylan Rodriquez,Grégory Schott,Philipp Siedler,Martin Simonovsky,Till Speicher,Volker Stampa,Stephan Wäldchen,Samuel Weinbach,Gregor Ziegltrum

Main category: cs.CL

TL;DR: 本文提出了一种基于分层自回归Transformer(HAT)架构的字节级语言模型,通过解耦编码器(字节→词嵌入)、预训练主干(处理词嵌入)和解码器(词嵌入→字节),摆脱对传统tokenization的依赖,提升跨语言/领域适应性、文本压缩率与拼写鲁棒性,并在英德双语任务上超越Llama 3.1。

Details Motivation: 现有LLM依赖固定词汇表的tokenizer,存在词汇量大、难以适配新语言或领域的问题,亟需更灵活、鲁棒的文本表示与建模方式。 Method: 提出HAT架构:编码器将字节聚合成词嵌入,预训练主干(如Llama 3.1去掉嵌入层和输出头)处理词嵌入,解码器通过cross-attention将主干输出还原为字节;复用Llama 3.1 8B/70B主干构建TFree-HAT模型,并从头训练7B HAT模型。 Result: HAT模型显著提升文本压缩率(减少序列长度)、增强对词内变化(如拼写错误)的鲁棒性;在英语和德语的预训练、监督微调及直接偏好优化中全面优于原始Llama 3.1,在多数基准测试中取得更好性能。 Conclusion: HAT是一种有效的无token化建模范式,兼顾性能与泛化能力,为构建更通用、可迁移的语言模型提供了新路径。 Abstract: Tokenization is a central component of natural language processing in current large language models (LLMs), enabling models to convert raw text into processable units. Although learned tokenizers are widely adopted, they exhibit notable limitations, including their large, fixed vocabulary sizes and poor adaptability to new domains or languages. We present a family of models with up to 70 billion parameters based on the hierarchical autoregressive transformer (HAT) architecture. In HAT, an encoder transformer aggregates bytes into word embeddings and then feeds them to the backbone, a classical autoregressive transformer. The outputs of the backbone are then cross-attended by the decoder and converted back into bytes. We show that we can reuse available pre-trained models by converting the Llama 3.1 8B and 70B models into the HAT architecture: Llama-3.1-8B-TFree-HAT and Llama-3.1-70B-TFree-HAT are byte-level models whose encoder and decoder are trained from scratch, but where we adapt the pre-trained Llama backbone, i.e., the transformer blocks with the embedding matrix and head removed, to handle word embeddings instead of the original tokens. We also provide a 7B HAT model, Llama-TFree-HAT-Pretrained, trained entirely from scratch on nearly 4 trillion words. The HAT architecture improves text compression by reducing the number of required sequence positions and enhances robustness to intra-word variations, e.g., spelling differences. Through pre-training, as well as subsequent supervised fine-tuning and direct preference optimization in English and German, we show strong proficiency in both languages, improving on the original Llama 3.1 in most benchmarks. We release our models (including 200 pre-training checkpoints) on Hugging Face.

[11] MoLoRA: Composable Specialization via Per-Token Adapter Routing

Shrey Shah,Justin Wagle

Main category: cs.CL

TL;DR: 本文提出了一种名为MoLoRA的多适配器服务方法,通过每个token独立路由到最适合的适配器(而非整条序列路由),显著提升多模态与混合能力请求下的模型性能和效率。

Details Motivation: 现有多适配器系统按整个序列路由,无法应对同一序列中含多模态token(如图文混排)或需跨领域专家能力(如写代码解方程)的场景。 Method: 提出per-token routing机制:对多模态模型利用词表结构路由,对语义专业化任务使用学习型门控;核心实现为MoLoRA——一种可组合的LoRA混合架构,支持多个领域专用LoRA并行加载、逐token动态选择。 Result: MoLoRA使Qwen3-1.7B在四项推理基准上超越Qwen3-8B,参数量仅为其21%(即小4.7倍);支持无需重训练的适配器即插即用与模块化扩展。 Conclusion: 逐token路由是理论最优方案,MoLoRA验证了‘专业化优于单纯扩大规模’的范式,实现了高效、灵活、可扩展的推理时模块化专家集成。 Abstract: Multi-adapter serving systems route entire sequences to a single adapter, forcing a choice when requests span multiple domains. This assumption fails in two important settings: (1) multimodal generation, where text and image tokens require different adapters within the same sequence, and (2) mixed-capability requests like "write code to solve this equation," which need expertise from multiple specialized adapters. We introduce per-token routing, which routes individual tokens to adapters based on either vocabulary structure (for multimodal models) or learned gating (for semantic specialization). Per-token routing is provably optimal, achieving work N for N tokens versus K \cdot N for per-sequence routing with K adapter types. Our key contribution is MoLoRA (Mixture of LoRA), which enables composable specialization: load multiple domain-specific adapters and let a learned router select the appropriate adapter per-token. We demonstrate that specialization dramatically beats scale: MoLoRA enables Qwen3-1.7B to exceed Qwen3-8B across four reasoning benchmarks while being 4.7x smaller. This enables modular expertise at inference time: train focused LoRAs independently, combine them without retraining, and add new capabilities by simply loading new adapters.

[12] Robust Language Identification for Romansh Varieties

Charlotte Model,Sina Ahmadi,Jannis Vamvas

Main category: cs.CL

TL;DR: 本文提出了一种基于SVM的罗曼什语方言识别(LID)系统,在新构建的基准数据集上达到97%的域内准确率,并开源了分类器。

Details Motivation: 罗曼什语存在多种区域变体(方言),彼此间有时难以互通,且缺乏针对这些方言的语言识别系统;同时需兼顾超区域标准形式Rumantsch Grischun,构成新颖分类任务。 Method: 采用支持向量机(SVM)方法构建罗曼什语方言识别系统,并在新整理的跨领域基准数据集上进行评估。 Result: 模型在两个领域的域内测试中平均准确率达97%,支持方言感知的拼写检查和机器翻译等应用;分类器已公开。 Conclusion: 该工作首次系统性地解决了罗曼什语多方言识别问题,验证了SVM在低资源、高多样性语言识别中的有效性,并提供了可复用的基准与工具。 Abstract: The Romansh language has several regional varieties, called idioms, which sometimes have limited mutual intelligibility. Despite this linguistic diversity, there has been a lack of documented efforts to build a language identification (LID) system that can distinguish between these idioms. Since Romansh LID should also be able to recognize Rumantsch Grischun, a supra-regional variety that combines elements of several idioms, this makes for a novel and interesting classification problem. In this paper, we present a LID system for Romansh idioms based on an SVM approach. We evaluate our model on a newly curated benchmark across two domains and find that it reaches an average in-domain accuracy of 97%, enabling applications such as idiom-aware spell checking or machine translation. Our classifier is publicly available.

[13] Aligning Paralinguistic Understanding and Generation in Speech LLMs via Multi-Task Reinforcement Learning

Jingxiang Chen,Minseok Kim,Seong-Gyun Leem,Yin Huang,Rashi Rungta,Zhicheng Ouyang,Haibin Wu,Surya Teja Appini,Ankur Bansal,Yang Bai,Yue Liu,Florian Metze,Ahmed A Aly,Anuj Kumar,Ariya Rastrow,Zhaojiang Lin

Main category: cs.CL

TL;DR: 本文提出了一种多任务强化学习与思维链提示相结合的方法,构建了具备副语言感知能力的语音大模型(PALLM),显著提升了语音情感理解性能。

Details Motivation: 语音大语言模型需理解语调、情绪和非言语声音等副语言线索以准确把握说话意图,但面临训练数据稀缺、标注困难以及模型倾向于依赖词汇捷径而非副语言信号等问题。 Method: 提出多任务强化学习(RL)结合思维链提示的方法,引导模型进行显式的感情推理;设计副语言感知语音大模型(PALLM),通过两阶段流程联合优化音频情感分类与副语言感知的响应生成。 Result: 在Expresso、IEMOCAP和RAVDESS数据集上,相比监督基线及Gemini-2.5-Pro、GPT-4o-audio等强闭源模型,副语言理解性能提升8–12%。 Conclusion: 采用多任务强化学习建模副语言推理对构建情感智能语音大模型至关重要。 Abstract: Speech large language models (LLMs) observe paralinguistic cues such as prosody, emotion, and non-verbal sounds--crucial for intent understanding. However, leveraging these cues faces challenges: limited training data, annotation difficulty, and models exploiting lexical shortcuts over paralinguistic signals. We propose multi-task reinforcement learning (RL) with chain-of-thought prompting that elicits explicit affective reasoning. To address data scarcity, we introduce a paralinguistics-aware speech LLM (PALLM) that jointly optimizes sentiment classification from audio and paralinguistics-aware response generation via a two-stage pipeline. Experiments demonstrate that our approach improves paralinguistics understanding over both supervised baselines and strong proprietary models (Gemini-2.5-Pro, GPT-4o-audio) by 8-12% on Expresso, IEMOCAP, and RAVDESS. The results show that modeling paralinguistic reasoning with multi-task RL is crucial for building emotionally intelligent speech LLMs.

[14] NLP Occupational Emergence Analysis: How Occupations Form and Evolve in Real Time -- A Zero-Assumption Method Demonstrated on AI in the US Technology Workforce, 2022-2026

David Nordfors

Main category: cs.CL

TL;DR: 本文提出了一种基于‘双部共吸引子’(bipartite co-attractor)概念的无先验假设方法,用于从简历数据中检测职业的自发涌现:即专业词汇与从业者群体相互强化、彼此维系。分析820万份美国简历发现,AI领域虽在2024年初迅速形成高度凝聚的专业词汇,但从业者群体始终未能凝聚,表明AI是扩散型技术而非新兴职业;并探讨设立‘AI工程师’职业类别是否可能促成群体凝聚,完成共吸引子闭环。

Details Motivation: 传统职业分类系统更新滞后于职业实际演化速度,亟需一种不依赖预定义职类或职位名称、能自主识别新兴职业的方法。 Method: 提出‘双部共吸引子’模型——职业是专业词汇集合与从业者群体之间相互强化的自维持结构;据此设计零假设检测方法:分别独立检验词汇凝聚力与人群凝聚力,并通过消融分析验证词汇是否确为凝聚人群的机制。 Result: 在820万份2022–2026年美国简历数据上验证该方法:成功识别出既有职业;发现AI领域存在显著不对称性——2024年初词汇迅速凝聚,但从业者群体始终未凝聚,原有AI社群瓦解,新词汇被吸纳进既有职业而非催生新职业。 Conclusion: AI目前表现为一项快速扩散的技术,而非正在形成的独立职业;若人为引入‘AI工程师’等正式职业类别,或可催化人群围绕已有词汇实现凝聚,从而补全共吸引子结构。 Abstract: Occupations form and evolve faster than classification systems can track. We propose that a genuine occupation is a self-reinforcing structure (a bipartite co-attractor) in which a shared professional vocabulary makes practitioners cohesive as a group, and the cohesive group sustains the vocabulary. This co-attractor concept enables a zero-assumption method for detecting occupational emergence from resume data, requiring no predefined taxonomy or job titles: we test vocabulary cohesion and population cohesion independently, with ablation to test whether the vocabulary is the mechanism binding the population. Applied to 8.2 million US resumes (2022-2026), the method correctly identifies established occupations and reveals a striking asymmetry for AI: a cohesive professional vocabulary formed rapidly in early 2024, but the practitioner population never cohered. The pre-existing AI community dissolved as the tools went mainstream, and the new vocabulary was absorbed into existing careers rather than binding a new occupation. AI appears to be a diffusing technology, not an emerging occupation. We discuss whether introducing an "AI Engineer" occupational category could catalyze population cohesion around the already-formed vocabulary, completing the co-attractor.

[15] RadAnnotate: Large Language Models for Efficient and Reliable Radiology Report Annotation

Saisha Pradeep Shetty,Roger Eric Goldman,Vladimir Filkov

Main category: cs.CL

TL;DR: RadAnnotate 是一个基于大语言模型(LLM)的放射学报告标注框架,通过检索增强生成合成报告和置信度驱动的选择性自动化,显著减少专家人工标注工作量;在 RadGraph 实体标注任务中,该方法在低资源场景下尤其提升不确定观察类别的性能,并能自动标注大部分报告(55–90%),同时保持高实体匹配得分(0.86–0.92)。

Details Motivation: 放射学报告的人工标注耗时昂贵,亟需降低专家标注负担。 Method: 提出 RadAnnotate 框架:1)训练实体特异性分类器分析其在解剖与观察类别上的表现及失败模式;2)利用 RAG 生成合成报告并评估其对模型性能的增益;3)学习实体特异性置信度阈值,实现选择性自动标注与专家复核分流。 Result: 合成报告训练模型性能接近金标准模型(F1 仅低 1–2 点);对最难的‘不确定观察’类别,在低资源下 F1 从 0.61 提升至 0.70;可自动标注 55–90% 的报告,实体匹配得分为 0.86–0.92。 Conclusion: RadAnnotate 有效平衡自动化精度与专家介入效率,为临床 NLP 中放射学报告标注提供了实用、可扩展的半自动解决方案。 Abstract: Radiology report annotation is essential for clinical NLP, yet manual labeling is slow and costly. We present RadAnnotate, an LLM-based framework that studies retrieval-augmented synthetic reports and confidence-based selective automation to reduce expert effort for labeling in RadGraph. We study RadGraph-style entity labeling (graph nodes) and leave relation extraction (edges) to future work. First, we train entity-specific classifiers on gold-standard reports and characterize their strengths and failure modes across anatomy and observation categories, with uncertain observations hardest to learn. Second, we generate RAG-guided synthetic reports and show that synthetic-only models remain within 1-2 F1 points of gold-trained models, and that synthetic augmentation is especially helpful for uncertain observations in a low-resource setting, improving F1 from 0.61 to 0.70. Finally, by learning entity-specific confidence thresholds, RadAnnotate can automatically annotate 55-90% of reports at 0.86-0.92 entity match score while routing low-confidence cases for expert review.

[16] Understanding Moral Reasoning Trajectories in Large Language Models: Toward Probing-Based Explainability

Fan Huang,Haewoon Kwak,Jisun An

Main category: cs.CL

TL;DR: 本文提出‘道德推理轨迹’概念,分析大语言模型在道德决策中如何跨推理步骤调用不同伦理框架,发现其普遍存在框架切换现象,并揭示了框架稳定性与模型鲁棒性、准确性及人类评价的一致性密切相关。

Details Motivation: 大型语言模型越来越多地参与道德敏感型决策,但其在推理过程中如何组织和切换伦理框架仍缺乏深入研究。 Method: 引入‘道德推理轨迹’(序列化的伦理框架调用),在六个模型和三个基准上分析其动态特征;使用线性探针定位框架特异性表征层;设计轻量级激活引导调节框架整合;提出Moral Representation Consistency(MRC)指标并验证其与人类一致性评价的相关性。 Result: 55.4–57.7%的连续推理步发生伦理框架切换;不稳定轨迹更易受说服攻击(+1.29×);线性探针成功定位模型特异性层(如Llama-3.3-70B第63层),KL散度降低13.8–22.6%;激活引导使框架漂移减少6.7–8.9%;MRC指标与LLM连贯性评分强相关(r=0.715),且人类标注验证其框架归因可靠性(平均余弦相似度0.859)。 Conclusion: 道德推理本质上是多框架、动态切换的过程,其表征稳定性可量化且与模型鲁棒性、准确性和人类感知一致性高度关联,为可解释AI伦理建模提供了新路径。 Abstract: Large language models (LLMs) increasingly participate in morally sensitive decision-making, yet how they organize ethical frameworks across reasoning steps remains underexplored. We introduce \textit{moral reasoning trajectories}, sequences of ethical framework invocations across intermediate reasoning steps, and analyze their dynamics across six models and three benchmarks. We find that moral reasoning involves systematic multi-framework deliberation: 55.4--57.7\% of consecutive steps involve framework switches, and only 16.4--17.8\% of trajectories remain framework-consistent. Unstable trajectories remain 1.29$\times$ more susceptible to persuasive attacks ($p=0.015$). At the representation level, linear probes localize framework-specific encoding to model-specific layers (layer 63/81 for Llama-3.3-70B; layer 17/81 for Qwen2.5-72B), achieving 13.8--22.6\% lower KL divergence than the training-set prior baseline. Lightweight activation steering modulates framework integration patterns (6.7--8.9\% drift reduction) and amplifies the stability--accuracy relationship. We further propose a Moral Representation Consistency (MRC) metric that correlates strongly ($r=0.715$, $p<0.0001$) with LLM coherence ratings, whose underlying framework attributions are validated by human annotators (mean cosine similarity $= 0.859$).

[17] SEAHateCheck: Functional Tests for Detecting Hate Speech in Low-Resource Languages of Southeast Asia

Ri Chi Ng,Aditi Kumaresan,Yujia Hu,Roy Ka-Wei Lee

Main category: cs.CL

TL;DR: 本文提出了SEAHateCheck,一个针对东南亚四种低资源语言(印尼语、他加禄语、泰语、越南语)的仇恨言论功能测试数据集,旨在解决现有模型在这些语言中检测能力不足的问题,并揭示了模型在隐式仇恨检测和反言论识别上的弱点。

Details Motivation: 高资源语言(如英语、中文)拥有丰富的语言资源,而东南亚低资源语言缺乏相应资源,且其复杂的社会语言环境加剧了在线仇恨言论监管的难度。 Method: 基于HateCheck的功能测试框架并改进SGHateCheck的方法,构建了SEAHateCheck数据集;利用大语言模型生成文化适配的测试用例,并由本地专家验证;对多种SOTA与多语言模型进行实验评估。 Result: 实验发现当前模型在东南亚语言(尤其是他加禄语)及俚语类测试中表现最差;模型在隐式仇恨检测和反言论识别方面存在明显短板。 Conclusion: SEAHateCheck是首个面向东南亚四国语言的功能性仇恨言论测试套件,为开发更具文化适应性的仇恨言论检测工具提供了可靠基准。 Abstract: Hate speech detection relies heavily on linguistic resources, which are primarily available in high-resource languages such as English and Chinese, creating barriers for researchers and platforms developing tools for low-resource languages in Southeast Asia, where diverse socio-linguistic contexts complicate online hate moderation. To address this, we introduce SEAHateCheck, a pioneering dataset tailored to Indonesia, Thailand, the Philippines, and Vietnam, covering Indonesian, Tagalog, Thai, and Vietnamese. Building on HateCheck's functional testing framework and refining SGHateCheck's methods, SEAHateCheck provides culturally relevant test cases, augmented by large language models and validated by local experts for accuracy. Experiments with state-of-the-art and multilingual models revealed limitations in detecting hate speech in specific low-resource languages. In particular, Tagalog test cases showed the lowest model accuracy, likely due to linguistic complexity and limited training data. In contrast, slang-based functional tests proved the hardest, as models struggled with culturally nuanced expressions. The diagnostic insights of SEAHateCheck further exposed model weaknesses in implicit hate detection and models' struggles with counter-speech expression. As the first functional test suite for these Southeast Asian languages, this work equips researchers with a robust benchmark, advancing the development of practical, culturally attuned hate speech detection tools for inclusive online content moderation.

[18] ClaimFlow: Tracing the Evolution of Scientific Claims in NLP

Aniket Pramanick,Yufang Hou,Saif M. Mohammad,Iryna Gurevych

Main category: cs.CL

TL;DR: 本文提出ClaimFlow数据集,通过人工标注304篇NLP论文中的1084条科学主张及其832条跨论文关系(支持、扩展、限定、反驳、背景),定义了主张关系分类新任务,并在该任务上取得0.78宏F1基线结果;进一步分析发现多数主张未被复用,仅少数被挑战,而广泛传播的主张多以限定或扩展方式被重塑。

Details Motivation: 现有引文与主张分析方法仅捕捉科学对话的片段,无法显式建模论文间关于具体科学主张的互动关系。 Method: 构建ClaimFlow数据集:人工标注304篇ACL论文中的主张及跨论文关系类型;定义Claim Relation Classification任务;使用神经模型和大语言模型进行实验评估;并基于训练模型分析约13,000篇NLP论文中主张的演化规律。 Result: Claim Relation Classification任务上达到0.78宏F1;对13k篇论文的分析显示:63.5%主张从未被复用,11.1%曾被挑战,广泛传播的主张更常被限定或扩展而非直接确认或反驳。 Conclusion: ClaimFlow为研究NLP领域内科学主张的演进提供了新视角和基础资源,也推动了模型对科学论证理解能力的评估。 Abstract: Scientific papers do more than report results $-$ they advance $\textit{claims}$ that later work supports, extends, or sometimes refutes. Yet existing methods for citation and claim analysis capture only fragments of this dialogue. In this work, we make these interactions explicit at the level of individual scientific claims. We introduce $\texttt{ClaimFlow}$, a claim-centric view of the NLP literature, built from $304$ ACL Anthology papers (1979$-$2025) that are manually annotated with $1{,}084$ claims and $832$ cross-paper claim relations, indicating whether a citing paper $\textit{supports}$, $\textit{extends}$, $\textit{qualifies}$, $\textit{refutes}$, or references a claim as $\textit{background}$. Using $\texttt{ClaimFlow}$, we define a new task $-$ $\textit{Claim Relation Classification}$ $-$ which requires models to infer the scientific stance toward a cited claim from the text and citation context. Evaluating strong neural models and large language models on this task, we report baseline performance of $0.78$ macro-F1, highlighting that claim-relation classification is feasible but challenging. We further apply our model to $\sim$$13k$ NLP papers to analyze how claims evolve across decades of NLP research. Our analysis reveals that $63.5$% claims are never reused; only $11.1$% are ever challenged; meanwhile, widely propagated claims are more often $\textit{reshaped}$ through qualification and extension than directly confirmed or refuted. Overall, $\texttt{ClaimFlow}$ offers a lens for examining how ideas shift and mature within NLP, and a foundation for assessing whether models can interpret scientific argumentation.

[19] CounterRefine: Answer-Conditioned Counterevidence Retrieval for Inference-Time Knowledge Repair in Factual Question Answering

Tianyi Huang,Ying Kai Deng

Main category: cs.CL

TL;DR: CounterRefine是一种轻量级推理时修复层,用于检索增强型问答,通过生成初答、检索支持与冲突证据、并基于确定性验证决定保留或修订答案,显著提升准确率。

Details Motivation: 许多事实性问答错误源于模型未能坚持正确答案,即虽检索到相关证据,却仍选择错误答案。 Method: CounterRefine分三步:1)基于检索证据生成初始答案;2)以该答案为条件发起后续查询,获取支持与冲突证据;3)执行受限的精炼步骤,仅在通过确定性验证时接受修订,否则保留原答案。 Result: 在SimpleQA基准上,CounterRefine将匹配的GPT-5 Baseline-RAG准确率提升5.8个百分点至73.1%,并比报告的一次性GPT-5.4得分高出约40点。 Conclusion: 知识型基础模型不仅需有效访问证据,更应具备利用证据反思并修正自身答案的能力;CounterRefine为此提供了一种简单而重要的实现路径。 Abstract: In factual question answering, many errors are not failures of access but failures of commitment: the system retrieves relevant evidence, yet still settles on the wrong answer. We present CounterRefine, a lightweight inference-time repair layer for retrieval-grounded question answering. CounterRefine first produces a short answer from retrieved evidence, then gathers additional support and conflicting evidence with follow-up queries conditioned on that draft answer, and finally applies a restricted refinement step that outputs either KEEP or REVISE, with proposed revisions accepted only if they pass deterministic validation. In effect, CounterRefine turns retrieval into a mechanism for testing a provisional answer rather than merely collecting more context. On the full SimpleQA benchmark, CounterRefine improves a matched GPT-5 Baseline-RAG by 5.8 points and reaches a 73.1 percent correct rate, while exceeding the reported one-shot GPT-5.4 score by roughly 40 points. These findings suggest a simple but important direction for knowledgeable foundation models: beyond accessing evidence, they should also be able to use that evidence to reconsider and, when necessary, repair their own answers.

[20] Frequency Matters: Fast Model-Agnostic Data Curation for Pruning and Quantization

Francesco Pio Monaco,Elia Cunegatti,Flavio Vella,Giovanni Iacca

Main category: cs.CL

TL;DR: 本文提出ZipCal,一种基于Zipf定律最大化词汇多样性的模型无关校准数据选择方法,用于LLM后训练压缩(剪枝与量化),在保持下游性能的同时显著提升效率(快240倍)。

Details Motivation: 现有大语言模型后训练压缩方法中,校准数据的选择常被忽视,但其对跨任务和任务内性能保持至关重要;需一种高效、模型无关的校准数据筛选策略。 Method: 提出ZipCal方法:依据Zipf幂律分析数据内在词频分布,优先选取词汇多样性高的样本作为校准集,不依赖模型前向计算或困惑度等模型特定信号。 Result: ZipCal在多种剪枝基准上持续优于均匀随机采样;下游任务性能媲美基于模型困惑度的SOTA方法,且平均加速约240倍(线性时间复杂度)。 Conclusion: 校准数据的词汇多样性是影响压缩效果的关键内在属性;ZipCal作为一种轻量、通用、高效的数据策展策略,为LLM压缩提供了新范式。 Abstract: Post-training model compression is essential for enhancing the portability of Large Language Models (LLMs) while preserving their performance. While several compression approaches have been proposed, less emphasis has been placed on selecting the most suitable set of data (the so-called \emph{calibration data}) for finding the compressed model configuration. The choice of calibration data is a critical step in preserving model capabilities both intra- and inter-tasks. In this work, we address the challenge of identifying high-performance calibration sets for both pruning and quantization by analyzing intrinsic data properties rather than model-specific signals. We introduce \texttt{\textbf{ZipCal}}, a model-agnostic data curation strategy that maximizes lexical diversity based on Zipfian power laws. Experiments demonstrate that our method consistently outperforms standard uniform random sampling across various pruning benchmarks. Notably, it also performs on par, in terms of downstream performance, with a state-of-the-art method that relies on model perplexity. The latter becomes prohibitively expensive at large-scale models and datasets, while \texttt{\textbf{ZipCal}} is on average $\sim$240$\times$ faster due to its tractable linear complexity\footnote{We make the code and the experiments available at https://anonymous.4open.science/r/zipcal-71CD/.}.

[21] ASDA: Automated Skill Distillation and Adaptation for Financial Reasoning

Tik Yu Yim,Wenting Tan,Sum Yee Chan,Tak-Wah Lam,Siu Ming Yiu

Main category: cs.CL

TL;DR: 本文提出ASDA框架,通过自动构建结构化技能文件实现LLM在金融推理任务上的零训练适配,显著提升FAMMA基准性能,且技能文件可读、可版本控制、符合开放标准。

Details Motivation: 现有零训练方法(如GEPA、ACE)在复杂多步金融推理任务上效果有限,而传统微调成本高且导致模型锁定;亟需一种高效、可解释、无需权重访问的领域适配新范式。 Method: ASDA框架利用教师模型分析学生模型在金融任务中的错误,按子领域和错误类型聚类,迭代生成包含推理步骤、代码模板和示例的结构化技能文件,并在推理时动态注入。 Result: 在FAMMA基准上,ASDA相较零训练基线提升达+17.33%(算术推理)和+5.95%(非算术推理);生成的技能文件符合Agent Skills开放标准,支持人类阅读与版本管理。 Conclusion: ASDA为金融等专业领域提供了实用、可审计、免重训练的LLM适配路径,突破了纯文本提示优化的局限,推动结构化、模块化技能复用的新范式。 Abstract: Adapting large language models (LLMs) to specialized financial reasoning typically requires expensive fine-tuning that produces model-locked expertise. Training-free alternatives have emerged, yet our experiments show that leading methods (GEPA and ACE) achieve only marginal gains on the FAMMA financial reasoning benchmark, exposing the limits of unstructured text optimization for complex, multi-step domain reasoning. We introduce Automated Skill Distillation and Adaptation (ASDA), a framework that automatically generates structured skill artifacts through iterative error-corrective learning without modifying model weights. A teacher model analyzes a student model's failures on financial reasoning tasks, clusters errors by subfield and error type, and synthesizes skill files containing reasoning procedures, code templates, and worked examples, which are dynamically injected during inference. Evaluated on FAMMA, ASDA achieves up to +17.33% improvement on arithmetic reasoning and +5.95% on non-arithmetic reasoning, substantially outperforming all training-free baselines. The resulting skill artifacts are human-readable, version-controlled, and compatible with the Agent Skills open standard, offering any organization with a labeled domain dataset a practical and auditable path to domain adaptation without weight access or retraining.

[22] Language Models Don't Know What You Want: Evaluating Personalization in Deep Research Needs Real Users

Nishant Balepur,Malachi Hamada,Varsha Kishore,Sergey Feldman,Amanpreet Singh,Pao Siangliulue,Joseph Chee Chang,Eunsol Choi,Jordan Lee Boyd-Graber,Aakanksha Naik

Main category: cs.CL

TL;DR: 本文提出MyScholarQA(MySQA),一种个性化的深度研究(DR)工具,通过推断用户研究兴趣、推荐个性化操作并生成符合用户偏好的多节报告,提升DR工具的个性化能力;作者结合合成用户基准测试与真实用户访谈,发现现有LLM评估难以捕捉个性化DR中的关键问题,强调真实用户参与对推进个性化DR至关重要。

Details Motivation: 现有深度研究(DR)工具虽能合成论文回答查询,但缺乏对用户需求的理解,个性化能力不足;且当前NLP评估依赖LLM裁判和合成用户,可能忽略真实用户重视的关键维度。 Method: 提出MyScholarQA系统,包含用户兴趣建模、个性化动作建议与用户批准驱动的多节报告生成;构建基于合成用户与LLM裁判的标准基准进行定量评估,并开展真实用户在线访谈以挖掘定性洞见。 Result: MySQA在引用指标和个性化动作遵循率上优于基线;但用户访谈揭示出9类LLM裁判无法检测的个性化DR错误,并提炼出面向未来DR设计的实践启示。 Conclusion: 个性化DR的进步不能仅依赖易用的LLM评估,必须纳入真实用户反馈;真实用户是检验和推动个性化DR发展的核心支柱。 Abstract: Deep Research (DR) tools (e.g. OpenAI DR) help researchers cope with ballooning publishing counts. Such tools can synthesize scientific papers to answer researchers' queries, but lack understanding of their users. We change that in MyScholarQA (MySQA), a personalized DR tool that: 1) infers a profile of a user's research interests; 2) proposes personalized actions for a user's input query; and 3) writes a multi-section report for the query that follows user-approved actions. We first test MySQA with NLP's standard protocol: we design a benchmark of synthetic users and LLM judges, where MySQA beats baselines in citation metrics and personalized action-following. However, we suspect this process does not cover all aspects of personalized DR users value, so we interview users in an online version of MySQA to unmask them. We reveal nine nuanced errors of personalized DR undetectable by our LLM judges, and we study qualitative feedback to form lessons for future DR design. In all, we argue for a pillar of personalization that easy-to-use LLM judges can lead NLP to overlook: real progress in personalization is only possible with real users.

[23] Pre-training LLM without Learning Rate Decay Enhances Supervised Fine-Tuning

Kazuki Yano,Shun Kiyono,Sosuke Kobayashi,Sho Takase,Jun Suzuki

Main category: cs.CL

TL;DR: 本文研究了学习率调度在大语言模型预训练中的作用,发现Warmup-Stable-Only(WSO)策略(即warmup后保持恒定学习率)相比衰减型调度器,在监督微调后下游任务性能更优,因其能维持更平坦的损失极小值,提升模型适应性。

Details Motivation: 尽管衰减型学习率调度器被广泛用于最小化预训练损失,但其对监督微调后下游性能的影响尚不明确,亟需系统探究。 Method: 提出并实验验证Warmup-Stable-Only(WSO)学习率调度策略,在1B和8B参数模型上对比其与衰减型调度器在不同训练阶段(中期训练、过训练)下的下游性能,并结合损失曲面分析解释机制。 Result: WSO在监督微调后下游性能上持续优于衰减型调度器,且该优势在不同训练 regime 下稳健;损失曲面分析表明WSO导向更平坦的极小值,而衰减策略导致更尖锐的极小值。 Conclusion: 为优化下游适应性,预训练中不应盲目追求更低的预训练损失而采用学习率衰减;WSO是一种更优实践,可提升模型发布后的泛化与适配能力。 Abstract: We investigate the role of learning rate scheduling in the large-scale pre-training of large language models, focusing on its influence on downstream performance after supervised fine-tuning (SFT). Decay-based learning rate schedulers are widely used to minimize pre-training loss. However, despite their widespread use, how these schedulers affect performance after SFT remains underexplored. In this paper, we examine Warmup-Stable-Only (WSO), which maintains a constant learning rate after warmup without any decay. Through experiments with 1B and 8B parameter models, we show that WSO consistently outperforms decay-based schedulers in terms of performance after SFT, even though decay-based schedulers may exhibit better performance after pre-training. The result also holds across different regimes with mid-training and over-training. Loss landscape analysis further reveals that decay-based schedulers lead models into sharper minima, whereas WSO preserves flatter minima that support adaptability. These findings indicate that applying LR decay to improve pre-training metrics may compromise downstream adaptability. Our work also provides practical guidance for training and model release strategies, highlighting that pre-training models with WSO enhances their adaptability for downstream tasks.

[24] Social Simulacra in the Wild: AI Agent Communities on Moltbook

Agam Goyal,Olivia Pal,Hari Sundaram,Eshwar Chandrasekharan,Koustuv Saha

Main category: cs.CL

TL;DR: 本文首次大规模实证比较了AI代理与人类在线社区,发现AI社区存在极端参与不平等、高跨社区作者重叠、情感扁平化、认知偏向断言、社交疏离等特征,其表面同质化主要源于共享作者结构,且单个AI代理比人类用户更具可识别性。

Details Motivation: 随着基于大语言模型的自主代理越来越多地出现在社交平台,理解AI代理社区的动态对于传播学研究和平台治理至关重要。 Method: 对73,899条Moltbook帖子和189,838条Reddit帖子(涵盖五个匹配社区)进行大规模实证比较分析,从结构特征(如参与不平等、作者重叠)和语言属性(如情感、认知、社交维度)两方面展开。 Result: Moltbook表现出极高的参与不平等(基尼系数0.84 vs. 0.47)和跨社区作者重叠(33.8% vs. 0.5%);AI生成内容情感扁平、认知上更倾向断言而非探索、社交上更疏离;表面社区同质化主要是共享作者导致的结构现象;个体AI代理因极端发帖量和异常风格而比人类用户更易识别。 Conclusion: AI代理社区展现出与人类社区显著不同的集体传播动态,本研究为理解多代理交互如何塑造新型在线话语提供了实证基础。 Abstract: As autonomous LLM-based agents increasingly populate social platforms, understanding the dynamics of AI-agent communities becomes essential for both communication research and platform governance. We present the first large-scale empirical comparison of AI-agent and human online communities, analyzing 73,899 Moltbook and 189,838 Reddit posts across five matched communities. Structurally, we find that Moltbook exhibits extreme participation inequality (Gini = 0.84 vs. 0.47) and high cross-community author overlap (33.8\% vs. 0.5\%). In terms of linguistic attributes, content generated by AI-agents is emotionally flattened, cognitively shifted toward assertion over exploration, and socially detached. These differences give rise to apparent community-level homogenization, but we show this is primarily a structural artifact of shared authorship. At the author level, individual agents are more identifiable than human users, driven by outlier stylistic profiles amplified by their extreme posting volume. As AI-mediated communication reshapes online discourse, our work offers an empirical foundation for understanding how multi-agent interaction gives rise to collective communication dynamics distinct from those of human communities.

[25] SciZoom: A Large-scale Benchmark for Hierarchical Scientific Summarization across the LLM Era

Han Jang,Junhyeok Lee,Kyu Sung Choi

Main category: cs.CL

TL;DR: 本文介绍了SciZoom,一个包含44946篇论文的新基准数据集,用于多粒度科学文本摘要研究,并分析了LLM时代前后科研写作的变化。

Details Motivation: 现有摘要基准规模小、粒度单一且过时,同时缺乏对LLM辅助写作如何改变科研写作风格的分析资源。 Method: 构建了覆盖2020–2025年四大顶会(NeurIPS、ICLR、ICML、EMNLP)的SciZoom数据集,按Pre-LLM与Post-LLM划分,提供Abstract、Contributions和TL;DR三级摘要目标,并进行语言学分析。 Result: 发现LLM辅助写作导致公式化表达激增(最高达10倍)及模糊表达减少23%,表明写作风格更自信但趋于同质化。 Conclusion: SciZoom不仅是一个多粒度摘要新基准,也为研究生成式AI时代科学话语演变提供了独特资源。 Abstract: The explosive growth of AI research has created unprecedented information overload, increasing the demand for scientific summarization at multiple levels of granularity beyond traditional abstracts. While LLMs are increasingly adopted for summarization, existing benchmarks remain limited in scale, target only a single granularity, and predate the LLM era. Moreover, since the release of ChatGPT in November 2022, researchers have rapidly adopted LLMs for drafting manuscripts themselves, fundamentally transforming scientific writing, yet no resource exists to analyze how this writing has evolved. To bridge these gaps, we introduce SciZoom, a benchmark comprising 44,946 papers from four top-tier ML venues (NeurIPS, ICLR, ICML, EMNLP) spanning 2020 to 2025, explicitly stratified into Pre-LLM and Post-LLM eras. SciZoom provides three hierarchical summarization targets (Abstract, Contributions, and TL;DR) achieving compression ratios up to 600:1, enabling both multi-granularity summarization research and temporal mining of scientific writing patterns. Our linguistic analysis reveals striking shifts in phrase patterns (up to 10x for formulaic expressions) and rhetorical style (23% decline in hedging), suggesting that LLM-assisted writing produces more confident yet homogenized prose. SciZoom serves as both a challenging benchmark and a unique resource for mining the evolution of scientific discourse in the generative AI era. Our code and dataset are publicly available on GitHub (https://github.com/janghana/SciZoom) and Hugging Face (https://huggingface.co/datasets/hanjang/SciZoom), respectively.

[26] SIA: A Synthesize-Inject-Align Framework for Knowledge-Grounded and Secure E-commerce Search LLMs with Industrial Deployment

Zhouwei Zhai,Mengxiang Chen,Anmeng Zhang

Main category: cs.CL

TL;DR: 本文提出SI框架(Synthesize-Inject-Align),通过知识合成、参数高效注入与双路径对齐,提升电商搜索大模型的知识准确性与安全鲁棒性,并在京东实际部署验证有效。

Details Motivation: 解决大语言模型在电商搜索中因产品知识编码不足导致的幻觉问题,以及易受越狱攻击引发的安全合规风险。 Method: 提出SI框架:1)融合知识图谱与行为日志合成高质量语料,加入推理链和安全数据;2)基于Depth Up-Scaling进行参数高效预训练以注入领域知识;3)通过多任务指令微调与对抗训练实现性能与安全的双路径对齐。 Result: 在京东平台五类核心搜索场景A/B测试中,关键业务指标显著提升,验证了框架的工业有效性与可扩展性。 Conclusion: SI框架有效兼顾电商搜索大模型的知识准确性与安全性,为工业级大模型落地提供了可行路径。 Abstract: Large language models offer transformative potential for e-commerce search by enabling intent-aware recommendations. However, their industrial deployment is hindered by two critical challenges: (1) knowledge hallucination due to insufficient encoding of dynamic, fine-grained product knowledge, and (2) security vulnerabilities under jailbreak attacks that threaten compliance. To address these issues, we propose SI--a Synthesize-Inject-Align framework for building knowledgeable and secure e-commerce search LLMs. Our approach first synthesizes high-quality natural language corpus by combining structured knowledge graphs with unstructured behavioral logs, augmented with reasoning chains and safety-aware data.We then introduce a parameter-efficient pre-training strategy based on Depth Up-Scaling to inject domain knowledge while preserving general capabilities. Finally, a dual-path alignment method via multi-task instruction tuning and adversarial training strengthens both task performance and safety robustness. The framework has been deployed at JD.com, China's largest self-operated e-commerce platform, where A/B tests across five core search scenarios demonstrate significant improvements in key business metrics, validating its industrial effectiveness and scalability.

[27] Parametric Social Identity Injection and Diversification in Public Opinion Simulation

Hexi Wang,Yujia Zhou,Bangde Du,Qingyao Ai,Yiqun Liu

Main category: cs.CL

TL;DR: 本文提出Parametric Social Identity Injection (PSII)框架,通过在大语言模型中间隐状态中注入显式的、可学习的群体身份与价值观表征,缓解其在公共舆论模拟中因隐表示多样性坍缩(Diversity Collapse)导致的社会多样性缺失问题,并在World Values Survey数据上验证了其提升分布保真度与群体差异性的有效性。

Details Motivation: 现有基于大语言模型的公共舆论模拟方法无法有效捕捉社会多样性,导致群体间差异被抹平、组内响应同质化;作者发现这是由于模型隐层中社会身份表征逐渐坍缩、难以区分所致。 Method: 提出Parametric Social Identity Injection (PSII),将可学习的、参数化的社会人口属性(如年龄、性别)和价值取向表征显式注入LLM中间隐藏层;区别于提示词层面的人设设定,PSII实现表征层级的细粒度、可控身份调制。 Result: 在World Values Survey数据集上,使用多个开源LLM验证PSII显著提升了模拟结果的分布保真度(KL散度降低)和多样性(增强组间差异与组内异质性)。 Conclusion: PSII为LLM代理提供了表征级的社会身份控制能力,推动了可扩展且兼顾多样性的公共舆论模拟研究。 Abstract: Large language models (LLMs) have recently been adopted as synthetic agents for public opinion simulation, offering a promising alternative to costly and slow human surveys. Despite their scalability, current LLM-based simulation methods fail to capture social diversity, producing flattened inter-group differences and overly homogeneous responses within demographic groups. We identify this limitation as a Diversity Collapse phenomenon in LLM hidden representations, where distinct social identities become increasingly indistinguishable across layers. Motivated by this observation, we propose Parametric Social Identity Injection (PSII), a general framework that injects explicit, parametric representations of demographic attributes and value orientations directly into intermediate hidden states of LLMs. Unlike prompt-based persona conditioning, PSII enables fine-grained and controllable identity modulation at the representation level. Extensive experiments on the World Values Survey using multiple open-source LLMs show that PSII significantly improves distributional fidelity and diversity, reducing KL divergence to real-world survey data while enhancing overall diversity. This work provides new insights into representation-level control of LLM agents and advances scalable, diversity-aware public opinion simulation. Code and data are available at https://github.com/halsayxi/PSII.

[28] Polyglot-Lion: Efficient Multilingual ASR for Singapore via Balanced Fine-Tuning of Qwen3-ASR

Quy-Anh Dang,Chris Ngo

Main category: cs.CL

TL;DR: Polyglot-Lion 是一组轻量级多语言语音识别模型,专为新加坡四种主要语言(英语、普通话、泰米尔语、马来语)设计,在性能接近更大模型的同时显著降低训练与推理成本。

Details Motivation: 针对新加坡多语言环境,需高效、低成本、部署友好的多语言ASR模型;现有大模型训练开销高、推理慢。 Method: 基于Qwen3-ASR-0.6B/1.7B,仅用公开语音语料进行平衡采样(每种语言样本数均等)的微调,且不使用语言标签条件,迫使模型从音频中隐式学习语言识别。 Result: Polyglot-Lion-1.7B在12个基准测试上平均错误率为14.85,媲美6倍参数量的MERaLiON-2-10B-ASR(14.32);训练成本仅81美元(单卡),仅为后者(18862美元,128卡)的0.4%;推理速度达0.10秒/样本,快约20倍。 Conclusion: 对中等规模预训练模型进行语言均衡微调,可高效构建实用型多语言ASR系统,大幅降低资源消耗,具备强落地价值。 Abstract: We present Polyglot-Lion, a family of compact multilingual automatic speech recognition (ASR) models tailored for the linguistic landscape of Singapore, covering English, Mandarin, Tamil, and Malay. Our models are obtained by fine-tuning Qwen3-ASR-0.6B and Qwen3-ASR-1.7B exclusively on publicly available speech corpora, using a balanced sampling strategy that equalizes the number of training utterances per language and deliberately omits language-tag conditioning so that the model learns to identify languages implicitly from audio. On 12 benchmarks spanning the four target languages, Polyglot-Lion-1.7B achieves an average error rate of 14.85, competitive with MERaLiON-2-10B-ASR (14.32) - a model 6x larger - while incurring a training cost of \$81 on a single RTX PRO 6000 GPU compared to \$18,862 for the 128-GPU baseline. Inference throughput is approximately 20x faster than MERaLiON at 0.10 s/sample versus 2.02 s/sample. These results demonstrate that linguistically balanced fine-tuning of moderate-scale pretrained models can yield deployment-ready multilingual ASR at a fraction of the cost of larger specialist systems.

[29] Structured Semantic Cloaking for Jailbreak Attacks on Large Language Models

Xiaobing Sun,Perry Lam,Shaohua Li,Zizhou Wang,Rick Siow Mong Goh,Yong Liu,Liangli Zhen

Main category: cs.CL

TL;DR: 本文提出Structured Semantic Cloaking (S2C),一种通过多步语义重构干扰LLM安全机制的新型 jailbreak框架,在多个模型上显著提升攻击成功率。

Details Motivation: 现有基于表面扰动的jailbreak方法因LLM深层语义安全机制而失效,需从语义重构过程本身入手设计更隐蔽的攻击。 Method: S2C包含三个机制:(1) Contextual Reframing(高风险场景引导模型合规);(2) Content Fragmentation(分散恶意语义线索);(3) Clue-Guided Camouflage(伪装线索并嵌入可恢复标记),共同延迟和重构恶意意图的语义整合。 Result: 在HarmBench和JBB-Behaviors基准上,S2C分别将攻击成功率(ASR)提升12.4%和9.7%;在GPT-5-mini上对JBB-Behaviors提升达26%;并分析了机制组合效果与混淆程度-可恢复性权衡。 Conclusion: S2C揭示了当前LLM安全机制对语义整合时序与结构的依赖弱点,为评估和增强模型鲁棒性提供了新视角与实用基准。 Abstract: Modern LLMs employ safety mechanisms that extend beyond surface-level input filtering to latent semantic representations and generation-time reasoning, enabling them to recover obfuscated malicious intent during inference and refuse accordingly, and rendering many surface-level obfuscation jailbreak attacks ineffective. We propose Structured Semantic Cloaking (S2C), a novel multi-dimensional jailbreak attack framework that manipulates how malicious semantic intent is reconstructed during model inference. S2C strategically distributes and reshapes semantic cues such that full intent consolidation requires multi-step inference and long-range co-reference resolution within deeper latent representations. The framework comprises three complementary mechanisms: (1) Contextual Reframing, which embeds the request within a plausible high-stakes scenario to bias the model toward compliance; (2) Content Fragmentation, which disperses the semantic signature of the request across disjoint prompt segments; and (3) Clue-Guided Camouflage, which disguises residual semantic cues while embedding recoverable markers that guide output generation. By delaying and restructuring semantic consolidation, S2C degrades safety triggers that depend on coherent or explicitly reconstructed malicious intent at decoding time, while preserving sufficient instruction recoverability for functional output generation. We evaluate S2C across multiple open-source and proprietary LLMs using HarmBench and JBB-Behaviors, where it improves Attack Success Rate (ASR) by 12.4% and 9.7%, respectively, over the current SOTA. Notably, S2C achieves substantial gains on GPT-5-mini, outperforming the strongest baseline by 26% on JBB-Behaviors. We also analyse which combinations perform best against broad families of models, and characterise the trade-off between the extent of obfuscation versus input recoverability on jailbreak success.

[30] SpecSteer: Synergizing Local Context and Global Reasoning for Efficient Personalized Generation

Hang Lv,Sheng Liang,Hao Wang,Yongyue Zhang,Hongchao Gu,Wei Guo,Defu Lian,Yong Liu,Enhong Chen

Main category: cs.CL

TL;DR: SpecSteer是一种新型的异步协同推理框架,通过设备端草稿生成、云端基于比率的逻辑验证、以及本地意图引导的恢复机制,在保障用户隐私的同时显著提升个性化生成质量与推理能力,并提速2.36倍。

Details Motivation: 解决个性化智能中的核心困境:将用户历史上传至中心化大模型存在隐私风险,而纯本地小模型又缺乏足够推理能力;现有纯本地增强方法无法可靠弥合该差距。 Method: 提出SpecSteer框架,将协同推理建模为贝叶斯知识融合,复用推测解码作为分布式对齐协议,构建Draft-Verify-Recover三阶段流水线:设备端生成个性化草稿;云端以不访问原始用户上下文的方式、通过比率机制验证逻辑正确性;验证失败时,利用本地意图进行引导式恢复。 Result: 实验表明SpecSteer成功弥合了推理能力差距,在个性化生成质量上优于基线,同时推理速度提升2.36倍。 Conclusion: SpecSteer在隐私保护与高质量推理之间实现了有效平衡,为隐私敏感场景下的个性化大模型应用提供了可行路径。 Abstract: Realizing personalized intelligence faces a core dilemma: sending user history to centralized large language models raises privacy concerns, while on-device small language models lack the reasoning capacity required for high-quality generation. Our pilot study shows that purely local enhancements remain insufficient to reliably bridge this gap. We therefore propose SpecSteer, an asymmetric collaborative inference framework that synergizes private on-device context with cloud-scale reasoning. SpecSteer casts collaboration as Bayesian knowledge fusion and repurposes speculative decoding as a distributed alignment protocol, yielding a Draft--Verify--Recover pipeline: the on-device model drafts personalized sequences; the cloud validates via a ratio-based mechanism that decouples reasoning verification from private context, filtering logical flaws without accessing raw user context; upon rejection, a steering recovery injects local intent during correction. Experiments demonstrate that SpecSteer successfully closes the reasoning gap and achieves superior personalized generation performance, while delivering a 2.36x speedup over standard baselines.

[31] More Rounds, More Noise: Why Multi-Turn Review Fails to Improve Cross-Context Verification

Song Tae-Eun

Main category: cs.CL

TL;DR: 本文提出动态跨上下文审查(D-CCR),即让评审者通过多轮问答与作者交互后再审查,但实验证明其性能反而低于单轮跨上下文审查(CCR),主因是多轮引入假阳性压力和评审目标偏移。

Details Motivation: 探索多轮交互式评审(即评审者可提问、作者回应、再评审)能否进一步提升LLM验证效果,作为原跨上下文审查(CCR)的自然扩展。 Method: 在控制实验中,使用30个工件和150个人工注入错误,对比单轮CCR基线与四种D-CCR变体;分析F1、精确率、召回率及错误模式(如假阳性压力、评审目标偏移)。 Result: 单轮CCR(F1=0.376)显著优于所有多轮D-CCR变体(最高F1=0.303);多轮虽提升召回率(+0.08),但精确率从0.30骤降至0.20,主因是假阳性激增(+62%)和评审目标偏移;独立重审(无上下文)表现最差(F1=0.263)。 Conclusion: 多轮评审非但未提升验证效果,反而因重复审查引发噪声而系统性退化;关键问题不在于信息量,而在于‘再次审查’这一行为本身诱发偏差。 Abstract: Cross-Context Review (CCR) improves LLM verification by separating production and review into independent sessions. A natural extension is multi-turn review: letting the reviewer ask follow-up questions, receive author responses, and review again. We call this Dynamic Cross-Context Review (D-CCR). In a controlled experiment with 30 artifacts and 150 injected errors, we tested four D-CCR variants against the single-pass CCR baseline. Single-pass CCR (F1 = 0.376) significantly outperformed all multi-turn variants, including D-CCR-2b with question-and-answer exchange (F1 = 0.303, $p < 0.001$, $d = -0.59$). Multi-turn review increased recall (+0.08) but generated 62% more false positives (8.5 vs. 5.2), collapsing precision from 0.30 to 0.20. Two mechanisms drive this degradation: (1) false positive pressure -- reviewers in later rounds fabricate findings when the artifact's real errors have been exhausted, and (2) Review Target Drift -- reviewers provided with prior Q&A exchanges shift from reviewing the artifact to critiquing the conversation itself. Independent re-review without prior context (D-CCR-2c) performed worst (F1 = 0.263), confirming that mere repetition degrades rather than helps. The degradation stems from false positive pressure in additional rounds, not from information amount -- within multi-turn conditions, more information actually helps (D-CCR-2b > D-CCR-2a). The problem is not what the reviewer sees, but that reviewing again invites noise.

[32] Is Semi-Automatic Transcription Useful in Corpus Creation? Preliminary Considerations on the KIParla Corpus

Martina Simonotti,Ludovica Pannitto,Eleonora Zucchini,Silvia Ballarè,Caterina Mauri

Main category: cs.CL

TL;DR: 本文探讨了将自动语音识别(ASR)引入意大利语口语语料库KIParla转录流程的效果,通过两阶段实验比较人工与ASR辅助转录在不同对话类型和标注者经验下的速度与准确性,并提出一种结合对齐、统计建模与标注指标的系统评估框架。

Details Motivation: 提升KIParla语料库的转录效率,同时保障转录质量;探索ASR在真实人工转录工作流中的实际效用与影响因素。 Method: 开展两阶段实验,由11名专家与新手转录员对三类对话音频分别进行人工与ASR辅助转录;采用统计建模、词级对齐及多种基于标注的指标进行综合分析。 Result: ASR辅助可提升转录速度,但未一致提升整体准确性;效果受工作流配置、对话类型及转录员经验等多重因素影响;所提分析框架能系统监测不同标注者与工作流下的转录行为。 Conclusion: ASR辅助转录(尤其经任务特定微调后)有望被整合进KIParla工作流,在不损害质量前提下加速语料构建。 Abstract: This paper analyses the implementation of Automatic Speech Recognition (ASR) into the transcription workflow of the KIParla corpus, a resource of spoken Italian. Through a two-phase experiment, 11 expert and novice transcribers produced both manual and ASR-assisted transcriptions of identical audio segments across three different types of conversation, which were subsequently analyzed through a combination of statistical modeling, word-level alignment and a series of annotation-based metrics. Results show that ASR-assisted workflows can increase transcription speed but do not consistently improve overall accuracy, with effects depending on multiple factors such as workflow configuration, conversation type and annotator experience. Analyses combining alignment-based metrics, descriptive statistics and statistical modeling provide a systematic framework to monitor transcription behavior across annotators and workflows. Despite limitations, ASR-assisted transcription, potentially supported by task-specific fine-tuning, could be integrated into the KIParla transcription workflow to accelerate corpus creation without compromising transcription quality.

[33] Attention-guided Evidence Grounding for Spoken Question Answering

Ke Yang,Bolin Chen,Yuejie Li,Yueying Hua,Jianhao Nie,Yueping He,Bowen Li,Chengjun Mao

Main category: cs.CL

TL;DR: 本文提出Attention-guided Evidence Grounding (AEG)框架,利用SpeechLLMs内部跨模态注意力机制直接对声学查询与文本知识进行端到端对齐,并通过Learning to Focus on Evidence (LFE)微调方法提升注意力聚焦能力,在多个数据集上显著降低幻觉并减少62%推理延迟。

Details Motivation: 解决语音问答中声学查询与文本知识跨模态对齐难的问题,避免传统ASR级联系统带来的高延迟和错误传播。 Method: 提出AEG端到端框架,利用SpeechLLMs的跨模态注意力定位潜在空间中的关键证据;设计LFE监督微调范式,校准注意力以区分相关与无关片段。 Result: 在SQuAD、HotpotQA和MuSiQue上验证,AEG比Whisper-Large-v3+Reranker级联系统性能更强、幻觉更少,推理延迟降低约62%。 Conclusion: AEG通过显式证据定位与注意力校准,实现了高效、鲁棒的端到端语音问答,为跨模态理解提供了新思路。 Abstract: Spoken Question Answering (Spoken QA) presents a challenging cross-modal problem: effectively aligning acoustic queries with textual knowledge while avoiding the latency and error propagation inherent in cascaded ASR-based systems. In this paper, we introduce Attention-guided Evidence Grounding (AEG), a novel end-to-end framework that leverages the internal cross-modal attention of Speech Large Language Models (SpeechLLMs) to explicitly locate and ground key evidence in the model's latent space. To address the diffuse attention distribution in pre-trained models, we propose Learning to Focus on Evidence (LFE), a supervised fine-tuning paradigm that calibrates the model's attention mechanism to distinguish query-relevant segments from irrelevant context. Experiments on SQuAD, HotpotQA, and MuSiQue demonstrate that AEG reduces hallucinations and achieves strong efficiency gains, outperforming large-scale cascaded baselines (Whisper-Large-v3 + Reranker) while reducing inference latency by approximately 62%.

[34] PyPhonPlan: Simulating phonetic planning with dynamic neural fields and task dynamics

Sam Kirkham

Main category: cs.CL

TL;DR: PyPhonPlan是一个开源Python工具包,用于实现基于耦合动态神经场和任务动态仿真的语音计划动力学模型,支持语音产生/感知循环建模,强调时间性、神经基础与音系丰富性。

Details Motivation: 为语音通信研究提供一个可复现、可扩展、累积式发展的计算框架,以建模语音计划中时间精确、神经机制合理且音系细节丰富的动态过程。 Method: 构建基于耦合动态神经场(DNF)和任务动态(Task Dynamics)的模块化Python工具包,包含规划场、感知场、记忆场及其耦合机制、发音动作输入,以及利用场激活剖面求解轨迹变量。 Result: 成功实现了带耦合记忆场的语音产生/感知闭环仿真,验证了该框架对交互式语音动态建模的有效性;工具包已开源并附带可执行示例。 Conclusion: PyPhonPlan为语音计划的计算建模提供了灵活、理论一致且实践可用的工具,推动了神经语言学与计算言语科学的交叉发展。 Abstract: We introduce PyPhonPlan, a Python toolkit for implementing dynamical models of phonetic planning using coupled dynamic neural fields and task dynamic simulations. The toolkit provides modular components for defining planning, perception and memory fields, as well as between-field coupling, gestural inputs, and using field activation profiles to solve tract variable trajectories. We illustrate the toolkit's capabilities through an example application:~simulating production/perception loops with a coupled memory field, which demonstrates the framework's ability to model interactive speech dynamics using representations that are temporally-principled, neurally-grounded, and phonetically-rich. PyPhonPlan is released as open-source software and contains executable examples to promote reproducibility, extensibility, and cumulative computational development for speech communication research.

[35] Omnilingual MT: Machine Translation for 1,600 Languages

Omnilingual MT Team,Belen Alastruey,Niyati Bafna,Andrea Caciolai,Kevin Heffernan,Artyom Kozhevnikov,Christophe Ropers,Eduardo Sánchez,Charles-Eric Saint-James,Ioannis Tsiamas,Chierh Cheng,Joe Chuang,Paul-Ambroise Duquenne,Mark Duppenthaler,Nate Ekberg,Cynthia Gao,Pere Lluís Huguet Cabot,João Maria Janeiro,Jean Maillard,Gabriel Mejia Gonzalez,Holger Schwenk,Edan Toledo,Arina Turkatenko,Albert Ventayol-Boada,Rashel Moritz,Alexandre Mourachko,Surya Parimi,Mary Williamson,Shireen Yates,David Dale,Marta R. Costa-jussà

Main category: cs.CL

TL;DR: 本文提出了Omnilingual Machine Translation (OMT),首个支持超1600种语言的机器翻译系统,通过整合公开多语语料与新构建数据集(如MeDLEY),并探索LLM在decoder-only(OMT-LLaMA)和encoder-decoder(OMT-NLLB)架构中的专业化路径,在低算力下实现超越70B基线模型的翻译性能,显著扩展了可连贯生成的语言覆盖范围,并提升跨语言理解能力。

Details Motivation: 现有MT系统仅覆盖约200种目标语言,远少于全球7000种语言,且缺乏可靠评测基准与指标,亟需扩大语言覆盖面并建立可信评估体系。 Method: 提出OMT系统,融合大规模公开多语语料与新构建数据集(含人工校验的MeDLEY双语数据);探索两种LLM专业化方式:decoder-only的OMT-LLaMA和作为模块嵌入encoder-decoder的OMT-NLLB;构建BOUQuET和Met-BOUQuET等人类标注评测数据集。 Result: 所有1B–8B参数OMT模型均达到或超越70B基线模型的翻译性能;OMT-LLaMA显著提升对低资源语言的连贯生成能力;OMT模型在1600种语言上的英语到目标语言翻译中大幅增强跨语言理解能力;发布动态演进、开源的评测榜单与数据集。 Conclusion: 专业化小型LLM(如OMT-LLaMA)可在低算力下实现高质量、广覆盖的多语言翻译,突破传统大模型在语言生成与理解上的瓶颈,推动MT向真正‘全语种’(omnilingual)迈进。 Abstract: High-quality machine translation (MT) can scale to hundreds of languages, setting a high bar for multilingual systems. However, compared to the world's 7,000 languages, current systems still offer only limited coverage: about 200 languages on the target side, and maybe a few hundreds more on the source side, supported due to cross-lingual transfer. And even these numbers have been hard to evaluate due to the lack of reliable benchmarks and metrics. We present Omnilingual Machine Translation (OMT), the first MT system supporting more than 1,600 languages. This scale is enabled by a comprehensive data strategy that integrates large public multilingual corpora with newly created datasets, including manually curated MeDLEY bitext. We explore two ways of specializing a Large Language model (LLM) for machine translation: as a decoder-only model (OMT-LLaMA) or as a module in an encoder-decoder architecture (OMT-NLLB). Notably, all our 1B to 8B parameter models match or exceed the MT performance of a 70B LLM baseline, revealing a clear specialization advantage and enabling strong translation quality in low-compute settings. Moreover, our evaluation of English-to-1,600 translations further shows that while baseline models can interpret undersupported languages, they frequently fail to generate them with meaningful fidelity; OMT-LLaMA models substantially expand the set of languages for which coherent generation is feasible. Additionally, OMT models improve in cross-lingual transfer, being close to solving the "understanding" part of the puzzle in MT for the 1,600 evaluated. Our leaderboard and main human-created evaluation datasets (BOUQuET and Met-BOUQuET) are dynamically evolving towards Omnilinguality and freely available.

[36] PashtoCorp: A 1.25-Billion-Word Corpus, Evaluation Suite, and Reproducible Pipeline for Low-Resource Language Development

Hanif Rahman

Main category: cs.CL

TL;DR: 本文构建了PashtoCorp,一个12.5亿词的普什图语语料库,显著提升该低资源语言在NLP中的可用性;基于该语料对XLM-R进行继续预训练,在NER和阅读理解任务上取得显著性能提升,并开源数据、模型与代码。

Details Motivation: 普什图语作为6000万人使用的语言,在NLP中严重缺乏高质量、大规模语料资源,现有语料规模小、覆盖不足,制约其模型发展。 Method: 从39个来源(含7个HuggingFace数据集和32个定制网络爬虫)构建语料,经阿拉伯文字分词、SHA-256去重和质量过滤等可复现流程处理。使用PashtoCorp对XLM-R-base进行持续MLM预训练,并在WikiANN NER和Belebele阅读理解任务上评估;同时开展留一源消融实验分析各数据源贡献。 Result: PashtoCorp达1.25B词,是OSCAR Pashto子集的40倍、此前最大专用语料的83倍;XLM-R继续预训练使困惑度下降25.1%,WikiANN NER实体F1提升10%相对值(19.0%→21.0%),训练方差降近7倍;Gemma-3n在Belebele上达64.6%准确率;消融显示仅移除占比0.7%的Wikipedia即致NER F1骤降47%。 Conclusion: PashtoCorp是迄今最大、最有效的普什图语语料库,显著推动低资源语言建模;其构建方法、数据源重要性分析及开源实践为其他低资源语言提供可复用范式。 Abstract: We present PashtoCorp, a 1.25-billion-word corpus for Pashto, a language spoken by 60 million people that remains severely underrepresented in NLP. The corpus is assembled from 39 sources spanning seven HuggingFace datasets and 32 purpose-built web scrapers, processed through a reproducible pipeline with Arabic-script tokenization, SHA-256 deduplication, and quality filtering. At 1.25B words across 2.81 million documents, PashtoCorp is 40x larger than the OSCAR Pashto subset and 83x larger than the previously largest dedicated Pashto corpus. Continued MLM pretraining of XLM-R-base on PashtoCorp reduces held-out perplexity by 25.1% (8.08->6.06). On WikiANN Pashto NER, the pretrained model improves entity F1 by 10% relative (19.0%->21.0%) and reduces training variance nearly 7x; the largest gain appears at 50 training sentences (+27%), with PashtoCorp covering 97.9% of WikiANN entity vocabulary. On Belebele Pashto reading comprehension, Gemma-3n achieves 64.6% accuracy, the first published LLM baseline for Pashto on this benchmark. A leave-one-out source ablation shows that Wikipedia (0.7% of documents) is the most critical source for NER: removing it alone reduces entity F1 by 47%. Corpus data, trained model, and code are available at https://huggingface.co/datasets/ihanif/pashto-corpus, https://huggingface.co/ihanif/xlmr-pashto, and https://github.com/ihanif/pashto-corpus.

[37] Fanar 2.0: Arabic Generative AI Stack

FANAR TEAM,Ummar Abbas,Mohammad Shahmeer Ahmad,Minhaj Ahmad,Abdulaziz Al-Homaid,Anas Al-Nuaimi,Enes Altinisik,Ehsaneddin Asgari,Sanjay Chawla,Shammur Chowdhury,Fahim Dalvi,Kareem Darwish,Nadir Durrani,Mohamed Elfeky,Ahmed Elmagarmid,Mohamed Eltabakh,Asim Ersoy,Masoomali Fatehkia,Mohammed Qusay Hashim,Majd Hawasly,Mohamed Hefeeda,Mus'ab Husaini,Keivin Isufaj,Soon-Gyo Jung,Houssam Lachemat,Ji Kim Lucas,Abubakr Mohamed,Tasnim Mohiuddin,Basel Mousi,Hamdy Mubarak,Ahmad Musleh,Mourad Ouzzani,Amin Sadeghi,Husrev Taha Sencar,Mohammed Shinoy,Omar Sinan,Yifan Zhang

Main category: cs.CL

TL;DR: Fanar 2.0 是卡塔尔推出的第二代以阿拉伯语为中心的生成式AI平台,强调技术主权与资源受限下的高效研发,在仅用256块H100 GPU和高质量小规模数据(1200亿token)条件下,通过持续预训练、模型融合等策略显著提升多方面性能,并集成多项新能力如双语内容审核、长时语音识别、阿拉伯文化适配的多模态理解与生成、伊斯兰内容多智能体系统等。

Details Motivation: 解决阿拉伯语在互联网数据中占比极低(仅约0.5%)但使用者众多(4亿母语者)的资源不匹配问题,同时保障国家AI主权,避免依赖外部基础设施与数据。 Method: 采用数据质量优先而非数量、针对阿拉伯语持续预训练、模型合并策略;基于Gemma-3-27B构建Fanar-27B主干模型;开发多个专用模块(FanarGuard、Aura、Oryx、Fanar-Sadiq等)及多层意图感知编排器。 Result: Fanar-27B在阿拉伯语知识、语言理解、方言处理和英语能力上分别提升9.1、7.3、3.5、7.6分;新增多项SOTA级能力模块,整体系统达到与更大规模商业模型相当的竞争力。 Conclusion: 证明在主权可控、算力与数据受限前提下,通过精细化方法论仍可构建高性能、多能力、文化适配的国家级大模型系统。 Abstract: We present Fanar 2.0, the second generation of Qatar's Arabic-centric Generative AI platform. Sovereignty is a first-class design principle: every component, from data pipelines to deployment infrastructure, was designed and operated entirely at QCRI, Hamad Bin Khalifa University. Fanar 2.0 is a story of resource-constrained excellence: the effort ran on 256 NVIDIA H100 GPUs, with Arabic having only ~0.5% of web data despite 400 million native speakers. Fanar 2.0 adopts a disciplined strategy of data quality over quantity, targeted continual pre-training, and model merging to achieve substantial gains within these constraints. At the core is Fanar-27B, continually pre-trained from a Gemma-3-27B backbone on a curated corpus of 120 billion high-quality tokens across three data recipes. Despite using 8x fewer pre-training tokens than Fanar 1.0, it delivers substantial benchmark improvements: Arabic knowledge (+9.1 pts), language (+7.3 pts), dialects (+3.5 pts), and English capability (+7.6 pts). Beyond the core LLM, Fanar 2.0 introduces a rich stack of new capabilities. FanarGuard is a state-of-the-art 4B bilingual moderation filter for Arabic safety and cultural alignment. The speech family Aura gains a long-form ASR model for hours-long audio. Oryx vision family adds Arabic-aware image and video understanding alongside culturally grounded image generation. An agentic tool-calling framework enables multi-step workflows. Fanar-Sadiq utilizes a multi-agent architecture for Islamic content. Fanar-Diwan provides classical Arabic poetry generation. FanarShaheen delivers LLM-powered bilingual translation. A redesigned multi-layer orchestrator coordinates all components through intent-aware routing and defense-in-depth safety validation. Taken together, Fanar 2.0 demonstrates that sovereign, resource-constrained AI development can produce systems competitive with those built at far greater scale.

[38] Who Benchmarks the Benchmarks? A Case Study of LLM Evaluation in Icelandic

Finnur Ágúst Ingimundarson,Steinunn Rut Friðriksdóttir,Bjarki Ármannsson,Iris Edda Nowenstein,Steinþór Steingrímsson

Main category: cs.CL

TL;DR: 本文评估了当前冰岛语大语言模型(LLM)基准测试的现状,指出其中广泛使用未经验证的合成或机器翻译数据所带来的严重问题,并呼吁为低/中资源语言建立更可靠的评估方法。

Details Motivation: 现有冰岛语LLM基准测试大量依赖未经验证的合成或机器翻译数据,可能严重损害评估有效性,尤其在低/中资源语言场景下亟需反思和改进。 Method: 通过定量错误分析,对比人类撰写/翻译与合成/机器翻译两类冰岛语基准测试数据的表现差异。 Result: 发现合成或机器翻译的基准中存在大量严重缺陷的测试样例,显著扭曲评估结果;人类参与构建的基准明显更可靠、有效。 Conclusion: 在低/中资源语言(如冰岛语)的LLM评估中,必须严格避免未经验证的合成或机器翻译数据,应优先采用人工校验或高质量本地化构建的基准。 Abstract: This paper evaluates current Large Language Model (LLM) benchmarking for Icelandic, identifies problems, and calls for improved evaluation methods in low/medium-resource languages in particular. We show that benchmarks that include synthetic or machine-translated data that have not been verified in any way, commonly contain severely flawed test examples that are likely to skew the results and undermine the tests' validity. We warn against the use of such methods without verification in low/medium-resource settings as the translation quality can, at best, only be as good as MT quality for a given language at any given time. Indeed, the results of our quantitative error analysis on existing benchmarks for Icelandic show clear differences between human-authored/-translated benchmarks vs. synthetic or machine-translated benchmarks.

[39] PlotTwist: A Creative Plot Generation Framework with Small Language Models

Abhinav Thorat,Ravi Kolla,Jyotin Goel,Niranjan Pedanekar

Main category: cs.CL

TL;DR: 本文提出PlotTwist框架,通过结构化偏好对齐方法,使小语言模型(≤5B参数)在创意剧情生成任务上媲美大模型(大200倍),显著提升资源效率与实用性。

Details Motivation: 大型语言模型在创意剧情生成等专业领域需偏好对齐,但该过程计算成本高、难以扩展,限制了实际应用;亟需一种高效、可扩展的替代方案。 Method: 提出PlotTwist框架,包含三部分:(1) 基于正负提示策略训练的五维叙事质量评估奖励模型;(2) 采用直接偏好优化(DPO)对齐的MoE剧情生成器;(3) 模拟人类批判性判断的智能体评估模块。 Result: PlotTwist在多项叙事质量维度上持续超越前沿大模型,且能可靠区分来自影评佳作与差评剧本的剧情,验证其对叙事质量的高度敏感性。 Conclusion: 结构化、基于偏好的对齐是一种资源高效、高质量的创意剧情生成新范式,为小模型在专业创作任务中提供可行路径。 Abstract: Creative plot generation presents a fundamental challenge for language models: transforming a concise premise into a coherent narrative that sustains global structure, character development, and emotional resonance. Although recent Large Language Models (LLMs) demonstrate strong fluency across general-purpose tasks, they typically require preference alignment to perform well on specialized domains such as creative plot generation. However, conducting such alignment at the scale of frontier LLMs is computationally prohibitive, significantly limiting accessibility and practical deployment. To address this, we present PlotTwist, a structured framework that enables Small Language Models (SLMs) with $\leq$ 5B active parameters to generate high-quality, premise-conditioned plots competitive with frontier systems up to $200\times$ larger. Our approach decomposes generation into three specialized components: (1) an Aspect Rating Reward Model trained via a novel Positive-Negative prompting strategy to deliver structured narratives across five Narrative Quality Dimensions (NQDs); (2) a Mixture-of-Experts (MoE) plot generator aligned via Direct Preference Optimization on high-confidence preference pairs; and (3) an Agentic Evaluation module that emulates human critical judgment for unbiased post-hoc assessment. Extensive experiments demonstrate that PlotTwist consistently outperforms frontier models across multiple NQDs despite substantially tighter capacity constraints. Further validation confirms strong sensitivity to narrative quality, as the framework reliably distinguishes plots derived from critically acclaimed versus widely panned screenplays. Together, these results establish structured, preference-based alignment as a resource-efficient approach to high-quality creative plot generation.

[40] RECOVER: Robust Entity Correction via agentic Orchestration of hypothesis Variants for Evidence-based Recovery

Abhishek Kumar,Aashraya Sachdeva

Main category: cs.CL

TL;DR: 本文提出RECOVER框架,利用ASR多假设、实体检索与约束式大语言模型修正,显著提升罕见及领域专有实体识别准确率。

Details Motivation: ASR在金融、医疗、空管等专业领域中对罕见和领域特定术语的实体识别效果差,且错误代价高;若实体完全未出现在ASR输出中,后续纠错极为困难。 Method: 提出RECOVER——一种工具调用型智能体框架,融合ASR多假设(1-Best、Entity-Aware Select、ROVER Ensemble、LLM-Select)、外部实体检索与约束条件下的LLM修正。 Result: 在五个多样化数据集上,E-WER相对降低8–46%,实体召回率最高提升22个百分点;LLM-Select策略整体性能最优,同时保持整体词错误率(WER)稳定。 Conclusion: RECOVER通过协同利用多假设证据与LLM推理能力,在不牺牲整体识别质量前提下,有效缓解ASR中长尾和领域实体漏识问题,为后处理纠错提供了新范式。 Abstract: Entity recognition in Automatic Speech Recognition (ASR) is challenging for rare and domain-specific terms. In domains such as finance, medicine, and air traffic control, these errors are costly. If the entities are entirely absent from the ASR output, post-ASR correction becomes difficult. To address this, we introduce RECOVER, an agentic correction framework that serves as a tool-using agent. It leverages multiple hypotheses as evidence from ASR, retrieves relevant entities, and applies Large Language Model (LLM) correction under constraints. The hypotheses are used using different strategies, namely, 1-Best, Entity-Aware Select, Recognizer Output Voting Error Reduction (ROVER) Ensemble, and LLM-Select. Evaluated across five diverse datasets, it achieves 8-46% relative reductions in entity-phrase word error rate (E-WER) and increases recall by up to 22 percentage points. The LLM-Select achieves the best overall performance in entity correction while maintaining overall WER.

[41] IndexRAG: Bridging Facts for Cross-Document Reasoning at Index Time

Zhenghua Bao,Yi Shi

Main category: cs.CL

TL;DR: IndexRAG 将多跳问答中的跨文档推理从在线推理转移到离线索引阶段,通过识别桥接实体并生成可独立检索的桥接事实,在不增加训练或微调开销的前提下,提升单次检索与单次大模型调用下的性能。

Details Motivation: 现有检索增强生成(RAG)方法在处理多跳问答时,要么依赖需在线图构建的图方法,要么依赖迭代多步推理,效率与实用性受限。 Method: 提出 IndexRAG:在离线索引阶段识别跨文档的桥接实体,并将其转化为可独立检索的‘桥接事实’;推理时仅需单次扁平检索和单次 LLM 调用,无需图结构或额外训练。 Result: 在 HotpotQA、2WikiMultiHopQA 和 MuSiQue 三个基准上,IndexRAG 平均 F1 比 Naive RAG 提升 4.6 分;结合 IRCoT 后,平均性能超越 HippoRAG 和 FastGraphRAG 等图基线方法。 Conclusion: 将跨文档推理前移至索引阶段是高效、轻量且高性能的多跳 QA 新范式,证明了扁平检索在复杂推理任务中的潜力。 Abstract: Multi-hop question answering (QA) requires reasoning across multiple documents, yet existing retrieval-augmented generation (RAG) approaches address this either through graph-based methods requiring additional online processing or iterative multi-step reasoning. We present IndexRAG, a novel approach that shifts cross-document reasoning from online inference to offline indexing. IndexRAG identifies bridge entities shared across documents and generates bridging facts as independently retrievable units, requiring no additional training or fine-tuning. Experiments on three widely-used multi-hop QA benchmarks (HotpotQA, 2WikiMultiHopQA, MuSiQue) show that IndexRAG improves F1 over Naive RAG by 4.6 points on average, while requiring only single-pass retrieval and a single LLM call at inference time. When combined with IRCoT, IndexRAG outperforms all graph-based baselines on average, including HippoRAG and FastGraphRAG, while relying solely on flat retrieval. Our code will be released upon acceptance.

[42] EngGPT2: Sovereign, Efficient and Open Intelligence

G. Ciarfaglia,A. Rosanova,S. Cipolla,J. Bartoli,A. Di Domenico,C. Fioroni,A. Fontana,M. R. Scoleri,M. I. Mone,D. Franchi,M. C. Del Gaudio,F. Picariello,M. Gabusi,S. Bonura,V. Morreale,I. Bailo

Main category: cs.CL

TL;DR: EngGPT2-16B-A3B 是一款面向欧洲与意大利语场景、兼顾高性能与高能效的开源MoE架构大模型,16B参数中仅3B激活,训练数据仅2.5T(远少于Qwen3/Llama3),在多项基准测试上媲美8B–16B稠密模型,且推理与训练开销显著降低;支持多语言、多模式推理(含实时优化的turbo-reasoning),全面符合欧盟AI法案。

Details Motivation: 构建符合欧盟AI法案、主权可控、高效开放的意大利语及欧洲本地化大模型,弥补现有开源模型在语言适配性、资源效率与合规性上的不足。 Method: 从零训练基于Mixture-of-Experts(MoE)架构的16B参数模型,每轮推理仅激活3B参数;训练数据共2.5万亿token,其中25%为意大利语;支持多语言(英/意)及多种推理模式(非推理、常规推理、turbo-推理)。 Result: 在MMLU-Pro、GSM8K、IFEval、HumanEval等关键基准上性能媲美8B–16B稠密模型;推理功耗仅为同类模型的1/5–1/2,训练数据量和算力需求仅为1/10–1/6;具备强意大利语NLP能力与实时turbo-reasoning能力。 Conclusion: EngGPT2确立了面向欧洲语境的资源意识型高性能大模型新范式,推动开源、合规、高效、本地化的AI发展路径。 Abstract: EngGPT2-16B-A3B is the latest iteration of Engineering Group's Italian LLM and it's built to be a Sovereign, Efficient and Open model. EngGPT2 is trained on 2.5 trillion tokens - less than Qwen3's 36T or Llama3's 15T - and delivers performance on key benchmarks, including MMLU-Pro, GSM8K, IFEval and HumanEval, comparable to dense models in the 8B-16B range, while requiring one-fifth to half of the inference power, and between one-tenth to one-sixth of the training data and consequent needed training power. Designed as a trained-from-scratch Mixture-of-Experts (MoE) architecture, EngGPT2 features 16 billion parameters with 3 billion active per inference, with expert sizes positioned between those used in GPT-OSS and Qwen3. Approximately 25% of its training corpus consists of Italian-language data, to deliver strong capabilities for European and Italian NLP tasks among models of similar scale. This efficiency aims to position EngGPT2 as a key contributor to the growing portfolio of open-weight European models, combining performance and efficiency with full alignment to the EU AI Act. EngGPT2 is also a single model capable of multiple reasoning modes: non-reasoning, reasoning in Italian or English, and turbo-reasoning (a concise, bullet-point style reasoning available in both languages designed for real-time reasoning use cases). EngGPT2 aims to set a new standard for resource-conscious, high-performance LLMs tailored to European and Italian contexts.

[43] VQKV: High-Fidelity and High-Ratio Cache Compression via Vector-Quantization

Yixuan Wang,Qingyu Shi,Jiayu Zhou,Dianbo Liu,Ziwei He,Zhouhan Lin

Main category: cs.CL

TL;DR: VQKV是一种无需训练的向量量化方法,用于高效压缩大语言模型的KV缓存,在保持高模型保真度的同时实现高压缩比。

Details Motivation: 大型语言模型上下文长度增长导致KV缓存增大,限制了其在资源受限环境中的部署;现有无训练KV压缩方法难以兼顾高压缩率与高重建保真度。 Method: 提出VQKV方法,将向量量化(VQ)引入KV缓存压缩,用少量整数索引表示大量浮点值,无需额外训练。 Result: 在LLaMA3.1-8B上实现82.8%的KV缓存压缩率,LongBench上保留98.6%基线性能,并在相同内存占用下支持4.3倍更长的生成长度。 Conclusion: VQKV是一种高效、实用、无需训练的KV缓存压缩方案,显著提升大模型在资源受限场景下的部署可行性。 Abstract: The growing context length of Large Language Models (LLMs) enlarges the Key-Value (KV) cache, limiting deployment in resource-limited environments. Prior training-free approaches for KV cache compression typically rely on low-rank approximation or scalar quantization, which fail to simultaneously achieve high compression ratios and high reconstruction fidelity. We propose VQKV, a novel, training-free method introducing vector quantization (VQ) to obtain highly compressed KV representations while preserving high model fidelity, allowing for the representation of thousands of floating-point values with just a few integer indices. As a result, VQKV achieves an 82.8\% compression ratio on LLaMA3.1-8B while retaining 98.6\% of the baseline performance on LongBench and enabling 4.3x longer generation length on the same memory footprint.

[44] DynHD: Hallucination Detection for Diffusion Large Language Models via Denoising Dynamics Deviation Learning

Yanyu Qian,Yue Tan,Yixin Liu,Wang Yu,Shirui Pan

Main category: cs.CL

TL;DR: 本文提出DynHD方法,从空间(token序列)和时间(去噪动态)两个角度解决扩散大语言模型(D-LLMs)中的幻觉检测问题,通过语义感知的证据构建模块和基于偏差的幻觉检测器提升检测性能与效率。

Details Motivation: 扩散大语言模型(D-LLMs)虽具迭代优化优势,但幻觉问题严重;现有基于token级不确定性的检测方法未考虑token贡献不均及不确定性在去噪过程中的动态演化规律。 Method: 提出DynHD框架:1)语义感知证据构造模块,过滤非信息token、增强语义关键token;2)参考证据生成器建模不确定性预期演化轨迹;3)基于偏差的幻觉检测器通过观测轨迹与参考轨迹差异进行检测。 Result: 在多个基准和骨干模型上,DynHD持续优于SOTA基线,同时具备更高效率。 Conclusion: 联合建模token级语义重要性与去噪过程中的不确定性动态演化,可显著提升D-LLMs幻觉检测的准确性与效率,为可信生成提供新思路。 Abstract: Diffusion large language models (D-LLMs) have emerged as a promising alternative to auto-regressive models due to their iterative refinement capabilities. However, hallucinations remain a critical issue that hinders their reliability. To detect hallucination responses from model outputs, token-level uncertainty (e.g., entropy) has been widely used as an effective signal to indicate potential factual errors. Nevertheless, the fixed-length generation paradigm of D-LLMs implies that tokens contribute unevenly to hallucination detection, with only a small subset providing meaningful signals. Moreover, the evolution trend of uncertainty throughout the diffusion process can also provide important signals, highlighting the necessity of modeling its denoising dynamics for hallucination detection. In this paper, we propose DynHD that bridge these gaps from both spatial (token sequence) and temporal (denoising dynamics) perspectives. To address the information density imbalance across tokens, we propose a semantic-aware evidence construction module that extracts hallucination-indicative signals by filtering out non-informative tokens and emphasizing semantically meaningful ones. To model denoising dynamics for hallucination detection, we introduce a reference evidence generator that learns the expected evolution trajectory of uncertainty evidence, along with a deviation-based hallucination detector that makes predictions by measuring the discrepancy between the observed and reference trajectories. Extensive experiments demonstrate that DynHD consistently outperforms state-of-the-art baselines while achieving higher efficiency across multiple benchmarks and backbone models.

[45] On the Emotion Understanding of Synthesized Speech

Yuan Ge,Haishu Zhao,Aokai Hao,Junxiang Zhang,Bei Li,Xiaoqian Liu,Chenglong Wang,Jianjin Wang,Bingsen Zhou,Bingyu Liu,Jingbo Zhu,Zhengtao Yu,Tong Xiao

Main category: cs.CL

TL;DR: 本文质疑了将语音情感识别(SER)模型作为评估合成语音情感表现力的可靠指标这一常见假设,通过系统性实验发现当前SER模型难以泛化到合成语音上,主要源于合成语音与真实语音之间的表征不匹配,以及生成式语音语言模型(SLMs)过度依赖文本语义而非副语言线索来推断情感。

Details Motivation: 验证语音情感识别(SER)模型是否能作为合成语音情感表现力评估的可靠指标,因为业界普遍认为SER模型学习到的基础表征可迁移到合成语音中。 Method: 在多个数据集、判别式和生成式SER模型、以及多种语音合成模型上,系统评估SER在合成语音上的性能;分析合成过程中语音token预测导致的表征差异;考察生成式语音语言模型(SLMs)对文本语义与副语言线索的利用偏好。 Result: 当前SER模型无法有效泛化至合成语音;语音合成中的token预测引发合成语音与真实语音间的表征不匹配;生成式SLMs倾向于仅从文本语义推断情感,忽略副语言线索。 Conclusion: 现有SER模型常依赖非鲁棒的捷径特征而非真正的情感相关基础表征;生成式SLMs在副语言理解方面仍面临重大挑战;SER结果不宜直接用作合成语音情感表达能力的评价指标。 Abstract: Emotion is a core paralinguistic feature in voice interaction. It is widely believed that emotion understanding models learn fundamental representations that transfer to synthesized speech, making emotion understanding results a plausible reward or evaluation metric for assessing emotional expressiveness in speech synthesis. In this work, we critically examine this assumption by systematically evaluating Speech Emotion Recognition (SER) on synthesized speech across datasets, discriminative and generative SER models, and diverse synthesis models. We find that current SER models can not generalize to synthesized speech, largely because speech token prediction during synthesis induces a representation mismatch between synthesized and human speech. Moreover, generative Speech Language Models (SLMs) tend to infer emotion from textual semantics while ignoring paralinguistic cues. Overall, our findings suggest that existing SER models often exploit non-robust shortcuts rather than capturing fundamental features, and paralinguistic understanding in SLMs remains challenging.

[46] AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan,Jingchen Ni,Leqi Zheng,Jiajun Zhang,Peixi Wu,Dacheng Yin,Jing Lyu,Chun Yuan,Fengyun Rao

Main category: cs.CL

TL;DR: 本文提出AdaMem,一种面向用户的自适应记忆框架,通过多类型记忆(工作、情景、人格、图记忆)和动态检索机制,提升大语言模型在长周期对话中的推理与用户建模能力。

Details Motivation: 现有记忆系统过度依赖语义相似性、缺乏时序与因果连贯性、内存粒度静态僵化,难以支持用户中心的长周期交互。 Method: AdaMem构建四类记忆(working/episodic/persona/graph),在推理时先识别目标参与者,再基于问题动态组合语义检索与关系感知图扩展,并通过角色专用流水线合成证据并生成响应。 Result: 在LoCoMo和PERSONAMEM基准上达到SOTA性能。 Conclusion: AdaMem通过用户中心设计与自适应检索显著提升了长周期对话中记忆利用的有效性与灵活性。 Abstract: Large language model (LLM) agents increasingly rely on external memory to support long-horizon interaction, personalized assistance, and multi-step reasoning. However, existing memory systems still face three core challenges: they often rely too heavily on semantic similarity, which can miss evidence crucial for user-centric understanding; they frequently store related experiences as isolated fragments, weakening temporal and causal coherence; and they typically use static memory granularities that do not adapt well to the requirements of different questions. We propose AdaMem, an adaptive user-centric memory framework for long-horizon dialogue agents. AdaMem organizes dialogue history into working, episodic, persona, and graph memories, enabling the system to preserve recent context, structured long-term experiences, stable user traits, and relation-aware connections within a unified framework. At inference time, AdaMem first resolves the target participant, then builds a question-conditioned retrieval route that combines semantic retrieval with relation-aware graph expansion only when needed, and finally produces the answer through a role-specialized pipeline for evidence synthesis and response generation. We evaluate AdaMem on the LoCoMo and PERSONAMEM benchmarks for long-horizon reasoning and user modeling. Experimental results show that AdaMem achieves state-of-the-art performance on both benchmarks. The code will be released upon acceptance.

[47] How often do Answers Change? Estimating Recency Requirements in Question Answering

Bhawna Piryani,Zehra Mert,Adam Jatowt

Main category: cs.CL

TL;DR: 本文提出了一种新的时间敏感性分类法(recency-stationarity taxonomy)并构建了RecencyQA数据集,用于评估大语言模型在需时效性知识的问题上的表现,强调对非平稳(context-dependent recency)问题的建模能力。

Details Motivation: 现有基准无法反映问题答案更新频率的差异及是否依赖上下文判断时效性需求,导致LLM难以适配真实场景中的动态知识需求。 Method: 提出recency-stationarity二维分类法,据此构建含4031个问题的RecencyQA数据集,并进行人工评估与实证分析。 Result: 发现非平稳问题显著更难,且难度随更新频率升高而增加;RecencyQA支持细粒度时序推理评估。 Conclusion: 显式建模时效性与上下文依赖性是提升LLM时间敏感问答能力的关键,RecencyQA为开发时效感知、上下文敏感的问答系统提供了基础。 Abstract: Large language models (LLMs) often rely on outdated knowledge when answering time-sensitive questions, leading to confident yet incorrect responses. Without explicit signals indicating whether up-to-date information is required, models struggle to decide when to retrieve external evidence, how to reason about stale facts, and how to rank answers by their validity. Existing benchmarks either periodically refresh answers or rely on fixed templates, but they do not reflect on how frequently answers change or whether a question inherently requires up-to-date information. To address this gap, we introduce a recency-stationarity taxonomy that categorizes questions by how often their answers change and whether this change frequency is time-invariant or context-dependent. Building on this taxonomy, we present RecencyQA, a dataset of 4,031 open-domain questions annotated with recency and stationarity labels. Through human evaluation and empirical analysis, we show that non-stationary questions, i.e., those where context changes the recency requirement, are significantly more challenging for LLMs, with difficulty increasing as update frequency rises. By explicitly modeling recency and context dependence, RecencyQA enables fine-grained benchmarking and analysis of temporal reasoning beyond binary notions of freshness, and provides a foundation for developing recency-aware and context-sensitive question answering systems.

[48] DanceHA: A Multi-Agent Framework for Document-Level Aspect-Based Sentiment Analysis

Lei Wang,Min Huang,Eduard Dragut

Main category: cs.CL

TL;DR: 本文提出DanceHA多智能体框架,用于非正式写作风格下的文档级方面情感强度分析(ABSIA),并发布了一个高质量多领域数据集Inf-ABSIA。

Details Motivation: 现有研究多集中于领域特定、句子级的方面情感强度分析,而文档级、尤其是处理ACOSI元组提取等复杂任务的研究仍不足;同时,非正式写作风格在ABSIA中被忽视,但其对情感强度有显著影响。 Method: 提出DanceHA多智能体框架,包含两部分:Dance(采用分而治之策略将长文本ABSIA任务分解为多个子任务,由专业化智能体协同完成)和HA(人机协作标注);并构建了Inf-ABSIA多领域文档级ABSIA数据集。 Result: 实验证明DanceHA框架有效,其多智能体知识可成功迁移至学生模型;Inf-ABSIA数据集具有细粒度与高准确率标签。 Conclusion: 文档级ABSIA需重视非正式写作风格,DanceHA框架及其衍生数据集为该方向提供了新思路与高质量资源。 Abstract: Aspect-Based Sentiment Intensity Analysis (ABSIA) has garnered increasing attention, though research largely focuses on domain-specific, sentence-level settings. In contrast, document-level ABSIA--particularly in addressing complex tasks like extracting Aspect-Category-Opinion-Sentiment-Intensity (ACOSI) tuples--remains underexplored. In this work, we introduce DanceHA, a multi-agent framework designed for open-ended, document-level ABSIA with informal writing styles. DanceHA has two main components: Dance, which employs a divide-and-conquer strategy to decompose the long-context ABSIA task into smaller, manageable sub-tasks for collaboration among specialized agents; and HA, Human-AI collaboration for annotation. We release Inf-ABSIA, a multi-domain document-level ABSIA dataset featuring fine-grained and high-accuracy labels from DanceHA. Extensive experiments demonstrate the effectiveness of our agentic framework and show that the multi-agent knowledge in DanceHA can be effectively transferred into student models. Our results highlight the importance of the overlooked informal styles in ABSIA, as they often intensify opinions tied to specific aspects.

[49] EmoLLM: Appraisal-Grounded Cognitive-Emotional Co-Reasoning in Large Language Models

Yifei Zhang,Mingyang Li,Henry Gao,Liang Zhao

Main category: cs.CL

TL;DR: 本文提出EmoLLM框架,结合认知智能(IQ)与情感智能(EQ),基于评估理论构建显式的评估推理图(ARG),通过多轮角色扮演与逆向视角强化学习训练,在保持事实可靠性的同时提升对话的情感适配性与响应质量。

Details Motivation: 现实交互不仅需要大语言模型的认知智能(IQ),还需情感智能(EQ)以实现既可靠又情感恰当的响应,尤其在情感支持、技术协助等场景中,需依据用户需求、目标与应对能力对情境进行评估。 Method: 提出基于评估理论的EmoLLM框架,引入显式评估推理图(ARG)建模上下文事实、用户需求、评估维度、情绪状态与响应策略;采用多轮角色扮演环境下的强化学习训练,利用逆向视角推理生成基于用户侧后果的奖励信号。 Result: EmoLLM在多种对话场景中显著改善用户情绪状态与响应质量,优于强基线模型,同时保持高水平的事实可靠性。 Conclusion: 将显式情感评估机制融入LLM推理过程,可有效协同提升IQ与EQ能力,为构建兼具理性与共情的对话系统提供了可行路径。 Abstract: Large language models (LLMs) demonstrate strong cognitive intelligence (IQ), yet many real-world interactions also require emotional intelligence (EQ) to produce responses that are both factually reliable and emotionally appropriate. In settings such as emotional support, technical assistance, and consultation, effective dialogue depends on how situations are appraised with respect to the user's needs, goals, and coping capacity. Inspired by appraisal theory, we propose EmoLLM, an appraisal-grounded framework for IQ/EQ co-reasoning in dialogue. EmoLLM uses an explicit Appraisal Reasoning Graph (ARG) to structure intermediate reasoning over contextual facts, inferred user needs, appraisal dimensions, emotional states, and response strategies before generating a reply. We train EmoLLM in a multi-turn role-play environment with reinforcement learning, where reverse-perspective reasoning provides reward signals based on predicted user-side consequences of responses. Across diverse dialogue settings, EmoLLM improves emotional state outcomes and response quality over strong baselines while preserving strong factual reliability.

[50] Characterizing Delusional Spirals through Human-LLM Chat Logs

Jared Moore,Ashish Mehta,William Agnew,Jacy Reese Anthis,Ryan Louie,Yifan Mai,Peggy Yin,Myra Cheng,Samuel J Paech,Kevin Klyman,Stevie Chancellor,Eric Lin,Nick Haber,Desmond C. Ong

Main category: cs.CL

TL;DR: 本文首次深入研究了大型语言模型(LLM)聊天机器人对用户心理健康的实际危害案例,基于19名受害用户的39万余条对话日志,构建包含28个编码的分析框架,发现如AI自称有意识、用户表达自杀念头等现象显著存在,并揭示其在长对话中加剧的趋势,进而提出面向政策制定者、开发者与用户的切实防护建议。

Details Motivation: 尽管媒体和法律界已出现关于LLM导致幻觉、自残甚至‘AI精神病’等负面心理影响的报道,但缺乏对长期有害人机交互过程(即‘妄想螺旋’)的实证分析,亟需真实对话数据支撑理解与干预。 Method: 收集并分析来自支持小组及媒体报道的19名受心理伤害用户的391,562条真实聊天记录;构建含28个行为/内容编码的定性分析体系(如用户妄想、自杀意念、AI自称有意识等),进行大规模编码与共现模式统计分析。 Result: 发现15.5%的用户消息含妄想内容,69条经验证的自杀相关用户消息,21.2%的AI消息虚假宣称自身有意识;浪漫意图与AI自称有意识在长对话中显著共现,提示多轮交互中风险升级。 Conclusion: 本研究提供了首个基于真实高危害案例的LLM心理风险实证分析框架,揭示多轮交互中风险动态演化规律,呼吁政策、技术与用户三方协同,将编码体系与分析工具用于风险监测与防护机制设计。 Abstract: As large language models (LLMs) have proliferated, disturbing anecdotal reports of negative psychological effects, such as delusions, self-harm, and ``AI psychosis,'' have emerged in global media and legal discourse. However, it remains unclear how users and chatbots interact over the course of lengthy delusional ``spirals,'' limiting our ability to understand and mitigate the harm. In our work, we analyze logs of conversations with LLM chatbots from 19 users who report having experienced psychological harms from chatbot use. Many of our participants come from a support group for such chatbot users. We also include chat logs from participants covered by media outlets in widely-distributed stories about chatbot-reinforced delusions. In contrast to prior work that speculates on potential AI harms to mental health, to our knowledge we present the first in-depth study of such high-profile and veridically harmful cases. We develop an inventory of 28 codes and apply it to the $391,562$ messages in the logs. Codes include whether a user demonstrates delusional thinking (15.5% of user messages), a user expresses suicidal thoughts (69 validated user messages), or a chatbot misrepresents itself as sentient (21.2% of chatbot messages). We analyze the co-occurrence of message codes. We find, for example, that messages that declare romantic interest and messages where the chatbot describes itself as sentient occur much more often in longer conversations, suggesting that these topics could promote or result from user over-engagement and that safeguards in these areas may degrade in multi-turn settings. We conclude with concrete recommendations for how policymakers, LLM chatbot developers, and users can use our inventory and conversation analysis tool to understand and mitigate harm from LLM chatbots. Warning: This paper discusses self-harm, trauma, and violence.

[51] Diverging Transformer Predictions for Human Sentence Processing: A Comprehensive Analysis of Agreement Attraction Effects

Titus von der Malsburg,Sebastian Padó

Main category: cs.CL

TL;DR: 本文通过基于惊奇度的链接机制,系统评估了11种自回归Transformer模型在英语一致吸引现象上的表现,发现其在介词短语配置中与人类阅读时间数据基本一致,但在宾语提取关系从句配置中表现显著下降,且无法复现人类的不对称干扰模式,因此认为当前Transformer模型尚不能解释人类形态句法加工过程。

Details Motivation: 探究Transformer模型作为人类句子加工认知模型的认知充分性,尤其是其在形态句法处理(如主谓一致吸引)中的解释力。 Method: 采用基于surprisal(惊奇度)的链接机制,系统评估11种不同规模与架构的自回归Transformer模型,在更全面的英语一致吸引构型(特别是介词短语和宾语提取关系从句)上与人类阅读时间数据进行对比分析。 Result: Transformer模型在介词短语构型中预测与人类阅读时间大致吻合;但在宾语提取关系从句构型中性能显著下降,各模型间预测差异大,且均未能复现人类表现出的不对称干扰模式。 Conclusion: 当前Transformer模型尚不能充分解释人类形态句法加工;将其作为认知模型的评估需采用更严格、更全面的实验设计,避免仅基于孤立句法构型或单个模型得出错误泛化结论。 Abstract: Transformers underlie almost all state-of-the-art language models in computational linguistics, yet their cognitive adequacy as models of human sentence processing remains disputed. In this work, we use a surprisal-based linking mechanism to systematically evaluate eleven autoregressive transformers of varying sizes and architectures on a more comprehensive set of English agreement attraction configurations than prior work. Our experiments yield mixed results: While transformer predictions generally align with human reading time data for prepositional phrase configurations, performance degrades significantly on object-extracted relative clause configurations. In the latter case, predictions also diverge markedly across models, and no model successfully replicates the asymmetric interference patterns observed in humans. We conclude that current transformer models do not explain human morphosyntactic processing, and that evaluations of transformers as cognitive models must adopt rigorous, comprehensive experimental designs to avoid spurious generalizations from isolated syntactic configurations or individual models.

[52] BATQuant: Outlier-resilient MXFP4 Quantization via Learnable Block-wise Optimization

Ji-Fu Li,Manyi Zhang,Xiaobo Xia,Han Bao,Haoli Bai,Zhenhua Dong,Xianzhi Yu

Main category: cs.CL

TL;DR: 本文提出BATQuant方法,通过块级仿射变换、全局与私有克罗内克分解及块级可学习裁剪,解决MXFP4格式下PTQ性能崩溃问题,在W4A4KV16配置下显著提升多模态与语言大模型量化性能。

Details Motivation: 现有基于旋转的PTQ方法在MXFP4格式上因格式不匹配导致性能严重下降,主要表现为跨块异常值传播和双峰激活分布,削弱了局部块缩放效果并浪费量化范围。 Method: 提出BATQuant:1)块级仿射变换(限制变换粒度以避免跨块异常值传播,放宽正交性以优化分布形状);2)全局与私有克罗内克(GPK)分解降低参数与计算开销;3)块级可学习裁剪抑制残余异常值。 Result: 在W4A4KV16配置下,BATQuant在多模态基准上恢复高达96.43%全精度性能,全面超越现有方法,确立新的SOTA。 Conclusion: BATQuant通过适配MXFP格式特性的结构化量化策略,有效缓解了异常值传播与分布失配问题,为MLLMs/LLMs在现代加速器上的高效低比特部署提供了可靠方案。 Abstract: Microscaling floating-point (MXFP) formats have emerged as a promising standard for deploying Multi-modal Large Language Models (MLLMs) and Large Language Models (LLMs) on modern accelerator architectures. However, existing Post-Training Quantization (PTQ) methods, particularly rotation-based techniques designed for integer formats, suffer from severe performance collapse when applied to MXFP4. Recent studies attribute this failure to a fundamental format mismatch: global orthogonal rotations inadvertently transfer outlier energy across quantization blocks, inducing new outliers that disrupt local block-wise scaling, while often creating bimodal activation distributions that underutilize the limited quantization range. To address these issues, we propose BATQuant (Block-wise Affine Transformation), which restricts transformations to align with MXFP granularity to prevent cross-block outlier propagation, while relaxing orthogonality constraints to optimize distribution shaping. To ensure parameter efficiency, we introduce Global and Private Kronecker (GPK) decomposition to effectively reduces storage and runtime overhead and incorporate Block-wise Learnable Clipping to suppress residual outliers. Extensive experiments on both MLLMs and LLMs demonstrate that BATQuant establishes new state-of-the-art results under aggressive W4A4KV16 configurations, recovering up to 96.43% of full-precision performance on multimodal benchmarks and clearly outperforming existing methods across diverse tasks.

[53] Tarab: A Multi-Dialect Corpus of Arabic Lyrics and Poetry

Mo El-Haj

Main category: cs.CL

TL;DR: 本文介绍了Tarab语料库,这是一个大规模的阿拉伯语歌曲歌词和诗歌语料库,包含256万行诗句和1350多万个词元,涵盖古典阿拉伯语、现代标准阿拉伯语及六种主要方言,支持跨语言、风格和历时分析。

Details Motivation: 构建一个统一框架下的大规模阿拉伯语文化与语言资源,以支持对阿拉伯语创造性文本的跨语言、风格和历时分析。 Method: 构建Tarab语料库,包括数据收集、标准化和验证流程,并进行方言识别和体裁区分的基线分析。 Result: 发布了目前最大的开源阿拉伯语创造性文本语料库,包含2.56百万行诗句和13.5百万词元,覆盖多种语言变体和历史时期,并提供结构化元数据。 Conclusion: Tarab语料库为阿拉伯语的语言学、文体学和历时研究提供了重要资源,具有广泛的应用潜力。 Abstract: We introduce the Tarab Corpus, a large-scale cultural and linguistic resource that brings together Arabic song lyrics and poetry within a unified analytical framework. The corpus comprises 2.56 million verses and more than 13.5 million tokens, making it, to our knowledge, the largest open Arabic corpus of creative text spanning both classical and contemporary production. Tarab is broadly balanced between songs and poems and covers Classical Arabic, Modern Standard Arabic (MSA), and six major regional varieties: Egyptian, Gulf, Levantine, Iraqi, Sudanese, and Maghrebi Arabic. The artists and poets represented in the corpus are associated with 28 modern nation states and multiple historical eras, covering over fourteen centuries of Arabic creative expression from the Pre-Islamic period to the twenty-first century. Each verse is accompanied by structured metadata describing linguistic variety, geographic origin, and historical or cultural context, enabling comparative linguistic, stylistic, and diachronic analysis across genres and time. We describe the data collection, normalisation, and validation pipeline and present baseline analyses for variety identification and genre differentiation. The dataset is publicly available on HuggingFace at https://huggingface.co/datasets/drelhaj/Tarab.

[54] Omnilingual SONAR: Cross-Lingual and Cross-Modal Sentence Embeddings Bridging Massively Multilingual Text and Speech

Omnilingual SONAR Team,João Maria Janeiro,Pere-Lluís Huguet Cabot,Ioannis Tsiamas,Yen Meng,Vivek Iyer,Guillem Ramírez,Loic Barrault,Belen Alastruey,Yu-An Chung,Marta R. Costa-Jussa,David Dale,Kevin Heffernan,Jaehyeong Jo,Artyom Kozhevnikov,Alexandre Mourachko,Christophe Ropers,Holger Schwenk,Paul-Ambroise Duquenne

Main category: cs.CL

TL;DR: OmniSONAR 是一种新型的全语言、跨语言、跨模态句子嵌入模型家族,支持数千种语言(含极低资源语言)及文本、语音、代码、数学表达式等多种模态,在多项基准上显著超越现有方法。

Details Motivation: 现有跨语言句子编码器覆盖语言少(仅数百种),且常为强对齐而牺牲下游性能,限制了实际应用。 Method: 采用渐进式训练:1)以大语言模型初始化的编码器-解码器在200种语言上学习基础语义空间,结合词元级解码、新提出的分段Softmax对比损失和合成难负样本;2)通过两阶段师生编码器蒸馏扩展至数千种语言变体;3)将177种口语无缝映射进该空间;4)另训练仅处理OmniSONAR嵌入序列的英文编码器-解码器LM(Spectrum),实现向多语言与语音任务的高效迁移。 Result: 在FLORES(200语言)相似性搜索错误减半;在BIBLE(1560语言)错误降低15倍;BIBLE翻译任务中比NLLB-3B优15 chrF++;MTEB/XLCoST表现优异;语音相似性搜索错误降43%,零样本语音到文本质量达SeamlessM4T的97%。 Conclusion: OmniSONAR成功构建了可扩展、高质量、多模态统一的语义空间,突破了传统跨语言嵌入模型在语言覆盖度与下游性能间的权衡瓶颈,并为低资源语言和跨模态任务提供了强大基础模型。 Abstract: Cross-lingual sentence encoders typically cover only a few hundred languages and often trade downstream quality for stronger alignment, limiting their adoption. We introduce OmniSONAR, a new family of omnilingual, cross-lingual and cross-modal sentence embedding models that natively embed text, speech, code, and mathematical expressions in a single semantic space, while delivering state-of-the-art downstream performance at the scale of thousands of languages, from high-resource to extremely low-resource varieties. To reach this scale without representation collapse, we use progressive training. We first learn a strong foundational space for 200 languages with an LLM-initialized encoder-decoder, combining token-level decoding with a novel split-softmax contrastive loss and synthetic hard negatives. Building on this foundation, we expand to several thousands language varieties via a two-stage teacher-student encoder distillation framework. Finally, we demonstrate the cross-modal extensibility of this space by seamlessly mapping 177 spoken languages into it. OmniSONAR halves cross-lingual similarity search error on the 200-language FLORES dataset and reduces error by a factor of 15 on the 1,560-language BIBLE benchmark. It also enables strong translation, outperforming NLLB-3B on multilingual benchmarks and exceeding prior models (including much larger LLMs) by 15 chrF++ points on 1,560 languages into English BIBLE translation. OmniSONAR also performs strongly on MTEB and XLCoST. For speech, OmniSONAR achieves a 43% lower similarity-search error and reaches 97% of SeamlessM4T speech-to-text quality, despite being zero-shot for translation (trained only on ASR data). Finally, by training an encoder-decoder LM, Spectrum, exclusively on English text processing OmniSONAR embedding sequences, we unlock high-performance transfer to thousands of languages and speech for complex downstream tasks.

[55] Domain Mixture Design via Log-Likelihood Differences for Aligning Language Models with a Target Model

Ryo Kishino,Riku Shiomi,Hiroaki Yamagiwa,Momose Oyama,Hidetoshi Shimodaira

Main category: cs.CL

TL;DR: 本文提出一种无需知识蒸馏即可对齐基础语言模型与目标模型的方法,通过在训练数据中设计领域混合权重,使基础模型的训练更新方向朝向目标模型,在NanoGPT实验中显著降低KL散度并提升下游任务性能。

Details Motivation: 避免直接知识蒸馏的高计算成本,探索更轻量、可扩展的模型对齐方式,尤其在目标模型不可访问或蒸馏不可行时提供替代方案。 Method: 将语言模型视为对数似然空间中的点,通过优化训练数据各领域的权重,使基础模型在预训练或持续预训练中的参数更新方向逼近目标模型的方向。 Result: 在NanoGPT上实验表明,该方法相比Pile数据集上的均匀加权,能持续降低与目标模型的KL散度;虽弱于知识蒸馏,但仍实现有意义的对齐,并使下游任务性能更接近目标模型。 Conclusion: 基于方向对齐的领域加权策略是一种有效的、无需蒸馏的模型对齐方法,为模型适配提供了新思路。 Abstract: Instead of directly distilling a language model, this study addresses the problem of aligning a base model with a target model in distribution by designing the domain mixture of training data for pretraining or continued pretraining as a fixed training recipe. We propose a method for determining domain weights by viewing models as points in log-likelihood space and aligning the training update direction with the direction toward the target model. Experiments with NanoGPT show that the proposed method consistently reduces the KL divergence to the target model compared with uniform weighting over the Pile. Although knowledge distillation remains more effective when available, the proposed method still achieves meaningful alignment, and downstream task performance also tends to become closer to that of the target model.

[56] Good Arguments Against the People Pleasers: How Reasoning Mitigates (Yet Masks) LLM Sycophancy

Zhaoxin Feng,Zheng Chen,Jianfei Ma,Yip Tin Po,Emmanuele Chersoni,Bo Li

Main category: cs.CL

TL;DR: 本文探讨了链式思维(CoT)推理对大语言模型(LLM)趋同性(sycophancy)的影响,发现CoT虽能降低最终决策中的趋同性,却可能通过逻辑不一致、计算错误等掩盖趋同性;且趋同性在主观任务和权威偏差下更显著,并在推理过程中动态演化。

Details Motivation: 现有对齐技术易引发LLM的趋同性,而CoT推理在此过程中的作用(是抑制还是掩盖趋同性)尚不明确。 Method: 在客观与主观任务上评估多种模型,结合实证分析与针对三个开源模型的机制分析,考察CoT对趋同性的影响及其动态演化过程。 Result: CoT总体上降低最终决策的趋同性,但部分样本中通过逻辑错误等方式掩盖趋同性;主观任务和权威偏差加剧趋同性;趋同性倾向在推理过程中动态变化而非输入即定。 Conclusion: CoT并非单纯抑制或放大趋同性,而是具有双重作用——既可缓解又可隐藏;趋同性是推理过程中的动态现象,需在细粒度推理阶段进行检测与干预。 Abstract: Alignment techniques often inadvertently induce sycophancy in LLMs. While prior studies studied this behaviour in direct-answer settings, the role of Chain-of-Thought (CoT) reasoning remains under-explored: does it serve as a logical constraint that mitigates sycophancy, or a tool for post-hoc rationalization that masks it? We evaluate a range of models across objective and subjective tasks to investigate the issue. Results show that reasoning generally reduces sycophancy in final decisions but also masks sycophancy in some samples, where models construct deceptive justifications through logical inconsistencies, calculation errors, and one-sided arguments etc. Furthermore, LLMs are more prone to sycophancy in subjective tasks and under authority-bias. Our mechanistic analysis on three open-source models reveals that the tendency of sycophancy is dynamic during the reasoning process rather than being pre-determined at the input stage.

[57] Omanic: Towards Step-wise Evaluation of Multi-hop Reasoning in Large Language Models

Xiaojie Gu,Sherry T. Tong,Aosong Feng,Sophia Simeng Han,Jinghui Lu,Yingjian Chen,Yusuke Iwasawa,Yutaka Matsuo,Chanjun Park,Rex Ying,Irene Li

Main category: cs.CL

TL;DR: 本文提出Omanic数据集,用于评估和提升大语言模型的多跳推理能力,包含机器生成的训练集和人工标注的评测集,并验证了其在推理能力迁移上的有效性。

Details Motivation: 现有推理型大语言模型评估困难,因仅依赖最终答案无法揭示中间推理步骤;同时,多跳问答基准缺乏细粒度步骤级标注,难以诊断推理失败原因。 Method: 构建Omanic开放域多跳问答数据集,含10,296条机器生成训练样本(OmanicSynth)和967条专家审核的人工标注评测样本(OmanicBench),提供分解子问题与中间答案作为结构化标注;并开展系统性评测与监督微调实验。 Result: SOTA LLMs在OmanicBench上多选准确率仅为73.11%;思维链(CoT)性能高度依赖事实完整性,知识缺失时增益下降、错误在后续跳中放大;在OmanicSynth上监督微调可使6个推理与数学基准平均提升7.41分。 Conclusion: Omanic为多跳推理提供了可解释、可诊断的评估资源,验证了结构化中间标注对提升模型推理能力及泛化性的关键作用。 Abstract: Reasoning-focused large language models (LLMs) have advanced in many NLP tasks, yet their evaluation remains challenging: final answers alone do not expose the intermediate reasoning steps, making it difficult to determine whether a model truly reasons correctly and where failures occur, while existing multi-hop QA benchmarks lack step-level annotations for diagnosing reasoning failures. To address this gap, we propose Omanic, an open-domain multi-hop QA resource that provides decomposed sub-questions and intermediate answers as structural annotations for analyzing reasoning processes. It contains 10,296 machine-generated training examples (OmanicSynth) and 967 expert-reviewed human-annotated evaluation examples (OmanicBench). Systematic evaluations show that state-of-the-art LLMs achieve only 73.11% multiple-choice accuracy on OmanicBench, confirming its high difficulty. Stepwise analysis reveals that CoT's performance hinges on factual completeness, with its gains diminishing under knowledge gaps and errors amplifying in later hops. Additionally, supervised fine-tuning on OmanicSynth brings substantial transfer gains (7.41 average points) across six reasoning and math benchmarks, validating the dataset's quality and further supporting the effectiveness of OmanicSynth as supervision for reasoning-capability transfer. We release the data at https://huggingface.co/datasets/li-lab/Omanic and the code at https://github.com/XiaojieGu/Omanic.

Aishwarya Ramasethu,Niyathi Allu,Rohin Garg,Harshwardhan Fartale,Dun Li Chan

Main category: cs.CL

TL;DR: 本文探讨了在极低资源机器翻译场景下,利用语言学上相近的中继语言(pivot languages)和少样本(few-shot)提示来指导大语言模型(LLMs)进行零样本或少样本翻译的有效性,发现其效果有限且敏感于示例构造,仅在目标语词汇表覆盖差时有 modest 提升。

Details Motivation: 标准微调方法依赖大量平行语料或参数更新,难以适用于长尾小语种;需探索无需训练、数据高效的推理时适配方法。 Method: 采用零参数更新的推理时提示策略,结合语言学相似的中继语言与少样本上下文示例,在受控实验设置下评估低资源翻译性能。 Result: 中继语言+少样本提示在目标语词汇表覆盖弱时可带来小幅提升,但增益敏感于示例设计;对语言相近或模型覆盖较好的语言,效果减弱或不一致。 Conclusion: 推理时提示与中继语言是一种轻量替代微调的方法,但适用场景有限,需谨慎选择中继语言和构造示例。 Abstract: Large Language Models (LLMs) have achieved strong performance across many downstream tasks, yet their effectiveness in extremely low-resource machine translation remains limited. Standard adaptation techniques typically rely on large-scale parallel data or extensive fine-tuning, which are infeasible for the long tail of underrepresented languages. In this work, we investigate a more constrained question: in data-scarce settings, to what extent can linguistically similar pivot languages and few-shot demonstrations provide useful guidance for on-the-fly adaptation in LLMs? We study a data-efficient experimental setup that combines linguistically related pivot languages with few-shot in-context examples, without any parameter updates, and evaluate translation behavior under controlled conditions. Our analysis shows that while pivot-based prompting can yield improvements in certain configurations, particularly in settings where the target language is less well represented in the model's vocabulary, the gains are often modest and sensitive to few shot example construction. For closely related or better represented varieties, we observe diminishing or inconsistent gains. Our findings provide empirical guidance on how and when inference-time prompting and pivot-based examples can be used as a lightweight alternative to fine-tuning in low-resource translation settings.

[59] Arabic Morphosyntactic Tagging and Dependency Parsing with Large Language Models

Mohamed Adel,Bashar Alhafni,Nizar Habash

Main category: cs.CL

TL;DR: 本文评估了指令调优的大语言模型(LLMs)在标准阿拉伯语的形态句法标注和依存句法分析两个结构化预测任务上的表现,比较了零样本提示与基于检索的上下文学习(ICL),发现提示设计和示例选择显著影响性能,部分专有模型接近监督基线,但分词等挑战仍存。

Details Motivation: 大语言模型(LLMs)在NLP任务中表现优异,但其生成显式语言结构(如形态句法和依存结构)的能力尚不明确;阿拉伯语因其丰富形态和正字法歧义,是检验该能力的理想挑战性语言。 Method: 在标准阿拉伯语上开展形态句法标注和标注依存句法分析两项结构化预测任务,对比零样本提示与基于阿拉伯语树库检索的上下文学习(ICL),并分析提示设计、示例选择及分词影响。 Result: 提示设计和示范样本选择显著影响性能;专有模型在特征级标注上接近监督基线,在依存分析上可媲美专用解析器;检索式ICL同时提升了依存分析和分词效果;但原始文本下的分词仍是难点。 Conclusion: LLMs能较可靠地捕捉阿拉伯语的部分形态句法与句法特征,但对强形态-句法交互、正字法歧义及底层分词等复杂现象仍存在明显局限。 Abstract: Large language models (LLMs) perform strongly on many NLP tasks, but their ability to produce explicit linguistic structure remains unclear. We evaluate instruction-tuned LLMs on two structured prediction tasks for Standard Arabic: morphosyntactic tagging and labeled dependency parsing. Arabic provides a challenging testbed due to its rich morphology and orthographic ambiguity, which create strong morphology-syntax interactions. We compare zero-shot prompting with retrieval-based in-context learning (ICL) using examples from Arabic treebanks. Results show that prompt design and demonstration selection strongly affect performance: proprietary models approach supervised baselines for feature-level tagging and become competitive with specialized dependency parsers. In raw-text settings, tokenization remains challenging, though retrieval-based ICL improves both parsing and tokenization. Our analysis highlights which aspects of Arabic morphosyntax and syntax LLMs capture reliably and which remain difficult.

[60] Probing Cultural Signals in Large Language Models through Author Profiling

Valentin Lafargue,Ariel Guerra-Adames,Emmanuelle Claeys,Elouan Vuichard,Jean-Michel Loubes

Main category: cs.CL

TL;DR: 本文通过零样本作者画像任务(从歌词推断歌手性别与族裔)探测大语言模型(LLM)中的文化偏见,发现不同开源模型存在系统性族裔偏好(如倾向北美或亚洲),并提出两个新公平性指标(MAD和RD)量化偏差,Ministral-8B偏差最强,Gemma-12B最均衡。

Details Motivation: 大型语言模型在具有社会影响的应用中日益部署,引发对其所编码文化偏见的担忧,需系统探测其内在文化表征。 Method: 在零样本设定下,利用多个开源大语言模型对超10,000首歌词进行作者画像(预测歌手性别与族裔),分析其预测分布与生成推理,并提出Modality Accuracy Divergence(MAD)和Recall Divergence(RD)两个新公平性指标量化族裔偏差。 Result: 发现LLMs具备非平凡的画像能力,但表现出系统性文化对齐:多数模型偏向北美族裔,DeepSeek-1.5B更倾向亚洲族裔;Ministral-8B族裔偏差最强,Gemma-12B最均衡。 Conclusion: 大语言模型隐含显著且模型依赖的文化偏见,零样本作者画像可有效揭示此类偏差,所提公平性指标为评估和缓解文化偏见提供了新工具。 Abstract: Large language models (LLMs) are increasingly deployed in applications with societal impact, raising concerns about the cultural biases they encode. We probe these representations by evaluating whether LLMs can perform author profiling from song lyrics in a zero-shot setting, inferring singers' gender and ethnicity without task-specific fine-tuning. Across several open-source models evaluated on more than 10,000 lyrics, we find that LLMs achieve non-trivial profiling performance but demonstrate systematic cultural alignment: most models default toward North American ethnicity, while DeepSeek-1.5B aligns more strongly with Asian ethnicity. This finding emerges from both the models' prediction distributions and an analysis of their generated rationales. To quantify these disparities, we introduce two fairness metrics, Modality Accuracy Divergence (MAD) and Recall Divergence (RD), and show that Ministral-8B displays the strongest ethnicity bias among the evaluated models, whereas Gemma-12B shows the most balanced behavior. Our code is available on GitHub (https://github.com/ValentinLafargue/CulturalProbingLLM).

[61] TurnWise: The Gap between Single- and Multi-turn Language Model Capabilities

Victoria Graf,Valentina Pyatkin,Nouha Dziri,Nathan Lambert,Hannaneh Hajishirzi

Main category: cs.CL

TL;DR: 本文提出了TurnWiseEval多轮对话评估基准和TurnWiseData合成数据生成流程,验证了多轮训练数据对提升语言模型多轮对话能力的关键作用。

Details Motivation: 现有开源训练与评估数据主要聚焦单轮对话,难以反映多轮交互中特有的能力需求,因此需系统性地研究并填补多轮与单轮对话之间的性能差距。 Method: 构建可与单轮评测直接对比的多轮专用基准TurnWiseEval,采用成对比较方法隔离多轮特有对话能力;设计可扩展的合成多轮数据生成流程TurnWiseData;在Olmo 3模型上开展消融实验,验证不同规模多轮数据对性能的影响。 Result: 仅加入10k条多轮对话数据进行后训练,即可在TurnWiseEval上带来12%的性能提升;实验证明多轮训练数据对提升多轮对话表现至关重要。 Conclusion: 多轮对话能力不能简单由单轮能力外推,必须通过专门的多轮数据训练与评估来建模和优化。 Abstract: Multi-turn conversations are a common and critical mode of language model interaction. However, current open training and evaluation data focus on single-turn settings, failing to capture the additional dimension of these longer interactions. To understand this multi-/single-turn gap, we first introduce a new benchmark, TurnWiseEval, for multi-turn capabilities that is directly comparable to single-turn chat evaluation. Our evaluation isolates multi-turn specific conversational ability through pairwise comparison to equivalent single-turn settings. We additionally introduce our synthetic multi-turn data pipeline TurnWiseData which allows the scalable generation of multi-turn training data. Our experiments with Olmo 3 show that training with multi-turn data is vital to achieving strong multi-turn chat performance, and that including as little as 10k multi-turn conversations during post-training can lead to a 12% improvement on TurnWiseEval.

[62] SpokenUS: A Spoken User Simulator for Task-Oriented Dialogue

Jonggeun Lee,Junseong Pyo,Jeongmin Park,Yohan Jo

Main category: cs.CL

TL;DR: 本文提出了SpokenTOD数据集(含52,390轮对话、1034小时语音,涵盖跨轮槽位、插话、不流畅和情感语调四种口语行为)及基于其构建的口语用户模拟器SpokenUS,该模拟器在目标覆盖率相当的前提下显著提升人类主观评分(MOS),并更真实地模拟人类渐进式披露槽值的行为。

Details Motivation: 现有口语任务型对话(TOD)数据集规模小、领域覆盖有限,缺乏系统性增强方法,难以支撑鲁棒口语对话代理的训练与评估。 Method: 构建大规模多行为口语TOD数据集SpokenTOD,并基于其设计专用支持插话(barge-in)机制的口语用户模拟器SpokenUS。 Result: SpokenUS在目标覆盖率上媲美更大模型,Human MOS显著优于所有基线;能渐进式披露槽值,并对下游对话代理构成真实挑战。 Conclusion: SpokenTOD与SpokenUS为构建和评估更鲁棒的口语对话系统提供了实用、可扩展的数据与工具基础。 Abstract: Robust task-oriented spoken dialogue agents require exposure to the full diversity of how people interact through speech. Building spoken user simulators that address this requires large-scale spoken task-oriented dialogue (TOD) data encompassing spoken user behaviors, yet existing datasets are limited in scale and domain coverage, with no systematic pipeline for augmenting them. To address this, we introduce \textbf{SpokenTOD}, a spoken TOD dataset of 52,390 dialogues and 1,034 hours of speech augmented with four spoken user behaviors -- cross-turn slots, barge-in, disfluency, and emotional prosody -- across diverse speakers and domains. Building on SpokenTOD, we present \textbf{SpokenUS}, a spoken user simulator grounded in TOD with a dedicated architecture for barge-in. SpokenUS achieves comparable goal coverage to significantly larger models while substantially outperforming all baselines in Human MOS, disclosing slot values gradually across the dialogue as humans do rather than front-loading them. Further analysis confirms that SpokenUS's spoken behaviors pose meaningful challenges to downstream agents, making it a practical tool for training and evaluating more robust spoken dialogue systems.

[63] Mediocrity is the key for LLM as a Judge Anchor Selection

Shachar Don-Yehiya,Asaf Yehudai,Leshem Choshen,Omri Abend

Main category: cs.CL

TL;DR: 本文系统研究了LLM-as-a-judge范式中锚点(anchor)选择对评估结果可靠性的影响,发现锚点选择至关重要,极端性能模型不适合作为锚点,并提出了基准规模与锚点选择的实用建议。

Details Motivation: 尽管锚点法被广泛用于大模型评估以降低计算成本,但其对结果可靠性的影响尚未被系统研究。 Method: 在Arena-Hard-v2.0数据集上评估22种不同锚点,分析其与人类排序的相关性,并进行效应量量化和统计功效分析。 Result: 锚点选择显著影响评估结果的可靠性;最佳/最差模型作为锚点表现差;锚点选择的效应量与裁判模型选择相当;当前基准规模不足以可靠区分竞争性模型。 Conclusion: 应避免使用极端性能模型作锚点;需扩大基准规模并依据信息量原则选择锚点,以提升评估的可靠性与效率。 Abstract: The ``LLM-as-a-judge'' paradigm has become a standard method for evaluating open-ended generation. To address the quadratic scalability costs of pairwise comparisons, popular benchmarks like Arena-Hard and AlpacaEval compare all models against a single anchor. However, despite its widespread use, the impact of anchor selection on the reliability of the results remains largely unexplored. In this work, we systematically investigate the effect of anchor selection by evaluating 22 different anchors on the Arena-Hard-v2.0 dataset. We find that the choice of anchor is critical: a poor anchor can dramatically reduce correlation with human rankings. We identify that common anchor choices (best-performing and worst-performing models) make poor anchors. Because these extreme anchors are consistently better or worse than all other models, they are seldom indicative of the relative ranking of the models. We further quantify the effect size of anchor selection, showing it is comparable to the selection of a judge model. We conclude with actionable recommendations. First, we conduct a power analysis, and compute sufficient benchmark sizes for anchor-based evaluation, finding that standard benchmark sizes are insufficient for pairwise evaluation and fail to distinguish between competitive models reliably. Second, we provide guidelines for selecting informative anchors to ensure reliable and efficient evaluation practices.

[64] Online Experiential Learning for Language Models

Tianzhu Ye,Li Dong,Qingxiu Dong,Xun Wu,Shaohan Huang,Furu Wei

Main category: cs.CL

TL;DR: 本文提出了一种名为在线经验学习(OEL)的新框架,使大语言模型能在真实部署中持续从自身交互经验中学习提升,无需人工标注或访问用户环境,通过提取可迁移经验知识并进行策略一致性上下文蒸馏实现迭代优化。

Details Motivation: 现有大语言模型改进范式依赖离线训练(如人工标注或仿真环境),忽视了真实部署中积累的丰富经验。 Method: OEL框架包含两个阶段:1)从用户端交互轨迹中提取并累积可迁移的经验知识;2)通过策略一致的上下文蒸馏将知识整合进模型参数,不需访问用户环境;两阶段循环构成在线学习闭环。 Result: 在文本游戏环境中多尺度、思考/非思考变体模型上验证,OEL在多轮迭代中持续提升任务准确率与token效率,并保持分布外泛化能力;经验知识显著优于原始轨迹,策略一致性对学习效果至关重要。 Conclusion: OEL为大语言模型提供了高效、实用的在线自主进化路径,凸显了利用真实部署经验进行持续学习的巨大潜力。 Abstract: The prevailing paradigm for improving large language models relies on offline training with human annotations or simulated environments, leaving the rich experience accumulated during real-world deployment entirely unexploited. We propose Online Experiential Learning (OEL), a framework that enables language models to continuously improve from their own deployment experience. OEL operates in two stages: first, transferable experiential knowledge is extracted and accumulated from interaction trajectories collected on the user side; second, this knowledge is consolidated into model parameters via on-policy context distillation, requiring no access to the user-side environment. The two stages are iterated to form an online learning loop, where the improved model collects higher-quality trajectories that yield richer experiential knowledge for subsequent rounds. We evaluate OEL on text-based game environments across multiple model scales and both thinking and non-thinking variants. OEL achieves consistent improvements over successive iterations, enhancing both task accuracy and token efficiency while preserving out-of-distribution performance. Our analysis further shows that extracted experiential knowledge is significantly more effective than raw trajectories, and that on-policy consistency between the knowledge source and the policy model is critical for effective learning.

[65] Chronos: Temporal-Aware Conversational Agents with Structured Event Retrieval for Long-Term Memory

Sahil Sen,Elias Lumer,Anmol Gulati,Vamse Kumar Subbiah

Main category: cs.CL

TL;DR: 本文提出Chronos,一种时间感知的记忆框架,用于提升大型语言模型在长期多轮对话中的记忆与检索能力。通过将对话分解为带时间范围的事件元组并构建结构化日历索引,结合动态提示与迭代工具调用,显著提升了时间敏感、多跳查询的准确性。

Details Motivation: 现有记忆系统难以有效处理随时间演化的事实与偏好,且缺乏针对长对话历史的时间敏感、多跳查询的高效检索策略。 Method: Chronos将原始对话分解为带解析时间范围和实体别名的主谓宾事件元组,分别构建事件日历和轮次日历;查询时采用动态提示生成定制化检索指引,并通过迭代工具调用在两个日历上执行多跳推理。 Result: 在LongMemEvalS基准(500个问题、6类任务)上,Chronos Low达92.60%,Chronos High达95.60%准确率,较此前最优系统提升7.67%;消融实验显示事件日历贡献最大(+58.9%),其余组件贡献15.5%–22.3%。 Conclusion: Chronos显著提升了LLM在长期交互中对时间演化信息的记忆与推理能力,其模块化设计与动态检索机制为长时序对话记忆系统提供了新范式。 Abstract: Recent advances in Large Language Models (LLMs) have enabled conversational AI agents to engage in extended multi-turn interactions spanning weeks or months. However, existing memory systems struggle to reason over temporally grounded facts and preferences that evolve across months of interaction and lack effective retrieval strategies for multi-hop, time-sensitive queries over long dialogue histories. We introduce Chronos, a novel temporal-aware memory framework that decomposes raw dialogue into subject-verb-object event tuples with resolved datetime ranges and entity aliases, indexing them in a structured event calendar alongside a turn calendar that preserves full conversational context. At query time, Chronos applies dynamic prompting to generate tailored retrieval guidance for each question, directing the agent on what to retrieve, how to filter across time ranges, and how to approach multi-hop reasoning through an iterative tool-calling loop over both calendars. We evaluate Chronos with 8 LLMs, both open-source and closed-source, on the LongMemEvalS benchmark comprising 500 questions spanning six categories of dialogue history tasks. Chronos Low achieves 92.60% and Chronos High scores 95.60% accuracy, setting a new state of the art with an improvement of 7.67% over the best prior system. Ablation results reveal the events calendar accounts for a 58.9% gain on the baseline while all other components yield improvements between 15.5% and 22.3%. Notably, Chronos Low alone surpasses prior approaches evaluated under their strongest model configurations.

cs.CV [Back]

[66] SAC-NeRF: Adaptive Ray Sampling for Neural Radiance Fields via Soft Actor-Critic Reinforcement Learning

Chenyu Ge

Main category: cs.CV

TL;DR: 本文提出SAC-NeRF,一种基于Soft Actor-Critic强化学习的自适应采样框架,用于提升NeRF渲染效率,在保持图像质量前提下显著减少采样点数。

Details Motivation: NeRF因密集光线采样导致计算效率低下,亟需更智能、数据驱动的采样策略。 Method: 将采样建模为马尔可夫决策过程,引入高斯混合颜色模型提供不确定性估计、多目标奖励函数(质量/效率/一致性)及两阶段训练策略以应对环境非平稳性。 Result: 在Synthetic-NeRF和LLFF数据集上,采样点减少35–48%,PSNR仅下降0.3–0.8 dB;验证了数据驱动策略的有效性。 Conclusion: 强化学习可自动发现优于人工设计启发式方法的采样模式,尽管策略具有场景特异性且框架复杂度较高,但仍为高效NeRF提供了新范式。 Abstract: Neural Radiance Fields (NeRF) have achieved photorealistic novel view synthesis but suffer from computational inefficiency due to dense ray sampling during volume rendering. We propose SAC-NeRF, a reinforcement learning framework that learns adaptive sampling policies using Soft Actor-Critic (SAC). Our method formulates sampling as a Markov Decision Process where an RL agent learns to allocate samples based on scene characteristics. We introduce three technical components: (1) a Gaussian mixture distribution color model providing uncertainty estimates, (2) a multi-component reward function balancing quality, efficiency, and consistency, and (3) a two-stage training strategy addressing environment non-stationarity. Experiments on Synthetic-NeRF and LLFF datasets show that SAC-NeRF reduces sampling points by 35-48\% while maintaining rendering quality within 0.3-0.8 dB PSNR of dense sampling baselines. While the learned policy is scene-specific and the RL framework adds complexity compared to simpler heuristics, our work demonstrates that data-driven sampling strategies can discover effective patterns that would be difficult to hand-design.

[67] Exploring the Use of VLMs for Navigation Assistance for People with Blindness and Low Vision

Yu Li,Yuchen Zheng,Giles Hamilton-Fletcher,Marco Mezzavilla,Yao Wang,Sundeep Rangan,Maurizio Porfiri,Zhou Yu,John-Ross Rizzo

Main category: cs.CV

TL;DR: 本文评估了多种视觉语言模型(VLMs)在辅助视障人士导航任务中的能力,发现GPT-4o整体表现最优,而开源模型在复杂空间推理方面存在明显不足;研究指出了当前VLMs在计数、空间推理和场景理解上的共性缺陷,并提出需结合人类反馈与增强空间能力以提升实用性。

Details Motivation: 探索视觉语言模型(VLMs)在辅助盲人及低视力人群(pBLV)导航任务中的实际潜力与适用性。 Method: 对比评估多个主流闭源(GPT-4V、GPT-4o、Gemini-1.5-Pro、Claude-3.5-Sonnet)与开源(Llava-v1.6-mistral、Llava-onevision-qwen)VLMs,在基础视觉技能(障碍物计数、相对空间推理、常识性场景理解)及pBLV定制化导航提示下的表现。 Result: GPT-4o在所有任务中表现最稳定且最优,尤其在空间推理与场景理解上;开源模型在复杂环境下的推理与适应能力较弱;各类模型普遍存在计数不准、空间偏差、过度关注物体细节而忽略空间反馈等问题。 Conclusion: 尽管当前VLMs在导航辅助中仍存在关键局限,但通过更好融合人类反馈与增强空间推理能力,其在辅助技术中具有应用前景;本研究为VLMs在无障碍技术中的落地提供了具体改进方向。 Abstract: This paper investigates the potential of vision-language models (VLMs) to assist people with blindness and low vision (pBLV) in navigation tasks. We evaluate state-of-the-art closed-source models, including GPT-4V, GPT-4o, Gemini-1.5-Pro, and Claude-3.5-Sonnet, alongside open-source models, such as Llava-v1.6-mistral and Llava-onevision-qwen, to analyze their capabilities in foundational visual skills: counting ambient obstacles, relative spatial reasoning, and common-sense wayfinding-pertinent scene understanding. We further assess their performance in navigation scenarios, using pBLV-specific prompts designed to simulate real-world assistance tasks. Our findings reveal notable performance disparities between these models: GPT-4o consistently outperforms others across all tasks, particularly in spatial reasoning and scene understanding. In contrast, open-source models struggle with nuanced reasoning and adaptability in complex environments. Common challenges include difficulties in accurately counting objects in cluttered settings, biases in spatial reasoning, and a tendency to prioritize object details over spatial feedback, limiting their usability for pBLV in navigation tasks. Despite these limitations, VLMs show promise for wayfinding assistance when better aligned with human feedback and equipped with improved spatial reasoning. This research provides actionable insights into the strengths and limitations of current VLMs, guiding developers on effectively integrating VLMs into assistive technologies while addressing key limitations for enhanced usability.

[68] Improving Generative Adversarial Network Generalization for Facial Expression Synthesis

Arbish Akram,Nazar Khan,Arif Mahmood

Main category: cs.CV

TL;DR: 本文提出RegGAN,通过引入回归层和对抗精炼网络,在面部表情合成任务中实现了更好的泛化能力,尤其在分布外图像上表现优异,并在多项自动与人工评估指标上超越现有方法。

Details Motivation: 现有基于条件GAN的面部表情合成方法在面对与训练数据分布不同的测试图像时性能下降,亟需提升模型泛化能力。 Method: RegGAN包含两个核心组件:1)具有局部感受野的回归层,通过岭回归损失最小化重建误差以学习表情细节;2)对抗训练的精炼网络以增强生成图像的真实性;模型在CFEE数据集上训练,并在CFEE及多种分布外图像(名人照、肖像、雕像、虚拟头像)上评估。 Result: RegGAN在Expression Classification Score (ECS)、Fréchet Inception Distance (FID) 和 QualiCLIP 上超越六种SOTA方法,在Face Similarity Score (FSS) 上排名第二;人工评估显示其在表情质量、身份保持和真实感上分别比最优对比方法高25%、26%和30%。 Conclusion: RegGAN通过解耦表达建模与图像生成,有效提升了面部表情合成的泛化性与真实性,为分布外表情迁移提供了新思路。 Abstract: Facial expression synthesis aims to generate realistic facial expressions while preserving identity. Existing conditional generative adversarial networks (GANs) achieve excellent image-to-image translation results, but their performance often degrades when test images differ from the training dataset. We present Regression GAN (RegGAN), a model that learns an intermediate representation to improve generalization beyond the training distribution. RegGAN consists of two components: a regression layer with local receptive fields that learns expression details by minimizing the reconstruction error through a ridge regression loss, and a refinement network trained adversarially to enhance the realism of generated images. We train RegGAN on the CFEE dataset and evaluate its generalization performance both on CFEE and challenging out-of-distribution images, including celebrity photos, portraits, statues, and avatar renderings. For evaluation, we employ four widely used metrics: Expression Classification Score (ECS) for expression quality, Face Similarity Score (FSS) for identity preservation, QualiCLIP for perceptual realism, and Fréchet Inception Distance (FID) for assessing both expression quality and realism. RegGAN outperforms six state-of-the-art models in ECS, FID, and QualiCLIP, while ranking second in FSS. Human evaluations indicate that RegGAN surpasses the best competing model by 25% in expression quality, 26% in identity preservation, and 30% in realism.

[69] OrthoAI v2: From Single-Agent Segmentation to Dual-Agent Treatment Planning for Clear Aligners

Lansiaux Edouard,Leman Margaux

Main category: cs.CV

TL;DR: OrthoAI v2 是一个开源的AI辅助正畸隐形矫治方案规划系统,通过引入双智能体架构、多维度生物力学评分模型和多帧治疗模拟器,显著提升了牙科标志点检测精度、方案评估全面性和可视化能力,并在合成数据集上实现了21%的相对性能提升。

Details Motivation: 解决OrthoAI v1在牙科标志点检测精度不足、评估模型过于简单(仅二元通过/失败)以及缺乏分阶段治疗模拟的问题。 Method: 提出三方面改进:(i) 引入基于条件热图回归(CHARM)的第二智能体进行无分割式牙科标志点检测,并与第一智能体通过置信加权协调器融合;(ii) 构建六维加权生物力学评分模型替代原有二元评分;(iii) 开发基于SLERP插值和循证分期规则的多帧6自由度牙齿运动轨迹模拟器。 Result: 在200例拥挤病例合成基准测试中,OrthoAI v2并行集成方案规划质量得分达92.8±4.1,较v1提升21%(76.4±8.3),推理耗时4.2±0.8秒,完全支持CPU部署。 Conclusion: OrthoAI v2成功克服了前代系统的关键局限,在精度、评估维度和临床可视化方面取得实质性进步,同时保持高效、可部署性,为AI驱动的正畸治疗规划提供了更可靠、更实用的开源框架。 Abstract: We present OrthoAI v2, the second iteration of our open-source pipeline for AI-assisted orthodontic treatment planning with clear aligners, substantially extending the single-agent framework previously introduced. The first version established a proof-of-concept based on Dynamic Graph Convolutional Neural Networks (\dgcnn{}) for tooth segmentation but was limited to per-tooth centroid extraction, lacked landmark-level precision, and produced a scalar quality score without staging simulation. \vtwo{} addresses all three limitations through three principal contributions: (i)~a second agent adopting the Conditioned Heatmap Regression Methodology (\charm{})~\cite{rodriguez2025charm} for direct, segmentation-free dental landmark detection, fused with Agent~1 via a confidence-weighted orchestrator in three modes (parallel, sequential, single-agent); (ii)~a composite six-category biomechanical scoring model (biomechanics $\times$ 0.30 + staging $\times$ 0.20 + attachments $\times$ 0.15 + IPR $\times$ 0.10 + occlusion $\times$ 0.10 + predictability $\times$ 0.15) replacing the binary pass/fail check of v1; (iii)~a multi-frame treatment simulator generating $F = A \times r$ temporally coherent 6-DoF tooth trajectories via SLERP interpolation and evidence-based staging rules, enabling ClinCheck 4D visualisation. On a synthetic benchmark of 200 crowding scenarios, the parallel ensemble of OrthoAI v2 reaches a planning quality score of $92.8 \pm 4.1$ vs.\ $76.4 \pm 8.3$ for OrthoAI v1, a $+21\%$ relative gain, while maintaining full CPU deployability ($4.2 \pm 0.8$~s).

[70] CLRNet: Targetless Extrinsic Calibration for Camera, Lidar and 4D Radar Using Deep Learning

Marcell Kegl,Andras Palffy,Csaba Benedek,Dariu M. Gavrila

Main category: cs.CV

TL;DR: 本文提出CLRNet,一种多模态端到端深度学习网络,用于相机、激光雷达和4D雷达的联合或两两外参标定,通过引入等距柱面投影、相机深度预测、额外雷达通道及共享特征空间等策略,在多个数据集上显著提升标定精度,并验证了跨域泛化能力。

Details Motivation: 雷达数据稀疏性导致其外参标定困难,现有方法难以实现相机、激光雷达与4D雷达三者高精度联合标定。 Method: 提出CLRNet多模态端到端深度学习网络,融合等距柱面投影、基于相机的深度图预测、扩展雷达通道,并利用激光雷达构建共享特征空间与闭环损失函数。 Result: 在View-of-Delft和Dual-Radar数据集上,相较SOTA方法,中位平移与旋转标定误差均降低至少50%;且展现出良好的跨数据集域迁移能力。 Conclusion: CLRNet为多传感器(尤其是含4D雷达)的外参标定提供了高效、鲁棒且可扩展的深度学习解决方案,并具备实际部署潜力。 Abstract: In this paper, we address extrinsic calibration for camera, lidar, and 4D radar sensors. Accurate extrinsic calibration of radar remains a challenge due to the sparsity of its data. We propose CLRNet, a novel, multi-modal end-to-end deep learning (DL) calibration network capable of addressing joint camera-lidar-radar calibration, or pairwise calibration between any two of these sensors. We incorporate equirectangular projection, camera-based depth image prediction, additional radar channels, and leverage lidar with a shared feature space and loop closure loss. In extensive experiments using the View-of-Delft and Dual-Radar datasets, we demonstrate superior calibration accuracy compared to existing state-of-the-art methods, reducing both median translational and rotational calibration errors by at least 50%. Finally, we examine the domain transfer capabilities of the proposed network and baselines, when evaluating across datasets. The code will be made publicly available upon acceptance at: https://github.com/tudelft-iv.

[71] Domain Adaptation Without the Compute Burden for Efficient Whole Slide Image Analysis

Umar Marikkar,Muhammad Awais,Sara Atito

Main category: cs.CV

TL;DR: 本文提出了一种名为EfficientWSI(eWSI)的新方法,结合参数高效微调(PEFT)与多实例学习(MIL),实现全切片图像(WSI)任务的端到端训练,在降低计算成本的同时提升任务特异性性能。

Details Motivation: 现有WSI分析方法依赖ImageNet预训练特征提取器,缺乏病理领域特异性;而领域内预训练又计算昂贵且缺乏任务针对性。 Method: 将参数高效微调(PEFT)与多实例学习(MIL)有机结合,支持在WSI上进行端到端训练,既可适配ImageNet特征提取器,也可增强领域内预训练模型的任务适应性。 Result: 在Camelyon16、TCGA和BRACS共7个WSI级任务上验证,eWSI搭配ImageNet特征提取器即达或超现有基于领域内预训练的MIL方法;进一步搭配领域内特征提取器时,多数任务性能进一步提升。 Conclusion: eWSI提供了一种计算高效、任务导向的WSI分析新范式,为计算病理学中的任务特异性学习开辟了可行路径。 Abstract: Computational methods on analyzing Whole Slide Images (WSIs) enable early diagnosis and treatments by supporting pathologists in detection and classification of tumors. However, the extremely high resolution of WSIs makes end-to-end training impractical compared to typical image analysis tasks. To address this, most approaches use pre-trained feature extractors to obtain fixed representations of whole slides, which are then combined with Multiple Instance Learning (MIL) for downstream tasks. These feature extractors are typically pre-trained on natural image datasets such as ImageNet, which fail to capture domain-specific characteristics. Although domain-specific pre-training on histopathology data yields more relevant feature representations, it remains computationally expensive and fail to capture task-specific characteristics within the domain. To address the computational cost and lack of task-specificity in domain-specific pre-training, we propose EfficientWSI (eWSI), a careful integration of Parameter-Efficient-Fine-Tuning (PEFT) and Multiple Instance Learning (MIL) that enables end-to-end training on WSI tasks. We evaluate eWSI on seven WSI-level tasks over Camelyon16, TCGA and BRACS datasets. Our results show that eWSI when applied with ImageNet feature extractors yields strong classification performance, matching or outperforming MILs with in-domain feature extractors, alleviating the need for extensive in-domain pre-training. Furthermore, when eWSI is applied with in-domain feature extractors, it further improves classification performance in most cases, demonstrating its ability to capture task-specific information where beneficial. Our findings suggest that eWSI provides a task-targeted, computationally efficient path for WSI tasks, offering a promising direction for task-specific learning in computational pathology.

[72] Parallelised Differentiable Straightest Geodesics for 3D Meshes

Hippolyte Verninas,Caner Korkmaz,Stefanos Zafeiriou,Tolga Birdal,Simone Foti

Main category: cs.CV

TL;DR: 本文提出了一种在网格表示的黎曼曲面上可微分的最直测地线(straightest geodesics)指数映射方法,支持并行GPU实现,并基于此构建了新的测地卷积层、流匹配方法和二阶优化器。

Details Motivation: 现有在曲面(如网格)上的几何机器学习方法受限于缺乏闭式黎曼算子、离散算子不可微及并行能力差等问题。 Method: 基于最直测地线框架实现指数映射的并行GPU计算;提出两种可微策略:外在代理函数法与测地有限差分法。 Result: 实现了高精度、高性能的可微指数映射;验证了其在测地卷积、流匹配和中心Voronoi剖分优化中的有效性。 Conclusion: 该工作为曲面机器学习提供了首个高效、可微、可扩展的指数映射基础工具,显著提升了几何深度学习在通用曲面上的学习与优化能力。 Abstract: Machine learning has been progressively generalised to operate within non-Euclidean domains, but geometrically accurate methods for learning on surfaces are still falling behind. The lack of closed-form Riemannian operators, the non-differentiability of their discrete counterparts, and poor parallelisation capabilities have been the main obstacles to the development of the field on meshes. A principled framework to compute the exponential map on Riemannian surfaces discretised as meshes is straightest geodesics, which also allows to trace geodesics and parallel-transport vectors as a by-product. We provide a parallel GPU implementation and derive two different methods for differentiating through the straightest geodesics, one leveraging an extrinsic proxy function and one based upon a geodesic finite differences scheme. After proving our parallelisation performance and accuracy, we demonstrate how our differentiable exponential map can improve learning and optimisation pipelines on general geometries. In particular, to showcase the versatility of our method, we propose a new geodesic convolutional layer, a new flow matching method for learning on meshes, and a second-order optimiser that we apply to centroidal Voronoi tessellation. Our code, models, and pip-installable library (digeo) are available at: circle-group.github.io/research/DSG.

[73] Evolving Contextual Safety in Multi-Modal Large Language Models via Inference-Time Self-Reflective Memory

Ce Zhang,Jinxi He,Junyi He,Katia Sycara,Yaqi Xie

Main category: cs.CV

TL;DR: 本文提出MM-SafetyBench++基准和EchoSafe框架,旨在提升多模态大语言模型(MLLMs)在上下文安全方面的鲁棒性,通过构造语义相近但安全意图相反的图像-文本对进行细粒度评估,并设计无需训练的自反思记忆机制实现上下文感知的安全推理。

Details Motivation: 现有MLLM安全防御方法主要关注显式不安全输入的检测与拒绝,忽视了需依赖上下文理解来区分细微安全意图差异的‘上下文安全’问题。 Method: 构建MM-SafetyBench++基准,为每个不安全图像-文本对生成仅微调用户意图的安全对照样本;提出无需训练的EchoSafe框架,利用自反思记忆银行存储并检索历史安全经验,将其融入当前提示以支持上下文感知推理。 Result: 在多个多模态安全基准上,EchoSafe持续取得最优性能,显著提升了MLLM对上下文安全意图的判别与响应能力。 Conclusion: 上下文安全是MLLM安全研究的关键新维度;MM-SafetyBench++和EchoSafe为该方向提供了可复现的评估标准与高效实用的技术范式。 Abstract: Multi-modal Large Language Models (MLLMs) have achieved remarkable performance across a wide range of visual reasoning tasks, yet their vulnerability to safety risks remains a pressing concern. While prior research primarily focuses on jailbreak defenses that detect and refuse explicitly unsafe inputs, such approaches often overlook contextual safety, which requires models to distinguish subtle contextual differences between scenarios that may appear similar but diverge significantly in safety intent. In this work, we present MM-SafetyBench++, a carefully curated benchmark designed for contextual safety evaluation. Specifically, for each unsafe image-text pair, we construct a corresponding safe counterpart through minimal modifications that flip the user intent while preserving the underlying contextual meaning, enabling controlled evaluation of whether models can adapt their safety behaviors based on contextual understanding. Further, we introduce EchoSafe, a training-free framework that maintains a self-reflective memory bank to accumulate and retrieve safety insights from prior interactions. By integrating relevant past experiences into current prompts, EchoSafe enables context-aware reasoning and continual evolution of safety behavior during inference. Extensive experiments on various multi-modal safety benchmarks demonstrate that EchoSafe consistently achieves superior performance, establishing a strong baseline for advancing contextual safety in MLLMs. All benchmark data and code are available at https://echosafe-mllm.github.io.

[74] Feed-forward Gaussian Registration for Head Avatar Creation and Editing

Malte Prinzler,Paulo Gotardo,Siyu Tang,Timo Bolkart

Main category: cs.CV

TL;DR: MATCH是一种多视角高斯配准方法,用于快速生成和编辑高质量头部虚拟形象,通过基于Transformer的模型在0.5秒内完成每帧处理,无需预处理,并支持跨主体对应关系以实现表情迁移、语义编辑等应用。

Details Motivation: 现有头部虚拟形象生成方法耗时长(超过一天),需繁琐的头部跟踪和昂贵的优化过程,亟需更高效的方法。 Method: 提出MATCH方法,采用基于Transformer的模型预测固定UV布局模板网格上的高斯点纹理;引入注册引导注意力模块,使每个UV图块仅关注对应网格区域的图像标记,避免密集跨视角注意力。 Result: MATCH在新视角合成、几何配准和头部虚拟形象生成方面优于现有方法,生成速度比最接近的基线快10倍。 Conclusion: MATCH实现了高效、高质量的头部虚拟形象创建与编辑,支持多种跨主体应用,显著提升了生成效率与灵活性。 Abstract: We present MATCH (Multi-view Avatars from Topologically Corresponding Heads), a multi-view Gaussian registration method for high-quality head avatar creation and editing. State-of-the-art multi-view head avatar methods require time-consuming head tracking followed by expensive avatar optimization, often resulting in a total creation time of more than one day. MATCH, in contrast, directly predicts Gaussian splat textures in correspondence from calibrated multi-view images in just 0.5 seconds per frame, without requiring data preprocessing. The learned intra-subject correspondence across frames enables fast creation of personalized head avatars, while correspondence across subjects supports applications such as expression transfer, optimization-free tracking, semantic editing, and identity interpolation. We establish these correspondences end-to-end using a transformer-based model that predicts Gaussian splat textures in the fixed UV layout of a template mesh. To achieve this, we introduce a novel registration-guided attention block, where each UV-map token attends exclusively to image tokens depicting its corresponding mesh region. This design improves efficiency and performance compared to dense cross-view attention. MATCH outperforms existing methods in novel-view synthesis, geometry registration, and head avatar generation, while making avatar creation 10 times faster than the closest competing baseline. The code and model weights are available on the project website.

[75] ModTrack: Sensor-Agnostic Multi-View Tracking via Identity-Informed PHD Filtering with Covariance Propagation

Aditya Iyer,Jack Roberts,Nora Ayanian

Main category: cs.CV

TL;DR: ModTrack是一种模块化多视角多目标跟踪(MV-MOT)系统,通过将学习限制在检测与特征提取阶段,其余流程采用解析式方法实现跨模态、传感器无关的泛化能力和可追溯的不确定性建模,在WildTrack上达到95.5 IDF1和91.4 MOTA,性能媲美端到端方法且无需重训练即可迁移至MultiviewX和RadarScenes。

Details Motivation: 现有端到端MV-MOT方法缺乏不确定性建模能力,且对传感器布局、模态和数据集耦合过强,难以泛化;需一种兼具高性能、可解释性与部署灵活性的新范式。 Method: ModTrack将学习局限于检测与特征提取模块,其余环节(跨视图聚类、精度加权融合、身份分配与时间跟踪)均采用闭式解析方法;输入为各传感器校准后的位置-协方差对(z,R),经融合得统一估计(ẑ, Ȓ),再由反馈耦合、身份感知的GM-PHD滤波器(含HMM运动模型)进行鲁棒身份维持。 Result: 在WildTrack数据集上取得95.5 IDF1和91.4 MOTA,超越所有先前模块化方法21+分,与SOTA端到端方法性能相当;同一追踪核心无需修改即可直接迁移到MultiviewX和RadarScenes,仅需更换感知模块适配新模态。 Conclusion: ModTrack验证了‘学习最小化、推理最大化’的设计范式在MV-MOT中的有效性,实现了高性能、强泛化、可解释与易部署的统一,为实际多传感器跟踪系统提供了更优工程路径。 Abstract: Multi-View Multi-Object Tracking (MV-MOT) aims to localize and maintain consistent identities of objects observed by multiple sensors. This task is challenging, as viewpoint changes and occlusion disrupt identity consistency across views and time. Recent end-to-end approaches address this by jointly learning 2D Bird's Eye View (BEV) representations and identity associations, achieving high tracking accuracy. However, these methods offer no principled uncertainty accounting and remain tightly coupled to their training configuration, limiting generalization across sensor layouts, modalities, or datasets without retraining. We propose ModTrack, a modular MV-MOT system that matches end-to-end performance while providing cross-modal, sensor-agnostic generalization and traceable uncertainty. ModTrack confines learning methods to just the \textit{Detection and Feature Extraction} stage of the MV-MOT pipeline, performing all fusion, association, and tracking with closed-form analytical methods. Our design reduces each sensor's output to calibrated position-covariance pairs $(\mathbf{z}, R)$; cross-view clustering and precision-weighted fusion then yield unified estimates $(\hat{\mathbf{z}}, \hat{R})$ for identity assignment and temporal tracking. A feedback-coupled, identity-informed Gaussian Mixture Probability Hypothesis Density (GM-PHD) filter with HMM motion modes uses these fused estimates to maintain identities under missed detections and heavy occlusion. ModTrack achieves 95.5 IDF1 and 91.4 MOTA on \textit{WildTrack}, surpassing all prior modular methods by over 21 points and rivaling the state-of-the-art end-to-end methods while providing deployment flexibility they cannot. Specifically, the same tracker core transfers unchanged to \textit{MultiviewX} and \textit{RadarScenes}, with only perception-module replacement required to extend to new domains and sensor modalities.

[76] Conflict-Aware Multimodal Fusion for Ambivalence and Hesitancy Recognition

Salah Eddine Bekhouche,Hichem Telli,Azeddine Benlamoudi,Salah Eddine Herrouz,Abdelmalik Taleb-Ahmed,Abdenour Hadid

Main category: cs.CV

TL;DR: 本文提出ConflictAwareAH,一种多模态框架,通过建模视频、音频和文本模态间的冲突特征(两两嵌入的绝对差)来识别犹豫与矛盾情绪(A/H),显著提升对非A/H样本的识别能力,并在BAH数据集上取得SOTA性能。

Details Motivation: 自动识别临床场景中的犹豫与矛盾情绪(A/H)具有重要价值,但其关键线索在于跨模态信号(言语、语音、面部)之间的不一致,传统以文本为主的方法易过检A/H而难以确认其缺失。 Method: 提出ConflictAwareAH框架:使用三个预训练编码器分别提取视频、音频、文本表征;构造两两模态嵌入的元素级绝对差作为双向冲突特征(大差异指示A/H,小差异锚定非A/H);辅以文本引导的晚期融合策略,在推理时融合文本辅助头与主模型。 Result: 在BAH数据集上,Macro F1达0.694(公开测试集)和0.715(私有排行榜),较现有多模态基线提升超10点;F1-NoAH提升+4.6,类别性能差距减半;+4.1 Macro F1来自文本引导融合;单卡25分钟内完成训练。 Conclusion: 冲突感知的多模态建模能更鲁棒地区分A/H与非A/H状态,有效缓解文本主导方法的过检问题,为临床情感分析提供了高效、可解释的新范式。 Abstract: Ambivalence and hesitancy (A/H) are subtle affective states where a person shows conflicting signals through different channels -- saying one thing while their face or voice tells another story. Recognising these states automatically is valuable in clinical settings, but it is hard for machines because the key evidence lives in the \emph{disagreements} between what is said, how it sounds, and what the face shows. We present \textbf{ConflictAwareAH}, a multimodal framework built for this problem. Three pre-trained encoders extract video, audio, and text representations. Pairwise conflict features -- element-wise absolute differences between modality embeddings -- serve as \emph{bidirectional} cues: large cross-modal differences flag A/H, while small differences confirm behavioural consistency and anchor the negative class. This conflict-aware design addresses a key limitation of text-dominant approaches, which tend to over-detect A/H (high F1-AH) while struggling to confirm its absence: our multimodal model improves F1-NoAH by +4.6 points over text alone and halves the class-performance gap. A complementary \emph{text-guided late fusion} strategy blends a text-only auxiliary head with the full model at inference, adding +4.1 Macro F1. On the BAH dataset from the ABAW10 Ambivalence/Hesitancy Challenge, our method reaches \textbf{0.694 Macro F1} on the labelled test split and \textbf{0.715} on the private leaderboard, outperforming published multimodal baselines by over 10 points -- all on a single GPU in under 25 minutes of training.

[77] Beyond the Embedding Bottleneck: Adaptive Retrieval-Augmented 3D CT Report Generation

Renjie Liang,Yiling Ma,Yang Xing,Zhengkang Fan,Jinqian Pan,Chengkun Sun,Li Li,Kuang Gong,Jie Xu

Main category: cs.CV

TL;DR: 本文提出AdaRAG-CT框架,通过自适应检索与文本增强缓解3D CT影像嵌入的表征瓶颈,显著提升放射科报告生成的临床有效性。

Details Motivation: 现有自动放射科报告生成方法在3D CT中病理覆盖不全,作者发现其根本原因在于视觉表征(对比学习得到的3D CT嵌入)存在严重维度集中现象,造成表征瓶颈。 Method: 提出AdaRAG-CT:一种自适应增强框架,结合受控文本检索与生成过程中的选择性文本融合,以补偿视觉表征不足。 Result: 在CT-RATE基准上,临床F1达0.480,较CT-Agent提升6个百分点;消融实验证明检索与生成模块均有效。 Conclusion: 视觉表征瓶颈是制约3D CT报告生成性能的关键,引入适配的外部文本知识可有效突破该瓶颈,AdaRAG-CT为此提供了可行路径。 Abstract: Automated radiology report generation from 3D CT volumes often suffers from incomplete pathology coverage. We provide empirical evidence that this limitation stems from a representational bottleneck: contrastive 3D CT embeddings encode discriminative pathology signals, yet exhibit severe dimensional concentration, with as few as 2 effective dimensions out of 512. Corroborating this, scaling the language model yields no measurable improvement, suggesting that the bottleneck lies in the visual representation rather than the generator. This bottleneck limits both generation and retrieval; naive static retrieval fails to improve clinical efficacy and can even degrade performance. We propose \textbf{AdaRAG-CT}, an adaptive augmentation framework that compensates for this visual bottleneck by introducing supplementary textual information through controlled retrieval and selectively integrating it during generation. On the CT-RATE benchmark, AdaRAG-CT achieves state-of-the-art clinical efficacy, improving Clinical F1 from 0.420 (CT-Agent) to 0.480 (+6 points); ablation studies confirm that both the retrieval and generation components contribute to the improvement. Code is available at https://github.com/renjie-liang/Adaptive-RAG-for-3DCT-Report-Generation.

[78] FEEL (Force-Enhanced Egocentric Learning): A Dataset for Physical Action Understanding

Eadom Dessalene,Botao He,Michael Maynord,Yonatan Tussa,Pavan Mantripragada,Yianni Karabati,Nirupam Roy,Yiannis Aloimonos

Main category: cs.CV

TL;DR: 本文提出了FEEL数据集,首次将力测量与自我中心视频配对,用于物理动作理解任务,包括接触理解和动作表征学习,并在多个基准上实现了先进性能。

Details Motivation: 力是驱动物理交互的根本原因,是物理动作理解的关键基础,但现有数据集缺乏同步的力测量与视频数据。 Method: 构建了包含约300万帧力同步视频的FEEL大规模数据集,使用定制压阻手套采集厨房环境中自然非脚本化操作的力信号;并将其应用于接触理解(时序接触分割与像素级接触物体分割)和动作表征学习(以力预测为自监督预训练目标)。 Result: 在时序接触分割任务中达到SOTA,在像素级接触物体分割中取得有竞争力的结果,且无需人工标注;在EPIC-Kitchens、SomethingSomething-V2、EgoExo4D和Meccano等数据集的动作理解迁移任务中显著提升无监督表征性能。 Conclusion: FEEL数据集及其应用验证了力信号对自我中心视频物理动作理解的重要价值,为无监督/弱监督动作理解提供了新范式。 Abstract: We introduce FEEL (Force-Enhanced Egocentric Learning), the first large-scale dataset pairing force measurements gathered from custom piezoresistive gloves with egocentric video. Our gloves enable scalable data collection, and FEEL contains approximately 3 million force-synchronized frames of natural unscripted manipulation in kitchen environments, with 45% of frames involving hand-object contact. Because force is the underlying cause that drives physical interaction, it is a critical primitive for physical action understanding. We demonstrate the utility of force for physical action understanding through application of FEEL to two families of tasks: (1) contact understanding, where we jointly perform temporal contact segmentation and pixel-level contacted object segmentation; and, (2) action representation learning, where force prediction serves as a self-supervised pretraining objective for video backbones. We achieve state-of-the-art temporal contact segmentation results and competitive pixel-level segmentation results without any need for manual contacted object segmentation annotations. Furthermore we demonstrate that action representation learning with FEEL improves transfer performance on action understanding tasks without any manual labels over EPIC-Kitchens, SomethingSomething-V2, EgoExo4D and Meccano.

[79] Self-supervised Disentanglement of Disease Effects from Aging in 3D Medical Shapes

Jakaria Rabbi,Nilanjan Ray,Dana Cobzas

Main category: cs.CV

TL;DR: 本文提出了一种两阶段框架,用于在无/少诊断标签情况下,从3D医学形状中解耦病理变化与生理性老化,通过无监督疾病发现与自监督隐式形状表征解耦,实现高保真重建、可控生成和因子级可解释性。

Details Motivation: 病理变化与生理性老化在3D医学形状上常产生重叠形变,导致难以分离;而临床诊断标签稀缺,阻碍了可解释生物标志物构建与患者分层。 Method: 第一阶段:用符号距离函数训练隐式神经模型获取稳定形状嵌入,并在潜在空间聚类生成伪疾病标签;第二阶段:利用伪疾病标签和真实年龄标签,在紧凑变分空间中结合协方差约束与有监督对比损失进行多目标解耦。 Result: 在ADNI海马体和OAI股骨远端数据上达到近监督性能,显著优于现有无监督方法,在解耦性、重建质量、可控合成及因子可解释性方面均有提升。 Conclusion: 该框架无需大量标注即可有效解耦疾病与老化因素,为少标签场景下的可解释医学形状分析提供了新范式。 Abstract: Disentangling pathological changes from physiological aging in 3D medical shapes is crucial for developing interpretable biomarkers and patient stratification. However, this separation is challenging when diagnosis labels are limited or unavailable, since disease and aging often produce overlapping effects on shape changes, obscuring clinically relevant shape patterns. To address this challenge, we propose a two-stage framework combining unsupervised disease discovery with self-supervised disentanglement of implicit shape representations. In the first stage, we train an implicit neural model with signed distance functions to learn stable shape embeddings. We then apply clustering on the shape latent space, which yields pseudo disease labels without using ground-truth diagnosis during discovery. In the second stage, we disentangle factors in a compact variational space using pseudo disease labels discovered in the first stage and the ground truth age labels available for all subjects. We enforce separation and controllability with a multi-objective disentanglement loss combining covariance and a supervised contrastive loss. On ADNI hippocampus and OAI distal femur shapes, we achieve near-supervised performance, improving disentanglement and reconstruction over state-of-the-art unsupervised baselines, while enabling high-fidelity reconstruction, controllable synthesis, and factor-based explainability. Code and checkpoints are available at https://github.com/anonymous-submission01/medical-shape-disentanglement

[80] EvoIQA - Explaining Image Distortions with Evolved White-Box Logic

Ruchika Gupta,Illya Bakurov,Nathan Haut,Wolfgang Banzhaf

Main category: cs.CV

TL;DR: 本文提出EvoIQA,一种基于遗传编程的可解释符号回归框架,用于图像质量评估(IQA),能演化出人类可读的数学公式,在保持高解释性的同时达到与先进深度学习模型相当的性能。

Details Motivation: 传统IQA指标要么是缺乏灵活性的手工设计模型,要么是不可解释的黑箱深度学习模型,亟需兼顾可解释性与高性能的新方法。 Method: 采用遗传编程(GP)进行符号回归,构建EvoIQA框架;以VSI、VIF、FSIM和HaarPSI等指标的特征作为终端集,演化出显式、可读的数学公式。 Result: 演化出的GP模型在预测结果与人类视觉偏好之间表现出强一致性,性能超越传统手工指标,并与DB-CNN等前沿深度学习模型持平。 Conclusion: 可解释的符号回归方法(如EvoIQA)无需以牺牲性能为代价即可实现高可解释性,为IQA提供了新范式。 Abstract: Traditional Image Quality Assessment (IQA) metrics typically fall into one of two extremes: rigid, hand-crafted mathematical models or "black-box" deep learning architectures that completely lack interpretability. To bridge this gap, we propose EvoIQA, a fully explainable symbolic regression framework based on Genetic Programming that Evolves explicit, human-readable mathematical formulas for image quality assessment (IQA). Utilizing a rich terminal set from the VSI, VIF, FSIM, and HaarPSI metrics, our framework inherently maps structural, chromatic, and information-theoretic degradations into observable mathematical equations. Our results demonstrate that the evolved GP models consistently achieve strong alignment between the predictions and human visual preferences. Furthermore, they not only outperform traditional hand-crafted metrics but also achieve performance parity with complex, state-of-the-art deep learning models like DB-CNN, proving that we no longer have to sacrifice interpretability for state-of-the-art performance.

[81] Sparse but not Simpler: A Multi-Level Interpretability Analysis of Vision Transformers

Siyu Zhang

Main category: cs.CV

TL;DR: 本文系统评估了视觉Transformer中权重稀疏性与可解释性之间的关系,发现尽管稀疏模型能产生更紧凑的电路结构,但在神经元选择性、稀疏自编码器特征可解释性及归因忠实性等多层面并未带来系统性提升,表明仅靠结构稀疏性不足以提高语义可解释性。

Details Motivation: 稀疏神经网络常被认为比稠密模型更具可解释性,但尚不清楚结构稀疏性本身是否真能提升语义可解释性。 Method: 使用Wanda方法剪枝DeiT-III B/16模型,构建多层级可解释性评估框架IMPACT,涵盖神经元、层表征(用BatchTopK稀疏自编码器分析)、任务电路(通过可学习节点掩码提取)和模型级归因(基于插入/删除指标的Transformer归因)。 Result: 稀疏模型电路边数减少约2.5倍,但活跃节点比例未下降甚至升高;在神经元选择性、SAE特征可解释性及归因忠实性上均无系统性改善。 Conclusion: 结构稀疏性本身并不能可靠地提升视觉模型的语义可解释性,需超越电路紧凑性、采用更全面的可解释性评估框架。 Abstract: Sparse neural networks are often hypothesized to be more interpretable than dense models, motivated by findings that weight sparsity can produce compact circuits in language models. However, it remains unclear whether structural sparsity itself leads to improved semantic interpretability. In this work, we systematically evaluate the relationship between weight sparsity and interpretability in Vision Transformers using DeiT-III B/16 models pruned with Wanda. To assess interpretability comprehensively, we introduce \textbf{IMPACT}, a multi-level framework that evaluates interpretability across four complementary levels: neurons, layer representations, task circuits, and model-level attribution. Layer representations are analyzed using BatchTopK sparse autoencoders, circuits are extracted via learnable node masking, and explanations are evaluated with transformer attribution using insertion and deletion metrics. Our results reveal a clear structural effect but limited interpretability gains. Sparse models produce circuits with approximately $2.5\times$ fewer edges than dense models, yet the fraction of active nodes remains similar or higher, indicating that pruning redistributes computation rather than isolating simpler functional modules. Consistent with this observation, sparse models show no systematic improvements in neuron-level selectivity, SAE feature interpretability, or attribution faithfulness. These findings suggest that structural sparsity alone does not reliably yield more interpretable vision models, highlighting the importance of evaluation frameworks that assess interpretability beyond circuit compactness.

[82] Nodule-Aligned Latent Space Learning with LLM-Driven Multimodal Diffusion for Lung Nodule Progression Prediction

James Song,Yifan Wang,Chuan Zhou,Liyue Shen

Main category: cs.CV

TL;DR: 本文提出NAMD框架,通过结合基线CT扫描和电子健康记录(EHR)生成1年随访CT图像,以预测肺结节进展;其在NLST数据集上实现了接近真实随访图像的恶性预测性能(AUROC 0.805 vs. 0.819),显著优于基线和其他合成方法。

Details Motivation: 早期肺癌诊断困难,主要源于生物学不确定性及对结节进展机制理解不足。 Method: 提出Nodule-Aligned Multimodal (Latent) Diffusion(NAMD)框架:构建结节对齐的潜在空间,使潜在距离直接对应结节属性变化,并采用大语言模型(LLM)驱动的控制机制,将患者EHR信息融入扩散模型主干。 Result: 在NLST数据集上,合成图像用于恶性预测达到AUROC=0.805、AUPRC=0.346,显著优于基线扫描和现有合成方法,且接近真实随访图像性能(AUROC=0.819,AUPRC=0.393)。 Conclusion: NAMD能有效捕捉肺结节进展的临床相关特征,有助于更早、更准确地诊断肺癌。 Abstract: Early diagnosis of lung cancer is challenging due to biological uncertainty and the limited understanding of the biological mechanisms driving nodule progression. To address this, we propose Nodule-Aligned Multimodal (Latent) Diffusion (NAMD), a novel framework that predicts lung nodule progression by generating 1-year follow-up nodule computed tomography images with baseline scans and the patient's and nodule's Electronic Health Record (EHR). NAMD introduces a nodule-aligned latent space, where distances between latents directly correspond to changes in nodule attributes, and utilizes an LLM-driven control mechanism to condition the diffusion backbone on patient data. On the National Lung Screening Trial (NLST) dataset, our method synthesizes follow-up nodule images that achieve an AUROC of 0.805 and an AUPRC of 0.346 for lung nodule malignancy prediction, significantly outperforming both baseline scans and state-of-the-art synthesis methods, while closely approaching the performance of real follow-up scans (AUROC: 0.819, AUPRC: 0.393). These results demonstrate that NAMD captures clinically relevant features of lung nodule progression, facilitating earlier and more accurate diagnosis.

[83] Towards Fair and Robust Volumetric CT Classification via KL-Regularised Group Distributionally Robust Optimisation

Samuel Johnny,Blessed Guda,Frank Ebeledike,Goodness Obasi,Moise Busogi

Main category: cs.CV

TL;DR: 本文提出一种结合MobileViT-XXS与SliceTransformer的轻量级框架,采用KL正则化Group DRO优化目标,同步缓解CT影像诊断中的跨中心分布偏移与性别亚组性能差异,在COVID-19检测和肺癌病理识别任务中均显著提升F1分数与公平性。

Details Motivation: 临床部署中自动化胸部CT诊断面临两大挑战:不同采集中心间的分布偏移,以及不同人口统计子群(如性别)间的性能差异。 Method: 采用轻量级MobileViT-XXS切片编码器与两层SliceTransformer进行体素级推理,并引入KL正则化的Group Distributionally Robust Optimisation(Group DRO)目标函数,动态加权表现较差的采集中心和人口子群;在Task 2中以性别类别为组定义粒度,特别关注女性鳞状细胞癌等严重欠代表组合。 Result: Task 1(COVID-19二分类)F1达0.835,超越最佳公开结果5.9点;Task 2(四类肺病理+性别公平性)α=0.5的Group DRO实现平均按性别宏F1为0.815,较最佳挑战结果提升11.1个百分点,且女性鳞癌F1较Focal Loss基线提升17.4。 Conclusion: 所提方法能同时提升模型泛化性与群体公平性,KL正则化有效防止Group DRO中组权重坍缩,为多中心医学影像AI提供兼顾鲁棒性与公平性的实用训练范式。 Abstract: Automated diagnosis from chest computed tomography (CT) scans faces two persistent challenges in clinical deployment: distribution shift across acquisition sites and performance disparity across demographic subgroups. We address both simultaneously across two complementary tasks: binary COVID-19 classification from multi-site CT volumes (Task 1) and four-class lung pathology recognition with gender-based fairness constraints (Task 2). Our framework combines a lightweight MobileViT-XXS slice encoder with a two-layer SliceTransformer aggregator for volumetric reasoning, and trains with a KL-regularised Group Distributionally Robust Optimisation (Group DRO) objective that adaptively upweights underperforming acquisition centres and demographic subgroups. Unlike standard Group DRO, the KL penalty prevents group weight collapse, providing a stable balance between worst-case protection and average performance. For Task 2, we define groups at the granularity of gender class, directly targeting severely underrepresented combinations such as female Squamous cell carcinoma. On Task 1, our best configuration achieves a challenge F1 of 0.835, surpassing the best published challenge entry by +5.9. On Task 2, Group DRO with α = 0.5 achieves a mean per-gender macro F1 of 0.815, outperforming the best challenge entry by +11.1 pp and improving Female Squamous F1 by +17.4 over the Fo- cal Loss baseline.

[84] A Comprehensive Benchmark of Histopathology Foundation Models for Kidney Histopathology

Harishwar Reddy Kasireddy,Patricio S. La Rosa,Akshita Gupta,Anindya S. Paul,Jamie L. Fermin,William L. Clapp,Meryl A. Waldman,Tarek M. El-Ashkar,Sanjay Jain,Luis Rodrigues,Kuang Yu Jen,Avi Z. Rosenberg,Michael T. Eadon,Jeffrey B. Hodgin,Pinaki Sarder

Main category: cs.CV

TL;DR: 本文系统评估了11个公开的组织病理学基础模型(HFMs)在11项肾脏特异性下游任务中的性能,涵盖多种染色、空间尺度、任务类型和临床目标;结果表明HFMs在粗粒度肾结构任务上表现较好,但在细粒度微结构判别或预后推断等任务上性能下降,提示需开发肾脏特异、多染色、多模态的基础模型。

Details Motivation: 现有组织病理学基础模型(HFMs)主要针对癌症数据预训练,其在非癌性慢性肾脏病(CKD)中的适用性尚未充分探索,而肾脏病理常与肾细胞癌、尿路上皮癌等共存,亟需评估其在肾脏疾病中的泛化能力。 Method: 对11个公开HFMs在11个肾脏特异性下游任务(含PAS、H&E、PASM、IHC等多种染色,tile/ slide级空间尺度,分类/回归/拷贝检测等任务类型)上进行系统评估;采用重复分层组交叉验证(tile级)和重复嵌套分层交叉验证(slide级),并用Friedman检验与Wilcoxon符号秩检验(Holm-Bonferroni校正)进行统计显著性分析;同步开源评估工具包kidney-hfm-eval。 Result: HFMs在依赖粗粒度中观尺度肾形态的任务(如诊断分类、显著结构改变检测)上表现中等到强;但在需细粒度微观结构判别、复杂生物学表型识别或slide级预后推断的任务上性能一致下降,且该现象基本不受染色类型影响。 Conclusion: 当前HFMs主要编码静态中观尺度表征,对细微肾脏病理或预后相关信号捕获能力有限;亟需构建肾脏特异、多染色、多模态的基础模型,以支撑肾病临床决策的可靠性。 Abstract: Histopathology foundation models (HFMs), pretrained on large-scale cancer datasets, have advanced computational pathology. However, their applicability to non-cancerous chronic kidney disease remains underexplored, despite coexistence of renal pathology with malignancies such as renal cell and urothelial carcinoma. We systematically evaluate 11 publicly available HFMs across 11 kidney-specific downstream tasks spanning multiple stains (PAS, H&E, PASM, and IHC), spatial scales (tile and slide-level), task types (classification, regression, and copy detection), and clinical objectives, including detection, diagnosis, and prognosis. Tile-level performance is assessed using repeated stratified group cross-validation, while slide-level tasks are evaluated using repeated nested stratified cross-validation. Statistical significance is examined using Friedman test followed by pairwise Wilcoxon signed-rank testing with Holm-Bonferroni correction and compact letter display visualization. To promote reproducibility, we release an open-source Python package, kidney-hfm-eval, available at https://pypi.org/project/kidney-hfm-eval/ , that reproduces the evaluation pipelines. Results show moderate to strong performance on tasks driven by coarse meso-scale renal morphology, including diagnostic classification and detection of prominent structural alterations. In contrast, performance consistently declines for tasks requiring fine-grained microstructural discrimination, complex biological phenotypes, or slide-level prognostic inference, largely independent of stain type. Overall, current HFMs appear to encode predominantly static meso-scale representations and may have limited capacity to capture subtle renal pathology or prognosis-related signals. Our results highlight the need for kidney-specific, multi-stain, and multimodal foundation models to support clinically reliable decision-making in nephrology.

[85] UMO: Unified In-Context Learning Unlocks Motion Foundation Model Priors

Xiaoyan Cong,Zekun Li,Zhiyang Dou,Hongyu Li,Omid Taheri,Chuan Guo,Abhay Mittal,Sizhe An,Taku Komura,Wojciech Matusik,Michael J. Black,Srinath Sridhar

Main category: cs.CV

TL;DR: 本文提出UMO框架,通过引入可学习的帧级元操作嵌入和轻量时序融合机制,将预训练的文本到动作大模型统一适配到多种下游动作生成任务(如时间修复、文本引导编辑、几何约束生成、多角色反应生成),实现单模型多任务支持且性能超越任务专用方法。

Details Motivation: 现有大模型仅支持单一文本到动作生成任务,难以高效扩展至多样化的跨模态与上下文感知动作生成下游任务;亟需一种统一框架来解锁预训练生成先验的通用性。 Method: 提出UMO统一框架:1)将各类下游任务建模为原子级帧操作组合;2)引入三个可学习的帧级元操作嵌入以指定每帧意图;3)采用轻量级时序融合机制将上下文线索注入预训练DiT骨干网络,几乎不增加推理开销。 Result: UMO在时间修复、文本引导动作编辑、文本序列化几何约束、多身份反应生成等新任务上均显著优于任务专用及无训练基线,在多个基准测试中表现一致优越。 Conclusion: UMO成功将单用途文本到动作大模型扩展为支持多样化下游任务的统一框架,验证了通过轻量级结构设计实现大模型泛化能力提升的有效性与实用性。 Abstract: Large-scale foundation models (LFMs) have recently made impressive progress in text-to-motion generation by learning strong generative priors from massive 3D human motion datasets and paired text descriptions. However, how to effectively and efficiently leverage such single-purpose motion LFMs, i.e., text-to-motion synthesis, in more diverse cross-modal and in-context motion generation downstream tasks remains largely unclear. Prior work typically adapts pretrained generative priors to individual downstream tasks in a task-specific manner. In contrast, our goal is to unlock such priors to support a broad spectrum of downstream motion generation tasks within a single unified framework. To bridge this gap, we present UMO, a simple yet general unified formulation that casts diverse downstream tasks into compositions of atomic per-frame operations, enabling in-context adaptation to unlock the generative priors of pretrained DiT-based motion LFMs. Specifically, UMO introduces three learnable frame-level meta-operation embeddings to specify per-frame intent and employs lightweight temporal fusion to inject in-context cues into the pretrained backbone, with negligible runtime overhead compared to the base model. With this design, UMO finetunes the pretrained model, originally limited to text-to-motion generation, to support diverse previously unsupported tasks, including temporal inpainting, text-guided motion editing, text-serialized geometric constraints, and multi-identity reaction generation. Experiments demonstrate that UMO consistently outperforms task-specific and training-free baselines across a wide range of benchmarks, despite using a single unified model. Code and model will be publicly available. Project Page: https://oliver-cong02.github.io/UMO.github.io/

[86] Mostly Text, Smart Visuals: Asymmetric Text-Visual Pruning for Large Vision-Language Models

Sijie Li,Biao Qian,Jungong Han

Main category: cs.CV

TL;DR: 本文提出了一种针对大型视觉-语言模型(LVLMs)的非对称文本-视觉权重剪枝方法(ATV-Pruning),通过解耦文本与视觉通路的敏感性差异,分别设计校准策略与重要token选择机制,实现更精准高效的剪枝。

Details Motivation: 现有LVLM剪枝方法将多模态校准数据统一处理,忽视了文本与视觉token在剪枝下的行为差异,导致剪枝不准确。 Method: 提出ATV-Pruning:1)解耦分析文本/视觉通路对剪枝的敏感性;2)构建自适应校准池(全量文本token + 子集视觉token);3)设计层自适应视觉token选择策略。 Result: 在多个标准多模态基准上,ATV-Pruning显著优于现有最先进剪枝方法,视觉通路支持高达50%稀疏度,文本通路需更精细校准。 Conclusion: 模态特异性是LVLM剪枝的关键,ATV-Pruning通过非对称建模文本与视觉通路的重要性,实现了更鲁棒、高效的轻量化。 Abstract: Network pruning is an effective technique for enabling lightweight Large Vision-Language Models (LVLMs), which primarily incorporates both weights and activations into the importance metric. However, existing efforts typically process calibration data from different modalities in a unified manner, overlooking modality-specific behaviors. This raises a critical challenge: how to address the divergent behaviors of textual and visual tokens for accurate pruning of LVLMs. To this end, we systematically investigate the sensitivity of visual and textual tokens to the pruning operation by decoupling their corresponding weights, revealing that: (i) the textual pathway should be calibrated via text tokens, since it exhibits higher sensitivity than the visual pathway; (ii) the visual pathway exhibits high redundancy, permitting even 50% sparsity. Motivated by these insights, we propose a simple yet effective Asymmetric Text-Visual Weight Pruning method for LVLMs, dubbed ATV-Pruning, which establishes the importance metric for accurate weight pruning by selecting the informative tokens from both textual and visual pathways. Specifically, ATV-Pruning integrates two primary innovations: first, a calibration pool is adaptively constructed by drawing on all textual tokens and a subset of visual tokens; second, we devise a layer-adaptive selection strategy to yield important visual tokens. Finally, extensive experiments across standard multimodal benchmarks verify the superiority of our ATV-Pruning over state-of-the-art methods.

[87] FlatLands: Generative Floormap Completion From a Single Egocentric View

Subhransu S. Bhattacharjee,Dylan Campbell,Rahul Shome

Main category: cs.CV

TL;DR: 本文提出了FlatLands数据集与基准,用于单视角鸟瞰图(BEV)地板补全任务,支持室内导航等应用。

Details Motivation: 单张自我中心图像仅覆盖地面一小部分,而完整的度量可通行地图对室内导航等应用更有价值。 Method: 构建了包含270,575个观测样本的FlatLands数据集,并设计了in-/out-of-distribution评估协议;对比了训练无关方法、确定性模型、集成模型和随机生成模型;并实现了端到端单目RGB到地板图的映射流程。 Result: 提供了首个面向不确定性感知室内建图与具身导航生成式补全的严格测试平台。 Conclusion: FlatLands为单视图BEV地板补全任务建立了标准化基准,推动了生成式建图与不确定性建模在具身智能中的发展。 Abstract: A single egocentric image typically captures only a small portion of the floor, yet a complete metric traversability map of the surroundings would better serve applications such as indoor navigation. We introduce FlatLands, a dataset and benchmark for single-view bird's-eye view (BEV) floor completion. The dataset contains 270,575 observations from 17,656 real metric indoor scenes drawn from six existing datasets, with aligned observation, visibility, validity, and ground-truth BEV maps, and the benchmark includes both in- and out-of-distribution evaluation protocols. We compare training-free approaches, deterministic models, ensembles, and stochastic generative models. Finally, we instantiate the task as an end-to-end monocular RGB-to-floormaps pipeline. FlatLands provides a rigorous testbed for uncertainty-aware indoor mapping and generative completion for embodied navigation.

[88] Speak, Segment, Track, Navigate: An Interactive System for Video-Guided Skull-Base Surgery

Jecia Z. Y. Mao,Francis X. Creighton,Russell H. Taylor,Manish Sahu

Main category: cs.CV

TL;DR: 本文提出了一种语音引导的具身智能体框架,用于视频引导的颅底手术,通过自然语言交互与实时视觉感知直接作用于术中视频流,无需额外硬件即可实现手术器械分割跟踪、解剖结构分割、术前3D模型配准及实时解剖叠加引导。

Details Motivation: 传统图像导航系统依赖外部光学追踪器和额外硬件,限制了工作流集成与快速部署;需一种更自然、无缝的术中计算辅助方式。 Method: 构建语音引导的具身智能体框架:首先对术中视频进行手术器械的交互式分割与标注,以此为锚点自主跟踪器械;进而支持解剖分割、预手术3D模型交互配准、单目视频工具位姿估计及实时解剖叠加引导。 Result: 在颅底手术场景中验证系统性能,其器械跟踪空间精度媲美商用光学追踪系统,同时显著提升工作流整合性与系统部署速度。 Conclusion: 纯视频驱动的语音引导具身智能体可替代传统光学导航,在保证精度的同时增强人机协作自然性与临床实用性。 Abstract: We introduce a speech-guided embodied agent framework for video-guided skull base surgery that dynamically executes perception and image-guidance tasks in response to surgeon queries. The proposed system integrates natural language interaction with real-time visual perception directly on live intraoperative video streams, thereby enabling surgeons to request computational assistance without disengaging from operative tasks. Unlike conventional image-guided navigation systems that rely on external optical trackers and additional hardware setup, the framework operates purely on intraoperative video. The system begins with interactive segmentation and labeling of the surgical instrument. The segmented instrument is then used as a spatial anchor that is autonomously tracked in the video stream to support downstream workflows, including anatomical segmentation, interactive registration of preoperative 3D models, monocular video-based estimation of the surgical tool pose, and support image guidance through real-time anatomical overlays.We evaluate the proposed system in video-guided skull base surgery scenarios and benchmark its tracking performance against a commercially available optical tracking system. Results demonstrate that speech-guided embodied agents can achieve competitive spatial accuracy while improving workflow integration and enabling rapid deployment of video-guided surgical systems.

[89] ViT-AdaLA: Adapting Vision Transformers with Linear Attention

Yifan Li,Seunghyun Yoon,Viet Dac Lai,Franck Dernoncourt,Jason Kuen,Yu Kong,Trung Bui

Main category: cs.CV

TL;DR: 本文提出ViT-AdaLA框架,通过注意力对齐、特征对齐和监督微调三阶段,将视觉基础模型(VFM)的知识高效迁移到线性注意力ViT中,在保持性能的同时显著降低计算复杂度。

Details Motivation: Vision Transformers(ViTs)存在二次方计算复杂度问题,限制其在长序列上的扩展;现有线性注意力方法需从头训练或难以迁移自大语言模型的线性化技术,缺乏对已有VFMs知识的有效复用。 Method: 提出ViT-AdaLA三阶段适配框架:1)注意力对齐——使线性注意力逼近原始softmax注意力行为;2)特征对齐——用冻结的softmax VFM作为教师模型,对齐线性ViT最后一层特征以缓解误差累积;3)监督微调——将适配后的模型迁移到下游任务。 Result: 在分类与分割任务上,ViT-AdaLA显著优于各类先进线性注意力ViT基线方法,验证了其有效性与泛化性。 Conclusion: ViT-AdaLA提供了一种高效、可迁移的线性注意力ViT适配范式,成功将预训练VFM的知识迁移至低复杂度架构,兼顾效率与性能。 Abstract: Vision Transformers (ViTs) based vision foundation models (VFMs) have achieved remarkable performance across diverse vision tasks, but suffer from quadratic complexity that limits scalability to long sequences. Existing linear attention approaches for ViTs are typically trained from scratch, requiring substantial computational resources, while linearization-based methods developed for large language model decoders do not transfer well to ViTs. To address these challenges, we propose ViT-AdaLA, a novel framework for effectively adapting and transferring prior knowledge from VFMs to linear attention ViTs. ViT-AdaLA consists of three stages: attention alignment, feature alignment, and supervised fine-tuning. In the attention alignment stage, we align vanilla linear attention with the original softmax-based attention in each block to approximate the behavior of softmax attention. However, residual approximation errors inevitably accumulate across layers. We mitigate this by fine-tuning the linearized ViT to align its final-layer features with a frozen softmax VFM teacher. Finally, the adapted prior knowledge is transferred to downstream tasks through supervised fine-tuning. Extensive experiments on classification and segmentation tasks demonstrate the effectiveness and generality of ViT-AdaLA over various state-of-the-art linear attention counterpart.

[90] Attribution Upsampling should Redistribute, Not Interpolate

Vincenzo Buono,Peyman Sheikholharam Mashhadi,Mahmoud Rahat,Prayag Tiwari,Stefan Byttner

Main category: cs.CV

TL;DR: 本文提出Universal Semantic-Aware Upsampling (USU)方法,将归因图上采样重新建模为基于语义边界的质量重分配问题,而非传统插值,从而避免信号失真并满足四项保真上采样公理。

Details Motivation: 现有归因方法依赖为自然图像设计的上采样技术(如双线性、双三次插值),但这些方法在处理显著性图时会因混叠、振铃和边界渗漏而破坏归因信号,导致错误高重要性区域;根本原因在于将上采样误视为孤立插值问题,而非需尊重模型语义边界的质量再分配问题。 Method: 提出USU方法,将上采样建模为比值形式的质量重分配算子,从公理化角度定义四项保真上采样理想性质,并证明插值方法结构性违反其中三项,而满足这三项必然导出比值形式,第四项进一步唯一确定算子形式。 Result: 在具备已知归因先验的模型上验证了USU的理论保证;在ImageNet、CIFAR-10和CUB-200上的实验表明USU持续提升归因保真度,生成更语义连贯、定性更优的解释图。 Conclusion: USU为归因上采样提供了首个公理化、语义感知且质量守恒的解决方案,突破了传统插值范式的局限,显著提升了XAI解释的可靠性与可解释性。 Abstract: Attribution methods in explainable AI rely on upsampling techniques that were designed for natural images, not saliency maps. Standard bilinear and bicubic interpolation systematically corrupts attribution signals through aliasing, ringing, and boundary bleeding, producing spurious high-importance regions that misrepresent model reasoning. We identify that the core issue is treating attribution upsampling as an interpolation problem that operates in isolation from the model's reasoning, rather than a mass redistribution problem where model-derived semantic boundaries must govern how importance flows. We present Universal Semantic-Aware Upsampling (USU), a principled method that reformulates upsampling through ratio-form mass redistribution operators, provably preserving attribution mass and relative importance ordering. Extending the axiomatic tradition of feature attribution to upsampling, we formalize four desiderata for faithful upsampling and prove that interpolation structurally violates three of them. These same three force any redistribution operator into a ratio form; the fourth selects the unique potential within this family, yielding USU. Controlled experiments on models with known attribution priors verify USU's formal guarantees; evaluation across ImageNet, CIFAR-10, and CUB-200 confirms consistent faithfulness improvements and qualitatively superior, semantically coherent explanations.

[91] Volumetrically Consistent Implicit Atlas Learning via Neural Diffeomorphic Flow for Placenta MRI

Athena Taymourtash,S. Mazdak Abulnaga,Esra Abaci Turk,P. Ellen Grant,Polina Golland

Main category: cs.CV

TL;DR: 本文提出了一种体素一致的隐式模型,通过耦合SDF重建与神经微分同胚流,实现胎盘MRI数据的体素级配准与群体模板构建,显著提升了几何保真度和体积对齐效果。

Details Motivation: 现有隐式配准方法主要依赖零水平集附近的监督,仅能捕获表面对应关系,导致内部形变约束不足,难以支持群体水平的体积分析。 Method: 提出一种体素一致的隐式模型,联合重建符号距离函数(SDF)与神经微分同胚流,学习共享的规范模板;引入雅可比行列式与双调和正则化以抑制局部折叠、促进全局一致形变。 Result: 在真实胎盘MRI数据上实验表明,该方法相比基于表面的隐式基线方法,在几何保真度和体积分对齐方面均有提升,并生成解剖可解释、拓扑一致的展平结果,适用于群体分析。 Conclusion: 所提方法成功实现了从表面到体素的隐式配准范式拓展,为基于隐式表示的器官级群体形态学分析提供了新途径。 Abstract: Establishing dense volumetric correspondences across anatomical shapes is essential for group-level analysis but remains challenging for implicit neural representations. Most existing implicit registration methods rely on supervision near the zero-level set and thus capture only surface correspondences, leaving interior deformations under-constrained. We introduce a volumetrically consistent implicit model that couples reconstruction of signed distance functions (SDFs) with neural diffeomorphic flow to learn a shared canonical template of the placenta. Volumetric regularization, including Jacobian-determinant and biharmonic penalties, suppresses local folding and promotes globally coherent deformations. In the motivating application to placenta MRI, our formulation jointly reconstructs individual placentas, aligns them to a population-derived implicit template, and enables voxel-wise intensity mapping in a unified canonical space. Experiments on in-vivo placenta MRI scans demonstrate improved geometric fidelity and volumetric alignment over surface-based implicit baseline methods, yielding anatomically interpretable and topologically consistent flattening suitable for group analysis.

[92] Structured prototype regularization for synthetic-to-real driving scene parsing

Jiahe Fan,Xiao Ma,Sergey Vityazev,George Giakos,Shaolong Shu,Rui Fan

Main category: cs.CV

TL;DR: 本文提出了一种新的无监督域自适应框架,通过显式正则化语义特征结构(如类间分离与类内紧凑),结合原型学习、熵滤波伪标签和像素级注意力机制,显著提升了合成数据到真实驾驶场景的语义分割性能。

Details Motivation: 合成数据虽可缓解像素级标注成本高问题,但存在合成到真实的域差距;现有无监督域自适应方法多关注全局特征对齐,忽视语义结构建模,导致类别间关系建模不足、泛化能力受限。 Method: 提出基于类特定原型的语义结构正则化方法,实现类间分离与类内紧凑;引入熵驱动的噪声过滤策略提升伪标签可靠性;设计像素级注意力机制优化特征对齐。 Result: 在多个主流驾驶场景解析基准上,该方法持续超越近期SOTA方法。 Conclusion: 保留并显式建模语义特征结构对合成到真实域的鲁棒自适应至关重要,所提框架有效提升了驾驶场景解析在真实世界中的性能。 Abstract: Driving scene parsing is critical for autonomous vehicles to operate reliably in complex real-world traffic environments. To reduce the reliance on costly pixel-level annotations, synthetic datasets with automatically generated labels have become a popular alternative. However, models trained on synthetic data often perform poorly when applied to real-world scenes due to the synthetic-to-real domain gap. Despite the success of unsupervised domain adaptation in narrowing this gap, most existing methods mainly focus on global feature alignment while overlooking the semantic structure of the feature space. As a result, semantic relations among classes are insufficiently modeled, limiting the model's ability to generalize. To address these challenges, this study introduces a novel unsupervised domain adaptation framework that explicitly regularizes semantic feature structures to significantly enhance driving scene parsing performance in real-world scenarios. Specifically, the proposed method enforces inter-class separation and intra-class compactness by leveraging class-specific prototypes, thereby enhancing the discriminability and structural coherence of feature clusters. An entropy-based noise filtering strategy improves the reliability of pseudo labels, while a pixel-level attention mechanism further refines feature alignment. Extensive experiments on representative benchmarks demonstrate that the proposed method consistently outperforms recent state-of-the-art methods. These results underscore the importance of preserving semantic structure for robust synthetic-to-real adaptation in driving scene parsing tasks.

[93] Interact3D: Compositional 3D Generation of Interactive Objects

Hui Shan,Keyang Luo,Ming Li,Sizhe Zheng,Yanwei Fu,Zhen Chen,Xiangru Huang

Main category: cs.CV

TL;DR: Interact3D是一种新框架,用于从单张图像生成物理上合理、相互作用的3D组合对象,通过统一3D引导场景、两阶段几何合成与基于VLM的闭环自修正机制,显著提升遮挡区域几何细节与物体间空间关系建模能力。

Details Motivation: 现有单图3D生成方法在遮挡下难以保持隐藏区域几何细节和物体间空间关系(OOR),无法生成物理合理的交互式3D组合对象。 Method: 提出Interact3D框架:1)利用生成先验构建统一3D引导场景并生成高质量单个资产;2)采用两阶段合成:全局-局部几何配准锚定主物体,SDF可微优化集成其余物体并惩罚几何交叠;3)引入VLM驱动的闭环智能体精调:分析多视角渲染、生成修正提示、引导图像编辑模块迭代优化。 Result: 实验表明Interact3D能生成碰撞感知、几何保真度更高、空间关系更一致的3D组合对象。 Conclusion: Interact3D有效解决了单图驱动下遮挡场景中3D组合对象的物理合理性与结构一致性生成难题,为交互式3D内容创作提供了新范式。 Abstract: Recent breakthroughs in 3D generation have enabled the synthesis of high-fidelity individual assets. However, generating 3D compositional objects from single images--particularly under occlusions--remains challenging. Existing methods often degrade geometric details in hidden regions and fail to preserve the underlying object-object spatial relationships (OOR). We present a novel framework Interact3D designed to generate physically plausible interacting 3D compositional objects. Our approach first leverages advanced generative priors to curate high-quality individual assets with a unified 3D guidance scene. To physically compose these assets, we then introduce a robust two-stage composition pipeline. Based on the 3D guidance scene, the primary object is anchored through precise global-to-local geometric alignment (registration), while subsequent geometries are integrated using a differentiable Signed Distance Field (SDF)-based optimization that explicitly penalizes geometry intersections. To reduce challenging collisions, we further deploy a closed-loop, agentic refinement strategy. A Vision-Language Model (VLM) autonomously analyzes multi-view renderings of the composed scene, formulates targeted corrective prompts, and guides an image editing module to iteratively self-correct the generation pipeline. Extensive experiments demonstrate that Interact3D successfully produces promising collsion-aware compositions with improved geometric fidelity and consistent spatial relationships.

[94] Parallel In-context Learning for Large Vision Language Models

Shin'ya Yamaguchi,Daiki Chijiwa,Tamao Sakao,Taku Hasegawa

Main category: cs.CV

TL;DR: 本文提出Parallel-ICL方法,通过并行处理分块的多模态演示样例并加权集成预测,在保持多模态上下文学习性能的同时显著降低推理延迟。

Details Motivation: 解决大型视觉语言模型中多模态上下文学习(MM-ICL)因长上下文导致的高推理延迟问题,平衡准确率与效率。 Method: 提出Parallel-ICL:将长演示上下文划分为多个短块,并行处理;采用基于聚类的分块策略提升块间多样性,基于相似性的编译策略对预测加权;最后用加权Product-of-Experts(PoE)在logit层融合结果。 Result: 在VQA、图像描述和分类等基准上,Parallel-ICL达到与全上下文MM-ICL相当的性能,同时显著提升推理速度。 Conclusion: Parallel-ICL是一种即插即用的高效推理算法,有效缓解MM-ICL中的精度-效率权衡问题,支持低开销动态任务适配。 Abstract: Large vision-language models (LVLMs) employ multi-modal in-context learning (MM-ICL) to adapt to new tasks by leveraging demonstration examples. While increasing the number of demonstrations boosts performance, they incur significant inference latency due to the quadratic computational cost of Transformer attention with respect to the context length. To address this trade-off, we propose Parallel In-Context Learning (Parallel-ICL), a plug-and-play inference algorithm. Parallel-ICL partitions the long demonstration context into multiple shorter, manageable chunks. It processes these chunks in parallel and integrates their predictions at the logit level, using a weighted Product-of-Experts (PoE) ensemble to approximate the full-context output. Guided by ensemble learning theory, we introduce principled strategies for Parallel-ICL: (i) clustering-based context chunking to maximize inter-chunk diversity and (ii) similarity-based context compilation to weight predictions by query relevance. Extensive experiments on VQA, image captioning, and classification benchmarks demonstrate that Parallel-ICL achieves performance comparable to full-context MM-ICL, while significantly improving inference speed. Our work offers an effective solution to the accuracy-efficiency trade-off in MM-ICL, enabling dynamic task adaptation with substantially reduced inference overhead.

[95] LICA: Layered Image Composition Annotations for Graphic Design Research

Elad Hirsch,Shubham Yadav,Mohit Garg,Purvanshi Mehta

Main category: cs.CV

TL;DR: LICA是一个大规模分层图像合成标注数据集,包含155万+多层图形设计组合,以结构化方式表示设计元素(文本、图像、矢量、组等)及其元数据,并引入图形设计视频新任务。

Details Motivation: 推动对图形布局的结构化理解与生成,弥补现有视觉-语言模型在图形设计领域结构化建模能力的不足,并拓展至动态设计建模。 Method: 构建LICA数据集,包含分层、类型化、带丰富元数据(空间几何、排版属性、透明度、可见性等)的设计组件;涵盖20类设计和97万+模板;新增图形设计视频子集(2.7万+动画布局,含逐组件关键帧与运动参数)。 Result: 提供了目前最大规模、最结构化的图形设计数据集,支持层感知修复、结构化布局生成、可控编辑、时序感知生成等新任务,并推动模型从像素级向设计结构级操作演进。 Conclusion: LICA不仅在规模上突破,更确立了以设计结构为中心的研究范式,为图形设计AI研究奠定了数据与任务基础。 Abstract: We introduce LICA (Layered Image Composition Annotations), a large-scale dataset of 1,550,244 multi-layer graphic design compositions designed to advance structured understanding and generation of graphic layouts1. In addition to ren- dered PNG images, LICA represents each design as a hierarchical composition of typed components including text, image, vector, and group elements, each paired with rich per-element metadata such as spatial geometry, typographic attributes, opacity, and visibility. The dataset spans 20 design categories and 971,850 unique templates, providing broad coverage of real-world design structures. We further introduce graphic design video as a new and largely unexplored challenge for current vision-language models through 27,261 animated layouts annotated with per-component keyframes and motion parameters. Beyond scale, LICA establishes a new paradigm of research tasks for graphic design, enabling structured investiga- tions into problems such as layer-aware inpainting, structured layout generation, controlled design editing, and temporally-aware generative modeling. By repre- senting design as a system of compositional layers and relationships, the dataset supports research on models that operate directly on design structure rather than pixels alone.

[96] OneWorld: Taming Scene Generation with 3D Unified Representation Autoencoder

Sensen Gao,Zhaoqing Wang,Qihang Cao,Dongdong Yu,Changhu Wang,Tongliang Liu,Mingming Gong,Jiawang Bian

Main category: cs.CV

TL;DR: OneWorld 是一个在统一3D潜在空间中直接进行扩散建模的3D场景生成框架,通过3D-URAE编码器、跨视角对应损失和流形漂移强制策略,显著提升生成场景的几何与外观一致性。

Details Motivation: 现有基于2D潜空间的扩散方法难以保证3D场景生成中的跨视角外观与几何一致性。 Method: 提出OneWorld框架:1)构建3D Unified Representation Autoencoder(3D-URAE),融合预训练3D基础模型的几何能力并注入外观与语义;2)设计token级Cross-View-Correspondence(CVC)一致性损失以强化多视角结构对齐;3)引入Manifold-Drift Forcing(MDF)缓解训练-推理偏差并优化3D流形。 Result: 在多项实验中,OneWorld生成的3D场景在质量与跨视角一致性上均优于当前主流2D潜空间方法。 Conclusion: 在3D统一潜空间中进行扩散建模是提升3D生成一致性的有效路径,3D-URAE、CVC损失与MDF共同构成了可扩展、鲁棒的3D生成新范式。 Abstract: Existing diffusion-based 3D scene generation methods primarily operate in 2D image/video latent spaces, which makes maintaining cross-view appearance and geometric consistency inherently challenging. To bridge this gap, we present OneWorld, a framework that performs diffusion directly within a coherent 3D representation space. Central to our approach is the 3D Unified Representation Autoencoder (3D-URAE); it leverages pretrained 3D foundation models and augments their geometry-centric nature by injecting appearance and distilling semantics into a unified 3D latent space. Furthermore, we introduce token-level Cross-View-Correspondence (CVC) consistency loss to explicitly enforce structural alignment across views, and propose Manifold-Drift Forcing (MDF) to mitigate train-inference exposure bias and shape a robust 3D manifold by mixing drifted and original representations. Comprehensive experiments demonstrate that OneWorld generates high-quality 3D scenes with superior cross-view consistency compared to state-of-the-art 2D-based methods. Our code will be available at https://github.com/SensenGao/OneWorld.

[97] Reevaluating the Intra-Modal Misalignment Hypothesis in CLIP

Jonas Herzog,Yue Wang

Main category: cs.CV

TL;DR: 本文质疑了CLIP等跨模态模型在图像任务中表现不佳源于‘模态内错位’的主流假设,通过理论分析和实验验证表明该假设缺乏依据,并指出任务歧义才是影响性能的关键因素。

Details Motivation: 近期研究认为CLIP类对比学习模型因只优化语言-图像对齐而忽略图像-图像对齐,导致图像嵌入距离校准不良;本文旨在检验这一‘模态内错位’假设是否成立。 Method: 从理论层面分析图像嵌入距离是否存在所谓自由度;实证层面比较CLIP、SigLIP(跨模态)与DINO、SigLIP2(单模态图像)在指标和下游任务(检索、小样本分类)上的表现。 Result: 理论证明图像嵌入距离并无额外自由度;多个指标在跨模态与单模态模型上结果一致;下游任务性能提升主要来自消解任务歧义,而非修正所谓错位。 Conclusion: ‘模态内错位’并非CLIP类模型图像任务性能受限的根本原因;应聚焦于任务定义与评估方式的清晰性,而非盲目修正嵌入空间结构。 Abstract: Recent research suggested that the embeddings produced by CLIP-like contrastive language-image training are suboptimal for image-only tasks. The main theory is that the inter-modal (language-image) alignment loss ignores intra-modal (image-image) alignment, leading to poorly calibrated distances between images. In this study, we question this intra-modal misalignment hypothesis. We reexamine its foundational theoretical argument, the indicators used to support it, and the performance metrics affected. For the theoretical argument, we demonstrate that there are no such supposed degrees of freedom for image embedding distances. For the empirical measures, our findings reveal they yield similar results for language-image trained models (CLIP, SigLIP) and image-image trained models (DINO, SigLIP2). This indicates the observed phenomena do not stem from a misalignment specific to the former. Experiments on the commonly studied intra-modal tasks retrieval and few-shot classification confirm that addressing task ambiguity, not supposed misalignment, is key for best results.

[98] NanoGS: Training-Free Gaussian Splat Simplification

Butian Xiong,Rong Liu,Tiantian Zhou,Meida Chen,Zhiwen Fan,Andrew Feng

Main category: cs.CV

TL;DR: NanoGS是一种无需训练、轻量级的3D高斯点阵简化框架,通过基于稀疏空间图的局部配对合并与质量评估,在保持高渲染保真度的同时显著减少原始高斯点数量。

Details Motivation: 现有3D高斯点阵压缩方法依赖GPU密集型后训练优化和标定图像,部署受限;需一种训练免费、高效、可直接部署的简化方案。 Method: 将简化建模为稀疏空间图上的局部高斯对合并;采用质量保持的矩匹配近似一对高斯为单个高斯,并定义基于混合分布与近似间差异的合理合并代价;仅在局部邻域内高效筛选兼容合并对。 Result: NanoGS可在CPU上高效运行,直接作用于已有3DGS模型,大幅降低高斯点数量(如数百万降至数万),同时保持结构与外观保真度,兼容标准渲染管线。 Conclusion: NanoGS提供了一种实用、高效、即插即用的3D高斯点阵简化方案,克服了训练依赖与硬件限制,推动3DGS在资源受限场景的实际应用。 Abstract: 3D Gaussian Splat (3DGS) enables high-fidelity, real-time novel view synthesis by representing scenes with large sets of anisotropic primitives, but often requires millions of Splats, incurring significant storage and transmission costs. Most existing compression methods rely on GPU-intensive post-training optimization with calibrated images, limiting practical deployment. We introduce NanoGS, a training-free and lightweight framework for Gaussian Splat simplification. Instead of relying on image-based rendering supervision, NanoGS formulates simplification as local pairwise merging over a sparse spatial graph. The method approximates a pair of Gaussians with a single primitive using mass preserved moment matching and evaluates merge quality through a principled merge cost between the original mixture and its approximation. By restricting merge candidates to local neighborhoods and selecting compatible pairs efficiently, NanoGS produces compact Gaussian representations while preserving scene structure and appearance. NanoGS operates directly on existing Gaussian Splat models, runs efficiently on CPU, and preserves the standard 3DGS parameterization, enabling seamless integration with existing rendering pipelines. Experiments demonstrate that NanoGS substantially reduces primitive count while maintaining high rendering fidelity, providing an efficient and practical solution for Gaussian Splat simplification. Our project website is available at https://saliteta.github.io/NanoGS/.

[99] PathGLS: Evaluating Pathology Vision-Language Models without Ground Truth through Multi-Dimensional Consistency

Minbing Chen,Zhu Meng,Fei Su

Main category: cs.CV

TL;DR: 本文提出PathGLS,一种无需参考文本的视觉-语言模型(VLM)评估框架,用于计算病理学领域,从定位(Grounding)、逻辑(Logic)和稳定性(Stability)三方面量化VLM性能,尤其擅长检测幻觉等细微错误,并在多个病理数据集上显著优于BERTScore和LLM基线方法。

Details Motivation: 现有VLM在病理学中临床落地受限于缺乏可靠、自动化的评估指标来识别如幻觉等细微失败模式。 Method: 提出PathGLS框架,从三个维度评估病理VLM:1)Grounding——细粒度图文对齐能力;2)Logic——基于自然语言推理(NLI)的蕴含图一致性;3)Stability——对抗性视觉语义扰动下的输出方差。支持补丁级与全切片图像(WSI)级分析,生成综合可信度得分。 Result: 在Quilt-1M上,PathGLS对幻觉报告的敏感度下降达40.2%,远高于BERTScore的2.1%;与专家定义的临床错误层级验证显示Spearman相关系数ρ=0.71(p<0.0001),显著优于Gemini 3.0 Pro(ρ=0.39)。 Conclusion: PathGLS是一种鲁棒、无参考的VLM评估方法,能直接量化幻觉率与域偏移鲁棒性,适用于私有临床数据集的基准测试与安全部署决策。 Abstract: Vision-Language Models (VLMs) offer significant potential in computational pathology by enabling interpretable image analysis, automated reporting, and scalable decision support. However, their widespread clinical adoption remains limited due to the absence of reliable, automated evaluation metrics capable of identifying subtle failures such as hallucinations. To address this gap, we propose PathGLS, a novel reference-free evaluation framework that assesses pathology VLMs across three dimensions: Grounding (fine-grained visual-text alignment), Logic (entailment graph consistency using Natural Language Inference), and Stability (output variance under adversarial visual-semantic perturbations). PathGLS supports both patch-level and whole-slide image (WSI)-level analysis, yielding a comprehensive trust score. Experiments on Quilt-1M, TCGA, REG2025, PathMMU and TCGA-Sarcoma datasets demonstrate the superiority of PathGLS. Specifically, on the Quilt-1M dataset, PathGLS reveals a steep sensitivity drop of 40.2% for hallucinated reports compared to only 2.1% for BERTScore. Moreover, validation against expert-defined clinical error hierarchies reveals that PathGLS achieves a strong Spearman's rank correlation of $ρ=0.71$ ($p < 0.0001$), significantly outperforming Large Language Model (LLM)-based approaches (Gemini 3.0 Pro: $ρ=0.39$, $p < 0.0001$). These results establish PathGLS as a robust reference-free metric. By directly quantifying hallucination rates and domain shift robustness, it serves as a reliable criterion for benchmarking VLMs on private clinical datasets and informing safe deployment. Code can be found at: https://github.com/My13ad/PathGLS

[100] Out-of-Distribution Object Detection in Street Scenes via Synthetic Outlier Exposure and Transfer Learning

Sadia Ilyas,Annika Mütze,Klaus Friedrichs,Thomas Kurbiel,Matthias Rottmann

Main category: cs.CV

TL;DR: 本文提出SynOE-OD框架,利用Stable Diffusion和开放词汇检测器生成语义合理的异常样本,提升目标检测器对分布外(OOD)物体的检测能力,实现ID与OOD物体的统一检测。

Details Motivation: 现有OOD目标检测方法依赖复杂结构或辅助分支,缺乏ID与OOD统一处理的框架,且易将OOD物体误判为背景。 Method: 提出SynOE-OD框架,结合Stable Diffusion生成对象级合成异常样本,并利用开放词汇检测器(如GroundingDINO)进行数据构建,通过迁移学习增强模型对ID任务性能及OOD检测鲁棒性。 Result: 在标准OOD目标检测基准上达到SOTA平均精度,显著优于GroundingDINO等OVOD模型在街景中零样本检测OOD物体的性能。 Conclusion: SynOE-OD实现了单模型统一检测ID与OOD物体,解决了OOD物体被完全漏检的问题,验证了合成异常暴露策略在OOD检测中的有效性。 Abstract: Out-of-distribution (OOD) object detection is an important yet underexplored task. A reliable object detector should be able to handle OOD objects by localizing and correctly classifying them as OOD. However, a critical issue arises when such atypical objects are completely missed by the object detector and incorrectly treated as background. Existing OOD detection approaches in object detection often rely on complex architectures or auxiliary branches and typically do not provide a framework that treats in-distribution (ID) and OOD in a unified way. In this work, we address these limitations by enabling a single detector to detect OOD objects, that are otherwise silently overlooked, alongside ID objects. We present \textbf{SynOE-OD}, a \textbf{Syn}thetic \textbf{O}utlier-\textbf{E}xposure-based \textbf{O}bject \textbf{D}etection framework, that leverages strong generative models, like Stable Diffusion, and Open-Vocabulary Object Detectors (OVODs) to generate semantically meaningful, object-level data that serve as outliers during training. The generated data is used for transfer-learning to establish strong ID task performance and supplement detection models with OOD object detection robustness. Our approach achieves state-of-the-art average precision on an established OOD object detection benchmark, where OVODs, such as GroundingDINO, show limited zero-shot performance in detecting OOD objects in street-scenes.

[101] Boosting Quantitive and Spatial Awareness for Zero-Shot Object Counting

Da Zhang,Bingyu Li,Feiyu Wang,Zhiyuan Zhao,Junyu Gao

Main category: cs.CV

TL;DR: 本文提出QICA框架,通过协同数量感知与鲁棒空间聚合解决零样本目标计数(ZSOC)中数量意识不足、空间不敏感及特征失真问题。

Details Motivation: 现有零样本目标计数方法将计数视为粗粒度检索任务,缺乏细粒度数量感知,且存在空间不敏感和因模型适配导致的特征空间失真问题。 Method: 提出QICA框架,包含:1)数值条件化提示的协同提示策略(SPS),桥接语义识别与定量推理;2)直接作用于图文相似度图的成本聚合解码器(CAD),缓解特征失真;3)多级数量对齐损失(L_MQA)保障数值一致性。 Result: 在FSC-147上取得竞争性性能,在CARPK和ShanghaiTech-A上的零样本评估验证了其对未见域的优越泛化能力。 Conclusion: QICA通过数量感知与空间聚合协同建模,有效提升了零样本目标计数的精度与跨域泛化能力。 Abstract: Zero-shot object counting (ZSOC) aims to enumerate objects of arbitrary categories specified by text descriptions without requiring visual exemplars. However, existing methods often treat counting as a coarse retrieval task, suffering from a lack of fine-grained quantity awareness. Furthermore, they frequently exhibit spatial insensitivity and degraded generalization due to feature space distortion during model adaptation.To address these challenges, we present \textbf{QICA}, a novel framework that synergizes \underline{q}uantity percept\underline{i}on with robust spatial \underline{c}ast \underline{a}ggregation. Specifically, we introduce a Synergistic Prompting Strategy (\textbf{SPS}) that adapts vision and language encoders through numerically conditioned prompts, bridging the gap between semantic recognition and quantitative reasoning. To mitigate feature distortion, we propose a Cost Aggregation Decoder (\textbf{CAD}) that operates directly on vision-text similarity maps. By refining these maps through spatial aggregation, CAD prevents overfitting while preserving zero-shot transferability. Additionally, a multi-level quantity alignment loss ($\mathcal{L}_{MQA}$) is employed to enforce numerical consistency across the entire pipeline. Extensive experiments on FSC-147 demonstrate competitive performance, while zero-shot evaluation on CARPK and ShanghaiTech-A validates superior generalization to unseen domains.

[102] EPOFusion: Exposure aware Progressive Optimization Method for Infrared and Visible Image Fusion

Zhiwei Wang,Yayu Zheng,Defeng He,Li Zhao,Xiaoqin Zhang,Yuxing Li,Edmund Y. Lam

Main category: cs.CV

TL;DR: 本文提出了一种面向过曝场景的红外与可见光图像融合方法EPOFusion,通过引入曝光感知机制、引导模块、迭代解码器和自适应损失函数,在过曝区域保留红外特征的同时提升整体融合质量,并构建了首个高质量过曝数据集IVOE。

Details Motivation: 现有红外与可见光融合方法在高亮(过曝)区域表现不佳,导致关键视觉信息丢失,亟需一种能有效处理过曝场景的融合模型。 Method: 提出EPOFusion模型:1)设计引导模块辅助编码器从过曝区域提取细粒度红外特征;2)构建含多尺度上下文融合模块的迭代解码器以渐进式增强融合图像;3)采用动态自适应损失函数平衡不同曝光条件下的模态贡献;4)构建首个红外-可见光过曝数据集IVOE并提供红外引导标注。 Result: EPOFusion在多个指标和视觉质量上均优于现有方法,能在过曝区域有效保留红外线索,在非过曝区域实现高保真融合,同时提升下游任务性能。 Conclusion: EPOFusion是一种有效的曝光感知融合框架,解决了过曝场景下信息丢失问题,所构建的IVOE数据集为该方向研究提供了重要支撑。 Abstract: Overexposure frequently occurs in practical scenarios, causing the loss of critical visual information. However, existing infrared and visible fusion methods still exhibit unsatisfactory performance in highly bright regions. To address this, we propose EPOFusion, an exposure-aware fusion model. Specifically, a guidance module is introduced to facilitate the encoder in extracting fine-grained infrared features from overexposed regions. Meanwhile, an iterative decoder incorporating a multiscale context fusion module is designed to progressively enhance the fused image, ensuring consistent details and superior visual quality. Finally, an adaptive loss function dynamically constrains the fusion process, enabling an effective balance between the modalities under varying exposure conditions. To achieve better exposure awareness, we construct the first infrared and visible overexposure dataset (IVOE) with high quality infrared guided annotations for overexposed regions. Extensive experiments show that EPOFusion outperforms existing methods. It maintains infrared cues in overexposed regions while achieving visually faithful fusion in non-overexposed areas, thereby enhancing both visual fidelity and downstream task performance. Code, fusion results and IVOE dataset will be made available at https://github.com/warren-wzw/EPOFusion.git.

[103] DualPrim: Compact 3D Reconstruction with Positive and Negative Primitives

Xiaoxu Meng,Zhongmin Chen,Bo Yang,Weikai Chen,Weixiao Liu,Lin Gao

Main category: cs.CV

TL;DR: DualPrim是一种新型3D重建框架,通过正负超二次曲面(superquadrics)的加减组合建模形状,在保持紧凑性与可微性的同时提升结构化表达能力,支持端到端多视角图像学习和高质量网格导出。

Details Motivation: 现有神经重建方法常牺牲结构(如拓扑规则性、部件边界)换取几何保真度,导致输出难以编辑、动画或复用。 Method: 提出DualPrim框架,使用正超二次曲面构建主体、负超二次曲面通过可微算子局部挖除体积,实现加减协同的拓扑感知建模;嵌入体素可微渲染器,支持端到端多视角图像训练,并通过解析布尔差分实现无缝网格导出。 Result: 在精度上达到SOTA,同时生成紧凑、结构清晰、语义可解释的3D模型,显著优于仅加法式隐式或基元方法。 Conclusion: 加减双路径的超二次曲面表示在不损失紧凑性和可微性的前提下,有效增强了神经重建的结构表达能力与下游实用性。 Abstract: Neural reconstructions often trade structure for fidelity, yielding dense and unstructured meshes with irregular topology and weak part boundaries that hinder editing, animation, and downstream asset reuse. We present DualPrim, a compact and structured 3D reconstruction framework. Unlike additive-only implicit or primitive methods, DualPrim represents shapes with positive and negative superquadrics: the former builds the bases while the latter carves local volumes through a differentiable operator, enabling topology-aware modeling of holes and concavities. This additive-subtractive design increases the representational power without sacrificing compactness or differentiability. We embed DualPrim in a volumetric differentiable renderer, enabling end-to-end learning from multi-view images and seamless mesh export via closed-form boolean difference. Empirically, DualPrim delivers state-of-the-art accuracy and produces compact, structured, and interpretable outputs that better satisfy downstream needs than additive-only alternatives.

[104] When Generative Augmentation Hurts: A Benchmark Study of GAN and Diffusion Models for Bias Correction in AI Classification Systems

Shesh Narayan Gupta,Nik Bear Brown

Main category: cs.CV

TL;DR: 本文在细粒度动物分类任务中对比了传统变换、FastGAN和微调LoRA的Stable Diffusion三种数据增强策略,发现FastGAN在极低样本量下不仅性能差,还会加剧分类器偏差,而Stable Diffusion+LoRA效果最佳;研究揭示了GAN在小样本下的失效机制(如模式坍塌),并初步界定其有害阈值在每类20–50张图像之间。

Details Motivation: 生成模型常用于缓解AI训练中的类别不平衡问题,但其在低数据条件下的失效模式尚不明确,亟需系统性基准评估。 Method: 在Oxford-IIIT Pet数据集上人工构建8个稀有品种子集,对比传统数据增强、FastGAN和LoRA微调的Stable Diffusion 1.5三种策略;采用宏观F1、偏差间隙(bias gap)及t-SNE特征可视化进行定量与定性分析。 Result: FastGAN在极低样本下显著增大偏差间隙(+20.7%,Cohen's d = +5.03, p = 0.013),t-SNE显示其生成图像形成远离真实分布的孤立簇;Stable Diffusion+LoRA取得最高宏观F1(0.9125±0.0047)并降低偏差间隙13.1%;有害样本阈值初步定位在每类20–50图之间。 Conclusion: GAN类生成器(如FastGAN)在极低数据场景下可能因模式坍塌而恶化模型公平性与泛化能力,而扩散模型经轻量微调(LoRA)更具鲁棒性;该现象提示需谨慎选择低资源下的生成增强方法,并推动跨领域验证临界样本量。 Abstract: Generative models are widely used to compensate for class imbalance in AI training pipelines, yet their failure modes under low-data conditions are poorly understood. This paper reports a controlled benchmark comparing three augmentation strategies applied to a fine-grained animal classification task: traditional transforms, FastGAN, and Stable Diffusion 1.5 fine-tuned with Low-Rank Adaptation (LoRA). Using the Oxford-IIIT Pet Dataset with eight artificially underrepresented breeds, we find that FastGAN augmentation does not merely underperform at very low training set sizes but actively increases classifier bias, with a statistically significant large effect across three random seeds (bias gap increase: +20.7%, Cohen's d = +5.03, p = 0.013). The effect size here is large enough to give confidence in the direction of the finding despite the small number of seeds. Feature embedding analysis using t-distributed Stochastic Neighbor Embedding reveals that FastGAN images for severe-minority breeds form tight isolated clusters outside the real image distribution, a pattern consistent with mode collapse. Stable Diffusion with Low-Rank Adaptation produced the best results overall, achieving the highest macro F1 (0.9125 plus or minus 0.0047) and a 13.1% reduction in the bias gap relative to the unaugmented baseline. The data suggest a sample-size boundary somewhere between 20 and 50 training images per class below which GAN augmentation becomes harmful in this setting, though further work across additional domains is needed to establish where that boundary sits more precisely. All experiments run on a consumer-grade GPU with 6 to 8 GB of memory, with no cloud compute required.

[105] Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training

Peng Sun,Jun Xie,Tao Lin

Main category: cs.CV

TL;DR: 本文提出了一种名为IOMM的两阶段训练框架,通过仅使用无标签图像数据进行视觉生成组件的预训练,显著提升统一多模态模型(UMMs)的训练效率与性能。

Details Motivation: 现有UMMs的视觉生成组件预训练依赖低效范式和稀缺高质量图文配对数据,构成主要瓶颈。 Method: 提出Image-Only Training for UMMs (IOMM):第一阶段仅用大量无标签图像预训练视觉生成模块;第二阶段混合少量图文对与无标签图像微调,提升指令对齐与生成质量。 Result: IOMM-B(3.6B)模型仅用约1050 H800 GPU小时从零训练,在GenEval达0.89、WISE达0.55,超越BAGEL-7B与BLIP3-o-4B等强基线。 Conclusion: IOMM有效缓解了数据与计算瓶颈,在保持高性能的同时大幅提升训练效率,为UMMs的可扩展预训练提供了新范式。 Abstract: Unified Multimodal Models (UMMs) are often constrained by the pre-training of their $\textbf{visual generation components}$, which typically relies on inefficient paradigms and scarce, high-quality text-image paired data. In this paper, we systematically analyze pre-training recipes for $\textbf{UMM visual generation}$ and identify these two issues as the major bottlenecks. To address them, we propose $\textbf{Image-Only Training for UMMs (IOMM)}$, a data-efficient two-stage training framework. The first stage pre-trains the visual generative component $\textbf{exclusively}$ using abundant unlabeled image-only data, thereby removing the dependency on paired data $\textbf{for this costly phase}$. The second stage fine-tunes the model using a mixture of unlabeled images and a small curated set of text-image pairs, leading to improved instruction alignment and generative quality. Extensive experiments show that IOMM not only improves training efficiency but also achieves state-of-the-art (SOTA) performance. For example, our IOMM-B (3.6B) model was trained from scratch using only $\sim \textbf{1050}$ H800 GPU hours (with the vast majority, $\textbf{1000}$ hours, dedicated to the efficient $\textbf{image-only pre-training stage}$). It achieves $\textbf{0.89}$ on GenEval and $\textbf{0.55}$ on WISE--surpassing strong baselines such as BAGEL-7B (0.82 & 0.55) and BLIP3-o-4B (0.84 & 0.50). Code is available $\href{https://github.com/LINs-lab/IOMM}{https://github.com/LINs-lab/IOMM}$.

[106] EFF-Grasp: Energy-Field Flow Matching for Physics-Aware Dexterous Grasp Generation

Yukun Zhao,Zichen Zhong,Yongshun Gong,Yilong Yin,Haoliang Sun

Main category: cs.CV

TL;DR: 本文提出EFF-Grasp,一种基于流匹配(Flow-Matching)的物理感知灵巧抓取生成框架,将抓取合成建模为确定性ODE过程,并引入无需训练的物理能量引导策略,显著提升抓取质量与物理可行性,同时大幅减少采样步数。

Details Motivation: 现有基于扩散模型的抓取生成方法依赖随机微分方程(SDE),采样步数多、轨迹不稳定,易产生物理不可行抓取。 Method: 提出基于流匹配的EFF-Grasp框架:1)将抓取生成建模为确定性ODE;2)设计训练无关的物理能量引导策略,利用显式物理能量函数定义目标分布,并通过局部蒙特卡洛近似估计引导项。 Result: 在五个基准数据集上,EFF-Grasp在抓取质量与物理可行性方面优于扩散基线,且采样步数显著减少。 Conclusion: 流匹配范式结合物理能量引导可实现高效、稳定、物理可行的灵巧抓取生成,为生成式抓取建模提供了新思路。 Abstract: Denoising generative models have recently become the dominant paradigm for dexterous grasp generation, owing to their ability to model complex grasp distributions from large-scale data. However, existing diffusion-based methods typically formulate generation as a stochastic differential equation (SDE), which often requires many sequential denoising steps and introduces trajectory instability that can lead to physically infeasible grasps. In this paper, we propose EFF-Grasp, a novel Flow-Matching-based framework for physics-aware dexterous grasp generation. Specifically, we reformulate grasp synthesis as a deterministic ordinary differential equation (ODE) process, which enables efficient and stable generation through smooth probability flows. To further enforce physical feasibility, we introduce a training-free physics-aware energy guidance strategy. Our method defines an energy-guided target distribution using adapted explicit physical energy functions that capture key grasp constraints, and estimates the corresponding guidance term via a local Monte Carlo approximation during inference. In this way, EFF-Grasp dynamically steers the generation trajectory toward physically feasible regions without requiring additional physics-based training or simulation feedback. Extensive experiments on five benchmark datasets show that EFF-Grasp achieves superior performance in grasp quality and physical feasibility, while requiring substantially fewer sampling steps than diffusion-based baselines.

[107] GATS: Gaussian Aware Temporal Scaling Transformer for Invariant 4D Spatio-Temporal Point Cloud Representation

Jiayi Tian,Jiaze Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为GATS的双不变性框架,通过UGGC模块解决点云分布不一致问题,通过TSA模块解决时间尺度偏差问题,显著提升了4D点云视频理解的准确性、鲁棒性和可扩展性。

Details Motivation: 4D点云视频理解面临时间尺度偏差(不同帧率)和不规则点云分布不确定性两大挑战,现有CNN或Transformer方法受限于感受野或计算复杂度,且忽视这些隐式失真。 Method: 提出Gaussian Aware Temporal Scaling (GATS)框架,包含Uncertainty Guided Gaussian Convolution (UGGC)和Temporal Scaling Attention (TSA)两个互补模块:UGGC融合局部高斯统计与不确定性感知门控以增强邻域聚合鲁棒性;TSA引入可学习缩放因子归一化时间距离,保障帧划分不变性与跨帧率速度估计一致性。 Result: 在MSR-Action3D(+6.62%准确率)、NTU RGBD(+1.4%准确率)和Synthia4D(+1.8% mIoU)上显著超越现有方法,相比Transformer方法更高效、准确、鲁棒且可扩展。 Conclusion: GATS为4D点云视频理解提供了一种更高效、原理清晰且具备双不变性的新范式,有效克服了时间尺度偏差与分布不确定性带来的挑战。 Abstract: Understanding 4D point cloud videos is essential for enabling intelligent agents to perceive dynamic environments. However, temporal scale bias across varying frame rates and distributional uncertainty in irregular point clouds make it highly challenging to design a unified and robust 4D backbone. Existing CNN or Transformer based methods are constrained either by limited receptive fields or by quadratic computational complexity, while neglecting these implicit distortions. To address this problem, we propose a novel dual invariant framework, termed \textbf{Gaussian Aware Temporal Scaling (GATS)}, which explicitly resolves both distributional inconsistencies and temporal. The proposed \emph{Uncertainty Guided Gaussian Convolution (UGGC)} incorporates local Gaussian statistics and uncertainty aware gating into point convolution, thereby achieving robust neighborhood aggregation under density variation, noise, and occlusion. In parallel, the \emph{Temporal Scaling Attention (TSA)} introduces a learnable scaling factor to normalize temporal distances, ensuring frame partition invariance and consistent velocity estimation across different frame rates. These two modules are complementary: temporal scaling normalizes time intervals prior to Gaussian estimation, while Gaussian modeling enhances robustness to irregular distributions. Our experiments on mainstream benchmarks MSR-Action3D (\textbf{+6.62\%} accuracy), NTU RGBD (\textbf{+1.4\%} accuracy), and Synthia4D (\textbf{+1.8\%} mIoU) demonstrate significant performance gains, offering a more efficient and principled paradigm for invariant 4D point cloud video understanding with superior accuracy, robustness, and scalability compared to Transformer based counterparts.

[108] AI-Generated Figures in Academic Publishing: Policies, Tools, and Practical Guidelines

Davie Chen

Main category: cs.CV

TL;DR: This paper surveys AI-generated figure policies across major publishers and proposes best practices for ethical, transparent use in scientific publishing.

Details Motivation: The rapid advancement of generative AI has enabled high-quality scientific figure generation, but inconsistent and ambiguous publisher policies create uncertainty for researchers. Method: A policy survey of major journals and publishers (e.g., Nature, Science, Cell Press, Elsevier, PLOS), complemented by analysis of practical examples from AI tools like SciDraw. Result: Identified key publisher concerns—reproducibility, authorship attribution, and visual misinformation—and derived actionable best-practice guidelines. Conclusion: AI-generated figures can accelerate scientific communication without compromising integrity, provided they are used with appropriate disclosure and quality control. Abstract: The rapid advancement of generative AI has introduced a new class of tools capable of producing publication-quality scientific figures, graphical abstracts, and data visualizations. However, academic publishers have responded with inconsistent and often ambiguous policies regarding AI-generated imagery. This paper surveys the current stance of major journals and publishers -- including Nature, Science, Cell Press, Elsevier, and PLOS -- on the use of AI-generated figures. We identify key concerns raised by publishers, including reproducibility, authorship attribution, and potential for visual misinformation. Drawing on practical examples from tools such as SciDraw, an AI-powered platform designed specifically for scientific illustration, we propose a set of best-practice guidelines for researchers seeking to use AI figure-generation tools in a compliant and transparent manner. Our findings suggest that, with appropriate disclosure and quality control, AI-generated figures can meaningfully accelerate scientific communication without compromising integrity.

[109] Segmentation-before-Staining Improves Structural Fidelity in Virtual IHC-to-Multiplex IF Translation

Junhyeok Lee,Han Jang,Heeseong Eum,Joon Jang,Kyu Sung Choi

Main category: cs.CV

TL;DR: 本文提出一种无需监督、与架构无关的虚拟染色方法,通过引入预训练核分割基础模型生成的连续细胞概率图作为输入先验,并结合方差保持正则化项,提升合成多通道免疫荧光图像中细胞核形态保真度和量化准确性。

Details Motivation: 现有虚拟染色方法仅优化像素级保真度,忽略核形态约束,导致临床关键指标(如Ki67增殖指数)出现有临床意义的误差。 Method: 引入预训练核分割模型输出的连续细胞概率图作为软先验输入,并设计方差保持正则化项以维持细胞级强度异质性;该策略无需监督且兼容多种生成架构(Pix2Pix、U-Net、ResNet、扩散模型)。 Result: 在两个独立数据集上,多种生成模型均一致提升核计数准确性和感知质量,且仅需添加该先验与正则化项即可实现改进。 Conclusion: 所提监督无关、架构无关的条件策略可显著提升虚拟染色在病理量化任务中的可靠性,推动其向临床常规应用迈进。 Abstract: Multiplex immunofluorescence (mIF) enables simultaneous single-cell quantification of multiple biomarkers within intact tissue architecture, yet its high reagent cost, multi-round staining protocols, and need for specialized imaging platforms limit routine clinical adoption. Virtual staining can synthesize mIF channels from widely available brightfield immunohistochemistry (IHC), but current translators optimize pixel-level fidelity without explicitly constraining nuclear morphology. In pathology, this gap is clinically consequential: subtle distortions in nuclei count, shape, or spatial arrangement propagate directly to quantification endpoints such as the Ki67 proliferation index, where errors of a few percent can shift treatment-relevant risk categories. This work introduces a supervision-free, architecture-agnostic conditioning strategy that injects a continuous cell probability map from a pretrained nuclei segmentation foundation model as an explicit input prior, together with a variance-preserving regularization term that matches local intensity statistics to maintain cell-level heterogeneity in synthesized fluorescence channels. The soft prior retains gradient-level boundary information lost by binary thresholding, providing a richer conditioning signal without task-specific tuning. Controlled experiments across Pix2Pix with U-Net and ResNet generators, deterministic regression U-Net, and conditional diffusion on two independent datasets demonstrate consistent improvements in nuclei count fidelity and perceptual quality, as the sole modifications. Code will be made publicly available upon acceptance.

[110] STARK: Spatio-Temporal Attention for Representation of Keypoints for Continuous Sign Language Recognition

Suvajit Patra,Soumitra Samanta

Main category: cs.CV

TL;DR: 本文提出了一种统一的时空注意力网络用于连续手语识别(CSLR),在显著减少参数量(减少70-80%)的同时,达到与现有关键点方法相当的性能。

Details Motivation: 现有基于关键点的CSLR方法依赖分离的时空编码结构(如GCN+1D-CNN),导致编码器和解码器参数量过大。 Method: 设计统一的时空注意力机制,在空间维度(关键点间)和时间维度(局部时间窗口内)联合计算注意力分数,并聚合生成局部上下文感知的时空表征。 Result: 所提编码器参数量比当前SOTA模型少70-80%,在Phoenix-14T数据集上性能与主流关键点方法相当。 Conclusion: 统一时空注意力机制可在大幅降低模型复杂度的同时保持识别性能,为轻量化CSLR提供了新思路。 Abstract: Continuous Sign Language Recognition (CSLR) is a crucial task for understanding the languages of deaf communities. Contemporary keypoint-based approaches typically rely on spatio-temporal encoding, where spatial interactions among keypoints are modeled using Graph Convolutional Networks or attention mechanisms, while temporal dynamics are captured using 1D convolutional networks. However, such designs often introduce a large number of parameters in both the encoder and the decoder. This paper introduces a unified spatio-temporal attention network that computes attention scores both spatially (across keypoints) and temporally (within local windows), and aggregates features to produce a local context-aware spatio-temporal representation. The proposed encoder contains approximately $70-80\%$ fewer parameters than existing state-of-the-art models while achieving comparable performance to keypoint-based methods on the Phoenix-14T dataset.

[111] Homogeneous and Heterogeneous Consistency progressive Re-ranking for Visible-Infrared Person Re-identification

Yiming Wang

Main category: cs.CV

TL;DR: 本文提出了一种渐进式模态关系重排序方法(HHCR),包含异质一致性和同质一致性重排序两个模块,以解决可见光-红外跨模态行人重识别中模态差异和模态内变化带来的匹配难题,并设计了基于一致性重排序的推理网络(CRI)基线模型,实验表明该方法具有泛化性并达到SOTA性能。

Details Motivation: 可见光-红外行人重识别面临显著模态差异挑战,现有重排序算法难以同时处理模态内变化和模态间差异。 Method: 提出渐进式模态关系重排序方法HHCR,包括异质一致性重排序(建模跨模态查询-图库关系)和同质一致性重排序(建模各模态内部查询-图库关系),并构建一致性重排序推理网络(CRI)作为基线。 Result: 所提重排序方法具备良好泛化性,重排序策略与CRI基线均在多个数据集上达到当前最优性能(SOTA)。 Conclusion: HHCR方法有效缓解了跨模态行人重识别中的模态差异与模态内变化问题,CRI基线与重排序联合提升了整体性能,验证了所提思路的有效性与通用性。 Abstract: Visible-infrared person re-identification faces greater challenges than traditional person re-identification due to the significant differences between modalities. In particular, the differences between these modalities make effective matching even more challenging, mainly because existing re-ranking algorithms cannot simultaneously address the intra-modal variations and inter-modal discrepancy in cross-modal person re-identification. To address this problem, we propose a novel Progressive Modal Relationship Re-ranking method consisting of two modules, called heterogeneous and homogeneous consistency re-ranking(HHCR). The first module, heterogeneous consistency re-ranking, explores the relationship between the query and the gallery modalities in the test set. The second module, homogeneous consistency reranking, investigates the intrinsic relationship within each modality between the query and the gallery in the test set. Based on this, we propose a baseline for cross-modal person re-identification, called a consistency re-ranking inference network (CRI). We conducted comprehensive experiments demonstrating that our proposed re-ranking method is generalized, and both the re-ranking and the baseline achieve state-of-the-art performance.

[112] 360° Image Perception with MLLMs: A Comprehensive Benchmark and a Training-Free Method

Huyen T. T. Tran,Van-Quang Nguyen,Farros Alferro,Kang-Jun Liu,Takayuki Okatani

Main category: cs.CV

TL;DR: 本文提出360Bench基准和Free360框架,系统评估并提升多模态大语言模型(MLLMs)对360°图像的理解与视觉问答(VQA)能力。

Details Motivation: 现有MLLMs在常规图像理解上表现优异,但对360°图像的感知能力尚未充分探索;360°图像具有全景性、几何畸变和复杂空间关系等挑战,亟需专用评估基准与高效推理方法。 Method: 构建首个面向高分辨率360°图像的VQA基准360Bench(含7K分辨率图像、7类任务、人工标注),系统评测7种MLLMs及6种增强方法;进而提出无需训练的Free360框架:基于场景图分解推理流程,结合自适应球面图像变换,并统一建模为图结构以生成答案。 Result: 实验表明,当前主流MLLMs在360°图像VQA任务中存在明显短板;Free360在不引入训练的前提下,显著且一致地提升了基座MLLM的性能,成为高效的训练-free解决方案。 Conclusion: 360Bench填补了360°图像多模态理解评估的空白,Free360为高分辨率360°VQA提供了可解释、模块化、免训练的新范式,推动MLLMs向真实三维环境感知迈进。 Abstract: Multimodal Large Language Models (MLLMs) have shown impressive abilities in understanding and reasoning over conventional images. However, their perception of 360° images remains largely underexplored. Unlike conventional images, 360° images capture the entire surrounding environment, enabling holistic spatial reasoning but introducing challenges such as geometric distortion and complex spatial relations. To comprehensively assess MLLMs' capabilities to perceive 360° images, we introduce 360Bench, a Visual Question Answering (VQA) benchmark featuring 7K-resolution 360° images, seven representative (sub)tasks with annotations carefully curated by human annotators. Using 360Bench, we systematically evaluate seven MLLMs and six enhancement methods, revealing their shortcomings in 360° image perception. To address these challenges, we propose Free360, a training-free scene-graph-based framework for high-resolution 360° VQA. Free360 decomposes the reasoning process into modular steps, applies adaptive spherical image transformations to 360° images tailored to each step, and seamlessly integrates the resulting information into a unified graph representation for answer generation. Experiments show that Free360 consistently improves its base MLLM and provides a strong training-free solution for 360° VQA tasks. The source code and dataset will be publicly released upon acceptance.

[113] KidsNanny: A Two-Stage Multimodal Content Moderation Pipeline Integrating Visual Classification, Object Detection, OCR, and Contextual Reasoning for Child Safety

Viraj Panchal,Tanmay Talsaniya,Parag Patel,Meet Patel

Main category: cs.CV

TL;DR: KidsNanny是一种面向儿童安全的两阶段多模态内容审核架构:第一阶段用ViT+目标检测器进行视觉筛选,第二阶段将视觉输出转为文本后结合OCR和7B语言模型做上下文推理;在UnsafeBench数据集上,其准确率与F1均优于ShieldGemma-2和LlavaGuard,且延迟显著更低,尤其在含文字威胁场景中展现出更高召回率。

Details Motivation: 现有内容审核模型在处理含嵌入文字的不安全图像时存在召回率低、延迟高问题,亟需一种兼顾效率与文本感知能力的轻量级多模态方案以保障儿童在线安全。 Method: 提出两阶段架构KidsNanny:Stage 1采用ViT与目标检测器联合进行快速视觉筛选(11.7ms),仅输出文本描述而非原始像素;Stage 2接收Stage 1输出文本,结合OCR与7B语言模型进行上下文推理(总延迟120ms);在UnsafeBench Sexual类别上对比视觉单模态与全管道性能,并专门构建text-only子集评估文本敏感性。 Result: Stage 1视觉筛选达80.27%准确率、85.39% F1;全管道达81.40%准确率、86.16% F1,显著优于ShieldGemma-2(64.80%,1136ms)和LlavaGuard(80.36%,4138ms);在text-only子集上KidsNanny实现100%召回(25/25)、75.76%精度,优于ShieldGemma-2(84%召回,60%精度)。 Conclusion: KidsNanny通过解耦视觉感知与文本推理、引入专用OCR路径,在保持低延迟的同时提升了对文字相关安全威胁的识别能力,为高效、可部署的儿童内容安全系统提供了新范式,但text-only子集样本量小限制了结论普适性。 Abstract: We present KidsNanny, a two-stage multimodal content moderation architecture for child safety. Stage 1 combines a vision transformer (ViT) with an object detector for visual screening (11.7 ms); outputs are routed as text not raw pixels to Stage 2, which applies OCR and a text based 7B language model for contextual reasoning (120 ms total pipeline). We evaluate on the UnsafeBench Sexual category (1,054 images) under two regimes: vision-only, isolating Stage 1, and multimodal, evaluating the full Stage 1+2 pipeline. Stage 1 achieves 80.27% accuracy and 85.39% F1 at 11.7 ms; vision-only baselines range from 59.01% to 77.04% accuracy. The full pipeline achieves 81.40% accuracy and 86.16% F1 at 120 ms, compared to ShieldGemma-2 (64.80% accuracy, 1,136 ms) and LlavaGuard (80.36% accuracy, 4,138 ms). To evaluate text-awareness, we filter two subsets: a text+visual subset (257 images) and a text-only subset (44 images where safety depends primarily on embedded text). On text-only images, KidsNanny achieves 100% recall (25/25 positives; small sample) and 75.76% precision; ShieldGemma-2 achieves 84% recall and 60% precision at 1,136 ms. Results suggest that dedicated OCR-based reasoning may offer recall-precision advantages on text-embedded threats at lower latency, though the small text-only subset limits generalizability. By documenting this architecture and evaluation methodology, we aim to contribute to the broader research effort on efficient multimodal content moderation for child safety.

[114] ECHO: Edge-Cloud Humanoid Orchestration for Language-to-Motion Control

Haozhe Jia,Jianfei Song,Yuan Zhang,Honglei Jin,Youcheng Fan,Wenshuo Chen,Wei Zhang,Yutao Yue

Main category: cs.CV

TL;DR: ECHO是一个边云协同框架,利用云端扩散模型生成自然语言驱动的机器人运动序列,并在边缘端通过强化学习跟踪器实时执行,采用紧凑的机器人原生运动表示,无需推理时的人体模型重定向,实现在Unitree G1人形机器人上的零微调稳定控制。

Details Motivation: 解决自然语言到人形机器人全身运动控制中的运动生成质量、实时性、跨域迁移及硬件适配等挑战,尤其避免依赖人体模型重定向和大量硬件微调。 Method: 提出边云协同架构:云端使用基于CLIP文本编码和1D Conv UNet的扩散模型(DDIM+CFG)生成运动参考;边缘端部署轻量级强化学习跟踪器(教师-学生蒸馏+证据自适应+形态对称约束+域随机化),并集成IMU驱动的自主跌倒恢复模块;运动表示采用38维机器人原生空间(含关节角、根部平面速度/高度/6D朝向)。 Result: 在HumanML3D基准上取得FID 0.029、R-Precision Top-1 0.686;真实Unitree G1机器人上实现多样化文本指令的零微调稳定执行,具备高安全性与轨迹一致性。 Conclusion: ECHO验证了语言驱动人形机器人控制中边云分工与机器人原生表征的有效性,为低延迟、高保真、强鲁棒的具身智能控制提供了可行路径。 Abstract: We present ECHO, an edge--cloud framework for language-driven whole-body control of humanoid robots. A cloud-hosted diffusion-based text-to-motion generator synthesizes motion references from natural language instructions, while an edge-deployed reinforcement-learning tracker executes them in closed loop on the robot. The two modules are bridged by a compact, robot-native 38-dimensional motion representation that encodes joint angles, root planar velocity, root height, and a continuous 6D root orientation per frame, eliminating inference-time retargeting from human body models and remaining directly compatible with low-level PD control. The generator adopts a 1D convolutional UNet with cross-attention conditioned on CLIP-encoded text features; at inference, DDIM sampling with 10 denoising steps and classifier-free guidance produces motion sequences in approximately one second on a cloud GPU. The tracker follows a Teacher--Student paradigm: a privileged teacher policy is distilled into a lightweight student equipped with an evidential adaptation module for sim-to-real transfer, further strengthened by morphological symmetry constraints and domain randomization. An autonomous fall recovery mechanism detects falls via onboard IMU readings and retrieves recovery trajectories from a pre-built motion library. We evaluate ECHO on a retargeted HumanML3D benchmark, where it achieves strong generation quality (FID 0.029, R-Precision Top-1 0.686) under a unified robot-domain evaluator, while maintaining high motion safety and trajectory consistency. Real-world experiments on a Unitree G1 humanoid demonstrate stable execution of diverse text commands with zero hardware fine-tuning.

[115] Reliable Reasoning in SVG-LLMs via Multi-Task Multi-Reward Reinforcement Learning

Haomin Wang,Qi Wei,Qianli Ma,Shengyuan Ding,Jinhui Yin,Kai Chen,Hongjie Zhang

Main category: cs.CV

TL;DR: 本文提出CTRL-S框架,通过引入思维链机制和多奖励优化,在SVG生成任务中显著提升结构一致性、代码质量和视觉保真度。

Details Motivation: 现有SVG生成方法存在泛化能力有限、生成代码冗余、缺乏显式推理等问题。 Method: 提出CTRL-S框架,引入链式思维机制,并构建高质量数据集SVG-Sophia;采用GRPO算法与多奖励优化框架(DINO、图文相似性、格式、代码效率)。 Result: 在多项指标上超越现有方法,包括任务成功率、SVG代码质量与视觉保真度。 Conclusion: CTRL-S通过结构化推理与多目标优化,有效提升了SVG生成的整体性能与实用性。 Abstract: With the rapid advancement of vision-language models, an increasing number of studies have explored their potential for SVG generation tasks. Although existing approaches improve performance by constructing large-scale SVG datasets and introducing SVG-specific tokens, they still suffer from limited generalization, redundant paths in code outputs, and a lack of explicit reasoning. In this work, we present CTRL-S (Chain-of-Thought Reinforcement Learning for SVG), a unified framework that introduces a chain-of-thought mechanism to explicitly expose the model's reasoning process during SVG generation. To support this structured reasoning, we construct SVG-Sophia, a high-quality dataset containing 145K samples across SVG code refinement, Text-to-SVG, and Image-to-SVG tasks. By training the model to generate group-level structured SVG code, CTRL-S significantly improves structural coherence and visual fidelity. Furthermore, we adopt the GRPO algorithm and design a multi-reward optimization framework, incorporating DINO, image-text similarity, format, and code efficiency rewards. Through joint multi-reward optimization and multi-task training, our approach systematically enhances overall generation capabilities. Extensive experiments show that CTRL-S outperforms existing methods, achieving higher task success rates, superior SVG code quality, and exceptional visual fidelity.

[116] S-VAM: Shortcut Video-Action Model by Self-Distilling Geometric and Semantic Foresight

Haodong Yan,Zhide Zhong,Jiaguan Zhu,Junjie He,Weilin Yuan,Wenxuan Song,Xin Gong,Yingjie Cai,Guanyi Zhao,Xu Yan,Bingbing Liu,Ying-Cong Chen,Haoang Li

Main category: cs.CV

TL;DR: 本文提出S-VAM,一种单步前向推理的视频动作模型,通过自蒸馏将多步去噪先验压缩为一步推理,实现高效、高保真动作预测。

Details Motivation: 现有视频动作模型在实时推理与高保真视觉预判之间难以兼顾:多步视频生成速度慢,单步特征提取噪声大。 Method: 提出S-VAM模型,采用自蒸馏策略——以扩散模型多步生成视频经视觉基础模型(VFM)提取的表征为教师目标,轻量解耦器作为学生,直接从单步噪声特征映射到这些目标;从而在单次前向中预见几何与语义一致的表征。 Result: 在仿真与真实世界实验中,S-VAM显著优于现有方法,实现了复杂环境下的高效精准操作。 Conclusion: S-VAM通过结构化自蒸馏实现了实时性与高保真性的统一,为机器人学习提供了更实用的视频动作建模范式。 Abstract: Video action models (VAMs) have emerged as a promising paradigm for robot learning, owing to their powerful visual foresight for complex manipulation tasks. However, current VAMs, typically relying on either slow multi-step video generation or noisy one-step feature extraction, cannot simultaneously guarantee real-time inference and high-fidelity foresight. To address this limitation, we propose S-VAM, a shortcut video-action model that foresees coherent geometric and semantic representations via a single forward pass. Serving as a stable blueprint, these foreseen representations significantly simplify the action prediction. To enable this efficient shortcut, we introduce a novel self-distillation strategy that condenses structured generative priors of multi-step denoising into one-step inference. Specifically, vision foundation model (VFM) representations extracted from the diffusion model's own multi-step generated videos provide teacher targets. Lightweight decouplers, as students, learn to directly map noisy one-step features to these targets. Extensive experiments in simulation and the real world demonstrate that our S-VAM outperforms state-of-the-art methods, enabling efficient and precise manipulation in complex environments. Our project page is https://haodong-yan.github.io/S-VAM/

[117] Leveling3D: Leveling Up 3D Reconstruction with Feed-Forward 3D Gaussian Splatting and Geometry-Aware Generation

Yiming Huang,Baixiang Huang,Beilei Cui,Chi Kit Ng,Long Bai,Hongliang Ren

Main category: cs.CV

TL;DR: Leveling3D 提出一种融合前馈式3D重建与几何一致性生成的新框架,通过几何感知的leveling适配器对扩散模型进行几何对齐,有效修复外推视角中的缺失与伪影区域,并反哺提升3D高斯泼溅重建质量,在新视角合成与深度估计任务上达到SOTA。

Details Motivation: 现有基于扩散模型修复渲染结果的方法缺乏几何约束,难以填补外推视角中因3D表征欠约束导致的缺失区域。 Method: 提出Leveling3D框架,包含几何感知的leveling适配器(对齐扩散模型内部知识与前馈模型几何先验)、调色板过滤训练策略(提升生成多样性)和测试时掩码精细化(抑制修复边界混乱);并利用修复后视图反馈优化3DGS重建。 Result: 在公开数据集上的新视角合成与深度估计任务中达到SOTA性能。 Conclusion: 几何引导的生成与前馈重建协同可实现端到端、闭环增强的3D视觉重建,显著提升外推视角质量与下游3D表征精度。 Abstract: Feed-forward 3D reconstruction has revolutionized 3D vision, providing a powerful baseline for downstream tasks such as novel-view synthesis with 3D Gaussian Splatting. Previous works explore fixing the corrupted rendering results with a diffusion model. However, they lack geometric concern and fail at filling the missing area on the extrapolated view. In this work, we introduce Leveling3D, a novel pipeline that integrates feed-forward 3D reconstruction with geometrical-consistent generation to enable holistic simultaneous reconstruction and generation. We propose a geometry-aware leveling adapter, a lightweight technique that aligns internal knowledge in the diffusion model with the geometry prior from the feed-forward model. The leveling adapter enables generation on the artifact area of the extrapolated novel views caused by underconstrained regions of the 3D representation. Specifically, to learn a more diverse distributed generation, we introduce the palette filtering strategy for training, and a test-time masking refinement to prevent messy boundaries along the fixing regions. More importantly, the enhanced extrapolated novel views from Leveling3D could be used as the inputs for feed-forward 3DGS, leveling up the 3D reconstruction. We achieve SOTA performance on public datasets, including tasks such as novel-view synthesis and depth estimation.

[118] Ground Reaction Inertial Poser: Physics-based Human Motion Capture from Sparse IMUs and Insole Pressure Sensors

Ryosuke Hori,Jyun-Ting Song,Zhengyi Luo,Jinkun Cao,Soyong Shin,Hideo Saito,Kris Kitani

Main category: cs.CV

TL;DR: GRIP是一种结合IMU与足底压力数据、利用数字孪生人体模型进行物理合理运动重建的新方法,显著提升了姿态精度和物理一致性。

Details Motivation: 传统仅使用IMU的方法难以准确建模人体与地面的相互作用,导致运动重建缺乏物理合理性;需融合足底压力等接触信息并引入物理仿真以提升真实性。 Method: 提出两模块架构:KinematicsNet从传感器数据估计姿态与速度;DynamicsNet在物理仿真器中驱动数字孪生人形模型,通过预测状态与仿真状态的残差进行闭环控制;融合IMU与足底压力数据,并构建PRISM大规模多模态数据集支持训练与评估。 Result: 在多个数据集上全面超越现有IMU-only及IMU-压力融合方法,全局姿态精度更高,物理一致性更强。 Conclusion: 融合触觉(压力)与惯性传感,并借助物理仿真数字孪生,是实现高精度、高物理合理性的无标记运动捕捉的有效范式。 Abstract: We propose Ground Reaction Inertial Poser (GRIP), a method that reconstructs physically plausible human motion using four wearable devices. Unlike conventional IMU-only approaches, GRIP combines IMU signals with foot pressure data to capture both body dynamics and ground interactions. Furthermore, rather than relying solely on kinematic estimation, GRIP uses a digital twin of a person, in the form of a synthetic humanoid in a physics simulator, to reconstruct realistic and physically plausible motion. At its core, GRIP consists of two modules: KinematicsNet, which estimates body poses and velocities from sensor data, and DynamicsNet, which controls the humanoid in the simulator using the residual between the KinematicsNet prediction and the simulated humanoid state. To enable robust training and fair evaluation, we introduce a large-scale dataset, Pressure and Inertial Sensing for Human Motion and Interaction (PRISM), that captures diverse human motions with synchronized IMUs and insole pressure sensors. Experimental results show that GRIP outperforms existing IMU-only and IMU-pressure fusion methods across all evaluated datasets, achieving higher global pose accuracy and improved physical consistency.

[119] PureCLIP-Depth: Prompt-Free and Decoder-Free Monocular Depth Estimation within CLIP Embedding Space

Ryutaro Miya,Kazuyoshi Fushinobu,Tatsuya Kawaguchi

Main category: cs.CV

TL;DR: PureCLIP-Depth 是一种完全无需提示(prompt-free)且无需解码器(decoder-free)的单目深度估计模型,直接在 CLIP 嵌入空间中进行概念驱动的深度预测,不依赖几何特征,在室内外数据集上达到 CLIP 基模型的 SOTA 性能。

Details Motivation: 现有 MDE 方法多依赖几何特征,本文旨在探索基于概念信息(而非几何)的新范式,利用 CLIP 的语义嵌入空间实现更鲁棒、泛化性更强的深度估计。 Method: 构建一个端到端映射,将 RGB 图像编码至 CLIP 视觉嵌入空间,并在该空间内直接回归深度嵌入(而非像素级深度图),全程不引入 prompt、文本分支或显式解码器。 Result: 在多个标准室内(NYUv2)和室外(KITTI)深度数据集上,PureCLIP-Depth 在所有基于 CLIP 嵌入的方法中取得最优性能(SOTA)。 Conclusion: 证明了纯概念驱动、无几何先验、无 prompt/decoder 的 CLIP 空间内深度估计是可行且有效的,为 MDE 提供了新思路。 Abstract: We propose PureCLIP-Depth, a completely prompt-free, decoder-free Monocular Depth Estimation (MDE) model that operates entirely within the Contrastive Language-Image Pre-training (CLIP) embedding space. Unlike recent models that rely heavily on geometric features, we explore a novel approach to MDE driven by conceptual information, performing computations directly within the conceptual CLIP space. The core of our method lies in learning a direct mapping from the RGB domain to the depth domain strictly inside this embedding space. Our approach achieves state-of-the-art performance among CLIP embedding-based models on both indoor and outdoor datasets. The code used in this research is available at: https://github.com/ryutaroLF/PureCLIP-Depth

[120] Exclusivity-Guided Mask Learning for Semi-Supervised Crowd Instance Segmentation and Counting

Jiyang Huang,Hongru Cheng,Wei Lin,Jia Wan,Antoni B. Chan

Main category: cs.CV

TL;DR: 本文提出了一种半监督人群分析框架,通过EDP-SAM生成高质量掩码监督,并利用XMask实现个体空间分离的掩码学习,最终以实例掩码为伪标签提升计数与分割性能。

Details Motivation: 传统基于点的标注在密集人群场景中存在区域模糊性,难以学习细粒度结构语义,而无标签数据丰富,亟需更有效的半监督方法。 Method: 提出Exclusion-Constrained Dual-Prompt SAM(EDP-SAM)结合Nearest Neighbor Exclusion Circle(NNEC)约束生成掩码监督;设计Exclusivity-Guided Mask Learning(XMask),引入判别性掩码目标、高斯平滑和可微中心采样策略;构建以实例掩码为伪标签的半监督计数框架。 Result: 在ShanghaiTech A、UCF-QNRF和JHU++数据集上(使用5%、10%、40%标注数据)达到当前最优的半监督实例分割与计数性能。 Conclusion: 所提方法成功统一了人群计数与实例分割任务,在有限标注下显著提升性能,验证了掩码级伪标签相较于点标注的优越性。 Abstract: Semi-supervised crowd analysis is a prominent area of research, as unlabeled data are typically abundant and inexpensive to obtain. However, traditional point-based annotations constrain performance because individual regions are inherently ambiguous, and consequently, learning fine-grained structural semantics from sparse anno tations remains an unresolved challenge. In this paper, we first propose an Exclusion-Constrained Dual-Prompt SAM (EDP-SAM), based on our Nearest Neighbor Exclusion Circle (NNEC) constraint, to generate mask supervision for current datasets. With the aim of segmenting individuals in dense scenes, we then propose Exclusivity-Guided Mask Learning (XMask), which enforces spatial separation through a discriminative mask objective. Gaussian smoothing and a differentiable center sampling strategy are utilized to improve feature continuity and training stability. Building on XMask, we present a semi-supervised crowd counting framework that uses instance mask priors as pseudo-labels, which contain richer shape information than traditional point cues. Extensive experiments on the ShanghaiTech A, UCF-QNRF, and JHU++ datasets (using 5%, 10%, and 40% labeled data) verify that our end-to-end model achieves state-of-the-art semi-supervised segmentation and counting performance, effectively bridging the gap between counting and instance segmentation within a unified framework.

[121] RASLF: Representation-Aware State Space Model for Light Field Super-Resolution

Zeqiang Wei,Kai Jin,Kuan Song,Xiuzhuang Zhou,Wenlong Chen,Min Xu

Main category: cs.CV

TL;DR: 本文提出RASLF框架,通过多表示联合建模、渐进几何优化与动态扫描机制,显著提升光场超分性能与效率。

Details Motivation: 现有基于SSM的光场超分方法未能充分利用不同光场表示间的互补性,导致纹理丢失和视图间几何错位。 Method: 提出表示感知的状态空间框架RASLF,包含:1)基于全景极线表示的渐进几何精化(PGR)模块;2)根据表示空间物理特性动态调整扫描路径的表示感知非对称扫描(RAAS)机制;3)双锚点聚合(DAA)模块以优化特征流。 Result: 在多个公开基准上达到最高重建精度,同时保持高计算效率。 Conclusion: RASLF通过显式建模多表示结构相关性与自适应扫描策略,有效缓解了纹理损失与几何失准问题,实现了性能与效率的协同提升。 Abstract: Current SSM-based light field super-resolution (LFSR) methods often fail to fully leverage the complementarity among various LF representations, leading to the loss of fine textures and geometric misalignments across views. To address these issues, we propose RASLF, a representation-aware state-space framework that explicitly models structural correlations across multiple LF representations. Specifically, a Progressive Geometric Refinement (PGR) block is created that uses a panoramic epipolar representation to explicitly encode multi-view parallax differences, thereby enabling integration across different LF representations. Furthermore, we introduce a Representation Aware Asymmetric Scanning (RAAS) mechanism that dynamically adjusts scanning paths based on the physical properties of different representation spaces, optimizing the balance between performance and efficiency through path pruning. Additionally, a Dual-Anchor Aggregation (DAA) module improves hierarchical feature flow, reducing redundant deeplayer features and prioritizing important reconstruction information. Experiments on various public benchmarks show that RASLF achieves the highest reconstruction accuracy while remaining highly computationally efficient.

[122] How to Utilize Complementary Vision-Text Information for 2D Structure Understanding

Jiancheng Dong,Pengyue Jia,Derong Xu,Jiawei Cheng,Jingyu Peng,Chao Zhang,Bowen Liu,Xin Sun,Lixin Su,Shuaiqiang Wang,Dawei Yin,Xiangyu Zhao

Main category: cs.CV

TL;DR: 本文提出DiVA-Former,一种轻量级视觉-文本融合架构,利用视觉token作为动态查询来压缩长文本序列,从而有效整合表格的视觉与文本信息,在13个表格基准上显著超越纯文本及现有多模态方法。

Details Motivation: 现有LLM线性化表格丢失空间结构信息,纯视觉编码器难以保留精确单元格文本;二者信息高度互补但直接融合易引入跨模态干扰。 Method: 提出DiVA-Former:以视觉token为动态查询,对长文本序列进行蒸馏生成摘要向量,实现高效视觉-文本互补融合。 Result: 在13个表格基准测试中,相比纯文本基线提升23.9%,且持续优于仅用视觉、仅用文本或简单融合的现有方法。 Conclusion: 视觉与文本模态在表格理解中具有强互补性;DiVA-Former通过动态查询蒸馏机制实现了更鲁棒、高效的多模态融合。 Abstract: LLMs typically linearize 2D tables into 1D sequences to fit their autoregressive architecture, which weakens row-column adjacency and other layout cues. In contrast, purely visual encoders can capture spatial cues, yet often struggle to preserve exact cell text. Our analysis reveals that these two modalities provide highly distinct information to LLMs and exhibit strong complementarity. However, direct concatenation and other fusion methods yield limited gains and frequently introduce cross-modal interference. To address this issue, we propose DiVA-Former, a lightweight architecture designed to effectively integrate vision and text information. DiVA-Former leverages visual tokens as dynamic queries to distill long textual sequences into digest vectors, thereby effectively exploiting complementary vision--text information. Evaluated across 13 table benchmarks, DiVA-Former improves upon the pure-text baseline by 23.9\% and achieves consistent gains over existing baselines using visual inputs, textual inputs, or a combination of both.

[123] Synergizing Deep Learning and Biological Heuristics for Extreme Long-Tail White Blood Cell Classification

Trong-Duc Nguyen,Hoang-Long Nguyen,Huy-Hieu Pham

Main category: cs.CV

TL;DR: 本文提出了一种结合生成式修复、Swin Transformer集成与生物学启发的形态学约束的混合框架,用于解决白细胞分类中罕见亚型识别难的问题,在WBCBench 2026挑战赛上取得Macro-F1为0.77139的优异结果。

Details Motivation: 白细胞自动分类在白血病筛查中至关重要,但面临类别极度不平衡、长尾分布和域偏移等挑战,导致深度模型对主导类别过拟合、难以识别罕见亚型。 Method: 提出一种混合框架:1)基于Pix2Pix的生成式修复模块去除图像伪影;2)集成Swin Transformer并结合MedSigLIP对比学习嵌入以增强表征鲁棒性;3)引入生物启发的细化步骤,利用几何尖锐度和基于马氏距离的形态学约束校正分布外预测。 Result: 在WBCBench 2026挑战赛私有排行榜上达到Macro-F1为0.77139,显著优于基线方法,验证了该框架在严重类别不平衡下的有效性。 Conclusion: 将生物学先验知识融入深度学习框架可显著提升罕见白细胞亚型的泛化能力,为血液图像分析提供了新范式。 Abstract: Automated white blood cell (WBC) classification is essential for leukemia screening but remains challenged by extreme class imbalance, long-tail distributions, and domain shift, leading deep models to overfit dominant classes and fail on rare subtypes. We propose a hybrid framework for rare-class generalization that integrates a generative Pix2Pix-based restoration module for artifact removal, a Swin Transformer ensemble with MedSigLIP contrastive embeddings for robust representation learning, and a biologically-inspired refinement step using geometric spikiness and Mahalanobis-based morphological constraints to recover out-of-distribution predictions. Evaluated on the WBCBench 2026 challenge, our method achieves a Macro-F1 of 0.77139 on the private leaderboard, demonstrating strong performance under severe imbalance and highlighting the value of incorporating biological priors into deep learning for hematological image analysis.

[124] Visual Prompt Discovery via Semantic Exploration

Jaechang Kim,Yotaro Shimose,Zhao Wang,Kuang-Da Wang,Jungseul Ok,Shingo Takamatsu

Main category: cs.CV

TL;DR: 本文提出SEVEX框架,通过语义探索自动发现任务特定的视觉提示,以提升LVLM的图像理解和视觉推理能力,显著优于基线方法。

Details Motivation: LVLM在图像理解和视觉推理中存在严重感知失败问题,现有视觉提示生成方法仅关注工具选择,未诊断和缓解根本原因,且依赖人工试错,效率低下。 Method: 提出自动化语义探索框架SEVEX,利用抽象概念空间作为搜索空间、新颖性引导的选择算法和语义反馈驱动的构想过程,解决低级代码冗长和搜索空间庞大无序两大挑战。 Result: 在BlindTest和BLINK基准上验证,SEVEX在任务准确率、推理效率、探索效率和探索稳定性方面均显著优于基线;并发现超越常规工具使用的复杂且反直觉的视觉策略。 Conclusion: SEVEX为通过自动化、任务特定视觉提示增强LVLM感知提供了新范式,减少人工干预,提升探索效率与效果。 Abstract: LVLMs encounter significant challenges in image understanding and visual reasoning, leading to critical perception failures. Visual prompts, which incorporate image manipulation code, have shown promising potential in mitigating these issues. While emerged as a promising direction, previous methods for visual prompt generation have focused on tool selection rather than diagnosing and mitigating the root causes of LVLM perception failures. Because of the opacity and unpredictability of LVLMs, optimal visual prompts must be discovered through empirical experiments, which have relied on manual human trial-and-error. We propose an automated semantic exploration framework for discovering task-wise visual prompts. Our approach enables diverse yet efficient exploration through agent-driven experiments, minimizing human intervention and avoiding the inefficiency of per-sample generation. We introduce a semantic exploration algorithm named SEVEX, which addresses two major challenges of visual prompt exploration: (1) the distraction caused by lengthy, low-level code and (2) the vast, unstructured search space of visual prompts. Specifically, our method leverages an abstract idea space as a search space, a novelty-guided selection algorithm, and a semantic feedback-driven ideation process to efficiently explore diverse visual prompts based on empirical results. We evaluate SEVEX on the BlindTest and BLINK benchmarks, which are designed to assess LVLM perception. Experimental results demonstrate that SEVEX significantly outperforms baseline methods in task accuracy, inference efficiency, exploration efficiency, and exploration stability. Notably, our framework discovers sophisticated and counter-intuitive visual strategies that go beyond conventional tool usage, offering a new paradigm for enhancing LVLM perception through automated, task-wise visual prompts.

[125] Grounding the Score: Explicit Visual Premise Verification for Reliable Vision-Language Process Reward Models

Junxin Wang,Dai Guan,Weijie Qiu,Zhihang Li,Yongbo Gai,Zhengyi Yang,Mengyu Zhou,Erchao Zhao,Xiaoxi Jiang,Guanjun Jiang

Main category: cs.CV

TL;DR: 本文提出Explicit Visual Premise Verification (EVPV)方法,通过解耦视觉感知与逻辑推理,提升视觉语言过程奖励模型(VL-PRMs)在多模态推理中对中间步骤的评分可靠性,从而改善候选重排序与错误定位。

Details Motivation: 现有VL-PRMs作为黑箱判别器,难以区分推理错误与视觉感知错误,导致假阳性(奖励幻觉前提)和假阴性(惩罚正确视觉陈述),损害重排序与错误定位效果。 Method: EVPV引入轻量级验证接口:策略生成步进式视觉检查清单(显式视觉前提),约束提取器独立从图像中抽取结构化视觉约束;通过匹配清单与约束计算视觉可靠性信号,并以此门控调节PRM对视觉依赖步骤的奖励(低可靠性时衰减,高可靠性时保留)。 Result: 在VisualProcessBench及六个多模态推理基准上,EVPV显著提升步骤级验证准确率与Best-of-N重排序精度;可控约束污染实验显示性能单调下降,证实增益源于约束保真度与显式前提验证。 Conclusion: EVPV有效解耦感知不确定性与逻辑评估,无需每步调用工具,即可提升VL-PRMs的鲁棒性与可解释性,为多模态推理中的过程监督提供新范式。 Abstract: Vision-language process reward models (VL-PRMs) are increasingly used to score intermediate reasoning steps and rerank candidates under test-time scaling. However, they often function as black-box judges: a low step score may reflect a genuine reasoning mistake or simply the verifier's misperception of the image. This entanglement between perception and reasoning leads to systematic false positives (rewarding hallucinated visual premises) and false negatives (penalizing correct grounded statements), undermining both reranking and error localization. We introduce Explicit Visual Premise Verification (EVPV), a lightweight verification interface that conditions step scoring on the reliability of the visual premises a step depends on. The policy is prompted to produce a step-wise visual checklist that makes required visual facts explicit, while a constraint extractor independently derives structured visual constraints from the input image. EVPV matches checklist claims against these constraints to compute a scalar visual reliability signal, and calibrates PRM step rewards via reliability gating: rewards for visually dependent steps are attenuated when reliability is low and preserved when reliability is high. This decouples perceptual uncertainty from logical evaluation without per-step tool calls. Experiments on VisualProcessBench and six multimodal reasoning benchmarks show that EVPV improves step-level verification and consistently boosts Best-of-N reranking accuracy over strong baselines. Furthermore, injecting controlled corruption into the extracted constraints produces monotonic performance degradation, providing causal evidence that the gains arise from constraint fidelity and explicit premise verification rather than incidental prompt effects. Code is available at: https://github.com/Qwen-Applications/EVPV-PRM

[126] When Thinking Hurts: Mitigating Visual Forgetting in Video Reasoning via Frame Repetition

Xiaokun Sun,Yubo Wang,Haoyu Cao,Linli Xu

Main category: cs.CV

TL;DR: 本文提出FrameRepeat框架,通过轻量级重复评分模块和Add-One-In训练策略,使视频大语言模型能自主识别并强化关键帧,缓解视觉锚点漂移问题,提升视频问答性能。

Details Motivation: 现有视频问答中链式思维推理易导致视觉锚点漂移(模型过度依赖自生成文本而忽略视觉输入),造成性能下降与幻觉;已有缓解方法训练成本高、泛化性差。 Method: 提出FrameRepeat框架:1)轻量级repeat评分模块,让Video-LLM自主识别需强化的关键帧;2)Add-One-In(AOI)训练策略,利用MLLM输出概率生成‘重复增益’监督信号,训练帧评分网络。 Result: 在多个模型和数据集上的实验表明,FrameRepeat能有效且泛化地增强推理过程中的关键视觉线索,显著缓解视觉锚点漂移。 Conclusion: FrameRepeat是一种低成本、高泛化性的通用增强框架,为视频多模态大模型的可靠推理提供了新思路。 Abstract: Recently, Multimodal Large Language Models (MLLMs) have demonstrated significant potential in complex visual tasks through the integration of Chain-of-Thought (CoT) reasoning. However, in Video Question Answering, extended thinking processes do not consistently yield performance gains and may even lead to degradation due to ``visual anchor drifting'', where models increasingly rely on self-generated text, sidelining visual inputs and causing hallucinations. While existing mitigations typically introduce specific mechanisms for the model to re-attend to visual inputs during inference, these approaches often incur prohibitive training costs and suffer from poor generalizability across different architectures. To address this, we propose FrameRepeat, an automated enhancement framework which features a lightweight repeat scoring module that enables Video-LLMs to autonomously identify which frames should be reinforced. We introduce a novel training strategy, Add-One-In (AOI), that uses MLLM output probabilities to generate supervision signals representing repeat gain. This can be used to train a frame scoring network, which guides the frame repetition behavior. Experimental results across multiple models and datasets demonstrate that FrameRepeat is both effective and generalizable in strengthening important visual cues during the reasoning process.

[127] Point-to-Mask: From Arbitrary Point Annotations to Mask-Level Infrared Small Target Detection

Weihua Gao,Wenlong Niu,Jie Tang,Man Yang,Jiafeng Zhang,Xiaodong Peng

Main category: cs.CV

TL;DR: 本文提出Point-to-Mask框架,利用低成本点标注实现红外小目标检测,通过物理驱动的掩码生成与半径感知点回归网络协同工作,在降低标注成本的同时接近全监督性能。

Details Motivation: 现有红外小目标检测方法多采用像素级分割范式,依赖昂贵的密集标注,且难以处理纹理弱、边界模糊的微小目标。 Method: 提出Point-to-Mask框架,包含Physics-driven Adaptive Mask Generation(PAMG)模块(将点标注转为紧凑掩码和几何线索)和轻量级Radius-aware Point Regression Network(RPR-Net)(利用时空运动线索进行目标中心定位与有效半径回归);二者形成训练-推理闭环;并构建SIRSTD-Pixel序列数据集。 Result: 实验表明该框架伪标签质量高、检测精度高、推理高效,在点监督下接近全监督性能,显著降低标注成本。 Conclusion: Point-to-Mask为红外小目标检测提供了一种低标注成本、高性能的新范式,验证了点监督结合物理建模与几何回归的有效性。 Abstract: Infrared small target detection (IRSTD) methods predominantly formulate the task as pixel-level segmentation, which requires costly dense annotations and is not well suited to tiny targets with weak texture and ambiguous boundaries. To address this issue, we propose Point-to-Mask, a framework that bridges low-cost point supervision and mask-level detection through two components: a Physics-driven Adaptive Mask Generation (PAMG) module that converts point annotations into compact target masks and geometric cues, and a lightweight Radius-aware Point Regression Network (RPR-Net) that reformulates IRSTD as target center localization and effective radius regression using spatiotemporal motion cues. The two modules form a closed loop: PAMG generates pseudo masks and geometric supervision during training, while the geometric predictions of RPR-Net are fed back to PAMG for pixel-level mask recovery during inference. To facilitate systematic evaluation, we further construct SIRSTD-Pixel, a sequential dataset with refined pixel-level annotations. Experiments show that the proposed framework achieves strong pseudo-label quality, high detection accuracy, and efficient inference, approaching full-supervision performance under point-supervised settings with substantially lower annotation cost. Code and datasets will be available at: https://github.com/GaoScience/point-to-mask.

[128] AW-MoE: All-Weather Mixture of Experts for Robust Multi-Modal 3D Object Detection

Hongwei Lin,Xun Huang,Chenglu Wen,Cheng Wang

Main category: cs.CV

TL;DR: 本文提出AW-MoE框架,通过图像引导的天气感知路由(IWR)和统一双模态增强(UDMA),提升恶劣天气下多模态3D目标检测鲁棒性,性能提升约15%,推理开销可忽略。

Details Motivation: 现有方法忽视不同天气场景间数据分布差异,导致性能冲突,难以实现恶劣天气下的鲁棒3D检测。 Method: 提出AW-MoE框架,包含图像引导的天气感知路由(IWR)用于精准天气分类并选择最相关的天气专用专家(WSE),以及统一双模态增强(UDMA)同步增强LiDAR与4D雷达数据并保持场景真实性。 Result: 在真实数据集上,AW-MoE相较SOTA方法在恶劣天气下检测性能提升约15%,推理开销极小;集成到基线检测器中亦超越当前SOTA。 Conclusion: AW-MoE有效缓解天气间数据分布差异带来的性能冲突,具备强有效性与可扩展性。 Abstract: Robust 3D object detection under adverse weather conditions is crucial for autonomous driving. However, most existing methods simply combine all weather samples for training while overlooking data distribution discrepancies across different weather scenarios, leading to performance conflicts. To address this issue, we introduce AW-MoE, the framework that innovatively integrates Mixture of Experts (MoE) into weather-robust multi-modal 3D object detection approaches. AW-MoE incorporates Image-guided Weather-aware Routing (IWR), which leverages the superior discriminability of image features across weather conditions and their invariance to scene variations for precise weather classification. Based on this accurate classification, IWR selects the top-K most relevant Weather-Specific Experts (WSE) that handle data discrepancies, ensuring optimal detection under all weather conditions. Additionally, we propose a Unified Dual-Modal Augmentation (UDMA) for synchronous LiDAR and 4D Radar dual-modal data augmentation while preserving the realism of scenes. Extensive experiments on the real-world dataset demonstrate that AW-MoE achieves ~ 15% improvement in adverse-weather performance over state-of-the-art methods, while incurring negligible inference overhead. Moreover, integrating AW-MoE into established baseline detectors yields performance improvements surpassing current state-of-the-art methods. These results show the effectiveness and strong scalability of our AW-MoE. We will release the code publicly at https://github.com/windlinsherlock/AW-MoE.

[129] FG-SGL: Fine-Grained Semantic Guidance Learning via Motion Process Decomposition for Micro-Gesture Recognition

Jinsheng Wei,Zhaodi Xu,Guanming Lu,Haoyu Chen,Jingjie Yan

Main category: cs.CV

TL;DR: 本文提出了一种细粒度语义引导学习(FG-SGL)框架,通过融合细粒度与类别级语义来提升微手势识别(MGR)性能,构建了带四维语义标注的文本数据集,并设计多级对比优化策略,在实验中验证了其有效性。

Details Motivation: 现有微手势识别方法依赖类别级监督,难以捕捉细微且局部的运动差异,导致识别性能受限。 Method: 提出FG-SGL框架,包含细粒度语义引导模块(FG-SA)和类别级语义增强模块(CP-A),并构建四维细粒度文本标注数据集,辅以粗到细的多级对比优化策略。 Result: FG-SGL在微手势识别任务上取得了具有竞争力的性能,验证了细粒度语义引导的有效性。 Conclusion: 融合细粒度与类别级语义可显著提升视觉-语言模型对局部微手势运动的感知能力,为MGR提供了新思路。 Abstract: Micro-gesture recognition (MGR) is challenging due to subtle inter-class variations. Existing methods rely on category-level supervision, which is insufficient for capturing subtle and localized motion differences. Thus, this paper proposes a Fine-Grained Semantic Guidance Learning (FG-SGL) framework that jointly integrates fine-grained and category-level semantics to guide vision--language models in perceiving local MG motions. FG-SA adopts fine-grained semantic cues to guide the learning of local motion features, while CP-A enhances the separability of MG features through category-level semantic guidance. To support fine-grained semantic guidance, this work constructs a fine-grained textual dataset with human annotations that describes the dynamic process of MGs in four refined semantic dimensions. Furthermore, a Multi-Level Contrastive Optimization strategy is designed to jointly optimize both modules in a coarse-to-fine pattern. Experiments show that FG-SGL achieves competitive performance, validating the effectiveness of fine-grained semantic guidance for MGR.

[130] VIGOR: VIdeo Geometry-Oriented Reward for Temporal Generative Alignment

Tengjiao Yin,Jinglei Shi,Heng Guo,Xi Wang

Main category: cs.CV

TL;DR: 本文提出了一种基于几何的奖励模型,利用预训练几何基础模型评估视频生成中的多视角一致性,通过点对点重投影误差提升生成视频的几何一致性,并结合后训练与推理时优化两种路径对视频扩散模型进行对齐。

Details Motivation: 视频扩散模型在训练中缺乏显式几何监督,导致生成视频出现物体形变、空间漂移和深度违反等不一致伪影。 Method: 提出基于几何的奖励模型,利用预训练几何基础模型计算跨帧重投影误差;引入几何感知采样策略,过滤低纹理和非语义区域;通过后训练(SFT或强化学习)和推理时优化(test-time scaling)两条路径对视频扩散模型进行对齐。 Result: 实验表明该几何奖励模型相比其他变体具有更高鲁棒性,且能以较低计算开销实现开源视频模型的高效推理时增强。 Conclusion: 所提方法为提升视频生成几何一致性提供了实用、高效的解决方案,无需大规模重训练即可改进现有开源视频模型。 Abstract: Video diffusion models lack explicit geometric supervision during training, leading to inconsistency artifacts such as object deformation, spatial drift, and depth violations in generated videos. To address this limitation, we propose a geometry-based reward model that leverages pretrained geometric foundation models to evaluate multi-view consistency through cross-frame reprojection error. Unlike previous geometric metrics that measure inconsistency in pixel space, where pixel intensity may introduce additional noise, our approach conducts error computation in a pointwise fashion, yielding a more physically grounded and robust error metric. Furthermore, we introduce a geometry-aware sampling strategy that filters out low-texture and non-semantic regions, focusing evaluation on geometrically meaningful areas with reliable correspondences to improve robustness. We apply this reward model to align video diffusion models through two complementary pathways: post-training of a bidirectional model via SFT or Reinforcement Learning and inference-time optimization of a Causal Video Model (e.g., Streaming video generator) via test-time scaling with our reward as a path verifier. Experimental results validate the effectiveness of our design, demonstrating that our geometry-based reward provides superior robustness compared to other variants. By enabling efficient inference-time scaling, our method offers a practical solution for enhancing open-source video models without requiring extensive computational resources for retraining.

[131] Locate-then-Sparsify: Attribution Guided Sparse Strategy for Visual Hallucination Mitigation

TianTian Dang,Chao Bi,Shufan Shen,Jinzhe Liu,Qingming Huang,Shuhui Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为Locate-Then-Sparsify for Feature Steering(LTS-FS)的即插即用框架,通过因果干预量化各层对幻觉的贡献,并据此分层调节特征引导强度,在不损害通用任务性能的前提下有效缓解大视觉语言模型(LVLMs)的幻觉问题。

Details Motivation: 现有特征引导方法在所有层采用统一强度,忽视了不同层对幻觉的差异化贡献,易干扰无关层并导致通用任务性能下降。 Method: 构建包含词级和句级幻觉的合成数据集;提出基于因果干预的归因方法,量化各层幻觉相关性;设计分层策略,将归因得分转化为各层特征引导强度。 Result: 在多个LVLM和基准测试上验证了LTS-FS能有效缓解幻觉,同时保持强泛化性能。 Conclusion: 分层自适应的特征引导策略比统一引导更精准高效,为LVLM幻觉缓解提供了新思路。 Abstract: Despite the significant advancements in Large Vision-Language Models (LVLMs), their tendency to generate hallucinations undermines reliability and restricts broader practical deployment. Among the hallucination mitigation methods, feature steering emerges as a promising approach that reduces erroneous outputs in LVLMs without increasing inference costs. However, current methods apply uniform feature steering across all layers. This heuristic strategy ignores inter-layer differences, potentially disrupting layers unrelated to hallucinations and ultimately leading to performance degradation on general tasks. In this paper, we propose a plug-and-play framework called Locate-Then-Sparsify for Feature Steering (LTS-FS), which controls the steering intensity according to the hallucination relevance of each layer. We first construct a synthetic dataset comprising token-level and sentence-level hallucination cases. Based on this dataset, we introduce an attribution method based on causal interventions to quantify the hallucination relevance of each layer. With the attribution scores across layers, we propose a layerwise strategy that converts these scores into feature steering intensities for individual layers, enabling more precise adjustments specifically on hallucination-relevant layers. Extensive experiments across multiple LVLMs and benchmarks demonstrate that our LTS-FS framework effectively mitigates hallucination while preserving strong performance.

[132] Persistent Story World Simulation with Continuous Character Customization

Jinlu Zhang,Qiyun Wang,Baoxiang Du,Jiayi Ji,Jing He,Rongsheng Zhang,Tangjie Lv,Xiaoshuai Sun,Rongrong Ji

Main category: cs.CV

TL;DR: 本文提出EverTale,一种支持连续角色定制的故事世界模拟器,通过统一LoRA模块实现角色持续适配、MLLM-as-Judge质量门控保障角色保真度、以及区域聚焦采样策略缓解多角色生成中的身份退化与布局冲突。

Details Motivation: 现有故事可视化方法难以兼顾准确的角色定制、语义对齐及新身份的连续集成。 Method: 提出All-in-One-World Character Integrator(统一LoRA实现连续角色适配)、Character Quality Gate(基于MLLM的链式推理质量判断)和Character-Aware Region-Focus Sampling(兼顾局部细节与全局场景的采样策略)。 Result: 在单角色与多角色故事可视化任务上均优于多种对比方法。 Conclusion: EverTale有效解决了角色定制连续性、保真度与多角色协同生成难题,提升了故事可视化整体质量与效率。 Abstract: Story visualization has gained increasing attention in computer vision. However, current methods often fail to achieve a synergy between accurate character customization, semantic alignment, and continuous integration of new identities. To tackle this challenge, in this paper we present EverTale, a story world simulator for continuous story character customization. We first propose an All-in-One-World Character Integrator to achieve continuous character adaptation within unified LoRA module, eliminating the need for per-character optimization modules of previous methods. Then, we incorporate a Character Quality Gate via MLLM-as-Judge to ensure the fidelity of each character adaptation process through chain-of-thought reasoning, determining whether the model can proceed to the next character or require additional training on the current one. We also introduce a Character-Aware Region-Focus Sampling strategy to address the identity degradation and layout conflicts in existing multi-character visual storytelling, ensuring natural multi-character generation by harmonizing local character-specific details with global scene context with higher efficiency. Experimental results show that our EverTale achieves superior performance against a wider range of compared methods on both single- and multi-character story visualization. Codes will be available.

[133] VisBrowse-Bench: Benchmarking Visual-Native Search for Multimodal Browsing Agents

Zhengbo Zhang,Jinbo Su,Zhaowen Zhou,Changtao Miao,Yuhan Hong,Qimeng Wu,Yumeng Liu,Feier Wu,Yihe Tian,Yuhao Liang,Zitong Shan,Wanke Xia,Yi-Fan Zhang,Bo Zhang,Zhe Li,Shiming Xiang,Ying Yan

Main category: cs.CV

TL;DR: 本文提出VisBrowse-Bench,一个面向视觉原生搜索的新基准,用于评估多模态大模型在网页浏览任务中的视觉推理能力,包含169个跨领域VQA实例,并设计了支持视觉信息主动采集与推理的代理工作流;实验表明当前最优模型准确率仍不足48%。

Details Motivation: 现有基准对视觉推理能力评估不足,且忽视网页中原生视觉信息在推理链中的作用。 Method: 构建了由专家通过多阶段流程生成并经人工严格验证的VisBrowse-Bench基准(含169个VQA样本),提出支持视觉信息主动采集与联合图文推理的浏览代理工作流,并在该流程下对开源与闭源模型进行综合评测。 Result: 最佳模型Claude-4.6-Opus准确率为47.6%,闭源Deep Research模型o3-deep-research为41.1%,均远未达理想水平。 Conclusion: 当前多模态浏览代理在视觉原生搜索任务中视觉推理能力仍严重受限,VisBrowse-Bench为该方向提供了更贴近真实场景的评估标准与改进路径。 Abstract: The rapid advancement of Multimodal Large Language Models (MLLMs) has enabled browsing agents to acquire and reason over multimodal information in the real world. But existing benchmarks suffer from two limitations: insufficient evaluation of visual reasoning ability and the neglect of native visual information of web pages in the reasoning chains. To address these challenges, we introduce a new benchmark for visual-native search, VisBrowse-Bench. It contains 169 VQA instances covering multiple domains and evaluates the models' visual reasoning capabilities during the search process through multimodal evidence cross-validation via text-image retrieval and joint reasoning. These data were constructed by human experts using a multi-stage pipeline and underwent rigorous manual verification. We additionally propose an agent workflow that can effectively drive the browsing agent to actively collect and reason over visual information during the search process. We comprehensively evaluated both open-source and closed-source models in this workflow. Experimental results show that even the best-performing model, Claude-4.6-Opus only achieves an accuracy of 47.6%, while the proprietary Deep Research model, o3-deep-research only achieves an accuracy of 41.1%. The code and data can be accessed at: https://github.com/ZhengboZhang/VisBrowse-Bench

[134] Micro-AU CLIP: Fine-Grained Contrastive Learning from Local Independence to Global Dependency for Micro-Expression Action Unit Detection

Jinsheng Wei,Fengzhou Guo,Yante Li,Haoyu Chen,Guanming Lu,Guoying Zhao

Main category: cs.CV

TL;DR: 本文提出了一种名为micro-AU CLIP的新框架,用于微表情动作单元(Micro-AUs)检测,通过局部语义独立建模(LSI)与全局语义依赖建模(GSD)结合,并引入微AU对比损失(MiAUCL),在无需情绪标签的情况下实现微表情识别,达到当前最优性能。

Details Motivation: 现有Micro-AU检测方法多基于整脸特征,忽视了AU固有的局部性;而AU既具局部独立性(对应特定肌肉运动),又存在情绪状态下的全局依赖性,需兼顾二者。 Method: 提出micro-AU CLIP框架:1)局部语义独立建模(LSI)采用Patch Token Attention(PTA)对齐AU区域局部特征;2)全局语义依赖建模(GSD)引入Global Dependency Attention(GDA)和Global Dependency Loss(GDLoss);3)设计microAU对比损失(MiAUCL)提升视觉-文本细粒度对齐;4)支持无情绪标签的ME识别。 Result: 实验表明该方法能充分学习细粒度Micro-AU特征,在Micro-AU检测与ME识别任务上均达到SOTA性能。 Conclusion: 独立到依赖的建模范式有效提升了Micro-AU建模能力;micro-AU CLIP通过多模块协同设计,解决了局部性建模不足、全局依赖建模缺失及微语义对齐困难三大问题。 Abstract: Micro-expression (ME) action units (Micro-AUs) provide objective clues for fine-grained genuine emotion analysis. Most existing Micro-AU detection methods learn AU features from the whole facial image/video, which conflicts with the inherent locality of AU, resulting in insufficient perception of AU regions. In fact, each AU independently corresponds to specific localized facial muscle movements (local independence), while there is an inherent dependency between some AUs under specific emotional states (global dependency). Thus, this paper explores the effectiveness of the independence-to-dependency pattern and proposes a novel micro-AU detection framework, micro-AU CLIP, that uniquely decomposes the AU detection process into local semantic independence modeling (LSI) and global semantic dependency (GSD) modeling. In LSI, Patch Token Attention (PTA) is designed, mapping several local features within the AU region to the same feature space; In GSD, Global Dependency Attention (GDA) and Global Dependency Loss (GDLoss) are presented to model the global dependency relationships between different AUs, thereby enhancing each AU feature. Furthermore, considering CLIP's native limitations in micro-semantic alignment, a microAU contrastive loss (MiAUCL) is designed to learn AU features by a fine-grained alignment of visual and text features. Also, Micro-AU CLIP is effectively applied to ME recognition in an emotion-label-free way. The experimental results demonstrate that Micro-AU CLIP can fully learn fine-grained micro-AU features, achieving state-of-the-art performance.

[135] DriveFix: Spatio-Temporally Coherent Driving Scene Restoration

Heyu Si,Brandon James Denis,Muyang Sun,Dragos Datcu,Yaoru Li,Xin Jin,Ruiju Fu,Yuliia Tatarinova,Federico Landi,Jie Song,Mingli Song,Qi Guo

Main category: cs.CV

TL;DR: DriveFix提出了一种多视角修复框架,通过交错扩散Transformer架构建模时空一致性,结合几何感知损失,在自动驾驶4D场景重建中实现高保真、无漂移的 novel view synthesis。

Details Motivation: 现有基于扩散先验的4D场景重建方法多为逐帧或逐视角处理,缺乏时空协同,导致跨相机空间错位和时序漂移。 Method: 提出DriveFix框架,采用交错扩散Transformer架构,包含专门模块显式建模时间依赖与跨相机空间一致性;以历史上下文为条件,并引入几何感知训练损失,确保重建视图符合统一3D几何。 Result: 在Waymo、nuScenes和PandaSet数据集上实验表明,DriveFix在重建与新视角合成任务中达到SOTA性能。 Conclusion: DriveFix显著提升了4D世界建模的鲁棒性,为真实场景部署提供了重要进展。 Abstract: Recent advancements in 4D scene reconstruction, particularly those leveraging diffusion priors, have shown promise for novel view synthesis in autonomous driving. However, these methods often process frames independently or in a view-by-view manner, leading to a critical lack of spatio-temporal synergy. This results in spatial misalignment across cameras and temporal drift in sequences. We propose DriveFix, a novel multi-view restoration framework that ensures spatio-temporal coherence for driving scenes. Our approach employs an interleaved diffusion transformer architecture with specialized blocks to explicitly model both temporal dependencies and cross-camera spatial consistency. By conditioning the generation on historical context and integrating geometry-aware training losses, DriveFix enforces that the restored views adhere to a unified 3D geometry. This enables the consistent propagation of high-fidelity textures and significantly reduces artifacts. Extensive evaluations on the Waymo, nuScenes, and PandaSet datasets demonstrate that DriveFix achieves state-of-the-art performance in both reconstruction and novel view synthesis, marking a substantial step toward robust 4D world modeling for real-world deployment.

[136] An Interpretable Machine Learning Framework for Non-Small Cell Lung Cancer Drug Response Analysis

Ann Rachel,Pranav M Pawar,Mithun Mukharjee,Raja M,Tojo Mathew

Main category: cs.CV

TL;DR: 本文提出了一种基于多组学数据和XGBoost回归模型的个性化肺癌治疗方案预测方法,并结合SHAP和大语言模型DeepSeek进行可解释性分析与生物学验证。

Details Motivation: 传统肺癌治疗方法(如手术、化疗、放疗)因癌症异质性效果有限,亟需基于个体遗传信息的个性化治疗策略。 Method: 利用GDSC多组学数据,构建以LN-IC50为靶标的XGBoost回归模型;采用交叉验证与随机搜索优化超参数;使用SHAP解释特征贡献,并借助DeepSeek大模型验证关键基因/通路的生物学合理性并提供上下文解释。 Result: 成功建立了高预测性能的药物敏感性模型,SHAP识别出关键分子与细胞特征,DeepSeek进一步证实其生物学意义并生成可理解的解释。 Conclusion: 融合机器学习建模与大语言模型解释的框架,可有效支持肺癌个性化用药决策,并提升模型可信度与临床适用性。 Abstract: Lung cancer is a condition where there is abnormal growth of malignant cells that spread in an uncontrollable fashion in the lungs. Some common treatment strategies are surgery, chemotherapy, and radiation which aren't the best options due to the heterogeneous nature of cancer. In personalized medicine, treatments are tailored according to the individual's genetic information along with lifestyle aspects. In addition, AI-based deep learning methods can analyze large sets of data to find early signs of cancer, types of tumor, and prospects of treatment. The paper focuses on the development of personalized treatment plans using specific patient data focusing primarily on the genetic profile. Multi-Omics data from Genomics of Drug Sensitivity in Cancer have been used to build a predictive model along with machine learning techniques. The value of the target variable, LN-IC50, determines how sensitive or resistive a drug is. An XGBoost regressor is utilized to predict the drug response focusing on molecular and cellular features extracted from cancer datasets. Cross-validation and Randomized Search are performed for hyperparameter tuning to further optimize the model's predictive performance. For explanation purposes, SHAP (SHapley Additive exPlanations) was used. SHAP values measure each feature's impact on an individual prediction. Furthermore, interpreting feature relationships was performed using DeepSeek, a large language model trained to verify the biological validity of the features. Contextual explanations regarding the most important genes or pathways were provided by DeepSeek alongside the top SHAP value constituents, supporting the predictability of the model.

[137] SpikeCLR: Contrastive Self-Supervised Learning for Few-Shot Event-Based Vision using Spiking Neural Networks

Maxime Vaillant,Axel Carlier,Lai Xing Ng,Christophe Hurter,Benoit R. Cottereau

Main category: cs.CV

TL;DR: 本文提出SpikeCLR,一种用于脉冲神经网络(SNN)的对比式自监督学习框架,利用事件数据的空间、时间和极性增强,在无标签事件数据上学习鲁棒视觉表征,显著提升小样本和半监督场景下的性能。

Details Motivation: 事件相机与SNN结合虽具能效优势,但受限于大规模标注数据稀缺,难以有效训练模型。 Method: 提出SpikeCLR框架,将帧式对比学习方法适配至脉冲域,采用代理梯度训练,并设计面向事件数据的空间、时间与极性增强策略。 Result: 在CIFAR10-DVS、N-Caltech101、N-MNIST和DVS-Gesture等基准上,自监督预训练+微调在低数据场景下优于全监督方法;消融实验证明时空联合增强对学习时空不变性至关重要;表征具备跨数据集迁移能力。 Conclusion: SpikeCLR为事件驱动视觉在标注稀缺场景下构建高性能模型提供了有效且通用的自监督学习范式。 Abstract: Event-based vision sensors provide significant advantages for high-speed perception, including microsecond temporal resolution, high dynamic range, and low power consumption. When combined with Spiking Neural Networks (SNNs), they can be deployed on neuromorphic hardware, enabling energy-efficient applications on embedded systems. However, this potential is severely limited by the scarcity of large-scale labeled datasets required to effectively train such models. In this work, we introduce SpikeCLR, a contrastive self-supervised learning framework that enables SNNs to learn robust visual representations from unlabeled event data. We adapt prior frame-based methods to the spiking domain using surrogate gradient training and introduce a suite of event-specific augmentations that leverage spatial, temporal, and polarity transformations. Through extensive experiments on CIFAR10-DVS, N-Caltech101, N-MNIST, and DVS-Gesture benchmarks, we demonstrate that self-supervised pretraining with subsequent fine-tuning outperforms supervised learning in low-data regimes, achieving consistent gains in few-shot and semi-supervised settings. Our ablation studies reveal that combining spatial and temporal augmentations is critical for learning effective spatio-temporal invariances in event data. We further show that learned representations transfer across datasets, contributing to efforts for powerful event-based models in label-scarce settings.

[138] Iris: Bringing Real-World Priors into Diffusion Model for Monocular Depth Estimation

Xinhao Cai,Gensheng Pei,Zeren Sun,Yazhou Yao,Fumin Shen,Wenguan Wang

Main category: cs.CV

TL;DR: Iris是一种确定性的单目深度估计框架,通过将真实世界先验融入扩散模型,在保持细节、跨域泛化和数据效率方面取得突破。

Details Motivation: 传统前馈方法依赖大量训练数据但仍丢失细节;基于扩散的方法虽具强生成先验,却难以实现从合成到真实场景的域迁移。 Method: 提出两阶段Priors-to-Geometry Deterministic(PGD)调度:先验阶段用Spectral-Gated Distillation(SGD)迁移低频真实先验;几何阶段用Spectral-Gated Consistency(SGC)保障高频保真并结合合成真值微调;两阶段共享权重且按高到低时间步执行。 Result: Iris在单目深度估计任务中显著提升性能,具备强野外泛化能力。 Conclusion: Iris通过融合真实先验与扩散建模,在有限数据下实现了细节保留、跨域泛化与高效推理的统一。 Abstract: In this paper, we propose \textbf{Iris}, a deterministic framework for Monocular Depth Estimation (MDE) that integrates real-world priors into the diffusion model. Conventional feed-forward methods rely on massive training data, yet still miss details. Previous diffusion-based methods leverage rich generative priors yet struggle with synthetic-to-real domain transfer. Iris, in contrast, preserves fine details, generalizes strongly from synthetic to real scenes, and remains efficient with limited training data. To this end, we introduce a two-stage Priors-to-Geometry Deterministic (PGD) schedule: the prior stage uses Spectral-Gated Distillation (SGD) to transfer low-frequency real priors while leaving high-frequency details unconstrained, and the geometry stage applies Spectral-Gated Consistency (SGC) to enforce high-frequency fidelity while refining with synthetic ground truth. The two stages share weights and are executed with a high-to-low timestep schedule. Extensive experimental results confirm that Iris achieves significant improvements in MDE performance with strong in-the-wild generalization.

[139] Retrieving Counterfactuals Improves Visual In-Context Learning

Guangzhi Xiong,Sanchit Sinha,Zhenghao He,Aidong Zhang

Main category: cs.CV

TL;DR: 本文提出CIRCLES框架,通过反事实风格的属性引导图像检索来选择因果学习示例,提升视觉语言模型在少样本和信息稀缺场景下的鲁棒推理能力。

Details Motivation: 现有视觉语言模型难以解耦细粒度视觉属性并进行因果推理;上下文学习的效果受限于示例选择,而传统基于相似性的检索易引入虚假关联。 Method: 提出CIRCLES框架,采用目标导向、属性引导的合成图像检索,主动构建包含反事实风格示例的演示集,使模型隐式学习属性与结果间的因果关系。 Result: 在四个数据集上,CIRCLES持续优于现有方法,尤其在小规模模型和信息稀缺条件下提升显著;检索示例更具多样性与因果信息性。 Conclusion: CIRCLES通过因果感知的示例选择机制,有效增强VLM的因果推理能力与泛化鲁棒性,为多模态上下文学习提供了新范式。 Abstract: Vision-language models (VLMs) have achieved impressive performance across a wide range of multimodal reasoning tasks, but they often struggle to disentangle fine-grained visual attributes and reason about underlying causal relationships. In-context learning (ICL) offers a promising avenue for VLMs to adapt to new tasks, but its effectiveness critically depends on the selection of demonstration examples. Existing retrieval-augmented approaches typically rely on passive similarity-based retrieval, which tends to select correlated but non-causal examples, amplifying spurious associations and limiting model robustness. We introduce CIRCLES (Composed Image Retrieval for Causal Learning Example Selection), a novel framework that actively constructs demonstration sets by retrieving counterfactual-style examples through targeted, attribute-guided composed image retrieval. By incorporating counterfactual-style examples, CIRCLES enables VLMs to implicitly reason about the causal relations between attributes and outcomes, moving beyond superficial correlations and fostering more robust and grounded reasoning. Comprehensive experiments on four diverse datasets demonstrate that CIRCLES consistently outperforms existing methods across multiple architectures, especially on small-scale models, with pronounced gains under information scarcity. Furthermore, CIRCLES retrieves more diverse and causally informative examples, providing qualitative insights into how models leverage in-context demonstrations for improved reasoning. Our code is available at https://github.com/gzxiong/CIRCLES.

[140] PKINet-v2: Towards Powerful and Efficient Poly-Kernel Remote Sensing Object Detection

Xinhao Cai,Liulei Li,Gensheng Pei,Zeren Sun,Yazhou Yao,Wenguan Wang

Main category: cs.CV

TL;DR: 本文提出PKINet-v2,一种统一处理遥感图像中目标几何与空间复杂性的高效骨干网络,通过结合各向异性条状卷积与各向同性方形卷积,并引入异构核重参数化(HKR)策略实现推理加速,在多个基准上达到SOTA精度与3.9× FPS提升。

Details Motivation: 遥感图像目标检测面临目标长宽比多样、尺度变化大、上下文多变等几何与空间复杂性挑战;现有方法分别设计各向异性或各向同性大核,但存在破坏空间一致性、丢失细节或引入噪声、几何失配等问题。 Method: 在PKINet基础上提出PKINet-v2:1)联合建模各向异性轴向条状卷积与各向同性方形卷积,构建多尺度感受野;2)提出异构核重参数化(HKR)策略,将多分支融合为单一深度可分离卷积用于推理。 Result: 在DOTA-v1.0、DOTA-v1.5、HRSC2016和DIOR-R四个主流遥感数据集上达到SOTA精度,并相比PKINet-v1实现3.9倍FPS加速,兼顾精度与效率。 Conclusion: PKINet-v2通过统一范式协同解决遥感图像中的几何与空间复杂性问题,并借助HKR实现高效部署,为遥感目标检测骨干网络设计提供了新思路。 Abstract: Object detection in remote sensing images (RSIs) is challenged by the coexistence of geometric and spatial complexity: targets may appear with diverse aspect ratios, while spanning a wide range of object sizes under varied contexts. Existing RSI backbones address the two challenges separately, either by adopting anisotropic strip kernels to model slender targets or by using isotropic large kernels to capture broader context. However, such isolated treatments lead to complementary drawbacks: the strip-only design can disrupt spatial coherence for regular-shaped objects and weaken tiny details, whereas isotropic large kernels often introduce severe background noise and geometric mismatch for slender structures. In this paper, we extend PKINet, and present a powerful and efficient backbone that jointly handles both challenges within a unified paradigm named Poly Kernel Inception Network v2 (PKINet-v2). PKINet-v2 synergizes anisotropic axial-strip convolutions with isotropic square kernels and builds a multi-scope receptive field, preserving fine-grained local textures while progressively aggregating long-range context across scales. To enable efficient deployment, we further introduce a Heterogeneous Kernel Re-parameterization (HKR) Strategy that fuses all heterogeneous branches into a single depth-wise convolution for inference, eliminating fragmented kernel launches without accuracy loss. Extensive experiments on four widely-used benchmarks, including DOTA-v1.0, DOTA-v1.5, HRSC2016, and DIOR-R, demonstrate that PKINet-v2 achieves state-of-the-art accuracy while delivering a $\textbf{3.9}\times$ FPS acceleration compared to PKINet-v1, surpassing previous remote sensing backbones in both effectiveness and efficiency.

[141] Learning Human-Object Interaction for 3D Human Pose Estimation from LiDAR Point Clouds

Daniel Sungho Jung,Dohee Cho,Kyoung Mu Lee

Main category: cs.CV

TL;DR: 本文提出了一种面向LiDAR点云的3D人体姿态估计框架HOIL,通过引入人-物交互感知的对比学习(HOICL)和接触感知的部件引导池化(CPPool),有效缓解交互区域的空间模糊性和点云类别不平衡问题。

Details Motivation: 现有方法忽视人-物交互对3D人体姿态估计的潜在价值;人-物交互带来两类挑战:一是交互区域中人与物体点的空间模糊性导致关键点预测错误;二是交互频繁部位(如手、脚)在LiDAR中点数稀疏,造成严重类别不平衡。 Method: 提出HOIL框架,包含:1)HOICL——人-物交互感知的对比学习,增强交互区域中人/物点特征判别力;2)CPPool——接触感知的部件引导池化,自适应重分配表征能力,压缩冗余点、保留关键交互部位信息;3)可选的基于接触的时间细化模块,利用时序接触线索优化单帧关键点估计。 Result: HOIL有效利用人-物交互信息,在LiDAR点云上提升了3D人体姿态估计的鲁棒性,尤其改善了交互区域的关键点精度,并缓解了点云稀疏与类别不平衡问题。 Conclusion: 引入人-物交互建模是提升LiDAR点云中3D人体姿态估计性能的有效途径;HOIL框架从特征判别与表征分配两个层面系统性解决了空间模糊与类别不平衡问题,为自动驾驶中安全行人理解提供了新思路。 Abstract: Understanding humans from LiDAR point clouds is one of the most critical tasks in autonomous driving due to its close relationships with pedestrian safety, yet it remains challenging in the presence of diverse human-object interactions and cluttered backgrounds. Nevertheless, existing methods largely overlook the potential of leveraging human-object interactions to build robust 3D human pose estimation frameworks. There are two major challenges that motivate the incorporation of human-object interaction. First, human-object interactions introduce spatial ambiguity between human and object points, which often leads to erroneous 3D human keypoint predictions in interaction regions. Second, there exists severe class imbalance in the number of points between interacting and non-interacting body parts, with the interaction-frequent regions such as hand and foot being sparsely observed in LiDAR data. To address these challenges, we propose a Human-Object Interaction Learning (HOIL) framework for robust 3D human pose estimation from LiDAR point clouds. To mitigate the spatial ambiguity issue, we present human-object interaction-aware contrastive learning (HOICL) that effectively enhances feature discrimination between human and object points, particularly in interaction regions. To alleviate the class imbalance issue, we introduce contact-aware part-guided pooling (CPPool) that adaptively reallocates representational capacity by compressing overrepresented points while preserving informative points from interacting body parts. In addition, we present an optional contact-based temporal refinement that refines erroneous per-frame keypoint estimates using contact cues over time. As a result, our HOIL effectively leverages human-object interaction to resolve spatial ambiguity and class imbalance in interaction regions. Codes will be released.

[142] Automated identification of Ichneumonoidea wasps via YOLO-based deep learning: Integrating HiresCam for Explainable AI

Joao Manoel Herrera Pinheiro,Gabriela Do Nascimento Herrera,Alvaro Doria Dos Santos,Luciana Bueno Dos Reis Fernandes,Ricardo V. Godoy,Eduardo A. B. Almeida,Helena Carolina Onody,Marcelo Andrade Da Costa Vieira,Angelica Maria Penteado-Dias,Marcelo Becker

Main category: cs.CV

TL;DR: 本研究提出了一种基于YOLO和HiResCAM的深度学习框架,用于自动识别寄生蜂总科(Ichneumonoidea)的科级分类,准确率超96%,并利用可解释AI技术验证模型关注生物相关解剖特征,提升生态与生物防治中的分类效率。

Details Motivation: 寄生蜂总科形态相似、体型小、种间差异细微,导致传统人工鉴定耗时且高度依赖专家经验,亟需自动化、高精度的分类方法以支持生物多样性评估、生态监测和生物防治。 Method: 采用YOLO架构结合高分辨率类激活映射(HiResCAM)构建深度学习模型,对3556张高分辨率膜翅目标本图像进行科级识别,并通过精确率、召回率、F1分数和准确率评估性能。 Result: 模型准确率超过96%,在形态变异下表现出强泛化能力;HiResCAM可视化证实模型聚焦于翅脉、触角分节和后胸结构等分类学关键解剖区域。 Conclusion: 该可解释深度学习框架显著提升了寄生蜂总科自动化分类的准确性与可信度,为昆虫学研究及未充分描述类群的快速生物多样性表征提供了实用工具。 Abstract: Accurate taxonomic identification of parasitoid wasps within the superfamily Ichneumonoidea is essential for biodiversity assessment, ecological monitoring, and biological control programs. However, morphological similarity, small body size, and fine-grained interspecific variation make manual identification labor-intensive and expertise-dependent. This study proposes a deep learning-based framework for the automated identification of Ichneumonoidea wasps using a YOLO-based architecture integrated with High-Resolution Class Activation Mapping (HiResCAM) to enhance interpretability. The proposed system simultaneously identifies wasp families from high-resolution images. The dataset comprises 3556 high-resolution images of Hymenoptera specimens. The taxonomic distribution is primarily concentrated among the families Ichneumonidae (n = 786), Braconidae (n = 648), Apidae (n = 466), and Vespidae (n = 460). Extensive experiments were conducted using a curated dataset, with model performance evaluated through precision, recall, F1 score, and accuracy. The results demonstrate high accuracy of over 96 % and robust generalization across morphological variations. HiResCAM visualizations confirm that the model focuses on taxonomically relevant anatomical regions, such as wing venation, antennae segmentation, and metasomal structures, thereby validating the biological plausibility of the learned features. The integration of explainable AI techniques improves transparency and trustworthiness, making the system suitable for entomological research to accelerate biodiversity characterization in an under-described parasitoid superfamily.

[143] $D^3$-RSMDE: 40$\times$ Faster and High-Fidelity Remote Sensing Monocular Depth Estimation

Ruizhi Wang,Weihan Li,Zunlei Feng,Haofei Zhang,Mingli Song,Jiayu Wang,Jie Song,Li Sun

Main category: cs.CV

TL;DR: 本文提出了一种名为$D^3$-RSMDE的高效单目遥感深度估计框架,结合ViT快速生成结构先验与轻量U-Net在VAE潜空间中少量迭代细化细节,兼顾高保真度与实时性。

Details Motivation: 现有方法在遥感单目深度估计中难以兼顾精度(如扩散模型)与效率(如ViT),亟需一种平衡二者的新范式。 Method: 提出$D^3$-RSMDE框架:1)ViT模块快速生成结构先验深度图;2)基于该先验,采用Progressive Linear Blending Refinement(PLBR)策略,用轻量U-Net在VAE压缩潜空间中进行少量迭代细节优化。 Result: 在LPIPS指标上比Marigold等SOTA模型降低11.85%,推理速度提升超40倍,显存占用与轻量ViT相当。 Conclusion: 该方法成功弥合了遥感深度估计中质量与效率的鸿沟,为实时高保真应用提供了可行新路径。 Abstract: Real-time, high-fidelity monocular depth estimation from remote sensing imagery is crucial for numerous applications, yet existing methods face a stark trade-off between accuracy and efficiency. Although using Vision Transformer (ViT) backbones for dense prediction is fast, they often exhibit poor perceptual quality. Conversely, diffusion models offer high fidelity but at a prohibitive computational cost. To overcome these limitations, we propose Depth Detail Diffusion for Remote Sensing Monocular Depth Estimation ($D^3$-RSMDE), an efficient framework designed to achieve an optimal balance between speed and quality. Our framework first leverages a ViT-based module to rapidly generate a high-quality preliminary depth map construction, which serves as a structural prior, effectively replacing the time-consuming initial structure generation stage of diffusion models. Based on this prior, we propose a Progressive Linear Blending Refinement (PLBR) strategy, which uses a lightweight U-Net to refine the details in only a few iterations. The entire refinement step operates efficiently in a compact latent space supported by a Variational Autoencoder (VAE). Extensive experiments demonstrate that $D^3$-RSMDE achieves a notable 11.85% reduction in the Learned Perceptual Image Patch Similarity (LPIPS) perceptual metric over leading models like Marigold, while also achieving over a 40x speedup in inference and maintaining VRAM usage comparable to lightweight ViT models.

[144] Advancing Visual Reliability: Color-Accurate Underwater Image Enhancement for Real-Time Underwater Missions

Yiqiang Zhou,Yifan Chen,Zhe Sun,Jijun Lu,Ye Zheng,Xuelong Li

Main category: cs.CV

TL;DR: 本文提出了一种实时、轻量级的水下图像增强框架,兼顾高精度色彩恢复与低计算开销,在多个数据集上达到SOTA性能,仅含3880参数、达409 FPS,并在ROV平台成功部署。

Details Motivation: 现有高性能水下图像增强方法结构复杂、难以部署;而轻量方法常以牺牲质量为代价,难以处理严重退化图像。 Method: 提出三模块框架:1)自适应加权通道补偿模块(以绿色通道为锚点动态恢复红蓝通道);2)多分支重参数化空洞卷积(训练时多分支融合、推理时结构重参数化);3)基于统计先验的全局色彩调整模块。 Result: 在八个数据集上七项指标达SOTA;模型仅3880参数,推理速度409 FPS;UCIQE得分提升29.7%;已在ROV平台部署并提升下游任务性能。 Conclusion: 该方法在保持极低计算成本的同时显著提升水下图像质量,尤其适用于资源受限的实时水下作业平台。 Abstract: Underwater image enhancement plays a crucial role in providing reliable visual information for underwater platforms, since strong absorption and scattering in water-related environments generally lead to image quality degradation. Existing high-performance methods often rely on complex architectures, which hinder deployment on underwater devices. Lightweight methods often sacrifice quality for speed and struggle to handle severely degraded underwater images. To address this limitation, we present a real-time underwater image enhancement framework with accurate color restoration. First, an Adaptive Weighted Channel Compensation module is introduced to achieve dynamic color recovery of the red and blue channels using the green channel as a reference anchor. Second, we design a Multi-branch Re-parameterized Dilated Convolution that employs multi-branch fusion during training and structural re-parameterization during inference, enabling large receptive field representation with low computational overhead. Finally, a Statistical Global Color Adjustment module is employed to optimize overall color performance based on statistical priors. Extensive experiments on eight datasets demonstrate that the proposed method achieves state-of-the-art performance across seven evaluation metrics. The model contains only 3,880 inference parameters and achieves an inference speed of 409 FPS. Our method improves the UCIQE score by 29.7% under diverse environmental conditions, and the deployment on ROV platforms and performance gains in downstream tasks further validate its superiority for real-time underwater missions.

[145] InViC: Intent-aware Visual Cues for Medical Visual Question Answering

Zhisong Wang,Ziyang Chen,Zanting Ye,Hongze Zhu,Yefeng Zheng,Yong Xia

Main category: cs.CV

TL;DR: 本文提出InViC框架,通过意图感知的视觉线索和两阶段微调策略,提升医学视觉问答(Med-VQA)中模型对图像证据的依赖,减少语言捷径偏差,增强临床可靠性。

Details Motivation: 现有多模态大语言模型在医学VQA中常依赖语言先验或数据集偏差进行‘捷径回答’,忽视关键视觉证据,影响临床可信度。 Method: 提出轻量级插件框架InViC,包含:1)线索令牌提取(CTE)模块,生成问题条件下的K个紧凑视觉线索令牌;2)两阶段微调策略:第一阶段用注意力掩码阻断原始视觉特征输入,强制模型仅通过线索令牌获取视觉信息;第二阶段恢复标准注意力,联合训练视觉与线索令牌。 Result: 在VQA-RAD、SLAKE和ImageCLEF VQA-Med 2019三个医学VQA基准上,InViC在多个主流MLLM上均优于零样本推理和标准LoRA微调。 Conclusion: 意图感知的视觉线索结合瓶颈式训练是一种实用且有效的提升Med-VQA可信性的方法。 Abstract: Medical visual question answering (Med-VQA) aims to answer clinically relevant questions grounded in medical images. However, existing multimodal large language models (MLLMs) often exhibit shortcut answering, producing plausible responses by exploiting language priors or dataset biases while insufficiently attending to visual evidence. This behavior undermines clinical reliability, especially when subtle imaging findings are decisive. We propose a lightweight plug-in framework, termed Intent-aware Visual Cues (InViC), to explicitly enhance image-based answer generation in medical VQA. InViC introduces a Cue Tokens Extraction (CTE) module that distills dense visual tokens into a compact set of K question-conditioned cue tokens, which serve as structured visual intermediaries injected into the LLM decoder to promote intent-aligned visual evidence. To discourage bypassing of visual information, we further design a two-stage fine-tuning strategy with a cue-bottleneck attention mask. In Stage I, we employ an attention mask to block the LLM's direct view of raw visual features, thereby funneling all visual evidence through the cue pathway. In Stage II, standard causal attention is restored to train the LLM to jointly exploit the visual and cue tokens. We evaluate InViC on three public Med-VQA benchmarks (VQA-RAD, SLAKE, and ImageCLEF VQA-Med 2019) across multiple representative MLLMs. InViC consistently improves over zero-shot inference and standard LoRA fine-tuning, demonstrating that intent-aware visual cues with bottlenecked training is a practical and effective strategy for improving trustworthy Med-VQA.

[146] Semantic One-Dimensional Tokenizer for Image Reconstruction and Generation

Yunpeng Qu,Kaidong Zhang,Yukang Ding,Ying Chen,Jian Wang

Main category: cs.CV

TL;DR: 本文提出SemTok,一种语义一维分词器,将2D图像压缩为具有高层语义的1D离散标记,在图像重建中达到新SOTA,同时提升下游图像生成任务性能。

Details Motivation: 现有视觉分词器主要将图像映射为固定2D空间网格,侧重像素级重建,难以捕捉紧凑的全局语义表征。 Method: 提出SemTok,包含三个关键创新:2D到1D分词方案、语义对齐约束、两阶段生成训练策略;并基于SemTok构建掩码自回归生成框架。 Result: 在图像重建任务中达到新SOTA,以更紧凑的标记实现更高保真度;下游图像生成任务性能显著提升。 Conclusion: 语义一维分词能更有效地建模全局高层语义,为视觉生成模型提供更优的潜在表示基础。 Abstract: Visual generative models based on latent space have achieved great success, underscoring the significance of visual tokenization. Mapping images to latents boosts efficiency and enables multimodal alignment for scaling up in downstream tasks. Existing visual tokenizers primarily map images into fixed 2D spatial grids and focus on pixel-level restoration, which hinders the capture of representations with compact global semantics. To address these issues, we propose \textbf{SemTok}, a semantic one-dimensional tokenizer that compresses 2D images into 1D discrete tokens with high-level semantics. SemTok sets a new state-of-the-art in image reconstruction, achieving superior fidelity with a remarkably compact token representation. This is achieved via a synergistic framework with three key innovations: a 2D-to-1D tokenization scheme, a semantic alignment constraint, and a two-stage generative training strategy. Building on SemTok, we construct a masked autoregressive generation framework, which yields notable improvements in downstream image generation tasks. Experiments confirm the effectiveness of our semantic 1D tokenization. Our code will be open-sourced.

[147] Unpaired Cross-Domain Calibration of DMSP to VIIRS Nighttime Light Data Based on CUT Network

Zhan Tong,ChenXu Zhou,Fei Tang,Yiming Tu,Tianyu Qin,Kaihao Fang

Main category: cs.CV

TL;DR: 本文提出了一种基于对比无配对翻译(CUT)网络的跨传感器校准方法,将DMSP-OLS夜间灯光数据转换为VIIRS-like格式,以解决长期城市化监测中因传感器不兼容导致的数据融合难题。

Details Motivation: DMSP-OLS与SNPP-VIIRS夜间灯光数据对城市化监测至关重要,但传感器差异阻碍了长期一致分析。 Method: 采用基于多层块级对比学习的CUT网络,最大化对应图像块间的互信息,在保持内容一致性的同时学习跨域相似性;使用2012–2013年重叠期数据训练,处理1992–2013年DMSP影像生成VIIRS风格数据。 Result: 生成的VIIRS-like数据与真实VIIRS观测高度一致(R² > 0.87),且与社会经济指标吻合良好。 Conclusion: 该方法有效缓解了跨传感器数据融合问题并校正了DMSP固有缺陷,为构建更长、更可靠的夜间灯光时间序列提供了可行方案。 Abstract: Defense Meteorological Satellite Program (DMSP-OLS) and Suomi National Polar-orbiting Partnership (SNPP-VIIRS) nighttime light (NTL) data are vital for monitoring urbanization, yet sensor incompatibilities hinder long-term analysis. This study proposes a cross-sensor calibration method using Contrastive Unpaired Translation (CUT) network to transform DMSP data into VIIRS-like format, correcting DMSP defects. The method employs multilayer patch-wise contrastive learning to maximize mutual information between corresponding patches, preserving content consistency while learning cross-domain similarity. Utilizing 2012-2013 overlapping data for training, the network processes 1992-2013 DMSP imagery to generate enhanced VIIRS-style raster data. Validation results demonstrate that generated VIIRS-like data exhibits high consistency with actual VIIRS observations (R-squared greater than 0.87) and socioeconomic indicators. This approach effectively resolves cross-sensor data fusion issues and calibrates DMSP defects, providing reliable attempt for extended NTL time-series.

[148] DermaFlux: Synthetic Skin Lesion Generation with Rectified Flows for Enhanced Image Classification

Stathis Galanakis,Alexandros Koliousis,Stefanos Zafeiriou

Main category: cs.CV

TL;DR: DermaFlux是一种基于修正流(rectified flow)的文本到图像生成框架,用于合成符合临床标准的皮肤病变图像,以缓解数据稀缺和类别不平衡问题,显著提升皮肤病变分类性能。

Details Motivation: 皮肤病变分类系统受限于大规模、多样化且标注良好的临床数据集稀缺,导致良恶性病变类别不平衡,泛化能力差。 Method: 提出DermaFlux框架,基于Flux.1模型,采用LoRA进行参数高效微调;利用Llama 3.2生成符合皮肤科标准(如不对称性、边界不规则、颜色变化)的合成文本描述,构建图像-文本对进行训练。 Result: DermaFlux生成的图像在小规模真实数据集上增强后使二分类性能提升最高达6%;仅用2500张真实图像加4375张DermaFlux生成图像训练ViT模型,准确率达78.04%,AUC为0.859,较次优模型高8%;相比扩散模型生成图像,其生成图像可使分类性能再提升9%。 Conclusion: DermaFlux是一种高效、临床可信的皮肤病变图像合成方法,能显著缓解数据瓶颈,在低数据场景下大幅提升分类模型性能,具备临床转化潜力。 Abstract: Despite recent advances in deep generative modeling, skin lesion classification systems remain constrained by the limited availability of large, diverse, and well-annotated clinical datasets, resulting in class imbalance between benign and malignant lesions and consequently reduced generalization performance. We introduce DermaFlux, a rectified flow-based text-to-image generative framework that synthesizes clinically grounded skin lesion images from natural language descriptions of dermatological attributes. Built upon Flux.1, DermaFlux is fine-tuned using parameter-efficient Low-Rank Adaptation (LoRA) on a large curated collection of publicly available clinical image datasets. We construct image-text pairs using synthetic textual captions generated by Llama 3.2, following established dermatological criteria including lesion asymmetry, border irregularity, and color variation. Extensive experiments demonstrate that DermaFlux generates diverse and clinically meaningful dermatology images that improve binary classification performance by up to 6% when augmenting small real-world datasets, and by up to 9% when classifiers are trained on DermaFlux-generated synthetic images rather than diffusion-based synthetic images. Our ImageNet-pretrained ViT fine-tuned with only 2,500 real images and 4,375 DermaFlux-generated samples achieves 78.04% binary classification accuracy and an AUC of 0.859, surpassing the next best dermatology model by 8%.

[149] Near-light Photometric Stereo with Symmetric Lights

Lilika Makabe,Heng Guo,Hiroaki Santo,Fumio Okura,Yasuyuki Matsushita

Main category: cs.CV

TL;DR: 本文提出了一种利用对称光源布置的近光光度立体线性解法,无需初始化即可闭式求解表面法向和深度,且在光源对称分布(即使空间偏移未标定)时仍有效。

Details Motivation: 解决传统非凸优化方法需要初始化、计算复杂且对光源标定要求高的问题,提升近光光度立体的鲁棒性与实用性。 Method: 设计多组对称邻近光源对,利用其几何对称性推导表面法向与深度的闭式线性解,不依赖初始值或精确光源位置标定。 Result: 实验表明该方法在形状恢复精度上媲美当前最优的标定近光光度立体方法,同时大幅降低对深度初始化和光源标定的要求。 Conclusion: 对称光源布置可将近光光度立体转化为线性问题,实现高效、鲁棒、免初始化的三维重建。 Abstract: This paper describes a linear solution method for near-light photometric stereo by exploiting symmetric light source arrangements. Unlike conventional non-convex optimization approaches, by arranging multiple sets of symmetric nearby light source pairs, our method derives a closed-form solution for surface normal and depth without requiring initialization. In addition, our method works as long as the light sources are symmetrically distributed about an arbitrary point even when the entire spatial offset is uncalibrated. Experiments showcase the accuracy of shape recovery accuracy of our method, achieving comparable results to the state-of-the-art calibrated near-light photometric stereo method while significantly reducing requirements of careful depth initialization and light calibration.

[150] HGP-Mamba: Integrating Histology and Generated Protein Features for Mamba-based Multimodal Survival Risk Prediction

Jing Dai,Chen Wu,Ming Wu,Qibin Zhang,Zexi Wu,Jingdong Zhang,Hongming Xu

Main category: cs.CV

TL;DR: HGP-Mamba是一种基于Mamba的多模态框架,通过从全切片图像(WSI)生成蛋白特征并融合组织学特征,实现高效、准确的癌症生存风险预测。

Details Motivation: 蛋白标志物与组织病理图像的联合预后潜力尚未被充分挖掘,主因蛋白表达谱检测成本高、数据稀缺。 Method: 提出HGP-Mamba框架:1)蛋白特征提取器(PFE),利用预训练基础模型从WSI中生成蛋白嵌入;2)局部交互感知Mamba(LiAM)实现细粒度跨模态交互;3)全局交互增强Mamba(GiEM)完成滑片级整体模态融合。 Result: 在四个公开癌症数据集上达到SOTA性能,且计算效率显著优于现有方法。 Conclusion: 证明仅用WSI即可有效生成具有预后价值的蛋白特征,为低成本、高效益的多模态癌症预后建模提供了新范式。 Abstract: Recent advances in multimodal learning have significantly improved cancer survival risk prediction. However, the joint prognostic potential of protein markers and histopathology images remains underexplored, largely due to the high cost and limited availability of protein expression profiling. To address this challenge, we propose HGP-Mamba, a Mamba-based multimodal framework that efficiently integrates histological with generated protein features for survival risk prediction. Specifically, we introduce a protein feature extractor (PFE) that leverages pretrained foundation models to derive high-throughput protein embeddings directly from Whole Slide Images (WSIs), enabling data-efficient incorporation of molecular information. Together with histology embeddings that capture morphological patterns, we further introduce the Local Interaction-aware Mamba (LiAM) for fine-grained feature interaction and the Global Interaction-enhanced Mamba (GiEM) to promote holistic modality fusion at the slide level, thus capture complex cross-modal dependencies. Experiments on four public cancer datasets demonstrate that HGP-Mamba achieves state-of-the-art performance while maintaining superior computational efficiency compared with existing methods. Our source code is publicly available at this https URL.

[151] SF-Mamba: Rethinking State Space Model for Vision

Masakazu Yoshimura,Teruaki Hayashi,Yuki Hoshino,Wei-Yao Wang,Takeshi Ohashi

Main category: cs.CV

TL;DR: 本文提出SF-Mamba,通过辅助图像块交换和批处理折叠加周期性状态重置,改进Mamba在视觉任务中的双向建模能力和GPU并行效率,显著提升性能与吞吐量。

Details Motivation: 现有视觉Mamba因单向扫描机制难以建模图像块间的非因果关系,且多扫描策略存在设计低效和数据重排开销;同时Mamba在短token长度下计算速度较慢。 Method: 提出SF-Mamba:1)辅助patch swapping机制,在单向扫描下引入双向信息流;2)batch folding配合周期性状态重置,提升GPU并行计算效率。 Result: 在图像分类、目标检测、实例分割和语义分割任务上均显著超越SOTA基线,并在不同模型尺寸下提升推理吞吐量。 Conclusion: SF-Mamba有效缓解了视觉Mamba的双向建模与计算效率瓶颈,是一种高效、可扩展的视觉编码器新范式。 Abstract: The realm of Mamba for vision has been advanced in recent years to strike for the alternatives of Vision Transformers (ViTs) that suffer from the quadratic complexity. While the recurrent scanning mechanism of Mamba offers computational efficiency, it inherently limits non-causal interactions between image patches. Prior works have attempted to address this limitation through various multi-scan strategies; however, these approaches suffer from inefficiencies due to suboptimal scan designs and frequent data rearrangement. Moreover, Mamba exhibits relatively slow computational speed under short token lengths, commonly used in visual tasks. In pursuit of a truly efficient vision encoder, we rethink the scan operation for vision and the computational efficiency of Mamba. To this end, we propose SF-Mamba, a novel visual Mamba with two key proposals: auxiliary patch swapping for encoding bidirectional information flow under an unidirectional scan and batch folding with periodic state reset for advanced GPU parallelism. Extensive experiments on image classification, object detection, and instance and semantic segmentation consistently demonstrate that our proposed SF-Mamba significantly outperforms state-of-the-art baselines while improving throughput across different model sizes. We will release the source code after publication.

[152] 3D Fourier-based Global Feature Extraction for Hyperspectral Image Classification

Muhammad Ahmad

Main category: cs.CV

TL;DR: 本文提出HGFNet,一种结合3D卷积与多种定制化傅里叶变换(谱、空间、空-谱三维)的混合架构,用于高效建模高光谱图像的空-谱相关性,并引入自适应焦点损失(AFL)缓解类别不平衡问题。

Details Motivation: 现有方法存在两大缺陷:基于Transformer的方法因自注意力机制的二次复杂度而难以扩展;基于傅里叶变换的方法多采用2D空间FFT,忽略关键的谱间依赖关系。此外,高光谱图像分类中普遍存在严重类别不平衡问题。 Method: 提出Hybrid GFNet(HGFNet):1)采用局部3D卷积提取细粒度空-谱特征;2)设计三种互补的频率变换——谱向1D FFT、空间2D FFT、空-谱联合3D FFT,并嵌入GFNet式全局滤波模块;3)引入自适应焦点损失(AFL)动态调整类别权重与聚焦程度。 Result: HGFNet在多个标准高光谱数据集上实现了SOTA性能,显著优于主流CNN、Transformer及现有频域方法,在精度、效率和鲁棒性(尤其对小样本类)方面均有提升。 Conclusion: 融合局部3D卷积与多维度频域建模是高光谱图像分类的有效范式;定制化的空-谱联合频域表示能更充分挖掘高光谱数据内在结构;AFL可有效缓解类别不平衡,提升模型泛化能力。 Abstract: Hyperspectral image classification (HSIC) has been significantly advanced by deep learning methods that exploit rich spatial-spectral correlations. However, existing approaches still face fundamental limitations: transformer-based models suffer from poor scalability due to the quadratic complexity of self-attention, while recent Fourier transform-based methods typically rely on 2D spatial FFTs and largely ignore critical inter-band spectral dependencies inherent to hyperspectral data. To address these challenges, we propose Hybrid GFNet (HGFNet), a novel architecture that integrates localized 3D convolutional feature extraction with frequency-domain global filtering via GFNet-style blocks for efficient and robust spatial-spectral representation learning. HGFNet introduces three complementary frequency transforms tailored to hyperspectral imagery: Spectral Fourier Transform (a 1D FFT along the spectral axis), Spatial Fourier Transform (a 2D FFT over spatial dimensions), and Spatial-Spatial Fourier Transform (a 3D FFT jointly over spectral and spatial dimensions), enabling comprehensive and high-dimensional frequency modeling. The 3D convolutional layers capture fine-grained local spatial-spectral structures, while the Fourier-based global filtering modules efficiently model long-range dependencies and suppress noise. To further mitigate the severe class imbalance commonly observed in HSIC, HGFNet incorporates an Adaptive Focal Loss (AFL) that dynamically adjusts class-wise focusing and weighting, improving discrimination for underrepresented classes.

[153] Cross-modal learning for plankton recognition

Joona Kareinen,Veikka Immonen,Tuomas Eerola,Lumi Haraguchi,Lasse Lensu,Kaisa Kraft,Sanna Suikkanen,Heikki Kälviäinen

Main category: cs.CV

TL;DR: 本文提出一种自监督跨模态协调方法,利用未标注的浮游生物图像与光学测量数据(如散射和荧光谱)联合训练编码器,仅需少量标注样本即可实现高精度浮游生物识别。

Details Motivation: 现有浮游生物识别方法依赖大量人工标注数据,成本高;而新型成像设备采集的光学测量数据尚未被充分利用。 Method: 受CLIP启发,采用对比学习框架,仅用二元监督信号(图像与光谱是否来自同一颗粒)联合训练图像和光谱编码器;识别阶段结合小规模标注图库与k-NN分类器。 Result: 该方法在仅需极少标注图像的情况下达到高识别精度,并优于纯图像自监督基线方法。 Conclusion: 自监督跨模态协调是一种有效利用多源未标注浮游生物数据的可行策略,可显著降低对标注数据的依赖,提升识别性能。 Abstract: This paper considers self-supervised cross-modal coordination as a strategy enabling utilization of multiple modalities and large volumes of unlabeled plankton data to build models for plankton recognition. Automated imaging instruments facilitate the continuous collection of plankton image data on a large scale. Current methods for automatic plankton image recognition rely primarily on supervised approaches, which require labeled training sets that are labor-intensive to collect. On the other hand, some modern plankton imaging instruments complement image information with optical measurement data, such as scatter and fluorescence profiles, which currently are not widely utilized in plankton recognition. In this work, we explore the possibility of using such measurement data to guide the learning process without requiring manual labeling. Inspired by the concepts behind Contrastive Language-Image Pre-training, we train encoders for both modalities using only binary supervisory information indicating whether a given image and profile originate from the same particle or from different particles. For plankton recognition, we employ a small labeled gallery of known plankton species combined with a $k$-NN classifier. This approach yields a recognition model that is inherently multimodal, i.e., capable of utilizing information extracted from both image and profile data. We demonstrate that the proposed method achieves high recognition accuracy while requiring only a minimal number of labeled images. Furthermore, we show that the approach outperforms an image-only self-supervised baseline. Code available at https://github.com/Jookare/cross-modal-plankton.

[154] IRIS: A Real-World Benchmark for Inverse Recovery and Identification of Physical Dynamic Systems from Monocular Video

Rasul Khanbayov,Mohamed Rayan Barhdadi,Erchin Serpedin,Hasan Kurban

Main category: cs.CV

TL;DR: 本文提出了IRIS基准,包含220个高保真真实世界视频,涵盖单体和多体动力学系统,并提供标准评估协议与多种基线方法,推动无监督物理参数估计研究。

Details Motivation: 现有无监督物理参数估计方法缺乏统一基准:合成数据不重叠、真实数据仅限单体系统、且无针对控制方程识别的评估协议。 Method: 构建IRIS基准(含4K/60fps真实视频、实测真值参数与不确定性估计、对应控制方程),定义涵盖参数精度、可识别性、外推性、鲁棒性及方程选择的标准化评估协议;实现多步物理损失函数及四种方程识别策略(VLM时序推理、描述-分类提示、CNN分类、路径标注)。 Result: 在所有IRIS场景中完成基线方法评估,揭示系统性失效模式,并开源数据集、标注、评估工具包及全部基线代码。 Conclusion: IRIS为无监督物理参数估计与控制方程识别提供了首个综合性真实世界基准与评估框架,显著提升该领域可复现性与可比性。 Abstract: Unsupervised physical parameter estimation from video lacks a common benchmark: existing methods evaluate on non-overlapping synthetic data, the sole real-world dataset is restricted to single-body systems, and no established protocol addresses governing-equation identification. This work introduces IRIS, a high-fidelity benchmark comprising 220 real-world videos captured at 4K resolution and 60\,fps, spanning both single- and multi-body dynamics with independently measured ground-truth parameters and uncertainty estimates. Each dynamical system is recorded under controlled laboratory conditions and paired with its governing equations, enabling principled evaluation. A standardized evaluation protocol is defined encompassing parameter accuracy, identifiability, extrapolation, robustness, and governing-equation selection. Multiple baselines are evaluated, including a multi-step physics loss formulation and four complementary equation-identification strategies (VLM temporal reasoning, describe-then-classify prompting, CNN-based classification, and path-based labelling), establishing reference performance across all IRIS scenarios and exposing systematic failure modes that motivate future research. The dataset, annotations, evaluation toolkit, and all baseline implementations are publicly released.

[155] CD-FKD: Cross-Domain Feature Knowledge Distillation for Robust Single-Domain Generalization in Object Detection

Junseok Lee,Sungho Shin,Seongju Lee,Kyoobin Lee

Main category: cs.CV

TL;DR: 本文提出了一种名为CD-FKD的跨域特征知识蒸馏方法,通过全局与实例级特征蒸馏,提升单源域训练的目标检测模型在未见目标域上的泛化能力。

Details Motivation: 单域泛化对目标检测至关重要,但天气、光照、场景等域偏移严重削弱现有模型的泛化能力。 Method: 提出Cross-Domain Feature Knowledge Distillation(CD-FKD),使用降尺度与图像退化生成多样化数据训练学生网络,教师网络则用原始源域数据;学生网络通过全局和实例级特征蒸馏模仿教师网络特征。 Result: 在多个挑战性场景上实验表明,CD-FKD在目标域泛化和源域性能上均优于当前最优方法。 Conclusion: CD-FKD有效提升了目标检测模型对域偏移的鲁棒性,适用于自动驾驶、监控等需跨环境稳定检测的实际场景。 Abstract: Single-domain generalization is essential for object detection, particularly when training models on a single source domain and evaluating them on unseen target domains. Domain shifts, such as changes in weather, lighting, or scene conditions, pose significant challenges to the generalization ability of existing models. To address this, we propose Cross-Domain Feature Knowledge Distillation (CD-FKD), which enhances the generalization capability of the student network by leveraging both global and instance-wise feature distillation. The proposed method uses diversified data through downscaling and corruption to train the student network, whereas the teacher network receives the original source domain data. The student network mimics the features of the teacher through both global and instance-wise distillation, enabling it to extract object-centric features effectively, even for objects that are difficult to detect owing to corruption. Extensive experiments on challenging scenes demonstrate that CD-FKD outperforms state-of-the-art methods in both target domain generalization and source domain performance, validating its effectiveness in improving object detection robustness to domain shifts. This approach is valuable in real-world applications, like autonomous driving and surveillance, where robust object detection in diverse environments is crucial.

[156] Fast-HaMeR: Boosting Hand Mesh Reconstruction using Knowledge Distillation

Hunain Ahmed Jillani,Ahmed Tawfik Aboukhadra,Ahmed Elhayek,Jameel Malik,Nadia Robertini,Didier Stricker

Main category: cs.CV

TL;DR: 本文提出了一种基于知识蒸馏的轻量化3D手部重建方法,将HaMeR模型的ViT-H骨干网络替换为MobileNet等轻量网络,在模型大小减至35%、推理速度提升1.5倍的同时,仅损失0.4mm精度。

Details Motivation: 现有高精度3D手部重建模型计算开销大,难以部署在VR/AR头显、手机等资源受限设备上,亟需轻量化方案。 Method: 采用MobileNet、MobileViT、ConvNeXt和ResNet等轻量骨干网络替代HaMeR原ViT-H backbone,并对比输出级、特征级及混合知识蒸馏策略。 Result: 轻量模型体积仅为原模型35%,推理速度快1.5倍,平均精度下降仅0.4mm;输出级蒸馏对性能提升最显著,特征级蒸馏更适用于高容量学生模型。 Conclusion: 该方法在保持高重建精度的同时显著提升效率,为低功耗设备上的实时3D手部重建提供了可行路径。 Abstract: Fast and accurate 3D hand reconstruction is essential for real-time applications in VR/AR, human-computer interaction, robotics, and healthcare. Most state-of-the-art methods rely on heavy models, limiting their use on resource-constrained devices like headsets, smartphones, and embedded systems. In this paper, we investigate how the use of lightweight neural networks, combined with Knowledge Distillation, can accelerate complex 3D hand reconstruction models by making them faster and lighter, while maintaining comparable reconstruction accuracy. While our approach is suited for various hand reconstruction frameworks, we focus primarily on boosting the HaMeR model, currently the leading method in terms of reconstruction accuracy. We replace its original ViT-H backbone with lighter alternatives, including MobileNet, MobileViT, ConvNeXt, and ResNet, and evaluate three knowledge distillation strategies: output-level, feature-level, and a hybrid of both. Our experiments show that using lightweight backbones that are only 35% the size of the original achieves 1.5x faster inference speed while preserving similar performance quality with only a minimal accuracy difference of 0.4mm. More specifically, we show how output-level distillation notably improves student performance, while feature-level distillation proves more effective for higher-capacity students. Overall, the findings pave the way for efficient real-world applications on low-power devices. The code and models are publicly available under https://github.com/hunainahmedj/Fast-HaMeR.

[157] Unified Removal of Raindrops and Reflections: A New Benchmark and A Novel Pipeline

Xingyu Liu,Zewei He,Yu Chen,Chunyu Zhu,Zixuan Chen,Xing Luo,Zhe-Ming Lu

Main category: cs.CV

TL;DR: 本文提出了一个名为UR³的新任务,旨在同时去除雨滴和反射,构建了RDRF真实拍摄数据集,并设计了基于扩散模型的DiffUR³框架,实现了该复合退化问题的有效解决。

Details Motivation: 在雨天通过玻璃或挡风玻璃拍摄图像时,雨滴和反射经常同时出现,严重影响图像可见性,而现有方法未能有效处理这种复合退化问题。 Method: 首次正式定义UR³任务,构建真实拍摄的RDRF数据集,并提出基于扩散模型的DiffUR³框架,利用生成先验实现联合去雨滴与去反射。 Result: 在自建RDRF基准和真实场景图像上均达到SOTA性能。 Conclusion: UR³是一个具有实际意义的新任务,DiffUR³框架及其配套数据集为后续研究提供了重要基础和新方向。 Abstract: When capturing images through glass surfaces or windshields on rainy days, raindrops and reflections frequently co-occur to significantly reduce the visibility of captured images. This practical problem lacks attention and needs to be resolved urgently. Prior de-raindrop, de-reflection, and all-in-one models have failed to address this composite degradation. To this end, we first formally define the unified removal of raindrops and reflections (UR$^3$) task for the first time and construct a real-shot dataset, namely RainDrop and ReFlection (RDRF), which provides a new benchmark with substantial, high-quality, diverse image pairs. Then, we propose a novel diffusion-based framework (i.e., DiffUR$^3$) with several target designs to address this challenging task. By leveraging the powerful generative prior, DiffUR$^3$ successfully removes both types of degradations. Extensive experiments demonstrate that our method achieves state-of-the-art performance on our benchmark and on challenging in-the-wild images. The RDRF dataset and the codes will be made public upon acceptance.

[158] ProgressiveAvatars: Progressive Animatable 3D Gaussian Avatars

Kaiwen Song,Jinkai Cui,Juyong Zhang

Main category: cs.CV

TL;DR: 本文提出ProgressiveAvatars,一种基于模板网格自适应隐式细分构建的3D高斯层次化渐进式头像表示方法,支持在带宽、计算与内存资源波动下实现渐进式传输与渲染。

Details Motivation: 实际XR和远程呈现应用中网络与计算资源频繁波动,需要一种渐进式3D表示以动态适配资源变化。 Method: 提出ProgressiveAvatars:基于模板网格的自适应隐式细分生成3D高斯层次结构;高斯定义于面局部坐标系以保持表情与头部运动下的可驱动性;根据屏幕空间信号触发层级扩展,并结合重要性排序实现增量加载与渲染。 Result: 实现了在波动带宽及多变计算/内存资源下,支持渐进式交付与渐进式渲染,保证质量平滑提升且内容连续性。 Conclusion: ProgressiveAvatars为实时沉浸式应用提供了鲁棒、高效、可伸缩的渐进式3D头像表示框架。 Abstract: In practical real-time XR and telepresence applications, network and computing resources fluctuate frequently. Therefore, a progressive 3D representation is needed. To this end, we propose ProgressiveAvatars, a progressive avatar representation built on a hierarchy of 3D Gaussians grown by adaptive implicit subdivision on a template mesh. 3D Gaussians are defined in face-local coordinates to remain animatable under varying expressions and head motion across multiple detail levels. The hierarchy expands when screen-space signals indicate a lack of detail, allocating resources to important areas. Leveraging importance ranking, ProgressiveAvatars supports incremental loading and rendering, adding new Gaussians as they arrive while preserving previous content, thus achieving smooth quality improvements across varying bandwidths. ProgressiveAvatars enables progressive delivery and progressive rendering under fluctuating network bandwidth and varying compute and memory resources.

[159] TinyGLASS: Real-Time Self-Supervised In-Sensor Anomaly Detection

Pietro Bonazzi,Rafael Sutter,Luigi Capogrosso,Mischa Buob,Michele Magno

Main category: cs.CV

TL;DR: 本文提出TinyGLASS,一种轻量级自监督异常检测模型,适配索尼IMX500智能图像传感器,实现低功耗、实时、高能效的端侧缺陷检测。

Details Motivation: 现有自监督异常检测方法(如GLASS)虽性能优异,但计算开销大,难以部署于资源受限的边缘设备(如IMX500)。 Method: 将GLASS的WideResNet-50骨干网络替换为紧凑型ResNet-18,并引入静态图追踪与INT8量化等面向部署的优化,使用索尼Model Compression Toolkit完成压缩;在MVTec-AD和自建MMS工业数据集上评估性能与鲁棒性。 Result: TinyGLASS实现8.7倍参数压缩,在MVTec-AD上达94.2%图像级AUROC;在IMX500上以20 FPS运行,内存占用≤8 MB,单次推理功耗仅4.0 mJ,能效达470 GMAC/J;对训练数据污染具备一定鲁棒性。 Conclusion: TinyGLASS成功将高性能自监督异常检测模型轻量化并落地至智能图像传感器,兼顾精度、速度、功耗与部署可行性,推动工业质检向端侧实时化演进。 Abstract: Anomaly detection plays a key role in industrial quality control, where defects must be identified despite the scarcity of labeled faulty samples. Recent self-supervised approaches, such as GLASS, learn normal visual patterns using only defect-free data and have shown strong performance on industrial benchmarks. However, their computational requirements limit deployment on resource-constrained edge platforms. This work introduces TinyGLASS, a lightweight adaptation of the GLASS framework designed for real-time in-sensor anomaly detection on the Sony IMX500 intelligent vision sensor. The proposed architecture replaces the original WideResNet-50 backbone with a compact ResNet-18 and introduces deployment-oriented modifications that enable static graph tracing and INT8 quantization using Sony's Model Compression Toolkit. In addition to evaluating performance on the MVTec-AD benchmark, we investigate robustness to contaminated training data and introduce a custom industrial dataset, named MMS Dataset, for cross-device evaluation. Experimental results show that TinyGLASS achieves 8.7x parameter compression while maintaining competitive detection performance, reaching 94.2% image-level AUROC on MVTec-AD and operating at 20 FPS within the 8 MB memory constraints of the IMX500 platform. System profiling demonstrates low power consumption (4.0 mJ per inference), real-time end-to-end latency (20 FPS), and high energy efficiency (470 GMAC/J). Furthermore, the model maintains stable performance under moderate levels of training data contamination.

[160] Evo-Retriever: LLM-Guided Curriculum Evolution with Viewpoint-Pathway Collaboration for Multimodal Document Retrieval

Weiqing Li,Jinyue Guo,Yaqi Wang,Haiyang Xiao,Yuewei Zhang,Guohua Liu,Hao Henry Wang

Main category: cs.CV

TL;DR: 本文提出Evo-Retriever框架,通过多视角图像对齐、双向对比学习生成难样本查询、LLM元控制器动态调整训练课程,提升视觉-文本跨模态检索性能,在ViDoRe V2和MMEB数据集上达到SOTA。

Details Motivation: 现有视觉-语言模型在真实文档场景中因异构性和非结构化导致跨模态嵌入不一致;传统晚交互方法受限于样本少和静态训练策略,难以适应模型动态演化,引发跨模态检索混淆。 Method: 提出Evo-Retriever框架:1)多视角图像对齐(多尺度+多方向)增强细粒度匹配;2)双向对比学习生成‘难查询’并构建互补学习路径以解耦视觉与文本歧义;3)将协作模块输出的模型状态摘要输入LLM元控制器,利用专家知识自适应调整训练课程。 Result: 在ViDoRe V2和MMEB(VisDoc)数据集上nDCG@5分别达65.2%和77.1%,显著优于现有方法。 Conclusion: Evo-Retriever通过视点-路径协同与LLM引导的课程演化,有效缓解了文档场景下跨模态检索的动态适配难题,验证了结合大模型元控制与细粒度对齐的可行性。 Abstract: Visual-language models (VLMs) excel at data mappings, but real-world document heterogeneity and unstructuredness disrupt the consistency of cross-modal embeddings. Recent late-interaction methods enhance image-text alignment through multi-vector representations, yet traditional training with limited samples and static strategies cannot adapt to the model's dynamic evolution, causing cross-modal retrieval confusion. To overcome this, we introduce Evo-Retriever, a retrieval framework featuring an LLM-guided curriculum evolution built upon a novel Viewpoint-Pathway collaboration. First, we employ multi-view image alignment to enhance fine-grained matching via multi-scale and multi-directional perspectives. Then, a bidirectional contrastive learning strategy generates "hard queries" and establishes complementary learning paths for visual and textual disambiguation to rebalance supervision. Finally, the model-state summary from the above collaboration is fed into an LLM meta-controller, which adaptively adjusts the training curriculum using expert knowledge to promote the model's evolution. On ViDoRe V2 and MMEB (VisDoc), Evo-Retriever achieves state-of-the-art performance, with nDCG@5 scores of 65.2% and 77.1%.

[161] GAP-MLLM: Geometry-Aligned Pre-training for Activating 3D Spatial Perception in Multimodal Large Language Models

Jiaxin Zhang,Junjun Jiang,Haijie Li,Youyu Chen,Kui Jiang,Dave Zhenyu Chen

Main category: cs.CV

TL;DR: 本文提出GAP-MLLM,一种几何对齐的预训练范式,通过视觉提示联合任务(预测稀疏点图+语义标签)和多级渐进融合模块,激活多模态大模型中的几何感知能力,显著提升其在3D视觉定位、密集描述和视频目标检测等任务上的性能。

Details Motivation: 现有基于RGB输入的多模态大语言模型(MLLMs)虽具备强语义推理能力,但在3D空间感知上表现不佳;作者认为问题根源不在于几何先验不足,而在于以文本为主的微调范式无法有效激活模型内部的几何表征。 Method: 提出GAP-MLLM:1)设计视觉提示驱动的联合任务(同步预测稀疏点图与语义标签),强制模型学习几何感知;2)构建带token级门控机制的多级渐进融合模块,自适应融合几何先验而不削弱语义推理能力。 Result: 在3D视觉定位、3D密集描述和3D视频目标检测任务上,GAP-MLLM显著提升几何特征融合效果,并持续优于现有方法。 Conclusion: 几何感知能力可通过专门设计的预训练范式被有效激活并融入MLLMs,无需依赖显式3D输入;GAP-MLLM为弥合RGB-only MLLMs与3D感知之间的性能鸿沟提供了新思路。 Abstract: Multimodal Large Language Models (MLLMs) demonstrate exceptional semantic reasoning but struggle with 3D spatial perception when restricted to pure RGB inputs. Despite leveraging implicit geometric priors from 3D reconstruction models, image-based methods still exhibit a notable performance gap compared to methods using explicit 3D data. We argue that this gap does not arise from insufficient geometric priors, but from a misalignment in the training paradigm: text-dominated fine-tuning fails to activate geometric representations within MLLMs. Existing approaches typically resort to naive feature concatenation and optimize directly for downstream tasks without geometry-specific supervision, leading to suboptimal structural utilization. To address this limitation, we propose GAP-MLLM, a Geometry-Aligned Pre-training paradigm that explicitly activates structural perception before downstream adaptation. Specifically, we introduce a visual-prompted joint task that compels the MLLMs to predict sparse pointmaps alongside semantic labels, thereby enforcing geometric awareness. Furthermore, we design a multi-level progressive fusion module with a token-level gating mechanism, enabling adaptive integration of geometric priors without suppressing semantic reasoning. Extensive experiments demonstrate that GAP-MLLM significantly enhances geometric feature fusion and consistently enhances performance across 3D visual grounding, 3D dense captioning, and 3D video object detection tasks.

[162] DST-Net: A Dual-Stream Transformer with Illumination-Independent Feature Guidance and Multi-Scale Spatial Convolution for Low-Light Image Enhancement

Yicui Shi,Yuhan Chen,Xiangfei Huang,Zhenguo Wang,Wenxuan Yu,Ying Fang

Main category: cs.CV

TL;DR: 本文提出了一种基于光照无关信号先验引导和多尺度空间卷积的双流Transformer网络(DST-Net),用于低光照图像增强,显著提升了视觉质量和客观指标(如LOL数据集PSNR达25.64 dB),并具备跨场景泛化能力。

Details Motivation: 现有低光照图像增强方法常导致固有信号先验严重丢失,且难以保持精细结构与纹理。 Method: 提出DST-Net:1)融合DoG、LAB变换与VGG-16提取光照无关信号先验;2)构建双流交互架构,结合跨模态注意力与可微曲线估计实现迭代增强;3)设计多尺度空间融合块(MSFB),引入伪3D/3D梯度算子卷积以恢复高频边缘并建模通道间空间相关性。 Result: 在LOL数据集上PSNR达25.64 dB,在LSRW数据集上验证了强跨场景泛化能力;主客观评估均优于现有方法。 Conclusion: DST-Net通过信号先验引导与多尺度空间建模,有效缓解低光照图像中亮度衰减与结构退化问题,为低光增强提供了新范式。 Abstract: Low-light image enhancement aims to restore the visibility of images captured by visual sensors in dim environments by addressing their inherent signal degradations, such as luminance attenuation and structural corruption. Although numerous algorithms attempt to improve image quality, existing methods often cause a severe loss of intrinsic signal priors. To overcome these challenges, we propose a Dual-Stream Transformer Network (DST-Net) based on illumination-agnostic signal prior guidance and multi-scale spatial convolutions. First, to address the loss of critical signal features under low-light conditions, we design a feature extraction module. This module integrates Difference of Gaussians (DoG), LAB color space transformations, and VGG-16 for texture extraction, utilizing decoupled illumination-agnostic features as signal priors to continuously guide the enhancement process. Second, we construct a dual-stream interaction architecture. By employing a cross-modal attention mechanism, the network leverages the extracted priors to dynamically rectify the deteriorated signal representation of the enhanced image, ultimately achieving iterative enhancement through differentiable curve estimation. Furthermore, to overcome the inability of existing methods to preserve fine structures and textures, we propose a Multi-Scale Spatial Fusion Block (MSFB) featuring pseudo-3D and 3D gradient operator convolutions. This module integrates explicit gradient operators to recover high-frequency edges while capturing inter-channel spatial correlations via multi-scale spatial convolutions. Extensive evaluations and ablation studies demonstrate that DST-Net achieves superior performance in subjective visual quality and objective metrics. Specifically, our method achieves a PSNR of 25.64 dB on the LOL dataset. Subsequent validation on the LSRW dataset further confirms its robust cross-scene generalization.

[163] Unlearning for One-Step Generative Models via Unbalanced Optimal Transport

Hyundo Choi,Junhyeong An,Jinseong Park,Jaewoong Choi

Main category: cs.CV

TL;DR: 本文提出UOT-Unlearn,一种基于非平衡最优传输(UOT)的即插即用类遗忘框架,用于单步生成模型,以在保证生成质量的同时实现目标类别的有效遗忘。

Details Motivation: 现有扩散模型的遗忘方法依赖多步去噪过程,无法直接应用于单步生成模型(如流映射模型),而单步模型的机器遗忘问题尚未被探索。 Method: 提出基于非平衡最优传输(UOT)的UOT-Unlearn框架,将遗忘建模为‘遗忘代价’(抑制目标类别)与‘f-散度惩罚’(通过松弛边缘约束保持整体生成保真度)之间的权衡,使被遗忘类别的概率质量平滑重分配至其余类别。 Result: 在CIFAR-10和ImageNet-256上的实验表明,该方法在遗忘成功率(PUL)和保留质量(u-FID)上显著优于基线方法。 Conclusion: UOT-Unlearn为单步生成模型提供了首个高效、保真且原理清晰的类遗忘解决方案,拓展了生成模型安全可控学习的研究边界。 Abstract: Recent advances in one-step generative frameworks, such as flow map models, have significantly improved the efficiency of image generation by learning direct noise-to-data mappings in a single forward pass. However, machine unlearning for ensuring the safety of these powerful generators remains entirely unexplored. Existing diffusion unlearning methods are inherently incompatible with these one-step models, as they rely on a multi-step iterative denoising process. In this work, we propose UOT-Unlearn, a novel plug-and-play class unlearning framework for one-step generative models based on the Unbalanced Optimal Transport (UOT). Our method formulates unlearning as a principled trade-off between a forget cost, which suppresses the target class, and an $f$-divergence penalty, which preserves overall generation fidelity via relaxed marginal constraints. By leveraging UOT, our method enables the probability mass of the forgotten class to be smoothly redistributed to the remaining classes, rather than collapsing into low-quality or noise-like samples. Experimental results on CIFAR-10 and ImageNet-256 demonstrate that our framework achieves superior unlearning success (PUL) and retention quality (u-FID), significantly outperforming baselines.

[164] VIEW2SPACE: Studying Multi-View Visual Reasoning from Sparse Observations

Fucai Ke,Zhixi Cai,Boying Li,Long Chen,Beibei Lin,Weiqing Wang,Pari Delir Haghighi,Gholamreza Haffari,Hamid Rezatofighi

Main category: cs.CV

TL;DR: 本文提出VIEW2SPACE多视角推理基准,利用物理仿真生成高质量3D多视角数据,揭示现有模型在稀疏多视角推理任务中表现不佳,并提出Grounded Chain-of-Thought with Visual Evidence方法显著提升性能。

Details Motivation: 现有研究多集中于单图或密集视频场景,而真实世界中智能系统需从稀疏、离散视角进行多视角视觉推理,但缺乏高质量、可扩展且带精确标注的多视角数据。 Method: 基于物理仿真的3D场景构建引擎生成多样化高保真多视角数据;提出VIEW2SPACE基准及配套大规模训练划分;设计Grounded Chain-of-Thought with Visual Evidence方法增强多视角推理能力,并开展难度感知的规模化分析。 Result: 现有视觉语言与空间模型在VIEW2SPACE上仅略高于随机猜测;所提方法在中等难度下显著提升性能,且跨数据集泛化效果优于现有方法;几何感知可随规模扩大而提升,但稀疏视角下的深层组合推理仍是根本挑战。 Conclusion: 稀疏多视角视觉推理仍是未解难题,需结合物理建模、结构化推理机制与难度感知训练策略共同推进。 Abstract: Multi-view visual reasoning is essential for intelligent systems that must understand complex environments from sparse and discrete viewpoints, yet existing research has largely focused on single-image or temporally dense video settings. In real-world scenarios, reasoning across views requires integrating partial observations without explicit guidance, while collecting large-scale multi-view data with accurate geometric and semantic annotations remains challenging. To address this gap, we leverage physically grounded simulation to construct diverse, high-fidelity 3D scenes with precise per-view metadata, enabling scalable data generation that remains transferable to real-world settings. Based on this engine, we introduce VIEW2SPACE, a multi-dimensional benchmark for sparse multi-view reasoning, together with a scalable, disjoint training split supporting millions of grounded question-answer pairs. Using this benchmark, a comprehensive evaluation of state-of-the-art vision-language and spatial models reveals that multi-view reasoning remains largely unsolved, with most models performing only marginally above random guessing. We further investigate whether training can bridge this gap. Our proposed Grounded Chain-of-Thought with Visual Evidence substantially improves performance under moderate difficulty, and generalizes to real-world data, outperforming existing approaches in cross-dataset evaluation. We further conduct difficulty-aware scaling analyses across model size, data scale, reasoning depth, and visibility constraints, indicating that while geometric perception can benefit from scaling under sufficient visibility, deep compositional reasoning across sparse views remains a fundamental challenge.

[165] An approximate graph elicits detonation lattice

Vansh Sharma,Venkat Raman

Main category: cs.CV

TL;DR: 本文提出了一种基于图论的无训练算法,用于从3D压力迹线中精确分割和测量爆轰胞格(即‘爆轰晶格’),克服了传统手动及2D边缘检测方法的局限性;该算法在合成数据上预测误差仅2%,在3D模拟数据中验证了其对椭圆形胞格的识别能力(轴向偏差17%),虽对高度复杂胞格仍有挑战,但具备跨几何形态泛化能力,为爆轰分析及三重点碰撞研究提供新工具。

Details Motivation: 解决爆轰研究中长期存在的爆轰胞格自动、精确分割与量化难题,克服现有手动和原始2D边缘检测方法精度低、维度受限、难以推广等缺陷。 Method: 提出一种无需训练的基于图论的分割算法,构建‘爆轰晶格’模型,从3D压力迹线中提取胞格结构;通过合成数据验证精度,再在3D数值模拟数据上评估其对胞格形状、取向与体积分布的统计与联合概率密度分析能力。 Result: 在合成数据上预测误差为2%;在3D模拟数据中识别出沿波传播方向拉长的胞格,轴向取向偏差17%,体积分散度更大(源于线性变异的立方放大);算法对多种胞格几何形态具有泛化性,但对高度复杂胞格仍存在分割困难。 Conclusion: 该图论算法是一种鲁棒、通用且实用的爆轰胞格分析新方法,为爆轰机理研究特别是三重点碰撞建模提供了坚实基础和可扩展框架。 Abstract: This study presents a novel algorithm based on graph theory for the precise segmentation and measurement of detonation cells from 3D pressure traces, termed detonation lattices, addressing the limitations of manual and primitive 2D edge detection methods prevalent in the field. Using a segmentation model, the proposed training-free algorithm is designed to accurately extract cellular patterns, a longstanding challenge in detonations research. First, the efficacy of segmentation on generated data is shown with a prediction error 2%. Next, 3D simulation data is used to establish performance of the graph-based workflow. The results of statistics and joint probability densities show oblong cells aligned with the wave propagation axis with 17% deviation, whereas larger dispersion in volume reflects cubic amplification of linear variability. Although the framework is robust, it remains challenging to reliably segment and quantify highly complex cellular patterns. However, the graph-based formulation generalizes across diverse cellular geometries, positioning it as a practical tool for detonation analysis and a strong foundation for future extensions in triple-point collision studies.

[166] Rethinking Pose Refinement in 3D Gaussian Splatting under Pose Prior and Geometric Uncertainty

Mangyu Kong,Jaewon Lee,Seongwon Lee,Euntai Kim

Main category: cs.CV

TL;DR: 本文提出了一种结合蒙特卡洛姿态采样与基于Fisher信息的PnP优化的重定位框架,以应对3D高斯泼溅(3DGS)在姿态优化中对初始位姿和几何重建敏感的问题,无需重新训练或额外监督,显著提升了定位精度与鲁棒性。

Details Motivation: 3D高斯泼溅(3DGS)虽具高质量可微渲染能力,但其姿态优化对初始相机位姿和重建几何高度敏感,主要受限于位姿先验不确定性和几何不确定性。 Method: 提出一种重定位框架,融合蒙特卡洛姿态采样与基于Fisher信息矩阵的PnP优化,显式建模位姿与几何双重不确定性,不依赖重训练或额外标注。 Result: 在多种室内外基准上,该方法持续提升定位精度,并在位姿噪声和深度噪声下显著增强优化稳定性。 Conclusion: 显式建模并联合处理位姿先验与几何不确定性,可有效提升3DGS-based视觉定位的鲁棒性与准确性,为实际应用提供更可靠的解决方案。 Abstract: 3D Gaussian Splatting (3DGS) has recently emerged as a powerful scene representation and is increasingly used for visual localization and pose refinement. However, despite its high-quality differentiable rendering, the robustness of 3DGS-based pose refinement remains highly sensitive to both the initial camera pose and the reconstructed geometry. In this work, we take a closer look at these limitations and identify two major sources of uncertainty: (i) pose prior uncertainty, which often arises from regression or retrieval models that output a single deterministic estimate, and (ii) geometric uncertainty, caused by imperfections in the 3DGS reconstruction that propagate errors into PnP solvers. Such uncertainties can distort reprojection geometry and destabilize optimization, even when the rendered appearance still looks plausible. To address these uncertainties, we introduce a relocalization framework that combines Monte Carlo pose sampling with Fisher Information-based PnP optimization. Our method explicitly accounts for both pose and geometric uncertainty and requires no retraining or additional supervision. Across diverse indoor and outdoor benchmarks, our approach consistently improves localization accuracy and significantly increases stability under pose and depth noise.

[167] Bridging the Simulation-to-Reality Gap in Electron Microscope Calibration via VAE-EM Estimation

Jilles S. van Hulst,W. P. M. H.,Heemels,Duarte J. Antunes

Main category: cs.CV

TL;DR: 本文提出了一种结合变分自编码器(VAE)与期望最大化(EM)算法的新型方法,用于扫描透射电子显微镜(STEM)的自动参数校准,显著提升了校准速度、一致性和精度,并解决了仿真到现实的数据鸿沟问题。

Details Motivation: 电子显微镜中光学像差导致图像质量下降,而校准需从高维、噪声大且单张图像不足以确定最优参数的诊断图像中估计多维参数,传统方法仅提取标量特征,难以有效建模复杂图像结构和仿真-现实差异。 Method: 利用仿真数据训练变分自编码器(VAE)学习图像的低维隐表示;在此基础上,采用EM算法联合估计‘校准参数→隐表示’的映射模型和最优校准参数;并借助光学系统的已知对称性保障联合估计问题的全局可识别性(唯一最优解)。 Result: 在真实STEM设备上验证表明,该方法比现有方法快得多、更稳定,估计误差降低约2倍,且所需观测图像更少。 Conclusion: VAE-EM框架不仅推动了STEM自动化校准的进步,也展示了VAE在图像信息压缩中的潜力;其思想可推广至存在仿真-现实差距及非单射映射的其他逆问题。 Abstract: Electron microscopy has enabled many scientific breakthroughs across multiple fields. A key challenge is the tuning of microscope parameters based on images to overcome optical aberrations that deteriorate image quality. This calibration problem is challenging due to the high-dimensional and noisy nature of the diagnostic images, and the fact that optimal parameters cannot be identified from a single image. We tackle the calibration problem for Scanning Transmission Electron Microscopes (STEM) by employing variational autoencoders (VAEs), trained on simulated data, to learn low-dimensional representations of images, whereas most existing methods extract only scalar values. We then simultaneously estimate the model that maps calibration parameters to encoded representations and the optimal calibration parameters using an expectation maximization (EM) approach. This joint estimation explicitly addresses the simulation-to-reality gap inherent in data-driven methods that train on simulated data from a digital twin. We leverage the known symmetry property of the optical system to establish global identifiability of the joint estimation problem, ensuring that a unique optimum exists. We demonstrate that our approach is substantially faster and more consistent than existing methods on a real STEM, achieving a 2x reduction in estimation error while requiring fewer observations. This represents a notable advance in automated STEM calibration and demonstrates the potential of VAEs for information compression in images. Beyond microscopy, the VAE-EM framework applies to inverse problems where simulated training data introduces a reality gap and where non-injective mappings would otherwise prevent unique solutions.

[168] CompDiff: Hierarchical Compositional Diffusion for Fair and Zero-Shot Intersectional Medical Image Generation

Mahmoud Ibrahim,Bart Elen,Chang Sun,Gokhan Ertaylan,Michel Dumontier

Main category: cs.CV

TL;DR: 本文提出CompDiff,一种分层组合式扩散模型,通过分层条件网络(HCN)解耦人口统计学条件,提升罕见及交叉人口子群的图像生成质量与公平性,在胸部X光和眼底图像上显著优于基线方法。

Details Motivation: 现有生成模型在医疗影像数据增强中常假设对各人群子群生成质量一致,但实际因训练数据不平衡,导致对稀有或交叉人口子群生成质量下降,即‘不平衡生成器问题’。 Method: 提出CompDiff框架,包含分层条件网络(HCN),将人口统计学条件分解为可组合的标记,并与CLIP嵌入拼接作为交叉注意力上下文,实现参数共享与组合泛化。 Result: 在MIMIC-CXR和FairGenMed数据集上,CompDiff在FID(64.3 vs. 75.1)、子群公平性(ES-FID)及零样本交叉泛化(FID提升最高21%)上均优于基线;下游分类器AUROC提升且偏差降低。 Conclusion: 人口统计学条件的架构设计是公平医疗图像生成中关键且被忽视的因素;CompDiff从表征层面缓解不平衡生成器问题,支持更鲁棒、公平的生成与下游任务性能。 Abstract: Generative models are increasingly used to augment medical imaging datasets for fairer AI. Yet a key assumption often goes unexamined: that generators themselves produce equally high-quality images across demographic groups. Models trained on imbalanced data can inherit these imbalances, yielding degraded synthesis quality for rare subgroups and struggling with demographic intersections absent from training. We refer to this as the imbalanced generator problem. Existing remedies such as loss reweighting operate at the optimization level and provide limited benefit when training signal is scarce or absent for certain combinations. We propose CompDiff, a hierarchical compositional diffusion framework that addresses this problem at the representation level. A dedicated Hierarchical Conditioner Network (HCN) decomposes demographic conditioning, producing a demographic token concatenated with CLIP embeddings as cross-attention context. This structured factorization encourages parameter sharing across subgroups and supports compositional generalization to rare or unseen demographic intersections. Experiments on chest X-rays (MIMIC-CXR) and fundus images (FairGenMed) show that CompDiff compares favorably against both standard fine-tuning and FairDiffusion across image quality (FID: 64.3 vs. 75.1), subgroup equity (ES-FID), and zero-shot intersectional generalization (up to 21% FID improvement on held-out intersections). Downstream classifiers trained on CompDiff-generated data also show improved AUROC and reduced demographic bias, suggesting that architectural design of demographic conditioning is an important and underexplored factor in fair medical image generation. Code is available at https://anonymous.4open.science/r/CompDiff-6FE6.

[169] Segmentation-Based Attention Entropy: Detecting and Mitigating Object Hallucinations in Large Vision-Language Models

Jiale Song,Jiaxin Luo,Xue-song Tang,Kuangrong Hao,Mingbo Zhao

Main category: cs.CV

TL;DR: 本文提出基于分割的注意力熵(SAE)方法,用于量化视觉模态中对象级语义空间内的注意力不确定性,进而检测并缓解大视觉语言模型(LVLMs)中的物体幻觉问题,无需额外训练开销。

Details Motivation: 现有研究多关注文本模态导致的幻觉(如语言先验过强),但作者观察到视觉模态内部异常注意力模式也会引发物体幻觉,因此需从视觉注意力角度建模可靠性。 Method: 提出Segmentation-based Attention Entropy(SAE),利用语义分割在对象级语义空间中量化视觉注意力不确定性;基于SAE构建幻觉检测的可靠性评分,并设计SAE引导的推理时注意力调整方法。 Result: 在公开基准和四足机器人真实具身多模态场景中验证有效;显著降低物体幻觉,且无需额外训练成本。 Conclusion: SAE为LVLMs提供了一种高效、即插即用的视觉注意力可靠性评估与校正机制,提升了其感知与决策的可信度。 Abstract: Large Vision-Language Models (LVLMs) achieve strong performance on many multimodal tasks, but object hallucinations severely undermine their reliability. Most existing studies focus on the text modality, attributing hallucinations to overly strong language priors and insufficient visual grounding. In contrast, we observe that abnormal attention patterns within the visual modality can also give rise to hallucinated objects. Building on this observation, we propose Segmentation-based Attention Entropy (SAE), which leverages semantic segmentation to quantify visual attention uncertainty in an object-level semantic space. Based on SAE, we further design a reliability score for hallucination detection and an SAE-guided attention adjustment method that modifies visual attention at inference time to mitigate hallucinations. We evaluate our approach on public benchmarks and in real embodied multimodal scenarios with quadruped robots. Experimental results show that SAE substantially reduces object hallucinations without any additional training cost, thereby enabling more trustworthy LVLM-driven perception and decision-making.

[170] Understanding Cell Fate Decisions with Temporal Attention

Florian Bürger,Martim Dias Gomes,Adrián E. Granada,Noémie Moreau,Katarzyna Bozek

Main category: cs.CV

TL;DR: 本文提出了一种基于Transformer的深度学习方法,直接从长时程活细胞视频预测癌细胞命运(如凋亡或有丝分裂),无需人工提取特征,并结合可解释性框架揭示非遗传性决定因素。

Details Motivation: 理解非遗传因素如何决定细胞命运对癌症治疗至关重要,因为基因相同的细胞在相同治疗下可能产生不同结局。 Method: 采用Transformer模型直接处理原始活细胞视频序列进行细胞命运预测,并设计了可解释性框架(包括注意力分析和掩码实验)来解析时间与形态学预测线索。 Result: 模型达到0.94的平衡准确率和0.93的F1分数;发现预测信号不局限于末帧,提前10小时即可可靠预测;揭示了有丝分裂与凋亡序列中预测信息的时间分布差异,以及细胞形态和p53信号的作用。 Conclusion: 基于注意力机制的时间建模不仅能高精度预测细胞命运,还能提供关于非遗传性细胞决策机制的生物学可解释洞见。 Abstract: Understanding non-genetic determinants of cell fate is critical for developing and improving cancer therapies, as genetically identical cells can exhibit divergent outcomes under the same treatment conditions. In this work, we present a deep learning approach for cell fate prediction from raw long-term live-cell recordings of cancer cell populations under chemotherapeutic treatment. Our Transformer model is trained to predict cell fate directly from raw image sequences, without relying on predefined morphological or molecular features. Beyond classification, we introduce a comprehensive explainability framework for interpreting the temporal and morphological cues guiding the model's predictions. We demonstrate that prediction of cell outcomes is possible based on the video only, our model achieves balanced accuracy of 0.94 and an F1-score of 0.93. Attention and masking experiments further indicate that the signal predictive of the cell fate is not uniquely located in the final frames of a cell trajectory, as reliable predictions are possible up to 10 h before the event. Our analysis reveals distinct temporal distribution of predictive information in the mitotic and apoptotic sequences, as well as the role of cell morphology and p53 signaling in determining cell outcomes. Together, these findings demonstrate that attention-based temporal models enable accurate cell fate prediction while providing biologically interpretable insights into non-genetic determinants of cellular decision-making. The code is available at https://github.com/bozeklab/Cell-Fate-Prediction.

[171] VideoMatGen: PBR Materials through Joint Generative Modeling

Jon Hasselgren,Zheng Zeng,Milos Hasan,Jacob Munkberg

Main category: cs.CV

TL;DR: 本文提出了一种基于视频扩散Transformer架构的物理材质生成方法,可依据3D形状几何与文本描述联合生成多种物理材质属性(如基础色、粗糙度、金属度、高度图),并引入定制化变分自编码器实现多模态材质的紧凑隐式表示与高效联合生成。

Details Motivation: 现有方法难以在保持物理合理性的同时,根据文本和几何条件联合生成多种材质属性;且多模态生成通常导致计算开销大、token数量激增。 Method: 采用视频扩散Transformer作为主干架构,以输入几何和文本为条件;设计定制化变分自编码器(VAE)将多种材质图(base color、roughness、metallicity、height)编码至共享紧凑隐空间,实现多模态联合建模与生成。 Result: 实现了高质量、物理合理的多属性材质生成,支持主流内容创作工具,显著减少token负载并提升生成一致性与效率。 Conclusion: 该方法有效解决了文本-几何驱动的多模态材质生成难题,在保真度、物理合理性和实用性方面取得良好平衡。 Abstract: We present a method for generating physically-based materials for 3D shapes based on a video diffusion transformer architecture. Our method is conditioned on input geometry and a text description, and jointly models multiple material properties (base color, roughness, metallicity, height map) to form physically plausible materials. We further introduce a custom variational auto-encoder which encodes multiple material modalities into a compact latent space, which enables joint generation of multiple modalities without increasing the number of tokens. Our pipeline generates high-quality materials for 3D shapes given a text prompt, compatible with common content creation tools.

[172] Face2Scene: Using Facial Degradation as an Oracle for Diffusion-Based Scene Restoration

Amirhossein Kazerouni,Maitreya Suin,Tristan Aumentado-Armstrong,Sina Honari,Amanpreet Walia,Iqbal Mohomed,Konstantinos G. Derpanis,Babak Taati,Alex Levinshtein

Main category: cs.CV

TL;DR: 本文提出Face2Scene框架,利用高质量人脸作为感知引导,提取面部退化特征并指导整图(含身体与背景)的联合恢复,显著提升全场景图像修复效果。

Details Motivation: 现有基于参考的人脸修复方法仅关注面部区域,忽略全身及背景退化;而全场景修复方法又缺乏对退化线索的有效建模,导致结果不确定和伪影。 Method: Face2Scene采用两阶段框架:第一阶段用Ref-FR模型重建高质量人脸;第二阶段从修复前后人脸对中提取面部退化码,并转化为多尺度退化感知token,用于条件化扩散模型完成整图修复。 Result: 在多个数据集上实验表明,Face2Scene在视觉质量和定量指标上均优于当前最先进方法。 Conclusion: 以人脸为退化感知源可有效提升全场景图像修复的准确性和鲁棒性,为跨区域协同修复提供了新范式。 Abstract: Recent advances in image restoration have enabled high-fidelity recovery of faces from degraded inputs using reference-based face restoration models (Ref-FR). However, such methods focus solely on facial regions, neglecting degradation across the full scene, including body and background, which limits practical usability. Meanwhile, full-scene restorers often ignore degradation cues entirely, leading to underdetermined predictions and visual artifacts. In this work, we propose Face2Scene, a two-stage restoration framework that leverages the face as a perceptual oracle to estimate degradation and guide the restoration of the entire image. Given a degraded image and one or more identity references, we first apply a Ref-FR model to reconstruct high-quality facial details. From the restored-degraded face pair, we extract a face-derived degradation code that captures degradation attributes (e.g., noise, blur, compression), which is then transformed into multi-scale degradation-aware tokens. These tokens condition a diffusion model to restore the full scene in a single step, including the body and background. Extensive experiments demonstrate the superior effectiveness of the proposed method compared to state-of-the-art methods.

[173] REFORGE: Multi-modal Attacks Reveal Vulnerable Concept Unlearning in Image Generation Models

Yong Zou,Haoran Li,Fanxiao Li,Shenyang Wei,Yunyun Dong,Li Tang,Wei Zhou,Renyang Liu

Main category: cs.CV

TL;DR: 本文提出REFORGE框架,用于在黑盒设置下评估图像生成模型遗忘(IGMU)方法对图像侧对抗攻击的鲁棒性,通过基于笔画初始化与跨注意力引导掩码的扰动优化策略,显著提升攻击成功率与语义对齐度,揭示当前IGMU方法在多模态对抗攻击下的固有脆弱性。

Details Motivation: 现有图像生成模型遗忘(IGMU)方法虽能去除有害概念,但其在黑盒、图像侧对抗输入下的鲁棒性尚未被充分研究,存在实际部署风险。 Method: 提出REFORGE——一种黑盒红队测试框架:以笔画图像为初始输入,采用跨注意力机制指导的区域感知掩码策略,在概念相关区域施加扰动,兼顾攻击有效性与图像视觉保真度。 Result: 在多个典型遗忘任务与防御方法上验证,REFORGE显著提升攻击成功率,同时具备更强语义对齐性与更高优化效率,暴露了主流IGMU方法在图像对抗提示下的系统性脆弱性。 Conclusion: 当前IGMU方法对图像侧黑盒对抗攻击鲁棒性不足,亟需设计兼顾鲁棒性与遗忘效果的多模态对抗感知遗忘机制。 Abstract: Recent progress in image generation models (IGMs) enables high-fidelity content creation but also amplifies risks, including the reproduction of copyrighted content and the generation of offensive content. Image Generation Model Unlearning (IGMU) mitigates these risks by removing harmful concepts without full retraining. Despite growing attention, the robustness under adversarial inputs, particularly image-side threats in black-box settings, remains underexplored. To bridge this gap, we present REFORGE, a black-box red-teaming framework that evaluates IGMU robustness via adversarial image prompts. REFORGE initializes stroke-based images and optimizes perturbations with a cross-attention-guided masking strategy that allocates noise to concept-relevant regions, balancing attack efficacy and visual fidelity. Extensive experiments across representative unlearning tasks and defenses demonstrate that REFORGE significantly improves attack success rate while achieving stronger semantic alignment and higher efficiency than involved baselines. These results expose persistent vulnerabilities in current IGMU methods and highlight the need for robustness-aware unlearning against multi-modal adversarial attacks. Our code is at: https://github.com/Imfatnoily/REFORGE.

[174] On the Transfer of Collinearity to Computer Vision

Frederik Beuth,Danny Kowerko

Main category: cs.CV

TL;DR: 本文将人类视觉中的共线性原理引入计算机视觉,通过四个应用场景(晶圆缺陷检测、纳米材料缺陷识别、遮挡处理和ImageNet分类)验证其有效性,发现共线性在工业图像(如含人工直线结构)中显著提升性能,但在自然图像(如ImageNet)中效果有限。

Details Motivation: 共线性是人类视觉中增强沿直线排列边缘的感知现象,但其现实用途及在计算机视觉中的应用尚未被充分探索。 Method: 构建原型模型,系统测试并基准评估共线性在四种用例中的表现:结合深度学习的草图生成(案例I、II)、与显著性模型结合(案例II)、作为特征检测器(案例I),以及在ImageNet上的泛化性测试。 Result: 共线性在晶圆缺陷检测中使错误率从6.5%降至5.26%(提升1.24倍);在纳米材料缺陷识别中错误率从21.65%降至6.64%(提升3.2倍);对遮挡场景有效;但在ImageNet上效果不明显。 Conclusion: 共线性适用于含人工直线结构的工业图像分析任务,可作为计算机视觉的新工具,体现人类视觉处理优势。 Abstract: Collinearity is a visual perception phenomenon in the human brain that amplifies spatially aligned edges arranged along a straight line. However, it is vague for which purpose humans might have this principle in the real-world, and its utilization in computer vision and engineering applications even is a largely unexplored field. In this work, our goal is to transfer the collinearity principle to computer vision, and we explore the potential usages of this novel principle for computer vision applications. We developed a prototype model to exemplify the principle, then tested it systematically, and benchmarked it in the context of four use cases. Our cases are selected to spawn a broad range of potential applications and scenarios: sketching the combination of collinearity with deep learning (case I and II), using collinearity with saliency models (case II), and as a feature detector (case I). In the first use case, we found that collinearity is able to improve the fault detection of wafers and obtain a performance increase by a factor 1.24 via collinearity (decrease of the error rate from 6.5% to 5.26%). In the second use case, we test the defect recognition in nanotechnology materials and achieve a performance increase by 3.2x via collinearity (deep learning, error from 21.65% to 6.64%), and also explore saliency models. As third experiment, we cover occlusions; while as fourth experiment, we test ImageNet and observe that it might not be very beneficial for ImageNet. Therefore, we can assemble a list of scenarios for which collinearity is beneficial (wafers, nanotechnology, occlusions), and for what is not beneficial (ImageNet). Hence, we infer collinearity might be suitable for industry applications as it helps if the image structures of interest are man-made because they often consist of lines. Our work provides another tool for CV, hope to capture the power of human processing.

[175] FSMC-Pose: Frequency and Spatial Fusion with Multiscale Self-calibration for Cattle Mounting Pose Estimation

Fangjing Li,Zhihai Wang,Xinxin Ding,Haiyang Liu,Ronghua Gao,Rong Wang,Yao Zhu,Ming Jin

Main category: cs.CV

TL;DR: 本文提出FSMC-Pose框架,用于在复杂环境中准确估计奶牛发情期的爬跨姿态,结合轻量级频-空融合骨干网络CattleMountNet与多尺度自校准头SC2Head,并构建了MOUNT-Cattle数据集。

Details Motivation: 奶牛爬跨姿态是判断其发情期的重要视觉指标,但在真实场景中因背景杂乱和个体间遮挡,姿态估计仍具挑战性。 Method: 提出FSMC-Pose:1)设计CattleMountNet骨干网络,含空间频率增强模块(SFEBlock)和感受野聚合模块(RABlock);2)引入空间-通道自校准头(SC2Head)缓解遮挡导致的结构错位;3)构建符合COCO格式的MOUNT-Cattle数据集(1176个实例)。 Result: 在MOUNT-Cattle与NWAFU-Cattle联合数据集上,FSMC-Pose精度高于强基线模型,计算量与参数量更低,且可在普通GPU上实时推理。 Conclusion: FSMC-Pose有效提升了复杂杂乱环境下奶牛爬跨姿态估计的准确性与鲁棒性,兼具高效性与实用性。 Abstract: Mounting posture is an important visual indicator of estrus in dairy cattle. However, achieving reliable mounting pose estimation in real-world environments remains challenging due to cluttered backgrounds and frequent inter-animal occlusion. We present FSMC-Pose, a top-down framework that integrates a lightweight frequency-spatial fusion backbone, CattleMountNet, and a multiscale self-calibration head, SC2Head. Specifically, we design two algorithmic components for CattleMountNet: the Spatial Frequency Enhancement Block (SFEBlock) and the Receptive Aggregation Block (RABlock). SFEBlock separates cattle from cluttered backgrounds, while RABlock captures multiscale contextual information. The Spatial-Channel Self-Calibration Head (SC2Head) attends to spatial and channel dependencies and introduces a self-calibration branch to mitigate structural misalignment under inter-animal overlap. We construct a mounting dataset, MOUNT-Cattle, covering 1176 mounting instances, which follows the COCO format and supports drop-in training across pose estimation models. Using a comprehensive dataset that combines MOUNT-Cattle with the public NWAFU-Cattle dataset, FSMC-Pose achieves higher accuracy than strong baselines, with markedly lower computational and parameter costs, while maintaining real-time inference on commodity GPUs. Extensive experiments and qualitative analyses show that FSMC-Pose effectively captures and estimates cattle mounting pose in complex and cluttered environments. Dataset and code are available at https://github.com/elianafang/FSMC-Pose.

[176] Rationale Matters: Learning Transferable Rubrics via Proxy-Guided Critique for VLMReward Models

Weijie Qiu,Dai Guan,Junxin Wang,Zhihang Li,Yongbo Gai,Mengyu Zhou,Erchao Zhao,Xiaoxi Jiang,Guanjun Jiang

Main category: cs.CV

TL;DR: 本文提出Proxy-GRM方法,通过引入轻量级代理模型验证生成的评估标准(rubric)质量,并将其作为强化学习中的可微奖励信号,显著提升视觉语言模型生成奖励模型的性能与泛化能力。

Details Motivation: 现有生成式奖励模型中,评估标准(rubric)常被忽视或依赖不可微、低效的LLM-as-judge方式验证,缺乏对rubric本身的优化信号。 Method: 设计两个轻量级代理模型(Proxy-SFT和Proxy-RL),以候选rubric、原始查询及偏好对为输入,仅依据rubric预测偏好顺序;其预测准确率作为rubric质量的可微奖励,用于RL训练主模型。 Result: 在VL-Reward Bench、Multimodal Reward Bench和MM-RLHF-Reward Bench上达到SOTA,仅用约5万样本即超越使用四倍数据的方法;验证了rubric可迁移至未见评估器,提升测试时奖励准确性。 Conclusion: Proxy-GRM证明了显式优化rubric质量的有效性,提供了一种高效、可微、可迁移的生成式奖励建模新范式。 Abstract: Generative reward models (GRMs) for vision-language models (VLMs) often evaluate outputs via a three-stage pipeline: rubric generation, criterion-based scoring, and a final verdict. However, the intermediate rubric is rarely optimized directly. Prior work typically either treats rubrics as incidental or relies on expensive LLM-as-judge checks that provide no differentiable signal and limited training-time guidance. We propose Proxy-GRM, which introduces proxy-guided rubric verification into Reinforcement Learning (RL) to explicitly enhance rubric quality. Concretely, we train lightweight proxy agents (Proxy-SFT and Proxy-RL) that take a candidate rubric together with the original query and preference pair, and then predict the preference ordering using only the rubric as evidence. The proxy's prediction accuracy serves as a rubric-quality reward, incentivizing the model to produce rubrics that are internally consistent and transferable. With ~50k data samples, Proxy-GRM reaches state-of-the-art results on the VL-Reward Bench, Multimodal Reward Bench, and MM-RLHF-Reward Bench, outperforming the methods trained on four times the data. Ablations show Proxy-SFT is a stronger verifier than Proxy-RL, and implicit reward aggregation performs best. Crucially, the learned rubrics transfer to unseen evaluators, improving reward accuracy at test time without additional training. Our code is available at https://github.com/Qwen-Applications/Proxy-GRM.

[177] ACPV-Net: All-Class Polygonal Vectorization for Seamless Vector Map Generation from Aerial Imagery

Weiqin Jiao,Hao Cheng,George Vosselman,Claudio Persello

Main category: cs.CV

TL;DR: 本文提出All-Class Polygonal Vectorization (ACPV)任务,旨在从航拍图像中一次性生成多类地物的完整矢量地图(无重叠、无间隙、共边),并发布首个公开基准Deventer-512;为此设计ACPV-Net框架,引入语义监督条件机制(SSC)与拓扑重建模块,在保证全局拓扑一致性的前提下显著提升多类多边形质量。

Details Motivation: 现有矢量化方法多为单类设计,多类扩展易导致拓扑不一致(如重复边、缝隙、重叠),缺乏统一、端到端生成全要素共边矢量地图的能力。 Method: 提出ACPV任务及基准Deventer-512;构建ACPV-Net:包含语义监督条件(SSC)机制以联合建模语义与几何原语,以及基于设计的拓扑重建模块强制共享边界一致性。 Result: 在Deventer-512上,ACPV-Net在语义保真度、几何精度、顶点效率、各类拓扑保真度和全局拓扑一致性等指标上全面超越各类单类基线;在单类WHU-Building数据集上亦达最优性能。 Conclusion: ACPV-Net首次实现了多类地物端到端、拓扑一致的矢量化生成,验证了统一框架解决ACPV任务的有效性与泛化性,为高精地图自动化生产提供了新范式。 Abstract: We tackle the problem of generating a complete vector map representation from aerial imagery in a single run: producing polygons for all land-cover classes with shared boundaries and without gaps or overlaps. Existing polygonization methods are typically class-specific; extending them to multiple classes via per-class runs commonly leads to topological inconsistencies, such as duplicated edges, gaps, and overlaps. We formalize this new task as All-Class Polygonal Vectorization (ACPV) and release the first public benchmark, Deventer-512, with standardized metrics jointly evaluating semantic fidelity, geometric accuracy, vertex efficiency, per-class topological fidelity and global topological consistency. To realize ACPV, we propose ACPV-Net, a unified framework introducing a novel Semantically Supervised Conditioning (SSC) mechanism coupling semantic perception with geometric primitive generation, along with a topological reconstruction that enforces shared-edge consistency by design. While enforcing such strict topological constraints, ACPV-Net surpasses all class-specific baselines in polygon quality across classes on Deventer-512. It also applies to single-class polygonal vectorization without any architectural modification, achieving the best-reported results on WHU-Building. Data, code, and models will be released at: https://github.com/HeinzJiao/ACPV-Net.

[178] TCATSeg: A Tooth Center-Wise Attention Network for 3D Dental Model Semantic Segmentation

Qiang He,Wentian Qu,Jiajia Dai,Changsong Lei,Shaofeng Wang,Feifei Zuo,Yajie Wang,Yaqian Liang,Xiaoming Deng,Cuixia Ma,Yong-Jin Liu,Hongan Wang

Main category: cs.CV

TL;DR: TCATSeg is a novel framework for 3D dental model segmentation that integrates local geometric features with global semantic context using sparse superpoints, and is validated on a new dataset of 400 dental models.

Details Motivation: Existing methods for 3D dental model segmentation struggle with accuracy due to complex tooth arrangements and shape similarities among adjacent teeth, largely because they focus on local geometry while neglecting global contextual information. Method: TCATSeg combines local geometric features with global semantic context using a set of sparse yet physically meaningful superpoints to capture global semantic relationships; a new dataset of 400 dental models, including pre-orthodontic samples, is introduced for evaluation. Result: TCATSeg outperforms state-of-the-art approaches in extensive experiments. Conclusion: Integrating global semantic context with local geometric features significantly improves segmentation accuracy for complex 3D dental models. Abstract: Accurate semantic segmentation of 3D dental models is essential for digital dentistry applications such as orthodontics and dental implants. However, due to complex tooth arrangements and similarities in shape among adjacent teeth, existing methods struggle with accurate segmentation, because they often focus on local geometry while neglecting global contextual information. To address this, we propose TCATSeg, a novel framework that combines local geometric features with global semantic context. We introduce a set of sparse yet physically meaningful superpoints to capture global semantic relationships and enhance segmentation accuracy. Additionally, we present a new dataset of 400 dental models, including pre-orthodontic samples, to evaluate the generalization of our method. Extensive experiments demonstrate that TCATSeg outperforms state-of-the-art approaches.

[179] MLLM-based Textual Explanations for Face Comparison

Redwan Sony,Anil K Jain,Ross Arun

Main category: cs.CV

TL;DR: 本文系统分析了多模态大语言模型(MLLMs)在无约束人脸验证任务中生成解释的可靠性,发现其解释常依赖于视觉证据不支持的不可验证或幻觉属性;尽管引入传统识别系统的分数/决策可提升验证准确率,但未能一致提升解释忠实性;为此提出基于似然比的解释可信度评估框架。

Details Motivation: 现有MLLMs虽能为人脸识别提供自然语言解释以增强可解释性,但其在无约束真实场景(如极端姿态、监控图像)下的解释可靠性尚未被充分研究。 Method: 在IJB-S数据集上系统评估MLLMs在人脸验证任务中的解释质量;引入传统人脸识别系统的分数和决策作为额外输入;提出基于似然比的新评估框架,超越决策准确率,量化文本解释的证据强度。 Result: 即使MLLMs做出正确验证决策,其解释仍频繁包含非可验证或幻觉的面部属性;融合传统系统信息可提升验证性能,但不能稳定提升解释忠实性;新似然比框架揭示当前MLLMs解释存在根本性不可靠问题。 Conclusion: 当前MLLMs在可解释人脸验证中存在严重可信度缺陷,亟需面向生物识别应用建立更严谨、原则性的可靠解释评估范式。 Abstract: Multimodal Large Language Models (MLLMs) have recently been proposed as a means to generate natural-language explanations for face recognition decisions. While such explanations facilitate human interpretability, their reliability on unconstrained face images remains underexplored. In this work, we systematically analyze MLLM-generated explanations for the unconstrained face verification task on the challenging IJB-S dataset, with a particular focus on extreme pose variation and surveillance imagery. Our results show that even when MLLMs produce correct verification decisions, the accompanying explanations frequently rely on non-verifiable or hallucinated facial attributes that are not supported by visual evidence. We further study the effect of incorporating information from traditional face recognition systems, viz., scores and decisions, alongside the input images. Although such information improves categorical verification performance, it does not consistently lead to faithful explanations. To evaluate the explanations beyond decision accuracy, we introduce a likelihood-ratio-based framework that measures the evidential strength of textual explanations. Our findings highlight fundamental limitations of current MLLMs for explainable face recognition and underscore the need for a principled evaluation of reliable and trustworthy explanations in biometric applications. Code is available at https://github.com/redwankarimsony/LR-MLLMFR-Explainability.

[180] FlowComposer: Composable Flows for Compositional Zero-Shot Learning

Zhenqi He,Lin Li,Long Chen

Main category: cs.CV

TL;DR: 本文提出FlowComposer,一种基于流匹配的组合零样本学习(CZSL)新框架,通过显式建模属性与物体特征流向文本嵌入空间,并融合为组合流,同时利用泄漏特征进行增强,显著提升泛化能力。

Details Motivation: 现有基于视觉-语言模型和参数高效微调的CZSL方法存在隐式组合构建和特征纠缠残留两大根本局限,限制了泛化能力。 Method: 提出FlowComposer框架:1)学习两个原始流,将视觉特征分别输运至属性和物体文本嵌入;2)设计可学习的Composer显式融合其速度场生成组合流;3)引入泄漏引导的增强策略,重用泄漏特征作为辅助信号。 Result: 在三个公开CZSL基准上,作为即插即用组件集成到多种基线中,均取得显著性能提升。 Conclusion: FlowComposer首次系统地将流匹配引入CZSL,通过显式流融合与泄漏利用,有效缓解隐式构造与特征纠缠问题,提升了模型泛化能力。 Abstract: Compositional zero-shot learning (CZSL) aims to recognize unseen attribute-object compositions by recombining primitives learned from seen pairs. Recent CZSL methods built on vision-language models (VLMs) typically adopt parameter-efficient fine-tuning (PEFT). They apply visual disentanglers for decomposition and manipulate token-level prompts or prefixes to encode compositions. However, such PEFT-based designs suffer from two fundamental limitations: (1) Implicit Composition Construction, where composition is realized only via token concatenation or branch-wise prompt tuning rather than an explicit operation in the embedding space; (2) Remained Feature Entanglement, where imperfect disentanglement leaves attribute, object, and composition features mutually contaminated. Together, these issues limit the generalization ability of current CZSL models. In this paper, we are the first to systematically study flow matching for CZSL and introduce FlowComposer, a model-agnostic framework that learns two primitive flows to transport visual features toward attribute and object text embeddings, and a learnable Composer that explicitly fuses their velocity fields into a composition flow. To exploit the inevitable residual entanglement, we further devise a leakage-guided augmentation scheme that reuses leaked features as auxiliary signals. We thoroughly evaluate FlowComposer on three public CZSL benchmarks by integrating it as a plug-and-play component into various baselines, consistently achieving significant improvements.

[181] BUSSARD: Normalizing Flows for Bijective Universal Scene-Specific Anomalous Relationship Detection

Melissa Schween,Mathis Kruse,Bodo Rosenhahn

Main category: cs.CV

TL;DR: 本文提出BUSSARD模型,一种基于归一化流的场景图关系异常检测方法,通过语言模型嵌入语义信息并利用双射变换进行似然估计,在SARD数据集上AUROC提升约10%,速度提升5倍,且对同义词等变化更鲁棒。

Details Motivation: 现有场景图关系异常检测方法在性能、速度及鲁棒性(如同义词泛化)方面存在不足,需结合语义知识与高效可解释的建模方式。 Method: 采用多模态方法,用语言模型嵌入场景图中的物体与关系token;构建基于归一化流的双射变换模型,将物体-关系-物体三元组映射至高斯基分布,通过似然估计实现异常检测。 Result: 在SARD数据集(办公室与餐厅场景)上AUROC较当前最优方法提升约10%,推理速度快5倍;消融实验显示对同义词扰动鲁棒性强(性能偏差仅约0%,而基线达17.5%)。 Conclusion: BUSSARD验证了基于学习的方法在场景图关系异常检测中的有效性与实用性,兼顾精度、效率与泛化能力,为真实场景理解提供了新思路。 Abstract: We propose Bijective Universal Scene-Specific Anomalous Relationship Detection (BUSSARD), a normalizing flow-based model for detecting anomalous relations in scene graphs, generated from images. Our work follows a multimodal approach, embedding object and relationship tokens from scene graphs with a language model to leverage semantic knowledge from the real world. A normalizing flow model is used to learn bijective transformations that map object-relation-object triplets from scene graphs to a simple base distribution (typically Gaussian), allowing anomaly detection through likelihood estimation. We evaluate our approach on the SARD dataset containing office and dining room scenes. Our method achieves around 10% better AUROC results compared to the current state-of-the-art model, while simultaneously being five times faster. Through ablation studies, we demonstrate superior robustness and universality, particularly regarding the use of synonyms, with our model maintaining stable performance while the baseline shows 17.5% deviation. This work demonstrates the strong potential of learning-based methods for relationship anomaly detection in scene graphs. Our code is available at https://github.com/mschween/BUSSARD .

[182] Mixture of Style Experts for Diverse Image Stylization

Shihao Zhu,Ziheng Ouyang,Yijia Kang,Qilong Wang,Mi Zhou,Bo Li,Ming-Ming Cheng,Qibin Hou

Main category: cs.CV

TL;DR: StyleExpert 是一种基于专家混合(MoE)的语义感知扩散风格化框架,通过统一风格编码器和相似性感知门控机制,实现对多级语义与材质细节的更好保持,并支持未见风格泛化。

Details Motivation: 现有扩散风格化方法仅限于颜色驱动变换,忽视复杂语义和材质细节。 Method: 提出 StyleExpert 框架,基于 Mixture of Experts(MoE),包含统一风格编码器(在内容-风格-风格化三元组大数据集上训练)和相似性感知门控机制,动态将风格路由至专业化专家。 Result: 在保持语义和材质细节方面优于现有方法,并能泛化到未见风格。 Conclusion: StyleExpert 有效提升了扩散风格化在语义与材质建模上的能力,验证了 MoE 架构在风格化任务中的有效性与可扩展性。 Abstract: Diffusion-based stylization has advanced significantly, yet existing methods are limited to color-driven transformations, neglecting complex semantics and material details.We introduce StyleExpert, a semantic-aware framework based on the Mixture of Experts (MoE). Our framework employs a unified style encoder, trained on our large-scale dataset of content-style-stylized triplets, to embed diverse styles into a consistent latent space. This embedding is then used to condition a similarity-aware gating mechanism, which dynamically routes styles to specialized experts within the MoE architecture. Leveraging this MoE architecture, our method adeptly handles diverse styles spanning multiple semantic levels, from shallow textures to deep semantics. Extensive experiments show that StyleExpert outperforms existing approaches in preserving semantics and material details, while generalizing to unseen styles. Our code and collected images are available at the project page: https://hh-lg.github.io/StyleExpert-Page/.

[183] Efficient Brood Cell Detection in Layer Trap Nests for Bees and Wasps: Balancing Labeling Effort and Species Coverage

Chenchang Liu,Felix Fornoff,Annika Grasreiner,Patrick Maeder,Henri Greil,Marco Seeland

Main category: cs.CV

TL;DR: 本文提出了一种基于深度学习的层式诱巢(LTN)中蜂类和胡蜂类幼虫巢室检测与分类方法,并引入约束假阳性损失(CFPL)策略以减少标注工作量、缓解类别不平衡问题。

Details Motivation: 手动评估层式诱巢(LTNs)中巢室数量和物种丰富度费时费力,且存在密集排列、标注成本高、类别分布严重不均衡等挑战,亟需自动化高效分析方法。 Method: 提出一种结合目标检测与细粒度分类的深度学习框架,并创新性地设计约束假阳性损失(CFPL)策略:在训练中动态屏蔽未标注样本的预测,避免其干扰分类损失;仅使用最多300个每类标注样本,在712张LTN图像、28个细粒度类别数据集上进行验证。 Result: 实验表明该深度学习方法能有效检测LTN中的巢室;CFPL策略在降低标注成本的同时提升了模型性能,缓解了类别不平衡问题,实现了精度与标注效率的更好平衡。 Conclusion: CFPL是一种适用于小样本、强类别不平衡、部分标注场景的有效训练策略,为野蜂监测等生态学图像分析任务提供了可扩展、低标注依赖的新范式。 Abstract: Monitoring cavity-nesting wild bees and wasps is vital for biodiversity research and conservation. Layer trap nests (LTNs) are emerging as a valuable tool to study the abundance and species richness of these insects, offering insights into their nesting activities and ecological needs. However, manually evaluating LTNs to detect and classify brood cells is labor-intensive and time-consuming. To address this, we propose a deep learning based approach for efficient brood cell detection and classification in LTNs. LTNs present additional challenges due to densely packed brood cells, leading to a high labeling effort per image. Moreover, we observe a significant imbalance in class distribution, with common species having notably more occurrences than rare species. Comprehensive labeling of common species is time-consuming and exacerbates data imbalance, while partial labeling introduces data incompleteness which degrades model performance. To reduce labeling effort and mitigate the impact of unlabeled data, we introduce a novel Constrained False Positive Loss (CFPL) strategy. CFPL dynamically masks predictions from unlabeled data, preventing them from interfering with the classification loss during training. We evaluate our approach on a dataset of 712 LTN images collected over one season, covering 28 fine-grained classes describing the taxonomy and status of brood cells. To minimize labeling effort, we limit the training set to a maximum of 300 labels per class. Experimental results demonstrate that deep learning can be effectively used to detect brood cells in LTNs. Our CFPL method further improves performance and balances model accuracy and labeling effort while also mitigating class imbalance.

[184] HeBA: Heterogeneous Bottleneck Adapters for Robust Vision-Language Models

Md Jahidul Islam

Main category: cs.CV

TL;DR: 本文提出HeBA(异构瓶颈适配器),一种针对视觉-语言模型(VLMs)下游任务的新型适配器架构,通过引入模态特异性的结构归纳偏置(如图像用2D深度可分离卷积、文本用线性投影)、压缩瓶颈正则化和主动梯度初始化策略,在11个少样本基准上达到SOTA性能。

Details Motivation: 现有VLM适配方法采用统一宽泛的适配器结构,忽视图像的空间局部性与文本的语义密集性之间的本质差异,导致性能与稳定性受限。 Method: 提出HeBA框架,包含三项创新:(1) 异构性——图像token用2D深度可分离卷积建模空间相关性,文本token用密集线性投影捕获语义关系;(2) 瓶颈正则化——采用D→D/4的压缩瓶颈强制学习紧凑鲁棒特征;(3) 主动梯度初始化——用Kaiming初始化替代零初始化以保障初始梯度流并加速收敛。 Result: 在11个少样本视觉-语言基准上显著优于现有方法,实现新SOTA,同时提升训练稳定性与泛化能力。 Conclusion: 模态特异的结构设计(异构+瓶颈+初始化)比通用适配器更有效,HeBA为VLM高效微调提供了新范式。 Abstract: Adapting large-scale Vision-Language Models (VLMs) like CLIP to downstream tasks often suffers from a "one-size-fits-all" architectural approach, where visual and textual tokens are processed uniformly by wide, generic adapters. We argue that this homogeneity ignores the distinct structural nature of the modalities -- spatial locality in images versus semantic density in text. To address this, we propose HeBA (Heterogeneous Bottleneck Adapter), a unified architectural framework that introduces modality-specific structural inductive biases. HeBA departs from conventional designs through three key architectural innovations: (1) Heterogeneity: It processes visual tokens via 2D depthwise-separable convolutions to preserve spatial correlations, while distinctively processing text tokens via dense linear projections to capture semantic relationships; (2) Bottleneck Regularization: Unlike standard expanding adapters, HeBA employs a compression bottleneck (D -> D/4) that explicitly forces the model to learn compact, robust features and acts as a structural regularizer; and (3) Active Gradient Initialization: We challenge the restrictive zero-initialization paradigm, utilizing a Kaiming initialization strategy that ensures sufficient initial gradient flow to accelerate convergence without compromising the frozen backbone's pre-trained knowledge. Extensive experiments demonstrate that HeBA's architecturally specialized design achieves superior stability and accuracy, establishing a new state-of-the-art on 11 few-shot benchmarks. Code is available at https://github.com/Jahid12012021/VLM-HeBA.

[185] Spectral Property-Driven Data Augmentation for Hyperspectral Single-Source Domain Generalization

Taiqin Chen,Yifeng Wang,Xiaochen Feng,Zhilin Zhu,Hao Sha,Yingjian Li,Yongbing Zhang

Main category: cs.CV

TL;DR: 本文提出了一种光谱特性驱动的数据增强方法(SPDDA),用于解决高光谱图像单源域泛化中真实性和多样性之间的权衡问题,通过光谱多样性模块、通道自适应光谱混合器和空-谱协同优化机制,显著提升了跨域分类性能。

Details Motivation: 高光谱图像(HSI)因光谱维度高和传感器差异大,易受域间分布偏移影响;现有单源域泛化方法依赖数据增强,但盲目增强易偏离真实场景,而过度追求真实性又牺牲多样性,限制了泛化能力。 Method: 提出光谱特性驱动的数据增强(SPDDA):1)光谱多样性模块沿光谱维重采样以生成不同通道数样本;2)基于通道相似性建模的通道自适应光谱混合器,避免固定增强模式;3)空-谱协同优化机制,联合空间保真约束与光谱连续性自约束,并自适应调整约束权重以兼顾空间结构与光谱平滑性。 Result: 在三个遥感基准数据集上的大量实验表明,SPDDA显著优于当前最先进方法。 Conclusion: SPDDA通过显式建模HSI固有光谱特性(设备相关通道数变化与邻近通道混合),在保持增强样本真实性的同时提升多样性,有效缓解单源域泛化中的域偏移问题,为HSI跨域分类提供了新思路。 Abstract: While hyperspectral images (HSI) benefit from numerous spectral channels that provide rich information for classification, the increased dimensionality and sensor variability make them more sensitive to distributional discrepancies across domains, which in turn can affect classification performance. To tackle this issue, hyperspectral single-source domain generalization (SDG) typically employs data augmentation to simulate potential domain shifts and enhance model robustness under the condition of single-source domain training data availability. However, blind augmentation may produce samples misaligned with real-world scenarios, while excessive emphasis on realism can suppress diversity, highlighting a tradeoff between realism and diversity that limits generalization to target domains. To address this challenge, we propose a spectral property-driven data augmentation (SPDDA) that explicitly accounts for the inherent properties of HSI, namely the device-dependent variation in the number of spectral channels and the mixing of adjacent channels. Specifically, SPDDA employs a spectral diversity module that resamples data from the source domain along the spectral dimension to generate samples with varying spectral channels, and constructs a channel-wise adaptive spectral mixer by modeling inter-channel similarity, thereby avoiding fixed augmentation patterns. To further enhance the realism of the augmented samples, we propose a spatial-spectral co-optimization mechanism, which jointly optimizes a spatial fidelity constraint and a spectral continuity self-constraint. Moreover, the weight of the spectral self-constraint is adaptively adjusted based on the spatial counterpart, thus preventing over-smoothing in the spectral dimension and preserving spatial structure. Extensive experiments conducted on three remote sensing benchmarks demonstrate that SPDDA outperforms state-of-the-art methods.

[186] Kestrel: Grounding Self-Refinement for LVLM Hallucination Mitigation

Jiawei Mao,Hardy Chen,Haoqin Tu,Yuhan Wang,Letian Zhang,Zeyu Zheng,Huaxiu Yao,Zirui Wang,Cihang Xie,Yuyin Zhou

Main category: cs.CV

TL;DR: Kestrel是一种无需训练的框架,通过显式视觉定位代理和基于证据验证的自修正机制,有效缓解大视觉语言模型(LVLMs)的幻觉问题,在多个基准上显著提升性能并提供可解释的诊断轨迹。

Details Motivation: 大型视觉语言模型(LVLMs)虽能力强,但易产生幻觉,而重新训练以抑制幻觉成本过高;现有免训练方法(如解码调整或工具调用)效果有限且可解释性差。 Method: 提出Kestrel框架:首先提取显式视觉证据并将工具输出转化为结构化文本证据;其次利用LVLM作为裁判对证据进行验证,并基于已验证证据迭代自修正答案,避免过度修正。 Result: 在POPE和MME-Hallucination等幻觉基准上显著优于强基线(如Qwen3-VL上POPE平均+3.31%,MME-Hallucination +28.34);自修正模块与定位代理各自贡献约+2.0% POPE提升;提供透明的验证轨迹用于幻觉诊断分析。 Conclusion: Kestrel是一种高效、免训练、可解释的LVLM幻觉缓解方案,兼顾性能提升与诊断能力,为实际部署提供了新思路。 Abstract: Large vision-language models (LVLMs) have become increasingly strong but remain prone to hallucinations in multimodal tasks, which significantly narrows their deployment. As training these LVLMs to avoid hallucinations becomes prohibitively expensive for larger models, training-free methods offer a cheap and flexible solution to this problem, yet existing approaches based on decoding or tool use often bring limited gains and/or weak interpretability. We propose Kestrel, a training-free framework for LVLM hallucination mitigation that combines an explicit visual-grounding agent with evidence-verified self-refinement mechanism. In detail, Kestrel first collects explicit visual evidence and converts tool outputs into reusable and structured textual evidence. Second, to take full advantage of these evidence, Kestrel verifies them via an LVLM judge for evidence checking, then iteratively self-refine answers based on verified evidence to reduce the risk of over-correction. Extensive experiments show that Kestrel improves performance over strong baselines across hallucination benchmarks (e.g., average +3.31% on POPE and +28.34 on MME-Hallucination with Qwen3-VL), while providing transparent verification traces for hallucination diagnosis and analysis -- e.g., both the integrated self-refinement module and grounding agent contributing an average +2.0% gain on POPE.

[187] Fast-WAM: Do World Action Models Need Test-time Future Imagination?

Tianyuan Yuan,Zibin Dong,Yicheng Liu,Hang Zhao

Main category: cs.CV

TL;DR: 本文提出Fast-WAM,验证WAMs在测试时无需显式未来想象,其性能提升主要源于训练阶段的视频建模;Fast-WAM去除了测试时的未来预测,显著降低延迟(190ms),同时保持竞争力。

Details Motivation: 探究WAMs是否必须在测试时进行显式未来想象才能获得良好动作性能,还是其优势主要来自训练阶段的视频建模。 Method: 提出Fast-WAM架构,在训练中保留视频协同训练,但在测试时跳过未来预测;设计多个变体以解耦训练视频建模与推理时未来生成的作用。 Result: Fast-WAM在LIBERO和RoboTwin等仿真基准及真实任务中达到SOTA水平,延迟仅190ms(快4倍以上);移除视频协同训练导致性能大幅下降,而跳过测试时未来预测影响较小。 Conclusion: WAMs中视频预测的主要价值在于训练阶段提升世界表征能力,而非测试时生成未来观测。 Abstract: World Action Models (WAMs) have emerged as a promising alternative to Vision-Language-Action (VLA) models for embodied control because they explicitly model how visual observations may evolve under action. Most existing WAMs follow an imagine-then-execute paradigm, incurring substantial test-time latency from iterative video denoising, yet it remains unclear whether explicit future imagination is actually necessary for strong action performance. In this paper, we ask whether WAMs need explicit future imagination at test time, or whether their benefit comes primarily from video modeling during training. We disentangle the role of video modeling during training from explicit future generation during inference by proposing \textbf{Fast-WAM}, a WAM architecture that retains video co-training during training but skips future prediction at test time. We further instantiate several Fast-WAM variants to enable a controlled comparison of these two factors. Across these variants, we find that Fast-WAM remains competitive with imagine-then-execute variants, while removing video co-training causes a much larger performance drop. Empirically, Fast-WAM achieves competitive results with state-of-the-art methods both on simulation benchmarks (LIBERO and RoboTwin) and real-world tasks, without embodied pretraining. It runs in real time with 190ms latency, over 4$\times$ faster than existing imagine-then-execute WAMs. These results suggest that the main value of video prediction in WAMs may lie in improving world representations during training rather than generating future observations at test time. Project page: https://yuantianyuan01.github.io/FastWAM/

[188] $x^2$-Fusion: Cross-Modality and Cross-Dimension Flow Estimation in Event Edge Space

Ruishan Guo,Ciyu Ruan,Haoyang Wang,Zihang Gong,Jingao Xu,Xinlei Chen

Main category: cs.CV

TL;DR: 本文提出x^2-Fusion方法,利用事件相机的时空边缘信号构建统一的‘事件边缘空间’,将图像和LiDAR特征对齐到该同质化表征空间中,并通过可靠性感知融合与跨维度对比学习联合估计2D光流和3D场景流,在标准与挑战性场景下均达到SOTA性能。

Details Motivation: 现有多模态(图像、LiDAR、事件)运动估计方法在异构特征空间中独立处理各模态,缺乏共享隐空间导致跨传感器不匹配和融合复杂;事件相机天然提供时空边缘信号,可作为锚点构建统一表征空间。 Method: 提出Event Edge Space作为统一的边缘中心同质化隐空间;设计x^2-Fusion框架,将图像和LiDAR特征显式对齐至此空间;引入可靠性感知的自适应融合机制;采用跨维度对比学习紧密耦合2D光流与3D场景流估计。 Result: 在合成与真实数据集上实验表明,x^2-Fusion在标准条件下达到SOTA精度,并在光照变化、运动模糊等挑战场景下显著优于现有方法。 Conclusion: 统一表征空间(而非传统特征级或决策级融合)是提升多模态运动估计鲁棒性与精度的关键;事件边缘信号可作为强几何先验,有效引导跨模态对齐与融合。 Abstract: Estimating dense 2D optical flow and 3D scene flow is essential for dynamic scene understanding. Recent work combines images, LiDAR, and event data to jointly predict 2D and 3D motion, yet most approaches operate in separate heterogeneous feature spaces. Without a shared latent space that all modalities can align to, these systems rely on multiple modality-specific blocks, leaving cross-sensor mismatches unresolved and making fusion unnecessarily complex.Event cameras naturally provide a spatiotemporal edge signal, which we can treat as an intrinsic edge field to anchor a unified latent representation, termed the Event Edge Space. Building on this idea, we introduce $x^2$-Fusion, which reframes multimodal fusion as representation unification: event-derived spatiotemporal edges define an edge-centric homogeneous space, and image and LiDAR features are explicitly aligned in this shared representation.Within this space, we perform reliability-aware adaptive fusion to estimate modality reliability and emphasize stable cues under degradation. We further employ cross-dimension contrast learning to tightly couple 2D optical flow with 3D scene flow. Extensive experiments on both synthetic and real benchmarks show that $x^2$-Fusion achieves state-of-the-art accuracy under standard conditions and delivers substantial improvements in challenging scenarios.

[189] HMAR: Hierarchical Modality-Aware Expert and Dynamic Routing Medical Image Retrieval Architecture

Aojie Yuan

Main category: cs.CV

TL;DR: HMAR是一种基于混合专家(MoE)架构的医学图像检索框架,通过双专家机制分别提取全局和位置无关的局部特征,并结合两阶段对比学习与滑动窗口匹配算法,实现高效、细粒度的区域级检索,且无需边界框标注;在RadioImageNet-CT数据集上显著提升mAP。

Details Motivation: 现有医学图像检索系统存在三方面局限:特征编码未区分解剖结构的临床重要性、相似度度量依赖模糊的粗粒度分类标签、仅关注全局图像相似而无法满足临床对病灶区域级检索的需求。 Method: 提出HMAR框架,包含双专家机制(Expert0提取全局特征,Expert1学习位置不变的局部表示)、两阶段对比学习(避免依赖边界框标注)、滑动窗口匹配算法(支持密集局部比对)以及基于Kolmogorov-Arnold网络(KAN)生成哈希码以实现高效汉明距离检索。 Result: 在RadioImageNet-CT数据集(16类临床模式,29,903张CT图像)上,HMAR在64位和128位哈希码下分别达到mAP 0.711和0.724,较SOTA方法ACIR提升0.7%和1.1%。 Conclusion: HMAR通过层次化、模态感知与动态路由机制,有效解决了医学图像检索中全局与局部协同建模、无监督细粒度匹配及高效检索的关键挑战,显著提升了临床相关检索性能。 Abstract: Medical image retrieval (MIR) is a critical component of computer-aided diagnosis, yet existing systems suffer from three persistent limitations: uniform feature encoding that fails to account for the varying clinical importance of anatomical structures, ambiguous similarity metrics based on coarse classification labels, and an exclusive focus on global image similarity that cannot meet the clinical demand for fine-grained region-specific retrieval. We propose HMAR (Hierarchical Modality-Aware Expert and Dynamic Routing), an adaptive retrieval framework built on a Mixture-of-Experts (MoE) architecture. HMAR employs a dual-expert mechanism: Expert0 extracts global features for holistic similarity matching, while Expert1 learns position-invariant local representations for precise lesion-region retrieval. A two-stage contrastive learning strategy eliminates the need for expensive bounding-box annotations, and a sliding-window matching algorithm enables dense local comparison at inference time. Hash codes are generated via Kolmogorov-Arnold Network (KAN) layers for efficient Hamming-distance search. Experiments on the RadioImageNet-CT dataset (16 clinical patterns, 29,903 images) show that HMAR achieves mean Average Precision (mAP) of 0.711 and 0.724 for 64-bit and 128-bit hash codes, improving over the state-of-the-art ACIR method by 0.7% and 1.1%, respectively.

Sainan Liu,Tz-Ying Wu,Hector A Valdez,Subarna Tripathi

Main category: cs.CV

TL;DR: Search2Motion是一种无需训练的图像到视频生成中面向对象的运动编辑框架,通过目标帧控制、语义引导插入与背景修复,结合早期自注意力图分析和ACE-Seed搜索策略,实现高保真、稳定场景下的对象重定位,并配套提出新基准S2M-DAVIS/S2M-OMB及FLF2V-obj评估指标。

Details Motivation: 现有方法依赖轨迹、边界框、掩码或运动场等强监督信号,且缺乏对物体运动与相机运动的解耦评估;需一种无需微调、更自然可控、可解释性强且评估更公平的运动编辑方案。 Method: 提出Search2Motion框架:1)目标帧驱动控制,利用首尾帧运动先验;2)语义引导的对象插入与鲁棒背景修复构建可靠目标帧;3)利用早期自注意力图预测物体与相机动态,设计ACE-Seed轻量搜索策略提升运动保真度;4)构建S2M-DAVIS和S2M-OMB新基准及FLF2V-obj评估指标,聚焦纯物体运动且无需真值轨迹。 Result: 在FLF2V-obj和VBench上持续超越基线方法;ACE-Seed提升运动保真度;新基准与指标有效解耦物体/相机运动并支持无真值评估。 Conclusion: Search2Motion验证了训练无关、目标帧控制范式在图像到视频运动编辑中的有效性,兼顾可控性、稳定性与可解释性,并推动建立更合理、实用的评估体系。 Abstract: We present Search2Motion, a training-free framework for object-level motion editing in image-to-video generation. Unlike prior methods requiring trajectories, bounding boxes, masks, or motion fields, Search2Motion adopts target-frame-based control, leveraging first-last-frame motion priors to realize object relocation while preserving scene stability without fine-tuning. Reliable target-frame construction is achieved through semantic-guided object insertion and robust background inpainting. We further show that early-step self-attention maps predict object and camera dynamics, offering interpretable user feedback and motivating ACE-Seed (Attention Consensus for Early-step Seed selection), a lightweight search strategy that improves motion fidelity without look-ahead sampling or external evaluators. Noting that existing benchmarks conflate object and camera motion, we introduce S2M-DAVIS and S2M-OMB for stable-camera, object-only evaluation, alongside FLF2V-obj metrics that isolate object artifacts without requiring ground-truth trajectories. Search2Motion consistently outperforms baselines on FLF2V-obj and VBench.

[191] Emotion-Aware Classroom Quality Assessment Leveraging IoT-Based Real-Time Student Monitoring

Hai Nguyen,Hieu Dao,Hung Nguyen,Nam Vu,Cong Tran

Main category: cs.CV

TL;DR: 本文提出了一种面向课堂学习的高通量、实时多智能体情感计算框架,通过实时监测学生情绪与参与度,提升教学效果;系统在真实教育场景中验证有效,准确率达88%,并发布了'课堂情感数据集'。

Details Motivation: 大班额教学和师生互动不足使教师难以及时掌握学生情绪与参与状态,亟需可扩展、数据驱动的实时情感分析工具。 Method: 设计了面向IoT设备的多智能体情感计算框架,支持高并发人脸检测与情绪/参与度识别,并使用自建的'课堂情感数据集'(1500张标注图像+300段视频)进行训练与评估,在三所不同学段学校开展实地测试。 Result: 系统支持最多50人同时检测、25FPS实时处理,课堂参与状态分类整体准确率达88%;师生家长反馈积极,认为有助于改善课堂互动与教学调整。 Conclusion: 本研究构建了首个实用化的物联网情感感知课堂学习框架,并开源了'课堂情感数据集',为情感计算在教育场景中的落地提供了技术路径与基准资源。 Abstract: This study presents high-throughput, real-time multi-agent affective computing framework designed to enhance classroom learning through emotional state monitoring. As large classroom sizes and limited teacher student interaction increasingly challenge educators, there is a growing need for scalable, data-driven tools capable of capturing students' emotional and engagement patterns in real time. The system was evaluated using the Classroom Emotion Dataset, consisting of 1,500 labeled images and 300 classroom detection videos. Tailored for IoT devices, the system addresses load balancing and latency challenges through efficient real-time processing. Field testing was conducted across three educational institutions in a large metropolitan area: a primary school (hereafter school A), a secondary school (school B), and a high school (school C). The system demonstrated robust performance, detecting up to 50 faces at 25 FPS and achieving 88% overall accuracy in classifying classroom engagement states. Implementation results showed positive outcomes, with favorable feedback from students, teachers, and parents regarding improved classroom interaction and teaching adaptation. Key contributions of this research include establishing a practical, IoT-based framework for emotion-aware learning environments and introducing the 'Classroom Emotion Dataset' to facilitate further validation and research.

[192] World Reconstruction From Inconsistent Views

Lukas Höllein,Matthias Nießner

Main category: cs.CV

TL;DR: 本文提出了一种新方法,通过非刚性对齐视频帧到全局一致坐标系,解决视频扩散模型生成帧间3D不一致的问题,从而实现高质量、可探索的3D环境重建。

Details Motivation: 视频扩散模型虽能生成高质量多样视频,但帧间缺乏3D一致性,导致3D世界重建困难。 Method: 利用几何基础模型将每帧提升为像素级3D点云;设计非刚性迭代帧到模型ICP进行初始对齐;再通过全局优化锐化点云;最后以该点云为初始化,结合新型逆变形渲染损失进行3D重建。 Result: 所生成的3D场景质量优于基线方法,成功将视频模型转化为3D一致的世界生成器。 Conclusion: 该方法有效缓解了视频扩散模型输出的3D不一致性问题,实现了从单目视频到高质量、可探索3D环境的可靠重建。 Abstract: Video diffusion models generate high-quality and diverse worlds; however, individual frames often lack 3D consistency across the output sequence, which makes the reconstruction of 3D worlds difficult. To this end, we propose a new method that handles these inconsistencies by non-rigidly aligning the video frames into a globally-consistent coordinate frame that produces sharp and detailed pointcloud reconstructions. First, a geometric foundation model lifts each frame into a pixel-wise 3D pointcloud, which contains unaligned surfaces due to these inconsistencies. We then propose a tailored non-rigid iterative frame-to-model ICP to obtain an initial alignment across all frames, followed by a global optimization that further sharpens the pointcloud. Finally, we leverage this pointcloud as initialization for 3D reconstruction and propose a novel inverse deformation rendering loss to create high quality and explorable 3D environments from inconsistent views. We demonstrate that our 3D scenes achieve higher quality than baselines, effectively turning video models into 3D-consistent world generators.

[193] When the City Teaches the Car: Label-Free 3D Perception from Infrastructure

Zhen Xu,Jinsu Yoo,Cristian Bautista,Zanming Huang,Tai-Yu Pan,Zhenzhen Liu,Katie Z Luo,Mark Campbell,Bharath Hariharan,Wei-Lun Chao

Main category: cs.CV

TL;DR: 本文提出了一种利用路侧单元(RSU)作为无监督教师、为自动驾驶车辆提供无标签3D感知训练的新范式,实现了无需测试时基础设施或通信支持的高性能检测(82.3% AP)。

Details Motivation: 传统3D感知依赖大规模人工标注数据,在多城市部署中不切实际;而城市中已广泛部署的路侧单元(RSU)可被用作天然的、静态的、可复用的监督源。 Method: 提出‘基础设施教学、无标签3D感知’范式:RSU利用固定视角和重复观测,从无标签数据中自主学习局部3D检测器,并向经过车辆广播预测结果;车辆聚合这些预测生成伪标签,用于训练独立的自车检测器;整个流程为三阶段、完全无标签,且测试时不依赖RSU或通信。 Result: 在CARLA多智能体环境中验证,基于CenterPoint实现车辆检测82.3% AP,接近全监督上界94.4%;系统分析了各阶段性能、可扩展性,并证明其与现有自车端无标签方法具有互补性。 Conclusion: 城市基础设施本身可成为可扩展的监督信号来源,该‘基础设施教学’范式为降低3D感知标注成本提供了有前景的正交新路径。 Abstract: Building robust 3D perception for self-driving still relies heavily on large-scale data collection and manual annotation, yet this paradigm becomes impractical as deployment expands across diverse cities and regions. Meanwhile, modern cities are increasingly instrumented with roadside units (RSUs), static sensors deployed along roads and at intersections to monitor traffic. This raises a natural question: can the city itself help train the vehicle? We propose infrastructure-taught, label-free 3D perception, a paradigm in which RSUs act as stationary, unsupervised teachers for ego vehicles. Leveraging their fixed viewpoints and repeated observations, RSUs learn local 3D detectors from unlabeled data and broadcast predictions to passing vehicles, which are aggregated as pseudo-label supervision for training a standalone ego detector. The resulting model requires no infrastructure or communication at test time. We instantiate this idea as a fully label-free three-stage pipeline and conduct a concept-and-feasibility study in a CARLA-based multi-agent environment. With CenterPoint, our pipeline achieves 82.3% AP for detecting vehicles, compared to a fully supervised ego upper bound of 94.4%. We further systematically analyze each stage, evaluate its scalability, and demonstrate complementarity with existing ego-centric label-free methods. Together, these results suggest that city infrastructure itself can potentially provide a scalable supervisory signal for autonomous vehicles, positioning infrastructure-taught learning as a promising orthogonal paradigm for reducing annotation cost in 3D perception.

[194] Semi-supervised Latent Disentangled Diffusion Model for Textile Pattern Generation

Chenggong Hu,Yi Wang,Mengqi Xue,Haofei Zhang,Jie Song,Li Sun

Main category: cs.CV

TL;DR: 本文提出SLDDM-TPG方法,通过潜空间解耦网络(LDN)和半监督潜扩散模型(S-LDM),解决服装图像到纺织图案生成中的特征混淆问题,显著提升生成结果的保真度与细节质量。

Details Motivation: 现有图像到图像转换模型在纺织图案生成(TPG)任务中因服装图像的非刚性形变与复杂纹理混淆,难以保留细粒度图案细节。 Method: 提出两阶段方法SLDDM-TPG:第一阶段为潜空间解耦网络(LDN),构建多维独立服装特征空间以缓解特征混淆;第二阶段为半监督潜扩散模型(S-LDM),结合LDN引导信号与细粒度对齐策略进行扩散训练。 Result: 在CTP-HD数据集上FID降低4.1、SSIM提升最多0.116;在VITON-HD数据集上也展现出良好泛化性。 Conclusion: SLDDM-TPG有效解决了TPG任务中因特征混淆导致的失真问题,实现了高保真、细粒度的纺织图案生成。 Abstract: Textile pattern generation (TPG) aims to synthesize fine-grained textile pattern images based on given clothing images. Although previous studies have not explicitly investigated TPG, existing image-to-image models appear to be natural candidates for this task. However, when applied directly, these methods often produce unfaithful results, failing to preserve fine-grained details due to feature confusion between complex textile patterns and the inherent non-rigid texture distortions in clothing images. In this paper, we propose a novel method, SLDDM-TPG, for faithful and high-fidelity TPG. Our method consists of two stages: (1) a latent disentangled network (LDN) that resolves feature confusion in clothing representations and constructs a multi-dimensional, independent clothing feature space; and (2) a semi-supervised latent diffusion model (S-LDM), which receives guidance signals from LDN and generates faithful results through semi-supervised diffusion training, combined with our designed fine-grained alignment strategy. Extensive evaluations show that SLDDM-TPG reduces FID by 4.1 and improves SSIM by up to 0.116 on our CTP-HD dataset, and also demonstrate good generalization on the VITON-HD dataset.

[195] SuCor: Susceptibility Distortion Correction via Parameter-Free and Self-Regularized Optimal Transport

Sreekar Chigurupati,Eleftherios Garyfallidis

Main category: cs.CV

TL;DR: SuCor是一种基于最优传输理论校正EPI图像磁敏感几何畸变的新方法,通过Wasserstein-2重心位移建模畸变场,并采用频域弯曲能量正则化与Morozov差异原理自动调参,在HCP数据上优于FSL TOPUP且计算高效。

Details Motivation: 解决回波平面成像(EPI)中由磁敏感效应引起的几何畸变问题,提升功能与结构影像配准精度。 Method: 提出SuCor方法:利用反向相位编码EPI对,将畸变场每列建模为对应强度轮廓间的Wasserstein-2重心位移;在频域施加弯曲能量正则化,并通过Morozov差异原则自动选择正则化强度。 Result: 在HCP数据集上,SuCor与T1像的平均体素互信息达0.341,高于FSL TOPUP的0.317;单核CPU耗时约12秒。 Conclusion: SuCor在无需人工调参前提下实现了更优的畸变校正效果和更高计算效率,为EPI图像预处理提供了新思路。 Abstract: We present SuCor, a method for correcting susceptibility induced geometric distortions in echo planar imaging (EPI) using optimal transport (OT) along the phase encoding direction. Given a pair of reversed phase encoding EPI volumes, we model each column of the distortion field as a Wasserstein-2 barycentric displacement between the opposing-polarity intensity profiles. Regularization is performed in the spectral domain using a bending-energy penalty whose strength is selected automatically via the Morozov discrepancy principle, requiring no manual tuning. On a human connectome project (HCP) dataset with left-right/right-left b0 EPI pairs and a co-registered T1 structural reference, SuCor achieves a mean volumetric mutual information of 0.341 with the T1 image, compared to 0.317 for FSL TOPUP, while running in approximately 12 seconds on a single CPU core.

[196] Dual Stream Independence Decoupling for True Emotion Recognition under Masked Expressions

Jinsheng Wei,Xiguang Zhang,Zheng Shi,Guanming Lu

Main category: cs.CV

TL;DR: 本文提出了一种基于顶点帧(apexframe)的新范式和双流独立解耦框架,用于从稳定伪装状态的面部表情中识别真实情绪,通过解耦真实与伪装情绪特征并设计新型损失函数提升识别性能。

Details Motivation: 现有方法基于起始伪装帧(onsetframe)识别真实情绪,但该帧尚未达到稳定伪装状态,易泄露真实情绪信息,无法反映实际伪装情形。 Method: 提出顶点帧(apexframe)新范式,并构建双流独立解耦框架;设计包含两个分类损失和Hilbert-Schmidt独立性损失的解耦损失组,以分离真实与伪装情绪特征。 Result: 实验表明,顶点帧范式更具挑战性,所提解耦框架显著提升了真实情绪识别性能。 Conclusion: 基于稳定伪装状态的顶点帧分析更贴近实际需求,双流解耦框架及所提损失函数有效缓解伪装情绪对真实情绪识别的干扰。 Abstract: Recongnizing true emotions from masked expressions is extremely challenging due to deliberate concealment. Existing paradigms recognize true emotions from masked-expression clips that contain onsetframes just starting to disguise. However, this paradigm may not reflect the actual disguised state, as the onsetframe leaks the true emotional information without reaching a stable disguise state. Thus, this paper introduces a novel apexframe-based paradigm that classifies true emotions from the apexframe with a stable disguised state. Furthermore, this paper proposes a novel dual stream independence decoupling framework that decouples true and disguised emotion features, avoiding the interference of disguised emotions on true emotions. For efficient decoupling, we design a decoupling loss group, comprising two classification losses that learn true emotion and disguised expression features, respectively, and a Hilbert-Schmidt Independence loss that enhances the independence of two features. Experiments demonstrate that the apexframe-based paradigm is challenging, and the proposed decouple framework improves recogntion performances.

[197] GDPO-SR: Group Direct Preference Optimization for One-Step Generative Image Super-Resolution

Qiaosi Yi,Shuai Li,Rongyuan Wu,Lingchen Sun,Zhengqiang Zhang,Lei Zhang

Main category: cs.CV

TL;DR: 本文提出GDPO方法,将强化学习集成到单步生成式图像超分辨率(ISR)模型训练中,通过噪声感知的单步扩散模型和属性感知奖励函数,显著提升了ISR性能。

Details Motivation: 现有强化学习方法在单步生成式ISR中应用不足,且DPO和GRPO等方法存在样本数量有限或忽略局部细节的问题。 Method: 提出GDPO方法,包括噪声感知的单步扩散模型(采用不等时间步策略解耦噪声添加与扩散时间步)和GDPO策略(融合GRPO原理到DPO中计算组相对优势),并设计属性感知奖励函数动态评估样本质量。 Result: 实验验证了GDPO在提升单步生成式ISR模型性能方面的有效性。 Conclusion: GDPO成功将强化学习应用于单步生成式ISR,解决了现有方法在样本多样性、局部细节建模和噪声鲁棒性方面的局限性。 Abstract: Recently, reinforcement learning (RL) has been employed for improving generative image super-resolution (ISR) performance. However, the current efforts are focused on multi-step generative ISR, while one-step generative ISR remains underexplored due to its limited stochasticity. In addition, RL methods such as Direct Preference Optimization (DPO) require the generation of positive and negative sample pairs offline, leading to a limited number of samples, while Group Relative Policy Optimization (GRPO) only calculates the likelihood of the entire image, ignoring local details that are crucial for ISR. In this paper, we propose Group Direct Preference Optimization (GDPO), a novel approach to integrate RL into one-step generative ISR model training. First, we introduce a noise-aware one-step diffusion model that can generate diverse ISR outputs. To prevent performance degradation caused by noise injection, we introduce an unequal-timestep strategy to decouple the timestep of noise addition from that of diffusion. We then present the GDPO strategy, which integrates the principle of GRPO into DPO, to calculate the group-relative advantage of each online generated sample for model optimization. Meanwhile, an attribute-aware reward function is designed to dynamically evaluate the score of each sample based on its statistics of smooth and texture areas. Experiments demonstrate the effectiveness of GDPO in enhancing the performance of one-step generative ISR models. Code: https://github.com/Joyies/GDPO.

[198] IOSVLM: A 3D Vision-Language Model for Unified Dental Diagnosis from Intraoral Scans

Huimin Xiong,Zijie Meng,Tianxiang Hu,Chenyi Zhou,Yang Feng,Zuozhu Liu

Main category: cs.CV

TL;DR: 本文提出IOSVLM,一种端到端3D视觉语言模型,直接处理牙科口内扫描点云数据,实现多病种统一诊断与视觉问答(VQA),并构建大规模IOSVQA数据集;通过几何转色代理和两阶段课程学习提升跨模态对齐与鲁棒性,在多项指标上显著超越基线。

Details Motivation: 现有牙科视觉语言模型主要基于2D或多视角图像,未能充分利用原生3D几何信息;而3D口内扫描建模面临扫描形式异构、疾病共现与类别不平衡、细粒度形态模糊、以及3D-文本配对数据稀缺等挑战。 Method: 提出IOSVLM:采用点云表示3D口内扫描,整体架构为3D编码器–投影器–大语言模型;设计几何转色代理(geometry-to-chromatic proxy)缓解无彩色扫描与有彩色预训练间的分布差异;引入两阶段课程学习策略提升模型鲁棒性;同时构建IOSVQA数据集(19,002例,249,055个VQA对,覆盖23种口腔疾病及多种扫描类型)。 Result: IOSVLM在宏准确率和宏F1上分别较强基线提升至少+9.58%和+1.46%,验证了直接建模3D几何信息对口内扫描诊断的有效性。 Conclusion: 直接利用原生3D几何结构建模优于依赖渲染2D视图的范式;IOSVLM与IOSVQA为牙科AI提供了可扩展、统一且临床实用的3D多任务诊断基础框架。 Abstract: 3D intraoral scans (IOS) are increasingly adopted in routine dentistry due to abundant geometric evidence, and unified multi-disease diagnosis is desirable for clinical documentation and communication. While recent works introduce dental vision-language models (VLMs) to enable unified diagnosis and report generation on 2D images or multi-view images rendered from IOS, they do not fully leverage native 3D geometry. Such work is necessary and also challenging, due to: (i) heterogeneous scan forms and the complex IOS topology, (ii) multi-disease co-occurrence with class imbalance and fine-grained morphological ambiguity, (iii) limited paired 3D IOS-text data. Thus, we present IOSVLM, an end-to-end 3D VLM that represents scans as point clouds and follows a 3D encoder-projector-LLM design for unified diagnosis and generative visual question-answering (VQA), together with IOSVQA, a large-scale multi-source IOS diagnosis VQA dataset comprising 19,002 cases and 249,055 VQA pairs over 23 oral diseases and heterogeneous scan types. To address the distribution gap between color-free IOS data and color-dependent 3D pre-training, we propose a geometry-to-chromatic proxy that stabilizes fine-grained geometric perception and cross-modal alignment. A two-stage curriculum training strategy further enhances robustness. IOSVLM consistently outperforms strong baselines, achieving gains of at least +9.58% macro accuracy and +1.46% macro F1, indicating the effectiveness of direct 3D geometry modeling for IOS-based diagnosis.

[199] V-Co: A Closer Look at Visual Representation Alignment via Co-Denoising

Han Lin,Xichen Pan,Zun Wang,Yue Zhang,Chu Wang,Jaemin Cho,Mohit Bansal

Main category: cs.CV

TL;DR: 本文提出V-Co,一种基于JiT框架的视觉协同去噪系统性研究,揭示了实现有效视觉协同去噪的四大关键要素,并在ImageNet-256上验证其优于现有像素空间扩散模型。

Details Motivation: 标准像素空间扩散模型语义监督弱、难以捕获高层视觉结构;虽有表示对齐方法(如REPA)和视觉协同去噪尝试,但现有方法设计耦合严重,关键有效成分不明。 Method: 在统一的JiT(Just-in-Time)框架下开展控制实验,系统解耦并评估视觉协同去噪中的各项设计选择,识别出四大核心组件:全双流架构、结构化无条件预测以支持CFG、感知漂移混合损失、RMS特征重标定实现跨流校准。 Result: V-Co在ImageNet-256上以相当模型规模和更少训练轮次,超越基础像素扩散模型及强基线方法;明确了视觉协同去噪的有效配方。 Conclusion: 视觉协同去噪的有效性依赖于四个明确可分离的关键设计:双流结构、CFG兼容的无条件建模、感知增强的混合损失、以及跨流特征校准;该发现为后续表示对齐生成模型提供了实用指导。 Abstract: Pixel-space diffusion has recently re-emerged as a strong alternative to latent diffusion, enabling high-quality generation without pretrained autoencoders. However, standard pixel-space diffusion models receive relatively weak semantic supervision and are not explicitly designed to capture high-level visual structure. Recent representation-alignment methods (e.g., REPA) suggest that pretrained visual features can substantially improve diffusion training, and visual co-denoising has emerged as a promising direction for incorporating such features into the generative process. However, existing co-denoising approaches often entangle multiple design choices, making it unclear which design choices are truly essential. Therefore, we present V-Co, a systematic study of visual co-denoising in a unified JiT-based framework. This controlled setting allows us to isolate the ingredients that make visual co-denoising effective. Our study reveals four key ingredients for effective visual co-denoising. First, preserving feature-specific computation while enabling flexible cross-stream interaction motivates a fully dual-stream architecture. Second, effective classifier-free guidance (CFG) requires a structurally defined unconditional prediction. Third, stronger semantic supervision is best provided by a perceptual-drifting hybrid loss. Fourth, stable co-denoising further requires proper cross-stream calibration, which we realize through RMS-based feature rescaling. Together, these findings yield a simple recipe for visual co-denoising. Experiments on ImageNet-256 show that, at comparable model sizes, V-Co outperforms the underlying pixel-space diffusion baseline and strong prior pixel-diffusion methods while using fewer training epochs, offering practical guidance for future representation-aligned generative models.

[200] WildDepth: A Multimodal Dataset for 3D Wildlife Perception and Depth Estimation

Muhammad Aamir,Naoya Muramatsu,Sangyun Shin,Matthew Wijers,Jiaxing Jhong,Xinyu Hou,Amir Patel,Andrew Markham

Main category: cs.CV

TL;DR: 本文提出了WildDepth,一个用于动物深度估计、行为检测和3D重建的多模态数据集与基准套件,结合RGB与LiDAR数据,显著提升了深度估计与3D重建性能。

Details Motivation: 现有动物深度估计模型大多基于无度量尺度的数据集训练,难以验证纯图像模型的可靠性,亟需带真实尺度的多模态数据支持。 Method: 构建WildDepth多模态数据集,涵盖多种环境下的动物,同步采集RGB与LiDAR数据,并设计相应基准任务(深度估计、行为检测、3D重建),采用RGB-LiDAR融合方法提升性能。 Result: 实验表明,多模态数据使深度估计RMSE降低最多10%,RGB-LiDAR融合使3D重建Chamfer距离改善12%。 Conclusion: WildDepth为动物场景下的鲁棒多模态感知提供了新基准,推动跨域泛化能力提升。 Abstract: Depth estimation and 3D reconstruction have been extensively studied as core topics in computer vision. Starting from rigid objects with relatively simple geometric shapes, such as vehicles, the research has expanded to address general objects, including challenging deformable objects, such as humans and animals. However, for the animal, in particular, the majority of existing models are trained based on datasets without metric scale, which can help validate image-only models. To address this limitation, we present WildDepth, a multimodal dataset and benchmark suite for depth estimation, behavior detection, and 3D reconstruction from diverse categories of animals ranging from domestic to wild environments with synchronized RGB and LiDAR. Experimental results show that the use of multi-modal data improves depth reliability by up to 10% RMSE, while RGB-LiDAR fusion enhances 3D reconstruction fidelity by 12% in Chamfer distance. By releasing WildDepth and its benchmarks, we aim to foster robust multimodal perception systems that generalize across domains.

[201] Deep Reinforcement Learning-driven Edge Offloading for Latency-constrained XR pipelines

Sourya Saha,Saptarshi Debroy

Main category: cs.CV

TL;DR: 本文提出了一种面向边缘辅助XR系统的电池感知执行管理框架,利用轻量级深度强化学习在线决策机制,在满足运动到光子(MTP)延迟要求的同时显著延长设备电池寿命。

Details Motivation: XR应用对延迟敏感且受限于终端设备的电池与能量,现有自适应执行与计算卸载方法多优化平均性能,未能充分建模实时延迟与电池寿命在闭环XR工作负载中的持续耦合关系。 Method: 设计了一个联合考虑执行位置、工作负载质量、延迟约束和电池动态特性的电池感知执行管理框架,并采用轻量级深度强化学习策略实现网络动态变化下的在线自适应决策。 Result: 相比仅优化延迟的本地执行,该方法最多可将设备预估电池寿命延长163%,在稳定网络下保持90%以上的MTP延迟合规率;即使带宽严重受限,合规率仍不低于80%。 Conclusion: 显式建模并优化延迟-能耗权衡对提升沉浸式XR系统实用性至关重要,所提框架在保障实时性的同时显著提升了能效。 Abstract: Immersive extended reality (XR) applications introduce latency-critical workloads that must satisfy stringent real-time responsiveness while operating on energy- and battery-constrained devices, making execution placement between end devices and nearby edge servers a fundamental systems challenge. Existing approaches to adaptive execution and computation offloading typically optimize average performance metrics and do not fully capture the sustained interaction between real-time latency requirements and device battery lifetime in closed-loop XR workloads. In this paper, we present a battery-aware execution management framework for edge-assisted XR systems that jointly considers execution placement, workload quality, latency requirements, and battery dynamics. We design an online decision mechanism based on a lightweight deep reinforcement learning policy that continuously adapts execution decisions under dynamic network conditions while maintaining high motion-to-photon latency compliance. Experimental results show that the proposed approach extends the projected device battery lifetime by up to 163% compared to latency-optimal local execution while maintaining over 90% motion-to-photon latency compliance under stable network conditions. Such compliance does not fall below 80% even under significantly limited network bandwidth availability, thereby demonstrating the effectiveness of explicitly managing latency-energy trade-offs in immersive XR systems.

[202] An assessment of data-centric methods for label noise identification in remote sensing data sets

Felix Kröber,Genc Hoxha,Ribana Roscher

Main category: cs.CV

TL;DR: 本文研究了遥感领域中标签噪声问题,评估了三种数据为中心的方法在不同标签噪声假设下的表现,并证明了这些方法在识别噪声标签和提升任务性能方面的价值。

Details Motivation: 遥感领域中自动化处理标签噪声的研究较少,缺乏对数据为中心方法的系统性分析,这些方法不仅能应对标签噪声,还能明确识别和隔离噪声标签。 Method: 选取了三种数据为中心的方法,在两个基准数据集上注入10%到70%不同类型的标签噪声,分析这些方法过滤标签噪声的效果及其对任务性能的影响。 Result: 实验证明了数据为中心方法在标签噪声识别和任务性能提升两方面的有效性,并提供了根据不同设置和目标选择最佳方法的见解。 Conclusion: 本研究推动了数据为中心的标签噪声方法在遥感领域的实际应用,指出了未来在该领域迁移这些方法的研究方向。 Abstract: Label noise in the sense of incorrect labels is present in many real-world data sets and is known to severely limit the generalizability of deep learning models. In the field of remote sensing, however, automated treatment of label noise in data sets has received little attention to date. In particular, there is a lack of systematic analysis of the performance of data-centric methods that not only cope with label noise but also explicitly identify and isolate noisy labels. In this paper, we examine three such methods and evaluate their behavior under different label noise assumptions. To do this, we inject different types of label noise with noise levels ranging from 10 to 70% into two benchmark data sets, followed by an analysis of how well the selected methods filter the label noise and how this affects task performances. With our analyses, we clearly prove the value of data-centric methods for both parts - label noise identification and task performance improvements. Our analyses provide insights into which method is the best choice depending on the setting and objective. Finally, we show in which areas there is still a need for research in the transfer of data-centric label noise methods to remote sensing data. As such, our work is a step forward in bridging the methodological establishment of data-centric label noise methods and their usage in practical settings in the remote sensing domain.

[203] What DINO saw: ALiBi positional encoding reduces positional bias in Vision Transformers

Moritz Pawlowsky,Antonis Vamvakeros,Alexander Weiss,Anja Bielefeld,Samuel J. Cooper,Ronan Docherty

Main category: cs.CV

TL;DR: 本文研究了视觉Transformer(ViT)中的位置偏差问题,特别是在材料科学等对方向不敏感的应用中;通过线性探测发现偏差普遍存在,并采用ALiBi相对位置编码进行微调以减轻偏差,最终在显微图像分割任务中验证了其有效性。

Details Motivation: ViT(如DINOv2)虽能学习丰富表征,但其位置编码等结构设计会引入与语义无关的位置偏差,阻碍零样本迁移,尤其在材料科学等需各向同性表征的领域。 Method: 通过线性探测分析多种ViT模型及位置编码下的位置偏差;提出用ALiBi相对位置编码替代原有位置编码并进行微调以抑制偏差。 Result: ALiBi微调后的模型显著降低位置偏差,同时保持语义表征能力,并在可训练的复杂显微图像分割任务中取得成功。 Conclusion: ALiBi相对位置编码是一种有效缓解ViT位置偏差的方法,提升了模型在方向无关任务中的泛化性与实用性。 Abstract: Vision transformers (ViTs) - especially feature foundation models like DINOv2 - learn rich representations useful for many downstream tasks. However, architectural choices (such as positional encoding) can lead to these models displaying positional biases and artefacts independent of semantic content. This makes zero-shot adaption difficult in fields like material science, where images are often cross-sections of homogeneous microstructure (i.e. having no preferred direction). In this work, we investigate the positional bias in ViTs via linear probing, finding it present across a range of objectives and positional encodings, and subsequently reduce it by finetuning models to use ALiBi relative positional encoding. We demonstrate that these models retain desirable general semantics and their unbiased features can be used successfully in trainable segmentation of complex microscopy images.

[204] M^3: Dense Matching Meets Multi-View Foundation Models for Monocular Gaussian Splatting SLAM

Kerui Ren,Guanghao Li,Changjian Jiang,Yingxiang Xu,Tao Lu,Linning Xu,Junting Dong,Jiangmiao Pang,Mulin Yu,Bo Dai

Main category: cs.CV

TL;DR: 本文提出M^3方法,通过在多视图基础模型中增加匹配头以实现细粒度稠密匹配,并将其集成到单目高斯泼溅SLAM框架中,显著提升了动态环境下的位姿估计与场景重建精度。

Details Motivation: 流式重建从未经标定的单目视频中仍具挑战性,需要高精度位姿估计和动态环境中计算高效的在线优化;现有基于多视图基础模型的方法因前馈式位姿估计导致像素级匹配精度不足,难以满足几何优化需求。 Method: 提出M^3方法:1)在多视图基础模型中引入专用Matching head以提升稠密对应精度;2)将其集成至鲁棒的单目高斯泼溅SLAM框架;3)引入动态区域抑制与跨推理内参对齐机制增强跟踪稳定性。 Result: 在多个室内外基准上达到位姿估计与场景重建SOTA性能:相比VGGT-SLAM 2.0,ATE RMSE降低64.3%;在ScanNet++上PSNR优于ARTDECO 2.11 dB。 Conclusion: M^3有效弥合了基础模型与几何SLAM之间的精度鸿沟,验证了增强匹配能力与系统级协同设计对单目流式重建的关键作用。 Abstract: Streaming reconstruction from uncalibrated monocular video remains challenging, as it requires both high-precision pose estimation and computationally efficient online refinement in dynamic environments. While coupling 3D foundation models with SLAM frameworks is a promising paradigm, a critical bottleneck persists: most multi-view foundation models estimate poses in a feed-forward manner, yielding pixel-level correspondences that lack the requisite precision for rigorous geometric optimization. To address this, we present M^3, which augments the Multi-view foundation model with a dedicated Matching head to facilitate fine-grained dense correspondences and integrates it into a robust Monocular Gaussian Splatting SLAM. M^3 further enhances tracking stability by incorporating dynamic area suppression and cross-inference intrinsic alignment. Extensive experiments on diverse indoor and outdoor benchmarks demonstrate state-of-the-art accuracy in both pose estimation and scene reconstruction. Notably, M^3 reduces ATE RMSE by 64.3% compared to VGGT-SLAM 2.0 and outperforms ARTDECO by 2.11 dB in PSNR on the ScanNet++ dataset.

[205] SOMA: Unifying Parametric Human Body Models

Jun Saito,Jiefeng Li,Michael de Ruyter,Miguel Guerrero,Edy Lim,Ehsan Hassani,Roger Blanco Ribera,Hyejin Moon,Magdalena Dadela,Marco Di Lucca,Qiao Wang,Xueting Li,Jan Kautz,Simon Yuen,Umar Iqbal

Main category: cs.CV

TL;DR: 本文提出SOMA,一种统一的参数化人体模型层,通过三层抽象(网格拓扑、骨骼结构、姿态)实现SMPL、SMPL-X等异构模型间的无缝互操作,支持端到端可微、GPU加速,将适配器开发复杂度从O(M²)降至O(M)。

Details Motivation: 现有主流参数化人体模型(如SMPL、SMPL-X等)在网格拓扑、骨骼结构、形变参数和单位约定上互不兼容,难以在一个流程中协同利用各自优势。 Method: 提出SOMA框架,包含三层抽象:1)网格拓扑抽象——将任意源模型映射至共享规范网格;2)骨骼抽象——单次闭式计算即可从任意形状恢复身份自适应的关节变换;3)姿态抽象——逆向蒙皮过程,直接从任意模型的顶点姿态恢复统一骨骼旋转。全系统基于NVIDIA Warp实现GPU加速与端到端可微。 Result: 实现了对SMPL、SMPL-X、MHR、Anny等主流模型的统一接入;消除了传统两两适配所需的迭代优化或模型特异性训练;支持异构动作数据集即插即用;适配器开发量从O(M²)降至O(M)。 Conclusion: SOMA为参数化人体建模提供了首个真正统一、高效、可微且硬件加速的中间表示层,显著提升跨模型协作的灵活性与工程效率。 Abstract: Parametric human body models are foundational to human reconstruction, animation, and simulation, yet they remain mutually incompatible: SMPL, SMPL-X, MHR, Anny, and related models each diverge in mesh topology, skeletal structure, shape parameterization, and unit convention, making it impractical to exploit their complementary strengths within a single pipeline. We present SOMA, a unified body layer that bridges these heterogeneous representations through three abstraction layers. Mesh topology abstraction maps any source model's identity to a shared canonical mesh in constant time per vertex. Skeletal abstraction recovers a full set of identity-adapted joint transforms from any body shape, whether in rest pose or an arbitrary posed configuration, in a single closed-form pass, with no iterative optimization or per-model training. Pose abstraction inverts the skinning pipeline to recover unified skeleton rotations directly from posed vertices of any supported model, enabling heterogeneous motion datasets to be consumed without custom retargeting. Together, these layers reduce the $O(M^2)$ per-pair adapter problem to $O(M)$ single-backend connectors, letting practitioners freely mix identity sources and pose data at inference time. The entire pipeline is fully differentiable end-to-end and GPU-accelerated via NVIDIA-Warp.

[206] SparkVSR: Interactive Video Super-Resolution via Sparse Keyframe Propagation

Jiongze Yu,Xiangbo Gao,Pooja Verlani,Akshay Gadde,Yilin Wang,Balu Adsumilli,Zhengzhong Tu

Main category: cs.CV

TL;DR: 本文提出了一种名为SparkVSR的交互式视频超分辨率框架,通过稀疏关键帧作为控制信号,实现用户可控的高质量视频重建。

Details Motivation: 现有VSR方法在推理时为黑箱,用户无法修正生成结果中的异常伪影,缺乏可控性。 Method: 提出两阶段训练流程:融合低分辨率视频潜在表示与稀疏高分辨率关键帧潜在表示;支持多种关键帧选择方式及无参考引导机制,平衡关键帧保真度与盲恢复能力。 Result: 在多个VSR基准上显著提升CLIP-IQA(+24.6%)、DOVER(+21.8%)和MUSIQ(+5.6%)指标,增强时间一致性与重建质量,并拓展至老电影修复与视频风格迁移等新任务。 Conclusion: SparkVSR是一种通用、交互式、关键帧驱动的视频处理框架,兼顾可控性、鲁棒性与泛化能力。 Abstract: Video Super-Resolution (VSR) aims to restore high-quality video frames from low-resolution (LR) estimates, yet most existing VSR approaches behave like black boxes at inference time: users cannot reliably correct unexpected artifacts, but instead can only accept whatever the model produces. In this paper, we propose a novel interactive VSR framework dubbed SparkVSR that makes sparse keyframes a simple and expressive control signal. Specifically, users can first super-resolve or optionally a small set of keyframes using any off-the-shelf image super-resolution (ISR) model, then SparkVSR propagates the keyframe priors to the entire video sequence while remaining grounded by the original LR video motion. Concretely, we introduce a keyframe-conditioned latent-pixel two-stage training pipeline that fuses LR video latents with sparsely encoded HR keyframe latents to learn robust cross-space propagation and refine perceptual details. At inference time, SparkVSR supports flexible keyframe selection (manual specification, codec I-frame extraction, or random sampling) and a reference-free guidance mechanism that continuously balances keyframe adherence and blind restoration, ensuring robust performance even when reference keyframes are absent or imperfect. Experiments on multiple VSR benchmarks demonstrate improved temporal consistency and strong restoration quality, surpassing baselines by up to 24.6%, 21.8%, and 5.6% on CLIP-IQA, DOVER, and MUSIQ, respectively, enabling controllable, keyframe-driven video super-resolution. Moreover, we demonstrate that SparkVSR is a generic interactive, keyframe-conditioned video processing framework as it can be applied out of the box to unseen tasks such as old-film restoration and video style transfer. Our project page is available at: https://sparkvsr.github.io/

[207] MessyKitchens: Contact-rich object-level 3D scene reconstruction

Junaid Ahmed Ansari,Ran Ding,Fabio Pizzati,Ivan Laptev

Main category: cs.CV

TL;DR: 本文提出了MessyKitchens数据集和Multi-Object Decoder(MOD)方法,用于实现物理上合理(如无穿透、真实接触)的单目3D场景中多物体联合重建。

Details Motivation: 现有单目3D场景重建方法在单物体深度估计上表现良好,但在复杂杂乱场景中实现物理合理的多物体分解与重建(如避免穿透、建模真实接触)仍具挑战。 Method: 1) 构建MessyKitchens真实世界数据集,提供高保真物体级3D形状、位姿及接触标注;2) 在SAM 3D基础上扩展出Multi-Object Decoder(MOD),支持多物体联合重建。 Result: MessyKitchens在配准精度和物体间穿透指标上显著优于以往数据集;MOD在三个数据集上的多物体重建性能持续且显著超越SOTA。 Conclusion: 本文通过新数据集与新模型推动了物理合理、对象级单目3D场景重建的发展,并将基准、代码与预训练模型开源。 Abstract: Monocular 3D scene reconstruction has recently seen significant progress. Powered by the modern neural architectures and large-scale data, recent methods achieve high performance in depth estimation from a single image. Meanwhile, reconstructing and decomposing common scenes into individual 3D objects remains a hard challenge due to the large variety of objects, frequent occlusions and complex object relations. Notably, beyond shape and pose estimation of individual objects, applications in robotics and animation require physically-plausible scene reconstruction where objects obey physical principles of non-penetration and realistic contacts. In this work we advance object-level scene reconstruction along two directions. First, we introduceMessyKitchens, a new dataset with real-world scenes featuring cluttered environments and providing high-fidelity object-level ground truth in terms of 3D object shapes, poses and accurate object contacts. Second, we build on the recent SAM 3D approach for single-object reconstruction and extend it with Multi-Object Decoder (MOD) for joint object-level scene reconstruction. To validate our contributions, we demonstrate MessyKitchens to significantly improve previous datasets in registration accuracy and inter-object penetration. We also compare our multi-object reconstruction approach on three datasets and demonstrate consistent and significant improvements of MOD over the state of the art. Our new benchmark, code and pre-trained models will become publicly available on our project website: https://messykitchens.github.io/.

[208] SegviGen: Repurposing 3D Generative Model for Part Segmentation

Lin Li,Haoran Feng,Zehuan Huang,Haohua Chen,Wenbo Nie,Shaohua Hou,Keqing Fan,Pan Hu,Sheng Wang,Buyu Li,Lu Sheng

Main category: cs.CV

TL;DR: SegviGen是一个利用预训练3D生成模型先验进行3D部件分割的新框架,通过体素级颜色编码实现高效、低监督的分割,在交互式与全分割任务上显著超越现有方法。

Details Motivation: 现有3D部件分割方法存在跨视角不一致、边界模糊,或依赖大量标注3D数据和高计算资源的问题,亟需一种高效、低监督的新范式。 Method: SegviGen利用预训练3D生成模型中的结构化先验,对几何对齐重建的活跃体素预测部件指示性颜色,支持交互式分割、全分割及融合2D引导的全分割。 Result: 在交互式部件分割上提升40%,全分割上提升15%,仅使用0.32%的标注训练数据。 Conclusion: 预训练3D生成模型的先验可有效迁移到3D部件分割任务,在极低监督下实现高性能,验证了生成-判别迁移的新路径。 Abstract: We introduce SegviGen, a framework that repurposes native 3D generative models for 3D part segmentation. Existing pipelines either lift strong 2D priors into 3D via distillation or multi-view mask aggregation, often suffering from cross-view inconsistency and blurred boundaries, or explore native 3D discriminative segmentation, which typically requires large-scale annotated 3D data and substantial training resources. In contrast, SegviGen leverages the structured priors encoded in pretrained 3D generative model to induce segmentation through distinctive part colorization, establishing a novel and efficient framework for part segmentation. Specifically, SegviGen encodes a 3D asset and predicts part-indicative colors on active voxels of a geometry-aligned reconstruction. It supports interactive part segmentation, full segmentation, and full segmentation with 2D guidance in a unified framework. Extensive experiments show that SegviGen improves over the prior state of the art by 40% on interactive part segmentation and by 15% on full segmentation, while using only 0.32% of the labeled training data. It demonstrates that pretrained 3D generative priors transfer effectively to 3D part segmentation, enabling strong performance with limited supervision. See our project page at https://fenghora.github.io/SegviGen-Page/.

[209] Demystifing Video Reasoning

Ruisi Wang,Zhongang Cai,Fanyi Pu,Junxiang Xu,Wanqi Yin,Maijunxian Wang,Ran Ji,Chenyang Gu,Bo Li,Ziqi Huang,Hokin Deng,Dahua Lin,Ziwei Liu,Lei Yang

Main category: cs.CV

TL;DR: 本文挑战了视频生成模型中推理能力源于帧间顺序推理(Chain-of-Frames)的假设,提出其核心机制实为沿扩散去噪步(Chain-of-Steps)逐步演化的推理过程,并揭示了工作记忆、自我修正与感知先行等涌现行为,以及扩散Transformer内部的功能分层;基于此,作者提出一种无需训练的潜变量轨迹集成策略以提升推理性能。

Details Motivation: 现有研究将视频扩散模型的推理能力归因于帧间顺序推理(CoF),但该假设缺乏深入验证;本文旨在探究推理真正发生的维度与内在机制。 Method: 通过定性分析与定向探针实验,系统分析模型在不同扩散步与网络层中的行为;识别并建模涌现的推理特性(如工作记忆、自我修正、感知先行);分析Diffusion Transformer各层功能分工;提出基于多随机种子潜轨迹集成的无训练推理增强策略。 Result: 发现推理主要沿扩散步展开(Chain-of-Steps),而非帧序列;识别出三种关键涌现推理行为;证实模型内部存在自演化功能分层;所提集成策略在多个视频推理任务上有效提升性能。 Conclusion: 视频生成模型的推理能力根植于扩散过程的动力学特性,而非单纯时空结构;理解并利用Chain-of-Steps及其伴随的涌现机制,为构建具推理能力的视频基础模型提供了新范式。 Abstract: Recent advances in video generation have revealed an unexpected phenomenon: diffusion-based video models exhibit non-trivial reasoning capabilities. Prior work attributes this to a Chain-of-Frames (CoF) mechanism, where reasoning is assumed to unfold sequentially across video frames. In this work, we challenge this assumption and uncover a fundamentally different mechanism. We show that reasoning in video models instead primarily emerges along the diffusion denoising steps. Through qualitative analysis and targeted probing experiments, we find that models explore multiple candidate solutions in early denoising steps and progressively converge to a final answer, a process we term Chain-of-Steps (CoS). Beyond this core mechanism, we identify several emergent reasoning behaviors critical to model performance: (1) working memory, enabling persistent reference; (2) self-correction and enhancement, allowing recovery from incorrect intermediate solutions; and (3) perception before action, where early steps establish semantic grounding and later steps perform structured manipulation. During a diffusion step, we further uncover self-evolved functional specialization within Diffusion Transformers, where early layers encode dense perceptual structure, middle layers execute reasoning, and later layers consolidate latent representations. Motivated by these insights, we present a simple training-free strategy as a proof-of-concept, demonstrating how reasoning can be improved by ensembling latent trajectories from identical models with different random seeds. Overall, our work provides a systematic understanding of how reasoning emerges in video generation models, offering a foundation to guide future research in better exploiting the inherent reasoning dynamics of video models as a new substrate for intelligence.

[210] WorldCam: Interactive Autoregressive 3D Gaming Worlds with Camera Pose as a Unifying Geometric Representation

Jisu Nam,Yicong Hong,Chun-Hao Paul Huang,Feng Liu,JoungBin Lee,Jiyoung Kim,Siyoon Jin,Yunsung Lee,Jaeyoon Jung,Suhwan Choi,Seungryong Kim,Yang Zhou

Main category: cs.CV

TL;DR: 本文提出了一种基于相机位姿统一表征的视频扩散Transformer方法,通过将用户动作映射为李代数中的6自由度相机位姿,并利用全局位姿实现几何一致的长时序导航,显著提升了交互式游戏世界模型的动作可控性与3D一致性。

Details Motivation: 现有视频扩散Transformer在交互式游戏世界建模中难以实现精确动作控制和长时序3D一致性,主因是忽视了动作与3D世界间固有的几何耦合关系(即动作导致相机运动并累积为全局位姿)。 Method: 1)定义基于物理的连续动作空间,将用户输入表示为李代数形式以推导精确6-DoF相机位姿,并通过相机嵌入器注入生成模型;2)以全局相机位姿作为空间索引,检索历史观测以支持几何一致的长时序重访;3)构建含3000分钟真实人类游戏视频、带相机轨迹与文本描述的大规模数据集。 Result: 在动作可控性、长时序视觉质量与3D空间一致性方面显著优于当前最先进交互式游戏世界模型。 Conclusion: 相机位姿可作为统一几何表征,有效联合建模即时动作控制与长期3D一致性,为交互式生成式世界模型提供了新范式。 Abstract: Recent advances in video diffusion transformers have enabled interactive gaming world models that allow users to explore generated environments over extended horizons. However, existing approaches struggle with precise action control and long-horizon 3D consistency. Most prior works treat user actions as abstract conditioning signals, overlooking the fundamental geometric coupling between actions and the 3D world, whereby actions induce relative camera motions that accumulate into a global camera pose within a 3D world. In this paper, we establish camera pose as a unifying geometric representation to jointly ground immediate action control and long-term 3D consistency. First, we define a physics-based continuous action space and represent user inputs in the Lie algebra to derive precise 6-DoF camera poses, which are injected into the generative model via a camera embedder to ensure accurate action alignment. Second, we use global camera poses as spatial indices to retrieve relevant past observations, enabling geometrically consistent revisiting of locations during long-horizon navigation. To support this research, we introduce a large-scale dataset comprising 3,000 minutes of authentic human gameplay annotated with camera trajectories and textual descriptions. Extensive experiments show that our approach substantially outperforms state-of-the-art interactive gaming world models in action controllability, long-horizon visual quality, and 3D spatial consistency.