Skip to content

Table of Contents

cs.CL [Back]

[1] Uncovering Competency Gaps in Large Language Models and Their Benchmarks

Matyas Bohacek,Nino Scherrer,Nicholas Dufour,Thomas Leung,Christoph Bregler,Stephanie C. Y. Chan

Main category: cs.CL

TL;DR: 本文提出了一种基于稀疏自编码器(SAE)的新方法,用于自动发现大语言模型和基准测试中的“模型差距”与“基准差距”,通过将评估建立在模型内部表征之上,实现跨基准的可比性,并揭示聚合评分背后的细粒度问题。

Details Motivation: 现有基准测试的聚合指标可能掩盖模型在特定子领域的能力缺陷以及基准本身覆盖不均的问题,因此需要一种更细粒度、基于模型内部表示的评估方法来揭示这些隐藏的差距。 Method: 利用稀疏自编码器(SAE)提取模型的概念激活,并结合显著性加权性能分数,在多个基准数据上进行分析,从而实现对模型表现的概念级分解和跨基准比较。 Result: 在两个开源模型和十个基准上验证了该方法,成功识别出模型在反谄媚行为(如礼貌拒绝、设定边界)和安全相关概念上的薄弱环节,同时发现多个基准过度代表服从性概念而遗漏核心内容。 Conclusion: 该方法为大模型评估提供了可解释、细粒度的补充工具,能够揭示聚合分数背后的原因,并指导未来基准的设计优化,而非取代传统指标,而是与其互补。 Abstract: The evaluation of large language models (LLMs) relies heavily on standardized benchmarks. These benchmarks provide useful aggregated metrics for a given capability, but those aggregated metrics can obscure (i) particular sub-areas where the LLMs are weak ("model gaps") and (ii) imbalanced coverage in the benchmarks themselves ("benchmark gaps"). We propose a new method that uses sparse autoencoders (SAEs) to automatically uncover both types of gaps. By extracting SAE concept activations and computing saliency-weighted performance scores across benchmark data, the method grounds evaluation in the model's internal representations and enables comparison across benchmarks. As examples demonstrating our approach, we applied the method to two popular open-source models and ten benchmarks. We found that these models consistently underperformed on concepts that stand in contrast to sycophantic behaviors (e.g., politely refusing a request or asserting boundaries) and concepts connected to safety discussions. These model gaps align with observations previously surfaced in the literature; our automated, unsupervised method was able to recover them without manual supervision. We also observed benchmark gaps: many of the evaluated benchmarks over-represented concepts related to obedience, authority, or instruction-following, while missing core concepts that should fall within their intended scope. In sum, our method offers a representation-grounded approach to evaluation, enabling concept-level decomposition of benchmark scores. Rather than replacing conventional aggregated metrics, CG complements them by providing a concept-level decomposition that can reveal why a model scored as it did and how benchmarks could evolve to better reflect their intended scope. Code is available at https://competency-gaps.github.io.

[2] SA-DiffuSeq: Addressing Computational and Scalability Challenges in Long-Document Generation with Sparse Attention

Alexandros Christoforos,Chadbourne Davis

Main category: cs.CL

TL;DR: SA-DiffuSeq是一种结合稀疏注意力的扩散模型框架,用于高效生成长文本,显著降低计算复杂度并保持生成质量。

Details Motivation: 现有的扩散模型在处理长序列时面临高计算成本和内存开销,难以有效扩展到长文本生成任务。 Method: 提出SA-DiffuSeq,将稀疏注意力机制引入扩散过程,并设计软吸收状态以稳定扩散轨迹,提升采样效率和长距离依赖建模能力。 Result: 实验表明,SA-DiffuSeq在训练效率和采样速度上优于现有扩散模型,尤其在长序列生成任务中表现突出。 Conclusion: 将结构化稀疏性引入扩散模型是实现高效且富有表达力的长文本生成的有效方向。 Abstract: Diffusion based approaches to long form text generation suffer from prohibitive computational cost and memory overhead as sequence length increases. We introduce SA-DiffuSeq, a diffusion framework that integrates sparse attention to fundamentally improve scalability for long document modeling. By selectively allocating attention within the diffusion process, SA-DiffuSeq significantly reduces computational complexity while maintaining semantic coherence and generation quality. A key component of our method is a soft absorbing state tailored to sparse attention dynamics, which stabilizes diffusion trajectories and accelerates sequence reconstruction. This design improves sampling efficiency and enhances precision in long range dependency modeling. Extensive experiments demonstrate that SA-DiffuSeq consistently surpasses state of the art diffusion baselines in both training efficiency and sampling speed, with especially strong gains on extended sequences. These properties make SA-DiffuSeq well suited for demanding long form applications such as scientific writing, large scale code generation, and multi turn long context dialogue. Overall, our results indicate that incorporating structured sparsity into diffusion models is a promising direction for efficient and expressive long text generation.

[3] TokSuite: Measuring the Impact of Tokenizer Choice on Language Model Behavior

Gül Sena Altıntaş,Malikeh Ehghaghi,Brian Lester,Fengyuan Liu,Wanru Zhao,Marco Ciccone,Colin Raffel

Main category: cs.CL

TL;DR: 本文提出了TokSuite,一个用于研究分词器对语言模型影响的模型集合和基准测试工具。

Details Motivation: 由于难以孤立地衡量分词的影响,分词在语言模型性能中的作用尚不明确。 Method: 训练了十四个使用不同分词器但其他条件完全相同的模型,并构建了一个新的基准来评估真实世界扰动下的模型性能。 Result: 通过TokSuite实现了对分词器影响的稳健解耦,揭示了多种流行分词器的优点与不足。 Conclusion: TokSuite有助于深入理解分词器在语言模型中的作用,推动相关研究的发展。 Abstract: Tokenizers provide the fundamental basis through which text is represented and processed by language models (LMs). Despite the importance of tokenization, its role in LM performance and behavior is poorly understood due to the challenge of measuring the impact of tokenization in isolation. To address this need, we present TokSuite, a collection of models and a benchmark that supports research into tokenization's influence on LMs. Specifically, we train fourteen models that use different tokenizers but are otherwise identical using the same architecture, dataset, training budget, and initialization. Additionally, we curate and release a new benchmark that specifically measures model performance subject to real-world perturbations that are likely to influence tokenization. Together, TokSuite allows robust decoupling of the influence of a model's tokenizer, supporting a series of novel findings that elucidate the respective benefits and shortcomings of a wide range of popular tokenizers.

[4] Adversarial Training for Failure-Sensitive User Simulation in Mental Health Dialogue Optimization

Ziyi Zhu,Olivier Tieleman,Caitlin A. Stamatis,Luka Smyth,Thomas D. Hull,Daniel R. Cahn,Matteo Malgaroli

Main category: cs.CL

TL;DR: 提出了一种基于对抗训练的用户模拟器框架,用于提升心理健康支持聊天机器人中任务导向对话系统的评估效果,显著增强了模拟器的真实性与发现系统缺陷的能力。

Details Motivation: 现有的用户模拟器难以真实反映人类行为,且在暴露对话系统缺陷方面表现不足,因此需要更逼真的模拟方法来有效评估任务导向对话系统。 Method: 采用对抗训练框架,通过生成器(用户模拟器)与判别器之间的竞争动态迭代优化模拟器;在心理健康支持对话场景中对模型进行微调和迭代训练。 Result: 微调后的模拟器显著优于零样本基础模型,能更有效地发现系统问题;对抗训练提升了行为多样性、分布对齐性和预测有效性;模拟器在不同聊天机器人配置下与真实故障率高度相关,且故障模式分布差异小;经过三轮对抗后判别器准确率大幅下降,表明模拟真实性提高。 Conclusion: 对抗训练是构建心理健康支持领域高真实感用户模拟器的有效途径,可实现快速、可靠、低成本的系统部署前评估。 Abstract: Realistic user simulation is crucial for training and evaluating task-oriented dialogue (TOD) systems, yet creating simulators that accurately replicate human behavior remains challenging. A key property of effective simulators is their ability to expose failure modes of the systems they evaluate. We present an adversarial training framework that iteratively improves user simulator realism through a competitive dynamic between a generator (user simulator) and a discriminator. Applied to mental health support chatbots, our approach demonstrates that fine-tuned simulators dramatically outperform zero-shot base models at surfacing system issues, and adversarial training further enhances diversity, distributional alignment, and predictive validity. The resulting simulator achieves a strong correlation between simulated and real failure occurrence rates across diverse chatbot configurations while maintaining low distributional divergence of failure modes. Discriminator accuracy decreases drastically after three adversarial iterations, suggesting improved realism. These results provide evidence that adversarial training is a promising approach for creating realistic user simulators in mental health support TOD domains, enabling rapid, reliable, and cost-effective system evaluation before deployment.

[5] Large Language Models Approach Expert Pedagogical Quality in Math Tutoring but Differ in Instructional and Linguistic Profiles

Ramatu Oiza Abdulsalam,Segun Aroyehun

Main category: cs.CL

TL;DR: 该研究通过对比专家教师、新手教师和大语言模型在数学辅导中的回应,发现大语言模型在感知教学质量上接近专家水平,但在教学策略和语言特征上存在系统性差异,例如较少使用重述和转述策略,且语言更冗长、礼貌。

Details Motivation: 探讨大语言模型在数学辅导中是否真正符合专家人类教师的教学实践。 Method: 采用控制性的回合级对比方法,分析专家教师、新手教师和多个大语言模型对相同数学辅导对话回合的回应,评估其教学策略和语言特征。 Result: 大语言模型在感知教学质量上接近专家水平,但较少使用重述/转述策略,语言更长、词汇更多样、更礼貌;统计显示重述/转述、词汇多样性和准确性追问与教学质量正相关,而过多的礼貌和主体性语言则负相关。 Conclusion: 尽管大语言模型能达到类似专家的教学质量感知,但其依赖不同的教学和语言策略,强调在评估智能辅导系统时需深入分析具体教学行为和语言特征。 Abstract: Recent work has explored the use of large language models for generating tutoring responses in mathematics, yet it remains unclear how closely their instructional behavior aligns with expert human practice. We examine this question using a controlled, turn-level comparison in which expert human tutors, novice human tutors, and multiple large language models respond to the same set of math remediation conversation turns. We examine both instructional strategies and linguistic characteristics of tutoring responses, including restating and revoicing, pressing for accuracy, lexical diversity, readability, politeness, and agency. We find that large language models approach expert levels of perceived pedagogical quality on average but exhibit systematic differences in their instructional and linguistic profiles. In particular, large language models tend to underuse restating and revoicing strategies characteristic of expert human tutors, while producing longer, more lexically diverse, and more polite responses. Statistical analyses show that restating and revoicing, lexical diversity, and pressing for accuracy are positively associated with perceived pedagogical quality, whereas higher levels of agentic and polite language are negatively associated. Overall, recent large language models exhibit levels of perceived pedagogical quality comparable to expert human tutors, while relying on different instructional and linguistic strategies. These findings underscore the value of analyzing instructional strategies and linguistic characteristics when evaluating tutoring responses across human tutors and intelligent tutoring systems.

[6] Investigating Model Editing for Unlearning in Large Language Models

Shariqah Hossain,Lalana Kagal

Main category: cs.CL

TL;DR: 本文探讨了使用模型编辑算法(如ROME、IKE和WISE)进行机器遗忘的可行性,发现其在特定设置下可超越传统遗忘方法,但仍面临遗忘范围控制与模型性能保持之间的挑战。

Details Motivation: 现有的机器遗忘方法对大型语言模型效率低下或无法完全删除目标信息而不影响应保留的知识,因此探索模型编辑算法是否可用于更有效的遗忘。 Method: 研究采用ROME、IKE和WISE三种模型编辑算法,设计新的编辑目标以适应遗忘场景,并评估其在不同设置下的遗忘效果与模型性能影响。 Result: 模型编辑方法在某些情况下优于基线遗忘方法,能够实现更高质量的遗忘,但仍难以完全控制遗忘范围,且可能损害整体模型性能。 Conclusion: 模型编辑算法有潜力用于机器遗忘任务,但在精确控制遗忘边界和保护无关知识方面仍需进一步改进。 Abstract: Machine unlearning aims to remove unwanted information from a model, but many methods are inefficient for LLMs with large numbers of parameters or fail to fully remove the intended information without degrading performance on knowledge that should be retained. Model editing algorithms solve a similar problem of changing information in models, but they focus on redirecting inputs to a new target rather than removing that information altogether. In this work, we explore the editing algorithms ROME, IKE, and WISE and design new editing targets for an unlearning setting. Through this investigation, we show that model editing approaches can exceed baseline unlearning methods in terms of quality of forgetting depending on the setting. Like traditional unlearning techniques, they struggle to encapsulate the scope of what is to be unlearned without damage to the overall model performance.

[7] Measuring Mechanistic Independence: Can Bias Be Removed Without Erasing Demographics?

Zhengyang Shan,Aaron Mueller

Main category: cs.CL

TL;DR: 研究探讨了语言模型中人口统计偏见机制与人口识别能力的独立性,提出通过稀疏自编码器特征消融实现精准去偏,同时保持识别性能。

Details Motivation: 探究人口统计偏见是否源于特定任务机制而非基本人口识别能力,以实现有针对性的去偏干预。 Method: 采用多任务评估框架,结合归因和相关性方法定位偏见特征,并在Gemma-2-9B上进行稀疏自编码器特征消融实验。 Result: 归因法有效缓解种族和性别职业刻板印象且不损害姓名识别;相关性法更适用于教育偏见;但移除教育任务中的归因特征会导致‘先验崩溃’,增加整体偏见。 Conclusion: 人口统计偏见源自任务特异性机制而非绝对人口标志,可通过机制性推理时干预实现外科手术式去偏,而不损害模型核心能力。 Abstract: We investigate how independent demographic bias mechanisms are from general demographic recognition in language models. Using a multi-task evaluation setup where demographics are associated with names, professions, and education levels, we measure whether models can be debiased while preserving demographic detection capabilities. We compare attribution-based and correlation-based methods for locating bias features. We find that targeted sparse autoencoder feature ablations in Gemma-2-9B reduce bias without degrading recognition performance: attribution-based ablations mitigate race and gender profession stereotypes while preserving name recognition accuracy, whereas correlation-based ablations are more effective for education bias. Qualitative analysis further reveals that removing attribution features in education tasks induces ``prior collapse'', thus increasing overall bias. This highlights the need for dimension-specific interventions. Overall, our results show that demographic bias arises from task-specific mechanisms rather than absolute demographic markers, and that mechanistic inference-time interventions can enable surgical debiasing without compromising core model capabilities.

[8] Semantic Deception: When Reasoning Models Can't Compute an Addition

Nathaniël de Leeuw,Marceau Nahon,Mathis Reymond,Raja Chatila,Mehdi Khamassi

Main category: cs.CL

TL;DR: 该研究通过引入语义欺骗和新符号系统,测试大语言模型在抽象符号推理中的表现,发现其易受表面语义影响,揭示了其在真正符号操作上的局限性。

Details Motivation: 探究大语言模型是否具备人类般的抽象推理能力,尤其是在涉及人类价值观的决策任务中,避免错误归因其推理能力所带来的伦理与社会风险。 Method: 重新定义数字和运算符为新颖符号,构造带有误导性语义关联的‘语义欺骗’情境,要求四个大语言模型完成简单计算任务,评估其在遵循指令和抵抗语义干扰方面的能力。 Result: 实验显示,即使任务极为简单,语义线索仍显著降低模型性能,表明模型倾向于依赖表层语义而非真正符号抽象,且思维链可能加剧对统计关联的依赖。 Conclusion: 当前大语言模型在符号推理方面存在根本局限,过度依赖训练数据中的语义关联,这对其在关键决策场景中的应用构成风险,警示不应轻易将其行为视为真正推理。 Abstract: Large language models (LLMs) are increasingly used in situations where human values are at stake, such as decision-making tasks that involve reasoning when performed by humans. We investigate the so-called reasoning capabilities of LLMs over novel symbolic representations by introducing an experimental framework that tests their ability to process and manipulate unfamiliar symbols. We introduce semantic deceptions: situations in which symbols carry misleading semantic associations due to their form, such as being embedded in specific contexts, designed to probe whether LLMs can maintain symbolic abstraction or whether they default to exploiting learned semantic associations. We redefine standard digits and mathematical operators using novel symbols, and task LLMs with solving simple calculations expressed in this altered notation. The objective is: (1) to assess LLMs' capacity for abstraction and manipulation of arbitrary symbol systems; (2) to evaluate their ability to resist misleading semantic cues that conflict with the task's symbolic logic. Through experiments with four LLMs we show that semantic cues can significantly deteriorate reasoning models' performance on very simple tasks. They reveal limitations in current LLMs' ability for symbolic manipulations and highlight a tendency to over-rely on surface-level semantics, suggesting that chain-of-thoughts may amplify reliance on statistical correlations. Even in situations where LLMs seem to correctly follow instructions, semantic cues still impact basic capabilities. These limitations raise ethical and societal concerns, undermining the widespread and pernicious tendency to attribute reasoning abilities to LLMs and suggesting how LLMs might fail, in particular in decision-making contexts where robust symbolic reasoning is essential and should not be compromised by residual semantic associations inherited from the model's training.

[9] EssayCBM: Rubric-Aligned Concept Bottleneck Models for Transparent Essay Grading

Kumar Satvik Chaudhary,Chengshuai Zhao,Fan Zhang,Yung Hin Tse,Garima Agrawal,Yuli Deng,Huan Liu

Main category: cs.CL

TL;DR: 本文提出了EssayCBM,一种可解释的论文评分框架,通过评估八个写作概念生成透明评分,并支持教师干预调整,实现可操作的反馈。

Details Motivation: 解决自动化评分系统(尤其是基于大语言模型的黑箱系统)缺乏透明性和可解释性的问题,帮助教育者和学生理解评分过程。 Method: 构建一个基于编码器的多任务模型,使用专用预测头分别评估八个写作概念(如论点清晰度、证据使用等),再通过轻量网络将概念得分汇总为最终分数,并提供可调节概念得分的人机协作接口。 Result: EssayCBM在保持与黑箱模型相当性能的同时,提供了更高的可解释性和灵活性,支持教师调整概念评分并实时查看对总分的影响。 Conclusion: EssayCBM实现了可解释、可干预的论文自动评分,平衡了性能与透明性,有助于推动人机协同教育评估的发展。 Abstract: Understanding how automated grading systems evaluate essays remains a significant challenge for educators and students, especially when large language models function as black boxes. We introduce EssayCBM, a rubric-aligned framework that prioritizes interpretability in essay assessment. Instead of predicting grades directly from text, EssayCBM evaluates eight writing concepts, such as Thesis Clarity and Evidence Use, through dedicated prediction heads on an encoder. These concept scores form a transparent bottleneck, and a lightweight network computes the final grade using only concepts. Instructors can adjust concept predictions and instantly view the updated grade, enabling accountable human-in-the-loop evaluation. EssayCBM matches black-box performance while offering actionable, concept-level feedback through an intuitive web interface.

[10] MediEval: A Unified Medical Benchmark for Patient-Contextual and Knowledge-Grounded Reasoning in LLMs

Zhan Qu,Michael Färber

Main category: cs.CL

TL;DR: MediEval是一个结合真实患者数据与统一知识库的医学大模型评估基准,提出CoRFu微调方法显著提升模型准确性和安全性。

Details Motivation: 现有医学大模型评估方法无法同时检验事实性与上下文一致性,缺乏对模型可靠性与安全性的系统评测。 Method: 构建MediEval基准,连接MIMIC-IV电子病历与UMLS等知识库,生成事实与反事实陈述,采用四象限框架评估;提出基于DPO的CoRFu微调方法,通过不对称惩罚减少危险错误。 Result: 发现当前LLMs普遍存在幻觉支持和真相反转等关键缺陷;CoRFu相比基础模型提升+16.4 macro-F1,且消除真相反转错误。 Conclusion: 联合知识 grounding 与上下文一致性的评估框架更全面揭示医学LLM风险,CoRFu有效提升模型安全与性能,推动其在临床场景中的可靠应用。 Abstract: Large Language Models (LLMs) are increasingly applied to medicine, yet their adoption is limited by concerns over reliability and safety. Existing evaluations either test factual medical knowledge in isolation or assess patient-level reasoning without verifying correctness, leaving a critical gap. We introduce MediEval, a benchmark that links MIMIC-IV electronic health records (EHRs) to a unified knowledge base built from UMLS and other biomedical vocabularies. MediEval generates diverse factual and counterfactual medical statements within real patient contexts, enabling systematic evaluation across a 4-quadrant framework that jointly considers knowledge grounding and contextual consistency. Using this framework, we identify critical failure modes, including hallucinated support and truth inversion, that current proprietary, open-source, and domain-specific LLMs frequently exhibit. To address these risks, we propose Counterfactual Risk-Aware Fine-tuning (CoRFu), a DPO-based method with an asymmetric penalty targeting unsafe confusions. CoRFu improves by +16.4 macro-F1 points over the base model and eliminates truth inversion errors, demonstrating both higher accuracy and substantially greater safety.

[11] Nemotron 3 Nano: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

NVIDIA,:,Aaron Blakeman,Aaron Grattafiori,Aarti Basant,Abhibha Gupta,Abhinav Khattar,Adi Renduchintala,Aditya Vavre,Akanksha Shukla,Akhiad Bercovich,Aleksander Ficek,Aleksandr Shaposhnikov,Alex Kondratenko,Alexander Bukharin,Alexandre Milesi,Ali Taghibakhshi,Alisa Liu,Amelia Barton,Ameya Sunil Mahabaleshwarkar,Amir Klein,Amit Zuker,Amnon Geifman,Amy Shen,Anahita Bhiwandiwalla,Andrew Tao,Ann Guan,Anubhav Mandarwal,Arham Mehta,Ashwath Aithal,Ashwin Poojary,Asif Ahamed,Asma Kuriparambil Thekkumpate,Ayush Dattagupta,Banghua Zhu,Bardiya Sadeghi,Barnaby Simkin,Ben Lanir,Benedikt Schifferer,Besmira Nushi,Bilal Kartal,Bita Darvish Rouhani,Boris Ginsburg,Brandon Norick,Brandon Soubasis,Branislav Kisacanin,Brian Yu,Bryan Catanzaro,Carlo del Mundo,Chantal Hwang,Charles Wang,Cheng-Ping Hsieh,Chenghao Zhang,Chenhan Yu,Chetan Mungekar,Chintan Patel,Chris Alexiuk,Christopher Parisien,Collin Neale,Damon Mosk-Aoyama,Dan Su,Dane Corneil,Daniel Afrimi,Daniel Rohrer,Daniel Serebrenik,Daria Gitman,Daria Levy,Darko Stosic,David Mosallanezhad,Deepak Narayanan,Dhruv Nathawani,Dima Rekesh,Dina Yared,Divyanshu Kakwani,Dong Ahn,Duncan Riach,Dusan Stosic,Edgar Minasyan,Edward Lin,Eileen Long,Eileen Peters Long,Elena Lantz,Ellie Evans,Elliott Ning,Eric Chung,Eric Harper,Eric Tramel,Erick Galinkin,Erik Pounds,Evan Briones,Evelina Bakhturina,Faisal Ladhak,Fay Wang,Fei Jia,Felipe Soares,Feng Chen,Ferenc Galko,Frankie Siino,Gal Hubara Agam,Ganesh Ajjanagadde,Gantavya Bhatt,Gargi Prasad,George Armstrong,Gerald Shen,Gorkem Batmaz,Grigor Nalbandyan,Haifeng Qian,Harsh Sharma,Hayley Ross,Helen Ngo,Herman Sahota,Hexin Wang,Himanshu Soni,Hiren Upadhyay,Huizi Mao,Huy C Nguyen,Huy Q Nguyen,Iain Cunningham,Ido Shahaf,Igor Gitman,Ilya Loshchilov,Ivan Moshkov,Izzy Putterman,Jan Kautz,Jane Polak Scowcroft,Jared Casper,Jatin Mitra,Jeffrey Glick,Jenny Chen,Jesse Oliver,Jian Zhang,Jiaqi Zeng,Jie Lou,Jimmy Zhang,Jining Huang,Joey Conway,Joey Guman,John Kamalu,Johnny Greco,Jonathan Cohen,Joseph Jennings,Joyjit Daw,Julien Veron Vialard,Junkeun Yi,Jupinder Parmar,Kai Xu,Kan Zhu,Kari Briski,Katherine Cheung,Katherine Luna,Keshav Santhanam,Kevin Shih,Kezhi Kong,Khushi Bhardwaj,Krishna C. Puvvada,Krzysztof Pawelec,Kumar Anik,Lawrence McAfee,Laya Sleiman,Leon Derczynski,Li Ding,Lucas Liebenwein,Luis Vega,Maanu Grover,Maarten Van Segbroeck,Maer Rodrigues de Melo,Makesh Narsimhan Sreedhar,Manoj Kilaru,Maor Ashkenazi,Marc Romeijn,Mark Cai,Markus Kliegl,Maryam Moosaei,Matvei Novikov,Mehrzad Samadi,Melissa Corpuz,Mengru Wang,Meredith Price,Michael Boone,Michael Evans,Miguel Martinez,Mike Chrzanowski,Mohammad Shoeybi,Mostofa Patwary,Nabin Mulepati,Natalie Hereth,Nave Assaf,Negar Habibi,Neta Zmora,Netanel Haber,Nicola Sessions,Nidhi Bhatia,Nikhil Jukar,Nikki Pope,Nikolai Ludwig,Nima Tajbakhsh,Nirmal Juluru,Oleksii Hrinchuk,Oleksii Kuchaiev,Olivier Delalleau,Oluwatobi Olabiyi,Omer Ullman Argov,Ouye Xie,Parth Chadha,Pasha Shamis,Pavlo Molchanov,Pawel Morkisz,Peter Dykas,Peter Jin,Pinky Xu,Piotr Januszewski,Pranav Prashant Thombre,Prasoon Varshney,Pritam Gundecha,Qing Miao,Rabeeh Karimi Mahabadi,Ran El-Yaniv,Ran Zilberstein,Rasoul Shafipour,Rich Harang,Rick Izzo,Rima Shahbazyan,Rishabh Garg,Ritika Borkar,Ritu Gala,Riyad Islam,Roger Waleffe,Rohit Watve,Roi Koren,Ruoxi Zhang,Russell J. Hewett,Ryan Prenger,Ryan Timbrook,Sadegh Mahdavi,Sahil Modi,Samuel Kriman,Sanjay Kariyappa,Sanjeev Satheesh,Saori Kaji,Satish Pasumarthi,Sean Narentharen,Sean Narenthiran,Seonmyeong Bak,Sergey Kashirsky,Seth Poulos,Shahar Mor,Shanmugam Ramasamy,Shantanu Acharya,Shaona Ghosh,Sharath Turuvekere Sreenivas,Shelby Thomas,Shiqing Fan,Shreya Gopal,Shrimai Prabhumoye,Shubham Pachori,Shubham Toshniwal,Shuoyang Ding,Siddharth Singh,Simeng Sun,Smita Ithape,Somshubra Majumdar,Soumye Singhal,Stefania Alborghetti,Stephen Ge,Sugam Dipak Devare,Sumeet Kumar Barua,Suseella Panguluri,Suyog Gupta,Sweta Priyadarshi,Syeda Nahida Akter,Tan Bui,Teodor-Dumitru Ene,Terry Kong,Thanh Do,Tijmen Blankevoort,Tom Balough,Tomer Asida,Tomer Bar Natan,Tugrul Konuk,Twinkle Vashishth,Udi Karpas,Ushnish De,Vahid Noorozi,Vahid Noroozi,Venkat Srinivasan,Venmugil Elango,Vijay Korthikanti,Vitaly Kurin,Vitaly Lavrukhin,Wanli Jiang,Wasi Uddin Ahmad,Wei Du,Wei Ping,Wenfei Zhou,Will Jennings,William Zhang,Wojciech Prazuch,Xiaowei Ren,Yashaswi Karnati,Yejin Choi,Yev Meyer,Yi-Fu Wu,Yian Zhang,Ying Lin,Yonatan Geifman,Yonggan Fu,Yoshi Subara,Yoshi Suhara,Yubo Gao,Zach Moshe,Zhen Dong,Zihan Liu,Zijia Chen,Zijie Yan

Main category: cs.CL

TL;DR: Nemotron 3 Nano 30B-A3B 是一种混合Mamba-Transformer的MoE语言模型,预训练于25万亿token,参数激活少于一半但精度高于前代,推理吞吐提升达3.3倍,支持长达1M token上下文,具备更强的代理、推理与对话能力,并已开源发布。

Details Motivation: 旨在提升语言模型的推理效率与性能,减少参数激活量的同时增强模型在代理任务、复杂推理和长上下文场景中的表现。 Method: 采用Mixture-of-Experts混合Mamba-Transformer架构,基于25万亿token(含超3万亿新token)预训练,随后进行监督微调与大规模强化学习优化。 Result: 相比Nemotron 2 Nano精度更高且每前向传递激活参数少于一半;推理吞吐达GPT-OSS-20B和Qwen3-30B-A3B等模型的3.3倍,在主流基准测试中更准确,支持最长1M token上下文。 Conclusion: Nemotron 3 Nano在保持高效率的同时显著提升性能,是高效语言建模的一个重要进展,适合复杂任务与实际应用部署。 Abstract: We present Nemotron 3 Nano 30B-A3B, a Mixture-of-Experts hybrid Mamba-Transformer language model. Nemotron 3 Nano was pretrained on 25 trillion text tokens, including more than 3 trillion new unique tokens over Nemotron 2, followed by supervised fine tuning and large-scale RL on diverse environments. Nemotron 3 Nano achieves better accuracy than our previous generation Nemotron 2 Nano while activating less than half of the parameters per forward pass. It achieves up to 3.3x higher inference throughput than similarly-sized open models like GPT-OSS-20B and Qwen3-30B-A3B-Thinking-2507, while also being more accurate on popular benchmarks. Nemotron 3 Nano demonstrates enhanced agentic, reasoning, and chat abilities and supports context lengths up to 1M tokens. We release both our pretrained Nemotron 3 Nano 30B-A3B Base and post-trained Nemotron 3 Nano 30B-A3B checkpoints on Hugging Face.

[12] How important is Recall for Measuring Retrieval Quality?

Shelly Schwartz,Oleg Vasilyev,Randy Sawaya

Main category: cs.CL

TL;DR: 本文提出了一种无需知晓相关文档总数的简单检索质量度量方法,并通过多个数据集上的实验验证了其与LLM判断响应质量的相关性。

Details Motivation: 在现实的检索场景中,知识库庞大且不断演变,查询相关的文档总数通常未知,导致无法计算召回率。因此需要有效的替代指标来评估检索质量。 Method: 通过测量多种现有策略的检索质量指标与基于大语言模型(LLM)的响应质量判断之间的相关性来进行评估,并引入一种新的简单检索质量度量方法。 Result: 在多个包含少量相关文档(2-15个)的数据集上进行了实验,新提出的方法在无需知道总相关文档数的情况下表现良好。 Conclusion: 所提出的简单检索质量度量方法能有效替代传统依赖召回率的评估方式,适用于实际的大规模动态知识库环境。 Abstract: In realistic retrieval settings with large and evolving knowledge bases, the total number of documents relevant to a query is typically unknown, and recall cannot be computed. In this paper, we evaluate several established strategies for handling this limitation by measuring the correlation between retrieval quality metrics and LLM-based judgments of response quality, where responses are generated from the retrieved documents. We conduct experiments across multiple datasets with a relatively low number of relevant documents (2-15). We also introduce a simple retrieval quality measure that performs well without requiring knowledge of the total number of relevant documents.

[13] NVIDIA Nemotron 3: Efficient and Open Intelligence

NVIDIA,:,Aaron Blakeman,Aaron Grattafiori,Aarti Basant,Abhibha Gupta,Abhinav Khattar,Adi Renduchintala,Aditya Vavre,Akanksha Shukla,Akhiad Bercovich,Aleksander Ficek,Aleksandr Shaposhnikov,Alex Kondratenko,Alexander Bukharin,Alexandre Milesi,Ali Taghibakhshi,Alisa Liu,Amelia Barton,Ameya Sunil Mahabaleshwarkar,Amir Klein,Amit Zuker,Amnon Geifman,Amy Shen,Anahita Bhiwandiwalla,Andrew Tao,Anjulie Agrusa,Ankur Verma,Ann Guan,Anubhav Mandarwal,Arham Mehta,Ashwath Aithal,Ashwin Poojary,Asif Ahamed,Asit Mishra,Asma Kuriparambil Thekkumpate,Ayush Dattagupta,Banghua Zhu,Bardiya Sadeghi,Barnaby Simkin,Ben Lanir,Benedikt Schifferer,Besmira Nushi,Bilal Kartal,Bita Darvish Rouhani,Boris Ginsburg,Brandon Norick,Brandon Soubasis,Branislav Kisacanin,Brian Yu,Bryan Catanzaro,Carlo del Mundo,Chantal Hwang,Charles Wang,Cheng-Ping Hsieh,Chenghao Zhang,Chenhan Yu,Chetan Mungekar,Chintan Patel,Chris Alexiuk,Christopher Parisien,Collin Neale,Cyril Meurillon,Damon Mosk-Aoyama,Dan Su,Dane Corneil,Daniel Afrimi,Daniel Lo,Daniel Rohrer,Daniel Serebrenik,Daria Gitman,Daria Levy,Darko Stosic,David Mosallanezhad,Deepak Narayanan,Dhruv Nathawani,Dima Rekesh,Dina Yared,Divyanshu Kakwani,Dong Ahn,Duncan Riach,Dusan Stosic,Edgar Minasyan,Edward Lin,Eileen Long,Eileen Peters Long,Elad Segal,Elena Lantz,Ellie Evans,Elliott Ning,Eric Chung,Eric Harper,Eric Tramel,Erick Galinkin,Erik Pounds,Evan Briones,Evelina Bakhturina,Evgeny Tsykunov,Faisal Ladhak,Fay Wang,Fei Jia,Felipe Soares,Feng Chen,Ferenc Galko,Frank Sun,Frankie Siino,Gal Hubara Agam,Ganesh Ajjanagadde,Gantavya Bhatt,Gargi Prasad,George Armstrong,Gerald Shen,Gorkem Batmaz,Grigor Nalbandyan,Haifeng Qian,Harsh Sharma,Hayley Ross,Helen Ngo,Herbert Hum,Herman Sahota,Hexin Wang,Himanshu Soni,Hiren Upadhyay,Huizi Mao,Huy C Nguyen,Huy Q Nguyen,Iain Cunningham,Ido Galil,Ido Shahaf,Igor Gitman,Ilya Loshchilov,Itamar Schen,Itay Levy,Ivan Moshkov,Izik Golan,Izzy Putterman,Jan Kautz,Jane Polak Scowcroft,Jared Casper,Jatin Mitra,Jeffrey Glick,Jenny Chen,Jesse Oliver,Jian Zhang,Jiaqi Zeng,Jie Lou,Jimmy Zhang,Jinhang Choi,Jining Huang,Joey Conway,Joey Guman,John Kamalu,Johnny Greco,Jonathan Cohen,Joseph Jennings,Joyjit Daw,Julien Veron Vialard,Junkeun Yi,Jupinder Parmar,Kai Xu,Kan Zhu,Kari Briski,Katherine Cheung,Katherine Luna,Keith Wyss,Keshav Santhanam,Kevin Shih,Kezhi Kong,Khushi Bhardwaj,Kirthi Shankar,Krishna C. Puvvada,Krzysztof Pawelec,Kumar Anik,Lawrence McAfee,Laya Sleiman,Leon Derczynski,Li Ding,Lizzie Wei,Lucas Liebenwein,Luis Vega,Maanu Grover,Maarten Van Segbroeck,Maer Rodrigues de Melo,Mahdi Nazemi,Makesh Narsimhan Sreedhar,Manoj Kilaru,Maor Ashkenazi,Marc Romeijn,Marcin Chochowski,Mark Cai,Markus Kliegl,Maryam Moosaei,Matt Kulka,Matvei Novikov,Mehrzad Samadi,Melissa Corpuz,Mengru Wang,Meredith Price,Michael Andersch,Michael Boone,Michael Evans,Miguel Martinez,Mikail Khona,Mike Chrzanowski,Minseok Lee,Mohammad Dabbah,Mohammad Shoeybi,Mostofa Patwary,Nabin Mulepati,Najeeb Nabwani,Natalie Hereth,Nave Assaf,Negar Habibi,Neta Zmora,Netanel Haber,Nicola Sessions,Nidhi Bhatia,Nikhil Jukar,Nikki Pope,Nikolai Ludwig,Nima Tajbakhsh,Nir Ailon,Nirmal Juluru,Nishant Sharma,Oleksii Hrinchuk,Oleksii Kuchaiev,Olivier Delalleau,Oluwatobi Olabiyi,Omer Ullman Argov,Omri Puny,Oren Tropp,Ouye Xie,Parth Chadha,Pasha Shamis,Paul Gibbons,Pavlo Molchanov,Pawel Morkisz,Peter Dykas,Peter Jin,Pinky Xu,Piotr Januszewski,Pranav Prashant Thombre,Prasoon Varshney,Pritam Gundecha,Przemek Tredak,Qing Miao,Qiyu Wan,Rabeeh Karimi Mahabadi,Rachit Garg,Ran El-Yaniv,Ran Zilberstein,Rasoul Shafipour,Rich Harang,Rick Izzo,Rima Shahbazyan,Rishabh Garg,Ritika Borkar,Ritu Gala,Riyad Islam,Robert Hesse,Roger Waleffe,Rohit Watve,Roi Koren,Ruoxi Zhang,Russell Hewett,Russell J. Hewett,Ryan Prenger,Ryan Timbrook,Sadegh Mahdavi,Sahil Modi,Samuel Kriman,Sangkug Lim,Sanjay Kariyappa,Sanjeev Satheesh,Saori Kaji,Satish Pasumarthi,Saurav Muralidharan,Sean Narentharen,Sean Narenthiran,Seonmyeong Bak,Sergey Kashirsky,Seth Poulos,Shahar Mor,Shanmugam Ramasamy,Shantanu Acharya,Shaona Ghosh,Sharath Turuvekere Sreenivas,Shelby Thomas,Shiqing Fan,Shreya Gopal,Shrimai Prabhumoye,Shubham Pachori,Shubham Toshniwal,Shuoyang Ding,Siddharth Singh,Simeng Sun,Smita Ithape,Somshubra Majumdar,Soumye Singhal,Stas Sergienko,Stefania Alborghetti,Stephen Ge,Sugam Dipak Devare,Sumeet Kumar Barua,Suseella Panguluri,Suyog Gupta,Sweta Priyadarshi,Syeda Nahida Akter,Tan Bui,Teodor-Dumitru Ene,Terry Kong,Thanh Do,Tijmen Blankevoort,Tim Moon,Tom Balough,Tomer Asida,Tomer Bar Natan,Tomer Ronen,Tugrul Konuk,Twinkle Vashishth,Udi Karpas,Ushnish De,Vahid Noorozi,Vahid Noroozi,Venkat Srinivasan,Venmugil Elango,Victor Cui,Vijay Korthikanti,Vinay Rao,Vitaly Kurin,Vitaly Lavrukhin,Vladimir Anisimov,Wanli Jiang,Wasi Uddin Ahmad,Wei Du,Wei Ping,Wenfei Zhou,Will Jennings,William Zhang,Wojciech Prazuch,Xiaowei Ren,Yashaswi Karnati,Yejin Choi,Yev Meyer,Yi-Fu Wu,Yian Zhang,Yigong Qin,Ying Lin,Yonatan Geifman,Yonggan Fu,Yoshi Subara,Yoshi Suhara,Yubo Gao,Zach Moshe,Zhen Dong,Zhongbo Zhu,Zihan Liu,Zijia Chen,Zijie Yan

Main category: cs.CL

TL;DR: Nemotron 3系列模型包括Nano、Super和Ultra,采用混合Mamba-Transformer架构,支持长达100万token的上下文,具备卓越的推理、对话和代理能力。Nano已发布,Super和Ultra将在未来几个月推出,所有模型将公开权重、训练软件、配方及可分发数据。

Details Motivation: 开发高效、高性能的语言模型,以支持长上下文、复杂推理和多步工具使用,满足不同规模应用场景的需求。 Method: 采用Mixture-of-Experts混合Mamba-Transformer架构,结合NVFP4训练、LatentMoE技术和MTP层,并通过多环境强化学习进行后训练,实现高效的文本生成和推理控制。 Result: Nemotron 3系列在吞吐量、上下文长度(最高达1M tokens)和推理性能上表现优异;Nano在保持高性价比的同时超越同类模型;Super适合协作型代理和高负载任务;Ultra达到最先进的准确性和推理水平。 Conclusion: Nemotron 3系列通过创新架构和训练方法,在性能、效率和可扩展性之间取得平衡,推动大模型在实际应用中的部署与开放。 Abstract: We introduce the Nemotron 3 family of models - Nano, Super, and Ultra. These models deliver strong agentic, reasoning, and conversational capabilities. The Nemotron 3 family uses a Mixture-of-Experts hybrid Mamba-Transformer architecture to provide best-in-class throughput and context lengths of up to 1M tokens. Super and Ultra models are trained with NVFP4 and incorporate LatentMoE, a novel approach that improves model quality. The two larger models also include MTP layers for faster text generation. All Nemotron 3 models are post-trained using multi-environment reinforcement learning enabling reasoning, multi-step tool use, and support granular reasoning budget control. Nano, the smallest model, outperforms comparable models in accuracy while remaining extremely cost-efficient for inference. Super is optimized for collaborative agents and high-volume workloads such as IT ticket automation. Ultra, the largest model, provides state-of-the-art accuracy and reasoning performance. Nano is released together with its technical report and this white paper, while Super and Ultra will follow in the coming months. We will openly release the model weights, pre- and post-training software, recipes, and all data for which we hold redistribution rights.

[14] Architectural Trade-offs in Small Language Models Under Compute Constraints

Shivraj Singh Bhatti

Main category: cs.CL

TL;DR: 本研究系统地探讨了在严格计算限制下的小型语言模型,分析架构选择与训练预算对性能的影响,发现基于注意力的模型在小规模下仍具有更高的每FLOP效率,但增加深度或上下文长度若缺乏充分优化反而可能降低性能,且大型语言模型中的有效技术(如RoPE)不一定适用于小型模型。

Details Motivation: 探索在资源受限环境下小型语言模型的性能影响因素,理解架构设计与训练成本之间的权衡关系。 Method: 从线性下一个词预测器出发,逐步引入非线性、自注意力机制和多层Transformer架构,在Tiny Shakespeare、PTB和WikiText-2数据集上进行字符级和词级建模,并使用测试负对数似然、参数量和训练FLOPs评估模型。 Result: 基于注意力的模型在小规模下比MLP更具每FLOP效率;过度增加深度或上下文长度可能因优化不足而损害性能;RoPE等在大模型中有效的技术在小模型中效果不显著。 Conclusion: 小型语言模型的设计需谨慎权衡架构复杂性与优化程度,简单的架构改进不能直接照搬至小模型,应针对低计算预算场景进行专门优化。 Abstract: We present a systematic empirical study of small language models under strict compute constraints, analyzing how architectural choices and training budget interact to determine performance. Starting from a linear next-token predictor, we progressively introduce nonlinearities, self-attention, and multi-layer transformer architectures, evaluating each on character-level modeling of Tiny Shakespeare and word-level modeling of Penn Treebank (PTB) and WikiText-2. We compare models using test negative log-likelihood (NLL), parameter count, and approximate training FLOPs to characterize accuracy-efficiency trade-offs. Our results show that attention-based models dominate MLPs in per-FLOP efficiency even at small scale, while increasing depth or context without sufficient optimization can degrade performance. We further examine rotary positional embeddings (RoPE), finding that architectural techniques successful in large language models do not necessarily transfer to small-model regimes.

[15] Where Did This Sentence Come From? Tracing Provenance in LLM Reasoning Distillation

Kaiyuan Liu,Shaotian Yan,Rui Miao,Bing Wang,Chen Shen,Jun Zhang,Jieping Ye

Main category: cs.CL

TL;DR: 本文提出了一种推理蒸馏溯源追踪框架,用于分析蒸馏模型中每个动作的来源,并通过教师引导的数据选择方法提升模型在测试时的泛化能力。

Details Motivation: 现有推理蒸馏方法缺乏对蒸馏模型能力来源的深入分析,不清楚学生模型在新测试场景下是否能保持与教师模型一致的行为,存在泛化性问题。 Method: 提出了跨模型的推理蒸馏溯源追踪框架,通过比较教师模型、原始学生模型和蒸馏后学生模型在同一上下文下的预测概率,将每个输出动作分类以追溯其来源,并提出基于教师-学生差异的训练数据选择方法。 Result: 实验表明,蒸馏模型在测试时确实能生成源自教师的行为,且这些行为与其性能提升相关;所提数据选择方法在多种教师和学生模型上均有效。 Conclusion: 该溯源框架有助于理解推理蒸馏机制,提升了蒸馏模型的可解释性和有效性,为未来研究提供了新思路。 Abstract: Reasoning distillation has attracted increasing attention. It typically leverages a large teacher model to generate reasoning paths, which are then used to fine-tune a student model so that it mimics the teacher's behavior in training contexts. However, previous approaches have lacked a detailed analysis of the origins of the distilled model's capabilities. It remains unclear whether the student can maintain consistent behaviors with the teacher in novel test-time contexts, or whether it regresses to its original output patterns, raising concerns about the generalization of distillation models. To analyse this question, we introduce a cross-model Reasoning Distillation Provenance Tracing framework. For each action (e.g., a sentence) produced by the distilled model, we obtain the predictive probabilities assigned by the teacher, the original student, and the distilled model under the same context. By comparing these probabilities, we classify each action into different categories. By systematically disentangling the provenance of each action, we experimentally demonstrate that, in test-time contexts, the distilled model can indeed generate teacher-originated actions, which correlate with and plausibly explain observed performance on distilled model. Building on this analysis, we further propose a teacher-guided data selection method. Unlike prior approach that rely on heuristics, our method directly compares teacher-student divergences on the training data, providing a principled selection criterion. We validate the effectiveness of our approach across multiple representative teacher models and diverse student models. The results highlight the utility of our provenance-tracing framework and underscore its promise for reasoning distillation. We hope to share Reasoning Distillation Provenance Tracing and our insights into reasoning distillation with the community.

[16] Foundation Model-based Evaluation of Neuropsychiatric Disorders: A Lifespan-Inclusive, Multi-Modal, and Multi-Lingual Study

Zhongren Dong,Haotian Guo,Weixiang Xu,Huan Zhao,Zixing Zhang

Main category: cs.CL

TL;DR: FEND是一个基于基础模型的多模态框架,整合语音和文本模态,用于跨生命周期的阿尔茨海默病、抑郁症和自闭症谱系障碍的多语言检测,提供了统一评估基准并揭示了多模态融合中的关键挑战。

Details Motivation: 现有神经精神疾病检测方法在多语言泛化和统一评估框架方面存在不足,缺乏对多模态融合效果的系统性分析。 Method: 提出FEND框架,结合13个多语言数据集,涵盖英语、中文、希腊语、法语和荷兰语,系统评估多模态融合在AD、抑郁和ASD检测中的表现。 Result: 多模态融合在AD和抑郁症检测中表现优异,但在ASD上因数据异质性而表现不佳;发现模态不平衡问题普遍存在,且跨语料库实验显示在多语言和任务异构场景下性能下降。 Conclusion: FEND推动了自动化、全生命周期、多语言神经精神疾病评估领域的发展,建议研究人员采用该框架以实现公平比较和可重复研究。 Abstract: Neuropsychiatric disorders, such as Alzheimer's disease (AD), depression, and autism spectrum disorder (ASD), are characterized by linguistic and acoustic abnormalities, offering potential biomarkers for early detection. Despite the promise of multi-modal approaches, challenges like multi-lingual generalization and the absence of a unified evaluation framework persist. To address these gaps, we propose FEND (Foundation model-based Evaluation of Neuropsychiatric Disorders), a comprehensive multi-modal framework integrating speech and text modalities for detecting AD, depression, and ASD across the lifespan. Leveraging 13 multi-lingual datasets spanning English, Chinese, Greek, French, and Dutch, we systematically evaluate multi-modal fusion performance. Our results show that multi-modal fusion excels in AD and depression detection but underperforms in ASD due to dataset heterogeneity. We also identify modality imbalance as a prevalent issue, where multi-modal fusion fails to surpass the best mono-modal models. Cross-corpus experiments reveal robust performance in task- and language-consistent scenarios but noticeable degradation in multi-lingual and task-heterogeneous settings. By providing extensive benchmarks and a detailed analysis of performance-influencing factors, FEND advances the field of automated, lifespan-inclusive, and multi-lingual neuropsychiatric disorder assessment. We encourage researchers to adopt the FEND framework for fair comparisons and reproducible research.

[17] Neural Probe-Based Hallucination Detection for Large Language Models

Shize Liang,Hongzhi Wang

Main category: cs.CL

TL;DR: 本文提出了一种基于MLP探针的神经网络框架,用于在冻结大语言模型参数的情况下进行词元级幻觉检测,通过非线性建模和多目标损失函数显著提升了检测性能。

Details Motivation: 大语言模型容易生成幻觉内容,现有基于不确定性估计和外部知识检索的方法存在高置信错误和依赖知识覆盖的局限性,需要更高效、准确的检测方法。 Method: 采用轻量级MLP探针对模型高层隐藏状态进行非线性建模,设计多目标联合损失函数提升检测稳定性,并建立层位置-探针性能响应模型,利用贝叶斯优化自动搜索最优插入层。 Result: 在LongFact、HealthBench和TriviaQA数据集上的实验表明,该方法在准确性、召回率和低误报条件下的检测能力均显著优于现有最先进方法。 Conclusion: 基于MLP的非线性探针框架为大语言模型的实时、轻量级幻觉检测提供了有效解决方案,具有较强的实用性和可扩展性。 Abstract: Large language models(LLMs) excel at text generation and knowledge question-answering tasks, but they are prone to generating hallucinated content, severely limiting their application in high-risk domains. Current hallucination detection methods based on uncertainty estimation and external knowledge retrieval suffer from the limitation that they still produce erroneous content at high confidence levels and rely heavily on retrieval efficiency and knowledge coverage. In contrast, probe methods that leverage the model's hidden-layer states offer real-time and lightweight advantages. However, traditional linear probes struggle to capture nonlinear structures in deep semantic spaces.To overcome these limitations, we propose a neural network-based framework for token-level hallucination detection. By freezing language model parameters, we employ lightweight MLP probes to perform nonlinear modeling of high-level hidden states. A multi-objective joint loss function is designed to enhance detection stability and semantic disambiguity. Additionally, we establish a layer position-probe performance response model, using Bayesian optimization to automatically search for optimal probe insertion layers and achieve superior training results.Experimental results on LongFact, HealthBench, and TriviaQA demonstrate that MLP probes significantly outperform state-of-the-art methods in accuracy, recall, and detection capability under low false-positive conditions.

[18] MultiMind at SemEval-2025 Task 7: Crosslingual Fact-Checked Claim Retrieval via Multi-Source Alignment

Mohammad Mahdi Abootorabi,Alireza Ghahramani Kure,Mohammadali Mohammadkhani,Sina Elahimanesh,Mohammad Ali Ali Panah

Main category: cs.CL

TL;DR: 本文提出了一种名为TriAligner的新型双编码器模型,用于多语言和跨语言事实核查声明检索,结合对比学习、多模态翻译信息和硬负样本采样,在检索准确性上显著优于基线方法。

Details Motivation: 在虚假信息快速传播的时代,有效的多语言事实核查至关重要,但现有方法在跨语言对齐和表示学习方面存在不足。 Method: 采用双编码器架构,结合对比学习,利用原始语言和英文翻译的多模态信息,并通过大语言模型进行数据增强与硬负样本采样以提升表示质量。 Result: 在单语和跨语言基准测试中,该方法在检索准确性和事实核查性能上均显著优于基线模型。 Conclusion: TriAligner通过融合多语言对齐策略和增强的训练策略,有效提升了多语言事实核查声明的检索效果,具有较强的鲁棒性和应用潜力。 Abstract: This paper presents our system for SemEval-2025 Task 7: Multilingual and Crosslingual Fact-Checked Claim Retrieval. In an era where misinformation spreads rapidly, effective fact-checking is increasingly critical. We introduce TriAligner, a novel approach that leverages a dual-encoder architecture with contrastive learning and incorporates both native and English translations across different modalities. Our method effectively retrieves claims across multiple languages by learning the relative importance of different sources in alignment. To enhance robustness, we employ efficient data preprocessing and augmentation using large language models while incorporating hard negative sampling to improve representation learning. We evaluate our approach on monolingual and crosslingual benchmarks, demonstrating significant improvements in retrieval accuracy and fact-checking performance over baselines.

[19] Reflection Pretraining Enables Token-Level Self-Correction in Biological Sequence Models

Xiang Zhang,Jiaqi Wei,Yuejin Yang,Zijie Qiu,Yuhan Chen,Zhiqiang Gao,Muhammad Abdul-Mageed,Laks V. S. Lakshmanan,Wanli Ouyang,Chenyu You,Siqi Sun

Main category: cs.CL

TL;DR: 本文提出了“语言表达性”的概念,并引入了反思预训练方法,首次在生物序列模型中实现了超越简单答案标记的中间推理过程,显著提升了蛋白质语言模型的推理能力和性能。

Details Motivation: 由于蛋白质和RNA语言模型的标记空间有限,无法应用链式思维(CoT)提示,限制了其复杂推理能力。 Method: 提出并定义了语言表达性的概念,通过引入反思预训练方法,使模型能够生成辅助的“思考标记”进行中间推理。 Result: 理论上证明增强的标记集显著提高了生物语言的表达性;实验上显示该方法使蛋白质模型具备自我纠正能力,并在性能上优于标准预训练方法。 Conclusion: 反思预训练有效克服了生物语言模型表达性不足的问题,增强了模型的推理能力,为非自然语言领域应用CoT提供了新途径。 Abstract: Chain-of-Thought (CoT) prompting has significantly advanced task-solving capabilities in natural language processing with large language models. Unlike standard prompting, CoT encourages the model to generate intermediate reasoning steps, non-answer tokens, that help guide the model toward more accurate final outputs. These intermediate steps enable more complex reasoning processes such as error correction, memory management, future planning, and self-reflection. However, applying CoT to non-natural language domains, such as protein and RNA language models, is not yet possible, primarily due to the limited expressiveness of their token spaces (e.g., amino acid tokens). In this work, we propose and define the concept of language expressiveness: the ability of a given language, using its tokens and grammar, to encode information. We show that the limited expressiveness of protein language severely restricts the applicability of CoT-style reasoning. To overcome this, we introduce reflection pretraining, for the first time in a biological sequence model, which enables the model to engage in intermediate reasoning through the generation of auxiliary "thinking tokens" beyond simple answer tokens. Theoretically, we demonstrate that our augmented token set significantly enhances biological language expressiveness, thereby improving the overall reasoning capacity of the model. Experimentally, our pretraining approach teaches protein models to self-correct and leads to substantial performance gains compared to standard pretraining.

[20] Automatic Replication of LLM Mistakes in Medical Conversations

Oleksii Proniakin,Diego Fajardo,Ruslan Nazarenko,Razvan Marinescu

Main category: cs.CL

TL;DR: MedMistake是一个自动化的管道,用于提取大型语言模型在医患对话中的错误,并将其转化为单轮问答对的基准测试。

Details Motivation: 现有的临床评估方法难以复现LLM的具体错误,且依赖人工努力,缺乏高效自动化手段。 Method: 提出MedMistake管道:1)生成LLM患者与LLM医生之间的复杂对话;2)使用两个LLM评审委员会从多个维度进行评估;3)将识别出的错误转化为简化的单轮问答对。 Result: 发布了包含3,390个问答对的MedMistake-All数据集,并通过211个专家验证的问题(MedMistake-Bench)对12个前沿LLM进行了评测,发现GPT、Claude和Grok表现最佳。 Conclusion: MedMistake提供了一种可扩展的方法来自动生成医学推理错误基准,有助于改进LLM在临床场景中的可靠性与安全性。 Abstract: Large language models (LLMs) are increasingly evaluated in clinical settings using multi-dimensional rubrics which quantify reasoning quality, safety, and patient-centeredness. Yet, replicating specific mistakes in other LLM models is not straightforward and often requires manual effort. We introduce MedMistake, an automatic pipeline that extracts mistakes LLMs make in patient-doctor conversations and converts them into a benchmark of single-shot QA pairs. Our pipeline (1) creates complex, conversational data between an LLM patient and LLM doctor, (2) runs an evaluation with a committee of 2 LLM judges across a variety of dimensions and (3) creates simplified single-shot QA scenarios from those mistakes. We release MedMistake-All, a dataset of 3,390 single-shot QA pairs where GPT-5 and Gemini 2.5 Pro are currently failing to answer correctly, as judged by two LLM judges. We used medical experts to validate a subset of 211/3390 questions (MedMistake-Bench), which we used to run a final evaluation of 12 frontier LLMs: Claude Opus 4.5, Claude Sonnet 4.5, DeepSeek-Chat, Gemini 2.5 Pro, Gemini 3 Pro, GPT-4o, GPT-5, GPT-5.1, GPT-5.2, Grok 4, Grok 4.1, Mistral Large. We found that GPT models, Claude and Grok obtained the best performance on MedMistake-Bench. We release both the doctor-validated benchmark (MedMistake-Bench), as well as the full dataset (MedMistake-All) at https://huggingface.co/datasets/TheLumos/MedicalMistakeBenchmark.

[21] Distilling the Essence: Efficient Reasoning Distillation via Sequence Truncation

Wei-Rui Chen,Vignesh Kothapalli,Ata Fatahibaarzi,Hejian Sang,Shao Tang,Qingquan Song,Zhipeng Wang,Muhammad Abdul-Mageed

Main category: cs.CL

TL;DR: 本文研究了在从大语言模型向小模型进行推理能力蒸馏时,如何通过仅监督链式思维(CoT)部分来有效提升效率,并提出一种截断协议,在保留约94%性能的同时减少50%的计算开销。

Details Motivation: 由于对包含提示、链式思维和答案的长序列进行蒸馏计算成本高昂,本文旨在探索不同序列段的监督分配对学生模型性能的影响,以提高训练效率。 Method: 分析不同序列段(P、CoT、A)的监督效果,提出仅对CoT部分进行知识蒸馏的方法,并设计截断协议评估序列长度与性能之间的权衡。 Result: 仅使用每个训练序列前50%的token可在数学基准上平均保留约94%的完整序列性能,同时将训练时间、内存使用和FLOPs各减少约50%。 Conclusion: 推理蒸馏应优先关注早期推理token,选择性地监督CoT部分可作为调节计算与性能权衡的有效手段。 Abstract: Distilling the reasoning capabilities from a large language model (LLM) to a smaller student model often involves training on substantial amounts of reasoning data. However, distillation over lengthy sequences with prompt (P), chain-of-thought (CoT), and answer (A) segments makes the process computationally expensive. In this work, we investigate how the allocation of supervision across different segments (P, CoT, A) affects student performance. Our analysis shows that selective knowledge distillation over only the CoT tokens can be effective when the prompt and answer information is encompassed by it. Building on this insight, we establish a truncation protocol to quantify computation-quality tradeoffs as a function of sequence length. We observe that training on only the first $50\%$ of tokens of every training sequence can retain, on average, $\approx94\%$ of full-sequence performance on math benchmarks while reducing training time, memory usage, and FLOPs by about $50\%$ each. These findings suggest that reasoning distillation benefits from prioritizing early reasoning tokens and provides a simple lever for computation-quality tradeoffs. Codes are available at https://github.com/weiruichen01/distilling-the-essence.

[22] Rethinking Supervised Fine-Tuning: Emphasizing Key Answer Tokens for Improved LLM Accuracy

Xiaofeng Shi,Qian Kou,Yuduo Li,Hua Zhou

Main category: cs.CL

TL;DR: 提出SFTKey,一种两阶段微调方法,通过单独优化关键答案部分,在保持思维链格式的同时显著提升大模型推理准确率。

Details Motivation: 传统监督微调中,模型可能过度关注冗长的思维链而忽视关键的答案部分,导致评估性能下降。 Method: 采用两阶段训练:第一阶段使用常规SFT确保输出格式正确;第二阶段仅对关键答案部分进行微调以提高准确性。 Result: 在多个基准和模型族上实验显示,相比传统SFT平均准确率提升超过5%,同时保持生成正确格式的能力。 Conclusion: SFTKey有效平衡了思维链学习与答案相关token的优化,推动了大语言模型微调技术的发展。 Abstract: With the rapid advancement of Large Language Models (LLMs), the Chain-of-Thought (CoT) component has become significant for complex reasoning tasks. However, in conventional Supervised Fine-Tuning (SFT), the model could allocate disproportionately more attention to CoT sequences with excessive length. This reduces focus on the much shorter but essential Key portion-the final answer, whose correctness directly determines task success and evaluation quality. To address this limitation, we propose SFTKey, a two-stage training scheme. In the first stage, conventional SFT is applied to ensure proper output format, while in the second stage, only the Key portion is fine-tuned to improve accuracy. Extensive experiments across multiple benchmarks and model families demonstrate that SFTKey achieves an average accuracy improvement exceeding 5\% over conventional SFT, while preserving the ability to generate correct formats. Overall, this study advances LLM fine-tuning by explicitly balancing CoT learning with additional optimization on answer-relevant tokens.

[23] Semantic Refinement with LLMs for Graph Representations

Safal Thapaliya,Zehong Wang,Jiazheng Li,Ziming Li,Yanfang Ye,Chuxu Zhang

Main category: cs.CL

TL;DR: 提出了一种数据自适应的语义精炼框架DAS,通过结合固定的图神经网络和大语言模型的闭环反馈机制,有效应对图结构数据中结构与语义异质性问题。

Details Motivation: 不同图域中预测信号来源差异大,传统固定归纳偏置的模型难以最优泛化,现有方法从模型侧改进存在局限。 Method: 提出DAS框架,将固定GNN与大语言模型耦合在闭环反馈中:GNN提供隐式监督信号指导LLM进行语义精炼,精炼后的语义反馈更新图学习器。 Result: 在文本丰富和无文本图上均取得提升,在结构主导的图上表现尤为突出,同时在语义丰富的图上保持竞争力。 Conclusion: 数据中心视角下的语义自适应能有效应对图表示学习中的结构-语义异质性挑战。 Abstract: Graph-structured data exhibit substantial heterogeneity in where their predictive signals originate: in some domains, node-level semantics dominate, while in others, structural patterns play a central role. This structure-semantics heterogeneity implies that no graph learning model with a fixed inductive bias can generalize optimally across diverse graph domains. However, most existing methods address this challenge from the model side by incrementally injecting new inductive biases, which remains fundamentally limited given the open-ended diversity of real-world graphs. In this work, we take a data-centric perspective and treat node semantics as a task-adaptive variable. We propose a Data-Adaptive Semantic Refinement framework DAS for graph representation learning, which couples a fixed graph neural network (GNN) and a large language model (LLM) in a closed feedback loop. The GNN provides implicit supervisory signals to guide the semantic refinement of LLM, and the refined semantics are fed back to update the same graph learner. We evaluate our approach on both text-rich and text-free graphs. Results show consistent improvements on structure-dominated graphs while remaining competitive on semantics-rich graphs, demonstrating the effectiveness of data-centric semantic adaptation under structure-semantics heterogeneity.

[24] Semi-Supervised Learning for Large Language Models Safety and Content Moderation

Eduard Stefan Dinuta,Iustin Sirbu,Traian Rebedea

Main category: cs.CL

TL;DR: 本文提出利用半监督学习技术来提升大语言模型的安全性,通过结合有标签和无标签数据,并采用任务特定的数据增强方法,显著提高了安全分类性能。

Details Motivation: 现有的大语言模型安全分类器依赖大量标注数据,但这些数据难以获取、易出错且常包含合成数据,因此需要更高效的方法来提升安全性。 Method: 采用半监督学习技术,结合有标签和无标签数据进行训练,并引入任务特定的数据增强策略以提升模型性能。 Result: 实验表明,所提出的方法在处理大语言模型的输入提示和输出响应时,均显著提升了安全分类任务的性能,尤其在使用任务特定增强时效果更优。 Conclusion: 半监督学习结合任务特定数据增强是一种有效提升大语言模型安全性的方法,减少了对大规模标注数据的依赖。 Abstract: Safety for Large Language Models (LLMs) has been an ongoing research focus since their emergence and is even more relevant nowadays with the increasing capacity of those models. Currently, there are several guardrails in place for all public LLMs and multiple proposed datasets for training safety classifiers. However, training these safety classifiers relies on large quantities of labeled data, which can be problematic to acquire, prone to labeling errors, or often include synthetic data. To address these issues, we suggest a different approach: utilizing semi-supervised learning techniques, which leverage both labeled and unlabeled data, to improve the performance on the safety task. We analyze the improvements that these techniques can offer for both prompts given to Large Language Models and the responses to those requests. Moreover, since augmentation is the central part of semi-supervised algorithms, we demonstrate the importance of using task-specific augmentations, which significantly increase the performance when compared to general-purpose augmentation techniques.

[25] ClarifyMT-Bench: Benchmarking and Improving Multi-Turn Clarification for Conversational Large Language Models

Sichun Luo,Yi Huang,Mukai Li,Shichang Meng,Fengyuan Liu,Zefa Hu,Junlan Feng,Qi Liu

Main category: cs.CL

TL;DR: 提出了ClarifyMT-Bench,一个基于五维模糊分类法和六种用户角色的多轮澄清对话基准,并提出ClarifyAgent以提升大模型在模糊情境下的澄清能力。

Details Motivation: 现有大模型澄清评估主要集中在单轮或合作性用户场景,难以反映真实多轮开放对话中的模糊处理能力,因此需要更贴近实际的评估基准。 Method: 构建了一个包含6120个多轮对话的数据集,结合五维模糊分类和六种模拟用户角色,通过LLM-人工混合流程生成;提出ClarifyAgent,将澄清过程分解为感知、预测、跟踪和规划四个模块。 Result: 评估了十个主流大模型,发现普遍存在过早回答和随对话加深性能下降的问题;ClarifyAgent在多种模糊条件下显著提升了澄清行为的鲁棒性。 Conclusion: ClarifyMT-Bench为研究大模型在真实对话中何时应提问、何时应回答以及如何应对模糊提供了可复现的基础,ClarifyAgent展示了代理式架构在复杂交互中的潜力。 Abstract: Large language models (LLMs) are increasingly deployed as conversational assistants in open-domain, multi-turn settings, where users often provide incomplete or ambiguous information. However, existing LLM-focused clarification benchmarks primarily assume single-turn interactions or cooperative users, limiting their ability to evaluate clarification behavior in realistic settings. We introduce \textbf{ClarifyMT-Bench}, a benchmark for multi-turn clarification grounded in a five-dimensional ambiguity taxonomy and a set of six behaviorally diverse simulated user personas. Through a hybrid LLM-human pipeline, we construct 6,120 multi-turn dialogues capturing diverse ambiguity sources and interaction patterns. Evaluating ten representative LLMs uncovers a consistent under-clarification bias: LLMs tend to answer prematurely, and performance degrades as dialogue depth increases. To mitigate this, we propose \textbf{ClarifyAgent}, an agentic approach that decomposes clarification into perception, forecasting, tracking, and planning, substantially improving robustness across ambiguity conditions. ClarifyMT-Bench establishes a reproducible foundation for studying when LLMs should ask, when they should answer, and how to navigate ambiguity in real-world human-LLM interactions.

[26] SpidR-Adapt: A Universal Speech Representation Model for Few-Shot Adaptation

Mahi Luthra,Jiayi Shen,Maxime Poli,Angelo Ortiz,Yosuke Higuchi,Youssef Benchekroun,Martin Gleize,Charles-Eric Saint-James,Dongyan Lin,Phillip Rust,Angel Villar,Surya Parimi,Vanessa Stark,Rashel Moritz,Juan Pino,Yann LeCun,Emmanuel Dupoux

Main category: cs.CL

TL;DR: 本文提出了SpidR-Adapt,一种用于极低资源语言快速自适应的语音表示学习方法,通过元学习框架和双层优化实现仅用不到1小时音频数据下的高效跨语言迁移。

Details Motivation: 人类婴儿在极少语音暴露下即可习得新语言的基本单元,而当前自监督语音模型需要大量数据,存在显著效率差距。本文旨在缩小这一差距。 Method: 将低资源语音表示学习建模为元学习问题,提出多任务自适应预训练(MAdaPT)协议,并采用双层优化框架;引入一阶双层优化(FOBLO)降低计算开销,并通过交错监督进行鲁棒初始化以稳定训练过程。 Result: SpidR-Adapt在不到1小时的目标语言音频上训练后,在音素可分辨性(ABX)和口语语言建模(sWUGGY, sBLIMP, tSC)任务上超越同领域模型,数据效率超过传统训练方法100倍以上。 Conclusion: 该方法提供了一条面向生物启发、高数据效率、且不依赖特定架构的语音表示学习路径,具有实际应用价值。 Abstract: Human infants, with only a few hundred hours of speech exposure, acquire basic units of new languages, highlighting a striking efficiency gap compared to the data-hungry self-supervised speech models. To address this gap, this paper introduces SpidR-Adapt for rapid adaptation to new languages using minimal unlabeled data. We cast such low-resource speech representation learning as a meta-learning problem and construct a multi-task adaptive pre-training (MAdaPT) protocol which formulates the adaptation process as a bi-level optimization framework. To enable scalable meta-training under this framework, we propose a novel heuristic solution, first-order bi-level optimization (FOBLO), avoiding heavy computation costs. Finally, we stabilize meta-training by using a robust initialization through interleaved supervision which alternates self-supervised and supervised objectives. Empirically, SpidR-Adapt achieves rapid gains in phonemic discriminability (ABX) and spoken language modeling (sWUGGY, sBLIMP, tSC), improving over in-domain language models after training on less than 1h of target-language audio, over $100\times$ more data-efficient than standard training. These findings highlight a practical, architecture-agnostic path toward biologically inspired, data-efficient representations. We open-source the training code and model checkpoints at https://github.com/facebookresearch/spidr-adapt.

[27] SMART SLM: Structured Memory and Reasoning Transformer, A Small Language Model for Accurate Document Assistance

Divij Dudeja,Mayukha Pal

Main category: cs.CL

TL;DR: SMART是一种针对工程手册(EM)信息提取与推理的高效模型,采用分层结构化处理方法,结合语法感知的事实抽取器、紧凑索引记忆模块和轻量Transformer解码器,显著提升准确率并减少幻觉。

Details Motivation: 传统大模型将工程手册视为扁平文本序列,导致数值回答错误且记忆效率低下;同时用户难以从冗长密集的手册中快速获取信息,因此需要一种更高效、准确且可解释的处理框架。 Method: SMART分为三个核心组件:1)基于Tree-LSTM的语法感知事实抽取器(Grammarian),从句子中提取主谓宾三元组;2)使用MANN的紧凑索引记忆模块,将事实编码为384维向量并关联来源;3)六层Transformer解码器融合检索到的事实生成回答;支持两种推理模式——已知文档的快速路径和新文档的动态RAG辅助路径。 Result: SMART仅用45.51M参数(比GPT-2少64%,比BERT少69%),在工程手册任务上比GPT-2准确率高出21.3%;实现亚秒级响应,并通过FAISS Top-20检索与64槽记忆限制支持新文档处理,显著降低幻觉。 Conclusion: SMART通过结构化记忆与推理机制,在减少参数量的同时大幅提升工程手册问答性能,适合实际部署,为技术文档处理提供了高效、可靠的新范式。 Abstract: The user of Engineering Manuals (EM) finds it difficult to read EM s because they are long, have a dense format which includes written documents, step by step procedures, and standard parameter lists for engineering equipment. Off the shelf transformers, especially compact ones, treat this material as a flat stream of tokens. This approach leads to confident but incorrect numeric answers and forces the models to memorize separate facts inefficiently. SMART (Structured Memory and Reasoning Transformer) offers a different and practical solution to the above problem. SMART structures its processing by using a hierarchical approach, and is based upon three main job categories (1) A syntax-aware Fact Extractor (Grammarian) Tree LSTM which extracts facts as subject relation object relations from EM sentences (2) A compact indexed memory MANN (Memory Augmented Neural Network) that indexes these Rational Subject Relation Objects as 384 dimensional vectors that are associated with the source of the information, and (3) A 6 layer Transformer that learns to fuse the previously retrieved facts into its generated response. The entire SMART model utilizes 45.51M parameters, which is 64% less than GPT-2 (124M) and 69% less than BERT (133M), and it achieves a 21.3% higher accuracy than GPT-2, indicating that SMART fits the data better with the least amount of processing requirements. SMART employs dual modes of inference an indexed fast path for known documents (sub-second answer times) and an indexed dynamic path assisted by RAGs for new uploads (FAISS Top 20 results with memory severed at 64 slots). In real world deployment, this framework leads to more well supported results with reduced hallucinations than comparable small transformer models.

[28] Parallel Token Prediction for Language Models

Felix Draxler,Justus Will,Farrin Marouf Sofian,Theofanis Karaletsos,Sameer Singh,Stephan Mandt

Main category: cs.CL

TL;DR: 提出了一种名为Parallel Token Prediction (PTP)的通用框架,用于语言模型中的并行序列生成,能够在单次Transformer调用中联合预测多个相关token,显著减少自回归解码的延迟,并在实验中实现了最先进的推测解码性能。

Details Motivation: 为了解决自回归解码在生成文本时存在的高延迟问题,以及现有多token预测方法中常见的独立性假设限制,提出一种更高效、更灵活的并行生成框架。 Method: 通过将采样过程整合到模型中,使模型能够在单次前向传播中联合预测多个依赖的token;PTP可通过模型蒸馏或无需教师模型的逆自回归训练方式进行训练。 Result: 证明了PTP能够表示任意的自回归序列分布,在Vicuna-7B上实现了超过每步接受4个token的推测解码性能,达到Spec-Bench上的最先进水平。 Conclusion: PTP是一种通用且有效的并行序列生成框架,能够在不损失建模能力的前提下实现长序列的快速生成,展示了并行生成的可行性与潜力。 Abstract: We propose Parallel Token Prediction (PTP), a universal framework for parallel sequence generation in language models. PTP jointly predicts multiple dependent tokens in a single transformer call by incorporating the sampling procedure into the model. This reduces the latency bottleneck of autoregressive decoding, and avoids the restrictive independence assumptions common in existing multi-token prediction methods. We prove that PTP can represent arbitrary autoregressive sequence distributions. PTP is trained either by distilling an existing model or through inverse autoregressive training without a teacher. Experimentally, we achieve state-of-the-art speculative decoding performance on Vicuna-7B by accepting over four tokens per step on Spec-Bench. The universality of our framework indicates that parallel generation of long sequences is feasible without loss of modeling power.

[29] Your Reasoning Benchmark May Not Test Reasoning: Revealing Perception Bottleneck in Abstract Reasoning Benchmarks

Xinhe Wang,Jin Huang,Xingjian Zhang,Tianhao Wang,Jiaqi W. Ma

Main category: cs.CL

TL;DR: 本文挑战了现有观点,认为ARC类推理基准中的性能差距主要源于视觉感知的局限性而非机器推理能力不足,并通过分离感知与推理的两阶段实验验证了这一假设。

Details Motivation: 现有研究普遍认为模型在ARC等推理任务上的表现不佳是由于推理能力不足,但作者质疑这一解释,提出可能是视觉感知限制导致了性能差距。 Method: 提出一个两阶段实验框架:第一阶段将图像独立转换为自然语言描述(感知),第二阶段使用这些描述进行规则归纳和应用(推理),从而隔离感知与推理过程,避免跨图像信息泄露。 Result: 在Mini-ARC、ACRE和Bongard-LOGO三个数据集上,相比端到端的一阶段方法,两阶段方法显著提升了性能;人工分析显示约80%的失败源于感知错误而非推理错误。 Conclusion: ARC类基准混淆了感知与推理挑战,当前观察到的性能差距可能高估了机器推理的缺陷;未来评估应采用解耦感知与推理的协议以更准确衡量机器智能进展。 Abstract: Reasoning benchmarks such as the Abstraction and Reasoning Corpus (ARC) and ARC-AGI are widely used to assess progress in artificial intelligence and are often interpreted as probes of core, so-called ``fluid'' reasoning abilities. Despite their apparent simplicity for humans, these tasks remain challenging for frontier vision-language models (VLMs), a gap commonly attributed to deficiencies in machine reasoning. We challenge this interpretation and hypothesize that the gap arises primarily from limitations in visual perception rather than from shortcomings in inductive reasoning. To verify this hypothesis, we introduce a two-stage experimental pipeline that explicitly separates perception and reasoning. In the perception stage, each image is independently converted into a natural-language description, while in the reasoning stage a model induces and applies rules using these descriptions. This design prevents leakage of cross-image inductive signals and isolates reasoning from perception bottlenecks. Across three ARC-style datasets, Mini-ARC, ACRE, and Bongard-LOGO, we show that the perception capability is the dominant factor underlying the observed performance gap by comparing the two-stage pipeline with against standard end-to-end one-stage evaluation. Manual inspection of reasoning traces in the VLM outputs further reveals that approximately 80 percent of model failures stem from perception errors. Together, these results demonstrate that ARC-style benchmarks conflate perceptual and reasoning challenges and that observed performance gaps may overstate deficiencies in machine reasoning. Our findings underscore the need for evaluation protocols that disentangle perception from reasoning when assessing progress in machine intelligence.

[30] C2LLM Technical Report: A New Frontier in Code Retrieval via Adaptive Cross-Attention Pooling

Jin Qin,Zihan Liao,Ziyin Zhang,Hang Yu,Peng Di,Rui Wang

Main category: cs.CL

TL;DR: 本文提出了C2LLM,一种基于Qwen-2.5-Coder的代码嵌入模型系列,采用多头注意力池化(PMA)模块生成序列嵌入,在多个方面优于传统方法,并在MTEB-Code基准上取得同规模模型中的最佳性能。

Details Motivation: 传统的EOS-based序列嵌入方法存在信息瓶颈,难以充分利用大语言模型的因果表示,且缺乏嵌入维度的灵活适应能力,限制了代码嵌入的质量和适用性。 Method: 基于Qwen-2.5-Coder构建0.5B和7B两种规模的C2LLM模型,引入Pooling by Multihead Attention (PMA)模块从token嵌入生成序列嵌入,以充分利用预训练中获得的因果表示,聚合整个序列的信息,并支持嵌入维度的灵活调整。 Result: 在三百万公开数据上训练后,C2LLM在MTEB-Code基准上达到同规模模型中的最先进水平,其中C2LLM-7B在综合排行榜上排名第一。 Conclusion: C2LLM通过PMA模块有效克服了传统代码嵌入方法的信息瓶颈问题,兼具对LLM因果表示的利用能力和全序列信息聚合能力,同时支持灵活的嵌入维度调整,显著提升了代码检索与理解任务的表现。 Abstract: We present C2LLM - Contrastive Code Large Language Models, a family of code embedding models in both 0.5B and 7B sizes. Building upon Qwen-2.5-Coder backbones, C2LLM adopts a Pooling by Multihead Attention (PMA) module for generating sequence embedding from token embeddings, effectively 1) utilizing the LLM's causal representations acquired during pretraining, while also 2) being able to aggregate information from all tokens in the sequence, breaking the information bottleneck in EOS-based sequence embeddings, and 3) supporting flexible adaptation of embedding dimension, serving as an alternative to MRL. Trained on three million publicly available data, C2LLM models set new records on MTEB-Code among models of similar sizes, with C2LLM-7B ranking 1st on the overall leaderboard.

[31] Optimizing Decoding Paths in Masked Diffusion Models by Quantifying Uncertainty

Ziyu Chen,Xinbei Jiang,Peng Sun,Tao Lin

Main category: cs.CL

TL;DR: 本文首次提出掩码扩散模型(MDM)生成质量对解码顺序敏感的问题,引入“去噪熵”作为衡量生成路径中预测不确定性的可计算指标,并基于此提出两种优化解码路径的算法,显著提升了生成质量。

Details Motivation: MDM在非自回归生成中具有灵活性,但输出质量高度依赖解码顺序,缺乏对生成路径中不确定性累积的建模与控制机制。 Method: 提出“去噪熵”来量化生成过程中的累积预测不确定性,并设计了后处理选择方法和实时引导策略两种算法以优化解码路径。 Result: 实验表明,基于去噪熵的引导方法在复杂推理、规划和代码生成等任务上显著提升生成准确性。 Conclusion: 去噪熵为理解和控制MDM中的生成过程提供了原则性工具,将模型不确定性从缺陷转化为发现高质量解的优势。 Abstract: Masked Diffusion Models (MDMs) offer flexible, non-autoregressive generation, but this freedom introduces a challenge: final output quality is highly sensitive to the decoding order. We are the first to formalize this issue, attributing the variability in output quality to the cumulative predictive uncertainty along a generative path. To quantify this uncertainty, we introduce Denoising Entropy, a computable metric that serves as an internal signal for evaluating generative process. Leveraging this metric, we propose two algorithms designed to optimize the decoding path: a post-hoc selection method and a real-time guidance strategy. Experiments demonstrate that our entropy-guided methods significantly improve generation quality, consistently boosting accuracy on challenging reasoning, planning, and code benchmarks. Our work establishes Denoising Entropy as a principled tool for understanding and controlling generation, effectively turning the uncertainty in MDMs from a liability into a key advantage for discovering high-quality solutions.

cs.CV [Back]

[32] VL4Gaze: Unleashing Vision-Language Models for Gaze Following

Shijing Wang,Chaoqun Cui,Yaping Huang,Hyung Jin Chang,Yihua Cheng

Main category: cs.CV

TL;DR: 本文提出了VL4Gaze,首个大规模用于研究视觉语言模型(VLMs)在凝视理解能力上的基准数据集,包含489K个问答对,涵盖四个任务,评估表明现有VLMs在无特定监督情况下难以可靠理解凝视,而基于该数据集训练可显著提升性能。

Details Motivation: 当前视觉语言模型缺乏对人类凝视理解的系统性评估与训练,尽管凝视在社交互动和注意力推断中至关重要,因此需要一个专门的基准来探索VLM是否能从通用训练中发展出凝视理解能力。 Method: 构建了VL4Gaze数据集,包含124K图像和489K自动生成的问答对,设计了四个互补任务:凝视对象描述、凝视方向描述、凝视点定位和模糊问题识别,并在上下文学习和微调设置下对多种VLM进行综合评估。 Result: 实验结果显示,即使大规模VLM在无任务特定监督时也难以可靠推断凝视语义和空间位置;而在VL4Gaze上训练后,所有任务性能均有显著且一致的提升。 Conclusion: 凝视理解不会轻易从通用视觉语言预训练中自发涌现,需要针对性的多任务监督训练,VL4Gaze为推动VLM在社会视觉理解方面的发展提供了重要资源。 Abstract: Human gaze provides essential cues for interpreting attention, intention, and social interaction in visual scenes, yet gaze understanding remains largely unexplored in current vision-language models (VLMs). While recent VLMs achieve strong scene-level reasoning across a range of visual tasks, there exists no benchmark that systematically evaluates or trains them for gaze interpretation, leaving open the question of whether gaze understanding can emerge from general-purpose vision-language pre-training. To address this gap, we introduce VL4Gaze, the first large-scale benchmark designed to investigate, evaluate, and unlock the potential of VLMs for gaze understanding. VL4Gaze contains 489K automatically generated question-answer pairs across 124K images and formulates gaze understanding as a unified VQA problem through four complementary tasks: (1) gaze object description, (2) gaze direction description, (3) gaze point location, and (4) ambiguous question recognition. We comprehensively evaluate both commercial and open-source VLMs under in-context learning and fine-tuning settings. The results show that even large-scale VLMs struggle to reliably infer gaze semantics and spatial localization without task-specific supervision. In contrast, training on VL4Gaze brings substantial and consistent improvements across all tasks, highlighting the importance of targeted multi-task supervision for developing gaze understanding capabilities in VLMs. We will release the dataset and code to support further research and development in this direction.

[33] TrashDet: Iterative Neural Architecture Search for Efficient Waste Detection

Tony Tran,Bin Hu

Main category: cs.CV

TL;DR: 本文提出了一种面向TinyML约束的垃圾检测方法,基于TACO数据集使用硬件感知的神经架构搜索框架,构建了可部署的TrashDet系列检测器,在精度、能效和延迟方面显著优于现有方法。

Details Motivation: 在资源受限的边缘和IoT设备上实现高效垃圾检测面临模型大小、能耗和精度之间的权衡,现有方法难以满足TinyML的严格约束,因此需要一种硬件感知且可扩展的检测器设计方法。 Method: 采用Once-for-All风格的ResDets超网络,结合迭代进化搜索策略,交替优化主干网络与颈部/头部结构,并引入种群传递机制和精度预测器以降低搜索成本并提升稳定性,最终生成适用于不同硬件预算的TrashDet模型家族。 Result: 在五类TACO子集上,TrashDet-l达到19.5 mAP50,参数量仅30.5M,比先前方法提升3.6 mAP50;TrashDet系列覆盖1.2M至30.5M参数,mAP50介于11.4–19.5;在MAX78002微控制器上,TrashDet-ResNet实现7525 μJ/推理、26.7 ms延迟和37.45 FPS,TrashDet-MBNet将mAP50提升10.2%,整体相较基线最多降低88%能耗、78%延迟和53%平均功耗。 Conclusion: 该工作展示了硬件感知神经架构搜索在TinyML目标检测中的有效性,所提出的TrashDet系列模型在精度与效率之间实现了优越平衡,为资源受限设备提供了可扩展且高性能的垃圾检测解决方案。 Abstract: This paper addresses trash detection on the TACO dataset under strict TinyML constraints using an iterative hardware-aware neural architecture search framework targeting edge and IoT devices. The proposed method constructs a Once-for-All-style ResDets supernet and performs iterative evolutionary search that alternates between backbone and neck/head optimization, supported by a population passthrough mechanism and an accuracy predictor to reduce search cost and improve stability. This framework yields a family of deployment-ready detectors, termed TrashDets. On a five-class TACO subset (paper, plastic, bottle, can, cigarette), the strongest variant, TrashDet-l, achieves 19.5 mAP50 with 30.5M parameters, improving accuracy by up to 3.6 mAP50 over prior detectors while using substantially fewer parameters. The TrashDet family spans 1.2M to 30.5M parameters with mAP50 values between 11.4 and 19.5, providing scalable detector options for diverse TinyML deployment budgets on resource-constrained hardware. On the MAX78002 microcontroller with the TrashNet dataset, two specialized variants, TrashDet-ResNet and TrashDet-MBNet, jointly dominate the ai87-fpndetector baseline, with TrashDet-ResNet achieving 7525~$μ$J energy per inference at 26.7 ms latency and 37.45 FPS, and TrashDet-MBNet improving mAP50 by 10.2%; together they reduce energy consumption by up to 88%, latency by up to 78%, and average power by up to 53% compared to existing TinyML detectors.

[34] OccuFly: A 3D Vision Benchmark for Semantic Scene Completion from the Aerial Perspective

Markus Gross,Sai B. Matha,Aya Fahmy,Rui Song,Daniel Cremers,Henri Meess

Main category: cs.CV

TL;DR: 本文提出了OccuFly,首个基于相机的空中语义场景补全(SSC)真实世界基准,适用于无人机在不同高度和季节下对城市、工业和乡村场景进行三维感知。通过无需LiDAR的重建框架,实现了高效的2D到3D标签转换,显著减少人工标注成本,并为高空视角下的空中3D场景理解提供了新的基准与挑战。

Details Motivation: 现有的SSC研究主要集中于地面场景(如自动驾驶),而空中场景(如无人机飞行)研究较少,且依赖LiDAR传感器,受限于无人机的载重、能耗及法规限制。此外,高空视角下LiDAR点云稀疏问题严重,因此亟需一种适用于无人机平台、基于相机的SSC解决方案。 Method: 提出OccuFly:一个基于相机的空中SSC基准数据集,采集自50m、40m和30m高空,涵盖四季变化与多种环境。采用传统3D重建技术,将部分标注的2D掩码提升至重建的点云中,实现自动化标签迁移,避免大量手动3D标注。数据格式遵循现有规范,便于集成。 Result: 发布了首个真实世界的相机驱动空中SSC基准OccuFly,包含22个语义类别,覆盖多场景与多季节数据。所提LiDAR-free框架有效降低了3D标注成本,并支持在资源受限的UAV上部署。实验评估了现有最先进方法在该数据集上的表现,揭示了高空视角带来的独特挑战。 Conclusion: OccuFly为无人机平台提供了可行的、基于相机的空中SSC解决方案,推动了 aerial 3D 场景理解的发展。该工作填补了空中SSC领域的空白,并通过轻量化的标注流程为未来研究提供了可扩展的数据生成范式。 Abstract: Semantic Scene Completion (SSC) is crucial for 3D perception in mobile robotics, as it enables holistic scene understanding by jointly estimating dense volumetric occupancy and per-voxel semantics. Although SSC has been widely studied in terrestrial domains such as autonomous driving, aerial scenarios like autonomous flying remain largely unexplored, thereby limiting progress on downstream applications. Furthermore, LiDAR sensors represent the primary modality for SSC data generation, which poses challenges for most uncrewed aerial vehicles (UAVs) due to flight regulations, mass and energy constraints, and the sparsity of LiDAR-based point clouds from elevated viewpoints. To address these limitations, we introduce OccuFly, the first real-world, camera-based aerial SSC benchmark, captured at altitudes of 50m, 40m, and 30m during spring, summer, fall, and winter. OccuFly covers urban, industrial, and rural scenarios, provides 22 semantic classes, and the data format adheres to established conventions to facilitate seamless integration with existing research. Crucially, we propose a LiDAR-free data generation framework based on camera modality, which is ubiquitous on modern UAVs. By utilizing traditional 3D reconstruction, our framework automates label transfer by lifting a subset of annotated 2D masks into the reconstructed point cloud, thereby substantially minimizing manual 3D annotation effort. Finally, we benchmark the state-of-the-art on OccuFly and highlight challenges specific to elevated viewpoints, yielding a comprehensive vision benchmark for holistic aerial 3D scene understanding.

[35] NULLBUS: Multimodal Mixed-Supervision for Breast Ultrasound Segmentation via Nullable Global-Local Prompts

Raja Mallina,Bryar Shareef

Main category: cs.CV

TL;DR: 提出NullBUS框架,通过可空提示(nullable prompts)实现乳腺超声图像在有无文本提示情况下的混合监督分割,显著提升多模态数据利用与分割性能。

Details Motivation: 公共乳腺超声数据集常缺乏可靠元数据或报告,限制了基于提示的分割方法的训练与鲁棒性。 Method: 提出NullBUS,一种支持混合监督的多模态框架;引入可空提示机制,使用可学习的空嵌入和存在掩码,在无文本时回退到仅图像模式,有文本时融合多模态信息。 Result: 在三个公开BUS数据集的统一池上评估,NullBUS达到平均IoU 0.8568和平均Dice 0.9103,性能处于业界领先水平。 Conclusion: NullBUS能有效利用含提示与不含提示的数据,提升乳腺超声图像分割的鲁棒性和实用性,尤其适用于元数据不完整的现实场景。 Abstract: Breast ultrasound (BUS) segmentation provides lesion boundaries essential for computer-aided diagnosis and treatment planning. While promptable methods can improve segmentation performance and tumor delineation when text or spatial prompts are available, many public BUS datasets lack reliable metadata or reports, constraining training to small multimodal subsets and reducing robustness. We propose NullBUS, a multimodal mixed-supervision framework that learns from images with and without prompts in a single model. To handle missing text, we introduce nullable prompts, implemented as learnable null embeddings with presence masks, enabling fallback to image-only evidence when metadata are absent and the use of text when present. Evaluated on a unified pool of three public BUS datasets, NullBUS achieves a mean IoU of 0.8568 and a mean Dice of 0.9103, demonstrating state-of-the-art performance under mixed prompt availability.

[36] Learning to Sense for Driving: Joint Optics-Sensor-Model Co-Design for Semantic Segmentation

Reeshad Khan amd John Gauch

Main category: cs.CV

TL;DR: 提出了一种任务驱动的端到端相机-感知协同设计框架,将光学、传感器建模与轻量级语义分割网络联合优化,显著提升自动驾驶场景下的分割性能与鲁棒性。

Details Motivation: 传统自动驾驶系统中相机设计与感知任务脱节,固定光学元件和图像信号处理流程丢失关键信息并引入不利于机器感知的伪影,限制了下游任务性能。 Method: 构建一个从RAW数据到语义分割的端到端可微分框架,集成可学习的彩色滤光阵列(CFA)、真实手机级镜头模型、泊松-高斯噪声建模和量化过程,并与轻量级分割网络共同优化。 Result: 在KITTI-360上实现了mIoU持续提升,尤其在细小物体和低光照敏感类别上表现突出;模型仅约100万参数,可达28 FPS,具备边缘部署能力;可视化显示协同设计能增强边界清晰度并在模糊、噪声和低比特深度下保持精度。 Conclusion: 光学、传感器与网络的全栈协同优化是实现高效、可靠且可部署的自动驾驶感知系统的有效途径。 Abstract: Traditional autonomous driving pipelines decouple camera design from downstream perception, relying on fixed optics and handcrafted ISPs that prioritize human viewable imagery rather than machine semantics. This separation discards information during demosaicing, denoising, or quantization, while forcing models to adapt to sensor artifacts. We present a task-driven co-design framework that unifies optics, sensor modeling, and lightweight semantic segmentation networks into a single end-to-end RAW-to-task pipeline. Building on DeepLens[19], our system integrates realistic cellphone-scale lens models, learnable color filter arrays, Poisson-Gaussian noise processes, and quantization, all optimized directly for segmentation objectives. Evaluations on KITTI-360 show consistent mIoU improvements over fixed pipelines, with optics modeling and CFA learning providing the largest gains, especially for thin or low-light-sensitive classes. Importantly, these robustness gains are achieved with a compact ~1M-parameter model running at ~28 FPS, demonstrating edge deployability. Visual and quantitative analyses further highlight how co-designed sensors adapt acquisition to semantic structure, sharpening boundaries and maintaining accuracy under blur, noise, and low bit-depth. Together, these findings establish full-stack co-optimization of optics, sensors, and networks as a principled path toward efficient, reliable, and deployable perception in autonomous systems.

[37] CHAMMI-75: pre-training multi-channel models with heterogeneous microscopy images

Vidit Agrawal,John Peters,Tyler N. Thompson,Mohammad Vali Sanian,Chau Pham,Nikita Moshkov,Arshad Kazi,Aditya Pillai,Jack Freeman,Byunguk Kang,Samouil L. Farhi,Ernest Fraenkel,Ron Stewart,Lassi Paavolainen,Bryan A. Plummer,Juan C. Caicedo

Main category: cs.CV

TL;DR: CHAMMI-75是一个来自75个不同生物学研究的异构多通道显微图像的开放数据集,旨在开发可跨研究复用的通道自适应细胞形态学模型。

Details Motivation: 现有细胞形态学模型通常仅使用单一成像类型训练,导致在不同技术规格或实验条件下泛化能力差,难以跨研究复用。 Method: 整合并整理来自公开资源的75个多样化生物学研究的多通道显微图像,构建CHAMMI-75数据集,并用于训练和评估具有通道自适应能力的细胞形态学模型。 Result: 实验表明,使用CHAMMI-75训练能提升多通道生物成像任务的性能,主要归因于该数据集在显微模态上的高度多样性。 Conclusion: CHAMMI-75为开发下一代适用于广泛生物学研究的通用细胞形态学模型提供了基础。 Abstract: Quantifying cell morphology using images and machine learning has proven to be a powerful tool to study the response of cells to treatments. However, models used to quantify cellular morphology are typically trained with a single microscopy imaging type. This results in specialized models that cannot be reused across biological studies because the technical specifications do not match (e.g., different number of channels), or because the target experimental conditions are out of distribution. Here, we present CHAMMI-75, an open access dataset of heterogeneous, multi-channel microscopy images from 75 diverse biological studies. We curated this resource from publicly available sources to investigate cellular morphology models that are channel-adaptive and can process any microscopy image type. Our experiments show that training with CHAMMI-75 can improve performance in multi-channel bioimaging tasks primarily because of its high diversity in microscopy modalities. This work paves the way to create the next generation of cellular morphology models for biological studies.

[38] Input-Adaptive Visual Preprocessing for Efficient Fast Vision-Language Model Inference

Putu Indah Githa Cahyani,Komang David Dananjaya Suartana,Novanto Yudistira

Main category: cs.CV

TL;DR: 本文提出了一种自适应视觉预处理方法,通过内容感知的分辨率选择和裁剪来减少视觉冗余,在不修改FastVLM架构或重新训练的情况下,显著降低推理时间和视觉令牌数量。

Details Motivation: 现有的视觉语言模型在处理高分辨率图像时存在推理延迟和计算成本高的问题,且采用静态预处理导致对简单图像也进行冗余计算。 Method: 提出一种基于图像内容特征动态调整输入分辨率和空间覆盖范围的自适应预处理方法,结合内容感知分析、自适应分辨率选择和内容感知裁剪,并集成到FastVLM中而不改变其结构。 Result: 在DocVQA数据集子集上实验显示,每张图像的推理时间减少超过50%,平均完整生成时间降低,视觉令牌数一致减少超过55%。 Conclusion: 输入感知的预处理是一种有效且轻量级的策略,可提升视觉语言模型在部署场景下的效率。 Abstract: Vision-Language Models (VLMs) have demonstrated strong performance on multimodal reasoning tasks, but their deployment remains challenging due to high inference latency and computational cost, particularly when processing high-resolution visual inputs. While recent architectures such as FastVLM improve efficiency through optimized vision encoders, existing pipelines still rely on static visual preprocessing, leading to redundant computation for visually simple inputs. In this work, we propose an adaptive visual preprocessing method that dynamically adjusts input resolution and spatial coverage based on image content characteristics. The proposed approach combines content-aware image analysis, adaptive resolution selection, and content-aware cropping to reduce visual redundancy prior to vision encoding. Importantly, the method is integrated with FastVLM without modifying its architecture or requiring retraining. We evaluate the proposed method on a subset of the DocVQA dataset in an inference-only setting, focusing on efficiency-oriented metrics. Experimental results show that adaptive preprocessing reduces per-image inference time by over 50\%, lowers mean full generation time, and achieves a consistent reduction of more than 55\% in visual token count compared to the baseline pipeline. These findings demonstrate that input-aware preprocessing is an effective and lightweight strategy for improving deployment-oriented efficiency of vision-language models. To facilitate reproducibility, our implementation is provided as a fork of the FastVLM repository, incorporating the files for the proposed method, and is available at https://github.com/kmdavidds/mlfastlm.

[39] ALIVE: An Avatar-Lecture Interactive Video Engine with Content-Aware Retrieval for Real-Time Interaction

Md Zabirul Islam,Md Motaleb Hossen Manik,Ge Wang

Main category: cs.CV

TL;DR: ALIVE是一个完全在本地运行的交互式视频学习系统,通过语音识别、大语言模型和神经虚拟形象技术,实现讲座内容感知的实时问答,提升录播课程的学习体验。

Details Motivation: 传统录播课程缺乏实时答疑机制,学生困惑时需外部搜索;现有交互系统多依赖云端、缺乏内容感知或未整合检索与虚拟形象讲解。 Method: 提出ALIVE系统,结合ASR转录、LLM优化、神经虚拟形象生成、语义与时间戳对齐的内容感知检索,以及文本/语音多模态交互,全部在本地硬件运行,并采用轻量级嵌入模型、FAISS检索和分段预加载以保证实时性。 Result: 在医学影像课程上验证了系统有效性,展示了高准确率的检索性能、低延迟响应和良好用户体验,支持文本和虚拟形象两种形式的实时解释反馈。 Conclusion: ALIVE证明了本地化、多模态AI与内容感知检索结合可显著增强录播课的教学价值,为下一代互动学习环境提供了可扩展路径。 Abstract: Traditional lecture videos offer flexibility but lack mechanisms for real-time clarification, forcing learners to search externally when confusion arises. Recent advances in large language models and neural avatars provide new opportunities for interactive learning, yet existing systems typically lack lecture awareness, rely on cloud-based services, or fail to integrate retrieval and avatar-delivered explanations in a unified, privacy-preserving pipeline. We present ALIVE, an Avatar-Lecture Interactive Video Engine that transforms passive lecture viewing into a dynamic, real-time learning experience. ALIVE operates fully on local hardware and integrates (1) Avatar-delivered lecture generated through ASR transcription, LLM refinement, and neural talking-head synthesis; (2) A content-aware retrieval mechanism that combines semantic similarity with timestamp alignment to surface contextually relevant lecture segments; and (3) Real-time multimodal interaction, enabling students to pause the lecture, ask questions through text or voice, and receive grounded explanations either as text or as avatar-delivered responses. To maintain responsiveness, ALIVE employs lightweight embedding models, FAISS-based retrieval, and segmented avatar synthesis with progressive preloading. We demonstrate the system on a complete medical imaging course, evaluate its retrieval accuracy, latency characteristics, and user experience, and show that ALIVE provides accurate, content-aware, and engaging real-time support. ALIVE illustrates how multimodal AI-when combined with content-aware retrieval and local deployment-can significantly enhance the pedagogical value of recorded lectures, offering an extensible pathway toward next-generation interactive learning environments.

[40] Lightweight framework for underground pipeline recognition and spatial localization based on multi-view 2D GPR images

Haotian Lv,Chao Li,Jiangbo Dai,Yuhui Zhang,Zepeng Fan,Yiqiu Tan,Dawei Wang,Binglei Xie

Main category: cs.CV

TL;DR: 本文提出了一种基于B/C/D-Scan三视图联合分析的3D地下管线智能检测框架,结合改进的DCO-YOLO模型与3D-DIoU匹配算法,提升了小目标检测精度和多视图特征关联能力,在真实城市数据上实现了96.7%的mAP。

Details Motivation: 针对三维探地雷达(GPR)在地下管线检测中存在多视角特征关联性弱、小尺度目标识别精度低以及复杂场景下鲁棒性不足的问题,本文旨在提升检测的准确性与可靠性。 Method: 1) 提出B/C/D-Scan三视图联合分析策略,通过FDTD正演模拟与实测数据交叉验证构建三维权重特征评价方法;2) 构建DCO-YOLO框架,融合DySample、CGLU和OutlookAttention模块以增强跨维度特征关联与边缘特征提取;3) 设计3D-DIoU空间匹配算法,引入三维几何约束与中心距离惩罚项实现多视图标注自动关联;4) 采用三视图融合策略消除单视图检测歧义。 Result: 在真实城市地下管线数据上的实验表明,该方法在复杂多管道场景下的准确率、召回率和平均精度(mAP)分别达到96.2%、93.3%和96.7%,较基线模型提升2.0%、2.1%和0.9%;消融实验验证了各模块的协同优化效果,Grad-CAM++可视化显示模型更聚焦于管道几何特征。 Conclusion: 本研究将深度学习优化策略与3D GPR物理特性相结合,提出了一种高效可靠的地下管线智能识别与定位新框架,显著提升了小目标检测性能与多视图融合能力。 Abstract: To address the issues of weak correlation between multi-view features, low recognition accuracy of small-scale targets, and insufficient robustness in complex scenarios in underground pipeline detection using 3D GPR, this paper proposes a 3D pipeline intelligent detection framework. First, based on a B/C/D-Scan three-view joint analysis strategy, a three-dimensional pipeline three-view feature evaluation method is established by cross-validating forward simulation results obtained using FDTD methods with actual measurement data. Second, the DCO-YOLO framework is proposed, which integrates DySample, CGLU, and OutlookAttention cross-dimensional correlation mechanisms into the original YOLOv11 algorithm, significantly improving the small-scale pipeline edge feature extraction capability. Furthermore, a 3D-DIoU spatial feature matching algorithm is proposed, which integrates three-dimensional geometric constraints and center distance penalty terms to achieve automated association of multi-view annotations. The three-view fusion strategy resolves inherent ambiguities in single-view detection. Experiments based on real urban underground pipeline data show that the proposed method achieves accuracy, recall, and mean average precision of 96.2%, 93.3%, and 96.7%, respectively, in complex multi-pipeline scenarios, which are 2.0%, 2.1%, and 0.9% higher than the baseline model. Ablation experiments validated the synergistic optimization effect of the dynamic feature enhancement module and Grad-CAM++ heatmap visualization demonstrated that the improved model significantly enhanced its ability to focus on pipeline geometric features. This study integrates deep learning optimization strategies with the physical characteristics of 3D GPR, offering an efficient and reliable novel technical framework for the intelligent recognition and localization of underground pipelines.

[41] NeRV360: Neural Representation for 360-Degree Videos with a Viewport Decoder

Daichi Arai,Kyohei Unno,Yasuko Sugito,Yuichi Kusakabe

Main category: cs.CV

TL;DR: NeRV360是一种面向高分辨率360度视频的端到端隐式神经表示压缩框架,通过仅解码用户视口区域并引入时空仿射变换模块,显著降低内存消耗并提升解码速度。

Details Motivation: 现有隐式神经视频表示(NeRV)在处理高分辨率360度视频时存在内存占用高和解码速度慢的问题,难以支持实时应用,因此需要一种更高效的压缩与解码框架。 Method: 提出NeRV360框架,将视口提取集成到解码过程中,仅解码用户选择的视区;引入空间-时间仿射变换模块,实现基于视角和时间的条件化解码,避免全帧重建。 Result: 在6K分辨率视频上的实验表明,相比代表性先前工作HNeRV,NeRV360内存消耗减少7倍,解码速度提升2.5倍,且在客观指标上提供更优图像质量。 Conclusion: NeRV360通过条件化解码和视口定制化重建,有效解决了高分辨率360度视频中内存与速度的瓶颈,为实时沉浸式视频应用提供了可行方案。 Abstract: Implicit neural representations for videos (NeRV) have shown strong potential for video compression. However, applying NeRV to high-resolution 360-degree videos causes high memory usage and slow decoding, making real-time applications impractical. We propose NeRV360, an end-to-end framework that decodes only the user-selected viewport instead of reconstructing the entire panoramic frame. Unlike conventional pipelines, NeRV360 integrates viewport extraction into decoding and introduces a spatial-temporal affine transform module for conditional decoding based on viewpoint and time. Experiments on 6K-resolution videos show that NeRV360 achieves a 7-fold reduction in memory consumption and a 2.5-fold increase in decoding speed compared to HNeRV, a representative prior work, while delivering better image quality in terms of objective metrics.

[42] Beyond Weight Adaptation: Feature-Space Domain Injection for Cross-Modal Ship Re-Identification

Tingfeng Xian,Wenlve Zhou,Zhiheng Zhou,Zhelin Li

Main category: cs.CV

TL;DR: 本文提出了一种基于视觉基础模型(VFM)的域表示注入(DRI)方法,用于解决跨模态船舶重识别中的模态差异问题,在保持VFM冻结的同时通过轻量级模块实现特征空间的参数高效微调,取得了最先进的性能。

Details Motivation: 现有的跨模态船舶重识别方法依赖大规模配对数据进行显式模态对齐,难以扩展且成本高;同时通用的参数高效微调方法在低容量模型上表现不佳,因此需要一种更高效、更通用的解决方案。 Method: 基于柏拉图表示假设,提出在特征空间中进行优化的Domain Representation Injection(DRI)方法:设计一个可学习的Offset Encoder提取原始输入中的模态和身份特征,并通过Modulator结合中间层上下文信息自适应地将这些表示注入到VFM的中间层,实现动态特征分布调整,同时保持主干网络完全冻结。 Result: 在HOSS-ReID数据集上,仅使用1.54M和7.05M可训练参数即分别达到57.9%和60.5%的mAP,显著优于现有方法,实现了最先进性能与极低参数开销的平衡。 Conclusion: DRI通过在特征空间中注入域特定表示,有效利用冻结的视觉基础模型解决跨模态差异问题,为参数高效微调提供了新视角,具有良好的应用前景和扩展性。 Abstract: Cross-Modality Ship Re-Identification (CMS Re-ID) is critical for achieving all-day and all-weather maritime target tracking, yet it is fundamentally challenged by significant modality discrepancies. Mainstream solutions typically rely on explicit modality alignment strategies; however, this paradigm heavily depends on constructing large-scale paired datasets for pre-training. To address this, grounded in the Platonic Representation Hypothesis, we explore the potential of Vision Foundation Models (VFMs) in bridging modality gaps. Recognizing the suboptimal performance of existing generic Parameter-Efficient Fine-Tuning (PEFT) methods that operate within the weight space, particularly on limited-capacity models, we shift the optimization perspective to the feature space and propose a novel PEFT strategy termed Domain Representation Injection (DRI). Specifically, while keeping the VFM fully frozen to maximize the preservation of general knowledge, we design a lightweight, learnable Offset Encoder to extract domain-specific representations rich in modality and identity attributes from raw inputs. Guided by the contextual information of intermediate features at different layers, a Modulator adaptively transforms these representations. Subsequently, they are injected into the intermediate layers via additive fusion, dynamically reshaping the feature distribution to adapt to the downstream task without altering the VFM's pre-trained weights. Extensive experimental results demonstrate the superiority of our method, achieving State-of-the-Art (SOTA) performance with minimal trainable parameters. For instance, on the HOSS-ReID dataset, we attain 57.9\% and 60.5\% mAP using only 1.54M and 7.05M parameters, respectively. The code is available at https://github.com/TingfengXian/DRI.

[43] DGSAN: Dual-Graph Spatiotemporal Attention Network for Pulmonary Nodule Malignancy Prediction

Xiao Yu,Zhaojie Fang,Guanyu Zhou,Yin Shen,Huoling Luo,Ye Li,Ahmed Elazab,Xiang Wan,Ruiquan Ge,Changmiao Wang

Main category: cs.CV

TL;DR: 提出了一种双图时空注意力网络(DGSAN),通过融合多模态和多时相信息提升肺结节分类精度,并构建了新的NLST-cmst数据集,实验表明该方法在准确性和计算效率上均优于现有最先进方法。

Details Motivation: 现有的多模态和多时相信息融合方法(如向量拼接和简单互注意力)效率低、效果有限,难以充分挖掘肺结节的时空变化特征,亟需更有效的融合机制以提高早期诊断准确性。 Method: 提出DGSAN模型,包括全局-局部特征编码器、双图构建方法(组织模态间和模态内图结构)以及分层跨模态图融合模块,实现高效多模态时空特征融合,并基于NLST和CSTL数据构建新的NLST-cmst多模态数据集用于验证。 Result: 在NLST-cmst和CSTL衍生数据集上的实验显示,DGSAN在肺结节分类任务中显著优于现有最先进方法,同时具备出色的计算效率。 Conclusion: DGSAN通过创新的双图结构和分层图融合策略,有效提升了多模态多时相肺结节分析的性能,为肺癌早期诊断提供了强有力的技术支持。 Abstract: Lung cancer continues to be the leading cause of cancer-related deaths globally. Early detection and diagnosis of pulmonary nodules are essential for improving patient survival rates. Although previous research has integrated multimodal and multi-temporal information, outperforming single modality and single time point, the fusion methods are limited to inefficient vector concatenation and simple mutual attention, highlighting the need for more effective multimodal information fusion. To address these challenges, we introduce a Dual-Graph Spatiotemporal Attention Network, which leverages temporal variations and multimodal data to enhance the accuracy of predictions. Our methodology involves developing a Global-Local Feature Encoder to better capture the local, global, and fused characteristics of pulmonary nodules. Additionally, a Dual-Graph Construction method organizes multimodal features into inter-modal and intra-modal graphs. Furthermore, a Hierarchical Cross-Modal Graph Fusion Module is introduced to refine feature integration. We also compiled a novel multimodal dataset named the NLST-cmst dataset as a comprehensive source of support for related research. Our extensive experiments, conducted on both the NLST-cmst and curated CSTL-derived datasets, demonstrate that our DGSAN significantly outperforms state-of-the-art methods in classifying pulmonary nodules with exceptional computational efficiency.

[44] Benchmarking and Enhancing VLM for Compressed Image Understanding

Zifu Zhang,Tongda Xu,Siqi Li,Shengxi Li,Yue Zhang,Mai Xu,Yan Wang

Main category: cs.CV

TL;DR: 本文提出了首个评估视觉-语言模型(VLM)在压缩图像上表现的综合基准,并分析了性能下降的原因,提出了一种通用的VLM适配器,可将模型在不同编码和比特率压缩图像上的性能提升10%-30%。

Details Motivation: 随着视觉-语言模型(VLM)的发展及其应用需求的增长,高效压缩图像输入变得愈发重要。然而,现有VLM主要处理高比特率压缩图像,对低比特率压缩图像的理解能力尚未被充分探索。因此,亟需一个系统性基准来评估VLM在压缩图像下的表现,并寻求有效的性能增强方法。 Method: 构建了一个包含超过一百万张压缩图像的大规模基准,涵盖多种常用图像编解码器和多样化任务;通过分类性能差距来源为压缩过程中的信息损失和VLM的泛化失败,分析根本原因;设计并训练一个通用的VLM适配器以提升模型在低比特率压缩图像上的理解能力。 Result: 发现信息损失导致的性能下降难以避免,但由泛化失败引起的问题可以通过适配器有效缓解;实验表明,所提出的单一适配器可在不同编解码器和比特率下使VLM性能提升10%-30%。 Conclusion: 本研究揭示了VLM在处理压缩图像时的挑战与潜力,所提供的基准和通用适配器为推动VLM在实际低带宽场景中的应用提供了重要基础和解决方案。 Abstract: With the rapid development of Vision-Language Models (VLMs) and the growing demand for their applications, efficient compression of the image inputs has become increasingly important. Existing VLMs predominantly digest and understand high-bitrate compressed images, while their ability to interpret low-bitrate compressed images has yet to be explored by far. In this paper, we introduce the first comprehensive benchmark to evaluate the ability of VLM against compressed images, varying existing widely used image codecs and diverse set of tasks, encompassing over one million compressed images in our benchmark. Next, we analyse the source of performance gap, by categorising the gap from a) the information loss during compression and b) generalisation failure of VLM. We visualize these gaps with concrete examples and identify that for compressed images, only the generalization gap can be mitigated. Finally, we propose a universal VLM adaptor to enhance model performance on images compressed by existing codecs. Consequently, we demonstrate that a single adaptor can improve VLM performance across images with varying codecs and bitrates by 10%-30%. We believe that our benchmark and enhancement method provide valuable insights and contribute toward bridging the gap between VLMs and compressed images.

[45] PanoGrounder: Bridging 2D and 3D with Panoramic Scene Representations for VLM-based 3D Visual Grounding

Seongmin Jung,Seongho Choi,Gunwoo Jeon,Minsu Cho,Jongwoo Lim

Main category: cs.CV

TL;DR: 提出PanoGrounder,一种基于多模态全景表示和预训练2D视觉语言模型的可泛化3D视觉定位框架,在ScanRefer和Nr3D上达到SOTA,并展现出对未见数据和文本改写的强泛化能力。

Details Motivation: 传统3D视觉定位模型依赖显式3D几何且受限于3D视觉语言数据集稀缺和推理能力不足,难以泛化。希望利用现代视觉语言模型的强大推理能力提升3DVG的泛化性。 Method: 提出PanoGrounder,使用包含3D语义和几何特征的全景渲染作为2D与3D之间的中间表示;设计三阶段流程:根据场景布局放置紧凑的全景视点、用VLM在每个视图中进行文本查询定位、通过提升融合多视图预测得到3D边界框。 Result: 在ScanRefer和Nr3D数据集上达到最先进的性能,且在未见的3D数据集和不同文本表述下表现出优异的泛化能力。 Conclusion: PanoGrounder通过结合全景表示与预训练2D VLM,有效提升了3D视觉定位的泛化性和推理能力,为从视觉语言感知到机器人应用的落地提供了更强的桥梁。 Abstract: 3D Visual Grounding (3DVG) is a critical bridge from vision-language perception to robotics, requiring both language understanding and 3D scene reasoning. Traditional supervised models leverage explicit 3D geometry but exhibit limited generalization, owing to the scarcity of 3D vision-language datasets and the limited reasoning capabilities compared to modern vision-language models (VLMs). We propose PanoGrounder, a generalizable 3DVG framework that couples multi-modal panoramic representation with pretrained 2D VLMs for strong vision-language reasoning. Panoramic renderings, augmented with 3D semantic and geometric features, serve as an intermediate representation between 2D and 3D, and offer two major benefits: (i) they can be directly fed to VLMs with minimal adaptation and (ii) they retain long-range object-to-object relations thanks to their 360-degree field of view. We devise a three-stage pipeline that places a compact set of panoramic viewpoints considering the scene layout and geometry, grounds a text query on each panoramic rendering with a VLM, and fuses per-view predictions into a single 3D bounding box via lifting. Our approach achieves state-of-the-art results on ScanRefer and Nr3D, and demonstrates superior generalization to unseen 3D datasets and text rephrasings.

[46] Self-supervised Multiplex Consensus Mamba for General Image Fusion

Yingying Wang,Rongjin Zhuang,Hui Zheng,Xuanhua He,Ke Cao,Xiaotong Tu,Xinghao Ding

Main category: cs.CV

TL;DR: 本文提出了一种用于通用图像融合的自监督多路共识Mamba框架SMC-Mamba,通过模态无关特征增强和多路共识跨模态Mamba模块实现多模态信息的有效融合,并引入双层自监督对比学习损失,在保持高频信息的同时提升下游任务性能。

Details Motivation: 通用图像融合需在不增加复杂度的前提下,有效整合来自不同模态的互补信息以支持多种下游任务,而现有方法难以兼顾广泛适用性与高性能。 Method: 提出SMC-Mamba框架,包含模态无关特征增强(MAFE)模块和多路共识跨模态Mamba(MCCM)模块,并设计双层自监督对比学习损失(BSCL)来增强特征表示与跨模态交互。 Result: 在红外-可见光、医学、多焦点和多曝光融合等任务中均超越现有最先进方法,并显著提升下游视觉任务性能。 Conclusion: SMC-Mamba通过自监督多路共识机制实现了高效通用的图像融合,在多种融合场景和下游任务中表现出优越性能。 Abstract: Image fusion integrates complementary information from different modalities to generate high-quality fused images, thereby enhancing downstream tasks such as object detection and semantic segmentation. Unlike task-specific techniques that primarily focus on consolidating inter-modal information, general image fusion needs to address a wide range of tasks while improving performance without increasing complexity. To achieve this, we propose SMC-Mamba, a Self-supervised Multiplex Consensus Mamba framework for general image fusion. Specifically, the Modality-Agnostic Feature Enhancement (MAFE) module preserves fine details through adaptive gating and enhances global representations via spatial-channel and frequency-rotational scanning. The Multiplex Consensus Cross-modal Mamba (MCCM) module enables dynamic collaboration among experts, reaching a consensus to efficiently integrate complementary information from multiple modalities. The cross-modal scanning within MCCM further strengthens feature interactions across modalities, facilitating seamless integration of critical information from both sources. Additionally, we introduce a Bi-level Self-supervised Contrastive Learning Loss (BSCL), which preserves high-frequency information without increasing computational overhead while simultaneously boosting performance in downstream tasks. Extensive experiments demonstrate that our approach outperforms state-of-the-art (SOTA) image fusion algorithms in tasks such as infrared-visible, medical, multi-focus, and multi-exposure fusion, as well as downstream visual tasks.

[47] Quantile Rendering: Efficiently Embedding High-dimensional Feature on 3D Gaussian Splatting

Yoonwoo Jeong,Cheng Sun,Frank Wang,Minsu Cho,Jaesung Choe

Main category: cs.CV

TL;DR: 本文提出了一种名为Quantile Rendering (Q-Render) 的新渲染策略和高斯点阵网络(GS-Net),用于在保持高保真度的同时高效处理3D高斯的高维特征,从而改进开放词汇分割任务。实验表明该方法在ScanNet和LeRF数据集上优于现有技术,并实现约43.7倍的实时渲染加速。

Details Motivation: 现有的3D开放词汇分割方法在渲染高维特征时依赖码本或压缩技术,导致信息丢失并降低分割质量,因此需要一种能高效且无损地处理高维特征的渲染方法。 Method: 提出Quantile Rendering (Q-Render),通过仅沿光线稀疏采样具有主导影响的3D高斯实现高效渲染;同时构建可泛化的3D神经网络GS-Net来预测高斯特征。 Result: 在ScanNet和LeRF数据集上取得了优于当前最先进方法的性能,同时对512维特征图实现了约43.7倍的渲染速度提升。 Conclusion: Q-Render与GS-Net相结合为3D开放词汇分割提供了一个高效、高质量的解决方案,显著提升了渲染效率与分割性能,具备实际应用潜力。 Abstract: Recent advancements in computer vision have successfully extended Open-vocabulary segmentation (OVS) to the 3D domain by leveraging 3D Gaussian Splatting (3D-GS). Despite this progress, efficiently rendering the high-dimensional features required for open-vocabulary queries poses a significant challenge. Existing methods employ codebooks or feature compression, causing information loss, thereby degrading segmentation quality. To address this limitation, we introduce Quantile Rendering (Q-Render), a novel rendering strategy for 3D Gaussians that efficiently handles high-dimensional features while maintaining high fidelity. Unlike conventional volume rendering, which densely samples all 3D Gaussians intersecting each ray, Q-Render sparsely samples only those with dominant influence along the ray. By integrating Q-Render into a generalizable 3D neural network, we also propose Gaussian Splatting Network (GS-Net), which predicts Gaussian features in a generalizable manner. Extensive experiments on ScanNet and LeRF demonstrate that our framework outperforms state-of-the-art methods, while enabling real-time rendering with an approximate ~43.7x speedup on 512-D feature maps. Code will be made publicly available.

[48] Transductive Visual Programming: Evolving Tool Libraries from Experience for Spatial Reasoning

Shengguang Wu,Xiaohan Wang,Yuhui Zhang,Hao Zhu,Serena Yeung-Levy

Main category: cs.CV

TL;DR: 本文提出了一种名为Transductive Visual Programming (TVP)的新型视觉编程框架,通过从经验中构建新工具来提升3D场景中的空间推理能力,实现了最先进的性能,并展现出强大的工具复用与泛化能力。

Details Motivation: 现有的视觉编程方法依赖固定或推测性工具生成,导致程序次优且工具利用率低,难以有效应对需要精确几何计算的3D空间推理任务。 Method: TVP首先使用基础工具解决问题并将解决方案存入示例库,然后从中抽象出重复模式并构建可重用的高级工具,形成不断演化的工具库,从而在后续任务中使用更强大的自学习工具进行推理。 Result: 在Omni3D-Bench上,TVP比GPT-4o高出22%,优于此前最佳系统11%;所学工具作为核心依赖的使用频率是归纳生成工具的5倍,并在SpatialScore-Hard等未见任务上表现出强泛化能力。 Conclusion: 基于经验的转导式工具创建是一种有效的范式,能够构建自我进化的视觉编程代理,显著提升复杂空间推理任务的解决能力。 Abstract: Spatial reasoning in 3D scenes requires precise geometric calculations that challenge vision-language models. Visual programming addresses this by decomposing problems into steps calling specialized tools, yet existing methods rely on either fixed toolsets or speculative tool induction before solving problems, resulting in suboptimal programs and poor utilization of induced tools. We present Transductive Visual Programming (TVP), a novel framework that builds new tools from its own experience rather than speculation. TVP first solves problems using basic tools while accumulating experiential solutions into an Example Library, then abstracts recurring patterns from these programs into reusable higher-level tools for an evolving Tool Library. This allows TVP to tackle new problems with increasingly powerful tools learned from experience. On Omni3D-Bench, TVP achieves state-of-the-art performance, outperforming GPT-4o by 22% and the previous best visual programming system by 11%. Our transductively learned tools are used 5x more frequently as core program dependency than inductively created ones, demonstrating more effective tool discovery and reuse. The evolved tools also show strong generalization to unseen spatial tasks, achieving superior performance on benchmarks from SpatialScore-Hard collection without any testset-specific modification. Our work establishes experience-driven transductive tool creation as a powerful paradigm for building self-evolving visual programming agents that effectively tackle challenging spatial reasoning tasks. We release our code at https://transductive-visualprogram.github.io/.

[49] Reasoning-Driven Amodal Completion: Collaborative Agents and Perceptual Evaluation

Hongxing Fan,Shuyu Zhao,Jiayang Ao,Lu Sheng

Main category: cs.CV

TL;DR: 提出了一种协作式多智能体推理框架,用于解决无模态补全中的语义一致性和结构完整性问题,通过解耦语义规划与视觉合成,并引入自修正验证机制和多样假设生成,显著优于现有方法。

Details Motivation: 现有渐进式方法在无模态补全中存在推理不稳定和误差累积问题,难以保持语义一致性和结构完整性。 Method: 提出协作式多智能体推理框架,将语义规划与视觉合成解耦;使用专门智能体进行前置推理以生成结构化计划;引入自纠正的验证智能体(基于思维链推理)和多样假设生成器;提出新的评估指标MAC-Score。 Result: 在多个数据集上实验表明,该方法在结构完整性和语义一致性方面显著优于当前最先进方法;MAC-Score与人类判断和真实标签高度一致。 Conclusion: 该框架有效解决了无模态补全中的关键挑战,实现了语义与视觉一致的单步合成,为未来研究提供了新基准和评估标准。 Abstract: Amodal completion, the task of inferring invisible object parts, faces significant challenges in maintaining semantic consistency and structural integrity. Prior progressive approaches are inherently limited by inference instability and error accumulation. To tackle these limitations, we present a Collaborative Multi-Agent Reasoning Framework that explicitly decouples Semantic Planning from Visual Synthesis. By employing specialized agents for upfront reasoning, our method generates a structured, explicit plan before pixel generation, enabling visually and semantically coherent single-pass synthesis. We integrate this framework with two critical mechanisms: (1) a self-correcting Verification Agent that employs Chain-of-Thought reasoning to rectify visible region segmentation and identify residual occluders strictly within the Semantic Planning phase, and (2) a Diverse Hypothesis Generator that addresses the ambiguity of invisible regions by offering diverse, plausible semantic interpretations, surpassing the limited pixel-level variations of standard random seed sampling. Furthermore, addressing the limitations of traditional metrics in assessing inferred invisible content, we introduce the MAC-Score (MLLM Amodal Completion Score), a novel human-aligned evaluation metric. Validated against human judgment and ground truth, these metrics establish a robust standard for assessing structural completeness and semantic consistency with visible context. Extensive experiments demonstrate that our framework significantly outperforms state-of-the-art methods across multiple datasets. Our project is available at: https://fanhongxing.github.io/remac-page.

[50] Beyond Artifacts: Real-Centric Envelope Modeling for Reliable AI-Generated Image Detection

Ruiqi Liu,Yi Han,Zhengbo Zhang,Liwei Yao,Zhiyuan Yan,Jialiang Shen,ZhiJin Chen,Boyi Sun,Lubin Weng,Jing Dong,Yan Wang,Shu Wu

Main category: cs.CV

TL;DR: 本文提出了一种新的合成图像检测范式——以真实为中心的包络建模(REM),通过建模真实图像分布而非生成器伪影来提升在现实退化条件下的检测鲁棒性和泛化能力,并构建了包含多种生成器和现实退化模拟的RealChain基准,实验表明REM显著优于现有方法。

Details Motivation: 现有的合成图像检测器容易过拟合于特定生成器的伪影,且对现实世界中的图像退化(如多次分享、后处理等链式退化)非常敏感,导致在实际应用中性能下降。 Method: 提出REM方法,通过自重建过程中的特征级扰动生成接近真实的样本,并采用具有跨域一致性的包络估计器学习包围真实图像流形的边界,从而实现对真实图像分布的建模。 Result: 在八个基准测试中,REM平均比现有最优方法提升7.5%,在严重退化的RealChain基准上表现出卓越的泛化能力。 Conclusion: REM通过从依赖生成器伪影转向建模真实图像分布,为现实条件下合成图像检测提供了更可靠和鲁棒的解决方案。 Abstract: The rapid progress of generative models has intensified the need for reliable and robust detection under real-world conditions. However, existing detectors often overfit to generator-specific artifacts and remain highly sensitive to real-world degradations. As generative architectures evolve and images undergo multi-round cross-platform sharing and post-processing (chain degradations), these artifact cues become obsolete and harder to detect. To address this, we propose Real-centric Envelope Modeling (REM), a new paradigm that shifts detection from learning generator artifacts to modeling the robust distribution of real images. REM introduces feature-level perturbations in self-reconstruction to generate near-real samples, and employs an envelope estimator with cross-domain consistency to learn a boundary enclosing the real image manifold. We further build RealChain, a comprehensive benchmark covering both open-source and commercial generators with simulated real-world degradation. Across eight benchmark evaluations, REM achieves an average improvement of 7.5% over state-of-the-art methods, and notably maintains exceptional generalization on the severely degraded RealChain benchmark, establishing a solid foundation for synthetic image detection under real-world conditions. The code and the RealChain benchmark will be made publicly available upon acceptance of the paper.

[51] SPOT!: Map-Guided LLM Agent for Unsupervised Multi-CCTV Dynamic Object Tracking

Yujin Noh,Inho Jake Park,Chigon Hwang

Main category: cs.CV

TL;DR: 本文提出了一种名为SPOT的无需训练的地图引导型LLM代理,利用道路结构和摄像头布局信息,在多摄像头盲区中实现车辆轨迹的连续跟踪。

Details Motivation: 现有基于CCTV的车辆跟踪系统在跨摄像头连续连接车辆轨迹方面存在局限性,尤其是在摄像头间隔和视场角限制导致的盲区中容易发生目标ID切换和轨迹丢失。 Method: 将道路路径点和CCTV位置以2D空间坐标形式构建成可查询的文档块,并结合车辆在真实世界坐标系中的位置、方向、速度及驾驶模式,利用地图空间信息在路口级别进行束搜索,预测车辆最可能进入的下一个摄像头区域。 Result: 在CARLA模拟器构建的虚拟城市环境中实验表明,SPOT能准确预测车辆在盲区后首次出现的摄像头位置,显著优于现有方法。 Conclusion: SPOT无需训练即可有效维持多摄像头环境下的连续车辆轨迹,提升了复杂城市场景中视频监控系统的可靠性与实用性。 Abstract: CCTV-based vehicle tracking systems face structural limitations in continuously connecting the trajectories of the same vehicle across multiple camera environments. In particular, blind spots occur due to the intervals between CCTVs and limited Fields of View (FOV), which leads to object ID switching and trajectory loss, thereby reducing the reliability of real-time path prediction. This paper proposes SPOT (Spatial Prediction Over Trajectories), a map-guided LLM agent capable of tracking vehicles even in blind spots of multi-CCTV environments without prior training. The proposed method represents road structures (Waypoints) and CCTV placement information as documents based on 2D spatial coordinates and organizes them through chunking techniques to enable real-time querying and inference. Furthermore, it transforms the vehicle's position into the actual world coordinate system using the relative position and FOV information of objects observed in CCTV images. By combining map spatial information with the vehicle's moving direction, speed, and driving patterns, a beam search is performed at the intersection level to derive candidate CCTV locations where the vehicle is most likely to enter after the blind spot. Experimental results based on the CARLA simulator in a virtual city environment confirmed that the proposed method accurately predicts the next appearing CCTV even in blind spot sections, maintaining continuous vehicle trajectories more effectively than existing techniques.

[52] XGrid-Mapping: Explicit Implicit Hybrid Grid Submaps for Efficient Incremental Neural LiDAR Mapping

Zeqing Song,Zhongmiao Yan,Junyuan Deng,Songpengcheng Xia,Xiang Mu,Jingyi Xu,Qi Wu,Ling Pei

Main category: cs.CV

TL;DR: 提出XGrid-Mapping,一种结合显式与隐式表示的混合网格框架,用于高效的大规模增量式神经LiDAR建图,通过稀疏网格提供结构引导,隐式密集网格增强场景表示,并引入蒸馏重叠对齐策略和动态去除模块,实现高质量、高效率的实时建图。

Details Motivation: 现有神经LiDAR建图方法多依赖密集隐式表示,未能充分利用几何结构,而基于体素的方法难以实现实时性能,因此需要一种兼顾效率与表达能力的增量建图框架。 Method: 提出XGrid-Mapping,结合稀疏显式网格(提供几何先验)与隐式密集网格(增强场景表示),采用VDB结构与子图组织降低计算开销,并设计基于蒸馏的重叠区域对齐策略以保证子图一致性,同时引入动态去除模块提升采样效率与鲁棒性。 Result: 实验表明,该方法在保持高建图质量的同时显著优于现有方法,克服了体素方法的效率瓶颈,实现了大规模实时增量建图。 Conclusion: XGrid-Mapping通过融合显式与隐式表示,在效率与精度之间取得良好平衡,推动了高性能神经LiDAR建图的发展,适用于大规模自主系统应用。 Abstract: Large-scale incremental mapping is fundamental to the development of robust and reliable autonomous systems, as it underpins incremental environmental understanding with sequential inputs for navigation and decision-making. LiDAR is widely used for this purpose due to its accuracy and robustness. Recently, neural LiDAR mapping has shown impressive performance; however, most approaches rely on dense implicit representations and underutilize geometric structure, while existing voxel-guided methods struggle to achieve real-time performance. To address these challenges, we propose XGrid-Mapping, a hybrid grid framework that jointly exploits explicit and implicit representations for efficient neural LiDAR mapping. Specifically, the strategy combines a sparse grid, providing geometric priors and structural guidance, with an implicit dense grid that enriches scene representation. By coupling the VDB structure with a submap-based organization, the framework reduces computational load and enables efficient incremental mapping on a large scale. To mitigate discontinuities across submaps, we introduce a distillation-based overlap alignment strategy, in which preceding submaps supervise subsequent ones to ensure consistency in overlapping regions. To further enhance robustness and sampling efficiency, we incorporate a dynamic removal module. Extensive experiments show that our approach delivers superior mapping quality while overcoming the efficiency limitations of voxel-guided methods, thereby outperforming existing state-of-the-art mapping methods.

[53] X-ray Insights Unleashed: Pioneering the Enhancement of Multi-Label Long-Tail Data

Xinquan Yang,Jinheng Xie,Yawen Huang,Yuexiang Li,Huimin Huang,Hao Zheng,Xian Wu,Yefeng Zheng,Linlin Shen

Main category: cs.CV

TL;DR: 本文提出了一种新的数据合成管道,利用大量正常X光片增强尾部病变的表示,通过预训练的扩散模型修复患病X光片中的头部病变,保留尾部类别作为增广训练数据,并结合大语言模型知识引导模块和渐进增量学习策略稳定修复微调过程,在MIMIC和CheXpert肺部数据集上的综合评估表明,该方法在性能上树立了新基准。

Details Motivation: 由于罕见病变样本稀缺,现有的基于扩散的方法在生成能力上受到限制,导致诊断精度不理想,因此需要一种能够有效增强尾部病变表示的新方法。 Method: 提出一种新的数据合成管道,使用大量正常X光片训练扩散模型生成正常X光图像,并用该模型修复患病X光片中的头部病变,保留尾部类别作为增广数据;引入大语言模型知识引导(LKG)模块和渐进增量学习(PIL)策略以稳定修复微调过程。 Result: 在MIMIC和CheXpert公共肺部数据集上的实验表明,所提方法在性能上优于现有方法,设定了新的基准。 Conclusion: 该方法通过利用正常X光样本和改进的扩散模型修复策略,有效增强了尾部病变的数据表示,显著提升了长尾肺部异常的诊断精度。 Abstract: Long-tailed pulmonary anomalies in chest radiography present formidable diagnostic challenges. Despite the recent strides in diffusion-based methods for enhancing the representation of tailed lesions, the paucity of rare lesion exemplars curtails the generative capabilities of these approaches, thereby leaving the diagnostic precision less than optimal. In this paper, we propose a novel data synthesis pipeline designed to augment tail lesions utilizing a copious supply of conventional normal X-rays. Specifically, a sufficient quantity of normal samples is amassed to train a diffusion model capable of generating normal X-ray images. This pre-trained diffusion model is subsequently utilized to inpaint the head lesions present in the diseased X-rays, thereby preserving the tail classes as augmented training data. Additionally, we propose the integration of a Large Language Model Knowledge Guidance (LKG) module alongside a Progressive Incremental Learning (PIL) strategy to stabilize the inpainting fine-tuning process. Comprehensive evaluations conducted on the public lung datasets MIMIC and CheXpert demonstrate that the proposed method sets a new benchmark in performance.

[54] PUFM++: Point Cloud Upsampling via Enhanced Flow Matching

Zhi-Song Liu,Chenhang He,Roland Maier,Andreas Rupp

Main category: cs.CV

TL;DR: PUFM++ 是一种增强的流匹配框架,用于从稀疏、含噪和部分观测中重建密集且精确的点云,在几何保真度、鲁棒性和下游任务一致性方面均有提升。

Details Motivation: 现有方法在处理不完整、噪声和稀疏输入时难以保证生成点云的几何准确性和对下游任务的支持,因此需要更鲁棒且高保真的点云上采样方法。 Method: 提出 PUFM++,采用两阶段流匹配策略:第一阶段学习从稀疏输入到密集目标的直接直线流,第二阶段利用加噪样本进行优化;引入数据驱动的自适应时间调度器以提高采样效率,并在采样过程中施加流形约束以保持点与表面一致;同时使用循环接口网络(RIN)增强层次特征交互。 Result: 在合成基准和真实世界扫描数据上实验表明,PUFM++ 在定量精度和视觉质量方面均达到当前最优水平,显著优于现有方法。 Conclusion: PUFM++ 通过改进流匹配的训练和采样过程,在点云上采样任务中实现了更高的保真度、鲁棒性和实用性,为基于表面的下游任务提供了更可靠的支持。 Abstract: Recent advances in generative modeling have demonstrated strong promise for high-quality point cloud upsampling. In this work, we present PUFM++, an enhanced flow-matching framework for reconstructing dense and accurate point clouds from sparse, noisy, and partial observations. PUFM++ improves flow matching along three key axes: (i) geometric fidelity, (ii) robustness to imperfect input, and (iii) consistency with downstream surface-based tasks. We introduce a two-stage flow-matching strategy that first learns a direct, straight-path flow from sparse inputs to dense targets, and then refines it using noise-perturbed samples to approximate the terminal marginal distribution better. To accelerate and stabilize inference, we propose a data-driven adaptive time scheduler that improves sampling efficiency based on interpolation behavior. We further impose on-manifold constraints during sampling to ensure that generated points remain aligned with the underlying surface. Finally, we incorporate a recurrent interface network~(RIN) to strengthen hierarchical feature interactions and boost reconstruction quality. Extensive experiments on synthetic benchmarks and real-world scans show that PUFM++ sets a new state of the art in point cloud upsampling, delivering superior visual fidelity and quantitative accuracy across a wide range of tasks. Code and pretrained models are publicly available at https://github.com/Holmes-Alan/Enhanced_PUFM.

[55] MVInverse: Feed-forward Multi-view Inverse Rendering in Seconds

Xiangzuo Wu,Chengwei Ren,Jun Zhou,Xiu Li,Yuan Liu

Main category: cs.CV

TL;DR: 提出了一种前馈的多视角逆渲染框架,通过跨视图交替注意力机制实现一致的几何、材质和光照恢复,并利用基于一致性的微调策略提升在真实场景下的泛化能力。

Details Motivation: 现有单视角方法忽略跨视图关系导致不一致结果,而多视角优化方法依赖慢速可微渲染和逐场景优化,计算成本高且难以扩展。 Method: 设计了一个前馈网络,从RGB图像序列中直接预测空间变化的反射率、金属度、粗糙度、漫反射阴影和法线;通过交替跨视图注意力捕捉视内长距离光照交互和视间材质一致性。 Result: 在基准数据集上实验表明,该方法在多视角一致性、材质与法线估计质量以及对真实图像的泛化能力方面均达到最先进水平。 Conclusion: 所提方法实现了高效、一致的多视角逆渲染,并通过无标签真实视频的微调进一步提升了实际应用中的鲁棒性与一致性。 Abstract: Multi-view inverse rendering aims to recover geometry, materials, and illumination consistently across multiple viewpoints. When applied to multi-view images, existing single-view approaches often ignore cross-view relationships, leading to inconsistent results. In contrast, multi-view optimization methods rely on slow differentiable rendering and per-scene refinement, making them computationally expensive and hard to scale. To address these limitations, we introduce a feed-forward multi-view inverse rendering framework that directly predicts spatially varying albedo, metallic, roughness, diffuse shading, and surface normals from sequences of RGB images. By alternating attention across views, our model captures both intra-view long-range lighting interactions and inter-view material consistency, enabling coherent scene-level reasoning within a single forward pass. Due to the scarcity of real-world training data, models trained on existing synthetic datasets often struggle to generalize to real-world scenes. To overcome this limitation, we propose a consistency-based finetuning strategy that leverages unlabeled real-world videos to enhance both multi-view coherence and robustness under in-the-wild conditions. Extensive experiments on benchmark datasets demonstrate that our method achieves state-of-the-art performance in terms of multi-view consistency, material and normal estimation quality, and generalization to real-world imagery.

[56] Learning from Next-Frame Prediction: Autoregressive Video Modeling Encodes Effective Representations

Jinghan Li,Yang Jin,Hao Jiang,Yadong Mu,Yang Song,Kun Xu

Main category: cs.CV

TL;DR: 本文提出了一种新的自回归视觉生成预训练框架NExT-Vid,通过掩码下一帧预测联合建模图像和视频,提升了视觉表示学习性能。

Details Motivation: 现有自回归视觉预训练方法存在语义定位不准、生成质量差等问题,且多数视觉方法仍依赖于忽略时间信息的掩码建模方式。 Method: 提出NExT-Vid框架,采用上下文隔离的自回归预测器解耦语义表示与目标解码,并引入条件流匹配解码器以提升生成质量和多样性。 Result: 在大规模预训练模型上实验表明,该方法在下游分类任务中通过注意力探测 consistently 优于以往生成式预训练方法。 Conclusion: NExT-Vid通过上下文隔离流匹配预训练,有效提升了图像和视频的联合建模能力,增强了视觉表示学习效果。 Abstract: Recent advances in pretraining general foundation models have significantly improved performance across diverse downstream tasks. While autoregressive (AR) generative models like GPT have revolutionized NLP, most visual generative pretraining methods still rely on BERT-style masked modeling, which often disregards the temporal information essential for video analysis. The few existing autoregressive visual pretraining methods suffer from issues such as inaccurate semantic localization and poor generation quality, leading to poor semantics. In this work, we propose NExT-Vid, a novel autoregressive visual generative pretraining framework that utilizes masked next-frame prediction to jointly model images and videos. NExT-Vid introduces a context-isolated autoregressive predictor to decouple semantic representation from target decoding, and a conditioned flow-matching decoder to enhance generation quality and diversity. Through context-isolated flow-matching pretraining, our approach achieves strong representations. Extensive experiments on large-scale pretrained models demonstrate that our proposed method consistently outperforms previous generative pretraining methods for visual representation learning via attentive probing in downstream classification.

[57] Granular-ball Guided Masking: Structure-aware Data Augmentation

Shuyin Xia,Fan Chen,Dawei Dai,Meng Yang,Junwei Han,Xinbo Gao,Guoyin Wang

Main category: cs.CV

TL;DR: 提出了一种基于Granular-ball计算的结构感知掩码增强方法GBGM,通过粗到精的层次化掩码策略,在保留重要语义区域的同时提升模型鲁棒性和性能。

Details Motivation: 现有数据增强中的掩码方法缺乏结构感知,容易丢弃关键语义信息,导致模型性能下降。 Method: 利用Granular-ball Computing进行结构分析,设计了一种自适应的层次化掩码策略(GBGM),在不同粒度上保留重要区域并抑制冗余区域。 Result: 在多个基准上实验表明,GBGM显著提升了图像分类准确率和掩码图像重建效果,且适用于CNN和Vision Transformer等不同架构。 Conclusion: GBGM是一种简单、通用且有效的结构-aware数据增强方法,为提升模型鲁棒性提供了新思路。 Abstract: Deep learning models have achieved remarkable success in computer vision, but they still rely heavily on large-scale labeled data and tend to overfit when data are limited or distributions shift. Data augmentation, particularly mask-based information dropping, can enhance robustness by forcing models to explore complementary cues; however, existing approaches often lack structural awareness and may discard essential semantics. We propose Granular-ball Guided Masking (GBGM), a structure-aware augmentation strategy guided by Granular-ball Computing (GBC). GBGM adaptively preserves semantically rich, structurally important regions while suppressing redundant areas through a coarse-to-fine hierarchical masking process, producing augmentations that are both representative and discriminative. Extensive experiments on multiple benchmarks demonstrate consistent improvements in classification accuracy and masked image reconstruction, confirming the effectiveness and broad applicability of the proposed method. Simple and model-agnostic, it integrates seamlessly into CNNs and Vision Transformers and provides a new paradigm for structure-aware data augmentation.

[58] FluencyVE: Marrying Temporal-Aware Mamba with Bypass Attention for Video Editing

Mingshu Cai,Yixuan Li,Osamu Yoshie,Yuya Ieiri

Main category: cs.CV

TL;DR: 提出FluencyVE,一种基于Mamba的高效单次视频编辑方法,替代时间注意力机制,实现全局帧级关注并降低计算开销。

Details Motivation: 现有基于预训练文本到图像模型的视频编辑方法存在时间不一致性问题和高计算开销,难以有效扩展到视频编辑任务。 Method: 将线性时间序列模块Mamba引入基于Stable Diffusion的视频编辑模型,取代时间注意力层;采用低秩近似矩阵替换因果注意力中的查询和键权重矩阵,并在训练中使用加权平均技术更新注意力分数。 Result: 在真实视频的属性、主体和位置编辑任务中表现出色,显著减少计算负担的同时保持了生成能力。 Conclusion: FluencyVE是一种简单而有效的视频编辑方法,通过引入Mamba架构实现了高效且时序一致的视频编辑。 Abstract: Large-scale text-to-image diffusion models have achieved unprecedented success in image generation and editing. However, extending this success to video editing remains challenging. Recent video editing efforts have adapted pretrained text-to-image models by adding temporal attention mechanisms to handle video tasks. Unfortunately, these methods continue to suffer from temporal inconsistency issues and high computational overheads. In this study, we propose FluencyVE, which is a simple yet effective one-shot video editing approach. FluencyVE integrates the linear time-series module, Mamba, into a video editing model based on pretrained Stable Diffusion models, replacing the temporal attention layer. This enables global frame-level attention while reducing the computational costs. In addition, we employ low-rank approximation matrices to replace the query and key weight matrices in the causal attention, and use a weighted averaging technique during training to update the attention scores. This approach significantly preserves the generative power of the text-to-image model while effectively reducing the computational burden. Experiments and analyses demonstrate promising results in editing various attributes, subjects, and locations in real-world videos.

[59] Efficient and Robust Video Defense Framework against 3D-field Personalized Talking Face

Rui-qing Sun,Xingshan Yao,Tian Lan,Hui-Yang Zhao,Jia-Ling Shi,Chen-Hao Cui,Zhijing Wu,Chen Yang,Xian-Ling Mao

Main category: cs.CV

TL;DR: 提出了一种针对3D场视频参考的说话人脸生成方法的高效防御框架,通过扰动3D信息获取过程来保护肖像视频,同时保持高保真度。

Details Motivation: 现有的基于图像的防御方法计算成本高、视频质量下降严重,且无法有效破坏3D信息,缺乏有效的防御框架来防止3D场TFG技术对个人肖像视频的恶意滥用。 Method: 提出了相似性引导的参数共享机制以提高计算效率,并设计了多尺度双域注意力模块联合优化空间-频率扰动。 Result: 实验表明该框架具有强大的防御能力,相比最快基线加速47倍,保持高保真视频质量,对缩放操作和先进净化攻击均具备鲁棒性。 Conclusion: 所提方法在保护隐私的同时高效抵御3D场TFG技术,为个性化视频提供了实用且安全的解决方案。 Abstract: State-of-the-art 3D-field video-referenced Talking Face Generation (TFG) methods synthesize high-fidelity personalized talking-face videos in real time by modeling 3D geometry and appearance from reference portrait video. This capability raises significant privacy concerns regarding malicious misuse of personal portraits. However, no efficient defense framework exists to protect such videos against 3D-field TFG methods. While image-based defenses could apply per-frame 2D perturbations, they incur prohibitive computational costs, severe video quality degradation, failing to disrupt 3D information for video protection. To address this, we propose a novel and efficient video defense framework against 3D-field TFG methods, which protects portrait video by perturbing the 3D information acquisition process while maintain high-fidelity video quality. Specifically, our method introduces: (1) a similarity-guided parameter sharing mechanism for computational efficiency, and (2) a multi-scale dual-domain attention module to jointly optimize spatial-frequency perturbations. Extensive experiments demonstrate that our proposed framework exhibits strong defense capability and achieves a 47x acceleration over the fastest baseline while maintaining high fidelity. Moreover, it remains robust against scaling operations and state-of-the-art purification attacks, and the effectiveness of our design choices is further validated through ablation studies. Our project is available at https://github.com/Richen7418/VDF.

[60] Multi-Attribute guided Thermal Face Image Translation based on Latent Diffusion Model

Mingshu Cai,Osamu Yoshie,Yuya Ieiri

Main category: cs.CV

TL;DR: 本文提出了一种基于潜在扩散模型的红外到可见光人脸图像生成方法,结合多属性分类器和Self-attn Mamba模块,有效提升跨模态人脸识别的图像质量和身份保持能力。

Details Motivation: 由于红外图像与可见光图像之间存在显著域偏移,现有人脸识别模型在红外图像上性能下降严重,传统生成方法易导致特征丢失和失真。 Method: 提出一种基于潜在扩散的模型,引入多属性分类器提取关键面部属性,并设计Self-attn Mamba模块以增强跨模态特征的全局建模并提升推理速度。 Result: 在两个基准数据集上实验表明,该方法在图像质量和身份保持方面均达到最先进的性能。 Conclusion: 所提方法有效缓解了红外到可见光人脸转换中的特征损失与失真问题,显著提升了异质人脸识别的准确性和生成质量。 Abstract: Modern surveillance systems increasingly rely on multi-wavelength sensors and deep neural networks to recognize faces in infrared images captured at night. However, most facial recognition models are trained on visible light datasets, leading to substantial performance degradation on infrared inputs due to significant domain shifts. Early feature-based methods for infrared face recognition proved ineffective, prompting researchers to adopt generative approaches that convert infrared images into visible light images for improved recognition. This paradigm, known as Heterogeneous Face Recognition (HFR), faces challenges such as model and modality discrepancies, leading to distortion and feature loss in generated images. To address these limitations, this paper introduces a novel latent diffusion-based model designed to generate high-quality visible face images from thermal inputs while preserving critical identity features. A multi-attribute classifier is incorporated to extract key facial attributes from visible images, mitigating feature loss during infrared-to-visible image restoration. Additionally, we propose the Self-attn Mamba module, which enhances global modeling of cross-modal features and significantly improves inference speed. Experimental results on two benchmark datasets demonstrate the superiority of our approach, achieving state-of-the-art performance in both image quality and identity preservation.

[61] Next-Scale Prediction: A Self-Supervised Approach for Real-World Image Denoising

Yiwen Shan,Haiyu Zhao,Peng Hu,Xi Peng,Yuanbiao Gou

Main category: cs.CV

TL;DR: 本文提出了一种新的自监督图像去噪方法Next-Scale Prediction (NSP),通过构建跨尺度训练对,解耦噪声去相关与细节保留,解决了传统盲点网络在去噪与保持高频细节之间的矛盾,并自然支持噪声图像的超分辨率。

Details Motivation: 现有基于像素混洗下采样的盲点网络方法在去噪时难以平衡噪声去相关与细节保留:强下采样破坏细节,弱下采样无法有效去除相关噪声。因此需要一种新方法来突破这一长期存在的权衡问题。 Method: 提出Next-Scale Prediction (NSP) 框架,利用低分辨率、完全去相关的子图像作为输入,训练盲点网络预测保留高频细节的高分辨率图像,构建跨尺度自监督训练对,实现噪声抑制与细节保持的分离处理。 Result: 实验表明,NSP在真实场景去噪基准上达到最先进的自监督性能,显著缓解了噪声去相关与细节保留之间的冲突,并能自然实现噪声图像的超分辨率。 Conclusion: NSP为自监督真实图像去噪提供了一个有效的新范式,成功解耦了噪声去除与细节保留过程,在多个真实数据集上表现出色,且具备无需微调的超分辨率能力。 Abstract: Self-supervised real-world image denoising remains a fundamental challenge, arising from the antagonistic trade-off between decorrelating spatially structured noise and preserving high-frequency details. Existing blind-spot network (BSN) methods rely on pixel-shuffle downsampling (PD) to decorrelate noise, but aggressive downsampling fragments fine structures, while milder downsampling fails to remove correlated noise. To address this, we introduce Next-Scale Prediction (NSP), a novel self-supervised paradigm that decouples noise decorrelation from detail preservation. NSP constructs cross-scale training pairs, where BSN takes low-resolution, fully decorrelated sub-images as input to predict high-resolution targets that retain fine details. As a by-product, NSP naturally supports super-resolution of noisy images without retraining or modification. Extensive experiments demonstrate that NSP achieves state-of-the-art self-supervised denoising performance on real-world benchmarks, significantly alleviating the long-standing conflict between noise decorrelation and detail preservation.

[62] A Large-Depth-Range Layer-Based Hologram Dataset for Machine Learning-Based 3D Computer-Generated Holography

Jaehong Lee,You Chan No,YoungWoo Kim,Duksu Kim

Main category: cs.CV

TL;DR: 本文提出了一个名为KOREATECH-CGH的大型公开全息图数据集,包含6000对RGB-D图像和复数全息图,并引入振幅投影后处理技术以提升大深度范围下的重建质量,验证了其在机器学习生成全息图与超分辨率任务中的有效性。

Details Motivation: 由于高质量、大规模全息图数据集的缺乏,基于机器学习的计算机生成全息(ML-CGH)发展受限,因此需要构建一个公开可用、覆盖广泛3D场景的数据集以推动该领域研究。 Method: 构建了KOREATECH-CGH数据集,包含6000组多分辨率(256*256至2048*2048)RGB-D与复数全息图配对数据;提出振幅投影方法,在各深度层替换全息波场的振幅分量而保留相位,以提高重建质量;并通过先进ML模型进行全息生成与超分辨率实验验证数据集效用。 Result: 振幅投影方法在PSNR上达到27.01 dB,SSIM为0.87,较现有方法分别提升2.03 dB和0.04;实验验证了KOREATECH-CGH在训练和评估下一代ML-CGH系统中的有效性。 Conclusion: KOREATECH-CGH为ML-CGH提供了高质量、多样化的基准数据集,结合振幅投影技术显著提升了大深度场景下的全息重建质量,有助于推动该领域的标准化与进一步发展。 Abstract: Machine learning-based computer-generated holography (ML-CGH) has advanced rapidly in recent years, yet progress is constrained by the limited availability of high-quality, large-scale hologram datasets. To address this, we present KOREATECH-CGH, a publicly available dataset comprising 6,000 pairs of RGB-D images and complex holograms across resolutions ranging from 256*256 to 2048*2048, with depth ranges extending to the theoretical limits of the angular spectrum method for wide 3D scene coverage. To improve hologram quality at large depth ranges, we introduce amplitude projection, a post-processing technique that replaces amplitude components of hologram wavefields at each depth layer while preserving phase. This approach enhances reconstruction fidelity, achieving 27.01 dB PSNR and 0.87 SSIM, surpassing a recent optimized silhouette-masking layer-based method by 2.03 dB and 0.04 SSIM, respectively. We further validate the utility of KOREATECH-CGH through experiments on hologram generation and super-resolution using state-of-the-art ML models, confirming its applicability for training and evaluating next-generation ML-CGH systems.

[63] Matrix Completion Via Reweighted Logarithmic Norm Minimization

Zhijie Wang,Liangtian He,Qinghua Zhang,Jifei Miao,Liang-Jian Deng,Jun Liu

Main category: cs.CV

TL;DR: 提出了一种新的重加权对数范数作为低秩矩阵补全中更有效的非凸替代方法,通过ADMM优化,在图像修复任务中取得了优于现有方法的性能。

Details Motivation: 核范数作为秩函数的凸近似在低秩矩阵补全中存在过度收缩奇异值的问题,导致次优解,因此需要更精确的非凸替代方法。 Method: 设计了一种新的重加权对数范数作为秩的非凸代理,并采用交替方向乘子法(ADMM)来高效求解优化问题。 Result: 在图像修复实验中,所提方法在视觉质量和定量指标上均优于当前最先进的低秩矩阵补全方法。 Conclusion: 重加权对数范数能更准确地逼近矩阵低秩特性,结合ADMM求解框架,在实际应用中展现出优越性能。 Abstract: Low-rank matrix completion (LRMC) has demonstrated remarkable success in a wide range of applications. To address the NP-hard nature of the rank minimization problem, the nuclear norm is commonly used as a convex and computationally tractable surrogate for the rank function. However, this approach often yields suboptimal solutions due to the excessive shrinkage of singular values. In this letter, we propose a novel reweighted logarithmic norm as a more effective nonconvex surrogate, which provides a closer approximation than many existing alternatives. We efficiently solve the resulting optimization problem by employing the alternating direction method of multipliers (ADMM). Experimental results on image inpainting demonstrate that the proposed method achieves superior performance compared to state-of-the-art LRMC approaches, both in terms of visual quality and quantitative metrics.

[64] Optical Flow-Guided 6DoF Object Pose Tracking with an Event Camera

Zibin Liu,Banglei Guan,Yang Shang,Shunkun Liang,Zhenbao Yu,Qifeng Yu

Main category: cs.CV

TL;DR: 提出一种基于事件相机的光流引导6DoF物体姿态跟踪方法,结合2D-3D混合特征提取与光流关联优化,在准确性和鲁棒性上优于现有方法。

Details Motivation: 传统相机在姿态跟踪中面临运动模糊、噪声、遮挡和光照变化等问题,事件相机虽有潜力但需有效算法支持。 Method: 采用2D-3D混合特征提取策略检测事件流中的角点和边缘,通过最大化时空窗口内事件相关概率搜索角点光流,并以光流引导建立角点与边缘的关联,进而通过最小化点边距离迭代优化6DoF姿态。 Result: 在仿真和真实事件数据上验证了方法的有效性,相比现有事件相机方法具有更高的精度和鲁棒性。 Conclusion: 所提方法能有效利用事件相机优势,实现高精度、强鲁棒的6自由度物体姿态连续跟踪。 Abstract: Object pose tracking is one of the pivotal technologies in multimedia, attracting ever-growing attention in recent years. Existing methods employing traditional cameras encounter numerous challenges such as motion blur, sensor noise, partial occlusion, and changing lighting conditions. The emerging bio-inspired sensors, particularly event cameras, possess advantages such as high dynamic range and low latency, which hold the potential to address the aforementioned challenges. In this work, we present an optical flow-guided 6DoF object pose tracking method with an event camera. A 2D-3D hybrid feature extraction strategy is firstly utilized to detect corners and edges from events and object models, which characterizes object motion precisely. Then, we search for the optical flow of corners by maximizing the event-associated probability within a spatio-temporal window, and establish the correlation between corners and edges guided by optical flow. Furthermore, by minimizing the distances between corners and edges, the 6DoF object pose is iteratively optimized to achieve continuous pose tracking. Experimental results of both simulated and real events demonstrate that our methods outperform event-based state-of-the-art methods in terms of both accuracy and robustness.

[65] DexAvatar: 3D Sign Language Reconstruction with Hand and Body Pose Priors

Kaustubh Kundu,Hrishav Bakul Barua,Lucy Robertson-Bell,Zhixi Cai,Kalin Stefanov

Main category: cs.CV

TL;DR: 本文提出了一种名为DexAvatar的新框架,用于从单目手语视频中重建生物力学上精确的精细手部动作和身体运动,显著提升了手部和身体姿态估计的性能。

Details Motivation: 现有的手语数据集大多基于视频,缺乏准确的3D信息,且当前最先进的3D人体姿态估计算法在处理手语视频时容易受到自遮挡、噪声和运动模糊的影响,导致重建质量差。 Method: DexAvatar利用学习到的3D手部和身体先验知识,指导从野外单目手语视频中重建精细的手部动作和身体运动。 Result: 在SGNify动作捕捉数据集上,DexAvatar相比现有最先进方法在身体和手部姿态估计方面提高了35.11%。 Conclusion: DexAvatar能够有效提升从单目手语视频中重建3D姿态的准确性,为手语生成提供了高质量的数据支持。 Abstract: The trend in sign language generation is centered around data-driven generative methods that require vast amounts of precise 2D and 3D human pose data to achieve an acceptable generation quality. However, currently, most sign language datasets are video-based and limited to automatically reconstructed 2D human poses (i.e., keypoints) and lack accurate 3D information. Furthermore, existing state-of-the-art for automatic 3D human pose estimation from sign language videos is prone to self-occlusion, noise, and motion blur effects, resulting in poor reconstruction quality. In response to this, we introduce DexAvatar, a novel framework to reconstruct bio-mechanically accurate fine-grained hand articulations and body movements from in-the-wild monocular sign language videos, guided by learned 3D hand and body priors. DexAvatar achieves strong performance in the SGNify motion capture dataset, the only benchmark available for this task, reaching an improvement of 35.11% in the estimation of body and hand poses compared to the state-of-the-art. The official website of this work is: https://github.com/kaustesseract/DexAvatar.

[66] Beyond Pixel Simulation: Pathology Image Generation via Diagnostic Semantic Tokens and Prototype Control

Minghao Han,YiChen Liu,Yizhou Liu,Zizhi Chen,Jingqun Tang,Xuecheng Wu,Dingkang Yang,Lihua Zhang

Main category: cs.CV

TL;DR: 本文提出了UniPath,一种语义驱动的病理图像生成框架,通过多流控制机制实现细粒度、可控的病理图像生成,并构建大规模数据集与评估体系,显著提升了生成质量与语义一致性。

Details Motivation: 现有生成模型在病理学中仅模拟像素,缺乏精细语义控制且受限于数据稀缺和术语异质性,而理解模型已达到诊断水平,因此需要桥接两者以实现可控生成。 Method: 提出Multi-Stream Control:包括原始文本流、高层语义流(利用冻结的病理MLLM提取诊断语义标记)和原型流(通过原型库实现形态级控制);构建265万图像-文本对的大规模数据集及6.8万高质量标注子集;建立四层评估体系。 Result: UniPath在Patho-FID上达到80.9(比第二名好51%),细粒度语义控制达到真实图像98.7%的水平,显著优于现有方法。 Conclusion: UniPath成功将成熟的病理理解能力引入生成模型,实现了高保真、可控的病理图像生成,推动了计算病理学中理解与生成的统一发展。 Abstract: In computational pathology, understanding and generation have evolved along disparate paths: advanced understanding models already exhibit diagnostic-level competence, whereas generative models largely simulate pixels. Progress remains hindered by three coupled factors: the scarcity of large, high-quality image-text corpora; the lack of precise, fine-grained semantic control, which forces reliance on non-semantic cues; and terminological heterogeneity, where diverse phrasings for the same diagnostic concept impede reliable text conditioning. We introduce UniPath, a semantics-driven pathology image generation framework that leverages mature diagnostic understanding to enable controllable generation. UniPath implements Multi-Stream Control: a Raw-Text stream; a High-Level Semantics stream that uses learnable queries to a frozen pathology MLLM to distill paraphrase-robust Diagnostic Semantic Tokens and to expand prompts into diagnosis-aware attribute bundles; and a Prototype stream that affords component-level morphological control via a prototype bank. On the data front, we curate a 2.65M image-text corpus and a finely annotated, high-quality 68K subset to alleviate data scarcity. For a comprehensive assessment, we establish a four-tier evaluation hierarchy tailored to pathology. Extensive experiments demonstrate UniPath's SOTA performance, including a Patho-FID of 80.9 (51% better than the second-best) and fine-grained semantic control achieving 98.7% of the real-image. The meticulously curated datasets, complete source code, and pre-trained model weights developed in this study will be made openly accessible to the public.

[67] Multimodal Skeleton-Based Action Representation Learning via Decomposition and Composition

Hongsong Wang,Heng Fei,Bingxuan Dai,Jie Gui

Main category: cs.CV

TL;DR: 提出了一种名为Decomposition and Composition的自监督多模态骨架动作表示学习框架,通过分解和组合策略平衡了多模态动作识别中的效率与性能。

Details Motivation: 现有方法在多模态人类动作理解中难以兼顾模型效率与性能,简单后期融合计算开销大,早期融合则性能不足。 Method: 设计了分解策略将融合的多模态特征分解为单模态特征并与真实单模态特征对齐;同时采用组合策略整合多个单模态特征,作为自监督信号增强多模态表示学习。 Result: 在NTU RGB+D 60、NTU RGB+D 120和PKU-MMD II数据集上进行了广泛实验,验证了该方法在计算成本和模型性能之间取得了良好平衡。 Conclusion: 所提方法有效解决了多模态动作识别中效率与效果之间的权衡问题,优于传统的融合策略。 Abstract: Multimodal human action understanding is a significant problem in computer vision, with the central challenge being the effective utilization of the complementarity among diverse modalities while maintaining model efficiency. However, most existing methods rely on simple late fusion to enhance performance, which results in substantial computational overhead. Although early fusion with a shared backbone for all modalities is efficient, it struggles to achieve excellent performance. To address the dilemma of balancing efficiency and effectiveness, we introduce a self-supervised multimodal skeleton-based action representation learning framework, named Decomposition and Composition. The Decomposition strategy meticulously decomposes the fused multimodal features into distinct unimodal features, subsequently aligning them with their respective ground truth unimodal counterparts. On the other hand, the Composition strategy integrates multiple unimodal features, leveraging them as self-supervised guidance to enhance the learning of multimodal representations. Extensive experiments on the NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD II datasets demonstrate that the proposed method strikes an excellent balance between computational cost and model performance.

[68] UniPR-3D: Towards Universal Visual Place Recognition with Visual Geometry Grounded Transformer

Tianchen Deng,Xun Chen,Ziming Li,Hongming Shen,Danwei Wang,Javier Civera,Hesheng Wang

Main category: cs.CV

TL;DR: 本文提出了UniPR-3D,首个有效融合多视角信息的视觉位置识别(VPR)架构,基于VGGT主干网络,结合2D和3D特征表示,通过专用聚合模块和多帧策略,在跨环境泛化和性能上均达到新SOTA。

Details Motivation: 传统VPR多为单图像检索任务,多视角虽具优势但研究不足,现有方法在不同环境中泛化能力有限,因此需要一种能有效整合多视角信息且具备良好泛化的VPR框架。 Method: 提出UniPR-3D,基于VGGT主干网络,利用其生成的3D tokens和中间2D tokens,设计专门的2D与3D特征聚合模块,并结合单帧与多帧聚合机制及变长序列检索策略,构建最终描述子。 Result: 实验表明,UniPR-3D在单视图和多视图基准上均优于现有方法,显著提升了VPR任务的性能与跨环境泛化能力,验证了几何锚定token的有效性。 Conclusion: UniPR-3D是首个成功整合多视角3D表示的VPR方法,通过联合利用2D纹理细节与3D几何信息,结合灵活的聚合策略,在视觉位置识别中实现了更鲁棒和通用的性能表现。 Abstract: Visual Place Recognition (VPR) has been traditionally formulated as a single-image retrieval task. Using multiple views offers clear advantages, yet this setting remains relatively underexplored and existing methods often struggle to generalize across diverse environments. In this work we introduce UniPR-3D, the first VPR architecture that effectively integrates information from multiple views. UniPR-3D builds on a VGGT backbone capable of encoding multi-view 3D representations, which we adapt by designing feature aggregators and fine-tune for the place recognition task. To construct our descriptor, we jointly leverage the 3D tokens and intermediate 2D tokens produced by VGGT. Based on their distinct characteristics, we design dedicated aggregation modules for 2D and 3D features, allowing our descriptor to capture fine-grained texture cues while also reasoning across viewpoints. To further enhance generalization, we incorporate both single- and multi-frame aggregation schemes, along with a variable-length sequence retrieval strategy. Our experiments show that UniPR-3D sets a new state of the art, outperforming both single- and multi-view baselines and highlighting the effectiveness of geometry-grounded tokens for VPR. Our code and models will be made publicly available on Github https://github.com/dtc111111/UniPR-3D.

[69] Hierarchical Modeling Approach to Fast and Accurate Table Recognition

Takaya Kawakatsu

Main category: cs.CV

TL;DR: 提出了一种利用非因果注意力和并行推理算法的多任务模型,用于高效表格识别。

Details Motivation: 现有表格识别模型虽然效果好,但推理时间长且有效性未充分解释。 Method: 采用非因果注意力捕捉完整表格结构,并设计并行推理算法加速单元格内容识别。 Result: 在两个大型公开数据集上,新模型在视觉和统计指标上均表现出优越性。 Conclusion: 所提方法在保持高精度的同时显著提升推理速度,为文档中表格识别提供了更高效的解决方案。 Abstract: The extraction and use of diverse knowledge from numerous documents is a pressing challenge in intelligent information retrieval. Documents contain elements that require different recognition methods. Table recognition typically consists of three subtasks, namely table structure, cell position and cell content recognition. Recent models have achieved excellent recognition with a combination of multi-task learning, local attention, and mutual learning. However, their effectiveness has not been fully explained, and they require a long period of time for inference. This paper presents a novel multi-task model that utilizes non-causal attention to capture the entire table structure, and a parallel inference algorithm for faster cell content inference. The superiority is demonstrated both visually and statistically on two large public datasets.

[70] T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation

Zhe Cao,Tao Wang,Jiaming Wang,Yanghai Wang,Yuanxing Zhang,Jialu Chen,Miao Deng,Jiahao Wang,Yubin Guo,Chenxi Liao,Yize Zhang,Zhaoxiang Zhang,Jiaheng Liu

Main category: cs.CV

TL;DR: 提出T2AV-Compass,一个用于全面评估文本到音视频生成系统的统一基准,包含500个复杂多样的提示,并结合客观指标与基于大语言模型的主观评判,揭示现有模型在现实感、跨模态一致性等方面仍有显著不足。

Details Motivation: 现有T2AV生成系统评估方法碎片化,缺乏对跨模态对齐、指令遵循和复杂提示下感知真实性的综合衡量。 Method: 构建基于分类体系的多样化提示集(500条),设计双层评估框架:客观信号级指标(视频/音频质量、跨模态对齐)和主观MLLM-as-a-Judge协议(指令遵循与真实性)。 Result: 对11个代表性T2AV系统进行评估发现,最强模型在音频真实感、细粒度同步和指令遵循方面仍远逊于人类水平。 Conclusion: T2AV-Compass是一个具有挑战性和诊断性的测试平台,可有效推动文本到音视频生成技术的发展。 Abstract: Text-to-Audio-Video (T2AV) generation aims to synthesize temporally coherent video and semantically synchronized audio from natural language, yet its evaluation remains fragmented, often relying on unimodal metrics or narrowly scoped benchmarks that fail to capture cross-modal alignment, instruction following, and perceptual realism under complex prompts. To address this limitation, we present T2AV-Compass, a unified benchmark for comprehensive evaluation of T2AV systems, consisting of 500 diverse and complex prompts constructed via a taxonomy-driven pipeline to ensure semantic richness and physical plausibility. Besides, T2AV-Compass introduces a dual-level evaluation framework that integrates objective signal-level metrics for video quality, audio quality, and cross-modal alignment with a subjective MLLM-as-a-Judge protocol for instruction following and realism assessment. Extensive evaluation of 11 representative T2AVsystems reveals that even the strongest models fall substantially short of human-level realism and cross-modal consistency, with persistent failures in audio realism, fine-grained synchronization, instruction following, etc. These results indicate significant improvement room for future models and highlight the value of T2AV-Compass as a challenging and diagnostic testbed for advancing text-to-audio-video generation.

[71] UniRec-0.1B: Unified Text and Formula Recognition with 0.1B Parameters

Yongkun Du,Zhineng Chen,Yazhen Xie,Weikang Baiand Hao Feng,Wei Shi,Yuchen Su,Can Huang,Yu-Gang Jiang

Main category: cs.CV

TL;DR: 本文提出了一种仅含0.1B参数的轻量级统一识别模型UniRec-0.1B,能够高效准确地识别文本和公式,并在多层级文档结构上表现优异。

Details Motivation: 现有的视觉-语言模型虽能统一识别文本和公式,但模型庞大、计算开销大,限制了其应用。因此需要一个轻量且高效的统一识别模型。 Method: 构建了包含4000万样本的大规模数据集UniRec40M;引入层次化监督训练以提升对多层次结构的理解;设计语义解耦的分词器以分离文本与公式表示。 Result: 在涵盖中英文多领域多层级的综合评测基准上,UniRec-0.1B优于通用VLM和领先的文档解析专家模型,并实现2-9倍的速度提升。 Conclusion: UniRec-0.1B在保持极小参数量的同时,实现了高效、准确的文本与公式统一识别,具备良好的实用性和推广性。 Abstract: Text and formulas constitute the core informational components of many documents. Accurately and efficiently recognizing both is crucial for developing robust and generalizable document parsing systems. Recently, vision-language models (VLMs) have achieved impressive unified recognition of text and formulas. However, they are large-sized and computationally demanding, restricting their usage in many applications. In this paper, we propose UniRec-0.1B, a unified recognition model with only 0.1B parameters. It is capable of performing text and formula recognition at multiple levels, including characters, words, lines, paragraphs, and documents. To implement this task, we first establish UniRec40M, a large-scale dataset comprises 40 million text, formula and their mix samples, enabling the training of a powerful yet lightweight model. Secondly, we identify two challenges when building such a lightweight but unified expert model. They are: structural variability across hierarchies and semantic entanglement between textual and formulaic content. To tackle these, we introduce a hierarchical supervision training that explicitly guides structural comprehension, and a semantic-decoupled tokenizer that separates text and formula representations. Finally, we develop a comprehensive evaluation benchmark covering Chinese and English documents from multiple domains and with multiple levels. Experimental results on this and public benchmarks demonstrate that UniRec-0.1B outperforms both general-purpose VLMs and leading document parsing expert models, while achieving a 2-9$\times$ speedup, validating its effectiveness and efficiency. Codebase and Dataset: https://github.com/Topdu/OpenOCR.

[72] FreeInpaint: Tuning-free Prompt Alignment and Visual Rationality Enhancement in Image Inpainting

Chao Gong,Dong Li,Yingwei Pan,Jingjing Chen,Ting Yao,Tao Mei

Main category: cs.CV

TL;DR: 提出FreeInpaint,一种无需调优、即插即用的文本引导图像修复方法,通过在推理过程中直接优化扩散隐变量来提升生成图像的提示对齐性和视觉合理性。

Details Motivation: 现有方法在保持提示对齐和视觉合理性方面难以兼顾,且依赖模型微调或额外训练。 Method: 提出先验引导的噪声优化方法,优化初始噪声以引导模型关注有效修复区域;设计针对修复任务的复合引导目标,通过每步优化中间隐变量来增强提示对齐和视觉合理性。 Result: 在多种扩散修复模型和评估指标上验证了FreeInpaint的有效性和鲁棒性,显著提升了生成内容与文本提示的一致性及视觉自然度。 Conclusion: FreeInpaint是一种通用、高效的即插即用式图像修复框架,无需任何训练即可提升现有扩散模型的修复性能。 Abstract: Text-guided image inpainting endeavors to generate new content within specified regions of images using textual prompts from users. The primary challenge is to accurately align the inpainted areas with the user-provided prompts while maintaining a high degree of visual fidelity. While existing inpainting methods have produced visually convincing results by leveraging the pre-trained text-to-image diffusion models, they still struggle to uphold both prompt alignment and visual rationality simultaneously. In this work, we introduce FreeInpaint, a plug-and-play tuning-free approach that directly optimizes the diffusion latents on the fly during inference to improve the faithfulness of the generated images. Technically, we introduce a prior-guided noise optimization method that steers model attention towards valid inpainting regions by optimizing the initial noise. Furthermore, we meticulously design a composite guidance objective tailored specifically for the inpainting task. This objective efficiently directs the denoising process, enhancing prompt alignment and visual rationality by optimizing intermediate latents at each step. Through extensive experiments involving various inpainting diffusion models and evaluation metrics, we demonstrate the effectiveness and robustness of our proposed FreeInpaint.

[73] MarineEval: Assessing the Marine Intelligence of Vision-Language Models

YuK-Kwan Wong,Tuan-An To,Jipeng Zhang,Ziqiang Zheng,Sai-Kit Yeung

Main category: cs.CV

TL;DR: 本文提出了首个大规模海洋视觉语言模型数据集和基准MarineEval,包含2000个基于图像的问答对,用于评估现有视觉语言模型在海洋领域问题回答中的表现。实验结果表明,现有模型在回答领域特定问题时效果不佳,仍有较大改进空间。

Details Motivation: 探讨现有的视觉语言模型(VLMs)是否能够作为需要专业知识的海洋领域的专家,准确回答具有特殊领域挑战的问题。 Method: 构建了名为MarineEval的大规模海洋VLM数据集和基准,包含2000个图像-问答对,涵盖7个任务维度和20个能力维度,并由海洋领域专家验证数据质量和领域需求。在此基础上对17种现有VLM进行了全面评测。 Result: 在MarineEval上对17个现有VLM的评估显示,当前模型在回答海洋领域特定问题时表现不佳,存在显著局限性,性能提升空间巨大。 Conclusion: 现有视觉语言模型尚不能有效胜任需要深度领域知识的海洋专业问题回答任务,未来研究需针对性地提升模型在专业领域的理解与推理能力。 Abstract: We have witnessed promising progress led by large language models (LLMs) and further vision language models (VLMs) in handling various queries as a general-purpose assistant. VLMs, as a bridge to connect the visual world and language corpus, receive both visual content and various text-only user instructions to generate corresponding responses. Though great success has been achieved by VLMs in various fields, in this work, we ask whether the existing VLMs can act as domain experts, accurately answering marine questions, which require significant domain expertise and address special domain challenges/requirements. To comprehensively evaluate the effectiveness and explore the boundary of existing VLMs, we construct the first large-scale marine VLM dataset and benchmark called MarineEval, with 2,000 image-based question-answering pairs. During our dataset construction, we ensure the diversity and coverage of the constructed data: 7 task dimensions and 20 capacity dimensions. The domain requirements are specially integrated into the data construction and further verified by the corresponding marine domain experts. We comprehensively benchmark 17 existing VLMs on our MarineEval and also investigate the limitations of existing models in answering marine research questions. The experimental results reveal that existing VLMs cannot effectively answer the domain-specific questions, and there is still a large room for further performance improvements. We hope our new benchmark and observations will facilitate future research. Project Page: http://marineeval.hkustvgd.com/

[74] TGC-Net: A Structure-Aware and Semantically-Aligned Framework for Text-Guided Medical Image Segmentation

Gaoren Lin,Huangxuan Zhao,Yuan Xiong,Lefei Zhang,Bo Du,Wentao Zhu

Main category: cs.CV

TL;DR: 本文提出了一种基于CLIP的文本引导医学图像分割框架TGC-Net,通过引入多尺度结构增强、医学知识注入和跨模态校准模块,在减少可训练参数的同时实现了最先进的性能。

Details Motivation: 现有方法依赖未对齐的图像和文本编码器,导致多模态融合复杂;直接应用CLIP到医学图像存在解剖结构保留不足、临床描述建模能力弱和领域语义不匹配问题。 Method: 提出TGC-Net,包含三个核心组件:1)语义-结构协同编码器(SSE),在ViT中加入CNN分支进行多尺度结构细化;2)领域增强文本编码器(DATE),注入大语言模型提取的医学知识;3)视觉-语言校准模块(VLCM),在统一特征空间中优化跨模态对应关系。 Result: 在胸部X光和胸腔CT五个数据集上实验表明,TGC-Net以更少的可训练参数达到最先进水平,在具有挑战性的基准上表现出显著的Dice系数提升。 Conclusion: TGC-Net有效解决了CLIP在医学图像分割中的局限性,通过参数高效的任务特定适应,实现了精准的文本引导分割,具有良好的应用潜力。 Abstract: Text-guided medical segmentation enhances segmentation accuracy by utilizing clinical reports as auxiliary information. However, existing methods typically rely on unaligned image and text encoders, which necessitate complex interaction modules for multimodal fusion. While CLIP provides a pre-aligned multimodal feature space, its direct application to medical imaging is limited by three main issues: insufficient preservation of fine-grained anatomical structures, inadequate modeling of complex clinical descriptions, and domain-specific semantic misalignment. To tackle these challenges, we propose TGC-Net, a CLIP-based framework focusing on parameter-efficient, task-specific adaptations. Specifically, it incorporates a Semantic-Structural Synergy Encoder (SSE) that augments CLIP's ViT with a CNN branch for multi-scale structural refinement, a Domain-Augmented Text Encoder (DATE) that injects large-language-model-derived medical knowledge, and a Vision-Language Calibration Module (VLCM) that refines cross-modal correspondence in a unified feature space. Experiments on five datasets across chest X-ray and thoracic CT modalities demonstrate that TGC-Net achieves state-of-the-art performance with substantially fewer trainable parameters, including notable Dice gains on challenging benchmarks.

[75] ORCA: Object Recognition and Comprehension for Archiving Marine Species

Yuk-Kwan Wong,Haixin Liang,Zeyu Ma,Yiwei Chen,Ziqiang Zheng,Rinaldi Gotama,Pascal Sebastian,Lauren D. Sparks,Sai-Kit Yeung

Main category: cs.CV

TL;DR: ORCA是一个面向海洋研究的多模态基准,包含14,647张图像、478个物种及细粒度图文标注,用于推动目标检测、实例描述和视觉定位等任务的研究。

Details Motivation: 当前海洋视觉理解受限于训练数据不足以及缺乏将领域挑战与计算机视觉任务系统结合的框架,限制了模型的有效应用。 Method: 提出ORCA多模态基准数据集,包含大量带边界框和专家验证实例描述的海洋图像,并评估18种前沿模型在三种任务上的表现:闭集与开词汇目标检测、实例描述和视觉定位。 Result: 实验揭示了物种多样性、形态重叠和领域特殊性带来的挑战,表明现有模型在海洋理解任务中仍面临困难。 Conclusion: ORCA为海洋视觉理解提供了全面的基准,有助于推动该领域的数据驱动方法发展。 Abstract: Marine visual understanding is essential for monitoring and protecting marine ecosystems, enabling automatic and scalable biological surveys. However, progress is hindered by limited training data and the lack of a systematic task formulation that aligns domain-specific marine challenges with well-defined computer vision tasks, thereby limiting effective model application. To address this gap, we present ORCA, a multi-modal benchmark for marine research comprising 14,647 images from 478 species, with 42,217 bounding box annotations and 22,321 expert-verified instance captions. The dataset provides fine-grained visual and textual annotations that capture morphology-oriented attributes across diverse marine species. To catalyze methodological advances, we evaluate 18 state-of-the-art models on three tasks: object detection (closed-set and open-vocabulary), instance captioning, and visual grounding. Results highlight key challenges, including species diversity, morphological overlap, and specialized domain demands, underscoring the difficulty of marine understanding. ORCA thus establishes a comprehensive benchmark to advance research in marine domain. Project Page: http://orca.hkustvgd.com/.

[76] A Turn Toward Better Alignment: Few-Shot Generative Adaptation with Equivariant Feature Rotation

Chenghao Xu,Qi Liu,Jiexi Yan,Muli Yang,Cheng Deng

Main category: cs.CV

TL;DR: 本文提出了一种名为等变特征旋转(EFR)的新方法,用于少样本图像生成,通过在自旋转变换的代理特征空间中对齐源域和目标域,显著提升了生成性能。

Details Motivation: 现有少样本图像生成方法因源域与目标域之间的分布结构差异及目标样本稀缺,难以有效对齐分布,导致生成内容失真或信息不足。 Method: 提出等变特征旋转(EFR),在参数化的李群中进行自适应旋转,将源域和目标域特征映射到一个等变的代理特征空间,并在此空间中进行双层次对齐,以保持域内结构并实现知识迁移。 Result: 在多个常用数据集上的实验表明,该方法显著优于现有方法,在生成质量与域适应效果上均有提升。 Conclusion: EFR通过构建等变的代理特征空间,有效缓解了域间分布差异问题,为少样本图像生成提供了一种更鲁棒的域适应框架。 Abstract: Few-shot image generation aims to effectively adapt a source generative model to a target domain using very few training images. Most existing approaches introduce consistency constraints-typically through instance-level or distribution-level loss functions-to directly align the distribution patterns of source and target domains within their respective latent spaces. However, these strategies often fall short: overly strict constraints can amplify the negative effects of the domain gap, leading to distorted or uninformative content, while overly relaxed constraints may fail to leverage the source domain effectively. This limitation primarily stems from the inherent discrepancy in the underlying distribution structures of the source and target domains. The scarcity of target samples further compounds this issue by hindering accurate estimation of the target domain's distribution. To overcome these limitations, we propose Equivariant Feature Rotation (EFR), a novel adaptation strategy that aligns source and target domains at two complementary levels within a self-rotated proxy feature space. Specifically, we perform adaptive rotations within a parameterized Lie Group to transform both source and target features into an equivariant proxy space, where alignment is conducted. These learnable rotation matrices serve to bridge the domain gap by preserving intra-domain structural information without distortion, while the alignment optimization facilitates effective knowledge transfer from the source to the target domain. Comprehensive experiments on a variety of commonly used datasets demonstrate that our method significantly enhances the generative performance within the targeted domain.

[77] Towards Arbitrary Motion Completing via Hierarchical Continuous Representation

Chenghao Xu,Guangtao Lyu,Qi Liu,Jiexi Yan,Muli Yang,Cheng Deng

Main category: cs.CV

TL;DR: 本文提出了一种基于隐式神经表示(INR)的分层隐式表示框架NAME,用于实现人体运动序列的连续化建模,支持任意帧率下的插值、中间生成和外推。

Details Motivation: 由于物理运动本质上是连续的,更高的帧率有助于提升运动序列的时间连贯性,但现有方法难以灵活处理不同帧率下的运动建模,因此需要一种能够连续表示并自由采样的人体运动表示方法。 Method: 提出名为NAME的参数激活驱动的分层隐式表示框架,引入多尺度时间编码机制以捕捉复杂时序模式,并设计基于傅里叶变换的可学习参数激活函数增强MLP解码器的表达能力。 Result: 在多个基准数据集上的实验表明,该方法在运动插值、外推和不同帧率重建任务中均表现出优异的性能和鲁棒性。 Conclusion: 所提出的NAME框架能有效实现人体运动的连续化建模,支持高精度的任意帧率生成,在运动表示方面优于传统离散序列建模方法。 Abstract: Physical motions are inherently continuous, and higher camera frame rates typically contribute to improved smoothness and temporal coherence. For the first time, we explore continuous representations of human motion sequences, featuring the ability to interpolate, inbetween, and even extrapolate any input motion sequences at arbitrary frame rates. To achieve this, we propose a novel parametric activation-induced hierarchical implicit representation framework, referred to as NAME, based on Implicit Neural Representations (INRs). Our method introduces a hierarchical temporal encoding mechanism that extracts features from motion sequences at multiple temporal scales, enabling effective capture of intricate temporal patterns. Additionally, we integrate a custom parametric activation function, powered by Fourier transformations, into the MLP-based decoder to enhance the expressiveness of the continuous representation. This parametric formulation significantly augments the model's ability to represent complex motion behaviors with high accuracy. Extensive evaluations across several benchmark datasets demonstrate the effectiveness and robustness of our proposed approach.

[78] UltraShape 1.0: High-Fidelity 3D Shape Generation via Scalable Geometric Refinement

Tanghui Jia,Dongyu Yan,Dehao Hao,Yang Li,Kaiyi Zhang,Xianyi He,Lanjiong Li,Jinnan Chen,Lutao Jiang,Qishen Yin,Long Quan,Ying-Cong Chen,Li Yuan

Main category: cs.CV

TL;DR: UltraShape 1.0是一种可扩展的3D扩散框架,通过两阶段生成流程实现高保真3D几何形状生成,包含新颖的数据处理和基于体素的精细化方法。

Details Motivation: 为了提升公开3D数据集的几何质量并实现高质量、细节丰富的3D形状生成,需要解决现有数据中存在的低质量样本、孔洞和薄弱结构问题,并改进生成模型的细节合成能力。 Method: 采用两阶段生成流程:首先生成粗略全局结构,然后进行精细化。提出新的水密性处理方法和高质量数据过滤流程;在扩散过程中解耦空间定位与几何细节合成,使用基于体素的精细化方法,通过来自粗略几何的体素查询提供显式位置锚点(RoPE编码),使模型专注于局部几何细节生成。 Result: 在仅使用公开可用3D数据集训练的情况下,UltraShape 1.0在几何质量方面表现出色,广泛评估显示其在数据处理质量和几何生成方面均能与现有的开源方法相媲美。 Conclusion: UltraShape 1.0是一个高效的3D扩散框架,能够利用公开数据生成高质量3D几何形状,所提出的数据处理和精细化策略有效提升了生成结果的保真度和细节水平,代码和模型将全部开源以支持后续研究。 Abstract: In this report, we introduce UltraShape 1.0, a scalable 3D diffusion framework for high-fidelity 3D geometry generation. The proposed approach adopts a two-stage generation pipeline: a coarse global structure is first synthesized and then refined to produce detailed, high-quality geometry. To support reliable 3D generation, we develop a comprehensive data processing pipeline that includes a novel watertight processing method and high-quality data filtering. This pipeline improves the geometric quality of publicly available 3D datasets by removing low-quality samples, filling holes, and thickening thin structures, while preserving fine-grained geometric details. To enable fine-grained geometry refinement, we decouple spatial localization from geometric detail synthesis in the diffusion process. We achieve this by performing voxel-based refinement at fixed spatial locations, where voxel queries derived from coarse geometry provide explicit positional anchors encoded via RoPE, allowing the diffusion model to focus on synthesizing local geometric details within a reduced, structured solution space. Our model is trained exclusively on publicly available 3D datasets, achieving strong geometric quality despite limited training resources. Extensive evaluations demonstrate that UltraShape 1.0 performs competitively with existing open-source methods in both data processing quality and geometry generation. All code and trained models will be released to support future research.

[79] VisRes Bench: On Evaluating the Visual Reasoning Capabilities of VLMs

Brigitta Malagurski Törtei,Yasser Dahou,Ngoc Dung Huynh,Wamiq Reyaz Para,Phúc H. Lê Khac,Ankit Singh,Sofian Chaybouti,Sanath Narayan

Main category: cs.CV

TL;DR: VisRes Bench 是一个用于研究自然场景下视觉推理能力的基准,揭示了当前视觉语言模型在感知和关系推理方面的局限性。

Details Motivation: 探讨视觉语言模型在多大程度上进行视觉推理,而非依赖语言先验。 Method: 提出 VisRes Bench 基准,包含三个复杂度层级:Level 1 测试感知补全和全局图像匹配,Level 2 测试单属性的基于规则的推理,Level 3 测试多属性组合推理。 Result: 在超过19,000张控制图像上发现,最先进的视觉语言模型在细微感知扰动下表现接近随机水平,表明其抽象能力有限。 Conclusion: VisRes 提供了一个统一框架,有助于推动多模态研究中的抽象视觉推理发展。 Abstract: Vision-Language Models (VLMs) have achieved remarkable progress across tasks such as visual question answering and image captioning. Yet, the extent to which these models perform visual reasoning as opposed to relying on linguistic priors remains unclear. To address this, we introduce VisRes Bench, a benchmark designed to study visual reasoning in naturalistic settings without contextual language supervision. Analyzing model behavior across three levels of complexity, we uncover clear limitations in perceptual and relational visual reasoning capacities. VisRes isolates distinct reasoning abilities across its levels. Level 1 probes perceptual completion and global image matching under perturbations such as blur, texture changes, occlusion, and rotation; Level 2 tests rule-based inference over a single attribute (e.g., color, count, orientation); and Level 3 targets compositional reasoning that requires integrating multiple visual attributes. Across more than 19,000 controlled task images, we find that state-of-the-art VLMs perform near random under subtle perceptual perturbations, revealing limited abstraction beyond pattern recognition. We conclude by discussing how VisRes provides a unified framework for advancing abstract visual reasoning in multimodal research.

[80] Human Motion Estimation with Everyday Wearables

Siqi Zhu,Yixuan Li,Junfu Li,Qi Wu,Zan Wang,Haozhe Ma,Wei Liang

Main category: cs.CV

TL;DR: 本文提出了一种名为EveryWear的轻量级人体运动捕捉方法,完全基于日常可穿戴设备(如智能手机、智能手表、耳塞和智能眼镜),无需校准即可实现全身运动估计。

Details Motivation: 现有基于穿戴设备的人体运动估计方法存在佩戴性差、硬件成本高和校准繁琐等问题,限制了其在日常生活中的应用。 Method: 采用多模态师生框架,融合来自第一人称摄像头的视觉信息和消费级设备的惯性信号,并直接在真实世界数据上进行训练,避免了仿真到现实的差距。 Result: 实验表明,该方法在实际全身体运动估计中优于基线模型,验证了其有效性。此外,作者发布了包含56种日常活动和真实3D标注的Ego-Elec数据集。 Conclusion: EveryWear为日常可穿戴设备上的实用化人体运动捕捉提供了可行方案,推动了无标记、免校准运动估计技术的发展。 Abstract: While on-body device-based human motion estimation is crucial for applications such as XR interaction, existing methods often suffer from poor wearability, expensive hardware, and cumbersome calibration, which hinder their adoption in daily life. To address these challenges, we present EveryWear, a lightweight and practical human motion capture approach based entirely on everyday wearables: a smartphone, smartwatch, earbuds, and smart glasses equipped with one forward-facing and two downward-facing cameras, requiring no explicit calibration before use. We introduce Ego-Elec, a 9-hour real-world dataset covering 56 daily activities across 17 diverse indoor and outdoor environments, with ground-truth 3D annotations provided by the motion capture (MoCap), to facilitate robust research and benchmarking in this direction. Our approach employs a multimodal teacher-student framework that integrates visual cues from egocentric cameras with inertial signals from consumer devices. By training directly on real-world data rather than synthetic data, our model effectively eliminates the sim-to-real gap that constrains prior work. Experiments demonstrate that our method outperforms baseline models, validating its effectiveness for practical full-body motion estimation.

[81] Latent Implicit Visual Reasoning

Kelvin Li,Chuyi Shang,Leonid Karlinsky,Rogerio Feris,Trevor Darrell,Roei Herzig

Main category: cs.CV

TL;DR: 提出一种任务无关的机制,使大型多模态模型能够自主发现和使用视觉推理标记,无需显式监督,在多种视觉主导任务上实现最优性能。

Details Motivation: 现有的大型多模态模型主要以文本为中心,难以处理以视觉为主的推理任务,且现有方法依赖人工标注的中间视觉步骤,泛化能力差、成本高。 Method: 设计一种无需显式监督的训练机制,让模型自动学习视觉推理标记,这些标记能全局关注并以任务自适应的方式重新编码图像。 Result: 该方法在多个视觉主导的任务上优于直接微调,并在中间抽象难以定义的任务中表现出色,同时支持多任务指令微调。 Conclusion: 所提出的机制有效提升了多模态模型在视觉推理方面的能力,无需手工标注中间步骤,具有良好的通用性和扩展性。 Abstract: While Large Multimodal Models (LMMs) have made significant progress, they remain largely text-centric, relying on language as their core reasoning modality. As a result, they are limited in their ability to handle reasoning tasks that are predominantly visual. Recent approaches have sought to address this by supervising intermediate visual steps with helper images, depth maps, or image crops. However, these strategies impose restrictive priors on what "useful" visual abstractions look like, add heavy annotation costs, and struggle to generalize across tasks. To address this critical limitation, we propose a task-agnostic mechanism that trains LMMs to discover and use visual reasoning tokens without explicit supervision. These tokens attend globally and re-encode the image in a task-adaptive way, enabling the model to extract relevant visual information without hand-crafted supervision. Our approach outperforms direct fine-tuning and achieves state-of-the-art results on a diverse range of vision-centric tasks -- including those where intermediate abstractions are hard to specify -- while also generalizing to multi-task instruction tuning.

[82] Leveraging Lightweight Entity Extraction for Scalable Event-Based Image Retrieval

Dao Sy Duy Minh,Huynh Trung Kiet,Nguyen Lam Phu Quy,Phu-Hoa Pham,Tran Chi Nguyen

Main category: cs.CV

TL;DR: 提出一种轻量级的两阶段图像检索框架,利用事件中心实体提取结合时间与上下文信息,在OpenEvents v1上显著优于基线方法。

Details Motivation: 现有图像-文本检索在面对模糊查询、语言多样性和可扩展性需求时表现不足,尤其在真实场景中效果受限。 Method: 采用两阶段检索:第一阶段使用基于显著实体的BM25进行候选过滤,第二阶段用BEiT-3模型进行多模态语义建模与重排序。 Result: 在OpenEvents v1基准上达到0.559的平均精度均值,显著超过先前方法。 Conclusion: 结合事件引导的过滤与长文本视觉-语言建模能有效提升复杂真实场景下的检索准确率与效率。 Abstract: Retrieving images from natural language descriptions is a core task at the intersection of computer vision and natural language processing, with wide-ranging applications in search engines, media archiving, and digital content management. However, real-world image-text retrieval remains challenging due to vague or context-dependent queries, linguistic variability, and the need for scalable solutions. In this work, we propose a lightweight two-stage retrieval pipeline that leverages event-centric entity extraction to incorporate temporal and contextual signals from real-world captions. The first stage performs efficient candidate filtering using BM25 based on salient entities, while the second stage applies BEiT-3 models to capture deep multimodal semantics and rerank the results. Evaluated on the OpenEvents v1 benchmark, our method achieves a mean average precision of 0.559, substantially outperforming prior baselines. These results highlight the effectiveness of combining event-guided filtering with long-text vision-language modeling for accurate and efficient retrieval in complex, real-world scenarios. Our code is available at https://github.com/PhamPhuHoa-23/Event-Based-Image-Retrieval

[83] SegMo: Segment-aligned Text to 3D Human Motion Generation

Bowen Dang,Lin Wu,Xiaohang Yang,Zheng Yuan,Zhixiang Chen

Main category: cs.CV

TL;DR: 本文提出了一种名为SegMo的细粒度文本到动作生成框架,通过将文本和动作序列分解为语义段并进行段级对齐,提升了文本驱动动作生成的质量和跨模态检索能力。

Details Motivation: 现有方法仅在序列级别对齐文本描述与人体动作,忽略了两种模态内部的语义结构,导致对应关系不够精细。而文本和动作均可自然分解为语义连贯的片段,可作为更精细对齐的基本单元。 Method: 提出SegMo框架,包含三个模块:(1) 文本片段提取,将复杂描述分解为时序排列的简单动作短语;(2) 动作片段提取,将完整动作序列分割为对应的动作段;(3) 细粒度文本-动作对齐,通过对比学习实现文本与动作片段的对齐。 Result: 在HumanML3D测试集上TOP 1分数达到0.553,优于强基线模型,并支持动作定位和动作到文本检索等检索任务。 Conclusion: SegMo通过段级对齐实现了更精细的文本-动作对应,在生成质量和跨模态检索方面均表现出色,验证了利用语义结构提升跨模态对齐的有效性。 Abstract: Generating 3D human motions from textual descriptions is an important research problem with broad applications in video games, virtual reality, and augmented reality. Recent methods align the textual description with human motion at the sequence level, neglecting the internal semantic structure of modalities. However, both motion descriptions and motion sequences can be naturally decomposed into smaller and semantically coherent segments, which can serve as atomic alignment units to achieve finer-grained correspondence. Motivated by this, we propose SegMo, a novel Segment-aligned text-conditioned human Motion generation framework to achieve fine-grained text-motion alignment. Our framework consists of three modules: (1) Text Segment Extraction, which decomposes complex textual descriptions into temporally ordered phrases, each representing a simple atomic action; (2) Motion Segment Extraction, which partitions complete motion sequences into corresponding motion segments; and (3) Fine-grained Text-Motion Alignment, which aligns text and motion segments with contrastive learning. Extensive experiments demonstrate that SegMo improves the strong baseline on two widely used datasets, achieving an improved TOP 1 score of 0.553 on the HumanML3D test set. Moreover, thanks to the learned shared embedding space for text and motion segments, SegMo can also be applied to retrieval-style tasks such as motion grounding and motion-to-text retrieval.

[84] DreaMontage: Arbitrary Frame-Guided One-Shot Video Generation

Jiawei Liu,Junqiao Li,Jiangfan Deng,Gen Li,Siyu Zhou,Zetao Fang,Shanshan Lao,Zengde Deng,Jianing Zhu,Tingting Ma,Jiayi Li,Yunqiu Wang,Qian He,Xinglong Wu

Main category: cs.CV

TL;DR: 本文提出DreaMontage框架,通过改进DiT架构、视觉表达微调和分段自回归推理,实现高质量、长时程、帧引导的无缝一镜到底视频生成。

Details Motivation: 现有视频生成方法在生成“一镜到底”视频时多采用简单拼接,难以保持视觉流畅性和时间连贯性,且实际拍摄成本高昂,因此需要一种能克服这些限制的虚拟生成方案。 Method: 1) 在DiT架构中引入轻量级中间条件机制,结合自适应调优策略实现任意帧控制;2) 构建高质量数据集并进行视觉表达SFT训练,采用定制化DPO优化运动合理性和过渡平滑性;3) 设计分段自回归(SAR)推理策略以高效生成长序列视频。 Result: 实验表明该方法在视觉质量、时间连贯性和计算效率方面均表现优异,能够将碎片化视觉素材合成为生动连贯的一镜到底视频。 Conclusion: DreaMontage为实现高表达性、长时程且无缝的一镜到底视频生成提供了有效解决方案,推动了虚拟电影制作的技术发展。 Abstract: The "one-shot" technique represents a distinct and sophisticated aesthetic in filmmaking. However, its practical realization is often hindered by prohibitive costs and complex real-world constraints. Although emerging video generation models offer a virtual alternative, existing approaches typically rely on naive clip concatenation, which frequently fails to maintain visual smoothness and temporal coherence. In this paper, we introduce DreaMontage, a comprehensive framework designed for arbitrary frame-guided generation, capable of synthesizing seamless, expressive, and long-duration one-shot videos from diverse user-provided inputs. To achieve this, we address the challenge through three primary dimensions. (i) We integrate a lightweight intermediate-conditioning mechanism into the DiT architecture. By employing an Adaptive Tuning strategy that effectively leverages base training data, we unlock robust arbitrary-frame control capabilities. (ii) To enhance visual fidelity and cinematic expressiveness, we curate a high-quality dataset and implement a Visual Expression SFT stage. In addressing critical issues such as subject motion rationality and transition smoothness, we apply a Tailored DPO scheme, which significantly improves the success rate and usability of the generated content. (iii) To facilitate the production of extended sequences, we design a Segment-wise Auto-Regressive (SAR) inference strategy that operates in a memory-efficient manner. Extensive experiments demonstrate that our approach achieves visually striking and seamlessly coherent one-shot effects while maintaining computational efficiency, empowering users to transform fragmented visual materials into vivid, cohesive one-shot cinematic experiences.

[85] AnyAD: Unified Any-Modality Anomaly Detection in Incomplete Multi-Sequence MRI

Changwei Wu,Yifei Chen,Yuxin Du,Mingxuan Liu,Jinying Zong,Beining Wu,Jie Dong,Feiwei Qin,Yunkang Cao,Qiyuan Tian

Main category: cs.CV

TL;DR: 本文提出了一种统一的任意模态异常检测(Any-Modality AD)框架,可在任意MRI模态组合下实现鲁棒的异常检测与定位,解决了临床中模态缺失和标注数据稀缺的问题。

Details Motivation: 由于异常病例标注稀缺且临床中常缺少关键成像模态,现有的单类或多类异常检测模型难以推广到未见过的模态组合,限制了其临床可扩展性。 Method: 提出双通路DINOv2编码器与特征分布对齐机制,统一对齐不完整与完整模态的特征;引入内在正常原型(INPs)提取器与INP引导解码器,仅重建正常结构并放大异常区域;通过随机模态掩蔽和间接特征补全进行训练,无需重训练即可适应各种模态配置。 Result: 在BraTS2018、MU-Glioma-Post和Pretreat-MetsToBrain-Masks数据集上,该方法在7种模态组合下均优于最先进的工业与医学AD基线方法,展现出卓越的泛化能力。 Conclusion: 本研究建立了一个在真实世界不完整模态条件下可扩展的多模态医学异常检测新范式。 Abstract: Reliable anomaly detection in brain MRI remains challenging due to the scarcity of annotated abnormal cases and the frequent absence of key imaging modalities in real clinical workflows. Existing single-class or multi-class anomaly detection (AD) models typically rely on fixed modality configurations, require repetitive training, or fail to generalize to unseen modality combinations, limiting their clinical scalability. In this work, we present a unified Any-Modality AD framework that performs robust anomaly detection and localization under arbitrary MRI modality availability. The framework integrates a dual-pathway DINOv2 encoder with a feature distribution alignment mechanism that statistically aligns incomplete-modality features with full-modality representations, enabling stable inference even with severe modality dropout. To further enhance semantic consistency, we introduce an Intrinsic Normal Prototypes (INPs) extractor and an INP-guided decoder that reconstruct only normal anatomical patterns while naturally amplifying abnormal deviations. Through randomized modality masking and indirect feature completion during training, the model learns to adapt to all modality configurations without re-training. Extensive experiments on BraTS2018, MU-Glioma-Post, and Pretreat-MetsToBrain-Masks demonstrate that our approach consistently surpasses state-of-the-art industrial and medical AD baselines across 7 modality combinations, achieving superior generalization. This study establishes a scalable paradigm for multimodal medical AD under real-world, imperfect modality conditions. Our source code is available at https://github.com/wuchangw/AnyAD.

[86] ACD: Direct Conditional Control for Video Diffusion Models via Attention Supervision

Weiqi Li,Zehao Zhang,Liang Lin,Guangrun Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为Attention-Conditional Diffusion (ACD)的新框架,通过注意力监督实现视频扩散模型中的直接条件控制,提升了条件对齐能力、时间连贯性和视觉质量。

Details Motivation: 现有无分类器引导方法在视频合成中对条件信号的控制有限,而基于分类器的引导易导致对抗性伪影,因此需要一种能实现更精确、直接条件控制的方法。 Method: 提出ACD框架,通过对外部控制信号(如稀疏3D感知物体布局)的注意力图进行监督来实现直接条件控制,并设计了专用的Layout ControlNet和自动化标注流程以支持可扩展的布局集成。 Result: 在多个基准视频生成数据集上的实验表明,ACD在条件对齐方面优于现有方法,同时保持了良好的时间一致性和视觉保真度。 Conclusion: ACD为条件视频合成提供了一个有效的新范式,通过注意力条件监督实现了更强的可控性和更准确的条件遵循。 Abstract: Controllability is a fundamental requirement in video synthesis, where accurate alignment with conditioning signals is essential. Existing classifier-free guidance methods typically achieve conditioning indirectly by modeling the joint distribution of data and conditions, which often results in limited controllability over the specified conditions. Classifier-based guidance enforces conditions through an external classifier, but the model may exploit this mechanism to raise the classifier score without genuinely satisfying the intended condition, resulting in adversarial artifacts and limited effective controllability. In this paper, we propose Attention-Conditional Diffusion (ACD), a novel framework for direct conditional control in video diffusion models via attention supervision. By aligning the model's attention maps with external control signals, ACD achieves better controllability. To support this, we introduce a sparse 3D-aware object layout as an efficient conditioning signal, along with a dedicated Layout ControlNet and an automated annotation pipeline for scalable layout integration. Extensive experiments on benchmark video generation datasets demonstrate that ACD delivers superior alignment with conditioning inputs while preserving temporal coherence and visual fidelity, establishing an effective paradigm for conditional video synthesis.

[87] GriDiT: Factorized Grid-Based Diffusion for Efficient Long Image Sequence Generation

Snehal Singh Tomar,Alexandros Graikos,Arjun Krishna,Dimitris Samaras,Klaus Mueller

Main category: cs.CV

TL;DR: 提出一种将图像序列生成分解为低分辨率序列生成和高分辨率帧超分的两阶段方法,基于DiT架构实现高效、高质量、长序列生成。

Details Motivation: 现有图像序列生成模型直接处理高分辨率时序张量,存在效率低、难以建模长序列、训练数据利用率低等问题,需要更有效的表示与生成方式。 Method: 首先使用DiT在低分辨率网格图像上学习生成粗略的图像序列(3D结构),利用自注意力捕捉帧间相关性;然后对每一帧单独进行高分辨率超分,恢复细节。整个流程无需修改2D生成器架构。 Result: 在多个数据集上实现了优于SoTA的生成质量与至少两倍的推理速度,支持任意长度序列生成,训练效率更高,并展现出跨数据域的良好泛化能力。 Conclusion: 将图像序列生成解耦为低分辨率结构生成与高分辨率细节补全是一种更高效且强大的范式,可克服当前SoTA的多项瓶颈。 Abstract: Modern deep learning methods typically treat image sequences as large tensors of sequentially stacked frames. However, is this straightforward representation ideal given the current state-of-the-art (SoTA)? In this work, we address this question in the context of generative models and aim to devise a more effective way of modeling image sequence data. Observing the inefficiencies and bottlenecks of current SoTA image sequence generation methods, we showcase that rather than working with large tensors, we can improve the generation process by factorizing it into first generating the coarse sequence at low resolution and then refining the individual frames at high resolution. We train a generative model solely on grid images comprising subsampled frames. Yet, we learn to generate image sequences, using the strong self-attention mechanism of the Diffusion Transformer (DiT) to capture correlations between frames. In effect, our formulation extends a 2D image generator to operate as a low-resolution 3D image-sequence generator without introducing any architectural modifications. Subsequently, we super-resolve each frame individually to add the sequence-independent high-resolution details. This approach offers several advantages and can overcome key limitations of the SoTA in this domain. Compared to existing image sequence generation models, our method achieves superior synthesis quality and improved coherence across sequences. It also delivers high-fidelity generation of arbitrary-length sequences and increased efficiency in inference time and training data usage. Furthermore, our straightforward formulation enables our method to generalize effectively across diverse data domains, which typically require additional priors and supervision to model in a generative context. Our method consistently outperforms SoTA in quality and inference speed (at least twice-as-fast) across datasets.

[88] Surgical Scene Segmentation using a Spike-Driven Video Transformer with Real-Time Potential

Shihao Zou,Jingjing Li,Wei Ji,Jincai Huang,Kai Wang,Guo Dan,Weixin Si,Yi Pan

Main category: cs.CV

TL;DR: 本文提出了SpikeSurgSeg,首个基于脉冲驱动的视频Transformer框架,用于手术场景分割,具有在非GPU平台上实现实时处理的潜力。

Details Motivation: 由于现有深度学习模型计算开销大、功耗高,难以在资源受限的手术环境中实时部署;同时标注手术数据稀缺,限制了当前方法的性能。 Method: 提出一种面向SNN的手术场景掩码自编码预训练策略,通过逐层tube掩码实现鲁棒的时空表征学习,并结合轻量级脉冲驱动分割头生成时间一致的低延迟预测。 Result: 在EndoVis18和自建SurgBleed数据集上实验表明,SpikeSurgSeg的mIoU与最先进的ANN模型相当,推理延迟至少降低8倍,相比多数基础模型加速超过20倍。 Conclusion: SpikeSurgSeg在保持高分割精度的同时显著降低延迟和能耗,展现出在时间关键型手术场景分割中的巨大应用潜力。 Abstract: Modern surgical systems increasingly rely on intelligent scene understanding to provide timely situational awareness for enhanced intra-operative safety. Within this pipeline, surgical scene segmentation plays a central role in accurately perceiving operative events. Although recent deep learning models, particularly large-scale foundation models, achieve remarkable segmentation accuracy, their substantial computational demands and power consumption hinder real-time deployment in resource-constrained surgical environments. To address this limitation, we explore the emerging SNN as a promising paradigm for highly efficient surgical intelligence. However, their performance is still constrained by the scarcity of labeled surgical data and the inherently sparse nature of surgical video representations. To this end, we propose \textit{SpikeSurgSeg}, the first spike-driven video Transformer framework tailored for surgical scene segmentation with real-time potential on non-GPU platforms. To address the limited availability of surgical annotations, we introduce a surgical-scene masked autoencoding pretraining strategy for SNNs that enables robust spatiotemporal representation learning via layer-wise tube masking. Building on this pretrained backbone, we further adopt a lightweight spike-driven segmentation head that produces temporally consistent predictions while preserving the low-latency characteristics of SNNs. Extensive experiments on EndoVis18 and our in-house SurgBleed dataset demonstrate that SpikeSurgSeg achieves mIoU comparable to SOTA ANN-based models while reducing inference latency by at least $8\times$. Notably, it delivers over $20\times$ acceleration relative to most foundation-model baselines, underscoring its potential for time-critical surgical scene segmentation.

[89] Post-Processing Mask-Based Table Segmentation for Structural Coordinate Extraction

Suren Bandara

Main category: cs.CV

TL;DR: 提出了一种基于多尺度信号处理的表格边缘检测新方法,通过高斯卷积和统计阈值化建模行列过渡为一维信号,有效抑制噪声并保留结构边缘,显著提升低分辨率或噪声图像中的表格分割精度。

Details Motivation: 准确识别表格的行和列边界在低分辨率或噪声图像中仍具挑战性,现有方法对噪声敏感、分辨率损失大或计算成本高,且Transformer方法在不完整或退化数据下适应性差。 Method: 将表格掩码中的行和列过渡建模为一维信号,采用方差递增的高斯卷积进行多尺度平滑,并结合统计阈值化抑制噪声;通过检测信号峰值并映射回图像坐标以确定精确的边界位置。 Result: 在PubLayNet-1M基准上,结合TableNet与PyTesseract OCR,列边界检测的CASA(单元感知分割准确率)从67%提升至76%;该方法对分辨率变化鲁棒,支持零填充和缩放策略。 Conclusion: 所提方法在噪声和分辨率变化场景下表现出更强的鲁棒性和准确性,能生成优化的结构化表格输出,适用于下游分析任务。 Abstract: Structured data extraction from tables plays a crucial role in document image analysis for scanned documents and digital archives. Although many methods have been proposed to detect table structures and extract cell contents, accurately identifying table segment boundaries (rows and columns) remains challenging, particularly in low-resolution or noisy images. In many real-world scenarios, table data are incomplete or degraded, limiting the adaptability of transformer-based methods to noisy inputs. Mask-based edge detection techniques have shown greater robustness under such conditions, as their sensitivity can be adjusted through threshold tuning; however, existing approaches typically apply masks directly to images, leading to noise sensitivity, resolution loss, or high computational cost. This paper proposes a novel multi-scale signal-processing method for detecting table edges from table masks. Row and column transitions are modeled as one-dimensional signals and processed using Gaussian convolution with progressively increasing variances, followed by statistical thresholding to suppress noise while preserving stable structural edges. Detected signal peaks are mapped back to image coordinates to obtain accurate segment boundaries. Experimental results show that applying the proposed approach to column edge detection improves Cell-Aware Segmentation Accuracy (CASA) a layout-aware metric evaluating both textual correctness and correct cell placement from 67% to 76% on the PubLayNet-1M benchmark when using TableNet with PyTesseract OCR. The method is robust to resolution variations through zero-padding and scaling strategies and produces optimized structured tabular outputs suitable for downstream analysis.

[90] AndroidLens: Long-latency Evaluation with Nested Sub-targets for Android GUI Agents

Yue Cao,Yingyao Wang,Pi Bu,Jingxuan Xing,Wei Jiang,Zekun Zhu,Junpeng Ma,Sashuai Zhou,Tong Lu,Jun Song,Yu Cheng,Yuning Jiang,Bo Zheng

Main category: cs.CV

TL;DR: AndroidLens是一个新的移动GUI代理评估框架,包含571个长延迟任务,涵盖真实场景中的复杂任务,提出静态与动态评估方法,揭示现有模型在任务成功率和进度衡量上的严重不足。

Details Motivation: 现有移动GUI代理的评估基准受限于应用范围窄、任务简单和指标粗糙,无法有效反映真实世界中的复杂性与挑战,因此需要更全面、更具挑战性的评估框架。 Method: 提出AndroidLens框架,包含571个跨38个领域的多语言(中英文)长步骤任务,采用静态评估保留真实异常并允许多种解决路径,结合基于里程碑的动态评估,使用平均任务进展(ATP)进行细粒度测量。 Result: 实验显示当前最优模型仅达到12.7%的任务成功率和50.47%的ATP,暴露出在环境异常应对、自适应探索和长期记忆保持方面的显著缺陷。 Conclusion: AndroidLens提供了一个更贴近现实、更具挑战性的评估平台,能够更准确地衡量移动GUI代理的真实能力,并为未来研究指出了关键改进方向。 Abstract: Graphical user interface (GUI) agents can substantially improve productivity by automating frequently executed long-latency tasks on mobile devices. However, existing evaluation benchmarks are still constrained to limited applications, simple tasks, and coarse-grained metrics. To address this, we introduce AndroidLens, a challenging evaluation framework for mobile GUI agents, comprising 571 long-latency tasks in both Chinese and English environments, each requiring an average of more than 26 steps to complete. The framework features: (1) tasks derived from real-world user scenarios across 38 domains, covering complex types such as multi-constraint, multi-goal, and domain-specific tasks; (2) static evaluation that preserves real-world anomalies and allows multiple valid paths to reduce bias; and (3) dynamic evaluation that employs a milestone-based scheme for fine-grained progress measurement via Average Task Progress (ATP). Our evaluation indicates that even the best models reach only a 12.7% task success rate and 50.47% ATP. We also underscore key challenges in real-world environments, including environmental anomalies, adaptive exploration, and long-term memory retention.

[91] TICON: A Slide-Level Tile Contextualizer for Histopathology Representation Learning

Varun Belagali,Saarthak Kapse,Pierre Marza,Srijan Das,Zilinghan Li,Sofiène Boutaj,Pushpak Pati,Srikar Yellapragada,Tarak Nath Nandi,Ravi K Madduri,Joel Saltz,Prateek Prasanna,Stergios Christodoulidis Maria Vakalopoulou,Dimitris Samaras

Main category: cs.CV

TL;DR: 本文提出了TICON,一种基于Transformer的瓷砖表示上下文化模型,能够统一并增强来自任意瓷砖级基础模型的嵌入表示,在多种计算病理学任务中实现新的最先进性能。

Details Motivation: 现有的瓷砖编码器方法忽略了瓦片在全切片图像中的上下文信息,且不同任务需要不同的瓷砖编码器,缺乏一个能统一处理多种来源嵌入的通用上下文化框架。 Method: 提出TICON,采用共享的Transformer编码器,通过掩码建模预训练,对来自任意瓷砖级基础模型的嵌入进行上下文化,并结合滑动窗口策略以捕捉全局上下文信息。 Result: TICON在多个瓷砖级(如HEST-Bench、THUNDER、CATCH)和幻灯片级(如Patho-Bench)基准测试中显著提升性能,达到新的SOTA;进一步构建的幻灯片级基础模型仅用11K WSI即优于使用多达350K WSI预训练的现有SOTA模型。 Conclusion: TICON提供了一个通用、高效的上下文化框架,能够整合和增强多种瓷砖级模型的表示,显著提升局部与全局任务性能,推动了计算病理学中基础模型的发展。 Abstract: The interpretation of small tiles in large whole slide images (WSI) often needs a larger image context. We introduce TICON, a transformer-based tile representation contextualizer that produces rich, contextualized embeddings for ''any'' application in computational pathology. Standard tile encoder-based pipelines, which extract embeddings of tiles stripped from their context, fail to model the rich slide-level information essential for both local and global tasks. Furthermore, different tile-encoders excel at different downstream tasks. Therefore, a unified model is needed to contextualize embeddings derived from ''any'' tile-level foundation model. TICON addresses this need with a single, shared encoder, pretrained using a masked modeling objective to simultaneously unify and contextualize representations from diverse tile-level pathology foundation models. Our experiments demonstrate that TICON-contextualized embeddings significantly improve performance across many different tasks, establishing new state-of-the-art results on tile-level benchmarks (i.e., HEST-Bench, THUNDER, CATCH) and slide-level benchmarks (i.e., Patho-Bench). Finally, we pretrain an aggregator on TICON to form a slide-level foundation model, using only 11K WSIs, outperforming SoTA slide-level foundation models pretrained with up to 350K WSIs.

[92] Fast SAM2 with Text-Driven Token Pruning

Avilasha Mandal,Chaoning Zhang,Fachrina Dewi Puspitasari,Xudong Wang,Jiaquan Zhang,Caiyan Qin,Guoqing Wang,Yang Yang,Heng Tao Shen

Main category: cs.CV

TL;DR: 本文提出了一种文本引导的token剪枝框架,用于提升视频对象分割模型SAM2的推理效率,通过在时间传播前选择性减少token密度,在几乎不损失分割精度的前提下显著降低计算和内存开销。

Details Motivation: SAM2等模型在视频分割中表现优异,但因处理时序上密集视觉token导致计算和内存成本过高,限制了实际部署。 Method: 在视觉编码后、时序传播前引入轻量级路由机制,结合局部视觉上下文、基于文本描述的语义相关性和不确定性线索对token进行排序并剪枝,仅保留最相关token参与后续处理。 Result: 相比未剪枝的SAM2,推理速度提升达42.50%,GPU内存使用降低37.41%,同时保持具有竞争力的J和F指标性能。 Conclusion: 早期token选择能有效提升基于Transformer的视频分割系统的可扩展性,为实时和资源受限应用提供了高效、提示感知的解决方案。 Abstract: Segment Anything Model 2 (SAM2), a vision foundation model has significantly advanced in prompt-driven video object segmentation, yet their practical deployment remains limited by the high computational and memory cost of processing dense visual tokens across time. The SAM2 pipelines typically propagate all visual tokens produced by the image encoder through downstream temporal reasoning modules, regardless of their relevance to the target object, resulting in reduced scalability due to quadratic memory attention overhead. In this work, we introduce a text-guided token pruning framework that improves inference efficiency by selectively reducing token density prior to temporal propagation, without modifying the underlying segmentation architecture. Operating after visual encoding and before memory based propagation, our method ranks tokens using a lightweight routing mechanism that integrates local visual context, semantic relevance derived from object-centric textual descriptions (either user-provided or automatically generated), and uncertainty cues that help preserve ambiguous or boundary critical regions. By retaining only the most informative tokens for downstream processing, the proposed approach reduces redundant computation while maintaining segmentation fidelity. Extensive experiments across multiple challenging video segmentation benchmarks demonstrate that post-encoder token pruning provides a practical and effective pathway to efficient, prompt-aware video segmentation, achieving up to 42.50 percent faster inference and 37.41 percent lower GPU memory usage compared to the unpruned baseline SAM2, while preserving competitive J and F performance. These results highlight the potential of early token selection to improve the scalability of transformer-based video segmentation systems for real-time and resource-constrained applications.

[93] Streaming Video Instruction Tuning

Jiaer Xia,Peixian Chen,Mengdan Zhang,Xing Sun,Kaiyang Zhou

Main category: cs.CV

TL;DR: Streamo是一个实时流视频大模型,能够执行多种流视频任务,如实时叙述、动作理解、事件描述和时间敏感问答等。

Details Motivation: 现有在线视频模型通常局限于问答或字幕生成,缺乏对多种流视频任务的统一支持。 Method: 构建了大规模指令跟随数据集Streamo-Instruct-465K,并通过端到端训练实现多任务统一建模。 Result: Streamo在多个流视频基准上表现出强大的时间推理能力、响应交互能力和广泛泛化性。 Conclusion: Streamo弥合了离线视频感知模型与实时多模态助手之间的差距,推动了连续视频流中统一智能理解的发展。 Abstract: We present Streamo, a real-time streaming video LLM that serves as a general-purpose interactive assistant. Unlike existing online video models that focus narrowly on question answering or captioning, Streamo performs a broad spectrum of streaming video tasks, including real-time narration, action understanding, event captioning, temporal event grounding, and time-sensitive question answering. To develop such versatility, we construct Streamo-Instruct-465K, a large-scale instruction-following dataset tailored for streaming video understanding. The dataset covers diverse temporal contexts and multi-task supervision, enabling unified training across heterogeneous streaming tasks. After training end-to-end on the instruction-following dataset through a streamlined pipeline, Streamo exhibits strong temporal reasoning, responsive interaction, and broad generalization across a variety of streaming benchmarks. Extensive experiments show that Streamo bridges the gap between offline video perception models and real-time multimodal assistants, making a step toward unified, intelligent video understanding in continuous video streams.

[94] Beyond Memorization: A Multi-Modal Ordinal Regression Benchmark to Expose Popularity Bias in Vision-Language Models

Li-Zhong Szu-Tu,Ting-Lin Wu,Chia-Jui Chang,He Syu,Yu-Lun Liu

Main category: cs.CV

TL;DR: 本文揭示了最先进的视觉语言模型(VLMs)在识别著名建筑时存在显著的流行度偏差,准确率比普通建筑高出高达34%,表明其依赖记忆而非泛化理解。为此,作者提出了目前最大的开放基准YearGuessr数据集,包含55,546张建筑图像,标注了建造年份、GPS位置和浏览量等多模态信息,并引入基于流行度的区间准确率指标来量化该偏差。通过对30多个模型(包括提出的YearCLIP)进行评估,证实了VLMs在不知名建筑上的推理能力严重不足。

Details Motivation: 揭示并系统研究视觉语言模型在处理不同流行度对象时存在的偏差,推动模型从记忆转向真正的理解与推理。 Method: 构建大规模多模态建筑数据集YearGuessr,包含连续的建造年份标签(1001–2024)、GPS坐标和页面浏览量作为流行度代理;将建造年份预测建模为有序回归任务,并提出流行度感知的区间准确率指标。 Result: 实验显示当前VLMs在著名建筑上准确率最高达34%优于普通建筑;模型在未被广泛认知的建筑上表现差,暴露其对记忆的依赖;YearCLIP等模型验证了该基准的有效性。 Conclusion: 现有视觉语言模型在建筑识别任务中存在严重流行度偏差,过度依赖训练数据中的高频/流行样本,缺乏对陌生对象的推理能力,需发展更鲁棒、公平且具备真正时空理解能力的模型。 Abstract: We expose a significant popularity bias in state-of-the-art vision-language models (VLMs), which achieve up to 34% higher accuracy on famous buildings compared to ordinary ones, indicating a reliance on memorization over generalizable understanding. To systematically investigate this, we introduce the largest open benchmark for this task: the YearGuessr dataset, a collection of 55,546 building images with multi-modal attributes from 157 countries, annotated with continuous ordinal labels of their construction year (1001-2024), GPS data, and page-view counts as a proxy for popularity. Using this dataset, we frame the construction year prediction task as ordinal regression and introduce popularity-aware interval accuracy metrics to quantify this bias. Our resulting benchmark of 30+ models, including our YearCLIP model, confirms that VLMs excel on popular, memorized items but struggle significantly with unrecognized subjects, exposing a critical flaw in their reasoning capabilities. Project page: https://sytwu.github.io/BeyondMemo/

[95] HiStream: Efficient High-Resolution Video Generation via Redundancy-Eliminated Streaming

Haonan Qiu,Shikun Liu,Zijian Zhou,Zhaochong An,Weiming Ren,Zhiheng Liu,Jonas Schult,Sen He,Shoufa Chen,Yuren Cong,Tao Xiang,Ziwei Liu,Juan-Manuel Perez-Rua

Main category: cs.CV

TL;DR: HiStream是一种高效的高分辨率视频生成框架,通过空间、时间和时间步压缩减少冗余,在保持高质量的同时显著提升生成速度。

Details Motivation: 扩散模型的二次复杂性导致高分辨率视频生成计算成本过高,难以实际应用。 Method: 提出HiStream框架,采用三轴压缩策略:空间压缩(低分辨率去噪后升维细化)、时间压缩(分块处理并使用固定大小锚点缓存)和时间步压缩(对后续块减少去噪步数)。 Result: 在1080p基准上,HiStream比Wan2.1基线快76.2倍,质量损失可忽略;HiStream+结合三项优化,加速达107.5倍,兼顾速度与质量。 Conclusion: HiStream使高分辨率视频生成变得高效且可扩展,推动其在数字媒体和电影中的实际应用。 Abstract: High-resolution video generation, while crucial for digital media and film, is computationally bottlenecked by the quadratic complexity of diffusion models, making practical inference infeasible. To address this, we introduce HiStream, an efficient autoregressive framework that systematically reduces redundancy across three axes: i) Spatial Compression: denoising at low resolution before refining at high resolution with cached features; ii) Temporal Compression: a chunk-by-chunk strategy with a fixed-size anchor cache, ensuring stable inference speed; and iii) Timestep Compression: applying fewer denoising steps to subsequent, cache-conditioned chunks. On 1080p benchmarks, our primary HiStream model (i+ii) achieves state-of-the-art visual quality while demonstrating up to 76.2x faster denoising compared to the Wan2.1 baseline and negligible quality loss. Our faster variant, HiStream+, applies all three optimizations (i+ii+iii), achieving a 107.5x acceleration over the baseline, offering a compelling trade-off between speed and quality, thereby making high-resolution video generation both practical and scalable.