Table of Contents
cs.CL [Back]
[1] Uncovering Competency Gaps in Large Language Models and Their Benchmarks
Matyas Bohacek,Nino Scherrer,Nicholas Dufour,Thomas Leung,Christoph Bregler,Stephanie C. Y. Chan
Main category: cs.CL
TL;DR: 提出一种基于稀疏自编码器(SAE)的自动化方法,通过模型内部表征来揭示大语言模型和基准测试中的概念级缺陷,补充传统聚合指标,实现更细粒度的评估。
Details
Motivation: 现有基准测试的聚合指标可能掩盖模型在特定子领域的能力不足(模型差距)以及基准本身覆盖不均(基准差距),需要更细粒度、基于模型内部表示的评估方法。 Method: 利用稀疏自编码器(SAE)提取概念激活,结合显著性加权性能得分,在多个基准数据上进行分析,从而识别模型和基准中的概念级差距,并实现跨基准比较。 Result: 在两个开源模型和十个基准上验证了方法的有效性,发现了模型在拒绝请求、设定边界和安全相关概念上的表现不足,同时发现多个基准过度代表服从性概念而遗漏核心内容。 Conclusion: 该方法提供了一种基于表示的评估范式,能够对基准分数进行概念级分解,不仅揭示模型表现背后的原因,也指导基准测试的改进,与传统指标互补而非替代。 Abstract: The evaluation of large language models (LLMs) relies heavily on standardized benchmarks. These benchmarks provide useful aggregated metrics for a given capability, but those aggregated metrics can obscure (i) particular sub-areas where the LLMs are weak ("model gaps") and (ii) imbalanced coverage in the benchmarks themselves ("benchmark gaps"). We propose a new method that uses sparse autoencoders (SAEs) to automatically uncover both types of gaps. By extracting SAE concept activations and computing saliency-weighted performance scores across benchmark data, the method grounds evaluation in the model's internal representations and enables comparison across benchmarks. As examples demonstrating our approach, we applied the method to two popular open-source models and ten benchmarks. We found that these models consistently underperformed on concepts that stand in contrast to sycophantic behaviors (e.g., politely refusing a request or asserting boundaries) and concepts connected to safety discussions. These model gaps align with observations previously surfaced in the literature; our automated, unsupervised method was able to recover them without manual supervision. We also observed benchmark gaps: many of the evaluated benchmarks over-represented concepts related to obedience, authority, or instruction-following, while missing core concepts that should fall within their intended scope. In sum, our method offers a representation-grounded approach to evaluation, enabling concept-level decomposition of benchmark scores. Rather than replacing conventional aggregated metrics, CG complements them by providing a concept-level decomposition that can reveal why a model scored as it did and how benchmarks could evolve to better reflect their intended scope. Code is available at https://competency-gaps.github.io.[2] SA-DiffuSeq: Addressing Computational and Scalability Challenges in Long-Document Generation with Sparse Attention
Alexandros Christoforos,Chadbourne Davis
Main category: cs.CL
TL;DR: SA-DiffuSeq是一种引入稀疏注意力的扩散模型框架,旨在提升长文本生成的效率和可扩展性,通过选择性注意力分配和软吸收状态设计,在降低计算复杂度的同时保持生成质量。
Details
Motivation: 扩散模型在长文本生成中面临计算成本高和内存开销大的问题,尤其在序列增长时难以扩展,因此需要一种更高效的建模方法。 Method: 提出SA-DiffuSeq,结合稀疏注意力机制与扩散过程,引入软吸收状态以稳定扩散轨迹,并优化序列重建效率和长距离依赖建模。 Result: 实验表明,SA-DiffuSeq在训练效率和采样速度上优于现有扩散模型,尤其在长序列任务中表现突出。 Conclusion: 将结构化稀疏性引入扩散模型是实现高效且富有表达力的长文本生成的有效路径。 Abstract: Diffusion based approaches to long form text generation suffer from prohibitive computational cost and memory overhead as sequence length increases. We introduce SA-DiffuSeq, a diffusion framework that integrates sparse attention to fundamentally improve scalability for long document modeling. By selectively allocating attention within the diffusion process, SA-DiffuSeq significantly reduces computational complexity while maintaining semantic coherence and generation quality. A key component of our method is a soft absorbing state tailored to sparse attention dynamics, which stabilizes diffusion trajectories and accelerates sequence reconstruction. This design improves sampling efficiency and enhances precision in long range dependency modeling. Extensive experiments demonstrate that SA-DiffuSeq consistently surpasses state of the art diffusion baselines in both training efficiency and sampling speed, with especially strong gains on extended sequences. These properties make SA-DiffuSeq well suited for demanding long form applications such as scientific writing, large scale code generation, and multi turn long context dialogue. Overall, our results indicate that incorporating structured sparsity into diffusion models is a promising direction for efficient and expressive long text generation.[3] TokSuite: Measuring the Impact of Tokenizer Choice on Language Model Behavior
Gül Sena Altıntaş,Malikeh Ehghaghi,Brian Lester,Fengyuan Liu,Wanru Zhao,Marco Ciccone,Colin Raffel
Main category: cs.CL
TL;DR: 本文提出了TokSuite,一个用于研究分词器对语言模型影响的模型集合和基准测试工具。
Details
Motivation: 由于分词在语言模型性能和行为中的作用尚不明确,且难以孤立地衡量其影响,因此需要系统性研究分词的作用。 Method: 训练了十四个使用不同分词器但其他条件完全相同的模型,并构建了一个专门评估现实世界扰动下模型表现的新基准。 Result: 通过TokSuite实现了分词器影响的稳健解耦,揭示了多种流行分词器的优势与不足。 Conclusion: TokSuite有助于深入理解分词对语言模型的影响,为选择和设计分词器提供了实证依据。 Abstract: Tokenizers provide the fundamental basis through which text is represented and processed by language models (LMs). Despite the importance of tokenization, its role in LM performance and behavior is poorly understood due to the challenge of measuring the impact of tokenization in isolation. To address this need, we present TokSuite, a collection of models and a benchmark that supports research into tokenization's influence on LMs. Specifically, we train fourteen models that use different tokenizers but are otherwise identical using the same architecture, dataset, training budget, and initialization. Additionally, we curate and release a new benchmark that specifically measures model performance subject to real-world perturbations that are likely to influence tokenization. Together, TokSuite allows robust decoupling of the influence of a model's tokenizer, supporting a series of novel findings that elucidate the respective benefits and shortcomings of a wide range of popular tokenizers.[4] Adversarial Training for Failure-Sensitive User Simulation in Mental Health Dialogue Optimization
Ziyi Zhu,Olivier Tieleman,Caitlin A. Stamatis,Luka Smyth,Thomas D. Hull,Daniel R. Cahn,Matteo Malgaroli
Main category: cs.CL
TL;DR: 提出了一种基于对抗训练的用户模拟器框架,用于提升心理健康支持聊天机器人中任务导向对话系统的评估效果,显著增强了模拟真实性和失败模式发现能力。
Details
Motivation: 现有的用户模拟器难以准确模拟人类行为,尤其在暴露系统缺陷方面表现不足,需要更真实的模拟方法来有效评估任务导向对话系统。 Method: 采用对抗训练框架,通过生成器(用户模拟器)与判别器之间的竞争动态,迭代优化用户模拟器的 realism;在心理健康支持聊天机器人场景中进行微调和评估。 Result: 微调后的模拟器显著优于零样本基础模型,能更有效地发现系统问题;对抗训练提升了多样性、分布对齐性和预测有效性;模拟器在多种聊天机器人配置下与真实失败率高度相关,且失败模式分布差异小;经过三轮对抗后判别器准确性大幅下降,表明模拟真实性提高。 Conclusion: 对抗训练是一种有前景的方法,可用于构建心理健康支持领域中更真实、可靠且高效的用户模拟器,支持部署前的快速、低成本系统评估。 Abstract: Realistic user simulation is crucial for training and evaluating task-oriented dialogue (TOD) systems, yet creating simulators that accurately replicate human behavior remains challenging. A key property of effective simulators is their ability to expose failure modes of the systems they evaluate. We present an adversarial training framework that iteratively improves user simulator realism through a competitive dynamic between a generator (user simulator) and a discriminator. Applied to mental health support chatbots, our approach demonstrates that fine-tuned simulators dramatically outperform zero-shot base models at surfacing system issues, and adversarial training further enhances diversity, distributional alignment, and predictive validity. The resulting simulator achieves a strong correlation between simulated and real failure occurrence rates across diverse chatbot configurations while maintaining low distributional divergence of failure modes. Discriminator accuracy decreases drastically after three adversarial iterations, suggesting improved realism. These results provide evidence that adversarial training is a promising approach for creating realistic user simulators in mental health support TOD domains, enabling rapid, reliable, and cost-effective system evaluation before deployment.[5] Large Language Models Approach Expert Pedagogical Quality in Math Tutoring but Differ in Instructional and Linguistic Profiles
Ramatu Oiza Abdulsalam,Segun Aroyehun
Main category: cs.CL
TL;DR: 该研究通过对比专家教师、新手教师和大语言模型在数学辅导中的回应,发现大语言模型在感知教学质量上接近专家水平,但在具体教学策略和语言特征上存在系统性差异。
Details
Motivation: 探究大语言模型在数学辅导中生成教学回应的行为与专家人类教师的相似程度。 Method: 采用受控的回合级对比实验,让专家教师、新手教师和多个大语言模型对相同的数学补救对话回合作出回应,并分析其教学策略和语言特征。 Result: 大语言模型在感知教学质量上接近专家水平,但较少使用重述和转述策略,且生成更长、词汇更多样、更礼貌的回应;统计分析显示重述/转述、词汇多样性和准确性追问与教学质量正相关,而过高的话语主导性和礼貌性则负相关。 Conclusion: 尽管大语言模型能达到类似专家的教学质量感知水平,但其依赖的教学和语言策略与人类专家不同,强调在评估辅导系统时需深入分析具体教学行为和语言特征。 Abstract: Recent work has explored the use of large language models for generating tutoring responses in mathematics, yet it remains unclear how closely their instructional behavior aligns with expert human practice. We examine this question using a controlled, turn-level comparison in which expert human tutors, novice human tutors, and multiple large language models respond to the same set of math remediation conversation turns. We examine both instructional strategies and linguistic characteristics of tutoring responses, including restating and revoicing, pressing for accuracy, lexical diversity, readability, politeness, and agency. We find that large language models approach expert levels of perceived pedagogical quality on average but exhibit systematic differences in their instructional and linguistic profiles. In particular, large language models tend to underuse restating and revoicing strategies characteristic of expert human tutors, while producing longer, more lexically diverse, and more polite responses. Statistical analyses show that restating and revoicing, lexical diversity, and pressing for accuracy are positively associated with perceived pedagogical quality, whereas higher levels of agentic and polite language are negatively associated. Overall, recent large language models exhibit levels of perceived pedagogical quality comparable to expert human tutors, while relying on different instructional and linguistic strategies. These findings underscore the value of analyzing instructional strategies and linguistic characteristics when evaluating tutoring responses across human tutors and intelligent tutoring systems.[6] Investigating Model Editing for Unlearning in Large Language Models
Shariqah Hossain,Lalana Kagal
Main category: cs.CL
TL;DR: 本文探讨了将模型编辑算法(如ROME、IKE和WISE)应用于机器遗忘任务,发现其在特定设置下能优于传统遗忘方法,但仍面临遗忘范围控制与模型性能保持的挑战。
Details
Motivation: 现有的机器遗忘方法对大参数量的LLM效率低或无法完全删除目标信息而不影响应保留的知识,因此探索模型编辑算法是否可用于更有效的遗忘。 Method: 研究者采用模型编辑算法ROME、IKE和WISE,并为其设计新的编辑目标以适应遗忘场景,评估其在遗忘质量和保留知识上的表现。 Result: 模型编辑方法在某些设置下优于基线遗忘方法,能够更有效地实现信息遗忘,但依然难以完全控制遗忘范围且可能损害整体模型性能。 Conclusion: 模型编辑算法有潜力用于机器遗忘任务,但在精确控制遗忘范围和保护无关知识方面仍需进一步改进。 Abstract: Machine unlearning aims to remove unwanted information from a model, but many methods are inefficient for LLMs with large numbers of parameters or fail to fully remove the intended information without degrading performance on knowledge that should be retained. Model editing algorithms solve a similar problem of changing information in models, but they focus on redirecting inputs to a new target rather than removing that information altogether. In this work, we explore the editing algorithms ROME, IKE, and WISE and design new editing targets for an unlearning setting. Through this investigation, we show that model editing approaches can exceed baseline unlearning methods in terms of quality of forgetting depending on the setting. Like traditional unlearning techniques, they struggle to encapsulate the scope of what is to be unlearned without damage to the overall model performance.[7] Measuring Mechanistic Independence: Can Bias Be Removed Without Erasing Demographics?
Zhengyang Shan,Aaron Mueller
Main category: cs.CL
TL;DR: 研究发现,通过稀疏自编码器特征消融可在不损害人口统计识别能力的情况下减少语言模型中的偏见,表明人口统计偏见源于任务特定机制而非绝对人口标记。
Details
Motivation: 探讨语言模型中的人口统计偏见机制是否独立于一般的人口统计识别能力,并寻找有效的去偏方法。 Method: 采用多任务评估设置,结合基于归因和相关性的方法定位偏见特征,并在Gemma-2-9B模型中进行稀疏自编码器特征消融实验。 Result: 基于归因的消融减少了种族和性别职业刻板印象并保持姓名识别准确率,基于相关的消融对教育偏见更有效;去除教育任务中的归因特征会导致‘先验崩溃’,增加整体偏见。 Conclusion: 人口统计偏见来自任务特定机制,机制性推理时干预可实现精确去偏而不损害模型核心能力。 Abstract: We investigate how independent demographic bias mechanisms are from general demographic recognition in language models. Using a multi-task evaluation setup where demographics are associated with names, professions, and education levels, we measure whether models can be debiased while preserving demographic detection capabilities. We compare attribution-based and correlation-based methods for locating bias features. We find that targeted sparse autoencoder feature ablations in Gemma-2-9B reduce bias without degrading recognition performance: attribution-based ablations mitigate race and gender profession stereotypes while preserving name recognition accuracy, whereas correlation-based ablations are more effective for education bias. Qualitative analysis further reveals that removing attribution features in education tasks induces ``prior collapse'', thus increasing overall bias. This highlights the need for dimension-specific interventions. Overall, our results show that demographic bias arises from task-specific mechanisms rather than absolute demographic markers, and that mechanistic inference-time interventions can enable surgical debiasing without compromising core model capabilities.[8] Semantic Deception: When Reasoning Models Can't Compute an Addition
Nathaniël de Leeuw,Marceau Nahon,Mathis Reymond,Raja Chatila,Mehdi Khamassi
Main category: cs.CL
TL;DR: 该论文研究了大语言模型(LLMs)在处理新颖符号表示时的推理能力,引入“语义欺骗”来测试其是否能保持符号抽象性,结果表明LLMs易受表面语义影响,暴露出在符号操作上的局限性,并对将其视为具备真正推理能力提出伦理和社会警示。
Details
Motivation: 探究LLMs是否具备真正的符号抽象与推理能力,而非依赖训练数据中的统计关联,尤其是在涉及人类价值观的决策场景中避免误判。 Method: 设计实验框架,重新定义数字和运算符为新符号,构造语义欺骗情境,测试四个LLMs在无歧义但具误导性符号下的简单计算任务表现。 Result: 实验显示语义线索显著降低LLMs在简单任务上的表现,即使看似遵循指令,仍受表面语义干扰;思维链可能加剧对统计相关性的依赖。 Conclusion: 当前LLMs在符号推理方面存在根本局限,过度依赖表面语义,不应轻易归因于真正推理能力,这在需稳健推理的决策场景中可能带来风险。 Abstract: Large language models (LLMs) are increasingly used in situations where human values are at stake, such as decision-making tasks that involve reasoning when performed by humans. We investigate the so-called reasoning capabilities of LLMs over novel symbolic representations by introducing an experimental framework that tests their ability to process and manipulate unfamiliar symbols. We introduce semantic deceptions: situations in which symbols carry misleading semantic associations due to their form, such as being embedded in specific contexts, designed to probe whether LLMs can maintain symbolic abstraction or whether they default to exploiting learned semantic associations. We redefine standard digits and mathematical operators using novel symbols, and task LLMs with solving simple calculations expressed in this altered notation. The objective is: (1) to assess LLMs' capacity for abstraction and manipulation of arbitrary symbol systems; (2) to evaluate their ability to resist misleading semantic cues that conflict with the task's symbolic logic. Through experiments with four LLMs we show that semantic cues can significantly deteriorate reasoning models' performance on very simple tasks. They reveal limitations in current LLMs' ability for symbolic manipulations and highlight a tendency to over-rely on surface-level semantics, suggesting that chain-of-thoughts may amplify reliance on statistical correlations. Even in situations where LLMs seem to correctly follow instructions, semantic cues still impact basic capabilities. These limitations raise ethical and societal concerns, undermining the widespread and pernicious tendency to attribute reasoning abilities to LLMs and suggesting how LLMs might fail, in particular in decision-making contexts where robust symbolic reasoning is essential and should not be compromised by residual semantic associations inherited from the model's training.[9] EssayCBM: Rubric-Aligned Concept Bottleneck Models for Transparent Essay Grading
Kumar Satvik Chaudhary,Chengshuai Zhao,Fan Zhang,Yung Hin Tse,Garima Agrawal,Yuli Deng,Huan Liu
Main category: cs.CL
TL;DR: 本文提出了EssayCBM,一种可解释的作文自动评分框架,通过评估八个写作概念来生成分数,并支持教师干预和实时反馈。
Details
Motivation: 现有的自动评分系统多为黑箱模型,缺乏透明度,难以提供可操作的反馈,限制了其在教育中的应用。 Method: 采用基于编码器的多任务学习架构,使用独立的预测头分别评估八个写作概念(如论点清晰性、证据使用等),再通过轻量级网络将概念得分聚合为最终成绩。 Result: EssayCBM在保持与黑箱模型相当评分性能的同时,提供了概念级别的可解释输出,并支持教师调整概念得分以影响最终评分,实现了人机协同评估。 Conclusion: EssayCBM在保证模型性能的同时提升了评分系统的透明性和可干预性,有助于促进可解释AI在教育场景中的应用。 Abstract: Understanding how automated grading systems evaluate essays remains a significant challenge for educators and students, especially when large language models function as black boxes. We introduce EssayCBM, a rubric-aligned framework that prioritizes interpretability in essay assessment. Instead of predicting grades directly from text, EssayCBM evaluates eight writing concepts, such as Thesis Clarity and Evidence Use, through dedicated prediction heads on an encoder. These concept scores form a transparent bottleneck, and a lightweight network computes the final grade using only concepts. Instructors can adjust concept predictions and instantly view the updated grade, enabling accountable human-in-the-loop evaluation. EssayCBM matches black-box performance while offering actionable, concept-level feedback through an intuitive web interface.[10] MediEval: A Unified Medical Benchmark for Patient-Contextual and Knowledge-Grounded Reasoning in LLMs
Zhan Qu,Michael Färber
Main category: cs.CL
TL;DR: 本文提出了MediEval,一个结合真实电子健康记录与统一知识库的医学大语言模型评估基准,并提出CoRFu微调方法以提升模型在医学推理中的准确性和安全性。
Details
Motivation: 现有医学大语言模型评估方法无法同时衡量事实准确性与临床情境一致性,缺乏对模型可靠性与安全性的系统性评估。 Method: 构建MediEval基准,整合MIMIC-IV电子病历与UMLS等知识源,设计四象限框架评估知识 grounding 与上下文一致性;提出基于DPO的CoRFu微调方法,采用不对称惩罚缓解幻觉与真值反转问题。 Result: MediEval揭示了当前LLMs普遍存在幻觉支持与真值反转等失败模式;CoRFu相比基线模型提升+16.4 macro-F1,且完全消除真值反转错误。 Conclusion: 联合知识验证与情境一致性的评估框架对医学LLM至关重要,CoRFu为提升模型安全性提供了有效路径。 Abstract: Large Language Models (LLMs) are increasingly applied to medicine, yet their adoption is limited by concerns over reliability and safety. Existing evaluations either test factual medical knowledge in isolation or assess patient-level reasoning without verifying correctness, leaving a critical gap. We introduce MediEval, a benchmark that links MIMIC-IV electronic health records (EHRs) to a unified knowledge base built from UMLS and other biomedical vocabularies. MediEval generates diverse factual and counterfactual medical statements within real patient contexts, enabling systematic evaluation across a 4-quadrant framework that jointly considers knowledge grounding and contextual consistency. Using this framework, we identify critical failure modes, including hallucinated support and truth inversion, that current proprietary, open-source, and domain-specific LLMs frequently exhibit. To address these risks, we propose Counterfactual Risk-Aware Fine-tuning (CoRFu), a DPO-based method with an asymmetric penalty targeting unsafe confusions. CoRFu improves by +16.4 macro-F1 points over the base model and eliminates truth inversion errors, demonstrating both higher accuracy and substantially greater safety.[11] Nemotron 3 Nano: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning
NVIDIA,:,Aaron Blakeman,Aaron Grattafiori,Aarti Basant,Abhibha Gupta,Abhinav Khattar,Adi Renduchintala,Aditya Vavre,Akanksha Shukla,Akhiad Bercovich,Aleksander Ficek,Aleksandr Shaposhnikov,Alex Kondratenko,Alexander Bukharin,Alexandre Milesi,Ali Taghibakhshi,Alisa Liu,Amelia Barton,Ameya Sunil Mahabaleshwarkar,Amir Klein,Amit Zuker,Amnon Geifman,Amy Shen,Anahita Bhiwandiwalla,Andrew Tao,Ann Guan,Anubhav Mandarwal,Arham Mehta,Ashwath Aithal,Ashwin Poojary,Asif Ahamed,Asma Kuriparambil Thekkumpate,Ayush Dattagupta,Banghua Zhu,Bardiya Sadeghi,Barnaby Simkin,Ben Lanir,Benedikt Schifferer,Besmira Nushi,Bilal Kartal,Bita Darvish Rouhani,Boris Ginsburg,Brandon Norick,Brandon Soubasis,Branislav Kisacanin,Brian Yu,Bryan Catanzaro,Carlo del Mundo,Chantal Hwang,Charles Wang,Cheng-Ping Hsieh,Chenghao Zhang,Chenhan Yu,Chetan Mungekar,Chintan Patel,Chris Alexiuk,Christopher Parisien,Collin Neale,Damon Mosk-Aoyama,Dan Su,Dane Corneil,Daniel Afrimi,Daniel Rohrer,Daniel Serebrenik,Daria Gitman,Daria Levy,Darko Stosic,David Mosallanezhad,Deepak Narayanan,Dhruv Nathawani,Dima Rekesh,Dina Yared,Divyanshu Kakwani,Dong Ahn,Duncan Riach,Dusan Stosic,Edgar Minasyan,Edward Lin,Eileen Long,Eileen Peters Long,Elena Lantz,Ellie Evans,Elliott Ning,Eric Chung,Eric Harper,Eric Tramel,Erick Galinkin,Erik Pounds,Evan Briones,Evelina Bakhturina,Faisal Ladhak,Fay Wang,Fei Jia,Felipe Soares,Feng Chen,Ferenc Galko,Frankie Siino,Gal Hubara Agam,Ganesh Ajjanagadde,Gantavya Bhatt,Gargi Prasad,George Armstrong,Gerald Shen,Gorkem Batmaz,Grigor Nalbandyan,Haifeng Qian,Harsh Sharma,Hayley Ross,Helen Ngo,Herman Sahota,Hexin Wang,Himanshu Soni,Hiren Upadhyay,Huizi Mao,Huy C Nguyen,Huy Q Nguyen,Iain Cunningham,Ido Shahaf,Igor Gitman,Ilya Loshchilov,Ivan Moshkov,Izzy Putterman,Jan Kautz,Jane Polak Scowcroft,Jared Casper,Jatin Mitra,Jeffrey Glick,Jenny Chen,Jesse Oliver,Jian Zhang,Jiaqi Zeng,Jie Lou,Jimmy Zhang,Jining Huang,Joey Conway,Joey Guman,John Kamalu,Johnny Greco,Jonathan Cohen,Joseph Jennings,Joyjit Daw,Julien Veron Vialard,Junkeun Yi,Jupinder Parmar,Kai Xu,Kan Zhu,Kari Briski,Katherine Cheung,Katherine Luna,Keshav Santhanam,Kevin Shih,Kezhi Kong,Khushi Bhardwaj,Krishna C. Puvvada,Krzysztof Pawelec,Kumar Anik,Lawrence McAfee,Laya Sleiman,Leon Derczynski,Li Ding,Lucas Liebenwein,Luis Vega,Maanu Grover,Maarten Van Segbroeck,Maer Rodrigues de Melo,Makesh Narsimhan Sreedhar,Manoj Kilaru,Maor Ashkenazi,Marc Romeijn,Mark Cai,Markus Kliegl,Maryam Moosaei,Matvei Novikov,Mehrzad Samadi,Melissa Corpuz,Mengru Wang,Meredith Price,Michael Boone,Michael Evans,Miguel Martinez,Mike Chrzanowski,Mohammad Shoeybi,Mostofa Patwary,Nabin Mulepati,Natalie Hereth,Nave Assaf,Negar Habibi,Neta Zmora,Netanel Haber,Nicola Sessions,Nidhi Bhatia,Nikhil Jukar,Nikki Pope,Nikolai Ludwig,Nima Tajbakhsh,Nirmal Juluru,Oleksii Hrinchuk,Oleksii Kuchaiev,Olivier Delalleau,Oluwatobi Olabiyi,Omer Ullman Argov,Ouye Xie,Parth Chadha,Pasha Shamis,Pavlo Molchanov,Pawel Morkisz,Peter Dykas,Peter Jin,Pinky Xu,Piotr Januszewski,Pranav Prashant Thombre,Prasoon Varshney,Pritam Gundecha,Qing Miao,Rabeeh Karimi Mahabadi,Ran El-Yaniv,Ran Zilberstein,Rasoul Shafipour,Rich Harang,Rick Izzo,Rima Shahbazyan,Rishabh Garg,Ritika Borkar,Ritu Gala,Riyad Islam,Roger Waleffe,Rohit Watve,Roi Koren,Ruoxi Zhang,Russell J. Hewett,Ryan Prenger,Ryan Timbrook,Sadegh Mahdavi,Sahil Modi,Samuel Kriman,Sanjay Kariyappa,Sanjeev Satheesh,Saori Kaji,Satish Pasumarthi,Sean Narentharen,Sean Narenthiran,Seonmyeong Bak,Sergey Kashirsky,Seth Poulos,Shahar Mor,Shanmugam Ramasamy,Shantanu Acharya,Shaona Ghosh,Sharath Turuvekere Sreenivas,Shelby Thomas,Shiqing Fan,Shreya Gopal,Shrimai Prabhumoye,Shubham Pachori,Shubham Toshniwal,Shuoyang Ding,Siddharth Singh,Simeng Sun,Smita Ithape,Somshubra Majumdar,Soumye Singhal,Stefania Alborghetti,Stephen Ge,Sugam Dipak Devare,Sumeet Kumar Barua,Suseella Panguluri,Suyog Gupta,Sweta Priyadarshi,Syeda Nahida Akter,Tan Bui,Teodor-Dumitru Ene,Terry Kong,Thanh Do,Tijmen Blankevoort,Tom Balough,Tomer Asida,Tomer Bar Natan,Tugrul Konuk,Twinkle Vashishth,Udi Karpas,Ushnish De,Vahid Noorozi,Vahid Noroozi,Venkat Srinivasan,Venmugil Elango,Vijay Korthikanti,Vitaly Kurin,Vitaly Lavrukhin,Wanli Jiang,Wasi Uddin Ahmad,Wei Du,Wei Ping,Wenfei Zhou,Will Jennings,William Zhang,Wojciech Prazuch,Xiaowei Ren,Yashaswi Karnati,Yejin Choi,Yev Meyer,Yi-Fu Wu,Yian Zhang,Ying Lin,Yonatan Geifman,Yonggan Fu,Yoshi Subara,Yoshi Suhara,Yubo Gao,Zach Moshe,Zhen Dong,Zihan Liu,Zijia Chen,Zijie Yan
Main category: cs.CL
TL;DR: Nemotron 3 Nano 30B-A3B 是一种混合Mamba-Transformer的MoE语言模型,通过25万亿token预训练,在激活参数少于一半的情况下优于前代模型,推理吞吐量提升高达3.3倍,并支持长达100万token的上下文。
Details
Motivation: 旨在开发更高效、高性能的语言模型,在减少激活参数的同时提升推理速度和准确性,特别是在长上下文、推理和代理能力方面超越现有开源模型。 Method: 采用Mixture-of-Experts架构结合Mamba与Transformer,基于25万亿token(含3万亿新增唯一token)进行预训练,随后进行监督微调和大规模强化学习。 Result: 在激活参数少于一半的情况下,性能优于Nemotron 2 Nano;推理吞吐量比同类开源模型高至3.3倍;在主流基准测试中表现更优;支持最长1M token上下文,具备更强的推理、代理和对话能力。 Conclusion: Nemotron 3 Nano 30B-A3B 在效率、性能和功能上均实现显著提升,是适用于复杂任务和长上下文场景的先进开源语言模型,相关模型已发布于Hugging Face。 Abstract: We present Nemotron 3 Nano 30B-A3B, a Mixture-of-Experts hybrid Mamba-Transformer language model. Nemotron 3 Nano was pretrained on 25 trillion text tokens, including more than 3 trillion new unique tokens over Nemotron 2, followed by supervised fine tuning and large-scale RL on diverse environments. Nemotron 3 Nano achieves better accuracy than our previous generation Nemotron 2 Nano while activating less than half of the parameters per forward pass. It achieves up to 3.3x higher inference throughput than similarly-sized open models like GPT-OSS-20B and Qwen3-30B-A3B-Thinking-2507, while also being more accurate on popular benchmarks. Nemotron 3 Nano demonstrates enhanced agentic, reasoning, and chat abilities and supports context lengths up to 1M tokens. We release both our pretrained Nemotron 3 Nano 30B-A3B Base and post-trained Nemotron 3 Nano 30B-A3B checkpoints on Hugging Face.[12] How important is Recall for Measuring Retrieval Quality?
Shelly Schwartz,Oleg Vasilyev,Randy Sawaya
Main category: cs.CL
TL;DR: 提出一种无需知晓相关文档总数的检索质量度量方法,并通过LLM判断响应质量评估其与现有策略的相关性。
Details
Motivation: 在真实场景中,知识库庞大且不断演化,查询的相关文档总数通常未知,导致无法计算召回率。 Method: 通过多个数据集实验,比较现有策略与提出的简单检索质量度量方法,使用LLM对基于检索结果生成的响应进行质量判断,并分析其相关性。 Result: 提出的检索质量度量方法在相关文档数较少(2-15个)的情况下表现良好,且不依赖于相关文档总数。 Conclusion: 该方法为在未知相关文档总数的情况下评估检索质量提供了有效且实用的解决方案。 Abstract: In realistic retrieval settings with large and evolving knowledge bases, the total number of documents relevant to a query is typically unknown, and recall cannot be computed. In this paper, we evaluate several established strategies for handling this limitation by measuring the correlation between retrieval quality metrics and LLM-based judgments of response quality, where responses are generated from the retrieved documents. We conduct experiments across multiple datasets with a relatively low number of relevant documents (2-15). We also introduce a simple retrieval quality measure that performs well without requiring knowledge of the total number of relevant documents.[13] NVIDIA Nemotron 3: Efficient and Open Intelligence
NVIDIA,:,Aaron Blakeman,Aaron Grattafiori,Aarti Basant,Abhibha Gupta,Abhinav Khattar,Adi Renduchintala,Aditya Vavre,Akanksha Shukla,Akhiad Bercovich,Aleksander Ficek,Aleksandr Shaposhnikov,Alex Kondratenko,Alexander Bukharin,Alexandre Milesi,Ali Taghibakhshi,Alisa Liu,Amelia Barton,Ameya Sunil Mahabaleshwarkar,Amir Klein,Amit Zuker,Amnon Geifman,Amy Shen,Anahita Bhiwandiwalla,Andrew Tao,Anjulie Agrusa,Ankur Verma,Ann Guan,Anubhav Mandarwal,Arham Mehta,Ashwath Aithal,Ashwin Poojary,Asif Ahamed,Asit Mishra,Asma Kuriparambil Thekkumpate,Ayush Dattagupta,Banghua Zhu,Bardiya Sadeghi,Barnaby Simkin,Ben Lanir,Benedikt Schifferer,Besmira Nushi,Bilal Kartal,Bita Darvish Rouhani,Boris Ginsburg,Brandon Norick,Brandon Soubasis,Branislav Kisacanin,Brian Yu,Bryan Catanzaro,Carlo del Mundo,Chantal Hwang,Charles Wang,Cheng-Ping Hsieh,Chenghao Zhang,Chenhan Yu,Chetan Mungekar,Chintan Patel,Chris Alexiuk,Christopher Parisien,Collin Neale,Cyril Meurillon,Damon Mosk-Aoyama,Dan Su,Dane Corneil,Daniel Afrimi,Daniel Lo,Daniel Rohrer,Daniel Serebrenik,Daria Gitman,Daria Levy,Darko Stosic,David Mosallanezhad,Deepak Narayanan,Dhruv Nathawani,Dima Rekesh,Dina Yared,Divyanshu Kakwani,Dong Ahn,Duncan Riach,Dusan Stosic,Edgar Minasyan,Edward Lin,Eileen Long,Eileen Peters Long,Elad Segal,Elena Lantz,Ellie Evans,Elliott Ning,Eric Chung,Eric Harper,Eric Tramel,Erick Galinkin,Erik Pounds,Evan Briones,Evelina Bakhturina,Evgeny Tsykunov,Faisal Ladhak,Fay Wang,Fei Jia,Felipe Soares,Feng Chen,Ferenc Galko,Frank Sun,Frankie Siino,Gal Hubara Agam,Ganesh Ajjanagadde,Gantavya Bhatt,Gargi Prasad,George Armstrong,Gerald Shen,Gorkem Batmaz,Grigor Nalbandyan,Haifeng Qian,Harsh Sharma,Hayley Ross,Helen Ngo,Herbert Hum,Herman Sahota,Hexin Wang,Himanshu Soni,Hiren Upadhyay,Huizi Mao,Huy C Nguyen,Huy Q Nguyen,Iain Cunningham,Ido Galil,Ido Shahaf,Igor Gitman,Ilya Loshchilov,Itamar Schen,Itay Levy,Ivan Moshkov,Izik Golan,Izzy Putterman,Jan Kautz,Jane Polak Scowcroft,Jared Casper,Jatin Mitra,Jeffrey Glick,Jenny Chen,Jesse Oliver,Jian Zhang,Jiaqi Zeng,Jie Lou,Jimmy Zhang,Jinhang Choi,Jining Huang,Joey Conway,Joey Guman,John Kamalu,Johnny Greco,Jonathan Cohen,Joseph Jennings,Joyjit Daw,Julien Veron Vialard,Junkeun Yi,Jupinder Parmar,Kai Xu,Kan Zhu,Kari Briski,Katherine Cheung,Katherine Luna,Keith Wyss,Keshav Santhanam,Kevin Shih,Kezhi Kong,Khushi Bhardwaj,Kirthi Shankar,Krishna C. Puvvada,Krzysztof Pawelec,Kumar Anik,Lawrence McAfee,Laya Sleiman,Leon Derczynski,Li Ding,Lizzie Wei,Lucas Liebenwein,Luis Vega,Maanu Grover,Maarten Van Segbroeck,Maer Rodrigues de Melo,Mahdi Nazemi,Makesh Narsimhan Sreedhar,Manoj Kilaru,Maor Ashkenazi,Marc Romeijn,Marcin Chochowski,Mark Cai,Markus Kliegl,Maryam Moosaei,Matt Kulka,Matvei Novikov,Mehrzad Samadi,Melissa Corpuz,Mengru Wang,Meredith Price,Michael Andersch,Michael Boone,Michael Evans,Miguel Martinez,Mikail Khona,Mike Chrzanowski,Minseok Lee,Mohammad Dabbah,Mohammad Shoeybi,Mostofa Patwary,Nabin Mulepati,Najeeb Nabwani,Natalie Hereth,Nave Assaf,Negar Habibi,Neta Zmora,Netanel Haber,Nicola Sessions,Nidhi Bhatia,Nikhil Jukar,Nikki Pope,Nikolai Ludwig,Nima Tajbakhsh,Nir Ailon,Nirmal Juluru,Nishant Sharma,Oleksii Hrinchuk,Oleksii Kuchaiev,Olivier Delalleau,Oluwatobi Olabiyi,Omer Ullman Argov,Omri Puny,Oren Tropp,Ouye Xie,Parth Chadha,Pasha Shamis,Paul Gibbons,Pavlo Molchanov,Pawel Morkisz,Peter Dykas,Peter Jin,Pinky Xu,Piotr Januszewski,Pranav Prashant Thombre,Prasoon Varshney,Pritam Gundecha,Przemek Tredak,Qing Miao,Qiyu Wan,Rabeeh Karimi Mahabadi,Rachit Garg,Ran El-Yaniv,Ran Zilberstein,Rasoul Shafipour,Rich Harang,Rick Izzo,Rima Shahbazyan,Rishabh Garg,Ritika Borkar,Ritu Gala,Riyad Islam,Robert Hesse,Roger Waleffe,Rohit Watve,Roi Koren,Ruoxi Zhang,Russell Hewett,Russell J. Hewett,Ryan Prenger,Ryan Timbrook,Sadegh Mahdavi,Sahil Modi,Samuel Kriman,Sangkug Lim,Sanjay Kariyappa,Sanjeev Satheesh,Saori Kaji,Satish Pasumarthi,Saurav Muralidharan,Sean Narentharen,Sean Narenthiran,Seonmyeong Bak,Sergey Kashirsky,Seth Poulos,Shahar Mor,Shanmugam Ramasamy,Shantanu Acharya,Shaona Ghosh,Sharath Turuvekere Sreenivas,Shelby Thomas,Shiqing Fan,Shreya Gopal,Shrimai Prabhumoye,Shubham Pachori,Shubham Toshniwal,Shuoyang Ding,Siddharth Singh,Simeng Sun,Smita Ithape,Somshubra Majumdar,Soumye Singhal,Stas Sergienko,Stefania Alborghetti,Stephen Ge,Sugam Dipak Devare,Sumeet Kumar Barua,Suseella Panguluri,Suyog Gupta,Sweta Priyadarshi,Syeda Nahida Akter,Tan Bui,Teodor-Dumitru Ene,Terry Kong,Thanh Do,Tijmen Blankevoort,Tim Moon,Tom Balough,Tomer Asida,Tomer Bar Natan,Tomer Ronen,Tugrul Konuk,Twinkle Vashishth,Udi Karpas,Ushnish De,Vahid Noorozi,Vahid Noroozi,Venkat Srinivasan,Venmugil Elango,Victor Cui,Vijay Korthikanti,Vinay Rao,Vitaly Kurin,Vitaly Lavrukhin,Vladimir Anisimov,Wanli Jiang,Wasi Uddin Ahmad,Wei Du,Wei Ping,Wenfei Zhou,Will Jennings,William Zhang,Wojciech Prazuch,Xiaowei Ren,Yashaswi Karnati,Yejin Choi,Yev Meyer,Yi-Fu Wu,Yian Zhang,Yigong Qin,Ying Lin,Yonatan Geifman,Yonggan Fu,Yoshi Subara,Yoshi Suhara,Yubo Gao,Zach Moshe,Zhen Dong,Zhongbo Zhu,Zihan Liu,Zijia Chen,Zijie Yan
Main category: cs.CL
TL;DR: Nemotron 3系列模型包括Nano、Super和Ultra,采用混合Mamba-Transformer架构,支持长达100万token的上下文,具备卓越的推理、对话和代理能力。Super和Ultra使用NVFP4训练并引入LatentMoE提升质量,配备MTP层以加速生成。所有模型通过多环境强化学习后训练,支持多步工具使用和推理预算控制。Nano高效且准确,Super适用于协作代理,Ultra性能领先。模型权重和训练资源将逐步开源。
Details
Motivation: 开发高性能、高效率的大规模语言模型,满足复杂推理、长上下文和实际应用场景(如IT自动化)的需求,同时推动开放研究。 Method: 采用Mixture-of-Experts混合Mamba-Transformer架构,使用NVFP4量化训练Super和Ultra模型,引入LatentMoE提升模型质量,并集成MTP层加速文本生成;所有模型通过多环境强化学习进行后训练,支持多步工具调用与推理控制。 Result: Nemotron 3系列在推理、对话和代理任务中表现优异:Nano在低成本下实现高精度;Super适合高并发协作场景;Ultra达到最先进的准确性和推理性能;支持最长1M token上下文和高效吞吐。 Conclusion: Nemotron 3系列通过创新架构和训练方法,在性能、效率和可扩展性之间取得平衡,未来将开源模型与训练资源,推动开放AI发展。 Abstract: We introduce the Nemotron 3 family of models - Nano, Super, and Ultra. These models deliver strong agentic, reasoning, and conversational capabilities. The Nemotron 3 family uses a Mixture-of-Experts hybrid Mamba-Transformer architecture to provide best-in-class throughput and context lengths of up to 1M tokens. Super and Ultra models are trained with NVFP4 and incorporate LatentMoE, a novel approach that improves model quality. The two larger models also include MTP layers for faster text generation. All Nemotron 3 models are post-trained using multi-environment reinforcement learning enabling reasoning, multi-step tool use, and support granular reasoning budget control. Nano, the smallest model, outperforms comparable models in accuracy while remaining extremely cost-efficient for inference. Super is optimized for collaborative agents and high-volume workloads such as IT ticket automation. Ultra, the largest model, provides state-of-the-art accuracy and reasoning performance. Nano is released together with its technical report and this white paper, while Super and Ultra will follow in the coming months. We will openly release the model weights, pre- and post-training software, recipes, and all data for which we hold redistribution rights.[14] Architectural Trade-offs in Small Language Models Under Compute Constraints
Shivraj Singh Bhatti
Main category: cs.CL
TL;DR: 本研究系统地探讨了在严格计算限制下小型语言模型的架构选择与训练预算对性能的影响,发现基于注意力的模型在小规模情况下每FLOP效率优于MLP,但增加深度或上下文长度若缺乏充分优化反而会降低性能,同时指出大型语言模型中成功的旋转位置嵌入(RoPE)等技术不一定适用于小型模型场景。
Details
Motivation: 探索在计算资源受限的情况下,小型语言模型如何通过不同的架构设计和训练策略实现最优性能,填补大模型与小模型之间有效技术迁移的研究空白。 Method: 从线性下一个令牌预测器出发,逐步引入非线性、自注意力机制和多层Transformer架构,在Tiny Shakespeare、PTB和WikiText-2数据集上进行字符级和词级建模,并使用测试负对数似然(NLL)、参数数量和近似训练FLOPs来评估各模型的性能。 Result: 基于注意力的模型在小规模下表现出比MLP更高的每FLOP效率;过度增加模型深度或上下文长度而没有相应优化会导致性能下降;旋转位置嵌入(RoPE)等在大模型中有效的技术在小模型中效果不佳甚至不利。 Conclusion: 在小型语言模型中,高效的架构设计需谨慎权衡模型复杂度与优化程度,简单移植大模型的技术并不总能带来收益,应针对低计算预算环境专门设计和评估模型组件。 Abstract: We present a systematic empirical study of small language models under strict compute constraints, analyzing how architectural choices and training budget interact to determine performance. Starting from a linear next-token predictor, we progressively introduce nonlinearities, self-attention, and multi-layer transformer architectures, evaluating each on character-level modeling of Tiny Shakespeare and word-level modeling of Penn Treebank (PTB) and WikiText-2. We compare models using test negative log-likelihood (NLL), parameter count, and approximate training FLOPs to characterize accuracy-efficiency trade-offs. Our results show that attention-based models dominate MLPs in per-FLOP efficiency even at small scale, while increasing depth or context without sufficient optimization can degrade performance. We further examine rotary positional embeddings (RoPE), finding that architectural techniques successful in large language models do not necessarily transfer to small-model regimes.[15] Where Did This Sentence Come From? Tracing Provenance in LLM Reasoning Distillation
Kaiyuan Liu,Shaotian Yan,Rui Miao,Bing Wang,Chen Shen,Jun Zhang,Jieping Ye
Main category: cs.CL
TL;DR: 本文提出了一种跨模型的推理蒸馏溯源追踪框架,用于分析蒸馏模型在测试时行为的来源,发现其确实能生成源自教师模型的行为,并据此提出了基于教师引导的数据选择方法,提升了推理蒸馏的可解释性与性能。
Details
Motivation: 现有推理蒸馏方法缺乏对学生模型能力来源的深入分析,不清楚其在新测试场景下是延续教师行为还是回归原始模式,因此需要一种可追溯机制来理解蒸馏行为的泛化性。 Method: 提出Reasoning Distillation Provenance Tracing框架,通过比较教师模型、原始学生模型和蒸馏后学生模型在同一上下文下的预测概率,对每个生成动作进行溯源分类,并基于此设计教师引导的数据选择方法。 Result: 实验表明,在测试时蒸馏模型能够生成源自教师的行为,且这些行为与其性能提升相关;所提数据选择方法在多种教师与学生模型上均有效,优于启发式方法。 Conclusion: 该溯源框架有助于理解推理蒸馏中学生模型能力的来源,验证了教师行为可在测试中被保持,并为更高效、可解释的蒸馏方法提供了新思路。 Abstract: Reasoning distillation has attracted increasing attention. It typically leverages a large teacher model to generate reasoning paths, which are then used to fine-tune a student model so that it mimics the teacher's behavior in training contexts. However, previous approaches have lacked a detailed analysis of the origins of the distilled model's capabilities. It remains unclear whether the student can maintain consistent behaviors with the teacher in novel test-time contexts, or whether it regresses to its original output patterns, raising concerns about the generalization of distillation models. To analyse this question, we introduce a cross-model Reasoning Distillation Provenance Tracing framework. For each action (e.g., a sentence) produced by the distilled model, we obtain the predictive probabilities assigned by the teacher, the original student, and the distilled model under the same context. By comparing these probabilities, we classify each action into different categories. By systematically disentangling the provenance of each action, we experimentally demonstrate that, in test-time contexts, the distilled model can indeed generate teacher-originated actions, which correlate with and plausibly explain observed performance on distilled model. Building on this analysis, we further propose a teacher-guided data selection method. Unlike prior approach that rely on heuristics, our method directly compares teacher-student divergences on the training data, providing a principled selection criterion. We validate the effectiveness of our approach across multiple representative teacher models and diverse student models. The results highlight the utility of our provenance-tracing framework and underscore its promise for reasoning distillation. We hope to share Reasoning Distillation Provenance Tracing and our insights into reasoning distillation with the community.[16] Foundation Model-based Evaluation of Neuropsychiatric Disorders: A Lifespan-Inclusive, Multi-Modal, and Multi-Lingual Study
Zhongren Dong,Haotian Guo,Weixiang Xu,Huan Zhao,Zixing Zhang
Main category: cs.CL
TL;DR: FEND是一个基于基础模型的多模态框架,整合语音和文本模态,用于跨生命周期的阿尔茨海默病、抑郁症和自闭症谱系障碍的多语言评估。研究利用13个多语言数据集系统评估多模态融合性能,并揭示了模态不平衡和数据异质性等关键挑战。
Details
Motivation: 神经精神疾病如阿尔茨海默病、抑郁症和自闭症存在语言和声学异常,但缺乏统一的多语言评估框架,且多模态方法在泛化性和公平性方面面临挑战。 Method: 提出FEND框架,结合语音与文本模态,基于13个涵盖英语、中文、希腊语、法语和荷兰语的数据集进行系统性多模态融合评估,开展跨语料库实验以分析任务一致性、语言一致性和模态平衡的影响。 Result: 多模态融合在阿尔茨海默病和抑郁症检测中表现优异,但在自闭症检测中因数据异质性而表现不佳;发现模态不平衡问题普遍,多模态未超越最优单模态;跨库实验显示在任务和语言一致时性能稳健,多语言和任务异质场景下性能下降。 Conclusion: FEND为神经精神疾病的自动化、全生命周期、多语言评估提供了基准框架,推动领域内公平比较与可复现研究,建议广泛采用该框架以促进发展。 Abstract: Neuropsychiatric disorders, such as Alzheimer's disease (AD), depression, and autism spectrum disorder (ASD), are characterized by linguistic and acoustic abnormalities, offering potential biomarkers for early detection. Despite the promise of multi-modal approaches, challenges like multi-lingual generalization and the absence of a unified evaluation framework persist. To address these gaps, we propose FEND (Foundation model-based Evaluation of Neuropsychiatric Disorders), a comprehensive multi-modal framework integrating speech and text modalities for detecting AD, depression, and ASD across the lifespan. Leveraging 13 multi-lingual datasets spanning English, Chinese, Greek, French, and Dutch, we systematically evaluate multi-modal fusion performance. Our results show that multi-modal fusion excels in AD and depression detection but underperforms in ASD due to dataset heterogeneity. We also identify modality imbalance as a prevalent issue, where multi-modal fusion fails to surpass the best mono-modal models. Cross-corpus experiments reveal robust performance in task- and language-consistent scenarios but noticeable degradation in multi-lingual and task-heterogeneous settings. By providing extensive benchmarks and a detailed analysis of performance-influencing factors, FEND advances the field of automated, lifespan-inclusive, and multi-lingual neuropsychiatric disorder assessment. We encourage researchers to adopt the FEND framework for fair comparisons and reproducible research.[17] Neural Probe-Based Hallucination Detection for Large Language Models
Shize Liang,Hongzhi Wang
Main category: cs.CL
TL;DR: 本文提出了一种基于轻量级MLP探针的神经网络框架,用于在冻结大语言模型参数的情况下进行词元级幻觉检测,通过非线性建模和多目标损失函数显著提升了检测性能。
Details
Motivation: 大语言模型容易生成幻觉内容,现有基于不确定性估计和外部知识检索的方法存在高置信错误和依赖知识覆盖的局限性,需要更高效、准确的检测方法。 Method: 采用冻结语言模型参数的方式,利用轻量级MLP探针对高层隐藏状态进行非线性建模,设计多目标联合损失函数,并结合贝叶斯优化搜索最优探针插入层。 Result: 在LongFact、HealthBench和TriviaQA数据集上的实验表明,MLP探针在准确性、召回率和低误报条件下的检测能力均显著优于现有最先进方法。 Conclusion: 该方法通过非线性探针和优化策略实现了高效、稳定的词元级幻觉检测,为大语言模型在高风险领域的应用提供了可靠的技术支持。 Abstract: Large language models(LLMs) excel at text generation and knowledge question-answering tasks, but they are prone to generating hallucinated content, severely limiting their application in high-risk domains. Current hallucination detection methods based on uncertainty estimation and external knowledge retrieval suffer from the limitation that they still produce erroneous content at high confidence levels and rely heavily on retrieval efficiency and knowledge coverage. In contrast, probe methods that leverage the model's hidden-layer states offer real-time and lightweight advantages. However, traditional linear probes struggle to capture nonlinear structures in deep semantic spaces.To overcome these limitations, we propose a neural network-based framework for token-level hallucination detection. By freezing language model parameters, we employ lightweight MLP probes to perform nonlinear modeling of high-level hidden states. A multi-objective joint loss function is designed to enhance detection stability and semantic disambiguity. Additionally, we establish a layer position-probe performance response model, using Bayesian optimization to automatically search for optimal probe insertion layers and achieve superior training results.Experimental results on LongFact, HealthBench, and TriviaQA demonstrate that MLP probes significantly outperform state-of-the-art methods in accuracy, recall, and detection capability under low false-positive conditions.[18] MultiMind at SemEval-2025 Task 7: Crosslingual Fact-Checked Claim Retrieval via Multi-Source Alignment
Mohammad Mahdi Abootorabi,Alireza Ghahramani Kure,Mohammadali Mohammadkhani,Sina Elahimanesh,Mohammad Ali Ali Panah
Main category: cs.CL
TL;DR: 本文提出了TriAligner,一种用于多语言和跨语言事实核查声明检索的新型方法,采用双编码器架构与对比学习,结合原生语言和英语翻译,提升了检索准确性。
Details
Motivation: 在错误信息迅速传播的时代,有效的事实核查变得越来越重要,需要能够处理多语言和跨语言环境下的声明检索系统。 Method: 采用双编码器架构与对比学习,结合原生语言和英语翻译,并利用大语言模型进行数据预处理和增强,引入难负样本采样以改进表示学习。 Result: 在单语和跨语言基准上评估了该方法,结果表明其在检索准确性和事实核查性能方面显著优于基线模型。 Conclusion: TriAligner通过融合多模态信息和优化表示学习,在多语言和跨语言事实核查任务中表现出色,具有较强的鲁棒性和应用潜力。 Abstract: This paper presents our system for SemEval-2025 Task 7: Multilingual and Crosslingual Fact-Checked Claim Retrieval. In an era where misinformation spreads rapidly, effective fact-checking is increasingly critical. We introduce TriAligner, a novel approach that leverages a dual-encoder architecture with contrastive learning and incorporates both native and English translations across different modalities. Our method effectively retrieves claims across multiple languages by learning the relative importance of different sources in alignment. To enhance robustness, we employ efficient data preprocessing and augmentation using large language models while incorporating hard negative sampling to improve representation learning. We evaluate our approach on monolingual and crosslingual benchmarks, demonstrating significant improvements in retrieval accuracy and fact-checking performance over baselines.[19] Reflection Pretraining Enables Token-Level Self-Correction in Biological Sequence Models
Xiang Zhang,Jiaqi Wei,Yuejin Yang,Zijie Qiu,Yuhan Chen,Zhiqiang Gao,Muhammad Abdul-Mageed,Laks V. S. Lakshmanan,Wanli Ouyang,Chenyu You,Siqi Sun
Main category: cs.CL
TL;DR: 本文提出了“语言表达性”的概念,并引入了反思预训练方法,通过生成辅助的‘思考标记’来增强生物序列模型的推理能力,从而克服蛋白质语言表达性有限的问题。
Details
Motivation: 由于蛋白质和RNA语言模型的标记空间表达能力有限,目前无法将思维链(CoT)提示应用于非自然语言领域。本文旨在解决这一限制。 Method: 提出并定义了语言表达性的概念;引入反射预训练方法,在生物序列模型中首次实现中间推理过程,通过生成超出简单答案标记的辅助‘思考标记’来进行。 Result: 理论上证明了扩增后的标记集显著增强了生物语言的表达性;实验上表明该预训练方法使蛋白质模型能够自我纠正,并相比标准预训练有显著性能提升。 Conclusion: 反射预训练有效提升了生物序列模型的推理能力,为在低表达性语言系统中应用复杂推理技术开辟了新路径。 Abstract: Chain-of-Thought (CoT) prompting has significantly advanced task-solving capabilities in natural language processing with large language models. Unlike standard prompting, CoT encourages the model to generate intermediate reasoning steps, non-answer tokens, that help guide the model toward more accurate final outputs. These intermediate steps enable more complex reasoning processes such as error correction, memory management, future planning, and self-reflection. However, applying CoT to non-natural language domains, such as protein and RNA language models, is not yet possible, primarily due to the limited expressiveness of their token spaces (e.g., amino acid tokens). In this work, we propose and define the concept of language expressiveness: the ability of a given language, using its tokens and grammar, to encode information. We show that the limited expressiveness of protein language severely restricts the applicability of CoT-style reasoning. To overcome this, we introduce reflection pretraining, for the first time in a biological sequence model, which enables the model to engage in intermediate reasoning through the generation of auxiliary "thinking tokens" beyond simple answer tokens. Theoretically, we demonstrate that our augmented token set significantly enhances biological language expressiveness, thereby improving the overall reasoning capacity of the model. Experimentally, our pretraining approach teaches protein models to self-correct and leads to substantial performance gains compared to standard pretraining.[20] Automatic Replication of LLM Mistakes in Medical Conversations
Oleksii Proniakin,Diego Fajardo,Ruslan Nazarenko,Razvan Marinescu
Main category: cs.CL
TL;DR: 本文提出了一种名为MedMistake的自动化管道,用于从大语言模型(LLM)在模拟医患对话中的错误中构建医学问答基准数据集,并发布了包含3390个问题的MedMistake-All及专家验证子集MedMistake-Bench,用以评估前沿LLM在临床推理中的表现。
Details
Motivation: 现有临床场景下的LLM评估多依赖人工设计的多维评分标准,难以系统性复现和转化模型错误为可衡量的测试用例,因此需要一种自动化的手段来提取并标准化LLM在医疗对话中的典型错误。 Method: MedMistake通过三个步骤实现:首先生成LLM患者与LLM医生之间的复杂对话;然后由两个LLM评审员组成的委员会对对话质量进行多维度评估以识别错误;最后将这些错误转化为简洁的单轮问答(single-shot QA)形式,构建出可复现的评测基准。 Result: 成功构建了包含3,390个QA对的MedMistake-All数据集,其中GPT-5和Gemini 2.5 Pro在多数情况下无法正确回答;从中选取并由医学专家验证的211个样本组成MedMistake-Bench,并用于评估12个前沿LLM,结果显示GPT系列、Claude和Grok模型表现最佳。 Conclusion: MedMistake提供了一个可扩展、自动化的框架,用于发现、提炼和评估LLM在医疗对话中的缺陷,所发布的数据集可作为未来临床LLM开发与安全评估的重要资源。 Abstract: Large language models (LLMs) are increasingly evaluated in clinical settings using multi-dimensional rubrics which quantify reasoning quality, safety, and patient-centeredness. Yet, replicating specific mistakes in other LLM models is not straightforward and often requires manual effort. We introduce MedMistake, an automatic pipeline that extracts mistakes LLMs make in patient-doctor conversations and converts them into a benchmark of single-shot QA pairs. Our pipeline (1) creates complex, conversational data between an LLM patient and LLM doctor, (2) runs an evaluation with a committee of 2 LLM judges across a variety of dimensions and (3) creates simplified single-shot QA scenarios from those mistakes. We release MedMistake-All, a dataset of 3,390 single-shot QA pairs where GPT-5 and Gemini 2.5 Pro are currently failing to answer correctly, as judged by two LLM judges. We used medical experts to validate a subset of 211/3390 questions (MedMistake-Bench), which we used to run a final evaluation of 12 frontier LLMs: Claude Opus 4.5, Claude Sonnet 4.5, DeepSeek-Chat, Gemini 2.5 Pro, Gemini 3 Pro, GPT-4o, GPT-5, GPT-5.1, GPT-5.2, Grok 4, Grok 4.1, Mistral Large. We found that GPT models, Claude and Grok obtained the best performance on MedMistake-Bench. We release both the doctor-validated benchmark (MedMistake-Bench), as well as the full dataset (MedMistake-All) at https://huggingface.co/datasets/TheLumos/MedicalMistakeBenchmark.[21] Distilling the Essence: Efficient Reasoning Distillation via Sequence Truncation
Wei-Rui Chen,Vignesh Kothapalli,Ata Fatahibaarzi,Hejian Sang,Shao Tang,Qingquan Song,Zhipeng Wang,Muhammad Abdul-Mageed
Main category: cs.CL
TL;DR: 本文研究了在从大语言模型向小模型进行推理能力蒸馏时,如何通过选择性监督和序列截断来减少计算成本,同时保持较高的性能。
Details
Motivation: 传统的知识蒸馏方法在长序列(包含提示、思维链和答案)上训练小模型,计算开销大。本文旨在探索如何更高效地分配监督信号以降低计算负担。 Method: 提出仅对思维链(CoT)部分进行监督,并设计了一种截断协议,评估不同序列长度下的计算与性能权衡。实验分析了仅使用前50%的token对模型性能的影响。 Result: 在数学基准上,仅使用完整序列前50%的token训练,平均可保留约94%的性能,同时将训练时间、内存和FLOPs减少约50%。 Conclusion: 推理蒸馏应优先关注早期的推理token,选择性监督和序列截断为计算效率与模型性能提供了有效的权衡手段。 Abstract: Distilling the reasoning capabilities from a large language model (LLM) to a smaller student model often involves training on substantial amounts of reasoning data. However, distillation over lengthy sequences with prompt (P), chain-of-thought (CoT), and answer (A) segments makes the process computationally expensive. In this work, we investigate how the allocation of supervision across different segments (P, CoT, A) affects student performance. Our analysis shows that selective knowledge distillation over only the CoT tokens can be effective when the prompt and answer information is encompassed by it. Building on this insight, we establish a truncation protocol to quantify computation-quality tradeoffs as a function of sequence length. We observe that training on only the first $50\%$ of tokens of every training sequence can retain, on average, $\approx94\%$ of full-sequence performance on math benchmarks while reducing training time, memory usage, and FLOPs by about $50\%$ each. These findings suggest that reasoning distillation benefits from prioritizing early reasoning tokens and provides a simple lever for computation-quality tradeoffs. Codes are available at https://github.com/weiruichen01/distilling-the-essence.[22] Rethinking Supervised Fine-Tuning: Emphasizing Key Answer Tokens for Improved LLM Accuracy
Xiaofeng Shi,Qian Kou,Yuduo Li,Hua Zhou
Main category: cs.CL
TL;DR: 本文提出了一种名为SFTKey的两阶段微调方法,旨在解决大语言模型在传统监督微调中因思维链过长而忽视关键答案部分的问题。通过仅对答案部分进行二次优化,SFTKey显著提升了模型准确率。
Details
Motivation: 在传统监督微调中,模型容易过度关注冗长的思维链(CoT)部分,而忽视决定任务成败的关键答案(Key)部分,导致评估质量下降。 Method: 提出SFTKey,一种两阶段训练方案:第一阶段采用常规SFT确保输出格式正确;第二阶段仅对最终答案(Key)部分进行微调以提升准确性。 Result: 在多个基准和模型族上的实验表明,SFTKey相比传统SFT平均准确率提升超过5%,同时保持正确的输出格式生成能力。 Conclusion: SFTKey通过显式平衡思维链学习与答案相关标记的优化,有效提升了复杂推理任务下的模型性能,推动了大语言模型微调技术的发展。 Abstract: With the rapid advancement of Large Language Models (LLMs), the Chain-of-Thought (CoT) component has become significant for complex reasoning tasks. However, in conventional Supervised Fine-Tuning (SFT), the model could allocate disproportionately more attention to CoT sequences with excessive length. This reduces focus on the much shorter but essential Key portion-the final answer, whose correctness directly determines task success and evaluation quality. To address this limitation, we propose SFTKey, a two-stage training scheme. In the first stage, conventional SFT is applied to ensure proper output format, while in the second stage, only the Key portion is fine-tuned to improve accuracy. Extensive experiments across multiple benchmarks and model families demonstrate that SFTKey achieves an average accuracy improvement exceeding 5\% over conventional SFT, while preserving the ability to generate correct formats. Overall, this study advances LLM fine-tuning by explicitly balancing CoT learning with additional optimization on answer-relevant tokens.[23] Semantic Refinement with LLMs for Graph Representations
Safal Thapaliya,Zehong Wang,Jiazheng Li,Ziming Li,Yanfang Ye,Chuxu Zhang
Main category: cs.CL
TL;DR: 提出了一种数据自适应的语义精炼框架DAS,通过结合图神经网络和大语言模型的闭环反馈机制,实现图表示学习中节点语义的任务自适应优化。
Details
Motivation: 图结构数据在不同领域中的预测信号来源存在显著异质性,传统固定归纳偏置的模型难以在多样化图域中最优泛化。 Method: 提出DAS框架,将固定的GNN与大语言模型耦合在闭环反馈回路中:GNN提供隐式监督信号指导LLM进行语义精炼,精炼后的语义反过来更新GNN。 Result: 在结构主导和语义丰富的图上均取得良好表现,在结构主导图上性能持续提升,同时在语义丰富图上保持竞争力。 Conclusion: 从数据角度动态调整语义信息比固定模型偏置更能有效应对图数据的结构-语义异质性,验证了数据中心化语义适应的有效性。 Abstract: Graph-structured data exhibit substantial heterogeneity in where their predictive signals originate: in some domains, node-level semantics dominate, while in others, structural patterns play a central role. This structure-semantics heterogeneity implies that no graph learning model with a fixed inductive bias can generalize optimally across diverse graph domains. However, most existing methods address this challenge from the model side by incrementally injecting new inductive biases, which remains fundamentally limited given the open-ended diversity of real-world graphs. In this work, we take a data-centric perspective and treat node semantics as a task-adaptive variable. We propose a Data-Adaptive Semantic Refinement framework DAS for graph representation learning, which couples a fixed graph neural network (GNN) and a large language model (LLM) in a closed feedback loop. The GNN provides implicit supervisory signals to guide the semantic refinement of LLM, and the refined semantics are fed back to update the same graph learner. We evaluate our approach on both text-rich and text-free graphs. Results show consistent improvements on structure-dominated graphs while remaining competitive on semantics-rich graphs, demonstrating the effectiveness of data-centric semantic adaptation under structure-semantics heterogeneity.[24] Semi-Supervised Learning for Large Language Models Safety and Content Moderation
Eduard Stefan Dinuta,Iustin Sirbu,Traian Rebedea
Main category: cs.CL
TL;DR: 提出利用半监督学习技术结合有标签和无标签数据来提升大语言模型安全性的方法,强调任务特定增强策略的重要性。
Details
Motivation: 现有安全分类器依赖大量标注数据,获取困难、易出错且常包含合成数据。 Method: 采用半监督学习技术,并引入任务特定的数据增强策略,用于处理大语言模型的输入提示和输出响应。 Result: 任务特定增强显著优于通用增强方法,在安全任务上取得了更好的性能提升。 Conclusion: 半监督学习结合任务特定增强是提升大语言模型安全性的一种高效且可行的方法。 Abstract: Safety for Large Language Models (LLMs) has been an ongoing research focus since their emergence and is even more relevant nowadays with the increasing capacity of those models. Currently, there are several guardrails in place for all public LLMs and multiple proposed datasets for training safety classifiers. However, training these safety classifiers relies on large quantities of labeled data, which can be problematic to acquire, prone to labeling errors, or often include synthetic data. To address these issues, we suggest a different approach: utilizing semi-supervised learning techniques, which leverage both labeled and unlabeled data, to improve the performance on the safety task. We analyze the improvements that these techniques can offer for both prompts given to Large Language Models and the responses to those requests. Moreover, since augmentation is the central part of semi-supervised algorithms, we demonstrate the importance of using task-specific augmentations, which significantly increase the performance when compared to general-purpose augmentation techniques.[25] ClarifyMT-Bench: Benchmarking and Improving Multi-Turn Clarification for Conversational Large Language Models
Sichun Luo,Yi Huang,Mukai Li,Shichang Meng,Fengyuan Liu,Zefa Hu,Junlan Feng,Qi Liu
Main category: cs.CL
TL;DR: 本文提出了ClarifyMT-Bench,一个用于评估大语言模型在多轮对话中澄清行为的基准,揭示了现有模型普遍存在过早回答和随对话加深表现下降的问题,并提出ClarifyAgent框架以提升模型在模糊情境下的鲁棒性。
Details
Motivation: 现有澄清基准多假设单轮交互或合作型用户,难以反映真实开放域多轮对话中的复杂模糊情况,因此需要更贴近现实的评估框架。 Method: 构建了一个基于五维模糊分类法和六种模拟用户角色的多轮澄清基准ClarifyMT-Bench,通过LLM-人类混合流水线生成6,120个多轮对话,并提出ClarifyAgent代理框架,将澄清过程分解为感知、预测、跟踪和规划四个模块。 Result: 评估十种代表性大模型发现普遍存在“澄清不足”偏差:模型倾向于过早回答,且随着对话轮次增加性能下降;ClarifyAgent显著提升了在各种模糊条件下的表现。 Conclusion: ClarifyMT-Bench为研究大模型在真实人机交互中何时应提问、何时应回答以及如何处理模糊提供了可复现的基础,ClarifyAgent展示了通过结构化代理设计改善澄清行为的潜力。 Abstract: Large language models (LLMs) are increasingly deployed as conversational assistants in open-domain, multi-turn settings, where users often provide incomplete or ambiguous information. However, existing LLM-focused clarification benchmarks primarily assume single-turn interactions or cooperative users, limiting their ability to evaluate clarification behavior in realistic settings. We introduce \textbf{ClarifyMT-Bench}, a benchmark for multi-turn clarification grounded in a five-dimensional ambiguity taxonomy and a set of six behaviorally diverse simulated user personas. Through a hybrid LLM-human pipeline, we construct 6,120 multi-turn dialogues capturing diverse ambiguity sources and interaction patterns. Evaluating ten representative LLMs uncovers a consistent under-clarification bias: LLMs tend to answer prematurely, and performance degrades as dialogue depth increases. To mitigate this, we propose \textbf{ClarifyAgent}, an agentic approach that decomposes clarification into perception, forecasting, tracking, and planning, substantially improving robustness across ambiguity conditions. ClarifyMT-Bench establishes a reproducible foundation for studying when LLMs should ask, when they should answer, and how to navigate ambiguity in real-world human-LLM interactions.[26] SpidR-Adapt: A Universal Speech Representation Model for Few-Shot Adaptation
Mahi Luthra,Jiayi Shen,Maxime Poli,Angelo Ortiz,Yosuke Higuchi,Youssef Benchekroun,Martin Gleize,Charles-Eric Saint-James,Dongyan Lin,Phillip Rust,Angel Villar,Surya Parimi,Vanessa Stark,Rashel Moritz,Juan Pino,Yann LeCun,Emmanuel Dupoux
Main category: cs.CL
TL;DR: 本文提出SpidR-Adapt,通过元学习框架实现仅用不到1小时无标签语音数据快速适应新语言,在语音表征学习中显著提升数据效率(超过100倍),并开源代码与模型。
Details
Motivation: 人类婴儿在极少量语音暴露下即可习得语言基本单元,而当前自监督语音模型需要大量数据,存在显著效率差距。本文旨在缩小这一差距,实现高效低资源语音表示学习。 Method: 将低资源语音表示学习建模为元学习问题,提出多任务自适应预训练(MAdaPT)协议,并采用双层优化框架;设计一阶双层优化(FOBLO)方法降低计算开销;通过交错监督(interleaved supervision)进行鲁棒初始化以稳定训练过程。 Result: SpidR-Adapt在少于1小时目标语言音频上训练后,在音素可区分性(ABX)和口语语言建模(sWUGGY, sBLIMP, tSC)任务上超越同领域模型,数据效率超过传统训练方法100倍以上。 Conclusion: SpidR-Adapt提供了一条实用、架构无关的路径,向生物启发式、高数据效率的语音表示学习迈进。 Abstract: Human infants, with only a few hundred hours of speech exposure, acquire basic units of new languages, highlighting a striking efficiency gap compared to the data-hungry self-supervised speech models. To address this gap, this paper introduces SpidR-Adapt for rapid adaptation to new languages using minimal unlabeled data. We cast such low-resource speech representation learning as a meta-learning problem and construct a multi-task adaptive pre-training (MAdaPT) protocol which formulates the adaptation process as a bi-level optimization framework. To enable scalable meta-training under this framework, we propose a novel heuristic solution, first-order bi-level optimization (FOBLO), avoiding heavy computation costs. Finally, we stabilize meta-training by using a robust initialization through interleaved supervision which alternates self-supervised and supervised objectives. Empirically, SpidR-Adapt achieves rapid gains in phonemic discriminability (ABX) and spoken language modeling (sWUGGY, sBLIMP, tSC), improving over in-domain language models after training on less than 1h of target-language audio, over $100\times$ more data-efficient than standard training. These findings highlight a practical, architecture-agnostic path toward biologically inspired, data-efficient representations. We open-source the training code and model checkpoints at https://github.com/facebookresearch/spidr-adapt.[27] SMART SLM: Structured Memory and Reasoning Transformer, A Small Language Model for Accurate Document Assistance
Divij Dudeja,Mayukha Pal
Main category: cs.CL
TL;DR: SMART是一种针对工程手册(EM)复杂格式设计的高效模型,通过分层处理和结构化记忆显著提升准确性和推理效率。
Details
Motivation: 现有的通用模型在处理工程手册时因信息密度高、格式复杂而表现不佳,容易产生错误的数值答案且记忆效率低下。 Method: SMART采用三阶段方法:1)基于Tree LSTM的语法感知事实提取器从句子中抽取主谓宾关系;2)使用MANN将这些关系编码为384维向量并索引存储;3)通过6层Transformer融合检索到的事实生成回答,并支持两种推理模式——已知文档的快速路径和新文档的动态路径。 Result: SMART参数量仅为45.51M,比GPT-2和BERT少64%-69%,准确率高出GPT-2达21.3%,在真实部署中实现亚秒级响应,减少幻觉现象。 Conclusion: SMART通过结构化记忆与分层推理,在降低计算资源消耗的同时大幅提升工程手册问答性能,是适用于专业领域文档处理的有效框架。 Abstract: The user of Engineering Manuals (EM) finds it difficult to read EM s because they are long, have a dense format which includes written documents, step by step procedures, and standard parameter lists for engineering equipment. Off the shelf transformers, especially compact ones, treat this material as a flat stream of tokens. This approach leads to confident but incorrect numeric answers and forces the models to memorize separate facts inefficiently. SMART (Structured Memory and Reasoning Transformer) offers a different and practical solution to the above problem. SMART structures its processing by using a hierarchical approach, and is based upon three main job categories (1) A syntax-aware Fact Extractor (Grammarian) Tree LSTM which extracts facts as subject relation object relations from EM sentences (2) A compact indexed memory MANN (Memory Augmented Neural Network) that indexes these Rational Subject Relation Objects as 384 dimensional vectors that are associated with the source of the information, and (3) A 6 layer Transformer that learns to fuse the previously retrieved facts into its generated response. The entire SMART model utilizes 45.51M parameters, which is 64% less than GPT-2 (124M) and 69% less than BERT (133M), and it achieves a 21.3% higher accuracy than GPT-2, indicating that SMART fits the data better with the least amount of processing requirements. SMART employs dual modes of inference an indexed fast path for known documents (sub-second answer times) and an indexed dynamic path assisted by RAGs for new uploads (FAISS Top 20 results with memory severed at 64 slots). In real world deployment, this framework leads to more well supported results with reduced hallucinations than comparable small transformer models.[28] Parallel Token Prediction for Language Models
Felix Draxler,Justus Will,Farrin Marouf Sofian,Theofanis Karaletsos,Sameer Singh,Stephan Mandt
Main category: cs.CL
TL;DR: 提出了一种名为并行令牌预测(PTP)的通用框架,用于语言模型中的并行序列生成,能够在单次Transformer调用中联合预测多个相关令牌,显著减少自回归解码的延迟瓶颈。
Details
Motivation: 为了解决自回归解码在生成序列时存在的高延迟问题,并克服现有多个令牌预测方法中常见的独立性假设限制。 Method: 通过将采样过程融入模型,PTP实现了在单个Transformer调用中联合预测多个依赖令牌;该框架可通过蒸馏现有模型或无需教师模型的逆向自回归训练进行训练。 Result: 在Vicuna-7B上实现了最先进的推测解码性能,在Spec-Bench上每步可接受超过四个令牌,验证了框架的有效性和通用性。 Conclusion: PTP框架证明了在不损失建模能力的前提下,实现长序列的并行生成是可行的,具有广泛的应用前景。 Abstract: We propose Parallel Token Prediction (PTP), a universal framework for parallel sequence generation in language models. PTP jointly predicts multiple dependent tokens in a single transformer call by incorporating the sampling procedure into the model. This reduces the latency bottleneck of autoregressive decoding, and avoids the restrictive independence assumptions common in existing multi-token prediction methods. We prove that PTP can represent arbitrary autoregressive sequence distributions. PTP is trained either by distilling an existing model or through inverse autoregressive training without a teacher. Experimentally, we achieve state-of-the-art speculative decoding performance on Vicuna-7B by accepting over four tokens per step on Spec-Bench. The universality of our framework indicates that parallel generation of long sequences is feasible without loss of modeling power.[29] Your Reasoning Benchmark May Not Test Reasoning: Revealing Perception Bottleneck in Abstract Reasoning Benchmarks
Xinhe Wang,Jin Huang,Xingjian Zhang,Tianhao Wang,Jiaqi W. Ma
Main category: cs.CL
TL;DR: 本文挑战了现有观点,认为ARC类推理基准中的性能差距主要源于视觉感知的局限性,而非机器推理能力的不足。通过将感知与推理分离的两阶段实验,作者证明感知能力是影响模型表现的主要因素,并指出当前基准测试混淆了感知与推理挑战,呼吁更清晰的评估方法。
Details
Motivation: 现有研究普遍认为AI在ARC等推理任务上的表现不佳反映了其推理能力的缺陷。然而,作者质疑这一解释,提出可能是视觉感知问题导致了这种性能差距,因此需要重新审视这些基准测试的实际测量目标。 Method: 设计了一个两阶段的实验流程:第一阶段将图像独立转换为自然语言描述(感知),第二阶段使用这些描述进行规则归纳和应用(推理)。该方法隔离了感知与推理过程,避免跨图像信息泄露,并在Mini-ARC、ACRE和Bongard-LOGO三个数据集上比较了此方法与传统端到端方法的表现差异。同时对VLM输出的推理轨迹进行了人工分析。 Result: 实验表明,采用两阶段流程后模型表现显著提升;人工分析发现约80%的失败源于感知错误而非推理错误。这说明感知能力是当前性能瓶颈的主因。 Conclusion: ARC风格的基准测试混淆了感知与推理挑战,其表现出的机器推理缺陷可能被夸大。未来应采用能解耦感知与推理的评估协议,以更准确地衡量机器智能的发展。 Abstract: Reasoning benchmarks such as the Abstraction and Reasoning Corpus (ARC) and ARC-AGI are widely used to assess progress in artificial intelligence and are often interpreted as probes of core, so-called ``fluid'' reasoning abilities. Despite their apparent simplicity for humans, these tasks remain challenging for frontier vision-language models (VLMs), a gap commonly attributed to deficiencies in machine reasoning. We challenge this interpretation and hypothesize that the gap arises primarily from limitations in visual perception rather than from shortcomings in inductive reasoning. To verify this hypothesis, we introduce a two-stage experimental pipeline that explicitly separates perception and reasoning. In the perception stage, each image is independently converted into a natural-language description, while in the reasoning stage a model induces and applies rules using these descriptions. This design prevents leakage of cross-image inductive signals and isolates reasoning from perception bottlenecks. Across three ARC-style datasets, Mini-ARC, ACRE, and Bongard-LOGO, we show that the perception capability is the dominant factor underlying the observed performance gap by comparing the two-stage pipeline with against standard end-to-end one-stage evaluation. Manual inspection of reasoning traces in the VLM outputs further reveals that approximately 80 percent of model failures stem from perception errors. Together, these results demonstrate that ARC-style benchmarks conflate perceptual and reasoning challenges and that observed performance gaps may overstate deficiencies in machine reasoning. Our findings underscore the need for evaluation protocols that disentangle perception from reasoning when assessing progress in machine intelligence.[30] C2LLM Technical Report: A New Frontier in Code Retrieval via Adaptive Cross-Attention Pooling
Jin Qin,Zihan Liao,Ziyin Zhang,Hang Yu,Peng Di,Rui Wang
Main category: cs.CL
TL;DR: C2LLM是一种基于Qwen-2.5-Coder的代码嵌入模型,采用多头注意力池化(PMA)模块生成序列嵌入,在MTEB-Code基准上取得同规模模型中的领先表现。
Details
Motivation: 传统的EOS-based序列嵌入方法存在信息瓶颈,且难以灵活调整嵌入维度,限制了代码嵌入模型的性能和适应性。 Method: 基于Qwen-2.5-Coder构建0.5B和7B两种规模的C2LLM模型,引入Pooling by Multihead Attention(PMA)模块从token嵌入生成序列嵌入,充分利用预训练中获得的因果表示,并聚合整个序列的信息,同时支持灵活的嵌入维度适配。 Result: 在三百万条公开数据上训练后,C2LLM在MTEB-Code基准上刷新了同类规模模型的记录,其中C2LLM-7B在整体排行榜上排名第一。 Conclusion: PMA模块有效缓解了传统方法的信息瓶颈问题,提升了代码嵌入质量,C2LLM在多个评测任务中表现出色,验证了其作为高效代码嵌入模型的潜力。 Abstract: We present C2LLM - Contrastive Code Large Language Models, a family of code embedding models in both 0.5B and 7B sizes. Building upon Qwen-2.5-Coder backbones, C2LLM adopts a Pooling by Multihead Attention (PMA) module for generating sequence embedding from token embeddings, effectively 1) utilizing the LLM's causal representations acquired during pretraining, while also 2) being able to aggregate information from all tokens in the sequence, breaking the information bottleneck in EOS-based sequence embeddings, and 3) supporting flexible adaptation of embedding dimension, serving as an alternative to MRL. Trained on three million publicly available data, C2LLM models set new records on MTEB-Code among models of similar sizes, with C2LLM-7B ranking 1st on the overall leaderboard.[31] Optimizing Decoding Paths in Masked Diffusion Models by Quantifying Uncertainty
Ziyu Chen,Xinbei Jiang,Peng Sun,Tao Lin
Main category: cs.CL
TL;DR: 本文首次提出去噪熵(Denoising Entropy)作为衡量掩码扩散模型生成路径中累积预测不确定性的可计算指标,并基于此提出两种优化解码路径的算法,显著提升了生成质量。
Details
Motivation: 掩码扩散模型虽然具有灵活的非自回归生成能力,但输出质量对解码顺序高度敏感,缺乏对生成路径中不确定性变化的建模与控制机制。 Method: 引入去噪熵来量化生成过程中的累积预测不确定性,并据此设计了后处理选择方法和实时引导策略两种优化解码路径的算法。 Result: 在多个推理、规划和代码生成基准上,所提出的熵引导方法显著提高了生成准确性与整体质量。 Conclusion: 去噪熵为理解与调控掩码扩散模型的生成过程提供了原则性工具,将模型的不确定性从缺陷转化为发现高质量解的优势。 Abstract: Masked Diffusion Models (MDMs) offer flexible, non-autoregressive generation, but this freedom introduces a challenge: final output quality is highly sensitive to the decoding order. We are the first to formalize this issue, attributing the variability in output quality to the cumulative predictive uncertainty along a generative path. To quantify this uncertainty, we introduce Denoising Entropy, a computable metric that serves as an internal signal for evaluating generative process. Leveraging this metric, we propose two algorithms designed to optimize the decoding path: a post-hoc selection method and a real-time guidance strategy. Experiments demonstrate that our entropy-guided methods significantly improve generation quality, consistently boosting accuracy on challenging reasoning, planning, and code benchmarks. Our work establishes Denoising Entropy as a principled tool for understanding and controlling generation, effectively turning the uncertainty in MDMs from a liability into a key advantage for discovering high-quality solutions.cs.CV [Back]
[32] VL4Gaze: Unleashing Vision-Language Models for Gaze Following
Shijing Wang,Chaoqun Cui,Yaping Huang,Hyung Jin Chang,Yihua Cheng
Main category: cs.CV
TL;DR: 本文提出了VL4Gaze,首个大规模基准,用于评估和提升视觉-语言模型(VLMs)在注视理解方面的能力,揭示了当前VLMs在无特定监督情况下难以可靠推断注视语义与空间定位。
Details
Motivation: 当前视觉-语言模型缺乏系统性评估和训练以实现对人类注视的理解,尽管注视是理解注意力、意图和社会互动的关键线索。 Method: 构建了一个包含48.9万问答对、12.4万张图像的大规模数据集VL4Gaze,将注视理解统一为视觉问答(VQA)任务,涵盖四个子任务:注视对象描述、注视方向描述、注视点定位和模糊问题识别,并在上下文学习和微调设置下对多种VLM进行评估。 Result: 实验表明,即使大规模VLM在无任务特定监督时仍难以可靠推断注视语义和空间位置;而在VL4Gaze上训练后,所有任务性能均显著提升。 Conclusion: 发展VLM的注视理解能力需要针对性的多任务监督,VL4Gaze为推动该领域研究提供了重要资源。 Abstract: Human gaze provides essential cues for interpreting attention, intention, and social interaction in visual scenes, yet gaze understanding remains largely unexplored in current vision-language models (VLMs). While recent VLMs achieve strong scene-level reasoning across a range of visual tasks, there exists no benchmark that systematically evaluates or trains them for gaze interpretation, leaving open the question of whether gaze understanding can emerge from general-purpose vision-language pre-training. To address this gap, we introduce VL4Gaze, the first large-scale benchmark designed to investigate, evaluate, and unlock the potential of VLMs for gaze understanding. VL4Gaze contains 489K automatically generated question-answer pairs across 124K images and formulates gaze understanding as a unified VQA problem through four complementary tasks: (1) gaze object description, (2) gaze direction description, (3) gaze point location, and (4) ambiguous question recognition. We comprehensively evaluate both commercial and open-source VLMs under in-context learning and fine-tuning settings. The results show that even large-scale VLMs struggle to reliably infer gaze semantics and spatial localization without task-specific supervision. In contrast, training on VL4Gaze brings substantial and consistent improvements across all tasks, highlighting the importance of targeted multi-task supervision for developing gaze understanding capabilities in VLMs. We will release the dataset and code to support further research and development in this direction.[33] TrashDet: Iterative Neural Architecture Search for Efficient Waste Detection
Tony Tran,Bin Hu
Main category: cs.CV
TL;DR: 本文提出了一种面向TinyML设备的垃圾检测方法TrashDets,基于TACO数据集和硬件感知神经架构搜索框架,通过迭代进化搜索优化检测模型,在精度、能耗和延迟方面显著优于现有方法。
Details
Motivation: 在资源受限的边缘和IoT设备上实现高效垃圾检测面临模型大小、功耗和精度之间的权衡,传统方法难以满足TinyML的严格约束,因此需要一种硬件感知且可扩展的自动化模型设计方法。 Method: 采用Once-for-All风格的ResDets超网,结合迭代进化搜索策略,交替优化主干网络与颈部/头部结构,并引入种群传递机制和精度预测器以降低搜索成本并提升稳定性,最终生成适用于不同硬件预算的TrashDet模型族。 Result: 在五类TACO子集上,TrashDet-l达到19.5 mAP50,参数量仅30.5M,较先前方法提升3.6 mAP50;模型族覆盖1.2M至30.5M参数,mAP50介于11.4–19.5;在MAX78002微控制器上,TrashDet-ResNet实现7525 μJ/次推理能耗、26.7ms延迟和37.45 FPS,TrashDet-MBNet将mAP50提升10.2%,整体相较基线最多降低88%能耗、78%延迟和53%平均功耗。 Conclusion: TrashDets框架能有效生成面向TinyML的高效垃圾检测模型,在多种硬件预算下实现精度与效率的优越平衡,显著优于现有方法,具备在边缘设备上大规模部署的潜力。 Abstract: This paper addresses trash detection on the TACO dataset under strict TinyML constraints using an iterative hardware-aware neural architecture search framework targeting edge and IoT devices. The proposed method constructs a Once-for-All-style ResDets supernet and performs iterative evolutionary search that alternates between backbone and neck/head optimization, supported by a population passthrough mechanism and an accuracy predictor to reduce search cost and improve stability. This framework yields a family of deployment-ready detectors, termed TrashDets. On a five-class TACO subset (paper, plastic, bottle, can, cigarette), the strongest variant, TrashDet-l, achieves 19.5 mAP50 with 30.5M parameters, improving accuracy by up to 3.6 mAP50 over prior detectors while using substantially fewer parameters. The TrashDet family spans 1.2M to 30.5M parameters with mAP50 values between 11.4 and 19.5, providing scalable detector options for diverse TinyML deployment budgets on resource-constrained hardware. On the MAX78002 microcontroller with the TrashNet dataset, two specialized variants, TrashDet-ResNet and TrashDet-MBNet, jointly dominate the ai87-fpndetector baseline, with TrashDet-ResNet achieving 7525~$μ$J energy per inference at 26.7 ms latency and 37.45 FPS, and TrashDet-MBNet improving mAP50 by 10.2%; together they reduce energy consumption by up to 88%, latency by up to 78%, and average power by up to 53% compared to existing TinyML detectors.[34] OccuFly: A 3D Vision Benchmark for Semantic Scene Completion from the Aerial Perspective
Markus Gross,Sai B. Matha,Aya Fahmy,Rui Song,Daniel Cremers,Henri Meess
Main category: cs.CV
TL;DR: 本文提出了OccuFly,首个基于相机的真实世界空中语义场景补全(SSC)基准,旨在解决无人机在高海拔视角下进行3D场景理解的挑战。
Details
Motivation: 现有的SSC研究主要集中于地面场景(如自动驾驶),而空中场景(如无人机飞行)的研究较少,且依赖LiDAR传感器,这在无人机应用中受限于法规、重量、能耗及点云稀疏性问题。因此,亟需一种适用于无人机平台的、不依赖LiDAR的SSC数据集与方法。 Method: 提出OccuFly基准,基于多季节、多高度(30-50米)的无人机相机图像构建;利用传统3D重建技术将部分标注的2D掩码提升至点云中,实现自动标签迁移,减少人工3D标注成本;采用纯相机模态,适配主流无人机配置。 Result: 发布了包含城市、工业和乡村场景的数据集,涵盖22个语义类,遵循通用数据格式;对现有最先进方法进行了基准测试,并揭示了高空视角下的特有挑战,如远处物体小、遮挡严重等。 Conclusion: OccuFly为无人机平台提供了首个无需LiDAR的空中SSC基准,推动了基于视觉的三维场景理解在空中场景中的发展,并为未来自主飞行系统提供支持。 Abstract: Semantic Scene Completion (SSC) is crucial for 3D perception in mobile robotics, as it enables holistic scene understanding by jointly estimating dense volumetric occupancy and per-voxel semantics. Although SSC has been widely studied in terrestrial domains such as autonomous driving, aerial scenarios like autonomous flying remain largely unexplored, thereby limiting progress on downstream applications. Furthermore, LiDAR sensors represent the primary modality for SSC data generation, which poses challenges for most uncrewed aerial vehicles (UAVs) due to flight regulations, mass and energy constraints, and the sparsity of LiDAR-based point clouds from elevated viewpoints. To address these limitations, we introduce OccuFly, the first real-world, camera-based aerial SSC benchmark, captured at altitudes of 50m, 40m, and 30m during spring, summer, fall, and winter. OccuFly covers urban, industrial, and rural scenarios, provides 22 semantic classes, and the data format adheres to established conventions to facilitate seamless integration with existing research. Crucially, we propose a LiDAR-free data generation framework based on camera modality, which is ubiquitous on modern UAVs. By utilizing traditional 3D reconstruction, our framework automates label transfer by lifting a subset of annotated 2D masks into the reconstructed point cloud, thereby substantially minimizing manual 3D annotation effort. Finally, we benchmark the state-of-the-art on OccuFly and highlight challenges specific to elevated viewpoints, yielding a comprehensive vision benchmark for holistic aerial 3D scene understanding.[35] NULLBUS: Multimodal Mixed-Supervision for Breast Ultrasound Segmentation via Nullable Global-Local Prompts
Raja Mallina,Bryar Shareef
Main category: cs.CV
TL;DR: 提出NullBUS框架,通过可空提示(nullable prompts)实现乳腺超声图像在有无文本提示情况下的混合监督分割,显著提升多模态数据利用与分割性能。
Details
Motivation: 现有乳腺超声数据集常缺乏可靠文本元数据,限制了基于提示的分割方法的训练与鲁棒性,需有效利用含或不含提示的异构数据。 Method: 设计NullBUS框架,引入可学习的空嵌入(null embeddings)与存在掩码,实现对缺失文本提示的建模,并在统一模型中融合图像与文本线索进行混合监督训练。 Result: 在三个公开BUS数据集的混合评估中,NullBUS达到平均IoU 0.8568和Dice 0.9103,性能优于现有方法。 Conclusion: NullBUS通过可空提示机制有效支持多模态混合监督,在提示信息不完整的情况下仍能保持高性能,增强了临床实用性与模型泛化能力。 Abstract: Breast ultrasound (BUS) segmentation provides lesion boundaries essential for computer-aided diagnosis and treatment planning. While promptable methods can improve segmentation performance and tumor delineation when text or spatial prompts are available, many public BUS datasets lack reliable metadata or reports, constraining training to small multimodal subsets and reducing robustness. We propose NullBUS, a multimodal mixed-supervision framework that learns from images with and without prompts in a single model. To handle missing text, we introduce nullable prompts, implemented as learnable null embeddings with presence masks, enabling fallback to image-only evidence when metadata are absent and the use of text when present. Evaluated on a unified pool of three public BUS datasets, NullBUS achieves a mean IoU of 0.8568 and a mean Dice of 0.9103, demonstrating state-of-the-art performance under mixed prompt availability.[36] Learning to Sense for Driving: Joint Optics-Sensor-Model Co-Design for Semantic Segmentation
Reeshad Khan amd John Gauch
Main category: cs.CV
TL;DR: 提出了一种任务驱动的端到端RAW到语义分割的联合设计框架,统一了光学、传感器建模与轻量级网络,显著提升自动驾驶感知性能。
Details
Motivation: 传统自动驾驶系统中相机设计与感知任务脱节,固定光学和图像信号处理(ISP)优先考虑人类视觉而非机器语义理解,导致信息丢失和模型对传感器伪影的适应困难。 Method: 构建了一个端到端的RAW-to-task联合优化框架,整合了真实手机级镜头模型(基于DeepLens)、可学习的彩色滤光阵列(CFA)、泊松-高斯噪声模型、量化过程,并与轻量级语义分割网络共同优化,直接以分割目标为导向进行训练。 Result: 在KITTI-360数据集上验证,相比固定流水线显著提升了mIoU指标,尤其在细小物体和低光照敏感类别上表现更优;光学建模和CFA学习带来最大增益;模型仅约100万参数,运行速度达28 FPS,具备边缘部署能力;可视化与定量分析显示联合设计能根据语义结构自适应调整成像,增强边界清晰度并在模糊、噪声和低位深下保持精度。 Conclusion: 全栈协同优化光学、传感器与网络是实现高效、可靠且可部署的自动驾驶感知系统的有效途径。 Abstract: Traditional autonomous driving pipelines decouple camera design from downstream perception, relying on fixed optics and handcrafted ISPs that prioritize human viewable imagery rather than machine semantics. This separation discards information during demosaicing, denoising, or quantization, while forcing models to adapt to sensor artifacts. We present a task-driven co-design framework that unifies optics, sensor modeling, and lightweight semantic segmentation networks into a single end-to-end RAW-to-task pipeline. Building on DeepLens[19], our system integrates realistic cellphone-scale lens models, learnable color filter arrays, Poisson-Gaussian noise processes, and quantization, all optimized directly for segmentation objectives. Evaluations on KITTI-360 show consistent mIoU improvements over fixed pipelines, with optics modeling and CFA learning providing the largest gains, especially for thin or low-light-sensitive classes. Importantly, these robustness gains are achieved with a compact ~1M-parameter model running at ~28 FPS, demonstrating edge deployability. Visual and quantitative analyses further highlight how co-designed sensors adapt acquisition to semantic structure, sharpening boundaries and maintaining accuracy under blur, noise, and low bit-depth. Together, these findings establish full-stack co-optimization of optics, sensors, and networks as a principled path toward efficient, reliable, and deployable perception in autonomous systems.[37] CHAMMI-75: pre-training multi-channel models with heterogeneous microscopy images
Vidit Agrawal,John Peters,Tyler N. Thompson,Mohammad Vali Sanian,Chau Pham,Nikita Moshkov,Arshad Kazi,Aditya Pillai,Jack Freeman,Byunguk Kang,Samouil L. Farhi,Ernest Fraenkel,Ron Stewart,Lassi Paavolainen,Bryan A. Plummer,Juan C. Caicedo
Main category: cs.CV
TL;DR: CHAMMI-75是一个来自75个不同生物学研究的异构多通道显微图像的开放数据集,旨在开发可跨研究复用的通道自适应细胞形态学模型。
Details
Motivation: 现有细胞形态学模型通常依赖单一成像类型,导致在不同技术参数或实验条件下泛化能力差,难以跨研究复用。 Method: 整合来自公开资源的75个多样化生物研究的多通道显微图像,构建CHAMMI-75数据集,并用于训练和评估具有通道自适应能力的细胞形态学模型。 Result: 实验表明,使用CHAMMI-75训练的模型在多通道生物成像任务中性能更优,主要归因于数据集在显微模态上的高度多样性。 Conclusion: CHAMMI-75为开发适用于多种生物研究的下一代通用细胞形态学模型提供了基础。 Abstract: Quantifying cell morphology using images and machine learning has proven to be a powerful tool to study the response of cells to treatments. However, models used to quantify cellular morphology are typically trained with a single microscopy imaging type. This results in specialized models that cannot be reused across biological studies because the technical specifications do not match (e.g., different number of channels), or because the target experimental conditions are out of distribution. Here, we present CHAMMI-75, an open access dataset of heterogeneous, multi-channel microscopy images from 75 diverse biological studies. We curated this resource from publicly available sources to investigate cellular morphology models that are channel-adaptive and can process any microscopy image type. Our experiments show that training with CHAMMI-75 can improve performance in multi-channel bioimaging tasks primarily because of its high diversity in microscopy modalities. This work paves the way to create the next generation of cellular morphology models for biological studies.[38] Input-Adaptive Visual Preprocessing for Efficient Fast Vision-Language Model Inference
Putu Indah Githa Cahyani,Komang David Dananjaya Suartana,Novanto Yudistira
Main category: cs.CV
TL;DR: 提出了一种基于内容感知的自适应视觉预处理方法,动态调整输入分辨率和裁剪区域,显著降低视觉冗余,在不修改FastVLM架构的前提下实现超过50%的推理时间减少和55%以上的视觉token减少。
Details
Motivation: 现有VLM推理管道依赖静态视觉预处理,导致对简单图像也进行冗余计算,难以满足高效部署需求。 Method: 结合内容感知分析、自适应分辨率选择和内容感知裁剪,动态调整输入图像的分辨率和空间覆盖范围,并集成到FastVLM中而不改变其结构或重新训练。 Result: 在DocVQA子集上实验显示,单图推理时间减少超50%,整体生成时间下降,视觉token数持续降低超55%。 Conclusion: 输入感知的自适应预处理是一种有效且轻量的方法,可显著提升VLM在部署场景下的推理效率。 Abstract: Vision-Language Models (VLMs) have demonstrated strong performance on multimodal reasoning tasks, but their deployment remains challenging due to high inference latency and computational cost, particularly when processing high-resolution visual inputs. While recent architectures such as FastVLM improve efficiency through optimized vision encoders, existing pipelines still rely on static visual preprocessing, leading to redundant computation for visually simple inputs. In this work, we propose an adaptive visual preprocessing method that dynamically adjusts input resolution and spatial coverage based on image content characteristics. The proposed approach combines content-aware image analysis, adaptive resolution selection, and content-aware cropping to reduce visual redundancy prior to vision encoding. Importantly, the method is integrated with FastVLM without modifying its architecture or requiring retraining. We evaluate the proposed method on a subset of the DocVQA dataset in an inference-only setting, focusing on efficiency-oriented metrics. Experimental results show that adaptive preprocessing reduces per-image inference time by over 50\%, lowers mean full generation time, and achieves a consistent reduction of more than 55\% in visual token count compared to the baseline pipeline. These findings demonstrate that input-aware preprocessing is an effective and lightweight strategy for improving deployment-oriented efficiency of vision-language models. To facilitate reproducibility, our implementation is provided as a fork of the FastVLM repository, incorporating the files for the proposed method, and is available at https://github.com/kmdavidds/mlfastlm.[39] ALIVE: An Avatar-Lecture Interactive Video Engine with Content-Aware Retrieval for Real-Time Interaction
Md Zabirul Islam,Md Motaleb Hossen Manik,Ge Wang
Main category: cs.CV
TL;DR: ALIVE是一个完全在本地运行的交互式视频学习引擎,通过神经虚拟形象、内容感知检索和实时多模态交互,将传统录播课程转化为动态学习体验。
Details
Motivation: 传统录播课程缺乏实时答疑机制,学生遇到困惑时需外部搜索;现有交互系统常缺乏对课程内容的理解、依赖云端处理或未能整合检索与虚拟形象讲解。 Method: 提出ALIVE系统,结合ASR转录、LLM优化、神经虚拟形象生成、基于语义与时间戳对齐的内容感知检索,以及文本/语音提问的实时响应,全部在本地硬件运行,并采用轻量级嵌入模型、FAISS检索和分段预加载技术保证响应速度。 Result: 在完整的医学影像课程上验证了系统,评估显示其具备高检索准确率、低延迟和良好用户体验,能提供精准且情境相关的实时解释。 Conclusion: ALIVE展示了通过本地化部署、内容感知检索与多模态AI融合,可显著提升录播课程的教学价值,为下一代互动学习环境提供了可扩展路径。 Abstract: Traditional lecture videos offer flexibility but lack mechanisms for real-time clarification, forcing learners to search externally when confusion arises. Recent advances in large language models and neural avatars provide new opportunities for interactive learning, yet existing systems typically lack lecture awareness, rely on cloud-based services, or fail to integrate retrieval and avatar-delivered explanations in a unified, privacy-preserving pipeline. We present ALIVE, an Avatar-Lecture Interactive Video Engine that transforms passive lecture viewing into a dynamic, real-time learning experience. ALIVE operates fully on local hardware and integrates (1) Avatar-delivered lecture generated through ASR transcription, LLM refinement, and neural talking-head synthesis; (2) A content-aware retrieval mechanism that combines semantic similarity with timestamp alignment to surface contextually relevant lecture segments; and (3) Real-time multimodal interaction, enabling students to pause the lecture, ask questions through text or voice, and receive grounded explanations either as text or as avatar-delivered responses. To maintain responsiveness, ALIVE employs lightweight embedding models, FAISS-based retrieval, and segmented avatar synthesis with progressive preloading. We demonstrate the system on a complete medical imaging course, evaluate its retrieval accuracy, latency characteristics, and user experience, and show that ALIVE provides accurate, content-aware, and engaging real-time support. ALIVE illustrates how multimodal AI-when combined with content-aware retrieval and local deployment-can significantly enhance the pedagogical value of recorded lectures, offering an extensible pathway toward next-generation interactive learning environments.[40] Lightweight framework for underground pipeline recognition and spatial localization based on multi-view 2D GPR images
Haotian Lv,Chao Li,Jiangbo Dai,Yuhui Zhang,Zepeng Fan,Yiqiu Tan,Dawei Wang,Binglei Xie
Main category: cs.CV
TL;DR: 本文提出了一种基于B/C/D-Scan三视图联合分析的3D地下管线智能检测框架,结合改进的DCO-YOLO模型与3D-DIoU匹配算法,提升了小目标检测精度和多视图特征融合能力,在真实城市数据上实现了96.7%的mAP,优于基线模型。
Details
Motivation: 针对3D地质雷达(GPR)在地下管线检测中存在多视角特征关联性弱、小目标识别精度低以及复杂场景下鲁棒性不足的问题,本文旨在提升检测的准确性与可靠性。 Method: 1) 提出B/C/D-Scan三视图联合分析策略,通过FDTD正演模拟与实测数据交叉验证构建三维特征评价方法;2) 构建DCO-YOLO框架,引入DySample、CGLU和OutlookAttention机制增强跨维度特征关联与小目标边缘特征提取;3) 设计3D-DIoU空间特征匹配算法,融合三维几何约束与中心距离惩罚项,实现多视图标注自动关联与特征融合。 Result: 在真实城市地下管线数据上的实验表明,该方法在复杂多管线场景下的准确率、召回率和mAP分别达到96.2%、93.3%和96.7%,较基线模型提升2.0%、2.1%和0.9%;消融实验验证了各模块的协同优化效果,Grad-CAM++可视化显示模型更聚焦于管线几何特征。 Conclusion: 本研究将深度学习优化策略与3D GPR物理特性相结合,有效解决了多视图特征弱相关、小目标漏检和复杂环境干扰等问题,为地下管线的智能识别与定位提供了高效可靠的新技术框架。 Abstract: To address the issues of weak correlation between multi-view features, low recognition accuracy of small-scale targets, and insufficient robustness in complex scenarios in underground pipeline detection using 3D GPR, this paper proposes a 3D pipeline intelligent detection framework. First, based on a B/C/D-Scan three-view joint analysis strategy, a three-dimensional pipeline three-view feature evaluation method is established by cross-validating forward simulation results obtained using FDTD methods with actual measurement data. Second, the DCO-YOLO framework is proposed, which integrates DySample, CGLU, and OutlookAttention cross-dimensional correlation mechanisms into the original YOLOv11 algorithm, significantly improving the small-scale pipeline edge feature extraction capability. Furthermore, a 3D-DIoU spatial feature matching algorithm is proposed, which integrates three-dimensional geometric constraints and center distance penalty terms to achieve automated association of multi-view annotations. The three-view fusion strategy resolves inherent ambiguities in single-view detection. Experiments based on real urban underground pipeline data show that the proposed method achieves accuracy, recall, and mean average precision of 96.2%, 93.3%, and 96.7%, respectively, in complex multi-pipeline scenarios, which are 2.0%, 2.1%, and 0.9% higher than the baseline model. Ablation experiments validated the synergistic optimization effect of the dynamic feature enhancement module and Grad-CAM++ heatmap visualization demonstrated that the improved model significantly enhanced its ability to focus on pipeline geometric features. This study integrates deep learning optimization strategies with the physical characteristics of 3D GPR, offering an efficient and reliable novel technical framework for the intelligent recognition and localization of underground pipelines.[41] NeRV360: Neural Representation for 360-Degree Videos with a Viewport Decoder
Daichi Arai,Kyohei Unno,Yasuko Sugito,Yuichi Kusakabe
Main category: cs.CV
TL;DR: NeRV360是一种面向高分辨率360度视频的端到端隐式神经表示框架,通过仅解码用户视口区域并引入时空仿射变换模块,显著降低内存消耗和提升解码速度,同时保持更优图像质量。
Details
Motivation: 现有的隐式神经视频表示(NeRV)在处理高分辨率360度视频时存在内存占用高和解码速度慢的问题,难以支持实时应用,因此需要一种更高效的条件解码方法。 Method: 提出NeRV360框架,将视口提取集成到解码过程中,并设计空间-时间仿射变换模块,实现基于视角和时间的条件化解码,仅重建用户当前观看的视口区域而非完整全景帧。 Result: 在6K分辨率视频上的实验表明,与代表性先前工作HNeRV相比,NeRV360内存消耗降低7倍,解码速度提升2.5倍,且在客观图像质量指标上表现更优。 Conclusion: NeRV360通过条件化视口解码有效解决了高分辨率360度视频中隐式神经表示的效率瓶颈,为实时应用提供了可行方案。 Abstract: Implicit neural representations for videos (NeRV) have shown strong potential for video compression. However, applying NeRV to high-resolution 360-degree videos causes high memory usage and slow decoding, making real-time applications impractical. We propose NeRV360, an end-to-end framework that decodes only the user-selected viewport instead of reconstructing the entire panoramic frame. Unlike conventional pipelines, NeRV360 integrates viewport extraction into decoding and introduces a spatial-temporal affine transform module for conditional decoding based on viewpoint and time. Experiments on 6K-resolution videos show that NeRV360 achieves a 7-fold reduction in memory consumption and a 2.5-fold increase in decoding speed compared to HNeRV, a representative prior work, while delivering better image quality in terms of objective metrics.[42] Beyond Weight Adaptation: Feature-Space Domain Injection for Cross-Modal Ship Re-Identification
Tingfeng Xian,Wenlve Zhou,Zhiheng Zhou,Zhelin Li
Main category: cs.CV
TL;DR: 本文提出了一种基于视觉基础模型(VFM)的跨模态船舶再识别新方法,通过在特征空间中注入领域表示(Domain Representation Injection, DRI)来克服模态差异,无需微调VFM权重,仅用少量可训练参数即实现了SOTA性能。
Details
Motivation: 现有的跨模态船舶再识别方法依赖大规模配对数据进行显式模态对齐,而此类数据难以获取;同时通用的参数高效微调(PEFT)方法在低容量模型上表现不佳,因此需要一种更高效、无需大量配对数据的新范式。 Method: 基于柏拉图表示假设,冻结视觉基础模型(VFM),设计一个轻量级的Offset Encoder提取原始输入中的模态和身份特征,并通过Modulator根据中间层上下文信息自适应地转换这些特征,最后以加性融合方式注入VFM的中间层,动态调整特征分布以适应下游任务。 Result: 在HOSS-ReID数据集上,使用仅1.54M和7.05M可训练参数分别达到了57.9%和60.5%的mAP,显著优于现有方法,实现了最先进(SOTA)性能。 Conclusion: DRI通过在特征空间进行参数高效微调,有效利用冻结的视觉基础模型实现跨模态船舶再识别,验证了特征空间优化相较于权重空间微调的优势,为资源受限场景下的跨模态学习提供了新思路。 Abstract: Cross-Modality Ship Re-Identification (CMS Re-ID) is critical for achieving all-day and all-weather maritime target tracking, yet it is fundamentally challenged by significant modality discrepancies. Mainstream solutions typically rely on explicit modality alignment strategies; however, this paradigm heavily depends on constructing large-scale paired datasets for pre-training. To address this, grounded in the Platonic Representation Hypothesis, we explore the potential of Vision Foundation Models (VFMs) in bridging modality gaps. Recognizing the suboptimal performance of existing generic Parameter-Efficient Fine-Tuning (PEFT) methods that operate within the weight space, particularly on limited-capacity models, we shift the optimization perspective to the feature space and propose a novel PEFT strategy termed Domain Representation Injection (DRI). Specifically, while keeping the VFM fully frozen to maximize the preservation of general knowledge, we design a lightweight, learnable Offset Encoder to extract domain-specific representations rich in modality and identity attributes from raw inputs. Guided by the contextual information of intermediate features at different layers, a Modulator adaptively transforms these representations. Subsequently, they are injected into the intermediate layers via additive fusion, dynamically reshaping the feature distribution to adapt to the downstream task without altering the VFM's pre-trained weights. Extensive experimental results demonstrate the superiority of our method, achieving State-of-the-Art (SOTA) performance with minimal trainable parameters. For instance, on the HOSS-ReID dataset, we attain 57.9\% and 60.5\% mAP using only 1.54M and 7.05M parameters, respectively. The code is available at https://github.com/TingfengXian/DRI.[43] DGSAN: Dual-Graph Spatiotemporal Attention Network for Pulmonary Nodule Malignancy Prediction
Xiao Yu,Zhaojie Fang,Guanyu Zhou,Yin Shen,Huoling Luo,Ye Li,Ahmed Elazab,Xiang Wan,Ruiquan Ge,Changmiao Wang
Main category: cs.CV
TL;DR: 提出了一种双图时空注意力网络(DGSAN),通过融合多模态和多时相信息来提高肺结节分类的准确性,方法在新构建的NLST-cmst数据集上显著优于现有技术。
Details
Motivation: 现有融合多模态和多时相信息的方法效率低,局限于简单的向量拼接和互注意力机制,难以充分挖掘肺结节特征,影响早期诊断精度。 Method: 设计了全局-局部特征编码器,构建双图结构(模态间与模态内图),并引入分层跨模态图融合模块,实现高效的多模态时空特征融合。 Result: 在NLST-cmst和CSTL衍生数据集上实验表明,DGSAN在肺结节分类任务中显著优于当前最先进的方法,且具备高计算效率。 Conclusion: DGSAN能更有效地融合多模态和多时相信息,提升肺结节良恶性判别的准确性和可靠性,为肺癌早期诊断提供了有力的技术支持。 Abstract: Lung cancer continues to be the leading cause of cancer-related deaths globally. Early detection and diagnosis of pulmonary nodules are essential for improving patient survival rates. Although previous research has integrated multimodal and multi-temporal information, outperforming single modality and single time point, the fusion methods are limited to inefficient vector concatenation and simple mutual attention, highlighting the need for more effective multimodal information fusion. To address these challenges, we introduce a Dual-Graph Spatiotemporal Attention Network, which leverages temporal variations and multimodal data to enhance the accuracy of predictions. Our methodology involves developing a Global-Local Feature Encoder to better capture the local, global, and fused characteristics of pulmonary nodules. Additionally, a Dual-Graph Construction method organizes multimodal features into inter-modal and intra-modal graphs. Furthermore, a Hierarchical Cross-Modal Graph Fusion Module is introduced to refine feature integration. We also compiled a novel multimodal dataset named the NLST-cmst dataset as a comprehensive source of support for related research. Our extensive experiments, conducted on both the NLST-cmst and curated CSTL-derived datasets, demonstrate that our DGSAN significantly outperforms state-of-the-art methods in classifying pulmonary nodules with exceptional computational efficiency.[44] Benchmarking and Enhancing VLM for Compressed Image Understanding
Zifu Zhang,Tongda Xu,Siqi Li,Shengxi Li,Yue Zhang,Mai Xu,Yan Wang
Main category: cs.CV
TL;DR: 本文提出了首个评估视觉-语言模型(VLM)在压缩图像上表现的综合基准,并分析了性能差距的来源,提出了一种通用的VLM适配器,可在不同编码和比特率的压缩图像上提升VLM性能10%-30%。
Details
Motivation: 随着视觉-语言模型(VLM)的发展及其应用需求的增长,高效压缩图像输入变得愈发重要,但现有VLM主要处理高比特率压缩图像,对低比特率压缩图像的理解能力尚未被充分探索。 Method: 构建了一个包含超过一百万张压缩图像的基准,涵盖多种常用图像编解码器和多样化任务;通过分类信息损失和VLM泛化失败来分析性能差距;提出一种通用的VLM适配器以提升模型在压缩图像上的表现。 Result: 发现压缩图像中的性能差距主要来自VLM的泛化失败而非信息丢失;提出的通用适配器能在不同编解码器和比特率下将VLM性能提升10%-30%。 Conclusion: 所提出的基准和增强方法为弥合VLM与压缩图像之间的鸿沟提供了有价值的见解和解决方案。 Abstract: With the rapid development of Vision-Language Models (VLMs) and the growing demand for their applications, efficient compression of the image inputs has become increasingly important. Existing VLMs predominantly digest and understand high-bitrate compressed images, while their ability to interpret low-bitrate compressed images has yet to be explored by far. In this paper, we introduce the first comprehensive benchmark to evaluate the ability of VLM against compressed images, varying existing widely used image codecs and diverse set of tasks, encompassing over one million compressed images in our benchmark. Next, we analyse the source of performance gap, by categorising the gap from a) the information loss during compression and b) generalisation failure of VLM. We visualize these gaps with concrete examples and identify that for compressed images, only the generalization gap can be mitigated. Finally, we propose a universal VLM adaptor to enhance model performance on images compressed by existing codecs. Consequently, we demonstrate that a single adaptor can improve VLM performance across images with varying codecs and bitrates by 10%-30%. We believe that our benchmark and enhancement method provide valuable insights and contribute toward bridging the gap between VLMs and compressed images.[45] PanoGrounder: Bridging 2D and 3D with Panoramic Scene Representations for VLM-based 3D Visual Grounding
Seongmin Jung,Seongho Choi,Gunwoo Jeon,Minsu Cho,Jongwoo Lim
Main category: cs.CV
TL;DR: 提出PanoGrounder,一种基于全景表示和预训练2D视觉语言模型的可泛化3D视觉定位框架,在多个数据集上达到SOTA并具有强泛化能力。
Details
Motivation: 传统3D视觉定位模型依赖显式3D几何且受限于3D视觉语言数据集稀缺和推理能力不足,难以泛化。 Method: 利用带有3D语义和几何特征增强的全景渲染作为2D与3D之间的中间表示,结合预训练2D视觉语言模型,通过三阶段流程:布置全景视点、在每帧全景图上进行文本查询定位、融合多视图预测结果提升3D定位性能。 Result: 在ScanRefer和Nr3D数据集上取得SOTA结果,并展现出对未见3D数据集和文本重述的优越泛化能力。 Conclusion: PanoGrounder通过全景中间表示有效结合2D VLM与3D场景理解,显著提升了3D视觉定位的性能和泛化能力。 Abstract: 3D Visual Grounding (3DVG) is a critical bridge from vision-language perception to robotics, requiring both language understanding and 3D scene reasoning. Traditional supervised models leverage explicit 3D geometry but exhibit limited generalization, owing to the scarcity of 3D vision-language datasets and the limited reasoning capabilities compared to modern vision-language models (VLMs). We propose PanoGrounder, a generalizable 3DVG framework that couples multi-modal panoramic representation with pretrained 2D VLMs for strong vision-language reasoning. Panoramic renderings, augmented with 3D semantic and geometric features, serve as an intermediate representation between 2D and 3D, and offer two major benefits: (i) they can be directly fed to VLMs with minimal adaptation and (ii) they retain long-range object-to-object relations thanks to their 360-degree field of view. We devise a three-stage pipeline that places a compact set of panoramic viewpoints considering the scene layout and geometry, grounds a text query on each panoramic rendering with a VLM, and fuses per-view predictions into a single 3D bounding box via lifting. Our approach achieves state-of-the-art results on ScanRefer and Nr3D, and demonstrates superior generalization to unseen 3D datasets and text rephrasings.[46] Self-supervised Multiplex Consensus Mamba for General Image Fusion
Yingying Wang,Rongjin Zhuang,Hui Zheng,Xuanhua He,Ke Cao,Xiaotong Tu,Xinghao Ding
Main category: cs.CV
TL;DR: 本文提出了一种用于通用图像融合的自监督多路共识Mamba框架SMC-Mamba,通过模态无关特征增强和多路共识跨模态Mamba模块有效整合多模态互补信息,并引入双层自监督对比学习损失,在保持高频信息的同时提升下游任务性能。实验表明该方法在多种图像融合任务中优于现有最先进方法。
Details
Motivation: 通用图像融合需在不增加复杂度的前提下,适应多种任务并提升性能,而现有方法多针对特定任务,难以兼顾广泛适用性与高效性。 Method: 提出SMC-Mamba框架,包含MAFE模块(通过自适应门控和空间-通道、频率旋转扫描增强局部细节与全局表示)和MCCM模块(通过多专家动态协作与跨模态扫描实现跨模态特征交互),并设计BSCL损失函数以自监督方式保留高频信息。 Result: 在红外-可见光、医学、多焦点和多曝光融合等任务中超越SOTA方法,同时在目标检测和语义分割等下游任务中表现出性能提升。 Conclusion: SMC-Mamba通过自监督多路共识机制实现了高效通用的图像融合,在多种融合场景和下游任务中均取得优异表现,具备较强的普适性与应用潜力。 Abstract: Image fusion integrates complementary information from different modalities to generate high-quality fused images, thereby enhancing downstream tasks such as object detection and semantic segmentation. Unlike task-specific techniques that primarily focus on consolidating inter-modal information, general image fusion needs to address a wide range of tasks while improving performance without increasing complexity. To achieve this, we propose SMC-Mamba, a Self-supervised Multiplex Consensus Mamba framework for general image fusion. Specifically, the Modality-Agnostic Feature Enhancement (MAFE) module preserves fine details through adaptive gating and enhances global representations via spatial-channel and frequency-rotational scanning. The Multiplex Consensus Cross-modal Mamba (MCCM) module enables dynamic collaboration among experts, reaching a consensus to efficiently integrate complementary information from multiple modalities. The cross-modal scanning within MCCM further strengthens feature interactions across modalities, facilitating seamless integration of critical information from both sources. Additionally, we introduce a Bi-level Self-supervised Contrastive Learning Loss (BSCL), which preserves high-frequency information without increasing computational overhead while simultaneously boosting performance in downstream tasks. Extensive experiments demonstrate that our approach outperforms state-of-the-art (SOTA) image fusion algorithms in tasks such as infrared-visible, medical, multi-focus, and multi-exposure fusion, as well as downstream visual tasks.[47] Quantile Rendering: Efficiently Embedding High-dimensional Feature on 3D Gaussian Splatting
Yoonwoo Jeong,Cheng Sun,Frank Wang,Minsu Cho,Jaesung Choe
Main category: cs.CV
TL;DR: 本文提出了一种名为Quantile Rendering (Q-Render) 的新渲染策略和高斯点阵网络(GS-Net),用于在保持高保真度的同时高效处理3D高斯的高维特征,从而解决开放词汇分割中的信息丢失问题。实验表明,该方法在ScanNet和LeRF数据集上优于现有技术,并实现约43.7倍的速度提升。
Details
Motivation: 现有的3D开放词汇分割方法在渲染高维特征时依赖码本或压缩技术,导致信息丢失并降低分割质量,因此需要一种能高效处理高维特征且不牺牲精度的新方法。 Method: 提出Quantile Rendering (Q-Render),通过稀疏采样对光线影响最大的3D高斯分布,避免传统体渲染中对所有相交高斯的密集采样;同时构建可泛化的3D神经网络GS-Net来预测高斯特征。 Result: 在ScanNet和LeRF数据集上取得了优于当前最先进方法的性能,同时实现了大约43.7倍的实时渲染加速(针对512维特征图)。 Conclusion: Q-Render与GS-Net相结合,能够在保持高保真度的同时高效处理3D高斯的高维语义特征,为开放词汇3D分割提供了一个快速且准确的新框架。 Abstract: Recent advancements in computer vision have successfully extended Open-vocabulary segmentation (OVS) to the 3D domain by leveraging 3D Gaussian Splatting (3D-GS). Despite this progress, efficiently rendering the high-dimensional features required for open-vocabulary queries poses a significant challenge. Existing methods employ codebooks or feature compression, causing information loss, thereby degrading segmentation quality. To address this limitation, we introduce Quantile Rendering (Q-Render), a novel rendering strategy for 3D Gaussians that efficiently handles high-dimensional features while maintaining high fidelity. Unlike conventional volume rendering, which densely samples all 3D Gaussians intersecting each ray, Q-Render sparsely samples only those with dominant influence along the ray. By integrating Q-Render into a generalizable 3D neural network, we also propose Gaussian Splatting Network (GS-Net), which predicts Gaussian features in a generalizable manner. Extensive experiments on ScanNet and LeRF demonstrate that our framework outperforms state-of-the-art methods, while enabling real-time rendering with an approximate ~43.7x speedup on 512-D feature maps. Code will be made publicly available.[48] Transductive Visual Programming: Evolving Tool Libraries from Experience for Spatial Reasoning
Shengguang Wu,Xiaohan Wang,Yuhui Zhang,Hao Zhu,Serena Yeung-Levy
Main category: cs.CV
TL;DR: 本文提出了一种名为Transductive Visual Programming (TVP)的新型视觉编程框架,通过从经验中构建新工具而非推测性生成,实现了在3D空间推理任务中的最先进性能。
Details
Motivation: 现有视觉编程方法依赖固定或预先推测的工具集,导致程序次优且工具利用率低,难以有效应对复杂的空间推理任务。 Method: TVP首先使用基础工具解决问题并将成功案例存入示例库,随后从中抽象出高频模式并构建可重用的高级工具,形成不断进化的工具库。 Result: 在Omni3D-Bench上超越GPT-4o达22%,优于此前最佳系统11%;所学工具使用频率高出5倍,并在SpatialScore-Hard等未见任务上展现强泛化能力。 Conclusion: 基于经验的转导式工具创建是一种强大的范式,可实现自我演化的视觉编程智能体,有效解决复杂空间推理问题。 Abstract: Spatial reasoning in 3D scenes requires precise geometric calculations that challenge vision-language models. Visual programming addresses this by decomposing problems into steps calling specialized tools, yet existing methods rely on either fixed toolsets or speculative tool induction before solving problems, resulting in suboptimal programs and poor utilization of induced tools. We present Transductive Visual Programming (TVP), a novel framework that builds new tools from its own experience rather than speculation. TVP first solves problems using basic tools while accumulating experiential solutions into an Example Library, then abstracts recurring patterns from these programs into reusable higher-level tools for an evolving Tool Library. This allows TVP to tackle new problems with increasingly powerful tools learned from experience. On Omni3D-Bench, TVP achieves state-of-the-art performance, outperforming GPT-4o by 22% and the previous best visual programming system by 11%. Our transductively learned tools are used 5x more frequently as core program dependency than inductively created ones, demonstrating more effective tool discovery and reuse. The evolved tools also show strong generalization to unseen spatial tasks, achieving superior performance on benchmarks from SpatialScore-Hard collection without any testset-specific modification. Our work establishes experience-driven transductive tool creation as a powerful paradigm for building self-evolving visual programming agents that effectively tackle challenging spatial reasoning tasks. We release our code at https://transductive-visualprogram.github.io/.[49] Reasoning-Driven Amodal Completion: Collaborative Agents and Perceptual Evaluation
Hongxing Fan,Shuyu Zhao,Jiayang Ao,Lu Sheng
Main category: cs.CV
TL;DR: 提出了一种协作式多智能体推理框架,用于解决无模态补全中的语义一致性和结构完整性问题,通过解耦语义规划与视觉合成,并引入自纠正验证机制和多样假设生成,显著优于现有方法。
Details
Motivation: 现有渐进式方法在无模态补全中存在推理不稳定和误差累积问题,难以保持语义一致性和结构完整性。 Method: 提出协作式多智能体框架,将语义规划与视觉合成分离;使用专门智能体进行前置推理,生成结构化计划;引入自纠正的验证智能体(基于思维链推理)和多样假设生成器以提升鲁棒性与多样性。 Result: 在多个数据集上显著超越当前最先进方法;提出了新的评估指标MAC-Score,经人类判断和真实标签验证,能更好衡量结构完整性和语义一致性。 Conclusion: 该框架通过显式解耦规划与生成、引入智能体协作与新评估标准,有效提升了无模态补全的质量与可信度。 Abstract: Amodal completion, the task of inferring invisible object parts, faces significant challenges in maintaining semantic consistency and structural integrity. Prior progressive approaches are inherently limited by inference instability and error accumulation. To tackle these limitations, we present a Collaborative Multi-Agent Reasoning Framework that explicitly decouples Semantic Planning from Visual Synthesis. By employing specialized agents for upfront reasoning, our method generates a structured, explicit plan before pixel generation, enabling visually and semantically coherent single-pass synthesis. We integrate this framework with two critical mechanisms: (1) a self-correcting Verification Agent that employs Chain-of-Thought reasoning to rectify visible region segmentation and identify residual occluders strictly within the Semantic Planning phase, and (2) a Diverse Hypothesis Generator that addresses the ambiguity of invisible regions by offering diverse, plausible semantic interpretations, surpassing the limited pixel-level variations of standard random seed sampling. Furthermore, addressing the limitations of traditional metrics in assessing inferred invisible content, we introduce the MAC-Score (MLLM Amodal Completion Score), a novel human-aligned evaluation metric. Validated against human judgment and ground truth, these metrics establish a robust standard for assessing structural completeness and semantic consistency with visible context. Extensive experiments demonstrate that our framework significantly outperforms state-of-the-art methods across multiple datasets. Our project is available at: https://fanhongxing.github.io/remac-page.[50] Beyond Artifacts: Real-Centric Envelope Modeling for Reliable AI-Generated Image Detection
Ruiqi Liu,Yi Han,Zhengbo Zhang,Liwei Yao,Zhiyuan Yan,Jialiang Shen,ZhiJin Chen,Boyi Sun,Lubin Weng,Jing Dong,Yan Wang,Shu Wu
Main category: cs.CV
TL;DR: 提出了一种新的合成图像检测范式REM,通过建模真实图像分布而非生成器伪影,在真实世界退化条件下实现更强的泛化能力,并构建了包含多种退化场景的RealChain基准。
Details
Motivation: 现有检测方法过度依赖生成器特定的伪影线索,对真实世界中的图像退化(如多平台传播和后处理)极为敏感,导致在实际应用中性能下降。 Method: 提出Real-centric Envelope Modeling (REM),通过自重建中的特征级扰动生成接近真实的样本,并利用具有跨域一致性的包络估计器学习包围真实图像流形的边界,从而避开对生成器伪影的依赖。 Result: 在八个基准测试上平均比现有最先进方法提升7.5%,在严重退化的RealChain基准上表现出卓越的泛化能力。 Conclusion: REM通过聚焦于真实图像分布建模,提供了一种更鲁棒、更具泛化性的合成图像检测新范式,为实际应用场景下的检测任务奠定了坚实基础。 Abstract: The rapid progress of generative models has intensified the need for reliable and robust detection under real-world conditions. However, existing detectors often overfit to generator-specific artifacts and remain highly sensitive to real-world degradations. As generative architectures evolve and images undergo multi-round cross-platform sharing and post-processing (chain degradations), these artifact cues become obsolete and harder to detect. To address this, we propose Real-centric Envelope Modeling (REM), a new paradigm that shifts detection from learning generator artifacts to modeling the robust distribution of real images. REM introduces feature-level perturbations in self-reconstruction to generate near-real samples, and employs an envelope estimator with cross-domain consistency to learn a boundary enclosing the real image manifold. We further build RealChain, a comprehensive benchmark covering both open-source and commercial generators with simulated real-world degradation. Across eight benchmark evaluations, REM achieves an average improvement of 7.5% over state-of-the-art methods, and notably maintains exceptional generalization on the severely degraded RealChain benchmark, establishing a solid foundation for synthetic image detection under real-world conditions. The code and the RealChain benchmark will be made publicly available upon acceptance of the paper.[51] SPOT!: Map-Guided LLM Agent for Unsupervised Multi-CCTV Dynamic Object Tracking
Yujin Noh,Inho Jake Park,Chigon Hwang
Main category: cs.CV
TL;DR: 本文提出了一种名为SPOT的地图引导的LLM代理方法,能够在多CCTV环境中的盲区实现车辆轨迹的连续跟踪,无需预先训练。
Details
Motivation: 由于CCTV之间存在间隔和视场角(FOV)限制,导致车辆在盲区出现ID切换和轨迹丢失,影响实时路径预测的可靠性。 Method: 将道路结构和CCTV布置信息表示为基于2D空间坐标的文档,并通过分块技术组织以支持实时查询;利用CCTV图像中物体的相对位置和FOV信息将车辆位置转换到真实世界坐标系,并结合行驶方向、速度和驾驶模式,在交叉口级别进行束搜索以预测车辆最可能进入的下一个CCTV。 Result: 在CARLA模拟器构建的虚拟城市环境中实验表明,该方法能准确预测车辆在盲区后最可能出现的CCTV位置,相比现有技术更有效地保持了车辆轨迹的连续性。 Conclusion: SPOT无需训练即可有效解决多摄像头环境下的车辆跟踪盲区问题,显著提升了轨迹连续性和路径预测的可靠性。 Abstract: CCTV-based vehicle tracking systems face structural limitations in continuously connecting the trajectories of the same vehicle across multiple camera environments. In particular, blind spots occur due to the intervals between CCTVs and limited Fields of View (FOV), which leads to object ID switching and trajectory loss, thereby reducing the reliability of real-time path prediction. This paper proposes SPOT (Spatial Prediction Over Trajectories), a map-guided LLM agent capable of tracking vehicles even in blind spots of multi-CCTV environments without prior training. The proposed method represents road structures (Waypoints) and CCTV placement information as documents based on 2D spatial coordinates and organizes them through chunking techniques to enable real-time querying and inference. Furthermore, it transforms the vehicle's position into the actual world coordinate system using the relative position and FOV information of objects observed in CCTV images. By combining map spatial information with the vehicle's moving direction, speed, and driving patterns, a beam search is performed at the intersection level to derive candidate CCTV locations where the vehicle is most likely to enter after the blind spot. Experimental results based on the CARLA simulator in a virtual city environment confirmed that the proposed method accurately predicts the next appearing CCTV even in blind spot sections, maintaining continuous vehicle trajectories more effectively than existing techniques.[52] XGrid-Mapping: Explicit Implicit Hybrid Grid Submaps for Efficient Incremental Neural LiDAR Mapping
Zeqing Song,Zhongmiao Yan,Junyuan Deng,Songpengcheng Xia,Xiang Mu,Jingyi Xu,Qi Wu,Ling Pei
Main category: cs.CV
TL;DR: 提出XGrid-Mapping,一种结合显式与隐式表示的混合网格框架,用于高效的大规模增量式神经LiDAR建图,通过稀疏网格提供几何先验,隐式密集网格增强场景表示,并引入蒸馏重叠对齐策略和动态剔除模块,实现高质量、高效率的实时建图。
Details
Motivation: 现有神经LiDAR建图方法或依赖密集隐式表示而忽略几何结构,或因体素引导方法计算开销大难以实现实时性能,亟需一种兼顾效率与精度的大规模增量建图方案。 Method: 提出XGrid-Mapping,结合稀疏网格(提供几何先验)与隐式密集网格(增强表示能力);采用VDB结构与子图组织降低计算负载;引入基于蒸馏的重叠对齐策略确保子图间一致性,并设计动态剔除模块提升采样效率与鲁棒性。 Result: 实验表明,该方法在保持高建图质量的同时显著提升效率,克服了体素引导方法的性能瓶颈,在大规模增量LiDAR建图任务中优于现有最先进方法。 Conclusion: XGrid-Mapping通过融合显式与隐式表示,实现了高效、一致且可扩展的神经LiDAR建图,为大规模自主系统提供了可靠的环境感知基础。 Abstract: Large-scale incremental mapping is fundamental to the development of robust and reliable autonomous systems, as it underpins incremental environmental understanding with sequential inputs for navigation and decision-making. LiDAR is widely used for this purpose due to its accuracy and robustness. Recently, neural LiDAR mapping has shown impressive performance; however, most approaches rely on dense implicit representations and underutilize geometric structure, while existing voxel-guided methods struggle to achieve real-time performance. To address these challenges, we propose XGrid-Mapping, a hybrid grid framework that jointly exploits explicit and implicit representations for efficient neural LiDAR mapping. Specifically, the strategy combines a sparse grid, providing geometric priors and structural guidance, with an implicit dense grid that enriches scene representation. By coupling the VDB structure with a submap-based organization, the framework reduces computational load and enables efficient incremental mapping on a large scale. To mitigate discontinuities across submaps, we introduce a distillation-based overlap alignment strategy, in which preceding submaps supervise subsequent ones to ensure consistency in overlapping regions. To further enhance robustness and sampling efficiency, we incorporate a dynamic removal module. Extensive experiments show that our approach delivers superior mapping quality while overcoming the efficiency limitations of voxel-guided methods, thereby outperforming existing state-of-the-art mapping methods.[53] X-ray Insights Unleashed: Pioneering the Enhancement of Multi-Label Long-Tail Data
Xinquan Yang,Jinheng Xie,Yawen Huang,Yuexiang Li,Huimin Huang,Hao Zheng,Xian Wu,Yefeng Zheng,Linlin Shen
Main category: cs.CV
TL;DR: 本文提出了一种新的数据合成管道,利用大量正常的X光片来增强尾部病变的表示,通过预训练的扩散模型对患病X光片中的头部病变进行修复,保留尾部类别作为增强的训练数据,并结合大语言模型知识指导模块和渐进增量学习策略来稳定修复微调过程,在MIMIC和CheXpert肺部数据集上的实验表明该方法性能优越。
Details
Motivation: 由于罕见病变样本稀缺,现有的基于扩散的方法在生成尾部病变方面能力受限,导致诊断精度不理想。因此,需要一种新方法来增强尾部病变的表示以提高诊断准确性。 Method: 提出一种新的数据合成管道:首先使用大量正常X光片训练一个扩散模型以生成正常X射线图像;然后利用该预训练模型对患病X光片中的头部病变区域进行修复,从而保留尾部病变作为增强数据;同时引入大语言模型知识指导(LKG)模块和渐进增量学习(PIL)策略以稳定微调过程。 Result: 在MIMIC和CheXpert两个公开肺部数据集上进行了综合评估,所提方法在尾部病变识别任务中显著优于现有方法,实现了新的性能基准。 Conclusion: 该方法有效解决了长尾肺部异常中罕见病变样本不足的问题,通过利用正常图像数据增强尾部病变表示,提升了模型的诊断性能,为医学影像中的长尾分布问题提供了新思路。 Abstract: Long-tailed pulmonary anomalies in chest radiography present formidable diagnostic challenges. Despite the recent strides in diffusion-based methods for enhancing the representation of tailed lesions, the paucity of rare lesion exemplars curtails the generative capabilities of these approaches, thereby leaving the diagnostic precision less than optimal. In this paper, we propose a novel data synthesis pipeline designed to augment tail lesions utilizing a copious supply of conventional normal X-rays. Specifically, a sufficient quantity of normal samples is amassed to train a diffusion model capable of generating normal X-ray images. This pre-trained diffusion model is subsequently utilized to inpaint the head lesions present in the diseased X-rays, thereby preserving the tail classes as augmented training data. Additionally, we propose the integration of a Large Language Model Knowledge Guidance (LKG) module alongside a Progressive Incremental Learning (PIL) strategy to stabilize the inpainting fine-tuning process. Comprehensive evaluations conducted on the public lung datasets MIMIC and CheXpert demonstrate that the proposed method sets a new benchmark in performance.[54] PUFM++: Point Cloud Upsampling via Enhanced Flow Matching
Zhi-Song Liu,Chenhang He,Roland Maier,Andreas Rupp
Main category: cs.CV
TL;DR: PUFM++是一种增强的流匹配框架,用于从稀疏、含噪和部分观测中重建密集且精确的点云,在几何保真度、鲁棒性和下游任务一致性方面均有提升。
Details
Motivation: 现有的点云上采样方法在处理稀疏、噪声和不完整输入时存在几何失真、生成点偏离表面以及对下游任务支持不足的问题,需要更鲁棒且高保真的生成模型。 Method: 提出一种两阶段流匹配策略:第一阶段学习从稀疏输入到密集目标的直接直线流,第二阶段利用加噪样本优化终端边缘分布;引入数据驱动的自适应时间调度器以提高采样效率,并在采样过程中施加流形约束以保持生成点位于物体表面上;同时采用循环接口网络(RIN)增强层次特征交互。 Result: 在合成基准和真实世界扫描数据上均取得最优性能,显著优于现有方法,具备更高的视觉质量和定量精度。 Conclusion: PUFM++通过多项改进显著提升了点云上采样的质量与鲁棒性,成为当前最先进的方法,并为下游表面重建任务提供了更好的一致性支持。 Abstract: Recent advances in generative modeling have demonstrated strong promise for high-quality point cloud upsampling. In this work, we present PUFM++, an enhanced flow-matching framework for reconstructing dense and accurate point clouds from sparse, noisy, and partial observations. PUFM++ improves flow matching along three key axes: (i) geometric fidelity, (ii) robustness to imperfect input, and (iii) consistency with downstream surface-based tasks. We introduce a two-stage flow-matching strategy that first learns a direct, straight-path flow from sparse inputs to dense targets, and then refines it using noise-perturbed samples to approximate the terminal marginal distribution better. To accelerate and stabilize inference, we propose a data-driven adaptive time scheduler that improves sampling efficiency based on interpolation behavior. We further impose on-manifold constraints during sampling to ensure that generated points remain aligned with the underlying surface. Finally, we incorporate a recurrent interface network~(RIN) to strengthen hierarchical feature interactions and boost reconstruction quality. Extensive experiments on synthetic benchmarks and real-world scans show that PUFM++ sets a new state of the art in point cloud upsampling, delivering superior visual fidelity and quantitative accuracy across a wide range of tasks. Code and pretrained models are publicly available at https://github.com/Holmes-Alan/Enhanced_PUFM.[55] MVInverse: Feed-forward Multi-view Inverse Rendering in Seconds
Xiangzuo Wu,Chengwei Ren,Jun Zhou,Xiu Li,Yuan Liu
Main category: cs.CV
TL;DR: 提出了一种前馈的多视角逆渲染框架,通过跨视角交替注意力机制实现一致的几何、材质和光照恢复,并结合基于一致性的微调策略提升在真实场景下的泛化能力。
Details
Motivation: 现有单视角逆渲染方法忽略跨视角关系导致结果不一致,而多视角优化方法依赖慢速可微渲染和逐场景优化,计算成本高且难以扩展。 Method: 提出一个前馈网络直接从RGB图像序列预测空间变化的反射率、金属度、粗糙度、漫反射阴影和法线;通过交替跨视角注意力建模视内长距离光照交互和视间材质一致性;引入基于一致性的微调策略,利用无标签真实世界视频提升鲁棒性和多视角一致性。 Result: 在多个基准数据集上实现了最先进的多视角一致性、材质与法线估计质量,并在真实世界图像中表现出更强的泛化能力。 Conclusion: 该方法在保持高效推理的同时显著提升了多视角逆渲染的一致性与实用性,尤其在真实场景中的表现优于现有方法。 Abstract: Multi-view inverse rendering aims to recover geometry, materials, and illumination consistently across multiple viewpoints. When applied to multi-view images, existing single-view approaches often ignore cross-view relationships, leading to inconsistent results. In contrast, multi-view optimization methods rely on slow differentiable rendering and per-scene refinement, making them computationally expensive and hard to scale. To address these limitations, we introduce a feed-forward multi-view inverse rendering framework that directly predicts spatially varying albedo, metallic, roughness, diffuse shading, and surface normals from sequences of RGB images. By alternating attention across views, our model captures both intra-view long-range lighting interactions and inter-view material consistency, enabling coherent scene-level reasoning within a single forward pass. Due to the scarcity of real-world training data, models trained on existing synthetic datasets often struggle to generalize to real-world scenes. To overcome this limitation, we propose a consistency-based finetuning strategy that leverages unlabeled real-world videos to enhance both multi-view coherence and robustness under in-the-wild conditions. Extensive experiments on benchmark datasets demonstrate that our method achieves state-of-the-art performance in terms of multi-view consistency, material and normal estimation quality, and generalization to real-world imagery.[56] Learning from Next-Frame Prediction: Autoregressive Video Modeling Encodes Effective Representations
Jinghan Li,Yang Jin,Hao Jiang,Yadong Mu,Yang Song,Kun Xu
Main category: cs.CV
TL;DR: 本文提出了一种新的自回归视觉生成预训练框架NExT-Vid,通过掩码下一帧预测联合建模图像和视频,提升了视觉表示学习性能。
Details
Motivation: 现有的自回归视觉预训练方法存在语义定位不准和生成质量差的问题,且多数视觉方法仍依赖于忽略时间信息的掩码建模方式。 Method: 提出NExT-Vid框架,采用掩码下一帧预测,引入上下文隔离的自回归预测器解耦语义表示与目标解码,并使用条件流匹配解码器提升生成质量和多样性。 Result: 在大规模预训练模型上的实验表明,该方法在下游分类任务中通过注意力探测显著优于先前的生成式预训练方法。 Conclusion: NExT-Vid通过上下文隔离的流匹配预训练,实现了更强的视觉表示能力,有效推动了图像和视频的联合建模。 Abstract: Recent advances in pretraining general foundation models have significantly improved performance across diverse downstream tasks. While autoregressive (AR) generative models like GPT have revolutionized NLP, most visual generative pretraining methods still rely on BERT-style masked modeling, which often disregards the temporal information essential for video analysis. The few existing autoregressive visual pretraining methods suffer from issues such as inaccurate semantic localization and poor generation quality, leading to poor semantics. In this work, we propose NExT-Vid, a novel autoregressive visual generative pretraining framework that utilizes masked next-frame prediction to jointly model images and videos. NExT-Vid introduces a context-isolated autoregressive predictor to decouple semantic representation from target decoding, and a conditioned flow-matching decoder to enhance generation quality and diversity. Through context-isolated flow-matching pretraining, our approach achieves strong representations. Extensive experiments on large-scale pretrained models demonstrate that our proposed method consistently outperforms previous generative pretraining methods for visual representation learning via attentive probing in downstream classification.[57] Granular-ball Guided Masking: Structure-aware Data Augmentation
Shuyin Xia,Fan Chen,Dawei Dai,Meng Yang,Junwei Han,Xinbo Gao,Guoyin Wang
Main category: cs.CV
TL;DR: 提出了一种基于Granular-ball计算的结构感知掩码增强方法GBGM,通过粗到精的层次化掩码策略,自适应保留重要语义区域,提升模型鲁棒性和分类性能。
Details
Motivation: 现有掩码增强方法缺乏结构感知能力,容易丢弃关键语义信息,导致模型在数据受限或分布偏移时表现不佳。 Method: 引入Granular-ball Computing(GBC)指导掩码生成,采用粗到细的层次化掩码策略,在保持结构完整性的同时抑制冗余区域,实现结构感知的数据增强。 Result: 在多个基准上验证了GBGM的有效性,显著提升了图像分类准确率和掩码图像重建效果,且兼容CNN和Vision Transformer。 Conclusion: GBGM是一种简单、通用且模型无关的结构感知增强方法,为数据增强提供了新的范式。 Abstract: Deep learning models have achieved remarkable success in computer vision, but they still rely heavily on large-scale labeled data and tend to overfit when data are limited or distributions shift. Data augmentation, particularly mask-based information dropping, can enhance robustness by forcing models to explore complementary cues; however, existing approaches often lack structural awareness and may discard essential semantics. We propose Granular-ball Guided Masking (GBGM), a structure-aware augmentation strategy guided by Granular-ball Computing (GBC). GBGM adaptively preserves semantically rich, structurally important regions while suppressing redundant areas through a coarse-to-fine hierarchical masking process, producing augmentations that are both representative and discriminative. Extensive experiments on multiple benchmarks demonstrate consistent improvements in classification accuracy and masked image reconstruction, confirming the effectiveness and broad applicability of the proposed method. Simple and model-agnostic, it integrates seamlessly into CNNs and Vision Transformers and provides a new paradigm for structure-aware data augmentation.[58] FluencyVE: Marrying Temporal-Aware Mamba with Bypass Attention for Video Editing
Mingshu Cai,Yixuan Li,Osamu Yoshie,Yuya Ieiri
Main category: cs.CV
TL;DR: 提出了一种基于Mamba的单次视频编辑方法FluencyVE,通过替换时间注意力机制实现高效且时序一致的视频编辑。
Details
Motivation: 现有基于预训练文本到图像模型的视频编辑方法存在时序不一致和计算开销高的问题。 Method: 将线性时间序列模型Mamba引入基于Stable Diffusion的视频编辑框架,替代时间注意力层,并采用低秩近似矩阵和加权平均技术优化计算。 Result: 在真实视频中实现了对多种属性、主体和位置的高质量编辑,同时显著降低计算成本并提升时序一致性。 Conclusion: FluencyVE是一种简单而有效的方法,在保持生成能力的同时提升了视频编辑的效率和流畅性。 Abstract: Large-scale text-to-image diffusion models have achieved unprecedented success in image generation and editing. However, extending this success to video editing remains challenging. Recent video editing efforts have adapted pretrained text-to-image models by adding temporal attention mechanisms to handle video tasks. Unfortunately, these methods continue to suffer from temporal inconsistency issues and high computational overheads. In this study, we propose FluencyVE, which is a simple yet effective one-shot video editing approach. FluencyVE integrates the linear time-series module, Mamba, into a video editing model based on pretrained Stable Diffusion models, replacing the temporal attention layer. This enables global frame-level attention while reducing the computational costs. In addition, we employ low-rank approximation matrices to replace the query and key weight matrices in the causal attention, and use a weighted averaging technique during training to update the attention scores. This approach significantly preserves the generative power of the text-to-image model while effectively reducing the computational burden. Experiments and analyses demonstrate promising results in editing various attributes, subjects, and locations in real-world videos.[59] Efficient and Robust Video Defense Framework against 3D-field Personalized Talking Face
Rui-qing Sun,Xingshan Yao,Tian Lan,Hui-Yang Zhao,Jia-Ling Shi,Chen-Hao Cui,Zhijing Wu,Chen Yang,Xian-Ling Mao
Main category: cs.CV
TL;DR: 提出了一种针对3D场视频参考说话人脸生成方法的高效防御框架,通过扰动3D信息获取过程来保护肖像视频,同时保持高保真视频质量。
Details
Motivation: 现有的基于图像的防御方法计算成本高、视频质量差,且无法有效破坏3D信息,缺乏针对3D场TFG方法的有效防御框架。 Method: 提出了相似性引导的参数共享机制和多尺度双域注意力模块,联合优化空间-频率域扰动,以高效保护3D信息。 Result: 实验表明该框架具有强防御能力,相比最快基线加速47倍,保持高保真,并对缩放操作和先进净化攻击具有鲁棒性。 Conclusion: 所提框架在保护个人肖像视频免受3D场TFG方法滥用方面高效且实用,具备良好的应用前景。 Abstract: State-of-the-art 3D-field video-referenced Talking Face Generation (TFG) methods synthesize high-fidelity personalized talking-face videos in real time by modeling 3D geometry and appearance from reference portrait video. This capability raises significant privacy concerns regarding malicious misuse of personal portraits. However, no efficient defense framework exists to protect such videos against 3D-field TFG methods. While image-based defenses could apply per-frame 2D perturbations, they incur prohibitive computational costs, severe video quality degradation, failing to disrupt 3D information for video protection. To address this, we propose a novel and efficient video defense framework against 3D-field TFG methods, which protects portrait video by perturbing the 3D information acquisition process while maintain high-fidelity video quality. Specifically, our method introduces: (1) a similarity-guided parameter sharing mechanism for computational efficiency, and (2) a multi-scale dual-domain attention module to jointly optimize spatial-frequency perturbations. Extensive experiments demonstrate that our proposed framework exhibits strong defense capability and achieves a 47x acceleration over the fastest baseline while maintaining high fidelity. Moreover, it remains robust against scaling operations and state-of-the-art purification attacks, and the effectiveness of our design choices is further validated through ablation studies. Our project is available at https://github.com/Richen7418/VDF.[60] Multi-Attribute guided Thermal Face Image Translation based on Latent Diffusion Model
Mingshu Cai,Osamu Yoshie,Yuya Ieiri
Main category: cs.CV
TL;DR: 提出了一种基于潜在扩散模型和Self-attn Mamba模块的新型方法,用于从热成像生成高质量可见光人脸图像,在保持身份特征的同时显著提升异质人脸识别性能。
Details
Motivation: 由于红外与可见光图像之间存在显著域偏移,现有面部识别模型在红外图像上性能下降严重,且传统生成方法易导致特征丢失和失真。 Method: 采用潜在扩散模型生成可见光人脸图像,引入多属性分类器保留关键面部属性,并设计Self-attn Mamba模块增强跨模态特征建模能力,提升推理速度。 Result: 在两个基准数据集上实现了最先进的图像质量和身份保持性能。 Conclusion: 所提方法有效缓解了红外到可见光人脸图像转换中的特征损失和模态差异问题,显著提升了异质人脸识别的准确性和生成质量。 Abstract: Modern surveillance systems increasingly rely on multi-wavelength sensors and deep neural networks to recognize faces in infrared images captured at night. However, most facial recognition models are trained on visible light datasets, leading to substantial performance degradation on infrared inputs due to significant domain shifts. Early feature-based methods for infrared face recognition proved ineffective, prompting researchers to adopt generative approaches that convert infrared images into visible light images for improved recognition. This paradigm, known as Heterogeneous Face Recognition (HFR), faces challenges such as model and modality discrepancies, leading to distortion and feature loss in generated images. To address these limitations, this paper introduces a novel latent diffusion-based model designed to generate high-quality visible face images from thermal inputs while preserving critical identity features. A multi-attribute classifier is incorporated to extract key facial attributes from visible images, mitigating feature loss during infrared-to-visible image restoration. Additionally, we propose the Self-attn Mamba module, which enhances global modeling of cross-modal features and significantly improves inference speed. Experimental results on two benchmark datasets demonstrate the superiority of our approach, achieving state-of-the-art performance in both image quality and identity preservation.[61] Next-Scale Prediction: A Self-Supervised Approach for Real-World Image Denoising
Yiwen Shan,Haiyu Zhao,Peng Hu,Xi Peng,Yuanbiao Gou
Main category: cs.CV
TL;DR: 本文提出了一种新的自监督真实图像去噪方法Next-Scale Prediction (NSP),通过跨尺度训练对解耦噪声去相关与细节保留,显著提升了去噪性能并支持无需重训练的超分辨率。
Details
Motivation: 现有盲点网络方法在去相关空间结构噪声和保留高频细节之间存在难以平衡的问题,导致去噪效果受限。 Method: 提出Next-Scale Prediction (NSP)框架,利用低分辨率、完全去相关的子图像作为输入,训练盲点网络预测保留精细细节的高分辨率图像,构建跨尺度训练对。 Result: 在多个真实世界去噪基准上达到最先进性能,有效缓解了噪声去相关与细节保留之间的冲突,并可自然支持带噪声图像的超分辨率。 Conclusion: NSP为自监督去噪提供了一个新范式,成功分离噪声去除与细节保持过程,在性能和功能上均优于现有方法。 Abstract: Self-supervised real-world image denoising remains a fundamental challenge, arising from the antagonistic trade-off between decorrelating spatially structured noise and preserving high-frequency details. Existing blind-spot network (BSN) methods rely on pixel-shuffle downsampling (PD) to decorrelate noise, but aggressive downsampling fragments fine structures, while milder downsampling fails to remove correlated noise. To address this, we introduce Next-Scale Prediction (NSP), a novel self-supervised paradigm that decouples noise decorrelation from detail preservation. NSP constructs cross-scale training pairs, where BSN takes low-resolution, fully decorrelated sub-images as input to predict high-resolution targets that retain fine details. As a by-product, NSP naturally supports super-resolution of noisy images without retraining or modification. Extensive experiments demonstrate that NSP achieves state-of-the-art self-supervised denoising performance on real-world benchmarks, significantly alleviating the long-standing conflict between noise decorrelation and detail preservation.[62] A Large-Depth-Range Layer-Based Hologram Dataset for Machine Learning-Based 3D Computer-Generated Holography
Jaehong Lee,You Chan No,YoungWoo Kim,Duksu Kim
Main category: cs.CV
TL;DR: 本文提出了KOREATECH-CGH,一个大规模公开的RGB-D图像与复数全息图配对数据集,涵盖多种分辨率和大深度范围,并引入振幅投影方法提升重建质量,验证了其在机器学习全息生成与超分辨率任务中的有效性。
Details
Motivation: 由于高质量、大规模全息图数据集的缺乏,基于机器学习的计算机生成全息技术(ML-CGH)的发展受到限制,因此需要构建一个公开、高质量且适用于广泛3D场景的数据集以推动该领域研究。 Method: 提出KOREATECH-CGH数据集,包含6000组多分辨率(256*256至2048*2048)RGB-D图像与复数全息图,深度范围达到角谱法理论极限;并引入振幅投影后处理技术,在各深度层替换全息波场的振幅分量而保留相位,以提高重建保真度。 Result: 振幅投影方法在大深度范围内实现了27.01 dB PSNR和0.87 SSIM,优于最新的优化轮廓掩膜层方法(分别提升2.03 dB和0.04 SSIM);通过先进ML模型在全息生成与超分辨率任务上的实验验证了该数据集的有效性。 Conclusion: KOREATECH-CGH为ML-CGH提供了高质量、多样化的训练与评估资源,结合振幅投影技术显著提升了复杂3D场景下的全息重建质量,有助于推动下一代机器学习驱动的全息系统发展。 Abstract: Machine learning-based computer-generated holography (ML-CGH) has advanced rapidly in recent years, yet progress is constrained by the limited availability of high-quality, large-scale hologram datasets. To address this, we present KOREATECH-CGH, a publicly available dataset comprising 6,000 pairs of RGB-D images and complex holograms across resolutions ranging from 256*256 to 2048*2048, with depth ranges extending to the theoretical limits of the angular spectrum method for wide 3D scene coverage. To improve hologram quality at large depth ranges, we introduce amplitude projection, a post-processing technique that replaces amplitude components of hologram wavefields at each depth layer while preserving phase. This approach enhances reconstruction fidelity, achieving 27.01 dB PSNR and 0.87 SSIM, surpassing a recent optimized silhouette-masking layer-based method by 2.03 dB and 0.04 SSIM, respectively. We further validate the utility of KOREATECH-CGH through experiments on hologram generation and super-resolution using state-of-the-art ML models, confirming its applicability for training and evaluating next-generation ML-CGH systems.[63] Matrix Completion Via Reweighted Logarithmic Norm Minimization
Zhijie Wang,Liangtian He,Qinghua Zhang,Jifei Miao,Liang-Jian Deng,Jun Liu
Main category: cs.CV
TL;DR: 提出一种新的加权对数范数作为低秩矩阵补全中更有效的非凸替代方法,通过ADMM算法求解,并在图像修复任务中展现出优于现有方法的性能。
Details
Motivation: 核范数作为秩函数的凸松弛会导致奇异值过度收缩,从而产生次优解,因此需要更精确的非凸代理函数来改进低秩矩阵补全的效果。 Method: 提出一种新的加权对数范数作为非凸代理函数,并采用交替方向乘子法(ADMM)来高效求解优化问题。 Result: 在图像修复实验中,所提方法在视觉质量和定量指标上均优于当前最先进的低秩矩阵补全方法。 Conclusion: 所提出的加权对数范数能更准确地逼近秩函数,显著提升低秩矩阵补全的性能,具有较强的应用潜力。 Abstract: Low-rank matrix completion (LRMC) has demonstrated remarkable success in a wide range of applications. To address the NP-hard nature of the rank minimization problem, the nuclear norm is commonly used as a convex and computationally tractable surrogate for the rank function. However, this approach often yields suboptimal solutions due to the excessive shrinkage of singular values. In this letter, we propose a novel reweighted logarithmic norm as a more effective nonconvex surrogate, which provides a closer approximation than many existing alternatives. We efficiently solve the resulting optimization problem by employing the alternating direction method of multipliers (ADMM). Experimental results on image inpainting demonstrate that the proposed method achieves superior performance compared to state-of-the-art LRMC approaches, both in terms of visual quality and quantitative metrics.[64] Optical Flow-Guided 6DoF Object Pose Tracking with an Event Camera
Zibin Liu,Banglei Guan,Yang Shang,Shunkun Liang,Zhenbao Yu,Qifeng Yu
Main category: cs.CV
TL;DR: 提出一种基于事件相机的光流引导的6DoF物体姿态跟踪方法,通过2D-3D混合特征提取和光流关联优化姿态,在准确性和鲁棒性上优于现有方法。
Details
Motivation: 传统相机在姿态跟踪中面临运动模糊、噪声、遮挡和光照变化等挑战,事件相机虽有潜力但需有效算法支持。 Method: 采用2D-3D混合特征提取策略检测事件流中的角点和边缘,通过最大化时空窗口内事件相关概率搜索角点光流,并以光流引导建立角点与边缘的关联,最后通过最小化角点到边缘距离迭代优化6DoF姿态。 Result: 在模拟和真实事件数据上实验表明,该方法在精度和鲁棒性方面优于现有的基于事件的姿态跟踪方法。 Conclusion: 光流引导的混合特征方法有效利用事件相机优势,实现了高精度、强鲁棒性的6DoF物体姿态连续跟踪。 Abstract: Object pose tracking is one of the pivotal technologies in multimedia, attracting ever-growing attention in recent years. Existing methods employing traditional cameras encounter numerous challenges such as motion blur, sensor noise, partial occlusion, and changing lighting conditions. The emerging bio-inspired sensors, particularly event cameras, possess advantages such as high dynamic range and low latency, which hold the potential to address the aforementioned challenges. In this work, we present an optical flow-guided 6DoF object pose tracking method with an event camera. A 2D-3D hybrid feature extraction strategy is firstly utilized to detect corners and edges from events and object models, which characterizes object motion precisely. Then, we search for the optical flow of corners by maximizing the event-associated probability within a spatio-temporal window, and establish the correlation between corners and edges guided by optical flow. Furthermore, by minimizing the distances between corners and edges, the 6DoF object pose is iteratively optimized to achieve continuous pose tracking. Experimental results of both simulated and real events demonstrate that our methods outperform event-based state-of-the-art methods in terms of both accuracy and robustness.[65] DexAvatar: 3D Sign Language Reconstruction with Hand and Body Pose Priors
Kaustubh Kundu,Hrishav Bakul Barua,Lucy Robertson-Bell,Zhixi Cai,Kalin Stefanov
Main category: cs.CV
TL;DR: 本文提出了一种名为DexAvatar的新框架,用于从单目手语视频中重建生物力学上精确的精细手势和身体动作,通过引入3D手部和身体先验信息,在现有基准数据集上比最先进方法提升了35.11%。
Details
Motivation: 当前手语数据集多为2D姿态且缺乏准确的3D信息,现有的3D姿态估计方法在处理遮挡、噪声和运动模糊时表现不佳,限制了手语生成的质量。 Method: 提出DexAvatar框架,利用学习到的3D手部和身体先验信息,从野外单目手语视频中重建精细的身体和手部动作。 Result: 在SGNify动作捕捉数据集上达到最先进的性能,身体和手部姿态估计精度相比现有最优方法提高了35.11%。 Conclusion: DexAvatar显著提升了从单目视频中重建手语3D姿态的能力,为数据驱动的手语生成提供了高质量的3D姿态数据支持。 Abstract: The trend in sign language generation is centered around data-driven generative methods that require vast amounts of precise 2D and 3D human pose data to achieve an acceptable generation quality. However, currently, most sign language datasets are video-based and limited to automatically reconstructed 2D human poses (i.e., keypoints) and lack accurate 3D information. Furthermore, existing state-of-the-art for automatic 3D human pose estimation from sign language videos is prone to self-occlusion, noise, and motion blur effects, resulting in poor reconstruction quality. In response to this, we introduce DexAvatar, a novel framework to reconstruct bio-mechanically accurate fine-grained hand articulations and body movements from in-the-wild monocular sign language videos, guided by learned 3D hand and body priors. DexAvatar achieves strong performance in the SGNify motion capture dataset, the only benchmark available for this task, reaching an improvement of 35.11% in the estimation of body and hand poses compared to the state-of-the-art. The official website of this work is: https://github.com/kaustesseract/DexAvatar.[66] Beyond Pixel Simulation: Pathology Image Generation via Diagnostic Semantic Tokens and Prototype Control
Minghao Han,YiChen Liu,Yizhou Liu,Zizhi Chen,Jingqun Tang,Xuecheng Wu,Dingkang Yang,Lihua Zhang
Main category: cs.CV
TL;DR: UniPath是一个语义驱动的病理图像生成框架,通过多流控制实现细粒度、可控制的生成,解决了数据稀缺、语义控制不足和术语异质性问题,在病理学图像生成中达到SOTA性能。
Details
Motivation: 现有的生成模型主要模拟像素,缺乏精细的语义控制,且受限于高质量图文数据的稀缺和诊断术语的多样性,难以实现可靠的文本条件生成。 Method: 提出UniPath框架,采用多流控制:原始文本流、高层语义流(利用冻结的病理MLLM提取鲁棒的诊断语义标记)和原型流(通过原型库实现形态学控制);构建265万图文对的大规模数据集及6.8万高质量标注子集;建立四层评估体系。 Result: 在Patho-FID上达到80.9(比第二名好51%),细粒度语义控制达到真实图像98.7%的效果。 Conclusion: UniPath通过融合成熟的诊断理解能力实现了可控的病理图像生成,在生成质量和语义一致性方面显著优于现有方法,推动了理解与生成的统一。 Abstract: In computational pathology, understanding and generation have evolved along disparate paths: advanced understanding models already exhibit diagnostic-level competence, whereas generative models largely simulate pixels. Progress remains hindered by three coupled factors: the scarcity of large, high-quality image-text corpora; the lack of precise, fine-grained semantic control, which forces reliance on non-semantic cues; and terminological heterogeneity, where diverse phrasings for the same diagnostic concept impede reliable text conditioning. We introduce UniPath, a semantics-driven pathology image generation framework that leverages mature diagnostic understanding to enable controllable generation. UniPath implements Multi-Stream Control: a Raw-Text stream; a High-Level Semantics stream that uses learnable queries to a frozen pathology MLLM to distill paraphrase-robust Diagnostic Semantic Tokens and to expand prompts into diagnosis-aware attribute bundles; and a Prototype stream that affords component-level morphological control via a prototype bank. On the data front, we curate a 2.65M image-text corpus and a finely annotated, high-quality 68K subset to alleviate data scarcity. For a comprehensive assessment, we establish a four-tier evaluation hierarchy tailored to pathology. Extensive experiments demonstrate UniPath's SOTA performance, including a Patho-FID of 80.9 (51% better than the second-best) and fine-grained semantic control achieving 98.7% of the real-image. The meticulously curated datasets, complete source code, and pre-trained model weights developed in this study will be made openly accessible to the public.[67] Multimodal Skeleton-Based Action Representation Learning via Decomposition and Composition
Hongsong Wang,Heng Fei,Bingxuan Dai,Jie Gui
Main category: cs.CV
TL;DR: 提出了一种名为Decomposition and Composition的自监督多模态骨架动作表示学习框架,通过分解与组合策略在保持高效的同时提升多模态动作识别性能。
Details
Motivation: 现有方法在多模态人类动作理解中难以平衡效率与效果,晚期融合计算开销大,早期融合性能不足。 Method: 设计了分解策略将融合的多模态特征分解为单模态特征并与真实单模态特征对齐;采用组合策略整合多个单模态特征作为自监督信号来增强多模态表示学习。 Result: 在NTU RGB+D 60、NTU RGB+D 120和PKU-MMD II数据集上的实验表明,该方法在计算成本和模型性能之间取得了良好平衡。 Conclusion: 所提框架有效解决了多模态动作识别中效率与性能的权衡问题,优于传统融合方法。 Abstract: Multimodal human action understanding is a significant problem in computer vision, with the central challenge being the effective utilization of the complementarity among diverse modalities while maintaining model efficiency. However, most existing methods rely on simple late fusion to enhance performance, which results in substantial computational overhead. Although early fusion with a shared backbone for all modalities is efficient, it struggles to achieve excellent performance. To address the dilemma of balancing efficiency and effectiveness, we introduce a self-supervised multimodal skeleton-based action representation learning framework, named Decomposition and Composition. The Decomposition strategy meticulously decomposes the fused multimodal features into distinct unimodal features, subsequently aligning them with their respective ground truth unimodal counterparts. On the other hand, the Composition strategy integrates multiple unimodal features, leveraging them as self-supervised guidance to enhance the learning of multimodal representations. Extensive experiments on the NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD II datasets demonstrate that the proposed method strikes an excellent balance between computational cost and model performance.[68] UniPR-3D: Towards Universal Visual Place Recognition with Visual Geometry Grounded Transformer
Tianchen Deng,Xun Chen,Ziming Li,Hongming Shen,Danwei Wang,Javier Civera,Hesheng Wang
Main category: cs.CV
TL;DR: 本文提出了UniPR-3D,首个有效融合多视角信息的视觉位置识别(VPR)架构,基于VGGT骨干网络,结合2D和3D特征聚合模块,在多种环境下实现先进性能。
Details
Motivation: 传统VPR多基于单图像检索,多视角虽有优势但研究较少且现有方法泛化能力差,因此需要一种能有效整合多视角信息并具备良好跨环境泛化能力的新架构。 Method: 提出UniPR-3D,采用VGGT骨干网络编码多视角3D表示,设计专门的2D与3D特征聚合模块,并融合单帧与多帧聚合策略及变长序列检索方法来构建描述符。 Result: 实验表明UniPR-3D在单视图和多视图基准上均达到最先进水平,显著优于现有方法,验证了几何感知token在VPR中的有效性。 Conclusion: UniPR-3D通过联合利用2D和3D token,并设计专用聚合机制,成功实现了鲁棒且泛化能力强的多视角视觉位置识别,为VPR提供了新的有效范式。 Abstract: Visual Place Recognition (VPR) has been traditionally formulated as a single-image retrieval task. Using multiple views offers clear advantages, yet this setting remains relatively underexplored and existing methods often struggle to generalize across diverse environments. In this work we introduce UniPR-3D, the first VPR architecture that effectively integrates information from multiple views. UniPR-3D builds on a VGGT backbone capable of encoding multi-view 3D representations, which we adapt by designing feature aggregators and fine-tune for the place recognition task. To construct our descriptor, we jointly leverage the 3D tokens and intermediate 2D tokens produced by VGGT. Based on their distinct characteristics, we design dedicated aggregation modules for 2D and 3D features, allowing our descriptor to capture fine-grained texture cues while also reasoning across viewpoints. To further enhance generalization, we incorporate both single- and multi-frame aggregation schemes, along with a variable-length sequence retrieval strategy. Our experiments show that UniPR-3D sets a new state of the art, outperforming both single- and multi-view baselines and highlighting the effectiveness of geometry-grounded tokens for VPR. Our code and models will be made publicly available on Github https://github.com/dtc111111/UniPR-3D.[69] Hierarchical Modeling Approach to Fast and Accurate Table Recognition
Takaya Kawakatsu
Main category: cs.CV
TL;DR: 提出了一种利用非因果注意力和并行推理算法的多任务模型,用于更高效地识别表格结构和内容。
Details
Motivation: 现有表格识别模型虽然效果好,但推理时间长且有效性未充分解释。 Method: 采用非因果注意力捕捉整体表格结构,并设计并行推理算法加速单元格内容识别。 Result: 在两个大型公开数据集上,新模型在视觉和统计指标上均表现出优越性。 Conclusion: 所提模型在保持高精度的同时显著提升了推理速度,为文档中表格信息的高效提取提供了有效方案。 Abstract: The extraction and use of diverse knowledge from numerous documents is a pressing challenge in intelligent information retrieval. Documents contain elements that require different recognition methods. Table recognition typically consists of three subtasks, namely table structure, cell position and cell content recognition. Recent models have achieved excellent recognition with a combination of multi-task learning, local attention, and mutual learning. However, their effectiveness has not been fully explained, and they require a long period of time for inference. This paper presents a novel multi-task model that utilizes non-causal attention to capture the entire table structure, and a parallel inference algorithm for faster cell content inference. The superiority is demonstrated both visually and statistically on two large public datasets.[70] T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation
Zhe Cao,Tao Wang,Jiaming Wang,Yanghai Wang,Yuanxing Zhang,Jialu Chen,Miao Deng,Jiahao Wang,Yubin Guo,Chenxi Liao,Yize Zhang,Zhaoxiang Zhang,Jiaheng Liu
Main category: cs.CV
TL;DR: 本文提出了T2AV-Compass,一个用于全面评估文本到音频视频(T2AV)生成系统的统一基准,包含500个复杂多样的提示,并结合客观指标与基于大语言模型的主观评判,揭示现有模型在跨模态一致性、指令遵循和真实感方面的显著不足。
Details
Motivation: 现有的T2AV生成系统评估方法碎片化,依赖单模态指标或狭窄基准,无法有效衡量跨模态对齐、指令遵循和复杂提示下的感知真实感。 Method: 提出T2AV-Compass,通过分类驱动的流程构建500个语义丰富且物理合理的复杂提示,并设计双层评估框架:结合信号级客观指标(视频/音频质量、跨模态对齐)与基于大语言模型的主观评判(指令遵循、真实感)。 Result: 对11种代表性T2AV系统的广泛评估表明,即使最强模型在音频真实感、细粒度同步和指令遵循等方面仍远逊于人类水平。 Conclusion: 当前T2AV系统仍有显著改进空间,T2AV-Compass作为一个具有挑战性和诊断性的测试平台,有助于推动该领域的发展。 Abstract: Text-to-Audio-Video (T2AV) generation aims to synthesize temporally coherent video and semantically synchronized audio from natural language, yet its evaluation remains fragmented, often relying on unimodal metrics or narrowly scoped benchmarks that fail to capture cross-modal alignment, instruction following, and perceptual realism under complex prompts. To address this limitation, we present T2AV-Compass, a unified benchmark for comprehensive evaluation of T2AV systems, consisting of 500 diverse and complex prompts constructed via a taxonomy-driven pipeline to ensure semantic richness and physical plausibility. Besides, T2AV-Compass introduces a dual-level evaluation framework that integrates objective signal-level metrics for video quality, audio quality, and cross-modal alignment with a subjective MLLM-as-a-Judge protocol for instruction following and realism assessment. Extensive evaluation of 11 representative T2AVsystems reveals that even the strongest models fall substantially short of human-level realism and cross-modal consistency, with persistent failures in audio realism, fine-grained synchronization, instruction following, etc. These results indicate significant improvement room for future models and highlight the value of T2AV-Compass as a challenging and diagnostic testbed for advancing text-to-audio-video generation.[71] UniRec-0.1B: Unified Text and Formula Recognition with 0.1B Parameters
Yongkun Du,Zhineng Chen,Yazhen Xie,Weikang Baiand Hao Feng,Wei Shi,Yuchen Su,Can Huang,Yu-Gang Jiang
Main category: cs.CV
TL;DR: 本文提出了UniRec-0.1B,一个仅含0.1B参数的轻量级统一文本与公式识别模型,能够在多层级上高效准确地识别中英文文档中的文本和公式内容。
Details
Motivation: 现有的视觉-语言模型虽能统一识别文本和公式,但模型庞大、计算开销大,难以广泛应用。需要一个更轻量、高效的统一识别模型。 Method: 构建了包含4000万样本的大规模数据集UniRec40M;提出分层监督训练以应对层次结构的多样性;设计语义解耦 tokenizer 以分离文本与公式表示。 Result: 在自建及公开基准上,UniRec-0.1B在多语言、多领域、多层次文档识别任务中优于通用VLM和专用文档解析模型,并实现2-9倍的速度提升。 Conclusion: UniRec-0.1B通过数据、训练策略和 tokenizer 的协同设计,在极小参数下实现了高效且准确的统一文本与公式识别,具有良好的实用性和推广性。 Abstract: Text and formulas constitute the core informational components of many documents. Accurately and efficiently recognizing both is crucial for developing robust and generalizable document parsing systems. Recently, vision-language models (VLMs) have achieved impressive unified recognition of text and formulas. However, they are large-sized and computationally demanding, restricting their usage in many applications. In this paper, we propose UniRec-0.1B, a unified recognition model with only 0.1B parameters. It is capable of performing text and formula recognition at multiple levels, including characters, words, lines, paragraphs, and documents. To implement this task, we first establish UniRec40M, a large-scale dataset comprises 40 million text, formula and their mix samples, enabling the training of a powerful yet lightweight model. Secondly, we identify two challenges when building such a lightweight but unified expert model. They are: structural variability across hierarchies and semantic entanglement between textual and formulaic content. To tackle these, we introduce a hierarchical supervision training that explicitly guides structural comprehension, and a semantic-decoupled tokenizer that separates text and formula representations. Finally, we develop a comprehensive evaluation benchmark covering Chinese and English documents from multiple domains and with multiple levels. Experimental results on this and public benchmarks demonstrate that UniRec-0.1B outperforms both general-purpose VLMs and leading document parsing expert models, while achieving a 2-9$\times$ speedup, validating its effectiveness and efficiency. Codebase and Dataset: https://github.com/Topdu/OpenOCR.[72] FreeInpaint: Tuning-free Prompt Alignment and Visual Rationality Enhancement in Image Inpainting
Chao Gong,Dong Li,Yingwei Pan,Jingjing Chen,Ting Yao,Tao Mei
Main category: cs.CV
TL;DR: 本文提出了一种无需微调的即插即用图像修复方法FreeInpaint,通过在推理过程中直接优化扩散模型的潜在变量来提升文本对齐性和视觉合理性。
Details
Motivation: 现有基于预训练文生图扩散模型的图像修复方法难以同时保证生成内容与文本提示的对齐性以及视觉上的合理性。 Method: 提出一种先验引导的噪声优化方法,通过优化初始噪声使模型关注有效修复区域;设计面向修复任务的复合引导目标,在每一步去噪过程中优化中间潜在变量以增强文本对齐和视觉合理性。 Result: 在多种图像修复扩散模型和评估指标下进行了广泛实验,验证了FreeInpaint在提升生成图像保真度方面的有效性与鲁棒性。 Conclusion: FreeInpaint是一种无需微调、即插即用的图像修复框架,能有效提升文本对齐性和视觉合理性,适用于多种扩散模型。 Abstract: Text-guided image inpainting endeavors to generate new content within specified regions of images using textual prompts from users. The primary challenge is to accurately align the inpainted areas with the user-provided prompts while maintaining a high degree of visual fidelity. While existing inpainting methods have produced visually convincing results by leveraging the pre-trained text-to-image diffusion models, they still struggle to uphold both prompt alignment and visual rationality simultaneously. In this work, we introduce FreeInpaint, a plug-and-play tuning-free approach that directly optimizes the diffusion latents on the fly during inference to improve the faithfulness of the generated images. Technically, we introduce a prior-guided noise optimization method that steers model attention towards valid inpainting regions by optimizing the initial noise. Furthermore, we meticulously design a composite guidance objective tailored specifically for the inpainting task. This objective efficiently directs the denoising process, enhancing prompt alignment and visual rationality by optimizing intermediate latents at each step. Through extensive experiments involving various inpainting diffusion models and evaluation metrics, we demonstrate the effectiveness and robustness of our proposed FreeInpaint.[73] MarineEval: Assessing the Marine Intelligence of Vision-Language Models
YuK-Kwan Wong,Tuan-An To,Jipeng Zhang,Ziqiang Zheng,Sai-Kit Yeung
Main category: cs.CV
TL;DR: 本文提出了首个大规模海洋领域的视觉语言模型(VLM)数据集和基准MarineEval,包含2000个基于图像的问答对,用于评估现有VLM在需要专业知识的海洋问题回答中的表现。实验表明,现有VLM在该领域仍有显著不足,亟需改进。
Details
Motivation: 探讨现有视觉语言模型(VLMs)是否能作为需要深厚专业知识的海洋领域的专家,准确回答具有特殊领域挑战的问题。 Method: 构建了名为MarineEval的大规模海洋VLM数据集和基准,包含2000个图像问答对,涵盖7个任务维度和20个能力维度,并由海洋领域专家验证数据质量;在该基准上系统评测了17种现有VLM。 Result: 实验结果显示,当前VLM在回答海洋领域专业问题时表现不佳,存在明显局限性,性能提升空间巨大。 Conclusion: 现有VLM尚不能胜任海洋领域专家角色,MarineEval为未来研究提供了重要基准和方向指引。 Abstract: We have witnessed promising progress led by large language models (LLMs) and further vision language models (VLMs) in handling various queries as a general-purpose assistant. VLMs, as a bridge to connect the visual world and language corpus, receive both visual content and various text-only user instructions to generate corresponding responses. Though great success has been achieved by VLMs in various fields, in this work, we ask whether the existing VLMs can act as domain experts, accurately answering marine questions, which require significant domain expertise and address special domain challenges/requirements. To comprehensively evaluate the effectiveness and explore the boundary of existing VLMs, we construct the first large-scale marine VLM dataset and benchmark called MarineEval, with 2,000 image-based question-answering pairs. During our dataset construction, we ensure the diversity and coverage of the constructed data: 7 task dimensions and 20 capacity dimensions. The domain requirements are specially integrated into the data construction and further verified by the corresponding marine domain experts. We comprehensively benchmark 17 existing VLMs on our MarineEval and also investigate the limitations of existing models in answering marine research questions. The experimental results reveal that existing VLMs cannot effectively answer the domain-specific questions, and there is still a large room for further performance improvements. We hope our new benchmark and observations will facilitate future research. Project Page: http://marineeval.hkustvgd.com/[74] TGC-Net: A Structure-Aware and Semantically-Aligned Framework for Text-Guided Medical Image Segmentation
Gaoren Lin,Huangxuan Zhao,Yuan Xiong,Lefei Zhang,Bo Du,Wentao Zhu
Main category: cs.CV
TL;DR: TGC-Net是一种基于CLIP的文本引导医学图像分割框架,通过参数高效的模块设计解决了细粒度结构保留、复杂临床描述建模和领域语义对齐问题,在多个数据集上实现了最先进的性能。
Details
Motivation: 现有方法在图像与文本编码器之间缺乏对齐,导致多模态融合困难;直接应用CLIP于医学图像时存在结构保留不足、临床描述建模能力弱和领域语义不匹配的问题。 Method: 提出TGC-Net,包含三个核心组件:1)语义-结构协同编码器(SSE),结合CNN分支增强ViT以实现多尺度结构细化;2)领域增强文本编码器(DATE),注入大语言模型提取的医学知识;3)视觉-语言校准模块(VLCM),在统一特征空间中优化跨模态对齐。 Result: 在胸部X光和胸部CT共五个数据集上实验表明,TGC-Net以更少的可训练参数达到最先进的分割性能,尤其在具有挑战性的基准上表现出显著的Dice系数提升。 Conclusion: TGC-Net通过任务特定的高效适配策略,有效提升了CLIP在医学图像分割中的适用性,实现了精准的文本引导分割。 Abstract: Text-guided medical segmentation enhances segmentation accuracy by utilizing clinical reports as auxiliary information. However, existing methods typically rely on unaligned image and text encoders, which necessitate complex interaction modules for multimodal fusion. While CLIP provides a pre-aligned multimodal feature space, its direct application to medical imaging is limited by three main issues: insufficient preservation of fine-grained anatomical structures, inadequate modeling of complex clinical descriptions, and domain-specific semantic misalignment. To tackle these challenges, we propose TGC-Net, a CLIP-based framework focusing on parameter-efficient, task-specific adaptations. Specifically, it incorporates a Semantic-Structural Synergy Encoder (SSE) that augments CLIP's ViT with a CNN branch for multi-scale structural refinement, a Domain-Augmented Text Encoder (DATE) that injects large-language-model-derived medical knowledge, and a Vision-Language Calibration Module (VLCM) that refines cross-modal correspondence in a unified feature space. Experiments on five datasets across chest X-ray and thoracic CT modalities demonstrate that TGC-Net achieves state-of-the-art performance with substantially fewer trainable parameters, including notable Dice gains on challenging benchmarks.[75] ORCA: Object Recognition and Comprehension for Archiving Marine Species
Yuk-Kwan Wong,Haixin Liang,Zeyu Ma,Yiwei Chen,Ziqiang Zheng,Rinaldi Gotama,Pascal Sebastian,Lauren D. Sparks,Sai-Kit Yeung
Main category: cs.CV
TL;DR: ORCA是一个面向海洋研究的多模态基准,包含14,647张图像、42,217个边界框标注和22,321条专家验证的实例描述,旨在推动海洋视觉理解的研究。
Details
Motivation: 现有的海洋视觉理解研究受限于训练数据不足以及缺乏将领域挑战与计算机视觉任务系统结合的任务设定,限制了模型的有效应用。 Method: 提出了ORCA多模态基准数据集,包含细粒度的视觉与文本标注,并在目标检测(闭集与开词汇)、实例描述生成和视觉定位三个任务上评估了18种先进模型。 Result: 实验揭示了物种多样性、形态重叠和领域特殊性带来的挑战,验证了当前方法在海洋理解上的局限性。 Conclusion: ORCA为海洋视觉理解提供了全面的基准,有助于推动该领域的进一步研究。 Abstract: Marine visual understanding is essential for monitoring and protecting marine ecosystems, enabling automatic and scalable biological surveys. However, progress is hindered by limited training data and the lack of a systematic task formulation that aligns domain-specific marine challenges with well-defined computer vision tasks, thereby limiting effective model application. To address this gap, we present ORCA, a multi-modal benchmark for marine research comprising 14,647 images from 478 species, with 42,217 bounding box annotations and 22,321 expert-verified instance captions. The dataset provides fine-grained visual and textual annotations that capture morphology-oriented attributes across diverse marine species. To catalyze methodological advances, we evaluate 18 state-of-the-art models on three tasks: object detection (closed-set and open-vocabulary), instance captioning, and visual grounding. Results highlight key challenges, including species diversity, morphological overlap, and specialized domain demands, underscoring the difficulty of marine understanding. ORCA thus establishes a comprehensive benchmark to advance research in marine domain. Project Page: http://orca.hkustvgd.com/.[76] A Turn Toward Better Alignment: Few-Shot Generative Adaptation with Equivariant Feature Rotation
Chenghao Xu,Qi Liu,Jiexi Yan,Muli Yang,Cheng Deng
Main category: cs.CV
TL;DR: 提出了一种名为Equivariant Feature Rotation (EFR)的新方法,通过在自旋的代理特征空间中进行双层对齐,有效解决少样本图像生成中的域适应问题。
Details
Motivation: 现有少样本图像生成方法因源域与目标域分布结构差异大、目标样本稀少,难以准确对齐分布,导致内容失真或知识迁移不足。 Method: 引入可学习的李群参数化旋转矩阵,将源域和目标域特征自适应地映射到一个等变的代理特征空间,在该空间内进行实例级和分布级的双重对齐,以保持域内结构并实现知识迁移。 Result: 在多个常用数据集上实验表明,该方法显著提升了目标域内的生成性能,优于现有少样本生成方法。 Conclusion: EFR通过构建等变的代理特征空间,有效缓解了域间分布差异带来的负面影响,为少样本图像生成提供了一种更鲁棒的域适应框架。 Abstract: Few-shot image generation aims to effectively adapt a source generative model to a target domain using very few training images. Most existing approaches introduce consistency constraints-typically through instance-level or distribution-level loss functions-to directly align the distribution patterns of source and target domains within their respective latent spaces. However, these strategies often fall short: overly strict constraints can amplify the negative effects of the domain gap, leading to distorted or uninformative content, while overly relaxed constraints may fail to leverage the source domain effectively. This limitation primarily stems from the inherent discrepancy in the underlying distribution structures of the source and target domains. The scarcity of target samples further compounds this issue by hindering accurate estimation of the target domain's distribution. To overcome these limitations, we propose Equivariant Feature Rotation (EFR), a novel adaptation strategy that aligns source and target domains at two complementary levels within a self-rotated proxy feature space. Specifically, we perform adaptive rotations within a parameterized Lie Group to transform both source and target features into an equivariant proxy space, where alignment is conducted. These learnable rotation matrices serve to bridge the domain gap by preserving intra-domain structural information without distortion, while the alignment optimization facilitates effective knowledge transfer from the source to the target domain. Comprehensive experiments on a variety of commonly used datasets demonstrate that our method significantly enhances the generative performance within the targeted domain.[77] Towards Arbitrary Motion Completing via Hierarchical Continuous Representation
Chenghao Xu,Guangtao Lyu,Qi Liu,Jiexi Yan,Muli Yang,Cheng Deng
Main category: cs.CV
TL;DR: 本文提出了一种基于隐式神经表示(INR)的层次化隐式表征框架NAME,用于实现人体运动序列的连续表示,支持任意帧率下的插值、中间生成和外推。
Details
Motivation: 由于物理运动本质上是连续的,更高的帧率有助于提升运动序列的时间连贯性,因此需要一种能够连续表示人类运动的方法以克服离散帧率的限制。 Method: 提出了一种新的参数化激活函数驱动的层次化隐式神经表示框架NAME,结合多尺度时间编码机制和基于傅里叶变换的参数化激活函数,增强MLP解码器对复杂运动模式的表达能力。 Result: 在多个基准数据集上的实验表明,该方法在运动序列的连续表示、插值、外推等方面表现出色,具有良好的平滑性和时间一致性。 Conclusion: 所提出的NAME框架能够有效实现人体运动的连续建模,支持任意帧率输出,在运动捕捉与生成任务中展现出优越的性能和鲁棒性。 Abstract: Physical motions are inherently continuous, and higher camera frame rates typically contribute to improved smoothness and temporal coherence. For the first time, we explore continuous representations of human motion sequences, featuring the ability to interpolate, inbetween, and even extrapolate any input motion sequences at arbitrary frame rates. To achieve this, we propose a novel parametric activation-induced hierarchical implicit representation framework, referred to as NAME, based on Implicit Neural Representations (INRs). Our method introduces a hierarchical temporal encoding mechanism that extracts features from motion sequences at multiple temporal scales, enabling effective capture of intricate temporal patterns. Additionally, we integrate a custom parametric activation function, powered by Fourier transformations, into the MLP-based decoder to enhance the expressiveness of the continuous representation. This parametric formulation significantly augments the model's ability to represent complex motion behaviors with high accuracy. Extensive evaluations across several benchmark datasets demonstrate the effectiveness and robustness of our proposed approach.[78] UltraShape 1.0: High-Fidelity 3D Shape Generation via Scalable Geometric Refinement
Tanghui Jia,Dongyu Yan,Dehao Hao,Yang Li,Kaiyi Zhang,Xianyi He,Lanjiong Li,Jinnan Chen,Lutao Jiang,Qishen Yin,Long Quan,Ying-Cong Chen,Li Yuan
Main category: cs.CV
TL;DR: UltraShape 1.0 是一个可扩展的两阶段3D扩散框架,通过改进的数据处理流程和解耦的空间定位与几何细节生成方法,实现高质量3D几何形状生成。
Details
Motivation: 现有的3D生成模型在几何质量和细节保真度方面仍有不足,且依赖高质量训练数据。本文旨在构建一个高效、可扩展的框架,提升公开数据集上的3D生成性能。 Method: 采用两阶段生成流程:先生成粗略结构,再进行基于体素的精细化修复;引入 watertight 处理、数据过滤,并在扩散过程中使用 RoPE 编码固定位置的体素查询,解耦空间定位与几何细节合成。 Result: 在仅使用公开3D数据集训练的情况下,UltraShape 1.0 在几何质量与生成效果上达到与现有开源方法相当甚至更优的水平。 Conclusion: UltraShape 1.0 展示了在有限资源下通过系统性数据处理和结构化扩散建模实现高保真3D生成的可行性,推动了开源3D生成研究的发展。 Abstract: In this report, we introduce UltraShape 1.0, a scalable 3D diffusion framework for high-fidelity 3D geometry generation. The proposed approach adopts a two-stage generation pipeline: a coarse global structure is first synthesized and then refined to produce detailed, high-quality geometry. To support reliable 3D generation, we develop a comprehensive data processing pipeline that includes a novel watertight processing method and high-quality data filtering. This pipeline improves the geometric quality of publicly available 3D datasets by removing low-quality samples, filling holes, and thickening thin structures, while preserving fine-grained geometric details. To enable fine-grained geometry refinement, we decouple spatial localization from geometric detail synthesis in the diffusion process. We achieve this by performing voxel-based refinement at fixed spatial locations, where voxel queries derived from coarse geometry provide explicit positional anchors encoded via RoPE, allowing the diffusion model to focus on synthesizing local geometric details within a reduced, structured solution space. Our model is trained exclusively on publicly available 3D datasets, achieving strong geometric quality despite limited training resources. Extensive evaluations demonstrate that UltraShape 1.0 performs competitively with existing open-source methods in both data processing quality and geometry generation. All code and trained models will be released to support future research.[79] VisRes Bench: On Evaluating the Visual Reasoning Capabilities of VLMs
Brigitta Malagurski Törtei,Yasser Dahou,Ngoc Dung Huynh,Wamiq Reyaz Para,Phúc H. Lê Khac,Ankit Singh,Sofian Chaybouti,Sanath Narayan
Main category: cs.CV
TL;DR: VisRes Bench 是一个用于研究自然场景下视觉推理能力的新基准,揭示了当前视觉语言模型在感知和关系推理方面的局限性。
Details
Motivation: 探讨视觉语言模型在多大程度上依赖语言先验而非真正的视觉推理,缺乏对自然情境中视觉推理的系统评估。 Method: 提出 VisRes Bench 基准,包含三个复杂度层级:Level 1 测试感知补全和图像匹配在扰动下的表现;Level 2 测试单属性的基于规则的推理;Level 3 考察多属性组合推理。使用超过19,000张受控图像进行评估。 Result: 发现最先进的视觉语言模型在细微感知扰动下表现接近随机水平,显示出其抽象能力有限,主要依赖模式识别而非真正视觉推理。不同层级有效分离了不同的推理能力。 Conclusion: VisRes Bench 提供了一个统一框架来推动多模态研究中的抽象视觉推理发展,突显了现有模型在真实视觉理解上的不足。 Abstract: Vision-Language Models (VLMs) have achieved remarkable progress across tasks such as visual question answering and image captioning. Yet, the extent to which these models perform visual reasoning as opposed to relying on linguistic priors remains unclear. To address this, we introduce VisRes Bench, a benchmark designed to study visual reasoning in naturalistic settings without contextual language supervision. Analyzing model behavior across three levels of complexity, we uncover clear limitations in perceptual and relational visual reasoning capacities. VisRes isolates distinct reasoning abilities across its levels. Level 1 probes perceptual completion and global image matching under perturbations such as blur, texture changes, occlusion, and rotation; Level 2 tests rule-based inference over a single attribute (e.g., color, count, orientation); and Level 3 targets compositional reasoning that requires integrating multiple visual attributes. Across more than 19,000 controlled task images, we find that state-of-the-art VLMs perform near random under subtle perceptual perturbations, revealing limited abstraction beyond pattern recognition. We conclude by discussing how VisRes provides a unified framework for advancing abstract visual reasoning in multimodal research.[80] Human Motion Estimation with Everyday Wearables
Siqi Zhu,Yixuan Li,Junfu Li,Qi Wu,Zan Wang,Haozhe Ma,Wei Liang
Main category: cs.CV
TL;DR: 本文提出了一种基于日常可穿戴设备(如智能手机、智能手表、耳塞和智能眼镜)的轻量级人体运动捕捉方法EveryWear,无需校准即可实现全身动作估计,并通过真实世界数据集Ego-Elec验证了其有效性。
Details
Motivation: 现有基于可穿戴设备的人体动作估计方法存在佩戴性差、硬件成本高和需要复杂校准的问题,限制了其在日常生活中的应用。 Method: 提出EveryWear,利用第一视角摄像头视觉信息与消费级设备的惯性信号,采用多模态师生框架进行融合建模,并直接在真实世界数据上训练以消除仿真到现实的差距。 Result: 构建了包含56种日常活动、覆盖17种室内外环境的9小时真实世界数据集Ego-Elec,并证明所提方法在实际全身体感任务中优于基线模型。 Conclusion: EveryWear为日常场景下的实用化全身体动捕提供了可行方案,推动了无需专门硬件和校准的普适性动作感知技术的发展。 Abstract: While on-body device-based human motion estimation is crucial for applications such as XR interaction, existing methods often suffer from poor wearability, expensive hardware, and cumbersome calibration, which hinder their adoption in daily life. To address these challenges, we present EveryWear, a lightweight and practical human motion capture approach based entirely on everyday wearables: a smartphone, smartwatch, earbuds, and smart glasses equipped with one forward-facing and two downward-facing cameras, requiring no explicit calibration before use. We introduce Ego-Elec, a 9-hour real-world dataset covering 56 daily activities across 17 diverse indoor and outdoor environments, with ground-truth 3D annotations provided by the motion capture (MoCap), to facilitate robust research and benchmarking in this direction. Our approach employs a multimodal teacher-student framework that integrates visual cues from egocentric cameras with inertial signals from consumer devices. By training directly on real-world data rather than synthetic data, our model effectively eliminates the sim-to-real gap that constrains prior work. Experiments demonstrate that our method outperforms baseline models, validating its effectiveness for practical full-body motion estimation.[81] Latent Implicit Visual Reasoning
Kelvin Li,Chuyi Shang,Leonid Karlinsky,Rogerio Feris,Trevor Darrell,Roei Herzig
Main category: cs.CV
TL;DR: 提出一种任务无关的机制,使大型多模态模型能够自主发现和使用视觉推理token,无需显式监督,在多种视觉主导任务上达到最先进性能。
Details
Motivation: 现有LMMs过于文本中心化,依赖语言进行推理,难以处理以视觉为主的推理任务;当前方法依赖人工标注的中间视觉步骤,限制了泛化能力并增加了标注成本。 Method: 设计一种无需显式监督的机制,让LMM学习生成视觉推理token,这些token能全局关注并以任务自适应方式重新编码图像,从而提取相关视觉信息。 Result: 该方法在多个视觉主导的任务上优于直接微调,并在中间抽象难以定义的任务中表现出色,同时支持多任务指令微调。 Conclusion: 所提方法使LMM能够在无监督情况下自主进行视觉推理,提升了在多样化视觉任务上的表现与泛化能力。 Abstract: While Large Multimodal Models (LMMs) have made significant progress, they remain largely text-centric, relying on language as their core reasoning modality. As a result, they are limited in their ability to handle reasoning tasks that are predominantly visual. Recent approaches have sought to address this by supervising intermediate visual steps with helper images, depth maps, or image crops. However, these strategies impose restrictive priors on what "useful" visual abstractions look like, add heavy annotation costs, and struggle to generalize across tasks. To address this critical limitation, we propose a task-agnostic mechanism that trains LMMs to discover and use visual reasoning tokens without explicit supervision. These tokens attend globally and re-encode the image in a task-adaptive way, enabling the model to extract relevant visual information without hand-crafted supervision. Our approach outperforms direct fine-tuning and achieves state-of-the-art results on a diverse range of vision-centric tasks -- including those where intermediate abstractions are hard to specify -- while also generalizing to multi-task instruction tuning.[82] Leveraging Lightweight Entity Extraction for Scalable Event-Based Image Retrieval
Dao Sy Duy Minh,Huynh Trung Kiet,Nguyen Lam Phu Quy,Phu-Hoa Pham,Tran Chi Nguyen
Main category: cs.CV
TL;DR: 提出了一种轻量级的两阶段图像检索框架,利用事件中心的实体提取来增强基于自然语言描述的图像检索效果,在OpenEvents v1基准上显著优于先前方法。
Details
Motivation: 由于查询模糊、语言多样性和可扩展性需求,现实世界中的图文检索仍然具有挑战性。 Method: 第一阶段使用BM25对显著实体进行候选过滤,第二阶段采用BEiT-3模型进行深度多模态语义建模并重排序。 Result: 在OpenEvents v1基准上实现了0.559的平均精度均值,显著优于之前的基线方法。 Conclusion: 结合事件引导的过滤与长文本视觉-语言建模能有效提升复杂现实场景中图像检索的准确性和效率。 Abstract: Retrieving images from natural language descriptions is a core task at the intersection of computer vision and natural language processing, with wide-ranging applications in search engines, media archiving, and digital content management. However, real-world image-text retrieval remains challenging due to vague or context-dependent queries, linguistic variability, and the need for scalable solutions. In this work, we propose a lightweight two-stage retrieval pipeline that leverages event-centric entity extraction to incorporate temporal and contextual signals from real-world captions. The first stage performs efficient candidate filtering using BM25 based on salient entities, while the second stage applies BEiT-3 models to capture deep multimodal semantics and rerank the results. Evaluated on the OpenEvents v1 benchmark, our method achieves a mean average precision of 0.559, substantially outperforming prior baselines. These results highlight the effectiveness of combining event-guided filtering with long-text vision-language modeling for accurate and efficient retrieval in complex, real-world scenarios. Our code is available at https://github.com/PhamPhuHoa-23/Event-Based-Image-Retrieval[83] SegMo: Segment-aligned Text to 3D Human Motion Generation
Bowen Dang,Lin Wu,Xiaohang Yang,Zheng Yuan,Zhixiang Chen
Main category: cs.CV
TL;DR: 本文提出了一种名为SegMo的文本到3D人体动作生成框架,通过将文本和动作序列分解为语义段并进行细粒度对齐,提升了生成质量和跨模态检索能力。
Details
Motivation: 现有方法仅在序列级别对齐文本描述与人体动作,忽略了模态内部的语义结构;而文本和动作均可自然分解为语义连贯的片段,可作为更精细对齐的基本单元。 Method: 提出SegMo框架,包含三个模块:(1) 文本段提取,将复杂描述分解为有序的原子动作短语;(2) 动作段提取,将完整动作序列分割为对应片段;(3) 细粒度文本-动作对齐,利用对比学习实现跨模态段间对齐。 Result: 在HumanML3D等数据集上超越强基线模型,测试集TOP 1得分达到0.553,并支持动作定位和动作到文本检索等下游任务。 Conclusion: 通过引入语义段级别的对齐机制,SegMo实现了更精确的文本到动作生成,并构建了共享的文本-动作段嵌入空间,增强了模型的多任务适用性。 Abstract: Generating 3D human motions from textual descriptions is an important research problem with broad applications in video games, virtual reality, and augmented reality. Recent methods align the textual description with human motion at the sequence level, neglecting the internal semantic structure of modalities. However, both motion descriptions and motion sequences can be naturally decomposed into smaller and semantically coherent segments, which can serve as atomic alignment units to achieve finer-grained correspondence. Motivated by this, we propose SegMo, a novel Segment-aligned text-conditioned human Motion generation framework to achieve fine-grained text-motion alignment. Our framework consists of three modules: (1) Text Segment Extraction, which decomposes complex textual descriptions into temporally ordered phrases, each representing a simple atomic action; (2) Motion Segment Extraction, which partitions complete motion sequences into corresponding motion segments; and (3) Fine-grained Text-Motion Alignment, which aligns text and motion segments with contrastive learning. Extensive experiments demonstrate that SegMo improves the strong baseline on two widely used datasets, achieving an improved TOP 1 score of 0.553 on the HumanML3D test set. Moreover, thanks to the learned shared embedding space for text and motion segments, SegMo can also be applied to retrieval-style tasks such as motion grounding and motion-to-text retrieval.[84] DreaMontage: Arbitrary Frame-Guided One-Shot Video Generation
Jiawei Liu,Junqiao Li,Jiangfan Deng,Gen Li,Siyu Zhou,Zetao Fang,Shanshan Lao,Zengde Deng,Jianing Zhu,Tingting Ma,Jiayi Li,Yunqiu Wang,Qian He,Xinglong Wu
Main category: cs.CV
TL;DR: 本文提出DreaMontage框架,通过改进DiT架构、视觉表达微调和分段自回归推理策略,实现高质量、长时程的一镜到底视频生成。
Details
Motivation: 一镜到底拍摄成本高且受现实条件限制,现有视频生成方法难以保持时间连贯性和视觉流畅性。 Method: 1) 在DiT中引入轻量中间条件机制与自适应调优策略;2) 构建高质量数据集并采用视觉表达SFT与定制化DPO优化运动合理性与过渡平滑性;3) 设计分段自回归(SAR)推理策略以高效生成长序列。 Result: 实验表明该方法在视觉质量、时间连贯性和计算效率方面表现优异,能将碎片化视觉素材合成为连贯、富有表现力的长时一镜到底视频。 Conclusion: DreaMontage为任意帧引导的长时程视频生成提供了有效解决方案,显著提升了一镜到底视频的自动化生成能力与实用性。 Abstract: The "one-shot" technique represents a distinct and sophisticated aesthetic in filmmaking. However, its practical realization is often hindered by prohibitive costs and complex real-world constraints. Although emerging video generation models offer a virtual alternative, existing approaches typically rely on naive clip concatenation, which frequently fails to maintain visual smoothness and temporal coherence. In this paper, we introduce DreaMontage, a comprehensive framework designed for arbitrary frame-guided generation, capable of synthesizing seamless, expressive, and long-duration one-shot videos from diverse user-provided inputs. To achieve this, we address the challenge through three primary dimensions. (i) We integrate a lightweight intermediate-conditioning mechanism into the DiT architecture. By employing an Adaptive Tuning strategy that effectively leverages base training data, we unlock robust arbitrary-frame control capabilities. (ii) To enhance visual fidelity and cinematic expressiveness, we curate a high-quality dataset and implement a Visual Expression SFT stage. In addressing critical issues such as subject motion rationality and transition smoothness, we apply a Tailored DPO scheme, which significantly improves the success rate and usability of the generated content. (iii) To facilitate the production of extended sequences, we design a Segment-wise Auto-Regressive (SAR) inference strategy that operates in a memory-efficient manner. Extensive experiments demonstrate that our approach achieves visually striking and seamlessly coherent one-shot effects while maintaining computational efficiency, empowering users to transform fragmented visual materials into vivid, cohesive one-shot cinematic experiences.[85] AnyAD: Unified Any-Modality Anomaly Detection in Incomplete Multi-Sequence MRI
Changwei Wu,Yifei Chen,Yuxin Du,Mingxuan Liu,Jinying Zong,Beining Wu,Jie Dong,Feiwei Qin,Yunkang Cao,Qiyuan Tian
Main category: cs.CV
TL;DR: 提出了一种统一的任意模态异常检测(Any-Modality AD)框架,可在任意MRI模态组合下实现鲁棒的异常检测与定位,无需重新训练,显著提升临床可扩展性。
Details
Motivation: 由于标注异常病例稀缺且临床中常缺少关键成像模态,现有异常检测模型难以在多变模态条件下泛化,限制了实际应用。 Method: 采用双通路DINOv2编码器与特征分布对齐机制,结合内在正常原型(INPs)提取器和INP引导解码器,通过随机模态掩蔽和间接特征补全进行训练,实现对任意模态缺失情况的适应。 Result: 在BraTS2018、MU-Glioma-Post和Pretreat-MetsToBrain-Masks数据集上,于7种模态组合下均优于最先进的工业与医学AD基线方法。 Conclusion: 该框架为真实世界中不完整多模态条件下的医学异常检测提供了可扩展的新范式。 Abstract: Reliable anomaly detection in brain MRI remains challenging due to the scarcity of annotated abnormal cases and the frequent absence of key imaging modalities in real clinical workflows. Existing single-class or multi-class anomaly detection (AD) models typically rely on fixed modality configurations, require repetitive training, or fail to generalize to unseen modality combinations, limiting their clinical scalability. In this work, we present a unified Any-Modality AD framework that performs robust anomaly detection and localization under arbitrary MRI modality availability. The framework integrates a dual-pathway DINOv2 encoder with a feature distribution alignment mechanism that statistically aligns incomplete-modality features with full-modality representations, enabling stable inference even with severe modality dropout. To further enhance semantic consistency, we introduce an Intrinsic Normal Prototypes (INPs) extractor and an INP-guided decoder that reconstruct only normal anatomical patterns while naturally amplifying abnormal deviations. Through randomized modality masking and indirect feature completion during training, the model learns to adapt to all modality configurations without re-training. Extensive experiments on BraTS2018, MU-Glioma-Post, and Pretreat-MetsToBrain-Masks demonstrate that our approach consistently surpasses state-of-the-art industrial and medical AD baselines across 7 modality combinations, achieving superior generalization. This study establishes a scalable paradigm for multimodal medical AD under real-world, imperfect modality conditions. Our source code is available at https://github.com/wuchangw/AnyAD.[86] ACD: Direct Conditional Control for Video Diffusion Models via Attention Supervision
Weiqi Li,Zehao Zhang,Liang Lin,Guangrun Wang
Main category: cs.CV
TL;DR: 本文提出了一种名为Attention-Conditional Diffusion (ACD)的新框架,通过注意力监督实现视频扩散模型中的直接条件控制,显著提升了生成视频与条件信号之间的对齐程度。
Details
Motivation: 现有无分类器引导方法在视频合成中难以精确控制条件,而基于分类器的引导易导致对抗性伪影,因此需要一种更有效的直接条件控制方法。 Method: 通过将模型的注意力图与外部控制信号(如稀疏3D感知物体布局)对齐,引入注意力监督机制,并设计专用的Layout ControlNet和自动化标注流程以支持可扩展的布局集成。 Result: 在多个基准视频生成数据集上的实验表明,ACD在条件对齐、时间连贯性和视觉质量方面均优于现有方法。 Conclusion: ACD为条件视频合成提供了一个有效的新范式,实现了更强的可控性和更高的生成质量。 Abstract: Controllability is a fundamental requirement in video synthesis, where accurate alignment with conditioning signals is essential. Existing classifier-free guidance methods typically achieve conditioning indirectly by modeling the joint distribution of data and conditions, which often results in limited controllability over the specified conditions. Classifier-based guidance enforces conditions through an external classifier, but the model may exploit this mechanism to raise the classifier score without genuinely satisfying the intended condition, resulting in adversarial artifacts and limited effective controllability. In this paper, we propose Attention-Conditional Diffusion (ACD), a novel framework for direct conditional control in video diffusion models via attention supervision. By aligning the model's attention maps with external control signals, ACD achieves better controllability. To support this, we introduce a sparse 3D-aware object layout as an efficient conditioning signal, along with a dedicated Layout ControlNet and an automated annotation pipeline for scalable layout integration. Extensive experiments on benchmark video generation datasets demonstrate that ACD delivers superior alignment with conditioning inputs while preserving temporal coherence and visual fidelity, establishing an effective paradigm for conditional video synthesis.[87] GriDiT: Factorized Grid-Based Diffusion for Efficient Long Image Sequence Generation
Snehal Singh Tomar,Alexandros Graikos,Arjun Krishna,Dimitris Samaras,Klaus Mueller
Main category: cs.CV
TL;DR: 本文提出一种新的图像序列生成方法,通过先生成低分辨率的粗略序列,再对单帧进行高分辨率细化,提升了生成质量、序列一致性与效率。
Details
Motivation: 现有图像序列生成模型直接处理高分辨率张量,存在效率低、建模困难等问题,难以兼顾序列连贯性与细节质量。 Method: 利用Diffusion Transformer(DiT)在低分辨率网格图像上建模帧间相关性,生成粗略序列;随后独立超分每帧以恢复高分辨率细节,无需修改网络结构。 Result: 在多个数据集上优于SoTA方法,生成质量更高、推理速度至少快两倍,支持任意长度序列生成,训练更高效且跨域泛化能力强。 Conclusion: 将图像序列生成解耦为低分辨率序列建模与高分辨率单帧超分是一种更有效、高效且通用的范式,可突破当前SoTA的瓶颈。 Abstract: Modern deep learning methods typically treat image sequences as large tensors of sequentially stacked frames. However, is this straightforward representation ideal given the current state-of-the-art (SoTA)? In this work, we address this question in the context of generative models and aim to devise a more effective way of modeling image sequence data. Observing the inefficiencies and bottlenecks of current SoTA image sequence generation methods, we showcase that rather than working with large tensors, we can improve the generation process by factorizing it into first generating the coarse sequence at low resolution and then refining the individual frames at high resolution. We train a generative model solely on grid images comprising subsampled frames. Yet, we learn to generate image sequences, using the strong self-attention mechanism of the Diffusion Transformer (DiT) to capture correlations between frames. In effect, our formulation extends a 2D image generator to operate as a low-resolution 3D image-sequence generator without introducing any architectural modifications. Subsequently, we super-resolve each frame individually to add the sequence-independent high-resolution details. This approach offers several advantages and can overcome key limitations of the SoTA in this domain. Compared to existing image sequence generation models, our method achieves superior synthesis quality and improved coherence across sequences. It also delivers high-fidelity generation of arbitrary-length sequences and increased efficiency in inference time and training data usage. Furthermore, our straightforward formulation enables our method to generalize effectively across diverse data domains, which typically require additional priors and supervision to model in a generative context. Our method consistently outperforms SoTA in quality and inference speed (at least twice-as-fast) across datasets.[88] Surgical Scene Segmentation using a Spike-Driven Video Transformer with Real-Time Potential
Shihao Zou,Jingjing Li,Wei Ji,Jincai Huang,Kai Wang,Guo Dan,Weixin Si,Yi Pan
Main category: cs.CV
TL;DR: 本文提出了SpikeSurgSeg,首个基于脉冲驱动的视频Transformer框架,用于手术场景分割,具有在非GPU平台上实现实时处理的潜力。
Details
Motivation: 现有的深度学习模型虽然在手术场景分割上表现优异,但其高计算需求和功耗限制了在资源受限环境中的实时部署。此外,标记数据稀缺且手术视频本身稀疏,制约了SNN的应用。 Method: 提出了一种针对SNN的手术场景掩码自编码预训练策略,通过逐层tube掩码实现鲁棒的时空表征学习,并结合轻量级的脉冲驱动分割头以保持低延迟特性。 Result: 在EndoVis18和自建SurgBleed数据集上的实验表明,SpikeSurgSeg的mIoU与最先进的ANN模型相当,推理延迟至少降低8倍,相比多数基础模型加速超过20倍。 Conclusion: SpikeSurgSeg在保证高精度的同时显著降低了能耗与延迟,展现出在时间关键型手术场景分割中实际部署的巨大潜力。 Abstract: Modern surgical systems increasingly rely on intelligent scene understanding to provide timely situational awareness for enhanced intra-operative safety. Within this pipeline, surgical scene segmentation plays a central role in accurately perceiving operative events. Although recent deep learning models, particularly large-scale foundation models, achieve remarkable segmentation accuracy, their substantial computational demands and power consumption hinder real-time deployment in resource-constrained surgical environments. To address this limitation, we explore the emerging SNN as a promising paradigm for highly efficient surgical intelligence. However, their performance is still constrained by the scarcity of labeled surgical data and the inherently sparse nature of surgical video representations. To this end, we propose \textit{SpikeSurgSeg}, the first spike-driven video Transformer framework tailored for surgical scene segmentation with real-time potential on non-GPU platforms. To address the limited availability of surgical annotations, we introduce a surgical-scene masked autoencoding pretraining strategy for SNNs that enables robust spatiotemporal representation learning via layer-wise tube masking. Building on this pretrained backbone, we further adopt a lightweight spike-driven segmentation head that produces temporally consistent predictions while preserving the low-latency characteristics of SNNs. Extensive experiments on EndoVis18 and our in-house SurgBleed dataset demonstrate that SpikeSurgSeg achieves mIoU comparable to SOTA ANN-based models while reducing inference latency by at least $8\times$. Notably, it delivers over $20\times$ acceleration relative to most foundation-model baselines, underscoring its potential for time-critical surgical scene segmentation.[89] Post-Processing Mask-Based Table Segmentation for Structural Coordinate Extraction
Suren Bandara
Main category: cs.CV
TL;DR: 提出了一种基于多尺度信号处理的表格边缘检测新方法,通过高斯卷积和统计阈值抑制噪声并保留结构边缘,显著提升了低分辨率或噪声图像中的表格分割精度。
Details
Motivation: 准确识别表格的行列边界在低分辨率或噪声图像中仍具挑战性,尤其是现有Transformer方法对噪声输入适应性差,而传统掩码边缘检测方法易受噪声和分辨率影响。 Method: 将表格掩码中的行和列转换建模为一维信号,采用方差递增的高斯卷积进行多尺度平滑,并结合统计阈值去噪,通过检测信号峰值确定边缘位置,再映射回图像坐标获得精确的表格边界。 Result: 在PubLayNet-1M数据集上,结合TableNet与PyTesseract OCR,列边缘检测使Cell-Aware Segmentation Accuracy(CASA)从67%提升至76%;方法对分辨率变化鲁棒,支持零填充和缩放策略。 Conclusion: 该方法有效提高了表格结构提取的准确性和鲁棒性,尤其适用于质量较差的扫描文档,生成的结构化表格输出有利于下游分析任务。 Abstract: Structured data extraction from tables plays a crucial role in document image analysis for scanned documents and digital archives. Although many methods have been proposed to detect table structures and extract cell contents, accurately identifying table segment boundaries (rows and columns) remains challenging, particularly in low-resolution or noisy images. In many real-world scenarios, table data are incomplete or degraded, limiting the adaptability of transformer-based methods to noisy inputs. Mask-based edge detection techniques have shown greater robustness under such conditions, as their sensitivity can be adjusted through threshold tuning; however, existing approaches typically apply masks directly to images, leading to noise sensitivity, resolution loss, or high computational cost. This paper proposes a novel multi-scale signal-processing method for detecting table edges from table masks. Row and column transitions are modeled as one-dimensional signals and processed using Gaussian convolution with progressively increasing variances, followed by statistical thresholding to suppress noise while preserving stable structural edges. Detected signal peaks are mapped back to image coordinates to obtain accurate segment boundaries. Experimental results show that applying the proposed approach to column edge detection improves Cell-Aware Segmentation Accuracy (CASA) a layout-aware metric evaluating both textual correctness and correct cell placement from 67% to 76% on the PubLayNet-1M benchmark when using TableNet with PyTesseract OCR. The method is robust to resolution variations through zero-padding and scaling strategies and produces optimized structured tabular outputs suitable for downstream analysis.[90] AndroidLens: Long-latency Evaluation with Nested Sub-targets for Android GUI Agents
Yue Cao,Yingyao Wang,Pi Bu,Jingxuan Xing,Wei Jiang,Zekun Zhu,Junpeng Ma,Sashuai Zhou,Tong Lu,Jun Song,Yu Cheng,Yuning Jiang,Bo Zheng
Main category: cs.CV
TL;DR: AndroidLens是一个针对移动设备GUI代理的新型评估框架,包含571个长延迟任务,涵盖真实场景中的复杂操作,评估结果显示当前模型表现有限,突显了现实环境中存在的多项挑战。
Details
Motivation: 现有移动GUI代理评估基准局限于少数应用、简单任务和粗粒度指标,缺乏对复杂、长周期任务的真实评估能力,因此需要构建更全面、更具挑战性的评估框架。 Method: 提出AndroidLens框架,包含571个中英文双语任务,覆盖38个领域,平均每个任务超过26步;采用静态评估保留真实环境异常并允许多种正确路径,结合基于里程碑的动态评估机制,使用平均任务进度(ATP)进行细粒度测量。 Result: 即使最优模型在AndroidLens上的任务成功率仅为12.7%,平均任务进度为50.47%;框架揭示了环境异常、自适应探索和长期记忆保持等关键挑战。 Conclusion: AndroidLens提供了一个更贴近真实世界的移动GUI代理评估平台,显著提升了评估难度和全面性,暴露了现有模型在复杂任务执行中的严重不足,为未来研究指明方向。 Abstract: Graphical user interface (GUI) agents can substantially improve productivity by automating frequently executed long-latency tasks on mobile devices. However, existing evaluation benchmarks are still constrained to limited applications, simple tasks, and coarse-grained metrics. To address this, we introduce AndroidLens, a challenging evaluation framework for mobile GUI agents, comprising 571 long-latency tasks in both Chinese and English environments, each requiring an average of more than 26 steps to complete. The framework features: (1) tasks derived from real-world user scenarios across 38 domains, covering complex types such as multi-constraint, multi-goal, and domain-specific tasks; (2) static evaluation that preserves real-world anomalies and allows multiple valid paths to reduce bias; and (3) dynamic evaluation that employs a milestone-based scheme for fine-grained progress measurement via Average Task Progress (ATP). Our evaluation indicates that even the best models reach only a 12.7% task success rate and 50.47% ATP. We also underscore key challenges in real-world environments, including environmental anomalies, adaptive exploration, and long-term memory retention.[91] TICON: A Slide-Level Tile Contextualizer for Histopathology Representation Learning
Varun Belagali,Saarthak Kapse,Pierre Marza,Srijan Das,Zilinghan Li,Sofiène Boutaj,Pushpak Pati,Srikar Yellapragada,Tarak Nath Nandi,Ravi K Madduri,Joel Saltz,Prateek Prasanna,Stergios Christodoulidis Maria Vakalopoulou,Dimitris Samaras
Main category: cs.CV
TL;DR: 本文提出了TICON,一种基于Transformer的图像块表示上下文化方法,能够统一并增强来自任意tile-level基础模型的嵌入表示,在多种计算病理学任务中实现先进性能。
Details
Motivation: 现有的tile编码器管道忽略了全局上下文信息,且不同任务需要不同的编码器,缺乏一个能统一处理多来源嵌入并提供上下文支持的模型。 Method: 提出TICON,采用共享的Transformer编码器,通过掩码建模预训练,对来自不同tile级基础模型的嵌入进行统一和上下文化,并进一步构建基于TICON的滑动全切片图像聚合器以形成全切片级基础模型。 Result: TICON在多个tile级(如HEST-Bench、THUNDER、CATCH)和全切片级(如Patho-Bench)基准测试中显著提升性能,达到新的SOTA;其全切片模型仅用11K WSI即优于使用多达350K WSI预训练的现有SOTA模型。 Conclusion: TICON有效解决了病理图像分析中局部嵌入缺乏上下文的问题,实现了跨任务、跨模型的统一上下文化框架,显著提升了小样本预训练下的下游性能。 Abstract: The interpretation of small tiles in large whole slide images (WSI) often needs a larger image context. We introduce TICON, a transformer-based tile representation contextualizer that produces rich, contextualized embeddings for ''any'' application in computational pathology. Standard tile encoder-based pipelines, which extract embeddings of tiles stripped from their context, fail to model the rich slide-level information essential for both local and global tasks. Furthermore, different tile-encoders excel at different downstream tasks. Therefore, a unified model is needed to contextualize embeddings derived from ''any'' tile-level foundation model. TICON addresses this need with a single, shared encoder, pretrained using a masked modeling objective to simultaneously unify and contextualize representations from diverse tile-level pathology foundation models. Our experiments demonstrate that TICON-contextualized embeddings significantly improve performance across many different tasks, establishing new state-of-the-art results on tile-level benchmarks (i.e., HEST-Bench, THUNDER, CATCH) and slide-level benchmarks (i.e., Patho-Bench). Finally, we pretrain an aggregator on TICON to form a slide-level foundation model, using only 11K WSIs, outperforming SoTA slide-level foundation models pretrained with up to 350K WSIs.[92] Fast SAM2 with Text-Driven Token Pruning
Avilasha Mandal,Chaoning Zhang,Fachrina Dewi Puspitasari,Xudong Wang,Jiaquan Zhang,Caiyan Qin,Guoqing Wang,Yang Yang,Heng Tao Shen
Main category: cs.CV
TL;DR: 本文提出了一种文本引导的token剪枝框架,用于提升视频对象分割模型SAM2的推理效率,通过在时间传播前选择性减少不重要的视觉token,在几乎不影响分割性能的前提下显著降低了计算和内存开销。
Details
Motivation: SAM2等模型在视频分割中表现优异,但因处理大量时空视觉token导致计算和内存成本过高,限制了其实际部署,尤其是在资源受限场景下的应用。 Method: 在视觉编码后、时序传播前引入轻量级路由机制,结合局部视觉上下文、基于文本描述的语义相关性和不确定性线索对token进行排序,并仅保留最相关token用于后续处理,从而实现高效推理。 Result: 在多个视频分割基准上实验表明,该方法相比原始SAM2最高可加速42.50%推理速度并降低37.41% GPU内存占用,同时保持具有竞争力的J和F得分。 Conclusion: 通过后编码器阶段的token剪枝,可在不修改分割架构的情况下有效提升视频分割系统的可扩展性,为基于Transformer的模型在实时和资源受限场景中的应用提供了可行路径。 Abstract: Segment Anything Model 2 (SAM2), a vision foundation model has significantly advanced in prompt-driven video object segmentation, yet their practical deployment remains limited by the high computational and memory cost of processing dense visual tokens across time. The SAM2 pipelines typically propagate all visual tokens produced by the image encoder through downstream temporal reasoning modules, regardless of their relevance to the target object, resulting in reduced scalability due to quadratic memory attention overhead. In this work, we introduce a text-guided token pruning framework that improves inference efficiency by selectively reducing token density prior to temporal propagation, without modifying the underlying segmentation architecture. Operating after visual encoding and before memory based propagation, our method ranks tokens using a lightweight routing mechanism that integrates local visual context, semantic relevance derived from object-centric textual descriptions (either user-provided or automatically generated), and uncertainty cues that help preserve ambiguous or boundary critical regions. By retaining only the most informative tokens for downstream processing, the proposed approach reduces redundant computation while maintaining segmentation fidelity. Extensive experiments across multiple challenging video segmentation benchmarks demonstrate that post-encoder token pruning provides a practical and effective pathway to efficient, prompt-aware video segmentation, achieving up to 42.50 percent faster inference and 37.41 percent lower GPU memory usage compared to the unpruned baseline SAM2, while preserving competitive J and F performance. These results highlight the potential of early token selection to improve the scalability of transformer-based video segmentation systems for real-time and resource-constrained applications.[93] Streaming Video Instruction Tuning
Jiaer Xia,Peixian Chen,Mengdan Zhang,Xing Sun,Kaiyang Zhou
Main category: cs.CV
TL;DR: Streamo是一个实时流媒体视频大模型,能够作为通用的交互式助手,支持多种流媒体视频任务。
Details
Motivation: 现有的在线视频模型通常局限于问答或字幕生成,缺乏对多种实时流媒体任务的统一支持。 Method: 构建了大规模指令跟随数据集Streamo-Instruct-465K,并通过端到端训练实现多任务统一建模。 Result: Streamo在时间推理、响应交互和泛化能力方面表现优异,能在多种流媒体基准上取得良好效果。 Conclusion: Streamo弥合了离线视频感知模型与实时多模态助手之间的差距,推动了连续视频流中统一智能视频理解的发展。 Abstract: We present Streamo, a real-time streaming video LLM that serves as a general-purpose interactive assistant. Unlike existing online video models that focus narrowly on question answering or captioning, Streamo performs a broad spectrum of streaming video tasks, including real-time narration, action understanding, event captioning, temporal event grounding, and time-sensitive question answering. To develop such versatility, we construct Streamo-Instruct-465K, a large-scale instruction-following dataset tailored for streaming video understanding. The dataset covers diverse temporal contexts and multi-task supervision, enabling unified training across heterogeneous streaming tasks. After training end-to-end on the instruction-following dataset through a streamlined pipeline, Streamo exhibits strong temporal reasoning, responsive interaction, and broad generalization across a variety of streaming benchmarks. Extensive experiments show that Streamo bridges the gap between offline video perception models and real-time multimodal assistants, making a step toward unified, intelligent video understanding in continuous video streams.[94] Beyond Memorization: A Multi-Modal Ordinal Regression Benchmark to Expose Popularity Bias in Vision-Language Models
Li-Zhong Szu-Tu,Ting-Lin Wu,Chia-Jui Chang,He Syu,Yu-Lun Liu
Main category: cs.CV
TL;DR: 本文揭示了视觉-语言模型(VLMs)在著名建筑上表现更好,存在高达34%的流行度偏差,表明其依赖记忆而非泛化理解。为此,作者提出了YearGuessr数据集(55,546张建筑图像),用于系统研究该问题,并引入基于流行度的评估指标。实验涵盖30多个模型,包括提出的YearCLIP,结果表明VLM在非知名建筑上表现显著下降,暴露其推理能力的缺陷。
Details
Motivation: 揭示并量化当前视觉-语言模型中存在的流行度偏差问题,检验模型是否真正具备泛化能力,而非依赖对知名对象的记忆。 Method: 构建大规模开放基准YearGuessr数据集,包含55,546张来自157个国家的建筑图像,标注建造年份(1001–2024)、GPS坐标和页面访问量(作为流行度代理)。将建造年份预测建模为序数回归任务,并设计流行度感知的区间准确率指标来量化偏差。 Result: 实验显示VLMs在著名建筑上的准确率比普通建筑高出最多34%;模型在未被广泛认知的建筑上表现显著下降;基于新指标的评估证实了模型普遍存在对流行度的依赖。 Conclusion: 当前视觉-语言模型存在严重流行度偏差,过度依赖记忆而非逻辑或视觉推理,这限制了其在现实场景中的泛化能力,未来模型需更注重对非流行对象的理解。 Abstract: We expose a significant popularity bias in state-of-the-art vision-language models (VLMs), which achieve up to 34% higher accuracy on famous buildings compared to ordinary ones, indicating a reliance on memorization over generalizable understanding. To systematically investigate this, we introduce the largest open benchmark for this task: the YearGuessr dataset, a collection of 55,546 building images with multi-modal attributes from 157 countries, annotated with continuous ordinal labels of their construction year (1001-2024), GPS data, and page-view counts as a proxy for popularity. Using this dataset, we frame the construction year prediction task as ordinal regression and introduce popularity-aware interval accuracy metrics to quantify this bias. Our resulting benchmark of 30+ models, including our YearCLIP model, confirms that VLMs excel on popular, memorized items but struggle significantly with unrecognized subjects, exposing a critical flaw in their reasoning capabilities. Project page: https://sytwu.github.io/BeyondMemo/[95] HiStream: Efficient High-Resolution Video Generation via Redundancy-Eliminated Streaming
Haonan Qiu,Shikun Liu,Zijian Zhou,Zhaochong An,Weiming Ren,Zhiheng Liu,Jonas Schult,Sen He,Shoufa Chen,Yuren Cong,Tao Xiang,Ziwei Liu,Juan-Manuel Perez-Rua
Main category: cs.CV
TL;DR: 本文提出了HiStream,一种高效的高分辨率视频生成框架,通过空间、时间和步长三个维度的压缩策略显著加速去噪过程,同时保持高质量,实现了比现有方法高达107.5倍的速度提升。