Skip to content

Table of Contents

cs.CL [Back]

[1] Uncovering Competency Gaps in Large Language Models and Their Benchmarks

Matyas Bohacek,Nino Scherrer,Nicholas Dufour,Thomas Leung,Christoph Bregler,Stephanie C. Y. Chan

Main category: cs.CL

TL;DR: 本文提出了一种基于稀疏自编码器(SAE)的新方法,用于自动发现大语言模型和基准测试中的“模型差距”与“基准差距”,通过将评估建立在模型内部表征基础上,实现跨基准的细粒度概念级分析。

Details Motivation: 现有基准测试的聚合指标可能掩盖模型在特定子领域的能力缺陷以及基准本身覆盖不均的问题,因此需要一种更细粒度、基于模型内部表示的评估方法。 Method: 利用稀疏自编码器(SAE)提取模型的概念激活,并结合显著性加权性能得分,在多个基准数据上进行分析,从而识别模型和基准中的潜在缺口。 Result: 在两个开源模型和十个基准上的实验表明,模型在反对谄媚行为(如礼貌拒绝请求)和安全相关概念上表现较差;同时许多基准过度代表服从性概念,而遗漏了其本应涵盖的核心概念。 Conclusion: 该方法为模型评估提供了基于表征的细粒度分解工具,能够揭示聚合分数背后的原因,并指导基准测试的改进,是对传统评估方式的有效补充。 Abstract: The evaluation of large language models (LLMs) relies heavily on standardized benchmarks. These benchmarks provide useful aggregated metrics for a given capability, but those aggregated metrics can obscure (i) particular sub-areas where the LLMs are weak ("model gaps") and (ii) imbalanced coverage in the benchmarks themselves ("benchmark gaps"). We propose a new method that uses sparse autoencoders (SAEs) to automatically uncover both types of gaps. By extracting SAE concept activations and computing saliency-weighted performance scores across benchmark data, the method grounds evaluation in the model's internal representations and enables comparison across benchmarks. As examples demonstrating our approach, we applied the method to two popular open-source models and ten benchmarks. We found that these models consistently underperformed on concepts that stand in contrast to sycophantic behaviors (e.g., politely refusing a request or asserting boundaries) and concepts connected to safety discussions. These model gaps align with observations previously surfaced in the literature; our automated, unsupervised method was able to recover them without manual supervision. We also observed benchmark gaps: many of the evaluated benchmarks over-represented concepts related to obedience, authority, or instruction-following, while missing core concepts that should fall within their intended scope. In sum, our method offers a representation-grounded approach to evaluation, enabling concept-level decomposition of benchmark scores. Rather than replacing conventional aggregated metrics, CG complements them by providing a concept-level decomposition that can reveal why a model scored as it did and how benchmarks could evolve to better reflect their intended scope. Code is available at https://competency-gaps.github.io.

[2] SA-DiffuSeq: Addressing Computational and Scalability Challenges in Long-Document Generation with Sparse Attention

Alexandros Christoforos,Chadbourne Davis

Main category: cs.CL

TL;DR: SA-DiffuSeq是一种结合稀疏注意力的扩散模型框架,用于高效生成长文本,显著降低计算成本并提升生成质量。

Details Motivation: 扩散模型在长文本生成中面临高计算成本和内存开销的问题,尤其在序列增长时表现不佳,因此需要更可扩展的方法。 Method: 提出SA-DiffuSeq,将稀疏注意力机制引入扩散过程,并设计软吸收状态以稳定扩散轨迹,提高采样效率和长距离依赖建模精度。 Result: 实验表明,SA-DiffuSeq在训练效率和采样速度上优于现有扩散模型,尤其在长序列上表现突出。 Conclusion: 将结构化稀疏性引入扩散模型是实现高效且富有表达力的长文本生成的有效路径。 Abstract: Diffusion based approaches to long form text generation suffer from prohibitive computational cost and memory overhead as sequence length increases. We introduce SA-DiffuSeq, a diffusion framework that integrates sparse attention to fundamentally improve scalability for long document modeling. By selectively allocating attention within the diffusion process, SA-DiffuSeq significantly reduces computational complexity while maintaining semantic coherence and generation quality. A key component of our method is a soft absorbing state tailored to sparse attention dynamics, which stabilizes diffusion trajectories and accelerates sequence reconstruction. This design improves sampling efficiency and enhances precision in long range dependency modeling. Extensive experiments demonstrate that SA-DiffuSeq consistently surpasses state of the art diffusion baselines in both training efficiency and sampling speed, with especially strong gains on extended sequences. These properties make SA-DiffuSeq well suited for demanding long form applications such as scientific writing, large scale code generation, and multi turn long context dialogue. Overall, our results indicate that incorporating structured sparsity into diffusion models is a promising direction for efficient and expressive long text generation.

[3] TokSuite: Measuring the Impact of Tokenizer Choice on Language Model Behavior

Gül Sena Altıntaş,Malikeh Ehghaghi,Brian Lester,Fengyuan Liu,Wanru Zhao,Marco Ciccone,Colin Raffel

Main category: cs.CL

TL;DR: 本文提出了TokSuite,一个用于研究分词器对语言模型影响的模型集合和基准测试,通过训练十四种使用不同分词器但其他条件相同的模型,揭示了各种流行分词器的优点与缺点。

Details Motivation: 由于难以孤立地衡量分词的影响,分词在语言模型性能和行为中的作用尚不明确,因此需要系统研究分词器的作用。 Method: 训练了十四个使用不同分词器但架构、数据集、训练预算和初始化完全相同的模型,并构建了一个新的基准测试来评估在现实世界扰动下的模型性能。 Result: TokSuite能够有效分离分词器的影响,发现不同分词器在实际应用中的表现差异,揭示了它们各自的优缺点。 Conclusion: 分词器对语言模型的行为和性能有显著影响,TokSuite为未来分词器的研究提供了有力工具和方法基础。 Abstract: Tokenizers provide the fundamental basis through which text is represented and processed by language models (LMs). Despite the importance of tokenization, its role in LM performance and behavior is poorly understood due to the challenge of measuring the impact of tokenization in isolation. To address this need, we present TokSuite, a collection of models and a benchmark that supports research into tokenization's influence on LMs. Specifically, we train fourteen models that use different tokenizers but are otherwise identical using the same architecture, dataset, training budget, and initialization. Additionally, we curate and release a new benchmark that specifically measures model performance subject to real-world perturbations that are likely to influence tokenization. Together, TokSuite allows robust decoupling of the influence of a model's tokenizer, supporting a series of novel findings that elucidate the respective benefits and shortcomings of a wide range of popular tokenizers.

[4] Adversarial Training for Failure-Sensitive User Simulation in Mental Health Dialogue Optimization

Ziyi Zhu,Olivier Tieleman,Caitlin A. Stamatis,Luka Smyth,Thomas D. Hull,Daniel R. Cahn,Matteo Malgaroli

Main category: cs.CL

TL;DR: 提出了一种基于对抗训练的用户模拟器框架,用于提升心理健康支持聊天机器人中任务导向对话系统的评估效果,显著增强了模拟真实性和发现系统缺陷的能力。

Details Motivation: 现有的用户模拟器难以准确模拟人类行为,且在暴露对话系统失败模式方面能力有限,因此需要更逼真的模拟方法以有效评估系统性能。 Method: 采用生成器(用户模拟器)与判别器之间的对抗训练框架,通过迭代优化提升模拟器的真实性,并应用于心理健康支持聊天机器人场景中进行验证。 Result: 经过微调和对抗训练的模拟器在发现系统问题方面显著优于零样本基础模型,模拟与真实故障发生率之间具有强相关性,故障模式分布差异小,且判别器准确性大幅下降,表明模拟真实性提高。 Conclusion: 对抗训练是构建心理健康支持领域高真实感用户模拟器的有效途径,可实现快速、可靠且低成本的系统评估。 Abstract: Realistic user simulation is crucial for training and evaluating task-oriented dialogue (TOD) systems, yet creating simulators that accurately replicate human behavior remains challenging. A key property of effective simulators is their ability to expose failure modes of the systems they evaluate. We present an adversarial training framework that iteratively improves user simulator realism through a competitive dynamic between a generator (user simulator) and a discriminator. Applied to mental health support chatbots, our approach demonstrates that fine-tuned simulators dramatically outperform zero-shot base models at surfacing system issues, and adversarial training further enhances diversity, distributional alignment, and predictive validity. The resulting simulator achieves a strong correlation between simulated and real failure occurrence rates across diverse chatbot configurations while maintaining low distributional divergence of failure modes. Discriminator accuracy decreases drastically after three adversarial iterations, suggesting improved realism. These results provide evidence that adversarial training is a promising approach for creating realistic user simulators in mental health support TOD domains, enabling rapid, reliable, and cost-effective system evaluation before deployment.

[5] Large Language Models Approach Expert Pedagogical Quality in Math Tutoring but Differ in Instructional and Linguistic Profiles

Ramatu Oiza Abdulsalam,Segun Aroyehun

Main category: cs.CL

TL;DR: 该研究通过对比专家教师、新手教师和大型语言模型在数学辅导中的回应,发现尽管大型语言模型的整体教学质量感知与专家相当,但在教学策略和语言特征上存在系统性差异。

Details Motivation: 探讨大型语言模型在数学辅导中生成教学回应的行为与人类专家实践的接近程度。 Method: 采用控制性的回合级比较,让专家教师、新手教师和多个大型语言模型对相同的数学补救对话回合作出回应,并分析其教学策略和语言特征。 Result: 大型语言模型在感知教学质量上接近专家水平,但较少使用重述和复述策略,且生成更长、词汇更多样、更礼貌的回应;统计分析表明,重述/复述、词汇多样性和准确性追问与更高的教学质量正相关,而过多的主动性和礼貌性语言则负相关。 Conclusion: 尽管大型语言模型能达到类似专家的教学质量感知水平,但其依赖不同的教学和语言策略,强调了评估智能辅导系统时需深入分析具体教学行为的重要性。 Abstract: Recent work has explored the use of large language models for generating tutoring responses in mathematics, yet it remains unclear how closely their instructional behavior aligns with expert human practice. We examine this question using a controlled, turn-level comparison in which expert human tutors, novice human tutors, and multiple large language models respond to the same set of math remediation conversation turns. We examine both instructional strategies and linguistic characteristics of tutoring responses, including restating and revoicing, pressing for accuracy, lexical diversity, readability, politeness, and agency. We find that large language models approach expert levels of perceived pedagogical quality on average but exhibit systematic differences in their instructional and linguistic profiles. In particular, large language models tend to underuse restating and revoicing strategies characteristic of expert human tutors, while producing longer, more lexically diverse, and more polite responses. Statistical analyses show that restating and revoicing, lexical diversity, and pressing for accuracy are positively associated with perceived pedagogical quality, whereas higher levels of agentic and polite language are negatively associated. Overall, recent large language models exhibit levels of perceived pedagogical quality comparable to expert human tutors, while relying on different instructional and linguistic strategies. These findings underscore the value of analyzing instructional strategies and linguistic characteristics when evaluating tutoring responses across human tutors and intelligent tutoring systems.

[6] Investigating Model Editing for Unlearning in Large Language Models

Shariqah Hossain,Lalana Kagal

Main category: cs.CL

TL;DR: 本文探讨了将模型编辑算法(如ROME、IKE和WISE)应用于机器遗忘任务的可行性,发现这些方法在特定设置下能优于传统遗忘方法,但在完全隔离需遗忘信息的同时保持模型整体性能方面仍面临挑战。

Details Motivation: 现有的机器遗忘方法对大规模语言模型效率低下或无法彻底删除目标信息而不影响其他知识,因此需要探索更有效的遗忘技术。 Method: 研究者采用模型编辑算法ROME、IKE和WISE,并为其设计新的编辑目标以适应遗忘场景,评估其在不同设置下的遗忘质量和模型性能保留能力。 Result: 模型编辑方法在某些情况下优于基线遗忘方法,能够更有效地实现信息遗忘,但依然难以完全控制遗忘范围,且可能损害模型的整体性能。 Conclusion: 模型编辑算法有潜力用于机器遗忘任务,但在精确控制遗忘边界和保护未目标知识方面仍需进一步改进。 Abstract: Machine unlearning aims to remove unwanted information from a model, but many methods are inefficient for LLMs with large numbers of parameters or fail to fully remove the intended information without degrading performance on knowledge that should be retained. Model editing algorithms solve a similar problem of changing information in models, but they focus on redirecting inputs to a new target rather than removing that information altogether. In this work, we explore the editing algorithms ROME, IKE, and WISE and design new editing targets for an unlearning setting. Through this investigation, we show that model editing approaches can exceed baseline unlearning methods in terms of quality of forgetting depending on the setting. Like traditional unlearning techniques, they struggle to encapsulate the scope of what is to be unlearned without damage to the overall model performance.

[7] Measuring Mechanistic Independence: Can Bias Be Removed Without Erasing Demographics?

Zhengyang Shan,Aaron Mueller

Main category: cs.CL

TL;DR: 研究探讨了语言模型中人口统计偏见机制与人口统计识别能力的独立性,提出通过稀疏自编码器特征消融实现精准去偏,同时保持识别性能。

Details Motivation: 探索人口统计偏见是否源于特定任务机制而非基本人口识别能力,以实现更安全、公平的模型部署。 Method: 采用多任务评估框架,结合归因法和相关性法定位偏见特征,并在Gemma-2-9B上进行稀疏自编码器特征消融实验。 Result: 归因法消融有效缓解种族和性别职业刻板印象且不损害姓名识别;相关性法对教育偏见更有效;移除教育任务中的归因特征会导致‘先验崩溃’从而增加整体偏见。 Conclusion: 人口统计偏见源自任务特定机制而非绝对人口标记,基于机制的推理时干预可实现不影响核心能力的精准去偏。 Abstract: We investigate how independent demographic bias mechanisms are from general demographic recognition in language models. Using a multi-task evaluation setup where demographics are associated with names, professions, and education levels, we measure whether models can be debiased while preserving demographic detection capabilities. We compare attribution-based and correlation-based methods for locating bias features. We find that targeted sparse autoencoder feature ablations in Gemma-2-9B reduce bias without degrading recognition performance: attribution-based ablations mitigate race and gender profession stereotypes while preserving name recognition accuracy, whereas correlation-based ablations are more effective for education bias. Qualitative analysis further reveals that removing attribution features in education tasks induces ``prior collapse'', thus increasing overall bias. This highlights the need for dimension-specific interventions. Overall, our results show that demographic bias arises from task-specific mechanisms rather than absolute demographic markers, and that mechanistic inference-time interventions can enable surgical debiasing without compromising core model capabilities.

[8] Semantic Deception: When Reasoning Models Can't Compute an Addition

Nathaniël de Leeuw,Marceau Nahon,Mathis Reymond,Raja Chatila,Mehdi Khamassi

Main category: cs.CL

TL;DR: 本文研究了大语言模型(LLM)在面对新颖符号表示时的推理能力,引入“语义欺骗”实验框架,揭示LLM在处理抽象符号时易受表面语义干扰,表现出对统计关联的过度依赖而非真正符号推理。

Details Motivation: 探讨LLM是否具备真正的符号抽象与操作能力,尤其是在涉及人类价值观的决策任务中,避免因误赋“推理”能力而导致伦理和社会风险。 Method: 重新定义数字和运算符为新符号,构造简单计算任务并加入语义欺骗(即符号形式带有误导性语义联想),测试四个LLM在该环境下的表现。 Result: 实验显示,即使任务极其简单,语义线索仍显著降低LLM性能;表明其难以脱离训练数据中的语义关联,且思维链可能加剧对表层统计模式的依赖。 Conclusion: 当前LLM缺乏稳健的符号操作能力,倾向于利用表面语义而非抽象逻辑,警示不应轻易将其行为视为真正推理,尤其在需强符号推理的关键决策场景中应谨慎应用。 Abstract: Large language models (LLMs) are increasingly used in situations where human values are at stake, such as decision-making tasks that involve reasoning when performed by humans. We investigate the so-called reasoning capabilities of LLMs over novel symbolic representations by introducing an experimental framework that tests their ability to process and manipulate unfamiliar symbols. We introduce semantic deceptions: situations in which symbols carry misleading semantic associations due to their form, such as being embedded in specific contexts, designed to probe whether LLMs can maintain symbolic abstraction or whether they default to exploiting learned semantic associations. We redefine standard digits and mathematical operators using novel symbols, and task LLMs with solving simple calculations expressed in this altered notation. The objective is: (1) to assess LLMs' capacity for abstraction and manipulation of arbitrary symbol systems; (2) to evaluate their ability to resist misleading semantic cues that conflict with the task's symbolic logic. Through experiments with four LLMs we show that semantic cues can significantly deteriorate reasoning models' performance on very simple tasks. They reveal limitations in current LLMs' ability for symbolic manipulations and highlight a tendency to over-rely on surface-level semantics, suggesting that chain-of-thoughts may amplify reliance on statistical correlations. Even in situations where LLMs seem to correctly follow instructions, semantic cues still impact basic capabilities. These limitations raise ethical and societal concerns, undermining the widespread and pernicious tendency to attribute reasoning abilities to LLMs and suggesting how LLMs might fail, in particular in decision-making contexts where robust symbolic reasoning is essential and should not be compromised by residual semantic associations inherited from the model's training.

[9] EssayCBM: Rubric-Aligned Concept Bottleneck Models for Transparent Essay Grading

Kumar Satvik Chaudhary,Chengshuai Zhao,Fan Zhang,Yung Hin Tse,Garima Agrawal,Yuli Deng,Huan Liu

Main category: cs.CL

TL;DR: 提出了一种可解释的论文评分框架EssayCBM,通过评估八个写作概念并利用轻量网络生成分数,实现与黑箱模型相当的性能,同时提供透明、可干预的反馈。

Details Motivation: 解决自动化评分系统作为黑箱难以解释的问题,提升教师和学生对评分结果的理解与信任。 Method: 构建EssayCBM框架,使用编码器结合专用预测头评估八个写作概念,概念得分经轻量网络生成最终分数,并支持教师调整概念分以实时查看成绩变化。 Result: 在保持与黑箱模型相当评分性能的同时,提供了可解释的评分过程和概念级反馈。 Conclusion: EssayCBM实现了可解释、可交互的论文自动评分,支持教学中的问责制和人机协同评估。 Abstract: Understanding how automated grading systems evaluate essays remains a significant challenge for educators and students, especially when large language models function as black boxes. We introduce EssayCBM, a rubric-aligned framework that prioritizes interpretability in essay assessment. Instead of predicting grades directly from text, EssayCBM evaluates eight writing concepts, such as Thesis Clarity and Evidence Use, through dedicated prediction heads on an encoder. These concept scores form a transparent bottleneck, and a lightweight network computes the final grade using only concepts. Instructors can adjust concept predictions and instantly view the updated grade, enabling accountable human-in-the-loop evaluation. EssayCBM matches black-box performance while offering actionable, concept-level feedback through an intuitive web interface.

[10] MediEval: A Unified Medical Benchmark for Patient-Contextual and Knowledge-Grounded Reasoning in LLMs

Zhan Qu,Michael Färber

Main category: cs.CL

TL;DR: MediEval是一个结合真实电子健康记录与统一医学知识库的基准,用于系统评估大语言模型在医疗场景下的知识准确性和上下文一致性;提出的方法CoRFu通过反事实风险感知微调显著提升模型的安全性与准确性。

Details Motivation: 现有医疗大模型评估方法要么孤立测试医学事实知识,要么无法验证推理正确性,缺乏对可靠性和安全性的系统评估。 Method: 构建MediEval基准,将MIMIC-IV电子病历与UMLS等知识库对齐,生成事实与反事实医学陈述,并提出基于DPO的CoRFu微调方法,采用不对称惩罚来减少危险错误。 Result: CoRFu相比基线模型提升了+16.4 macro-F1分数,并完全消除了真理反转错误,同时在四个象限框架下显著改善了模型的事实性与一致性表现。 Conclusion: 联合知识接地与上下文一致性的评估框架能有效揭示医疗大模型的关键缺陷,而CoRFu为提升其安全性与可靠性提供了有效路径。 Abstract: Large Language Models (LLMs) are increasingly applied to medicine, yet their adoption is limited by concerns over reliability and safety. Existing evaluations either test factual medical knowledge in isolation or assess patient-level reasoning without verifying correctness, leaving a critical gap. We introduce MediEval, a benchmark that links MIMIC-IV electronic health records (EHRs) to a unified knowledge base built from UMLS and other biomedical vocabularies. MediEval generates diverse factual and counterfactual medical statements within real patient contexts, enabling systematic evaluation across a 4-quadrant framework that jointly considers knowledge grounding and contextual consistency. Using this framework, we identify critical failure modes, including hallucinated support and truth inversion, that current proprietary, open-source, and domain-specific LLMs frequently exhibit. To address these risks, we propose Counterfactual Risk-Aware Fine-tuning (CoRFu), a DPO-based method with an asymmetric penalty targeting unsafe confusions. CoRFu improves by +16.4 macro-F1 points over the base model and eliminates truth inversion errors, demonstrating both higher accuracy and substantially greater safety.

[11] Nemotron 3 Nano: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

NVIDIA,:,Aaron Blakeman,Aaron Grattafiori,Aarti Basant,Abhibha Gupta,Abhinav Khattar,Adi Renduchintala,Aditya Vavre,Akanksha Shukla,Akhiad Bercovich,Aleksander Ficek,Aleksandr Shaposhnikov,Alex Kondratenko,Alexander Bukharin,Alexandre Milesi,Ali Taghibakhshi,Alisa Liu,Amelia Barton,Ameya Sunil Mahabaleshwarkar,Amir Klein,Amit Zuker,Amnon Geifman,Amy Shen,Anahita Bhiwandiwalla,Andrew Tao,Ann Guan,Anubhav Mandarwal,Arham Mehta,Ashwath Aithal,Ashwin Poojary,Asif Ahamed,Asma Kuriparambil Thekkumpate,Ayush Dattagupta,Banghua Zhu,Bardiya Sadeghi,Barnaby Simkin,Ben Lanir,Benedikt Schifferer,Besmira Nushi,Bilal Kartal,Bita Darvish Rouhani,Boris Ginsburg,Brandon Norick,Brandon Soubasis,Branislav Kisacanin,Brian Yu,Bryan Catanzaro,Carlo del Mundo,Chantal Hwang,Charles Wang,Cheng-Ping Hsieh,Chenghao Zhang,Chenhan Yu,Chetan Mungekar,Chintan Patel,Chris Alexiuk,Christopher Parisien,Collin Neale,Damon Mosk-Aoyama,Dan Su,Dane Corneil,Daniel Afrimi,Daniel Rohrer,Daniel Serebrenik,Daria Gitman,Daria Levy,Darko Stosic,David Mosallanezhad,Deepak Narayanan,Dhruv Nathawani,Dima Rekesh,Dina Yared,Divyanshu Kakwani,Dong Ahn,Duncan Riach,Dusan Stosic,Edgar Minasyan,Edward Lin,Eileen Long,Eileen Peters Long,Elena Lantz,Ellie Evans,Elliott Ning,Eric Chung,Eric Harper,Eric Tramel,Erick Galinkin,Erik Pounds,Evan Briones,Evelina Bakhturina,Faisal Ladhak,Fay Wang,Fei Jia,Felipe Soares,Feng Chen,Ferenc Galko,Frankie Siino,Gal Hubara Agam,Ganesh Ajjanagadde,Gantavya Bhatt,Gargi Prasad,George Armstrong,Gerald Shen,Gorkem Batmaz,Grigor Nalbandyan,Haifeng Qian,Harsh Sharma,Hayley Ross,Helen Ngo,Herman Sahota,Hexin Wang,Himanshu Soni,Hiren Upadhyay,Huizi Mao,Huy C Nguyen,Huy Q Nguyen,Iain Cunningham,Ido Shahaf,Igor Gitman,Ilya Loshchilov,Ivan Moshkov,Izzy Putterman,Jan Kautz,Jane Polak Scowcroft,Jared Casper,Jatin Mitra,Jeffrey Glick,Jenny Chen,Jesse Oliver,Jian Zhang,Jiaqi Zeng,Jie Lou,Jimmy Zhang,Jining Huang,Joey Conway,Joey Guman,John Kamalu,Johnny Greco,Jonathan Cohen,Joseph Jennings,Joyjit Daw,Julien Veron Vialard,Junkeun Yi,Jupinder Parmar,Kai Xu,Kan Zhu,Kari Briski,Katherine Cheung,Katherine Luna,Keshav Santhanam,Kevin Shih,Kezhi Kong,Khushi Bhardwaj,Krishna C. Puvvada,Krzysztof Pawelec,Kumar Anik,Lawrence McAfee,Laya Sleiman,Leon Derczynski,Li Ding,Lucas Liebenwein,Luis Vega,Maanu Grover,Maarten Van Segbroeck,Maer Rodrigues de Melo,Makesh Narsimhan Sreedhar,Manoj Kilaru,Maor Ashkenazi,Marc Romeijn,Mark Cai,Markus Kliegl,Maryam Moosaei,Matvei Novikov,Mehrzad Samadi,Melissa Corpuz,Mengru Wang,Meredith Price,Michael Boone,Michael Evans,Miguel Martinez,Mike Chrzanowski,Mohammad Shoeybi,Mostofa Patwary,Nabin Mulepati,Natalie Hereth,Nave Assaf,Negar Habibi,Neta Zmora,Netanel Haber,Nicola Sessions,Nidhi Bhatia,Nikhil Jukar,Nikki Pope,Nikolai Ludwig,Nima Tajbakhsh,Nirmal Juluru,Oleksii Hrinchuk,Oleksii Kuchaiev,Olivier Delalleau,Oluwatobi Olabiyi,Omer Ullman Argov,Ouye Xie,Parth Chadha,Pasha Shamis,Pavlo Molchanov,Pawel Morkisz,Peter Dykas,Peter Jin,Pinky Xu,Piotr Januszewski,Pranav Prashant Thombre,Prasoon Varshney,Pritam Gundecha,Qing Miao,Rabeeh Karimi Mahabadi,Ran El-Yaniv,Ran Zilberstein,Rasoul Shafipour,Rich Harang,Rick Izzo,Rima Shahbazyan,Rishabh Garg,Ritika Borkar,Ritu Gala,Riyad Islam,Roger Waleffe,Rohit Watve,Roi Koren,Ruoxi Zhang,Russell J. Hewett,Ryan Prenger,Ryan Timbrook,Sadegh Mahdavi,Sahil Modi,Samuel Kriman,Sanjay Kariyappa,Sanjeev Satheesh,Saori Kaji,Satish Pasumarthi,Sean Narentharen,Sean Narenthiran,Seonmyeong Bak,Sergey Kashirsky,Seth Poulos,Shahar Mor,Shanmugam Ramasamy,Shantanu Acharya,Shaona Ghosh,Sharath Turuvekere Sreenivas,Shelby Thomas,Shiqing Fan,Shreya Gopal,Shrimai Prabhumoye,Shubham Pachori,Shubham Toshniwal,Shuoyang Ding,Siddharth Singh,Simeng Sun,Smita Ithape,Somshubra Majumdar,Soumye Singhal,Stefania Alborghetti,Stephen Ge,Sugam Dipak Devare,Sumeet Kumar Barua,Suseella Panguluri,Suyog Gupta,Sweta Priyadarshi,Syeda Nahida Akter,Tan Bui,Teodor-Dumitru Ene,Terry Kong,Thanh Do,Tijmen Blankevoort,Tom Balough,Tomer Asida,Tomer Bar Natan,Tugrul Konuk,Twinkle Vashishth,Udi Karpas,Ushnish De,Vahid Noorozi,Vahid Noroozi,Venkat Srinivasan,Venmugil Elango,Vijay Korthikanti,Vitaly Kurin,Vitaly Lavrukhin,Wanli Jiang,Wasi Uddin Ahmad,Wei Du,Wei Ping,Wenfei Zhou,Will Jennings,William Zhang,Wojciech Prazuch,Xiaowei Ren,Yashaswi Karnati,Yejin Choi,Yev Meyer,Yi-Fu Wu,Yian Zhang,Ying Lin,Yonatan Geifman,Yonggan Fu,Yoshi Subara,Yoshi Suhara,Yubo Gao,Zach Moshe,Zhen Dong,Zihan Liu,Zijia Chen,Zijie Yan

Main category: cs.CL

TL;DR: Nemotron 3 Nano 30B-A3B 是一种混合Mamba-Transformer的MoE语言模型,经过25万亿token训练,在精度更高同时激活参数更少,推理吞吐提升达3.3倍,支持最长1M上下文,具备更强的代理、推理与对话能力,并已开源发布。

Details Motivation: 旨在构建更高效、高性能的语言模型,在减少激活参数量的同时提升推理速度和任务表现,特别是在长上下文、推理和对话等复杂场景下的能力。 Method: 采用Mixture-of-Experts混合Mamba-Transformer架构,预训练使用25万亿文本token(含超3万亿新token),并进行监督微调与大规模强化学习。 Result: 相比Nemotron 2 Nano,在激活不到一半参数的情况下实现更高精度;推理吞吐最高提升3.3倍,优于GPT-OSS-20B和Qwen3-30B-A3B等同类模型;在主流基准测试中表现更优,支持最长1M token上下文。 Conclusion: Nemotron 3 Nano在效率、性能和能力上均显著超越前代与同类模型,展示了混合架构在大模型中的潜力,且已公开发布基础与后训练模型以促进社区发展。 Abstract: We present Nemotron 3 Nano 30B-A3B, a Mixture-of-Experts hybrid Mamba-Transformer language model. Nemotron 3 Nano was pretrained on 25 trillion text tokens, including more than 3 trillion new unique tokens over Nemotron 2, followed by supervised fine tuning and large-scale RL on diverse environments. Nemotron 3 Nano achieves better accuracy than our previous generation Nemotron 2 Nano while activating less than half of the parameters per forward pass. It achieves up to 3.3x higher inference throughput than similarly-sized open models like GPT-OSS-20B and Qwen3-30B-A3B-Thinking-2507, while also being more accurate on popular benchmarks. Nemotron 3 Nano demonstrates enhanced agentic, reasoning, and chat abilities and supports context lengths up to 1M tokens. We release both our pretrained Nemotron 3 Nano 30B-A3B Base and post-trained Nemotron 3 Nano 30B-A3B checkpoints on Hugging Face.

[12] How important is Recall for Measuring Retrieval Quality?

Shelly Schwartz,Oleg Vasilyev,Randy Sawaya

Main category: cs.CL

TL;DR: 提出一种无需知晓相关文档总数的检索质量度量方法,并通过LLM判断响应质量评估其与现有策略的相关性。

Details Motivation: 在真实检索场景中,知识库庞大且动态变化,查询的相关文档总数通常未知,导致无法计算召回率。 Method: 通过多个数据集实验,比较不同策略下检索质量指标与基于LLM的响应质量判断之间的相关性。 Result: 发现几种已有策略表现有限,提出的一种简单检索质量度量方法表现良好。 Conclusion: 所提出的检索质量度量方法在相关文档数量未知的情况下仍能有效反映检索性能。 Abstract: In realistic retrieval settings with large and evolving knowledge bases, the total number of documents relevant to a query is typically unknown, and recall cannot be computed. In this paper, we evaluate several established strategies for handling this limitation by measuring the correlation between retrieval quality metrics and LLM-based judgments of response quality, where responses are generated from the retrieved documents. We conduct experiments across multiple datasets with a relatively low number of relevant documents (2-15). We also introduce a simple retrieval quality measure that performs well without requiring knowledge of the total number of relevant documents.

[13] NVIDIA Nemotron 3: Efficient and Open Intelligence

NVIDIA,:,Aaron Blakeman,Aaron Grattafiori,Aarti Basant,Abhibha Gupta,Abhinav Khattar,Adi Renduchintala,Aditya Vavre,Akanksha Shukla,Akhiad Bercovich,Aleksander Ficek,Aleksandr Shaposhnikov,Alex Kondratenko,Alexander Bukharin,Alexandre Milesi,Ali Taghibakhshi,Alisa Liu,Amelia Barton,Ameya Sunil Mahabaleshwarkar,Amir Klein,Amit Zuker,Amnon Geifman,Amy Shen,Anahita Bhiwandiwalla,Andrew Tao,Anjulie Agrusa,Ankur Verma,Ann Guan,Anubhav Mandarwal,Arham Mehta,Ashwath Aithal,Ashwin Poojary,Asif Ahamed,Asit Mishra,Asma Kuriparambil Thekkumpate,Ayush Dattagupta,Banghua Zhu,Bardiya Sadeghi,Barnaby Simkin,Ben Lanir,Benedikt Schifferer,Besmira Nushi,Bilal Kartal,Bita Darvish Rouhani,Boris Ginsburg,Brandon Norick,Brandon Soubasis,Branislav Kisacanin,Brian Yu,Bryan Catanzaro,Carlo del Mundo,Chantal Hwang,Charles Wang,Cheng-Ping Hsieh,Chenghao Zhang,Chenhan Yu,Chetan Mungekar,Chintan Patel,Chris Alexiuk,Christopher Parisien,Collin Neale,Cyril Meurillon,Damon Mosk-Aoyama,Dan Su,Dane Corneil,Daniel Afrimi,Daniel Lo,Daniel Rohrer,Daniel Serebrenik,Daria Gitman,Daria Levy,Darko Stosic,David Mosallanezhad,Deepak Narayanan,Dhruv Nathawani,Dima Rekesh,Dina Yared,Divyanshu Kakwani,Dong Ahn,Duncan Riach,Dusan Stosic,Edgar Minasyan,Edward Lin,Eileen Long,Eileen Peters Long,Elad Segal,Elena Lantz,Ellie Evans,Elliott Ning,Eric Chung,Eric Harper,Eric Tramel,Erick Galinkin,Erik Pounds,Evan Briones,Evelina Bakhturina,Evgeny Tsykunov,Faisal Ladhak,Fay Wang,Fei Jia,Felipe Soares,Feng Chen,Ferenc Galko,Frank Sun,Frankie Siino,Gal Hubara Agam,Ganesh Ajjanagadde,Gantavya Bhatt,Gargi Prasad,George Armstrong,Gerald Shen,Gorkem Batmaz,Grigor Nalbandyan,Haifeng Qian,Harsh Sharma,Hayley Ross,Helen Ngo,Herbert Hum,Herman Sahota,Hexin Wang,Himanshu Soni,Hiren Upadhyay,Huizi Mao,Huy C Nguyen,Huy Q Nguyen,Iain Cunningham,Ido Galil,Ido Shahaf,Igor Gitman,Ilya Loshchilov,Itamar Schen,Itay Levy,Ivan Moshkov,Izik Golan,Izzy Putterman,Jan Kautz,Jane Polak Scowcroft,Jared Casper,Jatin Mitra,Jeffrey Glick,Jenny Chen,Jesse Oliver,Jian Zhang,Jiaqi Zeng,Jie Lou,Jimmy Zhang,Jinhang Choi,Jining Huang,Joey Conway,Joey Guman,John Kamalu,Johnny Greco,Jonathan Cohen,Joseph Jennings,Joyjit Daw,Julien Veron Vialard,Junkeun Yi,Jupinder Parmar,Kai Xu,Kan Zhu,Kari Briski,Katherine Cheung,Katherine Luna,Keith Wyss,Keshav Santhanam,Kevin Shih,Kezhi Kong,Khushi Bhardwaj,Kirthi Shankar,Krishna C. Puvvada,Krzysztof Pawelec,Kumar Anik,Lawrence McAfee,Laya Sleiman,Leon Derczynski,Li Ding,Lizzie Wei,Lucas Liebenwein,Luis Vega,Maanu Grover,Maarten Van Segbroeck,Maer Rodrigues de Melo,Mahdi Nazemi,Makesh Narsimhan Sreedhar,Manoj Kilaru,Maor Ashkenazi,Marc Romeijn,Marcin Chochowski,Mark Cai,Markus Kliegl,Maryam Moosaei,Matt Kulka,Matvei Novikov,Mehrzad Samadi,Melissa Corpuz,Mengru Wang,Meredith Price,Michael Andersch,Michael Boone,Michael Evans,Miguel Martinez,Mikail Khona,Mike Chrzanowski,Minseok Lee,Mohammad Dabbah,Mohammad Shoeybi,Mostofa Patwary,Nabin Mulepati,Najeeb Nabwani,Natalie Hereth,Nave Assaf,Negar Habibi,Neta Zmora,Netanel Haber,Nicola Sessions,Nidhi Bhatia,Nikhil Jukar,Nikki Pope,Nikolai Ludwig,Nima Tajbakhsh,Nir Ailon,Nirmal Juluru,Nishant Sharma,Oleksii Hrinchuk,Oleksii Kuchaiev,Olivier Delalleau,Oluwatobi Olabiyi,Omer Ullman Argov,Omri Puny,Oren Tropp,Ouye Xie,Parth Chadha,Pasha Shamis,Paul Gibbons,Pavlo Molchanov,Pawel Morkisz,Peter Dykas,Peter Jin,Pinky Xu,Piotr Januszewski,Pranav Prashant Thombre,Prasoon Varshney,Pritam Gundecha,Przemek Tredak,Qing Miao,Qiyu Wan,Rabeeh Karimi Mahabadi,Rachit Garg,Ran El-Yaniv,Ran Zilberstein,Rasoul Shafipour,Rich Harang,Rick Izzo,Rima Shahbazyan,Rishabh Garg,Ritika Borkar,Ritu Gala,Riyad Islam,Robert Hesse,Roger Waleffe,Rohit Watve,Roi Koren,Ruoxi Zhang,Russell Hewett,Russell J. Hewett,Ryan Prenger,Ryan Timbrook,Sadegh Mahdavi,Sahil Modi,Samuel Kriman,Sangkug Lim,Sanjay Kariyappa,Sanjeev Satheesh,Saori Kaji,Satish Pasumarthi,Saurav Muralidharan,Sean Narentharen,Sean Narenthiran,Seonmyeong Bak,Sergey Kashirsky,Seth Poulos,Shahar Mor,Shanmugam Ramasamy,Shantanu Acharya,Shaona Ghosh,Sharath Turuvekere Sreenivas,Shelby Thomas,Shiqing Fan,Shreya Gopal,Shrimai Prabhumoye,Shubham Pachori,Shubham Toshniwal,Shuoyang Ding,Siddharth Singh,Simeng Sun,Smita Ithape,Somshubra Majumdar,Soumye Singhal,Stas Sergienko,Stefania Alborghetti,Stephen Ge,Sugam Dipak Devare,Sumeet Kumar Barua,Suseella Panguluri,Suyog Gupta,Sweta Priyadarshi,Syeda Nahida Akter,Tan Bui,Teodor-Dumitru Ene,Terry Kong,Thanh Do,Tijmen Blankevoort,Tim Moon,Tom Balough,Tomer Asida,Tomer Bar Natan,Tomer Ronen,Tugrul Konuk,Twinkle Vashishth,Udi Karpas,Ushnish De,Vahid Noorozi,Vahid Noroozi,Venkat Srinivasan,Venmugil Elango,Victor Cui,Vijay Korthikanti,Vinay Rao,Vitaly Kurin,Vitaly Lavrukhin,Vladimir Anisimov,Wanli Jiang,Wasi Uddin Ahmad,Wei Du,Wei Ping,Wenfei Zhou,Will Jennings,William Zhang,Wojciech Prazuch,Xiaowei Ren,Yashaswi Karnati,Yejin Choi,Yev Meyer,Yi-Fu Wu,Yian Zhang,Yigong Qin,Ying Lin,Yonatan Geifman,Yonggan Fu,Yoshi Subara,Yoshi Suhara,Yubo Gao,Zach Moshe,Zhen Dong,Zhongbo Zhu,Zihan Liu,Zijia Chen,Zijie Yan

Main category: cs.CL

TL;DR: Nemotron 3系列模型包括Nano、Super和Ultra,采用混合Mamba-Transformer架构,支持长达100万token的上下文,具备卓越的推理和对话能力。其中Nano高效且准确,Super适用于协作代理和高负载任务,Ultra提供最先进的性能。模型使用NVFP4训练并引入LatentMoE提升质量,结合多环境强化学习后训练,支持多步工具使用和推理预算控制,并将公开发布模型权重和相关资源。

Details Motivation: 开发高性能、高效率的大型语言模型,以支持复杂代理任务、长上下文处理和低成本推理,满足多样化应用场景需求。 Method: 采用Mixture-of-Experts混合Mamba-Transformer架构,使用NVFP4精度训练,引入LatentMoE提升模型质量,加入MTP层加速文本生成,并通过多环境强化学习进行后训练,支持多步工具调用和推理控制。 Result: Nemotron 3系列在吞吐量、上下文长度(最高达1M token)和推理能力上表现优异;Nano在小模型中精度领先且成本低;Super适合高并发代理任务;Ultra达到SOTA级别的准确性和推理性能。 Conclusion: Nemotron 3系列模型在架构设计、训练方法和应用优化方面取得平衡,兼顾性能、效率与可扩展性,未来将开放模型及相关资源,推动代理型AI系统发展。 Abstract: We introduce the Nemotron 3 family of models - Nano, Super, and Ultra. These models deliver strong agentic, reasoning, and conversational capabilities. The Nemotron 3 family uses a Mixture-of-Experts hybrid Mamba-Transformer architecture to provide best-in-class throughput and context lengths of up to 1M tokens. Super and Ultra models are trained with NVFP4 and incorporate LatentMoE, a novel approach that improves model quality. The two larger models also include MTP layers for faster text generation. All Nemotron 3 models are post-trained using multi-environment reinforcement learning enabling reasoning, multi-step tool use, and support granular reasoning budget control. Nano, the smallest model, outperforms comparable models in accuracy while remaining extremely cost-efficient for inference. Super is optimized for collaborative agents and high-volume workloads such as IT ticket automation. Ultra, the largest model, provides state-of-the-art accuracy and reasoning performance. Nano is released together with its technical report and this white paper, while Super and Ultra will follow in the coming months. We will openly release the model weights, pre- and post-training software, recipes, and all data for which we hold redistribution rights.

[14] Architectural Trade-offs in Small Language Models Under Compute Constraints

Shivraj Singh Bhatti

Main category: cs.CL

TL;DR: 本研究系统地探讨了在严格计算限制下,小型语言模型的架构选择与训练预算对性能的影响,发现基于注意力的模型即使在小规模情况下也比MLP更高效,并指出大型语言模型中的某些成功技术(如RoPE)不一定适用于小型模型。

Details Motivation: 探索在计算资源受限的情况下,如何通过合理的架构设计和训练策略提升小型语言模型的性能与效率。 Method: 从线性下一个词预测器出发,逐步引入非线性、自注意力机制和多层Transformer结构,在Tiny Shakespeare、PTB和WikiText-2数据集上进行字符级和词级建模,并使用测试负对数似然、参数量和训练FLOPs评估模型。 Result: 基于注意力的模型在每FLOP效率上优于MLP;增加深度或上下文长度若缺乏充分优化反而会降低性能;RoPE等在大模型中有效的技术在小模型中效果不佳。 Conclusion: 小型语言模型的设计需针对性优化,不能简单照搬大模型的架构和技术,注意力机制在小规模下仍具优势,但需谨慎调整深度和位置编码策略。 Abstract: We present a systematic empirical study of small language models under strict compute constraints, analyzing how architectural choices and training budget interact to determine performance. Starting from a linear next-token predictor, we progressively introduce nonlinearities, self-attention, and multi-layer transformer architectures, evaluating each on character-level modeling of Tiny Shakespeare and word-level modeling of Penn Treebank (PTB) and WikiText-2. We compare models using test negative log-likelihood (NLL), parameter count, and approximate training FLOPs to characterize accuracy-efficiency trade-offs. Our results show that attention-based models dominate MLPs in per-FLOP efficiency even at small scale, while increasing depth or context without sufficient optimization can degrade performance. We further examine rotary positional embeddings (RoPE), finding that architectural techniques successful in large language models do not necessarily transfer to small-model regimes.

[15] Where Did This Sentence Come From? Tracing Provenance in LLM Reasoning Distillation

Kaiyuan Liu,Shaotian Yan,Rui Miao,Bing Wang,Chen Shen,Jun Zhang,Jieping Ye

Main category: cs.CL

TL;DR: 本文提出了一种推理蒸馏溯源追踪框架,用于分析蒸馏模型中各行为的来源,并通过教师引导的数据选择方法提升推理蒸馏的效果。

Details Motivation: 现有推理蒸馏方法缺乏对蒸馏模型能力来源的深入分析,不清楚学生模型在新测试场景下是否真正继承了教师模型的行为,还是退回到原有输出模式,因此需要一种可追溯的方法来理解蒸馏过程的泛化性。 Method: 提出了跨模型的推理蒸馏溯源追踪框架,通过比较教师模型、原始学生模型和蒸馏后学生模型在同一上下文下的预测概率,对每个生成动作进行分类,从而解析其行为来源;并基于此提出一种教师引导的数据选择方法,以教师-学生差异为标准进行训练数据筛选。 Result: 实验表明,在测试时,蒸馏后的模型确实能生成源自教师的行为,这些行为与其性能提升相关;所提数据选择方法在多种教师和学生模型上均有效,优于依赖启发式的方法。 Conclusion: 推理蒸馏能够使学生模型在新情境下继承教师模型的推理行为,且所提出的溯源框架有助于理解蒸馏机制,为改进蒸馏方法提供了原则性指导。 Abstract: Reasoning distillation has attracted increasing attention. It typically leverages a large teacher model to generate reasoning paths, which are then used to fine-tune a student model so that it mimics the teacher's behavior in training contexts. However, previous approaches have lacked a detailed analysis of the origins of the distilled model's capabilities. It remains unclear whether the student can maintain consistent behaviors with the teacher in novel test-time contexts, or whether it regresses to its original output patterns, raising concerns about the generalization of distillation models. To analyse this question, we introduce a cross-model Reasoning Distillation Provenance Tracing framework. For each action (e.g., a sentence) produced by the distilled model, we obtain the predictive probabilities assigned by the teacher, the original student, and the distilled model under the same context. By comparing these probabilities, we classify each action into different categories. By systematically disentangling the provenance of each action, we experimentally demonstrate that, in test-time contexts, the distilled model can indeed generate teacher-originated actions, which correlate with and plausibly explain observed performance on distilled model. Building on this analysis, we further propose a teacher-guided data selection method. Unlike prior approach that rely on heuristics, our method directly compares teacher-student divergences on the training data, providing a principled selection criterion. We validate the effectiveness of our approach across multiple representative teacher models and diverse student models. The results highlight the utility of our provenance-tracing framework and underscore its promise for reasoning distillation. We hope to share Reasoning Distillation Provenance Tracing and our insights into reasoning distillation with the community.

[16] Foundation Model-based Evaluation of Neuropsychiatric Disorders: A Lifespan-Inclusive, Multi-Modal, and Multi-Lingual Study

Zhongren Dong,Haotian Guo,Weixiang Xu,Huan Zhao,Zixing Zhang

Main category: cs.CL

TL;DR: FEND是一个基于基础模型的多模态框架,整合语音和文本模态,用于跨生命周期、多语言环境下的阿尔茨海默病、抑郁症和自闭症谱系障碍的检测,通过13个多语言数据集系统评估了多模态融合性能,并揭示了模态不平衡和数据异质性等关键挑战。

Details Motivation: 现有研究在多语言泛化和统一评估框架方面存在不足,缺乏对神经精神疾病跨生命周期和多语言环境下多模态融合效果的系统评估。 Method: 提出FEND框架,整合语音与文本模态,利用13个涵盖英语、中文、希腊语、法语和荷兰语的多语言数据集,系统评估多模态融合在阿尔茨海默病、抑郁症和自闭症谱系障碍检测中的表现,并进行跨语料库实验分析。 Result: 多模态融合在阿尔茨海默病和抑郁症检测中表现优异,但在自闭症检测中因数据集异质性而表现不佳;发现模态不平衡问题普遍,且多模态融合未能超越最优单模态模型;在任务和语言一致场景下性能稳健,但在多语言和任务异构场景下性能明显下降。 Conclusion: FEND推动了自动化、全生命周期、多语言神经精神疾病评估领域的发展,提供了广泛的基准测试和影响因素分析,鼓励研究者采用该框架以实现公平比较和可重复研究。 Abstract: Neuropsychiatric disorders, such as Alzheimer's disease (AD), depression, and autism spectrum disorder (ASD), are characterized by linguistic and acoustic abnormalities, offering potential biomarkers for early detection. Despite the promise of multi-modal approaches, challenges like multi-lingual generalization and the absence of a unified evaluation framework persist. To address these gaps, we propose FEND (Foundation model-based Evaluation of Neuropsychiatric Disorders), a comprehensive multi-modal framework integrating speech and text modalities for detecting AD, depression, and ASD across the lifespan. Leveraging 13 multi-lingual datasets spanning English, Chinese, Greek, French, and Dutch, we systematically evaluate multi-modal fusion performance. Our results show that multi-modal fusion excels in AD and depression detection but underperforms in ASD due to dataset heterogeneity. We also identify modality imbalance as a prevalent issue, where multi-modal fusion fails to surpass the best mono-modal models. Cross-corpus experiments reveal robust performance in task- and language-consistent scenarios but noticeable degradation in multi-lingual and task-heterogeneous settings. By providing extensive benchmarks and a detailed analysis of performance-influencing factors, FEND advances the field of automated, lifespan-inclusive, and multi-lingual neuropsychiatric disorder assessment. We encourage researchers to adopt the FEND framework for fair comparisons and reproducible research.

[17] Neural Probe-Based Hallucination Detection for Large Language Models

Shize Liang,Hongzhi Wang

Main category: cs.CL

TL;DR: 提出一种基于MLP探针的非线性框架,用于token级别的幻觉检测,通过冻结LLM参数并利用高层隐藏状态实现高效、准确的检测。

Details Motivation: 现有幻觉检测方法在高置信度下仍产生错误内容,且依赖外部知识检索效率与覆盖范围,难以满足高风险领域需求。 Method: 冻结大语言模型参数,使用轻量级MLP探针对高层隐藏状态进行非线性建模,并设计多目标联合损失函数;结合贝叶斯优化搜索最优探针插入层。 Result: 在LongFact、HealthBench和TriviaQA数据集上,MLP探针在准确性、召回率及低误报条件下的检测能力显著优于现有最先进方法。 Conclusion: 该方法实现了更稳定、语义更清晰的实时幻觉检测,为高风险场景下的可靠生成提供了有效解决方案。 Abstract: Large language models(LLMs) excel at text generation and knowledge question-answering tasks, but they are prone to generating hallucinated content, severely limiting their application in high-risk domains. Current hallucination detection methods based on uncertainty estimation and external knowledge retrieval suffer from the limitation that they still produce erroneous content at high confidence levels and rely heavily on retrieval efficiency and knowledge coverage. In contrast, probe methods that leverage the model's hidden-layer states offer real-time and lightweight advantages. However, traditional linear probes struggle to capture nonlinear structures in deep semantic spaces.To overcome these limitations, we propose a neural network-based framework for token-level hallucination detection. By freezing language model parameters, we employ lightweight MLP probes to perform nonlinear modeling of high-level hidden states. A multi-objective joint loss function is designed to enhance detection stability and semantic disambiguity. Additionally, we establish a layer position-probe performance response model, using Bayesian optimization to automatically search for optimal probe insertion layers and achieve superior training results.Experimental results on LongFact, HealthBench, and TriviaQA demonstrate that MLP probes significantly outperform state-of-the-art methods in accuracy, recall, and detection capability under low false-positive conditions.

[18] MultiMind at SemEval-2025 Task 7: Crosslingual Fact-Checked Claim Retrieval via Multi-Source Alignment

Mohammad Mahdi Abootorabi,Alireza Ghahramani Kure,Mohammadali Mohammadkhani,Sina Elahimanesh,Mohammad Ali Ali Panah

Main category: cs.CL

TL;DR: 本文提出了一种名为TriAligner的新型双编码器模型,用于多语言和跨语言的事实核查声明检索,结合对比学习、多模态对齐与硬负样本采样,在数据增强和预处理基础上显著提升了检索性能。

Details Motivation: 由于错误信息传播迅速,有效的多语言事实核查变得至关重要,现有方法在跨语言对齐和表示学习方面存在不足。 Method: 采用双编码器架构,结合对比学习,利用原生语言和英文翻译进行多模态对齐,并使用大语言模型进行数据增强,引入硬负样本采样以提升表示学习效果。 Result: 在单语和跨语言基准测试中,该方法在检索准确率和事实核查性能上均显著优于基线模型。 Conclusion: TriAligner通过融合多语言信号、数据增强与强化负样本训练,有效提升了多语言事实核查声明检索的性能,具有较强的鲁棒性和应用潜力。 Abstract: This paper presents our system for SemEval-2025 Task 7: Multilingual and Crosslingual Fact-Checked Claim Retrieval. In an era where misinformation spreads rapidly, effective fact-checking is increasingly critical. We introduce TriAligner, a novel approach that leverages a dual-encoder architecture with contrastive learning and incorporates both native and English translations across different modalities. Our method effectively retrieves claims across multiple languages by learning the relative importance of different sources in alignment. To enhance robustness, we employ efficient data preprocessing and augmentation using large language models while incorporating hard negative sampling to improve representation learning. We evaluate our approach on monolingual and crosslingual benchmarks, demonstrating significant improvements in retrieval accuracy and fact-checking performance over baselines.

[19] Reflection Pretraining Enables Token-Level Self-Correction in Biological Sequence Models

Xiang Zhang,Jiaqi Wei,Yuejin Yang,Zijie Qiu,Yuhan Chen,Zhiqiang Gao,Muhammad Abdul-Mageed,Laks V. S. Lakshmanan,Wanli Ouyang,Chenyu You,Siqi Sun

Main category: cs.CL

TL;DR: 本文提出了“语言表达性”的概念,指出蛋白质语言模型由于令牌空间有限而难以应用思维链(CoT)推理。为此,作者引入了反思预训练方法,并首次在生物序列模型中使用辅助的‘思考令牌’来增强模型的推理能力。理论和实验结果表明,该方法显著提升了模型的语言表达性和性能。

Details Motivation: 由于现有蛋白质和RNA语言模型的令牌空间表达能力有限,无法支持类似自然语言处理中的思维链推理,因此需要提升生物序列模型的推理能力。 Method: 提出“语言表达性”概念,引入包含辅助‘思考令牌’的反思预训练方法,以增强模型在生物序列上的中间推理能力。 Result: 理论分析显示增强后的令牌集提升了语言表达性;实验表明该方法使模型具备自我纠正能力,并在性能上显著优于标准预训练。 Conclusion: 反思预训练通过引入思考令牌有效提升了生物序列模型的推理能力和整体性能,为在非自然语言领域应用CoT提供了可行路径。 Abstract: Chain-of-Thought (CoT) prompting has significantly advanced task-solving capabilities in natural language processing with large language models. Unlike standard prompting, CoT encourages the model to generate intermediate reasoning steps, non-answer tokens, that help guide the model toward more accurate final outputs. These intermediate steps enable more complex reasoning processes such as error correction, memory management, future planning, and self-reflection. However, applying CoT to non-natural language domains, such as protein and RNA language models, is not yet possible, primarily due to the limited expressiveness of their token spaces (e.g., amino acid tokens). In this work, we propose and define the concept of language expressiveness: the ability of a given language, using its tokens and grammar, to encode information. We show that the limited expressiveness of protein language severely restricts the applicability of CoT-style reasoning. To overcome this, we introduce reflection pretraining, for the first time in a biological sequence model, which enables the model to engage in intermediate reasoning through the generation of auxiliary "thinking tokens" beyond simple answer tokens. Theoretically, we demonstrate that our augmented token set significantly enhances biological language expressiveness, thereby improving the overall reasoning capacity of the model. Experimentally, our pretraining approach teaches protein models to self-correct and leads to substantial performance gains compared to standard pretraining.

[20] Automatic Replication of LLM Mistakes in Medical Conversations

Oleksii Proniakin,Diego Fajardo,Ruslan Nazarenko,Razvan Marinescu

Main category: cs.CL

TL;DR: 本文提出了MedMistake,一个自动生成医疗对话中大语言模型错误的基准测试数据集的管道,包含多维度评估和单轮问答转换,并发布了包含3390个问题的数据集MedMistake-All及经医生验证的子集MedMistake-Bench。

Details Motivation: 为了减少在不同大语言模型上复现临床对话错误所需的手动工作量,需要一种自动化方法来提取并标准化这些错误用于评估。 Method: 提出MedMistake管道:1)生成LLM患者与LLM医生之间的复杂对话;2)使用两个LLM评审委员会从多个维度进行评估;3)将识别出的错误转化为简化的单轮问答对。 Result: 构建了包含3,390个单轮问答对的MedMistake-All数据集,其中GPT-5和Gemini 2.5 Pro表现不佳;通过211个专家验证样本组成MedMistake-Bench,用于评估12个前沿LLM,结果显示GPT系列、Claude和Grok表现最佳。 Conclusion: MedMistake能有效自动提取和转化临床对话中的模型错误,提供高质量的评估基准,有助于推动更安全、可靠的临床LLM发展。 Abstract: Large language models (LLMs) are increasingly evaluated in clinical settings using multi-dimensional rubrics which quantify reasoning quality, safety, and patient-centeredness. Yet, replicating specific mistakes in other LLM models is not straightforward and often requires manual effort. We introduce MedMistake, an automatic pipeline that extracts mistakes LLMs make in patient-doctor conversations and converts them into a benchmark of single-shot QA pairs. Our pipeline (1) creates complex, conversational data between an LLM patient and LLM doctor, (2) runs an evaluation with a committee of 2 LLM judges across a variety of dimensions and (3) creates simplified single-shot QA scenarios from those mistakes. We release MedMistake-All, a dataset of 3,390 single-shot QA pairs where GPT-5 and Gemini 2.5 Pro are currently failing to answer correctly, as judged by two LLM judges. We used medical experts to validate a subset of 211/3390 questions (MedMistake-Bench), which we used to run a final evaluation of 12 frontier LLMs: Claude Opus 4.5, Claude Sonnet 4.5, DeepSeek-Chat, Gemini 2.5 Pro, Gemini 3 Pro, GPT-4o, GPT-5, GPT-5.1, GPT-5.2, Grok 4, Grok 4.1, Mistral Large. We found that GPT models, Claude and Grok obtained the best performance on MedMistake-Bench. We release both the doctor-validated benchmark (MedMistake-Bench), as well as the full dataset (MedMistake-All) at https://huggingface.co/datasets/TheLumos/MedicalMistakeBenchmark.

[21] Distilling the Essence: Efficient Reasoning Distillation via Sequence Truncation

Wei-Rui Chen,Vignesh Kothapalli,Ata Fatahibaarzi,Hejian Sang,Shao Tang,Qingquan Song,Zhipeng Wang,Muhammad Abdul-Mageed

Main category: cs.CL

TL;DR: 本文研究了在从大语言模型向小模型进行推理能力蒸馏时,如何通过仅监督链式思维(CoT)片段中的前50% token来有效减少计算开销,同时保留约94%的性能。

Details Motivation: 推理蒸馏通常依赖长序列数据,导致训练成本高;因此需要探索更高效的监督策略以降低计算负担。 Method: 提出一种截断协议,分析在提示(P)、链式思维(CoT)和答案(A)不同段落中分配监督的影响,并评估仅使用CoT前半部分token的效果。 Result: 在数学基准上,仅使用每个训练序列前50%的token可平均保留约94%的完整序列性能,同时将训练时间、内存使用和FLOPs各减少约50%。 Conclusion: 推理蒸馏应优先关注早期推理token,选择性地截断序列提供了一种简单的计算-质量权衡手段。 Abstract: Distilling the reasoning capabilities from a large language model (LLM) to a smaller student model often involves training on substantial amounts of reasoning data. However, distillation over lengthy sequences with prompt (P), chain-of-thought (CoT), and answer (A) segments makes the process computationally expensive. In this work, we investigate how the allocation of supervision across different segments (P, CoT, A) affects student performance. Our analysis shows that selective knowledge distillation over only the CoT tokens can be effective when the prompt and answer information is encompassed by it. Building on this insight, we establish a truncation protocol to quantify computation-quality tradeoffs as a function of sequence length. We observe that training on only the first $50\%$ of tokens of every training sequence can retain, on average, $\approx94\%$ of full-sequence performance on math benchmarks while reducing training time, memory usage, and FLOPs by about $50\%$ each. These findings suggest that reasoning distillation benefits from prioritizing early reasoning tokens and provides a simple lever for computation-quality tradeoffs. Codes are available at https://github.com/weiruichen01/distilling-the-essence.

[22] Rethinking Supervised Fine-Tuning: Emphasizing Key Answer Tokens for Improved LLM Accuracy

Xiaofeng Shi,Qian Kou,Yuduo Li,Hua Zhou

Main category: cs.CL

TL;DR: 提出SFTKey,一种两阶段微调方法,通过在第二阶段仅微调关键答案部分,提升大模型在复杂推理任务中的准确率,平均提高超过5%。

Details Motivation: 传统监督微调中,模型可能过度关注冗长的思维链(CoT),而忽视较短但关键的答案部分,影响最终任务性能。 Method: 采用两阶段训练:第一阶段使用常规SFT确保输出格式正确;第二阶段仅对关键答案部分进行微调以提升准确性。 Result: 在多个基准和模型族上实验表明,SFTKey相比传统SFT平均准确率提升超过5%,同时保持正确的输出格式。 Conclusion: SFTKey通过平衡思维链学习与关键答案优化,有效提升了大语言模型在复杂推理任务中的表现。 Abstract: With the rapid advancement of Large Language Models (LLMs), the Chain-of-Thought (CoT) component has become significant for complex reasoning tasks. However, in conventional Supervised Fine-Tuning (SFT), the model could allocate disproportionately more attention to CoT sequences with excessive length. This reduces focus on the much shorter but essential Key portion-the final answer, whose correctness directly determines task success and evaluation quality. To address this limitation, we propose SFTKey, a two-stage training scheme. In the first stage, conventional SFT is applied to ensure proper output format, while in the second stage, only the Key portion is fine-tuned to improve accuracy. Extensive experiments across multiple benchmarks and model families demonstrate that SFTKey achieves an average accuracy improvement exceeding 5\% over conventional SFT, while preserving the ability to generate correct formats. Overall, this study advances LLM fine-tuning by explicitly balancing CoT learning with additional optimization on answer-relevant tokens.

[23] Semantic Refinement with LLMs for Graph Representations

Safal Thapaliya,Zehong Wang,Jiazheng Li,Ziming Li,Yanfang Ye,Chuxu Zhang

Main category: cs.CL

TL;DR: 提出了一种数据自适应的语义精炼框架DAS,通过结合固定的图神经网络和大语言模型的闭环反馈机制,实现图表示学习中节点语义的任务自适应调整。

Details Motivation: 图结构数据在不同领域中的预测信号来源存在显著异质性,传统固定归纳偏置的模型难以在多样化图域中最优泛化,现有方法多从模型侧改进,缺乏对数据本身特性的动态适应。 Method: 提出了DAS框架,将固定GNN与大语言模型耦合在闭环反馈回路中:GNN提供隐式监督信号指导LLM进行语义精炼,精炼后的语义反过来更新图学习器,从而实现数据驱动的语义优化。 Result: 在文本丰富和无文本图数据上均取得提升,在结构主导的图上表现尤为突出,同时在语义丰富的图上保持竞争力。 Conclusion: 采用数据中心视角进行语义自适应能够有效应对图数据中结构与语义的异质性挑战,验证了数据驱动方法在图表示学习中的潜力。 Abstract: Graph-structured data exhibit substantial heterogeneity in where their predictive signals originate: in some domains, node-level semantics dominate, while in others, structural patterns play a central role. This structure-semantics heterogeneity implies that no graph learning model with a fixed inductive bias can generalize optimally across diverse graph domains. However, most existing methods address this challenge from the model side by incrementally injecting new inductive biases, which remains fundamentally limited given the open-ended diversity of real-world graphs. In this work, we take a data-centric perspective and treat node semantics as a task-adaptive variable. We propose a Data-Adaptive Semantic Refinement framework DAS for graph representation learning, which couples a fixed graph neural network (GNN) and a large language model (LLM) in a closed feedback loop. The GNN provides implicit supervisory signals to guide the semantic refinement of LLM, and the refined semantics are fed back to update the same graph learner. We evaluate our approach on both text-rich and text-free graphs. Results show consistent improvements on structure-dominated graphs while remaining competitive on semantics-rich graphs, demonstrating the effectiveness of data-centric semantic adaptation under structure-semantics heterogeneity.

[24] Semi-Supervised Learning for Large Language Models Safety and Content Moderation

Eduard Stefan Dinuta,Iustin Sirbu,Traian Rebedea

Main category: cs.CL

TL;DR: 提出一种基于半监督学习的大型语言模型安全方法,利用标注和未标注数据提升安全性,并强调任务特定增强技术的重要性。

Details Motivation: 现有安全分类器依赖大量标注数据,获取困难且易出错,或包含合成数据问题。 Method: 采用半监督学习技术,结合标注与未标注数据,针对提示和响应进行安全性能提升,并使用任务特定的数据增强方法。 Result: 任务特定的增强方法显著优于通用增强技术,在提示和响应的安全任务中性能明显提升。 Conclusion: 半监督学习结合任务特定增强可有效提升大型语言模型的安全性,减少对大规模标注数据的依赖。 Abstract: Safety for Large Language Models (LLMs) has been an ongoing research focus since their emergence and is even more relevant nowadays with the increasing capacity of those models. Currently, there are several guardrails in place for all public LLMs and multiple proposed datasets for training safety classifiers. However, training these safety classifiers relies on large quantities of labeled data, which can be problematic to acquire, prone to labeling errors, or often include synthetic data. To address these issues, we suggest a different approach: utilizing semi-supervised learning techniques, which leverage both labeled and unlabeled data, to improve the performance on the safety task. We analyze the improvements that these techniques can offer for both prompts given to Large Language Models and the responses to those requests. Moreover, since augmentation is the central part of semi-supervised algorithms, we demonstrate the importance of using task-specific augmentations, which significantly increase the performance when compared to general-purpose augmentation techniques.

[25] ClarifyMT-Bench: Benchmarking and Improving Multi-Turn Clarification for Conversational Large Language Models

Sichun Luo,Yi Huang,Mukai Li,Shichang Meng,Fengyuan Liu,Zefa Hu,Junlan Feng,Qi Liu

Main category: cs.CL

TL;DR: 本文提出了ClarifyMT-Bench,一个用于评估大语言模型在多轮对话中澄清行为的基准,并发现现有LLM存在过早回答和随对话加深表现下降的问题;为此提出ClarifyAgent方法,通过分解澄清过程提升模型应对模糊性的能力。

Details Motivation: 现有澄清评估基准多假设单轮交互或合作型用户,难以反映真实开放域多轮对话中的复杂性,因此需要更贴近现实的评估框架来研究LLM的澄清行为。 Method: 构建了一个包含五维模糊分类法和六种模拟用户角色的基准ClarifyMT-Bench,使用混合LLM-人工流程生成6,120个多轮对话,并提出ClarifyAgent代理方法,将澄清分解为感知、预测、跟踪和规划四个步骤。 Result: 在10个代表性LLM上的评估显示模型普遍存在澄清不足的偏差,且随着对话轮次增加性能下降;ClarifyAgent显著提升了在各种模糊条件下的鲁棒性。 Conclusion: ClarifyMT-Bench为研究LLM在真实人类交互中何时应提问、何时应回答以及如何处理模糊性提供了可复现的基础,ClarifyAgent展示了通过结构化代理设计改善澄清行为的有效路径。 Abstract: Large language models (LLMs) are increasingly deployed as conversational assistants in open-domain, multi-turn settings, where users often provide incomplete or ambiguous information. However, existing LLM-focused clarification benchmarks primarily assume single-turn interactions or cooperative users, limiting their ability to evaluate clarification behavior in realistic settings. We introduce \textbf{ClarifyMT-Bench}, a benchmark for multi-turn clarification grounded in a five-dimensional ambiguity taxonomy and a set of six behaviorally diverse simulated user personas. Through a hybrid LLM-human pipeline, we construct 6,120 multi-turn dialogues capturing diverse ambiguity sources and interaction patterns. Evaluating ten representative LLMs uncovers a consistent under-clarification bias: LLMs tend to answer prematurely, and performance degrades as dialogue depth increases. To mitigate this, we propose \textbf{ClarifyAgent}, an agentic approach that decomposes clarification into perception, forecasting, tracking, and planning, substantially improving robustness across ambiguity conditions. ClarifyMT-Bench establishes a reproducible foundation for studying when LLMs should ask, when they should answer, and how to navigate ambiguity in real-world human-LLM interactions.

[26] SpidR-Adapt: A Universal Speech Representation Model for Few-Shot Adaptation

Mahi Luthra,Jiayi Shen,Maxime Poli,Angelo Ortiz,Yosuke Higuchi,Youssef Benchekroun,Martin Gleize,Charles-Eric Saint-James,Dongyan Lin,Phillip Rust,Angel Villar,Surya Parimi,Vanessa Stark,Rashel Moritz,Juan Pino,Yann LeCun,Emmanuel Dupoux

Main category: cs.CL

TL;DR: 本文提出SpidR-Adapt,一种基于元学习的低资源语音表示学习方法,通过双层优化和交错监督,在少于1小时的目标语言数据上实现快速语言适应,显著提升数据效率(超过100倍),并改善音素可分性和口语建模性能。

Details Motivation: 人类婴儿仅用数百小时语音输入即可掌握新语言基本单元,而当前自监督语音模型需大量数据,存在显著效率差距。本文旨在缩小这一差距,实现更高效、类人化的低资源语言适应。 Method: 将低资源语音表征学习建模为元学习问题,提出多任务自适应预训练(MAdaPT)协议,采用双层优化框架;设计一阶双层优化(FOBLO)算法以降低计算开销,并通过交错监督(自监督与有监督目标交替)实现鲁棒初始化,稳定元训练过程。 Result: 在少于1小时目标语言音频上训练时,SpidR-Adapt在ABX、sWUGGY、sBLIMP和tSC等任务上显著优于领域内语言模型,数据效率超过标准训练方法100倍以上。 Conclusion: SpidR-Adapt提供了一条实用且与架构无关的路径,推动向生物启发式、高数据效率的语音表示学习发展,具备良好的可扩展性和应用潜力。 Abstract: Human infants, with only a few hundred hours of speech exposure, acquire basic units of new languages, highlighting a striking efficiency gap compared to the data-hungry self-supervised speech models. To address this gap, this paper introduces SpidR-Adapt for rapid adaptation to new languages using minimal unlabeled data. We cast such low-resource speech representation learning as a meta-learning problem and construct a multi-task adaptive pre-training (MAdaPT) protocol which formulates the adaptation process as a bi-level optimization framework. To enable scalable meta-training under this framework, we propose a novel heuristic solution, first-order bi-level optimization (FOBLO), avoiding heavy computation costs. Finally, we stabilize meta-training by using a robust initialization through interleaved supervision which alternates self-supervised and supervised objectives. Empirically, SpidR-Adapt achieves rapid gains in phonemic discriminability (ABX) and spoken language modeling (sWUGGY, sBLIMP, tSC), improving over in-domain language models after training on less than 1h of target-language audio, over $100\times$ more data-efficient than standard training. These findings highlight a practical, architecture-agnostic path toward biologically inspired, data-efficient representations. We open-source the training code and model checkpoints at https://github.com/facebookresearch/spidr-adapt.

[27] SMART SLM: Structured Memory and Reasoning Transformer, A Small Language Model for Accurate Document Assistance

Divij Dudeja,Mayukha Pal

Main category: cs.CL

TL;DR: SMART是一种针对工程手册(EM)信息提取与推理的高效模型,通过分层结构化处理,显著提升准确率并减少幻觉。

Details Motivation: 传统模型难以有效处理工程手册中复杂、密集且多格式的信息,导致错误的数值回答和低效的事实记忆。 Method: SMART采用三部分架构:基于Tree LSTM的语法感知事实抽取器、用于存储384维向量的紧凑索引记忆MANN,以及6层Transformer融合检索事实生成回答;支持快速路径与动态路径两种推理模式。 Result: SMART参数量仅为45.51M,比GPT-2和BERT少64%-69%,在准确率上高出GPT-2达21.3%,实现亚秒级响应,并减少幻觉。 Conclusion: SMART通过结构化记忆与推理机制,在处理工程手册类文档时优于现有小型Transformer模型,兼具高效性与准确性。 Abstract: The user of Engineering Manuals (EM) finds it difficult to read EM s because they are long, have a dense format which includes written documents, step by step procedures, and standard parameter lists for engineering equipment. Off the shelf transformers, especially compact ones, treat this material as a flat stream of tokens. This approach leads to confident but incorrect numeric answers and forces the models to memorize separate facts inefficiently. SMART (Structured Memory and Reasoning Transformer) offers a different and practical solution to the above problem. SMART structures its processing by using a hierarchical approach, and is based upon three main job categories (1) A syntax-aware Fact Extractor (Grammarian) Tree LSTM which extracts facts as subject relation object relations from EM sentences (2) A compact indexed memory MANN (Memory Augmented Neural Network) that indexes these Rational Subject Relation Objects as 384 dimensional vectors that are associated with the source of the information, and (3) A 6 layer Transformer that learns to fuse the previously retrieved facts into its generated response. The entire SMART model utilizes 45.51M parameters, which is 64% less than GPT-2 (124M) and 69% less than BERT (133M), and it achieves a 21.3% higher accuracy than GPT-2, indicating that SMART fits the data better with the least amount of processing requirements. SMART employs dual modes of inference an indexed fast path for known documents (sub-second answer times) and an indexed dynamic path assisted by RAGs for new uploads (FAISS Top 20 results with memory severed at 64 slots). In real world deployment, this framework leads to more well supported results with reduced hallucinations than comparable small transformer models.

[28] Parallel Token Prediction for Language Models

Felix Draxler,Justus Will,Farrin Marouf Sofian,Theofanis Karaletsos,Sameer Singh,Stephan Mandt

Main category: cs.CL

TL;DR: 提出了一种名为Parallel Token Prediction (PTP) 的通用框架,用于语言模型中的并行序列生成,能够在单次Transformer调用中联合预测多个相关令牌,显著减少自回归解码的延迟瓶颈。

Details Motivation: 为了解决自回归解码过程中逐个生成令牌带来的高延迟问题,以及现有多个令牌预测方法中常见的独立性假设限制。 Method: 通过将采样过程整合到模型中,在单个Transformer调用中联合预测多个依赖令牌;采用模型蒸馏或无需教师模型的逆自回归训练方式进行训练。 Result: 在Vicuna-7B上实现了最先进的推测解码性能,Spec-Bench上每步可接受超过四个令牌。 Conclusion: PTP框架具有通用性,能够在不损失建模能力的前提下实现长序列的并行生成,证明了并行生成的可行性。 Abstract: We propose Parallel Token Prediction (PTP), a universal framework for parallel sequence generation in language models. PTP jointly predicts multiple dependent tokens in a single transformer call by incorporating the sampling procedure into the model. This reduces the latency bottleneck of autoregressive decoding, and avoids the restrictive independence assumptions common in existing multi-token prediction methods. We prove that PTP can represent arbitrary autoregressive sequence distributions. PTP is trained either by distilling an existing model or through inverse autoregressive training without a teacher. Experimentally, we achieve state-of-the-art speculative decoding performance on Vicuna-7B by accepting over four tokens per step on Spec-Bench. The universality of our framework indicates that parallel generation of long sequences is feasible without loss of modeling power.

[29] Your Reasoning Benchmark May Not Test Reasoning: Revealing Perception Bottleneck in Abstract Reasoning Benchmarks

Xinhe Wang,Jin Huang,Xingjian Zhang,Tianhao Wang,Jiaqi W. Ma

Main category: cs.CL

TL;DR: 本文挑战了现有观点,认为ARC类推理基准中的性能差距主要源于视觉感知限制而非机器推理能力不足,通过分离感知与推理的两阶段实验验证了这一假设。

Details Motivation: 现有的ARC等推理基准被广泛用于评估AI的流体推理能力,但前沿视觉语言模型在这些任务上表现不佳,通常归因于推理缺陷。本文质疑这一解释,试图探究感知与推理各自在性能差距中的作用。 Method: 提出一个两阶段实验 pipeline:第一阶段将图像独立转换为自然语言描述(感知),第二阶段使用这些描述进行规则归纳与应用(推理),从而隔离感知与推理过程,避免跨图像信号泄露。在Mini-ARC、ACRE和Bongard-LOGO三个数据集上进行评估,并分析VLM的推理轨迹。 Result: 实验表明,使用两阶段方法后性能显著提升,且约80%的失败案例源于感知错误而非推理错误。感知能力是导致性能差距的主要因素。 Conclusion: ARC类基准混淆了感知与推理挑战,当前观察到的性能差距可能高估了机器的推理缺陷;未来应设计能解耦感知与推理的评估协议以更准确衡量机器智能进展。 Abstract: Reasoning benchmarks such as the Abstraction and Reasoning Corpus (ARC) and ARC-AGI are widely used to assess progress in artificial intelligence and are often interpreted as probes of core, so-called ``fluid'' reasoning abilities. Despite their apparent simplicity for humans, these tasks remain challenging for frontier vision-language models (VLMs), a gap commonly attributed to deficiencies in machine reasoning. We challenge this interpretation and hypothesize that the gap arises primarily from limitations in visual perception rather than from shortcomings in inductive reasoning. To verify this hypothesis, we introduce a two-stage experimental pipeline that explicitly separates perception and reasoning. In the perception stage, each image is independently converted into a natural-language description, while in the reasoning stage a model induces and applies rules using these descriptions. This design prevents leakage of cross-image inductive signals and isolates reasoning from perception bottlenecks. Across three ARC-style datasets, Mini-ARC, ACRE, and Bongard-LOGO, we show that the perception capability is the dominant factor underlying the observed performance gap by comparing the two-stage pipeline with against standard end-to-end one-stage evaluation. Manual inspection of reasoning traces in the VLM outputs further reveals that approximately 80 percent of model failures stem from perception errors. Together, these results demonstrate that ARC-style benchmarks conflate perceptual and reasoning challenges and that observed performance gaps may overstate deficiencies in machine reasoning. Our findings underscore the need for evaluation protocols that disentangle perception from reasoning when assessing progress in machine intelligence.

[30] C2LLM Technical Report: A New Frontier in Code Retrieval via Adaptive Cross-Attention Pooling

Jin Qin,Zihan Liao,Ziyin Zhang,Hang Yu,Peng Di,Rui Wang

Main category: cs.CL

TL;DR: C2LLM是一种基于Qwen-2.5-Coder的代码嵌入模型,采用多头注意力池化(PMA)模块生成序列嵌入,在MTEB-Code基准上取得同规模模型中的领先表现。

Details Motivation: 传统EOS-based序列嵌入方法存在信息瓶颈,且难以有效利用大语言模型在预训练中获得的因果表示,限制了代码嵌入的质量。 Method: 基于Qwen-2.5-Coder构建0.5B和7B两种规模的C2LLM模型,引入Pooling by Multihead Attention(PMA)模块从token嵌入生成序列嵌入,充分利用LLM的因果表示并聚合整个序列的信息,同时支持灵活调整嵌入维度。 Result: 在三百万条公开数据上训练后,C2LLM在MTEB-Code基准上刷新了同规模模型的记录,其中C2LLM-7B在综合排行榜上排名第一。 Conclusion: PMA模块能有效克服传统方法的信息瓶颈,提升代码嵌入质量,C2LLM在多个评估任务中表现出色,验证了其作为高效代码嵌入模型的潜力。 Abstract: We present C2LLM - Contrastive Code Large Language Models, a family of code embedding models in both 0.5B and 7B sizes. Building upon Qwen-2.5-Coder backbones, C2LLM adopts a Pooling by Multihead Attention (PMA) module for generating sequence embedding from token embeddings, effectively 1) utilizing the LLM's causal representations acquired during pretraining, while also 2) being able to aggregate information from all tokens in the sequence, breaking the information bottleneck in EOS-based sequence embeddings, and 3) supporting flexible adaptation of embedding dimension, serving as an alternative to MRL. Trained on three million publicly available data, C2LLM models set new records on MTEB-Code among models of similar sizes, with C2LLM-7B ranking 1st on the overall leaderboard.

[31] Optimizing Decoding Paths in Masked Diffusion Models by Quantifying Uncertainty

Ziyu Chen,Xinbei Jiang,Peng Sun,Tao Lin

Main category: cs.CL

TL;DR: 本文首次形式化了掩码扩散模型(MDM)中解码顺序对生成质量的影响问题,提出“去噪熵”作为衡量生成路径中预测不确定性的指标,并基于此设计了两种优化解码路径的算法,显著提升了生成质量。

Details Motivation: 掩码扩散模型虽然具有灵活的非自回归生成能力,但其输出质量对解码顺序高度敏感,缺乏对生成过程中不确定性变化的理解和控制机制。 Method: 提出“去噪熵”作为可计算的累积预测不确定性度量,并基于该指标设计了后处理选择方法和实时引导策略来优化解码路径。 Result: 在多个推理、规划和代码生成基准上,熵引导方法显著提高了生成准确性,验证了其有效性。 Conclusion: 去噪熵为理解和控制MDM中的生成过程提供了原理性工具,将模型的不确定性从缺陷转化为发现高质量解的优势。 Abstract: Masked Diffusion Models (MDMs) offer flexible, non-autoregressive generation, but this freedom introduces a challenge: final output quality is highly sensitive to the decoding order. We are the first to formalize this issue, attributing the variability in output quality to the cumulative predictive uncertainty along a generative path. To quantify this uncertainty, we introduce Denoising Entropy, a computable metric that serves as an internal signal for evaluating generative process. Leveraging this metric, we propose two algorithms designed to optimize the decoding path: a post-hoc selection method and a real-time guidance strategy. Experiments demonstrate that our entropy-guided methods significantly improve generation quality, consistently boosting accuracy on challenging reasoning, planning, and code benchmarks. Our work establishes Denoising Entropy as a principled tool for understanding and controlling generation, effectively turning the uncertainty in MDMs from a liability into a key advantage for discovering high-quality solutions.

cs.CV [Back]

[32] VL4Gaze: Unleashing Vision-Language Models for Gaze Following

Shijing Wang,Chaoqun Cui,Yaping Huang,Hyung Jin Chang,Yihua Cheng

Main category: cs.CV

TL;DR: 本文提出了VL4Gaze,首个大规模视觉语言模型(VLMs)用于凝视理解的基准,包含489K个问答对,涵盖四种凝视理解任务,实验证明任务特定监督对提升VLM的凝视理解能力至关重要。

Details Motivation: 当前视觉语言模型缺乏对凝视理解的系统性评估与训练,尽管其在场景理解上表现良好,但尚不清楚通用预训练能否自发形成凝视理解能力。 Method: 构建了一个大规模数据集VL4Gaze,包含124K图像和489K自动生成的问答对,设计了四个互补任务:凝视对象描述、凝视方向描述、凝视点定位和歧义问题识别,并在上下文学习和微调设置下对多种VLM进行评估。 Result: 实验表明,现有VLM在无任务特定监督时难以准确推断凝视语义与空间位置;而在VL4Gaze上训练后,所有任务性能均有显著且一致的提升。 Conclusion: 凝视理解不会轻易从通用视觉语言预训练中自发涌现,需要针对性的多任务监督来发展该能力,VL4Gaze为未来研究提供了重要资源。 Abstract: Human gaze provides essential cues for interpreting attention, intention, and social interaction in visual scenes, yet gaze understanding remains largely unexplored in current vision-language models (VLMs). While recent VLMs achieve strong scene-level reasoning across a range of visual tasks, there exists no benchmark that systematically evaluates or trains them for gaze interpretation, leaving open the question of whether gaze understanding can emerge from general-purpose vision-language pre-training. To address this gap, we introduce VL4Gaze, the first large-scale benchmark designed to investigate, evaluate, and unlock the potential of VLMs for gaze understanding. VL4Gaze contains 489K automatically generated question-answer pairs across 124K images and formulates gaze understanding as a unified VQA problem through four complementary tasks: (1) gaze object description, (2) gaze direction description, (3) gaze point location, and (4) ambiguous question recognition. We comprehensively evaluate both commercial and open-source VLMs under in-context learning and fine-tuning settings. The results show that even large-scale VLMs struggle to reliably infer gaze semantics and spatial localization without task-specific supervision. In contrast, training on VL4Gaze brings substantial and consistent improvements across all tasks, highlighting the importance of targeted multi-task supervision for developing gaze understanding capabilities in VLMs. We will release the dataset and code to support further research and development in this direction.

[33] TrashDet: Iterative Neural Architecture Search for Efficient Waste Detection

Tony Tran,Bin Hu

Main category: cs.CV

TL;DR: 本文提出了一种面向TinyML约束下垃圾检测的硬件感知神经架构搜索方法,基于TACO数据集构建了可部署的TrashDet系列检测器,在精度、能耗和延迟方面显著优于现有方法。

Details Motivation: 在边缘和物联网设备等资源受限环境下,现有的垃圾检测模型往往难以兼顾精度与计算开销,缺乏针对TinyML场景的高效专用检测器。 Method: 采用Once-for-All风格的ResDets超网,结合交替优化主干网络与颈部/头部的迭代进化搜索策略,引入种群传递机制和精度预测器以降低搜索成本并提升稳定性。 Result: 在TACO五类子集上,TrashDet-l达到19.5 mAP50,参数量仅30.5M,较先前方法最多提升3.6 mAP50;在MAX78002微控制器上,TrashDet-ResNet和TrashDet-MBNet相比基线分别降低88%能耗和78%延迟,并提升10.2% mAP50。 Conclusion: 该框架能有效生成适用于不同部署预算的高效垃圾检测模型,在TinyML设备上实现了精度与效率的最优平衡,具有良好的实际应用前景。 Abstract: This paper addresses trash detection on the TACO dataset under strict TinyML constraints using an iterative hardware-aware neural architecture search framework targeting edge and IoT devices. The proposed method constructs a Once-for-All-style ResDets supernet and performs iterative evolutionary search that alternates between backbone and neck/head optimization, supported by a population passthrough mechanism and an accuracy predictor to reduce search cost and improve stability. This framework yields a family of deployment-ready detectors, termed TrashDets. On a five-class TACO subset (paper, plastic, bottle, can, cigarette), the strongest variant, TrashDet-l, achieves 19.5 mAP50 with 30.5M parameters, improving accuracy by up to 3.6 mAP50 over prior detectors while using substantially fewer parameters. The TrashDet family spans 1.2M to 30.5M parameters with mAP50 values between 11.4 and 19.5, providing scalable detector options for diverse TinyML deployment budgets on resource-constrained hardware. On the MAX78002 microcontroller with the TrashNet dataset, two specialized variants, TrashDet-ResNet and TrashDet-MBNet, jointly dominate the ai87-fpndetector baseline, with TrashDet-ResNet achieving 7525~$μ$J energy per inference at 26.7 ms latency and 37.45 FPS, and TrashDet-MBNet improving mAP50 by 10.2%; together they reduce energy consumption by up to 88%, latency by up to 78%, and average power by up to 53% compared to existing TinyML detectors.

[34] OccuFly: A 3D Vision Benchmark for Semantic Scene Completion from the Aerial Perspective

Markus Gross,Sai B. Matha,Aya Fahmy,Rui Song,Daniel Cremers,Henri Meess

Main category: cs.CV

TL;DR: 本文提出了OccuFly,首个基于相机的真实世界空中语义场景补全(SSC)基准,适用于无人机在不同季节和高度下对城市、工业和乡村场景的3D感知,并提出了一种无需LiDAR的数据生成框架,显著减少3D标注工作量。

Details Motivation: 现有SSC研究主要集中于地面场景且依赖LiDAR,而空中场景研究不足,且LiDAR在无人机应用中受限于法规、重量、能耗及点云稀疏性,因此需要一种适用于无人机的、基于相机的SSC解决方案。 Method: 提出OccuFly基准数据集,由相机在不同高度(30-50米)和四季条件下采集;利用传统3D重建技术,将部分2D语义掩码提升至重建点云中,实现自动标签迁移,构建密集体素化语义标注;采用相机模态以适配常见无人机配置。 Result: 发布了包含22类语义标签、覆盖多种环境与季节的空中SSC数据集;实现了LiDAR-free的数据生成流程,大幅降低人工3D标注成本;对现有最先进方法进行了基准测试,并揭示了高空视角下的独特挑战。 Conclusion: OccuFly为无人机平台提供了首个实用的空中语义场景补全基准,推动了基于相机的空中3D场景理解研究,为下游自主飞行应用奠定了基础。 Abstract: Semantic Scene Completion (SSC) is crucial for 3D perception in mobile robotics, as it enables holistic scene understanding by jointly estimating dense volumetric occupancy and per-voxel semantics. Although SSC has been widely studied in terrestrial domains such as autonomous driving, aerial scenarios like autonomous flying remain largely unexplored, thereby limiting progress on downstream applications. Furthermore, LiDAR sensors represent the primary modality for SSC data generation, which poses challenges for most uncrewed aerial vehicles (UAVs) due to flight regulations, mass and energy constraints, and the sparsity of LiDAR-based point clouds from elevated viewpoints. To address these limitations, we introduce OccuFly, the first real-world, camera-based aerial SSC benchmark, captured at altitudes of 50m, 40m, and 30m during spring, summer, fall, and winter. OccuFly covers urban, industrial, and rural scenarios, provides 22 semantic classes, and the data format adheres to established conventions to facilitate seamless integration with existing research. Crucially, we propose a LiDAR-free data generation framework based on camera modality, which is ubiquitous on modern UAVs. By utilizing traditional 3D reconstruction, our framework automates label transfer by lifting a subset of annotated 2D masks into the reconstructed point cloud, thereby substantially minimizing manual 3D annotation effort. Finally, we benchmark the state-of-the-art on OccuFly and highlight challenges specific to elevated viewpoints, yielding a comprehensive vision benchmark for holistic aerial 3D scene understanding.

[35] NULLBUS: Multimodal Mixed-Supervision for Breast Ultrasound Segmentation via Nullable Global-Local Prompts

Raja Mallina,Bryar Shareef

Main category: cs.CV

TL;DR: 提出NullBUS框架,通过可空提示(nullable prompts)实现乳腺超声图像在有无文本提示情况下的统一多模态分割,取得当前最优性能。

Details Motivation: 现有乳腺超声数据集常缺乏可靠文本元数据,限制了基于提示的分割方法的训练与鲁棒性。 Method: 设计NullBUS框架,引入可空提示与存在掩码,结合图像和文本信息进行混合监督学习,在同一模型中处理有无提示的数据。 Result: 在三个公开BUS数据集的统一测试中,平均IoU达0.8568,平均Dice达0.9103,性能优于现有方法。 Conclusion: NullBUS能有效利用不完整多模态数据,提升乳腺超声分割在真实场景中的适用性和性能。 Abstract: Breast ultrasound (BUS) segmentation provides lesion boundaries essential for computer-aided diagnosis and treatment planning. While promptable methods can improve segmentation performance and tumor delineation when text or spatial prompts are available, many public BUS datasets lack reliable metadata or reports, constraining training to small multimodal subsets and reducing robustness. We propose NullBUS, a multimodal mixed-supervision framework that learns from images with and without prompts in a single model. To handle missing text, we introduce nullable prompts, implemented as learnable null embeddings with presence masks, enabling fallback to image-only evidence when metadata are absent and the use of text when present. Evaluated on a unified pool of three public BUS datasets, NullBUS achieves a mean IoU of 0.8568 and a mean Dice of 0.9103, demonstrating state-of-the-art performance under mixed prompt availability.

[36] Learning to Sense for Driving: Joint Optics-Sensor-Model Co-Design for Semantic Segmentation

Reeshad Khan amd John Gauch

Main category: cs.CV

TL;DR: 提出了一种面向自动驾驶感知任务的端到端光学-传感器-网络联合优化框架,通过可学习的光学元件和RAW域处理,在语义分割任务中实现了更高的精度与鲁棒性,同时保持轻量级模型以支持边缘部署。

Details Motivation: 传统自动驾驶感知系统将相机设计与下游感知任务分离,固定光学器件和手工ISP流程优先考虑人类视觉效果而非机器感知需求,导致信息丢失并迫使模型适应传感器伪影。 Method: 构建了一个从RAW图像到语义分割的端到端可微分框架,整合了真实手机级镜头模型(基于DeepLens)、可学习彩色滤光阵列(CFA)、泊松-高斯噪声建模和量化过程,并联合优化光学、传感器与轻量级分割网络。 Result: 在KITTI-360数据集上相比固定成像管线显著提升mIoU,尤其在细小物体和低光照敏感类别上表现更优;获得约1M参数量的紧凑模型,运行速度达~28 FPS,具备边缘部署能力;可视化分析显示联合设计能根据语义结构自适应调整成像,增强边界清晰度并在模糊、噪声和低位深下保持准确性。 Conclusion: 全栈协同优化光学、传感器与神经网络为实现高效、可靠且可部署的自动驾驶感知提供了一条有前景的技术路径。 Abstract: Traditional autonomous driving pipelines decouple camera design from downstream perception, relying on fixed optics and handcrafted ISPs that prioritize human viewable imagery rather than machine semantics. This separation discards information during demosaicing, denoising, or quantization, while forcing models to adapt to sensor artifacts. We present a task-driven co-design framework that unifies optics, sensor modeling, and lightweight semantic segmentation networks into a single end-to-end RAW-to-task pipeline. Building on DeepLens[19], our system integrates realistic cellphone-scale lens models, learnable color filter arrays, Poisson-Gaussian noise processes, and quantization, all optimized directly for segmentation objectives. Evaluations on KITTI-360 show consistent mIoU improvements over fixed pipelines, with optics modeling and CFA learning providing the largest gains, especially for thin or low-light-sensitive classes. Importantly, these robustness gains are achieved with a compact ~1M-parameter model running at ~28 FPS, demonstrating edge deployability. Visual and quantitative analyses further highlight how co-designed sensors adapt acquisition to semantic structure, sharpening boundaries and maintaining accuracy under blur, noise, and low bit-depth. Together, these findings establish full-stack co-optimization of optics, sensors, and networks as a principled path toward efficient, reliable, and deployable perception in autonomous systems.

[37] CHAMMI-75: pre-training multi-channel models with heterogeneous microscopy images

Vidit Agrawal,John Peters,Tyler N. Thompson,Mohammad Vali Sanian,Chau Pham,Nikita Moshkov,Arshad Kazi,Aditya Pillai,Jack Freeman,Byunguk Kang,Samouil L. Farhi,Ernest Fraenkel,Ron Stewart,Lassi Paavolainen,Bryan A. Plummer,Juan C. Caicedo

Main category: cs.CV

TL;DR: 本文介绍了CHAMMI-75,一个包含75个不同生物学研究的多通道显微图像的开放数据集,旨在开发可跨实验和成像类型复用的细胞形态学模型。

Details Motivation: 现有的细胞形态学机器学习模型通常依赖特定成像类型,缺乏泛化能力,难以在不同实验条件或通道配置间迁移。 Method: 从公开资源中整理并构建了CHAMMI-75数据集,包含多种显微成像模态的异构多通道图像,并用于训练通道自适应的细胞形态学模型。 Result: 实验证明,使用CHAMMI-75训练的模型在多通道生物成像任务中表现更优,主要归因于数据集的高度多样性。 Conclusion: CHAMMI-75为开发下一代适用于广泛生物学研究的通用细胞形态学模型提供了基础。 Abstract: Quantifying cell morphology using images and machine learning has proven to be a powerful tool to study the response of cells to treatments. However, models used to quantify cellular morphology are typically trained with a single microscopy imaging type. This results in specialized models that cannot be reused across biological studies because the technical specifications do not match (e.g., different number of channels), or because the target experimental conditions are out of distribution. Here, we present CHAMMI-75, an open access dataset of heterogeneous, multi-channel microscopy images from 75 diverse biological studies. We curated this resource from publicly available sources to investigate cellular morphology models that are channel-adaptive and can process any microscopy image type. Our experiments show that training with CHAMMI-75 can improve performance in multi-channel bioimaging tasks primarily because of its high diversity in microscopy modalities. This work paves the way to create the next generation of cellular morphology models for biological studies.

[38] Input-Adaptive Visual Preprocessing for Efficient Fast Vision-Language Model Inference

Putu Indah Githa Cahyani,Komang David Dananjaya Suartana,Novanto Yudistira

Main category: cs.CV

TL;DR: 本文提出了一种基于内容感知的自适应视觉预处理方法,用于提升视觉-语言模型(如FastVLM)在部署时的推理效率,无需修改模型结构或重新训练。

Details Motivation: 现有的视觉-语言模型在处理高分辨率图像时存在推理延迟高、计算成本大的问题,且传统静态预处理方式对简单图像内容仍进行冗余计算,效率低下。 Method: 提出一种自适应视觉预处理方法,结合内容感知分析、动态分辨率选择和内容感知裁剪,在不修改FastVLM架构和无需重训练的前提下,动态调整输入图像的分辨率和空间覆盖范围,减少视觉冗余。 Result: 在DocVQA数据集子集上进行纯推理评估,结果表明该方法将每张图像的推理时间减少了50%以上,平均完整生成时间降低,并且视觉token数量一致减少了超过55%。 Conclusion: 输入感知的自适应预处理是一种有效且轻量的方法,可显著提升视觉-语言模型在实际部署中的效率,具有良好的可复现性与应用潜力。 Abstract: Vision-Language Models (VLMs) have demonstrated strong performance on multimodal reasoning tasks, but their deployment remains challenging due to high inference latency and computational cost, particularly when processing high-resolution visual inputs. While recent architectures such as FastVLM improve efficiency through optimized vision encoders, existing pipelines still rely on static visual preprocessing, leading to redundant computation for visually simple inputs. In this work, we propose an adaptive visual preprocessing method that dynamically adjusts input resolution and spatial coverage based on image content characteristics. The proposed approach combines content-aware image analysis, adaptive resolution selection, and content-aware cropping to reduce visual redundancy prior to vision encoding. Importantly, the method is integrated with FastVLM without modifying its architecture or requiring retraining. We evaluate the proposed method on a subset of the DocVQA dataset in an inference-only setting, focusing on efficiency-oriented metrics. Experimental results show that adaptive preprocessing reduces per-image inference time by over 50\%, lowers mean full generation time, and achieves a consistent reduction of more than 55\% in visual token count compared to the baseline pipeline. These findings demonstrate that input-aware preprocessing is an effective and lightweight strategy for improving deployment-oriented efficiency of vision-language models. To facilitate reproducibility, our implementation is provided as a fork of the FastVLM repository, incorporating the files for the proposed method, and is available at https://github.com/kmdavidds/mlfastlm.

[39] ALIVE: An Avatar-Lecture Interactive Video Engine with Content-Aware Retrieval for Real-Time Interaction

Md Zabirul Islam,Md Motaleb Hossen Manik,Ge Wang

Main category: cs.CV

TL;DR: ALIVE是一个完全在本地硬件上运行的交互式视频引擎,通过神经化身、内容感知检索和实时多模态交互,将传统录播课程转化为动态学习体验。

Details Motivation: 传统录播课程缺乏实时答疑机制,而现有交互系统常依赖云端、缺乏课程上下文理解或隐私保护。 Method: 结合ASR转录、LLM优化、神经虚拟形象生成、基于语义与时间戳对齐的内容感知检索,以及轻量级嵌入模型与FAISS检索实现本地化实时响应。 Result: 在医学影像课程中验证了系统有效性,具备高检索准确率、低延迟和良好用户体验。 Conclusion: ALIVE展示了本地化、多模态AI与内容感知检索结合可显著提升录播课的教学价值,为下一代互动学习环境提供可扩展路径。 Abstract: Traditional lecture videos offer flexibility but lack mechanisms for real-time clarification, forcing learners to search externally when confusion arises. Recent advances in large language models and neural avatars provide new opportunities for interactive learning, yet existing systems typically lack lecture awareness, rely on cloud-based services, or fail to integrate retrieval and avatar-delivered explanations in a unified, privacy-preserving pipeline. We present ALIVE, an Avatar-Lecture Interactive Video Engine that transforms passive lecture viewing into a dynamic, real-time learning experience. ALIVE operates fully on local hardware and integrates (1) Avatar-delivered lecture generated through ASR transcription, LLM refinement, and neural talking-head synthesis; (2) A content-aware retrieval mechanism that combines semantic similarity with timestamp alignment to surface contextually relevant lecture segments; and (3) Real-time multimodal interaction, enabling students to pause the lecture, ask questions through text or voice, and receive grounded explanations either as text or as avatar-delivered responses. To maintain responsiveness, ALIVE employs lightweight embedding models, FAISS-based retrieval, and segmented avatar synthesis with progressive preloading. We demonstrate the system on a complete medical imaging course, evaluate its retrieval accuracy, latency characteristics, and user experience, and show that ALIVE provides accurate, content-aware, and engaging real-time support. ALIVE illustrates how multimodal AI-when combined with content-aware retrieval and local deployment-can significantly enhance the pedagogical value of recorded lectures, offering an extensible pathway toward next-generation interactive learning environments.

[40] Lightweight framework for underground pipeline recognition and spatial localization based on multi-view 2D GPR images

Haotian Lv,Chao Li,Jiangbo Dai,Yuhui Zhang,Zepeng Fan,Yiqiu Tan,Dawei Wang,Binglei Xie

Main category: cs.CV

TL;DR: 本文提出了一种基于三视图联合分析与改进YOLO框架的3D地下管线智能检测方法,显著提升了小目标检测精度与复杂场景下的鲁棒性。

Details Motivation: 针对三维探地雷达(GPR)在地下管线检测中多视角特征关联弱、小尺度目标识别精度低及复杂场景下鲁棒性不足的问题,亟需一种能够融合多视角信息并增强小目标特征提取能力的检测框架。 Method: 首先,采用B/C/D-Scan三视图联合分析策略,通过FDTD正演模拟与实测数据交叉验证构建三维管线三视图特征评价方法;其次,提出DCO-YOLO框架,在YOLOv11中引入DySample、CGLU和OutlookAttention等跨维度相关机制以增强小尺度管线边缘特征提取;最后,设计3D-DIoU空间特征匹配算法,结合三维几何约束与中心距离惩罚项实现多视图标注的自动关联,并通过三视图融合策略消除单视图检测的固有歧义。 Result: 在真实城市地下管线数据上的实验表明,该方法在复杂多管线场景下准确率、召回率和平均精度(mAP)分别达到96.2%、93.3%和96.7%,较基线模型提升2.0%、2.1%和0.9%;消融实验验证了动态特征增强模块的协同优化效果,Grad-CAM++可视化显示改进模型更聚焦于管线几何特征。 Conclusion: 本研究将深度学习优化策略与3D GPR物理特性相结合,提出了一种高效可靠的地下管线智能识别与定位新框架,为实际工程应用提供了有力技术支持。 Abstract: To address the issues of weak correlation between multi-view features, low recognition accuracy of small-scale targets, and insufficient robustness in complex scenarios in underground pipeline detection using 3D GPR, this paper proposes a 3D pipeline intelligent detection framework. First, based on a B/C/D-Scan three-view joint analysis strategy, a three-dimensional pipeline three-view feature evaluation method is established by cross-validating forward simulation results obtained using FDTD methods with actual measurement data. Second, the DCO-YOLO framework is proposed, which integrates DySample, CGLU, and OutlookAttention cross-dimensional correlation mechanisms into the original YOLOv11 algorithm, significantly improving the small-scale pipeline edge feature extraction capability. Furthermore, a 3D-DIoU spatial feature matching algorithm is proposed, which integrates three-dimensional geometric constraints and center distance penalty terms to achieve automated association of multi-view annotations. The three-view fusion strategy resolves inherent ambiguities in single-view detection. Experiments based on real urban underground pipeline data show that the proposed method achieves accuracy, recall, and mean average precision of 96.2%, 93.3%, and 96.7%, respectively, in complex multi-pipeline scenarios, which are 2.0%, 2.1%, and 0.9% higher than the baseline model. Ablation experiments validated the synergistic optimization effect of the dynamic feature enhancement module and Grad-CAM++ heatmap visualization demonstrated that the improved model significantly enhanced its ability to focus on pipeline geometric features. This study integrates deep learning optimization strategies with the physical characteristics of 3D GPR, offering an efficient and reliable novel technical framework for the intelligent recognition and localization of underground pipelines.

[41] NeRV360: Neural Representation for 360-Degree Videos with a Viewport Decoder

Daichi Arai,Kyohei Unno,Yasuko Sugito,Yuichi Kusakabe

Main category: cs.CV

TL;DR: 提出NeRV360,一种用于高分辨率360度视频的端到端隐式神经表示框架,通过仅解码用户视口显著降低内存消耗并提升解码速度。

Details Motivation: 高分辨率360度视频使用现有隐式神经表示方法(NeRV)时存在内存占用高和解码速度慢的问题,难以支持实时应用。 Method: 设计NeRV360框架,将视口提取集成到解码过程中,并引入时空仿射变换模块,实现基于视角和时间的条件化解码,仅重建用户观看的视口区域。 Result: 在6K视频上实验表明,相比HNeRV,NeRV360内存消耗减少7倍,解码速度提升2.5倍,且图像质量更优。 Conclusion: NeRV360有效解决了高分辨率360度视频在隐式神经表示中的效率瓶颈,为实时应用提供了可行方案。 Abstract: Implicit neural representations for videos (NeRV) have shown strong potential for video compression. However, applying NeRV to high-resolution 360-degree videos causes high memory usage and slow decoding, making real-time applications impractical. We propose NeRV360, an end-to-end framework that decodes only the user-selected viewport instead of reconstructing the entire panoramic frame. Unlike conventional pipelines, NeRV360 integrates viewport extraction into decoding and introduces a spatial-temporal affine transform module for conditional decoding based on viewpoint and time. Experiments on 6K-resolution videos show that NeRV360 achieves a 7-fold reduction in memory consumption and a 2.5-fold increase in decoding speed compared to HNeRV, a representative prior work, while delivering better image quality in terms of objective metrics.

[42] Beyond Weight Adaptation: Feature-Space Domain Injection for Cross-Modal Ship Re-Identification

Tingfeng Xian,Wenlve Zhou,Zhiheng Zhou,Zhelin Li

Main category: cs.CV

TL;DR: 本文提出了一种基于视觉基础模型(VFMs)的跨模态船舶再识别方法,通过在特征空间中引入领域表示注入(DRI)策略,在不微调原始模型权重的情况下有效缩小模态差距,实现了少参数下的SOTA性能。

Details Motivation: 现有的跨模态船舶再识别方法依赖大规模配对数据进行显式模态对齐,且主流参数高效微调(PEFT)方法在低容量模型上表现不佳,因此需要一种更高效、无需大量配对数据的新范式。 Method: 基于Platonic表示假设,冻结整个视觉基础模型,设计一个轻量级可学习的Offset Encoder来从原始输入中提取富含模态和身份信息的领域特定表示,并通过Modulator根据中间层上下文自适应变换这些表示,最后以加性融合方式注入到中间层,动态调整特征分布。 Result: 在HOSS-ReID数据集上,仅使用1.54M和7.05M可训练参数即分别达到57.9%和60.5%的mAP,显著优于现有方法,实现了SOTA性能。 Conclusion: DRI提供了一种高效的跨模态适应框架,验证了在特征空间进行参数高效微调的优越性,为资源受限场景下的实际应用提供了新思路。 Abstract: Cross-Modality Ship Re-Identification (CMS Re-ID) is critical for achieving all-day and all-weather maritime target tracking, yet it is fundamentally challenged by significant modality discrepancies. Mainstream solutions typically rely on explicit modality alignment strategies; however, this paradigm heavily depends on constructing large-scale paired datasets for pre-training. To address this, grounded in the Platonic Representation Hypothesis, we explore the potential of Vision Foundation Models (VFMs) in bridging modality gaps. Recognizing the suboptimal performance of existing generic Parameter-Efficient Fine-Tuning (PEFT) methods that operate within the weight space, particularly on limited-capacity models, we shift the optimization perspective to the feature space and propose a novel PEFT strategy termed Domain Representation Injection (DRI). Specifically, while keeping the VFM fully frozen to maximize the preservation of general knowledge, we design a lightweight, learnable Offset Encoder to extract domain-specific representations rich in modality and identity attributes from raw inputs. Guided by the contextual information of intermediate features at different layers, a Modulator adaptively transforms these representations. Subsequently, they are injected into the intermediate layers via additive fusion, dynamically reshaping the feature distribution to adapt to the downstream task without altering the VFM's pre-trained weights. Extensive experimental results demonstrate the superiority of our method, achieving State-of-the-Art (SOTA) performance with minimal trainable parameters. For instance, on the HOSS-ReID dataset, we attain 57.9\% and 60.5\% mAP using only 1.54M and 7.05M parameters, respectively. The code is available at https://github.com/TingfengXian/DRI.

[43] DGSAN: Dual-Graph Spatiotemporal Attention Network for Pulmonary Nodule Malignancy Prediction

Xiao Yu,Zhaojie Fang,Guanyu Zhou,Yin Shen,Huoling Luo,Ye Li,Ahmed Elazab,Xiang Wan,Ruiquan Ge,Changmiao Wang

Main category: cs.CV

TL;DR: 提出了一种双图时空注意力网络(DGSAN),通过全局-局部特征编码器和分层跨模态图融合模块,有效整合多模态和多时相信息,显著提升肺结节分类精度与计算效率。

Details Motivation: 现有融合方法局限于低效的向量拼接和简单互注意力机制,难以充分挖掘多模态与多时相数据的潜力,亟需更有效的融合策略以提高肺结节检测的准确性。 Method: 设计了全局-局部特征编码器以捕捉肺结节的局部、全局及融合特征;构建双图结构组织模态间与模态内特征,并引入分层次跨模态图融合模块优化特征整合过程。 Result: 在NLST-cmst和CSTL衍生数据集上的实验表明,DGSAN显著优于现有最先进方法,兼具高分类精度和优异的计算效率。 Conclusion: DGSAN为多模态医学图像分析提供了高效且可扩展的框架,有助于推动肺癌早期诊断技术的发展。 Abstract: Lung cancer continues to be the leading cause of cancer-related deaths globally. Early detection and diagnosis of pulmonary nodules are essential for improving patient survival rates. Although previous research has integrated multimodal and multi-temporal information, outperforming single modality and single time point, the fusion methods are limited to inefficient vector concatenation and simple mutual attention, highlighting the need for more effective multimodal information fusion. To address these challenges, we introduce a Dual-Graph Spatiotemporal Attention Network, which leverages temporal variations and multimodal data to enhance the accuracy of predictions. Our methodology involves developing a Global-Local Feature Encoder to better capture the local, global, and fused characteristics of pulmonary nodules. Additionally, a Dual-Graph Construction method organizes multimodal features into inter-modal and intra-modal graphs. Furthermore, a Hierarchical Cross-Modal Graph Fusion Module is introduced to refine feature integration. We also compiled a novel multimodal dataset named the NLST-cmst dataset as a comprehensive source of support for related research. Our extensive experiments, conducted on both the NLST-cmst and curated CSTL-derived datasets, demonstrate that our DGSAN significantly outperforms state-of-the-art methods in classifying pulmonary nodules with exceptional computational efficiency.

[44] Benchmarking and Enhancing VLM for Compressed Image Understanding

Zifu Zhang,Tongda Xu,Siqi Li,Shengxi Li,Yue Zhang,Mai Xu,Yan Wang

Main category: cs.CV

TL;DR: 本文提出了首个评估视觉-语言模型(VLM)在压缩图像上表现的综合基准,并分析了性能差距的来源,提出了一种通用的VLM适配器,可将不同编码和比特率下的压缩图像上的VLM性能提升10%-30%。

Details Motivation: 随着视觉-语言模型(VLM)的发展和应用需求的增长,高效压缩图像输入变得越来越重要。然而,现有VLM主要处理高比特率压缩图像,对低比特率压缩图像的理解能力尚未被充分探索。 Method: 构建了一个包含超过一百万张压缩图像的综合基准,涵盖多种常用图像编解码器和多样化任务;通过分类信息损失和VLM泛化失败来分析性能差距的来源;提出一种通用的VLM适配器以提升模型在压缩图像上的表现。 Result: 识别出压缩图像中的性能差距主要来自VLM的泛化失败而非信息损失;提出的通用适配器可在不同编解码器和比特率下将VLM性能提升10%-30%。 Conclusion: 该研究为VLM与压缩图像之间的差距提供了有价值的见解,所提出的基准和增强方法有助于推动VLM在实际压缩场景中的应用。 Abstract: With the rapid development of Vision-Language Models (VLMs) and the growing demand for their applications, efficient compression of the image inputs has become increasingly important. Existing VLMs predominantly digest and understand high-bitrate compressed images, while their ability to interpret low-bitrate compressed images has yet to be explored by far. In this paper, we introduce the first comprehensive benchmark to evaluate the ability of VLM against compressed images, varying existing widely used image codecs and diverse set of tasks, encompassing over one million compressed images in our benchmark. Next, we analyse the source of performance gap, by categorising the gap from a) the information loss during compression and b) generalisation failure of VLM. We visualize these gaps with concrete examples and identify that for compressed images, only the generalization gap can be mitigated. Finally, we propose a universal VLM adaptor to enhance model performance on images compressed by existing codecs. Consequently, we demonstrate that a single adaptor can improve VLM performance across images with varying codecs and bitrates by 10%-30%. We believe that our benchmark and enhancement method provide valuable insights and contribute toward bridging the gap between VLMs and compressed images.

[45] PanoGrounder: Bridging 2D and 3D with Panoramic Scene Representations for VLM-based 3D Visual Grounding

Seongmin Jung,Seongho Choi,Gunwoo Jeon,Minsu Cho,Jongwoo Lim

Main category: cs.CV

TL;DR: PanoGrounder提出了一种基于全景表示的可泛化3D视觉定位框架,结合预训练2D视觉语言模型实现强推理,在ScanRefer和Nr3D上达到SOTA,并展现出优异的跨数据集和文本泛化能力。

Details Motivation: 传统3D视觉定位模型依赖稀缺的3D视觉语言数据且泛化能力有限,需利用更强大的视觉语言模型提升推理性能。 Method: 提出PanoGrounder,通过融合3D语义与几何特征的全景渲染作为2D与3D之间的中间表示,结合预训练2D VLM进行多视角 grounding,并通过三阶段流程(视点选择、单视图定位、预测提升融合)输出3D边界框。 Result: 在ScanRefer和Nr3D数据集上达到最先进的性能,且在未见3D数据集和不同文本表述下表现出更强的泛化能力。 Conclusion: PanoGrounder通过全景中间表示有效结合了2D视觉语言模型的强大推理能力与3D场景结构,实现了高性能、高泛化的3D视觉定位。 Abstract: 3D Visual Grounding (3DVG) is a critical bridge from vision-language perception to robotics, requiring both language understanding and 3D scene reasoning. Traditional supervised models leverage explicit 3D geometry but exhibit limited generalization, owing to the scarcity of 3D vision-language datasets and the limited reasoning capabilities compared to modern vision-language models (VLMs). We propose PanoGrounder, a generalizable 3DVG framework that couples multi-modal panoramic representation with pretrained 2D VLMs for strong vision-language reasoning. Panoramic renderings, augmented with 3D semantic and geometric features, serve as an intermediate representation between 2D and 3D, and offer two major benefits: (i) they can be directly fed to VLMs with minimal adaptation and (ii) they retain long-range object-to-object relations thanks to their 360-degree field of view. We devise a three-stage pipeline that places a compact set of panoramic viewpoints considering the scene layout and geometry, grounds a text query on each panoramic rendering with a VLM, and fuses per-view predictions into a single 3D bounding box via lifting. Our approach achieves state-of-the-art results on ScanRefer and Nr3D, and demonstrates superior generalization to unseen 3D datasets and text rephrasings.

[46] Self-supervised Multiplex Consensus Mamba for General Image Fusion

Yingying Wang,Rongjin Zhuang,Hui Zheng,Xuanhua He,Ke Cao,Xiaotong Tu,Xinghao Ding

Main category: cs.CV

TL;DR: 本文提出了一种用于通用图像融合的自监督多路共识Mamba框架SMC-Mamba,通过模态无关特征增强和多路共识跨模态Mamba模块有效整合多模态互补信息,并引入双层自监督对比学习损失,在保持高频信息的同时提升下游任务性能。实验表明该方法在多种图像融合任务中优于现有最先进方法。

Details Motivation: 通用图像融合需要在不增加复杂度的前提下,应对多种任务并提升性能,而现有方法多针对特定任务,难以兼顾广泛适用性与高效性。 Method: 提出SMC-Mamba框架,包含模态无关特征增强(MAFE)模块和多路共识跨模态Mamba(MCCM)模块,并设计双层自监督对比学习损失(BSCL)以保留高频信息并促进跨模态特征交互。 Result: 在红外-可见光、医学、多焦点和多曝光融合等任务中均超越现有最先进方法,并显著提升下游视觉任务(如目标检测和语义分割)的表现。 Conclusion: SMC-Mamba通过自监督多路共识机制实现了高效通用的图像融合,在多种模态和任务上展现出卓越性能与泛化能力。 Abstract: Image fusion integrates complementary information from different modalities to generate high-quality fused images, thereby enhancing downstream tasks such as object detection and semantic segmentation. Unlike task-specific techniques that primarily focus on consolidating inter-modal information, general image fusion needs to address a wide range of tasks while improving performance without increasing complexity. To achieve this, we propose SMC-Mamba, a Self-supervised Multiplex Consensus Mamba framework for general image fusion. Specifically, the Modality-Agnostic Feature Enhancement (MAFE) module preserves fine details through adaptive gating and enhances global representations via spatial-channel and frequency-rotational scanning. The Multiplex Consensus Cross-modal Mamba (MCCM) module enables dynamic collaboration among experts, reaching a consensus to efficiently integrate complementary information from multiple modalities. The cross-modal scanning within MCCM further strengthens feature interactions across modalities, facilitating seamless integration of critical information from both sources. Additionally, we introduce a Bi-level Self-supervised Contrastive Learning Loss (BSCL), which preserves high-frequency information without increasing computational overhead while simultaneously boosting performance in downstream tasks. Extensive experiments demonstrate that our approach outperforms state-of-the-art (SOTA) image fusion algorithms in tasks such as infrared-visible, medical, multi-focus, and multi-exposure fusion, as well as downstream visual tasks.

[47] Quantile Rendering: Efficiently Embedding High-dimensional Feature on 3D Gaussian Splatting

Yoonwoo Jeong,Cheng Sun,Frank Wang,Minsu Cho,Jaesung Choe

Main category: cs.CV

TL;DR: 提出了一种名为Q-Render的新型3D高斯渲染策略,结合GS-Net网络,高效处理高维特征并实现高质量开放词汇分割与实时渲染。

Details Motivation: 现有方法在渲染用于开放词汇查询的高维特征时存在信息损失,导致分割质量下降,亟需一种既能保持高保真又能高效渲染的方法。 Method: 提出Quantile Rendering(Q-Render),通过稀疏采样对光线影响显著的3D高斯分布来替代传统密集采样;并构建Gaussian Splatting Network(GS-Net)以可泛化方式预测高斯特征。 Result: 在ScanNet和LeRF数据集上实验表明,该方法优于当前最先进方法,并在512维特征图上实现约43.7倍的渲染加速。 Conclusion: Q-Render与GS-Net联合框架在保持高精度的同时显著提升渲染效率,为3D开放词汇分割提供了高效且可扩展的解决方案。 Abstract: Recent advancements in computer vision have successfully extended Open-vocabulary segmentation (OVS) to the 3D domain by leveraging 3D Gaussian Splatting (3D-GS). Despite this progress, efficiently rendering the high-dimensional features required for open-vocabulary queries poses a significant challenge. Existing methods employ codebooks or feature compression, causing information loss, thereby degrading segmentation quality. To address this limitation, we introduce Quantile Rendering (Q-Render), a novel rendering strategy for 3D Gaussians that efficiently handles high-dimensional features while maintaining high fidelity. Unlike conventional volume rendering, which densely samples all 3D Gaussians intersecting each ray, Q-Render sparsely samples only those with dominant influence along the ray. By integrating Q-Render into a generalizable 3D neural network, we also propose Gaussian Splatting Network (GS-Net), which predicts Gaussian features in a generalizable manner. Extensive experiments on ScanNet and LeRF demonstrate that our framework outperforms state-of-the-art methods, while enabling real-time rendering with an approximate ~43.7x speedup on 512-D feature maps. Code will be made publicly available.

[48] Transductive Visual Programming: Evolving Tool Libraries from Experience for Spatial Reasoning

Shengguang Wu,Xiaohan Wang,Yuhui Zhang,Hao Zhu,Serena Yeung-Levy

Main category: cs.CV

TL;DR: 提出Transductive Visual Programming (TVP)框架,通过经验驱动的转导式工具创建,实现3D场景中空间推理的自我进化视觉编程,显著提升性能与工具复用率。

Details Motivation: 现有视觉编程方法依赖固定或推测性工具生成,导致程序次优和工具利用率低,难以应对复杂的3D空间推理任务。 Method: TVP先使用基础工具解决问题并将解决方案存入示例库,然后从中抽象出重复模式,构建可重用的高级工具并加入工具库,逐步演化出更强大的工具集以解决新问题。 Result: 在Omni3D-Bench上超越GPT-4o达22%,优于此前最佳系统11%;转导学习工具使用频率是归纳工具的5倍,并在SpatialScore-Hard等未见任务上展现强泛化能力。 Conclusion: 经验驱动的转导式工具创建是一种有效的自我进化视觉编程范式,显著提升了复杂空间推理任务中的性能、工具复用与泛化能力。 Abstract: Spatial reasoning in 3D scenes requires precise geometric calculations that challenge vision-language models. Visual programming addresses this by decomposing problems into steps calling specialized tools, yet existing methods rely on either fixed toolsets or speculative tool induction before solving problems, resulting in suboptimal programs and poor utilization of induced tools. We present Transductive Visual Programming (TVP), a novel framework that builds new tools from its own experience rather than speculation. TVP first solves problems using basic tools while accumulating experiential solutions into an Example Library, then abstracts recurring patterns from these programs into reusable higher-level tools for an evolving Tool Library. This allows TVP to tackle new problems with increasingly powerful tools learned from experience. On Omni3D-Bench, TVP achieves state-of-the-art performance, outperforming GPT-4o by 22% and the previous best visual programming system by 11%. Our transductively learned tools are used 5x more frequently as core program dependency than inductively created ones, demonstrating more effective tool discovery and reuse. The evolved tools also show strong generalization to unseen spatial tasks, achieving superior performance on benchmarks from SpatialScore-Hard collection without any testset-specific modification. Our work establishes experience-driven transductive tool creation as a powerful paradigm for building self-evolving visual programming agents that effectively tackle challenging spatial reasoning tasks. We release our code at https://transductive-visualprogram.github.io/.

[49] Reasoning-Driven Amodal Completion: Collaborative Agents and Perceptual Evaluation

Hongxing Fan,Shuyu Zhao,Jiayang Ao,Lu Sheng

Main category: cs.CV

TL;DR: 提出一种协作式多智能体推理框架,通过解耦语义规划与视觉合成,实现语义和结构一致的单步无可见区域补全。

Details Motivation: 现有渐进式方法存在推理不稳定和误差累积问题,难以保持语义一致性和结构完整性。 Method: 设计多智能体框架,包括专门用于前置推理的语义规划模块、自纠正的验证代理(采用思维链推理)和多样化解析生成器,并在语义规划阶段完成修正与遮挡识别。 Result: 在多个数据集上显著优于当前最先进方法,提出的MAC-Score指标与人类判断高度对齐,有效评估补全结果的结构完整性和语义一致性。 Conclusion: 该框架通过显式解耦和前置规划,实现了更稳定、一致且可解释的无可见区域补全,为该任务提供了新范式。 Abstract: Amodal completion, the task of inferring invisible object parts, faces significant challenges in maintaining semantic consistency and structural integrity. Prior progressive approaches are inherently limited by inference instability and error accumulation. To tackle these limitations, we present a Collaborative Multi-Agent Reasoning Framework that explicitly decouples Semantic Planning from Visual Synthesis. By employing specialized agents for upfront reasoning, our method generates a structured, explicit plan before pixel generation, enabling visually and semantically coherent single-pass synthesis. We integrate this framework with two critical mechanisms: (1) a self-correcting Verification Agent that employs Chain-of-Thought reasoning to rectify visible region segmentation and identify residual occluders strictly within the Semantic Planning phase, and (2) a Diverse Hypothesis Generator that addresses the ambiguity of invisible regions by offering diverse, plausible semantic interpretations, surpassing the limited pixel-level variations of standard random seed sampling. Furthermore, addressing the limitations of traditional metrics in assessing inferred invisible content, we introduce the MAC-Score (MLLM Amodal Completion Score), a novel human-aligned evaluation metric. Validated against human judgment and ground truth, these metrics establish a robust standard for assessing structural completeness and semantic consistency with visible context. Extensive experiments demonstrate that our framework significantly outperforms state-of-the-art methods across multiple datasets. Our project is available at: https://fanhongxing.github.io/remac-page.

[50] Beyond Artifacts: Real-Centric Envelope Modeling for Reliable AI-Generated Image Detection

Ruiqi Liu,Yi Han,Zhengbo Zhang,Liwei Yao,Zhiyuan Yan,Jialiang Shen,ZhiJin Chen,Boyi Sun,Lubin Weng,Jing Dong,Yan Wang,Shu Wu

Main category: cs.CV

TL;DR: 本文提出了一种新的合成图像检测范式REM,通过建模真实图像分布而非生成器伪影来提升在现实退化条件下的检测鲁棒性和泛化能力,并构建了包含多种生成器和真实退化模拟的RealChain基准测试。

Details Motivation: 现有检测方法过度依赖生成器特定的伪影,对现实世界中的图像退化(如多平台传播和后处理)敏感,难以适应不断演进的生成模型和复杂的链式退化场景。 Method: 提出Real-centric Envelope Modeling (REM),利用自重建中的特征级扰动生成接近真实的样本,并采用具有跨域一致性的包络估计器学习包围真实图像流形的边界,从而实现对真实图像分布的建模。 Result: 在八个基准测试中,REM平均比现有最先进方法提升7.5%,并在严重退化的RealChain基准上表现出卓越的泛化性能。 Conclusion: REM为现实条件下合成图像检测提供了坚实基础,通过聚焦真实图像分布建模而非生成器伪影,显著提升了检测的鲁棒性和泛化能力。 Abstract: The rapid progress of generative models has intensified the need for reliable and robust detection under real-world conditions. However, existing detectors often overfit to generator-specific artifacts and remain highly sensitive to real-world degradations. As generative architectures evolve and images undergo multi-round cross-platform sharing and post-processing (chain degradations), these artifact cues become obsolete and harder to detect. To address this, we propose Real-centric Envelope Modeling (REM), a new paradigm that shifts detection from learning generator artifacts to modeling the robust distribution of real images. REM introduces feature-level perturbations in self-reconstruction to generate near-real samples, and employs an envelope estimator with cross-domain consistency to learn a boundary enclosing the real image manifold. We further build RealChain, a comprehensive benchmark covering both open-source and commercial generators with simulated real-world degradation. Across eight benchmark evaluations, REM achieves an average improvement of 7.5% over state-of-the-art methods, and notably maintains exceptional generalization on the severely degraded RealChain benchmark, establishing a solid foundation for synthetic image detection under real-world conditions. The code and the RealChain benchmark will be made publicly available upon acceptance of the paper.

[51] SPOT!: Map-Guided LLM Agent for Unsupervised Multi-CCTV Dynamic Object Tracking

Yujin Noh,Inho Jake Park,Chigon Hwang

Main category: cs.CV

TL;DR: 本文提出了一种名为SPOT的地图引导的LLM代理方法,用于在多CCTV环境下的盲区中实现车辆轨迹的连续跟踪,无需预先训练。

Details Motivation: 由于CCTV之间的间隔和视场角(FOV)限制导致的盲点问题,现有的基于CCTV的车辆跟踪系统难以持续连接同一车辆在多个摄像头环境中的轨迹,这会导致对象ID切换和轨迹丢失,从而降低实时路径预测的可靠性。 Method: 该研究将道路结构(航路点)和CCTV布置信息表示为基于二维空间坐标的文档,并通过分块技术组织这些信息以支持实时查询和推理。此外,利用观察到的对象在CCTV图像中的相对位置和FOV信息,将车辆的位置转换成实际世界坐标系。结合地图的空间信息与车辆的移动方向、速度及驾驶模式,在交叉路口级别执行束搜索,以推导出车辆在经过盲区后最可能进入的候选CCTV位置。 Result: 基于CARLA模拟器在虚拟城市环境中的实验结果表明,所提出的方法即使在存在盲区的情况下也能准确预测下一个出现的CCTV,比现有技术更有效地保持了车辆轨迹的连续性。 Conclusion: SPOT方法能够有效解决多CCTV环境下因盲区造成的车辆跟踪难题,提高了车辆轨迹连续性和实时路径预测的准确性。 Abstract: CCTV-based vehicle tracking systems face structural limitations in continuously connecting the trajectories of the same vehicle across multiple camera environments. In particular, blind spots occur due to the intervals between CCTVs and limited Fields of View (FOV), which leads to object ID switching and trajectory loss, thereby reducing the reliability of real-time path prediction. This paper proposes SPOT (Spatial Prediction Over Trajectories), a map-guided LLM agent capable of tracking vehicles even in blind spots of multi-CCTV environments without prior training. The proposed method represents road structures (Waypoints) and CCTV placement information as documents based on 2D spatial coordinates and organizes them through chunking techniques to enable real-time querying and inference. Furthermore, it transforms the vehicle's position into the actual world coordinate system using the relative position and FOV information of objects observed in CCTV images. By combining map spatial information with the vehicle's moving direction, speed, and driving patterns, a beam search is performed at the intersection level to derive candidate CCTV locations where the vehicle is most likely to enter after the blind spot. Experimental results based on the CARLA simulator in a virtual city environment confirmed that the proposed method accurately predicts the next appearing CCTV even in blind spot sections, maintaining continuous vehicle trajectories more effectively than existing techniques.

[52] XGrid-Mapping: Explicit Implicit Hybrid Grid Submaps for Efficient Incremental Neural LiDAR Mapping

Zeqing Song,Zhongmiao Yan,Junyuan Deng,Songpengcheng Xia,Xiang Mu,Jingyi Xu,Qi Wu,Ling Pei

Main category: cs.CV

TL;DR: 提出XGrid-Mapping,一种结合显式与隐式表示的混合网格框架,用于高效的大规模增量式神经LiDAR建图,通过稀疏网格提供几何先验,隐式密集网格增强场景表达,并引入蒸馏对齐策略和动态去除模块,实现高质量、高效率的实时建图。

Details Motivation: 现有神经LiDAR建图方法或依赖密集隐式表示而忽略几何结构,或基于体素的方法难以实现实时性能,缺乏兼顾效率与精度的大规模增量建图方案。 Method: 提出XGrid-Mapping框架:1)结合稀疏网格(提供几何先验)与隐式密集网格(增强表示);2)采用VDB结构与子图组织降低计算负载;3)引入基于蒸馏的重叠对齐策略,利用前序子图监督后续子图以保证一致性;4)设计动态去除模块提升鲁棒性与采样效率。 Result: 实验表明,该方法在建图质量上优于现有方法,同时克服了体素引导方法的效率瓶颈,实现了大规模实时增量建图。 Conclusion: XGrid-Mapping通过融合显式与隐式表示,在保持高精度的同时显著提升效率,为大规模增量LiDAR建图提供了有效的解决方案。 Abstract: Large-scale incremental mapping is fundamental to the development of robust and reliable autonomous systems, as it underpins incremental environmental understanding with sequential inputs for navigation and decision-making. LiDAR is widely used for this purpose due to its accuracy and robustness. Recently, neural LiDAR mapping has shown impressive performance; however, most approaches rely on dense implicit representations and underutilize geometric structure, while existing voxel-guided methods struggle to achieve real-time performance. To address these challenges, we propose XGrid-Mapping, a hybrid grid framework that jointly exploits explicit and implicit representations for efficient neural LiDAR mapping. Specifically, the strategy combines a sparse grid, providing geometric priors and structural guidance, with an implicit dense grid that enriches scene representation. By coupling the VDB structure with a submap-based organization, the framework reduces computational load and enables efficient incremental mapping on a large scale. To mitigate discontinuities across submaps, we introduce a distillation-based overlap alignment strategy, in which preceding submaps supervise subsequent ones to ensure consistency in overlapping regions. To further enhance robustness and sampling efficiency, we incorporate a dynamic removal module. Extensive experiments show that our approach delivers superior mapping quality while overcoming the efficiency limitations of voxel-guided methods, thereby outperforming existing state-of-the-art mapping methods.

[53] X-ray Insights Unleashed: Pioneering the Enhancement of Multi-Label Long-Tail Data

Xinquan Yang,Jinheng Xie,Yawen Huang,Yuexiang Li,Huimin Huang,Hao Zheng,Xian Wu,Yefeng Zheng,Linlin Shen

Main category: cs.CV

TL;DR: 本文提出了一种新的数据合成管道,利用大量正常的X光片来增强尾部病变的表示,通过预训练的扩散模型修复头部病变,保留尾部类别作为增广训练数据,并结合大语言模型知识引导模块和渐进增量学习策略稳定修复微调过程,在MIMIC和CheXpert肺部数据集上取得了新的性能基准。

Details Motivation: 长尾分布的肺部异常在胸部X光诊断中具有挑战性,现有扩散方法因稀有病变样本不足而生成能力受限,导致诊断精度不理想。 Method: 提出一种利用正常X光片增强尾部病变的数据合成管道:先用正常样本训练扩散模型生成正常X光,再用该模型修复患病X光中的头部病变,保留尾部病变为增广数据;引入大语言模型知识引导(LKG)模块与渐进增量学习(PIL)策略以稳定微调过程。 Result: 在MIMIC和CheXpert公开肺部数据集上的综合评估显示,所提方法在性能上达到新标杆。 Conclusion: 该方法有效解决了稀有肺部病变样本不足的问题,提升了长尾病变的诊断精度,为医学图像分析中的数据不平衡问题提供了新思路。 Abstract: Long-tailed pulmonary anomalies in chest radiography present formidable diagnostic challenges. Despite the recent strides in diffusion-based methods for enhancing the representation of tailed lesions, the paucity of rare lesion exemplars curtails the generative capabilities of these approaches, thereby leaving the diagnostic precision less than optimal. In this paper, we propose a novel data synthesis pipeline designed to augment tail lesions utilizing a copious supply of conventional normal X-rays. Specifically, a sufficient quantity of normal samples is amassed to train a diffusion model capable of generating normal X-ray images. This pre-trained diffusion model is subsequently utilized to inpaint the head lesions present in the diseased X-rays, thereby preserving the tail classes as augmented training data. Additionally, we propose the integration of a Large Language Model Knowledge Guidance (LKG) module alongside a Progressive Incremental Learning (PIL) strategy to stabilize the inpainting fine-tuning process. Comprehensive evaluations conducted on the public lung datasets MIMIC and CheXpert demonstrate that the proposed method sets a new benchmark in performance.

[54] PUFM++: Point Cloud Upsampling via Enhanced Flow Matching

Zhi-Song Liu,Chenhang He,Roland Maier,Andreas Rupp

Main category: cs.CV

TL;DR: PUFM++ 是一种增强的流匹配框架,用于从稀疏、含噪和不完整的观测中重建密集且精确的点云,在几何保真度、鲁棒性和下游任务一致性方面均有提升。

Details Motivation: 现有的点云上采样方法在处理噪声和不完整输入时存在几何失真和对下游任务支持不足的问题,需要更鲁棒且高保真的生成模型。 Method: 提出两阶段流匹配策略:第一阶段学习从稀疏输入到密集目标的直接直线流,第二阶段利用加噪样本进行优化;引入数据驱动的自适应时间调度器加速推理,并在采样过程中施加流形约束以保持点云与基础表面一致;采用循环接口网络(RIN)增强层次特征交互。 Result: 在合成基准和真实世界扫描数据上均取得最先进性能,显著提升了视觉质量和定量指标,尤其在复杂噪声和遮挡条件下表现优越。 Conclusion: PUFM++ 通过多项改进显著提升了点云上采样的质量与鲁棒性,推动了基于流匹配的生成模型在三维重建中的应用。 Abstract: Recent advances in generative modeling have demonstrated strong promise for high-quality point cloud upsampling. In this work, we present PUFM++, an enhanced flow-matching framework for reconstructing dense and accurate point clouds from sparse, noisy, and partial observations. PUFM++ improves flow matching along three key axes: (i) geometric fidelity, (ii) robustness to imperfect input, and (iii) consistency with downstream surface-based tasks. We introduce a two-stage flow-matching strategy that first learns a direct, straight-path flow from sparse inputs to dense targets, and then refines it using noise-perturbed samples to approximate the terminal marginal distribution better. To accelerate and stabilize inference, we propose a data-driven adaptive time scheduler that improves sampling efficiency based on interpolation behavior. We further impose on-manifold constraints during sampling to ensure that generated points remain aligned with the underlying surface. Finally, we incorporate a recurrent interface network~(RIN) to strengthen hierarchical feature interactions and boost reconstruction quality. Extensive experiments on synthetic benchmarks and real-world scans show that PUFM++ sets a new state of the art in point cloud upsampling, delivering superior visual fidelity and quantitative accuracy across a wide range of tasks. Code and pretrained models are publicly available at https://github.com/Holmes-Alan/Enhanced_PUFM.

[55] MVInverse: Feed-forward Multi-view Inverse Rendering in Seconds

Xiangzuo Wu,Chengwei Ren,Jun Zhou,Xiu Li,Yuan Liu

Main category: cs.CV

TL;DR: 提出了一种前馈的多视角逆渲染框架,通过跨视图交替注意力机制实现一致的几何、材质和光照恢复,并利用基于一致性的微调策略提升在真实场景下的泛化能力。

Details Motivation: 现有单视角方法忽略跨视图关系导致不一致,而多视图优化方法依赖慢速可微渲染和逐场景优化,计算成本高且难以扩展。 Method: 设计了一个前馈网络,直接从RGB图像序列预测空间变化的反射率、金属度、粗糙度、漫射阴影和法线;通过交替跨视图注意力捕捉视图内长距离光照交互和视图间材质一致性;采用基于一致性的微调策略利用无标签真实视频进行优化。 Result: 在多个基准数据集上实验表明,该方法在多视图一致性、材质与法线估计质量以及对真实图像的泛化能力方面达到最优性能。 Conclusion: 所提出的方法实现了高效、一致的多视角逆渲染,并通过无监督微调显著提升了在真实场景中的鲁棒性和适用性。 Abstract: Multi-view inverse rendering aims to recover geometry, materials, and illumination consistently across multiple viewpoints. When applied to multi-view images, existing single-view approaches often ignore cross-view relationships, leading to inconsistent results. In contrast, multi-view optimization methods rely on slow differentiable rendering and per-scene refinement, making them computationally expensive and hard to scale. To address these limitations, we introduce a feed-forward multi-view inverse rendering framework that directly predicts spatially varying albedo, metallic, roughness, diffuse shading, and surface normals from sequences of RGB images. By alternating attention across views, our model captures both intra-view long-range lighting interactions and inter-view material consistency, enabling coherent scene-level reasoning within a single forward pass. Due to the scarcity of real-world training data, models trained on existing synthetic datasets often struggle to generalize to real-world scenes. To overcome this limitation, we propose a consistency-based finetuning strategy that leverages unlabeled real-world videos to enhance both multi-view coherence and robustness under in-the-wild conditions. Extensive experiments on benchmark datasets demonstrate that our method achieves state-of-the-art performance in terms of multi-view consistency, material and normal estimation quality, and generalization to real-world imagery.

[56] Learning from Next-Frame Prediction: Autoregressive Video Modeling Encodes Effective Representations

Jinghan Li,Yang Jin,Hao Jiang,Yadong Mu,Yang Song,Kun Xu

Main category: cs.CV

TL;DR: 本文提出了一种新的自回归视觉生成预训练框架NExT-Vid,通过掩码下一帧预测联合建模图像和视频,利用上下文隔离的自回归预测器和条件流匹配解码器提升表示学习性能。

Details Motivation: 现有的视觉生成预训练方法多依赖于掩码建模,忽视了视频中的时间信息;现有自回归方法存在语义定位不准和生成质量差的问题。 Method: 提出NExT-Vid框架,采用掩码下一帧预测,引入上下文隔离的自回归预测器以解耦语义表示与目标解码,并设计条件流匹配解码器来提高生成质量和多样性。 Result: 在大规模预训练模型上的实验表明,该方法通过注意力探测在下游分类任务中 consistently 优于以往的生成式预训练方法。 Conclusion: NExT-Vid通过上下文隔离的流匹配预训练,实现了强大的视觉表示能力,有效提升了图像和视频的联合建模性能。 Abstract: Recent advances in pretraining general foundation models have significantly improved performance across diverse downstream tasks. While autoregressive (AR) generative models like GPT have revolutionized NLP, most visual generative pretraining methods still rely on BERT-style masked modeling, which often disregards the temporal information essential for video analysis. The few existing autoregressive visual pretraining methods suffer from issues such as inaccurate semantic localization and poor generation quality, leading to poor semantics. In this work, we propose NExT-Vid, a novel autoregressive visual generative pretraining framework that utilizes masked next-frame prediction to jointly model images and videos. NExT-Vid introduces a context-isolated autoregressive predictor to decouple semantic representation from target decoding, and a conditioned flow-matching decoder to enhance generation quality and diversity. Through context-isolated flow-matching pretraining, our approach achieves strong representations. Extensive experiments on large-scale pretrained models demonstrate that our proposed method consistently outperforms previous generative pretraining methods for visual representation learning via attentive probing in downstream classification.

[57] Granular-ball Guided Masking: Structure-aware Data Augmentation

Shuyin Xia,Fan Chen,Dawei Dai,Meng Yang,Junwei Han,Xinbo Gao,Guoyin Wang

Main category: cs.CV

TL;DR: 提出了一种基于Granular-ball计算的结构感知掩码增强方法GBGM,通过粗到细的层次化掩码策略,在保留语义重要区域的同时提升模型鲁棒性和泛化能力。

Details Motivation: 现有掩码数据增强方法缺乏结构感知,容易丢弃关键语义信息,导致模型性能下降。 Method: 利用Granular-ball Computing进行结构分析,设计一种自适应的粗到细层次化掩码策略(GBGM),在保持重要结构区域的同时抑制冗余部分。 Result: 在多个基准上实验表明,GBGM显著提升了图像分类准确率和掩码图像重建效果,且兼容CNN和Vision Transformer。 Conclusion: GBGM是一种简单、模型无关的结构感知增强方法,为数据增强提供了新的范式。 Abstract: Deep learning models have achieved remarkable success in computer vision, but they still rely heavily on large-scale labeled data and tend to overfit when data are limited or distributions shift. Data augmentation, particularly mask-based information dropping, can enhance robustness by forcing models to explore complementary cues; however, existing approaches often lack structural awareness and may discard essential semantics. We propose Granular-ball Guided Masking (GBGM), a structure-aware augmentation strategy guided by Granular-ball Computing (GBC). GBGM adaptively preserves semantically rich, structurally important regions while suppressing redundant areas through a coarse-to-fine hierarchical masking process, producing augmentations that are both representative and discriminative. Extensive experiments on multiple benchmarks demonstrate consistent improvements in classification accuracy and masked image reconstruction, confirming the effectiveness and broad applicability of the proposed method. Simple and model-agnostic, it integrates seamlessly into CNNs and Vision Transformers and provides a new paradigm for structure-aware data augmentation.

[58] FluencyVE: Marrying Temporal-Aware Mamba with Bypass Attention for Video Editing

Mingshu Cai,Yixuan Li,Osamu Yoshie,Yuya Ieiri

Main category: cs.CV

TL;DR: 提出了一种名为FluencyVE的一次性视频编辑方法,通过引入Mamba模块替代时间注意力机制,实现了高效且时序一致的视频编辑。

Details Motivation: 现有的基于预训练文本到图像模型的视频编辑方法存在时序不一致和计算开销高的问题。 Method: 将线性时间序列模型Mamba集成到基于Stable Diffusion的视频编辑模型中,替换时间注意力层,并采用低秩近似矩阵和加权平均技术优化注意力计算。 Result: 在真实视频的属性、主体和位置编辑任务中表现出色,显著降低了计算成本并保持了生成质量。 Conclusion: FluencyVE是一种简单而有效的方法,能够在保持生成能力的同时实现高效的视频编辑。 Abstract: Large-scale text-to-image diffusion models have achieved unprecedented success in image generation and editing. However, extending this success to video editing remains challenging. Recent video editing efforts have adapted pretrained text-to-image models by adding temporal attention mechanisms to handle video tasks. Unfortunately, these methods continue to suffer from temporal inconsistency issues and high computational overheads. In this study, we propose FluencyVE, which is a simple yet effective one-shot video editing approach. FluencyVE integrates the linear time-series module, Mamba, into a video editing model based on pretrained Stable Diffusion models, replacing the temporal attention layer. This enables global frame-level attention while reducing the computational costs. In addition, we employ low-rank approximation matrices to replace the query and key weight matrices in the causal attention, and use a weighted averaging technique during training to update the attention scores. This approach significantly preserves the generative power of the text-to-image model while effectively reducing the computational burden. Experiments and analyses demonstrate promising results in editing various attributes, subjects, and locations in real-world videos.

[59] Efficient and Robust Video Defense Framework against 3D-field Personalized Talking Face

Rui-qing Sun,Xingshan Yao,Tian Lan,Hui-Yang Zhao,Jia-Ling Shi,Chen-Hao Cui,Zhijing Wu,Chen Yang,Xian-Ling Mao

Main category: cs.CV

TL;DR: 提出了一种针对3D场视频参考的说话人脸生成方法的高效防御框架,通过扰动3D信息获取过程来保护肖像视频,同时保持高质量视频输出。

Details Motivation: 现有的基于图像的防御方法计算成本高、视频质量差,且无法有效破坏3D信息,缺乏针对3D-field TFG方法的有效防御框架。 Method: 提出了相似性引导的参数共享机制和多尺度双域注意力模块,以提高计算效率并联合优化空间-频率域扰动。 Result: 实验表明该框架具有强大的防御能力,相比最快基线加速47倍,且对缩放操作和先进净化攻击具有鲁棒性。 Conclusion: 所提方法在保持高保真度的同时,能高效抵御3D-field TFG技术的滥用,为个人肖像视频提供了有效的隐私保护方案。 Abstract: State-of-the-art 3D-field video-referenced Talking Face Generation (TFG) methods synthesize high-fidelity personalized talking-face videos in real time by modeling 3D geometry and appearance from reference portrait video. This capability raises significant privacy concerns regarding malicious misuse of personal portraits. However, no efficient defense framework exists to protect such videos against 3D-field TFG methods. While image-based defenses could apply per-frame 2D perturbations, they incur prohibitive computational costs, severe video quality degradation, failing to disrupt 3D information for video protection. To address this, we propose a novel and efficient video defense framework against 3D-field TFG methods, which protects portrait video by perturbing the 3D information acquisition process while maintain high-fidelity video quality. Specifically, our method introduces: (1) a similarity-guided parameter sharing mechanism for computational efficiency, and (2) a multi-scale dual-domain attention module to jointly optimize spatial-frequency perturbations. Extensive experiments demonstrate that our proposed framework exhibits strong defense capability and achieves a 47x acceleration over the fastest baseline while maintaining high fidelity. Moreover, it remains robust against scaling operations and state-of-the-art purification attacks, and the effectiveness of our design choices is further validated through ablation studies. Our project is available at https://github.com/Richen7418/VDF.

[60] Multi-Attribute guided Thermal Face Image Translation based on Latent Diffusion Model

Mingshu Cai,Osamu Yoshie,Yuya Ieiri

Main category: cs.CV

TL;DR: 本文提出了一种基于潜在扩散模型的新型方法,用于从热成像生成高质量可见光人脸图像,结合多属性分类器和Self-attn Mamba模块,在跨模态人脸识别中实现了最先进的性能。

Details Motivation: 由于红外图像与可见光图像之间存在显著域偏移,现有面部识别模型在红外输入上性能下降严重,且传统生成方法易导致特征丢失和失真。 Method: 提出一种基于潜在扩散的生成模型,引入多属性分类器以保留关键身份特征,并设计Self-attn Mamba模块增强跨模态特征建模并提升推理速度。 Result: 在两个基准数据集上的实验表明,该方法在图像质量和身份保持方面均优于现有方法,达到最先进水平。 Conclusion: 所提出的方法有效缓解了红外到可见光人脸图像转换中的特征损失和模态差异问题,显著提升了异构人脸识别的性能。 Abstract: Modern surveillance systems increasingly rely on multi-wavelength sensors and deep neural networks to recognize faces in infrared images captured at night. However, most facial recognition models are trained on visible light datasets, leading to substantial performance degradation on infrared inputs due to significant domain shifts. Early feature-based methods for infrared face recognition proved ineffective, prompting researchers to adopt generative approaches that convert infrared images into visible light images for improved recognition. This paradigm, known as Heterogeneous Face Recognition (HFR), faces challenges such as model and modality discrepancies, leading to distortion and feature loss in generated images. To address these limitations, this paper introduces a novel latent diffusion-based model designed to generate high-quality visible face images from thermal inputs while preserving critical identity features. A multi-attribute classifier is incorporated to extract key facial attributes from visible images, mitigating feature loss during infrared-to-visible image restoration. Additionally, we propose the Self-attn Mamba module, which enhances global modeling of cross-modal features and significantly improves inference speed. Experimental results on two benchmark datasets demonstrate the superiority of our approach, achieving state-of-the-art performance in both image quality and identity preservation.

[61] Next-Scale Prediction: A Self-Supervised Approach for Real-World Image Denoising

Yiwen Shan,Haiyu Zhao,Peng Hu,Xi Peng,Yuanbiao Gou

Main category: cs.CV

TL;DR: 本文提出了一种新的自监督图像去噪范式Next-Scale Prediction (NSP),通过构建跨尺度训练对,解耦噪声去相关与细节保留,显著提升了真实场景下的去噪性能,并可自然支持去噪图像的超分辨率。

Details Motivation: 现有盲点网络方法在去相关噪声和保留高频细节之间存在矛盾,难以兼顾结构化噪声的去除与图像细节的完整性。 Method: 提出Next-Scale Prediction (NSP),利用低分辨率、完全去相关的子图像作为输入,训练盲点网络预测保留精细细节的高分辨率图像,构建跨尺度训练对。 Result: NSP在多个真实世界去噪基准上达到最先进性能,有效缓解了噪声去相关与细节保留之间的长期冲突,并能自然实现无需重新训练的去噪图像超分辨率。 Conclusion: NSP为自监督图像去噪提供了一个有效的新框架,在保持细节的同时实现了更强的噪声去除能力,具有良好的应用潜力。 Abstract: Self-supervised real-world image denoising remains a fundamental challenge, arising from the antagonistic trade-off between decorrelating spatially structured noise and preserving high-frequency details. Existing blind-spot network (BSN) methods rely on pixel-shuffle downsampling (PD) to decorrelate noise, but aggressive downsampling fragments fine structures, while milder downsampling fails to remove correlated noise. To address this, we introduce Next-Scale Prediction (NSP), a novel self-supervised paradigm that decouples noise decorrelation from detail preservation. NSP constructs cross-scale training pairs, where BSN takes low-resolution, fully decorrelated sub-images as input to predict high-resolution targets that retain fine details. As a by-product, NSP naturally supports super-resolution of noisy images without retraining or modification. Extensive experiments demonstrate that NSP achieves state-of-the-art self-supervised denoising performance on real-world benchmarks, significantly alleviating the long-standing conflict between noise decorrelation and detail preservation.

[62] A Large-Depth-Range Layer-Based Hologram Dataset for Machine Learning-Based 3D Computer-Generated Holography

Jaehong Lee,You Chan No,YoungWoo Kim,Duksu Kim

Main category: cs.CV

TL;DR: 本文提出了一个名为KOREATECH-CGH的公开大规模全息图数据集,包含6000对RGB-D图像和复数全息图,并引入振幅投影后处理技术以提升大深度范围下的重建质量,验证了其在机器学习全息生成与超分辨率任务中的有效性。

Details Motivation: 由于高质量、大规模全息图数据集的缺乏限制了基于机器学习的计算机生成全息术(ML-CGH)的发展,因此需要构建一个公开、多样化且高分辨率的数据集来推动该领域研究。 Method: 构建了一个包含6000对RGB-D图像与对应复数全息图的数据集,分辨率从256*256到2048*2048,深度范围达到角谱法理论极限;提出振幅投影方法,在各深度层替换全息波场的振幅分量但保留相位,以提升重建质量;并在全息生成和超分辨率任务中使用先进ML模型进行实验验证。 Result: 所提振幅投影方法在大深度范围内实现了27.01 dB PSNR和0.87 SSIM,优于现有掩膜方法2.03 dB和0.04 SSIM;基于KOREATECH-CGH训练的模型表现出良好性能,证明该数据集适用于下一代ML-CGH系统的开发与评估。 Conclusion: KOREATECH-CGH是一个高质量、多尺度的公开全息数据集,结合振幅投影技术可有效提升重建精度,为未来基于机器学习的全息显示研究提供了重要资源和支持。 Abstract: Machine learning-based computer-generated holography (ML-CGH) has advanced rapidly in recent years, yet progress is constrained by the limited availability of high-quality, large-scale hologram datasets. To address this, we present KOREATECH-CGH, a publicly available dataset comprising 6,000 pairs of RGB-D images and complex holograms across resolutions ranging from 256*256 to 2048*2048, with depth ranges extending to the theoretical limits of the angular spectrum method for wide 3D scene coverage. To improve hologram quality at large depth ranges, we introduce amplitude projection, a post-processing technique that replaces amplitude components of hologram wavefields at each depth layer while preserving phase. This approach enhances reconstruction fidelity, achieving 27.01 dB PSNR and 0.87 SSIM, surpassing a recent optimized silhouette-masking layer-based method by 2.03 dB and 0.04 SSIM, respectively. We further validate the utility of KOREATECH-CGH through experiments on hologram generation and super-resolution using state-of-the-art ML models, confirming its applicability for training and evaluating next-generation ML-CGH systems.

[63] Matrix Completion Via Reweighted Logarithmic Norm Minimization

Zhijie Wang,Liangtian He,Qinghua Zhang,Jifei Miao,Liang-Jian Deng,Jun Liu

Main category: cs.CV

TL;DR: 提出一种新的加权对数范数作为低秩矩阵补全中更有效的非凸替代方法,通过ADMM优化,并在图像修复任务中取得了优于现有方法的性能。

Details Motivation: 核范数作为秩函数的凸松弛会导致奇异值过度收缩,从而产生次优解,因此需要更精确的非凸替代方法。 Method: 提出一种新的加权对数范数作为秩的非凸代理,并采用交替方向乘子法(ADMM)来高效求解优化问题。 Result: 在图像修复实验中,该方法在视觉质量和定量指标上均优于现有的低秩矩阵补全方法。 Conclusion: 所提出的加权对数范数能更准确地逼近低秩结构,显著提升矩阵补全效果。 Abstract: Low-rank matrix completion (LRMC) has demonstrated remarkable success in a wide range of applications. To address the NP-hard nature of the rank minimization problem, the nuclear norm is commonly used as a convex and computationally tractable surrogate for the rank function. However, this approach often yields suboptimal solutions due to the excessive shrinkage of singular values. In this letter, we propose a novel reweighted logarithmic norm as a more effective nonconvex surrogate, which provides a closer approximation than many existing alternatives. We efficiently solve the resulting optimization problem by employing the alternating direction method of multipliers (ADMM). Experimental results on image inpainting demonstrate that the proposed method achieves superior performance compared to state-of-the-art LRMC approaches, both in terms of visual quality and quantitative metrics.

[64] Optical Flow-Guided 6DoF Object Pose Tracking with an Event Camera

Zibin Liu,Banglei Guan,Yang Shang,Shunkun Liang,Zhenbao Yu,Qifeng Yu

Main category: cs.CV

TL;DR: 提出一种基于事件相机的光流引导6DoF物体位姿跟踪方法,通过2D-3D混合特征提取和光流关联优化位姿,在准确性和鲁棒性上优于现有方法。

Details Motivation: 传统相机在运动模糊、噪声、遮挡和光照变化下难以实现稳定位姿跟踪,事件相机虽有潜力但需有效算法支持。 Method: 采用2D-3D混合特征提取策略检测事件流中的角点和边缘,通过最大化时空窗口内事件相关概率搜索角点光流,并以光流引导建立角点与边缘的关联,最后通过最小化角点到边缘距离迭代优化6DoF位姿。 Result: 在模拟和真实事件数据上验证了方法的有效性,精度和鲁棒性均优于现有的基于事件的方法。 Conclusion: 所提方法能有效利用事件相机优势,实现高精度、高鲁棒性的连续6DoF物体位姿跟踪。 Abstract: Object pose tracking is one of the pivotal technologies in multimedia, attracting ever-growing attention in recent years. Existing methods employing traditional cameras encounter numerous challenges such as motion blur, sensor noise, partial occlusion, and changing lighting conditions. The emerging bio-inspired sensors, particularly event cameras, possess advantages such as high dynamic range and low latency, which hold the potential to address the aforementioned challenges. In this work, we present an optical flow-guided 6DoF object pose tracking method with an event camera. A 2D-3D hybrid feature extraction strategy is firstly utilized to detect corners and edges from events and object models, which characterizes object motion precisely. Then, we search for the optical flow of corners by maximizing the event-associated probability within a spatio-temporal window, and establish the correlation between corners and edges guided by optical flow. Furthermore, by minimizing the distances between corners and edges, the 6DoF object pose is iteratively optimized to achieve continuous pose tracking. Experimental results of both simulated and real events demonstrate that our methods outperform event-based state-of-the-art methods in terms of both accuracy and robustness.

[65] DexAvatar: 3D Sign Language Reconstruction with Hand and Body Pose Priors

Kaustubh Kundu,Hrishav Bakul Barua,Lucy Robertson-Bell,Zhixi Cai,Kalin Stefanov

Main category: cs.CV

TL;DR: 本文提出了一种名为DexAvatar的新框架,用于从单目手语视频中重建生物力学上精确的精细手部动作和身体运动,通过引入学习到的3D手部和身体先验,在现有数据集上相比最先进方法提升了35.11%的姿态估计性能。

Details Motivation: 现有的手语数据集大多基于视频,缺乏准确的3D姿态信息,且当前最先进的3D姿态估计算法在处理手语视频时易受自遮挡、噪声和运动模糊影响,导致重建质量差。因此需要一个能从真实场景视频中恢复高质量3D手语姿态的框架。 Method: 提出DexAvatar框架,利用学习到的3D手部和身体先验知识,从野外单目手语视频中重建精细的手部关节和身体动作,结合生物力学约束与数据驱动先验提升重建精度。 Result: 在SGNify动作捕捉数据集上验证了方法的有效性,相比现有最先进方法在身体和手部姿态估计上提升了35.11%,显著提高了重建质量。 Conclusion: DexAvatar能够有效克服现有手语视频中3D姿态重建的挑战,提供更准确、细致的动作捕捉结果,推动了手语生成与理解领域的发展。 Abstract: The trend in sign language generation is centered around data-driven generative methods that require vast amounts of precise 2D and 3D human pose data to achieve an acceptable generation quality. However, currently, most sign language datasets are video-based and limited to automatically reconstructed 2D human poses (i.e., keypoints) and lack accurate 3D information. Furthermore, existing state-of-the-art for automatic 3D human pose estimation from sign language videos is prone to self-occlusion, noise, and motion blur effects, resulting in poor reconstruction quality. In response to this, we introduce DexAvatar, a novel framework to reconstruct bio-mechanically accurate fine-grained hand articulations and body movements from in-the-wild monocular sign language videos, guided by learned 3D hand and body priors. DexAvatar achieves strong performance in the SGNify motion capture dataset, the only benchmark available for this task, reaching an improvement of 35.11% in the estimation of body and hand poses compared to the state-of-the-art. The official website of this work is: https://github.com/kaustesseract/DexAvatar.

[66] Beyond Pixel Simulation: Pathology Image Generation via Diagnostic Semantic Tokens and Prototype Control

Minghao Han,YiChen Liu,Yizhou Liu,Zizhi Chen,Jingqun Tang,Xuecheng Wu,Dingkang Yang,Lihua Zhang

Main category: cs.CV

TL;DR: UniPath是一个语义驱动的病理图像生成框架,通过多流控制实现细粒度、可控的生成,解决了数据稀缺、语义控制不足和术语异质性问题,在性能上达到SOTA。

Details Motivation: 现有的病理图像生成模型缺乏精细的语义控制,依赖非语义线索,且受限于高质量图文数据的稀缺和术语表达的多样性,阻碍了发展。 Method: 提出UniPath框架,采用多流控制:原始文本流、高层语义流(利用冻结的病理MLLM提取诊断语义标记并扩展提示)和原型流(通过原型库实现形态学层面的控制);构建265万图文对的大规模数据集及6.8万精细标注子集。 Result: 在评估中取得SOTA结果,Patho-FID为80.9(比第二名好51%),细粒度语义控制达到真实图像98.7%的效果。 Conclusion: UniPath实现了基于成熟诊断理解的可控图像生成,显著提升生成质量与语义一致性,推动理解与生成的协同发展,资源将全部开源。 Abstract: In computational pathology, understanding and generation have evolved along disparate paths: advanced understanding models already exhibit diagnostic-level competence, whereas generative models largely simulate pixels. Progress remains hindered by three coupled factors: the scarcity of large, high-quality image-text corpora; the lack of precise, fine-grained semantic control, which forces reliance on non-semantic cues; and terminological heterogeneity, where diverse phrasings for the same diagnostic concept impede reliable text conditioning. We introduce UniPath, a semantics-driven pathology image generation framework that leverages mature diagnostic understanding to enable controllable generation. UniPath implements Multi-Stream Control: a Raw-Text stream; a High-Level Semantics stream that uses learnable queries to a frozen pathology MLLM to distill paraphrase-robust Diagnostic Semantic Tokens and to expand prompts into diagnosis-aware attribute bundles; and a Prototype stream that affords component-level morphological control via a prototype bank. On the data front, we curate a 2.65M image-text corpus and a finely annotated, high-quality 68K subset to alleviate data scarcity. For a comprehensive assessment, we establish a four-tier evaluation hierarchy tailored to pathology. Extensive experiments demonstrate UniPath's SOTA performance, including a Patho-FID of 80.9 (51% better than the second-best) and fine-grained semantic control achieving 98.7% of the real-image. The meticulously curated datasets, complete source code, and pre-trained model weights developed in this study will be made openly accessible to the public.

[67] Multimodal Skeleton-Based Action Representation Learning via Decomposition and Composition

Hongsong Wang,Heng Fei,Bingxuan Dai,Jie Gui

Main category: cs.CV

TL;DR: 提出了一种名为“分解与组合”的自监督多模态骨架动作表示学习框架,有效平衡了计算成本与模型性能。

Details Motivation: 现有方法在多模态人类动作理解中难以兼顾效率与效果,晚融合计算开销大,早融合性能不足。 Method: 采用分解策略将融合的多模态特征分解为单模态特征并与真实单模态特征对齐;通过组合策略整合多个单模态特征,作为自监督信号增强多模态表示学习。 Result: 在NTU RGB+D 60、NTU RGB+D 120和PKU-MMD II数据集上实验表明,该方法在计算成本和性能之间取得了良好平衡。 Conclusion: 所提框架能有效利用多模态互补性,同时保持高效性,为多模态动作识别提供了新的解决方案。 Abstract: Multimodal human action understanding is a significant problem in computer vision, with the central challenge being the effective utilization of the complementarity among diverse modalities while maintaining model efficiency. However, most existing methods rely on simple late fusion to enhance performance, which results in substantial computational overhead. Although early fusion with a shared backbone for all modalities is efficient, it struggles to achieve excellent performance. To address the dilemma of balancing efficiency and effectiveness, we introduce a self-supervised multimodal skeleton-based action representation learning framework, named Decomposition and Composition. The Decomposition strategy meticulously decomposes the fused multimodal features into distinct unimodal features, subsequently aligning them with their respective ground truth unimodal counterparts. On the other hand, the Composition strategy integrates multiple unimodal features, leveraging them as self-supervised guidance to enhance the learning of multimodal representations. Extensive experiments on the NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD II datasets demonstrate that the proposed method strikes an excellent balance between computational cost and model performance.

[68] UniPR-3D: Towards Universal Visual Place Recognition with Visual Geometry Grounded Transformer

Tianchen Deng,Xun Chen,Ziming Li,Hongming Shen,Danwei Wang,Javier Civera,Hesheng Wang

Main category: cs.CV

TL;DR: 本文提出了UniPR-3D,首个有效融合多视角信息的视觉地点识别(VPR)架构,基于VGGT骨干网络,结合2D和3D特征聚合模块,显著提升了跨环境泛化能力,并在单帧与多帧设置下均达到新SOTA。

Details Motivation: 传统VPR多基于单图像检索,多视角虽有优势但研究不足,现有方法在不同环境中泛化能力有限,因此需要一种能有效整合多视角信息且具备良好泛化的VPR框架。 Method: 提出UniPR-3D,采用VGGT骨干网络编码多视角3D表示,设计专门的2D与3D特征聚合模块,联合利用3D tokens与中间2D tokens,并引入单帧与多帧聚合机制及可变长度序列检索策略以增强泛化性。 Result: 实验表明UniPR-3D在多个基准上优于现有的单视图和多视图基线方法,验证了基于几何的tokens在VPR中的有效性,并实现了新的性能记录。 Conclusion: UniPR-3D是首个成功整合多视角3D信息用于VPR的架构,通过融合几何感知的3D特征与纹理丰富的2D特征,显著提升了识别性能与跨环境泛化能力,为未来多视角VPR研究提供了新方向。 Abstract: Visual Place Recognition (VPR) has been traditionally formulated as a single-image retrieval task. Using multiple views offers clear advantages, yet this setting remains relatively underexplored and existing methods often struggle to generalize across diverse environments. In this work we introduce UniPR-3D, the first VPR architecture that effectively integrates information from multiple views. UniPR-3D builds on a VGGT backbone capable of encoding multi-view 3D representations, which we adapt by designing feature aggregators and fine-tune for the place recognition task. To construct our descriptor, we jointly leverage the 3D tokens and intermediate 2D tokens produced by VGGT. Based on their distinct characteristics, we design dedicated aggregation modules for 2D and 3D features, allowing our descriptor to capture fine-grained texture cues while also reasoning across viewpoints. To further enhance generalization, we incorporate both single- and multi-frame aggregation schemes, along with a variable-length sequence retrieval strategy. Our experiments show that UniPR-3D sets a new state of the art, outperforming both single- and multi-view baselines and highlighting the effectiveness of geometry-grounded tokens for VPR. Our code and models will be made publicly available on Github https://github.com/dtc111111/UniPR-3D.

[69] Hierarchical Modeling Approach to Fast and Accurate Table Recognition

Takaya Kawakatsu

Main category: cs.CV

TL;DR: 本文提出了一种利用非因果注意力和并行推理算法的多任务模型,用于更高效地识别表格结构和内容,在公开数据集上表现出优越性能。

Details Motivation: 现有表格识别模型虽效果良好,但推理时间长且有效性未充分解释。 Method: 提出一种新的多任务模型,采用非因果注意力捕捉整体表格结构,并设计并行推理算法加速单元格内容识别。 Result: 在两个大型公开数据集上,新模型在视觉和统计指标上均展现出优越性。 Conclusion: 所提方法在保持高精度的同时显著提升推理效率,为文档中表格识别提供了更优解决方案。 Abstract: The extraction and use of diverse knowledge from numerous documents is a pressing challenge in intelligent information retrieval. Documents contain elements that require different recognition methods. Table recognition typically consists of three subtasks, namely table structure, cell position and cell content recognition. Recent models have achieved excellent recognition with a combination of multi-task learning, local attention, and mutual learning. However, their effectiveness has not been fully explained, and they require a long period of time for inference. This paper presents a novel multi-task model that utilizes non-causal attention to capture the entire table structure, and a parallel inference algorithm for faster cell content inference. The superiority is demonstrated both visually and statistically on two large public datasets.

[70] T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation

Zhe Cao,Tao Wang,Jiaming Wang,Yanghai Wang,Yuanxing Zhang,Jialu Chen,Miao Deng,Jiahao Wang,Yubin Guo,Chenxi Liao,Yize Zhang,Zhaoxiang Zhang,Jiaheng Liu

Main category: cs.CV

TL;DR: 本文提出了T2AV-Compass,一个用于全面评估文本到音频视频(T2AV)生成系统的统一基准,包含500个复杂多样的提示和双层评估框架,结合客观指标与基于大语言模型的主观评判,揭示现有模型在跨模态一致性、音视频真实感和指令遵循方面仍有显著不足。

Details Motivation: 现有的T2AV生成系统评估方法碎片化,依赖单模态或狭窄的评测标准,无法有效衡量跨模态对齐、指令遵循和复杂语义下的感知真实感,因此需要一个更全面、更具挑战性的统一评测基准。 Method: 提出T2AV-Compass,通过基于分类体系的流程构建500个语义丰富且物理合理的复杂提示;设计双层评估框架,结合客观信号级指标(视频质量、音频质量、跨模态对齐)与基于MLLM-as-a-Judge的主观评估(指令遵循与真实感)。 Result: 对11个代表性T2AV系统的广泛评估显示,即使最强模型在音频真实感、细粒度同步和指令遵循等方面仍远逊于人类水平,暴露出当前方法的重大缺陷。 Conclusion: T2AV-Compass是一个具有挑战性和诊断价值的测试平台,能够有效揭示现有T2AV系统的局限性,为未来研究提供明确改进方向。 Abstract: Text-to-Audio-Video (T2AV) generation aims to synthesize temporally coherent video and semantically synchronized audio from natural language, yet its evaluation remains fragmented, often relying on unimodal metrics or narrowly scoped benchmarks that fail to capture cross-modal alignment, instruction following, and perceptual realism under complex prompts. To address this limitation, we present T2AV-Compass, a unified benchmark for comprehensive evaluation of T2AV systems, consisting of 500 diverse and complex prompts constructed via a taxonomy-driven pipeline to ensure semantic richness and physical plausibility. Besides, T2AV-Compass introduces a dual-level evaluation framework that integrates objective signal-level metrics for video quality, audio quality, and cross-modal alignment with a subjective MLLM-as-a-Judge protocol for instruction following and realism assessment. Extensive evaluation of 11 representative T2AVsystems reveals that even the strongest models fall substantially short of human-level realism and cross-modal consistency, with persistent failures in audio realism, fine-grained synchronization, instruction following, etc. These results indicate significant improvement room for future models and highlight the value of T2AV-Compass as a challenging and diagnostic testbed for advancing text-to-audio-video generation.

[71] UniRec-0.1B: Unified Text and Formula Recognition with 0.1B Parameters

Yongkun Du,Zhineng Chen,Yazhen Xie,Weikang Baiand Hao Feng,Wei Shi,Yuchen Su,Can Huang,Yu-Gang Jiang

Main category: cs.CV

TL;DR: 本文提出了一种仅含0.1B参数的轻量级统一识别模型UniRec-0.1B,能够高效准确地识别文本和公式,并在多层级文档结构上表现出色。

Details Motivation: 现有的视觉-语言模型虽然能统一识别文本和公式,但模型庞大、计算成本高,限制了其广泛应用。因此需要一个更轻量、高效的统一识别模型。 Method: 构建了一个包含4000万样本的大规模数据集UniRec40M;引入分层监督训练以应对层次间的结构变化;设计语义解耦的tokenizer来分离文本与公式表示。 Result: 在多个中英文跨领域基准上,UniRec-0.1B优于通用VLM和领先的文档解析专家模型,同时实现2-9倍的速度提升。 Conclusion: UniRec-0.1B在保持极小参数量的同时实现了高效、准确的文本与公式统一识别,具备良好的实用性和推广潜力。 Abstract: Text and formulas constitute the core informational components of many documents. Accurately and efficiently recognizing both is crucial for developing robust and generalizable document parsing systems. Recently, vision-language models (VLMs) have achieved impressive unified recognition of text and formulas. However, they are large-sized and computationally demanding, restricting their usage in many applications. In this paper, we propose UniRec-0.1B, a unified recognition model with only 0.1B parameters. It is capable of performing text and formula recognition at multiple levels, including characters, words, lines, paragraphs, and documents. To implement this task, we first establish UniRec40M, a large-scale dataset comprises 40 million text, formula and their mix samples, enabling the training of a powerful yet lightweight model. Secondly, we identify two challenges when building such a lightweight but unified expert model. They are: structural variability across hierarchies and semantic entanglement between textual and formulaic content. To tackle these, we introduce a hierarchical supervision training that explicitly guides structural comprehension, and a semantic-decoupled tokenizer that separates text and formula representations. Finally, we develop a comprehensive evaluation benchmark covering Chinese and English documents from multiple domains and with multiple levels. Experimental results on this and public benchmarks demonstrate that UniRec-0.1B outperforms both general-purpose VLMs and leading document parsing expert models, while achieving a 2-9$\times$ speedup, validating its effectiveness and efficiency. Codebase and Dataset: https://github.com/Topdu/OpenOCR.

[72] FreeInpaint: Tuning-free Prompt Alignment and Visual Rationality Enhancement in Image Inpainting

Chao Gong,Dong Li,Yingwei Pan,Jingjing Chen,Ting Yao,Tao Mei

Main category: cs.CV

TL;DR: 本文提出了一种无需微调的即插即用图像修复方法FreeInpaint,通过在推理过程中直接优化扩散模型的潜在变量来提升生成图像与文本提示的一致性及视觉合理性。

Details Motivation: 现有基于预训练文生图扩散模型的图像修复方法难以同时保证文本对齐和视觉合理性,因此需要一种更灵活且高效的方法来提升修复质量。 Method: 提出FreeInpaint,采用先验引导的噪声优化方法,在推理时动态优化初始噪声,并设计针对修复任务的复合引导目标,通过每步优化中间潜在变量来增强提示对齐和视觉合理性。 Result: 在多种图像修复扩散模型和评估指标下的大量实验表明,FreeInpaint在提升文本对齐和视觉合理性方面具有显著效果且具备良好鲁棒性。 Conclusion: FreeInpaint是一种有效的无需微调的图像修复框架,能够在不训练的情况下提升扩散模型在文本引导图像修复中的表现。 Abstract: Text-guided image inpainting endeavors to generate new content within specified regions of images using textual prompts from users. The primary challenge is to accurately align the inpainted areas with the user-provided prompts while maintaining a high degree of visual fidelity. While existing inpainting methods have produced visually convincing results by leveraging the pre-trained text-to-image diffusion models, they still struggle to uphold both prompt alignment and visual rationality simultaneously. In this work, we introduce FreeInpaint, a plug-and-play tuning-free approach that directly optimizes the diffusion latents on the fly during inference to improve the faithfulness of the generated images. Technically, we introduce a prior-guided noise optimization method that steers model attention towards valid inpainting regions by optimizing the initial noise. Furthermore, we meticulously design a composite guidance objective tailored specifically for the inpainting task. This objective efficiently directs the denoising process, enhancing prompt alignment and visual rationality by optimizing intermediate latents at each step. Through extensive experiments involving various inpainting diffusion models and evaluation metrics, we demonstrate the effectiveness and robustness of our proposed FreeInpaint.

[73] MarineEval: Assessing the Marine Intelligence of Vision-Language Models

YuK-Kwan Wong,Tuan-An To,Jipeng Zhang,Ziqiang Zheng,Sai-Kit Yeung

Main category: cs.CV

TL;DR: 本文提出了首个大规模海洋领域的视觉语言模型(VLM)数据集和基准MarineEval,包含2000个基于图像的问答对,用于评估现有VLM在需要专业知识的海洋问题上的表现。实验结果表明,当前VLM在回答领域特定问题上效果不佳,仍有很大改进空间。

Details Motivation: 探讨现有的视觉语言模型(VLM)是否能作为领域专家,准确回答需要深厚专业知识和应对特殊领域挑战的海洋问题。 Method: 构建了一个名为MarineEval的大规模海洋VLM数据集和基准,包括2000个基于图像的问答对,并确保数据的多样性和覆盖性:涵盖7个任务维度和20个能力维度。数据构建过程中融入了领域需求,并由相应的海洋领域专家进行验证。在此基础上全面评测了17种现有的VLM模型。 Result: 实验结果显示,现有的VLM模型无法有效回答领域特定的问题,在解决海洋研究问题方面仍存在显著局限性,性能提升空间巨大。 Conclusion: 现有的视觉语言模型在处理需要专业领域知识的海洋问题时表现不足,MarineEval基准的提出及其观察结果有望推动未来相关研究的发展。 Abstract: We have witnessed promising progress led by large language models (LLMs) and further vision language models (VLMs) in handling various queries as a general-purpose assistant. VLMs, as a bridge to connect the visual world and language corpus, receive both visual content and various text-only user instructions to generate corresponding responses. Though great success has been achieved by VLMs in various fields, in this work, we ask whether the existing VLMs can act as domain experts, accurately answering marine questions, which require significant domain expertise and address special domain challenges/requirements. To comprehensively evaluate the effectiveness and explore the boundary of existing VLMs, we construct the first large-scale marine VLM dataset and benchmark called MarineEval, with 2,000 image-based question-answering pairs. During our dataset construction, we ensure the diversity and coverage of the constructed data: 7 task dimensions and 20 capacity dimensions. The domain requirements are specially integrated into the data construction and further verified by the corresponding marine domain experts. We comprehensively benchmark 17 existing VLMs on our MarineEval and also investigate the limitations of existing models in answering marine research questions. The experimental results reveal that existing VLMs cannot effectively answer the domain-specific questions, and there is still a large room for further performance improvements. We hope our new benchmark and observations will facilitate future research. Project Page: http://marineeval.hkustvgd.com/

[74] TGC-Net: A Structure-Aware and Semantically-Aligned Framework for Text-Guided Medical Image Segmentation

Gaoren Lin,Huangxuan Zhao,Yuan Xiong,Lefei Zhang,Bo Du,Wentao Zhu

Main category: cs.CV

TL;DR: 提出TGC-Net,一种基于CLIP的参数高效医学图像分割框架,通过结构-语义协同编码、医学知识增强文本编码和跨模态校准模块,在胸片和胸部CT上实现最先进性能。

Details Motivation: 现有方法依赖未对齐的图文编码器,导致多模态融合复杂;直接应用CLIP到医学图像存在解剖结构保留不足、临床描述建模弱和领域语义错位问题。 Method: 提出TGC-Net,包含:1)语义-结构协同编码器(SSE),在ViT基础上引入CNN分支进行多尺度结构细化;2)领域增强文本编码器(DATE),注入大模型生成的医学知识;3)视觉-语言校准模块(VLCM),在统一特征空间优化跨模态对齐。 Result: 在五个胸部X光和胸部CT数据集上实验表明,TGC-Net以更少的可训练参数达到最先进的性能,尤其在具有挑战性的基准上表现出显著的Dice提升。 Conclusion: TGC-Net通过任务特定的高效微调策略,有效解决了CLIP在医学图像分割中的局限性,实现了高效且准确的文本引导分割。 Abstract: Text-guided medical segmentation enhances segmentation accuracy by utilizing clinical reports as auxiliary information. However, existing methods typically rely on unaligned image and text encoders, which necessitate complex interaction modules for multimodal fusion. While CLIP provides a pre-aligned multimodal feature space, its direct application to medical imaging is limited by three main issues: insufficient preservation of fine-grained anatomical structures, inadequate modeling of complex clinical descriptions, and domain-specific semantic misalignment. To tackle these challenges, we propose TGC-Net, a CLIP-based framework focusing on parameter-efficient, task-specific adaptations. Specifically, it incorporates a Semantic-Structural Synergy Encoder (SSE) that augments CLIP's ViT with a CNN branch for multi-scale structural refinement, a Domain-Augmented Text Encoder (DATE) that injects large-language-model-derived medical knowledge, and a Vision-Language Calibration Module (VLCM) that refines cross-modal correspondence in a unified feature space. Experiments on five datasets across chest X-ray and thoracic CT modalities demonstrate that TGC-Net achieves state-of-the-art performance with substantially fewer trainable parameters, including notable Dice gains on challenging benchmarks.

[75] ORCA: Object Recognition and Comprehension for Archiving Marine Species

Yuk-Kwan Wong,Haixin Liang,Zeyu Ma,Yiwei Chen,Ziqiang Zheng,Rinaldi Gotama,Pascal Sebastian,Lauren D. Sparks,Sai-Kit Yeung

Main category: cs.CV

TL;DR: ORCA是一个面向海洋研究的多模态基准,包含14,647张图像、478个物种、42,217个边界框标注和22,321条专家验证的实例描述,支持目标检测、实例描述和视觉定位任务,旨在推动海洋视觉理解的研究。

Details Motivation: 当前海洋视觉理解受限于训练数据不足以及缺乏将海洋领域挑战与计算机视觉任务系统结合的任务定义,限制了模型的有效应用。 Method: 提出ORCA多模态基准数据集,包含大量带细粒度视觉和文本标注的海洋图像,并评估18种最先进模型在闭集/开词汇目标检测、实例描述和视觉定位三个任务上的表现。 Result: 实验揭示了物种多样性、形态重叠和专业领域需求等关键挑战,现有模型在海洋理解任务中仍面临较大困难。 Conclusion: ORCA为海洋视觉理解提供了全面的基准,有助于推动该领域的数据驱动和方法学发展。 Abstract: Marine visual understanding is essential for monitoring and protecting marine ecosystems, enabling automatic and scalable biological surveys. However, progress is hindered by limited training data and the lack of a systematic task formulation that aligns domain-specific marine challenges with well-defined computer vision tasks, thereby limiting effective model application. To address this gap, we present ORCA, a multi-modal benchmark for marine research comprising 14,647 images from 478 species, with 42,217 bounding box annotations and 22,321 expert-verified instance captions. The dataset provides fine-grained visual and textual annotations that capture morphology-oriented attributes across diverse marine species. To catalyze methodological advances, we evaluate 18 state-of-the-art models on three tasks: object detection (closed-set and open-vocabulary), instance captioning, and visual grounding. Results highlight key challenges, including species diversity, morphological overlap, and specialized domain demands, underscoring the difficulty of marine understanding. ORCA thus establishes a comprehensive benchmark to advance research in marine domain. Project Page: http://orca.hkustvgd.com/.

[76] A Turn Toward Better Alignment: Few-Shot Generative Adaptation with Equivariant Feature Rotation

Chenghao Xu,Qi Liu,Jiexi Yan,Muli Yang,Cheng Deng

Main category: cs.CV

TL;DR: 提出了一种名为Equivariant Feature Rotation (EFR)的新方法,用于解决少样本图像生成中源域与目标域分布结构不一致的问题,通过在自旋的代理特征空间中进行双层对齐,显著提升了生成性能。

Details Motivation: 现有少样本图像生成方法因源域与目标域间分布差异大且目标样本稀少,难以准确对齐分布,导致生成内容失真或信息不足。 Method: 在参数化的李群中学习可适应的旋转矩阵,将源域和目标域特征映射到一个等变的代理特征空间,在该空间内进行实例级和分布级的双重对齐,以保留域内结构并实现知识迁移。 Result: 在多个常用数据集上的实验表明,该方法显著优于现有少样本图像生成方法,有效提升了生成图像的质量与多样性。 Conclusion: EFR通过构建等变的代理特征空间,解决了域间分布结构不匹配的问题,为少样本生成模型的适配提供了新思路。 Abstract: Few-shot image generation aims to effectively adapt a source generative model to a target domain using very few training images. Most existing approaches introduce consistency constraints-typically through instance-level or distribution-level loss functions-to directly align the distribution patterns of source and target domains within their respective latent spaces. However, these strategies often fall short: overly strict constraints can amplify the negative effects of the domain gap, leading to distorted or uninformative content, while overly relaxed constraints may fail to leverage the source domain effectively. This limitation primarily stems from the inherent discrepancy in the underlying distribution structures of the source and target domains. The scarcity of target samples further compounds this issue by hindering accurate estimation of the target domain's distribution. To overcome these limitations, we propose Equivariant Feature Rotation (EFR), a novel adaptation strategy that aligns source and target domains at two complementary levels within a self-rotated proxy feature space. Specifically, we perform adaptive rotations within a parameterized Lie Group to transform both source and target features into an equivariant proxy space, where alignment is conducted. These learnable rotation matrices serve to bridge the domain gap by preserving intra-domain structural information without distortion, while the alignment optimization facilitates effective knowledge transfer from the source to the target domain. Comprehensive experiments on a variety of commonly used datasets demonstrate that our method significantly enhances the generative performance within the targeted domain.

[77] Towards Arbitrary Motion Completing via Hierarchical Continuous Representation

Chenghao Xu,Guangtao Lyu,Qi Liu,Jiexi Yan,Muli Yang,Cheng Deng

Main category: cs.CV

TL;DR: 本文提出了一种基于隐式神经表示(INR)的分层隐式表示框架NAME,用于实现人体运动序列的连续化建模,支持任意帧率下的插值、中间生成与外推。

Details Motivation: 由于物理运动本质上是连续的,而高帧率视频有助于提升时序连贯性,因此需要一种能够连续表示人体动作的方法以克服离散帧率的限制。 Method: 提出一种基于INR的参数化激活函数驱动的分层隐式表示框架NAME,包含多尺度时间编码机制和基于傅里叶变换的参数化激活函数,以增强模型对复杂时序模式的表达能力。 Result: 在多个基准数据集上的实验表明,该方法在运动序列的连续表示、插值、外推等方面具有优越的性能和鲁棒性。 Conclusion: 所提出的NAME框架能有效实现人体运动的连续化建模,支持任意帧率输出,并在精度和表达能力上优于现有方法。 Abstract: Physical motions are inherently continuous, and higher camera frame rates typically contribute to improved smoothness and temporal coherence. For the first time, we explore continuous representations of human motion sequences, featuring the ability to interpolate, inbetween, and even extrapolate any input motion sequences at arbitrary frame rates. To achieve this, we propose a novel parametric activation-induced hierarchical implicit representation framework, referred to as NAME, based on Implicit Neural Representations (INRs). Our method introduces a hierarchical temporal encoding mechanism that extracts features from motion sequences at multiple temporal scales, enabling effective capture of intricate temporal patterns. Additionally, we integrate a custom parametric activation function, powered by Fourier transformations, into the MLP-based decoder to enhance the expressiveness of the continuous representation. This parametric formulation significantly augments the model's ability to represent complex motion behaviors with high accuracy. Extensive evaluations across several benchmark datasets demonstrate the effectiveness and robustness of our proposed approach.

[78] UltraShape 1.0: High-Fidelity 3D Shape Generation via Scalable Geometric Refinement

Tanghui Jia,Dongyu Yan,Dehao Hao,Yang Li,Kaiyi Zhang,Xianyi He,Lanjiong Li,Jinnan Chen,Lutao Jiang,Qishen Yin,Long Quan,Ying-Cong Chen,Li Yuan

Main category: cs.CV

TL;DR: UltraShape 1.0 是一个可扩展的两阶段3D扩散框架,通过新的数据处理流程和解耦的扩散机制实现高保真3D几何生成。

Details Motivation: 现有3D生成模型在几何质量和细节还原上存在不足,且依赖高质量训练数据。本文旨在构建一个高效、可扩展的开源3D生成框架,提升生成结果的保真度与结构完整性。 Method: 采用两阶段生成流程:先生成粗略全局结构,再进行精细化几何修复。提出新的watertight数据处理方法,并在扩散过程中解耦空间定位与几何细节合成,使用基于体素的查询和RoPE编码位置信息,专注于局部细节生成。 Result: 在公开3D数据集上训练的UltraShape 1.0展现出优异的几何质量,在数据处理和生成效果方面均优于或媲美现有开源方法。 Conclusion: UltraShape 1.0验证了高质量数据处理与结构化扩散建模对3D生成的重要性,为未来研究提供了有效的技术路径和开源资源。 Abstract: In this report, we introduce UltraShape 1.0, a scalable 3D diffusion framework for high-fidelity 3D geometry generation. The proposed approach adopts a two-stage generation pipeline: a coarse global structure is first synthesized and then refined to produce detailed, high-quality geometry. To support reliable 3D generation, we develop a comprehensive data processing pipeline that includes a novel watertight processing method and high-quality data filtering. This pipeline improves the geometric quality of publicly available 3D datasets by removing low-quality samples, filling holes, and thickening thin structures, while preserving fine-grained geometric details. To enable fine-grained geometry refinement, we decouple spatial localization from geometric detail synthesis in the diffusion process. We achieve this by performing voxel-based refinement at fixed spatial locations, where voxel queries derived from coarse geometry provide explicit positional anchors encoded via RoPE, allowing the diffusion model to focus on synthesizing local geometric details within a reduced, structured solution space. Our model is trained exclusively on publicly available 3D datasets, achieving strong geometric quality despite limited training resources. Extensive evaluations demonstrate that UltraShape 1.0 performs competitively with existing open-source methods in both data processing quality and geometry generation. All code and trained models will be released to support future research.

[79] VisRes Bench: On Evaluating the Visual Reasoning Capabilities of VLMs

Brigitta Malagurski Törtei,Yasser Dahou,Ngoc Dung Huynh,Wamiq Reyaz Para,Phúc H. Lê Khac,Ankit Singh,Sofian Chaybouti,Sanath Narayan

Main category: cs.CV

TL;DR: VisRes Bench 是一个用于研究自然场景下视觉推理能力的新基准,揭示了当前视觉语言模型在感知和关系推理方面的局限性。

Details Motivation: 探讨视觉语言模型在多大程度上依赖语言先验而非真正的视觉推理,评估其在无语言监督下的视觉推理能力。 Method: 提出 VisRes Bench 基准,包含三个复杂度层级:Level 1 测试感知补全和图像匹配能力,Level 2 测试单属性的规则推理,Level 3 考察多属性组合推理;使用超过19,000张受控图像进行评估。 Result: 发现最先进的视觉语言模型在细微感知扰动下表现接近随机,显示出抽象能力有限,主要依赖模式识别而非深层视觉推理。 Conclusion: VisRes Bench 提供了一个统一框架,有助于推动多模态研究中的抽象视觉推理发展。 Abstract: Vision-Language Models (VLMs) have achieved remarkable progress across tasks such as visual question answering and image captioning. Yet, the extent to which these models perform visual reasoning as opposed to relying on linguistic priors remains unclear. To address this, we introduce VisRes Bench, a benchmark designed to study visual reasoning in naturalistic settings without contextual language supervision. Analyzing model behavior across three levels of complexity, we uncover clear limitations in perceptual and relational visual reasoning capacities. VisRes isolates distinct reasoning abilities across its levels. Level 1 probes perceptual completion and global image matching under perturbations such as blur, texture changes, occlusion, and rotation; Level 2 tests rule-based inference over a single attribute (e.g., color, count, orientation); and Level 3 targets compositional reasoning that requires integrating multiple visual attributes. Across more than 19,000 controlled task images, we find that state-of-the-art VLMs perform near random under subtle perceptual perturbations, revealing limited abstraction beyond pattern recognition. We conclude by discussing how VisRes provides a unified framework for advancing abstract visual reasoning in multimodal research.

[80] Human Motion Estimation with Everyday Wearables

Siqi Zhu,Yixuan Li,Junfu Li,Qi Wu,Zan Wang,Haozhe Ma,Wei Liang

Main category: cs.CV

TL;DR: 提出EveryWear,一种基于日常可穿戴设备(如智能手机、智能手表、耳塞和智能眼镜)的轻量级人体动作捕捉方法,无需校准即可实现全身运动估计。

Details Motivation: 现有基于穿戴设备的人体动作估计方法存在佩戴性差、硬件成本高和校准繁琐等问题,限制了其在日常生活中的应用。 Method: 采用多模态师生框架,融合来自第一视角摄像头的视觉信息和消费级设备的惯性信号,并直接在真实世界数据上训练模型,避免了仿真到现实的差距。 Result: 在自建的Ego-Elec真实世界数据集(包含56种日常活动和9小时数据)上验证了方法的有效性,性能优于基线模型。 Conclusion: EveryWear为实用化的全身动作捕捉提供了一种低成本、易部署且无需校准的解决方案,推动了在日常环境中的人机交互应用。 Abstract: While on-body device-based human motion estimation is crucial for applications such as XR interaction, existing methods often suffer from poor wearability, expensive hardware, and cumbersome calibration, which hinder their adoption in daily life. To address these challenges, we present EveryWear, a lightweight and practical human motion capture approach based entirely on everyday wearables: a smartphone, smartwatch, earbuds, and smart glasses equipped with one forward-facing and two downward-facing cameras, requiring no explicit calibration before use. We introduce Ego-Elec, a 9-hour real-world dataset covering 56 daily activities across 17 diverse indoor and outdoor environments, with ground-truth 3D annotations provided by the motion capture (MoCap), to facilitate robust research and benchmarking in this direction. Our approach employs a multimodal teacher-student framework that integrates visual cues from egocentric cameras with inertial signals from consumer devices. By training directly on real-world data rather than synthetic data, our model effectively eliminates the sim-to-real gap that constrains prior work. Experiments demonstrate that our method outperforms baseline models, validating its effectiveness for practical full-body motion estimation.

[81] Latent Implicit Visual Reasoning

Kelvin Li,Chuyi Shang,Leonid Karlinsky,Rogerio Feris,Trevor Darrell,Roei Herzig

Main category: cs.CV

TL;DR: 提出一种任务无关的机制,使大型多模态模型能够自主发现和使用视觉推理标记,无需显式监督,在多种视觉主导任务上实现最先进的性能。

Details Motivation: 现有的多模态模型主要依赖文本进行推理,且现有方法对有用的视觉抽象形式施加了限制性先验,标注成本高,泛化能力差。 Method: 设计一种无需显式监督的机制,让模型自动学习视觉推理标记,这些标记以全局注意力方式重新编码图像,实现任务自适应的信息提取。 Result: 该方法在多个视觉主导的任务上优于直接微调,并达到最先进的结果,尤其在中间抽象难以明确指定的任务中表现突出,同时能推广到多任务指令调优。 Conclusion: 所提出的机制有效提升了LMM在视觉推理任务中的能力,摆脱了对手工标注中间步骤的依赖,具有良好的通用性和扩展性。 Abstract: While Large Multimodal Models (LMMs) have made significant progress, they remain largely text-centric, relying on language as their core reasoning modality. As a result, they are limited in their ability to handle reasoning tasks that are predominantly visual. Recent approaches have sought to address this by supervising intermediate visual steps with helper images, depth maps, or image crops. However, these strategies impose restrictive priors on what "useful" visual abstractions look like, add heavy annotation costs, and struggle to generalize across tasks. To address this critical limitation, we propose a task-agnostic mechanism that trains LMMs to discover and use visual reasoning tokens without explicit supervision. These tokens attend globally and re-encode the image in a task-adaptive way, enabling the model to extract relevant visual information without hand-crafted supervision. Our approach outperforms direct fine-tuning and achieves state-of-the-art results on a diverse range of vision-centric tasks -- including those where intermediate abstractions are hard to specify -- while also generalizing to multi-task instruction tuning.

[82] Leveraging Lightweight Entity Extraction for Scalable Event-Based Image Retrieval

Dao Sy Duy Minh,Huynh Trung Kiet,Nguyen Lam Phu Quy,Phu-Hoa Pham,Tran Chi Nguyen

Main category: cs.CV

TL;DR: 提出一种轻量级的两阶段图像检索方法,利用事件中心的实体提取来增强基于自然语言描述的图像检索效果,在OpenEvents v1上显著优于基线。

Details Motivation: 现有图像-文本检索在面对模糊或上下文相关的查询时性能受限,且缺乏高效、可扩展的解决方案。 Method: 采用两阶段检索流程:第一阶段使用BM25基于显著实体进行候选过滤,第二阶段利用BEiT-3模型进行深度多模态语义建模与重排序,并引入事件中心的实体提取以捕获时间与上下文信息。 Result: 在OpenEvents v1基准上达到0.559的平均精度均值(mAP),显著优于先前方法。 Conclusion: 结合事件引导的过滤与长文本视觉-语言建模能有效提升复杂真实场景下的图像检索准确性和效率。 Abstract: Retrieving images from natural language descriptions is a core task at the intersection of computer vision and natural language processing, with wide-ranging applications in search engines, media archiving, and digital content management. However, real-world image-text retrieval remains challenging due to vague or context-dependent queries, linguistic variability, and the need for scalable solutions. In this work, we propose a lightweight two-stage retrieval pipeline that leverages event-centric entity extraction to incorporate temporal and contextual signals from real-world captions. The first stage performs efficient candidate filtering using BM25 based on salient entities, while the second stage applies BEiT-3 models to capture deep multimodal semantics and rerank the results. Evaluated on the OpenEvents v1 benchmark, our method achieves a mean average precision of 0.559, substantially outperforming prior baselines. These results highlight the effectiveness of combining event-guided filtering with long-text vision-language modeling for accurate and efficient retrieval in complex, real-world scenarios. Our code is available at https://github.com/PhamPhuHoa-23/Event-Based-Image-Retrieval

[83] SegMo: Segment-aligned Text to 3D Human Motion Generation

Bowen Dang,Lin Wu,Xiaohang Yang,Zheng Yuan,Zhixiang Chen

Main category: cs.CV

TL;DR: 本文提出了一种名为SegMo的细粒度文本到动作生成框架,通过将文本和动作序列分解为语义段并进行段级对齐,提升了文本-动作生成的质量,并支持检索类任务。

Details Motivation: 现有方法仅在序列级别对齐文本描述与人体动作,忽略了模态内部的语义结构;而文本和动作均可自然分解为语义一致的片段,可作为更精细对齐的基本单元。 Method: 提出SegMo框架,包含三个模块:文本段提取(将复杂描述分解为有序短语)、动作段提取(分割动作为对应片段)和细粒度文本-动作对齐(通过对比学习实现段级对齐)。 Result: 在HumanML3D等数据集上超越强基线,TOP 1得分达到0.553,并支持动作定位和动作到文本检索等任务。 Conclusion: SegMo通过段级对齐实现了更精细的文本-动作对应,在生成质量和多任务应用方面均有提升。 Abstract: Generating 3D human motions from textual descriptions is an important research problem with broad applications in video games, virtual reality, and augmented reality. Recent methods align the textual description with human motion at the sequence level, neglecting the internal semantic structure of modalities. However, both motion descriptions and motion sequences can be naturally decomposed into smaller and semantically coherent segments, which can serve as atomic alignment units to achieve finer-grained correspondence. Motivated by this, we propose SegMo, a novel Segment-aligned text-conditioned human Motion generation framework to achieve fine-grained text-motion alignment. Our framework consists of three modules: (1) Text Segment Extraction, which decomposes complex textual descriptions into temporally ordered phrases, each representing a simple atomic action; (2) Motion Segment Extraction, which partitions complete motion sequences into corresponding motion segments; and (3) Fine-grained Text-Motion Alignment, which aligns text and motion segments with contrastive learning. Extensive experiments demonstrate that SegMo improves the strong baseline on two widely used datasets, achieving an improved TOP 1 score of 0.553 on the HumanML3D test set. Moreover, thanks to the learned shared embedding space for text and motion segments, SegMo can also be applied to retrieval-style tasks such as motion grounding and motion-to-text retrieval.

[84] DreaMontage: Arbitrary Frame-Guided One-Shot Video Generation

Jiawei Liu,Junqiao Li,Jiangfan Deng,Gen Li,Siyu Zhou,Zetao Fang,Shanshan Lao,Zengde Deng,Jianing Zhu,Tingting Ma,Jiayi Li,Yunqiu Wang,Qian He,Xinglong Wu

Main category: cs.CV

TL;DR: 本文提出DreaMontage框架,通过改进DiT架构、视觉表达微调和分段自回归推理,实现高质量、长时程、帧引导的无缝一镜到底视频生成。

Details Motivation: 传统一镜到底拍摄成本高且受现实条件限制,现有视频生成方法在连续性和时间一致性上表现不佳,因此需要一种能在虚拟环境中生成高质量长时程一镜到底视频的方法。 Method: 1) 在DiT架构中引入轻量级中间条件机制,结合自适应调优策略实现任意帧控制;2) 构建高质量数据集并进行视觉表达SFT训练,采用定制化DPO优化动作合理性和过渡平滑性;3) 设计分段自回归(SAR)推理策略以高效生成长序列视频。 Result: 实验表明,该方法在视觉质量、时间连贯性和计算效率方面均表现出色,能将碎片化视觉素材合成为流畅、富有表现力的一镜到底视频,显著提升生成成功率与可用性。 Conclusion: DreaMontage为实现灵活、高质量的虚拟一镜到底视频生成提供了有效解决方案,推动了长时程可控视频生成的发展。 Abstract: The "one-shot" technique represents a distinct and sophisticated aesthetic in filmmaking. However, its practical realization is often hindered by prohibitive costs and complex real-world constraints. Although emerging video generation models offer a virtual alternative, existing approaches typically rely on naive clip concatenation, which frequently fails to maintain visual smoothness and temporal coherence. In this paper, we introduce DreaMontage, a comprehensive framework designed for arbitrary frame-guided generation, capable of synthesizing seamless, expressive, and long-duration one-shot videos from diverse user-provided inputs. To achieve this, we address the challenge through three primary dimensions. (i) We integrate a lightweight intermediate-conditioning mechanism into the DiT architecture. By employing an Adaptive Tuning strategy that effectively leverages base training data, we unlock robust arbitrary-frame control capabilities. (ii) To enhance visual fidelity and cinematic expressiveness, we curate a high-quality dataset and implement a Visual Expression SFT stage. In addressing critical issues such as subject motion rationality and transition smoothness, we apply a Tailored DPO scheme, which significantly improves the success rate and usability of the generated content. (iii) To facilitate the production of extended sequences, we design a Segment-wise Auto-Regressive (SAR) inference strategy that operates in a memory-efficient manner. Extensive experiments demonstrate that our approach achieves visually striking and seamlessly coherent one-shot effects while maintaining computational efficiency, empowering users to transform fragmented visual materials into vivid, cohesive one-shot cinematic experiences.

[85] AnyAD: Unified Any-Modality Anomaly Detection in Incomplete Multi-Sequence MRI

Changwei Wu,Yifei Chen,Yuxin Du,Mingxuan Liu,Jinying Zong,Beining Wu,Jie Dong,Feiwei Qin,Yunkang Cao,Qiyuan Tian

Main category: cs.CV

TL;DR: 提出了一种统一的任意模态异常检测(Any-Modality AD)框架,可在任意MRI模态组合下实现鲁棒的异常检测与定位,无需重新训练,显著提升临床适用性。

Details Motivation: 由于标注异常病例稀缺且临床中常缺失关键成像模态,现有异常检测模型难以泛化到未见过的模态组合,限制了其在真实场景中的应用。 Method: 采用双通路DINOv2编码器结合特征分布对齐机制,对齐不完整与完整模态的特征;引入内在正常原型(INPs)提取器和INP引导解码器,仅重建正常结构并放大异常区域;通过随机模态掩蔽和间接特征补全进行训练。 Result: 在BraTS2018、MU-Glioma-Post和Pretreat-MetsToBrain-Masks数据集上,该方法在7种模态组合下均优于最先进的工业与医学异常检测基线模型。 Conclusion: 该研究为在真实世界不完整模态条件下实现可扩展的多模态医学异常检测提供了有效范式。 Abstract: Reliable anomaly detection in brain MRI remains challenging due to the scarcity of annotated abnormal cases and the frequent absence of key imaging modalities in real clinical workflows. Existing single-class or multi-class anomaly detection (AD) models typically rely on fixed modality configurations, require repetitive training, or fail to generalize to unseen modality combinations, limiting their clinical scalability. In this work, we present a unified Any-Modality AD framework that performs robust anomaly detection and localization under arbitrary MRI modality availability. The framework integrates a dual-pathway DINOv2 encoder with a feature distribution alignment mechanism that statistically aligns incomplete-modality features with full-modality representations, enabling stable inference even with severe modality dropout. To further enhance semantic consistency, we introduce an Intrinsic Normal Prototypes (INPs) extractor and an INP-guided decoder that reconstruct only normal anatomical patterns while naturally amplifying abnormal deviations. Through randomized modality masking and indirect feature completion during training, the model learns to adapt to all modality configurations without re-training. Extensive experiments on BraTS2018, MU-Glioma-Post, and Pretreat-MetsToBrain-Masks demonstrate that our approach consistently surpasses state-of-the-art industrial and medical AD baselines across 7 modality combinations, achieving superior generalization. This study establishes a scalable paradigm for multimodal medical AD under real-world, imperfect modality conditions. Our source code is available at https://github.com/wuchangw/AnyAD.

[86] ACD: Direct Conditional Control for Video Diffusion Models via Attention Supervision

Weiqi Li,Zehao Zhang,Liang Lin,Guangrun Wang

Main category: cs.CV

TL;DR: 本文提出了Attention-Conditional Diffusion (ACD),一种通过注意力监督实现视频扩散模型直接条件控制的新框架,显著提升了条件对齐能力、时间连贯性和视觉保真度。

Details Motivation: 现有无分类器引导方法在视频合成中对条件信号的控制有限,而基于分类器的引导易产生对抗性伪影,因此需要更有效的直接控制方法。 Method: 提出ACD框架,通过对外部控制信号(如稀疏3D感知物体布局)的注意力图进行监督,实现对视频扩散模型的直接条件控制;并设计专用Layout ControlNet和自动化标注流程以支持可扩展布局集成。 Result: 在多个基准视频生成数据集上的实验表明,ACD在条件对齐、时间连贯性和视觉质量方面优于现有方法。 Conclusion: ACD为条件视频合成提供了一个有效范式,通过注意力监督实现了更强的可控性和更准确的条件遵循。 Abstract: Controllability is a fundamental requirement in video synthesis, where accurate alignment with conditioning signals is essential. Existing classifier-free guidance methods typically achieve conditioning indirectly by modeling the joint distribution of data and conditions, which often results in limited controllability over the specified conditions. Classifier-based guidance enforces conditions through an external classifier, but the model may exploit this mechanism to raise the classifier score without genuinely satisfying the intended condition, resulting in adversarial artifacts and limited effective controllability. In this paper, we propose Attention-Conditional Diffusion (ACD), a novel framework for direct conditional control in video diffusion models via attention supervision. By aligning the model's attention maps with external control signals, ACD achieves better controllability. To support this, we introduce a sparse 3D-aware object layout as an efficient conditioning signal, along with a dedicated Layout ControlNet and an automated annotation pipeline for scalable layout integration. Extensive experiments on benchmark video generation datasets demonstrate that ACD delivers superior alignment with conditioning inputs while preserving temporal coherence and visual fidelity, establishing an effective paradigm for conditional video synthesis.

[87] GriDiT: Factorized Grid-Based Diffusion for Efficient Long Image Sequence Generation

Snehal Singh Tomar,Alexandros Graikos,Arjun Krishna,Dimitris Samaras,Klaus Mueller

Main category: cs.CV

TL;DR: 提出一种将图像序列生成分解为低分辨率序列生成和高分辨率帧细化的两阶段方法,基于DiT架构实现高效、高质量、长序列生成。

Details Motivation: 现有图像序列生成模型直接处理高分辨率时序张量,存在效率低、难以建模长序列、训练数据利用率低等问题。 Method: 首先在低分辨率下生成由子采样帧组成的网格图像序列,利用DiT的自注意力机制建模帧间相关性;然后对每一帧独立进行超分辨率重建以恢复细节。 Result: 在多个数据集上优于SoTA方法,生成质量更高、推理速度至少快两倍,支持任意长度序列生成,训练更高效且跨域泛化能力强。 Conclusion: 通过分解生成过程,无需修改架构即可扩展2D生成器为3D序列生成器,有效克服了当前SoTA的关键局限。 Abstract: Modern deep learning methods typically treat image sequences as large tensors of sequentially stacked frames. However, is this straightforward representation ideal given the current state-of-the-art (SoTA)? In this work, we address this question in the context of generative models and aim to devise a more effective way of modeling image sequence data. Observing the inefficiencies and bottlenecks of current SoTA image sequence generation methods, we showcase that rather than working with large tensors, we can improve the generation process by factorizing it into first generating the coarse sequence at low resolution and then refining the individual frames at high resolution. We train a generative model solely on grid images comprising subsampled frames. Yet, we learn to generate image sequences, using the strong self-attention mechanism of the Diffusion Transformer (DiT) to capture correlations between frames. In effect, our formulation extends a 2D image generator to operate as a low-resolution 3D image-sequence generator without introducing any architectural modifications. Subsequently, we super-resolve each frame individually to add the sequence-independent high-resolution details. This approach offers several advantages and can overcome key limitations of the SoTA in this domain. Compared to existing image sequence generation models, our method achieves superior synthesis quality and improved coherence across sequences. It also delivers high-fidelity generation of arbitrary-length sequences and increased efficiency in inference time and training data usage. Furthermore, our straightforward formulation enables our method to generalize effectively across diverse data domains, which typically require additional priors and supervision to model in a generative context. Our method consistently outperforms SoTA in quality and inference speed (at least twice-as-fast) across datasets.

[88] Surgical Scene Segmentation using a Spike-Driven Video Transformer with Real-Time Potential

Shihao Zou,Jingjing Li,Wei Ji,Jincai Huang,Kai Wang,Guo Dan,Weixin Si,Yi Pan

Main category: cs.CV

TL;DR: 本文提出了SpikeSurgSeg,首个基于脉冲驱动的视频Transformer框架,用于手术场景分割,具有在非GPU平台上实现实时处理的潜力。

Details Motivation: 由于现有深度学习模型计算量大、功耗高,难以在资源受限的手术环境中实时部署;同时标注手术数据稀缺,限制了高效手术智能系统的发展。 Method: 提出一种面向SNN的手术场景掩码自编码预训练策略,通过逐层tube掩码实现鲁棒的时空表征学习,并结合轻量级脉冲驱动分割头生成时间一致的预测结果。 Result: 在EndoVis18和自建SurgBleed数据集上实验表明,SpikeSurgSeg的mIoU与最先进的ANN模型相当,推理延迟至少降低8倍,相比多数基础模型加速超过20倍。 Conclusion: SpikeSurgSeg在保持高分割精度的同时显著降低延迟和功耗,展现出在时间关键型手术场景分割中的巨大应用潜力。 Abstract: Modern surgical systems increasingly rely on intelligent scene understanding to provide timely situational awareness for enhanced intra-operative safety. Within this pipeline, surgical scene segmentation plays a central role in accurately perceiving operative events. Although recent deep learning models, particularly large-scale foundation models, achieve remarkable segmentation accuracy, their substantial computational demands and power consumption hinder real-time deployment in resource-constrained surgical environments. To address this limitation, we explore the emerging SNN as a promising paradigm for highly efficient surgical intelligence. However, their performance is still constrained by the scarcity of labeled surgical data and the inherently sparse nature of surgical video representations. To this end, we propose \textit{SpikeSurgSeg}, the first spike-driven video Transformer framework tailored for surgical scene segmentation with real-time potential on non-GPU platforms. To address the limited availability of surgical annotations, we introduce a surgical-scene masked autoencoding pretraining strategy for SNNs that enables robust spatiotemporal representation learning via layer-wise tube masking. Building on this pretrained backbone, we further adopt a lightweight spike-driven segmentation head that produces temporally consistent predictions while preserving the low-latency characteristics of SNNs. Extensive experiments on EndoVis18 and our in-house SurgBleed dataset demonstrate that SpikeSurgSeg achieves mIoU comparable to SOTA ANN-based models while reducing inference latency by at least $8\times$. Notably, it delivers over $20\times$ acceleration relative to most foundation-model baselines, underscoring its potential for time-critical surgical scene segmentation.

[89] Post-Processing Mask-Based Table Segmentation for Structural Coordinate Extraction

Suren Bandara

Main category: cs.CV

TL;DR: 本文提出了一种基于多尺度信号处理的表格边缘检测新方法,通过高斯卷积和统计阈值抑制噪声并保留结构边缘,显著提升了低分辨率或噪声图像中的表格分割精度。

Details Motivation: 准确识别表格的行列边界在低分辨率或噪声图像中仍然具有挑战性,现有方法在处理不完整或退化的表格数据时适应性有限。 Method: 将行列转换建模为一维信号,采用方差递增的高斯卷积进行多尺度处理,并结合统计阈值去噪,最后将检测到的信号峰值映射回图像坐标以获得精确的边界。 Result: 在PubLayNet-1M基准上,结合TableNet与PyTesseract OCR,列边缘检测的CASA指标从67%提升至76%,且方法对分辨率变化具有鲁棒性。 Conclusion: 该方法在噪声和分辨率变化环境下表现出更强的鲁棒性和准确性,生成的结构化表格输出适用于下游分析任务。 Abstract: Structured data extraction from tables plays a crucial role in document image analysis for scanned documents and digital archives. Although many methods have been proposed to detect table structures and extract cell contents, accurately identifying table segment boundaries (rows and columns) remains challenging, particularly in low-resolution or noisy images. In many real-world scenarios, table data are incomplete or degraded, limiting the adaptability of transformer-based methods to noisy inputs. Mask-based edge detection techniques have shown greater robustness under such conditions, as their sensitivity can be adjusted through threshold tuning; however, existing approaches typically apply masks directly to images, leading to noise sensitivity, resolution loss, or high computational cost. This paper proposes a novel multi-scale signal-processing method for detecting table edges from table masks. Row and column transitions are modeled as one-dimensional signals and processed using Gaussian convolution with progressively increasing variances, followed by statistical thresholding to suppress noise while preserving stable structural edges. Detected signal peaks are mapped back to image coordinates to obtain accurate segment boundaries. Experimental results show that applying the proposed approach to column edge detection improves Cell-Aware Segmentation Accuracy (CASA) a layout-aware metric evaluating both textual correctness and correct cell placement from 67% to 76% on the PubLayNet-1M benchmark when using TableNet with PyTesseract OCR. The method is robust to resolution variations through zero-padding and scaling strategies and produces optimized structured tabular outputs suitable for downstream analysis.

[90] AndroidLens: Long-latency Evaluation with Nested Sub-targets for Android GUI Agents

Yue Cao,Yingyao Wang,Pi Bu,Jingxuan Xing,Wei Jiang,Zekun Zhu,Junpeng Ma,Sashuai Zhou,Tong Lu,Jun Song,Yu Cheng,Yuning Jiang,Bo Zheng

Main category: cs.CV

TL;DR: AndroidLens是一个新的移动GUI代理评估框架,包含571个长延迟任务,涵盖真实场景中的复杂任务,引入静态与动态评估机制,揭示了现有模型在环境异常、自适应探索和长期记忆方面的挑战。

Details Motivation: 现有的移动GUI代理评估基准局限于少量应用、简单任务和粗粒度指标,无法反映真实世界中的复杂性和多样性,因此需要一个更全面、更具挑战性的评估框架。 Method: 提出AndroidLens框架,包含571个中英文双语任务,覆盖38个领域,平均每个任务超过26步;采用基于里程碑的动态评估方法,使用平均任务进度(ATP)进行细粒度测量,并保留真实环境中的异常和多条可行路径以减少偏差。 Result: 实验结果显示,即使是最优模型也仅有12.7%的任务成功率和50.47%的ATP,表明当前技术在完成复杂移动任务方面仍存在显著不足。 Conclusion: AndroidLens为移动GUI代理提供了更贴近真实世界的评估标准,揭示了现有模型在处理长周期、多目标和复杂约束任务中的关键挑战,推动未来研究关注环境鲁棒性、探索策略和长期记忆能力。 Abstract: Graphical user interface (GUI) agents can substantially improve productivity by automating frequently executed long-latency tasks on mobile devices. However, existing evaluation benchmarks are still constrained to limited applications, simple tasks, and coarse-grained metrics. To address this, we introduce AndroidLens, a challenging evaluation framework for mobile GUI agents, comprising 571 long-latency tasks in both Chinese and English environments, each requiring an average of more than 26 steps to complete. The framework features: (1) tasks derived from real-world user scenarios across 38 domains, covering complex types such as multi-constraint, multi-goal, and domain-specific tasks; (2) static evaluation that preserves real-world anomalies and allows multiple valid paths to reduce bias; and (3) dynamic evaluation that employs a milestone-based scheme for fine-grained progress measurement via Average Task Progress (ATP). Our evaluation indicates that even the best models reach only a 12.7% task success rate and 50.47% ATP. We also underscore key challenges in real-world environments, including environmental anomalies, adaptive exploration, and long-term memory retention.

[91] TICON: A Slide-Level Tile Contextualizer for Histopathology Representation Learning

Varun Belagali,Saarthak Kapse,Pierre Marza,Srijan Das,Zilinghan Li,Sofiène Boutaj,Pushpak Pati,Srikar Yellapragada,Tarak Nath Nandi,Ravi K Madduri,Joel Saltz,Prateek Prasanna,Stergios Christodoulidis Maria Vakalopoulou,Dimitris Samaras

Main category: cs.CV

TL;DR: 本文提出了TICON,一种基于Transformer的瓦片表示上下文化方法,能够统一并增强来自任意瓦片级基础模型的嵌入表示,在多种局部和全局任务中实现领先性能。

Details Motivation: 现有的瓦片编码器将瓦片从其上下文中剥离,无法捕捉全切片图像中的全局信息,且不同任务需要不同的编码器,缺乏一个统一的上下文化框架。 Method: 提出TICON,采用共享的Transformer编码器,通过掩码建模目标进行预训练,以统一并上下文化来自多种瓦片级基础模型的嵌入。 Result: TICON在多个瓦片级(如HEST-Bench、THUNDER、CATCH)和全切片级(如Patho-Bench)基准上达到最先进的性能;基于TICON构建的滑动窗口聚合器仅用11K WSI就超越了使用多达350K WSI预训练的现有滑动级别基础模型。 Conclusion: TICON能有效统一并上下文化任意来源的瓦片嵌入,显著提升多种病理学计算任务的表现,并支持高效构建高性能的全切片基础模型。 Abstract: The interpretation of small tiles in large whole slide images (WSI) often needs a larger image context. We introduce TICON, a transformer-based tile representation contextualizer that produces rich, contextualized embeddings for ''any'' application in computational pathology. Standard tile encoder-based pipelines, which extract embeddings of tiles stripped from their context, fail to model the rich slide-level information essential for both local and global tasks. Furthermore, different tile-encoders excel at different downstream tasks. Therefore, a unified model is needed to contextualize embeddings derived from ''any'' tile-level foundation model. TICON addresses this need with a single, shared encoder, pretrained using a masked modeling objective to simultaneously unify and contextualize representations from diverse tile-level pathology foundation models. Our experiments demonstrate that TICON-contextualized embeddings significantly improve performance across many different tasks, establishing new state-of-the-art results on tile-level benchmarks (i.e., HEST-Bench, THUNDER, CATCH) and slide-level benchmarks (i.e., Patho-Bench). Finally, we pretrain an aggregator on TICON to form a slide-level foundation model, using only 11K WSIs, outperforming SoTA slide-level foundation models pretrained with up to 350K WSIs.

[92] Fast SAM2 with Text-Driven Token Pruning

Avilasha Mandal,Chaoning Zhang,Fachrina Dewi Puspitasari,Xudong Wang,Jiaquan Zhang,Caiyan Qin,Guoqing Wang,Yang Yang,Heng Tao Shen

Main category: cs.CV

TL;DR: 本文提出了一种文本引导的token剪枝框架,用于提升视频对象分割模型SAM2的推理效率,通过在时间传播前选择性减少不重要的视觉token,在保持分割性能的同时显著降低计算和内存开销。

Details Motivation: SAM2等模型在视频分割中因处理大量跨时间的视觉token而导致高计算和内存成本,限制了实际部署,尤其是在资源受限场景下的应用。 Method: 在视觉编码后、时序传播前引入轻量级路由机制,结合局部视觉上下文、基于文本描述的语义相关性和不确定性线索对token进行排序,并保留最具信息量的token用于后续处理,从而实现高效推理。 Result: 在多个视频分割基准上实验表明,该方法相比未剪枝的SAM2可加速42.50%的推理速度并降低37.41%的GPU内存使用,同时保持具有竞争力的J和F得分。 Conclusion: 早期token选择是一种有效且实用的方法,能够显著提升基于Transformer的视频分割系统的可扩展性,适用于实时和资源受限的应用场景。 Abstract: Segment Anything Model 2 (SAM2), a vision foundation model has significantly advanced in prompt-driven video object segmentation, yet their practical deployment remains limited by the high computational and memory cost of processing dense visual tokens across time. The SAM2 pipelines typically propagate all visual tokens produced by the image encoder through downstream temporal reasoning modules, regardless of their relevance to the target object, resulting in reduced scalability due to quadratic memory attention overhead. In this work, we introduce a text-guided token pruning framework that improves inference efficiency by selectively reducing token density prior to temporal propagation, without modifying the underlying segmentation architecture. Operating after visual encoding and before memory based propagation, our method ranks tokens using a lightweight routing mechanism that integrates local visual context, semantic relevance derived from object-centric textual descriptions (either user-provided or automatically generated), and uncertainty cues that help preserve ambiguous or boundary critical regions. By retaining only the most informative tokens for downstream processing, the proposed approach reduces redundant computation while maintaining segmentation fidelity. Extensive experiments across multiple challenging video segmentation benchmarks demonstrate that post-encoder token pruning provides a practical and effective pathway to efficient, prompt-aware video segmentation, achieving up to 42.50 percent faster inference and 37.41 percent lower GPU memory usage compared to the unpruned baseline SAM2, while preserving competitive J and F performance. These results highlight the potential of early token selection to improve the scalability of transformer-based video segmentation systems for real-time and resource-constrained applications.

[93] Streaming Video Instruction Tuning

Jiaer Xia,Peixian Chen,Mengdan Zhang,Xing Sun,Kaiyang Zhou

Main category: cs.CV

TL;DR: Streamo是一个实时流媒体视频大模型,能够作为通用的交互式助手,支持多种流媒体视频任务,如实时叙述、动作理解、事件字幕生成等。

Details Motivation: 现有的在线视频模型通常局限于问答或字幕生成,缺乏对多种流媒体任务的统一支持,难以满足真实场景中对实时性和多功能性的需求。 Method: 构建了大规模指令跟随数据集Streamo-Instruct-465K,覆盖多样时间上下文和多任务监督,并通过端到端训练实现统一建模。 Result: Streamo在多个流媒体基准上表现出强大的时间推理能力、响应交互能力和广泛泛化性。 Conclusion: Streamo弥合了离线视频感知模型与实时多模态助手之间的差距,推动了连续视频流中统一智能视频理解的发展。 Abstract: We present Streamo, a real-time streaming video LLM that serves as a general-purpose interactive assistant. Unlike existing online video models that focus narrowly on question answering or captioning, Streamo performs a broad spectrum of streaming video tasks, including real-time narration, action understanding, event captioning, temporal event grounding, and time-sensitive question answering. To develop such versatility, we construct Streamo-Instruct-465K, a large-scale instruction-following dataset tailored for streaming video understanding. The dataset covers diverse temporal contexts and multi-task supervision, enabling unified training across heterogeneous streaming tasks. After training end-to-end on the instruction-following dataset through a streamlined pipeline, Streamo exhibits strong temporal reasoning, responsive interaction, and broad generalization across a variety of streaming benchmarks. Extensive experiments show that Streamo bridges the gap between offline video perception models and real-time multimodal assistants, making a step toward unified, intelligent video understanding in continuous video streams.

[94] Beyond Memorization: A Multi-Modal Ordinal Regression Benchmark to Expose Popularity Bias in Vision-Language Models

Li-Zhong Szu-Tu,Ting-Lin Wu,Chia-Jui Chang,He Syu,Yu-Lun Liu

Main category: cs.CV

TL;DR: 本文揭示了最先进的视觉-语言模型(VLMs)在识别著名建筑时存在显著的流行度偏差,准确率比普通建筑高出最多34%,表明其依赖记忆而非泛化理解。为此,作者提出了目前最大的开放基准YearGuessr数据集,包含55,546张来自157个国家的建筑图像,标注了建造年份、GPS位置和页面浏览量等多模态信息。通过将建造年份预测任务建模为序数回归,并引入考虑流行度的评估指标,研究发现现有VLMs在不知名建筑上表现不佳,暴露其推理能力的缺陷。

Details Motivation: 发现当前视觉-语言模型对流行或著名对象存在偏差,可能依赖训练数据中的高频记忆而非真正理解,限制其在现实场景中的泛化能力,因此需要系统性评估和纠正这一问题。 Method: 构建大规模开放基准YearGuessr数据集,包含建筑图像及其建造年份、地理位置和流行度(页面浏览量);将建造年份预测作为序数回归任务;提出基于流行度分层的区间准确率指标,以量化模型对不同流行度建筑的性能差异。 Result: 实验显示现有VLMs在著名建筑上准确率最高达34%以上,但在普通建筑上显著下降;引入的YearCLIP模型和基准测试涵盖30多个模型,验证了流行度偏差的普遍存在;新指标有效揭示模型对记忆内容的依赖。 Conclusion: 当前视觉-语言模型在建筑识别任务中严重依赖对象的流行度,暴露出其从记忆中检索而非进行时空推理的局限性,未来模型需增强对非主流、未见过对象的理解与泛化能力。 Abstract: We expose a significant popularity bias in state-of-the-art vision-language models (VLMs), which achieve up to 34% higher accuracy on famous buildings compared to ordinary ones, indicating a reliance on memorization over generalizable understanding. To systematically investigate this, we introduce the largest open benchmark for this task: the YearGuessr dataset, a collection of 55,546 building images with multi-modal attributes from 157 countries, annotated with continuous ordinal labels of their construction year (1001-2024), GPS data, and page-view counts as a proxy for popularity. Using this dataset, we frame the construction year prediction task as ordinal regression and introduce popularity-aware interval accuracy metrics to quantify this bias. Our resulting benchmark of 30+ models, including our YearCLIP model, confirms that VLMs excel on popular, memorized items but struggle significantly with unrecognized subjects, exposing a critical flaw in their reasoning capabilities. Project page: https://sytwu.github.io/BeyondMemo/

[95] HiStream: Efficient High-Resolution Video Generation via Redundancy-Eliminated Streaming

Haonan Qiu,Shikun Liu,Zijian Zhou,Zhaochong An,Weiming Ren,Zhiheng Liu,Jonas Schult,Sen He,Shoufa Chen,Yuren Cong,Tao Xiang,Ziwei Liu,Juan-Manuel Perez-Rua

Main category: cs.CV

TL;DR: 本文提出了HiStream,一种高效的高分辨率视频生成框架,通过空间、时间和时间步长三个维度的压缩策略显著加速去噪过程,同时保持高质量,实现了高达107.5倍的速度提升。

Details Motivation: 扩散模型在高分辨率视频生成中因二次计算复杂度导致推理效率低下,难以实际应用,因此需要一种更高效的生成框架。 Method: 提出HiStream框架,采用三重压缩策略:1)空间压缩——低分辨率去噪后利用缓存特征进行高分辨率细化;2)时间压缩——基于固定大小锚点缓存的分块处理;3)时间步压缩——对后续缓存依赖的块减少去噪步数。 Result: 在1080p基准上,HiStream(i+ii)达到最先进的视觉质量,去噪速度比Wan2.1基线快76.2倍;HiStream+(i+ii+iii)进一步实现107.5倍加速,在速度与质量之间取得良好平衡。 Conclusion: HiStream使高分辨率视频生成变得高效且可扩展,为实际应用提供了可行方案。 Abstract: High-resolution video generation, while crucial for digital media and film, is computationally bottlenecked by the quadratic complexity of diffusion models, making practical inference infeasible. To address this, we introduce HiStream, an efficient autoregressive framework that systematically reduces redundancy across three axes: i) Spatial Compression: denoising at low resolution before refining at high resolution with cached features; ii) Temporal Compression: a chunk-by-chunk strategy with a fixed-size anchor cache, ensuring stable inference speed; and iii) Timestep Compression: applying fewer denoising steps to subsequent, cache-conditioned chunks. On 1080p benchmarks, our primary HiStream model (i+ii) achieves state-of-the-art visual quality while demonstrating up to 76.2x faster denoising compared to the Wan2.1 baseline and negligible quality loss. Our faster variant, HiStream+, applies all three optimizations (i+ii+iii), achieving a 107.5x acceleration over the baseline, offering a compelling trade-off between speed and quality, thereby making high-resolution video generation both practical and scalable.