2025 04 01

Fragile Mastery: Are Domain-Specific Trade-Offs Undermining On-Device Language Models?

Basab Jha,Firoj Paudel

Task: 研究如何在资源受限的边缘设备上平衡计算效率、内存、功耗和语言能力，提出一种新的架构GEM。

Motivation: 解决领域特定优化与跨领域鲁棒性之间的权衡问题。

Details

Method: 提出GEM架构，采用SCAR动态分配计算资源，并通过47个基准测试验证。 Result: GEM在跨领域任务中F1准确率达0.89，延迟低于100ms，通用任务性能提升7%。 Conclusion: GEM在领域特定和通用任务中表现优异，并提出新的测量工具DSI、GG和CDTR。 Abstract: The application of on-device language models (ODLMs) on resource-constrained edge devices is a multi-dimensional problem that strikes a fine balance between computational effectiveness, memory, power usage, and linguistic capacity across heterogeneous tasks. This holistic study conducts a thorough investigation of the trade-offs between domain-specific optimization and cross-domain robustness, culminating in the proposal of the Generalized Edge Model (GEM), a new architecture that aims to balance specialization and generalization in a harmonious manner. With a rigorous experimental approach testing 47 well-chosen benchmarks in eight domains--healthcare, law, finance, STEM, commonsense, conversational AI, multilingual, and domain-adaptive tasks--we show that conventional optimization techniques decrease target task perplexity by 18-25% but result in a precipitous decline in general-task performance with F1 scores decreasing by 12-29%, as reported by Liu et al. GEM employs a Sparse Cross-Attention Router (SCAR) to dynamically allocate computation to a variable number of computing resources with a cross-domain F1 accuracy of 0.89 on less than 100ms latency across Raspberry Pi 4, Pixel 6, iPhone 13, and bespoke custom neural processing units (NPUs). Compared to GPT-4 Lite, GEM enhances the general-task level by 7% with respect and parity in domain-specific performance. We propose three new measurement tools--Domain Specialization Index (DSI), Generalization Gap (GG), and Cross-Domain Transfer Ratio (CDTR)--which show strong correlation between model compression intensity and brittleness.

TRIDIS: A Comprehensive Medieval and Early Modern Corpus for HTR and NER

Sergio Torres Aguilar

Task: 介绍TRIDIS（Tria Digita Scribunt），一个开源的中世纪和早期现代手稿语料库，并提供其构成、转录规则、测试分割策略及初步实验。

Motivation: 整合多个开放许可的遗留收藏，并提供统一的概述，以促进中世纪和早期现代文本遗产的手写文本识别（HTR）和命名实体识别（NER）研究。

Details

Method: 描述子语料库的背景、半外交转录规则、基于异常检测的测试分割策略，以及使用TrOCR和MiniCPM2.5的基线实验。 Result: 提供了TRIDIS语料库的详细构成和初步实验结果。 Conclusion: TRIDIS旨在促进中世纪和早期现代文本遗产的HTR和NER研究。 Abstract: This paper introduces TRIDIS (Tria Digita Scribunt), an open-source corpus of medieval and early modern manuscripts. TRIDIS aggregates multiple legacy collections (all published under open licenses) and incorporates large metadata descriptions. While prior publications referenced some portions of this corpus, here we provide a unified overview with a stronger focus on its constitution. We describe (i) the narrative, chronological, and editorial background of each major sub-corpus, (ii) its semi-diplomatic transcription rules (expansion, normalization, punctuation), (iii) a strategy for challenging out-of-domain test splits driven by outlier detection in a joint embedding space, and (iv) preliminary baseline experiments using TrOCR and MiniCPM2.5 comparing random and outlier-based test partitions. Overall, TRIDIS is designed to stimulate joint robust Handwritten Text Recognition (HTR) and Named Entity Recognition (NER) research across medieval and early modern textual heritage.

A Large-Scale Vision-Language Dataset Derived from Open Scientific Literature to Advance Biomedical Generalist AI

Alejandro Lozano,Min Woo Sun,James Burgess,Jeffrey J. Nirschl,Christopher Polzak,Yuhui Zhang,Liangyu Chen,Jeffrey Gu,Ivan Lopez,Josiah Aklilu,Anita Rau,Austin Wolfgang Katzer,Collin Chiu,Orr Zohar,Xiaohan Wang,Alfred Seunghoon Song,Chiang Chia-Chun,Robert Tibshirani,Serena Yeung-Levy

Task: 介绍并评估Biomedica数据集，一个基于PubMed Central Open Access子集的开源数据集，用于支持生物医学AI研究。

Motivation: 解决生物医学AI研究中高质量、多样化和大规模数据访问的瓶颈问题。

Details

Method: 从PubMed Central Open Access子集提取数据，构建包含600万篇科学文章和2400万图像-文本对的数据集，并提供可扩展的流式API和搜索API。 Result: 基于Biomedica数据集构建的嵌入模型、聊天式模型和检索增强聊天代理均优于现有开源系统。 Conclusion: Biomedica数据集展示了高质量、多样化和大规模数据对生物医学AI研究的关键作用。 Abstract: Despite the excitement behind biomedical artificial intelligence (AI), access to high-quality, diverse, and large-scale data - the foundation for modern AI systems - is still a bottleneck to unlocking its full potential. To address this gap, we introduce Biomedica, an open-source dataset derived from the PubMed Central Open Access subset, containing over 6 million scientific articles and 24 million image-text pairs, along with 27 metadata fields (including expert human annotations). To overcome the challenges of accessing our large-scale dataset, we provide scalable streaming and search APIs through a web server, facilitating seamless integration with AI systems. We demonstrate the utility of the Biomedica dataset by building embedding models, chat-style models, and retrieval-augmented chat agents. Notably, all our AI models surpass previous open systems in their respective categories, underscoring the critical role of diverse, high-quality, and large-scale biomedical data.

Susceptibility of Large Language Models to User-Driven Factors in Medical Queries

Kyung Ho Lim,Ujin Kang,Xiang Li,Jin Sung Kim,Young-Chul Jung,Sangjoon Park,Byung-Hoon Kim

Task: 研究大型语言模型（LLMs）在医疗诊断中的可靠性，特别是用户驱动因素（如问题表述和临床信息完整性）对其输出的影响。

Motivation: 探索LLMs在医疗领域的应用可靠性，揭示用户行为如何影响其诊断准确性。

Details

Method: 通过扰动测试（引入误导性外部意见）和消融测试（移除关键临床信息）评估多种LLMs的表现。 Result: 所有模型均易受用户驱动的错误信息影响，专有模型对权威语言尤为敏感；遗漏关键临床信息显著降低性能。 Conclusion: 需优化提示结构和提供完整临床信息，避免权威性误导，以提高LLMs在医疗中的可靠性。 Abstract: Large language models (LLMs) are increasingly used in healthcare, but their reliability is heavily influenced by user-driven factors such as question phrasing and the completeness of clinical information. In this study, we examined how misinformation framing, source authority, model persona, and omission of key clinical details affect the diagnostic accuracy and reliability of LLM outputs. We conducted two experiments: one introducing misleading external opinions with varying assertiveness (perturbation test), and another removing specific categories of patient information (ablation test). Using public datasets (MedQA and Medbullets), we evaluated proprietary models (GPT-4o, Claude 3.5 Sonnet, Claude 3.5 Haiku, Gemini 1.5 Pro, Gemini 1.5 Flash) and open-source models (LLaMA 3 8B, LLaMA 3 Med42 8B, DeepSeek R1 8B). All models were vulnerable to user-driven misinformation, with proprietary models especially affected by definitive and authoritative language. Assertive tone had the greatest negative impact on accuracy. In the ablation test, omitting physical exam findings and lab results caused the most significant performance drop. Although proprietary models had higher baseline accuracy, their performance declined sharply under misinformation. These results highlight the need for well-structured prompts and complete clinical context. Users should avoid authoritative framing of misinformation and provide full clinical details, especially for complex cases.

Patronus: Bringing Transparency to Diffusion Models with Prototypes

Nina Weng,Aasa Feragen,Siavash Bigdeli

Task: 提出一种可解释的扩散模型Patronus，通过原型网络增强DDPMs的生成过程透明性。

Motivation: 解决DDPMs生成机制不透明的问题，提升模型的可解释性和控制性。

Details

Method: 将原型网络集成到DDPMs中，提取原型并通过原型激活向量控制生成过程。 Result: Patronus能够展示学习到的原型及其对生成的影响，支持图像操作任务，并检测生成过程中的捷径学习。 Conclusion: Patronus为通过原型增强扩散模型的可解释性和控制性开辟了新途径。 Abstract: Diffusion-based generative models, such as Denoising Diffusion Probabilistic Models (DDPMs), have achieved remarkable success in image generation, but their step-by-step denoising process remains opaque, leaving critical aspects of the generation mechanism unexplained. To address this, we introduce \emph{Patronus}, an interpretable diffusion model inspired by ProtoPNet. Patronus integrates a prototypical network into DDPMs, enabling the extraction of prototypes and conditioning of the generation process on their prototype activation vector. This design enhances interpretability by showing the learned prototypes and how they influence the generation process. Additionally, the model supports downstream tasks like image manipulation, enabling more transparent and controlled modifications. Moreover, Patronus could reveal shortcut learning in the generation process by detecting unwanted correlations between learned prototypes. Notably, Patronus operates entirely without any annotations or text prompts. This work opens new avenues for understanding and controlling diffusion models through prototype-based interpretability. Our code is available at \href{https://github.com/nina-weng/patronus}{https://github.com/nina-weng/patronus}.

Boosting Large Language Models with Mask Fine-Tuning

Mingyuan Zhang,Yue Bai,Huan Wang,Yizhou Wang,Qihua Dong,Yun Fu

Task: 提出一种名为Mask Fine-Tuning (MFT)的新范式，通过打破大语言模型(LLM)的完整性来提升性能。

Motivation: 质疑主流LLM微调协议中保持模型完整性的必要性，探索打破完整性是否能提升性能。

Details

Method: 通过监督学习一组二进制掩码，结合典型的LLM微调目标，实现MFT。 Result: 实验表明MFT在多个领域和骨干模型上均能带来性能提升（如LLaMA2-7B/3.1-8B在编码任务中平均提升1.95%/1.88%）。 Conclusion: MFT扩展了掩码学习的应用范围，从传统的模型压缩到更通用的LLM训练协议更新。 Abstract: The model is usually kept integral in the mainstream large language model (LLM) fine-tuning protocols. No works have questioned whether maintaining the integrity of the model is indispensable for performance. In this work, we introduce Mask Fine-Tuning (MFT), a brand-new LLM fine-tuning paradigm to show that properly breaking the integrity of the model can surprisingly lead to improved performance. Specifically, MFT learns a set of binary masks supervised by the typical LLM fine-tuning objective. Extensive experiments show that MFT gains a consistent performance boost across various domains and backbones (e.g., 1.95%/1.88% average gain in coding with LLaMA2-7B/3.1-8B). Detailed procedures are provided to study the proposed MFT from different hyperparameter perspectives for better insight. In particular, MFT naturally updates the current LLM training protocol by deploying it on a complete well-trained model. This study extends the functionality of mask learning from its conventional network pruning context for model compression to a more general scope.

DiTFastAttnV2: Head-wise Attention Compression for Multi-Modality Diffusion Transformers

Hanling Zhang,Rundong Su,Zhihang Yuan,Pengtao Chen,Mingzhu Shen Yibo Fan,Shengen Yan,Guohao Dai,Yu Wang

Task: 提出一种名为DiTFastAttnV2的后训练压缩方法，用于加速MMDiT中的注意力机制。

Motivation: 现有的多模态扩散变换器（MMDiT）在生成高质量图像时面临计算瓶颈，尤其是注意力机制的效率问题，限制了其扩展性和效率。

Details

Method: 通过分析MMDiT的注意力模式，提出头向箭头注意力和缓存机制，动态调整注意力头，并设计高效融合内核以进一步加速。 Result: DiTFastAttnV2在保持生成质量的同时，将注意力FLOPs减少68%，并在2K图像生成中实现1.5倍的端到端加速。 Conclusion: DiTFastAttnV2有效解决了MMDiT的计算瓶颈问题，显著提升了效率和扩展性。 Abstract: Text-to-image generation models, especially Multimodal Diffusion Transformers (MMDiT), have shown remarkable progress in generating high-quality images. However, these models often face significant computational bottlenecks, particularly in attention mechanisms, which hinder their scalability and efficiency. In this paper, we introduce DiTFastAttnV2, a post-training compression method designed to accelerate attention in MMDiT. Through an in-depth analysis of MMDiT's attention patterns, we identify key differences from prior DiT-based methods and propose head-wise arrow attention and caching mechanisms to dynamically adjust attention heads, effectively bridging this gap. We also design an Efficient Fused Kernel for further acceleration. By leveraging local metric methods and optimization techniques, our approach significantly reduces the search time for optimal compression schemes to just minutes while maintaining generation quality. Furthermore, with the customized kernel, DiTFastAttnV2 achieves a 68% reduction in attention FLOPs and 1.5x end-to-end speedup on 2K image generation without compromising visual fidelity.

Learning to Reason for Long-Form Story Generation

Alexander Gurung,Mirella Lapata

Task: 提出一种通用的故事生成任务（下一章预测）和奖励机制（通过完成似然改进的验证奖励），利用未标记的书籍数据集作为推理的学习信号。

Motivation: 由于标记数据集和精确质量测量的获取困难，现有方法依赖手工设计的提示技术，而强化学习在数学和编码领域的成功启发了将其应用于故事生成。

Details

Method: 提出一种奖励机制和任务，利用未标记数据训练模型推理故事浓缩信息并生成下一章的详细计划。 Result: 人类评估显示，该方法生成的章节在几乎所有指标上优于非训练和监督微调基线，尤其在科幻和奇幻类型中效果显著。 Conclusion: 该方法通过强化学习和验证奖励，显著提升了长故事生成的质量和一致性。 Abstract: Generating high-quality stories spanning thousands of tokens requires competency across a variety of skills, from tracking plot and character arcs to keeping a consistent and engaging style. Due to the difficulty of sourcing labeled datasets and precise quality measurements, most work using large language models (LLMs) for long-form story generation uses combinations of hand-designed prompting techniques to elicit author-like behavior. This is a manual process that is highly dependent on the specific story-generation task. Motivated by the recent success of applying RL with Verifiable Rewards to domains like math and coding, we propose a general story-generation task (Next-Chapter Prediction) and a reward formulation (Verified Rewards via Completion Likelihood Improvement) that allows us to use an unlabeled book dataset as a learning signal for reasoning. We learn to reason over a story's condensed information and generate a detailed plan for the next chapter. Our reasoning is evaluated via the chapters it helps a story-generator create, and compared against non-trained and supervised finetuning (SFT) baselines. Pairwise human judgments reveal the chapters our learned reasoning produces are preferred across almost all metrics, and the effect is more pronounced in Scifi and Fantasy genres.

GmNet: Revisiting Gating Mechanisms From A Frequency View

Yifan Wang,Xu Ma,Yitian Zhang,Zhongruo Wang,Sung-Cheol Kim,Vahid Mirjalili,Vidya Renganathan,Yun Fu

Task: 从频率角度系统探索门控机制对神经网络训练动态的影响，并提出一种轻量级模型GmNet。

Motivation: 门控机制在解决长距离依赖问题中表现有效，但缺乏对其工作原理的理论分析。

Details

Method: 基于卷积定理，从频率视角分析门控机制，研究逐元素乘积与激活函数对不同频率分量的管理作用，并提出GmNet模型。 Result: GmNet在图像分类任务中表现出色，有效性和效率兼具，减少了现有轻量级模型的低频偏差。 Conclusion: 门控机制在频率视角下的分析为神经网络设计提供了新思路，GmNet展示了其实际应用潜力。 Abstract: Gating mechanisms have emerged as an effective strategy integrated into model designs beyond recurrent neural networks for addressing long-range dependency problems. In a broad understanding, it provides adaptive control over the information flow while maintaining computational efficiency. However, there is a lack of theoretical analysis on how the gating mechanism works in neural networks. In this paper, inspired by the {convolution theorem}, we systematically explore the effect of gating mechanisms on the training dynamics of neural networks from a frequency perspective. We investigate the interact between the element-wise product and activation functions in managing the responses to different frequency components. Leveraging these insights, we propose a Gating Mechanism Network (GmNet), a lightweight model designed to efficiently utilize the information of various frequency components. It minimizes the low-frequency bias present in existing lightweight models. GmNet achieves impressive performance in terms of both effectiveness and efficiency in the image classification task.

Generating Synthetic Oracle Datasets to Analyze Noise Impact: A Study on Building Function Classification Using Tweets

Shanshan Bai,Anna Kruspe,Xiaoxiang Zhu

Task: 研究推文在地球观测任务中的语义上下文作用，并构建合成数据集以分析特征噪声对建筑功能分类的影响。

Motivation: 现有方法通过地理启发式收集推文并使用外部数据库标记，导致标签噪声和句子级特征噪声，但后者影响尚未充分研究。

Details

Method: 提出使用LLM生成合成数据集，包含正确标记且语义相关的推文，并通过Naive Bayes和mBERT模型比较性能。 Result: 合成数据集显著提升mBERT性能，表明特征噪声比模型复杂度更关键。 Conclusion: 合成数据集为未来噪声研究提供了新工具，并公开在GitHub上。 Abstract: Tweets provides valuable semantic context for earth observation tasks and serves as a complementary modality to remote sensing imagery. In building function classification (BFC), tweets are often collected using geographic heuristics and labeled via external databases, an inherently weakly supervised process that introduces both label noise and sentence level feature noise (e.g., irrelevant or uninformative tweets). While label noise has been widely studied, the impact of sentence level feature noise remains underexplored, largely due to the lack of clean benchmark datasets for controlled analysis. In this work, we propose a method for generating a synthetic oracle dataset using LLM, designed to contain only tweets that are both correctly labeled and semantically relevant to their associated buildings. This oracle dataset enables systematic investigation of noise impacts that are otherwise difficult to isolate in real-world data. To assess its utility, we compare model performance using Naive Bayes and mBERT classifiers under three configurations: real vs. synthetic training data, and cross-domain generalization. Results show that noise in real tweets significantly degrades the contextual learning capacity of mBERT, reducing its performance to that of a simple keyword-based model. In contrast, the clean synthetic dataset allows mBERT to learn effectively, outperforming Naive Bayes Bayes by a large margin. These findings highlight that addressing feature noise is more critical than model complexity in this task. Our synthetic dataset offers a novel experimental environment for future noise injection studies and is publicly available on GitHub.

Zero-shot Domain Generalization of Foundational Models for 3D Medical Image Segmentation: An Experimental Study

Soumitri Chattopadhyay,Basar Demir,Marc Niethammer

Task: 研究基础模型在医学图像分割中的领域泛化能力。

Motivation: 医学图像分割中由于成像模态和采集协议的差异导致的领域偏移限制了模型的泛化能力，而基础模型在零样本泛化方面具有潜力。

Details

Method: 通过6种医学分割基础模型和12个公共数据集进行实验，研究其在多模态和多解剖结构下的表现。 Result: 发现可提示的基础模型通过智能提示技术能够弥合领域差距。 Conclusion: 基础模型在领域泛化方面具有潜力，并提出了未来研究的可能方向。 Abstract: Domain shift, caused by variations in imaging modalities and acquisition protocols, limits model generalization in medical image segmentation. While foundation models (FMs) trained on diverse large-scale data hold promise for zero-shot generalization, their application to volumetric medical data remains underexplored. In this study, we examine their ability towards domain generalization (DG), by conducting a comprehensive experimental study encompassing 6 medical segmentation FMs and 12 public datasets spanning multiple modalities and anatomies. Our findings reveal the potential of promptable FMs in bridging the domain gap via smart prompting techniques. Additionally, by probing into multiple facets of zero-shot DG, we offer valuable insights into the viability of FMs for DG and identify promising avenues for future research.

Understanding Inequality of LLM Fact-Checking over Geographic Regions with Agent and Retrieval models

Bruno Coelho,Shujaat Mirza,Yuyuan Cui,Christina Pöpper,Damon McCoy

Task: 评估大型语言模型（LLMs）在不同地理区域的真实性检查性能。

Motivation: 应对日益增长的虚假信息传播，探索LLMs在事实核查中的潜在应用，并揭示其性能差异。

Details

Method: 使用包含600个事实核查语句的数据集，比较三种实验设置（仅语句、基于维基百科的LLM代理、RAG系统）下不同LLM的表现。 Result: 无论场景或LLM类型，全球北方地区的表现显著优于全球南方地区，且基于维基百科的系统进一步扩大了这一差距。 Conclusion: 需改进数据集平衡和检索策略，以提升LLMs在多样化地理背景下的真实性检查能力。 Abstract: Fact-checking is a potentially useful application of Large Language Models (LLMs) to combat the growing dissemination of disinformation. However, the performance of LLMs varies across geographic regions. In this paper, we evaluate the factual accuracy of open and private models across a diverse set of regions and scenarios. Using a dataset containing 600 fact-checked statements balanced across six global regions we examine three experimental setups of fact-checking a statement: (1) when just the statement is available, (2) when an LLM-based agent with Wikipedia access is utilized, and (3) as a best case scenario when a Retrieval-Augmented Generation (RAG) system provided with the official fact check is employed. Our findings reveal that regardless of the scenario and LLM used, including GPT-4, Claude Sonnet, and LLaMA, statements from the Global North perform substantially better than those from the Global South. Furthermore, this gap is broadened for the more realistic case of a Wikipedia agent-based system, highlighting that overly general knowledge bases have a limited ability to address region-specific nuances. These results underscore the urgent need for better dataset balancing and robust retrieval strategies to enhance LLM fact-checking capabilities, particularly in geographically diverse contexts.

SIGHT: Single-Image Conditioned Generation of Hand Trajectories for Hand-Object Interaction

Alexey Gavryushin,Florian Redhardt,Gaia Di Lorenzo,Luc Van Gool,Marc Pollefeys,Kaichun Mo,Xi Wang

Task: 生成给定单个物体图像的逼真且多样的3D手部轨迹。

Motivation: 手-物体交互轨迹的先验知识在机器人学、具身AI、增强现实等领域有重要应用价值。

Details

Method: 提出SIGHT-Fusion系统，结合视觉特征提取和扩散条件运动生成模型，利用无监督视频数据进行训练。 Result: 实验表明，该方法生成的轨迹比基线更逼真且适用，并在未见物体上表现出泛化能力。 Conclusion: 生成的轨迹在物理模拟中验证了其真实性和下游应用的潜力。 Abstract: We introduce a novel task of generating realistic and diverse 3D hand trajectories given a single image of an object, which could be involved in a hand-object interaction scene or pictured by itself. When humans grasp an object, appropriate trajectories naturally form in our minds to use it for specific tasks. Hand-object interaction trajectory priors can greatly benefit applications in robotics, embodied AI, augmented reality and related fields. However, synthesizing realistic and appropriate hand trajectories given a single object or hand-object interaction image is a highly ambiguous task, requiring to correctly identify the object of interest and possibly even the correct interaction among many possible alternatives. To tackle this challenging problem, we propose the SIGHT-Fusion system, consisting of a curated pipeline for extracting visual features of hand-object interaction details from egocentric videos involving object manipulation, and a diffusion-based conditional motion generation model processing the extracted features. We train our method given video data with corresponding hand trajectory annotations, without supervision in the form of action labels. For the evaluation, we establish benchmarks utilizing the first-person FPHAB and HOI4D datasets, testing our method against various baselines and using multiple metrics. We also introduce task simulators for executing the generated hand trajectories and reporting task success rates as an additional metric. Experiments show that our method generates more appropriate and realistic hand trajectories than baselines and presents promising generalization capability on unseen objects. The accuracy of the generated hand trajectories is confirmed in a physics simulation setting, showcasing the authenticity of the created sequences and their applicability in downstream uses.

Resona: Improving Context Copying in Linear Recurrence Models with Retrieval

Xinyu Wang,Linrui Ma,Jerry Huang,Peng Lu,Prasanna Parthasarathi,Xiao-Wen Chang,Boxing Chen,Yufei Cui

Task: 提出一种名为Resona的框架，用于通过检索增强线性循环模型的能力。

Motivation: 线性循环模型虽然在计算效率上具有优势，但在上下文学习等任务中表现仍不及Transformer模型。

Details

Method: Resona框架通过检索输入上下文中的信息来增强线性循环模型，使其能够适应多样化的任务需求。 Result: 实验表明，Resona增强的模型在合成和真实自然语言任务中均表现出显著的性能提升。 Conclusion: Resona是一种通用方法，可有效提升线性循环大语言模型的上下文学习和语言建模能力。 Abstract: Recent shifts in the space of large language model (LLM) research have shown an increasing focus on novel architectures to compete with prototypical Transformer-based models that have long dominated this space. Linear recurrent models have proven to be a viable competitor due to their computational efficiency. However, such models still demonstrate a sizable gap compared to Transformers in terms of in-context learning among other tasks that require recalling information from a context. In this work, we introduce __Resona__, a simple and scalable framework for augmenting linear recurrent models with retrieval. __Resona__~augments models with the ability to integrate retrieved information from the provided input context, enabling tailored behavior to diverse task requirements. Experiments on a variety of linear recurrent models demonstrate that __Resona__-augmented models observe significant performance gains on a variety of synthetic as well as real-world natural language tasks, highlighting its ability to act as a general purpose method to improve the in-context learning and language modeling abilities of linear recurrent LLMs.

The Marine Debris Forward-Looking Sonar Datasets

Matias Valdenegro-Toro,Deepan Chakravarthi Padmanabhan,Deepak Singh,Bilal Wehbe,Yvan Petillot

Task: 提出并公开Marine Debris Forward-Looking Sonar数据集，支持多种计算机视觉任务。

Motivation: 解决声纳模态公开数据缺乏的问题，提升水下机器人AI系统的训练能力。

Details

Method: 提供三种不同环境（水箱、转台、淹没采石场）的数据集，增加多样性，并支持分类、检测、分割等任务。 Result: 数据集公开可用，初步任务结果和分析已提供。 Conclusion: 该数据集有望为研究社区带来益处。 Abstract: Sonar sensing is fundamental for underwater robotics, but limited by capabilities of AI systems, which need large training datasets. Public data in sonar modalities is lacking. This paper presents the Marine Debris Forward-Looking Sonar datasets, with three different settings (watertank, turntable, flooded quarry) increasing dataset diversity and multiple computer vision tasks: object classification, object detection, semantic segmentation, patch matching, and unsupervised learning. We provide full dataset description, basic analysis and initial results for some tasks. We expect the research community will benefit from this dataset, which is publicly available at https://doi.org/10.5281/zenodo.15101686

SUV: Scalable Large Language Model Copyright Compliance with Regularized Selective Unlearning

Tianyang Xu,Xiaoze Liu,Feijie Wu,Xiaoqian Wang,Jing Gao

Task: 提出一种选择性遗忘框架（SUV），防止大型语言模型记忆受版权保护内容，同时保持其整体性能。

Motivation: 大型语言模型在快速发展的同时，因可能生成受版权保护内容而面临法律风险，需要一种解决方案来减少此类风险。

Details

Method: 通过构建包含侵权案例的数据集，并利用直接偏好优化（DPO）替换受版权保护内容，结合梯度投影和Fisher信息正则化以减少性能损失。 Result: 实验表明，SUV显著减少了逐字记忆，同时对无关任务性能影响极小。 Conclusion: SUV为实际应用中减少LLM版权风险提供了一种有效的解决方案。 Abstract: Large Language Models (LLMs) have transformed natural language processing by learning from massive datasets, yet this rapid progress has also drawn legal scrutiny, as the ability to unintentionally generate copyrighted content has already prompted several prominent lawsuits. In this work, we introduce SUV (Selective Unlearning for Verbatim data), a selective unlearning framework designed to prevent LLM from memorizing copyrighted content while preserving its overall utility. In detail, the proposed method constructs a dataset that captures instances of copyrighted infringement cases by the targeted LLM. With the dataset, we unlearn the content from the LLM by means of Direct Preference Optimization (DPO), which replaces the verbatim copyrighted content with plausible and coherent alternatives. Since DPO may hinder the LLM's performance in other unrelated tasks, we integrate gradient projection and Fisher information regularization to mitigate the degradation. We validate our approach using a large-scale dataset of 500 famous books (predominantly copyrighted works) and demonstrate that SUV significantly reduces verbatim memorization with negligible impact on the performance on unrelated tasks. Extensive experiments on both our dataset and public benchmarks confirm the scalability and efficacy of our approach, offering a promising solution for mitigating copyright risks in real-world LLM applications.

Pairwise Matching of Intermediate Representations for Fine-grained Explainability

Lauren Shrack,Timm Haucke,Antoine Salaün,Arjun Subramonian,Sara Beery

Task: 提出一种新的可解释性方法（PAIR-X），用于生成细粒度、高度局部化的成对视觉解释。

Motivation: 现有深度学习模型的可解释性技术通常过于分散，无法提供有用且可解释的细粒度类别差异分析。

Details

Method: 结合中间模型激活和反向传播的相关性分数，生成细粒度的成对视觉解释。 Result: 在35个公共重识别数据集上，PAIR-X在视觉解释方面优于多种基线方法，专家访谈确认其改进效果。 Conclusion: PAIR-X通过提高可解释性，帮助人类更好地区分正确和错误的匹配。 Abstract: The differences between images belonging to fine-grained categories are often subtle and highly localized, and existing explainability techniques for deep learning models are often too diffuse to provide useful and interpretable explanations. We propose a new explainability method (PAIR-X) that leverages both intermediate model activations and backpropagated relevance scores to generate fine-grained, highly-localized pairwise visual explanations. We use animal and building re-identification (re-ID) as a primary case study of our method, and we demonstrate qualitatively improved results over a diverse set of explainability baselines on 35 public re-ID datasets. In interviews, animal re-ID experts were in unanimous agreement that PAIR-X was an improvement over existing baselines for deep model explainability, and suggested that its visualizations would be directly applicable to their work. We also propose a novel quantitative evaluation metric for our method, and demonstrate that PAIR-X visualizations appear more plausible for correct image matches than incorrect ones even when the model similarity score for the pairs is the same. By improving interpretability, PAIR-X enables humans to better distinguish correct and incorrect matches. Our code is available at: https://github.com/pairx-explains/pairx

Can LLMs Support Medical Knowledge Imputation? An Evaluation-Based Perspective

Xinyu Yao,Aditya Sannabhadti,Holly Wiberg,Karmel S. Shehadeh,Rema Padman

Task: 探索使用大型语言模型（LLMs）补充医学知识图谱中缺失的治疗关系。

Motivation: 医学知识图谱在治疗映射中存在不完整性问题，现有编码系统（如ICD、Mondo、ATC）覆盖不全，导致疾病与治疗关系缺失或不一致。

Details

Method: 系统评估LLM驱动的治疗映射，通过基准比较评估其可靠性。 Result: 研究发现LLM存在与临床指南不一致、患者安全风险等关键局限性。 Conclusion: 研究提醒研究者和从业者需谨慎评估LLM的应用，并建议采用混合方法增强医学知识图谱的治疗映射。 Abstract: Medical knowledge graphs (KGs) are essential for clinical decision support and biomedical research, yet they often exhibit incompleteness due to knowledge gaps and structural limitations in medical coding systems. This issue is particularly evident in treatment mapping, where coding systems such as ICD, Mondo, and ATC lack comprehensive coverage, resulting in missing or inconsistent associations between diseases and their potential treatments. To address this issue, we have explored the use of Large Language Models (LLMs) for imputing missing treatment relationships. Although LLMs offer promising capabilities in knowledge augmentation, their application in medical knowledge imputation presents significant risks, including factual inaccuracies, hallucinated associations, and instability between and within LLMs. In this study, we systematically evaluate LLM-driven treatment mapping, assessing its reliability through benchmark comparisons. Our findings highlight critical limitations, including inconsistencies with established clinical guidelines and potential risks to patient safety. This study serves as a cautionary guide for researchers and practitioners, underscoring the importance of critical evaluation and hybrid approaches when leveraging LLMs to enhance treatment mappings on medical knowledge graphs.

AutoComPose: Automatic Generation of Pose Transition Descriptions for Composed Pose Retrieval Using Multimodal LLMs

Yi-Ting Shen,Sungmin Eum,Doheon Lee,Rohit Shete,Chiao-Yi Wang,Heesung Kwon,Shuvra S. Bhattacharyya

Task: 提出一种利用多模态大语言模型（MLLMs）自动生成丰富且结构化的人体姿势转换描述的方法AutoComPose。

Motivation: 解决现有CPR数据集中标注稀缺和不一致的问题，降低标注成本并提高多样性。

Details

Method: 通过结构化细粒度身体部位动作、引入镜像/交换变体，并利用循环一致性约束确保逻辑连贯性。 Result: 实验表明，AutoComPose生成的标注优于人工和基于启发式的方法，显著降低了成本并提高了检索质量。 Conclusion: AutoComPose为CPR研究提供了可扩展的自动标注基础，推动了该领域的未来发展。 Abstract: Composed pose retrieval (CPR) enables users to search for human poses by specifying a reference pose and a transition description, but progress in this field is hindered by the scarcity and inconsistency of annotated pose transitions. Existing CPR datasets rely on costly human annotations or heuristic-based rule generation, both of which limit scalability and diversity. In this work, we introduce AutoComPose, the first framework that leverages multimodal large language models (MLLMs) to automatically generate rich and structured pose transition descriptions. Our method enhances annotation quality by structuring transitions into fine-grained body part movements and introducing mirrored/swapped variations, while a cyclic consistency constraint ensures logical coherence between forward and reverse transitions. To advance CPR research, we construct and release two dedicated benchmarks, AIST-CPR and PoseFixCPR, supplementing prior datasets with enhanced attributes. Extensive experiments demonstrate that training retrieval models with AutoComPose yields superior performance over human-annotated and heuristic-based methods, significantly reducing annotation costs while improving retrieval quality. Our work pioneers the automatic annotation of pose transitions, establishing a scalable foundation for future CPR research.

XL-Instruct: Synthetic Data for Cross-Lingual Open-Ended Generation

Vivek Iyer,Ricardo Rei,Pinzhen Chen,Alexandra Birch

Task: 评估和改进大型语言模型（LLMs）在跨语言开放生成任务中的性能。

Motivation: 跨语言开放生成是一个重要但研究不足的问题，需要新的评测基准和数据生成方法。

Details

Method: 引入XL-AlpacaEval评测基准和XL-Instruct合成数据生成方法，并用8K条XL-Instruct生成的数据微调模型。 Result: 微调后模型性能显著提升，胜率从7.4%增至21.5%，并在多项细粒度指标上表现更好，同时展现出零样本迁移能力。 Conclusion: 建议将XL-Instruct纳入未来多语言LLMs的后训练流程，并公开XL-Instruct和XL-AlpacaEval数据集以促进研究。 Abstract: Cross-lingual open-ended generation -- i.e. generating responses in a desired language different from that of the user's query -- is an important yet understudied problem. We introduce XL-AlpacaEval, a new benchmark for evaluating cross-lingual generation capabilities in Large Language Models (LLMs), and propose XL-Instruct, a high-quality synthetic data generation method. Fine-tuning with just 8K XL-Instruct-generated instructions significantly improves model performance, increasing the win rate against GPT-4o-Mini from 7.4% to 21.5%, and improving on several fine-grained quality metrics. Additionally, models fine-tuned on XL-Instruct exhibit strong zero-shot transfer to both English-only and multilingual generation tasks. Given its consistent gains across the board, we strongly recommend incorporating XL-Instruct in the post-training pipeline of future multilingual LLMs. To facilitate further research, we will publicly and freely release the XL-Instruct and XL-AlpacaEval datasets, which constitute two of the few cross-lingual resources currently available in the literature.

MedCL: Learning Consistent Anatomy Distribution for Scribble-supervised Medical Image Segmentation

Ke Zhang,Vishal M. Patel

Task: 提出一种基于涂鸦监督的聚类框架MedCL，用于医学图像的弱监督分割。

Motivation: 医学图像标注成本高且耗时，现有涂鸦监督方法需要大量标注且仅适用于规则器官分割。

Details

Method: 通过混合特征（图像内和图像间）和聚类特征，结合局部和全局解剖分布正则化。 Result: MedCL在规则器官和不规则病理分割任务中表现优于传统方法，即使使用较少涂鸦监督。 Conclusion: MedCL是一种高效的弱监督医学图像分割方法，适用于多种解剖结构。 Abstract: Curating large-scale fully annotated datasets is expensive, laborious, and cumbersome, especially for medical images. Several methods have been proposed in the literature that make use of weak annotations in the form of scribbles. However, these approaches require large amounts of scribble annotations, and are only applied to the segmentation of regular organs, which are often unavailable for the disease species that fall in the long-tailed distribution. Motivated by the fact that the medical labels have anatomy distribution priors, we propose a scribble-supervised clustering-based framework, called MedCL, to learn the inherent anatomy distribution of medical labels. Our approach consists of two steps: i) Mix the features with intra- and inter-image mix operations, and ii) Perform feature clustering and regularize the anatomy distribution at both local and global levels. Combined with a small amount of weak supervision, the proposed MedCL is able to segment both regular organs and challenging irregular pathologies. We implement MedCL based on SAM and UNet backbones, and evaluate the performance on three open datasets of regular structure (MSCMRseg), multiple organs (BTCV) and irregular pathology (MyoPS). It is shown that even with less scribble supervision, MedCL substantially outperforms the conventional segmentation methods. Our code is available at https://github.com/BWGZK/MedCL.

FReM: A Flexible Reasoning Mechanism for Balancing Quick and Slow Thinking in Long-Context Question Answering

Zhengyi Zhao,Shubo Zhang,Zezhong Wang,Bin Liang,Binyang Li,Kam-Fai Wong

Task: 提出一种灵活推理机制（FReM），根据问题复杂度调整推理深度，以优化长上下文问答系统（LCQA）的性能。

Motivation: 解决现有慢速推理和快速推理模式的局限性：慢速推理容易过度思考浪费时间，快速推理则依赖模式匹配而缺乏对查询逻辑的真正理解。

Details

Method: 利用合成参考问答示例提供明确的思维链，根据问题复杂度动态调整推理深度。 Result: 在七个问答数据集上的实验表明，FReM提高了推理准确性和可扩展性，尤其对复杂多跳问题效果显著。 Conclusion: FReM有潜力推动长上下文问答方法的发展，平衡推理效率与深度。 Abstract: Long-context question-answering (LCQA) systems have greatly benefited from the powerful reasoning capabilities of large language models (LLMs), which can be categorized into slow and quick reasoning modes. However, both modes have their limitations. Slow thinking generally leans to explore every possible reasoning path, which leads to heavy overthinking and wastes time. Quick thinking usually relies on pattern matching rather than truly understanding the query logic, which misses proper understanding. To address these issues, we propose FReM: Flexible Reasoning Mechanism, a method that adjusts reasoning depth according to the complexity of each question. Specifically, FReM leverages synthetic reference QA examples to provide an explicit chain of thought, enabling efficient handling of simple queries while allowing deeper reasoning for more complex ones. By doing so, FReM helps quick-thinking models move beyond superficial pattern matching and narrows the reasoning space for slow-thinking models to avoid unnecessary exploration. Experiments on seven QA datasets show that FReM improves reasoning accuracy and scalability, particularly for complex multihop questions, indicating its potential to advance LCQA methodologies.

Heng Yu,Juze Zhang,Changan Chen,Tiange Xiang,Yusu Fang,Juan Carlos Niebles,Ehsan Adeli

Task: 提出一个统一的运动-语言模型SocialGen，用于建模任意数量个体之间的交互行为。

Motivation: 现实世界中的人类社交互动具有多样性，而现有方法仅支持两人交互，无法满足多场景需求。

Details

Method: 提出一种新的社交运动表示方法，支持任意数量个体的运动标记化，并将其与语言空间对齐。 Result: 在运动-语言任务上取得最先进性能，并建立了多人类交互任务的基准数据集SocialX。 Conclusion: SocialGen为多人类交互建模设定了新标准，解决了数据稀缺和模型泛化问题。 Abstract: Human interactions in everyday life are inherently social, involving engagements with diverse individuals across various contexts. Modeling these social interactions is fundamental to a wide range of real-world applications. In this paper, we introduce SocialGen, the first unified motion-language model capable of modeling interaction behaviors among varying numbers of individuals, to address this crucial yet challenging problem. Unlike prior methods that are limited to two-person interactions, we propose a novel social motion representation that supports tokenizing the motions of an arbitrary number of individuals and aligning them with the language space. This alignment enables the model to leverage rich, pretrained linguistic knowledge to better understand and reason about human social behaviors. To tackle the challenges of data scarcity, we curate a comprehensive multi-human interaction dataset, SocialX, enriched with textual annotations. Leveraging this dataset, we establish the first comprehensive benchmark for multi-human interaction tasks. Our method achieves state-of-the-art performance across motion-language tasks, setting a new standard for multi-human interaction modeling.

Sparse Mixture of Experts as Unified Competitive Learning

Giang Do,Hung Le,Truyen Tran

Task: 研究稀疏混合专家（SMoE）在生成任务中的泛化能力，并提出一种改进框架USMoE。

Motivation: 现有SMoE方法在任务如MTEB中表现不佳，Token Choice可能过度关注无关专家，而Expert Choice可能丢弃重要标记。

Details

Method: 提出统一竞争学习SMoE（USMoE），通过分析竞争学习机制改进现有SMoE。 Result: USMoE在多种任务中表现优于传统方法，性能提升达10%或计算成本降低14%。 Conclusion: USMoE是一种高效框架，显著提升SMoE在训练和非训练场景下的性能。 Abstract: Sparse Mixture of Experts (SMoE) improves the efficiency of large language model training by directing input tokens to a subset of experts. Despite its success in generation tasks, its generalization ability remains an open question. In this paper, we demonstrate that current SMoEs, which fall into two categories: (1) Token Choice ;and (2) Expert Choice, struggle with tasks such as the Massive Text Embedding Benchmark (MTEB). By analyzing their mechanism through the lens of competitive learning, our study finds that the Token Choice approach may overly focus on irrelevant experts, while the Expert Choice approach risks discarding important tokens, potentially affecting performance. Motivated by this analysis, we propose Unified Competitive Learning SMoE (USMoE), a novel and efficient framework designed to improve the performance of existing SMoEs in both scenarios: with and without training. Extensive experiments across various tasks show that USMoE achieves up to a 10% improvement over traditional approaches or reduces computational inference costs by 14% while maintaining strong performance.

Enhancing DeepLabV3+ to Fuse Aerial and Satellite Images for Semantic Segmentation

Anas Berka,Mohamed El Hajji,Raphael Canals,Youssef Es-saady,Adel Hafiane

Task: 通过改进DeepLabV3+架构，结合卫星和航空图像进行多模态土地覆盖分割。

Motivation: 航空和卫星图像在土地覆盖分割中存在互补性，但现有方法（如DeepLabV3+）在鲁棒性和性能上仍需提升，尤其是在多模态图像融合场景中。

Details

Method: 在DeepLabV3+架构中引入新的转置卷积层块，用于上采样卫星图像并与航空图像的高层特征融合。 Result: 在不使用数据增强的情况下，平均交并比（mIoU）达到84.91%。 Conclusion: 改进后的DeepLabV3+架构通过多模态图像融合显著提升了土地覆盖分割的性能。 Abstract: Aerial and satellite imagery are inherently complementary remote sensing sources, offering high-resolution detail alongside expansive spatial coverage. However, the use of these sources for land cover segmentation introduces several challenges, prompting the development of a variety of segmentation methods. Among these approaches, the DeepLabV3+ architecture is considered as a promising approach in the field of single-source image segmentation. However, despite its reliable results for segmentation, there is still a need to increase its robustness and improve its performance. This is particularly crucial for multimodal image segmentation, where the fusion of diverse types of information is essential. An interesting approach involves enhancing this architectural framework through the integration of novel components and the modification of certain internal processes. In this paper, we enhance the DeepLabV3+ architecture by introducing a new transposed conventional layers block for upsampling a second entry to fuse it with high level features. This block is designed to amplify and integrate information from satellite images, thereby enriching the segmentation process through fusion with aerial images. For experiments, we used the LandCover.ai (Land Cover from Aerial Imagery) dataset for aerial images, alongside the corresponding dataset sourced from Sentinel 2 data. Through the fusion of both sources, the mean Intersection over Union (mIoU) achieved a total mIoU of 84.91% without data augmentation.

S2MoE: Robust Sparse Mixture of Experts via Stochastic Learning

Giang Do,Hung Le,Truyen Tran

Task: 提出一种名为S2MoE的新方法，通过随机学习解决稀疏混合专家（SMoE）训练中的表示崩溃问题。

Motivation: 现有SMoE方法存在专家嵌入维度小和路由机制导致专家特征过于相似的问题。

Details

Method: 提出S2MoE，结合确定性和非确定性输入，通过不确定性学习优化路由机制。 Result: S2MoE在性能与其他路由方法相当的同时，计算推理成本降低了28%。 Conclusion: S2MoE是一种高效的稀疏混合专家训练方法，解决了表示崩溃问题并降低了计算成本。 Abstract: Sparse Mixture of Experts (SMoE) enables efficient training of large language models by routing input tokens to a select number of experts. However, training SMoE remains challenging due to the issue of representation collapse. Recent studies have focused on improving the router to mitigate this problem, but existing approaches face two key limitations: (1) expert embeddings are significantly smaller than the model's dimension, contributing to representation collapse, and (2) routing each input to the Top-K experts can cause them to learn overly similar features. In this work, we propose a novel approach called Robust Sparse Mixture of Experts via Stochastic Learning (S2MoE), which is a mixture of experts designed to learn from both deterministic and non-deterministic inputs via Learning under Uncertainty. Extensive experiments across various tasks demonstrate that S2MoE achieves performance comparable to other routing methods while reducing computational inference costs by 28%.

DIFFER: Disentangling Identity Features via Semantic Cues for Clothes-Changing Person Re-ID

Xin Liang,Yogesh S Rawat

Task: 提出一种新的对抗学习方法DIFFER，通过文本描述解耦身份特征，解决衣物变化场景下的行人重识别问题。

Motivation: 现有方法要么依赖额外模态（如轮廓、姿态）建模身体形状，可能忽略其他重要生物特征；要么通过离散标注监督，无法全面描述。

Details

Method: 提出DIFFER方法，利用文本描述的可分离性作为监督，通过NBDetach机制和梯度反转层分离身份相关特征与非生物特征。 Result: 在四个基准数据集（LTCC、PRCC、CelebReID-Light、CCVID）上取得最优性能，top-1准确率分别提升3.6%、3.4%、2.5%和1%。 Conclusion: DIFFER通过文本监督有效解耦特征，显著提升了衣物变化场景下的行人重识别性能。 Abstract: Clothes-changing person re-identification (CC-ReID) aims to recognize individuals under different clothing scenarios. Current CC-ReID approaches either concentrate on modeling body shape using additional modalities including silhouette, pose, and body mesh, potentially causing the model to overlook other critical biometric traits such as gender, age, and style, or they incorporate supervision through additional labels that the model tries to disregard or emphasize, such as clothing or personal attributes. However, these annotations are discrete in nature and do not capture comprehensive descriptions. In this work, we propose DIFFER: Disentangle Identity Features From Entangled Representations, a novel adversarial learning method that leverages textual descriptions to disentangle identity features. Recognizing that image features inherently mix inseparable information, DIFFER introduces NBDetach, a mechanism designed for feature disentanglement by leveraging the separable nature of text descriptions as supervision. It partitions the feature space into distinct subspaces and, through gradient reversal layers, effectively separates identity-related features from non-biometric features. We evaluate DIFFER on 4 different benchmark datasets (LTCC, PRCC, CelebreID-Light, and CCVID) to demonstrate its effectiveness and provide state-of-the-art performance across all the benchmarks. DIFFER consistently outperforms the baseline method, with improvements in top-1 accuracy of 3.6% on LTCC, 3.4% on PRCC, 2.5% on CelebReID-Light, and 1% on CCVID. Our code can be found here.

A Retrieval-Augmented Knowledge Mining Method with Deep Thinking LLMs for Biomedical Research and Clinical Support

Yichun Feng,Jiawei Wang,Ruikun He,Lu Zhou,Yixue Li

Task: 提出一种结合大型语言模型（LLMs）和知识图谱的管道（BioStrataKG和BioCDQA），并引入IP-RAR方法以提升生物医学知识的检索和推理能力。

Motivation: 当前知识图谱构建受限于复杂术语、数据异质性和快速知识更新，而LLMs在检索和推理方面存在局限，难以揭示跨文档关联和推理路径。

Details

Method: 使用LLMs构建生物医学知识图谱（BioStrataKG），创建跨文档问答数据集（BioCDQA），并开发IP-RAR方法，结合集成和渐进式检索增强推理。 Result: IP-RAR将文档检索F1分数提升20%，答案生成准确率提高25%。 Conclusion: 该框架有助于医生整合治疗证据以制定个性化用药方案，帮助研究人员分析进展和研究空白，加速科学发现和决策。 Abstract: Knowledge graphs and large language models (LLMs) are key tools for biomedical knowledge integration and reasoning, facilitating structured organization of scientific articles and discovery of complex semantic relationships. However, current methods face challenges: knowledge graph construction is limited by complex terminology, data heterogeneity, and rapid knowledge evolution, while LLMs show limitations in retrieval and reasoning, making it difficult to uncover cross-document associations and reasoning pathways. To address these issues, we propose a pipeline that uses LLMs to construct a biomedical knowledge graph (BioStrataKG) from large-scale articles and builds a cross-document question-answering dataset (BioCDQA) to evaluate latent knowledge retrieval and multi-hop reasoning. We then introduce Integrated and Progressive Retrieval-Augmented Reasoning (IP-RAR) to enhance retrieval accuracy and knowledge reasoning. IP-RAR maximizes information recall through Integrated Reasoning-based Retrieval and refines knowledge via Progressive Reasoning-based Generation, using self-reflection to achieve deep thinking and precise contextual understanding. Experiments show that IP-RAR improves document retrieval F1 score by 20\% and answer generation accuracy by 25\% over existing methods. This framework helps doctors efficiently integrate treatment evidence for personalized medication plans and enables researchers to analyze advancements and research gaps, accelerating scientific discovery and decision-making.

Unsupervised Feature Disentanglement and Augmentation Network for One-class Face Anti-spoofing

Pei-Kai Huang,Jun-Xiong Chong,Ming-Tsung Hsu,Fang-Yu Hsu,Yi-Ting Lin,Kai-Heng Chien,Hao-Chiang Shao,Chiou-Ting Hsu

Task: 提出一种名为UFDANet的单类人脸防伪技术，通过特征解耦和增强提升泛化能力。

Motivation: 解决单类人脸防伪方法在域信息纠缠和未见攻击处理上的不足。

Details

Method: 采用无监督特征解耦方法分离活体特征和域特征，并结合特征增强方案合成未见攻击和域特征。 Result: UFDANet在性能上优于现有单类方法，并与双类方法表现相当。 Conclusion: UFDANet通过特征解耦和增强显著提升了单类人脸防伪的泛化能力和性能。 Abstract: Face anti-spoofing (FAS) techniques aim to enhance the security of facial identity authentication by distinguishing authentic live faces from deceptive attempts. While two-class FAS methods risk overfitting to training attacks to achieve better performance, one-class FAS approaches handle unseen attacks well but are less robust to domain information entangled within the liveness features. To address this, we propose an Unsupervised Feature Disentanglement and Augmentation Network (\textbf{UFDANet}), a one-class FAS technique that enhances generalizability by augmenting face images via disentangled features. The \textbf{UFDANet} employs a novel unsupervised feature disentangling method to separate the liveness and domain features, facilitating discriminative feature learning. It integrates an out-of-distribution liveness feature augmentation scheme to synthesize new liveness features of unseen spoof classes, which deviate from the live class, thus enhancing the representability and discriminability of liveness features. Additionally, \textbf{UFDANet} incorporates a domain feature augmentation routine to synthesize unseen domain features, thereby achieving better generalizability. Extensive experiments demonstrate that the proposed \textbf{UFDANet} outperforms previous one-class FAS methods and achieves comparable performance to state-of-the-art two-class FAS methods.

Hongjia Liu,Jinlong Li

Task: 提出一种无需训练的框架，通过交互机制解决上下文相关子任务执行中的信息丢失问题。

Motivation: 现有方法在分解复杂任务为子任务时，可能导致上下文相关子任务的信息丢失，引发冗余操作或执行失败。

Details

Method: 引入交互机制和子任务轨迹记忆，支持子任务间的信息查询和动作触发；新增动作生成子任务执行过程的简明描述。 Result: 在WebShop和HotpotQA任务上，使用GPT-3.5和GPT-4的框架表现优于现有无需训练的基线方法。 Conclusion: 提出的交互机制和轨迹记忆有效解决了子任务间的信息丢失问题，提升了任务执行效率。 Abstract: Large language models (LLMs) have shown remarkable capabilities in solving complex tasks. Recent work has explored decomposing such tasks into subtasks with independent contexts. However, some contextually related subtasks may encounter information loss during execution, leading to redundant operations or execution failures. To address this issue, we propose a training-free framework with an interaction mechanism, which enables a subtask to query specific information or trigger certain actions in completed subtasks by sending requests. To implement interaction, we introduce a subtask trajectory memory to enable resumption of completed subtasks upon receiving interaction requests. Additionally, we propose a new action during execution, which generates a concise and precise description of execution process and outcomes of a subtask, to assist subsequent subtasks in determining interaction targets and requests. We evaluate our framework on interactive decision-making task WebShop and multi-hop question answering HotpotQA, with GPT-3.5 and GPT-4, and comparison results show that our framework outperforms the state-of-the-art training-free baselines.

Bi-Level Multi-View fuzzy Clustering with Exponential Distance

Kristina P. Sinaga

Task: 扩展模糊c均值（FCM）聚类在多视图环境中的应用。

Motivation: 提出两种多视图FCM方法（E-MVFCM和EB-MVFCM），以简化热核系数的生成并自动计算特征和权重因子。

Details

Method: 引入E-MVFCM（考虑热核系数和权重因子）和EB-MVFCM（自动计算特征和权重因子）。 Result: 两种方法均能简化热核系数的生成，EB-MVFCM还能自动计算特征和权重因子。 Conclusion: 提出的方法在多视图环境中有效，且EB-MVFCM更具自动化优势。 Abstract: In this study, we propose extension of fuzzy c-means (FCM) clustering in multi-view environments. First, we introduce an exponential multi-view FCM (E-MVFCM). E-MVFCM is a centralized MVC with consideration to heat-kernel coefficients (H-KC) and weight factors. Secondly, we propose an exponential bi-level multi-view fuzzy c-means clustering (EB-MVFCM). Different to E-MVFCM, EB-MVFCM does automatic computation of feature and weight factors simultaneously. Like E-MVFCM, EB-MVFCM present explicit forms of the H-KC to simplify the generation of the heat-kernel $\mathcal{K}(t)$ in powers of the proper time $t$ during the clustering process. All the features used in this study, including tools and functions of proposed algorithms will be made available at https://www.github.com/KristinaP09/EB-MVFCM.

Efficient Inference for Large Reasoning Models: A Survey

Yue Liu,Jiaying Wu,Yufei He,Hongcheng Gao,Hongyu Chen,Baolong Bi,Jiaheng Zhang,Zhiqi Huang,Bryan Hooi

Task: 综述针对大型推理模型（LRMs）的高效推理方法，以减少令牌使用、内存消耗和推理时间。

Motivation: 大型推理模型（LRMs）在复杂任务解决中表现出色，但其推理过程导致令牌使用、内存和推理时间效率低下。

Details

Method: 提出分类法将方法分为显式紧凑思维链（CoT）和隐式潜在CoT，并分析其优缺点，进行实证分析。 Result: 总结了现有方法的性能与效率，并提出了该领域的开放挑战和关键见解。 Conclusion: 本文为研究人员提供了有价值的指导，帮助克服高效推理领域的挑战。 Abstract: Large Reasoning Models (LRMs) significantly improve the reasoning ability of Large Language Models (LLMs) by learning to reason, exhibiting promising performance in complex task-solving. However, their deliberative reasoning process leads to inefficiencies in token usage, memory consumption, and inference time. Thus, this survey provides a review of efficient inference methods designed specifically for LRMs, focusing on mitigating token inefficiency while preserving the reasoning quality. First, we introduce a taxonomy to group the recent methods into two main categories: (a) explicit compact Chain-of-Thought (CoT), which reduces tokens while keeping the explicit reasoning structure, and (b) implicit latent CoT, which encodes reasoning steps within hidden representations instead of explicit tokens. Meanwhile, we discuss their strengths and weaknesses. Then, we conduct empirical analyses on existing methods from performance and efficiency aspects. Besides, we present open challenges in this field, including human-centric controllable reasoning, trade-off between interpretability and efficiency of reasoning, ensuring safety of efficient reasoning, and broader applications of efficient reasoning. In addition, we highlight key insights for enhancing LRMs' inference efficiency via techniques such as model merging, new architectures, and agent routers. We hope this work serves as a valuable guide, helping researchers overcome challenges in this vibrant field\footnote{https://github.com/yueliu1999/Awesome-Efficient-Inference-for-LRMs}.

Enhancing Learnable Descriptive Convolutional Vision Transformer for Face Anti-Spoofing

Pei-Kai Huanga,Jun-Xiong Chong,Ming-Tsung Hsu,Fang-Yu Hsu,Chiou-Ting Hsu

Task: 提出三种新的训练策略以增强LDCformer在面部反欺骗（FAS）任务中的特征表征能力。

Motivation: 通过改进训练策略，提升LDCformer在识别活体/欺骗特征方面的性能，以应对面部呈现攻击。

Details

Method: 采用双注意力监督、自我挑战监督和过渡三元组挖掘策略，增强特征的表征能力和泛化能力。 Result: 实验表明，结合这三种训练策略的LDCformer优于之前的方法。 Conclusion: 提出的训练策略有效提升了LDCformer在面部反欺骗任务中的性能。 Abstract: Face anti-spoofing (FAS) heavily relies on identifying live/spoof discriminative features to counter face presentation attacks. Recently, we proposed LDCformer to successfully incorporate the Learnable Descriptive Convolution (LDC) into ViT, to model long-range dependency of locally descriptive features for FAS. In this paper, we propose three novel training strategies to effectively enhance the training of LDCformer to largely boost its feature characterization capability. The first strategy, dual-attention supervision, is developed to learn fine-grained liveness features guided by regional live/spoof attentions. The second strategy, self-challenging supervision, is designed to enhance the discriminability of the features by generating challenging training data. In addition, we propose a third training strategy, transitional triplet mining strategy, through narrowing the cross-domain gap while maintaining the transitional relationship between live and spoof features, to enlarge the domain-generalization capability of LDCformer. Extensive experiments show that LDCformer under joint supervision of the three novel training strategies outperforms previous methods.

EventWeave: A Dynamic Framework for Capturing Core and Supporting Events in Dialogue Systems

Zhengyi Zhao,Shubo Zhang,Yiming Du,Bin Liang,Baojun Wang,Zhongyang Li,Binyang Li,Kam-Fai Wong

Task: 提出EventWeave框架，通过动态事件图跟踪对话中的核心和支持事件，提升多轮对话的连贯性和意图捕捉能力。

Motivation: 现有大型语言模型在多轮对话中常因忽略事件跟踪而导致上下文不完整，影响对话连贯性和意图捕捉。

Details

Method: EventWeave框架通过动态事件图组织和更新核心与支持事件，避免重复访问整个对话历史。 Result: 在两个基准数据集上的实验表明，EventWeave在不微调的情况下提高了响应质量和事件相关性。 Conclusion: EventWeave通过事件中心化方法有效解决了多轮对话中的上下文跟踪问题。 Abstract: Existing large language models (LLMs) have shown remarkable progress in dialogue systems. However, many approaches still overlook the fundamental role of events throughout multi-turn interactions, leading to \textbf{incomplete context tracking}. Without tracking these events, dialogue systems often lose coherence and miss subtle shifts in user intent, causing disjointed responses. To bridge this gap, we present \textbf{EventWeave}, an event-centric framework that identifies and updates both core and supporting events as the conversation unfolds. Specifically, we organize these events into a dynamic event graph, which represents the interplay between \textbf{core events} that shape the primary idea and \textbf{supporting events} that provide critical context during the whole dialogue. By leveraging this dynamic graph, EventWeave helps models focus on the most relevant events when generating responses, thus avoiding repeated visits of the entire dialogue history. Experimental results on two benchmark datasets show that EventWeave improves response quality and event relevance without fine-tuning.

Yuxuan Wang,Yueqian Wang,Bo Chen,Tong Wu,Dongyan Zhao,Zilong Zheng

Task: 为Omni语言模型在流媒体视频场景中设计一个多模态交互基准测试OmniMMI。

Motivation: 现有视频基准测试在流媒体视频理解和主动推理方面存在不足，需要更全面的评估工具。

Details

Method: 提出OmniMMI基准测试，包含1,121个视频和2,290个问题，并设计多模态复用建模框架M4。 Result: OmniMMI覆盖六种子任务，支持流媒体视频理解和主动推理。 Conclusion: OmniMMI和M4框架为评估和提升Omni语言模型的多模态交互能力提供了有效工具。 Abstract: The rapid advancement of multi-modal language models (MLLMs) like GPT-4o has propelled the development of Omni language models, designed to process and proactively respond to continuous streams of multi-modal data. Despite their potential, evaluating their real-world interactive capabilities in streaming video contexts remains a formidable challenge. In this work, we introduce OmniMMI, a comprehensive multi-modal interaction benchmark tailored for OmniLLMs in streaming video contexts. OmniMMI encompasses over 1,121 videos and 2,290 questions, addressing two critical yet underexplored challenges in existing video benchmarks: streaming video understanding and proactive reasoning, across six distinct subtasks. Moreover, we propose a novel framework, Multi-modal Multiplexing Modeling (M4), designed to enable an inference-efficient streaming model that can see, listen while generating.

The Reasoning-Memorization Interplay in Language Models Is Mediated by a Single Direction

Yihuai Hong,Dian Zhou,Meng Cao,Lei Yu,Zhijing Jin

Task: 研究大型语言模型（LLMs）在推理与记忆之间的动态平衡机制。

Motivation: LLMs在多种推理任务中表现优异，但可能因过度依赖记忆的训练示例而难以泛化到未见问题，其推理与记忆切换的具体条件尚不明确。

Details

Method: 通过识别模型残差流中的线性特征，揭示推理与记忆的平衡机制，并操纵这些特征以影响推理任务表现。 Result: 发现这些特征不仅能区分推理任务与记忆密集型任务，还能通过干预提升模型在推理任务中的表现。 Conclusion: 研究揭示了LLMs推理与记忆的底层机制，为开发更鲁棒和可解释的生成式AI系统提供了新思路。 Abstract: Large language models (LLMs) excel on a variety of reasoning benchmarks, but previous studies suggest they sometimes struggle to generalize to unseen questions, potentially due to over-reliance on memorized training examples. However, the precise conditions under which LLMs switch between reasoning and memorization during text generation remain unclear. In this work, we provide a mechanistic understanding of LLMs' reasoning-memorization dynamics by identifying a set of linear features in the model's residual stream that govern the balance between genuine reasoning and memory recall. These features not only distinguish reasoning tasks from memory-intensive ones but can also be manipulated to causally influence model performance on reasoning tasks. Additionally, we show that intervening in these reasoning features helps the model more accurately activate the most relevant problem-solving capabilities during answer generation. Our findings offer new insights into the underlying mechanisms of reasoning and memory in LLMs and pave the way for the development of more robust and interpretable generative AI systems.

SuperEIO: Self-Supervised Event Feature Learning for Event Inertial Odometry

Peiyu Chen,Fuling Lin,Weipeng Guan,Peng Lu

Task: 提出一种基于学习的事件相机特征检测与匹配框架SuperEIO，用于实现事件-惯性里程计。

Motivation: 事件相机在高速度运动及复杂光照条件下具有低延迟输出的优势，但其运动依赖性导致特征检测与匹配存在挑战。

Details

Method: 采用卷积神经网络进行事件特征检测，并利用图神经网络实现事件描述符匹配，结合TensorRT加速推理。 Result: 在多个公开数据集上验证了方法的优越性和鲁棒性，并开源了代码。 Conclusion: SuperEIO在事件-惯性里程计中表现出高精度和鲁棒性，适用于资源受限平台。 Abstract: Event cameras asynchronously output low-latency event streams, promising for state estimation in high-speed motion and challenging lighting conditions. As opposed to frame-based cameras, the motion-dependent nature of event cameras presents persistent challenges in achieving robust event feature detection and matching. In recent years, learning-based approaches have demonstrated superior robustness over traditional handcrafted methods in feature detection and matching, particularly under aggressive motion and HDR scenarios. In this paper, we propose SuperEIO, a novel framework that leverages the learning-based event-only detection and IMU measurements to achieve event-inertial odometry. Our event-only feature detection employs a convolutional neural network under continuous event streams. Moreover, our system adopts the graph neural network to achieve event descriptor matching for loop closure. The proposed system utilizes TensorRT to accelerate the inference speed of deep networks, which ensures low-latency processing and robust real-time operation on resource-limited platforms. Besides, we evaluate our method extensively on multiple public datasets, demonstrating its superior accuracy and robustness compared to other state-of-the-art event-based methods. We have also open-sourced our pipeline to facilitate research in the field: https://github.com/arclab-hku/SuperEIO.

UNITYAI-GUARD: Pioneering Toxicity Detection Across Low-Resource Indian Languages

Himanshu Beniwal,Reddybathuni Venkat,Rohit Kumar,Birudugadda Srivibhav,Daksh Jain,Pavan Doddi,Eshwar Dhande,Adithya Ananth,Kuldeep,Heer Kubadia,Pratham Sharda,Mayank Singh

Task: 开发一个针对低资源印度语言的二进制毒性分类框架。

Motivation: 解决现有系统主要服务于高资源语言的问题，填补低资源印度语言在毒性内容识别上的空白。

Details

Method: 利用888k训练实例和35k手动验证测试实例，开发了针对多种Brahmic/Indic脚本的最先进模型。 Result: 在七种语言上平均F1得分为84.23%。 Conclusion: UnityAI-Guard通过提供公开API访问，推动了多语言内容审核在语言多样性地区的应用。 Abstract: This work introduces UnityAI-Guard, a framework for binary toxicity classification targeting low-resource Indian languages. While existing systems predominantly cater to high-resource languages, UnityAI-Guard addresses this critical gap by developing state-of-the-art models for identifying toxic content across diverse Brahmic/Indic scripts. Our approach achieves an impressive average F1-score of 84.23% across seven languages, leveraging a dataset of 888k training instances and 35k manually verified test instances. By advancing multilingual content moderation for linguistically diverse regions, UnityAI-Guard also provides public API access to foster broader adoption and application.

Pallet Detection And Localisation From Synthetic Data

Henri Mueller,Yechan Kim,Trevor Gee,Mahla Nejati

Task: 提出一种基于合成数据和几何特征的托盘检测与定位新方法。

Motivation: 全球仓储行业快速发展，自动化需求增加，但传统计算机视觉项目需要大量手动标注数据（每张图像约35秒）。

Details

Method: 使用Unity中的域随机化引擎生成合成数据，并利用托盘的侧面几何特征进行检测与定位。 Result: 在真实数据集上，单托盘检测性能为0.995 mAP50；5米范围内，平均位置精度小于4.2厘米，平均旋转精度为8.2度。 Conclusion: 该方法无需手动标注，实现了高性能的托盘检测与定位。 Abstract: The global warehousing industry is experiencing rapid growth, with the market size projected to grow at an annual rate of 8.1% from 2024 to 2030 [Grand View Research, 2021]. This expansion has led to a surge in demand for efficient pallet detection and localisation systems. While automation can significantly streamline warehouse operations, the development of such systems often requires extensive manual data annotation, with an average of 35 seconds per image, for a typical computer vision project. This paper presents a novel approach to enhance pallet detection and localisation using purely synthetic data and geometric features derived from their side faces. By implementing a domain randomisation engine in Unity, the need for time-consuming manual annotation is eliminated while achieving high-performance results. The proposed method demonstrates a pallet detection performance of 0.995 mAP50 for single pallets on a real-world dataset. Additionally, an average position accuracy of less than 4.2 cm and an average rotation accuracy of 8.2{\deg} were achieved for pallets within a 5-meter range, with the pallet positioned head-on.

Parsing Through Boundaries in Chinese Word Segmentation

Yige Chen,Zelong Li,Changbing Yang,Cindy Zhang,Amandisa Cady,Ai Ka Lee,Zejiao Zeng,Haihua Pan,Jungyeul Park

Task: 研究中文分词与句法分析之间的关系，探讨不同分词策略对依存结构的影响。

Motivation: 中文缺乏明确的词边界，分词对句法分析至关重要，但其模糊性需要深入研究。

Details

Method: 基于中文GSD树库，分析多种分词方案及其对句法结构的影响，并开发交互式可视化工具。 Result: 揭示了不同分词策略如何塑造中文的依存结构，提供了直观的比较工具。 Conclusion: 分词策略对句法分析有显著影响，可视化工具有助于深入理解这种关系。 Abstract: Chinese word segmentation is a foundational task in natural language processing (NLP), with far-reaching effects on syntactic analysis. Unlike alphabetic languages like English, Chinese lacks explicit word boundaries, making segmentation both necessary and inherently ambiguous. This study highlights the intricate relationship between word segmentation and syntactic parsing, providing a clearer understanding of how different segmentation strategies shape dependency structures in Chinese. Focusing on the Chinese GSD treebank, we analyze multiple word boundary schemes, each reflecting distinct linguistic and computational assumptions, and examine how they influence the resulting syntactic structures. To support detailed comparison, we introduce an interactive web-based visualization tool that displays parsing outcomes across segmentation methods.

From Flatland to Space: Teaching Vision-Language Models to Perceive and Reason in 3D

Jiahui Zhang,Yurui Chen,Yanpeng Zhou,Yueming Xu,Ze Huang,Jilin Mei,Junhui Chen,Yu-Jie Yuan,Xinyue Cai,Guowei Huang,Xingyue Quan,Hang Xu,Li Zhang

Task: 通过引入2D空间数据生成和标注流程，构建SPAR-7M数据集和SPAR-Bench基准，以提升视觉语言模型在空间感知和推理方面的能力。

Motivation: 现有的视觉语言模型在空间感知方面表现不足，限制了其在复杂3D场景中的推理能力。

Details

Method: 提出了一种基于3D场景数据的2D空间数据生成和标注流程，并构建了SPAR-7M数据集和SPAR-Bench基准。 Result: 模型在2D空间基准测试中达到最先进性能，并在3D任务特定数据集上表现出竞争力。 Conclusion: 所提出的数据集和基准有效提升了视觉语言模型的空间推理能力。 Abstract: Recent advances in LVLMs have improved vision-language understanding, but they still struggle with spatial perception, limiting their ability to reason about complex 3D scenes. Unlike previous approaches that incorporate 3D representations into models to improve spatial understanding, we aim to unlock the potential of VLMs by leveraging spatially relevant image data. To this end, we introduce a novel 2D spatial data generation and annotation pipeline built upon scene data with 3D ground-truth. This pipeline enables the creation of a diverse set of spatial tasks, ranging from basic perception tasks to more complex reasoning tasks. Leveraging this pipeline, we construct SPAR-7M, a large-scale dataset generated from thousands of scenes across multiple public datasets. In addition, we introduce SPAR-Bench, a benchmark designed to offer a more comprehensive evaluation of spatial capabilities compared to existing spatial benchmarks, supporting both single-view and multi-view inputs. Training on both SPAR-7M and large-scale 2D datasets enables our models to achieve state-of-the-art performance on 2D spatial benchmarks. Further fine-tuning on 3D task-specific datasets yields competitive results, underscoring the effectiveness of our dataset in enhancing spatial reasoning.

Memory-Aware and Uncertainty-Guided Retrieval for Multi-Hop Question Answering

Yuelyu Ji,Rui Meng,Zhuochun Li,Daqing He

Task: 提出MIND框架，解决多跳问答中检索步骤固定或频繁以及检索知识利用不足的问题。

Motivation: 现有检索增强生成方法在多跳问答中存在检索步骤固定或过于频繁，以及对已检索知识利用不足的局限性。

Details

Method: MIND框架通过基于提示的实体提取、基于令牌级熵和注意力信号的动态检索触发，以及跨推理步骤存储高置信度事实的记忆感知过滤来实现。 Result: MIND框架能够更有效地进行多跳问答，提升检索和推理的效率与一致性。 Conclusion: MIND框架通过动态检索和记忆感知过滤，显著改进了多跳问答的性能。 Abstract: Multi-hop question answering (QA) requires models to retrieve and reason over multiple pieces of evidence. While Retrieval-Augmented Generation (RAG) has made progress in this area, existing methods often suffer from two key limitations: (1) fixed or overly frequent retrieval steps, and (2) ineffective use of previously retrieved knowledge. We propose MIND (Memory-Informed and INteractive Dynamic RAG), a framework that addresses these challenges through: (i) prompt-based entity extraction to identify reasoning-relevant elements, (ii) dynamic retrieval triggering based on token-level entropy and attention signals, and (iii) memory-aware filtering, which stores high-confidence facts across reasoning steps to enable consistent multi-hop generation.

indiSplit: Bringing Severity Cognizance to Image Decomposition in Fluorescence Microscopy

Ashesh Ashesh,Florian Jug

Task: 提出一种名为indiSplit的新方法，用于解决荧光显微镜图像中未知混合比例的问题。

Motivation: 现有图像分解方法在固定强度比例下训练，无法适应荧光显微镜中可能出现的各种混合比例。

Details

Method: 基于InDI迭代方法，结合回归网络预测输入图像的退化水平（混合不对称性）和退化特定归一化模块，实现退化感知推断。 Result: indiSplit成功解决了图像分割和渗漏去除两个任务，并在5个公共数据集上验证了其适用性。 Conclusion: indiSplit是一种有效的退化感知方法，适用于荧光显微镜中的图像分解任务。 Abstract: Fluorescence microscopy, while being a key driver for progress in the life sciences, is also subject to technical limitations. To overcome them, computational multiplexing techniques have recently been proposed, which allow multiple cellular structures to be captured in a single image and later be unmixed. Existing image decomposition methods are trained on a set of superimposed input images and the respective unmixed target images. It is critical to note that the relative strength (mixing ratio) of the superimposed images for a given input is a priori unknown. However, existing methods are trained on a fixed intensity ratio of superimposed inputs, making them not cognizant to the range of relative intensities that can occur in fluorescence microscopy. In this work, we propose a novel method called indiSplit that is cognizant of the severity of the above mentioned mixing ratio. Our idea is based on InDI, a popular iterative method for image restoration, and an ideal starting point to embrace the unknown mixing ratio in any given input. We introduce (i) a suitably trained regressor network that predicts the degradation level (mixing asymmetry) of a given input image and (ii) a degradation-specific normalization module, enabling degradation-aware inference across all mixing ratios. We show that this method solves two relevant tasks in fluorescence microscopy, namely image splitting and bleedthrough removal, and empirically demonstrate the applicability of indiSplit on $5$ public datasets. We will release all sources under a permissive license.

The realization of tones in spontaneous spoken Taiwan Mandarin: a corpus-based survey and theory-driven computational modeling

Yuxin Lu,Yu-Ying Chuang,R. Harald Baayen

Task: 研究普通话双音节词在台湾普通话自发语料库中的声调实现，探讨语义与语音细节的交互关系。

Motivation: 现有研究表明语义可以影响语音细节，但语音实现与语义的复杂交互关系，尤其是音高实现，尚未充分研究。

Details

Method: 使用广义加性混合模型（GAMs）分析f0轮廓，并结合GPT-2生成上下文嵌入来预测音高轮廓。 Result: 词义和上下文嵌入是f0轮廓的关键预测因素，其效应超过声调模式；音高轮廓可显著通过上下文嵌入预测。 Conclusion: 语境中的意义与语音实现的关系比传统语言学理论预测的更为紧密。 Abstract: A growing body of literature has demonstrated that semantics can co-determine fine phonetic detail. However, the complex interplay between phonetic realization and semantics remains understudied, particularly in pitch realization. The current study investigates the tonal realization of Mandarin disyllabic words with all 20 possible combinations of two tones, as found in a corpus of Taiwan Mandarin spontaneous speech. We made use of Generalized Additive Mixed Models (GAMs) to model f0 contours as a function of a series of predictors, including gender, tonal context, tone pattern, speech rate, word position, bigram probability, speaker and word. In the GAM analysis, word and sense emerged as crucial predictors of f0 contours, with effect sizes that exceed those of tone pattern. For each word token in our dataset, we then obtained a contextualized embedding by applying the GPT-2 large language model to the context of that token in the corpus. We show that the pitch contours of word tokens can be predicted to a considerable extent from these contextualized embeddings, which approximate token-specific meanings in contexts of use. The results of our corpus study show that meaning in context and phonetic realization are far more entangled than standard linguistic theory predicts.

Optimal Transport-Guided Source-Free Adaptation for Face Anti-Spoofing

Zhuowei Li,Tianchen Zhao,Xiang Xu,Zheng Zhang,Zhihua Li,Xuanbai Chen,Qin Zhang,Alessandro Bergamo,Anil K. Jain,Yifan Xing

Task: 开发一种能够在测试时由客户端自行适应目标域的面部防伪模型。

Motivation: 由于训练数据集与多样化的终端用户测试数据之间存在领域差距，且出于安全和隐私考虑，客户不希望与服务提供商共享大量面部数据。

Details

Method: 提出了一种基于原型的基模型和最优传输引导的适配器，支持轻量级训练或无训练适应，同时保持基模型参数不可访问；并提出基于最优传输的geodesic mixup数据增强方法。 Result: 在跨域和跨攻击设置中，相比现有方法，平均HTER相对提升19.17%，AUC提升8.58%。 Conclusion: 该方法有效解决了领域适应问题，同时保护了客户数据的隐私和安全性。 Abstract: Developing a face anti-spoofing model that meets the security requirements of clients worldwide is challenging due to the domain gap between training datasets and diverse end-user test data. Moreover, for security and privacy reasons, it is undesirable for clients to share a large amount of their face data with service providers. In this work, we introduce a novel method in which the face anti-spoofing model can be adapted by the client itself to a target domain at test time using only a small sample of data while keeping model parameters and training data inaccessible to the client. Specifically, we develop a prototype-based base model and an optimal transport-guided adaptor that enables adaptation in either a lightweight training or training-free fashion, without updating base model's parameters. Furthermore, we propose geodesic mixup, an optimal transport-based synthesis method that generates augmented training data along the geodesic path between source prototypes and target data distribution. This allows training a lightweight classifier to effectively adapt to target-specific characteristics while retaining essential knowledge learned from the source domain. In cross-domain and cross-attack settings, compared with recent methods, our method achieves average relative improvements of 19.17% in HTER and 8.58% in AUC, respectively.

The Challenge of Achieving Attributability in Multilingual Table-to-Text Generation with Question-Answer Blueprints

Aden Haussmann

Task: 探索问答蓝图（QA blueprints）是否能提高多语言表格到文本生成（Table-to-Text）输出的可归因性。

Motivation: 低资源语言的多语言自然语言生成（NLG）因训练数据不足而具有挑战性，但对全球数千万使用者至关重要。表格到文本任务是衡量模型推理能力的有效方式，但多语言环境下其输出常缺乏对源数据的忠实性。

Details

Method: 扩展多语言表格到文本数据集TaTA（含非洲语言）并加入问答蓝图，使用序列到序列语言模型进行微调（有/无蓝图）。 Result: 问答蓝图在仅英语微调和评估时提升性能，但在多语言环境下无显著增益，因蓝图机器翻译不准确且模型未充分利用生成的蓝图。 Conclusion: 多语言环境下问答蓝图的直接应用存在挑战，需进一步解决翻译准确性和模型依赖性问题。 Abstract: Multilingual Natural Language Generation (NLG) is challenging due to the lack of training data for low-resource languages. However, some low-resource languages have up to tens of millions of speakers globally, making it important to improve NLG tools for them. Table-to-Text NLG is an excellent measure of models' reasoning abilities but is very challenging in the multilingual setting. System outputs are often not attributable, or faithful, to the data in the source table. Intermediate planning techniques like Question-Answer (QA) blueprints have been shown to improve attributability on summarisation tasks. This work explores whether QA blueprints make multilingual Table-to-Text outputs more attributable to the input tables. This paper extends the challenging multilingual Table-to-Text dataset, TaTA, which includes African languages, with QA blueprints. Sequence-to-sequence language models are then finetuned on this dataset, with and without blueprints. Results show that QA blueprints improve performance for models finetuned and evaluated only on English examples, but do not demonstrate gains in the multilingual setting. This is due to inaccuracies in machine translating the blueprints from English into target languages when generating the training data, and models failing to rely closely on the blueprints they generate. An in-depth analysis is conducted on why this is challenging.

FreeSplat++: Generalizable 3D Gaussian Splatting for Efficient Indoor Scene Reconstruction

Yunsong Wang,Tianxin Huang,Hanlin Chen,Gim Hee Lee

Task: 提出FreeSplat++，一种用于大规模室内全场景重建的通用3D高斯泼溅方法。

Motivation: 现有方法多关注小区域的稀疏视图重建，无法在全场景重建的质量或效率上取得理想结果。

Details

Method: 提出低成本跨视图聚合框架、像素级三重融合方法和加权浮点去除策略，并结合深度正则化的逐场景微调。 Result: FreeSplat++在全场景重建中显著优于现有方法，提高了重建精度并减少了训练时间。 Conclusion: FreeSplat++为大规模全场景重建提供了一种高效且高质量的替代方案。 Abstract: Recently, the integration of the efficient feed-forward scheme into 3D Gaussian Splatting (3DGS) has been actively explored. However, most existing methods focus on sparse view reconstruction of small regions and cannot produce eligible whole-scene reconstruction results in terms of either quality or efficiency. In this paper, we propose FreeSplat++, which focuses on extending the generalizable 3DGS to become an alternative approach to large-scale indoor whole-scene reconstruction, which has the potential of significantly accelerating the reconstruction speed and improving the geometric accuracy. To facilitate whole-scene reconstruction, we initially propose the Low-cost Cross-View Aggregation framework to efficiently process extremely long input sequences. Subsequently, we introduce a carefully designed pixel-wise triplet fusion method to incrementally aggregate the overlapping 3D Gaussian primitives from multiple views, adaptively reducing their redundancy. Furthermore, we propose a weighted floater removal strategy that can effectively reduce floaters, which serves as an explicit depth fusion approach that is crucial in whole-scene reconstruction. After the feed-forward reconstruction of 3DGS primitives, we investigate a depth-regularized per-scene fine-tuning process. Leveraging the dense, multi-view consistent depth maps obtained during the feed-forward prediction phase for an extra constraint, we refine the entire scene's 3DGS primitive to enhance rendering quality while preserving geometric accuracy. Extensive experiments confirm that our FreeSplat++ significantly outperforms existing generalizable 3DGS methods, especially in whole-scene reconstructions. Compared to conventional per-scene optimized 3DGS approaches, our method with depth-regularized per-scene fine-tuning demonstrates substantial improvements in reconstruction accuracy and a notable reduction in training time.

Enhancing Knowledge Graph Completion with Entity Neighborhood and Relation Context

Jianfang Chen,Kai Zhang,Aoran Gan,Shiwei Tong,Shuanghong Shen,Qi Liu

Task: 通过整合实体和关系的上下文信息，提高知识图谱补全（KGC）的性能。

Motivation: 传统基于结构的方法计算量大且扩展性差，而基于文本的方法未能充分利用上下文信息。

Details

Method: 提出KGC-ERC框架，结合实体和关系的上下文，并引入采样策略优化上下文选择。 Result: 在Wikidata5M、Wiki27K和FB15K-237-N数据集上表现优于或匹配现有最佳方法。 Conclusion: KGC-ERC通过有效利用上下文信息，提升了知识图谱补全的性能和扩展性。 Abstract: Knowledge Graph Completion (KGC) aims to infer missing information in Knowledge Graphs (KGs) to address their inherent incompleteness. Traditional structure-based KGC methods, while effective, face significant computational demands and scalability challenges due to the need for dense embedding learning and scoring all entities in the KG for each prediction. Recent text-based approaches using language models like T5 and BERT have mitigated these issues by converting KG triples into text for reasoning. However, they often fail to fully utilize contextual information, focusing mainly on the neighborhood of the entity and neglecting the context of the relation. To address this issue, we propose KGC-ERC, a framework that integrates both types of context to enrich the input of generative language models and enhance their reasoning capabilities. Additionally, we introduce a sampling strategy to effectively select relevant context within input token constraints, which optimizes the utilization of contextual information and potentially improves model performance. Experiments on the Wikidata5M, Wiki27K, and FB15K-237-N datasets show that KGC-ERC outperforms or matches state-of-the-art baselines in predictive performance and scalability.

On Geometrical Properties of Text Token Embeddings for Strong Semantic Binding in Text-to-Image Generation

Hoigi Seo,Junseo Bang,Haechang Lee,Joohoon Lee,Byung Hyun Lee,Se Young Chun

Task: 研究文本到图像（T2I）模型中语义绑定的几何特性及其优化方法。

Motivation: 解决T2I模型在复杂场景中文本与图像不对齐的问题，尤其是多对象和属性的语义绑定。

Details

Method: 提出TeeMo框架，包括Causality-Aware Projection-Out（CAPO）和Adaptive Token Mixing（ATM），通过几何特性分析优化跨注意力图。 Result: TeeMo在多种基线和数据集上表现优于现有方法。 Conclusion: 几何特性在语义绑定中起关键作用，TeeMo框架有效提升了T2I模型的语义绑定能力。 Abstract: Text-to-Image (T2I) models often suffer from text-image misalignment in complex scenes involving multiple objects and attributes. Semantic binding aims to mitigate this issue by accurately associating the generated attributes and objects with their corresponding noun phrases (NPs). Existing methods rely on text or latent optimizations, yet the factors influencing semantic binding remain underexplored. Here we investigate the geometrical properties of text token embeddings and their cross-attention (CA) maps. We empirically and theoretically analyze that the geometrical properties of token embeddings, specifically both angular distances and norms, play a crucial role in CA map differentiation. Then, we propose \textbf{TeeMo}, a training-free text embedding-aware T2I framework with strong semantic binding. TeeMo consists of Causality-Aware Projection-Out (CAPO) for distinct inter-NP CA maps and Adaptive Token Mixing (ATM) with our loss to enhance inter-NP separation while maintaining intra-NP cohesion in CA maps. Extensive experiments confirm TeeMo consistently outperforms prior arts across diverse baselines and datasets.

RECALL-MM: A Multimodal Dataset of Consumer Product Recalls for Risk Analysis using Computational Methods and Large Language Models

Diana Bolanos,Mohammadmehdi Ataei,Daniele Grandi,Kosa Goucher-Lambert

Task: 利用美国消费品安全委员会（CPSC）召回数据开发多模态数据集RECALL-MM，支持数据驱动的风险评估和更安全的设计决策。

Motivation: 召回数据蕴含潜在风险和危害信息，但其潜力尚未被充分利用。

Details

Method: 通过生成方法增强召回数据，构建多模态数据集，并利用交互式聚类映射和大型语言模型（LLM）分析风险。 Result: 数据集成功识别产品风险，并通过案例研究验证了其在预测危害和指导设计中的实用性。 Conclusion: 该研究通过数据驱动方法，将历史召回数据与未来产品安全联系起来，为更安全的工程设计提供了可扩展的解决方案。 Abstract: Product recalls provide valuable insights into potential risks and hazards within the engineering design process, yet their full potential remains underutilized. In this study, we curate data from the United States Consumer Product Safety Commission (CPSC) recalls database to develop a multimodal dataset, RECALL-MM, that informs data-driven risk assessment using historical information, and augment it using generative methods. Patterns in the dataset highlight specific areas where improved safety measures could have significant impact. We extend our analysis by demonstrating interactive clustering maps that embed all recalls into a shared latent space based on recall descriptions and product names. Leveraging these data-driven tools, we explore three case studies to demonstrate the dataset's utility in identifying product risks and guiding safer design decisions. The first two case studies illustrate how designers can visualize patterns across recalled products and situate new product ideas within the broader recall landscape to proactively anticipate hazards. In the third case study, we extend our approach by employing a large language model (LLM) to predict potential hazards based solely on product images. This demonstrates the model's ability to leverage visual context to identify risk factors, revealing strong alignment with historical recall data across many hazard categories. However, the analysis also highlights areas where hazard prediction remains challenging, underscoring the importance of risk awareness throughout the design process. Collectively, this work aims to bridge the gap between historical recall data and future product safety, presenting a scalable, data-driven approach to safer engineering design.

Multi-label classification for multi-temporal, multi-spatial coral reef condition monitoring using vision foundation model with adapter learning

Xinlei Shao,Hongruixuan Chen,Fan Zhao,Kirsty Magson,Jundong Chen,Peiran Li,Jiaqi Wang,Jun Sasaki

Task: 提出一种结合DINOv2视觉基础模型和LoRA微调方法的高效珊瑚礁状态分类方法。

Motivation: 珊瑚礁生态系统面临气候变化和人类活动的威胁，传统深度学习方法在处理复杂水下生态图像时性能不足，且基础模型微调需要大量计算资源并产生高碳排放。

Details

Method: 采用DINOv2视觉基础模型与LoRA微调方法结合，利用多时相实地图像数据进行训练。 Result: DINOv2-LoRA模型在准确率（64.77%）和参数效率（从1,100M降至5.91M）上优于传统模型，并展现出跨时空的强泛化能力。 Conclusion: 该方法为珊瑚礁状态分类提供了高效工具，支持珊瑚礁生态系统的监测与保护。 Abstract: Coral reef ecosystems provide essential ecosystem services, but face significant threats from climate change and human activities. Although advances in deep learning have enabled automatic classification of coral reef conditions, conventional deep models struggle to achieve high performance when processing complex underwater ecological images. Vision foundation models, known for their high accuracy and cross-domain generalizability, offer promising solutions. However, fine-tuning these models requires substantial computational resources and results in high carbon emissions. To address these challenges, adapter learning methods such as Low-Rank Adaptation (LoRA) have emerged as a solution. This study introduces an approach integrating the DINOv2 vision foundation model with the LoRA fine-tuning method. The approach leverages multi-temporal field images collected through underwater surveys at 15 dive sites at Koh Tao, Thailand, with all images labeled according to universal standards used in citizen science-based conservation programs. The experimental results demonstrate that the DINOv2-LoRA model achieved superior accuracy, with a match ratio of 64.77%, compared to 60.34% achieved by the best conventional model. Furthermore, incorporating LoRA reduced the trainable parameters from 1,100M to 5.91M. Transfer learning experiments conducted under different temporal and spatial settings highlight the exceptional generalizability of DINOv2-LoRA across different seasons and sites. This study is the first to explore the efficient adaptation of foundation models for multi-label classification of coral reef conditions under multi-temporal and multi-spatial settings. The proposed method advances the classification of coral reef conditions and provides a tool for monitoring, conserving, and managing coral reef ecosystems.

Beyond speculation: Measuring the growing presence of LLM-generated texts in multilingual disinformation

Dominik Macko,Aashish Anantha Ramakrishnan,Jason Samuel Lucas,Robert Moro,Ivan Srba,Adaku Uchendu,Dongwon Lee

Task: 研究大型语言模型（LLMs）在现实世界虚假信息数据集中的存在及其影响。

Motivation: 随着LLMs生成的多语言文本质量提升，其潜在的虚假信息滥用引发担忧，但学术界对其影响的看法存在分歧。

Details

Method: 通过实证分析最新现实世界虚假信息数据集，记录ChatGPT发布后机器生成内容的增加情况，并揭示跨语言、平台和时间的关键模式。 Result: 研究发现LLMs在虚假信息数据集中的存在显著增加，并揭示了其在不同语言、平台和时间段中的分布模式。 Conclusion: 研究填补了关于LLMs在虚假信息中作用的实证空白，为相关辩论提供了数据支持。 Abstract: Increased sophistication of large language models (LLMs) and the consequent quality of generated multilingual text raises concerns about potential disinformation misuse. While humans struggle to distinguish LLM-generated content from human-written texts, the scholarly debate about their impact remains divided. Some argue that heightened fears are overblown due to natural ecosystem limitations, while others contend that specific "longtail" contexts face overlooked risks. Our study bridges this debate by providing the first empirical evidence of LLM presence in the latest real-world disinformation datasets, documenting the increase of machine-generated content following ChatGPT's release, and revealing crucial patterns across languages, platforms, and time periods.

The impact of tissue detection on diagnostic artificial intelligence algorithms in digital pathology

Sol Erika Boman,Nita Mulliqi,Anders Blilie,Xiaoyi Ji,Kelvin Szolnoky,Einar Gudlaugsson,Emiel A. M. Janssen,Svein R. Kjosavik,José Asenjo,Marcello Gambacorta,Paolo Libretti,Marcin Braun,Radzislaw Kordek,Roman Łowicki,Kristina Hotakainen,Päivi Väre,Bodil Ginnerup Pedersen,Karina Dalsgaard Sørensen,Benedicte Parm Ulhøi,Lars Egevad,Kimmo Kartasalo

Task: 研究组织检测方法对下游任务性能的影响，并比较传统方法和基于AI的组织检测方法。

Motivation: 组织检测质量可能影响下游任务性能，甚至危及患者安全，但目前缺乏相关研究。

Details

Method: 使用两种组织检测算法（阈值法和UNet++）训练AI模型进行前列腺癌Gleason分级，并分析其性能差异。 Result: AI模型减少了完全未检测到组织样本的数量，但在可检测样本中分级性能无显著差异；3.5%的恶性样本中检测方法导致临床显著差异。 Conclusion: 稳健的组织检测对诊断AI的临床性能至关重要，AI模型在避免完全失败方面更可靠。 Abstract: Tissue detection is a crucial first step in most digital pathology applications. Details of the segmentation algorithm are rarely reported, and there is a lack of studies investigating the downstream effects of a poor segmentation algorithm. Disregarding tissue detection quality could create a bottleneck for downstream performance and jeopardize patient safety if diagnostically relevant parts of the specimen are excluded from analysis in clinical applications. This study aims to determine whether performance of downstream tasks is sensitive to the tissue detection method, and to compare performance of classical and AI-based tissue detection. To this end, we trained an AI model for Gleason grading of prostate cancer in whole slide images (WSIs) using two different tissue detection algorithms: thresholding (classical) and UNet++ (AI). A total of 33,823 WSIs scanned on five digital pathology scanners were used to train the tissue detection AI model. The downstream Gleason grading algorithm was trained and tested using 70,524 WSIs from 13 clinical sites scanned on 13 different scanners. There was a decrease from 116 (0.43%) to 22 (0.08%) fully undetected tissue samples when switching from thresholding-based tissue detection to AI-based, suggesting an AI model may be more reliable than a classical model for avoiding total failures on slides with unusual appearance. On the slides where tissue could be detected by both algorithms, no significant difference in overall Gleason grading performance was observed. However, tissue detection dependent clinically significant variations in AI grading were observed in 3.5% of malignant slides, highlighting the importance of robust tissue detection for optimal clinical performance of diagnostic AI.

Evaluating how LLM annotations represent diverse views on contentious topics

Megan A. Brown,Shubham Atreja,Libby Hemphill,Patrick Y. Wu

Task: 评估大型语言模型（LLMs）在标注争议性任务时对多样化观点的表现。

Motivation: 探讨LLMs在标注数据时是否存在对特定群体观点的偏见，尤其是争议性话题。

Details

Method: 在四个数据集上进行了四个标注任务的实验，分析LLMs与人类标注者的一致性。 Result: LLMs在标注任务中未表现出基于人口统计学的显著不一致，模型、提示和人类标注者之间的分歧更具预测性。 Conclusion: 使用LLMs标注数据时，对特定群体观点的低估并非主要问题，研究结果为研究者和实践者提供了参考。 Abstract: Researchers have proposed the use of generative large language models (LLMs) to label data for both research and applied settings. This literature emphasizes the improved performance of LLMs relative to other natural language models, noting that LLMs typically outperform other models on standard metrics such as accuracy, precision, recall, and F1 score. However, previous literature has also highlighted the bias embedded in language models, particularly around contentious topics such as potentially toxic content. This bias could result in labels applied by LLMs that disproportionately align with majority groups over a more diverse set of viewpoints. In this paper, we evaluate how LLMs represent diverse viewpoints on these contentious tasks. Across four annotation tasks on four datasets, we show that LLMs do not show substantial disagreement with annotators on the basis of demographics. Instead, the model, prompt, and disagreement between human annotators on the labeling task are far more predictive of LLM agreement. Our findings suggest that when using LLMs to annotate data, under-representing the views of particular groups is not a substantial concern. We conclude with a discussion of the implications for researchers and practitioners.

MeshCraft: Exploring Efficient and Controllable Mesh Generation with Flow-based DiTs

Xianglong He,Junyi Chen,Di Huang,Zexiang Liu,Xiaoshui Huang,Wanli Ouyang,Chun Yuan,Yangguang Li

Task: 提出一种高效且可控的3D网格生成框架MeshCraft，通过连续空间扩散生成离散三角形面。

Motivation: 现有方法（如MeshGPT）依赖自回归技术，生成速度慢且无法控制网格面数，限制了实际应用。

Details

Method: MeshCraft包含两部分：1）基于Transformer的VAE，编码和解码网格；2）基于流的扩散Transformer，支持按面数生成高质量网格。 Result: MeshCraft生成800面网格仅需3.2秒（比基线快35倍），在ShapeNet和Objaverse数据集上表现优于现有技术。 Conclusion: MeshCraft通过扩散模型实现高效、可控的网格生成，有望减轻艺术家在网格创建中的手动工作量。 Abstract: In the domain of 3D content creation, achieving optimal mesh topology through AI models has long been a pursuit for 3D artists. Previous methods, such as MeshGPT, have explored the generation of ready-to-use 3D objects via mesh auto-regressive techniques. While these methods produce visually impressive results, their reliance on token-by-token predictions in the auto-regressive process leads to several significant limitations. These include extremely slow generation speeds and an uncontrollable number of mesh faces. In this paper, we introduce MeshCraft, a novel framework for efficient and controllable mesh generation, which leverages continuous spatial diffusion to generate discrete triangle faces. Specifically, MeshCraft consists of two core components: 1) a transformer-based VAE that encodes raw meshes into continuous face-level tokens and decodes them back to the original meshes, and 2) a flow-based diffusion transformer conditioned on the number of faces, enabling the generation of high-quality 3D meshes with a predefined number of faces. By utilizing the diffusion model for the simultaneous generation of the entire mesh topology, MeshCraft achieves high-fidelity mesh generation at significantly faster speeds compared to auto-regressive methods. Specifically, MeshCraft can generate an 800-face mesh in just 3.2 seconds (35$\times$ faster than existing baselines). Extensive experiments demonstrate that MeshCraft outperforms state-of-the-art techniques in both qualitative and quantitative evaluations on ShapeNet dataset and demonstrates superior performance on Objaverse dataset. Moreover, it integrates seamlessly with existing conditional guidance strategies, showcasing its potential to relieve artists from the time-consuming manual work involved in mesh creation.

PromptDistill: Query-based Selective Token Retention in Intermediate Layers for Efficient Large Language Model Inference

Weisheng Jin,Maojia Song,Tej Deep Pala,Yew Ken Chia,Amir Zadeh,Chuan Li,Soujanya Poria

Task: 提出一种名为PromptDistill的无训练方法，以提高大型语言模型（LLM）在复杂任务和长文档处理中的推理效率。

Motivation: 随着LLM处理任务和文档的复杂性增加，推理时的计算和内存成本成为主要瓶颈。

Details

Method: 通过利用早期层的注意力交互识别并保留最具信息量的token，动态分配计算资源，减少后续层的计算负担。 Result: 在多个基准测试中，PromptDistill显著提高了效率，同时对输出质量影响极小，优于现有方法。 Conclusion: PromptDistill在性能和效率之间实现了有效平衡，优于现有方法，并展示了多阶段选择的潜力。 Abstract: As large language models (LLMs) tackle increasingly complex tasks and longer documents, their computational and memory costs during inference become a major bottleneck. To address this, we propose PromptDistill, a novel, training-free method that improves inference efficiency while preserving generation quality. PromptDistill identifies and retains the most informative tokens by leveraging attention interactions in early layers, preserving their hidden states while reducing the computational burden in later layers. This allows the model to focus on essential contextual information without fully processing all tokens. Unlike previous methods such as H2O and SnapKV, which perform compression only after processing the entire input, or GemFilter, which selects a fixed portion of the initial prompt without considering contextual dependencies, PromptDistill dynamically allocates computational resources to the most relevant tokens while maintaining a global awareness of the input. Experiments using our method and baseline approaches with base models such as LLaMA 3.1 8B Instruct, Phi 3.5 Mini Instruct, and Qwen2 7B Instruct on benchmarks including LongBench, InfBench, and Needle in a Haystack demonstrate that PromptDistill significantly improves efficiency while having minimal impact on output quality compared to the original models. With a single-stage selection strategy, PromptDistill effectively balances performance and efficiency, outperforming prior methods like GemFilter, H2O, and SnapKV due to its superior ability to retain essential information. Specifically, compared to GemFilter, PromptDistill achieves an overall $1\%$ to $5\%$ performance improvement while also offering better time efficiency. Additionally, we explore multi-stage selection, which further improves efficiency while maintaining strong generation performance.

Empowering Large Language Models with 3D Situation Awareness

Zhihao Yuan,Yibo Peng,Jinke Ren,Yinghong Liao,Yatong Han,Chun-Mei Feng,Hengshuang Zhao,Guanbin Li,Shuguang Cui,Zhen Li

Task: 提出一种新方法，通过利用数据采集过程中的扫描轨迹和视觉语言模型（VLMs），自动生成情境感知的数据集，以增强大型语言模型（LLMs）在3D场景中的情境理解能力。

Motivation: 当前基于LLM的方法忽视了3D场景中的自我中心视角，仅使用全局视角的数据集，导致描述（如“左”或“右”）不准确。

Details

Method: 通过扫描轨迹和VLMs生成高质量描述和问答对，并引入情境定位模块预测观察者的位置和方向。 Result: 在多个基准测试中验证了方法的有效性，显著提升了LLMs的3D情境感知能力，同时扩展了数据集并减少了人工工作量。 Conclusion: 该方法成功解决了3D场景中自我中心视角的问题，为LLMs在3D场景理解中的应用提供了新思路。 Abstract: Driven by the great success of Large Language Models (LLMs) in the 2D image domain, their applications in 3D scene understanding has emerged as a new trend. A key difference between 3D and 2D is that the situation of an egocentric observer in 3D scenes can change, resulting in different descriptions (e.g., ''left" or ''right"). However, current LLM-based methods overlook the egocentric perspective and simply use datasets from a global viewpoint. To address this issue, we propose a novel approach to automatically generate a situation-aware dataset by leveraging the scanning trajectory during data collection and utilizing Vision-Language Models (VLMs) to produce high-quality captions and question-answer pairs. Furthermore, we introduce a situation grounding module to explicitly predict the position and orientation of observer's viewpoint, thereby enabling LLMs to ground situation description in 3D scenes. We evaluate our approach on several benchmarks, demonstrating that our method effectively enhances the 3D situational awareness of LLMs while significantly expanding existing datasets and reducing manual effort.

Extracting Patient History from Clinical Text: A Comparative Study of Clinical Large Language Models

Hieu Nghiem,Tuan-Dung Le,Suhao Chen,Thanh Thieu,Andrew Gin,Ellie Phuong Nguyen,Dursun Delen,Johnson Thomas,Jivan Lamichhane,Zhuqi Miao

Task: 评估临床大型语言模型（cLLMs）在识别与主诉（CC）、现病史（HPI）及既往、家族和社会史（PFSH）相关的医疗史实体（MHEs）中的性能，并研究笔记特征对模型准确性的影响。

Motivation: 通过结构化自由文本临床笔记为标准化电子健康记录（EHRs），可以优化后续任务（如连续性护理、医疗编码和质量指标），同时通过本地部署保护敏感数据。

Details

Method: 在61份门诊相关临床笔记中标注了1,449个MHEs，并微调了七种先进的cLLMs，评估其在零样本设置下与GPT-4o的性能对比，并分析文本特征（如笔记长度、实体长度和分段）对模型准确性的影响。 Result: cLLMs在提取MHEs方面可减少20%以上的时间，但某些MHEs的检测仍具挑战性；微调的GatorTron和GatorTronS表现最佳，整合预识别的基本医疗实体（BMEs）可提升部分实体识别性能。 Conclusion: cLLMs在医疗实体提取中具有潜力，但需进一步优化以应对多义词和非医学术语的挑战；文本结构的清晰性对模型性能有积极影响。 Abstract: Extracting medical history entities (MHEs) related to a patient's chief complaint (CC), history of present illness (HPI), and past, family, and social history (PFSH) helps structure free-text clinical notes into standardized EHRs, streamlining downstream tasks like continuity of care, medical coding, and quality metrics. Fine-tuned clinical large language models (cLLMs) can assist in this process while ensuring the protection of sensitive data via on-premises deployment. This study evaluates the performance of cLLMs in recognizing CC/HPI/PFSH-related MHEs and examines how note characteristics impact model accuracy. We annotated 1,449 MHEs across 61 outpatient-related clinical notes from the MTSamples repository. To recognize these entities, we fine-tuned seven state-of-the-art cLLMs. Additionally, we assessed the models' performance when enhanced by integrating, problems, tests, treatments, and other basic medical entities (BMEs). We compared the performance of these models against GPT-4o in a zero-shot setting. To further understand the textual characteristics affecting model accuracy, we conducted an error analysis focused on note length, entity length, and segmentation. The cLLMs showed potential in reducing the time required for extracting MHEs by over 20%. However, detecting many types of MHEs remained challenging due to their polysemous nature and the frequent involvement of non-medical vocabulary. Fine-tuned GatorTron and GatorTronS, two of the most extensively trained cLLMs, demonstrated the highest performance. Integrating pre-identified BME information improved model performance for certain entities. Regarding the impact of textual characteristics on model performance, we found that longer entities were harder to identify, note length did not correlate with a higher error rate, and well-organized segments with headings are beneficial for the extraction.

Visual and Semantic Prompt Collaboration for Generalized Zero-Shot Learning

Huajie Jiang,Zhengxian Li,Xiaohan Yu,Yongli Hu,Baocai Yin,Jian Yang,Yuankai Qi

Task: 提出一种新颖的视觉与语义提示协作框架，用于广义零样本学习中的视觉-语义对齐和特征适应。

Motivation: 现有方法通过微调视觉主干网络可能导致对有限训练图像的可见类过拟合，因此需要一种更高效的特征适应方法。

Details

Method: 设计了视觉提示和语义提示，结合弱提示融合（浅层）和强提示融合（深层）机制，实现视觉-语义对齐和判别性特征学习。 Result: 在传统零样本学习和广义零样本学习基准测试中表现优于其他先进方法。 Conclusion: 通过视觉与语义提示的协作，能够获得判别性且语义相关的特征，提升广义零样本图像识别的性能。 Abstract: Generalized zero-shot learning aims to recognize both seen and unseen classes with the help of semantic information that is shared among different classes. It inevitably requires consistent visual-semantic alignment. Existing approaches fine-tune the visual backbone by seen-class data to obtain semantic-related visual features, which may cause overfitting on seen classes with a limited number of training images. This paper proposes a novel visual and semantic prompt collaboration framework, which utilizes prompt tuning techniques for efficient feature adaptation. Specifically, we design a visual prompt to integrate the visual information for discriminative feature learning and a semantic prompt to integrate the semantic formation for visualsemantic alignment. To achieve effective prompt information integration, we further design a weak prompt fusion mechanism for the shallow layers and a strong prompt fusion mechanism for the deep layers in the network. Through the collaboration of visual and semantic prompts, we can obtain discriminative semantic-related features for generalized zero-shot image recognition. Extensive experiments demonstrate that our framework consistently achieves favorable performance in both conventional zero-shot learning and generalized zero-shot learning benchmarks compared to other state-of-the-art methods.

Cocktail: Chunk-Adaptive Mixed-Precision Quantization for Long-Context LLM Inference

Wei Tao,Bin Zhang,Xiaoyang Qu,Jiguang Wan,Jianzong Wang

Task: 提出一种名为Cocktail的块自适应混合精度量化方法，以优化大型语言模型（LLMs）中的键值（KV）缓存。

Motivation: 现有的基于令牌粒度的混合精度量化方法在搜索过程中耗时且在计算时硬件效率低下，无法满足长上下文的需求。

Details

Method: Cocktail包含两个模块：块级量化搜索和块级KV缓存计算。前者基于上下文块与查询的相似性快速确定最优位宽配置，后者通过重排序KV缓存块避免硬件效率问题。 Result: 实验表明，Cocktail在多种模型和数据集上优于现有的KV缓存量化方法。 Conclusion: Cocktail通过块自适应混合精度量化，有效优化了KV缓存的性能，同时保持了模型精度。 Abstract: Recently, large language models (LLMs) have been able to handle longer and longer contexts. However, a context that is too long may cause intolerant inference latency and GPU memory usage. Existing methods propose mixed-precision quantization to the key-value (KV) cache in LLMs based on token granularity, which is time-consuming in the search process and hardware inefficient during computation. This paper introduces a novel approach called Cocktail, which employs chunk-adaptive mixed-precision quantization to optimize the KV cache. Cocktail consists of two modules: chunk-level quantization search and chunk-level KV cache computation. Chunk-level quantization search determines the optimal bitwidth configuration of the KV cache chunks quickly based on the similarity scores between the corresponding context chunks and the query, maintaining the model accuracy. Furthermore, chunk-level KV cache computation reorders the KV cache chunks before quantization, avoiding the hardware inefficiency caused by mixed-precision quantization in inference computation. Extensive experiments demonstrate that Cocktail outperforms state-of-the-art KV cache quantization methods on various models and datasets.

FreeInv: Free Lunch for Improving DDIM Inversion

Yuxiang Bao,Huijie Liu,Xun Gao,Huan Fu,Guoliang Kang

Task: 解决DDIM反转过程中的轨迹偏差问题。

Motivation: 传统方法通过复杂补偿策略或学习减少偏差，但成本高；FreeInv通过随机变换潜在表示并保持变换一致性，从统计角度实现更高效的轨迹匹配。

Details

Method: 随机变换潜在表示并在反转和重建步骤中保持变换一致性，实现多轨迹的高效集成。 Result: 在PIE基准和DAVIS数据集上，FreeInv显著优于传统DDIM反转，且计算效率更高。 Conclusion: FreeInv是一种高效且低成本的方法，适用于基于反转的图像和视频编辑技术。 Abstract: Naive DDIM inversion process usually suffers from a trajectory deviation issue, i.e., the latent trajectory during reconstruction deviates from the one during inversion. To alleviate this issue, previous methods either learn to mitigate the deviation or design cumbersome compensation strategy to reduce the mismatch error, exhibiting substantial time and computation cost. In this work, we present a nearly free-lunch method (named FreeInv) to address the issue more effectively and efficiently. In FreeInv, we randomly transform the latent representation and keep the transformation the same between the corresponding inversion and reconstruction time-step. It is motivated from a statistical perspective that an ensemble of DDIM inversion processes for multiple trajectories yields a smaller trajectory mismatch error on expectation. Moreover, through theoretical analysis and empirical study, we show that FreeInv performs an efficient ensemble of multiple trajectories. FreeInv can be freely integrated into existing inversion-based image and video editing techniques. Especially for inverting video sequences, it brings more significant fidelity and efficiency improvements. Comprehensive quantitative and qualitative evaluation on PIE benchmark and DAVIS dataset shows that FreeInv remarkably outperforms conventional DDIM inversion, and is competitive among previous state-of-the-art inversion methods, with superior computation efficiency.

Advancing Sentiment Analysis in Tamil-English Code-Mixed Texts: Challenges and Transformer-Based Solutions

Mikhail Krasitskii,Olga Kolesnikova,Liliana Chanona Hernandez,Grigori Sidorov,Alexander Gelbukh

Task: 探索泰米尔语-英语混合文本中的情感分析任务。

Motivation: 解决由于语法不一致、拼写变体和语音歧义带来的挑战，并填补现有数据集和标注的不足。

Details

Method: 评估了XLM-RoBERTa、mT5、IndicBERT和RemBERT等Transformer架构在低资源混合语言环境中的表现。 Result: 分析了性能指标，发现特定模型在多语言情感分类中表现优异。 Conclusion: 需进一步研究数据增强、语音归一化和混合建模方法以提高准确性，并提出了未来研究方向。 Abstract: The sentiment analysis task in Tamil-English code-mixed texts has been explored using advanced transformer-based models. Challenges from grammatical inconsistencies, orthographic variations, and phonetic ambiguities have been addressed. The limitations of existing datasets and annotation gaps have been examined, emphasizing the need for larger and more diverse corpora. Transformer architectures, including XLM-RoBERTa, mT5, IndicBERT, and RemBERT, have been evaluated in low-resource, code-mixed environments. Performance metrics have been analyzed, highlighting the effectiveness of specific models in handling multilingual sentiment classification. The findings suggest that further advancements in data augmentation, phonetic normalization, and hybrid modeling approaches are required to enhance accuracy. Future research directions for improving sentiment analysis in code-mixed texts have been proposed.

STSA: Spatial-Temporal Semantic Alignment for Visual Dubbing

Zijun Ding,Mingdie Xiong,Congcong Zhu,Jingrun Chen

Task: 提出一种空间-时间语义对齐（STSA）方法，以解决动态面部合成中语义模糊导致的稳定性问题。

Motivation: 现有音频驱动的视觉配音方法存在语义模糊问题，尤其是空间和时间域的语义差异，导致动态面部合成不稳定。

Details

Method: STSA方法采用双路径对齐机制和可微分语义表示，前者通过一致性信息学习模块最大化多尺度互信息，后者利用概率热图作为容忍模糊的指导。 Result: 实验结果表明STSA在图像质量和合成稳定性方面表现优越。 Conclusion: STSA通过语义对齐显著提升了动态面部合成的稳定性，具有实际应用价值。 Abstract: Existing audio-driven visual dubbing methods have achieved great success. Despite this, we observe that the semantic ambiguity between spatial and temporal domains significantly degrades the synthesis stability for the dynamic faces. We argue that aligning the semantic features from spatial and temporal domains is a promising approach to stabilizing facial motion. To achieve this, we propose a Spatial-Temporal Semantic Alignment (STSA) method, which introduces a dual-path alignment mechanism and a differentiable semantic representation. The former leverages a Consistent Information Learning (CIL) module to maximize the mutual information at multiple scales, thereby reducing the manifold differences between spatial and temporal domains. The latter utilizes probabilistic heatmap as ambiguity-tolerant guidance to avoid the abnormal dynamics of the synthesized faces caused by slight semantic jittering. Extensive experimental results demonstrate the superiority of the proposed STSA, especially in terms of image quality and synthesis stability. Pre-trained weights and inference code are available at https://github.com/SCAILab-USTC/STSA.

Using Source-Side Confidence Estimation for Reliable Translation into Unfamiliar Languages

Kenneth J. Sible,David Chiang

Task: 设计一个交互式机器翻译系统，帮助非目标语言熟练用户提高翻译的可信度和可解释性。

Motivation: 传统机器翻译的置信度估计主要关注目标端，而源端置信度估计通常通过词对齐将目标词概率投影到源端，但这种方法存在局限性。

Details

Method: 提出一种直接、无需对齐的方法，通过测量目标词概率对源嵌入变化的敏感性来估计源端置信度。 Result: 实验结果表明，该方法在检测误翻译方面优于传统的基于对齐的方法。 Conclusion: 该方法为源端置信度估计提供了一种更有效的解决方案，提升了交互式机器翻译系统的实用性。 Abstract: We present an interactive machine translation (MT) system designed for users who are not proficient in the target language. It aims to improve trustworthiness and explainability by identifying potentially mistranslated words and allowing the user to intervene to correct mistranslations. However, confidence estimation in machine translation has traditionally focused on the target side. Whereas the conventional approach to source-side confidence estimation would have been to project target word probabilities to the source side via word alignments, we propose a direct, alignment-free approach that measures how sensitive the target word probabilities are to changes in the source embeddings. Experimental results show that our method outperforms traditional alignment-based methods at detection of mistranslations.

CityGS-X: A Scalable Architecture for Efficient and Geometrically Accurate Large-Scale Scene Reconstruction

Yuanyuan Gao,Hao Li,Jiaqi Chen,Zhengyu Zou,Zhihang Zhong,Dingwen Zhang,Xiao Sun,Junwei Han

Task: 提出一种名为CityGS-X的可扩展架构，用于解决3D高斯泼溅在大规模场景重建中的效率、计算成本和几何精度问题。

Motivation: 3D高斯泼溅技术在大规模场景重建中存在处理速度慢、计算成本高和几何精度有限等核心问题，主要源于其非结构化设计和缺乏高效并行化。

Details

Method: 采用基于新型并行化混合分层3D表示（PH^2-3D）的可扩展架构，通过动态细节层次体素分配实现高效多GPU渲染。 Result: CityGS-X在训练速度、渲染容量和几何细节精度方面均优于现有方法，能够高效处理5000+图像的场景。 Conclusion: CityGS-X显著提升了大规模场景重建的性能和可扩展性，远超现有方法的能力范围。 Abstract: Despite its significant achievements in large-scale scene reconstruction, 3D Gaussian Splatting still faces substantial challenges, including slow processing, high computational costs, and limited geometric accuracy. These core issues arise from its inherently unstructured design and the absence of efficient parallelization. To overcome these challenges simultaneously, we introduce CityGS-X, a scalable architecture built on a novel parallelized hybrid hierarchical 3D representation (PH^2-3D). As an early attempt, CityGS-X abandons the cumbersome merge-and-partition process and instead adopts a newly-designed batch-level multi-task rendering process. This architecture enables efficient multi-GPU rendering through dynamic Level-of-Detail voxel allocations, significantly improving scalability and performance. Through extensive experiments, CityGS-X consistently outperforms existing methods in terms of faster training times, larger rendering capacities, and more accurate geometric details in large-scale scenes. Notably, CityGS-X can train and render a scene with 5,000+ images in just 5 hours using only 4 * 4090 GPUs, a task that would make other alternative methods encounter Out-Of-Memory (OOM) issues and fail completely. This implies that CityGS-X is far beyond the capacity of other existing methods.

Focus Directions Make Your Language Models Pay More Attention to Relevant Contexts

Youxiang Zhu,Ruochen Li,Danqing Wang,Daniel Haehn,Xiaohui Liang

Task: 研究长上下文大语言模型（LLMs）中注意力分散的原因及其缓解方法。

Motivation: 长上下文LLMs容易被无关上下文分散注意力，但其原因尚不明确，需要深入探究。

Details

Method: 识别控制LLM整体注意力的上下文头，并通过增加对相关上下文的注意力来缓解分散问题；进一步提出焦点方向的概念，优化注意力分配。 Result: 焦点方向能够有效缓解长上下文LLMs的任务对齐问题。 Conclusion: 研究结果有助于推动长上下文LLMs对齐的进一步研究。 Abstract: Long-context large language models (LLMs) are prone to be distracted by irrelevant contexts. The reason for distraction remains poorly understood. In this paper, we first identify the contextual heads, a special group of attention heads that control the overall attention of the LLM. Then, we demonstrate that distraction arises when contextual heads fail to allocate sufficient attention to relevant contexts and can be mitigated by increasing attention to these contexts. We further identify focus directions, located at the key and query activations of these heads, which enable them to allocate more attention to relevant contexts without explicitly specifying which context is relevant. We comprehensively evaluate the effect of focus direction on various long-context tasks and find out focus directions could help to mitigate the poor task alignment of the long-context LLMs. We believe our findings could promote further research on long-context LLM alignment.

Shape and Texture Recognition in Large Vision-Language Models

Sagi Eppel,Mor Bismut,Alona Faktor

Task: 评估大型视觉语言模型（LVLMs）在形状、纹理和材料识别方面的能力。

Motivation: 形状和纹理识别是视觉感知的基础，但现有模型在这些任务上的表现与人类水平仍有差距，需要更全面的数据集和评估方法。

Details

Method: 使用大型形状与纹理数据集（LAS&T）对LVLMs进行测试，评估其在形状匹配和纹理识别任务中的表现。 Result: LVLMs在形状识别任务中表现显著低于人类，对抽象形状识别能力较弱；在3D场景中的材料识别接近人类水平，但在2D纹理识别中表现较差。 Conclusion: LAS&T数据集为形状和纹理识别提供了全面评估资源，揭示了LVLMs在视觉理解任务中的局限性。 Abstract: Shape and texture recognition is fundamental to visual perception. The ability to identify shapes regardless of orientation, texture, or context, and to recognize textures independently of their associated objects, is essential for general visual understanding of the world. We introduce the Large Shape & Textures dataset (LAS&T), a giant collection of diverse shapes and textures automatically extracted from real-world images. This dataset is used to evaluate how effectively leading Large Vision-Language Models (LVLMs) understand shapes, textures, and materials in both 2D and 3D scenes. For shape recognition, we test models' ability to match identical shapes that differ in orientation, texture, color, or environment. Our results show that LVLMs' shape identification capabilities remain significantly below human performance. Single alterations (orientation, texture) cause minor decreases in matching accuracy, while multiple changes precipitate dramatic drops. LVLMs appear to rely predominantly on high-level and semantic features and struggle with abstract shapes lacking clear class associations. For texture and material recognition, we evaluate models' ability to identify identical textures and materials across different objects and environments. Interestingly, leading LVLMs approach human-level performance in recognizing materials in 3D scenes, yet substantially underperform humans when identifying simpler 2D textures. The LAS&T dataset and benchmark, the largest and most diverse resource for shape and texture evaluation, is freely available with generation and testing scripts.

Linguistic Loops and Geometric Invariants as a Way to Pre-Verbal Thought?

Daniele Corradetti,Alessio Marrani

Task: Introduce and define linguistic transformation, linguistic loop, and semantic deficit using Lie group theory and geometry.

Motivation: To explore structural properties of linguistic loops and potentially characterize pre-verbal thought mathematically.

Details

Method: Employ Lie group theoretical and geometric techniques to define invariants capturing linguistic loop structures. Result: New research direction combining Lie theory and higher-dimensional geometry in language studies, with implications for understanding pre-verbal thought. Conclusion: The study opens avenues for mathematical characterization of meta-linguistic or pre-verbal cognitive structures. Abstract: In this work we introduce the concepts of linguistic transformation, linguistic loop and semantic deficit. By exploiting Lie group theoretical and geometric techniques, we define invariants that capture the structural properties of a whole linguistic loop. This result introduces new line of research, employing tools from Lie theory and higher-dimensional geometry within language studies. But, even more intriguingly, our study hints to a mathematical characterization of the meta-linguistic or pre-verbal thought, namely of those cognitive structures that precede the language.

VGRP-Bench: Visual Grid Reasoning Puzzle Benchmark for Large Vision-Language Models

Yufan Ren,Konstantinos Tertikas,Shalini Maiti,Junlin Han,Tong Zhang,Sabine Süsstrunk,Filippos Kokkinos

Task: 评估和提升大型视觉语言模型（LVLMs）在解谜任务中的表现。

Motivation: 现有基准测试缺乏对推理能力的系统性评估，且未关注模型在解谜任务中的表现，而解谜能力反映了结构化推理能力，这对实际问题的解决至关重要。

Details

Method: 提出了VGRP-Bench，一个包含20种多样化谜题的视觉网格推理谜题基准测试，并对现有LVLMs（如GPT-4o和Gemini-Thinking）进行了广泛实验。 Result: 实验表明，即使是先进的LVLMs在解谜任务中表现不佳，揭示了其解谜能力的根本局限性。通过实验还识别了影响性能的关键因素（如线索数量、网格大小和规则复杂度），并探索了两种监督微调策略（S-SFT和R-SFT）。 Conclusion: VGRP-Bench的发布将促进对LVLMs在复杂实际问题解决中的进一步研究。 Abstract: Large Vision-Language Models (LVLMs) struggle with puzzles, which require precise perception, rule comprehension, and logical reasoning. Assessing and enhancing their performance in this domain is crucial, as it reflects their ability to engage in structured reasoning - an essential skill for real-world problem-solving. However, existing benchmarks primarily evaluate pre-trained models without additional training or fine-tuning, often lack a dedicated focus on reasoning, and fail to establish a systematic evaluation framework. To address these limitations, we introduce VGRP-Bench, a Visual Grid Reasoning Puzzle Benchmark featuring 20 diverse puzzles. VGRP-Bench spans multiple difficulty levels, and includes extensive experiments not only on existing chat LVLMs (e.g., GPT-4o), but also on reasoning LVLMs (e.g., Gemini-Thinking). Our results reveal that even the state-of-the-art LVLMs struggle with these puzzles, highlighting fundamental limitations in their puzzle-solving capabilities. Most importantly, through systematic experiments, we identify and analyze key factors influencing LVLMs' puzzle-solving performance, including the number of clues, grid size, and rule complexity. Furthermore, we explore two Supervised Fine-Tuning (SFT) strategies that can be used in post-training: SFT on solutions (S-SFT) and SFT on synthetic reasoning processes (R-SFT). While both methods significantly improve performance on trained puzzles, they exhibit limited generalization to unseen ones. We will release VGRP-Bench to facilitate further research on LVLMs for complex, real-world problem-solving.

Not All LoRA Parameters Are Essential: Insights on Inference Necessity

Guanhua Chen,Yutong Yao,Ci-Jun Gao,Lidia S. Chao,Feng Wan,Derek F. Wong

Task: 研究LoRA层在推理阶段的必要性及其对模型性能的影响。

Motivation: 现有研究主要关注减少LoRA微调参数或优化其架构，但未充分探索所有微调LoRA层在推理阶段的必要性。

Details

Method: 提出一种方法，通过分析验证样本确定一个“边界层”，在推理阶段丢弃边界层以上的LoRA层。 Result: 在四个文本生成数据集上验证了方法的有效性，结果显示性能显著提升。 Conclusion: 选择性保留关键LoRA层能有效提升模型性能。 Abstract: Current research on LoRA primarily focuses on minimizing the number of fine-tuned parameters or optimizing its architecture. However, the necessity of all fine-tuned LoRA layers during inference remains underexplored. In this paper, we investigate the contribution of each LoRA layer to the model's ability to predict the ground truth and hypothesize that lower-layer LoRA modules play a more critical role in model reasoning and understanding. To address this, we propose a simple yet effective method to enhance the performance of large language models (LLMs) fine-tuned with LoRA. Specifically, we identify a ``boundary layer'' that distinguishes essential LoRA layers by analyzing a small set of validation samples. During inference, we drop all LoRA layers beyond this boundary. We evaluate our approach on three strong baselines across four widely-used text generation datasets. Our results demonstrate consistent and significant improvements, underscoring the effectiveness of selectively retaining critical LoRA layers during inference.

InkFM: A Foundational Model for Full-Page Online Handwritten Note Understanding

Anastasiia Fadeeva,Vincent Coriou,Diego Antognini,Claudiu Musat,Andrii Maksai

Task: 开发一种名为InkFM的基础模型，用于分析手写数字笔记的完整页面内容。

Motivation: 优化平板和触控笔的使用体验，确保手写数字笔记的流畅和高效工作流程。

Details

Method: 训练一个多样化的任务混合模型，具备识别28种不同脚本的文本、数学表达式识别以及页面元素分割的能力。 Result: 模型在文本行分割、文本识别和草图分类等任务上表现优异，超越现有基线，并在多个公开数据集上达到最新技术水平。 Conclusion: InkFM的适应性为开发手写输入应用提供了强大的起点。 Abstract: Tablets and styluses are increasingly popular for taking notes. To optimize this experience and ensure a smooth and efficient workflow, it's important to develop methods for accurately interpreting and understanding the content of handwritten digital notes. We introduce a foundational model called InkFM for analyzing full pages of handwritten content. Trained on a diverse mixture of tasks, this model offers a unique combination of capabilities: recognizing text in 28 different scripts, mathematical expressions recognition, and segmenting pages into distinct elements like text and drawings. Our results demonstrate that these tasks can be effectively unified within a single model, achieving SoTA text line segmentation out-of-the-box quality surpassing public baselines like docTR. Fine- or LoRA-tuning our base model on public datasets further improves the quality of page segmentation, achieves state-of the art text recognition (DeepWriting, CASIA, SCUT, and Mathwriting datasets) and sketch classification (QuickDraw). This adaptability of InkFM provides a powerful starting point for developing applications with handwritten input.

Discovering Knowledge Deficiencies of Language Models on Massive Knowledge Base

Linxin Song,Xuwei Ding,Jieyu Zhang,Taiwei Shi,Ryotaro Shimizu,Rahul Gupta,Yang Liu,Jian Kang,Jieyu Zhao

Task: 提出一种可扩展且高效的框架（SEA），用于在严格查询预算下发现闭源大语言模型（LLM）的知识缺陷。

Motivation: 大语言模型（LLM）虽然具备强大的语言能力，但在保留事实知识方面存在缺陷，导致幻觉输出和不可靠性。全面评估其知识缺陷在计算上不可行，尤其是对闭源模型。

Details

Method: 提出随机错误上升（SEA）框架，通过随机优化过程迭代检索高错误候选，利用语义相似性和层次检索策略提升效率。 Result: SEA发现的错误数量比现有方法多40.7倍和26.7%，且每个错误的成本降低了599倍和9倍。 Conclusion: SEA高效且可扩展，揭示了LLM的共性缺陷，为未来模型开发提供了改进方向。 Abstract: Large language models (LLMs) possess impressive linguistic capabilities but often fail to faithfully retain factual knowledge, leading to hallucinations and unreliable outputs. Understanding LLMs' knowledge deficiencies by exhaustively evaluating against full-scale knowledge bases is computationally prohibitive, especially for closed-weight models. We propose stochastic error ascent (SEA), a scalable and efficient framework for discovering knowledge deficiencies (errors) in closed-weight LLMs under a strict query budget. Rather than naively probing all knowledge candidates, SEA formulates error discovery as a stochastic optimization process: it iteratively retrieves new high-error candidates by leveraging the semantic similarity to previously observed failures. To further enhance search efficiency and coverage, SEA employs hierarchical retrieval across document and paragraph levels, and constructs a relation directed acyclic graph to model error propagation and identify systematic failure modes. Empirically, SEA uncovers 40.7x more knowledge errors than Automated Capability Discovery and 26.7% more than AutoBencher, while reducing the cost-per-error by 599x and 9x, respectively. Human evaluation confirms the high quality of generated questions, while ablation and convergence analyses validate the contribution of each component in SEA. Further analysis on the discovered errors reveals correlated failure patterns across LLM families and recurring deficits, highlighting the need for better data coverage and targeted fine-tuning in future LLM development.

Efficient Adaptation For Remote Sensing Visual Grounding

Hasan Moughnieh,Mohamad Chalhoub,Hasan Nasrallah,Cristiano Nattero,Paolo Campanella,Ali J. Ghandour

Task: 通过参数高效微调（PEFT）技术优化基础模型在遥感（RS）领域的视觉定位（VG）任务性能。

Motivation: 基础模型在遥感领域的直接应用效果不佳，需针对其领域特定挑战进行优化。

Details

Method: 采用LoRA、BitFit和适配器等PEFT技术微调Grounding DINO和OFA基础模型。 Result: 性能达到或超越当前最优模型，同时显著降低计算成本。 Conclusion: PEFT技术为遥感领域的多模态分析提供了高效、精确且经济的解决方案。 Abstract: Foundation models have revolutionized artificial intelligence (AI), offering remarkable capabilities across multi-modal domains. Their ability to precisely locate objects in complex aerial and satellite images, using rich contextual information and detailed object descriptions, is essential for remote sensing (RS). These models can associate textual descriptions with object positions through the Visual Grounding (VG) task, but due to domain-specific challenges, their direct application to RS produces sub-optimal results. To address this, we applied Parameter Efficient Fine Tuning (PEFT) techniques to adapt these models for RS-specific VG tasks. Specifically, we evaluated LoRA placement across different modules in Grounding DINO and used BitFit and adapters to fine-tune the OFA foundation model pre-trained on general-purpose VG datasets. This approach achieved performance comparable to or surpassing current State Of The Art (SOTA) models while significantly reducing computational costs. This study highlights the potential of PEFT techniques to advance efficient and precise multi-modal analysis in RS, offering a practical and cost-effective alternative to full model training.

Mixture of Routers

Jia-Chen Zhang,Yu-Jie Xiong,Xi-He Qiu,Chun-Ming Xia,Fei Dai

Task: 提出一种名为Mixture of Routers (MoR)的高效微调方法，以改进LoRA与MoE结合时的路由机制问题。

Motivation: LoRA在提升大模型性能方面效果有限，而MoE虽能显著提升微调性能，但其路由机制存在专家分配不正确和不平衡的问题。

Details

Method: 创新地将MoE概念融入路由机制，提出MoR方法，使用多个子路由器联合选择，并通过可学习的主路由器确定子路由器权重。 Result: MoR在多数任务上优于基线模型，平均性能提升1%。 Conclusion: MoR是一种即插即用、参数高效的微调方法，适用于广泛的应用场景。 Abstract: Supervised fine-tuning (SFT) is a milestone in aligning large language models with human instructions and adapting them to downstream tasks. In particular, Low-Rank Adaptation (LoRA) has gained widespread attention due to its parameter efficiency. However, its impact on improving the performance of large models remains limited. Recent studies suggest that combining LoRA with Mixture-of-Experts (MoE) can significantly enhance fine-tuning performance. MoE adapts to the diversity and complexity of datasets by dynamically selecting the most suitable experts, thereby improving task accuracy and efficiency. Despite impressive results, recent studies reveal issues in the MoE routing mechanism, such as incorrect assignments and imbalanced expert allocation. Inspired by the principles of Redundancy and Fault Tolerance Theory. We innovatively integrate the concept of Mixture of Experts into the routing mechanism and propose an efficient fine-tuning method called Mixture of Routers (MoR). It employs multiple sub-routers for joint selection and uses a learnable main router to determine the weights of the sub-routers. The results show that MoR outperforms baseline models on most tasks, achieving an average performance improvement of 1%. MoR can serve as a plug-and-play, parameter-efficient fine-tuning method suitable for a wide range of applications. Our code is available here: https://anonymous.4open.science/r/MoR-DFC6.

FRAME: Floor-aligned Representation for Avatar Motion from Egocentric Video

Andrea Boscolo Camiletto,Jian Wang,Eduardo Alvarado,Rishabh Dabral,Thabo Beeler,Marc Habermann,Christian Theobalt

Task: 通过头戴式面向身体的立体相机实现自我中心运动捕捉，解决VR和AR应用中的遮挡和标注数据不足问题。

Motivation: 现有方法依赖合成数据预训练，在真实场景中难以生成平滑准确的预测，尤其是下肢部分。

Details

Method: 提出轻量级VR数据采集系统，结合设备姿态和相机数据，设计FRAME架构进行多模态整合，实现高效姿态预测。 Result: 收集了最大规模的现实世界数据集，通过几何合理的多模态整合，实现了高质量运动捕捉，并在现代硬件上达到300 FPS。 Conclusion: FRAME方法通过几何特性和新颖训练策略，显著提升了运动捕捉质量，解决了现有方法的局限性。 Abstract: Egocentric motion capture with a head-mounted body-facing stereo camera is crucial for VR and AR applications but presents significant challenges such as heavy occlusions and limited annotated real-world data. Existing methods rely on synthetic pretraining and struggle to generate smooth and accurate predictions in real-world settings, particularly for lower limbs. Our work addresses these limitations by introducing a lightweight VR-based data collection setup with on-board, real-time 6D pose tracking. Using this setup, we collected the most extensive real-world dataset for ego-facing ego-mounted cameras to date in size and motion variability. Effectively integrating this multimodal input -- device pose and camera feeds -- is challenging due to the differing characteristics of each data source. To address this, we propose FRAME, a simple yet effective architecture that combines device pose and camera feeds for state-of-the-art body pose prediction through geometrically sound multimodal integration and can run at 300 FPS on modern hardware. Lastly, we showcase a novel training strategy to enhance the model's generalization capabilities. Our approach exploits the problem's geometric properties, yielding high-quality motion capture free from common artifacts in prior works. Qualitative and quantitative evaluations, along with extensive comparisons, demonstrate the effectiveness of our method. Data, code, and CAD designs will be available at https://vcai.mpi-inf.mpg.de/projects/FRAME/

FeRG-LLM : Feature Engineering by Reason Generation Large Language Models

Jeonghyun Ko,Gyeongyun Park,Donghoon Lee,Kyunam Lee

Task: 提出一种名为FeRG-LLM的框架，利用大型语言模型自动完成表格数据的特征工程任务。

Motivation: 特征工程对提升模型性能至关重要，但需要大量人工和领域知识，因此希望通过自动化方法解决这一问题。

Details

Method: 构建两阶段对话框架，利用Chain-of-Thought能力分析任务并生成新特征，基于Llama 3.1 8B模型微调并整合DPO优化反馈。 Result: FeRG-LLM在多数数据集上表现优于或接近Llama 3.1 70B，资源消耗更低且推理时间缩短，同时在分类任务中优于其他研究。 Conclusion: FeRG-LLM是一种高效、本地可部署的特征工程解决方案，解决了资源消耗和安全性问题。 Abstract: One of the key tasks in machine learning for tabular data is feature engineering. Although it is vital for improving the performance of models, it demands considerable human expertise and deep domain knowledge, making it labor-intensive endeavor. To address this issue, we propose a novel framework, \textbf{FeRG-LLM} (\textbf{Fe}ature engineering by \textbf{R}eason \textbf{G}eneration \textbf{L}arge \textbf{L}anguage \textbf{M}odels), a large language model designed to automatically perform feature engineering at an 8-billion-parameter scale. We have constructed two-stage conversational dialogues that enable language models to analyze machine learning tasks and discovering new features, exhibiting their Chain-of-Thought (CoT) capabilities. We use these dialogues to fine-tune Llama 3.1 8B model and integrate Direct Preference Optimization (DPO) to receive feedback improving quality of new features and the model's performance. Our experiments show that FeRG-LLM performs comparably to or better than Llama 3.1 70B on most datasets, while using fewer resources and achieving reduced inference time. It outperforms other studies in classification tasks and performs well in regression tasks. Moreover, since it does not rely on cloud-hosted LLMs like GPT-4 with extra API costs when generating features, it can be deployed locally, addressing security concerns.

Open-Vocabulary Semantic Segmentation with Uncertainty Alignment for Robotic Scene Understanding in Indoor Building Environments

Yifan Xu,Vineet Kamat,Carol Menassa

Task: 提出一种基于视觉语言模型（VLMs）和大语言模型（LLMs）的开放词汇场景语义分割与检测方法，以提升辅助机器人在复杂环境中的导航能力。

Motivation: 全球肢体残疾人数增加，对先进辅助技术的需求上升，现有方法在理解人类直观指令和处理场景不确定性方面存在不足。

Details

Method: 采用'Segment Detect Select'框架，结合VLMs和LLMs，实现开放词汇场景分类。 Result: 该方法能够适应复杂环境，提供更直观和自适应的导航支持。 Conclusion: 提出的方法有效解决了现有技术的局限性，为辅助机器人导航提供了更优解决方案。 Abstract: The global rise in the number of people with physical disabilities, in part due to improvements in post-trauma survivorship and longevity, has amplified the demand for advanced assistive technologies to improve mobility and independence. Autonomous assistive robots, such as smart wheelchairs, require robust capabilities in spatial segmentation and semantic recognition to navigate complex built environments effectively. Place segmentation involves delineating spatial regions like rooms or functional areas, while semantic recognition assigns semantic labels to these regions, enabling accurate localization to user-specific needs. Existing approaches often utilize deep learning; however, these close-vocabulary detection systems struggle to interpret intuitive and casual human instructions. Additionally, most existing methods ignore the uncertainty of the scene recognition problem, leading to low success rates, particularly in ambiguous and complex environments. To address these challenges, we propose an open-vocabulary scene semantic segmentation and detection pipeline leveraging Vision Language Models (VLMs) and Large Language Models (LLMs). Our approach follows a 'Segment Detect Select' framework for open-vocabulary scene classification, enabling adaptive and intuitive navigation for assistive robots in built environments.

ToRL: Scaling Tool-Integrated RL

Xuefeng Li,Haoyang Zou,Pengfei Liu

Task: 训练大型语言模型（LLMs）通过强化学习自主使用计算工具。

Motivation: 探索模型如何通过强化学习发现工具使用的最优策略，而非依赖监督微调。

Details

Method: 提出ToRL框架，结合强化学习训练LLMs自主使用工具。 Result: ToRL-7B在AIME~24上达到43.3%准确率，比无工具集成的强化学习高14%，比现有最佳TIR模型高17%。 Conclusion: ToRL通过奖励驱动学习，实现了工具使用的策略性调用、无效代码的自我调节以及计算与分析推理的动态适应。 Abstract: We introduce ToRL (Tool-Integrated Reinforcement Learning), a framework for training large language models (LLMs) to autonomously use computational tools via reinforcement learning. Unlike supervised fine-tuning, ToRL allows models to explore and discover optimal strategies for tool use. Experiments with Qwen2.5-Math models show significant improvements: ToRL-7B reaches 43.3\% accuracy on AIME~24, surpassing reinforcement learning without tool integration by 14\% and the best existing Tool-Integrated Reasoning (TIR) model by 17\%. Further analysis reveals emergent behaviors such as strategic tool invocation, self-regulation of ineffective code, and dynamic adaptation between computational and analytical reasoning, all arising purely through reward-driven learning.

A large-scale image-text dataset benchmark for farmland segmentation

Chao Tao,Dandan Zhong,Weiliang Mu,Zhuofei Du,Haiyang Wu

Task: 提出一种语言驱动的学习范式，并开发FarmSeg-VL数据集，以解决农田遥感影像时空异质性问题。

Motivation: 传统深度学习范式依赖标注数据，难以有效建模农田的时空动态演化和空间异质性，而语言能明确表达农田的时空特征。

Details

Method: 提出半自动标注方法构建FarmSeg-VL数据集，覆盖四季和八个典型农业区域，包含丰富的时空特征描述。 Result: FarmSeg-VL数据集展示了显著的时空特性，并验证了其在农田分割任务中的潜力。 Conclusion: FarmSeg-VL为农田分割提供了首个细粒度图像-文本基准数据集，支持语言驱动的研究方向。 Abstract: The traditional deep learning paradigm that solely relies on labeled data has limitations in representing the spatial relationships between farmland elements and the surrounding environment.It struggles to effectively model the dynamic temporal evolution and spatial heterogeneity of farmland. Language,as a structured knowledge carrier,can explicitly express the spatiotemporal characteristics of farmland, such as its shape, distribution,and surrounding environmental information.Therefore,a language-driven learning paradigm can effectively alleviate the challenges posed by the spatiotemporal heterogeneity of farmland.However,in the field of remote sensing imagery of farmland,there is currently no comprehensive benchmark dataset to support this research direction.To fill this gap,we introduced language based descriptions of farmland and developed FarmSeg-VL dataset,the first fine-grained image-text dataset designed for spatiotemporal farmland segmentation.Firstly, this article proposed a semi-automatic annotation method that can accurately assign caption to each image, ensuring high data quality and semantic richness while improving the efficiency of dataset construction.Secondly,the FarmSeg-VL exhibits significant spatiotemporal characteristics.In terms of the temporal dimension,it covers all four seasons.In terms of the spatial dimension,it covers eight typical agricultural regions across China.In addition, in terms of captions,FarmSeg-VL covers rich spatiotemporal characteristics of farmland,including its inherent properties,phenological characteristics, spatial distribution,topographic and geomorphic features,and the distribution of surrounding environments.Finally,we present a performance analysis of VLMs and the deep learning models that rely solely on labels trained on the FarmSeg-VL,demonstrating its potential as a standard benchmark for farmland segmentation.

An Analysis of Decoding Methods for LLM-based Agents for Faithful Multi-Hop Question Answering

Alexander Murphy,Mohd Sanad Zaki Rizvi,Aden Haussmann,Ping Nie,Guifu Liu,Aryo Pradipta Gema,Pasquale Minervini

Task: 分析如何结合ReAct框架和解码策略（如DeCoRe、DoLa和CAD）提高LLM生成答案的忠实性。

Motivation: LLM常产生事实不准确的输出（幻觉现象），影响其在知识密集型NLP任务中的准确性，需通过检索增强生成和代理框架（如ReAct）结合解码策略来解决。

Details

Method: 结合ReAct框架和训练无关的解码策略（DeCoRe、DoLa、CAD），系统分析其对LLM生成答案忠实性的影响。 Result: 在HotpotQA任务中，结合ReAct和DoLa使F1分数从19.5提升至32.6。 Conclusion: 结合知识检索代理框架和提升忠实性的解码策略可显著提高LLM在下游任务中的准确性。 Abstract: Large Language Models (LLMs) frequently produce factually inaccurate outputs - a phenomenon known as hallucination - which limits their accuracy in knowledge-intensive NLP tasks. Retrieval-augmented generation and agentic frameworks such as Reasoning and Acting (ReAct) can address this issue by giving the model access to external knowledge. However, LLMs often fail to remain faithful to retrieved information. Mitigating this is critical, especially if LLMs are required to reason about the retrieved information. Recent research has explored training-free decoding strategies to improve the faithfulness of model generations. We present a systematic analysis of how the combination of the ReAct framework and decoding strategies (i.e., DeCoRe, DoLa, and CAD) can influence the faithfulness of LLM-generated answers. Our results show that combining an agentic framework for knowledge retrieval with decoding methods that enhance faithfulness can increase accuracy on the downstream Multi-Hop Question Answering tasks. For example, we observe an F1 increase from 19.5 to 32.6 on HotpotQA when using ReAct and DoLa.

Uncertainty-Instructed Structure Injection for Generalizable HD Map Construction

Xiaolu Liu,Ruizi Yang,Song Wang,Wentong Li,Junbo Chen,Jianke Zhu

Task: 提出一种名为UIGenMap的不确定性指导的结构注入方法，用于可泛化的高精地图矢量化。

Motivation: 解决现有方法在陌生驾驶场景中泛化能力不足的问题。

Details

Method: 通过不确定性重采样统计分布和显式实例特征减少对训练数据的过度依赖，引入视角检测分支和不确定性感知解码器，设计混合注入和轻量级模仿查询蒸馏。 Result: 在nuScenes数据集上实现了+5.7 mAP的性能提升。 Conclusion: UIGenMap在泛化性和实时性方面表现出色，适用于高精地图构建。 Abstract: Reliable high-definition (HD) map construction is crucial for the driving safety of autonomous vehicles. Although recent studies demonstrate improved performance, their generalization capability across unfamiliar driving scenes remains unexplored. To tackle this issue, we propose UIGenMap, an uncertainty-instructed structure injection approach for generalizable HD map vectorization, which concerns the uncertainty resampling in statistical distribution and employs explicit instance features to reduce excessive reliance on training data. Specifically, we introduce the perspective-view (PV) detection branch to obtain explicit structural features, in which the uncertainty-aware decoder is designed to dynamically sample probability distributions considering the difference in scenes. With probabilistic embedding and selection, UI2DPrompt is proposed to construct PV-learnable prompts. These PV prompts are integrated into the map decoder by designed hybrid injection to compensate for neglected instance structures. To ensure real-time inference, a lightweight Mimic Query Distillation is designed to learn from PV prompts, which can serve as an efficient alternative to the flow of PV branches. Extensive experiments on challenging geographically disjoint (geo-based) data splits demonstrate that our UIGenMap achieves superior performance, with +5.7 mAP improvement on the nuScenes dataset. Source code will be available at https://github.com/xiaolul2/UIGenMap.

CoRanking: Collaborative Ranking with Small and Large Ranking Agents

Wenhan Liu,Xinyu Ma,Yutao Zhu,Lixin Su,Shuaiqiang Wang,Dawei Yin,Zhicheng Dou

Task: 提出一种名为CoRanking的协作排序框架，结合小型和大型排序模型以实现高效且有效的排序。

Motivation: 大型语言模型（LLMs）在列表排序中表现优异，但其依赖大规模参数和重复滑动窗口过程，导致效率低下。

Details

Method: CoRanking首先使用小型重排序器预排序候选段落，将相关段落提升至列表顶部，然后仅对顶部段落应用LLM列表重排序器，显著提升效率。此外，通过强化学习训练段落顺序调整器，优化输入顺序以减轻LLM的位置偏差。 Result: 在三个IR基准测试中，CoRanking显著提升效率（减少约70%的排序延迟），同时效果优于仅使用LLM列表重排序器。 Conclusion: CoRanking通过协作框架和顺序调整器，实现了高效且有效的排序，解决了LLM的效率与偏差问题。 Abstract: Large Language Models (LLMs) have demonstrated superior listwise ranking performance. However, their superior performance often relies on large-scale parameters (\eg, GPT-4) and a repetitive sliding window process, which introduces significant efficiency challenges. In this paper, we propose \textbf{CoRanking}, a novel collaborative ranking framework that combines small and large ranking models for efficient and effective ranking. CoRanking first employs a small-size reranker to pre-rank all the candidate passages, bringing relevant ones to the top part of the list (\eg, top-20). Then, the LLM listwise reranker is applied to only rerank these top-ranked passages instead of the whole list, substantially enhancing overall ranking efficiency. Although more efficient, previous studies have revealed that the LLM listwise reranker have significant positional biases on the order of input passages. Directly feed the top-ranked passages from small reranker may result in the sub-optimal performance of LLM listwise reranker. To alleviate this problem, we introduce a passage order adjuster trained via reinforcement learning, which reorders the top passages from the small reranker to align with the LLM's preferences of passage order. Extensive experiments on three IR benchmarks demonstrate that CoRanking significantly improves efficiency (reducing ranking latency by about 70\%) while achieving even better effectiveness compared to using only the LLM listwise reranker.

Efficient Explicit Joint-level Interaction Modeling with Mamba for Text-guided HOI Generation

Guohong Huang,Ling-An Zeng,Zexin Zheng,Shengbo Gu,Wei-Shi Zheng

Task: 提出一种新颖的方法，用于生成文本引导的人-物交互（HOI），实现高效的计算方式下显式的关节级交互建模。

Motivation: 现有方法将整个人体表示为单一标记，难以捕捉细粒度的关节级交互，导致不真实的HOI；而将每个关节单独标记则会显著增加计算开销。

Details

Method: 提出高效显式关节级交互模型（EJIM），包括双分支HOI Mamba（分别高效建模时空HOI信息）和双分支条件注入器（整合文本语义和物体几何到人和物体运动中），并设计了动态交互块和渐进掩码机制以过滤无关关节。 Result: 在公开数据集上的大量定量和定性评估表明，EJIM大幅超越先前工作，同时仅使用5%的推理时间。 Conclusion: EJIM通过高效建模关节级交互，显著提升了HOI生成的准确性和效率。 Abstract: We propose a novel approach for generating text-guided human-object interactions (HOIs) that achieves explicit joint-level interaction modeling in a computationally efficient manner. Previous methods represent the entire human body as a single token, making it difficult to capture fine-grained joint-level interactions and resulting in unrealistic HOIs. However, treating each individual joint as a token would yield over twenty times more tokens, increasing computational overhead. To address these challenges, we introduce an Efficient Explicit Joint-level Interaction Model (EJIM). EJIM features a Dual-branch HOI Mamba that separately and efficiently models spatiotemporal HOI information, as well as a Dual-branch Condition Injector for integrating text semantics and object geometry into human and object motions. Furthermore, we design a Dynamic Interaction Block and a progressive masking mechanism to iteratively filter out irrelevant joints, ensuring accurate and nuanced interaction modeling. Extensive quantitative and qualitative evaluations on public datasets demonstrate that EJIM surpasses previous works by a large margin while using only 5\% of the inference time. Code is available \href{https://github.com/Huanggh531/EJIM}{here}.

Speculative End-Turn Detector for Efficient Speech Chatbot Assistant

Hyunjong Ok,Suho Yoo,Jaeho Lee

Task: 提出一种名为SpeculativeETD的协作推理框架，用于改进实时端转检测（ETD）的准确性和效率。

Motivation: 当前基于大语言模型的对话系统在端转检测（ETD）方面表现不佳，导致过早或延迟的响应，影响对话流畅性。

Details

Method: 结合轻量级GRU模型（本地设备实时检测）和高性能Wav2vec模型（服务器端分类），提出协作推理框架。 Result: 实验表明，SpeculativeETD显著提高了ETD准确性，同时保持低计算需求。 Conclusion: 提出的框架有效解决了ETD问题，数据集和代码将在评审后公开。 Abstract: Spoken dialogue systems powered by large language models have demonstrated remarkable abilities in understanding human speech and generating appropriate spoken responses. However, these systems struggle with end-turn detection (ETD) -- the ability to distinguish between user turn completion and hesitation. This limitation often leads to premature or delayed responses, disrupting the flow of spoken conversations. In this paper, we introduce the ETD Dataset, the first public dataset for end-turn detection. The ETD dataset consists of both synthetic speech data generated with text-to-speech models and real-world speech data collected from web sources. We also propose SpeculativeETD, a novel collaborative inference framework that balances efficiency and accuracy to improve real-time ETD in resource-constrained environments. Our approach jointly employs a lightweight GRU-based model, which rapidly detects the non-speaking units in real-time on local devices, and a high-performance Wav2vec-based model running on the server to make a more challenging classification of distinguishing turn ends from mere pauses. Experiments demonstrate that the proposed SpeculativeETD significantly improves ETD accuracy while keeping the required computations low. Datasets and code will be available after the review.

Evaluating Compositional Scene Understanding in Multimodal Generative Models

Shuhao Fu,Andrew Jun Lee,Anna Wang,Ida Momennejad,Trevor Bihl,Hongjing Lu,Taylor W. Webb

Task: 评估当前文本到图像（DALL-E 3）和多模态视觉语言模型（GPT-4V、GPT-4o等）在组合视觉处理任务中的能力。

Motivation: 视觉场景由对象及其关系的组合定义，计算机视觉系统需要具备组合性以实现鲁棒和泛化的场景理解。

Details

Method: 通过对比人类参与者的表现，评估多种模型在组合和关系任务中的能力。 Result: 模型在组合和关系任务中表现出一定能力，但性能仍远低于人类，尤其是在涉及多个对象和关系的复杂场景中。 Conclusion: 需要进一步改进以实现对视觉场景的组合性理解。 Abstract: The visual world is fundamentally compositional. Visual scenes are defined by the composition of objects and their relations. Hence, it is essential for computer vision systems to reflect and exploit this compositionality to achieve robust and generalizable scene understanding. While major strides have been made toward the development of general-purpose, multimodal generative models, including both text-to-image models and multimodal vision-language models, it remains unclear whether these systems are capable of accurately generating and interpreting scenes involving the composition of multiple objects and relations. In this work, we present an evaluation of the compositional visual processing capabilities in the current generation of text-to-image (DALL-E 3) and multimodal vision-language models (GPT-4V, GPT-4o, Claude Sonnet 3.5, QWEN2-VL-72B, and InternVL2.5-38B), and compare the performance of these systems to human participants. The results suggest that these systems display some ability to solve compositional and relational tasks, showing notable improvements over the previous generation of multimodal models, but with performance nevertheless well below the level of human participants, particularly for more complex scenes involving many ($>5$) objects and multiple relations. These results highlight the need for further progress toward compositional understanding of visual scenes.

Order Independence With Finetuning

Katrina Brown,Reid McIlroy

Task: 研究如何通过微调策略将基于集合的提示（SBP）整合到大型语言模型（LLM）中，以减少其对输入顺序的依赖性。

Motivation: 大型语言模型在NLP任务中表现优异，但对输入顺序敏感，导致预测不一致。现有方法SBP虽能缓解这一问题，但会引入分布外输入格式，影响模型性能。

Details

Method: 提出一种微调策略，将SBP整合到训练过程中，使模型适应集合格式的输入。 Result: 实验表明，SBP微调显著提高了模型在多项选择任务中的准确性和对答案顺序变化的鲁棒性，同时保持了语言建模能力。 Conclusion: SBP微调是一种有效的方法，可减少LLM的顺序依赖性，为构建更公平、一致的模型提供了方向。 Abstract: Large language models (LLMs) demonstrate remarkable performance on many NLP tasks, yet often exhibit order dependence: simply reordering semantically identical tokens (e.g., answer choices in multiple-choice questions) can lead to inconsistent predictions. Recent work proposes Set-Based Prompting (SBP) as a way to remove order information from designated token subsets, thereby mitigating positional biases. However, applying SBP on base models induces an out-of-distribution input format, which can degrade in-distribution performance. We introduce a fine-tuning strategy that integrates SBP into the training process, "pulling" these set-formatted prompts closer to the model's training manifold. We show that SBP can be incorporated into a model via fine-tuning. Our experiments on in-distribution (MMLU) and out-of-distribution (CSQA, ARC Challenge) multiple-choice tasks show that SBP fine-tuning significantly improves accuracy and robustness to answer-order permutations, all while preserving broader language modeling capabilities. We discuss the broader implications of order-invariant modeling and outline future directions for building fairer, more consistent LLMs.

Can DeepSeek-V3 Reason Like a Surgeon? An Empirical Evaluation for Vision-Language Understanding in Robotic-Assisted Surgery

Boyi Ma,Yanguang Zhao,Jie Wang,Guankun Wang,Kun Yuan,Tong Chen,Long Bai,Hongliang Ren

Task: 研究DeepSeek-V3在机器人手术场景中的对话能力，包括单短语问答、视觉问答和详细描述任务。

Motivation: 评估DeepSeek-V3在特定手术场景中的表现，以验证其是否适用于手术相关的视觉语言任务。

Details

Method: 使用公开数据集EndoVis18和CholecT50及其对话数据进行广泛评估。 Result: DeepSeek-V3在手术器械和组织识别任务中表现良好，但在空间位置分析和手术动作理解方面存在显著局限，且无法有效分析全局手术概念。 Conclusion: DeepSeek-V3在未经手术特定数据集微调的情况下，不适合用于手术场景的视觉语言任务。 Abstract: DeepSeek-V3, a recently emerging Large Language Model (LLM), demonstrates outstanding performance in general scene understanding, question-answering (QA), and text generation tasks, owing to its efficient training paradigm and strong reasoning capabilities. In this study, we investigate the dialogue capabilities of DeepSeek-V3 in robotic surgery scenarios, focusing on tasks such as Single Phrase QA, Visual QA, and Detailed Description. The Single Phrase QA tasks further include sub-tasks such as surgical instrument recognition, action understanding, and spatial position analysis. We conduct extensive evaluations using publicly available datasets, including EndoVis18 and CholecT50, along with their corresponding dialogue data. Our comprehensive evaluation results indicate that, when provided with specific prompts, DeepSeek-V3 performs well in surgical instrument and tissue recognition tasks However, DeepSeek-V3 exhibits significant limitations in spatial position analysis and struggles to understand surgical actions accurately. Additionally, our findings reveal that, under general prompts, DeepSeek-V3 lacks the ability to effectively analyze global surgical concepts and fails to provide detailed insights into surgical scenarios. Based on our observations, we argue that the DeepSeek-V3 is not ready for vision-language tasks in surgical contexts without fine-tuning on surgery-specific datasets.

Evolutionary Prompt Optimization Discovers Emergent Multimodal Reasoning Strategies in Vision-Language Models

Sid Bharthulwar,John Rho,Katrina Brown

Task: 提出一种优化视觉语言模型提示的框架，以在不重新训练模型的情况下激发多模态推理能力。

Motivation: 通过进化算法指导视觉任务下游的提示更新，改进缺乏“适者生存”迭代的基线提示更新算法。

Details

Method: 使用进化算法迭代优化提示，并通过系统级XML标签明确调用Python解释器工具以提升性能。 Result: 实验显示，该方法在部分视觉任务上实现了约50%的相对性能提升，并在零样本泛化能力上表现优异。 Conclusion: 进化提示优化能够引导语言模型自主发现推理技术，显著提升多模态任务的性能。 Abstract: We present a framework for optimizing prompts in vision-language models to elicit multimodal reasoning without model retraining. Using an evolutionary algorithm to guide prompt updates downstream of visual tasks, our approach improves upon baseline prompt-updating algorithms, which lack evolution-style "survival of the fittest" iteration. Crucially, we find this approach enables the language model to independently discover progressive problem-solving techniques across several evolution generations. For example, the model reasons that to "break down" visually complex spatial tasks, making a tool call to a Python interpreter to perform tasks (such as cropping, image segmentation, or saturation changes) would improve performance significantly. Our experimentation shows that explicitly evoking this "tool calling" call, via system-level XML $...\texttt{} ... \texttt{}...$ tags, can effectively flag Python interpreter access for the same language model to generate relevant programs, generating advanced multimodal functionality. This functionality can be crystallized into a system-level prompt that induces improved performance at inference time, and our experimentation suggests up to $\approx 50\%$ relative improvement across select visual tasks. Downstream performance is trained and evaluated across subtasks from MathVista, M3CoT, and GeoBench-VLM datasets. Importantly, our approach shows that evolutionary prompt optimization guides language models towards self-reasoning discoveries, which result in improved zero-shot generalization across tasks.

RefChartQA: Grounding Visual Answer on Chart Images through Instruction Tuning

Alexander Vogel,Omar Moured,Yufan Chen,Jiaming Zhang,Rainer Stiefelhagen

Task: 提出RefChartQA基准，将图表问答与视觉定位结合，支持多粒度元素引用。

Motivation: 现有图表理解方法缺乏对支持预测的视觉元素的明确识别，限制了模型的可解释性和可靠性。

Details

Method: 通过指令调优5种最先进的视觉语言模型，并结合空间感知的视觉定位。 Result: 实验表明，引入视觉定位后，模型响应准确率提升超过15%，减少了幻觉并提高了可靠性。 Conclusion: RefChartQA为图表理解提供了新的基准，并展示了视觉定位在提升模型性能中的重要性。 Abstract: Recently, Vision Language Models (VLMs) have increasingly emphasized document visual grounding to achieve better human-computer interaction, accessibility, and detailed understanding. However, its application to visualizations such as charts remains under-explored due to the inherent complexity of interleaved visual-numerical relationships in chart images. Existing chart understanding methods primarily focus on answering questions without explicitly identifying the visual elements that support their predictions. To bridge this gap, we introduce RefChartQA, a novel benchmark that integrates Chart Question Answering (ChartQA) with visual grounding, enabling models to refer elements at multiple granularities within chart images. Furthermore, we conduct a comprehensive evaluation by instruction-tuning 5 state-of-the-art VLMs across different categories. Our experiments demonstrate that incorporating spatial awareness via grounding improves response accuracy by over 15%, reducing hallucinations, and improving model reliability. Additionally, we identify key factors influencing text-spatial alignment, such as architectural improvements in TinyChart, which leverages a token-merging module for enhanced feature fusion. Our dataset is open-sourced for community development and further advancements. All models and code will be publicly available at https://github.com/moured/RefChartQA.

SCORE: Story Coherence and Retrieval Enhancement for AI Narratives

Qiang Yi,Yangfan He,Jianhui Wang,Xinyuan Song,Shiyao Qian,Miao Zhang,Li Sun,Tianyu Shi

Task: 提出SCORE框架以提升大语言模型在复杂故事中的长期连贯性和情感一致性。

Motivation: 大语言模型在生成创意叙事时表现优异，但在复杂故事中长期连贯性和情感一致性方面存在不足。

Details

Method: SCORE框架整合了动态状态追踪、上下文感知摘要和混合检索三个组件，采用时间对齐的检索增强生成（RAG）管道验证上下文一致性。 Result: SCORE在NCI-2.0基准测试中实现了23.6%的连贯性提升，EASM指标下情感一致性达到89.7%，并减少了41.8%的幻觉现象。 Conclusion: SCORE的模块化设计支持增量知识图谱构建和多LLM后端兼容性，为工业级叙事系统提供了可解释的解决方案。 Abstract: Large Language Models (LLMs) excel at generating creative narratives but struggle with long-term coherence and emotional consistency in complex stories. To address this, we propose SCORE (Story Coherence and Retrieval Enhancement), a framework integrating three components: 1) Dynamic State Tracking (monitoring objects/characters via symbolic logic), 2) Context-Aware Summarization (hierarchical episode summaries for temporal progression), and 3) Hybrid Retrieval (combining TF-IDF keyword relevance with cosine similarity-based semantic embeddings). The system employs a temporally-aligned Retrieval-Augmented Generation (RAG) pipeline to validate contextual consistency. Evaluations show SCORE achieves 23.6% higher coherence (NCI-2.0 benchmark), 89.7% emotional consistency (EASM metric), and 41.8% fewer hallucinations versus baseline GPT models. Its modular design supports incremental knowledge graph construction for persistent story memory and multi-LLM backend compatibility, offering an explainable solution for industrial-scale narrative systems requiring long-term consistency.

LSNet: See Large, Focus Small

Ao Wang,Hui Chen,Zijia Lin,Jungong Han,Guiguang Ding

Task: 提出一种基于“See Large, Focus Small”策略的轻量级视觉网络设计方法。

Motivation: 现有轻量级模型依赖自注意力机制和卷积，在感知和聚合过程中存在效率和效果的限制，难以在有限计算资源下平衡性能与效率。

Details

Method: 引入LS卷积（大核感知与小核聚合结合），并基于此提出LSNet轻量级模型家族。 Result: 实验表明LSNet在多种视觉任务中优于现有轻量级网络。 Conclusion: LS卷积和LSNet为轻量级视觉网络设计提供了高效且性能优越的解决方案。 Abstract: Vision network designs, including Convolutional Neural Networks and Vision Transformers, have significantly advanced the field of computer vision. Yet, their complex computations pose challenges for practical deployments, particularly in real-time applications. To tackle this issue, researchers have explored various lightweight and efficient network designs. However, existing lightweight models predominantly leverage self-attention mechanisms and convolutions for token mixing. This dependence brings limitations in effectiveness and efficiency in the perception and aggregation processes of lightweight networks, hindering the balance between performance and efficiency under limited computational budgets. In this paper, we draw inspiration from the dynamic heteroscale vision ability inherent in the efficient human vision system and propose a ``See Large, Focus Small'' strategy for lightweight vision network design. We introduce LS (\textbf{L}arge-\textbf{S}mall) convolution, which combines large-kernel perception and small-kernel aggregation. It can efficiently capture a wide range of perceptual information and achieve precise feature aggregation for dynamic and complex visual representations, thus enabling proficient processing of visual information. Based on LS convolution, we present LSNet, a new family of lightweight models. Extensive experiments demonstrate that LSNet achieves superior performance and efficiency over existing lightweight networks in various vision tasks. Codes and models are available at https://github.com/jameslahm/lsnet.

RARE: Retrieval-Augmented Reasoning Modeling

Zhengren Wang,Jiayang Yu,Dongsheng Ma,Zhe Chen,Yu Wang,Zhiyu Li,Feiyu Xiong,Yanfeng Wang,Weinan E,Linpeng Tang,Wentao Zhang

Task: 提出一种新的范式RARE，将知识存储与推理优化解耦，以提升领域特定智能的性能。

Motivation: 大型语言模型在领域特定智能中存在知识幻觉和推理能力不足的问题，尤其是在参数受限的情况下。

Details

Method: RARE将领域知识外部化到可检索的源中，并在训练中内化领域特定的推理模式，通过注入检索到的知识到训练提示中，将学习目标从死记硬背转变为上下文推理应用。 Result: 轻量级的RARE训练模型（如Llama-3.1-8B）能够超越检索增强的GPT-4和Deepseek-R1蒸馏模型，达到最先进的性能。 Conclusion: RARE通过可维护的外部知识库与紧凑的推理优化模型协同工作，推动了更具可扩展性的领域特定智能。 Abstract: Domain-specific intelligence demands specialized knowledge and sophisticated reasoning for problem-solving, posing significant challenges for large language models (LLMs) that struggle with knowledge hallucination and inadequate reasoning capabilities under constrained parameter budgets. Inspired by Bloom's Taxonomy in educational theory, we propose Retrieval-Augmented Reasoning Modeling (RARE), a novel paradigm that decouples knowledge storage from reasoning optimization. RARE externalizes domain knowledge to retrievable sources and internalizes domain-specific reasoning patterns during training. Specifically, by injecting retrieved knowledge into training prompts, RARE transforms learning objectives from rote memorization to contextualized reasoning application. It enables models to bypass parameter-intensive memorization and prioritize the development of higher-order cognitive processes. Our experiments demonstrate that lightweight RARE-trained models (e.g., Llama-3.1-8B) could achieve state-of-the-art performance, surpassing retrieval-augmented GPT-4 and Deepseek-R1 distilled counterparts. RARE establishes a paradigm shift where maintainable external knowledge bases synergize with compact, reasoning-optimized models, collectively driving more scalable domain-specific intelligence. Repo: https://github.com/Open-DataFlow/RARE

When 'YES' Meets 'BUT': Can Large Models Comprehend Contradictory Humor Through Comparative Reasoning?

Tuo Liang,Zhe Hu,Jing Li,Hao Zhang,Yiren Lu,Yunlai Zhou,Yiran Qiao,Disheng Liu,Jeirui Peng,Jing Ma,Yu Yin

Task: 通过YesBut (V2)基准测试，系统评估大型视觉语言模型（VLMs）在理解幽默漫画中的复杂叙事和矛盾推理能力。

Motivation: 理解幽默（尤其是涉及复杂矛盾叙事的幽默）是VLMs的重大挑战，限制了AI在类人推理和文化表达方面的能力。

Details

Method: 引入YesBut (V2)基准测试，包含1,262张多语言多文化漫画，并通过四项互补任务评估VLMs的表现。 Result: 实验表明，即使最先进的模型在视觉感知、关键元素识别和矛盾推理方面仍显著落后于人类。 Conclusion: 研究揭示了VLMs在文化创意表达理解上的关键弱点，并提出了增强模型性能的潜在路径。 Abstract: Understanding humor-particularly when it involves complex, contradictory narratives that require comparative reasoning-remains a significant challenge for large vision-language models (VLMs). This limitation hinders AI's ability to engage in human-like reasoning and cultural expression. In this paper, we investigate this challenge through an in-depth analysis of comics that juxtapose panels to create humor through contradictions. We introduce the YesBut (V2), a novel benchmark with 1,262 comic images from diverse multilingual and multicultural contexts, featuring comprehensive annotations that capture various aspects of narrative understanding. Using this benchmark, we systematically evaluate a wide range of VLMs through four complementary tasks spanning from surface content comprehension to deep narrative reasoning, with particular emphasis on comparative reasoning between contradictory elements. Our extensive experiments reveal that even the most advanced models significantly underperform compared to humans, with common failures in visual perception, key element identification, comparative analysis and hallucinations. We further investigate text-based training strategies and social knowledge augmentation methods to enhance model performance. Our findings not only highlight critical weaknesses in VLMs' understanding of cultural and creative expressions but also provide pathways toward developing context-aware models capable of deeper narrative understanding though comparative reasoning.

If an LLM Were a Character, Would It Know Its Own Story? Evaluating Lifelong Learning in LLMs

Siqi Fan,Xiusheng Huang,Yiqun Yao,Xuezhi Fang,Kang Liu,Peng Han,Shuo Shang,Aixin Sun,Yequan Wang

Task: 评估大语言模型（LLMs）在终身学习中的表现，并提出新的基准测试LIFESTATE-BENCH。

Motivation: 现有基准测试未能捕捉多轮、多智能体交互中LLMs表现出的持续性和角色化行为，亟需新的评估方法。

Details

Method: 提出LIFESTATE-BENCH基准，包含两个数据集（Hamlet和合成脚本），评估模型的自意识、情景记忆检索和关系跟踪能力。 Result: 非参数方法在状态学习中显著优于参数方法，但所有模型在长期交互中均表现出灾难性遗忘问题。 Conclusion: 终身学习在LLMs中仍需进一步研究，以解决灾难性遗忘等问题。 Abstract: Large language models (LLMs) can carry out human-like dialogue, but unlike humans, they are stateless due to the superposition property. However, during multi-turn, multi-agent interactions, LLMs begin to exhibit consistent, character-like behaviors, hinting at a form of emergent lifelong learning. Despite this, existing benchmarks often fail to capture these dynamics, primarily focusing on static, open-ended evaluations. To address this gap, we introduce LIFESTATE-BENCH, a benchmark designed to assess lifelong learning in LLMs. It features two episodic datasets: Hamlet and a synthetic script collection, rich in narrative structure and character interactions. Our fact checking evaluation probes models' self-awareness, episodic memory retrieval, and relationship tracking, across both parametric and non-parametric approaches. Experiments on models like Llama3.1-8B, GPT-4-turbo, and DeepSeek R1, we demonstrate that nonparametric methods significantly outperform parametric ones in managing stateful learning. However, all models exhibit challenges with catastrophic forgetting as interactions extend, highlighting the need for further advancements in lifelong learning.

NeuralGS: Bridging Neural Fields and 3D Gaussian Splatting for Compact 3D Representations

Zhenyu Tang,Chaoran Feng,Xinhua Cheng,Wangbo Yu,Junwu Zhang,Yuan Liu,Xiaoxiao Long,Wenping Wang,Li Yuan

Task: 开发一种名为NeuralGS的简单有效方法，用于压缩原始的3D高斯泼溅（3DGS）模型，避免使用体素结构和复杂的量化策略。

Motivation: 3DGS在质量和渲染速度上表现出色，但存储和传输成本高；现有压缩方法依赖体素结构和复杂策略，不够简洁。

Details

Method: 采用神经场表示（如NeRF）和MLP网络，通过聚类策略和重要性评分，用小型MLP拟合高斯属性。 Result: 在多个数据集上实现了45倍的平均模型大小缩减，且不影响视觉质量。 Conclusion: NeuralGS展示了直接压缩原始3DGS的巨大潜力，性能与专用压缩方法相当。 Abstract: 3D Gaussian Splatting (3DGS) demonstrates superior quality and rendering speed, but with millions of 3D Gaussians and significant storage and transmission costs. Recent 3DGS compression methods mainly concentrate on compressing Scaffold-GS, achieving impressive performance but with an additional voxel structure and a complex encoding and quantization strategy. In this paper, we aim to develop a simple yet effective method called NeuralGS that explores in another way to compress the original 3DGS into a compact representation without the voxel structure and complex quantization strategies. Our observation is that neural fields like NeRF can represent complex 3D scenes with Multi-Layer Perceptron (MLP) neural networks using only a few megabytes. Thus, NeuralGS effectively adopts the neural field representation to encode the attributes of 3D Gaussians with MLPs, only requiring a small storage size even for a large-scale scene. To achieve this, we adopt a clustering strategy and fit the Gaussians with different tiny MLPs for each cluster, based on importance scores of Gaussians as fitting weights. We experiment on multiple datasets, achieving a 45-times average model size reduction without harming the visual quality. The compression performance of our method on original 3DGS is comparable to the dedicated Scaffold-GS-based compression methods, which demonstrate the huge potential of directly compressing original 3DGS with neural fields.

Question-Aware Knowledge Graph Prompting for Enhancing Large Language Models

Haochen Liu,Song Wang,Chen Chen,Jundong Li

Task: 提出一种名为Question-Aware Knowledge Graph Prompting (QAP)的方法，以解决大型语言模型在知识密集型多选问答任务中的局限性。

Motivation: 大型语言模型在需要外部知识的任务中表现不佳，现有方法要么需要昂贵的微调，要么检索到的知识图谱信息噪声较大。

Details

Method: QAP通过将问题嵌入整合到图神经网络聚合中，动态评估知识图谱的相关性，并利用全局注意力捕捉选项间关系。 Result: 实验结果表明，QAP在多个数据集上优于现有方法。 Conclusion: QAP通过动态评估知识图谱相关性和丰富软提示，显著提升了知识密集型多选问答任务的性能。 Abstract: Large Language Models (LLMs) often struggle with tasks requiring external knowledge, such as knowledge-intensive Multiple Choice Question Answering (MCQA). Integrating Knowledge Graphs (KGs) can enhance reasoning; however, existing methods typically demand costly fine-tuning or retrieve noisy KG information. Recent approaches leverage Graph Neural Networks (GNNs) to generate KG-based input embedding prefixes as soft prompts for LLMs but fail to account for question relevance, resulting in noisy prompts. Moreover, in MCQA tasks, the absence of relevant KG knowledge for certain answer options remains a significant challenge. To address these issues, we propose Question-Aware Knowledge Graph Prompting (QAP), which incorporates question embeddings into GNN aggregation to dynamically assess KG relevance. QAP employs global attention to capture inter-option relationships, enriching soft prompts with inferred knowledge. Experimental results demonstrate that QAP outperforms state-of-the-art methods across multiple datasets, highlighting its effectiveness.

Intelligent Bear Prevention System Based on Computer Vision: An Approach to Reduce Human-Bear Conflicts in the Tibetan Plateau Area, China

Pengyu Chen,Teng Fei,Yunyan Du,Jiawei Yi,Yi Li,John A. Kupfer

Task: 提出一种结合计算机视觉和物联网技术的策略，以缓解青藏高原上人与熊的冲突。

Motivation: 人与熊的冲突对当地社区构成威胁，并阻碍野生动物保护工作。

Details

Method: 使用K210开发板与YOLO目标检测框架，结合定制化的熊驱赶机制，实现低能耗和实时效率。 Result: 实验评估显示模型的平均精度（mAP）达到91.4%，表现出高精度和可靠性。 Conclusion: 该系统为偏远地区提供了一种可行、环保且可扩展的解决方案，提升了人类安全并促进了熊的保护。 Abstract: Conflicts between humans and bears on the Tibetan Plateau present substantial threats to local communities and hinder wildlife preservation initiatives. This research introduces a novel strategy that incorporates computer vision alongside Internet of Things (IoT) technologies to alleviate these issues. Tailored specifically for the harsh environment of the Tibetan Plateau, the approach utilizes the K210 development board paired with the YOLO object detection framework along with a tailored bear-deterrent mechanism, offering minimal energy usage and real-time efficiency in bear identification and deterrence. The model's performance was evaluated experimentally, achieving a mean Average Precision (mAP) of 91.4%, demonstrating excellent precision and dependability. By integrating energy-efficient components, the proposed system effectively surpasses the challenges of remote and off-grid environments, ensuring uninterrupted operation in secluded locations. This study provides a viable, eco-friendly, and expandable solution to mitigate human-bear conflicts, thereby improving human safety and promoting bear conservation in isolated areas like Yushu, China.

Whisper-LM: Improving ASR Models with Language Models for Low-Resource Languages

Xabier de Zuazo,Eva Navas,Ibon Saratxaga,Inma Hernáez Rioja

Task: 通过结合传统和新型语言模型与微调的Whisper模型，提升自动语音识别系统在少数语言中的性能。

Motivation: 尽管多语言和多任务模型（如Whisper）在广泛语言中表现良好，但在处理少数语言时仍存在不足。

Details

Method: 通过微调Whisper模型并结合语言模型，在多个数据集上进行严格评估。 Result: 在低资源场景下显著降低了词错误率，最高提升51%（分布内数据集）和34%（分布外句子）。 Conclusion: 研究为更包容的自动语音识别技术铺平了道路，强调了优化语言模型参数和选择适当评估参数的重要性。 Abstract: Automatic speech recognition systems have undoubtedly advanced with the integration of multilingual and multitask models such as Whisper, which have shown a promising ability to understand and process speech across a wide range of languages. Despite their robustness, these models often fall short in handling the linguistic distinctions of minority languages. This study addresses this gap by integrating traditional and novel language models with fine-tuned Whisper models to raise their performance in less commonly studied languages. Through rigorous fine-tuning and evaluation across multiple datasets, we demonstrate substantial improvements in word error rate, particularly in low-resource scenarios. Our approach not only does take advantage of the extensive data Whisper was pre-trained on, but also complements its linguistic adaptability by incorporating language models. We obtained improvements up to 51\% for in-distribution datasets and up to 34\% for out-of-distribution sentences using statistical language models, while large language models provided moderate but consistently robust improvement across diverse linguistic contexts. The findings reveal that, while the integration reliably benefits all model sizes, the extent of improvement varies, highlighting the importance of optimized language model parameters. Finally, we emphasize the importance of selecting appropriate evaluation parameters when reporting the results using transformer-based ASR models. In summary, this research clears the way for more inclusive ASR technologies that perform better across languages by enriching their linguistic knowledge. For further implementation details of this study, the technical documentation and source code are available at http://www.github.com/hitz-zentroa/whisper-lm.

Enhancing Weakly Supervised Video Grounding via Diverse Inference Strategies for Boundary and Prediction Selection

Sunoh Kim,Daeho Um

Task: 弱监督视频定位旨在无需显式真实时间边界的情况下，定位与给定查询相关的时间边界。

Motivation: 现有方法主要基于高斯分布生成候选边界，但忽视了边界预测和推理过程中top-1预测选择的重要性。

Details

Method: 提出新的边界预测方法以从多个高斯分布中捕捉多样边界，并引入考虑候选质量的top-1选择方法。 Result: 在ActivityNet Captions和Charades-STA数据集上的实验验证了方法的有效性，性能提升且无需额外训练。 Conclusion: 通过改进边界预测和选择策略，显著提升了弱监督视频定位的性能。 Abstract: Weakly supervised video grounding aims to localize temporal boundaries relevant to a given query without explicit ground-truth temporal boundaries. While existing methods primarily use Gaussian-based proposals, they overlook the importance of (1) boundary prediction and (2) top-1 prediction selection during inference. In their boundary prediction, boundaries are simply set at half a standard deviation away from a Gaussian mean on both sides, which may not accurately capture the optimal boundaries. In the top-1 prediction process, these existing methods rely heavily on intersections with other proposals, without considering the varying quality of each proposal. To address these issues, we explore various inference strategies by introducing (1) novel boundary prediction methods to capture diverse boundaries from multiple Gaussians and (2) new selection methods that take proposal quality into account. Extensive experiments on the ActivityNet Captions and Charades-STA datasets validate the effectiveness of our inference strategies, demonstrating performance improvements without requiring additional training.

NRC VAD Lexicon v2: Norms for Valence, Arousal, and Dominance for over 55k English Terms

Saif M. Mohammad

Task: 扩展NRC VAD Lexicon至v2版本，包含超过55,000个英语单词和短语的效价、唤醒和支配维度评分。

Motivation: 效价、唤醒和支配维度对社交能力、情绪调节、职场成功及世界观有重要影响，需要更全面的词汇资源支持研究。

Details

Method: 通过人工评分扩展词汇量，新增约25,000个单词和10,000个多词短语，并验证其可靠性。 Result: NRC VAD Lexicon v2具有高度可靠性，支持心理学、NLP、公共卫生等领域的研究。 Conclusion: NRC VAD Lexicon v2为多学科研究提供了免费且可靠的词汇资源。 Abstract: Factor analysis studies have shown that the primary dimensions of word meaning are Valence (V), Arousal (A), and Dominance (D) (also referred to in social cognition research as Competence (C)). These dimensions impact various aspects of our lives from social competence and emotion regulation to success in the work place and how we view the world. We present here the NRC VAD Lexicon v2, which has human ratings of valence, arousal, and dominance for more than 55,000 English words and phrases. Notably, it adds entries for $\sim$25k additional words to v1.0. It also now includes for the first time entries for common multi-word phrases (~10k). We show that the associations are highly reliable. The lexicon enables a wide variety of research in psychology, NLP, public health, digital humanities, and social sciences. The NRC VAD Lexicon v2 is made freely available for research through our project webpage.

Real-time Video Prediction With Fast Video Interpolation Model and Prediction Training

Shota Hirose,Kazuki Kotoyori,Kasidis Arunruangsirilert,Fangzheng Lin,Heming Sun,Jiro Katto

Task: 提出一种实时视频预测方法（IFRVP）以实现网络零延迟交互。

Motivation: 传输延迟显著影响实时交互的用户体验，现有视频预测方法计算成本高且不适用于实时应用。

Details

Method: 基于IFRNet的卷积帧插值网络，并引入ELAN残差块提升推理速度和准确性。 Result: 模型在预测精度和计算速度之间取得最佳平衡。 Conclusion: IFRVP是一种高效且实用的实时视频预测方法。 Abstract: Transmission latency significantly affects users' quality of experience in real-time interaction and actuation. As latency is principally inevitable, video prediction can be utilized to mitigate the latency and ultimately enable zero-latency transmission. However, most of the existing video prediction methods are computationally expensive and impractical for real-time applications. In this work, we therefore propose real-time video prediction towards the zero-latency interaction over networks, called IFRVP (Intermediate Feature Refinement Video Prediction). Firstly, we propose three training methods for video prediction that extend frame interpolation models, where we utilize a simple convolution-only frame interpolation network based on IFRNet. Secondly, we introduce ELAN-based residual blocks into the prediction models to improve both inference speed and accuracy. Our evaluations show that our proposed models perform efficiently and achieve the best trade-off between prediction accuracy and computational speed among the existing video prediction methods. A demonstration movie is also provided at http://bit.ly/IFRVPDemo.

When LLM Therapists Become Salespeople: Evaluating Large Language Models for Ethical Motivational Interviewing

Haein Kong,Seonghyeon Moon

Task: 研究大型语言模型（LLMs）在动机访谈（MI）中的伦理意识。

Motivation: 尽管LLMs在心理健康领域（尤其是MI）展现出潜力，但缺乏对其伦理理解的研究，且存在被恶意利用的风险。

Details

Method: 通过多组实验评估LLMs区分伦理与非伦理MI实践的能力，并提出Chain-of-Ethic提示策略。 Result: LLMs对MI知识掌握较好，但伦理标准与MI精神不符，生成非伦理回答且检测能力差；Chain-of-Ethic提示显著改善伦理表现。 Conclusion: 需制定安全评估和指南，以确保LLM驱动的心理治疗的伦理性和安全性。 Abstract: Large language models (LLMs) have been actively applied in the mental health field. Recent research shows the promise of LLMs in applying psychotherapy, especially motivational interviewing (MI). However, there is a lack of studies investigating how language models understand MI ethics. Given the risks that malicious actors can use language models to apply MI for unethical purposes, it is important to evaluate their capability of differentiating ethical and unethical MI practices. Thus, this study investigates the ethical awareness of LLMs in MI with multiple experiments. Our findings show that LLMs have a moderate to strong level of knowledge in MI. However, their ethical standards are not aligned with the MI spirit, as they generated unethical responses and performed poorly in detecting unethical responses. We proposed a Chain-of-Ethic prompt to mitigate those risks and improve safety. Finally, our proposed strategy effectively improved ethical MI response generation and detection performance. These findings highlight the need for safety evaluations and guidelines for building ethical LLM-powered psychotherapy.

A GAN-Enhanced Deep Learning Framework for Rooftop Detection from Historical Aerial Imagery

Pengyu Chen,Sicheng Wang,Cuizhen Wang,Senrong Wang,Beiao Huang,Lu Huang,Zhe Zang

Task: 从历史航拍图像中准确检测屋顶，以研究长期城市发展和人类居住模式。

Motivation: 黑白模拟照片因空间分辨率低、缺乏颜色信息和档案退化，对现代目标检测框架构成挑战。

Details

Method: 提出基于生成对抗网络（GANs）的两阶段图像增强流程：使用DeOldify进行图像着色，随后用Real-ESRGAN进行超分辨率增强。 Result: 结合着色和超分辨率显著提升检测性能，YOLOv11n的平均精度（mAP）超过85%，比原始黑白图像提升约40%。 Conclusion: 该方法有效弥补了档案图像与现代深度学习技术之间的差距，实现了从历史航拍照片中更可靠地提取建筑足迹。 Abstract: Accurate rooftop detection from historical aerial imagery is vital for examining long-term urban development and human settlement patterns. However, black-and-white analog photographs pose significant challenges for modern object detection frameworks due to their limited spatial resolution, lack of color information, and archival degradation. To address these limitations, this study introduces a two-stage image enhancement pipeline based on Generative Adversarial Networks (GANs): image colorization using DeOldify, followed by super-resolution enhancement with Real-ESRGAN. The enhanced images were then used to train and evaluate rooftop detection models, including Faster R-CNN, DETReg, and YOLOv11n. Results show that combining colorization with super-resolution substantially improves detection performance, with YOLOv11n achieving a mean Average Precision (mAP) exceeding 85%. This reflects an improvement of approximately 40% over original black-and-white images and 20% over images enhanced through colorization alone. The proposed method effectively bridges the gap between archival imagery and contemporary deep learning techniques, enabling more reliable extraction of building footprints from historical aerial photographs.

The Impact of Code-switched Synthetic Data Quality is Task Dependent: Insights from MT and ASR

Injy Hamed,Ngoc Thang Vu,Nizar Habash

Task: 研究代码转换数据增强技术对自然语言处理任务（如机器翻译、自动语音识别和级联语音翻译）性能的影响。

Motivation: 代码转换是全球普遍现象，但数据稀缺限制了相关语言技术的发展，因此需要研究数据增强技术。

Details

Method: 采用多种数据增强技术（如词汇替换、语言学理论和回译），并在机器翻译、自动语音识别和级联语音翻译任务上进行实验。 Result: 通过实验验证了不同增强技术对任务性能的影响，并分析了数据质量与性能之间的关系。 Conclusion: 总结了各种数据增强技术的有效性，并提出了数据质量对任务性能的重要性。 Abstract: Code-switching, the act of alternating between languages, emerged as a prevalent global phenomenon that needs to be addressed for building user-friendly language technologies. A main bottleneck in this pursuit is data scarcity, motivating research in the direction of code-switched data augmentation. However, current literature lacks comprehensive studies that enable us to understand the relation between the quality of synthetic data and improvements on NLP tasks. We extend previous research conducted in this direction on machine translation (MT) with results on automatic speech recognition (ASR) and cascaded speech translation (ST) to test generalizability of findings. Our experiments involve a wide range of augmentation techniques, covering lexical replacements, linguistic theories, and back-translation. Based on the results of MT, ASR, and ST, we draw conclusions and insights regarding the efficacy of various augmentation techniques and the impact of quality on performance.

Convolutional Neural Networks Can (Meta-)Learn the Same-Different Relation

Max Gupta,Sunayana Rane,R. Thomas McCoy,Thomas L. Griffiths

Task: 研究CNN在视觉关系任务（如“相同-不同”关系）中的泛化能力。

Motivation: 尽管CNN在单个对象任务中表现优异，但在涉及关系的视觉任务中仍远逊于人类，尤其是“相同-不同”关系的泛化能力。

Details

Method: 通过元学习训练CNN，以促进跨任务的抽象和泛化。 Result: 与传统训练相比，元学习能显著提升CNN在“相同-不同”关系任务中的泛化能力。 Conclusion: 元学习是提升CNN在关系任务中泛化能力的有效方法。 Abstract: While convolutional neural networks (CNNs) have come to match and exceed human performance in many settings, the tasks these models optimize for are largely constrained to the level of individual objects, such as classification and captioning. Humans remain vastly superior to CNNs in visual tasks involving relations, including the ability to identify two objects as `same' or `different'. A number of studies have shown that while CNNs can be coaxed into learning the same-different relation in some settings, they tend to generalize poorly to other instances of this relation. In this work we show that the same CNN architectures that fail to generalize the same-different relation with conventional training are able to succeed when trained via meta-learning, which explicitly encourages abstraction and generalization across tasks.

CrossFormer: Cross-Segment Semantic Fusion for Document Segmentation

Tongke Ni,Yang Fan,Junru Zhou,Xiangping Wu,Qingcai Chen

Task: 提出一种基于Transformer的模型CrossFormer，用于文本语义分割，解决传统方法因分段处理导致的语义信息丢失问题。

Motivation: 传统方法将文档预处理为分段以应对输入长度限制，但会导致跨段的关键语义信息丢失。

Details

Method: CrossFormer采用跨段融合模块，动态建模文档段间的潜在语义依赖关系。 Result: CrossFormer在公开文本语义分割数据集上表现优异，并在RAG基准测试中显著提升效果。 Conclusion: CrossFormer不仅提升了文本语义分割的准确性，还能替代RAG系统中的规则分块方法，生成更具语义一致性的块。 Abstract: Text semantic segmentation involves partitioning a document into multiple paragraphs with continuous semantics based on the subject matter, contextual information, and document structure. Traditional approaches have typically relied on preprocessing documents into segments to address input length constraints, resulting in the loss of critical semantic information across segments. To address this, we present CrossFormer, a transformer-based model featuring a novel cross-segment fusion module that dynamically models latent semantic dependencies across document segments, substantially elevating segmentation accuracy. Additionally, CrossFormer can replace rule-based chunk methods within the Retrieval-Augmented Generation (RAG) system, producing more semantically coherent chunks that enhance its efficacy. Comprehensive evaluations confirm CrossFormer's state-of-the-art performance on public text semantic segmentation datasets, alongside considerable gains on RAG benchmarks.

Action Recognition in Real-World Ambient Assisted Living Environment

Vincent Gbouna Zakka,Zhuangzhuang Dai,Luis J. Manso

Task: 提出一种名为RE-TCN的鲁棒高效时间卷积网络，用于解决AAL技术中动作识别的挑战。

Motivation: 老龄化人口增长及其居家养老需求促使AAL技术的发展，但动作识别在噪声、遮挡和实时性方面仍存在挑战。

Details

Method: 结合自适应时间加权（ATW）、深度可分离卷积（DSC）和数据增强技术，提升模型的准确性、鲁棒性和计算效率。 Result: RE-TCN在四个基准数据集（NTU RGB+D 60、Northwestern-UCLA、SHREC'17和DHG-14/28）上表现优于现有模型。 Conclusion: RE-TCN在AAL应用中实现了准确性、鲁棒性和效率的平衡，为居家养老提供了更可靠的技术支持。 Abstract: The growing ageing population and their preference to maintain independence by living in their own homes require proactive strategies to ensure safety and support. Ambient Assisted Living (AAL) technologies have emerged to facilitate ageing in place by offering continuous monitoring and assistance within the home. Within AAL technologies, action recognition plays a crucial role in interpreting human activities and detecting incidents like falls, mobility decline, or unusual behaviours that may signal worsening health conditions. However, action recognition in practical AAL applications presents challenges, including occlusions, noisy data, and the need for real-time performance. While advancements have been made in accuracy, robustness to noise, and computation efficiency, achieving a balance among them all remains a challenge. To address this challenge, this paper introduces the Robust and Efficient Temporal Convolution network (RE-TCN), which comprises three main elements: Adaptive Temporal Weighting (ATW), Depthwise Separable Convolutions (DSC), and data augmentation techniques. These elements aim to enhance the model's accuracy, robustness against noise and occlusion, and computational efficiency within real-world AAL contexts. RE-TCN outperforms existing models in terms of accuracy, noise and occlusion robustness, and has been validated on four benchmark datasets: NTU RGB+D 60, Northwestern-UCLA, SHREC'17, and DHG-14/28. The code is publicly available at: https://github.com/Gbouna/RE-TCN

WHERE and WHICH: Iterative Debate for Biomedical Synthetic Data Augmentation

Zhengyi Zhao,Shubo Zhang,Bin Liang,Binyang Li,Kam-Fai Wong

Task: 提出一种基于生物医学关系相似性的合成数据增强方法，以解决生物医学自然语言处理任务中高质量数据稀缺的问题。

Motivation: 生物医学NLP任务中高质量数据稀缺，导致模型难以正确理解生物实体间的关系，现有数据增强方法生成的对抗性数据破坏了原始语义。

Details

Method: 采用生物医学关系相似性度量，结合多智能体反思机制，确保增强数据与生物关系强相关。 Result: 在BLURB和BigBIO基准测试中，9个数据集上的实验结果显示性能一致提升。 Conclusion: 该方法有效解决了数据稀缺问题，提升了生物医学NLP模型的整体性能。 Abstract: In Biomedical Natural Language Processing (BioNLP) tasks, such as Relation Extraction, Named Entity Recognition, and Text Classification, the scarcity of high-quality data remains a significant challenge. This limitation poisons large language models to correctly understand relationships between biological entities, such as molecules and diseases, or drug interactions, and further results in potential misinterpretation of biomedical documents. To address this issue, current approaches generally adopt the Synthetic Data Augmentation method which involves similarity computation followed by word replacement, but counterfactual data are usually generated. As a result, these methods disrupt meaningful word sets or produce sentences with meanings that deviate substantially from the original context, rendering them ineffective in improving model performance. To this end, this paper proposes a biomedical-dedicated rationale-based synthetic data augmentation method. Beyond the naive lexicon similarity, specific bio-relation similarity is measured to hold the augmented instance having a strong correlation with bio-relation instead of simply increasing the diversity of augmented data. Moreover, a multi-agents-involved reflection mechanism helps the model iteratively distinguish different usage of similar entities to escape falling into the mis-replace trap. We evaluate our method on the BLURB and BigBIO benchmark, which includes 9 common datasets spanning four major BioNLP tasks. Our experimental results demonstrate consistent performance improvements across all tasks, highlighting the effectiveness of our approach in addressing the challenges associated with data scarcity and enhancing the overall performance of biomedical NLP models.

Large Self-Supervised Models Bridge the Gap in Domain Adaptive Object Detection

Marc-Antoine Lavoie,Anas Mahmoud,Steven L. Waslander

Task: 利用DINOv2基础模型改进领域自适应目标检测（DAOD）中的标签生成和特征对齐。

Motivation: 现有方法（如Mean Teacher）将目标域标签生成与学习过程耦合，可能导致标签不准确且限制性能。利用预训练的大规模视觉基础模型（如DINOv2）可以生成更准确的标签并提升泛化能力。

Details

Method: 提出DINO Teacher，包括两部分：1）仅使用源数据训练基于DINOv2的标签生成器；2）通过DINO编码器对齐学生模型的源和目标图像块特征。 Result: 在多个DAOD数据集上取得了最先进的性能。 Conclusion: DINO Teacher通过利用预训练基础模型，显著提升了领域自适应目标检测的标签质量和特征对齐效果。 Abstract: The current state-of-the-art methods in domain adaptive object detection (DAOD) use Mean Teacher self-labelling, where a teacher model, directly derived as an exponential moving average of the student model, is used to generate labels on the target domain which are then used to improve both models in a positive loop. This couples learning and generating labels on the target domain, and other recent works also leverage the generated labels to add additional domain alignment losses. We believe this coupling is brittle and excessively constrained: there is no guarantee that a student trained only on source data can generate accurate target domain labels and initiate the positive feedback loop, and much better target domain labels can likely be generated by using a large pretrained network that has been exposed to much more data. Vision foundational models are exactly such models, and they have shown impressive task generalization capabilities even when frozen. We want to leverage these models for DAOD and introduce DINO Teacher, which consists of two components. First, we train a new labeller on source data only using a large frozen DINOv2 backbone and show it generates more accurate labels than Mean Teacher. Next, we align the student's source and target image patch features with those from a DINO encoder, driving source and target representations closer to the generalizable DINO representation. We obtain state-of-the-art performance on multiple DAOD datasets. Code available at https://github.com/TRAILab/DINO_Teacher

Large Language Models Pass the Turing Test

Cameron R. Jones,Benjamin K. Bergen

Task: 评估四种系统（ELIZA、GPT-4o、LLaMa-3.1-405B和GPT-4.5）在随机、受控且预注册的图灵测试中的表现。

Motivation: 探讨大型语言模型（LLMs）是否能够通过标准三方图灵测试，并分析其智能类型及社会与经济影响。

Details

Method: 通过两组独立人群进行5分钟对话测试，参与者同时与另一人类和一种系统对话后判断哪个是人类。 Result: GPT-4.5以73%的识别率被认为人类，显著高于真实人类；LLaMa-3.1为56%，与人类无显著差异；ELIZA和GPT-4o显著低于随机水平（23%和21%）。 Conclusion: GPT-4.5首次通过标准三方图灵测试，结果对LLMs的智能类型及其社会与经济影响具有重要启示。 Abstract: We evaluated 4 systems (ELIZA, GPT-4o, LLaMa-3.1-405B, and GPT-4.5) in two randomised, controlled, and pre-registered Turing tests on independent populations. Participants had 5 minute conversations simultaneously with another human participant and one of these systems before judging which conversational partner they thought was human. When prompted to adopt a humanlike persona, GPT-4.5 was judged to be the human 73% of the time: significantly more often than interrogators selected the real human participant. LLaMa-3.1, with the same prompt, was judged to be the human 56% of the time -- not significantly more or less often than the humans they were being compared to -- while baseline models (ELIZA and GPT-4o) achieved win rates significantly below chance (23% and 21% respectively). The results constitute the first empirical evidence that any artificial system passes a standard three-party Turing test. The results have implications for debates about what kind of intelligence is exhibited by Large Language Models (LLMs), and the social and economic impacts these systems are likely to have.

Synthetic Art Generation and DeepFake Detection A Study on Jamini Roy Inspired Dataset

Kushal Agrawal,Romi Banerjee

Task: 研究扩散生成模型在印度艺术（特别是Jamini Roy风格）中的应用，并探索合成艺术品的检测方法。

Motivation: 生成AI与艺术的结合带来了机遇与挑战，尤其是在识别合成艺术品方面，现有检测技术难以应对高质量和文化特定的深度伪造。

Details

Method: 微调Stable Diffusion 3，结合ControlNet和IPAdapter生成图像，创建包含真实与AI生成作品的数据集，并使用傅里叶域评估和自相关度量等方法分析差异。 Result: 揭示了合成图像与真实作品之间的细微差异，并指出现有深度伪造检测方法在高质量和文化特定场景下的局限性。 Conclusion: 本研究不仅展示了生成模型的复杂性，还为未来合成艺术检测研究奠定了基础。 Abstract: The intersection of generative AI and art is a fascinating area that brings both exciting opportunities and significant challenges, especially when it comes to identifying synthetic artworks. This study takes a unique approach by examining diffusion-based generative models in the context of Indian art, specifically focusing on the distinctive style of Jamini Roy. To explore this, we fine-tuned Stable Diffusion 3 and used techniques like ControlNet and IPAdapter to generate realistic images. This allowed us to create a new dataset that includes both real and AI-generated artworks, which is essential for a detailed analysis of what these models can produce. We employed various qualitative and quantitative methods, such as Fourier domain assessments and autocorrelation metrics, to uncover subtle differences between synthetic images and authentic pieces. A key takeaway from recent research is that existing methods for detecting deepfakes face considerable challenges, especially when the deepfakes are of high quality and tailored to specific cultural contexts. This highlights a critical gap in current detection technologies, particularly in light of the challenges identified above, where high-quality and culturally specific deepfakes are difficult to detect. This work not only sheds light on the increasing complexity of generative models but also sets a crucial foundation for future research aimed at effective detection of synthetic art.

MKA: Leveraging Cross-Lingual Consensus for Model Abstention

Sharad Duwal

Task: 利用LLM的多语言知识来校准其回答或选择放弃的决策。

Motivation: LLM的可靠性问题限制了其广泛应用，尤其是在事实性和置信度校准方面。

Details

Method: 开发了一个多语言流程，用于校准模型的置信度并使其在不确定时选择放弃。 Result: 流程的性能因模型和语言而异，但总体上有所提升，例如孟加拉语准确率提高了71.2%，英语提高了15.5%。 Conclusion: 多语言置信度校准方法对提升LLM的可靠性具有潜力，未来可能有进一步改进空间。 Abstract: Reliability of LLMs is questionable even as they get better at more tasks. A wider adoption of LLMs is contingent on whether they are usably factual. And if they are not, on whether they can properly calibrate their confidence in their responses. This work focuses on utilizing the multilingual knowledge of an LLM to inform its decision to abstain or answer when prompted. We develop a multilingual pipeline to calibrate the model's confidence and let it abstain when uncertain. We run several multilingual models through the pipeline to profile them across different languages. We find that the performance of the pipeline varies by model and language, but that in general they benefit from it. This is evidenced by the accuracy improvement of $71.2\%$ for Bengali over a baseline performance without the pipeline. Even a high-resource language like English sees a $15.5\%$ improvement. These results hint at possible further improvements.

Z-SASLM: Zero-Shot Style-Aligned SLI Blending Latent Manipulation

Alessio Borgi,Luca Maiano,Irene Amerini

Task: 提出一种零样本风格对齐的SLI混合潜在操作框架Z-SASLM，解决多风格混合方法的局限性。

Motivation: 传统线性混合方法假设潜在空间是平坦的，导致多风格融合效果不佳。

Details

Method: 利用SLI混合在超球面上沿测地线插值，结合加权风格表示，保留潜在空间的内在结构。 Result: 实验表明Z-SASLM实现了高保真和一致的多风格混合，无需微调。 Conclusion: Z-SASLM在理论和实践上展示了SLI混合在风格操作中的优势，并在多模态内容融合中表现优异。 Abstract: We introduce Z-SASLM, a Zero-Shot Style-Aligned SLI (Spherical Linear Interpolation) Blending Latent Manipulation pipeline that overcomes the limitations of current multi-style blending methods. Conventional approaches rely on linear blending, assuming a flat latent space leading to suboptimal results when integrating multiple reference styles. In contrast, our framework leverages the non-linear geometry of the latent space by using SLI Blending to combine weighted style representations. By interpolating along the geodesic on the hypersphere, Z-SASLM preserves the intrinsic structure of the latent space, ensuring high-fidelity and coherent blending of diverse styles - all without the need for fine-tuning. We further propose a new metric, Weighted Multi-Style DINO ViT-B/8, designed to quantitatively evaluate the consistency of the blended styles. While our primary focus is on the theoretical and practical advantages of SLI Blending for style manipulation, we also demonstrate its effectiveness in a multi-modal content fusion setting through comprehensive experimental studies. Experimental results show that Z-SASLM achieves enhanced and robust style alignment. The implementation code can be found at: https://github.com/alessioborgi/Z-SASLM.

Mapping Geopolitical Bias in 11 Large Language Models: A Bilingual, Dual-Framing Analysis of U.S.-China Tensions

William Guey,Pierrick Bougault,Vitor D. de Moura,Wei Zhang,Jose O. Gomes

Task: 系统分析11种主流大语言模型（LLMs）在地缘政治偏见上的表现。

Motivation: 探讨LLMs在美中关系关键议题上的意识形态倾向，揭示其与模型地理来源的相关性。

Details

Method: 采用双语（英语和中文）和双框架（肯定和反向）方法，生成19,712个提示，量化评估模型输出的意识形态倾向。 Result: 发现LLMs的意识形态倾向与其地理来源显著相关，美中模型分别倾向亲美和亲中立场；语言和提示框架对模型响应有显著影响。 Conclusion: 研究为选择符合地缘政治需求的LLMs提供了实用指导，并揭示了通过特定提示结构影响模型输出的方法。 Abstract: This study systematically analyzes geopolitical bias across 11 prominent Large Language Models (LLMs) by examining their responses to seven critical topics in U.S.-China relations. Utilizing a bilingual (English and Chinese) and dual-framing (affirmative and reverse) methodology, we generated 19,712 prompts designed to detect ideological leanings in model outputs. Responses were quantitatively assessed on a normalized scale from -2 (strongly Pro-China) to +2 (strongly Pro-U.S.) and categorized according to stance, neutrality, and refusal rates. The findings demonstrate significant and consistent ideological alignments correlated with the LLMs' geographic origins; U.S.-based models predominantly favored Pro-U.S. stances, while Chinese-origin models exhibited pronounced Pro-China biases. Notably, language and prompt framing substantially influenced model responses, with several LLMs exhibiting stance reversals based on prompt polarity or linguistic context. Additionally, we introduced comprehensive metrics to evaluate response consistency across languages and framing conditions, identifying variability and vulnerabilities in model behaviors. These results offer practical insights that can guide organizations and individuals in selecting LLMs best aligned with their operational priorities and geopolitical considerations, underscoring the importance of careful model evaluation in politically sensitive applications. Furthermore, the research highlights specific prompt structures and linguistic variations that can strategically trigger distinct responses from models, revealing methods for effectively navigating and influencing LLM outputs.

Context in object detection: a systematic literature review

Mahtab Jamali,Paul Davidsson,Reza Khoshkangini,Martin Georg Ljungqvist,Radu-Casian Mihailescu

Task: 探索基于上下文的方法在目标检测中的影响。

Motivation: 上下文信息在计算机视觉中具有重要价值，能够提升目标检测的精度和效率。

Details

Method: 调查和比较最新的基于上下文的目标检测方法，涵盖265篇相关文献。 Result: 提供了对上下文信息的全面理解，并总结了整合多种上下文类型的有效方法。 Conclusion: 总结了研究问题并指出未来研究方向，为研究者提供了有价值的参考。 Abstract: Context is an important factor in computer vision as it offers valuable information to clarify and analyze visual data. Utilizing the contextual information inherent in an image or a video can improve the precision and effectiveness of object detectors. For example, where recognizing an isolated object might be challenging, context information can improve comprehension of the scene. This study explores the impact of various context-based approaches to object detection. Initially, we investigate the role of context in object detection and survey it from several perspectives. We then review and discuss the most recent context-based object detection approaches and compare them. Finally, we conclude by addressing research questions and identifying gaps for further studies. More than 265 publications are included in this survey, covering different aspects of context in different categories of object detection, including general object detection, video object detection, small object detection, camouflaged object detection, zero-shot, one-shot, and few-shot object detection. This literature review presents a comprehensive overview of the latest advancements in context-based object detection, providing valuable contributions such as a thorough understanding of contextual information and effective methods for integrating various context types into object detection, thus benefiting researchers.

Building Instruction-Tuning Datasets from Human-Written Instructions with Open-Weight Large Language Models

Youmi Ma,Sakae Mizuki,Kazuki Fujii,Taishi Nakamura,Masanari Ohi,Hinari Shimada,Taihei Shiotani,Koshiro Saito,Koki Maeda,Kakeru Hattori,Takumi Okamoto,Shigeki Ishida,Rio Yokota,Hiroya Takamura,Naoaki Okazaki

Task: 探究是否仍需人类原始信号用于指令调优，并构建基于人类指令和LLM生成响应的数据集。

Motivation: 现有研究表明仅依赖LLM合成的指令调优数据有效，但人类原始信号的价值尚不明确。

Details

Method: 构建基于人类指令和LLM生成响应的数据集，并在多语言（如日语）中验证其有效性。 Result: 基于人类指令的数据集调优的LLM性能优于现有数据集，但在新语言中缺乏文化特定知识。 Conclusion: 人类指令信号对指令调优至关重要，且该方法可扩展至其他语言，但需补充文化知识。 Abstract: Instruction tuning is crucial for enabling Large Language Models (LLMs) to solve real-world tasks. Prior work has shown the effectiveness of instruction-tuning data synthesized solely from LLMs, raising a fundamental question: Do we still need human-originated signals for instruction tuning? This work answers the question affirmatively: we build state-of-the-art instruction-tuning datasets sourced from human-written instructions, by simply pairing them with LLM-generated responses. LLMs fine-tuned on our datasets consistently outperform those fine-tuned on existing ones. Our data construction approach can be easily adapted to other languages; we build datasets for Japanese and confirm that LLMs tuned with our data reach state-of-the-art performance. Analyses suggest that instruction-tuning in a new language allows LLMs to follow instructions, while the tuned models exhibit a notable lack of culture-specific knowledge in that language. The datasets and fine-tuned models will be publicly available. Our datasets, synthesized with open-weight LLMs, are openly distributed under permissive licenses, allowing for diverse use cases.

FIESTA: Fisher Information-based Efficient Selective Test-time Adaptation

Mohammadmahdi Honarmand,Onur Cezmi Mutlu,Parnian Azizian,Saimourya Surabhi,Dennis P. Wall

Task: 提出一种基于Fisher信息的动态参数选择框架，用于视频面部表情识别中的测试时自适应。

Motivation: 解决无约束环境下训练与测试分布差异导致的性能下降问题，同时减少计算开销。

Details

Method: 结合Fisher信息动态选择关键参数，并引入时间一致性约束。 Result: 在AffWild2基准测试中，F1分数提升7.7%，仅需更新22,000个参数。 Conclusion: 该方法显著提升了识别精度并降低了计算成本，适用于实际情感计算应用。 Abstract: Robust facial expression recognition in unconstrained, "in-the-wild" environments remains challenging due to significant domain shifts between training and testing distributions. Test-time adaptation (TTA) offers a promising solution by adapting pre-trained models during inference without requiring labeled test data. However, existing TTA approaches typically rely on manually selecting which parameters to update, potentially leading to suboptimal adaptation and high computational costs. This paper introduces a novel Fisher-driven selective adaptation framework that dynamically identifies and updates only the most critical model parameters based on their importance as quantified by Fisher information. By integrating this principled parameter selection approach with temporal consistency constraints, our method enables efficient and effective adaptation specifically tailored for video-based facial expression recognition. Experiments on the challenging AffWild2 benchmark demonstrate that our approach significantly outperforms existing TTA methods, achieving a 7.7% improvement in F1 score over the base model while adapting only 22,000 parameters-more than 20 times fewer than comparable methods. Our ablation studies further reveal that parameter importance can be effectively estimated from minimal data, with sampling just 1-3 frames sufficient for substantial performance gains. The proposed approach not only enhances recognition accuracy but also dramatically reduces computational overhead, making test-time adaptation more practical for real-world affective computing applications.

AdaMMS: Model Merging for Heterogeneous Multimodal Large Language Models with Unsupervised Coefficient Optimization

Yiyang Du,Xiaochen Wang,Chi Chen,Jiabo Ye,Yiru Wang,Peng Li,Ming Yan,Ji Zhang,Fei Huang,Zhifang Sui,Maosong Sun,Yang Liu

Task: 提出一种名为AdaMMS的新模型合并方法，专门用于处理异构多模态大型语言模型（MLLMs）。

Motivation: 现有模型合并方法主要针对同构模型，无法有效处理异构MLLMs的架构差异和参数空间不对称问题。

Details

Method: 通过映射、合并和搜索三个步骤实现异构MLLMs的合并，包括设计映射函数、线性插值权重和无监督超参数选择。 Result: AdaMMS在多种模型组合和视觉语言基准测试中优于现有方法。 Conclusion: AdaMMS是首个无需标注数据即可合并异构MLLMs的方法，展示了其优越性能。 Abstract: Recently, model merging methods have demonstrated powerful strengths in combining abilities on various tasks from multiple Large Language Models (LLMs). While previous model merging methods mainly focus on merging homogeneous models with identical architecture, they meet challenges when dealing with Multimodal Large Language Models (MLLMs) with inherent heterogeneous property, including differences in model architecture and the asymmetry in the parameter space. In this work, we propose AdaMMS, a novel model merging method tailored for heterogeneous MLLMs. Our method tackles the challenges in three steps: mapping, merging and searching. Specifically, we first design mapping function between models to apply model merging on MLLMs with different architecture. Then we apply linear interpolation on model weights to actively adapt the asymmetry in the heterogeneous MLLMs. Finally in the hyper-parameter searching step, we propose an unsupervised hyper-parameter selection method for model merging. As the first model merging method capable of merging heterogeneous MLLMs without labeled data, extensive experiments on various model combinations demonstrated that AdaMMS outperforms previous model merging methods on various vision-language benchmarks.

OwlSight: A Robust Illumination Adaptation Framework for Dark Video Human Action Recognition

Shihao Cheng,Jinlu Zhang,Yue Liu,Zhigang Tu

Task: 提出一种名为OwlSight的生物启发框架，用于低光环境下的人体动作识别。

Motivation: 现有方法在训练阶段未能充分利用亮度信息，导致性能不佳。

Details

Method: OwlSight结合了时间一致性模块（TCM）、亮度适应模块（LAM）和反射增强模块（RAM），并构建了大规模数据集Dark-101。 Result: 在四个低光动作识别基准测试中达到最优性能，显著优于现有方法。 Conclusion: OwlSight在低光环境下表现出色，验证了其有效性。 Abstract: Human action recognition in low-light environments is crucial for various real-world applications. However, the existing approaches overlook the full utilization of brightness information throughout the training phase, leading to suboptimal performance. To address this limitation, we propose OwlSight, a biomimetic-inspired framework with whole-stage illumination enhancement to interact with action classification for accurate dark video human action recognition. Specifically, OwlSight incorporates a Time-Consistency Module (TCM) to capture shallow spatiotemporal features meanwhile maintaining temporal coherence, which are then processed by a Luminance Adaptation Module (LAM) to dynamically adjust the brightness based on the input luminance distribution. Furthermore, a Reflect Augmentation Module (RAM) is presented to maximize illumination utilization and simultaneously enhance action recognition via two interactive paths. Additionally, we build Dark-101, a large-scale dataset comprising 18,310 dark videos across 101 action categories, significantly surpassing existing datasets (e.g., ARID1.5 and Dark-48) in scale and diversity. Extensive experiments demonstrate that the proposed OwlSight achieves state-of-the-art performance across four low-light action recognition benchmarks. Notably, it outperforms previous best approaches by 5.36% on ARID1.5 and 1.72% on Dark-101, highlighting its effectiveness in challenging dark environments.

LANID: LLM-assisted New Intent Discovery

Lu Fan,Jiashu Pu,Rongsheng Zhang,Xiao-Ming Wu

Task: 提出一种名为LANID的框架，通过利用大型语言模型（LLMs）增强轻量级新意图发现（NID）编码器的语义表示。

Motivation: 任务导向对话系统（TODS）在处理新意图时面临语义表示不足或依赖外部知识的问题，而现有方法在可扩展性和灵活性上存在局限。

Details

Method: LANID框架结合K近邻和DBSCAN算法从训练集中采样选择性话语对，利用LLM确定其关系，并通过对比三元组损失训练小型编码器。 Result: 在三个NID数据集上的实验表明，LANID在无监督和半监督设置下均优于基线方法。 Conclusion: LANID通过结合LLMs的指导，有效提升了轻量级NID编码器的性能，解决了现有方法的局限性。 Abstract: Task-oriented Dialogue Systems (TODS) often face the challenge of encountering new intents. New Intent Discovery (NID) is a crucial task that aims to identify these novel intents while maintaining the capability to recognize existing ones. Previous efforts to adapt TODS to new intents have struggled with inadequate semantic representation or have depended on external knowledge, which is often not scalable or flexible. Recently, Large Language Models (LLMs) have demonstrated strong zero-shot capabilities; however, their scale can be impractical for real-world applications that involve extensive queries. To address the limitations of existing NID methods by leveraging LLMs, we propose LANID, a framework that enhances the semantic representation of lightweight NID encoders with the guidance of LLMs. Specifically, LANID employs the $K$-nearest neighbors and Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithms to sample selective utterance pairs from the training set. It then queries an LLM to ascertain the relationships between these pairs. The data produced from this process is utilized to design a contrastive fine-tuning task, which is then used to train a small encoder with a contrastive triplet loss. Our experimental results demonstrate the efficacy of the proposed method across three distinct NID datasets, surpassing strong baselines in both unsupervised and semi-supervised settings. Our code is available at https://github.com/floatSDSDS/LANID.

Improved Ear Verification with Vision Transformers and Overlapping Patches

Deeksha Arun,Kagan Ozturk,Kevin W. Bowyer,Patrick Flynn

Task: 评估不同配置的Vision Transformers（ViTs）在耳部识别任务中的性能，特别是重叠补丁策略的效果。

Motivation: 由于耳部在成年期外观相对稳定，耳部识别成为一种有前景的生物特征识别方式，但现有ViTs在耳部识别中因忽略重叠补丁而效率不足。

Details

Method: 使用ViT-Tiny、ViT-Small、ViT-Base和ViT-Large配置，在多个数据集上应用重叠补丁选择策略进行实验。 Result: 重叠补丁策略在48次实验中的44次表现更优，性能提升显著（最高达10%），ViT-Tiny模型在多个数据集上表现最佳。 Conclusion: 重叠补丁选择的Transformer架构是耳部生物特征识别任务中高效且高性能的选择。 Abstract: Ear recognition has emerged as a promising biometric modality due to the relative stability in appearance during adulthood. Although Vision Transformers (ViTs) have been widely used in image recognition tasks, their efficiency in ear recognition has been hampered by a lack of attention to overlapping patches, which is crucial for capturing intricate ear features. In this study, we evaluate ViT-Tiny (ViT-T), ViT-Small (ViT-S), ViT-Base (ViT-B) and ViT-Large (ViT-L) configurations on a diverse set of datasets (OPIB, AWE, WPUT, and EarVN1.0), using an overlapping patch selection strategy. Results demonstrate the critical importance of overlapping patches, yielding superior performance in 44 of 48 experiments in a structured study. Moreover, upon comparing the results of the overlapping patches with the non-overlapping configurations, the increase is significant, reaching up to 10% for the EarVN1.0 dataset. In terms of model performance, the ViT-T model consistently outperformed the ViT-S, ViT-B, and ViT-L models on the AWE, WPUT, and EarVN1.0 datasets. The highest scores were achieved in a configuration with a patch size of 28x28 and a stride of 14 pixels. This patch-stride configuration represents 25% of the normalized image area (112x112 pixels) for the patch size and 12.5% of the row or column size for the stride. This study confirms that transformer architectures with overlapping patch selection can serve as an efficient and high-performing option for ear-based biometric recognition tasks in verification scenarios.

Texture or Semantics? Vision-Language Models Get Lost in Font Recognition

Zhecheng Li,Guoxian Song,Yujun Cai,Zhen Xiong,Junsong Yuan,Yiwei Wang

Task: 评估现代视觉语言模型（VLMs）在细粒度字体识别任务中的能力。

Motivation: 尽管VLMs在多种任务中表现优异，但其在细粒度任务（如字体识别）中的有效性尚未明确。

Details

Method: 引入字体识别基准（FRB），包括简单和困难版本，并评估多种VLMs的表现。 Result: 当前VLMs在字体识别中表现有限，少样本学习和CoT提示对其提升效果甚微。 Conclusion: VLMs在捕捉语义特征方面存在固有局限性，需进一步研究改进。 Abstract: Modern Vision-Language Models (VLMs) exhibit remarkable visual and linguistic capabilities, achieving impressive performance in various tasks such as image recognition and object localization. However, their effectiveness in fine-grained tasks remains an open question. In everyday scenarios, individuals encountering design materials, such as magazines, typography tutorials, research papers, or branding content, may wish to identify aesthetically pleasing fonts used in the text. Given their multimodal capabilities and free accessibility, many VLMs are often considered potential tools for font recognition. This raises a fundamental question: Do VLMs truly possess the capability to recognize fonts? To investigate this, we introduce the Font Recognition Benchmark (FRB), a compact and well-structured dataset comprising 15 commonly used fonts. FRB includes two versions: (i) an easy version, where 10 sentences are rendered in different fonts, and (ii) a hard version, where each text sample consists of the names of the 15 fonts themselves, introducing a stroop effect that challenges model perception. Through extensive evaluation of various VLMs on font recognition tasks, we arrive at the following key findings: (i) Current VLMs exhibit limited font recognition capabilities, with many state-of-the-art models failing to achieve satisfactory performance. (ii) Few-shot learning and Chain-of-Thought (CoT) prompting provide minimal benefits in improving font recognition accuracy across different VLMs. (iii) Attention analysis sheds light on the inherent limitations of VLMs in capturing semantic features.

AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos

Felix Wimbauer,Weirong Chen,Dominik Muhle,Christian Rupprecht,Daniel Cremers

Task: 提出一种名为AnyCam的快速Transformer模型，直接从动态视频序列中估计相机姿态和内部参数。

Motivation: 传统基于束调整的方法（如SfM和SLAM）在任意数据上表现不可靠，而现有数据驱动方法（如Dust3r）对动态物体不鲁棒且需要标注数据。

Details

Method: 使用基于不确定性的损失公式和预训练的深度与光流网络，无需运动或轨迹监督，通过轻量级轨迹细化步骤避免漂移。 Result: 在标准数据集上，AnyCam能准确估计相机姿态和内部参数，且速度显著快于现有方法，还能生成高质量4D点云。 Conclusion: AnyCam是一种高效、鲁棒的动态场景相机估计方法，适用于无标注数据。 Abstract: Estimating camera motion and intrinsics from casual videos is a core challenge in computer vision. Traditional bundle-adjustment based methods, such as SfM and SLAM, struggle to perform reliably on arbitrary data. Although specialized SfM approaches have been developed for handling dynamic scenes, they either require intrinsics or computationally expensive test-time optimization and often fall short in performance. Recently, methods like Dust3r have reformulated the SfM problem in a more data-driven way. While such techniques show promising results, they are still 1) not robust towards dynamic objects and 2) require labeled data for supervised training. As an alternative, we propose AnyCam, a fast transformer model that directly estimates camera poses and intrinsics from a dynamic video sequence in feed-forward fashion. Our intuition is that such a network can learn strong priors over realistic camera poses. To scale up our training, we rely on an uncertainty-based loss formulation and pre-trained depth and flow networks instead of motion or trajectory supervision. This allows us to use diverse, unlabelled video datasets obtained mostly from YouTube. Additionally, we ensure that the predicted trajectory does not accumulate drift over time through a lightweight trajectory refinement step. We test AnyCam on established datasets, where it delivers accurate camera poses and intrinsics both qualitatively and quantitatively. Furthermore, even with trajectory refinement, AnyCam is significantly faster than existing works for SfM in dynamic settings. Finally, by combining camera information, uncertainty, and depth, our model can produce high-quality 4D pointclouds.

CONGRAD:Conflicting Gradient Filtering for Multilingual Preference Alignment

Jiangnan Li,Thuy-Trang Vu,Christian Herold,Amirhossein Tebbifakhr,Shahram Khadivi,Gholamreza Haffari

Task: 提出CONGRAD方法，解决多语言偏好对齐中负干扰问题。

Motivation: 多语言联合训练中负干扰现象对性能的影响尚未充分研究。

Details

Method: 采用梯度手术和子线性梯度压缩策略，筛选高质量样本。 Result: CONGRAD在多语言任务中表现优于基线，对齐代价低。 Conclusion: CONGRAD是一种有效且可扩展的多语言偏好对齐方法。 Abstract: Naive joint training of large language models (LLMs) for multilingual preference alignment can suffer from negative interference. This is a known issue in multilingual training, where conflicting objectives degrade overall performance. However, the impact of this phenomenon in the context of multilingual preference alignment remains largely underexplored. To address this issue, we propose CONGRAD, a scalable and effective filtering method that selects high-quality preference samples with minimal gradient conflicts across languages. Our method leverages gradient surgery to retain samples aligned with an aggregated multilingual update direction. Additionally, we incorporate a sublinear gradient compression strategy that reduces memory overhead during gradient accumulation. We integrate CONGRAD into self-rewarding framework and evaluate on LLaMA3-8B and Gemma2-2B across 10 languages. Results show that CONGRAD consistently outperforms strong baselines in both seen and unseen languages, with minimal alignment tax.

Language Guided Concept Bottleneck Models for Interpretable Continual Learning

Lu Yu,Haoyu Han,Zhe Tao,Hantao Yao,Changsheng Xu

Task: 提出一种结合语言引导的概念瓶颈模型（CBMs）的框架，以解决持续学习中的灾难性遗忘和可解释性问题。

Motivation: 持续学习需要在不断学习新知识的同时避免遗忘旧知识，同时保持决策过程的可解释性，但现有方法主要关注性能提升而忽略了可解释性。

Details

Method: 利用概念瓶颈层（Concept Bottleneck Layer）与CLIP模型对齐语义一致性，学习可泛化的人类可理解概念。 Result: 在多个数据集上表现优异，ImageNet子集上的最终平均准确率提升高达3.06%，并提供概念可视化以增强可解释性。 Conclusion: 该方法不仅提升了模型的知识保留能力，还提供了透明的决策过程，推动了可解释持续学习的发展。 Abstract: Continual learning (CL) aims to enable learning systems to acquire new knowledge constantly without forgetting previously learned information. CL faces the challenge of mitigating catastrophic forgetting while maintaining interpretability across tasks. Most existing CL methods focus primarily on preserving learned knowledge to improve model performance. However, as new information is introduced, the interpretability of the learning process becomes crucial for understanding the evolving decision-making process, yet it is rarely explored. In this paper, we introduce a novel framework that integrates language-guided Concept Bottleneck Models (CBMs) to address both challenges. Our approach leverages the Concept Bottleneck Layer, aligning semantic consistency with CLIP models to learn human-understandable concepts that can generalize across tasks. By focusing on interpretable concepts, our method not only enhances the models ability to retain knowledge over time but also provides transparent decision-making insights. We demonstrate the effectiveness of our approach by achieving superior performance on several datasets, outperforming state-of-the-art methods with an improvement of up to 3.06% in final average accuracy on ImageNet-subset. Additionally, we offer concept visualizations for model predictions, further advancing the understanding of interpretable continual learning.

WinoWhat: A Parallel Corpus of Paraphrased WinoGrande Sentences with Common Sense Categorization

Ine Gevers,Victor De Marez,Luna De Bruyne,Walter Daelemans

Task: 评估大型语言模型（LLMs）在Winograd模式挑战中的常识推理能力。

Motivation: 研究Winograd模式挑战如何用于评估LLMs的常识推理能力，并揭示现有基准（如WinoGrande）可能高估了LLMs的推理能力。

Details

Method: 通过生成不同规模的模型在WinoGrande基准上的表现，并引入新语料库WinoWhat（WinoGrande验证集的改写版本），同时在五个常识知识类别上评估模型性能。 Result: 所有模型在WinoWhat上的表现显著下降，表明LLMs在WinoGrande上的推理能力可能被高估；基准记忆化对模型性能影响极小。 Conclusion: Winograd模式挑战能有效评估LLMs的常识推理能力，但需注意基准选择对评估结果的影响。 Abstract: In this study, we take a closer look at how Winograd schema challenges can be used to evaluate common sense reasoning in LLMs. Specifically, we evaluate generative models of different sizes on the popular WinoGrande benchmark. We release WinoWhat, a new corpus, in which each instance of the WinoGrande validation set is paraphrased. Additionally, we evaluate the performance on the challenge across five common sense knowledge categories, giving more fine-grained insights on what types of knowledge are more challenging for LLMs. Surprisingly, all models perform significantly worse on WinoWhat, implying that LLM reasoning capabilities are overestimated on WinoGrande. To verify whether this is an effect of benchmark memorization, we match benchmark instances to LLM trainingdata and create two test-suites. We observe that memorization has a minimal effect on model performance on WinoGrande.

ReasonGrounder: LVLM-Guided Hierarchical Feature Splatting for Open-Vocabulary 3D Visual Grounding and Reasoning

Zhenyang Liu,Yikai Wang,Sixiao Zheng,Tongying Pan,Longfei Liang,Yanwei Fu,Xiangyang Xue

Task: 提出一种名为ReasonGrounder的框架，用于实现开放词汇的3D视觉定位和推理，即使在物体被遮挡的情况下。

Motivation: 当前方法依赖3D注释和掩码提案，限制了处理多样语义和常识推理的能力，因此需要一种更灵活的方法。

Details

Method: 使用分层3D特征高斯场进行自适应分组，结合LVLM、3D高斯溅射、SAM的2D分割掩码和多视图CLIP嵌入。 Result: ReasonGrounder显著提高了真实场景中的3D定位准确性，并贡献了新数据集ReasoningGD。 Conclusion: ReasonGrounder通过结合多种技术，实现了开放词汇的3D定位和推理，尤其在遮挡情况下表现优异。 Abstract: Open-vocabulary 3D visual grounding and reasoning aim to localize objects in a scene based on implicit language descriptions, even when they are occluded. This ability is crucial for tasks such as vision-language navigation and autonomous robotics. However, current methods struggle because they rely heavily on fine-tuning with 3D annotations and mask proposals, which limits their ability to handle diverse semantics and common knowledge required for effective reasoning. In this work, we propose ReasonGrounder, an LVLM-guided framework that uses hierarchical 3D feature Gaussian fields for adaptive grouping based on physical scale, enabling open-vocabulary 3D grounding and reasoning. ReasonGrounder interprets implicit instructions using large vision-language models (LVLM) and localizes occluded objects through 3D Gaussian splatting. By incorporating 2D segmentation masks from the SAM and multi-view CLIP embeddings, ReasonGrounder selects Gaussian groups based on object scale, enabling accurate localization through both explicit and implicit language understanding, even in novel, occluded views. We also contribute ReasoningGD, a new dataset containing over 10K scenes and 2 million annotations for evaluating open-vocabulary 3D grounding and amodal perception under occlusion. Experiments show that ReasonGrounder significantly improves 3D grounding accuracy in real-world scenarios.

Adaptive Layer-skipping in Pre-trained LLMs

Xuan Luo,Weizhi Wang,Xifeng Yan

Task: 提出一种动态调整Transformer层数以加速大型语言模型（LLM）中token生成的方法FlexiDepth。

Motivation: 现有层跳过方法忽略了不同token生成时计算需求的差异，FlexiDepth旨在动态适应这种需求。

Details

Method: 通过插件路由器和适配器实现自适应层跳过，不修改原始模型参数。 Result: 在Llama-3-8B模型中跳过8层（共32层），同时保持100%基准性能；实验表明计算需求因token类型而异。 Conclusion: FlexiDepth的计算分配模式符合人类直觉，开源了该方法及数据集以推动未来研究。 Abstract: Various layer-skipping methods have been proposed to accelerate token generation in large language models (LLMs). However, they have overlooked a fundamental question: How do computational demands vary across the generation of different tokens? In this work, we introduce FlexiDepth, a method that dynamically adjusts the number of Transformer layers used in text generation. By incorporating a plug-in router and adapter, FlexiDepth enables adaptive layer-skipping in LLMs without modifying their original parameters. Introducing FlexiDepth to Llama-3-8B model achieves layer skipping of 8 layers out of 32, and meanwhile maintains the full 100\% benchmark performance. Experimental results with FlexiDepth demonstrate that computational demands in LLMs significantly vary based on token type. Specifically, generating repetitive tokens or fixed phrases requires fewer layers, whereas producing tokens involving computation or high uncertainty requires more layers. Interestingly, this adaptive allocation pattern aligns with human intuition. To advance research in this area, we open sourced FlexiDepth and a dataset documenting FlexiDepth's layer allocation patterns for future exploration.

Learning Predictive Visuomotor Coordination

Wenqi Jia,Bolin Lai,Miao Liu,Danfei Xu,James M. Rehg

Task: 预测头姿、视线和上半身运动，基于自我中心视觉和运动学观察。

Motivation: 理解和预测人类视觉运动协调对机器人、人机交互和辅助技术至关重要。

Details

Method: 提出视觉运动协调表示（VCR），学习多模态信号的结构化时间依赖关系，并扩展扩散式运动建模框架。 Result: 在EgoExo4D数据集上表现出强大的泛化能力，支持多模态整合的重要性。 Conclusion: 为视觉运动学习和人类行为建模研究做出贡献。 Abstract: Understanding and predicting human visuomotor coordination is crucial for applications in robotics, human-computer interaction, and assistive technologies. This work introduces a forecasting-based task for visuomotor modeling, where the goal is to predict head pose, gaze, and upper-body motion from egocentric visual and kinematic observations. We propose a \textit{Visuomotor Coordination Representation} (VCR) that learns structured temporal dependencies across these multimodal signals. We extend a diffusion-based motion modeling framework that integrates egocentric vision and kinematic sequences, enabling temporally coherent and accurate visuomotor predictions. Our approach is evaluated on the large-scale EgoExo4D dataset, demonstrating strong generalization across diverse real-world activities. Our results highlight the importance of multimodal integration in understanding visuomotor coordination, contributing to research in visuomotor learning and human behavior modeling.

Did ChatGPT or Copilot use alter the style of internet news headlines? A time series regression analysis

Chris Brogly,Connor McElroy

Task: 研究ChatGPT和Copilot等大型语言模型（LLMs）的发布是否影响了全球新闻网站标题和链接的写作风格。

Motivation: 探讨先进LLMs的发布是否改变了网络文本的创作方式，进而影响网络内容。

Details

Method: 对4.51亿条标题/链接数据集中的每条文本提取175个NLP特征，并应用中断时间序列分析评估ChatGPT和Copilot发布后是否存在显著持续变化。 Result: 44个特征在ChatGPT/Copilot发布后无显著变化；91个特征显示显著变化，但早期控制LLM发布后这些变化不再显著。 Conclusion: 初步分析表明，这些语言模型对新闻标题/链接风格的影响有限，仅体现在部分NLP指标上。 Abstract: The release of advanced Large Language Models (LLMs) such as ChatGPT and Copilot is changing the way text is created and may influence the content that we find on the web. This study investigated whether the release of these two popular LLMs coincided with a change in writing style in headlines and links on worldwide news websites. 175 NLP features were obtained for each text in a dataset of 451 million headlines/links. An interrupted time series analysis was applied for each of the 175 NLP features to evaluate whether there were any statistically significant sustained changes after the release dates of ChatGPT and/or Copilot. There were a total of 44 features that did not appear to have any significant sustained change after the release of ChatGPT/Copilot. A total of 91 other features did show significant change with ChatGPT and/or Copilot although significance with earlier control LLM release dates (GPT-1/2/3, Gopher) removed them from consideration. This initial analysis suggests these language models may have had a limited impact on the style of individual news headlines/links, with respect to only some NLP measures.

MoCha: Towards Movie-Grade Talking Character Synthesis

Cong Wei,Bo Sun,Haoyu Ma,Ji Hou,Felix Juefei-Xu,Zecheng He,Xiaoliang Dai,Luxin Zhang,Kunpeng Li,Tingbo Hou,Animesh Sinha,Peter Vajda,Wenhu Chen

Task: 生成直接从语音和文本驱动的多角色对话动画。

Motivation: 现有视频生成技术虽在运动真实性上取得进展，但忽略了角色驱动的叙事需求，这对自动化电影和动画生成至关重要。

Details

Method: 提出MoCha框架，采用语音-视频窗口注意力机制对齐语音和视频标记，并设计联合训练策略利用语音和文本标记的视频数据。此外，引入结构化提示模板支持多角色对话。 Result: MoCha在真实性、表现力、可控性和泛化能力上均优于现有方法，并通过人类偏好研究和基准测试验证其优越性。 Conclusion: MoCha为AI生成的电影叙事设定了新标准，尤其在多角色对话和角色动作生成方面表现突出。 Abstract: Recent advancements in video generation have achieved impressive motion realism, yet they often overlook character-driven storytelling, a crucial task for automated film, animation generation. We introduce Talking Characters, a more realistic task to generate talking character animations directly from speech and text. Unlike talking head, Talking Characters aims at generating the full portrait of one or more characters beyond the facial region. In this paper, we propose MoCha, the first of its kind to generate talking characters. To ensure precise synchronization between video and speech, we propose a speech-video window attention mechanism that effectively aligns speech and video tokens. To address the scarcity of large-scale speech-labeled video datasets, we introduce a joint training strategy that leverages both speech-labeled and text-labeled video data, significantly improving generalization across diverse character actions. We also design structured prompt templates with character tags, enabling, for the first time, multi-character conversation with turn-based dialogue-allowing AI-generated characters to engage in context-aware conversations with cinematic coherence. Extensive qualitative and quantitative evaluations, including human preference studies and benchmark comparisons, demonstrate that MoCha sets a new standard for AI-generated cinematic storytelling, achieving superior realism, expressiveness, controllability and generalization.

Expanding RL with Verifiable Rewards Across Diverse Domains

Yi Su,Dian Yu,Linfeng Song,Juntao Li,Haitao Mi,Zhaopeng Tu,Min Zhang,Dong Yu

Task: 研究如何将可验证奖励的强化学习（RLVR）扩展到更广泛的领域，如医学、化学、心理学和经济学。

Motivation: RLVR在数学推理和编程任务中表现良好，但在其他领域的适用性尚未充分探索。

Details

Method: 通过结合模型基础的软评分改进RLVR的灵活性，并使用蒸馏生成奖励模型作为跨领域验证器。 Result: 实验表明，基于奖励模型微调的7B模型在自由回答设置中显著优于开源对齐LLM。 Conclusion: RLVR在噪声或弱标签的实际应用中具有强大的鲁棒性和扩展潜力。 Abstract: Reinforcement learning (RL) with verifiable rewards (RLVR) has shown promising results in mathematical reasoning and coding tasks where well-structured reference answers are available. However, its applicability to broader domains remains underexplored. In this work, we study the extension of RLVR to more diverse domains such as medicine, chemistry, psychology, and economics. We observe high agreement in binary judgments across different large language models (LLMs) when objective reference answers exist, which challenges the necessity of large-scale annotation for training domain-specific reward models. To address the limitations of binary rewards when handling unstructured reference answers, we further incorporate model-based soft scoring into RLVR to improve its flexibility. Our experiments show that a distilled generative reward model can serve as an effective cross-domain verifier, providing reliable reward signals for RL without requiring domain-specific annotations. By fine-tuning a base 7B model using various RL algorithms against our reward model, we obtain policies that outperform state-of-the-art open-source aligned LLMs such as Qwen2.5-72B-Instruct and DeepSeek-R1-Distill-Qwen-32B by a large margin, across domains in free-form answer settings. This also strengthens RLVR's robustness and scalability, highlighting its potential for real-world applications with noisy or weak labels.

SpINR: Neural Volumetric Reconstruction for FMCW Radars

Harshvardhan Takawale,Nirupam Roy

Task: 提出了一种名为SpINR的新型框架，用于基于FMCW雷达数据的体积重建。

Motivation: 传统雷达成像技术（如反投影）通常假设理想信号模型并需要密集孔径采样，导致分辨率和泛化能力受限。

Details

Method: SpINR结合了在频域中完全可微的前向模型与隐式神经表示（INRs），利用FMCW雷达系统中拍频与散射体距离的线性关系。 Result: 实验表明，SpINR显著优于经典反投影方法和现有基于学习的方法，实现了更高分辨率和更准确的复杂场景重建。 Conclusion: SpINR是雷达领域中首次应用的神经体积重建方法，为未来雷达成像和感知系统研究提供了新方向。 Abstract: In this paper, we introduce SpINR, a novel framework for volumetric reconstruction using Frequency-Modulated Continuous-Wave (FMCW) radar data. Traditional radar imaging techniques, such as backprojection, often assume ideal signal models and require dense aperture sampling, leading to limitations in resolution and generalization. To address these challenges, SpINR integrates a fully differentiable forward model that operates natively in the frequency domain with implicit neural representations (INRs). This integration leverages the linear relationship between beat frequency and scatterer distance inherent in FMCW radar systems, facilitating more efficient and accurate learning of scene geometry. Additionally, by computing outputs for only the relevant frequency bins, our forward model achieves greater computational efficiency compared to time-domain approaches that process the entire signal before transformation. Through extensive experiments, we demonstrate that SpINR significantly outperforms classical backprojection methods and existing learning-based approaches, achieving higher resolution and more accurate reconstructions of complex scenes. This work represents the first application of neural volumetic reconstruction in the radar domain, offering a promising direction for future research in radar-based imaging and perception systems.

SpeechDialogueFactory: Generating High-Quality Speech Dialogue Data to Accelerate Your Speech-LLM Development

Minghan Wang,Ye Bai,Yuxia Wang,Thuy-Trang Vu,Ehsan Shareghi,Gholamreza Haffari

Task: 开发一个高效生成自然语音对话的框架，以解决现有语音对话数据集获取方法的局限性。

Motivation: 现有的人类录音方法成本高且涉及隐私问题，而合成方法通常缺乏对话的真实性。

Details

Method: 提出一个名为SpeechDialogueFactory的框架，包括元数据生成、对话脚本编写、副语言增强的语音模拟以及语音克隆的自然语音合成。 Result: 生成的对话质量接近人类录音，同时显著降低了生产成本。 Conclusion: 该框架作为开源工具发布，支持英语和中文数据集，助力语音-LLM的研究与开发。 Abstract: High-quality speech dialogue datasets are crucial for Speech-LLM development, yet existing acquisition methods face significant limitations. Human recordings incur high costs and privacy concerns, while synthetic approaches often lack conversational authenticity. To address these challenges, we introduce \textsc{SpeechDialogueFactory}, a production-ready framework for generating natural speech dialogues efficiently. Our solution employs a comprehensive pipeline including metadata generation, dialogue scripting, paralinguistic-enriched utterance simulation, and natural speech synthesis with voice cloning. Additionally, the system provides an interactive UI for detailed sample inspection and a high-throughput batch synthesis mode. Evaluations show that dialogues generated by our system achieve a quality comparable to human recordings while significantly reducing production costs. We release our work as an open-source toolkit, alongside example datasets available in English and Chinese, empowering researchers and developers in Speech-LLM research and development.

EagleVision: Object-level Attribute Multimodal LLM for Remote Sensing

Hongxiang Jiang,Jihao Yin,Qixiong Wang,Jiaqi Feng,Guo Chen

Task: 开发一种针对遥感领域的多模态大语言模型EagleVision，以解决高分辨率和物体比例小带来的精确定位和细粒度属性描述问题。

Motivation: 现有多模态大语言模型在遥感领域表现不佳，无法超越传统视觉感知模型，仅能提供粗粒度图像理解，限制了实际应用价值。

Details

Method: 提出EagleVision模型，配备属性解耦模块，学习解耦视觉标记以表达不同属性，并构建EVAttrs-95K数据集和EVBench评估基准。 Result: EagleVision在细粒度物体检测和物体属性理解任务上达到最先进性能，展示了检测与理解能力的相互促进。 Conclusion: EagleVision为遥感领域提供了一种高效的多模态大语言模型，显著提升了物体检测和属性理解的性能。 Abstract: Recent advances in multimodal large language models (MLLMs) have demonstrated impressive results in various visual tasks. However, in remote sensing (RS), high resolution and small proportion of objects pose challenges to existing MLLMs, which struggle with object-centric tasks, particularly in precise localization and fine-grained attribute description for each object. These RS MLLMs have not yet surpassed classical visual perception models, as they only provide coarse image understanding, leading to limited gains in real-world scenarios. To address this gap, we establish EagleVision, an MLLM tailored for remote sensing that excels in object detection and attribute comprehension. Equipped with the Attribute Disentangle module, EagleVision learns disentanglement vision tokens to express distinct attributes. To support object-level visual-language alignment, we construct EVAttrs-95K, the first large-scale object attribute understanding dataset in RS for instruction tuning, along with a novel evaluation benchmark, EVBench. EagleVision achieves state-of-the-art performance on both fine-grained object detection and object attribute understanding tasks, highlighting the mutual promotion between detection and understanding capabilities in MLLMs. The code, model, data, and demo will be available at https://github.com/XiangTodayEatsWhat/EagleVision.

Better wit than wealth: Dynamic Parametric Retrieval Augmented Generation for Test-time Knowledge Enhancement

Yuqiao Tan,Shizhu He,Huanxuan Liao,Jun Zhao,Kang Liu

Task: 提出Dynamic Parametric RAG (DyPRAG)框架，以动态生成参数化知识，减少推理、训练和存储成本，并解决RAG中的幻觉问题。

Motivation: 传统RAG方法增加了推理成本并存在幻觉问题，而Parametric RAG (PRAG)虽能减少推理成本，但训练和存储成本高且泛化能力有限。

Details

Method: DyPRAG利用轻量级参数翻译模型将文档动态转换为参数化知识，实现即插即用的知识增强。 Result: 在多个数据集上的实验证明DyPRAG能有效减少成本、增强知识融合并缓解幻觉问题。 Conclusion: DyPRAG提供了一种高效且实用的RAG范式，适用于现实应用中的知识增强和幻觉缓解。 Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) by retrieving relevant documents from external sources and incorporating them into the context. While it improves reliability by providing factual texts, it significantly increases inference costs as context length grows and introduces challenging issue of RAG hallucination, primarily caused by the lack of corresponding parametric knowledge in LLMs. An efficient solution is to enhance the knowledge of LLMs at test-time. Parametric RAG (PRAG) addresses this by embedding document into LLMs parameters to perform test-time knowledge enhancement, effectively reducing inference costs through offline training. However, its high training and storage costs, along with limited generalization ability, significantly restrict its practical adoption. To address these challenges, we propose Dynamic Parametric RAG (DyPRAG), a novel framework that leverages a lightweight parameter translator model to efficiently convert documents into parametric knowledge. DyPRAG not only reduces inference, training, and storage costs but also dynamically generates parametric knowledge, seamlessly enhancing the knowledge of LLMs and resolving knowledge conflicts in a plug-and-play manner at test-time. Extensive experiments on multiple datasets demonstrate the effectiveness and generalization capabilities of DyPRAG, offering a powerful and practical RAG paradigm which enables superior knowledge fusion and mitigates RAG hallucination in real-world applications. Our code is available at https://github.com/Trae1ounG/DyPRAG.

HiPART: Hierarchical Pose AutoRegressive Transformer for Occluded 3D Human Pose Estimation

Hongwei Zheng,Han Li,Wenrui Dai,Ziyang Zheng,Chenglin Li,Junni Zou,Hongkai Xiong

Task: 提出一种名为HiPART的两阶段生成性密集化方法，从稀疏的2D姿态生成层次化的2D密集姿态，以解决2D到3D人体姿态估计中的遮挡问题。

Motivation: 现有方法在提升阶段通过丰富时间或视觉信息来应对遮挡问题，但忽略了稀疏骨架2D输入表示的根本限制，这限制了2D到3D的提升并加剧了遮挡问题。

Details

Method: 提出HiPART方法，包括多尺度骨架标记化模块和骨架感知对齐机制，以及层次化自回归建模方案，生成层次化的2D密集姿态。 Result: 在遮挡场景下表现出强鲁棒性，在单帧3D人体姿态估计中达到最先进性能，且优于多帧方法，同时降低参数和计算复杂度。 Conclusion: HiPART不仅提升了性能，还能与其他方法互补，进一步增强性能和鲁棒性。 Abstract: Existing 2D-to-3D human pose estimation (HPE) methods struggle with the occlusion issue by enriching information like temporal and visual cues in the lifting stage. In this paper, we argue that these methods ignore the limitation of the sparse skeleton 2D input representation, which fundamentally restricts the 2D-to-3D lifting and worsens the occlusion issue. To address these, we propose a novel two-stage generative densification method, named Hierarchical Pose AutoRegressive Transformer (HiPART), to generate hierarchical 2D dense poses from the original sparse 2D pose. Specifically, we first develop a multi-scale skeleton tokenization module to quantize the highly dense 2D pose into hierarchical tokens and propose a Skeleton-aware Alignment to strengthen token connections. We then develop a Hierarchical AutoRegressive Modeling scheme for hierarchical 2D pose generation. With generated hierarchical poses as inputs for 2D-to-3D lifting, the proposed method shows strong robustness in occluded scenarios and achieves state-of-the-art performance on the single-frame-based 3D HPE. Moreover, it outperforms numerous multi-frame methods while reducing parameter and computational complexity and can also complement them to further enhance performance and robustness.

Rubrik's Cube: Testing a New Rubric for Evaluating Explanations on the CUBE dataset

Diana Galvan-Sosa,Gabrielle Gaudeau,Pride Kavumba,Yunmeng Li,Hongyi gu,Zheng Yuan,Keisuke Sakaguchi,Paula Buttery

Task: 研究如何通过Rubrik's CUBE评估和改进大型语言模型（LLMs）生成的解释的可靠性。

Motivation: 尽管LLMs在解释生成任务中被广泛使用，但其生成的解释不可靠，用户难以区分好坏。

Details

Method: 提出了Rubrik's CUBE，一个教育启发的评估标准，并构建了一个包含26k条解释的数据集，由人类和六种开源及闭源LLMs标注质量。 Result: 发现解释质量受任务和感知难度影响，低质量主要源于LLM生成解释缺乏简洁性，而非连贯性和词汇选择。 Conclusion: Rubrik's CUBE为评估和改进LLM生成解释提供了有效工具，数据集和代码将公开。 Abstract: The performance and usability of Large-Language Models (LLMs) are driving their use in explanation generation tasks. However, despite their widespread adoption, LLM explanations have been found to be unreliable, making it difficult for users to distinguish good from bad explanations. To address this issue, we present Rubrik's CUBE, an education-inspired rubric and a dataset of 26k explanations, written and later quality-annotated using the rubric by both humans and six open- and closed-source LLMs. The CUBE dataset focuses on two reasoning and two language tasks, providing the necessary diversity for us to effectively test our proposed rubric. Using Rubrik, we find that explanations are influenced by both task and perceived difficulty. Low quality stems primarily from a lack of conciseness in LLM-generated explanations, rather than cohesion and word choice. The full dataset, rubric, and code will be made available upon acceptance.

TraceMark-LDM: Authenticatable Watermarking for Latent Diffusion Models via Binary-Guided Rearrangement

Wenhao Luo,Zhangyi Shen,Ye Yao,Feng Ding,Guopu Zhu,Weizhi Meng

Task: 提出一种名为TraceMark-LDM的新算法，用于在图像生成模型中嵌入水印以实现图像溯源，同时保证无损性能。

Motivation: 当前大多数基于潜在扩散模型（LDM）的图像溯源方法通过直接嵌入水印会损害生成内容的质量和鲁棒性，亟需一种更高效且无损的解决方案。

Details

Method: TraceMark-LDM利用水印作为指导重新排列从高斯分布中采样的随机变量，并通过分组和重新排列小绝对值元素来减少反转误差的潜在偏差，同时微调LDM编码器以增强水印的鲁棒性。 Result: 实验结果表明，TraceMark-LDM生成的图像在质量和溯源准确性上优于现有技术，且对各种常见攻击方法表现出卓越的鲁棒性。 Conclusion: TraceMark-LDM是一种高效且无损的图像溯源方法，显著提升了生成图像的质量和鲁棒性。 Abstract: Image generation algorithms are increasingly integral to diverse aspects of human society, driven by their practical applications. However, insufficient oversight in artificial Intelligence generated content (AIGC) can facilitate the spread of malicious content and increase the risk of copyright infringement. Among the diverse range of image generation models, the Latent Diffusion Model (LDM) is currently the most widely used, dominating the majority of the Text-to-Image model market. Currently, most attribution methods for LDMs rely on directly embedding watermarks into the generated images or their intermediate noise, a practice that compromises both the quality and the robustness of the generated content. To address these limitations, we introduce TraceMark-LDM, an novel algorithm that integrates watermarking to attribute generated images while guaranteeing non-destructive performance. Unlike current methods, TraceMark-LDM leverages watermarks as guidance to rearrange random variables sampled from a Gaussian distribution. To mitigate potential deviations caused by inversion errors, the small absolute elements are grouped and rearranged. Additionally, we fine-tune the LDM encoder to enhance the robustness of the watermark. Experimental results show that images synthesized using TraceMark-LDM exhibit superior quality and attribution accuracy compared to state-of-the-art (SOTA) techniques. Notably, TraceMark-LDM demonstrates exceptional robustness against various common attack methods, consistently outperforming SOTA methods.

Entropy-Based Adaptive Weighting for Self-Training

Xiaoxuan Wang,Yihe Deng,Mingyu Derek Ma,Wei Wang

Task: 研究如何通过自生成推理路径提升大型语言模型的数学问题解决能力。

Motivation: 自训练方法在推理任务中有效，但如何优化自生成数据的使用仍是一个挑战。

Details

Method: 提出基于熵的自适应加权策略（EAST），优先处理模型不确定性较高的数据。 Result: 在GSM8K和MATH基准测试中，EAST相比基线模型和普通方法分别提升了1%和1-2%的性能。 Conclusion: EAST通过自适应加权策略有效提升了模型的推理能力。 Abstract: The mathematical problem-solving capabilities of large language models have become a focal point of research, with growing interests in leveraging self-generated reasoning paths as a promising way to refine and enhance these models. These paths capture step-by-step logical processes while requiring only the correct answer for supervision. The self-training method has been shown to be effective in reasoning tasks while eliminating the need for external models and manual annotations. However, optimizing the use of self-generated data for model training remains an open challenge. In this work, we propose Entropy-Based Adaptive Weighting for Self-Training (EAST), an adaptive weighting strategy designed to prioritize uncertain data during self-training. Specifically, EAST employs a mapping function with a tunable parameter that controls the sharpness of the weighting, assigning higher weights to data where the model exhibits greater uncertainty. This approach guides the model to focus on more informative and challenging examples, thereby enhancing its reasoning ability. We evaluate our approach on GSM8K and MATH benchmarks. Empirical results show that, while the vanilla method yields virtually no improvement (0%) on MATH, EAST achieves around a 1% gain over backbone model. On GSM8K, EAST attains a further 1-2% performance boost compared to the vanilla method.

Enhancing 3D Gaussian Splatting Compression via Spatial Condition-based Prediction

Jingui Ma,Yang Hu,Luyang Tang,Jiayu Yang,Yongqi Zhai,Ronggang Wang

Task: 提出一种基于预测的压缩框架，用于降低3D高斯泼溅（3DGS）在存储和传输中的高成本。

Motivation: 3DGS在实时渲染中表现优异，但其存储和传输成本过高（单场景可达数百MB或GB），限制了其应用。受视频压缩中预测技术的启发，将其引入高斯表示以降低比特率。

Details

Method: 提出空间条件预测模块，利用网格捕获的场景信息进行预测，并设计残差补偿策略学习细粒度信息；进一步提出实例感知超先验，开发结构感知和实例感知的熵模型。 Result: 实验证明框架及各技术组件的有效性，相比SOTA压缩方法，比特率节省24.42%。 Conclusion: 提出的预测压缩框架显著降低了3DGS的存储和传输成本，具有实际应用潜力。 Abstract: Recently, 3D Gaussian Spatting (3DGS) has gained widespread attention in Novel View Synthesis (NVS) due to the remarkable real-time rendering performance. However, the substantial cost of storage and transmission of vanilla 3DGS hinders its further application (hundreds of megabytes or even gigabytes for a single scene). Motivated by the achievements of prediction in video compression, we introduce the prediction technique into the anchor-based Gaussian representation to effectively reduce the bit rate. Specifically, we propose a spatial condition-based prediction module to utilize the grid-captured scene information for prediction, with a residual compensation strategy designed to learn the missing fine-grained information. Besides, to further compress the residual, we propose an instance-aware hyper prior, developing a structure-aware and instance-aware entropy model. Extensive experiments demonstrate the effectiveness of our prediction-based compression framework and each technical component. Even compared with SOTA compression method, our framework still achieves a bit rate savings of 24.42 percent. Code is to be released!

Model Hemorrhage and the Robustness Limits of Large Language Models

Ziyang Ma,Zuchao Li,Lefei Zhang,Gui-Song Xia,Bo Du,Liangpei Zhang,Dacheng Tao

Task: 研究大型语言模型（LLMs）在参数调整和架构修改后的性能下降现象（模型出血）。

Motivation: LLMs在自然语言处理任务中表现优异，但在部署过程中通过量化、剪枝或解码策略调整等修改后性能显著下降，需要系统性分析并提出解决方案。

Details

Method: 通过系统分析多种LLM框架，识别关键脆弱性模式，并提出梯度感知剪枝、动态量化缩放和解码校准三种缓解策略。 Result: 揭示了Transformer架构的固有鲁棒性阈值，并提出了评估模型稳定性的基础指标和实用指南。 Conclusion: 该研究为大规模语言模型在架构变换下的弹性提供了新理解，并为高效部署提供了实践指导。 Abstract: Large language models (LLMs) demonstrate strong performance across natural language processing tasks, yet undergo significant performance degradation when modified for deployment through quantization, pruning, or decoding strategy adjustments. We define this phenomenon as model hemorrhage - performance decline caused by parameter alterations and architectural changes. Through systematic analysis of various LLM frameworks, we identify key vulnerability patterns: layer expansion frequently disrupts attention mechanisms, compression techniques induce information loss cascades, and decoding adjustments amplify prediction divergences. Our investigation reveals transformer architectures exhibit inherent robustness thresholds that determine hemorrhage severity across modification types. We propose three mitigation strategies: gradient-aware pruning preserves critical weight pathways, dynamic quantization scaling maintains activation integrity, and decoding calibration aligns generation trajectories with original model distributions. This work establishes foundational metrics for evaluating model stability during adaptation, providing practical guidelines for maintaining performance while enabling efficient LLM deployment. Our findings advance understanding of neural network resilience under architectural transformations, particularly for large-scale language models.

From Panels to Prose: Generating Literary Narratives from Comics

Ragav Sachdeva,Andrew Zisserman

Task: 开发一个自动化系统，将漫画转换为基于文本的文学叙事，以帮助视障读者理解漫画内容。

Motivation: 漫画的视觉特性对视障读者构成障碍，限制了他们对这种流行故事形式的访问。

Details

Method: 提出统一模型Magiv3，用于漫画理解任务（如定位面板、角色、文本和对话气泡），并结合大型视觉语言模型生成文学叙事。 Result: 发布了3300多个日本漫画面板的人工标注数据集，并展示了Magiv3与大型视觉语言模型结合生成流畅叙事的能力。 Conclusion: 通过Magiv3和视觉语言模型的结合，成功为视障读者提供了沉浸式的漫画叙事体验。 Abstract: Comics have long been a popular form of storytelling, offering visually engaging narratives that captivate audiences worldwide. However, the visual nature of comics presents a significant barrier for visually impaired readers, limiting their access to these engaging stories. In this work, we provide a pragmatic solution to this accessibility challenge by developing an automated system that generates text-based literary narratives from manga comics. Our approach aims to create an evocative and immersive prose that not only conveys the original narrative but also captures the depth and complexity of characters, their interactions, and the vivid settings in which they reside. To this end we make the following contributions: (1) We present a unified model, Magiv3, that excels at various functional tasks pertaining to comic understanding, such as localising panels, characters, texts, and speech-bubble tails, performing OCR, grounding characters etc. (2) We release human-annotated captions for over 3300 Japanese comic panels, along with character grounding annotations, and benchmark large vision-language models in their ability to understand comic images. (3) Finally, we demonstrate how integrating large vision-language models with Magiv3, can generate seamless literary narratives that allows visually impaired audiences to engage with the depth and richness of comic storytelling.

BeMERC: Behavior-Aware MLLM-based Framework for Multimodal Emotion Recognition in Conversation

Yumeng Fu,Junjie Wu,Zhongjie Wang,Meishan Zhang,Yulin Wu,Bingquan Liu

Task: 识别对话中每个话语的情感标签。

Motivation: 当前基于多模态语言模型（MLLM）的多模态情感识别研究主要关注文本或声音特征，忽略了视频行为信息的重要性。

Details

Method: 提出了一种行为感知的MLLM框架（BeMERC），结合说话者的面部微表情、肢体语言和姿势，并采用两阶段指令调优策略。 Result: BeMERC在两个基准数据集上优于现有方法，验证了视频行为信息的重要性。 Conclusion: 视频行为信息对多模态情感识别具有显著贡献，BeMERC框架有效提升了情感动态建模能力。 Abstract: Multimodal emotion recognition in conversation (MERC), the task of identifying the emotion label for each utterance in a conversation, is vital for developing empathetic machines. Current MLLM-based MERC studies focus mainly on capturing the speaker's textual or vocal characteristics, but ignore the significance of video-derived behavior information. Different from text and audio inputs, learning videos with rich facial expression, body language and posture, provides emotion trigger signals to the models for more accurate emotion predictions. In this paper, we propose a novel behavior-aware MLLM-based framework (BeMERC) to incorporate speaker's behaviors, including subtle facial micro-expression, body language and posture, into a vanilla MLLM-based MERC model, thereby facilitating the modeling of emotional dynamics during a conversation. Furthermore, BeMERC adopts a two-stage instruction tuning strategy to extend the model to the conversations scenario for end-to-end training of a MERC predictor. Experiments demonstrate that BeMERC achieves superior performance than the state-of-the-art methods on two benchmark datasets, and also provides a detailed discussion on the significance of video-derived behavior information in MERC.

Object Isolated Attention for Consistent Story Visualization

Xiangyang Luo,Junhao Cheng,Yifan Xie,Xin Zhang,Tao Feng,Zhou Liu,Fei Ma,Fei Yu

Task: 通过增强的Transformer模块生成连贯的开放故事图像序列。

Motivation: 现有方法在保持角色一致性和生成自然场景方面存在困难。

Details

Method: 使用分离的自注意力和交叉注意力机制，结合预训练扩散模型的知识，确保逻辑场景生成。 Result: 在定性和定量评估中均优于现有方法。 Conclusion: 提出的方法无需训练即可持续生成新角色和故事情节，且效果显著。 Abstract: Open-ended story visualization is a challenging task that involves generating coherent image sequences from a given storyline. One of the main difficulties is maintaining character consistency while creating natural and contextually fitting scenes--an area where many existing methods struggle. In this paper, we propose an enhanced Transformer module that uses separate self attention and cross attention mechanisms, leveraging prior knowledge from pre-trained diffusion models to ensure logical scene creation. The isolated self attention mechanism improves character consistency by refining attention maps to reduce focus on irrelevant areas and highlight key features of the same character. Meanwhile, the isolated cross attention mechanism independently processes each character's features, avoiding feature fusion and further strengthening consistency. Notably, our method is training-free, allowing the continuous generation of new characters and storylines without re-tuning. Both qualitative and quantitative evaluations show that our approach outperforms current methods, demonstrating its effectiveness.

Comparing representations of long clinical texts for the task of patient note-identification

Safa Alsaidi,Marc Vincent,Olivia Boyer,Nicolas Garcelon,Miguel Couceiro,Adrien Coulet

Task: 解决患者-笔记识别问题，即准确匹配匿名临床笔记与对应患者。

Motivation: 该任务在重复记录检测和患者相似性分析中有广泛应用，需要稳健的患者级别表示。

Details

Method: 探索了多种嵌入方法（如HAN、HTN、LongFormer和BERT模型）及池化策略（均值、最大值和均值_最大值），并研究了滑动窗口对性能的影响。 Result: BERT模型在长临床笔记处理中表现最佳，均值_最大值池化策略效果最优，结果在MIMIC和Necker数据集上具有普适性。 Conclusion: 嵌入方法和聚合策略对优化患者-笔记识别和患者级别建模至关重要。 Abstract: In this paper, we address the challenge of patient-note identification, which involves accurately matching an anonymized clinical note to its corresponding patient, represented by a set of related notes. This task has broad applications, including duplicate records detection and patient similarity analysis, which require robust patient-level representations. We explore various embedding methods, including Hierarchical Attention Networks (HAN), three-level Hierarchical Transformer Networks (HTN), LongFormer, and advanced BERT-based models, focusing on their ability to process mediumto-long clinical texts effectively. Additionally, we evaluate different pooling strategies (mean, max, and mean_max) for aggregating wordlevel embeddings into patient-level representations and we examine the impact of sliding windows on model performance. Our results indicate that BERT-based embeddings outperform traditional and hierarchical models, particularly in processing lengthy clinical notes and capturing nuanced patient representations. Among the pooling strategies, mean_max pooling consistently yields the best results, highlighting its ability to capture critical features from clinical notes. Furthermore, the reproduction of our results on both MIMIC dataset and Necker hospital data warehouse illustrates the generalizability of these approaches to real-world applications, emphasizing the importance of both embedding methods and aggregation strategies in optimizing patient-note identification and enhancing patient-level modeling.

DSPFusion: Image Fusion via Degradation and Semantic Dual-Prior Guidance

Linfeng Tang,Chunyu Li,Guoqing Wang,Yixuan Yuan,Jiayi Ma

Task: 提出一种基于退化先验和语义先验双引导的退化图像融合框架（DSPFusion）。

Motivation: 现有融合方法针对高质量图像设计，但在恶劣条件下捕获的退化图像上表现不佳，限制了图像融合的实际应用潜力。

Details

Method: 通过扩散模型恢复高质量语义先验，结合退化先验，在统一模型中指导信息恢复和融合。 Result: DSPFusion能够有效缓解典型退化问题，并以较低计算成本整合互补信息，显著扩展了图像融合的应用范围。 Conclusion: DSPFusion通过双先验引导框架，在退化图像融合中表现出高效性和广泛适用性。 Abstract: Existing fusion methods are tailored for high-quality images but struggle with degraded images captured under harsh circumstances, thus limiting the practical potential of image fusion. This work presents a \textbf{D}egradation and \textbf{S}emantic \textbf{P}rior dual-guided framework for degraded image \textbf{Fusion} (\textbf{DSPFusion}), utilizing degradation priors and high-quality scene semantic priors restored via diffusion models to guide both information recovery and fusion in a unified model. In specific, it first individually extracts modality-specific degradation priors, while jointly capturing comprehensive low-quality semantic priors. Subsequently, a diffusion model is developed to iteratively restore high-quality semantic priors in a compact latent space, enabling our method to be over $20 \times$ faster than mainstream diffusion model-based image fusion schemes. Finally, the degradation priors and high-quality semantic priors are employed to guide information enhancement and aggregation via the dual-prior guidance and prior-guided fusion modules. Extensive experiments demonstrate that DSPFusion mitigates most typical degradations while integrating complementary context with minimal computational cost, greatly broadening the application scope of image fusion.

You Cannot Feed Two Birds with One Score: the Accuracy-Naturalness Tradeoff in Translation

Gergely Flamich,David Vilar,Jan-Thorsten Peter,Markus Freitag

Task: 研究机器翻译中单一评分无法全面衡量翻译性能的问题。

Motivation: 现有机器翻译评估通常使用单一评分同时衡量语义准确性和自然性，但这种方法无法全面反映系统性能。

Details

Method: 基于信息论的理论证明和WMT24共享任务的实际评估。 Result: 证明准确性和自然性之间存在权衡关系，单一评分无法全面反映翻译性能。 Conclusion: 建议采用准确性-自然性平面而非单一评分来评估翻译系统。 Abstract: The goal of translation, be it by human or by machine, is, given some text in a source language, to produce text in a target language that simultaneously 1) preserves the meaning of the source text and 2) achieves natural expression in the target language. However, researchers in the machine translation community usually assess translations using a single score intended to capture semantic accuracy and the naturalness of the output simultaneously. In this paper, we build on recent advances in information theory to mathematically prove and empirically demonstrate that such single-score summaries do not and cannot give the complete picture of a system's true performance. Concretely, we prove that a tradeoff exists between accuracy and naturalness and demonstrate it by evaluating the submissions to the WMT24 shared task. Our findings help explain well-known empirical phenomena, such as the observation that optimizing translation systems for a specific accuracy metric (like BLEU) initially improves the system's naturalness, while ``overfitting'' the system to the metric can significantly degrade its naturalness. Thus, we advocate for a change in how translations are evaluated: rather than comparing systems using a single number, they should be compared on an accuracy-naturalness plane.

ControlFusion: A Controllable Image Fusion Framework with Language-Vision Degradation Prompts

Linfeng Tang,Yeda Wang,Zhanchuan Cai,Junjun Jiang,Jiayi Ma

Task: 提出一种可控的图像融合框架ControlFusion，通过语言-视觉提示自适应消除复合退化。

Motivation: 现有图像融合方法难以处理真实场景中的复合退化问题，且缺乏灵活性以满足用户特定需求。

Details

Method: 结合Retinex理论和大气散射原理构建退化成像模型，设计提示调制的恢复与融合网络，动态增强特征以应对不同退化水平。 Result: ControlFusion在融合质量和退化处理方面优于现有方法，尤其在应对真实场景和复合退化时表现突出。 Conclusion: ControlFusion通过语言-视觉提示和自适应网络设计，有效解决了复合退化问题并满足用户需求。 Abstract: Current image fusion methods struggle to address the composite degradations encountered in real-world imaging scenarios and lack the flexibility to accommodate user-specific requirements. In response to these challenges, we propose a controllable image fusion framework with language-vision prompts, termed ControlFusion, which adaptively neutralizes composite degradations. On the one hand, we develop a degraded imaging model that integrates physical imaging mechanisms, including the Retinex theory and atmospheric scattering principle, to simulate composite degradations, thereby providing potential for addressing real-world complex degradations from the data level. On the other hand, we devise a prompt-modulated restoration and fusion network that dynamically enhances features with degradation prompts, enabling our method to accommodate composite degradation of varying levels. Specifically, considering individual variations in quality perception of users, we incorporate a text encoder to embed user-specified degradation types and severity levels as degradation prompts. We also design a spatial-frequency collaborative visual adapter that autonomously perceives degradations in source images, thus eliminating the complete dependence on user instructions. Extensive experiments demonstrate that ControlFusion outperforms SOTA fusion methods in fusion quality and degradation handling, particularly in countering real-world and compound degradations with various levels.

Crossing Boundaries: Leveraging Semantic Divergences to Explore Cultural Novelty in Cooking Recipes

Florian Carichon,Romain Rampa,Golnoosh Farnadi

Task: 提出一个跨学科框架，用于量化和理解文化新颖性在自然语言处理中的应用。

Motivation: 文化新颖性是影响个体感知的关键因素，但缺乏量化文化新颖性的稳健指标，限制了在计算框架中理解和量化文化差异的能力。

Details

Method: 提出一个结合社会学和管理学知识的框架，并引入GlobalFusion数据集和Jensen-Shannon Divergence指标来分析文化新颖性。 Result: 结果显示文化新颖性指标与基于语言、宗教和地理距离的现有文化测量方法显著相关。 Conclusion: 该框架在理解和测量AI中的文化多样性方面具有潜力。 Abstract: Novelty modeling and detection is a core topic in Natural Language Processing (NLP), central to numerous tasks such as recommender systems and automatic summarization. It involves identifying pieces of text that deviate in some way from previously known information. However, novelty is also a crucial determinant of the unique perception of relevance and quality of an experience, as it rests upon each individual's understanding of the world. Social factors, particularly cultural background, profoundly influence perceptions of novelty and innovation. Cultural novelty arises from differences in salience and novelty as shaped by the distance between distinct communities. While cultural diversity has garnered increasing attention in artificial intelligence (AI), the lack of robust metrics for quantifying cultural novelty hinders a deeper understanding of these divergences. This gap limits quantifying and understanding cultural differences within computational frameworks. To address this, we propose an interdisciplinary framework that integrates knowledge from sociology and management. Central to our approach is GlobalFusion, a novel dataset comprising 500 dishes and approximately 100,000 cooking recipes capturing cultural adaptation from over 150 countries. By introducing a set of Jensen-Shannon Divergence metrics for novelty, we leverage this dataset to analyze textual divergences when recipes from one community are modified by another with a different cultural background. The results reveal significant correlations between our cultural novelty metrics and established cultural measures based on linguistic, religious, and geographical distances. Our findings highlight the potential of our framework to advance the understanding and measurement of cultural diversity in AI.

Linfeng Tang,Yeda Wang,Meiqi Gong,Zizhuo Li,Yuxin Deng,Xunpeng Yi,Chunyu Li,Han Xu,Hao Zhang,Jiayi Ma

Task: 构建一个多模态视频融合模型（VideoFusion）并填补视频融合领域的数据空白。

Motivation: 视频比图像更贴近真实场景且具有时间线索，但现有研究多集中于图像融合，缺乏大规模多传感器视频数据集和统一的时空建模框架。

Details

Method: 1) 构建M3SVD数据集；2) 提出VideoFusion模型，包括差分强化模块、模态引导融合策略和双向时间共注意力机制。 Result: VideoFusion在时序场景中优于现有图像融合方法，有效缓解时间不一致性和干扰。 Conclusion: VideoFusion和M3SVD填补了视频融合领域的数据和方法空白，提升了多模态视频融合的性能。 Abstract: Compared to images, videos better align with real-world acquisition scenarios and possess valuable temporal cues. However, existing multi-sensor fusion research predominantly integrates complementary context from multiple images rather than videos. This primarily stems from two factors: 1) the scarcity of large-scale multi-sensor video datasets, limiting research in video fusion, and 2) the inherent difficulty of jointly modeling spatial and temporal dependencies in a unified framework. This paper proactively compensates for the dilemmas. First, we construct M3SVD, a benchmark dataset with $220$ temporally synchronized and spatially registered infrared-visible video pairs comprising 153,797 frames, filling the data gap for the video fusion community. Secondly, we propose VideoFusion, a multi-modal video fusion model that fully exploits cross-modal complementarity and temporal dynamics to generate spatio-temporally coherent videos from (potentially degraded) multi-modal inputs. Specifically, 1) a differential reinforcement module is developed for cross-modal information interaction and enhancement, 2) a complete modality-guided fusion strategy is employed to adaptively integrate multi-modal features, and 3) a bi-temporal co-attention mechanism is devised to dynamically aggregate forward-backward temporal contexts to reinforce cross-frame feature representations. Extensive experiments reveal that VideoFusion outperforms existing image-oriented fusion paradigms in sequential scenarios, effectively mitigating temporal inconsistency and interference.

Artificial Conversations, Real Results: Fostering Language Detection with Synthetic Data

Fatemeh Mohammadi,Tommaso Romano,Samira Maghool,Paolo Ceravolo

Task: 提出一种生成合成数据的流程，并研究影响LLM生成合成数据有效性的因素。

Motivation: 获取高质量训练数据成本高且耗时，尤其是非英语语言（如意大利语），因此探索LLM生成合成数据的可行性。

Details

Method: 通过分析提示策略、文本长度和目标位置等因素，研究合成数据的有效性，并以意大利招聘广告中的包容性语言检测为例。 Result: 在大多数情况下，基于合成数据微调的模型在真实和合成测试数据集上均优于其他模型。 Conclusion: 讨论了合成数据在语言检测任务中的实际应用和局限性。 Abstract: Collecting high-quality training data is essential for fine-tuning Large Language Models (LLMs). However, acquiring such data is often costly and time-consuming, especially for non-English languages such as Italian. Recently, researchers have begun to explore the use of LLMs to generate synthetic datasets as a viable alternative. This study proposes a pipeline for generating synthetic data and a comprehensive approach for investigating the factors that influence the validity of synthetic data generated by LLMs by examining how model performance is affected by metrics such as prompt strategy, text length and target position in a specific task, i.e. inclusive language detection in Italian job advertisements. Our results show that, in most cases and across different metrics, the fine-tuned models trained on synthetic data consistently outperformed other models on both real and synthetic test datasets. The study discusses the practical implications and limitations of using synthetic data for language detection tasks with LLMs.

OnSiteVRU: A High-Resolution Trajectory Dataset for High-Density Vulnerable Road Users

Zhangcun Yan,Jianqing Li,Peng Hang,Jian Sun

Task: 开发高精度、多样化的轨迹数据集以支持自动驾驶系统的优化。

Motivation: 城市化加速和交通需求增长导致弱势道路使用者（VRUs）安全问题突出，现有数据集无法满足复杂交通环境的研究需求。

Details

Method: 开发OnSiteVRU数据集，涵盖多种场景（如交叉口、路段、城中村），提供机动车、电动自行车和人力自行车的轨迹数据，并结合空中视角和车载实时检测数据。 Result: VRU_Data在VRU密度和场景覆盖方面优于传统数据集，能更全面地反映VRU行为特征。 Conclusion: 该数据集为交通流建模、轨迹预测和自动驾驶虚拟测试提供了关键支持，并已公开下载。 Abstract: With the acceleration of urbanization and the growth of transportation demands, the safety of vulnerable road users (VRUs, such as pedestrians and cyclists) in mixed traffic flows has become increasingly prominent, necessitating high-precision and diverse trajectory data to support the development and optimization of autonomous driving systems. However, existing datasets fall short in capturing the diversity and dynamics of VRU behaviors, making it difficult to meet the research demands of complex traffic environments. To address this gap, this study developed the OnSiteVRU datasets, which cover a variety of scenarios, including intersections, road segments, and urban villages. These datasets provide trajectory data for motor vehicles, electric bicycles, and human-powered bicycles, totaling approximately 17,429 trajectories with a precision of 0.04 seconds. The datasets integrate both aerial-view natural driving data and onboard real-time dynamic detection data, along with environmental information such as traffic signals, obstacles, and real-time maps, enabling a comprehensive reconstruction of interaction events. The results demonstrate that VRU\_Data outperforms traditional datasets in terms of VRU density and scene coverage, offering a more comprehensive representation of VRU behavioral characteristics. This provides critical support for traffic flow modeling, trajectory prediction, and autonomous driving virtual testing. The dataset is publicly available for download at: https://www.kaggle.com/datasets/zcyan2/mixed-traffic-trajectory-dataset-in-from-shanghai.

Is LLM the Silver Bullet to Low-Resource Languages Machine Translation?

Yewei Song,Lujun Li,Cedric Lothritz,Saad Ezzini,Lama Sleem,Niccolo Gentile,Radu State,Tegawendé F. Bissyandé,Jacques Klein

Task: 系统评估当前大型语言模型在200种低资源语言中的局限性，并探索改进方法。

Motivation: 低资源语言在自然语言处理中面临资源有限和数据集代表性不足的挑战，尤其在隐私敏感和资源受限的场景中表现不佳。

Details

Method: 使用FLORES-200等基准进行评估，探索新闻文章和双语词典等替代数据源，并通过知识蒸馏和微调策略改进翻译性能。 Result: 知识蒸馏和增量微调显著缩小了低资源语言在小型语言模型上的性能差距。 Conclusion: 通过数据源多样化和模型优化策略，可以有效提升低资源语言的翻译性能。 Abstract: Low-Resource Languages (LRLs) present significant challenges in natural language processing due to their limited linguistic resources and underrepresentation in standard datasets. While recent advancements in Large Language Models (LLMs) and Neural Machine Translation (NMT) have substantially improved translation capabilities for high-resource languages, performance disparities persist for LRLs, particularly impacting privacy-sensitive and resource-constrained scenarios. This paper systematically evaluates the limitations of current LLMs across 200 languages using benchmarks such as FLORES-200. We also explore alternative data sources, including news articles and bilingual dictionaries, and demonstrate how knowledge distillation from large pre-trained models can significantly improve smaller LRL translations. Additionally, we investigate various fine-tuning strategies, revealing that incremental enhancements markedly reduce performance gaps on smaller LLMs.

FastVAR: Linear Visual Autoregressive Modeling via Cached Token Pruning

Hang Guo,Yawei Li,Taolin Zhang,Jiangshan Wang,Tao Dai,Shu-Tao Xia,Luca Benini

Task: 提出FastVAR方法，用于加速视觉自回归（VAR）模型的后训练分辨率扩展。

Motivation: 现有VAR范式在处理每个尺度步骤时需处理整个令牌图，导致复杂性和运行时随图像分辨率急剧增加。

Details

Method: 采用缓存令牌剪枝策略，仅对关键令牌进行尺度特定建模，同时利用先前尺度步骤的缓存令牌恢复剪除的槽位。 Result: FastVAR在FlashAttention加速的VAR基础上进一步提速2.7倍，性能下降可忽略（<1%），并能高效生成2K分辨率图像。 Conclusion: FastVAR显著提升了VAR模型在高分辨率下的效率，适用于零样本生成高分辨率图像。 Abstract: Visual Autoregressive (VAR) modeling has gained popularity for its shift towards next-scale prediction. However, existing VAR paradigms process the entire token map at each scale step, leading to the complexity and runtime scaling dramatically with image resolution. To address this challenge, we propose FastVAR, a post-training acceleration method for efficient resolution scaling with VARs. Our key finding is that the majority of latency arises from the large-scale step where most tokens have already converged. Leveraging this observation, we develop the cached token pruning strategy that only forwards pivotal tokens for scale-specific modeling while using cached tokens from previous scale steps to restore the pruned slots. This significantly reduces the number of forwarded tokens and improves the efficiency at larger resolutions. Experiments show the proposed FastVAR can further speedup FlashAttention-accelerated VAR by 2.7$\times$ with negligible performance drop of <1%. We further extend FastVAR to zero-shot generation of higher resolution images. In particular, FastVAR can generate one 2K image with 15GB memory footprints in 1.5s on a single NVIDIA 3090 GPU. Code is available at https://github.com/csguoh/FastVAR.

TeleAntiFraud-28k: A Audio-Text Slow-Thinking Dataset for Telecom Fraud Detection

Zhiming Ma,Peidong Wang,Minhua Huang,Jingpeng Wang,Kai Wu,Xiangzhao Lv,Yachun Pang,Yin Yang,Wenjie Tang,Yuchen Kang

Task: 构建首个开源音频-文本慢思考数据集TeleAntiFraud-28k，用于自动化电信诈骗分析，并提供标准化评估基准TeleAntiFraud-Bench。

Motivation: 电信诈骗检测面临高质量多模态训练数据缺乏的挑战，尤其是结合音频信号与推理导向的文本分析。

Details

Method: 通过三种策略构建数据集：(1) 隐私保护的文本-真实样本生成；(2) 基于大语言模型（LLM）的语义增强；(3) 多代理对抗合成模拟新兴诈骗手法。 Result: 生成28,511个严格处理的语音-文本对，并构建评估基准和优化后的监督微调（SFT）模型。 Conclusion: 该研究为多模态反诈骗研究提供了基础框架，解决了数据隐私和场景多样性的关键挑战。 Abstract: The detection of telecom fraud faces significant challenges due to the lack of high-quality multimodal training data that integrates audio signals with reasoning-oriented textual analysis. To address this gap, we present TeleAntiFraud-28k, the first open-source audio-text slow-thinking dataset specifically designed for automated telecom fraud analysis. Our dataset is constructed through three strategies: (1) Privacy-preserved text-truth sample generation using automatically speech recognition (ASR)-transcribed call recordings (with anonymized original audio), ensuring real-world consistency through text-to-speech (TTS) model regeneration; (2) Semantic enhancement via large language model (LLM)-based self-instruction sampling on authentic ASR outputs to expand scenario coverage; (3) Multi-agent adversarial synthesis that simulates emerging fraud tactics through predefined communication scenarios and fraud typologies. The generated dataset contains 28,511 rigorously processed speech-text pairs, complete with detailed annotations for fraud reasoning. The dataset is divided into three tasks: scenario classification, fraud detection, fraud type classification. Furthermore, we construct TeleAntiFraud-Bench, a standardized evaluation benchmark comprising proportionally sampled instances from the dataset, to facilitate systematic testing of model performance on telecom fraud detection tasks. We also contribute a production-optimized supervised fine-tuning (SFT) model trained on hybrid real/synthetic data, while open-sourcing the data processing framework to enable community-driven dataset expansion. This work establishes a foundational framework for multimodal anti-fraud research while addressing critical challenges in data privacy and scenario diversity. The project will be released at https://github.com/JimmyMa99/TeleAntiFraud.

Towards Physically Plausible Video Generation via VLM Planning

Xindi Yang,Baolu Li,Yiming Zhang,Zhenfei Yin,Lei Bai,Liqian Ma,Zhiyong Wang,Jianfei Cai,Tien-Tsin Wong,Huchuan Lu,Xu Jia

Task: 提出一种新颖的两阶段图像到视频生成框架，通过显式引入物理知识解决视频扩散模型（VDMs）生成物理上不合理视频的问题。

Motivation: 尽管视频扩散模型（VDMs）在生成逼真视频方面取得进展，但其缺乏对物理的理解，导致生成的视频动态和事件序列不符合物理规律。

Details

Method: 采用两阶段框架：第一阶段使用视觉语言模型（VLM）作为粗粒度运动规划器，结合链式思维和物理感知推理预测近似真实物理动态的运动轨迹；第二阶段利用预测的运动轨迹指导VDMs生成视频，并在推理时添加噪声以增加细节自由度。 Result: 实验结果表明，该框架能生成物理上合理的运动，且相比现有方法具有显著优势。 Conclusion: 提出的两阶段框架有效解决了VDMs生成物理不合理视频的问题，为视频生成领域提供了新思路。 Abstract: Video diffusion models (VDMs) have advanced significantly in recent years, enabling the generation of highly realistic videos and drawing the attention of the community in their potential as world simulators. However, despite their capabilities, VDMs often fail to produce physically plausible videos due to an inherent lack of understanding of physics, resulting in incorrect dynamics and event sequences. To address this limitation, we propose a novel two-stage image-to-video generation framework that explicitly incorporates physics. In the first stage, we employ a Vision Language Model (VLM) as a coarse-grained motion planner, integrating chain-of-thought and physics-aware reasoning to predict a rough motion trajectories/changes that approximate real-world physical dynamics while ensuring the inter-frame consistency. In the second stage, we use the predicted motion trajectories/changes to guide the video generation of a VDM. As the predicted motion trajectories/changes are rough, noise is added during inference to provide freedom to the VDM in generating motion with more fine details. Extensive experimental results demonstrate that our framework can produce physically plausible motion, and comparative evaluations highlight the notable superiority of our approach over existing methods. More video results are available on our Project Page: https://madaoer.github.io/projects/physically_plausible_video_generation.

Multi-Task Learning for Extracting Menstrual Characteristics from Clinical Notes

Anna Shopova,Cristoph Lippert,Leslee J. Shaw,Eugenia Alleva

Task: 提出一种自然语言处理流水线，从临床记录中提取关键月经周期属性。

Motivation: 月经健康是女性健康的重要但常被忽视的方面，且结构化医疗记录中缺乏详细数据。

Details

Method: 采用GatorTron模型结合多任务提示学习，并通过混合检索预处理步骤识别相关文本片段。 Result: 在少于100条标注临床记录上训练，平均F1分数达90%，检索步骤显著提升性能。 Conclusion: 结合多任务学习与检索的方法提高了泛化能力和性能，支持女性健康研究。 Abstract: Menstrual health is a critical yet often overlooked aspect of women's healthcare. Despite its clinical relevance, detailed data on menstrual characteristics is rarely available in structured medical records. To address this gap, we propose a novel Natural Language Processing pipeline to extract key menstrual cycle attributes -- dysmenorrhea, regularity, flow volume, and intermenstrual bleeding. Our approach utilizes the GatorTron model with Multi-Task Prompt-based Learning, enhanced by a hybrid retrieval preprocessing step to identify relevant text segments. It out- performs baseline methods, achieving an average F1-score of 90% across all menstrual characteristics, despite being trained on fewer than 100 annotated clinical notes. The retrieval step consistently improves performance across all approaches, allowing the model to focus on the most relevant segments of lengthy clinical notes. These results show that combining multi-task learning with retrieval improves generalization and performance across menstrual charac- teristics, advancing automated extraction from clinical notes and supporting women's health research.

Map Feature Perception Metric for Map Generation Quality Assessment and Loss Optimization

Chenxing Sun,Jing Bai

Task: 提出一种新的地图特征感知度量（MFP），用于评估生成地图与目标地图的全局特征和空间一致性。

Motivation: 现有基于计算机视觉的图像评估指标（如L1、L2、SSIM和FID）主要关注像素级比较，无法充分捕捉地图的全局特征和空间相关性，导致生成结果中出现语义结构伪影。

Details

Method: 提出一种基于元素级深度特征的地图特征感知度量（MFP），全面编码地图的结构完整性和拓扑关系。 Result: 实验验证表明，MFP在评估地图语义特征方面表现优异，分类增强的实现优于传统损失函数，性能提升范围为2%至50%。 Conclusion: 显式考虑地图的全局属性和空间一致性显著提升了生成模型的优化效果，从而大幅提高合成地图的地理合理性。 Abstract: In intelligent cartographic generation tasks empowered by generative models, the authenticity of synthesized maps constitutes a critical determinant. Concurrently, the selection of appropriate evaluation metrics to quantify map authenticity emerges as a pivotal research challenge. Current methodologies predominantly adopt computer vision-based image assessment metrics to compute discrepancies between generated and reference maps. However, conventional visual similarity metrics-including L1, L2, SSIM, and FID-primarily operate at pixel-level comparisons, inadequately capturing cartographic global features and spatial correlations, consequently inducing semantic-structural artifacts in generated outputs. This study introduces a novel Map Feature Perception Metric designed to evaluate global characteristics and spatial congruence between synthesized and target maps. Diverging from pixel-wise metrics, our approach extracts elemental-level deep features that comprehensively encode cartographic structural integrity and topological relationships. Experimental validation demonstrates MFP's superior capability in evaluating cartographic semantic features, with classification-enhanced implementations outperforming conventional loss functions across diverse generative frameworks. When employed as optimization objectives, our metric achieves performance gains ranging from 2% to 50% across multiple benchmarks compared to traditional L1, L2, and SSIM baselines. This investigation concludes that explicit consideration of cartographic global attributes and spatial coherence substantially enhances generative model optimization, thereby significantly improving the geographical plausibility of synthesized maps.

Implicit In-Context Learning: Evidence from Artificial Language Experiments

Xiaomeng Ma,Qihui Xu

Task: 系统评估两种OpenAI模型（gpt-4o和o3-mini）在推理层面的隐式学习能力，并与人类行为进行对比。

Motivation: 探究大型语言模型是否在推理层面表现出类似人类的模式识别能力。

Details

Method: 通过三个经典的人工语言学习实验（形态学、形态句法学和句法学）进行系统性评估。 Result: o3-mini在形态学上表现更接近人类，而两种模型在句法学上均与人类行为一致。 Conclusion: 模型在语言学特定领域与人类行为存在一致性，但表现因领域而异。 Abstract: Humans acquire language through implicit learning, absorbing complex patterns without explicit awareness. While LLMs demonstrate impressive linguistic capabilities, it remains unclear whether they exhibit human-like pattern recognition during in-context learning at inferencing level. We adapted three classic artificial language learning experiments spanning morphology, morphosyntax, and syntax to systematically evaluate implicit learning at inferencing level in two state-of-the-art OpenAI models: gpt-4o and o3-mini. Our results reveal linguistic domain-specific alignment between models and human behaviors, o3-mini aligns better in morphology while both models align in syntax.

JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization

Kai Liu,Wei Li,Lai Chen,Shengqiong Wu,Yanhao Zheng,Jiayi Ji,Fan Zhou,Rongxin Jiang,Jiebo Luo,Hao Fei,Tat-Seng Chua

Task: 提出一种新颖的联合音频-视频扩散变换器（JavisDiT），用于同步生成音频和视频内容。

Motivation: 解决现有方法在同步生成高质量音频和视频内容方面的不足，尤其是在复杂场景下的同步问题。

Details

Method: 基于扩散变换器（DiT）架构，引入分层时空同步先验（HiST-Sypo）估计器，实现细粒度的时空对齐。 Result: JavisDiT在生成质量和同步性上显著优于现有方法，并提出了新的基准数据集JavisBench和评估指标。 Conclusion: JavisDiT为音频-视频同步生成任务设定了新标准，代码、模型和数据集将公开。 Abstract: This paper introduces JavisDiT, a novel Joint Audio-Video Diffusion Transformer designed for synchronized audio-video generation (JAVG). Built upon the powerful Diffusion Transformer (DiT) architecture, JavisDiT is able to generate high-quality audio and video content simultaneously from open-ended user prompts. To ensure optimal synchronization, we introduce a fine-grained spatio-temporal alignment mechanism through a Hierarchical Spatial-Temporal Synchronized Prior (HiST-Sypo) Estimator. This module extracts both global and fine-grained spatio-temporal priors, guiding the synchronization between the visual and auditory components. Furthermore, we propose a new benchmark, JavisBench, consisting of 10,140 high-quality text-captioned sounding videos spanning diverse scenes and complex real-world scenarios. Further, we specifically devise a robust metric for evaluating the synchronization between generated audio-video pairs in real-world complex content. Experimental results demonstrate that JavisDiT significantly outperforms existing methods by ensuring both high-quality generation and precise synchronization, setting a new standard for JAVG tasks. Our code, model, and dataset will be made publicly available at https://javisdit.github.io/.

TwT: Thinking without Tokens by Habitual Reasoning Distillation with Multi-Teachers' Guidance

Jingxian Xu,Mengyu Zhou,Weichang Liu,Hanbing Liu,Shi Han,Dongmei Zhang

Task: 提出一种名为TwT（Thinking without Tokens）的方法，通过习惯性推理蒸馏和多教师指导减少推理时的计算成本，同时保持高性能。

Motivation: 大型语言模型（LLMs）在推理过程中输出令牌数量增加导致计算成本上升，需要一种高效的方法来降低成本。

Details

Method: 采用习惯性推理蒸馏方法，通过教师指导的压缩策略将显式推理内化为模型的习惯行为，并提出双标准拒绝采样（DCRS）技术生成高质量蒸馏数据集。 Result: TwT有效降低了推理成本，同时保持高性能，比其他蒸馏方法在减少输出令牌的情况下准确率提高了13.6%。 Conclusion: TwT为高效部署LLM提供了一种实用的解决方案。 Abstract: Large Language Models (LLMs) have made significant strides in problem-solving by incorporating reasoning processes. However, this enhanced reasoning capability results in an increased number of output tokens during inference, leading to higher computational costs. To address this challenge, we propose TwT (Thinking without Tokens), a method that reduces inference-time costs through habitual reasoning distillation with multi-teachers' guidance, while maintaining high performance. Our approach introduces a Habitual Reasoning Distillation method, which internalizes explicit reasoning into the model's habitual behavior through a Teacher-Guided compression strategy inspired by human cognition. Additionally, we propose Dual-Criteria Rejection Sampling (DCRS), a technique that generates a high-quality and diverse distillation dataset using multiple teacher models, making our method suitable for unsupervised scenarios. Experimental results demonstrate that TwT effectively reduces inference costs while preserving superior performance, achieving up to a 13.6% improvement in accuracy with fewer output tokens compared to other distillation methods, offering a highly practical solution for efficient LLM deployment.

Haiduo Huang,Yadong Zhang,Pengju Ren

Task: 提出一种轻量级卷积核插件KernelDNA，以解决动态卷积在参数开销、推理速度和优化方面的局限性。

Motivation: 动态卷积通过自适应组合多个核增强模型能力，但存在参数开销大、推理速度慢或难以联合优化动态注意力与静态核的问题。此外，预训练CNN存在层间冗余。

Details

Method: 提出KernelDNA，通过输入依赖的动态路由和预训练的静态调制解耦核适应，实现参数高效和硬件友好的推理，同时利用跨层权重共享和基于适配器的调制。 Result: 在图像分类和密集预测任务中，KernelDNA在动态卷积变体中实现了最先进的精度-效率平衡。 Conclusion: KernelDNA通过轻量级设计和权重共享机制，在保持标准卷积计算效率的同时增强了表示能力。 Abstract: Dynamic convolution enhances model capacity by adaptively combining multiple kernels, yet faces critical trade-offs: prior works either (1) incur significant parameter overhead by scaling kernel numbers linearly, (2) compromise inference speed through complex kernel interactions, or (3) struggle to jointly optimize dynamic attention and static kernels. We also observe that pre-trained Convolutional Neural Networks (CNNs) exhibit inter-layer redundancy akin to that in Large Language Models (LLMs). Specifically, dense convolutional layers can be efficiently replaced by derived ``child" layers generated from a shared ``parent" convolutional kernel through an adapter. To address these limitations and implement the weight-sharing mechanism, we propose a lightweight convolution kernel plug-in, named KernelDNA. It decouples kernel adaptation into input-dependent dynamic routing and pre-trained static modulation, ensuring both parameter efficiency and hardware-friendly inference. Unlike existing dynamic convolutions that expand parameters via multi-kernel ensembles, our method leverages cross-layer weight sharing and adapter-based modulation, enabling dynamic kernel specialization without altering the standard convolution structure. This design preserves the native computational efficiency of standard convolutions while enhancing representation power through input-adaptive kernel adjustments. Experiments on image classification and dense prediction tasks demonstrate that KernelDNA achieves state-of-the-art accuracy-efficiency balance among dynamic convolution variants. Our codes are available at https://github.com/haiduo/KernelDNA.

Synthetic News Generation for Fake News Classification

Abdul Sittar,Luka Golob,Mateja Smiljanic

Task: 探索基于事实操纵的大型语言模型生成和评估合成假新闻的方法。

Motivation: 研究如何通过修改真实文章的关键事实并生成内容来模拟假新闻，同时保持连贯性，以增强假新闻检测系统的能力。

Details

Method: 提出一种新方法，从真实文章中提取关键事实并修改，生成合成假新闻，并设计评估指标（连贯性、差异性和正确性）来评估生成内容的质量。 Result: 实验表明，基于Transformer的模型（如BERT）能有效利用合成数据进行假新闻检测，且少量合成数据即可提升性能；事实验证特征在区分合成假新闻方面表现最佳。 Conclusion: 合成数据在增强假新闻检测系统方面具有潜力，未来研究可通过改进合成数据生成方法进一步提升检测模型性能。 Abstract: This study explores the generation and evaluation of synthetic fake news through fact based manipulations using large language models (LLMs). We introduce a novel methodology that extracts key facts from real articles, modifies them, and regenerates content to simulate fake news while maintaining coherence. To assess the quality of the generated content, we propose a set of evaluation metrics coherence, dissimilarity, and correctness. The research also investigates the application of synthetic data in fake news classification, comparing traditional machine learning models with transformer based models such as BERT. Our experiments demonstrate that transformer models, especially BERT, effectively leverage synthetic data for fake news detection, showing improvements with smaller proportions of synthetic data. Additionally, we find that fact verification features, which focus on identifying factual inconsistencies, provide the most promising results in distinguishing synthetic fake news. The study highlights the potential of synthetic data to enhance fake news detection systems, offering valuable insights for future research and suggesting that targeted improvements in synthetic data generation can further strengthen detection models.

Enhancing Human Motion Prediction via Multi-range Decoupling Decoding with Gating-adjusting Aggregation

Jiexin Wang,Wenwen Qiang,Zhao Yang,Bing Su

Task: 提出一种名为MD2GA的新方法，用于改进人体运动预测中的运动表示学习。

Motivation: 现有深度学习方法在运动预测中忽略了历史信息与未来时刻之间的动态相关性，限制了运动表示学习和预测性能。

Details

Method: 采用两阶段策略：多范围解耦解码和门控调整聚合，动态调整特征学习并整合多样化的运动模式见解。 Result: 实验表明，MD2GA能轻松集成到其他运动预测方法中并提升其性能。 Conclusion: MD2GA通过利用时间相关性优化运动表示学习，显著提升了运动预测的准确性。 Abstract: Expressive representation of pose sequences is crucial for accurate motion modeling in human motion prediction (HMP). While recent deep learning-based methods have shown promise in learning motion representations, these methods tend to overlook the varying relevance and dependencies between historical information and future moments, with a stronger correlation for short-term predictions and weaker for distant future predictions. This limits the learning of motion representation and then hampers prediction performance. In this paper, we propose a novel approach called multi-range decoupling decoding with gating-adjusting aggregation ($MD2GA$), which leverages the temporal correlations to refine motion representation learning. This approach employs a two-stage strategy for HMP. In the first stage, a multi-range decoupling decoding adeptly adjusts feature learning by decoding the shared features into distinct future lengths, where different decoders offer diverse insights into motion patterns. In the second stage, a gating-adjusting aggregation dynamically combines the diverse insights guided by input motion data. Extensive experiments demonstrate that the proposed method can be easily integrated into other motion prediction methods and enhance their prediction performance.

BAR-Analytics: A Web-based Platform for Analyzing Information Spreading Barriers in News: Comparative Analysis Across Multiple Barriers and Events

Abdul Sittar,Dunja Mladenic,Alenka Gucek,Marko Grobelnik

Task: 提出并评估BAR-Analytics平台，用于分析新闻传播在地理、经济、政治和文化边界上的差异。

Motivation: 研究不同冲突背景下新闻传播的差异，揭示政治、经济和区域因素对媒体报道的影响。

Details

Method: 平台整合传播分析、趋势分析、情感分析和时间主题建模四种方法，分析超过35万篇文章，重点关注经济差异和地理影响。 Result: 以色列-巴勒斯坦冲突的报道更负面且关注人权，俄罗斯-乌克兰冲突的报道更正面且关注选举干预。 Conclusion: 政治、经济和区域因素显著影响不同冲突中的媒体报道叙事。 Abstract: This paper presents BAR-Analytics, a web-based, open-source platform designed to analyze news dissemination across geographical, economic, political, and cultural boundaries. Using the Russian-Ukrainian and Israeli-Palestinian conflicts as case studies, the platform integrates four analytical methods: propagation analysis, trend analysis, sentiment analysis, and temporal topic modeling. Over 350,000 articles were collected and analyzed, with a focus on economic disparities and geographical influences using metadata enrichment. We evaluate the case studies using coherence, sentiment polarity, topic frequency, and trend shifts as key metrics. Our results show distinct patterns in news coverage: the Israeli-Palestinian conflict tends to have more negative sentiment with a focus on human rights, while the Russia-Ukraine conflict is more positive, emphasizing election interference. These findings highlight the influence of political, economic, and regional factors in shaping media narratives across different conflicts.

COSMIC: Clique-Oriented Semantic Multi-space Integration for Robust CLIP Test-Time Adaptation

Fanding Huang,Jingyan Jiang,Qinting Jiang,Hebei Li,Faisal Nadeem Khan,Zhi Wang

Task: 提出一种名为COSMIC的测试时适应框架，用于提升视觉语言模型在新领域的适应能力。

Motivation: 现有基于缓存的方法在适应新领域时存在特征-标签对不可靠和单类信息使用不当的问题，导致适应精度下降。

Details

Method: 通过双语义图（DSG）和超类引导（CGH）机制，结合多粒度跨模态语义缓存和图基查询，提升模型适应性。 Result: 在多个基准测试中表现优异，显著优于现有方法，如OOD任务提升15.81%，跨域生成任务提升5.33%。 Conclusion: COSMIC框架通过多粒度语义缓存和结构化类关系，显著提升了视觉语言模型在新领域的适应能力。 Abstract: Recent vision-language models (VLMs) face significant challenges in test-time adaptation to novel domains. While cache-based methods show promise by leveraging historical information, they struggle with both caching unreliable feature-label pairs and indiscriminately using single-class information during querying, significantly compromising adaptation accuracy. To address these limitations, we propose COSMIC (Clique-Oriented Semantic Multi-space Integration for CLIP), a robust test-time adaptation framework that enhances adaptability through multi-granular, cross-modal semantic caching and graph-based querying mechanisms. Our framework introduces two key innovations: Dual Semantics Graph (DSG) and Clique Guided Hyper-class (CGH). The Dual Semantics Graph constructs complementary semantic spaces by incorporating textual features, coarse-grained CLIP features, and fine-grained DINOv2 features to capture rich semantic relationships. Building upon these dual graphs, the Clique Guided Hyper-class component leverages structured class relationships to enhance prediction robustness through correlated class selection. Extensive experiments demonstrate COSMIC's superior performance across multiple benchmarks, achieving significant improvements over state-of-the-art methods: 15.81% gain on out-of-distribution tasks and 5.33% on cross-domain generation with CLIP RN-50. Code is available at github.com/hf618/COSMIC.

What, How, Where, and How Well? A Survey on Test-Time Scaling in Large Language Models

Qiyuan Zhang,Fuyuan Lyu,Zexu Sun,Lei Wang,Weixu Zhang,Zhihan Guo,Yufei Wang,Irwin King,Xue Liu,Chen Ma

Task: 提出一个统一的多维度框架，系统化理解测试时扩展（TTS）研究。

Motivation: 随着测试时扩展（TTS）成为研究热点，缺乏全面综述，亟需系统性理解。

Details

Method: 构建基于四个核心维度的框架（扩展什么、如何扩展、在哪里扩展、扩展效果），并综述方法、应用场景和评估。 Result: 总结了TTS的主要发展轨迹，提供了实践指南，并指出未来研究方向。 Conclusion: TTS研究潜力巨大，未来需进一步扩展、明确技术功能本质、泛化至更多任务及更多属性。 Abstract: As enthusiasm for scaling computation (data and parameters) in the pretraining era gradually diminished, test-time scaling (TTS), also referred to as ``test-time computing'' has emerged as a prominent research focus. Recent studies demonstrate that TTS can further elicit the problem-solving capabilities of large language models (LLMs), enabling significant breakthroughs not only in specialized reasoning tasks, such as mathematics and coding, but also in general tasks like open-ended Q&A. However, despite the explosion of recent efforts in this area, there remains an urgent need for a comprehensive survey offering a systemic understanding. To fill this gap, we propose a unified, multidimensional framework structured along four core dimensions of TTS research: what to scale, how to scale, where to scale, and how well to scale. Building upon this taxonomy, we conduct an extensive review of methods, application scenarios, and assessment aspects, and present an organized decomposition that highlights the unique functional roles of individual techniques within the broader TTS landscape. From this analysis, we distill the major developmental trajectories of TTS to date and offer hands-on guidelines for practical deployment. Furthermore, we identify several open challenges and offer insights into promising future directions, including further scaling, clarifying the functional essence of techniques, generalizing to more tasks, and more attributions.

A Large Scale Analysis of Gender Biases in Text-to-Image Generative Models

Leander Girrbach,Stephan Alaniz,Genevieve Smith,Zeynep Akata

Task: 研究文本到图像（T2I）模型中的性别偏见，特别是在日常活动、对象和情境中的表现。

Motivation: 随着图像生成技术的广泛应用，理解其社会偏见（包括性别偏见）变得至关重要。

Details

Method: 创建包含3,217个性别中性提示的数据集，从五个领先的T2I模型生成200张图像/提示，自动检测图像中的感知性别，并分析性别比例。 Result: T2I模型强化了传统性别角色，反映了家庭角色中的性别刻板印象，并在金融相关活动中低估了女性。 Conclusion: T2I模型在性别表现上存在显著偏见，需进一步改进以减少社会刻板印象的传播。 Abstract: With the increasing use of image generation technology, understanding its social biases, including gender bias, is essential. This paper presents the first large-scale study on gender bias in text-to-image (T2I) models, focusing on everyday situations. While previous research has examined biases in occupations, we extend this analysis to gender associations in daily activities, objects, and contexts. We create a dataset of 3,217 gender-neutral prompts and generate 200 images per prompt from five leading T2I models. We automatically detect the perceived gender of people in the generated images and filter out images with no person or multiple people of different genders, leaving 2,293,295 images. To enable a broad analysis of gender bias in T2I models, we group prompts into semantically similar concepts and calculate the proportion of male- and female-gendered images for each prompt. Our analysis shows that T2I models reinforce traditional gender roles, reflect common gender stereotypes in household roles, and underrepresent women in financial related activities. Women are predominantly portrayed in care- and human-centered scenarios, and men in technical or physical labor scenarios.

Enhancing Large Language Models (LLMs) for Telecommunications using Knowledge Graphs and Retrieval-Augmented Generation

Dun Yuan,Hao Zhou,Di Wu,Xue Liu,Hao Chen,Yan Xin,Jianzhong,Zhang

Task: 提出一种结合知识图谱（KG）和检索增强生成（RAG）的新框架，以提升大语言模型（LLM）在电信领域的性能。

Motivation: 大语言模型在通用自然语言处理任务中表现优异，但在电信等专业领域面临挑战，需要适应动态标准和专业知识。

Details

Method: 通过知识图谱捕捉电信领域的结构化信息，并结合检索增强生成技术，动态利用最新知识生成响应。 Result: KG-RAG框架在电信领域问答任务中达到88%的准确率，优于RAG-only（82%）和LLM-only（48%）方法。 Conclusion: KG-RAG框架有效结合结构化知识和生成能力，显著提升了LLM在电信领域的准确性和适应性。 Abstract: Large language models (LLMs) have made significant progress in general-purpose natural language processing tasks. However, LLMs are still facing challenges when applied to domain-specific areas like telecommunications, which demands specialized expertise and adaptability to evolving standards. This paper presents a novel framework that combines knowledge graph (KG) and retrieval-augmented generation (RAG) techniques to enhance LLM performance in the telecom domain. The framework leverages a KG to capture structured, domain-specific information about network protocols, standards, and other telecom-related entities, comprehensively representing their relationships. By integrating KG with RAG, LLMs can dynamically access and utilize the most relevant and up-to-date knowledge during response generation. This hybrid approach bridges the gap between structured knowledge representation and the generative capabilities of LLMs, significantly enhancing accuracy, adaptability, and domain-specific comprehension. Our results demonstrate the effectiveness of the KG-RAG framework in addressing complex technical queries with precision. The proposed KG-RAG model attained an accuracy of 88% for question answering tasks on a frequently used telecom-specific dataset, compared to 82% for the RAG-only and 48% for the LLM-only approaches.

Diffusion Meets Few-shot Class Incremental Learning

Junsu Kim,Yunhoe Ku,Dongyoon Han,Seungryul Baek

Task: 解决少样本类增量学习（FSCIL）中的挑战，通过利用扩散模型的能力。

Motivation: FSCIL面临数据有限和灾难性遗忘的问题，需要一种高效的方法来学习新信息并保持旧知识的性能。

Details

Method: 提出Diffusion-FSCIL，利用文本到图像的扩散模型作为冻结骨干，提取多尺度特征并结合特征蒸馏。 Result: 在CUB-200、miniImageNet和CIFAR-100上表现优于现有方法，有效平衡新旧类别的性能。 Conclusion: Diffusion-FSCIL通过冻结骨干和最小化训练组件，实现了高效且性能优越的少样本类增量学习。 Abstract: Few-shot class-incremental learning (FSCIL) is challenging due to extremely limited training data; while aiming to reduce catastrophic forgetting and learn new information. We propose Diffusion-FSCIL, a novel approach that employs a text-to-image diffusion model as a frozen backbone. Our conjecture is that FSCIL can be tackled using a large generative model's capabilities benefiting from 1) generation ability via large-scale pre-training; 2) multi-scale representation; 3) representational flexibility through the text encoder. To maximize the representation capability, we propose to extract multiple complementary diffusion features to play roles as latent replay with slight support from feature distillation for preventing generative biases. Our framework realizes efficiency through 1) using a frozen backbone; 2) minimal trainable components; 3) batch processing of multiple feature extractions. Extensive experiments on CUB-200, miniImageNet, and CIFAR-100 show that Diffusion-FSCIL surpasses state-of-the-art methods, preserving performance on previously learned classes and adapting effectively to new ones.

Is analogy enough to draw novel adjective-noun inferences?

Hayley Ross,Kathryn Davidson,Najoung Kim

Task: 研究人类和大型语言模型（LLM）是否通过类比而非组合机制来推断新颖形容词-名词组合的意义。

Motivation: 探讨人类和LLM是否能够仅通过类比已知推断来理解新颖组合，而非依赖组合机制。

Details

Method: （1）构建基于词汇相似度的类比推理模型；（2）通过人类参与者进行类比推理实验。 Result: 类比策略在大部分数据集中表现良好，但某些新颖组合的推断无法通过类比解释，人类和LLM的推断结果一致。 Conclusion: 人类和LLM在这些情况下的泛化机制不能完全归结为类比，可能涉及组合机制。 Abstract: Recent work (Ross et al., 2025, 2024) has argued that the ability of humans and LLMs respectively to generalize to novel adjective-noun combinations shows that they each have access to a compositional mechanism to determine the phrase's meaning and derive inferences. We study whether these inferences can instead be derived by analogy to known inferences, without need for composition. We investigate this by (1) building a model of analogical reasoning using similarity over lexical items, and (2) asking human participants to reason by analogy. While we find that this strategy works well for a large proportion of the dataset of Ross et al. (2025), there are novel combinations for which both humans and LLMs derive convergent inferences but which are not well handled by analogy. We thus conclude that the mechanism humans and LLMs use to generalize in these cases cannot be fully reduced to analogy, and likely involves composition.

GMapLatent: Geometric Mapping in Latent Space

Wei Zeng,Xuebin Chang,Jianghao Su,Xiang Gu,Jian Sun,Zongben Xu

Task: 提出一种基于几何映射的跨域对齐与生成模型GMapLatent，以解决编码器-解码器架构中的模式崩溃和混合问题。

Motivation: 传统跨域对齐方法直接处理初始分布，可能导致模式崩溃和混合问题，影响模型泛化能力。

Details

Method: 通过几何映射构建规范潜在空间表示，结合重心平移、最优传输合并和约束调和映射，实现严格聚类对齐。 Result: 在灰度和彩色图像上的实验验证了GMapLatent的高效性和优越性能。 Conclusion: GMapLatent通过精确的潜在空间对齐，显著提升了跨域生成模型的性能。 Abstract: Cross-domain generative models based on encoder-decoder AI architectures have attracted much attention in generating realistic images, where domain alignment is crucial for generation accuracy. Domain alignment methods usually deal directly with the initial distribution; however, mismatched or mixed clusters can lead to mode collapse and mixture problems in the decoder, compromising model generalization capabilities. In this work, we innovate a cross-domain alignment and generation model that introduces a canonical latent space representation based on geometric mapping to align the cross-domain latent spaces in a rigorous and precise manner, thus avoiding mode collapse and mixture in the encoder-decoder generation architectures. We name this model GMapLatent. The core of the method is to seamlessly align latent spaces with strict cluster correspondence constraints using the canonical parameterizations of cluster-decorated latent spaces. We first (1) transform the latent space to a canonical parameter domain by composing barycenter translation, optimal transport merging and constrained harmonic mapping, and then (2) compute geometric registration with cluster constraints over the canonical parameter domains. This process realizes a bijective (one-to-one and onto) mapping between newly transformed latent spaces and generates a precise alignment of cluster pairs. Cross-domain generation is then achieved through the aligned latent spaces embedded in the encoder-decoder pipeline. Experiments on gray-scale and color images validate the efficiency, efficacy and applicability of GMapLatent, and demonstrate that the proposed model has superior performance over existing models.

A Systematic Evaluation of LLM Strategies for Mental Health Text Analysis: Fine-tuning vs. Prompt Engineering vs. RAG

Arshia Kermani,Veronica Perez-Rosas,Vangelis Metsis

Task: 系统比较三种基于大语言模型（LLMs）的心理健康文本分析方法：提示工程、检索增强生成（RAG）和微调。

Motivation: 为心理健康应用中基于LLM的解决方案提供实践指导，权衡准确性、计算需求和部署灵活性。

Details

Method: 使用LLaMA 3模型，在情绪分类和心理健康状况检测任务上评估三种方法。 Result: 微调方法在情绪分类和心理健康状况检测任务中准确率最高（分别为91%和80%），但计算资源需求高；提示工程和RAG方法灵活性更高，但准确率中等（40-68%）。 Conclusion: 研究结果为心理健康应用中LLM方法的选择提供了实用参考，需根据具体需求权衡准确性与资源投入。 Abstract: This study presents a systematic comparison of three approaches for the analysis of mental health text using large language models (LLMs): prompt engineering, retrieval augmented generation (RAG), and fine-tuning. Using LLaMA 3, we evaluate these approaches on emotion classification and mental health condition detection tasks across two datasets. Fine-tuning achieves the highest accuracy (91% for emotion classification, 80% for mental health conditions) but requires substantial computational resources and large training sets, while prompt engineering and RAG offer more flexible deployment with moderate performance (40-68% accuracy). Our findings provide practical insights for implementing LLM-based solutions in mental health applications, highlighting the trade-offs between accuracy, computational requirements, and deployment flexibility.

Improving underwater semantic segmentation with underwater image quality attention and muti-scale aggregation attention

Xin Zuo,Jiaran Jiang,Jifeng Shen,Wankou Yang

Task: 提出一种基于Transformer的框架（UWSegFormer）用于低质量水下图像的语义分割。

Motivation: 水下环境光照不足导致成像质量下降，严重影响语义分割性能，尤其是物体边界轮廓的准确性。

Details

Method: 提出UIQA模块增强高质量语义信息表示，MAA模块通过多尺度特征聚合补偿细节损失，并引入ELL损失函数优化边缘学习。 Result: 在SUIM和DUT数据集上分别达到82.12和71.41的mIoU，分割完整性和边界清晰度优于现有方法。 Conclusion: UWSegFormer通过结合注意力机制和多尺度特征聚合，显著提升了水下图像语义分割的性能。 Abstract: Underwater image understanding is crucial for both submarine navigation and seabed exploration. However, the low illumination in underwater environments degrades the imaging quality, which in turn seriously deteriorates the performance of underwater semantic segmentation, particularly for outlining the object region boundaries. To tackle this issue, we present UnderWater SegFormer (UWSegFormer), a transformer-based framework for semantic segmentation of low-quality underwater images. Firstly, we propose the Underwater Image Quality Attention (UIQA) module. This module enhances the representation of highquality semantic information in underwater image feature channels through a channel self-attention mechanism. In order to address the issue of loss of imaging details due to the underwater environment, the Multi-scale Aggregation Attention(MAA) module is proposed. This module aggregates sets of semantic features at different scales by extracting discriminative information from high-level features,thus compensating for the semantic loss of detail in underwater objects. Finally, during training, we introduce Edge Learning Loss (ELL) in order to enhance the model's learning of underwater object edges and improve the model's prediction accuracy. Experiments conducted on the SUIM and DUT-USEG (DUT) datasets have demonstrated that the proposed method has advantages in terms of segmentation completeness, boundary clarity, and subjective perceptual details when compared to SOTA methods. In addition, the proposed method achieves the highest mIoU of 82.12 and 71.41 on the SUIM and DUT datasets, respectively. Code will be available at https://github.com/SAWRJJ/UWSegFormer.

BEATS: Bias Evaluation and Assessment Test Suite for Large Language Models

Alok Abhishek,Lisa Erickson,Tushar Bandopadhyay

Task: 提出BEATS框架，用于评估大型语言模型（LLMs）中的偏见、伦理、公平性和事实性。

Motivation: 量化评估LLM生成内容中可能存在的偏见和不公平现象，以推动更负责任和伦理对齐的AI模型发展。

Details

Method: 基于BEATS框架，设计包含29个指标的偏见基准，涵盖人口统计、认知、社会偏见、伦理推理、群体公平性和事实性风险。 Result: 实验数据显示，行业领先模型的37.65%输出存在某种偏见，凸显了在关键决策系统中使用这些模型的风险。 Conclusion: BEATS框架为LLM的基准测试、偏见诊断和缓解策略提供了可扩展且统计严谨的方法，旨在推动更社会负责的AI发展。 Abstract: In this research, we introduce BEATS, a novel framework for evaluating Bias, Ethics, Fairness, and Factuality in Large Language Models (LLMs). Building upon the BEATS framework, we present a bias benchmark for LLMs that measure performance across 29 distinct metrics. These metrics span a broad range of characteristics, including demographic, cognitive, and social biases, as well as measures of ethical reasoning, group fairness, and factuality related misinformation risk. These metrics enable a quantitative assessment of the extent to which LLM generated responses may perpetuate societal prejudices that reinforce or expand systemic inequities. To achieve a high score on this benchmark a LLM must show very equitable behavior in their responses, making it a rigorous standard for responsible AI evaluation. Empirical results based on data from our experiment show that, 37.65\% of outputs generated by industry leading models contained some form of bias, highlighting a substantial risk of using these models in critical decision making systems. BEATS framework and benchmark offer a scalable and statistically rigorous methodology to benchmark LLMs, diagnose factors driving biases, and develop mitigation strategies. With the BEATS framework, our goal is to help the development of more socially responsible and ethically aligned AI models.

CA^2ST: Cross-Attention in Audio, Space, and Time for Holistic Video Recognition

Jongseo Lee,Joohyun Chang,Dongho Lee,Jinwoo Choi

Task: 提出一种基于Transformer的方法CA^2ST，用于全面的视频识别。

Motivation: 现有模型在视频识别中缺乏平衡的时空理解能力。

Details

Method: 采用两流架构CAST和CAVA，通过Bottleneck Cross-Attention模块实现空间、时间和音频专家之间的信息交换。 Result: 在多个基准测试中表现均衡，验证了B-CA模块的有效性。 Conclusion: CA^2ST通过跨注意力机制整合多专家信息，实现了平衡且全面的视频理解。 Abstract: We propose Cross-Attention in Audio, Space, and Time (CA^2ST), a transformer-based method for holistic video recognition. Recognizing actions in videos requires both spatial and temporal understanding, yet most existing models lack a balanced spatio-temporal understanding of videos. To address this, we propose a novel two-stream architecture, called Cross-Attention in Space and Time (CAST), using only RGB input. In each layer of CAST, Bottleneck Cross-Attention (B-CA) enables spatial and temporal experts to exchange information and make synergistic predictions. For holistic video understanding, we extend CAST by integrating an audio expert, forming Cross-Attention in Visual and Audio (CAVA). We validate the CAST on benchmarks with different characteristics, EPIC-KITCHENS-100, Something-Something-V2, and Kinetics-400, consistently showing balanced performance. We also validate the CAVA on audio-visual action recognition benchmarks, including UCF-101, VGG-Sound, KineticsSound, and EPIC-SOUNDS. With a favorable performance of CAVA across these datasets, we demonstrate the effective information exchange among multiple experts within the B-CA module. In summary, CA^2ST combines CAST and CAVA by employing spatial, temporal, and audio experts through cross-attention, achieving balanced and holistic video understanding.

Query and Conquer: Execution-Guided SQL Generation

Łukasz Borchmann,Marek Wydmuch

Task: 提出一种新方法，用于在文本到SQL任务中生成复杂输出并显著提高准确性。

Motivation: 通过利用执行结果从多个候选中选择语义最一致的查询，使小型、经济高效的模型超越计算密集型推理方法，同时降低推理成本。

Details

Method: 利用执行结果选择语义最一致的查询，并与现有模型无缝集成。 Result: 模型在减少推理成本高达30倍的同时，性能超越o1、o3-mini和DeepSeek R1等方法。 Conclusion: 该方法为文本到SQL生成提供了一种实用且可扩展的途径，实现了最先进的性能。 Abstract: We propose a novel approach for generating complex outputs that significantly improves accuracy in text-to-SQL tasks. Our method leverages execution results to select the most semantically consistent query from multiple candidates, enabling smaller, cost-effective models to surpass computationally intensive reasoning methods such as o1, o3-mini, and DeepSeek R1 while reducing inference cost by as much as 30 times. It integrates effortlessly with existing models, offering a practical and scalable pathway to state-of-the-art SQL generation.

AU-TTT: Vision Test-Time Training model for Facial Action Unit Detection

Bohao Xing,Kaishen Yuan,Zitong Yu,Xin Liu,Heikki Kälviäinen

Task: 提出一种名为AU-TTT的新型视觉主干网络，用于面部动作单元（AUs）检测。

Motivation: 解决AU检测中标注成本高、数据集有限导致的过拟合问题，以及现有Transformer方法因自注意力二次复杂度而受限的问题。

Details

Method: 结合双向TTT块（Test-Time Training），引入TTT Linear优化图像扫描机制，并设计AU特定的感兴趣区域（RoI）扫描机制。 Result: 实验表明，该方法在域内和跨域场景中均表现出竞争力。 Conclusion: AU-TTT为AU检测任务提供了一种有效的解决方案，提升了模型的泛化能力。 Abstract: Facial Action Units (AUs) detection is a cornerstone of objective facial expression analysis and a critical focus in affective computing. Despite its importance, AU detection faces significant challenges, such as the high cost of AU annotation and the limited availability of datasets. These constraints often lead to overfitting in existing methods, resulting in substantial performance degradation when applied across diverse datasets. Addressing these issues is essential for improving the reliability and generalizability of AU detection methods. Moreover, many current approaches leverage Transformers for their effectiveness in long-context modeling, but they are hindered by the quadratic complexity of self-attention. Recently, Test-Time Training (TTT) layers have emerged as a promising solution for long-sequence modeling. Additionally, TTT applies self-supervised learning for iterative updates during both training and inference, offering a potential pathway to mitigate the generalization challenges inherent in AU detection tasks. In this paper, we propose a novel vision backbone tailored for AU detection, incorporating bidirectional TTT blocks, named AU-TTT. Our approach introduces TTT Linear to the AU detection task and optimizes image scanning mechanisms for enhanced performance. Additionally, we design an AU-specific Region of Interest (RoI) scanning mechanism to capture fine-grained facial features critical for AU detection. Experimental results demonstrate that our method achieves competitive performance in both within-domain and cross-domain scenarios.

Harnessing the Reasoning Economy: A Survey of Efficient Reasoning for Large Language Models

Rui Wang,Hongru Wang,Boyang Xue,Jianhui Pang,Shudong Liu,Yi Chen,Jiahao Qiu,Derek Fai Wong,Heng Ji,Kam-Fai Wong

Task: 分析大型语言模型（LLMs）中推理经济性的概念及其在训练后和推理阶段的应用。

Motivation: 尽管System 2推理提高了任务准确性，但其高计算成本与System 1推理的低效性能之间存在权衡，需要优化推理经济性。

Details

Method: 通过全面调查，分析推理低效的原因、不同推理模式的行为特征及潜在解决方案。 Result: 提供了改进LLMs推理经济性的可行见解，并总结了开放挑战。 Conclusion: 该调查为优化LLMs推理经济性提供了策略，并建立了公共资源库以跟踪该领域的最新进展。 Abstract: Recent advancements in Large Language Models (LLMs) have significantly enhanced their ability to perform complex reasoning tasks, transitioning from fast and intuitive thinking (System 1) to slow and deep reasoning (System 2). While System 2 reasoning improves task accuracy, it often incurs substantial computational costs due to its slow thinking nature and inefficient or unnecessary reasoning behaviors. In contrast, System 1 reasoning is computationally efficient but leads to suboptimal performance. Consequently, it is critical to balance the trade-off between performance (benefits) and computational costs (budgets), giving rise to the concept of reasoning economy. In this survey, we provide a comprehensive analysis of reasoning economy in both the post-training and test-time inference stages of LLMs, encompassing i) the cause of reasoning inefficiency, ii) behavior analysis of different reasoning patterns, and iii) potential solutions to achieve reasoning economy. By offering actionable insights and highlighting open challenges, we aim to shed light on strategies for improving the reasoning economy of LLMs, thereby serving as a valuable resource for advancing research in this evolving area. We also provide a public repository to continually track developments in this fast-evolving field.

Beyond Academic Benchmarks: Critical Analysis and Best Practices for Visual Industrial Anomaly Detection

Aimira Baitieva,Yacine Bouaouni,Alexandre Briot,Dick Ameln,Souhaiel Khalfaoui,Samet Akcay

Task: 通过实际生产数据建立基准，并对现有最先进方法进行公平比较，以推动视觉异常检测研究的实际应用。

Motivation: 当前异常检测研究多基于实验室环境数据，与实际生产条件脱节，导致方法在工业场景中表现不佳。

Details

Method: 使用真实生产数据建立基准，并采用实用指标对现有方法进行公平比较。 Result: 展示了真实数据集的重要性，并提供了对现有方法的全面分析。 Conclusion: 强调了真实数据集和实用指标的重要性，为学术界与工业界的差距提供了新的视角。 Abstract: Anomaly detection (AD) is essential for automating visual inspection in manufacturing. This field of computer vision is rapidly evolving, with increasing attention towards real-world applications. Meanwhile, popular datasets are typically produced in controlled lab environments with artificially created defects, unable to capture the diversity of real production conditions. New methods often fail in production settings, showing significant performance degradation or requiring impractical computational resources. This disconnect between academic results and industrial viability threatens to misdirect visual anomaly detection research. This paper makes three key contributions: (1) we demonstrate the importance of real-world datasets and establish benchmarks using actual production data, (2) we provide a fair comparison of existing SOTA methods across diverse tasks by utilizing metrics that are valuable for practical applications, and (3) we present a comprehensive analysis of recent advancements in this field by discussing important challenges and new perspectives for bridging the academia-industry gap. The code is publicly available at https://github.com/abc-125/viad-benchmark

Enhancing Aviation Communication Transcription: Fine-Tuning Distil-Whisper with LoRA

Shokoufeh Mirzaei,Jesse Arzate,Yukti Vijay

Task: 使用低秩自适应（LoRA）方法对distil-Whisper模型进行参数高效微调，以改进航空通信转录任务。

Motivation: 航空通信转录在多个领域有重要应用，但现有模型（如Whisper）的计算效率不高，需要更高效的微调方法。

Details

Method: 采用LoRA方法对distil-Whisper进行微调，使用Air Traffic Control Corpus数据集，并通过网格搜索和5折交叉验证优化超参数。 Result: 微调后的模型在5折交叉验证中平均词错误率降至3.86%，表现出色。 Conclusion: 该方法显著提高了航空通信转录的准确性，具有在驾驶舱等实际场景中应用的潜力。 Abstract: Transcription of aviation communications has several applications, from assisting air traffic controllers in identifying the accuracy of read-back errors to search and rescue operations. Recent advances in artificial intelligence have provided unprecedented opportunities for improving aviation communication transcription tasks. OpenAI's Whisper is one of the leading automatic speech recognition models. However, fine-tuning Whisper for aviation communication transcription is not computationally efficient. Thus, this paper aims to use a Parameter-Efficient Fine-tuning method called Low-Rank Adaptation to fine-tune a more computationally efficient version of Whisper, distil-Whisper. To perform the fine-tuning, we used the Air Traffic Control Corpus dataset from the Linguistic Data Consortium, which contains approximately 70 hours of controller and pilot transmissions near three major airports in the US. The objective was to reduce the word error rate to enhance accuracy in the transcription of aviation communication. First, starting with an initial set of hyperparameters for LoRA (Alpha = 64 and Rank = 32), we performed a grid search. We applied a 5-fold cross-validation to find the best combination of distil-Whisper hyperparameters. Then, we fine-tuned the model for LoRA hyperparameters, achieving an impressive average word error rate of 3.86% across five folds. This result highlights the model's potential for use in the cockpit.

VideoGen-Eval: Agent-based System for Video Generation Evaluation

Yuhang Yang,Ke Fan,Shangkun Sun,Hongxiang Li,Ailing Zeng,FeiLin Han,Wei Zhai,Wei Liu,Yang Cao,Zheng-Jun Zha

Task: 提出VideoGen-Eval评估系统以解决现有视频生成评估方法的不足。

Motivation: 现有评估系统因简单提示、固定评估算子与人类偏好不一致等问题，无法有效评估先进视频生成模型。

Details

Method: 整合基于LLM的内容结构化、基于MLLM的内容判断及针对时间密集维度的补丁工具，构建动态、灵活、可扩展的评估系统。 Result: 实验验证了评估系统与人类偏好的强一致性，以及基准的多样性和丰富性。 Conclusion: VideoGen-Eval系统能可靠完成评估任务，且基准数据具有高质量。 Abstract: The rapid advancement of video generation has rendered existing evaluation systems inadequate for assessing state-of-the-art models, primarily due to simple prompts that cannot showcase the model's capabilities, fixed evaluation operators struggling with Out-of-Distribution (OOD) cases, and misalignment between computed metrics and human preferences. To bridge the gap, we propose VideoGen-Eval, an agent evaluation system that integrates LLM-based content structuring, MLLM-based content judgment, and patch tools designed for temporal-dense dimensions, to achieve a dynamic, flexible, and expandable video generation evaluation. Additionally, we introduce a video generation benchmark to evaluate existing cutting-edge models and verify the effectiveness of our evaluation system. It comprises 700 structured, content-rich prompts (both T2V and I2V) and over 12,000 videos generated by 20+ models, among them, 8 cutting-edge models are selected as quantitative evaluation for the agent and human. Extensive experiments validate that our proposed agent-based evaluation system demonstrates strong alignment with human preferences and reliably completes the evaluation, as well as the diversity and richness of the benchmark.

Bridging Language Models and Financial Analysis

Alejandro Lopez-Lira,Jihoon Kwon,Sangwoon Yoon,Jy-yong Sohn,Chanyeol Choi

Task: 提供关于大型语言模型（LLMs）在金融领域应用的最新研究进展的综述。

Motivation: 金融数据复杂多样，传统方法难以有效处理，而LLMs虽潜力巨大，但在金融行业的实际应用进展缓慢。

Details

Method: 综述近期LLM研究的新方法，分析其在金融数据分析中的独特能力和潜在适用性。 Result: 总结了LLMs在金融领域的应用潜力，并指出了未来研究方向。 Conclusion: 本文为研究者和从业者提供了有价值的资源，推动了LLMs在金融领域的进一步应用。 Abstract: The rapid advancements in Large Language Models (LLMs) have unlocked transformative possibilities in natural language processing, particularly within the financial sector. Financial data is often embedded in intricate relationships across textual content, numerical tables, and visual charts, posing challenges that traditional methods struggle to address effectively. However, the emergence of LLMs offers new pathways for processing and analyzing this multifaceted data with increased efficiency and insight. Despite the fast pace of innovation in LLM research, there remains a significant gap in their practical adoption within the finance industry, where cautious integration and long-term validation are prioritized. This disparity has led to a slower implementation of emerging LLM techniques, despite their immense potential in financial applications. As a result, many of the latest advancements in LLM technology remain underexplored or not fully utilized in this domain. This survey seeks to bridge this gap by providing a comprehensive overview of recent developments in LLM research and examining their applicability to the financial sector. Building on previous survey literature, we highlight several novel LLM methodologies, exploring their distinctive capabilities and their potential relevance to financial data analysis. By synthesizing insights from a broad range of studies, this paper aims to serve as a valuable resource for researchers and practitioners, offering direction on promising research avenues and outlining future opportunities for advancing LLM applications in finance.

Maofu Liu,Jiahui Liu,Xiaokang Zhang

Task: 生成与遥感图像视觉特征紧密关联的语义准确描述。

Motivation: 现有方法忽视了文本信息对视觉语义的补充作用，且难以精确定位与图像上下文最相关的对象。

Details

Method: 提出了一种语义-空间特征融合与动态图优化（SFDR）方法，包含语义-空间特征融合（SSFF）模块和动态图特征优化（DGFR）模块。 Result: 在三个基准数据集上的实验证明了该方法的有效性。 Conclusion: SFDR方法显著提升了生成描述的质量。 Abstract: Remote sensing image captioning aims to generate semantically accurate descriptions that are closely linked to the visual features of remote sensing images. Existing approaches typically emphasize fine-grained extraction of visual features and capturing global information. However, they often overlook the complementary role of textual information in enhancing visual semantics and face challenges in precisely locating objects that are most relevant to the image context. To address these challenges, this paper presents a semantic-spatial feature fusion with dynamic graph refinement (SFDR) method, which integrates the semantic-spatial feature fusion (SSFF) and dynamic graph feature refinement (DGFR) modules. The SSFF module utilizes a multi-level feature representation strategy by leveraging pre-trained CLIP features, grid features, and ROI features to integrate rich semantic and spatial information. In the DGFR module, a graph attention network captures the relationships between feature nodes, while a dynamic weighting mechanism prioritizes objects that are most relevant to the current scene and suppresses less significant ones. Therefore, the proposed SFDR method significantly enhances the quality of the generated descriptions. Experimental results on three benchmark datasets demonstrate the effectiveness of the proposed method. The source code will be available at https://github.com/zxk688}{https://github.com/zxk688.

Enhancing nonnative speech perception and production through an AI-powered application

Georgios P. Georgiou

Task: 研究AI驱动的移动应用对非母语者英语元音感知和发音的影响。

Motivation: 现有AI辅助外语发音研究多关注可懂度和清晰度，而忽视个体音素的感知与发音改进。

Details

Method: 使用Speakometer移动应用进行训练，包括录音任务、发音反馈和练习，并通过前后测试评估效果。 Result: 干预后，参与者在目标元音对比的辨别和发音上显著提升，但未达到母语水平。 Conclusion: AI应用能有效辅助语音习得，支持其在课堂外个性化、互动式发音训练中的潜在应用。 Abstract: While research on using Artificial Intelligence (AI) through various applications to enhance foreign language pronunciation is expanding, it has primarily focused on aspects such as comprehensibility and intelligibility, largely neglecting the improvement of individual speech sounds in both perception and production. This study seeks to address this gap by examining the impact of training with an AI-powered mobile application on nonnative sound perception and production. Participants completed a pretest assessing their ability to discriminate the second language English heed-hid contrast and produce these vowels in sentence contexts. The intervention involved training with the Speakometer mobile application, which incorporated recording tasks featuring the English vowels, along with pronunciation feedback and practice. The posttest mirrored the pretest to measure changes in performance. The results revealed significant improvements in both discrimination accuracy and production of the target contrast following the intervention. However, participants did not achieve native-like competence. These findings highlight the effectiveness of AI-powered applications in facilitating speech acquisition and support their potential use for personalized, interactive pronunciation training beyond the classroom.

Efficient Token Compression for Vision Transformer with Spatial Information Preserved

Junzhu Mao,Yang Shen,Jinyang Guo,Yazhou Yao,Xiansheng Hua

Task: 提出一种高效的、硬件兼容的令牌压缩方法Prune and Merge，用于减少Transformer模型的计算和内存需求。

Motivation: 在资源受限的环境中部署Transformer模型需要减少其计算和内存需求，令牌压缩是实现这一目标的关键。

Details

Method: 通过集成令牌修剪和合并操作，引入可训练的合并和重建矩阵以及快捷连接，结合梯度加权注意力评分机制，实现层间令牌压缩。 Result: 在ImageNet-1k和ADE20K数据集上的实验表明，该方法显著提升了速度（如DeiT-Small上1.64倍加速），且精度损失极小（仅0.2%）。 Conclusion: Prune and Merge方法在效率和效果上优于现有方法，适用于资源受限环境中的Transformer模型部署。 Abstract: Token compression is essential for reducing the computational and memory requirements of transformer models, enabling their deployment in resource-constrained environments. In this work, we propose an efficient and hardware-compatible token compression method called Prune and Merge. Our approach integrates token pruning and merging operations within transformer models to achieve layer-wise token compression. By introducing trainable merge and reconstruct matrices and utilizing shortcut connections, we efficiently merge tokens while preserving important information and enabling the restoration of pruned tokens. Additionally, we introduce a novel gradient-weighted attention scoring mechanism that computes token importance scores during the training phase, eliminating the need for separate computations during inference and enhancing compression efficiency. We also leverage gradient information to capture the global impact of tokens and automatically identify optimal compression structures. Extensive experiments on the ImageNet-1k and ADE20K datasets validate the effectiveness of our approach, achieving significant speed-ups with minimal accuracy degradation compared to state-of-the-art methods. For instance, on DeiT-Small, we achieve a 1.64$\times$ speed-up with only a 0.2\% drop in accuracy on ImageNet-1k. Moreover, by compressing segmenter models and comparing with existing methods, we demonstrate the superior performance of our approach in terms of efficiency and effectiveness. Code and models have been made available at https://github.com/NUST-Machine-Intelligence-Laboratory/prune_and_merge.

CodeScientist: End-to-End Semi-Automated Scientific Discovery with Code-based Experimentation

Peter Jansen,Oyvind Tafjord,Marissa Radensky,Pao Siangliulue,Tom Hope,Bhavana Dalvi Mishra,Bodhisattwa Prasad Majumder,Daniel S. Weld,Peter Clark

Task: 介绍CodeScientist，一种新型自主科学发现（ASD）系统，用于在软件领域（如改进的机器学习算法）中探索更广泛的设计空间并生成高质量的研究成果。

Motivation: 当前ASD系统存在两个主要限制：1）探索范围局限于现有代码库或类似设计空间；2）生成的研究成果（如自动生成的论文和代码）通常仅通过会议式论文评审进行有限评估。

Details

Method: CodeScientist将构思和实验构建视为一种遗传搜索，联合研究文章和代码块（如语言模型提示）的组合。 Result: 系统在代理和虚拟环境领域进行了数百次自动化实验，返回19项发现，其中6项经过多层面评估（包括外部评审、代码审查和复制尝试）被判定为至少基本可靠且具有增量新颖性。 Conclusion: CodeScientist的发现涵盖了新任务、代理、指标和数据，表明从基准优化向更广泛发现的质变。 Abstract: Despite the surge of interest in autonomous scientific discovery (ASD) of software artifacts (e.g., improved ML algorithms), current ASD systems face two key limitations: (1) they largely explore variants of existing codebases or similarly constrained design spaces, and (2) they produce large volumes of research artifacts (such as automatically generated papers and code) that are typically evaluated using conference-style paper review with limited evaluation of code. In this work we introduce CodeScientist, a novel ASD system that frames ideation and experiment construction as a form of genetic search jointly over combinations of research articles and codeblocks defining common actions in a domain (like prompting a language model). We use this paradigm to conduct hundreds of automated experiments on machine-generated ideas broadly in the domain of agents and virtual environments, with the system returning 19 discoveries, 6 of which were judged as being both at least minimally sound and incrementally novel after a multi-faceted evaluation beyond that typically conducted in prior work, including external (conference-style) review, code review, and replication attempts. Moreover, the discoveries span new tasks, agents, metrics, and data, suggesting a qualitative shift from benchmark optimization to broader discoveries.

Maofu Liu,Xin Jiang,Xiaokang Zhang

Task: Referring Remote Sensing Image Segmentation (RRSIS) 任务旨在根据给定的语言表达分割遥感图像中的特定目标物体。

Motivation: 现有方法通常采用粗粒度的单向对齐方式获取多模态特征，并忽视语言特征在解码过程中的上下文作用，导致视觉与语言特征的弱对象级对应关系，预测结果不完整或错误。

Details

Method: 提出了一种细粒度跨模态对齐和解码Transformer（CADFormer），包括语义互导对齐模块（SMGAM）和文本增强跨模态解码器（TCMD）。 Result: 在RRSIS-HR和RRSIS-D数据集上的实验证明了CADFormer的有效性和优越性。 Conclusion: CADFormer通过细粒度对齐和文本增强解码，显著提升了RRSIS任务的性能。 Abstract: Referring Remote Sensing Image Segmentation (RRSIS) is a challenging task, aiming to segment specific target objects in remote sensing (RS) images based on a given language expression. Existing RRSIS methods typically employ coarse-grained unidirectional alignment approaches to obtain multimodal features, and they often overlook the critical role of language features as contextual information during the decoding process. Consequently, these methods exhibit weak object-level correspondence between visual and language features, leading to incomplete or erroneous predicted masks, especially when handling complex expressions and intricate RS image scenes. To address these challenges, we propose a fine-grained cross-modal alignment and decoding Transformer, CADFormer, for RRSIS. Specifically, we design a semantic mutual guidance alignment module (SMGAM) to achieve both vision-to-language and language-to-vision alignment, enabling comprehensive integration of visual and textual features for fine-grained cross-modal alignment. Furthermore, a textual-enhanced cross-modal decoder (TCMD) is introduced to incorporate language features during decoding, using refined textual information as context to enhance the relationship between cross-modal features. To thoroughly evaluate the performance of CADFormer, especially for inconspicuous targets in complex scenes, we constructed a new RRSIS dataset, called RRSIS-HR, which includes larger high-resolution RS image patches and semantically richer language expressions. Extensive experiments on the RRSIS-HR dataset and the popular RRSIS-D dataset demonstrate the effectiveness and superiority of CADFormer. Datasets and source codes will be available at https://github.com/zxk688.

InfoBid: A Simulation Framework for Studying Information Disclosure in Auctions with Large Language Model-based Agents

Yue Yin

Task: 研究在线广告系统中信息披露策略对拍卖结果的影响。

Motivation: 解决出版商在信息披露策略中面临的效率与收入潜力之间的权衡问题，并利用大语言模型（LLMs）模拟多智能体拍卖环境。

Details

Method: 提出InfoBid框架，利用GPT-4o实现第二价格拍卖的多样化信息模式模拟。 Result: 揭示了信号传递对策略行为和拍卖结果的影响，与经济和社会学习理论一致。 Conclusion: InfoBid为市场模拟和信息设计提供了新工具，推动LLMs在实证研究中的应用，填补了理论与实践的差距。 Abstract: In online advertising systems, publishers often face a trade-off in information disclosure strategies: while disclosing more information can enhance efficiency by enabling optimal allocation of ad impressions, it may lose revenue potential by decreasing uncertainty among competing advertisers. Similar to other challenges in market design, understanding this trade-off is constrained by limited access to real-world data, leading researchers and practitioners to turn to simulation frameworks. The recent emergence of large language models (LLMs) offers a novel approach to simulations, providing human-like reasoning and adaptability without necessarily relying on explicit assumptions about agent behavior modeling. Despite their potential, existing frameworks have yet to integrate LLM-based agents for studying information asymmetry and signaling strategies, particularly in the context of auctions. To address this gap, we introduce InfoBid, a flexible simulation framework that leverages LLM agents to examine the effects of information disclosure strategies in multi-agent auction settings. Using GPT-4o, we implemented simulations of second-price auctions with diverse information schemas. The results reveal key insights into how signaling influences strategic behavior and auction outcomes, which align with both economic and social learning theories. Through InfoBid, we hope to foster the use of LLMs as proxies for human economic and social agents in empirical studies, enhancing our understanding of their capabilities and limitations. This work bridges the gap between theoretical market designs and practical applications, advancing research in market simulations, information design, and agent-based reasoning while offering a valuable tool for exploring the dynamics of digital economies.

Reinforcement Learning-based Token Pruning in Vision Transformers: A Markov Game Approach

Chenglong Lu,Shen Liang,Xuewei Wang,Wei Wang

Task: 利用强化学习（RL）自适应地学习Vision Transformers（ViTs）的token剪枝策略。

Motivation: 现有token剪枝策略多为手工设计，缺乏对不同输入的适应性，且未考虑跨层的序列性剪枝问题。

Details

Method: 将token剪枝建模为马尔可夫游戏，使用多智能体近端策略优化（MAPPO），每个智能体为单个token做剪枝决策。 Result: 在ImageNet-1k数据集上，推理速度提升44%，准确率仅下降0.4%。 Conclusion: 提出的RL方法有效平衡了效率和准确性，为ViTs的token剪枝提供了自适应解决方案。 Abstract: Vision Transformers (ViTs) have computational costs scaling quadratically with the number of tokens, calling for effective token pruning policies. Most existing policies are handcrafted, lacking adaptivity to varying inputs. Moreover, they fail to consider the sequential nature of token pruning across multiple layers. In this work, for the first time (as far as we know), we exploit Reinforcement Learning (RL) to data-adaptively learn a pruning policy. Formulating token pruning as a sequential decision-making problem, we model it as a Markov Game and utilize Multi-Agent Proximal Policy Optimization (MAPPO) where each agent makes an individualized pruning decision for a single token. We also develop reward functions that enable simultaneous collaboration and competition of these agents to balance efficiency and accuracy. On the well-known ImageNet-1k dataset, our method improves the inference speed by up to 44% while incurring only a negligible accuracy drop of 0.4%. The source code is available at https://github.com/daashuai/rl4evit.

Reasoning Beyond Limits: Advances and Open Problems for LLMs

Mohamed Amine Ferrag,Norbert Tihanyi,Merouane Debbah

Task: 对2023年至2025年间发布的27个顶级大语言模型（LLM）进行全面分析，并概述其训练方法。

Motivation: 生成式推理技术的突破推动了LLM在复杂问题上的动态信息检索与多步推理能力，需要系统总结最新模型与方法。

Details

Method: 分析27个LLM模型，涵盖通用训练方法、MoE架构创新、RAG、思维链、自改进技术、测试时计算扩展、蒸馏与强化学习等。 Result: 总结了LLM模型的训练方法及其在推理能力上的提升。 Conclusion: 讨论了LLM能力提升的关键挑战，包括无监督多步推理、链式任务限制、结构化提示与灵活性的平衡，以及长上下文检索与外部工具集成。 Abstract: Recent generative reasoning breakthroughs have transformed how large language models (LLMs) tackle complex problems by dynamically retrieving and refining information while generating coherent, multi-step thought processes. Techniques such as inference-time scaling, reinforcement learning, supervised fine-tuning, and distillation have been successfully applied to models like DeepSeek-R1, OpenAI's o1 & o3, GPT-4o, Qwen-32B, and various Llama variants, resulting in enhanced reasoning capabilities. In this paper, we provide a comprehensive analysis of the top 27 LLM models released between 2023 and 2025 (including models such as Mistral AI Small 3 24B, DeepSeek-R1, Search-o1, QwQ-32B, and phi-4). Then, we present an extensive overview of training methodologies that spans general training approaches, mixture-of-experts (MoE) and architectural innovations, retrieval-augmented generation (RAG), chain-of-thought and self-improvement techniques, as well as test-time compute scaling, distillation, and reinforcement learning (RL) methods. Finally, we discuss the key challenges in advancing LLM capabilities, including improving multi-step reasoning without human supervision, overcoming limitations in chained tasks, balancing structured prompts with flexibility, and enhancing long-context retrieval and external tool integration.

TextCrafter: Accurately Rendering Multiple Texts in Complex Visual Scenes

Nikai Du,Zhennan Chen,Zhizhou Chen,Shan Gao,Xi Chen,Zhengkai Jiang,Jian Yang,Ying Tai

Task: 探索复杂视觉文本生成（CVTG）任务，解决图像生成模型中视觉文本的扭曲、模糊或缺失问题。

Motivation: CVTG任务中，图像生成模型常导致视觉文本失真或遗漏，需要一种新方法来提升文本生成质量。

Details

Method: 提出TextCrafter方法，采用渐进策略分解复杂视觉文本，并结合令牌聚焦增强机制。 Result: TextCrafter有效解决了文本混淆、遗漏和模糊问题，并在实验中超越现有方法。 Conclusion: TextCrafter为CVTG任务提供了高效解决方案，并提出了新基准数据集CVTG-2K。 Abstract: This paper explores the task of Complex Visual Text Generation (CVTG), which centers on generating intricate textual content distributed across diverse regions within visual images. In CVTG, image generation models often rendering distorted and blurred visual text or missing some visual text. To tackle these challenges, we propose TextCrafter, a novel multi-visual text rendering method. TextCrafter employs a progressive strategy to decompose complex visual text into distinct components while ensuring robust alignment between textual content and its visual carrier. Additionally, it incorporates a token focus enhancement mechanism to amplify the prominence of visual text during the generation process. TextCrafter effectively addresses key challenges in CVTG tasks, such as text confusion, omissions, and blurriness. Moreover, we present a new benchmark dataset, CVTG-2K, tailored to rigorously evaluate the performance of generative models on CVTG tasks. Extensive experiments demonstrate that our method surpasses state-of-the-art approaches.

Training in translation tools and technologies: Findings of the EMT survey 2023

Andrew Rothwell,Joss Moorkens,Tomas Svoboda

Task: 调查研究生翻译培训课程中教授的计算机化工具和技术。

Motivation: 了解翻译技术领域的创新对课程的影响，以及COVID-19大流行对课程灵活性的长期改变。

Details

Method: 通过问卷调查，涵盖EMT网络内外的研究生翻译培训课程。 Result: 课程对翻译技术创新反应迅速，增加了机器翻译、后编辑和质量评估的必修内容，并快速适应生成工具的出现。疫情推动了课程交付方式的转变，从传统实验室转向学生个人设备的使用。 Conclusion: 翻译技术课程在工具范围、专业背景嵌入和交付方式上持续演变，反映了行业需求和技术发展。 Abstract: This article reports on the third iteration of a survey of computerized tools and technologies taught as part of postgraduate translation training programmes. While the survey was carried out under the aegis of the EMT Network, more than half of responses are from outside that network. The results show the responsiveness of programmes to innovations in translation technology, with increased compulsory inclusion of machine translation, post-editing, and quality evaluation, and a rapid response to the release of generative tools. The flexibility required during the Covid-19 pandemic has also led to some lasting changes to programmes. While the range of tools being taught has continued to expand, programmes seem to be consolidating their core offering around cloud-based software with cost-free academic access. There has also been an increase in the embedding of professional contexts and workflows associated with translation technology. Generic file management and data security skills have increased in perceived importance, and legal and ethical issues related to translation data have also become more prominent. In terms of course delivery the shift away from conventional labs identified in EMT2017 has accelerated markedly, no doubt partly driven by the pandemic, accompanied by a dramatic expansion in the use of students' personal devices.

OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model

Xingcheng Zhou,Xuyuan Han,Feng Yang,Yunpu Ma,Alois C. Knoll

Task: 开发一个名为OpenDriveVLA的视觉语言动作模型，用于端到端自动驾驶。

Motivation: 利用开源预训练的大型视觉语言模型（VLMs）生成可靠的驾驶动作，结合3D环境感知、车辆状态和驾驶员指令，以提升自动驾驶性能。

Details

Method: 提出分层视觉语言对齐过程，将2D和3D视觉标记投影到统一的语义空间，并通过自回归的车辆-环境-自我交互过程建模动态关系。 Result: 在nuScenes数据集上，OpenDriveVLA在开环轨迹规划和驾驶相关问答任务中达到最先进水平，并能稳健生成轨迹。 Conclusion: OpenDriveVLA展示了下一代端到端自动驾驶的潜力，代码将开源以促进进一步研究。 Abstract: We present OpenDriveVLA, a Vision-Language Action (VLA) model designed for end-to-end autonomous driving. OpenDriveVLA builds upon open-source pre-trained large Vision-Language Models (VLMs) to generate reliable driving actions, conditioned on 3D environmental perception, ego vehicle states, and driver commands. To bridge the modality gap between driving visual representations and language embeddings, we propose a hierarchical vision-language alignment process, projecting both 2D and 3D structured visual tokens into a unified semantic space. Besides, OpenDriveVLA models the dynamic relationships between the ego vehicle, surrounding agents, and static road elements through an autoregressive agent-env-ego interaction process, ensuring both spatially and behaviorally informed trajectory planning. Extensive experiments on the nuScenes dataset demonstrate that OpenDriveVLA achieves state-of-the-art results across open-loop trajectory planning and driving-related question-answering tasks. Qualitative analyses further illustrate OpenDriveVLA's superior capability to follow high-level driving commands and robustly generate trajectories under challenging scenarios, highlighting its potential for next-generation end-to-end autonomous driving. We will release our code to facilitate further research in this domain.

Adaptive Integrated Layered Attention (AILA)

William Claster,Suhas KM,Dhairya Gundechia

Task: 提出了一种名为自适应集成分层注意力（AILA）的神经网络架构，结合密集跳跃连接和自适应特征重用机制。

Motivation: 通过自适应特征重用机制提升网络性能，同时减少训练和推理时间。

Details

Method: 设计了两种架构：AILA-Architecture 1（线性层连接）和AILA-Architecture 2（注意力机制连接），并在单任务学习环境中测试。 Result: 在价格预测、图像识别和情感分析任务中，AILA性能与强基线模型相当，但训练和推理时间显著减少。 Conclusion: AILA通过自适应层间连接提升了特征重用效率，为长序列建模、图像识别和分类任务提供了高效解决方案。 Abstract: We propose Adaptive Integrated Layered Attention (AILA), a neural network architecture that combines dense skip connections with different mechanisms for adaptive feature reuse across network layers. We evaluate AILA on three challenging tasks: price forecasting for various commodities and indices (S&P 500, Gold, US dollar Futures, Coffee, Wheat), image recognition using the CIFAR-10 dataset, and sentiment analysis on the IMDB movie review dataset. In all cases, AILA matches strong deep learning baselines (LSTMs, Transformers, and ResNets), achieving it at a fraction of the training and inference time. Notably, we implement and test two versions of the model - AILA-Architecture 1, which uses simple linear layers as the connection mechanism between layers, and AILA-Architecture 2, which implements an attention mechanism to selectively focus on outputs from previous layers. Both architectures are applied in a single-task learning setting, with each model trained separately for individual tasks. Results confirm that AILA's adaptive inter-layer connections yield robust gains by flexibly reusing pertinent features at multiple network depths. The AILA approach thus presents an extension to existing architectures, improving long-range sequence modeling, image recognition with optimised computational speed, and SOTA classification performance in practice.

Internal Organ Localization Using Depth Images

Eytan Kats,Kai Geißler,Jochen G. Hirsch,Stefan Heldman,Mattias P. Heinrich

Task: 研究基于RGB-D相机的深度学习框架，用于从体表深度图像推断内部器官的近似位置。

Motivation: 自动化患者定位是优化MRI工作流程和提高患者吞吐量的关键步骤，RGB-D相机系统通过深度信息提供了一种有前景的解决方案。

Details

Method: 利用大规模MRI扫描数据集训练深度学习模型，仅通过深度图像预测器官位置和形状。 Result: 方法在定位多个内部器官（包括骨骼和软组织）方面表现出有效性。 Conclusion: 集成RGB-D相机系统到MRI工作流程中，有望实现准确且自动化的患者定位，从而优化扫描流程并提升患者体验。 Abstract: Automated patient positioning is a crucial step in streamlining MRI workflows and enhancing patient throughput. RGB-D camera-based systems offer a promising approach to automate this process by leveraging depth information to estimate internal organ positions. This paper investigates the feasibility of a learning-based framework to infer approximate internal organ positions from the body surface. Our approach utilizes a large-scale dataset of MRI scans to train a deep learning model capable of accurately predicting organ positions and shapes from depth images alone. We demonstrate the effectiveness of our method in localization of multiple internal organs, including bones and soft tissues. Our findings suggest that RGB-D camera-based systems integrated into MRI workflows have the potential to streamline scanning procedures and improve patient experience by enabling accurate and automated patient positioning.

L0-Reasoning Bench: Evaluating Procedural Correctness in Language Models via Simple Program Execution

Simeng Sun,Cheng-Ping Hsieh,Faisal Ladhak,Erik Arakelyan,Santiago Akle Serano,Boris Ginsburg

Task: 系统地评估语言模型在生成逐步、无错误的执行轨迹方面的能力。

Motivation: 复杂推理任务依赖于逐步应用简单规则的能力，但目前基准测试主要关注结果正确性，缺乏对过程正确性的评估。

Details

Method: 引入L0-Bench，一个基于合成Python函数的基准测试，用于评估模型生成正确推理过程的能力。 Result: 所有模型随着目标轨迹步骤增加性能下降，但更大模型和增强推理模型在多步骤中表现更好。 Conclusion: L0-Bench揭示了改进“level-0”推理的潜力，并提供了构建更可靠推理系统的方向。 Abstract: Complex reasoning tasks often rely on the ability to consistently and accurately apply simple rules across incremental steps, a foundational capability which we term "level-0" reasoning. To systematically evaluate this capability, we introduce L0-Bench, a language model benchmark for testing procedural correctness -- the ability to generate correct reasoning processes, complementing existing benchmarks that primarily focus on outcome correctness. Given synthetic Python functions with simple operations, L0-Bench grades models on their ability to generate step-by-step, error-free execution traces. The synthetic nature of L0-Bench enables systematic and scalable generation of test programs along various axes (e.g., number of trace steps). We evaluate a diverse array of recent closed-source and open-weight models on a baseline test set. All models exhibit degradation as the number of target trace steps increases, while larger models and reasoning-enhanced models better maintain correctness over multiple steps. Additionally, we use L0-Bench to explore test-time scaling along three dimensions: input context length, number of solutions for majority voting, and inference steps. Our results suggest substantial room to improve "level-0" reasoning and potential directions to build more reliable reasoning systems.

Efficient Dynamic Attention 3D Convolution for Hyperspectral Image Classification

Guandong Li,Mengxia Ye

Task: 提出一种基于改进3D-DenseNet模型的动态注意力卷积设计，用于高光谱图像分类。

Motivation: 解决深度神经网络在高光谱图像分类中面临的联合空间-光谱信息利用不足、梯度消失和过拟合问题。

Details

Method: 采用多并行卷积核替代单一核，并为其分配动态注意力权重，实现空间维度的自适应特征响应和光谱维度的动态区分。 Result: 在IN、UP和KSC数据集上，该方法在推理速度和准确性上均优于主流高光谱图像分类方法。 Conclusion: 动态注意力卷积设计通过注意力机制聚合多卷积核，提升了模型表示能力，无需增加网络深度或宽度。 Abstract: Deep neural networks face several challenges in hyperspectral image classification, including insufficient utilization of joint spatial-spectral information, gradient vanishing with increasing depth, and overfitting. To enhance feature extraction efficiency while skipping redundant information, this paper proposes a dynamic attention convolution design based on an improved 3D-DenseNet model. The design employs multiple parallel convolutional kernels instead of a single kernel and assigns dynamic attention weights to these parallel convolutions. This dynamic attention mechanism achieves adaptive feature response based on spatial characteristics in the spatial dimension of hyperspectral images, focusing more on key spatial structures. In the spectral dimension, it enables dynamic discrimination of different bands, alleviating information redundancy and computational complexity caused by high spectral dimensionality. The DAC module enhances model representation capability by attention-based aggregation of multiple convolutional kernels without increasing network depth or width. The proposed method demonstrates superior performance in both inference speed and accuracy, outperforming mainstream hyperspectral image classification methods on the IN, UP, and KSC datasets.

Quamba2: A Robust and Scalable Post-training Quantization Framework for Selective State Space Models

Hung-Yueh Chiang,Chi-Chih Chang,Natalia Frumkin,Kai-Chiang Wu,Mohamed S. Abdelfattah,Diana Marculescu

Task: 提出一种名为Quamba2的量化方法，支持多种位宽配置（如W8A8、W4A8、W4A16），以优化状态空间模型（SSMs）在不同平台上的部署。

Motivation: 状态空间模型（SSMs）因其一致的内存使用和高性能成为Transformer的有力替代品，但其在云服务或资源受限设备上的扩展性受到存储需求和计算能力的限制。量化可以减小模型大小并利用硬件加速，但SSMs对量化误差敏感，需要针对不同场景优化位宽配置。

Details

Method: 基于SSMs的通道顺序保持和激活持久性，提出了一种离线量化方法，通过对输入$x$进行排序和聚类实现8位量化，同时对输入依赖参数$B$和$C$进行逐状态组量化，并通过离线重排权重确保SSM输出的计算不变性。 Result: 实验表明，Quamba2-8B在多个SSM量化方法中表现最优，预填充和生成阶段分别实现1.3倍和3倍加速，内存占用减少4倍，平均准确率仅下降1.6%。在MMLU评测中展示了框架的通用性和鲁棒性。 Conclusion: Quamba2为SSMs在不同平台上的高效部署提供了灵活的量化解决方案，显著提升了性能和资源利用率。 Abstract: State Space Models (SSMs) are emerging as a compelling alternative to Transformers because of their consistent memory usage and high performance. Despite this, scaling up SSMs on cloud services or limited-resource devices is challenging due to their storage requirements and computational power. To overcome this, quantizing SSMs with low bit-width data formats can reduce model size and benefit from hardware acceleration. As SSMs are prone to quantization-induced errors, recent efforts have focused on optimizing a particular model or bit-width for efficiency without sacrificing performance. However, distinct bit-width configurations are essential for different scenarios, like W4A8 for boosting large-batch decoding speed, and W4A16 for enhancing generation speed in short prompt applications for a single user. To this end, we present Quamba2, compatible with W8A8, W4A8, and W4A16 for both Mamba1 and Mamba2 backbones, addressing the growing demand for SSM deployment on various platforms. Based on the channel order preserving and activation persistence of SSMs, we propose an offline approach to quantize inputs of a linear recurrence in 8-bit by sorting and clustering for input $x$, combined with a per-state-group quantization for input-dependent parameters $B$ and $C$. To ensure compute-invariance in the SSM output, we rearrange weights offline according to the clustering sequence. The experiments show that Quamba2-8B outperforms several state-of-the-art SSM quantization methods and delivers 1.3$\times$ and 3$\times$ speed-ups in the pre-filling and generation stages, respectively, while offering 4$\times$ memory reduction with only a $1.6\%$ average accuracy drop. The evaluation on MMLU shows the generalizability and robustness of our framework. The code and quantized models will be released at: https://github.com/enyac-group/Quamba.

Embedding Shift Dissection on CLIP: Effects of Augmentations on VLM's Representation Learning

Ashim Dahal,Saydul Akbar Murad,Nick Rahimi

Task: 研究CLIP模型在不同数据增强技术下的表示偏移。

Motivation: 理解视觉语言模型（如CLIP）在不同增强技术下的表示偏移，为机械解释性和对抗性数据防御提供基础。

Details

Method: 分析9种常见增强技术对CLIP嵌入的影响，通过注意力图、补丁、边缘等指标进行相似性评估。 Result: 发现噪声、透视变换和尺度变换等增强技术对嵌入偏移影响较大。 Conclusion: 为未来研究视觉语言模型的鲁棒性和对抗防御提供了具体基础。 Abstract: Understanding the representation shift on Vision Language Models like CLIP under different augmentations provides valuable insights on Mechanistic Interpretability. In this study, we show the shift on CLIP's embeddings on 9 common augmentation techniques: noise, blur, color jitter, scale and rotate, flip, elastic and perspective transforms, random brightness and contrast, and coarse dropout of pixel blocks. We scrutinize the embedding shifts under similarity on attention map, patch, edge, detail preservation, cosine similarity, L2 distance, pairwise distance and dendrogram clusters and provide qualitative analysis on sample images. Our findings suggest certain augmentations like noise, perspective transform and shift scaling have higher degree of drastic impact on embedding shift. This study provides a concrete foundation for future work on VLM's robustness for mechanical interpretation and adversarial data defense.

HRET: A Self-Evolving LLM Evaluation Toolkit for Korean

Hanwool Lee,Soo Yong Kim,Dasol Choi,SangWon Baek,Seunghyeok Hong,Ilgyun Jeong,Inseon Hwang,Naeun Lee,Guijin Son

Task: 开发一个针对韩语大语言模型的标准评估框架HRET。

Motivation: 当前韩语大语言模型的评估缺乏标准化框架，导致结果不一致且难以比较。

Details

Method: HRET整合了多种评估方法（如基于logit的评分、精确匹配、语言不一致惩罚和LLM-as-a-Judge评估），采用模块化、基于注册的架构，并集成了主要基准和推理后端。 Result: HRET提供了一个可重现、公平且透明的韩语NLP研究基础。 Conclusion: HRET填补了韩语大语言模型评估的标准化空白，为研究提供了可靠工具。 Abstract: Recent advancements in Korean large language models (LLMs) have spurred numerous benchmarks and evaluation methodologies, yet the lack of a standardized evaluation framework has led to inconsistent results and limited comparability. To address this, we introduce HRET Haerae Evaluation Toolkit, an open-source, self-evolving evaluation framework tailored specifically for Korean LLMs. HRET unifies diverse evaluation methods, including logit-based scoring, exact-match, language-inconsistency penalization, and LLM-as-a-Judge assessments. Its modular, registry-based architecture integrates major benchmarks (HAE-RAE Bench, KMMLU, KUDGE, HRM8K) and multiple inference backends (vLLM, HuggingFace, OpenAI-compatible endpoints). With automated pipelines for continuous evolution, HRET provides a robust foundation for reproducible, fair, and transparent Korean NLP research.

Boosting Omnidirectional Stereo Matching with a Pre-trained Depth Foundation Model

Jannik Endres,Oliver Hahn,Charles Corbière,Simone Schaub-Meyer,Stefan Roth,Alexandre Alahi

Task: 提出一种新型的全向立体匹配方法DFI-OmniStereo，用于提高全向深度感知的准确性。

Motivation: 全向深度感知在移动机器人应用中至关重要，但现有方法因缺乏真实数据而在不同环境、深度范围和光照条件下表现有限。

Details

Method: 结合大规模预训练基础模型进行相对单目深度估计，并采用迭代优化立体匹配架构，引入两阶段训练策略。 Result: 在Helvipad数据集上表现最优，将视差MAE降低了约16%。 Conclusion: DFI-OmniStereo通过结合预训练模型和优化策略，显著提升了全向立体匹配的精度。 Abstract: Omnidirectional depth perception is essential for mobile robotics applications that require scene understanding across a full 360{\deg} field of view. Camera-based setups offer a cost-effective option by using stereo depth estimation to generate dense, high-resolution depth maps without relying on expensive active sensing. However, existing omnidirectional stereo matching approaches achieve only limited depth accuracy across diverse environments, depth ranges, and lighting conditions, due to the scarcity of real-world data. We present DFI-OmniStereo, a novel omnidirectional stereo matching method that leverages a large-scale pre-trained foundation model for relative monocular depth estimation within an iterative optimization-based stereo matching architecture. We introduce a dedicated two-stage training strategy to utilize the relative monocular depth features for our omnidirectional stereo matching before scale-invariant fine-tuning. DFI-OmniStereo achieves state-of-the-art results on the real-world Helvipad dataset, reducing disparity MAE by approximately 16% compared to the previous best omnidirectional stereo method.

FindTheFlaws: Annotated Errors for Detecting Flawed Reasoning and Scalable Oversight Research

Gabriel Recchia,Chatrik Singh Mangat,Issac Li,Gayatri Krishnakumar

Task: 提出并评估FindTheFlaws数据集，用于支持AI监督的可扩展性研究。

Motivation: 当前缺乏包含专家验证的正确和错误解决方案的长数据集，限制了AI监督方法（如辩论、批评和验证游戏）的可扩展性评估。

Details

Method: 构建五个多样化的数据集（涵盖医学、数学、科学、编程和Lojban语言），包含问题和长解决方案，并由专家标注正确性或错误。评估前沿模型的批评能力。 Result: 模型在不同数据集上表现不一，可用于可扩展监督实验；某些任务中专家基线表现优于模型。 Conclusion: FindTheFlaws数据集填补了现有空白，支持AI监督方法的可扩展性研究，并为模型和专家在监督中的角色提供了新见解。 Abstract: As AI models tackle increasingly complex problems, ensuring reliable human oversight becomes more challenging due to the difficulty of verifying solutions. Approaches to scaling AI supervision include debate, in which two agents engage in structured dialogue to help a judge evaluate claims; critique, in which models identify potential flaws in proposed solutions; and prover-verifier games, in which a capable 'prover' model generates solutions that must be verifiable by a less capable 'verifier'. Evaluations of the scalability of these and similar approaches to difficult problems benefit from datasets that include (1) long-form expert-verified correct solutions and (2) long-form flawed solutions with annotations highlighting specific errors, but few are available. To address this gap, we present FindTheFlaws, a group of five diverse datasets spanning medicine, mathematics, science, coding, and the Lojban language. Each dataset contains questions and long-form solutions with expert annotations validating their correctness or identifying specific error(s) in the reasoning. We evaluate frontier models' critiquing capabilities and observe a range of performance that can be leveraged for scalable oversight experiments: models performing more poorly on particular datasets can serve as judges/verifiers for more capable models. Additionally, for some task/dataset combinations, expert baselines exceed even top model performance, making them more beneficial for scalable oversight experiments.

Siladittya Manna,Suresh Das,Sayantari Ghosh,Saumik Bhattacharya

Task: 探索联邦自监督一次性分割任务，适应数据稀缺场景。

Motivation: 在医疗图像分割等应用中，单一来源的大规模标注数据难以获取，联邦自监督学习提供了一种隐私保护的解决方案。

Details

Method: 采用现有的自监督少样本分割框架CoWPro，并适应联邦学习场景，引入融合Dice损失以改进性能。 Result: 在未见过的本地客户端数据集上，性能与FedAvg版本的CoWPro相当或更好。 Conclusion: 该框架在联邦学习领域首次尝试自监督少样本分割任务，并展示了在多模态数据下的有效性。 Abstract: Decentralized federated learning enables learning of data representations from multiple sources without compromising the privacy of the clients. In applications like medical image segmentation, where obtaining a large annotated dataset from a single source is a distressing problem, federated self-supervised learning can provide some solace. In this work, we push the limits further by exploring a federated self-supervised one-shot segmentation task representing a more data-scarce scenario. We adopt a pre-existing self-supervised few-shot segmentation framework CoWPro and adapt it to the federated learning scenario. To the best of our knowledge, this work is the first to attempt a self-supervised few-shot segmentation task in the federated learning domain. Moreover, we consider the clients to be constituted of data from different modalities and imaging techniques like MR or CT, which makes the problem even harder. Additionally, we reinforce and improve the baseline CoWPro method using a fused dice loss which shows considerable improvement in performance over the baseline CoWPro. Finally, we evaluate this novel framework on a completely unseen held-out part of the local client dataset. We observe that the proposed framework can achieve performance at par or better than the FedAvg version of the CoWPro framework on the held-out validation dataset.

Agentic Large Language Models, a survey

Aske Plaat,Max van Duijn,Niki van Stein,Mike Preuss,Peter van der Putten,Kees Joost Batenburg

Task: 综述并提出了关于代理性大型语言模型（Agentic LLMs）的研究议程。

Motivation: 探讨代理性LLMs在推理、行动和交互方面的潜力及其对社会的影响。

Details

Method: 通过将文献分为三类（推理、行动、交互）来组织研究，并分析它们之间的相互影响。 Result: 发现代理性LLMs在医疗诊断、物流和金融市场分析等领域有重要应用，并能通过推理生成新的训练数据。 Conclusion: 代理性LLMs具有广泛的应用前景和社会效益，但也需注意其潜在风险。 Abstract: There is great interest in agentic LLMs, large language models that act as agents. We review the growing body of work in this area and provide a research agenda. Agentic LLMs are LLMs that (1) reason, (2) act, and (3) interact. We organize the literature according to these three categories. The research in the first category focuses on reasoning, reflection, and retrieval, aiming to improve decision making; the second category focuses on action models, robots, and tools, aiming for agents that act as useful assistants; the third category focuses on multi-agent systems, aiming for collaborative task solving and simulating interaction to study emergent social behavior. We find that works mutually benefit from results in other categories: retrieval enables tool use, reflection improves multi-agent collaboration, and reasoning benefits all categories. We discuss applications of agentic LLMs and provide an agenda for further research. Important applications are in medical diagnosis, logistics and financial market analysis. Meanwhile, self-reflective agents playing roles and interacting with one another augment the process of scientific research itself. Further, agentic LLMs may provide a solution for the problem of LLMs running out of training data: inference-time behavior generates new training states, such that LLMs can keep learning without needing ever larger datasets. We note that there is risk associated with LLM assistants taking action in the real world, while agentic LLMs are also likely to benefit society.

Re-Aligning Language to Visual Objects with an Agentic Workflow

Yuming Chen,Jiangyan Feng,Haodong Zhang,Lijun Gong,Feng Zhu,Rui Zhao,Qibin Hou,Ming-Ming Cheng,Yibing Song

Task: 通过语言对齐视觉对象，提升语言基础目标检测（LOD）模型的性能。

Motivation: 现有方法利用视觉语言模型（VLMs）自动生成对象描述，但存在幻觉问题导致描述不准确，影响视觉语言对齐质量。

Details

Method: 提出Real-LOD工作流，通过LLM控制的代理工作流（规划、工具使用和反思步骤）动态调整图像和文本提示，逐步优化语言描述。 Result: 在标准基准测试中，Real-LOD比现有LOD方法性能提升约50%。 Conclusion: Real-LOD工作流通过自动优化视觉语言对齐，在数据规模扩大的同时保持数据质量，显著提升了LOD性能。 Abstract: Language-based object detection (LOD) aims to align visual objects with language expressions. A large amount of paired data is utilized to improve LOD model generalizations. During the training process, recent studies leverage vision-language models (VLMs) to automatically generate human-like expressions for visual objects, facilitating training data scaling up. In this process, we observe that VLM hallucinations bring inaccurate object descriptions (e.g., object name, color, and shape) to deteriorate VL alignment quality. To reduce VLM hallucinations, we propose an agentic workflow controlled by an LLM to re-align language to visual objects via adaptively adjusting image and text prompts. We name this workflow Real-LOD, which includes planning, tool use, and reflection steps. Given an image with detected objects and VLM raw language expressions, Real-LOD reasons its state automatically and arranges action based on our neural symbolic designs (i.e., planning). The action will adaptively adjust the image and text prompts and send them to VLMs for object re-description (i.e., tool use). Then, we use another LLM to analyze these refined expressions for feedback (i.e., reflection). These steps are conducted in a cyclic form to gradually improve language descriptions for re-aligning to visual objects. We construct a dataset that contains a tiny amount of 0.18M images with re-aligned language expression and train a prevalent LOD model to surpass existing LOD methods by around 50% on the standard benchmarks. Our Real-LOD workflow, with automatic VL refinement, reveals a potential to preserve data quality along with scaling up data quantity, which further improves LOD performance from a data-alignment perspective.

Efficient Adaptation For Remote Sensing Visual Grounding

Hasan Moughnieh,Mohamad Chalhoub,Hasan Nasrallah,Cristiano Nattero,Paolo Campanella,Ali J. Ghandour

Task: 通过参数高效微调（PEFT）技术优化基础模型在遥感（RS）领域视觉定位（VG）任务中的性能。

Motivation: 基础模型在多模态领域表现出色，但直接应用于遥感领域时效果不佳，需针对领域特定挑战进行优化。

Details

Method: 使用LoRA、BitFit和适配器技术对Grounding DINO和OFA基础模型进行微调。 Result: 性能达到或超越当前最优模型，同时显著降低计算成本。 Conclusion: PEFT技术为遥感领域的高效多模态分析提供了实用且经济的解决方案。 Abstract: Foundation models have revolutionized artificial intelligence (AI), offering remarkable capabilities across multi-modal domains. Their ability to precisely locate objects in complex aerial and satellite images, using rich contextual information and detailed object descriptions, is essential for remote sensing (RS). These models can associate textual descriptions with object positions through the Visual Grounding (VG) task, but due to domain-specific challenges, their direct application to RS produces sub-optimal results. To address this, we applied Parameter Efficient Fine Tuning (PEFT) techniques to adapt these models for RS-specific VG tasks. Specifically, we evaluated LoRA placement across different modules in Grounding DINO and used BitFit and adapters to fine-tune the OFA foundation model pre-trained on general-purpose VG datasets. This approach achieved performance comparable to or surpassing current State Of The Art (SOTA) models while significantly reducing computational costs. This study highlights the potential of PEFT techniques to advance efficient and precise multi-modal analysis in RS, offering a practical and cost-effective alternative to full model training.

ReferDINO-Plus: 2nd Solution for 4th PVUW MeViS Challenge at CVPR 2025

Tianming Liang,Haichao Jiang,Wei-Shi Zheng,Jian-Fang Hu

Task: 通过文本描述在视频中分割目标对象。

Motivation: 由于在视频编辑和人机交互中的潜在应用，该任务在计算机视觉领域受到越来越多的关注。

Details

Method: 结合SAM2在掩码质量和对象一致性方面的优势，并引入条件掩码融合策略以平衡单对象和多对象场景的性能。 Result: 在MeViS测试集上达到60.43的J&F分数，在CVPR 2025的MeViS PVUW挑战中获得第二名。 Conclusion: 提出的ReferDINO-Plus方法通过结合ReferDINO和SAM2的优势，显著提升了性能。 Abstract: Referring Video Object Segmentation (RVOS) aims to segment target objects throughout a video based on a text description. This task has attracted increasing attention in the field of computer vision due to its promising applications in video editing and human-agent interaction. Recently, ReferDINO has demonstrated promising performance in this task by adapting object-level vision-language knowledge from pretrained foundational image models. In this report, we further enhance its capabilities by incorporating the advantages of SAM2 in mask quality and object consistency. In addition, to effectively balance performance between single-object and multi-object scenarios, we introduce a conditional mask fusion strategy that adaptively fuses the masks from ReferDINO and SAM2. Our solution, termed ReferDINO-Plus, achieves 60.43 $\mathcal{J}\&\mathcal{F}$ on MeViS test set, securing 2nd place in the MeViS PVUW challenge at CVPR 2025. The code is available at: https://github.com/iSEE-Laboratory/ReferDINO-Plus.

Beyond Standard MoE: Mixture of Latent Experts for Resource-Efficient Language Models

Zehua Liu,Han Wu,Ruifeng She,Xiaojin Fu,Xiongwei Han,Tao Zhong,Mingxuan Yuan

Task: 提出一种名为Mixture of Latent Experts (MoLE)的新参数化方法，用于解决传统MoE架构中内存和通信开销过大的问题。

Motivation: 传统MoE架构因专家模块数量增加导致内存和通信开销过大，限制了其在大规模语言模型中的高效应用。

Details

Method: 通过将专家操作分解为共享的低维潜在空间投影和专家特定变换，显著减少参数数量和计算需求。 Result: MoLE在保持模型表达能力的同时，显著提升了计算效率，并通过实验验证了其性能与传统MoE相当。 Conclusion: MoLE是一种高效且资源节约的MoE架构改进方法，适用于大规模语言模型的训练和推理。 Abstract: Mixture of Experts (MoE) has emerged as a pivotal architectural paradigm for efficient scaling of Large Language Models (LLMs), operating through selective activation of parameter subsets for each input token. Nevertheless, conventional MoE architectures encounter substantial challenges, including excessive memory utilization and communication overhead during training and inference, primarily attributable to the proliferation of expert modules. In this paper, we introduce Mixture of Latent Experts (MoLE), a novel parameterization methodology that facilitates the mapping of specific experts into a shared latent space. Specifically, all expert operations are systematically decomposed into two principal components: a shared projection into a lower-dimensional latent space, followed by expert-specific transformations with significantly reduced parametric complexity. This factorized approach substantially diminishes parameter count and computational requirements. Beyond the pretraining implementation of the MoLE architecture, we also establish a rigorous mathematical framework for transforming pre-trained MoE models into the MoLE architecture, characterizing the sufficient conditions for optimal factorization and developing a systematic two-phase algorithm for this conversion process. Our comprehensive theoretical analysis demonstrates that MoLE significantly enhances computational efficiency across multiple dimensions while preserving model representational capacity. Empirical evaluations corroborate our theoretical findings, confirming that MoLE achieves performance comparable to standard MoE implementations while substantially reducing resource requirements.

BoundMatch: Boundary detection applied to semi-supervised segmentation for urban-driving scenes

Haruya Ishikawa,Yoshimitsu Aoki

Task: 提出BoundMatch框架，通过多任务学习将语义边界检测整合到半监督语义分割的一致性正则化流程中。

Motivation: 当前教师-学生一致性正则化方法在密集像素标注中忽略了对象边界的精确划分，BoundMatch旨在解决这一问题。

Details

Method: BoundMatch结合边界一致性正则化多任务学习（BCRM）和两个轻量级融合模块（BSF和SGF），提升边界伪标签质量。 Result: 在多个数据集上，BoundMatch在边界特定评估指标上显著优于现有方法，并在轻量级架构中表现良好。 Conclusion: BoundMatch通过整合边界检测和分割任务，显著提升了半监督语义分割的性能，尤其在边界划分上表现突出。 Abstract: Semi-supervised semantic segmentation (SS-SS) aims to mitigate the heavy annotation burden of dense pixel labeling by leveraging abundant unlabeled images alongside a small labeled set. While current teacher-student consistency regularization methods achieve strong results, they often overlook a critical challenge: the precise delineation of object boundaries. In this paper, we propose BoundMatch, a novel multi-task SS-SS framework that explicitly integrates semantic boundary detection into the consistency regularization pipeline. Our core mechanism, Boundary Consistency Regularized Multi-Task Learning (BCRM), enforces prediction agreement between teacher and student models on both segmentation masks and detailed semantic boundaries. To further enhance performance and sharpen contours, BoundMatch incorporates two lightweight fusion modules: Boundary-Semantic Fusion (BSF) injects learned boundary cues into the segmentation decoder, while Spatial Gradient Fusion (SGF) refines boundary predictions using mask gradients, leading to higher-quality boundary pseudo-labels. This framework is built upon SAMTH, a strong teacher-student baseline featuring a Harmonious Batch Normalization (HBN) update strategy for improved stability. Extensive experiments on diverse datasets including Cityscapes, BDD100K, SYNTHIA, ADE20K, and Pascal VOC show that BoundMatch achieves competitive performance against state-of-the-art methods while significantly improving boundary-specific evaluation metrics. We also demonstrate its effectiveness in realistic large-scale unlabeled data scenarios and on lightweight architectures designed for mobile deployment.

A large-scale image-text dataset benchmark for farmland segmentation

Chao Tao,Dandan Zhong,Weiliang Mu,Zhuofei Du,Haiyang Wu

Task: 提出一种语言驱动的学习范式，并开发FarmSeg-VL数据集，以解决农田遥感影像中时空异质性的挑战。

Motivation: 传统深度学习范式依赖标注数据，难以有效建模农田的动态时空演化和空间异质性，而语言作为结构化知识载体可以明确表达农田的时空特征。

Details

Method: 提出半自动标注方法构建FarmSeg-VL数据集，覆盖四季和八个典型农业区域，包含丰富的时空特征描述。 Result: FarmSeg-VL数据集展示了显著的时空特性，并验证了其在农田分割任务中的潜力。 Conclusion: FarmSeg-VL为农田分割提供了首个细粒度图像-文本基准数据集，支持语言驱动的研究方向。 Abstract: The traditional deep learning paradigm that solely relies on labeled data has limitations in representing the spatial relationships between farmland elements and the surrounding environment.It struggles to effectively model the dynamic temporal evolution and spatial heterogeneity of farmland. Language,as a structured knowledge carrier,can explicitly express the spatiotemporal characteristics of farmland, such as its shape, distribution,and surrounding environmental information.Therefore,a language-driven learning paradigm can effectively alleviate the challenges posed by the spatiotemporal heterogeneity of farmland.However,in the field of remote sensing imagery of farmland,there is currently no comprehensive benchmark dataset to support this research direction.To fill this gap,we introduced language based descriptions of farmland and developed FarmSeg-VL dataset,the first fine-grained image-text dataset designed for spatiotemporal farmland segmentation.Firstly, this article proposed a semi-automatic annotation method that can accurately assign caption to each image, ensuring high data quality and semantic richness while improving the efficiency of dataset construction.Secondly,the FarmSeg-VL exhibits significant spatiotemporal characteristics.In terms of the temporal dimension,it covers all four seasons.In terms of the spatial dimension,it covers eight typical agricultural regions across China.In addition, in terms of captions,FarmSeg-VL covers rich spatiotemporal characteristics of farmland,including its inherent properties,phenological characteristics, spatial distribution,topographic and geomorphic features,and the distribution of surrounding environments.Finally,we present a performance analysis of VLMs and the deep learning models that rely solely on labels trained on the FarmSeg-VL,demonstrating its potential as a standard benchmark for farmland segmentation.

ViLAaD: Enhancing "Attracting and Dispersing'' Source-Free Domain Adaptation with Vision-and-Language Model

Shuhei Tarashima,Xinqi Shu,Norio Tagawa

Task: 提出一种名为ViLAaD的新方法，通过结合视觉与语言（ViL）模型增强无源域适应（SFDA）的性能。

Motivation: 传统SFDA方法受限于预训练源模型和无标注目标数据的信息，而利用辅助资源的方法尚处于早期阶段，存在研究空间。

Details

Method: 基于广泛采用的SFDA技术AaD，扩展其核心原则，自然整合ViL模型作为目标适应的强大初始化，并提出ViLAaD和其增强版ViLAaD++。 Result: ViLAaD在多个SFDA基准测试中表现优于AaD和零样本分类，ViLAaD++在多种SFDA场景下达到最先进性能。 Conclusion: ViLAaD方法通过结合ViL模型显著提升了SFDA性能，且其灵活性支持进一步优化和扩展。 Abstract: Source-Free Domain Adaptation (SFDA) aims to adapt a pre-trained source model to a target dataset from a different domain without access to the source data. Conventional SFDA methods are limited by the information encoded in the pre-trained source model and the unlabeled target data. Recently, approaches leveraging auxiliary resources have emerged, yet remain in their early stages, offering ample opportunities for research. In this work, we propose a novel method that incorporates auxiliary information by extending an existing SFDA framework using Vision-and-Language (ViL) models. Specifically, we build upon Attracting and Dispersing (AaD), a widely adopted SFDA technique, and generalize its core principle to naturally integrate ViL models as a powerful initialization for target adaptation. Our approach, called ViL-enhanced AaD (ViLAaD), preserves the simplicity and flexibility of the AaD framework, while leveraging ViL models to significantly boost adaptation performance. We validate our method through experiments using various ViL models, demonstrating that ViLAaD consistently outperforms both AaD and zero-shot classification by ViL models, especially when both the source model and ViL model provide strong initializations. Moreover, the flexibility of ViLAaD allows it to be seamlessly incorporated into an alternating optimization framework with ViL prompt tuning and extended with additional objectives for target model adaptation. Extensive experiments on four SFDA benchmarks show that this enhanced version, ViLAaD++, achieves state-of-the-art performance across multiple SFDA scenarios, including Closed-set SFDA, Partial-set SFDA, and Open-set SFDA.

Can DeepSeek-V3 Reason Like a Surgeon? An Empirical Evaluation for Vision-Language Understanding in Robotic-Assisted Surgery

Boyi Ma,Yanguang Zhao,Jie Wang,Guankun Wang,Kun Yuan,Tong Chen,Long Bai,Hongliang Ren

Task: 研究DeepSeek-V3在机器人手术场景中的对话能力，包括单短语问答、视觉问答和详细描述任务。

Motivation: 评估DeepSeek-V3在特定手术场景中的表现，以确定其是否适用于手术相关的视觉语言任务。

Details

Method: 使用公开数据集（如EndoVis18和CholecT50）及其对话数据进行广泛评估。 Result: DeepSeek-V3在手术器械和组织识别任务中表现良好，但在空间位置分析和手术动作理解方面存在显著局限。 Conclusion: DeepSeek-V3在未针对手术数据集进行微调的情况下，不适合用于手术场景的视觉语言任务。 Abstract: DeepSeek-V3, a recently emerging Large Language Model (LLM), demonstrates outstanding performance in general scene understanding, question-answering (QA), and text generation tasks, owing to its efficient training paradigm and strong reasoning capabilities. In this study, we investigate the dialogue capabilities of DeepSeek-V3 in robotic surgery scenarios, focusing on tasks such as Single Phrase QA, Visual QA, and Detailed Description. The Single Phrase QA tasks further include sub-tasks such as surgical instrument recognition, action understanding, and spatial position analysis. We conduct extensive evaluations using publicly available datasets, including EndoVis18 and CholecT50, along with their corresponding dialogue data. Our comprehensive evaluation results indicate that, when provided with specific prompts, DeepSeek-V3 performs well in surgical instrument and tissue recognition tasks However, DeepSeek-V3 exhibits significant limitations in spatial position analysis and struggles to understand surgical actions accurately. Additionally, our findings reveal that, under general prompts, DeepSeek-V3 lacks the ability to effectively analyze global surgical concepts and fails to provide detailed insights into surgical scenarios. Based on our observations, we argue that the DeepSeek-V3 is not ready for vision-language tasks in surgical contexts without fine-tuning on surgery-specific datasets.

BiPVL-Seg: Bidirectional Progressive Vision-Language Fusion with Global-Local Alignment for Medical Image Segmentation

Rafi Ibn Sultan,Hui Zhu,Chengyin Li,Dongxiao Zhu

Task: 提出一种名为BiPVL-Seg的端到端框架，通过架构和训练创新整合视觉-语言融合和嵌入对齐，以增强医学图像分割。

Motivation: 医学图像分割通常仅依赖视觉数据，忽略了临床诊断中丰富的文本信息，现有方法在处理视觉和文本特征时独立操作，导致跨模态对齐较弱。

Details

Method: BiPVL-Seg采用双向渐进融合架构，促进视觉和文本编码器之间的阶段信息交换，并结合全局-局部对比对齐训练目标，增强文本编码器的理解能力。 Result: 在多种医学影像基准测试（CT和MR模态）中，BiPVL-Seg在复杂多类分割任务中表现优于现有最先进方法。 Conclusion: BiPVL-Seg通过视觉-语言融合和对齐的协同作用，显著提升了医学图像分割的性能。 Abstract: Medical image segmentation typically relies solely on visual data, overlooking the rich textual information clinicians use for diagnosis. Vision-language models attempt to bridge this gap, but existing approaches often process visual and textual features independently, resulting in weak cross-modal alignment. Simple fusion techniques fail due to the inherent differences between spatial visual features and sequential text embeddings. Additionally, medical terminology deviates from general language, limiting the effectiveness of off-the-shelf text encoders and further hindering vision-language alignment. We propose BiPVL-Seg, an end-to-end framework that integrates vision-language fusion and embedding alignment through architectural and training innovations, where both components reinforce each other to enhance medical image segmentation. BiPVL-Seg introduces bidirectional progressive fusion in the architecture, which facilitates stage-wise information exchange between vision and text encoders. Additionally, it incorporates global-local contrastive alignment, a training objective that enhances the text encoder's comprehension by aligning text and vision embeddings at both class and concept levels. Extensive experiments on diverse medical imaging benchmarks across CT and MR modalities demonstrate BiPVL-Seg's superior performance when compared with state-of-the-art methods in complex multi-class segmentation. Source code is available in this GitHub repository.

When 'YES' Meets 'BUT': Can Large Models Comprehend Contradictory Humor Through Comparative Reasoning?

Tuo Liang,Zhe Hu,Jing Li,Hao Zhang,Yiren Lu,Yunlai Zhou,Yiran Qiao,Disheng Liu,Jeirui Peng,Jing Ma,Yu Yin

Task: 研究视觉语言模型（VLMs）在理解复杂幽默叙事中的表现，并开发新的基准和方法以提升其能力。

Motivation: 理解幽默，尤其是涉及复杂矛盾叙事的幽默，对VLMs仍是一个挑战，限制了AI在人类推理和文化表达方面的能力。

Details

Method: 引入YesBut（V2）基准，包含1,262张漫画图像，通过四项互补任务系统评估VLMs的表现，并探索文本训练和社会知识增强方法。 Result: 实验显示，即使最先进的模型在视觉感知、关键元素识别、比较分析和幻觉方面显著落后于人类表现。 Conclusion: 研究揭示了VLMs在文化创意表达理解上的弱点，并提供了通过比较推理开发上下文感知模型的路径。 Abstract: Understanding humor-particularly when it involves complex, contradictory narratives that require comparative reasoning-remains a significant challenge for large vision-language models (VLMs). This limitation hinders AI's ability to engage in human-like reasoning and cultural expression. In this paper, we investigate this challenge through an in-depth analysis of comics that juxtapose panels to create humor through contradictions. We introduce the YesBut (V2), a novel benchmark with 1,262 comic images from diverse multilingual and multicultural contexts, featuring comprehensive annotations that capture various aspects of narrative understanding. Using this benchmark, we systematically evaluate a wide range of VLMs through four complementary tasks spanning from surface content comprehension to deep narrative reasoning, with particular emphasis on comparative reasoning between contradictory elements. Our extensive experiments reveal that even the most advanced models significantly underperform compared to humans, with common failures in visual perception, key element identification, comparative analysis and hallucinations. We further investigate text-based training strategies and social knowledge augmentation methods to enhance model performance. Our findings not only highlight critical weaknesses in VLMs' understanding of cultural and creative expressions but also provide pathways toward developing context-aware models capable of deeper narrative understanding though comparative reasoning.

Enhancing Creative Generation on Stable Diffusion-based Models

Jiyeon Han,Dahee Kwon,Gayoung Lee,Junho Kim,Jaesik Choi

Task: 提出一种无需训练的方法（C3）来增强基于Stable Diffusion的模型的创造力。

Motivation: 现有的文本到图像生成模型（如Stable Diffusion）在创造能力上受限，无法通过简单提示实现理想的创造性输出。

Details

Method: C3通过在去噪过程中选择性放大特征来促进更富创造性的输出，并提供基于创造力两方面的实用指南。 Result: C3在多种基于Stable Diffusion的模型中表现出有效性，且无需高昂计算成本。 Conclusion: C3是首个在不增加计算负担的情况下提升扩散模型创造力的方法。 Abstract: Recent text-to-image generative models, particularly Stable Diffusion and its distilled variants, have achieved impressive fidelity and strong text-image alignment. However, their creative capability remains constrained, as including `creative' in prompts seldom yields the desired results. This paper introduces C3 (Creative Concept Catalyst), a training-free approach designed to enhance creativity in Stable Diffusion-based models. C3 selectively amplifies features during the denoising process to foster more creative outputs. We offer practical guidelines for choosing amplification factors based on two main aspects of creativity. C3 is the first study to enhance creativity in diffusion models without extensive computational costs. We demonstrate its effectiveness across various Stable Diffusion-based models.

CodeARC: Benchmarking Reasoning Capabilities of LLM Agents for Inductive Program Synthesis

Anjiang Wei,Tarun Suresh,Jiannan Cao,Naveen Kannan,Yuheng Wu,Kai Yan,Thiago S. F. X. Teixeira,Ke Wang,Alex Aiken

Task: 提出CodeARC，一个用于评估基于LLM的归纳程序合成和归纳推理的新框架。

Motivation: 现有评估协议依赖静态示例集，缺乏反馈机制，无法反映真实场景（如逆向工程）。

Details

Method: 设计交互式框架CodeARC，通过查询隐藏目标函数、合成候选函数并利用差分测试迭代优化解决方案。 Result: 构建了包含1114个函数的大规模基准，o3-mini表现最佳（成功率52.7%），微调LLaMA-3.1-8B-Instruct带来31%性能提升。 Conclusion: CodeARC为LLM程序合成提供了更真实且具挑战性的测试平台。 Abstract: Inductive program synthesis, or programming by example, requires synthesizing functions from input-output examples that generalize to unseen inputs. While large language model agents have shown promise in programming tasks guided by natural language, their ability to perform inductive program synthesis is underexplored. Existing evaluation protocols rely on static sets of examples and held-out tests, offering no feedback when synthesized functions are incorrect and failing to reflect real-world scenarios such as reverse engineering. We propose CodeARC, the Code Abstraction and Reasoning Challenge, a new evaluation framework where agents interact with a hidden target function by querying it with new inputs, synthesizing candidate functions, and iteratively refining their solutions using a differential testing oracle. This interactive setting encourages agents to perform function calls and self-correction based on feedback. We construct the first large-scale benchmark for general-purpose inductive program synthesis, featuring 1114 functions. Among 18 models evaluated, o3-mini performs best with a success rate of 52.7%, highlighting the difficulty of this task. Fine-tuning LLaMA-3.1-8B-Instruct on curated synthesis traces yields up to a 31% relative performance gain. CodeARC provides a more realistic and challenging testbed for evaluating LLM-based program synthesis and inductive reasoning.

DASH: Detection and Assessment of Systematic Hallucinations of VLMs

Maximilian Augustin,Yannic Neuhaus,Matthias Hein

Task: 提出DASH（Detection and Assessment of Systematic Hallucinations）方法，用于自动识别和评估视觉语言模型（VLMs）在开放世界中的系统性幻觉问题。

Motivation: 现有的基准测试方法在小规模标注数据集上评估幻觉问题，无法适应开放世界的广泛使用场景，也无法检测VLMs的系统性错误。

Details

Method: DASH通过自动化的、大规模流程，结合DASH-OPT图像检索技术，在自然图像流形上优化生成误导VLMs的图像，从而识别系统性幻觉。 Result: 在PaliGemma和LLaVA-NeXT模型上应用DASH，发现了超过19k个包含950k图像的系统性幻觉簇，并通过微调PaliGemma验证了DASH的有效性。 Conclusion: DASH能够有效识别和缓解VLMs的系统性幻觉问题，为开放世界中的模型评估提供了新方法。 Abstract: Vision-language models (VLMs) are prone to object hallucinations, where they erroneously indicate the presenceof certain objects in an image. Existing benchmarks quantify hallucinations using relatively small, labeled datasets. However, this approach is i) insufficient to assess hallucinations that arise in open-world settings, where VLMs are widely used, and ii) inadequate for detecting systematic errors in VLMs. We propose DASH (Detection and Assessment of Systematic Hallucinations), an automatic, large-scale pipeline designed to identify systematic hallucinations of VLMs on real-world images in an open-world setting. A key component is DASH-OPT for image-based retrieval, where we optimize over the ''natural image manifold'' to generate images that mislead the VLM. The output of DASH consists of clusters of real and semantically similar images for which the VLM hallucinates an object. We apply DASH to PaliGemma and two LLaVA-NeXT models across 380 object classes and, in total, find more than 19k clusters with 950k images. We study the transfer of the identified systematic hallucinations to other VLMs and show that fine-tuning PaliGemma with the model-specific images obtained with DASH mitigates object hallucinations. Code and data are available at https://YanNeu.github.io/DASH.

TRA: Better Length Generalisation with Threshold Relative Attention

Mattia Opper,Roland Fernandez,Paul Smolensky,Jianfeng Gao

Task: 研究Transformer模型在长度泛化上的局限性及其改进方法。

Motivation: Transformer模型在基本任务上表现出长度泛化能力不足的问题，可能源于自注意力机制的两个关键缺陷。

Details

Method: 通过选择性稀疏化和上下文相关的相对距离改进自注意力机制。 Result: 改进后的注意力机制显著提升了仅解码器Transformer的泛化能力。 Conclusion: 通过解决自注意力机制的两个关键缺陷，可以有效提升Transformer模型的长度泛化能力。 Abstract: Transformers struggle with length generalisation, displaying poor performance even on basic tasks. We test whether these limitations can be explained through two key failures of the self-attention mechanism. The first is the inability to fully remove irrelevant information. The second is tied to position, even if the dot product between a key and query is highly negative (i.e. an irrelevant key) learned positional biases may unintentionally up-weight such information - dangerous when distances become out of distribution. Put together, these two failure cases lead to compounding generalisation difficulties. We test whether they can be mitigated through the combination of a) selective sparsity - completely removing irrelevant keys from the attention softmax and b) contextualised relative distance - distance is only considered as between the query and the keys that matter. We show how refactoring the attention mechanism with these two mitigations in place can substantially improve generalisation capabilities of decoder only transformers.

Multiview Image-Based Localization

Cameron Fiore,Hongyi Fan,Benjamin Kimia

Task: 提出一种结合图像检索和潜在3D重建的混合方法，以改进图像定位的性能。

Motivation: 图像检索方法在图像定位中具有简单、高效和隐私保护等优势，但其定位精度较差。

Details

Method: 通过解耦相对平移和旋转估计，并直接从多视图对应中计算最优位姿，避免保留完整的3D场景重建。 Result: 在7-Scenes和Cambridge Landmarks数据集上表现出更好的性能，同时提升了时间和内存效率。 Conclusion: 该方法在保持图像检索优势的同时，显著提升了定位精度和计算效率。 Abstract: The image retrieval (IR) approach to image localization has distinct advantages to the 3D and the deep learning (DNN) approaches: it is seen-agnostic, simpler to implement and use, has no privacy issues, and is computationally efficient. The main drawback of this approach is relatively poor localization in both position and orientation of the query camera when compared to the competing approaches. This paper represents a hybrid approach that stores only image features in the database like some IR methods, but relies on a latent 3D reconstruction, like 3D methods but without retaining a 3D scene reconstruction. The approach is based on two ideas: {\em (i)} a novel proposal where query camera center estimation relies only on relative translation estimates but not relative rotation estimates through a decoupling of the two, and {\em (ii)} a shift from computing optimal pose from estimated relative pose to computing optimal pose from multiview correspondences, thus cutting out the ``middle-man''. Our approach shows improved performance on the 7-Scenes and Cambridge Landmarks datasets while also improving on timing and memory footprint as compared to state-of-the-art.

Beyond Contrastive Learning: Synthetic Data Enables List-wise Training with Multiple Levels of Relevance

Reza Esfandiarpoor,George Zerveas,Ruochen Zhang,Macton Mgonzo,Carsten Eickhoff,Stephen H. Bach

Task: 利用开源大语言模型生成合成文档，以多级相关性直接回答用户查询，改进密集检索器的训练。

Motivation: 传统对比学习方法忽略未标注文档的相关性差异，且易受标注噪声影响，无法捕捉排名的细微差别。

Details

Method: 完全使用合成文档和分级相关性标签，结合Wasserstein距离的列表损失函数训练密集检索器。 Result: 在多个IR数据集上显著优于传统InfoNCE训练方法，零-shot评估在BEIR数据集上表现更优。 Conclusion: 合成数据训练方法不仅匹配真实标注数据的性能，还更具鲁棒性，尤其在分布偏移情况下表现更佳。 Abstract: Recent advancements in large language models (LLMs) have allowed the augmentation of information retrieval (IR) pipelines with synthetic data in various ways. Yet, the main training paradigm remains: contrastive learning with binary relevance labels and the InfoNCE loss, where one positive document is compared against one or more negatives. This objective treats all documents that are not explicitly annotated as relevant on an equally negative footing, regardless of their actual degree of relevance, thus (a) missing subtle nuances that are useful for ranking and (b) being susceptible to annotation noise. To overcome this limitation, in this work we forgo real training documents and annotations altogether and use open-source LLMs to directly generate synthetic documents that answer real user queries according to several different levels of relevance. This fully synthetic ranking context of graduated relevance, together with an appropriate list-wise loss (Wasserstein distance), enables us to train dense retrievers in a way that better captures the ranking task. Experiments on various IR datasets show that our proposed approach outperforms conventional training with InfoNCE by a large margin. Without using any real documents for training, our dense retriever significantly outperforms the same retriever trained through self-supervision. More importantly, it matches the performance of the same retriever trained on real, labeled training documents of the same dataset, while being more robust to distribution shift and clearly outperforming it when evaluated zero-shot on the BEIR dataset collection.

DiT4SR: Taming Diffusion Transformer for Real-World Image Super-Resolution

Zheng-Peng Duan,Jiawei Zhang,Xin Jin,Ziheng Zhang,Zheng Xiong,Dongqing Zou,Jimmy Ren,Chun-Le Guo,Chongyi Li

Task: 探索并改进基于扩散变换器（DiT）的模型在真实世界图像超分辨率（Real-ISR）任务中的应用。

Motivation: 扩散模型在图像生成中表现出色，尤其是DiT架构超越传统UNet，但其在Real-ISR中的应用尚未充分探索。

Details

Method: 提出DiT4SR，通过双向信息流和跨流卷积层改进DiT，使其更好地适应Real-ISR任务。 Result: 实验证明DiT4SR在Real-ISR中表现优异。 Conclusion: DiT4SR通过简单有效的设计，成功将DiT模型应用于Real-ISR，并展现出卓越性能。 Abstract: Large-scale pre-trained diffusion models are becoming increasingly popular in solving the Real-World Image Super-Resolution (Real-ISR) problem because of their rich generative priors. The recent development of diffusion transformer (DiT) has witnessed overwhelming performance over the traditional UNet-based architecture in image generation, which also raises the question: Can we adopt the advanced DiT-based diffusion model for Real-ISR? To this end, we propose our DiT4SR, one of the pioneering works to tame the large-scale DiT model for Real-ISR. Instead of directly injecting embeddings extracted from low-resolution (LR) images like ControlNet, we integrate the LR embeddings into the original attention mechanism of DiT, allowing for the bidirectional flow of information between the LR latent and the generated latent. The sufficient interaction of these two streams allows the LR stream to evolve with the diffusion process, producing progressively refined guidance that better aligns with the generated latent at each diffusion step. Additionally, the LR guidance is injected into the generated latent via a cross-stream convolution layer, compensating for DiT's limited ability to capture local information. These simple but effective designs endow the DiT model with superior performance in Real-ISR, which is demonstrated by extensive experiments. Project Page: https://adam-duan.github.io/projects/dit4sr/.

SPIO: Ensemble and Selective Strategies via LLM-Based Multi-Agent Planning in Automated Data Science

Wonduk Seo,Juhyeon Lee,Yi Bu

Task: 提出一种名为SPIO的新框架，利用LLM驱动的决策来协调多智能体规划，以优化自动化数据科学任务。

Motivation: 现有方法依赖单一路径工作流，限制了策略的多样性和探索，导致预测结果不理想。

Details

Method: SPIO框架通过四个关键模块（数据预处理、特征工程、建模和超参数调优）协调多智能体规划，并引入优化代理和两种变体（SPIO-S和SPIO-E）。 Result: 在Kaggle和OpenML数据集上的实验表明，SPIO显著优于现有方法。 Conclusion: SPIO为自动化数据科学任务提供了鲁棒且可扩展的解决方案。 Abstract: Large Language Models (LLMs) have revolutionized automated data analytics and machine learning by enabling dynamic reasoning and adaptability. While recent approaches have advanced multi-stage pipelines through multi-agent systems, they typically rely on rigid, single-path workflows that limit the exploration and integration of diverse strategies, often resulting in suboptimal predictions. To address these challenges, we propose SPIO (Sequential Plan Integration and Optimization), a novel framework that leverages LLM-driven decision-making to orchestrate multi-agent planning across four key modules: data preprocessing, feature engineering, modeling, and hyperparameter tuning. In each module, dedicated planning agents independently generate candidate strategies that cascade into subsequent stages, fostering comprehensive exploration. A plan optimization agent refines these strategies by suggesting several optimized plans. We further introduce two variants: SPIO-S, which selects a single best solution path as determined by the LLM, and SPIO-E, which selects the top k candidate plans and ensembles them to maximize predictive performance. Extensive experiments on Kaggle and OpenML datasets demonstrate that SPIO significantly outperforms state-of-the-art methods, providing a robust and scalable solution for automated data science task.

PhysPose: Refining 6D Object Poses with Physical Constraints

Martin Malenický,Martin Cífka,Médéric Fourmy,Louis Montaut,Justin Carpentier,Josef Sivic,Vladimir Petrik

Task: 提出一种名为PhysPose的新方法，通过物理约束优化6D物体姿态估计。

Motivation: 现有方法在姿态估计中常产生物理不一致的结果，限制了实际应用。

Details

Method: 通过后处理优化引入非穿透和重力约束，结合场景几何信息提升姿态估计的物理合理性。 Result: 在YCB-Video和HOPE-Video数据集上达到最优性能，并在机器人抓取任务中显著提升成功率。 Conclusion: 物理一致性对实际应用至关重要，PhysPose方法在姿态估计中表现出色。 Abstract: Accurate 6D object pose estimation from images is a key problem in object-centric scene understanding, enabling applications in robotics, augmented reality, and scene reconstruction. Despite recent advances, existing methods often produce physically inconsistent pose estimates, hindering their deployment in real-world scenarios. We introduce PhysPose, a novel approach that integrates physical reasoning into pose estimation through a postprocessing optimization enforcing non-penetration and gravitational constraints. By leveraging scene geometry, PhysPose refines pose estimates to ensure physical plausibility. Our approach achieves state-of-the-art accuracy on the YCB-Video dataset from the BOP benchmark and improves over the state-of-the-art pose estimation methods on the HOPE-Video dataset. Furthermore, we demonstrate its impact in robotics by significantly improving success rates in a challenging pick-and-place task, highlighting the importance of physical consistency in real-world applications.

Beyond Unimodal Boundaries: Generative Recommendation with Multimodal Semantics

Jing Zhu,Mingxuan Ju,Yozen Liu,Danai Koutra,Neil Shah,Tong Zhao

Task: 探索多模态生成推荐（MGR）中的模态选择问题及其对推荐系统性能的影响。

Motivation: 现有生成推荐方法通常假设数据是单模态的（如文本），忽略了现实数据的多模态特性，且模型对模态选择敏感，因此需要研究多模态环境下的生成推荐。

Details

Method: 提出MGR-LF++框架，通过对比模态对齐和特殊标记表示不同模态，有效利用多模态数据。 Result: MGR-LF++框架在性能上比单模态方法提升了20%以上。 Conclusion: 多模态生成推荐中模态选择至关重要，MGR-LF++框架通过有效利用多模态数据显著提升了推荐性能。 Abstract: Generative recommendation (GR) has become a powerful paradigm in recommendation systems that implicitly links modality and semantics to item representation, in contrast to previous methods that relied on non-semantic item identifiers in autoregressive models. However, previous research has predominantly treated modalities in isolation, typically assuming item content is unimodal (usually text). We argue that this is a significant limitation given the rich, multimodal nature of real-world data and the potential sensitivity of GR models to modality choices and usage. Our work aims to explore the critical problem of Multimodal Generative Recommendation (MGR), highlighting the importance of modality choices in GR nframeworks. We reveal that GR models are particularly sensitive to different modalities and examine the challenges in achieving effective GR when multiple modalities are available. By evaluating design strategies for effectively leveraging multiple modalities, we identify key challenges and introduce MGR-LF++, an enhanced late fusion framework that employs contrastive modality alignment and special tokens to denote different modalities, achieving a performance improvement of over 20% compared to single-modality alternatives.

Blurry-Edges: Photon-Limited Depth Estimation from Defocused Boundaries

Wei Xu,Charles James Wagner,Junjie Luo,Qi Guo

Task: 从光子有限的散焦图像中提取深度信息。

Motivation: 由于散焦深度（DfD）依赖于对散焦模糊的准确估计，而散焦模糊对图像噪声非常敏感，因此在光子有限的图像中提取深度信息具有挑战性。

Details

Method: 提出了一种新的图像块表示方法Blurry-Edges，并结合深度神经网络架构从一对不同散焦的图像中预测该表示，进而通过推导的封闭形式DfD关系计算深度。 Result: 在合成和真实数据上的实验结果表明，该方法在光子有限图像上的深度估计精度优于多种最先进的DfD方法。 Conclusion: Blurry-Edges表示结合深度神经网络能够有效提升光子有限图像中的深度估计精度。 Abstract: Extracting depth information from photon-limited, defocused images is challenging because depth from defocus (DfD) relies on accurate estimation of defocus blur, which is fundamentally sensitive to image noise. We present a novel approach to robustly measure object depths from photon-limited images along the defocused boundaries. It is based on a new image patch representation, Blurry-Edges, that explicitly stores and visualizes a rich set of low-level patch information, including boundaries, color, and smoothness. We develop a deep neural network architecture that predicts the Blurry-Edges representation from a pair of differently defocused images, from which depth can be calculated using a closed-form DfD relation we derive. The experimental results on synthetic and real data show that our method achieves the highest depth estimation accuracy on photon-limited images compared to a broad range of state-of-the-art DfD methods.

A Scalable Framework for Evaluating Health Language Models

Neil Mallinar,A. Ali Heydari,Xin Liu,Anthony Z. Faranesh,Brent Winslow,Nova Hammerquist,Benjamin Graef,Cathy Speed,Mark Malhotra,Shwetak Patel,Javier L. Prieto,Daniel McDuff,Ahmed A. Metwally

Task: 提出一种名为Adaptive Precise Boolean rubrics的评估框架，用于高效评估大型语言模型在医疗领域的开放式文本回答质量。

Motivation: 当前基于专家的人工评估方法成本高、耗时长且难以扩展，尤其是在需要领域专业知识的复杂领域（如医疗健康）。

Details

Method: 通过设计一组精确的布尔问题（rubrics）来识别模型回答中的不足，结合人工与自动化评估。 Result: 该方法在代谢健康领域验证，显示其比传统Likert量表具有更高评分者间一致性，且评估时间减半。 Conclusion: Adaptive Precise Boolean rubrics为医疗领域LLM的评估提供了更高效、经济的解决方案。 Abstract: Large language models (LLMs) have emerged as powerful tools for analyzing complex datasets. Recent studies demonstrate their potential to generate useful, personalized responses when provided with patient-specific health information that encompasses lifestyle, biomarkers, and context. As LLM-driven health applications are increasingly adopted, rigorous and efficient one-sided evaluation methodologies are crucial to ensure response quality across multiple dimensions, including accuracy, personalization and safety. Current evaluation practices for open-ended text responses heavily rely on human experts. This approach introduces human factors and is often cost-prohibitive, labor-intensive, and hinders scalability, especially in complex domains like healthcare where response assessment necessitates domain expertise and considers multifaceted patient data. In this work, we introduce Adaptive Precise Boolean rubrics: an evaluation framework that streamlines human and automated evaluation of open-ended questions by identifying gaps in model responses using a minimal set of targeted rubrics questions. Our approach is based on recent work in more general evaluation settings that contrasts a smaller set of complex evaluation targets with a larger set of more precise, granular targets answerable with simple boolean responses. We validate this approach in metabolic health, a domain encompassing diabetes, cardiovascular disease, and obesity. Our results demonstrate that Adaptive Precise Boolean rubrics yield higher inter-rater agreement among expert and non-expert human evaluators, and in automated assessments, compared to traditional Likert scales, while requiring approximately half the evaluation time of Likert-based methods. This enhanced efficiency, particularly in automated evaluation and non-expert contributions, paves the way for more extensive and cost-effective evaluation of LLMs in health.

Leveraging Vision-Language Foundation Models to Reveal Hidden Image-Attribute Relationships in Medical Imaging

Amar Kumar,Anita Kriz,Barak Pertzov,Tal Arbel

Task: 研究微调的基础模型是否能帮助识别关键且可能未知的数据属性。

Motivation: 探索视觉语言基础模型（VLMs）在医学图像中揭示隐藏数据关系的潜力。

Details

Method: 在胸部X光数据集上评估提出的方法，并与基于结构因果模型（SCMs）的方法进行比较。 Result: 微调的VLMs能生成高分辨率、精确编辑的图像，并揭示隐藏的数据关系，但也存在准确编辑的局限性和对偏差的敏感性。 Conclusion: 微调的VLMs在揭示数据集属性方面具有潜力，但也存在局限性，需进一步研究。 Abstract: Vision-language foundation models (VLMs) have shown impressive performance in guiding image generation through text, with emerging applications in medical imaging. In this work, we are the first to investigate the question: 'Can fine-tuned foundation models help identify critical, and possibly unknown, data properties?' By evaluating our proposed method on a chest x-ray dataset, we show that these models can generate high-resolution, precisely edited images compared to methods that rely on Structural Causal Models (SCMs) according to numerous metrics. For the first time, we demonstrate that fine-tuned VLMs can reveal hidden data relationships that were previously obscured due to available metadata granularity and model capacity limitations. Our experiments demonstrate both the potential of these models to reveal underlying dataset properties while also exposing the limitations of fine-tuned VLMs for accurate image editing and susceptibility to biases and spurious correlations.

Large Language Models Are Better Logical Fallacy Reasoners with Counterargument, Explanation, and Goal-Aware Prompt Formulation

Jiwon Jeong,Hyeju Jang,Hogun Park

Task: 提出一种新颖且有效的提示词设计方法，用于检测逻辑谬误，适用于监督（微调）和无监督（零样本）场景。

Motivation: 尽管大语言模型（LLMs）在处理复杂语言方面取得了进步，但准确检测逻辑谬误仍是一个重大挑战。

Details

Method: 通过丰富输入文本，引入隐含的上下文信息（如反驳论点、解释和目标），并在论证上下文中验证其有效性，然后根据置信度分数对查询进行排序以辅助分类。 Result: 在涵盖5个领域、29种谬误类型的多个数据集上评估，结果显示在零样本设置下F1分数提升高达0.60，在微调模型中提升高达0.45。 Conclusion: 该方法显著优于现有技术，并通过深入分析阐明了其优势所在。 Abstract: The advancement of Large Language Models (LLMs) has greatly improved our ability to process complex language. However, accurately detecting logical fallacies remains a significant challenge. This study presents a novel and effective prompt formulation approach for logical fallacy detection, applicable in both supervised (fine-tuned) and unsupervised (zero-shot) settings. Our method enriches input text incorporating implicit contextual information -- counterarguments, explanations, and goals -- which we query for validity within the context of the argument. We then rank these queries based on confidence scores to inform classification. We evaluate our approach across multiple datasets from 5 domains, covering 29 distinct fallacy types, using models from the GPT and LLaMA series. The results show substantial improvements over state-of-the-art models, with F1 score increases of up to 0.60 in zero-shot settings and up to 0.45 in fine-tuned models. Extensive analyses further illustrate why and how our method excels.

Language-Guided Trajectory Traversal in Disentangled Stable Diffusion Latent Space for Factorized Medical Image Generation

Zahra TehraniNasab,Amar Kumar,Tal Arbel

Task: 探索预训练视觉语言基础模型在医学图像中潜在解耦和控制的潜力。

Motivation: 文本到图像扩散模型在生成高分辨率、语言引导的图像方面表现出色，但在医学影像等专业领域中解耦和控制潜在变化因素的研究不足。

Details

Method: 通过微调医学图像数据集上的预训练视觉语言基础模型，设计框架以识别、隔离和操作关键属性。 Result: 实验表明，微调后的模型能够解耦并控制医学图像生成中的关键属性，如解剖结构或疾病特征。 Conclusion: 该方法为医学图像合成提供了精确控制，展示了预训练模型在专业领域的潜力。 Abstract: Text-to-image diffusion models have demonstrated a remarkable ability to generate photorealistic images from natural language prompts. These high-resolution, language-guided synthesized images are essential for the explainability of disease or exploring causal relationships. However, their potential for disentangling and controlling latent factors of variation in specialized domains like medical imaging remains under-explored. In this work, we present the first investigation of the power of pre-trained vision-language foundation models, once fine-tuned on medical image datasets, to perform latent disentanglement for factorized medical image generation and interpolation. Through extensive experiments on chest X-ray and skin datasets, we illustrate that fine-tuned, language-guided Stable Diffusion inherently learns to factorize key attributes for image generation, such as the patient's anatomical structures or disease diagnostic features. We devise a framework to identify, isolate, and manipulate key attributes through latent space trajectory traversal of generative models, facilitating precise control over medical image synthesis.

What Makes an Evaluation Useful? Common Pitfalls and Best Practices

Gil Gekker,Meirav Segal,Dan Lahav,Omer Nevo

Task: 提出一套用于AI安全评估的最佳实践方法。

Motivation: 随着AI能力的快速提升，社区对安全风险的担忧增加，需要高质量评估来支持AI系统的安全使用和开发决策。

Details

Method: 结合模型评估的先前工作，通过网络安全示例展示，讨论了从威胁建模到评估设计的初始思考过程，并提供了评估的有用特征和参数。 Result: 提出了一套构建全面评估套件的实践方法和额外考虑因素。 Conclusion: 本文为AI安全评估提供了实用的指导，帮助构建更全面的评估体系。 Abstract: Following the rapid increase in Artificial Intelligence (AI) capabilities in recent years, the AI community has voiced concerns regarding possible safety risks. To support decision-making on the safe use and development of AI systems, there is a growing need for high-quality evaluations of dangerous model capabilities. While several attempts to provide such evaluations have been made, a clear definition of what constitutes a "good evaluation" has yet to be agreed upon. In this practitioners' perspective paper, we present a set of best practices for safety evaluations, drawing on prior work in model evaluation and illustrated through cybersecurity examples. We first discuss the steps of the initial thought process, which connects threat modeling to evaluation design. Then, we provide the characteristics and parameters that make an evaluation useful. Finally, we address additional considerations as we move from building specific evaluations to building a full and comprehensive evaluation suite.

Introducing the Short-Time Fourier Kolmogorov Arnold Network: A Dynamic Graph CNN Approach for Tree Species Classification in 3D Point Clouds

Said Ohamouddoua,Mohamed Ohamouddoub,Rafik Lasrib,Hanaa El Afiaa,Raddouane Chiheba,Abdellatif El Afiaa

Task: 利用TLS和ALS数据实现高精度的树种分类。

Motivation: 尽管深度学习模型在3D点云分类中表现优异，但其高复杂性阻碍了高效、低计算架构的发展。

Details

Method: 提出STFT-KAN网络，结合短时傅里叶变换，替代标准线性激活层，并集成到轻量级DGCNN（liteDGCNN）中。 Result: STFT-KAN在参数减少的情况下，性能优于现有KAN变体，并与MLP模型竞争；混合架构在性能接近MLP的同时，参数减少50%-75%。 Conclusion: STFT-KAN在参数大幅减少的情况下，性能与最先进方法相当，展示了高效性与竞争力的平衡。 Abstract: Accurate classification of tree species based on Terrestrial Laser Scanning (TLS) and Airborne Laser Scanning (ALS) is essential for biodiversity conservation. While advanced deep learning models for 3D point cloud classification have demonstrated strong performance in this domain, their high complexity often hinders the development of efficient, low-computation architectures. In this paper, we introduce STFT-KAN, a novel Kolmogorov-Arnold network that integrates the Short-Time Fourier Transform (STFT), which can replace the standard linear layer with activation. We implemented STFT-KAN within a lightweight version of DGCNN, called liteDGCNN, to classify tree species using the TLS data. Our experiments show that STFT-KAN outperforms existing KAN variants by effectively balancing model complexity and performance with parameter count reduction, achieving competitive results compared to MLP-based models. Additionally, we evaluated a hybrid architecture that combines MLP in edge convolution with STFT-KAN in other layers, achieving comparable performance to MLP models while reducing the parameter count by 50% and 75% compared to other KAN-based variants. Furthermore, we compared our model to leading 3D point cloud learning approaches, demonstrating that STFT-KAN delivers competitive results compared to the state-of-the-art method PointMLP lite with an 87% reduction in parameter count.

Semantic-Preserving Transformations as Mutation Operators: A Study on Their Effectiveness in Defect Detection

Max Hort,Linas Vidziunas,Leon Moonen

Task: 研究语义保持转换是否能提升缺陷检测工具在测试阶段的性能。

Motivation: 现有研究未考虑在工具应用阶段使用语义相同代码（类似蜕变测试）来改进缺陷检测工具。

Details

Method: 收集并复用现有语义保持转换实现，采用三种集成策略在Devign数据集上测试两种语言模型（VulBERTa、PLBART）。 Result: 未发现语义保持转换能提升模型准确率，且复用共享转换易导致语义错误。 Conclusion: 复用语义保持转换困难，需谨慎验证其正确性。 Abstract: Recent advances in defect detection use language models. Existing works enhanced the training data to improve the models' robustness when applied to semantically identical code (i.e., predictions should be the same). However, the use of semantically identical code has not been considered for improving the tools during their application - a concept closely related to metamorphic testing. The goal of our study is to determine whether we can use semantic-preserving transformations, analogue to mutation operators, to improve the performance of defect detection tools in the testing stage. We first collect existing publications which implemented semantic-preserving transformations and share their implementation, such that we can reuse them. We empirically study the effectiveness of three different ensemble strategies for enhancing defect detection tools. We apply the collected transformations on the Devign dataset, considering vulnerabilities as a type of defect, and two fine-tuned large language models for defect detection (VulBERTa, PLBART). We found 28 publications with 94 different transformations. We choose to implement 39 transformations from four of the publications, but a manual check revealed that 23 out 39 transformations change code semantics. Using the 16 remaining, correct transformations and three ensemble strategies, we were not able to increase the accuracy of the defect detection models. Our results show that reusing shared semantic-preserving transformation is difficult, sometimes even causing wrongful changes to the semantics. Keywords: defect detection, language model, semantic-preserving transformation, ensemble

Junjie Zheng,Zihao Chen,Chaofan Ding,Xinhan Di

Task: 提出一个多模态大语言模型框架，以解决电影配音中风格适应、对话处理和细节理解等未充分研究的挑战。

Motivation: 当前电影配音技术在语音生成方面表现良好，但在风格适应、对话处理及细节理解（如说话者年龄和性别）方面研究不足。

Details

Method: 采用多模态链式思维（CoT）推理方法分析视觉输入以理解配音风格和细节属性，并通过大语音生成模型生成高质量配音。 Result: 在多个数据集上性能优于现有方法，如SPK-SIM和EMO-SIM分别提升至89.74%和78.88%，LSE-D和MCD-SL降低至14.63和4.74。 Conclusion: 提出的框架显著提升了电影配音的质量和细节处理能力，为多模态配音技术提供了新方向。 Abstract: Current movie dubbing technology can generate the desired voice from a given speech prompt, ensuring good synchronization between speech and visuals while accurately conveying the intended emotions. However, in movie dubbing, key aspects such as adapting to different dubbing styles, handling dialogue, narration, and monologue effectively, and understanding subtle details like the age and gender of speakers, have not been well studied. To address this challenge, we propose a framework of multi-modal large language model. First, it utilizes multimodal Chain-of-Thought (CoT) reasoning methods on visual inputs to understand dubbing styles and fine-grained attributes. Second, it generates high-quality dubbing through large speech generation models, guided by multimodal conditions. Additionally, we have developed a movie dubbing dataset with CoT annotations. The evaluation results demonstrate a performance improvement over state-of-the-art methods across multiple datasets. In particular, for the evaluation metrics, the SPK-SIM and EMO-SIM increases from 82.48% to 89.74%, 66.24% to 78.88% for dubbing setting 2.0 on V2C Animation dataset, LSE-D and MCD-SL decreases from 14.79 to 14.63, 5.24 to 4.74 for dubbing setting 2.0 on Grid dataset, SPK-SIM increases from 64.03 to 83.42 and WER decreases from 52.69% to 23.20% for initial reasoning setting on proposed CoT-Movie-Dubbing dataset in the comparison with the state-of-the art models.

Codehacks: A Dataset of Adversarial Tests for Competitive Programming Problems Obtained from Codeforces

Max Hort,Leon Moonen

Task: 为支持数据驱动的测试套件创建，特别是针对从大型语言模型合成的软件测试，收集并整理了一个包含编程问题及其对应错误诱导测试用例的数据集（Codehacks）。

Motivation: 软件测试中可能存在假阴性（即软件通过所有测试但仍存在未被测试到的错误），因此需要关注错误诱导测试用例以提高测试可靠性。

Details

Method: 从Codeforces在线评测平台收集了288,617个针对5,578个编程问题的错误诱导测试用例（“hacks”），并整理了2,196个可被这些测试用例破解的提交解决方案的源代码。 Result: 构建了一个名为Codehacks的数据集，包含编程问题的自然语言描述、错误诱导测试用例及相关源代码。 Conclusion: Codehacks数据集为数据驱动的测试套件创建提供了资源，尤其适用于测试基于大型语言模型生成的软件。 Abstract: Software is used in critical applications in our day-to-day life and it is important to ensure its correctness. One popular approach to assess correctness is to evaluate software on tests. If a test fails, it indicates a fault in the software under test; if all tests pass correctly, one may assume that the software is correct. However, the reliability of these results depends on the test suite considered, and there is a risk of false negatives (i.e. software that passes all available tests but contains bugs because some cases are not tested). Therefore, it is important to consider error-inducing test cases when evaluating software. To support data-driven creation of such a test-suite, which is especially of interest for testing software synthesized from large language models, we curate a dataset (Codehacks) of programming problems together with corresponding error-inducing test cases (i.e., "hacks"). This dataset is collected from the wild, in particular, from the Codeforces online judge platform. The dataset comprises 288,617 hacks for 5,578 programming problems, each with a natural language description, as well as the source code for 2,196 submitted solutions to these problems that can be broken with their corresponding hacks. Keywords: competitive programming, language model, dataset

LiM-Loc: Visual Localization with Dense and Accurate 3D Reference Maps Directly Corresponding 2D Keypoints to 3D LiDAR Point Clouds

Masahiko Tsuji,Hitoshi Niigaki,Ryuichi Tanida

Task: 提出一种利用3D LiDAR点云直接为关键点分配3D位置以生成密集且准确的3D参考地图的方法。

Motivation: 传统基于图像的方法需要大量图像且难以避免特征匹配错误，导致3D参考地图稀疏且不准确；而结合3D传感器（如LiDAR）可以生成更准确的3D参考地图。

Details

Method: 通过直接分配3D LiDAR点云到关键点，避免特征匹配，并使用广域LiDAR点云去除相机不可见的点以减少2D-3D对应误差。 Result: 在室内和室外数据集上验证了该方法能够提高相机位姿估计的准确性。 Conclusion: 结合LiDAR点云的方法显著提升了3D参考地图的密度和准确性，从而提高了相机位姿估计的精度。 Abstract: Visual localization is to estimate the 6-DOF camera pose of a query image in a 3D reference map. We extract keypoints from the reference image and generate a 3D reference map with 3D reconstruction of the keypoints in advance. We emphasize that the more keypoints in the 3D reference map and the smaller the error of the 3D positions of the keypoints, the higher the accuracy of the camera pose estimation. However, previous image-only methods require a huge number of images, and it is difficult to 3D-reconstruct keypoints without error due to inevitable mismatches and failures in feature matching. As a result, the 3D reference map is sparse and inaccurate. In contrast, accurate 3D reference maps can be generated by combining images and 3D sensors. Recently, 3D-LiDAR has been widely used around the world. LiDAR, which measures a large space with high density, has become inexpensive. In addition, accurately calibrated cameras are also widely used, so images that record the external parameters of the camera without errors can be easily obtained. In this paper, we propose a method to directly assign 3D LiDAR point clouds to keypoints to generate dense and accurate 3D reference maps. The proposed method avoids feature matching and achieves accurate 3D reconstruction for almost all keypoints. To estimate camera pose over a wide area, we use the wide-area LiDAR point cloud to remove points that are not visible to the camera and reduce 2D-3D correspondence errors. Using indoor and outdoor datasets, we apply the proposed method to several state-of-the-art local features and confirm that it improves the accuracy of camera pose estimation.

Benchmarking Systematic Relational Reasoning with Large Language and Reasoning Models

Irtaza Khalid,Amir Masoud Nourollah,Steven Schockaert

Task: 研究大型语言模型（LLMs）和大型推理模型（LRMs）在关系组合任务中的系统性推理能力。

Motivation: LLMs在系统性推理任务中表现不佳，依赖捷径而非真正推理能力，且缺乏对分布外问题的泛化能力。LRMs虽在数学和编程问题中表现优异，但其在其他领域的潜力尚不明确。

Details

Method: 通过设计需要系统性推理的关系组合任务（尤其是定性空间和时间推理任务），控制问题难度并精确测量模型的泛化能力。 Result: LLMs和LRMs整体表现较差，但仍优于随机猜测。 Conclusion: 当前LLMs和LRMs在系统性推理任务中的能力有限，需进一步研究提升其泛化能力。 Abstract: Large Language Models (LLMs) have been found to struggle with systematic reasoning. Even on tasks where they appear to perform well, their performance often depends on shortcuts, rather than on genuine reasoning abilities, leading them to collapse on out-of-distribution examples. Post-training strategies based on reinforcement learning and chain-of-thought prompting have recently been hailed as a step change. However, little is still known about the potential of the resulting ``Large Reasoning Models'' (LRMs) beyond problem solving in mathematics and programming, where finding genuine out-of-distribution problems can be difficult. In this paper, we focus on tasks that require systematic reasoning about relational compositions, especially for qualitative spatial and temporal reasoning. These tasks allow us to control the difficulty of problem instances, and measure in a precise way to what extent models can generalise. We find that that the considered LLMs and LRMs overall perform poorly overall, albeit better than random chance.

Context-Independent OCR with Multimodal LLMs: Effects of Image Resolution and Visual Complexity

Kotaro Inoue

Task: 研究多模态大语言模型（LLMs）在单字符图像上的上下文无关OCR任务性能。

Motivation: 尽管多模态LLMs在OCR任务中表现优异，但其在不同图像条件下的性能及对上下文依赖的局限性尚未充分研究。

Details

Method: 使用具有不同视觉复杂度的单字符图像进行上下文无关OCR任务测试。 Result: 多模态LLMs在约300 ppi时与传统OCR方法相当，低于150 ppi时性能显著下降；视觉复杂度与误识别相关性较弱。 Conclusion: 图像分辨率和视觉复杂度对多模态LLMs在精确字符级OCR任务中的可靠性至关重要。 Abstract: Due to their high versatility in tasks such as image captioning, document analysis, and automated content generation, multimodal Large Language Models (LLMs) have attracted significant attention across various industrial fields. In particular, they have been shown to surpass specialized models in Optical Character Recognition (OCR). Nevertheless, their performance under different image conditions remains insufficiently investigated, and individual character recognition is not guaranteed due to their reliance on contextual cues. In this work, we examine a context-independent OCR task using single-character images with diverse visual complexities to determine the conditions for accurate recognition. Our findings reveal that multimodal LLMs can match conventional OCR methods at about 300 ppi, yet their performance deteriorates significantly below 150 ppi. Additionally, we observe a very weak correlation between visual complexity and misrecognitions, whereas a conventional OCR-specific model exhibits no correlation. These results suggest that image resolution and visual complexity may play an important role in the reliable application of multimodal LLMs to OCR tasks that require precise character-level accuracy.

KOFFVQA: An Objectively Evaluated Free-form VQA Benchmark for Large Vision-Language Models in the Korean Language

Yoonshik Kim,Jaeyoon Jung

Task: 提出一个用于评估大型视觉语言模型（VLMs）的韩语自由形式视觉问答基准KOFFVQA。

Motivation: 现有评估方法要么牺牲开放性，要么依赖主观的评判模型，且缺乏针对韩语的基准。

Details

Method: 开发包含275个问题的韩语视觉问答基准，每个问题配有图像和10个方面的评分标准。 Result: 通过预定义的评分标准，即使小型开源模型也能可靠评估VLMs，验证了方法的可靠性。 Conclusion: KOFFVQA为韩语VLMs提供了客观、可靠的评估工具，弥补了现有方法的不足。 Abstract: The recent emergence of Large Vision-Language Models(VLMs) has resulted in a variety of different benchmarks for evaluating such models. Despite this, we observe that most existing evaluation methods suffer from the fact that they either require the model to choose from pre-determined responses, sacrificing open-endedness, or evaluate responses using a judge model, resulting in subjective and unreliable evaluation. In addition, we observe a lack of benchmarks for VLMs in the Korean language, which are necessary as a separate metric from more common English language benchmarks, as the performance of generative language models can differ significantly based on the language being used. Therefore, we present KOFFVQA, a general-purpose free-form visual question answering benchmark in the Korean language for the evaluation of VLMs. Our benchmark consists of 275 carefully crafted questions each paired with an image and grading criteria covering 10 different aspects of VLM performance. The grading criteria eliminate the problem of unreliability by allowing the judge model to grade each response based on a pre-determined set of rules. By defining the evaluation criteria in an objective manner, even a small open-source model can be used to evaluate models on our benchmark reliably. In addition to evaluating a large number of existing VLMs on our benchmark, we also experimentally verify that our method of using pre-existing grading criteria for evaluation is much more reliable than existing methods. Our evaluation code is available at https://github.com/maum-ai/KOFFVQA

Learning Bijective Surface Parameterization for Inferring Signed Distance Functions from Sparse Point Clouds with Grid Deformation

Takeshi Noda,Chao Chen,Junsheng Zhou,Weiqi Zhang,Yu-Shen Liu,Zhizhong Han

Task: 从稀疏点云中推断有符号距离函数（SDFs）以进行表面重建。

Motivation: 稀疏点云缺乏详细的几何信息，这限制了学习连续场的能力。

Details

Method: 提出了一种动态变形网络，结合双射表面参数化（BSP）和网格变形优化（GDO），以端到端的方式预测SDFs。 Result: 在合成和真实扫描数据集上的实验结果表明，该方法显著优于当前最先进的方法。 Conclusion: 通过动态变形网络和双射参数化，有效解决了稀疏点云表面重建的挑战。 Abstract: Inferring signed distance functions (SDFs) from sparse point clouds remains a challenge in surface reconstruction. The key lies in the lack of detailed geometric information in sparse point clouds, which is essential for learning a continuous field. To resolve this issue, we present a novel approach that learns a dynamic deformation network to predict SDFs in an end-to-end manner. To parameterize a continuous surface from sparse points, we propose a bijective surface parameterization (BSP) that learns the global shape from local patches. Specifically, we construct a bijective mapping for sparse points from the parametric domain to 3D local patches, integrating patches into the global surface. Meanwhile, we introduce grid deformation optimization (GDO) into the surface approximation to optimize the deformation of grid points and further refine the parametric surfaces. Experimental results on synthetic and real scanned datasets demonstrate that our method significantly outperforms the current state-of-the-art methods. Project page: https://takeshie.github.io/Bijective-SDF

Short-video Propagation Influence Rating: A New Real-world Dataset and A New Large Graph Model

Dizhan Xue,Jing Cui,Shengsheng Qian,Chuanrui Hu,Changsheng Xu

Task: 提出一个新的短视频传播影响力评级（SPIR）任务，并从数据集和方法两个角度推动SPIR研究。

Motivation: 短视频平台在全球范围内广受欢迎，分析其传播对商业价值、公众意见和用户行为等具有重要意义。

Details

Method: 提出了跨平台短视频（XS-Video）数据集和一个基于三阶段训练机制的大型图模型（LGM）NetGPT，用于预测短视频的长期传播影响力。 Result: 在XS-Video数据集上，通过分类和回归指标评估，NetGPT表现出优越的SPIR性能。 Conclusion: 该研究为短视频传播分析提供了首个大规模跨平台数据集和高效模型，推动了相关领域的发展。 Abstract: Short-video platforms have gained immense popularity, captivating the interest of millions, if not billions, of users globally. Recently, researchers have highlighted the significance of analyzing the propagation of short-videos, which typically involves discovering commercial values, public opinions, user behaviors, etc. This paper proposes a new Short-video Propagation Influence Rating (SPIR) task and aims to promote SPIR from both the dataset and method perspectives. First, we propose a new Cross-platform Short-Video (XS-Video) dataset, which aims to provide a large-scale and real-world short-video propagation network across various platforms to facilitate the research on short-video propagation. Our XS-Video dataset includes 117,720 videos, 381,926 samples, and 535 topics across 5 biggest Chinese platforms, annotated with the propagation influence from level 0 to 9. To the best of our knowledge, this is the first large-scale short-video dataset that contains cross-platform data or provides all of the views, likes, shares, collects, fans, comments, and comment content. Second, we propose a Large Graph Model (LGM) named NetGPT, based on a novel three-stage training mechanism, to bridge heterogeneous graph-structured data with the powerful reasoning ability and knowledge of Large Language Models (LLMs). Our NetGPT can comprehend and analyze the short-video propagation graph, enabling it to predict the long-term propagation influence of short-videos. Comprehensive experimental results evaluated by both classification and regression metrics on our XS-Video dataset indicate the superiority of our method for SPIR.

The Devil is in the Distributions: Explicit Modeling of Scene Content is Key in Zero-Shot Video Captioning

Mingkai Tian,Guorong Li,Yuankai Qi,Amin Beheshti,Javen Qinfeng Shi,Anton van den Hengel,Qingming Huang

Task: 提出一种新颖的渐进式多粒度文本提示策略，用于零样本视频字幕生成。

Motivation: 现有方法倾向于关注场景的一个关键方面而忽略其他视觉输入，导致字幕不完整或不准确。

Details

Method: 构建三个不同的记忆库（名词短语、名词短语的场景图和完整句子），并引入类别感知检索机制。 Result: 在MSR-VTT、MSVD和VATEX基准测试中，CIDEr指标分别提高了5.7%、16.2%和3.4%。 Conclusion: 该方法显著提升了零样本视频字幕生成的准确性和完整性。 Abstract: Zero-shot video captioning requires that a model generate high-quality captions without human-annotated video-text pairs for training. State-of-the-art approaches to the problem leverage CLIP to extract visual-relevant textual prompts to guide language models in generating captions. These methods tend to focus on one key aspect of the scene and build a caption that ignores the rest of the visual input. To address this issue, and generate more accurate and complete captions, we propose a novel progressive multi-granularity textual prompting strategy for zero-shot video captioning. Our approach constructs three distinct memory banks, encompassing noun phrases, scene graphs of noun phrases, and entire sentences. Moreover, we introduce a category-aware retrieval mechanism that models the distribution of natural language surrounding the specific topics in question. Extensive experiments demonstrate the effectiveness of our method with 5.7%, 16.2%, and 3.4% improvements in terms of the main metric CIDEr on MSR-VTT, MSVD, and VATEX benchmarks compared to existing state-of-the-art.

Towards a cognitive architecture to enable natural language interaction in co-constructive task learning

Manuel Scheibl,Birte Richter,Alissa Müller,Michael Beetz,Britta Wrede

Task: 探讨认知架构需要具备哪些特性才能在共同建构任务学习（CCTL）中利用自然语言的优势。

Motivation: 研究旨在通过分析交互式任务学习（ITL）、人类记忆系统机制以及自然语言和多模态的重要性，为CCTL提供理论基础。

Details

Method: 通过分析现有认知架构的能力，整合多领域研究见解，开发统一框架。 Result: 提出了一个基于多源信息的CCTL概念框架。 Conclusion: 总结了实现人机交互（HRI）中CCTL的剩余挑战和需求。 Abstract: This research addresses the question, which characteristics a cognitive architecture must have to leverage the benefits of natural language in Co-Constructive Task Learning (CCTL). To provide context, we first discuss Interactive Task Learning (ITL), the mechanisms of the human memory system, and the significance of natural language and multi-modality. Next, we examine the current state of cognitive architectures, analyzing their capabilities to inform a concept of CCTL grounded in multiple sources. We then integrate insights from various research domains to develop a unified framework. Finally, we conclude by identifying the remaining challenges and requirements necessary to achieve CCTL in Human-Robot Interaction (HRI).

Detail-aware multi-view stereo network for depth estimation

Haitao Tian,Junyang Li,Chenxing Wang,Helong Jiang

Task: 提出一种基于粗到细框架的细节感知多视点立体网络（DA-MVSNet），用于解决现有方法在物体边界和细节区域深度恢复上的不足。

Motivation: 现有基于粗到细深度学习的多视点立体方法在物体边界和细节区域的深度恢复效果较差，需要改进。

Details

Method: 利用粗阶段隐藏的几何深度线索保持物体表面的几何结构关系，增强图像特征表达能力；采用图像合成损失约束细节区域的梯度流；提出自适应深度间隔调整策略。 Result: 在DTU和Tanks & Temples数据集上的实验表明，该方法取得了具有竞争力的结果。 Conclusion: DA-MVSNet通过几何线索、图像合成损失和自适应深度间隔调整，显著提升了物体边界和细节区域的深度恢复效果。 Abstract: Multi-view stereo methods have achieved great success for depth estimation based on the coarse-to-fine depth learning frameworks, however, the existing methods perform poorly in recovering the depth of object boundaries and detail regions. To address these issues, we propose a detail-aware multi-view stereo network (DA-MVSNet) with a coarse-to-fine framework. The geometric depth clues hidden in the coarse stage are utilized to maintain the geometric structural relationships between object surfaces and enhance the expressive capability of image features. In addition, an image synthesis loss is employed to constrain the gradient flow for detailed regions and further strengthen the supervision of object boundaries and texture-rich areas. Finally, we propose an adaptive depth interval adjustment strategy to improve the accuracy of object reconstruction. Extensive experiments on the DTU and Tanks & Temples datasets demonstrate that our method achieves competitive results. The code is available at https://github.com/wsmtht520-/DAMVSNet.

Get the Agents Drunk: Memory Perturbations in Autonomous Agent-based Recommender Systems

Shiyi Yang,Zhibo Hu,Chen Wang,Tong Yu,Xiwei Xu,Liming Zhu,Lina Yao

Task: 研究基于大语言模型的推荐系统代理（Agent4RSs）的鲁棒性，并提出一种名为DrunkAgent的攻击框架以揭示其局限性并增强安全性。

Motivation: 尽管Agent4RSs通过记忆机制实现自主学习和自我进化，但其鲁棒性尚未被充分研究，存在安全和隐私隐患。

Details

Method: 提出DrunkAgent攻击框架，包括生成模块、策略模块和代理模块，通过扰动代理的记忆来实施黑盒攻击。 Result: 实验证明DrunkAgent在多种真实数据集上有效，揭示了Agent4RSs的脆弱性。 Conclusion: 通过分析漏洞，为构建更安全、更鲁棒的Agent4RSs提供了关键见解。 Abstract: Large language model-based agents are increasingly used in recommender systems (Agent4RSs) to achieve personalized behavior modeling. Specifically, Agent4RSs introduces memory mechanisms that enable the agents to autonomously learn and self-evolve from real-world interactions. However, to the best of our knowledge, how robust Agent4RSs are remains unexplored. As such, in this paper, we propose the first work to attack Agent4RSs by perturbing agents' memories, not only to uncover their limitations but also to enhance their security and robustness, ensuring the development of safer and more reliable AI agents. Given the security and privacy concerns, it is more practical to launch attacks under a black-box setting, where the accurate knowledge of the victim models cannot be easily obtained. Moreover, the practical attacks are often stealthy to maximize the impact. To this end, we propose a novel practical attack framework named DrunkAgent. DrunkAgent consists of a generation module, a strategy module, and a surrogate module. The generation module aims to produce effective and coherent adversarial textual triggers, which can be used to achieve attack objectives such as promoting the target items. The strategy module is designed to `get the target agents drunk' so that their memories cannot be effectively updated during the interaction process. As such, the triggers can play the best role. Both of the modules are optimized on the surrogate module to improve the transferability and imperceptibility of the attacks. By identifying and analyzing the vulnerabilities, our work provides critical insights that pave the way for building safer and more resilient Agent4RSs. Extensive experiments across various real-world datasets demonstrate the effectiveness of DrunkAgent.

3D Dental Model Segmentation with Geometrical Boundary Preserving

Shufan Xi,Zexian Liu,Junlin Chang,Hongyu Wu,Xiaogang Wang,Aimin Hao

Task: 提出一种名为CrossTooth的边界保留分割方法，用于3D口腔扫描网格的牙齿分割。

Motivation: 现有深度学习方法在牙冠分割上表现良好，但在牙冠与牙龈交界处的分割精度较低，且现有下采样方法无法有效保留交界处的几何细节。

Details

Method: 结合3D网格选择性下采样以保留更多牙-龈区域的顶点，并从多视角渲染图像中提取跨模态判别边界特征，增强分割网络的几何表示。 Result: 在公共口腔扫描数据集上的实验表明，CrossTooth显著提高了分割精度。 Conclusion: CrossTooth通过结合网格下采样和图像特征，有效提升了牙齿分割的精度，尤其在牙-龈交界处。 Abstract: 3D intraoral scan mesh is widely used in digital dentistry diagnosis, segmenting 3D intraoral scan mesh is a critical preliminary task. Numerous approaches have been devised for precise tooth segmentation. Currently, the deep learning-based methods are capable of the high accuracy segmentation of crown. However, the segmentation accuracy at the junction between the crown and the gum is still below average. Existing down-sampling methods are unable to effectively preserve the geometric details at the junction. To address these problems, we propose CrossTooth, a boundary-preserving segmentation method that combines 3D mesh selective downsampling to retain more vertices at the tooth-gingiva area, along with cross-modal discriminative boundary features extracted from multi-view rendered images, enhancing the geometric representation of the segmentation network. Using a point network as a backbone and incorporating image complementary features, CrossTooth significantly improves segmentation accuracy, as demonstrated by experiments on a public intraoral scan dataset.

Grounding Agent Reasoning in Image Schemas: A Neurosymbolic Approach to Embodied Cognition

François Olivier,Zied Bouraoui

Task: 提出一种结合具身认知理论和智能体系统的新框架，利用图像图式的形式化表征来增强智能体的理解和交互能力。

Motivation: 现有的智能体推理系统难以捕捉人类自然使用的概念结构，限制了其理解和交互能力。

Details

Method: 通过定制LLMs将自然语言描述转换为基于传感器运动模式的形式化表征，构建一种神经符号系统。 Result: 该系统能够将智能体的理解基于基本概念结构，提高效率和可解释性。 Conclusion: 该方法通过共享具身理解，实现了更直观的人机交互，同时提升了智能体的性能。 Abstract: Despite advances in embodied AI, agent reasoning systems still struggle to capture the fundamental conceptual structures that humans naturally use to understand and interact with their environment. To address this, we propose a novel framework that bridges embodied cognition theory and agent systems by leveraging a formal characterization of image schemas, which are defined as recurring patterns of sensorimotor experience that structure human cognition. By customizing LLMs to translate natural language descriptions into formal representations based on these sensorimotor patterns, we will be able to create a neurosymbolic system that grounds the agent's understanding in fundamental conceptual structures. We argue that such an approach enhances both efficiency and interpretability while enabling more intuitive human-agent interactions through shared embodied understanding.

Expanding-and-Shrinking Binary Neural Networks

Xulong Shi,Caiyi Sun,Zhi Qi,Liu Hao,Xiaodong Yang

Task: 提出一种扩展-收缩操作以增强二元神经网络的表示能力。

Motivation: 二元神经网络在速度和能效方面具有优势，但在复杂任务中精度显著下降，原因是特征图的取值受限。

Details

Method: 通过扩展-收缩操作增强二元特征图，计算复杂度几乎不增加。 Result: 在多个基准测试中表现优异，适用于图像分类、目标检测和生成扩散模型，优于多种领先的二值化算法。 Conclusion: 该方法显著提升了二元神经网络的性能，适用于多种架构和应用场景。 Abstract: While binary neural networks (BNNs) offer significant benefits in terms of speed, memory and energy, they encounter substantial accuracy degradation in challenging tasks compared to their real-valued counterparts. Due to the binarization of weights and activations, the possible values of each entry in the feature maps generated by BNNs are strongly constrained. To tackle this limitation, we propose the expanding-and-shrinking operation, which enhances binary feature maps with negligible increase of computation complexity, thereby strengthening the representation capacity. Extensive experiments conducted on multiple benchmarks reveal that our approach generalizes well across diverse applications ranging from image classification, object detection to generative diffusion model, while also achieving remarkable improvement over various leading binarization algorithms based on different architectures including both CNNs and Transformers.

MB-ORES: A Multi-Branch Object Reasoner for Visual Grounding in Remote Sensing

Karim Radouane,Hanane Azzag,Mustapha lebbah

Task: 提出一个统一框架，将目标检测（OD）和视觉定位（VG）集成到遥感（RS）图像中。

Motivation: 支持常规目标检测并为视觉定位任务建立直观的先验。

Details

Method: 通过微调开放集目标检测器，构建图像图表示，并设计任务感知架构，包括多分支网络和对象推理网络。 Result: 在OPT-RSVG和DIOR-RSVG数据集上表现优异，显著优于现有方法，同时保留经典目标检测能力。 Conclusion: 提出的框架在遥感图像中有效整合了目标检测和视觉定位任务，性能优越。 Abstract: We propose a unified framework that integrates object detection (OD) and visual grounding (VG) for remote sensing (RS) imagery. To support conventional OD and establish an intuitive prior for VG task, we fine-tune an open-set object detector using referring expression data, framing it as a partially supervised OD task. In the first stage, we construct a graph representation of each image, comprising object queries, class embeddings, and proposal locations. Then, our task-aware architecture processes this graph to perform the VG task. The model consists of: (i) a multi-branch network that integrates spatial, visual, and categorical features to generate task-aware proposals, and (ii) an object reasoning network that assigns probabilities across proposals, followed by a soft selection mechanism for final referring object localization. Our model demonstrates superior performance on the OPT-RSVG and DIOR-RSVG datasets, achieving significant improvements over state-of-the-art methods while retaining classical OD capabilities. The code will be available in our repository: \url{https://github.com/rd20karim/MB-ORES}.

ElimPCL: Eliminating Noise Accumulation with Progressive Curriculum Labeling for Source-Free Domain Adaptation

Jie Cheng,Hao Zheng,Meiguang Zheng,Lei Wang,Hao Wu,Jian Zhang

Task: 在无源数据的情况下训练目标模型，并通过伪标签生成解决领域适应问题。

Motivation: 源模型生成的伪标签对受领域偏移影响的困难样本具有高度不确定性，导致噪声伪标签在训练中被引入并传播，影响模型性能。

Details

Method: 提出渐进式课程标签方法（ElimPCL），基于原型一致性迭代过滤可信伪标签样本，并结合特征空间的双重MixUP技术增强困难样本的可分性。 Result: 实验表明ElimPCL在挑战性任务上比现有方法提升3.4%。 Conclusion: ElimPCL通过噪声过滤和特征增强有效解决了噪声积累问题，提升了无源领域适应的性能。 Abstract: Source-Free Domain Adaptation (SFDA) aims to train a target model without source data, and the key is to generate pseudo-labels using a pre-trained source model. However, we observe that the source model often produces highly uncertain pseudo-labels for hard samples, particularly those heavily affected by domain shifts, leading to these noisy pseudo-labels being introduced even before adaptation and further reinforced through parameter updates. Additionally, they continuously influence neighbor samples through propagation in the feature space.To eliminate the issue of noise accumulation, we propose a novel Progressive Curriculum Labeling (ElimPCL) method, which iteratively filters trustworthy pseudo-labeled samples based on prototype consistency to exclude high-noise samples from training. Furthermore, a Dual MixUP technique is designed in the feature space to enhance the separability of hard samples, thereby mitigating the interference of noisy samples on their neighbors.Extensive experiments validate the effectiveness of ElimPCL, achieving up to a 3.4% improvement on challenging tasks compared to state-of-the-art methods.

PAARS: Persona Aligned Agentic Retail Shoppers

Saab Mansour,Leonardo Perelli,Lorenzo Mainetti,George Davidson,Stefano D'Amato

Task: 提出一个框架，用于合成和验证模拟人类购物行为的LLM代理群体。

Motivation: 传统的行为数据收集成本高且速度慢，而LLM代理模拟存在偏见，需与真实用户行为对齐。

Details

Method: 框架包括从历史数据挖掘人物角色、为代理配备零售工具生成购物会话，以及提出群体级别的对齐评估方法。 Result: 实验表明人物角色提高了对齐性能，但与人类行为仍有差距；框架成功应用于自动化A/B测试。 Conclusion: 该框架为LLM代理模拟提供了有效方法，但需进一步解决与人类行为的差距和挑战。 Abstract: In e-commerce, behavioral data is collected for decision making which can be costly and slow. Simulation with LLM powered agents is emerging as a promising alternative for representing human population behavior. However, LLMs are known to exhibit certain biases, such as brand bias, review rating bias and limited representation of certain groups in the population, hence they need to be carefully benchmarked and aligned to user behavior. Ultimately, our goal is to synthesise an agent population and verify that it collectively approximates a real sample of humans. To this end, we propose a framework that: (i) creates synthetic shopping agents by automatically mining personas from anonymised historical shopping data, (ii) equips agents with retail-specific tools to synthesise shopping sessions and (iii) introduces a novel alignment suite measuring distributional differences between humans and shopping agents at the group (i.e. population) level rather than the traditional "individual" level. Experimental results demonstrate that using personas improves performance on the alignment suite, though a gap remains to human behaviour. We showcase an initial application of our framework for automated agentic A/B testing and compare the findings to human results. Finally, we discuss applications, limitations and challenges setting the stage for impactful future work.

HOIGen-1M: A Large-scale Dataset for Human-Object Interaction Video Generation

Kun Liu,Qi Liu,Xinchen Liu,Jie Li,Yongdong Zhang,Jiebo Luo,Xiaodong He,Wu Liu

Task: 提出并构建HOIGen-1M数据集，以解决当前文本到视频（T2V）模型在生成人-物交互（HOI）视频时的不足。

Motivation: 当前T2V模型由于缺乏大规模且标注准确的HOI视频数据，难以精确生成人-物交互场景。

Details

Method: 通过多模态大语言模型（MLLMs）自动筛选高质量HOI视频，并结合人工清理；采用混合多模态专家（MoME）策略生成准确视频描述；提出两种新指标评估生成视频质量。 Result: 实验表明当前T2V模型在生成高质量HOI视频方面存在困难，而HOIGen-1M数据集显著提升了生成效果。 Conclusion: HOIGen-1M数据集为改进HOI视频生成提供了重要支持，并提出了有效的评估方法。 Abstract: Text-to-video (T2V) generation has made tremendous progress in generating complicated scenes based on texts. However, human-object interaction (HOI) often cannot be precisely generated by current T2V models due to the lack of large-scale videos with accurate captions for HOI. To address this issue, we introduce HOIGen-1M, the first largescale dataset for HOI Generation, consisting of over one million high-quality videos collected from diverse sources. In particular, to guarantee the high quality of videos, we first design an efficient framework to automatically curate HOI videos using the powerful multimodal large language models (MLLMs), and then the videos are further cleaned by human annotators. Moreover, to obtain accurate textual captions for HOI videos, we design a novel video description method based on a Mixture-of-Multimodal-Experts (MoME) strategy that not only generates expressive captions but also eliminates the hallucination by individual MLLM. Furthermore, due to the lack of an evaluation framework for generated HOI videos, we propose two new metrics to assess the quality of generated videos in a coarse-to-fine manner. Extensive experiments reveal that current T2V models struggle to generate high-quality HOI videos and confirm that our HOIGen-1M dataset is instrumental for improving HOI video generation. Project webpage is available at https://liuqi-creat.github.io/HOIGen.github.io.

MaintainCoder: Maintainable Code Generation Under Dynamic Requirements

Zhengren Wang,Rui Ling,Chufan Wang,Yongan Yu,Zhiyu Li,Feiyu Xiong,Wentao Zhang

Task: 提出一种名为MaintainCoder的解决方案，以提升代码生成系统的可维护性。

Motivation: 现有代码生成系统在功能正确性和执行效率上取得进展，但忽视了可维护性这一关键维度。

Details

Method: 集成瀑布模型、设计模式和多智能体协作，系统性增强内聚性、降低耦合性并提高适应性。 Result: MaintainCoder将可维护性指标提升14-30%，同时保持更高的正确性（pass@k）。 Conclusion: 该研究不仅为可维护代码生成奠定了基础，还强调了更全面的代码质量研究的必要性。 Abstract: Modern code generation has made significant strides in functional correctness and execution efficiency. However, these systems often overlook a critical dimension in real-world software development: maintainability. To handle dynamic requirements with minimal rework, we propose MaintainCoder as a pioneering solution. It integrates Waterfall model, design patterns, and multi-agent collaboration to systematically enhance cohesion, reduce coupling, and improve adaptability. We also introduce MaintainBench, a benchmark comprising requirement changes and corresponding dynamic metrics on maintainance effort. Experiments demonstrate that existing code generation methods struggle to meet maintainability standards when requirements evolve. In contrast, MaintainCoder improves maintainability metrics by 14-30% with even higher correctness, i.e. pass@k. Our work not only provides the foundation of maintainable code generation, but also highlights the need for more holistic code quality research. Resources: https://github.com/IAAR-Shanghai/MaintainCoder.

Effective Cloud Removal for Remote Sensing Images by an Improved Mean-Reverting Denoising Model with Elucidated Design Space

Yi Liu,Wengen Li,Jihong Guan,Shuigeng Zhou,Yichao Zhang

Task: 开发一种基于均值回复扩散模型（MRDM）的云去除模型EMRDM，以解决遥感图像处理中的云去除问题。

Motivation: 现有的扩散模型（DM）在云去除任务中表现不佳，因为它们从随机噪声生成无云图像，忽略了有云输入中的固有信息。

Details

Method: 提出EMRDM模型，通过重新设计前向过程和基于ODE的后向过程，构建模块化框架，并改进去噪器、训练过程和采样过程。 Result: 在单时相和多时相数据集上的实验表明，EMRDM表现出优越的性能。 Conclusion: EMRDM通过模块化设计和改进的扩散过程，显著提升了云去除任务的性能。 Abstract: Cloud removal (CR) remains a challenging task in remote sensing image processing. Although diffusion models (DM) exhibit strong generative capabilities, their direct applications to CR are suboptimal, as they generate cloudless images from random noise, ignoring inherent information in cloudy inputs. To overcome this drawback, we develop a new CR model EMRDM based on mean-reverting diffusion models (MRDMs) to establish a direct diffusion process between cloudy and cloudless images. Compared to current MRDMs, EMRDM offers a modular framework with updatable modules and an elucidated design space, based on a reformulated forward process and a new ordinary differential equation (ODE)-based backward process. Leveraging our framework, we redesign key MRDM modules to boost CR performance, including restructuring the denoiser via a preconditioning technique, reorganizing the training process, and improving the sampling process by introducing deterministic and stochastic samplers. To achieve multi-temporal CR, we further develop a denoising network for simultaneously denoising sequential images. Experiments on mono-temporal and multi-temporal datasets demonstrate the superior performance of EMRDM. Our code is available at https://github.com/Ly403/EMRDM.

Rec-R1: Bridging Generative Large Language Models and User-Centric Recommendation Systems via Reinforcement Learning

Jiacheng Lin,Tian Wang,Kun Qian

Task: 提出Rec-R1，一个通过闭环优化将大语言模型（LLMs）与推荐系统结合的强化学习框架。

Motivation: 避免依赖合成SFT数据和专有模型（如GPT-4o）的高成本，同时直接优化LLM生成。

Details

Method: 利用固定黑盒推荐模型的反馈直接优化LLM生成，无需合成SFT数据。 Result: 在商品搜索和序列推荐任务中，Rec-R1优于基于提示和SFT的方法，且保持LLM的通用能力。 Conclusion: Rec-R1为持续任务特定适应提供了有前景的基础，避免了灾难性遗忘。 Abstract: We propose Rec-R1, a general reinforcement learning framework that bridges large language models (LLMs) with recommendation systems through closed-loop optimization. Unlike prompting and supervised fine-tuning (SFT), Rec-R1 directly optimizes LLM generation using feedback from a fixed black-box recommendation model, without relying on synthetic SFT data from proprietary models such as GPT-4o. This avoids the substantial cost and effort required for data distillation. To verify the effectiveness of Rec-R1, we evaluate it on two representative tasks: product search and sequential recommendation. Experimental results demonstrate that Rec-R1 not only consistently outperforms prompting- and SFT-based methods, but also achieves significant gains over strong discriminative baselines, even when used with simple retrievers such as BM25. Moreover, Rec-R1 preserves the general-purpose capabilities of the LLM, unlike SFT, which often impairs instruction-following and reasoning. These findings suggest Rec-R1 as a promising foundation for continual task-specific adaptation without catastrophic forgetting.

LATex: Leveraging Attribute-based Text Knowledge for Aerial-Ground Person Re-Identification

Xiang Hu,Yuhao Wang,Pingping Zhang,Huchuan Lu

Task: Aerial-Ground person Re-IDentification (AG-ReID) aims to retrieve specific persons across heterogeneous cameras in different views.

Motivation: Previous methods overlook semantic information in person attributes and rely on costly full fine-tuning of large-scale models.

Details

Method: Proposes LATex framework with prompt-tuning strategies, leveraging CLIP as backbone, Attribute-aware Image Encoder (AIE), Prompted Attribute Classifier Group (PACG), and Coupled Prompt Template (CPT). Result: Extensive experiments on three benchmarks demonstrate effectiveness. Conclusion: LATex successfully leverages attribute-based text knowledge to improve AG-ReID performance. Abstract: Aerial-Ground person Re-IDentification (AG-ReID) aims to retrieve specific persons across heterogeneous cameras in different views. Previous methods usually adopt large-scale models, focusing on view-invariant features. However, they overlook the semantic information in person attributes. Additionally, existing training strategies often rely on full fine-tuning large-scale models, which significantly increases training costs. To address these issues, we propose a novel framework named LATex for AG-ReID, which adopts prompt-tuning strategies to leverage attribute-based text knowledge. More specifically, we first introduce the Contrastive Language-Image Pre-training (CLIP) model as the backbone, and propose an Attribute-aware Image Encoder (AIE) to extract global semantic features and attribute-aware features. Then, with these features, we propose a Prompted Attribute Classifier Group (PACG) to generate person attribute predictions and obtain the encoded representations of predicted attributes. Finally, we design a Coupled Prompt Template (CPT) to transform attribute tokens and view information into structured sentences. These sentences are processed by the text encoder of CLIP to generate more discriminative features. As a result, our framework can fully leverage attribute-based text knowledge to improve the AG-ReID. Extensive experiments on three AG-ReID benchmarks demonstrate the effectiveness of our proposed LATex. The source code will be available.

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

Jingcheng Hu,Yinmin Zhang,Qi Han,Daxin Jiang,Xiangyu Zhang,Heung-Yeung Shum

Task: 开源实现大规模推理导向的强化学习训练，关注可扩展性、简单性和可访问性。

Motivation: 展示一种简约方法（如PPO与GAE结合）能够在无需KL正则化的情况下，扩展响应长度并提升基准性能。

Details

Method: 使用vanilla PPO与GAE（λ=1，γ=1）和基于规则的奖励，无需KL正则化。 Result: 在AIME2024、MATH500和GPQA Diamond基准上表现优异，训练效率显著（仅需DeepSeek-R1-Zero十分之一的训练步骤）。 Conclusion: 开源代码、参数设置、训练数据和模型权重，推动社区发展。 Abstract: We introduce Open-Reasoner-Zero, the first open source implementation of large-scale reasoning-oriented RL training focusing on scalability, simplicity and accessibility. Through extensive experiments, we demonstrate that a minimalist approach, vanilla PPO with GAE ($\lambda=1$, $\gamma=1$) and straightforward rule-based rewards, without any KL regularization, is sufficient to scale up both response length and benchmark performance, similar to the phenomenon observed in DeepSeek-R1-Zero. Using the same base model as DeepSeek-R1-Zero-Qwen-32B, our implementation achieves superior performance on AIME2024, MATH500, and the GPQA Diamond benchmark while demonstrating remarkable efficiency -- requiring only a tenth of the training steps, compared to DeepSeek-R1-Zero pipeline. In the spirit of open source, we release our source code, parameter settings, training data, and model weights across various sizes.

Exploring Temporal Dynamics in Event-based Eye Tracker

Hongwei Ren,Xiaopeng Lin,Hongxiang Huang,Yue Zhou,Bojun Cheng

Task: 提出一种基于事件相机的高效眼动追踪框架TDTracker，用于捕捉快速眼动。

Motivation: 传统基于帧的图像传感器因时间分辨率有限，难以准确捕捉快速眼动（如扫视和眨眼），而事件相机因其低功耗和高时间分辨率，有望实现高速、高精度的眼动追踪。

Details

Method: TDTracker结合3D卷积神经网络捕捉隐式短期时间动态，并通过级联结构（频率感知模块、GRU和Mamba）提取显式长期时间动态，最终使用预测热图进行眼坐标回归。 Result: TDTracker在合成SEET数据集上达到SOTA性能，并在CVPR 2025事件眼动追踪挑战赛中获第三名。 Conclusion: TDTracker通过全面建模时间动态，实现了高速、高精度的眼动追踪，展示了事件相机在此领域的潜力。 Abstract: Eye-tracking is a vital technology for human-computer interaction, especially in wearable devices such as AR, VR, and XR. The realization of high-speed and high-precision eye-tracking using frame-based image sensors is constrained by their limited temporal resolution, which impairs the accurate capture of rapid ocular dynamics, such as saccades and blinks. Event cameras, inspired by biological vision systems, are capable of perceiving eye movements with extremely low power consumption and ultra-high temporal resolution. This makes them a promising solution for achieving high-speed, high-precision tracking with rich temporal dynamics. In this paper, we propose TDTracker, an effective eye-tracking framework that captures rapid eye movements by thoroughly modeling temporal dynamics from both implicit and explicit perspectives. TDTracker utilizes 3D convolutional neural networks to capture implicit short-term temporal dynamics and employs a cascaded structure consisting of a Frequency-aware Module, GRU, and Mamba to extract explicit long-term temporal dynamics. Ultimately, a prediction heatmap is used for eye coordinate regression. Experimental results demonstrate that TDTracker achieves state-of-the-art (SOTA) performance on the synthetic SEET dataset and secured Third place in the CVPR event-based eye-tracking challenge 2025. Our code is available at https://github.com/rhwxmx/TDTracker.

ORAL: Prompting Your Large-Scale LoRAs via Conditional Recurrent Diffusion

Rana Muhammad Shahroz Khan,Dongwen Tang,Pingzhi Li,Kai Wang,Tianlong Chen

Task: 提出一种名为ORAL的条件循环扩散框架，用于高效生成任务特定的LoRA参数，以适应不断更新的大型语言模型。

Motivation: 现有方法在同时实现可扩展性和可控性方面存在局限性，ORAL旨在解决这些问题。

Details

Method: ORAL采用条件循环扩散框架，结合模型架构和文本任务描述，生成任务特定的LoRA参数。 Result: 实验表明，ORAL生成的LoRA参数在性能上优于或与传统的训练方法相当。 Conclusion: ORAL为大型语言模型的高效适应提供了一种可扩展且可控的解决方案。 Abstract: Parameter generation has emerged as a novel paradigm for neural network development, offering an alternative to traditional neural network training by synthesizing high-quality model weights directly. In the context of Low-Rank Adaptation (LoRA) for evolving ($\textit{i.e.}$, constantly updated) large language models (LLMs), this approach promises efficient adaptation without costly retraining. However, existing methods face critical limitations in simultaneously achieving scalability and controllability. In this paper, we introduce $\texttt{ORAL}$, a novel $\textbf{conditional recurrent diffusion}$ framework that addresses these challenges. $\texttt{ORAL}$ incorporates a novel conditioning mechanism that integrates model architecture and textual task specifications, enabling the generation of task-specific LoRA parameters that can seamlessly transfer across evolving foundation models. Our approach successfully scales to billions-of-parameter LLMs and maintains controllability. Through extensive experiments across seven language tasks, four vision tasks, and three multimodal tasks using five pre-trained LLMs, we demonstrate that $\texttt{ORAL}$ generates high-quality LoRA parameters that achieve comparable or superior performance to vanilla trained counterparts.

KOFFVQA: An Objectively Evaluated Free-form VQA Benchmark for Large Vision-Language Models in the Korean Language

Yoonshik Kim,Jaeyoon Jung

Task: 提出一个用于评估大型视觉语言模型（VLMs）的韩语自由形式视觉问答基准（KOFFVQA）。

Motivation: 现有评估方法要么牺牲开放性（预定义答案），要么依赖主观的评判模型，且缺乏针对韩语的评估基准。

Details

Method: 开发KOFFVQA基准，包含275个精心设计的问题，每问题配图和评分标准，覆盖10个VLM性能方面。 Result: 通过预定义评分标准，即使小型开源模型也能可靠评估，验证了方法的可靠性。 Conclusion: KOFFVQA为韩语VLMs提供了客观、可靠的评估工具，弥补了现有方法的不足。 Abstract: The recent emergence of Large Vision-Language Models(VLMs) has resulted in a variety of different benchmarks for evaluating such models. Despite this, we observe that most existing evaluation methods suffer from the fact that they either require the model to choose from pre-determined responses, sacrificing open-endedness, or evaluate responses using a judge model, resulting in subjective and unreliable evaluation. In addition, we observe a lack of benchmarks for VLMs in the Korean language, which are necessary as a separate metric from more common English language benchmarks, as the performance of generative language models can differ significantly based on the language being used. Therefore, we present KOFFVQA, a general-purpose free-form visual question answering benchmark in the Korean language for the evaluation of VLMs. Our benchmark consists of 275 carefully crafted questions each paired with an image and grading criteria covering 10 different aspects of VLM performance. The grading criteria eliminate the problem of unreliability by allowing the judge model to grade each response based on a pre-determined set of rules. By defining the evaluation criteria in an objective manner, even a small open-source model can be used to evaluate models on our benchmark reliably. In addition to evaluating a large number of existing VLMs on our benchmark, we also experimentally verify that our method of using pre-existing grading criteria for evaluation is much more reliable than existing methods. Our evaluation code is available at https://github.com/maum-ai/KOFFVQA

SQuat: Subspace-orthogonal KV Cache Quantization

Hao Wang,Ligong Han,Kai Xu,Akash Srivastava

Task: 提出一种名为SQuat的子空间正交KV缓存量化方法，以减少LLM解码时的内存开销并保持输出质量。

Motivation: 现有的KV缓存量化方法虽然减少了内存使用，但量化误差会随着生成令牌的增加而累积，影响输出质量。

Details

Method: 通过构建查询张量生成的子空间，并在量化键张量时确保误差与子空间正交，从而最小化对注意力机制输出的影响。 Result: 实验表明，SQuat将峰值内存减少2.17至2.82倍，吞吐量提高2.45至3.60倍，并在基准测试中优于现有方法。 Conclusion: SQuat是一种无需微调或额外校准数据的高效KV缓存量化方法，具有理论支持且性能优越。 Abstract: The key-value (KV) cache accelerates LLMs decoding by storing KV tensors from previously generated tokens. It reduces redundant computation at the cost of increased memory usage. To mitigate this overhead, existing approaches compress KV tensors into lower-bit representations; however, quantization errors can accumulate as more tokens are generated, potentially resulting in undesired outputs. In this paper, we introduce SQuat (Subspace-orthogonal KV cache quantization). It first constructs a subspace spanned by query tensors to capture the most critical task-related information. During key tensor quantization, it enforces that the difference between the (de)quantized and original keys remains orthogonal to this subspace, minimizing the impact of quantization errors on the attention mechanism's outputs. SQuat requires no model fine-tuning, no additional calibration dataset for offline learning, and is grounded in a theoretical framework we develop. Through numerical experiments, we show that our method reduces peak memory by 2.17 to 2.82, improves throughput by 2.45 to 3.60, and achieves more favorable benchmark scores than existing KV cache quantization algorithms.

Investigation of intelligent barbell squat coaching system based on computer vision and machine learning

Yinq-Rong Chern,Yuhao Lee,Hsiao-Ching Lin,Guan-Ting Chen,Ying-Hsien Chen,Fu-Sung Lin,Chih-Yao Chuang,Jenn-Jier James Lien,Chih-Hsien Huang

Task: 开发一种基于人工智能和计算机视觉的杠铃深蹲实时诊断与反馈系统。

Motivation: 力量训练可降低慢性疾病和身体退化风险，但独自训练时缺乏诊断系统，因此需要一种实时反馈工具。

Details

Method: 收集8,151次深蹲数据，分类为良好和六种问题动作，使用三种机器学习架构训练模型，并应用SHAP方法优化预测。 Result: 六种问题的F1分数分别为86.86%、69.01%、77.42%、90.74%、95.83%和100%，每次诊断耗时小于0.5秒，系统显著提升用户深蹲技术。 Conclusion: 该研究整合人工智能与计算机视觉技术，构建了实时、用户友好的杠铃深蹲反馈系统。 Abstract: Purpose: Research has revealed that strength training can reduce the incidence of chronic diseases and physical deterioration at any age. Therefore, having a movement diagnostic system is crucial for training alone. Hence, this study developed an artificial intelligence and computer vision-based barbell squat coaching system with a real-time mode that immediately diagnoses the issue and provides feedback after each squat. In addition, a replay mode allows users to examine their previous squats and check their comments. Initially, four primary characteristics of the barbell squat were identified: body joint angles, dorsiflexion, the ratio of knee-to-hip movement, and barbell stability. Methods: We collect 8,151 squats from 77 participants, categorizing them as good squats and six issues. Then, we trained the diagnosis models with three machine-learning architectures. Furthermore, this research applied the SHapley Additive exPlanations (SHAP) method to enhance the accuracy of issue prediction and reduce the computation time by feature selection. Results: The F1 score of the six issues reached 86.86%, 69.01%, 77.42%, 90.74%, 95.83%, and 100%. Each squat diagnosis took less than 0.5 seconds. Finally, this study examined the efficacy of the proposed system with two groups of participants trained with and without the system. Subsequently, participants trained with the system exhibited substantial improvements in their squat technique, as assessed both by the system itself and by a professional weightlifting coach. Conclusion: This is a comprehensive study that integrates artificial intelligence, computer vision and multivariable processing technologies, aimed at building a real-time, user-friendly barbell squat feedback and training system.

Effectively Controlling Reasoning Models through Thinking Intervention

Tong Wu,Chong Xiang,Jiachen T. Wang,Prateek Mittal

Task: 提出一种名为Thinking Intervention的新范式，通过干预LLMs的内部推理过程来增强其行为控制。

Motivation: 利用LLMs生成中间推理步骤的特性，实现对模型行为的更细粒度控制。

Details

Method: 通过策略性地插入或修改特定的思考标记（thinking tokens）来引导LLMs的推理过程。 Result: 在多个任务中显著优于基线方法，包括指令遵循、指令层次推理和安全对齐任务。 Conclusion: Thinking Intervention为控制推理型LLMs开辟了新的研究方向。 Abstract: Reasoning-enhanced large language models (LLMs) explicitly generate intermediate reasoning steps prior to generating final answers, helping the model excel in complex problem-solving. In this paper, we demonstrate that this emerging generation framework offers a unique opportunity for more fine-grained control over model behavior. We propose Thinking Intervention, a novel paradigm designed to explicitly guide the internal reasoning processes of LLMs by strategically inserting or revising specific thinking tokens. We conduct comprehensive evaluations across multiple tasks, including instruction following on IFEval, instruction hierarchy on SEP, and safety alignment on XSTest and SORRY-Bench. Our results demonstrate that Thinking Intervention significantly outperforms baseline prompting approaches, achieving up to 6.7% accuracy gains in instruction-following scenarios, 15.4% improvements in reasoning about instruction hierarchies, and a 40.0% increase in refusal rates for unsafe prompts using open-source DeepSeek R1 models. Overall, our work opens a promising new research avenue for controlling reasoning LLMs.

Every Painting Awakened: A Training-free Framework for Painting-to-Animation Generation

Lingyu Liu,Yaxiong Wang,Li Zhu,Zhedong Zheng

Task: 提出一种无需训练的方法，通过图像到视频（I2V）合成将静态绘画动态化。

Motivation: 现有I2V方法主要基于自然视频数据集训练，难以从静态绘画生成动态效果，且无法同时保持与文本指导的对齐和原作的保真度。

Details

Method: 利用预训练图像模型的文本-图像对齐能力，通过双路径分数蒸馏和混合潜在融合技术，结合真实绘画与合成代理图像生成动态效果。 Result: 实验表明，该方法显著提升了与文本提示的语义对齐，同时保持了原画的独特性和完整性。 Conclusion: 该框架无需训练即可实现动态效果，并能与现有I2V方法无缝集成，为真实绘画动画化提供了理想解决方案。 Abstract: We introduce a training-free framework specifically designed to bring real-world static paintings to life through image-to-video (I2V) synthesis, addressing the persistent challenge of aligning these motions with textual guidance while preserving fidelity to the original artworks. Existing I2V methods, primarily trained on natural video datasets, often struggle to generate dynamic outputs from static paintings. It remains challenging to generate motion while maintaining visual consistency with real-world paintings. This results in two distinct failure modes: either static outputs due to limited text-based motion interpretation or distorted dynamics caused by inadequate alignment with real-world artistic styles. We leverage the advanced text-image alignment capabilities of pre-trained image models to guide the animation process. Our approach introduces synthetic proxy images through two key innovations: (1) Dual-path score distillation: We employ a dual-path architecture to distill motion priors from both real and synthetic data, preserving static details from the original painting while learning dynamic characteristics from synthetic frames. (2) Hybrid latent fusion: We integrate hybrid features extracted from real paintings and synthetic proxy images via spherical linear interpolation in the latent space, ensuring smooth transitions and enhancing temporal consistency. Experimental evaluations confirm that our approach significantly improves semantic alignment with text prompts while faithfully preserving the unique characteristics and integrity of the original paintings. Crucially, by achieving enhanced dynamic effects without requiring any model training or learnable parameters, our framework enables plug-and-play integration with existing I2V methods, making it an ideal solution for animating real-world paintings. More animated examples can be found on our project website.

Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1

Yi Chen,Yuying Ge,Rui Wang,Yixiao Ge,Lu Qiu,Ying Shan,Xihui Liu

Task: 系统评估多模态大语言模型（MLLMs）在视频理解任务中的后训练方法。

Motivation: 尽管多模态大语言模型继承了链式思维（COT）的推理潜力，但在需要感知和逻辑推理的任务中仍未被充分探索。

Details

Method: 引入SEED-Bench-R1基准，通过多级层次（同分布、跨环境和跨环境-任务场景）评估后训练方法，并比较强化学习（RL）与监督微调（SFT）的效果。 Result: RL在数据效率和性能上优于SFT，尤其在视觉感知方面表现突出，但在逻辑连贯性上存在不足。 Conclusion: RL在视频理解任务中具有潜力，但需改进推理连贯性和抗噪声能力。 Abstract: Recent advancements in Chain of Thought (COT) generation have significantly improved the reasoning capabilities of Large Language Models (LLMs), with reinforcement learning (RL) emerging as an effective post-training approach. Multimodal Large Language Models (MLLMs) inherit this reasoning potential but remain underexplored in tasks requiring both perception and logical reasoning. To address this, we introduce SEED-Bench-R1, a benchmark designed to systematically evaluate post-training methods for MLLMs in video understanding. It includes intricate real-world videos and complex everyday planning tasks in the format of multiple-choice questions, requiring sophisticated perception and reasoning. SEED-Bench-R1 assesses generalization through a three-level hierarchy: in-distribution, cross-environment, and cross-environment-task scenarios, equipped with a large-scale training dataset with easily verifiable ground-truth answers. Using Qwen2-VL-Instruct-7B as a base model, we compare RL with supervised fine-tuning (SFT), demonstrating RL's data efficiency and superior performance on both in-distribution and out-of-distribution tasks, even outperforming SFT on general video understanding benchmarks like LongVideoBench. Our detailed analysis reveals that RL enhances visual perception but often produces less logically coherent reasoning chains. We identify key limitations such as inconsistent reasoning and overlooked visual cues, and suggest future improvements in base model reasoning, reward modeling, and RL robustness against noisy signals.

Short-video Propagation Influence Rating: A New Real-world Dataset and A New Large Graph Model

Dizhan Xue,Jing Cui,Shengsheng Qian,Chuanrui Hu,Changsheng Xu

Task: 提出一种新的短视频传播影响力评级（SPIR）任务，并从数据集和方法两个角度推动SPIR研究。

Motivation: 短视频平台在全球范围内广受欢迎，分析其传播对商业价值、公众意见和用户行为等具有重要意义。

Details

Method: 提出跨平台短视频数据集XS-Video，并基于新型三阶段训练机制开发大图模型NetGPT，用于预测短视频的长期传播影响力。 Result: 在XS-Video数据集上，NetGPT在分类和回归指标上均表现出优越性。 Conclusion: 本文提出的数据集和方法为短视频传播影响力研究提供了新工具和方向。 Abstract: Short-video platforms have gained immense popularity, captivating the interest of millions, if not billions, of users globally. Recently, researchers have highlighted the significance of analyzing the propagation of short-videos, which typically involves discovering commercial values, public opinions, user behaviors, etc. This paper proposes a new Short-video Propagation Influence Rating (SPIR) task and aims to promote SPIR from both the dataset and method perspectives. First, we propose a new Cross-platform Short-Video (XS-Video) dataset, which aims to provide a large-scale and real-world short-video propagation network across various platforms to facilitate the research on short-video propagation. Our XS-Video dataset includes 117,720 videos, 381,926 samples, and 535 topics across 5 biggest Chinese platforms, annotated with the propagation influence from level 0 to 9. To the best of our knowledge, this is the first large-scale short-video dataset that contains cross-platform data or provides all of the views, likes, shares, collects, fans, comments, and comment content. Second, we propose a Large Graph Model (LGM) named NetGPT, based on a novel three-stage training mechanism, to bridge heterogeneous graph-structured data with the powerful reasoning ability and knowledge of Large Language Models (LLMs). Our NetGPT can comprehend and analyze the short-video propagation graph, enabling it to predict the long-term propagation influence of short-videos. Comprehensive experimental results evaluated by both classification and regression metrics on our XS-Video dataset indicate the superiority of our method for SPIR.

RIG: Synergizing Reasoning and Imagination in End-to-End Generalist Policy

Zhonghan Zhao,Wenwei Zhang,Haian Huang,Kuikun Liu,Jianfei Gao,Gaoang Wang,Kai Chen

Task: 将推理与想象能力整合到一个端到端的通用策略（RIG）中。

Motivation: 现有方法要么仅整合推理或想象能力，要么依赖多个专用模型，限制了策略的学习效率和泛化能力。

Details

Method: 构建数据管道逐步整合和丰富推理与想象内容，联合学习推理与下一帧生成。 Result: 样本效率提升超过17倍，泛化能力显著增强。 Conclusion: 推理与想象的协同不仅提升了策略的鲁棒性和泛化性，还支持测试时扩展以提升整体性能。 Abstract: Reasoning before action and imagining potential outcomes (i.e., world models) are essential for embodied agents operating in complex open-world environments. Yet, prior work either incorporates only one of these abilities in an end-to-end agent or integrates multiple specialized models into an agent system, limiting the learning efficiency and generalization of the policy. Thus, this paper makes the first attempt to synergize Reasoning and Imagination in an end-to-end Generalist policy, termed RIG. To train RIG in an end-to-end manner, we construct a data pipeline that progressively integrates and enriches the content of imagination and reasoning in the trajectories collected from existing agents. The joint learning of reasoning and next image generation explicitly models the inherent correlation between reasoning, action, and dynamics of environments, and thus exhibits more than $17\times$ sample efficiency improvements and generalization in comparison with previous works. During inference, RIG first reasons about the next action, produces potential action, and then predicts the action outcomes, which offers the agent a chance to review and self-correct based on the imagination before taking real actions. Experimental results show that the synergy of reasoning and imagination not only improves the robustness, generalization, and interoperability of generalist policy but also enables test-time scaling to enhance overall performance.

Consistency-aware Self-Training for Iterative-based Stereo Matching

Jingyi Zhou,Peng Ye,Haoyu Zhang,Jiakang Yuan,Rao Qiang,Liu YangChenXu,Wu Cailin,Feng Xu,Tao Chen

Task: 提出一种一致性感知的自训练框架，用于迭代式立体匹配，以利用未标记的真实世界数据。

Motivation: 迭代式方法在立体匹配中表现优异，但依赖标记数据且难以处理未标记的真实世界数据。

Details

Method: 提出一致性感知软过滤模块和一致性感知软加权损失，通过教师-学生模式评估伪标签可靠性并调整权重。 Result: 实验表明，该方法能提升多种迭代式立体匹配方法的性能，并在多个基准数据集上超越当前最优方法。 Conclusion: 该方法有效解决了伪标签错误积累和性能下降问题，显著提升了立体匹配的性能。 Abstract: Iterative-based methods have become mainstream in stereo matching due to their high performance. However, these methods heavily rely on labeled data and face challenges with unlabeled real-world data. To this end, we propose a consistency-aware self-training framework for iterative-based stereo matching for the first time, leveraging real-world unlabeled data in a teacher-student manner. We first observe that regions with larger errors tend to exhibit more pronounced oscillation characteristics during model prediction.Based on this, we introduce a novel consistency-aware soft filtering module to evaluate the reliability of teacher-predicted pseudo-labels, which consists of a multi-resolution prediction consistency filter and an iterative prediction consistency filter to assess the prediction fluctuations of multiple resolutions and iterative optimization respectively. Further, we introduce a consistency-aware soft-weighted loss to adjust the weight of pseudo-labels accordingly, relieving the error accumulation and performance degradation problem due to incorrect pseudo-labels. Extensive experiments demonstrate that our method can improve the performance of various iterative-based stereo matching approaches in various scenarios. In particular, our method can achieve further enhancements over the current SOTA methods on several benchmark datasets.

Decoupled Distillation to Erase: A General Unlearning Method for Any Class-centric Tasks

Yu Zhou,Dian Zheng,Qijie Mo,Renjie Lu,Kun-Yu Lin,Wei-Shi Zheng

Task: 提出一种通用的、针对类中心任务的去学习方法DELETE。

Motivation: 现有方法在优化遗忘项时缺乏对保留项的监督，导致预训练模型分布被干扰，难以充分保留剩余类的知识。

Details

Method: 通过理论框架分析去学习损失的一般形式，将其分解为遗忘项和保留项，并利用“暗知识”优化保留项，提出掩码蒸馏去学习方法。 Result: 在多个基准测试中达到最先进性能，且无需访问剩余数据或干预。 Conclusion: DELETE是一种通用解决方案，适用于多种下游任务，如人脸识别、后门防御和语义分割，表现优异。 Abstract: In this work, we present DEcoupLEd Distillation To Erase (DELETE), a general and strong unlearning method for any class-centric tasks. To derive this, we first propose a theoretical framework to analyze the general form of unlearning loss and decompose it into forgetting and retention terms. Through the theoretical framework, we point out that a class of previous methods could be mainly formulated as a loss that implicitly optimizes the forgetting term while lacking supervision for the retention term, disturbing the distribution of pre-trained model and struggling to adequately preserve knowledge of the remaining classes. To address it, we refine the retention term using "dark knowledge" and propose a mask distillation unlearning method. By applying a mask to separate forgetting logits from retention logits, our approach optimizes both the forgetting and refined retention components simultaneously, retaining knowledge of the remaining classes while ensuring thorough forgetting of the target class. Without access to the remaining data or intervention (i.e., used in some works), we achieve state-of-the-art performance across various benchmarks. What's more, DELETE is a general solution that can be applied to various downstream tasks, including face recognition, backdoor defense, and semantic segmentation with great performance.

WaveFormer: A 3D Transformer with Wavelet-Driven Feature Representation for Efficient Medical Image Segmentation

Md Mahfuz Al Hasan,Mahdi Zaman,Abdul Jawad,Alberto Santamaria-Pang,Ho Hin Lee,Ivan Tarapov,Kyle See,Md Shah Imran,Antika Roy,Yaser Pourmohammadi Fallah,Navid Asadizanjani,Reza Forghani

Task: 提出一种名为WaveFormer的新型3D-transformer架构，用于解决3D医学图像分析中内存开销大和局部特征捕捉不足的问题。

Motivation: Transformer架构在3D医学图像分析中因内存开销大和局部特征捕捉不足而受限，需要一种更高效的解决方案。

Details

Method: WaveFormer利用频域特性和离散小波变换（DWT）在多个尺度上保留全局上下文和高频细节，同时减少参数数量。 Result: 在BraTS2023、FLARE2021和KiTS2023数据集上表现与最先进方法相当，但计算复杂度显著降低。 Conclusion: WaveFormer是一种高效且通用的3D-transformer架构，适用于资源受限的实际部署场景。 Abstract: Transformer-based architectures have advanced medical image analysis by effectively modeling long-range dependencies, yet they often struggle in 3D settings due to substantial memory overhead and insufficient capture of fine-grained local features. We address these limi- tations with WaveFormer, a novel 3D-transformer that: i) leverages the fundamental frequency-domain properties of features for contextual rep- resentation, and ii) is inspired by the top-down mechanism of the human visual recognition system, making it a biologically motivated architec- ture. By employing discrete wavelet transformations (DWT) at multiple scales, WaveFormer preserves both global context and high-frequency de- tails while replacing heavy upsampling layers with efficient wavelet-based summarization and reconstruction. This significantly reduces the number of parameters, which is critical for real-world deployment where compu- tational resources and training times are constrained. Furthermore, the model is generic and easily adaptable to diverse applications. Evaluations on BraTS2023, FLARE2021, and KiTS2023 demonstrate performance on par with state-of-the-art methods while offering substantially lower computational complexity.

STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?

Yun Li,Yiming Zhang,Tao Lin,XiangRui Liu,Wenxiao Cai,Zheng Liu,Bo Zhao

Task: 评估多模态大语言模型（MLLMs）在空间-时间理解任务中的表现。

Motivation: 尽管MLLMs在视觉语义理解任务中已有广泛研究，但其在真实应用中的精确空间-时间理解能力尚未充分验证。

Details

Method: 提出STI-Bench基准，通过估计和预测物体的外观、姿态、位移和运动等任务来评估MLLMs的空间-时间智能。 Result: 实验表明，当前最先进的MLLMs在真实世界的空间-时间理解任务中表现不佳，尤其是在精确距离估计和运动分析方面。 Conclusion: MLLMs在空间-时间理解任务中仍有待改进，STI-Bench为未来研究提供了评估工具。 Abstract: The use of Multimodal Large Language Models (MLLMs) as an end-to-end solution for Embodied AI and Autonomous Driving has become a prevailing trend. While MLLMs have been extensively studied for visual semantic understanding tasks, their ability to perform precise and quantitative spatial-temporal understanding in real-world applications remains largely unexamined, leading to uncertain prospects. To evaluate models' Spatial-Temporal Intelligence, we introduce STI-Bench, a benchmark designed to evaluate MLLMs' spatial-temporal understanding through challenging tasks such as estimating and predicting the appearance, pose, displacement, and motion of objects. Our benchmark encompasses a wide range of robot and vehicle operations across desktop, indoor, and outdoor scenarios. The extensive experiments reveals that the state-of-the-art MLLMs still struggle in real-world spatial-temporal understanding, especially in tasks requiring precise distance estimation and motion analysis.

XLRS-Bench: Could Your Multimodal LLMs Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery?

Fengxiang Wang,Hongzhen Wang,Mingshuo Chen,Di Wang,Yulin Wang,Zonghao Guo,Qiang Ma,Long Lan,Wenjing Yang,Jing Zhang,Zhiyuan Liu,Maosong Sun

Task: 提出XLRS-Bench，一个用于评估多模态大语言模型（MLLMs）在超高分辨率遥感场景中感知与推理能力的综合基准。

Motivation: 现有基准在遥感场景中图像尺寸过小、标注质量有限且评估维度不足，无法满足超高分辨率遥感图像的需求。

Details

Method: 构建XLRS-Bench，包含最大平均图像尺寸（8500×8500），手动精细标注，并定义16个子任务评估10种感知能力和6种推理能力。 Result: 结果表明，当前MLLMs在真实遥感应用中仍需进一步改进。 Conclusion: XLRS-Bench的开源将支持开发更强大的遥感MLLMs。 Abstract: The astonishing breakthrough of multimodal large language models (MLLMs) has necessitated new benchmarks to quantitatively assess their capabilities, reveal their limitations, and indicate future research directions. However, this is challenging in the context of remote sensing (RS), since the imagery features ultra-high resolution that incorporates extremely complex semantic relationships. Existing benchmarks usually adopt notably smaller image sizes than real-world RS scenarios, suffer from limited annotation quality, and consider insufficient dimensions of evaluation. To address these issues, we present XLRS-Bench: a comprehensive benchmark for evaluating the perception and reasoning capabilities of MLLMs in ultra-high-resolution RS scenarios. XLRS-Bench boasts the largest average image size (8500$\times$8500) observed thus far, with all evaluation samples meticulously annotated manually, assisted by a novel semi-automatic captioner on ultra-high-resolution RS images. On top of the XLRS-Bench, 16 sub-tasks are defined to evaluate MLLMs' 10 kinds of perceptual capabilities and 6 kinds of reasoning capabilities, with a primary emphasis on advanced cognitive processes that facilitate real-world decision-making and the capture of spatiotemporal changes. The results of both general and RS-focused MLLMs on XLRS-Bench indicate that further efforts are needed for real-world RS applications. We have open-sourced XLRS-Bench to support further research in developing more powerful MLLMs for remote sensing.

Evaluation of (Un-)Supervised Machine Learning Methods for GNSS Interference Classification with Real-World Data Discrepancies

Lucas Heublein,Nisha L. Raichur,Tobias Feigl,Tobias Brieger,Fin Heuer,Lennart Asbach,Alexander Rügamer,Felix Ott

Task: 评估机器学习方法在真实环境中监测全球导航卫星系统（GNSS）干扰信号的性能。

Motivation: GNSS定位在车辆应用中至关重要，但易受干扰信号影响，需有效监测和消除干扰。现有机器学习方法在真实环境中的可行性尚未验证。

Details

Method: 通过大规模实地测量活动（德国高速公路、奥地利Seetal阿尔卑斯山及室内环境），评估监督学习和无监督学习方法（如伪标签技术）。 Result: 展示了机器学习方法在真实环境中的性能，并探讨了数据集差异、异常检测、领域适应和数据增强技术的应用。 Conclusion: 研究填补了机器学习方法在真实GNSS干扰监测中的性能评估空白，并提出了适应数据集变化的技术方案。 Abstract: The accuracy and reliability of vehicle localization on roads are crucial for applications such as self-driving cars, toll systems, and digital tachographs. To achieve accurate positioning, vehicles typically use global navigation satellite system (GNSS) receivers to validate their absolute positions. However, GNSS-based positioning can be compromised by interference signals, necessitating the identification, classification, determination of purpose, and localization of such interference to mitigate or eliminate it. Recent approaches based on machine learning (ML) have shown superior performance in monitoring interference. However, their feasibility in real-world applications and environments has yet to be assessed. Effective implementation of ML techniques requires training datasets that incorporate realistic interference signals, including real-world noise and potential multipath effects that may occur between transmitter, receiver, and satellite in the operational area. Additionally, these datasets require reference labels. Creating such datasets is often challenging due to legal restrictions, as causing interference to GNSS sources is strictly prohibited. Consequently, the performance of ML-based methods in practical applications remains unclear. To address this gap, we describe a series of large-scale measurement campaigns conducted in real-world settings at two highway locations in Germany and the Seetal Alps in Austria, and in large-scale controlled indoor environments. We evaluate the latest supervised ML-based methods to report on their performance in real-world settings and present the applicability of pseudo-labeling for unsupervised learning. We demonstrate the challenges of combining datasets due to data discrepancies and evaluate outlier detection, domain adaptation, and data augmentation techniques to present the models' capabilities to adapt to changes in the datasets.

MGD-SAM2: Multi-view Guided Detail-enhanced Segment Anything Model 2 for High-Resolution Class-agnostic Segmentation

Haoran Shen,Peixian Zhuang,Jiahao Kou,Yuxin Zeng,Haoying Xu,Jiangyun Li

Task: 提出MGD-SAM2模型，用于解决高分辨率类无关分割（HRCS）中的细粒度细节分割问题。

Motivation: SAMs在处理高分辨率输入和低分辨率掩码预测时存在局限性，且依赖准确的手动提示，导致细粒度细节分割效果不佳。

Details

Method: 结合SAM2与多视角特征交互，引入MPAdapter、MCEM、HMIM和DRM四个新模块，提升局部细节和全局语义的提取能力。 Result: 实验结果表明，MGD-SAM2在多个高分辨率和普通分辨率数据集上表现优异，具有强泛化能力。 Conclusion: MGD-SAM2通过多视角特征交互和细节恢复模块，显著提升了高分辨率图像分割的精度。 Abstract: Segment Anything Models (SAMs), as vision foundation models, have demonstrated remarkable performance across various image analysis tasks. Despite their strong generalization capabilities, SAMs encounter challenges in fine-grained detail segmentation for high-resolution class-independent segmentation (HRCS), due to the limitations in the direct processing of high-resolution inputs and low-resolution mask predictions, and the reliance on accurate manual prompts. To address these limitations, we propose MGD-SAM2 which integrates SAM2 with multi-view feature interaction between a global image and local patches to achieve precise segmentation. MGD-SAM2 incorporates the pre-trained SAM2 with four novel modules: the Multi-view Perception Adapter (MPAdapter), the Multi-view Complementary Enhancement Module (MCEM), the Hierarchical Multi-view Interaction Module (HMIM), and the Detail Refinement Module (DRM). Specifically, we first introduce MPAdapter to adapt the SAM2 encoder for enhanced extraction of local details and global semantics in HRCS images. Then, MCEM and HMIM are proposed to further exploit local texture and global context by aggregating multi-view features within and across multi-scales. Finally, DRM is designed to generate gradually restored high-resolution mask predictions, compensating for the loss of fine-grained details resulting from directly upsampling the low-resolution prediction maps. Experimental results demonstrate the superior performance and strong generalization of our model on multiple high-resolution and normal-resolution datasets. Code will be available at https://github.com/sevenshr/MGD-SAM2.

Pan-LUT: Efficient Pan-sharpening via Learnable Look-Up Tables

Zhongnan Cai,Yingying Wang,Yunlong Lin,Hui Zheng,Ge Meng,Zixu Lin,Jiaxin Xie,Junbin Lu,Yue Huang,Xinghao Ding

Task: 提出一种基于可学习查找表（LUT）的新型框架Pan-LUT，用于高分辨率遥感图像的泛锐化，以平衡性能和计算效率。

Motivation: 现有深度学习泛锐化方法在推理时计算开销大，限制了其在高分辨率图像和实际场景中的应用。

Details

Method: 设计了PAN引导的查找表（PGLUT）进行通道级光谱映射，以及空间细节查找表（SDLUT）和自适应聚合查找表（AALUT）以捕捉空间细节和局部上下文。 Result: Pan-LUT参数少于300K，处理8K分辨率图像仅需1毫秒，性能优于其他方法，并在实际场景中超越SOTA方法。 Conclusion: Pan-LUT以轻量级方式高效处理大尺寸遥感图像，填补了实际应用中的技术空白。 Abstract: Recently, deep learning-based pan-sharpening algorithms have achieved notable advancements over traditional methods. However, many deep learning-based approaches incur substantial computational overhead during inference, especially with high-resolution images. This excessive computational demand limits the applicability of these methods in real-world scenarios, particularly in the absence of dedicated computing devices such as GPUs and TPUs. To address these challenges, we propose Pan-LUT, a novel learnable look-up table (LUT) framework for pan-sharpening that strikes a balance between performance and computational efficiency for high-resolution remote sensing images. To finely control the spectral transformation, we devise the PAN-guided look-up table (PGLUT) for channel-wise spectral mapping. To effectively capture fine-grained spatial details and adaptively learn local contexts, we introduce the spatial details look-up table (SDLUT) and adaptive aggregation look-up table (AALUT). Our proposed method contains fewer than 300K parameters and processes a 8K resolution image in under 1 ms using a single NVIDIA GeForce RTX 2080 Ti GPU, demonstrating significantly faster performance compared to other methods. Experiments reveal that Pan-LUT efficiently processes large remote sensing images in a lightweight manner, bridging the gap to real-world applications. Furthermore, our model surpasses SOTA methods in full-resolution scenes under real-world conditions, highlighting its effectiveness and efficiency.

On-device Sora: Enabling Training-Free Diffusion-based Text-to-Video Generation for Mobile Devices

Bosung Kim,Kyuhwan Lee,Isu Jeong,Jungmin Cheon,Yeojin Lee,Seulki Lee

Task: 提出一种无需模型训练的扩散式文本到视频生成方案，适用于智能手机设备。

Motivation: 解决在计算和内存受限的移动设备上实现扩散式文本到视频生成的挑战。

Details

Method: 采用三种新技术：线性比例跳跃（LPL）、时间维度令牌合并（TDTM）和动态加载并发推理（CI-DL）。 Result: 在iPhone 15 Pro上实现高质量视频生成，性能媲美高端GPU。 Conclusion: On-device Sora为资源受限设备上的高效高质量视频生成提供了可行方案，推动了生成技术的普及。 Abstract: We present On-device Sora, the first model training-free solution for diffusion-based on-device text-to-video generation that operates efficiently on smartphone-grade devices. To address the challenges of diffusion-based text-to-video generation on computation- and memory-limited mobile devices, the proposed On-device Sora applies three novel techniques to pre-trained video generative models. First, Linear Proportional Leap (LPL) reduces the excessive denoising steps required in video diffusion through an efficient leap-based approach. Second, Temporal Dimension Token Merging (TDTM) minimizes intensive token-processing computation in attention layers by merging consecutive tokens along the temporal dimension. Third, Concurrent Inference with Dynamic Loading (CI-DL) dynamically partitions large models into smaller blocks and loads them into memory for concurrent model inference, effectively addressing the challenges of limited device memory. We implement On-device Sora on the iPhone 15 Pro, and the experimental evaluations show that it is capable of generating high-quality videos on the device, comparable to those produced by high-end GPUs. These results show that On-device Sora enables efficient and high-quality video generation on resource-constrained mobile devices. We envision the proposed On-device Sora as a significant first step toward democratizing state-of-the-art generative technologies, enabling video generation on commodity mobile and embedded devices without resource-intensive re-training for model optimization (compression). The code implementation is available at a GitHub repository(https://github.com/eai-lab/On-device-Sora).

Bridge the Gap Between Visual and Linguistic Comprehension for Generalized Zero-shot Semantic Segmentation

Xiaoqing Guo,Wuyang Li,Yixuan Yuan

Task: 广义零样本语义分割（GZS3）旨在通过语义表示（如词向量）实现分割训练数据中未见类别的能力。

Motivation: 现有方法仅依赖单一语义表示关联类别并实现知识迁移，这与人类认知不符；人类通过对象的部分和状态信息理解对象。

Details

Method: 提出解耦视觉-语言匹配框架（DeVLMatch），包含空间部分匹配（SPMatch）和通道状态匹配（CSMatch）模块，通过解耦对象的部分和状态信息进行跨模态匹配。 Result: DeVLMatch在PASCAL VOC、COCO-Stuff和CATARACTS等标准基准上优于现有方法。 Conclusion: 通过解耦和匹配对象的细粒度部分和状态信息，DeVLMatch有效促进了从已知类别到未知类别的知识迁移。 Abstract: Generalized zero-shot semantic segmentation (GZS3) aims to achieve the human-level capability of segmenting not only seen classes but also novel class regions unseen in the training data through introducing the bridge of semantic representations, e.g., word vector. While effective, the way of utilizing one semantic representation to associate the corresponding class and to enable the knowledge transfer from seen to unseen classes is insufficient as well as incompatible with human cognition. Inspired by the observation that humans often use some `part' and `state' information to comprehend the seen objects and imagine unseen classes, we decouple each class into detailed descriptions, including object parts and states. Based on the decoupling formulation, we propose a Decoupled Vision-Language Matching (DeVLMatch) framework, composed of spatial-part (SPMatch) and channel-state (CSMatch) matching modules, for GZS3. In SPMatch, we comprehend objects with spatial part information from both visual and linguistic perspectives and perform graph matching to bridge the gap. In CSMatch, states of objects from the linguistic perspective are matched to compatible channel information from the visual perspective. By decoupling and matching objects across visual and linguistic comprehension, we can explicitly introspect the relationship between seen and unseen classes in fine-grained object part and state levels, thereby facilitating the knowledge transfer from seen to unseen classes in visual space. The proposed DeVLMatch framework surpasses the previous GZS3 methods on standard benchmarks, including PASCAL VOC, COCO-Stuff, and CATARACTS, demonstrating its effectiveness.

FlexiMo: A Flexible Remote Sensing Foundation Model

Xuyang Li,Chenyu Li,Pedram Ghamisi,Danfeng Hong

Task: 提出一种灵活的遥感基础模型FlexiMo，能够适应任意空间分辨率。

Motivation: 现有模型受限于固定空间分辨率和补丁大小，无法充分利用卫星图像的异构空间特征。

Details

Method: FlexiMo采用空间分辨率感知模块和无参数对齐嵌入机制，动态调整补丁嵌入；并引入轻量级通道适应模块，利用传感器先验光谱信息。 Result: FlexiMo在多模态、多分辨率和多尺度数据集上显著提升模型泛化性和鲁棒性，在下游任务中表现优异。 Conclusion: FlexiMo通过参数高效和物理一致的适应，为实际遥感应用提供了更灵活有效的基础模型。 Abstract: The rapid expansion of multi-source satellite imagery drives innovation in Earth observation, opening unprecedented opportunities for Remote Sensing Foundation Models to harness diverse data. However, many existing models remain constrained by fixed spatial resolutions and patch sizes, limiting their ability to fully exploit the heterogeneous spatial characteristics inherent in satellite imagery. To address these challenges, we propose FlexiMo, a flexible remote sensing foundation model that endows the pre-trained model with the flexibility to adapt to arbitrary spatial resolutions. Central to FlexiMo is a spatial resolution-aware module that employs a parameter-free alignment embedding mechanism to dynamically recalibrate patch embeddings based on the input image's resolution and dimensions. This design not only preserves critical token characteristics and ensures multi-scale feature fidelity but also enables efficient feature extraction without requiring modifications to the underlying network architecture. In addition, FlexiMo incorporates a lightweight channel adaptation module that leverages prior spectral information from sensors. This mechanism allows the model to process images with varying numbers of channels while maintaining the data's intrinsic physical properties. Extensive experiments on diverse multimodal, multi-resolution, and multi-scale datasets demonstrate that FlexiMo significantly enhances model generalization and robustness. In particular, our method achieves outstanding performance across a range of downstream tasks, including scene classification, land cover classification, urban building segmentation, and cloud detection. By enabling parameter-efficient and physically consistent adaptation, FlexiMo paves the way for more adaptable and effective foundation models in real-world remote sensing applications.

Learned Image Compression and Restoration for Digital Pathology

SeonYeong Lee,EonSeung Seong,DongEon Lee,SiYeoul Lee,Yubin Cho,Chunsu Park,Seonho Kim,MinKyoung Seo,YoungSin Ko,MinWoo Kim

Task: 提出一种名为CLERIC的深度学习图像压缩框架，专门用于全切片图像（WSIs），以解决其高分辨率和文件大小带来的存储、传输和实时可视化挑战。

Motivation: 数字病理图像在医学诊断中至关重要，但其超高分辨率和巨大文件大小对存储、传输和实时可视化提出了重大挑战。

Details

Method: CLERIC结合可学习的提升方案和先进的卷积技术，通过分析阶段的可提升方案变换将图像分解为低频和高频成分，并采用并行编码器（包含可变形残差块和循环残差块）处理这些成分，最后通过逆提升变换进行图像重建。 Result: 实验结果表明，CLERIC在率失真（RD）性能上优于现有学习图像压缩模型，显著减少存储需求的同时保持高诊断图像质量。 Conclusion: 研究表明，基于深度学习的压缩在数字病理学中具有潜力，可促进高效数据管理和长期存储，同时确保与临床工作流程和AI辅助诊断系统的无缝集成。 Abstract: Digital pathology images play a crucial role in medical diagnostics, but their ultra-high resolution and large file sizes pose significant challenges for storage, transmission, and real-time visualization. To address these issues, we propose CLERIC, a novel deep learning-based image compression framework designed specifically for whole slide images (WSIs). CLERIC integrates a learnable lifting scheme and advanced convolutional techniques to enhance compression efficiency while preserving critical pathological details. Our framework employs a lifting-scheme transform in the analysis stage to decompose images into low- and high-frequency components, enabling more structured latent representations. These components are processed through parallel encoders incorporating Deformable Residual Blocks (DRB) and Recurrent Residual Blocks (R2B) to improve feature extraction and spatial adaptability. The synthesis stage applies an inverse lifting transform for effective image reconstruction, ensuring high-fidelity restoration of fine-grained tissue structures. We evaluate CLERIC on a digital pathology image dataset and compare its performance against state-of-the-art learned image compression (LIC) models. Experimental results demonstrate that CLERIC achieves superior rate-distortion (RD) performance, significantly reducing storage requirements while maintaining high diagnostic image quality. Our study highlights the potential of deep learning-based compression in digital pathology, facilitating efficient data management and long-term storage while ensuring seamless integration into clinical workflows and AI-assisted diagnostic systems. Code and models are available at: https://github.com/pnu-amilab/CLERIC.

ExScene: Free-View 3D Scene Reconstruction with Gaussian Splatting from a Single Image

Tianyi Gong,Boyan Li,Yifei Zhong,Fangxin Wang

Task: 从单视角图像重建沉浸式3D场景。

Motivation: 现有方法因单视角输入的局限性，难以重建高一致性和广视角的沉浸式3D场景。

Details

Method: 提出ExScene，采用两阶段流程：首先生成高保真全景图像，再结合几何信息训练3D高斯溅射模型，并通过视频扩散先验优化模型。 Result: 实验表明，ExScene仅需单视角输入即可实现高质量沉浸式场景重建，显著优于现有方法。 Conclusion: ExScene通过多模态扩散和优化技术，有效解决了单视角输入重建沉浸式3D场景的挑战。 Abstract: The increasing demand for augmented and virtual reality applications has highlighted the importance of crafting immersive 3D scenes from a simple single-view image. However, due to the partial priors provided by single-view input, existing methods are often limited to reconstruct low-consistency 3D scenes with narrow fields of view from single-view input. These limitations make them less capable of generalizing to reconstruct immersive scenes. To address this problem, we propose ExScene, a two-stage pipeline to reconstruct an immersive 3D scene from any given single-view image. ExScene designs a novel multimodal diffusion model to generate a high-fidelity and globally consistent panoramic image. We then develop a panoramic depth estimation approach to calculate geometric information from panorama, and we combine geometric information with high-fidelity panoramic image to train an initial 3D Gaussian Splatting (3DGS) model. Following this, we introduce a GS refinement technique with 2D stable video diffusion priors. We add camera trajectory consistency and color-geometric priors into the denoising process of diffusion to improve color and spatial consistency across image sequences. These refined sequences are then used to fine-tune the initial 3DGS model, leading to better reconstruction quality. Experimental results demonstrate that our ExScene achieves consistent and immersive scene reconstruction using only single-view input, significantly surpassing state-of-the-art baselines.

GLane3D : Detecting Lanes with Graph of 3D Keypoints

Halil İbrahim Öztürk,Muhammet Esat Kalfaoğlu,Ozsel Kilinc

Task: 提出一种基于关键点检测和连接的3D车道检测方法，以提高算法的泛化能力。

Motivation: 解决传统自上而下方法在车道结构多样化的全球场景中泛化能力不足的问题。

Details

Method: 通过检测车道关键点并预测其顺序连接来构建完整的3D车道，使用偏移机制和PointNMS减少冗余。 Result: 在Apollo和OpenLane数据集上表现优于现有方法，展示了更高的F1分数和更强的泛化能力。 Conclusion: 提出的方法在3D车道检测任务中具有显著的泛化优势，适用于多样化车道结构的场景。 Abstract: Accurate and efficient lane detection in 3D space is essential for autonomous driving systems, where robust generalization is the foremost requirement for 3D lane detection algorithms. Considering the extensive variation in lane structures worldwide, achieving high generalization capacity is particularly challenging, as algorithms must accurately identify a wide variety of lane patterns worldwide. Traditional top-down approaches rely heavily on learning lane characteristics from training datasets, often struggling with lanes exhibiting previously unseen attributes. To address this generalization limitation, we propose a method that detects keypoints of lanes and subsequently predicts sequential connections between them to construct complete 3D lanes. Each key point is essential for maintaining lane continuity, and we predict multiple proposals per keypoint by allowing adjacent grids to predict the same keypoint using an offset mechanism. PointNMS is employed to eliminate overlapping proposal keypoints, reducing redundancy in the estimated BEV graph and minimizing computational overhead from connection estimations. Our model surpasses previous state-of-the-art methods on both the Apollo and OpenLane datasets, demonstrating superior F1 scores and a strong generalization capacity when models trained on OpenLane are evaluated on the Apollo dataset, compared to prior approaches.

MuseFace: Text-driven Face Editing via Diffusion-based Mask Generation Approach

Xin Zhang,Siting Huang,Xiangyang Luo,Yifan Xie,Weijiang Yu,Heng Chang,Fei Ma,Fei Yu

Task: 提出一种基于文本提示的人脸编辑框架MuseFace，实现多样性、可控性和灵活性。

Motivation: 现有文本驱动人脸编辑方法无法同时满足多样性、可控性和灵活性。

Details

Method: 结合Text-to-Mask扩散模型和语义感知人脸编辑模型，直接从文本生成细粒度语义掩码并进行人脸编辑。 Result: MuseFace能够生成细粒度语义掩码，显著提升编辑的精确性、可控性和灵活性，实验证明其具有高保真性能。 Conclusion: MuseFace通过结合扩散模型和语义感知模型，成功解决了现有方法的局限性，实现了高效、可控的人脸编辑。 Abstract: Face editing modifies the appearance of face, which plays a key role in customization and enhancement of personal images. Although much work have achieved remarkable success in text-driven face editing, they still face significant challenges as none of them simultaneously fulfill the characteristics of diversity, controllability and flexibility. To address this challenge, we propose MuseFace, a text-driven face editing framework, which relies solely on text prompt to enable face editing. Specifically, MuseFace integrates a Text-to-Mask diffusion model and a semantic-aware face editing model, capable of directly generating fine-grained semantic masks from text and performing face editing. The Text-to-Mask diffusion model provides \textit{diversity} and \textit{flexibility} to the framework, while the semantic-aware face editing model ensures \textit{controllability} of the framework. Our framework can create fine-grained semantic masks, making precise face editing possible, and significantly enhancing the controllability and flexibility of face editing models. Extensive experiments demonstrate that MuseFace achieves superior high-fidelity performance.

Training-Free Text-Guided Image Editing with Visual Autoregressive Model

Yufei Wang,Lanqing Guo,Zhihao Li,Jiaxing Huang,Pichao Wang,Bihan Wen,Jian Wang

Task: 提出一种基于VAR的文本引导图像编辑框架，无需显式反演即可实现精确控制修改。

Motivation: 解决现有方法因反演不准确导致的误差传播和文本提示与图像特征纠缠导致的全局修改问题。

Details

Method: 引入缓存机制存储原始图像的标记索引和概率分布，设计自适应细粒度掩码策略和标记重组方法。 Result: 在训练自由模式下实现高保真编辑，处理1K分辨率图像仅需1.2秒，性能优于现有扩散和整流流方法。 Conclusion: 提出的框架在定量指标和视觉质量上均表现优异，代码将开源。 Abstract: Text-guided image editing is an essential task that enables users to modify images through natural language descriptions. Recent advances in diffusion models and rectified flows have significantly improved editing quality, primarily relying on inversion techniques to extract structured noise from input images. However, inaccuracies in inversion can propagate errors, leading to unintended modifications and compromising fidelity. Moreover, even with perfect inversion, the entanglement between textual prompts and image features often results in global changes when only local edits are intended. To address these challenges, we propose a novel text-guided image editing framework based on VAR (Visual AutoRegressive modeling), which eliminates the need for explicit inversion while ensuring precise and controlled modifications. Our method introduces a caching mechanism that stores token indices and probability distributions from the original image, capturing the relationship between the source prompt and the image. Using this cache, we design an adaptive fine-grained masking strategy that dynamically identifies and constrains modifications to relevant regions, preventing unintended changes. A token reassembling approach further refines the editing process, enhancing diversity, fidelity, and control. Our framework operates in a training-free manner and achieves high-fidelity editing with faster inference speeds, processing a 1K resolution image in as fast as 1.2 seconds. Extensive experiments demonstrate that our method achieves performance comparable to, or even surpassing, existing diffusion- and rectified flow-based approaches in both quantitative metrics and visual quality. The code will be released.

Boosting MLLM Reasoning with Text-Debiased Hint-GRPO

Qihan Huang,Long Chan,Jinlong Liu,Wanggui He,Hao Jiang,Mingli Song,Jingyuan Chen,Chang Yao,Jie Song

Task: 提升多模态大语言模型（MLLM）在复杂推理任务中的性能。

Motivation: 当前GRPO算法在MLLM中存在数据利用率低和文本偏见问题，限制了其在复杂任务中的表现。

Details

Method: 提出Hint-GRPO，通过自适应提示和文本偏见校准改进GRPO。 Result: 在11个数据集上的实验表明，该方法显著提升了MLLM的推理能力。 Conclusion: Hint-GRPO有效解决了GRPO的局限性，性能优于现有方法。 Abstract: MLLM reasoning has drawn widespread research for its excellent problem-solving capability. Current reasoning methods fall into two types: PRM, which supervises the intermediate reasoning steps, and ORM, which supervises the final results. Recently, DeepSeek-R1 has challenged the traditional view that PRM outperforms ORM, which demonstrates strong generalization performance using an ORM method (i.e., GRPO). However, current MLLM's GRPO algorithms still struggle to handle challenging and complex multimodal reasoning tasks (e.g., mathematical reasoning). In this work, we reveal two problems that impede the performance of GRPO on the MLLM: Low data utilization and Text-bias. Low data utilization refers to that GRPO cannot acquire positive rewards to update the MLLM on difficult samples, and text-bias is a phenomenon that the MLLM bypasses image condition and solely relies on text condition for generation after GRPO training. To tackle these problems, this work proposes Hint-GRPO that improves data utilization by adaptively providing hints for samples of varying difficulty, and text-bias calibration that mitigates text-bias by calibrating the token prediction logits with image condition in test-time. Experiment results on three base MLLMs across eleven datasets demonstrate that our proposed methods advance the reasoning capability of original MLLM by a large margin, exhibiting superior performance to existing MLLM reasoning methods. Our code is available at https://github.com/hqhQAQ/Hint-GRPO.

HumanAesExpert: Advancing a Multi-Modality Foundation Model for Human Image Aesthetic Assessment

Zhichao Liao,Xiaokun Liu,Wenyu Qin,Qingyu Li,Qiulin Wang,Pengfei Wan,Di Zhang,Long Zeng,Pingfa Feng

Task: 提出一个针对人类图像美学评估（HIAA）的全面实现框架，包括数据集构建和模型设计。

Motivation: HIAA在社交媒体和AI工作流中广泛应用，但研究较少，需要填补这一空白。

Details

Method: 构建HumanBeauty数据集（108k图像），提出HumanAesExpert模型，结合专家头、语言建模头和回归头，并引入MetaVoter整合评分。 Result: HumanAesExpert在HIAA任务中显著优于现有先进模型。 Conclusion: 该研究填补了HIAA领域的空白，公开的数据集和模型推动了社区发展。 Abstract: Image Aesthetic Assessment (IAA) is a long-standing and challenging research task. However, its subset, Human Image Aesthetic Assessment (HIAA), has been scarcely explored, even though HIAA is widely used in social media, AI workflows, and related domains. To bridge this research gap, our work pioneers a holistic implementation framework tailored for HIAA. Specifically, we introduce HumanBeauty, the first dataset purpose-built for HIAA, which comprises 108k high-quality human images with manual annotations. To achieve comprehensive and fine-grained HIAA, 50K human images are manually collected through a rigorous curation process and annotated leveraging our trailblazing 12-dimensional aesthetic standard, while the remaining 58K with overall aesthetic labels are systematically filtered from public datasets. Based on the HumanBeauty database, we propose HumanAesExpert, a powerful Vision Language Model for aesthetic evaluation of human images. We innovatively design an Expert head to incorporate human knowledge of aesthetic sub-dimensions while jointly utilizing the Language Modeling (LM) and Regression head. This approach empowers our model to achieve superior proficiency in both overall and fine-grained HIAA. Furthermore, we introduce a MetaVoter, which aggregates scores from all three heads, to effectively balance the capabilities of each head, thereby realizing improved assessment precision. Extensive experiments demonstrate that our HumanAesExpert models deliver significantly better performance in HIAA than other state-of-the-art models. Our datasets, models, and codes are publicly released to advance the HIAA community. Project webpage: https://humanaesexpert.github.io/HumanAesExpert/

FineCausal: A Causal-Based Framework for Interpretable Fine-Grained Action Quality Assessment

Ruisheng Han,Kanglei Zhou,Amir Atapour-Abarghouei,Xiaohui Liang,Hubert P. H. Shum

Task: 提出一种基于因果关系的框架FineCausal，用于动作质量评估（AQA），以解决现有深度学习方法的黑盒性和虚假相关性问题。

Motivation: 现有深度学习方法在动作质量评估中存在可靠性低和可解释性差的问题，限制了其在竞技体育中的应用。

Details

Method: FineCausal采用基于图注意力网络的因果干预模块和时序因果注意力模块，分离前景线索与背景干扰，并捕捉动作阶段的细粒度时序依赖。 Result: 在FineDiving-HM数据集上达到最优性能，同时提供透明的特征解释。 Conclusion: FineCausal在性能和可解释性上表现优异，但依赖专家知识和高质量标注，未来需进一步研究。 Abstract: Action quality assessment (AQA) is critical for evaluating athletic performance, informing training strategies, and ensuring safety in competitive sports. However, existing deep learning approaches often operate as black boxes and are vulnerable to spurious correlations, limiting both their reliability and interpretability. In this paper, we introduce FineCausal, a novel causal-based framework that achieves state-of-the-art performance on the FineDiving-HM dataset. Our approach leverages a Graph Attention Network-based causal intervention module to disentangle human-centric foreground cues from background confounders, and incorporates a temporal causal attention module to capture fine-grained temporal dependencies across action stages. This dual-module strategy enables FineCausal to generate detailed spatio-temporal representations that not only achieve state-of-the-art scoring performance but also provide transparent, interpretable feedback on which features drive the assessment. Despite its strong performance, FineCausal requires extensive expert knowledge to define causal structures and depends on high-quality annotations, challenges that we discuss and address as future research directions. Code is available at https://github.com/Harrison21/FineCausal.

CoMatch: Dynamic Covisibility-Aware Transformer for Bilateral Subpixel-Level Semi-Dense Image Matching

Zizhuo Li,Yifan Lu,Linfeng Tang,Shihua Zhang,Jiayi Ma

Task: 提出一种名为CoMatch的半密集图像匹配方法，具有动态共视感知和双边亚像素精度。

Motivation: 观察到现有方法在粗特征图上建模上下文交互时存在高度冗余计算，且非共视区域的干扰会降低特征独特性，同时在精细阶段仅调整目标视图的关键点至亚像素级别，导致源视图的关键点信息不足。

Details

Method: 引入共视引导的令牌压缩器动态聚合令牌，部署共视辅助注意力机制选择性抑制无关信息，开发精细相关模块在源和目标视图中细化匹配候选点至亚像素级别。 Result: 在多个公开基准测试中，CoMatch表现出优异的准确性、效率和泛化能力。 Conclusion: CoMatch通过动态共视感知和双边亚像素精度，显著提升了图像匹配的性能和效率。 Abstract: This prospective study proposes CoMatch, a novel semi-dense image matcher with dynamic covisibility awareness and bilateral subpixel accuracy. Firstly, observing that modeling context interaction over the entire coarse feature map elicits highly redundant computation due to the neighboring representation similarity of tokens, a covisibility-guided token condenser is introduced to adaptively aggregate tokens in light of their covisibility scores that are dynamically estimated, thereby ensuring computational efficiency while improving the representational capacity of aggregated tokens simultaneously. Secondly, considering that feature interaction with massive non-covisible areas is distracting, which may degrade feature distinctiveness, a covisibility-assisted attention mechanism is deployed to selectively suppress irrelevant message broadcast from non-covisible reduced tokens, resulting in robust and compact attention to relevant rather than all ones. Thirdly, we find that at the fine-level stage, current methods adjust only the target view's keypoints to subpixel level, while those in the source view remain restricted at the coarse level and thus not informative enough, detrimental to keypoint location-sensitive usages. A simple yet potent fine correlation module is developed to refine the matching candidates in both source and target views to subpixel level, attaining attractive performance improvement. Thorough experimentation across an array of public benchmarks affirms CoMatch's promising accuracy, efficiency, and generalizability.

Exploring Reliable PPG Authentication on Smartwatches in Daily Scenarios

Jiankai Tang,Jiacheng Liu,Renling Tong,Kai Zhu,Zhe Li,Xin Yi,Junliang Xing,Yuanchun Shi,Yuntao Wang

Task: 提出一种高效可靠的PPG认证模型MTL-RAPID，用于解决运动伪影和生理变化带来的可靠性问题。

Motivation: PPG传感器在智能手表中广泛部署，但运动伪影和生理变化导致认证可靠性不足。

Details

Method: 采用多任务联合训练策略，同时评估信号质量和验证用户身份。 Result: 在运动伪影、时间变化和用户偏好的综合研究中，MTL-RAPID的AUC达到99.2%，EER为3.5%，优于现有基线。 Conclusion: MTL-RAPID通过联合优化任务，以更少参数实现更强性能，并开源数据集和模型以促进未来研究。 Abstract: Photoplethysmography (PPG) Sensors, widely deployed in smartwatches, offer a simple and non-invasive authentication approach for daily use. However, PPG authentication faces reliability issues due to motion artifacts from physical activity and physiological variability over time. To address these challenges, we propose MTL-RAPID, an efficient and reliable PPG authentication model, that employs a multitask joint training strategy, simultaneously assessing signal quality and verifying user identity. The joint optimization of these two tasks in MTL-RAPID results in a structure that outperforms models trained on individual tasks separately, achieving stronger performance with fewer parameters. In our comprehensive user studies regarding motion artifacts (N = 30), time variations (N = 32), and user preferences (N = 16), MTL-RAPID achieves a best AUC of 99.2\% and an EER of 3.5\%, outperforming existing baselines. We opensource our PPG authentication dataset along with the MTL-RAPID model to facilitate future research on GitHub.

Spectral-Adaptive Modulation Networks for Visual Perception

Guhnoo Yun,Juhan Yoo,Kijung Kim,Jeongho Lee,Paul Hongsuck Seo,Dong Hwan Kim

Task: 通过图谱分析理论模拟和比较2D卷积与自注意力在统一框架内的频率响应。

Motivation: 现有理论分析未能充分解释2D卷积在高通滤波中比自注意力更有效以及大核更倾向于形状偏好的原因。

Details

Method: 提出一种基于图谱分析的统一框架，并引入频谱自适应调制（SPAM）混合器，开发SPANetV2作为视觉主干网络。 Result: SPANetV2在多个视觉任务（如ImageNet-1K分类、COCO目标检测和ADE20K语义分割）中优于现有最先进模型。 Conclusion: 节点连通性是影响频谱功能的关键因素，SPAM混合器通过多尺度卷积核和频谱重缩放机制优化了视觉特征的频谱处理。 Abstract: Recent studies have shown that 2D convolution and self-attention exhibit distinct spectral behaviors, and optimizing their spectral properties can enhance vision model performance. However, theoretical analyses remain limited in explaining why 2D convolution is more effective in high-pass filtering than self-attention and why larger kernels favor shape bias, akin to self-attention. In this paper, we employ graph spectral analysis to theoretically simulate and compare the frequency responses of 2D convolution and self-attention within a unified framework. Our results corroborate previous empirical findings and reveal that node connectivity, modulated by window size, is a key factor in shaping spectral functions. Leveraging this insight, we introduce a \textit{spectral-adaptive modulation} (SPAM) mixer, which processes visual features in a spectral-adaptive manner using multi-scale convolutional kernels and a spectral re-scaling mechanism to refine spectral components. Based on SPAM, we develop SPANetV2 as a novel vision backbone. Extensive experiments demonstrate that SPANetV2 outperforms state-of-the-art models across multiple vision tasks, including ImageNet-1K classification, COCO object detection, and ADE20K semantic segmentation.

JointTuner: Appearance-Motion Adaptive Joint Training for Customized Video Generation

Fangda Chen,Shanshan Zhao,Chuanfu Xu,Long Lan

Task: 提出一种名为JointTuner的自适应联合训练框架，以解决文本到视频合成中的概念干扰和外观污染问题。

Motivation: 现有方法在文本到视频合成中因特征域不匹配或空间特征泄漏导致概念干扰和外观污染。

Details

Method: 采用Adaptive LoRA（带有上下文感知门控机制）和Appearance-independent Temporal Loss，通过联合优化外观和运动来消除干扰。 Result: 实验表明，该方法在90种外观-运动组合和10种多类型自动指标上优于现有先进方法。 Conclusion: JointTuner通过自适应联合训练和去耦优化，显著提升了文本到视频合成的性能。 Abstract: Recent text-to-video advancements have enabled coherent video synthesis from prompts and expanded to fine-grained control over appearance and motion. However, existing methods either suffer from concept interference due to feature domain mismatch caused by naive decoupled optimizations or exhibit appearance contamination induced by spatial feature leakage resulting from the entanglement of motion and appearance in reference video reconstructions. In this paper, we propose JointTuner, a novel adaptive joint training framework, to alleviate these issues. Specifically, we develop Adaptive LoRA, which incorporates a context-aware gating mechanism, and integrate the gated LoRA components into the spatial and temporal Transformers within the diffusion model. These components enable simultaneous optimization of appearance and motion, eliminating concept interference. In addition, we introduce the Appearance-independent Temporal Loss, which decouples motion patterns from intrinsic appearance in reference video reconstructions through an appearance-agnostic noise prediction task. The key innovation lies in adding frame-wise offset noise to the ground-truth Gaussian noise, perturbing its distribution, thereby disrupting spatial attributes associated with frames while preserving temporal coherence. Furthermore, we construct a benchmark comprising 90 appearance-motion customized combinations and 10 multi-type automatic metrics across four dimensions, facilitating a more comprehensive evaluation for this customization task. Extensive experiments demonstrate the superior performance of our method compared to current advanced approaches.

Kai Huang,Hao Zou,Bochen Wang,Ye Xi,Zhen Xie,Hao Wang

Task: 提出一种名为AirCache的新型KV缓存压缩方法，以加速大型视觉语言模型（LVLMs）的推理过程。

Motivation: 处理大量视觉标记和生成长上下文输出会导致计算开销过大，特别是对KV缓存的需求过高，因此需要解决这一瓶颈问题。

Details

Method: 通过系统研究LVLMs中视觉和文本标记在注意力机制中的相关性，提出精英观察窗口评估视觉组件的重要性，并开发自适应分层预算分配策略。 Result: 在多个LVLMs和基准测试中，该方法仅保留10%的视觉KV缓存即可实现与完整缓存相当的性能，解码延迟降低29%至66%。 Conclusion: AirCache在降低缓存保留率时表现出优于现有方法的性能优势，显著提升了LVLMs的推理效率。 Abstract: Recent advancements in Large Visual Language Models (LVLMs) have gained significant attention due to their remarkable reasoning capabilities and proficiency in generalization. However, processing a large number of visual tokens and generating long-context outputs impose substantial computational overhead, leading to excessive demands for key-value (KV) cache. To address this critical bottleneck, we propose AirCache, a novel KV cache compression method aimed at accelerating LVLMs inference. This work systematically investigates the correlations between visual and textual tokens within the attention mechanisms of LVLMs. Our empirical analysis reveals considerable redundancy in cached visual tokens, wherein strategically eliminating these tokens preserves model performance while significantly accelerating context generation. Inspired by these findings, we introduce an elite observation window for assessing the importance of visual components in the KV cache, focusing on stable inter-modal relevancy modeling with enhanced multi-perspective consistency. Additionally, we develop an adaptive layer-wise budget allocation strategy that capitalizes on the strength and skewness of token importance distribution, showcasing superior efficiency compared to uniform allocation. Comprehensive evaluations across multiple LVLMs and benchmarks demonstrate that our method achieves comparable performance to the full cache while retaining only 10% of visual KV cache, thereby reducing decoding latency by 29% to 66% across various batch size and prompt length of inputs. Notably, as cache retention rates decrease, our method exhibits increasing performance advantages over existing approaches.

A Multi-Stage Auto-Context Deep Learning Framework for Tissue and Nuclei Segmentation and Classification in H&E-Stained Histological Images of Advanced Melanoma

Nima Torbati,Anastasia Meshcheryakova,Diana Mechtcheriakova,Amirreza Mahbod

Task: 提出一种基于多阶段深度学习的统一框架，结合组织和细胞核信息进行黑色素瘤组织图像的自动分割与分类。

Motivation: 现有计算机化方法通常将组织和细胞核分析作为独立任务，可能效果不佳，因此需要一种统一的方法。

Details

Method: 基于自动上下文概念的多阶段深度学习框架，结合预训练和后处理技术。 Result: 在PUMA挑战赛中，Track 1的平均微Dice组织得分为73.40%，Track 2的细胞核F1得分为63.48%，分别获得第二名和第一名。 Conclusion: 该方法通过统一框架显著提升了黑色素瘤组织图像的分析性能。 Abstract: Melanoma is the most lethal form of skin cancer, with an increasing incidence rate worldwide. Analyzing histological images of melanoma by localizing and classifying tissues and cell nuclei is considered the gold standard method for diagnosis and treatment options for patients. While many computerized approaches have been proposed for automatic analysis, most perform tissue-based analysis and nuclei (cell)-based analysis as separate tasks, which might be suboptimal. In this work, using the PUMA challenge dataset, we proposed a novel multi-stage deep learning approach by combining tissue and nuclei information in a unified framework based on the auto-context concept to perform segmentation and classification in histological images of melanoma. Through pre-training and further post-processing, our approach achieved second and first place rankings in the PUMA challenge, with average micro Dice tissue score and summed nuclei F1-score of 73.40% for Track 1 and 63.48% for Track 2, respectively. Our implementation for training and testing is available at: https://github.com/NimaTorbati/PumaSubmit

Local Information Matters: Inference Acceleration For Grounded Conversation Generation Models Through Adaptive Local-Aware Token Pruning

Bizhe Bai,Jianjian Cao,Yadan Luo,Tao Che

Task: 提出一种名为ALTP的框架，用于加速Grounded Conversation Generation（GCG）任务中的模型处理，同时保留局部视觉特征。

Motivation: 现有令牌剪枝方法（如FastV和PyramidDrop）在GCG任务中无法保留关键局部视觉特征，导致性能下降。

Details

Method: ALTP框架包含两个关键组件：Detail Density Capture（DDC）和Dynamic Density Formation（DDF），分别通过超像素分割和动态令牌分配来保留对象中心区域的细节。 Result: 在GranDf数据集上，ALTP显著优于现有方法，例如在GLaMM模型上减少90%视觉令牌的同时，AP50和Recall分别提升4.9%和5.0%。 Conclusion: ALTP是一种简单有效的框架，能够在减少计算成本的同时提升GCG任务的性能。 Abstract: Grounded Conversation Generation (GCG) is an emerging vision-language task that requires models to generate natural language responses seamlessly intertwined with corresponding object segmentation masks. Recent models, such as GLaMM and OMG-LLaVA, achieve pixel-level grounding but incur significant computational costs due to processing a large number of visual tokens. Existing token pruning methods, like FastV and PyramidDrop, fail to preserve the local visual features critical for accurate grounding, leading to substantial performance drops in GCG tasks. To address this, we propose Adaptive Local-Aware Token Pruning (ALTP), a simple yet effective framework that accelerates GCG models by prioritizing local object information. ALTP introduces two key components: (1) Detail Density Capture (DDC), which uses superpixel segmentation to retain tokens in object-centric regions, preserving fine-grained details, and (2) Dynamic Density Formation (DDF), which dynamically allocates tokens based on information density, ensuring higher retention in semantically rich areas. Extensive experiments on the GranDf dataset demonstrate that ALTP significantly outperforms existing token pruning methods, such as FastV and PyramidDrop, on both GLaMM and OMG-LLaVA models. Notably, when applied to GLaMM, ALTP achieves a 90% reduction in visual tokens with a 4.9% improvement in AP50 and a 5.0% improvement in Recall compared to PyramidDrop. Similarly, on OMG-LLaVA, ALTP improves AP by 2.1% and mIOU by 3.0% at a 90% token reduction compared with PDrop.

A Benchmark for Vision-Centric HD Mapping by V2I Systems

Miao Fan,Shanshan Yu,Shengtong Xu,Kun Jiang,Haoyi Xiong,Xiangzeng Liu

Task: 研究车辆与基础设施协同自动驾驶（VICAD）中的在线高精地图构建问题。

Motivation: 自动驾驶因缺乏全局视角和矢量高精地图的语义信息而面临安全挑战，路边摄像头信息通过车路协同（V2I）通信可扩展地图感知范围，但缺乏真实世界数据集支持研究。

Details

Method: 提出一个端到端神经网络框架（V2I-HD），利用视觉为中心的V2I系统构建矢量地图，并引入方向解耦自注意力机制以降低计算成本。 Result: 实验表明V2I-HD在实时推理速度和地图构建质量上表现优异，适用于复杂多样的驾驶场景。 Conclusion: 发布真实世界数据集和源代码，推动车辆与基础设施协同自动驾驶的在线高精地图研究。 Abstract: Autonomous driving faces safety challenges due to a lack of global perspective and the semantic information of vectorized high-definition (HD) maps. Information from roadside cameras can greatly expand the map perception range through vehicle-to-infrastructure (V2I) communications. However, there is still no dataset from the real world available for the study on map vectorization onboard under the scenario of vehicle-infrastructure cooperation. To prosper the research on online HD mapping for Vehicle-Infrastructure Cooperative Autonomous Driving (VICAD), we release a real-world dataset, which contains collaborative camera frames from both vehicles and roadside infrastructures, and provides human annotations of HD map elements. We also present an end-to-end neural framework (i.e., V2I-HD) leveraging vision-centric V2I systems to construct vectorized maps. To reduce computation costs and further deploy V2I-HD on autonomous vehicles, we introduce a directionally decoupled self-attention mechanism to V2I-HD. Extensive experiments show that V2I-HD has superior performance in real-time inference speed, as tested by our real-world dataset. Abundant qualitative results also demonstrate stable and robust map construction quality with low cost in complex and various driving scenes. As a benchmark, both source codes and the dataset have been released at OneDrive for the purpose of further study.

Video-based Traffic Light Recognition by Rockchip RV1126 for Autonomous Driving

Miao Fan,Xuxu Kong,Shengtong Xu,Haoyi Xiong,Xiangzeng Liu

Task: 提出一种基于视频的端到端神经网络ViTLR，用于实时交通灯识别。

Motivation: 现有单帧分析方法在复杂场景（如遮挡和不良光照）中表现不佳，影响自动驾驶安全。

Details

Method: 采用类似Transformer的设计，结合卷积自注意力模块，并针对Rockchip RV1126嵌入式平台优化。 Result: 在两个真实数据集上达到最优性能，实时处理能力（>25 FPS），且在时间稳定性、目标距离变化和恶劣环境条件下表现更鲁棒。 Conclusion: ViTLR成功集成到自动驾驶应用中，代码和数据集已公开以促进进一步研究。 Abstract: Real-time traffic light recognition is fundamental for autonomous driving safety and navigation in urban environments. While existing approaches rely on single-frame analysis from onboard cameras, they struggle with complex scenarios involving occlusions and adverse lighting conditions. We present \textit{ViTLR}, a novel video-based end-to-end neural network that processes multiple consecutive frames to achieve robust traffic light detection and state classification. The architecture leverages a transformer-like design with convolutional self-attention modules, which is optimized specifically for deployment on the Rockchip RV1126 embedded platform. Extensive evaluations on two real-world datasets demonstrate that \textit{ViTLR} achieves state-of-the-art performance while maintaining real-time processing capabilities (>25 FPS) on RV1126's NPU. The system shows superior robustness across temporal stability, varying target distances, and challenging environmental conditions compared to existing single-frame approaches. We have successfully integrated \textit{ViTLR} into an ego-lane traffic light recognition system using HD maps for autonomous driving applications. The complete implementation, including source code and datasets, is made publicly available to facilitate further research in this domain.

SALT: A Flexible Semi-Automatic Labeling Tool for General LiDAR Point Clouds with Cross-Scene Adaptability and 4D Consistency

Yanbo Wang,Yongtao Chen,Chuan Cao,Tianchen Deng,Wentao Zhao,Jingchuan Wang,Weidong Chen

Task: 提出一种灵活的半自动标注工具（SALT），用于通用LiDAR点云数据，具有跨场景适应性和4D一致性。

Motivation: 解决现有方法依赖相机蒸馏的问题，直接处理原始LiDAR数据，提高标注效率。

Details

Method: 采用零样本学习范式（数据对齐），将LiDAR数据转换为伪图像，并设计4D一致性提示策略和非极大值抑制模块。 Result: 在SemanticKITTI上超越最新零样本方法18.4% PQ，在新收集的低分辨率LiDAR数据上达到人类标注者性能的40-50%。 Conclusion: SALT的开源将推动LiDAR数据集的扩展，并为未来LiDAR基础模型的发展奠定基础。 Abstract: We propose a flexible Semi-Automatic Labeling Tool (SALT) for general LiDAR point clouds with cross-scene adaptability and 4D consistency. Unlike recent approaches that rely on camera distillation, SALT operates directly on raw LiDAR data, automatically generating pre-segmentation results. To achieve this, we propose a novel zero-shot learning paradigm, termed data alignment, which transforms LiDAR data into pseudo-images by aligning with the training distribution of vision foundation models. Additionally, we design a 4D-consistent prompting strategy and 4D non-maximum suppression module to enhance SAM2, ensuring high-quality, temporally consistent presegmentation. SALT surpasses the latest zero-shot methods by 18.4% PQ on SemanticKITTI and achieves nearly 40-50% of human annotator performance on our newly collected low-resolution LiDAR data and on combined data from three LiDAR types, significantly boosting annotation efficiency. We anticipate that SALT's open-sourcing will catalyze substantial expansion of current LiDAR datasets and lay the groundwork for the future development of LiDAR foundation models. Code is available at https://github.com/Cavendish518/SALT.

DenseFormer: Learning Dense Depth Map from Sparse Depth and Image via Conditional Diffusion Model

Ming Yuan,Sichao Wang,Chuang Zhang,Lei He,Qing Xu,Jianqiang Wang

Task: 将稀疏深度图和RGB图像生成密集深度图。

Motivation: 深度补全任务在自动驾驶中至关重要，现有方法多采用空间传播网络迭代优化深度图，但存在改进空间。

Details

Method: 提出DenseFormer，结合扩散模型的去噪机制，通过多步迭代从初始随机深度分布生成密集深度图，并设计特征提取模块和深度细化模块。 Result: 在KITTI数据集上表现优于经典深度补全方法。 Conclusion: DenseFormer通过扩散模型和特征提取模块的有效结合，提升了深度补全任务的性能。 Abstract: The depth completion task is a critical problem in autonomous driving, involving the generation of dense depth maps from sparse depth maps and RGB images. Most existing methods employ a spatial propagation network to iteratively refine the depth map after obtaining an initial dense depth. In this paper, we propose DenseFormer, a novel method that integrates the diffusion model into the depth completion task. By incorporating the denoising mechanism of the diffusion model, DenseFormer generates the dense depth map by progressively refining an initial random depth distribution through multiple iterations. We propose a feature extraction module that leverages a feature pyramid structure, along with multi-layer deformable attention, to effectively extract and integrate features from sparse depth maps and RGB images, which serve as the guiding condition for the diffusion process. Additionally, this paper presents a depth refinement module that applies multi-step iterative refinement across various ranges to the dense depth results generated by the diffusion process. The module utilizes image features enriched with multi-scale information and sparse depth input to further enhance the accuracy of the predicted depth map. Extensive experiments on the KITTI outdoor scene dataset demonstrate that DenseFormer outperforms classical depth completion methods.

H2VU-Benchmark: A Comprehensive Benchmark for Hierarchical Holistic Video Understanding

Qi Wu,Quanlong Zheng,Yanhao Zhang,Junlin Xie,Jinguo Luo,Kuo Wang,Peng Liu,Qingsong Xie,Ru Zhen,Haonan Lu,Zhenyu Yang

Task: 提出一个层次化和全面的视频理解（H2VU）基准，用于评估通用视频和在线流媒体视频的理解能力。

Motivation: 现有视频理解基准在覆盖范围、任务多样性和场景适应性方面存在显著局限性，阻碍了对模型综合视频理解能力的准确评估。

Details

Method: 设计了H2VU基准，包含扩展的视频时长、全面的评估任务和丰富的视频数据，以测试模型的深度理解能力。 Result: H2VU基准揭示了现有多模态大语言模型（MLLMs）在新提出的评估任务中仍有显著改进空间。 Conclusion: H2VU基准有望通过提供全面深入的分析，推动视频理解研究的进展。 Abstract: With the rapid development of multimodal models, the demand for assessing video understanding capabilities has been steadily increasing. However, existing benchmarks for evaluating video understanding exhibit significant limitations in coverage, task diversity, and scene adaptability. These shortcomings hinder the accurate assessment of models' comprehensive video understanding capabilities. To tackle this challenge, we propose a hierarchical and holistic video understanding (H2VU) benchmark designed to evaluate both general video and online streaming video comprehension. This benchmark contributes three key features: Extended video duration: Spanning videos from brief 3-second clips to comprehensive 1.5-hour recordings, thereby bridging the temporal gaps found in current benchmarks. Comprehensive assessment tasks: Beyond traditional perceptual and reasoning tasks, we have introduced modules for countercommonsense comprehension and trajectory state tracking. These additions test the models' deep understanding capabilities beyond mere prior knowledge. Enriched video data: To keep pace with the rapid evolution of current AI agents, we have expanded first-person streaming video datasets. This expansion allows for the exploration of multimodal models' performance in understanding streaming videos from a first-person perspective. Extensive results from H2VU reveal that existing multimodal large language models (MLLMs) possess substantial potential for improvement in our newly proposed evaluation tasks. We expect that H2VU will facilitate advancements in video understanding research by offering a comprehensive and in-depth analysis of MLLMs.

Optimization of Layer Skipping and Frequency Scaling for Convolutional Neural Networks under Latency Constraint

Minh David Thao Chan,Ruoyu Zhao,Yukuan Jia,Ruiqing Mao,Sheng Zhou

Task: 提出一种结合比例层跳过（PLS）和频率缩放（FS）的方法，以减少卷积神经网络（CNN）在资源受限设备上的能耗。

Motivation: 在移动设备和自动驾驶汽车等资源受限设备上部署深度学习模型时，能耗是一个关键问题。

Details

Method: 通过选择性跳过网络层（PLS）和调整处理器频率（FS）来优化计算复杂度和能耗。 Result: 在ResNet-152和CIFAR-10数据集上的实验表明，该方法显著降低了计算需求和能耗，且精度损失极小。 Conclusion: 该方法为资源受限环境下的实时处理提供了实用解决方案，并揭示了计算效率与模型性能之间的平衡。 Abstract: The energy consumption of Convolutional Neural Networks (CNNs) is a critical factor in deploying deep learning models on resource-limited equipment such as mobile devices and autonomous vehicles. We propose an approach involving Proportional Layer Skipping (PLS) and Frequency Scaling (FS). Layer skipping reduces computational complexity by selectively bypassing network layers, whereas frequency scaling adjusts the frequency of the processor to optimize energy use under latency constraints. Experiments of PLS and FS on ResNet-152 with the CIFAR-10 dataset demonstrated significant reductions in computational demands and energy consumption with minimal accuracy loss. This study offers practical solutions for improving real-time processing in resource-limited settings and provides insights into balancing computational efficiency and model performance.

Crossmodal Knowledge Distillation with WordNet-Relaxed Text Embeddings for Robust Image Classification

Chenqi Guo,Mengshuo Rong,Qianli Feng,Rongfan Feng,Yinglong Ma

Task: 提出一种多教师跨模态知识蒸馏框架，通过结合CLIP图像嵌入和可学习的WordNet松弛文本嵌入，提升单模态学生的性能。

Motivation: 现有的图像分类数据集的标签仅代表高层概念，未能捕捉更深层次的语义结构，且直接使用可能导致标签泄漏，限制了知识蒸馏的效果。

Details

Method: 采用多教师框架，结合CLIP图像嵌入和WordNet松弛文本嵌入，通过分层损失避免直接使用精确类别名称。 Result: 实验表明，该方法显著提升了学生模型的性能，并在六个公开数据集上达到最佳或次佳结果。 Conclusion: 通过避免标签泄漏并引入更丰富的文本线索，该方法有效推动了跨模态知识蒸馏的发展。 Abstract: Crossmodal knowledge distillation (KD) aims to enhance a unimodal student using a multimodal teacher model. In particular, when the teacher's modalities include the student's, additional complementary information can be exploited to improve knowledge transfer. In supervised image classification, image datasets typically include class labels that represent high-level concepts, suggesting a natural avenue to incorporate textual cues for crossmodal KD. However, these labels rarely capture the deeper semantic structures in real-world visuals and can lead to label leakage if used directly as inputs, ultimately limiting KD performance. To address these issues, we propose a multi-teacher crossmodal KD framework that integrates CLIP image embeddings with learnable WordNet-relaxed text embeddings under a hierarchical loss. By avoiding direct use of exact class names and instead using semantically richer WordNet expansions, we mitigate label leakage and introduce more diverse textual cues. Experiments show that this strategy significantly boosts student performance, whereas noisy or overly precise text embeddings hinder distillation efficiency. Interpretability analyses confirm that WordNet-relaxed prompts encourage heavier reliance on visual features over textual shortcuts, while still effectively incorporating the newly introduced textual cues. Our method achieves state-of-the-art or second-best results on six public datasets, demonstrating its effectiveness in advancing crossmodal KD.

HumanDreamer: Generating Controllable Human-Motion Videos via Decoupled Generation

Boyuan Wang,Xiaofeng Wang,Chaojun Ni,Guosheng Zhao,Zhiqin Yang,Zheng Zhu,Muyang Zhang,Yukun Zhou,Xinze Chen,Guan Huang,Lihong Liu,Xingang Wang

Task: 提出一种解耦的人类视频生成框架HumanDreamer，通过文本提示生成多样化姿势并利用这些姿势生成人类运动视频。

Motivation: 现有方法依赖现有视频中的姿势，缺乏灵活性，因此需要一种更灵活的人类运动视频生成方法。

Details

Method: 提出HumanDreamer框架，包括MotionVid数据集和MotionDiT模型，并引入LAMA损失函数。 Result: FID提升62.4%，R-precision（top1、top2、top3）分别提升41.8%、26.3%、18.3%。 Conclusion: HumanDreamer能够生成多样化和高质量的人类运动视频，并支持其他下游任务。 Abstract: Human-motion video generation has been a challenging task, primarily due to the difficulty inherent in learning human body movements. While some approaches have attempted to drive human-centric video generation explicitly through pose control, these methods typically rely on poses derived from existing videos, thereby lacking flexibility. To address this, we propose HumanDreamer, a decoupled human video generation framework that first generates diverse poses from text prompts and then leverages these poses to generate human-motion videos. Specifically, we propose MotionVid, the largest dataset for human-motion pose generation. Based on the dataset, we present MotionDiT, which is trained to generate structured human-motion poses from text prompts. Besides, a novel LAMA loss is introduced, which together contribute to a significant improvement in FID by 62.4%, along with respective enhancements in R-precision for top1, top2, and top3 by 41.8%, 26.3%, and 18.3%, thereby advancing both the Text-to-Pose control accuracy and FID metrics. Our experiments across various Pose-to-Video baselines demonstrate that the poses generated by our method can produce diverse and high-quality human-motion videos. Furthermore, our model can facilitate other downstream tasks, such as pose sequence prediction and 2D-3D motion lifting.

BBoxCut: A Targeted Data Augmentation Technique for Enhancing Wheat Head Detection Under Occlusions

Yasashwini Sai Gowri P,Karthik Seemakurthy,Andrews Agyemang Opoku,Sita Devi Bharatula

Task: 提出一种名为BBoxCut的数据增强技术，用于在小麦头检测中模拟遮挡情况。

Motivation: 传统的小麦头特征测量方法耗时且低效，而数字技术虽提供了自动化可能，但田间条件（如遮挡、光照变化等）对检测精度提出了挑战。

Details

Method: 使用随机局部掩码（BBoxCut）模拟叶片和相邻小麦头的遮挡，并在三种先进目标检测器上评估。 Result: 在Faster R-CNN、FCOS和DETR上分别实现了2.76、3.26和1.9的mAP提升，显著提高了遮挡场景下的检测鲁棒性。 Conclusion: BBoxCut技术能有效提升小麦头检测在复杂田间条件下的性能，为育种工作提供了更高效的工具。 Abstract: Wheat plays a critical role in global food security, making it one of the most extensively studied crops. Accurate identification and measurement of key characteristics of wheat heads are essential for breeders to select varieties for cross-breeding, with the goal of developing nutrient-dense, resilient, and sustainable cultivars. Traditionally, these measurements are performed manually, which is both time-consuming and inefficient. Advances in digital technologies have paved the way for automating this process. However, field conditions pose significant challenges, such as occlusions of leaves, overlapping wheat heads, varying lighting conditions, and motion blur. In this paper, we propose a novel data augmentation technique, BBoxCut, which uses random localized masking to simulate occlusions caused by leaves and neighboring wheat heads. We evaluated our approach using three state-of-the-art object detectors and observed mean average precision (mAP) gains of 2.76, 3.26, and 1.9 for Faster R-CNN, FCOS, and DETR, respectively. Our augmentation technique led to significant improvements both qualitatively and quantitatively. In particular, the improvements were particularly evident in scenarios involving occluded wheat heads, demonstrating the robustness of our method in challenging field conditions.

AMMSM: Adaptive Motion Magnification and Sparse Mamba for Micro-Expression Recognition

Xuxiong Liu,Tengteng Dong,Fei Wang,Weijie Feng,Xiao Sun

Task: 提出一种名为AMMSM的多任务学习框架，用于增强微表情的准确捕捉和识别。

Motivation: 微表情因其短暂和微妙的信号特性，对下游识别任务提出了重大挑战。

Details

Method: 采用自适应运动放大和稀疏Mamba架构（AMMSM），结合自监督的微妙运动放大和稀疏空间选择Mamba模型，并通过进化搜索优化放大因子和稀疏比例。 Result: 在两个标准数据集上的实验表明，AMMSM实现了最先进的准确性和鲁棒性。 Conclusion: AMMSM框架有效解决了微表情识别中的挑战，并取得了显著的性能提升。 Abstract: Micro-expressions are typically regarded as unconscious manifestations of a person's genuine emotions. However, their short duration and subtle signals pose significant challenges for downstream recognition. We propose a multi-task learning framework named the Adaptive Motion Magnification and Sparse Mamba (AMMSM) to address this. This framework aims to enhance the accurate capture of micro-expressions through self-supervised subtle motion magnification, while the sparse spatial selection Mamba architecture combines sparse activation with the advanced Visual Mamba model to model key motion regions and their valuable representations more effectively. Additionally, we employ evolutionary search to optimize the magnification factor and the sparsity ratios of spatial selection, followed by fine-tuning to improve performance further. Extensive experiments on two standard datasets demonstrate that the proposed AMMSM achieves state-of-the-art (SOTA) accuracy and robustness.

Siqi Zhang,Yanyuan Qiao,Qunbo Wang,Zike Yan,Qi Wu,Zhihua Wei,Jing Liu

Task: 提出一种名为COSMO的新型架构，结合选择性记忆化，以在视觉与语言导航（VLN）任务中实现高性能和低计算成本。

Motivation: 当前VLN方法虽然性能提升，但引入了额外组件导致模型变大和计算成本增加，因此需要一种既能保持高性能又能降低计算成本的解决方案。

Details

Method: COSMO结合状态空间模块和Transformer模块，并引入两种VLN定制化的选择性状态空间模块：RSS和CS3，以增强模态间交互。 Result: 在REVERIE、R2R和R2R-CE三个主流VLN基准测试中，模型表现出竞争性的导航性能，并显著降低计算成本。 Conclusion: COSMO通过选择性记忆化和模态交互优化，成功平衡了高性能与低计算成本的需求。 Abstract: Vision-and-Language Navigation (VLN) tasks have gained prominence within artificial intelligence research due to their potential application in fields like home assistants. Many contemporary VLN approaches, while based on transformer architectures, have increasingly incorporated additional components such as external knowledge bases or map information to enhance performance. These additions, while boosting performance, also lead to larger models and increased computational costs. In this paper, to achieve both high performance and low computational costs, we propose a novel architecture with the COmbination of Selective MemOrization (COSMO). Specifically, COSMO integrates state-space modules and transformer modules, and incorporates two VLN-customized selective state space modules: the Round Selective Scan (RSS) and the Cross-modal Selective State Space Module (CS3). RSS facilitates comprehensive inter-modal interactions within a single scan, while the CS3 module adapts the selective state space module into a dual-stream architecture, thereby enhancing the acquisition of cross-modal interactions. Experimental validations on three mainstream VLN benchmarks, REVERIE, R2R, and R2R-CE, not only demonstrate competitive navigation performance of our model but also show a significant reduction in computational costs.

From Colors to Classes: Emergence of Concepts in Vision Transformers

Teresa Dorszewski,Lenka Tětková,Robert Jenssen,Lars Kai Hansen,Kristoffer Knutsen Wickstrøm

Task: 研究Vision Transformers（ViTs）在不同层级中编码概念的方式。

Motivation: ViTs在计算机视觉任务中表现出强大的表征能力，但其层级信息处理机制尚不明确，需要深入理解其层级编码特性。

Details

Method: 采用神经元标记技术对ViTs进行层级分析。 Result: ViTs的层级编码概念复杂度逐渐增加，早期层编码基础特征（如颜色和纹理），后期层编码更具体的类别（如物体和动物）；不同预训练策略影响编码概念的数量和类别。 Conclusion: ViTs的层级编码特性与CNNs类似，但受预训练策略和下游任务微调的影响显著。 Abstract: Vision Transformers (ViTs) are increasingly utilized in various computer vision tasks due to their powerful representation capabilities. However, it remains understudied how ViTs process information layer by layer. Numerous studies have shown that convolutional neural networks (CNNs) extract features of increasing complexity throughout their layers, which is crucial for tasks like domain adaptation and transfer learning. ViTs, lacking the same inductive biases as CNNs, can potentially learn global dependencies from the first layers due to their attention mechanisms. Given the increasing importance of ViTs in computer vision, there is a need to improve the layer-wise understanding of ViTs. In this work, we present a novel, layer-wise analysis of concepts encoded in state-of-the-art ViTs using neuron labeling. Our findings reveal that ViTs encode concepts with increasing complexity throughout the network. Early layers primarily encode basic features such as colors and textures, while later layers represent more specific classes, including objects and animals. As the complexity of encoded concepts increases, the number of concepts represented in each layer also rises, reflecting a more diverse and specific set of features. Additionally, different pretraining strategies influence the quantity and category of encoded concepts, with finetuning to specific downstream tasks generally reducing the number of encoded concepts and shifting the concepts to more relevant categories.

A Plasticity-Aware Method for Continual Self-Supervised Learning in Remote Sensing

Lars Möllenbrok,Behnood Rasti,Begüm Demir

Task: 提出一种新的持续自监督学习方法，旨在在顺序学习新任务的同时实现高学习可塑性。

Motivation: 现有持续自监督学习方法在防止灾难性遗忘时降低了模型对新任务数据的适应能力（学习可塑性），导致性能下降。

Details

Method: 采用知识蒸馏策略和集成解耦机制，将特征维度分为任务通用和任务特定部分，分别进行相关和去相关处理。 Result: 与广泛使用的CaSSLe框架相比，在任务增量场景中平均准确率提升1.12%，顽固性提升2.33%；在类别增量场景中平均准确率提升1.24%，顽固性提升2.01%。 Conclusion: 所提方法在保持记忆稳定性的同时提高了学习可塑性，显著优于现有方法。 Abstract: Continual self-supervised learning (CSSL) methods have gained increasing attention in remote sensing (RS) due to their capability to learn new tasks sequentially from continuous streams of unlabeled data. Existing CSSL methods, while learning new tasks, focus on preventing catastrophic forgetting. To this end, most of them use regularization strategies to retain knowledge of previous tasks. This reduces the model's ability to adapt to the data of new tasks (i.e., learning plasticity), which can degrade performance. To address this problem, in this paper, we propose a novel CSSL method that aims to learn tasks sequentially, while achieving high learning plasticity. To this end, the proposed method uses a knowledge distillation strategy with an integrated decoupling mechanism. The decoupling is achieved by first dividing the feature dimensions into task-common and task-specific parts. Then, the task-common features are forced to be correlated to ensure memory stability while the task-specific features are forced to be de-correlated facilitating the learning of new features. Experimental results show the effectiveness of the proposed method compared to CaSSLe, which is a widely used CSSL framework, with improvements of up to 1.12% in average accuracy and 2.33% in intransigence in a task-incremental scenario, and 1.24% in average accuracy and 2.01% in intransigence in a class-incremental scenario.

4D mmWave Radar in Adverse Environments for Autonomous Driving: A Survey

Xiangyuan Peng,Miao Tang,Huawei Sun,Lorenzo Servadei,Robert Wille

Task: 综述4D毫米波雷达在恶劣环境下的研究现状。

Motivation: 恶劣环境（如雨、雪、雾）会显著降低LiDAR和摄像头的性能，而4D毫米波雷达具有鲁棒性，适合自动驾驶系统。

Details

Method: 通过分析现有数据集、方法和模型，总结4D毫米波雷达在恶劣环境下的研究进展。 Result: 提出了当前研究的挑战和未来发展方向。 Conclusion: 这是首篇专注于恶劣环境下4D毫米波雷达在自动驾驶中应用的综述。 Abstract: Autonomous driving systems require accurate and reliable perception. However, adverse environments, such as rain, snow, and fog, can significantly degrade the performance of LiDAR and cameras. In contrast, 4D millimeter-wave (mmWave) radar not only provides 3D sensing and additional velocity measurements but also maintains robustness in challenging conditions, making it increasingly valuable for autonomous driving. Recently, research on 4D mmWave radar under adverse environments has been growing, but a comprehensive survey is still lacking. To bridge this gap, this survey comprehensively reviews the current research on 4D mmWave radar under adverse environments. First, we present an overview of existing 4D mmWave radar datasets encompassing diverse weather and lighting scenarios. Next, we analyze methods and models according to different adverse conditions. Finally, the challenges faced in current studies and potential future directions are discussed for advancing 4D mmWave radar applications in harsh environments. To the best of our knowledge, this is the first survey specifically focusing on 4D mmWave radar in adverse environments for autonomous driving.

DANTE-AD: Dual-Vision Attention Network for Long-Term Audio Description

Adrienne Deganutti,Simon Hadfield,Andrew Gilbert

Task: 提出一种增强的视频描述模型DANTE-AD，用于解决长时视觉叙事连贯性问题。

Motivation: 现有方法仅依赖帧级嵌入，缺乏跨场景的上下文信息，难以实现长时视觉叙事的连贯性。

Details

Method: 采用双视觉Transformer架构，融合帧级和场景级嵌入，提出一种新的顺序交叉注意力方法。 Result: 在广泛测试中，DANTE-AD在传统NLP指标和基于LLM的评估中均优于现有方法。 Conclusion: DANTE-AD通过融合多级嵌入和顺序交叉注意力，显著提升了长时视觉叙事的连贯性和描述质量。 Abstract: Audio Description is a narrated commentary designed to aid vision-impaired audiences in perceiving key visual elements in a video. While short-form video understanding has advanced rapidly, a solution for maintaining coherent long-term visual storytelling remains unresolved. Existing methods rely solely on frame-level embeddings, effectively describing object-based content but lacking contextual information across scenes. We introduce DANTE-AD, an enhanced video description model leveraging a dual-vision Transformer-based architecture to address this gap. DANTE-AD sequentially fuses both frame and scene level embeddings to improve long-term contextual understanding. We propose a novel, state-of-the-art method for sequential cross-attention to achieve contextual grounding for fine-grained audio description generation. Evaluated on a broad range of key scenes from well-known movie clips, DANTE-AD outperforms existing methods across traditional NLP metrics and LLM-based evaluations.

PolypSegTrack: Unified Foundation Model for Colonoscopy Video Analysis

Anwesa Choudhuri,Zhongpai Gao,Meng Zheng,Benjamin Planche,Terrence Chen,Ziyan Wu

Task: 联合解决结肠镜视频中息肉检测、分割、分类和无监督跟踪任务。

Motivation: 现有深度学习方法需要任务特定微调、缺乏跟踪能力或依赖领域特定预训练。

Details

Method: 提出PolypSegTrack模型，利用条件掩码损失和无监督跟踪模块，基于预训练视觉基础模型。 Result: 在多个息肉基准测试中显著优于现有方法。 Conclusion: PolypSegTrack是一种高效的基础模型，无需任务特定微调或领域预训练。 Abstract: Early detection, accurate segmentation, classification and tracking of polyps during colonoscopy are critical for preventing colorectal cancer. Many existing deep-learning-based methods for analyzing colonoscopic videos either require task-specific fine-tuning, lack tracking capabilities, or rely on domain-specific pre-training. In this paper, we introduce \textit{PolypSegTrack}, a novel foundation model that jointly addresses polyp detection, segmentation, classification and unsupervised tracking in colonoscopic videos. Our approach leverages a novel conditional mask loss, enabling flexible training across datasets with either pixel-level segmentation masks or bounding box annotations, allowing us to bypass task-specific fine-tuning. Our unsupervised tracking module reliably associates polyp instances across frames using object queries, without relying on any heuristics. We leverage a robust vision foundation model backbone that is pre-trained unsupervisedly on natural images, thereby removing the need for domain-specific pre-training. Extensive experiments on multiple polyp benchmarks demonstrate that our method significantly outperforms existing state-of-the-art approaches in detection, segmentation, classification, and tracking.

IMPACT: A Generic Semantic Loss for Multimodal Medical Image Registration

Valentin Boussot,Cédric Hémon,Jean-Claude Nunes,Jason Downling,Simon Rouzé,Caroline Lafond,Anaïs Barateau,Jean-Louis Dillenseger

Task: 提出一种通用的语义相似性度量方法IMPACT，用于多模态医学图像配准。

Motivation: 医学图像配准在诊断、治疗规划和监测中至关重要，但现有方法在跨模态配准时面临挑战。

Details

Method: 利用预训练模型提取深度学习特征，无需任务特定训练，支持多种配准框架。 Result: 在多个配准任务中，IMPACT显著提高了解剖对齐的准确性，并表现出更强的鲁棒性。 Conclusion: IMPACT是一种高效、通用的工具，可提升多模态医学图像配准的性能。 Abstract: Image registration is fundamental in medical imaging, enabling precise alignment of anatomical structures for diagnosis, treatment planning, image-guided treatment or longitudinal monitoring. This work introduces IMPACT (Image Metric with Pretrained model-Agnostic Comparison for Transmodality registration), a generic semantic similarity metric designed for seamless integration into diverse image registration frameworks (such as Elastix and Voxelmorph). It compares deep learning-based features extracted from medical images without requiring task-specific training, ensuring broad applicability across various modalities. By leveraging the features of the large-scale pretrained TotalSegmentator models and the ability to integrate Segment Anything Model (SAM) and other large-scale segmentation networks, this approach offers significant advantages. It provides robust, scalable, and efficient solutions for multimodal image registration. The IMPACT loss was evaluated on five challenging registration tasks involving thoracic CT/CBCT, and pelvic MR/CT datasets. Quantitative metrics, such as Target Registration Error and Dice Similarity Coefficient, demonstrated significant improvements in anatomical alignment compared to baseline methods. Qualitative analyses further confirmed the increased robustness of the proposed metric in the face of noise, artifacts, and modality variations. IMPACT's versatility and efficiency make it a valuable tool for advancing registration performance in clinical and research applications, addressing critical challenges in multimodal medical imaging.

Dominik Schnaus,Nikita Araslanov,Daniel Cremers

Task: 研究无监督条件下视觉与语言嵌入匹配的可行性。

Motivation: 随着基础模型的成熟，视觉和语言嵌入可能无需平行数据即可匹配，探索这一可能性。

Details

Method: 将无监督匹配建模为二次分配问题，并提出新的启发式方法；开发技术寻找最优匹配问题。 Result: 实验表明，许多情况下视觉与语言表示确实可以无监督匹配，并展示了无监督分类器的效果。 Conclusion: 无监督匹配视觉与语言嵌入是可行的，为跨模态语义知识嵌入提供了新思路。 Abstract: The platonic representation hypothesis suggests that vision and language embeddings become more homogeneous as model and dataset sizes increase. In particular, pairwise distances within each modality become more similar. This suggests that as foundation models mature, it may become possible to match vision and language embeddings in a fully unsupervised fashion, i.e. without parallel data. We present the first feasibility study, and investigate conformity of existing vision and language foundation models in the context of unsupervised, or "blind", matching. First, we formulate unsupervised matching as a quadratic assignment problem and introduce a novel heuristic that outperforms previous solvers. We also develop a technique to find optimal matching problems, for which a non-trivial match is very likely. Second, we conduct an extensive study deploying a range of vision and language models on four datasets. Our analysis reveals that for many problem instances, vision and language representations can be indeed matched without supervision. This finding opens up the exciting possibility of embedding semantic knowledge into other modalities virtually annotation-free. As a proof of concept, we showcase an unsupervised classifier, which achieves non-trivial classification accuracy without any image-text annotation.

PixelCAM: Pixel Class Activation Mapping for Histology Image Classification and ROI Localization

Alexis Guichemerre,Soufiane Belharbi,Mohammadhadi Shateri,Luke McCaffrey,Eric Granger

Task: 提出一种多任务方法PixelCAM，用于弱监督目标定位（WSOL），以解决分类和定位任务异步收敛的问题。

Motivation: 现有WSOL方法在组织学图像中存在局限性，如单步方法易导致激活不足或过度，两步方法受限于固定分类器，且两者在分布外数据集上表现不佳。

Details

Method: 通过共享图像编码器的像素特征空间，同时训练分类和定位任务，并提出PixelCAM作为像素级分类器，利用预训练模型的伪标签进行训练。 Result: PixelCAM能够学习判别性特征并准确划分前景/背景区域，支持ROI定位和图像分类，且可无缝集成到CNN和Transformer架构中。 Conclusion: PixelCAM是一种高效的多任务WSOL方法，解决了异步收敛问题，并在组织学图像中表现出色。 Abstract: Weakly supervised object localization (WSOL) methods allow training models to classify images and localize ROIs. WSOL only requires low-cost image-class annotations yet provides a visually interpretable classifier, which is important in histology image analysis. Standard WSOL methods rely on class activation mapping (CAM) methods to produce spatial localization maps according to a single- or two-step strategy. While both strategies have made significant progress, they still face several limitations with histology images. Single-step methods can easily result in under- or over-activation due to the limited visual ROI saliency in histology images and the limited localization cues. They also face the well-known issue of asynchronous convergence between classification and localization tasks. The two-step approach is sub-optimal because it is tied to a frozen classifier, limiting the capacity for localization. Moreover, these methods also struggle when applied to out-of-distribution (OOD) datasets. In this paper, a multi-task approach for WSOL is introduced for simultaneous training of both tasks to address the asynchronous convergence problem. In particular, localization is performed in the pixel-feature space of an image encoder that is shared with classification. This allows learning discriminant features and accurate delineation of foreground/background regions to support ROI localization and image classification. We propose PixelCAM, a cost-effective foreground/background pixel-wise classifier in the pixel-feature space that allows for spatial object localization. PixelCAM is trained using pixel pseudo-labels collected from a pretrained WSOL model. Both image and pixel-wise classifiers are trained simultaneously using standard gradient descent. In addition, our pixel classifier can easily be integrated into CNN- and transformer-based architectures without any modifications.

Foundation Models For Seismic Data Processing: An Extensive Review

Fabian Fuchs,Mario Ruben Fernandez,Norman Ettrich,Janis Keuper

Task: 研究基础模型在地震数据处理任务（如去多次波、插值和去噪）中的应用。

Motivation: 传统地震处理技术面临噪声数据、依赖人工流程等问题，深度学习虽提供新方法，但多依赖合成数据和专用网络。基础模型在自然图像领域的成功为其在地震领域的应用提供了可能。

Details

Method: 评估不同基础模型特性（如预训练技术和神经网络架构）对性能和效率的影响，而非提出单一模型。 Result: 探讨了多种自然图像基础模型，并提出了未来探索的有潜力候选模型。 Conclusion: 基础模型在地震处理中具有潜力，但需进一步研究和优化。 Abstract: Seismic processing plays a crucial role in transforming raw data into high-quality subsurface images, pivotal for various geoscience applications. Despite its importance, traditional seismic processing techniques face challenges such as noisy and damaged data and the reliance on manual, time-consuming workflows. The emergence of deep learning approaches has introduced effective and user-friendly alternatives, yet many of these deep learning approaches rely on synthetic datasets and specialized neural networks. Recently, foundation models have gained traction in the seismic domain, due to their success in natural imaging. This paper investigates the application of foundation models in seismic processing on the tasks: demultiple, interpolation, and denoising. It evaluates the impact of different model characteristics, such as pre-training technique and neural network architecture, on performance and efficiency. Rather than proposing a single seismic foundation model, this paper critically examines various natural image foundation models and suggest some promising candidates for future exploration.

Ziming Cheng,Zhiyuan Huang,Junting Pan,Zhaohui Hou,Mingjie Zhan

Task: 提出一种支持交互式信息补全的GUI导航任务，以解决用户任务传达不完整的问题。

Motivation: 当前GUI自动化代理不支持即时用户干预，导致用户遗漏关键信息时代理性能下降。

Details

Method: 开发了Navi-plus数据集和Dual-Stream Trajectory Evaluation方法，用于评估GUI代理的交互式信息补全能力。 Result: 具备GUI后续问题询问能力的代理在面对模糊用户任务时能完全恢复性能。 Conclusion: 交互式信息补全能力显著提升了GUI代理在任务传达不完整时的表现。 Abstract: Graphical user interfaces (GUI) automation agents are emerging as powerful tools, enabling humans to accomplish increasingly complex tasks on smart devices. However, users often inadvertently omit key information when conveying tasks, which hinders agent performance in the current agent paradigm that does not support immediate user intervention. To address this issue, we introduce a $\textbf{Self-Correction GUI Navigation}$ task that incorporates interactive information completion capabilities within GUI agents. We developed the $\textbf{Navi-plus}$ dataset with GUI follow-up question-answer pairs, alongside a $\textbf{Dual-Stream Trajectory Evaluation}$ method to benchmark this new capability. Our results show that agents equipped with the ability to ask GUI follow-up questions can fully recover their performance when faced with ambiguous user tasks.

Yingrui Ji,Xi Xiao,Gaofei Chen,Hao Xu,Chenrui Ma,Lijing Zhu,Aokun Liang,Jiansheng Chen

Task: 提出跨模态信息瓶颈（CIB）框架，解释CLIP的对比学习目标为隐式信息瓶颈优化，并引入跨模态信息瓶颈正则化（CIBR）方法以增强语义对齐。

Motivation: CLIP在跨模态任务中表现优异，但其强泛化能力的理论基础尚不明确。

Details

Method: 提出CIB框架，将CLIP的对比学习目标解释为信息瓶颈优化，并设计CIBR方法显式优化跨模态信息共享。 Result: 在多个视觉语言基准测试中，CIBR方法显著提升了CLIP的性能。 Conclusion: CIB框架首次从信息瓶颈角度解释了CLIP的泛化能力，并为未来跨模态表示学习提供了实用改进方向。 Abstract: Contrastive Language-Image Pretraining (CLIP) has achieved remarkable success in cross-modal tasks such as zero-shot image classification and text-image retrieval by effectively aligning visual and textual representations. However, the theoretical foundations underlying CLIP's strong generalization remain unclear. In this work, we address this gap by proposing the Cross-modal Information Bottleneck (CIB) framework. CIB offers a principled interpretation of CLIP's contrastive learning objective as an implicit Information Bottleneck optimization. Under this view, the model maximizes shared cross-modal information while discarding modality-specific redundancies, thereby preserving essential semantic alignment across modalities. Building on this insight, we introduce a Cross-modal Information Bottleneck Regularization (CIBR) method that explicitly enforces these IB principles during training. CIBR introduces a penalty term to discourage modality-specific redundancy, thereby enhancing semantic alignment between image and text features. We validate CIBR on extensive vision-language benchmarks, including zero-shot classification across seven diverse image datasets and text-image retrieval on MSCOCO and Flickr30K. The results show consistent performance gains over standard CLIP. These findings provide the first theoretical understanding of CLIP's generalization through the IB lens. They also demonstrate practical improvements, offering guidance for future cross-modal representation learning.

DiET-GS: Diffusion Prior and Event Stream-Assisted Motion Deblurring 3D Gaussian Splatting

Seungjun Lee,Gim Hee Lee

Task: 从模糊的多视角图像中重建清晰的3D表示。

Motivation: 现有方法在利用事件相机恢复运动模糊时，常导致颜色不准确或细节丢失，视觉质量不理想。

Details

Method: 提出DiET-GS框架，结合事件流和扩散先验，通过两阶段训练策略约束3DGS，并利用扩散先验增强边缘细节。 Result: 在合成和真实数据上，DiET-GS生成的新视角质量显著优于现有基线。 Conclusion: DiET-GS通过事件流和扩散先验的结合，有效提升了3D重建的视觉质量。 Abstract: Reconstructing sharp 3D representations from blurry multi-view images are long-standing problem in computer vision. Recent works attempt to enhance high-quality novel view synthesis from the motion blur by leveraging event-based cameras, benefiting from high dynamic range and microsecond temporal resolution. However, they often reach sub-optimal visual quality in either restoring inaccurate color or losing fine-grained details. In this paper, we present DiET-GS, a diffusion prior and event stream-assisted motion deblurring 3DGS. Our framework effectively leverages both blur-free event streams and diffusion prior in a two-stage training strategy. Specifically, we introduce the novel framework to constraint 3DGS with event double integral, achieving both accurate color and well-defined details. Additionally, we propose a simple technique to leverage diffusion prior to further enhance the edge details. Qualitative and quantitative results on both synthetic and real-world data demonstrate that our DiET-GS is capable of producing significantly better quality of novel views compared to the existing baselines. Our project page is https://diet-gs.github.io

MB-ORES: A Multi-Branch Object Reasoner for Visual Grounding in Remote Sensing

Karim Radouane,Hanane Azzag,Mustapha lebbah

Task: 提出一个统一框架，将目标检测（OD）和视觉定位（VG）任务整合到遥感（RS）图像中。

Motivation: 为支持传统目标检测并为视觉定位任务建立直观的先验知识。

Details

Method: 通过微调开放集目标检测器，构建图像图表示，并设计任务感知架构，包括多分支网络和对象推理网络。 Result: 在OPT-RSVG和DIOR-RSVG数据集上表现优异，显著优于现有方法。 Conclusion: 该框架在保留传统目标检测能力的同时，显著提升了视觉定位任务的性能。 Abstract: We propose a unified framework that integrates object detection (OD) and visual grounding (VG) for remote sensing (RS) imagery. To support conventional OD and establish an intuitive prior for VG task, we fine-tune an open-set object detector using referring expression data, framing it as a partially supervised OD task. In the first stage, we construct a graph representation of each image, comprising object queries, class embeddings, and proposal locations. Then, our task-aware architecture processes this graph to perform the VG task. The model consists of: (i) a multi-branch network that integrates spatial, visual, and categorical features to generate task-aware proposals, and (ii) an object reasoning network that assigns probabilities across proposals, followed by a soft selection mechanism for final referring object localization. Our model demonstrates superior performance on the OPT-RSVG and DIOR-RSVG datasets, achieving significant improvements over state-of-the-art methods while retaining classical OD capabilities. The code will be available in our repository: \url{https://github.com/rd20karim/MB-ORES}.

Pre-training with 3D Synthetic Data: Learning 3D Point Cloud Instance Segmentation from 3D Synthetic Scenes

Daichi Otsuka,Shinichi Mae,Ryosuke Yamada,Hirokatsu Kataoka

Task: 改进基于生成模型的3D点云实例分割方法。

Motivation: 3D点云数据在现实应用中的广泛使用，但数据标注成本高昂，生成模型为数据生成提供了可能。

Details

Method: 提出使用3D合成数据（通过Point-E生成）进行预训练，以支持3D点云实例分割模型的训练。 Result: 实验表明，该方法相比基线方法性能有所提升。 Conclusion: 3D生成模型在3D点云实例分割中具有有效性。 Abstract: In the recent years, the research community has witnessed growing use of 3D point cloud data for the high applicability in various real-world applications. By means of 3D point cloud, this modality enables to consider the actual size and spatial understanding. The applied fields include mechanical control of robots, vehicles, or other real-world systems. Along this line, we would like to improve 3D point cloud instance segmentation which has emerged as a particularly promising approach for these applications. However, the creation of 3D point cloud datasets entails enormous costs compared to 2D image datasets. To train a model of 3D point cloud instance segmentation, it is necessary not only to assign categories but also to provide detailed annotations for each point in the large-scale 3D space. Meanwhile, the increase of recent proposals for generative models in 3D domain has spurred proposals for using a generative model to create 3D point cloud data. In this work, we propose a pre-training with 3D synthetic data to train a 3D point cloud instance segmentation model based on generative model for 3D scenes represented by point cloud data. We directly generate 3D point cloud data with Point-E for inserting a generated data into a 3D scene. More recently in 2025, although there are other accurate 3D generation models, even using the Point-E as an early 3D generative model can effectively support the pre-training with 3D synthetic data. In the experimental section, we compare our pre-training method with baseline methods indicated improved performance, demonstrating the efficacy of 3D generative models for 3D point cloud instance segmentation.

Beyond a Single Mode: GAN Ensembles for Diverse Medical Data Generation

Lorenzo Tronchin,Tommy Löfstedt,Paolo Soda,Valerio Guarrasi

Task: 探索使用GAN集成来解决医学影像合成中的高保真度、多样性和效率问题。

Motivation: 生成对抗网络（GANs）在医学影像应用中面临模式崩溃和真实数据分布覆盖不足的挑战。

Details

Method: 通过多目标优化问题平衡保真度和多样性，提出一种选择最优GAN集成的方法。 Result: 实验表明，GAN集成能够生成多样且具有代表性的合成医学影像，提升下游任务的效能。 Conclusion: GAN集成在医学影像合成中具有显著潜力，能够克服单一GAN的局限性。 Abstract: The advancement of generative AI, particularly in medical imaging, confronts the trilemma of ensuring high fidelity, diversity, and efficiency in synthetic data generation. While Generative Adversarial Networks (GANs) have shown promise across various applications, they still face challenges like mode collapse and insufficient coverage of real data distributions. This work explores the use of GAN ensembles to overcome these limitations, specifically in the context of medical imaging. By solving a multi-objective optimisation problem that balances fidelity and diversity, we propose a method for selecting an optimal ensemble of GANs tailored for medical data. The selected ensemble is capable of generating diverse synthetic medical images that are representative of true data distributions and computationally efficient. Each model in the ensemble brings a unique contribution, ensuring minimal redundancy. We conducted a comprehensive evaluation using three distinct medical datasets, testing 22 different GAN architectures with various loss functions and regularisation techniques. By sampling models at different training epochs, we crafted 110 unique configurations. The results highlight the capability of GAN ensembles to enhance the quality and utility of synthetic medical images, thereby improving the efficacy of downstream tasks such as diagnostic modelling.

FakeScope: Large Multimodal Expert Model for Transparent AI-Generated Image Forensics

Yixuan Li,Yu Tian,Yipo Huang,Wei Lu,Shiqi Wang,Weisi Lin,Anderson Rocha

Task: 开发一种名为FakeScope的多模态模型，用于高精度识别AI生成的图像并提供可解释的取证分析。

Motivation: 生成式AI的快速发展带来了高度逼真的虚假内容，威胁社会信任，现有检测模型缺乏解释性，无法满足透明度和可信度需求。

Details

Method: 提出FakeScope模型，结合FakeChain数据集和FakeInstruct指令调优数据集，采用基于令牌的概率估计策略。 Result: FakeScope在封闭和开放取证场景中表现最优，能高精度识别合成图像并提供解释、讨论和改进策略，具备零样本定量检测能力。 Conclusion: FakeScope通过可解释性和泛化能力，为AI生成图像的取证提供了高效且实用的解决方案。 Abstract: The rapid and unrestrained advancement of generative artificial intelligence (AI) presents a double-edged sword: while enabling unprecedented creativity, it also facilitates the generation of highly convincing deceptive content, undermining societal trust. As image generation techniques become increasingly sophisticated, detecting synthetic images is no longer just a binary task: it necessitates interpretable, context-aware methodologies that enhance trustworthiness and transparency. However, existing detection models primarily focus on classification, offering limited explanatory insights into image authenticity. In this work, we propose FakeScope, an expert multimodal model (LMM) tailored for AI-generated image forensics, which not only identifies AI-synthetic images with high accuracy but also provides rich, interpretable, and query-driven forensic insights. We first construct FakeChain dataset that contains linguistic authenticity reasoning based on visual trace evidence, developed through a novel human-machine collaborative framework. Building upon it, we further present FakeInstruct, the largest multimodal instruction tuning dataset containing 2 million visual instructions tailored to enhance forensic awareness in LMMs. FakeScope achieves state-of-the-art performance in both closed-ended and open-ended forensic scenarios. It can distinguish synthetic images with high accuracy while offering coherent and insightful explanations, free-form discussions on fine-grained forgery attributes, and actionable enhancement strategies. Notably, despite being trained exclusively on qualitative hard labels, FakeScope demonstrates remarkable zero-shot quantitative capability on detection, enabled by our proposed token-based probability estimation strategy. Furthermore, FakeScope exhibits strong generalization and in-the-wild ability, ensuring its applicability in real-world scenarios.

Visual Acoustic Fields

Yuelei Li,Hyunjin Kim,Fangneng Zhan,Ri-Zhao Qiu,Mazeyu Ji,Xiaojun Shan,Xueyan Zou,Paul Liang,Hanspeter Pfister,Xiaolong Wang

Task: 提出一种名为Visual Acoustic Fields的框架，通过3D高斯泼溅（3DGS）在3D空间中连接视觉信号和敲击声音。

Motivation: 人类能够根据物体的外观和材质直观推断其敲击声音，受此启发，研究如何通过视觉信号生成和定位声音。

Details

Method: 采用两个关键模块：声音生成模块（基于条件扩散模型）和声音定位模块（通过特征增强的3DGS查询3D场景）。 Result: 实验证明，该框架能够生成逼真的敲击声音并准确定位声源。 Conclusion: Visual Acoustic Fields是首个在3D环境中连接视觉和声学信号的框架，并提供了首个相关数据集。 Abstract: Objects produce different sounds when hit, and humans can intuitively infer how an object might sound based on its appearance and material properties. Inspired by this intuition, we propose Visual Acoustic Fields, a framework that bridges hitting sounds and visual signals within a 3D space using 3D Gaussian Splatting (3DGS). Our approach features two key modules: sound generation and sound localization. The sound generation module leverages a conditional diffusion model, which takes multiscale features rendered from a feature-augmented 3DGS to generate realistic hitting sounds. Meanwhile, the sound localization module enables querying the 3D scene, represented by the feature-augmented 3DGS, to localize hitting positions based on the sound sources. To support this framework, we introduce a novel pipeline for collecting scene-level visual-sound sample pairs, achieving alignment between captured images, impact locations, and corresponding sounds. To the best of our knowledge, this is the first dataset to connect visual and acoustic signals in a 3D context. Extensive experiments on our dataset demonstrate the effectiveness of Visual Acoustic Fields in generating plausible impact sounds and accurately localizing impact sources. Our project page is at https://yuelei0428.github.io/projects/Visual-Acoustic-Fields/.

Learning Velocity and Acceleration: Self-Supervised Motion Consistency for Pedestrian Trajectory Prediction

Yizhou Huang,Yihua Cheng,Kezhi Wang

Task: 提出一种自监督的行人轨迹预测框架，通过联合建模位置、速度和加速度来提升预测准确性。

Motivation: 传统监督学习方法在长尾数据分布下难以捕捉异常行为，因此需要一种自监督方法来优化轨迹预测。

Details

Method: 通过特征注入和自监督运动一致性机制，将速度和加速度信息融入位置预测，并设计基于物理原理的运动一致性评估策略。 Result: 在ETH-UCY和Stanford Drone数据集上实现了最先进的性能。 Conclusion: 提出的自监督框架有效提升了行人轨迹预测的准确性，尤其在处理异常行为时表现优异。 Abstract: Understanding human motion is crucial for accurate pedestrian trajectory prediction. Conventional methods typically rely on supervised learning, where ground-truth labels are directly optimized against predicted trajectories. This amplifies the limitations caused by long-tailed data distributions, making it difficult for the model to capture abnormal behaviors. In this work, we propose a self-supervised pedestrian trajectory prediction framework that explicitly models position, velocity, and acceleration. We leverage velocity and acceleration information to enhance position prediction through feature injection and a self-supervised motion consistency mechanism. Our model hierarchically injects velocity features into the position stream. Acceleration features are injected into the velocity stream. This enables the model to predict position, velocity, and acceleration jointly. From the predicted position, we compute corresponding pseudo velocity and acceleration, allowing the model to learn from data-generated pseudo labels and thus achieve self-supervised learning. We further design a motion consistency evaluation strategy grounded in physical principles; it selects the most reasonable predicted motion trend by comparing it with historical dynamics and uses this trend to guide and constrain trajectory generation. We conduct experiments on the ETH-UCY and Stanford Drone datasets, demonstrating that our method achieves state-of-the-art performance on both datasets.

Style Quantization for Data-Efficient GAN Training

Jian Wang,Xin Lan,Jizhe Zhou,Yuxin Tian,Jiancheng Lv

Task: 在有限数据设置下，通过量化风格空间提升生成对抗网络（GAN）的一致性正则化（CR）性能。

Motivation: GAN在有限数据下难以有效利用输入潜在空间，导致相邻潜在变量生成的图像在真实性上差异显著，CR效果不佳。

Details

Method: 提出SQ-GAN，通过将稀疏连续的输入潜在空间转换为紧凑的结构化离散代理空间，并利用可学习码本进行量化，同时优化最优传输距离以嵌入外部知识。 Result: 实验表明，该方法显著提升了判别器鲁棒性和生成质量。 Conclusion: SQ-GAN通过量化风格空间和嵌入外部知识，有效改善了有限数据下GAN的CR性能和生成效果。 Abstract: Under limited data setting, GANs often struggle to navigate and effectively exploit the input latent space. Consequently, images generated from adjacent variables in a sparse input latent space may exhibit significant discrepancies in realism, leading to suboptimal consistency regularization (CR) outcomes. To address this, we propose \textit{SQ-GAN}, a novel approach that enhances CR by introducing a style space quantization scheme. This method transforms the sparse, continuous input latent space into a compact, structured discrete proxy space, allowing each element to correspond to a specific real data point, thereby improving CR performance. Instead of direct quantization, we first map the input latent variables into a less entangled ``style'' space and apply quantization using a learnable codebook. This enables each quantized code to control distinct factors of variation. Additionally, we optimize the optimal transport distance to align the codebook codes with features extracted from the training data by a foundation model, embedding external knowledge into the codebook and establishing a semantically rich vocabulary that properly describes the training dataset. Extensive experiments demonstrate significant improvements in both discriminator robustness and generation quality with our method.

Order Matters: On Parameter-Efficient Image-to-Video Probing for Recognizing Nearly Symmetric Actions

Thinesh Thiyakesan Ponbagavathi,Alina Roitberg

Task: 研究参数高效的图像到视频探测方法，以解决近乎对称动作（视觉相似但时间顺序相反的动作）的识别问题。

Motivation: 现有基于图像预训练模型（如DinoV2和CLIP）的探测机制依赖注意力机制进行时间建模，但其固有的排列不变性导致无论帧顺序如何预测结果相同。

Details

Method: 提出Self-attentive Temporal Embedding Probing (STEP)方法，通过可学习的帧级位置编码、全局CLS令牌和简化的注意力机制增强时间敏感性。 Result: STEP在四个活动识别基准上比现有方法性能提升3-15%，参数仅为1/3；在两个数据集上超越所有已发表方法，包括完全微调模型；在识别近乎对称动作时优势明显，性能提升9-19%。 Conclusion: STEP是一种简单高效的参数高效图像到视频迁移方法，显著提升了时间敏感性和识别性能。 Abstract: We study parameter-efficient image-to-video probing for the unaddressed challenge of recognizing nearly symmetric actions - visually similar actions that unfold in opposite temporal order (e.g., opening vs. closing a bottle). Existing probing mechanisms for image-pretrained models, such as DinoV2 and CLIP, rely on attention mechanism for temporal modeling but are inherently permutation-invariant, leading to identical predictions regardless of frame order. To address this, we introduce Self-attentive Temporal Embedding Probing (STEP), a simple yet effective approach designed to enforce temporal sensitivity in parameter-efficient image-to-video transfer. STEP enhances self-attentive probing with three key modifications: (1) a learnable frame-wise positional encoding, explicitly encoding temporal order; (2) a single global CLS token, for sequence coherence; and (3) a simplified attention mechanism to improve parameter efficiency. STEP outperforms existing image-to-video probing mechanisms by 3-15% across four activity recognition benchmarks with only 1/3 of the learnable parameters. On two datasets, it surpasses all published methods, including fully fine-tuned models. STEP shows a distinct advantage in recognizing nearly symmetric actions, surpassing other probing mechanisms by 9-19%. and parameter-heavier PEFT-based transfer methods by 5-15%. Code and models will be made publicly available.

Point Tracking in Surgery--The 2024 Surgical Tattoos in Infrared (STIR) Challenge

Adam Schmidt,Mert Asim Karaoglu,Soham Sinha,Mingang Jang,Ho-Gun Ha,Kyungmin Jung,Kyeongmo Gu,Ihsan Ullah,Hyunki Lee,Jonáš Šerých,Michal Neoral,Jiří Matas,Rulin Zhou,Wenlong He,An Wang,Hongliang Ren,Bruno Silva,Sandro Queirós,Estêvão Lima,João L. Vilaça,Shunsuke Kikuchi,Atsushi Kouno,Hiroki Matsuzaki,Tongtong Li,Yulu Chen,Ling Li,Xiang Ma,Xiaojian Li,Mona Sheikh Zeinoddin,Xu Wang,Zafer Tandogdu,Greg Shaw,Evangelos Mazomenos,Danail Stoyanov,Yuxin Chen,Zijian Wu,Alexander Ladikos,Simon DiMaio,Septimiu E. Salcudean,Omid Mohareri

Task: 介绍并组织STIR Challenge 2024，以推动手术中组织运动理解算法的准确性和效率。

Motivation: 理解手术中的组织运动对下游任务（如分割、3D重建等）至关重要，而标记数据是训练和量化算法的关键。

Details

Method: 通过点跟踪挑战赛的形式，参与者提交算法，使用STIR数据集进行量化评估，包括准确性和效率两个指标。 Result: 共有8个团队参与挑战，4个团队在挑战日前提交，4个团队在挑战日后提交。 Conclusion: STIR Challenge 2024为手术空间理解算法的准确性和效率提供了推动力，并总结了挑战的设计、提交和结果。 Abstract: Understanding tissue motion in surgery is crucial to enable applications in downstream tasks such as segmentation, 3D reconstruction, virtual tissue landmarking, autonomous probe-based scanning, and subtask autonomy. Labeled data are essential to enabling algorithms in these downstream tasks since they allow us to quantify and train algorithms. This paper introduces a point tracking challenge to address this, wherein participants can submit their algorithms for quantification. The submitted algorithms are evaluated using a dataset named surgical tattoos in infrared (STIR), with the challenge aptly named the STIR Challenge 2024. The STIR Challenge 2024 comprises two quantitative components: accuracy and efficiency. The accuracy component tests the accuracy of algorithms on in vivo and ex vivo sequences. The efficiency component tests the latency of algorithm inference. The challenge was conducted as a part of MICCAI EndoVis 2024. In this challenge, we had 8 total teams, with 4 teams submitting before and 4 submitting after challenge day. This paper details the STIR Challenge 2024, which serves to move the field towards more accurate and efficient algorithms for spatial understanding in surgery. In this paper we summarize the design, submissions, and results from the challenge. The challenge dataset is available here: https://zenodo.org/records/14803158 , and the code for baseline models and metric calculation is available here: https://github.com/athaddius/STIRMetrics

Can Test-Time Scaling Improve World Foundation Model?

Wenyan Cong,Hanqing Zhu,Peihao Wang,Bangya Liu,Dejia Xu,Kevin Wang,David Z. Pan,Yan Wang,Zhiwen Fan,Zhangyang Wang

Task: 提出一个测试时扩展框架SWIFT，用于优化世界基础模型（WFMs）的推理效率。

Motivation: 世界基础模型在物理智能应用中至关重要，但预训练和后续训练需要大量计算资源，且受限于数据可用性，因此测试时扩展成为一种关键且实用的替代方案。

Details

Method: SWIFT结合了可扩展的WFM评估工具包和过程级推理策略，包括快速标记化、基于概率的Top-K剪枝和高效束搜索。 Result: 在COSMOS模型上的实证结果表明，测试时扩展即使在计算最优的情况下也存在，且SWIFT提供了一种无需重新训练或增加模型大小的可扩展且有效的改进路径。 Conclusion: SWIFT证明了测试时扩展法则适用于WFMs，并为提高WFM推理效率提供了实用解决方案。 Abstract: World foundation models, which simulate the physical world by predicting future states from current observations and inputs, have become central to many applications in physical intelligence, including autonomous driving and robotics. However, these models require substantial computational resources for pretraining and are further constrained by available data during post-training. As such, scaling computation at test time emerges as both a critical and practical alternative to traditional model enlargement or re-training. In this work, we introduce SWIFT, a test-time scaling framework tailored for WFMs. SWIFT integrates our extensible WFM evaluation toolkit with process-level inference strategies, including fast tokenization, probability-based Top-K pruning, and efficient beam search. Empirical results on the COSMOS model demonstrate that test-time scaling exists even in a compute-optimal way. Our findings reveal that test-time scaling laws hold for WFMs and that SWIFT provides a scalable and effective pathway for improving WFM inference without retraining or increasing model size. The code is available at https://github.com/Mia-Cong/SWIFT.git.

Self-Supervised Pretraining for Aerial Road Extraction

Rupert Polley,Sai Vignesh Abishek Deenadayalan,J. Marius Zöllner

Task: 提出一种自监督预训练方法，用于减少对标记数据的依赖并提高航空图像分割性能。

Motivation: 高质量的航空图像数据集稀缺且标注成本高，限制了深度神经网络在航空图像分割中的应用。

Details

Method: 采用基于修复的预训练方法，模型学习重建航空图像中的缺失区域，捕捉其内在结构，随后微调用于道路提取。 Result: 实验表明，该方法显著提高了分割准确性，尤其在数据量较少的情况下，同时增强了泛化能力和对领域变化的鲁棒性。 Conclusion: 该方法为航空图像分析提供了一种可扩展的解决方案，减少了对标记数据的依赖。 Abstract: Deep neural networks for aerial image segmentation require large amounts of labeled data, but high-quality aerial datasets with precise annotations are scarce and costly to produce. To address this limitation, we propose a self-supervised pretraining method that improves segmentation performance while reducing reliance on labeled data. Our approach uses inpainting-based pretraining, where the model learns to reconstruct missing regions in aerial images, capturing their inherent structure before being fine-tuned for road extraction. This method improves generalization, enhances robustness to domain shifts, and is invariant to model architecture and dataset choice. Experiments show that our pretraining significantly boosts segmentation accuracy, especially in low-data regimes, making it a scalable solution for aerial image analysis.

PathOrchestra: A Comprehensive Foundation Model for Computational Pathology with Over 100 Diverse Clinical-Grade Tasks

Fang Yan,Jianfeng Wu,Jiawen Li,Wei Wang,Jiaxuan Lu,Wen Chen,Zizhao Gao,Jianan Li,Hong Yan,Jiabo Ma,Minda Chen,Yang Lu,Qing Chen,Yizhi Wang,Xitong Ling,Xuenian Wang,Zihan Wang,Qiang Huang,Shengyi Hua,Mianxin Liu,Lei Ma,Tian Shen,Xiaofan Zhang,Yonghong He,Hao Chen,Shaoting Zhang,Zhe Wang

Task: 开发并验证一种名为PathOrchestra的多功能病理学基础模型，用于处理高分辨率病理图像的复杂性和变异性。

Motivation: 高分辨率病理图像的复杂性和变异性对计算病理学提出了挑战，现有AI模型需要大规模数据集和资源，且临床适用性和泛化能力需严格验证。

Details

Method: 通过自监督学习在包含300K病理切片的数据集上训练PathOrchestra模型，并在112个临床任务上使用61个私有和51个公共数据集进行验证。 Result: PathOrchestra在27,755个WSIs和9,415,729个ROIs上表现优异，47个任务中准确率超过0.950，包括泛癌分类和复杂癌症亚型诊断。 Conclusion: PathOrchestra展示了大规模自监督病理学基础模型的可行性和高效性，具有临床整合潜力，可提升医疗服务效率和质量。 Abstract: The complexity and variability inherent in high-resolution pathological images present significant challenges in computational pathology. While pathology foundation models leveraging AI have catalyzed transformative advancements, their development demands large-scale datasets, considerable storage capacity, and substantial computational resources. Furthermore, ensuring their clinical applicability and generalizability requires rigorous validation across a broad spectrum of clinical tasks. Here, we present PathOrchestra, a versatile pathology foundation model trained via self-supervised learning on a dataset comprising 300K pathological slides from 20 tissue and organ types across multiple centers. The model was rigorously evaluated on 112 clinical tasks using a combination of 61 private and 51 public datasets. These tasks encompass digital slide preprocessing, pan-cancer classification, lesion identification, multi-cancer subtype classification, biomarker assessment, gene expression prediction, and the generation of structured reports. PathOrchestra demonstrated exceptional performance across 27,755 WSIs and 9,415,729 ROIs, achieving over 0.950 accuracy in 47 tasks, including pan-cancer classification across various organs, lymphoma subtype diagnosis, and bladder cancer screening. Notably, it is the first model to generate structured reports for high-incidence colorectal cancer and diagnostically complex lymphoma-areas that are infrequently addressed by foundational models but hold immense clinical potential. Overall, PathOrchestra exemplifies the feasibility and efficacy of a large-scale, self-supervised pathology foundation model, validated across a broad range of clinical-grade tasks. Its high accuracy and reduced reliance on extensive data annotation underline its potential for clinical integration, offering a pathway toward more efficient and high-quality medical services.

InstructRestore: Region-Customized Image Restoration with Human Instructions

Shuaizheng Liu,Jianqi Ma,Lingchen Sun,Xiangtao Kong,Lei Zhang

Task: 提出一种名为InstructRestore的新框架，实现基于用户指令的区域可调图像修复。

Motivation: 现有基于扩散先验的图像修复方法缺乏根据用户指令进行区域定制修复的能力。

Details

Method: 开发数据生成引擎构建训练三元组数据集，并在ControlNet架构下整合低质量图像特征，实现区域定制修复。 Result: 实验证明InstructRestore能有效实现用户指令驱动的图像修复，如虚化效果和局部增强。 Conclusion: 该工作推动了交互式图像修复与增强技术的研究。 Abstract: Despite the significant progress in diffusion prior-based image restoration, most existing methods apply uniform processing to the entire image, lacking the capability to perform region-customized image restoration according to user instructions. In this work, we propose a new framework, namely InstructRestore, to perform region-adjustable image restoration following human instructions. To achieve this, we first develop a data generation engine to produce training triplets, each consisting of a high-quality image, the target region description, and the corresponding region mask. With this engine and careful data screening, we construct a comprehensive dataset comprising 536,945 triplets to support the training and evaluation of this task. We then examine how to integrate the low-quality image features under the ControlNet architecture to adjust the degree of image details enhancement. Consequently, we develop a ControlNet-like model to identify the target region and allocate different integration scales to the target and surrounding regions, enabling region-customized image restoration that aligns with user instructions. Experimental results demonstrate that our proposed InstructRestore approach enables effective human-instructed image restoration, such as images with bokeh effects and user-instructed local enhancement. Our work advances the investigation of interactive image restoration and enhancement techniques. Data, code, and models will be found at https://github.com/shuaizhengliu/InstructRestore.git.

StochasticSplats: Stochastic Rasterization for Sorting-Free 3D Gaussian Splatting

Shakiba Kheradmand,Delio Vicini,George Kopanas,Dmitry Lagun,Kwang Moo Yi,Mark Matthews,Andrea Tagliasacchi

Task: 结合3D高斯泼溅与随机光栅化，解决现有方法在渲染效率和视觉保真度上的局限性。

Motivation: 现有3D高斯泼溅方法依赖排序渲染，可能导致渲染伪影且无法灵活控制渲染成本与视觉质量。

Details

Method: 采用无偏蒙特卡罗估计器实现随机光栅化，无需排序并支持高斯重叠的准确混合。 Result: 在合理视觉质量下，渲染速度比排序光栅化快四倍以上。 Conclusion: 随机光栅化为3D高斯泼溅提供了高效且灵活的渲染方案。 Abstract: 3D Gaussian splatting (3DGS) is a popular radiance field method, with many application-specific extensions. Most variants rely on the same core algorithm: depth-sorting of Gaussian splats then rasterizing in primitive order. This ensures correct alpha compositing, but can cause rendering artifacts due to built-in approximations. Moreover, for a fixed representation, sorted rendering offers little control over render cost and visual fidelity. For example, and counter-intuitively, rendering a lower-resolution image is not necessarily faster. In this work, we address the above limitations by combining 3D Gaussian splatting with stochastic rasterization. Concretely, we leverage an unbiased Monte Carlo estimator of the volume rendering equation. This removes the need for sorting, and allows for accurate 3D blending of overlapping Gaussians. The number of Monte Carlo samples further imbues 3DGS with a way to trade off computation time and quality. We implement our method using OpenGL shaders, enabling efficient rendering on modern GPU hardware. At a reasonable visual quality, our method renders more than four times faster than sorted rasterization.

Adapting Vision Foundation Models for Real-time Ultrasound Image Segmentation

Xiaoran Zhang,Eric Z. Chen,Lin Zhao,Xiao Chen,Yikang Liu,Boris Maihe,James S. Duncan,Terrence Chen,Shanhui Sun

Task: 提出一种基于分层视觉基础模型的实时超声图像分割方法。

Motivation: 现有超声分割方法对新任务适应性差且依赖昂贵的人工标注，而实时方法性能难以达到最优。

Details

Method: 利用Hiera视觉基础模型提取多尺度特征，并结合DINOv2增强视觉表达能力，解码生成精确分割。 Result: 在六个公共数据集和一个内部数据集上表现优异，尤其在1%和10%数据设置下平均超越nnUNet超过20%，推理速度达77 FPS。 Conclusion: 该方法在实时性和性能上均优于现有技术，适用于临床实时应用。 Abstract: We propose a novel approach that adapts hierarchical vision foundation models for real-time ultrasound image segmentation. Existing ultrasound segmentation methods often struggle with adaptability to new tasks, relying on costly manual annotations, while real-time approaches generally fail to match state-of-the-art performance. To overcome these limitations, we introduce an adaptive framework that leverages the vision foundation model Hiera to extract multi-scale features, interleaved with DINOv2 representations to enhance visual expressiveness. These enriched features are then decoded to produce precise and robust segmentation. We conduct extensive evaluations on six public datasets and one in-house dataset, covering both cardiac and thyroid ultrasound segmentation. Experiments show that our approach outperforms state-of-the-art methods across multiple datasets and excels with limited supervision, surpassing nnUNet by over 20\% on average in the 1\% and 10\% data settings. Our method achieves $\sim$77 FPS inference speed with TensorRT on a single GPU, enabling real-time clinical applications.

ERUPT: Efficient Rendering with Unposed Patch Transformer

Maxim V. Shugaev,Vincent Chen,Maxim Karrenbach,Kyle Ashley,Bridget Kennedy,Naresh P. Cuntoor

Task: 解决从少量RGB图像中合成多样化场景的新视角问题。

Motivation: 现有方法需要密集图像和精确的相机位姿，限制了实际应用；ERUPT旨在通过无位姿图像实现高效场景渲染。

Details

Method: 提出ERUPT模型，采用基于块的查询（而非基于像素的查询）和学习的潜在相机位姿，减少计算需求。 Result: ERUPT在商业硬件上达到600fps的渲染速度，仅需5张无位姿输入图像，优于现有方法，减少95%的标注数据和计算需求。 Conclusion: ERUPT为多样化真实场景提供了高效的新视角合成方案，显著降低了数据和计算需求。 Abstract: This work addresses the problem of novel view synthesis in diverse scenes from small collections of RGB images. We propose ERUPT (Efficient Rendering with Unposed Patch Transformer) a state-of-the-art scene reconstruction model capable of efficient scene rendering using unposed imagery. We introduce patch-based querying, in contrast to existing pixel-based queries, to reduce the compute required to render a target view. This makes our model highly efficient both during training and at inference, capable of rendering at 600 fps on commercial hardware. Notably, our model is designed to use a learned latent camera pose which allows for training using unposed targets in datasets with sparse or inaccurate ground truth camera pose. We show that our approach can generalize on large real-world data and introduce a new benchmark dataset (MSVS-1M) for latent view synthesis using street-view imagery collected from Mapillary. In contrast to NeRF and Gaussian Splatting, which require dense imagery and precise metadata, ERUPT can render novel views of arbitrary scenes with as few as five unposed input images. ERUPT achieves better rendered image quality than current state-of-the-art methods for unposed image synthesis tasks, reduces labeled data requirements by ~95\% and decreases computational requirements by an order of magnitude, providing efficient novel view synthesis for diverse real-world scenes.

Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1

Yi Chen,Yuying Ge,Rui Wang,Yixiao Ge,Lu Qiu,Ying Shan,Xihui Liu

Task: 系统评估多模态大语言模型（MLLMs）在视频理解任务中的后训练方法。

Motivation: 多模态大语言模型在感知和逻辑推理任务中的潜力尚未充分探索，需要专门的基准来评估其性能。

Details

Method: 引入SEED-Bench-R1基准，包含复杂视频和多选题任务，并比较强化学习（RL）和监督微调（SFT）的效果。 Result: RL在数据效率和性能上优于SFT，但推理链的逻辑一致性较差。 Conclusion: RL在视频理解任务中表现优异，但需改进推理能力和奖励建模以提升逻辑一致性。 Abstract: Recent advancements in Chain of Thought (COT) generation have significantly improved the reasoning capabilities of Large Language Models (LLMs), with reinforcement learning (RL) emerging as an effective post-training approach. Multimodal Large Language Models (MLLMs) inherit this reasoning potential but remain underexplored in tasks requiring both perception and logical reasoning. To address this, we introduce SEED-Bench-R1, a benchmark designed to systematically evaluate post-training methods for MLLMs in video understanding. It includes intricate real-world videos and complex everyday planning tasks in the format of multiple-choice questions, requiring sophisticated perception and reasoning. SEED-Bench-R1 assesses generalization through a three-level hierarchy: in-distribution, cross-environment, and cross-environment-task scenarios, equipped with a large-scale training dataset with easily verifiable ground-truth answers. Using Qwen2-VL-Instruct-7B as a base model, we compare RL with supervised fine-tuning (SFT), demonstrating RL's data efficiency and superior performance on both in-distribution and out-of-distribution tasks, even outperforming SFT on general video understanding benchmarks like LongVideoBench. Our detailed analysis reveals that RL enhances visual perception but often produces less logically coherent reasoning chains. We identify key limitations such as inconsistent reasoning and overlooked visual cues, and suggest future improvements in base model reasoning, reward modeling, and RL robustness against noisy signals.

Any2Caption:Interpreting Any Condition to Caption for Controllable Video Generation

Shengqiong Wu,Weicai Ye,Jiahao Wang,Quande Liu,Xintao Wang,Pengfei Wan,Di Zhang,Kun Gai,Shuicheng Yan,Hao Fei,Tat-Seng Chua

Task: 提出Any2Caption框架，用于在任意条件下实现可控视频生成。

Motivation: 解决当前视频生成社区中用户意图准确解释的瓶颈问题。

Details

Method: 利用多模态大语言模型（MLLMs）将多种输入（文本、图像、视频及区域、运动、相机姿态等专用线索）解耦为密集结构化标题，为视频生成提供更好指导。 Result: 通过大规模数据集Any2CapIns（337K实例和407K条件）进行指令调优，系统在可控性和视频质量方面显著提升。 Conclusion: Any2Caption在现有视频生成模型的多个方面表现出优越性能。 Abstract: To address the bottleneck of accurate user intent interpretation within the current video generation community, we present Any2Caption, a novel framework for controllable video generation under any condition. The key idea is to decouple various condition interpretation steps from the video synthesis step. By leveraging modern multimodal large language models (MLLMs), Any2Caption interprets diverse inputs--text, images, videos, and specialized cues such as region, motion, and camera poses--into dense, structured captions that offer backbone video generators with better guidance. We also introduce Any2CapIns, a large-scale dataset with 337K instances and 407K conditions for any-condition-to-caption instruction tuning. Comprehensive evaluations demonstrate significant improvements of our system in controllability and video quality across various aspects of existing video generation models. Project Page: https://sqwu.top/Any2Cap/

UniOcc: A Unified Benchmark for Occupancy Forecasting and Prediction in Autonomous Driving

Yuping Wang,Xiangyu Huang,Xiaokang Sun,Mingxuan Yan,Shuo Xing,Zhengzhong Tu,Jiachen Li

Task: 提出一个统一的基准UniOcc，用于基于历史信息的未来占用预测和当前帧的相机图像占用预测。

Motivation: 统一多源数据（如nuScenes、Waymo等真实数据集和CARLA、OpenCOOD等高保真驾驶模拟器）并提供2D/3D占用标签及体素流注释，以支持协同自动驾驶。

Details

Method: 整合多源数据，引入不依赖真实占用的新评估指标，并通过大规模多样化训练数据和显式流信息提升性能。 Result: 实验表明，大规模多样化数据和显式流信息显著提升了占用预测和预测性能。 Conclusion: UniOcc为占用预测和预测提供了一个统一的基准，并通过新指标和多样化数据提升了性能评估的鲁棒性。 Abstract: We introduce UniOcc, a comprehensive, unified benchmark for occupancy forecasting (i.e., predicting future occupancies based on historical information) and current-frame occupancy prediction from camera images. UniOcc unifies data from multiple real-world datasets (i.e., nuScenes, Waymo) and high-fidelity driving simulators (i.e., CARLA, OpenCOOD), which provides 2D/3D occupancy labels with per-voxel flow annotations and support for cooperative autonomous driving. In terms of evaluation, unlike existing studies that rely on suboptimal pseudo labels for evaluation, UniOcc incorporates novel metrics that do not depend on ground-truth occupancy, enabling robust assessment of additional aspects of occupancy quality. Through extensive experiments on state-of-the-art models, we demonstrate that large-scale, diverse training data and explicit flow information significantly enhance occupancy prediction and forecasting performance.

Free360: Layered Gaussian Splatting for Unbounded 360-Degree View Synthesis from Extremely Sparse and Unposed Views

Chong Bao,Xiyu Zhang,Zehao Yu,Jiale Shi,Guofeng Zhang,Songyou Peng,Zhaopeng Cui

Task: 提出一种新颖的神经渲染框架，用于解决无姿态且极稀疏视角下的360度无界场景3D重建问题。

Motivation: 现有神经渲染方法在密集输入视角和准确姿态下表现优异，但在极稀疏、无姿态的无界360度场景中仍面临挑战。

Details

Method: 采用分层高斯表示建模场景，结合密集立体重建模型恢复粗略几何，并通过分层引导优化和迭代重建-生成融合方法提升重建质量。 Result: 实验表明，该方法在渲染质量和表面重建精度上优于现有最先进方法。 Conclusion: 提出的框架有效解决了无姿态稀疏视角下的无界场景重建问题，具有显著性能优势。 Abstract: Neural rendering has demonstrated remarkable success in high-quality 3D neural reconstruction and novel view synthesis with dense input views and accurate poses. However, applying it to extremely sparse, unposed views in unbounded 360{\deg} scenes remains a challenging problem. In this paper, we propose a novel neural rendering framework to accomplish the unposed and extremely sparse-view 3D reconstruction in unbounded 360{\deg} scenes. To resolve the spatial ambiguity inherent in unbounded scenes with sparse input views, we propose a layered Gaussian-based representation to effectively model the scene with distinct spatial layers. By employing a dense stereo reconstruction model to recover coarse geometry, we introduce a layer-specific bootstrap optimization to refine the noise and fill occluded regions in the reconstruction. Furthermore, we propose an iterative fusion of reconstruction and generation alongside an uncertainty-aware training approach to facilitate mutual conditioning and enhancement between these two processes. Comprehensive experiments show that our approach outperforms existing state-of-the-art methods in terms of rendering quality and surface reconstruction accuracy. Project page: https://zju3dv.github.io/free360/

Consistent Subject Generation via Contrastive Instantiated Concepts

Lee Hsin-Ying,Kelvin C. K. Chan,Ming-Hsuan Yang

Task: 提出一种名为对比概念实例化（CoCoIns）的方法，以在多幅独立生成的图像中合成一致的主题。

Motivation: 现有方法在生成长内容时存在主题变化的问题，且需要耗时调整、所有主题的参考或访问其他创作。

Details

Method: 结合生成模型和映射网络，通过对比学习训练网络区分提示和潜在代码的组合。 Result: 在单主题人脸生成任务中表现与现有方法相当，同时保持更高灵活性，并展示扩展到多主题和其他对象类别的潜力。 Conclusion: CoCoIns为生成一致主题提供了一种灵活且高效的方法，具有广泛的应用前景。 Abstract: While text-to-image generative models can synthesize diverse and faithful contents, subject variation across multiple creations limits the application in long content generation. Existing approaches require time-consuming tuning, references for all subjects, or access to other creations. We introduce Contrastive Concept Instantiation (CoCoIns) to effectively synthesize consistent subjects across multiple independent creations. The framework consists of a generative model and a mapping network, which transforms input latent codes into pseudo-words associated with certain instances of concepts. Users can generate consistent subjects with the same latent codes. To construct such associations, we propose a contrastive learning approach that trains the network to differentiate the combination of prompts and latent codes. Extensive evaluations of human faces with a single subject show that CoCoIns performs comparably to existing methods while maintaining higher flexibility. We also demonstrate the potential of extending CoCoIns to multiple subjects and other object categories.

SU-YOLO: Spiking Neural Network for Efficient Underwater Object Detection

Chenyang Li,Wenxuan Liu,Guoqiang Gong,Xiaobo Ding,Xian Zhong

Task: 提出一种基于脉冲神经网络的轻量级水下目标检测模型SU-YOLO。

Motivation: 水下复杂的光学环境和设备资源限制对高精度和低功耗的目标检测提出了挑战。

Details

Method: 结合脉冲神经网络的轻量化和高效特性，设计了基于整数加法的图像去噪方法、分离批归一化技术（SeBN）以及改进的脉冲残差块。 Result: 在URPC2019数据集上，SU-YOLO实现了78.8%的mAP，参数为6.97M，能耗为2.98 mJ，优于主流SNN模型。 Conclusion: SU-YOLO展示了SNN在工程应用中的潜力，尤其是在水下目标检测领域。 Abstract: Underwater object detection is critical for oceanic research and industrial safety inspections. However, the complex optical environment and the limited resources of underwater equipment pose significant challenges to achieving high accuracy and low power consumption. To address these issues, we propose Spiking Underwater YOLO (SU-YOLO), a Spiking Neural Network (SNN) model. Leveraging the lightweight and energy-efficient properties of SNNs, SU-YOLO incorporates a novel spike-based underwater image denoising method based solely on integer addition, which enhances the quality of feature maps with minimal computational overhead. In addition, we introduce Separated Batch Normalization (SeBN), a technique that normalizes feature maps independently across multiple time steps and is optimized for integration with residual structures to capture the temporal dynamics of SNNs more effectively. The redesigned spiking residual blocks integrate the Cross Stage Partial Network (CSPNet) with the YOLO architecture to mitigate spike degradation and enhance the model's feature extraction capabilities. Experimental results on URPC2019 underwater dataset demonstrate that SU-YOLO achieves mAP of 78.8% with 6.97M parameters and an energy consumption of 2.98 mJ, surpassing mainstream SNN models in both detection accuracy and computational efficiency. These results underscore the potential of SNNs for engineering applications. The code is available in https://github.com/lwxfight/snn-underwater.

Easi3R: Estimating Disentangled Motion from DUSt3R Without Training

Xingyu Chen,Yue Chen,Yuliang Xiu,Andreas Geiger,Anpei Chen

Task: 提出一种无需训练的4D重建方法Easi3R，通过注意力适应实现动态场景的精确重建。

Motivation: 现有4D数据集规模有限，限制了4D模型的泛化能力，而传统方法依赖大规模动态数据训练或微调。

Details

Method: 利用DUSt3R中的注意力层信息，通过注意力适应实现动态区域分割、相机姿态估计和4D点云重建。 Result: 在真实动态视频上的实验表明，该方法显著优于依赖大规模动态数据集训练的现有方法。 Conclusion: Easi3R是一种高效且无需训练的4D重建方法，通过注意力适应实现了优异的性能。 Abstract: Recent advances in DUSt3R have enabled robust estimation of dense point clouds and camera parameters of static scenes, leveraging Transformer network architectures and direct supervision on large-scale 3D datasets. In contrast, the limited scale and diversity of available 4D datasets present a major bottleneck for training a highly generalizable 4D model. This constraint has driven conventional 4D methods to fine-tune 3D models on scalable dynamic video data with additional geometric priors such as optical flow and depths. In this work, we take an opposite path and introduce Easi3R, a simple yet efficient training-free method for 4D reconstruction. Our approach applies attention adaptation during inference, eliminating the need for from-scratch pre-training or network fine-tuning. We find that the attention layers in DUSt3R inherently encode rich information about camera and object motion. By carefully disentangling these attention maps, we achieve accurate dynamic region segmentation, camera pose estimation, and 4D dense point map reconstruction. Extensive experiments on real-world dynamic videos demonstrate that our lightweight attention adaptation significantly outperforms previous state-of-the-art methods that are trained or finetuned on extensive dynamic datasets. Our code is publicly available for research purpose at https://easi3r.github.io/

From Eye to Mind: brain2text Decoding Reveals the Neural Mechanisms of Visual Semantic Processing

Feihan Feng,Jingxin Nie

Task: 通过解码fMRI信号生成自然图像的文本描述，揭示大脑语义编码的神经机制。

Motivation: 传统脑解码方法主要关注低层次感知特征，无法捕捉引导人类认知的深层语义内容。

Details

Method: 提出一种新颖的深度学习模型，直接解码fMRI信号为文本描述，无需视觉输入。 Result: 模型实现了先进的语义解码性能，生成有意义的核心场景描述，并揭示了高级视觉区域在语义转换中的关键作用。 Conclusion: 基于文本的解码方法为研究复杂语义处理的神经基础提供了更直接和可解释的窗口。 Abstract: Deciphering the neural mechanisms that transform sensory experiences into meaningful semantic representations is a fundamental challenge in cognitive neuroscience. While neuroimaging has mapped a distributed semantic network, the format and neural code of semantic content remain elusive, particularly for complex, naturalistic stimuli. Traditional brain decoding, focused on visual reconstruction, primarily captures low-level perceptual features, missing the deeper semantic essence guiding human cognition. Here, we introduce a paradigm shift by directly decoding fMRI signals into textual descriptions of viewed natural images. Our novel deep learning model, trained without visual input, achieves state-of-the-art semantic decoding performance, generating meaningful captions that capture the core semantic content of complex scenes. Neuroanatomical analysis reveals the critical role of higher-level visual regions, including MT+, ventral stream visual cortex, and inferior parietal cortex, in this semantic transformation. Category-specific decoding further demonstrates nuanced neural representations for semantic dimensions like animacy and motion. This text-based decoding approach provides a more direct and interpretable window into the brain's semantic encoding than visual reconstruction, offering a powerful new methodology for probing the neural basis of complex semantic processing, refining our understanding of the distributed semantic network, and potentially inspiring brain-inspired language models.

Chirp Localization via Fine-Tuned Transformer Model: A Proof-of-Concept Study

Nooshin Bahador,Milad Lankarany

Task: 开发一种基于Vision Transformer（ViT）和Low-Rank Adaptation（LoRA）的自动化工具，用于检测、定位和提取脑电图（EEG）频谱图中的线性或指数频率扫描的chirp模式。

Motivation: 当前缺乏自动化工具来检测和定位EEG频谱图中的chirp模式，而这些模式是癫痫动态的关键生物标志物。

Details

Method: 通过生成100,000个合成频谱图作为大规模基准数据集，并利用ViT模型进行回归预测，结合LoRA微调注意力层以提高适应性。 Result: 模型在预测chirp参数（起始时间、起始频率和终止频率）上表现出色，Pearson相关系数达到0.9841，且推理时间稳定（137至140秒）。 Conclusion: 该方法填补了EEG时频分析中chirp模式检测的方法学空白，为相关研究提供了实用工具。 Abstract: Spectrograms are pivotal in time-frequency signal analysis, widely used in audio processing and computational neuroscience. Chirp-like patterns in electroencephalogram (EEG) spectrograms (marked by linear or exponential frequency sweep) are key biomarkers for seizure dynamics, but automated tools for their detection, localization, and feature extraction are lacking. This study bridges this gap by fine-tuning a Vision Transformer (ViT) model on synthetic spectrograms, augmented with Low-Rank Adaptation (LoRA) to boost adaptability. We generated 100000 synthetic spectrograms with chirp parameters, creating the first large-scale benchmark for chirp localization. These spectrograms mimic neural chirps using linear or exponential frequency sweep, Gaussian noise, and smoothing. A ViT model, adapted for regression, predicted chirp parameters. LoRA fine-tuned the attention layers, enabling efficient updates to the pre-trained backbone. Training used MSE loss and the AdamW optimizer, with a learning rate scheduler and early stopping to curb overfitting. Only three features were targeted: Chirp Start Time (Onset Time), Chirp Start Frequency (Onset Frequency), and Chirp End Frequency (Offset Frequency). Performance was evaluated via Pearson correlation between predicted and actual labels. Results showed strong alignment: 0.9841 correlation for chirp start time, with stable inference times (137 to 140s) and minimal bias in error distributions. This approach offers a tool for chirp analysis in EEG time-frequency representation, filling a critical methodological void.

TRIDIS: A Comprehensive Medieval and Early Modern Corpus for HTR and NER

Sergio Torres Aguilar

Task: 介绍TRIDIS（Tria Digita Scribunt），一个开源的中世纪和早期现代手稿语料库，并提供其构成、转录规则、测试分割策略和基线实验的概述。

Motivation: 整合多个开放许可的遗留收藏，并促进中世纪和早期现代文本遗产的联合手写文本识别（HTR）和命名实体识别（NER）研究。

Details

Method: 描述子语料库的背景、半外交转录规则、基于异常检测的测试分割策略，并使用TrOCR和MiniCPM2.5进行基线实验。 Result: 提供了TRIDIS语料库的统一概述，并展示了随机和异常测试分割的初步实验结果。 Conclusion: TRIDIS旨在推动中世纪和早期现代文本遗产的HTR和NER研究。 Abstract: This paper introduces TRIDIS (Tria Digita Scribunt), an open-source corpus of medieval and early modern manuscripts. TRIDIS aggregates multiple legacy collections (all published under open licenses) and incorporates large metadata descriptions. While prior publications referenced some portions of this corpus, here we provide a unified overview with a stronger focus on its constitution. We describe (i) the narrative, chronological, and editorial background of each major sub-corpus, (ii) its semi-diplomatic transcription rules (expansion, normalization, punctuation), (iii) a strategy for challenging out-of-domain test splits driven by outlier detection in a joint embedding space, and (iv) preliminary baseline experiments using TrOCR and MiniCPM2.5 comparing random and outlier-based test partitions. Overall, TRIDIS is designed to stimulate joint robust Handwritten Text Recognition (HTR) and Named Entity Recognition (NER) research across medieval and early modern textual heritage.

Hierarchical Adaptive Expert for Multimodal Sentiment Analysis

Jiahao Qin,Feng Liu,Lu Zong

Task: 提出一种名为HAEMSA的层次化自适应专家框架，用于多模态情感分析，以区分和整合模态共享和模态特定信息。

Motivation: 现有方法在多模态学习中难以有效区分和整合模态共享和模态特定信息，限制了性能。

Details

Method: HAEMSA结合进化优化、跨模态知识迁移和多任务学习，采用层次化自适应专家结构捕捉全局和局部模态表示。 Result: 在CMU-MOSEI、CMU-MOSI和IEMOCAP数据集上，HAEMSA在7类准确率、MAE和加权F1分数上均优于现有最佳方法。 Conclusion: HAEMSA能有效捕捉复杂多模态交互并在不同情感场景中泛化。 Abstract: Multimodal sentiment analysis has emerged as a critical tool for understanding human emotions across diverse communication channels. While existing methods have made significant strides, they often struggle to effectively differentiate and integrate modality-shared and modality-specific information, limiting the performance of multimodal learning. To address this challenge, we propose the Hierarchical Adaptive Expert for Multimodal Sentiment Analysis (HAEMSA), a novel framework that synergistically combines evolutionary optimization, cross-modal knowledge transfer, and multi-task learning. HAEMSA employs a hierarchical structure of adaptive experts to capture both global and local modality representations, enabling more nuanced sentiment analysis. Our approach leverages evolutionary algorithms to dynamically optimize network architectures and modality combinations, adapting to both partial and full modality scenarios. Extensive experiments demonstrate HAEMSA's superior performance across multiple benchmark datasets. On CMU-MOSEI, HAEMSA achieves a 2.6% increase in 7-class accuracy and a 0.059 decrease in MAE compared to the previous best method. For CMU-MOSI, we observe a 6.3% improvement in 7-class accuracy and a 0.058 reduction in MAE. On IEMOCAP, HAEMSA outperforms the state-of-the-art by 2.84% in weighted-F1 score for emotion recognition. These results underscore HAEMSA's effectiveness in capturing complex multimodal interactions and generalizing across different emotional contexts.

Dual Audio-Centric Modality Coupling for Talking Head Generation

Ao Fu,Ziqi Ni,Yi Zhou

Task: 提出一种基于NeRF的框架DAMC，用于生成高质量的音频驱动说话头部视频。

Motivation: 传统方法在捕捉音频与面部动态的复杂交互时存在困难，导致唇部同步和视觉质量问题。

Details

Method: 采用双编码器结构（内容感知编码器和动态同步编码器），并通过跨同步融合模块（CSFM）融合特征。 Result: 在唇部同步准确性和图像质量等关键指标上优于现有方法，并展示了良好的泛化能力。 Conclusion: DAMC为高质量音频驱动说话头部生成提供了有前景的解决方案，并展示了可扩展性。 Abstract: The generation of audio-driven talking head videos is a key challenge in computer vision and graphics, with applications in virtual avatars and digital media. Traditional approaches often struggle with capturing the complex interaction between audio and facial dynamics, leading to lip synchronization and visual quality issues. In this paper, we propose a novel NeRF-based framework, Dual Audio-Centric Modality Coupling (DAMC), which effectively integrates content and dynamic features from audio inputs. By leveraging a dual encoder structure, DAMC captures semantic content through the Content-Aware Encoder and ensures precise visual synchronization through the Dynamic-Sync Encoder. These features are fused using a Cross-Synchronized Fusion Module (CSFM), enhancing content representation and lip synchronization. Extensive experiments show that our method outperforms existing state-of-the-art approaches in key metrics such as lip synchronization accuracy and image quality, demonstrating robust generalization across various audio inputs, including synthetic speech from text-to-speech (TTS) systems. Our results provide a promising solution for high-quality, audio-driven talking head generation and present a scalable approach for creating realistic talking heads.

Ancestral Mamba: Enhancing Selective Discriminant Space Model with Online Visual Prototype Learning for Efficient and Robust Discriminant Approach

Jiahao Qin,Feng Liu,Lu Zong

Task: 提出一种名为Ancestral Mamba的新方法，用于在非平稳数据流中持续学习并适应新的视觉模式，同时减轻灾难性遗忘。

Motivation: 现有方法难以捕捉和表示动态视觉概念的本质特征，限制了其在动态图形任务中的应用。

Details

Method: 结合在线原型学习和选择性判别空间模型，通过Ancestral Prototype Adaptation (APA)和Mamba Feedback (MF)实现高效且鲁棒的在线持续学习。 Result: 在CIFAR-10和CIFAR-100等图形数据集上表现出色，显著提高了准确性和遗忘缓解能力。 Conclusion: Ancestral Mamba在动态图形任务中具有优越性能，为持续学习提供了有效解决方案。 Abstract: In the realm of computer graphics, the ability to learn continuously from non-stationary data streams while adapting to new visual patterns and mitigating catastrophic forgetting is of paramount importance. Existing approaches often struggle to capture and represent the essential characteristics of evolving visual concepts, hindering their applicability to dynamic graphics tasks. In this paper, we propose Ancestral Mamba, a novel approach that integrates online prototype learning into a selective discriminant space model for efficient and robust online continual learning. The key components of our approach include Ancestral Prototype Adaptation (APA), which continuously refines and builds upon learned visual prototypes, and Mamba Feedback (MF), which provides targeted feedback to adapt to challenging visual patterns. APA enables the model to continuously adapt its prototypes, building upon ancestral knowledge to tackle new challenges, while MF acts as a targeted feedback mechanism, focusing on challenging classes and refining their representations. Extensive experiments on graphics-oriented datasets, such as CIFAR-10 and CIFAR-100, demonstrate the superior performance of Ancestral Mamba compared to state-of-the-art baselines, achieving significant improvements in accuracy and forgetting mitigation.

Adaptive Integrated Layered Attention (AILA)

William Claster,Suhas KM,Dhairya Gundechia

Task: 提出一种名为自适应集成分层注意力（AILA）的神经网络架构，结合密集跳跃连接和自适应特征重用机制。

Motivation: 通过自适应特征重用机制提升网络性能，同时减少训练和推理时间。

Details

Method: 设计了两种AILA架构：一种使用线性层连接，另一种引入注意力机制选择性重用特征。 Result: 在价格预测、图像识别和情感分析任务中，AILA性能与主流深度学习模型相当，但训练和推理时间显著减少。 Conclusion: AILA通过自适应层间连接灵活重用特征，提升了长序列建模、图像识别和分类任务的性能。 Abstract: We propose Adaptive Integrated Layered Attention (AILA), a neural network architecture that combines dense skip connections with different mechanisms for adaptive feature reuse across network layers. We evaluate AILA on three challenging tasks: price forecasting for various commodities and indices (S&P 500, Gold, US dollar Futures, Coffee, Wheat), image recognition using the CIFAR-10 dataset, and sentiment analysis on the IMDB movie review dataset. In all cases, AILA matches strong deep learning baselines (LSTMs, Transformers, and ResNets), achieving it at a fraction of the training and inference time. Notably, we implement and test two versions of the model - AILA-Architecture 1, which uses simple linear layers as the connection mechanism between layers, and AILA-Architecture 2, which implements an attention mechanism to selectively focus on outputs from previous layers. Both architectures are applied in a single-task learning setting, with each model trained separately for individual tasks. Results confirm that AILA's adaptive inter-layer connections yield robust gains by flexibly reusing pertinent features at multiple network depths. The AILA approach thus presents an extension to existing architectures, improving long-range sequence modeling, image recognition with optimised computational speed, and SOTA classification performance in practice.

Nonhuman Primate Brain Tissue Segmentation Using a Transfer Learning Approach

Zhen Lin,Hongyu Yuan,Richard Barcus,Qing Lyu,Sucheta Chakravarty,Megan E. Lipford,Carol A. Shively,Suzanne Craft,Mohammad Kawas,Jeongchul Kim,Christopher T. Whitlow

Task: 提出一种利用STU-Net和迁移学习的方法，从人类脑MRI数据中迁移知识以提高非人灵长类动物（NHP）脑MRI分割的准确性。

Motivation: 由于非人灵长类动物（NHP）与人类在进化上的密切关系，它们是研究人类脑功能和神经系统疾病的重要模型。然而，NHP脑MRI数据的稀缺、脑部尺寸小、成像分辨率有限以及解剖学差异使得准确分割脑组织具有挑战性。

Details

Method: 采用STU-Net结合迁移学习，利用人类脑MRI数据的知识迁移来提升NHP脑MRI的分割精度，特别是在训练数据有限的情况下。 Result: 该方法在分割小亚皮层结构（如壳核和丘脑）时表现优异，DSC超过0.88，IoU超过0.8，HD95低于7。 Conclusion: 本研究为NHP的多类脑组织分割提供了一种稳健的方法，有望加速进化神经科学和与人类健康相关的神经系统疾病的临床前研究。 Abstract: Non-human primates (NHPs) serve as critical models for understanding human brain function and neurological disorders due to their close evolutionary relationship with humans. Accurate brain tissue segmentation in NHPs is critical for understanding neurological disorders, but challenging due to the scarcity of annotated NHP brain MRI datasets, the small size of the NHP brain, the limited resolution of available imaging data and the anatomical differences between human and NHP brains. To address these challenges, we propose a novel approach utilizing STU-Net with transfer learning to leverage knowledge transferred from human brain MRI data to enhance segmen-tation accuracy in the NHP brain MRI, particularly when training data is limited.The combination of STU-Net and transfer learning effectively delineates complex tissue boundaries and captures fine anatomical details specific to NHP brains. Notably, our method demonstrated improvement in segmenting small subcortical structures such as putamen and thalamus that are challenging to resolve with limited spatial resolution and tissue contrast, and achieved DSC of over 0.88, IoU over 0.8 and HD95 under 7. This study introduces a robust method for multi-class brain tissue segmentation in NHPs, potentially accelerating research in evolutionary neuroscience and preclinical studies of neurological disorders relevant to human health.

VizFlyt: Perception-centric Pedagogical Framework For Autonomous Aerial Robots

Kushagra Srivastava,Rutwik Kulkarni,Manoj Velmurugan,Nitin J. Sanket

Task: 提出并验证一个开源的、以感知为中心的硬件在环（HITL）测试框架VizFlyt，用于无人机机器人课程。

Motivation: 高效的无人机机器人课程需要可靠的测试平台，以安全地测试自主算法。

Details

Method: 利用外部定位系统的位姿信息，通过3D高斯泼溅技术实时生成逼真的视觉传感器数据。 Result: 实现了超过100Hz的系统更新率，并在真实HITL实验中验证了框架的有效性。 Conclusion: VizFlyt框架及其配套课程为无人机机器人教育提供了高效且安全的测试解决方案。 Abstract: Autonomous aerial robots are becoming commonplace in our lives. Hands-on aerial robotics courses are pivotal in training the next-generation workforce to meet the growing market demands. Such an efficient and compelling course depends on a reliable testbed. In this paper, we present \textit{VizFlyt}, an open-source perception-centric Hardware-In-The-Loop (HITL) photorealistic testing framework for aerial robotics courses. We utilize pose from an external localization system to hallucinate real-time and photorealistic visual sensors using 3D Gaussian Splatting. This enables stress-free testing of autonomy algorithms on aerial robots without the risk of crashing into obstacles. We achieve over 100Hz of system update rate. Lastly, we build upon our past experiences of offering hands-on aerial robotics courses and propose a new open-source and open-hardware curriculum based on \textit{VizFlyt} for the future. We test our framework on various course projects in real-world HITL experiments and present the results showing the efficacy of such a system and its large potential use cases. Code, datasets, hardware guides and demo videos are available at https://pear.wpi.edu/research/vizflyt.html

Towards Mobile Sensing with Event Cameras on High-mobility Resource-constrained Devices: A Survey

Haoyang Wang,Ruishan Guo,Pengtao Ma,Ciyu Ruan,Xinyu Luo,Wenhua Ding,Tianyang Zhong,Jingao Xu,Yunhao Liu,Xinlei Chen

Task: 综述2014-2024年间基于事件的移动感知系统，涵盖基本原理、事件抽象方法、算法进展及软硬件加速策略。

Motivation: 随着移动设备应用复杂度的增加，高精度和低延迟的移动感知需求日益突出，基于事件的视觉技术因其优势成为潜在解决方案，但面临噪声、语义缺失和大数据量等挑战。

Details

Method: 通过文献调研，系统梳理基于事件的移动感知系统，包括原理、方法、算法、硬件和软件加速策略，并讨论关键应用及挑战。 Result: 提供了全面的综述，涵盖事件相机的应用领域（如视觉里程计、目标跟踪等），并总结了数据处理、传感器融合和实时部署的挑战。 Conclusion: 提出了未来研究方向（如改进硬件、利用神经形态计算等），并提供了开源资源支持，旨在促进基于事件视觉技术的广泛应用。 Abstract: With the increasing complexity of mobile device applications, these devices are evolving toward high mobility. This shift imposes new demands on mobile sensing, particularly in terms of achieving high accuracy and low latency. Event-based vision has emerged as a disruptive paradigm, offering high temporal resolution, low latency, and energy efficiency, making it well-suited for high-accuracy and low-latency sensing tasks on high-mobility platforms. However, the presence of substantial noisy events, the lack of inherent semantic information, and the large data volume pose significant challenges for event-based data processing on resource-constrained mobile devices. This paper surveys the literature over the period 2014-2024, provides a comprehensive overview of event-based mobile sensing systems, covering fundamental principles, event abstraction methods, algorithmic advancements, hardware and software acceleration strategies. We also discuss key applications of event cameras in mobile sensing, including visual odometry, object tracking, optical flow estimation, and 3D reconstruction, while highlighting the challenges associated with event data processing, sensor fusion, and real-time deployment. Furthermore, we outline future research directions, such as improving event camera hardware with advanced optics, leveraging neuromorphic computing for efficient processing, and integrating bio-inspired algorithms to enhance perception. To support ongoing research, we provide an open-source \textit{Online Sheet} with curated resources and recent developments. We hope this survey serves as a valuable reference, facilitating the adoption of event-based vision across diverse applications.

MIL vs. Aggregation: Evaluating Patient-Level Survival Prediction Strategies Using Graph-Based Learning

M Rita Verdelho,Alexandre Bernardino,Catarina Barata

Task: 比较不同策略（WSI级别和患者级别）用于预测癌症患者生存率的效果。

Motivation: 由于肿瘤异质性和WSI的复杂性，如何有效利用WSI数据预测患者预后是一个关键问题。

Details

Method: 采用多实例学习（MIL）自动识别最具代表性的WSI，并评估不同图神经网络架构。 Result: 基于MIL的选择方法提高了预测准确性，表明选择最具代表性的WSI有助于生存预测。 Conclusion: 识别最具代表性的WSI对改善癌症患者生存预测具有重要意义。 Abstract: Oncologists often rely on a multitude of data, including whole-slide images (WSIs), to guide therapeutic decisions, aiming for the best patient outcome. However, predicting the prognosis of cancer patients can be a challenging task due to tumor heterogeneity and intra-patient variability, and the complexity of analyzing WSIs. These images are extremely large, containing billions of pixels, making direct processing computationally expensive and requiring specialized methods to extract relevant information. Additionally, multiple WSIs from the same patient may capture different tumor regions, some being more informative than others. This raises a fundamental question: Should we use all WSIs to characterize the patient, or should we identify the most representative slide for prognosis? Our work seeks to answer this question by performing a comparison of various strategies for predicting survival at the WSI and patient level. The former treats each WSI as an independent sample, mimicking the strategy adopted in other works, while the latter comprises methods to either aggregate the predictions of the several WSIs or automatically identify the most relevant slide using multiple-instance learning (MIL). Additionally, we evaluate different Graph Neural Networks architectures under these strategies. We conduct our experiments using the MMIST-ccRCC dataset, which comprises patients with clear cell renal cell carcinoma (ccRCC). Our results show that MIL-based selection improves accuracy, suggesting that choosing the most representative slide benefits survival prediction.

Prediction of 30-day hospital readmission with clinical notes and EHR information

Tiago Almeida,Plinio Moreno,Catarina Barata

Task: 预测30天内医院再入院率。

Motivation: 高再入院率带来显著的成本和健康风险，需要开发预测模型支持临床决策。

Details

Method: 结合电子健康记录（EHR）和临床笔记，使用图神经网络（GNN）整合多模态信息。 Result: 模型AUROC为0.72，平衡准确率为66.7%。 Conclusion: 多模态信息结合对预测再入院率至关重要。 Abstract: High hospital readmission rates are associated with significant costs and health risks for patients. Therefore, it is critical to develop predictive models that can support clinicians to determine whether or not a patient will return to the hospital in a relatively short period of time (e.g, 30-days). Nowadays, it is possible to collect both structured (electronic health records - EHR) and unstructured information (clinical notes) about a patient hospital event, all potentially containing relevant information for a predictive model. However, their integration is challenging. In this work we explore the combination of clinical notes and EHRs to predict 30-day hospital readmissions. We address the representation of the various types of information available in the EHR data, as well as exploring LLMs to characterize the clinical notes. We collect both information sources as the nodes of a graph neural network (GNN). Our model achieves an AUROC of 0.72 and a balanced accuracy of 66.7\%, highlighting the importance of combining the multimodal information.

OncoReg: Medical Image Registration for Oncological Challenges

Wiebke Heyer,Yannic Elser,Lennart Berkel,Xinrui Song,Xuanang Xu,Pingkun Yan,Xi Jia,Zi Li,Tony C. W. Mok,BoWen LI,Christian Staackmann,Christoph Großbröhmer,Alessa Hering,Malte M. Sieren,Mattias P. Heinrich

Task: 开发并验证一种确保患者隐私的图像配准方法，以促进通用AI模型的发展。

Motivation: 现代癌症研究中，大量医疗数据因患者隐私问题未被充分利用，OncoReg挑战旨在解决这一问题。

Details

Method: 采用两阶段框架：第一阶段使用公开数据集，第二阶段在安全医院网络中训练私有数据集模型。 Result: 特征提取在配准任务中起关键作用，新方法展现多样性，传统方法与新技术表现相当，深度学习和经典方法结合最有效。 Conclusion: OncoReg挑战为图像配准提供了有效框架，特征提取和混合方法在任务中表现突出。 Abstract: In modern cancer research, the vast volume of medical data generated is often underutilised due to challenges related to patient privacy. The OncoReg Challenge addresses this issue by enabling researchers to develop and validate image registration methods through a two-phase framework that ensures patient privacy while fostering the development of more generalisable AI models. Phase one involves working with a publicly available dataset, while phase two focuses on training models on a private dataset within secure hospital networks. OncoReg builds upon the foundation established by the Learn2Reg Challenge by incorporating the registration of interventional cone-beam computed tomography (CBCT) with standard planning fan-beam CT (FBCT) images in radiotherapy. Accurate image registration is crucial in oncology, particularly for dynamic treatment adjustments in image-guided radiotherapy, where precise alignment is necessary to minimise radiation exposure to healthy tissues while effectively targeting tumours. This work details the methodology and data behind the OncoReg Challenge and provides a comprehensive analysis of the competition entries and results. Findings reveal that feature extraction plays a pivotal role in this registration task. A new method emerging from this challenge demonstrated its versatility, while established approaches continue to perform comparably to newer techniques. Both deep learning and classical approaches still play significant roles in image registration, with the combination of methods - particularly in feature extraction - proving most effective.

Aurelia: Test-time Reasoning Distillation in Audio-Visual LLMs

Sanjoy Chowdhury,Hanan Gani,Nishit Anand,Sayan Nag,Ruohan Gao,Mohamed Elhoseiny,Salman Khan,Dinesh Manocha

Task: 提出AURELIA框架和AVReasonBench基准，以增强音频视觉大语言模型（AVLLMs）的多模态推理能力。

Motivation: 现有研究未能充分解决音频视觉场景的复杂性，需要进一步探索。

Details

Method: 采用基于演员-评论家（actor-critic）的AURELIA框架，在测试时逐步推理，无需额外训练或微调。 Result: 在AVReasonBench上评估18个AVLLMs，发现其多模态推理能力显著不足；AURELIA实现了高达100%的相对改进。 Conclusion: AURELIA展示了推理增强数据生成在提升AVLLMs实际应用潜力方面的有效性。 Abstract: Recent advancements in reasoning optimization have greatly enhanced the performance of large language models (LLMs). However, existing work fails to address the complexities of audio-visual scenarios, underscoring the need for further research. In this paper, we introduce AURELIA, a novel actor-critic based audio-visual (AV) reasoning framework that distills structured, step-by-step reasoning into AVLLMs at test time, improving their ability to process complex multi-modal inputs without additional training or fine-tuning. To further advance AVLLM reasoning skills, we present AVReasonBench, a challenging benchmark comprising 4500 audio-visual questions, each paired with detailed step-by-step reasoning. Our benchmark spans six distinct tasks, including AV-GeoIQ, which evaluates AV reasoning combined with geographical and cultural knowledge. Evaluating 18 AVLLMs on AVReasonBench reveals significant limitations in their multi-modal reasoning capabilities. Using AURELIA, we achieve up to a 100% relative improvement, demonstrating its effectiveness. This performance gain highlights the potential of reasoning-enhanced data generation for advancing AVLLMs in real-world applications. Our code and data will be publicly released at: https: //github.com/schowdhury671/aurelia.

Geometry in Style: 3D Stylization via Surface Normal Deformation

Nam Anh Dinh,Itai Lang,Hyunwoo Kim,Oded Stein,Rana Hanocka

Task: 提出一种新的身份保持网格风格化方法Geometry in Style。

Motivation: 现有技术要么通过过于限制的变形（如凹凸贴图）保持原始形状，要么使用可能导致伪影或改变形状身份的变形方法。

Details

Method: 通过为每个顶点邻域定义目标法向量来表示变形，并使用可微分的As-Rigid-As-Possible (dARAP)层实现变形。 Result: 实现了既能保留形状身份又能实现详细风格化的变形效果。 Conclusion: Geometry in Style方法在保持形状身份的同时实现了高质量的风格化。 Abstract: We present Geometry in Style, a new method for identity-preserving mesh stylization. Existing techniques either adhere to the original shape through overly restrictive deformations such as bump maps or significantly modify the input shape using expressive deformations that may introduce artifacts or alter the identity of the source shape. In contrast, we represent a deformation of a triangle mesh as a target normal vector for each vertex neighborhood. The deformations we recover from target normals are expressive enough to enable detailed stylizations yet restrictive enough to preserve the shape's identity. We achieve such deformations using our novel differentiable As-Rigid-As-Possible (dARAP) layer, a neural-network-ready adaptation of the classical ARAP algorithm which we use to solve for per-vertex rotations and deformed vertices. As a differentiable layer, dARAP is paired with a visual loss from a text-to-image model to drive deformations toward style prompts, altogether giving us Geometry in Style. Our project page is at https://threedle.github.io/geometry-in-style.

A Lightweight Image Super-Resolution Transformer Trained on Low-Resolution Images Only

Björn Möller,Lucas Görnhardt,Tim Fingscheidt

Task: 利用轻量级视觉Transformer模型和LR-only训练方法解决无监督单图像超分辨率（SISR）任务。

Motivation: Transformer模型在SISR任务中表现优异，但对训练数据需求高，而实际应用中高质量HR图像稀缺，因此探索LR-only训练方法。

Details

Method: 采用并调整显微镜图像超分辨率的LR-only训练方法，提出多尺度训练方法（MSTbic），并在Transformer和CNN模型上验证其有效性。 Result: 在Set5、Set14等经典SR数据集上表现优于现有基于CNN的LR-only SISR方法。 Conclusion: 提出的MSTbic方法在LR-only训练条件下有效，适用于Transformer和CNN模型，性能优于现有方法。 Abstract: Transformer architectures prominently lead single-image super-resolution (SISR) benchmarks, reconstructing high-resolution (HR) images from their low-resolution (LR) counterparts. Their strong representative power, however, comes with a higher demand for training data compared to convolutional neural networks (CNNs). For many real-world SR applications, the availability of high-quality HR training images is not given, sparking interest in LR-only training methods. The LR-only SISR benchmark mimics this condition by allowing only low-resolution (LR) images for model training. For a 4x super-resolution, this effectively reduces the amount of available training data to 6.25% of the HR image pixels, which puts the employment of a data-hungry transformer model into question. In this work, we are the first to utilize a lightweight vision transformer model with LR-only training methods addressing the unsupervised SISR LR-only benchmark. We adopt and configure a recent LR-only training method from microscopy image super-resolution to macroscopic real-world data, resulting in our multi-scale training method for bicubic degradation (MSTbic). Furthermore, we compare it with reference methods and prove its effectiveness both for a transformer and a CNN model. We evaluate on the classic SR benchmark datasets Set5, Set14, BSD100, Urban100, and Manga109, and show superior performance over state-of-the-art (so far: CNN-based) LR-only SISR methods. The code is available on GitHub: https://github.com/ifnspaml/SuperResolutionMultiscaleTraining.

SketchVideo: Sketch-based Video Generation and Editing

Feng-Lin Liu,Hongbo Fu,Xintao Wang,Weicai Ye,Pengfei Wan,Di Zhang,Lin Gao

Task: 实现基于草图的空间和运动控制的视频生成，并支持对真实或合成视频的细粒度编辑。

Motivation: 现有技术在仅通过文本控制全局布局和几何细节时存在挑战，且难以通过图像实现运动控制和局部修改。

Details

Method: 基于DiT视频生成模型，提出了一种内存高效的控制结构，通过草图控制块预测跳跃DiT块的残差特征，并设计了帧间注意力机制传播稀疏草图条件。 Result: 实验表明，SketchVideo在可控视频生成和编辑方面表现出色。 Conclusion: 提出的方法成功实现了基于草图的视频生成和编辑，解决了现有技术的局限性。 Abstract: Video generation and editing conditioned on text prompts or images have undergone significant advancements. However, challenges remain in accurately controlling global layout and geometry details solely by texts, and supporting motion control and local modification through images. In this paper, we aim to achieve sketch-based spatial and motion control for video generation and support fine-grained editing of real or synthetic videos. Based on the DiT video generation model, we propose a memory-efficient control structure with sketch control blocks that predict residual features of skipped DiT blocks. Sketches are drawn on one or two keyframes (at arbitrary time points) for easy interaction. To propagate such temporally sparse sketch conditions across all frames, we propose an inter-frame attention mechanism to analyze the relationship between the keyframes and each video frame. For sketch-based video editing, we design an additional video insertion module that maintains consistency between the newly edited content and the original video's spatial feature and dynamic motion. During inference, we use latent fusion for the accurate preservation of unedited regions. Extensive experiments demonstrate that our SketchVideo achieves superior performance in controllable video generation and editing.

LaViC: Adapting Large Vision-Language Models to Visually-Aware Conversational Recommendation

Hyunsik Jeon,Satoshi Koide,Yu Wang,Zhankui He,Julian McAuley

Task: 提出一种名为LaViC的框架，将紧凑的图像表示整合到基于对话的推荐系统中，以解决视觉驱动领域中视觉信息的需求。

Motivation: 在视觉驱动的领域（如时尚或家居装饰）中，仅依赖文本信息无法满足用户对颜色、风格或设计等视觉细节的需求。

Details

Method: LaViC采用两阶段方法：视觉知识自蒸馏和推荐提示调整，结合对话上下文和视觉令牌。 Result: 实验表明，LaViC在视觉感知的对话推荐任务中显著优于纯文本方法和开源视觉语言基线，甚至与专有基线（如GPT系列）竞争或超越。 Conclusion: LaViC证明了在视觉驱动领域中显式使用视觉数据的必要性，并展示了视觉语言整合的有效性。 Abstract: Conversational recommender systems engage users in dialogues to refine their needs and provide more personalized suggestions. Although textual information suffices for many domains, visually driven categories such as fashion or home decor potentially require detailed visual information related to color, style, or design. To address this challenge, we propose LaViC (Large Vision-Language Conversational Recommendation Framework), a novel approach that integrates compact image representations into dialogue-based recommendation systems. LaViC leverages a large vision-language model in a two-stage process: (1) visual knowledge self-distillation, which condenses product images from hundreds of tokens into a small set of visual tokens in a self-distillation manner, significantly reducing computational overhead, and (2) recommendation prompt tuning, which enables the model to incorporate both dialogue context and distilled visual tokens, providing a unified mechanism for capturing textual and visual features. To support rigorous evaluation of visually-aware conversational recommendation, we construct a new dataset by aligning Reddit conversations with Amazon product listings across multiple visually oriented categories (e.g., fashion, beauty, and home). This dataset covers realistic user queries and product appearances in domains where visual details are crucial. Extensive experiments demonstrate that LaViC significantly outperforms text-only conversational recommendation methods and open-source vision-language baselines. Moreover, LaViC achieves competitive or superior accuracy compared to prominent proprietary baselines (e.g., GPT-3.5-turbo, GPT-4o-mini, and GPT-4o), demonstrating the necessity of explicitly using visual data for capturing product attributes and showing the effectiveness of our vision-language integration. Our code and dataset are available at https://github.com/jeon185/LaViC.

Beyond Unimodal Boundaries: Generative Recommendation with Multimodal Semantics

Jing Zhu,Mingxuan Ju,Yozen Liu,Danai Koutra,Neil Shah,Tong Zhao

Task: 探索多模态生成推荐（MGR）中模态选择的重要性及其对模型性能的影响。

Motivation: 现有生成推荐方法通常假设内容为单模态（如文本），忽略了现实数据的多模态特性及模型对模态选择的敏感性。

Details

Method: 提出MGR-LF++框架，采用对比模态对齐和特殊标记表示不同模态，以有效利用多模态数据。 Result: MGR-LF++框架相比单模态方法性能提升超过20%。 Conclusion: 多模态生成推荐中模态选择至关重要，MGR-LF++框架为多模态数据的高效利用提供了有效解决方案。 Abstract: Generative recommendation (GR) has become a powerful paradigm in recommendation systems that implicitly links modality and semantics to item representation, in contrast to previous methods that relied on non-semantic item identifiers in autoregressive models. However, previous research has predominantly treated modalities in isolation, typically assuming item content is unimodal (usually text). We argue that this is a significant limitation given the rich, multimodal nature of real-world data and the potential sensitivity of GR models to modality choices and usage. Our work aims to explore the critical problem of Multimodal Generative Recommendation (MGR), highlighting the importance of modality choices in GR nframeworks. We reveal that GR models are particularly sensitive to different modalities and examine the challenges in achieving effective GR when multiple modalities are available. By evaluating design strategies for effectively leveraging multiple modalities, we identify key challenges and introduce MGR-LF++, an enhanced late fusion framework that employs contrastive modality alignment and special tokens to denote different modalities, achieving a performance improvement of over 20% compared to single-modality alternatives.

Physically Ground Commonsense Knowledge for Articulated Object Manipulation with Analytic Concepts

Jianhua Sun,Jiude Wei,Yuxuan Li,Cewu Lu

Task: 将大型语言模型（LLM）生成的语义级常识知识有效落地到物理世界，以指导机器人完成广义的关节物体操作。

Motivation: 机器人需要常识知识来发展广义物体操作技能，而LLM虽能获取常识知识，但如何将其物理落地仍是一大挑战。

Details

Method: 引入基于数学符号的分析概念，作为LLM语义知识与物理世界的桥梁，生成物理感知的物体结构与功能知识，并用于指导机器人控制策略。 Result: 在仿真和真实环境中的大量实验证明了该方法的优越性。 Conclusion: 通过分析概念将LLM知识与物理世界连接，实现了广义、可解释且准确的关节物体操作。 Abstract: We human rely on a wide range of commonsense knowledge to interact with an extensive number and categories of objects in the physical world. Likewise, such commonsense knowledge is also crucial for robots to successfully develop generalized object manipulation skills. While recent advancements in Large Language Models (LLM) have showcased their impressive capabilities in acquiring commonsense knowledge and conducting commonsense reasoning, effectively grounding this semantic-level knowledge produced by LLMs to the physical world to thoroughly guide robots in generalized articulated object manipulation remains a challenge that has not been sufficiently addressed. To this end, we introduce analytic concepts, procedurally defined upon mathematical symbolism that can be directly computed and simulated by machines. By leveraging the analytic concepts as a bridge between the semantic-level knowledge inferred by LLMs and the physical world where real robots operate, we are able to figure out the knowledge of object structure and functionality with physics-informed representations, and then use the physically grounded knowledge to instruct robot control policies for generalized, interpretable and accurate articulated object manipulation. Extensive experiments in both simulation and real-world environments demonstrate the superiority of our approach.

Visual Acuity Consistent Foveated Rendering towards Retinal Resolution

Zhi Zhang,Meng Gai,Sheng Li

Task: 提出一种视觉敏锐度一致的中心凹渲染方法（VaFR），以提高视网膜级分辨率下的渲染性能。

Motivation: 现有中心凹渲染方法在显示分辨率增加时着色负载上升，导致效率下降，尤其是在处理视网膜级分辨率时。

Details

Method: 提出一种基于人类视觉敏锐度模型的对数极坐标映射函数及其相关着色率，确保渲染信息输出的一致性。 Result: VaFR在多种测试场景中表现优于现有方法，渲染速度显著提升（6.5倍至16.4倍），同时保持感知视觉质量。 Conclusion: VaFR方法在视网膜分辨率下实现了高效的渲染性能，适用于多种渲染管道和双目渲染策略。 Abstract: Prior foveated rendering methods often suffer from a limitation where the shading load escalates with increasing display resolution, leading to decreased efficiency, particularly when dealing with retinal-level resolutions. To tackle this challenge, we begin with the essence of the human visual system (HVS) perception and present visual acuity-consistent foveated rendering (VaFR), aiming to achieve exceptional rendering performance at retinal-level resolutions. Specifically, we propose a method with a novel log-polar mapping function derived from the human visual acuity model, which accommodates the natural bandwidth of the visual system. This mapping function and its associated shading rate guarantee a consistent output of rendering information, regardless of variations in the display resolution of the VR HMD. Consequently, our VaFR outperforms alternative methods, improving rendering speed while preserving perceptual visual quality, particularly when operating at retinal resolutions. We validate our approach using both the rasterization and ray-casting rendering pipelines. We also validate our approach using different binocular rendering strategies for HMD devices. In diverse testing scenarios, our approach delivers better perceptual visual quality than prior foveated rendering while achieving an impressive speedup of 6.5$\times$-9.29$\times$ for deferred rendering of 3D scenarios and an even more powerful speedup of 10.4$\times$-16.4$\times$ for ray-casting at retinal resolution. Additionally, our approach significantly enhances the rendering performance of binocular 8K path tracing, achieving smooth frame rates.

Optimal Invariant Bases for Atomistic Machine Learning

Alice E. A. Allen,Emily Shinkle,Roxana Bujack,Nicholas Lubbers

Task: 开发一种方法，通过去除冗余描述符来优化原子环境描述符的表示。

Motivation: 现有的原子环境描述符存在不完整或功能冗余的问题，导致计算负担增加且无法有效区分不同原子环境。

Details

Method: 利用模式识别技术对现有原子描述符进行优化，去除功能依赖的描述符，生成最小完备集。 Result: 优化后的描述符集提高了效率，并开发了一种新的消息传递网络架构，能够识别5体模式，同时保持低计算成本。 Conclusion: 该方法不仅提升了模型性能，还为其他应用提供了低成本高表达力的不变基类。 Abstract: The representation of atomic configurations for machine learning models has led to the development of numerous descriptors, often to describe the local environment of atoms. However, many of these representations are incomplete and/or functionally dependent. Incomplete descriptor sets are unable to represent all meaningful changes in the atomic environment. Complete constructions of atomic environment descriptors, on the other hand, often suffer from a high degree of functional dependence, where some descriptors can be written as functions of the others. These redundant descriptors do not provide additional power to discriminate between different atomic environments and increase the computational burden. By employing techniques from the pattern recognition literature to existing atomistic representations, we remove descriptors that are functions of other descriptors to produce the smallest possible set that satisfies completeness. We apply this in two ways: first we refine an existing description, the Atomistic Cluster Expansion. We show that this yields a more efficient subset of descriptors. Second, we augment an incomplete construction based on a scalar neural network, yielding a new message-passing network architecture that can recognize up to 5-body patterns in each neuron by taking advantage of an optimal set of Cartesian tensor invariants. This architecture shows strong accuracy on state-of-the-art benchmarks while retaining low computational cost. Our results not only yield improved models, but point the way to classes of invariant bases that minimize cost while maximizing expressivity for a host of applications.

GenVP: Generating Visual Puzzles with Contrastive Hierarchical VAEs

Kalliopi Basioti,Pritish Sahu,Qingze Tony Liu,Zihao Xu,Hao Wang,Vladimir Pavlovic

Task: 提出Generative Visual Puzzles (GenVP)框架，以建模Raven's Progressive Matrices (RPMs)的生成过程。

Motivation: 人类能够基于规则生成新的谜题，而现有算法仅能解决固定谜题，缺乏生成能力。

Details

Method: GenVP框架，能够从生成特定问题的多个解到基于规则生成全新谜题。 Result: 在五个数据集上，GenVP在谜题解决准确率和OOD泛化能力上达到SOTA，并在22种OOD场景中表现优异。 Conclusion: GenVP不仅能高效泛化到复杂场景，还能基于抽象规则生成多样化的完整RPMs。 Abstract: Raven's Progressive Matrices (RPMs) is an established benchmark to examine the ability to perform high-level abstract visual reasoning (AVR). Despite the current success of algorithms that solve this task, humans can generalize beyond a given puzzle and create new puzzles given a set of rules, whereas machines remain locked in solving a fixed puzzle from a curated choice list. We propose Generative Visual Puzzles (GenVP), a framework to model the entire RPM generation process, a substantially more challenging task. Our model's capability spans from generating multiple solutions for one specific problem prompt to creating complete new puzzles out of the desired set of rules. Experiments on five different datasets indicate that GenVP achieves state-of-the-art (SOTA) performance both in puzzle-solving accuracy and out-of-distribution (OOD) generalization in 22 OOD scenarios. Compared to SOTA generative approaches, which struggle to solve RPMs when the feasible solution space increases, GenVP efficiently generalizes to these challenging setups. Moreover, our model demonstrates the ability to produce a wide range of complete RPMs given a set of abstract rules by effectively capturing the relationships between abstract rules and visual object properties.

Uni-Render: A Unified Accelerator for Real-Time Rendering Across Diverse Neural Renderers

Chaojian Li,Sixu Li,Linrui Jiang,Jingqun Zhang,Yingyan Celine Lin

Task: 开发一种统一的神经渲染加速器，支持多种典型的神经渲染管线，实现实时和边缘设备上的渲染。

Motivation: 当前神经渲染技术缺乏通用算法解决方案，且现有设备仅支持特定渲染管线，限制了实时交互的实现。

Details

Method: 提出一种基于共享算子的可重构硬件架构，动态调整数据流以适应不同渲染需求。 Result: 实验证明该加速器在合成和真实场景中均有效，支持多种典型和混合渲染管线。 Conclusion: 该统一加速器首次实现了边缘设备上多种管线的实时神经渲染，为下一代神经图形应用铺平了道路。 Abstract: Recent advancements in neural rendering technologies and their supporting devices have paved the way for immersive 3D experiences, significantly transforming human interaction with intelligent devices across diverse applications. However, achieving the desired real-time rendering speeds for immersive interactions is still hindered by (1) the lack of a universal algorithmic solution for different application scenarios and (2) the dedication of existing devices or accelerators to merely specific rendering pipelines. To overcome this challenge, we have developed a unified neural rendering accelerator that caters to a wide array of typical neural rendering pipelines, enabling real-time and on-device rendering across different applications while maintaining both efficiency and compatibility. Our accelerator design is based on the insight that, although neural rendering pipelines vary and their algorithm designs are continually evolving, they typically share common operators, predominantly executing similar workloads. Building on this insight, we propose a reconfigurable hardware architecture that can dynamically adjust dataflow to align with specific rendering metric requirements for diverse applications, effectively supporting both typical and the latest hybrid rendering pipelines. Benchmarking experiments and ablation studies on both synthetic and real-world scenes demonstrate the effectiveness of the proposed accelerator. The proposed unified accelerator stands out as the first solution capable of achieving real-time neural rendering across varied representative pipelines on edge devices, potentially paving the way for the next generation of neural graphics applications.

AdaMMS: Model Merging for Heterogeneous Multimodal Large Language Models with Unsupervised Coefficient Optimization

Yiyang Du,Xiaochen Wang,Chi Chen,Jiabo Ye,Yiru Wang,Peng Li,Ming Yan,Ji Zhang,Fei Huang,Zhifang Sui,Maosong Sun,Yang Liu

Task: 提出一种名为AdaMMS的新型模型融合方法，专门用于处理异构多模态大语言模型（MLLMs）的合并问题。

Motivation: 现有的模型融合方法主要针对同构模型，难以应对异构MLLMs在模型架构和参数空间不对称性上的挑战。

Details

Method: 通过映射、合并和搜索三个步骤实现异构MLLMs的融合，包括设计映射函数、线性插值权重和无监督超参数选择。 Result: AdaMMS在多种视觉-语言基准测试中优于现有模型融合方法。 Conclusion: AdaMMS是首个无需标注数据即可融合异构MLLMs的方法，具有广泛的应用潜力。 Abstract: Recently, model merging methods have demonstrated powerful strengths in combining abilities on various tasks from multiple Large Language Models (LLMs). While previous model merging methods mainly focus on merging homogeneous models with identical architecture, they meet challenges when dealing with Multimodal Large Language Models (MLLMs) with inherent heterogeneous property, including differences in model architecture and the asymmetry in the parameter space. In this work, we propose AdaMMS, a novel model merging method tailored for heterogeneous MLLMs. Our method tackles the challenges in three steps: mapping, merging and searching. Specifically, we first design mapping function between models to apply model merging on MLLMs with different architecture. Then we apply linear interpolation on model weights to actively adapt the asymmetry in the heterogeneous MLLMs. Finally in the hyper-parameter searching step, we propose an unsupervised hyper-parameter selection method for model merging. As the first model merging method capable of merging heterogeneous MLLMs without labeled data, extensive experiments on various model combinations demonstrated that AdaMMS outperforms previous model merging methods on various vision-language benchmarks.

StrokeFusion: Vector Sketch Generation via Joint Stroke-UDF Encoding and Latent Sequence Diffusion

Jin Zhou,Yi Zhou,Pengfei Xu,Hui Huang

Task: 提出一种名为StrokeFusion的两阶段框架，用于生成高质量的矢量草图。

Motivation: 解决现有草图生成方法中存在的非笔画伪影、缺乏整体理解以及难以提取相似元素共同特征的问题。

Details

Method: 采用双模态草图特征学习网络，将笔画映射到高质量潜在空间，并利用笔画级潜在扩散模型调整笔画位置、比例和轨迹。 Result: 在QuickDraw数据集上的实验表明，该方法在保持结构完整性和语义特征方面优于现有技术。 Conclusion: StrokeFusion框架有效提升了矢量草图生成的质量和可编辑性。 Abstract: In the field of sketch generation, raster-format trained models often produce non-stroke artifacts, while vector-format trained models typically lack a holistic understanding of sketches, leading to compromised recognizability. Moreover, existing methods struggle to extract common features from similar elements (e.g., eyes of animals) appearing at varying positions across sketches. To address these challenges, we propose StrokeFusion, a two-stage framework for vector sketch generation. It contains a dual-modal sketch feature learning network that maps strokes into a high-quality latent space. This network decomposes sketches into normalized strokes and jointly encodes stroke sequences with Unsigned Distance Function (UDF) maps, representing sketches as sets of stroke feature vectors. Building upon this representation, our framework exploits a stroke-level latent diffusion model that simultaneously adjusts stroke position, scale, and trajectory during generation. This enables high-fidelity sketch generation while supporting stroke interpolation editing. Extensive experiments on the QuickDraw dataset demonstrate that our framework outperforms state-of-the-art techniques, validating its effectiveness in preserving structural integrity and semantic features. Code and models will be made publicly available upon publication.

Texture or Semantics? Vision-Language Models Get Lost in Font Recognition

Zhecheng Li,Guoxian Song,Yujun Cai,Zhen Xiong,Junsong Yuan,Yiwei Wang

Task: 评估现代视觉语言模型（VLMs）在细粒度字体识别任务中的能力。

Motivation: 尽管VLMs在多种任务中表现优异，但其在细粒度任务（如字体识别）中的有效性尚不明确，尤其是在日常场景中识别美观字体的需求。

Details

Method: 引入字体识别基准（FRB），包含15种常用字体的两个版本（简单版和困难版），并评估多种VLMs在字体识别任务中的表现。 Result: 当前VLMs在字体识别能力上表现有限，少样本学习和思维链提示对其提升效果甚微，注意力分析揭示了VLMs在捕捉语义特征上的固有局限性。 Conclusion: VLMs在细粒度字体识别任务中能力不足，需进一步研究改进其语义特征捕捉能力。 Abstract: Modern Vision-Language Models (VLMs) exhibit remarkable visual and linguistic capabilities, achieving impressive performance in various tasks such as image recognition and object localization. However, their effectiveness in fine-grained tasks remains an open question. In everyday scenarios, individuals encountering design materials, such as magazines, typography tutorials, research papers, or branding content, may wish to identify aesthetically pleasing fonts used in the text. Given their multimodal capabilities and free accessibility, many VLMs are often considered potential tools for font recognition. This raises a fundamental question: Do VLMs truly possess the capability to recognize fonts? To investigate this, we introduce the Font Recognition Benchmark (FRB), a compact and well-structured dataset comprising 15 commonly used fonts. FRB includes two versions: (i) an easy version, where 10 sentences are rendered in different fonts, and (ii) a hard version, where each text sample consists of the names of the 15 fonts themselves, introducing a stroop effect that challenges model perception. Through extensive evaluation of various VLMs on font recognition tasks, we arrive at the following key findings: (i) Current VLMs exhibit limited font recognition capabilities, with many state-of-the-art models failing to achieve satisfactory performance. (ii) Few-shot learning and Chain-of-Thought (CoT) prompting provide minimal benefits in improving font recognition accuracy across different VLMs. (iii) Attention analysis sheds light on the inherent limitations of VLMs in capturing semantic features.

Conformal uncertainty quantification to evaluate predictive fairness of foundation AI model for skin lesion classes across patient demographics

Swarnava Bhattacharyya,Umapada Pal,Tapabrata Chakraborti

Task: 使用共形分析量化基于视觉变换器（ViT）的基础模型在皮肤病变分类任务中的预测不确定性。

Motivation: 解决深度学习AI系统在医疗应用中因缺乏透明度和可解释性而难以被信任的问题。

Details

Method: 采用共形分析量化预测不确定性，并结合动态F1分数采样方法缓解类别不平衡问题。 Result: 共形分析提供了群体水平的覆盖保证和个体不确定性评分，动态F1分数采样有助于稳定类别不平衡。 Conclusion: 该方法可作为公平性指标，提升临床AI的可信度和公平性。 Abstract: Deep learning based diagnostic AI systems based on medical images are starting to provide similar performance as human experts. However these data hungry complex systems are inherently black boxes and therefore slow to be adopted for high risk applications like healthcare. This problem of lack of transparency is exacerbated in the case of recent large foundation models, which are trained in a self supervised manner on millions of data points to provide robust generalisation across a range of downstream tasks, but the embeddings generated from them happen through a process that is not interpretable, and hence not easily trustable for clinical applications. To address this timely issue, we deploy conformal analysis to quantify the predictive uncertainty of a vision transformer (ViT) based foundation model across patient demographics with respect to sex, age and ethnicity for the tasks of skin lesion classification using several public benchmark datasets. The significant advantage of this method is that conformal analysis is method independent and it not only provides a coverage guarantee at population level but also provides an uncertainty score for each individual. We used a model-agnostic dynamic F1-score-based sampling during model training, which helped to stabilize the class imbalance and we investigate the effects on uncertainty quantification (UQ) with or without this bias mitigation step. Thus we show how this can be used as a fairness metric to evaluate the robustness of the feature embeddings of the foundation model (Google DermFoundation) and thus advance the trustworthiness and fairness of clinical AI.

ZeroMimic: Distilling Robotic Manipulation Skills from Web Videos

Junyao Shi,Zhuolun Zhao,Tianyou Wang,Ian Pedroza,Amy Luo,Jie Wang,Jason Ma,Dinesh Jayaraman

Task: 通过模仿学习从人类视频数据中提取可部署的机器人技能策略。

Motivation: 利用现有的大规模人类视频数据集（如EpicKitchens）来训练机器人技能策略，避免依赖特定机器人或环境的演示数据。

Details

Method: 设计ZeroMimic系统，结合语义和几何视觉理解、抓取能力检测器和模仿策略类，生成图像目标条件的技能策略。 Result: ZeroMimic在真实和模拟厨房环境中表现出色，能够处理多种任务（如开关、倒、抓放等）和不同机器人平台。 Conclusion: ZeroMimic展示了从人类视频数据中提取通用机器人技能的可行性，并提供了可复用的策略和软件。 Abstract: Many recent advances in robotic manipulation have come through imitation learning, yet these rely largely on mimicking a particularly hard-to-acquire form of demonstrations: those collected on the same robot in the same room with the same objects as the trained policy must handle at test time. In contrast, large pre-recorded human video datasets demonstrating manipulation skills in-the-wild already exist, which contain valuable information for robots. Is it possible to distill a repository of useful robotic skill policies out of such data without any additional requirements on robot-specific demonstrations or exploration? We present the first such system ZeroMimic, that generates immediately deployable image goal-conditioned skill policies for several common categories of manipulation tasks (opening, closing, pouring, pick&place, cutting, and stirring) each capable of acting upon diverse objects and across diverse unseen task setups. ZeroMimic is carefully designed to exploit recent advances in semantic and geometric visual understanding of human videos, together with modern grasp affordance detectors and imitation policy classes. After training ZeroMimic on the popular EpicKitchens dataset of ego-centric human videos, we evaluate its out-of-the-box performance in varied real-world and simulated kitchen settings with two different robot embodiments, demonstrating its impressive abilities to handle these varied tasks. To enable plug-and-play reuse of ZeroMimic policies on other task setups and robots, we release software and policy checkpoints of our skill policies.

DiffScale: Continuous Downscaling and Bias Correction of Subseasonal Wind Speed Forecasts using Diffusion Models

Maximilian Springenberg,Noelia Otero,Yuxin Xue,Jackie Ma

Task: 利用扩散模型和分类器自由引导增强风能预测中的风速预测。

Motivation: 可再生能源高度依赖天气情况，而次季节到季节（S2S）预报能为能源行业带来显著的社会经济效益。

Details

Method: 提出DiffScale扩散模型，通过空间信息超分辨实现连续降尺度和时间预测，利用天气先验作为生成过程的引导。 Result: 在合成实验中，DiffScale显著提升了预测质量，优于基线方法至第3周。 Conclusion: DiffScale是一种灵活且高效的工具，可泛化至不同网格分辨率和时间尺度，无需重新训练模型。 Abstract: Renewable resources are strongly dependent on local and large-scale weather situations. Skillful subseasonal to seasonal (S2S) forecasts -- beyond two weeks and up to two months -- can offer significant socioeconomic advantages to the energy sector. This study aims to enhance wind speed predictions using a diffusion model with classifier-free guidance to downscale S2S forecasts of surface wind speed. We propose DiffScale, a diffusion model that super-resolves spatial information for continuous downscaling factors and lead times. Leveraging weather priors as guidance for the generative process of diffusion models, we adopt the perspective of conditional probabilities on sampling super-resolved S2S forecasts. We aim to directly estimate the density associated with the target S2S forecasts at different spatial resolutions and lead times without auto-regression or sequence prediction, resulting in an efficient and flexible model. Synthetic experiments were designed to super-resolve wind speed S2S forecasts from the European Center for Medium-Range Weather Forecast (ECMWF) from a coarse resolution to a finer resolution of ERA5 reanalysis data, which serves as a high-resolution target. The innovative aspect of DiffScale lies in its flexibility to downscale arbitrary scaling factors, enabling it to generalize across various grid resolutions and lead times -without retraining the model- while correcting model errors, making it a versatile tool for improving S2S wind speed forecasts. We achieve a significant improvement in prediction quality, outperforming baselines up to week 3.

An Explainable Neural Radiomic Sequence Model with Spatiotemporal Continuity for Quantifying 4DCT-based Pulmonary Ventilation

Rihui Zhang,Haiming Zhu,Jingtong Zhao,Lei Zhang,Fang-Fang Yin,Chunhao Wang,Zhenyu Yang

Task: 提出一种可解释的神经放射组学序列模型，用于基于4DCT识别肺通气功能受损区域。

Motivation: 当前核医学通气显像技术耗时、昂贵且伴随额外辐射暴露，需要一种更高效、低成本的方法。

Details

Method: 使用45例肺癌患者的4DCT数据，提取56维放射组学特征，构建时间序列，并开发一种时间显著性增强的可解释LSTM网络。 Result: 模型在PET和SPECT数据上的平均Dice相似系数分别为0.78，时间显著性图揭示了肺呼气阶段的关键特征变化。 Conclusion: 该方法能有效识别肺通气功能受损区域，并提供了可解释的特征动态变化。 Abstract: Accurate evaluation of regional lung ventilation is essential for the management and treatment of lung cancer patients, supporting assessments of pulmonary function, optimization of therapeutic strategies, and monitoring of treatment response. Currently, ventilation scintigraphy using nuclear medicine techniques is widely employed in clinical practice; however, it is often time-consuming, costly, and entails additional radiation exposure. In this study, we propose an explainable neural radiomic sequence model to identify regions of compromised pulmonary ventilation based on four-dimensional computed tomography (4DCT). A cohort of 45 lung cancer patients from the VAMPIRE dataset was analyzed. For each patient, lung volumes were segmented from 4DCT, and voxel-wise radiomic features (56-dimensional) were extracted across the respiratory cycle to capture local intensity and texture dynamics, forming temporal radiomic sequences. Ground truth ventilation defects were delineated voxel-wise using Galligas-PET and DTPA-SPECT. To identify compromised regions, we developed a temporal saliency-enhanced explainable long short-term memory (LSTM) network trained on the radiomic sequences. Temporal saliency maps were generated to highlight key features contributing to the model's predictions. The proposed model demonstrated robust performance, achieving average (range) Dice similarity coefficients of 0.78 (0.74-0.79) for 25 PET cases and 0.78 (0.74-0.82) for 20 SPECT cases. The temporal saliency map explained three key radiomic sequences in ventilation quantification: during lung exhalation, compromised pulmonary function region typically exhibits (1) an increasing trend of intensity and (2) a decreasing trend of homogeneity, in contrast to healthy lung tissue.

AMB-FHE: Adaptive Multi-biometric Fusion with Fully Homomorphic Encryption

Florian Bayer,Christian Rathgeb

Task: 提出一种基于全同态加密的自适应多生物特征融合方法（AMB-FHE），以提高生物特征模板的隐私保护并动态适应安全需求。

Motivation: 多生物特征系统在高安全性应用中常用，但多模态生物特征的展示可能降低系统易用性，且并非所有场景都需要。

Details

Method: 结合全同态加密技术，提出AMB-FHE方法，并在CASIA虹膜和MCYT指纹数据集上使用深度神经网络进行特征提取和基准测试。 Result: AMB-FHE方法易于实现，提高了生物特征认证的灵活性，并通过多模态模板的联合加密增强了隐私保护。 Conclusion: AMB-FHE为生物特征系统提供了一种简单灵活的隐私保护方案，同时支持动态安全需求。 Abstract: Biometric systems strive to balance security and usability. The use of multi-biometric systems combining multiple biometric modalities is usually recommended for high-security applications. However, the presentation of multiple biometric modalities can impair the user-friendliness of the overall system and might not be necessary in all cases. In this work, we present a simple but flexible approach to increase the privacy protection of homomorphically encrypted multi-biometric reference templates while enabling adaptation to security requirements at run-time: An adaptive multi-biometric fusion with fully homomorphic encryption (AMB-FHE). AMB-FHE is benchmarked against a bimodal biometric database consisting of the CASIA iris and MCYT fingerprint datasets using deep neural networks for feature extraction. Our contribution is easy to implement and increases the flexibility of biometric authentication while offering increased privacy protection through joint encryption of templates from multiple modalities.

Learning 3D-Gaussian Simulators from RGB Videos

Mikel Zhobro,Andreas René Geist,Georg Martius

Task: 从多视角RGB视频中学习3D物体动力学的端到端物理模拟器。

Motivation: 解决从视频数据学习物理模拟时保持时空一致性的挑战，避免依赖强归纳偏置或真实3D信息，以提高可扩展性和泛化性。

Details

Method: 使用3D高斯粒子表示编码图像，通过transformer传播动力学，并利用3D高斯泼溅渲染帧，联合训练逆渲染和动力学transformer。 Result: 模型能够捕捉从刚性到弹性和布料状交互的多样化物理行为，以及逼真的光照效果，并能泛化到未见过的多体交互和新场景编辑。 Conclusion: 3DGSim通过嵌入物理属性到点级潜在向量，无需显式连接约束，实现了高效且泛化的物理模拟。 Abstract: Learning physics simulations from video data requires maintaining spatial and temporal consistency, a challenge often addressed with strong inductive biases or ground-truth 3D information -- limiting scalability and generalization. We introduce 3DGSim, a 3D physics simulator that learns object dynamics end-to-end from multi-view RGB videos. It encodes images into a 3D Gaussian particle representation, propagates dynamics via a transformer, and renders frames using 3D Gaussian splatting. By jointly training inverse rendering with a dynamics transformer using a temporal encoding and merging layer, 3DGSimembeds physical properties into point-wise latent vectors without enforcing explicit connectivity constraints. This enables the model to capture diverse physical behaviors, from rigid to elastic and cloth-like interactions, along with realistic lighting effects that also generalize to unseen multi-body interactions and novel scene edits.

AI-Assisted Colonoscopy: Polyp Detection and Segmentation using Foundation Models

Uxue Delaquintana-Aramendi,Leire Benito-del-Valle,Aitor Alvarez-Gila,Javier Pascau,Luisa F Sánchez-Peralta,Artzai Picón,J Blas Pagador,Cristina L Saratxaga

Task: 评估基础模型在结肠镜息肉分割任务中的性能。

Motivation: 结肠镜检查中80%的漏检息肉可通过深度学习模型辅助检测，基础模型的零样本或少样本学习能力在医学影像领域具有潜力，尤其是在标注数据稀缺的情况下。

Details

Method: 使用三个结肠镜数据集，比较五种基础模型（DINOv2、YOLO-World、GroundingDINO、SAM和MedSAM）与两种基准网络（YOLOv8和Mask R-CNN）的性能。 Result: 基础模型在息肉分割中的成功高度依赖领域专业化，领域专用模型表现最优，通用模型需微调才能有效。部分基础模型在零样本评估中甚至优于微调模型。 Conclusion: 在医学应用中，领域专用的基础模型性能优于现有检测和分割模型，尤其在零样本或少样本场景下表现突出。 Abstract: In colonoscopy, 80% of the missed polyps could be detected with the help of Deep Learning models. In the search for algorithms capable of addressing this challenge, foundation models emerge as promising candidates. Their zero-shot or few-shot learning capabilities, facilitate generalization to new data or tasks without extensive fine-tuning. A concept that is particularly advantageous in the medical imaging domain, where large annotated datasets for traditional training are scarce. In this context, a comprehensive evaluation of foundation models for polyp segmentation was conducted, assessing both detection and delimitation. For the study, three different colonoscopy datasets have been employed to compare the performance of five different foundation models, DINOv2, YOLO-World, GroundingDINO, SAM and MedSAM, against two benchmark networks, YOLOv8 and Mask R-CNN. Results show that the success of foundation models in polyp characterization is highly dependent on domain specialization. For optimal performance in medical applications, domain-specific models are essential, and generic models require fine-tuning to achieve effective results. Through this specialization, foundation models demonstrated superior performance compared to state-of-the-art detection and segmentation models, with some models even excelling in zero-shot evaluation; outperforming fine-tuned models on unseen data.

A Comparative Study of Scanpath Models in Graph-Based Visualization

Angela Lopez-Cardona,Parvin Emami,Sebastian Idesis,Saravanakumar Duraisamy,Luis A. Leiva,Ioannis Arapakis

Task: 评估计算模型（如DeepGaze、UMSS和Gazeformer）在预测视觉注意力分配中的准确性，并研究问题复杂性和节点数量对模型性能的影响。

Motivation: 由于眼动追踪数据收集存在成本、隐私和可扩展性等挑战，计算模型为预测注视模式提供了替代方案，从而推动信息可视化研究的发展。

Details

Method: 通过40名参与者的眼动追踪实验，比较人类扫描路径与模型生成的合成扫描路径，分析模型在不同问题复杂性和节点数量下的表现。 Result: 研究评估了模型的准确性，并揭示了问题复杂性和节点数量对模型性能的影响。 Conclusion: 该研究为视觉分析中的预测建模提供了贡献，有助于优化信息可视化系统的设计和效果。 Abstract: Information Visualization (InfoVis) systems utilize visual representations to enhance data interpretation. Understanding how visual attention is allocated is essential for optimizing interface design. However, collecting Eye-tracking (ET) data presents challenges related to cost, privacy, and scalability. Computational models provide alternatives for predicting gaze patterns, thereby advancing InfoVis research. In our study, we conducted an ET experiment with 40 participants who analyzed graphs while responding to questions of varying complexity within the context of digital forensics. We compared human scanpaths with synthetic ones generated by models such as DeepGaze, UMSS, and Gazeformer. Our research evaluates the accuracy of these models and examines how question complexity and number of nodes influence performance. This work contributes to the development of predictive modeling in visual analytics, offering insights that can enhance the design and effectiveness of InfoVis systems.

ORAL: Prompting Your Large-Scale LoRAs via Conditional Recurrent Diffusion

Rana Muhammad Shahroz Khan,Dongwen Tang,Pingzhi Li,Kai Wang,Tianlong Chen

Task: 提出一种名为ORAL的条件循环扩散框架，用于生成任务特定的LoRA参数，以适应不断更新的大型语言模型。

Motivation: 传统参数生成方法在同时实现可扩展性和可控性方面存在局限性，ORAL旨在解决这一问题。

Details

Method: ORAL采用条件循环扩散框架，结合模型架构和文本任务规范，生成任务特定的LoRA参数。 Result: ORAL在七种语言任务、四种视觉任务和三种多模态任务中表现优异，生成的LoRA参数性能与训练模型相当或更优。 Conclusion: ORAL成功解决了可扩展性和可控性问题，为大型语言模型的高效适应提供了新方法。 Abstract: Parameter generation has emerged as a novel paradigm for neural network development, offering an alternative to traditional neural network training by synthesizing high-quality model weights directly. In the context of Low-Rank Adaptation (LoRA) for evolving ($\textit{i.e.}$, constantly updated) large language models (LLMs), this approach promises efficient adaptation without costly retraining. However, existing methods face critical limitations in simultaneously achieving scalability and controllability. In this paper, we introduce $\texttt{ORAL}$, a novel $\textbf{conditional recurrent diffusion}$ framework that addresses these challenges. $\texttt{ORAL}$ incorporates a novel conditioning mechanism that integrates model architecture and textual task specifications, enabling the generation of task-specific LoRA parameters that can seamlessly transfer across evolving foundation models. Our approach successfully scales to billions-of-parameter LLMs and maintains controllability. Through extensive experiments across seven language tasks, four vision tasks, and three multimodal tasks using five pre-trained LLMs, we demonstrate that $\texttt{ORAL}$ generates high-quality LoRA parameters that achieve comparable or superior performance to vanilla trained counterparts.