2025 03 24

Token-Level Uncertainty-Aware Objective for Language Model Post-Training

Tingkai Liu,Ari S. Benjamin,Anthony M. Zador

Task: 研究因果语言建模中标记级不确定性与两种训练目标（掩码最大似然和自蒸馏）的关系。

Motivation: 探索如何通过不确定性感知训练提升语言模型的性能，并解决掩码最大似然容易过拟合的问题。

Details

Method: 结合掩码最大似然（MLE）和自蒸馏的训练目标，并在多种架构（Gemma、LLaMA、Phi）和数据集（Alpaca、ShareGPT、GSM8K）上进行验证。 Result: 提出的训练目标显著提升了性能，缓解了过拟合问题，同时保持了后训练阶段的适应性。 Conclusion: 不确定性感知训练是增强语言模型训练的有效机制。 Abstract: In the current work, we connect token-level uncertainty in causal language modeling to two types of training objectives: 1) masked maximum likelihood (MLE), 2) self-distillation. We show that masked MLE is effective in reducing epistemic uncertainty, and serve as an effective token-level automatic curriculum learning technique. However, masked MLE is prone to overfitting and requires self-distillation regularization to improve or maintain performance on out-of-distribution tasks. We demonstrate significant performance gain via the proposed training objective - combined masked MLE and self-distillation - across multiple architectures (Gemma, LLaMA, Phi) and datasets (Alpaca, ShareGPT, GSM8K), mitigating overfitting while maintaining adaptability during post-training. Our findings suggest that uncertainty-aware training provides an effective mechanism for enhancing language model training.

Medifact at PerAnsSumm 2025: Leveraging Lightweight Models for Perspective-Specific Summarization of Clinical Q&A Forums

Nadia Saeed

Task: 提出一种基于Snorkel-BART-SVM框架的少样本学习方法，用于分类和总结开放式医疗社区问答（CQA）。

Motivation: 通过弱监督和预训练模型提升医疗问答的视角感知总结能力，支持临床决策系统。

Details

Method: 使用Snorkel弱监督训练SVM模型进行视角相关句子分类，再用预训练的BART-CNN模型生成总结。 Result: 在100支队伍中排名第12，展示了计算效率和上下文准确性。 Conclusion: 该方法推动了医疗CQA研究，并为临床决策支持系统提供了实用工具。 Abstract: The PerAnsSumm 2025 challenge focuses on perspective-aware healthcare answer summarization (Agarwal et al., 2025). This work proposes a few-shot learning framework using a Snorkel-BART-SVM pipeline for classifying and summarizing open-ended healthcare community question-answering (CQA). An SVM model is trained with weak supervision via Snorkel, enhancing zero-shot learning. Extractive classification identifies perspective-relevant sentences, which are then summarized using a pretrained BART-CNN model. The approach achieved 12th place among 100 teams in the shared task, demonstrating computational efficiency and contextual accuracy. By leveraging pretrained summarization models, this work advances medical CQA research and contributes to clinical decision support systems.

Highlighting Case Studies in LLM Literature Review of Interdisciplinary System Science

Lachlan McGinness,Peter Baumgartner

Task: 评估大型语言模型（LLMs）在系统性文献综述（SLR）任务中的表现。

Motivation: 探索LLMs在辅助研究人员进行SLR时的准确性和实用性，尤其是在提取证据和回答研究问题方面的能力。

Details

Method: 通过四个案例研究，调整参数评估LLM的准确性，使用语义文本高亮工具辅助专家评审，并采用专家评审和嵌入相似度两种方法验证LLM回答的正确性。 Result: LLMs在重现文献引用时准确率超过95%，回答研究问题的准确率约为83%；两种验证方法的相关系数为0.48至0.77，表明嵌入相似度是有效的语义相似度度量。 Conclusion: LLMs在SLR任务中表现出较高的准确性，嵌入相似度可作为验证LLM回答的有效指标。 Abstract: Large Language Models (LLMs) were used to assist four Commonwealth Scientific and Industrial Research Organisation (CSIRO) researchers to perform systematic literature reviews (SLR). We evaluate the performance of LLMs for SLR tasks in these case studies. In each, we explore the impact of changing parameters on the accuracy of LLM responses. The LLM was tasked with extracting evidence from chosen academic papers to answer specific research questions. We evaluate the models' performance in faithfully reproducing quotes from the literature and subject experts were asked to assess the model performance in answering the research questions. We developed a semantic text highlighting tool to facilitate expert review of LLM responses. We found that state of the art LLMs were able to reproduce quotes from texts with greater than 95% accuracy and answer research questions with an accuracy of approximately 83%. We use two methods to determine the correctness of LLM responses; expert review and the cosine similarity of transformer embeddings of LLM and expert answers. The correlation between these methods ranged from 0.48 to 0.77, providing evidence that the latter is a valid metric for measuring semantic similarity.

Using LLMs for Automated Privacy Policy Analysis: Prompt Engineering, Fine-Tuning and Explainability

Yuxin Chen,Peng Tang,Weidong Qiu,Shujun Li

Task: 研究大型语言模型（LLMs）在隐私政策分析中的自动化应用。

Motivation: 尽管LLMs在许多NLP任务中表现优异，但在隐私政策分析领域的应用尚未充分探索。

Details

Method: 结合提示工程（prompt engineering）和LoRA微调（low-rank adaptation fine-tuning），在四个先进的隐私政策语料库和分类体系上评估LLM分类器。 Result: 实验表明，结合提示工程和微调的LLM分类器在性能和可解释性上均显著优于其他方法。 Conclusion: LLMs不仅能提升隐私政策分类性能，还能增强检测结果的可解释性。 Abstract: Privacy policies are widely used by digital services and often required for legal purposes. Many machine learning based classifiers have been developed to automate detection of different concepts in a given privacy policy, which can help facilitate other automated tasks such as producing a more reader-friendly summary and detecting legal compliance issues. Despite the successful applications of large language models (LLMs) to many NLP tasks in various domains, there is very little work studying the use of LLMs for automated privacy policy analysis, therefore, if and how LLMs can help automate privacy policy analysis remains under-explored. To fill this research gap, we conducted a comprehensive evaluation of LLM-based privacy policy concept classifiers, employing both prompt engineering and LoRA (low-rank adaptation) fine-tuning, on four state-of-the-art (SOTA) privacy policy corpora and taxonomies. Our experimental results demonstrated that combining prompt engineering and fine-tuning can make LLM-based classifiers outperform other SOTA methods, \emph{significantly} and \emph{consistently} across privacy policy corpora/taxonomies and concepts. Furthermore, we evaluated the explainability of the LLM-based classifiers using three metrics: completeness, logicality, and comprehensibility. For all three metrics, a score exceeding 91.1\% was observed in our evaluation, indicating that LLMs are not only useful to improve the classification performance, but also to enhance the explainability of detection results.

Adams Bashforth Moulton Solver for Inversion and Editing in Rectified Flow

Yongjia Ma,Donglin Di,Xuan Liu,Xiaokai Chen,Lei Fan,Wei Chen,Tonghua Su

Task: 提出一种基于Adams-Bashforth-Moulton（ABM）预测-校正方法的高精度ODE求解器ABM-Solver，用于提升整流流模型的采样速度和准确性。

Motivation: 现有数值求解器在快速采样和高精度解之间存在权衡，限制了其在重建和编辑等下游应用中的效果。

Details

Method: 结合多步预测-校正方法减少局部截断误差，采用自适应步长调整提高采样速度，并引入掩模引导特征注入模块以区分保留区域和可编辑区域。 Result: 在多个高分辨率图像数据集上的实验表明，ABM-Solver显著提高了反演精度和编辑质量，优于现有求解器且无需额外训练或优化。 Conclusion: ABM-Solver通过改进ODE求解和特征注入，有效提升了整流流模型在生成任务中的性能。 Abstract: Rectified flow models have achieved remarkable performance in image and video generation tasks. However, existing numerical solvers face a trade-off between fast sampling and high-accuracy solutions, limiting their effectiveness in downstream applications such as reconstruction and editing. To address this challenge, we propose leveraging the Adams-Bashforth-Moulton (ABM) predictor-corrector method to enhance the accuracy of ODE solving in rectified flow models. Specifically, we introduce ABM-Solver, which integrates a multi step predictor corrector approach to reduce local truncation errors and employs Adaptive Step Size Adjustment to improve sampling speed. Furthermore, to effectively preserve non edited regions while facilitating semantic modifications, we introduce a Mask Guided Feature Injection module. We estimate self-similarity to generate a spatial mask that differentiates preserved regions from those available for editing. Extensive experiments on multiple high-resolution image datasets validate that ABM-Solver significantly improves inversion precision and editing quality, outperforming existing solvers without requiring additional training or optimization.

Not All Personas Are Worth It: Culture-Reflective Persona Data Augmentation

Ji-Eun Han,Yoonseok Heo

Task: 提出一个两步流程生成文化特定人物角色，并引入KoPersona数据集以捕捉韩国文化特征。

Motivation: 现有的人物角色数据集在文化多样性和适应性方面存在不足，限制了构建文化感知AI系统的效果。

Details

Method: 采用两步流程生成文化特定人物角色，并构建包含20万个人物角色的KoPersona数据集。 Result: 通过多种指标验证了KoPersona的质量及其与韩国文化的相关性。 Conclusion: 该研究不仅推动了基于人物角色的研究，还为创建适应不同语言和文化背景的可扩展方法奠定了基础。 Abstract: Incorporating personas into conversational AI models is crucial for achieving authentic and engaging interactions. However, the cultural diversity and adaptability of existing persona datasets is often overlooked, reducing their efficacy in building culturally aware AI systems. To address this issue, we propose a two-step pipeline for generating culture-specific personas and introduce KoPersona, a dataset comprising 200,000 personas designed to capture Korean cultural values, behaviors, and social nuances. A comprehensive evaluation through various metrics validates the quality of KoPersona and its relevance to Korean culture. This work not only contributes to persona-based research, but also establishes a scalable approach for creating culturally relevant personas adaptable to various languages and cultural contexts.

Vision-Language Embodiment for Monocular Depth Estimation

Jinchang Zhang,Guoyu Lu

Task: 提出一种结合相机物理特性和深度学习的单目深度估计方法。

Motivation: 当前深度估计模型主要依赖图像间关系进行监督训练，忽略了相机本身提供的内在信息。

Details

Method: 将相机模型及其物理特性嵌入深度学习模型，结合RGB图像特征和文本描述，实时计算场景深度。 Result: 实验结果表明，该方法在不同场景下提升了模型性能。 Conclusion: 结合相机特性和语言先验的方法能够有效提升单目深度估计的实时性和准确性。 Abstract: Depth estimation is a core problem in robotic perception and vision tasks, but 3D reconstruction from a single image presents inherent uncertainties. Current depth estimation models primarily rely on inter-image relationships for supervised training, often overlooking the intrinsic information provided by the camera itself. We propose a method that embodies the camera model and its physical characteristics into a deep learning model, computing embodied scene depth through real-time interactions with road environments. The model can calculate embodied scene depth in real-time based on immediate environmental changes using only the intrinsic properties of the camera, without any additional equipment. By combining embodied scene depth with RGB image features, the model gains a comprehensive perspective on both geometric and visual details. Additionally, we incorporate text descriptions containing environmental content and depth information as priors for scene understanding, enriching the model's perception of objects. This integration of image and language - two inherently ambiguous modalities - leverages their complementary strengths for monocular depth estimation. The real-time nature of the embodied language and depth prior model ensures that the model can continuously adjust its perception and behavior in dynamic environments. Experimental results show that the embodied depth estimation method enhances model performance across different scenes.

Mind2: Mind-to-Mind Emotional Support System with Bidirectional Cognitive Discourse Analysis

Shi Yin Hong,Uttamasha Oyshi,Quan Mai,Gibson Nkhata,Susan Gauch

Task: 提出一种名为Mind-to-Mind（Mind2）的情感支持框架，用于生成具有上下文和可解释性的情感支持对话。

Motivation: 现有情感支持系统在生成有效对话时缺乏及时的上下文和可解释性，难以赢得公众信任。

Details

Method: 基于认知模型，从话语分析角度建模可解释的上下文，动态传播窗口适应对话进展，并结合心理理论、生理预期效用和认知理性提取认知知识。 Result: 实验表明，Mind2在仅使用10%训练数据的情况下，性能与现有最优系统相当。 Conclusion: Mind2框架通过增强上下文建模和可解释性，提升了情感支持对话的效果和信任度。 Abstract: Emotional support (ES) systems alleviate users' mental distress by generating strategic supportive dialogues based on diverse user situations. However, ES systems are limited in their ability to generate effective ES dialogues that include timely context and interpretability, hindering them from earning public trust. Driven by cognitive models, we propose Mind-to-Mind (Mind2), an ES framework that approaches interpretable ES context modeling for the ES dialogue generation task from a discourse analysis perspective. Specifically, we perform cognitive discourse analysis on ES dialogues according to our dynamic discourse context propagation window, which accommodates evolving context as the conversation between the ES system and user progresses. To enhance interpretability, Mind2 prioritizes details that reflect each speaker's belief about the other speaker with bidirectionality, integrating Theory-of-Mind, physiological expected utility, and cognitive rationality to extract cognitive knowledge from ES conversations. Experimental results support that Mind2 achieves competitive performance versus state-of-the-art ES systems while trained with only 10\% of the available training data.

Leveraging Vision-Language Models for Open-Vocabulary Instance Segmentation and Tracking

Bastian Pätzold,Jan Nogga,Sven Behnke

Task: 提出一种结合视觉语言模型（VLMs）与开放词汇检测（OVD）、实例分割和跟踪的新方法。

Motivation: 通过整合VLMs的描述能力和OVD的定位能力，结合视频分割模型的像素级理解和速度，实现对动态环境中非标准对象的任务特定属性提取。

Details

Method: 利用VLM生成结构化描述，结合OVD提取边界框，再通过视频分割模型生成精确分割掩码和跟踪能力。 Result: 在多个数据集和机器人平台上验证了方法的广泛适用性，能够实时处理图像流并提取动态环境中非标准对象的属性。 Conclusion: 该方法成功结合了VLMs的描述能力、OVD的定位能力和视频分割的像素级理解，适用于动态环境中的实时处理。 Abstract: This paper introduces a novel approach that leverages the capabilities of vision-language models (VLMs) by integrating them with established approaches for open-vocabulary detection (OVD), instance segmentation, and tracking. We utilize VLM-generated structured descriptions to identify visible object instances, collect application-relevant attributes, and inform an open-vocabulary detector to extract corresponding bounding boxes that are passed to a video segmentation model providing precise segmentation masks and tracking capabilities. Once initialized, this model can then directly extract segmentation masks, allowing processing of image streams in real time with minimal computational overhead. Tracks can be updated online as needed by generating new structured descriptions and corresponding open-vocabulary detections. This combines the descriptive power of VLMs with the grounding capability of OVD and the pixel-level understanding and speed of video segmentation. Our evaluation across datasets and robotics platforms demonstrates the broad applicability of this approach, showcasing its ability to extract task-specific attributes from non-standard objects in dynamic environments.

Huan Yang,Renji Zhang,Deyu Zhang

Task: 提出一种基于语义相似性的多用户键值缓存共享技术KVShare，以提升大型语言模型（LLMs）和多模态大型语言模型（MLLMs）的推理效率。

Motivation: 解决现有前缀缓存（严格文本前缀匹配）和语义缓存（响应多样性损失）的局限性。

Details

Method: 通过语义对齐算法和差异编辑操作实现细粒度的键值缓存重用。 Result: 在真实用户对话数据集上的实验表明，KVShare将键值缓存命中率提高了60%以上，同时保持输出质量（BLEU和Rouge-L指标无明显下降）。 Conclusion: KVShare有效减少GPU资源消耗，适用于重复查询场景（如医疗和教育）。 Abstract: This paper presents KVShare, a multi-user Key-Value (KV) Cache sharing technology based on semantic similarity, designed to enhance the inference efficiency of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs). Addressing the limitations of existing prefix caching (strict text prefix matching) and semantic caching (loss of response diversity), KVShare achieves fine-grained KV cache reuse through semantic alignment algorithms and differential editing operations. Experiments on real-world user conversation datasets demonstrate that KVShare improves KV cache hit rates by over 60%, while maintaining output quality comparable to full computation (no significant degradation in BLEU and Rouge-L metrics). This approach effectively reduces GPU resource consumption and is applicable to scenarios with repetitive queries, such as healthcare and education.

Defending Against Gradient Inversion Attacks for Biomedical Images via Learnable Data Perturbation

Shiyi Jiang,Farshad Firouzi,Krishnendu Chakrabarty

Task: 提出一种针对联邦学习中梯度反转攻击的防御方法。

Motivation: 医疗数据共享和临床研究合作的需求增加引发了隐私问题，现有防御方法缺乏普适性且未在真实医疗数据上充分测试。

Details

Method: 采用潜在数据扰动和极小极大优化，结合通用和医学图像数据集。 Result: 防御方法在攻击者分类重构图像的准确率上降低了12.5%，原始与重构图像的均方误差增加了12.4%，同时保持约90%的客户端分类准确率。 Conclusion: 该方法展示了在医疗数据上实现普适性防御的潜力。 Abstract: The increasing need for sharing healthcare data and collaborating on clinical research has raised privacy concerns. Health information leakage due to malicious attacks can lead to serious problems such as misdiagnoses and patient identification issues. Privacy-preserving machine learning (PPML) and privacy-enhancing technologies, particularly federated learning (FL), have emerged in recent years as innovative solutions to balance privacy protection with data utility; however, they also suffer from inherent privacy vulnerabilities. Gradient inversion attacks constitute major threats to data sharing in federated learning. Researchers have proposed many defenses against gradient inversion attacks. However, current defense methods for healthcare data lack generalizability, i.e., existing solutions may not be applicable to data from a broader range of populations. In addition, most existing defense methods are tested using non-healthcare data, which raises concerns about their applicability to real-world healthcare systems. In this study, we present a defense against gradient inversion attacks in federated learning. We achieve this using latent data perturbation and minimax optimization, utilizing both general and medical image datasets. Our method is compared to two baselines, and the results show that our approach can outperform the baselines with a reduction of 12.5% in the attacker's accuracy in classifying reconstructed images. The proposed method also yields an increase of over 12.4% in Mean Squared Error (MSE) between the original and reconstructed images at the same level of model utility of around 90% client classification accuracy. The results suggest the potential of a generalizable defense for healthcare data.

LLM Generated Persona is a Promise with a Catch

Ang Li,Haozhe Chen,Hongseok Namkoong,Tianyi Peng

Task: 探索使用大型语言模型（LLMs）生成合成人物角色（personas）以模拟人类行为的方法及其潜在偏差。

Motivation: 传统人物角色数据收集方法成本高、隐私受限且难以捕捉多维属性，而现有LLM生成方法缺乏方法论严谨性，导致系统性偏差。

Details

Method: 通过大规模实验（如总统选举预测和美国人口一般意见调查）评估LLM生成人物角色的偏差。 Result: 实验显示LLM生成的人物角色存在显著偏差，可能导致与现实结果的显著偏离。 Conclusion: 需要建立严谨的人物角色生成科学，并提出方法创新、组织支持和实证基础以提升LLM驱动模拟的可靠性和可扩展性。 Abstract: The use of large language models (LLMs) to simulate human behavior has gained significant attention, particularly through personas that approximate individual characteristics. Persona-based simulations hold promise for transforming disciplines that rely on population-level feedback, including social science, economic analysis, marketing research, and business operations. Traditional methods to collect realistic persona data face significant challenges. They are prohibitively expensive and logistically challenging due to privacy constraints, and often fail to capture multi-dimensional attributes, particularly subjective qualities. Consequently, synthetic persona generation with LLMs offers a scalable, cost-effective alternative. However, current approaches rely on ad hoc and heuristic generation techniques that do not guarantee methodological rigor or simulation precision, resulting in systematic biases in downstream tasks. Through extensive large-scale experiments including presidential election forecasts and general opinion surveys of the U.S. population, we reveal that these biases can lead to significant deviations from real-world outcomes. Our findings underscore the need to develop a rigorous science of persona generation and outline the methodological innovations, organizational and institutional support, and empirical foundations required to enhance the reliability and scalability of LLM-driven persona simulations. To support further research and development in this area, we have open-sourced approximately one million generated personas, available for public access and analysis at https://huggingface.co/datasets/Tianyi-Lab/Personas.

A Comprehensive Survey on Architectural Advances in Deep CNNs: Challenges, Applications, and Emerging Research Directions

Saddam Hussain Khan,Rashid Iqbal

Task: 对深度卷积神经网络（CNN）从2015年至2025年的演进进行综述，分类架构并总结应用领域。

Motivation: CNN在多个领域取得突破性进展，但缺乏统一的分类和系统综述，以指导未来研究方向。

Details

Method: 提出基于空间利用、多路径结构、深度、宽度等维度的统一分类法，并系统回顾CNN在各领域的应用。 Result: 总结了CNN的关键创新、挑战和机遇，并展望了未来研究方向，如混合CNN-Transformer模型。 Conclusion: 该综述为CNN的演进提供了全面视角，为未来研究提供了方向。 Abstract: Deep Convolutional Neural Networks (CNNs) have significantly advanced deep learning, driving breakthroughs in computer vision, natural language processing, medical diagnosis, object detection, and speech recognition. Architectural innovations including 1D, 2D, and 3D convolutional models, dilated and grouped convolutions, depthwise separable convolutions, and attention mechanisms address domain-specific challenges and enhance feature representation and computational efficiency. Structural refinements such as spatial-channel exploitation, multi-path design, and feature-map enhancement contribute to robust hierarchical feature extraction and improved generalization, particularly through transfer learning. Efficient preprocessing strategies, including Fourier transforms, structured transforms, low-precision computation, and weight compression, optimize inference speed and facilitate deployment in resource-constrained environments. This survey presents a unified taxonomy that classifies CNN architectures based on spatial exploitation, multi-path structures, depth, width, dimensionality expansion, channel boosting, and attention mechanisms. It systematically reviews CNN applications in face recognition, pose estimation, action recognition, text classification, statistical language modeling, disease diagnosis, radiological analysis, cryptocurrency sentiment prediction, 1D data processing, video analysis, and speech recognition. In addition to consolidating architectural advancements, the review highlights emerging learning paradigms such as few-shot, zero-shot, weakly supervised, federated learning frameworks and future research directions include hybrid CNN-transformer models, vision-language integration, generative learning, etc. This review provides a comprehensive perspective on CNN's evolution from 2015 to 2025, outlining key innovations, challenges, and opportunities.

HDLCoRe: A Training-Free Framework for Mitigating Hallucinations in LLM-Generated HDL

Heng Ping,Shixuan Li,Peiyu Zhang,Anzhe Cheng,Shukai Duan,Nikos Kanakaris,Xiongye Xiao,Wei Yang,Shahin Nazarian,Andrei Irimia,Paul Bogdan

Task: 提出HDLCoRe框架，通过提示工程和检索增强生成技术提升大语言模型在硬件描述语言（HDL）生成任务中的能力。

Motivation: 现有大语言模型在HDL生成任务中因数据稀缺导致幻觉和错误代码生成，表现受限。

Details

Method: 采用HDL感知的Chain-of-Thought提示技术和两阶段异构检索增强生成系统，无需微调模型。 Result: 在RTLLM2.0基准测试中表现优异，显著减少幻觉并提升语法和功能正确性。 Conclusion: HDLCoRe框架有效提升了LLMs在HDL生成任务中的能力，无需额外训练。 Abstract: Recent advances in large language models (LLMs) have demonstrated remarkable capabilities in code generation tasks. However, when applied to hardware description languages (HDL), these models exhibit significant limitations due to data scarcity, resulting in hallucinations and incorrect code generation. To address these challenges, we propose HDLCoRe, a training-free framework that enhances LLMs' HDL generation capabilities through prompt engineering techniques and retrieval-augmented generation (RAG). Our approach consists of two main components: (1) an HDL-aware Chain-of-Thought (CoT) prompting technique with self-verification that classifies tasks by complexity and type, incorporates domain-specific knowledge, and guides LLMs through step-by-step self-simulation for error correction; and (2) a two-stage heterogeneous RAG system that addresses formatting inconsistencies through key component extraction and efficiently retrieves relevant HDL examples through sequential filtering and re-ranking. HDLCoRe eliminates the need for model fine-tuning while substantially improving LLMs' HDL generation capabilities. Experimental results demonstrate that our framework achieves superior performance on the RTLLM2.0 benchmark, significantly reducing hallucinations and improving both syntactic and functional correctness.

MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical Problems

Felix Chen,Hangjie Yuan,Yunqiu Xu,Tao Feng,Jun Cen,Pengwei Liu,Zeying Huang,Yi Yang

Task: 评估多模态大语言模型（MLLMs）在视觉数学问题解决中的表现，并提出改进方法。

Motivation: 现有MLLMs在视觉数学问题解决中表现不足，尤其是在图表信息的准确感知和解释方面。

Details

Method: 开发FlowVerse基准测试，将问题解决信息分为四类，并设计MathFlow模块化管道，将感知和推理解耦。 Result: 现有MLLMs在图表信息提取和复杂推理方面存在显著局限，MathFlow-P-7B作为专用感知模型显著提升了性能。 Conclusion: MathFlow管道有效优化了视觉数学问题解决，且兼容多种推理框架。 Abstract: Despite impressive performance across diverse tasks, Multimodal Large Language Models (MLLMs) have yet to fully demonstrate their potential in visual mathematical problem-solving, particularly in accurately perceiving and interpreting diagrams. Inspired by typical processes of humans, we hypothesize that the perception capabilities to extract meaningful information from diagrams is crucial, as it directly impacts subsequent inference processes. To validate this hypothesis, we developed FlowVerse, a comprehensive benchmark that categorizes all information used during problem-solving into four components, which are then combined into six problem versions for evaluation. Our preliminary results on FlowVerse reveal that existing MLLMs exhibit substantial limitations when extracting essential information and reasoned property from diagrams and performing complex reasoning based on these visual inputs. In response, we introduce MathFlow, a modular problem-solving pipeline that decouples perception and inference into distinct stages, thereby optimizing each independently. Given the perceptual limitations observed in current MLLMs, we trained MathFlow-P-7B as a dedicated perception model. Experimental results indicate that MathFlow-P-7B yields substantial performance gains when integrated with various closed-source and open-source inference models. This demonstrates the effectiveness of the MathFlow pipeline and its compatibility to diverse inference frameworks. The FlowVerse benchmark and code are available at https://github.com/MathFlow-zju/MathFlow.

Safety Evaluation and Enhancement of DeepSeek Models in Chinese Contexts

Wenjing Zhang,Xuejiao Lei,Zhaoxiang Liu,Limin Han,Jiaojiao Zhao,Beibei Huang,Zhenhong Long,Junting Guo,Meijuan An,Rongjia Du,Ning Wang,Kai Wang,Shiguo Lian

Task: 评估DeepSeek-R1系列蒸馏模型在中文语境下的安全能力，并实施针对性安全增强。

Motivation: DeepSeek-R1存在显著的安全缺陷，且其蒸馏模型的安全能力尚未全面评估。

Details

Method: 使用中文安全基准CHiSafetyBench对DeepSeek-R1系列蒸馏模型进行安全评估，并实施安全增强。 Result: 增强后的模型在安全能力上显著提升，同时推理能力未明显下降。 Conclusion: 研究为DeepSeek模型的安全优化提供了有价值的资源，并开源了增强后的模型。 Abstract: DeepSeek-R1, renowned for its exceptional reasoning capabilities and open-source strategy, is significantly influencing the global artificial intelligence landscape. However, it exhibits notable safety shortcomings. Recent research conducted by Robust Intelligence, a subsidiary of Cisco, in collaboration with the University of Pennsylvania, revealed that DeepSeek-R1 achieves a 100\% attack success rate when processing harmful prompts. Furthermore, multiple security firms and research institutions have identified critical security vulnerabilities within the model. Although China Unicom has uncovered safety vulnerabilities of R1 in Chinese contexts, the safety capabilities of the remaining distilled models in the R1 series have not yet been comprehensively evaluated. To address this gap, this study utilizes the comprehensive Chinese safety benchmark CHiSafetyBench to conduct an in-depth safety evaluation of the DeepSeek-R1 series distilled models. The objective is to assess the safety capabilities of these models in Chinese contexts both before and after distillation, and to further elucidate the adverse effects of distillation on model safety. Building on these findings, we implement targeted safety enhancements for six distilled models. Evaluation results indicate that the enhanced models achieve significant improvements in safety while maintaining reasoning capabilities without notable degradation. We open-source the safety-enhanced models at https://github.com/UnicomAI/DeepSeek-R1-Distill-Safe/tree/main to serve as a valuable resource for future research and optimization of DeepSeek models.

REVAL: A Comprehension Evaluation on Reliability and Values of Large Vision-Language Models

Jie Zhang,Zheng Yuan,Zhongqi Wang,Bei Yan,Sibo Wang,Xiangkui Cao,Zonghui Guo,Shiguang Shan,Xilin Chen

Task: Introduce REVAL, a comprehensive benchmark for evaluating the reliability and value of Large Vision-Language Models (LVLMs).

Motivation: Existing benchmarks lack breadth and depth to holistically assess LVLMs' strengths and limitations.

Details

Method: REVAL includes over 144K image-text VQA samples, structured into Reliability (truthfulness, robustness) and Values (ethics, safety, privacy) sections. Result: Current LVLMs excel in perceptual tasks and toxicity avoidance but show vulnerabilities in adversarial scenarios, privacy, and ethical reasoning. Conclusion: REVAL offers a robust framework for systematic assessment, guiding future improvements in LVLM development. Abstract: The rapid evolution of Large Vision-Language Models (LVLMs) has highlighted the necessity for comprehensive evaluation frameworks that assess these models across diverse dimensions. While existing benchmarks focus on specific aspects such as perceptual abilities, cognitive capabilities, and safety against adversarial attacks, they often lack the breadth and depth required to provide a holistic understanding of LVLMs' strengths and limitations. To address this gap, we introduce REVAL, a comprehensive benchmark designed to evaluate the \textbf{RE}liability and \textbf{VAL}ue of LVLMs. REVAL encompasses over 144K image-text Visual Question Answering (VQA) samples, structured into two primary sections: Reliability, which assesses truthfulness (\eg, perceptual accuracy and hallucination tendencies) and robustness (\eg, resilience to adversarial attacks, typographic attacks, and image corruption), and Values, which evaluates ethical concerns (\eg, bias and moral understanding), safety issues (\eg, toxicity and jailbreak vulnerabilities), and privacy problems (\eg, privacy awareness and privacy leakage). We evaluate 26 models, including mainstream open-source LVLMs and prominent closed-source models like GPT-4o and Gemini-1.5-Pro. Our findings reveal that while current LVLMs excel in perceptual tasks and toxicity avoidance, they exhibit significant vulnerabilities in adversarial scenarios, privacy preservation, and ethical reasoning. These insights underscore critical areas for future improvements, guiding the development of more secure, reliable, and ethically aligned LVLMs. REVAL provides a robust framework for researchers to systematically assess and compare LVLMs, fostering advancements in the field.

Enhancing LLM Generation with Knowledge Hypergraph for Evidence-Based Medicine

Chengfeng Dou,Ying Zhang,Zhi Jin,Wenpin Jiao,Haiyan Zhao,Yongqiang Zhao,Zhengwei Tao

Task: 利用大语言模型（LLMs）和知识超图技术解决循证医学（EBM）中证据分散和组织效率低的问题。

Motivation: 当前检索增强生成（RAG）技术在循证医学应用中面临证据收集分散和组织效率低的挑战，需要更高效的方法支持复杂查询。

Details

Method: 提出基于知识超图的证据管理模型和重要性驱动的证据优先级（IDEP）算法，利用LLMs生成多特征证据并排序。 Result: 在六个数据集上的实验表明，该方法在医学问答、幻觉检测和决策支持等EBM应用领域优于现有RAG技术。 Conclusion: 该方法通过整合分散证据和优化查询支持，显著提升了循证医学应用的效率和准确性。 Abstract: Evidence-based medicine (EBM) plays a crucial role in the application of large language models (LLMs) in healthcare, as it provides reliable support for medical decision-making processes. Although it benefits from current retrieval-augmented generation~(RAG) technologies, it still faces two significant challenges: the collection of dispersed evidence and the efficient organization of this evidence to support the complex queries necessary for EBM. To tackle these issues, we propose using LLMs to gather scattered evidence from multiple sources and present a knowledge hypergraph-based evidence management model to integrate these evidence while capturing intricate relationships. Furthermore, to better support complex queries, we have developed an Importance-Driven Evidence Prioritization (IDEP) algorithm that utilizes the LLM to generate multiple evidence features, each with an associated importance score, which are then used to rank the evidence and produce the final retrieval results. Experimental results from six datasets demonstrate that our approach outperforms existing RAG techniques in application domains of interest to EBM, such as medical quizzing, hallucination detection, and decision support. Testsets and the constructed knowledge graph can be accessed at \href{https://drive.google.com/file/d/1WJ9QTokK3MdkjEmwuFQxwH96j_Byawj_/view?usp=drive_link}{https://drive.google.com/rag4ebm}.

World Knowledge from AI Image Generation for Robot Control

Jonas Krumme,Christoph Zetzsche

Task: 利用生成式AI系统生成的图像解决机器人面对的不明确任务。

Motivation: 机器人在处理不明确任务时缺乏明确的决策依据，而人类可以依靠经验和知识填补空白。生成式AI系统生成的图像隐含了现实世界的知识，可能为机器人提供解决方案。

Details

Method: 通过生成式AI系统生成与现实环境相关的图像，利用这些图像中的隐含知识指导机器人完成任务。 Result: 研究表明生成式AI生成的图像可以为机器人提供有用的隐含知识，帮助其解决不明确任务。 Conclusion: 生成式AI系统生成的图像可以作为机器人解决不明确任务的有效工具，填补其知识空白。 Abstract: When interacting with the world robots face a number of difficult questions, having to make decisions when given under-specified tasks where they need to make choices, often without clearly defined right and wrong answers. Humans, on the other hand, can often rely on their knowledge and experience to fill in the gaps. For example, the simple task of organizing newly bought produce into the fridge involves deciding where to put each thing individually, how to arrange them together meaningfully, e.g. putting related things together, all while there is no clear right and wrong way to accomplish this task. We could encode all this information on how to do such things explicitly into the robots' knowledge base, but this can quickly become overwhelming, considering the number of potential tasks and circumstances the robot could encounter. However, images of the real world often implicitly encode answers to such questions and can show which configurations of objects are meaningful or are usually used by humans. An image of a full fridge can give a lot of information about how things are usually arranged in relation to each other and the full fridge at large. Modern generative systems are capable of generating plausible images of the real world and can be conditioned on the environment in which the robot operates. Here we investigate the idea of using the implicit knowledge about the world of modern generative AI systems given by their ability to generate convincing images of the real world to solve under-specified tasks.

EEG-CLIP : Learning EEG representations from natural language descriptions

Tidiane Camaret N'dir,Robin Tibor Schirrmeister

Task: 开发一个对比学习框架EEG-CLIP，将EEG时间序列与临床文本描述对齐在共享嵌入空间中。

Motivation: 利用临床EEG记录和医学报告之间的映射关系，实现更通用的EEG解码方法。

Details

Method: 采用对比学习框架EEG-CLIP，对齐EEG时间序列和临床文本描述。 Result: EEG-CLIP成功实现了文本与EEG表示的非平凡对齐，支持少样本和零样本解码。 Conclusion: EEG-CLIP为学习通用EEG表示提供了有前景的方法，支持零样本解码和少样本训练。 Abstract: Deep networks for electroencephalogram (EEG) decoding are currently often trained to only solve a specific task like pathology or gender decoding. A more general approach leveraging the medical reports of clinical EEG recordings is to learn mappings between medical reports and EEG recordings. This approach was pioneered in the computer vision domain matching images and their text captions and subsequently allowed to do successful zero-shot decoding using textual class prompts. In this work, we follow this approach and develop a contrastive learning framework EEG-CLIP that aligns EEG time series and their corresponding clinical text descriptions in a shared embedding space. We investigate its potential for versatile EEG decoding, assessing performance on a range of few-shot and zero-shot settings. Overall, results show that EEG-CLIP manages to nontrivially align text and EEG representations. Our work presents a promising approach to learn general EEG representations, which could enable easier analyses of diverse decoding questions through zero shot decoding or training task-specific models from fewer training examples. The code for reproducing our results is available at https://github.com/tidiane-camaret/EEGClip.

UniK3D: Universal Camera Monocular 3D Estimation

Luigi Piccinelli,Christos Sakaridis,Mattia Segu,Yung-Hsu Yang,Siyuan Li,Wim Abbeloos,Luc Van Gool

Task: 提出一种通用的单目3D估计方法UniK3D，适用于任意相机模型。

Motivation: 现有方法依赖简化假设（如针孔相机模型或校正图像），导致在鱼眼或全景图像等真实场景中性能不佳。

Details

Method: 引入球形3D表示和模型无关的光束表示，通过球谐函数学习实现，并提出角度损失防止3D输出收缩。 Result: 在13个数据集上的零样本评估中，UniK3D在3D、深度和相机指标上达到最先进性能，尤其在广视角和全景场景中表现突出。 Conclusion: UniK3D是一种通用且高效的单目3D估计方法，适用于多种相机模型和场景。 Abstract: Monocular 3D estimation is crucial for visual perception. However, current methods fall short by relying on oversimplified assumptions, such as pinhole camera models or rectified images. These limitations severely restrict their general applicability, causing poor performance in real-world scenarios with fisheye or panoramic images and resulting in substantial context loss. To address this, we present UniK3D, the first generalizable method for monocular 3D estimation able to model any camera. Our method introduces a spherical 3D representation which allows for better disentanglement of camera and scene geometry and enables accurate metric 3D reconstruction for unconstrained camera models. Our camera component features a novel, model-independent representation of the pencil of rays, achieved through a learned superposition of spherical harmonics. We also introduce an angular loss, which, together with the camera module design, prevents the contraction of the 3D outputs for wide-view cameras. A comprehensive zero-shot evaluation on 13 diverse datasets demonstrates the state-of-the-art performance of UniK3D across 3D, depth, and camera metrics, with substantial gains in challenging large-field-of-view and panoramic settings, while maintaining top accuracy in conventional pinhole small-field-of-view domains. Code and models are available at github.com/lpiccinelli-eth/unik3d .

From Patient Consultations to Graphs: Leveraging LLMs for Patient Journey Knowledge Graph Construction

Hassan S. Al Khatib,Sudip Mittal,Shahram Rahimi,Nina Marhamati,Sean Bozorgzad

Task: 提出一种利用大型语言模型（LLMs）构建患者旅程知识图谱（PJKGs）的方法，以整合碎片化的医疗数据。

Motivation: 现有医疗数据系统分散且缺乏对患者旅程的整体描述，导致协调护理和个性化干预的困难。

Details

Method: 使用四种大型语言模型（Claude 3.5、Mistral、Llama 3.1和Chatgpt4o）处理和结构化临床文档及非结构化医患对话，构建PJKGs。 Result: 所有模型在结构合规性上表现完美，但在医学实体处理和计算效率上存在差异。 Conclusion: PJKGs有助于推动以患者为中心的医疗，支持更好的护理协调和结果预测，同时指出了未来研究方向。 Abstract: The transition towards patient-centric healthcare necessitates a comprehensive understanding of patient journeys, which encompass all healthcare experiences and interactions across the care spectrum. Existing healthcare data systems are often fragmented and lack a holistic representation of patient trajectories, creating challenges for coordinated care and personalized interventions. Patient Journey Knowledge Graphs (PJKGs) represent a novel approach to addressing the challenge of fragmented healthcare data by integrating diverse patient information into a unified, structured representation. This paper presents a methodology for constructing PJKGs using Large Language Models (LLMs) to process and structure both formal clinical documentation and unstructured patient-provider conversations. These graphs encapsulate temporal and causal relationships among clinical encounters, diagnoses, treatments, and outcomes, enabling advanced temporal reasoning and personalized care insights. The research evaluates four different LLMs, such as Claude 3.5, Mistral, Llama 3.1, and Chatgpt4o, in their ability to generate accurate and computationally efficient knowledge graphs. Results demonstrate that while all models achieved perfect structural compliance, they exhibited variations in medical entity processing and computational efficiency. The paper concludes by identifying key challenges and future research directions. This work contributes to advancing patient-centric healthcare through the development of comprehensive, actionable knowledge graphs that support improved care coordination and outcome prediction.

A Recipe for Generating 3D Worlds From a Single Image

Katja Schwarz,Denys Rozumnyi,Samuel Rota Bulò,Lorenzo Porzi,Peter Kontschieder

Task: 从单张图像生成沉浸式3D世界。

Motivation: 利用现有生成模型，通过最小化训练需求，实现高质量的3D环境生成。

Details

Method: 结合预训练的扩散模型生成全景图，再通过深度估计器将其提升为3D，并利用点云渲染填充未观察区域。 Result: 在合成和真实图像上测试，生成高质量的3D环境，适合VR显示，并在多项图像质量指标上优于现有方法。 Conclusion: 通过从开始就明确建模3D结构，该方法在图像质量和效率上均表现出色。 Abstract: We introduce a recipe for generating immersive 3D worlds from a single image by framing the task as an in-context learning problem for 2D inpainting models. This approach requires minimal training and uses existing generative models. Our process involves two steps: generating coherent panoramas using a pre-trained diffusion model and lifting these into 3D with a metric depth estimator. We then fill unobserved regions by conditioning the inpainting model on rendered point clouds, requiring minimal fine-tuning. Tested on both synthetic and real images, our method produces high-quality 3D environments suitable for VR display. By explicitly modeling the 3D structure of the generated environment from the start, our approach consistently outperforms state-of-the-art, video synthesis-based methods along multiple quantitative image quality metrics. Project Page: https://katjaschwarz.github.io/worlds/

Gender and content bias in Large Language Models: a case study on Google Gemini 2.0 Flash Experimental

Roberto Balestri

Task: 评估Gemini 2.0 Flash Experimental在内容审核和性别差异方面的偏见。

Motivation: 比较Gemini 2.0与ChatGPT-4o在伦理审核实践中的差异，探讨其改进与潜在问题。

Details

Method: 通过比较Gemini 2.0与ChatGPT-4o在性别和暴力内容审核上的表现。 Result: Gemini 2.0减少了性别偏见，但对暴力内容更宽容，可能助长暴力正常化。 Conclusion: AI系统需进一步优化，以平衡减少偏见与避免有害内容传播。 Abstract: This study evaluates the biases in Gemini 2.0 Flash Experimental, a state-of-the-art large language model (LLM) developed by Google, focusing on content moderation and gender disparities. By comparing its performance to ChatGPT-4o, examined in a previous work of the author, the analysis highlights some differences in ethical moderation practices. Gemini 2.0 demonstrates reduced gender bias, notably with female-specific prompts achieving a substantial rise in acceptance rates compared to results obtained by ChatGPT-4o. It adopts a more permissive stance toward sexual content and maintains relatively high acceptance rates for violent prompts, including gender-specific cases. Despite these changes, whether they constitute an improvement is debatable. While gender bias has been reduced, this reduction comes at the cost of permitting more violent content toward both males and females, potentially normalizing violence rather than mitigating harm. Male-specific prompts still generally receive higher acceptance rates than female-specific ones. These findings underscore the complexities of aligning AI systems with ethical standards, highlighting progress in reducing certain biases while raising concerns about the broader implications of the model's permissiveness. Ongoing refinements are essential to achieve moderation practices that ensure transparency, fairness, and inclusivity without amplifying harmful content.

Progressive Test Time Energy Adaptation for Medical Image Segmentation

Xiaoran Zhang,Byung-Woo Hong,Hyoungseob Park,Daniel H. Pak,Anne-Marie Rickmann,Lawrence H. Staib,James S. Duncan,Alex Wong

Task: 提出一种模型无关的、渐进式测试时能量适应方法，用于医学图像分割。

Motivation: 解决医学数据集中由于成像协议和患者差异导致的分布偏移问题，同时避免传统域适应方法在临床环境中不实用的多轮数据遍历需求。

Details

Method: 利用在源数据上训练的形状能量模型，为分割图分配能量分数，通过最小化测试时的能量分数来调整模型以适应目标分布。 Result: 在八个公开的MRI和X射线数据集上验证了方法的有效性和适应性，定量和定性均优于基线方法。 Conclusion: 提出的方法能够有效适应目标分布，提升医学图像分割的准确性。 Abstract: We propose a model-agnostic, progressive test-time energy adaptation approach for medical image segmentation. Maintaining model performance across diverse medical datasets is challenging, as distribution shifts arise from inconsistent imaging protocols and patient variations. Unlike domain adaptation methods that require multiple passes through target data - impractical in clinical settings - our approach adapts pretrained models progressively as they process test data. Our method leverages a shape energy model trained on source data, which assigns an energy score at the patch level to segmentation maps: low energy represents in-distribution (accurate) shapes, while high energy signals out-of-distribution (erroneous) predictions. By minimizing this energy score at test time, we refine the segmentation model to align with the target distribution. To validate the effectiveness and adaptability, we evaluated our framework on eight public MRI (bSSFP, T1- and T2-weighted) and X-ray datasets spanning cardiac, spinal cord, and lung segmentation. We consistently outperform baselines both quantitatively and qualitatively.

Word2Minecraft: Generating 3D Game Levels through Large Language Models

Shuo Huang,Muhammad Umair Nasir,Steven James,Julian Togelius

Task: 利用大型语言模型将结构化故事生成可玩的Minecraft游戏关卡。

Motivation: 探索叙事元素（如主角目标、反派挑战和环境设置）如何转化为具有空间和游戏性约束的游戏关卡，推动故事生成与游戏设计的结合。

Details

Method: 提出一个灵活框架，支持故事复杂度的定制化，并采用缩放算法保持空间一致性。 Result: GPT-4-Turbo在故事连贯性和目标乐趣方面优于GPT-4o-Mini，而后者在美学吸引力上表现更好；系统能生成高地图乐趣的关卡。 Conclusion: Word2Minecraft在故事生成与游戏设计的交叉领域迈出了有前景的一步，并开源了代码。 Abstract: We present Word2Minecraft, a system that leverages large language models to generate playable game levels in Minecraft based on structured stories. The system transforms narrative elements-such as protagonist goals, antagonist challenges, and environmental settings-into game levels with both spatial and gameplay constraints. We introduce a flexible framework that allows for the customization of story complexity, enabling dynamic level generation. The system employs a scaling algorithm to maintain spatial consistency while adapting key game elements. We evaluate Word2Minecraft using both metric-based and human-based methods. Our results show that GPT-4-Turbo outperforms GPT-4o-Mini in most areas, including story coherence and objective enjoyment, while the latter excels in aesthetic appeal. We also demonstrate the system' s ability to generate levels with high map enjoyment, offering a promising step forward in the intersection of story generation and game design. We open-source the code at https://github.com/JMZ-kk/Word2Minecraft/tree/word2mc_v0

MobilePlantViT: A Mobile-friendly Hybrid ViT for Generalized Plant Disease Image Classification

Moshiur Rahman Tonmoy,Md. Mithun Hossain,Nilanjan Dey,M. F. Mridha

Task: 提出一种轻量级混合Vision Transformer架构（MobilePlantViT），用于植物病害分类。

Motivation: 植物病害威胁全球粮食安全，现有深度学习模型在移动和边缘设备上部署困难，需要轻量级且高效的解决方案。

Details

Method: 设计了一种混合Vision Transformer架构，优化资源效率并保持高性能。 Result: 在多个数据集上测试准确率80%至99%，参数仅0.69百万，优于MobileViTv1和MobileViTv2。 Conclusion: MobilePlantViT在资源受限设备上具有实际应用潜力，支持智能农业系统。 Abstract: Plant diseases significantly threaten global food security by reducing crop yields and undermining agricultural sustainability. AI-driven automated classification has emerged as a promising solution, with deep learning models demonstrating impressive performance in plant disease identification. However, deploying these models on mobile and edge devices remains challenging due to high computational demands and resource constraints, highlighting the need for lightweight, accurate solutions for accessible smart agriculture systems. To address this, we propose MobilePlantViT, a novel hybrid Vision Transformer (ViT) architecture designed for generalized plant disease classification, which optimizes resource efficiency while maintaining high performance. Extensive experiments across diverse plant disease datasets of varying scales show our model's effectiveness and strong generalizability, achieving test accuracies ranging from 80% to over 99%. Notably, with only 0.69 million parameters, our architecture outperforms the smallest versions of MobileViTv1 and MobileViTv2, despite their higher parameter counts. These results underscore the potential of our approach for real-world, AI-powered automated plant disease classification in sustainable and resource-efficient smart agriculture systems. All codes will be available in the GitHub repository: https://github.com/moshiurtonmoy/MobilePlantViT

Do Multimodal Large Language Models Understand Welding?

Grigorii Khvatskii,Yong Suk Lee,Corey Angst,Maria Gibbs,Robert Landers,Nitesh V. Chawla

Task: 评估多模态大语言模型（MLLMs）在焊接领域的性能，并测试新提出的提示策略WeldPrompt的效果。

Motivation: 探索MLLMs在高风险技术领域（如焊接）的应用潜力，并改进其可靠性和推理能力。

Details

Method: 使用真实世界和在线焊接图像数据集，评估两种先进MLLMs的性能，并引入WeldPrompt策略（结合链式思维生成和上下文学习）。 Result: 模型在在线图像上表现更好，但在真实世界图像中表现尚可；WeldPrompt在某些场景下提高了召回率，但表现不一致。 Conclusion: MLLMs在技术领域具有潜力，但需进一步优化微调、领域数据和提示策略以提高可靠性；研究为工业应用中的多模态学习开辟了新方向。 Abstract: This paper examines the performance of Multimodal LLMs (MLLMs) in skilled production work, with a focus on welding. Using a novel data set of real-world and online weld images, annotated by a domain expert, we evaluate the performance of two state-of-the-art MLLMs in assessing weld acceptability across three contexts: RV \& Marine, Aeronautical, and Farming. While both models perform better on online images, likely due to prior exposure or memorization, they also perform relatively well on unseen, real-world weld images. Additionally, we introduce WeldPrompt, a prompting strategy that combines Chain-of-Thought generation with in-context learning to mitigate hallucinations and improve reasoning. WeldPrompt improves model recall in certain contexts but exhibits inconsistent performance across others. These results underscore the limitations and potentials of MLLMs in high-stakes technical domains and highlight the importance of fine-tuning, domain-specific data, and more sophisticated prompting strategies to improve model reliability. The study opens avenues for further research into multimodal learning in industry applications.

iFlame: Interleaving Full and Linear Attention for Efficient Mesh Generation

Hanxiao Wang,Biao Zhang,Weize Quan,Dong-Ming Yan,Peter Wonka

Task: 提出iFlame，一种基于Transformer的新型网络架构，用于网格生成。

Motivation: 基于注意力的模型在网格生成中表现出色，但其二次计算复杂性限制了可扩展性，尤其是对高分辨率3D数据；而线性注意力机制计算成本低，但难以捕捉长程依赖关系。

Details

Method: 提出一种交错自回归网格生成框架，结合线性注意力的效率和全注意力机制的表达能力，并集成到沙漏架构中以提升效率。 Result: 在ShapeNet和Objaverse上验证了框架的高效性，生成高质量3D网格，训练时间短且性能接近纯注意力模型。 Conclusion: 提出的交错框架有效平衡计算效率和生成性能，是网格生成的实用解决方案。 Abstract: This paper propose iFlame, a novel transformer-based network architecture for mesh generation. While attention-based models have demonstrated remarkable performance in mesh generation, their quadratic computational complexity limits scalability, particularly for high-resolution 3D data. Conversely, linear attention mechanisms offer lower computational costs but often struggle to capture long-range dependencies, resulting in suboptimal outcomes. To address this trade-off, we propose an interleaving autoregressive mesh generation framework that combines the efficiency of linear attention with the expressive power of full attention mechanisms. To further enhance efficiency and leverage the inherent structure of mesh representations, we integrate this interleaving approach into an hourglass architecture, which significantly boosts efficiency. Our approach reduces training time while achieving performance comparable to pure attention-based models. To improve inference efficiency, we implemented a caching algorithm that almost doubles the speed and reduces the KV cache size by seven-eighths compared to the original Transformer. We evaluate our framework on ShapeNet and Objaverse, demonstrating its ability to generate high-quality 3D meshes efficiently. Our results indicate that the proposed interleaving framework effectively balances computational efficiency and generative performance, making it a practical solution for mesh generation. The training takes only 2 days with 4 GPUs on 39k data with a maximum of 4k faces on Objaverse.

Poly-FEVER: A Multilingual Fact Verification Benchmark for Hallucination Detection in Large Language Models

Hanzhi Zhang,Sumera Anjum,Heng Fan,Weijian Zheng,Yan Huang,Yunhe Feng

Task: 提出Poly-FEVER，一个多语言事实验证基准，用于评估大型语言模型（LLMs）中的幻觉检测。

Motivation: 现有幻觉检测基准主要集中在英语和少数广泛使用的语言，缺乏评估模型在多样化语言环境中表现不一致的能力。

Details

Method: 构建Poly-FEVER数据集，包含77,973个标注的事实声明，涵盖11种语言，数据来源于FEVER、Climate-FEVER和SciFact。 Result: 分析揭示了主题分布和网络资源可用性对幻觉频率的影响，并发现了影响模型准确性的语言特定偏见。 Conclusion: Poly-FEVER为多语言事实验证提供了基准，促进了幻觉检测的跨语言比较，有助于开发更可靠、语言包容的AI系统。 Abstract: Hallucinations in generative AI, particularly in Large Language Models (LLMs), pose a significant challenge to the reliability of multilingual applications. Existing benchmarks for hallucination detection focus primarily on English and a few widely spoken languages, lacking the breadth to assess inconsistencies in model performance across diverse linguistic contexts. To address this gap, we introduce Poly-FEVER, a large-scale multilingual fact verification benchmark specifically designed for evaluating hallucination detection in LLMs. Poly-FEVER comprises 77,973 labeled factual claims spanning 11 languages, sourced from FEVER, Climate-FEVER, and SciFact. It provides the first large-scale dataset tailored for analyzing hallucination patterns across languages, enabling systematic evaluation of LLMs such as ChatGPT and the LLaMA series. Our analysis reveals how topic distribution and web resource availability influence hallucination frequency, uncovering language-specific biases that impact model accuracy. By offering a multilingual benchmark for fact verification, Poly-FEVER facilitates cross-linguistic comparisons of hallucination detection and contributes to the development of more reliable, language-inclusive AI systems. The dataset is publicly available to advance research in responsible AI, fact-checking methodologies, and multilingual NLP, promoting greater transparency and robustness in LLM performance. The proposed Poly-FEVER is available at: https://huggingface.co/datasets/HanzhiZhang/Poly-FEVER.

When Less is Enough: Adaptive Token Reduction for Efficient Image Representation

Eduard Allakhverdov,Elizaveta Goncharova,Andrey Kuznetsov

Task: 提出一种基于特征重构价值的方法，用于选择并保留最有价值的视觉标记以减少计算成本。

Motivation: 视觉编码器生成的标记数量庞大，计算成本高，但并非所有标记都同等重要，因此需要一种方法在不影响性能的情况下减少标记数量。

Details

Method: 通过集成自编码器和Gumbel-Softmax选择机制，识别并保留最有信息的视觉标记。 Result: 在OCR任务中，可移除50%以上视觉标记且性能损失极小；在通用任务中，随机保留30%标记即可达到与完整标记集相当的性能。 Conclusion: 该方法为自适应高效的多模态剪枝提供了可行方向，实现了低开销且不影响性能的推理。 Abstract: Vision encoders typically generate a large number of visual tokens, providing information-rich representations but significantly increasing computational demands. This raises the question of whether all generated tokens are equally valuable or if some of them can be discarded to reduce computational costs without compromising quality. In this paper, we introduce a new method for determining feature utility based on the idea that less valuable features can be reconstructed from more valuable ones. We implement this concept by integrating an autoencoder with a Gumbel-Softmax selection mechanism, that allows identifying and retaining only the most informative visual tokens. To validate our approach, we compared the performance of the LLaVA-NeXT model, using features selected by our method with randomly selected features. We found that on OCR-based tasks, more than 50% of the visual context can be removed with minimal performance loss, whereas randomly discarding the same proportion of features significantly affects the model capabilities. Furthermore, in general-domain tasks, even randomly retaining only 30% of tokens achieves performance comparable to using the full set of visual tokens. Our results highlight a promising direction towards adaptive and efficient multimodal pruning that facilitates scalable and low-overhead inference without compromising performance.

Causal Discovery and Counterfactual Reasoning to Optimize Persuasive Dialogue Policies

Donghuo Zeng,Roberto Legaspi,Yuewen Sun,Xinshuai Dong,Kazushi Ikeda,Peter Spirtes,Kun Zhang

Task: 提出一种利用因果发现和反事实推理优化对话系统说服能力的方法。

Motivation: 现有对话系统难以适应动态变化的用户状态，导致说服效果不佳。

Details

Method: 结合GRaSP算法识别因果关系，BiCoGAN生成反事实话语，D3QN模型优化系统策略。 Result: 在PersuasionForGood数据集上，方法显著提升了说服效果，累积奖励和Q值增加。 Conclusion: 因果发现和反事实推理能有效增强对话系统的说服能力。 Abstract: Tailoring persuasive conversations to users leads to more effective persuasion. However, existing dialogue systems often struggle to adapt to dynamically evolving user states. This paper presents a novel method that leverages causal discovery and counterfactual reasoning for optimizing system persuasion capability and outcomes. We employ the Greedy Relaxation of the Sparsest Permutation (GRaSP) algorithm to identify causal relationships between user and system utterance strategies, treating user strategies as states and system strategies as actions. GRaSP identifies user strategies as causal factors influencing system responses, which inform Bidirectional Conditional Generative Adversarial Networks (BiCoGAN) in generating counterfactual utterances for the system. Subsequently, we use the Dueling Double Deep Q-Network (D3QN) model to utilize counterfactual data to determine the best policy for selecting system utterances. Our experiments with the PersuasionForGood dataset show measurable improvements in persuasion outcomes using our approach over baseline methods. The observed increase in cumulative rewards and Q-values highlights the effectiveness of causal discovery in enhancing counterfactual reasoning and optimizing reinforcement learning policies for online dialogue systems.

TextBite: A Historical Czech Document Dataset for Logical Page Segmentation

Martin Kostelník,Karel Beneš,Michal Hradiš

Task: 将逻辑页面分割任务定义为图像域中的纯分割任务，避免依赖OCR或精确几何信息。

Motivation: 通过避免OCR依赖和几何变化对评估的影响，提升文档分析的语义表示和信息检索能力。

Details

Method: 提出仅使用前景文本像素的评估指标，并引入TextBite数据集，结合文本区域检测和关系预测的基线方法。 Result: 构建了包含8,449页图像和78,863个逻辑段落的TextBite数据集，并提供了基线方法和评估框架。 Conclusion: 通过纯图像分割和新的评估方法，推动了逻辑文档分割的研究，并提供了公开数据集和工具支持。 Abstract: Logical page segmentation is an important step in document analysis, enabling better semantic representations, information retrieval, and text understanding. Previous approaches define logical segmentation either through text or geometric objects, relying on OCR or precise geometry. To avoid the need for OCR, we define the task purely as segmentation in the image domain. Furthermore, to ensure the evaluation remains unaffected by geometrical variations that do not impact text segmentation, we propose to use only foreground text pixels in the evaluation metric and disregard all background pixels. To support research in logical document segmentation, we introduce TextBite, a dataset of historical Czech documents spanning the 18th to 20th centuries, featuring diverse layouts from newspapers, dictionaries, and handwritten records. The dataset comprises 8,449 page images with 78,863 annotated segments of logically and thematically coherent text. We propose a set of baseline methods combining text region detection and relation prediction. The dataset, baselines and evaluation framework can be accessed at https://github.com/DCGM/textbite-dataset.

Unified Enhancement of the Generalization and Robustness of Language Models via Bi-Stage Optimization

Yudao Sun,Juan Yin,Juan Zhao,Fan Zhang,Yongheng Liu,Hongji Chen

Task: 提出一个双阶段优化框架（UEGR）以同时提升语言模型的泛化性和鲁棒性。

Motivation: 当前研究通常单独提升泛化性或鲁棒性，缺乏同时解决两者的方法，限制了语言模型的综合性能。

Details

Method: 在前向传播阶段通过自适应dropout生成多样子模型，结合JS散度和对抗损失增强输出稳定性；在后向传播阶段选择性更新关键参数以减少不必要偏差。 Result: 在13个公开语言数据集上显著提升泛化性和鲁棒性，达到SOTA性能。 Conclusion: UEGR框架通过梯度正则化和选择性参数更新，有效平衡并提升了语言模型的泛化性和鲁棒性。 Abstract: Neural network language models (LMs) are confronted with significant challenges in generalization and robustness. Currently, many studies focus on improving either generalization or robustness in isolation, without methods addressing both aspects simultaneously, which presents a significant challenge in developing LMs that are both robust and generalized. In this paper, we propose a bi-stage optimization framework to uniformly enhance both the generalization and robustness of LMs, termed UEGR. Specifically, during the forward propagation stage, we enrich the output probability distributions of adversarial samples by adaptive dropout to generate diverse sub models, and incorporate JS divergence and adversarial losses of these output distributions to reinforce output stability. During backward propagation stage, we compute parameter saliency scores and selectively update only the most critical parameters to minimize unnecessary deviations and consolidate the model's resilience. Theoretical analysis shows that our framework includes gradient regularization to limit the model's sensitivity to input perturbations and selective parameter updates to flatten the loss landscape, thus improving both generalization and robustness. The experimental results show that our method significantly improves the generalization and robustness of LMs compared to other existing methods across 13 publicly available language datasets, achieving state-of-the-art (SOTA) performance.

GAIR: Improving Multimodal Geo-Foundation Model with Geo-Aligned Implicit Representations

Zeping Liu,Fan Zhang,Junfeng Jiao,Ni Lao,Gengchen Mai

Task: 开发一种多模态地理基础模型（GeoFM）GAIR，整合遥感数据、街景图像及其地理元数据，以解决现有模型忽视多模态数据的问题。

Motivation: 现有地理基础模型主要关注遥感数据，忽略了其他模态数据（如地面图像），且缺乏跨模态地理关系的显式建模，限制了模型的通用性和适应性。

Details

Method: 提出GAIR架构，使用三个分解的神经编码器分别处理街景图像、地理元数据和遥感图像，并通过隐式神经表示（INR）模块实现地理对齐，最后采用对比学习目标训练。 Result: 在10个地理空间任务上的实验表明，GAIR优于现有最先进的地理基础模型和其他基线方法，展示了其学习通用和可迁移地理表示的有效性。 Conclusion: GAIR通过多模态数据整合和地理对齐，显著提升了地理基础模型的性能，为跨任务、跨尺度和跨时间的地理空间分析提供了新思路。 Abstract: Advancements in vision and language foundation models have inspired the development of geo-foundation models (GeoFMs), enhancing performance across diverse geospatial tasks. However, many existing GeoFMs primarily focus on overhead remote sensing (RS) data while neglecting other data modalities such as ground-level imagery. A key challenge in multimodal GeoFM development is to explicitly model geospatial relationships across modalities, which enables generalizability across tasks, spatial scales, and temporal contexts. To address these limitations, we propose GAIR, a novel multimodal GeoFM architecture integrating overhead RS data, street view (SV) imagery, and their geolocation metadata. We utilize three factorized neural encoders to project an SV image, its geolocation, and an RS image into the embedding space. The SV image needs to be located within the RS image's spatial footprint but does not need to be at its geographic center. In order to geographically align the SV image and RS image, we propose a novel implicit neural representations (INR) module that learns a continuous RS image representation and looks up the RS embedding at the SV image's geolocation. Next, these geographically aligned SV embedding, RS embedding, and location embedding are trained with contrastive learning objectives from unlabeled data. We evaluate GAIR across 10 geospatial tasks spanning RS image-based, SV image-based, and location embedding-based benchmarks. Experimental results demonstrate that GAIR outperforms state-of-the-art GeoFMs and other strong baselines, highlighting its effectiveness in learning generalizable and transferable geospatial representations.

A Foundational individual Mobility Prediction Model based on Open-Source Large Language Models

Zhenlin Qin,Leizhen Wang,Francisco Camara Pereira,Zhenlinag Ma

Task: 提出一个统一的微调框架，用于训练基于开源大语言模型（LLM）的移动预测模型。

Motivation: 当前基于LLM的移动预测模型通常针对特定数据集或单一提示设计，难以适应不同城市和多样化用户背景。

Details

Method: 提出一个统一的微调框架，并在六个真实移动数据集上进行实验验证。 Result: 所提模型在预测准确性和可迁移性上优于基于深度学习和LLM的最先进模型。 Conclusion: 该框架为LLM在移动预测领域的应用提供了更灵活和通用的解决方案。 Abstract: Large Language Models (LLMs) are widely applied to domain-specific tasks due to their massive general knowledge and remarkable inference capacities. Current studies on LLMs have shown immense potential in applying LLMs to model individual mobility prediction problems. However, most LLM-based mobility prediction models only train on specific datasets or use single well-designed prompts, leading to difficulty in adapting to different cities and users with diverse contexts. To fill these gaps, this paper proposes a unified fine-tuning framework to train a foundational open source LLM-based mobility prediction model. We conducted extensive experiments on six real-world mobility datasets to validate the proposed model. The results showed that the proposed model achieved the best performance in prediction accuracy and transferability over state-of-the-art models based on deep learning and LLMs.

Jinlong Li,Cristiano Saltori,Fabio Poiesi,Nicu Sebe

Task: 提出一种名为CUA-O3D的模型，用于整合多种基础模型（如CLIP、DINOv2和Stable Diffusion）以实现开放词汇的3D场景理解。

Motivation: 当前方法通常依赖单一视觉语言模型（VLM）对齐3D模型的特征空间，限制了3D模型利用多种基础模型中多样化的空间和语义能力的潜力。

Details

Method: 通过确定性不确定性估计自适应地蒸馏和协调来自多种基础模型的异构2D特征嵌入。 Result: 在ScanNetV2和Matterport3D上的实验表明，该方法不仅提升了开放词汇分割性能，还实现了鲁棒的跨域对齐和竞争性的空间感知能力。 Conclusion: CUA-O3D是首个整合多种基础模型的3D场景理解方法，通过不确定性估计解决了异构特征融合的挑战，取得了显著效果。 Abstract: The lack of a large-scale 3D-text corpus has led recent works to distill open-vocabulary knowledge from vision-language models (VLMs). owever, these methods typically rely on a single VLM to align the feature spaces of 3D models within a common language space, which limits the potential of 3D models to leverage the diverse spatial and semantic capabilities encapsulated in various foundation models. In this paper, we propose Cross-modal and Uncertainty-aware Agglomeration for Open-vocabulary 3D Scene Understanding dubbed CUA-O3D, the first model to integrate multiple foundation models-such as CLIP, DINOv2, and Stable Diffusion-into 3D scene understanding. We further introduce a deterministic uncertainty estimation to adaptively distill and harmonize the heterogeneous 2D feature embeddings from these models. Our method addresses two key challenges: (1) incorporating semantic priors from VLMs alongside the geometric knowledge of spatially-aware vision foundation models, and (2) using a novel deterministic uncertainty estimation to capture model-specific uncertainties across diverse semantic and geometric sensitivities, helping to reconcile heterogeneous representations during training. Extensive experiments on ScanNetV2 and Matterport3D demonstrate that our method not only advances open-vocabulary segmentation but also achieves robust cross-domain alignment and competitive spatial perception capabilities. The code will be available at \href{https://github.com/TyroneLi/CUA_O3D}{CUA_O3D}.

Explainable AI Components for Narrative Map Extraction

Brian Keith,Fausto German,Eric Krokos,Sarah Joseph,Chris North

Task: 评估一个用于叙事地图提取的可解释人工智能（XAI）系统，该系统提供多层次的抽象解释。

Motivation: 随着叙事提取系统复杂度的增加，通过可解释的输出建立用户信任变得至关重要。

Details

Method: 系统整合了基于主题聚类的低层文档关系解释、事件关系的连接解释以及高层叙事模式的结构解释，并通过用户研究（10名参与者）评估其效果。 Result: 用户研究表明，多层次的解释方法（尤其是连接解释和重要事件检测）显著提升了用户对系统决策的信任。 Conclusion: 该研究推动了可解释叙事提取的先进技术，并为开发支持有效人机协作的可靠叙事提取系统提供了实用见解。 Abstract: As narrative extraction systems grow in complexity, establishing user trust through interpretable and explainable outputs becomes increasingly critical. This paper presents an evaluation of an Explainable Artificial Intelligence (XAI) system for narrative map extraction that provides meaningful explanations across multiple levels of abstraction. Our system integrates explanations based on topical clusters for low-level document relationships, connection explanations for event relationships, and high-level structure explanations for overall narrative patterns. In particular, we evaluate the XAI system through a user study involving 10 participants that examined narratives from the 2021 Cuban protests. The analysis of results demonstrates that participants using the explanations made the users trust in the system's decisions, with connection explanations and important event detection proving particularly effective at building user confidence. Survey responses indicate that the multi-level explanation approach helped users develop appropriate trust in the system's narrative extraction capabilities. This work advances the state-of-the-art in explainable narrative extraction while providing practical insights for developing reliable narrative extraction systems that support effective human-AI collaboration.

QuartDepth: Post-Training Quantization for Real-Time Depth Estimation on the Edge

Xuan Shen,Weize Ma,Jing Liu,Changdi Yang,Rui Ding,Quanyi Wang,Henghui Ding,Wei Niu,Yanzhi Wang,Pu Zhao,Jun Lin,Jiuxiang Gu

Task: 提出了一种名为QuartDepth的方法，用于在资源受限的边缘设备（如ASIC）上高效部署单目深度估计（MDE）模型。

Motivation: 由于高计算和内存需求，现有深度估计模型难以在ASIC等边缘设备上部署，而QuartDepth旨在解决这一问题。

Details

Method: 采用后训练量化技术，将权重和激活量化为4位精度，并引入激活抛光和补偿算法以及权重重建方法以减少性能损失。同时设计了支持内核融合和定制指令可编程性的硬件加速器。 Result: 实验结果表明，QuartDepth在保持高精度的同时，实现了快速推理和更高的能效。 Conclusion: QuartDepth成功弥合了高性能深度估计与边缘设备实际应用之间的差距。 Abstract: Monocular Depth Estimation (MDE) has emerged as a pivotal task in computer vision, supporting numerous real-world applications. However, deploying accurate depth estimation models on resource-limited edge devices, especially Application-Specific Integrated Circuits (ASICs), is challenging due to the high computational and memory demands. Recent advancements in foundational depth estimation deliver impressive results but further amplify the difficulty of deployment on ASICs. To address this, we propose QuartDepth which adopts post-training quantization to quantize MDE models with hardware accelerations for ASICs. Our approach involves quantizing both weights and activations to 4-bit precision, reducing the model size and computation cost. To mitigate the performance degradation, we introduce activation polishing and compensation algorithm applied before and after activation quantization, as well as a weight reconstruction method for minimizing errors in weight quantization. Furthermore, we design a flexible and programmable hardware accelerator by supporting kernel fusion and customized instruction programmability, enhancing throughput and efficiency. Experimental results demonstrate that our framework achieves competitive accuracy while enabling fast inference and higher energy efficiency on ASICs, bridging the gap between high-performance depth estimation and practical edge-device applicability. Code: https://github.com/shawnricecake/quart-depth

FutureGen: LLM-RAG Approach to Generate the Future Work of Scientific Article

Ibrahim Al Azher,Miftahul Jannat Mokarrama,Zhishuai Guo,Sagnik Ray Choudhury,Hamed Alhoori

Task: 生成科学论文的未来研究方向建议并分析其趋势。

Motivation: 为早期职业研究人员提供未探索领域，并为经验丰富的研究人员寻找新项目或合作机会。

Details

Method: 结合大型语言模型（LLM）和检索增强生成（RAG），并引入LLM反馈机制和LLM-as-a-judge评估方法。 Result: RAG结合LLM反馈的方法在定性和定量指标上优于其他方法。 Conclusion: 该方法有效提升了未来研究方向建议的质量，并通过人类评估验证了LLM作为提取器和评估器的性能。 Abstract: The future work section of a scientific article outlines potential research directions by identifying gaps and limitations of a current study. This section serves as a valuable resource for early-career researchers seeking unexplored areas and experienced researchers looking for new projects or collaborations. In this study, we generate future work suggestions from key sections of a scientific article alongside related papers and analyze how the trends have evolved. We experimented with various Large Language Models (LLMs) and integrated Retrieval-Augmented Generation (RAG) to enhance the generation process. We incorporate a LLM feedback mechanism to improve the quality of the generated content and propose an LLM-as-a-judge approach for evaluation. Our results demonstrated that the RAG-based approach with LLM feedback outperforms other methods evaluated through qualitative and quantitative metrics. Moreover, we conduct a human evaluation to assess the LLM as an extractor and judge. The code and dataset for this project are here, code: HuggingFace

4D Gaussian Splatting SLAM

Yanyan Li,Youxu Fang,Zunjie Zhu,Kunyi Li,Yong Ding,Federico Tombari

Task: 在动态场景中同时定位相机姿态并构建高斯辐射场。

Motivation: 传统方法通常移除动态物体作为干扰并仅重建静态环境，而本文旨在通过RGB-D图像序列高效跟踪相机姿态并建立4D高斯辐射场。

Details

Method: 通过生成运动掩码获取静态和动态先验，将高斯基元分类为静态和动态集，利用稀疏控制点和MLP建模动态高斯的变换场，并设计2D光流重建算法监督4D高斯辐射场。 Result: 实验结果表明，该方法在真实环境中实现了鲁棒的跟踪和高质量的视图合成。 Conclusion: 提出了一种高效架构，能够在未知场景中增量跟踪相机姿态并建立4D高斯辐射场。 Abstract: Simultaneously localizing camera poses and constructing Gaussian radiance fields in dynamic scenes establish a crucial bridge between 2D images and the 4D real world. Instead of removing dynamic objects as distractors and reconstructing only static environments, this paper proposes an efficient architecture that incrementally tracks camera poses and establishes the 4D Gaussian radiance fields in unknown scenarios by using a sequence of RGB-D images. First, by generating motion masks, we obtain static and dynamic priors for each pixel. To eliminate the influence of static scenes and improve the efficiency on learning the motion of dynamic objects, we classify the Gaussian primitives into static and dynamic Gaussian sets, while the sparse control points along with an MLP is utilized to model the transformation fields of the dynamic Gaussians. To more accurately learn the motion of dynamic Gaussians, a novel 2D optical flow map reconstruction algorithm is designed to render optical flows of dynamic objects between neighbor images, which are further used to supervise the 4D Gaussian radiance fields along with traditional photometric and geometric constraints. In experiments, qualitative and quantitative evaluation results show that the proposed method achieves robust tracking and high-quality view synthesis performance in real-world environments.

Extract, Match, and Score: An Evaluation Paradigm for Long Question-context-answer Triplets in Financial Analysis

Bo Hu,Han Yuan,Vlad Pandelea,Wuqiong Luo,Yingzhu Zhao,Zheng Ma

Task: 提出一种针对长文本问答的评估方法（EMS），以解决传统指标在评估长形式答案时的不足。

Motivation: 传统评估指标在长形式问答（如金融分析或法规合规）中效果不佳，需要更可靠的评估框架。

Details

Method: 通过构建真实金融数据集，提出Extract, Match, and Score (EMS)评估方法。 Result: EMS方法能有效评估长形式LLM输出的质量。 Conclusion: EMS为复杂现实场景中的LLM性能评估提供了可靠方法。 Abstract: The rapid advancement of large language models (LLMs) has sparked widespread adoption across diverse applications, making robust evaluation frameworks crucial for assessing their performance. While conventional evaluation metrics remain applicable for shorter texts, their efficacy diminishes when evaluating the quality of long-form answers. This limitation is particularly critical in real-world scenarios involving extended questions, extensive context, and long-form answers, such as financial analysis or regulatory compliance. In this paper, we use a practical financial use case to illustrate applications that handle "long question-context-answer triplets". We construct a real-world financial dataset comprising long triplets and demonstrate the inadequacies of traditional metrics. To address this, we propose an effective Extract, Match, and Score (EMS) evaluation approach tailored to the complexities of long-form LLMs' outputs, providing practitioners with a reliable methodology for assessing LLMs' performance in complex real-world scenarios.

EDiT: Efficient Diffusion Transformers with Linear Compressed Attention

Philipp Becker,Abhinav Mehrotra,Ruchika Chavhan,Malcolm Chadwick,Luca Morreale,Mehdi Noroozi,Alberto Gil Ramos,Sourav Bhattacharya

Task: 提出一种高效的扩散变换器（EDiT）以解决传统DiTs和MM-DiTs的效率瓶颈问题。

Motivation: 传统DiTs中的注意力机制二次缩放特性限制了高分辨率图像生成或资源受限设备上的应用。

Details

Method: 提出线性压缩注意力方法和混合注意力方案，结合多层卷积网络和标准点积注意力。 Result: EDiT和MM-EDiT在PixArt-Sigma和Stable Diffusion 3.5-Medium中实现了最高2.2倍的加速，同时保持图像质量。 Conclusion: EDiT和MM-EDiT有效提升了扩散变换器的效率，适用于高分辨率图像生成和多模态输入场景。 Abstract: Diffusion Transformers (DiTs) have emerged as a leading architecture for text-to-image synthesis, producing high-quality and photorealistic images. However, the quadratic scaling properties of the attention in DiTs hinder image generation with higher resolution or on devices with limited resources. This work introduces an efficient diffusion transformer (EDiT) to alleviate these efficiency bottlenecks in conventional DiTs and Multimodal DiTs (MM-DiTs). First, we present a novel linear compressed attention method that uses a multi-layer convolutional network to modulate queries with local information while keys and values are spatially aggregated. Second, we formulate a hybrid attention scheme for multi-modal inputs that combines linear attention for image-to-image interactions and standard scaled dot-product attention for interactions involving prompts. Merging these two approaches leads to an expressive, linear-time Multimodal Efficient Diffusion Transformer (MM-EDiT). We demonstrate the effectiveness of the EDiT and MM-EDiT architectures by integrating them into PixArt-Sigma(conventional DiT) and Stable Diffusion 3.5-Medium (MM-DiT), achieving up to 2.2x speedup with comparable image quality after distillation.

SeniorTalk: A Chinese Conversation Dataset with Rich Annotations for Super-Aged Seniors

Yang Chen,Hui Wang,Shiyao Wang,Junyang Chen,Jiabei He,Jiaming Zhou,Xi Yang,Yequan Wang,Yonghua Lin,Yong Qin

Task: 解决老年语音数据稀缺问题，提供高质量的中文老年语音数据集。

Motivation: 现有语音系统因缺乏老年特定声学特征（如老年嗓音和方言变化）的训练数据而表现不佳，且现有数据集对超高龄人群覆盖不足。

Details

Method: 引入SeniorTalk数据集，包含55.53小时来自101次自然对话的语音，涵盖202名参与者，平衡性别、地区和年龄，并进行多维度标注。 Result: 通过实验验证了数据集在说话人验证、说话人分割、语音识别和语音编辑任务中的有效性。 Conclusion: SeniorTalk为针对老年人群的语音技术开发提供了重要支持。 Abstract: While voice technologies increasingly serve aging populations, current systems exhibit significant performance gaps due to inadequate training data capturing elderly-specific vocal characteristics like presbyphonia and dialectal variations. The limited data available on super-aged individuals in existing elderly speech datasets, coupled with overly simple recording styles and annotation dimensions, exacerbates this issue. To address the critical scarcity of speech data from individuals aged 75 and above, we introduce SeniorTalk, a carefully annotated Chinese spoken dialogue dataset. This dataset contains 55.53 hours of speech from 101 natural conversations involving 202 participants, ensuring a strategic balance across gender, region, and age. Through detailed annotation across multiple dimensions, it can support a wide range of speech tasks. We perform extensive experiments on speaker verification, speaker diarization, speech recognition, and speech editing tasks, offering crucial insights for the development of speech technologies targeting this age group.

Digitally Prototype Your Eye Tracker: Simulating Hardware Performance using 3D Synthetic Data

Esther Y. H. Lin,Yimin Ding,Jogendra Kundu,Yatong An,Mohamed T. El-Haddad,Alexander Fix

Task: 提出一种利用合成数据评估硬件变化对基于机器学习的眼动追踪性能影响的方法。

Motivation: 由于从真实硬件获取大量数据成本高昂，尤其是机器学习需要大规模训练数据集，因此需要一种替代方法。

Details

Method: 利用神经辐射场（NeRF）重建真实3D眼睛数据集，合成不同视角和相机参数下的眼睛图像，用于评估硬件配置的性能。 Result: 方法能够预测不同硬件配置的相对性能，并与真实数据集（Project Aria眼镜）的性能表现高度相关。 Conclusion: 该方法显著加快了眼动追踪硬件的原型设计过程。 Abstract: Eye tracking (ET) is a key enabler for Augmented and Virtual Reality (AR/VR). Prototyping new ET hardware requires assessing the impact of hardware choices on eye tracking performance. This task is compounded by the high cost of obtaining data from sufficiently many variations of real hardware, especially for machine learning, which requires large training datasets. We propose a method for end-to-end evaluation of how hardware changes impact machine learning-based ET performance using only synthetic data. We utilize a dataset of real 3D eyes, reconstructed from light dome data using neural radiance fields (NeRF), to synthesize captured eyes from novel viewpoints and camera parameters. Using this framework, we demonstrate that we can predict the relative performance across various hardware configurations, accounting for variations in sensor noise, illumination brightness, and optical blur. We also compare our simulator with the publicly available eye tracking dataset from the Project Aria glasses, demonstrating a strong correlation with real-world performance. Finally, we present a first-of-its-kind analysis in which we vary ET camera positions, evaluating ET performance ranging from on-axis direct views of the eye to peripheral views on the frame. Such an analysis would have previously required manufacturing physical devices to capture evaluation data. In short, our method enables faster prototyping of ET hardware.

Investigating Retrieval-Augmented Generation in Quranic Studies: A Study of 13 Open-Source Large Language Models

Zahra Khalila,Arbi Haza Nasution,Winda Monika,Aytug Onan,Yohei Murakami,Yasir Bin Ismail Radi,Noor Mohammad Osmani

Task: 研究开源大语言模型（LLMs）在古兰经研究领域中的表现，并评估其生成准确、上下文忠实回答的能力。

Motivation: 通用LLMs在宗教等敏感领域易产生幻觉回答，偏离权威来源，因此需要开发能够整合领域知识并保持回答准确性和忠实性的系统。

Details

Method: 使用检索增强生成（RAG）技术，评估13种开源LLMs（分为大、中、小三类）在古兰经数据集上的表现，通过人工评估指标（上下文相关性、回答忠实性和回答相关性）进行分析。 Result: 大型模型在语义捕捉和生成准确回答方面表现最佳；小型模型Llama3.2:3b在忠实性和相关性上表现突出，展示了优化后小型架构的潜力。 Conclusion: 研究揭示了模型大小、计算效率和回答质量之间的权衡，为领域特定应用中LLMs的选择提供了参考。 Abstract: Accurate and contextually faithful responses are critical when applying large language models (LLMs) to sensitive and domain-specific tasks, such as answering queries related to quranic studies. General-purpose LLMs often struggle with hallucinations, where generated responses deviate from authoritative sources, raising concerns about their reliability in religious contexts. This challenge highlights the need for systems that can integrate domain-specific knowledge while maintaining response accuracy, relevance, and faithfulness. In this study, we investigate 13 open-source LLMs categorized into large (e.g., Llama3:70b, Gemma2:27b, QwQ:32b), medium (e.g., Gemma2:9b, Llama3:8b), and small (e.g., Llama3.2:3b, Phi3:3.8b). A Retrieval-Augmented Generation (RAG) is used to make up for the problems that come with using separate models. This research utilizes a descriptive dataset of Quranic surahs including the meanings, historical context, and qualities of the 114 surahs, allowing the model to gather relevant knowledge before responding. The models are evaluated using three key metrics set by human evaluators: context relevance, answer faithfulness, and answer relevance. The findings reveal that large models consistently outperform smaller models in capturing query semantics and producing accurate, contextually grounded responses. The Llama3.2:3b model, even though it is considered small, does very well on faithfulness (4.619) and relevance (4.857), showing the promise of smaller architectures that have been well optimized. This article examines the trade-offs between model size, computational efficiency, and response quality while using LLMs in domain-specific applications.

Rethinking the Role of Spatial Mixing

George Cazenavette,Joel Julin,Simon Lucey

Task: 研究2D卷积中空间混合和通道混合操作在深度学习中的作用。

Motivation: 探索空间混合和通道混合操作对模型性能的影响，并发现随机初始化的空间混合器也能达到类似性能。

Details

Method: 通过实验和分析，在经典模型（如ResNet）和前沿模型（如ConvMixer）上测试空间混合和通道混合的效果。 Result: 随机初始化的空间混合器也能达到接近的分类性能，且对对抗扰动更鲁棒，还能解码像素重排图像。 Conclusion: 空间混合操作在模型中的作用可能被高估，随机初始化的空间混合器也能有效工作。 Abstract: Until quite recently, the backbone of nearly every state-of-the-art computer vision model has been the 2D convolution. At its core, a 2D convolution simultaneously mixes information across both the spatial and channel dimensions of a representation. Many recent computer vision architectures consist of sequences of isotropic blocks that disentangle the spatial and channel-mixing components. This separation of the operations allows us to more closely juxtapose the effects of spatial and channel mixing in deep learning. In this paper, we take an initial step towards garnering a deeper understanding of the roles of these mixing operations. Through our experiments and analysis, we discover that on both classical (ResNet) and cutting-edge (ConvMixer) models, we can reach nearly the same level of classification performance by and leaving the spatial mixers at their random initializations. Furthermore, we show that models with random, fixed spatial mixing are naturally more robust to adversarial perturbations. Lastly, we show that this phenomenon extends past the classification regime, as such models can also decode pixel-shuffled images.

Distributed LLMs and Multimodal Large Language Models: A Survey on Advances, Challenges, and Future Directions

Hadi Amini,Md Jueal Mia,Yasaman Saadati,Ahmed Imteaj,Seyedsina Nabavirazavi,Urmish Thakker,Md Zarif Hossain,Awal Ahmed Fime,S. S. Iyengar

Task: 综述分布式解决方案在语言模型（包括大型语言模型、视觉语言模型、多模态语言模型和小型语言模型）中的应用及其挑战。

Motivation: 尽管大规模数据集能提升语言模型性能，但计算资源和隐私问题限制了其扩展性，分布式计算策略为解决这些问题提供了关键方案。

Details

Method: 通过文献综述，分析分布式训练、推理、微调和部署等关键环节的进展，并基于六个主要去中心化领域对文献进行分类。 Result: 总结了分布式语言模型的贡献和局限性，并指出了当前方法在实现分布式解决方案中的不足。 Conclusion: 未来研究需要开发新方法以提高分布式语言模型的鲁棒性和适用性。 Abstract: Language models (LMs) are machine learning models designed to predict linguistic patterns by estimating the probability of word sequences based on large-scale datasets, such as text. LMs have a wide range of applications in natural language processing (NLP) tasks, including autocomplete and machine translation. Although larger datasets typically enhance LM performance, scalability remains a challenge due to constraints in computational power and resources. Distributed computing strategies offer essential solutions for improving scalability and managing the growing computational demand. Further, the use of sensitive datasets in training and deployment raises significant privacy concerns. Recent research has focused on developing decentralized techniques to enable distributed training and inference while utilizing diverse computational resources and enabling edge AI. This paper presents a survey on distributed solutions for various LMs, including large language models (LLMs), vision language models (VLMs), multimodal LLMs (MLLMs), and small language models (SLMs). While LLMs focus on processing and generating text, MLLMs are designed to handle multiple modalities of data (e.g., text, images, and audio) and to integrate them for broader applications. To this end, this paper reviews key advancements across the MLLM pipeline, including distributed training, inference, fine-tuning, and deployment, while also identifying the contributions, limitations, and future areas of improvement. Further, it categorizes the literature based on six primary focus areas of decentralization. Our analysis describes gaps in current methodologies for enabling distributed solutions for LMs and outline future research directions, emphasizing the need for novel solutions to enhance the robustness and applicability of distributed LMs.

Dynamic Attention Mechanism in Spatiotemporal Memory Networks for Object Tracking

Meng Zhou,Jiadong Xie,Mingsheng Xu

Task: 提出一种动态注意力机制的时空记忆网络（DASTM），用于提升复杂场景下的视觉目标跟踪性能。

Motivation: 现有模板匹配方法在目标变形、遮挡和背景干扰等复杂场景中性能下降，而时空记忆方法缺乏动态特征选择和自适应融合机制。

Details

Method: 引入可微分的动态注意力机制和轻量级门控网络，自适应调整通道-空间注意力权重并分配计算资源。 Result: 在多个基准测试（OTB-2015、VOT 2018、LaSOT、GOT-10K）中取得最优性能，包括成功率、鲁棒性和实时性。 Conclusion: DASTM为复杂环境下的实时跟踪提供了一种新颖且高效的解决方案。 Abstract: Mainstream visual object tracking frameworks predominantly rely on template matching paradigms. Their performance heavily depends on the quality of template features, which becomes increasingly challenging to maintain in complex scenarios involving target deformation, occlusion, and background clutter. While existing spatiotemporal memory-based trackers emphasize memory capacity expansion, they lack effective mechanisms for dynamic feature selection and adaptive fusion. To address this gap, we propose a Dynamic Attention Mechanism in Spatiotemporal Memory Network (DASTM) with two key innovations: 1) A differentiable dynamic attention mechanism that adaptively adjusts channel-spatial attention weights by analyzing spatiotemporal correlations between the templates and memory features; 2) A lightweight gating network that autonomously allocates computational resources based on target motion states, prioritizing high-discriminability features in challenging scenarios. Extensive evaluations on OTB-2015, VOT 2018, LaSOT, and GOT-10K benchmarks demonstrate our DASTM's superiority, achieving state-of-the-art performance in success rate, robustness, and real-time efficiency, thereby offering a novel solution for real-time tracking in complex environments.

Classification of User Reports for Detection of Faulty Computer Components using NLP Models: A Case Study

Maria de Lourdes M. Silva,André L. C. Mendonça,Eduardo R. D. Neto,Iago C. Chaves,Felipe T. Brito,Victor A. E. Farias,Javam C. Machado

Task: 利用自然语言处理（NLP）模型对用户报告进行分类，以检测计算机故障组件。

Motivation: 现有平台无法有效利用用户文本报告，限制了用户用自然语言描述问题的能力。

Details

Method: 构建包含341份用户报告的数据集，并采用NLP模型进行分类。 Result: 实验评估显示，该方法在数据集上的准确率达到79%。 Conclusion: NLP模型为计算机故障报告分类提供了有效的解决方案。 Abstract: Computer manufacturers typically offer platforms for users to report faults. However, there remains a significant gap in these platforms' ability to effectively utilize textual reports, which impedes users from describing their issues in their own words. In this context, Natural Language Processing (NLP) offers a promising solution, by enabling the analysis of user-generated text. This paper presents an innovative approach that employs NLP models to classify user reports for detecting faulty computer components, such as CPU, memory, motherboard, video card, and more. In this work, we build a dataset of 341 user reports obtained from many sources. Additionally, through extensive experimental evaluation, our approach achieved an accuracy of 79% with our dataset.

Region Masking to Accelerate Video Processing on Neuromorphic Hardware

Sreetama Sarkar,Sumit Bam Shrestha,Yue Che,Leobardo Campos-Macias,Gourav Datta,Peter A. Beerel

Task: 提出一种区域掩码策略，以减少脉冲神经网络（SNN）中冗余计算和数据移动。

Motivation: 资源受限设备上对边缘智能的需求增长，需要降低深度学习模型的能耗和延迟。SNN因其事件驱动的特性有望降低能耗，但仍存在冗余计算问题。

Details

Method: 通过区域掩码策略识别输入中的感兴趣区域，消除不重要区域的事件计算和数据移动。 Result: 在Loihi 2上进行视频目标检测时，掩码约60%输入区域可将能量延迟积降低1.65倍，mAP@0.5仅下降1.09%。 Conclusion: 区域掩码策略能显著减少SNN的脉冲活动和计算开销，同时保持较高的检测精度。 Abstract: The rapidly growing demand for on-chip edge intelligence on resource-constrained devices has motivated approaches to reduce energy and latency of deep learning models. Spiking neural networks (SNNs) have gained particular interest due to their promise to reduce energy consumption using event-based processing. We assert that while sigma-delta encoding in SNNs can take advantage of the temporal redundancy across video frames, they still involve a significant amount of redundant computations due to processing insignificant events. In this paper, we propose a region masking strategy that identifies regions of interest at the input of the SNN, thereby eliminating computation and data movement for events arising from unimportant regions. Our approach demonstrates that masking regions at the input not only significantly reduces the overall spiking activity of the network, but also provides significant improvement in throughput and latency. We apply region masking during video object detection on Loihi 2, demonstrating that masking approximately 60% of input regions can reduce energy-delay product by 1.65x over a baseline sigma-delta network, with a degradation in mAP@0.5 by 1.09%.

Leveraging Large Language Models for Explainable Activity Recognition in Smart Homes: A Critical Evaluation

Michele Fiori,Gabriele Civitarese,Priyankar Choudhary,Claudio Bettini

Task: 探索如何结合可解释人工智能（XAI）和大型语言模型（LLMs）用于基于传感器的日常活动（ADL）识别。

Motivation: 现有方法生成的解释缺乏自然语言的灵活性且不可扩展，而LLMs在人类活动知识方面表现优异，可能提升解释生成的质量。

Details

Method: 评估LLMs的两种应用方式：a) 作为零样本ADL识别模型；b) 为现有数据驱动的XAI方法自动生成解释。 Result: 提供了关于LLMs在可解释ADL识别中优势和挑战的见解。 Conclusion: LLMs在提升ADL识别的可解释性和灵活性方面具有潜力，但仍需解决相关挑战。 Abstract: Explainable Artificial Intelligence (XAI) aims to uncover the inner reasoning of machine learning models. In IoT systems, XAI improves the transparency of models processing sensor data from multiple heterogeneous devices, ensuring end-users understand and trust their outputs. Among the many applications, XAI has also been applied to sensor-based Activities of Daily Living (ADLs) recognition in smart homes. Existing approaches highlight which sensor events are most important for each predicted activity, using simple rules to convert these events into natural language explanations for non-expert users. However, these methods produce rigid explanations lacking natural language flexibility and are not scalable. With the recent rise of Large Language Models (LLMs), it is worth exploring whether they can enhance explanation generation, considering their proven knowledge of human activities. This paper investigates potential approaches to combine XAI and LLMs for sensor-based ADL recognition. We evaluate if LLMs can be used: a) as explainable zero-shot ADL recognition models, avoiding costly labeled data collection, and b) to automate the generation of explanations for existing data-driven XAI approaches when training data is available and the goal is higher recognition rates. Our critical evaluation provides insights into the benefits and challenges of using LLMs for explainable ADL recognition.

OpenCity3D: What do Vision-Language Models know about Urban Environments?

Valentin Bieri,Marco Zamboni,Nicolas S. Blumer,Qingxuan Chen,Francis Engelmann

Task: 扩展视觉语言模型（VLMs）到城市规模环境，解决高层任务如人口密度估计、建筑年龄分类等。

Motivation: 现有VLMs主要应用于室内空间或自动驾驶，专注于低层次任务，而城市环境的高层次任务需求未被满足。

Details

Method: 提出OpenCity3D方法，利用多视角航空影像的3D重建数据。 Result: OpenCity3D展示了出色的零样本和少样本能力，适应新场景。 Conclusion: 该研究为语言驱动的城市分析建立了新范式，支持规划、政策和环境监测应用。 Abstract: Vision-language models (VLMs) show great promise for 3D scene understanding but are mainly applied to indoor spaces or autonomous driving, focusing on low-level tasks like segmentation. This work expands their use to urban-scale environments by leveraging 3D reconstructions from multi-view aerial imagery. We propose OpenCity3D, an approach that addresses high-level tasks, such as population density estimation, building age classification, property price prediction, crime rate assessment, and noise pollution evaluation. Our findings highlight OpenCity3D's impressive zero-shot and few-shot capabilities, showcasing adaptability to new contexts. This research establishes a new paradigm for language-driven urban analytics, enabling applications in planning, policy, and environmental monitoring. See our project page: opencity3d.github.io

Accelerating Antibiotic Discovery with Large Language Models and Knowledge Graphs

Maxime Delmas,Magdalena Wysocka,Danilo Gusicuma,André Freitas

Task: 开发一种基于LLM的管道系统，用于检测抗生素活性的先前证据以避免重复发现。

Motivation: 解决抗生素研发中的高成本、长周期和高失败率问题，尤其是已知化合物的重复发现。

Details

Method: 通过整合生物和化学文献构建知识图谱（KG），实现分类分辨率、同义词处理和多级证据分类。 Result: 在73种潜在抗生素产生生物中检测出12种阴性结果，验证了管道的有效性，减少了假阴性并加速决策。 Conclusion: 该管道系统在证据审查中表现高效，其知识图谱和用户界面将公开，以促进进一步研究。 Abstract: The discovery of novel antibiotics is critical to address the growing antimicrobial resistance (AMR). However, pharmaceutical industries face high costs (over $1 billion), long timelines, and a high failure rate, worsened by the rediscovery of known compounds. We propose an LLM-based pipeline that acts as an alarm system, detecting prior evidence of antibiotic activity to prevent costly rediscoveries. The system integrates organism and chemical literature into a Knowledge Graph (KG), ensuring taxonomic resolution, synonym handling, and multi-level evidence classification. We tested the pipeline on a private list of 73 potential antibiotic-producing organisms, disclosing 12 negative hits for evaluation. The results highlight the effectiveness of the pipeline for evidence reviewing, reducing false negatives, and accelerating decision-making. The KG for negative hits and the user interface for interactive exploration will be made publicly available.

A-IDE : Agent-Integrated Denoising Experts

Uihyun Cho,Namhun Kim

Task: 提出一种名为Agent-Integrated Denoising Experts (A-IDE)的框架，解决单一模型在低剂量CT图像去噪中难以泛化到多种解剖结构的问题。

Motivation: 由于不同解剖结构的HU分布和特征差异，单一模型难以泛化，需要一种能动态分配任务的方法。

Details

Method: 集成三个解剖区域专用的RED-CNN模型，通过基于BiomedCLIP语义分析的LLM代理动态分配任务。 Result: 在Mayo-2016数据集上，A-IDE在RMSE、PSNR和SSIM指标上优于单一去噪模型。 Conclusion: A-IDE框架在数据稀缺的异构环境中表现优异，能自动防止过拟合，且无需人工干预。 Abstract: Recent advances in deep-learning based denoising methods have improved Low-Dose CT image quality. However, due to distinct HU distributions and diverse anatomical characteristics, a single model often struggles to generalize across multiple anatomies. To address this limitation, we introduce \textbf{Agent-Integrated Denoising Experts (A-IDE)} framework, which integrates three anatomical region-specialized RED-CNN models under the management of decision-making LLM agent. The agent analyzes semantic cues from BiomedCLIP to dynamically route incoming LDCT scans to the most appropriate expert model. We highlight three major advantages of our approach. A-IDE excels in heterogeneous, data-scarce environments. The framework automatically prevents overfitting by distributing tasks among multiple experts. Finally, our LLM-driven agentic pipeline eliminates the need for manual interventions. Experimental evaluations on the Mayo-2016 dataset confirm that A-IDE achieves superior performance in RMSE, PSNR, and SSIM compared to a single unified denoiser.

Through the LLM Looking Glass: A Socratic Self-Assessment of Donkeys, Elephants, and Markets

Molly Kennedy,Ayyoob Imani,Timo Spinde,Hinrich Schütze

Task: 评估LLM生成内容中的媒体偏见以及LLM检测微妙意识形态偏见的能力。

Motivation: 媒体偏见通常微妙且主观，难以识别和缓解，尤其是在LLM生成的文本中。

Details

Method: 使用PoliGen和EconoLex两个数据集，评估八种广泛使用的LLM，通过自我评估分析其意识形态偏好。 Result: 所有模型在政治话题中一致偏向民主党立场；在经济话题中，西方LLM的偏见各异，而中国开发的LLM更倾向于社会主义。 Conclusion: LLM在生成内容中存在意识形态偏见，且自我评估方法能直接测量模型偏见，减少主观判断。 Abstract: While detecting and avoiding bias in LLM-generated text is becoming increasingly important, media bias often remains subtle and subjective, making it particularly difficult to identify and mitigate. In this study, we assess media bias in LLM-generated content and LLMs' ability to detect subtle ideological bias. We conduct this evaluation using two datasets, PoliGen and EconoLex, covering political and economic discourse, respectively. We evaluate eight widely used LLMs by prompting them to generate articles and analyze their ideological preferences via self-assessment. By using self-assessment, the study aims to directly measure the models' biases rather than relying on external interpretations, thereby minimizing subjective judgments about media bias. Our results reveal a consistent preference of Democratic over Republican positions across all models. Conversely, in economic topics, biases vary among Western LLMs, while those developed in China lean more strongly toward socialism.

Learning Part Knowledge to Facilitate Category Understanding for Fine-Grained Generalized Category Discovery

Enguang Wang,Zhimao Peng,Zhengyuan Xie,Haori Lu,Fei Yang,Xialei Liu

Task: 提出一种名为PartGCD的方法，通过结合局部知识来解决细粒度广义类别发现（GCD）问题。

Motivation: 现有方法在细粒度场景中表现不佳，因为它们依赖全局图像特征的对比学习，无法捕捉区分细粒度类别所需的细微局部差异。

Details

Method: 提出PartGCD，包括自适应部分分解（通过高斯混合模型自动提取类特定语义部分）和部分差异正则化（强制分离部分特征以放大细粒度局部差异）。 Result: 在多个细粒度基准测试中实现了最先进的性能，同时在通用数据集上保持竞争力。 Conclusion: PartGCD方法在解决细粒度GCD问题时表现出有效性和鲁棒性。 Abstract: Generalized Category Discovery (GCD) aims to classify unlabeled data containing both seen and novel categories. Although existing methods perform well on generic datasets, they struggle in fine-grained scenarios. We attribute this difficulty to their reliance on contrastive learning over global image features to automatically capture discriminative cues, which fails to capture the subtle local differences essential for distinguishing fine-grained categories. Therefore, in this paper, we propose incorporating part knowledge to address fine-grained GCD, which introduces two key challenges: the absence of annotations for novel classes complicates the extraction of the part features, and global contrastive learning prioritizes holistic feature invariance, inadvertently suppressing discriminative local part patterns. To address these challenges, we propose PartGCD, including 1) Adaptive Part Decomposition, which automatically extracts class-specific semantic parts via Gaussian Mixture Models, and 2) Part Discrepancy Regularization, enforcing explicit separation between part features to amplify fine-grained local part distinctions. Experiments demonstrate state-of-the-art performance across multiple fine-grained benchmarks while maintaining competitiveness on generic datasets, validating the effectiveness and robustness of our approach.

Natural Language Generation

Emiel van Miltenburg,Chenghua Lin

Task: 概述自然语言生成（NLG）领域的研究内容及其与其他自然语言处理子领域的关系。

Motivation: 介绍NLG的定义、应用范围及其与其他子领域（如机器翻译和对话系统）的区别与联系，同时探讨大型语言模型（LLMs）对NLG和其他子领域的影响。

Details

Method: 通过文献综述和领域分析，总结NLG的核心概念、应用场景及其与其他子领域的异同。 Result: 明确了NLG的广义定义及其在数据到文本、文本到文本和图像到文本等任务中的应用，同时指出了LLMs对NLG和其他子领域的趋同影响。 Conclusion: NLG作为自然语言处理的重要子领域，其定义和应用范围正在因LLMs的发展而与其他子领域趋同，未来研究需关注这一趋势。 Abstract: This article provides a brief overview of the field of Natural Language Generation. The term Natural Language Generation (NLG), in its broadest definition, refers to the study of systems that verbalize some form of information through natural language. That information could be stored in a large database or knowledge graph (in data-to-text applications), but NLG researchers may also study summarisation (text-to-text) or image captioning (image-to-text), for example. As a subfield of Natural Language Processing, NLG is closely related to other sub-disciplines such as Machine Translation (MT) and Dialog Systems. Some NLG researchers exclude MT from their definition of the field, since there is no content selection involved where the system has to determine what to say. Conversely, dialog systems do not typically fall under the header of Natural Language Generation since NLG is just one component of dialog systems (the others being Natural Language Understanding and Dialog Management). However, with the rise of Large Language Models (LLMs), different subfields of Natural Language Processing have converged on similar methodologies for the production of natural language and the evaluation of automatically generated text.

Restoring Forgotten Knowledge in Non-Exemplar Class Incremental Learning through Test-Time Semantic Evolution

Haori Lu,Xusheng Cao,Linlan Huang,Enguang Wang,Fei Yang,Xialei Liu

Task: 解决非示例类增量学习（NECIL）中的灾难性遗忘问题。

Motivation: 现有方法在训练阶段难以平衡稳定性和可塑性，而测试阶段被忽视但可能是解决遗忘问题的潜在途径。

Details

Method: 提出RoSE方法，通过测试时语义演化恢复遗忘知识，采用自监督方式进行语义漂移补偿，并推导解析解替代梯度下降。 Result: 在CIFAR-100、TinyImageNet和ImageNet100数据集上，RoSE在冷启动和热启动设置下均优于大多数SOTA方法。 Conclusion: 测试时演化在NECIL中具有潜力和可行性。 Abstract: Continual learning aims to accumulate knowledge over a data stream while mitigating catastrophic forgetting. In Non-exemplar Class Incremental Learning (NECIL), forgetting arises during incremental optimization because old classes are inaccessible, hindering the retention of prior knowledge. To solve this, previous methods struggle in achieving the stability-plasticity balance in the training stages. However, we note that the testing stage is rarely considered among them, but is promising to be a solution to forgetting. Therefore, we propose RoSE, which is a simple yet effective method that \textbf{R}est\textbf{o}res forgotten knowledge through test-time \textbf{S}emantic \textbf{E}volution. Specifically designed for minimizing forgetting, RoSE is a test-time semantic drift compensation framework that enables more accurate drift estimation in a self-supervised manner. Moreover, to avoid incomplete optimization during online testing, we derive an analytical solution as an alternative to gradient descent. We evaluate RoSE on CIFAR-100, TinyImageNet, and ImageNet100 datasets, under both cold-start and warm-start settings. Our method consistently outperforms most state-of-the-art (SOTA) methods across various scenarios, validating the potential and feasibility of test-time evolution in NECIL.

SPACER: A Parallel Dataset of Speech Production And Comprehension of Error Repairs

Shiva Upadhye,Jiaxuan Li,Richard Futrell

Task: 研究说话者和理解者如何检测和纠正自然语言中的单字替换错误。

Motivation: 尽管已有研究分别探讨了语言产生和理解中的错误监控与纠正，但缺乏并行数据阻碍了对两者系统的整合研究。

Details

Method: 使用SPACER数据集，从Switchboard语料库中提取单字替换错误，并结合说话者的自我修复和理解者的离线文本编辑实验数据。 Result: 发现说话者和理解者在错误纠正策略上存在不对称性：说话者更倾向于修复语义和音位偏差较大的错误，而理解者则倾向于纠正音位相似或与上下文不符的错误。 Conclusion: SPACER数据集为未来研究语言产生和理解的整合方法提供了基础。 Abstract: Speech errors are a natural part of communication, yet they rarely lead to complete communicative failure because both speakers and comprehenders can detect and correct errors. Although prior research has examined error monitoring and correction in production and comprehension separately, integrated investigation of both systems has been impeded by the scarcity of parallel data. In this study, we present SPACER, a parallel dataset that captures how naturalistic speech errors are corrected by both speakers and comprehenders. We focus on single-word substitution errors extracted from the Switchboard corpus, accompanied by speaker's self-repairs and comprehenders' responses from an offline text-editing experiment. Our exploratory analysis suggests asymmetries in error correction strategies: speakers are more likely to repair errors that introduce greater semantic and phonemic deviations, whereas comprehenders tend to correct errors that are phonemically similar to more plausible alternatives or do not fit into prior contexts. Our dataset enables future research on integrated approaches toward studying language production and comprehension.

DCEdit: Dual-Level Controlled Image Editing via Precisely Localized Semantics

Yihan Hu,Jianing Peng,Yiheng Lin,Ting Liu,Xiaochao Qu,Luoqi Liu,Yao Zhao,Yunchao Wei

Task: 提出一种改进基于扩散模型的文本引导图像编辑的新方法。

Motivation: 解决文本引导图像编辑任务中精确定位和编辑目标语义的关键挑战。

Details

Method: 引入精确语义定位策略和双级控制机制，结合视觉和文本自注意力增强交叉注意力图，并在特征和潜在层面整合区域线索。 Result: 在PIE-Bench和RW-800基准测试中表现出色，能够保留背景并提供精确编辑。 Conclusion: 该方法在文本引导图像编辑任务中表现出优越性能。 Abstract: This paper presents a novel approach to improving text-guided image editing using diffusion-based models. Text-guided image editing task poses key challenge of precisly locate and edit the target semantic, and previous methods fall shorts in this aspect. Our method introduces a Precise Semantic Localization strategy that leverages visual and textual self-attention to enhance the cross-attention map, which can serve as a regional cues to improve editing performance. Then we propose a Dual-Level Control mechanism for incorporating regional cues at both feature and latent levels, offering fine-grained control for more precise edits. To fully compare our methods with other DiT-based approaches, we construct the RW-800 benchmark, featuring high resolution images, long descriptive texts, real-world images, and a new text editing task. Experimental results on the popular PIE-Bench and RW-800 benchmarks demonstrate the superior performance of our approach in preserving background and providing accurate edits.

Chain-of-Tools: Utilizing Massive Unseen Tools in the CoT Reasoning of Frozen Language Models

Mengsong Wu,Tong Zhu,Han Han,Xiang Zhang,Wenbiao Shao,Wenliang Chen

Task: 提出一种新的工具学习方法Chain-of-Tools，以扩展大型语言模型（LLMs）的工具使用场景。

Motivation: 现有方法要么需要微调模型（限制于训练数据中的工具），要么在提示中添加工具演示（效率较低），无法灵活处理未见过的工具。

Details

Method: 利用冻结LLMs的强大语义表示能力，在CoT推理中完成工具调用，支持包含未见工具的大规模灵活工具池。 Result: 在多个基准测试（GSM8K-XL、FuncQA、KAMEL和SimpleToolQuestions）上表现优于基线方法，并提升了模型可解释性。 Conclusion: Chain-of-Tools方法有效扩展了LLMs的工具使用能力，支持未见工具，并提高了性能和可解释性。 Abstract: Tool learning can further broaden the usage scenarios of large language models (LLMs). However most of the existing methods either need to finetune that the model can only use tools seen in the training data, or add tool demonstrations into the prompt with lower efficiency. In this paper, we present a new Tool Learning method Chain-of-Tools. It makes full use of the powerful semantic representation capability of frozen LLMs to finish tool calling in CoT reasoning with a huge and flexible tool pool which may contain unseen tools. Especially, to validate the effectiveness of our approach in the massive unseen tool scenario, we construct a new dataset SimpleToolQuestions. We conduct experiments on two numerical reasoning benchmarks (GSM8K-XL and FuncQA) and two knowledge-based question answering benchmarks (KAMEL and SimpleToolQuestions). Experimental results show that our approach performs better than the baseline. We also identify dimensions of the model output that are critical in tool selection, enhancing the model interpretability. Our code and data are available at: https://github.com/fairyshine/Chain-of-Tools .

Seg2Box: 3D Object Detection by Point-Wise Semantics Supervision

Maoji Zheng,Ziyu Xu,Qiming Xia,Hai Wu,Chenglu Wen,Cheng Wang

Task: 通过仅使用语义标签监督3D物体检测任务，消除传统检测和分割方法中标签冗余的问题。

Motivation: 传统方法中，3D物体检测和语义分割使用独立的标签（边界框和语义掩码），导致标签冗余。本文旨在通过仅使用语义标签监督检测任务，解决这一问题。

Details

Method: 提出Seg2Box方法，包括多帧多尺度聚类（MFMS-C）模块和语义引导迭代挖掘自训练（SGIM-ST）模块，分别用于生成准确的伪标签和逐步优化伪标签。 Result: 在Waymo Open Dataset和nuScenes Dataset上，mAP分别显著优于其他方法23.7%和10.3%。 Conclusion: 该方法展示了标签高效性和先进性，为3D场景理解提供了新思路。 Abstract: LiDAR-based 3D object detection and semantic segmentation are critical tasks in 3D scene understanding. Traditional detection and segmentation methods supervise their models through bounding box labels and semantic mask labels. However, these two independent labels inherently contain significant redundancy. This paper aims to eliminate the redundancy by supervising 3D object detection using only semantic labels. However, the challenge arises due to the incomplete geometry structure and boundary ambiguity of point-cloud instances, leading to inaccurate pseudo labels and poor detection results. To address these challenges, we propose a novel method, named Seg2Box. We first introduce a Multi-Frame Multi-Scale Clustering (MFMS-C) module, which leverages the spatio-temporal consistency of point clouds to generate accurate box-level pseudo-labels. Additionally, the Semantic?Guiding Iterative-Mining Self-Training (SGIM-ST) module is proposed to enhance the performance by progressively refining the pseudo-labels and mining the instances without generating pseudo-labels. Experiments on the Waymo Open Dataset and nuScenes Dataset show that our method significantly outperforms other competitive methods by 23.7\% and 10.3\% in mAP, respectively. The results demonstrate the great label-efficient potential and advancement of our method.

Conversational User-AI Intervention: A Study on Prompt Rewriting for Improved LLM Response Generation

Rupak Sarkar,Bahareh Sarrafzadeh,Nirupama Chandrasekaran,Nagu Rangan,Philip Resnik,Longqi Yang,Sujay Kumar Jauhar

Task: 研究人类与LLM对话中用户提示的不足及LLM重写提示的潜力。

Motivation: 用户因缺乏有效提示技巧而难以从LLM获取有用回复，而现有对话数据和LLM能力为解决此问题提供了机会。

Details

Method: 通过分析真实人类-AI对话数据，研究用户查询表达不足的方面，并探索LLM重写提示的效果。 Result: 重写提示能改善LLM回复质量且保留用户意图，尤其在长对话中效果更佳；LLM需对用户意图做出合理假设。 Conclusion: 提示重写是改善人机交互的有效方法，适用于多种对话领域、用户意图和LLM类型。 Abstract: Human-LLM conversations are increasingly becoming more pervasive in peoples' professional and personal lives, yet many users still struggle to elicit helpful responses from LLM Chatbots. One of the reasons for this issue is users' lack of understanding in crafting effective prompts that accurately convey their information needs. Meanwhile, the existence of real-world conversational datasets on the one hand, and the text understanding faculties of LLMs on the other, present a unique opportunity to study this problem, and its potential solutions at scale. Thus, in this paper we present the first LLM-centric study of real human-AI chatbot conversations, focused on investigating aspects in which user queries fall short of expressing information needs, and the potential of using LLMs to rewrite suboptimal user prompts. Our findings demonstrate that rephrasing ineffective prompts can elicit better responses from a conversational system, while preserving the user's original intent. Notably, the performance of rewrites improves in longer conversations, where contextual inferences about user needs can be made more accurately. Additionally, we observe that LLMs often need to -- and inherently do -- make \emph{plausible} assumptions about a user's intentions and goals when interpreting prompts. Our findings largely hold true across conversational domains, user intents, and LLMs of varying sizes and families, indicating the promise of using prompt rewriting as a solution for better human-AI interactions.

ST-Prompt Guided Histological Hypergraph Learning for Spatial Gene Expression Prediction

Yi Niu,Jiashuai Liu,Yingkang Zhan,Jiangbo Shi,Di Zhang,Ines Machado,Mireia Crispin-Ortuzar,Chen Li,Zeyu Gao

Task: 利用稀疏的空间转录组（ST）数据预测H&E染色组织切片中的全局基因表达空间分布。

Motivation: 解决由于组织形态与基因表达之间的异质性关系导致的预测挑战，并利用少量局部ST数据实现更实用和经济的全局预测。

Details

Method: 提出PHG2ST框架，通过ST提示引导的组织学超图学习，结合多尺度超图表示和掩码ST提示编码机制。 Result: 在两个公开ST数据集上的评估表明，PHG2ST优于现有方法，且与真实数据高度一致。 Conclusion: PHG2ST展示了利用稀疏ST数据进行可扩展且经济高效的基因表达空间映射的潜力。 Abstract: Spatial Transcriptomics (ST) reveals the spatial distribution of gene expression in tissues, offering critical insights into biological processes and disease mechanisms. However, predicting ST from H\&E-stained histology images is challenging due to the heterogeneous relationship between histomorphology and gene expression, which arises from substantial variability across different patients and tissue sections. A more practical and valuable approach is to utilize ST data from a few local regions to predict the spatial transcriptomic landscape across the remaining regions in H&E slides. In response, we propose PHG2ST, an ST-prompt guided histological hypergraph learning framework, which leverages sparse ST signals as prompts to guide histological hypergraph learning for global spatial gene expression prediction. Our framework fuses histological hypergraph representations at multiple scales through a masked ST-prompt encoding mechanism, improving robustness and generalizability. Benchmark evaluations on two public ST datasets demonstrate that PHG2ST outperforms the existing state-of-the-art methods and closely aligns with the ground truth. These results underscore the potential of leveraging sparse local ST data for scalable and cost-effective spatial gene expression mapping in real-world biomedical applications.

When Tom Eats Kimchi: Evaluating Cultural Bias of Multimodal Large Language Models in Cultural Mixture Contexts

Jun Seong Kim,Kyaw Ye Thu,Javad Ismayilzada,Junyeong Park,Eunsu Kim,Huzama Ahmad,Na Min An,James Thorne,Alice Oh

Task: 研究多模态大语言模型（MLLMs）对不同文化输入的识别和响应能力，并提出MixCuBe基准测试。

Motivation: 当前MLLMs在识别跨文化实体时过度依赖视觉特征，导致误分类，需评估其鲁棒性。

Details

Method: 引入MixCuBe跨文化偏见基准，研究来自五个国家和四个种族的元素。 Result: MLLMs在高资源文化中表现更好，但在低资源文化中准确性差异显著（如GPT-4o差异达58%）。 Conclusion: MixCuBe揭示了MLLMs在跨文化识别中的局限性，尤其是在低资源文化中，数据集已公开。 Abstract: In a highly globalized world, it is important for multi-modal large language models (MLLMs) to recognize and respond correctly to mixed-cultural inputs. For example, a model should correctly identify kimchi (Korean food) in an image both when an Asian woman is eating it, as well as an African man is eating it. However, current MLLMs show an over-reliance on the visual features of the person, leading to misclassification of the entities. To examine the robustness of MLLMs to different ethnicity, we introduce MixCuBe, a cross-cultural bias benchmark, and study elements from five countries and four ethnicities. Our findings reveal that MLLMs achieve both higher accuracy and lower sensitivity to such perturbation for high-resource cultures, but not for low-resource cultures. GPT-4o, the best-performing model overall, shows up to 58% difference in accuracy between the original and perturbed cultural settings in low-resource cultures. Our dataset is publicly available at: https://huggingface.co/datasets/kyawyethu/MixCuBe.

RigGS: Rigging of 3D Gaussians for Modeling Articulated Objects in Videos

Yuxin Yao,Zhi Deng,Junhui Hou

Task: 建模2D视频中捕捉的关节物体以实现新视角合成，同时易于编辑、驱动和重定位。

Motivation: 解决动态物体建模的挑战性问题，无需依赖额外的模板先验。

Details

Method: 提出RigGS，结合3D高斯表示和基于骨架的运动表示，通过骨架感知节点控制变形、可学习皮肤变形和姿态依赖的细节变形。 Result: 能够轻松生成物体的新动作，并从新视角渲染高质量图像。 Conclusion: RigGS方法在动态物体建模和新视角合成方面表现出色。 Abstract: This paper considers the problem of modeling articulated objects captured in 2D videos to enable novel view synthesis, while also being easily editable, drivable, and re-posable. To tackle this challenging problem, we propose RigGS, a new paradigm that leverages 3D Gaussian representation and skeleton-based motion representation to model dynamic objects without utilizing additional template priors. Specifically, we first propose skeleton-aware node-controlled deformation, which deforms a canonical 3D Gaussian representation over time to initialize the modeling process, producing candidate skeleton nodes that are further simplified into a sparse 3D skeleton according to their motion and semantic information. Subsequently, based on the resulting skeleton, we design learnable skin deformations and pose-dependent detailed deformations, thereby easily deforming the 3D Gaussian representation to generate new actions and render further high-quality images from novel views. Extensive experiments demonstrate that our method can generate realistic new actions easily for objects and achieve high-quality rendering.

Imagine to Hear: Auditory Knowledge Generation can be an Effective Assistant for Language Models

Suho Yoo,Hyunjong Ok,Jaeho Lee

Task: 通过生成模型动态生成听觉知识，以解决语言模型在需要听觉常识的任务中的不足。

Motivation: 传统的基于检索的方法存在外部音频数据库相关性和成本问题，需要更高效的解决方案。

Details

Method: 提出Imagine to Hear框架，利用生成模型动态生成听觉知识，并结合CLAP拒绝采样器和语言-音频融合模块处理多听觉知识。 Result: 在AuditoryBench上实现了最先进的性能，且无需依赖外部数据库。 Conclusion: 生成式方法在解决听觉常识任务中具有高效性和有效性。 Abstract: Language models pretrained on text-only corpora often struggle with tasks that require auditory commonsense knowledge. Previous work addresses this problem by augmenting the language model to retrieve knowledge from external audio databases. This approach has several limitations, such as the potential lack of relevant audio in databases and the high costs associated with constructing and querying the databases. To address these issues, we propose Imagine to Hear, a novel approach that dynamically generates auditory knowledge using generative models. Our framework detects multiple audio-related textual spans from the given prompt and generates corresponding auditory knowledge. We develop several mechanisms to efficiently process multiple auditory knowledge, including a CLAP-based rejection sampler and a language-audio fusion module. Our experiments show that our method achieves state-of-the-art performance on AuditoryBench without relying on external databases, highlighting the effectiveness of our generation-based approach.

SGFormer: Satellite-Ground Fusion for 3D Semantic Scene Completion

Xiyue Guo,Jiarui Hu,Junjie Hu,Hujun Bao,Guofeng Zhang

Task: 提出一种卫星-地面协同的场景语义补全框架SGFormer，利用卫星和地面图像对解决视觉遮挡问题。

Motivation: 现有基于相机的方法在视觉遮挡情况下难以捕捉完整的场景语义，因此探索卫星与地面图像的协同潜力。

Details

Method: 设计双分支架构并行编码正交卫星和地面视图，统一到共同域；提出地面视图引导策略校正卫星图像偏差；开发自适应权重策略平衡两种视图的贡献。 Result: SGFormer在SemanticKITTI和SSCBench-KITTI-360数据集上优于现有方法。 Conclusion: 卫星-地面协同框架SGFormer有效解决了视觉遮挡问题，提升了场景语义补全性能。 Abstract: Recently, camera-based solutions have been extensively explored for scene semantic completion (SSC). Despite their success in visible areas, existing methods struggle to capture complete scene semantics due to frequent visual occlusions. To address this limitation, this paper presents the first satellite-ground cooperative SSC framework, i.e., SGFormer, exploring the potential of satellite-ground image pairs in the SSC task. Specifically, we propose a dual-branch architecture that encodes orthogonal satellite and ground views in parallel, unifying them into a common domain. Additionally, we design a ground-view guidance strategy that corrects satellite image biases during feature encoding, addressing misalignment between satellite and ground views. Moreover, we develop an adaptive weighting strategy that balances contributions from satellite and ground views. Experiments demonstrate that SGFormer outperforms the state of the art on SemanticKITTI and SSCBench-KITTI-360 datasets. Our code is available on https://github.com/gxytcrc/SGFormer.

MMCR: Benchmarking Cross-Source Reasoning in Scientific Papers

Yang Tian,Zheng Lu,Mingqi Gao,Zheng Liu,Bo Zhao

Task: 评估视觉语言模型（VLMs）在跨源信息推理方面的能力。

Motivation: 机器完全理解科学论文需要跨碎片化和异构信息源进行推理，这是一个复杂且具有实际意义的挑战。

Details

Method: 提出MMCR基准，包含276个高质量问题，涵盖7个学科和10种任务类型，用于测试18种VLMs。 Result: 现有模型在跨源推理方面表现不佳，GPT-4o总体准确率为48.55%，Qwen2.5-VL-72B为39.86%。CoT技术对小模型有负面影响，但对大模型有显著提升。 Conclusion: 需要开发能够有效利用跨源信息进行推理的VLMs。 Abstract: Fully comprehending scientific papers by machines reflects a high level of Artificial General Intelligence, requiring the ability to reason across fragmented and heterogeneous sources of information, presenting a complex and practically significant challenge. While Vision-Language Models (VLMs) have made remarkable strides in various tasks, particularly those involving reasoning with evidence source from single image or text page, their ability to use cross-source information for reasoning remains an open problem. This work presents MMCR, a high-difficulty benchmark designed to evaluate VLMs' capacity for reasoning with cross-source information from scientific papers. The benchmark comprises 276 high-quality questions, meticulously annotated by humans across 7 subjects and 10 task types. Experiments with 18 VLMs demonstrate that cross-source reasoning presents a substantial challenge for existing models. Notably, even the top-performing model, GPT-4o, achieved only 48.55% overall accuracy, with only 20% accuracy in multi-table comprehension tasks, while the second-best model, Qwen2.5-VL-72B, reached 39.86% overall accuracy. Furthermore, we investigated the impact of the Chain-of-Thought (CoT) technique on cross-source reasoning and observed a detrimental effect on small models, whereas larger models demonstrated substantially enhanced performance. These results highlight the pressing need to develop VLMs capable of effectively utilizing cross-source information for reasoning.

Joint Self-Supervised Video Alignment and Action Segmentation

Ali Shah Ali,Syed Ahmed Mahmood,Mubin Saeed,Andrey Konin,M. Zeeshan Zia,Quoc-Huy Tran

Task: 提出一种基于统一最优传输框架的自监督视频对齐与动作分割的新方法。

Motivation: 解决传统方法在视频对齐和动作分割中需要单独模型的问题，提高效率和性能。

Details

Method: 采用融合Gromov-Wasserstein最优传输框架，结合结构先验，实现高效训练和求解。 Result: 在多个基准测试中达到最先进的视频对齐性能，并在动作分割上优于现有方法。 Conclusion: 首次将视频对齐与动作分割统一为单一模型，显著节省时间和内存消耗。 Abstract: We introduce a novel approach for simultaneous self-supervised video alignment and action segmentation based on a unified optimal transport framework. In particular, we first tackle self-supervised video alignment by developing a fused Gromov-Wasserstein optimal transport formulation with a structural prior, which trains efficiently on GPUs and needs only a few iterations for solving the optimal transport problem. Our single-task method achieves the state-of-the-art performance on multiple video alignment benchmarks and outperforms VAVA, which relies on a traditional Kantorovich optimal transport formulation with an optimality prior. Furthermore, we extend our approach by proposing a unified optimal transport framework for joint self-supervised video alignment and action segmentation, which requires training and storing a single model and saves both time and memory consumption as compared to two different single-task models. Extensive evaluations on several video alignment and action segmentation datasets demonstrate that our multi-task method achieves comparable video alignment yet superior action segmentation results over previous methods in video alignment and action segmentation respectively. Finally, to the best of our knowledge, this is the first work to unify video alignment and action segmentation into a single model.

MTBench: A Multimodal Time Series Benchmark for Temporal Reasoning and Question Answering

Jialin Chen,Aosong Feng,Ziyu Zhao,Juan Garza,Gaukhar Nurbek,Cheng Qin,Ali Maatouk,Leandros Tassiulas,Yifeng Gao,Rex Ying

Task: 评估大型语言模型（LLMs）在金融和天气领域中时间序列与文本理解的能力。

Motivation: 现有数据集在多模态时间序列评估中缺乏对跨模态推理和复杂问答的支持，无法捕捉叙事信息与时间模式的复杂交互。

Details

Method: 引入多模态时间序列基准（MTBench），包含配对的时间序列和文本数据，支持时间序列预测、趋势分析和新闻驱动的问答等任务。 Result: 发现当前模型在捕捉长期依赖、解释因果关系和融合多模态信息方面存在显著挑战。 Conclusion: MTBench为评估模型在多模态时间序列理解中的表现提供了全面测试平台，揭示了现有模型的局限性。 Abstract: Understanding the relationship between textual news and time-series evolution is a critical yet under-explored challenge in applied data science. While multimodal learning has gained traction, existing multimodal time-series datasets fall short in evaluating cross-modal reasoning and complex question answering, which are essential for capturing complex interactions between narrative information and temporal patterns. To bridge this gap, we introduce Multimodal Time Series Benchmark (MTBench), a large-scale benchmark designed to evaluate large language models (LLMs) on time series and text understanding across financial and weather domains. MTbench comprises paired time series and textual data, including financial news with corresponding stock price movements and weather reports aligned with historical temperature records. Unlike existing benchmarks that focus on isolated modalities, MTbench provides a comprehensive testbed for models to jointly reason over structured numerical trends and unstructured textual narratives. The richness of MTbench enables formulation of diverse tasks that require a deep understanding of both text and time-series data, including time-series forecasting, semantic and technical trend analysis, and news-driven question answering (QA). These tasks target the model's ability to capture temporal dependencies, extract key insights from textual context, and integrate cross-modal information. We evaluate state-of-the-art LLMs on MTbench, analyzing their effectiveness in modeling the complex relationships between news narratives and temporal patterns. Our findings reveal significant challenges in current models, including difficulties in capturing long-term dependencies, interpreting causality in financial and weather trends, and effectively fusing multimodal information.

Safe and Reliable Diffusion Models via Subspace Projection

Huiqiang Chen,Tianqing Zhu,Linlin Wang,Xin Yu,Longxiang Gao,Wanlei Zhou

Task: 提出一种名为SAFER的新方法，用于彻底从扩散模型中移除目标概念。

Motivation: 现有方法在移除扩散模型中的不适当内容（如受版权保护的作品或冒犯性图像）时，往往无法完全消除目标概念，导致其以隐蔽形式重新出现。

Details

Method: SAFER利用文本嵌入空间的低维结构，通过识别与目标概念相关的子空间$S_c$，并将提示嵌入投影到其补空间中，从而彻底移除概念。同时结合文本反转和子空间扩展策略，提升移除效果。 Result: 实验表明，SAFER能一致且有效地从扩散模型中移除不想要的概念，同时保持生成质量。 Conclusion: SAFER为扩散模型中的概念移除提供了一种高效且全面的解决方案。 Abstract: Large-scale text-to-image (T2I) diffusion models have revolutionized image generation, enabling the synthesis of highly detailed visuals from textual descriptions. However, these models may inadvertently generate inappropriate content, such as copyrighted works or offensive images. While existing methods attempt to eliminate specific unwanted concepts, they often fail to ensure complete removal, allowing the concept to reappear in subtle forms. For instance, a model may successfully avoid generating images in Van Gogh's style when explicitly prompted with 'Van Gogh', yet still reproduce his signature artwork when given the prompt 'Starry Night'. In this paper, we propose SAFER, a novel and efficient approach for thoroughly removing target concepts from diffusion models. At a high level, SAFER is inspired by the observed low-dimensional structure of the text embedding space. The method first identifies a concept-specific subspace $S_c$ associated with the target concept c. It then projects the prompt embeddings onto the complementary subspace of $S_c$, effectively erasing the concept from the generated images. Since concepts can be abstract and difficult to fully capture using natural language alone, we employ textual inversion to learn an optimized embedding of the target concept from a reference image. This enables more precise subspace estimation and enhances removal performance. Furthermore, we introduce a subspace expansion strategy to ensure comprehensive and robust concept erasure. Extensive experiments demonstrate that SAFER consistently and effectively erases unwanted concepts from diffusion models while preserving generation quality.

Joint Extraction Matters: Prompt-Based Visual Question Answering for Multi-Field Document Information Extraction

Mengsay Loem,Taiju Hosaka

Task: 研究联合提取多个字段与单独提取的优劣。

Motivation: 现有工作通常单独查询每个字段，忽略了字段间的潜在依赖关系。

Details

Method: 通过实验比较联合提取与单独提取的效果，并使用回归指标量化字段间关系。 Result: 联合提取通常能提高准确性，尤其在字段间存在强数值或上下文依赖时。 Conclusion: 多字段提示可以减少相似表面形式和相关数值引起的混淆，为文档信息提取任务设计鲁棒的VQA系统提供实用方法。 Abstract: Visual question answering (VQA) has emerged as a flexible approach for extracting specific pieces of information from document images. However, existing work typically queries each field in isolation, overlooking potential dependencies across multiple items. This paper investigates the merits of extracting multiple fields jointly versus separately. Through experiments on multiple large vision language models and datasets, we show that jointly extracting fields often improves accuracy, especially when the fields share strong numeric or contextual dependencies. We further analyze how performance scales with the number of requested items and use a regression based metric to quantify inter field relationships. Our results suggest that multi field prompts can mitigate confusion arising from similar surface forms and related numeric values, providing practical methods for designing robust VQA systems in document information extraction tasks.

LoRASculpt: Sculpting LoRA for Harmonizing General and Specialized Knowledge in Multimodal Large Language Models

Jian Liang,Wenke Huang,Guancheng Wan,Qu Yang,Mang Ye

Task: 提出LoRASculpt方法以消除多模态大语言模型（MLLMs）在视觉指令调优中的有害冗余参数，同时保留通用和专用知识。

Motivation: 尽管LoRA被广泛用于MLLMs的高效专用知识获取，但其在视觉指令调优中引入的有害冗余会加剧通用知识的遗忘并降低下游任务性能。

Details

Method: 提出LoRASculpt方法，通过稀疏更新和冲突缓解正则化器消除冗余参数并优化LoRA的更新轨迹。 Result: 实验表明，即使在极高稀疏度（≤5%）下，该方法仍能同时提升泛化能力和下游任务性能。 Conclusion: LoRASculpt有效缓解了灾难性遗忘问题，并促进了MLLMs中的知识协调。 Abstract: While Multimodal Large Language Models (MLLMs) excel at generalizing across modalities and tasks, effectively adapting them to specific downstream tasks while simultaneously retaining both general and specialized knowledge remains challenging. Although Low-Rank Adaptation (LoRA) is widely used to efficiently acquire specialized knowledge in MLLMs, it introduces substantial harmful redundancy during visual instruction tuning, which exacerbates the forgetting of general knowledge and degrades downstream task performance. To address this issue, we propose LoRASculpt to eliminate harmful redundant parameters, thereby harmonizing general and specialized knowledge. Specifically, under theoretical guarantees, we introduce sparse updates into LoRA to discard redundant parameters effectively. Furthermore, we propose a Conflict Mitigation Regularizer to refine the update trajectory of LoRA, mitigating knowledge conflicts with the pretrained weights. Extensive experimental results demonstrate that even at very high degree of sparsity ($\le$ 5%), our method simultaneously enhances generalization and downstream task performance. This confirms that our approach effectively mitigates the catastrophic forgetting issue and further promotes knowledge harmonization in MLLMs.

Assessing the Reliability and Validity of GPT-4 in Annotating Emotion Appraisal Ratings

Deniss Ruder,Andero Uusberg,Kairit Sirts

Task: 研究GPT-4作为读者标注者在不同提示设置下对21种特定评价评分的标注性能。

Motivation: 评估和改进GPT-4在评价评分标注中的表现，并与人类标注者进行比较。

Details

Method: 在不同提示设置下使用GPT-4进行评价评分标注，并通过多数投票机制优化结果。 Result: GPT-4表现接近或略优于人类标注者，多数投票机制显著提升结果；单提示下能有效预测评价评分和情绪标签，但复杂指令会降低性能；较长事件描述能提高标注准确性。 Conclusion: GPT-4在心理学标注任务中表现优异，多数投票和优化提示策略可进一步提升其性能。 Abstract: Appraisal theories suggest that emotions arise from subjective evaluations of events, referred to as appraisals. The taxonomy of appraisals is quite diverse, and they are usually given ratings on a Likert scale to be annotated in an experiencer-annotator or reader-annotator paradigm. This paper studies GPT-4 as a reader-annotator of 21 specific appraisal ratings in different prompt settings, aiming to evaluate and improve its performance compared to human annotators. We found that GPT-4 is an effective reader-annotator that performs close to or even slightly better than human annotators, and its results can be significantly improved by using a majority voting of five completions. GPT-4 also effectively predicts appraisal ratings and emotion labels using a single prompt, but adding instruction complexity results in poorer performance. We also found that longer event descriptions lead to more accurate annotations for both model and human annotator ratings. This work contributes to the growing usage of LLMs in psychology and the strategies for improving GPT-4 performance in annotating appraisals.

Casual Inference via Style Bias Deconfounding for Domain Generalization

Jiaxi Li,Di Lin,Hao Chen,Hongying Liu,Liang Wan,Wei Feng

Task: 提出一种名为Style Deconfounding Causal Learning (SDCL)的新框架，旨在解决深度神经网络在处理分布外数据时因风格混淆而导致的可靠性问题。

Motivation: 现有领域泛化方法通常忽略训练集中风格频率的影响，导致模型学习虚假视觉相关性而非真正因果表示，影响推理可靠性。

Details

Method: 通过构建结构因果模型(SCM)并应用后门调整策略处理风格影响，设计风格引导专家模块(SGEM)自适应聚类风格分布，以及后门因果学习模块(BDCL)在特征提取时进行因果干预。 Result: 在多种自然和医学图像识别任务中验证了SDCL的有效性，在多领域和单领域泛化场景中均表现出优越性能。 Conclusion: SDCL框架能有效减少风格偏差，提升模型在未见测试领域的泛化能力，且易于与现有数据增强技术结合。 Abstract: Deep neural networks (DNNs) often struggle with out-of-distribution data, limiting their reliability in diverse realworld applications. To address this issue, domain generalization methods have been developed to learn domain-invariant features from single or multiple training domains, enabling generalization to unseen testing domains. However, existing approaches usually overlook the impact of style frequency within the training set. This oversight predisposes models to capture spurious visual correlations caused by style confounding factors, rather than learning truly causal representations, thereby undermining inference reliability. In this work, we introduce Style Deconfounding Causal Learning (SDCL), a novel causal inference-based framework designed to explicitly address style as a confounding factor. Our approaches begins with constructing a structural causal model (SCM) tailored to the domain generalization problem and applies a backdoor adjustment strategy to account for style influence. Building on this foundation, we design a style-guided expert module (SGEM) to adaptively clusters style distributions during training, capturing the global confounding style. Additionally, a back-door causal learning module (BDCL) performs causal interventions during feature extraction, ensuring fair integration of global confounding styles into sample predictions, effectively reducing style bias. The SDCL framework is highly versatile and can be seamlessly integrated with state-of-the-art data augmentation techniques. Extensive experiments across diverse natural and medical image recognition tasks validate its efficacy, demonstrating superior performance in both multi-domain and the more challenging single-domain generalization scenarios.

When Words Outperform Vision: VLMs Can Self-Improve Via Text-Only Training For Human-Centered Decision Making

Zhe Hu,Jing Li,Yu Yin

Task: 系统评估开源视觉语言模型（VLMs）在多模态以人为中心的决策任务中的表现，并提出一种新的文本训练方法以提升其能力。

Motivation: 当前视觉语言模型在复杂决策任务中表现不佳，尤其是在需要深入推理人类需求和价值观的场景中。

Details

Method: 提出一种基于合成文本数据的文本训练方法，并通过自我改进机制利用LLM生成的数据增强VLMs。 Result: 发现仅接收文本描述的LLMs表现优于处理图像的VLMs，新方法显著提升了VLMs的决策能力。 Conclusion: 通过文本训练和自我改进机制，可以更高效地提升VLMs在以人为中心的决策任务中的表现。 Abstract: Embodied decision-making is fundamental for AI agents operating in real-world environments. While Visual Language Models (VLMs) have advanced this capability, they still struggle with complex decisions, particularly in human-centered situations that require deep reasoning about human needs and values. In this study, we systematically evaluate open-sourced VLMs on multimodal human-centered decision-making tasks. We find that LLMs receiving only textual descriptions unexpectedly outperform their VLM counterparts of similar scale that process actual images, suggesting that visual alignment may hinder VLM abilities. To address this challenge, we propose a novel text-only training approach with synthesized textual data. This method strengthens VLMs' language components and transfers the learned abilities to multimodal inference, eliminating the need for expensive image-text paired data. Furthermore, we show that VLMs can achieve substantial performance gains through self-improvement, using training data generated by their LLM counterparts rather than relying on larger teacher models like GPT-4. Our findings establish a more efficient and scalable approach to enhancing VLMs' human-centered decision-making capabilities, opening new avenues for optimizing VLMs through self-improvement mechanisms.

Generative Compositor for Few-Shot Visual Information Extraction

Zhibo Yang,Wei Hua,Sibo Song,Cong Yao,Yingying Zhu,Wenqing Cheng,Xiang Bai

Task: 提出一种名为Generative Compositor的生成模型，以解决少样本视觉信息提取（VIE）的挑战。

Motivation: 视觉信息提取（VIE）在文档处理中至关重要，但许多类型因缺乏训练数据而面临挑战。

Details

Method: 采用混合指针生成网络模拟排版操作，结合三种预训练策略和提示感知重采样器。 Result: 在全样本训练中表现优异，在1-shot、5-shot和10-shot设置中显著优于基线。 Conclusion: Generative Compositor通过提示检索机制和预训练策略，有效利用有限样本获取空间和语义线索。 Abstract: Visual Information Extraction (VIE), aiming at extracting structured information from visually rich document images, plays a pivotal role in document processing. Considering various layouts, semantic scopes, and languages, VIE encompasses an extensive range of types, potentially numbering in the thousands. However, many of these types suffer from a lack of training data, which poses significant challenges. In this paper, we propose a novel generative model, named Generative Compositor, to address the challenge of few-shot VIE. The Generative Compositor is a hybrid pointer-generator network that emulates the operations of a compositor by retrieving words from the source text and assembling them based on the provided prompts. Furthermore, three pre-training strategies are employed to enhance the model's perception of spatial context information. Besides, a prompt-aware resampler is specially designed to enable efficient matching by leveraging the entity-semantic prior contained in prompts. The introduction of the prompt-based retrieval mechanism and the pre-training strategies enable the model to acquire more effective spatial and semantic clues with limited training samples. Experiments demonstrate that the proposed method achieves highly competitive results in the full-sample training, while notably outperforms the baseline in the 1-shot, 5-shot, and 10-shot settings.

A Survey on Personalized Alignment -- The Missing Piece for Large Language Models in Real-World Applications

Jian Guan,Junfei Wu,Jia-Nan Li,Chuanqi Cheng,Wei Wu

Task: 综述个性化对齐范式，使大语言模型能够在伦理边界内根据个人偏好调整行为。

Motivation: 现有对齐技术采用一刀切方法，无法满足用户多样化的背景和需求。

Details

Method: 提出一个统一框架，包括偏好记忆管理、个性化生成和基于反馈的对齐，并系统分析实现方法和评估效果。 Result: 通过分析现有技术、潜在风险和未来挑战，为开发更具适应性和伦理对齐的大语言模型提供了结构化基础。 Conclusion: 该综述为个性化对齐范式的发展提供了系统化的指导和未来研究方向。 Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities, yet their transition to real-world applications reveals a critical limitation: the inability to adapt to individual preferences while maintaining alignment with universal human values. Current alignment techniques adopt a one-size-fits-all approach that fails to accommodate users' diverse backgrounds and needs. This paper presents the first comprehensive survey of personalized alignment-a paradigm that enables LLMs to adapt their behavior within ethical boundaries based on individual preferences. We propose a unified framework comprising preference memory management, personalized generation, and feedback-based alignment, systematically analyzing implementation approaches and evaluating their effectiveness across various scenarios. By examining current techniques, potential risks, and future challenges, this survey provides a structured foundation for developing more adaptable and ethically-aligned LLMs.

Stack Transformer Based Spatial-Temporal Attention Model for Dynamic Multi-Culture Sign Language Recognition

Koki Hirooka,Abu Saleh Musa Miah,Tatsuya Murakami,Yuto Akiba,Yong Seok Hwang,Jungpil Shin

Task: 提出一种基于堆叠时空Transformer网络的手势识别方法，用于多文化手语识别（McSL）。

Motivation: 现有手语识别系统在跨文化手语（McSL）中表现不佳，需要一种能够同时捕捉时空依赖性的方法。

Details

Method: 使用堆叠时空Transformer网络，结合多头注意力机制和层次特征提取，通过空间和时间Transformer模块处理数据。 Result: 在JSL、KSL和ASL数据集上取得了良好的性能，证明了该方法在McSL中的有效性。 Conclusion: 该方法在多文化手语识别中表现优异，是该领域的一项创新工作。 Abstract: Hand gesture-based Sign Language Recognition (SLR) serves as a crucial communication bridge between deaf and non-deaf individuals. Existing SLR systems perform well for their cultural SL but may struggle with multi-cultural sign languages (McSL). To address these challenges, this paper proposes a Stack Spatial-Temporal Transformer Network that leverages multi-head attention mechanisms to capture both spatial and temporal dependencies with hierarchical features using the Stack Transfer concept. In the proceed, firstly, we applied a fully connected layer to make a embedding vector which has high expressive power from the original dataset, then fed them a stack newly proposed transformer to achieve hierarchical features with short-range and long-range dependency. The network architecture is composed of several stages that process spatial and temporal relationships sequentially, ensuring effective feature extraction. After making the fully connected layer, the embedding vector is processed by the Spatial Multi-Head Attention Transformer, which captures spatial dependencies between joints. In the next stage, the Temporal Multi-Head Attention Transformer captures long-range temporal dependencies, and again, the features are concatenated with the output using another skip connection. The processed features are then passed to the Feed-Forward Network (FFN), which refines the feature representations further. After the FFN, additional skip connections are applied to combine the output with earlier layers, followed by a final normalization layer to produce the final output feature tensor. This process is repeated for 10 transformer blocks. The extensive experiment shows that the JSL, KSL and ASL datasets achieved good performance accuracy. Our approach demonstrates improved performance in McSL, and it will be consider as a novel work in this domain.

Summarization Metrics for Spanish and Basque: Do Automatic Scores and LLM-Judges Correlate with Humans?

Jeremy Barnes,Naiara Perez,Alba Bonet-Jover,Begoña Altuna

Task: 通过BASSE数据集评估自动文本摘要的评估指标和LLM-as-a-Judge模型在巴斯克语和西班牙语中的表现。

Motivation: 目前关于自动文本摘要的评估指标和LLM-as-a-Judge模型的研究主要集中在英语，限制了对其在其他语言中有效性的理解。

Details

Method: 收集巴斯克语和西班牙语的2,040份摘要的人类评分，评估传统自动指标和LLM-as-a-Judge模型的表现。 Result: 专有LLM模型与人类评分的相关性最高，其次是特定标准的自动指标，而开源LLM模型表现较差。 Conclusion: 研究填补了非英语语言自动摘要评估的空白，并发布了BASSE数据集和相关代码。 Abstract: Studies on evaluation metrics and LLM-as-a-Judge models for automatic text summarization have largely been focused on English, limiting our understanding of their effectiveness in other languages. Through our new dataset BASSE (BAsque and Spanish Summarization Evaluation), we address this situation by collecting human judgments on 2,040 abstractive summaries in Basque and Spanish, generated either manually or by five LLMs with four different prompts. For each summary, annotators evaluated five criteria on a 5-point Likert scale: coherence, consistency, fluency, relevance, and 5W1H. We use these data to reevaluate traditional automatic metrics used for evaluating summaries, as well as several LLM-as-a-Judge models that show strong performance on this task in English. Our results show that currently proprietary judge LLMs have the highest correlation with human judgments, followed by criteria-specific automatic metrics, while open-sourced judge LLMs perform poorly. We release BASSE and our code publicly, along with the first large-scale Basque summarization dataset containing 22,525 news articles with their subheads.

ETVA: Evaluation of Text-to-Video Alignment via Fine-grained Question Generation and Answering

Kaisi Guan,Zhengfeng Lai,Yuchong Sun,Peng Zhang,Wei Liu,Kieran Liu,Meng Cao,Ruihua Song

Task: 提出一种新的文本到视频对齐评估方法ETVA，通过细粒度问题生成与回答来精确评估语义对齐。

Motivation: 现有文本到视频对齐指标（如CLIPScore）仅提供粗粒度评分，缺乏细粒度对齐细节，无法与人类偏好一致。

Details

Method: ETVA通过多智能体系统将提示解析为语义场景图生成原子问题，并设计知识增强的多阶段推理框架回答问题。 Result: ETVA的Spearman相关系数为58.47，显著高于现有指标的31.0，与人类判断更一致。 Conclusion: ETVA为下一代文本到视频生成提供了更精确的对齐评估方法，并揭示了现有模型的关键能力与限制。 Abstract: Precisely evaluating semantic alignment between text prompts and generated videos remains a challenge in Text-to-Video (T2V) Generation. Existing text-to-video alignment metrics like CLIPScore only generate coarse-grained scores without fine-grained alignment details, failing to align with human preference. To address this limitation, we propose ETVA, a novel Evaluation method of Text-to-Video Alignment via fine-grained question generation and answering. First, a multi-agent system parses prompts into semantic scene graphs to generate atomic questions. Then we design a knowledge-augmented multi-stage reasoning framework for question answering, where an auxiliary LLM first retrieves relevant common-sense knowledge (e.g., physical laws), and then video LLM answers the generated questions through a multi-stage reasoning mechanism. Extensive experiments demonstrate that ETVA achieves a Spearman's correlation coefficient of 58.47, showing a much higher correlation with human judgment than existing metrics which attain only 31.0. We also construct a comprehensive benchmark specifically designed for text-to-video alignment evaluation, featuring 2k diverse prompts and 12k atomic questions spanning 10 categories. Through a systematic evaluation of 15 existing text-to-video models, we identify their key capabilities and limitations, paving the way for next-generation T2V generation.

A Study into Investigating Temporal Robustness of LLMs

Jonas Wallat,Abdelrahman Abdallah,Adam Jatowt,Avishek Anand

Task: 评估大型语言模型（LLMs）在处理时间敏感问题和历史知识时的鲁棒性。

Motivation: LLMs在时间问题和历史知识上的表现受限，因为它们难以理解时间范围和方向，或完全忽略时间因素。

Details

Method: 设计了八项时间敏感的鲁棒性测试，评估六种流行LLMs在零样本设置下的表现。 Result: LLMs在时间鲁棒性上表现不足，尤其是对时间重构和不同时间粒度参考的处理。 Conclusion: 研究结果可用于实时评估模型的时间鲁棒性，并提升时间问答性能达55%。 Abstract: Large Language Models (LLMs) encapsulate a surprising amount of factual world knowledge. However, their performance on temporal questions and historical knowledge is limited because they often cannot understand temporal scope and orientation or neglect the temporal aspect altogether. In this study, we aim to measure precisely how robust LLMs are for question answering based on their ability to process temporal information and perform tasks requiring temporal reasoning and temporal factual knowledge. Specifically, we design eight time-sensitive robustness tests for factual information to check the sensitivity of six popular LLMs in the zero-shot setting. Overall, we find LLMs lacking temporal robustness, especially to temporal reformulations and the use of different granularities of temporal references. We show how a selection of these eight tests can be used automatically to judge a model's temporal robustness for user questions on the fly. Finally, we apply the findings of this study to improve the temporal QA performance by up to 55 percent.

Classifier-guided CLIP Distillation for Unsupervised Multi-label Classification

Dongseob Kim,Hyunjung Shim

Task: 提出一种基于CLIP的无监督多标签分类方法，解决其视角依赖和固有偏差问题。

Motivation: 多标签分类对图像理解至关重要，但获取精确标注成本高且困难。

Details

Method: 利用目标对象附近的多个视角，结合分类器的CAM引导和CLIP预测的去偏伪标签，提出Classifier-guided CLIP Distillation (CCD)。 Result: 实验证明CCD在多个数据集上优于现有技术。 Conclusion: CCD方法通过多视角选择和去偏预测，显著提升了无监督多标签分类性能。 Abstract: Multi-label classification is crucial for comprehensive image understanding, yet acquiring accurate annotations is challenging and costly. To address this, a recent study suggests exploiting unsupervised multi-label classification leveraging CLIP, a powerful vision-language model. Despite CLIP's proficiency, it suffers from view-dependent predictions and inherent bias, limiting its effectiveness. We propose a novel method that addresses these issues by leveraging multiple views near target objects, guided by Class Activation Mapping (CAM) of the classifier, and debiasing pseudo-labels derived from CLIP predictions. Our Classifier-guided CLIP Distillation (CCD) enables selecting multiple local views without extra labels and debiasing predictions to enhance classification performance. Experimental results validate our method's superiority over existing techniques across diverse datasets. The code is available at https://github.com/k0u-id/CCD.

Modifying Large Language Model Post-Training for Diverse Creative Writing

John Joon Young Chung,Vishakh Padmakumar,Melissa Roemmele,Yuqian Sun,Max Kreminski

Task: 研究后训练方法以提升大型语言模型在创意写作任务中的输出多样性和质量。

Motivation: 现有后训练方法通常关注生成质量而忽视输出多样性，而创意写作任务需要多样化的有效输出。

Details

Method: 在训练目标中加入偏差（样本与其他同提示样本的差异程度），并应用于直接偏好优化（DPO）和几率比偏好优化（ORPO）。 Result: 8B参数模型在多样性上达到与人类创作数据集相当的水平，质量与GPT-4o和DeepSeek-R1相近。 Conclusion: 所提方法能显著提升输出多样性且质量损失最小，并通过人工评估和对比验证了其有效性。 Abstract: As creative writing tasks do not have singular correct answers, large language models (LLMs) trained to perform these tasks should be able to generate diverse valid outputs. However, LLM post-training often focuses on improving generation quality but neglects to facilitate output diversity. Hence, in creative writing generation, we investigate post-training approaches to promote both output diversity and quality. Our core idea is to include deviation -- the degree of difference between a training sample and all other samples with the same prompt -- in the training objective to facilitate learning from rare high-quality instances. By adopting our approach to direct preference optimization (DPO) and odds ratio preference optimization (ORPO), we demonstrate that we can promote the output diversity of trained models while minimally decreasing quality. Our best model with 8B parameters could achieve on-par diversity as a human-created dataset while having output quality similar to the best instruction-tuned models we examined, GPT-4o and DeepSeek-R1. We further validate our approaches with a human evaluation, an ablation, and a comparison to an existing diversification approach, DivPO.

Salient Object Detection in Traffic Scene through the TSOD10K Dataset

Yu Qiu,Yuhang Sun,Jie Mei,Lin Xiao,Jing Xu

Task: Traffic Salient Object Detection (TSOD) aims to segment objects critical to driving safety by combining semantic and visual saliency.

Motivation: Unlike natural scene images, TSOD emphasizes objects with semantic impact on driving safety, even if they have low visual contrast, addressing the lack of task-specific benchmarks.

Details

Method: A Mamba-based TSOD model, Tramba, is proposed, featuring a Dual-Frequency Visual State Space module and a traffic-oriented Helix 2D-Selective-Scan mechanism. Result: Tramba outperforms 22 existing NSI-SOD models on the newly collected TSOD10K dataset, demonstrating its superiority. Conclusion: This research establishes the foundation for safety-aware saliency analysis in intelligent transportation systems. Abstract: Traffic Salient Object Detection (TSOD) aims to segment the objects critical to driving safety by combining semantic (e.g., collision risks) and visual saliency. Unlike SOD in natural scene images (NSI-SOD), which prioritizes visually distinctive regions, TSOD emphasizes the objects that demand immediate driver attention due to their semantic impact, even with low visual contrast. This dual criterion, i.e., bridging perception and contextual risk, re-defines saliency for autonomous and assisted driving systems. To address the lack of task-specific benchmarks, we collect the first large-scale TSOD dataset with pixel-wise saliency annotations, named TSOD10K. TSOD10K covers the diverse object categories in various real-world traffic scenes under various challenging weather/illumination variations (e.g., fog, snowstorms, low-contrast, and low-light). Methodologically, we propose a Mamba-based TSOD model, termed Tramba. Considering the challenge of distinguishing inconspicuous visual information from complex traffic backgrounds, Tramba introduces a novel Dual-Frequency Visual State Space module equipped with shifted window partitioning and dilated scanning to enhance the perception of fine details and global structure by hierarchically decomposing high/low-frequency components. To emphasize critical regions in traffic scenes, we propose a traffic-oriented Helix 2D-Selective-Scan (Helix-SS2D) mechanism that injects driving attention priors while effectively capturing global multi-direction spatial dependencies. We establish a comprehensive benchmark by evaluating Tramba and 22 existing NSI-SOD models on TSOD10K, demonstrating Tramba's superiority. Our research establishes the first foundation for safety-aware saliency analysis in intelligent transportation systems.

CoKe: Customizable Fine-Grained Story Evaluation via Chain-of-Keyword Rationalization

Brihi Joshi,Sriram Venkatapathy,Mohit Bansal,Nanyun Peng,Haw-Shiuan Chang

Task: 评估人类创作的故事等创意文本，通过语言模型模仿人类思维过程。

Motivation: 解决多标注者评分的主观性问题，以及现有自一致性方法因目标不匹配导致的次优结果。

Details

Method: 提出Chain-of-Keywords (CoKe)方法，首先生成关键词序列，再生成自由文本解释，并通过多样化关键词集合聚合评分。 Result: 在StoryER数据集上，CoKe基于小型微调模型达到人类水平性能，显著优于GPT-4，且参数需求大幅减少。 Conclusion: CoKe方法有效提升了创意文本评估的准确性和效率。 Abstract: Evaluating creative text such as human-written stories using language models has always been a challenging task -- owing to the subjectivity of multi-annotator ratings. To mimic the thinking process of humans, chain of thought (CoT) generates free-text explanations that help guide a model's predictions and Self-Consistency (SC) marginalizes predictions over multiple generated explanations. In this study, we discover that the widely-used self-consistency reasoning methods cause suboptimal results due to an objective mismatch between generating 'fluent-looking' explanations vs. actually leading to a good rating prediction for an aspect of a story. To overcome this challenge, we propose $\textbf{C}$hain-$\textbf{o}$f-$\textbf{Ke}$ywords (CoKe), that generates a sequence of keywords $\textit{before}$ generating a free-text rationale, that guide the rating prediction of our evaluation language model. Then, we generate a diverse set of such keywords, and aggregate the scores corresponding to these generations. On the StoryER dataset, CoKe based on our small fine-tuned evaluation models not only reach human-level performance and significantly outperform GPT-4 with a 2x boost in correlation with human annotators, but also requires drastically less number of parameters.

Temporal Action Detection Model Compression by Progressive Block Drop

Xiaoyong Chen,Yong Guo,Jiaming Liang,Sitong Zhuang,Runhao Zeng,Xiping Hu

Task: 提出一种渐进式块丢弃方法以减少模型深度，同时保持层宽度，从而降低计算开销。

Motivation: 现有通道剪枝方法在减少通道数时会降低GPU并行效率，而深度模型的计算需求限制了其在资源有限场景（如自动驾驶和机器人）中的应用。

Details

Method: 通过两步渐进式块丢弃方法：先移除对模型性能影响最小的块，再使用参数高效的跨深度对齐技术微调模型以恢复精度。 Result: 在两个TAD基准测试（THUMOS14和ActivityNet-1.3）上实现了25%的计算开销减少，且无损压缩。 Conclusion: 该方法与通道剪枝方法正交，可结合使用以进一步提升效率。 Abstract: Temporal action detection (TAD) aims to identify and localize action instances in untrimmed videos, which is essential for various video understanding tasks. However, recent improvements in model performance, driven by larger feature extractors and datasets, have led to increased computational demands. This presents a challenge for applications like autonomous driving and robotics, which rely on limited computational resources. While existing channel pruning methods can compress these models, reducing the number of channels often hinders the parallelization efficiency of GPU, due to the inefficient multiplication between small matrices. Instead of pruning channels, we propose a Progressive Block Drop method that reduces model depth while retaining layer width. In this way, we still use large matrices for computation but reduce the number of multiplications. Our approach iteratively removes redundant blocks in two steps: first, we drop blocks with minimal impact on model performance; and second, we employ a parameter-efficient cross-depth alignment technique, fine-tuning the pruned model to restore model accuracy. Our method achieves a 25% reduction in computational overhead on two TAD benchmarks (THUMOS14 and ActivityNet-1.3) to achieve lossless compression. More critically, we empirically show that our method is orthogonal to channel pruning methods and can be combined with it to yield further efficiency gains.

A Language Anchor-Guided Method for Robust Noisy Domain Generalization

Zilin Dai,Lehong Wang,Fangzhou Lin,Yidong Wang,Zhigang Li,Kazunori D Yamada,Ziming Zhang,Wang Lu

Task: 解决机器学习中的分布偏移和标签噪声问题。

Motivation: 现实中的机器学习应用常因分布偏移和标签噪声导致模型过拟合，难以泛化到目标域。

Details

Method: 提出A3W算法，利用NLP锚点进行样本重加权，提取更具代表性的特征，并通过加权损失函数调整样本贡献。 Result: 在标准基准数据集上，A3W显著优于现有领域泛化方法，提高了准确性和鲁棒性。 Conclusion: A3W通过结合NLP锚点和自适应加权，有效提升了模型在分布偏移和噪声环境下的性能。 Abstract: Real-world machine learning applications often struggle with two major challenges: distribution shift and label noise. Models tend to overfit by focusing on redundant and uninformative features in the training data, which makes it hard for them to generalize to the target domain. Noisy data worsens this problem by causing further overfitting to the noise, meaning that existing methods often fail to tell the difference between true, invariant features and misleading, spurious ones. To tackle these issues, we introduce Anchor Alignment and Adaptive Weighting (A3W). This new algorithm uses sample reweighting guided by natural language processing (NLP) anchors to extract more representative features. In simple terms, A3W leverages semantic representations from natural language models as a source of domain-invariant prior knowledge. Additionally, it employs a weighted loss function that adjusts each sample's contribution based on its similarity to the corresponding NLP anchor. This adjustment makes the model more robust to noisy labels. Extensive experiments on standard benchmark datasets show that A3W consistently outperforms state-of-the-art domain generalization methods, offering significant improvements in both accuracy and robustness across different datasets and noise levels.

When Preferences Diverge: Aligning Diffusion Models with Minority-Aware Adaptive DPO

Lingfan Zhang,Chen Liu,Chengming Xu,Kai Hu,Donghao Luo,Chengjie Wang,Yanwei Fu,Yuan Yao

Task: 探索偏好数据在扩散模型训练中的关键作用，并提出Adaptive-DPO方法以优化模型性能。

Motivation: 研究通用人类偏好在图像生成中的主观性及其挑战，尤其是少数样本对模型性能的负面影响。

Details

Method: 提出Adaptive-DPO方法，通过引入少数样本感知指标（如标注者内置信度和标注者间稳定性）改进DPO损失函数。 Result: 实验表明Adaptive-DPO能有效处理合成和真实偏好数据，提升模型性能。 Conclusion: Adaptive-DPO为图像生成任务提供了更有效的训练方法。 Abstract: In recent years, the field of image generation has witnessed significant advancements, particularly in fine-tuning methods that align models with universal human preferences. This paper explores the critical role of preference data in the training process of diffusion models, particularly in the context of Diffusion-DPO and its subsequent adaptations. We investigate the complexities surrounding universal human preferences in image generation, highlighting the subjective nature of these preferences and the challenges posed by minority samples in preference datasets. Through pilot experiments, we demonstrate the existence of minority samples and their detrimental effects on model performance. We propose Adaptive-DPO -- a novel approach that incorporates a minority-instance-aware metric into the DPO objective. This metric, which includes intra-annotator confidence and inter-annotator stability, distinguishes between majority and minority samples. We introduce an Adaptive-DPO loss function which improves the DPO loss in two ways: enhancing the model's learning of majority labels while mitigating the negative impact of minority samples. Our experiments demonstrate that this method effectively handles both synthetic minority data and real-world preference data, paving the way for more effective training methodologies in image generation tasks.

Automating Adjudication of Cardiovascular Events Using Large Language Models

Sonish Sivarajkumar,Kimia Ameri,Chuqin Li,Yanshan Wang,Min Jiang

Task: 自动化心血管事件在临床试验中的裁定过程。

Motivation: 传统手动裁定方法耗时、资源密集且存在评审间变异性，可能引入偏见并阻碍试验进展。

Details

Method: 采用两阶段方法：基于大型语言模型（LLM）的事件信息提取和基于Tree of Thoughts与临床终点委员会（CEC）指南的裁定过程。 Result: 事件提取的F1分数为0.82，裁定的准确率为0.68，并引入CLEART评分评估AI生成的临床推理质量。 Conclusion: 该方法显著减少裁定时间和成本，同时保持高质量、一致性和可审计性，有助于更快识别和降低心血管治疗相关风险。 Abstract: Cardiovascular events, such as heart attacks and strokes, remain a leading cause of mortality globally, necessitating meticulous monitoring and adjudication in clinical trials. This process, traditionally performed manually by clinical experts, is time-consuming, resource-intensive, and prone to inter-reviewer variability, potentially introducing bias and hindering trial progress. This study addresses these critical limitations by presenting a novel framework for automating the adjudication of cardiovascular events in clinical trials using Large Language Models (LLMs). We developed a two-stage approach: first, employing an LLM-based pipeline for event information extraction from unstructured clinical data and second, using an LLM-based adjudication process guided by a Tree of Thoughts approach and clinical endpoint committee (CEC) guidelines. Using cardiovascular event-specific clinical trial data, the framework achieved an F1-score of 0.82 for event extraction and an accuracy of 0.68 for adjudication. Furthermore, we introduce the CLEART score, a novel, automated metric specifically designed for evaluating the quality of AI-generated clinical reasoning in adjudicating cardiovascular events. This approach demonstrates significant potential for substantially reducing adjudication time and costs while maintaining high-quality, consistent, and auditable outcomes in clinical trials. The reduced variability and enhanced standardization also allow for faster identification and mitigation of risks associated with cardiovascular therapies.

Optimized Minimal 3D Gaussian Splatting

Joo Chan Lee,Jong Hwan Ko,Eunbyung Park

Task: 提出一种名为OMG的优化最小高斯表示方法，显著减少存储需求并使用最少的高斯基元。

Motivation: 现有的3D高斯溅射（3DGS）方法在存储和内存开销上存在显著问题，且现有压缩方法仍依赖较多高斯基元，仅关注属性压缩。

Details

Method: 通过区分相似高斯基元以减少冗余，并提出紧凑且精确的属性表示方法，同时引入子向量量化技术提升不规则性表示。 Result: OMG将存储需求减少近50%，并实现600+ FPS的高质量渲染。 Conclusion: OMG是一种高效的高斯基元表示方法，显著降低了存储和计算成本，同时保持高质量渲染。 Abstract: 3D Gaussian Splatting (3DGS) has emerged as a powerful representation for real-time, high-performance rendering, enabling a wide range of applications. However, representing 3D scenes with numerous explicit Gaussian primitives imposes significant storage and memory overhead. Recent studies have shown that high-quality rendering can be achieved with a substantially reduced number of Gaussians when represented with high-precision attributes. Nevertheless, existing 3DGS compression methods still rely on a relatively large number of Gaussians, focusing primarily on attribute compression. This is because a smaller set of Gaussians becomes increasingly sensitive to lossy attribute compression, leading to severe quality degradation. Since the number of Gaussians is directly tied to computational costs, it is essential to reduce the number of Gaussians effectively rather than only optimizing storage. In this paper, we propose Optimized Minimal Gaussians representation (OMG), which significantly reduces storage while using a minimal number of primitives. First, we determine the distinct Gaussian from the near ones, minimizing redundancy without sacrificing quality. Second, we propose a compact and precise attribute representation that efficiently captures both continuity and irregularity among primitives. Additionally, we propose a sub-vector quantization technique for improved irregularity representation, maintaining fast training with a negligible codebook size. Extensive experiments demonstrate that OMG reduces storage requirements by nearly 50% compared to the previous state-of-the-art and enables 600+ FPS rendering while maintaining high rendering quality. Our source code is available at https://maincold2.github.io/omg/.

SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model Merging

Aladin Djuhera,Swanand Ravindra Kadhe,Farhan Ahmed,Syed Zawad,Holger Boche

Task: 提出一种名为SafeMERGE的后微调框架，以在保持任务效用的同时保护大型语言模型的安全性。

Motivation: 微调大型语言模型可能会无意中削弱其安全性对齐，即使对良性数据集也是如此。

Details

Method: 通过选择性合并微调和安全性对齐的模型层（基于余弦相似度准则），仅在它们偏离安全行为时进行合并。 Result: SafeMERGE在减少有害输出的同时，未显著牺牲性能，有时甚至提升性能。 Conclusion: 选择性、子空间引导和逐层合并的方法有效防止微调模型的安全性损失，优于其他后微调防御方法。 Abstract: Fine-tuning large language models (LLMs) on downstream tasks can inadvertently erode their safety alignment, even for benign fine-tuning datasets. We address this challenge by proposing SafeMERGE, a post-fine-tuning framework that preserves safety while maintaining task utility. It achieves this by selectively merging fine-tuned and safety-aligned model layers only when those deviate from safe behavior, measured by a cosine similarity criterion. We evaluate SafeMERGE against other fine-tuning- and post-fine-tuning-stage approaches for Llama-2-7B-Chat and Qwen-2-7B-Instruct models on GSM8K and PubMedQA tasks while exploring different merging strategies. We find that SafeMERGE consistently reduces harmful outputs compared to other baselines without significantly sacrificing performance, sometimes even enhancing it. The results suggest that our selective, subspace-guided, and per-layer merging method provides an effective safeguard against the inadvertent loss of safety in fine-tuned LLMs while outperforming simpler post-fine-tuning-stage defenses.

TEMPO: Temporal Preference Optimization of Video LLMs via Difficulty Scheduling and Pre-SFT Alignment

Shicheng Li,Lei Li,Kun Ouyang,Shuhuai Ren,Yuanxin Liu,Yuanxing Zhang,Fuzheng Zhang,Lingpeng Kong,Qi Liu,Xu Sun

Task: 提升视频大语言模型（Video LLMs）的时间推理能力。

Motivation: 现有方法在时间推理上表现不佳，主要由于数据中时间对应性弱以及训练时依赖下一词预测范式。

Details

Method: 提出TEMPO框架，通过直接偏好优化（DPO）增强时间推理能力，包括自动生成偏好数据、课程学习和预SFT对齐。 Result: 实验表明，TEMPO在多个基准测试中显著提升了Video LLMs的性能，且数据量较小。 Conclusion: TEMPO是一种可扩展且高效的方法，为开发可靠的Video LLMs提供了新思路。 Abstract: Video Large Language Models (Video LLMs) have achieved significant success by leveraging a two-stage paradigm: pretraining on large-scale video-text data for vision-language alignment, followed by supervised fine-tuning (SFT) for task-specific capabilities. However, existing approaches struggle with temporal reasoning due to weak temporal correspondence in the data and reliance on the next-token prediction paradigm during training. To address these limitations, we propose TEMPO (TEMporal Preference Optimization), a systematic framework that enhances Video LLMs' temporal reasoning capabilities through Direct Preference Optimization (DPO). To facilitate this, we introduce an automated preference data generation pipeline that systematically constructs preference pairs by selecting videos that are rich in temporal information, designing video-specific perturbation strategies, and finally evaluating model responses on clean and perturbed video inputs. Our temporal alignment features two key innovations: curriculum learning which that progressively increases perturbation difficulty to improve model robustness and adaptability; and ``Pre-SFT Alignment'', applying preference optimization before instruction tuning to prioritize fine-grained temporal comprehension. Extensive experiments demonstrate that our approach consistently improves Video LLM performance across multiple benchmarks with a relatively small set of self-generated DPO data. We further analyze the transferability of DPO data across architectures and the role of difficulty scheduling in optimization. Our findings highlight our TEMPO as a scalable and efficient complement to SFT-based methods, paving the way for developing reliable Video LLMs.

KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications

Michael J Bommarito,Daniel Martin Katz,Jillian Bommarito

Task: 开发专门用于法律、金融和政府文本的KL3M分词器家族。

Motivation: 专业领域的分词器研究不足，现有分词器在专业文档中效率较低。

Details

Method: 引入领域特定的BPE分词器，并开发字符级BPE分词器用于文本校正任务。 Result: KL3M分词器在专业文档中比GPT-4o和Llama3使用更少的token，且在术语处理上效率更高；字符级分词器有助于文本校正任务。 Conclusion: KL3M分词器提升了专业文档的处理效率，并支持进一步研究。 Abstract: We present the KL3M tokenizers, a family of specialized tokenizers for legal, financial, and governmental text. Despite established work on tokenization, specialized tokenizers for professional domains remain understudied. Our paper offers two main contributions to this area. First, we introduce domain-specific BPE tokenizers for legal, financial, and governmental text. Our kl3m-004-128k-cased tokenizer uses 9-17% fewer tokens than GPT-4o and Llama3 for domain-specific documents, despite having a smaller vocabulary. For specialized terminology, our cased tokenizer is even more efficient, using up to 83% fewer tokens for legal terms and 39% fewer tokens for financial terms. Second, we develop character-level BPE tokenizers (4K, 8K, and 16K vocabulary sizes) for text correction tasks like OCR post-processing. These tokenizers keep consistent token boundaries between error-containing and correct text, making it easier for models to learn correction patterns. These tokenizers help professional applications by fitting more text in context windows, reducing computational needs, and preserving the meaning of domain-specific terms. Our analysis shows these efficiency gains directly benefit the processing of long legal and financial documents. We release all tokenizers and code through GitHub and Hugging Face to support further research in specialized tokenization.

Vision-Language Gradient Descent-driven All-in-One Deep Unfolding Networks

Haijin Zeng,Xiangming Wang,Yongyong Chen,Jingyong Su,Jie Liu

Task: 提出一种统一的深度展开网络框架（VLU-Net），用于同时处理多种图像退化类型。

Motivation: 现有深度展开网络（DUNs）需要手动选择退化矩阵，限制了其适应性，无法应对多样化的退化场景。

Details

Method: 利用视觉语言模型（VLM）对退化图像-文本对进行精炼，将图像特征与退化描述对齐，并集成基于VLM的梯度估计策略到PGD算法中。 Result: 在SOTS去雾数据集上比现有方法高出3.74 dB，在Rain100L去雨数据集上高出1.70 dB。 Conclusion: VLU-Net是一种高效、可解释的统一框架，能够同时处理多种退化类型，性能优于现有方法。 Abstract: Dynamic image degradations, including noise, blur and lighting inconsistencies, pose significant challenges in image restoration, often due to sensor limitations or adverse environmental conditions. Existing Deep Unfolding Networks (DUNs) offer stable restoration performance but require manual selection of degradation matrices for each degradation type, limiting their adaptability across diverse scenarios. To address this issue, we propose the Vision-Language-guided Unfolding Network (VLU-Net), a unified DUN framework for handling multiple degradation types simultaneously. VLU-Net leverages a Vision-Language Model (VLM) refined on degraded image-text pairs to align image features with degradation descriptions, selecting the appropriate transform for target degradation. By integrating an automatic VLM-based gradient estimation strategy into the Proximal Gradient Descent (PGD) algorithm, VLU-Net effectively tackles complex multi-degradation restoration tasks while maintaining interpretability. Furthermore, we design a hierarchical feature unfolding structure to enhance VLU-Net framework, efficiently synthesizing degradation patterns across various levels. VLU-Net is the first all-in-one DUN framework and outperforms current leading one-by-one and all-in-one end-to-end methods by 3.74 dB on the SOTS dehazing dataset and 1.70 dB on the Rain100L deraining dataset.

CASE -- Condition-Aware Sentence Embeddings for Conditional Semantic Textual Similarity Measurement

Gaifan Zhang,Yi Zhou,Danushka Bollegala

Task: 提出一种条件感知的句子嵌入方法（CASE），用于在给定条件下生成句子嵌入。

Motivation: 句子含义常依赖于上下文，现有句子嵌入方法在条件修改方面表现不足。

Details

Method: 使用大型语言模型（LLM）生成条件嵌入，并通过监督非线性投影降维。 Result: CASE在标准基准数据集上显著优于现有条件语义文本相似性（C-STS）方法，且条件嵌入的减法操作和降维方法均提升性能。 Conclusion: CASE是一种高效准确的条件句子嵌入方法，通过条件嵌入减法和监督降维显著提升性能。 Abstract: The meaning conveyed by a sentence often depends on the context in which it appears. Despite the progress of sentence embedding methods, it remains unclear how to best modify a sentence embedding conditioned on its context. To address this problem, we propose Condition-Aware Sentence Embeddings (CASE), an efficient and accurate method to create an embedding for a sentence under a given condition. First, CASE creates an embedding for the condition using a Large Language Model (LLM), where the sentence influences the attention scores computed for the tokens in the condition during pooling. Next, a supervised nonlinear projection is learned to reduce the dimensionality of the LLM-based text embeddings. We show that CASE significantly outperforms previously proposed Conditional Semantic Textual Similarity (C-STS) methods on an existing standard benchmark dataset. We find that subtracting the condition embedding consistently improves the C-STS performance of LLM-based text embeddings. Moreover, we propose a supervised dimensionality reduction method that not only reduces the dimensionality of LLM-based embeddings but also significantly improves their performance.

Re-HOLD: Video Hand Object Interaction Reenactment via adaptive Layout-instructed Diffusion Model

Yingying Fan,Quanwei Yang,Kaisiyuan Wang,Hang Zhou,Yingying Li,Haocheng Feng,Yu Wu,Jingdong Wang

Task: 提出一种专注于人-物交互（HOI）的视频重演框架，通过自适应布局指导扩散模型（Re-HOLD）生成高质量的人与物体交互视频。

Motivation: 当前数字人研究集中于唇同步和身体动作，无法满足工业需求，而支持与现实环境（如物体）交互的人类视频生成技术尚未充分研究。

Details

Method: 采用自适应布局指导扩散模型（Re-HOLD），分别设计手和物体的专用布局表示，并引入交互式纹理增强模块和布局调整策略。 Result: 定性和定量评估表明，所提框架显著优于现有方法。 Conclusion: Re-HOLD框架有效解决了人-物交互视频生成中的挑战，尤其在物体尺寸和形状变化明显的情况下表现优异。 Abstract: Current digital human studies focusing on lip-syncing and body movement are no longer sufficient to meet the growing industrial demand, while human video generation techniques that support interacting with real-world environments (e.g., objects) have not been well investigated. Despite human hand synthesis already being an intricate problem, generating objects in contact with hands and their interactions presents an even more challenging task, especially when the objects exhibit obvious variations in size and shape. To cope with these issues, we present a novel video Reenactment framework focusing on Human-Object Interaction (HOI) via an adaptive Layout-instructed Diffusion model (Re-HOLD). Our key insight is to employ specialized layout representation for hands and objects, respectively. Such representations enable effective disentanglement of hand modeling and object adaptation to diverse motion sequences. To further improve the generation quality of HOI, we have designed an interactive textural enhancement module for both hands and objects by introducing two independent memory banks. We also propose a layout-adjusting strategy for the cross-object reenactment scenario to adaptively adjust unreasonable layouts caused by diverse object sizes during inference. Comprehensive qualitative and quantitative evaluations demonstrate that our proposed framework significantly outperforms existing methods. Project page: https://fyycs.github.io/Re-HOLD.

FastCuRL: Curriculum Reinforcement Learning with Progressive Context Extension for Efficient Training R1-like Reasoning Models

Mingyang Song,Mao Zheng,Zheng Li,Wenjie Yang,Xuan Luo,Yue Pan,Feng Zhang

Task: 提出一种名为FastCuRL的高效课程强化学习方法，用于加速R1类推理模型的训练效率并提升其在复杂推理任务中的表现。

Motivation: 解决强化学习在长链思维推理任务中训练效率低下的问题，同时提升模型性能。

Details

Method: 采用长度感知的训练数据分段和上下文窗口扩展训练策略。 Result: FastCuRL-1.5B-Preview在五个数据集上超越DeepScaleR-1.5B-Preview，且仅使用50%的训练步骤和单节点8GPU完成训练。 Conclusion: FastCuRL是一种高效且资源节约的课程强化学习方法，显著提升了推理模型的训练效率和性能。 Abstract: In this paper, we propose \textbf{\textsc{FastCuRL}}, a simple yet efficient \textbf{Cu}rriculum \textbf{R}einforcement \textbf{L}earning approach with context window extending strategy to accelerate the reinforcement learning training efficiency for R1-like reasoning models while enhancing their performance in tackling complex reasoning tasks with long chain-of-thought rationales, particularly with a 1.5B parameter language model. \textbf{\textsc{FastCuRL}} consists of two main procedures: length-aware training data segmentation and context window extension training. Specifically, the former first splits the original training data into three different levels by the input prompt length, and then the latter leverages segmented training datasets with a progressively increasing context window length to train the reasoning model. Experimental results demonstrate that \textbf{\textsc{FastCuRL}}-1.5B-Preview surpasses DeepScaleR-1.5B-Preview across all five datasets (including MATH 500, AIME 2024, AMC 2023, Minerva Math, and OlympiadBench) while only utilizing 50\% of training steps. Furthermore, all training stages for FastCuRL-1.5B-Preview are completed using just a single node with 8 GPUs.

HyperLoRA: Parameter-Efficient Adaptive Generation for Portrait Synthesis

Mengtian Li,Jinshu Chen,Wanquan Feng,Bingchuan Li,Fei Dai,Songtao Zhao,Qian He

Task: 提出一种参数高效的个性化肖像合成方法HyperLoRA，结合LoRA的高性能和适配器方案的零样本能力。

Motivation: 现有方法中，基于个人微调的方法（如LoRA和DreamBooth）需要训练个体样本，耗时耗资源且不稳定；而基于适配器的方法（如IP-Adapter）虽支持零样本推理，但生成结果缺乏自然性和真实性。

Details

Method: 使用自适应插件网络生成LoRA权重，结合精心设计的网络结构和训练策略。 Result: 实现了零样本个性化肖像生成（支持单张或多张图像输入），具有高真实感、高保真度和可编辑性。 Conclusion: HyperLoRA成功结合了LoRA的高性能和适配器方案的零样本能力，为个性化肖像合成提供了高效且高质量的解决方案。 Abstract: Personalized portrait synthesis, essential in domains like social entertainment, has recently made significant progress. Person-wise fine-tuning based methods, such as LoRA and DreamBooth, can produce photorealistic outputs but need training on individual samples, consuming time and resources and posing an unstable risk. Adapter based techniques such as IP-Adapter freeze the foundational model parameters and employ a plug-in architecture to enable zero-shot inference, but they often exhibit a lack of naturalness and authenticity, which are not to be overlooked in portrait synthesis tasks. In this paper, we introduce a parameter-efficient adaptive generation method, namely HyperLoRA, that uses an adaptive plug-in network to generate LoRA weights, merging the superior performance of LoRA with the zero-shot capability of adapter scheme. Through our carefully designed network structure and training strategy, we achieve zero-shot personalized portrait generation (supporting both single and multiple image inputs) with high photorealism, fidelity, and editability.

Efficient Intent-Based Filtering for Multi-Party Conversations Using Knowledge Distillation from LLMs

Reem Gody,Mohamed Abdelghaffar,Mohammed Jabreel,Ahmed Tawfik

Task: 提出一种成本效益高的方法，通过意图过滤筛选对话片段，以减少大型语言模型（LLM）的处理负担。

Motivation: 大型语言模型虽然功能强大，但资源消耗高，需要更高效的解决方案来降低计算成本。

Details

Method: 利用知识蒸馏技术，开发基于意图的过滤器，通过多标签意图分类优化MobileBERT模型。 Result: 该方法在效率和性能之间取得平衡，显著降低了LLM的总体运营成本。 Conclusion: 提出的意图过滤方法为资源受限环境下的对话处理提供了高效且经济的解决方案。 Abstract: Large language models (LLMs) have showcased remarkable capabilities in conversational AI, enabling open-domain responses in chat-bots, as well as advanced processing of conversations like summarization, intent classification, and insights generation. However, these models are resource-intensive, demanding substantial memory and computational power. To address this, we propose a cost-effective solution that filters conversational snippets of interest for LLM processing, tailored to the target downstream application, rather than processing every snippet. In this work, we introduce an innovative approach that leverages knowledge distillation from LLMs to develop an intent-based filter for multi-party conversations, optimized for compute power constrained environments. Our method combines different strategies to create a diverse multi-party conversational dataset, that is annotated with the target intents and is then used to fine-tune the MobileBERT model for multi-label intent classification. This model achieves a balance between efficiency and performance, effectively filtering conversation snippets based on their intents. By passing only the relevant snippets to the LLM for further processing, our approach significantly reduces overall operational costs depending on the intents and the data distribution as demonstrated in our experiments.

PE-CLIP: A Parameter-Efficient Fine-Tuning of Vision Language Models for Dynamic Facial Expression Recognition

Ibtissam Saadi,Abdenour Hadid,Douglas W. Cunningham,Abdelmalik Taleb-Ahmed,Yassin El Hillali

Task: 提出一种参数高效的微调框架PE-CLIP，用于动态面部表情识别（DFER）。

Motivation: 现有视觉语言模型（如CLIP）在DFER中存在全微调效率低、复杂度高、文本与视觉表征对齐差等问题，且缺乏有效的时序建模方法。

Details

Method: PE-CLIP引入两种适配器：基于GRU的动态时序适配器（TDA）和共享适配器（ShA），并结合多模态提示学习（MaPLe）优化模态对齐。 Result: 在DFEW和FERV39K数据集上表现优异，参数效率高，性能优于现有方法。 Conclusion: PE-CLIP在资源高效的DFER中树立了新标杆，平衡了效率与准确性。 Abstract: Vision-Language Models (VLMs) like CLIP offer promising solutions for Dynamic Facial Expression Recognition (DFER) but face challenges such as inefficient full fine-tuning, high complexity, and poor alignment between textual and visual representations. Additionally, existing methods struggle with ineffective temporal modeling. To address these issues, we propose PE-CLIP, a parameter-efficient fine-tuning (PEFT) framework that adapts CLIP for DFER while significantly reducing trainable parameters while maintaining high accuracy. PE-CLIP introduces two specialized adapters: a Temporal Dynamic Adapter (TDA) and a Shared Adapter (ShA). The TDA is a GRU-based module with dynamic scaling that captures sequential dependencies while emphasizing informative temporal features and suppressing irrelevant variations. The ShA is a lightweight adapter that refines representations within both textual and visual encoders, ensuring consistency and efficiency. Additionally, we integrate Multi-modal Prompt Learning (MaPLe), introducing learnable prompts for visual and action unit-based textual inputs, enhancing semantic alignment between modalities and enabling efficient CLIP adaptation for dynamic tasks. We evaluate PE-CLIP on two benchmark datasets, DFEW and FERV39K, achieving competitive performance compared to state-of-the-art methods while requiring fewer trainable parameters. By balancing efficiency and accuracy, PE-CLIP sets a new benchmark in resource-efficient DFER. The source code of the proposed PE-CLIP will be publicly available at https://github.com/Ibtissam-SAADI/PE-CLIP .

Dancing with Critiques: Enhancing LLM Reasoning with Stepwise Natural Language Self-Critique

Yansi Li,Jiahao Xu,Tian Liang,Xingyu Chen,Zhiwei He,Qiuzhi Liu,Rui Wang,Zhuosheng Zhang,Zhaopeng Tu,Haitao Mi,Dong Yu

Task: 提出一种新的推理时间扩展方法——逐步自然语言自我批评（PANEL），以增强大型语言模型在复杂任务中的推理能力。

Motivation: 传统方法使用标量奖励信号评估推理步骤，缺乏对每一步的详细定性信息，限制了推理能力。

Details

Method: PANEL通过生成自然语言自我批评作为反馈，指导步骤级搜索过程，保留定性信息。 Result: 在AIME和GPQA等挑战性推理基准测试中，PANEL显著优于传统的标量奖励方法。 Conclusion: PANEL方法无需任务特定验证器，具有广泛适用性，显著提升了推理性能。 Abstract: Enhancing the reasoning capabilities of large language models (LLMs), particularly for complex tasks requiring multi-step logical deductions, remains a significant challenge. Traditional inference time scaling methods utilize scalar reward signals from process reward models to evaluate candidate reasoning steps, but these scalar rewards lack the nuanced qualitative information essential for understanding and justifying each step. In this paper, we propose a novel inference-time scaling approach -- stepwise natural language self-critique (PANEL), which employs self-generated natural language critiques as feedback to guide the step-level search process. By generating rich, human-readable critiques for each candidate reasoning step, PANEL retains essential qualitative information, facilitating better-informed decision-making during inference. This approach bypasses the need for task-specific verifiers and the associated training overhead, making it broadly applicable across diverse tasks. Experimental results on challenging reasoning benchmarks, including AIME and GPQA, demonstrate that PANEL significantly enhances reasoning performance, outperforming traditional scalar reward-based methods. Our code is available at https://github.com/puddingyeah/PANEL to support and encourage future research in this promising field.

MagicColor: Multi-Instance Sketch Colorization

Yinhan Zhang,Yue Ma,Bingyuan Wang,Qifeng Chen,Zeyu Wang

Task: 提出一种基于扩散模型的多实例线稿上色框架MagicColor，实现自动化、高精度的多实例上色。

Motivation: 传统多实例线稿上色流程效率低且不准确，现有生成方法因缺乏多实例配对数据而难以解决此问题。

Details

Method: 采用自博弈训练策略解决数据不足问题，引入实例引导器和细粒度颜色匹配（含边缘损失）提升视觉质量。 Result: 实验表明，MagicColor在色彩精度上优于现有方法，并能实现零手动调整的自动化上色。 Conclusion: MagicColor能够高效、精准地完成多实例线稿上色，为新手用户提供便捷的创作工具。 Abstract: We present \textit{MagicColor}, a diffusion-based framework for multi-instance sketch colorization. The production of multi-instance 2D line art colorization adheres to an industry-standard workflow, which consists of three crucial stages: the design of line art characters, the coloring of individual objects, and the refinement process. The artists are required to repeat the process of coloring each instance one by one, which is inaccurate and inefficient. Meanwhile, current generative methods fail to solve this task due to the challenge of multi-instance pair data collection. To tackle these challenges, we incorporate three technical designs to ensure precise character detail transcription and achieve multi-instance sketch colorization in a single forward. Specifically, we first propose the self-play training strategy to solve the lack of training data. Then we introduce an instance guider to feed the color of the instance. To achieve accurate color matching, we present fine-grained color matching with edge loss to enhance visual quality. Equipped with the proposed modules, MagicColor enables automatically transforming sketches into vividly-colored images with accurate consistency and multi-instance control. Experiments on our collected datasets show that our model outperforms existing methods regarding chromatic precision. Specifically, our model critically automates the colorization process with zero manual adjustments, so novice users can produce stylistically consistent artwork by providing reference instances and the original line art. Our code and additional details are available at https://yinhan-zhang.github.io/color

Multimodal Transformer Models for Turn-taking Prediction: Effects on Conversational Dynamics of Human-Agent Interaction during Cooperative Gameplay

Young-Ho Bae,Casey C. Bennett

Task: 研究多模态轮换预测在人类与智能体互动（HAI）中的应用，特别是在合作游戏环境中。

Motivation: 通过模型开发和用户研究，提升对口语对话系统（SDSs）中对话动态的理解和改进。

Details

Method: 提出一种基于Transformer的深度学习模型，整合文本、视觉、音频和游戏上下文数据，用于实时轮换预测。 Result: 模型在准确率和宏F1分数上表现优于基线模型（87.3%和83.0%），用户研究表明模型提升了对话的流畅性和自然度。 Conclusion: 多模态轮换预测模型能够提升对话质量，为上下文自适应和响应式对话智能体提供了潜力。 Abstract: This study investigates multimodal turn-taking prediction within human-agent interactions (HAI), particularly focusing on cooperative gaming environments. It comprises both model development and subsequent user study, aiming to refine our understanding and improve conversational dynamics in spoken dialogue systems (SDSs). For the modeling phase, we introduce a novel transformer-based deep learning (DL) model that simultaneously integrates multiple modalities - text, vision, audio, and contextual in-game data to predict turn-taking events in real-time. Our model employs a Crossmodal Transformer architecture to effectively fuse information from these diverse modalities, enabling more comprehensive turn-taking predictions. The model demonstrates superior performance compared to baseline models, achieving 87.3% accuracy and 83.0% macro F1 score. A human user study was then conducted to empirically evaluate the turn-taking DL model in an interactive scenario with a virtual avatar while playing the game "Dont Starve Together", comparing a control condition without turn-taking prediction (n=20) to an experimental condition with our model deployed (n=40). Both conditions included a mix of English and Korean speakers, since turn-taking cues are known to vary by culture. We then analyzed the interaction quality, examining aspects such as utterance counts, interruption frequency, and participant perceptions of the avatar. Results from the user study suggest that our multimodal turn-taking model not only enhances the fluidity and naturalness of human-agent conversations, but also maintains a balanced conversational dynamic without significantly altering dialogue frequency. The study provides in-depth insights into the influence of turn-taking abilities on user perceptions and interaction quality, underscoring the potential for more contextually adaptive and responsive conversational agents.

Center-guided Classifier for Semantic Segmentation of Remote Sensing Images

Wei Zhang,Mengting Ma,Yizhen Jiang,Rongrong Lian,Zhenkai Wu,Kangning Cui,Xiaowen Ma

Task: 提出一种名为CenterSeg的新型分类器，用于解决遥感图像语义分割中的大类别内方差问题。

Motivation: 遥感图像具有较大的类别内方差，现有基于softmax的分类器存在监督不直接、建模能力不足和决策过程不透明的问题。

Details

Method: 通过多原型、Grassmann流形上的直接监督和可解释性策略设计CenterSeg分类器。 Result: 在三个遥感分割数据集上验证了模型的有效性，表现优越且具有轻量化和可解释性。 Conclusion: CenterSeg解决了现有分类器的不足，性能优越且易于实现。 Abstract: Compared with natural images, remote sensing images (RSIs) have the unique characteristic. i.e., larger intraclass variance, which makes semantic segmentation for remote sensing images more challenging. Moreover, existing semantic segmentation models for remote sensing images usually employ a vanilla softmax classifier, which has three drawbacks: (1) non-direct supervision for the pixel representations during training; (2) inadequate modeling ability of parametric softmax classifiers under large intraclass variance; and (3) opaque process of classification decision. In this paper, we propose a novel classifier (called CenterSeg) customized for RSI semantic segmentation, which solves the abovementioned problems with multiple prototypes, direct supervision under Grassmann manifold, and interpretability strategy. Specifically, for each class, our CenterSeg obtains local class centers by aggregating corresponding pixel features based on ground-truth masks, and generates multiple prototypes through hard attention assignment and momentum updating. In addition, we introduce the Grassmann manifold and constrain the joint embedding space of pixel features and prototypes based on two additional regularization terms. Especially, during the inference, CenterSeg can further provide interpretability to the model by restricting the prototype as a sample of the training set. Experimental results on three remote sensing segmentation datasets validate the effectiveness of the model. Besides the superior performance, CenterSeg has the advantages of simplicity, lightweight, compatibility, and interpretability. Code is available at https://github.com/xwmaxwma/rssegmentation.

The Application of MATEC (Multi-AI Agent Team Care) Framework in Sepsis Care

Andrew Cho,Jason M. Woo,Brian Shi,Aishwaryaa Udeshi,Jonathan S. H. Woo

Task: 开发一个多AI代理团队护理框架（MATEC）以辅助资源不足的医院进行脓毒症护理。

Motivation: 资源不足或农村医院缺乏医疗专家和医护人员，影响脓毒症患者的治疗效果。

Details

Method: 整合一个由5个医生代理、4个健康专业代理和1个风险预测模型代理组成的AI团队，并额外提供33个医生代理进行咨询。 Result: 10位主治医师评估认为MATEC非常有用（中位数=4，P=0.01）且非常准确（中位数=4，P<0.01）。 Conclusion: MATEC框架在资源不足的医院环境中具有潜在应用价值，能有效辅助医疗专业人员。 Abstract: Under-resourced or rural hospitals have limited access to medical specialists and healthcare professionals, which can negatively impact patient outcomes in sepsis. To address this gap, we developed the MATEC (Multi-AI Agent Team Care) framework, which integrates a team of specialized AI agents for sepsis care. The sepsis AI agent team includes five doctor agents, four health professional agents, and a risk prediction model agent, with an additional 33 doctor agents available for consultations. Ten attending physicians at a teaching hospital evaluated this framework, spending approximately 40 minutes on the web-based MATEC application and participating in the 5-point Likert scale survey (rated from 1-unfavorable to 5-favorable). The physicians found the MATEC framework very useful (Median=4, P=0.01), and very accurate (Median=4, P<0.01). This pilot study demonstrates that a Multi-AI Agent Team Care framework (MATEC) can potentially be useful in assisting medical professionals, particularly in under-resourced hospital settings.

DroneSplat: 3D Gaussian Splatting for Robust 3D Reconstruction from In-the-Wild Drone Imagery

Jiadong Tang,Yu Gao,Dianyi Yang,Liqi Yan,Yufeng Yue,Yi Yang

Task: 提出DroneSplat框架，用于从野外无人机图像中进行鲁棒的3D重建。

Motivation: 解决动态干扰物和有限视角约束对辐射场方法在野外场景重建中的挑战。

Details

Method: 结合局部-全局分割启发式与统计方法自适应调整掩蔽阈值，并利用多视角立体预测和体素引导优化策略增强3D高斯泼溅。 Result: DroneSplat在野外无人机图像处理中优于3DGS和NeRF基线。 Conclusion: DroneSplat为野外无人机图像的高质量3D重建提供了有效解决方案。 Abstract: Drones have become essential tools for reconstructing wild scenes due to their outstanding maneuverability. Recent advances in radiance field methods have achieved remarkable rendering quality, providing a new avenue for 3D reconstruction from drone imagery. However, dynamic distractors in wild environments challenge the static scene assumption in radiance fields, while limited view constraints hinder the accurate capture of underlying scene geometry. To address these challenges, we introduce DroneSplat, a novel framework designed for robust 3D reconstruction from in-the-wild drone imagery. Our method adaptively adjusts masking thresholds by integrating local-global segmentation heuristics with statistical approaches, enabling precise identification and elimination of dynamic distractors in static scenes. We enhance 3D Gaussian Splatting with multi-view stereo predictions and a voxel-guided optimization strategy, supporting high-quality rendering under limited view constraints. For comprehensive evaluation, we provide a drone-captured 3D reconstruction dataset encompassing both dynamic and static scenes. Extensive experiments demonstrate that DroneSplat outperforms both 3DGS and NeRF baselines in handling in-the-wild drone imagery.

Integrating Personality into Digital Humans: A Review of LLM-Driven Approaches for Virtual Reality

Iago Alves Brito,Julia Soares Dollis,Fernanda Bufon Färber,Pedro Schindler Freire Brasil Ribeiro,Rafael Teixeira Sousa,Arlindo Rodrigues Galvão Filho

Task: 探索如何将大型语言模型（LLMs）与虚拟现实（VR）环境结合，以创建更具沉浸感和互动性的数字人类。

Motivation: 通过结合LLMs的生成能力和多模态输出（如面部表情和手势），虚拟代理可以模拟人类性格和情感，从而提升用户体验。

Details

Method: 综述了零样本、少样本和微调等方法，使数字人类具备细腻的性格特征。 Result: 指出了将LLM驱动的性格特征整合到VR中的挑战，包括计算需求、延迟问题以及缺乏多模态交互的标准化评估框架。 Conclusion: 通过填补这些空白，为教育、治疗和游戏等领域的应用奠定了基础，并促进跨学科合作以重新定义VR中的人机交互。 Abstract: The integration of large language models (LLMs) into virtual reality (VR) environments has opened new pathways for creating more immersive and interactive digital humans. By leveraging the generative capabilities of LLMs alongside multimodal outputs such as facial expressions and gestures, virtual agents can simulate human-like personalities and emotions, fostering richer and more engaging user experiences. This paper provides a comprehensive review of methods for enabling digital humans to adopt nuanced personality traits, exploring approaches such as zero-shot, few-shot, and fine-tuning. Additionally, it highlights the challenges of integrating LLM-driven personality traits into VR, including computational demands, latency issues, and the lack of standardized evaluation frameworks for multimodal interactions. By addressing these gaps, this work lays a foundation for advancing applications in education, therapy, and gaming, while fostering interdisciplinary collaboration to redefine human-computer interaction in VR.

Distilling Monocular Foundation Model for Fine-grained Depth Completion

Yingping Liang,Yutao Hu,Wenqi Shao,Ying Fu

Task: 通过两阶段知识蒸馏框架，利用单目基础模型为深度补全提供密集监督。

Motivation: 稀疏的LiDAR输入限制了密集监督的可用性，而密集监督对学习详细几何特征至关重要。

Details

Method: 提出两阶段知识蒸馏框架，第一阶段通过预训练策略从自然图像生成多样化训练数据，第二阶段使用尺度和平移不变损失（SSI Loss）在真实数据集上微调。 Result: 在KITTI基准测试中排名第一，达到最先进性能。 Conclusion: 两阶段蒸馏框架能够有效利用单目基础模型的优势，提升深度补全模型的性能。 Abstract: Depth completion involves predicting dense depth maps from sparse LiDAR inputs. However, sparse depth annotations from sensors limit the availability of dense supervision, which is necessary for learning detailed geometric features. In this paper, we propose a two-stage knowledge distillation framework that leverages powerful monocular foundation models to provide dense supervision for depth completion. In the first stage, we introduce a pre-training strategy that generates diverse training data from natural images, which distills geometric knowledge to depth completion. Specifically, we simulate LiDAR scans by utilizing monocular depth and mesh reconstruction, thereby creating training data without requiring ground-truth depth. Besides, monocular depth estimation suffers from inherent scale ambiguity in real-world settings. To address this, in the second stage, we employ a scale- and shift-invariant loss (SSI Loss) to learn real-world scales when fine-tuning on real-world datasets. Our two-stage distillation framework enables depth completion models to harness the strengths of monocular foundation models. Experimental results demonstrate that models trained with our two-stage distillation framework achieve state-of-the-art performance, ranking \textbf{first place} on the KITTI benchmark. Code is available at https://github.com/Sharpiless/DMD3C

Improving Interactive Diagnostic Ability of a Large Language Model Agent Through Clinical Experience Learning

Zhoujian Sun,Ziyi Liu,Cheng Luo,Jiebin Chu,Zhengxing Huang

Task: 研究大型语言模型（LLMs）在医学诊断中的性能退化现象并提出解决方案。

Motivation: LLMs在医学诊断中表现出色，但在交互式诊断场景中性能显著下降，需要探究原因并提出改进方法。

Details

Method: 开发了一种即插即用增强型（PPME）LLM代理，利用超过350万份电子病历数据，结合监督和强化学习技术训练初始诊断和病史询问模型。 Result: PPME LLM在交互式诊断场景中比基线模型提高了30%以上的性能，最终诊断准确率接近完整临床数据的水平。 Conclusion: PPME LLM展示了开发自主诊断系统的潜力，但需要进一步验证研究。 Abstract: Recent advances in large language models (LLMs) have shown promising results in medical diagnosis, with some studies indicating superior performance compared to human physicians in specific scenarios. However, the diagnostic capabilities of LLMs are often overestimated, as their performance significantly deteriorates in interactive diagnostic settings that require active information gathering. This study investigates the underlying mechanisms behind the performance degradation phenomenon and proposes a solution. We identified that the primary deficiency of LLMs lies in the initial diagnosis phase, particularly in information-gathering efficiency and initial diagnosis formation, rather than in the subsequent differential diagnosis phase. To address this limitation, we developed a plug-and-play method enhanced (PPME) LLM agent, leveraging over 3.5 million electronic medical records from Chinese and American healthcare facilities. Our approach integrates specialized models for initial disease diagnosis and inquiry into the history of the present illness, trained through supervised and reinforcement learning techniques. The experimental results indicate that the PPME LLM achieved over 30% improvement compared to baselines. The final diagnostic accuracy of the PPME LLM in interactive diagnostic scenarios approached levels comparable to those achieved using complete clinical data. These findings suggest a promising potential for developing autonomous diagnostic systems, although further validation studies are needed.

ARFlow: Human Action-Reaction Flow Matching with Physical Guidance

Wentao Jiang,Jingya Wang,Haotao Lu,Kaiyang Ji,Baoxiong Jia,Siyuan Huang,Ye Shi

Task: Error

Motivation: Error

Details

Method: Error Result: Error Conclusion: Error Abstract: Human action-reaction synthesis, a fundamental challenge in modeling causal human interactions, plays a critical role in applications ranging from virtual reality to social robotics. While diffusion-based models have demonstrated promising performance, they exhibit two key limitations for interaction synthesis: reliance on complex noise-to-reaction generators with intricate conditional mechanisms, and frequent physical violations in generated motions. To address these issues, we propose Action-Reaction Flow Matching (ARFlow), a novel framework that establishes direct action-to-reaction mappings, eliminating the need for complex conditional mechanisms. Our approach introduces two key innovations: an x1-prediction method that directly outputs human motions instead of velocity fields, enabling explicit constraint enforcement; and a training-free, gradient-based physical guidance mechanism that effectively prevents body penetration artifacts during sampling. Extensive experiments on NTU120 and Chi3D datasets demonstrate that ARFlow not only outperforms existing methods in terms of Fr\'echet Inception Distance and motion diversity but also significantly reduces body collisions, as measured by our new Intersection Volume and Intersection Frequency metrics.

Human-Centered AI in Multidisciplinary Medical Discussions: Evaluating the Feasibility of a Chat-Based Approach to Case Assessment

Shinnosuke Sawano,Satoshi Kodera

Task: 研究使用以人为中心的人工智能聊天平台，供医学专家协作评估复杂病例的可行性。

Motivation: 针对心血管疾病且处于多病状态的患者，探索通过AI平台提高协作效率和讨论内容量化。

Details

Method: 构建模拟病例，利用聊天应用与医生协作，评估AI在总结时间和讨论内容中的表现。 Result: AI显著减少总结时间（平均减少79.98%），幻觉率平均为3.62%（有害幻觉率0.49%），多学科评估能更复杂地表达医学知识。 Conclusion: AI辅助总结显著减少讨论时间并保持知识结构化，支持AI辅助聊天讨论在多学科医疗决策中的可行性。 Abstract: In this study, we investigate the feasibility of using a human-centered artificial intelligence (AI) chat platform where medical specialists collaboratively assess complex cases. As the target population for this platform, we focus on patients with cardiovascular diseases who are in a state of multimorbidity, that is, suffering from multiple chronic conditions. We evaluate simulated cases with multiple diseases using a chat application by collaborating with physicians to assess feasibility, efficiency gains through AI utilization, and the quantification of discussion content. We constructed simulated cases based on past case reports, medical errors reports and complex cases of cardiovascular diseases experienced by the physicians. The analysis of discussions across five simulated cases demonstrated a significant reduction in the time required for summarization using AI, with an average reduction of 79.98\%. Additionally, we examined hallucination rates in AI-generated summaries used in multidisciplinary medical discussions. The overall hallucination rate ranged from 1.01\% to 5.73\%, with an average of 3.62\%, whereas the harmful hallucination rate varied from 0.00\% to 2.09\%, with an average of 0.49\%. Furthermore, morphological analysis demonstrated that multidisciplinary assessments enabled a more complex and detailed representation of medical knowledge compared with single physician assessments. We examined structural differences between multidisciplinary and single physician assessments using centrality metrics derived from the knowledge graph. In this study, we demonstrated that AI-assisted summarization significantly reduced the time required for medical discussions while maintaining structured knowledge representation. These findings can support the feasibility of AI-assisted chat-based discussions as a human-centered approach to multidisciplinary medical decision-making.

EasyRobust: A Comprehensive and Easy-to-use Toolkit for Robust and Generalized Vision

Xiaofeng Mao,Yuefeng Chen,Rong Zhang,Hui Xue,Zhao Li,Hang Su

Task: 开发一个名为EasyRobust的易用工具包，用于训练、评估和分析鲁棒的视觉模型。

Motivation: 解决深度神经网络在计算机视觉任务中因对抗攻击和数据分布变化导致的鲁棒性问题，推动模型鲁棒性研究。

Details

Method: 开发EasyRobust工具包，支持对抗鲁棒性和非对抗鲁棒性的训练与评估，并通过图像分类基准测试提供准确评估。 Result: EasyRobust工具包成功开源，支持对抗和非对抗鲁棒性研究，促进模型鲁棒性提升。 Conclusion: EasyRobust有助于训练实际鲁棒的模型，缩小人类与机器视觉之间的差距，推动学术和工业进步。 Abstract: Deep neural networks (DNNs) has shown great promise in computer vision tasks. However, machine vision achieved by DNNs cannot be as robust as human perception. Adversarial attacks and data distribution shifts have been known as two major scenarios which degrade machine performance and obstacle the wide deployment of machines "in the wild". In order to break these obstructions and facilitate the research of model robustness, we develop EasyRobust, a comprehensive and easy-to-use toolkit for training, evaluation and analysis of robust vision models. EasyRobust targets at two types of robustness: 1) Adversarial robustness enables the model to defense against malicious inputs crafted by worst-case perturbations, also known as adversarial examples; 2) Non-adversarial robustness enhances the model performance on natural test images with corruptions or distribution shifts. Thorough benchmarks on image classification enable EasyRobust to provide an accurate robustness evaluation on vision models. We wish our EasyRobust can help for training practically-robust models and promote academic and industrial progress in closing the gap between human and machine vision. Codes and models of EasyRobust have been open-sourced in https://github.com/alibaba/easyrobust.

Human Preferences for Constructive Interactions in Language Model Alignment

Yara Kyrychenko,Jon Roozenbeek,Brandon Davidson,Sander van der Linden,Ramit Debnath

Task: 研究大型语言模型（LLMs）在多文化对话中对人类偏好的反映及其对建设性对话的影响。

Motivation: 随着大型语言模型进入主流，如何使其促进建设性对话而非加剧社会分裂变得至关重要。

Details

Method: 使用包含来自74个国家的7,500多段对话的多文化对齐数据集，分析人类偏好数据中与建设性互动相关的语言属性。 Result: 用户普遍偏好理由充分且细致的回答，而拒绝高度个人化的叙述；认为AI应反映其价值观的用户更注重好奇心而非推理。LLMs能反映用户查询中的语言属性（包括毒性），用户可设定对话基调。 Conclusion: 研究表明，用户偏好和对话基调对LLMs的建设性对话有显著影响，为模型对齐提供了重要启示。 Abstract: As large language models (LLMs) enter the mainstream, aligning them to foster constructive dialogue rather than exacerbate societal divisions is critical. Using an individualized and multicultural alignment dataset of over 7,500 conversations of individuals from 74 countries engaging with 21 LLMs, we examined how linguistic attributes linked to constructive interactions are reflected in human preference data used for training AI. We found that users consistently preferred well-reasoned and nuanced responses while rejecting those high in personal storytelling. However, users who believed that AI should reflect their values tended to place less preference on reasoning in LLM responses and more on curiosity. Encouragingly, we observed that users could set the tone for how constructive their conversation would be, as LLMs mirrored linguistic attributes, including toxicity, in user queries.

GeoT: Geometry-guided Instance-dependent Transition Matrix for Semi-supervised Tooth Point Cloud Segmentation

Weihao Yu,Xiaoqing Guo,Chenxin Li,Yifan Liu,Yixuan Yuan

Task: 实现从口腔内扫描数据中对牙齿点云的精细分割。

Motivation: 由于牙齿标注工作繁重，大量数据未被标记，推动了半监督方法的研究兴趣。现有半监督医学分割方法的主要挑战在于未标记数据生成的伪标签噪声问题。

Details

Method: 提出GeoT框架，首次采用实例依赖的转移矩阵（IDTM）显式建模伪标签噪声，并通过点级几何正则化（PLGR）和类级几何平滑（CLGS）利用牙齿几何先验优化IDTM估计。 Result: 在公开数据集Teeth3DS和私有数据集上的实验表明，该方法能充分利用未标记数据，仅用20%的标记数据即可达到全监督方法的性能。 Conclusion: GeoT框架通过几何先验和IDTM建模，有效解决了半监督牙齿分割中的伪标签噪声问题，显著提升了分割性能。 Abstract: Achieving meticulous segmentation of tooth point clouds from intra-oral scans stands as an indispensable prerequisite for various orthodontic applications. Given the labor-intensive nature of dental annotation, a significant amount of data remains unlabeled, driving increasing interest in semi-supervised approaches. One primary challenge of existing semi-supervised medical segmentation methods lies in noisy pseudo labels generated for unlabeled data. To address this challenge, we propose GeoT, the first framework that employs instance-dependent transition matrix (IDTM) to explicitly model noise in pseudo labels for semi-supervised dental segmentation. Specifically, to handle the extensive solution space of IDTM arising from tens of thousands of dental points, we introduce tooth geometric priors through two key components: point-level geometric regularization (PLGR) to enhance consistency between point adjacency relationships in 3D and IDTM spaces, and class-level geometric smoothing (CLGS) to leverage the fixed spatial distribution of tooth categories for optimal IDTM estimation. Extensive experiments performed on the public Teeth3DS dataset and private dataset demonstrate that our method can make full utilization of unlabeled data to facilitate segmentation, achieving performance comparable to fully supervised methods with only $20\%$ of the labeled data.

Llms, Virtual Users, and Bias: Predicting Any Survey Question Without Human Data

Enzo Sinacola,Arnault Pachot,Thierry Petit

Task: 使用大型语言模型（LLMs）创建虚拟人群以回答调查问题，并预测与人类响应可比的结果。

Motivation: 探索LLMs作为传统调查方法的替代方案，以提高效率并降低成本。

Details

Method: 评估多种LLMs（如GPT-4o、GPT-3.5、Claude 3.5-Sonnet、Llama和Mistral模型）与传统随机森林算法在预测性能上的对比。 Result: LLMs整体表现具有竞争力且无需额外训练数据，但在某些宗教和人口群体预测中存在偏见；随机森林在数据充足时表现更优；去除LLMs的审查机制可显著提高预测准确性。 Conclusion: 需解决LLMs的偏见和审查机制问题，以提高其在公共意见研究中的可靠性和公平性。 Abstract: Large Language Models (LLMs) offer a promising alternative to traditional survey methods, potentially enhancing efficiency and reducing costs. In this study, we use LLMs to create virtual populations that answer survey questions, enabling us to predict outcomes comparable to human responses. We evaluate several LLMs-including GPT-4o, GPT-3.5, Claude 3.5-Sonnet, and versions of the Llama and Mistral models-comparing their performance to that of a traditional Random Forests algorithm using demographic data from the World Values Survey (WVS). LLMs demonstrate competitive performance overall, with the significant advantage of requiring no additional training data. However, they exhibit biases when predicting responses for certain religious and population groups, underperforming in these areas. On the other hand, Random Forests demonstrate stronger performance than LLMs when trained with sufficient data. We observe that removing censorship mechanisms from LLMs significantly improves predictive accuracy, particularly for underrepresented demographic segments where censored models struggle. These findings highlight the importance of addressing biases and reconsidering censorship approaches in LLMs to enhance their reliability and fairness in public opinion research.

Instant Gaussian Stream: Fast and Generalizable Streaming of Dynamic Scene Reconstruction via Gaussian Splatting

Jinbo Yan,Rui Peng,Zhiyan Wang,Luyang Tang,Jiayu Yang,Jie Liang,Jiahao Wu,Ronggang Wang

Task: 提出一种快速且可泛化的流式框架Instant Gaussian Stream (IGS)，用于构建自由视点视频。

Motivation: 当前流式方法存在每帧重建时间长（10秒以上）和误差累积的问题，限制了其广泛应用。

Details

Method: 引入基于锚点的高斯运动网络和关键帧引导的流式策略，以快速生成高斯运动并减少误差累积。 Result: 实现了平均每帧重建时间2秒以上，并提升了视图合成质量。 Conclusion: IGS框架在快速性和泛化性上优于现有方法，适用于复杂场景的实时重建。 Abstract: Building Free-Viewpoint Videos in a streaming manner offers the advantage of rapid responsiveness compared to offline training methods, greatly enhancing user experience. However, current streaming approaches face challenges of high per-frame reconstruction time (10s+) and error accumulation, limiting their broader application. In this paper, we propose Instant Gaussian Stream (IGS), a fast and generalizable streaming framework, to address these issues. First, we introduce a generalized Anchor-driven Gaussian Motion Network, which projects multi-view 2D motion features into 3D space, using anchor points to drive the motion of all Gaussians. This generalized Network generates the motion of Gaussians for each target frame in the time required for a single inference. Second, we propose a Key-frame-guided Streaming Strategy that refines each key frame, enabling accurate reconstruction of temporally complex scenes while mitigating error accumulation. We conducted extensive in-domain and cross-domain evaluations, demonstrating that our approach can achieve streaming with a average per-frame reconstruction time of 2s+, alongside a enhancement in view synthesis quality.

Scalable Evaluation of Online Moderation Strategies via Synthetic Simulations

Dimitris Tsirmpas,Ion Androutsopoulos,John Pavlopoulos

Task: 评估六种基于大型语言模型（LLM）的在线内容审核策略的有效性。

Motivation: 由于缺乏合适的数据集和难以组织人类参与者进行大规模实验，目前尚无大规模研究评估不同在线内容审核策略的效果。

Details

Method: 提出一种利用LLM进行合成实验的方法，无需人类参与，评估六种审核策略，包括现有策略、基线策略和一种基于强化学习的新策略。 Result: 提出的基于强化学习的审核策略显著优于现有策略和默认LLM审核；较小LLM能生成更多样化的讨论。 Conclusion: 通过合成实验和开源工具SynDisco及数据集VMD，为在线内容审核研究提供了新方法和资源。 Abstract: Despite the ever-growing importance of online moderation, there has been no large-scale study evaluating the effectiveness of alternative moderation strategies. This is largely due to the lack of appropriate datasets, and the difficulty of getting human discussants, moderators, and evaluators involved in multiple experiments. In this paper, we propose a methodology for leveraging synthetic experiments performed exclusively by Large Language Models (LLMs) to initially bypass the need for human participation in experiments involving online moderation. We evaluate six LLM moderation configurations; two currently used real-life moderation strategies (guidelines issued for human moderators for online moderation and real-life facilitation), two baseline strategies (guidelines elicited for LLM alignment work, and LLM moderation with minimal prompting) a baseline with no moderator at all, as well as our own proposed strategy inspired by a Reinforcement Learning (RL) formulation of the problem. We find that our own moderation strategy significantly outperforms established moderation guidelines, as well as out-of-the-box LLM moderation. We also find that smaller LLMs, with less intensive instruction-tuning, can create more varied discussions than larger models. In order to run these experiments, we create and release an efficient, purpose-built, open-source Python framework, dubbed "SynDisco" to easily simulate hundreds of discussions using LLM user-agents and moderators. Additionally, we release the Virtual Moderation Dataset (VMD), a large dataset of LLM-generated and LLM-annotated discussions, generated by three families of open-source LLMs accompanied by an exploratory analysis of the dataset.

Token Dynamics: Towards Efficient and Dynamic Video Token Representation for Video Large Language Models

Haichao Zhang,Zhuowei Li,Dimitris Metaxas,Yun Fu

Task: 提出一种名为Token Dynamics的新视频表示框架，旨在以极少的令牌数量表示长视频序列。

Motivation: 现有令牌缩减技术（如令牌修剪和合并）会破坏空间-时间位置嵌入，无法在计算效率和令牌数量之间取得平衡，限制了在需要极端令牌压缩的场景（如视频大语言模型）中的应用。

Details

Method: 通过将视觉嵌入与网格级运动信息分离，构建简洁的令牌基和令牌动态图，并引入跨动态注意力机制以保持紧凑性和空间-时间完整性。 Result: 实验表明，令牌数量减少至原始令牌的0.07%，性能仅下降1.13%。 Conclusion: 该方法在理论复杂度、令牌数量和吞吐量方面表现优异，为视频大语言模型提供了高效解决方案。 Abstract: Token-based video representation has emerged as a promising approach for enabling large language models to interpret video content. However, existing token reduction techniques, such as token pruning and token merging, often disrupt essential spatial-temporal positional embeddings, failing to adequately balance computational efficiency with fewer tokens. Consequently, these methods result in relatively lengthy token sequences, limiting their applicability in scenarios requiring extreme token compression, such as video large language models. In this paper, we introduce the novel task of extreme short token reduction, aiming to represent extensive video sequences with a minimal number of tokens. To address this challenge, we propose Token Dynamics, a new video representation framework that dynamically reduces token count while preserving spatial-temporal coherence. Specifically, we disentangle video representations by separating visual embeddings from grid-level motion information, structuring them into: 1. a concise token base, created by clustering tokens that describe object-level content; 2. a token dynamics map, capturing detailed spatial-temporal motion patterns across grids. Furthermore, we introduce a cross-dynamics attention mechanism that integrates motion features into the token base without increasing token length, thereby maintaining compactness and spatial-temporal integrity. The experiments demonstrate a reduction of token count to merely 0.07% of the original tokens, with only a minor performance drop of 1.13%. Additionally, we propose two novel subtasks within extreme token reduction (fixed-length and adaptive-length compression), both effectively representing long token sequences for video-language tasks. Our method offers significantly lower theoretical complexity, fewer tokens, and enhanced throughput, thus providing an efficient solution for video LLMs.

Earthquake Response Analysis with AI

Deep Patel,Panthadeep Bhattacharjee,Amit Reza,Priodyuti Pradhan

Task: 探索利用Twitter数据进行地震响应分析的潜力。

Motivation: 在地震等自然灾害中，及时有效的响应对于减少损失和拯救生命至关重要，而Twitter等微博平台已成为此类事件的实时信息来源。

Details

Method: 开发了一个结合自然语言处理（NLP）技术的机器学习（ML）框架，从地震事件期间的推文中提取和分析相关信息，重点关注提取位置数据以识别受影响区域，生成严重性地图，并利用WebGIS展示信息。 Result: 分析结果可为应急响应者、政府机构、人道主义组织和非政府组织提供有价值的信息，帮助优化灾害响应策略和资源分配。 Conclusion: 利用Twitter数据和ML框架可以有效支持地震响应，提升灾害管理的效率和效果。 Abstract: A timely and effective response is crucial to minimize damage and save lives during natural disasters like earthquakes. Microblogging platforms, particularly Twitter, have emerged as valuable real-time information sources for such events. This work explores the potential of leveraging Twitter data for earthquake response analysis. We develop a machine learning (ML) framework by incorporating natural language processing (NLP) techniques to extract and analyze relevant information from tweets posted during earthquake events. The approach primarily focuses on extracting location data from tweets to identify affected areas, generating severity maps, and utilizing WebGIS to display valuable information. The insights gained from this analysis can aid emergency responders, government agencies, humanitarian organizations, and NGOs in enhancing their disaster response strategies and facilitating more efficient resource allocation during earthquake events.

Enabling Versatile Controls for Video Diffusion Models

Xu Zhang,Hao Zhou,Haoming Qin,Xiaobin Lu,Jiaxing Yan,Guanzhong Wang,Zeyu Chen,Yi Liu

Task: 提出一种名为VCtrl（PP-VCtrl）的新框架，用于实现对预训练视频扩散模型的细粒度时空控制。

Motivation: 解决文本到视频生成中细粒度时空属性控制的不足。

Details

Method: 通过通用条件模块整合多种用户指定的控制信号（如Canny边缘、分割掩码和人体关键点），并设计统一控制信号编码管道和稀疏残差连接机制。 Result: 实验和人工评估表明，VCtrl显著提升了控制能力和生成质量。 Conclusion: VCtrl为视频生成提供了灵活且精确的控制方法，代码和预训练模型已开源。 Abstract: Despite substantial progress in text-to-video generation, achieving precise and flexible control over fine-grained spatiotemporal attributes remains a significant unresolved challenge in video generation research. To address these limitations, we introduce VCtrl (also termed PP-VCtrl), a novel framework designed to enable fine-grained control over pre-trained video diffusion models in a unified manner. VCtrl integrates diverse user-specified control signals-such as Canny edges, segmentation masks, and human keypoints-into pretrained video diffusion models via a generalizable conditional module capable of uniformly encoding multiple types of auxiliary signals without modifying the underlying generator. Additionally, we design a unified control signal encoding pipeline and a sparse residual connection mechanism to efficiently incorporate control representations. Comprehensive experiments and human evaluations demonstrate that VCtrl effectively enhances controllability and generation quality. The source code and pre-trained models are publicly available and implemented using the PaddlePaddle framework at http://github.com/PaddlePaddle/PaddleMIX/tree/develop/ppdiffusers/examples/ppvctrl.

EmpathyAgent: Can Embodied Agents Conduct Empathetic Actions?

Xinyan Chen,Jiaxin Ge,Hongming Dai,Qiang Zhou,Qiuxuan Feng,Jingtong Hu,Yizhou Wang,Jiaming Liu,Shanghang Zhang

Task: 评估和增强代理在多场景中的共情行为。

Motivation: 现有研究忽视了代理是否能理解共情需求并执行共情行为，而共情对人类互动至关重要。

Details

Method: 引入EmpathyAgent基准，包含10,000个多模态样本和共情任务计划，并提出共情专用评估套件。 Result: 当前模型在共情行为上表现不佳，但训练后的Llama3-8B显示出潜力。 Conclusion: 通过建立共情行为评估标准，推动共情代理的研究发展。 Abstract: Empathy is fundamental to human interactions, yet it remains unclear whether embodied agents can provide human-like empathetic support. Existing works have studied agents' tasks solving and social interactions abilities, but whether agents can understand empathetic needs and conduct empathetic behaviors remains overlooked. To address this, we introduce EmpathyAgent, the first benchmark to evaluate and enhance agents' empathetic actions across diverse scenarios. EmpathyAgent contains 10,000 multimodal samples with corresponding empathetic task plans and three different challenges. To systematically evaluate the agents' empathetic actions, we propose an empathy-specific evaluation suite that evaluates the agents' empathy process. We benchmark current models and found that exhibiting empathetic actions remains a significant challenge. Meanwhile, we train Llama3-8B using EmpathyAgent and find it can potentially enhance empathetic behavior. By establishing a standard benchmark for evaluating empathetic actions, we hope to advance research in empathetic embodied agents. Our code and data are publicly available at https://github.com/xinyan-cxy/EmpathyAgent.

Steady Progress Beats Stagnation: Mutual Aid of Foundation and Conventional Models in Mixed Domain Semi-Supervised Medical Image Segmentation

Qinghe Ma,Jian Zhang,Zekun Li,Lei Qi,Qian Yu,Yinghuan Shi

Task: 提出SynFoC框架，解决半监督医学图像分割中基础模型的过度自信预测问题。

Motivation: 基础模型在特定领域任务中可能因过度自信预测导致错误累积，影响无标签数据的有效利用。

Details

Method: 结合基础模型和传统模型的协同训练，提出共识-分歧一致性正则化。 Result: 在四个公共多域数据集上表现优越，前列腺数据集Dice分数提升10.31%。 Conclusion: SynFoC框架有效解决了基础模型的过度自信问题，提升了分割性能。 Abstract: Large pretrained visual foundation models exhibit impressive general capabilities. However, the extensive prior knowledge inherent in these models can sometimes be a double-edged sword when adapting them to downstream tasks in specific domains. In the context of semi-supervised medical image segmentation with domain shift, foundation models like MedSAM tend to make overconfident predictions, some of which are incorrect. The error accumulation hinders the effective utilization of unlabeled data and limits further improvements. In this paper, we introduce a Synergistic training framework for Foundation and Conventional models (SynFoC) to address the issue. We observe that a conventional model trained from scratch has the ability to correct the high-confidence mispredictions of the foundation model, while the foundation model can supervise it with high-quality pseudo-labels in the early training stages. Furthermore, to enhance the collaborative training effectiveness of both models and promote reliable convergence towards optimization, the consensus-divergence consistency regularization is proposed. We demonstrate the superiority of our method across four public multi-domain datasets. In particular, our method improves the Dice score by 10.31\% on the Prostate dataset. Our code is available at https://github.com/MQinghe/SynFoC .

Chem42: a Family of chemical Language Models for Target-aware Ligand Generation

Aahan Singh,Engin Tekin,Maryam Nadeem,Nancy A. ElNaker,Mohammad Amaan Sayeed,Natalia Vassilieva,Boulbaba Ben Amor

Task: 开发一种能够设计针对特定生物靶标的新型配体的生成模型。

Motivation: 现有化学语言模型（cLM）大多未能整合靶标特异性信息，限制了其在从头配体生成中的应用。

Details

Method: 通过整合原子级相互作用与来自Prot42的多模态输入，Chem42实现了分子结构、相互作用和结合模式的跨模态表示。 Result: Chem42在化学有效性、靶标感知设计和预测结合亲和力方面优于现有方法。 Conclusion: Chem42为精准医学提供了一种强大的生成AI工具，有望加速药物发现流程。 Abstract: Revolutionizing drug discovery demands more than just understanding molecular interactions - it requires generative models that can design novel ligands tailored to specific biological targets. While chemical Language Models (cLMs) have made strides in learning molecular properties, most fail to incorporate target-specific insights, restricting their ability to drive de-novo ligand generation. Chem42, a cutting-edge family of generative chemical Language Models, is designed to bridge this gap. By integrating atomic-level interactions with multimodal inputs from Prot42, a complementary protein Language Model, Chem42 achieves a sophisticated cross-modal representation of molecular structures, interactions, and binding patterns. This innovative framework enables the creation of structurally valid, synthetically accessible ligands with enhanced target specificity. Evaluations across diverse protein targets confirm that Chem42 surpasses existing approaches in chemical validity, target-aware design, and predicted binding affinity. By reducing the search space of viable drug candidates, Chem42 could accelerate the drug discovery pipeline, offering a powerful generative AI tool for precision medicine. Our Chem42 models set a new benchmark in molecule property prediction, conditional molecule generation, and target-aware ligand design. The models are publicly available at huggingface.co/inceptionai.

RAW-Adapter: Adapting Pre-trained Visual Model to Camera RAW Images and A Benchmark

Ziteng Cui,Jianfei Yang,Tatsuya Harada

Task: 提出一种名为RAW-Adapter的新框架，通过可学习的ISP模块和模型级适配器，将RAW图像数据与高级下游架构无缝结合。

Motivation: 尽管RAW图像保留了丰富的物理细节，但现有方法通常忽略了ISP处理与模型级协同的潜力，因此需要一种更高效的方法来利用RAW数据的优势。

Details

Method: 采用基于适配器的方法，结合可学习的ISP模块和模型级适配器，同时引入RAW-Bench基准和RAW数据增强策略。 Result: 实验表明RAW-Adapter在多样场景中表现鲁棒，优于现有ISP方法和RAW视觉算法。 Conclusion: RAW-Adapter是一种通用框架，能有效利用RAW数据提升视觉感知性能，并通过数据增强进一步优化泛化能力。 Abstract: In the computer vision community, the preference for pre-training visual models has largely shifted toward sRGB images due to their ease of acquisition and compact storage. However, camera RAW images preserve abundant physical details across diverse real-world scenarios. Despite this, most existing visual perception methods that utilize RAW data directly integrate image signal processing (ISP) stages with subsequent network modules, often overlooking potential synergies at the model level. Building on recent advances in adapter-based methodologies in both NLP and computer vision, we propose RAW-Adapter, a novel framework that incorporates learnable ISP modules as input-level adapters to adjust RAW inputs. At the same time, it employs model-level adapters to seamlessly bridge ISP processing with high-level downstream architectures. Moreover, RAW-Adapter serves as a general framework applicable to various computer vision frameworks. Furthermore, we introduce RAW-Bench, which incorporates 17 types of RAW-based common corruptions, including lightness degradations, weather effects, blurriness, camera imaging degradations, and variations in camera color response. Using this benchmark, we systematically compare the performance of RAW-Adapter with state-of-the-art (SOTA) ISP methods and other RAW-based high-level vision algorithms. Additionally, we propose a RAW-based data augmentation strategy to further enhance RAW-Adapter's performance and improve its out-of-domain (OOD) generalization ability. Extensive experiments substantiate the effectiveness and efficiency of RAW-Adapter, highlighting its robust performance across diverse scenarios.

Gene42: Long-Range Genomic Foundation Model With Dense Attention

Kirill Vishniakov,Boulbaba Ben Amor,Engin Tekin,Nancy A. ElNaker,Karthik Viswanathan,Aleksandr Medvedev,Aahan Singh,Maryam Nadeem,Mohammad Amaan Sayeed,Praveenkumar Kanithi,Tiago Magalhaes,Natalia Vassilieva,Dwarikanath Mahapatra,Marco Pimentel,and Shadab Khan

Task: 开发Gene42，一种能够处理长达192,000 bp基因组数据的密集自注意力模型。

Motivation: 解决基因组数据中长上下文依赖的建模问题，挑战现有状态空间模型的局限性。

Details

Method: 采用解码器架构（类似LLaMA）和密集自注意力机制，通过连续预训练逐步扩展上下文长度。 Result: 模型表现出低困惑度和高重建精度，在多个基因组任务中达到最先进性能。 Conclusion: Gene42是首个能够处理超长基因组上下文的密集注意力模型，为基因组研究提供了强大工具。 Abstract: We introduce Gene42, a novel family of Genomic Foundation Models (GFMs) designed to manage context lengths of up to 192,000 base pairs (bp) at a single-nucleotide resolution. Gene42 models utilize a decoder-only (LLaMA-style) architecture with a dense self-attention mechanism. Initially trained on fixed-length sequences of 4,096 bp, our models underwent continuous pretraining to extend the context length to 192,000 bp. This iterative extension allowed for the comprehensive processing of large-scale genomic data and the capture of intricate patterns and dependencies within the human genome. Gene42 is the first dense attention model capable of handling such extensive long context lengths in genomics, challenging state-space models that often rely on convolutional operators among other mechanisms. Our pretrained models exhibit notably low perplexity values and high reconstruction accuracy, highlighting their strong ability to model genomic data. Extensive experiments on various genomic benchmarks have demonstrated state-of-the-art performance across multiple tasks, including biotype classification, regulatory region identification, chromatin profiling prediction, variant pathogenicity prediction, and species classification. The models are publicly available at huggingface.co/inceptionai.

AnimatePainter: A Self-Supervised Rendering Framework for Reconstructing Painting Process

Junjie Hu,Shuyong Gao,Qianyu Guo,Yan Wang,Qishan Wang,Yuang Feng,Wenqiang Zhang

Task: 从任何类型的图像生成绘画过程。

Motivation: 现有方法局限于特定数据类型且依赖昂贵的人工标注数据集，无法泛化处理任意图像。

Details

Method: 提出一种自监督框架，将任务视为视频生成问题，通过逐步移除笔画模拟人类绘画序列，并利用深度估计和笔画渲染构建自监督数据集。 Result: 实验验证了方法的有效性，能够生成逼真的绘画过程，无需真实绘画过程数据。 Conclusion: 该方法为生成任意图像的绘画过程提供了一种高效且无需人工标注的解决方案。 Abstract: Humans can intuitively decompose an image into a sequence of strokes to create a painting, yet existing methods for generating drawing processes are limited to specific data types and often rely on expensive human-annotated datasets. We propose a novel self-supervised framework for generating drawing processes from any type of image, treating the task as a video generation problem. Our approach reverses the drawing process by progressively removing strokes from a reference image, simulating a human-like creation sequence. Crucially, our method does not require costly datasets of real human drawing processes; instead, we leverage depth estimation and stroke rendering to construct a self-supervised dataset. We model human drawings as "refinement" and "layering" processes and introduce depth fusion layers to enable video generation models to learn and replicate human drawing behavior. Extensive experiments validate the effectiveness of our approach, demonstrating its ability to generate realistic drawings without the need for real drawing process data.

WaveFM: A High-Fidelity and Efficient Vocoder Based on Flow Matching

Tianze Luo,Xingchen Miao,Wenbo Duan

Task: 提出WaveFM，一种重新参数化的流匹配模型，用于mel频谱图条件语音合成，以提升扩散声码器的样本质量和生成速度。

Motivation: 直接应用流匹配到神经声码器可能导致音频质量不佳，因此需要改进。

Details

Method: 采用mel条件先验分布以减少合成中的不必要传输成本，并引入辅助损失（如多分辨率STFT损失）和一致性蒸馏方法。 Result: WaveFM在质量和效率上优于之前的扩散声码器，并能一步生成波形。 Conclusion: WaveFM通过改进流匹配和引入辅助损失，显著提升了扩散声码器的性能和速度。 Abstract: Flow matching offers a robust and stable approach to training diffusion models. However, directly applying flow matching to neural vocoders can result in subpar audio quality. In this work, we present WaveFM, a reparameterized flow matching model for mel-spectrogram conditioned speech synthesis, designed to enhance both sample quality and generation speed for diffusion vocoders. Since mel-spectrograms represent the energy distribution of waveforms, WaveFM adopts a mel-conditioned prior distribution instead of a standard Gaussian prior to minimize unnecessary transportation costs during synthesis. Moreover, while most diffusion vocoders rely on a single loss function, we argue that incorporating auxiliary losses, including a refined multi-resolution STFT loss, can further improve audio quality. To speed up inference without degrading sample quality significantly, we introduce a tailored consistency distillation method for WaveFM. Experiment results demonstrate that our model achieves superior performance in both quality and efficiency compared to previous diffusion vocoders, while enabling waveform generation in a single inference step.

TaoAvatar: Real-Time Lifelike Full-Body Talking Avatars for Augmented Reality via 3D Gaussian Splatting

Jianchuan Chen,Jingchuan Hu,Gaige Wang,Zhonghua Jiang,Tiansong Zhou,Zhiwen Chen,Chengfei Lv

Task: 开发一种高保真、轻量级的3D全身说话虚拟形象TaoAvatar，支持多种驱动信号。

Motivation: 现有方法在全身说话任务中难以精细控制面部表情和身体动作，且缺乏细节和实时性。

Details

Method: 通过创建个性化参数化模板、预训练StyleUnet网络处理非刚性变形，并利用蒸馏技术优化为轻量级MLP网络。 Result: TaoAvatar在多种设备上实现实时渲染（如Apple Vision Pro上90 FPS），并达到最先进的渲染质量。 Conclusion: TaoAvatar解决了现有方法的不足，实现了高保真、轻量级和实时性的全身说话虚拟形象。 Abstract: Realistic 3D full-body talking avatars hold great potential in AR, with applications ranging from e-commerce live streaming to holographic communication. Despite advances in 3D Gaussian Splatting (3DGS) for lifelike avatar creation, existing methods struggle with fine-grained control of facial expressions and body movements in full-body talking tasks. Additionally, they often lack sufficient details and cannot run in real-time on mobile devices. We present TaoAvatar, a high-fidelity, lightweight, 3DGS-based full-body talking avatar driven by various signals. Our approach starts by creating a personalized clothed human parametric template that binds Gaussians to represent appearances. We then pre-train a StyleUnet-based network to handle complex pose-dependent non-rigid deformation, which can capture high-frequency appearance details but is too resource-intensive for mobile devices. To overcome this, we "bake" the non-rigid deformations into a lightweight MLP-based network using a distillation technique and develop blend shapes to compensate for details. Extensive experiments show that TaoAvatar achieves state-of-the-art rendering quality while running in real-time across various devices, maintaining 90 FPS on high-definition stereo devices such as the Apple Vision Pro.

CAARMA: Class Augmentation with Adversarial Mixup Regularization

Massa Baali,Xiang Li,Hao Chen,Rita Singh,Bhiksha Raj

Task: 提出一种名为CAARMA的类别增强框架，通过嵌入空间中的数据混合生成合成类别，以解决说话人验证任务中类别多样性不足的问题。

Motivation: 现实世界的说话人数据集通常缺乏足够的类别多样性，导致模型难以学习到泛化能力强的嵌入表示。

Details

Method: 采用嵌入空间中的数据混合生成合成类别，并结合对抗性细化机制，最小化合成类别与真实类别之间的分类差异。 Result: 在多个说话人验证任务和其他零样本比较语音分析任务中，CAARMA框架相比基线模型取得了8%的显著提升。 Conclusion: CAARMA通过类别增强和对抗性细化，有效提升了说话人验证任务的性能，具有广泛的应用潜力。 Abstract: Speaker verification is a typical zero-shot learning task, where inference of unseen classes is performed by comparing embeddings of test instances to known examples. The models performing inference must hence naturally generate embeddings that cluster same-class instances compactly, while maintaining separation across classes. In order to learn to do so, they are typically trained on a large number of classes (speakers), often using specialized losses. However real-world speaker datasets often lack the class diversity needed to effectively learn this in a generalizable manner. We introduce CAARMA, a class augmentation framework that addresses this problem by generating synthetic classes through data mixing in the embedding space, expanding the number of training classes. To ensure the authenticity of the synthetic classes we adopt a novel adversarial refinement mechanism that minimizes categorical distinctions between synthetic and real classes. We evaluate CAARMA on multiple speaker verification tasks, as well as other representative zero-shot comparison-based speech analysis tasks and obtain consistent improvements: our framework demonstrates a significant improvement of 8\% over all baseline models. Code for CAARMA will be released.

An Attentive Representative Sample Selection Strategy Combined with Balanced Batch Training for Skin Lesion Segmentation

Stephen Lloyd-Brown,Susan Francis,Caroline Hoad,Penny Gowland,Karen Mullinger,Andrew French,Xin Chen

Task: 解决医学图像分割中训练子集的有效选择问题。

Motivation: 随机选择训练子集可能导致模型性能不佳，尤其是在最小监督设置下。

Details

Method: 使用原型对比学习和聚类提取代表性且多样化的样本进行标注，并引入无监督平衡批次数据加载。 Result: 在ISIC 2018皮肤病变数据集上表现优于现有方法，尤其在低标注预算场景下。 Conclusion: 提出的方法能有效提升医学图像分割模型在有限标注数据下的性能。 Abstract: An often overlooked problem in medical image segmentation research is the effective selection of training subsets to annotate from a complete set of unlabelled data. Many studies select their training sets at random, which may lead to suboptimal model performance, especially in the minimal supervision setting where each training image has a profound effect on performance outcomes. This work aims to address this issue. We use prototypical contrasting learning and clustering to extract representative and diverse samples for annotation. We improve upon prior works with a bespoke cluster-based image selection process. Additionally, we introduce the concept of unsupervised balanced batch dataloading to medical image segmentation, which aims to improve model learning with minimally annotated data. We evaluated our method on a public skin lesion dataset (ISIC 2018) and compared it to another state-of-the-art data sampling method. Our method achieved superior performance in a low annotation budget scenario.

Design and Implementation of an FPGA-Based Tiled Matrix Multiplication Accelerator for Transformer Self-Attention on the Xilinx KV260 SoM

Zhaoqin "Richie" Li,Sicheng Chen

Task: 加速Transformer模型中Q、K、V线性投影的矩阵乘法运算。

Motivation: Q、K、V线性投影是Multi-Head Self-Attention模块的计算瓶颈，需优化以提升效率。

Details

Method: 设计了一种基于FPGA的矩阵乘法加速器，采用持久化片上存储、两级分块数据复用和类脉动阵列计算引擎。 Result: 在ARM CPU上实现了7倍加速，比numpy快200倍，吞吐量达3.1 GFLOPs。 Conclusion: FPGA加速器在Transformer关键组件中具有显著潜力。 Abstract: Transformer-based LLMs spend most of their compute in large matrix multiplications for attention and feed-forward layers. Recognizing that the Q, K, and V linear projections within the Multi-Head Self-Attention (MHA) module represent a critical computational bottleneck, we strategically focused our efforts on accelerating these operations. We present a tiled matrix multiplication accelerator optimized for such workloads on a Xilinx KV260 on-board FPGA. Key innovations include persistent on-chip storage for one matrix operand, two-level tiling for data reuse, and a systolic-like unrolled compute engine. Implemented via high-level synthesis (HLS) and integrated with DistilBERT for Q, K, V projections, our accelerator achieves significant speedup and energy efficiency gains over CPU baselines. Standalone GEMM benchmarks show up to a 7x speedup over an ARM CPU (PyTorch) and ~200x over naive numpy, with a throughput of up to 3.1 GFLOPs on 768x3072 matrices. Although the overall end-to-end DistilBERT acceleration is more modest, our results validate the potential of FPGA-based acceleration for critical components of Transformer models.

ExCap3D: Expressive 3D Scene Understanding via Object Captioning with Varying Detail

Chandan Yeshwanth,David Rozenberszki,Angela Dai

Task: 提出了一种名为ExCap3D的表达性3D字幕生成方法，用于为3D室内场景中的对象生成多层次的详细描述。

Motivation: 现有方法仅能生成单一层次的描述，无法捕捉对象的细粒度细节（如纹理、材料和形状）。

Details

Method: ExCap3D模型通过输入3D扫描数据，为每个检测到的对象生成高层次的物体描述和低层次的部件属性描述，并通过语义一致性和潜在空间文本相似性提升生成质量。 Result: 实验显示ExCap3D在对象和部件层次的描述质量上优于现有方法，Cider分数分别提高了17%和124%。 Conclusion: ExCap3D及其数据集为3D场景的细粒度描述提供了有效工具，代码和模型将公开。 Abstract: Generating text descriptions of objects in 3D indoor scenes is an important building block of embodied understanding. Existing methods do this by describing objects at a single level of detail, which often does not capture fine-grained details such as varying textures, materials, and shapes of the parts of objects. We propose the task of expressive 3D captioning: given an input 3D scene, describe objects at multiple levels of detail: a high-level object description, and a low-level description of the properties of its parts. To produce such captions, we present ExCap3D, an expressive 3D captioning model which takes as input a 3D scan, and for each detected object in the scan, generates a fine-grained collective description of the parts of the object, along with an object-level description conditioned on the part-level description. We design ExCap3D to encourage semantic consistency between the generated text descriptions, as well as textual similarity in the latent space, to further increase the quality of the generated captions. To enable this task, we generated the ExCap3D Dataset by leveraging a visual-language model (VLM) for multi-view captioning. The ExCap3D Dataset contains captions on the ScanNet++ dataset with varying levels of detail, comprising 190k text descriptions of 34k 3D objects in 947 indoor scenes. Our experiments show that the object- and part-level of detail captions generated by ExCap3D are of higher quality than those produced by state-of-the-art methods, with a Cider score improvement of 17% and 124% for object- and part-level details respectively. Our code, dataset and models will be made publicly available.

The Deployment of End-to-End Audio Language Models Should Take into Account the Principle of Least Privilege

Luxi He,Xiangyu Qi,Michel Liao,Inyoung Cheong,Prateek Mittal,Danqi Chen,Peter Henderson

Task: 探讨端到端音频语言模型（Audio LMs）的构建与部署问题。

Motivation: 端到端音频语言模型直接处理语音，保留了语调、多说话者等细节信息，但也带来了新的安全风险，如声音身份信息的滥用。

Details

Method: 提出以最小权限原则为指导，评估是否需要端到端模型以及信息访问的适当范围。 Result: 指出当前音频语言模型基准的不足，并提出了技术和政策方面的开放研究问题。 Conclusion: 呼吁更审慎地构建和部署端到端音频语言模型，以确保其负责任的使用。 Abstract: We are at a turning point for language models that accept audio input. The latest end-to-end audio language models (Audio LMs) process speech directly instead of relying on a separate transcription step. This shift preserves detailed information, such as intonation or the presence of multiple speakers, that would otherwise be lost in transcription. However, it also introduces new safety risks, including the potential misuse of speaker identity cues and other sensitive vocal attributes, which could have legal implications. In this position paper, we urge a closer examination of how these models are built and deployed. We argue that the principle of least privilege should guide decisions on whether to deploy cascaded or end-to-end models. Specifically, evaluations should assess (1) whether end-to-end modeling is necessary for a given application; and (2), the appropriate scope of information access. Finally, We highlight related gaps in current audio LM benchmarks and identify key open research questions, both technical and policy-related, that must be addressed to enable the responsible deployment of end-to-end Audio LMs.

Scoring, Remember, and Reference: Catching Camouflaged Objects in Videos

Yuang Feng,Shuyong Gao,Fuzhen Yan,Yicheng Song,Lingyi Hong,Junjie Hu,Wenqiang Zhang

Task: 视频伪装目标检测（VCOD）旨在分割外观与周围环境高度相似的物体。

Motivation: 现有视觉模型在伪装目标检测中表现不佳，主要因为目标外观难以区分且视频动态信息利用不足。

Details

Method: 提出了一种基于人类记忆识别启发的端到端VCOD框架，通过整合记忆参考帧进行伪装序列处理，设计了双用途解码器，并引入参考引导的多级非对称注意力机制。 Result: 提出的SRR框架在基准数据集上性能显著提升，超越现有方法10%，且参数更少（54M），仅需单次视频处理。 Conclusion: 该研究通过优化模块设计和有效利用视频数据，显著提升了伪装目标检测的性能。 Abstract: Video Camouflaged Object Detection (VCOD) aims to segment objects whose appearances closely resemble their surroundings, posing a challenging and emerging task. Existing vision models often struggle in such scenarios due to the indistinguishable appearance of camouflaged objects and the insufficient exploitation of dynamic information in videos. To address these challenges, we propose an end-to-end VCOD framework inspired by human memory-recognition, which leverages historical video information by integrating memory reference frames for camouflaged sequence processing. Specifically, we design a dual-purpose decoder that simultaneously generates predicted masks and scores, enabling reference frame selection based on scores while introducing auxiliary supervision to enhance feature extraction.Furthermore, this study introduces a novel reference-guided multilevel asymmetric attention mechanism, effectively integrating long-term reference information with short-term motion cues for comprehensive feature extraction. By combining these modules, we develop the Scoring, Remember, and Reference (SRR) framework, which efficiently extracts information to locate targets and employs memory guidance to improve subsequent processing. With its optimized module design and effective utilization of video data, our model achieves significant performance improvements, surpassing existing approaches by 10% on benchmark datasets while requiring fewer parameters (54M) and only a single pass through the video. The code will be made publicly available.

Towards LLM Guardrails via Sparse Representation Steering

Zeqing He,Zhibo Wang,Huiyu Xu,Kui Ren

Task: 提出一种基于稀疏编码的表征工程方法（SRE），以解决大型语言模型（LLM）输出控制中的精细度不足、内容质量下降和可解释性差的问题。

Motivation: 大型语言模型（LLM）在自然语言生成任务中表现优异，但其不可控输出存在伦理和安全风险，现有表征工程方法在精细控制、内容质量和可解释性方面存在局限。

Details

Method: 通过稀疏自编码将多义激活分解为结构化、单义的特征空间，仅调整任务相关的稀疏特征维度，实现精确且可解释的模型行为控制。 Result: 在安全、公平和真实性三个关键领域验证了SRE方法的有效性，实验表明其在保持生成内容质量的同时实现了优越的可控性。 Conclusion: SRE是一种精细且可解释的激活控制框架，能够有效解决现有方法的局限性。 Abstract: Large Language Models (LLMs) have demonstrated remarkable performance in natural language generation tasks, yet their uncontrolled outputs pose significant ethical and safety risks. Recently, representation engineering methods have shown promising results in steering model behavior by modifying the rich semantic information encoded in activation vectors. However, due to the difficulty of precisely disentangling semantic directions within high-dimensional representation space, existing approaches suffer from three major limitations: lack of fine-grained control, quality degradation of generated content, and poor interpretability. To address these challenges, we propose a sparse encoding-based representation engineering method, named SRE, which decomposes polysemantic activations into a structured, monosemantic feature space. By leveraging sparse autoencoding, our approach isolates and adjusts only task-specific sparse feature dimensions, enabling precise and interpretable steering of model behavior while preserving content quality. We validate our method on three critical domains, i.e., safety, fairness, and truthfulness using the open-source LLM Gemma-2-2B-it. Experimental results show that SRE achieves superior controllability while maintaining the overall quality of generated content (i.e., controllability and quality), demonstrating its effectiveness as a fine-grained and interpretable activation steering framework.

PVChat: Personalized Video Chat with One-Shot Learning

Yufei Shi,Weilong Yan,Gang Xu,Yumeng Li,Yuchen Li,Zhenxi Li,Fei Richard Yu,Ming Li,Si Yong Yeo

Task: 提出一个名为PVChat的一次性学习框架，用于实现个性化视频大语言模型（ViLLM），支持基于单个视频的主体感知问答（QA）。

Motivation: 现有的视频大语言模型在通用视频理解上表现优异，但在身份感知理解（如特定人物的活动或交互）上存在局限，限制了其在智能医疗和智能家居等场景的应用。

Details

Method: 通过合成增强的视频-QA数据集优化混合头（MoH）增强的ViLLM，采用渐进式图像到视频学习策略，包括自动增强管道、ReLU路由MoH注意力机制和两阶段训练策略。 Result: 在涵盖医疗场景、电视剧、动画和真实视频的多样化数据集上，PVChat在单视频学习后表现出优于现有ViLLM的个性化特征理解能力。 Conclusion: PVChat通过一次性学习框架和渐进式学习策略，显著提升了视频大语言模型在身份感知理解上的性能，扩展了其应用场景。 Abstract: Video large language models (ViLLMs) excel in general video understanding, e.g., recognizing activities like talking and eating, but struggle with identity-aware comprehension, such as "Wilson is receiving chemotherapy" or "Tom is discussing with Sarah", limiting their applicability in smart healthcare and smart home environments. To address this limitation, we propose a one-shot learning framework PVChat, the first personalized ViLLM that enables subject-aware question answering (QA) from a single video for each subject. Our approach optimizes a Mixture-of-Heads (MoH) enhanced ViLLM on a synthetically augmented video-QA dataset, leveraging a progressive image-to-video learning strategy. Specifically, we introduce an automated augmentation pipeline that synthesizes identity-preserving positive samples and retrieves hard negatives from existing video corpora, generating a diverse training dataset with four QA types: existence, appearance, action, and location inquiries. To enhance subject-specific learning, we propose a ReLU Routing MoH attention mechanism, alongside two novel objectives: (1) Smooth Proximity Regularization for progressive learning through exponential distance scaling and (2) Head Activation Enhancement for balanced attention routing. Finally, we adopt a two-stage training strategy, transitioning from image pre-training to video fine-tuning, enabling a gradual learning process from static attributes to dynamic representations. We evaluate PVChat on diverse datasets covering medical scenarios, TV series, anime, and real-world footage, demonstrating its superiority in personalized feature understanding after learning from a single video, compared to state-of-the-art ViLLMs.

Sparse Logit Sampling: Accelerating Knowledge Distillation in LLMs

Anshumann,Mohd Abbas Zaidi,Akhil Kedia,Jinwoo Ahn,Taehwak Kwon,Kangwook Lee,Haejun Lee,Joohyung Lee

Task: 探索在预训练中应用稀疏知识蒸馏的方法。

Motivation: 传统的稀疏知识蒸馏方法（如缓存Top-K概率）会导致教师概率分布的偏差估计，影响学生模型的性能和校准。

Details

Method: 提出了一种基于重要性采样的方法`Random Sampling Knowledge Distillation'，提供无偏估计，保留梯度期望，并显著减少存储需求。 Result: 该方法在300M到3B模型范围内，以边际开销（<10%）实现了学生模型的快速训练，同时保持与完整蒸馏相当的竞争力。 Conclusion: `Random Sampling Knowledge Distillation'是一种高效且性能优越的稀疏知识蒸馏方法。 Abstract: Knowledge distillation can be a cost-effective technique to distill knowledge in Large Language Models, if the teacher output logits can be pre-computed and cached. However, successfully applying this to pre-training remains largely unexplored. In this work, we prove that naive approaches for sparse knowledge distillation such as caching Top-K probabilities, while intuitive, provide biased estimates of teacher probability distribution to the student, resulting in suboptimal performance and calibration. We propose an importance-sampling-based method `Random Sampling Knowledge Distillation', which provides unbiased estimates, preserves the gradient in expectation, and requires storing significantly sparser logits. Our method enables faster training of student models with marginal overhead (<10%) compared to cross-entropy based training, while maintaining competitive performance compared to full distillation, across a range of model sizes from 300M to 3B.

Superpowering Open-Vocabulary Object Detectors for X-ray Vision

Pablo Garcia-Fernandez,Lorenzo Vaquero,Mingxuan Liu,Feng Xue,Daniel Cores,Nicu Sebe,Manuel Mucientes,Elisa Ricci

Task: 开发一种无需训练的框架RAXO，用于将现成的RGB开放词汇目标检测器（OvOD）重新用于X射线检测。

Motivation: 解决X射线图像中数据稀缺和模态差异问题，避免直接采用RGB解决方案的限制。

Details

Method: RAXO通过双源检索策略构建高质量的X射线类别描述符，利用网络RGB图像并通过X射线材料转移机制增强，无需标注数据库。 Result: RAXO显著提升OvOD性能，平均mAP提高17.0点，并引入新基准DET-COMPASS支持大规模评估。 Conclusion: RAXO为X射线开放词汇目标检测提供了一种高效且无需训练的方法，推动了该领域的研究。 Abstract: Open-vocabulary object detection (OvOD) is set to revolutionize security screening by enabling systems to recognize any item in X-ray scans. However, developing effective OvOD models for X-ray imaging presents unique challenges due to data scarcity and the modality gap that prevents direct adoption of RGB-based solutions. To overcome these limitations, we propose RAXO, a training-free framework that repurposes off-the-shelf RGB OvOD detectors for robust X-ray detection. RAXO builds high-quality X-ray class descriptors using a dual-source retrieval strategy. It gathers relevant RGB images from the web and enriches them via a novel X-ray material transfer mechanism, eliminating the need for labeled databases. These visual descriptors replace text-based classification in OvOD, leveraging intra-modal feature distances for robust detection. Extensive experiments demonstrate that RAXO consistently improves OvOD performance, providing an average mAP increase of up to 17.0 points over base detectors. To further support research in this emerging field, we also introduce DET-COMPASS, a new benchmark featuring bounding box annotations for over 300 object categories, enabling large-scale evaluation of OvOD in X-ray. Code and dataset available at: https://github.com/PAGF188/RAXO.

Federated Cross-Domain Click-Through Rate Prediction With Large Language Model Augmentation

Jiangcheng Qin,Xueyuan Zhang,Baisong Liu,Jiangbo Qian,Yangyang Wang

Task: 在严格隐私约束下准确预测跨域点击率（CTR）。

Motivation: 传统跨域CTR方法假设特征空间同质且依赖集中式数据共享，忽略了域间差异和隐私保护协议的权衡。

Details

Method: 提出FedCCTR-LM框架，结合数据增强、表示解耦和自适应隐私保护，包括PrivAugNet、IDST-CL模块和AdaLDP机制。 Result: 在四个真实数据集上显著优于现有基线，实现了鲁棒、隐私保护且可泛化的跨域CTR预测。 Conclusion: FedCCTR-LM在异构联邦环境中有效解决了数据稀疏和隐私保护的挑战。 Abstract: Accurately predicting click-through rates (CTR) under stringent privacy constraints poses profound challenges, particularly when user-item interactions are sparse and fragmented across domains. Conventional cross-domain CTR (CCTR) methods frequently assume homogeneous feature spaces and rely on centralized data sharing, neglecting complex inter-domain discrepancies and the subtle trade-offs imposed by privacy-preserving protocols. Here, we present Federated Cross-Domain CTR Prediction with Large Language Model Augmentation (FedCCTR-LM), a federated framework engineered to address these limitations by synchronizing data augmentation, representation disentanglement, and adaptive privacy protection. Our approach integrates three core innovations. First, the Privacy-Preserving Augmentation Network (PrivAugNet) employs large language models to enrich user and item representations and expand interaction sequences, mitigating data sparsity and feature incompleteness. Second, the Independent Domain-Specific Transformer with Contrastive Learning (IDST-CL) module disentangles domain-specific and shared user preferences, employing intra-domain representation alignment (IDRA) and crossdomain representation disentanglement (CDRD) to refine the learned embeddings and enhance knowledge transfer across domains. Finally, the Adaptive Local Differential Privacy (AdaLDP) mechanism dynamically calibrates noise injection to achieve an optimal balance between rigorous privacy guarantees and predictive accuracy. Empirical evaluations on four real-world datasets demonstrate that FedCCTR-LM substantially outperforms existing baselines, offering robust, privacy-preserving, and generalizable cross-domain CTR prediction in heterogeneous, federated environments.

Zero-Shot Styled Text Image Generation, but Make It Autoregressive

Vittorio Pippi,Fabio Quattrini,Silvia Cascianelli,Alessio Tonioni,Rita Cucchiara

Task: 提出一种名为Emuru的新型框架，用于生成具有特定风格（如字体或手写风格）的文本图像。

Motivation: 现有基于GAN或扩散的方法在泛化到新风格时表现不佳，且存在输出长度和训练效率的技术限制。

Details

Method: 结合变分自编码器和自回归Transformer，利用多样化的合成数据集进行训练。 Result: Emuru能够零样本生成未见过的风格（字体和手写），且生成的图像无背景伪影。 Conclusion: Emuru在类型和手写文本图像生成任务中表现出色，是首个专注于风格泛化的自回归模型。 Abstract: Styled Handwritten Text Generation (HTG) has recently received attention from the computer vision and document analysis communities, which have developed several solutions, either GAN- or diffusion-based, that achieved promising results. Nonetheless, these strategies fail to generalize to novel styles and have technical constraints, particularly in terms of maximum output length and training efficiency. To overcome these limitations, in this work, we propose a novel framework for text image generation, dubbed Emuru. Our approach leverages a powerful text image representation model (a variational autoencoder) combined with an autoregressive Transformer. Our approach enables the generation of styled text images conditioned on textual content and style examples, such as specific fonts or handwriting styles. We train our model solely on a diverse, synthetic dataset of English text rendered in over 100,000 typewritten and calligraphy fonts, which gives it the capability to reproduce unseen styles (both fonts and users' handwriting) in zero-shot. To the best of our knowledge, Emuru is the first autoregressive model for HTG, and the first designed specifically for generalization to novel styles. Moreover, our model generates images without background artifacts, which are easier to use for downstream applications. Extensive evaluation on both typewritten and handwritten, any-length text image generation scenarios demonstrates the effectiveness of our approach.

Assessing Consistency and Reproducibility in the Outputs of Large Language Models: Evidence Across Diverse Finance and Accounting Tasks

Julian Junyan Wang,Victor Xiaoqi Wang

Task: 评估大型语言模型（LLM）在金融和会计研究中的输出一致性和可重复性。

Motivation: 解决LLM在金融和会计任务中输出一致性和可重复性的问题，并探讨其与人类专家表现的对比。

Details

Method: 通过50次独立实验，使用三种OpenAI模型（GPT-3.5-turbo、GPT-4o-mini和GPT-4o）在五种常见任务（分类、情感分析、摘要、文本生成和预测）中生成超过340万条输出。 Result: LLM在简单任务（如分类和情感分析）中表现高度一致，复杂任务则变异性较大；高级模型未显著提升一致性；LLM在一致性上显著优于人类专家；简单聚合策略可显著提升一致性。 Conclusion: LLM在金融和会计任务中表现出较高的输出一致性和可重复性，且下游统计推断稳健，降低了选择性报告（G-hacking）的风险。 Abstract: This study provides the first comprehensive assessment of consistency and reproducibility in Large Language Model (LLM) outputs in finance and accounting research. We evaluate how consistently LLMs produce outputs given identical inputs through extensive experimentation with 50 independent runs across five common tasks: classification, sentiment analysis, summarization, text generation, and prediction. Using three OpenAI models (GPT-3.5-turbo, GPT-4o-mini, and GPT-4o), we generate over 3.4 million outputs from diverse financial source texts and data, covering MD&As, FOMC statements, finance news articles, earnings call transcripts, and financial statements. Our findings reveal substantial but task-dependent consistency, with binary classification and sentiment analysis achieving near-perfect reproducibility, while complex tasks show greater variability. More advanced models do not consistently demonstrate better consistency and reproducibility, with task-specific patterns emerging. LLMs significantly outperform expert human annotators in consistency and maintain high agreement even where human experts significantly disagree. We further find that simple aggregation strategies across 3-5 runs dramatically improve consistency. Simulation analysis reveals that despite measurable inconsistency in LLM outputs, downstream statistical inferences remain remarkably robust. These findings address concerns about what we term "G-hacking," the selective reporting of favorable outcomes from multiple Generative AI runs, by demonstrating that such risks are relatively low for finance and accounting tasks.

Halton Scheduler For Masked Generative Image Transformer

Victor Besnier,Mickael Chen,David Hurych,Eduardo Valle,Matthieu Cord

Task: 改进MaskGIT框架中的token unmasking scheduler，提出基于Halton序列的新采样策略。

Motivation: MaskGIT的token unmasking scheduler未得到充分研究，现有方法存在不足，影响生成图像的质量和多样性。

Details

Method: 提出基于Halton序列的采样策略，通过准随机、低差异的序列选择token位置，均匀覆盖图像。 Result: Halton scheduler在ImageNet和COCO数据集上表现优于原方法，降低了FID，生成图像更多样且细节更丰富。 Conclusion: Halton scheduler无需重新训练或噪声注入，可直接替换原策略，提升图像生成质量。 Abstract: Masked Generative Image Transformers (MaskGIT) have emerged as a scalable and efficient image generation framework, able to deliver high-quality visuals with low inference costs. However, MaskGIT's token unmasking scheduler, an essential component of the framework, has not received the attention it deserves. We analyze the sampling objective in MaskGIT, based on the mutual information between tokens, and elucidate its shortcomings. We then propose a new sampling strategy based on our Halton scheduler instead of the original Confidence scheduler. More precisely, our method selects the token's position according to a quasi-random, low-discrepancy Halton sequence. Intuitively, that method spreads the tokens spatially, progressively covering the image uniformly at each step. Our analysis shows that it allows reducing non-recoverable sampling errors, leading to simpler hyper-parameters tuning and better quality images. Our scheduler does not require retraining or noise injection and may serve as a simple drop-in replacement for the original sampling strategy. Evaluation of both class-to-image synthesis on ImageNet and text-to-image generation on the COCO dataset demonstrates that the Halton scheduler outperforms the Confidence scheduler quantitatively by reducing the FID and qualitatively by generating more diverse and more detailed images. Our code is at https://github.com/valeoai/Halton-MaskGIT.

Token Dynamics: Towards Efficient and Dynamic Video Token Representation for Video Large Language Models

Haichao Zhang,Zhuowei Li,Dimitris Metaxas,Yun Fu

Task: 提出一种名为Token Dynamics的新视频表示框架，旨在以极少的令牌数量表示长视频序列。

Motivation: 现有令牌缩减技术破坏了空间-时间位置嵌入，无法平衡计算效率与令牌数量，限制了在需要极端令牌压缩的场景中的应用。

Details

Method: 通过分离视觉嵌入与网格级运动信息，构建简洁的令牌基和令牌动态图，并引入跨动态注意力机制。 Result: 令牌数量减少至原始令牌的0.07%，性能仅下降1.13%。 Conclusion: Token Dynamics在保持空间-时间一致性的同时，显著降低了计算复杂性和令牌数量，为视频大语言模型提供了高效解决方案。 Abstract: Token-based video representation has emerged as a promising approach for enabling large language models to interpret video content. However, existing token reduction techniques, such as token pruning and token merging, often disrupt essential spatial-temporal positional embeddings, failing to adequately balance computational efficiency with fewer tokens. Consequently, these methods result in relatively lengthy token sequences, limiting their applicability in scenarios requiring extreme token compression, such as video large language models. In this paper, we introduce the novel task of extreme short token reduction, aiming to represent extensive video sequences with a minimal number of tokens. To address this challenge, we propose Token Dynamics, a new video representation framework that dynamically reduces token count while preserving spatial-temporal coherence. Specifically, we disentangle video representations by separating visual embeddings from grid-level motion information, structuring them into: 1. a concise token base, created by clustering tokens that describe object-level content; 2. a token dynamics map, capturing detailed spatial-temporal motion patterns across grids. Furthermore, we introduce a cross-dynamics attention mechanism that integrates motion features into the token base without increasing token length, thereby maintaining compactness and spatial-temporal integrity. The experiments demonstrate a reduction of token count to merely 0.07% of the original tokens, with only a minor performance drop of 1.13%. Additionally, we propose two novel subtasks within extreme token reduction (fixed-length and adaptive-length compression), both effectively representing long token sequences for video-language tasks. Our method offers significantly lower theoretical complexity, fewer tokens, and enhanced throughput, thus providing an efficient solution for video LLMs.

Seeing What Matters: Empowering CLIP with Patch Generation-to-Selection

Gensheng Pei,Tao Chen,Yujia Wang,Xinhao Cai,Xiangbo Shu,Tianfei Zhou,Yazhou Yao

Task: 提出一种名为Patch Generation-to-Selection（PGS）的方法，以提升CLIP模型的训练效率并保留关键语义信息。

Motivation: CLIP模型的训练计算成本高且数据需求大，现有掩码策略虽能提升效率但会损失语义信息。

Details

Method: 通过渐进式掩码过程、Sobel边缘检测和最优传输归一化，选择性地保留关键图像区域。 Result: CLIP-PGS在零样本分类和检索任务中达到新SOTA，并在鲁棒性和语言组合性评测中表现优异。 Conclusion: CLIP-PGS是一种高效且语义保留的方法，显著提升了CLIP模型的性能。 Abstract: The CLIP model has demonstrated significant advancements in aligning visual and language modalities through large-scale pre-training on image-text pairs, enabling strong zero-shot classification and retrieval capabilities on various domains. However, CLIP's training remains computationally intensive, with high demands on both data processing and memory. To address these challenges, recent masking strategies have emerged, focusing on the selective removal of image patches to improve training efficiency. Although effective, these methods often compromise key semantic information, resulting in suboptimal alignment between visual features and text descriptions. In this work, we present a concise yet effective approach called Patch Generation-to-Selection to enhance CLIP's training efficiency while preserving critical semantic content. Our method introduces a gradual masking process in which a small set of candidate patches is first pre-selected as potential mask regions. Then, we apply Sobel edge detection across the entire image to generate an edge mask that prioritizes the retention of the primary object areas. Finally, similarity scores between the candidate mask patches and their neighboring patches are computed, with optimal transport normalization refining the selection process to ensure a balanced similarity matrix. Our approach, CLIP-PGS, sets new state-of-the-art results in zero-shot classification and retrieval tasks, achieving superior performance in robustness evaluation and language compositionality benchmarks.

Text2Model: Generating dynamic chemical reactor models using large language models (LLMs)

Sophia Rupprecht,Yassine Hounat,Monisha Kumar,Giacomo Lastrucci,Artur M. Schweidtmann

Task: 通过微调Llama 3.1 8B Instruct模型，从文本描述生成动态化学反应器模型的Modelica代码。

Motivation: 探索大型语言模型（LLM）如何帮助化学工程师在研究和工业中完成领域特定任务。

Details

Method: 使用合成生成的Modelica代码微调Llama 3.1 8B Instruct模型，并与基线模型和GPT4o进行比较。 Result: 微调模型在生成Modelica代码的语法和语义准确性上显著优于基线模型，但在泛化能力上不及GPT4o。 Conclusion: 微调模型在特定任务上表现优异，但需进一步提升泛化能力以应对未见场景。 Abstract: As large language models have shown remarkable capabilities in conversing via natural language, the question arises as to how LLMs could potentially assist chemical engineers in research and industry with domain-specific tasks. We generate dynamic chemical reactor models in Modelica code format from textual descriptions as user input. We fine-tune Llama 3.1 8B Instruct on synthetically generated Modelica code for different reactor scenarios. We compare the performance of our fine-tuned model to the baseline Llama 3.1 8B Instruct model and GPT4o. We manually assess the models' predictions regarding the syntactic and semantic accuracy of the generated dynamic models. We find that considerable improvements are achieved by the fine-tuned model with respect to both the semantic and the syntactic accuracy of the Modelica models. However, the fine-tuned model lacks a satisfactory ability to generalize to unseen scenarios compared to GPT4o.

ColabSfM: Collaborative Structure-from-Motion by Point Cloud Registration

Johan Edstedt,André Mateus,Alberto Jaenal

Task: 提出一种可扩展的SfM重建点云配准任务，并开发相关数据集和模型。

Motivation: 当前缺乏可扩展的方法和训练数据集用于SfM重建的配准，阻碍了协作SfM的发展。

Details

Method: 提出SfM配准数据集生成流程，并在现有SotA方法RoITr基础上设计神经优化器RefineRoITr。 Result: 实验表明，提出的流程和模型显著提升了配准效果，实现了协作SfM。 Conclusion: 通过数据集生成和模型优化，解决了SfM重建配准的可扩展性问题。 Abstract: Structure-from-Motion (SfM) is the task of estimating 3D structure and camera poses from images. We define Collaborative SfM (ColabSfM) as sharing distributed SfM reconstructions. Sharing maps requires estimating a joint reference frame, which is typically referred to as registration. However, there is a lack of scalable methods and training datasets for registering SfM reconstructions. In this paper, we tackle this challenge by proposing the scalable task of point cloud registration for SfM reconstructions. We find that current registration methods cannot register SfM point clouds when trained on existing datasets. To this end, we propose a SfM registration dataset generation pipeline, leveraging partial reconstructions from synthetically generated camera trajectories for each scene. Finally, we propose a simple but impactful neural refiner on top of the SotA registration method RoITr that yields significant improvements, which we call RefineRoITr. Our extensive experimental evaluation shows that our proposed pipeline and model enables ColabSfM. Code is available at https://github.com/EricssonResearch/ColabSfM

FactSelfCheck: Fact-Level Black-Box Hallucination Detection for LLMs

Albert Sawczyn,Jakub Binkowski,Denis Janiak,Bogdan Gabrys,Tomasz Kajdanowicz

Task: 提出一种名为FactSelfCheck的新方法，用于细粒度的事实级幻觉检测。

Motivation: 大型语言模型（LLMs）经常生成幻觉内容，这在事实性至关重要的应用中带来重大挑战。

Details

Method: 将文本表示为知识图谱中的三元组事实，通过分析多个LLM响应的事实一致性，计算细粒度幻觉分数，无需外部资源或训练数据。 Result: FactSelfCheck在竞争中表现优异，显著提升了幻觉校正效果，事实内容增加了35%，而句子级方法仅提升8%。 Conclusion: FactSelfCheck的细粒度检测方法能更精确地识别和校正幻觉内容。 Abstract: Large Language Models (LLMs) frequently generate hallucinated content, posing significant challenges for applications where factuality is crucial. While existing hallucination detection methods typically operate at the sentence level or passage level, we propose FactSelfCheck, a novel black-box sampling-based method that enables fine-grained fact-level detection. Our approach represents text as knowledge graphs consisting of facts in the form of triples. Through analyzing factual consistency across multiple LLM responses, we compute fine-grained hallucination scores without requiring external resources or training data. Our evaluation demonstrates that FactSelfCheck performs competitively with leading sampling-based methods while providing more detailed insights. Most notably, our fact-level approach significantly improves hallucination correction, achieving a 35% increase in factual content compared to the baseline, while sentence-level SelfCheckGPT yields only an 8% improvement. The granular nature of our detection enables more precise identification and correction of hallucinated content.

Ruiyang Ha,Songyi Jiang,Bin Li,Bikang Pan,Yihang Zhu,Junjie Zhang,Xiatian Zhu,Shaogang Gong,Jingya Wang

Task: 提出MP-ReID基准数据集和Uni-Prompt ReID框架，解决多模态和多平台场景下的行人重识别问题。

Motivation: 传统行人重识别研究局限于单模态静态摄像头数据，无法应对现实场景中多模态信号的复杂性。

Details

Method: 构建MP-ReID数据集（包含RGB、红外和热成像数据），并提出Uni-Prompt ReID框架，设计特定提示以处理跨模态和跨平台场景。 Result: Uni-Prompt ReID在性能上优于现有方法，为复杂动态环境下的行人重识别研究奠定基础。 Conclusion: MP-ReID数据集和Uni-Prompt ReID框架为多模态和多平台行人重识别提供了有效解决方案。 Abstract: Conventional person re-identification (ReID) research is often limited to single-modality sensor data from static cameras, which fails to address the complexities of real-world scenarios where multi-modal signals are increasingly prevalent. For instance, consider an urban ReID system integrating stationary RGB cameras, nighttime infrared sensors, and UAVs equipped with dynamic tracking capabilities. Such systems face significant challenges due to variations in camera perspectives, lighting conditions, and sensor modalities, hindering effective person ReID. To address these challenges, we introduce the MP-ReID benchmark, a novel dataset designed specifically for multi-modality and multi-platform ReID. This benchmark uniquely compiles data from 1,930 identities across diverse modalities, including RGB, infrared, and thermal imaging, captured by both UAVs and ground-based cameras in indoor and outdoor environments. Building on this benchmark, we introduce Uni-Prompt ReID, a framework with specific-designed prompts, tailored for cross-modality and cross-platform scenarios. Our method consistently outperforms state-of-the-art approaches, establishing a robust foundation for future research in complex and dynamic ReID environments. Our dataset are available at:https://mp-reid.github.io/.

An Iterative Feedback Mechanism for Improving Natural Language Class Descriptions in Open-Vocabulary Object Detection

Louis Y. Kim,Michelle Karker,Victoria Valledor,Seiyoung C. Lee,Karl F. Brzoska,Margaret Duff,Anthony Palladino

Task: 通过改进非技术用户对目标感兴趣的自然语言文本描述，提升开放词汇目标检测模型的性能。

Motivation: 开放词汇目标检测模型的进步使得非技术用户能够在运行时通过自然语言定义新类别，而无需重新训练模型，但用户提供的描述质量可能影响检测效果。

Details

Method: 结合文本嵌入分析技术和对比示例的嵌入组合，提供反馈机制优化用户描述。 Result: 通过多个公开可用的开放词汇目标检测模型验证了反馈机制的性能提升。 Conclusion: 提出的方法有效提升了非技术用户描述的质量，从而增强了目标检测模型的适应性和实用性。 Abstract: Recent advances in open-vocabulary object detection models will enable Automatic Target Recognition systems to be sustainable and repurposed by non-technical end-users for a variety of applications or missions. New, and potentially nuanced, classes can be defined with natural language text descriptions in the field, immediately before runtime, without needing to retrain the model. We present an approach for improving non-technical users' natural language text descriptions of their desired targets of interest, using a combination of analysis techniques on the text embeddings, and proper combinations of embeddings for contrastive examples. We quantify the improvement that our feedback mechanism provides by demonstrating performance with multiple publicly-available open-vocabulary object detection models.

R2LDM: An Efficient 4D Radar Super-Resolution Framework Leveraging Diffusion Model

Boyuan Zheng,Shouyi Lu,Renbo Huang,Minqing Huang,Fan Lu,Wei Tian,Guirong Zhuo,Lu Xiong

Task: 提出R2LDM方法，通过LiDAR点云引导生成密集且准确的4D雷达点云。

Motivation: 传统方法使用范围图像或鸟瞰图（BEV）无法有效捕捉3D形状信息，因此需要一种更有效的方法。

Details

Method: 采用体素特征表示LiDAR和4D雷达点云，提出潜在体素扩散模型（LVDM）和潜在点云重建模块（LPCR）。 Result: 在两种数据集上验证，R2LDM实现雷达点云6至10倍的密集化，并在下游任务中显著提升性能。 Conclusion: R2LDM在4D雷达点云超分辨率任务中优于现有方法，并显著提升点云配准和目标检测的准确性。 Abstract: We introduce R2LDM, an innovative approach for generating dense and accurate 4D radar point clouds, guided by corresponding LiDAR point clouds. Instead of utilizing range images or bird's eye view (BEV) images, we represent both LiDAR and 4D radar point clouds using voxel features, which more effectively capture 3D shape information. Subsequently, we propose the Latent Voxel Diffusion Model (LVDM), which performs the diffusion process in the latent space. Additionally, a novel Latent Point Cloud Reconstruction (LPCR) module is utilized to reconstruct point clouds from high-dimensional latent voxel features. As a result, R2LDM effectively generates LiDAR-like point clouds from paired raw radar data. We evaluate our approach on two different datasets, and the experimental results demonstrate that our model achieves 6- to 10-fold densification of radar point clouds, outperforming state-of-the-art baselines in 4D radar point cloud super-resolution. Furthermore, the enhanced radar point clouds generated by our method significantly improve downstream tasks, achieving up to 31.7% improvement in point cloud registration recall rate and 24.9% improvement in object detection accuracy.

OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement

Yihe Deng,Hritik Bansal,Fan Yin,Nanyun Peng,Wei Wang,Kai-Wei Chang

Task: 研究是否可以将复杂的推理能力（如自验证和自校正）成功集成到大型视觉语言模型（LVLMs）中，并评估其对多模态推理任务的影响。

Motivation: 基于DeepSeek-R1在大型语言模型（LLMs）中通过强化学习实现复杂推理能力的成功，探索类似方法在视觉语言模型中的可行性。

Details

Method: 采用监督微调（SFT）和强化学习（RL）的迭代方法，从纯文本模型中提取推理能力，并通过RL进一步优化。 Result: 开发了OpenVLThinker模型，在MathVista、MathVerse和MathVision等挑战性基准测试中表现出持续改进的推理性能。 Conclusion: 证明了该策略在视觉语言推理中的潜力，为多模态推理任务提供了新思路。 Abstract: Recent advancements demonstrated by DeepSeek-R1 have shown that complex reasoning abilities in large language models (LLMs), including sophisticated behaviors such as self-verification and self-correction, can be achieved by RL with verifiable rewards and significantly improves model performance on challenging tasks such as AIME. Motivated by these findings, our study investigates whether similar reasoning capabilities can be successfully integrated into large vision-language models (LVLMs) and assesses their impact on challenging multimodal reasoning tasks. We consider an approach that iteratively leverages supervised fine-tuning (SFT) on lightweight training data and Reinforcement Learning (RL) to further improve model generalization. Initially, reasoning capabilities were distilled from pure-text R1 models by generating reasoning steps using high-quality captions of the images sourced from diverse visual datasets. Subsequently, iterative RL training further enhance reasoning skills, with each iteration's RL-improved model generating refined SFT datasets for the next round. This iterative process yielded OpenVLThinker, a LVLM exhibiting consistently improved reasoning performance on challenging benchmarks such as MathVista, MathVerse, and MathVision, demonstrating the potential of our strategy for robust vision-language reasoning. The code, model and data are held at https://github.com/yihedeng9/OpenVLThinker.

GAA-TSO: Geometry-Aware Assisted Depth Completion for Transparent and Specular Objects

Yizhe Liu,Tong Jia,Da Cai,Hao Wang,Dongyue Chen

Task: 准确恢复透明和镜面物体的深度信息。

Motivation: 透明和镜面物体的深度信息通常不完整且不准确，这对机器人任务构成挑战。

Details

Method: 提出一种几何感知辅助的深度补全方法，通过提取RGB-D输入的2D特征和点云的3D结构特征，设计跨模态融合模块和自适应相关聚合策略。 Result: 在多个数据集上表现优于现有方法，并显著提升机器人抓取任务的性能。 Conclusion: 该方法通过结合3D几何信息，有效解决了透明和镜面物体深度补全的问题。 Abstract: Transparent and specular objects are frequently encountered in daily life, factories, and laboratories. However, due to the unique optical properties, the depth information on these objects is usually incomplete and inaccurate, which poses significant challenges for downstream robotics tasks. Therefore, it is crucial to accurately restore the depth information of transparent and specular objects. Previous depth completion methods for these objects usually use RGB information as an additional channel of the depth image to perform depth prediction. Due to the poor-texture characteristics of transparent and specular objects, these methods that rely heavily on color information tend to generate structure-less depth predictions. Moreover, these 2D methods cannot effectively explore the 3D structure hidden in the depth channel, resulting in depth ambiguity. To this end, we propose a geometry-aware assisted depth completion method for transparent and specular objects, which focuses on exploring the 3D structural cues of the scene. Specifically, besides extracting 2D features from RGB-D input, we back-project the input depth to a point cloud and build the 3D branch to extract hierarchical scene-level 3D structural features. To exploit 3D geometric information, we design several gated cross-modal fusion modules to effectively propagate multi-level 3D geometric features to the image branch. In addition, we propose an adaptive correlation aggregation strategy to appropriately assign 3D features to the corresponding 2D features. Extensive experiments on ClearGrasp, OOD, TransCG, and STD datasets show that our method outperforms other state-of-the-art methods. We further demonstrate that our method significantly enhances the performance of downstream robotic grasping tasks.

Missing Target-Relevant Information Prediction with World Model for Accurate Zero-Shot Composed Image Retrieval

Yuanmin Tang,Jing Yu,Keke Gai,Jiamin Zhuang,Gang Xiong,Gaopeng Gou,Qi Wu

Task: 解决零样本组合图像检索（ZS-CIR）任务中参考图像缺失目标内容时的准确检索问题。

Motivation: ZS-CIR任务涉及广泛的视觉内容操作意图，但参考图像可能缺失关键目标内容，导致检索困难。

Details

Method: 提出PrediCIR模型，通过预测缺失的视觉内容并在潜在空间映射，结合世界视图生成和目标内容预测模块。 Result: 在六个ZS-CIR任务中表现优异，性能提升1.73%至4.45%，达到新SOTA。 Conclusion: PrediCIR通过自适应预测缺失内容，显著提升了ZS-CIR任务的性能。 Abstract: Zero-Shot Composed Image Retrieval (ZS-CIR) involves diverse tasks with a broad range of visual content manipulation intent across domain, scene, object, and attribute. The key challenge for ZS-CIR tasks is to modify a reference image according to manipulation text to accurately retrieve a target image, especially when the reference image is missing essential target content. In this paper, we propose a novel prediction-based mapping network, named PrediCIR, to adaptively predict the missing target visual content in reference images in the latent space before mapping for accurate ZS-CIR. Specifically, a world view generation module first constructs a source view by omitting certain visual content of a target view, coupled with an action that includes the manipulation intent derived from existing image-caption pairs. Then, a target content prediction module trains a world model as a predictor to adaptively predict the missing visual information guided by user intention in manipulating text at the latent space. The two modules map an image with the predicted relevant information to a pseudo-word token without extra supervision. Our model shows strong generalization ability on six ZS-CIR tasks. It obtains consistent and significant performance boosts ranging from 1.73% to 4.45% over the best methods and achieves new state-of-the-art results on ZS-CIR. Our code is available at https://github.com/Pter61/predicir.

Beyond Accuracy: What Matters in Designing Well-Behaved Models?

Robin Hesse,Doğukan Bağcı,Bernt Schiele,Simone Schaub-Meyer,Stefan Roth

Task: 研究深度神经网络（DNNs）在图像分类任务中九个不同质量维度的表现。

Motivation: 现有研究通常只关注DNNs的预测性能，而忽视了其他关键质量维度（如鲁棒性、校准性和公平性），本研究旨在填补这一空白。

Details

Method: 通过大规模实验分析326个骨干模型，探讨不同训练范式和模型架构对质量维度的影响。 Result: 发现视觉语言模型在公平性和域变化鲁棒性上表现优异；自监督学习能显著提升多个质量维度；训练数据集大小是多数质量维度的主要驱动因素。 Conclusion: 提出QUBA评分（Quality Understanding Beyond Accuracy），用于在多质量维度下评估和推荐模型。 Abstract: Deep learning has become an essential part of computer vision, with deep neural networks (DNNs) excelling in predictive performance. However, they often fall short in other critical quality dimensions, such as robustness, calibration, or fairness. While existing studies have focused on a subset of these quality dimensions, none have explored a more general form of "well-behavedness" of DNNs. With this work, we address this gap by simultaneously studying nine different quality dimensions for image classification. Through a large-scale study, we provide a bird's-eye view by analyzing 326 backbone models and how different training paradigms and model architectures affect the quality dimensions. We reveal various new insights such that (i) vision-language models exhibit high fairness on ImageNet-1k classification and strong robustness against domain changes; (ii) self-supervised learning is an effective training paradigm to improve almost all considered quality dimensions; and (iii) the training dataset size is a major driver for most of the quality dimensions. We conclude our study by introducing the QUBA score (Quality Understanding Beyond Accuracy), a novel metric that ranks models across multiple dimensions of quality, enabling tailored recommendations based on specific user needs.

R-LiViT: A LiDAR-Visual-Thermal Dataset Enabling Vulnerable Road User Focused Roadside Perception

Jonas Mirlach,Lei Wan,Andreas Wiedholz,Hannan Ejaz Keen,Andreas Eich

Task: 提出并公开了一个结合LiDAR、RGB和热成像的多模态数据集R-LiViT，专注于路边视角下的VRU检测。

Motivation: 解决现有数据集中热成像数据不足的问题，提升极端光照条件下VRU检测的安全性和准确性。

Details

Method: 通过路边视角采集三个交叉口白天和夜晚的数据，整合LiDAR、RGB和热成像传感器，提供时空对齐的多模态数据。 Result: R-LiViT数据集包含10,000帧LiDAR数据和2,400张RGB与热成像图像，涵盖150多种交通场景，并提供了6类和8类的标注。 Conclusion: R-LiViT为多模态感知任务提供了全面的资源，并公开了数据集和代码以促进研究。 Abstract: In autonomous driving, the integration of roadside perception systems is essential for overcoming occlusion challenges and enhancing the safety of Vulnerable Road Users (VRUs). While LiDAR and visual (RGB) sensors are commonly used, thermal imaging remains underrepresented in datasets, despite its acknowledged advantages for VRU detection in extreme lighting conditions. In this paper, we present R-LiViT, the first dataset to combine LiDAR, RGB, and thermal imaging from a roadside perspective, with a strong focus on VRUs. R-LiViT captures three intersections during both day and night, ensuring a diverse dataset. It includes 10,000 LiDAR frames and 2,400 temporally and spatially aligned RGB and thermal images across over 150 traffic scenarios, with 6 and 8 annotated classes respectively, providing a comprehensive resource for tasks such as object detection and tracking. The dataset1 and the code for reproducing our evaluation results2 are made publicly available.

Temporal-Guided Spiking Neural Networks for Event-Based Human Action Recognition

Siyuan Yang,Shilin Lu,Shizheng Wang,Meng Hwa Er,Zengwei Zheng,Alex C. Kot

Task: 探索脉冲神经网络（SNNs）与事件相机在隐私保护人类动作识别（HAR）中的协同作用。

Motivation: 事件相机仅捕捉运动轮廓的特性与SNNs处理时空数据的能力高度契合，但现有研究受限于SNNs处理长期时间信息的能力。

Details

Method: 提出了两种新框架：基于时间分段的SNN（TS-SNN）和3D卷积SNN（3D-SNN），并创建了新数据集FallingDetection-CeleX。 Result: 实验结果表明，所提框架在新数据集和三个其他神经形态数据集上优于现有SNN方法。 Conclusion: 提出的框架有效解决了事件相机HAR中长期时间信息处理的挑战。 Abstract: This paper explores the promising interplay between spiking neural networks (SNNs) and event-based cameras for privacy-preserving human action recognition (HAR). The unique feature of event cameras in capturing only the outlines of motion, combined with SNNs' proficiency in processing spatiotemporal data through spikes, establishes a highly synergistic compatibility for event-based HAR. Previous studies, however, have been limited by SNNs' ability to process long-term temporal information, essential for precise HAR. In this paper, we introduce two novel frameworks to address this: temporal segment-based SNN (\textit{TS-SNN}) and 3D convolutional SNN (\textit{3D-SNN}). The \textit{TS-SNN} extracts long-term temporal information by dividing actions into shorter segments, while the \textit{3D-SNN} replaces 2D spatial elements with 3D components to facilitate the transmission of temporal information. To promote further research in event-based HAR, we create a dataset, \textit{FallingDetection-CeleX}, collected using the high-resolution CeleX-V event camera $(1280 \times 800)$, comprising 7 distinct actions. Extensive experimental results show that our proposed frameworks surpass state-of-the-art SNN methods on our newly collected dataset and three other neuromorphic datasets, showcasing their effectiveness in handling long-range temporal information for event-based HAR.

Not Only Text: Exploring Compositionality of Visual Representations in Vision-Language Models

Davide Berasi,Matteo Farina,Massimiliano Mancini,Elisa Ricci,Nicola Strisciuglio

Task: 研究视觉语言模型（VLM）中视觉嵌入空间的组合性。

Motivation: 探索视觉嵌入空间是否具有组合性模式，以解决视觉数据噪声和稀疏性带来的挑战。

Details

Method: 提出了一种称为Geodesically Decomposable Embeddings（GDE）的框架，在潜在空间中利用几何感知的组合结构近似图像表示。 Result: GDE在组合分类和群体鲁棒性任务中表现优于线性几何假设的对比方法，尤其在群体鲁棒性上效果显著。 Conclusion: VLM可以在视觉领域自动发展出类似人类的组合推理能力，使其底层过程更具可解释性。 Abstract: Vision-Language Models (VLMs) learn a shared feature space for text and images, enabling the comparison of inputs of different modalities. While prior works demonstrated that VLMs organize natural language representations into regular structures encoding composite meanings, it remains unclear if compositional patterns also emerge in the visual embedding space. In this work, we investigate compositionality in the image domain, where the analysis of compositional properties is challenged by noise and sparsity of visual data. We address these problems and propose a framework, called Geodesically Decomposable Embeddings (GDE), that approximates image representations with geometry-aware compositional structures in the latent space. We demonstrate that visual embeddings of pre-trained VLMs exhibit a compositional arrangement, and evaluate the effectiveness of this property in the tasks of compositional classification and group robustness. GDE achieves stronger performance in compositional classification compared to its counterpart method that assumes linear geometry of the latent space. Notably, it is particularly effective for group robustness, where we achieve higher results than task-specific solutions. Our results indicate that VLMs can automatically develop a human-like form of compositional reasoning in the visual domain, making their underlying processes more interpretable. Code is available at https://github.com/BerasiDavide/vlm_image_compositionality.

Enhancing Steering Estimation with Semantic-Aware GNNs

Fouad Makiyeh,Huy-Dung Nguyen,Patrick Chareyre,Ramin Hasani,Marc Blanchon,Daniela Rus

Task: 探索结合3D空间信息的混合架构（3D神经网络与RNN）在自动驾驶转向估计任务中的优势。

Motivation: 传统方法依赖2D图像模型，而3D空间信息可能提供更准确的转向估计。

Details

Method: 使用LiDAR点云作为输入，评估四种混合3D模型，并引入伪3D点云（通过单目图像深度估计生成）以减少对LiDAR的依赖。优化图构建策略，减少计算成本。 Result: GNN-RNN模型表现最佳，伪3D点云模型性能接近或优于LiDAR模型，KITTI数据集上比2D模型提升71%。 Conclusion: 3D空间信息和高效图构建策略显著提升转向估计性能，同时保持了单目图像的成本优势。 Abstract: Steering estimation is a critical task in autonomous driving, traditionally relying on 2D image-based models. In this work, we explore the advantages of incorporating 3D spatial information through hybrid architectures that combine 3D neural network models with recurrent neural networks (RNNs) for temporal modeling, using LiDAR-based point clouds as input. We systematically evaluate four hybrid 3D models, all of which outperform the 2D-only baseline, with the Graph Neural Network (GNN) - RNN model yielding the best results. To reduce reliance on LiDAR, we leverage a pretrained unified model to estimate depth from monocular images, reconstructing pseudo-3D point clouds. We then adapt the GNN-RNN model, originally designed for LiDAR-based point clouds, to work with these pseudo-3D representations, achieving comparable or even superior performance compared to the LiDAR-based model. Additionally, the unified model provides semantic labels for each point, enabling a more structured scene representation. To further optimize graph construction, we introduce an efficient connectivity strategy where connections are predominantly formed between points of the same semantic class, with only 20\% of inter-class connections retained. This targeted approach reduces graph complexity and computational cost while preserving critical spatial relationships. Finally, we validate our approach on the KITTI dataset, achieving a 71% improvement over 2D-only models. Our findings highlight the advantages of 3D spatial information and efficient graph construction for steering estimation, while maintaining the cost-effectiveness of monocular images and avoiding the expense of LiDAR-based systems.

D2C: Unlocking the Potential of Continuous Autoregressive Image Generation with Discrete Tokens

Panpan Wang,Liqiang Niu,Fandong Meng,Jinan Xu,Yufeng Chen,Jie Zhou

Task: 提出一种名为D2C的两阶段方法，用于增强图像生成模型的生成能力。

Motivation: 探索离散值和连续值标记在图像生成领域的潜力，以解决现有模型在生成质量或效率上的不足。

Details

Method: 采用两阶段方法：第一阶段通过小型离散值生成器采样粗粒度特征，第二阶段在离散标记序列基础上学习细粒度连续值特征，并设计两种融合模块实现无缝交互。 Result: 在ImageNet-256基准测试中，D2C模型在类别条件图像生成任务上表现优于多种连续值和离散值生成模型。 Conclusion: D2C方法通过结合离散和连续标记的优势，显著提升了图像生成的质量和效率。 Abstract: In the domain of image generation, latent-based generative models occupy a dominant status; however, these models rely heavily on image tokenizer. To meet modeling requirements, autoregressive models possessing the characteristics of scalability and flexibility embrace a discrete-valued tokenizer, but face the challenge of poor image generation quality. In contrast, diffusion models take advantage of the continuous-valued tokenizer to achieve better generation quality but are subject to low efficiency and complexity. The existing hybrid models are mainly to compensate for information loss and simplify the diffusion learning process. The potential of merging discrete-valued and continuous-valued tokens in the field of image generation has not yet been explored. In this paper, we propose D2C, a novel two-stage method to enhance model generation capacity. In the first stage, the discrete-valued tokens representing coarse-grained image features are sampled by employing a small discrete-valued generator. Then in the second stage, the continuous-valued tokens representing fine-grained image features are learned conditioned on the discrete token sequence. In addition, we design two kinds of fusion modules for seamless interaction. On the ImageNet-256 benchmark, extensive experiment results validate that our model achieves superior performance compared with several continuous-valued and discrete-valued generative models on the class-conditional image generation tasks.

CoRLD: Contrastive Representation Learning Of Deformable Shapes In Images

Tonmoy Hossain ana Miaomiao Zhang

Task: 提出一种新的框架（CoRLD），通过学习变形空间中的对比表示学习，用于改进图像分类任务。

Motivation: 现有方法依赖已知模板且难以捕捉细粒度形状差异，限制了其应用范围和灵活性。

Details

Method: 利用类别感知的对比监督学习目标在潜在变形空间中学习，消除对参考图像的依赖。 Result: 在多种数据集上验证了CoRLD的有效性，显著提高了分类准确率。 Conclusion: CoRLD框架灵活且通用，适用于广泛的医学应用。 Abstract: Deformable shape representations, parameterized by deformations relative to a given template, have proven effective for improved image analysis tasks. However, their broader applicability is hindered by two major challenges. First, existing methods mainly rely on a known template during testing, which is impractical and limits flexibility. Second, they often struggle to capture fine-grained, voxel-level distinctions between similar shapes (e.g., anatomical variations among healthy individuals, those with mild cognitive impairment, and diseased states). To address these limitations, we propose a novel framework - Contrastive Representation Learning of Deformable shapes (CoRLD) in learned deformation spaces and demonstrate its effectiveness in the context of image classification. Our CoRLD leverages a class-aware contrastive supervised learning objective in latent deformation spaces, promoting proximity among representations of similar classes while ensuring separation of dissimilar groups. In contrast to previous deep learning networks that require a reference image as input to predict deformation changes, our approach eliminates this dependency. Instead, template images are utilized solely as ground truth in the loss function during the training process, making our model more flexible and generalizable to a wide range of medical applications. We validate CoRLD on diverse datasets, including real brain magnetic resonance imaging (MRIs) and adrenal shapes derived from computed tomography (CT) scans. Experimental results show that our model effectively extracts deformable shape features, which can be easily integrated with existing classifiers to substantially boost the classification accuracy. Our code is available at GitHub.

Hi-ALPS -- An Experimental Robustness Quantification of Six LiDAR-based Object Detection Systems for Autonomous Driving

Alexandra Arzberger,Ramin Tavakoli Kolagari

Task: 提出Hi-ALPS系统，用于量化3D目标检测系统（OD）在对抗性扰动下的鲁棒性。

Motivation: LiDAR在自动驾驶中至关重要，但3D OD需要对抗各种扰动，现有对抗性示例方法无法准确量化鲁棒性。

Details

Method: 开发Hi-ALPS系统，通过分层次的对抗性扰动逐步测试OD的鲁棒性。 Result: 实验表明，现有OD均无法完全抵抗Hi-ALPS的所有扰动级别，且人类仍能识别扰动后的对象。 Conclusion: Hi-ALPS有效量化了OD的鲁棒性，并提出了改进建议和对抗措施。 Abstract: Light Detection and Ranging (LiDAR) is an essential sensor technology for autonomous driving as it can capture high-resolution 3D data. As 3D object detection systems (OD) can interpret such point cloud data, they play a key role in the driving decisions of autonomous vehicles. Consequently, such 3D OD must be robust against all types of perturbations and must therefore be extensively tested. One approach is the use of adversarial examples, which are small, sometimes sophisticated perturbations in the input data that change, i.e., falsify, the prediction of the OD. These perturbations are carefully designed based on the weaknesses of the OD. The robustness of the OD cannot be quantified with adversarial examples in general, because if the OD is vulnerable to a given attack, it is unclear whether this is due to the robustness of the OD or whether the attack algorithm produces particularly strong adversarial examples. The contribution of this work is Hi-ALPS -- Hierarchical Adversarial-example-based LiDAR Perturbation Level System, where higher robustness of the OD is required to withstand the perturbations as the perturbation levels increase. In doing so, the Hi-ALPS levels successively implement a heuristic followed by established adversarial example approaches. In a series of comprehensive experiments using Hi-ALPS, we quantify the robustness of six state-of-the-art 3D OD under different types of perturbations. The results of the experiments show that none of the OD is robust against all Hi-ALPS levels; an important factor for the ranking is that human observers can still correctly recognize the perturbed objects, as the respective perturbations are small. To increase the robustness of the OD, we discuss the applicability of state-of-the-art countermeasures. In addition, we derive further suggestions for countermeasures based on our experimental results.

Which2comm: An Efficient Collaborative Perception Framework for 3D Object Detection

Duanrui Yu,Jing You,Xin Pei,Anqi Qu,Dingyu Wang,Shaocheng Jia

Task: 提出一种名为Which2comm的多智能体3D目标检测框架，通过对象级稀疏特征解决协作感知中的通信带宽限制问题。

Motivation: 实际场景中有限的通信带宽限制了智能体间的数据传输量，导致协作感知系统性能下降，需要在感知性能和通信成本之间权衡。

Details

Method: 引入语义检测框（SemDBs），通过传输信息丰富的对象级稀疏特征，构建完全稀疏网络提取SemDBs，并使用时序融合方法获取时空特征。 Result: 在V2XSet和OPV2V数据集上的实验表明，Which2comm在感知性能和通信成本上均优于现有方法，且对实际延迟具有更好的鲁棒性。 Conclusion: 对于多智能体协作3D目标检测，仅传输对象级稀疏特征即可实现高精度和鲁棒性能。 Abstract: Collaborative perception allows real-time inter-agent information exchange and thus offers invaluable opportunities to enhance the perception capabilities of individual agents. However, limited communication bandwidth in practical scenarios restricts the inter-agent data transmission volume, consequently resulting in performance declines in collaborative perception systems. This implies a trade-off between perception performance and communication cost. To address this issue, we propose Which2comm, a novel multi-agent 3D object detection framework leveraging object-level sparse features. By integrating semantic information of objects into 3D object detection boxes, we introduce semantic detection boxes (SemDBs). Innovatively transmitting these information-rich object-level sparse features among agents not only significantly reduces the demanding communication volume, but also improves 3D object detection performance. Specifically, a fully sparse network is constructed to extract SemDBs from individual agents; a temporal fusion approach with a relative temporal encoding mechanism is utilized to obtain the comprehensive spatiotemporal features. Extensive experiments on the V2XSet and OPV2V datasets demonstrate that Which2comm consistently outperforms other state-of-the-art methods on both perception performance and communication cost, exhibiting better robustness to real-world latency. These results present that for multi-agent collaborative 3D object detection, transmitting only object-level sparse features is sufficient to achieve high-precision and robust performance.

Radar-Guided Polynomial Fitting for Metric Depth Estimation

Patrick Rim,Hyoungseob Park,Vadim Ezhov,Jeffrey Moon,Alex Wong

Task: 提出一种基于雷达引导的深度估计方法PolyRad，通过多项式拟合将预训练的单目深度估计模型的尺度无关深度预测转换为度量深度图。

Motivation: 现有方法依赖复杂架构或昂贵传感器，而PolyRad利用廉价雷达数据预测多项式系数，自适应调整深度预测，解决单目深度估计模型在区域间深度对齐上的不足。

Details

Method: 引入多项式拟合框架，通过一阶导数正则化强制单调性，保留结构一致性，并利用雷达数据预测多项式系数。 Result: 在nuScenes、ZJU-4DRadarCam和View-of-Delft数据集上达到最先进性能，MAE和RMSE分别提升30.3%和37.2%。 Conclusion: PolyRad通过多项式拟合和雷达数据，有效解决了单目深度估计模型的尺度对齐问题，性能显著优于现有方法。 Abstract: We propose PolyRad, a novel radar-guided depth estimation method that introduces polynomial fitting to transform scaleless depth predictions from pretrained monocular depth estimation (MDE) models into metric depth maps. Unlike existing approaches that rely on complex architectures or expensive sensors, our method is grounded in a simple yet fundamental insight: using polynomial coefficients predicted from cheap, ubiquitous radar data to adaptively adjust depth predictions non-uniformly across depth ranges. Although MDE models often infer reasonably accurate local depth structure within each object or local region, they may misalign these regions relative to one another, making a linear scale-and-shift transformation insufficient given three or more of these regions. In contrast, PolyRad generalizes beyond linear transformations and is able to correct such misalignments by introducing inflection points. Importantly, our polynomial fitting framework preserves structural consistency through a novel training objective that enforces monotonicity via first-derivative regularization. PolyRad achieves state-of-the-art performance on the nuScenes, ZJU-4DRadarCam, and View-of-Delft datasets, outperforming existing methods by 30.3% in MAE and 37.2% in RMSE.

D2Fusion: Dual-domain Fusion with Feature Superposition for Deepfake Detection

Xueqi Qiu,Xingyu Miao,Fan Wan,Haoran Duan,Tejal Shah,Varun Ojhab,Yang Longa,Rajiv Ranjan

Task: 提出一种新颖的双向注意力模块和细粒度频率注意力模块，用于改进Deepfake检测中的伪影信息探索。

Motivation: 当前Deepfake检测方法因缺乏跨域特征的内在交互，无法充分探索伪影信息，导致检测性能受限。

Details

Method: 结合空间域的双向注意力模块和频率域的细粒度频率注意力模块，并通过特征叠加策略融合不同域的特征。 Result: 在五个公开Deepfake数据集上显著优于现有方法，能够有效捕捉不同操作和现实场景中的异常。 Conclusion: 通过跨域特征的有效融合和优化，显著提升了Deepfake检测的性能和泛化能力。 Abstract: Deepfake detection is crucial for curbing the harm it causes to society. However, current Deepfake detection methods fail to thoroughly explore artifact information across different domains due to insufficient intrinsic interactions. These interactions refer to the fusion and coordination after feature extraction processes across different domains, which are crucial for recognizing complex forgery clues. Focusing on more generalized Deepfake detection, in this work, we introduce a novel bi-directional attention module to capture the local positional information of artifact clues from the spatial domain. This enables accurate artifact localization, thus addressing the coarse processing with artifact features. To further address the limitation that the proposed bi-directional attention module may not well capture global subtle forgery information in the artifact feature (e.g., textures or edges), we employ a fine-grained frequency attention module in the frequency domain. By doing so, we can obtain high-frequency information in the fine-grained features, which contains the global and subtle forgery information. Although these features from the diverse domains can be effectively and independently improved, fusing them directly does not effectively improve the detection performance. Therefore, we propose a feature superposition strategy that complements information from spatial and frequency domains. This strategy turns the feature components into the form of wave-like tokens, which are updated based on their phase, such that the distinctions between authentic and artifact features can be amplified. Our method demonstrates significant improvements over state-of-the-art (SOTA) methods on five public Deepfake datasets in capturing abnormalities across different manipulated operations and real-life.

MSCA-Net:Multi-Scale Context Aggregation Network for Infrared Small Target Detection

Xiaojin Lu,Taoran yue,Jiaxi cai,Shibing Chu

Task: 提出一种名为MSCA-Net的网络架构，用于在复杂背景下检测红外小目标。

Motivation: 红外图像的低对比度和高噪声导致特征提取中关键细节丢失，现有方法在全局和局部信息融合方面存在局限性。

Details

Method: 结合多尺度增强检测注意力机制（MSEDA）、位置卷积块注意力模块（PCBAM）和通道聚合块（CAB），通过多尺度特征融合和全局-局部特征交互提升检测能力。 Result: 在NUAA-SIRST、NUDT-SIRST和IRTSD-1K数据集上分别取得78.43%、94.56%和67.08%的mIoU分数。 Conclusion: MSCA-Net在复杂背景下表现出色，具有实际应用潜力。 Abstract: Detecting infrared small targets in complex backgrounds remains a challenging task because of the low contrast and high noise levels inherent in infrared images. These factors often lead to the loss of crucial details during feature extraction. Moreover, existing detection methods have limitations in adequately integrating global and local information, which constrains the efficiency and accuracy of infrared small target detection. To address these challenges, this paper proposes a novel network architecture named MSCA-Net, which integrates three key components: Multi-Scale Enhanced Detection Attention mechanism(MSEDA), Positional Convolutional Block Attention Module (PCBAM), and Channel Aggregation Block (CAB). Specifically, MSEDA employs a multi-scale feature fusion attention mechanism to adaptively aggregate information across different scales, enriching feature representation. PCBAM captures the correlation between global and local features through a correlation matrix-based strategy, enabling deep feature interaction. Moreover, CAB redistributes input feature channels, facilitating the efficient transmission of beneficial features and further enhancing the model detection capability in complex backgrounds. The experimental results demonstrate that MSCA-Net achieves outstanding small target detection performance in complex backgrounds. Specifically, it attains mIoU scores of 78.43\%, 94.56\%, and 67.08\% on the NUAA-SIRST, NUDT-SIRST, and IRTSD-1K datasets, respectively, underscoring its effectiveness and strong potential for real-world applications.

FreeUV: Ground-Truth-Free Realistic Facial UV Texture Recovery via Cross-Assembly Inference Strategy

Xingchao Yang,Takafumi Taketomi,Yuki Endo,Yoshihiro Kanamori

Task: 从单视角2D图像中恢复高质量3D面部纹理。

Motivation: 解决在有限数据和复杂面部细节（如化妆、皱纹和遮挡）下恢复3D面部纹理的挑战。

Details

Method: 提出FreeUV框架，利用预训练的稳定扩散模型和交叉组装推理策略，独立训练网络分别关注真实外观和结构一致性，并在推理时结合生成连贯纹理。 Result: FreeUV能准确捕捉复杂面部特征，在多样姿态和遮挡下表现鲁棒，实验结果显示其性能超越现有方法。 Conclusion: FreeUV通过减少数据需求，为真实场景提供了一种可扩展的高保真3D面部纹理生成方案，并支持局部编辑等新应用。 Abstract: Recovering high-quality 3D facial textures from single-view 2D images is a challenging task, especially under constraints of limited data and complex facial details such as makeup, wrinkles, and occlusions. In this paper, we introduce FreeUV, a novel ground-truth-free UV texture recovery framework that eliminates the need for annotated or synthetic UV data. FreeUV leverages pre-trained stable diffusion model alongside a Cross-Assembly inference strategy to fulfill this objective. In FreeUV, separate networks are trained independently to focus on realistic appearance and structural consistency, and these networks are combined during inference to generate coherent textures. Our approach accurately captures intricate facial features and demonstrates robust performance across diverse poses and occlusions. Extensive experiments validate FreeUV's effectiveness, with results surpassing state-of-the-art methods in both quantitative and qualitative metrics. Additionally, FreeUV enables new applications, including local editing, facial feature interpolation, and multi-view texture recovery. By reducing data requirements, FreeUV offers a scalable solution for generating high-fidelity 3D facial textures suitable for real-world scenarios.

A Deep Learning Framework for Visual Attention Prediction and Analysis of News Interfaces

Matthew Kenely,Dylan Seychell,Carl James Debono,Chris Porter

Task: 开发一种基于深度学习的框架，结合SaRa和DeepGaze IIE模型，以提高显著对象排名（SOR）性能。

Motivation: 新闻界面中注意力竞争的需求凸显了人口统计学感知的显著性预测模型的重要性，现有数据集在规模和人口统计学代表性上存在局限。

Details

Method: 通过优化三个关键组件（显著性图生成、网格片段评分和图归一化），结合眼动追踪（30人）和鼠标追踪（375人，年龄13-70岁）实验分析注意力模式。 Result: SOR性能提升10.7%，发现年龄对注意力模式有显著影响（p < 0.05），鼠标追踪数据与眼动追踪行为高度一致（sAUC = 0.86）。 Conclusion: 显著性研究应优先收集更大、更具人口统计学代表性的样本数据，并报告精确的人口统计学分布。 Abstract: News outlets' competition for attention in news interfaces has highlighted the need for demographically-aware saliency prediction models. Despite recent advancements in saliency detection applied to user interfaces (UI), existing datasets are limited in size and demographic representation. We present a deep learning framework that enhances the SaRa (Saliency Ranking) model with DeepGaze IIE, improving Salient Object Ranking (SOR) performance by 10.7%. Our framework optimizes three key components: saliency map generation, grid segment scoring, and map normalization. Through a two-fold experiment using eye-tracking (30 participants) and mouse-tracking (375 participants aged 13--70), we analyze attention patterns across demographic groups. Statistical analysis reveals significant age-based variations (p < 0.05, {\epsilon^2} = 0.042), with older users (36--70) engaging more with textual content and younger users (13--35) interacting more with images. Mouse-tracking data closely approximates eye-tracking behavior (sAUC = 0.86) and identifies UI elements that immediately stand out, validating its use in large-scale studies. We conclude that saliency studies should prioritize gathering data from a larger, demographically representative sample and report exact demographic distributions.

PP-DocLayout: A Unified Document Layout Detection Model to Accelerate Large-Scale Data Construction

Ting Sun,Cheng Cui,Yuning Du,Yi Liu

Task: 提出一种名为PP-DocLayout的文档布局分析方法，用于高效识别23种文档布局区域。

Motivation: 现有布局检测模型在泛化性、处理复杂布局和实时性能方面存在挑战。

Details

Method: 基于RT-DETR-L检测器，提供三种不同规模的模型（L、M、S），分别针对高精度、平衡性和高效性需求。 Result: PP-DocLayout-L达到90.4% mAP@0.5，推理时间为13.4 ms/页；PP-DocLayout-M为75.2% mAP@0.5，12.7 ms/页；PP-DocLayout-S为8.1 ms/页（GPU）。 Conclusion: PP-DocLayout不仅提升了文档布局分析的性能，还为构建高质量训练数据提供了解决方案，推动了文档智能和多模态AI系统的发展。 Abstract: Document layout analysis is a critical preprocessing step in document intelligence, enabling the detection and localization of structural elements such as titles, text blocks, tables, and formulas. Despite its importance, existing layout detection models face significant challenges in generalizing across diverse document types, handling complex layouts, and achieving real-time performance for large-scale data processing. To address these limitations, we present PP-DocLayout, which achieves high precision and efficiency in recognizing 23 types of layout regions across diverse document formats. To meet different needs, we offer three models of varying scales. PP-DocLayout-L is a high-precision model based on the RT-DETR-L detector, achieving 90.4% mAP@0.5 and an end-to-end inference time of 13.4 ms per page on a T4 GPU. PP-DocLayout-M is a balanced model, offering 75.2% mAP@0.5 with an inference time of 12.7 ms per page on a T4 GPU. PP-DocLayout-S is a high-efficiency model designed for resource-constrained environments and real-time applications, with an inference time of 8.1 ms per page on a T4 GPU and 14.5 ms on a CPU. This work not only advances the state of the art in document layout analysis but also provides a robust solution for constructing high-quality training data, enabling advancements in document intelligence and multimodal AI systems. Code and models are available at https://github.com/PaddlePaddle/PaddleX .

UniCon: Unidirectional Information Flow for Effective Control of Large-Scale Diffusion Models

Fanghua Yu,Jinjin Gu,Jinfan Hu,Zheyuan Li,Chao Dong

Task: 提出一种名为UniCon的新架构，用于增强大规模扩散模型中适配器的控制与训练效率。

Motivation: 现有方法依赖于扩散模型与控制适配器之间的双向交互，导致计算和存储需求较高。

Details

Method: UniCon采用从扩散网络到适配器的单向流，仅由适配器生成最终输出，从而减少计算需求。 Result: UniCon将GPU内存使用减少三分之一，训练速度提高2.3倍，同时保持适配器参数量不变，并能训练更大参数量适配器。 Conclusion: UniCon在图像条件生成任务中表现出对控制输入的精确响应和卓越生成能力。 Abstract: We introduce UniCon, a novel architecture designed to enhance control and efficiency in training adapters for large-scale diffusion models. Unlike existing methods that rely on bidirectional interaction between the diffusion model and control adapter, UniCon implements a unidirectional flow from the diffusion network to the adapter, allowing the adapter alone to generate the final output. UniCon reduces computational demands by eliminating the need for the diffusion model to compute and store gradients during adapter training. Our results indicate that UniCon reduces GPU memory usage by one-third and increases training speed by 2.3 times, while maintaining the same adapter parameter size. Additionally, without requiring extra computational resources, UniCon enables the training of adapters with double the parameter volume of existing ControlNets. In a series of image conditional generation tasks, UniCon has demonstrated precise responsiveness to control inputs and exceptional generation capabilities.

Neuro-Symbolic Scene Graph Conditioning for Synthetic Image Dataset Generation

Giacomo Savazzi,Eugenio Lomurno,Cristian Sbrolli,Agnese Chiatti,Matteo Matteucci

Task: 探索神经符号条件化在合成图像数据集生成中的效用，特别是提升场景图生成模型的性能。

Motivation: 由于数据获取成本、隐私限制和特定领域数据稀缺，合成数据生成成为替代方案，但其性能与真实数据相比仍有差距。神经符号方法结合了神经网络的强大学习能力和符号推理的结构化表示，为解决这一问题提供了潜力。

Details

Method: 研究通过场景图形式的结构化符号表示，通过显式编码关系约束来提升合成数据质量。 Result: 神经符号条件化在标准召回指标上提升了2.59%，在无图约束召回指标上提升了2.83%。 Conclusion: 神经符号与生成方法的结合能产生具有互补结构信息的合成数据，提升模型性能，为解决复杂视觉推理任务中的数据稀缺问题提供了新途径。 Abstract: As machine learning models increase in scale and complexity, obtaining sufficient training data has become a critical bottleneck due to acquisition costs, privacy constraints, and data scarcity in specialised domains. While synthetic data generation has emerged as a promising alternative, a notable performance gap remains compared to models trained on real data, particularly as task complexity grows. Concurrently, Neuro-Symbolic methods, which combine neural networks' learning strengths with symbolic reasoning's structured representations, have demonstrated significant potential across various cognitive tasks. This paper explores the utility of Neuro-Symbolic conditioning for synthetic image dataset generation, focusing specifically on improving the performance of Scene Graph Generation models. The research investigates whether structured symbolic representations in the form of scene graphs can enhance synthetic data quality through explicit encoding of relational constraints. The results demonstrate that Neuro-Symbolic conditioning yields significant improvements of up to +2.59% in standard Recall metrics and +2.83% in No Graph Constraint Recall metrics when used for dataset augmentation. These findings establish that merging Neuro-Symbolic and generative approaches produces synthetic data with complementary structural information that enhances model performance when combined with real data, providing a novel approach to overcome data scarcity limitations even for complex visual reasoning tasks.

Leveraging Text-to-Image Generation for Handling Spurious Correlation

Aryan Yazdan Parast,Basim Azam,Naveed Akhtar

Task: 提出一种利用文本到图像（T2I）扩散模型生成训练样本的技术，以解决深度神经网络中的虚假相关性问题。

Motivation: 深度神经网络在训练和测试数据来自同一分布时表现良好，但在面对分布外样本时泛化能力不足，原因是模型可能依赖标签与图像无关特征之间的虚假相关性。

Details

Method: 通过文本反转机制计算样本因果成分的最佳描述词，结合语言分割方法和扩散模型生成新样本，并通过预测概率和归因分数精心筛选样本，最后在增强数据集上重新训练模型。 Result: 实验表明，该方法在不同基准测试中均优于现有最先进方法，实现了更高的最差组准确率。 Conclusion: 通过生成精心设计的样本并重新训练模型，有效减少了模型对虚假相关性的依赖，提升了泛化能力。 Abstract: Deep neural networks trained with Empirical Risk Minimization (ERM) perform well when both training and test data come from the same domain, but they often fail to generalize to out-of-distribution samples. In image classification, these models may rely on spurious correlations that often exist between labels and irrelevant features of images, making predictions unreliable when those features do not exist. We propose a technique to generate training samples with text-to-image (T2I) diffusion models for addressing the spurious correlation problem. First, we compute the best describing token for the visual features pertaining to the causal components of samples by a textual inversion mechanism. Then, leveraging a language segmentation method and a diffusion model, we generate new samples by combining the causal component with the elements from other classes. We also meticulously prune the generated samples based on the prediction probabilities and attribution scores of the ERM model to ensure their correct composition for our objective. Finally, we retrain the ERM model on our augmented dataset. This process reduces the model's reliance on spurious correlations by learning from carefully crafted samples for in which this correlation does not exist. Our experiments show that across different benchmarks, our technique achieves better worst-group accuracy than the existing state-of-the-art methods.

Strong Baseline: Multi-UAV Tracking via YOLOv12 with BoT-SORT-ReID

Yu-Hsi Chen

Task: 提出一种基于YOLOv12和BoT-SORT的多无人机热红外视频跟踪方法。

Motivation: 解决热红外视频中多无人机跟踪的低对比度、环境噪声和小目标尺寸等挑战。

Details

Method: 采用YOLOv12和BoT-SORT框架，结合定制化的训练和推理策略。 Result: 在第四届反无人机挑战赛的指标下表现优异，无需对比度增强或时间信息融合。 Conclusion: 该方法为多无人机跟踪任务提供了一个强基线，并提供了实现细节和改进方向。 Abstract: Detecting and tracking multiple unmanned aerial vehicles (UAVs) in thermal infrared video is inherently challenging due to low contrast, environmental noise, and small target sizes. This paper provides a straightforward approach to address multi-UAV tracking in thermal infrared video, leveraging recent advances in detection and tracking. Instead of relying on the YOLOv5 with the DeepSORT pipeline, we present a tracking framework built on YOLOv12 and BoT-SORT, enhanced with tailored training and inference strategies. We evaluate our approach following the metrics from the 4th Anti-UAV Challenge and demonstrate competitive performance. Notably, we achieve strong results without using contrast enhancement or temporal information fusion to enrich UAV features, highlighting our approach as a "Strong Baseline" for the multi-UAV tracking task. We provide implementation details, in-depth experimental analysis, and a discussion of potential improvements. The code is available at https://github.com/wish44165/YOLOv12-BoT-SORT-ReID .

Slide-Level Prompt Learning with Vision Language Models for Few-Shot Multiple Instance Learning in Histopathology

Devavrat Tomar,Guillaume Vray,Dwarikanath Mahapatra,Sudipta Roy,Jean-Philippe Thiran,Behzad Bozorgtabar

Task: 解决在病理学全切片图像（WSIs）中的少样本分类问题，利用基础视觉语言模型（VLMs）和切片级提示学习。

Motivation: 传统多实例学习（MIL）方法依赖聚合函数从补丁表示中推导切片级预测，需要大量切片级标签；而基于VLM的方法虽能对齐视觉嵌入与文本提示，但缺乏病理学先验知识。

Details

Method: 结合病理学先验知识识别关键局部组织类型（补丁），并将其整合到基于VLM的MIL框架中，通过提示学习微调模型。 Result: 在真实病理WSI数据集和消融实验中，该方法在少样本WSI分类任务中优于现有MIL和VLM方法。 Conclusion: 该方法通过结合病理学先验知识和提示学习，显著提升了少样本WSI分类的性能。 Abstract: In this paper, we address the challenge of few-shot classification in histopathology whole slide images (WSIs) by utilizing foundational vision-language models (VLMs) and slide-level prompt learning. Given the gigapixel scale of WSIs, conventional multiple instance learning (MIL) methods rely on aggregation functions to derive slide-level (bag-level) predictions from patch representations, which require extensive bag-level labels for training. In contrast, VLM-based approaches excel at aligning visual embeddings of patches with candidate class text prompts but lack essential pathological prior knowledge. Our method distinguishes itself by utilizing pathological prior knowledge from language models to identify crucial local tissue types (patches) for WSI classification, integrating this within a VLM-based MIL framework. Our approach effectively aligns patch images with tissue types, and we fine-tune our model via prompt learning using only a few labeled WSIs per category. Experimentation on real-world pathological WSI datasets and ablation studies highlight our method's superior performance over existing MIL- and VLM-based methods in few-shot WSI classification tasks. Our code is publicly available at https://github.com/LTS5/SLIP.

Unsupervised Joint Learning of Optical Flow and Intensity with Event Cameras

Shuang Guo,Friedhelm Hamann,Guillermo Gallego

Task: 提出一种无监督学习框架，联合估计光流（运动）和图像强度（外观）。

Motivation: 事件相机依赖运动获取场景外观信息，而现有方法将这两个任务分开处理，忽略了它们的固有联系。

Details

Method: 基于事件生成模型，推导出事件光度误差函数，结合对比度最大化框架，构建综合损失函数。 Result: 在光流估计（EPE和AE分别提升20%和25%）和强度估计（高动态范围场景表现优异）上达到最优性能，且推理时间更短。 Conclusion: 提出的框架有效联合估计光流和图像强度，性能优越且高效。 Abstract: Event cameras rely on motion to obtain information about scene appearance. In other words, for event cameras, motion and appearance are seen both or neither, which are encoded in the output event stream. Previous works consider recovering these two visual quantities as separate tasks, which does not fit with the nature of event cameras and neglects the inherent relations between both tasks. In this paper, we propose an unsupervised learning framework that jointly estimates optical flow (motion) and image intensity (appearance), with a single network. Starting from the event generation model, we newly derive the event-based photometric error as a function of optical flow and image intensity, which is further combined with the contrast maximization framework, yielding a comprehensive loss function that provides proper constraints for both flow and intensity estimation. Exhaustive experiments show that our model achieves state-of-the-art performance for both optical flow (achieves 20% and 25% improvement in EPE and AE respectively in the unsupervised learning category) and intensity estimation (produces competitive results with other baselines, particularly in high dynamic range scenarios). Last but not least, our model achieves shorter inference time than all the other optical flow models and many of the image reconstruction models, while they output only one quantity. Project page: https://github.com/tub-rip/e2fai

Physical Plausibility-aware Trajectory Prediction via Locomotion Embodiment

Hiromu Taketsugu,Takeru Oba,Takahiro Maeda,Shohei Nobuhara,Norimichi Ukita

Task: 提出一种显式评估预测轨迹物理合理性的框架Locomotion Embodiment，以改进人类轨迹预测（HTP）方法。

Motivation: 现有HTP方法隐式利用姿态线索导致预测不合理，需通过物理规律显式评估轨迹合理性。

Details

Method: 结合不可微分物理模拟器学习运动合理性，并通过可微分Locomotion Value函数训练HTP网络，提出Embodied Locomotion损失和Locomotion Value过滤器。 Result: 实验表明该方法显著提升了现有HTP方法的性能，适用于多种数据集和问题设置。 Conclusion: Locomotion Embodiment框架通过显式物理合理性评估，有效改进了HTP方法的预测质量。 Abstract: Humans can predict future human trajectories even from momentary observations by using human pose-related cues. However, previous Human Trajectory Prediction (HTP) methods leverage the pose cues implicitly, resulting in implausible predictions. To address this, we propose Locomotion Embodiment, a framework that explicitly evaluates the physical plausibility of the predicted trajectory by locomotion generation under the laws of physics. While the plausibility of locomotion is learned with an indifferentiable physics simulator, it is replaced by our differentiable Locomotion Value function to train an HTP network in a data-driven manner. In particular, our proposed Embodied Locomotion loss is beneficial for efficiently training a stochastic HTP network using multiple heads. Furthermore, the Locomotion Value filter is proposed to filter out implausible trajectories at inference. Experiments demonstrate that our method enhances even the state-of-the-art HTP methods across diverse datasets and problem settings. Our code is available at: https://github.com/ImIntheMiddle/EmLoco.

Recovering Pulse Waves from Video Using Deep Unrolling and Deep Equilibrium Models

Vineet R Shenoy,Suhas Lohit,Hassan Mansour,Rama Chellappa,Tim K. Marks

Task: 结合信号处理和深度学习方法，从面部视频中估计脉搏信号和心率。

Motivation: 现有的iPPG方法要么基于稀疏先验模型，要么采用端到端的黑盒深度学习，存在局限性。

Details

Method: 在逆问题框架下结合信号处理和深度学习，利用深度网络展开和深度平衡模型学习去噪算子。 Result: 在知名基准测试中实现了最先进的心率估计性能，且参数数量少于最接近的竞争方法的五分之一。 Conclusion: 提出的方法在性能和效率上均优于现有方法，为iPPG领域提供了新的解决方案。 Abstract: Camera-based monitoring of vital signs, also known as imaging photoplethysmography (iPPG), has seen applications in driver-monitoring, perfusion assessment in surgical settings, affective computing, and more. iPPG involves sensing the underlying cardiac pulse from video of the skin and estimating vital signs such as the heart rate or a full pulse waveform. Some previous iPPG methods impose model-based sparse priors on the pulse signals and use iterative optimization for pulse wave recovery, while others use end-to-end black-box deep learning methods. In contrast, we introduce methods that combine signal processing and deep learning methods in an inverse problem framework. Our methods estimate the underlying pulse signal and heart rate from facial video by learning deep-network-based denoising operators that leverage deep algorithm unfolding and deep equilibrium models. Experiments show that our methods can denoise an acquired signal from the face and infer the correct underlying pulse rate, achieving state-of-the-art heart rate estimation performance on well-known benchmarks, all with less than one-fifth the number of learnable parameters as the closest competing method.

HyperNVD: Accelerating Neural Video Decomposition via Hypernetworks

Maria Pilligua,Danna Xue,Javier Vazquez-Corral

Task: 提出一种基于元学习的通用视频分解模型，以加速新视频的训练过程。

Motivation: 现有视频层分解模型依赖于为每个视频独立训练的隐式神经表示（INRs），导致处理新视频时耗时较长。

Details

Method: 采用超网络架构，根据视频编码器嵌入生成紧凑的INR神经视频分解模型参数。 Result: 缓解了单视频过拟合问题，并显著缩短了新视频分解的收敛时间。 Conclusion: 提出的元学习策略有效提升了视频分解的效率，适用于创意产业的视频编辑需求。 Abstract: Decomposing a video into a layer-based representation is crucial for easy video editing for the creative industries, as it enables independent editing of specific layers. Existing video-layer decomposition models rely on implicit neural representations (INRs) trained independently for each video, making the process time-consuming when applied to new videos. Noticing this limitation, we propose a meta-learning strategy to learn a generic video decomposition model to speed up the training on new videos. Our model is based on a hypernetwork architecture which, given a video-encoder embedding, generates the parameters for a compact INR-based neural video decomposition model. Our strategy mitigates the problem of single-video overfitting and, importantly, shortens the convergence of video decomposition on new, unseen videos. Our code is available at: https://hypernvd.github.io/

An Iterative Feedback Mechanism for Improving Natural Language Class Descriptions in Open-Vocabulary Object Detection

Louis Y. Kim,Michelle Karker,Victoria Valledor,Seiyoung C. Lee,Karl F. Brzoska,Margaret Duff,Anthony Palladino

Task: 通过改进非技术用户对目标感兴趣的自然语言文本描述，提升开放词汇目标检测模型的性能。

Motivation: 开放词汇目标检测模型的进步使得自动目标识别系统能够被非技术用户灵活应用于多种任务，但用户提供的自然语言描述可能不够准确，影响模型性能。

Details

Method: 结合文本嵌入分析技术和对比示例的嵌入组合，提供反馈机制优化用户描述。 Result: 通过公开可用的开放词汇目标检测模型验证，反馈机制显著提升了性能。 Conclusion: 提出的方法有效改进了非技术用户的描述质量，从而提升了目标检测模型的实用性。 Abstract: Recent advances in open-vocabulary object detection models will enable Automatic Target Recognition systems to be sustainable and repurposed by non-technical end-users for a variety of applications or missions. New, and potentially nuanced, classes can be defined with natural language text descriptions in the field, immediately before runtime, without needing to retrain the model. We present an approach for improving non-technical users' natural language text descriptions of their desired targets of interest, using a combination of analysis techniques on the text embeddings, and proper combinations of embeddings for contrastive examples. We quantify the improvement that our feedback mechanism provides by demonstrating performance with multiple publicly-available open-vocabulary object detection models.

Exploring a Principled Framework for Deep Subspace Clustering

Xianghan Meng,Zhiyuan Huang,Wei He,Xianbiao Qi,Rong Xiao,Chun-Guang Li

Task: 提出一种名为PRO-DSC的深度子空间聚类框架，以统一方式学习结构化表示和自表达系数。

Motivation: 现实数据常偏离子空间联合假设，现有深度子空间聚类算法存在特征崩溃问题且缺乏理论保证。

Details

Method: 在自表达模型中引入有效正则化，证明其能防止特征空间崩溃，并展示在特定条件下学习到的最优表示位于正交子空间联合上。 Result: 实验验证了理论发现，并展示了PRO-DSC在深度子空间聚类中的优越性能。 Conclusion: PRO-DSC是一种理论保证且高效的深度子空间聚类方法，解决了现有算法的局限性。 Abstract: Subspace clustering is a classical unsupervised learning task, built on a basic assumption that high-dimensional data can be approximated by a union of subspaces (UoS). Nevertheless, the real-world data are often deviating from the UoS assumption. To address this challenge, state-of-the-art deep subspace clustering algorithms attempt to jointly learn UoS representations and self-expressive coefficients. However, the general framework of the existing algorithms suffers from a catastrophic feature collapse and lacks a theoretical guarantee to learn desired UoS representation. In this paper, we present a Principled fRamewOrk for Deep Subspace Clustering (PRO-DSC), which is designed to learn structured representations and self-expressive coefficients in a unified manner. Specifically, in PRO-DSC, we incorporate an effective regularization on the learned representations into the self-expressive model, prove that the regularized self-expressive model is able to prevent feature space collapse, and demonstrate that the learned optimal representations under certain condition lie on a union of orthogonal subspaces. Moreover, we provide a scalable and efficient approach to implement our PRO-DSC and conduct extensive experiments to verify our theoretical findings and demonstrate the superior performance of our proposed deep subspace clustering approach. The code is available at https://github.com/mengxianghan123/PRO-DSC.

Pow3R: Empowering Unconstrained 3D Reconstruction with Camera and Scene Priors

Wonbong Jang,Philippe Weinzaepfel,Vincent Leroy,Lourdes Agapito,Jerome Revaud

Task: 提出Pow3r，一种新型的大规模3D视觉回归模型，能够灵活处理多种输入模态。

Motivation: 现有前馈模型缺乏利用已知相机或场景先验的机制，Pow3r通过整合多种辅助信息（如内参、相对位姿、稠密或稀疏深度等）提升预测准确性。

Details

Method: 基于DUSt3R范式，采用Transformer架构，通过轻量级条件化机制利用辅助信息，并在训练中随机输入不同模态子集以适应测试时的不同先验水平。 Result: 在3D重建、深度补全、多视角深度预测、多视角立体视觉和多视角位姿估计等任务中取得最先进结果。 Conclusion: Pow3r能有效利用所有可用信息，展示了其在多模态3D视觉任务中的优越性能。 Abstract: We present Pow3r, a novel large 3D vision regression model that is highly versatile in the input modalities it accepts. Unlike previous feed-forward models that lack any mechanism to exploit known camera or scene priors at test time, Pow3r incorporates any combination of auxiliary information such as intrinsics, relative pose, dense or sparse depth, alongside input images, within a single network. Building upon the recent DUSt3R paradigm, a transformer-based architecture that leverages powerful pre-training, our lightweight and versatile conditioning acts as additional guidance for the network to predict more accurate estimates when auxiliary information is available. During training we feed the model with random subsets of modalities at each iteration, which enables the model to operate under different levels of known priors at test time. This in turn opens up new capabilities, such as performing inference in native image resolution, or point-cloud completion. Our experiments on 3D reconstruction, depth completion, multi-view depth prediction, multi-view stereo, and multi-view pose estimation tasks yield state-of-the-art results and confirm the effectiveness of Pow3r at exploiting all available information. The project webpage is https://europe.naverlabs.com/pow3r.

Dereflection Any Image with Diffusion Priors and Diversified Data

Jichen Hu,Chen Yang,Zanwei Zhou,Jiemin Fang,Xiaokang Yang,Qi Tian,Wei Shen

Task: 提出一种名为Dereflection Any Image的全面解决方案，用于单张图像的反射去除。

Motivation: 现有方法受限于高质量、多样化数据的稀缺和恢复先验不足，导致在真实场景中泛化能力有限。

Details

Method: 提出一个高效的数据准备流程和一个可泛化的模型，包括创建多样化反射去除数据集（DRR）和基于扩散的框架，采用三阶段渐进训练策略。 Result: 在常见基准测试和真实场景图像上实现了SOTA性能，表现出卓越的泛化能力。 Conclusion: Dereflection Any Image通过高效数据准备和可泛化模型，显著提升了单张图像反射去除的效果和泛化能力。 Abstract: Reflection removal of a single image remains a highly challenging task due to the complex entanglement between target scenes and unwanted reflections. Despite significant progress, existing methods are hindered by the scarcity of high-quality, diverse data and insufficient restoration priors, resulting in limited generalization across various real-world scenarios. In this paper, we propose Dereflection Any Image, a comprehensive solution with an efficient data preparation pipeline and a generalizable model for robust reflection removal. First, we introduce a dataset named Diverse Reflection Removal (DRR) created by randomly rotating reflective mediums in target scenes, enabling variation of reflection angles and intensities, and setting a new benchmark in scale, quality, and diversity. Second, we propose a diffusion-based framework with one-step diffusion for deterministic outputs and fast inference. To ensure stable learning, we design a three-stage progressive training strategy, including reflection-invariant finetuning to encourage consistent outputs across varying reflection patterns that characterize our dataset. Extensive experiments show that our method achieves SOTA performance on both common benchmarks and challenging in-the-wild images, showing superior generalization across diverse real-world scenes.

Beyond Semantics: Rediscovering Spatial Awareness in Vision-Language Models

Jianing Qi,Jiawei Liu,Hao Tang,Zhigang Zhu

Task: 研究视觉语言模型（VLMs）在空间推理任务中的表现不足的原因，并提出改进方法。

Motivation: 尽管VLMs在物体识别和描述方面表现出色，但在空间推理（如物体相对位置理解）上表现不佳，受人类视觉双通路模型的启发，探索其失败原因。

Details

Method: 通过可解释性分析发现视觉嵌入被过度语义化，掩盖了空间线索；提出嵌入范数归一化和提取中层空间特征等干预措施。 Result: 在合成数据和标准基准测试中，改进后的模型表现出更好的空间推理能力。 Conclusion: 研究揭示了当前VLM架构的根本局限性，并提供了增强视觉场景结构化感知的实际建议。 Abstract: Vision-Language Models (VLMs) excel at identifying and describing objects but struggle with spatial reasoning such as accurately understanding the relative positions of objects. Inspired by the dual-pathway (ventral-dorsal) model of human vision, we investigate why VLMs fail spatial tasks despite strong object recognition capabilities. Our interpretability-driven analysis reveals a critical underlying cause: vision embeddings in VLMs are treated primarily as semantic ``bag-of-tokens," overshadowing subtle yet crucial positional cues due to their disproportionately large embedding norms. We validate this insight through extensive diagnostic experiments, demonstrating minimal performance impact when token orders or fine-grained spatial details are removed. Guided by these findings, we propose simple, interpretable interventions, including normalizing vision embedding norms and extracting mid-layer spatially rich features, to restore spatial awareness. Empirical results on both our synthetic data and standard benchmarks demonstrate improved spatial reasoning capabilities, highlighting the value of interpretability-informed design choices. Our study not only uncovers fundamental limitations in current VLM architectures but also provides actionable insights for enhancing structured perception of visual scenes.

Decouple and Track: Benchmarking and Improving Video Diffusion Transformers for Motion Transfer

Qingyu Shi,Jianzong Wu,Jinbin Bai,Jiangning Zhang,Lu Qi,Xiangtai Li,Yunhai Tong

Task: 改进视频扩散变换器（DiT）模型以实现更有效的运动传递。

Motivation: 现有的DiT模型未明确分离时空信息，导致运动和外观解耦困难，限制了运动传递能力。

Details

Method: 提出DeT方法，通过引入时间核平滑DiT特征，并在潜在特征空间中沿密集轨迹引入显式监督。 Result: DeT在MTBench上实现了运动保真度和编辑保真度的最佳平衡。 Conclusion: DeT通过改进DiT模型，提升了运动传递能力，并提供了更全面的评估基准和指标。 Abstract: The motion transfer task involves transferring motion from a source video to newly generated videos, requiring the model to decouple motion from appearance. Previous diffusion-based methods primarily rely on separate spatial and temporal attention mechanisms within 3D U-Net. In contrast, state-of-the-art video Diffusion Transformers (DiT) models use 3D full attention, which does not explicitly separate temporal and spatial information. Thus, the interaction between spatial and temporal dimensions makes decoupling motion and appearance more challenging for DiT models. In this paper, we propose DeT, a method that adapts DiT models to improve motion transfer ability. Our approach introduces a simple yet effective temporal kernel to smooth DiT features along the temporal dimension, facilitating the decoupling of foreground motion from background appearance. Meanwhile, the temporal kernel effectively captures temporal variations in DiT features, which are closely related to motion. Moreover, we introduce explicit supervision along dense trajectories in the latent feature space to further enhance motion consistency. Additionally, we present MTBench, a general and challenging benchmark for motion transfer. We also introduce a hybrid motion fidelity metric that considers both the global and local motion similarity. Therefore, our work provides a more comprehensive evaluation than previous works. Extensive experiments on MTBench demonstrate that DeT achieves the best trade-off between motion fidelity and edit fidelity.

Time-Series U-Net with Recurrence for Noise-Robust Imaging Photoplethysmography

Vineet R. Shenoy,Shaoju Wu,Armand Comas,Tim K. Marks,Suhas Lohit,Hassan Mansour

Task: 通过面部视频估计脉搏信号，实现远程生命体征监测。

Motivation: 解决接触式设备不可用、侵入性强或成本高的问题。

Details

Method: 采用模块化、可解释的流程，包括面部和标志点检测、时间序列提取和脉搏信号/心率估计。 Result: 在公开数据集上取得最佳性能，尤其在运动或面部遮挡情况下表现稳健。 Conclusion: 该算法无需专用传感器或皮肤接触，优于现有iPPG方法。 Abstract: Remote estimation of vital signs enables health monitoring for situations in which contact-based devices are either not available, too intrusive, or too expensive. In this paper, we present a modular, interpretable pipeline for pulse signal estimation from video of the face that achieves state-of-the-art results on publicly available datasets.Our imaging photoplethysmography (iPPG) system consists of three modules: face and landmark detection, time-series extraction, and pulse signal/pulse rate estimation. Unlike many deep learning methods that make use of a single black-box model that maps directly from input video to output signal or heart rate, our modular approach enables each of the three parts of the pipeline to be interpreted individually. The pulse signal estimation module, which we call TURNIP (Time-Series U-Net with Recurrence for Noise-Robust Imaging Photoplethysmography), allows the system to faithfully reconstruct the underlying pulse signal waveform and uses it to measure heart rate and pulse rate variability metrics, even in the presence of motion. When parts of the face are occluded due to extreme head poses, our system explicitly detects such "self-occluded" regions and maintains estimation robustness despite the missing information. Our algorithm provides reliable heart rate estimates without the need for specialized sensors or contact with the skin, outperforming previous iPPG methods on both color (RGB) and near-infrared (NIR) datasets.

OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement

Yihe Deng,Hritik Bansal,Fan Yin,Nanyun Peng,Wei Wang,Kai-Wei Chang

Task: 研究如何将复杂的推理能力（如自验证和自校正）整合到大型视觉语言模型（LVLMs）中，并评估其对多模态推理任务的影响。

Motivation: 基于DeepSeek-R1在纯文本模型中通过强化学习实现复杂推理能力的成功，探索类似方法在视觉语言模型中的可行性。

Details

Method: 采用监督微调（SFT）和强化学习（RL）的迭代方法，从纯文本模型中提取推理能力，并通过RL进一步优化。 Result: 开发了OpenVLThinker模型，在MathVista、MathVerse和MathVision等挑战性基准测试中表现出显著提升的推理性能。 Conclusion: 证明了该策略在视觉语言推理中的潜力，为多模态推理任务提供了新的解决方案。 Abstract: Recent advancements demonstrated by DeepSeek-R1 have shown that complex reasoning abilities in large language models (LLMs), including sophisticated behaviors such as self-verification and self-correction, can be achieved by RL with verifiable rewards and significantly improves model performance on challenging tasks such as AIME. Motivated by these findings, our study investigates whether similar reasoning capabilities can be successfully integrated into large vision-language models (LVLMs) and assesses their impact on challenging multimodal reasoning tasks. We consider an approach that iteratively leverages supervised fine-tuning (SFT) on lightweight training data and Reinforcement Learning (RL) to further improve model generalization. Initially, reasoning capabilities were distilled from pure-text R1 models by generating reasoning steps using high-quality captions of the images sourced from diverse visual datasets. Subsequently, iterative RL training further enhance reasoning skills, with each iteration's RL-improved model generating refined SFT datasets for the next round. This iterative process yielded OpenVLThinker, a LVLM exhibiting consistently improved reasoning performance on challenging benchmarks such as MathVista, MathVerse, and MathVision, demonstrating the potential of our strategy for robust vision-language reasoning. The code, model and data are held at https://github.com/yihedeng9/OpenVLThinker.

Image as an IMU: Estimating Camera Motion from a Single Motion-Blurred Image

Jerred Chen,Ronald Clark

Task: 提出一种利用运动模糊作为运动估计线索的新框架，从单张运动模糊图像中恢复相机瞬时速度。

Motivation: 快速相机运动导致的高运动模糊使现有相机位姿估计方法失效，而运动模糊本身可以作为运动估计的丰富线索。

Details

Method: 通过预测密集运动流场和单目深度图，并在线性最小二乘问题下恢复相机瞬时速度。 Result: 在真实世界基准测试中，该方法在角速度和平移速度估计上达到最先进水平，优于MASt3R和COLMAP。 Conclusion: 该框架成功将运动模糊转化为运动估计的有用线索，显著提升了快速相机运动下的位姿估计性能。 Abstract: In many robotics and VR/AR applications, fast camera motions cause a high level of motion blur, causing existing camera pose estimation methods to fail. In this work, we propose a novel framework that leverages motion blur as a rich cue for motion estimation rather than treating it as an unwanted artifact. Our approach works by predicting a dense motion flow field and a monocular depth map directly from a single motion-blurred image. We then recover the instantaneous camera velocity by solving a linear least squares problem under the small motion assumption. In essence, our method produces an IMU-like measurement that robustly captures fast and aggressive camera movements. To train our model, we construct a large-scale dataset with realistic synthetic motion blur derived from ScanNet++v2 and further refine our model by training end-to-end on real data using our fully differentiable pipeline. Extensive evaluations on real-world benchmarks demonstrate that our method achieves state-of-the-art angular and translational velocity estimates, outperforming current methods like MASt3R and COLMAP.

Position: Interactive Generative Video as Next-Generation Game Engine

Jiwen Yu,Yiran Qin,Haoxuan Che,Quande Liu,Xintao Wang,Pengfei Wan,Di Zhang,Xihui Liu

Task: 提出交互式生成视频（IGV）作为生成游戏引擎（GGE）的基础，以解决传统游戏引擎中内容预定的问题。

Motivation: 现代游戏开发因传统游戏引擎中内容预定而面临创意和成本挑战，视频生成模型的突破为游戏创作带来新机遇。

Details

Method: 利用IGV的优势（如高质量内容合成、物理感知建模、用户交互控制等），提出GGE的框架和分层成熟度路线图（L0-L4）。 Result: 提出了GGE的核心模块和成熟度路线图，为AI时代的游戏开发指明新方向。 Conclusion: GGE有望通过AI生成系统彻底改变游戏的创作和体验方式。 Abstract: Modern game development faces significant challenges in creativity and cost due to predetermined content in traditional game engines. Recent breakthroughs in video generation models, capable of synthesizing realistic and interactive virtual environments, present an opportunity to revolutionize game creation. In this position paper, we propose Interactive Generative Video (IGV) as the foundation for Generative Game Engines (GGE), enabling unlimited novel content generation in next-generation gaming. GGE leverages IGV's unique strengths in unlimited high-quality content synthesis, physics-aware world modeling, user-controlled interactivity, long-term memory capabilities, and causal reasoning. We present a comprehensive framework detailing GGE's core modules and a hierarchical maturity roadmap (L0-L4) to guide its evolution. Our work charts a new course for game development in the AI era, envisioning a future where AI-powered generative systems fundamentally reshape how games are created and experienced.

Inclusive STEAM Education: A Framework for Teaching Cod-2 ing and Robotics to Students with Visually Impairment Using 3 Advanced Computer Vision

Mahmoud Hamash,Md Raqib Khan,Peter Tiernan

Task: 提出一个框架，帮助视觉障碍学生在STEAM教育中学习编程和机器人技术。

Motivation: 视觉障碍学生在编程和机器人技术中面临跟踪机器人运动和空间感知的挑战。

Details

Method: 利用预构建的机器人和算法（如迷宫解决技术），结合CLIP处理视觉数据并转换为空间音频提示，通过SLAM提供实时反馈。 Result: 框架成功帮助视觉障碍学生开发编码技能并参与复杂问题解决任务。 Conclusion: 该方法展示了计算机视觉在特殊教育中的潜力，提升了STEAM教育的可及性和学习体验。 Abstract: STEAM education integrates Science, Technology, Engineering, Arts, and Mathematics to foster creativity and problem-solving. However, students with visual impairments (VI) encounter significant challenges in programming and robotics, particularly in tracking robot movements and developing spatial awareness. This paper presents a framework that leverages pre-constructed robots and algorithms, such as maze-solving techniques, within an accessible learning environment. The proposed system employs Contrastive Language-Image Pre-training (CLIP) to process global camera-captured maze layouts, converting visual data into textual descriptions that generate spatial audio prompts in an Audio Virtual Reality (AVR) system. Students issue verbal commands, which are refined through CLIP, while robot-mounted stereo cameras provide real-time data processed via Simultaneous Localization and Mapping (SLAM) for continuous feedback. By integrating these technologies, the framework empowers VI students to develop coding skills and engage in complex problem-solving tasks. Beyond maze-solving applications, this approach demonstrates the broader potential of computer vision in special education, contributing to improved accessibility and learning experiences in STEAM disciplines.

VocalEyes: Enhancing Environmental Perception for the Visually Impaired through Vision-Language Models and Distance-Aware Object Detection

Kunal Chavan,Keertan Balaji,Spoorti Barigidad,Samba Raju Chiluveru

Task: 提出一种实时系统，通过音频描述帮助视障人士提高环境感知能力。

Motivation: 满足视障人士对辅助技术日益增长的需求，提升其独立性和移动性。

Details

Method: 利用量化与微调的Florence-2大模型处理实时视频输入，结合轻量级TTS组件Parler TTS Mini提供音频反馈。 Result: 系统在低功耗边缘设备上高效运行，提供快速且准确的物体、行人和障碍物描述及其距离估计。 Conclusion: 该系统为视障人士提供了一种可行且有效的导航辅助方案。 Abstract: With an increasing demand for assistive technologies that promote the independence and mobility of visually impaired people, this study suggests an innovative real-time system that gives audio descriptions of a user's surroundings to improve situational awareness. The system acquires live video input and processes it with a quantized and fine-tuned Florence-2 big model, adjusted to 4-bit accuracy for efficient operation on low-power edge devices such as the NVIDIA Jetson Orin Nano. By transforming the video signal into frames with a 5-frame latency, the model provides rapid and contextually pertinent descriptions of objects, pedestrians, and barriers, together with their estimated distances. The system employs Parler TTS Mini, a lightweight and adaptable Text-to-Speech (TTS) solution, for efficient audio feedback. It accommodates 34 distinct speaker types and enables customization of speech tone, pace, and style to suit user requirements. This study examines the quantization and fine-tuning techniques utilized to modify the Florence-2 model for this application, illustrating how the integration of a compact model architecture with a versatile TTS component improves real-time performance and user experience. The proposed system is assessed based on its accuracy, efficiency, and usefulness, providing a viable option to aid vision-impaired users in navigating their surroundings securely and successfully.

Do Multimodal Large Language Models Understand Welding?

Grigorii Khvatskii,Yong Suk Lee,Corey Angst,Maria Gibbs,Robert Landers,Nitesh V. Chawla

Task: 评估多模态大语言模型（MLLMs）在焊接技能生产工作中的表现。

Motivation: 研究MLLMs在高风险技术领域（如焊接）中的应用潜力与局限性。

Details

Method: 使用真实世界和在线焊接图像数据集，结合专家标注，评估两种先进MLLMs的性能，并引入WeldPrompt提示策略。 Result: 模型在在线图像上表现更好，WeldPrompt在部分场景中提升了召回率，但性能不一致。 Conclusion: MLLMs在技术领域具有潜力，但需微调、领域数据和更复杂的提示策略以提高可靠性。 Abstract: This paper examines the performance of Multimodal LLMs (MLLMs) in skilled production work, with a focus on welding. Using a novel data set of real-world and online weld images, annotated by a domain expert, we evaluate the performance of two state-of-the-art MLLMs in assessing weld acceptability across three contexts: RV \& Marine, Aeronautical, and Farming. While both models perform better on online images, likely due to prior exposure or memorization, they also perform relatively well on unseen, real-world weld images. Additionally, we introduce WeldPrompt, a prompting strategy that combines Chain-of-Thought generation with in-context learning to mitigate hallucinations and improve reasoning. WeldPrompt improves model recall in certain contexts but exhibits inconsistent performance across others. These results underscore the limitations and potentials of MLLMs in high-stakes technical domains and highlight the importance of fine-tuning, domain-specific data, and more sophisticated prompting strategies to improve model reliability. The study opens avenues for further research into multimodal learning in industry applications.

Comprehensive Review of Reinforcement Learning for Medical Ultrasound Imaging

Hanae Elmekki,Saidul Islam,Ahmed Alagha,Hani Sami,Amanda Spilkin,Ehsan Zakeri,Antonela Mariel Zanuttini,Jamal Bentahar,Lyes Kadem,Wen-Fang Xie,Philippe Pibarot,Rabeb Mizouni,Hadi Otrok,Shakti Singh,Azzam Mourad

Task: 提出一个全面的分类法，将超声（US）过程的阶段与强化学习（RL）开发流程相结合。

Motivation: 解决超声成像领域对人工依赖的挑战，推动完全自主超声系统的发展。

Details

Method: 通过综述现有研究，整合超声过程的阶段与强化学习解决方案的进展。 Result: 提出了一种分类法，突出了RL在超声领域的应用，并识别了未解决的挑战。 Conclusion: 强化学习在构建自主超声解决方案中具有潜力，但仍需进一步研究以克服现有局限。 Abstract: Medical Ultrasound (US) imaging has seen increasing demands over the past years, becoming one of the most preferred imaging modalities in clinical practice due to its affordability, portability, and real-time capabilities. However, it faces several challenges that limit its applicability, such as operator dependency, variability in interpretation, and limited resolution, which are amplified by the low availability of trained experts. This calls for the need of autonomous systems that are capable of reducing the dependency on humans for increased efficiency and throughput. Reinforcement Learning (RL) comes as a rapidly advancing field under Artificial Intelligence (AI) that allows the development of autonomous and intelligent agents that are capable of executing complex tasks through rewarded interactions with their environments. Existing surveys on advancements in the US scanning domain predominantly focus on partially autonomous solutions leveraging AI for scanning guidance, organ identification, plane recognition, and diagnosis. However, none of these surveys explore the intersection between the stages of the US process and the recent advancements in RL solutions. To bridge this gap, this review proposes a comprehensive taxonomy that integrates the stages of the US process with the RL development pipeline. This taxonomy not only highlights recent RL advancements in the US domain but also identifies unresolved challenges crucial for achieving fully autonomous US systems. This work aims to offer a thorough review of current research efforts, highlighting the potential of RL in building autonomous US solutions while identifying limitations and opportunities for further advancements in this field.

Reliable Radiologic Skeletal Muscle Area Assessment -- A Biomarker for Cancer Cachexia Diagnosis

Sabeen Ahmed,Nathan Parker,Margaret Park,Daniel Jeong,Lauren Peres,Evan W. Davis,Jennifer B. Permuth,Erin Siegel,Matthew B. Schabath,Yasin Yilmaz,Ghulam Rasool

Task: 开发一个自动化工具（SMAART-AI）用于通过CT扫描纵向监测骨骼肌面积（SMA），并结合临床数据预测癌症恶病质。

Motivation: 现有工具缺乏全自动化和准确性不一致，限制了其在临床工作流程中的整合潜力。

Details

Method: 基于深度学习的端到端自动化流程（nnU-Net 2D），结合不确定性机制和多层感知器（MLP）模型预测恶病质。 Result: SMAART-AI在测试中达到Dice分数97.80%，SMA预测误差中位数为2.48%，MLP模型预测恶病质精度为79%。 Conclusion: SMAART-AI通过自动化、准确性和不确定性感知，为癌症恶病质的早期诊断和干预提供了可靠工具。 Abstract: Cancer cachexia is a common metabolic disorder characterized by severe muscle atrophy which is associated with poor prognosis and quality of life. Monitoring skeletal muscle area (SMA) longitudinally through computed tomography (CT) scans, an imaging modality routinely acquired in cancer care, is an effective way to identify and track this condition. However, existing tools often lack full automation and exhibit inconsistent accuracy, limiting their potential for integration into clinical workflows. To address these challenges, we developed SMAART-AI (Skeletal Muscle Assessment-Automated and Reliable Tool-based on AI), an end-to-end automated pipeline powered by deep learning models (nnU-Net 2D) trained on mid-third lumbar level CT images with 5-fold cross-validation, ensuring generalizability and robustness. SMAART-AI incorporates an uncertainty-based mechanism to flag high-error SMA predictions for expert review, enhancing reliability. We combined the SMA, skeletal muscle index, BMI, and clinical data to train a multi-layer perceptron (MLP) model designed to predict cachexia at the time of cancer diagnosis. Tested on the gastroesophageal cancer dataset, SMAART-AI achieved a Dice score of 97.80% +/- 0.93%, with SMA estimated across all four datasets in this study at a median absolute error of 2.48% compared to manual annotations with SliceOmatic. Uncertainty metrics-variance, entropy, and coefficient of variation-strongly correlated with SMA prediction errors (0.83, 0.76, and 0.73 respectively). The MLP model predicts cachexia with 79% precision, providing clinicians with a reliable tool for early diagnosis and intervention. By combining automation, accuracy, and uncertainty awareness, SMAART-AI bridges the gap between research and clinical application, offering a transformative approach to managing cancer cachexia.

Distributed LLMs and Multimodal Large Language Models: A Survey on Advances, Challenges, and Future Directions

Hadi Amini,Md Jueal Mia,Yasaman Saadati,Ahmed Imteaj,Seyedsina Nabavirazavi,Urmish Thakker,Md Zarif Hossain,Awal Ahmed Fime,S. S. Iyengar

Task: 综述分布式解决方案在语言模型（包括LLMs、VLMs、MLLMs和SLMs）中的应用及其关键进展。

Motivation: 解决语言模型在扩展性和隐私方面的挑战，并探索分布式计算策略以提升性能和资源管理。

Details

Method: 通过文献综述，分析分布式训练、推理、微调和部署等方面的关键进展，并将文献按六个主要去中心化领域分类。 Result: 总结了分布式语言模型的贡献、局限性和未来改进方向，并指出了当前方法中的空白。 Conclusion: 未来研究需要开发新解决方案，以增强分布式语言模型的鲁棒性和适用性。 Abstract: Language models (LMs) are machine learning models designed to predict linguistic patterns by estimating the probability of word sequences based on large-scale datasets, such as text. LMs have a wide range of applications in natural language processing (NLP) tasks, including autocomplete and machine translation. Although larger datasets typically enhance LM performance, scalability remains a challenge due to constraints in computational power and resources. Distributed computing strategies offer essential solutions for improving scalability and managing the growing computational demand. Further, the use of sensitive datasets in training and deployment raises significant privacy concerns. Recent research has focused on developing decentralized techniques to enable distributed training and inference while utilizing diverse computational resources and enabling edge AI. This paper presents a survey on distributed solutions for various LMs, including large language models (LLMs), vision language models (VLMs), multimodal LLMs (MLLMs), and small language models (SLMs). While LLMs focus on processing and generating text, MLLMs are designed to handle multiple modalities of data (e.g., text, images, and audio) and to integrate them for broader applications. To this end, this paper reviews key advancements across the MLLM pipeline, including distributed training, inference, fine-tuning, and deployment, while also identifying the contributions, limitations, and future areas of improvement. Further, it categorizes the literature based on six primary focus areas of decentralization. Our analysis describes gaps in current methodologies for enabling distributed solutions for LMs and outline future research directions, emphasizing the need for novel solutions to enhance the robustness and applicability of distributed LMs.

Utilizing Reinforcement Learning for Bottom-Up part-wise Reconstruction of 2D Wire-Frame Projections

Julian Ziegler,Patrick Frenzel,Mirco Fuchs

Task: 从二维图像中重建任意3D线框模型的所有边缘。

Motivation: 探索一种基于强化学习的自底向上分步方法，用于分割和重建2D多部分对象。

Details

Method: 使用四色图像表示环境状态，通过四维动作空间操作重建线，并测试不同奖励函数（如分段和增量奖励）及课程学习策略（动作和任务基础）。 Result: 结合优化奖励函数和课程学习策略显著提高了训练效果，任务基础课程学习表现尤为突出。 Conclusion: 该方法为类似任务提供了有效框架，并为未来研究指明了方向。 Abstract: This work concerns itself with the task of reconstructing all edges of an arbitrary 3D wire-frame model projected to an image plane. We explore a bottom-up part-wise procedure undertaken by an RL agent to segment and reconstruct these 2D multipart objects. The environment's state is represented as a four-colour image, where different colours correspond to background, a target edge, a reconstruction line, and the overlap of both. At each step, the agent can transform the reconstruction line within a four-dimensional action space or terminate the episode using a specific termination action. To investigate the impact of reward function formulations, we tested episodic and incremental rewards, as well as combined approaches. Empirical results demonstrated that the latter yielded the most effective training performance. To further enhance efficiency and stability, we introduce curriculum learning strategies. First, an action-based curriculum was implemented, where the agent was initially restricted to a reduced action space, being able to only perform three of the five possible actions, before progressing to the full action space. Second, we test a task-based curriculum, where the agent first solves a simplified version of the problem before being presented with the full, more complex task. This second approach produced promising results, as the agent not only successfully transitioned from learning the simplified task to mastering the full task, but in doing so gained significant performance. This study demonstrates the potential of an iterative RL wire-frame reconstruction in two dimensions. By combining optimized reward function formulations with curriculum learning strategies, we achieved significant improvements in training success. The proposed methodology provides an effective framework for solving similar tasks and represents a promising direction for future research in the field.

TriTex: Learning Texture from a Single Mesh via Triplane Semantic Features

Dana Cohen-Bar,Daniel Cohen-Or,Gal Chechik,Yoni Kasten

Task: 提出一种从单个纹理网格学习体积纹理场的新方法，实现语义感知的纹理转移。

Motivation: 解决现有方法在纹理转移中难以保持源纹理外观的问题。

Details

Method: 使用基于三平面的高效架构，将语义特征映射到表面颜色。 Result: 在多样形状上实现高质量的纹理转移，且推理速度快。 Conclusion: 该方法为单样本纹理转移提供了实用解决方案，适用于游戏开发和模拟等应用。 Abstract: As 3D content creation continues to grow, transferring semantic textures between 3D meshes remains a significant challenge in computer graphics. While recent methods leverage text-to-image diffusion models for texturing, they often struggle to preserve the appearance of the source texture during texture transfer. We present \ourmethod, a novel approach that learns a volumetric texture field from a single textured mesh by mapping semantic features to surface colors. Using an efficient triplane-based architecture, our method enables semantic-aware texture transfer to a novel target mesh. Despite training on just one example, it generalizes effectively to diverse shapes within the same category. Extensive evaluation on our newly created benchmark dataset shows that \ourmethod{} achieves superior texture transfer quality and fast inference times compared to existing methods. Our approach advances single-example texture transfer, providing a practical solution for maintaining visual coherence across related 3D models in applications like game development and simulation.

Fed-NDIF: A Noise-Embedded Federated Diffusion Model For Low-Count Whole-Body PET Denoising

Yinchi Zhou,Huidong Xie,Menghua Xia,Qiong Liu,Bo Zhou,Tianqi Chen,Jun Hou,Liang Guo,Xinyuan Zheng,Hanzhong Wang,Biao Li,Axel Rominger,Kuangyu Shi,Nicha C. Dvorneka,Chi Liu

Task: 提出一种结合扩散模型和联邦学习的噪声嵌入联邦学习扩散模型（Fed-NDIF），用于低计数正电子发射断层扫描（LCPET）图像的去噪。

Motivation: LCPET成像可以减少患者辐射暴露，但图像噪声增加和病变检测能力下降，需要有效的去噪技术。扩散模型在LCPET去噪中表现出潜力，但训练需要大量多样化数据，这在医学领域难以获取。此外，数据稀缺和隐私问题也是挑战。

Details

Method: 结合扩散模型与联邦学习，提出Fed-NDIF模型，利用多中心数据集和不同计数水平，将肝脏归一化标准偏差（NSTD）噪声嵌入2.5D扩散模型，并使用联邦平均（FedAvg）算法聚合本地训练的模型为全局模型，再在本地数据集上微调以获得个性化模型。 Result: 在伯尔尼大学、上海瑞金医院和耶鲁-纽黑文医院的数据集上验证，Fed-NDIF在提升图像质量和改善病变量化方面表现优异，PSNR、SSIM和NMSE指标显著优于本地扩散模型和基于联邦UNet的模型。 Conclusion: Fed-NDIF模型通过结合扩散模型和联邦学习，有效解决了LCPET去噪中的数据稀缺和隐私问题，显著提升了图像质量和病变检测能力。 Abstract: Low-count positron emission tomography (LCPET) imaging can reduce patients' exposure to radiation but often suffers from increased image noise and reduced lesion detectability, necessitating effective denoising techniques. Diffusion models have shown promise in LCPET denoising for recovering degraded image quality. However, training such models requires large and diverse datasets, which are challenging to obtain in the medical domain. To address data scarcity and privacy concerns, we combine diffusion models with federated learning -- a decentralized training approach where models are trained individually at different sites, and their parameters are aggregated on a central server over multiple iterations. The variation in scanner types and image noise levels within and across institutions poses additional challenges for federated learning in LCPET denoising. In this study, we propose a novel noise-embedded federated learning diffusion model (Fed-NDIF) to address these challenges, leveraging a multicenter dataset and varying count levels. Our approach incorporates liver normalized standard deviation (NSTD) noise embedding into a 2.5D diffusion model and utilizes the Federated Averaging (FedAvg) algorithm to aggregate locally trained models into a global model, which is subsequently fine-tuned on local datasets to optimize performance and obtain personalized models. Extensive validation on datasets from the University of Bern, Ruijin Hospital in Shanghai, and Yale-New Haven Hospital demonstrates the superior performance of our method in enhancing image quality and improving lesion quantification. The Fed-NDIF model shows significant improvements in PSNR, SSIM, and NMSE of the entire 3D volume, as well as enhanced lesion detectability and quantification, compared to local diffusion models and federated UNet-based models.

Depth Matters: Multimodal RGB-D Perception for Robust Autonomous Agents

Mihaela-Larisa Clement,Mónika Farsang,Felix Resch,Radu Grosu

Task: 研究如何通过融合RGB和深度信息提升自动驾驶代理的转向命令预测能力。

Motivation: 纯感知的自动驾驶代理需要高效且鲁棒的架构，而RGB输入结合深度信息可能显著提升性能。

Details

Method: 使用轻量级循环控制器，融合RGB-D特征进行序列决策，并通过专家驾驶的小型自动驾驶车收集高质量数据。 Result: 早期融合深度数据的控制器表现出高度鲁棒性，即使在帧丢失和噪声增加的情况下仍能保持任务专注。 Conclusion: 融合深度信息可显著提升自动驾驶代理的鲁棒性和性能。 Abstract: Autonomous agents that rely purely on perception to make real-time control decisions require efficient and robust architectures. In this work, we demonstrate that augmenting RGB input with depth information significantly enhances our agents' ability to predict steering commands compared to using RGB alone. We benchmark lightweight recurrent controllers that leverage the fused RGB-D features for sequential decision-making. To train our models, we collect high-quality data using a small-scale autonomous car controlled by an expert driver via a physical steering wheel, capturing varying levels of steering difficulty. Our models, trained under diverse configurations, were successfully deployed on real hardware. Specifically, our findings reveal that the early fusion of depth data results in a highly robust controller, which remains effective even with frame drops and increased noise levels, without compromising the network's focus on the task.

SAGE: Semantic-Driven Adaptive Gaussian Splatting in Extended Reality

Chiara Schiavo,Elena Camuffo,Leonardo Badia,Simone Milani

Task: 提出一种名为SAGE的框架，通过语义分割动态调整3D高斯泼溅（3DGS）对象的细节层次（LOD），以提升扩展现实（XR）应用的用户体验。

Motivation: 3D高斯泼溅（3DGS）在三维场景可视化中提升了效率和真实感，但在交互式XR应用中仍需优化内存和计算开销。

Details

Method: 结合语义分割技术，动态调整3DGS对象的LOD，以平衡视觉质量和计算资源。 Result: 实验表明，SAGE在保持目标视觉质量的同时，有效减少了内存和计算开销。 Conclusion: SAGE为交互式XR应用提供了一种高效的优化方案。 Abstract: 3D Gaussian Splatting (3DGS) has significantly improved the efficiency and realism of three-dimensional scene visualization in several applications, ranging from robotics to eXtended Reality (XR). This work presents SAGE (Semantic-Driven Adaptive Gaussian Splatting in Extended Reality), a novel framework designed to enhance the user experience by dynamically adapting the Level of Detail (LOD) of different 3DGS objects identified via a semantic segmentation. Experimental results demonstrate how SAGE effectively reduces memory and computational overhead while keeping a desired target visual quality, thus providing a powerful optimization for interactive XR applications.

elaTCSF: A Temporal Contrast Sensitivity Function for Flicker Detection and Modeling Variable Refresh Rate Flicker

Yancheng Cai,Ali Bozorgian,Maliha Ashraf,Robert Wanat,Rafał K. Mantiuk

Task: 扩展TCSF$_{IDMS}$并结合新的空间概率求和模型，以解决传统方法在低空间频率下检测闪烁的不足。

Motivation: 传统方法（如CFF）仅适用于高对比度闪烁，而现有标准（如TCSF$_{IDMS}$）忽略了亮度、偏心率和面积等关键参数。

Details

Method: 提出elaTCSF模型，结合TCSF$_{IDMS}$和空间概率求和模型，并利用多组闪烁检测数据集进行训练。 Result: 建立了首个可变刷新率闪烁检测数据集，验证了elaTCSF的有效性，并解决了外围视觉中闪烁可见性的争议。 Conclusion: elaTCSF模型可用于预测VR头显中的低持久性闪烁、确定无闪烁VRR操作范围，并在照明设计中评估闪烁敏感性。 Abstract: The perception of flicker has been a prominent concern in illumination and electronic display fields for over a century. Traditional approaches often rely on Critical Flicker Frequency (CFF), primarily suited for high-contrast (full-on, full-off) flicker. To tackle varying contrast flicker, the International Committee for Display Metrology (ICDM) introduced a Temporal Contrast Sensitivity Function TCSF$_{IDMS}$ within the Information Display Measurements Standard (IDMS). Nevertheless, this standard overlooks crucial parameters: luminance, eccentricity, and area. Existing models incorporating these parameters are inadequate for flicker detection, especially at low spatial frequencies. To address these limitations, we extend the TCSF$_{IDMS}$ and combine it with a new spatial probability summation model to incorporate the effects of luminance, eccentricity, and area (elaTCSF). We train the elaTCSF on various flicker detection datasets and establish the first variable refresh rate flicker detection dataset for further verification. Additionally, we contribute to resolving a longstanding debate on whether the flicker is more visible in peripheral vision. We demonstrate how elaTCSF can be used to predict flicker due to low-persistence in VR headsets, identify flicker-free VRR operational ranges, and determine flicker sensitivity in lighting design.

Auto-Regressive Diffusion for Generating 3D Human-Object Interactions

Zichen Geng,Zeeshan Hayder,Wei Liu,Ajmal Saeed Mian

Task: 生成文本驱动的人-物交互（Text-to-HOI）序列，并解决长序列中的交互一致性问题。

Motivation: 现有基于文本到运动的方法（如离散运动标记化）无法直接应用于HOI生成，因为该领域数据有限且模态复杂。

Details

Method: 提出了一种自回归扩散模型（ARDHOI），结合对比变分自编码器（cVAE）学习连续HOI标记的物理合理空间，并使用Mamba-based上下文编码器和MLP-based去噪器生成序列。 Result: 在OMOMO和BEHAVE数据集上表现优于现有方法，性能和推理速度均更优。 Conclusion: ARDHOI为文本驱动的HOI任务提供了高效且鲁棒的解决方案。 Abstract: Text-driven Human-Object Interaction (Text-to-HOI) generation is an emerging field with applications in animation, video games, virtual reality, and robotics. A key challenge in HOI generation is maintaining interaction consistency in long sequences. Existing Text-to-Motion-based approaches, such as discrete motion tokenization, cannot be directly applied to HOI generation due to limited data in this domain and the complexity of the modality. To address the problem of interaction consistency in long sequences, we propose an autoregressive diffusion model (ARDHOI) that predicts the next continuous token. Specifically, we introduce a Contrastive Variational Autoencoder (cVAE) to learn a physically plausible space of continuous HOI tokens, thereby ensuring that generated human-object motions are realistic and natural. For generating sequences autoregressively, we develop a Mamba-based context encoder to capture and maintain consistent sequential actions. Additionally, we implement an MLP-based denoiser to generate the subsequent token conditioned on the encoded context. Our model has been evaluated on the OMOMO and BEHAVE datasets, where it outperforms existing state-of-the-art methods in terms of both performance and inference speed. This makes ARDHOI a robust and efficient solution for text-driven HOI tasks

Depth-Aided Color Image Inpainting in Quaternion Domain

Shunki Tatsumi,Ryo Hayakawa,Youji Iiguni

Task: 提出一种基于深度辅助的低秩四元数矩阵补全（D-LRQMC）方法，用于四元数域中的彩色图像修复。

Motivation: 传统四元数修复技术中，彩色图像的三通道信息仅用于四元数的虚部，而实部为零且无信息，导致信息利用不足。通过引入深度信息作为四元数实部，利用颜色与深度的相关性提升修复效果。

Details

Method: 首先用传统LRQMC修复观测图像并估计其深度，然后将估计的深度信息作为实部整合到观测图像中，再次进行LRQMC。 Result: 实验表明，D-LRQMC在多种图像上的修复精度和视觉质量均优于传统LRQMC。 Conclusion: 深度信息在四元数域彩色图像处理中具有显著效果。 Abstract: In this paper, we propose a depth-aided color image inpainting method in the quaternion domain, called depth-aided low-rank quaternion matrix completion (D-LRQMC). In conventional quaternion-based inpainting techniques, the color image is expressed as a quaternion matrix by using the three imaginary parts as the color channels, whereas the real part is set to zero and has no information. Our approach incorporates depth information as the real part of the quaternion representations, leveraging the correlation between color and depth to improve the result of inpainting. In the proposed method, we first restore the observed image with the conventional LRQMC and estimate the depth of the restored result. We then incorporate the estimated depth into the real part of the observed image and perform LRQMC again. Simulation results demonstrate that the proposed D-LRQMC can improve restoration accuracy and visual quality for various images compared to the conventional LRQMC. These results suggest the effectiveness of the depth information for color image processing in quaternion domain.

Downstream Analysis of Foundational Medical Vision Models for Disease Progression

Basar Demir,Soumitri Chattopadhyay,Thomas Hastings Greer,Boqi Chen,Marc Niethammer

Task: 评估医学视觉基础模型在疾病进展预测中的能力。

Motivation: 假设分割模型的中间层特征捕捉结构信息，而配准模型的特征编码随时间变化的知识。

Details

Method: 使用简单的线性探针评估模型特征对疾病进展预测的效用。 Result: 配准模型特征无需空间对齐输入图像，而分割模型需空间对齐以优化性能。 Conclusion: 强调了空间对齐的重要性以及基础模型特征在图像配准中的实用性。 Abstract: Medical vision foundational models are used for a wide variety of tasks, including medical image segmentation and registration. This work evaluates the ability of these models to predict disease progression using a simple linear probe. We hypothesize that intermediate layer features of segmentation models capture structural information, while those of registration models encode knowledge of change over time. Beyond demonstrating that these features are useful for disease progression prediction, we also show that registration model features do not require spatially aligned input images. However, for segmentation models, spatial alignment is essential for optimal performance. Our findings highlight the importance of spatial alignment and the utility of foundation model features for image registration.

HSM: Hierarchical Scene Motifs for Multi-Scale Indoor Scene Generation

Hou In Derek Pun,Hou In Ivan Tam,Austin T. Wang,Xiaoliang Huo,Angel X. Chang,Manolis Savva

Task: 提出一种分层框架（HSM）用于生成具有密集物体排列的室内场景。

Motivation: 现有方法主要关注大型家具而忽略小型物体，导致场景不真实且不符合文本描述的排列要求。

Details

Method: 利用室内场景的层次性，通过跨尺度空间模式生成复杂且真实的场景。 Result: HSM在生成更真实且更符合用户输入的室内场景方面优于现有方法。 Conclusion: HSM通过分层框架有效解决了密集物体排列的挑战，提升了场景生成的现实性和一致性。 Abstract: Despite advances in indoor 3D scene layout generation, synthesizing scenes with dense object arrangements remains challenging. Existing methods primarily focus on large furniture while neglecting smaller objects, resulting in unrealistically empty scenes. Those that place small objects typically do not honor arrangement specifications, resulting in largely random placement not following the text description. We present HSM, a hierarchical framework for indoor scene generation with dense object arrangements across spatial scales. Indoor scenes are inherently hierarchical, with surfaces supporting objects at different scales, from large furniture on floors to smaller objects on tables and shelves. HSM embraces this hierarchy and exploits recurring cross-scale spatial patterns to generate complex and realistic indoor scenes in a unified manner. Our experiments show that HSM outperforms existing methods by generating scenes that are more realistic and better conform to user input across room types and spatial configurations.

City2Scene: Improving Acoustic Scene Classification with City Features

Yiqiang Cai,Yizhou Tan,Peihong Zhang,Yuxuan Liu,Shengchen Li,Xi Shao,Mark D. Plumbley

Task: 利用城市特征提升声学场景分类（ASC）任务。

Motivation: 现有ASC方法通常关注跨城市的通用声学场景模式，而忽略了城市特有的环境和文化差异对声学特征的影响。

Details

Method: 提出City2Scene框架，通过知识蒸馏将城市分类模型中的城市特定知识迁移到场景分类模型中。 Result: 在DCASE Challenge Task 1数据集上的实验表明，城市特征为场景分类提供了有价值的信息，显著提升了多种先进ASC模型的准确性。 Conclusion: 城市特定知识对ASC任务具有重要价值，City2Scene框架能有效利用这些知识提升分类性能。 Abstract: Acoustic scene recordings are often collected from a diverse range of cities. Most existing acoustic scene classification (ASC) approaches focus on identifying common acoustic scene patterns across cities to enhance generalization. In contrast, we hypothesize that city-specific environmental and cultural differences in acoustic features are beneficial for the ASC task. In this paper, we introduce City2Scene, a novel framework that leverages city features to improve ASC. City2Scene transfers the city-specific knowledge from city classification models to a scene classification model using knowledge distillation. We evaluated City2Scene on the DCASE Challenge Task 1 datasets, where each audio clip is annotated with both scene and city labels. Experimental results demonstrate that city features provide valuable information for classifying scenes. By distilling the city-specific knowledge, City2Scene effectively improves accuracy for various state-of-the-art ASC backbone models, including both CNNs and Transformers.

Joint Extraction Matters: Prompt-Based Visual Question Answering for Multi-Field Document Information Extraction

Mengsay Loem,Taiju Hosaka

Task: 研究联合提取多个字段与单独提取字段在视觉问答（VQA）中的效果差异。

Motivation: 现有方法通常单独查询每个字段，忽略了字段间的潜在依赖关系，影响了信息提取的准确性。

Details

Method: 通过实验比较联合提取与单独提取的效果，使用大型视觉语言模型和数据集，并量化字段间关系。 Result: 联合提取字段通常能提高准确性，尤其在字段间存在强数值或上下文依赖时。 Conclusion: 多字段提示可以缓解相似表面形式和相关数值带来的混淆，为文档信息提取任务设计鲁棒的VQA系统提供了实用方法。 Abstract: Visual question answering (VQA) has emerged as a flexible approach for extracting specific pieces of information from document images. However, existing work typically queries each field in isolation, overlooking potential dependencies across multiple items. This paper investigates the merits of extracting multiple fields jointly versus separately. Through experiments on multiple large vision language models and datasets, we show that jointly extracting fields often improves accuracy, especially when the fields share strong numeric or contextual dependencies. We further analyze how performance scales with the number of requested items and use a regression based metric to quantify inter field relationships. Our results suggest that multi field prompts can mitigate confusion arising from similar surface forms and related numeric values, providing practical methods for designing robust VQA systems in document information extraction tasks.

Lie Detector: Unified Backdoor Detection via Cross-Examination Framework

Xuan Wang,Siyuan Liang,Dongping Liao,Han Fang,Aishan Liu,Xiaochun Cao,Yu-liang Lu,Ee-Chien Chang,Xitong Gao

Task: 提出一种在半诚实设置下统一的后门检测框架，利用两个独立服务提供商的模型不一致性进行交叉检查。

Motivation: 解决现有检测方法在不同学习范式中无法保持普遍准确性的问题，以及外包模型训练中可能引入的安全风险。

Details

Method: 集成中心核对齐以实现跨模型架构和学习范式的鲁棒特征相似性测量，并引入后门微调敏感性分析以减少误报。 Result: 实验表明，该方法在监督、半监督和自回归学习任务中分别比现有技术基线提高了5.4%、1.6%和11.9%的检测准确率。 Conclusion: 该框架首次有效检测多模态大语言模型中的后门，展示了其广泛适用性并推动了安全深度学习的发展。 Abstract: Institutions with limited data and computing resources often outsource model training to third-party providers in a semi-honest setting, assuming adherence to prescribed training protocols with pre-defined learning paradigm (e.g., supervised or semi-supervised learning). However, this practice can introduce severe security risks, as adversaries may poison the training data to embed backdoors into the resulting model. Existing detection approaches predominantly rely on statistical analyses, which often fail to maintain universally accurate detection accuracy across different learning paradigms. To address this challenge, we propose a unified backdoor detection framework in the semi-honest setting that exploits cross-examination of model inconsistencies between two independent service providers. Specifically, we integrate central kernel alignment to enable robust feature similarity measurements across different model architectures and learning paradigms, thereby facilitating precise recovery and identification of backdoor triggers. We further introduce backdoor fine-tuning sensitivity analysis to distinguish backdoor triggers from adversarial perturbations, substantially reducing false positives. Extensive experiments demonstrate that our method achieves superior detection performance, improving accuracy by 5.4%, 1.6%, and 11.9% over SoTA baselines across supervised, semi-supervised, and autoregressive learning tasks, respectively. Notably, it is the first to effectively detect backdoors in multimodal large language models, further highlighting its broad applicability and advancing secure deep learning.

From Faces to Voices: Learning Hierarchical Representations for High-quality Video-to-Speech

Ji-Hoon Kim,Jeongsoo Choi,Jaehun Kim,Chaeyoung Jung,Joon Son Chung

Task: 从无声说话面部视频生成高质量语音（视频到语音合成）。

Motivation: 视频到语音合成中存在显著的模态差距，需要有效的方法来弥合这一差距。

Details

Method: 通过三个阶段（内容、音色和韵律建模）逐步将视频转换为声学特征空间，并使用流匹配模型生成语音。 Result: 实验表明，该方法生成的语音质量接近真实语音，显著优于现有方法。 Conclusion: 提出的系统成功弥合了模态差距，显著提升了合成语音的质量。 Abstract: The objective of this study is to generate high-quality speech from silent talking face videos, a task also known as video-to-speech synthesis. A significant challenge in video-to-speech synthesis lies in the substantial modality gap between silent video and multi-faceted speech. In this paper, we propose a novel video-to-speech system that effectively bridges this modality gap, significantly enhancing the quality of synthesized speech. This is achieved by learning of hierarchical representations from video to speech. Specifically, we gradually transform silent video into acoustic feature spaces through three sequential stages -- content, timbre, and prosody modeling. In each stage, we align visual factors -- lip movements, face identity, and facial expressions -- with corresponding acoustic counterparts to ensure the seamless transformation. Additionally, to generate realistic and coherent speech from the visual representations, we employ a flow matching model that estimates direct trajectories from a simple prior distribution to the target speech distribution. Extensive experiments demonstrate that our method achieves exceptional generation quality comparable to real utterances, outperforming existing methods by a significant margin.

When Words Outperform Vision: VLMs Can Self-Improve Via Text-Only Training For Human-Centered Decision Making

Zhe Hu,Jing Li,Yu Yin

Task: 系统评估开源视觉语言模型（VLMs）在多模态以人为中心的决策任务中的表现，并提出一种新的文本训练方法。

Motivation: 尽管视觉语言模型在决策能力上有所进步，但在复杂的人类中心情境中仍表现不佳，需要深入推理人类需求和价值观。

Details

Method: 提出一种基于合成文本数据的纯文本训练方法，并通过自改进机制利用LLM生成的数据增强VLMs性能。 Result: 纯文本训练的VLMs在多模态推理中表现优于传统视觉对齐的VLMs，且通过自改进机制显著提升性能。 Conclusion: 该方法为提升VLMs的人类中心决策能力提供了更高效和可扩展的途径，开辟了通过自改进优化VLMs的新方向。 Abstract: Embodied decision-making is fundamental for AI agents operating in real-world environments. While Visual Language Models (VLMs) have advanced this capability, they still struggle with complex decisions, particularly in human-centered situations that require deep reasoning about human needs and values. In this study, we systematically evaluate open-sourced VLMs on multimodal human-centered decision-making tasks. We find that LLMs receiving only textual descriptions unexpectedly outperform their VLM counterparts of similar scale that process actual images, suggesting that visual alignment may hinder VLM abilities. To address this challenge, we propose a novel text-only training approach with synthesized textual data. This method strengthens VLMs' language components and transfers the learned abilities to multimodal inference, eliminating the need for expensive image-text paired data. Furthermore, we show that VLMs can achieve substantial performance gains through self-improvement, using training data generated by their LLM counterparts rather than relying on larger teacher models like GPT-4. Our findings establish a more efficient and scalable approach to enhancing VLMs' human-centered decision-making capabilities, opening new avenues for optimizing VLMs through self-improvement mechanisms.

High Accuracy Pulmonary Vessel Segmentation for Contrast and Non-contrast CT Images and Its Clinical Evaluation

Ying Ming,Shaoze Luo,Longfei Zhao,Qiqi Xu,Wei Song

Task: 提出一种用于CTPA和NCCT图像的自动化肺血管分割的3D图像分割算法。

Motivation: 临床实践中缺乏高精度的肺血管分割算法，尤其是NCCT图像的肺血管分割更具挑战性。

Details

Method: 设计了一个Vessel Lumen Structure Optimization Module (VLSOM)和Cl-Dice-Loss，并提出了从CTPA生成NCCT血管GT的方法。 Result: 模型在CTPA和NCCT上分别取得了0.879、0.909、0.934和0.928、0.936、0.955的Cl-Recall、Cl-DICE和Recall值。 Conclusion: 该模型在肺血管分割的准确性和完整性上表现良好，具有临床应用潜力。 Abstract: Accurate segmentation of pulmonary vessels plays a very critical role in diagnosing and assessing various lung diseases. In clinical practice, diagnosis is typically carried out using CTPA images. However, there is a lack of high-precision pulmonary vessel segmentation algorithms for CTPA, and pulmonary vessel segmentation for NCCT poses an even greater challenge. In this study, we propose a 3D image segmentation algorithm for automated pulmonary vessel segmentation from both contrast and non-contrast CT images. In the network, we designed a Vessel Lumen Structure Optimization Module (VLSOM), which extracts the centerline of vessels and adjusts the weights based on the positional information and adds a Cl-Dice-Loss to supervise the stability of the vessels structure. In addition, we designed a method for generating vessel GT from CTPA to NCCT for training models that support both CTPA and NCCT. In this work, we used 427 sets of high-precision annotated CT data from multiple vendors and countries. Finally, our experimental model achieved Cl-Recall, Cl-DICE and Recall values of 0.879, 0.909, 0.934 (CTPA) and 0.928, 0.936, 0.955 (NCCT) respectively. This shows that our model has achieved good performance in both accuracy and completeness of pulmonary vessel segmentation. In clinical visual evaluation, our model also had good segmentation performance on various disease types and can assist doctors in medical diagnosis, verifying the great potential of this method in clinical application.

Specifying What You Know or Not for Multi-Label Class-Incremental Learning

Aoting Zhang,Dongbao Yang,Chang Liu,Xiaopeng Hong,Yu Zhou

Task: 提出一种名为HCP的新框架，用于解决多标签类增量学习（MLCIL）中已知和未知知识区分不清的问题。

Motivation: 现有类增量学习方法主要针对单标签分类任务，无法有效处理多标签场景中因标签不完整导致的学习目标矛盾。

Details

Method: 通过动态特征净化和基于分布先验的召回增强来明确已知类，同时设计前瞻性知识挖掘以探索未知类。 Result: 实验表明，HCP框架有效缓解了MLCIL中的灾难性遗忘，在MS-COCO B0-C10设置下平均准确率比之前最佳方法提高了3.3%。 Conclusion: HCP框架通过明确区分已知和未知知识，显著提升了多标签类增量学习的性能。 Abstract: Existing class incremental learning is mainly designed for single-label classification task, which is ill-equipped for multi-label scenarios due to the inherent contradiction of learning objectives for samples with incomplete labels. We argue that the main challenge to overcome this contradiction in multi-label class-incremental learning (MLCIL) lies in the model's inability to clearly distinguish between known and unknown knowledge. This ambiguity hinders the model's ability to retain historical knowledge, master current classes, and prepare for future learning simultaneously. In this paper, we target at specifying what is known or not to accommodate Historical, Current, and Prospective knowledge for MLCIL and propose a novel framework termed as HCP. Specifically, (i) we clarify the known classes by dynamic feature purification and recall enhancement with distribution prior, enhancing the precision and retention of known information. (ii) We design prospective knowledge mining to probe the unknown, preparing the model for future learning. Extensive experiments validate that our method effectively alleviates catastrophic forgetting in MLCIL, surpassing the previous state-of-the-art by 3.3% on average accuracy for MS-COCO B0-C10 setting without replay buffers.

A Tale of Two Classes: Adapting Supervised Contrastive Learning to Binary Imbalanced Datasets

David Mildenberger,Paul Hager,Daniel Rueckert,Martin J Menten

Task: 研究监督对比学习（SupCon）在二分类不平衡数据集上的表现，并提出改进方法。

Motivation: 监督对比学习在多类平衡数据集上表现优异，但在长尾分布或二分类不平衡数据集（如医学诊断）上表现不佳。

Details

Method: 通过实验验证SupCon在七种二分类不平衡数据集上的表现，并引入两种新指标评估表示空间质量。提出两种改进策略。 Result: 新策略改善了表示空间结构，下游分类准确率比标准SupCon提升高达35%。 Conclusion: 针对二分类不平衡数据集的改进策略显著提升了监督对比学习的性能。 Abstract: Supervised contrastive learning (SupCon) has proven to be a powerful alternative to the standard cross-entropy loss for classification of multi-class balanced datasets. However, it struggles to learn well-conditioned representations of datasets with long-tailed class distributions. This problem is potentially exacerbated for binary imbalanced distributions, which are commonly encountered during many real-world problems such as medical diagnosis. In experiments on seven binary datasets of natural and medical images, we show that the performance of SupCon decreases with increasing class imbalance. To substantiate these findings, we introduce two novel metrics that evaluate the quality of the learned representation space. By measuring the class distribution in local neighborhoods, we are able to uncover structural deficiencies of the representation space that classical metrics cannot detect. Informed by these insights, we propose two new supervised contrastive learning strategies tailored to binary imbalanced datasets that improve the structure of the representation space and increase downstream classification accuracy over standard SupCon by up to 35%. We make our code available.

Exploring the Efficacy of Partial Denoising Using Bit Plane Slicing for Enhanced Fracture Identification: A Comparative Study of Deep Learning-Based Approaches and Handcrafted Feature Extraction Techniques

Snigdha Paul,Sambit Mallick,Anindya Sen

Task: 探索部分去噪技术和不同图像表示方法以提高骨折分类的准确性。

Motivation: 骨折分类在医疗诊断中至关重要，但复杂模式和图像噪声影响了准确检测。

Details

Method: 结合DenseNet深度学习模型和手工特征提取，使用决策树和随机森林对不同图像表示（包括原始图像、位平面拼接、完全去噪图像和部分去噪图像）进行训练和评估。 Result: 部分去噪图像表示在随机森林分类器下测试准确率达到95.61%，优于其他图像表示。 Conclusion: 部分去噪技术能保留关键特征并提高分类准确性，为骨折识别的高效预处理和特征提取提供了重要参考。 Abstract: Computer vision has transformed medical diagnosis, treatment, and research through advanced image processing and machine learning techniques. Fracture classification, a critical area in healthcare, has greatly benefited from these advancements, yet accurate detection is challenged by complex patterns and image noise. Bit plane slicing enhances medical images by reducing noise interference and extracting informative features. This research explores partial denoising techniques to provide practical solutions for improved fracture analysis, ultimately enhancing patient care. The study explores deep learning model DenseNet and handcrafted feature extraction. Decision Tree and Random Forest, were employed to train and evaluate distinct image representations. These include the original image, the concatenation of the four bit planes from the LSB as well as MSB, the fully denoised image, and an image consisting of 6 bit planes from MSB and 2 denoised bit planes from LSB. The purpose of forming these diverse image representations is to analyze SNR as well as classification accuracy and identify the bit planes that contain the most informative features. Moreover, the study delves into the significance of partial denoising techniques in preserving crucial features, leading to improvements in classification results. Notably, this study shows that employing the Random Forest classifier, the partially denoised image representation exhibited a testing accuracy of 95.61% surpassing the performance of other image representations. The outcomes of this research provide valuable insights into the development of efficient preprocessing, feature extraction and classification approaches for fracture identification. By enhancing diagnostic accuracy, these advancements hold the potential to positively impact patient care and overall medical outcomes.

HAPI: A Model for Learning Robot Facial Expressions from Human Preferences

Dongsheng Yang,Qianying Liu,Wataru Sato,Takashi Minato,Chaoran Liu,Shin'ya Nishida

Task: 提出一种基于学习排序的框架，利用人类反馈增强机器人面部表情的生成。

Motivation: 手工方法生成的面部表情僵硬且不自然，现有自动化技术未能充分弥合人类偏好与模型预测之间的差距。

Details

Method: 开发了HAPI模型，一种基于Siamese RankNet的方法，通过成对比较标注收集人类偏好数据。 Result: 在35-DOF机器人平台上，该方法生成的愤怒、快乐和惊讶表情比基线方法和专家设计方法更真实且更具社会共鸣。 Conclusion: 该框架有效弥合了人类偏好与模型预测之间的差距，并显著提升了机器人表情生成的表达力。 Abstract: Automatic robotic facial expression generation is crucial for human-robot interaction, as handcrafted methods based on fixed joint configurations often yield rigid and unnatural behaviors. Although recent automated techniques reduce the need for manual tuning, they tend to fall short by not adequately bridging the gap between human preferences and model predictions-resulting in a deficiency of nuanced and realistic expressions due to limited degrees of freedom and insufficient perceptual integration. In this work, we propose a novel learning-to-rank framework that leverages human feedback to address this discrepancy and enhanced the expressiveness of robotic faces. Specifically, we conduct pairwise comparison annotations to collect human preference data and develop the Human Affective Pairwise Impressions (HAPI) model, a Siamese RankNet-based approach that refines expression evaluation. Results obtained via Bayesian Optimization and online expression survey on a 35-DOF android platform demonstrate that our approach produces significantly more realistic and socially resonant expressions of Anger, Happiness, and Surprise than those generated by baseline and expert-designed methods. This confirms that our framework effectively bridges the gap between human preferences and model predictions while robustly aligning robotic expression generation with human affective responses.

Semi-supervised Cervical Segmentation on Ultrasound by A Dual Framework for Neural Networks

Fangyijie Wang,Kathleen M. Curran,Guénolé Silvestre

Task: 开发一种新颖的半监督学习框架，用于超声图像中宫颈肌肉的精确分割。

Motivation: 由于标记数据稀缺，自动计算机辅助方法的发展受到阻碍，而半监督学习方法在利用标记和未标记数据方面显示出潜力。

Details

Method: 提出了一种结合双神经网络的半监督学习框架，通过生成伪标签和像素级交叉监督，并引入自监督对比学习策略以增强特征学习能力。 Result: 该框架在宫颈分割任务中表现出竞争力。 Conclusion: 该研究为超声图像分割提供了一种有效的半监督学习方法，代码已公开。 Abstract: Accurate segmentation of ultrasound (US) images of the cervical muscles is crucial for precision healthcare. The demand for automatic computer-assisted methods is high. However, the scarcity of labeled data hinders the development of these methods. Advanced semi-supervised learning approaches have displayed promise in overcoming this challenge by utilizing labeled and unlabeled data. This study introduces a novel semi-supervised learning (SSL) framework that integrates dual neural networks. This SSL framework utilizes both networks to generate pseudo-labels and cross-supervise each other at the pixel level. Additionally, a self-supervised contrastive learning strategy is introduced, which employs a pair of deep representations to enhance feature learning capabilities, particularly on unlabeled data. Our framework demonstrates competitive performance in cervical segmentation tasks. Our codes are publicly available on https://github.com/13204942/SSL\_Cervical\_Segmentation.

DIDiffGes: Decoupled Semi-Implicit Diffusion Models for Real-time Gesture Generation from Speech

Yongkang Cheng,Shaoli Huang,Xuelin Chen,Jifeng Ning,Mingming Gong

Task: 提出DIDiffGes框架，通过解耦半隐式扩散模型从语音中合成高质量、富有表现力的手势。

Motivation: 扩散模型在生成语音手势时表现出色，但计算密集的采样步骤限制了其实际应用。

Details

Method: 结合生成对抗网络（GANs）实现大步长采样，将手势数据分解为身体和手部分布，并进一步分解为边际和条件分布。 Result: 仅需10步采样即可生成高质量手势，采样步骤减少100倍，且在人类相似性、适当性和风格正确性上优于现有方法。 Conclusion: DIDiffGes框架显著提升了扩散模型的实际应用性，同时保持了生成手势的质量和表现力。 Abstract: Diffusion models have demonstrated remarkable synthesis quality and diversity in generating co-speech gestures. However, the computationally intensive sampling steps associated with diffusion models hinder their practicality in real-world applications. Hence, we present DIDiffGes, for a Decoupled Semi-Implicit Diffusion model-based framework, that can synthesize high-quality, expressive gestures from speech using only a few sampling steps. Our approach leverages Generative Adversarial Networks (GANs) to enable large-step sampling for diffusion model. We decouple gesture data into body and hands distributions and further decompose them into marginal and conditional distributions. GANs model the marginal distribution implicitly, while L2 reconstruction loss learns the conditional distributions exciplictly. This strategy enhances GAN training stability and ensures expressiveness of generated full-body gestures. Our framework also learns to denoise root noise conditioned on local body representation, guaranteeing stability and realism. DIDiffGes can generate gestures from speech with just 10 sampling steps, without compromising quality and expressiveness, reducing the number of sampling steps by a factor of 100 compared to existing methods. Our user study reveals that our method outperforms state-of-the-art approaches in human likeness, appropriateness, and style correctness. Project is https://cyk990422.github.io/DIDiffGes.

Does a Rising Tide Lift All Boats? Bias Mitigation for AI-based CMR Segmentation

Tiarna Lee,Esther Puyol-Antón,Bram Ruijsink,Miaojing Shi,Andrew P. King

Task: 研究常见偏置缓解方法在基于AI的心脏磁共振（CMR）图像分割模型中对黑人和白人受试者之间偏置的影响。

Motivation: 尽管已有文献报道了CMR图像分割模型中的种族偏置现象，但关于偏置缓解算法在该领域的有效性知之甚少。

Details

Method: 使用过采样、重要性重加权和Group DRO及其组合来缓解种族偏置，并评估这些方法在裁剪后的CMR图像上的效果。 Result: 过采样显著改善了黑人受试者的性能，同时未显著降低白人受试者的性能；Group DRO对黑人受试者性能有改善但不显著；重加权降低了黑人受试者的性能。使用裁剪图像提高了两个种族的性能并减少了偏置。 Conclusion: 过采样和裁剪图像是有效的偏置缓解方法，而Group DRO和重加权效果有限。 Abstract: Artificial intelligence (AI) is increasingly being used for medical imaging tasks. However, there can be biases in the resulting models, particularly when they were trained using imbalanced training datasets. One such example has been the strong race bias effect in cardiac magnetic resonance (CMR) image segmentation models. Although this phenomenon has been reported in a number of publications, little is known about the effectiveness of bias mitigation algorithms in this domain. We aim to investigate the impact of common bias mitigation methods to address bias between Black and White subjects in AI-based CMR segmentation models. Specifically, we use oversampling, importance reweighing and Group DRO as well as combinations of these techniques to mitigate the race bias. Furthermore, motivated by recent findings on the root causes of AI-based CMR segmentation bias, we evaluate the same methods using models trained and evaluated on cropped CMR images. We find that bias can be mitigated using oversampling, significantly improving performance for the underrepresented Black subjects whilst not significantly reducing the majority White subjects' performance. Group DRO also improves performance for Black subjects but not significantly, while reweighing decreases performance for Black subjects. Using a combination of oversampling and Group DRO also improves performance for Black subjects but not significantly. Using cropped images increases performance for both races and reduces the bias, whilst adding oversampling as a bias mitigation technique with cropped images reduces the bias further.

FFaceNeRF: Few-shot Face Editing in Neural Radiance Fields

Kwan Yun,Chaelin Kim,Hangyeul Shin,Junyong Noh

Task: 提出一种基于NeRF的3D人脸编辑方法FFaceNeRF，解决现有方法因使用固定掩码布局而导致的用户控制受限问题。

Motivation: 现有方法依赖预训练分割掩码，用户控制有限，且需要大量训练数据，难以适应个性化需求。

Details

Method: 采用几何适配器和特征注入技术，结合潜在混合的三平面增强方法，实现少量样本训练。 Result: FFaceNeRF在灵活性、控制力和生成图像质量上优于现有方法。 Conclusion: FFaceNeRF为定制化和高保真3D人脸编辑提供了新方向。 Abstract: Recent 3D face editing methods using masks have produced high-quality edited images by leveraging Neural Radiance Fields (NeRF). Despite their impressive performance, existing methods often provide limited user control due to the use of pre-trained segmentation masks. To utilize masks with a desired layout, an extensive training dataset is required, which is challenging to gather. We present FFaceNeRF, a NeRF-based face editing technique that can overcome the challenge of limited user control due to the use of fixed mask layouts. Our method employs a geometry adapter with feature injection, allowing for effective manipulation of geometry attributes. Additionally, we adopt latent mixing for tri-plane augmentation, which enables training with a few samples. This facilitates rapid model adaptation to desired mask layouts, crucial for applications in fields like personalized medical imaging or creative face editing. Our comparative evaluations demonstrate that FFaceNeRF surpasses existing mask based face editing methods in terms of flexibility, control, and generated image quality, paving the way for future advancements in customized and high-fidelity 3D face editing. The code is available on the {\href{https://kwanyun.github.io/FFaceNeRF_page/}{project-page}}.

A Comparative Analysis of Image Descriptors for Histopathological Classification of Gastric Cancer

Marco Usai,Andrea Loddo,Alessandra Perniciano,Maurizio Atzori,Cecilia Di Ruberto

Task: 使用机器学习和深度学习技术对胃组织病理学图像进行分类，区分健康与癌变组织。

Motivation: 胃癌预后预测不足，病理学家工作负担重且易出错，需要自动化、准确的诊断工具。

Details

Method: 结合手工特征和深度特征，使用浅层学习分类器在GasHisSDB数据集上进行比较分析。 Result: 使用随机森林分类器，F1分数达到93.4%。 Conclusion: 该方法有效区分正常与异常组织病理学图像，无需微调策略。 Abstract: Gastric cancer ranks as the fifth most common and fourth most lethal cancer globally, with a dismal 5-year survival rate of approximately 20%. Despite extensive research on its pathobiology, the prognostic predictability remains inadequate, compounded by pathologists' high workload and potential diagnostic errors. Thus, automated, accurate histopathological diagnosis tools are crucial. This study employs Machine Learning and Deep Learning techniques to classify histopathological images into healthy and cancerous categories. Using handcrafted and deep features with shallow learning classifiers on the GasHisSDB dataset, we offer a comparative analysis and insights into the most robust and high-performing combinations of features and classifiers for distinguishing between normal and abnormal histopathological images without fine-tuning strategies. With the RF classifier, our approach can reach F1 of 93.4%, demonstrating its validity.

Exploring Few-Shot Object Detection on Blood Smear Images: A Case Study of Leukocytes and Schistocytes

Davide Antonio Mura,Michela Pinna,Lorenzo Putzu,Andrea Loddo,Alessandra Perniciano,Olga Mulas,Cecilia Di Ruberto

Task: 开发一种名为DE-ViT的新方法，用于在Few-Shot范式下进行血细胞计数。

Motivation: 血细胞计数的准确性对检测血液疾病至关重要，因此需要开发精确的自动系统。

Details

Method: 采用DE-ViT方法，并在Few-Shot范式下训练，使用Raabin-WBC数据集和本地数据集进行实验，并与Faster R-CNN 50和Faster R-CNN X 101基线模型对比。 Result: DE-ViT在COCO和LVIS数据集上表现优异，但在Raabin-WBC数据集上被基线模型超越，Faster R-CNN X 101在SC-IDB上表现较好。 Conclusion: 性能差异可能由领域偏移现象引起。 Abstract: The detection of blood disorders often hinges upon the quantification of specific blood cell types. Variations in cell counts may indicate the presence of pathological conditions. Thus, the significance of developing precise automatic systems for blood cell enumeration is underscored. The investigation focuses on a novel approach termed DE-ViT. This methodology is employed in a Few-Shot paradigm, wherein training relies on a limited number of images. Two distinct datasets are utilised for experimental purposes: the Raabin-WBC dataset for Leukocyte detection and a local dataset for Schistocyte identification. In addition to the DE-ViT model, two baseline models, Faster R-CNN 50 and Faster R-CNN X 101, are employed, with their outcomes being compared against those of the proposed model. While DE-ViT has demonstrated state-of-the-art performance on the COCO and LVIS datasets, both baseline models surpassed its performance on the Raabin-WBC dataset. Moreover, only Faster R-CNN X 101 yielded satisfactory results on the SC-IDB. The observed disparities in performance may possibly be attributed to domain shift phenomena.

The CASTLE 2024 Dataset: Advancing the Art of Multimodal Understanding

Luca Rossetto,Werner Bailer,Duc-Tien Dang-Nguyen,Graham Healy,Björn Þór Jónsson,Onanong Kongmeesub,Hoang-Bao Le,Stevan Rudinac,Klaus Schöffmann,Florian Spiess,Allie Tran,Minh-Triet Tran,Quang-Linh Tran,Cathal Gurrin

Task: 介绍并发布CASTLE 2024数据集，一个包含多模态、多视角的自我和外部中心视频及音频的数据集。

Motivation: 现有数据集大多局限于单一视角，无法满足多模态研究的需求。

Details

Method: 通过志愿者参与者在固定地点录制，包含15个时间对齐的自我和外部中心视角视频、音频及其他传感器数据。 Result: 发布了包含600小时超高清视频的CASTLE 2024数据集，无任何部分审查。 Conclusion: CASTLE 2024数据集填补了多模态、多视角数据集的空白，为相关研究提供了丰富资源。 Abstract: Egocentric video has seen increased interest in recent years, as it is used in a range of areas. However, most existing datasets are limited to a single perspective. In this paper, we present the CASTLE 2024 dataset, a multimodal collection containing ego- and exo-centric (i.e., first- and third-person perspective) video and audio from 15 time-aligned sources, as well as other sensor streams and auxiliary data. The dataset was recorded by volunteer participants over four days in a fixed location and includes the point of view of 10 participants, with an additional 5 fixed cameras providing an exocentric perspective. The entire dataset contains over 600 hours of UHD video recorded at 50 frames per second. In contrast to other datasets, CASTLE 2024 does not contain any partial censoring, such as blurred faces or distorted audio. The dataset is available via https://castle-dataset.github.io/.

A New Statistical Model of Star Speckles for Learning to Detect and Characterize Exoplanets in Direct Imaging Observations

Théo Bodrito,Olivier Flasseur,Julien Mairal,Jean Ponce,Maud Langlois,Anne-Marie Lagrange

Task: 提出一种新颖的统计模型，用于从强残余星光中分离出系外行星信号。

Motivation: 直接成像法在系外行星探测中极具挑战性，需要先进图像处理技术以分离微弱信号。

Details

Method: 采用多尺度方法，结合问题对称性和基于物理原理的联合光谱通道表示，构建可解释的端到端学习框架。 Result: 在SPHERE仪器数据集上显著提升了精确率-召回率的权衡，尤其在具有挑战性的数据集上表现优异。 Conclusion: 该方法计算高效，对数据质量变化鲁棒，适合大规模观测调查。 Abstract: The search for exoplanets is an active field in astronomy, with direct imaging as one of the most challenging methods due to faint exoplanet signals buried within stronger residual starlight. Successful detection requires advanced image processing to separate the exoplanet signal from this nuisance component. This paper presents a novel statistical model that captures nuisance fluctuations using a multi-scale approach, leveraging problem symmetries and a joint spectral channel representation grounded in physical principles. Our model integrates into an interpretable, end-to-end learnable framework for simultaneous exoplanet detection and flux estimation. The proposed algorithm is evaluated against the state of the art using datasets from the SPHERE instrument operating at the Very Large Telescope (VLT). It significantly improves the precision-recall trade-off, notably on challenging datasets that are otherwise unusable by astronomers. The proposed approach is computationally efficient, robust to varying data quality, and well suited for large-scale observational surveys.

Jailbreaking the Non-Transferable Barrier via Test-Time Data Disguising

Yongli Xiang,Ziming Hong,Lina Yao,Dadong Wang,Tongliang Liu

Task: 揭示黑盒非可迁移学习（NTL）模型的安全漏洞，并提出一种名为JailNTL的新型攻击方法。

Motivation: 现有攻击方法需要修改模型权重，无法在黑盒场景下生效，因此需要探索黑盒NTL模型的安全性。

Details

Method: 通过测试时数据伪装（包括数据内在伪装和模型引导伪装）绕过NTL模型的非可迁移屏障。 Result: JailNTL在黑盒攻击中，仅使用1%授权样本，将未授权域准确率提升高达55.7%。 Conclusion: JailNTL揭示了黑盒NTL模型的安全漏洞，其攻击效果显著优于现有白盒攻击方法。 Abstract: Non-transferable learning (NTL) has been proposed to protect model intellectual property (IP) by creating a "non-transferable barrier" to restrict generalization from authorized to unauthorized domains. Recently, well-designed attack, which restores the unauthorized-domain performance by fine-tuning NTL models on few authorized samples, highlights the security risks of NTL-based applications. However, such attack requires modifying model weights, thus being invalid in the black-box scenario. This raises a critical question: can we trust the security of NTL models deployed as black-box systems? In this work, we reveal the first loophole of black-box NTL models by proposing a novel attack method (dubbed as JailNTL) to jailbreak the non-transferable barrier through test-time data disguising. The main idea of JailNTL is to disguise unauthorized data so it can be identified as authorized by the NTL model, thereby bypassing the non-transferable barrier without modifying the NTL model weights. Specifically, JailNTL encourages unauthorized-domain disguising in two levels, including: (i) data-intrinsic disguising (DID) for eliminating domain discrepancy and preserving class-related content at the input-level, and (ii) model-guided disguising (MGD) for mitigating output-level statistics difference of the NTL model. Empirically, when attacking state-of-the-art (SOTA) NTL models in the black-box scenario, JailNTL achieves an accuracy increase of up to 55.7% in the unauthorized domain by using only 1% authorized samples, largely exceeding existing SOTA white-box attacks.

A Language Anchor-Guided Method for Robust Noisy Domain Generalization

Zilin Dai,Lehong Wang,Fangzhou Lin,Yidong Wang,Zhigang Li,Kazunori D Yamada,Ziming Zhang,Wang Lu

Task: 提出一种名为A3W的新算法，通过样本重加权和NLP锚点来解决分布偏移和标签噪声问题。

Motivation: 现实机器学习应用中，分布偏移和标签噪声导致模型过拟合冗余特征，难以泛化到目标域。

Details

Method: A3W利用NLP锚点指导样本重加权，提取更具代表性的特征，并通过加权损失函数调整样本贡献。 Result: 在标准基准数据集上，A3W在准确性和鲁棒性方面显著优于现有领域泛化方法。 Conclusion: A3W通过结合语义表示和自适应加权，有效提升了模型在分布偏移和噪声标签下的性能。 Abstract: Real-world machine learning applications often struggle with two major challenges: distribution shift and label noise. Models tend to overfit by focusing on redundant and uninformative features in the training data, which makes it hard for them to generalize to the target domain. Noisy data worsens this problem by causing further overfitting to the noise, meaning that existing methods often fail to tell the difference between true, invariant features and misleading, spurious ones. To tackle these issues, we introduce Anchor Alignment and Adaptive Weighting (A3W). This new algorithm uses sample reweighting guided by natural language processing (NLP) anchors to extract more representative features. In simple terms, A3W leverages semantic representations from natural language models as a source of domain-invariant prior knowledge. Additionally, it employs a weighted loss function that adjusts each sample's contribution based on its similarity to the corresponding NLP anchor. This adjustment makes the model more robust to noisy labels. Extensive experiments on standard benchmark datasets show that A3W consistently outperforms state-of-the-art domain generalization methods, offering significant improvements in both accuracy and robustness across different datasets and noise levels.

Deep End-to-End Posterior ENergy (DEEPEN) for image recovery

Jyothi Rikhab Chand,Mathews Jacob

Task: 提出一种名为DEEPEN的框架，能够同时实现最大后验（MAP）估计和后验分布采样。

Motivation: 现有端到端（E2E）和即插即用（PnP）图像重建算法无法实现后验分布采样，而扩散模型难以以端到端方式训练。

Details

Method: 通过最大似然优化学习后验分布的参数，后验分布由数据一致性误差和负对数先验分布组成。 Result: DEEPEN在MAP设置下性能优于现有E2E和PnP方法，采样速度比扩散模型更快，且对图像采集设置变化更鲁棒。 Conclusion: DEEPEN框架在图像重建中实现了高效的后验估计和采样，同时具有计算和内存优势。 Abstract: Current end-to-end (E2E) and plug-and-play (PnP) image reconstruction algorithms approximate the maximum a posteriori (MAP) estimate but cannot offer sampling from the posterior distribution, like diffusion models. By contrast, it is challenging for diffusion models to be trained in an E2E fashion. This paper introduces a Deep End-to-End Posterior ENergy (DEEPEN) framework, which enables MAP estimation as well as sampling. We learn the parameters of the posterior, which is the sum of the data consistency error and the negative log-prior distribution, using maximum likelihood optimization in an E2E fashion. The proposed approach does not require algorithm unrolling, and hence has a smaller computational and memory footprint than current E2E methods, while it does not require contraction constraints typically needed by current PnP methods. Our results demonstrate that DEEPEN offers improved performance than current E2E and PnP models in the MAP setting, while it also offers faster sampling compared to diffusion models. In addition, the learned energy-based model is observed to be more robust to changes in image acquisition settings.

Jie Mei,Chenyu Lin,Yu Qiu,Yaonan Wang,Hui Zhang,Ziyang Wang,Dong Dai

Task: 提出一种基于深度学习的PET-CT肺肿瘤分割方法，并引入大规模数据集PCLT20K。

Motivation: 解决PET-CT图像质量差、运动伪影和复杂肿瘤形态等问题，同时克服现有小规模和私有数据集的限制。

Details

Method: 设计了一种跨模态交互感知网络（CIPA），包括通道校正模块（CRM）和动态跨模态交互模块（DCIM）。 Result: 在综合基准测试中，CIPA表现优于当前最先进的分割方法。 Conclusion: 该研究为医学图像分割提供了新的探索机会，数据集和代码已公开。 Abstract: Lung cancer is a leading cause of cancer-related deaths globally. PET-CT is crucial for imaging lung tumors, providing essential metabolic and anatomical information, while it faces challenges such as poor image quality, motion artifacts, and complex tumor morphology. Deep learning-based models are expected to address these problems, however, existing small-scale and private datasets limit significant performance improvements for these methods. Hence, we introduce a large-scale PET-CT lung tumor segmentation dataset, termed PCLT20K, which comprises 21,930 pairs of PET-CT images from 605 patients. Furthermore, we propose a cross-modal interactive perception network with Mamba (CIPA) for lung tumor segmentation in PET-CT images. Specifically, we design a channel-wise rectification module (CRM) that implements a channel state space block across multi-modal features to learn correlated representations and helps filter out modality-specific noise. A dynamic cross-modality interaction module (DCIM) is designed to effectively integrate position and context information, which employs PET images to learn regional position information and serves as a bridge to assist in modeling the relationships between local features of CT images. Extensive experiments on a comprehensive benchmark demonstrate the effectiveness of our CIPA compared to the current state-of-the-art segmentation methods. We hope our research can provide more exploration opportunities for medical image segmentation. The dataset and code are available at https://github.com/mj129/CIPA.

Vision Transformer Based Semantic Communications for Next Generation Wireless Networks

Muhammad Ahmed Mohsin,Muhammad Jazib,Zeeshan Alam,Muhmmad Farhan Khan,Muhammad Saad,Muhammad Ali Jamshed

Task: 提出一种基于Vision Transformer（ViT）的语义通信框架，用于在图像传输中实现高语义相似性并减少带宽需求。

Motivation: 在6G网络中，语义通信通过优先传输语义意义而非原始数据准确性，有望彻底改变数据传输方式。

Details

Method: 采用ViT作为编码器-解码器框架，高效编码图像为高语义内容，并在接收端精确重建图像，同时考虑实际衰落和噪声。 Result: 提出的ViT架构在峰值信噪比（PSNR）上达到38 dB，优于其他深度学习方法，如CNN和GAN。 Conclusion: ViT-based方法在语义通信中取得了显著突破，为6G网络中的高效数据传输提供了新思路。 Abstract: In the evolving landscape of 6G networks, semantic communications are poised to revolutionize data transmission by prioritizing the transmission of semantic meaning over raw data accuracy. This paper presents a Vision Transformer (ViT)-based semantic communication framework that has been deliberately designed to achieve high semantic similarity during image transmission while simultaneously minimizing the demand for bandwidth. By equipping ViT as the encoder-decoder framework, the proposed architecture can proficiently encode images into a high semantic content at the transmitter and precisely reconstruct the images, considering real-world fading and noise consideration at the receiver. Building on the attention mechanisms inherent to ViTs, our model outperforms Convolution Neural Network (CNNs) and Generative Adversarial Networks (GANs) tailored for generating such images. The architecture based on the proposed ViT network achieves the Peak Signal-to-noise Ratio (PSNR) of 38 dB, which is higher than other Deep Learning (DL) approaches in maintaining semantic similarity across different communication environments. These findings establish our ViT-based approach as a significant breakthrough in semantic communications.

A Topological Data Analysis Framework for Quantifying Necrosis in Glioblastomas

Francisco Tellez,Enrique Torres-Giese

Task: 提出一种称为“内部函数”的形状描述符，用于细化图像分析中的描述符。

Motivation: 通过拓扑数据分析（TDA）改进现有描述符，以量化肿瘤坏死的几何特征。

Details

Method: 定义子复合体空隙率指数，并构建一套分析坏死形态的指标和图表。 Result: 在胶质母细胞瘤（GB）的MRI研究中应用该框架，通过聚类分析识别出四种反映坏死区域几何特性的亚型。 Conclusion: 该框架能够有效捕捉肿瘤坏死区域的独特结构和几何特性。 Abstract: In this paper, we introduce a shape descriptor that we call "interior function". This is a Topological Data Analysis (TDA) based descriptor that refines previous descriptors for image analysis. Using this concept, we define subcomplex lacunarity, a new index that quantifies geometric characteristics of necrosis in tumors such as conglomeration. Building on this framework, we propose a set of indices to analyze necrotic morphology and construct a diagram that captures the distinct structural and geometric properties of necrotic regions in tumors. We present an application of this framework in the study of MRIs of Glioblastomas (GB). Using cluster analysis, we identify four distinct subtypes of Glioblastomas that reflect geometric properties of necrotic regions.

Align Your Rhythm: Generating Highly Aligned Dance Poses with Gating-Enhanced Rhythm-Aware Feature Representation

Congyi Fan,Jian Guan,Xuanjia Zhao,Dongli Xu,Youtian Lin,Tong Ye,Pengming Feng,Haiwei Pan

Task: 提出一种名为Danceba的新框架，用于生成与音乐高度对齐且具有节奏感的舞蹈动作。

Motivation: 虚拟现实和电影行业需要自然、多样且节奏感强的舞蹈动作生成，但现有方法在节奏对齐和动作自然性上存在不足。

Details

Method: 结合Phase-Based Rhythm Extraction（PRE）提取音乐节奏信息，Temporal-Gated Causal Attention（TGCA）关注全局节奏特征，以及Parallel Mamba Motion Modeling（PMMM）分别建模上下半身动作。 Result: Danceba在实验中表现优于现有方法，显著提升了节奏对齐和动作多样性。 Conclusion: Danceba通过创新的节奏感知特征表示和动作建模，成功生成了更自然、多样的舞蹈动作。 Abstract: Automatically generating natural, diverse and rhythmic human dance movements driven by music is vital for virtual reality and film industries. However, generating dance that naturally follows music remains a challenge, as existing methods lack proper beat alignment and exhibit unnatural motion dynamics. In this paper, we propose Danceba, a novel framework that leverages gating mechanism to enhance rhythm-aware feature representation for music-driven dance generation, which achieves highly aligned dance poses with enhanced rhythmic sensitivity. Specifically, we introduce Phase-Based Rhythm Extraction (PRE) to precisely extract rhythmic information from musical phase data, capitalizing on the intrinsic periodicity and temporal structures of music. Additionally, we propose Temporal-Gated Causal Attention (TGCA) to focus on global rhythmic features, ensuring that dance movements closely follow the musical rhythm. We also introduce Parallel Mamba Motion Modeling (PMMM) architecture to separately model upper and lower body motions along with musical features, thereby improving the naturalness and diversity of generated dance movements. Extensive experiments confirm that Danceba outperforms state-of-the-art methods, achieving significantly better rhythmic alignment and motion diversity. Project page: https://danceba.github.io/ .