2025 04 07

Computer Vision and Deep Learning for 4D Augmented Reality

Karthik Shivashankar

Task: 探索在微软混合现实平台上渲染4D视频的可行性，并开发一种紧凑表示方法以解决数据传输带宽限制。

Motivation: 4D视频在扩展现实（XR）平台上的应用前景广阔，但复杂3D模型的数据传输带宽限制了其实际应用。

Details

Method: 使用深度学习模型学习4D视频序列的紧凑表示，并在不损失形状和外观质量的情况下重建视频序列。 Result: 成功实现了4D视频在混合现实平台上的渲染，并通过紧凑表示方法解决了数据传输带宽问题。 Conclusion: 该方法为4D视频在XR平台上的应用提供了可行的技术方案，解决了数据传输瓶颈。 Abstract: The prospect of 4D video in Extended Reality (XR) platform is huge and exciting, it opens a whole new way of human computer interaction and the way we perceive the reality and consume multimedia. In this thesis, we have shown that feasibility of rendering 4D video in Microsoft mixed reality platform. This enables us to port any 3D performance capture from CVSSP into XR product like the HoloLens device with relative ease. However, if the 3D model is too complex and is made up of millions of vertices, the data bandwidth required to port the model is a severe limitation with the current hardware and communication system. Therefore, in this project we have also developed a compact representation of both shape and appearance of the 4d video sequence using deep learning models to effectively learn the compact representation of 4D video sequence and reconstruct it without affecting the shape and appearance of the video sequence.

Towards Understanding How Knowledge Evolves in Large Vision-Language Models

Sudong Wang,Yunjian Zhang,Yao Zhu,Jianing Li,Zizhe Wang,Yanwei Liu,Xiangyang Ji

Task: 研究大型视觉语言模型（LVLMs）中多模态知识的演化及其如何诱导自然语言生成。

Motivation: 理解LVLMs的内部工作机制对提升其能力至关重要，但目前仍缺乏深入探索。

Details

Method: 设计一系列新策略，从单标记概率、标记概率分布和特征编码三个层次分析LVLMs内部知识演化。 Result: 发现知识演化的两个关键节点（关键层和突变层），并将演化过程分为快速演化、稳定和突变三个阶段。 Conclusion: 首次揭示了LVLMs中知识演化的轨迹，为理解其底层机制提供了新视角。 Abstract: Large Vision-Language Models (LVLMs) are gradually becoming the foundation for many artificial intelligence applications. However, understanding their internal working mechanisms has continued to puzzle researchers, which in turn limits the further enhancement of their capabilities. In this paper, we seek to investigate how multimodal knowledge evolves and eventually induces natural languages in LVLMs. We design a series of novel strategies for analyzing internal knowledge within LVLMs, and delve into the evolution of multimodal knowledge from three levels, including single token probabilities, token probability distributions, and feature encodings. In this process, we identify two key nodes in knowledge evolution: the critical layers and the mutation layers, dividing the evolution process into three stages: rapid evolution, stabilization, and mutation. Our research is the first to reveal the trajectory of knowledge evolution in LVLMs, providing a fresh perspective for understanding their underlying mechanisms. Our codes are available at https://github.com/XIAO4579/Vlm-interpretability.

OpenFACADES: An Open Framework for Architectural Caption and Attribute Data Enrichment via Street View Imagery

Xiucheng Liang,Jinheng Xie,Tianhong Zhao,Rudi Stouffs,Filip Biljecki

Task: 提出OpenFACADES框架，利用多模态众包数据丰富建筑属性。

Motivation: 建筑属性数据在许多城市地区稀缺，现有方法难以整合多样化开放数据集并推断全面属性。

Details

Method: 整合街景图像元数据与OpenStreetMap几何数据，自动化检测建筑立面，利用开源大视觉语言模型进行多属性预测。 Result: 微调后的视觉语言模型在多属性推断上表现优异，超越单属性计算机视觉模型和零样本ChatGPT-4o。 Conclusion: OpenFACADES框架有效填补了建筑属性数据稀缺的空白，为城市分析提供了新工具。 Abstract: Building properties, such as height, usage, and material composition, play a crucial role in spatial data infrastructures, supporting applications such as energy simulation, risk assessment, and environmental modeling. Despite their importance, comprehensive and high-quality building attribute data remain scarce in many urban areas. Recent advances have enabled the extraction and tagging of objective building attributes using remote sensing and street-level imagery. However, establishing a method and pipeline that integrates diverse open datasets, acquires holistic building imagery at scale, and infers comprehensive building attributes remains a significant challenge. Among the first, this study bridges the gaps by introducing OpenFACADES, an open framework that leverages multimodal crowdsourced data to enrich building profiles with both objective attributes and semantic descriptors through multimodal large language models. Our methodology proceeds in three major steps. First, we integrate street-level image metadata from Mapillary with OpenStreetMap geometries via isovist analysis, effectively identifying images that provide suitable vantage points for observing target buildings. Second, we automate the detection of building facades in panoramic imagery and tailor a reprojection approach to convert objects into holistic perspective views that approximate real-world observation. Third, we introduce an innovative approach that harnesses and systematically investigates the capabilities of open-source large vision-language models (VLMs) for multi-attribute prediction and open-vocabulary captioning in building-level analytics, leveraging a globally sourced dataset of 30,180 labeled images from seven cities. Evaluation shows that fine-tuned VLM excel in multi-attribute inference, outperforming single-attribute computer vision models and zero-shot ChatGPT-4o.

Multimodal Reference Visual Grounding

Yangxiao Lu,Ruosen Li,Liqiang Jing,Jikai Wang,Xinya Du,Yunhui Guo,Nicholas Ruozzi,Yu Xiang

Task: 提出并解决多模态参考视觉定位（MRVG）任务，通过参考图像和语言表达检测查询图像中的目标对象。

Motivation: 现有大型视觉语言模型（LVLM）在处理相似物体时表现不佳，参考图像可提升性能。

Details

Method: 提出MRVG-Net方法，结合少样本目标检测和大型语言模型（LLM）进行对象匹配。 Result: MRVG-Net在视觉定位性能上优于现有LVLM（如Qwen2.5-VL-7B）。 Conclusion: 该方法弥合了少样本检测与视觉定位的差距，为视觉理解提供了新能力。 Abstract: Visual grounding focuses on detecting objects from images based on language expressions. Recent Large Vision-Language Models (LVLMs) have significantly advanced visual grounding performance by training large models with large-scale datasets. However, the problem remains challenging, especially when similar objects appear in the input image. For example, an LVLM may not be able to differentiate Diet Coke and regular Coke in an image. In this case, if additional reference images of Diet Coke and regular Coke are available, it can help the visual grounding of similar objects. In this work, we introduce a new task named Multimodal Reference Visual Grounding (MRVG). In this task, a model has access to a set of reference images of objects in a database. Based on these reference images and a language expression, the model is required to detect a target object from a query image. We first introduce a new dataset to study the MRVG problem. Then we introduce a novel method, named MRVG-Net, to solve this visual grounding problem. We show that by efficiently using reference images with few-shot object detection and using Large Language Models (LLMs) for object matching, our method achieves superior visual grounding performance compared to the state-of-the-art LVLMs such as Qwen2.5-VL-7B. Our approach bridges the gap between few-shot detection and visual grounding, unlocking new capabilities for visual understanding. Project page with our code and dataset: https://irvlutd.github.io/MultiGrounding

Optimizing Humor Generation in Large Language Models: Temperature Configurations and Architectural Trade-offs

Evgenii Evstafev

Task: 评估13种先进大语言模型在生成面向软件开发者的技术幽默方面的表现。

Motivation: 尽管大语言模型在创意文本生成方面能力不断提升，但其幽默生成的系统性评估仍不足。

Details

Method: 通过全因子设计测试715种温度和提示变体的独特配置，使用五项加权标准（幽默质量、领域相关性、概念原创性、语气精准度和表达效率）评估模型输出，并采用ANOVA、相关研究和二次回归等统计分析方法。 Result: 结果显示不同模型性能差异显著，某些架构比基线系统高出21.8%；73%的模型在低随机性设置（<=0.5）下表现最佳；模型架构解释了38.7%的性能差异。 Conclusion: 研究为模型选择和配置提供了实用指南，并验证了温度和架构对幽默生成效果的影响，推动了对大语言模型在创意技术写作中能力的理解。 Abstract: Large language models (LLMs) demonstrate increasing capabilities in creative text generation, yet systematic evaluations of their humor production remain underexplored. This study presents a comprehensive analysis of 13 state-of-the-art LLMs across five architectural families, evaluating their performance in generating technically relevant humor for software developers. Through a full factorial design testing 715 unique configurations of temperature settings and prompt variations, we assess model outputs using five weighted criteria: humor quality, domain relevance, concept originality, tone precision, and delivery efficiency. Our methodology employs rigorous statistical analysis including ANOVA, correlation studies, and quadratic regression to identify optimal configurations and architectural influences. Results reveal significant performance variations across models, with certain architectures achieving 21.8% superiority over baseline systems. Temperature sensitivity analysis demonstrates that 73% of models achieve peak performance at lower stochasticity settings (<= 0.5), though optimal ranges vary substantially by architecture. We identify distinct model clusters: compact high-performers maintaining efficiency-quality balance versus verbose specialists requiring longer outputs for marginal gains. Statistical validation confirms model architecture explains 38.7% of performance variance, with significant correlations between humor quality and concept originality. The study establishes practical guidelines for model selection and configuration, demonstrating how temperature adjustments and architectural considerations impact humor generation effectiveness. These findings advance understanding of LLM capabilities in creative technical writing and provide empirically validated configuration strategies for developers implementing humor-generation systems.

Exploring the Capabilities of LLMs for IMU-based Fine-grained Human Activity Understanding

Lilin Xu,Kaiyuan Hou,Xiaofan Jiang

Task: 利用大型语言模型（LLMs）进行细粒度人类活动识别（HAR），特别是空中书写字母识别。

Motivation: 现有方法主要关注粗粒度活动（如行走或跑步），而预训练的LLMs在细粒度HAR任务（如空中书写字母识别）上表现极差。

Details

Method: 通过微调LLMs并使用自收集数据集和少样本学习，针对2D场景优化；设计基于编码器的管道将3D数据映射为2D等效数据，保留时空信息。 Result: 在2D数据上实现了129倍的改进；在3D空中书写场景中，端到端管道在最多5个字母的单词识别上达到78%的准确率。 Conclusion: LLMs可以作为细粒度HAR的有效工具。 Abstract: Human activity recognition (HAR) using inertial measurement units (IMUs) increasingly leverages large language models (LLMs), yet existing approaches focus on coarse activities like walking or running. Our preliminary study indicates that pretrained LLMs fail catastrophically on fine-grained HAR tasks such as air-written letter recognition, achieving only near-random guessing accuracy. In this work, we first bridge this gap for flat-surface writing scenarios: by fine-tuning LLMs with a self-collected dataset and few-shot learning, we achieved up to a 129x improvement on 2D data. To extend this to 3D scenarios, we designed an encoder-based pipeline that maps 3D data into 2D equivalents, preserving the spatiotemporal information for robust letter prediction. Our end-to-end pipeline achieves 78% accuracy on word recognition with up to 5 letters in mid-air writing scenarios, establishing LLMs as viable tools for fine-grained HAR.

Girma Yohannis Bade,Zahra Ahani,Olga Kolesnikova,José Luis Oropeza,Grigori Sidorov

Task: 检测社交媒体上针对女性的侮辱性文本。

Motivation: 社交媒体滥用问题日益严重，需要技术手段有效管理内容，尤其是针对女性的侮辱性言论。

Details

Method: 使用逻辑回归和BERT模型，基于DravidianLangTech@2025的泰米尔语和马拉雅拉姆语数据集进行训练。 Result: BERT模型的宏F1得分为0.729，逻辑回归为0.6279。 Conclusion: BERT在检测侮辱性文本方面表现优于逻辑回归，为社交媒体内容管理提供了有效工具。 Abstract: The increasing misuse of social media has become a concern; however, technological solutions are being developed to moderate its content effectively. This paper focuses on detecting abusive texts targeting women on social media platforms. Abusive speech refers to communication intended to harm or incite hatred against vulnerable individuals or groups. Specifically, this study aims to identify abusive language directed toward women. To achieve this, we utilized logistic regression and BERT as base models to train datasets sourced from DravidianLangTech@2025 for Tamil and Malayalam languages. The models were evaluated on test datasets, resulting in a 0.729 macro F1 score for BERT and 0.6279 for logistic regression in Tamil and Malayalam, respectively.

Enhancing Traffic Sign Recognition On The Performance Based On Yolov8

Baba Ibrahim,Zhou Kui

Task: 提出一种基于YOLOv8的增强型交通标志检测系统。

Motivation: 交通标志识别在自动驾驶和高级驾驶辅助系统中至关重要，但由于标志尺寸小、环境变化、遮挡和类别不平衡等问题，准确检测和分类仍具挑战性。

Details

Method: 结合先进的数据增强技术、坐标注意力（CA）、双向特征金字塔网络（BiFPN）、动态模块（ODConv和LSKA）以及改进的损失函数（EIoU和WIoU结合Focal Loss）。 Result: 在GTSRB、TT100K和GTSDB数据集上的实验表明，检测精度显著提升，在恶劣条件下具有更强的鲁棒性，并能在边缘设备上实时推理。 Conclusion: 研究结果为实际自动驾驶场景中部署可靠的交通标志识别系统提供了实用见解。 Abstract: This paper Traffic sign recognition plays a crucial role in the development of autonomous vehicles and advanced driver-assistance systems (ADAS). Despite significant advances in deep learning and object detection, accurately detecting and classifying traffic signs remains challenging due to their small sizes, variable environmental conditions, occlusion, and class imbalance. This thesis presents an enhanced YOLOv8-based detection system that integrates advanced data augmentation techniques, novel architectural enhancements including Coordinate Attention (CA), Bidirectional Feature Pyramid Network (BiFPN), and dynamic modules such as ODConv and LSKA, along with refined loss functions (EIoU and WIoU combined with Focal Loss). Extensive experiments conducted on datasets including GTSRB, TT100K, and GTSDB demonstrate marked improvements in detection accuracy, robustness under adverse conditions, and real-time inference on edge devices. The findings contribute actionable insights for deploying reliable traffic sign recognition systems in real-world autonomous driving scenarios.

The Material Contracts Corpus

Peter Adelson,Julian Nyarko

Task: 构建并公开Material Contracts Corpus (MCC)数据集，用于支持合同设计和法律语言的实证研究以及AI法律工具的开发。

Motivation: 为研究合同设计和法律语言提供公开数据集，并促进AI在法律领域的应用。

Details

Method: 使用机器学习和自然语言处理技术（如微调的LLaMA-2模型）对合同进行分类，并提供元数据（如文件格式、修订状态等）。 Result: MCC包含超过一百万份合同，按协议类型分类，并记录了合同语言、长度和复杂性的趋势。 Conclusion: MCC是一个公开可用的资源，可用于批量下载和在线访问，支持法律研究和AI工具开发。 Abstract: This paper introduces the Material Contracts Corpus (MCC), a publicly available dataset comprising over one million contracts filed by public companies with the U.S. Securities and Exchange Commission (SEC) between 2000 and 2023. The MCC facilitates empirical research on contract design and legal language, and supports the development of AI-based legal tools. Contracts in the corpus are categorized by agreement type and linked to specific parties using machine learning and natural language processing techniques, including a fine-tuned LLaMA-2 model for contract classification. The MCC further provides metadata such as filing form, document format, and amendment status. We document trends in contractual language, length, and complexity over time, and highlight the dominance of employment and security agreements in SEC filings. This resource is available for bulk download and online access at https://mcc.law.stanford.edu.

UAC: Uncertainty-Aware Calibration of Neural Networks for Gesture Detection

Farida Al Haddad,Yuxin Wang,Malcolm Mielle

Task: 提出一种名为UAC（Uncertainty-Aware Calibration）的两步方法，用于解决基于IMU的手势识别中的概率校准和鲁棒性问题。

Motivation: 安全关键领域（如建筑、制造和医疗）对AI的严格安全要求限制了其应用，需要准确校准预测概率并增强对分布外数据的鲁棒性。

Details

Method: 首先设计一个不确定性感知的手势网络架构，预测手势概率及其不确定性；然后利用熵加权期望对多个IMU数据窗口的预测进行校准。 Result: UAC在三个公开IMU数据集上优于现有校准方法（如温度缩放、熵最大化和拉普拉斯近似），在分布内外场景中均提高了准确性和校准性。 Conclusion: 不确定性感知校准方法在IMU手势识别中显著提升了校准性和准确性，优于现有方法。 Abstract: Artificial intelligence has the potential to impact safety and efficiency in safety-critical domains such as construction, manufacturing, and healthcare. For example, using sensor data from wearable devices, such as inertial measurement units (IMUs), human gestures can be detected while maintaining privacy, thereby ensuring that safety protocols are followed. However, strict safety requirements in these domains have limited the adoption of AI, since accurate calibration of predicted probabilities and robustness against out-of-distribution (OOD) data is necessary. This paper proposes UAC (Uncertainty-Aware Calibration), a novel two-step method to address these challenges in IMU-based gesture recognition. First, we present an uncertainty-aware gesture network architecture that predicts both gesture probabilities and their associated uncertainties from IMU data. This uncertainty is then used to calibrate the probabilities of each potential gesture. Second, an entropy-weighted expectation of predictions over multiple IMU data windows is used to improve accuracy while maintaining correct calibration. Our method is evaluated using three publicly available IMU datasets for gesture detection and is compared to three state-of-the-art calibration methods for neural networks: temperature scaling, entropy maximization, and Laplace approximation. UAC outperforms existing methods, achieving improved accuracy and calibration in both OOD and in-distribution scenarios. Moreover, we find that, unlike our method, none of the state-of-the-art methods significantly improve the calibration of IMU-based gesture recognition models. In conclusion, our work highlights the advantages of uncertainty-aware calibration of neural networks, demonstrating improvements in both calibration and accuracy for gesture detection using IMU data.

The Illusionist's Prompt: Exposing the Factual Vulnerabilities of Large Language Models with Linguistic Nuances

Yining Wang,Yuquan Wang,Xi Li,Mi Zhang,Geng Hong,Min Yang

Task: 研究一种名为'The Illusionist's Prompt'的新型幻觉攻击方法，挑战LLMs在对抗性查询中的事实准确性。

Motivation: 随着LLMs的广泛应用，确保其提供信息的真实性至关重要，但现有研究多关注正式查询，忽略了恶意构造的查询。

Details

Method: 通过将语言细微差别融入对抗性查询，自动生成具有高度可转移性的幻觉提示，诱导LLMs产生内部事实错误。 Result: 实验证明该方法能有效攻击包括GPT-4o和Gemini-2.0在内的黑盒LLMs，即使存在多种防御机制。 Conclusion: 该研究揭示了LLMs在面对精心设计的对抗性查询时的脆弱性，强调了进一步研究防御此类攻击的必要性。 Abstract: As Large Language Models (LLMs) continue to advance, they are increasingly relied upon as real-time sources of information by non-expert users. To ensure the factuality of the information they provide, much research has focused on mitigating hallucinations in LLM responses, but only in the context of formal user queries, rather than maliciously crafted ones. In this study, we introduce The Illusionist's Prompt, a novel hallucination attack that incorporates linguistic nuances into adversarial queries, challenging the factual accuracy of LLMs against five types of fact-enhancing strategies. Our attack automatically generates highly transferrable illusory prompts to induce internal factual errors, all while preserving user intent and semantics. Extensive experiments confirm the effectiveness of our attack in compromising black-box LLMs, including commercial APIs like GPT-4o and Gemini-2.0, even with various defensive mechanisms.

Comparative Analysis of Deepfake Detection Models: New Approaches and Perspectives

Matheus Martins Batista

Task: 比较和评估不同深度伪造检测方法，特别是GenConViT模型在DeepfakeBenchmark中的性能。

Motivation: 深度伪造视频对社会和法律造成威胁，亟需有效的检测方法以应对虚假信息的传播。

Details

Method: 研究采用数字图像处理、机器学习和人工神经网络（如CNN、GAN和Transformer）技术，重点评估GenConViT模型，并使用WildDeep-fake和DeepSpeak数据集进行性能测试。 Result: 经过微调的GenConViT模型在DeepSpeak数据集上表现出色，准确率达93.82%，优于其他架构。 Conclusion: GenConViT模型为深度伪造检测提供了更高效的工具，有助于开发更强大的解决方案以对抗虚假信息。 Abstract: The growing threat posed by deepfake videos, capable of manipulating realities and disseminating misinformation, drives the urgent need for effective detection methods. This work investigates and compares different approaches for identifying deepfakes, focusing on the GenConViT model and its performance relative to other architectures present in the DeepfakeBenchmark. To contextualize the research, the social and legal impacts of deepfakes are addressed, as well as the technical fundamentals of their creation and detection, including digital image processing, machine learning, and artificial neural networks, with emphasis on Convolutional Neural Networks (CNNs), Generative Adversarial Networks (GANs), and Transformers. The performance evaluation of the models was conducted using relevant metrics and new datasets established in the literature, such as WildDeep-fake and DeepSpeak, aiming to identify the most effective tools in the battle against misinformation and media manipulation. The obtained results indicated that GenConViT, after fine-tuning, exhibited superior performance in terms of accuracy (93.82%) and generalization capacity, surpassing other architectures in the DeepfakeBenchmark on the DeepSpeak dataset. This study contributes to the advancement of deepfake detection techniques, offering contributions to the development of more robust and effective solutions against the dissemination of false information.

Multi-Agent LLM Judge: automatic personalized LLM judge design for evaluating natural language generation applications

Hongliu Cao,Ilias Driouich,Robin Singh,Eoin Thomas

Task: 提出一种动态多代理系统，用于为自然语言生成应用设计个性化的LLM评估框架。

Motivation: 现有LLM评估框架在适应不同文本风格和与人类判断相关性方面存在不足。

Details

Method: 采用动态多代理系统，迭代优化评估提示并平衡任务适应性与人类感知对齐。 Result: 实验结果表明，该方法提高了评估准确性，并生成更符合人类感知的评分。 Conclusion: 提出的多代理LLM评估框架优于现有方法，能更好地适应多样化的文本风格并与人类判断对齐。 Abstract: Large Language Models (LLMs) have demonstrated impressive performance across diverse domains, yet they still encounter challenges such as insufficient domain-specific knowledge, biases, and hallucinations. This underscores the need for robust evaluation methodologies to accurately assess LLM-based applications. Traditional evaluation methods, which rely on word overlap or text embeddings, are inadequate for capturing the nuanced semantic information necessary to evaluate dynamic, open-ended text generation. Recent research has explored leveraging LLMs to mimic human reasoning and decision-making processes for evaluation purposes known as LLM-as-a-judge framework. However, these existing frameworks have two significant limitations. First, they lack the flexibility to adapt to different text styles, including various answer and ground truth styles, thereby reducing their generalization performance. Second, the evaluation scores produced by these frameworks are often skewed and hard to interpret, showing a low correlation with human judgment. To address these challenges, we propose a novel dynamic multi-agent system that automatically designs personalized LLM judges for various natural language generation applications. This system iteratively refines evaluation prompts and balances the trade-off between the adaptive requirements of downstream tasks and the alignment with human perception. Our experimental results show that the proposed multi-agent LLM Judge framework not only enhances evaluation accuracy compared to existing methods but also produces evaluation scores that better align with human perception.

Haphazard Inputs as Images in Online Learning

Rohit Agarwal,Aryan Dessai,Arif Ahmed Sekh,Krishna Agarwal,Alexander Horsch,Dilip K. Prasad

Task: 将在线学习环境中的可变特征空间转换为固定维度的图像表示。

Motivation: 当前针对可变特征空间的解决方案依赖于特定模型，无法利用先进的深度学习方法，因为这些方法需要固定维度的输入。

Details

Method: 提出一种模型无关的方法，将可变特征空间动态转换为固定维度的图像表示，适用于任何基于视觉的模型（如ResNet和ViT）。 Result: 在四个公开数据集上验证了方法的有效性，证明了其可扩展性和鲁棒性。 Conclusion: 该方法为可变特征空间的在线学习提供了一种通用的解决方案，适用于多种视觉模型。 Abstract: The field of varying feature space in online learning settings, also known as haphazard inputs, is very prominent nowadays due to its applicability in various fields. However, the current solutions to haphazard inputs are model-dependent and cannot benefit from the existing advanced deep-learning methods, which necessitate inputs of fixed dimensions. Therefore, we propose to transform the varying feature space in an online learning setting to a fixed-dimension image representation on the fly. This simple yet novel approach is model-agnostic, allowing any vision-based models to be applicable for haphazard inputs, as demonstrated using ResNet and ViT. The image representation handles the inconsistent input data seamlessly, making our proposed approach scalable and robust. We show the efficacy of our method on four publicly available datasets. The code is available at https://github.com/Rohit102497/HaphazardInputsAsImages.

AI Hiring with LLMs: A Context-Aware and Explainable Multi-Agent Framework for Resume Screening

Frank P. -W. Lo,Jianing Qiu,Zeyu Wang,Haibao Yu,Yeming Chen,Gao Zhang,Benny Lo

Task: 提出一个基于大型语言模型的多智能体框架，用于简历筛选。

Motivation: 简历筛选是招聘中的关键但耗时的过程，需要保持客观、准确和公平，而大型语言模型的推理能力和知识库为自动化招聘流程提供了新机会。

Details

Method: 设计了一个包含简历提取器、评估器、总结器和分数格式化器的多智能体框架，并在评估器中集成了检索增强生成（RAG）技术以整合外部知识。 Result: 通过比较AI生成的评分与HR专业人士的评分，验证了该框架在简历筛选中的有效性。 Conclusion: 多智能体RAG-LLM系统在自动化简历筛选方面具有潜力，能够实现更高效和可扩展的招聘流程。 Abstract: Resume screening is a critical yet time-intensive process in talent acquisition, requiring recruiters to analyze vast volume of job applications while remaining objective, accurate, and fair. With the advancements in Large Language Models (LLMs), their reasoning capabilities and extensive knowledge bases demonstrate new opportunities to streamline and automate recruitment workflows. In this work, we propose a multi-agent framework for resume screening using LLMs to systematically process and evaluate resumes. The framework consists of four core agents, including a resume extractor, an evaluator, a summarizer, and a score formatter. To enhance the contextual relevance of candidate assessments, we integrate Retrieval-Augmented Generation (RAG) within the resume evaluator, allowing incorporation of external knowledge sources, such as industry-specific expertise, professional certifications, university rankings, and company-specific hiring criteria. This dynamic adaptation enables personalized recruitment, bridging the gap between AI automation and talent acquisition. We assess the effectiveness of our approach by comparing AI-generated scores with ratings provided by HR professionals on a dataset of anonymized online resumes. The findings highlight the potential of multi-agent RAG-LLM systems in automating resume screening, enabling more efficient and scalable hiring workflows.

Morpheus: Benchmarking Physical Reasoning of Video Generative Models with Real Physical Experiments

Chenyu Zhang,Daniil Cherniavskii,Andrii Zadaianchuk,Antonios Tragoudaras,Antonios Vozikis,Thijmen Nijdam,Derck W. E. Prinzhorn,Mark Bodracska,Nicu Sebe,Efstratios Gavves

Task: 评估视频生成模型在物理推理方面的表现。

Motivation: 探讨图像和视频生成模型是否具备世界建模能力，即生成符合物理守恒定律的逼真视频。

Details

Method: 提出Morpheus基准，包含80个真实世界视频，利用物理守恒定律和物理信息度量评估模型的物理合理性。 Result: 当前模型即使通过高级提示和视频条件化，仍难以编码物理原理。 Conclusion: 视频生成模型在物理合理性方面仍有不足，需进一步改进。 Abstract: Recent advances in image and video generation raise hopes that these models possess world modeling capabilities, the ability to generate realistic, physically plausible videos. This could revolutionize applications in robotics, autonomous driving, and scientific simulation. However, before treating these models as world models, we must ask: Do they adhere to physical conservation laws? To answer this, we introduce Morpheus, a benchmark for evaluating video generation models on physical reasoning. It features 80 real-world videos capturing physical phenomena, guided by conservation laws. Since artificial generations lack ground truth, we assess physical plausibility using physics-informed metrics evaluated with respect to infallible conservation laws known per physical setting, leveraging advances in physics-informed neural networks and vision-language foundation models. Our findings reveal that even with advanced prompting and video conditioning, current models struggle to encode physical principles despite generating aesthetically pleasing videos. All data, leaderboard, and code are open-sourced at our project page.

Synthesized Annotation Guidelines are Knowledge-Lite Boosters for Clinical Information Extraction

Enshuo Hsu,Martin Ugbala,Krishna Kumar Kookal,Zouaidi Kawtar,Nicholas L. Rider,Muhammad F. Walji,Kirk Roberts

Task: 提出一种利用大语言模型（LLM）自生成标注指南的方法，以减少人工输入并提升信息抽取性能。

Motivation: 传统人工编写标注指南耗时且知识密集，且任务特异性强，难以复用。

Details

Method: 利用LLM的知识总结和文本生成能力，自生成标注指南，无需人工输入。 Result: 在多个临床命名实体识别基准测试中，零样本实验显示性能提升显著，部分任务甚至优于人工编写的指南。 Conclusion: 该方法提出了一种新颖的LLM自改进方法，适用于多个生物医学领域，且对知识和人工输入需求极低。 Abstract: Generative information extraction using large language models, particularly through few-shot learning, has become a popular method. Recent studies indicate that providing a detailed, human-readable guideline-similar to the annotation guidelines traditionally used for training human annotators can significantly improve performance. However, constructing these guidelines is both labor- and knowledge-intensive. Additionally, the definitions are often tailored to meet specific needs, making them highly task-specific and often non-reusable. Handling these subtle differences requires considerable effort and attention to detail. In this study, we propose a self-improving method that harvests the knowledge summarization and text generation capacity of LLMs to synthesize annotation guidelines while requiring virtually no human input. Our zero-shot experiments on the clinical named entity recognition benchmarks, 2012 i2b2 EVENT, 2012 i2b2 TIMEX, 2014 i2b2, and 2018 n2c2 showed 25.86%, 4.36%, 0.20%, and 7.75% improvements in strict F1 scores from the no-guideline baseline. The LLM-synthesized guidelines showed equivalent or better performance compared to human-written guidelines by 1.15% to 4.14% in most tasks. In conclusion, this study proposes a novel LLM self-improving method that requires minimal knowledge and human input and is applicable to multiple biomedical domains.

LiDAR-based Object Detection with Real-time Voice Specifications

Anurag Kulkarni

Task: 开发一种基于LiDAR的多模态物体检测系统，结合实时语音输出功能。

Motivation: 通过整合3D点云和RGB图像数据，解决类别不平衡问题，提升检测精度，并增强自动驾驶和辅助技术中的可访问性和安全性。

Details

Method: 采用多模态PointNet框架，结合加权损失和自适应训练技术，并通过Tkinter原型实现实时语音输出和3D可视化。 Result: 在3000样本子集上达到87.0%的验证准确率，显著优于200样本基线的67.5%。 Conclusion: 该系统在环境感知和人机交互领域具有可扩展性，符合当前研究趋势。 Abstract: This paper presents a LiDAR-based object detection system with real-time voice specifications, integrating KITTI's 3D point clouds and RGB images through a multi-modal PointNet framework. It achieves 87.0% validation accuracy on a 3000-sample subset, surpassing a 200-sample baseline of 67.5% by combining spatial and visual data, addressing class imbalance with weighted loss, and refining training via adaptive techniques. A Tkinter prototype provides natural Indian male voice output using Edge TTS (en-IN-PrabhatNeural), alongside 3D visualizations and real-time feedback, enhancing accessibility and safety in autonomous navigation, assistive technology, and beyond. The study offers a detailed methodology, comprehensive experimental analysis, and a broad review of applications and challenges, establishing this work as a scalable advancement in human-computer interaction and environmental perception, aligned with current research trends.

Scraping the Shadows: Deep Learning Breakthroughs in Dark Web Intelligence

Ingmar Bakermans,Daniel De Pascale,Gonçalo Marcelino,Giuseppe Cascavilla,Zeno Geradts

Task: 开发一个框架来自动化从暗网市场（DNMs）提取数据，并评估三种最先进的命名实体识别（NER）模型在此任务中的表现。

Motivation: 手动从暗网市场提取数据耗时且容易出错，自动化这一过程对执法机构打击犯罪至关重要。

Details

Method: 提出一个新标注的数据集，用于训练、微调和评估三种NER模型（ELMo-BiLSTM、UniversalNER和GLiNER）。 Result: NER模型在DNM数据提取中表现优异，达到91%的精确率、96%的召回率和94%的F1分数；微调后UniversalNER表现最佳。 Conclusion: 最先进的NER模型能够有效自动化从暗网市场提取数据，微调可进一步提升性能。 Abstract: Darknet markets (DNMs) facilitate the trade of illegal goods on a global scale. Gathering data on DNMs is critical to ensuring law enforcement agencies can effectively combat crime. Manually extracting data from DNMs is an error-prone and time-consuming task. Aiming to automate this process we develop a framework for extracting data from DNMs and evaluate the application of three state-of-the-art Named Entity Recognition (NER) models, ELMo-BiLSTM \citep{ShahEtAl2022}, UniversalNER \citep{ZhouEtAl2024}, and GLiNER \citep{ZaratianaEtAl2023}, at the task of extracting complex entities from DNM product listing pages. We propose a new annotated dataset, which we use to train, fine-tune, and evaluate the models. Our findings show that state-of-the-art NER models perform well in information extraction from DNMs, achieving 91% Precision, 96% Recall, and an F1 score of 94%. In addition, fine-tuning enhances model performance, with UniversalNER achieving the best performance.

VARGPT-v1.1: Improve Visual Autoregressive Large Unified Model via Iterative Instruction Tuning and Reinforcement Learning

Xianwei Zhuang,Yuxin Xie,Yufan Deng,Dongchao Yang,Liming Liang,Jinghan Ru,Yuguo Yin,Yuexian Zou

Task: 提出并改进VARGPT-v1.1，一个统一的视觉自回归模型，用于视觉理解和图像合成。

Motivation: 通过结合迭代视觉指令调优和强化学习，提升模型在视觉理解和生成任务中的性能。

Details

Method: 采用新的训练策略（DPO）、扩展训练语料、升级语言模型骨干、提高图像生成分辨率，并实现图像编辑功能。 Result: 在多种视觉理解和生成任务中达到最先进性能，显著提升理解和生成指标。 Conclusion: 统一的视觉自回归模型可以通过灵活的训练策略实现视觉理解、生成和编辑的统一，展现出良好的扩展性。 Abstract: In this work, we present VARGPT-v1.1, an advanced unified visual autoregressive model that builds upon our previous framework VARGPT. The model preserves the dual paradigm of next-token prediction for visual understanding and next-scale generation for image synthesis. Specifically, VARGPT-v1.1 integrates: (1) a novel training strategy combining iterative visual instruction tuning with reinforcement learning through Direct Preference Optimization (DPO), (2) an expanded training corpus containing 8.3M visual-generative instruction pairs, (3) an upgraded language model backbone using Qwen2, (4) enhanced image generation resolution, and (5) emergent image editing capabilities without architectural modifications. These advancements enable VARGPT-v1.1 to achieve state-of-the-art performance in multimodal understanding and text-to-image instruction-following tasks, demonstrating significant improvements in both comprehension and generation metrics. Notably, through visual instruction tuning, the model acquires image editing functionality while maintaining architectural consistency with its predecessor, revealing the potential for unified visual understanding, generation, and editing. Our findings suggest that well-designed unified visual autoregressive models can effectively adopt flexible training strategies from large language models (LLMs), exhibiting promising scalability. The codebase and model weights are publicly available at https://github.com/VARGPT-family/VARGPT-v1.1.

Short-PHD: Detecting Short LLM-generated Text with Topological Data Analysis After Off-topic Content Insertion

Dongjun Wei,Minjia Mao,Xiao Fang,Michael Chau

Task: 提出一种针对短文本的零样本LLM生成文本检测方法Short-PHD。

Motivation: 恶意使用大型语言模型（LLMs）促使需要检测LLM生成的文本，而现有方法在短文本检测上效果不佳。

Details

Method: 通过插入无关内容稳定短文本的持久同调维度（PHD）估计，并基于检测阈值识别LLM生成文本。 Result: 在公开和生成数据集上的实验表明，Short-PHD优于现有零样本方法。 Conclusion: Short-PHD是一种有效的短文本LLM生成检测方法，代码已开源。 Abstract: The malicious usage of large language models (LLMs) has motivated the detection of LLM-generated texts. Previous work in topological data analysis shows that the persistent homology dimension (PHD) of text embeddings can serve as a more robust and promising score than other zero-shot methods. However, effectively detecting short LLM-generated texts remains a challenge. This paper presents Short-PHD, a zero-shot LLM-generated text detection method tailored for short texts. Short-PHD stabilizes the estimation of the previous PHD method for short texts by inserting off-topic content before the given input text and identifies LLM-generated text based on an established detection threshold. Experimental results on both public and generated datasets demonstrate that Short-PHD outperforms existing zero-shot methods in short LLM-generated text detection. Implementation codes are available online.

QID: Efficient Query-Informed ViTs in Data-Scarce Regimes for OCR-free Visual Document Understanding

Binh M. Le,Shaoyuan Xu,Jinmiao Fu,Zhishen Huang,Moyan Li,Yanhui Guo,Hongdong Li,Sameera Ramasinghe,Bryan Wang

Task: 提出一种名为QID的新方法，用于优化视觉语言模型在视觉文档理解任务中对查询特定区域的识别能力。

Motivation: 现有方法在数据稀缺情况下难以适应新数据集，且直接修改网络架构的方法效果不佳。

Details

Method: 引入双模块框架，包括查询感知模块和查询无关模块，独立于视觉注意力块运行，以增强查询嵌入的学习和视觉语义识别。 Result: 在多个数据集上的实验表明，该方法显著提升了性能，尤其在数据稀缺环境下处理文本丰富的文档时表现突出。 Conclusion: QID方法在保持架构不变的情况下，通过优化查询嵌入，显著提升了视觉文档理解任务的性能。 Abstract: In Visual Document Understanding (VDU) tasks, fine-tuning a pre-trained Vision-Language Model (VLM) with new datasets often falls short in optimizing the vision encoder to identify query-specific regions in text-rich document images. Existing methods that directly inject queries into model layers by modifying the network architecture often struggle to adapt to new datasets with limited annotations. To address this, we introduce QID, a novel, streamlined, architecture-preserving approach that integrates query embeddings into the vision encoder, leading to notable performance gains, particularly in data-scarce fine-tuning scenarios. Specifically, our approach introduces a dual-module framework: a query-aware module that generates a unique query vector to precisely guide the model's focus, as well as a query-agnostic module that captures the positional relationships among tokens, ensuring robust spatial understanding. Notably, both modules operate independently of the vision attention blocks, facilitating targeted learning of query embeddings and enhancing visual semantic identification. Experiments with OCR-free VLMs across multiple datasets demonstrate significant performance improvements using our method, especially in handling text-rich documents in data-scarce environments.

TheBlueScrubs-v1, a comprehensive curated medical dataset derived from the internet

Luis Felipe,Carlos Garcia,Issam El Naqa,Monique Shotande,Aakash Tripathi,Vivek Rudrapatna,Ghulam Rasool,Danielle Bitterman,Gilmer Valdes

Task: 构建并验证一个大规模、多样化的医学数据集TheBlueScrubs-v1，用于训练临床大型语言模型（cLLMs）。

Motivation: 现有公共医学数据集（如PubMed）规模有限且覆盖范围狭窄，无法满足全面医学应用的需求。

Details

Method: 采用两阶段过滤流程：首先使用逻辑回归模型筛选文档（AUC约0.95），然后通过70B参数的Llama 3.1模型验证，并为每篇文本分配三个基于LLM的质量评分。 Result: 数据集包含250亿医学标记，临床评审验证了其高质量，并通过两个演示任务展示了其实际价值。 Conclusion: TheBlueScrubs-v1数据集为医学AI研究提供了潜在的高效工具。 Abstract: The need for robust and diverse data sets to train clinical large language models (cLLMs) is critical given that currently available public repositories often prove too limited in size or scope for comprehensive medical use. While resources like PubMed provide foundational medical literature, they capture only a narrow range of formal publications and omit the broader medical discourse on the internet. To address these deficits, we introduce TheBlueScrubs-v1, a curated dataset of over 25 billion medical tokens - nearly three times larger than PubMed - drawn from a broad-scale internet corpus. Our two-stage filtering pipeline employs a Logistic Regression model for document screening (achieving an AUC of approximately 0.95 on external validation), followed by verification via a 70B-parameter Llama 3.1 instruct model. Each text is assigned three LLM-based quality scores encompassing medical relevance, precision and factual detail, and safety and ethical standards. Clinician reviews confirm high concordance with these automated evaluations, and a specialized cancer classifier further labels approximately 11 billion oncology tokens. Two demonstration tasks highlight the dataset's practical value: first, we distill the safety evaluations to a smaller BERT-style model that reaches an AUC near 0.96 on unseen data; second, we fine-tune a compact LLM on a filtered subset, showing measurable improvements over standard baselines in medical benchmarks as well as private ones. This Data Descriptor details the dataset's creation and validation, underscoring its potential utility for medical AI research.

DiSRT-In-Bed: Diffusion-Based Sim-to-Real Transfer Framework for In-Bed Human Mesh Recovery

Jing Gao,Ce Zheng,Laszlo A. Jeni,Zackory Erickson

Task: 提出一种基于Sim-to-Real Transfer Framework的方法，从俯视深度图像中恢复床上人体网格。

Motivation: 床上人体网格恢复在医疗健康领域有重要应用，但真实数据难以获取且隐私成本高，现有方法泛化能力有限。

Details

Method: 利用大规模合成数据和少量或无真实数据，通过扩散模型弥合合成数据与真实数据之间的差距。 Result: 实验验证了框架的有效性，显著提高了在多样化医疗场景中的鲁棒性和适应性。 Conclusion: 提出的Sim-to-Real Transfer Framework为床上人体网格恢复提供了一种高效且泛化能力强的解决方案。 Abstract: In-bed human mesh recovery can be crucial and enabling for several healthcare applications, including sleep pattern monitoring, rehabilitation support, and pressure ulcer prevention. However, it is difficult to collect large real-world visual datasets in this domain, in part due to privacy and expense constraints, which in turn presents significant challenges for training and deploying deep learning models. Existing in-bed human mesh estimation methods often rely heavily on real-world data, limiting their ability to generalize across different in-bed scenarios, such as varying coverings and environmental settings. To address this, we propose a Sim-to-Real Transfer Framework for in-bed human mesh recovery from overhead depth images, which leverages large-scale synthetic data alongside limited or no real-world samples. We introduce a diffusion model that bridges the gap between synthetic data and real data to support generalization in real-world in-bed pose and body inference scenarios. Extensive experiments and ablation studies validate the effectiveness of our framework, demonstrating significant improvements in robustness and adaptability across diverse healthcare scenarios.

Revisiting Funnel Transformers for Modern LLM Architectures with Comprehensive Ablations in Training and Inference Configurations

DongHyun Choi,Lucas Spangher,Chris Hidey,Peter Grabowski,Ramy Eskander

Task: 研究在当代Gemma2 Transformer架构中应用漏斗压缩技术的影响。

Motivation: 由于Transformer模型的高计算成本，早期的优化技术可能不适用于现代模型，因此需要探索漏斗压缩在当代架构中的效果。

Details

Method: 系统评估不同的漏斗配置和恢复方法，包括标准预训练与漏斗感知预训练策略、漏斗感知微调的影响以及序列恢复操作的类型。 Result: 漏斗压缩会引入信息瓶颈，尤其在大型模型中（如Gemma 7B）可能导致性能损失，但通过精心选择压缩层和恢复策略，可显著减少性能损失，并降低44%的延迟。 Conclusion: 研究揭示了计算效率与模型准确性之间的权衡，为大规模自然语言应用中漏斗压缩技术的部署提供了实用指导。 Abstract: Transformer-based Large Language Models, which suffer from high computational costs, advance so quickly that techniques proposed to streamline earlier iterations are not guaranteed to benefit more modern models. Building upon the Funnel Transformer proposed by Dai and Le (2020), which progressively compresses intermediate representations, we investigate the impact of funneling in contemporary Gemma2 Transformer architectures. We systematically evaluate various funnel configurations and recovery methods, comparing: (1) standard pretraining to funnel-aware pretraining strategies, (2) the impact of funnel-aware fine-tuning, and (3) the type of sequence recovery operation. Our results demonstrate that funneling creates information bottlenecks that propagate through deeper network layers, particularly in larger models (e.g., Gemma 7B), leading to at times unmanageable performance lost. However, carefully selecting the funneling layer and employing effective recovery strategies, can substantially mitigate performance losses, achieving up to a 44\% reduction in latency. Our findings highlight key trade-offs between computational efficiency and model accuracy, providing practical guidance for deploying funnel-based approaches in large-scale natural language applications.

Emotion Recognition Using Convolutional Neural Networks

Shaoyuan Xu,Yang Cheng,Qian Lin,Jan P. Allebach

Task: 开发一个能够识别七种基本情绪的深度学习系统，适用于静态图像和实时视频。

Motivation: 情绪在日常生活中扮演重要角色，通过面部表情识别情绪有助于更高效的沟通和理解。

Details

Method: 构建了一个从零开始的情绪识别分类和回归系统，包括数据集收集、数据预处理、模型训练和测试。 Result: 在两种不同数据集上测试，系统准确率超过80%，实时测试证明了卷积神经网络在实时情绪检测中的可行性和高效性。 Conclusion: 提出的系统能够准确高效地识别情绪，适用于静态图像和实时视频。 Abstract: Emotion has an important role in daily life, as it helps people better communicate with and understand each other more efficiently. Facial expressions can be classified into 7 categories: angry, disgust, fear, happy, neutral, sad and surprise. How to detect and recognize these seven emotions has become a popular topic in the past decade. In this paper, we develop an emotion recognition system that can apply emotion recognition on both still images and real-time videos by using deep learning. We build our own emotion recognition classification and regression system from scratch, which includes dataset collection, data preprocessing , model training and testing. Given a certain image or a real-time video, our system is able to show the classification and regression results for all of the 7 emotions. The proposed system is tested on 2 different datasets, and achieved an accuracy of over 80\%. Moreover, the result obtained from real-time testing proves the feasibility of implementing convolutional neural networks in real time to detect emotions accurately and efficiently.

Better Bill GPT: Comparing Large Language Models against Legal Invoice Reviewers

Nick Whitehouse,Nicole Lincoln,Stephanie Yiu,Lizzie Catterson,Rivindu Perera

Task: 比较大型语言模型（LLMs）与人类在法律发票审核中的表现。

Motivation: 传统法律发票审核成本高、效率低且不一致，需要探索更高效的替代方案。

Details

Method: 通过实证研究，将LLMs与不同经验水平的人类审核员（如初级律师、资深律师和法律运营专家）在准确性、速度和成本效益上进行对比。 Result: LLMs在所有指标上显著优于人类，准确率高达92%，速度更快（3.6秒/发票），成本降低99.97%。 Conclusion: LLMs在法律支出管理中的应用已成熟，未来挑战在于如何战略性地结合自动化与人工判断。 Abstract: Legal invoice review is a costly, inconsistent, and time-consuming process, traditionally performed by Legal Operations, Lawyers or Billing Specialists who scrutinise billing compliance line by line. This study presents the first empirical comparison of Large Language Models (LLMs) against human invoice reviewers - Early-Career Lawyers, Experienced Lawyers, and Legal Operations Professionals-assessing their accuracy, speed, and cost-effectiveness. Benchmarking state-of-the-art LLMs against a ground truth set by expert legal professionals, our empirically substantiated findings reveal that LLMs decisively outperform humans across every metric. In invoice approval decisions, LLMs achieve up to 92% accuracy, surpassing the 72% ceiling set by experienced lawyers. On a granular level, LLMs dominate line-item classification, with top models reaching F-scores of 81%, compared to just 43% for the best-performing human group. Speed comparisons are even more striking - while lawyers take 194 to 316 seconds per invoice, LLMs are capable of completing reviews in as fast as 3.6 seconds. And cost? AI slashes review expenses by 99.97%, reducing invoice processing costs from an average of $4.27 per invoice for human invoice reviewers to mere cents. These results highlight the evolving role of AI in legal spend management. As law firms and corporate legal departments struggle with inefficiencies, this study signals a seismic shift: The era of LLM-powered legal spend management is not on the horizon, it has arrived. The challenge ahead is not whether AI can perform as well as human reviewers, but how legal teams will strategically incorporate it, balancing automation with human discretion.

Comprehensive Relighting: Generalizable and Consistent Monocular Human Relighting and Harmonization

Junying Wang,Jingyuan Liu,Xin Sun,Krishna Kumar Singh,Zhixin Shu,He Zhang,Jimei Yang,Nanxuan Zhao,Tuanfeng Y. Wang,Simon S. Chen,Ulrich Neumann,Jae Shin Yoon

Task: 提出一种名为Comprehensive Relighting的全能方法，用于控制和协调图像或视频中任意人体部位的照明。

Motivation: 现有基于图像的重新照明模型因缺乏数据集而局限于特定场景（如人脸或静态人体），构建一个通用模型极具挑战性。

Details

Method: 利用预训练的扩散模型作为通用图像先验，在粗到细的框架中联合建模人体重新照明和背景协调，并引入无监督时间照明模型以增强时间一致性。 Result: 实验表明，Comprehensive Relighting在通用性和照明时间一致性上优于现有方法。 Conclusion: 该方法通过结合扩散模型和时间照明模块，实现了高效且通用的重新照明和背景协调。 Abstract: This paper introduces Comprehensive Relighting, the first all-in-one approach that can both control and harmonize the lighting from an image or video of humans with arbitrary body parts from any scene. Building such a generalizable model is extremely challenging due to the lack of dataset, restricting existing image-based relighting models to a specific scenario (e.g., face or static human). To address this challenge, we repurpose a pre-trained diffusion model as a general image prior and jointly model the human relighting and background harmonization in the coarse-to-fine framework. To further enhance the temporal coherence of the relighting, we introduce an unsupervised temporal lighting model that learns the lighting cycle consistency from many real-world videos without any ground truth. In inference time, our temporal lighting module is combined with the diffusion models through the spatio-temporal feature blending algorithms without extra training; and we apply a new guided refinement as a post-processing to preserve the high-frequency details from the input image. In the experiments, Comprehensive Relighting shows a strong generalizability and lighting temporal coherence, outperforming existing image-based human relighting and harmonization methods.

DiaTool-DPO: Multi-Turn Direct Preference Optimization for Tool-Augmented Large Language Models

Sunghee Jung,Donghun Lee,Shinbok Lee,Gaeun Seo,Daniel Lee,Byeongil Ko,Junrae Cho,Kihyun Kim,Eunggyun Kim,Myeongcheol Shin

Task: 提出一种名为DiaTool-DPO的新方法，通过直接偏好优化增强工具增强大型语言模型（TA-LLM）的对话能力。

Motivation: 现有的TA-LLM在处理不完整查询和超出范围请求时面临挑战，而现有方法主要依赖专家轨迹的监督微调。

Details

Method: 将TA-LLM交互建模为具有5种对话状态的马尔可夫决策过程，并根据状态转移轨迹将用户查询分为3类，自动构建正确和错误对话流的配对轨迹数据集，并引入专用目标损失用于对话控制。 Result: DiaTool-DPO在信息收集和工具调用拒绝任务中分别达到94.8%和91%的性能，显著优于基线（44%和9.6%），同时保持核心功能。 Conclusion: 该方法为开发无需额外专家演示或人工标注即可处理多样化现实场景的TA-LLM提供了新可能性。 Abstract: Tool-Augmented Larage Language Models (TA-LLMs) have shown promise in real-world applications, but face challenges in handling incomplete queries and out-of-scope requests. While existing approaches rely mainly on Supervised Fine-Tuning with expert trajectories, we propose DiaTool-DPO, a novel method that enhances TA-LLM's dialogue capabilities through Direct Preference Optimization. We model TA-LLM interactions as a Markov Decision Process with 5 distinct dialogue states and categorize user queries into 3 types based on their state transition trajectories. We automatically construct paired trajectory datasets of correct and incorrect dialogue flows and introduce a specialized objective loss for dialogue control. Our comprehensive evaluation demonstrates that DiaTool-DPO approaches GPT-4o's performance (94.8% in information gathering, 91% in tool call rejection) with substantial improvements over baseline (44% and 9.6% respectively) while maintaining core functionality. Our approach opens new possibilities for developing TA-LLMs that can handle diverse real-world scenarios without requiring additional expert demonstrations or human labeling.

Page Classification for Print Imaging Pipeline

Shaoyuan Xu,Cheng Lu,Mark Shaw,Peter Bauer,Jan P. Allebach

Task: 开发一种基于SVM的分类方法，用于将图像分为五类：文本、图片、混合、收据和高亮。

Motivation: 现代复印机和打印机配备了针对不同类型图像设计的处理流水线，但现有方法仅能区分三类图像，无法满足更多应用场景的需求。

Details

Method: 采用更先进的基于SVM的分类方法，并引入四个新特征。 Result: 实现了对五种类型图像的分类。 Conclusion: 新方法扩展了分类能力，适用于更多应用场景。 Abstract: Digital copiers and printers are widely used nowadays. One of the most important things people care about is copying or printing quality. In order to improve it, we previously came up with an SVM-based classification method to classify images with only text, only pictures or a mixture of both based on the fact that modern copiers and printers are equipped with processing pipelines designed specifically for different kinds of images. However, in some other applications, we need to distinguish more than three classes. In this paper, we develop a more advanced SVM-based classification method using four more new features to classify 5 types of images which are text, picture, mixed, receipt and highlight.

SemEval-2025 Task 4: Unlearning sensitive content from Large Language Models

Anil Ramakrishna,Yixin Wan,Xiaomeng Jin,Kai-Wei Chang,Zhiqi Bu,Bhanukiran Vinzamuri,Volkan Cevher,Mingyi Hong,Rahul Gupta

Task: 介绍SemEval-2025 Task 4，即从大型语言模型（LLMs）中遗忘敏感内容的任务。

Motivation: 解决LLMs中敏感内容的遗忘问题，涵盖不同用例。

Details

Method: 设计了三个子任务：遗忘长格式合成创意文档、遗忘短格式合成传记（含PII）以及遗忘真实文档。 Result: 收到来自30多个机构的100多份提交，总结了关键技术和方法。 Conclusion: 论文总结了LLM遗忘敏感内容的关键技术和经验。 Abstract: We introduce SemEval-2025 Task 4: unlearning sensitive content from Large Language Models (LLMs). The task features 3 subtasks for LLM unlearning spanning different use cases: (1) unlearn long form synthetic creative documents spanning different genres; (2) unlearn short form synthetic biographies containing personally identifiable information (PII), including fake names, phone number, SSN, email and home addresses, and (3) unlearn real documents sampled from the target model's training dataset. We received over 100 submissions from over 30 institutions and we summarize the key techniques and lessons in this paper.

HALO: Human-Aligned End-to-end Image Retargeting with Layered Transformations

Yiran Xu,Siqi Xie,Zhuofang Li,Harris Shadmany,Yinxiao Li,Luciano Sbaiz,Miaosen Wang,Junjie Ke,Jose Lezama,Hang Qi,Han Zhang,Jesse Berent,Ming-Hsuan Yang,Irfan Essa,Jia-Bin Huang,Feng Yang

Task: 提出一种名为HALO的端到端可训练图像重定向方法，以减少视觉伪影并保持图像内容和结构。

Motivation: 现有方法在图像重定向中仍会产生大量伪影或无法保持原始内容和结构，因此需要一种更有效的方法。

Details

Method: HALO将输入图像分解为显著/非显著层，并对不同层应用不同的扭曲场，同时提出感知结构相似性损失以最小化结构失真。 Result: 在RetargetMe数据集上的定量结果和用户研究表明，HALO达到了最先进水平，用户偏好平均比基线高18.4%。 Conclusion: HALO通过分层处理和结构相似性损失，显著提升了图像重定向的质量和用户满意度。 Abstract: Image retargeting aims to change the aspect-ratio of an image while maintaining its content and structure with less visual artifacts. Existing methods still generate many artifacts or fail to maintain original content or structure. To address this, we introduce HALO, an end-to-end trainable solution for image retargeting. Since humans are more sensitive to distortions in salient areas than non-salient areas of an image, HALO decomposes the input image into salient/non-salient layers and applies different wrapping fields to different layers. To further minimize the structure distortion in the output images, we propose perceptual structure similarity loss which measures the structure similarity between input and output images and aligns with human perception. Both quantitative results and a user study on the RetargetMe dataset show that HALO achieves SOTA. Especially, our method achieves an 18.4% higher user preference compared to the baselines on average.

LVMed-R2: Perception and Reflection-driven Complex Reasoning for Medical Report Generation

Hao Wang,Shuchang Ye,Jinghao Lin,Usman Naseem,Jinman Kim

Task: 提出一种新的微调策略LVMed-R2，通过引入复杂推理和反思机制来提升大型视觉语言模型（LVMs）在医学报告生成（MRG）任务中的性能。

Motivation: 现有的LVMs在医学报告生成中存在逻辑不一致和诊断错误的问题，且缺乏反思机制来发现和纠正错误。

Details

Method: 提出LVMed-R2策略，包括医学知识注入、感知增强模块和感知树以提升推理能力，并引入反思机制进行自我验证。 Result: 在IU-Xray和MIMIC-CXR数据集上的实验表明，LVMed-R2在自然语言生成和临床效能指标上显著提升了LVMs的性能。 Conclusion: LVMed-R2通过复杂推理和反思机制有效提升了医学报告生成的准确性和可靠性。 Abstract: Large vision-language models (LVMs) hold a great promise for automating medical report generation, potentially reducing the burden of manual reporting. State-of-the-art (SOTA) research fine-tunes general LVMs with medical data to align radiology images to corresponding medical reports. However, there are two key factors that limit these LVM's performance. Firstly, LVMs lack complex reasoning capability that leads to logical inconsistencies and potential diagnostic errors in generated reports. Secondly, LVMs lack reflection mechanism that leads to an inability to discover errors in the thinking process. To address these gaps, we propose LVMed-R2, a new fine-tuning strategy that introduces complex reasoning and reflection mechanisms for LVMs to enhance medical report generation. To the best of our knowledge, this is the first work to introduce complex reasoning to the medical report generation (MRG) task. Our proposed complex reasoning contains medical knowledge injection and perception-enhancing modules which improve the accuracy of LVMs diagnosis, coupled with a perception tree to provide guidance to limit the perception range. Further, the reflection mechanism forces self-verification for outputs to correct for potential errors. We experimented by fine-tuning LVMs with our proposed LVMed-R2 strategy, using IU-Xray and MIMIC-CXR datasets. Our results, measured on natural language generation (NLG) metrics and clinical efficacy (CE) metrics, demonstrate that LVMs fine-tuned with the proposed reflection mechanism possess the ability to correct outputs and complex reasoning effectively and improve LVMs performance for MRG.

VIP: Video Inpainting Pipeline for Real World Human Removal

Huiming Sun,Yikang Li,Kangning Yang,Ruineng Li,Daitao Xing,Yangbo Xie,Lan Fu,Kaiyu Zhang,Ming Chen,Jiaming Ding,Jiang Geng,Jie Cai,Zibo Meng,Chiuman Ho

Task: 提出一种无提示的视频修复框架VIP，用于真实场景中高分辨率视频中人和行人的移除。

Motivation: 解决真实场景视频修复中高质量结果、时间一致性和复杂物体交互（如人、其物品及阴影）的挑战。

Details

Method: 结合运动模块增强文本到视频模型，使用变分自编码器（VAE）进行潜在空间渐进去噪，并实现高效的人和物品分割以生成精确掩码。 Result: VIP在多样真实场景中表现出卓越的时间一致性和视觉保真度，优于现有方法。 Conclusion: VIP框架及其关键技术（如参考帧集成和双融合潜在段细化）有效解决了长高分辨率视频序列修复的复杂性。 Abstract: Inpainting for real-world human and pedestrian removal in high-resolution video clips presents significant challenges, particularly in achieving high-quality outcomes, ensuring temporal consistency, and managing complex object interactions that involve humans, their belongings, and their shadows. In this paper, we introduce VIP (Video Inpainting Pipeline), a novel promptless video inpainting framework for real-world human removal applications. VIP enhances a state-of-the-art text-to-video model with a motion module and employs a Variational Autoencoder (VAE) for progressive denoising in the latent space. Additionally, we implement an efficient human-and-belongings segmentation for precise mask generation. Sufficient experimental results demonstrate that VIP achieves superior temporal consistency and visual fidelity across diverse real-world scenarios, surpassing state-of-the-art methods on challenging datasets. Our key contributions include the development of the VIP pipeline, a reference frame integration technique, and the Dual-Fusion Latent Segment Refinement method, all of which address the complexities of inpainting in long, high-resolution video sequences.

Processes Matter: How ML/GAI Approaches Could Support Open Qualitative Coding of Online Discourse Datasets

John Chen,Alexandros Lotsos,Grace Wang,Lexie Zhao,Bruce Sherin,Uri Wilensky,Michael Horn

Task: 比较五种ML/GAI方法和四位人类编码者在开放编码中的表现。

Motivation: 探讨ML/GAI在开放编码中的潜力及其与人类编码者的互补性。

Details

Method: 使用在线聊天消息数据集，系统分析ML/GAI和人类编码者的开放编码结果。 Result: ML/GAI擅长内容编码，人类擅长对话动态解读；两者具有互补潜力。 Conclusion: 应整合AI与人类编码者，而非替代，例如作为并行共同编码者。 Abstract: Open coding, a key inductive step in qualitative research, discovers and constructs concepts from human datasets. However, capturing extensive and nuanced aspects or "coding moments" can be challenging, especially with large discourse datasets. While some studies explore machine learning (ML)/Generative AI (GAI)'s potential for open coding, few evaluation studies exist. We compare open coding results by five recently published ML/GAI approaches and four human coders, using a dataset of online chat messages around a mobile learning software. Our systematic analysis reveals ML/GAI approaches' strengths and weaknesses, uncovering the complementary potential between humans and AI. Line-by-line AI approaches effectively identify content-based codes, while humans excel in interpreting conversational dynamics. We discussed how embedded analytical processes could shape the results of ML/GAI approaches. Instead of replacing humans in open coding, researchers should integrate AI with and according to their analytical processes, e.g., as parallel co-coders.

Sliced Wasserstein Discrepancy in Disentangling Representation and Adaptation Networks for Unsupervised Domain Adaptation

Joel Sol,Shadi Alijani,Homayoun Najjaran

Task: 提出DRANet-SWD，一种通过解耦图像内容和风格表示的无监督域适应方法。

Motivation: 研究SWD相比传统Gram矩阵损失在捕捉风格变化方面的潜在优势。

Details

Method: 在DRANet基础上引入SWD作为风格损失，并在数字分类和驾驶场景分割数据集上进行实验验证。 Result: 实验表明DRANet-SWD性能提升，SWD能更稳健地比较特征分布，实现更好的风格适应。 Conclusion: SWD在优化特征对齐和提升域适应任务效果方面具有显著优势。 Abstract: This paper introduces DRANet-SWD, an extension of existing work that disentangles content and style representations of images for unsupervised domain adaptation (UDA). The approach builds upon DRANet by incorporating the sliced Wasserstein discrepancy (SWD) as a style loss instead of the traditional Gram matrix loss. The potential advantages of SWD over the Gram matrix loss for capturing style variations in domain adaptation are investigated. Experiments using digit classification datasets and driving scenario segmentation validate the method, demonstrating that DRANet-SWD enhances performance. Results indicate that SWD provides a more robust statistical comparison of feature distributions, leading to better style adaptation. These findings highlight the effectiveness of SWD in refining feature alignment and improving domain adaptation tasks across these benchmarks. Our code can be found here.

A Status Quo Investigation of Large Language Models towards Cost-Effective CFD Automation with OpenFOAMGPT: ChatGPT vs. Qwen vs. Deepseek

Wenkang Wang,Ran Xu,Jingsen Feng,Qingfu Zhang,Xu Chu

Task: 评估OpenFOAMGPT结合多种大型语言模型在CFD任务中的性能。

Motivation: 探讨不同模型在CFD任务（如调整边界条件、湍流模型和求解器配置）中的表现，以及其成本和稳定性差异。

Details

Method: 通过测试多种模型（包括本地部署的小型模型和大型模型）在复杂CFD任务中的表现，尤其是零样本提示的效果。 Result: 大型模型表现较好，但零样本提示在复杂设置中失败；小型模型在生成复杂求解器文件时表现不佳，需专家监督。 Conclusion: 需要进一步开发以实现CFD模拟的完全自动化。 Abstract: We evaluated the performance of OpenFOAMGPT incorporating multiple large-language models. Some of the present models efficiently manage different CFD tasks such as adjusting boundary conditions, turbulence models, and solver configurations, although their token cost and stability vary. Locally deployed smaller models like QwQ-32B struggled with generating valid solver files for complex processes. Zero-shot prompting commonly failed in simulations with intricate settings, even for large models. Challenges with boundary conditions and solver keywords stress the requirement for expert supervision, indicating that further development is needed to fully automate specialized CFD simulations.

Attention-Aware Multi-View Pedestrian Tracking

Reef Alturki,Adrian Hilton,Jean-Yves Guillemaut

Task: 提出一种结合注意力机制的多视角行人跟踪模型，以解决地面平面投影导致的特征失真问题。

Motivation: 多视角行人检测中，早期融合策略虽能提升性能，但地面平面投影会导致特征失真，影响跟踪鲁棒性。

Details

Method: 采用早期融合策略进行检测，并引入交叉注意力机制以增强帧间行人关联和特征传播。 Result: 在Wildtrack和MultiviewX数据集上分别达到96.1%和85.7%的IDF1分数，优于现有方法。 Conclusion: 所提模型通过注意力机制有效提升了多视角行人跟踪的鲁棒性和性能。 Abstract: In spite of the recent advancements in multi-object tracking, occlusion poses a significant challenge. Multi-camera setups have been used to address this challenge by providing a comprehensive coverage of the scene. Recent multi-view pedestrian detection models have highlighted the potential of an early-fusion strategy, projecting feature maps of all views to a common ground plane or the Bird's Eye View (BEV), and then performing detection. This strategy has been shown to improve both detection and tracking performance. However, the perspective transformation results in significant distortion on the ground plane, affecting the robustness of the appearance features of the pedestrians. To tackle this limitation, we propose a novel model that incorporates attention mechanisms in a multi-view pedestrian tracking scenario. Our model utilizes an early-fusion strategy for detection, and a cross-attention mechanism to establish robust associations between pedestrians in different frames, while efficiently propagating pedestrian features across frames, resulting in a more robust feature representation for each pedestrian. Extensive experiments demonstrate that our model outperforms state-of-the-art models, with an IDF1 score of $96.1\%$ on Wildtrack dataset, and $85.7\%$ on MultiviewX dataset.

Scaling Test-time Compute for Low-resource Languages: Multilingual Reasoning in LLMs

Khanh-Tung Tran,Barry O'Sullivan,Hoang D. Nguyen

Task: 研究如何通过英语作为中介语言（English-Pivoted CoT Training）提升低资源语言中的推理能力。

Motivation: 现有的链式思维（CoT）技术主要应用于英语等高资源语言，低资源语言的推理能力未被充分探索且存在偏差。

Details

Method: 训练模型在低资源语言输入时生成英语的CoT，但最终输出目标语言的答案。 Result: 该方法（English-Pivoted CoT Training）比其他基线方法表现更好，最高提升28.33%。 Conclusion: 研究揭示了LLMs中推理与多语言能力的关系，为开发多语言大推理模型提供了新思路。 Abstract: Recent advances in test-time compute scaling have enabled Large Language Models (LLMs) to tackle deep reasoning tasks by generating a chain-of-thought (CoT) that includes trial and error, backtracking, and intermediate reasoning steps before producing the final answer. However, these techniques have been applied predominantly to popular languages, such as English, leaving reasoning in low-resource languages underexplored and misaligned. In this work, we investigate the multilingual mechanism by which LLMs internally operate in a latent space biased toward their inherently dominant language. To leverage this phenomenon for low-resource languages, we train models to generate the CoT in English while outputting the final response in the target language, given input in the low-resource language. Our experiments demonstrate that this approach, named English-Pivoted CoT Training, outperforms other baselines, including training to generate both the CoT and the final response solely in the target language, with up to 28.33% improvement. Further analysis provides novel insights into the relationships between reasoning and multilinguality of LLMs, prompting for better approaches in developing multilingual large reasoning models

Cooperative Inference for Real-Time 3D Human Pose Estimation in Multi-Device Edge Networks

Hyun-Ho Choi,Kangsoo Kim,Ki-Ho Lee,Kisong Lee

Task: 提出一种在移动边缘计算网络中实时3D人体姿态估计的新型协同推理方法。

Motivation: 在资源受限和动态环境中，高计算复杂度使得实时3D姿态估计具有挑战性。

Details

Method: 通过多终端设备配备轻量级推理模型，采用双重置信阈值过滤模糊图像，仅将过滤后的图像卸载到边缘服务器进行重新评估。 Result: 实验结果表明，该方法通过优化置信阈值和传输时间，显著降低了平均每关节位置误差（MPJPE），同时满足端到端延迟要求。 Conclusion: 提出的协同推理方法在多种移动边缘计算环境中有效平衡了MPJPE和延迟，实现了高精度实时估计。 Abstract: Accurate and real-time three-dimensional (3D) pose estimation is challenging in resource-constrained and dynamic environments owing to its high computational complexity. To address this issue, this study proposes a novel cooperative inference method for real-time 3D human pose estimation in mobile edge computing (MEC) networks. In the proposed method, multiple end devices equipped with lightweight inference models employ dual confidence thresholds to filter ambiguous images. Only the filtered images are offloaded to an edge server with a more powerful inference model for re-evaluation, thereby improving the estimation accuracy under computational and communication constraints. We numerically analyze the performance of the proposed inference method in terms of the inference accuracy and end-to-end delay and formulate a joint optimization problem to derive the optimal confidence thresholds and transmission time for each device, with the objective of minimizing the mean per-joint position error (MPJPE) while satisfying the required end-to-end delay constraint. To solve this problem, we demonstrate that minimizing the MPJPE is equivalent to maximizing the sum of the inference accuracies for all devices, decompose the problem into manageable subproblems, and present a low-complexity optimization algorithm to obtain a near-optimal solution. The experimental results show that a trade-off exists between the MPJPE and end-to-end delay depending on the confidence thresholds. Furthermore, the results confirm that the proposed cooperative inference method achieves a significant reduction in the MPJPE through the optimal selection of confidence thresholds and transmission times, while consistently satisfying the end-to-end delay requirement in various MEC environments.

Automated Survey Collection with LLM-based Conversational Agents

Kurmanbek Kaiyrbekov,Nicholas J Dobbins,Sean D Mooney

Task: 提出一个基于大型语言模型（LLM）的端到端电话调查收集框架，以克服传统电话调查的成本高、劳动密集和难以扩展的问题。

Motivation: 传统电话调查在生物医学和医疗数据收集中广泛使用，但存在成本高、劳动密集和难以扩展的局限性。

Details

Method: 框架包括研究者设计调查、招募参与者，由LLM驱动的对话电话代理进行电话调查，GPT-4o分析对话转录，数据库存储结果。测试中招募8名参与者，进行40次调查，评估转录准确性、GPT-4o推断的响应准确性及参与者体验。 Result: GPT-4o从对话转录中提取调查响应的平均准确率为98%，尽管转录的每行单词错误率为7.7%。参与者认为对话代理有效传达了调查目的，理解力强且互动良好。 Conclusion: LLM代理在医疗电话调查中具有潜力，通过减少人工工作量并提供可扩展的解决方案，为现实世界端到端AI电话调查系统铺平道路。 Abstract: Objective: Traditional phone-based surveys are among the most accessible and widely used methods to collect biomedical and healthcare data, however, they are often costly, labor intensive, and difficult to scale effectively. To overcome these limitations, we propose an end-to-end survey collection framework driven by conversational Large Language Models (LLMs). Materials and Methods: Our framework consists of a researcher responsible for designing the survey and recruiting participants, a conversational phone agent powered by an LLM that calls participants and administers the survey, a second LLM (GPT-4o) that analyzes the conversation transcripts generated during the surveys, and a database for storing and organizing the results. To test our framework, we recruited 8 participants consisting of 5 native and 3 non-native english speakers and administered 40 surveys. We evaluated the correctness of LLM-generated conversation transcripts, accuracy of survey responses inferred by GPT-4o and overall participant experience. Results: Survey responses were successfully extracted by GPT-4o from conversation transcripts with an average accuracy of 98% despite transcripts exhibiting an average per-line word error rate of 7.7%. While participants noted occasional errors made by the conversational LLM agent, they reported that the agent effectively conveyed the purpose of the survey, demonstrated good comprehension, and maintained an engaging interaction. Conclusions: Our study highlights the potential of LLM agents in conducting and analyzing phone surveys for healthcare applications. By reducing the workload on human interviewers and offering a scalable solution, this approach paves the way for real-world, end-to-end AI-powered phone survey collection systems.

Compressing 3D Gaussian Splatting by Noise-Substituted Vector Quantization

Haishan Wang,Mohammad Hassan Vali,Arno Solin

Task: 提出一种压缩3D高斯泼溅（3DGS）存储成本的方法。

Motivation: 3DGS在3D重建中表现优异，但存储成本高，单场景需要约1GB内存。

Details

Method: 通过构建独立的属性码本并仅存储离散码索引，采用噪声替代向量量化技术联合训练码本和模型特征。 Result: 方法将内存消耗降低约45倍，同时保持重建质量，并在不同码本大小下展示了压缩比与图像质量的权衡。 Conclusion: 压缩后的模型兼容主流3DGS查看器，渲染速度更快，适合实际应用。 Abstract: 3D Gaussian Splatting (3DGS) has demonstrated remarkable effectiveness in 3D reconstruction, achieving high-quality results with real-time radiance field rendering. However, a key challenge is the substantial storage cost: reconstructing a single scene typically requires millions of Gaussian splats, each represented by 59 floating-point parameters, resulting in approximately 1~GB of memory. To address this challenge, we propose a compression method by building separate attribute codebooks and storing only discrete code indices. Specifically, we employ noise-substituted vector quantization technique to jointly train the codebooks and model features, ensuring consistency between gradient descent optimization and parameter discretization. Our method reduces the memory consumption efficiently (around $45\times$) while maintaining competitive reconstruction quality on standard 3D benchmark scenes. Experiments on different codebook sizes show the trade-off between compression ratio and image quality. Furthermore, the trained compressed model remains fully compatible with popular 3DGS viewers and enables faster rendering speed, making it well-suited for practical applications.

OnRL-RAG: Real-Time Personalized Mental Health Dialogue System

Ahsan Bilal,Beiyu Lin,Mehdi Zaeifi

Task: 提出一种基于在线强化学习的检索增强生成系统（OnRL-RAG），用于检测和个性化应对心理健康问题。

Motivation: 大型语言模型（LLMs）受限于预训练数据的时效性和准确性，且检索增强生成（RAG）虽能提供新信息，但缺乏个性化。

Details

Method: 结合在线强化学习（RLHF）和RAG，动态适应用户需求和反馈，个性化响应心理健康问题。 Result: 在2028名大学生的数据集上，OnRL-RAG系统表现优于标准RAG和简单LLM（如GPT-4o、Gemini-1.5等）。 Conclusion: 该系统为LLMs在个性化服务中的实际应用提供了可能，并有助于社会学、心理学和神经科学研究的理论验证。 Abstract: Large language models (LLMs) have been widely used for various tasks and applications. However, LLMs and fine-tuning are limited to the pre-trained data. For example, ChatGPT's world knowledge until 2021 can be outdated or inaccurate. To enhance the capabilities of LLMs, Retrieval-Augmented Generation (RAG), is proposed to augment LLMs with additional, new, latest details and information to LLMs. While RAG offers the correct information, it may not best present it, especially to different population groups with personalizations. Reinforcement Learning from Human Feedback (RLHF) adapts to user needs by aligning model responses with human preference through feedback loops. In real-life applications, such as mental health problems, a dynamic and feedback-based model would continuously adapt to new information and offer personalized assistance due to complex factors fluctuating in a daily environment. Thus, we propose an Online Reinforcement Learning-based Retrieval-Augmented Generation (OnRL-RAG) system to detect and personalize the responding systems to mental health problems, such as stress, anxiety, and depression. We use an open-source dataset collected from 2028 College Students with 28 survey questions for each student to demonstrate the performance of our proposed system with the existing systems. Our system achieves superior performance compared to standard RAG and simple LLM via GPT-4o, GPT-4o-mini, Gemini-1.5, and GPT-3.5. This work would open up the possibilities of real-life applications of LLMs for personalized services in the everyday environment. The results will also help researchers in the fields of sociology, psychology, and neuroscience to align their theories more closely with the actual human daily environment.

How I Warped Your Noise: a Temporally-Correlated Noise Prior for Diffusion Models

Pascal Chang,Jingwei Tang,Markus Gross,Vinicius C. Azevedo

Task: Error

Motivation: Error

Details

Method: Error Result: Error Conclusion: Error Abstract: Video editing and generation methods often rely on pre-trained image-based diffusion models. During the diffusion process, however, the reliance on rudimentary noise sampling techniques that do not preserve correlations present in subsequent frames of a video is detrimental to the quality of the results. This either produces high-frequency flickering, or texture-sticking artifacts that are not amenable to post-processing. With this in mind, we propose a novel method for preserving temporal correlations in a sequence of noise samples. This approach is materialized by a novel noise representation, dubbed $\int$-noise (integral noise), that reinterprets individual noise samples as a continuously integrated noise field: pixel values do not represent discrete values, but are rather the integral of an underlying infinite-resolution noise over the pixel area. Additionally, we propose a carefully tailored transport method that uses $\int$-noise to accurately advect noise samples over a sequence of frames, maximizing the correlation between different frames while also preserving the noise properties. Our results demonstrate that the proposed $\int$-noise can be used for a variety of tasks, such as video restoration, surrogate rendering, and conditional video generation. See https://warpyournoise.github.io/ for video results.

A Practical Synthesis of Detecting AI-Generated Textual, Visual, and Audio Content

Lele Cao

Task: 探索检测和缓解AI生成的文本、视觉和音频内容的方法。

Motivation: AI生成内容的快速发展引发了关于错误信息、版权侵权、安全威胁和公众信任侵蚀的担忧。

Details

Method: 讨论了观察策略、语言和统计分析、基于模型的流程、水印和指纹技术，以及新兴的集成方法。 Result: 提供了关于鲁棒性、适应快速发展的生成架构以及人在循环验证的关键作用的新视角。 Conclusion: 总结了开放挑战，如对抗性变换、领域泛化和伦理问题，为研究人员、从业者和监管者提供了全面的指南。 Abstract: Advances in AI-generated content have led to wide adoption of large language models, diffusion-based visual generators, and synthetic audio tools. However, these developments raise critical concerns about misinformation, copyright infringement, security threats, and the erosion of public trust. In this paper, we explore an extensive range of methods designed to detect and mitigate AI-generated textual, visual, and audio content. We begin by discussing motivations and potential impacts associated with AI-based content generation, including real-world risks and ethical dilemmas. We then outline detection techniques spanning observation-based strategies, linguistic and statistical analysis, model-based pipelines, watermarking and fingerprinting, as well as emergent ensemble approaches. We also present new perspectives on robustness, adaptation to rapidly improving generative architectures, and the critical role of human-in-the-loop verification. By surveying state-of-the-art research and highlighting case studies in academic, journalistic, legal, and industrial contexts, this paper aims to inform robust solutions and policymaking. We conclude by discussing open challenges, including adversarial transformations, domain generalization, and ethical concerns, thereby offering a holistic guide for researchers, practitioners, and regulators to preserve content authenticity in the face of increasingly sophisticated AI-generated media.

SLACK: Attacking LiDAR-based SLAM with Adversarial Point Injections

Prashant Kumar,Dheeraj Vattikonda,Kshitij Madhav Bhat,Kunal Dargan,Prem Kalra

Task: 研究基于学习的对抗攻击方法（SLACK）对LiDAR扫描进行点注入攻击。

Motivation: 学习型方法在LiDAR中的广泛应用使自动驾驶车辆易受对抗攻击，尤其是点注入攻击，这对导航和地图生成构成严重安全挑战。目前缺乏针对LiDAR SLAM的学习型攻击研究。

Details

Method: 提出SLACK，一种端到端的深度生成对抗模型，通过结合对比学习和基于分割的注意力机制，设计新型自编码器，实现对LiDAR扫描的点注入攻击。 Result: SLACK在KITTI和CARLA-64数据集上表现出色，优于基线方法，同时保持LiDAR扫描质量。 Conclusion: SLACK能够以少量LiDAR点实现点注入攻击，显著降低导航和地图质量，同时不影响LiDAR扫描质量。 Abstract: The widespread adoption of learning-based methods for the LiDAR makes autonomous vehicles vulnerable to adversarial attacks through adversarial \textit{point injections (PiJ)}. It poses serious security challenges for navigation and map generation. Despite its critical nature, no major work exists that studies learning-based attacks on LiDAR-based SLAM. Our work proposes SLACK, an end-to-end deep generative adversarial model to attack LiDAR scans with several point injections without deteriorating LiDAR quality. To facilitate SLACK, we design a novel yet simple autoencoder that augments contrastive learning with segmentation-based attention for precise reconstructions. SLACK demonstrates superior performance on the task of \textit{point injections (PiJ)} compared to the best baselines on KITTI and CARLA-64 dataset while maintaining accurate scan quality. We qualitatively and quantitatively demonstrate PiJ attacks using a fraction of LiDAR points. It severely degrades navigation and map quality without deteriorating the LiDAR scan quality.

Beyond Accuracy: The Role of Calibration in Self-Improving Large Language Models

Liangjie Huang,Dawei Li,Huan Liu,Lu Cheng

Task: 研究大型语言模型（LLMs）在自我改进过程中对置信度估计的影响。

Motivation: 尽管LLMs的自我改进机制能提升任务表现，但可能引入自我偏见，导致过度自信，影响置信度估计的准确性。

Details

Method: 评估三种自我改进范式（基础提示、思维链提示和基于调优的方法），并探索三种置信度校准策略（改进后校准、改进前校准和迭代校准）。 Result: 迭代自我改进会导致系统性过度自信（ECE增加），而迭代校准策略能有效降低ECE，提升校准效果。 Conclusion: 本研究首次从校准角度探讨LLMs的自我改进，为平衡模型性能和可靠性提供了重要见解。 Abstract: Large Language Models (LLMs) have demonstrated remarkable self-improvement capabilities, whereby models iteratively revise their outputs through self-generated feedback. While this reflective mechanism has shown promise in enhancing task performance, recent studies suggest that it may also introduce undesirable biases-most notably, self-bias, or the tendency of LLMs to favor their own prior outputs. In this work, we extend this line of inquiry by investigating the impact on confidence estimation. We evaluate three representative self-improvement paradigms-basic prompting, Chain-of-Thought (CoT) prompting, and tuning-based methods and find that iterative self-improvement can lead to systematic overconfidence, as evidenced by a steadily increasing Expected Calibration Error (ECE) and lower accuracy with high confidence. We then further explore the integration of confidence calibration techniques with self-improvement. Specifically, we compare three strategies: (1) applying calibration after multiple rounds of self-improvement, (2) calibrating before self-improvement, and (3) applying calibration iteratively at each self-improvement step. Our results show that iterative calibration is most effective in reducing ECE, yielding improved calibration. Our work pioneers the study of self-improving LLMs from a calibration perspective, offering valuable insights into balancing model performance and reliability.

Scaling Open-Vocabulary Action Detection

Zhen Hao Sia,Yogesh Singh Rawat

Task: 研究如何扩展开放词汇动作检测的规模。

Motivation: 现有动作检测方法主要局限于封闭集场景，且依赖复杂、参数密集的架构，扩展至开放词汇场景面临两大挑战：缺乏大规模多动作类别的数据集和参数密集的调整可能导致过拟合。

Details

Method: 提出一种仅编码器的多模态模型以减少参数依赖，并设计弱监督训练策略利用现有封闭集数据集进行预训练，同时提出新的基准测试方法。 Result: 展示了在未用于训练的封闭集动作检测数据集上的新结果，为未来工作提供基准。 Conclusion: 通过简化模型和训练策略，成功扩展了开放词汇动作检测的规模，并提出了更合理的评估基准。 Abstract: In this work, we focus on scaling open-vocabulary action detection. Existing approaches for action detection are predominantly limited to closed-set scenarios and rely on complex, parameter-heavy architectures. Extending these models to the open-vocabulary setting poses two key challenges: (1) the lack of large-scale datasets with many action classes for robust training, and (2) parameter-heavy adaptations to a pretrained vision-language contrastive model to convert it for detection, risking overfitting the additional non-pretrained parameters to base action classes. Firstly, we introduce an encoder-only multimodal model for video action detection, reducing the reliance on parameter-heavy additions for video action detection. Secondly, we introduce a simple weakly supervised training strategy to exploit an existing closed-set action detection dataset for pretraining. Finally, we depart from the ill-posed base-to-novel benchmark used by prior works in open-vocabulary action detection and devise a new benchmark to evaluate on existing closed-set action detection datasets without ever using them for training, showing novel results to serve as baselines for future work.

How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence

Hongzhe Du,Weikai Li,Min Cai,Karim Saraipour,Zimin Zhang,Himabindu Lakkaraju,Yizhou Sun,Shichang Zhang

Task: 比较基础模型和后训练大型语言模型（LLMs）的内部机制变化。

Motivation: 研究后训练如何改变LLMs的内部表示，以更好地理解其效果。

Details

Method: 从四个角度对基础模型和后训练模型进行机制性比较。 Result: 发现后训练不改变事实知识存储位置，但调整知识表示；真实性和拒绝行为可表示为隐藏空间中的线性向量；拒绝方向在基础和后训练模型间不同；置信度差异与熵神经元无关。 Conclusion: 研究揭示了后训练中保留和改变的机制，有助于模型操控和未来可解释性研究。 Abstract: Post-training is essential for the success of large language models (LLMs), transforming pre-trained base models into more useful and aligned post-trained models. While plenty of works have studied post-training algorithms and evaluated post-training models by their outputs, it remains understudied how post-training reshapes LLMs internally. In this paper, we compare base and post-trained LLMs mechanistically from four perspectives to better understand post-training effects. Our findings across model families and datasets reveal that: (1) Post-training does not change the factual knowledge storage locations, and it adapts knowledge representations from the base model while developing new knowledge representations; (2) Both truthfulness and refusal can be represented by linear vectors in the hidden representation space. The truthfulness direction is highly similar between the base and post-trained model, and it is effectively transferable for interventions; (3) The refusal direction is different between the base and post-trained models, and it shows limited forward transferability; (4) Differences in confidence between the base and post-trained models cannot be attributed to entropy neurons. Our study provides insights into the fundamental mechanisms preserved and altered during post-training, facilitates downstream tasks like model steering, and could potentially benefit future research in interpretability and LLM post-training.

Multi-Granularity Vision Fastformer with Fusion Mechanism for Skin Lesion Segmentation

Xuanyu Liu,Huiyun Yao,Jinggui Gao,Zhongyi Guo,Xue Zhang,Yulin Dong

Task: 优化医学图像分割中计算成本与长距离依赖建模的平衡，并实现对不同严重程度病变的优异泛化能力。

Motivation: CNN局限于局部上下文信息，ViT的二次复杂度导致高计算成本，同时模型难以区分不同严重程度的病变边界。

Details

Method: 提出轻量级U型网络VFFM-UNet，结合Fastformer的加法注意力机制和多粒度融合机制，以降低计算成本并提取多粒度特征。 Result: 在ISIC2017、ISIC2018和PH2数据集上，VFFM-UNet在参数数量、计算复杂度和分割性能上优于现有模型，参数和计算成本分别降低101倍和15倍。 Conclusion: VFFM-UNet在参数数量、计算复杂度和分割性能之间实现了理想平衡，成为新的基准模型。 Abstract: Background:Convolutional Neural Networks(CNN) and Vision Transformers(ViT) are the main techniques used in Medical image segmentation. However, CNN is limited to local contextual information, and ViT's quadratic complexity results in significant computational costs. At the same time, equipping the model to distinguish lesion boundaries with varying degrees of severity is also a challenge encountered in skin lesion segmentation. Purpose:This research aims to optimize the balance between computational costs and long-range dependency modelling and achieve excellent generalization across lesions with different degrees of severity. Methods:we propose a lightweight U-shape network that utilizes Vision Fastformer with Fusion Mechanism (VFFM-UNet). We inherit the advantages of Fastformer's additive attention mechanism, combining element-wise product and matrix product for comprehensive feature extraction and channel reduction to save computational costs. In order to accurately identify the lesion boundaries with varying degrees of severity, we designed Fusion Mechanism including Multi-Granularity Fusion and Channel Fusion, which can process the feature maps in the granularity and channel levels to obtain different contextual information. Results:Comprehensive experiments on the ISIC2017, ISIC2018 and PH2 datasets demonstrate that VFFM-UNet outperforms existing state-of-the-art models regarding parameter numbers, computational complexity and segmentation performance. In short, compared to MISSFormer, our model achieves superior segmentation performance while reducing parameter and computation costs by 101x and 15x, respectively. Conclusions:Both quantitative and qualitative analyses show that VFFM-UNet sets a new benchmark by reaching an ideal balance between parameter numbers, computational complexity, and segmentation performance compared to existing state-of-the-art models.

Enhancing Chart-to-Code Generation in Multimodal Large Language Models via Iterative Dual Preference Learning

Zhihan Zhang,Yixin Cao,Lizi Liao

Task: 将图表图像转换为可执行的绘图脚本，即图表到代码的生成。

Motivation: 多模态大语言模型（MLLMs）在代码生成任务中表现不佳，需要一种方法来提升其图表到代码生成的能力。

Details

Method: 提出了Chart2Code框架，通过迭代的双偏好学习，结合结构化的代码变体生成和细粒度的双奖励信号。 Result: 迭代偏好学习显著提升了分布外图表到代码生成的质量，双评分方法在减少偏好数据集规模的情况下仍能带来性能提升。 Conclusion: Chart2Code框架为图表理解和代码生成任务提供了新的思路，并为未来的研究奠定了基础。 Abstract: Chart-to-code generation, the process of converting chart images into executable plotting scripts, provides a lossless representation of chart information, requiring models to accurately capture and summarize all visual and structural elements. However, this remains a significant challenge for multimodal large language models (MLLMs), which are not inherently well-aligned with code generation tasks. To bridge this gap, we introduce Chart2Code, a novel iterative dual preference learning framework designed to enhance MLLMs' chart-to-code generation capabilities through structured code variant generation and fine-grained dual reward signals. We validate Chart2Code across three MLLMs and find that iterative preference learning consistently improves out-of-distribution chart-to-code generation quality. Throughout this process, our dual scoring method, which evaluates both the textual code structure and its visual representation, leads to greater performance improvements, even with a reduced preference dataset size. Further analysis explores the key components of our framework and highlights the interplay between chart-to-code generation and broader chart reasoning, paving the way for future advancements in chart comprehension.

NuWa: Deriving Lightweight Task-Specific Vision Transformers for Edge Devices

Ziteng Wei,Qiang He,Bing Li,Feifei Chen,Yun Yang

Task: 提出一种名为NuWa的方法，从基础ViT中提取小型ViT，以满足边缘设备的特定任务需求。

Motivation: Vision Transformers (ViTs)在边缘设备上缺乏灵活性，预训练模型通常过度适配，导致任务特定精度不足。

Details

Method: 通过NuWa方法，从基础ViT中提取任务特定知识，构建小型ViT，以优化边缘设备的推理速度和精度。 Result: 实验表明，NuWa在三个公共数据集上相比现有方法，精度提升最高达11.83%，推理速度加快1.29倍至2.79倍。 Conclusion: NuWa能有效提升边缘设备上ViT的任务特定精度和推理速度。 Abstract: Vision Transformers (ViTs) excel in computer vision tasks but lack flexibility for edge devices' diverse needs. A vital issue is that ViTs pre-trained to cover a broad range of tasks are \textit{over-qualified} for edge devices that usually demand only part of a ViT's knowledge for specific tasks. Their task-specific accuracy on these edge devices is suboptimal. We discovered that small ViTs that focus on device-specific tasks can improve model accuracy and in the meantime, accelerate model inference. This paper presents NuWa, an approach that derives small ViTs from the base ViT for edge devices with specific task requirements. NuWa can transfer task-specific knowledge extracted from the base ViT into small ViTs that fully leverage constrained resources on edge devices to maximize model accuracy with inference latency assurance. Experiments with three base ViTs on three public datasets demonstrate that compared with state-of-the-art solutions, NuWa improves model accuracy by up to $\text{11.83}\%$ and accelerates model inference by 1.29$\times$ - 2.79$\times$. Code for reproduction is available at https://anonymous.4open.science/r/Task_Specific-3A5E.

Noiser: Bounded Input Perturbations for Attributing Large Language Models

Mohammad Reza Ghasemi Madani,Aryo Pradipta Gema,Gabriele Sarti,Yu Zhao,Pasquale Minervini,Andrea Passerini

Task: 提出一种基于扰动的特征归因方法Noiser，用于解释大型语言模型的预测行为。

Motivation: 生成忠实反映模型内部行为的特征归因对于理解模型预测至关重要。

Details

Method: 通过在每个输入嵌入上施加有界噪声，并测量模型对部分噪声输入的鲁棒性，获得输入归因；同时提出一种可回答性度量，使用指导的评判模型评估高分标记是否足以恢复预测输出。 Result: 在六个大型语言模型和三个任务上的综合评估表明，Noiser在忠实性和可回答性方面均优于现有的基于梯度、注意力和扰动的特征归因方法。 Conclusion: Noiser是一种鲁棒且有效的解释语言模型预测的方法。 Abstract: Feature attribution (FA) methods are common post-hoc approaches that explain how Large Language Models (LLMs) make predictions. Accordingly, generating faithful attributions that reflect the actual inner behavior of the model is crucial. In this paper, we introduce Noiser, a perturbation-based FA method that imposes bounded noise on each input embedding and measures the robustness of the model against partially noised input to obtain the input attributions. Additionally, we propose an answerability metric that employs an instructed judge model to assess the extent to which highly scored tokens suffice to recover the predicted output. Through a comprehensive evaluation across six LLMs and three tasks, we demonstrate that Noiser consistently outperforms existing gradient-based, attention-based, and perturbation-based FA methods in terms of both faithfulness and answerability, making it a robust and effective approach for explaining language model predictions.

FontGuard: A Robust Font Watermarking Approach Leveraging Deep Font Knowledge

Kahim Wong,Jicheng Zhou,Kemou Li,Yain-Whar Si,Xiaowei Wu,Jiantao Zhou

Task: 提出一种名为FontGuard的新型字体水印技术，用于解决AI生成文本的版权保护和溯源问题。

Motivation: 现有字体水印方法忽视字体知识，导致水印质量低、嵌入容量有限，且对现实世界失真和低分辨率字体脆弱。

Details

Method: 利用字体模型和语言引导的对比学习，通过修改隐藏的字体风格特征嵌入水印，并利用字体流形增加嵌入容量。 Result: FontGuard在合成、跨媒体和社交媒体失真下的解码准确率分别提升5.4%、7.4%和5.8%，视觉质量提升52.7%（LPIPS指标）。 Conclusion: FontGuard是一种高效、鲁棒的字体水印方法，支持未见字体的水印生成，无需重新训练网络。 Abstract: The proliferation of AI-generated content brings significant concerns on the forensic and security issues such as source tracing, copyright protection, etc, highlighting the need for effective watermarking technologies. Font-based text watermarking has emerged as an effective solution to embed information, which could ensure copyright, traceability, and compliance of the generated text content. Existing font watermarking methods usually neglect essential font knowledge, which leads to watermarked fonts of low quality and limited embedding capacity. These methods are also vulnerable to real-world distortions, low-resolution fonts, and inaccurate character segmentation. In this paper, we introduce FontGuard, a novel font watermarking model that harnesses the capabilities of font models and language-guided contrastive learning. Unlike previous methods that focus solely on the pixel-level alteration, FontGuard modifies fonts by altering hidden style features, resulting in better font quality upon watermark embedding. We also leverage the font manifold to increase the embedding capacity of our proposed method by generating substantial font variants closely resembling the original font. Furthermore, in the decoder, we employ an image-text contrastive learning to reconstruct the embedded bits, which can achieve desirable robustness against various real-world transmission distortions. FontGuard outperforms state-of-the-art methods by +5.4%, +7.4%, and +5.8% in decoding accuracy under synthetic, cross-media, and online social network distortions, respectively, while improving the visual quality by 52.7% in terms of LPIPS. Moreover, FontGuard uniquely allows the generation of watermarked fonts for unseen fonts without re-training the network. The code and dataset are available at https://github.com/KAHIMWONG/FontGuard.

Bias in Large Language Models Across Clinical Applications: A Systematic Review

Thanathip Suenghataiphorn,Narisara Tribuddharat,Pojsakorn Danpanichkul,Narathorn Kulthamrongsri

Task: 调查大型语言模型（LLMs）在临床应用中偏见的普遍性、来源、表现及临床影响。

Motivation: LLMs在医疗领域的应用日益广泛，但其潜在的偏见可能危害患者护理并加剧健康不平等，因此需要系统评估。

Details

Method: 通过系统检索PubMed、OVID和EMBASE数据库，提取并分析LLM类型、偏见来源、表现、受影响属性等数据，使用改进的ROBINS-I工具评估偏见风险。 Result: 38项研究显示，LLMs普遍存在偏见，主要源于数据和模型训练，表现为分配性伤害、代表性伤害和性能差异，影响种族/民族、性别等属性。 Conclusion: 临床LLMs的偏见是系统性且普遍的问题，需严格评估模型并制定缓解策略，以确保其在医疗中的安全、公平和可信应用。 Abstract: Background: Large language models (LLMs) are rapidly being integrated into healthcare, promising to enhance various clinical tasks. However, concerns exist regarding their potential for bias, which could compromise patient care and exacerbate health inequities. This systematic review investigates the prevalence, sources, manifestations, and clinical implications of bias in LLMs. Methods: We conducted a systematic search of PubMed, OVID, and EMBASE from database inception through 2025, for studies evaluating bias in LLMs applied to clinical tasks. We extracted data on LLM type, bias source, bias manifestation, affected attributes, clinical task, evaluation methods, and outcomes. Risk of bias was assessed using a modified ROBINS-I tool. Results: Thirty-eight studies met inclusion criteria, revealing pervasive bias across various LLMs and clinical applications. Both data-related bias (from biased training data) and model-related bias (from model training) were significant contributors. Biases manifested as: allocative harm (e.g., differential treatment recommendations); representational harm (e.g., stereotypical associations, biased image generation); and performance disparities (e.g., variable output quality). These biases affected multiple attributes, most frequently race/ethnicity and gender, but also age, disability, and language. Conclusions: Bias in clinical LLMs is a pervasive and systemic issue, with a potential to lead to misdiagnosis and inappropriate treatment, particularly for marginalized patient populations. Rigorous evaluation of the model is crucial. Furthermore, the development and implementation of effective mitigation strategies, coupled with continuous monitoring in real-world clinical settings, are essential to ensure the safe, equitable, and trustworthy deployment of LLMs in healthcare.

Joint Retrieval of Cloud properties using Attention-based Deep Learning Models

Zahid Hassan Tushar,Adeleke Ademakinwa,Jianwu Wang,Zhibo Zhang,Sanjay Purushotham

Task: 开发一种名为CloudUNet with Attention Module (CAM)的紧凑型UNet模型，用于联合检索云光学厚度（COT）和云有效半径（CER）。

Motivation: 传统独立像素近似（IPA）方法在计算效率上表现良好，但在处理3D辐射效应、云边缘误差以及异质云场时存在显著局限性；而现有的AI/ML模型虽提高了准确性，但存在内存占用大、仅能检索单一云属性或难以联合检索的问题。

Details

Method: 提出了一种基于UNet的紧凑模型CAM，采用注意力机制减少厚云和重叠云区域的误差，并设计了专用损失函数用于联合检索COT和CER。 Result: 在大型涡模拟（LES）数据集上的实验表明，CAM模型在COT和CER的检索上分别比现有深度学习方法降低了34%和42%的平均绝对误差（MAE），相比IPA方法降低了76%和86%的MAE。 Conclusion: CAM模型通过注意力机制和专用损失函数，显著提高了云属性检索的准确性，尤其在处理复杂云场时表现优异。 Abstract: Accurate cloud property retrieval is vital for understanding cloud behavior and its impact on climate, including applications in weather forecasting, climate modeling, and estimating Earth's radiation balance. The Independent Pixel Approximation (IPA), a widely used physics-based approach, simplifies radiative transfer calculations by assuming each pixel is independent of its neighbors. While computationally efficient, IPA has significant limitations, such as inaccuracies from 3D radiative effects, errors at cloud edges, and ineffectiveness for overlapping or heterogeneous cloud fields. Recent AI/ML-based deep learning models have improved retrieval accuracy by leveraging spatial relationships across pixels. However, these models are often memory-intensive, retrieve only a single cloud property, or struggle with joint property retrievals. To overcome these challenges, we introduce CloudUNet with Attention Module (CAM), a compact UNet-based model that employs attention mechanisms to reduce errors in thick, overlapping cloud regions and a specialized loss function for joint retrieval of Cloud Optical Thickness (COT) and Cloud Effective Radius (CER). Experiments on a Large Eddy Simulation (LES) dataset show that our CAM model outperforms state-of-the-art deep learning methods, reducing mean absolute errors (MAE) by 34% for COT and 42% for CER, and achieving 76% and 86% lower MAE for COT and CER retrievals compared to the IPA method.

HyperRAG: Enhancing Quality-Efficiency Tradeoffs in Retrieval-Augmented Generation with Reranker KV-Cache Reuse

Yuwei An,Yihua Cheng,Seo Jin Park,Junchen Jiang

Task: 提出HyperRAG系统，优化RAG管道中质量与效率的权衡。

Motivation: 传统RAG管道中的reranker虽然能提升生成质量，但带来了计算挑战，影响高吞吐和低延迟。

Details

Method: 通过KV-cache重用实现高效reranker推理，并结合系统级优化。 Result: 实验显示HyperRAG在吞吐量上提升2-3倍，同时下游性能优于传统RAG。 Conclusion: HyperRAG通过KV-cache重用和系统优化，实现了高质量生成与高效系统的平衡。 Abstract: Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for enhancing the performance of large language models (LLMs) by integrating external knowledge into the generation process. A key component of RAG pipelines is the reranker, which selects the most relevant documents from a pool of retrieved candidates and significantly improves the quality of the generated responses. While rerankers refine the selection of retrieved documents in RAG pipelines, they introduce computational challenges that hinder high throughput and low latency. To address this problem, we propose HyperRAG, a system that optimizes the trade-off between quality and efficiency in RAG pipelines by leveraging KV-cache reuse for efficient reranker inference. By reusing document-side KV-cache, HyperRAG achieves both high-quality generation and system-level efficiency. To fully realize the benefits of KV-cache reuse, HyperRAG incorporates a range of system-level optimizations designed to enhance efficiency and scalability. Experiments show that HyperRAG achieves a 2 - 3 throughput improvement with decoder-only rerankers while also delivering higher downstream performance compared with traditional RAG service.

Hierarchical Modeling for Medical Visual Question Answering with Cross-Attention Fusion

Junkai Zhang,Bin Li,Shoujun Zhou,Yue Du

Task: 提出一种名为HiCA-VQA的方法，用于解决医学视觉问答（Med-VQA）中层次化建模不完善和跨模态融合方法过度依赖隐式学习的问题。

Motivation: 医学视觉问答系统在辅助临床诊断和提高诊断准确性方面具有重要意义，但现有方法在层次化建模和跨模态融合方面存在不足。

Details

Method: 提出HiCA-VQA方法，包括层次化提示模块和层次化解码器模块，结合跨注意力融合模块。 Result: 在Rad-Restruct基准测试中，HiCA-VQA优于现有方法，特别是在回答层次化细粒度问题时。 Conclusion: HiCA-VQA为层次化视觉问答系统提供了有效途径，推动了医学图像理解的进步。 Abstract: Medical Visual Question Answering (Med-VQA) answers clinical questions using medical images, aiding diagnosis. Designing the MedVQA system holds profound importance in assisting clinical diagnosis and enhancing diagnostic accuracy. Building upon this foundation, Hierarchical Medical VQA extends Medical VQA by organizing medical questions into a hierarchical structure and making level-specific predictions to handle fine-grained distinctions. Recently, many studies have proposed hierarchical MedVQA tasks and established datasets, However, several issues still remain: (1) imperfect hierarchical modeling leads to poor differentiation between question levels causing semantic fragmentation across hierarchies. (2) Excessive reliance on implicit learning in Transformer-based cross-modal self-attention fusion methods, which obscures crucial local semantic correlations in medical scenarios. To address these issues, this study proposes a HiCA-VQA method, including two modules: Hierarchical Prompting for fine-grained medical questions and Hierarchical Answer Decoders. The hierarchical prompting module pre-aligns hierarchical text prompts with image features to guide the model in focusing on specific image regions according to question types, while the hierarchical decoder performs separate predictions for questions at different levels to improve accuracy across granularities. The framework also incorporates a cross-attention fusion module where images serve as queries and text as key-value pairs. Experiments on the Rad-Restruct benchmark demonstrate that the HiCA-VQA framework better outperforms existing state-of-the-art methods in answering hierarchical fine-grained questions. This study provides an effective pathway for hierarchical visual question answering systems, advancing medical image understanding.

Cultural Learning-Based Culture Adaptation of Language Models

Chen Cecilia Liu,Anna Korhonen,Iryna Gurevych

Task: 提出一种名为CLCA的新框架，用于增强大型语言模型（LLMs）与文化价值观的对齐。

Motivation: 现有的LLMs通常默认反映特定群体的价值观，可能对其他群体造成伤害，因此需要适应多样文化价值观。

Details

Method: 利用模拟社交互动生成对话，让LLMs在文化适应的社会场景中进行角色扮演，以捕捉隐含的文化规范用于模型微调。 Result: CLCA在不同模型架构中提高了文化价值观对齐，基于世界价值观调查数据验证了其有效性。 Conclusion: 研究结果表明，理解意图和社交互动可以增强LLMs的文化价值观适应，突显了基于文化学习的训练方法的潜力。 Abstract: Adapting large language models (LLMs) to diverse cultural values is a challenging task, as existing LLMs often reflect the values of specific groups by default, and potentially causing harm to others. In this paper, we present CLCA, a novel framework for enhancing LLM alignment with cultural values based on cultural learning. The framework leverages simulated social interactions to generate conversations in which LLMs engage in role-playing within culturally adapted social scenarios, capturing implicit cultural norms for model fine-tuning. CLCA improves cultural value alignment across various model architectures measured using World Value Survey data, demonstrating the effectiveness of our proposed approach. Our results provide early evidence that understanding intent and social interactions can enhance cultural value adaptation in LLMs, highlighting the promise of training approaches based on cultural learning.

Classic Video Denoising in a Machine Learning World: Robust, Fast, and Controllable

Xin Jin,Simon Niklaus,Zhoutong Zhang,Zhihao Xia,Chunle Guo,Yuting Yang,Jiawen Chen,Chongyi Li

Task: 提出一种基于传统方法的可微分去噪流程，通过神经网络预测最优去噪参数。

Motivation: 解决深度学习方法在真实视频中因训练数据分布与噪声模式不匹配而导致的失败问题，同时弥补传统方法需要手动调参的不足。

Details

Method: 结合传统去噪方法的可靠性和深度学习的自动化优势，设计可微分流程并用神经网络预测参数。 Result: 实现了高效、鲁棒且支持用户控制的去噪方法。 Conclusion: 该方法在去噪质量、速度和用户控制方面取得了平衡，优于现有方法。 Abstract: Denoising is a crucial step in many video processing pipelines such as in interactive editing, where high quality, speed, and user control are essential. While recent approaches achieve significant improvements in denoising quality by leveraging deep learning, they are prone to unexpected failures due to discrepancies between training data distributions and the wide variety of noise patterns found in real-world videos. These methods also tend to be slow and lack user control. In contrast, traditional denoising methods perform reliably on in-the-wild videos and run relatively quickly on modern hardware. However, they require manually tuning parameters for each input video, which is not only tedious but also requires skill. We bridge the gap between these two paradigms by proposing a differentiable denoising pipeline based on traditional methods. A neural network is then trained to predict the optimal denoising parameters for each specific input, resulting in a robust and efficient approach that also supports user control.

Understanding Aha Moments: from External Observations to Internal Mechanisms

Shu Yang,Junchao Wu,Xin Chen,Yunze Xiao,Xinyi Yang,Derek F. Wong,Di Wang

Task: 研究大型推理模型（LRMs）中的“顿悟时刻”及其表现。

Motivation: 理解LRMs如何通过重新组织方法以分配更多思考时间来解决问题，并探索其“顿悟时刻”的外部表现和内部机制。

Details

Method: 系统研究“顿悟时刻”在语言模式、不确定性描述、“推理崩溃”及潜在空间分析中的表现。 Result: “顿悟时刻”表现为更频繁地使用拟人化语气进行自我反思，并根据问题难度自适应调整不确定性；内部机制涉及拟人化特征与纯推理的分离。 Conclusion: “顿悟时刻”通过改变模型对问题难度的感知，帮助解决复杂问题，且随着模型层数增加，简单问题被感知为更复杂，而复杂问题则显得更简单。 Abstract: Large Reasoning Models (LRMs), capable of reasoning through complex problems, have become crucial for tasks like programming, mathematics, and commonsense reasoning. However, a key challenge lies in understanding how these models acquire reasoning capabilities and exhibit "aha moments" when they reorganize their methods to allocate more thinking time to problems. In this work, we systematically study "aha moments" in LRMs, from linguistic patterns, description of uncertainty, "Reasoning Collapse" to analysis in latent space. We demonstrate that the "aha moment" is externally manifested in a more frequent use of anthropomorphic tones for self-reflection and an adaptive adjustment of uncertainty based on problem difficulty. This process helps the model complete reasoning without succumbing to "Reasoning Collapse". Internally, it corresponds to a separation between anthropomorphic characteristics and pure reasoning, with an increased anthropomorphic tone for more difficult problems. Furthermore, we find that the "aha moment" helps models solve complex problems by altering their perception of problem difficulty. As the layer of the model increases, simpler problems tend to be perceived as more complex, while more difficult problems appear simpler.

Model Reveals What to Cache: Profiling-Based Feature Reuse for Video Diffusion Models

Xuran Ma,Yexin Liu,Yaofu Liu,Xianfeng Wu,Mingzhe Zheng,Zihao Wang,Ser-Nam Lim,Harry Yang

Task: 提出一种自适应缓存策略ProfilingDiT，以优化扩散模型在视频生成中的计算效率。

Motivation: 现有特征缓存方法未考虑块间异质性，导致计算资源利用不足和输出质量下降。

Details

Method: 通过分析注意力分布，提出选择性缓存策略，动态处理前景与背景块。 Result: 显著减少计算开销（如2.01倍加速），同时保持视觉保真度。 Conclusion: ProfilingDiT为高效视频生成提供了一种可行方法。 Abstract: Recent advances in diffusion models have demonstrated remarkable capabilities in video generation. However, the computational intensity remains a significant challenge for practical applications. While feature caching has been proposed to reduce the computational burden of diffusion models, existing methods typically overlook the heterogeneous significance of individual blocks, resulting in suboptimal reuse and degraded output quality. To this end, we address this gap by introducing ProfilingDiT, a novel adaptive caching strategy that explicitly disentangles foreground and background-focused blocks. Through a systematic analysis of attention distributions in diffusion models, we reveal a key observation: 1) Most layers exhibit a consistent preference for either foreground or background regions. 2) Predicted noise shows low inter-step similarity initially, which stabilizes as denoising progresses. This finding inspires us to formulate a selective caching strategy that preserves full computation for dynamic foreground elements while efficiently caching static background features. Our approach substantially reduces computational overhead while preserving visual fidelity. Extensive experiments demonstrate that our framework achieves significant acceleration (e.g., 2.01 times speedup for Wan2.1) while maintaining visual fidelity across comprehensive quality metrics, establishing a viable method for efficient video generation.

CoLa -- Learning to Interactively Collaborate with Large LMs

Abhishek Sharma,Dan Goldwasser

Task: 探索人类指导是否可以模拟，并通过训练自动化指导（CoLa）来提升AI系统解决复杂语言问题的能力。

Motivation: 利用LLMs的广泛语言任务处理能力，探索如何通过模拟人类指导来增强AI系统的协作问题解决能力。

Details

Method: 引入CoLa，一种自引导学习范式，用于训练自动化指导，并在QA数据集、谜题解决任务和受限文本生成任务上进行评估。 Result: CoLa在所有领域均优于竞争方法，小型训练的指导甚至优于GPT-4；自动化指导通过适应推理者能力展现出优于人类的表现。 Conclusion: 自动化指导（如CoLa）能够有效模拟人类指导，并在协作问题解决中表现出色，为人类-AI协作提供了新思路。 Abstract: LLMs' remarkable ability to tackle a wide range of language tasks opened new opportunities for collaborative human-AI problem solving. LLMs can amplify human capabilities by applying their intuitions and reasoning strategies at scale. We explore whether human guides can be simulated, by generalizing from human demonstrations of guiding an AI system to solve complex language problems. We introduce CoLa, a novel self-guided learning paradigm for training automated $\textit{guides}$ and evaluate it on two QA datasets, a puzzle-solving task, and a constrained text generation task. Our empirical results show that CoLa consistently outperforms competitive approaches across all domains. Moreover, a small-sized trained guide outperforms a strong model like GPT-4 when acting as a guide. We compare the strategies employed by humans and automated guides by conducting a human study on a QA dataset. We show that automated guides outperform humans by adapting their strategies to reasoners' capabilities and conduct qualitative analyses highlighting distinct differences in guiding strategies.

TokenFLEX: Unified VLM Training for Flexible Visual Tokens Inference

Junshan Hu,Jialiang Mao,Zhikang Liu,Zhongpu Xia,Peng Jia,Xianpeng Lang

Task: 提出一种名为TokenFLEX的可变视觉标记框架，以解决传统视觉语言模型中固定视觉标记数量导致的效率问题。

Motivation: 传统视觉语言模型使用固定数量的视觉标记，无论任务复杂度如何，导致简单任务计算资源浪费或复杂任务视觉理解不足。

Details

Method: TokenFLEX通过随机调制训练中的标记数量和设计轻量级视觉标记投影器（包含自适应池化层和SwiGLU），实现视觉标记数量的灵活调整。 Result: 在八种视觉语言基准测试中，TokenFLEX在64、144和256标记数量下分别实现了1.6%、1.0%和0.4%的性能提升。 Conclusion: TokenFLEX展示了在保持高性能视觉语言理解的同时，具有显著的灵活性和效率优势。 Abstract: Conventional Vision-Language Models(VLMs) typically utilize a fixed number of vision tokens, regardless of task complexity. This one-size-fits-all strategy introduces notable inefficiencies: using excessive tokens leads to unnecessary computational overhead in simpler tasks, whereas insufficient tokens compromise fine-grained visual comprehension in more complex contexts. To overcome these limitations, we present TokenFLEX, an innovative and adaptable vision-language framework that encodes images into a variable number of tokens for efficient integration with a Large Language Model (LLM). Our approach is underpinned by two pivotal innovations. Firstly, we present a novel training paradigm that enhances performance across varying numbers of vision tokens by stochastically modulating token counts during training. Secondly, we design a lightweight vision token projector incorporating an adaptive pooling layer and SwiGLU, allowing for flexible downsampling of vision tokens and adaptive selection of features tailored to specific token counts. Comprehensive experiments reveal that TokenFLEX consistently outperforms its fixed-token counterparts, achieving notable performance gains across various token counts enhancements of 1.6%, 1.0%, and 0.4% with 64, 144, and 256 tokens, respectively averaged over eight vision-language benchmarks. These results underscore TokenFLEX's remarkable flexibility while maintaining high-performance vision-language understanding.

A Bayesian account of pronoun and neopronoun acquisition

Cassandra L. Jacobs,Morgan Grobol

Task: 建模个体在代词选择上的差异，以支持对性别多样性的尊重。

Motivation: 解决主流语言使用中对代词误用的忽视问题，推动对性别表达的灵活性和尊重。

Details

Method: 基于嵌套中国餐馆过程（nCRFP）的概率图模型方法，用于学习灵活的代词和名称引用。 Result: 模型能够解释代词或名称融入符号知识的速度差异，支持对性别多样性的尊重。 Conclusion: 提出的方法为计算系统提供了灵活性，同时尊重性别表达的多样性。 Abstract: A major challenge to equity among members of queer communities is the use of one's chosen forms of reference, such as personal names or pronouns. Speakers often dismiss their misuses of pronouns as "unintentional", and claim that their errors reflect many decades of fossilized mainstream language use, as well as attitudes or expectations about the relationship between one's appearance and acceptable forms of reference. We argue for explicitly modeling individual differences in pronoun selection and present a probabilistic graphical modeling approach based on the nested Chinese Restaurant Franchise Process (nCRFP) (Ahmed et al., 2013) to account for flexible pronominal reference such as chosen names and neopronouns while moving beyond form-to-meaning mappings and without lexical co-occurrence statistics to learn referring expressions, as in contemporary language models. We show that such a model can account for variability in how quickly pronouns or names are integrated into symbolic knowledge and can empower computational systems to be both flexible and respectful of queer people with diverse gender expression.

NuScenes-SpatialQA: A Spatial Understanding and Reasoning Benchmark for Vision-Language Models in Autonomous Driving

Kexin Tian,Jingrui Mao,Yunlong Zhang,Jiwan Jiang,Yang Zhou,Zhengzhong Tu

Task: 提出NuScenes-SpatialQA基准，用于评估视觉语言模型（VLMs）在自动驾驶场景中的空间理解和推理能力。

Motivation: 现有基准未系统评估VLMs在自动驾驶中的空间推理能力，NuScenes-SpatialQA填补了这一空白。

Details

Method: 基于NuScenes数据集，通过自动化3D场景图生成和QA生成流程构建基准，并在多维度上评估VLMs性能。 Result: 实验发现，空间增强VLM在定性QA中表现优异，但在定量QA中缺乏竞争力，VLMs在空间理解和推理上仍面临挑战。 Conclusion: VLMs在自动驾驶中的空间能力仍需进一步提升，NuScenes-SpatialQA为未来研究提供了重要基准。 Abstract: Recent advancements in Vision-Language Models (VLMs) have demonstrated strong potential for autonomous driving tasks. However, their spatial understanding and reasoning-key capabilities for autonomous driving-still exhibit significant limitations. Notably, none of the existing benchmarks systematically evaluate VLMs' spatial reasoning capabilities in driving scenarios. To fill this gap, we propose NuScenes-SpatialQA, the first large-scale ground-truth-based Question-Answer (QA) benchmark specifically designed to evaluate the spatial understanding and reasoning capabilities of VLMs in autonomous driving. Built upon the NuScenes dataset, the benchmark is constructed through an automated 3D scene graph generation pipeline and a QA generation pipeline. The benchmark systematically evaluates VLMs' performance in both spatial understanding and reasoning across multiple dimensions. Using this benchmark, we conduct extensive experiments on diverse VLMs, including both general and spatial-enhanced models, providing the first comprehensive evaluation of their spatial capabilities in autonomous driving. Surprisingly, the experimental results show that the spatial-enhanced VLM outperforms in qualitative QA but does not demonstrate competitiveness in quantitative QA. In general, VLMs still face considerable challenges in spatial understanding and reasoning.

Hummus: A Dataset of Humorous Multimodal Metaphor Use

Xiaoyu Tong,Zhi Zhang,Martha Lewis,Ekaterina Shutova

Task: 研究多模态隐喻的幽默能力，并开发一种新的标注方案用于图像-标题对中的幽默多模态隐喻使用。

Motivation: 多模态隐喻的幽默能力在学术界未得到足够关注，而隐喻是幽默的常见机制之一。

Details

Method: 从幽默的不协调理论、概念隐喻理论和VU阿姆斯特丹隐喻语料库的标注方案中汲取灵感，开发了一种新的标注方案，并创建了Hummus数据集，包含1k个图像-标题对。 Result: 实验表明，当前的多模态大语言模型在处理幽默多模态隐喻时仍存在困难，尤其是在整合视觉和文本信息方面。 Conclusion: 多模态隐喻的幽默能力是一个值得深入研究的方向，当前模型在此任务上仍有改进空间。 Abstract: Metaphor and humor share a lot of common ground, and metaphor is one of the most common humorous mechanisms. This study focuses on the humorous capacity of multimodal metaphors, which has not received due attention in the community. We take inspiration from the Incongruity Theory of humor, the Conceptual Metaphor Theory, and the annotation scheme behind the VU Amsterdam Metaphor Corpus, and developed a novel annotation scheme for humorous multimodal metaphor use in image-caption pairs. We create the Hummus Dataset of Humorous Multimodal Metaphor Use, providing expert annotation on 1k image-caption pairs sampled from the New Yorker Caption Contest corpus. Using the dataset, we test state-of-the-art multimodal large language models (MLLMs) on their ability to detect and understand humorous multimodal metaphor use. Our experiments show that current MLLMs still struggle with processing humorous multimodal metaphors, particularly with regard to integrating visual and textual information. We release our dataset and code at github.com/xiaoyuisrain/humorous-multimodal-metaphor-use.

Hanbo Bi,Yingchao Feng,Boyuan Tong,Mengyu Wang,Haichen Yu,Yongqiang Mao,Hao Chang,Wenhui Diao,Peijin Wang,Yue Yu,Hanyang Peng,Yehong Zhang,Kun Fu,Xian Sun

Task: 提出一种统一的多模态遥感基础模型RingMoE，以解决现有模型在遥感领域多模态数据处理的局限性。

Motivation: 现有基础模型主要处理单模态或有限模态数据，忽视了遥感观测的多模态特性，导致分析中存在模糊性和不确定性。

Details

Method: RingMoE采用分层混合专家架构（MoE），结合物理信息的自监督学习和动态专家剪枝技术，实现多模态数据的高效建模和压缩。 Result: 在23个基准测试中，RingMoE在六项关键遥感任务上表现优于现有基础模型，并实现了从单模态到多模态场景的显著适应性。 Conclusion: RingMoE不仅在理论上取得了进展，还在应急响应、土地管理、海洋科学和城市规划等多个领域得到了实际部署和验证。 Abstract: The rapid advancement of foundation models has revolutionized visual representation learning in a self-supervised manner. However, their application in remote sensing (RS) remains constrained by a fundamental gap: existing models predominantly handle single or limited modalities, overlooking the inherently multi-modal nature of RS observations. Optical, synthetic aperture radar (SAR), and multi-spectral data offer complementary insights that significantly reduce the inherent ambiguity and uncertainty in single-source analysis. To bridge this gap, we introduce RingMoE, a unified multi-modal RS foundation model with 14.7 billion parameters, pre-trained on 400 million multi-modal RS images from nine satellites. RingMoE incorporates three key innovations: (1) A hierarchical Mixture-of-Experts (MoE) architecture comprising modal-specialized, collaborative, and shared experts, effectively modeling intra-modal knowledge while capturing cross-modal dependencies to mitigate conflicts between modal representations; (2) Physics-informed self-supervised learning, explicitly embedding sensor-specific radiometric characteristics into the pre-training objectives; (3) Dynamic expert pruning, enabling adaptive model compression from 14.7B to 1B parameters while maintaining performance, facilitating efficient deployment in Earth observation applications. Evaluated across 23 benchmarks spanning six key RS tasks (i.e., classification, detection, segmentation, tracking, change detection, and depth estimation), RingMoE outperforms existing foundation models and sets new SOTAs, demonstrating remarkable adaptability from single-modal to multi-modal scenarios. Beyond theoretical progress, it has been deployed and trialed in multiple sectors, including emergency response, land management, marine sciences, and urban planning.

The Dual-Route Model of Induction

Sheridan Feucht,Eric Todd,Byron Wallace,David Bau

Task: 研究概念级归纳头（concept-level induction heads）在上下文学习中的作用及其与词级归纳头（token-level induction heads）的区别。

Motivation: 探索上下文复制中是否存在更高层次的语义单元（如词汇单元）的复制机制，并分析其对语义任务的影响。

Details

Method: 引入概念级归纳头，研究其与词级归纳头的并行工作方式，并通过消融实验验证其独立性。 Result: 概念级归纳头负责语义任务（如词汇翻译），而词级归纳头则适用于逐字复制任务；消融词级归纳头会导致模型进行意译而非逐字复制。 Conclusion: 概念级归纳头在上下文学习中具有更广泛的适用性，而词级归纳头仅对特定任务至关重要。 Abstract: Prior work on in-context copying has shown the existence of induction heads, which attend to and promote individual tokens during copying. In this work we introduce a new type of induction head: concept-level induction heads, which copy entire lexical units instead of individual tokens. Concept induction heads learn to attend to the ends of multi-token words throughout training, working in parallel with token-level induction heads to copy meaningful text. We show that these heads are responsible for semantic tasks like word-level translation, whereas token induction heads are vital for tasks that can only be done verbatim, like copying nonsense tokens. These two "routes" operate independently: in fact, we show that ablation of token induction heads causes models to paraphrase where they would otherwise copy verbatim. In light of these findings, we argue that although token induction heads are vital for specific tasks, concept induction heads may be more broadly relevant for in-context learning.

Finding the Reflection Point: Unpadding Images to Remove Data Augmentation Artifacts in Large Open Source Image Datasets for Machine Learning

Lucas Choi,Ross Greer

Task: 检测并去除图像中的噪声镜像填充伪影。

Motivation: 数据增强技术（如填充）在标准化图像尺寸时可能引入伪影，影响跨领域数据集的重用和模型评估。

Details

Method: 提出一种系统算法，通过最小均方误差方法和阈值处理精确划定反射边界，并去除镜像填充。 Result: 在SHEL5k数据集上验证，零样本目标检测任务中性能显著提升（如安全帽检测的平均精度从0.47提高到0.61）。 Conclusion: 该方法提升了数据集完整性，支持更可靠的计算机视觉任务模型评估。 Abstract: In this paper, we address a novel image restoration problem relevant to machine learning dataset curation: the detection and removal of noisy mirrored padding artifacts. While data augmentation techniques like padding are necessary for standardizing image dimensions, they can introduce artifacts that degrade model evaluation when datasets are repurposed across domains. We propose a systematic algorithm to precisely delineate the reflection boundary through a minimum mean squared error approach with thresholding and remove reflective padding. Our method effectively identifies the transition between authentic content and its mirrored counterpart, even in the presence of compression or interpolation noise. We demonstrate our algorithm's efficacy on the SHEL5k dataset, showing significant performance improvements in zero-shot object detection tasks using OWLv2, with average precision increasing from 0.47 to 0.61 for hard hat detection and from 0.68 to 0.73 for person detection. By addressing annotation inconsistencies and distorted objects in padded regions, our approach enhances dataset integrity, enabling more reliable model evaluation across computer vision tasks.

IPA-CHILDES & G2P+: Feature-Rich Resources for Cross-Lingual Phonology and Phonemic Language Modeling

Zébulon Goriely,Paula Buttery

Task: 介绍两种资源：G2P+（将正字法数据集转换为一致音位表示的工具）和IPA CHILDES（涵盖31种语言的儿童中心语音音位数据集）。

Motivation: 现有工具在音位转换中存在不一致性，且现有音位数据集缺乏多语言覆盖、自发语音和儿童导向语言的关注。

Details

Method: 利用Phoible数据库中的音位清单改进G2P+工具，并将其应用于CHILDES数据集以生成IPA CHILDES。 Result: IPA CHILDES填补了现有音位数据集的空白，并通过训练音位语言模型证明了其有效性。 Conclusion: G2P+和IPA CHILDES为音位学研究提供了重要资源，展示了音位分布特性在多语言中的学习潜力。 Abstract: In this paper, we introduce two resources: (i) G2P+, a tool for converting orthographic datasets to a consistent phonemic representation; and (ii) IPA CHILDES, a phonemic dataset of child-centered speech across 31 languages. Prior tools for grapheme-to-phoneme conversion result in phonemic vocabularies that are inconsistent with established phonemic inventories, an issue which G2P+ addresses by leveraging the inventories in the Phoible database. Using this tool, we augment CHILDES with phonemic transcriptions to produce IPA CHILDES. This new resource fills several gaps in existing phonemic datasets, which often lack multilingual coverage, spontaneous speech, and a focus on child-directed language. We demonstrate the utility of this dataset for phonological research by training phoneme language models on 11 languages and probing them for distinctive features, finding that the distributional properties of phonemes are sufficient to learn major class and place features cross-lingually.

REJEPA: A Novel Joint-Embedding Predictive Architecture for Efficient Remote Sensing Image Retrieval

Shabnam Choudhury,Yash Salunkhe,Sarthak Mehrotra,Biplab Banerjee

Task: 开发一种自监督框架REJEPA，用于单模态遥感图像内容检索（RS-CBIR）。

Motivation: 遥感图像存档的快速扩展需要强大且高效的内容检索技术。

Details

Method: REJEPA采用空间分布上下文令牌编码预测目标令牌的抽象表示，结合VICReg正则化防止编码器崩溃。 Result: 在多个数据集上检索准确率提升5.1%-10.1%，计算复杂度降低40-60%。 Conclusion: REJEPA是一种高效、可扩展且精确的传感器无关RS-CBIR基准方法。 Abstract: The rapid expansion of remote sensing image archives demands the development of strong and efficient techniques for content-based image retrieval (RS-CBIR). This paper presents REJEPA (Retrieval with Joint-Embedding Predictive Architecture), an innovative self-supervised framework designed for unimodal RS-CBIR. REJEPA utilises spatially distributed context token encoding to forecast abstract representations of target tokens, effectively capturing high-level semantic features and eliminating unnecessary pixel-level details. In contrast to generative methods that focus on pixel reconstruction or contrastive techniques that depend on negative pairs, REJEPA functions within feature space, achieving a reduction in computational complexity of 40-60% when compared to pixel-reconstruction baselines like Masked Autoencoders (MAE). To guarantee strong and varied representations, REJEPA incorporates Variance-Invariance-Covariance Regularisation (VICReg), which prevents encoder collapse by promoting feature diversity and reducing redundancy. The method demonstrates an estimated enhancement in retrieval accuracy of 5.1% on BEN-14K (S1), 7.4% on BEN-14K (S2), 6.0% on FMoW-RGB, and 10.1% on FMoW-Sentinel compared to prominent SSL techniques, including CSMAE-SESD, Mask-VLM, SatMAE, ScaleMAE, and SatMAE++, on extensive RS benchmarks BEN-14K (multispectral and SAR data), FMoW-RGB and FMoW-Sentinel. Through effective generalisation across sensor modalities, REJEPA establishes itself as a sensor-agnostic benchmark for efficient, scalable, and precise RS-CBIR, addressing challenges like varying resolutions, high object density, and complex backgrounds with computational efficiency.

Extending CREAMT: Leveraging Large Language Models for Literary Translation Post-Editing

Antonio Castaldo,Sheila Castilho,Joss Moorkens,Johanna Monti

Task: 评估大型语言模型（LLMs）生成的文学翻译后编辑的可行性。

Motivation: 神经机器翻译系统在处理文学文本时难以平衡效率与创意和风格的保留，而LLMs提供了更好的上下文感知和创意翻译能力。

Details

Method: 使用定制研究工具，与专业文学翻译合作，分析编辑时间、质量和创意。 Result: 后编辑LLM生成的翻译显著减少了编辑时间，同时保持了与人工翻译相似的创意水平。 Conclusion: LLMs可能有效支持高资源语言的文学翻译工作，因其在创意差异极小的情况下带来显著的生产力提升。 Abstract: Post-editing machine translation (MT) for creative texts, such as literature, requires balancing efficiency with the preservation of creativity and style. While neural MT systems struggle with these challenges, large language models (LLMs) offer improved capabilities for context-aware and creative translation. This study evaluates the feasibility of post-editing literary translations generated by LLMs. Using a custom research tool, we collaborated with professional literary translators to analyze editing time, quality, and creativity. Our results indicate that post-editing LLM-generated translations significantly reduces editing time compared to human translation while maintaining a similar level of creativity. The minimal difference in creativity between PE and MT, combined with substantial productivity gains, suggests that LLMs may effectively support literary translators working with high-resource languages.

Real-Time Roadway Obstacle Detection for Electric Scooters Using Deep Learning and Multi-Sensor Fusion

Zeyang Zheng,Arman Hosseini,Dong Chen,Omid Shoghli,Arsalan Heydarian

Task: 开发一种基于深度学习的电动滑板车地面障碍物实时检测系统。

Motivation: 电动滑板车在城市的普及导致交通事故增加，现有技术未充分探索其障碍物检测。

Details

Method: 结合RGB相机、深度相机和IMU传感器，利用YOLO模型进行障碍物检测和距离估计。 Result: 系统在七小时自然骑行数据集上达到0.827的平均精度（mAP），并表现优异的实时性能。 Conclusion: 该系统通过计算机视觉和数据融合有效提升了电动滑板车的安全性。 Abstract: The increasing adoption of electric scooters (e-scooters) in urban areas has coincided with a rise in traffic accidents and injuries, largely due to their small wheels, lack of suspension, and sensitivity to uneven surfaces. While deep learning-based object detection has been widely used to improve automobile safety, its application for e-scooter obstacle detection remains unexplored. This study introduces a novel ground obstacle detection system for e-scooters, integrating an RGB camera, and a depth camera to enhance real-time road hazard detection. Additionally, the Inertial Measurement Unit (IMU) measures linear vertical acceleration to identify surface vibrations, guiding the selection of six obstacle categories: tree branches, manhole covers, potholes, pine cones, non-directional cracks, and truncated domes. All sensors, including the RGB camera, depth camera, and IMU, are integrated within the Intel RealSense Camera D435i. A deep learning model powered by YOLO detects road hazards and utilizes depth data to estimate obstacle proximity. Evaluated on the seven hours of naturalistic riding dataset, the system achieves a high mean average precision (mAP) of 0.827 and demonstrates excellent real-time performance. This approach provides an effective solution to enhance e-scooter safety through advanced computer vision and data fusion. The dataset is accessible at https://zenodo.org/records/14583718, and the project code is hosted on https://github.com/Zeyang-Zheng/Real-Time-Roadway-Obstacle-Detection-for-Electric-Scooters.

Task as Context Prompting for Accurate Medical Symptom Coding Using Large Language Models

Chengyang He,Wenlong Zhang,Violet Xinying Chen,Yue Ning,Ping Wang

Task: 从非结构化临床文本（如疫苗安全报告）中准确编码医疗症状，并将其链接到标准化词汇表（如MedDRA）。

Motivation: 传统方法将症状提取和链接视为独立流程，难以处理临床叙述的复杂性和变异性，尤其是罕见病例。大语言模型（LLMs）虽提供新机会，但在一致性表现上仍有挑战。

Details

Method: 提出Task as Context (TACO) Prompting框架，通过将任务特定上下文嵌入LLM提示中，统一提取和链接任务。同时引入SYMPCODER数据集和两阶段评估框架。 Result: TACO框架显著提高了症状编码的灵活性和准确性，在多种LLMs（如Llama2-chat、GPT-4 Turbo等）上表现优异。 Conclusion: TACO为特定编码任务提供了有效解决方案，推动了临床文本处理方法的进步。 Abstract: Accurate medical symptom coding from unstructured clinical text, such as vaccine safety reports, is a critical task with applications in pharmacovigilance and safety monitoring. Symptom coding, as tailored in this study, involves identifying and linking nuanced symptom mentions to standardized vocabularies like MedDRA, differentiating it from broader medical coding tasks. Traditional approaches to this task, which treat symptom extraction and linking as independent workflows, often fail to handle the variability and complexity of clinical narratives, especially for rare cases. Recent advancements in Large Language Models (LLMs) offer new opportunities but face challenges in achieving consistent performance. To address these issues, we propose Task as Context (TACO) Prompting, a novel framework that unifies extraction and linking tasks by embedding task-specific context into LLM prompts. Our study also introduces SYMPCODER, a human-annotated dataset derived from Vaccine Adverse Event Reporting System (VAERS) reports, and a two-stage evaluation framework to comprehensively assess both symptom linking and mention fidelity. Our comprehensive evaluation of multiple LLMs, including Llama2-chat, Jackalope-7b, GPT-3.5 Turbo, GPT-4 Turbo, and GPT-4o, demonstrates TACO's effectiveness in improving flexibility and accuracy for tailored tasks like symptom coding, paving the way for more specific coding tasks and advancing clinical text processing methodologies.

Detection Based Part-level Articulated Object Reconstruction from Single RGBD Image

Yuki Kawana,Tatsuya Harada

Task: 提出一种端到端可训练的跨类别方法，用于从单个RGBD图像重建多个人工铰接物体，重点关注部件级形状重建以及姿态和运动学估计。

Motivation: 区别于以往依赖实例级潜在空间学习的方法，专注于具有预定义部件数量的人工铰接物体，提出一种基于部件级表示的新方法。

Details

Method: 采用检测后分组的策略，通过测试时运动学感知部件融合、各向异性尺度归一化和特征空间与输出空间的平衡策略来解决假阳性、部件尺寸和尺度变化以及模型大小增加的问题。 Result: 在合成和真实数据上的评估表明，该方法成功重建了以往方法无法处理的各种结构实例，并在形状重建和运动学估计方面优于先前工作。 Conclusion: 该方法通过部件级表示和优化策略，显著提升了多实例铰接物体的重建能力。 Abstract: We propose an end-to-end trainable, cross-category method for reconstructing multiple man-made articulated objects from a single RGBD image, focusing on part-level shape reconstruction and pose and kinematics estimation. We depart from previous works that rely on learning instance-level latent space, focusing on man-made articulated objects with predefined part counts. Instead, we propose a novel alternative approach that employs part-level representation, representing instances as combinations of detected parts. While our detect-then-group approach effectively handles instances with diverse part structures and various part counts, it faces issues of false positives, varying part sizes and scales, and an increasing model size due to end-to-end training. To address these challenges, we propose 1) test-time kinematics-aware part fusion to improve detection performance while suppressing false positives, 2) anisotropic scale normalization for part shape learning to accommodate various part sizes and scales, and 3) a balancing strategy for cross-refinement between feature space and output space to improve part detection while maintaining model size. Evaluation on both synthetic and real data demonstrates that our method successfully reconstructs variously structured multiple instances that previous works cannot handle, and outperforms prior works in shape reconstruction and kinematics estimation.

AD-GPT: Large Language Models in Alzheimer's Disease

Ziyu Liu,Lintao Tang,Zeliang Sun,Zhengliang Liu,Yanjun Lyu,Wei Ruan,Yangshuang Xu,Liang Shan,Jiyoon Shin,Xiaohe Chen,Dajiang Zhu,Tianming Liu,Rongjie Liu,Chao Huang

Task: 开发AD-GPT，一种针对阿尔茨海默病（AD）的领域特定生成预训练变换器，以提升AD相关遗传和神经生物学信息的检索与分析。

Motivation: 大型语言模型（LLM）在医学信息检索中表现出强大能力，但在阿尔茨海默病等专业领域的准确性和深度有限。

Details

Method: 结合Llama3和BERT的堆叠LLM架构，整合多种生物医学数据源，优化AD研究的四项关键任务。 Result: AD-GPT在遗传信息检索、基因-脑区关系评估等任务中表现出优于现有LLM的精度和可靠性。 Conclusion: AD-GPT是一种强大的专业AI工具，有望推动AD研究和生物标志物发现。 Abstract: Large language models (LLMs) have emerged as powerful tools for medical information retrieval, yet their accuracy and depth remain limited in specialized domains such as Alzheimer's disease (AD), a growing global health challenge. To address this gap, we introduce AD-GPT, a domain-specific generative pre-trained transformer designed to enhance the retrieval and analysis of AD-related genetic and neurobiological information. AD-GPT integrates diverse biomedical data sources, including potential AD-associated genes, molecular genetic information, and key gene variants linked to brain regions. We develop a stacked LLM architecture combining Llama3 and BERT, optimized for four critical tasks in AD research: (1) genetic information retrieval, (2) gene-brain region relationship assessment, (3) gene-AD relationship analysis, and (4) brain region-AD relationship mapping. Comparative evaluations against state-of-the-art LLMs demonstrate AD-GPT's superior precision and reliability across these tasks, underscoring its potential as a robust and specialized AI tool for advancing AD research and biomarker discovery.

MIMRS: A Survey on Masked Image Modeling in Remote Sensing

Shabnam Choudhury,Akhil Vasim,Michael Schmitt,Biplab Banerjee

Task: 综述掩码图像建模（MIM）在遥感领域的应用、方法及未来研究方向。

Motivation: MIM作为一种自监督学习技术，能够利用未标注数据进行预训练，解决遥感中因云层遮挡、传感器限制等导致的数据不完整问题。

Details

Method: 通过综合和批判性分析近期进展，总结MIM在遥感中的方法论和应用。 Result: 提供了MIM在遥感领域的现状、先进方法及潜在应用，如云去除、多模态数据融合和超分辨率。 Conclusion: 该综述为遥感领域的MIM研究提供了基础性指导，并指出了未来创新方向。 Abstract: Masked Image Modeling (MIM) is a self-supervised learning technique that involves masking portions of an image, such as pixels, patches, or latent representations, and training models to predict the missing information using the visible context. This approach has emerged as a cornerstone in self-supervised learning, unlocking new possibilities in visual understanding by leveraging unannotated data for pre-training. In remote sensing, MIM addresses challenges such as incomplete data caused by cloud cover, occlusions, and sensor limitations, enabling applications like cloud removal, multi-modal data fusion, and super-resolution. By synthesizing and critically analyzing recent advancements, this survey (MIMRS) is a pioneering effort to chart the landscape of mask image modeling in remote sensing. We highlight state-of-the-art methodologies, applications, and future research directions, providing a foundational review to guide innovation in this rapidly evolving field.

Single-Pass Document Scanning for Question Answering

Weili Cao,Jianyou Wang,Youze Zheng,Longtian Bao,Qirui Zheng,Taylor Berg-Kirkpatrick,Ramamohan Paturi,Leon Bergen

Task: 提出一种单遍文档扫描方法，用于处理大规模文档的问答任务。

Motivation: 处理极大规模文档时，分块嵌入方法可能丢失全局上下文，而全上下文变换器计算成本过高。

Details

Method: 采用单遍文档扫描方法，线性处理全文，保留全局连贯性并筛选与查询最相关的句子。 Result: 在41个问答基准测试中，该方法优于分块嵌入方法，并以较低计算成本与大型语言模型竞争。 Conclusion: 单遍文档扫描为大规模文本问答提供了一种简单有效的解决方案。 Abstract: Handling extremely large documents for question answering is challenging: chunk-based embedding methods often lose track of important global context, while full-context transformers can be prohibitively expensive for hundreds of thousands of tokens. We propose a single-pass document scanning approach that processes the entire text in linear time, preserving global coherence while deciding which sentences are most relevant to the query. On 41 QA benchmarks, our single-pass scanner consistently outperforms chunk-based embedding methods and competes with large language models at a fraction of the computational cost. By conditioning on the entire preceding context without chunk breaks, the method preserves global coherence, which is especially important for long documents. Overall, single-pass document scanning offers a simple solution for question answering over massive text. All code, datasets, and model checkpoints are available at https://github.com/MambaRetriever/MambaRetriever

Three Forensic Cues for JPEG AI Images

Sandra Bergmann,Fabian Brand,Christian Riess

Task: 提出三种用于JPEG AI图像的法证分析线索。

Motivation: JPEG AI图像的法证分析工具需要重新设计，因为传统JPEG的法证工具无法直接应用于JPEG AI，且JPEG AI的伪影易与DeepFakes混淆。

Details

Method: 提出三种法证算法线索：颜色通道相关性、重复压缩的失真差异、潜在空间量化。 Result: 这些方法能够检测JPEG AI图像的特征，并区分真实图像与合成图像。 Conclusion: 这些方法为JPEG AI的法证分析提供了初步工具，并希望激发进一步研究。 Abstract: The JPEG standard was vastly successful. Currently, the first AI-based compression method ``JPEG AI'' will be standardized. JPEG AI brings remarkable benefits. JPEG AI images exhibit impressive image quality at bitrates that are an order of magnitude lower than images compressed with traditional JPEG. However, forensic analysis of JPEG AI has to be completely re-thought: forensic tools for traditional JPEG do not transfer to JPEG AI, and artifacts from JPEG AI are easily confused with artifacts from artificially generated images (``DeepFakes''). This creates a need for novel forensic approaches to detection and distinction of JPEG AI images. In this work, we make a first step towards a forensic JPEG AI toolset. We propose three cues for forensic algorithms for JPEG AI. These algorithms address three forensic questions: first, we show that the JPEG AI preprocessing introduces correlations in the color channels that do not occur in uncompressed images. Second, we show that repeated compression of JPEG AI images leads to diminishing distortion differences. This can be used to detect recompression, in a spirit similar to some classic JPEG forensics methods. Third, we show that the quantization of JPEG AI images in the latent space can be used to distinguish real images with JPEG AI compression from synthetically generated images. The proposed methods are interpretable for a forensic analyst, and we hope that they inspire further research in the forensics of AI-compressed images.

Why Reasoning Matters? A Survey of Advancements in Multimodal Reasoning (v1)

Jing Bi,Susan Liang,Xiaofei Zhou,Pinxin Liu,Junjia Guo,Yunlong Tang,Luchuan Song,Chao Huang,Guangyu Sun,Jinxi He,Jiarui Wu,Shu Yang,Daoan Zhang,Chen Chen,Lianggong Bruce Wen,Zhang Liu,Jiebo Luo,Chenliang Xu

Task: 概述文本和多模态大型语言模型中的推理技术，并明确核心挑战与机遇。

Motivation: 多模态推理中模型需整合视觉和文本输入，处理跨模态冲突信息，需要高级解释策略和评估方法。

Details

Method: 通过全面且最新的比较，提出后训练优化和测试时推理的实用方法。 Result: 为未来研究提供有价值的见解和指导，连接理论框架与实际实现。 Conclusion: 本文为多模态推理领域的研究设定了明确方向，并提供了实践方法。 Abstract: Reasoning is central to human intelligence, enabling structured problem-solving across diverse tasks. Recent advances in large language models (LLMs) have greatly enhanced their reasoning abilities in arithmetic, commonsense, and symbolic domains. However, effectively extending these capabilities into multimodal contexts-where models must integrate both visual and textual inputs-continues to be a significant challenge. Multimodal reasoning introduces complexities, such as handling conflicting information across modalities, which require models to adopt advanced interpretative strategies. Addressing these challenges involves not only sophisticated algorithms but also robust methodologies for evaluating reasoning accuracy and coherence. This paper offers a concise yet insightful overview of reasoning techniques in both textual and multimodal LLMs. Through a thorough and up-to-date comparison, we clearly formulate core reasoning challenges and opportunities, highlighting practical methods for post-training optimization and test-time inference. Our work provides valuable insights and guidance, bridging theoretical frameworks and practical implementations, and sets clear directions for future research.

Mamba as a Bridge: Where Vision Foundation Models Meet Vision Language Models for Domain-Generalized Semantic Segmentation

Xin Zhang,Robby T. Tan

Task: 提出一种基于Mamba的融合框架MFuser，有效结合视觉基础模型（VFMs）和视觉语言模型（VLMs）的优势，以解决领域泛化语义分割（DGSS）中的挑战。

Motivation: 现有DGSS方法通常仅依赖VFMs或VLMs，忽略了它们的互补性。VFMs擅长细粒度特征，而VLMs提供文本对齐但粒度较粗。

Details

Method: MFuser包含MVFuser（联合微调VFMs和VLMs）和MTEnhancer（混合注意力-Mamba模块），通过线性可扩展的序列长度实现高效融合。 Result: 在合成到真实和真实到真实基准测试中，MFuser分别达到68.20和71.87 mIoU，显著优于现有方法。 Conclusion: MFuser成功结合VFMs和VLMs的优势，实现了高效且高性能的领域泛化语义分割。 Abstract: Vision Foundation Models (VFMs) and Vision-Language Models (VLMs) have gained traction in Domain Generalized Semantic Segmentation (DGSS) due to their strong generalization capabilities. However, existing DGSS methods often rely exclusively on either VFMs or VLMs, overlooking their complementary strengths. VFMs (e.g., DINOv2) excel at capturing fine-grained features, while VLMs (e.g., CLIP) provide robust text alignment but struggle with coarse granularity. Despite their complementary strengths, effectively integrating VFMs and VLMs with attention mechanisms is challenging, as the increased patch tokens complicate long-sequence modeling. To address this, we propose MFuser, a novel Mamba-based fusion framework that efficiently combines the strengths of VFMs and VLMs while maintaining linear scalability in sequence length. MFuser consists of two key components: MVFuser, which acts as a co-adapter to jointly fine-tune the two models by capturing both sequential and spatial dynamics; and MTEnhancer, a hybrid attention-Mamba module that refines text embeddings by incorporating image priors. Our approach achieves precise feature locality and strong text alignment without incurring significant computational overhead. Extensive experiments demonstrate that MFuser significantly outperforms state-of-the-art DGSS methods, achieving 68.20 mIoU on synthetic-to-real and 71.87 mIoU on real-to-real benchmarks. The code is available at https://github.com/devinxzhang/MFuser.

Beyond the Next Token: Towards Prompt-Robust Zero-Shot Classification via Efficient Multi-Token Prediction

Junlang Qian,Zixiao Zhu,Hanzhang Zhou,Zijian Feng,Zepeng Zhai,Kezhi Mao

Task: 提出一种名为Placeholding Parallel Prediction (P3)的新方法，以解决零样本文本分类中提示工程的不稳定性问题。

Motivation: 大型语言模型在提示工程中存在的不稳定性（提示微小变化导致性能显著差异）影响了零样本文本分类的可靠性。

Details

Method: 通过预测多个位置的标记概率并模拟生成路径的全面采样，P3在单次运行中提升了模型的鲁棒性。 Result: 实验显示P3提高了准确性，并减少了提示间的标准差达98%，同时在不依赖提示的情况下仍保持可比性能。 Conclusion: P3显著提升了零样本文本分类的鲁棒性，减少了对提示工程的依赖。 Abstract: Zero-shot text classification typically relies on prompt engineering, but the inherent prompt brittleness of large language models undermines its reliability. Minor changes in prompt can cause significant discrepancies in model performance. We attribute this prompt brittleness largely to the narrow focus on nexttoken probabilities in existing methods. To address this, we propose Placeholding Parallel Prediction (P3), a novel approach that predicts token probabilities across multiple positions and simulates comprehensive sampling of generation paths in a single run of a language model. Experiments show improved accuracy and up to 98% reduction in the standard deviation across prompts, boosting robustness. Even without a prompt, P3 maintains comparable performance, reducing the need for prompt engineering.

Endo3R: Unified Online Reconstruction from Dynamic Monocular Endoscopic Video

Jiaxin Guo,Wenzhen Dong,Tianyu Huang,Hao Ding,Ziyi Wang,Haomin Kuang,Qi Dou,Yun-Hui Liu

Task: 从单目手术视频中实现尺度一致的三维场景重建。

Motivation: 增强外科医生的感知能力，解决内窥镜视频中动态变形和无纹理表面带来的挑战。

Details

Method: 提出Endo3R模型，通过不确定性感知的双记忆机制和自监督机制，实现无先验或离线优化的在线重建。 Result: 在SCARED和Hamlyn数据集上表现出色，支持零样本深度预测和相机姿态估计。 Conclusion: Endo3R为手术视频的在线三维重建提供了一种高效且无需先验的解决方案。 Abstract: Reconstructing 3D scenes from monocular surgical videos can enhance surgeon's perception and therefore plays a vital role in various computer-assisted surgery tasks. However, achieving scale-consistent reconstruction remains an open challenge due to inherent issues in endoscopic videos, such as dynamic deformations and textureless surfaces. Despite recent advances, current methods either rely on calibration or instrument priors to estimate scale, or employ SfM-like multi-stage pipelines, leading to error accumulation and requiring offline optimization. In this paper, we present Endo3R, a unified 3D foundation model for online scale-consistent reconstruction from monocular surgical video, without any priors or extra optimization. Our model unifies the tasks by predicting globally aligned pointmaps, scale-consistent video depths, and camera parameters without any offline optimization. The core contribution of our method is expanding the capability of the recent pairwise reconstruction model to long-term incremental dynamic reconstruction by an uncertainty-aware dual memory mechanism. The mechanism maintains history tokens of both short-term dynamics and long-term spatial consistency. Notably, to tackle the highly dynamic nature of surgical scenes, we measure the uncertainty of tokens via Sampson distance and filter out tokens with high uncertainty. Regarding the scarcity of endoscopic datasets with ground-truth depth and camera poses, we further devise a self-supervised mechanism with a novel dynamics-aware flow loss. Abundant experiments on SCARED and Hamlyn datasets demonstrate our superior performance in zero-shot surgical video depth prediction and camera pose estimation with online efficiency. Project page: https://wrld.github.io/Endo3R/.

Efficient Dynamic Clustering-Based Document Compression for Retrieval-Augmented-Generation

Weitao Li,Kaiming Liu,Xiangyu Zhang,Xuanyu Lei,Weizhi Ma,Yang Liu

Task: 提出一种基于动态聚类的文档压缩框架（EDC²-RAG），以改进检索增强生成（RAG）中噪声、重复和冗余内容的问题。

Motivation: 当前RAG方法在利用细粒度文档间关系方面能力有限，导致检索内容中存在噪声、重复和冗余。

Details

Method: 提出EDC²-RAG框架，利用潜在文档间关系并去除无关和冗余内容。 Result: 在知识问答和幻觉检测数据集上验证了方法的有效性，展示了性能提升和强鲁棒性。 Conclusion: EDC²-RAG框架能有效改进RAG的性能和适用性。 Abstract: Retrieval-Augmented Generation (RAG) has emerged as a widely adopted approach for knowledge integration during large language model (LLM) inference in recent years. However, current RAG implementations face challenges in effectively addressing noise, repetition and redundancy in retrieved content, primarily due to their limited ability to exploit fine-grained inter-document relationships. To address these limitations, we propose an \textbf{E}fficient \textbf{D}ynamic \textbf{C}lustering-based document \textbf{C}ompression framework (\textbf{EDC\textsuperscript{2}-RAG}) that effectively utilizes latent inter-document relationships while simultaneously removing irrelevant information and redundant content. We validate our approach, built upon GPT-3.5, on widely used knowledge-QA and hallucination-detected datasets. The results show that this method achieves consistent performance improvements across various scenarios and experimental settings, demonstrating strong robustness and applicability. Our code and datasets can be found at https://github.com/Tsinghua-dhy/EDC-2-RAG.

From ChatGPT to DeepSeek AI: A Comprehensive Analysis of Evolution, Deviation, and Future Implications in AI-Language Models

Simrandeep Singh,Shreya Bansal,Abdulmotaleb El Saddik,Mukesh Saini

Task: 分析从ChatGPT到DeepSeek AI的演进，比较其技术差异、实际应用及对AI发展的影响。

Motivation: 探讨AI模型的进步及其在自然语言处理领域的潜在影响，为未来研究提供方向。

Details

Method: 通过案例研究，使用预定义的多领域选择题评估模型的能力。 Result: 揭示了ChatGPT和DeepSeek AI的优缺点，为AI的未来发展提供了见解。 Conclusion: DeepSeek AI在架构、性能和伦理方面有显著改进，为AI语言模型的未来发展指明了方向。 Abstract: The rapid advancement of artificial intelligence (AI) has reshaped the field of natural language processing (NLP), with models like OpenAI ChatGPT and DeepSeek AI. Although ChatGPT established a strong foundation for conversational AI, DeepSeek AI introduces significant improvements in architecture, performance, and ethical considerations. This paper presents a detailed analysis of the evolution from ChatGPT to DeepSeek AI, highlighting their technical differences, practical applications, and broader implications for AI development. To assess their capabilities, we conducted a case study using a predefined set of multiple choice questions in various domains, evaluating the strengths and limitations of each model. By examining these aspects, we provide valuable insight into the future trajectory of AI, its potential to transform industries, and key research directions for improving AI-driven language models.

Multi-lingual Multi-turn Automated Red Teaming for LLMs

Abhishek Singhania,Christophe Dupuy,Shivam Mangale,Amani Namboori

Task: 提出一种多语言多轮自动红队测试方法（MM-ART），用于全面自动化地识别导致不安全响应的提示。

Motivation: 传统人工红队测试成本高、耗时长且难以覆盖LLMs的最新功能（如多语言、多模态），而现有自动化方法仅覆盖LLMs能力的有限子集（如英语或单轮对话）。

Details

Method: 通过多语言多轮自动红队测试（MM-ART）方法，自动化进行对话式、多语言的红队测试操作。 Result: 实验表明，经过5轮英语对话后，LLMs的不安全性平均增加71%；在非英语语言中，模型的安全性漏洞比标准单轮英语方法高出195%。 Conclusion: 需要开发与LLMs能力匹配的自动化红队测试方法，以应对多语言和多轮对话的安全挑战。 Abstract: Language Model Models (LLMs) have improved dramatically in the past few years, increasing their adoption and the scope of their capabilities over time. A significant amount of work is dedicated to ``model alignment'', i.e., preventing LLMs to generate unsafe responses when deployed into customer-facing applications. One popular method to evaluate safety risks is \textit{red-teaming}, where agents attempt to bypass alignment by crafting elaborate prompts that trigger unsafe responses from a model. Standard human-driven red-teaming is costly, time-consuming and rarely covers all the recent features (e.g., multi-lingual, multi-modal aspects), while proposed automation methods only cover a small subset of LLMs capabilities (i.e., English or single-turn). We present Multi-lingual Multi-turn Automated Red Teaming (\textbf{MM-ART}), a method to fully automate conversational, multi-lingual red-teaming operations and quickly identify prompts leading to unsafe responses. Through extensive experiments on different languages, we show the studied LLMs are on average 71\% more vulnerable after a 5-turn conversation in English than after the initial turn. For conversations in non-English languages, models display up to 195\% more safety vulnerabilities than the standard single-turn English approach, confirming the need for automated red-teaming methods matching LLMs capabilities.

Electromyography-Based Gesture Recognition: Hierarchical Feature Extraction for Enhanced Spatial-Temporal Dynamics

Jungpil Shin,Abu Saleh Musa Miah,Sota Konnai,Shu Hoshitaka,Pankoo Kim

Task: 提出一种基于多流时空动态时间变化特征提取的轻量级深度学习模型，用于提高sEMG手势识别的准确性和效率。

Motivation: 解决sEMG手势识别中预测不稳定和时间变化特征提取效率低的问题。

Details

Method: 采用多分支模型，分别利用Bi-TCN、1D CNN+SE块和TCN+BiLSTM提取时空特征，并通过通道注意力模块融合和优化特征。 Result: 在Ninapro DB2、DB4和DB5数据集上分别达到96.41%、92.40%和93.34%的准确率。 Conclusion: 该模型能有效处理复杂的sEMG动态特征，为假肢控制和人机交互技术提供了重要进展。 Abstract: Hand gesture recognition using multichannel surface electromyography (sEMG) is challenging due to unstable predictions and inefficient time-varying feature enhancement. To overcome the lack of signal based time-varying feature problems, we propose a lightweight squeeze-excitation deep learning-based multi stream spatial temporal dynamics time-varying feature extraction approach to build an effective sEMG-based hand gesture recognition system. Each branch of the proposed model was designed to extract hierarchical features, capturing both global and detailed spatial-temporal relationships to ensure feature effectiveness. The first branch, utilizing a Bidirectional-TCN (Bi-TCN), focuses on capturing long-term temporal dependencies by modelling past and future temporal contexts, providing a holistic view of gesture dynamics. The second branch, incorporating a 1D Convolutional layer, separable CNN, and Squeeze-and-Excitation (SE) block, efficiently extracts spatial-temporal features while emphasizing critical feature channels, enhancing feature relevance. The third branch, combining a Temporal Convolutional Network (TCN) and Bidirectional LSTM (BiLSTM), captures bidirectional temporal relationships and time-varying patterns. Outputs from all branches are fused using concatenation to capture subtle variations in the data and then refined with a channel attention module, selectively focusing on the most informative features while improving computational efficiency. The proposed model was tested on the Ninapro DB2, DB4, and DB5 datasets, achieving accuracy rates of 96.41%, 92.40%, and 93.34%, respectively. These results demonstrate the capability of the system to handle complex sEMG dynamics, offering advancements in prosthetic limb control and human-machine interface technologies with significant implications for assistive technologies.

Learning Natural Language Constraints for Safe Reinforcement Learning of Language Agents

Jaymari Chua,Chen Wang,Lina Yao

Task: 提出一种新的框架，用于通过自然语言约束学习实现大型语言模型的安全对齐。

Motivation: 当前的对齐方法（如RLHF）在训练分布之外无法保证约束满足，需要更通用的安全对齐方法。

Details

Method: 通过正负示范学习自然语言约束，并基于约束马尔可夫决策过程（CMDP）框架实现。 Result: 在文本导航环境中验证了框架的有效性，展示了在领域变化和对抗输入下的安全适应能力。 Conclusion: 该框架为构建安全关键且更具通用性的大型语言模型提供了可行路径。 Abstract: Generalizable alignment is a core challenge for deploying Large Language Models (LLMs) safely in real-world NLP applications. Current alignment methods, including Reinforcement Learning from Human Feedback (RLHF), often fail to guarantee constraint satisfaction outside their training distribution due to their reliance on implicit, post-hoc preferences. Inspired by a paradigm shift to first curate data before tuning, we introduce a new framework for safe language alignment that learns natural language constraints from positive and negative demonstrations as a primary step. From inferring both a task-specific reward function and latent constraint functions, our approach fosters adaptation to novel safety requirements and robust generalization under domain shifts and adversarial inputs. We formalize the framework within a Constrained Markov Decision Process (CMDP) and validate it via a text-based navigation environment, demonstrating safe adaptation to changing danger zones. Our experiments show fewer violations upon domain shift when following a safe navigation path, and we achieve zero violations by applying learned constraints to a distilled BERT model as a fine-tuning technique. This work offers a promising path toward building safety-critical and more generalizable LLMs for practical NLP settings.

Unlocking Neural Transparency: Jacobian Maps for Explainable AI in Alzheimer's Detection

Yasmine Mustafa,Mohamed Elmahallawy,Tie Luo

Task: 提出一种基于Jacobian Maps（JMs）的多模态框架，以提高阿尔茨海默病（AD）检测的可解释性和可信度。

Motivation: 深度学习模型在AD诊断中准确性高，但缺乏可解释性限制了临床信任和应用。

Details

Method: 利用Jacobian Maps捕捉局部脑容量变化，建立模型预测与已知神经解剖生物标志物的相关性，并通过3D CNN和3D Grad-CAM分析验证。 Result: 实验表明，基于JMs的3D CNN在准确性上优于传统预处理数据，同时提高了可解释性和诊断可靠性。 Conclusion: JMs方法在AD检测中不仅提高了准确性，还增强了模型的可解释性和临床可信度。 Abstract: Alzheimer's disease (AD) leads to progressive cognitive decline, making early detection crucial for effective intervention. While deep learning models have shown high accuracy in AD diagnosis, their lack of interpretability limits clinical trust and adoption. This paper introduces a novel pre-model approach leveraging Jacobian Maps (JMs) within a multi-modal framework to enhance explainability and trustworthiness in AD detection. By capturing localized brain volume changes, JMs establish meaningful correlations between model predictions and well-known neuroanatomical biomarkers of AD. We validate JMs through experiments comparing a 3D CNN trained on JMs versus on traditional preprocessed data, which demonstrates superior accuracy. We also employ 3D Grad-CAM analysis to provide both visual and quantitative insights, further showcasing improved interpretability and diagnostic reliability.

Explain with Visual Keypoints Like a Real Mentor! A Benchmark for Multimodal Solution Explanation

Jaewoo Park,Jungyang Park,Dongju Jang,Jiwan Chung,Byungwoo Yoo,Jaewoo Shin,Seonjoon Park,Taehyeong Kim,Youngjae Yu

Task: 提出并评估一种新的任务——视觉解题解释，要求模型不仅解题，还需生成包含视觉元素的解释。

Motivation: 当前大语言模型生成的解释缺乏视觉辅助，而人类教学中常使用视觉辅助工具（如图表、标记）提升理解。

Details

Method: 引入MathExplain多模态基准数据集，包含997个数学问题及其视觉关键点和解释文本。 Result: 闭源模型在视觉解题解释上表现较好，而开源通用模型在识别视觉组件和生成连贯解释上表现不稳定。 Conclusion: 视觉解题解释任务及MathExplain数据集将推动教育中多模态大语言模型的研究与应用。 Abstract: With the rapid advancement of mathematical reasoning capabilities in large language models (LLMs), AI systems are increasingly being adopted in educational settings to support students' comprehension of problem-solving processes. However, a critical component remains underexplored in current LLM-generated explanations: visual explanation. In real-world instructional contexts, human tutors routinely employ visual aids-such as diagrams, markings, and highlights-to enhance conceptual clarity. To bridge this gap, we introduce a novel task of visual solution explanation, which requires not only solving problems but also generating explanations that incorporate newly introduced visual elements essential for understanding (e.g., auxiliary lines, annotations, or geometric constructions). To evaluate model performance on this task, we propose MathExplain, a multimodal benchmark consisting of 997 math problems annotated with visual keypoints and corresponding explanatory text that references those elements. Our empirical results show that while some closed-source models demonstrate promising capabilities on visual solution-explaining, current open-source general-purpose models perform inconsistently, particularly in identifying relevant visual components and producing coherent keypoint-based explanations. We expect that visual solution-explaining and the MathExplain dataset will catalyze further research on multimodal LLMs in education and advance their deployment as effective, explanation-oriented AI tutors. Code and data will be released publicly.

Crash Time Matters: HybridMamba for Fine-Grained Temporal Localization in Traffic Surveillance Footage

Ibne Farabi Shihab,Anuj Sharma

Task: 在长时监控视频中准确检测交通碰撞事件的时间点。

Motivation: 交通碰撞事件短暂且罕见，但对其准确检测对应急响应和基础设施规划至关重要。

Details

Method: 提出HybridMamba架构，结合视觉Transformer和状态空间时序建模，采用多级令牌压缩和分层时序处理以保持计算效率和时序分辨率。 Result: 在爱荷华州交通部的大规模数据集上，HybridMamba的平均绝对误差为1.50秒，65.2%的预测误差在1秒内，优于现有方法。 Conclusion: HybridMamba为交通监控中的细粒度时序定位提供了高效且鲁棒的解决方案。 Abstract: Traffic crash detection in long-form surveillance videos is critical for emergency response and infrastructure planning but remains difficult due to the brief and rare nature of crash events. We introduce HybridMamba, a novel architecture that combines visual transformers with state-space temporal modeling to achieve accurate crash time localization. Our method uses multi-level token compression and hierarchical temporal processing to remain computationally efficient without sacrificing temporal resolution. Evaluated on a large-scale dataset from the Iowa Department of Transportation, HybridMamba achieves a mean absolute error of 1.50 seconds, with 65.2 percent of predictions within one second of the ground truth. It outperforms recent video-language models such as TimeChat and VideoLLaMA2 by up to 2.8 seconds, while using significantly fewer parameters. Our results demonstrate strong generalization across videos ranging from 2 to 40 minutes in diverse conditions. HybridMamba offers a robust and efficient solution for fine-grained temporal localization in traffic surveillance. The code will be released upon publication.

Enhancing Personalized Multi-Turn Dialogue with Curiosity Reward

Yanming Wan,Jiaxing Wu,Marwa Abdulhai,Lior Shani,Natasha Jaques

Task: 提出一种结合内在动机的奖励机制，以改进对话代理的用户建模能力，从而实现更个性化的交互。

Motivation: 当前基于人类反馈的强化学习方法在培养共情、适应性和个性化交互方面存在不足，且传统个性化方法依赖大量用户历史数据，对新用户或上下文受限用户效果有限。

Details

Method: 在基于多轮人类反馈的强化学习中引入内在动机奖励，鼓励代理主动获取用户特征信息以优化用户模型。 Result: 在教育与健身场景中，该方法在揭示用户偏好和适应性方面优于基线方法。 Conclusion: 通过引入内在动机奖励，对话代理能够更有效地建模用户特征，提供更个性化的交互体验。 Abstract: Effective conversational agents must be able to personalize their behavior to suit a user's preferences, personality, and attributes, whether they are assisting with writing tasks or operating in domains like education or healthcare. Current training methods like Reinforcement Learning from Human Feedback (RLHF) prioritize helpfulness and safety but fall short in fostering truly empathetic, adaptive, and personalized interactions. Traditional approaches to personalization often rely on extensive user history, limiting their effectiveness for new or context-limited users. To overcome these limitations, we propose to incorporate an intrinsic motivation to improve the conversational agents's model of the user as an additional reward alongside multi-turn RLHF. This reward mechanism encourages the agent to actively elicit user traits by optimizing conversations to increase the accuracy of its user model. Consequently, the policy agent can deliver more personalized interactions through obtaining more information about the user. We applied our method both education and fitness settings, where LLMs teach concepts or recommend personalized strategies based on users' hidden learning style or lifestyle attributes. Using LLM-simulated users, our approach outperformed a multi-turn RLHF baseline in revealing information about the users' preferences, and adapting to them.

Rotation Invariance in Floor Plan Digitization using Zernike Moments

Marius Graumann,Jan Marius Stürmer,Tobias Koch

Task: 提出一种端到端流程，将老旧平面图转换为机器可读形式。

Motivation: 老旧平面图多为印刷或扫描的栅格图像，扫描时可能出现轻微旋转或偏移，难以直接用于机器处理。

Details

Method: 通过预处理图像并利用新颖方法创建区域邻接图（RAG），预测其节点，并在RAG特征提取中引入归一化步骤以提高旋转不变性。 Result: 显著提高了旋转数据的F1分数和IoU，并提出了用于分割墙体的算法。 Conclusion: 该方法有效提升了老旧平面图的机器可读性和处理效果。 Abstract: Nowadays, a lot of old floor plans exist in printed form or are stored as scanned raster images. Slight rotations or shifts may occur during scanning. Bringing floor plans of this form into a machine readable form to enable further use, still poses a problem. Therefore, we propose an end-to-end pipeline that pre-processes the image and leverages a novel approach to create a region adjacency graph (RAG) from the pre-processed image and predict its nodes. By incorporating normalization steps into the RAG feature extraction, we significantly improved the rotation invariance of the RAG feature calculation. Moreover, applying our method leads to an improved F1 score and IoU on rotated data. Furthermore, we proposed a wall splitting algorithm for partitioning walls into segments associated with the corresponding rooms.

Think When You Need: Self-Adaptive Chain-of-Thought Learning

Junjie Yang,Ke Lin,Xing Yu

Task: 通过奖励机制优化语言模型的推理长度和质量，以解决简单问题上的“过度思考”问题。

Motivation: 现有方法直接惩罚推理长度未能考虑问题复杂性差异，导致效率低下。

Details

Method: 通过长度和质量比较构建奖励机制，结合理论假设提升解答正确性和简洁性。 Result: 在多个推理基准测试中，方法保持准确性同时生成更简洁的解释。 Conclusion: 该方法有效教导模型“在需要时思考”，适用于模糊任务。 Abstract: Chain of Thought (CoT) reasoning enhances language models' performance but often leads to inefficient "overthinking" on simple problems. We identify that existing approaches directly penalizing reasoning length fail to account for varying problem complexity. Our approach constructs rewards through length and quality comparisons, guided by theoretical assumptions that jointly enhance solution correctness with conciseness. Moreover, we further demonstrate our method to fuzzy tasks where ground truth is unavailable. Experiments across multiple reasoning benchmarks demonstrate that our method maintains accuracy while generating significantly more concise explanations, effectively teaching models to "think when needed."

Robot Localization Using a Learned Keypoint Detector and Descriptor with a Floor Camera and a Feature Rich Industrial Floor

Piet Brömmel,Dominik Brämer,Oliver Urbann,Diana Kleingarn

Task: 提出一种基于深度神经网络的Keypoint Localization Framework (KOALA)，用于从工业地面图像中提取特征以实现移动机器人的高精度定位。

Motivation: 移动机器人的定位依赖于环境中的良好特征，而传统传感器如激光雷达成本较高，因此探索从地面图像中提取独特特征的方法。

Details

Method: 利用深度神经网络从工业地面图像中提取特征，无需标记或额外信息（如过滤、先验或时序信息）。 Result: 在75.7%的图像中实现定位，平均位置误差为2厘米，旋转误差为2.4%，且优于同类方法。 Conclusion: KOALA框架能够高效解决机器人绑架问题，并在移动中实现高精度定位。 Abstract: The localization of moving robots depends on the availability of good features from the environment. Sensor systems like Lidar are popular, but unique features can also be extracted from images of the ground. This work presents the Keypoint Localization Framework (KOALA), which utilizes deep neural networks that extract sufficient features from an industrial floor for accurate localization without having readable markers. For this purpose, we use a floor covering that can be produced as cheaply as common industrial floors. Although we do not use any filtering, prior, or temporal information, we can estimate our position in 75.7 % of all images with a mean position error of 2 cm and a rotation error of 2.4 %. Thus, the robot kidnapping problem can be solved with high precision in every frame, even while the robot is moving. Furthermore, we show that our framework with our detector and descriptor combination is able to outperform comparable approaches.

Stance-Driven Multimodal Controlled Statement Generation: New Dataset and Task

Bingqian Wang,Quan Fang,Jiachen Sun,Xiaoxiao Ma

Task: 研究多模态立场驱动的内容生成问题，特别是在政治推文中结合文本和图像生成立场可控的回应。

Motivation: 当前数据集多关注纯文本，缺乏多模态内容和有效上下文，尤其在立场检测方面，限制了立场可控生成的研究和应用。

Details

Method: 提出了一个立场驱动的多模态生成框架（SDMG），结合多模态特征的加权融合和立场引导，以提升语义一致性和立场控制。 Result: 创建了首个多模态立场生成数据集（StanceGen2024），并提出了SDMG框架，实验表明其在立场控制和语义一致性上的有效性。 Conclusion: 该研究为多模态立场可控生成提供了新的数据集和方法，推动了这一领域的研究和应用。 Abstract: Formulating statements that support diverse or controversial stances on specific topics is vital for platforms that enable user expression, reshape political discourse, and drive social critique and information dissemination. With the rise of Large Language Models (LLMs), controllable text generation towards specific stances has become a promising research area with applications in shaping public opinion and commercial marketing. However, current datasets often focus solely on pure texts, lacking multimodal content and effective context, particularly in the context of stance detection. In this paper, we formally define and study the new problem of stance-driven controllable content generation for tweets with text and images, where given a multimodal post (text and image/video), a model generates a stance-controlled response. To this end, we create the Multimodal Stance Generation Dataset (StanceGen2024), the first resource explicitly designed for multimodal stance-controllable text generation in political discourse. It includes posts and user comments from the 2024 U.S. presidential election, featuring text, images, videos, and stance annotations to explore how multimodal political content shapes stance expression. Furthermore, we propose a Stance-Driven Multimodal Generation (SDMG) framework that integrates weighted fusion of multimodal features and stance guidance to improve semantic consistency and stance control. We release the dataset and code (https://anonymous.4open.science/r/StanceGen-BE9D) for public use and further research.

SARLANG-1M: A Benchmark for Vision-Language Modeling in SAR Image Understanding

Yimin Wei,Aoran Xiao,Yexian Ren,Yuting Zhu,Hongruixuan Chen,Junshi Xia,Naoto Yokoya

Task: 提出并构建SARLANG-1M数据集，用于提升视觉语言模型（VLMs）在SAR图像理解中的性能。

Motivation: SAR图像解释因复杂的物理成像机制和与人类视觉的显著差异而具有挑战性，而现有的VLMs因缺乏SAR特定知识而表现不佳。

Details

Method: 构建包含超过100万高质量SAR图像-文本对的大规模数据集SARLANG-1M，涵盖多分辨率、细粒度语义描述和多样化任务。 Result: 实验表明，使用SARLANG-1M微调的VLMs在SAR图像解释中性能显著提升，接近人类专家水平。 Conclusion: SARLANG-1M为SAR图像的多模态理解提供了有效工具，推动了相关领域的研究和应用。 Abstract: Synthetic Aperture Radar (SAR) is a crucial remote sensing technology, enabling all-weather, day-and-night observation with strong surface penetration for precise and continuous environmental monitoring and analysis. However, SAR image interpretation remains challenging due to its complex physical imaging mechanisms and significant visual disparities from human perception. Recently, Vision-Language Models (VLMs) have demonstrated remarkable success in RGB image understanding, offering powerful open-vocabulary interpretation and flexible language interaction. However, their application to SAR images is severely constrained by the absence of SAR-specific knowledge in their training distributions, leading to suboptimal performance. To address this limitation, we introduce SARLANG-1M, a large-scale benchmark tailored for multimodal SAR image understanding, with a primary focus on integrating SAR with textual modality. SARLANG-1M comprises more than 1 million high-quality SAR image-text pairs collected from over 59 cities worldwide. It features hierarchical resolutions (ranging from 0.1 to 25 meters), fine-grained semantic descriptions (including both concise and detailed captions), diverse remote sensing categories (1,696 object types and 16 land cover classes), and multi-task question-answering pairs spanning seven applications and 1,012 question types. Extensive experiments on mainstream VLMs demonstrate that fine-tuning with SARLANG-1M significantly enhances their performance in SAR image interpretation, reaching performance comparable to human experts. The dataset and code will be made publicly available at https://github.com/Jimmyxichen/SARLANG-1M.

Noise Augmented Fine Tuning for Mitigating Hallucinations in Large Language Models

Afshin Khadangi,Amir Sartipi,Igor Tchappi,Ramin Bahmani

Task: 通过噪声增强微调（NoiseFiT）框架减少大型语言模型（LLMs）的幻觉现象。

Motivation: 大型语言模型常生成不准确或误导性内容（幻觉），需要一种方法来增强模型的鲁棒性。

Details

Method: 提出NoiseFiT框架，基于信噪比（SNR）动态注入高斯噪声，并结合混合损失函数（标准交叉熵、软交叉熵和一致性正则化）。 Result: 实验表明NoiseFiT显著降低幻觉率，并在关键任务中保持或优于基线性能。 Conclusion: 噪声驱动策略能有效提升语言模型的鲁棒性和可信度，且计算开销可控。 Abstract: Large language models (LLMs) often produce inaccurate or misleading content-hallucinations. To address this challenge, we introduce Noise-Augmented Fine-Tuning (NoiseFiT), a novel framework that leverages adaptive noise injection based on the signal-to-noise ratio (SNR) to enhance model robustness. In particular, NoiseFiT selectively perturbs layers identified as either high-SNR (more robust) or low-SNR (potentially under-regularized) using a dynamically scaled Gaussian noise. We further propose a hybrid loss that combines standard cross-entropy, soft cross-entropy, and consistency regularization to ensure stable and accurate outputs under noisy training conditions. Our theoretical analysis shows that adaptive noise injection is both unbiased and variance-preserving, providing strong guarantees for convergence in expectation. Empirical results on multiple test and benchmark datasets demonstrate that NoiseFiT significantly reduces hallucination rates, often improving or matching baseline performance in key tasks. These findings highlight the promise of noise-driven strategies for achieving robust, trustworthy language modeling without incurring prohibitive computational overhead. Given the comprehensive and detailed nature of our experiments, we have publicly released the fine-tuning logs, benchmark evaluation artifacts, and source code online at W&B, Hugging Face, and GitHub, respectively, to foster further research, accessibility and reproducibility.

TQD-Track: Temporal Query Denoising for 3D Multi-Object Tracking

Shuxiao Ding,Yutong Yang,Julian Wiederer,Markus Braun,Peizheng Li,Juergen Gall,Bin Yang

Task: 提出一种名为TQD-Track的方法，通过时间查询去噪（TQD）改进多目标跟踪（MOT）的性能。

Motivation: 现有查询去噪方法仅针对单帧，无法学习时序信息，且注意力掩码限制了去噪查询与目标查询之间的信息交换。

Details

Method: 引入TQD，使去噪查询携带时序信息和实例特征表示，并设计关联掩码以保持跟踪与检测查询的一致性。 Result: 在nuScenes数据集上的实验表明，TQD-Track显著提升了多种跟踪方法的性能，尤其是具有显式关联模块的方法。 Conclusion: TQD-Track通过改进训练过程，有效提升了多目标跟踪的性能，特别是在显式关联模块的范式中表现更优。 Abstract: Query denoising has become a standard training strategy for DETR-based detectors by addressing the slow convergence issue. Besides that, query denoising can be used to increase the diversity of training samples for modeling complex scenarios which is critical for Multi-Object Tracking (MOT), showing its potential in MOT application. Existing approaches integrate query denoising within the tracking-by-attention paradigm. However, as the denoising process only happens within the single frame, it cannot benefit the tracker to learn temporal-related information. In addition, the attention mask in query denoising prevents information exchange between denoising and object queries, limiting its potential in improving association using self-attention. To address these issues, we propose TQD-Track, which introduces Temporal Query Denoising (TQD) tailored for MOT, enabling denoising queries to carry temporal information and instance-specific feature representation. We introduce diverse noise types onto denoising queries that simulate real-world challenges in MOT. We analyze our proposed TQD for different tracking paradigms, and find out the paradigm with explicit learned data association module, e.g. tracking-by-detection or alternating detection and association, benefit from TQD by a larger margin. For these paradigms, we further design an association mask in the association module to ensure the consistent interaction between track and detection queries as during inference. Extensive experiments on the nuScenes dataset demonstrate that our approach consistently enhances different tracking methods by only changing the training process, especially the paradigms with explicit association module.

Evaluating Compact LLMs for Zero-Shot Iberian Language Tasks on End-User Devices

Luís Couto Seller,Íñigo Sanz Torres,Adrián Vogel-Fernández,Carlos González Carballo,Pedro Miguel Sánchez Sánchez,Adrián Carruana Martín,Enrique de Miguel Ambite

Task: 对适用于伊比利亚语言的紧凑型最先进大型语言模型进行全面评估。

Motivation: 大型语言模型的计算需求高，限制了其在消费级设备上的可访问性，尤其是资源较少的伊比利亚语言。

Details

Method: 评估多个紧凑型最先进大型语言模型在伊比利亚语言上的多种NLP任务表现。 Result: 某些模型在特定任务中表现优异，但巴斯克语等语言仍存在显著性能差距。 Conclusion: 需进一步研究如何在模型紧凑性与多语言性能之间取得平衡。 Abstract: Large Language Models have significantly advanced natural language processing, achieving remarkable performance in tasks such as language generation, translation, and reasoning. However, their substantial computational requirements restrict deployment to high-end systems, limiting accessibility on consumer-grade devices. This challenge is especially pronounced for under-resourced languages like those spoken in the Iberian Peninsula, where relatively limited linguistic resources and benchmarks hinder effective evaluation. This work presents a comprehensive evaluation of compact state-of-the-art LLMs across several essential NLP tasks tailored for Iberian languages. The results reveal that while some models consistently excel in certain tasks, significant performance gaps remain, particularly for languages such as Basque. These findings highlight the need for further research on balancing model compactness with robust multilingual performance

FaR: Enhancing Multi-Concept Text-to-Image Diffusion via Concept Fusion and Localized Refinement

Gia-Nghia Tran,Quang-Huy Che,Trong-Tai Dam Vu,Bich-Nga Pham,Vinh-Tiep Nguyen,Trung-Nghia Le,Minh-Triet Tran

Task: 解决文本到图像任务中生成多个新概念时的过拟合和属性泄漏问题。

Motivation: 当前方法在小样本训练时容易过拟合，且难以处理类相似主题（如两只特定狗）的属性泄漏问题。

Details

Method: 提出Fuse-and-Refine (FaR)方法，包括概念融合技术和局部细化损失函数。 Result: FaR有效防止过拟合和属性泄漏，同时保持照片级真实感，并优于其他先进方法。 Conclusion: FaR通过数据增强和注意力对齐，成功解决了文本到图像任务中的关键挑战。 Abstract: Generating multiple new concepts remains a challenging problem in the text-to-image task. Current methods often overfit when trained on a small number of samples and struggle with attribute leakage, particularly for class-similar subjects (e.g., two specific dogs). In this paper, we introduce Fuse-and-Refine (FaR), a novel approach that tackles these challenges through two key contributions: Concept Fusion technique and Localized Refinement loss function. Concept Fusion systematically augments the training data by separating reference subjects from backgrounds and recombining them into composite images to increase diversity. This augmentation technique tackles the overfitting problem by mitigating the narrow distribution of the limited training samples. In addition, Localized Refinement loss function is introduced to preserve subject representative attributes by aligning each concept's attention map to its correct region. This approach effectively prevents attribute leakage by ensuring that the diffusion model distinguishes similar subjects without mixing their attention maps during the denoising process. By fine-tuning specific modules at the same time, FaR balances the learning of new concepts with the retention of previously learned knowledge. Empirical results show that FaR not only prevents overfitting and attribute leakage while maintaining photorealism, but also outperforms other state-of-the-art methods.

BabyLM's First Words: Word Segmentation as a Phonological Probing Task

Zébulon Goriely

Task: 研究如何利用词分割作为音韵学探测任务，分析基于音素的语言模型在31种语言中的表现。

Motivation: 当前大型语言模型在音韵学分析中存在困难，缺乏多语言基准且标准输入表示不适用于音素分析。

Details

Method: 使用无监督方法从训练模型中提取词边界，并通过线性探测验证模型是否隐含跟踪词边界。 Result: 模型在未显式训练的情况下仍能隐含跟踪词边界，支持统计学习理论，并为子词分词器训练提供新方法。 Conclusion: 跨语言研究验证了统计学习理论，并为音韵学分析和子词分词器训练提供了新思路。 Abstract: Language models provide a key framework for studying linguistic theories based on prediction, but phonological analysis using large language models (LLMs) is difficult; there are few phonological benchmarks beyond English and the standard input representation used in LLMs (subwords of graphemes) is not suitable for analyzing the representation of phonemes. In this work, we demonstrate how word segmentation can be used as a phonological probing task, allowing us to study the representations learned by phoneme-based language models trained on child-directed speech across 31 languages. Following computational models of word segmentation, we present unsupervised methods for extracting word boundaries from a trained model using the observation that prediction-error peaks at the start of words. We also use linear probes to identify that these models implicitly track word boundaries, even when they do not appear in training. This cross-lingual work corroborates statistical learning theories of acquisition and empirically motivates new methods for training subword tokenizers.

Multi-Flow: Multi-View-Enriched Normalizing Flows for Industrial Anomaly Detection

Mathis Kruse,Bodo Rosenhahn

Task: 提出一种名为Multi-Flow的多视角异常检测方法，以解决单视角方法在复杂工业产品异常检测中的局限性。

Motivation: 现实生产场景中的复杂工业产品特性无法通过单一图像完全捕捉，而现有的基于归一化流的方法在多视角数据中未充分利用先验信息。

Details

Method: 基于流模型的Multi-Flow方法，通过新颖的多视角架构和跨视角信息传递方案，增强多视角数据的精确似然估计。 Result: 在真实多视角数据集Real-IAD上验证，Multi-Flow在图像级和样本级异常检测任务中均达到新的最优性能。 Conclusion: Multi-FFlow通过多视角信息融合显著提升了异常检测性能，填补了现有方法在多视角数据利用上的空白。 Abstract: With more well-performing anomaly detection methods proposed, many of the single-view tasks have been solved to a relatively good degree. However, real-world production scenarios often involve complex industrial products, whose properties may not be fully captured by one single image. While normalizing flow based approaches already work well in single-camera scenarios, they currently do not make use of the priors in multi-view data. We aim to bridge this gap by using these flow-based models as a strong foundation and propose Multi-Flow, a novel multi-view anomaly detection method. Multi-Flow makes use of a novel multi-view architecture, whose exact likelihood estimation is enhanced by fusing information across different views. For this, we propose a new cross-view message-passing scheme, letting information flow between neighboring views. We empirically validate it on the real-world multi-view data set Real-IAD and reach a new state-of-the-art, surpassing current baselines in both image-wise and sample-wise anomaly detection tasks.

Kaustubh Shivshankar Shejole,Pushpak Bhattacharyya

Task: 研究刻板印象和反刻板印象的检测，并提出一个四元组定义以区分刻板印象、反刻板印象、刻板偏见和偏见。

Motivation: 当前研究主要关注LLM中的刻板偏见检测，但缺乏对刻板印象本身的清晰定义和区分，阻碍了该领域的进展。

Details

Method: 提出StereoDetect数据集，通过优化利用现有数据集（如StereoSet和WinoQueer）并结合人工验证和语义信息转移。 Result: 发现参数少于10B的语言模型在检测反刻板印象时容易混淆，并通过与其他模型的比较证明了高质量数据集的重要性。 Conclusion: 通过提出清晰的定义和高质量数据集，为刻板印象和反刻板印象检测的研究提供了重要基础。 Abstract: Stereotypes are known to be highly pernicious, making their detection critically important. However, current research predominantly focuses on detecting and evaluating stereotypical biases in LLMs, leaving the study of stereotypes in its early stages. Many studies have failed to clearly distinguish between stereotypes and stereotypical biases, which has significantly slowed progress in advancing research in this area. Stereotype and anti-stereotype detection is a problem that requires knowledge of society; hence, it is one of the most difficult areas in Responsible AI. This work investigates this task, where we propose a four-tuple definition and provide precise terminology distinguishing stereotype, anti-stereotype, stereotypical bias, and bias, offering valuable insights into their various aspects. In this paper, we propose StereoDetect, a high-quality benchmarking dataset curated for this task by optimally utilizing current datasets such as StereoSet and WinoQueer, involving a manual verification process and the transfer of semantic information. We demonstrate that language models for reasoning with fewer than 10B parameters often get confused when detecting anti-stereotypes. We also demonstrate the critical importance of well-curated datasets by comparing our model with other current models for stereotype detection. The dataset and code is available at https://github.com/KaustubhShejole/StereoDetect.

Steerable Anatomical Shape Synthesis with Implicit Neural Representations

Bram de Wilde,Max T. Rietberg,Guillaume Lajoinie,Jelmer M. Wolterink

Task: 提出一种基于隐式神经表示的可控生成模型，用于生成具有目标控制的解剖结构。

Motivation: 生成模型在虚拟成像试验中至关重要，但需要能够针对特定患者群体进行模拟，而非依赖随机采样。

Details

Method: 使用隐式神经表示构建可控生成模型，支持拓扑变化并学习解耦的潜在表示。 Result: 模型在重建精度和解剖合理性方面表现优异，实现了高质量的形态生成和目标解剖修改。 Conclusion: 该模型为虚拟成像试验提供了一种高质量且可控的解剖结构生成方法。 Abstract: Generative modeling of anatomical structures plays a crucial role in virtual imaging trials, which allow researchers to perform studies without the costs and constraints inherent to in vivo and phantom studies. For clinical relevance, generative models should allow targeted control to simulate specific patient populations rather than relying on purely random sampling. In this work, we propose a steerable generative model based on implicit neural representations. Implicit neural representations naturally support topology changes, making them well-suited for anatomical structures with varying topology, such as the thyroid. Our model learns a disentangled latent representation, enabling fine-grained control over shape variations. Evaluation includes reconstruction accuracy and anatomical plausibility. Our results demonstrate that the proposed model achieves high-quality shape generation while enabling targeted anatomical modifications.

Online Difficulty Filtering for Reasoning Oriented Reinforcement Learning

Sanghwan Bae,Jiwoo Hong,Min Young Lee,Hanbyul Kim,JeongYeon Nam,Donghyun Kwak

Task: 通过平衡在线难度过滤提升RORL训练效果。

Motivation: 由于RORL中奖励稀疏，训练效果高度依赖问题难度选择，现有方法缺乏理论基础和系统性理解。

Details

Method: 提出平衡在线难度过滤方法，动态选择模型达到中等准确率的问题。 Result: 在五个数学推理基准测试中，AIME提升10%，平均提升4%，样本效率和训练时间效率均有显著提升。 Conclusion: 平衡在线难度过滤能最大化RORL训练效果，显著提升性能与效率。 Abstract: Reasoning-Oriented Reinforcement Learning (RORL) enhances the reasoning ability of Large Language Models (LLMs). However, due to the sparsity of rewards in RORL, effective training is highly dependent on the selection of problems of appropriate difficulty. Although curriculum learning attempts to address this by adjusting difficulty, it often relies on static schedules, and even recent online filtering methods lack theoretical grounding and a systematic understanding of their effectiveness. In this work, we theoretically and empirically show that curating the batch with the problems that the training model achieves intermediate accuracy on the fly can maximize the effectiveness of RORL training, namely balanced online difficulty filtering. We first derive that the lower bound of the KL divergence between the initial and the optimal policy can be expressed with the variance of the sampled accuracy. Building on those insights, we show that balanced filtering can maximize the lower bound, leading to better performance. Experimental results across five challenging math reasoning benchmarks show that balanced online filtering yields an additional 10% in AIME and 4% improvements in average over plain GRPO. Moreover, further analysis shows the gains in sample efficiency and training time efficiency, exceeding the maximum reward of plain GRPO within 60% training time and the volume of the training set.

QIRL: Boosting Visual Question Answering via Optimized Question-Image Relation Learning

Quanxing Xu,Ling Zhou,Xian Zhong,Feifei Zhang,Rubing Huang,Chia-Wen Lin

Task: 提出一种新的框架（QIRL），通过生成式自监督学习策略优化视觉问答（VQA）中的问题-图像关系学习。

Motivation: 现有去偏方法未能充分捕捉图像与文本的深层关联，也未在推理时评估问题与图像的相关性。

Details

Method: 引入两个模块：负图像生成（NIG）模块和无关样本识别（ISI）模块，分别增强相关性学习和过滤无关输入。 Result: 在VQA-CPv2和VQA-v2数据集上验证了方法的有效性，并达到数据增强策略中的最优性能。 Conclusion: QIRL框架具有模型无关性，能显著提升VQA任务的性能，尤其在去偏和数据增强方面表现突出。 Abstract: Existing debiasing approaches in Visual Question Answering (VQA) primarily focus on enhancing visual learning, integrating auxiliary models, or employing data augmentation strategies. However, these methods exhibit two major drawbacks. First, current debiasing techniques fail to capture the superior relation between images and texts because prevalent learning frameworks do not enable models to extract deeper correlations from highly contrasting samples. Second, they do not assess the relevance between the input question and image during inference, as no prior work has examined the degree of input relevance in debiasing studies. Motivated by these limitations, we propose a novel framework, Optimized Question-Image Relation Learning (QIRL), which employs a generation-based self-supervised learning strategy. Specifically, two modules are introduced to address the aforementioned issues. The Negative Image Generation (NIG) module automatically produces highly irrelevant question-image pairs during training to enhance correlation learning, while the Irrelevant Sample Identification (ISI) module improves model robustness by detecting and filtering irrelevant inputs, thereby reducing prediction errors. Furthermore, to validate our concept of reducing output errors through filtering unrelated question-image inputs, we propose a specialized metric to evaluate the performance of the ISI module. Notably, our approach is model-agnostic and can be integrated with various VQA models. Extensive experiments on VQA-CPv2 and VQA-v2 demonstrate the effectiveness and generalization ability of our method. Among data augmentation strategies, our approach achieves state-of-the-art results.

Locations of Characters in Narratives: Andersen and Persuasion Datasets

Batuhan Ozyurt,Roya Arkhmammadova,Deniz Yuret

Task: 测试AI在叙事中理解角色与其位置关系的能力。

Motivation: 研究机器在叙事背景下的空间理解能力，以评估AI的阅读理解水平。

Details

Method: 引入两个新数据集Andersen和Persuasion，手动标注角色及其位置，并基于这些数据集测试大型语言模型（LLMs）。 Result: 在Andersen数据集上，最佳LLM的准确率为61.85%；在Persuasion数据集上为56.06%。 Conclusion: 当前LLMs在理解角色与位置关系方面表现有限，仍有改进空间。 Abstract: The ability of machines to grasp spatial understanding within narrative contexts is an intriguing aspect of reading comprehension that continues to be studied. Motivated by the goal to test the AI's competence in understanding the relationship between characters and their respective locations in narratives, we introduce two new datasets: Andersen and Persuasion. For the Andersen dataset, we selected fifteen children's stories from "Andersen's Fairy Tales" by Hans Christian Andersen and manually annotated the characters and their respective locations throughout each story. Similarly, for the Persuasion dataset, characters and their locations in the novel "Persuasion" by Jane Austen were also manually annotated. We used these datasets to prompt Large Language Models (LLMs). The prompts are created by extracting excerpts from the stories or the novel and combining them with a question asking the location of a character mentioned in that excerpt. Out of the five LLMs we tested, the best-performing one for the Andersen dataset accurately identified the location in 61.85% of the examples, while for the Persuasion dataset, the best-performing one did so in 56.06% of the cases.

EOOD: Entropy-based Out-of-distribution Detection

Guide Yang,Chao Hou,Weilong Peng,Xiang Fang,Yongwei Nie,Peican Zhu,Keke Tang

Task: 提出一种基于熵的离群分布检测框架（EOOD），用于识别和处理深度神经网络中的离群样本。

Motivation: 深度神经网络（DNNs）在面对离群分布（OOD）样本时容易表现出过度自信，这对实际部署带来挑战。

Details

Method: EOOD通过识别ID和OOD样本信息流差异显著的特定块，并计算选定块的条件熵作为OOD置信度得分。 Result: 在各种ID和OOD设置下的实验表明，EOOD在OOD检测中具有高效性，并优于现有方法。 Conclusion: EOOD框架有效解决了DNNs在OOD样本检测中的问题，表现优于现有技术。 Abstract: Deep neural networks (DNNs) often exhibit overconfidence when encountering out-of-distribution (OOD) samples, posing significant challenges for deployment. Since DNNs are trained on in-distribution (ID) datasets, the information flow of ID samples through DNNs inevitably differs from that of OOD samples. In this paper, we propose an Entropy-based Out-Of-distribution Detection (EOOD) framework. EOOD first identifies specific block where the information flow differences between ID and OOD samples are more pronounced, using both ID and pseudo-OOD samples. It then calculates the conditional entropy on the selected block as the OOD confidence score. Comprehensive experiments conducted across various ID and OOD settings demonstrate the effectiveness of EOOD in OOD detection and its superiority over state-of-the-art methods.

SpectR: Dynamically Composing LM Experts with Spectral Routing

William Fleshman,Benjamin Van Durme

Task: 提出一种动态组合专家模型的方法SPECTR，以在推理过程中灵活选择或合并模型。

Motivation: 利用现有的专家模型潜力，解决大规模通用语言模型训练的挑战。

Details

Method: SPECTR方法无需额外训练，支持按token和layer动态组合模型。 Result: 实验表明SPECTR在路由准确性和任务性能上优于其他无需训练的方法。 Conclusion: SPECTR为动态组合专家模型提供了一种高效且灵活的方法。 Abstract: Training large, general-purpose language models poses significant challenges. The growing availability of specialized expert models, fine-tuned from pretrained models for specific tasks or domains, offers a promising alternative. Leveraging the potential of these existing expert models in real-world applications requires effective methods to select or merge the models best suited for a given task. This paper introduces SPECTR, an approach for dynamically composing expert models at each time step during inference. Notably, our method requires no additional training and enables flexible, token- and layer-wise model combinations. Our experimental results demonstrate that SPECTR improves routing accuracy over alternative training-free methods, increasing task performance across expert domains.

Meta-DAN: towards an efficient prediction strategy for page-level handwritten text recognition

Denis Coquenet

Task: 提出一种名为Meta-DAN的解码策略，以减少预测时间并改进上下文建模。

Motivation: 现有的字符级自回归解码过程导致预测时间过长，需要几秒钟处理单页图像。

Details

Method: 采用窗口化查询和多令牌预测，以扩大上下文建模并提高效率。 Result: 在10个全页手写数据集上评估，平均字符错误率表现达到最先进水平。 Conclusion: Meta-DAN在减少预测时间的同时，提升了上下文建模能力，取得了优异性能。 Abstract: Recent advances in text recognition led to a paradigm shift for page-level recognition, from multi-step segmentation-based approaches to end-to-end attention-based ones. However, the na\"ive character-level autoregressive decoding process results in long prediction times: it requires several seconds to process a single page image on a modern GPU. We propose the Meta Document Attention Network (Meta-DAN) as a novel decoding strategy to reduce the prediction time while enabling a better context modeling. It relies on two main components: windowed queries, to process several transformer queries altogether, enlarging the context modeling with near future; and multi-token predictions, whose goal is to predict several tokens per query instead of only the next one. We evaluate the proposed approach on 10 full-page handwritten datasets and demonstrate state-of-the-art results on average in terms of character error rate. Source code and weights of trained models are available at https://github.com/FactoDeepLearning/meta_dan.

Structured Legal Document Generation in India: A Model-Agnostic Wrapper Approach with VidhikDastaavej

Shubham Kumar Nigam,Balaramamahanthi Deepak Patnaik,Ajay Varghese Thomas,Noel Shallum,Kripabandhu Ghosh,Arnab Bhattacharya

Task: 开发一种自动化生成印度法律领域私人法律文档的框架和工具。

Motivation: 自动化法律文档起草可以提高效率、减少人工工作并优化法律流程，而印度法律领域的结构化生成问题尚未解决。

Details

Method: 提出VidhikDastaavej数据集和NyayaShilp模型，采用Model-Agnostic Wrapper（MAW）两步框架，结合检索机制和Human-in-the-Loop（HITL）系统。 Result: 结构化框架显著提升了文档的连贯性、事实准确性和质量，同时减少了幻觉现象。 Conclusion: 该研究为印度AI辅助法律起草提供了可扩展和适应性强的基础，实现了高效的结构化法律文档生成。 Abstract: Automating legal document drafting can significantly enhance efficiency, reduce manual effort, and streamline legal workflows. While prior research has explored tasks such as judgment prediction and case summarization, the structured generation of private legal documents in the Indian legal domain remains largely unaddressed. To bridge this gap, we introduce VidhikDastaavej, a novel, anonymized dataset of private legal documents, and develop NyayaShilp, a fine-tuned legal document generation model specifically adapted to Indian legal texts. We propose a Model-Agnostic Wrapper (MAW), a two-step framework that first generates structured section titles and then iteratively produces content while leveraging retrieval-based mechanisms to ensure coherence and factual accuracy. We benchmark multiple open-source LLMs, including instruction-tuned and domain-adapted versions, alongside proprietary models for comparison. Our findings indicate that while direct fine-tuning on small datasets does not always yield improvements, our structured wrapper significantly enhances coherence, factual adherence, and overall document quality while mitigating hallucinations. To ensure real-world applicability, we developed a Human-in-the-Loop (HITL) Document Generation System, an interactive user interface that enables users to specify document types, refine section details, and generate structured legal drafts. This tool allows legal professionals and researchers to generate, validate, and refine AI-generated legal documents efficiently. Extensive evaluations, including expert assessments, confirm that our framework achieves high reliability in structured legal drafting. This research establishes a scalable and adaptable foundation for AI-assisted legal drafting in India, offering an effective approach to structured legal document generation.

FLAIRBrainSeg: Fine-grained brain segmentation using FLAIR MRI only

Edern Le Bot,Rémi Giraud,Boris Mansencal,Thomas Tourdias,Josè V. Manjon,Pierrick Coupé

Task: 提出一种仅使用FLAIR MRI进行脑部分割的新方法FLAIRBrainSeg。

Motivation: 针对其他成像模态受限的情况，提供可靠的脑部分割解决方案。

Details

Method: 利用现有自动分割方法训练网络，近似T1加权MRI的分割结果。 Result: 在多个数据集上表现优于基于图像合成的模态无关方法，并能分割132个结构。 Conclusion: FLAIRBrainSeg为T1加权MRI不可用时的脑部分割提供了有价值的替代方案。 Abstract: This paper introduces a novel method for brain segmentation using only FLAIR MRIs, specifically targeting cases where access to other imaging modalities is limited. By leveraging existing automatic segmentation methods, we train a network to approximate segmentations, typically obtained from T1-weighted MRIs. Our method, called FLAIRBrainSeg, produces segmentations of 132 structures and is robust to multiple sclerosis lesions. Experiments on both in-domain and out-of-domain datasets demonstrate that our method outperforms modality-agnostic approaches based on image synthesis, the only currently available alternative for performing brain parcellation using FLAIR MRI alone. This technique holds promise for scenarios where T1-weighted MRIs are unavailable and offers a valuable alternative for clinicians and researchers in need of reliable anatomical segmentation.

Neutralizing the Narrative: AI-Powered Debiasing of Online News Articles

Chen Wei Kuo,Kevin Chu,Nouar AlDahoul,Hazem Ibrahim,Talal Rahwan,Yasir Zaki

Task: 利用大型语言模型（LLMs）系统性地检测和减轻新闻文章中的偏见。

Motivation: 传统偏见检测方法依赖人工审核，存在主观性和可扩展性限制，需要更高效、客观的解决方案。

Details

Method: 采用两阶段方法：1) 使用多种LLMs（如GPT-4o、Gemini Pro等）检测段落级偏见并验证；2) 使用GPT-4o Mini迭代去偏，并通过自动和人工评估验证。 Result: GPT-4o Mini在偏见检测和去偏方面表现最佳；分析还揭示了媒体偏见与时空及社会政治动态的关联。 Conclusion: 该研究为新闻偏见提供了可扩展的计算方法，促进了新闻公平性和问责制。 Abstract: Bias in news reporting significantly impacts public perception, particularly regarding crime, politics, and societal issues. Traditional bias detection methods, predominantly reliant on human moderation, suffer from subjective interpretations and scalability constraints. Here, we introduce an AI-driven framework leveraging advanced large language models (LLMs), specifically GPT-4o, GPT-4o Mini, Gemini Pro, Gemini Flash, Llama 8B, and Llama 3B, to systematically identify and mitigate biases in news articles. To this end, we collect an extensive dataset consisting of over 30,000 crime-related articles from five politically diverse news sources spanning a decade (2013-2023). Our approach employs a two-stage methodology: (1) bias detection, where each LLM scores and justifies biased content at the paragraph level, validated through human evaluation for ground truth establishment, and (2) iterative debiasing using GPT-4o Mini, verified by both automated reassessment and human reviewers. Empirical results indicate GPT-4o Mini's superior accuracy in bias detection and effectiveness in debiasing. Furthermore, our analysis reveals temporal and geographical variations in media bias correlating with socio-political dynamics and real-world events. This study contributes to scalable computational methodologies for bias mitigation, promoting fairness and accountability in news reporting.

ZFusion: An Effective Fuser of Camera and 4D Radar for 3D Object Perception in Autonomous Driving

Sheng Yang,Tong Zhan,Shichen Qiao,Jicheng Gong,Qing Yang,Yanfeng Lu,Jian Wang

Task: 提出了一种名为ZFusion的3D物体检测方法，融合4D雷达和视觉模态以提高感知精度。

Motivation: 4D雷达在全天候条件下具有感知能力，但其点云稀疏性限制了性能，需要与视觉信息融合以弥补不足。

Details

Method: 采用FP-DDCA（特征金字塔-双可变形交叉注意力）融合器，通过Transformer块在不同尺度上交互融合多模态特征，并结合深度-上下文-分割视图变换模块。 Result: 在VoD数据集的实验中，ZFusion在感兴趣区域达到了最先进的mAP，整体性能接近LiDAR，显著优于仅使用相机的方法。 Conclusion: ZFusion是一种低成本且高效的替代方案，性能接近LiDAR，适用于自动驾驶中的3D物体检测。 Abstract: Reliable 3D object perception is essential in autonomous driving. Owing to its sensing capabilities in all weather conditions, 4D radar has recently received much attention. However, compared to LiDAR, 4D radar provides much sparser point cloud. In this paper, we propose a 3D object detection method, termed ZFusion, which fuses 4D radar and vision modality. As the core of ZFusion, our proposed FP-DDCA (Feature Pyramid-Double Deformable Cross Attention) fuser complements the (sparse) radar information and (dense) vision information, effectively. Specifically, with a feature-pyramid structure, the FP-DDCA fuser packs Transformer blocks to interactively fuse multi-modal features at different scales, thus enhancing perception accuracy. In addition, we utilize the Depth-Context-Split view transformation module due to the physical properties of 4D radar. Considering that 4D radar has a much lower cost than LiDAR, ZFusion is an attractive alternative to LiDAR-based methods. In typical traffic scenarios like the VoD (View-of-Delft) dataset, experiments show that with reasonable inference speed, ZFusion achieved the state-of-the-art mAP (mean average precision) in the region of interest, while having competitive mAP in the entire area compared to the baseline methods, which demonstrates performance close to LiDAR and greatly outperforms those camera-only methods.

Diverse In-Context Example Selection After Decomposing Programs and Aligned Utterances Improves Semantic Parsing

Mayank Kothyari,Sunita Sarawagi,Soumen Chakrabarti,Gaurav Arora,Srujana Merugu

Task: 研究如何通过分解和选择上下文示例（ICEs）来优化LLMs在语义解析任务中的表现。

Motivation: 程序通常表示为抽象语法树（ASTs），这种结构化表示带来了设计和选择ICEs的新问题，需要解决如何有效利用这些示例。

Details

Method: 提出分解ICE树为片段，利用LLM和语法约束自动映射片段到对应语句，并扩展多样ICE选择方法以处理片段化实例。 Result: 在多个语义解析基准测试中，SCUD4ICL系统显示出明显的准确性提升，尤其对小规模LLMs和大标签树场景有效。 Conclusion: 分解和多样化的ICE选择方法能显著提升语义解析性能，特别是在资源受限的语言和小规模模型中。 Abstract: LLMs are increasingly used as seq2seq translators from natural language utterances to structured programs, a process called semantic interpretation. Unlike atomic labels or token sequences, programs are naturally represented as abstract syntax trees (ASTs). Such structured representation raises novel issues related to the design and selection of in-context examples (ICEs) presented to the LLM. We focus on decomposing the pool of available ICE trees into fragments, some of which may be better suited to solving the test instance. Next, we propose how to use (additional invocations of) an LLM with prompted syntax constraints to automatically map the fragments to corresponding utterances. Finally, we adapt and extend a recent method for diverse ICE selection to work with whole and fragmented ICE instances. We evaluate our system, SCUD4ICL, on popular diverse semantic parsing benchmarks, showing visible accuracy gains from our proposed decomposed diverse demonstration method. Benefits are particularly notable for smaller LLMs, ICE pools having larger labeled trees, and programs in lower resource languages.

Know What You do Not Know: Verbalized Uncertainty Estimation Robustness on Corrupted Images in Vision-Language Models

Mirko Borszukovszki,Ivo Pascal de Jong,Matias Valdenegro-Toro

Task: 研究视觉语言模型（VLMs）在图像数据损坏情况下的不确定性估计能力。

Motivation: 为了充分利用大型语言模型（LLMs）的潜力，需要了解其回答的不确定性，但视觉语言模型（VLMs）在这方面的研究较少。

Details

Method: 测试三种最先进的VLMs在损坏图像数据上的表现。 Result: 图像损坏的严重程度对模型的不确定性估计能力有负面影响，且模型在大多数实验中表现出过度自信。 Conclusion: VLMs在不确定性估计方面存在不足，尤其是在图像损坏情况下，需要进一步改进。 Abstract: To leverage the full potential of Large Language Models (LLMs) it is crucial to have some information on their answers' uncertainty. This means that the model has to be able to quantify how certain it is in the correctness of a given response. Bad uncertainty estimates can lead to overconfident wrong answers undermining trust in these models. Quite a lot of research has been done on language models that work with text inputs and provide text outputs. Still, since the visual capabilities have been added to these models recently, there has not been much progress on the uncertainty of Visual Language Models (VLMs). We tested three state-of-the-art VLMs on corrupted image data. We found that the severity of the corruption negatively impacted the models' ability to estimate their uncertainty and the models also showed overconfidence in most of the experiments.

MultiMed-ST: Large-scale Many-to-many Multilingual Medical Speech Translation

Khai Le-Duc,Tuyen Tran,Bach Phan Tat,Nguyen Kim Hai Bui,Quan Dang,Hung-Phong Tran,Thanh-Thuy Nguyen,Ly Nguyen,Tuan-Minh Phan,Thi Thu Phuong Tran,Chris Ngo,Nguyen X. Khanh,Thanh Nguyen-Tang

Task: 系统研究医疗领域的多语言语音翻译（ST），并发布MultiMed-ST数据集和模型。

Motivation: 通过跨语言沟通改善患者护理，缓解专业劳动力短缺，提升诊断和治疗效率，尤其在疫情期间。

Details

Method: 发布MultiMed-ST数据集（涵盖五种语言的双向翻译，共290,000样本），并进行多项分析研究（如基线实验、双语-多语言对比、端到端与级联对比等）。 Result: MultiMed-ST是当前最大的医疗机器翻译数据集，也是跨领域最大的多对多多语言ST数据集。 Conclusion: 研究为医疗ST领域提供了首个系统性数据集和分析框架，推动了该领域的发展。 Abstract: Multilingual speech translation (ST) in the medical domain enhances patient care by enabling efficient communication across language barriers, alleviating specialized workforce shortages, and facilitating improved diagnosis and treatment, particularly during pandemics. In this work, we present the first systematic study on medical ST, to our best knowledge, by releasing MultiMed-ST, a large-scale ST dataset for the medical domain, spanning all translation directions in five languages: Vietnamese, English, German, French, Traditional Chinese and Simplified Chinese, together with the models. With 290,000 samples, our dataset is the largest medical machine translation (MT) dataset and the largest many-to-many multilingual ST among all domains. Secondly, we present the most extensive analysis study in ST research to date, including: empirical baselines, bilingual-multilingual comparative study, end-to-end vs. cascaded comparative study, task-specific vs. multi-task sequence-to-sequence (seq2seq) comparative study, code-switch analysis, and quantitative-qualitative error analysis. All code, data, and models are available online: https://github.com/leduckhai/MultiMed-ST.

Pyramid-based Mamba Multi-class Unsupervised Anomaly Detection

Nasar Iqbal,Niki Martinel

Task: 提出一种基于状态空间模型（SSM）的金字塔扫描策略（PSS），用于多类异常检测和定位。

Motivation: 解决小异常定位的挑战，克服CNN在捕捉长距离依赖上的局限和Transformer架构的高计算开销问题。

Details

Method: 结合PSS与预训练编码器进行多尺度特征提取，并使用特征级合成异常生成器。 Result: 在多类异常定位上AP提升1%，在MVTec基准上AU-PRO提升1%。 Conclusion: 该方法在多样化工业场景中实现了精确的异常定位，表现出优越性。 Abstract: Recent advances in convolutional neural networks (CNNs) and transformer-based methods have improved anomaly detection and localization, but challenges persist in precisely localizing small anomalies. While CNNs face limitations in capturing long-range dependencies, transformer architectures often suffer from substantial computational overheads. We introduce a state space model (SSM)-based Pyramidal Scanning Strategy (PSS) for multi-class anomaly detection and localization--a novel approach designed to address the challenge of small anomaly localization. Our method captures fine-grained details at multiple scales by integrating the PSS with a pre-trained encoder for multi-scale feature extraction and a feature-level synthetic anomaly generator. An improvement of $+1\%$ AP for multi-class anomaly localization and a +$1\%$ increase in AU-PRO on MVTec benchmark demonstrate our method's superiority in precise anomaly localization across diverse industrial scenarios. The code is available at https://github.com/iqbalmlpuniud/Pyramid Mamba.

Agentic Knowledgeable Self-awareness

Shuofei Qiao,Zhisong Qiu,Baochang Ren,Xiaobin Wang,Xiangyuan Ru,Ningyu Zhang,Xiang Chen,Yong Jiang,Pengjun Xie,Fei Huang,Huajun Chen

Task: 提出一种名为KnowSelf的新范式，使基于LLM的代理能够自主调节知识利用。

Motivation: 传统代理规划方法忽视了人类决策中的情境自我意识，导致资源利用效率低下。

Details

Method: 提出KnowSelf，一种数据驱动的方法，通过启发式情境判断标准标记代理的自探索轨迹，并通过两阶段训练过程实现情境切换。 Result: 实验表明，KnowSelf在不同任务和模型上优于多种基线方法，且外部知识使用最少。 Conclusion: KnowSelf通过自主调节知识利用，显著提升了代理的规划效果和效率。 Abstract: Large Language Models (LLMs) have achieved considerable performance across various agentic planning tasks. However, traditional agent planning approaches adopt a "flood irrigation" methodology that indiscriminately injects gold trajectories, external feedback, and domain knowledge into agent models. This practice overlooks the fundamental human cognitive principle of situational self-awareness during decision-making-the ability to dynamically assess situational demands and strategically employ resources during decision-making. We propose agentic knowledgeable self-awareness to address this gap, a novel paradigm enabling LLM-based agents to autonomously regulate knowledge utilization. Specifically, we propose KnowSelf, a data-centric approach that applies agents with knowledgeable self-awareness like humans. Concretely, we devise a heuristic situation judgement criterion to mark special tokens on the agent's self-explored trajectories for collecting training data. Through a two-stage training process, the agent model can switch between different situations by generating specific special tokens, achieving optimal planning effects with minimal costs. Our experiments demonstrate that KnowSelf can outperform various strong baselines on different tasks and models with minimal use of external knowledge. Code is available at https://github.com/zjunlp/KnowSelf.

D-Garment: Physics-Conditioned Latent Diffusion for Dynamic Garment Deformations

Antoine Dumoulin,Adnane Boukhayma,Laurence Boissieux,Bharath Bhushan Damodaran,Pierre Hellier,Stefanie Wuhrer

Task: 提出一种基于学习的方法，用于调整和变形3D服装以适应身体形状、运动和布料材料。

Motivation: 虚拟和增强现实中服装动态调整的需求广泛，如虚拟试衣间和游戏行业，但现有方法难以处理大变形和动态褶皱。

Details

Method: 利用基于物理的模拟器生成数据训练3D生成模型，结合扩散模型学习细节，并在2D参数空间中建模。 Result: 在模拟和实际采集数据上，该方法在Chamfer距离上优于基线。 Conclusion: 该方法能有效学习服装变形，尤其是大变形和动态褶皱，且适用于不同分辨率的网格。 Abstract: Adjusting and deforming 3D garments to body shapes, body motion, and cloth material is an important problem in virtual and augmented reality. Applications are numerous, ranging from virtual change rooms to the entertainment and gaming industry. This problem is challenging as garment dynamics influence geometric details such as wrinkling patterns, which depend on physical input including the wearer's body shape and motion, as well as cloth material features. Existing work studies learning-based modeling techniques to generate garment deformations from example data, and physics-inspired simulators to generate realistic garment dynamics. We propose here a learning-based approach trained on data generated with a physics-based simulator. Compared to prior work, our 3D generative model learns garment deformations for loose cloth geometry, especially for large deformations and dynamic wrinkles driven by body motion and cloth material. Furthermore, the model can be efficiently fitted to observations captured using vision sensors. We propose to leverage the capability of diffusion models to learn fine-scale detail: we model the 3D garment in a 2D parameter space, and learn a latent diffusion model using this representation independent from the mesh resolution. This allows to condition global and local geometric information with body and material information. We quantitatively and qualitatively evaluate our method on both simulated data and data captured with a multi-view acquisition platform. Compared to strong baselines, our method is more accurate in terms of Chamfer distance.

Runnan Fang,Xiaobin Wang,Yuan Liang,Shuofei Qiao,Jialong Wu,Zekun Xi,Ningyu Zhang,Yong Jiang,Pengjun Xie,Fei Huang,Huajun Chen

Task: 提出SynWorld框架，帮助LLM-based智能体在新环境中自主探索和优化动作知识。

Motivation: LLM-based智能体在新环境或非常规动作空间中面临挑战，需要提升其自主探索和动作理解能力。

Details

Method: 通过合成多步动作调用的可能场景，并利用蒙特卡洛树搜索（MCTS）进行探索，以优化动作知识。 Result: 实验表明，SynWorld是一种有效且通用的方法，能够在新环境中学习动作知识。 Conclusion: SynWorld框架为智能体在新环境中的自主探索和动作知识优化提供了有效解决方案。 Abstract: In the interaction between agents and their environments, agents expand their capabilities by planning and executing actions. However, LLM-based agents face substantial challenges when deployed in novel environments or required to navigate unconventional action spaces. To empower agents to autonomously explore environments, optimize workflows, and enhance their understanding of actions, we propose SynWorld, a framework that allows agents to synthesize possible scenarios with multi-step action invocation within the action space and perform Monte Carlo Tree Search (MCTS) exploration to effectively refine their action knowledge in the current environment. Our experiments demonstrate that SynWorld is an effective and general approach to learning action knowledge in new environments. Code is available at https://github.com/zjunlp/SynWorld.

Dynamic Importance in Diffusion U-Net for Enhanced Image Synthesis

Xi Wang,Ziqi He,Yang Zhou

Task: 研究如何通过动态重加权U-Net中的Transformer块输出来提高扩散模型的推理效率和生成质量。

Motivation: 传统扩散模型中的注意力块作用已被研究，但其动态重要性演变被忽视，限制了进一步优化图像应用的潜力。

Details

Method: 提出理论证明重加权Transformer块输出可提高信噪比，设计Importance Probe量化动态重要性变化，并开发自适应重加权策略。 Result: 实验表明，该方法显著提升推理效率和生成样本的美学质量，同时保持身份一致性。 Conclusion: 所提方法可无缝集成到任何基于U-Net的架构中，为扩散模型优化提供了新思路。 Abstract: Traditional diffusion models typically employ a U-Net architecture. Previous studies have unveiled the roles of attention blocks in the U-Net. However, they overlook the dynamic evolution of their importance during the inference process, which hinders their further exploitation to improve image applications. In this study, we first theoretically proved that, re-weighting the outputs of the Transformer blocks within the U-Net is a "free lunch" for improving the signal-to-noise ratio during the sampling process. Next, we proposed Importance Probe to uncover and quantify the dynamic shifts in importance of the Transformer blocks throughout the denoising process. Finally, we design an adaptive importance-based re-weighting schedule tailored to specific image generation and editing tasks. Experimental results demonstrate that, our approach significantly improves the efficiency of the inference process, and enhances the aesthetic quality of the samples with identity consistency. Our method can be seamlessly integrated into any U-Net-based architecture. Code: https://github.com/Hytidel/UNetReweighting

Extending the SAREF4ENER Ontology with Flexibility Based on FlexOffers

Fabio Lilliu,Amir Laadhar,Christian Thomsen,Diego Reforgiato Recupero,Torben Bach Pedersen

Task: 扩展SAREF4ENER以支持完整的FlexOffer模型，包括高级用例。

Motivation: 现有行业标准SAREF4ENER对灵活性的支持有限，无法满足智能能源设备和市场系统的需求。

Details

Method: 提出一个扩展的SAREF4ENER模块，集成完整的FlexOffer模型，同时保持向后兼容性。 Result: 新模块能准确描述电动汽车、电池和热泵等高级设备的灵活性，并捕捉灵活负载类型的不确定性。 Conclusion: 扩展后的SAREF4ENER为智能能源设备和市场系统提供了更全面的灵活性支持。 Abstract: A key element to support the increased amounts of renewable energy in the energy system is flexibility, i.e., the possibility of changing energy loads in time and amount. Many flexibility models have been designed; however, exact models fail to scale for long time horizons or many devices. Because of this, the FlexOffer (FOs) model has been designed, to provide device-independent approximations of flexibility with good accuracy, and much better scaling for long time horizons and many devices. An important aspect of the real-life implementation of energy flexibility is enabling flexible data exchange with many types of smart energy appliances and market systems, e.g., in smart buildings. For this, ontologies standardizing data formats are required. However, the current industry standard ontology for integrating smart devices for energy purposes, SAREF for Energy Flexibility (SAREF4ENER) only has limited support for flexibility and thus cannot support important use cases. In this paper we propose an extension of SAREF4ENER that integrates full support for the complete FlexOffer model, including advanced use cases, while maintaining backward compatibility. This novel ontology module can accurately describe flexibility for advanced devices such as electric vehicles, batteries, and heat pumps. It can also capture the inherent uncertainty associated with many flexible load types.

Multi-encoder nnU-Net outperforms Transformer models with self-supervised pretraining

Seyedeh Sahar Taheri Otaghsara,Reza Rahmanzadeh

Task: 提出一种新型的自监督学习多编码器nnU-Net架构，用于医学图像分割，特别是处理多模态MRI数据。

Motivation: 医学图像分割在放射学中至关重要，但传统模型受限于MRI模态差异、图像伪影和标注数据稀缺等问题。

Details

Method: 设计了一种多编码器nnU-Net架构，独立处理多模态MRI数据，捕捉模态特异性特征后进行融合。 Result: 模型在Dice相似系数（DSC）上达到93.72%，优于其他模型如vanilla nnU-Net、SegResNet和Swin UNETR。 Conclusion: 多编码器nnU-Net架构显著提升了肿瘤分割的准确性，尤其在标注数据有限的情况下表现优异。 Abstract: This study addresses the essential task of medical image segmentation, which involves the automatic identification and delineation of anatomical structures and pathological regions in medical images. Accurate segmentation is crucial in radiology, as it aids in the precise localization of abnormalities such as tumors, thereby enabling effective diagnosis, treatment planning, and monitoring of disease progression. Specifically, the size, shape, and location of tumors can significantly influence clinical decision-making and therapeutic strategies, making accurate segmentation a key component of radiological workflows. However, challenges posed by variations in MRI modalities, image artifacts, and the scarcity of labeled data complicate the segmentation task and impact the performance of traditional models. To overcome these limitations, we propose a novel self-supervised learning Multi-encoder nnU-Net architecture designed to process multiple MRI modalities independently through separate encoders. This approach allows the model to capture modality-specific features before fusing them for the final segmentation, thus improving accuracy. Our Multi-encoder nnU-Net demonstrates exceptional performance, achieving a Dice Similarity Coefficient (DSC) of 93.72%, which surpasses that of other models such as vanilla nnU-Net, SegResNet, and Swin UNETR. By leveraging the unique information provided by each modality, the model enhances segmentation tasks, particularly in scenarios with limited annotated data. Evaluations highlight the effectiveness of this architecture in improving tumor segmentation outcomes.

EnrichIndex: Using LLMs to Enrich Retrieval Indices Offline

Peter Baile Chen,Tomer Wolfson,Michael Cafarella,Dan Roth

Task: 提出EnrichIndex方法，通过离线使用大语言模型（LLM）构建语义增强的检索索引，以提高信息检索系统的性能。

Motivation: 现有检索系统在查询语言与目标文档语言匹配时表现良好，但在需要隐含推理文档相关性的场景（如技术文本或表格检索）中表现不足，且在线LLM检索存在高延迟和高计算成本的问题。

Details

Method: 使用LLM在离线阶段对检索语料库中的所有文档进行一次处理，构建语义增强的检索索引，并结合现有的在线检索方法。 Result: 在五个检索任务中，EnrichIndex的平均召回率@10提高了11.7点，NDCG@10提高了10.6点，且在线LLM调用减少了293.3倍的令牌处理量。 Conclusion: EnrichIndex通过离线利用LLM的推理能力，显著提升了检索性能，同时降低了在线延迟和成本。 Abstract: Existing information retrieval systems excel in cases where the language of target documents closely matches that of the user query. However, real-world retrieval systems are often required to implicitly reason whether a document is relevant. For example, when retrieving technical texts or tables, their relevance to the user query may be implied through a particular jargon or structure, rather than explicitly expressed in their content. Large language models (LLMs) hold great potential in identifying such implied relevance by leveraging their reasoning skills. Nevertheless, current LLM-augmented retrieval is hindered by high latency and computation cost, as the LLM typically computes the query-document relevance online, for every query anew. To tackle this issue we introduce EnrichIndex, a retrieval approach which instead uses the LLM offline to build semantically-enriched retrieval indices, by performing a single pass over all documents in the retrieval corpus once during ingestion time. Furthermore, the semantically-enriched indices can complement existing online retrieval approaches, boosting the performance of LLM re-rankers. We evaluated EnrichIndex on five retrieval tasks, involving passages and tables, and found that it outperforms strong online LLM-based retrieval systems, with an average improvement of 11.7 points in recall @ 10 and 10.6 points in NDCG @ 10 compared to strong baselines. In terms of online calls to the LLM, it processes 293.3 times fewer tokens which greatly reduces the online latency and cost. Overall, EnrichIndex is an effective way to build better retrieval indices offline by leveraging the strong reasoning skills of LLMs.

Sheng Lian,Dengfeng Pan,Jianlong Cai,Guang-Yong Chen,Zhun Zhong,Zhiming Luo,Shen Zhao,Shuo Li

Task: 提出一种名为ATM-Net的创新框架，用于腰椎子结构的细粒度分割，包括椎体（VBs）、椎间盘（IDs）和椎管（SC）。

Motivation: 现有方法通常采用粗粒度分割策略，缺乏精确诊断所需的细节，且仅依赖视觉模型难以捕捉解剖语义，导致分类错误和分割细节不佳。

Details

Method: ATM-Net采用解剖感知文本提示生成器（ATPG）将图像注释转换为多视角的解剖感知提示，并通过整体解剖感知语义融合（HASF）模块与图像特征结合，同时使用通道级对比解剖感知增强（CCAE）模块提升类别区分和分割精度。 Result: 在MRSpineSeg和SPIDER数据集上的实验表明，ATM-Net显著优于现有方法，例如在SPIDER数据集上Dice得分为79.39%，HD95为9.91像素，分别比SpineParseNet高出8.31%和4.14像素。 Conclusion: ATM-Net通过多模态融合和对比学习机制，显著提升了腰椎子结构分割的精度和细节表现。 Abstract: Accurate lumbar spine segmentation is crucial for diagnosing spinal disorders. Existing methods typically use coarse-grained segmentation strategies that lack the fine detail needed for precise diagnosis. Additionally, their reliance on visual-only models hinders the capture of anatomical semantics, leading to misclassified categories and poor segmentation details. To address these limitations, we present ATM-Net, an innovative framework that employs an anatomy-aware, text-guided, multi-modal fusion mechanism for fine-grained segmentation of lumbar substructures, i.e., vertebrae (VBs), intervertebral discs (IDs), and spinal canal (SC). ATM-Net adopts the Anatomy-aware Text Prompt Generator (ATPG) to adaptively convert image annotations into anatomy-aware prompts in different views. These insights are further integrated with image features via the Holistic Anatomy-aware Semantic Fusion (HASF) module, building a comprehensive anatomical context. The Channel-wise Contrastive Anatomy-Aware Enhancement (CCAE) module further enhances class discrimination and refines segmentation through class-wise channel-level multi-modal contrastive learning. Extensive experiments on the MRSpineSeg and SPIDER datasets demonstrate that ATM-Net significantly outperforms state-of-the-art methods, with consistent improvements regarding class discrimination and segmentation details. For example, ATM-Net achieves Dice of 79.39% and HD95 of 9.91 pixels on SPIDER, outperforming the competitive SpineParseNet by 8.31% and 4.14 pixels, respectively.

APIGen-MT: Agentic Pipeline for Multi-Turn Data Generation via Simulated Agent-Human Interplay

Akshara Prabhakar,Zuxin Liu,Weiran Yao,Jianguo Zhang,Ming Zhu,Shiyu Wang,Zhiwei Liu,Tulika Awalgaonkar,Haolin Chen,Thai Hoang,Juan Carlos Niebles,Shelby Heinecke,Huan Wang,Silvio Savarese,Caiming Xiong

Task: 提出APIGen-MT框架，用于生成高质量的多轮交互数据以训练AI代理。

Motivation: 高质量的多轮交互数据稀缺且手动收集成本高，限制了AI代理的训练效果。

Details

Method: 采用两阶段框架：第一阶段通过LLM评审委员会和迭代反馈生成任务蓝图；第二阶段通过模拟人机交互将蓝图转化为完整交互轨迹。 Result: 训练的xLAM-2-fc-r系列模型在τ-bench和BFCL基准测试中优于GPT-4o和Claude 3.5，小模型在多轮交互中表现更优且一致性更高。 Conclusion: 验证的蓝图到细节方法能生成高质量训练数据，推动更可靠、高效和强大的AI代理发展，并开源了数据和模型。 Abstract: Training effective AI agents for multi-turn interactions requires high-quality data that captures realistic human-agent dynamics, yet such data is scarce and expensive to collect manually. We introduce APIGen-MT, a two-phase framework that generates verifiable and diverse multi-turn agent data. In the first phase, our agentic pipeline produces detailed task blueprints with ground-truth actions, leveraging a committee of LLM reviewers and iterative feedback loops. These blueprints are then transformed into complete interaction trajectories through simulated human-agent interplay. We train a family of models -- the xLAM-2-fc-r series with sizes ranging from 1B to 70B parameters. Our models outperform frontier models such as GPT-4o and Claude 3.5 on $\tau$-bench and BFCL benchmarks, with the smaller models surpassing their larger counterparts, particularly in multi-turn settings, while maintaining superior consistency across multiple trials. Comprehensive experiments demonstrate that our verified blueprint-to-details approach yields high-quality training data, enabling the development of more reliable, efficient, and capable agents. We open-source both the synthetic data collected and the trained xLAM-2-fc-r models to advance research in AI agents. Models are available on HuggingFace at https://huggingface.co/collections/Salesforce/xlam-2-67ef5be12949d8dcdae354c4 and project website is https://apigen-mt.github.io

BUFF: Bayesian Uncertainty Guided Diffusion Probabilistic Model for Single Image Super-Resolution

Zihao He,Shengchuan Zhang,Runze Hu,Yunhang Shen,Yan Zhang

Task: 提出一种基于贝叶斯不确定性引导的扩散概率模型（BUFF）用于图像超分辨率任务。

Motivation: 现有基于高斯模型的扩散方法在处理自然场景中复杂多变的纹理时表现不足。

Details

Method: 引入贝叶斯网络生成高分辨率不确定性掩码，指导扩散过程以自适应调整噪声强度。 Result: 在DIV2K数据集上，BUFF在BSD100上的SSIM提高了0.61，PSNR平均增益比传统方法高出0.20dB。 Conclusion: 贝叶斯方法在增强扩散过程用于超分辨率方面具有潜力，为未来研究提供了新方向。 Abstract: Super-resolution (SR) techniques are critical for enhancing image quality, particularly in scenarios where high-resolution imagery is essential yet limited by hardware constraints. Existing diffusion models for SR have relied predominantly on Gaussian models for noise generation, which often fall short when dealing with the complex and variable texture inherent in natural scenes. To address these deficiencies, we introduce the Bayesian Uncertainty Guided Diffusion Probabilistic Model (BUFF). BUFF distinguishes itself by incorporating a Bayesian network to generate high-resolution uncertainty masks. These masks guide the diffusion process, allowing for the adjustment of noise intensity in a manner that is both context-aware and adaptive. This novel approach not only enhances the fidelity of super-resolved images to their original high-resolution counterparts but also significantly mitigates artifacts and blurring in areas characterized by complex textures and fine details. The model demonstrates exceptional robustness against complex noise patterns and showcases superior adaptability in handling textures and edges within images. Empirical evidence, supported by visual results, illustrates the model's robustness, especially in challenging scenarios, and its effectiveness in addressing common SR issues such as blurring. Experimental evaluations conducted on the DIV2K dataset reveal that BUFF achieves a notable improvement, with a +0.61 increase compared to baseline in SSIM on BSD100, surpassing traditional diffusion approaches by an average additional +0.20dB PSNR gain. These findings underscore the potential of Bayesian methods in enhancing diffusion processes for SR, paving the way for future advancements in the field.

AIR: A Systematic Analysis of Annotations, Instructions, and Response Pairs in Preference Dataset

Bingxiang He,Wenbin Zhang,Jiaxi Song,Cheng Qian,Zixuan Fu,Bowen Sun,Ning Ding,Haiwen Hong,Longtao Huang,Hui Xue,Ganqu Cui,Wanxiang Che,Zhiyuan Liu,Maosong Sun

Task: 提出一个组件化分析框架（AIR），用于系统性地隔离和优化偏好学习中的三个核心组件（偏好标注、指令和响应对）。

Motivation: 当前方法将偏好学习的三个核心组件混为一谈，掩盖了它们各自的影响，阻碍了系统优化。

Details

Method: 通过AIR框架，分别优化偏好标注（点式生成评分）、指令（基于方差的过滤）和响应对（适度边际+高绝对分数），并评估它们的协同效应。 Result: AIR框架在仅使用14k高质量数据对的情况下，平均性能提升5.3分。 Conclusion: AIR框架为偏好数据集设计提供了从随意扩展到组件感知优化的蓝图，实现了高效、可复现的对齐。 Abstract: Preference learning is critical for aligning large language models (LLMs) with human values, yet its success hinges on high-quality datasets comprising three core components: Preference \textbf{A}nnotations, \textbf{I}nstructions, and \textbf{R}esponse Pairs. Current approaches conflate these components, obscuring their individual impacts and hindering systematic optimization. In this work, we propose \textbf{AIR}, a component-wise analysis framework that systematically isolates and optimizes each component while evaluating their synergistic effects. Through rigorous experimentation, AIR reveals actionable principles: annotation simplicity (point-wise generative scoring), instruction inference stability (variance-based filtering across LLMs), and response pair quality (moderate margins + high absolute scores). When combined, these principles yield +5.3 average gains over baseline method, even with only 14k high-quality pairs. Our work shifts preference dataset design from ad hoc scaling to component-aware optimization, offering a blueprint for efficient, reproducible alignment.

LV-MAE: Learning Long Video Representations through Masked-Embedding Autoencoders

Ilan Naiman,Emanuel Ben-Baruch,Oron Anschel,Alon Shoshan,Igor Kviatkovsky,Manoj Aggarwal,Gerard Medioni

Task: 提出一种名为LV-MAE的自监督学习框架，用于长视频表示学习。

Motivation: 现有方法通常基于短视频数据集进行预训练，无法有效处理长视频中的短时和长时依赖关系。

Details

Method: 将短时和长时依赖关系解耦为两个任务，先编码短时时空基元，再捕捉长时依赖关系，利用现有多模态编码器提取短片段表示，并通过掩码嵌入自编码器预训练捕捉片段间高级交互。 Result: 在LVU、COIN和Breakfast三个长视频基准测试中取得了最先进的结果。 Conclusion: LV-MAE是一种高效的长视频表示学习方法，能够处理更长的视频并实现自监督预训练。 Abstract: In this work, we introduce long-video masked-embedding autoencoders (LV-MAE), a self-supervised learning framework for long video representation. Our approach treats short- and long-span dependencies as two separate tasks. Such decoupling allows for a more intuitive video processing where short-span spatiotemporal primitives are first encoded and are then used to capture long-range dependencies across consecutive video segments. To achieve this, we leverage advanced off-the-shelf multimodal encoders to extract representations from short segments within the long video, followed by pre-training a masked-embedding autoencoder capturing high-level interactions across segments. LV-MAE is highly efficient to train and enables the processing of much longer videos by alleviating the constraint on the number of input frames. Furthermore, unlike existing methods that typically pre-train on short-video datasets, our approach offers self-supervised pre-training using long video samples (e.g., 20+ minutes video clips) at scale. Using LV-MAE representations, we achieve state-of-the-art results on three long-video benchmarks -- LVU, COIN, and Breakfast -- employing only a simple classification head for either attentive or linear probing. Finally, to assess LV-MAE pre-training and visualize its reconstruction quality, we leverage the video-language aligned space of short video representations to monitor LV-MAE through video-text retrieval.

Multilingual Retrieval-Augmented Generation for Knowledge-Intensive Task

Leonardo Ranaldi,Barry Haddow,Alexandra Birch

Task: 研究多语言检索增强生成（RAG）在开放域问答任务中的有效性。

Motivation: 尽管RAG在单语（尤其是英语）任务中表现优异，但其在多语言任务中的应用尚未探索。

Details

Method: 提出了三种多语言RAG策略：问题翻译（tRAG）、多语言检索（MultiRAG）和跨语言RAG（CrossRAG）。 Result: tRAG覆盖有限，MultiRAG效率高但存在不一致性，CrossRAG通过翻译检索文档显著提升了性能。 Conclusion: CrossRAG在多语言知识密集型任务中表现最佳，尤其对低资源语言有益。 Abstract: Retrieval-augmented generation (RAG) has become a cornerstone of contemporary NLP, enhancing large language models (LLMs) by allowing them to access richer factual contexts through in-context retrieval. While effective in monolingual settings, especially in English, its use in multilingual tasks remains unexplored. This paper investigates the effectiveness of RAG across multiple languages by proposing novel approaches for multilingual open-domain question-answering. We evaluate the performance of various multilingual RAG strategies, including question-translation (tRAG), which translates questions into English before retrieval, and Multilingual RAG (MultiRAG), where retrieval occurs directly across multiple languages. Our findings reveal that tRAG, while useful, suffers from limited coverage. In contrast, MultiRAG improves efficiency by enabling multilingual retrieval but introduces inconsistencies due to cross-lingual variations in the retrieved content. To address these issues, we propose Crosslingual RAG (CrossRAG), a method that translates retrieved documents into a common language (e.g., English) before generating the response. Our experiments show that CrossRAG significantly enhances performance on knowledge-intensive tasks, benefiting both high-resource and low-resource languages.

FADConv: A Frequency-Aware Dynamic Convolution for Farmland Non-agriculturalization Identification and Segmentation

Tan Shu,Li Shen

Task: 通过提出频率感知动态卷积（FADConv）和频率注意力（FAT）模块，提高农田与非农田区域的精确分割。

Motivation: 农田非农化现象威胁粮食安全和农业可持续性，现有动态卷积方法因全局平均池化（GAP）导致信息丢失，限制了分割精度。

Details

Method: 结合2D离散余弦变换（2D DCT）设计FADConv，并引入FAT模块生成高质量注意力权重，替代传统GAP方法。 Result: 在GID和Hi-CNA数据集上，FADConv显著提升了分割精度，如ResNet18结合FADConv在GID上的F1-score和IoU分别提高了1.9%和2.7%。 Conclusion: FADConv在农田分割任务中表现优于其他动态卷积方法，且计算开销小。 Abstract: Cropland non-agriculturalization refers to the conversion of arable land into non-agricultural uses such as forests, residential areas, and construction sites. This phenomenon not only directly leads to the loss of cropland resources but also poses systemic threats to food security and agricultural sustainability. Accurate identification of cropland and non-cropland areas is crucial for detecting and addressing this issue. Traditional CNNs employ static convolution layers, while dynamic convolution studies demonstrate that adaptively weighting multiple convolutional kernels through attention mechanisms can enhance accuracy. However, existing dynamic convolution methods relying on Global Average Pooling (GAP) for attention weight allocation suffer from information loss, limiting segmentation precision. This paper proposes Frequency-Aware Dynamic Convolution (FADConv) and a Frequency Attention (FAT) module to address these limitations. Building upon the foundational structure of dynamic convolution, we designed FADConv by integrating 2D Discrete Cosine Transform (2D DCT) to capture frequency domain features and fuse them. FAT module generates high-quality attention weights that replace the traditional GAP method,making the combination between dynamic convolution kernels more reasonable.Experiments on the GID and Hi-CNA datasets demonstrate that FADConv significantly improves segmentation accuracy with minimal computational overhead. For instance, ResNet18 with FADConv achieves 1.9% and 2.7% increases in F1-score and IoU for cropland segmentation on GID, with only 58.87M additional MAdds. Compared to other dynamic convolution approaches, FADConv exhibits superior performance in cropland segmentation tasks.

Align to Structure: Aligning Large Language Models with Structural Information

Zae Myung Kim,Anand Ramachandran,Farideh Tavazoee,Joo-Kyung Kim,Oleg Rokhlenko,Dongyeop Kang

Task: 提出一种名为Structural Alignment的新方法，通过将语言模型与人类话语结构对齐来增强长文本生成。

Motivation: 解决大型语言模型在生成长文本时缺乏层次规划和结构化组织的问题。

Details

Method: 结合语言学基础的话语框架和强化学习，采用密集奖励机制和Proximal Policy Optimization框架，通过两种互补的奖励模型评估。 Result: 在文章生成和长文档摘要等任务中表现优于标准和RLHF增强模型。 Conclusion: Structural Alignment方法能显著提升长文本的连贯性和组织性，相关数据和代码将公开共享。 Abstract: Generating long, coherent text remains a challenge for large language models (LLMs), as they lack hierarchical planning and structured organization in discourse generation. We introduce Structural Alignment, a novel method that aligns LLMs with human-like discourse structures to enhance long-form text generation. By integrating linguistically grounded discourse frameworks into reinforcement learning, our approach guides models to produce coherent and well-organized outputs. We employ a dense reward scheme within a Proximal Policy Optimization framework, assigning fine-grained, token-level rewards based on the discourse distinctiveness relative to human writing. Two complementary reward models are evaluated: the first improves readability by scoring surface-level textual features to provide explicit structuring, while the second reinforces deeper coherence and rhetorical sophistication by analyzing global discourse patterns through hierarchical discourse motifs, outperforming both standard and RLHF-enhanced models in tasks such as essay generation and long-document summarization. All training data and code will be publicly shared at https://github.com/minnesotanlp/struct_align.

Gianluca Monaci,Rafael S. Rezende,Romain Deffayet,Gabriela Csurka,Guillaume Bono,Hervé Déjean,Stéphane Clinchant,Christian Wolf

Task: 提出一种基于检索增强的智能体，用于导航任务，能够利用先前操作中收集的信息。

Motivation: 在现实场景中，智能体应能利用先前操作中收集的信息，而非每次都以全新记忆开始。

Details

Method: 引入一种基于强化学习的检索增强智能体，能够查询先前操作的数据库并整合上下文信息，采用数据驱动的检索和上下文编码方法，结合视觉基础模型。 Result: 检索方法实现了任务和环境的零样本迁移，并显著提升了性能。 Conclusion: 检索增强方法在导航任务中具有显著优势，能够有效利用历史信息提升性能。 Abstract: Methods for navigation based on large-scale learning typically treat each episode as a new problem, where the agent is spawned with a clean memory in an unknown environment. While these generalization capabilities to an unknown environment are extremely important, we claim that, in a realistic setting, an agent should have the capacity of exploiting information collected during earlier robot operations. We address this by introducing a new retrieval-augmented agent, trained with RL, capable of querying a database collected from previous episodes in the same environment and learning how to integrate this additional context information. We introduce a unique agent architecture for the general navigation task, evaluated on ObjectNav, ImageNav and Instance-ImageNav. Our retrieval and context encoding methods are data-driven and heavily employ vision foundation models (FM) for both semantic and geometric understanding. We propose new benchmarks for these settings and we show that retrieval allows zero-shot transfer across tasks and environments while significantly improving performance.

Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models

NVIDIA,:,Aaron Blakeman,Aarti Basant,Abhinav Khattar,Adithya Renduchintala,Akhiad Bercovich,Aleksander Ficek,Alexis Bjorlin,Ali Taghibakhshi,Amala Sanjay Deshmukh,Ameya Sunil Mahabaleshwarkar,Andrew Tao,Anna Shors,Ashwath Aithal,Ashwin Poojary,Ayush Dattagupta,Balaram Buddharaju,Bobby Chen,Boris Ginsburg,Boxin Wang,Brandon Norick,Brian Butterfield,Bryan Catanzaro,Carlo del Mundo,Chengyu Dong,Christine Harvey,Christopher Parisien,Dan Su,Daniel Korzekwa,Danny Yin,Daria Gitman,David Mosallanezhad,Deepak Narayanan,Denys Fridman,Dima Rekesh,Ding Ma,Dmytro Pykhtar,Dong Ahn,Duncan Riach,Dusan Stosic,Eileen Long,Elad Segal,Ellie Evans,Eric Chung,Erick Galinkin,Evelina Bakhturina,Ewa Dobrowolska,Fei Jia,Fuxiao Liu,Gargi Prasad,Gerald Shen,Guilin Liu,Guo Chen,Haifeng Qian,Helen Ngo,Hongbin Liu,Hui Li,Igor Gitman,Ilia Karmanov,Ivan Moshkov,Izik Golan,Jan Kautz,Jane Polak Scowcroft,Jared Casper,Jarno Seppanen,Jason Lu,Jason Sewall,Jiaqi Zeng,Jiaxuan You,Jimmy Zhang,Jing Zhang,Jining Huang,Jinze Xue,Jocelyn Huang,Joey Conway,John Kamalu,Jon Barker,Jonathan Cohen,Joseph Jennings,Jupinder Parmar,Karan Sapra,Kari Briski,Kateryna Chumachenko,Katherine Luna,Keshav Santhanam,Kezhi Kong,Kirthi Sivamani,Krzysztof Pawelec,Kumar Anik,Kunlun Li,Lawrence McAfee,Leon Derczynski,Lindsey Pavao,Luis Vega,Lukas Voegtle,Maciej Bala,Maer Rodrigues de Melo,Makesh Narsimhan Sreedhar,Marcin Chochowski,Markus Kliegl,Marta Stepniewska-Dziubinska,Matthieu Le,Matvei Novikov,Mehrzad Samadi,Michael Andersch,Michael Evans,Miguel Martinez,Mike Chrzanowski,Mike Ranzinger,Mikolaj Blaz,Misha Smelyanskiy,Mohamed Fawzy,Mohammad Shoeybi,Mostofa Patwary,Nayeon Lee,Nima Tajbakhsh,Ning Xu,Oleg Rybakov,Oleksii Kuchaiev,Olivier Delalleau,Osvald Nitski,Parth Chadha,Pasha Shamis,Paulius Micikevicius,Pavlo Molchanov,Peter Dykas,Philipp Fischer,Pierre-Yves Aquilanti,Piotr Bialecki,Prasoon Varshney,Pritam Gundecha,Przemek Tredak,Rabeeh Karimi,Rahul Kandu,Ran El-Yaniv,Raviraj Joshi,Roger Waleffe,Ruoxi Zhang,Sabrina Kavanaugh,Sahil Jain,Samuel Kriman,Sangkug Lym,Sanjeev Satheesh,Saurav Muralidharan,Sean Narenthiran,Selvaraj Anandaraj,Seonmyeong Bak,Sergey Kashirsky,Seungju Han,Shantanu Acharya,Shaona Ghosh,Sharath Turuvekere Sreenivas,Sharon Clay,Shelby Thomas,Shrimai Prabhumoye,Shubham Pachori,Shubham Toshniwal,Shyamala Prayaga,Siddhartha Jain,Sirshak Das,Slawek Kierat,Somshubra Majumdar,Song Han,Soumye Singhal,Sriharsha Niverty,Stefania Alborghetti,Suseella Panguluri,Swetha Bhendigeri,Syeda Nahida Akter,Szymon Migacz,Tal Shiri,Terry Kong,Timo Roman,Tomer Ronen,Trisha Saar,Tugrul Konuk,Tuomas Rintamaki,Tyler Poon,Ushnish De,Vahid Noroozi,Varun Singh,Vijay Korthikanti,Vitaly Kurin,Wasi Uddin Ahmad,Wei Du,Wei Ping,Wenliang Dai,Wonmin Byeon,Xiaowei Ren,Yao Xu,Yejin Choi,Yian Zhang,Ying Lin,Yoshi Suhara,Zhiding Yu,Zhiqi Li,Zhiyu Li,Zhongbo Zhu,Zhuolin Yang,Zijia Chen

Task: 提出Nemotron-H模型家族，旨在通过混合Mamba-Transformer架构降低推理成本并保持准确性。

Motivation: 随着推理时间扩展对增强推理能力的重要性增加，构建高效推理模型变得至关重要。

Details

Method: 用Mamba层替换Transformer中的大部分自注意力层，实现恒定计算和内存需求；提出MiniPuzzle压缩技术；引入FP8训练方法。 Result: Nemotron-H模型在保持或优于同类Transformer模型准确性的同时，推理速度提升至3倍；Nemotron-H-47B-Base比56B模型快20%。 Conclusion: Nemotron-H模型在高效推理和准确性方面表现优异，并支持多种平台发布。 Abstract: As inference-time scaling becomes critical for enhanced reasoning capabilities, it is increasingly becoming important to build models that are efficient to infer. We introduce Nemotron-H, a family of 8B and 56B/47B hybrid Mamba-Transformer models designed to reduce inference cost for a given accuracy level. To achieve this goal, we replace the majority of self-attention layers in the common Transformer model architecture with Mamba layers that perform constant computation and require constant memory per generated token. We show that Nemotron-H models offer either better or on-par accuracy compared to other similarly-sized state-of-the-art open-sourced Transformer models (e.g., Qwen-2.5-7B/72B and Llama-3.1-8B/70B), while being up to 3$\times$ faster at inference. To further increase inference speed and reduce the memory required at inference time, we created Nemotron-H-47B-Base from the 56B model using a new compression via pruning and distillation technique called MiniPuzzle. Nemotron-H-47B-Base achieves similar accuracy to the 56B model, but is 20% faster to infer. In addition, we introduce an FP8-based training recipe and show that it can achieve on par results with BF16-based training. This recipe is used to train the 56B model. All Nemotron-H models will be released, with support in Hugging Face, NeMo, and Megatron-LM.

HumanDreamer-X: Photorealistic Single-image Human Avatars Reconstruction via Gaussian Restoration

Boyuan Wang,Runqi Ouyang,Xiaofeng Wang,Zheng Zhu,Guosheng Zhao,Chaojun Ni,Guan Huang,Lihong Liu,Xingang Wang

Task: 提出一种名为HumanDreamer-X的新框架，将多视角人体生成与重建整合为统一流程，以提升3D模型的几何一致性和视觉保真度。

Motivation: 解决现有方法在单张人体图像生成多视角时存在的几何不一致问题，如肢体断裂或模糊。

Details

Method: 结合3D高斯泼溅作为显式3D表示，并训练HumanFixer恢复3DGS渲染；提出注意力调制策略以增强多视角生成中的几何细节一致性。 Result: 实验显示，生成和重建的PSNR分别提升16.45%和12.65%，最高PSNR达25.62 dB，且在真实数据和多种重建模型中表现良好。 Conclusion: HumanDreamer-X显著提升了单张图像人体重建的质量和一致性，具有广泛适用性。 Abstract: Single-image human reconstruction is vital for digital human modeling applications but remains an extremely challenging task. Current approaches rely on generative models to synthesize multi-view images for subsequent 3D reconstruction and animation. However, directly generating multiple views from a single human image suffers from geometric inconsistencies, resulting in issues like fragmented or blurred limbs in the reconstructed models. To tackle these limitations, we introduce \textbf{HumanDreamer-X}, a novel framework that integrates multi-view human generation and reconstruction into a unified pipeline, which significantly enhances the geometric consistency and visual fidelity of the reconstructed 3D models. In this framework, 3D Gaussian Splatting serves as an explicit 3D representation to provide initial geometry and appearance priority. Building upon this foundation, \textbf{HumanFixer} is trained to restore 3DGS renderings, which guarantee photorealistic results. Furthermore, we delve into the inherent challenges associated with attention mechanisms in multi-view human generation, and propose an attention modulation strategy that effectively enhances geometric details identity consistency across multi-view. Experimental results demonstrate that our approach markedly improves generation and reconstruction PSNR quality metrics by 16.45% and 12.65%, respectively, achieving a PSNR of up to 25.62 dB, while also showing generalization capabilities on in-the-wild data and applicability to various human reconstruction backbone models.

Bonsai: Interpretable Tree-Adaptive Grounded Reasoning

Kate Sanders,Benjamin Van Durme

Task: 开发一种通用的协作代理系统，能够适应新领域并透明地处理不确定性。

Motivation: 黑盒模型虽然数据处理能力强，但缺乏透明性、领域适应性和不确定性处理能力，无法满足可靠AI系统的需求。

Details

Method: 提出Bonsai系统，通过检索相关证据并计算子主张的似然性，生成可适应的推理树。 Result: Bonsai在多种领域（如文本、图像、视频、音频和数据库）中表现可靠，性能与领域特定的黑盒方法相当，同时生成可解释、有依据且不确定性感知的推理痕迹。 Conclusion: Bonsai是一种可调、可解释且适应性强的推理系统，适用于需要透明性和不确定性处理的协作代理任务。 Abstract: To develop general-purpose collaborative agents, humans need reliable AI systems that can (1) adapt to new domains and (2) transparently reason with uncertainty to allow for verification and correction. Black-box models demonstrate powerful data processing abilities but do not satisfy these criteria due to their opaqueness, domain specificity, and lack of uncertainty awareness. We introduce Bonsai, a compositional and probabilistic reasoning system that generates adaptable inference trees by retrieving relevant grounding evidence and using it to compute likelihoods of sub-claims derived from broader natural language inferences. Bonsai's reasoning power is tunable at test-time via evidence scaling and it demonstrates reliable handling of varied domains including transcripts, photographs, videos, audio, and databases. Question-answering and human alignment experiments demonstrate that Bonsai matches the performance of domain-specific black-box methods while generating interpretable, grounded, and uncertainty-aware reasoning traces.

PF3Det: A Prompted Foundation Feature Assisted Visual LiDAR 3D Detector

Kaidong Li,Tianxiao Zhang,Kuan-Chuan Peng,Guanghui Wang

Task: 提出一种名为Prompted Foundational 3D Detector (PF3Det)的方法，用于结合LiDAR点云和相机图像进行3D目标检测。

Motivation: 多模态方法结合LiDAR和相机数据可以提高检测的鲁棒性，但高效融合这两种模态仍存在挑战，且高质量标注数据稀缺且昂贵。

Details

Method: 利用基础模型的大规模预训练能力，结合提示工程技术，提出PF3Det方法，通过集成基础模型编码器和软提示来增强LiDAR与相机特征的融合。 Result: 在nuScenes数据集上，PF3Det在有限训练数据下取得了最先进的结果，NDS提高了1.19%，mAP提高了2.42%。 Conclusion: PF3Det在3D目标检测中表现出高效性，尤其在数据有限的情况下具有显著优势。 Abstract: 3D object detection is crucial for autonomous driving, leveraging both LiDAR point clouds for precise depth information and camera images for rich semantic information. Therefore, the multi-modal methods that combine both modalities offer more robust detection results. However, efficiently fusing LiDAR points and images remains challenging due to the domain gaps. In addition, the performance of many models is limited by the amount of high quality labeled data, which is expensive to create. The recent advances in foundation models, which use large-scale pre-training on different modalities, enable better multi-modal fusion. Combining the prompt engineering techniques for efficient training, we propose the Prompted Foundational 3D Detector (PF3Det), which integrates foundation model encoders and soft prompts to enhance LiDAR-camera feature fusion. PF3Det achieves the state-of-the-art results under limited training data, improving NDS by 1.19% and mAP by 2.42% on the nuScenes dataset, demonstrating its efficiency in 3D detection.

Mapping Technological Futures: Anticipatory Discourse Through Text Mining

Maciej Skorski,Alina Landowska,Krzysztof Rajda

Task: 通过分析社交媒体上关键意见领袖（KOLs）的帖子，研究新兴技术（如人工智能）的预期话语。

Motivation: 新兴技术的波动性和不可预测性带来了显著的不确定性，社交媒体上的讨论对此有重要影响。

Details

Method: 使用BERTopic建模、情感、情绪和态度分析等文本挖掘技术，分析1.5百万条来自400位KOLs的帖子。 Result: 识别出100个反映技术驱动未来的主题，KOLs在塑造乐观的未来愿景（如AI和物联网）和影响当前社会及地缘政治辩论中发挥双重作用。 Conclusion: KOLs在引导公众关注新兴技术方面扮演关键角色，尤其在不确定性加剧时期，他们的叙事连接了想象的未来和当前现实。 Abstract: The volatility and unpredictability of emerging technologies, such as artificial intelligence (AI), generate significant uncertainty, which is widely discussed on social media. This study examines anticipatory discourse surrounding technological futures by analysing 1.5 million posts from 400 key opinion leaders (KOLs) published on the X platform (from 2021 to 2023). Using advanced text mining techniques, including BERTopic modelling, sentiment, emotion, and attitude analyses, the research identifies 100 distinct topics reflecting anticipated tech-driven futures. Our findings emphasize the dual role of KOLs in framing \textit{present futures} -- optimistic visions of transformative technologies like AI and IoT -- and influencing \textit{future presents}, where these projections shape contemporary societal and geopolitical debates. Positive emotions such as Hope dominate, outweighing Anxiety, particularly in topics like ``Machine Learning, Data Science, and Deep Learning,'' while discussions around ``Climate Change'' and ``War, Ukraine, and Trump People'' elicit \textit{Anxiety}. By framing technologies as solutions to societal challenges, KOLs act as mediators of societal narratives, bridging imagined futures and current realities. These insights underscore their pivotal role in directing public attention with emerging technologies during periods of heightened uncertainty, advancing our understanding of anticipatory discourse in technology-mediated contexts.

AutoSSVH: Exploring Automated Frame Sampling for Efficient Self-Supervised Video Hashing

Niu Lian,Jun Li,Jinpeng Wang,Ruisheng Luo,Yaowei Wang,Shu-Tao Xia,Bin Chen

Task: 提出一种名为AutoSSVH的自监督视频哈希框架，通过对抗性帧采样和哈希对比学习优化视频哈希编码。

Motivation: 现有方法依赖随机帧采样且平等对待所有帧，忽略了帧间信息密度和重建难度，导致哈希编码效果不佳。

Details

Method: 采用对抗性帧采样策略自动选择信息丰富的挑战性帧，并结合哈希组件投票策略和点对集（P2Set）哈希对比目标。 Result: 实验表明，AutoSSVH在检索效果和效率上优于现有方法。 Conclusion: AutoSSVH通过改进帧采样和对比学习，显著提升了视频哈希编码的性能。 Abstract: Self-Supervised Video Hashing (SSVH) compresses videos into hash codes for efficient indexing and retrieval using unlabeled training videos. Existing approaches rely on random frame sampling to learn video features and treat all frames equally. This results in suboptimal hash codes, as it ignores frame-specific information density and reconstruction difficulty. To address this limitation, we propose a new framework, termed AutoSSVH, that employs adversarial frame sampling with hash-based contrastive learning. Our adversarial sampling strategy automatically identifies and selects challenging frames with richer information for reconstruction, enhancing encoding capability. Additionally, we introduce a hash component voting strategy and a point-to-set (P2Set) hash-based contrastive objective, which help capture complex inter-video semantic relationships in the Hamming space and improve the discriminability of learned hash codes. Extensive experiments demonstrate that AutoSSVH achieves superior retrieval efficacy and efficiency compared to state-of-the-art approaches. Code is available at https://github.com/EliSpectre/CVPR25-AutoSSVH.

Robustly identifying concepts introduced during chat fine-tuning using crosscoders

Julian Minder,Clement Dumas,Caden Juang,Bilal Chugtai,Neel Nanda

Task: 研究微调如何改变模型的表示和内部算法，并提出改进方法以更准确地识别微调引入的概念。

Motivation: 微调过程中引入的行为需要解释，而模型差异分析（model diffing）是一个有前景的工具，但现有方法（如Crosscoders）存在误判概念来源的问题。

Details

Method: 提出Latent Scaling方法，改进Crosscoders的L1训练损失问题，并采用BatchTopK损失训练Crosscoders。 Result: 实验表明，改进后的方法能更准确地识别微调特有的概念，如“虚假信息”和“个人问题”等。 Conclusion: 本研究改进了基于Crosscoders的模型差异分析方法，并展示了其在理解语言模型行为变化中的实用性。 Abstract: Model diffing is the study of how fine-tuning changes a model's representations and internal algorithms. Many behaviours of interest are introduced during fine-tuning, and model diffing offers a promising lens to interpret such behaviors. Crosscoders are a recent model diffing method that learns a shared dictionary of interpretable concepts represented as latent directions in both the base and fine-tuned models, allowing us to track how concepts shift or emerge during fine-tuning. Notably, prior work has observed concepts with no direction in the base model, and it was hypothesized that these model-specific latents were concepts introduced during fine-tuning. However, we identify two issues which stem from the crosscoders L1 training loss that can misattribute concepts as unique to the fine-tuned model, when they really exist in both models. We develop Latent Scaling to flag these issues by more accurately measuring each latent's presence across models. In experiments comparing Gemma 2 2B base and chat models, we observe that the standard crosscoder suffers heavily from these issues. Building on these insights, we train a crosscoder with BatchTopK loss and show that it substantially mitigates these issues, finding more genuinely chat-specific and highly interpretable concepts. We recommend practitioners adopt similar techniques. Using the BatchTopK crosscoder, we successfully identify a set of genuinely chat-specific latents that are both interpretable and causally effective, representing concepts such as $\textit{false information}$ and $\textit{personal question}$, along with multiple refusal-related latents that show nuanced preferences for different refusal triggers. Overall, our work advances best practices for the crosscoder-based methodology for model diffing and demonstrates that it can provide concrete insights into how chat tuning modifies language model behavior.

Robust Human Registration with Body Part Segmentation on Noisy Point Clouds

Kai Lascheit,Daniel Barath,Marc Pollefeys,Leonidas Guibas,Francis Engelmann

Task: 提出一种结合身体部位分割的混合方法，用于将人体网格注册到3D点云中。

Motivation: 由于真实世界数据中的噪声和背景干扰，现有方法在人体网格注册时往往结果不精确。

Details

Method: 首先为点云中的点分配身体部位标签，然后通过两步SMPL-X拟合（初始姿态和方向估计，全局点云对齐优化）完成注册。 Result: 在InterCap、EgoBody和BEHAVE数据集上的评估表明，该方法在姿态估计和分割精度上显著优于现有方法。 Conclusion: 结合身体部位分割的混合方法能够有效提升人体网格注册的精度，并进一步优化分割结果。 Abstract: Registering human meshes to 3D point clouds is essential for applications such as augmented reality and human-robot interaction but often yields imprecise results due to noise and background clutter in real-world data. We introduce a hybrid approach that incorporates body-part segmentation into the mesh fitting process, enhancing both human pose estimation and segmentation accuracy. Our method first assigns body part labels to individual points, which then guide a two-step SMPL-X fitting: initial pose and orientation estimation using body part centroids, followed by global refinement of the point cloud alignment. Additionally, we demonstrate that the fitted human mesh can refine body part labels, leading to improved segmentation. Evaluations on the cluttered and noisy real-world datasets InterCap, EgoBody, and BEHAVE show that our approach significantly outperforms prior methods in both pose estimation and segmentation accuracy. Code and results are available on our project website: https://segfit.github.io

QID: Efficient Query-Informed ViTs in Data-Scarce Regimes for OCR-free Visual Document Understanding

Binh M. Le,Shaoyuan Xu,Jinmiao Fu,Zhishen Huang,Moyan Li,Yanhui Guo,Hongdong Li,Sameera Ramasinghe,Bryan Wang

Task: 提出一种名为QID的新方法，用于优化视觉语言模型在视觉文档理解任务中对查询特定区域的识别能力。

Motivation: 现有方法在数据稀缺情况下难以适应新数据集，且直接修改网络架构的方法效果有限。

Details

Method: 引入双模块框架：查询感知模块生成查询向量指导模型关注点，查询无关模块捕捉令牌间位置关系。 Result: 在多个数据集上实验表明，QID显著提升了性能，尤其在数据稀缺环境下处理文本丰富文档时。 Conclusion: QID是一种高效且无需修改架构的方法，显著提升了视觉文档理解的性能。 Abstract: In Visual Document Understanding (VDU) tasks, fine-tuning a pre-trained Vision-Language Model (VLM) with new datasets often falls short in optimizing the vision encoder to identify query-specific regions in text-rich document images. Existing methods that directly inject queries into model layers by modifying the network architecture often struggle to adapt to new datasets with limited annotations. To address this, we introduce QID, a novel, streamlined, architecture-preserving approach that integrates query embeddings into the vision encoder, leading to notable performance gains, particularly in data-scarce fine-tuning scenarios. Specifically, our approach introduces a dual-module framework: a query-aware module that generates a unique query vector to precisely guide the model's focus, as well as a query-agnostic module that captures the positional relationships among tokens, ensuring robust spatial understanding. Notably, both modules operate independently of the vision attention blocks, facilitating targeted learning of query embeddings and enhancing visual semantic identification. Experiments with OCR-free VLMs across multiple datasets demonstrate significant performance improvements using our method, especially in handling text-rich documents in data-scarce environments.

Multimodal Diffusion Bridge with Attention-Based SAR Fusion for Satellite Image Cloud Removal

Yuyang Hu,Suhas Lohit,Ulugbek S. Kamilov,Tim K. Marks

Task: 提出了一种基于扩散桥的云去除方法（DB-CR），用于光学卫星图像与合成孔径雷达（SAR）图像的多模态融合。

Motivation: 现有扩散模型从纯高斯噪声开始采样，导致采样轨迹复杂且性能不佳，同时现有方法在多模态数据融合上表现不足。

Details

Method: 提出扩散桥方法直接连接云图和去云图分布，并设计了一种双分支主干网络和跨模态融合块，有效提取和融合SAR与光学图像特征。 Result: 在SEN12MS-CR数据集上实现了最先进的云去除效果，同时计算效率高。 Conclusion: DB-CR通过扩散桥和多模态架构设计，显著提升了云去除的质量和效率。 Abstract: Deep learning has achieved some success in addressing the challenge of cloud removal in optical satellite images, by fusing with synthetic aperture radar (SAR) images. Recently, diffusion models have emerged as powerful tools for cloud removal, delivering higher-quality estimation by sampling from cloud-free distributions, compared to earlier methods. However, diffusion models initiate sampling from pure Gaussian noise, which complicates the sampling trajectory and results in suboptimal performance. Also, current methods fall short in effectively fusing SAR and optical data. To address these limitations, we propose Diffusion Bridges for Cloud Removal, DB-CR, which directly bridges between the cloudy and cloud-free image distributions. In addition, we propose a novel multimodal diffusion bridge architecture with a two-branch backbone for multimodal image restoration, incorporating an efficient backbone and dedicated cross-modality fusion blocks to effectively extract and fuse features from synthetic aperture radar (SAR) and optical images. By formulating cloud removal as a diffusion-bridge problem and leveraging this tailored architecture, DB-CR achieves high-fidelity results while being computationally efficient. We evaluated DB-CR on the SEN12MS-CR cloud-removal dataset, demonstrating that it achieves state-of-the-art results.

Language Models Guidance with Multi-Aspect-Cueing: A Case Study for Competitor Analysis

Amir Hadifar,Christopher Ochs,Arjan Van Ewijk

Task: 通过将商业知识融入大型语言模型（LLMs）来提升其在竞争市场分析中的表现。

Motivation: 现代商业中竞争对手分析对战略规划至关重要，但现有LLMs在理解竞争市场时存在知识不足和局限性。

Details

Method: 通过定量和定性实验，将商业知识融入LLMs。 Result: 实验表明，融入商业知识能显著提升模型表现，增强竞争分析的效能。 Conclusion: 结合商业知识的LLMs能有效弥补其在竞争市场分析中的不足，提升分析效果。 Abstract: Competitor analysis is essential in modern business due to the influence of industry rivals on strategic planning. It involves assessing multiple aspects and balancing trade-offs to make informed decisions. Recent Large Language Models (LLMs) have demonstrated impressive capabilities to reason about such trade-offs but grapple with inherent limitations such as a lack of knowledge about contemporary or future realities and an incomplete understanding of a market's competitive landscape. In this paper, we address this gap by incorporating business aspects into LLMs to enhance their understanding of a competitive market. Through quantitative and qualitative experiments, we illustrate how integrating such aspects consistently improves model performance, thereby enhancing analytical efficacy in competitor analysis.

Autonomous and Self-Adapting System for Synthetic Media Detection and Attribution

Aref Azizpour,Tai D. Nguyen,Matthew C. Stamm

Task: 提出一种自主自适应合成媒体识别系统，用于检测和归因合成图像，并自动识别新生成器。

Motivation: 生成AI的快速发展带来了高度逼真的合成图像，但也带来了虚假信息、欺诈等恶意应用的风险，现有静态识别系统难以应对新生成模型。

Details

Method: 采用开放集识别策略和可进化嵌入空间，结合无监督聚类方法聚合未知样本并持续优化决策边界。 Result: 实验表明，该方法显著优于现有方法，适应生成模型的快速演进。 Conclusion: 该方法为通用、适应性强的取证系统迈出了关键一步。 Abstract: Rapid advances in generative AI have enabled the creation of highly realistic synthetic images, which, while beneficial in many domains, also pose serious risks in terms of disinformation, fraud, and other malicious applications. Current synthetic image identification systems are typically static, relying on feature representations learned from known generators; as new generative models emerge, these systems suffer from severe performance degradation. In this paper, we introduce the concept of an autonomous self-adaptive synthetic media identification system -- one that not only detects synthetic images and attributes them to known sources but also autonomously identifies and incorporates novel generators without human intervention. Our approach leverages an open-set identification strategy with an evolvable embedding space that distinguishes between known and unknown sources. By employing an unsupervised clustering method to aggregate unknown samples into high-confidence clusters and continuously refining its decision boundaries, our system maintains robust detection and attribution performance even as the generative landscape evolves. Extensive experiments demonstrate that our method significantly outperforms existing approaches, marking a crucial step toward universal, adaptable forensic systems in the era of rapidly advancing generative models.

Ontologies in Design: How Imagining a Tree Reveals Possibilites and Assumptions in Large Language Models

Nava Haghighi,Sunny Yu,James Landay,Daniela Rosner

Task: 探讨生成式AI系统中本体论（ontologies）的重要性及其在设计中的应用。

Motivation: 现有研究多关注价值观和伦理学（如偏见），而忽略了本体论（即我们允许自己思考或谈论的内容）在分析这些系统时的关键作用。

Details

Method: 提出四种本体论设计导向（多元性、基础性、活力性和实施性），并通过两个本体论分析案例（LLM聊天机器人的响应分析和LLM代理模拟的架构分析）验证其应用。 Result: 展示了这些本体论导向在LLM开发全流程中的潜力，并揭示了其带来的新可能性。 Conclusion: 强调了本体论在社会技术系统设计与开发中的重要性，同时指出了其应用的机会和限制。 Abstract: Amid the recent uptake of Generative AI, sociotechnical scholars and critics have traced a multitude of resulting harms, with analyses largely focused on values and axiology (e.g., bias). While value-based analyses are crucial, we argue that ontologies -- concerning what we allow ourselves to think or talk about -- is a vital but under-recognized dimension in analyzing these systems. Proposing a need for a practice-based engagement with ontologies, we offer four orientations for considering ontologies in design: pluralism, groundedness, liveliness, and enactment. We share examples of potentialities that are opened up through these orientations across the entire LLM development pipeline by conducting two ontological analyses: examining the responses of four LLM-based chatbots in a prompting exercise, and analyzing the architecture of an LLM-based agent simulation. We conclude by sharing opportunities and limitations of working with ontologies in the design and development of sociotechnical systems.

VISTA-OCR: Towards generative and interactive end to end OCR models

Laziz Hamdi,Amine Tamasna,Pascal Boisson,Thierry Paquet

Task: 提出一种轻量级架构VISTA-OCR，统一文本检测与识别任务。

Motivation: 传统方法需要分离的模块处理文本检测与识别，计算成本高，而VISTA-OCR通过统一生成模型解决这一问题。

Details

Method: 基于编码器-解码器架构，使用Transformer解码器生成文本转录和空间坐标，并通过多任务学习与提示控制任务增强模型能力。 Result: VISTA-OCR在多个数据集上表现优于现有专用模型，且参数更少（150M），适用于交互式OCR任务。 Conclusion: VISTA-OCR为轻量级、多功能OCR系统提供了高效解决方案，具有广泛的应用潜力。 Abstract: We introduce \textbf{VISTA-OCR} (Vision and Spatially-aware Text Analysis OCR), a lightweight architecture that unifies text detection and recognition within a single generative model. Unlike conventional methods that require separate branches with dedicated parameters for text recognition and detection, our approach leverages a Transformer decoder to sequentially generate text transcriptions and their spatial coordinates in a unified branch. Built on an encoder-decoder architecture, VISTA-OCR is progressively trained, starting with the visual feature extraction phase, followed by multitask learning with multimodal token generation. To address the increasing demand for versatile OCR systems capable of advanced tasks, such as content-based text localization \ref{content_based_localization}, we introduce new prompt-controllable OCR tasks during pre-training.To enhance the model's capabilities, we built a new dataset composed of real-world examples enriched with bounding box annotations and synthetic samples. Although recent Vision Large Language Models (VLLMs) can efficiently perform these tasks, their high computational cost remains a barrier for practical deployment. In contrast, our VISTA$_{\text{omni}}$ variant processes both handwritten and printed documents with only 150M parameters, interactively, by prompting. Extensive experiments on multiple datasets demonstrate that VISTA-OCR achieves better performance compared to state-of-the-art specialized models on standard OCR tasks while showing strong potential for more sophisticated OCR applications, addressing the growing need for interactive OCR systems. All code and annotations for VISTA-OCR will be made publicly available upon acceptance.

LLM Library Learning Fails: A LEGO-Prover Case Study

Ian Berlot-Attwell,Frank Rudzicz,Xujie Si

Task: 分析LEGO-Prover系统在数学推理中学习可重用引理的有效性。

Motivation: 探讨LLM库学习（如创建、存储和检索可重用函数或引理）是否真正提升任务性能和计算效率。

Details

Method: 深入研究LEGO-Prover系统，评估其学习引理的重用情况及其对任务准确性和计算成本的影响。 Result: 未发现学习引理被直接或软重用，且LEGO-Prover在计算成本调整后未优于简单提示模型。 Conclusion: 需重新评估LLM库学习技术的有效性，并制定更强评估标准，包括行为分析和计算预算对等。 Abstract: Recent advancements in the coding, reasoning, and tool-using abilities of LLMs have spurred interest in library learning (i.e., online learning through the creation, storage, and retrieval of reusable and composable functions, knowledge, checklists, or lemmas). Such systems often promise improved task performance through the automatic creation of broadly applicable tools, as well as superior computational performance through the caching of reasoning (i.e., the storage of generated tools). However, we find strong reason to be skeptical. We perform a deep dive into one such system, LEGO-Prover, which purports to learn reusable lemmas for mathematical reasoning. We find no evidence of the direct reuse of learned lemmas, and find evidence against the soft reuse of learned lemmas (i.e., reuse by modifying relevant examples). Crucially, we find that LEGO-Prover does not in fact improve over the simple baseline of prompting the model - the improvements in task accuracy vanish once computational cost is accounted for. Our findings suggest that serious misconceptions exist as to the effectiveness of these techniques, that a serious re-examination of the state of LLM-based library learning is required, and that we require much stronger standards for evaluation including behavioural analysis and ensuring that an equal computational budget is used for baselines.

Quantifying the uncertainty of model-based synthetic image quality metrics

Ciaran Bench,Spencer A. Thomas

Task: 评估合成生成图像质量的信任度。

Motivation: 现有评估方法（如FID）依赖预训练辅助模型的特征嵌入，其有效性直接影响评估结果的可靠性，尤其在医学影像等领域。

Details

Method: 提出一种基于不确定性量化（UQ）的启发式方法，使用蒙特卡洛dropout对特征嵌入模型（卷积自编码器）建模，计算FAED值的分布。 Result: 嵌入的预测方差和FAED值的标准差与输入数据是否超出模型训练分布相关，验证了该方法评估FAED信任度的能力。 Conclusion: 通过不确定性量化可有效评估特征嵌入模型的可靠性，提升合成图像质量评估的信任度。 Abstract: The quality of synthetically generated images (e.g. those produced by diffusion models) are often evaluated using information about image contents encoded by pretrained auxiliary models. For example, the Fr\'{e}chet Inception Distance (FID) uses embeddings from an InceptionV3 model pretrained to classify ImageNet. The effectiveness of this feature embedding model has considerable impact on the trustworthiness of the calculated metric (affecting its suitability in several domains, including medical imaging). Here, uncertainty quantification (UQ) is used to provide a heuristic measure of the trustworthiness of the feature embedding model and an FID-like metric called the Fr\'{e}chet Autoencoder Distance (FAED). We apply Monte Carlo dropout to a feature embedding model (convolutional autoencoder) to model the uncertainty in its embeddings. The distribution of embeddings for each input are then used to compute a distribution of FAED values. We express uncertainty as the predictive variance of the embeddings as well as the standard deviation of the computed FAED values. We find that their magnitude correlates with the extent to which the inputs are out-of-distribution to the model's training data, providing some validation of its ability to assess the trustworthiness of the FAED.

LightPROF: A Lightweight Reasoning Framework for Large Language Model on Knowledge Graph

Tu Ao,Yanhua Yu,Yuling Wang,Yang Deng,Zirui Guo,Liang Pang,Pinghui Wang,Tat-Seng Chua,Xiao Zhang,Zhen Cai

Task: 提出一种轻量高效的提示学习推理框架（LightPROF），用于知识图谱问答（KGQA），以充分利用大型语言模型（LLMs）的潜力。

Motivation: 现有基于知识图谱（KG）的LLM推理方法仅以文本形式注入KG知识，忽略了其结构信息，且依赖闭源或大参数开源模型，资源消耗高。

Details

Method: LightPROF采用“检索-嵌入-推理”流程，通过检索模块从KG中获取推理图，利用基于Transformer的知识适配器提取并整合KG的事实和结构信息，映射到LLM的标记嵌入空间，生成LLM友好的提示。 Result: 在两个公开的KGQA基准测试中，LightPROF在小规模LLMs上表现出色，且在输入标记数和推理时间上有显著优势。 Conclusion: LightPROF是一种高效且轻量的框架，能够充分利用LLMs的潜力，适用于复杂的推理任务。 Abstract: Large Language Models (LLMs) have impressive capabilities in text understanding and zero-shot reasoning. However, delays in knowledge updates may cause them to reason incorrectly or produce harmful results. Knowledge Graphs (KGs) provide rich and reliable contextual information for the reasoning process of LLMs by structurally organizing and connecting a wide range of entities and relations. Existing KG-based LLM reasoning methods only inject KGs' knowledge into prompts in a textual form, ignoring its structural information. Moreover, they mostly rely on close-source models or open-source models with large parameters, which poses challenges to high resource consumption. To address this, we propose a novel Lightweight and efficient Prompt learning-ReasOning Framework for KGQA (LightPROF), which leverages the full potential of LLMs to tackle complex reasoning tasks in a parameter-efficient manner. Specifically, LightPROF follows a "Retrieve-Embed-Reason process", first accurately, and stably retrieving the corresponding reasoning graph from the KG through retrieval module. Next, through a Transformer-based Knowledge Adapter, it finely extracts and integrates factual and structural information from the KG, then maps this information to the LLM's token embedding space, creating an LLM-friendly prompt to be used by the LLM for the final reasoning. Additionally, LightPROF only requires training Knowledge Adapter and can be compatible with any open-source LLM. Extensive experiments on two public KGQA benchmarks demonstrate that LightPROF achieves superior performance with small-scale LLMs. Furthermore, LightPROF shows significant advantages in terms of input token count and reasoning time.

An Algebraic Geometry Approach to Viewing Graph Solvability

Federica Arrigoni,Kathlén Kohn,Andrea Fusiello,Tomas Pajdla

Task: 研究基于代数几何的视角图可解性分析框架。

Motivation: 视角图在运动恢复结构中具有重要意义，但其可解性条件尚不明确。

Details

Method: 提出一种基于代数几何的新框架，用于分析视角图的可解性问题。 Result: 证明了之前提出的一个猜想，并展示了该框架在运动恢复结构图中的潜力。 Conclusion: 代数几何框架为视角图可解性问题提供了新的理解和解决方案。 Abstract: The concept of viewing graph solvability has gained significant interest in the context of structure-from-motion. A viewing graph is a mathematical structure where nodes are associated to cameras and edges represent the epipolar geometry connecting overlapping views. Solvability studies under which conditions the cameras are uniquely determined by the graph. In this paper we propose a novel framework for analyzing solvability problems based on Algebraic Geometry, demonstrating its potential in understanding structure-from-motion graphs and proving a conjecture that was previously proposed.

DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments

Yuxiang Zheng,Dayuan Fu,Xiangkun Hu,Xiaojie Cai,Lyumanshan Ye,Pengrui Lu,Pengfei Liu

Task: 提出DeepResearcher框架，通过端到端强化学习训练LLM驱动的深度研究代理，以应对真实网络环境中的复杂交互。

Motivation: 现有方法（如基于提示工程或RAG的强化学习）在真实网络环境中表现脆弱或无法捕捉复杂交互，需要更鲁棒的解决方案。

Details

Method: 采用多智能体架构，通过强化学习在真实网络环境中训练浏览代理，提取网页信息并克服技术挑战。 Result: 在开放领域研究任务中，DeepResearcher比基线方法提升28.9分（提示工程）和7.2分（RAG强化学习），并展现出认知行为（如计划制定、多源验证等）。 Conclusion: 端到端强化学习在真实网络环境中的训练是开发鲁棒研究能力的关键，DeepResearcher框架为此提供了有效解决方案。 Abstract: Large Language Models (LLMs) equipped with web search capabilities have demonstrated impressive potential for deep research tasks. However, current approaches predominantly rely on either manually engineered prompts (prompt engineering-based) with brittle performance or reinforcement learning within controlled Retrieval-Augmented Generation (RAG) environments (RAG-based) that fail to capture the complexities of real-world interaction. In this paper, we introduce DeepResearcher, the first comprehensive framework for end-to-end training of LLM-based deep research agents through scaling reinforcement learning (RL) in real-world environments with authentic web search interactions. Unlike RAG-based approaches that assume all necessary information exists within a fixed corpus, our method trains agents to navigate the noisy, unstructured, and dynamic nature of the open web. We implement a specialized multi-agent architecture where browsing agents extract relevant information from various webpage structures and overcoming significant technical challenges. Extensive experiments on open-domain research tasks demonstrate that DeepResearcher achieves substantial improvements of up to 28.9 points over prompt engineering-based baselines and up to 7.2 points over RAG-based RL agents. Our qualitative analysis reveals emergent cognitive behaviors from end-to-end RL training, including the ability to formulate plans, cross-validate information from multiple sources, engage in self-reflection to redirect research, and maintain honesty when unable to find definitive answers. Our results highlight that end-to-end training in real-world web environments is not merely an implementation detail but a fundamental requirement for developing robust research capabilities aligned with real-world applications. We release DeepResearcher at https://github.com/GAIR-NLP/DeepResearcher.

Shape My Moves: Text-Driven Shape-Aware Synthesis of Human Motions

Ting-Hsuan Liao,Yi Zhou,Yu Shen,Chun-Hao Paul Huang,Saayan Mitra,Jia-Bin Huang,Uttaran Bhattacharya

Task: 探索身体形状如何影响人体运动合成，并生成与身体形状相关的人体运动。

Motivation: 现有文本到运动生成方法常忽视身体形状的影响，导致运动动态与身体形状的自然关联被扭曲。

Details

Method: 使用基于有限标量量化的变分自编码器（FSQ-VAE）量化运动为离散标记，并结合连续身体形状信息反量化回连续运动；利用预训练语言模型预测形状参数和运动标记。 Result: 通过定量、定性评估及感知研究，验证了方法在生成形状感知运动方面的有效性。 Conclusion: 该方法成功填补了现有方法在身体形状与运动动态关联上的空白，生成更自然的形状感知运动。 Abstract: We explore how body shapes influence human motion synthesis, an aspect often overlooked in existing text-to-motion generation methods due to the ease of learning a homogenized, canonical body shape. However, this homogenization can distort the natural correlations between different body shapes and their motion dynamics. Our method addresses this gap by generating body-shape-aware human motions from natural language prompts. We utilize a finite scalar quantization-based variational autoencoder (FSQ-VAE) to quantize motion into discrete tokens and then leverage continuous body shape information to de-quantize these tokens back into continuous, detailed motion. Additionally, we harness the capabilities of a pretrained language model to predict both continuous shape parameters and motion tokens, facilitating the synthesis of text-aligned motions and decoding them into shape-aware motions. We evaluate our method quantitatively and qualitatively, and also conduct a comprehensive perceptual study to demonstrate its efficacy in generating shape-aware motions.

Inherent and emergent liability issues in LLM-based agentic systems: a principal-agent perspective

Garry A. Gabison,R. Patrick Xian

Task: 分析基于大型语言模型（LLM）的代理系统在委托使用中可能产生的责任问题。

Motivation: 随着LLM代理系统的复杂性和能力不断提升，其代理行为和应用场景的扩展引发了对其有效治理、监控和控制协议的关注。

Details

Method: 从委托-代理关系的角度分析LLM代理系统及其扩展系统可能产生的责任问题，并补充现有关于人工代理的风险研究。 Result: 揭示了委托-代理关系中的重要方面及其在部署中的潜在后果，并提出了技术治理方法的发展方向。 Conclusion: 通过阐明LLM代理系统在AI责任方面的突出问题，旨在为系统设计、审计和监控方法提供透明度和问责制的改进方向。 Abstract: Agentic systems powered by large language models (LLMs) are becoming progressively more complex and capable. Their increasing agency and expanding deployment settings attract growing attention over effective governance policies, monitoring and control protocols. Based on emerging landscapes of the agentic market, we analyze the potential liability issues stemming from delegated use of LLM agents and their extended systems from a principal-agent perspective. Our analysis complements existing risk-based studies on artificial agency and covers the spectrum of important aspects of the principal-agent relationship and their potential consequences at deployment. Furthermore, we motivate method developments for technical governance along the directions of interpretability and behavior evaluations, reward and conflict management, and the mitigation of misalignment and misconduct through principled engineering of detection and fail-safe mechanisms. By illustrating the outstanding issues in AI liability for LLM-based agentic systems, we aim to inform the system design, auditing and monitoring approaches to enhancing transparency and accountability.

MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models

Wulin Xie,Yi-Fan Zhang,Chaoyou Fu,Yang Shi,Bingyan Nie,Hongkai Chen,Zhang Zhang,Liang Wang,Tieniu Tan

Task: 提出一个全面的评估框架，用于系统评估统一多模态大语言模型（U-MLLMs）。

Motivation: 现有MLLM基准在评估U-MLLMs时存在两大挑战：缺乏标准化基准和混合模态生成任务的缺失。

Details

Method: 设计了一个包含标准化传统任务评估、统一任务评估和全面模型基准测试的框架。 Result: 评估了12个领先的U-MLLMs，揭示了现有模型在混合模态任务上的性能差距。 Conclusion: 需要更强大的模型来有效处理混合模态任务，并提供了代码和评估数据供进一步研究。 Abstract: Existing MLLM benchmarks face significant challenges in evaluating Unified MLLMs (U-MLLMs) due to: 1) lack of standardized benchmarks for traditional tasks, leading to inconsistent comparisons; 2) absence of benchmarks for mixed-modality generation, which fails to assess multimodal reasoning capabilities. We present a comprehensive evaluation framework designed to systematically assess U-MLLMs. Our benchmark includes: Standardized Traditional Task Evaluation. We sample from 12 datasets, covering 10 tasks with 30 subtasks, ensuring consistent and fair comparisons across studies." 2. Unified Task Assessment. We introduce five novel tasks testing multimodal reasoning, including image editing, commonsense QA with image generation, and geometric reasoning. 3. Comprehensive Model Benchmarking. We evaluate 12 leading U-MLLMs, such as Janus-Pro, EMU3, VILA-U, and Gemini2-flash, alongside specialized understanding (e.g., Claude-3.5-Sonnet) and generation models (e.g., DALL-E-3). Our findings reveal substantial performance gaps in existing U-MLLMs, highlighting the need for more robust models capable of handling mixed-modality tasks effectively. The code and evaluation data can be found in https://mme-unify.github.io/.

RWKVTTS: Yet another TTS based on RWKV-7

Lin yueyu,Liu Xiao

Task: 介绍并评估一种基于RNN的新型TTS架构RWKV-7，旨在提升语音合成的效率和质量。

Motivation: 传统基于Transformer的TTS系统在计算效率和扩展性方面存在局限，RWKV-7通过RNN架构解决这些问题，同时保持高质量输出。

Details

Method: 提出RWKV-7，一种基于RNN的TTS架构，通过优化计算效率和扩展性，实现高性能语音合成。 Result: RWKV-7在合成速度、语音自然度和资源效率等关键指标上优于Transformer模型，并适应多样语言环境和低资源场景。 Conclusion: RWKV-7是一种高效且创新的TTS解决方案，有望推动语音合成技术的普及和应用。 Abstract: Human-AI interaction thrives on intuitive and efficient interfaces, among which voice stands out as a particularly natural and accessible modality. Recent advancements in transformer-based text-to-speech (TTS) systems, such as Fish-Speech, CosyVoice, and MegaTTS 3, have delivered remarkable improvements in quality and realism, driving a significant evolution in the TTS domain. In this paper, we introduce RWKV-7 \cite{peng2025rwkv}, a cutting-edge RNN-based architecture tailored for TTS applications. Unlike traditional transformer models, RWKV-7 leverages the strengths of recurrent neural networks to achieve greater computational efficiency and scalability, while maintaining high-quality output. Our comprehensive benchmarks demonstrate that RWKV-7 outperforms transformer-based models across multiple key metrics, including synthesis speed, naturalness of speech, and resource efficiency. Furthermore, we explore its adaptability to diverse linguistic contexts and low-resource environments, showcasing its potential to democratize TTS technology. These findings position RWKV-7 as a powerful and innovative alternative, paving the way for more accessible and versatile voice synthesis solutions in real-world applications.Our code and weights are https://github.com/yynil/RWKVTTS, https://huggingface.co/spaces/RWKV-Red-Team

Machine Learning Prediction of Cardiovascular Risk in Type 1 Diabetes Mellitus Using Radiomics Features from Multimodal Retinal Images

Ariadna Tohà-Dalmau,Josep Rosinés-Fonoll,Enrique Romero,Ferran Mazzanti,Ruben Martin-Pinardel,Sonia Marias-Perez,Carolina Bernal-Morales,Rafael Castro-Dominguez,Andrea Mendez,Emilio Ortega,Irene Vinagre,Marga Gimenez,Alfredo Vellido,Javier Zarranz-Ventura

Task: 开发一种机器学习算法，用于通过多模态视网膜图像评估1型糖尿病患者的低、中、高心血管风险。

Motivation: 利用视网膜图像的放射组学特征结合临床数据，提高心血管风险分级的准确性。

Details

Method: 从眼底照相、OCT和OCTA图像中提取放射组学特征，单独或结合临床数据训练机器学习模型。 Result: 仅使用放射组学特征的模型AUC为0.79（中风险）和0.73（高风险）；结合临床数据后AUC提升至0.99和0.95。 Conclusion: 多模态视网膜图像的放射组学特征可用于心血管风险分级，结合临床数据效果更佳。 Abstract: This study aimed to develop a machine learning (ML) algorithm capable of determining cardiovascular risk in multimodal retinal images from patients with type 1 diabetes mellitus, distinguishing between moderate, high, and very high-risk levels. Radiomic features were extracted from fundus retinography, optical coherence tomography (OCT), and OCT angiography (OCTA) images. ML models were trained using these features either individually or combined with clinical data. A dataset of 597 eyes (359 individuals) was analyzed, and models trained only with radiomic features achieved AUC values of (0.79 $\pm$ 0.03) for identifying moderate risk cases from high and very high-risk cases, and (0.73 $\pm$ 0.07) for distinguishing between high and very high-risk cases. The addition of clinical variables improved all AUC values, reaching (0.99 $\pm$ 0.01) for identifying moderate risk cases and (0.95 $\pm$ 0.02) for differentiating between high and very high-risk cases. For very high CV risk, radiomics combined with OCT+OCTA metrics and ocular data achieved an AUC of (0.89 $\pm$ 0.02) without systemic data input. These results demonstrate that radiomic features obtained from multimodal retinal images are useful for discriminating and classifying CV risk labels, highlighting the potential of this oculomics approach for CV risk assessment.

Optimal Embedding Guided Negative Sample Generation for Knowledge Graph Link Prediction

Makoto Takamoto,Daniel Oñoro-Rubio,Wiem Ben Rim,Takashi Maruyama,Bhushan Kotnis

Task: 研究知识图谱嵌入（KGE）模型中高质量负样本的生成条件及其对模型性能的影响。

Motivation: 现有方法难以识别高质量的负样本，而负样本质量对KGE模型的训练至关重要。

Details

Method: 提出EMU框架，通过生成满足理论条件的负样本，而非从训练数据中识别。 Result: 实验表明EMU显著提升了多种KGE模型和负采样方法的链接预测性能。 Conclusion: EMU通过生成高质量负样本，显著提升了KGE模型的性能，且易于与现有方法集成。 Abstract: Knowledge graph embedding (KGE) models encode the structural information of knowledge graphs to predicting new links. Effective training of these models requires distinguishing between positive and negative samples with high precision. Although prior research has shown that improving the quality of negative samples can significantly enhance model accuracy, identifying high-quality negative samples remains a challenging problem. This paper theoretically investigates the condition under which negative samples lead to optimal KG embedding and identifies a sufficient condition for an effective negative sample distribution. Based on this theoretical foundation, we propose \textbf{E}mbedding \textbf{MU}tation (\textsc{EMU}), a novel framework that \emph{generates} negative samples satisfying this condition, in contrast to conventional methods that focus on \emph{identifying} challenging negative samples within the training data. Importantly, the simplicity of \textsc{EMU} ensures seamless integration with existing KGE models and negative sampling methods. To evaluate its efficacy, we conducted comprehensive experiments across multiple datasets. The results consistently demonstrate significant improvements in link prediction performance across various KGE models and negative sampling methods. Notably, \textsc{EMU} enables performance improvements comparable to those achieved by models with embedding dimension five times larger. An implementation of the method and experiments are available at https://github.com/nec-research/EMU-KG.

Global Rice Multi-Class Segmentation Dataset (RiceSEG): A Comprehensive and Diverse High-Resolution RGB-Annotated Images for the Development and Benchmarking of Rice Segmentation Algorithms

Junchi Zhou,Haozhou Wang,Yoichiro Kato,Tejasri Nampally,P. Rajalakshmi,M. Balram,Keisuke Katsura,Hao Lu,Yue Mu,Wanneng Yang,Yangmingrui Gao,Feng Xiao,Hongtao Chen,Yuhao Chen,Wenjuan Li,Jingwen Wang,Fenghua Yu,Jian Zhou,Wensheng Wang,Xiaochun Hu,Yuanzhu Yang,Yanfeng Ding,Wei Guo,Shouyang Liu

Task: 开发一种基于计算机视觉的水稻表型分析技术，用于精准田间管理和加速育种。

Motivation: 由于水稻器官的精细结构和冠层内复杂的照明条件，区分图像组件是表征植物生长和发育的关键前提，但目前缺乏高质量的训练数据集。

Details

Method: 建立了首个全面的多类水稻语义分割数据集RiceSEG，包含来自五个主要水稻种植国家的近50,000张高分辨率图像，并从中选择了3,078张代表性样本进行六类标注。 Result: 现有的卷积神经网络和基于变换器的语义分割模型在背景和绿色植被分割上表现良好，但在生殖阶段（冠层结构复杂且涉及多类）存在困难。 Conclusion: RiceSEG数据集对于开发针对水稻和其他作物的专用分割模型具有重要意义。 Abstract: Developing computer vision-based rice phenotyping techniques is crucial for precision field management and accelerating breeding, thereby continuously advancing rice production. Among phenotyping tasks, distinguishing image components is a key prerequisite for characterizing plant growth and development at the organ scale, enabling deeper insights into eco-physiological processes. However, due to the fine structure of rice organs and complex illumination within the canopy, this task remains highly challenging, underscoring the need for a high-quality training dataset. Such datasets are scarce, both due to a lack of large, representative collections of rice field images and the time-intensive nature of annotation. To address this gap, we established the first comprehensive multi-class rice semantic segmentation dataset, RiceSEG. We gathered nearly 50,000 high-resolution, ground-based images from five major rice-growing countries (China, Japan, India, the Philippines, and Tanzania), encompassing over 6,000 genotypes across all growth stages. From these original images, 3,078 representative samples were selected and annotated with six classes (background, green vegetation, senescent vegetation, panicle, weeds, and duckweed) to form the RiceSEG dataset. Notably, the sub-dataset from China spans all major genotypes and rice-growing environments from the northeast to the south. Both state-of-the-art convolutional neural networks and transformer-based semantic segmentation models were used as baselines. While these models perform reasonably well in segmenting background and green vegetation, they face difficulties during the reproductive stage, when canopy structures are more complex and multiple classes are involved. These findings highlight the importance of our dataset for developing specialized segmentation models for rice and other crops.

Sustainable LLM Inference for Edge AI: Evaluating Quantized LLMs for Energy Efficiency, Output Accuracy, and Inference Latency

Erik Johannes Husom,Arda Goknil,Merve Astekin,Lwin Khin Shar,Andre Kåsen,Sagar Sen,Benedikt Andreas Mithassel,Ahmet Soylu

Task: 分析28种量化大型语言模型（LLMs）在边缘设备上的能效、推理性能和输出准确性。

Motivation: 边缘设备部署LLMs面临计算、内存、推理速度和能耗的限制，量化技术是解决这些问题的关键方法。

Details

Method: 使用Ollama库中的28种量化LLMs，在Raspberry Pi 4上评估不同量化级别和任务类型的能效、推理性能和准确性。 Result: 揭示了不同量化设置下能效、推理速度和准确性之间的权衡，并提出了优化资源受限环境下LLM部署的配置。 Conclusion: 通过结合硬件级能耗分析和LLM基准测试，为可持续AI提供了实用见解，填补了能源感知LLM部署研究的空白。 Abstract: Deploying Large Language Models (LLMs) on edge devices presents significant challenges due to computational constraints, memory limitations, inference speed, and energy consumption. Model quantization has emerged as a key technique to enable efficient LLM inference by reducing model size and computational overhead. In this study, we conduct a comprehensive analysis of 28 quantized LLMs from the Ollama library, which applies by default Post-Training Quantization (PTQ) and weight-only quantization techniques, deployed on an edge device (Raspberry Pi 4 with 4GB RAM). We evaluate energy efficiency, inference performance, and output accuracy across multiple quantization levels and task types. Models are benchmarked on five standardized datasets (CommonsenseQA, BIG-Bench Hard, TruthfulQA, GSM8K, and HumanEval), and we employ a high-resolution, hardware-based energy measurement tool to capture real-world power consumption. Our findings reveal the trade-offs between energy efficiency, inference speed, and accuracy in different quantization settings, highlighting configurations that optimize LLM deployment for resource-constrained environments. By integrating hardware-level energy profiling with LLM benchmarking, this study provides actionable insights for sustainable AI, bridging a critical gap in existing research on energy-aware LLM deployment.

Hummus: A Dataset of Humorous Multimodal Metaphor Use

Xiaoyu Tong,Zhi Zhang,Martha Lewis,Ekaterina Shutova

Task: 研究多模态隐喻的幽默能力，并开发一种新的标注方案用于图像-标题对中的幽默多模态隐喻使用。

Motivation: 多模态隐喻的幽默能力尚未得到充分关注，而隐喻是幽默的常见机制之一。

Details

Method: 基于幽默的不协调理论、概念隐喻理论和VU阿姆斯特丹隐喻语料库的标注方案，开发了一种新的标注方案，并创建了Hummus数据集，包含1k个图像-标题对的专家标注。 Result: 实验表明，当前的多模态大语言模型在处理幽默多模态隐喻时仍存在困难，尤其是在整合视觉和文本信息方面。 Conclusion: 研究揭示了多模态隐喻的幽默能力及其在模型处理中的挑战，并公开了数据集和代码。 Abstract: Metaphor and humor share a lot of common ground, and metaphor is one of the most common humorous mechanisms. This study focuses on the humorous capacity of multimodal metaphors, which has not received due attention in the community. We take inspiration from the Incongruity Theory of humor, the Conceptual Metaphor Theory, and the annotation scheme behind the VU Amsterdam Metaphor Corpus, and developed a novel annotation scheme for humorous multimodal metaphor use in image-caption pairs. We create the Hummus Dataset of Humorous Multimodal Metaphor Use, providing expert annotation on 1k image-caption pairs sampled from the New Yorker Caption Contest corpus. Using the dataset, we test state-of-the-art multimodal large language models (MLLMs) on their ability to detect and understand humorous multimodal metaphor use. Our experiments show that current MLLMs still struggle with processing humorous multimodal metaphors, particularly with regard to integrating visual and textual information. We release our dataset and code at github.com/xiaoyuisrain/humorous-multimodal-metaphor-use.

Do Larger Language Models Imply Better Reasoning? A Pretraining Scaling Law for Reasoning

Xinyi Wang,Shawn Tan,Mingyu Jin,William Yang Wang,Rameswar Panda,Yikang Shen

Task: 研究大型语言模型（LLMs）在扩展规模时对多跳推理能力的影响。

Motivation: 尽管LLMs在复杂推理任务中表现出色，但规模扩展对其推理能力的影响尚不明确，需要进一步研究。

Details

Method: 设计了一个合成多跳推理环境，模拟真实知识图谱的结构和分布，通过预训练语言模型并评估其补全缺失边的能力。 Result: 发现过参数化会因过度记忆而损害推理性能，并研究了影响U形损失曲线的因素，提出了基于知识图谱搜索熵的模型规模优化方法。 Conclusion: 揭示了规模扩展与推理能力的关系，为优化LLMs在推理任务中的性能提供了新思路。 Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks requiring complex reasoning. However, the effects of scaling on their reasoning abilities remain insufficiently understood. In this paper, we introduce a synthetic multihop reasoning environment designed to closely replicate the structure and distribution of real-world large-scale knowledge graphs. Our reasoning task involves completing missing edges in the graph, which requires advanced multi-hop reasoning and mimics real-world reasoning scenarios. To evaluate this, we pretrain language models (LMs) from scratch solely on triples from the incomplete graph and assess their ability to infer the missing edges. Interestingly, we observe that overparameterization can impair reasoning performance due to excessive memorization. We investigate different factors that affect this U-shaped loss curve, including graph structure, model size, and training steps. To predict the optimal model size for a specific knowledge graph, we find an empirical scaling that linearly maps the knowledge graph search entropy to the optimal model size. This work provides new insights into the relationship between scaling and reasoning in LLMs, shedding light on possible ways to optimize their performance for reasoning tasks.

Noise-Aware Generalization: Robustness to In-Domain Noise and Out-of-Domain Generalization

Siqi Wang,Aoming Liu,Bryan A. Plummer

Task: 研究在分布偏移和标签噪声同时存在的情况下模型的鲁棒性，即噪声感知泛化（NAG）。

Motivation: 现有的多源域泛化（DG）方法通常忽略标签噪声的影响，而学习噪声标签（LNL）方法的假设在NAG场景下不成立，因此需要新的解决方案。

Details

Method: 提出DL4ND方法，通过利用跨域信息改进噪声检测，区分标签噪声和域偏移。 Result: DL4ND在四个不同数据集上显著提升了性能。 Conclusion: DL4ND为解决噪声感知泛化问题提供了有前景的方向。 Abstract: Multi-source Domain Generalization (DG) aims to improve model robustness to new distributions. However, DG methods often overlook the effect of label noise, which can confuse a model during training, reducing performance. Limited prior work has analyzed DG method's noise-robustness, typically focused on an analysis of existing methods rather than new solutions. In this paper, we investigate this underexplored space, where models are evaluated under both distribution shifts and label noise, which we refer to as Noise-Aware Generalization (NAG). A natural solution to address label noise would be to combine a Learning with Noisy Labels (LNL) method with those from DG. Many LNL methods aim to detect distribution shifts in a class's samples, i.e., they assume that distribution shifts often correspond to label noise. However, in NAG distribution shifts can be due to label noise or domain shifts, breaking the assumptions used by LNL methods. A naive solution is to make a similar assumption made by many DG methods, where we presume to have domain labels during training, enabling us to isolate the two types of shifts. However, this ignores valuable cross-domain information. Specifically, our proposed DL4ND approach improves noise detection by taking advantage of the observation that noisy samples that may appear indistinguishable within a single domain often show greater variation when compared across domains. Experiments show that DL4ND significantly improves performance across four diverse datasets, offering a promising direction for tackling NAG.

GraphSeg: Segmented 3D Representations via Graph Edge Addition and Contraction

Haozhan Tang,Tianyi Zhang,Oliver Kroemer,Matthew Johnson-Roberson,Weiming Zhi

Task: 提出GraphSeg框架，用于从稀疏的2D图像生成一致的3D物体分割。

Motivation: 现有的大模型（如SAM）在2D图像分割中表现优异，但在3D物理世界中存在过度分割和视角间掩码不一致的问题。

Details

Method: 通过构建双对应图（基于2D像素相似性和推断的3D结构），将分割问题转化为边添加和图收缩问题，生成统一的物体级分割。 Result: GraphSeg在桌面场景中实现了最先进的性能，显著减少了所需图像数量并提高了准确性。 Conclusion: GraphSeg能够生成一致的3D物体分割，并提升下游机器人操作任务的性能。 Abstract: Robots operating in unstructured environments often require accurate and consistent object-level representations. This typically requires segmenting individual objects from the robot's surroundings. While recent large models such as Segment Anything (SAM) offer strong performance in 2D image segmentation. These advances do not translate directly to performance in the physical 3D world, where they often over-segment objects and fail to produce consistent mask correspondences across views. In this paper, we present GraphSeg, a framework for generating consistent 3D object segmentations from a sparse set of 2D images of the environment without any depth information. GraphSeg adds edges to graphs and constructs dual correspondence graphs: one from 2D pixel-level similarities and one from inferred 3D structure. We formulate segmentation as a problem of edge addition, then subsequent graph contraction, which merges multiple 2D masks into unified object-level segmentations. We can then leverage \emph{3D foundation models} to produce segmented 3D representations. GraphSeg achieves robust segmentation with significantly fewer images and greater accuracy than prior methods. We demonstrate state-of-the-art performance on tabletop scenes and show that GraphSeg enables improved performance on downstream robotic manipulation tasks. Code available at https://github.com/tomtang502/graphseg.git.

Comparative Analysis of Unsupervised and Supervised Autoencoders for Nuclei Classification in Clear Cell Renal Cell Carcinoma Images

Fatemeh Javadian,Zahra Aminparast,Johannes Stegmaier,Abin Jose

Task: 探索监督和无监督自编码器（AEs）在透明细胞肾细胞癌（ccRCC）图像中自动分类细胞核的应用。

Motivation: 传统的ccRCC诊断依赖病理学家的主观视觉评估，本研究旨在通过自动化方法提高分类的客观性和准确性。

Details

Method: 评估多种AE架构（标准AE、收缩AE、判别AE和分类器判别AE），使用Optuna进行超参数调优，并通过Bhattacharyya距离评估潜在空间的类别可分性。 Result: 分类器判别AE（CDAE）在潜在空间分离和分类准确性上表现最佳，显著提升了侵袭性ccRCC分级的识别能力，优于现有技术CHR-Network。 Conclusion: 在AE中集成分类器分支，结合神经架构搜索和对比学习，可增强ccRCC病理分级的自动化，提高诊断准确性。 Abstract: This study explores the application of supervised and unsupervised autoencoders (AEs) to automate nuclei classification in clear cell renal cell carcinoma (ccRCC) images, a diagnostic task traditionally reliant on subjective visual grading by pathologists. We evaluate various AE architectures, including standard AEs, contractive AEs (CAEs), and discriminative AEs (DAEs), as well as a classifier-based discriminative AE (CDAE), optimized using the hyperparameter tuning tool Optuna. Bhattacharyya distance is selected from several metrics to assess class separability in the latent space, revealing challenges in distinguishing adjacent grades using unsupervised models. CDAE, integrating a supervised classifier branch, demonstrated superior performance in both latent space separation and classification accuracy. Given that CDAE-CNN achieved notable improvements in classification metrics, affirming the value of supervised learning for class-specific feature extraction, F1 score was incorporated into the tuning process to optimize classification performance. Results show significant improvements in identifying aggressive ccRCC grades by leveraging the classification capability of AE through latent clustering followed by fine-grained classification. Our model outperforms the current state of the art, CHR-Network, across all evaluated metrics. These findings suggest that integrating a classifier branch in AEs, combined with neural architecture search and contrastive learning, enhances grading automation in ccRCC pathology, particularly in detecting aggressive tumor grades, and may improve diagnostic accuracy.

Malware Detection in Docker Containers: An Image is Worth a Thousand Logs

Akis Nousias,Efklidis Katsaros,Evangelos Syrmos,Panagiotis Radoglou-Grammatikis,Thomas Lagkas,Vasileios Argyriou,Ioannis Moscholios,Evangelos Markakis,Sotirios Goudos,Panagiotis Sarigiannidis

Task: 提出一种通过机器学习分析文件系统来识别受恶意软件感染的软件容器的方法。

Motivation: 传统恶意软件检测方法在面对混淆和多态技术时效果有限，而软件容器的广泛使用带来了新的安全挑战，如恶意软件注入。

Details

Method: 将软件容器的tarball表示转换为大型RGB图像，并使用卷积神经网络以流式、基于补丁的方式进行检测。 Result: 该方法比VirusTotal引擎及其组合检测到更多恶意软件，并取得了更高的F1和Recall分数。 Conclusion: 该方法有效且为识别受恶意软件感染的软件容器设定了新标准。 Abstract: Malware detection is increasingly challenged by evolving techniques like obfuscation and polymorphism, limiting the effectiveness of traditional methods. Meanwhile, the widespread adoption of software containers has introduced new security challenges, including the growing threat of malicious software injection, where a container, once compromised, can serve as entry point for further cyberattacks. In this work, we address these security issues by introducing a method to identify compromised containers through machine learning analysis of their file systems. We cast the entire software containers into large RGB images via their tarball representations, and propose to use established Convolutional Neural Network architectures on a streaming, patch-based manner. To support our experiments, we release the COSOCO dataset--the first of its kind--containing 3364 large-scale RGB images of benign and compromised software containers at https://huggingface.co/datasets/k3ylabs/cosoco-image-dataset. Our method detects more malware and achieves higher F1 and Recall scores than all individual and ensembles of VirusTotal engines, demonstrating its effectiveness and setting a new standard for identifying malware-compromised software containers.

Point Cloud-based Grasping for Soft Hand Exoskeleton

Chen Hu,Enrica Tricomi,Eojin Rho,Daekyum Kim,Lorenzo Masia,Shan Luo,Letizia Gionfrida

Task: 提出一种基于视觉的预测控制框架，用于辅助软手外骨骼的抓取功能。

Motivation: 针对手部功能障碍者，软手外骨骼在抓取控制上存在环境理解复杂的问题，需要一种更通用的解决方案。

Details

Method: 采用几何建模方法，结合深度感知的上下文信息，预测抓取目标并确定控制状态。 Result: 系统在15种对象和健康参与者中实现了91%的GAS评分，对未见对象也保持了重建成功率。 Conclusion: 该方法在通用性和适应性上优于数据驱动模型，展示了其在多样化抓取场景中的有效性。 Abstract: Grasping is a fundamental skill for interacting with and manipulating objects in the environment. However, this ability can be challenging for individuals with hand impairments. Soft hand exoskeletons designed to assist grasping can enhance or restore essential hand functions, yet controlling these soft exoskeletons to support users effectively remains difficult due to the complexity of understanding the environment. This study presents a vision-based predictive control framework that leverages contextual awareness from depth perception to predict the grasping target and determine the next control state for activation. Unlike data-driven approaches that require extensive labelled datasets and struggle with generalizability, our method is grounded in geometric modelling, enabling robust adaptation across diverse grasping scenarios. The Grasping Ability Score (GAS) was used to evaluate performance, with our system achieving a state-of-the-art GAS of 91% across 15 objects and healthy participants, demonstrating its effectiveness across different object types. The proposed approach maintained reconstruction success for unseen objects, underscoring its enhanced generalizability compared to learning-based models.

NeRFlex: Resource-aware Real-time High-quality Rendering of Complex Scenes on Mobile Devices

Zhe Wang,Yifei Zhu

Task: 提出NeRFlex框架，解决移动设备上复杂场景的高分辨率实时渲染问题。

Motivation: NeRF在移动设备上的高计算需求和内存开销限制了其实际应用，现有方法在复杂场景中难以同时实现高质量和实时渲染。

Details

Method: NeRFlex通过多NeRF表示分解场景，结合轻量级分析器和动态规划算法优化资源配置。 Result: 实验表明，NeRFlex在移动设备上实现了高质量实时渲染。 Conclusion: NeRFlex为移动设备上的复杂场景渲染提供了高效解决方案。 Abstract: Neural Radiance Fields (NeRF) is a cutting-edge neural network-based technique for novel view synthesis in 3D reconstruction. However, its significant computational demands pose challenges for deployment on mobile devices. While mesh-based NeRF solutions have shown potential in achieving real-time rendering on mobile platforms, they often fail to deliver high-quality reconstructions when rendering practical complex scenes. Additionally, the non-negligible memory overhead caused by pre-computed intermediate results complicates their practical application. To overcome these challenges, we present NeRFlex, a resource-aware, high-resolution, real-time rendering framework for complex scenes on mobile devices. NeRFlex integrates mobile NeRF rendering with multi-NeRF representations that decompose a scene into multiple sub-scenes, each represented by an individual NeRF network. Crucially, NeRFlex considers both memory and computation constraints as first-class citizens and redesigns the reconstruction process accordingly. NeRFlex first designs a detail-oriented segmentation module to identify sub-scenes with high-frequency details. For each NeRF network, a lightweight profiler, built on domain knowledge, is used to accurately map configurations to visual quality and memory usage. Based on these insights and the resource constraints on mobile devices, NeRFlex presents a dynamic programming algorithm to efficiently determine configurations for all NeRF representations, despite the NP-hardness of the original decision problem. Extensive experiments on real-world datasets and mobile devices demonstrate that NeRFlex achieves real-time, high-quality rendering on commercial mobile devices.

Autonomous state-space segmentation for Deep-RL sparse reward scenarios

Gianluca Maselli,Vieri Giuliano Santucci

Task: 提出一种两阶段架构，用于在稀疏奖励环境中学习策略。

Motivation: 解决自主开放学习环境中稀疏奖励的问题，利用内在动机提升深度强化学习算法的探索能力。

Details

Method: 采用两阶段架构，交替进行内在驱动的探索与子目标生成，以及稀疏奖励下的目标导向策略学习。 Result: 在Gym SuperMarioBros环境中验证了方法的有效性，展示了自主环境分割对高效路径生成的重要性。 Conclusion: 提出的方法在稀疏奖励环境中有效，通过自主生成子目标提升了学习效率。 Abstract: Dealing with environments with sparse rewards has always been crucial for systems developed to operate in autonomous open-ended learning settings. Intrinsic Motivations could be an effective way to help Deep Reinforcement Learning algorithms learn in such scenarios. In fact, intrinsic reward signals, such as novelty or curiosity, are generally adopted to improve exploration when extrinsic rewards are delayed or absent. Building on previous works, we tackle the problem of learning policies in the presence of sparse rewards by proposing a two-level architecture that alternates an ''intrinsically driven'' phase of exploration and autonomous sub-goal generation, to a phase of sparse reward, goal-directed policy learning. The idea is to build several small networks, each one specialized on a particular sub-path, and use them as starting points for future exploration without the need to further explore from scratch previously learnt paths. Two versions of the system have been trained and tested in the Gym SuperMarioBros environment without considering any additional extrinsic reward. The results show the validity of our approach and the importance of autonomously segment the environment to generate an efficient path towards the final goal.

Early detection of diabetes through transfer learning-based eye (vision) screening and improvement of machine learning model performance and advanced parameter setting algorithms

Mohammad Reza Yousefi,Ali Bakrani,Amin Dehghani

Task: 研究如何利用迁移学习提升机器学习模型在糖尿病视网膜病变（DR）检测中的性能。

Motivation: 传统方法在DR检测中存在准确性和敏感性低、训练时间长以及数据集有限等问题。

Details

Method: 采用迁移学习，结合降维、学习率优化和高级参数调优算法。 Result: 模型在测试数据集上达到84%的整体准确率，最高类别准确率为89%，敏感性为97%，F1分数为92%。 Conclusion: 基于迁移学习的DR筛查是一种有前景的早期诊断方法，有助于及时干预以预防视力丧失。 Abstract: Diabetic Retinopathy (DR) is a serious and common complication of diabetes, caused by prolonged high blood sugar levels that damage the small retinal blood vessels. If left untreated, DR can progress to retinal vein occlusion and stimulate abnormal blood vessel growth, significantly increasing the risk of blindness. Traditional diabetes diagnosis methods often utilize convolutional neural networks (CNNs) to extract visual features from retinal images, followed by classification algorithms such as decision trees and k-nearest neighbors (KNN) for disease detection. However, these approaches face several challenges, including low accuracy and sensitivity, lengthy machine learning (ML) model training due to high data complexity and volume, and the use of limited datasets for testing and evaluation. This study investigates the application of transfer learning (TL) to enhance ML model performance in DR detection. Key improvements include dimensionality reduction, optimized learning rate adjustments, and advanced parameter tuning algorithms, aimed at increasing efficiency and diagnostic accuracy. The proposed model achieved an overall accuracy of 84% on the testing dataset, outperforming prior studies. The highest class-specific accuracy reached 89%, with a maximum sensitivity of 97% and an F1-score of 92%, demonstrating strong performance in identifying DR cases. These findings suggest that TL-based DR screening is a promising approach for early diagnosis, enabling timely interventions to prevent vision loss and improve patient outcomes.

Probabilistic Machine Learning for Noisy Labels in Earth Observation

Spyros Kondylatos,Nikolaos Ioannis Bountos,Ioannis Prapas,Angelos Zavras,Gustau Camps-Valls,Ioannis Papoutsis

Task: 利用概率机器学习建模输入依赖的标签噪声并量化地球观测任务中的数据不确定性。

Motivation: 地球观测（EO）中标签噪声严重影响监督机器学习模型的性能和可靠性，而开发鲁棒且可信的ML解决方案对关键EO应用至关重要。

Details

Method: 通过训练不确定性感知的概率模型，覆盖多种高影响EO应用，并引入专用评估流程验证其准确性和可靠性。 Result: 实验表明，不确定性感知模型在大多数数据集和评估指标上优于标准确定性方法，且不确定性估计可靠。 Conclusion: 建模标签噪声和量化不确定性对提升EO领域ML解决方案的准确性、可靠性和可信度具有重要意义。 Abstract: Label noise poses a significant challenge in Earth Observation (EO), often degrading the performance and reliability of supervised Machine Learning (ML) models. Yet, given the critical nature of several EO applications, developing robust and trustworthy ML solutions is essential. In this study, we take a step in this direction by leveraging probabilistic ML to model input-dependent label noise and quantify data uncertainty in EO tasks, accounting for the unique noise sources inherent in the domain. We train uncertainty-aware probabilistic models across a broad range of high-impact EO applications-spanning diverse noise sources, input modalities, and ML configurations-and introduce a dedicated pipeline to assess their accuracy and reliability. Our experimental results show that the uncertainty-aware models consistently outperform the standard deterministic approaches across most datasets and evaluation metrics. Moreover, through rigorous uncertainty evaluation, we validate the reliability of the predicted uncertainty estimates, enhancing the interpretability of model predictions. Our findings emphasize the importance of modeling label noise and incorporating uncertainty quantification in EO, paving the way for more accurate, reliable, and trustworthy ML solutions in the field.

Agentic Knowledgeable Self-awareness

Shuofei Qiao,Zhisong Qiu,Baochang Ren,Xiaobin Wang,Xiangyuan Ru,Ningyu Zhang,Xiang Chen,Yong Jiang,Pengjun Xie,Fei Huang,Huajun Chen

Task: 提出一种名为KnowSelf的新范式，使基于LLM的代理能够自主调节知识利用以实现最优规划效果。

Motivation: 传统代理规划方法忽视了人类决策中的情境自我意识，导致资源利用效率低下。

Details

Method: 提出KnowSelf方法，通过启发式情境判断标准和两阶段训练过程，使代理模型能够动态调节知识利用。 Result: 实验表明KnowSelf在多种任务和模型上优于基线方法，且外部知识使用最少。 Conclusion: KnowSelf通过引入情境自我意识，显著提升了代理模型的规划效率和性能。 Abstract: Large Language Models (LLMs) have achieved considerable performance across various agentic planning tasks. However, traditional agent planning approaches adopt a "flood irrigation" methodology that indiscriminately injects gold trajectories, external feedback, and domain knowledge into agent models. This practice overlooks the fundamental human cognitive principle of situational self-awareness during decision-making-the ability to dynamically assess situational demands and strategically employ resources during decision-making. We propose agentic knowledgeable self-awareness to address this gap, a novel paradigm enabling LLM-based agents to autonomously regulate knowledge utilization. Specifically, we propose KnowSelf, a data-centric approach that applies agents with knowledgeable self-awareness like humans. Concretely, we devise a heuristic situation judgement criterion to mark special tokens on the agent's self-explored trajectories for collecting training data. Through a two-stage training process, the agent model can switch between different situations by generating specific special tokens, achieving optimal planning effects with minimal costs. Our experiments demonstrate that KnowSelf can outperform various strong baselines on different tasks and models with minimal use of external knowledge. Code is available at https://github.com/zjunlp/KnowSelf.

SynWorld: Virtual Scenario Synthesis for Agentic Action Knowledge Refinement

Runnan Fang,Xiaobin Wang,Yuan Liang,Shuofei Qiao,Jialong Wu,Zekun Xi,Ningyu Zhang,Yong Jiang,Pengjun Xie,Fei Huang,Huajun Chen

Task: 提出SynWorld框架，使LLM-based agent能在新环境中自主探索并优化动作知识。

Motivation: LLM-based agent在新环境或非常规动作空间中面临挑战，需提升其自主探索和动作理解能力。

Details

Method: 通过合成多步动作场景并结合蒙特卡洛树搜索（MCTS）探索，优化动作知识。 Result: 实验证明SynWorld能有效学习新环境中的动作知识。 Conclusion: SynWorld是一种通用且有效的动作知识学习方法。 Abstract: In the interaction between agents and their environments, agents expand their capabilities by planning and executing actions. However, LLM-based agents face substantial challenges when deployed in novel environments or required to navigate unconventional action spaces. To empower agents to autonomously explore environments, optimize workflows, and enhance their understanding of actions, we propose SynWorld, a framework that allows agents to synthesize possible scenarios with multi-step action invocation within the action space and perform Monte Carlo Tree Search (MCTS) exploration to effectively refine their action knowledge in the current environment. Our experiments demonstrate that SynWorld is an effective and general approach to learning action knowledge in new environments. Code is available at https://github.com/zjunlp/SynWorld.

AdaViT: Adaptive Vision Transformer for Flexible Pretrain and Finetune with Variable 3D Medical Image Modalities

Badhan Kumar Das,Gengyan Zhao,Han Liu,Thomas J. Re,Dorin Comaniciu,Eli Gibson,Andreas Maier

Task: 提出一种自适应Vision Transformer（AdaViT）框架，用于处理每个病例中不同的输入模态集。

Motivation: 在临床场景中，不同病例的磁共振（MR）对比度集不同，导致现有方法在输入模态与预训练模型不匹配时性能下降。

Details

Method: 使用动态分词器编码不同输入模态为令牌，并利用Transformer的注意力机制处理可变长度令牌。 Result: 实验表明，该架构在零样本测试、少样本微调和反向迁移任务中表现优异，适用于脑梗死和脑肿瘤分割。 Conclusion: AdaViT能够有效迁移预训练模型至不同输入模态集的数据集，并最大化自监督预训练数据的利用。 Abstract: Pretrain techniques, whether supervised or self-supervised, are widely used in deep learning to enhance model performance. In real-world clinical scenarios, different sets of magnetic resonance (MR) contrasts are often acquired for different subjects/cases, creating challenges for deep learning models assuming consistent input modalities among all the cases and between pretrain and finetune. Existing methods struggle to maintain performance when there is an input modality/contrast set mismatch with the pretrained model, often resulting in degraded accuracy. We propose an adaptive Vision Transformer (AdaViT) framework capable of handling variable set of input modalities for each case. We utilize a dynamic tokenizer to encode different input image modalities to tokens and take advantage of the characteristics of the transformer to build attention mechanism across variable length of tokens. Through extensive experiments, we demonstrate that this architecture effectively transfers supervised pretrained models to new datasets with different input modality/contrast sets, resulting in superior performance on zero-shot testing, few-shot finetuning, and backward transferring in brain infarct and brain tumor segmentation tasks. Additionally, for self-supervised pretrain, the proposed method is able to maximize the pretrain data and facilitate transferring to diverse downstream tasks with variable sets of input modalities.

MedSAM2: Segment Anything in 3D Medical Images and Videos

Jun Ma,Zongxin Yang,Sumin Kim,Bihui Chen,Mohammed Baharoon,Adibvafa Fallahpour,Reza Asakereh,Hongwei Lyu,Bo Wang

Task: 开发一个通用的3D图像和视频分割模型MedSAM2，用于医学图像和视频分割。

Motivation: 尽管2D图像分割已有显著进展，但3D图像和视频分割领域缺乏通用模型和全面的用户研究。

Details

Method: 通过在大规模医学数据集上微调Segment Anything Model 2（SAM2），并引入人机协作流程构建数据集。 Result: MedSAM2在多种器官、病变和成像模态上表现优于先前模型，并减少85%以上的手动标注成本。 Conclusion: MedSAM2是一个高效、可扩展且实用的工具，适用于研究和医疗环境中的高质量分割任务。 Abstract: Medical image and video segmentation is a critical task for precision medicine, which has witnessed considerable progress in developing task or modality-specific and generalist models for 2D images. However, there have been limited studies on building general-purpose models for 3D images and videos with comprehensive user studies. Here, we present MedSAM2, a promptable segmentation foundation model for 3D image and video segmentation. The model is developed by fine-tuning the Segment Anything Model 2 on a large medical dataset with over 455,000 3D image-mask pairs and 76,000 frames, outperforming previous models across a wide range of organs, lesions, and imaging modalities. Furthermore, we implement a human-in-the-loop pipeline to facilitate the creation of large-scale datasets resulting in, to the best of our knowledge, the most extensive user study to date, involving the annotation of 5,000 CT lesions, 3,984 liver MRI lesions, and 251,550 echocardiogram video frames, demonstrating that MedSAM2 can reduce manual costs by more than 85%. MedSAM2 is also integrated into widely used platforms with user-friendly interfaces for local and cloud deployment, making it a practical tool for supporting efficient, scalable, and high-quality segmentation in both research and healthcare environments.

Bonsai: Interpretable Tree-Adaptive Grounded Reasoning

Kate Sanders,Benjamin Van Durme

Task: 开发一种通用的协作代理系统，能够适应新领域并透明地处理不确定性。

Motivation: 黑盒模型虽然数据处理能力强，但缺乏透明性、领域适应性和不确定性处理能力，无法满足可靠AI系统的需求。

Details

Method: 提出Bonsai系统，通过检索相关证据并计算子主张的似然性，生成可适应的推理树。 Result: Bonsai在问答和人类对齐实验中表现与领域特定黑盒方法相当，同时生成可解释、有依据且不确定性感知的推理轨迹。 Conclusion: Bonsai是一种可调、适应性强且透明的推理系统，适用于多种领域。 Abstract: To develop general-purpose collaborative agents, humans need reliable AI systems that can (1) adapt to new domains and (2) transparently reason with uncertainty to allow for verification and correction. Black-box models demonstrate powerful data processing abilities but do not satisfy these criteria due to their opaqueness, domain specificity, and lack of uncertainty awareness. We introduce Bonsai, a compositional and probabilistic reasoning system that generates adaptable inference trees by retrieving relevant grounding evidence and using it to compute likelihoods of sub-claims derived from broader natural language inferences. Bonsai's reasoning power is tunable at test-time via evidence scaling and it demonstrates reliable handling of varied domains including transcripts, photographs, videos, audio, and databases. Question-answering and human alignment experiments demonstrate that Bonsai matches the performance of domain-specific black-box methods while generating interpretable, grounded, and uncertainty-aware reasoning traces.