2025 03 21

CAM-Seg: A Continuous-valued Embedding Approach for Semantic Image Generation

Masud Ahmed,Zahid Hasan,Syed Arefinul Haque,Abu Zaher Md Faridee,Sanjay Purushotham,Suya You,Nirmalya Roy

Task: 提出一种基于连续值嵌入的语义分割框架，以替代传统的量化嵌入方法。

Motivation: 分析表明，使用量化嵌入（如VQ-VAE）的自动编码器在分割掩码上的准确率比连续值嵌入（如KL-VAE）低8%。

Details

Method: 通过将语义掩码生成重新表述为连续的图像到嵌入扩散过程，提出了一种扩散引导的自回归变换器，学习连续的语义嵌入空间。 Result: 在多个数据集（如Cityscapes和域转移变体）上的实验表明，该框架在分布转移（如恶劣天气和视角变化）下具有最先进的鲁棒性，并且在噪声环境下表现出强大的抗噪能力。 Conclusion: 该框架通过连续嵌入空间实现了零样本域适应能力，并在多种噪声和图像变化下保持了较高的性能。 Abstract: Traditional transformer-based semantic segmentation relies on quantized embeddings. However, our analysis reveals that autoencoder accuracy on segmentation mask using quantized embeddings (e.g. VQ-VAE) is 8% lower than continuous-valued embeddings (e.g. KL-VAE). Motivated by this, we propose a continuous-valued embedding framework for semantic segmentation. By reformulating semantic mask generation as a continuous image-to-embedding diffusion process, our approach eliminates the need for discrete latent representations while preserving fine-grained spatial and semantic details. Our key contribution includes a diffusion-guided autoregressive transformer that learns a continuous semantic embedding space by modeling long-range dependencies in image features. Our framework contains a unified architecture combining a VAE encoder for continuous feature extraction, a diffusion-guided transformer for conditioned embedding generation, and a VAE decoder for semantic mask reconstruction. Our setting facilitates zero-shot domain adaptation capabilities enabled by the continuity of the embedding space. Experiments across diverse datasets (e.g., Cityscapes and domain-shifted variants) demonstrate state-of-the-art robustness to distribution shifts, including adverse weather (e.g., fog, snow) and viewpoint variations. Our model also exhibits strong noise resilience, achieving robust performance ($\approx$ 95% AP compared to baseline) under gaussian noise, moderate motion blur, and moderate brightness/contrast variations, while experiencing only a moderate impact ($\approx$ 90% AP compared to baseline) from 50% salt and pepper noise, saturation and hue shifts. Code available: https://github.com/mahmed10/CAMSS.git

LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning

Federico Cocchi,Nicholas Moratelli,Davide Caffagni,Sara Sarto,Lorenzo Baraldi,Marcella Cornia,Rita Cucchiara

Task: 探索多模态大语言模型（MLLMs）中视觉骨干和语言模型之间的权衡，并引入LLaVA-MORE模型家族。

Motivation: 现有研究主要关注将模型规模扩展到数十亿参数，但模型大小、架构和性能之间的权衡尚未充分探索，且训练数据和评估协议的不一致性阻碍了直接比较。

Details

Method: 引入LLaVA-MORE模型家族，采用统一的训练协议，系统分析小规模和中等规模的LLMs（如Phi-4、LLaMA-3.1和Gemma-2）以及多种视觉编码器（如CLIP、DINOv2、SigLIP和SigLIP2）。 Result: 提供了关于如何设计更有效的MLLMs的见解，并提供了一个可重复的评估框架，便于直接比较和指导未来模型开发。 Conclusion: LLaVA-MORE模型家族及其统一的训练协议为多模态大语言模型的设计和评估提供了新的视角和方法。 Abstract: Recent progress in Multimodal Large Language Models (MLLMs) has highlighted the critical roles of both the visual backbone and the underlying language model. While prior work has primarily focused on scaling these components to billions of parameters, the trade-offs between model size, architecture, and performance remain underexplored. Additionally, inconsistencies in training data and evaluation protocols have hindered direct comparisons, making it difficult to derive optimal design choices. In this paper, we introduce LLaVA-MORE, a new family of MLLMs that integrates recent language models with diverse visual backbones. To ensure fair comparisons, we employ a unified training protocol applied consistently across all architectures. Our analysis systematically explores both small- and medium-scale LLMs -- including Phi-4, LLaMA-3.1, and Gemma-2 -- to evaluate multimodal reasoning, generation, and instruction following, while examining the relationship between model size and performance. Beyond evaluating the LLM impact on final results, we conduct a comprehensive study of various visual encoders, ranging from CLIP-based architectures to alternatives such as DINOv2, SigLIP, and SigLIP2. Additional experiments investigate the effects of increased image resolution and variations in pre-training datasets. Overall, our results provide insights into the design of more effective MLLMs, offering a reproducible evaluation framework that facilitates direct comparisons and can guide future model development. Our source code and trained models are publicly available at: https://github.com/aimagelab/LLaVA-MORE.

EarthScape: A Multimodal Dataset for Surficial Geologic Mapping and Earth Surface Analysis

Matthew Massey,Abdullah-Al-Zubaer Imran

Task: 介绍并评估EarthScape数据集，用于地表地质制图和地球表面分析。

Motivation: 传统的地质制图方法劳动密集，限制了空间覆盖范围并引入了潜在的偏差。

Details

Method: 集成高分辨率航空RGB和近红外（NIR）影像、数字高程模型（DEM）、多尺度DEM衍生的地形特征以及水文和基础设施矢量数据，并提供七个不同地表地质类别的详细注释。 Result: 建立了使用不同空间模态的基准测试，展示了EarthScape的实用性。 Conclusion: EarthScape填补了计算机视觉与地球科学之间的空白，为多模态学习、地理空间分析和地质制图的研究提供了宝贵的资源。 Abstract: Surficial geologic mapping is essential for understanding Earth surface processes, addressing modern challenges such as climate change and national security, and supporting common applications in engineering and resource management. However, traditional mapping methods are labor-intensive, limiting spatial coverage and introducing potential biases. To address these limitations, we introduce EarthScape, a novel, AI-ready multimodal dataset specifically designed for surficial geologic mapping and Earth surface analysis. EarthScape integrates high-resolution aerial RGB and near-infrared (NIR) imagery, digital elevation models (DEM), multi-scale DEM-derived terrain features, and hydrologic and infrastructure vector data. The dataset provides detailed annotations for seven distinct surficial geologic classes encompassing various geological processes. We present a comprehensive data processing pipeline using open-sourced raw data and establish baseline benchmarks using different spatial modalities to demonstrate the utility of EarthScape. As a living dataset with a vision for expansion, EarthScape bridges the gap between computer vision and Earth sciences, offering a valuable resource for advancing research in multimodal learning, geospatial analysis, and geological mapping. Our code is available at https://github.com/masseygeo/earthscape.

Vision-Speech Models: Teaching Speech Models to Converse about Images

Amélie Royer,Moritz Böhle,Gabriel de Marmiesse,Laurent Mazaré,Neil Zeghidour,Alexandre Défossez,Patrick Pérez

Task: 将预训练的语音模型与视觉理解相结合，构建一个能够自由讨论图像的多模态语音模型。

Motivation: 解决构建对话式视觉-语音模型的独特挑战，包括配对图像-语音数据稀缺、推理时实时延迟的计算和内存限制，以及保留无法仅从文本推断的韵律特征。

Details

Method: 通过轻量级适配模块增强最近的对话语音LLM Moshi，引入动态门控机制以在视觉输入和无关对话主题之间切换，并设计了一个简单的一阶段、参数高效的微调管道。 Result: 在下游视觉理解任务上评估模型，并报告与MoshiVis交互的定性样本。 Conclusion: MoshiVis通过轻量级适配和动态门控机制成功地将视觉输入集成到语音模型中，减少了训练成本并保留了韵律特征。 Abstract: The recent successes of Vision-Language models raise the question of how to equivalently imbue a pretrained speech model with vision understanding, an important milestone towards building a multimodal speech model able to freely converse about images. Building such a conversational Vision-Speech model brings its unique challenges: (i) paired image-speech datasets are much scarcer than their image-text counterparts, (ii) ensuring real-time latency at inference is crucial thus bringing compute and memory constraints, and (iii) the model should preserve prosodic features (e.g., speaker tone) which cannot be inferred from text alone. In this work, we introduce MoshiVis, augmenting a recent dialogue speech LLM, Moshi, with visual inputs through lightweight adaptation modules. An additional dynamic gating mechanism enables the model to more easily switch between the visual inputs and unrelated conversation topics. To reduce training costs, we design a simple one-stage, parameter-efficient fine-tuning pipeline in which we leverage a mixture of image-text (i.e., "speechless") and image-speech samples. We evaluate the model on downstream visual understanding tasks with both audio and text prompts, and report qualitative samples of interactions with MoshiVis. Our inference code will be made available, as well as the image-speech data used for audio evaluation.

A Context-Driven Training-Free Network for Lightweight Scene Text Segmentation and Recognition

Ritabrata Chakraborty,Shivakumara Palaiahnakote,Umapada Pal,Cheng-Lin Liu

Task: 提出一种无需训练、即插即用的场景文本识别框架，以减少计算资源和内存消耗。

Motivation: 现代场景文本识别系统通常依赖大型端到端架构，需要大量训练且在实际应用中成本过高。

Details

Method: 利用预训练文本识别器的优势，引入基于注意力的分割阶段，细化候选文本区域，并通过语义和词汇评估生成最终得分。 Result: 在公共基准测试中，该框架的性能与最先进的系统相当，但所需资源显著减少。 Conclusion: 该框架在减少计算资源和内存消耗的同时，实现了与最先进系统相当的性能。 Abstract: Modern scene text recognition systems often depend on large end-to-end architectures that require extensive training and are prohibitively expensive for real-time scenarios. In such cases, the deployment of heavy models becomes impractical due to constraints on memory, computational resources, and latency. To address these challenges, we propose a novel, training-free plug-and-play framework that leverages the strengths of pre-trained text recognizers while minimizing redundant computations. Our approach uses context-based understanding and introduces an attention-based segmentation stage, which refines candidate text regions at the pixel level, improving downstream recognition. Instead of performing traditional text detection that follows a block-level comparison between feature map and source image and harnesses contextual information using pretrained captioners, allowing the framework to generate word predictions directly from scene context.Candidate texts are semantically and lexically evaluated to get a final score. Predictions that meet or exceed a pre-defined confidence threshold bypass the heavier process of end-to-end text STR profiling, ensuring faster inference and cutting down on unnecessary computations. Experiments on public benchmarks demonstrate that our paradigm achieves performance on par with state-of-the-art systems, yet requires substantially fewer resources.

Jumanh Atoum,Garrison L. H. Johnston,Nabil Simaan,Jie Ying Wu

Task: 实时识别手术手势，以实现自动化活动识别、技能评估、术中辅助和最终的手术自动化。

Motivation: 当前的手术机器人系统提供了丰富的多模态数据（如视频和运动学数据），但现有方法将运动学信息视为独立信号，忽略了工具尖端姿态之间的几何关系。

Details

Method: 提出了一种将运动不变度量（曲率和扭转）与视觉和运动学数据结合的方法，使用关系图网络捕捉不同数据流之间的潜在关系。 Result: 在JIGSAWS缝合数据集上，结合不变信号与工具位置的手势识别帧准确率达到了90.3%。 Conclusion: 运动不变信号与位置结合的手势运动表示优于传统的位置和四元数表示，强调了运动学几何感知建模在手势识别中的重要性。 Abstract: Recognizing surgical gestures in real-time is a stepping stone towards automated activity recognition, skill assessment, intra-operative assistance, and eventually surgical automation. The current robotic surgical systems provide us with rich multi-modal data such as video and kinematics. While some recent works in multi-modal neural networks learn the relationships between vision and kinematics data, current approaches treat kinematics information as independent signals, with no underlying relation between tool-tip poses. However, instrument poses are geometrically related, and the underlying geometry can aid neural networks in learning gesture representation. Therefore, we propose combining motion invariant measures (curvature and torsion) with vision and kinematics data using a relational graph network to capture the underlying relations between different data streams. We show that gesture recognition improves when combining invariant signals with tool position, achieving 90.3\% frame-wise accuracy on the JIGSAWS suturing dataset. Our results show that motion invariant signals coupled with position are better representations of gesture motion compared to traditional position and quaternion representations. Our results highlight the need for geometric-aware modeling of kinematics for gesture recognition.

Miguel Ureña Pliego,Rubén Martínez Marín,Nianfang Shi,Takeru Shibayama,Ulrich Leth,Miguel Marchamalo Sacristán

Task: 探索将机器学习集成到城市航空图像分析中，重点识别汽车和行人的基础设施表面并分析历史趋势。

Motivation: 强调从卷积架构向基于变压器的预训练模型的转变，突出其在全球地理空间分析中的潜力。

Details

Method: 提出了一种自动生成地理空间数据集的工作流程，能够从各种来源创建语义分割数据集，包括WMS/WMTS链接、矢量制图和OpenStreetMap (OSM) overpass-turbo请求。 Result: 生成了两个用于汽车和行人表面检测的数据集，并训练和评估了基于变压器的模型，展示了良好的准确性。历史趋势分析成功识别了不同城市区域中行人和汽车基础设施的时间趋势。 Conclusion: 该技术适用于市政政府以最低成本收集有价值的数据。 Abstract: This study explores the integration of machine learning into urban aerial image analysis, with a focus on identifying infrastructure surfaces for cars and pedestrians and analyzing historical trends. It emphasizes the transition from convolutional architectures to transformer-based pre-trained models, underscoring their potential in global geospatial analysis. A workflow is presented for automatically generating geospatial datasets, enabling the creation of semantic segmentation datasets from various sources, including WMS/WMTS links, vectorial cartography, and OpenStreetMap (OSM) overpass-turbo requests. The developed code allows a fast dataset generation process for training machine learning models using openly available data without manual labelling. Using aerial imagery and vectorial data from the respective geographical offices of Madrid and Vienna, two datasets were generated for car and pedestrian surface detection. A transformer-based model was trained and evaluated for each city, demonstrating good accuracy values. The historical trend analysis involved applying the trained model to earlier images predating the availability of vectorial data 10 to 20 years, successfully identifying temporal trends in infrastructure for pedestrians and cars across different city areas. This technique is applicable for municipal governments to gather valuable data at a minimal cost.

UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction

Shravan Nayak,Xiangru Jian,Kevin Qinghong Lin,Juan A. Rodriguez,Montek Kalsi,Rabiul Awal,Nicolas Chapados,M. Tamer Özsu,Aishwarya Agrawal,David Vazquez,Christopher Pal,Perouz Taslakian,Spandana Gella,Sai Rajeswar

Task: 介绍UI-Vision，一个用于在真实桌面环境中评估计算机使用代理的综合性、许可宽松的基准。

Motivation: 现有的研究主要集中在在线环境，而桌面环境对于许多专业和日常任务至关重要，但由于数据收集挑战和许可问题，桌面环境仍未得到充分探索。

Details

Method: UI-Vision提供了密集、高质量的人类演示注释，包括边界框、UI标签和动作轨迹（点击、拖动和键盘输入），并设计了三个从细到粗粒度的任务：元素定位、布局定位和动作预测。 Result: 评估揭示了最先进模型（如UI-TARS-72B）的关键局限性，包括理解专业软件、空间推理和执行复杂动作（如拖放）的问题。 Conclusion: 通过开源UI-Vision，旨在推动开发更强大的代理以应对现实世界中的桌面任务。 Abstract: Autonomous agents that navigate Graphical User Interfaces (GUIs) to automate tasks like document editing and file management can greatly enhance computer workflows. While existing research focuses on online settings, desktop environments, critical for many professional and everyday tasks, remain underexplored due to data collection challenges and licensing issues. We introduce UI-Vision, the first comprehensive, license-permissive benchmark for offline, fine-grained evaluation of computer use agents in real-world desktop environments. Unlike online benchmarks, UI-Vision provides: (i) dense, high-quality annotations of human demonstrations, including bounding boxes, UI labels, and action trajectories (clicks, drags, and keyboard inputs) across 83 software applications, and (ii) three fine-to-coarse grained tasks-Element Grounding, Layout Grounding, and Action Prediction-with well-defined metrics to rigorously evaluate agents' performance in desktop environments. Our evaluation reveals critical limitations in state-of-the-art models like UI-TARS-72B, including issues with understanding professional software, spatial reasoning, and complex actions like drag-and-drop. These findings highlight the challenges in developing fully autonomous computer use agents. By releasing UI-Vision as open-source, we aim to advance the development of more capable agents for real-world desktop tasks.

Toward Scalable, Flexible Scene Flow for Point Clouds

Kyle Vedder

Task: 描述时间上连续观测之间的3D运动。

Motivation: 构建具有可扩展性和灵活性的场景流估计器，使其能够在各种领域和运动模式中无需大量超参数调整即可工作。

Details

Method: 提出了几种具体贡献，包括通过大规模蒸馏从强无监督测试时间优化方法提供的伪标签构建和扩展前馈场景流估计器，引入基准以更好地衡量估计质量，并提出一种新的全序列问题公式的无监督场景流估计器。 Result: 提出了一个最先进的无监督场景流估计器，并在相邻领域如3D点跟踪中表现出巨大潜力。 Conclusion: 场景流估计器具有可扩展性和灵活性，未来在更广泛的应用中具有潜力。 Abstract: Scene flow estimation is the task of describing 3D motion between temporally successive observations. This thesis aims to build the foundation for building scene flow estimators with two important properties: they are scalable, i.e. they improve with access to more data and computation, and they are flexible, i.e. they work out-of-the-box in a variety of domains and on a variety of motion patterns without requiring significant hyperparameter tuning. In this dissertation we present several concrete contributions towards this. In Chapter 1 we contextualize scene flow and its prior methods. In Chapter 2 we present a blueprint to build and scale feedforward scene flow estimators without requiring expensive human annotations via large scale distillation from pseudolabels provided by strong unsupervised test-time optimization methods. In Chapter 3 we introduce a benchmark to better measure estimate quality across diverse object types, better bringing into focus what we care about and expect from scene flow estimators, and use this benchmark to host a public challenge that produced significant progress. In Chapter 4 we present a state-of-the-art unsupervised scene flow estimator that introduces a new, full sequence problem formulation and exhibits great promise in adjacent domains like 3D point tracking. Finally, in Chapter 5 I philosophize about what's next for scene flow and its potential future broader impacts.

DiffPortrait360: Consistent Portrait Diffusion for 360 View Synthesis

Yuming Gu,Phong Tran,Yujian Zheng,Hongyi Xu,Heyuan Li,Adilbek Karmanov,Hao Li

Task: 生成高质量的单视角图像到360度人类头部视图。

Motivation: 为了实现可访问的沉浸式远程呈现应用和可扩展的个性化内容创作。

Details

Method: 基于DiffPortrait3D框架，结合自定义ControlNet用于后脑细节生成和双外观模块以确保全局前后一致性。 Result: 该方法能够生成高质量的神经营辐射场（NeRFs），用于实时自由视角渲染，在对象合成和360度头部生成方面优于现有方法。 Conclusion: 该方法在生成360度头部视图方面表现出色，能够处理人类、风格化和拟人化形式，包括眼镜和帽子等配饰。 Abstract: Generating high-quality 360-degree views of human heads from single-view images is essential for enabling accessible immersive telepresence applications and scalable personalized content creation. While cutting-edge methods for full head generation are limited to modeling realistic human heads, the latest diffusion-based approaches for style-omniscient head synthesis can produce only frontal views and struggle with view consistency, preventing their conversion into true 3D models for rendering from arbitrary angles. We introduce a novel approach that generates fully consistent 360-degree head views, accommodating human, stylized, and anthropomorphic forms, including accessories like glasses and hats. Our method builds on the DiffPortrait3D framework, incorporating a custom ControlNet for back-of-head detail generation and a dual appearance module to ensure global front-back consistency. By training on continuous view sequences and integrating a back reference image, our approach achieves robust, locally continuous view synthesis. Our model can be used to produce high-quality neural radiance fields (NeRFs) for real-time, free-viewpoint rendering, outperforming state-of-the-art methods in object synthesis and 360-degree head generation for very challenging input portraits.

Does Context Matter? ContextualJudgeBench for Evaluating LLM-based Judges in Contextual Settings

Austin Xu,Srijan Bansal,Yifei Ming,Semih Yavuz,Shafiq Joty

Task: 提出一个用于评估模型输出的上下文评估基准ContextualJudgeBench。

Motivation: 现有的LLM-as-judge范式通常在非上下文场景下进行评估，忽略了检索增强生成（RAG）和摘要等上下文场景的需求。

Details

Method: 构建了一个包含2000个挑战性响应对的基准ContextualJudgeBench，采用多管齐下的数据构建管道，结合现有的人类注释和基于模型的扰动。 Result: 在11个评估模型和9个通用模型上的综合研究表明，上下文信息及其评估标准对即使是目前最先进的模型也构成了重大挑战。例如，表现最好的模型OpenAI的o1仅达到55%的一致准确性。 Conclusion: 上下文评估是一个具有挑战性的任务，现有的模型在上下文场景下的表现仍有待提高。 Abstract: The large language model (LLM)-as-judge paradigm has been used to meet the demand for a cheap, reliable, and fast evaluation of model outputs during AI system development and post-deployment monitoring. While judge models -- LLMs finetuned to specialize in assessing and critiquing model outputs -- have been touted as general purpose evaluators, they are typically evaluated only on non-contextual scenarios, such as instruction following. The omission of contextual settings -- those where external information is used as context to generate an output -- is surprising given the increasing prevalence of retrieval-augmented generation (RAG) and summarization use cases. Contextual assessment is uniquely challenging, as evaluation often depends on practitioner priorities, leading to conditional evaluation criteria (e.g., comparing responses based on factuality and then considering completeness if they are equally factual). To address the gap, we propose ContextualJudgeBench, a judge benchmark with 2,000 challenging response pairs across eight splits inspired by real-world contextual evaluation scenarios. We build our benchmark with a multi-pronged data construction pipeline that leverages both existing human annotations and model-based perturbations. Our comprehensive study across 11 judge models and 9 general purpose models, reveals that the contextual information and its assessment criteria present a significant challenge to even state-of-the-art models. For example, OpenAI's o1, the best-performing model, barely reaches 55% consistent accuracy.

CHROME: Clothed Human Reconstruction with Occlusion-Resilience and Multiview-Consistency from a Single Image

Arindam Dutta,Meng Zheng,Zhongpai Gao,Benjamin Planche,Anwesha Choudhuri,Terrence Chen,Amit K. Roy-Chowdhury,Ziyan Wu

Task: 从单张图像重建穿衣服的人体

Motivation: 现有的单目穿衣服人体重建方法在无遮挡环境下表现良好，但在实际应用中遇到遮挡图像时，会产生多视角不一致和碎片化的重建结果。此外，大多数方法依赖于难以获取的几何先验（如SMPL注释）。

Details

Method: 提出了CHROME方法，利用多视角扩散模型从遮挡输入中合成无遮挡的人体图像，并结合现成的姿态控制来显式地强制跨视角一致性。然后训练一个3D重建模型，根据遮挡输入和合成视图预测一组3D高斯分布，以生成一致且准确的3D表示。 Result: CHROME在具有挑战性的条件下，在新视角合成（最高3 dB PSNR）和几何重建方面取得了显著改进。 Conclusion: CHROME方法能够在不需要几何先验注释或3D监督的情况下，从单张遮挡图像中重建出具有遮挡恢复能力和多视角一致性的3D人体。 Abstract: Reconstructing clothed humans from a single image is a fundamental task in computer vision with wide-ranging applications. Although existing monocular clothed human reconstruction solutions have shown promising results, they often rely on the assumption that the human subject is in an occlusion-free environment. Thus, when encountering in-the-wild occluded images, these algorithms produce multiview inconsistent and fragmented reconstructions. Additionally, most algorithms for monocular 3D human reconstruction leverage geometric priors such as SMPL annotations for training and inference, which are extremely challenging to acquire in real-world applications. To address these limitations, we propose CHROME: Clothed Human Reconstruction with Occlusion-Resilience and Multiview-ConsistEncy from a Single Image, a novel pipeline designed to reconstruct occlusion-resilient 3D humans with multiview consistency from a single occluded image, without requiring either ground-truth geometric prior annotations or 3D supervision. Specifically, CHROME leverages a multiview diffusion model to first synthesize occlusion-free human images from the occluded input, compatible with off-the-shelf pose control to explicitly enforce cross-view consistency during synthesis. A 3D reconstruction model is then trained to predict a set of 3D Gaussians conditioned on both the occluded input and synthesized views, aligning cross-view details to produce a cohesive and accurate 3D representation. CHROME achieves significant improvements in terms of both novel view synthesis (upto 3 db PSNR) and geometric reconstruction under challenging conditions.

Enhancing Pancreatic Cancer Staging with Large Language Models: The Role of Retrieval-Augmented Generation

Hisashi Johno,Yuki Johno,Akitomo Amakawa,Junichi Sato,Ryota Tozuka,Atsushi Komaba,Hiroaki Watanabe,Hiroki Watanabe,Chihiro Goto,Hiroyuki Morisaka,Hiroshi Onishi,Kazunori Nakamoto

Task: 评估检索增强生成（RAG）技术在不同癌症分期中的效用，特别是在胰腺癌分期中的应用。

Motivation: 为了更好地区分RAG的影响并评估其在不同癌症中的效用，比较了NotebookLM与其内部LLM（Gemini 2.0 Flash）在胰腺癌分期实验中的表现。

Details

Method: 使用日本胰腺癌分期指南作为可靠外部知识（REK），比较了三组：REK+/RAG+（NotebookLM与REK）、REK+/RAG-（Gemini 2.0 Flash与REK）和REK-/RAG-（Gemini 2.0 Flash无REK），在基于CT发现的100个虚构胰腺癌病例中进行分期。 Result: REK+/RAG+的分期准确率为70%，优于REK+/RAG-（38%）和REK-/RAG-（35%）。在TNM分类中，REK+/RAG+的准确率为80%，超过REK+/RAG-（55%）和REK-/RAG-（50%）。此外，REK+/RAG+明确展示了检索到的REK摘录，检索准确率为92%。 Conclusion: NotebookLM，一种RAG-LLM，在胰腺癌分期实验中优于其内部LLM（Gemini 2.0 Flash），表明RAG可能提高LLM的分期准确性。此外，其检索和展示REK摘录的能力为医生提供了透明度，突显了其在临床诊断和分类中的适用性。 Abstract: Purpose: Retrieval-augmented generation (RAG) is a technology to enhance the functionality and reliability of large language models (LLMs) by retrieving relevant information from reliable external knowledge (REK). RAG has gained interest in radiology, and we previously reported the utility of NotebookLM, an LLM with RAG (RAG-LLM), for lung cancer staging. However, since the comparator LLM differed from NotebookLM's internal model, it remained unclear whether its advantage stemmed from RAG or inherent model differences. To better isolate RAG's impact and assess its utility across different cancers, we compared NotebookLM with its internal LLM, Gemini 2.0 Flash, in a pancreatic cancer staging experiment. Materials and Methods: A summary of Japan's pancreatic cancer staging guidelines was used as REK. We compared three groups - REK+/RAG+ (NotebookLM with REK), REK+/RAG- (Gemini 2.0 Flash with REK), and REK-/RAG- (Gemini 2.0 Flash without REK) - in staging 100 fictional pancreatic cancer cases based on CT findings. Staging criteria included TNM classification, local invasion factors, and resectability classification. In REK+/RAG+, retrieval accuracy was quantified based on the sufficiency of retrieved REK excerpts. Results: REK+/RAG+ achieved a staging accuracy of 70%, outperforming REK+/RAG- (38%) and REK-/RAG- (35%). For TNM classification, REK+/RAG+ attained 80% accuracy, exceeding REK+/RAG- (55%) and REK-/RAG- (50%). Additionally, REK+/RAG+ explicitly presented retrieved REK excerpts, achieving a retrieval accuracy of 92%. Conclusion: NotebookLM, a RAG-LLM, outperformed its internal LLM, Gemini 2.0 Flash, in a pancreatic cancer staging experiment, suggesting that RAG may improve LLM's staging accuracy. Furthermore, its ability to retrieve and present REK excerpts provides transparency for physicians, highlighting its applicability for clinical diagnosis and classification.

GASP: Unifying Geometric and Semantic Self-Supervised Pre-training for Autonomous Driving

William Ljungbergh,Adam Lilja,Adam Tonderski. Arvid Laveno Ling,Carl Lindström,Willem Verbeke,Junsheng Fu,Christoffer Petersson,Lars Hammarstrand,Michael Felsberg

Task: 提出一种几何和语义自监督预训练方法GASP，用于自动驾驶中的环境表示学习。

Motivation: 自动驾驶生成大量时空数据，利用这些数据学习环境的几何和语义结构及其随时间演变的可能性。

Details

Method: 通过预测未来时空点的（1）一般占用率，（2）自我占用率，以及（3）从视觉基础模型中提取的高级特征，学习统一表示。 Result: 在多个自动驾驶基准测试中验证了GASP，展示了在语义占用预测、在线地图和自车轨迹预测方面的显著改进。 Conclusion: 连续的4D几何和语义占用预测为自动驾驶提供了一种可扩展且有效的预训练范式。 Abstract: Self-supervised pre-training based on next-token prediction has enabled large language models to capture the underlying structure of text, and has led to unprecedented performance on a large array of tasks when applied at scale. Similarly, autonomous driving generates vast amounts of spatiotemporal data, alluding to the possibility of harnessing scale to learn the underlying geometric and semantic structure of the environment and its evolution over time. In this direction, we propose a geometric and semantic self-supervised pre-training method, GASP, that learns a unified representation by predicting, at any queried future point in spacetime, (1) general occupancy, capturing the evolving structure of the 3D scene; (2) ego occupancy, modeling the ego vehicle path through the environment; and (3) distilled high-level features from a vision foundation model. By modeling geometric and semantic 4D occupancy fields instead of raw sensor measurements, the model learns a structured, generalizable representation of the environment and its evolution through time. We validate GASP on multiple autonomous driving benchmarks, demonstrating significant improvements in semantic occupancy forecasting, online mapping, and ego trajectory prediction. Our results demonstrate that continuous 4D geometric and semantic occupancy prediction provides a scalable and effective pre-training paradigm for autonomous driving. For code and additional visualizations, see \href{https://research.zenseact.com/publications/gasp/.

Am I eligible? Natural Language Inference for Clinical Trial Patient Recruitment: the Patient's Point of View

Mathilde Aguiar,Pierre Zweigenbaum,Nona Naderi

Task: 研究患者使用自然语言描述其医疗档案以确定是否符合临床试验资格的情况。

Motivation: 通过在线招募直接向患者推广临床试验，可能更有效地接触到患者。

Details

Method: 设计了一个新的数据集和任务NLI4PR，通过改编TREC 2022临床试验数据集，手动重新表述患者的医疗档案，并使用相关的临床试验报告。 Result: 使用患者语言时，F1得分在56.5到71.8之间，而使用医学语言时，F1得分在64.7到73.1之间。 Conclusion: 使用患者语言时，性能损失较小，表明以患者为起点可以帮助招募临床试验患者。 Abstract: Recruiting patients to participate in clinical trials can be challenging and time-consuming. Usually, participation in a clinical trial is initiated by a healthcare professional and proposed to the patient. Promoting clinical trials directly to patients via online recruitment might help to reach them more efficiently. In this study, we address the case where a patient is initiating their own recruitment process and wants to determine whether they are eligible for a given clinical trial, using their own language to describe their medical profile. To study whether this creates difficulties in the patient trial matching process, we design a new dataset and task, Natural Language Inference for Patient Recruitment (NLI4PR), in which patient language profiles must be matched to clinical trials. We create it by adapting the TREC 2022 Clinical Trial Track dataset, which provides patients' medical profiles, and rephrasing them manually using patient language. We also use the associated clinical trial reports where the patients are either eligible or excluded. We prompt several open-source Large Language Models on our task and achieve from 56.5 to 71.8 of F1 score using patient language, against 64.7 to 73.1 for the same task using medical language. When using patient language, we observe only a small loss in performance for the best model, suggesting that having the patient as a starting point could be adopted to help recruit patients for clinical trials. The corpus and code bases are all freely available on our Github and HuggingFace repositories.

High Temporal Consistency through Semantic Similarity Propagation in Semi-Supervised Video Semantic Segmentation for Autonomous Flight

Cédric Vincent,Taehyoung Kim,Henri Meeß

Task: 提出一种轻量级的视频语义分割方法，适用于机载实时推理，通过语义相似性传播实现高时间一致性。

Motivation: RGB相机的语义分割对于自主飞行器的感知至关重要，预测的稳定性直接影响其可靠性和信任度。

Details

Method: 提出语义相似性传播（SSP）方法，通过全局配准对齐补偿相机运动，结合当前估计和先验预测进行线性插值。并提出一致性感知的知识蒸馏训练方法，利用稀疏标注数据集进行训练。 Result: KD-SSP在UAVid和RuralScapes数据集上分别提高了12.5%和6.7%的时间一致性，具有更高的准确性和相当的推理速度。 Conclusion: KD-SSP在航空数据集上提供了优于其他视频方法的分割质量和推理速度的权衡，并显示出更高的时间一致性。 Abstract: Semantic segmentation from RGB cameras is essential to the perception of autonomous flying vehicles. The stability of predictions through the captured videos is paramount to their reliability and, by extension, to the trustworthiness of the agents. In this paper, we propose a lightweight video semantic segmentation approach-suited to onboard real-time inference-achieving high temporal consistency on aerial data through Semantic Similarity Propagation across frames. SSP temporally propagates the predictions of an efficient image segmentation model with global registration alignment to compensate for camera movements. It combines the current estimation and the prior prediction with linear interpolation using weights computed from the features similarities of the two frames. Because data availability is a challenge in this domain, we propose a consistency-aware Knowledge Distillation training procedure for sparsely labeled datasets with few annotations. Using a large image segmentation model as a teacher to train the efficient SSP, we leverage the strong correlations between labeled and unlabeled frames in the same training videos to obtain high-quality supervision on all frames. KD-SSP obtains a significant temporal consistency increase over the base image segmentation model of 12.5% and 6.7% TC on UAVid and RuralScapes respectively, with higher accuracy and comparable inference speed. On these aerial datasets, KD-SSP provides a superior segmentation quality and inference speed trade-off than other video methods proposed for general applications and shows considerably higher consistency. The code will be made publicly available upon acceptance.

KoGNER: A Novel Framework for Knowledge Graph Distillation on Biomedical Named Entity Recognition

Heming Zhang,Wenyu Li,Di Huang,Yinjie Tang,Yixin Chen,Philip Payne,Fuhai Li

Task: 提出了一种名为KoGNER的新方法，通过将知识图谱蒸馏集成到NER模型中，以增强实体识别性能。

Motivation: 传统的深度学习NER模型在领域泛化和数据稀疏性方面存在困难，因此需要一种新的方法来提高实体识别的准确性和性能。

Details

Method: KoGNER采用两步过程：1) 知识蒸馏，将外部知识源蒸馏为轻量级表示，以便与NER模型无缝集成；2) 实体感知增强，将富含知识图谱信息的上下文嵌入直接集成到GNN中，从而提高模型理解和表示实体关系的能力。 Result: 在基准数据集上的实验结果表明，KoGNER实现了最先进的性能，显著优于微调的NER模型和LLMs。 Conclusion: 利用知识图谱作为辅助信息可以显著提高NER的准确性，KoGNER为知识感知NLP的未来研究提供了一个有前景的方向。 Abstract: Named Entity Recognition (NER) is a fundamental task in Natural Language Processing (NLP) that plays a crucial role in information extraction, question answering, and knowledge-based systems. Traditional deep learning-based NER models often struggle with domain-specific generalization and suffer from data sparsity issues. In this work, we introduce Knowledge Graph distilled for Named Entity Recognition (KoGNER), a novel approach that integrates Knowledge Graph (KG) distillation into NER models to enhance entity recognition performance. Our framework leverages structured knowledge representations from KGs to enrich contextual embeddings, thereby improving entity classification and reducing ambiguity in entity detection. KoGNER employs a two-step process: (1) Knowledge Distillation, where external knowledge sources are distilled into a lightweight representation for seamless integration with NER models, and (2) Entity-Aware Augmentation, which integrates contextual embeddings that have been enriched with knowledge graph information directly into GNN, thereby improving the model's ability to understand and represent entity relationships. Experimental results on benchmark datasets demonstrate that KoGNER achieves state-of-the-art performance, outperforming finetuned NER models and LLMs by a significant margin. These findings suggest that leveraging knowledge graphs as auxiliary information can significantly improve NER accuracy, making KoGNER a promising direction for future research in knowledge-aware NLP.

The Change You Want To Detect: Semantic Change Detection In Earth Observation With Hybrid Data Generation

Benidir Yanis,Gonthier Nicolas,Mallet Clement

Task: 提出了一种生成大规模混合语义变化检测数据集HySCDG的方法，用于双时相高分辨率图像的变化检测。

Motivation: 现有的方法要么需要大量标注数据，要么局限于特定数据集，缺乏时空适应性。合成数据集虽然是一种解决方案，但难以处理复杂多样的场景。

Details

Method: 提出了HySCDG生成管道，结合真实高分辨率图像和修复图像，生成包含土地覆盖语义图和变化图的混合数据集FSC-180k。 Result: 实验表明，在混合数据集上进行预训练显著提升了性能，在所有配置下均优于完全合成数据集SyntheWorld。 Conclusion: HySCDG生成的混合数据集FSC-180k在变化检测任务中表现出色，提供了更全面的训练数据。 Abstract: Bi-temporal change detection at scale based on Very High Resolution (VHR) images is crucial for Earth monitoring. This remains poorly addressed so far: methods either require large volumes of annotated data (semantic case), or are limited to restricted datasets (binary set-ups). Most approaches do not exhibit the versatility required for temporal and spatial adaptation: simplicity in architecture design and pretraining on realistic and comprehensive datasets. Synthetic datasets are the key solution but still fail to handle complex and diverse scenes. In this paper, we present HySCDG a generative pipeline for creating a large hybrid semantic change detection dataset that contains both real VHR images and inpainted ones, along with land cover semantic map at both dates and the change map. Being semantically and spatially guided, HySCDG generates realistic images, leading to a comprehensive and hybrid transfer-proof dataset FSC-180k. We evaluate FSC-180k on five change detection cases (binary and semantic), from zero-shot to mixed and sequential training, and also under low data regime training. Experiments demonstrate that pretraining on our hybrid dataset leads to a significant performance boost, outperforming SyntheWorld, a fully synthetic dataset, in every configuration. All codes, models, and data are available here: $\href{https://yb23.github.io/projects/cywd/}{https://yb23.github.io/projects/cywd/}$.

Can one size fit all?: Measuring Failure in Multi-Document Summarization Domain Transfer

Alexandra DeLucia,Mark Dredze

Task: 自动总结多篇文档中的信息，从新闻文章到多说话者的对话。

Motivation: 评估多文档摘要模型在不同训练方法、领域和维度（参考相似性、质量和事实性）上的表现，分析模型在一个领域训练的模型为何在另一个领域（新闻、科学和对话）的零样本领域转移设置中失败。

Details

Method: 将当前MDS模型的训练方法分为四类：端到端特殊预训练（“直接”）、分块后总结、提取后总结和使用GPT风格模型的推理。 Result: 定义了领域转移“失败”为事实性下降、与目标的偏差更大以及摘要质量普遍下降。 Conclusion: 除了探索MDS模型的领域转移外，还研究了直接应用流行的摘要指标可能存在的问题。 Abstract: Abstractive multi-document summarization (MDS) is the task of automatically summarizing information in multiple documents, from news articles to conversations with multiple speakers. The training approaches for current MDS models can be grouped into four approaches: end-to-end with special pre-training ("direct"), chunk-then-summarize, extract-then-summarize, and inference with GPT-style models. In this work, we evaluate MDS models across training approaches, domains, and dimensions (reference similarity, quality, and factuality), to analyze how and why models trained on one domain can fail to summarize documents from another (News, Science, and Conversation) in the zero-shot domain transfer setting. We define domain-transfer "failure" as a decrease in factuality, higher deviation from the target, and a general decrease in summary quality. In addition to exploring domain transfer for MDS models, we examine potential issues with applying popular summarization metrics out-of-the-box.

Multi-focal Conditioned Latent Diffusion for Person Image Synthesis

Jiaqi Liu,Jichao Zahng,Paolo Rota,Nicu Sebe

Task: 提出一种多焦点条件潜在扩散（MCLD）方法，以解决潜在扩散模型（LDM）在细节退化方面的问题。

Motivation: 潜在扩散模型在高分辨率图像生成中表现出色，但在压缩过程中会导致细节退化，特别是在面部特征和服装纹理等敏感区域。

Details

Method: 通过在多焦点条件聚合模块中利用解耦的、姿态不变的特征来增强模型生成外观逼真且身份一致的图像的能力。 Result: 在DeepFashion数据集上展示了身份和外观生成的一致性，并实现了灵活的人物图像编辑。 Conclusion: 提出的MCLD方法有效解决了LDM在细节退化方面的问题，并提高了图像生成的质量和一致性。 Abstract: The Latent Diffusion Model (LDM) has demonstrated strong capabilities in high-resolution image generation and has been widely employed for Pose-Guided Person Image Synthesis (PGPIS), yielding promising results. However, the compression process of LDM often results in the deterioration of details, particularly in sensitive areas such as facial features and clothing textures. In this paper, we propose a Multi-focal Conditioned Latent Diffusion (MCLD) method to address these limitations by conditioning the model on disentangled, pose-invariant features from these sensitive regions. Our approach utilizes a multi-focal condition aggregation module, which effectively integrates facial identity and texture-specific information, enhancing the model's ability to produce appearance realistic and identity-consistent images. Our method demonstrates consistent identity and appearance generation on the DeepFashion dataset and enables flexible person image editing due to its generation consistency. The code is available at https://github.com/jqliu09/mcld.

Grammar and Gameplay-aligned RL for Game Description Generation with LLMs

Tsunehiko Tanaka,Edgar Simo-Serra

Task: 生成游戏描述语言（GDL）的游戏描述文本。

Motivation: 现有的方法在准确再现游戏描述的游戏特征方面仍存在挑战。

Details

Method: 提出了基于强化学习的LLMs微调方法（RLGDG），通过引入语法奖励和概念奖励来同时提高语法正确性和对游戏概念的忠实度，并采用两阶段训练策略，即在监督微调（SFT）之后应用强化学习（RL）。 Result: 实验结果表明，所提出的方法显著优于仅使用SFT的基线方法。 Conclusion: 基于强化学习的微调方法在生成游戏描述文本方面具有显著优势。 Abstract: Game Description Generation (GDG) is the task of generating a game description written in a Game Description Language (GDL) from natural language text. Previous studies have explored generation methods leveraging the contextual understanding capabilities of Large Language Models (LLMs); however, accurately reproducing the game features of the game descriptions remains a challenge. In this paper, we propose reinforcement learning-based fine-tuning of LLMs for GDG (RLGDG). Our training method simultaneously improves grammatical correctness and fidelity to game concepts by introducing both grammar rewards and concept rewards. Furthermore, we adopt a two-stage training strategy where Reinforcement Learning (RL) is applied following Supervised Fine-Tuning (SFT). Experimental results demonstrate that our proposed method significantly outperforms baseline methods using SFT alone.

Technical Report for the 5th CLVision Challenge at CVPR: Addressing the Class-Incremental with Repetition using Unlabeled Data -- 4th Place Solution

Panagiota Moraiti,Efstathios Karypidis

Task: 解决CVPR第5届CLVision挑战中的类增量重复（CIR）场景问题。

Motivation: CIR场景与传统类增量学习不同，引入了新的挑战和研究机会，特别是在训练过程中整合未标记数据。

Details

Method: 利用知识蒸馏和伪标签技术来保留先前学到的知识，并在训练过程中利用未标记数据。 Result: 在预选阶段的平均准确率为16.68%，在最终评估阶段的平均准确率为21.19%，优于基线准确率9.39%。 Conclusion: 该方法通过利用未标记数据，在保持先前遇到类别的性能方面表现出色，并减少了灾难性遗忘的负面影响。 Abstract: This paper outlines our approach to the 5th CLVision challenge at CVPR, which addresses the Class-Incremental with Repetition (CIR) scenario. In contrast to traditional class incremental learning, this novel setting introduces unique challenges and research opportunities, particularly through the integration of unlabeled data into the training process. In the CIR scenario, encountered classes may reappear in later learning experiences, and each experience may involve only a subset of the overall class distribution. Additionally, the unlabeled data provided during training may include instances of unseen classes, or irrelevant classes which should be ignored. Our approach focuses on retaining previously learned knowledge by utilizing knowledge distillation and pseudo-labeling techniques. The key characteristic of our method is the exploitation of unlabeled data during training, in order to maintain optimal performance on instances of previously encountered categories and reduce the detrimental effects of catastrophic forgetting. Our method achieves an average accuracy of 16.68\% during the pre-selection phase and 21.19% during the final evaluation phase, outperforming the baseline accuracy of 9.39%. We provide the implementation code at https://github.com/panagiotamoraiti/continual-learning-challenge-2024 .

Fùxì: A Benchmark for Evaluating Language Models on Ancient Chinese Text Understanding and Generation

Shangqing Zhao,Yuhao Zhou,Yupei Ren,Zhe Chen,Chenghao Jia,Fang Zhe,Zhaogaung Long,Shu Liu,Man Lan

Task: 评估大型语言模型在古典中文文本处理中的理解和生成能力。

Motivation: 现有基准主要关注通过选择题评估理解能力，而在古典中文生成能力评估方面存在显著差距。

Details

Method: 引入F`ux`i基准，涵盖21个多样化任务，包括诗歌创作和对联完成，使用专门设计的评估指标和系统评估框架。 Result: 评估显示，模型在理解任务上表现良好，但在生成任务上表现较差，特别是在需要深厚文化知识和古典格式的任务上。 Conclusion: 研究揭示了古代中文文本处理的当前局限性，并为未来模型开发提供了见解。 Abstract: Ancient Chinese text processing presents unique challenges for large language models (LLMs) due to its distinct linguistic features, complex structural constraints, and rich cultural context. While existing benchmarks have primarily focused on evaluating comprehension through multiple-choice questions, there remains a critical gap in assessing models' generative capabilities in classical Chinese. We introduce F\`ux\`i, a comprehensive benchmark that evaluates both understanding and generation capabilities across 21 diverse tasks. Our benchmark distinguishes itself through three key contributions: (1) balanced coverage of both comprehension and generation tasks, including novel tasks like poetry composition and couplet completion, (2) specialized evaluation metrics designed specifically for classical Chinese text generation, combining rule-based verification with fine-tuned LLM evaluators, and (3) a systematic assessment framework that considers both linguistic accuracy and cultural authenticity. Through extensive evaluation of state-of-the-art LLMs, we reveal significant performance gaps between understanding and generation tasks, with models achieving promising results in comprehension but struggling considerably in generation tasks, particularly those requiring deep cultural knowledge and adherence to classical formats. Our findings highlight the current limitations in ancient Chinese text processing and provide insights for future model development. The benchmark, evaluation toolkit, and baseline results are publicly available to facilitate research in this domain.

Representational Similarity via Interpretable Visual Concepts

Neehar Kondapaneni,Oisin Mac Aodha,Pietro Perona

Task: 比较两个深度神经网络在决策过程中的差异。

Motivation: 测量深度网络的相似性是一个长期未解决的问题，现有方法通常只提供一个单一的数字来衡量两个网络在某一层的相似性，但无法解释它们相似或不同的原因。

Details

Method: 引入了一种可解释的表示相似性方法（RSVC）来比较两个网络，并使用RSVC发现两个模型之间共享和独特的视觉概念。 Result: 研究表明，模型差异的某些方面可以归因于一个模型发现的独特概念，而这些概念在另一个模型中未能很好地表示。 Conclusion: 通过对不同视觉模型架构和训练协议的广泛评估，证明了RSVC方法的有效性。 Abstract: How do two deep neural networks differ in how they arrive at a decision? Measuring the similarity of deep networks has been a long-standing open question. Most existing methods provide a single number to measure the similarity of two networks at a given layer, but give no insight into what makes them similar or dissimilar. We introduce an interpretable representational similarity method (RSVC) to compare two networks. We use RSVC to discover shared and unique visual concepts between two models. We show that some aspects of model differences can be attributed to unique concepts discovered by one model that are not well represented in the other. Finally, we conduct extensive evaluation across different vision model architectures and training protocols to demonstrate its effectiveness.

Uncertainty Quantification and Confidence Calibration in Large Language Models: A Survey

Xiaoou Liu,Tiejin Chen,Longchao Da,Chacha Chen,Zhen Lin,Hua Wei

Task: 提出一种新的分类法，用于分类基于计算效率和不确定性维度（输入、推理、参数和预测不确定性）的不确定性量化方法。

Motivation: 大型语言模型（LLMs）在高风险领域中的应用日益广泛，但其可靠性是一个主要问题，因为它们经常产生看似合理但不正确的响应。不确定性量化（UQ）通过估计输出的置信度来增强可信度，从而实现风险缓解和选择性预测。然而，传统的UQ方法由于计算限制和解码不一致性而难以应用于LLMs。此外，LLMs引入了独特的不确定性来源，如输入模糊性、推理路径分歧和解码随机性，这些超出了经典的偶然性和认知不确定性。

Details

Method: 引入一种新的分类法，评估现有技术，评估其在实际应用中的适用性，并识别开放挑战。 Result: 提出了一种新的分类法，并评估了现有技术的实际应用性，强调了可扩展、可解释和鲁棒的UQ方法的需求。 Conclusion: 需要开发可扩展、可解释和鲁棒的不确定性量化方法，以增强大型语言模型的可靠性。 Abstract: Large Language Models (LLMs) excel in text generation, reasoning, and decision-making, enabling their adoption in high-stakes domains such as healthcare, law, and transportation. However, their reliability is a major concern, as they often produce plausible but incorrect responses. Uncertainty quantification (UQ) enhances trustworthiness by estimating confidence in outputs, enabling risk mitigation and selective prediction. However, traditional UQ methods struggle with LLMs due to computational constraints and decoding inconsistencies. Moreover, LLMs introduce unique uncertainty sources, such as input ambiguity, reasoning path divergence, and decoding stochasticity, that extend beyond classical aleatoric and epistemic uncertainty. To address this, we introduce a new taxonomy that categorizes UQ methods based on computational efficiency and uncertainty dimensions (input, reasoning, parameter, and prediction uncertainty). We evaluate existing techniques, assess their real-world applicability, and identify open challenges, emphasizing the need for scalable, interpretable, and robust UQ approaches to enhance LLM reliability.

Sustainable Deep Learning-Based Breast Lesion Segmentation: Impact of Breast Region Segmentation on Performance

Sam Narimani,Solveig Roth Hoff,Kathinka Dahli Kurz,Kjell-Inge Gjesdal,Jurgen Geisler,Endre Grovik

Task: 研究乳腺区域分割（BRS）对基于深度学习的乳腺病变分割（BLS）在乳腺动态对比增强磁共振成像（DCE-MRI）中的影响。

Motivation: 准确分割乳腺病变是诊断、治疗计划和进展监测的关键步骤。本研究旨在探讨BRS对BLS的影响。

Details

Method: 使用包含59个DCE-MRI扫描的Stavanger数据集和UNet++模型，进行了四种不同的处理来比较BRS对BLS的影响。预处理方法包括数据增强和过采样，以提高小数据集的性能和形状一致性。通过精确过程确定最佳体积大小，并使用混合损失函数和5折交叉验证方法评估模型。 Result: 结果表明，使用BRS显著提高了模型性能和验证效果。最佳体积与BRS的方法相比没有BRS的方法提高了约50%，并且在能耗方面也有显著改善，减少了450%。 Conclusion: BRS在BLS中非常有效，不仅提高了模型性能，还减少了能耗，为未来在大数据集上的工作提供了更环保的解决方案。 Abstract: Purpose: Segmentation of the breast lesion in dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) is an essential step to accurately diagnose and plan treatment and monitor progress. This study aims to highlight the impact of breast region segmentation (BRS) on deep learning-based breast lesion segmentation (BLS) in breast DCE-MRI. Methods Using the Stavanger Dataset containing primarily 59 DCE-MRI scans and UNet++ as deep learning models, four different process were conducted to compare effect of BRS on BLS. These four approaches included the whole volume without BRS and with BRS, BRS with the selected lesion slices and lastly optimal volume with BRS. Preprocessing methods like augmentation and oversampling were used to enhance the small dataset, data shape uniformity and improve model performance. Optimal volume size were investigated by a precise process to ensure that all lesions existed in slices. To evaluate the model, a hybrid loss function including dice, focal and cross entropy along with 5-fold cross validation method were used and lastly a test dataset which was randomly split used to evaluate the model performance on unseen data for each of four mentioned approaches. Results Results demonstrate that using BRS considerably improved model performance and validation. Significant improvement in last approach -- optimal volume with BRS -- compared to the approach without BRS counting around 50 percent demonstrating how effective BRS has been in BLS. Moreover, huge improvement in energy consumption, decreasing up to 450 percent, introduces a green solution toward a more environmentally sustainable approach for future work on large dataset.

Typed-RAG: Type-aware Multi-Aspect Decomposition for Non-Factoid Question Answering

DongGeon Lee,Ahjeong Park,Hyeri Lee,Hyeonseo Nam,Yunho Maeng

Task: 提出了一种类型感知的多方面分解框架Typed-RAG，用于非事实性问答（NFQA）。

Motivation: 非事实性问答由于其开放性、多样的意图和多方面的推理需求，使得传统的检索增强生成（RAG）方法不足以应对。

Details

Method: Typed-RAG将非事实性问题分类为不同的类型（如辩论、经验和比较），并应用基于方面的分解来优化检索和生成策略。 Result: 实验结果表明，Typed-RAG在基准数据集Wiki-NFQA上优于基线方法，突显了类型感知分解在NFQA中的重要性。 Conclusion: Typed-RAG通过类型感知的多方面分解，生成了更具信息性和上下文相关性的回答，有效提升了非事实性问答的效果。 Abstract: Non-factoid question-answering (NFQA) poses a significant challenge due to its open-ended nature, diverse intents, and the need for multi-aspect reasoning, which renders conventional factoid QA approaches, including retrieval-augmented generation (RAG), inadequate. Unlike factoid questions, non-factoid questions (NFQs) lack definitive answers and require synthesizing information from multiple sources across various reasoning dimensions. To address these limitations, we introduce Typed-RAG, a type-aware multi-aspect decomposition framework within the RAG paradigm for NFQA. Typed-RAG classifies NFQs into distinct types -- such as debate, experience, and comparison -- and applies aspect-based decomposition to refine retrieval and generation strategies. By decomposing multi-aspect NFQs into single-aspect sub-queries and aggregating the results, Typed-RAG generates more informative and contextually relevant responses. To evaluate Typed-RAG, we introduce Wiki-NFQA, a benchmark dataset covering diverse NFQ types. Experimental results demonstrate that Typed-RAG outperforms baselines, thereby highlighting the importance of type-aware decomposition for effective retrieval and generation in NFQA. Our code and dataset are available at \href{https://github.com/TeamNLP/Typed-RAG}{https://github.com/TeamNLP/Typed-RAG}.

SPNeRF: Open Vocabulary 3D Neural Scene Segmentation with Superpoints

Weiwen Hu,Niccolò Parodi,Marcus Zepp,Ingo Feldmann,Oliver Schreer,Peter Eisert

Task: 提出了一种基于NeRF的零样本3D分割方法SPNeRF，利用几何先验进行3D场景分割。

Motivation: 现有的基于CLIP的2D分割方法在3D分割中存在几何细节不足的问题，且现有方法往往引入冗余或丧失CLIP的通用语言能力。

Details

Method: 通过将3D场景中的几何基元整合到NeRF训练中，生成基元级别的CLIP特征，并提出基于基元的合并机制。 Result: 在不依赖额外分割模型的情况下，显著提升了3D分割的性能。 Conclusion: SPNeRF方法有效利用了CLIP的能力进行3D分割，并在性能上取得了显著提升。 Abstract: Open-vocabulary segmentation, powered by large visual-language models like CLIP, has expanded 2D segmentation capabilities beyond fixed classes predefined by the dataset, enabling zero-shot understanding across diverse scenes. Extending these capabilities to 3D segmentation introduces challenges, as CLIP's image-based embeddings often lack the geometric detail necessary for 3D scene segmentation. Recent methods tend to address this by introducing additional segmentation models or replacing CLIP with variations trained on segmentation data, which lead to redundancy or loss on CLIP's general language capabilities. To overcome this limitation, we introduce SPNeRF, a NeRF based zero-shot 3D segmentation approach that leverages geometric priors. We integrate geometric primitives derived from the 3D scene into NeRF training to produce primitive-wise CLIP features, avoiding the ambiguity of point-wise features. Additionally, we propose a primitive-based merging mechanism enhanced with affinity scores. Without relying on additional segmentation models, our method further explores CLIP's capability for 3D segmentation and achieves notable improvements over original LERF.

Parameters vs. Context: Fine-Grained Control of Knowledge Reliance in Language Models

Baolong Bi,Shenghua Liu,Yiwei Wang,Yilong Xu,Junfeng Fang,Lingrui Mei,Xueqi Cheng

Task: 提出一种控制大型语言模型（LLMs）对参数知识和上下文知识依赖的方法。

Motivation: 解决在检索增强生成（RAG）中，参数知识与检索上下文之间的冲突问题，特别是在检索信息不可靠或模型内部知识过时的情况下。

Details

Method: 提出CK-PLUG方法，通过引入新的知识一致性度量——置信增益（Confidence Gain），检测知识冲突，并通过调整具有负置信增益的token的概率分布来实现细粒度的知识偏好控制。 Result: 实验表明，CK-PLUG能够在反事实RAG场景中显著调节知识依赖，同时保持生成流畅性和知识准确性。例如，在Llama3-8B上，RAG响应的记忆召回率（MR）可以在9.9%-71.9%的范围内调整，而基线为42.1%。 Conclusion: CK-PLUG支持基于模型对内部和外部知识的置信度进行自适应控制，在各种通用RAG任务中实现了持续的性能提升。 Abstract: Retrieval-Augmented Generation (RAG) mitigates hallucinations in Large Language Models (LLMs) by integrating external knowledge. However, conflicts between parametric knowledge and retrieved context pose challenges, particularly when retrieved information is unreliable or the model's internal knowledge is outdated. In such cases, LLMs struggle to determine whether to rely more on their own parameters or the conflicted context. To address this, we propose **CK-PLUG**, a plug-and-play method for controlling LLMs' reliance on parametric and contextual knowledge. We introduce a novel knowledge consistency metric, Confidence Gain, which detects knowledge conflicts by measuring entropy shifts in token probability distributions after context insertion. CK-PLUG then enables fine-grained control over knowledge preference by adjusting the probability distribution of tokens with negative confidence gain through a single tuning parameter. Experiments demonstrate CK-PLUG's ability to significantly regulate knowledge reliance in counterfactual RAG scenarios while maintaining generation fluency and knowledge accuracy. For instance, on Llama3-8B, memory recall (MR) of RAG response can be adjusted within a broad range (9.9%-71.9%), compared to the baseline of 42.1%. Moreover, CK-PLUG supports adaptive control based on the model's confidence in both internal and external knowledge, achieving consistent performance improvements across various general RAG tasks. Our code is available at: $\href{https://github.com/byronBBL/CK-PLUG}{\text{this https URL}}$.

Graph-Weighted Contrastive Learning for Semi-Supervised Hyperspectral Image Classification

Yuqing Zhang,Qi Han,Ligeng Wang,Kai Cheng,Bo Wang,Kun Zhan

Task: 提出一种新的基于图加权对比学习的方法，用于高光谱图像分类。

Motivation: 现有的基于图的半监督高光谱图像分类方法依赖于超像素分割技术，但由于超像素边界的不准确性，导致某些像素的误分类，限制了整体分类性能。

Details

Method: 提出了一种避免使用超像素分割的方法，直接使用神经网络学习高光谱图像表示，并支持小批量训练，减少计算复杂度并提高对未见节点的泛化能力。 Result: 在三个广泛使用的数据集上的实验结果表明，与依赖超像素分割的基线方法相比，所提出的方法具有更高的有效性。 Conclusion: 所提出的基于图加权对比学习的方法在高光谱图像分类中表现出色，避免了超像素分割的局限性，并提高了分类性能。 Abstract: Most existing graph-based semi-supervised hyperspectral image classification methods rely on superpixel partitioning techniques. However, they suffer from misclassification of certain pixels due to inaccuracies in superpixel boundaries, \ie, the initial inaccuracies in superpixel partitioning limit overall classification performance. In this paper, we propose a novel graph-weighted contrastive learning approach that avoids the use of superpixel partitioning and directly employs neural networks to learn hyperspectral image representation. Furthermore, while many approaches require all graph nodes to be available during training, our approach supports mini-batch training by processing only a subset of nodes at a time, reducing computational complexity and improving generalization to unseen nodes. Experimental results on three widely-used datasets demonstrate the effectiveness of the proposed approach compared to baselines relying on superpixel partitioning.

From Structured Prompts to Open Narratives: Measuring Gender Bias in LLMs Through Open-Ended Storytelling

Evan Chen,Run-Jun Zhan,Yan-Bai Lin,Hung-Hsuan Chen

Task: 评估大型语言模型（LLMs）中的性别偏见，特别是其在职业叙述中的表现。

Motivation: 尽管大型语言模型在自然语言处理领域取得了革命性进展，但其在训练数据中反映或放大社会偏见的问题仍然存在。

Details

Method: 引入了一种新颖的评估框架，通过自由形式的讲故事来揭示模型中的偏见，而不是依赖结构化场景或精心设计的提示。 Result: 系统分析显示，六个广泛使用的LLMs在职业叙述中女性角色的比例过高。此外，LLM生成的职业性别排名更接近人类刻板印象，而不是实际的劳动力统计数据。 Conclusion: 这些发现强调了需要平衡的缓解策略，以确保公平性，同时避免强化新的刻板印象。 Abstract: Large Language Models (LLMs) have revolutionized natural language processing, yet concerns persist regarding their tendency to reflect or amplify social biases present in their training data. This study introduces a novel evaluation framework to uncover gender biases in LLMs, focusing on their occupational narratives. Unlike previous methods relying on structured scenarios or carefully crafted prompts, our approach leverages free-form storytelling to reveal biases embedded in the models. Systematic analyses show an overrepresentation of female characters across occupations in six widely used LLMs. Additionally, our findings reveal that LLM-generated occupational gender rankings align more closely with human stereotypes than actual labor statistics. These insights underscore the need for balanced mitigation strategies to ensure fairness while avoiding the reinforcement of new stereotypes.

Sarosij Bose,Arindam Dutta,Sayak Nag,Junge Zhang,Jiachen Li,Konstantinos Karydis,Amit K. Roy Chowdhury

Task: 从单张图像重建3D场景，并生成高质量的新视角合成结果。

Motivation: 现有的单张图像到3D场景重建方法在生成新视角时往往呈现不连贯和模糊的视图，尤其是在输入相机视角之外的区域。

Details

Method: 利用预训练的潜在视频扩散模型作为生成先验，通过可优化的高斯参数对粗糙场景进行迭代细化，并结合即时傅里叶风格迁移和语义不确定性量化模块来指导细化过程。 Result: 在RealEstate-10K和KITTI-v2数据集上的实验表明，该方法能够生成比现有最先进方法更真实和高保真的新视角合成结果。 Conclusion: 通过引入生成先验和不确定性量化模块，本文方法显著提升了单张图像到3D场景重建的质量，尤其是在生成新视角时的表现。 Abstract: Reconstructing 3D scenes from a single image is a fundamentally ill-posed task due to the severely under-constrained nature of the problem. Consequently, when the scene is rendered from novel camera views, existing single image to 3D reconstruction methods render incoherent and blurry views. This problem is exacerbated when the unseen regions are far away from the input camera. In this work, we address these inherent limitations in existing single image-to-3D scene feedforward networks. To alleviate the poor performance due to insufficient information beyond the input image's view, we leverage a strong generative prior in the form of a pre-trained latent video diffusion model, for iterative refinement of a coarse scene represented by optimizable Gaussian parameters. To ensure that the style and texture of the generated images align with that of the input image, we incorporate on-the-fly Fourier-style transfer between the generated images and the input image. Additionally, we design a semantic uncertainty quantification module that calculates the per-pixel entropy and yields uncertainty maps used to guide the refinement process from the most confident pixels while discarding the remaining highly uncertain ones. We conduct extensive experiments on real-world scene datasets, including in-domain RealEstate-10K and out-of-domain KITTI-v2, showing that our approach can provide more realistic and high-fidelity novel view synthesis results compared to existing state-of-the-art methods.

Towards Automatic Continual Learning: A Self-Adaptive Framework for Continual Instruction Tuning

Peiyi Lin,Fukai Zhang,Kai Niu,Hao Fu

Task: 提出一种自动化的持续指令调优框架，动态过滤输入数据以减少冗余数据。

Motivation: 在特定领域背景下，保持数据质量和管理系统约束是关键挑战。

Details

Method: 利用小型代理模型进行基于困惑度的高效过滤，并更新代理以确保过滤标准与部署模型的演变状态保持一致。 Result: 在真实世界的医疗场景中评估了系统，减少了66.7%的计算成本并提高了模型性能。 Conclusion: 该框架有效处理了增量获取的数据和分布变化，展示了自动持续指令调优的有效性。 Abstract: Continual instruction tuning enables large language models (LLMs) to learn incrementally while retaining past knowledge, whereas existing methods primarily focus on how to retain old knowledge rather than on selecting which new knowledge to learn. In domain-specific contexts, maintaining data quality and managing system constraints remain key challenges. To address these issues, we propose an automated continual instruction tuning framework that dynamically filters incoming data, which identify and reduce redundant data across successive updates. Our approach utilizes a small proxy model for efficient perplexity-based filtering, and updates the proxy to ensure that the filtering criteria remain aligned with the evolving state of the deployed model. Compared to existing static data selection methods, our framework can effectively handle incrementally acquired data and shifting distributions. Additionally, it addresses practical deployment challenges by enabling seamless model updates, supporting version rollback and incorporating automatic checkpoint evaluation. We evaluated the system in real-world medical scenarios. It reduced computational costs by 66.7% and improved model performance, and achieved autonomous updates, thus demonstrating its effectiveness for automatic continual instruction tuning.

GraPLUS: Graph-based Placement Using Semantics for Image Composition

Mir Mohammad Khaleghi,Mehran Safayani,Abdolreza Mirzaei

Task: 提出了一种基于图结构和语义理解的新框架GraPLUS，用于图像中合理的物体放置。

Motivation: 通过结合图结构的场景表示和语义理解，确定上下文适当的物体位置。

Details

Method: 利用GPT-2将分类节点和边标签转换为丰富的语义嵌入，捕捉定义特征和典型空间上下文，从而理解物体关系和放置模式。 Result: 在OPA数据集上，GraPLUS实现了92.1%的放置准确率和28.83的FID分数，优于最先进的方法8.1%。在人类评估研究中，52.1%的情况下被优先选择。 Conclusion: GraPLUS框架通过预训练的场景图模型、边缘感知图神经网络、跨模态注意力机制和多目标训练策略，显著提高了物体放置的准确性和视觉质量。 Abstract: We present GraPLUS (Graph-based Placement Using Semantics), a novel framework for plausible object placement in images that leverages scene graphs and large language models. Our approach uniquely combines graph-structured scene representation with semantic understanding to determine contextually appropriate object positions. The framework employs GPT-2 to transform categorical node and edge labels into rich semantic embeddings that capture both definitional characteristics and typical spatial contexts, enabling nuanced understanding of object relationships and placement patterns. GraPLUS achieves placement accuracy of 92.1% and an FID score of 28.83 on the OPA dataset, outperforming state-of-the-art methods by 8.1% while maintaining competitive visual quality. In human evaluation studies involving 964 samples assessed by 19 participants, our method was preferred in 52.1% of cases, significantly outperforming previous approaches. The framework's key innovations include: (i) leveraging pre-trained scene graph models that transfer knowledge from other domains, (ii) edge-aware graph neural networks that process scene semantics through structured relationships, (iii) a cross-modal attention mechanism that aligns categorical embeddings with enhanced scene features, and (iv) a multiobjective training strategy incorporating semantic consistency constraints.

From Chaos to Order: The Atomic Reasoner Framework for Fine-grained Reasoning in Large Language Models

Jinyi Liu,Yan Zheng,Rong Cheng,Qiyu Wu,Wei Guo,Fei Ni,Hebin Liang,Yifu Yuan,Hangyu Mao,Fuzheng Zhang,Jianye Hao

Task: 提出一种名为Atomic Reasoner (AR)的认知推理策略，以解决大语言模型在逻辑推理中的碎片化思维流和计算复杂度问题。

Motivation: 当前的大语言模型在逻辑推理方面存在思维流碎片化和计算复杂度高的问题，限制了其推理能力。

Details

Method: AR通过将推理过程分解为原子认知单元，并采用认知路由机制动态构建推理表示和编排推理路径，实现逐步、结构化的认知。 Result: 实验结果表明，AR在不增加计算负担的情况下，显著提升了推理能力，尤其在语言逻辑谜题中表现优异。 Conclusion: AR有效增强了大语言模型在长序列逻辑推理和深思熟虑方面的能力。 Abstract: Recent advances in large language models (LLMs) have shown remarkable progress, yet their capacity for logical ``slow-thinking'' reasoning persists as a critical research frontier. Current inference scaling paradigms suffer from two fundamental constraints: fragmented thought flows compromising logical coherence, and intensively computational complexity that escalates with search space dimensions. To overcome these limitations, we present \textbf{Atomic Reasoner} (\textbf{AR}), a cognitive inference strategy that enables fine-grained reasoning through systematic atomic-level operations. AR decomposes the reasoning process into atomic cognitive units, employing a cognitive routing mechanism to dynamically construct reasoning representations and orchestrate inference pathways. This systematic methodology implements stepwise, structured cognition, which ensures logical coherence while significantly reducing cognitive load, effectively simulating the cognitive patterns observed in human deep thinking processes. Extensive experimental results demonstrate AR's superior reasoning capabilities without the computational burden of exhaustive solution searches, particularly excelling in linguistic logic puzzles. These findings substantiate AR's effectiveness in enhancing LLMs' capacity for robust, long-sequence logical reasoning and deliberation.

OffsetOPT: Explicit Surface Reconstruction without Normals

Huan Lei

Task: 从3D点云直接重建显式表面，消除对点法线的需求。

Motivation: 现有的神经表面重建方法通常需要高质量的法线来进行准确重建，这限制了其应用范围。

Details

Method: 提出OffsetOPT方法，包括两个阶段：首先训练神经网络基于局部点几何预测表面三角形，然后在未见过的点云上应用冻结网络，通过优化每点偏移来最大化三角形预测的准确性。 Result: OffsetOPT在重建整体表面和显著保留尖锐表面特征方面优于现有方法。 Conclusion: OffsetOPT在包括小尺度形状和大尺度开放表面在内的流行基准上展示了其准确性。 Abstract: Neural surface reconstruction has been dominated by implicit representations with marching cubes for explicit surface extraction. However, those methods typically require high-quality normals for accurate reconstruction. We propose OffsetOPT, a method that reconstructs explicit surfaces directly from 3D point clouds and eliminates the need for point normals. The approach comprises two stages: first, we train a neural network to predict surface triangles based on local point geometry, given uniformly distributed training point clouds. Next, we apply the frozen network to reconstruct surfaces from unseen point clouds by optimizing a per-point offset to maximize the accuracy of triangle predictions. Compared to state-of-the-art methods, OffsetOPT not only excels at reconstructing overall surfaces but also significantly preserves sharp surface features. We demonstrate its accuracy on popular benchmarks, including small-scale shapes and large-scale open surfaces.

Adaptive Group Policy Optimization: Towards Stable Training and Token-Efficient Reasoning

Chen Li,Nazhou Liu,Kai Yang

Task: 提出一种改进的推理大语言模型训练方法，即自适应组策略优化（AGPO）。

Motivation: 发现现有的组相对策略优化（GRPO）在强化学习稳定性和推理效率方面存在不足。

Details

Method: 提出了两种简单但有效的改进：一种修订的优势估计方法，以缓解零方差情况；一种基于长度的奖励，激励模型避免过度思考。 Result: 实验表明，该方法在训练稳定性上表现更好，并且在推理步骤中使用的token数量显著减少的情况下，性能相当或更优。 Conclusion: 自适应组策略优化（AGPO）在推理大语言模型的训练中表现出更高的稳定性和效率。 Abstract: Since DeepSeek-R1 popularized, Group Relative Policy Optimization (GRPO) has become the core part of Reasoning LLMs training. However, we find some deficiency that influences RL stability and inference efficiency. Thus, we propose Adaptive Group Policy Optimization (AGPO) which contains two simple but effective modifications: a revised advantage estimation method to mitigate zero-variance situations; a length-based reward, incentivizing the model to avoid overthinking. The experiments demonstrate our methods achieve more stable training and comparable or superior performance with significantly fewer tokens in reasoning steps.

AutoDrive-QA- Automated Generation of Multiple-Choice Questions for Autonomous Driving Datasets Using Large Vision-Language Models

Boshra Khalili,Andrew W. Smyth

Task: 将现有的驾驶问答数据集转换为结构化多选题格式，以提供标准化和客观的评估框架。

Motivation: 解决自动驾驶中开放式问答评估不可靠的问题，因为自由形式的回答需要复杂的度量标准或主观的人类判断。

Details

Method: 引入AutoDrive-QA，一个自动化的管道，利用大型语言模型（LLMs）生成高质量、上下文相关的干扰项，基于自动驾驶场景中常见的领域特定错误模式。 Result: 在三个公共数据集上测试基准，并在未见过的数据集上进行零样本实验。GPT-4V在零样本评估中以69.57%的准确率领先，其中感知任务达到74.94%，预测任务达到65.33%，规划任务达到68.45%。 Conclusion: AutoDrive-QA为整合和评估不同视觉语言模型提供了一个严格、无偏的标准，从而提高了该领域的泛化能力。 Abstract: In autonomous driving, open-ended question answering often suffers from unreliable evaluations because freeform responses require either complex metrics or subjective human judgment. To address this challenge, we introduce AutoDrive-QA, an automatic pipeline that converts existing driving QA datasets (including DriveLM, NuScenes-QA, and LingoQA) into a structured multiple-choice question (MCQ) format. This benchmark systematically assesses perception, prediction, and planning tasks, providing a standardized and objective evaluation framework. AutoDrive-QA employs an automated pipeline that leverages large language models (LLMs) to generate high-quality, contextually relevant distractors based on domain-specific error patterns commonly found in autonomous driving scenarios. To evaluate both general capabilities and generalization performance, we test the benchmark on three public datasets and conduct zero-shot experiments on an unseen dataset. The zero-shot evaluations reveal that GPT-4V leads with 69.57% accuracy -- achieving 74.94% in Perception, 65.33% in Prediction, and 68.45% in Planning -- demonstrating that while all models excel in Perception, they struggle in Prediction. Consequently, AutoDrive-QA establishes a rigorous, unbiased standard for integrating and evaluating different vision-language models across various autonomous driving datasets, thereby improving generalization in this field. We release all the codes in the AutoDrive-QA GitHub Repository.

Exploratory Study into Relations between Cognitive Distortions and Emotional Appraisals

Navneet Agarwal,Kairit Sirts

Task: 探索认知扭曲与情感评估维度之间的关系，并分析认知重构对评估维度的影响。

Motivation: 尽管情感重评和认知重构作为情绪调节技术有相似之处，但这些概念大多被孤立研究。本研究旨在填补这一空白，探索它们之间的关系及其对未来跨学科研究的潜在影响。

Details

Method: 进行了一项探索性计算研究，分析认知扭曲与情感评估维度之间的统计显著关系，并研究认知重构对评估维度的影响。 Result: 研究发现，不同扭曲类别与评估维度之间的统计显著关系模式各不相同，形成了不同扭曲类别的独特评估特征。此外，研究还展示了认知重构对评估维度的影响，体现了认知重构在情绪调节中的作用。 Conclusion: 本研究揭示了认知扭曲与情感评估维度之间的复杂关系，并强调了认知重构在情绪调节中的重要性，为未来的跨学科研究提供了新的方向。 Abstract: In recent years, there has been growing interest in studying cognitive distortions and emotional appraisals from both computational and psychological perspectives. Despite considerable similarities between emotional reappraisal and cognitive reframing as emotion regulation techniques, these concepts have largely been examined in isolation. This research explores the relationship between cognitive distortions and emotional appraisal dimensions, examining their potential connections and relevance for future interdisciplinary studies. Under this pretext, we conduct an exploratory computational study, aimed at investigating the relationship between cognitive distortion and emotional appraisals. We show that the patterns of statistically significant relationships between cognitive distortions and appraisal dimensions vary across different distortion categories, giving rise to distinct appraisal profiles for individual distortion classes. Additionally, we analyze the impact of cognitive restructuring on appraisal dimensions, exemplifying the emotion regulation aspect of cognitive restructuring.

RL4Med-DDPO: Reinforcement Learning for Controlled Guidance Towards Diverse Medical Image Generation using Vision-Language Foundation Models

Parham Saremi,Amar Kumar,Mohammed Mohammed,Zahra TehraniNasab,Tal Arbel

Task: 提出一种多阶段架构，通过预训练的视觉语言基础模型（VLFM）提供粗略的语义理解，并使用强化学习（RL）算法通过迭代过程优化语义上下文理解，以解决医学影像中图像区域与文本描述之间的精确对应问题。

Motivation: 视觉语言基础模型（VLFM）在生成高分辨率、逼真的自然图像方面表现出色，但在需要图像区域与文本描述之间精确对应的细粒度对齐任务中表现不佳，这在医学影像中尤为重要，因为准确的定位和检测临床特征对诊断和分析至关重要。

Details

Method: 提出一种多阶段架构，结合预训练的VLFM和强化学习算法，通过迭代过程优化语义上下文理解，奖励信号设计为使文本的语义信息与合成图像对齐。 Result: 在医学影像皮肤数据集上，生成的图像在生成质量和与提示的对齐方面优于微调的Stable Diffusion，并且合成的样本可以通过数据增强提高疾病分类器在代表性不足子组中的性能。 Conclusion: 所提出的方法在医学影像中有效提高了图像生成质量和与文本描述的对齐，同时通过数据增强提高了疾病分类器的性能。 Abstract: Vision-Language Foundation Models (VLFM) have shown a tremendous increase in performance in terms of generating high-resolution, photorealistic natural images. While VLFMs show a rich understanding of semantic content across modalities, they often struggle with fine-grained alignment tasks that require precise correspondence between image regions and textual descriptions a limitation in medical imaging, where accurate localization and detection of clinical features are essential for diagnosis and analysis. To address this issue, we propose a multi-stage architecture where a pre-trained VLFM provides a cursory semantic understanding, while a reinforcement learning (RL) algorithm refines the alignment through an iterative process that optimizes for understanding semantic context. The reward signal is designed to align the semantic information of the text with synthesized images. We demonstrate the effectiveness of our method on a medical imaging skin dataset where the generated images exhibit improved generation quality and alignment with prompt over the fine-tuned Stable Diffusion. We also show that the synthesized samples could be used to improve disease classifier performance for underrepresented subgroups through augmentation.

InhibiDistilbert: Knowledge Distillation for a ReLU and Addition-based Transformer

Tony Zhang,Rickard Brännvall

Task: 优化基于Transformer的语言模型，通过集成模型压缩技术和一种新颖的替代注意力机制——抑制注意力。

Motivation: 探索一种替代的注意力机制，以节省计算和能源，同时保持模型的有效性。

Details

Method: 使用曼哈顿距离和ReLU激活函数替代传统的缩放点积注意力中的矩阵乘法和softmax激活函数，并对抑制机制进行进一步调整以提高训练效率。 Result: 在DistilBERT架构上进行的知识蒸馏实验表明，改进后的抑制Transformer模型在标准NLP基准测试（包括GLUE和情感分析任务）上具有竞争力。 Conclusion: 抑制注意力机制在保持模型性能的同时，能够节省计算和能源，具有潜在的应用价值。 Abstract: This work explores optimizing transformer-based language models by integrating model compression techniques with inhibitor attention, a novel alternative attention mechanism. Inhibitor attention employs Manhattan distances and ReLU activations instead of the matrix multiplications and softmax activation of the conventional scaled dot-product attention. This shift offers potential computational and energy savings while maintaining model effectiveness. We propose further adjustments to improve the inhibitor mechanism's training efficiency and evaluate its performance on the DistilBERT architecture. Our knowledge distillation experiments indicate that the modified inhibitor transformer model can achieve competitive performance on standard NLP benchmarks, including General Language Understanding Evaluation (GLUE) and sentiment analysis tasks.

Frequency Enhancement for Image Demosaicking

Jingyun Liu,Daiqin Yang,Zhenzhong Chen

Task: 通过频率增强方法恢复图像去马赛克中的高频纹理。

Motivation: 现有的空间学习方法在恢复高频纹理方面表现有限，因此提出了一种频率增强方法。

Details

Method: 提出了双路径频率增强网络（DFENet），通过傅里叶域频率选择以分而治之的方式重建RGB图像。 Result: DFENet在不同数据集上优于其他最先进的算法，并在困难案例中表现出显著优势。 Conclusion: DFENet通过频率增强和多级频率监督策略，显著提高了图像去马赛克的性能，并贡献了一个新的数据集LineSet37用于评估算法在高频纹理重建方面的能力。 Abstract: Recovering high-frequency textures in image demosaicking remains a challenging issue. While existing methods introduced elaborate spatial learning methods, they still exhibit limited performance. To address this issue, a frequency enhancement approach is proposed. Based on the frequency analysis of color filter array (CFA)/demosaicked/ground truth images, we propose Dual-path Frequency Enhancement Network (DFENet), which reconstructs RGB images in a divide-and-conquer manner through fourier-domain frequency selection. In DFENet, two frequency selectors are employed, each selecting a set of frequency components for processing along separate paths. One path focuses on generating missing information through detail refinement in spatial domain, while the other aims at suppressing undesirable frequencies with the guidance of CFA images in frequency domain. Multi-level frequency supervision with a stagewise training strategy is employed to further improve the reconstruction performance. With these designs, the proposed DFENet outperforms other state-of-the-art algorithms on different datasets and demonstrates significant advantages on hard cases. Moreover, to better assess algorithms' ability to reconstruct high-frequency textures, a new dataset, LineSet37, is contributed, which consists of 37 artificially designed and generated images. These images feature complex line patterns and are prone to severe visual artifacts like color moir\'e after demosaicking. Experiments on LineSet37 offer a more targeted evaluation of performance on challenging cases. The code and dataset are available at https://github.com/VelvetReverie/DFENet-demosaicking.

ECKGBench: Benchmarking Large Language Models in E-commerce Leveraging Knowledge Graph

Langming Liu,Haibin Chen,Yuhao Wang,Yujin Yuan,Shilei Liu,Wenbo Su,Xiangyu Zhao,Bo Zheng

Task: 评估大型语言模型（LLMs）在电子商务知识中的能力。

Motivation: 由于LLMs在电子商务中的广泛应用，其事实性（如幻觉）对用户体验和收入有重大影响，因此需要有效的评估方法。

Details

Method: 提出ECKGBench数据集，采用标准化工作流程自动生成基于大规模知识图谱的问题，并通过简单问答范式提高评估效率。 Result: 通过ECKGBench对多个先进LLMs进行全面评估，提供了详细分析和见解。 Conclusion: ECKGBench能够有效评估LLMs在电子商务知识中的能力，并为利用LLMs进行电子商务提供了新的视角。 Abstract: Large language models (LLMs) have demonstrated their capabilities across various NLP tasks. Their potential in e-commerce is also substantial, evidenced by practical implementations such as platform search, personalized recommendations, and customer service. One primary concern associated with LLMs is their factuality (e.g., hallucination), which is urgent in e-commerce due to its significant impact on user experience and revenue. Despite some methods proposed to evaluate LLMs' factuality, issues such as lack of reliability, high consumption, and lack of domain expertise leave a gap between effective assessment in e-commerce. To bridge the evaluation gap, we propose ECKGBench, a dataset specifically designed to evaluate the capacities of LLMs in e-commerce knowledge. Specifically, we adopt a standardized workflow to automatically generate questions based on a large-scale knowledge graph, guaranteeing sufficient reliability. We employ the simple question-answering paradigm, substantially improving the evaluation efficiency by the least input and output tokens. Furthermore, we inject abundant e-commerce expertise in each evaluation stage, including human annotation, prompt design, negative sampling, and verification. Besides, we explore the LLMs' knowledge boundaries in e-commerce from a novel perspective. Through comprehensive evaluations of several advanced LLMs on ECKGBench, we provide meticulous analysis and insights into leveraging LLMs for e-commerce.

A Vision Centric Remote Sensing Benchmark

Abduljaleel Adejumo,Faegheh Yeganli,Clifford Broni-bediako,Aoran Xiao,Naoto Yokoya,Mennatullah Siam

Task: 研究多模态大语言模型（MLLMs）在遥感（RS）任务中的局限性，并提出一个遥感多模态视觉模式（RSMMVP）基准来评估这些模型。

Motivation: 尽管多模态大语言模型在视觉-语言任务中取得了显著成功，但在遥感领域的应用相对较少。遥感图像与自然图像不同，具有独特的挑战，特别是在视觉定位和空间推理方面。

Details

Method: 通过引入遥感多模态视觉模式（RSMMVP）基准，评估CLIP-based MLLMs在遥感任务中的表现，特别是识别CLIP-blind对，即CLIP-based模型错误地将高相似度分数分配给视觉上不同的遥感图像。 Result: 通过视觉问答（VQA）评估，揭示了当前最先进的MLLMs在遥感特定表示学习中的显著局限性。 Conclusion: 研究结果提供了关于CLIP-based视觉编码弱点的宝贵见解，并为未来开发更适合遥感应用的多模态大语言模型奠定了基础。 Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable success in vision-language tasks but their remote sensing (RS) counterpart are relatively under explored. Unlike natural images, RS imagery presents unique challenges that current MLLMs struggle to handle, particularly in visual grounding and spatial reasoning. This study investigates the limitations of CLIP-based MLLMs in RS, highlighting their failure to differentiate visually distinct yet semantically similar RS images. To address this, we introduce a remote sensing multimodal visual patterns (RSMMVP) benchmark. It is designed to evaluate MLLMs in RS tasks by identifying the CLIP-blind pairs, where CLIP-based models incorrectly assign high similarity scores to visually distinct RS images. Through a visual question answering (VQA) evaluation, we analyze the performance of state-of-the-art MLLMs, revealing significant limitations in RS specific representation learning. The results provide valuable insights into the weaknesses of CLIP-based visual encoding and offer a foundation for future research to develop more effective MLLMs tailored for remote sensing applications.

Corrective In-Context Learning: Evaluating Self-Correction in Large Language Models

Mario Sanz-Guerrero,Katharina von der Wense

Task: 提出并评估纠正性上下文学习（CICL）方法，以提高大语言模型在文本分类任务中的性能。

Motivation: 尽管上下文学习（ICL）在NLP任务中表现出色，但在处理具有挑战性的示例时容易出错，因此需要改进其性能。

Details

Method: 提出纠正性上下文学习（CICL），将模型的错误预测与真实纠正一起纳入提示中，以通过自我纠正提高分类准确性。 Result: 实验表明，CICL在文本分类任务中始终表现不如标准ICL，随着提示中纠正比例的增加，性能下降。 Conclusion: CICL通过破坏模型的任务理解引入混淆，而不是改进其预测。此外，标准ICL中呈现更难示例并不能提高性能，表明示例难度本身可能不是有效选择的可靠标准。 Abstract: In-context learning (ICL) has transformed the use of large language models (LLMs) for NLP tasks, enabling few-shot learning by conditioning on labeled examples without finetuning. Despite its effectiveness, ICL is prone to errors, especially for challenging examples. With the goal of improving the performance of ICL, we propose corrective in-context learning (CICL), an approach that incorporates a model's incorrect predictions alongside ground truth corrections into the prompt, aiming to enhance classification accuracy through self-correction. However, contrary to our hypothesis, extensive experiments on text classification tasks demonstrate that CICL consistently underperforms standard ICL, with performance degrading as the proportion of corrections in the prompt increases. Our findings indicate that CICL introduces confusion by disrupting the model's task understanding, rather than refining its predictions. Additionally, we observe that presenting harder examples in standard ICL does not improve performance, suggesting that example difficulty alone may not be a reliable criterion for effective selection. By presenting these negative results, we provide important insights into the limitations of self-corrective mechanisms in LLMs and offer directions for future research.