2025 03 31

ELM: Ensemble of Language Models for Predicting Tumor Group from Pathology Reports

Lovedeep Gondara,Jonathan Simkin,Shebnum Devji,Gregory Arbour,Raymond Ng

Task: 自动化从非结构化病理报告中提取数据并分配肿瘤组别。

Motivation: 解决人口癌症登记处（PBCRs）在手动提取病理报告数据时的高时间成本问题（如100,000份报告需900人时）。

Details

Method: 提出ELM（语言模型集成）方法，结合小型语言模型（SLMs）和大型语言模型（LLMs），通过六种微调SLM（三组分别处理报告顶部和底部）和LLM仲裁分歧，实现五分之六一致性的肿瘤组分类。 Result: 在19个肿瘤组别中，ELM的平均精确度和召回率达到0.94，优于单模型和无LLM的集成方法，并在实际部署中每年节省数百人时。 Conclusion: ELM展示了LLMs在PBCR环境中的成功应用，显著提升操作效率并达到最先进效果。 Abstract: Population-based cancer registries (PBCRs) face a significant bottleneck in manually extracting data from unstructured pathology reports, a process crucial for tasks like tumor group assignment, which can consume 900 person-hours for approximately 100,000 reports. To address this, we introduce ELM (Ensemble of Language Models), a novel ensemble-based approach leveraging both small language models (SLMs) and large language models (LLMs). ELM utilizes six fine-tuned SLMs, where three SLMs use the top part of the pathology report and three SLMs use the bottom part. This is done to maximize report coverage. ELM requires five-out-of-six agreement for a tumor group classification. Disagreements are arbitrated by an LLM with a carefully curated prompt. Our evaluation across nineteen tumor groups demonstrates ELM achieves an average precision and recall of 0.94, outperforming single-model and ensemble-without-LLM approaches. Deployed at the British Columbia Cancer Registry, ELM demonstrates how LLMs can be successfully applied in a PBCR setting to achieve state-of-the-art results and significantly enhance operational efficiencies, saving hundreds of person-hours annually.

ImF: Implicit Fingerprint for Large Language Models

Wu jiaxuan,Peng Wanli,Fu hang,Xue Yiming,Wen juan

Task: 提出一种新的模型指纹注入范式（Implicit Fingerprints, ImF），以解决现有指纹方法语义相关性弱的问题。

Motivation: 现有指纹方法生成的指纹对语义相关性弱，容易被攻击（如GRI攻击）擦除，需要更安全的指纹保护方法。

Details

Method: 提出Implicit Fingerprints（ImF），构造语义相关性强的指纹对，伪装为自然问答对。 Result: 实验表明，ImF在对抗条件下仍保持高验证成功率。 Conclusion: ImF为保护大语言模型所有权提供了可靠解决方案。 Abstract: Training large language models (LLMs) is resource-intensive and expensive, making intellectual property (IP) protection essential. Most existing model fingerprint methods inject fingerprints into LLMs to protect model ownership. These methods create fingerprint pairs with weak semantic correlations, lacking the contextual coherence and semantic relatedness founded in normal question-answer (QA) pairs in LLMs. In this paper, we propose a Generation Revision Intervention (GRI) attack that can effectively exploit this flaw to erase fingerprints, highlighting the need for more secure model fingerprint methods. Thus, we propose a novel injected fingerprint paradigm called Implicit Fingerprints (ImF). ImF constructs fingerprint pairs with strong semantic correlations, disguising them as natural QA pairs within LLMs. This ensures the fingerprints are consistent with normal model behavior, making them indistinguishable and robust against detection and removal. Our experiment on multiple LLMs demonstrates that ImF retains high verification success rates under adversarial conditions, offering a reliable solution for protecting LLM ownership.

Large Language Models Meet Contrastive Learning: Zero-Shot Emotion Recognition Across Languages

Heqing Zou,Fengmao Lv,Desheng Zheng,Eng Siong Chng,Deepu Rajan

Task: 通过对比学习优化多语言语音特征，并扩展大型语言模型以实现零样本多语言语音情感识别。

Motivation: 多语言语音情感识别面临语音特征和语言多样性的挑战，尤其是在零样本场景下。

Details

Method: 采用两阶段训练框架，将语音信号与情感空间中的语言特征对齐，捕捉情感感知和语言无关的语音表示。 Result: 实验表明，该方法在多语言语音情感识别和零样本任务中均有效，包括未见过的数据集和语言。 Conclusion: 提出的方法在多语言语音情感识别中表现出色，并通过合成数据集M5SER推动了该领域的研究。 Abstract: Multilingual speech emotion recognition aims to estimate a speaker's emotional state using a contactless method across different languages. However, variability in voice characteristics and linguistic diversity poses significant challenges for zero-shot speech emotion recognition, especially with multilingual datasets. In this paper, we propose leveraging contrastive learning to refine multilingual speech features and extend large language models for zero-shot multilingual speech emotion estimation. Specifically, we employ a novel two-stage training framework to align speech signals with linguistic features in the emotional space, capturing both emotion-aware and language-agnostic speech representations. To advance research in this field, we introduce a large-scale synthetic multilingual speech emotion dataset, M5SER. Our experiments demonstrate the effectiveness of the proposed method in both speech emotion recognition and zero-shot multilingual speech emotion recognition, including previously unseen datasets and languages.

OAEI-LLM-T: A TBox Benchmark Dataset for Understanding LLM Hallucinations in Ontology Matching Systems

Zhangcheng Qiang

Task: Introduce a new benchmark dataset called OAEI-LLM-T to address hallucinations in LLM-based ontology matching systems.

Motivation: Hallucinations in LLMs are a significant challenge for ontology matching tasks, necessitating a dedicated dataset to study and mitigate them.

Details

Method: Develop the OAEI-LLM-T dataset from TBox datasets in OAEI, classify hallucinations into two primary categories and six sub-categories, and use it for leaderboard construction and LLM fine-tuning. Result: The dataset captures and classifies hallucinations in LLM-based OM tasks, demonstrating its utility for benchmarking and improving LLM performance. Conclusion: OAEI-LLM-T is a valuable resource for addressing hallucinations in LLM-based OM systems, aiding in both evaluation and model refinement. Abstract: Hallucinations are inevitable in downstream tasks using large language models (LLMs). While addressing hallucinations becomes a substantial challenge for LLM-based ontology matching (OM) systems, we introduce a new benchmark dataset called OAEI-LLM-T. The dataset evolves from the TBox (i.e. schema-matching) datasets in the Ontology Alignment Evaluation Initiative (OAEI), capturing hallucinations of different LLMs performing OM tasks. These OM-specific hallucinations are carefully classified into two primary categories and six sub-categories. We showcase the usefulness of the dataset in constructing the LLM leaderboard and fine-tuning foundational LLMs for LLM-based OM systems.

Skip-Vision: A Comprehensive Framework for Accelerating Vision-Language Models

Weili Zeng,Ziyuan Huang,Kaixiang Ji,Yichao Yan

Task: 提出Skip-Vision框架，解决多模态大语言模型在训练和推理中的效率问题。

Motivation: Transformer模型在多模态任务中计算成本高，视觉令牌数量激增是主要瓶颈。

Details

Method: 结合两种加速策略：训练时跳过冗余视觉令牌的FFN计算（Skip-FFN），推理时选择性移除KV缓存。 Result: 训练时间减少35%，推理FLOPs减少75%，延迟降低45%，性能与现有方法相当或更优。 Conclusion: Skip-Vision为高效扩展高性能多模态大语言模型提供了实用解决方案。 Abstract: Transformer-based models have driven significant advancements in Multimodal Large Language Models (MLLMs), yet their computational costs surge drastically when scaling resolution, training data, and model parameters. A key bottleneck stems from the proliferation of visual tokens required for fine-grained image understanding. We propose Skip-Vision, a unified framework addressing both training and inference inefficiencies in vision-language models. On top of conventional token compression approaches, our method introduces two complementary acceleration strategies. For training acceleration, we observe that Feed-Forward Network (FFN) computations on visual tokens induce marginal feature updates. This motivates our Skip-FFN strategy, which bypasses FFN layers for redundant visual tokens. For inference acceleration, we design a selective KV-cache removal mechanism that prunes the skipped key-value pairs during decoding while preserving model performance. Experimental results demonstrate that Skip-Vision reduces training time by up to 35\%, inference FLOPs by 75\%, and latency by 45\%, while achieving comparable or superior performance to existing methods. Our work provides a practical solution for scaling high-performance MLLMs with enhanced efficiency.

Optimizing Safe and Aligned Language Generation: A Multi-Objective GRPO Approach

Xuying Li,Zhuo Li,Yuji Kosuga,Victor Bian

Task: 提出一种名为Group Relative Policy Optimization (GRPO)的框架，用于实现安全和对齐的语言生成。

Motivation: 由于大型语言模型（LLMs）在人类价值观和安全约束上的对齐具有挑战性，尤其是在目标（如帮助性、真实性和避免伤害）冲突时，现有方法如RLHF和DPO存在复杂性和潜在偏差。

Details

Method: GRPO框架通过多标签奖励回归模型优化策略，比较响应组，无需单独的价值评判器，并训练奖励模型预测多个对齐分数（如安全性、帮助性等）。 Result: GRPO在语言生成任务中提升了所有安全和质量指标，并在不同规模模型（0.5B、7B和14B参数）上实现了目标的稳健平衡。 Conclusion: GRPO在计算成本和显式多目标处理方面优于PPO-based RLHF和DPO，展示了高效的对齐能力。 Abstract: Aligning large language models (LLMs) with human values and safety constraints is challenging, especially when objectives like helpfulness, truthfulness, and avoidance of harm conflict. Reinforcement Learning from Human Feedback (RLHF) has achieved notable success in steering models, but is complex and can be unstable. Recent approaches such as Direct Preference Optimization (DPO) simplify preference-based fine-tuning but may introduce bias or trade-off certain objectives~\cite{dpo}. In this work, we propose a Group Relative Policy Optimization (GRPO) framework with a multi-label reward regression model to achieve safe and aligned language generation. The GRPO algorithm optimizes a policy by comparing groups of sampled responses, eliminating the need for a separate value critic and improving training efficiency~\cite{grpo}. We train a reward model to predict multiple alignment scores (e.g., safety, helpfulness, etc.), which are combined into a single reward signal. We provide a theoretical derivation for using this learned multi-aspect reward within GRPO and discuss its advantages and limitations. Empirically, our approach improves all the safety and quality metrics evaluated in language generation tasks on model scales (0.5B, 7B, and 14B parameters), demonstrating a robust balance of objectives. We compare GRPO to PPO-based RLHF and DPO, highlighting that GRPO achieves alignment with significantly lower computational cost and explicit multi-objective handling. \textbf{We will open-source all trained models at https://huggingface.co/hydroxai.

Yide Di,Yun Liao,Hao Zhou,Kaijun Zhu,Qing Duan,Junhui Liu,Mingyu Lu

Task: 提出一种统一的特征匹配预训练模型（UFM）以解决多模态图像中的特征匹配问题。

Motivation: 多模态图像的特征匹配任务复杂且需要针对特定数据集进行训练，亟需一种通用解决方案。

Details

Method: 引入多模态图像助手（MIA）变换器和数据增强算法，采用分阶段预训练策略。 Result: UFM在多种特征匹配任务中表现出优异的泛化能力和性能。 Conclusion: UFM为多模态图像特征匹配提供了一种高效的通用解决方案。 Abstract: Image feature matching, a foundational task in computer vision, remains challenging for multimodal image applications, often necessitating intricate training on specific datasets. In this paper, we introduce a Unified Feature Matching pre-trained model (UFM) designed to address feature matching challenges across a wide spectrum of modal images. We present Multimodal Image Assistant (MIA) transformers, finely tunable structures adept at handling diverse feature matching problems. UFM exhibits versatility in addressing both feature matching tasks within the same modal and those across different modals. Additionally, we propose a data augmentation algorithm and a staged pre-training strategy to effectively tackle challenges arising from sparse data in specific modals and imbalanced modal datasets. Experimental results demonstrate that UFM excels in generalization and performance across various feature matching tasks. The code will be released at:https://github.com/LiaoYun0x0/UFM.

Refining Time Series Anomaly Detectors using Large Language Models

Alan Yang,Yulin Chen,Sean Lee,Venus Montes

Task: 研究如何利用多模态大语言模型（LLMs）部分自动化时间序列异常检测（TSAD）中的人工审核过程。

Motivation: 尽管已有多种自动检测异常的方法，但仍需人工审核以确保准确性，希望通过LLMs减少人工工作量。

Details

Method: 利用多模态LLMs，结合时间序列图的可视化检查与数据生成过程的文本描述，识别误报。 Result: LLMs能有效识别误报，减少人工审核的依赖。 Conclusion: 多模态LLMs在TSAD系统中具有潜力，可部分替代人工审核，提高效率。 Abstract: Time series anomaly detection (TSAD) is of widespread interest across many industries, including finance, healthcare, and manufacturing. Despite the development of numerous automatic methods for detecting anomalies, human oversight remains necessary to review and act upon detected anomalies, as well as verify their accuracy. We study the use of multimodal large language models (LLMs) to partially automate this process. We find that LLMs can effectively identify false alarms by integrating visual inspection of time series plots with text descriptions of the data-generating process. By leveraging the capabilities of LLMs, we aim to reduce the reliance on human effort required to maintain a TSAD system

Low-Rank Adaptation of Pre-Trained Stable Diffusion for Rigid-Body Target ISAR Imaging

Boan Zhang,Hang Dong,Jiongge Zhang,Long Tian,Rongrong Wang,Zhenhua Wu,Xiyang Liu,Hongwei Liu

Task: 提出一种基于预训练生成模型Stable Diffusion（SD）和低秩适应（LoRA）的逆合成孔径雷达（ISAR）成像方法，用于提高刚性目标成像的分辨率。

Motivation: 传统的瞬时多普勒（RID）方法因时频分析（TFA）的限制导致分辨率较低，需要一种能够从低分辨率时频表示（TFRs）中获取高分辨率的方法。

Details

Method: 利用预训练的SD模型及其LoRA技术，结合对抗训练和线性操作，实现超分辨率和噪声抑制，并将其集成到基于RID的ISAR成像中。 Result: 实验结果表明，该方法在频率估计和ISAR成像方面优于传统方法，且通过模拟和实测雷达数据验证了其泛化能力。 Conclusion: 提出的LoRA-SD方法能够显著提高成像的分辨率和去噪效果，适用于刚性目标的高质量ISAR成像。 Abstract: Traditional range-instantaneous Doppler (RID) methods for rigid-body target imaging often suffer from low resolution due to the limitations of time-frequency analysis (TFA). To address this challenge, our primary focus is on obtaining high resolution time-frequency representations (TFRs) from their low resolution counterparts. Recognizing that the curve features of TFRs are a specific type of texture feature, we argue that pre trained generative models such as Stable Diffusion (SD) are well suited for enhancing TFRs, thanks to their powerful capability in capturing texture representations. Building on this insight, we propose a novel inverse synthetic aperture radar (ISAR) imaging method for rigid-body targets, leveraging the low-rank adaptation (LoRA) of a pre-trained SD model. Our approach adopts the basic structure and pre-trained parameters of SD Turbo while incorporating additional linear operations for LoRA and adversarial training to achieve super-resolution and noise suppression. Then we integrate LoRA-SD into the RID-based ISAR imaging, enabling sharply focused and denoised imaging with super-resolution capabilities. We evaluate our method using both simulated and real radar data. The experimental results demonstrate the superiority of our approach in frequency es timation and ISAR imaging compared to traditional methods. Notably, the generalization capability is verified by training on simulated radar data and testing on measured radar data.

MSPLoRA: A Multi-Scale Pyramid Low-Rank Adaptation for Efficient Model Fine-Tuning

Jiancheng Zhao,Xingda Yu,Zhen Yang

Task: 提出MSPLoRA方法，通过多尺度金字塔结构优化LoRA在参数高效微调中的应用。

Motivation: 传统LoRA在所有层使用固定秩，无法适应层次信息的复杂性，导致冗余和低效。

Details

Method: 引入全局共享LoRA、中层共享LoRA和层特定LoRA，分别捕捉全局模式、中层特征和细粒度信息。 Result: 在多种NLP任务中表现更高效且性能更好，同时显著减少可训练参数。 Conclusion: MSPLoRA是一种可扩展且有效的参数高效微调优化策略。 Abstract: Parameter-Efficient Fine-Tuning (PEFT) has become an essential approach for adapting large-scale pre-trained models while reducing computational costs. Among PEFT methods, LoRA significantly reduces trainable parameters by decomposing weight updates into low-rank matrices. However, traditional LoRA applies a fixed rank across all layers, failing to account for the varying complexity of hierarchical information, which leads to inefficient adaptation and redundancy. To address this, we propose MSPLoRA (Multi-Scale Pyramid LoRA), which introduces Global Shared LoRA, Mid-Level Shared LoRA, and Layer-Specific LoRA to capture global patterns, mid-level features, and fine-grained information, respectively. This hierarchical structure reduces inter-layer redundancy while maintaining strong adaptation capability. Experiments on various NLP tasks demonstrate that MSPLoRA achieves more efficient adaptation and better performance while significantly reducing the number of trainable parameters. Furthermore, additional analyses based on Singular Value Decomposition validate its information decoupling ability, highlighting MSPLoRA as a scalable and effective optimization strategy for parameter-efficient fine-tuning in large language models. Our code is available at https://github.com/Oblivioniss/MSPLoRA.

Protecting Your Video Content: Disrupting Automated Video-based LLM Annotations

Haitong Liu,Kuofeng Gao,Yang Bai,Jinmin Li,Jinxiao Shan,Tao Dai,Shu-Tao Xia

Task: 提出两种保护个人视频数据免受未经授权使用的视频水印方法（Ramblings和Mutes）。

Motivation: 视频大语言模型（video-based LLMs）的快速发展引发隐私和安全问题，尤其是未经授权使用个人视频数据进行自动标注的问题。

Details

Method: 设计两种不可察觉的对抗性扰动视频水印：Ramblings旨在误导模型生成不准确的标注，Mutes则促使模型生成极简标注。 Result: 实验表明，这两种水印方法能显著降低视频标注性能，同时保持隐蔽性和鲁棒性。 Conclusion: 提出的视频水印方法有效保护了个人视频内容免受未经授权的使用。 Abstract: Recently, video-based large language models (video-based LLMs) have achieved impressive performance across various video comprehension tasks. However, this rapid advancement raises significant privacy and security concerns, particularly regarding the unauthorized use of personal video data in automated annotation by video-based LLMs. These unauthorized annotated video-text pairs can then be used to improve the performance of downstream tasks, such as text-to-video generation. To safeguard personal videos from unauthorized use, we propose two series of protective video watermarks with imperceptible adversarial perturbations, named Ramblings and Mutes. Concretely, Ramblings aim to mislead video-based LLMs into generating inaccurate captions for the videos, thereby degrading the quality of video annotations through inconsistencies between video content and captions. Mutes, on the other hand, are designed to prompt video-based LLMs to produce exceptionally brief captions, lacking descriptive detail. Extensive experiments demonstrate that our video watermarking methods effectively protect video data by significantly reducing video annotation performance across various video-based LLMs, showcasing both stealthiness and robustness in protecting personal video content. Our code is available at https://github.com/ttthhl/Protecting_Your_Video_Content.

Zeyad Alghamdi,Tharindu Kumarage,Garima Agrawal,Mansooreh Karami,Ibrahim Almuteb,Huan Liu

Task: 研究如何通过RedditESS数据集和集成标注机制，更全面地定义和评估基于大语言模型的心理健康支持的有效性。

Motivation: 现有研究对心理健康支持的‘有效性’定义过于狭窄，仅关注共情回应，忽略了信息指导、社区验证和具体应对策略等其他关键维度。

Details

Method: 引入RedditESS数据集，基于社会科学理论开发集成标注机制，标注支持性评论的有效性，并通过定性评估确保标注可靠性。 Result: RedditESS数据集成功用于指导大语言模型生成更具情境敏感性和实际帮助的支持性回应。 Conclusion: 研究扩展了对有效支持的理解，为AI驱动的心理健康干预提供了新方向。 Abstract: Effective mental health support is crucial for alleviating psychological distress. While large language model (LLM)-based assistants have shown promise in mental health interventions, existing research often defines "effective" support primarily in terms of empathetic acknowledgments, overlooking other essential dimensions such as informational guidance, community validation, and tangible coping strategies. To address this limitation and better understand what constitutes effective support, we introduce RedditESS, a novel real-world dataset derived from Reddit posts, including supportive comments and original posters' follow-up responses. Grounded in established social science theories, we develop an ensemble labeling mechanism to annotate supportive comments as effective or not and perform qualitative assessments to ensure the reliability of the annotations. Additionally, we demonstrate the practical utility of RedditESS by using it to guide LLM alignment toward generating more context-sensitive and genuinely helpful supportive responses. By broadening the understanding of effective support, our study paves the way for advanced AI-driven mental health interventions.

Hybrid Multi-Stage Learning Framework for Edge Detection: A Survey

Mark Phil Pacot,Jayno Juventud,Gleen Dalaorao

Task: 提出一种混合多阶段学习框架，结合CNN特征提取与SVM分类器，以提高边缘定位和结构准确性。

Motivation: 解决在变化光照、噪声和复杂场景条件下边缘检测的挑战性问题。

Details

Method: 采用CNN特征提取与SVM分类器结合的混合多阶段学习框架，分离特征表示和分类阶段。 Result: 在BSDS500和NYUDv2数据集上，ODS和OIS指标优于传统和近期学习方法，同时保持竞争力AP。 Conclusion: 该框架不仅连接了经典与深度学习范式，还为可扩展、可解释且高质量的边缘检测提供了新方向。 Abstract: Edge detection remains a fundamental yet challenging task in computer vision, especially under varying illumination, noise, and complex scene conditions. This paper introduces a Hybrid Multi-Stage Learning Framework that integrates Convolutional Neural Network (CNN) feature extraction with a Support Vector Machine (SVM) classifier to improve edge localization and structural accuracy. Unlike conventional end-to-end deep learning models, our approach decouples feature representation and classification stages, enhancing robustness and interpretability. Extensive experiments conducted on benchmark datasets such as BSDS500 and NYUDv2 demonstrate that the proposed framework outperforms traditional edge detectors and even recent learning-based methods in terms of Optimal Dataset Scale (ODS) and Optimal Image Scale (OIS), while maintaining competitive Average Precision (AP). Both qualitative and quantitative results highlight enhanced performance on edge continuity, noise suppression, and perceptual clarity achieved by our method. This work not only bridges classical and deep learning paradigms but also sets a new direction for scalable, interpretable, and high-quality edge detection solutions.

JEEM: Vision-Language Understanding in Four Arabic Dialects

Karima Kadaoui,Hanin Atwany,Hamdan Al-Ali,Abdelrahman Mohamed,Ali Mekky,Sergei Tilga,Natalia Fedorova,Ekaterina Artemova,Hanan Aldarmaki,Yova Kementchedjhieva

Task: 评估视觉语言模型（VLMs）在四个阿拉伯语国家（约旦、阿联酋、埃及和摩洛哥）的视觉理解能力。

Motivation: 通过包含文化丰富和地区多样性的内容，评估VLMs在跨方言和文化元素理解中的泛化能力。

Details

Method: 使用JEEM基准测试，包括图像描述和视觉问答任务，评估五种开源阿拉伯语VLMs和GPT-4V。 Result: 阿拉伯语VLMs表现不佳，GPT-4V表现最好但仍有方言和视觉理解能力的局限性。 Conclusion: 需要更具包容性的模型和文化多样化的评估范式。 Abstract: We introduce JEEM, a benchmark designed to evaluate Vision-Language Models (VLMs) on visual understanding across four Arabic-speaking countries: Jordan, The Emirates, Egypt, and Morocco. JEEM includes the tasks of image captioning and visual question answering, and features culturally rich and regionally diverse content. This dataset aims to assess the ability of VLMs to generalize across dialects and accurately interpret cultural elements in visual contexts. In an evaluation of five prominent open-source Arabic VLMs and GPT-4V, we find that the Arabic VLMs consistently underperform, struggling with both visual understanding and dialect-specific generation. While GPT-4V ranks best in this comparison, the model's linguistic competence varies across dialects, and its visual understanding capabilities lag behind. This underscores the need for more inclusive models and the value of culturally-diverse evaluation paradigms.

Shape Generation via Weight Space Learning

Maximilian Plattner,Arturs Berzins,Johannes Brandstetter

Task: 探索大型3D形状生成模型的权重空间作为数据模态，以调制拓扑属性或局部特征。

Motivation: 现实数据稀缺或噪声多，传统微调易导致灾难性遗忘，需新方法利用几何先验。

Details

Method: 将权重空间视为数据模态，通过插值和低维重参数化实验验证子流形的调制能力。 Result: 权重空间插值显示全局连通性突变，低维重参数化能控制局部几何变化。 Conclusion: 权重空间学习为3D形状生成和专用微调提供了新途径。 Abstract: Foundation models for 3D shape generation have recently shown a remarkable capacity to encode rich geometric priors across both global and local dimensions. However, leveraging these priors for downstream tasks can be challenging as real-world data are often scarce or noisy, and traditional fine-tuning can lead to catastrophic forgetting. In this work, we treat the weight space of a large 3D shape-generative model as a data modality that can be explored directly. We hypothesize that submanifolds within this high-dimensional weight space can modulate topological properties or fine-grained part features separately, demonstrating early-stage evidence via two experiments. First, we observe a sharp phase transition in global connectivity when interpolating in conditioning space, suggesting that small changes in weight space can drastically alter topology. Second, we show that low-dimensional reparameterizations yield controlled local geometry changes even with very limited data. These results highlight the potential of weight space learning to unlock new approaches for 3D shape generation and specialized fine-tuning.

AutoPsyC: Automatic Recognition of Psychodynamic Conflicts from Semi-structured Interviews with Large Language Models

Sayed Muddashir Hossain,Simon Ostermann,Patrick Gebhard,Cord Benecke,Josef van Genabith,Philipp Müller

Task: 提出一种名为AutoPsyC的方法，利用大型语言模型（LLMs）从完整的操作性心理动力学诊断（OPD）访谈中识别心理动力学冲突的存在和重要性。

Motivation: 心理动力学冲突是影响个体行为和体验的持久且常为无意识的主题，其准确诊断对患者治疗至关重要。目前自动化解决方案多关注广泛障碍类别（如抑郁症），而心理动力学冲突的自动识别尚未明确。

Details

Method: 结合参数高效微调、检索增强生成（RAG）和摘要策略，处理长达90分钟的对话。 Result: 在141次诊断访谈数据集上，AutoPsyC在识别四种高度相关的心理动力学冲突方面始终优于所有基线和消融条件。 Conclusion: AutoPsyC是首个从OPD访谈中自动识别心理动力学冲突的方法，展现了其有效性和潜力。 Abstract: Psychodynamic conflicts are persistent, often unconscious themes that shape a person's behaviour and experiences. Accurate diagnosis of psychodynamic conflicts is crucial for effective patient treatment and is commonly done via long, manually scored semi-structured interviews. Existing automated solutions for psychiatric diagnosis tend to focus on the recognition of broad disorder categories such as depression, and it is unclear to what extent psychodynamic conflicts which even the patient themselves may not have conscious access to could be automatically recognised from conversation. In this paper, we propose AutoPsyC, the first method for recognising the presence and significance of psychodynamic conflicts from full-length Operationalized Psychodynamic Diagnostics (OPD) interviews using Large Language Models (LLMs). Our approach combines recent advances in parameter-efficient fine-tuning and Retrieval-Augmented Generation (RAG) with a summarisation strategy to effectively process entire 90 minute long conversations. In evaluations on a dataset of 141 diagnostic interviews we show that AutoPsyC consistently outperforms all baselines and ablation conditions on the recognition of four highly relevant psychodynamic conflicts.

Haomin Yu,Tianyi Li,Kristian Torp,Christian S. Jensen

Task: 提出一种多模态知识增强框架（MAKER）以提高船舶轨迹预测的准确性。

Motivation: 现有预测方法难以处理AIS数据的不规则采样时间间隔和船舶运动的复杂性，导致模型学习和泛化困难。

Details

Method: MAKER包含语言模型引导的知识转移模块（LKT）和基于知识的自步学习模块（KSL），分别处理不规则时间间隔和复杂轨迹模式。 Result: 在两个船舶轨迹数据集上，MAKER将预测准确率提高了12.08%-17.86%。 Conclusion: MAKER框架有效解决了船舶轨迹预测中的关键挑战，显著提升了预测性能。 Abstract: Accurate vessel trajectory prediction facilitates improved navigational safety, routing, and environmental protection. However, existing prediction methods are challenged by the irregular sampling time intervals of the vessel tracking data from the global AIS system and the complexity of vessel movement. These aspects render model learning and generalization difficult. To address these challenges and improve vessel trajectory prediction, we propose the multi-modal knowledge-enhanced framework (MAKER) for vessel trajectory prediction. To contend better with the irregular sampling time intervals, MAKER features a Large language model-guided Knowledge Transfer (LKT) module that leverages pre-trained language models to transfer trajectory-specific contextual knowledge effectively. To enhance the ability to learn complex trajectory patterns, MAKER incorporates a Knowledge-based Self-paced Learning (KSL) module. This module employs kinematic knowledge to progressively integrate complex patterns during training, allowing for adaptive learning and enhanced generalization. Experimental results on two vessel trajectory datasets show that MAKER can improve the prediction accuracy of state-of-the-art methods by 12.08%-17.86%.

Hybrid Emotion Recognition: Enhancing Customer Interactions Through Acoustic and Textual Analysis

Sahan Hewage Wewelwala,T. G. D. K. Sumanathilaka

Task: 开发一种混合情感识别系统，结合深度学习和自然语言处理技术，分析音频和文本数据以提升客户互动。

Motivation: 解决传统方法在理解复杂情感状态时的局限性，并提升客户服务的个性化和同理心。

Details

Method: 结合LSTM和CNN模型进行音频分析，使用DistilBERT进行文本情感分析，同时考虑语言和文化差异。 Result: 系统在多样化数据集上表现出鲁棒性和高准确性，能够实现实时处理。 Conclusion: 该研究为更智能、以人为中心的数字通信奠定了基础，重新定义了客户服务标准。 Abstract: This research presents a hybrid emotion recognition system integrating advanced Deep Learning, Natural Language Processing (NLP), and Large Language Models (LLMs) to analyze audio and textual data for enhancing customer interactions in contact centers. By combining acoustic features with textual sentiment analysis, the system achieves nuanced emotion detection, addressing the limitations of traditional approaches in understanding complex emotional states. Leveraging LSTM and CNN models for audio analysis and DistilBERT for textual evaluation, the methodology accommodates linguistic and cultural variations while ensuring real-time processing. Rigorous testing on diverse datasets demonstrates the system's robustness and accuracy, highlighting its potential to transform customer service by enabling personalized, empathetic interactions and improving operational efficiency. This research establishes a foundation for more intelligent and human-centric digital communication, redefining customer service standards.

iMedImage Technical Report

Ran Wei,ZhiXiong Lan,Qing Yan,Ning Song,Ming Lv,LongQing Ye

Task: 开发一个名为iMedImage的端到端基础模型，用于医学图像分析，包括染色体异常检测等任务。

Motivation: 染色体核型分析对遗传病诊断至关重要，但结构异常检测仍具挑战性；AI在医学影像中表现不一，需结合多模态医学影像的先进技术。

Details

Method: 构建多模态医学图像数据集，开发iMedImage模型，采用统一表示方法、多级图像识别能力（案例级、图像级、块级），结合CoT嵌入和MoE策略。 Result: 在包含12家机构数据的测试集上，模型实现全自动染色体分析流程，敏感性和特异性分别达92.75%和91.5%。 Conclusion: iMedImage在多种医学影像任务中表现优异，为临床提供精准分析工具，提升诊断准确性和疾病筛查能力。 Abstract: Background: Chromosome karyotype analysis is crucial for diagnosing hereditary diseases, yet detecting structural abnormalities remains challenging. While AI has shown promise in medical imaging, its effectiveness varies across modalities. Leveraging advances in Foundation Models that integrate multimodal medical imaging for robust feature extraction and accurate diagnosis, we developed iMedImage, an end-to-end model for general medical image recognition, demonstrating strong performance across multiple imaging tasks, including chromosome abnormality detection. Materials and Methods: We constructed a comprehensive medical image dataset encompassing multiple modalities from common medical domains, including chromosome, cell, pathology, ultrasound, X-ray, CT, and MRI images. Based on this dataset, we developed the iMedImage model, which incorporates the following key features: (1) a unified representation method for diverse modality inputs and medical imaging tasks; (2) multi-level (case-level, image-level, patch-level) image recognition capabilities enhanced by Chain of Thought (CoT) embedding and Mixture of Experts (MoE) strategies. Results: The test set comprised data from 12 institutions across six regions in China, covering three mainstream scanning devices, and included naturally distributed, unscreened abnormal cases. On this diverse dataset, the model achieved a fully automated chromosome analysis workflow, including segmentation, karyotyping, and abnormality detection, reaching a sensitivity of 92.75% and a specificity of 91.5%. Conclusion: We propose iMedImage, an end-to-end foundation model for medical image analysis, demonstrating its superior performance across various medical imaging tasks. iMedImage provides clinicians with a precise imaging analysis tool and contributes to improving diagnostic accuracy and disease screening.

Local Normalization Distortion and the Thermodynamic Formalism of Decoding Strategies for Large Language Models

Tom Kempton,Stuart Burrell

Task: 探讨解码策略对语言模型生成文本质量与多样性的影响。

Motivation: 现有解码策略多基于启发式方法，缺乏理论支持，难以系统改进。

Details

Method: 通过遍历理论将流行解码算法表达为平衡状态，并分析其优化函数。 Result: 局部归一化扭曲是解码策略的根本缺陷，影响了生成文本的质量与多样性。 Conclusion: 未来解码算法设计需关注局部归一化问题，并改进机器生成文本的检测方法。 Abstract: Advances in hardware and language model architecture have spurred a revolution in natural language generation. However, autoregressive models compute probability distributions over next-token choices, and sampling from these distributions, known as decoding, has received significantly less attention than other design choices. Existing decoding strategies are largely based on heuristics, resulting in methods that are hard to apply or improve in a principled manner. We develop the theory of decoding strategies for language models by expressing popular decoding algorithms as equilibrium states in the language of ergodic theory and stating the functions they optimize. Using this, we analyze the effect of the local normalization step of top-k, nucleus, and temperature sampling, used to make probabilities sum to one. We argue that local normalization distortion is a fundamental defect of decoding strategies and quantify the size of this distortion and its effect on mathematical proxies for the quality and diversity of generated text. Contrary to the prevailing explanation, we argue that the major cause of the under-performance of top-k sampling relative to nucleus sampling is local normalization distortion. This yields conclusions for the future design of decoding algorithms and the detection of machine-generated text.

M-DocSum: Do LVLMs Genuinely Comprehend Interleaved Image-Text in Document Summarization?

Haolong Yan,Kaijun Tan,Yeqing Shen,Xin Huang,Zheng Ge,Xiangyu Zhang,Si Li,Daxin Jiang

Task: 研究大型视觉语言模型（LVLMs）是否真正理解文档中交错的图像-文本内容。

Motivation: 现有文档理解基准通常使用问答格式评估LVLMs，这种格式信息稀疏且难以覆盖长距离依赖关系。

Details

Method: 引入多模态文档摘要基准（M-DocSum-Bench），包含500篇高质量arXiv论文及符合人类偏好的交错多模态摘要，并提出自动化框架和细粒度评估方法M-DocEval。 Result: 领先的LVLMs在长且交错的上下文中难以保持连贯性和准确整合信息，而提出的M-DocSum-7B模型表现优于大型闭源模型。 Conclusion: M-DocSum-7B展示了LVLMs在改进交错图像-文本理解方面的潜力。 Abstract: We investigate a critical yet under-explored question in Large Vision-Language Models (LVLMs): Do LVLMs genuinely comprehend interleaved image-text in the document? Existing document understanding benchmarks often assess LVLMs using question-answer formats, which are information-sparse and difficult to guarantee the coverage of long-range dependencies. To address this issue, we introduce a novel and challenging Multimodal Document Summarization Benchmark (M-DocSum-Bench), which comprises 500 high-quality arXiv papers, along with interleaved multimodal summaries aligned with human preferences. M-DocSum-Bench is a reference-based generation task and necessitates the generation of interleaved image-text summaries using provided reference images, thereby simultaneously evaluating capabilities in understanding, reasoning, localization, and summarization within complex multimodal document scenarios. To facilitate this benchmark, we develop an automated framework to construct summaries and propose a fine-grained evaluation method called M-DocEval. Moreover, we further develop a robust summarization baseline, i.e., M-DocSum-7B, by progressive two-stage training with diverse instruction and preference data. The extensive results on our M-DocSum-Bench reveal that the leading LVLMs struggle to maintain coherence and accurately integrate information within long and interleaved contexts, often exhibiting confusion between similar images and a lack of robustness. Notably, M-DocSum-7B achieves state-of-the-art performance compared to larger and closed-source models (including GPT-4o, Gemini Pro, Claude-3.5-Sonnet and Qwen2.5-VL-72B, etc.), demonstrating the potential of LVLMs for improved interleaved image-text understanding. The code, data, and models are available at https://github.com/stepfun-ai/M-DocSum-Bench.

Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad

Ivo Petrov,Jasper Dekoninck,Lyuben Baltadzhiev,Maria Drencheva,Kristian Minchev,Mislav Balunović,Nikola Jovanović,Martin Vechev

Task: 评估大型语言模型在数学竞赛问题中的完整推理能力。

Motivation: 现有数学基准仅关注最终数值答案，忽略了严谨的推理和证明生成能力，而这些能力对实际数学任务至关重要。

Details

Method: 使用专家标注的2025年USAMO六道题目，对多个先进推理模型进行完整解决方案的评估。 Result: 所有测试模型表现显著不足，平均得分低于5%。通过分析推理过程，发现了常见的失败模式和训练策略带来的不良影响。 Conclusion: 当前大型语言模型在严谨数学推理任务上表现不足，需大幅提升推理和证明生成能力。 Abstract: Recent math benchmarks for large language models (LLMs) such as MathArena indicate that state-of-the-art reasoning models achieve impressive performance on mathematical competitions like AIME, with the leading model, o3-mini, achieving scores comparable to top human competitors. However, these benchmarks evaluate models solely based on final numerical answers, neglecting rigorous reasoning and proof generation which are essential for real-world mathematical tasks. To address this, we introduce the first comprehensive evaluation of full-solution reasoning for challenging mathematical problems. Using expert human annotators, we evaluated several state-of-the-art reasoning models on the six problems from the 2025 USAMO within hours of their release. Our results reveal that all tested models struggled significantly, achieving less than 5% on average. Through detailed analysis of reasoning traces, we identify the most common failure modes and find several unwanted artifacts arising from the optimization strategies employed during model training. Overall, our results suggest that current LLMs are inadequate for rigorous mathematical reasoning tasks, highlighting the need for substantial improvements in reasoning and proof generation capabilities.

HyperFree: A Channel-adaptive and Tuning-free Foundation Model for Hyperspectral Remote Sensing Imagery

Jingtao Li,Yingyi Liu,Xinyu Wang,Yunning Peng,Chen Sun,Shaoyu Wang,Zhendong Sun,Tian Ke,Xiao Jiang,Tangwei Lu,Anran Zhao,Yanfei Zhong

Task: 提出一种无需调优的高光谱基础模型HyperFree，用于处理高光谱遥感图像的精确解译。

Motivation: 现有视觉基础模型主要针对RGB和多光谱图像，而高光谱图像因通道多变导致模型需要逐图调优，增加了硬件和时间资源压力。

Details

Method: 通过适配现有视觉提示工程，设计一个覆盖全光谱的学习权重字典，动态构建嵌入层，并生成多个语义感知掩码。 Result: 在5个任务和11个数据集上，HyperFree（1提示）表现与专用模型（5样本）相当。 Conclusion: HyperFree为高光谱遥感图像解译提供了一种高效且无需调优的解决方案。 Abstract: Advanced interpretation of hyperspectral remote sensing images benefits many precise Earth observation tasks. Recently, visual foundation models have promoted the remote sensing interpretation but concentrating on RGB and multispectral images. Due to the varied hyperspectral channels,existing foundation models would face image-by-image tuning situation, imposing great pressure on hardware and time resources. In this paper, we propose a tuning-free hyperspectral foundation model called HyperFree, by adapting the existing visual prompt engineering. To process varied channel numbers, we design a learned weight dictionary covering full-spectrum from $0.4 \sim 2.5 \, \mu\text{m}$, supporting to build the embedding layer dynamically. To make the prompt design more tractable, HyperFree can generate multiple semantic-aware masks for one prompt by treating feature distance as semantic-similarity. After pre-training HyperFree on constructed large-scale high-resolution hyperspectral images, HyperFree (1 prompt) has shown comparable results with specialized models (5 shots) on 5 tasks and 11 datasets.Code and dataset are accessible at https://rsidea.whu.edu.cn/hyperfree.htm.

Entropy-Aware Branching for Improved Mathematical Reasoning

Xianzhi Li,Ethan Callanan,Xiaodan Zhu,Mathieu Sibue,Antony Papadimitriou,Mahmoud Mahfouz,Zhiqiang Ma,Xiaomo Liu

Task: 提出一种动态分支生成方法，以提升大型语言模型在数学推理中的表现。

Motivation: 大型语言模型在生成过程中存在不确定性，尤其是在高熵和高熵方差的标记处容易出错。

Details

Method: 通过动态分支生成，并行探索多个高概率标记的路径，并利用外部反馈选择最佳推理分支。 Result: 实验表明，该方法在数学应用题和计算题上比传统argmax解码方法提升了4.6%。 Conclusion: 动态分支生成策略能有效提升小型语言模型的推理能力。 Abstract: While Large Language Models (LLMs) are effectively aligned through extensive pre-training and fine-tuning, they still struggle with varying levels of uncertainty during token generation. In our investigation of mathematical reasoning, we observe that errors are more likely to arise at tokens exhibiting high entropy and variance of entropy in the model's output distribution. Based on the observation, we propose a novel approach that dynamically branches the generation process on demand instead of defaulting to the single most probable token. By exploring in parallel multiple branches stemming from high probability tokens of critical decision points, the model can discover diverse reasoning paths that might otherwise be missed. We further harness external feedback from larger models to rank and select the most coherent and accurate reasoning branch. Our experimental results on mathematical word problems and calculation questions show that this branching strategy boosts the reasoning capabilities of small LLMs up to 4.6% compared to conventional argmax decoding.

Hanyu Liu,Siyao Li,Ying Yu,Yixuan Jiang,Hang Xiao,Jingxi Long,Haotian Tang

Task: 解决传感器数据混合分布、活动异质性和复杂模型部署问题。

Motivation: 尽管深度学习方法已用于加速特征提取，但多模态数据混合、活动异质性和复杂模型部署问题仍未解决。

Details

Method: 提出了一种时空注意力模态分解对齐融合策略，结合梯度调制和可穿戴部署模拟系统。 Result: 在大量公共数据集上验证了模型的有效性。 Conclusion: 提出的方法有效解决了传感器数据混合分布和活动异质性问题，并验证了部署可行性。 Abstract: Human Activity Recognition (HAR) is a fundamental technology for numerous human - centered intelligent applications. Although deep learning methods have been utilized to accelerate feature extraction, issues such as multimodal data mixing, activity heterogeneity, and complex model deployment remain largely unresolved. The aim of this paper is to address issues such as multimodal data mixing, activity heterogeneity, and complex model deployment in sensor-based human activity recognition. We propose a spatiotemporal attention modal decomposition alignment fusion strategy to tackle the problem of the mixed distribution of sensor data. Key discriminative features of activities are captured through cross-modal spatio-temporal disentangled representation, and gradient modulation is combined to alleviate data heterogeneity. In addition, a wearable deployment simulation system is constructed. We conducted experiments on a large number of public datasets, demonstrating the effectiveness of the model.

Cluster automata

András Kornai

Task: 介绍并研究一类新的聚类摩尔自动机（CMA）及其时间行为。

Motivation: 探索聚类摩尔自动机的潜在应用及其时间行为特性。

Details

Method: 研究聚类摩尔自动机的定义及其时间行为。 Result: 描述了聚类摩尔自动机的一些应用。 Conclusion: 聚类摩尔自动机及其时间行为具有潜在的研究和应用价值。 Abstract: We introduce a new class of clustered Moore automata (CMA), investigate their temporal behavior, and describe some applications.

Comparative Analysis of Image, Video, and Audio Classifiers for Automated News Video Segmentation

Jonathan Attard,Dylan Seychell

Task: 比较图像、视频和音频分类器在新闻视频自动分割中的性能。

Motivation: 新闻视频的非结构化特性为自动化处理带来挑战，需要高效的内容组织和检索系统。

Details

Method: 开发并评估了多种深度学习方法（如ResNet、ViViT、AST和多模态架构），用于分类五种片段类型。 Result: 基于图像的分类器性能最优（84.34%准确率），ResNet架构在计算资源较少的情况下优于其他视频分类器。 Conclusion: 研究为新闻视频分割提供了有效的架构选择，并为媒体应用中的自动化内容组织系统提供了实用见解。 Abstract: News videos require efficient content organisation and retrieval systems, but their unstructured nature poses significant challenges for automated processing. This paper presents a comprehensive comparative analysis of image, video, and audio classifiers for automated news video segmentation. This work presents the development and evaluation of multiple deep learning approaches, including ResNet, ViViT, AST, and multimodal architectures, to classify five distinct segment types: advertisements, stories, studio scenes, transitions, and visualisations. Using a custom-annotated dataset of 41 news videos comprising 1,832 scene clips, our experiments demonstrate that image-based classifiers achieve superior performance (84.34\% accuracy) compared to more complex temporal models. Notably, the ResNet architecture outperformed state-of-the-art video classifiers while requiring significantly fewer computational resources. Binary classification models achieved high accuracy for transitions (94.23\%) and advertisements (92.74\%). These findings advance the understanding of effective architectures for news video segmentation and provide practical insights for implementing automated content organisation systems in media applications. These include media archiving, personalised content delivery, and intelligent video search.

Monte Carlo Sampling for Analyzing In-Context Examples

Stephanie Schoch,Yangfeng Ji

Task: 研究上下文学习中示例数量、顺序和选择对性能的影响。

Motivation: 先前研究表明上下文学习对示例的展示方式（如顺序、数量和选择）敏感，但现有方法可能忽略了这些因素之间的相互作用。

Details

Method: 采用蒙特卡洛采样方法，同时考虑示例数量、顺序和选择的影响。 Result: 发现先前关于示例数量的指导在不同示例集和顺序下并不通用，且单示例与零示例的性能对比高度依赖所选示例。此外，示例选择方法未能优于随机采样。 Conclusion: 上下文学习的性能对示例顺序和数量具有鲁棒性，但示例选择方法未带来预期改进。 Abstract: Prior works have shown that in-context learning is brittle to presentation factors such as the order, number, and choice of selected examples. However, ablation-based guidance on selecting the number of examples may ignore the interplay between different presentation factors. In this work we develop a Monte Carlo sampling-based method to study the impact of number of examples while explicitly accounting for effects from order and selected examples. We find that previous guidance on how many in-context examples to select does not always generalize across different sets of selected examples and orderings, and whether one-shot settings outperform zero-shot settings is highly dependent on the selected example. Additionally, inspired by data valuation, we apply our sampling method to in-context example selection to select examples that perform well across different orderings. We find a negative result, that while performance is robust to ordering and number of examples, there is an unexpected performance degradation compared to random sampling.

On Large Multimodal Models as Open-World Image Classifiers

Alessandro Conti,Massimiliano Mancini,Enrico Fini,Yiming Wang,Paolo Rota,Elisa Ricci

Task: 评估大型多模态模型（LMMs）在开放世界设置下的图像分类性能。

Motivation: 现有研究大多局限于封闭世界设置，缺乏对LMMs在开放世界分类性能的全面评估。

Details

Method: 提出任务形式化和评估协议，定义多种指标评估预测与真实类别的对齐，并在10个基准上评估13个模型。 Result: 揭示了LMMs在开放世界分类中的挑战，特别是粒度和细粒度能力方面的问题，并提出针对性提示和推理的改进方法。 Conclusion: LMMs在开放世界分类中面临显著挑战，但通过优化提示和推理可以部分缓解这些问题。 Abstract: Traditional image classification requires a predefined list of semantic categories. In contrast, Large Multimodal Models (LMMs) can sidestep this requirement by classifying images directly using natural language (e.g., answering the prompt "What is the main object in the image?"). Despite this remarkable capability, most existing studies on LMM classification performance are surprisingly limited in scope, often assuming a closed-world setting with a predefined set of categories. In this work, we address this gap by thoroughly evaluating LMM classification performance in a truly open-world setting. We first formalize the task and introduce an evaluation protocol, defining various metrics to assess the alignment between predicted and ground truth classes. We then evaluate 13 models across 10 benchmarks, encompassing prototypical, non-prototypical, fine-grained, and very fine-grained classes, demonstrating the challenges LMMs face in this task. Further analyses based on the proposed metrics reveal the types of errors LMMs make, highlighting challenges related to granularity and fine-grained capabilities, showing how tailored prompting and reasoning can alleviate them.

Enhancing Domain-Specific Encoder Models with LLM-Generated Data: How to Leverage Ontologies, and How to Do Without Them

Marc Brinner,Tarek Al Mustafa,Sina Zarrieß

Task: 研究如何利用LLM生成的数据对编码器模型进行持续预训练，特别是在数据有限的专门领域（以入侵生物学为例）。

Motivation: 解决在数据有限的专门领域中预训练编码器模型的挑战，提升模型在特定领域的理解能力。

Details

Method: 利用领域特定本体论，通过LLM生成的数据丰富本体，并将编码器模型预训练为基于本体的概念定义嵌入模型。对于无完整本体的领域，使用科学摘要自动提取概念并通过分布统计建立关系。 Result: 在入侵生物学领域显著优于标准LLM预训练，且在无完整本体的领域也能通过少量科学摘要实现类似性能。 Conclusion: 该方法为低资源环境下的领域特定理解提供了一种全自动解决方案，性能接近基于更大数据集的掩码语言模型预训练。 Abstract: We investigate the use of LLM-generated data for continual pretraining of encoder models in specialized domains with limited training data, using the scientific domain of invasion biology as a case study. To this end, we leverage domain-specific ontologies by enriching them with LLM-generated data and pretraining the encoder model as an ontology-informed embedding model for concept definitions. To evaluate the effectiveness of this method, we compile a benchmark specifically designed for assessing model performance in invasion biology. After demonstrating substantial improvements over standard LLM pretraining, we investigate the feasibility of applying the proposed approach to domains without comprehensive ontologies by substituting ontological concepts with concepts automatically extracted from a small corpus of scientific abstracts and establishing relationships between concepts through distributional statistics. Our results demonstrate that this automated approach achieves comparable performance using only a small set of scientific abstracts, resulting in a fully automated pipeline for enhancing domain-specific understanding of small encoder models that is especially suited for application in low-resource settings and achieves performance comparable to masked language modeling pretraining on much larger datasets.

Foveated Instance Segmentation

Hongyi Zeng,Wenxuan Liu,Tianhua Xia,Jinhui Chen,Ziyun Li,Sai Qian Zhang

Task: 提出一种基于用户注视数据的实例分割框架（FovealSeg），专注于处理用户感兴趣的区域以减少计算开销。

Motivation: AR/VR设备资源有限，传统实例分割计算量大，导致延迟和用户体验下降；用户通常只关注视野中的部分区域，因此可以优化分割方法。

Details

Method: 利用实时用户注视数据，仅对用户感兴趣的实例进行分割（FovealSeg框架）。 Result: 在ADE20K和LVIS数据集上分别达到0.56和0.54的IoU，显著优于基线方法。 Conclusion: FovealSeg通过聚焦用户注视区域，显著降低了计算开销，提升了实时性能。 Abstract: Instance segmentation is essential for augmented reality and virtual reality (AR/VR) as it enables precise object recognition and interaction, enhancing the integration of virtual and real-world elements for an immersive experience. However, the high computational overhead of segmentation limits its application on resource-constrained AR/VR devices, causing large processing latency and degrading user experience. In contrast to conventional scenarios, AR/VR users typically focus on only a few regions within their field of view before shifting perspective, allowing segmentation to be concentrated on gaze-specific areas. This insight drives the need for efficient segmentation methods that prioritize processing instance of interest, reducing computational load and enhancing real-time performance. In this paper, we present a foveated instance segmentation (FovealSeg) framework that leverages real-time user gaze data to perform instance segmentation exclusively on instance of interest, resulting in substantial computational savings. Evaluation results show that FSNet achieves an IoU of 0.56 on ADE20K and 0.54 on LVIS, notably outperforming the baseline. The code is available at https://github.com/SAI-

Cognitive Prompts Using Guilford's Structure of Intellect Model

Oliver Kramer

Task: 提出一种基于Guilford智力结构（SOI）模型的认知提示工程方法，以增强大型语言模型（LLM）的结构化推理能力。

Motivation: 大型语言模型在语言生成方面表现出色，但在结构化推理上存在不足，导致问题解决能力不一致或次优。

Details

Method: 利用SOI模型对认知操作（如模式识别、记忆检索和评估）进行分类，设计系统化的认知提示方法。 Result: 提出了一种新颖的认知提示方法，能够提升模型响应的清晰性、连贯性和适应性。 Conclusion: 通过SOI模型指导的认知提示工程可以有效增强LLM的结构化推理能力。 Abstract: Large language models (LLMs) demonstrate strong language generation capabilities but often struggle with structured reasoning, leading to inconsistent or suboptimal problem-solving. To mitigate this limitation, Guilford's Structure of Intellect (SOI) model - a foundational framework from intelligence theory - is leveraged as the basis for cognitive prompt engineering. The SOI model categorizes cognitive operations such as pattern recognition, memory retrieval, and evaluation, offering a systematic approach to enhancing LLM reasoning and decision-making. This position paper presents a novel cognitive prompting approach for enforcing SOI-inspired reasoning for improving clarity, coherence, and adaptability in model responses.

StarFlow: Generating Structured Workflow Outputs From Sketch Images

Patrice Bechard,Chao Wang,Amirhossein Abaskohi,Juan Rodriguez,Christopher Pal,David Vazquez,Spandana Gella,Sai Rajeswar,Perouz Taslakian

Task: 利用视觉语言模型（VLMs）从视觉输入自动生成结构化工作流。

Motivation: 尽管工作流在企业平台中广泛使用，但其构建过程复杂，通常需要手动配置，因此探索如何简化这一过程。

Details

Method: 提出StarFlow框架，通过视觉语言模型从草图生成结构化工作流，并使用多样化的数据集进行训练和评估。 Result: 微调显著提升了结构化工作流的生成效果，优于大型视觉语言模型。 Conclusion: StarFlow框架有效解决了从视觉输入生成结构化工作流的挑战，为自动化工作流构建提供了新思路。 Abstract: Workflows are a fundamental component of automation in enterprise platforms, enabling the orchestration of tasks, data processing, and system integrations. Despite being widely used, building workflows can be complex, often requiring manual configuration through low-code platforms or visual programming tools. To simplify this process, we explore the use of generative foundation models, particularly vision-language models (VLMs), to automatically generate structured workflows from visual inputs. Translating hand-drawn sketches or computer-generated diagrams into executable workflows is challenging due to the ambiguity of free-form drawings, variations in diagram styles, and the difficulty of inferring execution logic from visual elements. To address this, we introduce StarFlow, a framework for generating structured workflow outputs from sketches using vision-language models. We curate a diverse dataset of workflow diagrams -- including synthetic, manually annotated, and real-world samples -- to enable robust training and evaluation. We finetune and benchmark multiple vision-language models, conducting a series of ablation studies to analyze the strengths and limitations of our approach. Our results show that finetuning significantly enhances structured workflow generation, outperforming large vision-language models on this task.

Hao Lin,Yongjun Zhang

Task: 系统评估大型语言模型（LLMs）在社会科学文本编码任务中的潜力与风险。

Motivation: 探索GenAI或LLMs如何革新计算社会科学，特别是在自动文本分析领域。

Details

Method: 提出一个框架，帮助社会科学家将LLMs纳入文本标注任务，并优化提示设计、验证方法工具的有效性和可靠性。 Result: 提供了使用LLMs进行文本标注的实用指南，并讨论了其有效性、可靠性、可复制性和透明度等认知风险。 Conclusion: 总结了LLMs在文本标注任务中的应用建议，并强调如何更好地传达研究中的认知风险。 Abstract: Generative artificial intelligence (GenAI) or large language models (LLMs) have the potential to revolutionize computational social science, particularly in automated textual analysis. In this paper, we conduct a systematic evaluation of the promises and risks of using LLMs for diverse coding tasks, with social movement studies serving as a case example. We propose a framework for social scientists to incorporate LLMs into text annotation, either as the primary coding decision-maker or as a coding assistant. This framework provides tools for researchers to develop the optimal prompt, and to examine and report the validity and reliability of LLMs as a methodological tool. Additionally, we discuss the associated epistemic risks related to validity, reliability, replicability, and transparency. We conclude with several practical guidelines for using LLMs in text annotation tasks, and how we can better communicate the epistemic risks in research.

Exponentially Weighted Instance-Aware Repeat Factor Sampling for Long-Tailed Object Detection Model Training in Unmanned Aerial Vehicles Surveillance Scenarios

Taufiq Ahmed,Abhishek Kumar,Constantino Álvarez Casado,Anlan Zhang,Tuomo Hänninen,Lauri Loven,Miguel Bordallo López,Sasu Tarkoma

Task: 提出一种基于指数加权的实例感知重复因子采样方法（E-IRFS），以解决目标检测中的类别不平衡问题。

Motivation: 现有线性调整的采样方法（如RFS和IRFS）在长尾分布中效果有限，需要更有效的策略来区分稀有和常见类别。

Details

Method: 通过将几何平均的图像和实例频率应用于指数函数，调整采样概率，实现更自适应的重平衡策略。 Result: 在多个数据集上，E-IRFS比基线方法提升了22%的检测性能，尤其在稀有类别上表现优于RFS和IRFS。 Conclusion: E-IRFS在资源受限环境中有效提升稀有目标检测，适用于无人机应急监测等实时应用。 Abstract: Object detection models often struggle with class imbalance, where rare categories appear significantly less frequently than common ones. Existing sampling-based rebalancing strategies, such as Repeat Factor Sampling (RFS) and Instance-Aware Repeat Factor Sampling (IRFS), mitigate this issue by adjusting sample frequencies based on image and instance counts. However, these methods are based on linear adjustments, which limit their effectiveness in long-tailed distributions. This work introduces Exponentially Weighted Instance-Aware Repeat Factor Sampling (E-IRFS), an extension of IRFS that applies exponential scaling to better differentiate between rare and frequent classes. E-IRFS adjusts sampling probabilities using an exponential function applied to the geometric mean of image and instance frequencies, ensuring a more adaptive rebalancing strategy. We evaluate E-IRFS on a dataset derived from the Fireman-UAV-RGBT Dataset and four additional public datasets, using YOLOv11 object detection models to identify fire, smoke, people and lakes in emergency scenarios. The results show that E-IRFS improves detection performance by 22\% over the baseline and outperforms RFS and IRFS, particularly for rare categories. The analysis also highlights that E-IRFS has a stronger effect on lightweight models with limited capacity, as these models rely more on data sampling strategies to address class imbalance. The findings demonstrate that E-IRFS improves rare object detection in resource-constrained environments, making it a suitable solution for real-time applications such as UAV-based emergency monitoring.

ThinkEdit: Interpretable Weight Editing to Mitigate Overly Short Thinking in Reasoning Models

Chung-En Sun,Ge Yan,Tsui-Wei Weng

Task: 研究大型语言模型（LLMs）在链式思维（CoT）推理中生成过短推理的问题，并提出解决方法。

Motivation: 发现LLMs在CoT推理中偶尔生成过短推理，导致简单数学问题性能下降，需探究其机制并提出改进方案。

Details

Method: 通过分析推理长度在隐藏表示中的线性方向，提出ThinkEdit方法，编辑少量注意力头的权重以抑制过短推理。 Result: ThinkEdit显著减少过短推理，短推理输出准确率提升5.44%，多个数学基准整体提升2.43%。 Conclusion: 揭示了LLMs中推理长度的控制机制，展示了细粒度模型干预提升推理质量的潜力。 Abstract: Recent studies have shown that Large Language Models (LLMs) augmented with chain-of-thought (CoT) reasoning demonstrate impressive problem-solving abilities. However, in this work, we identify a recurring issue where these models occasionally generate overly short reasoning, leading to degraded performance on even simple mathematical problems. Specifically, we investigate how reasoning length is embedded in the hidden representations of reasoning models and its impact on accuracy. Our analysis reveals that reasoning length is governed by a linear direction in the representation space, allowing us to induce overly short reasoning by steering the model along this direction. Building on this insight, we introduce ThinkEdit, a simple yet effective weight-editing approach to mitigate the issue of overly short reasoning. We first identify a small subset of attention heads (approximately 2%) that predominantly drive short reasoning behavior. We then edit the output projection weights of these heads to suppress the short reasoning direction. With changes to only 0.1% of the model's parameters, ThinkEdit effectively reduces overly short reasoning and yields notable accuracy gains for short reasoning outputs (+5.44%), along with an overall improvement across multiple math benchmarks (+2.43%). Our findings provide new mechanistic insights into how reasoning length is controlled within LLMs and highlight the potential of fine-grained model interventions to improve reasoning quality. Our code is available at https://github.com/Trustworthy-ML-Lab/ThinkEdit

AssistPDA: An Online Video Surveillance Assistant for Video Anomaly Prediction, Detection, and Analysis

Zhiwei Yang,Chen Gao,Jing Liu,Peng Wu,Guansong Pang,Mike Zheng Shou

Task: 开发一个名为AssistPDA的在线视频异常监测助手，统一视频异常预测、检测和分析（VAPDA）于单一框架中。

Motivation: 现有的大型语言模型（LLM）视频异常检测方法主要关注视频级异常问答或离线检测，忽视了实际应用中所需的实时性。

Details

Method: 提出Spatio-Temporal Relation Distillation（STRD）模块，将视觉语言模型（VLM）的长时空建模能力从离线场景迁移到实时场景，并构建VAPDA-127K基准数据集。 Result: AssistPDA在实时VAPDA任务中表现优于现有离线VLM方法，达到新的最优水平。 Conclusion: AssistPDA为LLM-based视频异常检测的实际部署提供了有效解决方案，并开源数据集和代码以促进社区研究。 Abstract: The rapid advancements in large language models (LLMs) have spurred growing interest in LLM-based video anomaly detection (VAD). However, existing approaches predominantly focus on video-level anomaly question answering or offline detection, ignoring the real-time nature essential for practical VAD applications. To bridge this gap and facilitate the practical deployment of LLM-based VAD, we introduce AssistPDA, the first online video anomaly surveillance assistant that unifies video anomaly prediction, detection, and analysis (VAPDA) within a single framework. AssistPDA enables real-time inference on streaming videos while supporting interactive user engagement. Notably, we introduce a novel event-level anomaly prediction task, enabling proactive anomaly forecasting before anomalies fully unfold. To enhance the ability to model intricate spatiotemporal relationships in anomaly events, we propose a Spatio-Temporal Relation Distillation (STRD) module. STRD transfers the long-term spatiotemporal modeling capabilities of vision-language models (VLMs) from offline settings to real-time scenarios. Thus it equips AssistPDA with a robust understanding of complex temporal dependencies and long-sequence memory. Additionally, we construct VAPDA-127K, the first large-scale benchmark designed for VLM-based online VAPDA. Extensive experiments demonstrate that AssistPDA outperforms existing offline VLM-based approaches, setting a new state-of-the-art for real-time VAPDA. Our dataset and code will be open-sourced to facilitate further research in the community.

Non-Monotonic Attention-based Read/Write Policy Learning for Simultaneous Translation

Zeeshan Ahmed,Frank Seide,Zhe Liu,Rastislav Rabatin,Jachym Kolar,Niko Moritz,Ruiming Xie,Simone Merello,Christian Fuegen

Task: 提出一种方法，将预训练的非流式机器翻译模型转换为流式模型，以在质量与延迟之间取得平衡。

Motivation: 流式机器翻译需要在生成翻译时实时处理输入流，但面临质量与延迟的权衡问题，目标是接近非流式模型的高质量翻译同时最小化延迟。

Details

Method: 通过利用源和目标标记之间的对齐，学习读/写决策边界，将预训练的非流式模型转换为流式模型。训练时使用对齐点（伪标签）通过监督学习训练读/写策略模块。 Result: 实验结果表明，该方法优于多个强基线，并缩小了与非流式基线模型的差距。 Conclusion: 提出的方法有效管理了质量与延迟的权衡，实现了高质量的流式翻译。 Abstract: Simultaneous or streaming machine translation generates translation while reading the input stream. These systems face a quality/latency trade-off, aiming to achieve high translation quality similar to non-streaming models with minimal latency. We propose an approach that efficiently manages this trade-off. By enhancing a pretrained non-streaming model, which was trained with a seq2seq mechanism and represents the upper bound in quality, we convert it into a streaming model by utilizing the alignment between source and target tokens. This alignment is used to learn a read/write decision boundary for reliable translation generation with minimal input. During training, the model learns the decision boundary through a read/write policy module, employing supervised learning on the alignment points (pseudo labels). The read/write policy module, a small binary classification unit, can control the quality/latency trade-off during inference. Experimental results show that our model outperforms several strong baselines and narrows the gap with the non-streaming baseline model.

Oliver Heinimann,Assaf Shocher,Tal Zimbalist,Michal Irani

Task: 提出一种零样本扩散方法（KernelFusion），用于从低分辨率（LR）图像中恢复特定于图像的SR核及其对应的高分辨率（HR）图像。

Motivation: 传统超分辨率方法依赖于理想的降采样核（如双三次降采样），而现有盲超分辨率方法仍局限于简单核（如各向异性高斯核），无法处理复杂降采样退化。正确的SR核比算法本身更为重要。

Details

Method: 通过训练一个基于单幅LR图像的特定于图像的块扩散模型，捕捉其独特的内部块统计信息，同时恢复HR图像和正确的降采样SR核。 Result: KernelFusion在复杂降采样退化情况下显著优于现有盲超分辨率方法。 Conclusion: KernelFusion突破了预定义核的限制，将盲超分辨率推向无假设的新范式，解决了此前被认为不可能处理的降采样核问题。 Abstract: Traditional super-resolution (SR) methods assume an ``ideal'' downscaling SR-kernel (e.g., bicubic downscaling) between the high-resolution (HR) image and the low-resolution (LR) image. Such methods fail once the LR images are generated differently. Current blind-SR methods aim to remove this assumption, but are still fundamentally restricted to rather simplistic downscaling SR-kernels (e.g., anisotropic Gaussian kernels), and fail on more complex (out of distribution) downscaling degradations. However, using the correct SR-kernel is often more important than using a sophisticated SR algorithm. In ``KernelFusion'' we introduce a zero-shot diffusion-based method that makes no assumptions about the kernel. Our method recovers the unique image-specific SR-kernel directly from the LR input image, while simultaneously recovering its corresponding HR image. KernelFusion exploits the principle that the correct SR-kernel is the one that maximizes patch similarity across different scales of the LR image. We first train an image-specific patch-based diffusion model on the single LR input image, capturing its unique internal patch statistics. We then reconstruct a larger HR image with the same learned patch distribution, while simultaneously recovering the correct downscaling SR-kernel that maintains this cross-scale relation between the HR and LR images. Empirical results show that KernelFusion vastly outperforms all SR baselines on complex downscaling degradations, where existing SotA Blind-SR methods fail miserably. By breaking free from predefined kernel assumptions, KernelFusion pushes Blind-SR into a new assumption-free paradigm, handling downscaling kernels previously thought impossible.

Penrose Tiled Low-Rank Compression and Section-Wise Q&A Fine-Tuning: A General Framework for Domain-Specific Large Language Model Adaptation

Chuan-Wei Kuo,Siyu Chen,Chenqi Yan,Yu Yang Fredrik Liu

Task: 提出一种两阶段框架，结合结构化模型压缩和科学微调方法，以高效准确地将大语言模型（LLMs）适应于材料科学等专业领域。

Motivation: 由于数据有限且知识密度高，将LLMs高效准确地适应于专业科学领域仍具挑战性。

Details

Method: 第一阶段通过局部低秩分解和Penrose非周期平铺模式压缩模型权重矩阵，第二阶段采用分节问答微调策略逐步注入领域知识。 Result: 该方法在数据稀缺条件下实现了LLMs对高价值领域的精确专业化。 Conclusion: 两阶段方法为材料科学知识整合提供了潜在解决方案，并为未来全面实证评估奠定了基础。 Abstract: Large language models (LLMs) hold great promise for specialized scientific domains such as materials science, yet adapting them efficiently and accurately to domain-specific knowledge remains challenging due to limited data and high knowledge density. We propose a two-stage framework that combines structured model compression with a scientific fine-tuning regimen to address this challenge. In the compression stage, we decompose the LLM's weight matrices into local low-rank "rank blocks" and arrange these blocks in a Penrose-like non-periodic tiling pattern. Each block is then compacted via spectral transformations (e.g., discrete cosine or Fourier transforms), and a Kullback-Leibler (KL) divergence-based alignment loss preserves the distributional similarity between the compressed model's representations and those of the original full model. In the adaptation stage, the compressed model is further tuned using a human-like scientific reading protocol: it processes technical materials science documents section by section, engaging in a structured question-and-answer routine for each section. This section-wise Q&A fine-tuning strategy extracts explicit reasoning traces and gradually injects domain knowledge, while minimizing catastrophic forgetting of the model's general language capabilities. By balancing efficient compression with targeted adaptation, our two-stage approach enables precise specialization of LLMs to high-value domains under data-scarce conditions. We present this principled yet exploratory pipeline and outline its potential for advancing materials science knowledge integration, laying the groundwork for comprehensive empirical evaluation in future work.

Multimodal Data Integration for Sustainable Indoor Gardening: Tracking Anyplant with Time Series Foundation Model

Seyed Hamidreza Nabaei,Zeyang Zheng,Dong Chen,Arsalan Heydarian

Task: 提出一种结合计算机视觉、机器学习和环境传感的自动化植物健康与生长监测框架。

Motivation: 通过整合多模态数据（如RGB图像、植物表型数据和环境因素），提高植物水分胁迫预测的准确性，以支持可持续建筑中的室内园艺发展。

Details

Method: 利用高分辨率摄像头提取植物表型特征（如RGB、面积、高度和宽度），并结合Lag-Llama时间序列模型分析水分胁迫。 Result: 实验结果表明，整合RGB、尺寸比和环境数据显著提高了预测准确性，优化后的模型误差最低（MSE = 0.420777，MAE = 0.595428）。 Conclusion: 多模态数据和智能系统在自动化植物护理和优化资源消耗方面具有潜力，有助于推动可持续建筑管理实践。 Abstract: Indoor gardening within sustainable buildings offers a transformative solution to urban food security and environmental sustainability. By 2030, urban farming, including Controlled Environment Agriculture (CEA) and vertical farming, is expected to grow at a compound annual growth rate (CAGR) of 13.2% from 2024 to 2030, according to market reports. This growth is fueled by advancements in Internet of Things (IoT) technologies, sustainable innovations such as smart growing systems, and the rising interest in green interior design. This paper presents a novel framework that integrates computer vision, machine learning (ML), and environmental sensing for the automated monitoring of plant health and growth. Unlike previous approaches, this framework combines RGB imagery, plant phenotyping data, and environmental factors such as temperature and humidity, to predict plant water stress in a controlled growth environment. The system utilizes high-resolution cameras to extract phenotypic features, such as RGB, plant area, height, and width while employing the Lag-Llama time series model to analyze and predict water stress. Experimental results demonstrate that integrating RGB, size ratios, and environmental data significantly enhances predictive accuracy, with the Fine-tuned model achieving the lowest errors (MSE = 0.420777, MAE = 0.595428) and reduced uncertainty. These findings highlight the potential of multimodal data and intelligent systems to automate plant care, optimize resource consumption, and align indoor gardening with sustainable building management practices, paving the way for resilient, green urban spaces.

Leveraging LLMs for Predicting Unknown Diagnoses from Clinical Notes

Dina Albassam,Adam Cross,Chengxiang Zhai

Task: 研究大型语言模型（LLMs）是否能从临床记录中预测隐含的诊断并将其与相应药物关联。

Motivation: 电子健康记录（EHRs）中药物与诊断之间的显式链接缺失，增加了临床决策和研究的难度。

Details

Method: 使用GPT-3.5 Turbo，通过18种提示配置生成8568个测试案例，评估多数投票策略在不同超参数下的表现。 Result: 多数投票策略达到75%的准确率，优于最佳单一配置的66%。 Conclusion: 通过多样化的LLM配置进行多数投票，可提升EHRs中诊断预测的准确性，为临床文本中药物与诊断的关联提供有效方法。 Abstract: Electronic Health Records (EHRs) often lack explicit links between medications and diagnoses, making clinical decision-making and research more difficult. Even when links exist, diagnosis lists may be incomplete, especially during early patient visits. Discharge summaries tend to provide more complete information, which can help infer accurate diagnoses, especially with the help of large language models (LLMs). This study investigates whether LLMs can predict implicitly mentioned diagnoses from clinical notes and link them to corresponding medications. We address two research questions: (1) Does majority voting across diverse LLM configurations outperform the best single configuration in diagnosis prediction? (2) How sensitive is majority voting accuracy to LLM hyperparameters such as temperature, top-p, and summary length? To evaluate, we created a new dataset of 240 expert-annotated medication-diagnosis pairs from 20 MIMIC-IV notes. Using GPT-3.5 Turbo, we ran 18 prompting configurations across short and long summary lengths, generating 8568 test cases. Results show that majority voting achieved 75 percent accuracy, outperforming the best single configuration at 66 percent. No single hyperparameter setting dominated, but combining deterministic, balanced, and exploratory strategies improved performance. Shorter summaries generally led to higher accuracy.In conclusion, ensemble-style majority voting with diverse LLM configurations improves diagnosis prediction in EHRs and offers a promising method to link medications and diagnoses in clinical texts.

Flexible Moment-Invariant Bases from Irreducible Tensors

Roxana Bujack,Emily Shinkle,Alice Allen,Tomas Suk,Nicholas Lubbers

Task: 提出一种结合球谐函数和笛卡尔张量代数的方法，以克服现有矩不变量生成方法在球面函数退化情况下的脆弱性。

Motivation: 现有的矩不变量生成方法在面对球面函数退化时表现脆弱，而球面函数在实际应用中很常见，因此需要改进。

Details

Method: 结合球谐函数和笛卡尔张量代数两种方法。 Result: 提出了一种能够克服球面函数退化脆弱性的新方法。 Conclusion: 通过结合两种方法，成功解决了现有矩不变量生成方法在球面函数退化情况下的问题。 Abstract: Moment invariants are a powerful tool for the generation of rotation-invariant descriptors needed for many applications in pattern detection, classification, and machine learning. A set of invariants is optimal if it is complete, independent, and robust against degeneracy in the input. In this paper, we show that the current state of the art for the generation of these bases of moment invariants, despite being robust against moment tensors being identically zero, is vulnerable to a degeneracy that is common in real-world applications, namely spherical functions. We show how to overcome this vulnerability by combining two popular moment invariant approaches: one based on spherical harmonics and one based on Cartesian tensor algebra.

Beyond Single-Sentence Prompts: Upgrading Value Alignment Benchmarks with Dialogues and Stories

Yazhou Zhang,Qimeng Liu,Qiuchi Li,Peng Zhang,Jing Qin

Task: 提出一种基于多轮对话和叙事场景的升级版价值对齐基准，以更有效地评估大型语言模型（LLMs）的价值对齐。

Motivation: 传统的单句对抗性提示方法在现代LLMs中效果有限，无法充分揭示模型的潜在偏见和伦理立场。

Details

Method: 设计并实现包含对话陷阱和伦理模糊叙事的数据集，通过多轮对话和叙事场景系统评估LLMs的响应。 Result: 实验结果表明，该方法能有效暴露传统单次评估中未检测到的潜在偏见。 Conclusion: 上下文和动态测试对LLMs的价值对齐评估至关重要，为AI伦理和安全评估提供了更复杂和现实的路径。 Abstract: Evaluating the value alignment of large language models (LLMs) has traditionally relied on single-sentence adversarial prompts, which directly probe models with ethically sensitive or controversial questions. However, with the rapid advancements in AI safety techniques, models have become increasingly adept at circumventing these straightforward tests, limiting their effectiveness in revealing underlying biases and ethical stances. To address this limitation, we propose an upgraded value alignment benchmark that moves beyond single-sentence prompts by incorporating multi-turn dialogues and narrative-based scenarios. This approach enhances the stealth and adversarial nature of the evaluation, making it more robust against superficial safeguards implemented in modern LLMs. We design and implement a dataset that includes conversational traps and ethically ambiguous storytelling, systematically assessing LLMs' responses in more nuanced and context-rich settings. Experimental results demonstrate that this enhanced methodology can effectively expose latent biases that remain undetected in traditional single-shot evaluations. Our findings highlight the necessity of contextual and dynamic testing for value alignment in LLMs, paving the way for more sophisticated and realistic assessments of AI ethics and safety.

Parametric Shadow Control for Portrait Generationin Text-to-Image Diffusion Models

Haoming Cai,Tsung-Wei Huang,Shiv Gehlot,Brandon Y. Feng,Sachin Shah,Guan-Ming Su,Christopher Metzler

Task: 提出一种名为Shadow Director的方法，用于在文本到图像扩散模型中提取和操纵隐藏的阴影属性。

Motivation: 现有方法缺乏直观的阴影控制，且依赖昂贵的数据采集或计算资源。

Details

Method: 使用小型估计网络，仅需少量合成图像和短时间训练，无需真实光阶段数据。 Result: 实现了对阴影形状、位置和强度的参数化控制，同时保持艺术完整性和身份多样性。 Conclusion: Shadow Director是一种更易获取且资源友好的解决方案，适用于多种风格的肖像生成。 Abstract: Text-to-image diffusion models excel at generating diverse portraits, but lack intuitive shadow control. Existing editing approaches, as post-processing, struggle to offer effective manipulation across diverse styles. Additionally, these methods either rely on expensive real-world light-stage data collection or require extensive computational resources for training. To address these limitations, we introduce Shadow Director, a method that extracts and manipulates hidden shadow attributes within well-trained diffusion models. Our approach uses a small estimation network that requires only a few thousand synthetic images and hours of training-no costly real-world light-stage data needed. Shadow Director enables parametric and intuitive control over shadow shape, placement, and intensity during portrait generation while preserving artistic integrity and identity across diverse styles. Despite training only on synthetic data built on real-world identities, it generalizes effectively to generated portraits with diverse styles, making it a more accessible and resource-friendly solution.

FRASE: Structured Representations for Generalizable SPARQL Query Generation

Papa Abdou Karim Karou Diallo,Amal Zouaq

Task: 将自然语言问题转换为SPARQL查询，以支持知识库查询。

Motivation: 现有数据集多为模板化，导致模型学习的是问题与查询模板之间的浅层映射，缺乏真正的泛化能力。

Details

Method: 提出FRASE（基于框架的语义增强）方法，利用框架语义角色标注（FSRL）解决这一问题，并引入LC-QuAD 3.0数据集。 Result: 实验表明，基于框架的结构化表示能显著提升SPARQL生成性能，尤其是在未知模板和自然语言问题场景下。 Conclusion: FRASE方法通过框架语义增强，有效提升了模型在复杂泛化场景中的表现。 Abstract: Translating natural language questions into SPARQL queries enables Knowledge Base querying for factual and up-to-date responses. However, existing datasets for this task are predominantly template-based, leading models to learn superficial mappings between question and query templates rather than developing true generalization capabilities. As a result, models struggle when encountering naturally phrased, template-free questions. This paper introduces FRASE (FRAme-based Semantic Enhancement), a novel approach that leverages Frame Semantic Role Labeling (FSRL) to address this limitation. We also present LC-QuAD 3.0, a new dataset derived from LC-QuAD 2.0, in which each question is enriched using FRASE through frame detection and the mapping of frame-elements to their argument. We evaluate the impact of this approach through extensive experiments on recent large language models (LLMs) under different fine-tuning configurations. Our results demonstrate that integrating frame-based structured representations consistently improves SPARQL generation performance, particularly in challenging generalization scenarios when test questions feature unseen templates (unknown template splits) and when they are all naturally phrased (reformulated questions).

Enhancing Pavement Crack Classification with Bidirectional Cascaded Neural Networks

Taqwa I. Alhadidi,Asmaa Alazmi,Shadi Jaradat,Ahmed Jaber,Huthaifa Ashqar,Mohammed Elhenawy

Task: 利用双向级联神经网络（BCNNs）对增强后的路面裂缝图像进行分类。

Motivation: 路面裂缝（如线性裂缝、坑洞和疲劳裂缝）对道路安全和维护有重大影响，需要高效准确的分类方法。

Details

Method: 采用U-Net 50进行图像增强，构建双向级联神经网络（BCNNs）模型，利用前向和后向信息流提升分类精度。 Result: 模型整体准确率为87%，各类别的精确率、召回率和F1分数均表现优异，其中疲劳裂缝的F1分数为0.85，线性裂缝为0.85，坑洞为0.93。 Conclusion: BCNNs在路面裂缝分类中表现出色，可显著提升路面维护和管理的效率与可靠性。 Abstract: Pavement distress, such as cracks and potholes, is a significant issue affecting road safety and maintenance. In this study, we present the implementation and evaluation of Bidirectional Cascaded Neural Networks (BCNNs) for the classification of pavement crack images following image augmentation. We classified pavement cracks into three main categories: linear cracks, potholes, and fatigue cracks on an enhanced dataset utilizing U-Net 50 for image augmentation. The augmented dataset comprised 599 images. Our proposed BCNN model was designed to leverage both forward and backward information flows, with detection accuracy enhanced by its cascaded structure wherein each layer progressively refines the output of the preceding one. Our model achieved an overall accuracy of 87%, with precision, recall, and F1-score measures indicating high effectiveness across the categories. For fatigue cracks, the model recorded a precision of 0.87, recall of 0.83, and F1-score of 0.85 on 205 images. Linear cracks were detected with a precision of 0.81, recall of 0.89, and F1-score of 0.85 on 205 images, and potholes with a precision of 0.96, recall of 0.90, and F1-score of 0.93 on 189 images. The macro and weighted average of precision, recall, and F1-score were identical at 0.88, confirming the BCNN's excellent performance in classifying complex pavement crack patterns. This research demonstrates the potential of BCNNs to significantly enhance the accuracy and reliability of pavement distress classification, resulting in more effective and efficient pavement maintenance and management systems.

EdgeInfinite: A Memory-Efficient Infinite-Context Transformer for Edge Devices

Jiyu Chen,Shuang Peng,Daxiong Luo,Fan Yang,Renshou Wu,Fangyuan Li,Xiaoxin Chen

Task: 提出一种名为EdgeInfinite的内存高效解决方案，用于处理Transformer-based LLMs在边缘设备上处理长序列时的挑战。

Motivation: Transformer-based LLMs在边缘设备上处理长序列时面临注意力机制二次复杂性和KV缓存内存需求增加的挑战，现有优化方法难以解决长输出任务中的不可逆令牌驱逐问题。

Details

Method: 通过可训练的记忆门控模块将压缩内存集成到Transformer-based LLMs中，保持与标准Transformer架构的完全兼容性，仅需微调少量参数。 Result: 实验结果表明，EdgeInfinite在长上下文基准测试中与基线Transformer-based LLM性能相当，同时优化了内存消耗和首令牌生成时间。 Conclusion: EdgeInfinite是一种高效且兼容性强的解决方案，适用于边缘设备上的长序列处理任务。 Abstract: Transformer-based large language models (LLMs) encounter challenges in processing long sequences on edge devices due to the quadratic complexity of attention mechanisms and growing memory demands from Key-Value (KV) cache. Existing KV cache optimizations struggle with irreversible token eviction in long-output tasks, while alternative sequence modeling architectures prove costly to adopt within established Transformer infrastructure. We present EdgeInfinite, a memory-efficient solution for infinite contexts that integrates compressed memory into Transformer-based LLMs through a trainable memory-gating module. This approach maintains full compatibility with standard Transformer architectures, requiring fine-tuning only a small part of parameters, and enables selective activation of the memory-gating module for long and short context task routing. The experimental result shows that EdgeInfinite achieves comparable performance to baseline Transformer-based LLM on long context benchmarks while optimizing memory consumption and time to first token.

NeRF-based Point Cloud Reconstruction using a Stationary Camera for Agricultural Applications

Kibon Ku,Talukder Z Jubery,Elijah Rodriguez,Aditya Balu,Soumik Sarkar,Adarsh Krishnamurthy,Baskar Ganapathysubramanian

Task: 提出一种基于NeRF的点云重建框架，专为室内高通量植物表型设施设计。

Motivation: 传统NeRF方法需要相机围绕静止物体移动，但在高通量环境中，物体通常在传送带或旋转台上快速移动，传统方法不适用。

Details

Method: 开发了一种变体NeRF方法，使用单个静止相机捕捉旋转物体图像，结合COLMAP姿态估计、姿态变换模拟相机移动及标准NeRF训练。 Result: 实验结果表明重建保真度优异，F-score接近100.00，验证了方法在高通量表型应用中的可行性。 Conclusion: 使用静止相机可实现高质量NeRF重建，无需复杂相机运动或昂贵设备，未来将优化姿态估计以集成到自动化流程中。 Abstract: This paper presents a NeRF-based framework for point cloud (PCD) reconstruction, specifically designed for indoor high-throughput plant phenotyping facilities. Traditional NeRF-based reconstruction methods require cameras to move around stationary objects, but this approach is impractical for high-throughput environments where objects are rapidly imaged while moving on conveyors or rotating pedestals. To address this limitation, we develop a variant of NeRF-based PCD reconstruction that uses a single stationary camera to capture images as the object rotates on a pedestal. Our workflow comprises COLMAP-based pose estimation, a straightforward pose transformation to simulate camera movement, and subsequent standard NeRF training. A defined Region of Interest (ROI) excludes irrelevant scene data, enabling the generation of high-resolution point clouds (10M points). Experimental results demonstrate excellent reconstruction fidelity, with precision-recall analyses yielding an F-score close to 100.00 across all evaluated plant objects. Although pose estimation remains computationally intensive with a stationary camera setup, overall training and reconstruction times are competitive, validating the method's feasibility for practical high-throughput indoor phenotyping applications. Our findings indicate that high-quality NeRF-based 3D reconstructions are achievable using a stationary camera, eliminating the need for complex camera motion or costly imaging equipment. This approach is especially beneficial when employing expensive and delicate instruments, such as hyperspectral cameras, for 3D plant phenotyping. Future work will focus on optimizing pose estimation techniques and further streamlining the methodology to facilitate seamless integration into automated, high-throughput 3D phenotyping pipelines.

CFiCS: Graph-Based Classification of Common Factors and Microcounseling Skills

Fabian Schmidt,Karin Hammerfald,Henrik Haaland Jahren,Vladimir Vlassov

Task: 开发一个名为CFiCS的分层分类框架，用于从心理治疗对话文本中自动识别共同因素和微观咨询技能。

Motivation: 共同因素和微观咨询技能对心理治疗的有效性至关重要，但因其微妙和上下文依赖的特性，从文本数据中自动识别这些变化原则具有挑战性。

Details

Method: 结合图机器学习和预训练上下文嵌入（如ClinicalBERT），构建异构图表示共同因素、干预概念和微观咨询技能，并利用图神经网络学习归纳节点嵌入。 Result: CFiCS在分类性能上显著优于基线方法（如随机森林、基于BERT的多任务模型和图方法），尤其在细粒度技能预测方面表现突出。 Conclusion: CFiCS通过整合ClinicalBERT节点特征和图结构，有效提升了共同因素和微观咨询技能的自动识别能力。 Abstract: Common factors and microcounseling skills are critical to the effectiveness of psychotherapy. Understanding and measuring these elements provides valuable insights into therapeutic processes and outcomes. However, automatic identification of these change principles from textual data remains challenging due to the nuanced and context-dependent nature of therapeutic dialogue. This paper introduces CFiCS, a hierarchical classification framework integrating graph machine learning with pretrained contextual embeddings. We represent common factors, intervention concepts, and microcounseling skills as a heterogeneous graph, where textual information from ClinicalBERT enriches each node. This structure captures both the hierarchical relationships (e.g., skill-level nodes linking to broad factors) and the semantic properties of therapeutic concepts. By leveraging graph neural networks, CFiCS learns inductive node embeddings that generalize to unseen text samples lacking explicit connections. Our results demonstrate that integrating ClinicalBERT node features and graph structure significantly improves classification performance, especially in fine-grained skill prediction. CFiCS achieves substantial gains in both micro and macro F1 scores across all tasks compared to baselines, including random forests, BERT-based multi-task models, and graph-based methods.

Q-MambaIR: Accurate Quantized Mamba for Efficient Image Restoration

Yujie Chen,Haotong Qin,Zhang Zhang,Michelo Magno,Luca Benini,Yawei Li

Task: 提出一种高效、准确的量化方法Q-MambaIR，用于解决State-Space Models (SSMs)在超低比特宽度（2-4位）下的性能下降问题。

Motivation: SSMs在图像恢复任务中表现出色，但在边缘设备上部署时面临内存、计算能力和功耗的限制，需要高效的压缩策略。

Details

Method: 引入统计动态平衡可学习标量（DLS）和范围浮动灵活分配器（RFA），动态调整量化映射范围和自适应阈值，以减少量化误差并保持特征提取能力。 Result: Q-MambaIR在图像恢复任务中表现优于现有量化SSMs，以极小的计算和存储开销实现了更高的准确率。 Conclusion: Q-MambaIR是一种高效、灵活的量化方法，显著提升了SSMs在超低比特宽度下的性能，适用于边缘设备部署。 Abstract: State-Space Models (SSMs) have attracted considerable attention in Image Restoration (IR) due to their ability to scale linearly sequence length while effectively capturing long-distance dependencies. However, deploying SSMs to edge devices is challenging due to the constraints in memory, computing capacity, and power consumption, underscoring the need for efficient compression strategies. While low-bit quantization is an efficient model compression strategy for reducing size and accelerating IR tasks, SSM suffers substantial performance drops at ultra-low bit-widths (2-4 bits), primarily due to outliers that exacerbate quantization error. To address this challenge, we propose Q-MambaIR, an accurate, efficient, and flexible Quantized Mamba for IR tasks. Specifically, we introduce a Statistical Dynamic-balancing Learnable Scalar (DLS) to dynamically adjust the quantization mapping range, thereby mitigating the peak truncation loss caused by extreme values. Furthermore, we design a Range-floating Flexible Allocator (RFA) with an adaptive threshold to flexibly round values. This approach preserves high-frequency details and maintains the SSM's feature extraction capability. Notably, RFA also enables pre-deployment weight quantization, striking a balance between computational efficiency and model accuracy. Extensive experiments on IR tasks demonstrate that Q-MambaIR consistently outperforms existing quantized SSMs, achieving much higher state-of-the-art (SOTA) accuracy results with only a negligible increase in training computation and storage saving.

MultiClaimNet: A Massively Multilingual Dataset of Fact-Checked Claim Clusters

Rrubaa Panchendrarajan,Rubén Míguez,Arkaitz Zubiaga

Task: 提出并验证一种多语言声明聚类方法，以减少事实核查中的冗余。

Motivation: 声明在不同平台和语言中重复出现，需要更高效的解决方案来减少冗余并提高检索效率。

Details

Method: 引入MultiClaimNet数据集，包含86种语言的声明聚类，通过自动聚类和有限人工干预构建数据集。 Result: 构建了包含85.3K声明的数据集，并通过实验验证了基线性能。 Conclusion: MultiClaimNet为可扩展的声明聚类提供了基础，有助于高效的事实核查流程。 Abstract: In the context of fact-checking, claims are often repeated across various platforms and in different languages, which can benefit from a process that reduces this redundancy. While retrieving previously fact-checked claims has been investigated as a solution, the growing number of unverified claims and expanding size of fact-checked databases calls for alternative, more efficient solutions. A promising solution is to group claims that discuss the same underlying facts into clusters to improve claim retrieval and validation. However, research on claim clustering is hindered by the lack of suitable datasets. To bridge this gap, we introduce \textit{MultiClaimNet}, a collection of three multilingual claim cluster datasets containing claims in 86 languages across diverse topics. Claim clusters are formed automatically from claim-matching pairs with limited manual intervention. We leverage two existing claim-matching datasets to form the smaller datasets within \textit{MultiClaimNet}. To build the larger dataset, we propose and validate an approach involving retrieval of approximate nearest neighbors to form candidate claim pairs and an automated annotation of claim similarity using large language models. This larger dataset contains 85.3K fact-checked claims written in 78 languages. We further conduct extensive experiments using various clustering techniques and sentence embedding models to establish baseline performance. Our datasets and findings provide a strong foundation for scalable claim clustering, contributing to efficient fact-checking pipelines.

Harmonizing Visual Representations for Unified Multimodal Understanding and Generation

Size Wu,Wenwei Zhang,Lumin Xu,Sheng Jin,Zhonghua Wu,Qingyi Tao,Wentao Liu,Wei Li,Chen Change Loy

Task: 提出一个统一视觉理解和生成任务的自回归框架Harmon。

Motivation: 现有方法在统一视觉表示时过于关注图像特征而忽略语义，导致理解性能下降。

Details

Method: 基于掩码自回归（MAR）编码器，通过三阶段训练逐步优化理解和生成能力。 Result: 在图像生成任务上达到SOTA，同时在图像理解任务上媲美专用语义编码器。 Conclusion: Harmon框架成功统一了视觉理解和生成任务，展示了MAR编码器的潜力。 Abstract: Unifying visual understanding and generation within a single multimodal framework remains a significant challenge, as the two inherently heterogeneous tasks require representations at different levels of granularity. Current approaches that utilize vector quantization (VQ) or variational autoencoders (VAE) for unified visual representation prioritize intrinsic imagery features over semantics, compromising understanding performance. In this work, we take inspiration from masked image modelling (MIM) that learns rich semantics via a mask-and-reconstruct pre-training and its successful extension to masked autoregressive (MAR) image generation. A preliminary study on the MAR encoder's representation reveals exceptional linear probing accuracy and precise feature response to visual concepts, which indicates MAR's potential for visual understanding tasks beyond its original generation role. Based on these insights, we present \emph{Harmon}, a unified autoregressive framework that harmonizes understanding and generation tasks with a shared MAR encoder. Through a three-stage training procedure that progressively optimizes understanding and generation capabilities, Harmon achieves state-of-the-art image generation results on the GenEval, MJHQ30K and WISE benchmarks while matching the performance of methods with dedicated semantic encoders (e.g., Janus) on image understanding benchmarks. Our code and models will be available at https://github.com/wusize/Harmon.

Preference-based Learning with Retrieval Augmented Generation for Conversational Question Answering

Magdalena Kaiser,Gerhard Weikum

Task: 提出PRAISE，一种基于管道的对话问答方法，通过训练LLM适配器分别处理三个子任务。

Motivation: 由于实际中缺乏针对单个子任务的标注训练数据，PRAISE通过自身生成的数据学习，利用最终回答性能作为反馈信号，无需人工干预。

Details

Method: 采用Direct Preference Optimization方法，对比每个子任务的成功与失败样本，利用中间信息（如相关证据）作为弱标注数据。 Result: PRAISE在实验中表现出色，每个子任务均有改进，并在流行对话问答基准上达到新的最优性能，精度提升15.5个百分点。 Conclusion: PRAISE通过自学习和弱监督方法，显著提升了对话问答任务的性能。 Abstract: Conversational Question Answering (ConvQA) involves multiple subtasks, i) to understand incomplete questions in their context, ii) to retrieve relevant information, and iii) to generate answers. This work presents PRAISE, a pipeline-based approach for ConvQA that trains LLM adapters for each of the three subtasks. As labeled training data for individual subtasks is unavailable in practice, PRAISE learns from its own generations using the final answering performance as feedback signal without human intervention and treats intermediate information, like relevant evidence, as weakly labeled data. We apply Direct Preference Optimization by contrasting successful and unsuccessful samples for each subtask. In our experiments, we show the effectiveness of this training paradigm: PRAISE shows improvements per subtask and achieves new state-of-the-art performance on a popular ConvQA benchmark, by gaining 15.5 percentage points increase in precision over baselines.

AgRowStitch: A High-fidelity Image Stitching Pipeline for Ground-based Agricultural Images

Isaac Kazuo Uyehara,Heesup Yun,Earl Ranario,Mason Earles

Task: 开发一种用户友好且开源的管道，用于拼接地面拍摄的线性作物行图像，而不依赖额外数据。

Motivation: 农业图像拼接因重复纹理、非平面植物和多图像累积误差而具有挑战性，现有方法无法为近距离拍摄的图像提供通用解决方案。

Details

Method: 使用SuperPoint和LightGlue提取和匹配小批量图像特征，通过约束相机运动逐批拼接，最后将所有批次的拼接结果整合为最终马赛克。 Result: 在三种不同采集方式下，管道生成了高质量马赛克，平均绝对误差为20厘米。 Conclusion: 该方法为需要粗略地理定位的用户提供了可访问的叶片级拼接解决方案，无需精确位置数据或复杂成像系统。 Abstract: Agricultural imaging often requires individual images to be stitched together into a final mosaic for analysis. However, agricultural images can be particularly challenging to stitch because feature matching across images is difficult due to repeated textures, plants are non-planar, and mosaics built from many images can accumulate errors that cause drift. Although these issues can be mitigated by using georeferenced images or taking images at high altitude, there is no general solution for images taken close to the crop. To address this, we created a user-friendly and open source pipeline for stitching ground-based images of a linear row of crops that does not rely on additional data. First, we use SuperPoint and LightGlue to extract and match features within small batches of images. Then we stitch the images in each batch in series while imposing constraints on the camera movement. After straightening and rescaling each batch mosaic, all batch mosaics are stitched together in series and then straightened into a final mosaic. We tested the pipeline on images collected along 72 m long rows of crops using two different agricultural robots and a camera manually carried over the row. In all three cases, the pipeline produced high-quality mosaics that could be used to georeference real world positions with a mean absolute error of 20 cm. This approach provides accessible leaf-scale stitching to users who need to coarsely georeference positions within a row, but do not have access to accurate positional data or sophisticated imaging systems.

A Refined Analysis of Massive Activations in LLMs

Louis Owen,Nilabhra Roy Chowdhury,Abhay Kumar,Fabian Güra

Task: 分析大规模语言模型中的大规模激活现象及其缓解策略。

Motivation: 研究大规模激活对低精度训练和量化的影响，填补现有分析的局限性。

Details

Method: 对多种LLM（包括GLU和非GLU架构）进行大规模激活分析，并提出混合缓解策略（如TVR与Attention KV bias或DyT结合）。 Result: 挑战了先前的假设，发现并非所有大规模激活都有害，且某些缓解策略模型特异性强；混合策略能有效平衡缓解与性能。 Conclusion: 混合策略（如TVR与Attention KV bias或DyT结合）在大规模激活缓解中表现良好，代码已开源。 Abstract: Motivated in part by their relevance for low-precision training and quantization, massive activations in large language models (LLMs) have recently emerged as a topic of interest. However, existing analyses are limited in scope, and generalizability across architectures is unclear. This paper helps address some of these gaps by conducting an analysis of massive activations across a broad range of LLMs, including both GLU-based and non-GLU-based architectures. Our findings challenge several prior assumptions, most importantly: (1) not all massive activations are detrimental, i.e. suppressing them does not lead to an explosion of perplexity or a collapse in downstream task performance; (2) proposed mitigation strategies such as Attention KV bias are model-specific and ineffective in certain cases. We consequently investigate novel hybrid mitigation strategies; in particular pairing Target Variance Rescaling (TVR) with Attention KV bias or Dynamic Tanh (DyT) successfully balances the mitigation of massive activations with preserved downstream model performance in the scenarios we investigated. Our code is available at: https://github.com/bluorion-com/refine_massive_activations.

BOOTPLACE: Bootstrapped Object Placement with Detection Transformers

Hang Zhou,Xinxin Zuo,Rui Ma,Li Cheng

Task: 解决图像复制粘贴中的对象放置问题，提出了一种新的放置检测方法。

Motivation: 现有生成模型对复杂数据分布建模能力有限，而基于稀疏对比损失的Transformer网络则因正则化过松导致对象放置不精确。

Details

Method: 提出BOOTPLACE方法，将对象放置问题转化为检测问题，通过训练专用检测Transformer和多对象监督实现。 Result: 在Cityscapes和OPA数据集上显著优于现有方法，IOU分数有明显提升。 Conclusion: BOOTPLACE方法在对象放置任务中表现出色，具有组合性和泛化性。 Abstract: In this paper, we tackle the copy-paste image-to-image composition problem with a focus on object placement learning. Prior methods have leveraged generative models to reduce the reliance for dense supervision. However, this often limits their capacity to model complex data distributions. Alternatively, transformer networks with a sparse contrastive loss have been explored, but their over-relaxed regularization often leads to imprecise object placement. We introduce BOOTPLACE, a novel paradigm that formulates object placement as a placement-by-detection problem. Our approach begins by identifying suitable regions of interest for object placement. This is achieved by training a specialized detection transformer on object-subtracted backgrounds, enhanced with multi-object supervisions. It then semantically associates each target compositing object with detected regions based on their complementary characteristics. Through a boostrapped training approach applied to randomly object-subtracted images, our model enforces meaningful placements through extensive paired data augmentation. Experimental results on established benchmarks demonstrate BOOTPLACE's superior performance in object repositioning, markedly surpassing state-of-the-art baselines on Cityscapes and OPA datasets with notable improvements in IOU scores. Additional ablation studies further showcase the compositionality and generalizability of our approach, supported by user study evaluations.

SKDU at De-Factify 4.0: Natural Language Features for AI-Generated Text-Detection

Shrikant Malviya,Pablo Arnau-González,Miguel Arevalillo-Herráez,Stamos Katsigiannis

Task: 探索一种用于检测AI生成文本的流水线方法，包括特征提取和分类模块。

Motivation: 大型语言模型（LLMs）的快速发展使得区分人类书写文本和AI生成内容变得更具挑战性。

Details

Method: 采用特征提取（基于提示的重写特征RAIDAR和基于内容的特征NELA）和分类模块的流水线方法。 Result: NELA特征在区分人类和AI生成文本及识别生成模型的任务中显著优于RAIDAR特征，XGBoost分类器表现最佳。 Conclusion: NELA特征能有效捕捉细微差异，组合特征改进有限，XGBoost是最佳分类器。 Abstract: The rapid advancement of large language models (LLMs) has introduced new challenges in distinguishing human-written text from AI-generated content. In this work, we explored a pipelined approach for AI-generated text detection that includes a feature extraction step (i.e. prompt-based rewriting features inspired by RAIDAR and content-based features derived from the NELA toolkit) followed by a classification module. Comprehensive experiments were conducted on the Defactify4.0 dataset, evaluating two tasks: binary classification to differentiate human-written and AI-generated text, and multi-class classification to identify the specific generative model used to generate the input text. Our findings reveal that NELA features significantly outperform RAIDAR features in both tasks, demonstrating their ability to capture nuanced linguistic, stylistic, and content-based differences. Combining RAIDAR and NELA features provided minimal improvement, highlighting the redundancy introduced by less discriminative features. Among the classifiers tested, XGBoost emerged as the most effective, leveraging the rich feature sets to achieve high accuracy and generalisation.

Tony Tran,Bin Hu

Task: 提出一种名为FACETS的新型统一迭代神经架构搜索方法，用于优化深度学习目标检测框架中的多模块架构。

Motivation: 解决神经架构搜索在多模块目标检测框架中因搜索空间庞大和计算成本高而导致的联合优化难题，同时满足目标设备的计算约束。

Details

Method: FACETS通过循环迭代的方式优化各模块架构，利用前次迭代的反馈，交替固定一个模块的架构并优化其他模块，从而减少搜索空间并保持模块间的相互依赖。 Result: FACETS在早期阶段比渐进式搜索策略快两倍，且准确率提高4.75%，同时能通过迭代优化搜索空间，最终生成的架构平均准确率比全局搜索高27%，比渐进式搜索高5%。 Conclusion: FACETS是一种高效且灵活的神经架构搜索方法，能够在减少计算成本的同时提升目标检测架构的性能。 Abstract: Neural Architecture Search (NAS) for deep learning object detection frameworks typically involves multiple modules, each performing distinct tasks. These modules contribute to a vast search space, resulting in searches that can take several GPU hours or even days, depending on the complexity of the search space. This makes joint optimization both challenging and computationally expensive. Furthermore, satisfying target device constraints across modules adds additional complexity to the optimization process. To address these challenges, we propose \textbf{FACETS}, e\textbf{\underline{F}}ficient Once-for-\textbf{\underline{A}}ll Object Detection via \textbf{\underline{C}}onstrained it\textbf{\underline{E}}ra\textbf{\underline{T}}ive\textbf{\underline{S}}earch, a novel unified iterative NAS method that refines the architecture of all modules in a cyclical manner. FACETS leverages feedback from previous iterations, alternating between fixing one module's architecture and optimizing the others. This approach reduces the overall search space while preserving interdependencies among modules and incorporates constraints based on the target device's computational budget. In a controlled comparison against progressive and single-module search strategies, FACETS achieves architectures with up to $4.75\%$ higher accuracy twice as fast as progressive search strategies in earlier stages, while still being able to achieve a global optimum. Moreover, FACETS demonstrates the ability to iteratively refine the search space, producing better performing architectures over time. The refined search space yields candidates with a mean accuracy up to $27\%$ higher than global search and $5\%$ higher than progressive search methods via random sampling.

Firm or Fickle? Evaluating Large Language Models Consistency in Sequential Interactions

Yubo Li,Yidi Miao,Xueying Ding,Ramayya Krishnan,Rema Padman

Task: 提出一个评估和改进大型语言模型（LLM）响应一致性的综合框架。

Motivation: LLM在高风险领域的部署需要多轮交互中的稳定表现，但目前缺乏系统性的评估方法。

Details

Method: 提出了Position-Weighted Consistency (PWC)评分、构建了多领域基准数据集，并开发了Confidence-Aware Response Generation (CARG)框架。 Result: CARG显著提高了响应稳定性且不牺牲准确性。 Conclusion: 该框架为LLM在关键应用中的可靠部署提供了潜力。 Abstract: Large Language Models (LLMs) have shown remarkable capabilities across various tasks, but their deployment in high-stake domains requires consistent performance across multiple interaction rounds. This paper introduces a comprehensive framework for evaluating and improving LLM response consistency, making three key contributions. First, we propose a novel Position-Weighted Consistency (PWC) score that captures both the importance of early-stage stability and recovery patterns in multi-turn interactions. Second, we present a carefully curated benchmark dataset spanning diverse domains and difficulty levels, specifically designed to evaluate LLM consistency under various challenging follow-up scenarios. Third, we introduce Confidence-Aware Response Generation (CARG), a framework that significantly improves response stability by incorporating model confidence signals into the generation process. Empirical results demonstrate that CARG significantly improves response stability without sacrificing accuracy, underscoring its potential for reliable LLM deployment in critical applications.

AGILE: A Diffusion-Based Attention-Guided Image and Label Translation for Efficient Cross-Domain Plant Trait Identification

Earl Ranario,Lars Lundqvist,Heesup Yun,Brian N. Bailey,J. Mason Earles

Task: 提出一种基于扩散模型的注意力引导图像和标签翻译框架（AGILE），用于跨域植物性状识别。

Motivation: 现有生成模型在跨域图像翻译中难以保持对象级准确性，尤其是在域间差异显著时。

Details

Method: 利用优化的文本嵌入和注意力引导，通过预训练扩散模型和农业数据集，约束图像翻译的语义一致性。 Result: AGILE在跨域植物数据集上生成语义准确的翻译图像，提升了目标域的对象检测性能，并保持了真实性和一致性。 Conclusion: AGILE在语义对齐方面优于现有方法，尤其在对象差异显著或域间差异大的情况下表现更优。 Abstract: Semantically consistent cross-domain image translation facilitates the generation of training data by transferring labels across different domains, making it particularly useful for plant trait identification in agriculture. However, existing generative models struggle to maintain object-level accuracy when translating images between domains, especially when domain gaps are significant. In this work, we introduce AGILE (Attention-Guided Image and Label Translation for Efficient Cross-Domain Plant Trait Identification), a diffusion-based framework that leverages optimized text embeddings and attention guidance to semantically constrain image translation. AGILE utilizes pretrained diffusion models and publicly available agricultural datasets to improve the fidelity of translated images while preserving critical object semantics. Our approach optimizes text embeddings to strengthen the correspondence between source and target images and guides attention maps during the denoising process to control object placement. We evaluate AGILE on cross-domain plant datasets and demonstrate its effectiveness in generating semantically accurate translated images. Quantitative experiments show that AGILE enhances object detection performance in the target domain while maintaining realism and consistency. Compared to prior image translation methods, AGILE achieves superior semantic alignment, particularly in challenging cases where objects vary significantly or domain gaps are substantial.

Supposedly Equivalent Facts That Aren't? Entity Frequency in Pre-training Induces Asymmetry in LLMs

Yuan He,Bailan He,Zifeng Ding,Alisia Lupidi,Yuqicheng Zhu,Shuo Chen,Caiqi Zhang,Jiaoyan Chen,Yunpu Ma,Volker Tresp,Ian Horrocks

Task: 解释大型语言模型（LLMs）中幻觉现象的原因，并将其行为与预训练数据中的先验知识联系起来。

Motivation: 理解并减少LLMs中的幻觉现象，以提高内容生成的可靠性。

Details

Method: 利用开源OLMo系列和Dolma数据集估计实体频率，构建探测数据集以分析逻辑等价事实的识别不对称性。 Result: 发现高频主体与低频客体的事实识别效果优于其反向情况，且这种不对称性在高低频组合下反转，而在双高频情况下不显著。 Conclusion: 预训练数据对模型预测有显著影响，为推断封闭或部分封闭LLMs的预训练数据特征提供了依据。 Abstract: Understanding and mitigating hallucinations in Large Language Models (LLMs) is crucial for ensuring reliable content generation. While previous research has primarily focused on "when" LLMs hallucinate, our work explains "why" and directly links model behaviour to the pre-training data that forms their prior knowledge. Specifically, we demonstrate that an asymmetry exists in the recognition of logically equivalent facts, which can be attributed to frequency discrepancies of entities appearing as subjects versus objects. Given that most pre-training datasets are inaccessible, we leverage the fully open-source OLMo series by indexing its Dolma dataset to estimate entity frequencies. Using relational facts (represented as triples) from Wikidata5M, we construct probing datasets to isolate this effect. Our experiments reveal that facts with a high-frequency subject and a low-frequency object are better recognised than their inverse, despite their logical equivalence. The pattern reverses in low-to-high frequency settings, and no statistically significant asymmetry emerges when both entities are high-frequency. These findings highlight the influential role of pre-training data in shaping model predictions and provide insights for inferring the characteristics of pre-training data in closed or partially closed LLMs.

CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models

Qingqing Zhao,Yao Lu,Moo Jin Kim,Zipeng Fu,Zhuoyang Zhang,Yecheng Wu,Zhaoshuo Li,Qianli Ma,Song Han,Chelsea Finn,Ankur Handa,Ming-Yu Liu,Donglai Xiang,Gordon Wetzstein,Tsung-Yi Lin

Task: 提出一种将显式视觉链式推理（CoT）融入视觉-语言-动作模型（VLA）的方法，以提升复杂操作任务的性能。

Motivation: 现有VLA模型缺乏中间推理步骤，无法处理复杂任务中的时序规划或推理需求。

Details

Method: 通过自回归预测未来图像帧作为视觉目标，并生成短动作序列以实现这些目标。 Result: CoT-VLA在真实世界操作任务中优于当前最佳VLA模型17%，在仿真基准中提升6%。 Conclusion: 引入视觉链式推理显著提升了VLA模型在复杂任务中的性能。 Abstract: Vision-language-action models (VLAs) have shown potential in leveraging pretrained vision-language models and diverse robot demonstrations for learning generalizable sensorimotor control. While this paradigm effectively utilizes large-scale data from both robotic and non-robotic sources, current VLAs primarily focus on direct input--output mappings, lacking the intermediate reasoning steps crucial for complex manipulation tasks. As a result, existing VLAs lack temporal planning or reasoning capabilities. In this paper, we introduce a method that incorporates explicit visual chain-of-thought (CoT) reasoning into vision-language-action models (VLAs) by predicting future image frames autoregressively as visual goals before generating a short action sequence to achieve these goals. We introduce CoT-VLA, a state-of-the-art 7B VLA that can understand and generate visual and action tokens. Our experimental results demonstrate that CoT-VLA achieves strong performance, outperforming the state-of-the-art VLA model by 17% in real-world manipulation tasks and 6% in simulation benchmarks. Project website: https://cot-vla.github.io/

Why Stop at One Error? Benchmarking LLMs as Data Science Code Debuggers for Multi-Hop and Multi-Bug Errors

Zhiyu Yang,Shuo Wang,Yukun Yan,Yang Deng

Task: 评估大型语言模型（LLMs）在数据科学代码调试中多跳错误追踪和多错误检测的能力。

Motivation: 当前代码生成和修复基准主要关注简单、单错误情况下的语法和功能正确性，而LLMs在复杂数据科学代码中自主发现和修复运行时逻辑错误的能力尚未充分研究。

Details

Method: 引入DSDBench基准，通过自动合成的多跳、多错误代码片段，评估LLMs在数据科学调试任务中的表现。 Result: 评估显示，现有LLMs在调试数据科学代码中的逻辑运行时错误方面存在显著性能差距。 Conclusion: DSDBench为评估和改进LLMs的调试和推理能力提供了关键资源，有助于未来实现更可靠的AI辅助数据科学。 Abstract: LLMs are transforming software development, yet current code generation and code repair benchmarks mainly assess syntactic and functional correctness in simple, single-error cases. LLMs' capabilities to autonomously find and fix runtime logical errors in complex data science code remain largely unexplored. To address this gap, we introduce DSDBench: the Data Science Debugging Benchmark, the first benchmark for systematic evaluation of LLMs on multi-hop error tracing and multi-bug detection in data science code debugging. DSDBench adapts datasets from existing data science task benchmarks, such as DABench and MatPlotBench, featuring realistic data science debugging tasks with automatically synthesized multi-hop, multi-bug code snippets. DSDBench includes 1,117 annotated samples with 741 cause-effect error pairs and runtime error messages. Evaluations of state-of-the-art LLMs on DSDBench show significant performance gaps, highlighting challenges in debugging logical runtime errors in data science code. DSDBench offers a crucial resource to evaluate and improve LLMs' debugging and reasoning capabilities, enabling more reliable AI-assisted data science in the future.DSDBench is publicly available at https://github.com/KevinCL16/DSDBench.

Multispectral Demosaicing via Dual Cameras

SaiKiran Tedla,Junyong Lee,Beixuan Yang,Mahmoud Afifi,Michael Brown

Task: 提出一种针对双摄像头设置的多光谱图像去马赛克方法。

Motivation: 多光谱图像在光谱应用中具有重要价值，将其集成到智能手机等多摄像头设备中可提升光谱应用和RGB图像质量。

Details

Method: 利用共捕获的高空间保真度RGB图像指导低保真度多光谱图像的去马赛克。 Result: 实验结果表明，该方法在准确性上优于现有技术。 Conclusion: 该方法为双摄像头设置中的多光谱图像去马赛克提供了高效解决方案。 Abstract: Multispectral (MS) images capture detailed scene information across a wide range of spectral bands, making them invaluable for applications requiring rich spectral data. Integrating MS imaging into multi camera devices, such as smartphones, has the potential to enhance both spectral applications and RGB image quality. A critical step in processing MS data is demosaicing, which reconstructs color information from the mosaic MS images captured by the camera. This paper proposes a method for MS image demosaicing specifically designed for dual-camera setups where both RGB and MS cameras capture the same scene. Our approach leverages co-captured RGB images, which typically have higher spatial fidelity, to guide the demosaicing of lower-fidelity MS images. We introduce the Dual-camera RGB-MS Dataset - a large collection of paired RGB and MS mosaiced images with ground-truth demosaiced outputs - that enables training and evaluation of our method. Experimental results demonstrate that our method achieves state-of-the-art accuracy compared to existing techniques.

Negation: A Pink Elephant in the Large Language Models' Room?

Tereza Vrabcová,Marek Kadlčík,Petr Sojka,Michal Štefánik,Michal Spiegel

Task: 研究大型语言模型（LLMs）在处理否定句时的表现，并构建多语言自然语言推理（NLI）数据集以评估模型能力。

Motivation: 否定句对逻辑推理至关重要，但LLMs在处理否定句时表现不佳且研究不足。

Details

Method: 构建两个多语言NLI数据集，通过评估流行LLMs，研究模型大小和语言对其处理否定句能力的影响。 Result: 模型大小的增加能持续提升处理否定句的能力；推理准确性和鲁棒性受语言影响，前提的长度和明确性对鲁棒性影响更大。 Conclusion: 数据集可促进多语言环境下语言模型推理能力的进一步研究和改进。 Abstract: Negations are key to determining sentence meaning, making them essential for logical reasoning. Despite their importance, negations pose a substantial challenge for large language models (LLMs) and remain underexplored. We construct two multilingual natural language inference (NLI) datasets with \textit{paired} examples differing in negation. We investigate how model size and language impact its ability to handle negation correctly by evaluating popular LLMs. Contrary to previous work, we show that increasing the model size consistently improves the models' ability to handle negations. Furthermore, we find that both the models' reasoning accuracy and robustness to negation are language-dependent and that the length and explicitness of the premise have a greater impact on robustness than language. Our datasets can facilitate further research and improvements of language model reasoning in multilingual settings.

A Deep Learning Framework for Boundary-Aware Semantic Segmentation

Tai An,Weiqiang Huang,Da Xu,Qingyuan He,Jiacheng Hu,Yujia Lou

Task: 提出一种基于Mask2Former的语义分割算法，结合边界增强特征桥接模块（BEFBM），以提高目标边界准确性和分割一致性。

Motivation: Transformer-based分割方法在全局特征建模中表现优异，但在目标边界模糊和小目标识别不足方面仍有挑战。

Details

Method: 在Mask2Former框架基础上构建边界感知特征图，并引入特征桥接机制，实现跨尺度特征融合。 Result: 在Cityscapes数据集上，相比主流方法，mIOU、mDICE和mRecall等指标显著提升，且在复杂场景中边界保留效果更优。 Conclusion: 未来研究将优化计算效率，并探索其在高精度分割任务中的潜力。 Abstract: As a fundamental task in computer vision, semantic segmentation is widely applied in fields such as autonomous driving, remote sensing image analysis, and medical image processing. In recent years, Transformer-based segmentation methods have demonstrated strong performance in global feature modeling. However, they still struggle with blurred target boundaries and insufficient recognition of small targets. To address these issues, this study proposes a Mask2Former-based semantic segmentation algorithm incorporating a boundary enhancement feature bridging module (BEFBM). The goal is to improve target boundary accuracy and segmentation consistency. Built upon the Mask2Former framework, this method constructs a boundary-aware feature map and introduces a feature bridging mechanism. This enables effective cross-scale feature fusion, enhancing the model's ability to focus on target boundaries. Experiments on the Cityscapes dataset demonstrate that, compared to mainstream segmentation methods, the proposed approach achieves significant improvements in metrics such as mIOU, mDICE, and mRecall. It also exhibits superior boundary retention in complex scenes. Visual analysis further confirms the model's advantages in fine-grained regions. Future research will focus on optimizing computational efficiency and exploring its potential in other high-precision segmentation tasks.

Elite Political Discourse has Become More Toxic in Western Countries

Petter Törnberg,Juliana Chueri

Task: 系统研究国际政治中不文明行为的趋势及其决定因素。

Motivation: 政治不文明行为对民主价值观和治理构成威胁，但其驱动因素和演变尚不明确。

Details

Method: 利用来自17个国家议员的近1800万条Twitter消息数据集进行分析。 Result: 政治精英的毒性言论显著增加，与激进右翼政党和反对党相关；COVID-19初期和选举期间毒性降低；文化战争话题的毒性更高。 Conclusion: 国际民主国家正面临建设性对话的侵蚀，政治不文明行为趋势令人担忧。 Abstract: Toxic and uncivil politics is widely seen as a growing threat to democratic values and governance, yet our understanding of the drivers and evolution of political incivility remains limited. Leveraging a novel dataset of nearly 18 million Twitter messages from parliamentarians in 17 countries over five years, this paper systematically investigates whether politics internationally is becoming more uncivil, and what are the determinants of political incivility. Our analysis reveals a marked increase in toxic discourse among political elites, and that it is associated to radical-right parties and parties in opposition. Toxicity diminished markedly during the early phase of the COVID-19 pandemic and, surprisingly, during election campaigns. Furthermore, our results indicate that posts relating to ``culture war'' topics, such as migration and LGBTQ+ rights, are substantially more toxic than debates focused on welfare or economic issues. These findings underscore a troubling shift in international democracies toward an erosion of constructive democratic dialogue.

Deep Depth Estimation from Thermal Image: Dataset, Benchmark, and Challenges

Ukcheol Shin,Jinsun Park

Task: 提出一个大规模多光谱立体（MS$^2$）数据集，用于评估RGB、近红外和热成像模态下的深度估计网络。

Motivation: 现有基于可见光谱的感知算法在恶劣天气和光照条件下表现不佳，而热成像相机可能提供更高鲁棒性，但缺乏大规模数据集和标准化基准。

Details

Method: 构建包含多模态数据（RGB、NIR、热成像、LiDAR等）的MS$^2$数据集，并进行深度估计网络的全面评估。 Result: 提供了162K多模态数据对，建立了标准化基准结果，并分析了各模态在恶劣条件下的性能差异和挑战。 Conclusion: MS$^2$数据集填补了热成像感知研究的空白，为未来研究提供了方向和资源。 Abstract: Achieving robust and accurate spatial perception under adverse weather and lighting conditions is crucial for the high-level autonomy of self-driving vehicles and robots. However, existing perception algorithms relying on the visible spectrum are highly affected by weather and lighting conditions. A long-wave infrared camera (i.e., thermal imaging camera) can be a potential solution to achieve high-level robustness. However, the absence of large-scale datasets and standardized benchmarks remains a significant bottleneck to progress in active research for robust visual perception from thermal images. To this end, this manuscript provides a large-scale Multi-Spectral Stereo (MS$^2$) dataset that consists of stereo RGB, stereo NIR, stereo thermal, stereo LiDAR data, and GNSS/IMU information along with semi-dense depth ground truth. MS$^2$ dataset includes 162K synchronized multi-modal data pairs captured across diverse locations (e.g., urban city, residential area, campus, and high-way road) at different times (e.g., morning, daytime, and nighttime) and under various weather conditions (e.g., clear-sky, cloudy, and rainy). Secondly, we conduct a thorough evaluation of monocular and stereo depth estimation networks across RGB, NIR, and thermal modalities to establish standardized benchmark results on MS$^2$ depth test sets (e.g., day, night, and rainy). Lastly, we provide in-depth analyses and discuss the challenges revealed by the benchmark results, such as the performance variability for each modality under adverse conditions, domain shift between different sensor modalities, and potential research direction for thermal perception. Our dataset and source code are publicly available at https://sites.google.com/view/multi-spectral-stereo-dataset and https://github.com/UkcheolShin/SupDepth4Thermal.

Long-Tail Crisis in Nearest Neighbor Language Models

Yuto Nishida,Makoto Morishita,Hiroyuki Deguchi,Hidetaka Kamigaito,Taro Watanabe

Task: 研究$k$NN-LM在低频词上的预测性能。

Motivation: 探讨$k$NN-LM是否真正提升了低频词的预测能力，而非仅依赖于长尾上下文。

Details

Method: 分析预测概率、检索准确率、数据存储中的词分布及乘积量化的近似误差。 Result: $k$NN-LM并未提升低频词的预测性能，主要受益于高频词。 Conclusion: $k$NN-LM对低频词的预测效果有限，需进一步优化。 Abstract: The $k$-nearest-neighbor language model ($k$NN-LM), one of the retrieval-augmented language models, improves the perplexity for given text by directly accessing a large datastore built from any text data during inference. A widely held hypothesis for the success of $k$NN-LM is that its explicit memory, i.e., the datastore, enhances predictions for long-tail phenomena. However, prior works have primarily shown its ability to retrieve long-tail contexts, leaving the model's performance remain underexplored in estimating the probabilities of long-tail target tokens during inference. In this paper, we investigate the behavior of $k$NN-LM on low-frequency tokens, examining prediction probability, retrieval accuracy, token distribution in the datastore, and approximation error of the product quantization. Our experimental results reveal that $k$NN-LM does not improve prediction performance for low-frequency tokens but mainly benefits high-frequency tokens regardless of long-tail contexts in the datastore.

Contrasting Low and High-Resolution Features for HER2 Scoring using Deep Learning

Ekansh Chauhan,Anila Sharma,Amit Sharma,Vikas Nishadham,Asha Ghughtyal,Ankur Kumar,Gurudutt Gupta,Anurag Mehta,C. V. Jawahar,P. K. Vinod

Task: 开发预测模型以实现乳腺癌HER2的三分类（0、低、高）以提高预后。

Motivation: 传统免疫组化（IHC）分类依赖病理学家的专业知识，工作量大且存在显著的观察者间变异性，需要自动化解决方案。

Details

Method: 使用印度病理乳腺癌数据集（IPD-Breast）中的1,272张IHC切片（HER2、ER、PR），采用端到端的ConvNeXt网络处理低分辨率IHC图像。 Result: ConvNeXt网络在三分类任务中的AUC、F1和准确率分别为91.79%、83.52%和83.56%，优于基于patch的方法（F1分数提高5.35%）。 Conclusion: 简单而有效的深度学习技术可显著提高乳腺癌分类的准确性和可重复性，支持其融入临床工作流程以改善患者预后。 Abstract: Breast cancer, the most common malignancy among women, requires precise detection and classification for effective treatment. Immunohistochemistry (IHC) biomarkers like HER2, ER, and PR are critical for identifying breast cancer subtypes. However, traditional IHC classification relies on pathologists' expertise, making it labor-intensive and subject to significant inter-observer variability. To address these challenges, this study introduces the India Pathology Breast Cancer Dataset (IPD-Breast), comprising of 1,272 IHC slides (HER2, ER, and PR) aimed at automating receptor status classification. The primary focus is on developing predictive models for HER2 3-way classification (0, Low, High) to enhance prognosis. Evaluation of multiple deep learning models revealed that an end-to-end ConvNeXt network utilizing low-resolution IHC images achieved an AUC, F1, and accuracy of 91.79%, 83.52%, and 83.56%, respectively, for 3-way classification, outperforming patch-based methods by over 5.35% in F1 score. This study highlights the potential of simple yet effective deep learning techniques to significantly improve accuracy and reproducibility in breast cancer classification, supporting their integration into clinical workflows for better patient outcomes.

Scaling Laws of Scientific Discovery with AI and Robot Scientists

Pengsong Zhang,Heng Zhang,Huazhe Xu,Renjun Xu,Zhenting Wang,Cong Wang,Animesh Garg,Zhibin Li,Arash Ajoudani,Xinyu Liu

Task: 提出并描述一种自主通用科学家（AGS）系统，以超越传统研究的局限性。

Motivation: 传统研究方法受限于手动流程和孤立的知识领域，难以满足现代科学发现的需求。

Details

Method: 结合代理型AI和机器人技术，构建一个能够自主导航物理和数字领域的AGS系统。 Result: AGS系统有望显著减少科学研究所需的时间和资源，并推动跨学科的高效研究。 Conclusion: AGS系统可能引发科学发现的范式转变，推动持续创新，拓展科学边界。 Abstract: The rapid evolution of scientific inquiry highlights an urgent need for groundbreaking methodologies that transcend the limitations of traditional research. Conventional approaches, bogged down by manual processes and siloed expertise, struggle to keep pace with the demands of modern discovery. We envision an autonomous generalist scientist (AGS) system-a fusion of agentic AI and embodied robotics-that redefines the research lifecycle. This system promises to autonomously navigate physical and digital realms, weaving together insights from disparate disciplines with unprecedented efficiency. By embedding advanced AI and robot technologies into every phase-from hypothesis formulation to peer-ready manuscripts-AGS could slash the time and resources needed for scientific research in diverse field. We foresee a future where scientific discovery follows new scaling laws, driven by the proliferation and sophistication of such systems. As these autonomous agents and robots adapt to extreme environments and leverage a growing reservoir of knowledge, they could spark a paradigm shift, pushing the boundaries of what's possible and ushering in an era of relentless innovation.

A Semantic-Enhanced Heterogeneous Graph Learning Method for Flexible Objects Recognition

Kunshan Yang,Wenwei Luo,Yuguo Hu,Jiafu Yan,Mengmeng Jing,Lin Zuo

Task: 提出一种语义增强的异构图学习方法，用于灵活物体识别。

Motivation: 灵活物体识别因其形状、大小多样、半透明属性和类间差异细微而具有挑战性，现有图模型未能充分对齐语义与视觉信息。

Details

Method: 采用自适应扫描模块提取语义上下文，并通过异构图生成模块聚合全局视觉和局部语义节点特征。 Result: 在FDA、FSCW数据集及CIFAR-100、ImageNet-Hard基准测试中表现出竞争力。 Conclusion: 所提方法通过语义与视觉信息的对齐，提升了灵活物体识别的性能。 Abstract: Flexible objects recognition remains a significant challenge due to its inherently diverse shapes and sizes, translucent attributes, and subtle inter-class differences. Graph-based models, such as graph convolution networks and graph vision models, are promising in flexible objects recognition due to their ability of capturing variable relations within the flexible objects. These methods, however, often focus on global visual relationships or fail to align semantic and visual information. To alleviate these limitations, we propose a semantic-enhanced heterogeneous graph learning method. First, an adaptive scanning module is employed to extract discriminative semantic context, facilitating the matching of flexible objects with varying shapes and sizes while aligning semantic and visual nodes to enhance cross-modal feature correlation. Second, a heterogeneous graph generation module aggregates global visual and local semantic node features, improving the recognition of flexible objects. Additionally, We introduce the FSCW, a large-scale flexible dataset curated from existing sources. We validate our method through extensive experiments on flexible datasets (FDA and FSCW), and challenge benchmarks (CIFAR-100 and ImageNet-Hard), demonstrating competitive performance.

Evaluating LLM-based Agents for Multi-Turn Conversations: A Survey

Shengyue Guan,Haoyi Xiong,Jindong Wang,Jiang Bian,Bin Zhu,Jian-guang Lou

Task: 系统综述了大型语言模型（LLM）在多轮对话中作为代理的评估方法。

Motivation: 为多轮对话中LLM代理的评估提供系统化的分类和方法框架。

Details

Method: 采用PRISMA框架系统分析近250篇文献，构建了两个相互关联的分类系统：评估内容和评估方法。 Result: 提出了评估LLM代理的关键维度（如任务完成、响应质量等）和方法分类（如基于标注、自动化指标等）。 Conclusion: 该框架为多轮对话中LLM代理的评估提供了全面且有意义的方法。 Abstract: This survey examines evaluation methods for large language model (LLM)-based agents in multi-turn conversational settings. Using a PRISMA-inspired framework, we systematically reviewed nearly 250 scholarly sources, capturing the state of the art from various venues of publication, and establishing a solid foundation for our analysis. Our study offers a structured approach by developing two interrelated taxonomy systems: one that defines \emph{what to evaluate} and another that explains \emph{how to evaluate}. The first taxonomy identifies key components of LLM-based agents for multi-turn conversations and their evaluation dimensions, including task completion, response quality, user experience, memory and context retention, as well as planning and tool integration. These components ensure that the performance of conversational agents is assessed in a holistic and meaningful manner. The second taxonomy system focuses on the evaluation methodologies. It categorizes approaches into annotation-based evaluations, automated metrics, hybrid strategies that combine human assessments with quantitative measures, and self-judging methods utilizing LLMs. This framework not only captures traditional metrics derived from language understanding, such as BLEU and ROUGE scores, but also incorporates advanced techniques that reflect the dynamic, interactive nature of multi-turn dialogues.

A Survey on Remote Sensing Foundation Models: From Vision to Multimodality

Ziyue Huang,Hongxi Yan,Qiqi Zhan,Shuai Yang,Mingming Zhang,Chenkai Zhang,YiMing Lei,Zeming Liu,Qingjie Liu,Yunhong Wang

Task: 综述遥感视觉和多模态基础模型的最新进展及其在智能地理空间数据解读中的应用。

Motivation: 尽管遥感基础模型在对象检测、土地覆盖分类和变化检测等任务中表现出色，但数据多样性、大规模标注数据集的需求以及多模态融合技术的复杂性仍带来挑战。

Details

Method: 通过分析现有模型的架构、训练方法、数据集和应用场景，讨论数据对齐、跨模态迁移学习和可扩展性等关键挑战。 Result: 总结了遥感视觉和多模态基础模型的现状，并指出了克服现有局限性的新兴研究方向。 Conclusion: 旨在为遥感基础模型的当前研究提供清晰的理解，并激发未来研究以推动这些模型在实际应用中的边界。 Abstract: The rapid advancement of remote sensing foundation models, particularly vision and multimodal models, has significantly enhanced the capabilities of intelligent geospatial data interpretation. These models combine various data modalities, such as optical, radar, and LiDAR imagery, with textual and geographic information, enabling more comprehensive analysis and understanding of remote sensing data. The integration of multiple modalities allows for improved performance in tasks like object detection, land cover classification, and change detection, which are often challenged by the complex and heterogeneous nature of remote sensing data. However, despite these advancements, several challenges remain. The diversity in data types, the need for large-scale annotated datasets, and the complexity of multimodal fusion techniques pose significant obstacles to the effective deployment of these models. Moreover, the computational demands of training and fine-tuning multimodal models require significant resources, further complicating their practical application in remote sensing image interpretation tasks. This paper provides a comprehensive review of the state-of-the-art in vision and multimodal foundation models for remote sensing, focusing on their architecture, training methods, datasets and application scenarios. We discuss the key challenges these models face, such as data alignment, cross-modal transfer learning, and scalability, while also identifying emerging research directions aimed at overcoming these limitations. Our goal is to provide a clear understanding of the current landscape of remote sensing foundation models and inspire future research that can push the boundaries of what these models can achieve in real-world applications. The list of resources collected by the paper can be found in the https://github.com/IRIP-BUAA/A-Review-for-remote-sensing-vision-language-models.

WorkTeam: Constructing Workflows from Natural Language with Multi-Agents

Hanchao Liu,Rongjun Li,Weimin Xiong,Ziyu Zhou,Wei Peng

Task: 提出一种多代理框架WorkTeam，用于从自然语言指令生成工作流（NL2Workflow）。

Motivation: 手工构建工作流需要专业知识，现有单代理LLM方法在复杂任务上性能下降，需要解决这些问题。

Details

Method: 提出多代理框架WorkTeam，包括监督者、协调者和填充代理，协同提升转换过程。 Result: 实验结果表明，WorkTeam显著提高了工作流构建的成功率。 Conclusion: WorkTeam为NL2Workflow服务提供了一种新颖有效的解决方案。 Abstract: Workflows play a crucial role in enhancing enterprise efficiency by orchestrating complex processes with multiple tools or components. However, hand-crafted workflow construction requires expert knowledge, presenting significant technical barriers. Recent advancements in Large Language Models (LLMs) have improved the generation of workflows from natural language instructions (aka NL2Workflow), yet existing single LLM agent-based methods face performance degradation on complex tasks due to the need for specialized knowledge and the strain of task-switching. To tackle these challenges, we propose WorkTeam, a multi-agent NL2Workflow framework comprising a supervisor, orchestrator, and filler agent, each with distinct roles that collaboratively enhance the conversion process. As there are currently no publicly available NL2Workflow benchmarks, we also introduce the HW-NL2Workflow dataset, which includes 3,695 real-world business samples for training and evaluation. Experimental results show that our approach significantly increases the success rate of workflow construction, providing a novel and effective solution for enterprise NL2Workflow services.

Mitigating Trade-off: Stream and Query-guided Aggregation for Efficient and Effective 3D Occupancy Prediction

Seokha Moon,Janghyun Baek,Giseop Kim,Jinkyu Kim,Sunwook Choi

Task: 提出StreamOcc框架，以流式方式聚合时空信息，解决3D占用预测中效率与准确性的权衡问题。

Motivation: 现有方法在处理多帧融合时面临效率与准确性的权衡，限制了实际应用。

Details

Method: StreamOcc包含流式体素聚合和查询引导聚合两个关键组件，有效积累历史观测并优化动态对象细节。 Result: 在Occ3D-nuScenes数据集上，StreamOcc在实时设置下达到最优性能，内存使用减少50%以上。 Conclusion: StreamOcc通过流式聚合显著提升了3D占用预测的效率和准确性。 Abstract: 3D occupancy prediction has emerged as a key perception task for autonomous driving, as it reconstructs 3D environments to provide a comprehensive scene understanding. Recent studies focus on integrating spatiotemporal information obtained from past observations to improve prediction accuracy, using a multi-frame fusion approach that processes multiple past frames together. However, these methods struggle with a trade-off between efficiency and accuracy, which significantly limits their practicality. To mitigate this trade-off, we propose StreamOcc, a novel framework that aggregates spatio-temporal information in a stream-based manner. StreamOcc consists of two key components: (i) Stream-based Voxel Aggregation, which effectively accumulates past observations while minimizing computational costs, and (ii) Query-guided Aggregation, which recurrently aggregates instance-level features of dynamic objects into corresponding voxel features, refining fine-grained details of dynamic objects. Experiments on the Occ3D-nuScenes dataset show that StreamOcc achieves state-of-the-art performance in real-time settings, while reducing memory usage by more than 50% compared to previous methods.

Exploiting Mixture-of-Experts Redundancy Unlocks Multimodal Generative Abilities

Raman Dutt,Harleen Hanspal,Guoxuan Xia,Petru-Daniel Tudosiu,Alexander Black,Yongxin Yang,Steven McDonagh,Sarah Parisot

Task: 增强预训练纯文本大型语言模型（LLMs）的多模态生成能力，同时满足两个核心约束条件。

Motivation: 在保持原有语言生成能力的同时，以较小的参数量学习新模态，确保可扩展性和效率。

Details

Method: 利用深度模型中未充分利用的容量，特别是混合专家（MoEs）中的参数冗余，通过低秩适应和基于Gromov-Wasserstein距离的参数初始化方案。 Result: 实现了多模态生成能力，同时保持了语言生成性能，并提高了参数效率和训练稳定性。 Conclusion: 该方法为从单模态到多模态架构的过渡提供了一条新途径。 Abstract: In this work, we undertake the challenge of augmenting the existing generative capabilities of pre-trained text-only large language models (LLMs) with multi-modal generation capability while satisfying two core constraints: C1 preserving the preservation of original language generative capabilities with negligible performance degradation, and C2 adhering to a small parameter budget to learn the new modality, ensuring scalability and efficiency. In contrast to current approaches that add dedicated modules, thereby significantly increasing the parameter count, we propose a method that leverages the underutilized capacity inherent in deep models. Specifically, we exploit the parameter redundancy within Mixture-of-Experts (MoEs) as a source of additional capacity for learning a new modality, enabling better parameter efficiency (C1). Moreover, we preserve the original language generation capabilities by applying low-rank adaptation exclusively to the tokens of the new modality (C2). Furthermore, we introduce a novel parameter initialization scheme based on the Gromov-Wasserstein distance to improve convergence and training stability. Through an extensive analysis of the routing mechanism, we uncover the emergence of modality-specific pathways and decreased redundancy within the experts that can efficiently unlock multi-modal generative capabilities. Overall, our method can be seamlessly applied to a wide range of contemporary LLMs, providing a new pathway for transitioning from uni-modal to multi-modal architectures.

How Well Can Vison-Language Models Understand Humans' Intention? An Open-ended Theory of Mind Question Evaluation Benchmark

Ximing Wen,Mallika Mainali,Anik Sen

Task: 评估视觉语言模型（VLMs）在心理理论（ToM）任务中的表现。

Motivation: 探索VLMs在推断人类意图、信念等心理状态方面的能力，填补该领域的研究空白。

Details

Method: 提出一个开放式问题框架，并构建一个包含30张图像的基准数据集，评估四种不同规模的VLMs。 Result: GPT-4表现最佳，GPT-4o-mini次之；VLMs在复杂场景（如欺凌或作弊）中表现不佳，但小模型有时能通过错误视觉线索推断正确意图。 Conclusion: VLMs在ToM任务中表现有限，尤其在复杂场景中，但小模型可能具备一定的鲁棒性。 Abstract: Vision Language Models (VLMs) have demonstrated strong reasoning capabilities in Visual Question Answering (VQA) tasks; However, their ability to perform Theory of Mind (ToM) tasks such as accurately inferring human intentions, beliefs, and other mental states remains underexplored. In this work, we propose an open-ended question framework to comprehensively evaluate VLMs' performance across diverse categories of ToM tasks. We curated and annotated a benchmark dataset composed of 30 images. We then assessed the performance of four VLMs of varying sizes on this dataset. Our experimental results show that the GPT-4 model outperformed all others, with only one smaller model, GPT-4o-mini, achieving comparable performance. Additionally, we observed that VLMs often struggle to accurately infer intentions in complex scenarios such as bullying or cheating. Moreover, our findings also reveal that smaller models can sometimes infer correct intentions despite relying on incorrect visual cues.

Bridging the Dimensional Chasm: Uncover Layer-wise Dimensional Reduction in Transformers through Token Correlation

Zhuo-Yang Song,Zeyu Li,Qing-Hong Cao,Ming-xing Luo,Hua Xing Zhu

Task: 研究大型语言模型中词元表示的几何演化及其与人类语言低维语义空间的对比。

Motivation: 解决现代LLMs使用高维嵌入与人类语言低维语义空间之间的矛盾。

Details

Method: 开发几何框架，追踪Transformer层间词元动态，分析内在维度变化模式。 Result: 发现词元在高维工作空间扩散后逐渐投影到低维子流形，有效模型倾向于压缩至约10维子流形。 Conclusion: 通过将Transformer层重构为高维计算与低维语义的媒介，提升LLMs可解释性，并提供不依赖任务评估的模型诊断工具。 Abstract: The geometric evolution of token representations in large language models (LLMs) presents a fundamental paradox: while human language inherently organizes semantic information in low-dimensional spaces ($\sim 10^1$ dimensions), modern LLMs employ high-dimensional embeddings ($\sim 10^3$ dimensions) processed through Transformer architectures. To resolve this paradox, this work bridges this conceptual gap by developing a geometric framework that tracks token dynamics across Transformers layers. Through layer-wise analysis of intrinsic dimensions across multiple architectures, we reveal an expansion-contraction pattern where tokens diffuse to a "working space" and then progressively project onto lower-dimensional submanifolds. Our finding implies a negative correlation between the working space dimension and parameter-sensitive performance of the LLMs, and indicates that effective models tend to compress tokens into approximately 10-dimensional submanifolds, closely resembling human semantic spaces. This work not only advances LLM interpretability by reframing Transformers layers as projectors that mediate between high-dimensional computation and low-dimensional semantics, but also provides practical tools for model diagnostics that do not rely on task-specific evaluations.

Camera Model Identification with SPAIR-Swin and Entropy based Non-Homogeneous Patches

Protyay Dey,Rejoy Chakraborty,Abhilasha S. Jadhav,Kapil Rana,Gaurav Sharma,Puneet Goyal

Task: 提出一种结合改进空间注意力机制和倒置残差块（SPAIR）与Swin Transformer的新模型SPAIR-Swin，用于源相机模型识别（SCMI）。

Motivation: 源相机模型识别在图像取证中具有重要作用，如真实性验证和版权保护。

Details

Method: 结合SPAIR和Swin Transformer，提出一种强调高熵区域的补丁选择策略。 Result: 在四个基准数据集上表现优异，补丁级和图像级准确率均超过现有方法。 Conclusion: 高熵补丁（包含高频信息）有助于提高SCMI的准确性。 Abstract: Source camera model identification (SCMI) plays a pivotal role in image forensics with applications including authenticity verification and copyright protection. For identifying the camera model used to capture a given image, we propose SPAIR-Swin, a novel model combining a modified spatial attention mechanism and inverted residual block (SPAIR) with a Swin Transformer. SPAIR-Swin effectively captures both global and local features, enabling robust identification of artifacts such as noise patterns that are particularly effective for SCMI. Additionally, unlike conventional methods focusing on homogeneous patches, we propose a patch selection strategy for SCMI that emphasizes high-entropy regions rich in patterns and textures. Extensive evaluations on four benchmark SCMI datasets demonstrate that SPAIR-Swin outperforms existing methods, achieving patch-level accuracies of 99.45%, 98.39%, 99.45%, and 97.46% and image-level accuracies of 99.87%, 99.32%, 100%, and 98.61% on the Dresden, Vision, Forchheim, and Socrates datasets, respectively. Our findings highlight that high-entropy patches, which contain high-frequency information such as edge sharpness, noise, and compression artifacts, are more favorable in improving SCMI accuracy. Code will be made available upon request.

Beyond Vanilla Fine-Tuning: Leveraging Multistage, Multilingual, and Domain-Specific Methods for Low-Resource Machine Translation

Sarubi Thillainathan,Songchen Yuan,En-Shiun Annie Lee,Sanath Jayasena,Surangika Ranathunga

Task: 提出两种方法（持续预训练和中间任务迁移学习）来改进多语言序列到序列大语言模型在极低资源神经机器翻译中的性能。

Motivation: 传统单阶段微调方法在极低资源机器翻译中表现不佳，需要更有效的方法来适应这种挑战性场景。

Details

Method: 采用持续预训练（CPT）和中间任务迁移学习（ITTL）两种方法，分别利用领域特定单语数据和跨领域并行数据增强模型能力。 Result: 实验表明，这两种方法在六种语言对的极低资源翻译任务中平均提升了1.47 BLEU分数，多模型集成进一步提升了性能。 Conclusion: 提出的方法在极低资源机器翻译中显著提升了性能，为相关领域提供了实用解决方案。 Abstract: Fine-tuning multilingual sequence-to-sequence large language models (msLLMs) has shown promise in developing neural machine translation (NMT) systems for low-resource languages (LRLs). However, conventional single-stage fine-tuning methods struggle in extremely low-resource NMT settings, where training data is very limited. This paper contributes to artificial intelligence by proposing two approaches for adapting msLLMs in these challenging scenarios: (1) continual pre-training (CPT), where the msLLM is further trained with domain-specific monolingual data to compensate for the under-representation of LRLs, and (2) intermediate task transfer learning (ITTL), a method that fine-tunes the msLLM with both in-domain and out-of-domain parallel data to enhance its translation capabilities across various domains and tasks. As an application in engineering, these methods are implemented in NMT systems for Sinhala, Tamil, and English (six language pairs) in domain-specific, extremely low-resource settings (datasets containing fewer than 100,000 samples). Our experiments reveal that these approaches enhance translation performance by an average of +1.47 bilingual evaluation understudy (BLEU) score compared to the standard single-stage fine-tuning baseline across all translation directions. Additionally, a multi-model ensemble further improves performance by an additional BLEU score.

Detecting Localized Deepfake Manipulations Using Action Unit-Guided Video Representations

Tharun Anand,Siva Sankar,Pravin Nair

Task: 提出一种专门针对深度伪造视频中局部编辑的检测方法。

Motivation: 随着生成模型的快速发展，深度伪造技术逐渐缩小了真实与合成视频之间的差距，引发了严重的隐私和安全问题，尤其是局部编辑的精细操纵对现有检测模型提出了挑战。

Details

Method: 利用基于面部动作单元的时空表示，通过跨注意力机制融合随机掩码和动作单元检测等前置任务学习到的表示，生成能够有效编码局部细微变化的嵌入。 Result: 在多个深度伪造生成方法上的综合评估表明，该方法在检测局部编辑的深度伪造视频上实现了20%的准确率提升，并在标准数据集上表现出竞争力。 Conclusion: 该方法为检测局部编辑的深度伪造视频提供了新的基准，并展示了其鲁棒性和泛化能力。 Abstract: With rapid advancements in generative modeling, deepfake techniques are increasingly narrowing the gap between real and synthetic videos, raising serious privacy and security concerns. Beyond traditional face swapping and reenactment, an emerging trend in recent state-of-the-art deepfake generation methods involves localized edits such as subtle manipulations of specific facial features like raising eyebrows, altering eye shapes, or modifying mouth expressions. These fine-grained manipulations pose a significant challenge for existing detection models, which struggle to capture such localized variations. To the best of our knowledge, this work presents the first detection approach explicitly designed to generalize to localized edits in deepfake videos by leveraging spatiotemporal representations guided by facial action units. Our method leverages a cross-attention-based fusion of representations learned from pretext tasks like random masking and action unit detection, to create an embedding that effectively encodes subtle, localized changes. Comprehensive evaluations across multiple deepfake generation methods demonstrate that our approach, despite being trained solely on the traditional FF+ dataset, sets a new benchmark in detecting recent deepfake-generated videos with fine-grained local edits, achieving a $20\%$ improvement in accuracy over current state-of-the-art detection methods. Additionally, our method delivers competitive performance on standard datasets, highlighting its robustness and generalization across diverse types of local and global forgeries.

Historical Ink: Exploring Large Language Models for Irony Detection in 19th-Century Spanish

Kevin Cohen,Laura Manrique-Gómez,Rubén Manrique

Task: 利用大型语言模型（LLMs）增强数据集并改进19世纪拉丁美洲报纸中的反讽检测。

Motivation: 探索BERT和GPT-4o模型在捕捉反讽微妙特性中的有效性，并通过多分类和二分类任务评估其表现。

Details

Method: 采用两种策略：数据集增强（聚焦情感和上下文线索）和半自动化标注过程（解决类别不平衡问题）。 Result: 数据集增强对历史语言分析效果有限，但半自动化标注成功提升了数据质量并解决了类别不平衡问题。 Conclusion: 研究通过引入新的历史西班牙语数据集和半自动化标注方法，推动了情感分析和反讽检测的进展，强调了人类专业知识在优化LLM结果中的重要性。 Abstract: This study explores the use of large language models (LLMs) to enhance datasets and improve irony detection in 19th-century Latin American newspapers. Two strategies were employed to evaluate the efficacy of BERT and GPT-4o models in capturing the subtle nuances nature of irony, through both multi-class and binary classification tasks. First, we implemented dataset enhancements focused on enriching emotional and contextual cues; however, these showed limited impact on historical language analysis. The second strategy, a semi-automated annotation process, effectively addressed class imbalance and augmented the dataset with high-quality annotations. Despite the challenges posed by the complexity of irony, this work contributes to the advancement of sentiment analysis through two key contributions: introducing a new historical Spanish dataset tagged for sentiment analysis and irony detection, and proposing a semi-automated annotation methodology where human expertise is crucial for refining LLMs results, enriched by incorporating historical and cultural contexts as core features.

Semantic segmentation for building houses from wooden cubes

Ivan Beleacov

Task: 比较分析三种神经网络模型（U-Net(light)、LinkNet和PSPNet）在语义分割任务中的性能。

Motivation: 自动化建筑是提高效率、降低成本和减少错误的重要领域，本研究旨在为自动生成分阶段建筑计划奠定基础。

Details

Method: 使用两个专门的数据集（4类和44类）训练模型，并通过MeanIoU和F1 Score评估性能。 Result: U-Net(light)表现最佳，在4类数据集上MeanIoU为78%，F1 Score为87%，但在44类数据集上表现较差（MeanIoU 17%，F1 Score 25%）。 Conclusion: 未来研究将扩展数据集并应用抗过拟合方法，进一步开发自动生成建筑计划的算法。 Abstract: Automated construction is one of the most promising areas that can improve efficiency, reduce costs and minimize errors in the process of building construction. In this paper, a comparative analysis of three neural network models for semantic segmentation, U-Net(light), LinkNet and PSPNet, is performed. Two specialized datasets with images of houses built from wooden cubes were created for the experiments. The first dataset contains 4 classes (background, foundation, walls, roof ) and is designed for basic model evaluation, while the second dataset includes 44 classes where each cube is labeled as a separate object. The models were trained with the same hyperparameters and their accuracy was evaluated using MeanIoU and F1 Score metrics. According to the results obtained, U-Net(light) showed the best performance with 78% MeanIoU and 87% F1 Score on the first dataset and 17% and 25% respectively on the second dataset. The poor results on the second dataset are due to the limited amount of data, the complexity of the partitioning and the imbalance of classes, making it difficult to accurately select individual cubes. In addition, overtraining was observed in all experiments, manifested by high accuracy on the training dataset and its significant decrease on the validation dataset. The present work is the basis for the development of algorithms for automatic generation of staged building plans, which can be further scaled to design complete buildings. Future research is planned to extend the datasets and apply methods to combat overfitting (L1/L2 regularization, Early Stopping). The next stage of work will be the development of algorithms for automatic generation of a step-by-step plan for building houses from cubes using manipulators. Index Terms-Deep Learning, Computer vision, CNN, Semantic segmentation, Construction materials.

Self-Evolving Multi-Agent Simulations for Realistic Clinical Interactions

Mohammad Almansoori,Komal Kumar,Hisham Cholakkal

Task: 介绍并评估MedAgentSim，一个用于动态诊断环境中增强LLM性能的开源模拟临床环境。

Motivation: 模拟真实世界的诊断过程，通过多轮对话和医疗检查请求，提升LLM在动态诊断中的表现。

Details

Method: 使用多代理框架（医生、患者、测量代理），结合多代理讨论、思维链推理和经验知识检索，实现渐进式学习。 Result: 在模拟诊断场景中验证了方法的有效性，并提供了代码、工具和评估基准。 Conclusion: MedAgentSim为动态诊断环境中的LLM性能提升提供了有效工具，支持自动化与人工交互。 Abstract: In this work, we introduce MedAgentSim, an open-source simulated clinical environment with doctor, patient, and measurement agents designed to evaluate and enhance LLM performance in dynamic diagnostic settings. Unlike prior approaches, our framework requires doctor agents to actively engage with patients through multi-turn conversations, requesting relevant medical examinations (e.g., temperature, blood pressure, ECG) and imaging results (e.g., MRI, X-ray) from a measurement agent to mimic the real-world diagnostic process. Additionally, we incorporate self improvement mechanisms that allow models to iteratively refine their diagnostic strategies. We enhance LLM performance in our simulated setting by integrating multi-agent discussions, chain-of-thought reasoning, and experience-based knowledge retrieval, facilitating progressive learning as doctor agents interact with more patients. We also introduce an evaluation benchmark for assessing the LLM's ability to engage in dynamic, context-aware diagnostic interactions. While MedAgentSim is fully automated, it also supports a user-controlled mode, enabling human interaction with either the doctor or patient agent. Comprehensive evaluations in various simulated diagnostic scenarios demonstrate the effectiveness of our approach. Our code, simulation tool, and benchmark are available at \href{https://medagentsim.netlify.app/}.

Beyond Background Shift: Rethinking Instance Replay in Continual Semantic Segmentation

Hongmei Yin,Tingliang Feng,Fan Lyu,Fanhua Shang,Hongying Liu,Wei Feng,Liang Wan

Task: 研究持续语义分割（CSS）中如何在不遗忘旧类别知识的情况下学习新类别。

Motivation: 直接存储旧类别图像并用于新模型训练在分类任务中有效，但在CSS中会导致未标注类别与背景混淆，影响模型拟合。

Details

Method: 提出增强实例回放（EIR）方法，通过存储旧类别实例消除背景混淆，并整合存储实例与新图像以缓解背景偏移。 Result: EIR显著优于现有CSS方法，有效缓解灾难性遗忘。 Conclusion: EIR通过解决背景偏移问题，提升了模型在CSS任务中的表现。 Abstract: In this work, we focus on continual semantic segmentation (CSS), where segmentation networks are required to continuously learn new classes without erasing knowledge of previously learned ones. Although storing images of old classes and directly incorporating them into the training of new models has proven effective in mitigating catastrophic forgetting in classification tasks, this strategy presents notable limitations in CSS. Specifically, the stored and new images with partial category annotations leads to confusion between unannotated categories and the background, complicating model fitting. To tackle this issue, this paper proposes a novel Enhanced Instance Replay (EIR) method, which not only preserves knowledge of old classes while simultaneously eliminating background confusion by instance storage of old classes, but also mitigates background shifts in the new images by integrating stored instances with new images. By effectively resolving background shifts in both stored and new images, EIR alleviates catastrophic forgetting in the CSS task, thereby enhancing the model's capacity for CSS. Experimental results validate the efficacy of our approach, which significantly outperforms state-of-the-art CSS methods.

Leveraging Large Language Models for Automated Causal Loop Diagram Generation: Enhancing System Dynamics Modeling through Curated Prompting Techniques

Ning-Yuan Georgia Liu,David R. Keith

Task: 自动化将动态假设转化为因果循环图（CLD）的过程。

Motivation: 解决新手建模者在从文本中提取关键变量和因果关系时面临的挑战，提高系统动力学工具的应用率。

Details

Method: 利用大型语言模型（LLMs）和优化的提示技术，开发并测试自动化生成CLD的方法。 Result: 对于简单模型结构，使用优化的提示技术，LLMs生成的CLD质量接近专家构建的水平，显著加速了CLD的创建。 Conclusion: LLMs结合优化提示技术可以有效支持CLD的自动化生成，为新手建模者提供便利。 Abstract: Transforming a dynamic hypothesis into a causal loop diagram (CLD) is crucial for System Dynamics Modelling. Extracting key variables and causal relationships from text to build a CLD is often challenging and time-consuming for novice modelers, limiting SD tool adoption. This paper introduces and tests a method for automating the translation of dynamic hypotheses into CLDs using large language models (LLMs) with curated prompting techniques. We first describe how LLMs work and how they can make the inferences needed to build CLDs using a standard digraph structure. Next, we develop a set of simple dynamic hypotheses and corresponding CLDs from leading SD textbooks. We then compare the four different combinations of prompting techniques, evaluating their performance against CLDs labeled by expert modelers. Results show that for simple model structures and using curated prompting techniques, LLMs can generate CLDs of a similar quality to expert-built ones, accelerating CLD creation.

EgoToM: Benchmarking Theory of Mind Reasoning from Egocentric Videos

Yuxuan Li,Vijay Veerabadran,Michael L. Iuzzolino,Brett D. Roads,Asli Celikyilmaz,Karl Ridgeway

Task: 提出EgoToM，一个新的视频问答基准，将心理理论（ToM）评估扩展到自我中心领域。

Motivation: 通过因果ToM模型生成多选视频问答实例，以评估预测摄像头佩戴者目标、信念和下一步行动的能力。

Details

Method: 使用Ego4D数据集，研究人类和最先进的多模态大语言模型（MLLMs）在三个相互关联的推理问题上的表现。 Result: MLLMs在从自我中心视频推断目标方面接近人类水平，但在推断佩戴者即时信念状态和未来行动方面表现不及人类。 Conclusion: 研究结果将影响未来设计具备用户内部心理状态模型的自我中心数字助手。 Abstract: We introduce EgoToM, a new video question-answering benchmark that extends Theory-of-Mind (ToM) evaluation to egocentric domains. Using a causal ToM model, we generate multi-choice video QA instances for the Ego4D dataset to benchmark the ability to predict a camera wearer's goals, beliefs, and next actions. We study the performance of both humans and state of the art multimodal large language models (MLLMs) on these three interconnected inference problems. Our evaluation shows that MLLMs achieve close to human-level accuracy on inferring goals from egocentric videos. However, MLLMs (including the largest ones we tested with over 100B parameters) fall short of human performance when inferring the camera wearers' in-the-moment belief states and future actions that are most consistent with the unseen video future. We believe that our results will shape the future design of an important class of egocentric digital assistants which are equipped with a reasonable model of the user's internal mental states.

Efficient Joint Prediction of Multiple Future Tokens

Kwangjun Ahn,Alex Lamb,John Langford

Task: 介绍联合多令牌预测（JTP），一种改进标准下一令牌预测的轻量级方法，旨在通过联合预测多个未来令牌来丰富隐藏状态表示。

Motivation: 现有的多令牌预测方法未能实现短时信念状态表示，而JTP通过精心设计的表示瓶颈和教师强制技术，以最小计算开销实现这一目标。

Details

Method: JTP通过教师强制未来令牌和表示瓶颈设计，联合预测多个未来令牌，以丰富隐藏状态表示。 Result: 在合成星图导航任务中，JTP显著优于现有方法，展示了其有效性。 Conclusion: JTP提供了有前景的初步结果，旨在推动进一步研究。 Abstract: In this short report, we introduce joint multi-token prediction (JTP), a lightweight modification of standard next-token prediction designed to enrich hidden state representations by jointly predicting multiple future tokens. Unlike previous multi-token prediction approaches, JTP strategically employs teacher forcing of future-tokens through a carefully designed representation bottleneck, allowing the model to encode rich predictive information with minimal computational overhead during training. We show that the JTP approach achieves a short-horizon belief state representation, while popular alternatives for multi-token prediction fail to do so. We demonstrate the effectiveness of our method on the synthetic star graph navigation task from from Bachmann and Nagarajan [2024], highlighting a significant performance improvement over existing methods. This manuscript presents promising preliminary results intended to stimulate further research.

Permutation-Invariant and Orientation-Aware Dataset Distillation for 3D Point Clouds

Jae-Young Yim,Dongwook Kim,Jae-Young Sim

Task: 提出一种基于分布匹配的数据集蒸馏方法，用于3D点云数据。

Motivation: 3D点云数据由于其无序结构的特点，数据集蒸馏方法尚未充分探索。

Details

Method: 通过联合优化合成数据集的几何结构和合成模型的方向，设计了一种排列不变分布匹配损失，并采用可学习旋转角度调整模型方向。 Result: 在多个基准数据集（ModelNet10、ModelNet40、ShapeNet、ScanObjectNN）上的实验结果表明，该方法优于现有方法。 Conclusion: 该方法在3D点云数据集蒸馏中表现出色，为相关领域提供了新的解决方案。 Abstract: We should collect large amount of data to train deep neural networks for various applications. Recently, the dataset distillation for images and texts has been attracting a lot of attention, that reduces the original dataset to a synthetic dataset while preserving essential task-relevant information. However, 3D point clouds distillation is almost unexplored due to the challenges of unordered structures of points. In this paper, we propose a novel distribution matching-based dataset distillation method for 3D point clouds that jointly optimizes the geometric structures of synthetic dataset as well as the orientations of synthetic models. To ensure the consistent feature alignment between different 3D point cloud models, we devise a permutation invariant distribution matching loss with the sorted feature vectors. We also employ learnable rotation angles to transform each syntheic model according to the optimal orientation best representing the original feature distribution. Extensive experimental results on widely used four benchmark datasets, including ModelNet10, ModelNet40, ShapeNet, and ScanObjectNN, demonstrate that the proposed method consistently outperforms the existing methods.

Taxonomy Inference for Tabular Data Using Large Language Models

Zhenyu Wu,Jiaoyan Chen,Norman W. Paton

Task: 提出两种基于大语言模型（LLM）的方法（EmTT和GeTT）用于表格数据的分类推断。

Motivation: 现有模式推断系统主要关注XML、JSON或RDF数据，且多依赖数据的词汇格式和结构计算相似性，对表格中文本语义的利用有限。

Details

Method: EmTT通过微调对比学习编码器（如BERT）嵌入列并利用聚类构建层次结构；GeTT通过迭代提示解码器（如GPT-4）生成表格实体类型及其层次结构。 Result: 在三个真实数据集上的广泛评估表明，EmTT和GeTT生成的分类与真实数据具有强一致性。 Conclusion: EmTT和GeTT能有效解决表格数据的分类推断问题，为数据管理、探索等应用提供支持。 Abstract: Taxonomy inference for tabular data is a critical task of schema inference, aiming at discovering entity types (i.e., concepts) of the tables and building their hierarchy. It can play an important role in data management, data exploration, ontology learning, and many data-centric applications. Existing schema inference systems focus more on XML, JSON or RDF data, and often rely on lexical formats and structures of the data for calculating similarities, with limited exploitation of the semantics of the text across a table. Motivated by recent works on taxonomy completion and construction using Large Language Models (LLMs), this paper presents two LLM-based methods for taxonomy inference for tables: (i) EmTT which embeds columns by fine-tuning with contrastive learning encoder-alone LLMs like BERT and utilises clustering for hierarchy construction, and (ii) GeTT which generates table entity types and their hierarchy by iterative prompting using a decoder-alone LLM like GPT-4. Extensive evaluation on three real-world datasets with six metrics covering different aspects of the output taxonomies has demonstrated that EmTT and GeTT can both produce taxonomies with strong consistency relative to the Ground Truth.

Spatial Transport Optimization by Repositioning Attention Map for Training-Free Text-to-Image Synthesis

Woojung Han,Yeonkyung Lee,Chanyoung Kim,Kwanghyun Park,Seong Jae Hwang

Task: 提出一种训练自由的方法STORM，解决文本到图像生成中物体位置不准确的问题。

Motivation: 现有方法在处理物体缺失和属性不匹配时表现良好，但物体位置不准确的问题仍未解决，尤其是在文本提示中难以明确表达空间指导。

Details

Method: STORM通过基于最优传输理论的空间传输优化（STO）动态调整物体注意力图，并结合空间传输成本函数增强空间理解。 Result: 实验表明STORM在空间对齐方面优于现有方法，同时改善了物体缺失和属性不匹配的问题。 Conclusion: STORM为文本到图像合成中的空间对齐问题提供了新的解决方案，并在实验中验证了其有效性。 Abstract: Diffusion-based text-to-image (T2I) models have recently excelled in high-quality image generation, particularly in a training-free manner, enabling cost-effective adaptability and generalization across diverse tasks. However, while the existing methods have been continuously focusing on several challenges, such as "missing objects" and "mismatched attributes," another critical issue of "mislocated objects" remains where generated spatial positions fail to align with text prompts. Surprisingly, ensuring such seemingly basic functionality remains challenging in popular T2I models due to the inherent difficulty of imposing explicit spatial guidance via text forms. To address this, we propose STORM (Spatial Transport Optimization by Repositioning Attention Map), a novel training-free approach for spatially coherent T2I synthesis. STORM employs Spatial Transport Optimization (STO), rooted in optimal transport theory, to dynamically adjust object attention maps for precise spatial adherence, supported by a Spatial Transport (ST) Cost function that enhances spatial understanding. Our analysis shows that integrating spatial awareness is most effective in the early denoising stages, while later phases refine details. Extensive experiments demonstrate that STORM surpasses existing methods, effectively mitigating mislocated objects while improving missing and mismatched attributes, setting a new benchmark for spatial alignment in T2I synthesis.

OntoAligner: A Comprehensive Modular and Robust Python Toolkit for Ontology Alignment

Hamed Babaei Giglou,Jennifer D'Souza,Oliver Karras,Sören Auer

Task: 介绍并评估OntoAligner，一个用于本体对齐的Python工具包。

Motivation: 解决现有本体对齐工具在可扩展性、模块化和与最新AI技术集成方面的局限性。

Details

Method: OntoAligner采用灵活的架构，整合了轻量级本体对齐技术（如模糊匹配）和现代方法（如检索增强生成和大语言模型）。 Result: 评估表明OntoAligner能高效处理大规模本体，且代码简洁，对齐质量高。 Conclusion: OntoAligner作为开源工具，旨在促进本体对齐领域的创新与合作，支持可重复研究和实际应用。 Abstract: Ontology Alignment (OA) is fundamental for achieving semantic interoperability across diverse knowledge systems. We present OntoAligner, a comprehensive, modular, and robust Python toolkit for ontology alignment, designed to address current limitations with existing tools faced by practitioners. Existing tools are limited in scalability, modularity, and ease of integration with recent AI advances. OntoAligner provides a flexible architecture integrating existing lightweight OA techniques such as fuzzy matching but goes beyond by supporting contemporary methods with retrieval-augmented generation and large language models for OA. The framework prioritizes extensibility, enabling researchers to integrate custom alignment algorithms and datasets. This paper details the design principles, architecture, and implementation of the OntoAligner, demonstrating its utility through benchmarks on standard OA tasks. Our evaluation highlights OntoAligner's ability to handle large-scale ontologies efficiently with few lines of code while delivering high alignment quality. By making OntoAligner open-source, we aim to provide a resource that fosters innovation and collaboration within the OA community, empowering researchers and practitioners with a toolkit for reproducible OA research and real-world applications.

An Empirical Study of Validating Synthetic Data for Text-Based Person Retrieval

Min Cao,ZiYin Zeng,YuXin Lu,Mang Ye,Dong Yi,Jinqiao Wang

Task: 探索合成数据在基于文本的人物检索（TBPR）中的潜力，并提出生成和增强合成数据的管道。

Motivation: 主流研究方法依赖真实数据和人工标注，存在隐私和劳动密集型问题，且现有合成数据方法仍依赖真实数据，导致多样性和探索受限。

Details

Method: 提出类间图像生成管道和类内图像增强管道，结合自动文本生成，探索合成数据的有效性。 Result: 通过实验验证合成数据在多种场景中的有效性，并提出噪声鲁棒学习策略。 Conclusion: 提出的方法和生成的大规模合成数据集有望推动TBPR研究的实际应用。 Abstract: Data plays a pivotal role in Text-Based Person Retrieval (TBPR) research. Mainstream research paradigm necessitates real-world person images with manual textual annotations for training models, posing privacy-sensitive and labor-intensive issues. Several pioneering efforts explore synthetic data for TBPR but still rely on real data, keeping the aforementioned issues and also resulting in diversity-deficient issue in synthetic datasets, thus impacting TBPR performance. Moreover, these works tend to explore synthetic data for TBPR through limited perspectives, leading to exploration-restricted issue. In this paper, we conduct an empirical study to explore the potential of synthetic data for TBPR, highlighting three key aspects. (1) We propose an inter-class image generation pipeline, in which an automatic prompt construction strategy is introduced to guide generative Artificial Intelligence (AI) models in generating various inter-class images without reliance on original data. (2) We develop an intra-class image augmentation pipeline, in which the generative AI models are applied to further edit the images for obtaining various intra-class images. (3) Building upon the proposed pipelines and an automatic text generation pipeline, we explore the effectiveness of synthetic data in diverse scenarios through extensive experiments. Additionally, we experimentally investigate various noise-robust learning strategies to mitigate the inherent noise in synthetic data. We will release the code, along with the synthetic large-scale dataset generated by our pipelines, which are expected to advance practical TBPR research.

Socially Constructed Treatment Plans: Analyzing Online Peer Interactions to Understand How Patients Navigate Complex Medical Conditions

Madhusudan Basak,Omar Sharif,Jessica Hulsey,Elizabeth C. Saunders,Daisy J. Goodman,Luke J. Archibald,Sarah M. Preum

Task: 研究在线患者社区中复杂医疗条件下治疗计划的“社会构建”现象及其与临床指南的偏差。

Motivation: 探索患者在复杂医疗条件下如何通过在线社区寻求同伴支持，并分析其治疗计划的社会构建过程及其影响。

Details

Method: 结合在线话语内容分析、民族志研究、深度访谈以及大型语言模型（LLM）的评估。 Result: 揭示了患者在线社区中治疗计划的社会构建现象及其与临床指南的偏差，并评估了LLM在此类知识中的反映。 Conclusion: 为在线健康社区中以患者为中心的沟通提供了重要研究方向。 Abstract: When faced with complex and uncertain medical conditions (e.g., cancer, mental health conditions, recovery from substance dependency), millions of patients seek online peer support. In this study, we leverage content analysis of online discourse and ethnographic studies with clinicians and patient representatives to characterize how treatment plans for complex conditions are "socially constructed." Specifically, we ground online conversation on medication-assisted recovery treatment to medication guidelines and subsequently surface when and why people deviate from the clinical guidelines. We characterize the implications and effectiveness of socially constructed treatment plans through in-depth interviews with clinical experts. Finally, given the enthusiasm around AI-powered solutions for patient communication, we investigate whether and how socially constructed treatment-related knowledge is reflected in a state-of-the-art large language model (LLM). Leveraging a novel mixed-method approach, this study highlights critical research directions for patient-centered communication in online health communities.

Concept-Aware LoRA for Domain-Aligned Segmentation Dataset Generation

Minho Park,Sunghyun Park,Jungsoo Lee,Hyojin Park,Kyuwoong Hwang,Fatih Porikli,Jaegul Choo,Sungha Choi

Task: 通过文本到图像（T2I）生成模型生成数据集，解决语义分割中数据稀缺的挑战。

Motivation: 减少图像获取和标注成本，并解决生成样本与目标域对齐及生成超出训练数据的信息样本的挑战。

Details

Method: 提出Concept-Aware LoRA（CA-LoRA），一种选择性更新与必要概念相关权重的微调方法，以保持T2I模型的预训练知识。 Result: 在城市场景分割数据集生成中表现优异，优于基线方法和最新方法，尤其在恶劣天气和光照变化等挑战性条件下。 Conclusion: CA-LoRA能有效生成多样且对齐目标域的样本，为语义分割提供高质量数据集。 Abstract: This paper addresses the challenge of data scarcity in semantic segmentation by generating datasets through text-to-image (T2I) generation models, reducing image acquisition and labeling costs. Segmentation dataset generation faces two key challenges: 1) aligning generated samples with the target domain and 2) producing informative samples beyond the training data. Fine-tuning T2I models can help generate samples aligned with the target domain. However, it often overfits and memorizes training data, limiting their ability to generate diverse and well-aligned samples. To overcome these issues, we propose Concept-Aware LoRA (CA-LoRA), a novel fine-tuning approach that selectively identifies and updates only the weights associated with necessary concepts (e.g., style or viewpoint) for domain alignment while preserving the pretrained knowledge of the T2I model to produce informative samples. We demonstrate its effectiveness in generating datasets for urban-scene segmentation, outperforming baseline and state-of-the-art methods in in-domain (few-shot and fully-supervised) settings, as well as in domain generalization tasks, especially under challenging conditions such as adverse weather and varying illumination, further highlighting its superiority.

Debate-Driven Multi-Agent LLMs for Phishing Email Detection

Ngoc Tuong Vy Nguyen,Felix D Childress,Yunting Yin

Task: 提出一种基于多智能体大语言模型（LLM）的辩论技术，用于检测电子邮件是否为钓鱼邮件。

Motivation: 传统检测方法（如基于规则的系统或监督学习模型）存在局限性，如依赖预定义模式或需要大量训练数据，且易产生误判。

Details

Method: 使用两个LLM智能体分别提出支持或反对分类任务的论点，并由一个法官智能体根据推理质量做出最终裁决。 Result: 在多组钓鱼邮件数据集上的评估表明，混合智能体配置性能优于同质配置，且辩论结构本身足以实现高准确率。 Conclusion: 多智能体辩论机制能有效提升钓鱼邮件检测的准确性，无需额外提示策略。 Abstract: Phishing attacks remain a critical cybersecurity threat. Attackers constantly refine their methods, making phishing emails harder to detect. Traditional detection methods, including rule-based systems and supervised machine learning models, either rely on predefined patterns like blacklists, which can be bypassed with slight modifications, or require large datasets for training and still can generate false positives and false negatives. In this work, we propose a multi-agent large language model (LLM) prompting technique that simulates debates among agents to detect whether the content presented on an email is phishing. Our approach uses two LLM agents to present arguments for or against the classification task, with a judge agent adjudicating the final verdict based on the quality of reasoning provided. This debate mechanism enables the models to critically analyze contextual cue and deceptive patterns in text, which leads to improved classification accuracy. The proposed framework is evaluated on multiple phishing email datasets and demonstrate that mixed-agent configurations consistently outperform homogeneous configurations. Results also show that the debate structure itself is sufficient to yield accurate decisions without extra prompting strategies.

Synergistic Bleeding Region and Point Detection in Surgical Videos

Jialun Pei,Zhangjun Zhou,Diandian Guo,Zhixi Li,Jing Qin,Bo Du,Pheng-Ann Heng

Task: 开发一种名为BlooDet的双任务协同在线检测器，用于在手术视频中同时检测出血区域和出血点。

Motivation: 腹腔镜手术中的术中出血会迅速模糊手术视野，阻碍手术进程；智能检测出血区域可以量化失血量以辅助决策，而定位出血点有助于外科医生快速识别出血源并及时止血。

Details

Method: 基于Segment Anything Model 2（SAM 2）的双分支双向引导设计，包括一个用于检测出血区域的掩码分支和一个用于建模出血点记忆的点分支，通过交互引导和提示探索时空关系。 Result: 在SurgBlood数据集上，BlooDet在出血区域检测任务中达到64.88%的IoU，在出血点检测任务中达到83.69%的PCK-10%。 Conclusion: BlooDet通过双分支协同设计，显著提升了出血区域和点的检测性能，为术中决策提供了有力支持。 Abstract: Intraoperative bleeding in laparoscopic surgery causes rapid obscuration of the operative field to hinder the surgical process. Intelligent detection of bleeding regions can quantify the blood loss to assist decision-making, while locating the bleeding point helps surgeons quickly identify the source of bleeding and achieve hemostasis in time. In this study, we first construct a real-world surgical bleeding detection dataset, named SurgBlood, comprising 5,330 frames from 95 surgical video clips with bleeding region and point annotations. Accordingly, we develop a dual-task synergistic online detector called BlooDet, designed to perform simultaneous detection of bleeding regions and points in surgical videos. Our framework embraces a dual-branch bidirectional guidance design based on Segment Anything Model 2 (SAM 2). The mask branch detects bleeding regions through adaptive edge and point prompt embeddings, while the point branch leverages mask memory to induce bleeding point memory modeling and captures the direction of bleed point movement through inter-frame optical flow. By interactive guidance and prompts, the two branches explore potential spatial-temporal relationships while leveraging memory modeling from previous frames to infer the current bleeding condition. Extensive experiments demonstrate that our approach outperforms other counterparts on SurgBlood in both bleeding region and point detection tasks, e.g., achieving 64.88% IoU for bleeding region detection and 83.69% PCK-10% for bleeding point detection.

REMAC: Self-Reflective and Self-Evolving Multi-Agent Collaboration for Long-Horizon Robot Manipulation

Puzhen Yuan,Angyuan Ma,Yunchao Yao,Huaxiu Yao,Masayoshi Tomizuka,Mingyu Ding

Task: 提出一种自适应多智能体规划框架（REMAC），用于高效、场景无关的多机器人长时程任务规划与执行。

Motivation: 现有方法依赖先验环境知识或特定任务提示，难以应对动态场景变化或意外任务条件，缺乏适应性和效率。

Details

Method: REMAC包含自反思模块（循环进行前置和后置条件检查）和自进化模块（动态调整计划），支持多机器人并行协作。 Result: 在基于RoboCasa的多智能体环境中，REMAC将平均成功率提升40%，执行效率提高52.7%。 Conclusion: REMAC通过持续反思和自进化，显著提升了多机器人长时程任务的适应性和执行效率。 Abstract: Vision-language models (VLMs) have demonstrated remarkable capabilities in robotic planning, particularly for long-horizon tasks that require a holistic understanding of the environment for task decomposition. Existing methods typically rely on prior environmental knowledge or carefully designed task-specific prompts, making them struggle with dynamic scene changes or unexpected task conditions, e.g., a robot attempting to put a carrot in the microwave but finds the door was closed. Such challenges underscore two critical issues: adaptability and efficiency. To address them, in this work, we propose an adaptive multi-agent planning framework, termed REMAC, that enables efficient, scene-agnostic multi-robot long-horizon task planning and execution through continuous reflection and self-evolution. REMAC incorporates two key modules: a self-reflection module performing pre-condition and post-condition checks in the loop to evaluate progress and refine plans, and a self-evolvement module dynamically adapting plans based on scene-specific reasoning. It offers several appealing benefits: 1) Robots can initially explore and reason about the environment without complex prompt design. 2) Robots can keep reflecting on potential planning errors and adapting the plan based on task-specific insights. 3) After iterations, a robot can call another one to coordinate tasks in parallel, maximizing the task execution efficiency. To validate REMAC's effectiveness, we build a multi-agent environment for long-horizon robot manipulation and navigation based on RoboCasa, featuring 4 task categories with 27 task styles and 50+ different objects. Based on it, we further benchmark state-of-the-art reasoning models, including DeepSeek-R1, o3-mini, QwQ, and Grok3, demonstrating REMAC's superiority by boosting average success rates by 40% and execution efficiency by 52.7% over the single robot baseline.

Efficient Continual Learning through Frequency Decomposition and Integration

Ruiqi Liu,Boyu Diao,Libo Huang,Hangda Liu,Chuanguang Yang,Zhulin An,Yongjun Xu

Task: 提出一种名为FDINet的新框架，通过频率分解和整合增强持续学习中的任务适应能力。

Motivation: 解决持续学习中任务适应时的遗忘问题，并提升资源受限环境下基于回放方法的效率。

Details

Method: 设计FDINet框架，通过两个轻量级网络分别处理图像的低频和高频信息，并结合回放方法。 Result: FDINet减少78%的主干参数，准确率提升7.49%，峰值内存使用降低80%，在边缘设备上训练速度提升5倍。 Conclusion: FDINet通过频率感知设计有效提升跨任务泛化能力，保留细节并实现高效训练。 Abstract: Continual learning (CL) aims to learn new tasks while retaining past knowledge, addressing the challenge of forgetting during task adaptation. Rehearsal-based methods, which replay previous samples, effectively mitigate forgetting. However, research on enhancing the efficiency of these methods, especially in resource-constrained environments, remains limited, hindering their application in real-world systems with dynamic data streams. The human perceptual system processes visual scenes through complementary frequency channels: low-frequency signals capture holistic cues, while high-frequency components convey structural details vital for fine-grained discrimination. Inspired by this, we propose the Frequency Decomposition and Integration Network (FDINet), a novel framework that decomposes and integrates information across frequencies. FDINet designs two lightweight networks to independently process low- and high-frequency components of images. When integrated with rehearsal-based methods, this frequency-aware design effectively enhances cross-task generalization through low-frequency information, preserves class-specific details using high-frequency information, and facilitates efficient training due to its lightweight architecture. Experiments demonstrate that FDINet reduces backbone parameters by 78%, improves accuracy by up to 7.49% over state-of-the-art (SOTA) methods, and decreases peak memory usage by up to 80%. Additionally, on edge devices, FDINet accelerates training by up to 5$\times$.

Convolutional optimization with convex kernel and power lift

Zhipeng Lu

Task: 建立一种基于凸核卷积的新型优化理论的基础范式。

Motivation: 目标是设计一种道德确定性的模型，用于定位任意函数的全局最优解，区别于常用的统计模型。

Details

Method: 通过凸核卷积的方法构建理论框架，并测试特定算法的效率。 Result: 提供了有限的初步数值结果，验证了算法的效率。 Conclusion: 希望激发进一步的实际兴趣和研究。 Abstract: We focus on establishing the foundational paradigm of a novel optimization theory based on convolution with convex kernels. Our goal is to devise a morally deterministic model of locating the global optima of an arbitrary function, which is distinguished from most commonly used statistical models. Limited preliminary numerical results are provided to test the efficiency of some specific algorithms derived from our paradigm, which we hope to stimulate further practical interest.

High-Fidelity Diffusion Face Swapping with ID-Constrained Facial Conditioning

Dailan He,Xiahong Wang,Shulun Wang,Guanglu Song,Bingqi Ma,Hao Shao,Yu Liu,Hongsheng Li

Task: 提出一种基于扩散模型的身份约束属性调整框架，以解决人脸交换中身份与属性条件冲突的问题。

Motivation: 扩散模型在人脸交换中表现出色，但存在身份优先保存与属性条件冲突的挑战。

Details

Method: 采用身份约束属性调整框架，分步实现身份保存和属性对齐，并通过解耦条件注入和后训练细化提升保真度。 Result: 提出的模型在定性和定量评估中优于现有方法，实现了高保真人脸交换的最新性能。 Conclusion: 该框架有效解决了身份与属性的冲突问题，为人脸交换提供了高质量的解决方案。 Abstract: Face swapping aims to seamlessly transfer a source facial identity onto a target while preserving target attributes such as pose and expression. Diffusion models, known for their superior generative capabilities, have recently shown promise in advancing face-swapping quality. This paper addresses two key challenges in diffusion-based face swapping: the prioritized preservation of identity over target attributes and the inherent conflict between identity and attribute conditioning. To tackle these issues, we introduce an identity-constrained attribute-tuning framework for face swapping that first ensures identity preservation and then fine-tunes for attribute alignment, achieved through a decoupled condition injection. We further enhance fidelity by incorporating identity and adversarial losses in a post-training refinement stage. Our proposed identity-constrained diffusion-based face-swapping model outperforms existing methods in both qualitative and quantitative evaluations, demonstrating superior identity similarity and attribute consistency, achieving a new state-of-the-art performance in high-fidelity face swapping.

Learning to Instruct for Visual Instruction Tuning

Zhihan Zhou,Feng Hong,Jiaan Luo,Jiangchao Yao,Dongsheng Li,Bo Han,Ya Zhang,Yanfeng Wang

Task: 提出LIT方法，改进视觉指令调整（VIT）以解决其过拟合和捷径学习问题。

Motivation: 当前VIT设计过于强调指令跟随能力，忽视了视觉信息的主动理解，导致性能下降。

Details

Method: LIT通过将损失函数同时应用于指令和响应序列，扩展训练数据并减少对语言先验的依赖。 Result: LIT在多项多模态基准测试中相对提升达9%，无需额外数据且计算开销可忽略；在视觉基础能力上提升18%，同时减少幻觉现象。 Conclusion: LIT是一种简单有效的方法，显著提升了多模态大模型的性能。 Abstract: We propose LIT, an advancement of visual instruction tuning (VIT). While VIT equips Multimodal LLMs (MLLMs) with promising multimodal capabilities, the current design choices for VIT often result in overfitting and shortcut learning, potentially degrading performance. This gap arises from an overemphasis on instruction-following abilities, while neglecting the proactive understanding of visual information. Inspired by this, LIT adopts a simple yet effective approach by incorporating the loss function into both the instruction and response sequences. It seamlessly expands the training data, and regularizes the MLLMs from overly relying on language priors. Based on this merit, LIT achieves a significant relative improvement of up to 9% on comprehensive multimodal benchmarks, requiring no additional training data and incurring negligible computational overhead. Surprisingly, LIT attains exceptional fundamental visual capabilities, yielding up to an 18% improvement in captioning performance, while simultaneously alleviating hallucination in MLLMs.

Knowledge Rectification for Camouflaged Object Detection: Unlocking Insights from Low-Quality Data

Juwei Guan,Xiaolin Fang,Donghyun Kim,Haotian Gong,Tongxin Zhu,Zhen Ling,Ming Yang

Task: 提出KRNet框架，专门用于低质量数据中的伪装目标检测（COD）。

Motivation: 低质量数据因缺乏图像细节而增加伪装目标检测的复杂性，现有方法主要针对高质量数据，导致在低质量数据上性能显著下降。

Details

Method: KRNet采用Leader-Follower框架，Leader从高质量数据中提取双重黄金标准分布（条件和混合），驱动Follower修正从低质量数据中学到的知识，并结合交叉一致性策略和时间依赖条件编码器。 Result: 在基准数据集上的实验表明，KRNet优于现有COD方法和超分辨率辅助COD方法。 Conclusion: KRNet能有效解决低质量数据在COD中的挑战。 Abstract: Low-quality data often suffer from insufficient image details, introducing an extra implicit aspect of camouflage that complicates camouflaged object detection (COD). Existing COD methods focus primarily on high-quality data, overlooking the challenges posed by low-quality data, which leads to significant performance degradation. Therefore, we propose KRNet, the first framework explicitly designed for COD on low-quality data. KRNet presents a Leader-Follower framework where the Leader extracts dual gold-standard distributions: conditional and hybrid, from high-quality data to drive the Follower in rectifying knowledge learned from low-quality data. The framework further benefits from a cross-consistency strategy that improves the rectification of these distributions and a time-dependent conditional encoder that enriches the distribution diversity. Extensive experiments on benchmark datasets demonstrate that KRNet outperforms state-of-the-art COD methods and super-resolution-assisted COD approaches, proving its effectiveness in tackling the challenges of low-quality data in COD.

Spend Your Budget Wisely: Towards an Intelligent Distribution of the Privacy Budget in Differentially Private Text Rewriting

Stephen Meisenbacher,Chaeeun Joy Lee,Florian Matthes

Task: 研究如何在差分隐私（DP）保证下对文本进行重写，以隐藏敏感信息并保留语义。

Motivation: 隐藏文本中的显式和隐式标识符，同时保留原始文本的语义和实用性。

Details

Method: 构建并评估基于语言学和NLP的方法，用于智能分配隐私预算到文本中的各个部分。 Result: 实验证明，智能分配隐私预算比简单分配能提供更高的隐私水平和更好的效用权衡。 Conclusion: 强调了文本隐私化的复杂性，并呼吁进一步研究如何更高效地利用差分隐私进行文本重写。 Abstract: The task of $\textit{Differentially Private Text Rewriting}$ is a class of text privatization techniques in which (sensitive) input textual documents are $\textit{rewritten}$ under Differential Privacy (DP) guarantees. The motivation behind such methods is to hide both explicit and implicit identifiers that could be contained in text, while still retaining the semantic meaning of the original text, thus preserving utility. Recent years have seen an uptick in research output in this field, offering a diverse array of word-, sentence-, and document-level DP rewriting methods. Common to these methods is the selection of a privacy budget (i.e., the $\varepsilon$ parameter), which governs the degree to which a text is privatized. One major limitation of previous works, stemming directly from the unique structure of language itself, is the lack of consideration of $\textit{where}$ the privacy budget should be allocated, as not all aspects of language, and therefore text, are equally sensitive or personal. In this work, we are the first to address this shortcoming, asking the question of how a given privacy budget can be intelligently and sensibly distributed amongst a target document. We construct and evaluate a toolkit of linguistics- and NLP-based methods used to allocate a privacy budget to constituent tokens in a text document. In a series of privacy and utility experiments, we empirically demonstrate that given the same privacy budget, intelligent distribution leads to higher privacy levels and more positive trade-offs than a naive distribution of $\varepsilon$. Our work highlights the intricacies of text privatization with DP, and furthermore, it calls for further work on finding more efficient ways to maximize the privatization benefits offered by DP in text rewriting.

Unbiased Max-Min Embedding Classification for Transductive Few-Shot Learning: Clustering and Classification Are All You Need

Yang Liu,Feixiang Liu,Jiale Du,Xinbo Gao,Jungong Han

Task: 提出一种名为UMMEC的方法，以解决少样本学习中的关键挑战。

Motivation: 卷积神经网络和监督学习需要大量标注数据，而少样本学习（FSL）和转导少样本学习（TFSL）虽然能缓解这一问题，但仍面临如中心化问题等挑战。

Details

Method: UMMEC方法通过三个创新贡献：分散协方差矩阵、局部对齐与全局统一的结合，以及变分Sinkhorn少样本分类器。 Result: UMMEC方法显著提高了分类性能，仅需少量标注数据即可达到先进水平。 Conclusion: UMMEC方法在转导少样本学习中取得了显著进展，为解决少样本学习问题提供了有效方案。 Abstract: Convolutional neural networks and supervised learning have achieved remarkable success in various fields but are limited by the need for large annotated datasets. Few-shot learning (FSL) addresses this limitation by enabling models to generalize from only a few labeled examples. Transductive few-shot learning (TFSL) enhances FSL by leveraging both labeled and unlabeled data, though it faces challenges like the hubness problem. To overcome these limitations, we propose the Unbiased Max-Min Embedding Classification (UMMEC) Method, which addresses the key challenges in few-shot learning through three innovative contributions. First, we introduce a decentralized covariance matrix to mitigate the hubness problem, ensuring a more uniform distribution of embeddings. Second, our method combines local alignment and global uniformity through adaptive weighting and nonlinear transformation, balancing intra-class clustering with inter-class separation. Third, we employ a Variational Sinkhorn Few-Shot Classifier to optimize the distances between samples and class prototypes, enhancing classification accuracy and robustness. These combined innovations allow the UMMEC method to achieve superior performance with minimal labeled data. Our UMMEC method significantly improves classification performance with minimal labeled data, advancing the state-of-the-art in TFSL.

EllieSQL: Cost-Efficient Text-to-SQL with Complexity-Aware Routing

Yizhang Zhu,Runzhi Jiang,Boyan Li,Nan Tang,Yuyu Luo

Task: 探索并提出一种复杂性感知的路由框架EllieSQL，以优化Text-to-SQL任务的成本效率。

Motivation: 当前基于LLM的Text-to-SQL方法虽然性能优越，但计算成本高昂，限制了其实际应用和经济可行性。

Details

Method: 提出EllieSQL框架，通过复杂性感知的路由器将查询分配到适合的SQL生成管道，并引入Token Elasticity of Performance (TEP) 指标衡量成本效率。 Result: 实验表明，EllieSQL在使用Qwen2.5-0.5B-DPO路由器时，减少了40%以上的token使用，且性能未下降，TEP提升超过2倍。 Conclusion: EllieSQL不仅提升了Text-to-SQL的成本效率，还呼吁社区在性能之外重视资源效率，推动可持续的Text-to-SQL研究。 Abstract: Text-to-SQL automatically translates natural language queries to SQL, allowing non-technical users to retrieve data from databases without specialized SQL knowledge. Despite the success of advanced LLM-based Text-to-SQL approaches on leaderboards, their unsustainable computational costs--often overlooked--stand as the "elephant in the room" in current leaderboard-driven research, limiting their economic practicability for real-world deployment and widespread adoption. To tackle this, we exploratively propose EllieSQL, a complexity-aware routing framework that assigns queries to suitable SQL generation pipelines based on estimated complexity. We investigate multiple routers to direct simple queries to efficient approaches while reserving computationally intensive methods for complex cases. Drawing from economics, we introduce the Token Elasticity of Performance (TEP) metric, capturing cost-efficiency by quantifying the responsiveness of performance gains relative to token investment in SQL generation. Experiments show that compared to always using the most advanced methods in our study, EllieSQL with the Qwen2.5-0.5B-DPO router reduces token use by over 40% without compromising performance on Bird development set, achieving more than a 2x boost in TEP over non-routing approaches. This not only advances the pursuit of cost-efficient Text-to-SQL but also invites the community to weigh resource efficiency alongside performance, contributing to progress in sustainable Text-to-SQL.

ORIGEN: Zero-Shot 3D Orientation Grounding in Text-to-Image Generation

Yunhong Min,Daehyeon Choi,Kyeongmin Yeo,Jihyun Lee,Minhyuk Sung

Task: 提出ORIGEN，一种零样本方法，用于在文本到图像生成中实现多对象和多样类别的3D方向定位。

Motivation: 现有空间定位方法主要关注2D定位，缺乏对3D方向的控制。

Details

Method: 采用奖励引导的采样方法，结合预训练的判别模型和一步文本到图像生成流模型，并使用Langevin动力学进行采样优化。 Result: ORIGEN在定量指标和用户研究中均优于基于训练和测试时引导的方法。 Conclusion: ORIGEN成功解决了3D方向定位问题，并在性能上表现优越。 Abstract: We introduce ORIGEN, the first zero-shot method for 3D orientation grounding in text-to-image generation across multiple objects and diverse categories. While previous work on spatial grounding in image generation has mainly focused on 2D positioning, it lacks control over 3D orientation. To address this, we propose a reward-guided sampling approach using a pretrained discriminative model for 3D orientation estimation and a one-step text-to-image generative flow model. While gradient-ascent-based optimization is a natural choice for reward-based guidance, it struggles to maintain image realism. Instead, we adopt a sampling-based approach using Langevin dynamics, which extends gradient ascent by simply injecting random noise--requiring just a single additional line of code. Additionally, we introduce adaptive time rescaling based on the reward function to accelerate convergence. Our experiments show that ORIGEN outperforms both training-based and test-time guidance methods across quantitative metrics and user studies.

CoSIL: Software Issue Localization via LLM-Driven Code Repository Graph Searching

Zhonghao Jiang,Xiaoxue Ren,Meng Yan,Wei Jiang,Yong Li,Zhongxin Liu

Task: 提出一种基于大语言模型（LLM）的函数级问题定位方法CoSIL，用于自动程序修复中的问题定位。

Motivation: 现有问题定位方法因LLM上下文窗口长度的限制，难以平衡简洁有效的上下文与全面搜索空间的需求。

Details

Method: CoSIL通过模块调用图缩小搜索空间，迭代搜索函数调用图获取相关上下文，并使用上下文剪枝控制搜索方向和有效管理上下文。 Result: 在SWE bench Lite和SWE bench Verified上，CoSIL的Top-1定位成功率分别为43%和44.6%，优于现有方法8.6%至98.2%。 Conclusion: CoSIL是一种无需训练或索引的高效问题定位方法，显著提升了问题定位和补丁生成的性能。 Abstract: Large language models (LLMs) have significantly advanced autonomous software engineering, leading to a growing number of software engineering agents that assist developers in automatic program repair. Issue localization forms the basis for accurate patch generation. However, because of limitations caused by the context window length of LLMs, existing issue localization methods face challenges in balancing concise yet effective contexts and adequately comprehensive search spaces. In this paper, we introduce CoSIL, an LLM driven, simple yet powerful function level issue localization method without training or indexing. CoSIL reduces the search space through module call graphs, iteratively searches the function call graph to obtain relevant contexts, and uses context pruning to control the search direction and manage contexts effectively. Importantly, the call graph is dynamically constructed by the LLM during search, eliminating the need for pre-parsing. Experiment results demonstrate that CoSIL achieves a Top-1 localization success rate of 43 percent and 44.6 percent on SWE bench Lite and SWE bench Verified, respectively, using Qwen2.5 Coder 32B, outperforming existing methods by 8.6 to 98.2 percent. When CoSIL is applied to guide the patch generation stage, the resolved rate further improves by 9.3 to 31.5 percent.

Extremely Simple Out-of-distribution Detection for Audio-visual Generalized Zero-shot Learning

Yang Liu,Xun Zhang,Jiale Du,Xinbo Gao,Jungong Han

Task: 提出一种基于OOD检测的AV-GZSL方法（EZ-AVOOD），以缓解领域偏移问题并区分可见和未见样本。

Motivation: 现有AV-GZSL方法在领域偏移问题上表现不佳，需要一种更简单有效的方法来区分可见和未见样本。

Details

Method: 利用类特定logits和类无关特征子空间的信息，无需额外训练OOD检测网络，实现可见-未见样本分离，并分别用两个专家模型分类。 Result: 在三个音频-视觉数据集上实现了优于现有方法的ZSL和GZSL性能，成为新的SOTA。 Conclusion: EZ-AVOOD通过简单有效的方法解决了领域偏移问题，显著提升了性能。 Abstract: Zero-shot Learning(ZSL) attains knowledge transfer from seen classes to unseen classes by exploring auxiliary category information, which is a promising yet difficult research topic. In this field, Audio-Visual Generalized Zero-Shot Learning~(AV-GZSL) has aroused researchers' great interest in which intricate relations within triple modalities~(audio, video, and natural language) render this task quite challenging but highly research-worthy. However, both existing embedding-based and generative-based AV-GZSL methods tend to suffer from domain shift problem a lot and we propose an extremely simple Out-of-distribution~(OOD) detection based AV-GZSL method~(EZ-AVOOD) to further mitigate bias problem by differentiating seen and unseen samples at the initial beginning. EZ-AVOOD accomplishes effective seen-unseen separation by exploiting the intrinsic discriminative information held in class-specific logits and class-agnostic feature subspace without training an extra OOD detector network. Followed by seen-unseen binary classification, we employ two expert models to classify seen samples and unseen samples separately. Compared to existing state-of-the-art methods, our model achieves superior ZSL and GZSL performances on three audio-visual datasets and becomes the new SOTA, which comprehensively demonstrates the effectiveness of the proposed EZ-AVOOD.

Evaluating Multimodal Language Models as Visual Assistants for Visually Impaired Users

Antonia Karamolegkou,Malvina Nikandrou,Georgios Pantazopoulos,Danae Sanchez Villegas,Phillip Rust,Ruchira Dhar,Daniel Hershcovich,Anders Søgaard

Task: 探讨多模态大语言模型（MLLMs）作为视障人士辅助技术的有效性。

Motivation: 通过用户调查了解此类技术的采用模式和用户面临的主要挑战，为改进技术提供依据。

Details

Method: 进行用户调查，收集五项以用户为中心的任务（包括光学盲文识别新任务），并系统评估12种MLLMs。 Result: 尽管采用率高，但模型在上下文理解、文化敏感性、复杂场景理解等方面存在不足，需进一步改进。 Conclusion: 多模态AI在可访问性方面需更包容、稳健和可信赖，未来应关注文化背景、多语言支持和盲文理解等方向。 Abstract: This paper explores the effectiveness of Multimodal Large Language models (MLLMs) as assistive technologies for visually impaired individuals. We conduct a user survey to identify adoption patterns and key challenges users face with such technologies. Despite a high adoption rate of these models, our findings highlight concerns related to contextual understanding, cultural sensitivity, and complex scene understanding, particularly for individuals who may rely solely on them for visual interpretation. Informed by these results, we collate five user-centred tasks with image and video inputs, including a novel task on Optical Braille Recognition. Our systematic evaluation of twelve MLLMs reveals that further advancements are necessary to overcome limitations related to cultural context, multilingual support, Braille reading comprehension, assistive object recognition, and hallucinations. This work provides critical insights into the future direction of multimodal AI for accessibility, underscoring the need for more inclusive, robust, and trustworthy visual assistance technologies.

Hyperspectral Adapter for Object Tracking based on Hyperspectral Video

Long Gao,Yunhe Zhang,Langkun Chen,Yan Jiang,Weiying Xie,Yunsong Li

Task: 提出一种新的高光谱目标跟踪方法（HyA-T），通过适配器增强光谱信息并提升效率。

Motivation: 现有高光谱跟踪方法在转换过程中丢失光谱信息，且全网络微调效率低下。

Details

Method: 提出HAS和HAM适配器增强自注意力和多层感知器，以及HEI增强输入光谱信息。 Result: HyA-T在多个数据集上取得最优性能。 Conclusion: HyA-T有效解决了光谱信息丢失和效率问题，性能优越。 Abstract: Object tracking based on hyperspectral video attracts increasing attention to the rich material and motion information in the hyperspectral videos. The prevailing hyperspectral methods adapt pretrained RGB-based object tracking networks for hyperspectral tasks by fine-tuning the entire network on hyperspectral datasets, which achieves impressive results in challenging scenarios. However, the performance of hyperspectral trackers is limited by the loss of spectral information during the transformation, and fine-tuning the entire pretrained network is inefficient for practical applications. To address the issues, a new hyperspectral object tracking method, hyperspectral adapter for tracking (HyA-T), is proposed in this work. The hyperspectral adapter for the self-attention (HAS) and the hyperspectral adapter for the multilayer perceptron (HAM) are proposed to generate the adaption information and to transfer the multi-head self-attention (MSA) module and the multilayer perceptron (MLP) in pretrained network for the hyperspectral object tracking task by augmenting the adaption information into the calculation of the MSA and MLP. Additionally, the hyperspectral enhancement of input (HEI) is proposed to augment the original spectral information into the input of the tracking network. The proposed methods extract spectral information directly from the hyperspectral images, which prevent the loss of the spectral information. Moreover, only the parameters in the proposed methods are fine-tuned, which is more efficient than the existing methods. Extensive experiments were conducted on four datasets with various spectral bands, verifing the effectiveness of the proposed methods. The HyA-T achieves state-of-the-art performance on all the datasets.

ActionStudio: A Lightweight Framework for Data and Training of Action Models

Jianguo Zhang,Thai Hoang,Ming Zhu,Zuxin Liu,Shiyu Wang,Tulika Awalgaonkar,Akshara Prabhakar,Haolin Chen,Weiran Yao,Zhiwei Liu,Juntao Tan,Juan Carlos Niebles,Shelby Heinecke,Huan Wang,Silvio Savarese,Caiming Xiong

Task: 提出一个轻量级且可扩展的数据和训练框架ActionStudio，用于动作模型的训练。

Motivation: 由于代理环境的多样性和代理数据的复杂性，训练大型动作模型具有挑战性，现有基础设施对可扩展的、针对代理的微调支持有限。

Details

Method: ActionStudio通过标准化格式统一异构代理轨迹，支持包括LoRA、全微调和分布式设置在内的多种训练范式，并集成强大的预处理和验证工具。 Result: 在公共和实际行业基准测试中验证了其有效性，展示了强大的性能和实际可扩展性。 Conclusion: 开源代码和数据以促进社区研究。 Abstract: Action models are essential for enabling autonomous agents to perform complex tasks. However, training large action models remains challenging due to the diversity of agent environments and the complexity of agentic data. Despite growing interest, existing infrastructure provides limited support for scalable, agent-specific fine-tuning. We present ActionStudio, a lightweight and extensible data and training framework designed for action models. ActionStudio unifies heterogeneous agent trajectories through a standardized format, supports diverse training paradigms including LoRA, full fine-tuning, and distributed setups, and integrates robust preprocessing and verification tools. We validate its effectiveness across both public and realistic industry benchmarks, demonstrating strong performance and practical scalability. We open-sourced code and data at https://github.com/SalesforceAIResearch/xLAM to facilitate research in the community.

Jaewoo Jeong,Seohee Lee,Daehee Park,Giwon Lee,Kuk-Jin Yoon

Task: 提出一种多模态知识蒸馏框架，用于在资源受限系统中提升行人轨迹预测的准确性。

Motivation: 在行人轨迹预测中，文本描述等额外模态能提升准确性，但资源受限系统难以在线提取文本。

Details

Method: 通过知识蒸馏，将教师模型（使用轨迹、姿态和文本）的知识传递给学生模型（仅使用轨迹或姿态）。 Result: 蒸馏后的学生模型在所有预测指标上均有提升，最高提升约13%。 Conclusion: 该框架在多种数据集和设置下验证有效，适用于资源受限系统。 Abstract: Pedestrian trajectory forecasting is crucial in various applications such as autonomous driving and mobile robot navigation. In such applications, camera-based perception enables the extraction of additional modalities (human pose, text) to enhance prediction accuracy. Indeed, we find that textual descriptions play a crucial role in integrating additional modalities into a unified understanding. However, online extraction of text requires the use of VLM, which may not be feasible for resource-constrained systems. To address this challenge, we propose a multi-modal knowledge distillation framework: a student model with limited modality is distilled from a teacher model trained with full range of modalities. The comprehensive knowledge of a teacher model trained with trajectory, human pose, and text is distilled into a student model using only trajectory or human pose as a sole supplement. In doing so, we separately distill the core locomotion insights from intra-agent multi-modality and inter-agent interaction. Our generalizable framework is validated with two state-of-the-art models across three datasets on both ego-view (JRDB, SIT) and BEV-view (ETH/UCY) setups, utilizing both annotated and VLM-generated text captions. Distilled student models show consistent improvement in all prediction metrics for both full and instantaneous observations, improving up to ~13%. The code is available at https://github.com/Jaewoo97/KDTF.

QuestBench: Can LLMs ask the right question to acquire information in reasoning tasks?

Belinda Z. Li,Been Kim,Zi Wang

Task: 评估大型语言模型（LLMs）在解决缺失信息问题时的能力，并量化问题的难度。

Motivation: 现实世界中的查询往往不完整，需要获取缺失信息才能解决，而现有研究多假设任务定义明确。

Details

Method: 将问题形式化为缺失变量赋值的约束满足问题（CSP），并构建QuestBench基准测试集，包含四种任务类型。 Result: 先进模型在GSM-Q和GSME-Q上表现优异，但在Logic-Q和Planning-Q上准确率仅为40-50%。 Conclusion: 模型在解决明确问题时表现良好，但在识别必要提问方面存在困难，需进一步研究其信息获取能力。 Abstract: Recently, a large amount of work has focused on improving large language models' (LLMs') performance on reasoning benchmarks such as math and logic. However, past work has largely assumed that tasks are well-defined. In the real world, queries to LLMs are often underspecified, only solvable through acquiring missing information. We formalize this as a constraint satisfaction problem (CSP) with missing variable assignments. Using a special case of this formalism where only one necessary variable assignment is missing, we can rigorously evaluate an LLM's ability to identify the minimal necessary question to ask and quantify axes of difficulty levels for each problem. We present QuestBench, a set of underspecified reasoning tasks solvable by asking at most one question, which includes: (1) Logic-Q: Logical reasoning tasks with one missing proposition, (2) Planning-Q: PDDL planning problems with initial states that are partially-observed, (3) GSM-Q: Human-annotated grade school math problems with one missing variable assignment, and (4) GSME-Q: a version of GSM-Q where word problems are translated into equations by human annotators. The LLM is tasked with selecting the correct clarification question(s) from a list of options. While state-of-the-art models excel at GSM-Q and GSME-Q, their accuracy is only 40-50% on Logic-Q and Planning-Q. Analysis demonstrates that the ability to solve well-specified reasoning problems may not be sufficient for success on our benchmark: models have difficulty identifying the right question to ask, even when they can solve the fully specified version of the problem. Furthermore, in the Planning-Q domain, LLMs tend not to hedge, even when explicitly presented with the option to predict ``not sure.'' This highlights the need for deeper investigation into models' information acquisition capabilities.

Segment then Splat: A Unified Approach for 3D Open-Vocabulary Segmentation based on Gaussian Splatting

Yiren Lu,Yunlai Zhou,Yiran Qiao,Chaoda Song,Tuo Liang,Jing Ma,Yu Yin

Task: 提出一种基于高斯溅射的3D感知开放词汇分割方法，用于静态和动态场景。

Motivation: 解决现有方法在3D空间中开放词汇查询时存在的多视角不一致性、3D对象检索效果差以及动态场景处理困难的问题。

Details

Method: 通过先分割后重建的方式（Segment then Splat），将高斯分为不同对象集，再进行重建，实现真正的3D分割。 Result: 实验表明，该方法在静态和动态场景中均表现出色，消除了高斯与对象不对齐的问题，并加速了优化过程。 Conclusion: Segment then Splat方法在3D开放词汇分割中具有显著优势，适用于多种应用场景。 Abstract: Open-vocabulary querying in 3D space is crucial for enabling more intelligent perception in applications such as robotics, autonomous systems, and augmented reality. However, most existing methods rely on 2D pixel-level parsing, leading to multi-view inconsistencies and poor 3D object retrieval. Moreover, they are limited to static scenes and struggle with dynamic scenes due to the complexities of motion modeling. In this paper, we propose Segment then Splat, a 3D-aware open vocabulary segmentation approach for both static and dynamic scenes based on Gaussian Splatting. Segment then Splat reverses the long established approach of "segmentation after reconstruction" by dividing Gaussians into distinct object sets before reconstruction. Once the reconstruction is complete, the scene is naturally segmented into individual objects, achieving true 3D segmentation. This approach not only eliminates Gaussian-object misalignment issues in dynamic scenes but also accelerates the optimization process, as it eliminates the need for learning a separate language field. After optimization, a CLIP embedding is assigned to each object to enable open-vocabulary querying. Extensive experiments on various datasets demonstrate the effectiveness of our proposed method in both static and dynamic scenarios.

Jiakai Tang,Sunhao Dai,Teng Shi,Jun Xu,Xu Chen,Wen Chen,Wu Jian,Yuning Jiang

Task: 提出一种名为ReaRec的推理时计算框架，用于增强顺序推荐系统中的用户表示。

Motivation: 现有顺序推荐方法采用直接前向计算范式，难以捕捉用户偏好的复杂演变，且对长尾物品的理解不足，导致性能不佳。

Details

Method: ReaRec通过隐式多步推理增强用户表示，引入特殊推理位置嵌入解耦原始物品编码空间与多步推理空间，并提出两种轻量级推理学习方法（ERL和PRL）。 Result: 在五个公开数据集和不同顺序推荐架构上的实验表明，ReaRec显著提升了性能，性能上限提高了30%-50%。 Conclusion: ReaRec为顺序推荐的推理时计算开辟了新方向，具有广阔的研究前景。 Abstract: Sequential Recommendation (SeqRec) aims to predict the next item by capturing sequential patterns from users' historical interactions, playing a crucial role in many real-world recommender systems. However, existing approaches predominantly adopt a direct forward computation paradigm, where the final hidden state of the sequence encoder serves as the user representation. We argue that this inference paradigm, due to its limited computational depth, struggles to model the complex evolving nature of user preferences and lacks a nuanced understanding of long-tail items, leading to suboptimal performance. To address this issue, we propose \textbf{ReaRec}, the first inference-time computing framework for recommender systems, which enhances user representations through implicit multi-step reasoning. Specifically, ReaRec autoregressively feeds the sequence's last hidden state into the sequential recommender while incorporating special reasoning position embeddings to decouple the original item encoding space from the multi-step reasoning space. Moreover, we introduce two lightweight reasoning-based learning methods, Ensemble Reasoning Learning (ERL) and Progressive Reasoning Learning (PRL), to further effectively exploit ReaRec's reasoning potential. Extensive experiments on five public real-world datasets and different SeqRec architectures demonstrate the generality and effectiveness of our proposed ReaRec. Remarkably, post-hoc analyses reveal that ReaRec significantly elevates the performance ceiling of multiple sequential recommendation backbones by approximately 30\%-50\%. Thus, we believe this work can open a new and promising avenue for future research in inference-time computing for sequential recommendation.

Intrinsic Image Decomposition for Robust Self-supervised Monocular Depth Estimation on Reflective Surfaces

Wonhyeok Choi,Kyumin Hwang,Minwoo Choi,Kiljoon Han,Wonjoon Choi,Mingyu Shin,Sunghoon Im

Task: 提出一种结合本征图像分解的自监督单目深度估计框架，以解决传统方法在反射表面上的局限性。

Motivation: 传统的光度一致性损失依赖朗伯假设，在处理反射表面时会产生显著误差。

Details

Method: 结合本征图像分解与深度估计，通过多图像一致性对齐坐标系，并排除反射区域的梯度干扰，同时引入伪深度生成和知识蒸馏技术。 Result: 在多个数据集上显著优于现有基线，尤其在反射表面上表现突出。 Conclusion: 该框架通过本征图像分解和知识蒸馏技术，有效提升了自监督单目深度估计的性能，特别是在反射表面上的准确性。 Abstract: Self-supervised monocular depth estimation (SSMDE) has gained attention in the field of deep learning as it estimates depth without requiring ground truth depth maps. This approach typically uses a photometric consistency loss between a synthesized image, generated from the estimated depth, and the original image, thereby reducing the need for extensive dataset acquisition. However, the conventional photometric consistency loss relies on the Lambertian assumption, which often leads to significant errors when dealing with reflective surfaces that deviate from this model. To address this limitation, we propose a novel framework that incorporates intrinsic image decomposition into SSMDE. Our method synergistically trains for both monocular depth estimation and intrinsic image decomposition. The accurate depth estimation facilitates multi-image consistency for intrinsic image decomposition by aligning different view coordinate systems, while the decomposition process identifies reflective areas and excludes corrupted gradients from the depth training process. Furthermore, our framework introduces a pseudo-depth generation and knowledge distillation technique to further enhance the performance of the student model across both reflective and non-reflective surfaces. Comprehensive evaluations on multiple datasets show that our approach significantly outperforms existing SSMDE baselines in depth prediction, especially on reflective surfaces.

Learning to Instruct for Visual Instruction Tuning

Zhihan Zhou,Feng Hong,Jiaan Luo,Jiangchao Yao,Dongsheng Li,Bo Han,Ya Zhang,Yanfeng Wang

Task: 提出LIT方法以改进视觉指令调整（VIT），解决其过拟合和捷径学习问题。

Motivation: 当前VIT设计过于强调指令跟随能力，忽视了对视觉信息的主动理解，导致性能下降。

Details

Method: LIT通过将损失函数同时应用于指令和响应序列，扩展训练数据并减少对语言先验的依赖。 Result: LIT在综合多模态基准上实现高达9%的相对提升，并在视觉描述任务中提升18%，同时减少幻觉现象。 Conclusion: LIT是一种简单有效的方法，无需额外数据或计算开销，显著提升了多模态模型的性能。 Abstract: We propose LIT, an advancement of visual instruction tuning (VIT). While VIT equips Multimodal LLMs (MLLMs) with promising multimodal capabilities, the current design choices for VIT often result in overfitting and shortcut learning, potentially degrading performance. This gap arises from an overemphasis on instruction-following abilities, while neglecting the proactive understanding of visual information. Inspired by this, LIT adopts a simple yet effective approach by incorporating the loss function into both the instruction and response sequences. It seamlessly expands the training data, and regularizes the MLLMs from overly relying on language priors. Based on this merit, LIT achieves a significant relative improvement of up to 9% on comprehensive multimodal benchmarks, requiring no additional training data and incurring negligible computational overhead. Surprisingly, LIT attains exceptional fundamental visual capabilities, yielding up to an 18% improvement in captioning performance, while simultaneously alleviating hallucination in MLLMs.

ABC-GS: Alignment-Based Controllable Style Transfer for 3D Gaussian Splatting

Wenjie Liu,Zhongliang Liu,Xiaoyan Yang,Man Sha,Yang Li

Task: 基于3D高斯泼溅（3D Gaussian Splatting）实现高质量的3D风格迁移。

Motivation: 现有的基于NeRF和NNFM损失的方法未考虑全局风格信息，且NeRF的隐式表示限制了场景的细粒度控制。

Details

Method: 设计了可控匹配阶段以实现场景内容与风格特征的精确对齐，并提出了基于特征对齐的风格迁移损失函数。 Result: ABC-GS框架实现了风格迁移的可控性，并生成了更忠实于参考图像全局风格的风格化结果。 Conclusion: ABC-GS通过改进的匹配和损失设计，解决了现有方法的局限性，实现了高质量的3D风格迁移。 Abstract: 3D scene stylization approaches based on Neural Radiance Fields (NeRF) achieve promising results by optimizing with Nearest Neighbor Feature Matching (NNFM) loss. However, NNFM loss does not consider global style information. In addition, the implicit representation of NeRF limits their fine-grained control over the resulting scenes. In this paper, we introduce ABC-GS, a novel framework based on 3D Gaussian Splatting to achieve high-quality 3D style transfer. To this end, a controllable matching stage is designed to achieve precise alignment between scene content and style features through segmentation masks. Moreover, a style transfer loss function based on feature alignment is proposed to ensure that the outcomes of style transfer accurately reflect the global style of the reference image. Furthermore, the original geometric information of the scene is preserved with the depth loss and Gaussian regularization terms. Extensive experiments show that our ABC-GS provides controllability of style transfer and achieves stylization results that are more faithfully aligned with the global style of the chosen artistic reference. Our homepage is available at https://vpx-ecnu.github.io/ABC-GS-website.

Follow Your Motion: A Generic Temporal Consistency Portrait Editing Framework with Trajectory Guidance

Haijie Yang,Zhenyu Zhang,Hao Tang,Jianjun Qian,Jian Yang

Task: 提出一种名为Follow Your Motion (FYM)的通用框架，用于在肖像编辑中保持时间一致性。

Motivation: 预训练的条件扩散模型在图像编辑中表现出色，但在时间一致性方面存在挑战，尤其是在动态面部表情的编辑中。

Details

Method: 开发了一种扩散模型，学习运动轨迹变化，并提出动态重加权注意力机制以保持表情一致性。 Result: 实验表明，该方法在时间一致性上优于现有方法，并适用于多种应用场景。 Conclusion: FYM框架有效解决了时间一致性问题，为肖像编辑提供了更优的解决方案。 Abstract: Pre-trained conditional diffusion models have demonstrated remarkable potential in image editing. However, they often face challenges with temporal consistency, particularly in the talking head domain, where continuous changes in facial expressions intensify the level of difficulty. These issues stem from the independent editing of individual images and the inherent loss of temporal continuity during the editing process. In this paper, we introduce Follow Your Motion (FYM), a generic framework for maintaining temporal consistency in portrait editing. Specifically, given portrait images rendered by a pre-trained 3D Gaussian Splatting model, we first develop a diffusion model that intuitively and inherently learns motion trajectory changes at different scales and pixel coordinates, from the first frame to each subsequent frame. This approach ensures that temporally inconsistent edited avatars inherit the motion information from the rendered avatars. Secondly, to maintain fine-grained expression temporal consistency in talking head editing, we propose a dynamic re-weighted attention mechanism. This mechanism assigns higher weight coefficients to landmark points in space and dynamically updates these weights based on landmark loss, achieving more consistent and refined facial expressions. Extensive experiments demonstrate that our method outperforms existing approaches in terms of temporal consistency and can be used to optimize and compensate for temporally inconsistent outputs in a range of applications, such as text-driven editing, relighting, and various other applications.

CoGen: 3D Consistent Video Generation via Adaptive Conditioning for Autonomous Driving

Yishen Ji,Ziyue Zhu,Zhenxin Zhu,Kaixin Xiong,Ming Lu,Zhiqi Li,Lijun Zhou,Haiyang Sun,Bing Wang,Tong Lu

Task: 提出一种新颖的空间自适应生成框架CoGen，以解决多视角驾驶视频生成中的3D一致性问题。

Motivation: 现有基于2D布局条件的生成模型难以实现高3D一致性的可控多视角视频，限制了其在自动驾驶系统中的实用性。

Details

Method: 通过生成高质量可控的3D条件替代2D条件，并引入一致性适配器模块增强模型对多条件控制的鲁棒性。 Result: 该方法在几何保真度和视觉真实感方面表现优异，为自动驾驶提供了可靠的视频生成解决方案。 Conclusion: CoGen框架显著提升了生成视频的3D一致性，为自动驾驶系统的训练数据提供了更高质量的可控生成能力。 Abstract: Recent progress in driving video generation has shown significant potential for enhancing self-driving systems by providing scalable and controllable training data. Although pretrained state-of-the-art generation models, guided by 2D layout conditions (e.g., HD maps and bounding boxes), can produce photorealistic driving videos, achieving controllable multi-view videos with high 3D consistency remains a major challenge. To tackle this, we introduce a novel spatial adaptive generation framework, CoGen, which leverages advances in 3D generation to improve performance in two key aspects: (i) To ensure 3D consistency, we first generate high-quality, controllable 3D conditions that capture the geometry of driving scenes. By replacing coarse 2D conditions with these fine-grained 3D representations, our approach significantly enhances the spatial consistency of the generated videos. (ii) Additionally, we introduce a consistency adapter module to strengthen the robustness of the model to multi-condition control. The results demonstrate that this method excels in preserving geometric fidelity and visual realism, offering a reliable video generation solution for autonomous driving.

SCHNet: SAM Marries CLIP for Human Parsing

Kunliang Liu,Jianming Wang,Rize Jin,Wonjun Hwang,Tae-Sun Chung

Task: 提出一种高效模块，结合SAM和CLIP的特征以改进人体解析任务。

Motivation: SAM在细粒度分割表现优异但缺乏语义理解能力，CLIP具有语义理解能力但细粒度分割不足，而人体解析需要两者结合。

Details

Method: 设计语义精炼模块整合CLIP的语义特征与SAM的特征，并提出高效微调模块调整预训练的SAM以适应人体解析任务。 Result: 在LIP、PPP和CIHP数据库上验证了方法的有效性，显著减少了训练时间并提升了性能。 Conclusion: 结合SAM和CLIP特征的方法在人体解析任务中表现出色，兼顾了语义理解和细粒度分割需求。 Abstract: Vision Foundation Model (VFM) such as the Segment Anything Model (SAM) and Contrastive Language-Image Pre-training Model (CLIP) has shown promising performance for segmentation and detection tasks. However, although SAM excels in fine-grained segmentation, it faces major challenges when applying it to semantic-aware segmentation. While CLIP exhibits a strong semantic understanding capability via aligning the global features of language and vision, it has deficiencies in fine-grained segmentation tasks. Human parsing requires to segment human bodies into constituent parts and involves both accurate fine-grained segmentation and high semantic understanding of each part. Based on traits of SAM and CLIP, we formulate high efficient modules to effectively integrate features of them to benefit human parsing. We propose a Semantic-Refinement Module to integrate semantic features of CLIP with SAM features to benefit parsing. Moreover, we formulate a high efficient Fine-tuning Module to adjust the pretrained SAM for human parsing that needs high semantic information and simultaneously demands spatial details, which significantly reduces the training time compared with full-time training and achieves notable performance. Extensive experiments demonstrate the effectiveness of our method on LIP, PPP, and CIHP databases.

Efficient Building Roof Type Classification: A Domain-Specific Self-Supervised Approach

Guneet Mutreja,Ksenia Bittner

Task: 利用自监督学习和EfficientNet架构进行建筑物屋顶类型的精确分类。

Motivation: 解决建筑物屋顶类型分类任务中标记数据有限的问题。

Details

Method: 提出了一种结合卷积块注意力模块（CBAM）的EfficientNet框架，并探索了在领域特定数据集（AID）上进行预训练的效果。 Result: 在验证集上达到95.5%的准确率，性能与基于Transformer的模型相当，但参数更少。 Conclusion: 基于EfficientNet的自监督学习是一种计算高效且有效的方法，特别适用于标记数据有限的场景。 Abstract: Accurate classification of building roof types from aerial imagery is crucial for various remote sensing applications, including urban planning, disaster management, and infrastructure monitoring. However, this task is often hindered by the limited availability of labeled data for supervised learning approaches. To address this challenge, this paper investigates the effectiveness of self supervised learning with EfficientNet architectures, known for their computational efficiency, for building roof type classification. We propose a novel framework that incorporates a Convolutional Block Attention Module (CBAM) to enhance the feature extraction capabilities of EfficientNet. Furthermore, we explore the benefits of pretraining on a domain-specific dataset, the Aerial Image Dataset (AID), compared to ImageNet pretraining. Our experimental results demonstrate the superiority of our approach. Employing Simple Framework for Contrastive Learning of Visual Representations (SimCLR) with EfficientNet-B3 and CBAM achieves a 95.5% accuracy on our validation set, matching the performance of state-of-the-art transformer-based models while utilizing significantly fewer parameters. We also provide a comprehensive evaluation on two challenging test sets, demonstrating the generalization capability of our method. Notably, our findings highlight the effectiveness of domain-specific pretraining, consistently leading to higher accuracy compared to models pretrained on the generic ImageNet dataset. Our work establishes EfficientNet based self-supervised learning as a computationally efficient and highly effective approach for building roof type classification, particularly beneficial in scenarios with limited labeled data.

Mono2Stereo: A Benchmark and Empirical Study for Stereo Conversion

Songsong Yu,Yuxin Chen,Zhongang Qi,Zeke Xie,Yifan Wang,Lijun Wang,Ying Shan,Huchuan Lu

Task: 研究如何利用预训练的扩散模型（DMs）进行立体转换，并解决数据稀缺和评估方法不足的问题。

Motivation: 3D设备快速普及但3D内容短缺，立体转换需求增加，但现有方法在数据、评估和效果上存在不足。

Details

Method: 提出Mono2Stereo数据集，进行实证研究，发现现有方法的局限性，并提出新的评估指标和基线模型。 Result: 提出Stereo Intersection-over-Union指标，显著优于现有方法，并开源代码和数据。 Conclusion: 通过新数据集和基线模型，推动了立体转换领域的研究，解决了现有方法的不足。 Abstract: With the rapid proliferation of 3D devices and the shortage of 3D content, stereo conversion is attracting increasing attention. Recent works introduce pretrained Diffusion Models (DMs) into this task. However, due to the scarcity of large-scale training data and comprehensive benchmarks, the optimal methodologies for employing DMs in stereo conversion and the accurate evaluation of stereo effects remain largely unexplored. In this work, we introduce the Mono2Stereo dataset, providing high-quality training data and benchmark to support in-depth exploration of stereo conversion. With this dataset, we conduct an empirical study that yields two primary findings. 1) The differences between the left and right views are subtle, yet existing metrics consider overall pixels, failing to concentrate on regions critical to stereo effects. 2) Mainstream methods adopt either one-stage left-to-right generation or warp-and-inpaint pipeline, facing challenges of degraded stereo effect and image distortion respectively. Based on these findings, we introduce a new evaluation metric, Stereo Intersection-over-Union, which prioritizes disparity and achieves a high correlation with human judgments on stereo effect. Moreover, we propose a strong baseline model, harmonizing the stereo effect and image quality simultaneously, and notably surpassing current mainstream methods. Our code and data will be open-sourced to promote further research in stereo conversion. Our models are available at mono2stereo-bench.github.io.

Haomin Zhang,Chang Liu,Junjie Zheng,Zihao Chen,Chaofan Ding,Xinhan Di

Task: 提出一个端到端的多模态生成框架，同时基于视频和文本条件生成语音和音频。

Motivation: 现实场景中视频通常同时包含语音和音频，但现有研究未充分探索基于视频和文本条件的同步语音和音频生成。

Details

Method: 提出DeepAudio框架，包含视频到音频（V2A）模块、文本到语音（TTS）模块和动态模态融合（MoF）模块。 Result: 在视频-音频、视频-语音和文本-语音基准测试中达到最先进性能，尤其在视频-语音基准中显著提升（如WER降低80.99%）。 Conclusion: DeepAudio框架在多模态生成任务中表现出色，尤其在同步语音和音频生成方面具有显著优势。 Abstract: Currently, high-quality, synchronized audio is synthesized using various multi-modal joint learning frameworks, leveraging video and optional text inputs. In the video-to-audio benchmarks, video-to-audio quality, semantic alignment, and audio-visual synchronization are effectively achieved. However, in real-world scenarios, speech and audio often coexist in videos simultaneously, and the end-to-end generation of synchronous speech and audio given video and text conditions are not well studied. Therefore, we propose an end-to-end multi-modal generation framework that simultaneously produces speech and audio based on video and text conditions. Furthermore, the advantages of video-to-audio (V2A) models for generating speech from videos remain unclear. The proposed framework, DeepAudio, consists of a video-to-audio (V2A) module, a text-to-speech (TTS) module, and a dynamic mixture of modality fusion (MoF) module. In the evaluation, the proposed end-to-end framework achieves state-of-the-art performance on the video-audio benchmark, video-speech benchmark, and text-speech benchmark. In detail, our framework achieves comparable results in the comparison with state-of-the-art models for the video-audio and text-speech benchmarks, and surpassing state-of-the-art models in the video-speech benchmark, with WER 16.57% to 3.15% (+80.99%), SPK-SIM 78.30% to 89.38% (+14.15%), EMO-SIM 66.24% to 75.56% (+14.07%), MCD 8.59 to 7.98 (+7.10%), MCD SL 11.05 to 9.40 (+14.93%) across a variety of dubbing settings.

Segment Any Motion in Videos

Nan Huang,Wenzhao Zheng,Chenfeng Xu,Kurt Keutzer,Shanghang Zhang,Angjoo Kanazawa,Qianqian Wang

Task: 提出一种结合长程轨迹运动线索与DINO语义特征的新方法，用于视频中的运动物体分割。

Motivation: 人类能轻松分割视频中的运动物体，但现有方法依赖光流，常因部分运动、复杂变形、运动模糊和背景干扰导致预测不完美。

Details

Method: 结合长程轨迹运动线索与DINO语义特征，利用SAM2通过迭代提示策略进行像素级掩码密集化，采用时空轨迹注意力和运动-语义解耦嵌入。 Result: 在多样数据集上表现出色，尤其在挑战性场景和多物体精细分割中达到领先水平。 Conclusion: 新方法在运动物体分割任务中表现优异，代码已开源。 Abstract: Moving object segmentation is a crucial task for achieving a high-level understanding of visual scenes and has numerous downstream applications. Humans can effortlessly segment moving objects in videos. Previous work has largely relied on optical flow to provide motion cues; however, this approach often results in imperfect predictions due to challenges such as partial motion, complex deformations, motion blur and background distractions. We propose a novel approach for moving object segmentation that combines long-range trajectory motion cues with DINO-based semantic features and leverages SAM2 for pixel-level mask densification through an iterative prompting strategy. Our model employs Spatio-Temporal Trajectory Attention and Motion-Semantic Decoupled Embedding to prioritize motion while integrating semantic support. Extensive testing on diverse datasets demonstrates state-of-the-art performance, excelling in challenging scenarios and fine-grained segmentation of multiple objects. Our code is available at https://motion-seg.github.io/.

Divide to Conquer: A Field Decomposition Approach for Multi-Organ Whole-Body CT Image Registration

Xuan Loc Pham,Mathias Prokop,Bram van Ginneken,Alessa Hering

Task: 提出一种新颖的场分解方法，用于解决多器官全身CT图像配准中变形的高复杂性。

Motivation: 现有方法通常针对特定器官设计，对其他器官性能较低，限制了通用性和适用性。

Details

Method: 采用场分解方法处理多器官的复杂变形，并在691名患者的纵向CT数据集上进行训练和评估。 Result: 实验结果表明，所提方法在处理多器官全身CT图像配准的复杂变形方面优于基线方法。 Conclusion: 该方法在多器官全身CT图像配准中表现出更高的性能，具有更好的通用性和适用性。 Abstract: Image registration is an essential technique for the analysis of Computed Tomography (CT) images in clinical practice. However, existing methodologies are predominantly tailored to a specific organ of interest and often exhibit lower performance on other organs, thus limiting their generalizability and applicability. Multi-organ registration addresses these limitations, but the simultaneous alignment of multiple organs with diverse shapes, sizes and locations requires a highly complex deformation field with a multi-layer composition of individual deformations. This study introduces a novel field decomposition approach to address the high complexity of deformations in multi-organ whole-body CT image registration. The proposed method is trained and evaluated on a longitudinal dataset of 691 patients, each with two CT images obtained at distinct time points. These scans fully encompass the thoracic, abdominal, and pelvic regions. Two baseline registration methods are selected for this study: one based on optimization techniques and another based on deep learning. Experimental results demonstrate that the proposed approach outperforms baseline methods in handling complex deformations in multi-organ whole-body CT image registration.

RUNA: Object-level Out-of-Distribution Detection via Regional Uncertainty Alignment of Multimodal Representations

Bin Zhang,Jinggang Chen,Xiaoyang Qu,Guokuan Li,Kai Lu,Jiguang Wan,Jing Xiao,Jianzong Wang

Task: 探索如何利用预训练的视觉-语言表示进行对象级别的分布外（OOD）检测。

Motivation: 现有模型在分布外对象上缺乏监督信号，导致对OOD对象过度自信的预测，影响系统可靠性。

Details

Method: 提出RUNA框架，采用双编码器架构捕获上下文信息，并通过区域不确定性对齐机制区分分布内（ID）和OOD对象，同时引入少样本微调方法。 Result: 实验表明，RUNA在对象级别OOD检测中显著优于现有方法，尤其在复杂场景下。 Conclusion: RUNA通过结合视觉-语言表示和区域对齐机制，有效提升了对象级别OOD检测的性能。 Abstract: Enabling object detectors to recognize out-of-distribution (OOD) objects is vital for building reliable systems. A primary obstacle stems from the fact that models frequently do not receive supervisory signals from unfamiliar data, leading to overly confident predictions regarding OOD objects. Despite previous progress that estimates OOD uncertainty based on the detection model and in-distribution (ID) samples, we explore using pre-trained vision-language representations for object-level OOD detection. We first discuss the limitations of applying image-level CLIP-based OOD detection methods to object-level scenarios. Building upon these insights, we propose RUNA, a novel framework that leverages a dual encoder architecture to capture rich contextual information and employs a regional uncertainty alignment mechanism to distinguish ID from OOD objects effectively. We introduce a few-shot fine-tuning approach that aligns region-level semantic representations to further improve the model's capability to discriminate between similar objects. Our experiments show that RUNA substantially surpasses state-of-the-art methods in object-level OOD detection, particularly in challenging scenarios with diverse and complex object instances.

VisTa: Visual-contextual and Text-augmented Zero-shot Object-level OOD Detection

Bin Zhang,Xiaoyang Qu,Guokuan Li,Jiguang Wan,Jianzong Wang

Task: 提出一种新方法，利用视觉提示和文本增强的分布内（ID）空间构建，将CLIP模型适配于零样本目标级分布外（OOD）检测任务。

Motivation: 由于目标检测器越来越多地以黑盒云服务或预训练模型的形式部署，且无法访问原始训练数据，零样本目标级OOD检测的需求变得至关重要，以确保在开放世界场景中的可靠性。

Details

Method: 通过视觉提示和文本增强的ID空间构建，保留关键上下文信息，并提升区分ID和OOD目标的能力。 Result: 在不同基准测试中表现出竞争性性能。 Conclusion: 该方法有效解决了直接应用预训练视觉语言模型（如CLIP）到目标级OOD检测时的挑战，提升了检测性能。 Abstract: As object detectors are increasingly deployed as black-box cloud services or pre-trained models with restricted access to the original training data, the challenge of zero-shot object-level out-of-distribution (OOD) detection arises. This task becomes crucial in ensuring the reliability of detectors in open-world settings. While existing methods have demonstrated success in image-level OOD detection using pre-trained vision-language models like CLIP, directly applying such models to object-level OOD detection presents challenges due to the loss of contextual information and reliance on image-level alignment. To tackle these challenges, we introduce a new method that leverages visual prompts and text-augmented in-distribution (ID) space construction to adapt CLIP for zero-shot object-level OOD detection. Our method preserves critical contextual information and improves the ability to differentiate between ID and OOD objects, achieving competitive performance across different benchmarks.

A Dataset for Semantic Segmentation in the Presence of Unknowns

Zakaria Laskar,Tomas Vojir,Matej Grcic,Iaroslav Melekhov,Shankar Gangisettye,Juho Kannala,Jiri Matas,Giorgos Tolias,C. V. Jawahar

Task: 提出一个新颖的异常分割数据集ISSU，用于评估深度神经网络在已知和未知输入上的表现。

Motivation: 现有数据集仅能评估已知或未知输入，无法同时满足实际应用中对深度神经网络“野外”适用性的需求。

Details

Method: 构建了一个包含多样化异常输入的ISSU数据集，提供训练、验证和测试集，支持闭集和开集评估。 Result: ISSU数据集规模是现有数据集的两倍，评估结果显示当前方法在领域泛化和小/大物体分割方面仍需改进。 Conclusion: ISSU数据集填补了现有空白，为深度神经网络的实际应用提供了更全面的评估工具。 Abstract: Before deployment in the real-world deep neural networks require thorough evaluation of how they handle both knowns, inputs represented in the training data, and unknowns (anomalies). This is especially important for scene understanding tasks with safety critical applications, such as in autonomous driving. Existing datasets allow evaluation of only knowns or unknowns - but not both, which is required to establish "in the wild" suitability of deep neural network models. To bridge this gap, we propose a novel anomaly segmentation dataset, ISSU, that features a diverse set of anomaly inputs from cluttered real-world environments. The dataset is twice larger than existing anomaly segmentation datasets, and provides a training, validation and test set for controlled in-domain evaluation. The test set consists of a static and temporal part, with the latter comprised of videos. The dataset provides annotations for both closed-set (knowns) and anomalies, enabling closed-set and open-set evaluation. The dataset covers diverse conditions, such as domain and cross-sensor shift, illumination variation and allows ablation of anomaly detection methods with respect to these variations. Evaluation results of current state-of-the-art methods confirm the need for improvements especially in domain-generalization, small and large object segmentation.

AH-GS: Augmented 3D Gaussian Splatting for High-Frequency Detail Representation

Chenyang Xu,XingGuo Deng,Rui Zhong

Task: 提出AH-GS方法，通过增强输入特征的流形复杂性和使用基于网络的特征图损失，提升3D-GS模型的图像重建质量。

Motivation: Scaffold-GS在细粒度渲染上高度依赖视角，且神经网络学习的谱偏置导致其对场景高频信息的感知和学习能力不足。

Details

Method: 增强输入特征的流形复杂性，引入网络特征图损失和高频强化损失，使3D高斯在结构复杂区域获得更高频编码。 Result: 在特定场景（如MipNeRf360-garden）中，仅15K次迭代即可超越Scaffold-GS的渲染质量。 Conclusion: AH-GS显著提升了渲染保真度，尤其在复杂区域的高频信息学习上表现优异。 Abstract: The 3D Gaussian Splatting (3D-GS) is a novel method for scene representation and view synthesis. Although Scaffold-GS achieves higher quality real-time rendering compared to the original 3D-GS, its fine-grained rendering of the scene is extremely dependent on adequate viewing angles. The spectral bias of neural network learning results in Scaffold-GS's poor ability to perceive and learn high-frequency information in the scene. In this work, we propose enhancing the manifold complexity of input features and using network-based feature map loss to improve the image reconstruction quality of 3D-GS models. We introduce AH-GS, which enables 3D Gaussians in structurally complex regions to obtain higher-frequency encodings, allowing the model to more effectively learn the high-frequency information of the scene. Additionally, we incorporate high-frequency reinforce loss to further enhance the model's ability to capture detailed frequency information. Our result demonstrates that our model significantly improves rendering fidelity, and in specific scenarios (e.g., MipNeRf360-garden), our method exceeds the rendering quality of Scaffold-GS in just 15K iterations.

VoteFlow: Enforcing Local Rigidity in Self-Supervised Scene Flow

Yancong Lin,Shiming Wang,Liangliang Nan,Julian Kooij,Holger Caesar

Task: 从两个相邻的LiDAR扫描中恢复每点运动。

Motivation: 在现实应用中，如自动驾驶，点很少独立移动，尤其是属于同一物体的邻近点通常共享相同运动。现有方法缺乏对局部刚性的架构归纳偏置。

Details

Method: 设计了一个轻量级的投票模块，通过离散化投票空间和可微分投票实现局部刚性约束，并在支柱上操作以提高计算效率。 Result: 在Argoverse 2和Waymo数据集上优于基线方法，且计算开销仅略有增加。 Conclusion: 通过引入局部刚性约束的投票模块，实现了端到端学习，提升了场景流估计的性能和效率。 Abstract: Scene flow estimation aims to recover per-point motion from two adjacent LiDAR scans. However, in real-world applications such as autonomous driving, points rarely move independently of others, especially for nearby points belonging to the same object, which often share the same motion. Incorporating this locally rigid motion constraint has been a key challenge in self-supervised scene flow estimation, which is often addressed by post-processing or appending extra regularization. While these approaches are able to improve the rigidity of predicted flows, they lack an architectural inductive bias for local rigidity within the model structure, leading to suboptimal learning efficiency and inferior performance. In contrast, we enforce local rigidity with a lightweight add-on module in neural network design, enabling end-to-end learning. We design a discretized voting space that accommodates all possible translations and then identify the one shared by nearby points by differentiable voting. Additionally, to ensure computational efficiency, we operate on pillars rather than points and learn representative features for voting per pillar. We plug the Voting Module into popular model designs and evaluate its benefit on Argoverse 2 and Waymo datasets. We outperform baseline works with only marginal compute overhead. Code is available at https://github.com/tudelft-iv/VoteFlow.

Semantix: An Energy Guided Sampler for Semantic Style Transfer

Huiang He,Minghui Hu,Chuanxia Zheng,Chaoyue Wang,Tat-Jen Cham

Task: 提出一种名为Semantic Style Transfer的新任务，通过语义对应将参考图像的风格和外观特征转移到目标视觉内容中。

Motivation: 现有方法通常将全局风格和局部外观转移分开处理，且忽视了语义对应；同时，图像和视频任务通常孤立处理，缺乏整合。

Details

Method: 提出一种无需训练的方法Semantix，利用预训练扩散模型的语义理解能力，通过精心设计的能量函数引导采样过程。 Result: 实验表明，Semantix不仅有效完成跨图像和视频的语义风格转移任务，还在两个领域超越了现有最优方法。 Conclusion: Semantix为视觉媒体中的语义风格转移提供了一种通用且高效的解决方案。 Abstract: Recent advances in style and appearance transfer are impressive, but most methods isolate global style and local appearance transfer, neglecting semantic correspondence. Additionally, image and video tasks are typically handled in isolation, with little focus on integrating them for video transfer. To address these limitations, we introduce a novel task, Semantic Style Transfer, which involves transferring style and appearance features from a reference image to a target visual content based on semantic correspondence. We subsequently propose a training-free method, Semantix an energy-guided sampler designed for Semantic Style Transfer that simultaneously guides both style and appearance transfer based on semantic understanding capacity of pre-trained diffusion models. Additionally, as a sampler, Semantix be seamlessly applied to both image and video models, enabling semantic style transfer to be generic across various visual media. Specifically, once inverting both reference and context images or videos to noise space by SDEs, Semantix utilizes a meticulously crafted energy function to guide the sampling process, including three key components: Style Feature Guidance, Spatial Feature Guidance and Semantic Distance as a regularisation term. Experimental results demonstrate that Semantix not only effectively accomplishes the task of semantic style transfer across images and videos, but also surpasses existing state-of-the-art solutions in both fields. The project website is available at https://huiang-he.github.io/semantix/

ArchCAD-400K: An Open Large-Scale Architectural CAD Dataset and New Baseline for Panoptic Symbol Spotting

Ruifeng Luo,Zhengjie Liu,Tianxiao Cheng,Jie Wang,Tongjie Wang,Xingguang Wei,Haomin Wang,YanPeng Li,Fu Chai,Fei Cheng,Shenglong Ye,Wenhai Wang,Yanting Zhang,Yu Qiao,Hongjie Zhang,Xianzhong Zhao

Task: 提出一种新型CAD数据标注引擎，并构建大规模CAD数据集ArchCAD-400K，同时提出一种新的全景符号识别基线模型DPSS。

Motivation: 减少建筑CAD图纸中符号识别的手动标注工作，推动建筑设计与施工的创新。

Details

Method: 利用CAD图纸的内在属性自动生成高质量标注，构建ArchCAD-400K数据集，并提出DPSS模型，结合自适应融合模块增强特征。 Result: ArchCAD-400K是目前最大规模的CAD数据集，DPSS模型在符号识别中达到最先进性能。 Conclusion: ArchCAD-400K和DPSS模型为建筑CAD图纸的符号识别提供了高效工具，具有推动行业创新的潜力。 Abstract: Recognizing symbols in architectural CAD drawings is critical for various advanced engineering applications. In this paper, we propose a novel CAD data annotation engine that leverages intrinsic attributes from systematically archived CAD drawings to automatically generate high-quality annotations, thus significantly reducing manual labeling efforts. Utilizing this engine, we construct ArchCAD-400K, a large-scale CAD dataset consisting of 413,062 chunks from 5538 highly standardized drawings, making it over 26 times larger than the largest existing CAD dataset. ArchCAD-400K boasts an extended drawing diversity and broader categories, offering line-grained annotations. Furthermore, we present a new baseline model for panoptic symbol spotting, termed Dual-Pathway Symbol Spotter (DPSS). It incorporates an adaptive fusion module to enhance primitive features with complementary image features, achieving state-of-the-art performance and enhanced robustness. Extensive experiments validate the effectiveness of DPSS, demonstrating the value of ArchCAD-400K and its potential to drive innovation in architectural design and construction.

GCRayDiffusion: Pose-Free Surface Reconstruction via Geometric Consistent Ray Diffusion

Li-Heng Chen,Zi-Xin Zou,Chang Liu,Tianjiao Jing,Yan-Pei Cao,Shi-Sheng Huang,Hongbo Fu,Hua Huang

Task: 提出一种新的无姿态表面重建技术，通过几何一致的射线扩散模型（GCRayDiffusion）实现稀疏视角下的高精度重建。

Motivation: 解决稀疏视角下无姿态表面重建的挑战，尤其是相机姿态估计的联合优化问题。

Details

Method: 采用基于三平面符号距离场（SDF）的学习方法，并通过射线扩散模型对相机姿态进行正则化。 Result: 在公开数据集上验证了GCRayDiffusion在相机姿态估计和表面重建上的高精度和几何一致性。 Conclusion: 该方法在稀疏视角输入下实现了比现有方法更准确的姿态估计和表面重建。 Abstract: Accurate surface reconstruction from unposed images is crucial for efficient 3D object or scene creation. However, it remains challenging, particularly for the joint camera pose estimation. Previous approaches have achieved impressive pose-free surface reconstruction results in dense-view settings, but could easily fail for sparse-view scenarios without sufficient visual overlap. In this paper, we propose a new technique for pose-free surface reconstruction, which follows triplane-based signed distance field (SDF) learning but regularizes the learning by explicit points sampled from ray-based diffusion of camera pose estimation. Our key contribution is a novel Geometric Consistent Ray Diffusion model (GCRayDiffusion), where we represent camera poses as neural bundle rays and regress the distribution of noisy rays via a diffusion model. More importantly, we further condition the denoising process of RGRayDiffusion using the triplane-based SDF of the entire scene, which provides effective 3D consistent regularization to achieve multi-view consistent camera pose estimation. Finally, we incorporate RGRayDiffusion into the triplane-based SDF learning by introducing on-surface geometric regularization from the sampling points of the neural bundle rays, which leads to highly accurate pose-free surface reconstruction results even for sparse-view inputs. Extensive evaluations on public datasets show that our GCRayDiffusion achieves more accurate camera pose estimation than previous approaches, with geometrically more consistent surface reconstruction results, especially given sparse-view inputs.

Byeongjun Kwon,Munchurl Kim

Task: 提出一种高效的基于分块的框架（PRO），用于解决高分辨率图像零样本深度估计中的深度不连续性和泛化性问题。

Motivation: 现有方法在处理高分辨率图像时存在深度不连续、内存消耗大、泛化性差等问题，且依赖合成数据集。

Details

Method: PRO框架包含两个关键组件：分组块一致性训练和偏差自由掩码，分别解决深度不连续性和泛化性问题。 Result: 在多个数据集上的零样本评估表明，PRO能够有效处理高分辨率图像，减少深度不连续性，并保持高效推理速度。 Conclusion: PRO是一种高效且泛化性强的框架，适用于高分辨率图像的零样本深度估计。 Abstract: Zero-shot depth estimation (DE) models exhibit strong generalization performance as they are trained on large-scale datasets. However, existing models struggle with high-resolution images due to the discrepancy in image resolutions of training (with smaller resolutions) and inference (for high resolutions). Processing them at full resolution leads to decreased estimation accuracy on depth with tremendous memory consumption, while downsampling to the training resolution results in blurred edges in the estimated depth images. Prevailing high-resolution depth estimation methods adopt a patch-based approach, which introduces depth discontinuity issues when reassembling the estimated depth patches and results in test-time inefficiency. Additionally, to obtain fine-grained depth details, these methods rely on synthetic datasets due to the real-world sparse ground truth depth, leading to poor generalizability. To tackle these limitations, we propose Patch Refine Once (PRO), an efficient and generalizable tile-based framework. Our PRO consists of two key components: (i) Grouped Patch Consistency Training that enhances test-time efficiency while mitigating the depth discontinuity problem by jointly processing four overlapping patches and enforcing a consistency loss on their overlapping regions within a single backpropagation step, and (ii) Bias Free Masking that prevents the DE models from overfitting to dataset-specific biases, enabling better generalization to real-world datasets even after training on synthetic data. Zero-shot evaluation on Booster, ETH3D, Middlebury 2014, and NuScenes demonstrates into which our PRO can be well harmonized, making their DE capabilities still effective for the grid input of high-resolution images with little depth discontinuities at the grid boundaries. Our PRO runs fast at inference time.

Meta-LoRA: Meta-Learning LoRA Components for Domain-Aware ID Personalization

Barış Batuhan Topal,Umut Özyurt,Zafer Doğan Budak,Ramazan Gokberk Cinbis

Task: 提出一种名为Meta-LoRA的新框架，通过元学习将领域特定先验编码到基于LoRA的身份个性化中，以解决文本到图像生成模型中的身份个性化问题。

Motivation: 当前文本到图像生成模型在身份个性化方面存在挑战，即如何从有限的参考图像中生成特定主题的一致输出。

Details

Method: 采用三层LoRA架构，分离身份无关知识与身份特定适应，通过元学习训练共享流形，再优化特定层以适应新主题。 Result: Meta-LoRA在身份保留、计算效率和适应性方面优于现有方法，并引入了新的基准数据集Meta-PHD。 Conclusion: Meta-LoRA为解决身份个性化问题提供了一种高效且适应性强的解决方案，代码和数据集将公开。 Abstract: Recent advancements in text-to-image generative models, particularly latent diffusion models (LDMs), have demonstrated remarkable capabilities in synthesizing high-quality images from textual prompts. However, achieving identity personalization-ensuring that a model consistently generates subject-specific outputs from limited reference images-remains a fundamental challenge. To address this, we introduce Meta-Low-Rank Adaptation (Meta-LoRA), a novel framework that leverages meta-learning to encode domain-specific priors into LoRA-based identity personalization. Our method introduces a structured three-layer LoRA architecture that separates identity-agnostic knowledge from identity-specific adaptation. In the first stage, the LoRA Meta-Down layers are meta-trained across multiple subjects, learning a shared manifold that captures general identity-related features. In the second stage, only the LoRA-Mid and LoRA-Up layers are optimized to specialize on a given subject, significantly reducing adaptation time while improving identity fidelity. To evaluate our approach, we introduce Meta-PHD, a new benchmark dataset for identity personalization, and compare Meta-LoRA against state-of-the-art methods. Our results demonstrate that Meta-LoRA achieves superior identity retention, computational efficiency, and adaptability across diverse identity conditions. The code, model weights, and dataset will be released publicly upon acceptance.

EchoFlow: A Foundation Model for Cardiac Ultrasound Image and Video Generation

Hadrien Reynaud,Alberto Gomez,Paul Leeson,Qingjie Meng,Bernhard Kainz

Task: 提出一种名为EchoFlow的框架，用于生成高质量、保护隐私的合成超声心动图图像和视频。

Motivation: 解决医学图像分析中因患者隐私问题导致的大规模数据集受限的挑战。

Details

Method: 结合对抗变分自编码器、潜在图像流匹配模型、潜在重识别模型和潜在视频流匹配模型，生成合成数据。 Result: 合成数据集在射血分数回归任务中表现与真实数据集相当。 Conclusion: EchoFlow为医学超声成像研究提供了隐私合规的大规模数据解决方案。 Abstract: Advances in deep learning have significantly enhanced medical image analysis, yet the availability of large-scale medical datasets remains constrained by patient privacy concerns. We present EchoFlow, a novel framework designed to generate high-quality, privacy-preserving synthetic echocardiogram images and videos. EchoFlow comprises four key components: an adversarial variational autoencoder for defining an efficient latent representation of cardiac ultrasound images, a latent image flow matching model for generating accurate latent echocardiogram images, a latent re-identification model to ensure privacy by filtering images anatomically, and a latent video flow matching model for animating latent images into realistic echocardiogram videos conditioned on ejection fraction. We rigorously evaluate our synthetic datasets on the clinically relevant task of ejection fraction regression and demonstrate, for the first time, that downstream models trained exclusively on EchoFlow-generated synthetic datasets achieve performance parity with models trained on real datasets. We release our models and synthetic datasets, enabling broader, privacy-compliant research in medical ultrasound imaging at https://huggingface.co/spaces/HReynaud/EchoFlow.

Mitigating Knowledge Discrepancies among Multiple Datasets for Task-agnostic Unified Face Alignment

Jiahao Xia,Min Xu,Wenjian Huang,Jianguo Zhang,Haimin Zhang,Chunxia Xiao

Task: 提出一种任务无关的统一人脸对齐（TUFA）框架，以解决多数据集标注差异问题。

Motivation: 现有方法无法从具有不同标注的多数据集中学习统一知识，且单数据集的有限样本导致模型鲁棒性不足。

Details

Method: 通过计算每个数据集的平均人脸形状，并结合语义对齐嵌入，将这些形状对齐到可解释平面上，最终通过结构提示和图像特征回归目标人脸关键点。 Result: 实验表明，该方法显著提高了人脸对齐性能，并增强了模型的泛化能力和少样本学习效率。 Conclusion: TUFA框架成功缓解了知识差异，实现了任务无关的零样本关键点定位能力。 Abstract: Despite the similar structures of human faces, existing face alignment methods cannot learn unified knowledge from multiple datasets with different landmark annotations. The limited training samples in a single dataset commonly result in fragile robustness in this field. To mitigate knowledge discrepancies among different datasets and train a task-agnostic unified face alignment (TUFA) framework, this paper presents a strategy to unify knowledge from multiple datasets. Specifically, we calculate a mean face shape for each dataset. To explicitly align these mean shapes on an interpretable plane based on their semantics, each shape is then incorporated with a group of semantic alignment embeddings. The 2D coordinates of these aligned shapes can be viewed as the anchors of the plane. By encoding them into structure prompts and further regressing the corresponding facial landmarks using image features, a mapping from the plane to the target faces is finally established, which unifies the learning target of different datasets. Consequently, multiple datasets can be utilized to boost the generalization ability of the model. The successful mitigation of discrepancies also enhances the efficiency of knowledge transferring to a novel dataset, significantly boosts the performance of few-shot face alignment. Additionally, the interpretable plane endows TUFA with a task-agnostic characteristic, enabling it to locate landmarks unseen during training in a zero-shot manner. Extensive experiments are carried on seven benchmarks and the results demonstrate an impressive improvement in face alignment brought by knowledge discrepancies mitigation.

ForcePose: A Deep Learning Approach for Force Calculation Based on Action Recognition Using MediaPipe Pose Estimation Combined with Object Detection

Nandakishor M,Vrinda Govind V,Anuradha Puthalath,Anzy L,Swathi P S,Aswathi R,Devaprabha A R,Varsha Raj,Midhuna Krishnan K,Akhila Anilkumar T V,Yamuna P V

Task: 提出了一种名为ForcePose的深度学习框架，通过结合人体姿态估计和物体检测来估计人-物交互中的力。

Motivation: 传统方法依赖昂贵的专业设备且局限于实验室环境，ForcePose旨在提供一种无需物理传感器的实时力分析解决方案。

Details

Method: 利用MediaPipe进行骨骼跟踪和SSD MobileNet进行物体识别，构建了一个处理时空特征的神经网络来预测力的大小和方向。 Result: 在850个标注视频数据集上训练后，模型在力大小和方向上的平均绝对误差分别为5.83 N和7.4度，性能优于现有计算机视觉方法27.5%。 Conclusion: ForcePose为康复、人体工程学评估和运动表现分析等实际场景提供了新的力分析可能性。 Abstract: Force estimation in human-object interactions is crucial for various fields like ergonomics, physical therapy, and sports science. Traditional methods depend on specialized equipment such as force plates and sensors, which makes accurate assessments both expensive and restricted to laboratory settings. In this paper, we introduce ForcePose, a novel deep learning framework that estimates applied forces by combining human pose estimation with object detection. Our approach leverages MediaPipe for skeletal tracking and SSD MobileNet for object recognition to create a unified representation of human-object interaction. We've developed a specialized neural network that processes both spatial and temporal features to predict force magnitude and direction without needing any physical sensors. After training on our dataset of 850 annotated videos with corresponding force measurements, our model achieves a mean absolute error of 5.83 N in force magnitude and 7.4 degrees in force direction. When compared to existing computer vision approaches, our method performs 27.5% better while still offering real-time performance on standard computing hardware. ForcePose opens up new possibilities for force analysis in diverse real-world scenarios where traditional measurement tools are impractical or intrusive. This paper discusses our methodology, the dataset creation process, evaluation metrics, and potential applications across rehabilitation, ergonomics assessment, and athletic performance analysis.

ViSketch-GPT: Collaborative Multi-Scale Feature Extraction for Sketch Recognition and Generation

Giulio Federico,Giuseppe Amato,Fabio Carrara,Claudio Gennaro,Marco Di Benedetto

Task: 提出一种名为ViSketch-GPT的新算法，通过多尺度上下文提取方法提高草图识别和生成的准确性。

Motivation: 由于人类草图的创建方式差异很大，理解其本质具有挑战性，识别复杂结构模式可以提高草图识别和生成的准确性。

Details

Method: 采用多尺度上下文提取方法，通过集成机制结合不同尺度的特征，增强关键细节的识别和生成。 Result: 在QuickDraw数据集上的实验表明，ViSketch-GPT在分类和生成任务中显著优于现有方法，准确性和生成草图的保真度均有显著提升。 Conclusion: ViSketch-GPT通过协作提取特征，为理解复杂结构（如草图）提供了强大框架，适用于计算机视觉和机器学习的多种应用。 Abstract: Understanding the nature of human sketches is challenging because of the wide variation in how they are created. Recognizing complex structural patterns improves both the accuracy in recognizing sketches and the fidelity of the generated sketches. In this work, we introduce ViSketch-GPT, a novel algorithm designed to address these challenges through a multi-scale context extraction approach. The model captures intricate details at multiple scales and combines them using an ensemble-like mechanism, where the extracted features work collaboratively to enhance the recognition and generation of key details crucial for classification and generation tasks. The effectiveness of ViSketch-GPT is validated through extensive experiments on the QuickDraw dataset. Our model establishes a new benchmark, significantly outperforming existing methods in both classification and generation tasks, with substantial improvements in accuracy and the fidelity of generated sketches. The proposed algorithm offers a robust framework for understanding complex structures by extracting features that collaborate to recognize intricate details, enhancing the understanding of structures like sketches and making it a versatile tool for various applications in computer vision and machine learning.

Data Quality Matters: Quantifying Image Quality Impact on Machine Learning Performance

Christian Steinhauser,Philipp Reis,Hubert Padusinski,Jacob Langner,Eric Sax

Task: 提出一个四步框架，评估图像修改对机器学习任务的影响。

Motivation: 高度自动化驾驶系统依赖精确的环境感知，而数据压缩和虚拟化可能改变传感器数据并降低模型性能，因此需要系统方法量化图像有效性。

Details

Method: 准备修改后的数据集，量化图像偏差，分析目标检测模型性能，并进行相关性分析。 Result: LPIPS指标在所有评估的机器学习任务中，图像偏差与模型性能之间的相关性最高。 Conclusion: 提出的框架能有效评估图像修改对机器学习任务的影响，LPIPS是最佳相关性指标。 Abstract: Precise perception of the environment is essential in highly automated driving systems, which rely on machine learning tasks such as object detection and segmentation. Compression of sensor data is commonly used for data handling, while virtualization is used for hardware-in-the-loop validation. Both methods can alter sensor data and degrade model performance. This necessitates a systematic approach to quantifying image validity. This paper presents a four-step framework to evaluate the impact of image modifications on machine learning tasks. First, a dataset with modified images is prepared to ensure one-to-one matching image pairs, enabling measurement of deviations resulting from compression and virtualization. Second, image deviations are quantified by comparing the effects of compression and virtualization against original camera-based sensor data. Third, the performance of state-of-the-art object detection models is analyzed to determine how altered input data affects perception tasks, including bounding box accuracy and reliability. Finally, a correlation analysis is performed to identify relationships between image quality and model performance. As a result, the LPIPS metric achieves the highest correlation between image deviation and machine learning performance across all evaluated machine learning tasks.

Rulin Zhou,Wenlong He,An Wang,Qiqi Yao,Haijun Hu,Jiankun Wang,Xi Zhang an Hongliang Ren

Task: 提出一种名为Endo-TTAP的框架，用于解决内窥镜视频中组织点跟踪的挑战。

Motivation: 由于复杂的变形、器械遮挡和密集轨迹标注的稀缺，现有方法在长期跟踪中表现不佳。

Details

Method: 结合多尺度流动态、DINOv2语义嵌入和显式运动模式的多面引导注意力模块（MFGA），以及两阶段课程学习策略（ACA）。 Result: 在两个MICCAI挑战数据集和自收集数据集上验证，Endo-TTAP在复杂内窥镜条件下实现了最先进的性能。 Conclusion: Endo-TTAP通过多模态特征和渐进学习策略，显著提升了组织点跟踪的准确性。 Abstract: Accurate tissue point tracking in endoscopic videos is critical for robotic-assisted surgical navigation and scene understanding, but remains challenging due to complex deformations, instrument occlusion, and the scarcity of dense trajectory annotations. Existing methods struggle with long-term tracking under these conditions due to limited feature utilization and annotation dependence. We present Endo-TTAP, a novel framework addressing these challenges through: (1) A Multi-Facet Guided Attention (MFGA) module that synergizes multi-scale flow dynamics, DINOv2 semantic embeddings, and explicit motion patterns to jointly predict point positions with uncertainty and occlusion awareness; (2) A two-stage curriculum learning strategy employing an Auxiliary Curriculum Adapter (ACA) for progressive initialization and hybrid supervision. Stage I utilizes synthetic data with optical flow ground truth for uncertainty-occlusion regularization, while Stage II combines unsupervised flow consistency and semi-supervised learning with refined pseudo-labels from off-the-shelf trackers. Extensive validation on two MICCAI Challenge datasets and our collected dataset demonstrates that Endo-TTAP achieves state-of-the-art performance in tissue point tracking, particularly in scenarios characterized by complex endoscopic conditions. The source code and dataset will be available at https://anonymous.4open.science/r/Endo-TTAP-36E5.

GAITGen: Disentangled Motion-Pathology Impaired Gait Generative Model -- Bringing Motion Generation to the Clinical Domain

Vida Adeli,Soroush Mehraban,Majid Mirmehdi,Alan Whone,Benjamin Filtjens,Amirhossein Dadashzadeh,Alfonso Fasano,Andrea Iaboni Babak Taati

Task: 提出GAITGen框架，用于生成基于病理严重程度的真实步态序列。

Motivation: 解决临床数据集稀缺和标注困难对计算机视觉模型在帕金森病步态分析中准确性和偏倚风险的限制。

Details

Method: 使用条件残差向量量化变分自编码器学习运动动态和病理特定因素的解耦表示，并结合掩码和残差变换器生成条件序列。 Result: GAITGen在重建保真度和生成质量上优于现有模型，并能准确捕捉病理特异性步态特征。 Conclusion: GAITGen生成的数据能提升下游任务的性能，具有推动临床步态分析的潜力。 Abstract: Gait analysis is crucial for the diagnosis and monitoring of movement disorders like Parkinson's Disease. While computer vision models have shown potential for objectively evaluating parkinsonian gait, their effectiveness is limited by scarce clinical datasets and the challenge of collecting large and well-labelled data, impacting model accuracy and risk of bias. To address these gaps, we propose GAITGen, a novel framework that generates realistic gait sequences conditioned on specified pathology severity levels. GAITGen employs a Conditional Residual Vector Quantized Variational Autoencoder to learn disentangled representations of motion dynamics and pathology-specific factors, coupled with Mask and Residual Transformers for conditioned sequence generation. GAITGen generates realistic, diverse gait sequences across severity levels, enriching datasets and enabling large-scale model training in parkinsonian gait analysis. Experiments on our new PD-GaM (real) dataset demonstrate that GAITGen outperforms adapted state-of-the-art models in both reconstruction fidelity and generation quality, accurately capturing critical pathology-specific gait features. A clinical user study confirms the realism and clinical relevance of our generated sequences. Moreover, incorporating GAITGen-generated data into downstream tasks improves parkinsonian gait severity estimation, highlighting its potential for advancing clinical gait analysis.

DF-Net: The Digital Forensics Network for Image Forgery Detection

David Fischinger,Martin Boyer

Task: 提出一种用于像素级图像伪造检测的深度神经网络DF-Net。

Motivation: 在线社交网络中传播的操纵图像对公众舆论的操控已成为严重的社会威胁。

Details

Method: 采用深度神经网络DF-Net进行像素级图像伪造检测。 Result: DF-Net在四个基准数据集上优于多种先进方法，且对社交网络自动执行的有损图像操作（如调整大小、压缩）具有鲁棒性。 Conclusion: DF-Net是一种有效的图像伪造检测方法，尤其适用于社交网络环境。 Abstract: The orchestrated manipulation of public opinion, particularly through manipulated images, often spread via online social networks (OSN), has become a serious threat to society. In this paper we introduce the Digital Forensics Net (DF-Net), a deep neural network for pixel-wise image forgery detection. The released model outperforms several state-of-the-art methods on four established benchmark datasets. Most notably, DF-Net's detection is robust against lossy image operations (e.g resizing, compression) as they are automatically performed by social networks.

VITAL: More Understandable Feature Visualization through Distribution Alignment and Relevant Information Flow

Ada Gorgun,Bernt Schiele,Jonas Fischer

Task: 通过结合真实图像特征的统计数据和相关网络流的测量，改进特征可视化方法以生成更易理解的图像。

Motivation: 现代深度网络在高风险决策中广泛应用，但其推理过程难以理解，现有特征可视化方法生成的图像往往难以识别。

Details

Method: 提出一种通过结合真实图像特征的统计数据和网络流测量来引导特征可视化的方法。 Result: 该方法生成的图像在质量和数量上均优于现有技术，能够更好地解码网络使用的信息。 Conclusion: 该方法为理解神经网络的推理过程提供了更有效的工具，补充了现有机制电路的不足。 Abstract: Neural networks are widely adopted to solve complex and challenging tasks. Especially in high-stakes decision-making, understanding their reasoning process is crucial, yet proves challenging for modern deep networks. Feature visualization (FV) is a powerful tool to decode what information neurons are responding to and hence to better understand the reasoning behind such networks. In particular, in FV we generate human-understandable images that reflect the information detected by neurons of interest. However, current methods often yield unrecognizable visualizations, exhibiting repetitive patterns and visual artifacts that are hard to understand for a human. To address these problems, we propose to guide FV through statistics of real image features combined with measures of relevant network flow to generate prototypical images. Our approach yields human-understandable visualizations that both qualitatively and quantitatively improve over state-of-the-art FVs across various architectures. As such, it can be used to decode which information the network uses, complementing mechanistic circuits that identify where it is encoded. Code is available at: https://github.com/adagorgun/VITAL

Modeling Multiple Normal Action Representations for Error Detection in Procedural Tasks

Wei-Jin Huang,Yuan-Ming Li,Zhi-Wei Xia,Yu-Ming Tang,Kun-Yu Lin,Jian-Fang Hu,Wei-Shi Zheng

Task: 提出一种自适应多正常动作表示（AMNAR）框架，用于检测程序性活动中的错误。

Motivation: 现有方法通常忽略多动作有效性场景，导致在训练与推断环境不一致时无法有效检测错误。

Details

Method: AMNAR预测所有有效下一动作并重建其正常动作表示，与当前动作比较以检测错误。 Result: 实验表明AMNAR达到最先进性能，验证了其有效性。 Conclusion: AMNAR框架通过建模多有效下一动作，显著提升了错误检测性能。 Abstract: Error detection in procedural activities is essential for consistent and correct outcomes in AR-assisted and robotic systems. Existing methods often focus on temporal ordering errors or rely on static prototypes to represent normal actions. However, these approaches typically overlook the common scenario where multiple, distinct actions are valid following a given sequence of executed actions. This leads to two issues: (1) the model cannot effectively detect errors using static prototypes when the inference environment or action execution distribution differs from training; and (2) the model may also use the wrong prototypes to detect errors if the ongoing action label is not the same as the predicted one. To address this problem, we propose an Adaptive Multiple Normal Action Representation (AMNAR) framework. AMNAR predicts all valid next actions and reconstructs their corresponding normal action representations, which are compared against the ongoing action to detect errors. Extensive experiments demonstrate that AMNAR achieves state-of-the-art performance, highlighting the effectiveness of AMNAR and the importance of modeling multiple valid next actions in error detection. The code is available at https://github.com/iSEE-Laboratory/AMNAR.

DF2023: The Digital Forensics 2023 Dataset for Image Forgery Detection

David Fischinger,Martin Boyer

Task: 发布DF2023数据集以支持检测伪造图像的研究。

Motivation: 通过伪造图像操纵公众舆论对社会构成重大威胁。

Details

Method: 提供包含四种主要伪造类别的百万张图像数据集。 Result: 数据集可减少研究者准备数据的时间和精力。 Conclusion: DF2023数据集有助于推动伪造图像检测技术的研究。 Abstract: The deliberate manipulation of public opinion, especially through altered images, which are frequently disseminated through online social networks, poses a significant danger to society. To fight this issue on a technical level we support the research community by releasing the Digital Forensics 2023 (DF2023) training and validation dataset, comprising one million images from four major forgery categories: splicing, copy-move, enhancement and removal. This dataset enables an objective comparison of network architectures and can significantly reduce the time and effort of researchers preparing datasets.

Unveiling the Mist over 3D Vision-Language Understanding: Object-centric Evaluation with Chain-of-Analysis

Jiangyong Huang,Baoxiong Jia,Yan Wang,Ziyu Zhu,Xiongkun Linghu,Qing Li,Song-Chun Zhu,Siyuan Huang

Task: 提出Beacon3D基准，用于评估3D视觉语言（3D-VL）模型在定位和问答任务中的表现。

Motivation: 现有3D-VL基准存在测试数据缺陷、简化指标和任务隔离问题，导致评估不准确。

Details

Method: 设计高质量测试数据、对象中心评估和链式分析范式，以提升评估的鲁棒性和一致性。 Result: 发现对象中心评估揭示真实模型性能，定位与问答一致性脆弱，LLM引入对定位能力有负面影响。 Conclusion: Beacon3D为3D-VL社区提供了更准确的评估工具，推动领域发展。 Abstract: Existing 3D vision-language (3D-VL) benchmarks fall short in evaluating 3D-VL models, creating a "mist" that obscures rigorous insights into model capabilities and 3D-VL tasks. This mist persists due to three key limitations. First, flawed test data, like ambiguous referential text in the grounding task, can yield incorrect and unreliable test results. Second, oversimplified metrics such as simply averaging accuracy per question answering (QA) pair, cannot reveal true model capability due to their vulnerability to language variations. Third, existing benchmarks isolate the grounding and QA tasks, disregarding the underlying coherence that QA should be based on solid grounding capabilities. To unveil the "mist", we propose Beacon3D, a benchmark for 3D-VL grounding and QA tasks, delivering a perspective shift in the evaluation of 3D-VL understanding. Beacon3D features (i) high-quality test data with precise and natural language, (ii) object-centric evaluation with multiple tests per object to ensure robustness, and (iii) a novel chain-of-analysis paradigm to address language robustness and model performance coherence across grounding and QA. Our evaluation of state-of-the-art 3D-VL models on Beacon3D reveals that (i) object-centric evaluation elicits true model performance and particularly weak generalization in QA; (ii) grounding-QA coherence remains fragile in current 3D-VL models, and (iii) incorporating large language models (LLMs) to 3D-VL models, though as a prevalent practice, hinders grounding capabilities and has yet to elevate QA capabilities. We hope Beacon3D and our comprehensive analysis could benefit the 3D-VL community towards faithful developments.

MVSAnywhere: Zero-Shot Multi-View Stereo

Sergio Izquierdo,Mohamed Sayed,Michael Firman,Guillermo Garcia-Hernando,Daniyar Turmukhambetov,Javier Civera,Oisin Mac Aodha,Gabriel Brostow,Jamie Watson

Task: 提出一种通用的多视图立体视觉架构MVSA，旨在解决跨领域和场景类型的深度估计问题。

Motivation: 现有方法在跨领域和场景类型（如室内与室外）的深度估计中泛化能力不足，训练通用模型面临架构设计、输入视图数量可变性及深度范围估计等挑战。

Details

Method: 结合单目和多视图线索，采用自适应成本体积处理尺度问题，设计MVSA架构。 Result: 在Robust Multi-View Depth Benchmark上实现了零样本深度估计的先进性能，超越了现有多视图立体和单目基线方法。 Conclusion: MVSA通过结合多种线索和自适应设计，成功实现了跨领域和场景类型的高效深度估计。 Abstract: Computing accurate depth from multiple views is a fundamental and longstanding challenge in computer vision. However, most existing approaches do not generalize well across different domains and scene types (e.g. indoor vs. outdoor). Training a general-purpose multi-view stereo model is challenging and raises several questions, e.g. how to best make use of transformer-based architectures, how to incorporate additional metadata when there is a variable number of input views, and how to estimate the range of valid depths which can vary considerably across different scenes and is typically not known a priori? To address these issues, we introduce MVSA, a novel and versatile Multi-View Stereo architecture that aims to work Anywhere by generalizing across diverse domains and depth ranges. MVSA combines monocular and multi-view cues with an adaptive cost volume to deal with scale-related issues. We demonstrate state-of-the-art zero-shot depth estimation on the Robust Multi-View Depth Benchmark, surpassing existing multi-view stereo and monocular baselines.

NuGrounding: A Multi-View 3D Visual Grounding Framework in Autonomous Driving

Fuhao Li,Huan Jin,Bin Gao,Liaoyuan Fan,Lihui Jiang,Long Zeng

Task: 提出NuGrounding基准和HoG方法，用于多视角3D视觉定位任务。

Motivation: 现有数据集和方法在语言指令的细粒度和3D几何推理与语言理解的结合上存在不足。

Details

Method: 结合多模态LLMs的指令理解能力和检测模型的精确定位能力，引入解耦任务令牌和上下文查询，通过融合解码器优化空间-语义特征融合。 Result: 方法显著优于基线，精度和召回率分别达到0.59和0.64，提升50.8%和54.7%。 Conclusion: NuGrounding和HoG方法在多视角3D视觉定位任务中表现优异，为自动驾驶提供了有效解决方案。 Abstract: Multi-view 3D visual grounding is critical for autonomous driving vehicles to interpret natural languages and localize target objects in complex environments. However, existing datasets and methods suffer from coarse-grained language instructions, and inadequate integration of 3D geometric reasoning with linguistic comprehension. To this end, we introduce NuGrounding, the first large-scale benchmark for multi-view 3D visual grounding in autonomous driving. We present a Hierarchy of Grounding (HoG) method to construct NuGrounding to generate hierarchical multi-level instructions, ensuring comprehensive coverage of human instruction patterns. To tackle this challenging dataset, we propose a novel paradigm that seamlessly combines instruction comprehension abilities of multi-modal LLMs (MLLMs) with precise localization abilities of specialist detection models. Our approach introduces two decoupled task tokens and a context query to aggregate 3D geometric information and semantic instructions, followed by a fusion decoder to refine spatial-semantic feature fusion for precise localization. Extensive experiments demonstrate that our method significantly outperforms the baselines adapted from representative 3D scene understanding methods by a significant margin and achieves 0.59 in precision and 0.64 in recall, with improvements of 50.8% and 54.7%.

EndoLRMGS: Complete Endoscopic Scene Reconstruction combining Large Reconstruction Modelling and Gaussian Splatting

Xu Wang,Shuai Zhang,Baoru Huang,Danail Stoyanov,Evangelos B. Mazomenos

Task: Complete reconstruction of surgical scenes for robot-assisted surgery (RAS) using deep depth estimation.

Motivation: Existing methods struggle with depth discontinuities and noisy predictions at object boundaries, leading to incomplete reconstruction of occluded surfaces.

Details

Method: Proposes EndoLRMGS, combining Large Reconstruction Modelling (LRM) and Gaussian Splatting (GS), with orthogonal perspective joint projection optimization (OPjPO) for accuracy. Result: Improves IoU of tool 3D models by >40%, PSNR of tool projection by 3.82% to 11.07%, and tissue rendering quality (PSNR: 0.46% to 49.87%, SSIM: 1.53% to 29.21%). Conclusion: EndoLRMGS effectively addresses depth estimation challenges in RAS, achieving significant improvements in reconstruction accuracy and rendering quality. Abstract: Complete reconstruction of surgical scenes is crucial for robot-assisted surgery (RAS). Deep depth estimation is promising but existing works struggle with depth discontinuities, resulting in noisy predictions at object boundaries and do not achieve complete reconstruction omitting occluded surfaces. To address these issues we propose EndoLRMGS, that combines Large Reconstruction Modelling (LRM) and Gaussian Splatting (GS), for complete surgical scene reconstruction. GS reconstructs deformable tissues and LRM generates 3D models for surgical tools while position and scale are subsequently optimized by introducing orthogonal perspective joint projection optimization (OPjPO) to enhance accuracy. In experiments on four surgical videos from three public datasets, our method improves the Intersection-over-union (IoU) of tool 3D models in 2D projections by>40%. Additionally, EndoLRMGS improves the PSNR of the tools projection from 3.82% to 11.07%. Tissue rendering quality also improves, with PSNR increasing from 0.46% to 49.87%, and SSIM from 1.53% to 29.21% across all test videos.

SemAlign3D: Semantic Correspondence between RGB-Images through Aligning 3D Object-Class Representations

Krispin Wandel,Hesheng Wang

Task: 利用单目深度估计和大型视觉模型特征构建3D物体类别表示，以提升语义对应在极端视角变化下的鲁棒性和数据效率。

Motivation: 现有大型视觉模型在捕捉局部语义上表现良好，但在捕捉全局几何关系上表现不足，导致极端视角变化下的语义对应性能不可靠。

Details

Method: 提出一种简单有效的方法，结合单目深度估计和稀疏标注数据集构建3D物体类别表示，并通过梯度下降最小化对齐能量实现RGB图像与3D表示的对齐。 Result: 在SPair-71k数据集上，PCK@0.1分数在多个类别中提升超过10分，总体从85.6%提升至88.9%。 Conclusion: 通过结合深度估计和3D表示，显著提升了语义对应的准确性和鲁棒性。 Abstract: Semantic correspondence made tremendous progress through the recent advancements of large vision models (LVM). While these LVMs have been shown to reliably capture local semantics, the same can currently not be said for capturing global geometric relationships between semantic object regions. This problem leads to unreliable performance for semantic correspondence between images with extreme view variation. In this work, we aim to leverage monocular depth estimates to capture these geometric relationships for more robust and data-efficient semantic correspondence. First, we introduce a simple but effective method to build 3D object-class representations from monocular depth estimates and LVM features using a sparsely annotated image correspondence dataset. Second, we formulate an alignment energy that can be minimized using gradient descent to obtain an alignment between the 3D object-class representation and the object-class instance in the input RGB-image. Our method achieves state-of-the-art matching accuracy in multiple categories on the challenging SPair-71k dataset, increasing the PCK@0.1 score by more than 10 points on three categories and overall by 3.3 points from 85.6% to 88.9%. Additional resources and code are available at https://dub.sh/semalign3d.

Masked Self-Supervised Pre-Training for Text Recognition Transformers on Large-Scale Datasets

Martin Kišš,Michal Hradiš

Task: 探索掩码自监督预训练在文本识别Transformer中的应用。

Motivation: 利用大规模无标签数据提升模型性能，尤其是在文本识别领域。

Details

Method: 提出两种预训练阶段的改进：逐步增加掩码概率，以及修改损失函数以同时考虑掩码和非掩码的文本块。 Result: 预训练显著降低了字符错误率，相对提升最高达30%，且性能与迁移学习相当，但无需额外标注数据。 Conclusion: 自监督预训练在文本识别中具有显著优势，能够有效利用无标签数据提升模型性能。 Abstract: Self-supervised learning has emerged as a powerful approach for leveraging large-scale unlabeled data to improve model performance in various domains. In this paper, we explore masked self-supervised pre-training for text recognition transformers. Specifically, we propose two modifications to the pre-training phase: progressively increasing the masking probability, and modifying the loss function to incorporate both masked and non-masked patches. We conduct extensive experiments using a dataset of 50M unlabeled text lines for pre-training and four differently sized annotated datasets for fine-tuning. Furthermore, we compare our pre-trained models against those trained with transfer learning, demonstrating the effectiveness of the self-supervised pre-training. In particular, pre-training consistently improves the character error rate of models, in some cases up to 30 % relatively. It is also on par with transfer learning but without relying on extra annotated text lines.

AnnoPage Dataset: Dataset of Non-Textual Elements in Documents with Fine-Grained Categorization

Martin Kišš,Michal Hradiš,Martina Dvořáková,Václav Jiroušek,Filip Kersch

Task: 介绍AnnoPage数据集，支持文档布局分析和目标检测研究。

Motivation: 为历史文档中的非文本元素（如图像、地图等）提供高质量的标注数据，促进相关领域研究。

Details

Method: 收集7550页历史文档，由专家标注25类非文本元素的轴对齐边界框（AABB），并划分为开发和测试子集。 Result: 提供基线结果（YOLO和DETR检测器），数据集公开可用。 Conclusion: AnnoPage数据集为文档布局分析和目标检测提供了高质量资源，支持未来研究。 Abstract: We introduce the AnnoPage Dataset, a novel collection of 7550 pages from historical documents, primarily in Czech and German, spanning from 1485 to the present, focusing on the late 19th and early 20th centuries. The dataset is designed to support research in document layout analysis and object detection. Each page is annotated with axis-aligned bounding boxes (AABB) representing elements of 25 categories of non-textual elements, such as images, maps, decorative elements, or charts, following the Czech Methodology of image document processing. The annotations were created by expert librarians to ensure accuracy and consistency. The dataset also incorporates pages from multiple, mainly historical, document datasets to enhance variability and maintain continuity. The dataset is divided into development and test subsets, with the test set carefully selected to maintain the category distribution. We provide baseline results using YOLO and DETR object detectors, offering a reference point for future research. The AnnoPage Dataset is publicly available on Zenodo (https://doi.org/10.5281/zenodo.12788419), along with ground-truth annotations in YOLO format.

LIM: Large Interpolator Model for Dynamic Reconstruction

Remy Sabathier,Niloy J. Mitra,David Novotny

Task: 提出一种基于Transformer的前馈模型（LIM），用于在时间上插值隐式3D表示，实现动态4D重建。

Motivation: 现有4D重建方法受限于类别特定模型或缓慢的优化方法，需要一种高效且通用的解决方案。

Details

Method: 采用基于Transformer的前馈模型，结合新颖的因果一致性损失，实现隐式3D表示的时间插值。 Result: LIM能在秒级时间内生成高质量插值帧，支持显式网格跟踪，并可与扩散模型结合从单目视频生成动态4D重建。 Conclusion: LIM是首个能高速重建多样化类别4D资产的前馈模型，优于现有方法。 Abstract: Reconstructing dynamic assets from video data is central to many in computer vision and graphics tasks. Existing 4D reconstruction approaches are limited by category-specific models or slow optimization-based methods. Inspired by the recent Large Reconstruction Model (LRM), we present the Large Interpolation Model (LIM), a transformer-based feed-forward solution, guided by a novel causal consistency loss, for interpolating implicit 3D representations across time. Given implicit 3D representations at times $t_0$ and $t_1$, LIM produces a deformed shape at any continuous time $t\in[t_0,t_1]$, delivering high-quality interpolated frames in seconds. Furthermore, LIM allows explicit mesh tracking across time, producing a consistently uv-textured mesh sequence ready for integration into existing production pipelines. We also use LIM, in conjunction with a diffusion-based multiview generator, to produce dynamic 4D reconstructions from monocular videos. We evaluate LIM on various dynamic datasets, benchmarking against image-space interpolation methods (e.g., FiLM) and direct triplane linear interpolation, and demonstrate clear advantages. In summary, LIM is the first feed-forward model capable of high-speed tracked 4D asset reconstruction across diverse categories.

MO-CTranS: A unified multi-organ segmentation model learning from multiple heterogeneously labelled datasets

Zhendi Gong,Susan Francis,Eleanor Cox,Stamatios N. Sotiropoulos,Dorothee P. Auer,Guoping Qiu,Andrew P. French,Xin Chen

Task: 训练一个单一模型（MO-CTranS）从多个部分标注的数据集中进行多器官分割。

Motivation: 解决多数据集标注不一致和数据不平衡问题，提高数据利用效率。

Details

Method: 结合CNN编码器和Transformer解码器，引入任务特定标记以区分标签差异。 Result: 在腹部MRI数据集上表现优于基线模型和SOTA方法。 Conclusion: MO-CTranS能有效解决多数据集分割问题，性能显著提升。 Abstract: Multi-organ segmentation holds paramount significance in many clinical tasks. In practice, compared to large fully annotated datasets, multiple small datasets are often more accessible and organs are not labelled consistently. Normally, an individual model is trained for each of these datasets, which is not an effective way of using data for model learning. It remains challenging to train a single model that can robustly learn from several partially labelled datasets due to label conflict and data imbalance problems. We propose MO-CTranS: a single model that can overcome such problems. MO-CTranS contains a CNN-based encoder and a Transformer-based decoder, which are connected in a multi-resolution manner. Task-specific tokens are introduced in the decoder to help differentiate label discrepancies. Our method was evaluated and compared to several baseline models and state-of-the-art (SOTA) solutions on abdominal MRI datasets that were acquired in different views (i.e. axial and coronal) and annotated for different organs (i.e. liver, kidney, spleen). Our method achieved better performance (most were statistically significant) than the compared methods. Github link: https://github.com/naisops/MO-CTranS.

Image Decomposition with G-norm Weighted by Total Symmetric Variation

Roy Y. He,Martin Huska,Hao Liu

Task: 提出一种新的变分模型，用于将图像分解为卡通和纹理部分。

Motivation: 通过总对称变分（TSV）表征有界变分（BV）图像的非局部特征，以有效识别区域边界。

Details

Method: 引入加权Meyer的$G$-范数来识别纹理内部而不包含轮廓边缘，并设计基于算子分裂的快速算法解决非凸优化问题。 Result: 对于具有有界TSV的BV图像，模型存在解，并通过数值实验验证了方法的性能。 Conclusion: 提出的模型和算法在图像分解任务中表现出有效性。 Abstract: In this paper, we propose a novel variational model for decomposing images into their respective cartoon and texture parts. Our model characterizes certain non-local features of any Bounded Variation (BV) image by its Total Symmetric Variation (TSV). We demonstrate that TSV is effective in identifying regional boundaries. Based on this property, we introduce a weighted Meyer's $G$-norm to identify texture interiors without including contour edges. For BV images with bounded TSV, we show that the proposed model admits a solution. Additionally, we design a fast algorithm based on operator-splitting to tackle the associated non-convex optimization problem. The performance of our method is validated by a series of numerical experiments.

Breaking Language Barriers in Visual Language Models via Multilingual Textual Regularization

Iñigo Pikabea,Iñaki Lacunza,Oriol Pareras,Carlos Escolano,Aitor Gonzalez-Agirre,Javier Hernando,Marta Villegas

Task: 解决视觉语言模型（VLMs）在多语言输入时仅生成英文回应的Image-induced Fidelity Loss（IFL）问题。

Motivation: 现有VLMs在多语言输入时因缺乏多模态多语言训练数据而仅生成英文回应，限制了其全球应用。

Details

Method: 提出一种连续多语言集成策略，通过在视觉指令调优中注入纯文本多语言数据，保留语言模型的多语言能力。 Result: 该方法显著提高了多语言的语言保真度，且未降低视觉性能；模型合并虽提升语言保真度但牺牲视觉性能。 Conclusion: 核心方法实现了无需权衡的多语言对齐，为全球VLM应用提供了一种可扩展且有效的IFL缓解路径。 Abstract: Rapid advancements in Visual Language Models (VLMs) have transformed multimodal understanding but are often constrained by generating English responses regardless of the input language. This phenomenon has been termed as Image-induced Fidelity Loss (IFL) and stems from limited multimodal multilingual training data. To address this, we propose a continuous multilingual integration strategy that injects text-only multilingual data during visual instruction tuning, preserving the language model's original multilingual capabilities. Extensive evaluations demonstrate that our approach significantly improves linguistic fidelity across languages without degradation in visual performance. We also explore model merging, which improves language fidelity but comes at the cost of visual performance. In contrast, our core method achieves robust multilingual alignment without trade-offs, offering a scalable and effective path to mitigating IFL for global VLM adoption.

Zero4D: Training-Free 4D Video Generation From Single Video Using Off-the-Shelf Video Diffusion Model

Jangho Park,Taesung Kwon,Jong Chul Ye

Task: 提出一种无需训练的方法，利用现成的视频扩散模型从单个输入视频生成多视角视频。

Motivation: 解决现有4D视频生成方法依赖额外训练或计算密集型训练的问题，同时缺乏真实世界4D数据和计算资源。

Details

Method: 通过两步法：首先生成关键帧并保持结构一致性，然后插值剩余帧以构建完整的时空一致采样网格。 Result: 成功扩展单个视频为多视角视频，保持时空一致性。 Conclusion: 该方法无需训练，利用现成模型，为多视角视频生成提供了实用有效的解决方案。 Abstract: Recently, multi-view or 4D video generation has emerged as a significant research topic. Nonetheless, recent approaches to 4D generation still struggle with fundamental limitations, as they primarily rely on harnessing multiple video diffusion models with additional training or compute-intensive training of a full 4D diffusion model with limited real-world 4D data and large computational costs. To address these challenges, here we propose the first training-free 4D video generation method that leverages the off-the-shelf video diffusion models to generate multi-view videos from a single input video. Our approach consists of two key steps: (1) By designating the edge frames in the spatio-temporal sampling grid as key frames, we first synthesize them using a video diffusion model, leveraging a depth-based warping technique for guidance. This approach ensures structural consistency across the generated frames, preserving spatial and temporal coherence. (2) We then interpolate the remaining frames using a video diffusion model, constructing a fully populated and temporally coherent sampling grid while preserving spatial and temporal consistency. Through this approach, we extend a single video into a multi-view video along novel camera trajectories while maintaining spatio-temporal consistency. Our method is training-free and fully utilizes an off-the-shelf video diffusion model, offering a practical and effective solution for multi-view video generation.

Understanding Co-speech Gestures in-the-wild

Sindhu B Hegde,K R Prajwal,Taein Kwon,Andrew Zisserman

Task: 提出一种新的框架，用于在自然场景中理解伴随语音的手势，并引入三个新任务和基准来评估模型的能力。

Motivation: 伴随语音的手势在非语言交流中至关重要，但目前缺乏有效的理解和评估方法。

Details

Method: 提出一种学习语音-文本-视频-手势的三模态表示的新方法，结合全局短语对比损失和局部手势-词语耦合损失。 Result: 学习到的表示在三个任务中均优于现有方法，包括大型视觉语言模型（VLMs）。 Conclusion: 语音和文本模态捕捉了不同的手势相关信号，证明了学习共享三模态嵌入空间的优势。 Abstract: Co-speech gestures play a vital role in non-verbal communication. In this paper, we introduce a new framework for co-speech gesture understanding in the wild. Specifically, we propose three new tasks and benchmarks to evaluate a model's capability to comprehend gesture-text-speech associations: (i) gesture-based retrieval, (ii) gestured word spotting, and (iii) active speaker detection using gestures. We present a new approach that learns a tri-modal speech-text-video-gesture representation to solve these tasks. By leveraging a combination of global phrase contrastive loss and local gesture-word coupling loss, we demonstrate that a strong gesture representation can be learned in a weakly supervised manner from videos in the wild. Our learned representations outperform previous methods, including large vision-language models (VLMs), across all three tasks. Further analysis reveals that speech and text modalities capture distinct gesture-related signals, underscoring the advantages of learning a shared tri-modal embedding space. The dataset, model, and code are available at: https://www.robots.ox.ac.uk/~vgg/research/jegal

TranSplat: Lighting-Consistent Cross-Scene Object Transfer with 3D Gaussian Splatting

Boyang,Yu,Yanlin Jin,Ashok Veeraraghavan,Akshat Dave,Guha Balakrishnan

Task: 提出TranSplat算法，实现基于高斯泼溅框架的跨场景物体转移与真实感渲染。

Motivation: 解决跨场景物体转移中的两大挑战：精确的3D物体提取和目标场景中的真实感重光照。

Details

Method: 利用高斯泼溅模型拟合源场景，通过2D物体掩码驱动细粒度3D分割，结合球谐分析实现重光照。 Result: 在合成和真实场景中表现出色，优于基线方法，实现视觉上可信的跨场景物体转移。 Conclusion: 讨论了方法的局限性。 Abstract: We present TranSplat, a 3D scene rendering algorithm that enables realistic cross-scene object transfer (from a source to a target scene) based on the Gaussian Splatting framework. Our approach addresses two critical challenges: (1) precise 3D object extraction from the source scene, and (2) faithful relighting of the transferred object in the target scene without explicit material property estimation. TranSplat fits a splatting model to the source scene, using 2D object masks to drive fine-grained 3D segmentation. Following user-guided insertion of the object into the target scene, along with automatic refinement of position and orientation, TranSplat derives per-Gaussian radiance transfer functions via spherical harmonic analysis to adapt the object's appearance to match the target scene's lighting environment. This relighting strategy does not require explicitly estimating physical scene properties such as BRDFs. Evaluated on several synthetic and real-world scenes and objects, TranSplat yields excellent 3D object extractions and relighting performance compared to recent baseline methods and visually convincing cross-scene object transfers. We conclude by discussing the limitations of the approach.

DSO: Aligning 3D Generators with Simulation Feedback for Physical Soundness

Ruining Li,Chuanxia Zheng,Christian Rupprecht,Andrea Vedaldi

Task: 提出一种名为直接模拟优化（DSO）的框架，用于提高3D生成器直接输出稳定3D对象的可能性。

Motivation: 现有3D对象生成器注重美学质量，但忽略了物理约束（如自支撑性），而传统优化方法速度慢且不稳定。

Details

Method: 通过非可微分模拟器的反馈，使用直接偏好优化（DPO）或直接奖励优化（DRO）目标对3D生成器进行微调。 Result: 实验表明，微调后的生成器比测试时优化更快且更可能生成稳定对象。 Conclusion: DSO框架无需真实3D训练数据，可通过模拟反馈实现自我改进。 Abstract: Most 3D object generators focus on aesthetic quality, often neglecting physical constraints necessary in applications. One such constraint is that the 3D object should be self-supporting, i.e., remains balanced under gravity. Prior approaches to generating stable 3D objects used differentiable physics simulators to optimize geometry at test-time, which is slow, unstable, and prone to local optima. Inspired by the literature on aligning generative models to external feedback, we propose Direct Simulation Optimization (DSO), a framework to use the feedback from a (non-differentiable) simulator to increase the likelihood that the 3D generator outputs stable 3D objects directly. We construct a dataset of 3D objects labeled with a stability score obtained from the physics simulator. We can then fine-tune the 3D generator using the stability score as the alignment metric, via direct preference optimization (DPO) or direct reward optimization (DRO), a novel objective, which we introduce, to align diffusion models without requiring pairwise preferences. Our experiments show that the fine-tuned feed-forward generator, using either DPO or DRO objective, is much faster and more likely to produce stable objects than test-time optimization. Notably, the DSO framework works even without any ground-truth 3D objects for training, allowing the 3D generator to self-improve by automatically collecting simulation feedback on its own outputs.

Q-Insight: Understanding Image Quality via Visual Reinforcement Learning

Weiqi Li,Xuanyu Zhang,Shijie Zhao,Yabin Zhang,Junlin Li,Li Zhang,Jian Zhang

Task: 提出一种基于强化学习的模型Q-Insight，用于图像质量评估（IQA），结合内容分析、退化感知和比较推理。

Motivation: 现有基于多模态大语言模型（MLLM）的方法要么生成缺乏解释性的数值分数，要么依赖大规模标注数据集进行监督微调（SFT），限制了灵活性和适用性。

Details

Method: 使用基于组相对策略优化（GRPO）的强化学习模型，联合优化分数回归和退化感知任务，设计奖励函数以提升性能。 Result: Q-Insight在分数回归和退化感知任务中显著优于现有方法，并在零样本比较推理任务中表现出色。 Conclusion: Q-Insight通过强化学习和任务联合优化，实现了高性能的图像质量理解，且具有较好的泛化能力。 Abstract: Image quality assessment (IQA) focuses on the perceptual visual quality of images, playing a crucial role in downstream tasks such as image reconstruction, compression, and generation. The rapid advancement of multi-modal large language models (MLLMs) has significantly broadened the scope of IQA, moving toward comprehensive image quality understanding that incorporates content analysis, degradation perception, and comparison reasoning beyond mere numerical scoring. Previous MLLM-based methods typically either generate numerical scores lacking interpretability or heavily rely on supervised fine-tuning (SFT) using large-scale annotated datasets to provide descriptive assessments, limiting their flexibility and applicability. In this paper, we propose Q-Insight, a reinforcement learning-based model built upon group relative policy optimization (GRPO), which demonstrates strong visual reasoning capability for image quality understanding while requiring only a limited amount of rating scores and degradation labels. By jointly optimizing score regression and degradation perception tasks with carefully designed reward functions, our approach effectively exploits their mutual benefits for enhanced performance. Extensive experiments demonstrate that Q-Insight substantially outperforms existing state-of-the-art methods in both score regression and degradation perception tasks, while exhibiting impressive zero-shot generalization to comparison reasoning tasks. Code will be available at https://github.com/lwq20020127/Q-Insight.

Deep Learning-Based Quantitative Assessment of Renal Chronicity Indices in Lupus Nephritis

Tianqi Tu,Hui Wang,Jiangbo Pei,Xiaojuan Yu,Aidong Men,Suxia Wang,Qingchao Chen,Ying Tan,Feng Yu,Minghui Zhao

Task: 开发一种深度学习（DL）流程，用于自动化评估狼疮性肾炎（LN）患者的肾脏慢性指数（CI），并提供疾病特异性的预后分析。

Motivation: 病理学家评估CI存在耗时、观察者间差异大和易疲劳等问题，需要一种高效、准确的自动化方法。

Details

Method: 使用来自141名患者的282张切片数据，开发并验证DL流程，包括训练集和内外测试集。 Result: DL流程在组织分割和CI评估中表现优异，显著提高了观察者间一致性，并增强了预后预测的准确性。 Conclusion: DL流程在LN患者CI评估中表现出高效性和准确性，有望改善临床决策和预后分析。 Abstract: Background: Renal chronicity indices (CI) have been identified as strong predictors of long-term outcomes in lupus nephritis (LN) patients. However, assessment by pathologists is hindered by challenges such as substantial time requirements, high interobserver variation, and susceptibility to fatigue. This study aims to develop an effective deep learning (DL) pipeline that automates the assessment of CI and provides valuable prognostic insights from a disease-specific perspective. Methods: We curated a dataset comprising 282 slides obtained from 141 patients across two independent cohorts with a complete 10-years follow-up. Our DL pipeline was developed on 60 slides (22,410 patch images) from 30 patients in the training cohort and evaluated on both an internal testing set (148 slides, 77,605 patch images) and an external testing set (74 slides, 27,522 patch images). Results: The study included two cohorts with slight demographic differences, particularly in age and hemoglobin levels. The DL pipeline showed high segmentation performance across tissue compartments and histopathologic lesions, outperforming state-of-the-art methods. The DL pipeline also demonstrated a strong correlation with pathologists in assessing CI, significantly improving interobserver agreement. Additionally, the DL pipeline enhanced prognostic accuracy, particularly in outcome prediction, when combined with clinical parameters and pathologist-assessed CIs Conclusions: The DL pipeline demonstrated accuracy and efficiency in assessing CI in LN, showing promise in improving interobserver agreement among pathologists. It also exhibited significant value in prognostic analysis and enhancing outcome prediction in LN patients, offering a valuable tool for clinical decision-making.

Implicit neural representations for end-to-end PET reconstruction

Younès Moussaoui,Diana Mateus,Nasrin Taheri,Saïd Moussaoui,Thomas Carlier,Simon Stute

Task: 提出一种基于隐式SIREN神经网络架构的无监督PET图像重建方法。

Motivation: 隐式神经表示（INRs）在医学成像任务中表现出色，但尚未在PET重建中得到研究。

Details

Method: 使用正弦激活函数的隐式SIREN神经网络架构，结合前向投影模型和适应PET重建的损失函数。 Result: 与常规惩罚似然方法和基于深度图像先验（DIP）的方法相比，INR方法能重建更高质量的图像，提升对比度、活动恢复和相对偏差。 Conclusion: INR方法为PET图像重建提供了一种更简单、高效的模型，具有显著改进。 Abstract: Implicit neural representations (INRs) have demonstrated strong capabilities in various medical imaging tasks, such as denoising, registration, and segmentation, by representing images as continuous functions, allowing complex details to be captured. For image reconstruction problems, INRs can also reduce artifacts typically introduced by conventional reconstruction algorithms. However, to the best of our knowledge, INRs have not been studied in the context of PET reconstruction. In this paper, we propose an unsupervised PET image reconstruction method based on the implicit SIREN neural network architecture using sinusoidal activation functions. Our method incorporates a forward projection model and a loss function adapted to perform PET image reconstruction directly from sinograms, without the need for large training datasets. The performance of the proposed approach was compared with that of conventional penalized likelihood methods and deep image prior (DIP) based reconstruction using brain phantom data and realistically simulated sinograms. The results show that the INR-based approach can reconstruct high-quality images with a simpler, more efficient model, offering improvements in PET image reconstruction, particularly in terms of contrast, activity recovery, and relative bias.

Learning from spatially inhomogenous data: resolution-adaptive convolutions for multiple sclerosis lesion segmentation

Ivan Diaz,Florin Scherer,Yanik Berli,Roland Wiest,Helly Hammer,Robert Hoepner,Alejandro Leon Betancourt,Piotr Radojewski,Richard McKinley

Task: 提出一种基于e3nn框架的网络架构，用于直接从空间异构的MRI数据中学习分割任务，无需重采样。

Motivation: 临床成像中，不同设备、医院和序列导致的成像数据分辨率差异大，传统重采样方法可能导致保真度损失。

Details

Method: 设计了一种基于球谐函数的卷积核参数化网络，固定物理半径，可适应不同体素分辨率。 Result: 在公开数据集和内部多发性硬化症数据集上，该网络在2D和大多数3D测试案例中优于传统U-Net。 Conclusion: 该网络能从未见过的图像分辨率中泛化，展示了处理异构数据的潜力。 Abstract: In the setting of clinical imaging, differences in between vendors, hospitals and sequences can yield highly inhomogeneous imaging data. In MRI in particular, voxel dimension, slice spacing and acquisition plane can vary substantially. For clinical applications, therefore, algorithms must be trained to handle data with various voxel resolutions. The usual strategy to deal with heterogeneity of resolution is harmonization: resampling imaging data to a common (usually isovoxel) resolution. This can lead to loss of fidelity arising from interpolation artifacts out-of-plane and downsampling in-plane. We present in this paper a network architecture designed to be able to learn directly from spatially heterogeneous data, without resampling: a segmentation network based on the e3nn framework that leverages a spherical harmonic, rather than voxel-grid, parameterization of convolutional kernels, with a fixed physical radius. Networks based on these kernels can be resampled to their input voxel dimensions. We trained and tested our network on a publicly available dataset assembled from three centres, and on an in-house dataset of Multiple Sclerosis cases with a high degree of spatial inhomogeneity. We compared our approach to a standard U-Net with two strategies for handling inhomogeneous data: training directly on the data without resampling, and resampling to a common resolution of 1mm isovoxels. We show that our network is able to learn from various combinations of voxel sizes and outperforms classical U-Nets on 2D testing cases and most 3D testing cases. This shows an ability to generalize well when tested on image resolutions not seen during training. Our code can be found at: http://github.com/SCAN-NRAD/e3nn\_U-Net.

Vision Language Models versus Machine Learning Models Performance on Polyp Detection and Classification in Colonoscopy Images

Mohammad Amin Khalafi,Seyed Amir Ahmad Safavi-Naini,Ameneh Salehi,Nariman Naderi,Dorsa Alijanzadeh,Pardis Ketabi Moghadam,Kaveh Kavosi,Negar Golestani,Shabnam Shahrokh,Soltanali Fallah,Jamil S Samaan,Nicholas P. Tatonetti,Nicholas Hoerter,Girish Nadkarni,Hamid Asadzadeh Aghdaei,Ali Soroush

Task: 评估视觉语言模型（VLMs）与传统卷积神经网络（CNNs）及经典机器学习模型（CMLs）在结肠镜息肉图像的计算机辅助检测（CADe）和诊断（CADx）中的性能。

Motivation: 比较不同模型在结肠镜息肉图像处理任务中的表现，以确定最优模型。

Details

Method: 分析了2,258张结肠镜图像及428名患者的病理报告，预处理图像后评估了11种模型（包括ResNet50、4种CMLs、2种专用视觉语言编码器和3种通用VLMs），重点关注CADe和CADx任务。 Result: ResNet50在息肉检测中表现最佳（F1: 91.35%，AUROC: 0.98），BioMedCLIP次之；GPT-4在息肉分类中表现优于其他VLMs，但整体性能仍低于CNNs。 Conclusion: CNNs在CADx和CADe任务中仍占优势，但BioMedCLIP和GPT-4在无法训练CNNs时可能适用于息肉检测。 Abstract: Introduction: This study provides a comprehensive performance assessment of vision-language models (VLMs) against established convolutional neural networks (CNNs) and classic machine learning models (CMLs) for computer-aided detection (CADe) and computer-aided diagnosis (CADx) of colonoscopy polyp images. Method: We analyzed 2,258 colonoscopy images with corresponding pathology reports from 428 patients. We preprocessed all images using standardized techniques (resizing, normalization, and augmentation) and implemented a rigorous comparative framework evaluating 11 distinct models: ResNet50, 4 CMLs (random forest, support vector machine, logistic regression, decision tree), two specialized contrastive vision language encoders (CLIP, BiomedCLIP), and three general-purpose VLMs ( GPT-4 Gemini-1.5-Pro, Claude-3-Opus). Our performance assessment focused on two clinical tasks: polyp detection (CADe) and classification (CADx). Result: In polyp detection, ResNet50 achieved the best performance (F1: 91.35%, AUROC: 0.98), followed by BiomedCLIP (F1: 88.68%, AUROC: [AS1] ). GPT-4 demonstrated comparable effectiveness to traditional machine learning approaches (F1: 81.02%, AUROC: [AS2] ), outperforming other general-purpose VLMs. For polyp classification, performance rankings remained consistent but with lower overall metrics. ResNet50 maintained the highest efficacy (weighted F1: 74.94%), while GPT-4 demonstrated moderate capability (weighted F1: 41.18%), significantly exceeding other VLMs (Claude-3-Opus weighted F1: 25.54%, Gemini 1.5 Pro weighted F1: 6.17%). Conclusion: CNNs remain superior for both CADx and CADe tasks. However, VLMs like BioMedCLIP and GPT-4 may be useful for polyp detection tasks where training CNNs is not feasible.

ManipTrans: Efficient Dexterous Bimanual Manipulation Transfer via Residual Learning

Kailin Li,Puhao Li,Tengyu Liu,Yuyang Li,Siyuan Huang

Task: 提出一种名为ManipTrans的两阶段方法，用于将人类双手技能高效迁移到仿真中的灵巧机器人手上。

Motivation: 人类双手在交互中扮演核心角色，但传统强化学习或现实世界遥操作难以获取精确、大规模、类人的操作序列。

Details

Method: ManipTrans首先预训练一个通用轨迹模仿器模仿手部动作，然后在交互约束下微调特定残差模块。 Result: 实验表明，ManipTrans在成功率、保真度和效率上优于现有方法，并创建了大规模数据集DexManipNet。 Conclusion: ManipTrans为灵巧机器人手的策略训练和实际部署提供了高效解决方案。 Abstract: Human hands play a central role in interacting, motivating increasing research in dexterous robotic manipulation. Data-driven embodied AI algorithms demand precise, large-scale, human-like manipulation sequences, which are challenging to obtain with conventional reinforcement learning or real-world teleoperation. To address this, we introduce ManipTrans, a novel two-stage method for efficiently transferring human bimanual skills to dexterous robotic hands in simulation. ManipTrans first pre-trains a generalist trajectory imitator to mimic hand motion, then fine-tunes a specific residual module under interaction constraints, enabling efficient learning and accurate execution of complex bimanual tasks. Experiments show that ManipTrans surpasses state-of-the-art methods in success rate, fidelity, and efficiency. Leveraging ManipTrans, we transfer multiple hand-object datasets to robotic hands, creating DexManipNet, a large-scale dataset featuring previously unexplored tasks like pen capping and bottle unscrewing. DexManipNet comprises 3.3K episodes of robotic manipulation and is easily extensible, facilitating further policy training for dexterous hands and enabling real-world deployments.

Refined Geometry-guided Head Avatar Reconstruction from Monocular RGB Video

Pilseo Park,Ze Zhang,Michel Sarkis,Ning Bi,Xiaoming Liu,Yiying Tong

Task: 从单目视频中高保真重建头部虚拟形象。

Motivation: 为虚拟人应用提供高质量的头部虚拟形象重建方法，解决现有基于3DMM的粗粒度模板表示无法捕捉复杂面部细节的问题。

Details

Method: 提出两阶段头部虚拟形象重建网络，结合优化的3D网格表示。第一阶段利用3DMM存储的NeRF和初始网格整合几何先验；第二阶段基于初始NeRF密度场构建SDF进行网格优化，并通过Laplace平滑减少噪声。 Result: 实验表明，该方法在初始网格基础上进一步优化了NeRF渲染，重建高保真头部虚拟形象性能优于现有方法。 Conclusion: 提出的两阶段方法能有效捕捉复杂面部细节，提升头部虚拟形象的重建质量。 Abstract: High-fidelity reconstruction of head avatars from monocular videos is highly desirable for virtual human applications, but it remains a challenge in the fields of computer graphics and computer vision. In this paper, we propose a two-phase head avatar reconstruction network that incorporates a refined 3D mesh representation. Our approach, in contrast to existing methods that rely on coarse template-based 3D representations derived from 3DMM, aims to learn a refined mesh representation suitable for a NeRF that captures complex facial nuances. In the first phase, we train 3DMM-stored NeRF with an initial mesh to utilize geometric priors and integrate observations across frames using a consistent set of latent codes. In the second phase, we leverage a novel mesh refinement procedure based on an SDF constructed from the density field of the initial NeRF. To mitigate the typical noise in the NeRF density field without compromising the features of the 3DMM, we employ Laplace smoothing on the displacement field. Subsequently, we apply a second-phase training with these refined meshes, directing the learning process of the network towards capturing intricate facial details. Our experiments demonstrate that our method further enhances the NeRF rendering based on the initial mesh and achieves performance superior to state-of-the-art methods in reconstructing high-fidelity head avatars with such input.

PyUAT: Open-source Python framework for efficient and scalable cell tracking

Johannes Seiffarth,Katharina Nöh

Task: 开发PyUAT，一种用于微生物细胞追踪的高效模块化Python工具。

Motivation: 微生物细胞追踪在活细胞成像中具有挑战性，尤其是当帧率受限时，传统方法难以应对随机细胞运动和频繁分裂。

Details

Method: 采用不确定性感知追踪（UAT）方法，结合统计模型预测细胞关联。 Result: PyUAT在大型2D+t数据集上表现出色，并验证了模块化生物模型和成像间隔对追踪性能的影响。 Conclusion: PyUAT为微生物细胞追踪提供了一种高效、灵活的解决方案，并已开源供社区使用。 Abstract: Tracking individual cells in live-cell imaging provides fundamental insights, inevitable for studying causes and consequences of phenotypic heterogeneity, responses to changing environmental conditions or stressors. Microbial cell tracking, characterized by stochastic cell movements and frequent cell divisions, remains a challenging task when imaging frame rates must be limited to avoid counterfactual results. A promising way to overcome this limitation is uncertainty-aware tracking (UAT), which uses statistical models, calibrated to empirically observed cell behavior, to predict likely cell associations. We present PyUAT, an efficient and modular Python implementation of UAT for tracking microbial cells in time-lapse imaging. We demonstrate its performance on a large 2D+t data set and investigate the influence of modular biological models and imaging intervals on the tracking performance. The open-source PyUAT software is available at https://github.com/JuBiotech/PyUAT, including example notebooks for immediate use in Google Colab.

Locally Orderless Images for Optimization in Differentiable Rendering

Ishit Mehta,Manmohan Chandraker,Ravi Ramamoorthi

Task: 提出一种利用局部无序图像（locally orderless images）解决可微分渲染中稀疏梯度问题的方法。

Motivation: 现有方法通过代理梯度（如拓扑导数或拉格朗日导数）处理稀疏梯度问题，但对渲染过程做了简化假设；多分辨率图像金字塔在实践中不可靠。

Details

Method: 使用局部无序图像，将每个像素映射为保留外观局部变化的强度直方图，并通过最小化直方图距离的反渲染目标函数扩展稀疏梯度支持。 Result: 在合成和真实数据的多种反问题上验证了方法的有效性。 Conclusion: 该方法能够恢复最优参数，解决了稀疏梯度导致的收敛问题。 Abstract: Problems in differentiable rendering often involve optimizing scene parameters that cause motion in image space. The gradients for such parameters tend to be sparse, leading to poor convergence. While existing methods address this sparsity through proxy gradients such as topological derivatives or lagrangian derivatives, they make simplifying assumptions about rendering. Multi-resolution image pyramids offer an alternative approach but prove unreliable in practice. We introduce a method that uses locally orderless images, where each pixel maps to a histogram of intensities that preserves local variations in appearance. Using an inverse rendering objective that minimizes histogram distance, our method extends support for sparsely defined image gradients and recovers optimal parameters. We validate our method on various inverse problems using both synthetic and real data.

Comprehensive segmentation of deep grey nuclei from structural MRI data

Manojkumar Saranathan,Giuseppina Cogliandro,Thomas Hicks,Dianne Patterson,Behroze Vachha,Alberto Cacciola

Task: 开发一种快速、准确且稳健的方法，用于从常规场强的结构T1 MRI数据中分割深部灰质核团。

Motivation: 缺乏用于全面且完整分割深部灰质核团的单一软件工具，影响了研究的可重复性和可重复性。

Details

Method: 利用白质抑制成像的改进对比度，通过最近提出的基于直方图的多项式合成（HIPS）从标准T1合成类似WMn的图像，然后使用多图谱分割与联合标签融合技术分割深部灰质核团。 Result: 该方法在所有场强（1.5/3/7特斯拉）下均表现稳健，所有结构的Dice系数均达到0.7或更高，与手动分割金标准相比。 Conclusion: 该方法通过利用大型公共数据库中的常规T1数据，为深入研究深部灰质核团的作用提供了可能，填补了此前缺乏稳健、可重复分割工具的空白。 Abstract: Motivation: Lack of tools for comprehensive and complete segmentation of deep grey nuclei using a single software for reproducibility and repeatability Goal(s): A fast accurate and robust method for segmentation of deep grey nuclei (thalamic nuclei, basal ganglia, claustrum, red nucleus) from structural T1 MRI data at conventional field strengths Approach: We leverage the improved contrast of white-matter-nulled imaging by using the recently proposed Histogram-based Polynomial Synthesis (HIPS) to synthesize WMn-like images from standard T1 and then use a multi-atlas segmentation with joint label fusion to segment deep grey nuclei. Results: The method worked robustly on all field strengths (1.5/3/7) and Dice coefficients of 0.7 or more were achieved for all structures compared against manual segmentation ground truth. Impact: This method facilitates careful investigation of the role of deep grey nuclei by enabling the use of conventional T1 data from large public databases, which has not been possible, hitherto, due to lack of robust reproducible segmentation tools.

Differential Evolution for Grassmann Manifold Optimization: A Projection Approach

Andrew Lesniewski

Task: 提出一种新颖的进化算法，用于优化定义在Grassmann流形Gr(k,n)上的实值目标函数。

Motivation: 现有的Gr(k,n)优化技术主要依赖局部的一阶或二阶黎曼方法，这些方法在非凸或多模态地形中表现不佳。

Details

Method: 将差分进化算法（一种全局、基于种群的优化方法）适应于Grassmann流形，结合自适应控制参数方案和通过QR分解将试验向量投影到流形上的机制。 Result: 该方法在保持流形结构可行性的同时，能够探索局部邻域之外的空间，适用于机器学习、信号处理和低秩矩阵恢复等应用。 Conclusion: 该算法为经典黎曼优化方法提供了一种灵活且几何感知的替代方案，并在多个Grassmann流形优化问题上进行了验证。 Abstract: We propose a novel evolutionary algorithm for optimizing real-valued objective functions defined on the Grassmann manifold Gr}(k,n), the space of all k-dimensional linear subspaces of R^n. While existing optimization techniques on Gr}(k,n) predominantly rely on first- or second-order Riemannian methods, these inherently local methods often struggle with nonconvex or multimodal landscapes. To address this limitation, we adapt the Differential Evolution algorithm - a global, population based optimization method - to operate effectively on the Grassmannian. Our approach incorporates adaptive control parameter schemes, and introduces a projection mechanism that maps trial vectors onto the manifold via QR decomposition. The resulting algorithm maintains feasibility with respect to the manifold structure while enabling exploration beyond local neighborhoods. This framework provides a flexible and geometry-aware alternative to classical Riemannian optimization methods and is well-suited to applications in machine learning, signal processing, and low-rank matrix recovery where subspace representations play a central role. We test the methodology on a number of examples of optimization problems on Grassmann manifolds.

DeCompress: Denoising via Neural Compression

Ali Zafari,Xi Chen,Shirin Jalali

Task: 提出一种新的基于压缩的去噪算法DeCompress，无需真实干净图像或大规模训练数据集。

Motivation: 解决传统去噪算法依赖大规模干净-噪声图像对训练数据的问题，特别是在难以获取真实图像的领域（如显微镜成像）。

Details

Method: 结合压缩去噪和神经压缩的最新进展，开发了一种仅需单张噪声图像即可训练的算法。 Result: DeCompress在无需真实图像或大规模数据集的情况下，性能优于零样本或无监督学习去噪方法。 Conclusion: DeCompress是一种高效、鲁棒的去噪方法，适用于难以获取真实数据的应用场景。 Abstract: Learning-based denoising algorithms achieve state-of-the-art performance across various denoising tasks. However, training such models relies on access to large training datasets consisting of clean and noisy image pairs. On the other hand, in many imaging applications, such as microscopy, collecting ground truth images is often infeasible. To address this challenge, researchers have recently developed algorithms that can be trained without requiring access to ground truth data. However, training such models remains computationally challenging and still requires access to large noisy training samples. In this work, inspired by compression-based denoising and recent advances in neural compression, we propose a new compression-based denoising algorithm, which we name DeCompress, that i) does not require access to ground truth images, ii) does not require access to large training dataset - only a single noisy image is sufficient, iii) is robust to overfitting, and iv) achieves superior performance compared with zero-shot or unsupervised learning-based denoisers.

Improving the generalization of deep learning models in the segmentation of mammography images

Jan Hurtado,Joao P. Maia,Cesar A. Sierra-Franco,Alberto Raposo

Task: 通过数据增强策略改进深度学习模型在乳腺X光图像中标志结构的分割性能。

Motivation: 标志结构的分割有助于乳腺癌风险评估和图像采集质量评估，但现有方法在泛化性上存在不足。

Details

Method: 采用基于注释的图像强度调整和风格转换的数据增强策略，平衡处理不同厂商设备生成的图像。 Result: 实验表明，该方法在泛化性和准确性上优于标准训练方法。 Conclusion: 该方法适合临床实践应用，具有较高的准确性和鲁棒性。 Abstract: Mammography stands as the main screening method for detecting breast cancer early, enhancing treatment success rates. The segmentation of landmark structures in mammography images can aid the medical assessment in the evaluation of cancer risk and the image acquisition adequacy. We introduce a series of data-centric strategies aimed at enriching the training data for deep learning-based segmentation of landmark structures. Our approach involves augmenting the training samples through annotation-guided image intensity manipulation and style transfer to achieve better generalization than standard training procedures. These augmentations are applied in a balanced manner to ensure the model learns to process a diverse range of images generated by different vendor equipments while retaining its efficacy on the original data. We present extensive numerical and visual results that demonstrate the superior generalization capabilities of our methods when compared to the standard training. For this evaluation, we consider a large dataset that includes mammography images generated by different vendor equipments. Further, we present complementary results that show both the strengths and limitations of our methods across various scenarios. The accuracy and robustness demonstrated in the experiments suggest that our method is well-suited for integration into clinical practice.

REMAC: Self-Reflective and Self-Evolving Multi-Agent Collaboration for Long-Horizon Robot Manipulation

Puzhen Yuan,Angyuan Ma,Yunchao Yao,Huaxiu Yao,Masayoshi Tomizuka,Mingyu Ding

Task: 提出一种自适应多智能体规划框架REMAC，用于高效、场景无关的多机器人长时程任务规划与执行。

Motivation: 现有方法依赖先验环境知识或特定任务提示，难以应对动态场景变化或意外任务条件，亟需解决适应性和效率问题。

Details

Method: REMAC框架包含自反思模块（循环进行前条件和后条件检查）和自进化模块（动态调整计划），支持多机器人并行协作。 Result: 在基于RoboCasa的多智能体环境中，REMAC将平均成功率提升40%，执行效率提高52.7%。 Conclusion: REMAC通过持续反思和自适应优化，显著提升了多机器人长时程任务的规划与执行能力。 Abstract: Vision-language models (VLMs) have demonstrated remarkable capabilities in robotic planning, particularly for long-horizon tasks that require a holistic understanding of the environment for task decomposition. Existing methods typically rely on prior environmental knowledge or carefully designed task-specific prompts, making them struggle with dynamic scene changes or unexpected task conditions, e.g., a robot attempting to put a carrot in the microwave but finds the door was closed. Such challenges underscore two critical issues: adaptability and efficiency. To address them, in this work, we propose an adaptive multi-agent planning framework, termed REMAC, that enables efficient, scene-agnostic multi-robot long-horizon task planning and execution through continuous reflection and self-evolution. REMAC incorporates two key modules: a self-reflection module performing pre-condition and post-condition checks in the loop to evaluate progress and refine plans, and a self-evolvement module dynamically adapting plans based on scene-specific reasoning. It offers several appealing benefits: 1) Robots can initially explore and reason about the environment without complex prompt design. 2) Robots can keep reflecting on potential planning errors and adapting the plan based on task-specific insights. 3) After iterations, a robot can call another one to coordinate tasks in parallel, maximizing the task execution efficiency. To validate REMAC's effectiveness, we build a multi-agent environment for long-horizon robot manipulation and navigation based on RoboCasa, featuring 4 task categories with 27 task styles and 50+ different objects. Based on it, we further benchmark state-of-the-art reasoning models, including DeepSeek-R1, o3-mini, QwQ, and Grok3, demonstrating REMAC's superiority by boosting average success rates by 40% and execution efficiency by 52.7% over the single robot baseline.

Enhancing Dance-to-Music Generation via Negative Conditioning Latent Diffusion Model

Changchang Sun,Gaowen Liu,Charles Fleming,Yan Yan

Task: 生成与舞蹈视频节奏同步的音乐。

Motivation: 利用双向指导（正负节奏信息）提升扩散模型在舞蹈与音乐同步生成中的表现。

Details

Method: 提出PN-Diffusion方法，采用双扩散和反向过程，通过正负节奏信息训练多模态U-Net结构。 Result: 在AIST++和TikTok数据集上，模型在舞蹈-音乐节拍对齐和生成音乐质量上优于现有方法。 Conclusion: PN-Diffusion通过双向节奏条件显著提升了舞蹈视频与生成音乐的同步性和音乐质量。 Abstract: Conditional diffusion models have gained increasing attention since their impressive results for cross-modal synthesis, where the strong alignment between conditioning input and generated output can be achieved by training a time-conditioned U-Net augmented with cross-attention mechanism. In this paper, we focus on the problem of generating music synchronized with rhythmic visual cues of the given dance video. Considering that bi-directional guidance is more beneficial for training a diffusion model, we propose to enhance the quality of generated music and its synchronization with dance videos by adopting both positive rhythmic information and negative ones (PN-Diffusion) as conditions, where a dual diffusion and reverse processes is devised. Specifically, to train a sequential multi-modal U-Net structure, PN-Diffusion consists of a noise prediction objective for positive conditioning and an additional noise prediction objective for negative conditioning. To accurately define and select both positive and negative conditioning, we ingeniously utilize temporal correlations in dance videos, capturing positive and negative rhythmic cues by playing them forward and backward, respectively. Through subjective and objective evaluations of input-output correspondence in terms of dance-music beat alignment and the quality of generated music, experimental results on the AIST++ and TikTok dance video datasets demonstrate that our model outperforms SOTA dance-to-music generation models.

Score-Based Turbo Message Passing for Plug-and-Play Compressive Image Recovery

Chang Cai,Xiaojun Yuan,Ying-Jun Angela Zhang

Task: 设计一种基于消息传递框架的压缩图像恢复方法，集成分数生成建模的MMSE去噪器。

Motivation: 传统消息传递算法依赖通用或手工设计的先验去噪器，在高度欠定场景下表现不佳，而分数生成建模能更准确地捕捉图像分布。

Details

Method: 利用分数生成建模与经验贝叶斯最优去噪的紧密关系，提出一种消息传递框架，集成分数MMSE去噪器。 Result: 在FFHQ数据集上，该方法在性能与复杂度之间取得显著更好的平衡，通常需要少于20次神经函数评估即可收敛。 Conclusion: 该方法在压缩图像恢复中优于传统消息传递、正则化线性回归和分数后验采样基线。 Abstract: Message passing algorithms have been tailored for compressive imaging applications by plugging in different types of off-the-shelf image denoisers. These off-the-shelf denoisers mostly rely on some generic or hand-crafted priors for denoising. Due to their insufficient accuracy in capturing the true image prior, these methods often fail to produce satisfactory results, especially in largely underdetermined scenarios. On the other hand, score-based generative modeling offers a promising way to accurately characterize the sophisticated image distribution. In this paper, by exploiting the close relation between score-based modeling and empirical Bayes-optimal denoising, we devise a message passing framework that integrates a score-based minimum mean squared error (MMSE) denoiser for compressive image recovery. This framework is firmly rooted in Bayesian formalism, in which state evolution (SE) equations accurately predict its asymptotic performance. Experiments on the FFHQ dataset demonstrate that our method strikes a significantly better performance-complexity tradeoff than conventional message passing, regularized linear regression, and score-based posterior sampling baselines. Remarkably, our method typically requires less than 20 neural function evaluations (NFEs) to converge.

A Self-Supervised Learning of a Foundation Model for Analog Layout Design Automation

Sungyu Jeong,Won Joon Choi,Junung Choi,Anik Biswas,Byungsub Kim

Task: 提出一种基于UNet的基础模型及其自监督学习方法，以解决模拟布局数据标注不足和任务多样性过高的问题。

Motivation: 模拟布局设计面临标注数据稀缺和任务多样性的挑战，需要一种高效且通用的解决方案。

Details

Method: 采用随机块采样和随机掩码技术进行自监督学习，从少量未标注数据中生成增强的训练数据，并通过预训练和微调适应不同下游任务。 Result: 预训练模型在五个下游任务中表现优异，生成96.6%的DRC/LVS干净布局，微调仅需1/8数据即可达到与从头训练相同的性能。 Conclusion: 该方法显著减少了人工标注需求，为多样化的模拟布局任务提供了高效且通用的解决方案。 Abstract: We propose a UNet-based foundation model and its self-supervised learning method to address two key challenges: 1) lack of qualified annotated analog layout data, and 2) excessive variety in analog layout design tasks. For self-supervised learning, we propose random patch sampling and random masking techniques automatically to obtain enough training data from a small unannotated layout dataset. The obtained data are greatly augmented, less biased, equally sized, and contain enough information for excessive varieties of qualified layout patterns. By pre-training with the obtained data, the proposed foundation model can learn implicit general knowledge on layout patterns so that it can be fine-tuned for various downstream layout tasks with small task-specific datasets. Fine-tuning provides an efficient and consolidated methodology for diverse downstream tasks, reducing the enormous human effort to develop a model per task separately. In experiments, the foundation model was pre-trained using 324,000 samples obtained from 6 silicon-proved manually designed analog circuits, then it was fine-tuned for the five example downstream tasks: generating contacts, vias, dummy fingers, N-wells, and metal routings. The fine-tuned models successfully performed these tasks for more than one thousand unseen layout inputs, generating DRC/LVS-clean layouts for 96.6% of samples. Compared with training the model from scratch for the metal routing task, fine-tuning required only 1/8 of the data to achieve the same dice score of 0.95. With the same data, fine-tuning achieved a 90% lower validation loss and a 40% higher benchmark score than training from scratch.

Disentangled 4D Gaussian Splatting: Towards Faster and More Efficient Dynamic Scene Rendering

Hao Feng,Hao Sun,Wei Xie

Task: 提出一种名为Disentangled4DGS的新方法，用于动态场景的新视角合成。

Motivation: 现有基于4D高斯模型的方法因引入时空变形导致计算冗余和存储需求高。

Details

Method: 通过解耦时空变形，将3DGS扩展到4D，避免4D矩阵计算。 Result: 方法在RTX 3090 GPU上达到343 FPS的渲染速度，存储需求降低至少4.5%。 Conclusion: Disentangled4DGS在动态场景渲染中表现出高效性和竞争力。 Abstract: Novel-view synthesis (NVS) for dynamic scenes from 2D images presents significant challenges due to the spatial complexity and temporal variability of such scenes. Recently, inspired by the remarkable success of NVS using 3D Gaussian Splatting (3DGS), researchers have sought to extend 3D Gaussian models to four dimensions (4D) for dynamic novel-view synthesis. However, methods based on 4D rotation and scaling introduce spatiotemporal deformation into the 4D covariance matrix, necessitating the slicing of 4D Gaussians into 3D Gaussians. This process increases redundant computations as timestamps change-an inherent characteristic of dynamic scene rendering. Additionally, performing calculations on a four-dimensional matrix is computationally intensive. In this paper, we introduce Disentangled 4D Gaussian Splatting (Disentangled4DGS), a novel representation and rendering approach that disentangles temporal and spatial deformations, thereby eliminating the reliance on 4D matrix computations. We extend the 3DGS rendering process to 4D, enabling the projection of temporal and spatial deformations into dynamic 2D Gaussians in ray space. Consequently, our method facilitates faster dynamic scene synthesis. Moreover, it reduces storage requirements by at least 4.5\% due to our efficient presentation method. Our approach achieves an unprecedented average rendering speed of 343 FPS at a resolution of $1352\times1014$ on an RTX 3090 GPU, with experiments across multiple benchmarks demonstrating its competitive performance in both monocular and multi-view scenarios.

A Multi-Site Study on AI-Driven Pathology Detection and Osteoarthritis Grading from Knee X-Ray

Bargava Subramanian,Naveen Kumarasami,Praveen Shastry,Kalyan Sivasailam,Anandakumar D,Keerthana R,Mounigasri M,Abilaasha G,Kishore Prasath Venkatesh

Task: 开发一种基于AI的系统，通过分析膝关节X光片检测多种病理特征并评估骨关节炎的严重程度。

Motivation: 骨健康问题（如骨关节炎和骨质疏松症）的早期诊断因诊断工具有限而延迟，亟需一种高效、准确的解决方案。

Details

Method: 利用130万张膝关节X光片构建多样化的数据集，采用ResNet15和DenseNet等模型进行病理特征检测和骨关节炎分级。 Result: AI系统在多样化的成像环境中表现出高诊断准确性，各病理模型在精确度、召回率和阴性预测值方面表现优异。 Conclusion: 该AI系统是一种可扩展且经济高效的骨健康诊断解决方案，适用于资源有限的医疗环境，有望改善患者管理。 Abstract: Introduction: Bone health disorders like osteoarthritis and osteoporosis pose major global health challenges, often leading to delayed diagnoses due to limited diagnostic tools. This study presents an AI-powered system that analyzes knee X-rays to detect key pathologies, including joint space narrowing, sclerosis, osteophytes, tibial spikes, alignment issues, and soft tissue anomalies. It also grades osteoarthritis severity, enabling timely, personalized treatment. Study Design: The research used 1.3 million knee X-rays from a multi-site Indian clinical trial across government, private, and SME hospitals. The dataset ensured diversity in demographics, imaging equipment, and clinical settings. Rigorous annotation and preprocessing yielded high-quality training datasets for pathology-specific models like ResNet15 for joint space narrowing and DenseNet for osteoarthritis grading. Performance: The AI system achieved strong diagnostic accuracy across diverse imaging environments. Pathology-specific models excelled in precision, recall, and NPV, validated using Mean Squared Error (MSE), Intersection over Union (IoU), and Dice coefficient. Subgroup analyses across age, gender, and manufacturer variations confirmed generalizability for real-world applications. Conclusion: This scalable, cost-effective solution for bone health diagnostics demonstrated robust performance in a multi-site trial. It holds promise for widespread adoption, especially in resource-limited healthcare settings, transforming bone health management and enabling proactive patient care.

3D Acetabular Surface Reconstruction from 2D Pre-operative X-ray Images using SRVF Elastic Registration and Deformation Graph

Shuai Zhang,Jinliang Wang,Sujith Konandetails,Xu Wang,Danail Stoyanov,Evangelos B. Mazomenos

Task: 提出一种结合SRVF弹性形状配准技术和ED图方法的新框架，用于从多视角2D术前骨盆X射线图像和半球表面模型重建髋臼的3D关节表面。

Motivation: 准确可靠的髋臼杯尺寸选择对全髋关节置换术中恢复关节生物力学至关重要。

Details

Method: 结合SRVF弹性配准建立2D-3D对应关系，并通过ED框架优化3D髋臼表面重建。 Result: 通过仿真和真实患者数据验证了算法的鲁棒性和潜在临床价值。 Conclusion: 该重建结果可帮助外科医生在初次全髋关节置换术中首次选择正确的髋臼杯，减少翻修手术需求。 Abstract: Accurate and reliable selection of the appropriate acetabular cup size is crucial for restoring joint biomechanics in total hip arthroplasty (THA). This paper proposes a novel framework that integrates square-root velocity function (SRVF)-based elastic shape registration technique with an embedded deformation (ED) graph approach to reconstruct the 3D articular surface of the acetabulum by fusing multiple views of 2D pre-operative pelvic X-ray images and a hemispherical surface model. The SRVF-based elastic registration establishes 2D-3D correspondences between the parametric hemispherical model and X-ray images, and the ED framework incorporates the SRVF-derived correspondences as constraints to optimize the 3D acetabular surface reconstruction using nonlinear least-squares optimization. Validations using both simulation and real patient datasets are performed to demonstrate the robustness and the potential clinical value of the proposed algorithm. The reconstruction result can assist surgeons in selecting the correct acetabular cup on the first attempt in primary THA, minimising the need for revision surgery.

AdaRank: Adaptive Rank Pruning for Enhanced Model Merging

Chanhyuk Lee,Jiho Choi,Chanryeol Lee,Donggyun Kim,Seunghoon Hong

Task: 提出一种名为AdaRank的自适应模型合并框架，用于优化多任务学习中的模型合并效果。

Motivation: 现有的基于SVD的模型合并方法依赖手动设计的秩选择，容易导致任务间干扰和性能下降。

Details

Method: AdaRank通过动态剪枝干扰的奇异方向，并通过熵最小化在测试时学习剪枝秩，自适应选择最优奇异方向。 Result: AdaRank显著减少了任务间的有害重叠，并在多种骨干网络和任务数量下实现了最先进的性能，将性能差距缩小至近1%。 Conclusion: AdaRank是一种高效的自适应模型合并方法，能够显著提升多任务学习的性能。 Abstract: Model merging has emerged as a promising approach for unifying independently fine-tuned models into an integrated framework, significantly enhancing computational efficiency in multi-task learning. Recently, several SVD-based techniques have been introduced to exploit low-rank structures for enhanced merging, but their reliance on such manually designed rank selection often leads to cross-task interference and suboptimal performance. In this paper, we propose AdaRank, a novel model merging framework that adaptively selects the most beneficial singular directions of task vectors to merge multiple models. We empirically show that the dominant singular components of task vectors can cause critical interference with other tasks, and that naive truncation across tasks and layers degrades performance. In contrast, AdaRank dynamically prunes the singular components that cause interference and offers an optimal amount of information to each task vector by learning to prune ranks during test-time via entropy minimization. Our analysis demonstrates that such method mitigates detrimental overlaps among tasks, while empirical results show that AdaRank consistently achieves state-of-the-art performance with various backbones and number of tasks, reducing the performance gap between fine-tuned models to nearly 1%.

Sell It Before You Make It: Revolutionizing E-Commerce with Personalized AI-Generated Items

Jianghao Lin,Peng Du,Jiaqi Liu,Weite Li,Yong Yu,Weinan Zhang,Yang Cao

Task: 提出一种基于AI生成物品（AIGI）的个性化文本到图像生成系统，用于电子商务产品设计。

Motivation: 传统电子商务工作流程效率低下，产品设计和制造库存成本高，AIGI通过“先销售后生产”模式减少对物理原型的依赖，加速上市时间。

Details

Method: 提出个性化群体级偏好对齐框架（PerFusion），包括PerFusion奖励模型和个性化自适应网络，用于建模用户偏好并优化群体级偏好。 Result: 实验表明，AI生成物品的点击率和转化率相比人工设计物品提升了13%以上。 Conclusion: AI生成物品在电子商务平台具有革命性潜力，能够显著提升效率和用户满意度。 Abstract: E-commerce has revolutionized retail, yet its traditional workflows remain inefficient, with significant time and resource costs tied to product design and manufacturing inventory. This paper introduces a novel system deployed at Alibaba that leverages AI-generated items (AIGI) to address these challenges with personalized text-to-image generation for e-commercial product design. AIGI enables an innovative business mode called "sell it before you make it", where merchants can design fashion items and generate photorealistic images with digital models based on textual descriptions. Only when the items have received a certain number of orders, do the merchants start to produce them, which largely reduces reliance on physical prototypes and thus accelerates time to market. For such a promising application, we identify the underlying key scientific challenge, i.e., capturing the users' group-level personalized preferences towards multiple generated candidate images. To this end, we propose a Personalized Group-Level Preference Alignment Framework for Diffusion Models (i.e., PerFusion). We first design PerFusion Reward Model for user preference estimation with a feature-crossing-based personalized plug-in. Then we develop PerFusion with a personalized adaptive network to model diverse preferences across users, and meanwhile derive the group-level preference optimization objective to capture the comparative behaviors among multiple candidates. Both offline and online experiments demonstrate the effectiveness of our proposed algorithm. The AI-generated items have achieved over 13% relative improvements for both click-through rate and conversion rate compared to their human-designed counterparts, validating the revolutionary potential of AI-generated items for e-commercial platforms.

Enhance Generation Quality of Flow Matching V2A Model via Multi-Step CoT-Like Guidance and Combined Preference Optimization

Haomin Zhang,Sizhe Shan,Haoyu Wang,Zihao Chen,Xiulong Liu,Chaofan Ding,Xinhan Di

Task: 通过多阶段、多模态的端到端生成框架（CoP）从视频和文本提示中生成高质量音效。

Motivation: 当前视频引导的音效生成模型在通用和专业用例中表现不佳，需改进视觉与音频的语义和时间对齐。

Details

Method: 采用基于Transformer的网络架构，实现CoP引导学习；多阶段训练框架；构建CoP多模态数据集。 Result: 在多个数据集上表现优于现有模型，如FAD、CLIP、SI-SDR和MOS指标显著提升。 Conclusion: CoP框架有效提升了音效生成的质量和适应性。 Abstract: Creating high-quality sound effects from videos and text prompts requires precise alignment between visual and audio domains, both semantically and temporally, along with step-by-step guidance for professional audio generation. However, current state-of-the-art video-guided audio generation models often fall short of producing high-quality audio for both general and specialized use cases. To address this challenge, we introduce a multi-stage, multi-modal, end-to-end generative framework with Chain-of-Thought-like (CoT-like) guidance learning, termed Chain-of-Perform (CoP). First, we employ a transformer-based network architecture designed to achieve CoP guidance, enabling the generation of both general and professional audio. Second, we implement a multi-stage training framework that follows step-by-step guidance to ensure the generation of high-quality sound effects. Third, we develop a CoP multi-modal dataset, guided by video, to support step-by-step sound effects generation. Evaluation results highlight the advantages of the proposed multi-stage CoP generative framework compared to the state-of-the-art models on a variety of datasets, with FAD 0.79 to 0.74 (+6.33%), CLIP 16.12 to 17.70 (+9.80%) on VGGSound, SI-SDR 1.98dB to 3.35dB (+69.19%), MOS 2.94 to 3.49(+18.71%) on PianoYT-2h, and SI-SDR 2.22dB to 3.21dB (+44.59%), MOS 3.07 to 3.42 (+11.40%) on Piano-10h.

Data-Free Universal Attack by Exploiting the Intrinsic Vulnerability of Deep Models

YangTian Yan,Jinyu Tian

Task: 提出一种无需数据的方法（IntriUAP），利用深度模型的固有漏洞生成通用对抗扰动（UAPs）。

Motivation: 现有生成UAPs的方法需要大量数据，而实际任务中难以满足这一假设。

Details

Method: 通过分析模型的线性组件，利用其病态性质，将UAP与线性层的最大奇异值对应的右奇异向量对齐。 Result: 在无需图像样本的情况下，攻击性能与现有数据无关方法相当，且在仅访问部分模型层时攻击成功率仅下降4%。 Conclusion: IntriUAP在数据无关和弱假设条件下表现出色，为对抗攻击提供了新思路。 Abstract: Deep neural networks (DNNs) are susceptible to Universal Adversarial Perturbations (UAPs), which are instance agnostic perturbations that can deceive a target model across a wide range of samples. Unlike instance-specific adversarial examples, UAPs present a greater challenge as they must generalize across different samples and models. Generating UAPs typically requires access to numerous examples, which is a strong assumption in real-world tasks. In this paper, we propose a novel data-free method called Intrinsic UAP (IntriUAP), by exploiting the intrinsic vulnerabilities of deep models. We analyze a series of popular deep models composed of linear and nonlinear layers with a Lipschitz constant of 1, revealing that the vulnerability of these models is predominantly influenced by their linear components. Based on this observation, we leverage the ill-conditioned nature of the linear components by aligning the UAP with the right singular vectors corresponding to the maximum singular value of each linear layer. Remarkably, our method achieves highly competitive performance in attacking popular image classification deep models without using any image samples. We also evaluate the black-box attack performance of our method, showing that it matches the state-of-the-art baseline for data-free methods on models that conform to our theoretical framework. Beyond the data-free assumption, IntriUAP also operates under a weaker assumption, where the adversary only can access a few of the victim model's layers. Experiments demonstrate that the attack success rate decreases by only 4% when the adversary has access to just 50% of the linear layers in the victim model.

DeepSound-V1: Start to Think Step-by-Step in the Audio Generation from Videos

Yunming Liang,Zihao Chen,Chaofan Ding,Xinhan Di

Task: 从视频和可选文本输入中生成高质量、同步的音频。

Motivation: 现有方法在视觉和生成音频领域的精确对齐方面表现不佳，主要原因是开源视频-音频和文本-音频基准中缺乏足够的时序和语义对齐标注。

Details

Method: 提出一个框架，利用多模态大语言模型（MLLM）的内部思维链（CoT）进行逐步推理，无需额外标注，并构建多模态推理数据集以支持初始推理学习。 Result: 实验表明，该方法有效减少了生成音频中的不对齐（配音）问题，并在多个指标上优于现有最先进模型。 Conclusion: 所提框架在减少音频生成中的不对齐和提升性能方面表现出色，具有显著优势。 Abstract: Currently, high-quality, synchronized audio is synthesized from video and optional text inputs using various multi-modal joint learning frameworks. However, the precise alignment between the visual and generated audio domains remains far from satisfactory. One key factor is the lack of sufficient temporal and semantic alignment annotations in open-source video-audio and text-audio benchmarks. Therefore, we propose a framework for audio generation from videos, leveraging the internal chain-of-thought (CoT) of a multi-modal large language model (MLLM) to enable step-by-step reasoning without requiring additional annotations. Additionally, a corresponding multi-modal reasoning dataset is constructed to facilitate the learning of initial reasoning in audio generation. In the experiments, we demonstrate the effectiveness of the proposed framework in reducing misalignment (voice-over) in generated audio and achieving competitive performance compared to various state-of-the-art models. The evaluation results show that the proposed method outperforms state-of-the-art approaches across multiple metrics. Specifically, the F DP aSST indicator is reduced by up to 10.07%, the F DP AN N s indicator by up to 11.62%, and the F DV GG indicator by up to 38.61%. Furthermore, the IS indicator improves by up to 4.95%, the IB-score indicator increases by up to 6.39%, and the DeSync indicator is reduced by up to 0.89%.

Hi3DGen: High-fidelity 3D Geometry Generation from Images via Normal Bridging

Chongjie Ye,Yushuang Wu,Ziteng Lu,Jiahao Chang,Xiaoyang Guo,Jiaqing Zhou,Hao Zhao,Xiaoguang Han

Task: 提出Hi3DGen框架，通过法线桥接从2D图像生成高保真3D几何模型。

Motivation: 现有方法在从RGB图像准确还原细粒度几何细节时面临领域差距和固有模糊性的挑战。

Details

Method: Hi3DGen包含三个关键组件：图像到法线估计器、法线到几何学习方法以及3D数据合成流程。 Result: 实验表明，Hi3DGen在生成丰富几何细节方面优于现有方法，具有更高的保真度。 Conclusion: 通过法线图作为中间表示，为高保真3D几何生成提供了新方向。 Abstract: With the growing demand for high-fidelity 3D models from 2D images, existing methods still face significant challenges in accurately reproducing fine-grained geometric details due to limitations in domain gaps and inherent ambiguities in RGB images. To address these issues, we propose Hi3DGen, a novel framework for generating high-fidelity 3D geometry from images via normal bridging. Hi3DGen consists of three key components: (1) an image-to-normal estimator that decouples the low-high frequency image pattern with noise injection and dual-stream training to achieve generalizable, stable, and sharp estimation; (2) a normal-to-geometry learning approach that uses normal-regularized latent diffusion learning to enhance 3D geometry generation fidelity; and (3) a 3D data synthesis pipeline that constructs a high-quality dataset to support training. Extensive experiments demonstrate the effectiveness and superiority of our framework in generating rich geometric details, outperforming state-of-the-art methods in terms of fidelity. Our work provides a new direction for high-fidelity 3D geometry generation from images by leveraging normal maps as an intermediate representation.

FLIP: Towards Comprehensive and Reliable Evaluation of Federated Prompt Learning

Dongping Liao,Xitong Gao,Yabo Xu,Chengzhong Xu

Task: 研究联邦学习与提示学习的结合，特别是针对视觉语言模型，并提出一个名为FLIP的框架来评估联邦提示学习算法。

Motivation: 隐私和数据安全的重要性推动了联邦学习的采用，而提示学习在联邦环境中具有降低计算成本和通信开销的优势。

Details

Method: 引入FLIP框架，评估8种最先进的联邦提示学习方法，涵盖4种联邦学习协议和12个开放数据集，考虑6种不同的评估场景。 Result: 提示学习在分布内和分布外设置中均表现出强大的泛化性能，且资源消耗极低。 Conclusion: 联邦提示学习在数据稀缺、未见类别和跨域分布偏移的环境中非常有效，研究开源了FLIP代码以促进进一步研究。 Abstract: The increasing emphasis on privacy and data security has driven the adoption of federated learning, a decentralized approach to train machine learning models without sharing raw data. Prompt learning, which fine-tunes prompt embeddings of pretrained models, offers significant advantages in federated settings by reducing computational costs and communication overheads while leveraging the strong performance and generalization capabilities of vision-language models such as CLIP. This paper addresses the intersection of federated learning and prompt learning, particularly for vision-language models. In this work, we introduce a comprehensive framework, named FLIP, to evaluate federated prompt learning algorithms. FLIP assesses the performance of 8 state-of-the-art federated prompt learning methods across 4 federated learning protocols and 12 open datasets, considering 6 distinct evaluation scenarios. Our findings demonstrate that prompt learning maintains strong generalization performance in both in-distribution and out-of-distribution settings with minimal resource consumption. This work highlights the effectiveness of federated prompt learning in environments characterized by data scarcity, unseen classes, and cross-domain distributional shifts. We open-source the code for all implemented algorithms in FLIP to facilitate further research in this domain.

Efficient Epistemic Uncertainty Estimation in Cerebrovascular Segmentation

Omini Rathore,Richard Paul,Abigail Morrison,Hanno Scharr,Elisabeth Pfaehler

Task: 将认知不确定性量化首次纳入脑血管分割模型，以提高基于深度学习的模型的信任度。

Motivation: 由于传统深度学习模型复杂度高且缺乏决策可靠性指示，其信任度不足，而脑血管分割在诊断脑血管疾病中至关重要。

Details

Method: 通过结合贝叶斯近似和深度集成优势的高效集成模型，降低传统概率网络的高计算成本。 Result: 实验表明，模型能有效识别高不确定性和错误预测区域，且在分布外数据上不确定性增加，忽略高不确定区域可提升分割质量。 Conclusion: 该集成模型能可靠地解释其局限性，保持对分布外数据的可信度，适用于临床应用。 Abstract: Brain vessel segmentation of MR scans is a critical step in the diagnosis of cerebrovascular diseases. Due to the fine vessel structure, manual vessel segmentation is time consuming. Therefore, automatic deep learning (DL) based segmentation techniques are intensively investigated. As conventional DL models yield a high complexity and lack an indication of decision reliability, they are often considered as not trustworthy. This work aims to increase trust in DL based models by incorporating epistemic uncertainty quantification into cerebrovascular segmentation models for the first time. By implementing an efficient ensemble model combining the advantages of Bayesian Approximation and Deep Ensembles, we aim to overcome the high computational costs of conventional probabilistic networks. Areas of high model uncertainty and erroneous predictions are aligned which demonstrates the effectiveness and reliability of the approach. We perform extensive experiments applying the ensemble model on out-of-distribution (OOD) data. We demonstrate that for OOD-images, the estimated uncertainty increases. Additionally, omitting highly uncertain areas improves the segmentation quality, both for in- and out-of-distribution data. The ensemble model explains its limitations in a reliable manner and can maintain trustworthiness also for OOD data and could be considered in clinical applications

Imperceptible but Forgeable: Practical Invisible Watermark Forgery via Diffusion Models

Ziping Dong,Chao Shuai,Zhongjie Ba,Peng Cheng,Zhan Qin,Qinglong Wang,Kui Ren

Task: 提出DiffForge框架，用于在无盒设置下伪造不可见水印。

Motivation: 研究现有水印方案在伪造攻击下的鲁棒性不足问题。

Details

Method: 使用无条件扩散模型估计水印分布，并通过浅层反转将水印无缝注入非水印图像。 Result: DiffForge成功欺骗开源水印检测器（96.38%成功率）和商业水印系统（97%成功率）。 Conclusion: 揭示了当前水印范式的基本安全局限性。 Abstract: Invisible watermarking is critical for content provenance and accountability in Generative AI. Although commercial companies have increasingly committed to using watermarks, the robustness of existing watermarking schemes against forgery attacks is understudied. This paper proposes DiffForge, the first watermark forgery framework capable of forging imperceptible watermarks under a no-box setting. We estimate the watermark distribution using an unconditional diffusion model and introduce shallow inversion to inject the watermark into a non-watermarked image seamlessly. This approach facilitates watermark injection while preserving image quality by adaptively selecting the depth of inversion steps, leveraging our key insight that watermarks degrade with added noise during the early diffusion phases. Comprehensive evaluations show that DiffForge deceives open-source watermark detectors with a 96.38% success rate and misleads a commercial watermark system with over 97% success rate, achieving high confidence.1 This work reveals fundamental security limitations in current watermarking paradigms.

Scenario Dreamer: Vectorized Latent Diffusion for Generating Driving Simulation Environments

Luke Rowe,Roger Girgis,Anthony Gosselin,Liam Paull,Christopher Pal,Felix Heide

Task: 提出一种名为Scenario Dreamer的数据驱动生成模拟器，用于自动驾驶车辆规划，生成初始交通场景和闭环代理行为。

Motivation: 现有方法在生成驾驶模拟环境时将初始交通场景编码为栅格化图像，导致参数繁重的网络和不必要的计算；同时，基于规则的代理行为缺乏多样性和真实性。

Details

Method: 采用向量化潜在扩散模型生成初始场景，并使用自回归Transformer模拟数据驱动的代理行为，支持通过扩散修复进行场景外推。 Result: Scenario Dreamer在真实性和效率上优于现有生成模拟器，参数减少2倍，生成延迟降低6倍，GPU训练时间减少10倍。 Conclusion: Scenario Dreamer在挑战性驾驶环境中对强化学习规划代理更具挑战性，证明了其实际应用价值。 Abstract: We introduce Scenario Dreamer, a fully data-driven generative simulator for autonomous vehicle planning that generates both the initial traffic scene - comprising a lane graph and agent bounding boxes - and closed-loop agent behaviours. Existing methods for generating driving simulation environments encode the initial traffic scene as a rasterized image and, as such, require parameter-heavy networks that perform unnecessary computation due to many empty pixels in the rasterized scene. Moreover, we find that existing methods that employ rule-based agent behaviours lack diversity and realism. Scenario Dreamer instead employs a novel vectorized latent diffusion model for initial scene generation that directly operates on the vectorized scene elements and an autoregressive Transformer for data-driven agent behaviour simulation. Scenario Dreamer additionally supports scene extrapolation via diffusion inpainting, enabling the generation of unbounded simulation environments. Extensive experiments show that Scenario Dreamer outperforms existing generative simulators in realism and efficiency: the vectorized scene-generation base model achieves superior generation quality with around 2x fewer parameters, 6x lower generation latency, and 10x fewer GPU training hours compared to the strongest baseline. We confirm its practical utility by showing that reinforcement learning planning agents are more challenged in Scenario Dreamer environments than traditional non-generative simulation environments, especially on long and adversarial driving environments.

Exploiting Mixture-of-Experts Redundancy Unlocks Multimodal Generative Abilities

Raman Dutt,Harleen Hanspal,Guoxuan Xia,Petru-Daniel Tudosiu,Alexander Black,Yongxin Yang,Steven McDonagh,Sarah Parisot

Task: 增强预训练文本大语言模型的多模态生成能力，同时满足保持原有语言生成能力和参数效率的约束。

Motivation: 现有方法通过增加专用模块显著增加参数数量，而本研究旨在利用深度模型中未充分利用的容量，提高参数效率。

Details

Method: 利用Mixture-of-Experts（MoEs）中的参数冗余作为学习新模态的额外容量，并通过低秩适应保留语言生成能力。 Result: 通过路由机制分析，揭示了模态特定路径的出现和专家冗余的减少，有效解锁多模态生成能力。 Conclusion: 该方法可无缝应用于多种大语言模型，为从单模态到多模态架构的过渡提供了新途径。 Abstract: In this work, we undertake the challenge of augmenting the existing generative capabilities of pre-trained text-only large language models (LLMs) with multi-modal generation capability while satisfying two core constraints: C1 preserving the preservation of original language generative capabilities with negligible performance degradation, and C2 adhering to a small parameter budget to learn the new modality, ensuring scalability and efficiency. In contrast to current approaches that add dedicated modules, thereby significantly increasing the parameter count, we propose a method that leverages the underutilized capacity inherent in deep models. Specifically, we exploit the parameter redundancy within Mixture-of-Experts (MoEs) as a source of additional capacity for learning a new modality, enabling better parameter efficiency (C1). Moreover, we preserve the original language generation capabilities by applying low-rank adaptation exclusively to the tokens of the new modality (C2). Furthermore, we introduce a novel parameter initialization scheme based on the Gromov-Wasserstein distance to improve convergence and training stability. Through an extensive analysis of the routing mechanism, we uncover the emergence of modality-specific pathways and decreased redundancy within the experts that can efficiently unlock multi-modal generative capabilities. Overall, our method can be seamlessly applied to a wide range of contemporary LLMs, providing a new pathway for transitioning from uni-modal to multi-modal architectures.

Deterministic Medical Image Translation via High-fidelity Brownian Bridges

Qisheng He,Nicholas Summerfield,Peiyong Wang,Carri Glide-Hurst,Ming Dong

Task: 提出一种新型高保真布朗桥模型（HiFi-BBrg）用于确定性医学图像转换。

Motivation: 扩散模型在生成合成图像方面优于GANs，但其输出具有非确定性和低保真度的问题。

Details

Method: 结合生成映射和重建映射，通过保真度损失和对抗训练指导布朗桥训练过程。 Result: 在多个数据集上的实验表明，HiFi-BBrg在多模态图像转换和多图像超分辨率任务中优于现有方法。 Conclusion: HiFi-BBrg能够实现高保真度的确定性医学图像转换。 Abstract: Recent studies have shown that diffusion models produce superior synthetic images when compared to Generative Adversarial Networks (GANs). However, their outputs are often non-deterministic and lack high fidelity to the ground truth due to the inherent randomness. In this paper, we propose a novel High-fidelity Brownian bridge model (HiFi-BBrg) for deterministic medical image translations. Our model comprises two distinct yet mutually beneficial mappings: a generation mapping and a reconstruction mapping. The Brownian bridge training process is guided by the fidelity loss and adversarial training in the reconstruction mapping. This ensures that translated images can be accurately reversed to their original forms, thereby achieving consistent translations with high fidelity to the ground truth. Our extensive experiments on multiple datasets show HiFi-BBrg outperforms state-of-the-art methods in multi-modal image translation and multi-image super-resolution.

RELD: Regularization by Latent Diffusion Models for Image Restoration

Pasquale Cascarano,Lorenzo Stacchio,Andrea Sebastiani,Alessandro Benfenati,Ulugbek S. Kamilov,Gustavo Marfia

Task: 提出一种基于潜在扩散模型的变分框架方法（RELD），用于图像去噪、去模糊和超分辨率任务。

Motivation: 扩散模型已成为深度生成建模的新前沿，但其计算成本较高，因此需要一种更高效的方法。

Details

Method: 结合潜在扩散模型和半二次分裂的变分框架，利用其正则化特性降低计算成本。 Result: 在自然图像数据集上的实验表明，RELD在感知质量指标上表现优异，与现有方法竞争力相当。 Conclusion: RELD是一种高效且高质量的图像恢复方法，适用于多种成像应用。 Abstract: In recent years, Diffusion Models have become the new state-of-the-art in deep generative modeling, ending the long-time dominance of Generative Adversarial Networks. Inspired by the Regularization by Denoising principle, we introduce an approach that integrates a Latent Diffusion Model, trained for the denoising task, into a variational framework using Half-Quadratic Splitting, exploiting its regularization properties. This approach, under appropriate conditions that can be easily met in various imaging applications, allows for reduced computational cost while achieving high-quality results. The proposed strategy, called Regularization by Latent Denoising (RELD), is then tested on a dataset of natural images, for image denoising, deblurring, and super-resolution tasks. The numerical experiments show that RELD is competitive with other state-of-the-art methods, particularly achieving remarkable results when evaluated using perceptual quality metrics.

Next-Best-Trajectory Planning of Robot Manipulators for Effective Observation and Exploration

Heiko Renz,Maximilian Krämer,Frank Hoffmann,Torsten Bertram

Task: 开发一种基于Next-Best-Trajectory原则的机器人操作策略，用于动态环境中的高效数据采集。

Motivation: 机器学习算法需要大量数据集，但数据收集成本高且耗时，自动化观察和探索策略能提高数据采集效率。

Details

Method: 利用Next-Best-Trajectory原则，结合局部轨迹生成、体素地图环境建模、光线投射信息增益估计和全局遍历轨迹规划。 Result: 并行化计算提高了效率，真实实验验证了策略的有效性。 Conclusion: 提出的策略在动态环境中高效且有效，适用于机器人操作和数据采集任务。 Abstract: Visual observation of objects is essential for many robotic applications, such as object reconstruction and manipulation, navigation, and scene understanding. Machine learning algorithms constitute the state-of-the-art in many fields but require vast data sets, which are costly and time-intensive to collect. Automated strategies for observation and exploration are crucial to enhance the efficiency of data gathering. Therefore, a novel strategy utilizing the Next-Best-Trajectory principle is developed for a robot manipulator operating in dynamic environments. Local trajectories are generated to maximize the information gained from observations along the path while avoiding collisions. We employ a voxel map for environment modeling and utilize raycasting from perspectives around a point of interest to estimate the information gain. A global ergodic trajectory planner provides an optional reference trajectory to the local planner, improving exploration and helping to avoid local minima. To enhance computational efficiency, raycasting for estimating the information gain in the environment is executed in parallel on the graphics processing unit. Benchmark results confirm the efficiency of the parallelization, while real-world experiments demonstrate the strategy's effectiveness.

Using AI to Summarize US Presidential Campaign TV Advertisement Videos, 1952-2012

Adam Breuer,Bryce J. Dietrich,Michael H. Crespin,Matthew Butler,J. A. Pyrse,Kosuke Imai

Task: 介绍并分析美国总统竞选电视广告的最大、最全面的数据集。

Motivation: 解决手动获取和标注广告数据的困难，推动学术研究。

Details

Method: 设计大规模并行化、基于AI的分析流程，自动处理视频的转录和摘要。 Result: 生成高质量转录和摘要，与人工生成的质量相当，并应用于追踪竞选焦点问题的演变。 Conclusion: 展示了基于LLM的工具如何高效处理视频数据集，为其他研究提供参考。 Abstract: This paper introduces the largest and most comprehensive dataset of US presidential campaign television advertisements, available in digital format. The dataset also includes machine-searchable transcripts and high-quality summaries designed to facilitate a variety of academic research. To date, there has been great interest in collecting and analyzing US presidential campaign advertisements, but the need for manual procurement and annotation led many to rely on smaller subsets. We design a large-scale parallelized, AI-based analysis pipeline that automates the laborious process of preparing, transcribing, and summarizing videos. We then apply this methodology to the 9,707 presidential ads from the Julian P. Kanter Political Commercial Archive. We conduct extensive human evaluations to show that these transcripts and summaries match the quality of manually generated alternatives. We illustrate the value of this data by including an application that tracks the genesis and evolution of current focal issue areas over seven decades of presidential elections. Our analysis pipeline and codebase also show how to use LLM-based tools to obtain high-quality summaries for other video datasets.

KEVS: Enhancing Segmentation of Visceral Adipose Tissue in Pre-Cystectomy CT with Gaussian Kernel Density Estimation

Thomas Boucher,Nicholas Tetlow,Annie Fung,Amy Dewar,Pietro Arina,Sven Kerneis,John Whittle,Evangelos B. Mazomenos

Task: 开发一种全自动的内脏脂肪组织（VAT）预测方法，用于膀胱切除术前CT扫描，无需依赖真实标注的VAT掩模进行训练。

Motivation: 现有基于强度阈值的VAT分割方法存在观察者间变异性问题，且真实标注掩模的获取困难限制了深度学习模型的发展。

Details

Method: 提出KEVS方法，结合深度学习语义分割模型和高斯核密度估计分析，实现无需真实VAT掩模的自动化预测。 Result: KEVS在未见的CT数据中准确分割腹部器官，并在20例膀胱切除术前CT扫描中优于现有方法，Dice系数分别提升4.80%和6.02%。 Conclusion: KEVS是一种全自动、无需真实VAT掩模的先进方法，解决了观察者间变异性问题，并完全基于开源CT数据集训练。 Abstract: Purpose: The distribution of visceral adipose tissue (VAT) in cystectomy patients is indicative of the incidence of post-operative complications. Existing VAT segmentation methods for computed tomography (CT) employing intensity thresholding have limitations relating to inter-observer variability. Moreover, the difficulty in creating ground-truth masks limits the development of deep learning (DL) models for this task. This paper introduces a novel method for VAT prediction in pre-cystectomy CT, which is fully automated and does not require ground-truth VAT masks for training, overcoming aforementioned limitations. Methods: We introduce the Kernel density Enhanced VAT Segmentator ( KEVS), combining a DL semantic segmentation model, for multi-body feature prediction, with Gaussian kernel density estimation analysis of predicted subcutaneous adipose tissue to achieve accurate scan-specific predictions of VAT in the abdominal cavity. Uniquely for a DL pipeline, KEVS does not require ground-truth VAT masks. Results: We verify the ability of KEVS to accurately segment abdominal organs in unseen CT data and compare KEVS VAT segmentation predictions to existing state-of-the-art (SOTA) approaches in a dataset of 20 pre-cystectomy CT scans, collected from University College London Hospital (UCLH-Cyst), with expert ground-truth annotations. KEVS presents a 4.80% and 6.02% improvement in Dice Coefficient over the second best DL and thresholding-based VAT segmentation techniques respectively when evaluated on UCLH-Cyst. Conclusion: This research introduces KEVS; an automated, SOTA method for the prediction of VAT in pre-cystectomy CT which eliminates inter-observer variability and is trained entirely on open-source CT datasets which do not contain ground-truth VAT masks.

Audio-Plane: Audio Factorization Plane Gaussian Splatting for Real-Time Talking Head Synthesis

Shuai Shen,Wanhua Li,Yunpeng Zhang,Weipeng Hu,Yap-Peng Tan

Task: 提出一种基于音频因子化平面（Audio-Plane）的高斯泼溅方法，用于高质量实时说话头部生成。

Motivation: 现有方法在生成质量和计算效率之间难以平衡，且直接存储密集4D网格成本高且不可扩展。

Details

Method: 将4D体积表示分解为音频无关的空间平面和音频相关的平面，结合动态泼溅方法优化嘴部动态建模。 Result: 实验表明，该方法能实时合成高质量说话视频，并确保精确的音频-唇部同步。 Conclusion: 通过音频因子化平面和高斯泼溅的结合，实现了高效且高质量的说话头部生成。 Abstract: Talking head synthesis has become a key research area in computer graphics and multimedia, yet most existing methods often struggle to balance generation quality with computational efficiency. In this paper, we present a novel approach that leverages an Audio Factorization Plane (Audio-Plane) based Gaussian Splatting for high-quality and real-time talking head generation. For modeling a dynamic talking head, 4D volume representation is needed. However, directly storing a dense 4D grid is impractical due to the high cost and lack of scalability for longer durations. We overcome this challenge with the proposed Audio-Plane, where the 4D volume representation is decomposed into audio-independent space planes and audio-dependent planes. This provides a compact and interpretable feature representation for talking head, facilitating more precise audio-aware spatial encoding and enhanced audio-driven lip dynamic modeling. To further improve speech dynamics, we develop a dynamic splatting method that helps the network more effectively focus on modeling the dynamics of the mouth region. Extensive experiments demonstrate that by integrating these innovations with the powerful Gaussian Splatting, our method is capable of synthesizing highly realistic talking videos in real time while ensuring precise audio-lip synchronization. Synthesized results are available in https://sstzal.github.io/Audio-Plane/.

Unicorn: Text-Only Data Synthesis for Vision Language Model Training

Xiaomin Yu,Pengxiang Ding,Wenjie Zhang,Siteng Huang,Songyang Gao,Chengwei Qin,Kejian Wu,Zhaoxin Fan,Ziyue Qiao,Donglin Wang

Task: 提出一种从纯文本合成高质量多模态训练数据的三阶段框架。

Motivation: 由于大规模高质量图像-文本对数据收集成本高，而文本数据丰富且廉价，因此探索是否可以从纯文本合成高质量多模态训练数据。

Details

Method: 采用三阶段框架：1）多样化标题数据合成；2）指令微调数据生成；3）模态表示转换，最终生成合成图像表示。 Result: 生成了Unicorn-1.2M预训练数据集和Unicorn-471K-Instruction指令微调数据集，无需依赖真实图像。 Conclusion: 该框架为视觉语言模型训练提供了一种成本低、可扩展的解决方案。 Abstract: Training vision-language models (VLMs) typically requires large-scale, high-quality image-text pairs, but collecting or synthesizing such data is costly. In contrast, text data is abundant and inexpensive, prompting the question: can high-quality multimodal training data be synthesized purely from text? To tackle this, we propose a cross-integrated three-stage multimodal data synthesis framework, which generates two datasets: Unicorn-1.2M and Unicorn-471K-Instruction. In Stage 1: Diverse Caption Data Synthesis, we construct 1.2M semantically diverse high-quality captions by expanding sparse caption seeds using large language models (LLMs). In Stage 2: Instruction-Tuning Data Generation, we further process 471K captions into multi-turn instruction-tuning tasks to support complex reasoning. Finally, in Stage 3: Modality Representation Transfer, these textual captions representations are transformed into visual representations, resulting in diverse synthetic image representations. This three-stage process enables us to construct Unicorn-1.2M for pretraining and Unicorn-471K-Instruction for instruction-tuning, without relying on real images. By eliminating the dependency on real images while maintaining data quality and diversity, our framework offers a cost-effective and scalable solution for VLMs training. Code is available at https://github.com/Yu-xm/Unicorn.git.

Evaluation of Machine-generated Biomedical Images via A Tally-based Similarity Measure

Frank J. Brooks,Rucha Deshpande

Task: 提出一种基于Tversky指数的图像合成质量评估方法。

Motivation: 在生物医学等关键任务场景中，需要一种可靠的方法来评估合成图像的质量，而传统基于深度特征空间距离的方法存在不足。

Details

Method: 使用Tversky指数作为感知相似性度量，开发并演示了一种评估流程。 Result: Tversky方法能够直观地评估合成图像质量，而传统方法则无法做到。 Conclusion: Tversky指数是一种有效的图像合成质量评估工具，尤其在主观性和特征编码缺陷明显的情况下表现优越。 Abstract: Super-resolution, in-painting, whole-image generation, unpaired style-transfer, and network-constrained image reconstruction each include an aspect of machine-learned image synthesis where the actual ground truth is not known at time of use. It is generally difficult to quantitatively and authoritatively evaluate the quality of synthetic images; however, in mission-critical biomedical scenarios robust evaluation is paramount. In this work, all practical image-to-image comparisons really are relative qualifications, not absolute difference quantifications; and, therefore, meaningful evaluation of generated image quality can be accomplished using the Tversky Index, which is a well-established measure for assessing perceptual similarity. This evaluation procedure is developed and then demonstrated using multiple image data sets, both real and simulated. The main result is that when the subjectivity and intrinsic deficiencies of any feature-encoding choice are put upfront, Tversky's method leads to intuitive results, whereas traditional methods based on summarizing distances in deep feature spaces do not.