2025 04 09

Unequal Opportunities: Examining the Bias in Geographical Recommendations by Large Language Models

Shiran Dudy,Thulasi Tholeti,Resmi Ramachandranpillai,Muhammad Ali,Toby Jia-Jun Li,Ricardo Baeza-Yates

Task: 研究大型语言模型（LLMs）在推荐美国城镇时的偏见问题，特别是在搬迁、旅游和创业三个领域。

Motivation: LLMs作为信息检索工具的普及可能带来对少数话题的偏见，进而影响现实决策和机会分配，加剧社会经济不平等。

Details

Method: 通过分析LLMs对不同城镇的推荐一致性及其对特定特征的偏好，探讨其偏见表现。 Result: 研究发现LLMs在推荐中存在一致的人口统计偏见，可能加剧经济不平等。 Conclusion: LLMs的偏见可能导致“富者愈富”效应，需采取措施减少偏见以促进公平。 Abstract: Recent advancements in Large Language Models (LLMs) have made them a popular information-seeking tool among end users. However, the statistical training methods for LLMs have raised concerns about their representation of under-represented topics, potentially leading to biases that could influence real-world decisions and opportunities. These biases could have significant economic, social, and cultural impacts as LLMs become more prevalent, whether through direct interactions--such as when users engage with chatbots or automated assistants--or through their integration into third-party applications (as agents), where the models influence decision-making processes and functionalities behind the scenes. Our study examines the biases present in LLMs recommendations of U.S. cities and towns across three domains: relocation, tourism, and starting a business. We explore two key research questions: (i) How similar LLMs responses are, and (ii) How this similarity might favor areas with certain characteristics over others, introducing biases. We focus on the consistency of LLMs responses and their tendency to over-represent or under-represent specific locations. Our findings point to consistent demographic biases in these recommendations, which could perpetuate a ``rich-get-richer'' effect that widens existing economic disparities.

Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling

Benjamin Lipkin,Benjamin LeBrun,Jacob Hoover Vigly,João Loula,David R. MacIver,Li Du,Jason Eisner,Ryan Cotterell,Vikash Mansinghka,Timothy J. O'Donnell,Alexander K. Lew,Tim Vieira

Task: 提出一种新算法，解决语言模型生成受约束文本时的效率和全局分布失真问题。

Motivation: 现有局部约束解码（LCD）方法存在计算开销大和全局分布失真的问题。

Details

Method: 采用自适应拒绝采样算法减少约束评估次数，并结合重要性权重估计纠正局部约束的短视行为。 Result: 在多个领域（如文本到SQL、分子合成等）中优于现有基线方法，支持更广泛的约束类型并提升运行效率和性能。 Conclusion: 新算法通过动态计算使用，显著提升效率，尤其对更好的模型效果更明显。 Abstract: The dominant approach to generating from language models subject to some constraint is locally constrained decoding (LCD), incrementally sampling tokens at each time step such that the constraint is never violated. Typically, this is achieved through token masking: looping over the vocabulary and excluding non-conforming tokens. There are two important problems with this approach. (i) Evaluating the constraint on every token can be prohibitively expensive -- LM vocabularies often exceed $100,000$ tokens. (ii) LCD can distort the global distribution over strings, sampling tokens based only on local information, even if they lead down dead-end paths. This work introduces a new algorithm that addresses both these problems. First, to avoid evaluating a constraint on the full vocabulary at each step of generation, we propose an adaptive rejection sampling algorithm that typically requires orders of magnitude fewer constraint evaluations. Second, we show how this algorithm can be extended to produce low-variance, unbiased estimates of importance weights at a very small additional cost -- estimates that can be soundly used within previously proposed sequential Monte Carlo algorithms to correct for the myopic behavior of local constraint enforcement. Through extensive empirical evaluation in text-to-SQL, molecular synthesis, goal inference, pattern matching, and JSON domains, we show that our approach is superior to state-of-the-art baselines, supporting a broader class of constraints and improving both runtime and performance. Additional theoretical and empirical analyses show that our method's runtime efficiency is driven by its dynamic use of computation, scaling with the divergence between the unconstrained and constrained LM, and as a consequence, runtime improvements are greater for better models.

Less but Better: Parameter-Efficient Fine-Tuning of Large Language Models for Personality Detection

Lingzhi Shen,Yunfei Long,Xiaohao Cai,Guanming Chen,Imran Razzak,Shoaib Jameel

Task: 提出一种参数高效的微调框架PersLLM，用于从社交媒体文本等数据源中自动检测人格。

Motivation: 随着语言模型参数规模的增加，计算成本和管理复杂性上升，微调过程变得难以预测和证明其合理性。

Details

Method: PersLLM通过动态内存层存储高维表示，并利用可替换的输出网络更新下游层，避免重复复杂计算。 Result: 在Kaggle和Pandora等基准数据集上，PersLLM显著降低了计算成本，同时保持了竞争性能和强适应性。 Conclusion: PersLLM为解决大规模语言模型在人格检测中的计算和微调问题提供了一种高效且灵活的解决方案。 Abstract: Personality detection automatically identifies an individual's personality from various data sources, such as social media texts. However, as the parameter scale of language models continues to grow, the computational cost becomes increasingly difficult to manage. Fine-tuning also grows more complex, making it harder to justify the effort and reliably predict outcomes. We introduce a novel parameter-efficient fine-tuning framework, PersLLM, to address these challenges. In PersLLM, a large language model (LLM) extracts high-dimensional representations from raw data and stores them in a dynamic memory layer. PersLLM then updates the downstream layers with a replaceable output network, enabling flexible adaptation to various personality detection scenarios. By storing the features in the memory layer, we eliminate the need for repeated complex computations by the LLM. Meanwhile, the lightweight output network serves as a proxy for evaluating the overall effectiveness of the framework, improving the predictability of results. Experimental results on key benchmark datasets like Kaggle and Pandora show that PersLLM significantly reduces computational cost while maintaining competitive performance and strong adaptability.

PreSumm: Predicting Summarization Performance Without Summarizing

Steven Koniaev,Ori Ernst,Jackie Chi Kit Cheung

Task: 探索文档特性对摘要性能的影响，并提出PreSumm任务来预测文档的摘要性能。

Motivation: 现有摘要模型对不同文档的摘要效果不一致，但文档特性在摘要性能中的作用尚未被充分研究。

Details

Method: 分析文档特性对摘要性能的影响，提出PreSumm任务，通过源文档预测摘要性能。 Result: 发现低PreSumm分数的文档通常存在连贯性问题、内容复杂或缺乏明确主题；PreSumm在混合摘要工作流和数据集质量提升中具有实用价值。 Conclusion: 文档特性对摘要性能至关重要，PreSumm揭示了当前系统的局限性，为未来改进提供了基础。 Abstract: Despite recent advancements in automatic summarization, state-of-the-art models do not summarize all documents equally well, raising the question: why? While prior research has extensively analyzed summarization models, little attention has been given to the role of document characteristics in influencing summarization performance. In this work, we explore two key research questions. First, do documents exhibit consistent summarization quality across multiple systems? If so, can we predict a document's summarization performance without generating a summary? We answer both questions affirmatively and introduce PreSumm, a novel task in which a system predicts summarization performance based solely on the source document. Our analysis sheds light on common properties of documents with low PreSumm scores, revealing that they often suffer from coherence issues, complex content, or a lack of a clear main theme. In addition, we demonstrate PreSumm's practical utility in two key applications: improving hybrid summarization workflows by identifying documents that require manual summarization and enhancing dataset quality by filtering outliers and noisy documents. Overall, our findings highlight the critical role of document properties in summarization performance and offer insights into the limitations of current systems that could serve as the basis for future improvements.

GARF: Learning Generalizable 3D Reassembly for Real-World Fractures

Sihang Li,Zeyu Jiang,Grace Chen,Chenyang Xu,Siqi Tan,Xue Wang,Irving Fang,Kristof Zyskowski,Shannon P. McPherron,Radu Iovita,Chen Feng,Jing Zhang

Task: 提出一个通用的3D重组框架GARF，用于解决真实世界断裂物体的重组问题。

Motivation: 现有基于合成数据的学习方法在真实世界断裂模式更复杂的情况下泛化能力有限，需要一种能够适应不同领域和复杂断裂模式的方法。

Details

Method: GARF利用断裂感知预训练从单个碎片中学习断裂特征，并通过流匹配实现精确的6自由度对齐；推理时引入一步预组装以提高对未见物体和不同断裂数量的鲁棒性。 Result: GARF在合成和真实数据集上均优于现有方法，旋转误差降低82.87%，部件准确率提高25.15%。 Conclusion: GARF展示了通过合成数据训练提升真实世界3D拼图解决能力的潜力，并表现出对未见物体形状和多样断裂类型的强泛化能力。 Abstract: 3D reassembly is a challenging spatial intelligence task with broad applications across scientific domains. While large-scale synthetic datasets have fueled promising learning-based approaches, their generalizability to different domains is limited. Critically, it remains uncertain whether models trained on synthetic datasets can generalize to real-world fractures where breakage patterns are more complex. To bridge this gap, we propose GARF, a generalizable 3D reassembly framework for real-world fractures. GARF leverages fracture-aware pretraining to learn fracture features from individual fragments, with flow matching enabling precise 6-DoF alignments. At inference time, we introduce one-step preassembly, improving robustness to unseen objects and varying numbers of fractures. In collaboration with archaeologists, paleoanthropologists, and ornithologists, we curate Fractura, a diverse dataset for vision and learning communities, featuring real-world fracture types across ceramics, bones, eggshells, and lithics. Comprehensive experiments have shown our approach consistently outperforms state-of-the-art methods on both synthetic and real-world datasets, achieving 82.87\% lower rotation error and 25.15\% higher part accuracy. This sheds light on training on synthetic data to advance real-world 3D puzzle solving, demonstrating its strong generalization across unseen object shapes and diverse fracture types.

A Survey on Hypothesis Generation for Scientific Discovery in the Era of Large Language Models

Atilla Kaan Alkan,Shashwat Sourav,Maja Jablonska,Simone Astarita,Rishabh Chakrabarty,Nikhil Garuda,Pranav Khetarpal,Maciej Pióro,Dimitrios Tanoglidis,Kartheik G. Iyer,Mugdha S. Polimera,Michael J. Smith,Tirthankar Ghosal,Marc Huertas-Company,Sandor Kruk,Kevin Schawinski,Ioana Ciucă

Task: 综述大语言模型（LLMs）在假设生成中的应用，包括方法分类、质量提升技术、评估策略及未来挑战。

Motivation: 科学发现中的假设生成面临信息过载和学科碎片化的挑战，LLMs的潜力激发了对其自动化增强这一过程的兴趣。

Details

Method: 通过回顾现有方法并提出分类法，分析提升假设质量的技术，概述评估策略，并讨论未来方向。 Result: 提出了一套全面的LLMs在假设生成中的方法分类、质量提升技术和评估策略。 Conclusion: 本综述为研究者探索LLMs在假设生成中的应用提供了参考，并指出了未来研究方向。 Abstract: Hypothesis generation is a fundamental step in scientific discovery, yet it is increasingly challenged by information overload and disciplinary fragmentation. Recent advances in Large Language Models (LLMs) have sparked growing interest in their potential to enhance and automate this process. This paper presents a comprehensive survey of hypothesis generation with LLMs by (i) reviewing existing methods, from simple prompting techniques to more complex frameworks, and proposing a taxonomy that categorizes these approaches; (ii) analyzing techniques for improving hypothesis quality, such as novelty boosting and structured reasoning; (iii) providing an overview of evaluation strategies; and (iv) discussing key challenges and future directions, including multimodal integration and human-AI collaboration. Our survey aims to serve as a reference for researchers exploring LLMs for hypothesis generation.

Time-adaptive Video Frame Interpolation based on Residual Diffusion

Victor Fonte Chavez,Claudia Esteves,Jean-Bernard Hayet

Task: 提出一种基于扩散的视频帧插值方法，专门针对传统手绘动画。

Motivation: 解决动画领域中帧间变化较大且传统方法难以处理的问题。

Details

Method: 引入三个主要贡献：显式处理插值时间、将ResShift扩散方案应用于视频帧插值、利用扩散过程的随机性提供像素级不确定性估计。 Result: 在动画视频上，模型表现优于现有先进方法。 Conclusion: 提出的方法在动画视频帧插值中表现出色，并提供了不确定性估计的额外优势。 Abstract: In this work, we propose a new diffusion-based method for video frame interpolation (VFI), in the context of traditional hand-made animation. We introduce three main contributions: The first is that we explicitly handle the interpolation time in our model, which we also re-estimate during the training process, to cope with the particularly large variations observed in the animation domain, compared to natural videos; The second is that we adapt and generalize a diffusion scheme called ResShift recently proposed in the super-resolution community to VFI, which allows us to perform a very low number of diffusion steps (in the order of 10) to produce our estimates; The third is that we leverage the stochastic nature of the diffusion process to provide a pixel-wise estimate of the uncertainty on the interpolated frame, which could be useful to anticipate where the model may be wrong. We provide extensive comparisons with respect to state-of-the-art models and show that our model outperforms these models on animation videos.

ChartQAPro: A More Diverse and Challenging Benchmark for Chart Question Answering

Ahmed Masry,Mohammed Saidul Islam,Mahir Ahmed,Aayush Bajaj,Firoz Kabir,Aaryaman Kartha,Md Tahmid Rahman Laskar,Mizanur Rahman,Shadikur Rahman,Mehrad Shahmohammadi,Megh Thakkar,Md Rizwan Parvez,Enamul Hoque,Shafiq Joty

Task: 提出并评估一个新的图表问答基准ChartQAPro，以解决现有基准在多样性和复杂性上的不足。

Motivation: 现有图表问答基准如ChartQA缺乏真实世界的多样性，且现代大型视觉语言模型（LVLMs）在其上表现已趋于饱和，无法充分反映实际挑战。

Details

Method: 构建ChartQAPro基准，包含1,341张来自157个不同来源的图表，涵盖多种图表类型和问题形式（如多选题、对话式、假设性和不可回答问题）。 Result: 评估21个模型显示，LVLMs在ChartQAPro上表现显著下降（如Claude Sonnet 3.5从ChartQA的90.5%降至55.81%），突显图表推理的复杂性。 Conclusion: ChartQAPro为图表理解和推理提供了更具挑战性的基准，揭示了LVLMs的关键挑战和改进机会。 Abstract: Charts are ubiquitous, as people often use them to analyze data, answer questions, and discover critical insights. However, performing complex analytical tasks with charts requires significant perceptual and cognitive effort. Chart Question Answering (CQA) systems automate this process by enabling models to interpret and reason with visual representations of data. However, existing benchmarks like ChartQA lack real-world diversity and have recently shown performance saturation with modern large vision-language models (LVLMs). To address these limitations, we introduce ChartQAPro, a new benchmark that includes 1,341 charts from 157 diverse sources, spanning various chart types, including infographics and dashboards, and featuring 1,948 questions in various types, such as multiple-choice, conversational, hypothetical, and unanswerable questions, to better reflect real-world challenges. Our evaluations with 21 models show a substantial performance drop for LVLMs on ChartQAPro; e.g., Claude Sonnet 3.5 scores 90.5% on ChartQA but only 55.81% on ChartQAPro, underscoring the complexity of chart reasoning. We complement our findings with detailed error analyses and ablation studies, identifying key challenges and opportunities for advancing LVLMs in chart understanding and reasoning. We release ChartQAPro at https://github.com/vis-nlp/ChartQAPro.

EP-Diffuser: An Efficient Diffusion Model for Traffic Scene Generation and Prediction via Polynomial Representations

Yue Yao,Mohamed-Khalil Bouzidi,Daniel Goehring,Joerg Reichardt

Task: 预测交通场景的多模态未来演化。

Motivation: 由于交通场景中智能体运动的多模态特性，长期预测变得困难，而现有模型主要关注最可能的未来，忽略了其他可能的运动分布。

Details

Method: 提出EP-Diffuser，一种参数高效的基于扩散的生成模型，以道路布局和智能体历史为条件，生成多样且合理的场景延续。 Result: 在Argoverse 2数据集上，EP-Diffuser在预测准确性和合理性上优于两种SotA模型，且模型规模更小；在Waymo Open数据集的OoD测试中表现出更强的鲁棒性。 Conclusion: EP-Diffuser能够高效生成多样且准确的交通场景预测，适用于自动驾驶的安全操作。 Abstract: As the prediction horizon increases, predicting the future evolution of traffic scenes becomes increasingly difficult due to the multi-modal nature of agent motion. Most state-of-the-art (SotA) prediction models primarily focus on forecasting the most likely future. However, for the safe operation of autonomous vehicles, it is equally important to cover the distribution for plausible motion alternatives. To address this, we introduce EP-Diffuser, a novel parameter-efficient diffusion-based generative model designed to capture the distribution of possible traffic scene evolutions. Conditioned on road layout and agent history, our model acts as a predictor and generates diverse, plausible scene continuations. We benchmark EP-Diffuser against two SotA models in terms of accuracy and plausibility of predictions on the Argoverse 2 dataset. Despite its significantly smaller model size, our approach achieves both highly accurate and plausible traffic scene predictions. We further evaluate model generalization ability in an out-of-distribution (OoD) test setting using Waymo Open dataset and show superior robustness of our approach. The code and model checkpoints can be found here: https://github.com/continental/EP-Diffuser.

Pretraining Language Models for Diachronic Linguistic Change Discovery

Elisabeth Fittschen,Sabrina Li,Tom Lippincott,Leshem Choshsem,Craig Messner

Task: 探讨如何通过高效预训练技术构建适用于人文领域（如历史语言学和文学研究）的大型语言模型。

Motivation: 人文领域的研究通常基于特定领域（如时间或体裁）构建论点，而现有方法（如微调或模型编辑）难以完全保证领域限制，因此需要探索更高效且精确的预训练方法。

Details

Method: 提出一种新颖的时间属性标注流程，构建五个时间分段的10百万词数据集，并训练两组模型：高效预训练模型和基于Llama3-8B的高效微调模型。 Result: 预训练模型比微调基线训练更快，且更尊重历史分段；该方法支持多种语言现象的检测，如词汇变化、语法形态变化及词义演变。 Conclusion: 该方法为人文领域提供了一种快速、精确的假设发现和测试工具，并可扩展到其他领域。 Abstract: Large language models (LLMs) have shown potential as tools for scientific discovery. This has engendered growing interest in their use in humanistic disciplines, such as historical linguistics and literary studies. These fields often construct arguments on the basis of delineations like genre, or more inflexibly, time period. Although efforts have been made to restrict inference to specific domains via fine-tuning or model editing, we posit that the only true guarantee is domain-restricted pretraining -- typically, a data- and compute-expensive proposition. We show that efficient pretraining techniques can produce useful models over corpora too large for easy manual inspection but too small for "typical" LLM approaches. We employ a novel date-attribution pipeline in order to obtain a temporally-segmented dataset of five 10-million-word slices. We train two corresponding five-model batteries over these corpus segments, efficient pretraining and Llama3-8B parameter efficiently finetuned. We find that the pretrained models are faster to train than the finetuned baselines and that they better respect the historical divisions of our corpus. Emphasizing speed and precision over a-historical comprehensiveness enables a number of novel approaches to hypothesis discovery and testing in our target fields. Taking up diachronic linguistics as a testbed, we show that our method enables the detection of a diverse set of phenomena, including en masse lexical change, non-lexical (grammatical and morphological) change, and word sense introduction/obsolescence. We provide a ready-to-use pipeline that allows extension of our approach to other target fields with only minimal adaptation.

Biomechanical Constraints Assimilation in Deep-Learning Image Registration: Application to sliding and locally rigid deformations

Ziad Kheil,Soleakhena Ken,Laurent Risser

Task: 提出一种基于学习的医学图像配准方法，通过局部适应生物力学特性来推断变形属性。

Motivation: 传统正则化策略在医学图像配准中采用统一约束，忽略了生物结构的非均匀变形特性，无法准确模拟软硬组织的生物力学行为。

Details

Method: 在训练过程中使用固体力学启发的正则化损失，强制局部刚性位移、剪切运动或伪弹性变形。 Result: 在合成和真实3D胸腹部图像上验证了不同性质的力学特性在新图像对之间的变形推断中具有良好的泛化能力。 Conclusion: 该方法能直接从输入图像推断组织特异性变形模式，确保力学上合理的运动，并在硬组织中保持刚性，在自然分离区域允许受控滑动，更准确地捕捉生理运动。 Abstract: Regularization strategies in medical image registration often take a one-size-fits-all approach by imposing uniform constraints across the entire image domain. Yet biological structures are anything but regular. Lacking structural awareness, these strategies may fail to consider a panoply of spatially inhomogeneous deformation properties, which would faithfully account for the biomechanics of soft and hard tissues, especially in poorly contrasted structures. To bridge this gap, we propose a learning-based image registration approach in which the inferred deformation properties can locally adapt themselves to trained biomechanical characteristics. Specifically, we first enforce in the training process local rigid displacements, shearing motions or pseudo-elastic deformations using regularization losses inspired from the field of solid-mechanics. We then show on synthetic and real 3D thoracic and abdominal images that these mechanical properties of different nature are well generalized when inferring the deformations between new image pairs. Our approach enables neural-networks to infer tissue-specific deformation patterns directly from input images, ensuring mechanically plausible motion. These networks preserve rigidity within hard tissues while allowing controlled sliding in regions where tissues naturally separate, more faithfully capturing physiological motion. The code is publicly available at https://github.com/Kheil-Z/biomechanical_DLIR .

Bridging Industrial Expertise and XR with LLM-Powered Conversational Agents

Despina Tomkou,George Fatouros,Andreas Andreou,Georgios Makridis,Fotis Liarokapis,Dimitrios Dardanis,Athanasios Kiourtis,John Soldatos,Dimosthenis Kyriazis

Task: 将检索增强生成（RAG）增强的大型语言模型（LLM）与扩展现实（XR）技术结合，解决工业环境中的知识传递问题。

Motivation: 工业环境中知识传递的挑战需要一种自然、高效且无需动手的解决方案，以支持工人的实时专家指导。

Details

Method: 提出一种系统架构，包括具有动态工具编排的LLM聊天引擎和基于语音交互的XR应用，并通过评估不同分块策略、嵌入模型和向量数据库优化性能。 Result: 语义分块、平衡的嵌入模型和高效的向量存储为工业知识检索提供了最佳性能，并在多个工业用例中展示了系统的潜力。 Conclusion: 该系统有望提升培训效率、远程协助能力和操作指导，符合工业5.0以人为中心和弹性的发展方向。 Abstract: This paper introduces a novel integration of Retrieval-Augmented Generation (RAG) enhanced Large Language Models (LLMs) with Extended Reality (XR) technologies to address knowledge transfer challenges in industrial environments. The proposed system embeds domain-specific industrial knowledge into XR environments through a natural language interface, enabling hands-free, context-aware expert guidance for workers. We present the architecture of the proposed system consisting of an LLM Chat Engine with dynamic tool orchestration and an XR application featuring voice-driven interaction. Performance evaluation of various chunking strategies, embedding models, and vector databases reveals that semantic chunking, balanced embedding models, and efficient vector stores deliver optimal performance for industrial knowledge retrieval. The system's potential is demonstrated through early implementation in multiple industrial use cases, including robotic assembly, smart infrastructure maintenance, and aerospace component servicing. Results indicate potential for enhancing training efficiency, remote assistance capabilities, and operational guidance in alignment with Industry 5.0's human-centric and resilient approach to industrial development.

Learning Activity View-invariance Under Extreme Viewpoint Changes via Curriculum Knowledge Distillation

Arjun Somayazulu,Efi Mavroudi,Changan Chen,Lorenzo Torresani,Kristen Grauman

Task: 提出一种方法，用于在极端视角差异和严重遮挡的野外视频中学习丰富的视频表示。

Motivation: 传统方法在控制多视角设置下表现良好，但在野外视频中因极端视角差异和视觉内容共享少而表现不佳。

Details

Method: 定义基于几何的度量以按遮挡程度对视角进行细粒度排序，并设计知识蒸馏目标和课程学习程序，逐步配对更具挑战性的视角。 Result: 在两个任务上优于现有最优模型，特别是在严重遮挡的视角下，表现显著提升。 Conclusion: 该方法能有效处理极端视角差异和遮挡，提升视频表示学习的性能。 Abstract: Traditional methods for view-invariant learning from video rely on controlled multi-view settings with minimal scene clutter. However, they struggle with in-the-wild videos that exhibit extreme viewpoint differences and share little visual content. We introduce a method for learning rich video representations in the presence of such severe view-occlusions. We first define a geometry-based metric that ranks views at a fine-grained temporal scale by their likely occlusion level. Then, using those rankings, we formulate a knowledge distillation objective that preserves action-centric semantics with a novel curriculum learning procedure that pairs incrementally more challenging views over time, thereby allowing smooth adaptation to extreme viewpoint differences. We evaluate our approach on two tasks, outperforming SOTA models on both temporal keystep grounding and fine-grained keystep recognition benchmarks - particularly on views that exhibit severe occlusion.

COIG-P: A High-Quality and Large-Scale Chinese Preference Dataset for Alignment with Human Values

M-A-P Team,Siwei Wu,Jincheng Ren,Xinrun Du,Shuyue Guo,Xingwei Qu,Yiming Liang,Jie Liu,Yunwen Li,Tianyu Zheng,Boyu Feng,Huaqing Yuan,Zenith Wang,Jiaheng Liu,Wenhao Huang,Chenglin Cai,Haoran Que,Jian Yang,Yuelin Bai,Zekun Moore Wang,Zhouliang Yu,Qunshu Lin,Ding Pan,Yuchen Jiang,Tiannan Wang,Wangchunshu Zhou,Shenzhi Wang,Xingyuan Bu,Minghao Liu,Guoyin Wang,Ge Zhang,Chenghua Lin

Task: 设计一个无需人工干预的基于LLM的中文偏好数据集标注流程，并构建高质量、大规模的中文偏好数据集COIG-P。

Motivation: 现有中文偏好数据集规模小、领域覆盖窄且缺乏严格的数据验证，依赖人工标注限制了数据集的扩展性。

Details

Method: 通过爬取并筛选92k高质量中文查询，利用15个主流LLM生成和评分响应对，构建COIG-P数据集，并训练一个8B大小的中文奖励模型（CRM）。 Result: COIG-P在多个领域表现优异，显著提升模型性能（2%-12%），CRM在评分能力上表现强劲且高效。 Conclusion: COIG-P和CRM为中文偏好对齐提供了高质量、可扩展的解决方案，显著优于现有数据集和方法。 Abstract: Aligning large language models (LLMs) with human preferences has achieved remarkable success. However, existing Chinese preference datasets are limited by small scale, narrow domain coverage, and lack of rigorous data validation. Additionally, the reliance on human annotators for instruction and response labeling significantly constrains the scalability of human preference datasets. To address these challenges, we design an LLM-based Chinese preference dataset annotation pipeline with no human intervention. Specifically, we crawled and carefully filtered 92k high-quality Chinese queries and employed 15 mainstream LLMs to generate and score chosen-rejected response pairs. Based on it, we introduce COIG-P (Chinese Open Instruction Generalist - Preference), a high-quality, large-scale Chinese preference dataset, comprises 1,009k Chinese preference pairs spanning 6 diverse domains: Chat, Code, Math, Logic, Novel, and Role. Building upon COIG-P, to reduce the overhead of using LLMs for scoring, we trained a 8B-sized Chinese Reward Model (CRM) and meticulously constructed a Chinese Reward Benchmark (CRBench). Evaluation results based on AlignBench \citep{liu2024alignbenchbenchmarkingchinesealignment} show that that COIG-P significantly outperforms other Chinese preference datasets, and it brings significant performance improvements ranging from 2% to 12% for the Qwen2/2.5 and Infinity-Instruct-3M-0625 model series, respectively. The results on CRBench demonstrate that our CRM has a strong and robust scoring ability. We apply it to filter chosen-rejected response pairs in a test split of COIG-P, and our experiments show that it is comparable to GPT-4o in identifying low-quality samples while maintaining efficiency and cost-effectiveness. Our codes and data are released in https://github.com/multimodal-art-projection/COIG-P.

Generative Adversarial Networks with Limited Data: A Survey and Benchmarking

Omar De Mitri,Ruyu Wang,Marco F. Huber

Task: 概述生成对抗网络（GANs）及其变体在有限数据条件下的应用，并分析解决数据不足问题的方法。

Motivation: GANs在图像合成任务中表现优异，但其性能依赖于大规模训练数据，数据不足时性能迅速下降，因此需要研究如何解决这一问题。

Details

Method: 通过设计实验分析现有GANs在有限数据条件下的表现，并从不同角度总结解决数据不足问题的方法。 Result: 总结了当前GANs在有限数据下的表现，并提出了多种解决数据不足问题的方法。 Conclusion: 未来研究仍需解决GANs在有限数据条件下的挑战，并探索新的趋势。 Abstract: Generative Adversarial Networks (GANs) have shown impressive results in various image synthesis tasks. Vast studies have demonstrated that GANs are more powerful in feature and expression learning compared to other generative models and their latent space encodes rich semantic information. However, the tremendous performance of GANs heavily relies on the access to large-scale training data and deteriorates rapidly when the amount of data is limited. This paper aims to provide an overview of GANs, its variants and applications in various vision tasks, focusing on addressing the limited data issue. We analyze state-of-the-art GANs in limited data regime with designed experiments, along with presenting various methods attempt to tackle this problem from different perspectives. Finally, we further elaborate on remaining challenges and trends for future research.

Can Large Language Models Match Tutoring System Adaptivity? A Benchmarking Study

Conrad Borchers,Tianze Shou

Task: 评估大型语言模型（LLMs）在教学适应性方面的表现，并与智能辅导系统（ITS）进行比较。

Motivation: 探讨LLMs是否能像ITS一样动态适应学生需求并提供有效的教学策略。

Details

Method: 提出一个提示变体框架，通过去除关键上下文组件（如学生错误和知识点）生成75个真实辅导场景的变体，并测试三种LLM（Llama3-8B、Llama3-70B和GPT-4o）生成的1,350条教学指令的适应性和教学合理性。 Result: 结果显示，即使表现最好的模型（Llama3-70B）也仅能勉强模仿ITS的适应性，而GPT-4o虽能遵循指令但反馈过于直接，Llama3-8B在教学合理性上得分较高但指令遵循能力较差。 Conclusion: 当前基于LLM的辅导难以达到已知有效的ITS辅导效果，但研究提供了一个可复现的评估方法。 Abstract: Large Language Models (LLMs) hold promise as dynamic instructional aids. Yet, it remains unclear whether LLMs can replicate the adaptivity of intelligent tutoring systems (ITS)--where student knowledge and pedagogical strategies are explicitly modeled. We propose a prompt variation framework to assess LLM-generated instructional moves' adaptivity and pedagogical soundness across 75 real-world tutoring scenarios from an ITS. We systematically remove key context components (e.g., student errors and knowledge components) from prompts to create variations of each scenario. Three representative LLMs (Llama3-8B, Llama3-70B, and GPT-4o) generate 1,350 instructional moves. We use text embeddings and randomization tests to measure how the omission of each context feature impacts the LLMs' outputs (adaptivity) and a validated tutor-training classifier to evaluate response quality (pedagogical soundness). Surprisingly, even the best-performing model only marginally mimics the adaptivity of ITS. Specifically, Llama3-70B demonstrates statistically significant adaptivity to student errors. Although Llama3-8B's recommendations receive higher pedagogical soundness scores than the other models, it struggles with instruction-following behaviors, including output formatting. By contrast, GPT-4o reliably adheres to instructions but tends to provide overly direct feedback that diverges from effective tutoring, prompting learners with open-ended questions to gauge knowledge. Given these results, we discuss how current LLM-based tutoring is unlikely to produce learning benefits rivaling known-to-be-effective ITS tutoring. Through our open-source benchmarking code, we contribute a reproducible method for evaluating LLMs' instructional adaptivity and fidelity.

Taxonomy-Aware Evaluation of Vision-Language Models

Vésteinn Snæbjarnarson,Kevin Du,Niklas Stoehr,Serge Belongie,Ryan Cotterell,Nico Lang,Stella Frank

Task: 提出一种基于分类学的框架，用于评估视觉语言模型（VLM）生成的无约束文本预测的准确性和特异性。

Motivation: 由于VLM生成的文本可能过于笼统（如'针叶树'而非'挪威云杉'），现有评估方法无法有效衡量其部分正确性，需要一种能反映分类学层级关系的评估方法。

Details

Method: 提出分层精确率和召回率度量，开发将VLM预测文本映射到分类学的方法，并计算生成文本与真实标签的层级相似性。 Result: 实验表明现有文本相似性度量无法捕捉分类学相似性，提出的分层度量能有效评估VLM预测的特异性。 Conclusion: 基于分类学的评估框架为VLM在细粒度视觉分类任务中的表现提供了更全面的分析工具。 Abstract: When a vision-language model (VLM) is prompted to identify an entity depicted in an image, it may answer 'I see a conifer,' rather than the specific label 'norway spruce'. This raises two issues for evaluation: First, the unconstrained generated text needs to be mapped to the evaluation label space (i.e., 'conifer'). Second, a useful classification measure should give partial credit to less-specific, but not incorrect, answers ('norway spruce' being a type of 'conifer'). To meet these requirements, we propose a framework for evaluating unconstrained text predictions, such as those generated from a vision-language model, against a taxonomy. Specifically, we propose the use of hierarchical precision and recall measures to assess the level of correctness and specificity of predictions with regard to a taxonomy. Experimentally, we first show that existing text similarity measures do not capture taxonomic similarity well. We then develop and compare different methods to map textual VLM predictions onto a taxonomy. This allows us to compute hierarchical similarity measures between the generated text and the ground truth labels. Finally, we analyze modern VLMs on fine-grained visual classification tasks based on our proposed taxonomic evaluation scheme.

Knowledge-Instruct: Effective Continual Pre-training from Limited Data using Instructions

Oded Ovadia,Meni Brief,Rachel Lemberg,Eitam Sheetrit

Task: 提出一种名为Knowledge-Instruct的新方法，通过纯指令调优高效地从有限语料中注入知识。

Motivation: 解决大语言模型在领域特定、新或小众信息上的不足，同时避免持续预训练中的灾难性遗忘和低数据效率问题。

Details

Method: 生成信息密集的合成指令数据，通过指令调优整合新知识并保留通用推理和指令遵循能力。 Result: Knowledge-Instruct在事实记忆、灾难性遗忘最小化和可扩展性方面表现优异，同时提升了上下文理解能力。 Conclusion: 该方法在多个基准测试中验证了有效性，并发布了一个新数据集Companies来衡量知识注入能力。 Abstract: While Large Language Models (LLMs) acquire vast knowledge during pre-training, they often lack domain-specific, new, or niche information. Continual pre-training (CPT) attempts to address this gap but suffers from catastrophic forgetting and inefficiencies in low-data regimes. We introduce Knowledge-Instruct, a novel approach to efficiently inject knowledge from limited corpora through pure instruction-tuning. By generating information-dense synthetic instruction data, it effectively integrates new knowledge while preserving general reasoning and instruction-following abilities. Knowledge-Instruct demonstrates superior factual memorization, minimizes catastrophic forgetting, and remains scalable by leveraging synthetic data from relatively small language models. Additionally, it enhances contextual understanding, including complex multi-hop reasoning, facilitating integration with retrieval systems. We validate its effectiveness across diverse benchmarks, including Companies, a new dataset that we release to measure knowledge injection capabilities.

Optimizing 4D Gaussians for Dynamic Scene Video from Single Landscape Images

In-Hwan Jin,Haesoo Choo,Seong-Hun Jeong,Heemoon Park,Junghwan Kim,Oh-joon Kwon,Kyeongbo Kong

Task: 从单张景观图像中构建完整的3D空间以实现动态场景视频。

Motivation: 现有方法使用伪3D空间（如LDIs）可能导致深度感知减弱和失真，且输出局限于2D视频。

Details

Method: 通过建模显式表示（4D高斯）从单张图像生成多视角图像，并优化3D运动以估计一致的3D运动。 Result: 模型能够在多种景观图像中实现真实的沉浸感。 Conclusion: 首次尝试从单张景观图像中同时考虑动画和完整3D空间表示，实验验证了其有效性。 Abstract: To achieve realistic immersion in landscape images, fluids such as water and clouds need to move within the image while revealing new scenes from various camera perspectives. Recently, a field called dynamic scene video has emerged, which combines single image animation with 3D photography. These methods use pseudo 3D space, implicitly represented with Layered Depth Images (LDIs). LDIs separate a single image into depth-based layers, which enables elements like water and clouds to move within the image while revealing new scenes from different camera perspectives. However, as landscapes typically consist of continuous elements, including fluids, the representation of a 3D space separates a landscape image into discrete layers, and it can lead to diminished depth perception and potential distortions depending on camera movement. Furthermore, due to its implicit modeling of 3D space, the output may be limited to videos in the 2D domain, potentially reducing their versatility. In this paper, we propose representing a complete 3D space for dynamic scene video by modeling explicit representations, specifically 4D Gaussians, from a single image. The framework is focused on optimizing 3D Gaussians by generating multi-view images from a single image and creating 3D motion to optimize 4D Gaussians. The most important part of proposed framework is consistent 3D motion estimation, which estimates common motion among multi-view images to bring the motion in 3D space closer to actual motions. As far as we know, this is the first attempt that considers animation while representing a complete 3D space from a single landscape image. Our model demonstrates the ability to provide realistic immersion in various landscape images through diverse experiments and metrics. Extensive experimental results are https://cvsp-lab.github.io/ICLR2025_3D-MOM/.

DEL: Context-Aware Dynamic Exit Layer for Efficient Self-Speculative Decoding

Hossein Entezari Zarch,Lei Gao,Chaoyi Jiang,Murali Annavaram

Task: 提出一种自适应选择退出层和推测长度的方法（DEL），以优化推测解码（SD）的性能。

Motivation: 现有方法静态选择退出层和推测长度，但这些参数是任务和上下文相关的，导致性能不佳。

Details

Method: 动态跟踪每个层的令牌接受率，启发式选择最优退出层和推测长度。 Result: DEL在多种模型和任务中实现2.16×∼2.50×的加速，优于现有SD方法。 Conclusion: DEL是一种高效的动态优化方法，显著提升推测解码的性能。 Abstract: Speculative Decoding (SD) is a widely used approach to accelerate the inference of large language models (LLMs) without reducing generation quality. It operates by first using a compact model to draft multiple tokens efficiently, followed by parallel verification using the target LLM. This approach leads to faster inference compared to auto-regressive decoding. While there are multiple approaches to create a draft model, one promising approach is to use early-exit methods. These methods draft candidate tokens by using a subset of layers of the primary model and applying the remaining layers for verification, allowing a single model to handle both drafting and verification. While this technique reduces memory usage and computational cost, its performance relies on the choice of the exit layer for drafting and the number of tokens drafted (speculation length) in each SD round. Prior works use hyperparameter exploration to statically select these values. However, our evaluations show that these hyperparameter values are task-specific, and even within a task they are dependent on the current sequence context. We introduce DEL, a plug-and-play method that adaptively selects the exit layer and speculation length during inference. DEL dynamically tracks the token acceptance rate if the tokens are drafted at each layer of an LLM and uses that knowledge to heuristically select the optimal exit layer and speculation length. Our experiments across a broad range of models and downstream tasks show that DEL achieves overall speedups of $2.16\times$$\sim$$2.50\times$ over vanilla auto-regressive decoding and improves upon the state-of-the-art SD methods by up to $0.27\times$.

REVEAL: Relation-based Video Representation Learning for Video-Question-Answering

Sofian Chaybouti,Walid Bousselham,Moritz Wolter,Hilde Kuehne

Task: 提出REVEAL框架，通过结构化分解表示捕捉视频中的视觉关系信息，以改进视频问答任务。

Motivation: 现有视频语言模型在捕捉复杂视觉关系变化时面临挑战，需要将视觉内容表示为合理大小的输入。

Details

Method: 利用时空场景图，将视频序列编码为关系三元组（主语-谓语-宾语），并通过MM-NCE和Q-Former架构对齐视频查询与文本关系描述。 Result: 在五个基准测试中表现优异，特别是在需要时间推理和关系理解的任务上，优于基于全局对齐的表示方法。 Conclusion: REVEAL框架通过结构化关系表示显著提升了视频问答性能，代码和模型将公开。 Abstract: Video-Question-Answering (VideoQA) comprises the capturing of complex visual relation changes over time, remaining a challenge even for advanced Video Language Models (VLM), i.a., because of the need to represent the visual content to a reasonably sized input for those models. To address this problem, we propose RElation-based Video rEpresentAtion Learning (REVEAL), a framework designed to capture visual relation information by encoding them into structured, decomposed representations. Specifically, inspired by spatiotemporal scene graphs, we propose to encode video sequences as sets of relation triplets in the form of (\textit{subject-predicate-object}) over time via their language embeddings. To this end, we extract explicit relations from video captions and introduce a Many-to-Many Noise Contrastive Estimation (MM-NCE) together with a Q-Former architecture to align an unordered set of video-derived queries with corresponding text-based relation descriptions. At inference, the resulting Q-former produces an efficient token representation that can serve as input to a VLM for VideoQA. We evaluate the proposed framework on five challenging benchmarks: NeXT-QA, Intent-QA, STAR, VLEP, and TVQA. It shows that the resulting query-based video representation is able to outperform global alignment-based CLS or patch token representations and achieves competitive results against state-of-the-art models, particularly on tasks requiring temporal reasoning and relation comprehension. The code and models will be publicly released.

On the Impact of Language Nuances on Sentiment Analysis with Large Language Models: Paraphrasing, Sarcasm, and Emojis

Naman Bhargava,Mohammed I. Radaideh,O Hwang Kwon,Aditi Verma,Majdi I. Radaideh

Task: 研究文本细微差别（如表情符号和讽刺）对情感分析的影响，并通过文本改写技术提高数据质量。

Motivation: 社交媒体数据质量对大型语言模型（LLMs）情感分析的准确性有显著影响，尤其是讽刺性内容。

Details

Method: 创建人工标注的讽刺推文数据集，评估LLMs在不同讽刺语境下的表现，并采用主题特定和通用数据集微调模型，结合对抗性文本增强和文本改写技术。 Result: 讽刺移除使情感分析准确性提升21%；通用数据集训练的LLMs对讽刺推文的情感预测准确性达60%；对抗性文本增强和改写技术分别提升模型鲁棒性至85%和6%。 Conclusion: 通用数据集和文本改写技术能有效提升LLMs对讽刺内容的处理能力，而对抗性文本增强显著提高模型鲁棒性。 Abstract: Large Language Models (LLMs) have demonstrated impressive performance across various tasks, including sentiment analysis. However, data quality--particularly when sourced from social media--can significantly impact their accuracy. This research explores how textual nuances, including emojis and sarcasm, affect sentiment analysis, with a particular focus on improving data quality through text paraphrasing techniques. To address the lack of labeled sarcasm data, the authors created a human-labeled dataset of 5929 tweets that enabled the assessment of LLM in various sarcasm contexts. The results show that when topic-specific datasets, such as those related to nuclear power, are used to finetune LLMs these models are not able to comprehend accurate sentiment in presence of sarcasm due to less diverse text, requiring external interventions like sarcasm removal to boost model accuracy. Sarcasm removal led to up to 21% improvement in sentiment accuracy, as LLMs trained on nuclear power-related content struggled with sarcastic tweets, achieving only 30% accuracy. In contrast, LLMs trained on general tweet datasets, covering a broader range of topics, showed considerable improvements in predicting sentiment for sarcastic tweets (60% accuracy), indicating that incorporating general text data can enhance sarcasm detection. The study also utilized adversarial text augmentation, showing that creating synthetic text variants by making minor changes significantly increased model robustness and accuracy for sarcastic tweets (approximately 85%). Additionally, text paraphrasing of tweets with fragmented language transformed around 40% of the tweets with low-confidence labels into high-confidence ones, improving LLMs sentiment analysis accuracy by 6%.

Studying Image Diffusion Features for Zero-Shot Video Object Segmentation

Thanos Delatolas,Vicky Kalogeiton,Dim P. Papadopoulos

Task: 研究如何利用大规模扩散模型进行零样本视频对象分割（ZS-VOS），无需对视频数据进行微调或使用任何图像分割数据进行训练。

Motivation: 扩散模型在多种任务中表现出强大的视觉表示能力，但其在ZS-VOS中的直接应用尚未充分探索。

Details

Method: 通过确定最佳时间步和特征提取层，优化ZS-VOS的特征提取过程，并分析这些特征的亲和力与点对应关系。 Result: 在DAVIS-17和MOSE数据集上的实验表明，基于ImageNet训练的扩散模型在ZS-VOS中表现优于基于更大、更多样化数据集训练的模型，并取得了最先进的结果。 Conclusion: 扩散模型在ZS-VOS中表现出色，且无需依赖昂贵的图像分割数据集训练，点对应关系对分割精度至关重要。 Abstract: This paper investigates the use of large-scale diffusion models for Zero-Shot Video Object Segmentation (ZS-VOS) without fine-tuning on video data or training on any image segmentation data. While diffusion models have demonstrated strong visual representations across various tasks, their direct application to ZS-VOS remains underexplored. Our goal is to find the optimal feature extraction process for ZS-VOS by identifying the most suitable time step and layer from which to extract features. We further analyze the affinity of these features and observe a strong correlation with point correspondences. Through extensive experiments on DAVIS-17 and MOSE, we find that diffusion models trained on ImageNet outperform those trained on larger, more diverse datasets for ZS-VOS. Additionally, we highlight the importance of point correspondences in achieving high segmentation accuracy, and we yield state-of-the-art results in ZS-VOS. Finally, our approach performs on par with models trained on expensive image segmentation datasets.

FactGuard: Leveraging Multi-Agent Systems to Generate Answerable and Unanswerable Questions for Enhanced Long-Context LLM Extraction

Qian-Wen Zhang,Fang Li,Jie Wang,Lingfeng Qiao,Yifei Yu,Di Yin,Xing Sun

Task: 提出一种基于多智能体协作框架的数据增强方法，用于生成可回答和不可回答的问题-答案对，以提升抽取式阅读理解系统的性能。

Motivation: 解决现有抽取式阅读理解系统在识别不可回答问题时的准确性问题，同时降低人工标注的高成本。

Details

Method: 采用多智能体协作框架，自主生成基于证据的问题-答案对，并系统构建不可回答问题，形成FactGuard-Bench数据集。 Result: 实验表明，即使在最先进的LLMs上，整体准确率仅为61.79%，凸显了模型对不可回答问题推理能力的重要性。 Conclusion: 该方法显著降低了人工标注成本，为LLMs的训练和优化提供了有价值的见解。 Abstract: Extractive reading comprehension systems are designed to locate the correct answer to a question within a given text. However, a persistent challenge lies in ensuring these models maintain high accuracy in answering questions while reliably recognizing unanswerable queries. Despite significant advances in large language models (LLMs) for reading comprehension, this issue remains critical, particularly as the length of supported contexts continues to expand. To address this challenge, we propose an innovative data augmentation methodology grounded in a multi-agent collaborative framework. Unlike traditional methods, such as the costly human annotation process required for datasets like SQuAD 2.0, our method autonomously generates evidence-based question-answer pairs and systematically constructs unanswerable questions. Using this methodology, we developed the FactGuard-Bench dataset, which comprises 25,220 examples of both answerable and unanswerable question scenarios, with context lengths ranging from 8K to 128K. Experimental evaluations conducted on seven popular LLMs reveal that even the most advanced models achieve only 61.79% overall accuracy. Furthermore, we emphasize the importance of a model's ability to reason about unanswerable questions to avoid generating plausible but incorrect answers. By implementing efficient data selection and generation within the multi-agent collaborative framework, our method significantly reduces the traditionally high costs associated with manual annotation and provides valuable insights for the training and optimization of LLMs.

Secure Diagnostics: Adversarial Robustness Meets Clinical Interpretability

Mohammad Hossein Najafi,Mohammad Morsali,Mohammadreza Pashanejad,Saman Soleimani Roudi,Mohammad Norouzi,Saeed Bagheri Shouraki

Task: 研究深度神经网络在骨折检测中的可解释性和鲁棒性。

Motivation: 由于临床实践中i.i.d.假设的违反和决策不透明性，医学图像分类的深度神经网络泛化能力不足。

Details

Method: 通过对抗攻击评估模型性能，并比较可解释性方法与骨科医生标注的骨折区域。 Result: 鲁棒模型产生的解释更符合临床意义区域，表明鲁棒性促进了解剖相关特征的优先选择。 Conclusion: 可解释性和鲁棒性是弥补基准性能与安全临床部署之间差距的互补指标，有助于人机协作。 Abstract: Deep neural networks for medical image classification often fail to generalize consistently in clinical practice due to violations of the i.i.d. assumption and opaque decision-making. This paper examines interpretability in deep neural networks fine-tuned for fracture detection by evaluating model performance against adversarial attack and comparing interpretability methods to fracture regions annotated by an orthopedic surgeon. Our findings prove that robust models yield explanations more aligned with clinically meaningful areas, indicating that robustness encourages anatomically relevant feature prioritization. We emphasize the value of interpretability for facilitating human-AI collaboration, in which models serve as assistants under a human-in-the-loop paradigm: clinically plausible explanations foster trust, enable error correction, and discourage reliance on AI for high-stakes decisions. This paper investigates robustness and interpretability as complementary benchmarks for bridging the gap between benchmark performance and safe, actionable clinical deployment.

Yichen Dong,Xinglin Lyu,Junhui Li,Daimeng Wei,Min Zhang,Shimin Tao,Hao Yang

Task: 将大型语言模型（LLMs）的翻译自我优化从句子级别扩展到文档级别，专注于文档到文档（Doc2Doc）翻译优化。

Motivation: 句子级别和文档级别的翻译优化关注翻译过程的不同方面，结合两者的优势可以提升翻译质量。

Details

Method: 提出一种基于两种中间翻译的LLM微调方法，并引入质量感知的增强微调方法，根据翻译难度分配权重。 Result: 在十个翻译任务中，使用LLaMA-3-8B-Instruct和Mistral-Nemo-Instruct验证了方法的有效性。 Conclusion: 该方法通过结合句子级别和文档级别的翻译优化，显著提升了翻译质量。 Abstract: Recent research has shown that large language models (LLMs) can enhance translation quality through self-refinement. In this paper, we build on this idea by extending the refinement from sentence-level to document-level translation, specifically focusing on document-to-document (Doc2Doc) translation refinement. Since sentence-to-sentence (Sent2Sent) and Doc2Doc translation address different aspects of the translation process, we propose fine-tuning LLMs for translation refinement using two intermediate translations, combining the strengths of both Sent2Sent and Doc2Doc. Additionally, recognizing that the quality of intermediate translations varies, we introduce an enhanced fine-tuning method with quality awareness that assigns lower weights to easier translations and higher weights to more difficult ones, enabling the model to focus on challenging translation cases. Experimental results across ten translation tasks with LLaMA-3-8B-Instruct and Mistral-Nemo-Instruct demonstrate the effectiveness of our approach.

REEF: Relevance-Aware and Efficient LLM Adapter for Video Understanding

Sakib Reza,Xiyun Song,Heather Yu,Zongfang Lin,Mohsen Moghaddam,Octavia Camps

Task: 提出一种高效的LLM适配器，用于未修剪视频的视频级理解，重点关注时空标记的上下文相关性。

Motivation: 现有方法通常使用基于相似性的贪婪方法压缩视觉记忆，可能忽略单个标记的上下文重要性。

Details

Method: 利用评分网络选择性压缩视觉记忆库，并通过可微分的Top-K操作符进行端到端训练，过滤空间标记。 Result: 在三个视频级理解任务（未修剪视频分类、视频问答和视频字幕生成）中，方法在四个大规模数据集上取得竞争性或更优结果，同时计算开销减少高达34%。 Conclusion: 提出的方法在保持高效的同时，显著提升了视频理解的性能。 Abstract: Integrating vision models into large language models (LLMs) has sparked significant interest in creating vision-language foundation models, especially for video understanding. Recent methods often utilize memory banks to handle untrimmed videos for video-level understanding. However, they typically compress visual memory using similarity-based greedy approaches, which can overlook the contextual importance of individual tokens. To address this, we introduce an efficient LLM adapter designed for video-level understanding of untrimmed videos that prioritizes the contextual relevance of spatio-temporal tokens. Our framework leverages scorer networks to selectively compress the visual memory bank and filter spatial tokens based on relevance, using a differentiable Top-K operator for end-to-end training. Across three key video-level understanding tasks$\unicode{x2013}$ untrimmed video classification, video question answering, and video captioning$\unicode{x2013}$our method achieves competitive or superior results on four large-scale datasets while reducing computational overhead by up to 34%. The code will be available soon on GitHub.

Reasoning Towards Fairness: Mitigating Bias in Language Models through Reasoning-Guided Fine-Tuning

Sanchit Kabra,Akshita Jha,Chandan Reddy

Task: 研究大型生成语言模型的推理能力与公平性之间的关系，并探索推理能力是否能减少有害的刻板印象响应。

Motivation: 尽管大规模生成语言模型的推理能力显著提升了任务表现，但其对减少刻板印象响应的影响尚未充分研究。

Details

Method: 通过评估多个开源LLM，提出ReGiFT（推理引导微调）方法，利用高级推理模型的结构化推理轨迹增强缺乏推理能力的模型。 Result: 发现推理能力更强的模型在公平性基准测试中表现出更低的刻板偏见，且ReGiFT微调的模型在公平性上优于非推理模型和高级推理模型。 Conclusion: 增强推理能力是一种无需公平性监督的有效策略，可减少因推理缺陷导致的刻板偏见。 Abstract: Recent advances in large-scale generative language models have shown that reasoning capabilities can significantly improve model performance across a variety of tasks. However, the impact of reasoning on a model's ability to mitigate stereotypical responses remains largely underexplored. In this work, we investigate the crucial relationship between a model's reasoning ability and fairness, and ask whether improved reasoning capabilities can mitigate harmful stereotypical responses, especially those arising due to shallow or flawed reasoning. We conduct a comprehensive evaluation of multiple open-source LLMs, and find that larger models with stronger reasoning abilities exhibit substantially lower stereotypical bias on existing fairness benchmarks. Building on this insight, we introduce ReGiFT -- Reasoning Guided Fine-Tuning, a novel approach that extracts structured reasoning traces from advanced reasoning models and infuses them into models that lack such capabilities. We use only general-purpose reasoning and do not require any fairness-specific supervision for bias mitigation. Notably, we see that models fine-tuned using ReGiFT not only improve fairness relative to their non-reasoning counterparts but also outperform advanced reasoning models on fairness benchmarks. We also analyze how variations in the correctness of the reasoning traces and their length influence model fairness and their overall performance. Our findings highlight that enhancing reasoning capabilities is an effective, fairness-agnostic strategy for mitigating stereotypical bias caused by reasoning flaws.

Few-shot Personalized Scanpath Prediction

Ruoyu Xue,Jingyi Xu,Sounak Mondal,Hieu Le,Gregory Zelinsky,Minh Hoai,Dimitris Samaras

Task: 提出一种少样本个性化扫描路径预测任务（FS-PSP）及其解决方法。

Motivation: 现有扫描路径预测方法数据需求大且难以个性化适应新个体。

Details

Method: 设计主题嵌入网络（SE-Net）生成个体化表示，并基于此预测扫描路径。 Result: 在多个眼动数据集上验证了方法的有效性，无需测试时微调。 Conclusion: 该方法在少样本个性化扫描路径预测中表现优异。 Abstract: A personalized model for scanpath prediction provides insights into the visual preferences and attention patterns of individual subjects. However, existing methods for training scanpath prediction models are data-intensive and cannot be effectively personalized to new individuals with only a few available examples. In this paper, we propose few-shot personalized scanpath prediction task (FS-PSP) and a novel method to address it, which aims to predict scanpaths for an unseen subject using minimal support data of that subject's scanpath behavior. The key to our method's adaptability is the Subject-Embedding Network (SE-Net), specifically designed to capture unique, individualized representations for each subject's scanpaths. SE-Net generates subject embeddings that effectively distinguish between subjects while minimizing variability among scanpaths from the same individual. The personalized scanpath prediction model is then conditioned on these subject embeddings to produce accurate, personalized results. Experiments on multiple eye-tracking datasets demonstrate that our method excels in FS-PSP settings and does not require any fine-tuning steps at test time. Code is available at: https://github.com/cvlab-stonybrook/few-shot-scanpath

DBOT: Artificial Intelligence for Systematic Long-Term Investing

Vasant Dhar,João Sedoc

Task: 开发一个能够像Aswath Damodaran一样进行估值的生成式AI系统DBOT。

Motivation: 利用生成式AI实现自动化长期投资，减少对人类判断的依赖。

Details

Method: 基于Aswath Damodaran的公开估值数据和著作训练AI系统DBOT，并进行回测验证。 Result: DBOT能够对任何上市公司进行估值，但其能力尚未达到Damodaran的水平。 Conclusion: DBOT类AI代理将对金融行业产生深远影响，尤其是改变人类分析师在估值中的角色。 Abstract: Long-term investing was previously seen as requiring human judgment. With the advent of generative artificial intelligence (AI) systems, automated systematic long-term investing is now feasible. In this paper, we present DBOT, a system whose goal is to reason about valuation like Aswath Damodaran, who is a unique expert in the investment arena in terms of having published thousands of valuations on companies in addition to his numerous writings on the topic, which provide ready training data for an AI system. DBOT can value any publicly traded company. DBOT can also be back-tested, making its behavior and performance amenable to scientific inquiry. We compare DBOT to its analytic parent, Damodaran, and highlight the research challenges involved in raising its current capability to that of Damodaran's. Finally, we examine the implications of DBOT-like AI agents for the financial industry, especially how they will impact the role of human analysts in valuation.

SelfMAD: Enhancing Generalization and Robustness in Morphing Attack Detection via Self-Supervised Learning

Marija Ivanovska,Leon Todorov,Naser Damer,Deepak Kumar Jain,Peter Peer,Vitomir Štruc

Task: 提出一种自监督方法（SelfMAD）用于检测新型面部变形攻击（MAD）。

Motivation: 现有监督方法在检测未见过的变形技术时表现不佳，而无监督方法泛化能力虽强但错误率较高。

Details

Method: 通过自监督模拟通用变形攻击伪影，使分类器学习鲁棒决策边界。 Result: 在广泛使用的数据集上，SelfMAD显著优于现有方法，检测错误率降低超过64%（与无监督方法相比）和66%（与监督方法相比）。 Conclusion: SelfMAD是一种高效且鲁棒的面部变形攻击检测方法，适用于多种未知变形技术。 Abstract: With the continuous advancement of generative models, face morphing attacks have become a significant challenge for existing face verification systems due to their potential use in identity fraud and other malicious activities. Contemporary Morphing Attack Detection (MAD) approaches frequently rely on supervised, discriminative models trained on examples of bona fide and morphed images. These models typically perform well with morphs generated with techniques seen during training, but often lead to sub-optimal performance when subjected to novel unseen morphing techniques. While unsupervised models have been shown to perform better in terms of generalizability, they typically result in higher error rates, as they struggle to effectively capture features of subtle artifacts. To address these shortcomings, we present SelfMAD, a novel self-supervised approach that simulates general morphing attack artifacts, allowing classifiers to learn generic and robust decision boundaries without overfitting to the specific artifacts induced by particular face morphing methods. Through extensive experiments on widely used datasets, we demonstrate that SelfMAD significantly outperforms current state-of-the-art MADs, reducing the detection error by more than 64% in terms of EER when compared to the strongest unsupervised competitor, and by more than 66%, when compared to the best performing discriminative MAD model, tested in cross-morph settings. The source code for SelfMAD is available at https://github.com/LeonTodorov/SelfMAD.

Leveraging Prompt-Tuning for Bengali Grammatical Error Explanation Using Large Language Models

Subhankar Maity,Aniket Deroy

Task: 提出一种新颖的三步提示调优方法，用于孟加拉语语法错误解释（BGEE）。

Motivation: 利用先进的大型语言模型（如GPT-4、GPT-3.5 Turbo和Llama-2-70b）改进孟加拉语语法错误的识别、修正和解释。

Details

Method: 通过三步提示调优方法：识别并分类孟加拉语句子中的语法错误、生成修正后的句子、为每个错误提供自然语言解释。 Result: GPT-4在自动评估指标中表现最佳，F1分数提升5.26%，精确匹配提升6.95%，错误类型和解释的错误率分别降低25.51%和26.27%，但仍未达到人类基准。 Conclusion: 提出的提示调优方法显著提升了BGEE任务的性能，但仍有改进空间。 Abstract: We propose a novel three-step prompt-tuning method for Bengali Grammatical Error Explanation (BGEE) using state-of-the-art large language models (LLMs) such as GPT-4, GPT-3.5 Turbo, and Llama-2-70b. Our approach involves identifying and categorizing grammatical errors in Bengali sentences, generating corrected versions of the sentences, and providing natural language explanations for each identified error. We evaluate the performance of our BGEE system using both automated evaluation metrics and human evaluation conducted by experienced Bengali language experts. Our proposed prompt-tuning approach shows that GPT-4, the best performing LLM, surpasses the baseline model in automated evaluation metrics, with a 5.26% improvement in F1 score and a 6.95% improvement in exact match. Furthermore, compared to the previous baseline, GPT-4 demonstrates a decrease of 25.51% in wrong error type and a decrease of 26.27% in wrong error explanation. However, the results still lag behind the human baseline.

PartStickers: Generating Parts of Objects for Rapid Prototyping

Mo Zhou,Josh Myers-Dean,Danna Gurari

Task: 提出一种名为“部分贴纸生成”的新任务和方法，用于生成物体在中性背景上的孤立部分。

Motivation: 现有文本到图像方法通常只能生成完整物体，而设计原型制作中常需要特定部分，如视频游戏中新生物的构建。

Details

Method: 提出“部分贴纸生成”方法，专注于生成物体孤立部分并保持中性背景。 Result: 实验表明，该方法在真实感和文本对齐方面优于现有基线，同时保留物体级生成能力。 Conclusion: 公开代码和模型以推动社区在这一新任务上的进展。 Abstract: Design prototyping involves creating mockups of products or concepts to gather feedback and iterate on ideas. While prototyping often requires specific parts of objects, such as when constructing a novel creature for a video game, existing text-to-image methods tend to only generate entire objects. To address this, we propose a novel task and method of ``part sticker generation", which entails generating an isolated part of an object on a neutral background. Experiments demonstrate our method outperforms state-of-the-art baselines with respect to realism and text alignment, while preserving object-level generation capabilities. We publicly share our code and models to encourage community-wide progress on this new task: https://partsticker.github.io.

Towards Smarter Hiring: Are Zero-Shot and Few-Shot Pre-trained LLMs Ready for HR Spoken Interview Transcript Analysis?

Subhankar Maity,Aniket Deroy,Sudeshna Sarkar

Task: 分析预训练大型语言模型（LLMs）在模拟HR面试中评分、识别错误及提供反馈的表现。

Motivation: 评估LLMs在HR面试评估中的能力，以确定其是否适合自动部署，或是否需要人工干预。

Details

Method: 使用HURIT数据集（3,890份真实HR面试记录）比较LLMs（如GPT-4 Turbo、GPT-3.5 Turbo等）与人类专家的表现。 Result: LLMs（尤其是GPT-4 Turbo和GPT-3.5 Turbo）在评分上表现接近人类专家，但在识别错误和提供具体改进建议方面不足。 Conclusion: 当前LLMs不适合完全自动部署于HR面试评估，建议采用人工干预策略以提高反馈质量。 Abstract: This research paper presents a comprehensive analysis of the performance of prominent pre-trained large language models (LLMs), including GPT-4 Turbo, GPT-3.5 Turbo, text-davinci-003, text-babbage-001, text-curie-001, text-ada-001, llama-2-7b-chat, llama-2-13b-chat, and llama-2-70b-chat, in comparison to expert human evaluators in providing scores, identifying errors, and offering feedback and improvement suggestions to candidates during mock HR (Human Resources) interviews. We introduce a dataset called HURIT (Human Resource Interview Transcripts), which comprises 3,890 HR interview transcripts sourced from real-world HR interview scenarios. Our findings reveal that pre-trained LLMs, particularly GPT-4 Turbo and GPT-3.5 Turbo, exhibit commendable performance and are capable of producing evaluations comparable to those of expert human evaluators. Although these LLMs demonstrate proficiency in providing scores comparable to human experts in terms of human evaluation metrics, they frequently fail to identify errors and offer specific actionable advice for candidate performance improvement in HR interviews. Our research suggests that the current state-of-the-art pre-trained LLMs are not fully conducive for automatic deployment in an HR interview assessment. Instead, our findings advocate for a human-in-the-loop approach, to incorporate manual checks for inconsistencies and provisions for improving feedback quality as a more suitable strategy.

Towards Efficient Real-Time Video Motion Transfer via Generative Time Series Modeling

Tasmiah Haque,Md. Asif Bin Syed,Byungheon Jeong,Xue Bai,Sumit Mohan,Somdyuti Paul,Imtiaz Ahmed,Srinjoy Das

Task: 提出一种深度学习框架，显著优化支持运动传输的视频应用的带宽。

Motivation: 为了有效捕捉复杂运动并优化视频应用的带宽需求。

Details

Method: 使用第一阶运动模型（FOMM）结合变分循环神经网络（VRNN）和门控循环单元与归一化流（GRU-NF）进行关键点预测和视频合成。 Result: 在视频动画和重建任务中验证了方法的有效性，VRNN-FOMM在多步预测中表现优异，GRU-NF-FOMM在多样样本生成和视觉质量上更优。 Conclusion: 提出的框架在不同视频应用中均表现出色，VRNN和GRU-NF分别适用于不同场景。 Abstract: We propose a deep learning framework designed to significantly optimize bandwidth for motion-transfer-enabled video applications, including video conferencing, virtual reality interactions, health monitoring systems, and vision-based real-time anomaly detection. To capture complex motion effectively, we utilize the First Order Motion Model (FOMM), which encodes dynamic objects by detecting keypoints and their associated local affine transformations. These keypoints are identified using a self-supervised keypoint detector and arranged into a time series corresponding to the successive frames. Forecasting is performed on these keypoints by integrating two advanced generative time series models into the motion transfer pipeline, namely the Variational Recurrent Neural Network (VRNN) and the Gated Recurrent Unit with Normalizing Flow (GRU-NF). The predicted keypoints are subsequently synthesized into realistic video frames using an optical flow estimator paired with a generator network, thereby facilitating accurate video forecasting and enabling efficient, low-frame-rate video transmission. We validate our results across three datasets for video animation and reconstruction using the following metrics: Mean Absolute Error, Joint Embedding Predictive Architecture Embedding Distance, Structural Similarity Index, and Average Pair-wise Displacement. Our results confirm that by utilizing the superior reconstruction property of the Variational Autoencoder, the VRNN integrated FOMM excels in applications involving multi-step ahead forecasts such as video conferencing. On the other hand, by leveraging the Normalizing Flow architecture for exact likelihood estimation, and enabling efficient latent space sampling, the GRU-NF based FOMM exhibits superior capabilities for producing diverse future samples while maintaining high visual quality for tasks like real-time video-based anomaly detection.

Separator Injection Attack: Uncovering Dialogue Biases in Large Language Models Caused by Role Separators

Xitao Li,Haijun Wang,Jiang Wu,Ting Liu

Task: 识别角色分隔符在对话系统中引入的系统性弱点，并提出一种基于角色分隔符的新型攻击方法（SIA）。

Motivation: 角色分隔符在对话模型中被广泛使用，但其潜在的安全漏洞（如提示注入攻击）被忽视，导致模型行为与用户意图不一致。

Details

Method: 通过观察角色分隔符的位置偏差，开发了一种名为分隔符注入攻击（SIA）的新型攻击方法。 Result: 实验表明，SIA在手动方法中平均提升了18.2%的攻击效果，自动方法下攻击成功率可达100%。 Conclusion: 角色分隔符在对话系统中存在系统性弱点，需引起重视并采取防御措施。 Abstract: Conversational large language models (LLMs) have gained widespread attention due to their instruction-following capabilities. To ensure conversational LLMs follow instructions, role separators are employed to distinguish between different participants in a conversation. However, incorporating role separators introduces potential vulnerabilities. Misusing roles can lead to prompt injection attacks, which can easily misalign the model's behavior with the user's intentions, raising significant security concerns. Although various prompt injection attacks have been proposed, recent research has largely overlooked the impact of role separators on safety. This highlights the critical need to thoroughly understand the systemic weaknesses in dialogue systems caused by role separators. This paper identifies modeling weaknesses caused by role separators. Specifically, we observe a strong positional bias associated with role separators, which is inherent in the format of dialogue modeling and can be triggered by the insertion of role separators. We further develop the Separators Injection Attack (SIA), a new orthometric attack based on role separators. The experiment results show that SIA is efficient and extensive in manipulating model behavior with an average gain of 18.2% for manual methods and enhances the attack success rate to 100% with automatic methods.

Caption Anything in Video: Fine-grained Object-centric Captioning via Spatiotemporal Multimodal Prompting

Yunlong Tang,Jing Bi,Chao Huang,Susan Liang,Daiki Shimada,Hang Hua,Yunzhong Xiao,Yizhi Song,Pinxin Liu,Mingqian Feng,Junjia Guo,Zhuo Liu,Luchuan Song,Ali Vosoughi,Jinxi He,Liu He,Zeliang Zhang,Jiebo Luo,Chenliang Xu

Task: 提出一种无需训练的细粒度对象中心视频描述框架CAT-V，支持用户通过视觉提示选择对象并生成详细描述。

Motivation: 解决现有视频描述方法生成描述过于抽象或缺乏对象级精确性的问题。

Details

Method: 结合Segmenter（基于SAMURAI）、Temporal Analyzer（基于TRACE-Uni）和Captioner（基于InternVL-2.5），通过时空视觉提示和链式推理生成详细描述。 Result: CAT-V能够生成具有时间感知的对象属性、动作、状态、交互和环境上下文的详细描述，且无需额外训练数据。 Conclusion: CAT-V通过灵活的用户交互和时空敏感性，实现了细粒度、对象特定的视频描述，同时保持时间连贯性和空间准确性。 Abstract: We present CAT-V (Caption AnyThing in Video), a training-free framework for fine-grained object-centric video captioning that enables detailed descriptions of user-selected objects through time. CAT-V integrates three key components: a Segmenter based on SAMURAI for precise object segmentation across frames, a Temporal Analyzer powered by TRACE-Uni for accurate event boundary detection and temporal analysis, and a Captioner using InternVL-2.5 for generating detailed object-centric descriptions. Through spatiotemporal visual prompts and chain-of-thought reasoning, our framework generates detailed, temporally-aware descriptions of objects' attributes, actions, statuses, interactions, and environmental contexts without requiring additional training data. CAT-V supports flexible user interactions through various visual prompts (points, bounding boxes, and irregular regions) and maintains temporal sensitivity by tracking object states and interactions across different time segments. Our approach addresses limitations of existing video captioning methods, which either produce overly abstract descriptions or lack object-level precision, enabling fine-grained, object-specific descriptions while maintaining temporal coherence and spatial accuracy. The GitHub repository for this project is available at https://github.com/yunlong10/CAT-V

STRIVE: A Think & Improve Approach with Iterative Refinement for Enhancing Question Quality Estimation

Aniket Deroy,Subhankar Maity

Task: 提出一种名为STRIVE的新方法，利用多大型语言模型（LLMs）自动评估问题质量。

Motivation: 自动评估问题质量可以节省教育者的时间，确保一致性，并为教学材料的改进提供即时反馈。

Details

Method: 通过多LLMs生成多个评估，基于问题的优缺点选择最佳解决方案，并通过迭代审查和响应优化评估指标。 Result: 与基线方法相比，STRIVE提高了与人类判断的相关性，且在相关性和适当性等指标上显著改进。 Conclusion: STRIVE方法通过自动化问题质量评估任务，提升了评估的准确性和深度，支持多样化学习者并改善教育实践。 Abstract: Automatically assessing question quality is crucial for educators as it saves time, ensures consistency, and provides immediate feedback for refining teaching materials. We propose a novel methodology called STRIVE (Structured Thinking and Refinement with multiLLMs for Improving Verified Question Estimation) using a series of Large Language Models (LLMs) for automatic question evaluation. This approach aims to improve the accuracy and depth of question quality assessment, ultimately supporting diverse learners and enhancing educational practices. The method estimates question quality in an automated manner by generating multiple evaluations based on the strengths and weaknesses of the provided question and then choosing the best solution generated by the LLM. Then the process is improved by iterative review and response with another LLM until the evaluation metric values converge. This sophisticated method of evaluating question quality improves the estimation of question quality by automating the task of question quality evaluation. Correlation scores show that using this proposed method helps to improve correlation with human judgments compared to the baseline method. Error analysis shows that metrics like relevance and appropriateness improve significantly relative to human judgments by using STRIVE.

A Lightweight Large Vision-language Model for Multimodal Medical Images

Belal Alsinglawi,Chris McCarthy,Sara Webb,Christopher Fluke,Navid Toosy Saidy

Task: 开发一种轻量级、多模态的医学视觉问答（VQA）模型，用于处理医学图像和临床查询。

Motivation: 医学VQA可以辅助临床决策，但由于医学图像的复杂性和多样性，开发高效、高性能的模型具有挑战性。

Details

Method: 结合BiomedCLIP进行图像特征提取和LLaMA-3进行文本处理，设计了一种轻量级多模态VQA模型。 Result: 在OmniMedVQA数据集上达到73.4%的准确率，仅需两块NVIDIA 40 GB A100 GPU，性能优于现有模型。 Conclusion: 该模型在医学VQA任务中表现出色，具有实际应用的潜力，贡献包括高效架构和强大的开放性问题回答能力。 Abstract: Medical Visual Question Answering (VQA) enhances clinical decision-making by enabling systems to interpret medical images and answer clinical queries. However, developing efficient, high-performance VQA models is challenging due to the complexity of medical imagery and diverse modalities. In this paper, we introduce a lightweight, multimodal VQA model integrating BiomedCLIP for image feature extraction and LLaMA-3 for text processing. Designed for medical VQA tasks, our model achieves state-of-the-art performance on the OmniMedVQA dataset. With approximately 8 billion parameters, it requires only two NVIDIA 40 GB A100 GPUs, demonstrating superior efficiency over larger models. Our results show 73.4% accuracy for open-end questions, surpassing existing models and validating its potential for real-world medical applications. Key contributions include a specialized multimodal VQA model, a resource-efficient architecture, and strong performance in answering open-ended clinical questions.

Evaluating Speech-to-Text Systems with PennSound

Jonathan Wright,Mark Liberman,Neville Ryant,James Fiumara

Task: 评估多个商业和开源语音转文本系统在诗歌朗读和讨论音频上的性能。

Motivation: PennSound的多样性和代表性使其成为评估语音转文本系统的理想基准。

Details

Method: 使用PennSound的10小时音频样本，通过训练有素的标注者创建参考转录，并比较多个系统的转录结果。 Result: Rev.ai表现最佳，Whisper是表现最好的开源系统，AWS在说话人分离方面表现最好。 Conclusion: 不同系统在性能和功能上各有优劣，用户需根据需求选择合适的系统，并注意Whisper的幻觉问题。 Abstract: A random sample of nearly 10 hours of speech from PennSound, the world's largest online collection of poetry readings and discussions, was used as a benchmark to evaluate several commercial and open-source speech-to-text systems. PennSound's wide variation in recording conditions and speech styles makes it a good representative for many other untranscribed audio collections. Reference transcripts were created by trained annotators, and system transcripts were produced from AWS, Azure, Google, IBM, NeMo, Rev.ai, Whisper, and Whisper.cpp. Based on word error rate, Rev.ai was the top performer, and Whisper was the top open source performer (as long as hallucinations were avoided). AWS had the best diarization error rates among three systems. However, WER and DER differences were slim, and various tradeoffs may motivate choosing different systems for different end users. We also examine the issue of hallucinations in Whisper. Users of Whisper should be cautioned to be aware of runtime options, and whether the speed vs accuracy trade off is acceptable.

TAPNext: Tracking Any Point (TAP) as Next Token Prediction

Artem Zholus,Carl Doersch,Yi Yang,Skanda Koppula,Viorica Patraucean,Xu Owen He,Ignacio Rocco,Mehdi S. M. Sajjadi,Sarath Chandar,Ross Goroshin

Task: 提出一种新的方法TAPNext，用于视频中的任意点跟踪（TAP）任务。

Motivation: 现有TAP方法依赖复杂的跟踪特定归纳偏差和启发式方法，限制了其通用性和扩展潜力。

Details

Method: 将TAP任务建模为序列掩码令牌解码问题，采用因果模型进行纯在线跟踪，去除跟踪特定归纳偏差。 Result: TAPNext在在线和离线跟踪器中均达到新的最先进性能，且无需时间窗口限制。 Conclusion: TAPNext通过端到端训练自然涌现出广泛使用的跟踪启发式方法，证明了其简洁性和高效性。 Abstract: Tracking Any Point (TAP) in a video is a challenging computer vision problem with many demonstrated applications in robotics, video editing, and 3D reconstruction. Existing methods for TAP rely heavily on complex tracking-specific inductive biases and heuristics, limiting their generality and potential for scaling. To address these challenges, we present TAPNext, a new approach that casts TAP as sequential masked token decoding. Our model is causal, tracks in a purely online fashion, and removes tracking-specific inductive biases. This enables TAPNext to run with minimal latency, and removes the temporal windowing required by many existing state of art trackers. Despite its simplicity, TAPNext achieves a new state-of-the-art tracking performance among both online and offline trackers. Finally, we present evidence that many widely used tracking heuristics emerge naturally in TAPNext through end-to-end training.

LLM$\times$MapReduce-V2: Entropy-Driven Convolutional Test-Time Scaling for Generating Long-Form Articles from Extremely Long Resources

Haoyu Wang,Yujia Fu,Zhu Zhang,Shuo Wang,Zirui Ren,Xiaorong Wang,Zhili Li,Chaoqun He,Bo An,Zhiyuan Liu,Maosong Sun

Task: 提出一种名为LLM×MapReduce-V2的测试时扩展策略，以增强大型语言模型（LLMs）处理极长输入的能力。

Motivation: 长文本生成在多种实际应用中至关重要，但当前LLMs在处理极长输入时难以有效整合和分析相关信息。

Details

Method: 借鉴卷积神经网络的思路，通过堆叠卷积扩展层逐步扩展对输入材料的理解。 Result: 实验结果表明，该方法显著提升了LLMs处理长输入和生成连贯、信息丰富的长文本的能力，优于多个代表性基线。 Conclusion: LLM×MapReduce-V2是一种有效的策略，能够显著提升LLMs在长文本生成任务中的表现。 Abstract: Long-form generation is crucial for a wide range of practical applications, typically categorized into short-to-long and long-to-long generation. While short-to-long generations have received considerable attention, generating long texts from extremely long resources remains relatively underexplored. The primary challenge in long-to-long generation lies in effectively integrating and analyzing relevant information from extensive inputs, which remains difficult for current large language models (LLMs). In this paper, we propose LLM$\times$MapReduce-V2, a novel test-time scaling strategy designed to enhance the ability of LLMs to process extremely long inputs. Drawing inspiration from convolutional neural networks, which iteratively integrate local features into higher-level global representations, LLM$\times$MapReduce-V2 utilizes stacked convolutional scaling layers to progressively expand the understanding of input materials. Both quantitative and qualitative experimental results demonstrate that our approach substantially enhances the ability of LLMs to process long inputs and generate coherent, informative long-form articles, outperforming several representative baselines.

Gaze-Guided Learning: Avoiding Shortcut Bias in Visual Classification

Jiahang Li,Shibo Xue,Yong Su

Task: 提出一种结合人类注视先验和视觉序列的框架，以改进视觉分类任务中的特征定位和分类准确性。

Motivation: 现有注意力机制在视觉分类任务中忽视了特征的精确定位，导致误分类和迁移数据集上的性能下降，而人类能够利用先验知识快速定位和比较细粒度属性。

Details

Method: 引入Gaze-CIFAR-10数据集和双序列注视编码器，结合Vision Transformer (ViT) 进行跨模态融合，整合人类注视先验和机器视觉序列。 Result: 实验表明，注视引导的认知线索显著提高了分类准确性。 Conclusion: 结合人类注视先验的框架能够有效改进视觉分类任务中的特征定位和分类性能。 Abstract: Inspired by human visual attention, deep neural networks have widely adopted attention mechanisms to learn locally discriminative attributes for challenging visual classification tasks. However, existing approaches primarily emphasize the representation of such features while neglecting their precise localization, which often leads to misclassification caused by shortcut biases. This limitation becomes even more pronounced when models are evaluated on transfer or out-of-distribution datasets. In contrast, humans are capable of leveraging prior object knowledge to quickly localize and compare fine-grained attributes, a capability that is especially crucial in complex and high-variance classification scenarios. Motivated by this, we introduce Gaze-CIFAR-10, a human gaze time-series dataset, along with a dual-sequence gaze encoder that models the precise sequential localization of human attention on distinct local attributes. In parallel, a Vision Transformer (ViT) is employed to learn the sequential representation of image content. Through cross-modal fusion, our framework integrates human gaze priors with machine-derived visual sequences, effectively correcting inaccurate localization in image feature representations. Extensive qualitative and quantitative experiments demonstrate that gaze-guided cognitive cues significantly enhance classification accuracy.

Rank-Then-Score: Enhancing Large Language Models for Automated Essay Scoring

Yida Cai,Kun Liang,Sanwoo Lee,Qinghan Wang,Yunfang Wu

Task: 提出一种基于大语言模型的Rank-Then-Score（RTS）框架，用于提升自动作文评分（AES）的性能。

Motivation: 大语言模型在自动作文评分领域的潜力尚未充分挖掘，尤其是针对中文作文评分的方法相对不足。

Details

Method: 通过特征增强数据微调排名模型（Ranker），并将其输出与作文内容输入评分模型（Scorer）生成最终分数。 Result: 在HSK和ASAP数据集上，RTS方法在所有大语言模型和数据集上的平均QWK均优于直接提示（Vanilla）方法，并在HSK数据集上表现最佳。 Conclusion: RTS框架显著提升了中文作文评分的性能，验证了大语言模型在该领域的潜力。 Abstract: In recent years, large language models (LLMs) achieve remarkable success across a variety of tasks. However, their potential in the domain of Automated Essay Scoring (AES) remains largely underexplored. Moreover, compared to English data, the methods for Chinese AES is not well developed. In this paper, we propose Rank-Then-Score (RTS), a fine-tuning framework based on large language models to enhance their essay scoring capabilities. Specifically, we fine-tune the ranking model (Ranker) with feature-enriched data, and then feed the output of the ranking model, in the form of a candidate score set, with the essay content into the scoring model (Scorer) to produce the final score. Experimental results on two benchmark datasets, HSK and ASAP, demonstrate that RTS consistently outperforms the direct prompting (Vanilla) method in terms of average QWK across all LLMs and datasets, and achieves the best performance on Chinese essay scoring using the HSK dataset.

CoA: Towards Real Image Dehazing via Compression-and-Adaptation

Long Ma,Yuxin Feng,Yan Zhang,Jinyuan Liu,Weimin Wang,Guang-Yong Chen,Chengpei Xu,Zhuo Su

Task: 提出一种名为Compression-and-Adaptation (CoA)的计算流程，以高效且适应性强的方式解决真实图像去雾问题。

Motivation: 尽管基于学习的图像去雾算法在合成领域取得了显著成功，但由于计算资源限制和真实场景的多样性，真实图像去雾仍面临挑战。

Details

Method: 通过模型压缩在合成域构建紧凑的去雾参数空间，随后在真实域引入双层适应机制，以聚合合成域的去雾能力。 Result: CoA表现出与领域无关的稳定性和模型无关的灵活性，有效弥合了合成域与真实域之间的模型鸿沟。 Conclusion: 该方法在效率和适应性方面表现出色，为真实图像去雾提供了实用解决方案。 Abstract: Learning-based image dehazing algorithms have shown remarkable success in synthetic domains. However, real image dehazing is still in suspense due to computational resource constraints and the diversity of real-world scenes. Therefore, there is an urgent need for an algorithm that excels in both efficiency and adaptability to address real image dehazing effectively. This work proposes a Compression-and-Adaptation (CoA) computational flow to tackle these challenges from a divide-and-conquer perspective. First, model compression is performed in the synthetic domain to develop a compact dehazing parameter space, satisfying efficiency demands. Then, a bilevel adaptation in the real domain is introduced to be fearless in unknown real environments by aggregating the synthetic dehazing capabilities during the learning process. Leveraging a succinct design free from additional constraints, our CoA exhibits domain-irrelevant stability and model-agnostic flexibility, effectively bridging the model chasm between synthetic and real domains to further improve its practical utility. Extensive evaluations and analyses underscore the approach's superiority and effectiveness. The code is publicly available at https://github.com/fyxnl/COA.

SEA-LION: Southeast Asian Languages in One Network

Raymond Ng,Thanh Ngan Nguyen,Yuli Huang,Ngee Chia Tai,Wai Yi Leong,Wei Qi Leong,Xianbin Yong,Jian Gang Ngui,Yosephine Susanto,Nicholas Cheng,Hamsawardhini Rengarajan,Peerat Limkonchotiwat,Adithya Venkatadri Hulagadri,Kok Wai Teng,Yeo Yeow Tong,Bryan Siow,Wei Yi Teo,Wayne Lau,Choon Meng Tan,Brandon Ong,Zhi Hao Ong,Jann Railey Montalan,Adwin Chan,Sajeban Antonyrex,Ren Lee,Esther Choa,David Ong Tat-Wee,Bing Jie Darius Liu,William Chandra Tjhi,Erik Cambria,Leslie Teo

Task: 开发支持东南亚语言的先进多语言大语言模型（LLM）。

Motivation: 解决当前LLM研究中东南亚语言资源不足的问题。

Details

Method: 通过大规模多语言持续预训练，结合多阶段指令微调、对齐和模型合并。 Result: 模型在多语言基准测试中表现优异，达到支持东南亚语言的LLM的领先水平。 Conclusion: 开源模型以促进东南亚社区的广泛受益。 Abstract: Recently, Large Language Models (LLMs) have dominated much of the artificial intelligence scene with their ability to process and generate natural languages. However, the majority of LLM research and development remains English-centric, leaving low-resource languages such as those in the Southeast Asian (SEA) region under-represented. To address this representation gap, we introduce Llama-SEA-LION-v3-8B-IT and Gemma-SEA-LION-v3-9B-IT, two cutting-edge multilingual LLMs designed for SEA languages. The SEA-LION family of LLMs supports 11 SEA languages, namely English, Chinese, Indonesian, Vietnamese, Malay, Thai, Burmese, Lao, Filipino, Tamil, and Khmer. Our work leverages large-scale multilingual continued pre-training with a comprehensive post-training regime involving multiple stages of instruction fine-tuning, alignment, and model merging. Evaluation results on multilingual benchmarks indicate that our models achieve state-of-the-art performance across LLMs supporting SEA languages. We open-source the models to benefit the wider SEA community.

Tuning-Free Image Editing with Fidelity and Editability via Unified Latent Diffusion Model

Qi Mao,Lan Chen,Yuchao Gu,Mike Zheng Shou,Ming-Hsuan Yang

Task: 提出一种无调优方法UnifyEdit，以在文本到图像编辑中平衡保真度和可编辑性。

Motivation: 现有方法缺乏显式和统一的机制来平衡结构保真度和文本对齐，导致过度或不足编辑的问题。

Details

Method: 通过扩散潜在优化，结合自注意力（SA）保护约束和交叉注意力（CA）对齐约束，并引入自适应时间步调度器动态调整约束影响。 Result: 实验验证了UnifyEdit在结构保真和文本对齐上的优越性，优于现有方法。 Conclusion: UnifyEdit在统一框架内有效平衡了保真度和可编辑性，为文本到图像编辑提供了新解决方案。 Abstract: Balancing fidelity and editability is essential in text-based image editing (TIE), where failures commonly lead to over- or under-editing issues. Existing methods typically rely on attention injections for structure preservation and leverage the inherent text alignment capabilities of pre-trained text-to-image (T2I) models for editability, but they lack explicit and unified mechanisms to properly balance these two objectives. In this work, we introduce UnifyEdit, a tuning-free method that performs diffusion latent optimization to enable a balanced integration of fidelity and editability within a unified framework. Unlike direct attention injections, we develop two attention-based constraints: a self-attention (SA) preservation constraint for structural fidelity, and a cross-attention (CA) alignment constraint to enhance text alignment for improved editability. However, simultaneously applying both constraints can lead to gradient conflicts, where the dominance of one constraint results in over- or under-editing. To address this challenge, we introduce an adaptive time-step scheduler that dynamically adjusts the influence of these constraints, guiding the diffusion latent toward an optimal balance. Extensive quantitative and qualitative experiments validate the effectiveness of our approach, demonstrating its superiority in achieving a robust balance between structure preservation and text alignment across various editing tasks, outperforming other state-of-the-art methods. The source code will be available at https://github.com/CUC-MIPG/UnifyEdit.

RETROcode: Leveraging a Code Database for Improved Natural Language to Code Generation

Nathanaël Beau,Benoît Crabbé

Task: 提出RETROcode，一种基于RETRO架构的序列到序列模型，用于代码生成任务。

Motivation: 解决大规模预训练模型在代码生成任务中计算需求高和过拟合风险的问题。

Details

Method: 利用大型代码数据库作为辅助扩展方法，通过整合外部记忆增强模型效率。 Result: RETROcode在测试集上优于传统架构，接近更大规模Codex模型的效果。 Conclusion: RETROcode通过外部记忆整合，实现了高效且高性能的代码生成。 Abstract: As text and code resources have expanded, large-scale pre-trained models have shown promising capabilities in code generation tasks, typically employing supervised fine-tuning with problem statement-program pairs. However, increasing model size and data volume for performance gains also raises computational demands and risks of overfitting. Addressing these challenges, we present RETROcode, a novel adaptation of the RETRO architecture \cite{RETRO} for sequence-to-sequence models, utilizing a large code database as an auxiliary scaling method. This approach, diverging from simply enlarging model and dataset sizes, allows RETROcode to leverage a vast code database for prediction, enhancing the model's efficiency by integrating extensive memory. Our findings indicate that RETROcode not only outperforms similar-sized traditional architectures on test sets but also approaches the effectiveness of the much larger Codex model, despite being trained from scratch on a substantially smaller dataset.

Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought

Yi Peng,Chris,Xiaokun Wang,Yichen Wei,Jiangbo Pei,Weijie Qiu,Ai Jian,Yunzhuo Hao,Jiachun Pan,Tianyidan Xie,Li Ge,Rongxian Zhuang,Xuchen Song,Yang Liu,Yahui Zhou

Task: 扩展R1系列大型语言模型（LLM）到视觉模态，构建多模态推理模型Skywork R1V。

Motivation: 通过轻量级视觉投影器和高效多模态迁移方法，实现无需重新训练基础语言模型或视觉编码器的多模态适应。

Details

Method: 采用轻量级视觉投影器、混合优化策略（迭代监督微调与组相对策略优化）和自适应长度思维链蒸馏方法。 Result: Skywork R1V在38B参数规模下，在MMMU和MathVista上分别获得69.0和67.5分，同时在AIME和MATH500上保持72.0和94.0的文本推理性能。 Conclusion: Skywork R1V在多模态和文本推理任务中表现优异，模型权重已公开以促进开放性和可复现性。 Abstract: We introduce Skywork R1V, a multimodal reasoning model extending the an R1-series Large language models (LLM) to visual modalities via an efficient multimodal transfer method. Leveraging a lightweight visual projector, Skywork R1V facilitates seamless multimodal adaptation without necessitating retraining of either the foundational language model or the vision encoder. To strengthen visual-text alignment, we propose a hybrid optimization strategy that combines Iterative Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO), significantly enhancing cross-modal integration efficiency. Additionally, we introduce an adaptive-length Chain-of-Thought distillation approach for reasoning data generation. This approach dynamically optimizes reasoning chain lengths, thereby enhancing inference efficiency and preventing excessive reasoning overthinking. Empirical evaluations demonstrate that Skywork R1V, with only 38B parameters, delivers competitive performance, achieving a score of 69.0 on the MMMU benchmark and 67.5 on MathVista. Meanwhile, it maintains robust textual reasoning performance, evidenced by impressive scores of 72.0 on AIME and 94.0 on MATH500. The Skywork R1V model weights have been publicly released to promote openness and reproducibility.

Layer-Aware Embedding Fusion for LLMs in Text Classifications

Jiho Gwak,Yuchul Jung

Task: 提出一种层感知的嵌入选择方法，并研究如何定量评估不同层以确定对下游NLP任务最重要的层。

Motivation: 现有研究对选择最优层和开发有效融合策略以整合LLMs缺乏系统性指导。

Details

Method: 提出层感知嵌入选择方法，研究多模型嵌入融合策略，无需微调模型。 Result: 实验表明不同层对分类任务的表示能力不同，多模型嵌入融合可提升性能。 Conclusion: 未来工作将探索多语言和领域特定数据集，以及自动化层选择技术以提高性能和可扩展性。 Abstract: Embedding fusion has emerged as an effective approach for enhancing performance across various NLP tasks. However, systematic guidelines for selecting optimal layers and developing effective fusion strategies for the integration of LLMs remain underexplored. In this study, we propose a layer-aware embedding selection method and investigate how to quantitatively evaluate different layers to identify the most important ones for downstream NLP tasks, showing that the critical layers vary depending on the dataset. We also explore how combining embeddings from multiple LLMs, without requiring model fine-tuning, can improve performance. Experiments on four English text classification datasets (SST-2, MR, R8, and R52) demonstrate that different layers in LLMs exhibit varying degrees of representational strength for classification, and that combining embeddings from different models can enhance performance if the models exhibit complementary characteristics. Additionally, we discuss resources overhead (memory and inference time) to provide a balanced perspective on the real world feasibility of embedding fusion. Future work will explore multilingual and domain specific datasets, as well as techniques for automating layer selection, to improve both performance and scalability.

AD-Det: Boosting Object Detection in UAV Images with Focused Small Objects and Balanced Tail Classes

Zhenteng Li,Sheng Lian,Dengfeng Pan,Youlin Wang,Wei Liu

Task: 提出一种名为AD-Det的新框架，用于解决无人机图像中目标检测的尺度变化和类别不平衡问题。

Motivation: 现有方法通常分别处理尺度变化和类别不平衡问题，忽略了无人机图像的复杂性和两者之间的潜在协同作用。

Details

Method: AD-Det采用粗到细的策略，结合自适应小目标增强（ASOE）和动态类别平衡复制粘贴（DCC）两个关键组件。ASOE通过高分辨率特征图识别和聚类小目标区域，DCC通过动态粘贴尾类对象进行重采样。 Result: 在VisDrone和UAVDT数据集上，AD-Det显著优于现有方法，在VisDrone上达到37.5%的平均精度（AP），比竞争对手至少高出3.1%。 Conclusion: AD-Det通过协同和自适应的框架有效解决了无人机图像中的尺度变化和类别不平衡问题，显著提升了检测性能。 Abstract: Object detection in Unmanned Aerial Vehicle (UAV) images poses significant challenges due to complex scale variations and class imbalance among objects. Existing methods often address these challenges separately, overlooking the intricate nature of UAV images and the potential synergy between them. In response, this paper proposes AD-Det, a novel framework employing a coherent coarse-to-fine strategy that seamlessly integrates two pivotal components: Adaptive Small Object Enhancement (ASOE) and Dynamic Class-balanced Copy-paste (DCC). ASOE utilizes a high-resolution feature map to identify and cluster regions containing small objects. These regions are subsequently enlarged and processed by a fine-grained detector. On the other hand, DCC conducts object-level resampling by dynamically pasting tail classes around the cluster centers obtained by ASOE, main-taining a dynamic memory bank for each tail class. This approach enables AD-Det to not only extract regions with small objects for precise detection but also dynamically perform reasonable resampling for tail-class objects. Consequently, AD-Det enhances the overall detection performance by addressing the challenges of scale variations and class imbalance in UAV images through a synergistic and adaptive framework. We extensively evaluate our approach on two public datasets, i.e., VisDrone and UAVDT, and demonstrate that AD-Det significantly outperforms existing competitive alternatives. Notably, AD-Det achieves a 37.5% Average Precision (AP) on the VisDrone dataset, surpassing its counterparts by at least 3.1%.

Probabilistic Process Discovery with Stochastic Process Trees

András Horváth,Paolo Ballarini,Pierre Cry

Task: 提出一种直接向过程树添加随机性的新形式化方法，以避免现有随机Petri网方法的模糊性。

Motivation: 现有方法中，随机Petri网的权重分配和参数数量不明确，导致随机语言的概率分布不清晰。

Details

Method: 提出一种称为随机过程树的新形式化方法，直接在过程树中引入随机性。 Result: 新方法明确了参数数量及其在随机语言中的作用，避免了现有方法的模糊性。 Conclusion: 随机过程树是一种更清晰、更明确的随机建模方法。 Abstract: In order to obtain a stochastic model that accounts for the stochastic aspects of the dynamics of a business process, usually the following steps are taken. Given an event log, a process tree is obtained through a process discovery algorithm, i.e., a process tree that is aimed at reproducing, as accurately as possible, the language of the log. The process tree is then transformed into a Petri net that generates the same set of sequences as the process tree. In order to capture the frequency of the sequences in the event log, weights are assigned to the transitions of the Petri net, resulting in a stochastic Petri net with a stochastic language in which each sequence is associated with a probability. In this paper we show that this procedure has unfavorable properties. First, the weights assigned to the transitions of the Petri net have an unclear role in the resulting stochastic language. We will show that a weight can have multiple, ambiguous impact on the probability of the sequences generated by the Petri net. Second, a number of different Petri nets with different number of transitions can correspond to the same process tree. This means that the number of parameters (the number of weights) that determines the stochastic language is not well-defined. In order to avoid these ambiguities, in this paper, we propose to add stochasticity directly to process trees. The result is a new formalism, called stochastic process trees, in which the number of parameters and their role in the associated stochastic language is clear and well-defined.

Falcon: Fractional Alternating Cut with Overcoming Minima in Unsupervised Segmentation

Xiao Zhang,Xiangyu Han,Xiwen Lai,Yao Sun,Pei Zhang,Konrad Kording

Task: 提出一种基于优化的K-way归一化切割方法（Falcon），用于改进无监督图像分割的效率和准确性。

Motivation: 当前基于图切割的无监督分割方法依赖高维注意力图，存在速度慢和分割精度低的问题，落后于监督方法。

Details

Method: Falcon采用两阶段方法：1）通过分数二次变换和交替迭代优化快速求解K-way归一化切割；2）利用低层信息细化分割掩码。 Result: Falcon在六个基准测试中平均提升2.5%（最高4.3%），运行时间减少30%。 Conclusion: Falcon有效利用基础模型注意力信息，缩小了无监督与监督分割的差距，为下游任务提供了可扩展的解决方案。 Abstract: Today's unsupervised image segmentation algorithms often segment suboptimally. Modern graph-cut based approaches rely on high-dimensional attention maps from Transformer-based foundation models, typically employing a relaxed Normalized Cut solved recursively via the Fiedler vector (the eigenvector of the second smallest eigenvalue). Consequently, they still lag behind supervised methods in both mask generation speed and segmentation accuracy. We present a regularized fractional alternating cut (Falcon), an optimization-based K-way Normalized Cut without relying on recursive eigenvector computations, achieving substantially improved speed and accuracy. Falcon operates in two stages: (1) a fast K-way Normalized Cut solved by extending into a fractional quadratic transformation, with an alternating iterative procedure and regularization to avoid local minima; and (2) refinement of the resulting masks using complementary low-level information, producing high-quality pixel-level segmentations. Experiments show that Falcon not only surpasses existing state-of-the-art methods by an average of 2.5% across six widely recognized benchmarks (reaching up to 4.3\% improvement on Cityscapes), but also reduces runtime by around 30% compared to prior graph-based approaches. These findings demonstrate that the semantic information within foundation-model attention can be effectively harnessed by a highly parallelizable graph cut framework. Consequently, Falcon can narrow the gap between unsupervised and supervised segmentation, enhancing scalability in real-world applications and paving the way for dense prediction-based vision pre-training in various downstream tasks. The code is released in https://github.com/KordingLab/Falcon.

Cross-Document Contextual Coreference Resolution in Knowledge Graphs

Zhang Dong,Mingbang Wang,Songhang deng,Le Dai,Jiyuan Li,Xingzu Liu,Ruilin Nong

Task: 提出一种创新方法，用于在多文档中识别和解析指向相同实体的引用，以提升信息的连贯性和协作性。

Motivation: 多文档中的共指解析在自然语言处理中是一个重大挑战，尤其是在知识图谱领域。

Details

Method: 采用动态链接机制，将知识图谱中的实体与其对应的文本提及关联起来，结合上下文嵌入和图推理策略。 Result: 在多个基准数据集上的严格评估表明，该方法在共指解析的准确性和性能上显著优于传统方法。 Conclusion: 该方法通过利用知识图谱中的上下文信息，显著提升了跨文档共指解析的精确度和召回率，展示了其在知识驱动应用中的有效性。 Abstract: Coreference resolution across multiple documents poses a significant challenge in natural language processing, particularly within the domain of knowledge graphs. This study introduces an innovative method aimed at identifying and resolving references to the same entities that appear across differing texts, thus enhancing the coherence and collaboration of information. Our method employs a dynamic linking mechanism that associates entities in the knowledge graph with their corresponding textual mentions. By utilizing contextual embeddings along with graph-based inference strategies, we effectively capture the relationships and interactions among entities, thereby improving the accuracy of coreference resolution. Rigorous evaluations on various benchmark datasets highlight notable advancements in our approach over traditional methodologies. The results showcase how the contextual information derived from knowledge graphs enhances the understanding of complex relationships across documents, leading to better entity linking and information extraction capabilities in applications driven by knowledge. Our technique demonstrates substantial improvements in both precision and recall, underscoring its effectiveness in the area of cross-document coreference resolution.

Time-Aware Auto White Balance in Mobile Photography

Mahmoud Afifi,Luxi Zhao,Abhijith Punnappurath,Mohammed A. Abdelsalam,Ran Zhang,Michael S. Brown

Task: 提出一种轻量化的光源估计方法，结合上下文元数据和图像颜色信息。

Motivation: 利用移动设备提供的额外元数据（如时间戳和地理位置）来优化自动白平衡（AWB）性能。

Details

Method: 设计了一个紧凑模型（约5K参数），整合元数据、捕获信息和图像颜色。 Result: 模型表现优异，匹配或超越更大模型。 Conclusion: 通过引入包含多样光照条件的智能手机图像数据集，验证了方法的有效性。 Abstract: Cameras rely on auto white balance (AWB) to correct undesirable color casts caused by scene illumination and the camera's spectral sensitivity. This is typically achieved using an illuminant estimator that determines the global color cast solely from the color information in the camera's raw sensor image. Mobile devices provide valuable additional metadata-such as capture timestamp and geolocation-that offers strong contextual clues to help narrow down the possible illumination solutions. This paper proposes a lightweight illuminant estimation method that incorporates such contextual metadata, along with additional capture information and image colors, into a compact model (~5K parameters), achieving promising results, matching or surpassing larger models. To validate our method, we introduce a dataset of 3,224 smartphone images with contextual metadata collected at various times of day and under diverse lighting conditions. The dataset includes ground-truth illuminant colors, determined using a color chart, and user-preferred illuminants validated through a user study, providing a comprehensive benchmark for AWB evaluation.

End-to-End Dialog Neural Coreference Resolution: Balancing Efficiency and Accuracy in Large-Scale Systems

Zhang Dong,Songhang deng,Mingbang Wang,Le Dai,Jiyuan Li,Xingzu Liu,Ruilin Nong

Task: 开发一种端到端神经共指消解系统，适用于大规模应用。

Motivation: 大规模共指消解在自然语言处理中具有挑战性，需要在效率和准确性之间取得平衡。

Details

Method: 利用先进的神经网络架构，结合多种上下文嵌入和注意力机制，并应用优化策略加速处理速度。 Result: 在基准数据集上的广泛评估表明，该模型在保持快速推理时间的同时，比现有方法提高了准确性。 Conclusion: 该系统能够高效地提供精确的共指消解，为未来该领域的进展树立了标杆。 Abstract: Large-scale coreference resolution presents a significant challenge in natural language processing, necessitating a balance between efficiency and accuracy. In response to this challenge, we introduce an End-to-End Neural Coreference Resolution system tailored for large-scale applications. Our system efficiently identifies and resolves coreference links in text, ensuring minimal computational overhead without compromising on performance. By utilizing advanced neural network architectures, we incorporate various contextual embeddings and attention mechanisms, which enhance the quality of predictions for coreference pairs. Furthermore, we apply optimization strategies to accelerate processing speeds, making the system suitable for real-world deployment. Extensive evaluations conducted on benchmark datasets demonstrate that our model achieves improved accuracy compared to existing approaches, while effectively maintaining rapid inference times. Rigorous testing confirms the ability of our system to deliver precise coreference resolutions efficiently, thereby establishing a benchmark for future advancements in this field.

CTI-Unet: Cascaded Threshold Integration for Improved U-Net Segmentation of Pathology Images

Mingyang Zhu,Yuqiu Liang,Jiacheng Wang

Task: 提出一种新型的Cascaded Threshold-Integrated U-Net (CTI-Unet)方法，用于肾脏病理图像的自动分割。

Motivation: 慢性肾脏病（CKD）是全球健康问题，需要精确高效的图像分析辅助诊断和治疗规划，而传统分割模型通常需要精细的阈值调整。

Details

Method: 通过顺序集成多个阈值输出，CTI-Unet能够平衡噪声抑制与保留精细结构细节。 Result: 在KPIs2024数据集上的实验表明，CTI-Unet优于nnU-Net、Swin-Unet和CE-Net等先进架构。 Conclusion: CTI-Unet为肾脏病理图像分割提供了一个鲁棒且灵活的框架。 Abstract: Chronic kidney disease (CKD) is a growing global health concern, necessitating precise and efficient image analysis to aid diagnosis and treatment planning. Automated segmentation of kidney pathology images plays a central role in facilitating clinical workflows, yet conventional segmentation models often require delicate threshold tuning. This paper proposes a novel \textit{Cascaded Threshold-Integrated U-Net (CTI-Unet)} to overcome the limitations of single-threshold segmentation. By sequentially integrating multiple thresholded outputs, our approach can reconcile noise suppression with the preservation of finer structural details. Experiments on the challenging KPIs2024 dataset demonstrate that CTI-Unet outperforms state-of-the-art architectures such as nnU-Net, Swin-Unet, and CE-Net, offering a robust and flexible framework for kidney pathology image segmentation.

Leveraging Robust Optimization for LLM Alignment under Distribution Shifts

Mingye Zhu,Yi Liu,Junbo Guo,Quan Wang,Yongdong Zhang,Zhendong Mao

Task: 提出一种分布感知优化框架，以改善在分布偏移情况下的大语言模型偏好对齐。

Motivation: 解决高质量人工标注数据稀缺问题，同时避免合成数据引入的分布偏移对模型输出的负面影响。

Details

Method: 通过学习的分类器估计目标分布与训练分布之间的似然比，并在优化过程中最小化反映目标人类偏好分布的数据区域的最坏情况损失。 Result: 方法有效减轻了分布变化的不利影响，生成更忠实反映人类价值观的响应。 Conclusion: 提出的分布感知优化框架能够显著提升在分布偏移情况下的偏好对齐效果。 Abstract: Large language models (LLMs) increasingly rely on preference alignment methods to steer outputs toward human values, yet these methods are often constrained by the scarcity of high-quality human-annotated data. To tackle this, recent approaches have turned to synthetic data generated by LLMs as a scalable alternative. However, synthetic data can introduce distribution shifts, compromising the nuanced human preferences that are essential for desirable outputs. In this paper, we propose a novel distribution-aware optimization framework that improves preference alignment in the presence of such shifts. Our approach first estimates the likelihood ratios between the target and training distributions leveraging a learned classifier, then it minimizes the worst-case loss over data regions that reflect the target human-preferred distribution. By explicitly prioritizing the target distribution during optimization, our method mitigates the adverse effects of distributional variation and enhances the generation of responses that faithfully reflect human values.

iEBAKER: Improved Remote Sensing Image-Text Retrieval Framework via Eliminate Before Align and Keyword Explicit Reasoning

Yan Zhang,Zhong Ji,Changxu Meng,Yanwei Pang,Jungong Han

Task: 提出一种名为iEBAKER的方法，用于改进遥感图像-文本检索（RSITR）任务。

Motivation: 现有基于基础模型（FMs）的方法忽略了弱相关样本对的负面影响，且未考虑遥感文本的关键区别，导致样本对探索存在偏差和表面化。

Details

Method: 提出消除对齐前策略（EBA）过滤弱相关样本对，引入局部与全局相似性影响的两种方案，并结合反向检索优化相似矩阵（SAR）和关键词显式推理模块（KER）。 Result: 在三个基准数据集上的实验表明，iEBAKER方法优于现有最优模型，且需要更少的训练数据。 Conclusion: iEBAKER方法无需额外预训练即可直接应用于RSITR任务，显著提升了性能。 Abstract: Recent studies focus on the Remote Sensing Image-Text Retrieval (RSITR), which aims at searching for the corresponding targets based on the given query. Among these efforts, the application of Foundation Models (FMs), such as CLIP, to the domain of remote sensing has yielded encouraging outcomes. However, existing FM based methodologies neglect the negative impact of weakly correlated sample pairs and fail to account for the key distinctions among remote sensing texts, leading to biased and superficial exploration of sample pairs. To address these challenges, we propose an approach named iEBAKER (an Improved Eliminate Before Align strategy with Keyword Explicit Reasoning framework) for RSITR. Specifically, we propose an innovative Eliminate Before Align (EBA) strategy to filter out the weakly correlated sample pairs, thereby mitigating their deviations from optimal embedding space during alignment.Further, two specific schemes are introduced from the perspective of whether local similarity and global similarity affect each other. On this basis, we introduce an alternative Sort After Reversed Retrieval (SAR) strategy, aims at optimizing the similarity matrix via reverse retrieval. Additionally, we incorporate a Keyword Explicit Reasoning (KER) module to facilitate the beneficial impact of subtle key concept distinctions. Without bells and whistles, our approach enables a direct transition from FM to RSITR task, eliminating the need for additional pretraining on remote sensing data. Extensive experiments conducted on three popular benchmark datasets demonstrate that our proposed iEBAKER method surpasses the state-of-the-art models while requiring less training data. Our source code will be released at https://github.com/zhangy0822/iEBAKER.

Enhancing Coreference Resolution with Pretrained Language Models: Bridging the Gap Between Syntax and Semantics

Xingzu Liu,Songhang deng,Mingbang Wang,Zhang Dong,Le Dai,Jiyuan Li,Ruilin Nong

Task: 提出一种结合语法解析和语义角色标注的创新框架，以提升共指消解的性能。

Motivation: 传统方法在区分指代关系时因缺乏句法和语义信息的整合而表现不佳。

Details

Method: 利用预训练语言模型结合语法解析和语义角色标注，并通过注意力机制进行微调。 Result: 在多个数据集上的实验结果表明，该方法优于传统共指消解系统，显著提高了指代消歧的准确性。 Conclusion: 该研究不仅改进了共指消解结果，还对依赖精确指代理解的其他自然语言处理任务产生了积极影响。 Abstract: Large language models have made significant advancements in various natural language processing tasks, including coreference resolution. However, traditional methods often fall short in effectively distinguishing referential relationships due to a lack of integration between syntactic and semantic information. This study introduces an innovative framework aimed at enhancing coreference resolution by utilizing pretrained language models. Our approach combines syntax parsing with semantic role labeling to accurately capture finer distinctions in referential relationships. By employing state-of-the-art pretrained models to gather contextual embeddings and applying an attention mechanism for fine-tuning, we improve the performance of coreference tasks. Experimental results across diverse datasets show that our method surpasses conventional coreference resolution systems, achieving notable accuracy in disambiguating references. This development not only improves coreference resolution outcomes but also positively impacts other natural language processing tasks that depend on precise referential understanding.

POD: Predictive Object Detection with Single-Frame FMCW LiDAR Point Cloud

Yining Shi,Kun Jiang,Xin Zhao,Kangan Qian,Chuchu Xie,Tuopu Wen,Mengmeng Yang,Diange Yang

Task: 探索基于FMCW LiDAR的预测性物体检测（POD）任务，旨在仅使用当前帧数据预测物体的短期未来位置和尺寸。

Motivation: 利用FMCW LiDAR的径向速度优势，避免依赖多帧历史数据，实现快速响应潜在危险。

Details

Method: 提出POD框架，通过射线投射生成虚拟未来点，构建虚拟两帧点云，并使用稀疏4D编码器处理特征，最终分离为当前帧和未来预测的BEV特征。 Result: 在内部数据集上展示了POD框架在标准和预测检测任务中的先进性能。 Conclusion: FMCW LiDAR的径向速度特性结合POD框架，能够高效实现快速响应的预测性物体检测。 Abstract: LiDAR-based 3D object detection is a fundamental task in the field of autonomous driving. This paper explores the unique advantage of Frequency Modulated Continuous Wave (FMCW) LiDAR in autonomous perception. Given a single frame FMCW point cloud with radial velocity measurements, we expect that our object detector can detect the short-term future locations of objects using only the current frame sensor data and demonstrate a fast ability to respond to intermediate danger. To achieve this, we extend the standard object detection task to a novel task named predictive object detection (POD), which aims to predict the short-term future location and dimensions of objects based solely on current observations. Typically, a motion prediction task requires historical sensor information to process the temporal contexts of each object, while our detector's avoidance of multi-frame historical information enables a much faster response time to potential dangers. The core advantage of FMCW LiDAR lies in the radial velocity associated with every reflected point. We propose a novel POD framework, the core idea of which is to generate a virtual future point using a ray casting mechanism, create virtual two-frame point clouds with the current and virtual future frames, and encode these two-frame voxel features with a sparse 4D encoder. Subsequently, the 4D voxel features are separated by temporal indices and remapped into two Bird's Eye View (BEV) features: one decoded for standard current frame object detection and the other for future predictive object detection. Extensive experiments on our in-house dataset demonstrate the state-of-the-art standard and predictive detection performance of the proposed POD framework.

Assessing Thai Dialect Performance in LLMs with Automatic Benchmarks and Human Evaluation

Peerat Limkonchotiwat,Kanruethai Masuk,Surapon Nonesung,Chalermpun Mai-On,Sarana Nutanong,Wuttikorn Ponwitayarat,Potsawee Manakul

Task: 评估大型语言模型在泰国地方方言中的表现，并提出一种人类评估指南和指标。

Motivation: 现有研究主要关注主流方言，而大型语言模型在地方方言中的鲁棒性和一致性尚未充分探索。

Details

Method: 引入一个泰国地方方言基准，覆盖北部、东北部和南部方言，并在五个NLP任务上评估模型。 Result: 大型语言模型在地方方言中的表现显著低于标准泰语，仅GPT-4o和Gemini2等专有模型表现较好。 Conclusion: 需要进一步研究以提高大型语言模型在地方方言中的表现。 Abstract: Large language models show promising results in various NLP tasks. Despite these successes, the robustness and consistency of LLMs in underrepresented languages remain largely unexplored, especially concerning local dialects. Existing benchmarks also focus on main dialects, neglecting LLMs' ability on local dialect texts. In this paper, we introduce a Thai local dialect benchmark covering Northern (Lanna), Northeastern (Isan), and Southern (Dambro) Thai, evaluating LLMs on five NLP tasks: summarization, question answering, translation, conversation, and food-related tasks. Furthermore, we propose a human evaluation guideline and metric for Thai local dialects to assess generation fluency and dialect-specific accuracy. Results show that LLM performance declines significantly in local Thai dialects compared to standard Thai, with only proprietary models like GPT-4o and Gemini2 demonstrating some fluency

Reconstruction-Free Anomaly Detection with Diffusion Models via Direct Latent Likelihood Evaluation

Shunsuke Sakai,Tatsuhito Hasegawa

Task: 提出一种基于扩散模型的异常检测方法，避免资源密集的重建过程。

Motivation: 传统基于重建的方法需要调整噪声强度且计算速度慢，限制了实际应用。

Details

Method: 直接推断输入图像的潜在变量，并通过高斯先验分布测量其密度作为异常分数。 Result: 在MVTecAD数据集上实现了0.991的AUC和15 FPS的速度，达到新的速度-AUC权衡最优。 Conclusion: 该方法通过简化扩散过程，显著提高了异常检测的速度和性能。 Abstract: Diffusion models, with their robust distribution approximation capabilities, have demonstrated excellent performance in anomaly detection. However, conventional reconstruction-based approaches rely on computing the reconstruction error between the original and denoised images, which requires careful noise-strength tuning and over ten network evaluations per input-leading to significantly slower detection speeds. To address these limitations, we propose a novel diffusion-based anomaly detection method that circumvents the need for resource-intensive reconstruction. Instead of reconstructing the input image, we directly infer its corresponding latent variables and measure their density under the Gaussian prior distribution. Remarkably, the prior density proves effective as an anomaly score even when using a short partial diffusion process of only 2-5 steps. We evaluate our method on the MVTecAD dataset, achieving an AUC of 0.991 at 15 FPS, thereby setting a new state-of-the-art speed-AUC anomaly detection trade-off.

High-Resource Translation:Turning Abundance into Accessibility

Abhiram Reddy Yanampally

Task: 构建一个通过迁移学习技术解决低资源语言挑战的英语到泰卢固语翻译模型。

Motivation: 解决低资源语言翻译中的数据集稀疏问题，并提升翻译模型的性能。

Details

Method: 利用Bharat Parallel Corpus Collection (BPCC)数据集，结合迭代反向翻译生成合成平行数据，优化训练参数并使用预训练模型。 Result: 开发了一个能够处理英语和泰卢固语多样句子结构和语言细微差别的鲁棒翻译系统。 Conclusion: 创新数据处理技术和迁移学习在低资源语言翻译中具有重要潜力，有助于提升英语和泰卢固语之间的实际交流。 Abstract: This paper presents a novel approach to constructing an English-to-Telugu translation model by leveraging transfer learning techniques and addressing the challenges associated with low-resource languages. Utilizing the Bharat Parallel Corpus Collection (BPCC) as the primary dataset, the model incorporates iterative backtranslation to generate synthetic parallel data, effectively augmenting the training dataset and enhancing the model's translation capabilities. The research focuses on a comprehensive strategy for improving model performance through data augmentation, optimization of training parameters, and the effective use of pre-trained models. These methodologies aim to create a robust translation system that can handle diverse sentence structures and linguistic nuances in both English and Telugu. This work highlights the significance of innovative data handling techniques and the potential of transfer learning in overcoming limitations posed by sparse datasets in low-resource languages. The study contributes to the field of machine translation and seeks to improve communication between English and Telugu speakers in practical contexts.

Contrastive Decoupled Representation Learning and Regularization for Speech-Preserving Facial Expression Manipulation

Tianshui Chen,Jianman Lin,Zhijing Yang,Chumei Qing,Yukai Shi,Liang Lin

Task: Speech-preserving facial expression manipulation (SPFEM) aims to modify a talking head to display a specific reference emotion while preserving the mouth animation of source spoken contents.

Motivation: The intrinsic intertwining of emotion and content information during the talking process poses challenges to their effectiveness as supervisory signals.

Details

Method: Proposes a Contrastive Decoupled Representation Learning (CDRL) algorithm, including Contrastive Content Representation Learning (CCRL) and Contrastive Emotion Representation Learning (CERL) modules, augmented with contrastive learning. Result: Extensive experiments and evaluations on various benchmarks show the effectiveness of the proposed algorithm. Conclusion: The decoupled content and emotion representations ensure more accurate emotion manipulation together with audio-lip synchronization. Abstract: Speech-preserving facial expression manipulation (SPFEM) aims to modify a talking head to display a specific reference emotion while preserving the mouth animation of source spoken contents. Thus, emotion and content information existing in reference and source inputs can provide direct and accurate supervision signals for SPFEM models. However, the intrinsic intertwining of these elements during the talking process poses challenges to their effectiveness as supervisory signals. In this work, we propose to learn content and emotion priors as guidance augmented with contrastive learning to learn decoupled content and emotion representation via an innovative Contrastive Decoupled Representation Learning (CDRL) algorithm. Specifically, a Contrastive Content Representation Learning (CCRL) module is designed to learn audio feature, which primarily contains content information, as content priors to guide learning content representation from the source input. Meanwhile, a Contrastive Emotion Representation Learning (CERL) module is proposed to make use of a pre-trained visual-language model to learn emotion prior, which is then used to guide learning emotion representation from the reference input. We further introduce emotion-aware and emotion-augmented contrastive learning to train CCRL and CERL modules, respectively, ensuring learning emotion-independent content representation and content-independent emotion representation. During SPFEM model training, the decoupled content and emotion representations are used to supervise the generation process, ensuring more accurate emotion manipulation together with audio-lip synchronization. Extensive experiments and evaluations on various benchmarks show the effectiveness of the proposed algorithm.

Unsupervised Location Mapping for Narrative Corpora

Eitan Wagner,Renana Keydar,Omri Abend

Task: 无监督位置映射任务，旨在将个体叙事的轨迹映射到一组叙事发生的空间位置地图上。

Motivation: 尽管任务具有基础性和普适性，但关于叙事文本的空间映射研究极少。

Details

Method: 利用大型语言模型上下文长度的最新进展，提出完全无监督的流程，无需预定义标签集。 Result: 在两种不同领域（大屠杀证词和湖区文学）上测试，通过内在和外在评估获得积极结果。 Conclusion: 为任务设定了基准和评估实践，并突出了挑战。 Abstract: This work presents the task of unsupervised location mapping, which seeks to map the trajectory of an individual narrative on a spatial map of locations in which a large set of narratives take place. Despite the fundamentality and generality of the task, very little work addressed the spatial mapping of narrative texts. The task consists of two parts: (1) inducing a ``map'' with the locations mentioned in a set of texts, and (2) extracting a trajectory from a single narrative and positioning it on the map. Following recent advances in increasing the context length of large language models, we propose a pipeline for this task in a completely unsupervised manner without predefining the set of labels. We test our method on two different domains: (1) Holocaust testimonies and (2) Lake District writing, namely multi-century literature on travels in the English Lake District. We perform both intrinsic and extrinsic evaluations for the task, with encouraging results, thereby setting a benchmark and evaluation practices for the task, as well as highlighting challenges.

Dongjun Qian,Kai Su,Yiming Tan,Qishuai Diao,Xian Wu,Chang Liu,Bingyue Peng,Zehuan Yuan

Task: 利用大型语言模型自动生成高质量短视频广告内容。

Motivation: 短视频广告需求增长，但手动创作效率低且创意要求高，需要自动化解决方案。

Details

Method: 提出VC-LLM框架，结合高分辨率空间输入和低分辨率时间输入，并利用重写真实文本的辅助信息减少模型幻觉。 Result: 实验表明，基于GPT-4o的VC-LLM能生成与人工创作相当的视频，而基于微调LLM的版本在叙事逻辑上更优。 Conclusion: VC-LLM框架能有效自动化短视频广告创作，提升效率和质量。 Abstract: As short videos have risen in popularity, the role of video content in advertising has become increasingly significant. Typically, advertisers record a large amount of raw footage about the product and then create numerous different short-form advertisement videos based on this raw footage. Creating such videos mainly involves editing raw footage and writing advertisement scripts, which requires a certain level of creative ability. It is usually challenging to create many different video contents for the same product, and manual efficiency is often low. In this paper, we present VC-LLM, a framework powered by Large Language Models for the automatic creation of high-quality short-form advertisement videos. Our approach leverages high-resolution spatial input and low-resolution temporal input to represent video clips more effectively, capturing both fine-grained visual details and broader temporal dynamics. In addition, during training, we incorporate supplementary information generated by rewriting the ground truth text, ensuring that all key output information can be directly traced back to the input, thereby reducing model hallucinations. We also designed a benchmark to evaluate the quality of the created videos. Experiments show that VC-LLM based on GPT-4o can produce videos comparable to those created by humans. Furthermore, we collected numerous high-quality short advertisement videos to create a pre-training dataset and manually cleaned a portion of the data to construct a high-quality fine-tuning dataset. Experiments indicate that, on the benchmark, the VC-LLM based on fine-tuned LLM can produce videos with superior narrative logic compared to those created by the VC-LLM based on GPT-4o.

NativQA Framework: Enabling LLMs with Native, Local, and Everyday Knowledge

Firoj Alam,Md Arid Hasan,Sahinur Rahman Laskar,Mucahid Kutlu,Shammur Absar Chowdhury

Task: 提出一个框架NativQA，用于构建大规模、文化和区域对齐的多语言问答数据集。

Motivation: 解决大型语言模型在文化偏见、公平性和多语言及低资源地区适用性方面的不足。

Details

Method: 利用用户定义的种子查询和搜索引擎收集特定地区的日常信息，构建问答数据集。 Result: 在24个国家的39个地区和7种语言（从极低资源到高资源语言）中评估，生成了超过30万对问答数据。 Conclusion: NativQA框架可用于大型语言模型的基准测试和进一步微调，并已公开供社区使用。 Abstract: The rapid advancement of large language models (LLMs) has raised concerns about cultural bias, fairness, and their applicability in diverse linguistic and underrepresented regional contexts. To enhance and benchmark the capabilities of LLMs, there is a need to develop large-scale resources focused on multilingual, local, and cultural contexts. In this study, we propose a framework, NativQA, that can seamlessly construct large-scale, culturally and regionally aligned QA datasets in native languages. The framework utilizes user-defined seed queries and leverages search engines to collect location-specific, everyday information. It has been evaluated across 39 locations in 24 countries and in 7 languages, ranging from extremely low-resource to high-resource languages, which resulted over 300K Question Answer (QA) pairs. The developed resources can be used for LLM benchmarking and further fine-tuning. The framework has been made publicly available for the community (https://gitlab.com/nativqa/nativqa-framework).

Noisy Deep Ensemble: Accelerating Deep Ensemble Learning via Noise Injection

Shunsuke Sakai,Shunsuke Tsuge,Tatsuhito Hasegawa

Task: 提出一种名为“Noisy Deep Ensemble”的新方法，以减少神经网络集成所需的训练时间。

Motivation: 传统的神经网络集成方法通过独立训练多个网络并平均其预测结果，但训练时间随集成成员数量线性增加。

Details

Method: 通过训练一个父模型至收敛，然后扰动其权重以生成多个子模型，从而探索不同的局部极小值并减少训练时间。 Result: 在CIFAR-10和CIFAR-100数据集上，该方法超越了传统高效集成方法，并达到了与标准集成相当的测试准确率。 Conclusion: Noisy Deep Ensemble方法显著减少了训练时间，同时保持了集成模型的性能。 Abstract: Neural network ensembles is a simple yet effective approach for enhancing generalization capabilities. The most common method involves independently training multiple neural networks initialized with different weights and then averaging their predictions during inference. However, this approach increases training time linearly with the number of ensemble members. To address this issue, we propose the novel ``\textbf{Noisy Deep Ensemble}'' method, significantly reducing the training time required for neural network ensembles. In this method, a \textit{parent model} is trained until convergence, and then the weights of the \textit{parent model} are perturbed in various ways to construct multiple \textit{child models}. This perturbation of the \textit{parent model} weights facilitates the exploration of different local minima while significantly reducing the training time for each ensemble member. We evaluated our method using diverse CNN architectures on CIFAR-10 and CIFAR-100 datasets, surpassing conventional efficient ensemble methods and achieving test accuracy comparable to standard ensembles. Code is available at \href{https://github.com/TSTB-dev/NoisyDeepEnsemble}{https://github.com/TSTB-dev/NoisyDeepEnsemble}

Llama-3-Nanda-10B-Chat: An Open Generative Large Language Model for Hindi

Monojit Choudhury,Shivam Chauhan,Rocktim Jyoti Das,Dhruv Sahnan,Xudong Han,Haonan Li,Aaryamonvikram Singh,Alok Anil Jadhav,Utkarsh Agarwal,Mukund Choudhary,Debopriyo Banerjee,Fajri Koto,Junaid Bhat,Awantika Shukla,Samujjwal Ghosh,Samta Kamboj,Onkar Pandit,Lalit Pradhan,Rahul Pal,Sunil Sahu,Soundar Doraiswamy,Parvez Mullah,Ali El Filali,Neha Sengupta,Gokul Ramakrishnan,Rituraj Joshi,Gurpreet Gosal,Avraham Sheinin,Natalia Vassilieva,Preslav Nakov

Task: 开发一个高质量的印地语中心指令调优生成型大语言模型（LLM）。

Motivation: 解决中等资源语言（如印地语）在数据可用性、模型适应和评估方面的独特挑战。

Details

Method: 基于Llama-3-8B，通过持续预训练、扩展Transformer块、数据增强和双语训练策略优化跨语言知识转移。 Result: Nanda（10B参数）在开源印地语和多语言模型中表现优异，显著优于现有模型。 Conclusion: 通过开源Nanda，推动印地语LLM研究，支持学术、工业和公共服务的广泛应用。 Abstract: Developing high-quality large language models (LLMs) for moderately resourced languages presents unique challenges in data availability, model adaptation, and evaluation. We introduce Llama-3-Nanda-10B-Chat, or Nanda for short, a state-of-the-art Hindi-centric instruction-tuned generative LLM, designed to push the boundaries of open-source Hindi language models. Built upon Llama-3-8B, Nanda incorporates continuous pre-training with expanded transformer blocks, leveraging the Llama Pro methodology. A key challenge was the limited availability of high-quality Hindi text data; we addressed this through rigorous data curation, augmentation, and strategic bilingual training, balancing Hindi and English corpora to optimize cross-linguistic knowledge transfer. With 10 billion parameters, Nanda stands among the top-performing open-source Hindi and multilingual models of similar scale, demonstrating significant advantages over many existing models. We provide an in-depth discussion of training strategies, fine-tuning techniques, safety alignment, and evaluation metrics, demonstrating how these approaches enabled Nanda to achieve state-of-the-art results. By open-sourcing Nanda, we aim to advance research in Hindi LLMs and support a wide range of real-world applications across academia, industry, and public services.

Event-based Civil Infrastructure Visual Defect Detection: ev-CIVIL Dataset and Benchmark

Udayanga G. W. K. N. Gamage,Xuanni Huo,Luca Zanatta,T Delbruck,Cesar Cadena,Matteo Fumagalli,Silvia Tolu

Task: 探索使用动态视觉传感器（DVS）检测民用结构缺陷的可行性，并创建首个事件数据集。

Motivation: 传统相机在低光或动态光照条件下难以捕捉缺陷，而DVS在这些场景中表现优异，但缺乏相关研究和数据集。

Details

Method: 使用集成了DVS和APS传感器的DAVIS346相机，采集事件数据和灰度图像数据，构建包含裂缝和剥落缺陷的数据集。 Result: 数据集包含318个现场记录序列和362个实验室记录序列，验证了四种实时目标检测模型的有效性。 Conclusion: 数据集在挑战性光照条件下仍能实现高精度的缺陷检测和分类，证明了其鲁棒性。 Abstract: Small Unmanned Aerial Vehicle (UAV) based visual inspections are a more efficient alternative to manual methods for examining civil structural defects, offering safe access to hazardous areas and significant cost savings by reducing labor requirements. However, traditional frame-based cameras, widely used in UAV-based inspections, often struggle to capture defects under low or dynamic lighting conditions. In contrast, Dynamic Vision Sensors (DVS), or event-based cameras, excel in such scenarios by minimizing motion blur, enhancing power efficiency, and maintaining high-quality imaging across diverse lighting conditions without saturation or information loss. Despite these advantages, existing research lacks studies exploring the feasibility of using DVS for detecting civil structural defects.Moreover, there is no dedicated event-based dataset tailored for this purpose. Addressing this gap, this study introduces the first event-based civil infrastructure defect detection dataset, capturing defective surfaces as a spatio-temporal event stream using DVS.In addition to event-based data, the dataset includes grayscale intensity image frames captured simultaneously using an Active Pixel Sensor (APS). Both data types were collected using the DAVIS346 camera, which integrates DVS and APS sensors.The dataset focuses on two types of defects: cracks and spalling, and includes data from both field and laboratory environments. The field dataset comprises 318 recording sequences,documenting 458 distinct cracks and 121 distinct spalling instances.The laboratory dataset includes 362 recording sequences, covering 220 distinct cracks and 308 spalling instances.Four realtime object detection models were evaluated on it to validate the dataset effectiveness.The results demonstrate the dataset robustness in enabling accurate defect detection and classification,even under challenging lighting conditions.

Multi-Sense Embeddings for Language Models and Knowledge Distillation

Qitong Wang,Mohammed J. Zaki,Georgios Kollias,Vasileios Kalantzis

Task: 提出一种多义嵌入方法作为词嵌入的替代方案，以捕捉词汇的多义性。

Motivation: 大型语言模型（LLMs）生成的上下文嵌入虽然能根据上下文生成不同的表示，但词汇通常具有有限的意义。

Details

Method: 通过聚类算法构建多义嵌入字典，并提出一种基于该字典的知识蒸馏方法，训练小型学生模型。 Result: 在多个基准测试中验证了多义嵌入和知识蒸馏方法的有效性。 Conclusion: 多义嵌入和知识蒸馏方法在保持性能的同时显著节省了空间和推理时间。 Abstract: Transformer-based large language models (LLMs) rely on contextual embeddings which generate different (continuous) representations for the same token depending on its surrounding context. Nonetheless, words and tokens typically have a limited number of senses (or meanings). We propose multi-sense embeddings as a drop-in replacement for each token in order to capture the range of their uses in a language. To construct a sense embedding dictionary, we apply a clustering algorithm to embeddings generated by an LLM and consider the cluster centers as representative sense embeddings. In addition, we propose a novel knowledge distillation method that leverages the sense dictionary to learn a smaller student model that mimics the senses from the much larger base LLM model, offering significant space and inference time savings, while maintaining competitive performance. Via thorough experiments on various benchmarks, we showcase the effectiveness of our sense embeddings and knowledge distillation approach. We share our code at https://github.com/Qitong-Wang/SenseDict

On the Suitability of Reinforcement Fine-Tuning to Visual Tasks

Xiaxu Chen,Wei Li,Chunxu Liu,Chi Xie,Xiaoyan Hu,Chengqian Ma,Feng Zhu,Rui Zhao

Task: 研究强化微调（RFT）在视觉任务中的适用性和局限性。

Motivation: 尽管RFT已被证明能提升语言模型的推理能力，但其在视觉任务中的应用尚处于早期阶段，缺乏对其适用性的深入分析。

Details

Method: 通过实验分析和观察，定量比较RFT与监督微调（SFT）在视觉任务上的表现，并设计新奖励机制以验证推理过程的影响。 Result: RFT在视觉任务上普遍优于SFT，尤其是在训练样本有限时；推理过程对复杂任务有益，但对简单任务可能有害。 Conclusion: 本研究为RFT在视觉任务中的应用提供了新见解，有助于推动该领域的快速发展。 Abstract: Reinforcement Fine-Tuning (RFT) is proved to be greatly valuable for enhancing the reasoning ability of LLMs. Researchers have been starting to apply RFT to MLLMs, hoping it will also enhance the capabilities of visual understanding. However, these works are at a very early stage and have not examined how suitable RFT actually is for visual tasks. In this work, we endeavor to understand the suitabilities and limitations of RFT for visual tasks, through experimental analysis and observations. We start by quantitative comparisons on various tasks, which shows RFT is generally better than SFT on visual tasks. %especially when the number of training samples are limited. To check whether such advantages are brought up by the reasoning process, we design a new reward that encourages the model to ``think'' more, whose results show more thinking can be beneficial for complicated tasks but harmful for simple tasks. We hope this study can provide more insight for the rapid advancements on this topic.

Confidence Regularized Masked Language Modeling using Text Length

Seunghyun Ji,Soowon Lee

Task: 提出一种动态调整正则化强度的置信度正则化方法，以解决掩码语言建模中因输入文本短而导致的模型过度自信问题。

Motivation: 掩码语言建模在输入文本短时，可能因高熵词分布导致模型对单一答案过度自信。

Details

Method: 提出一种基于输入文本长度动态调整正则化强度的置信度正则化方法。 Result: 在GLUE和SQuAD数据集上的实验表明，该方法提高了准确性并降低了预期校准误差。 Conclusion: 提出的置信度正则化方法有效解决了掩码语言建模中的过度自信问题，提升了模型性能。 Abstract: Masked language modeling, which is a task to predict a randomly masked word in the input text, is an efficient language representation learning method. Masked language modeling ignores various words which people can think of for filling in the masked position and calculates the loss with a single word. Especially when the input text is short, the entropy of the word distribution that can fill in the masked position can be high. This may cause the model to be overconfident in the single answer. To address this issue, we propose a novel confidence regularizer that controls regularizing strength dynamically by the input text length. Experiments with GLUE and SQuAD datasets showed that our method achieves better accuracy and lower expected calibration error.

Point-based Instance Completion with Scene Constraints

Wesley Khademi,Li Fuxin

Task: 提出一种基于点云的实例补全模型，能够在场景中以任意尺度和姿态鲁棒地补全物体。

Motivation: 现有的点基物体补全方法未考虑场景约束，而实例场景补全方法在补全质量上落后且同样未利用场景约束。

Details

Method: 引入稀疏的场景约束点云，并通过交叉注意力机制将其整合到补全模型中。 Result: 实验表明，该方法在部分扫描的保真度、补全质量和合理性上优于现有方法。 Conclusion: 提出的方法在实例场景补全任务中表现优异，解决了现有方法的局限性。 Abstract: Recent point-based object completion methods have demonstrated the ability to accurately recover the missing geometry of partially observed objects. However, these approaches are not well-suited for completing objects within a scene, as they do not consider known scene constraints (e.g., other observed surfaces) in their completions and further expect the partial input to be in a canonical coordinate system, which does not hold for objects within scenes. While instance scene completion methods have been proposed for completing objects within a scene, they lag behind point-based object completion methods in terms of object completion quality and still do not consider known scene constraints during completion. To overcome these limitations, we propose a point cloud-based instance completion model that can robustly complete objects at arbitrary scales and pose in the scene. To enable reasoning at the scene level, we introduce a sparse set of scene constraints represented as point clouds and integrate them into our completion model via a cross-attention mechanism. To evaluate the instance scene completion task on indoor scenes, we further build a new dataset called ScanWCF, which contains labeled partial scans as well as aligned ground truth scene completions that are watertight and collision-free. Through several experiments, we demonstrate that our method achieves improved fidelity to partial scans, higher completion quality, and greater plausibility over existing state-of-the-art methods.

QGen Studio: An Adaptive Question-Answer Generation, Training and Evaluation Platform

Movina Moses,Mohab Elkaref,James Barry,Shinnosuke Tanaka,Vishnudev Kuruvanthodi,Nathan Herr,Campbell D Watson,Geeth De Mel

Task: 开发一个自适应的问题-答案生成、训练和评估平台QGen Studio。

Motivation: 利用大型语言模型（LLMs）创建自定义问题-答案数据集并在此合成数据上微调模型。

Details

Method: 提供数据集查看器和模型探索器，支持数据质量分析和模型性能对比。 Result: QGen Studio提供了一个交互式、端到端的解决方案，用于生成QA数据集和训练可扩展、领域适应性强的模型。 Conclusion: QGen Studio将开源，用户可本地部署。 Abstract: We present QGen Studio: an adaptive question-answer generation, training, and evaluation platform. QGen Studio enables users to leverage large language models (LLMs) to create custom question-answer datasets and fine-tune models on this synthetic data. It features a dataset viewer and model explorer to streamline this process. The dataset viewer provides key metrics and visualizes the context from which the QA pairs are generated, offering insights into data quality. The model explorer supports model comparison, allowing users to contrast the performance of their trained LLMs against other models, supporting performance benchmarking and refinement. QGen Studio delivers an interactive, end-to-end solution for generating QA datasets and training scalable, domain-adaptable models. The studio will be open-sourced soon, allowing users to deploy it locally.

Pose-Aware Weakly-Supervised Action Segmentation

Seth Z. Zhao,Reza Ghoddoosian,Isht Dwivedi,Nakul Agarwal,Behzad Dariush

Task: 开发一种弱监督框架，用于在长教学视频中分割人类动作。

Motivation: 减少动作分割任务中所需的精确标注成本。

Details

Method: 提出一种弱监督框架，结合姿态知识训练但不用于推理，并引入姿态启发的对比损失。 Result: 在多个数据集上优于现有方法，适用于在线和离线场景。 Conclusion: 该框架有效且适应性强，可用于不同分割主干和姿态提取器。 Abstract: Understanding human behavior is an important problem in the pursuit of visual intelligence. A challenge in this endeavor is the extensive and costly effort required to accurately label action segments. To address this issue, we consider learning methods that demand minimal supervision for segmentation of human actions in long instructional videos. Specifically, we introduce a weakly-supervised framework that uniquely incorporates pose knowledge during training while omitting its use during inference, thereby distilling pose knowledge pertinent to each action component. We propose a pose-inspired contrastive loss as a part of the whole weakly-supervised framework which is trained to distinguish action boundaries more effectively. Our approach, validated through extensive experiments on representative datasets, outperforms previous state-of-the-art (SOTA) in segmenting long instructional videos under both online and offline settings. Additionally, we demonstrate the framework's adaptability to various segmentation backbones and pose extractors across different datasets.

Navigating the Rabbit Hole: Emergent Biases in LLM-Generated Attack Narratives Targeting Mental Health Groups

Rijul Magu,Arka Dutta,Sean Kim,Ashiqur R. KhudaBukhsh,Munmun De Choudhury

Task: 评估大型语言模型（LLMs）对高危心理健康群体生成的攻击性内容及其传播偏见的结构性倾向。

Motivation: 研究LLMs对特定群体（如心理健康群体）的无端攻击行为及其偏见传播机制，填补该领域的空白。

Details

Method: 提出基于网络的框架分析偏见传播，并评估攻击性内容中的污名化程度。 Result: 心理健康实体在攻击叙事网络中占据中心位置，污名化分析显示对心理健康目标的标签化程度更高。 Conclusion: LLMs存在加剧有害话语的结构性倾向，需开发合适的缓解方法。 Abstract: Large Language Models (LLMs) have been shown to demonstrate imbalanced biases against certain groups. However, the study of unprovoked targeted attacks by LLMs towards at-risk populations remains underexplored. Our paper presents three novel contributions: (1) the explicit evaluation of LLM-generated attacks on highly vulnerable mental health groups; (2) a network-based framework to study the propagation of relative biases; and (3) an assessment of the relative degree of stigmatization that emerges from these attacks. Our analysis of a recently released large-scale bias audit dataset reveals that mental health entities occupy central positions within attack narrative networks, as revealed by a significantly higher mean centrality of closeness (p-value = 4.06e-10) and dense clustering (Gini coefficient = 0.7). Drawing from sociological foundations of stigmatization theory, our stigmatization analysis indicates increased labeling components for mental health disorder-related targets relative to initial targets in generation chains. Taken together, these insights shed light on the structural predilections of large language models to heighten harmful discourse and highlight the need for suitable approaches for mitigation.

SEVERE++: Evaluating Benchmark Sensitivity in Generalization of Video Representation Learning

Fida Mohammad Thoker,Letian Jiang,Chen Zhao,Piyush Bagad,Hazel Doughty,Bernard Ghanem,Cees G. M. Snoek

Task: 全面评估现代视频自监督学习模型在四个关键下游因素上的泛化能力。

Motivation: 尽管视频自监督学习在标准动作识别基准上表现良好，但其评估协议狭窄，限制了对其在真实场景中泛化能力的理解。

Details

Method: 扩展先前关于CNN对比学习基准敏感性的研究，涵盖基于Transformer的视频模型和视频-文本模型，共进行1100多次实验。 Result: Transformer模型对下游条件敏感，无方法在所有因素上表现一致；视频Transformer在域偏移下表现更好，CNN在细粒度任务中更优，视频-文本模型表现不佳。 Conclusion: 研究提供了当前视频自监督学习方法优缺点的详细视图，并为评估视频表示学习的泛化能力提供了统一基准。 Abstract: Continued advances in self-supervised learning have led to significant progress in video representation learning, offering a scalable alternative to supervised approaches by removing the need for manual annotations. Despite strong performance on standard action recognition benchmarks, video self-supervised learning methods are largely evaluated under narrow protocols, typically pretraining on Kinetics-400 and fine-tuning on similar datasets, limiting our understanding of their generalization in real world scenarios. In this work, we present a comprehensive evaluation of modern video self-supervised models, focusing on generalization across four key downstream factors: domain shift, sample efficiency, action granularity, and task diversity. Building on our prior work analyzing benchmark sensitivity in CNN-based contrastive learning, we extend the study to cover state-of-the-art transformer-based video-only and video-text models. Specifically, we benchmark 12 transformer-based methods (7 video-only, 5 video-text) and compare them to 10 CNN-based methods, totaling over 1100 experiments across 8 datasets and 7 downstream tasks. Our analysis shows that, despite architectural advances, transformer-based models remain sensitive to downstream conditions. No method generalizes consistently across all factors, video-only transformers perform better under domain shifts, CNNs outperform for fine-grained tasks, and video-text models often underperform despite large scale pretraining. We also find that recent transformer models do not consistently outperform earlier approaches. Our findings provide a detailed view of the strengths and limitations of current video SSL methods and offer a unified benchmark for evaluating generalization in video representation learning.

Assessing how hyperparameters impact Large Language Models' sarcasm detection performance

Montgomery Gole,Andriy Miranskyy

Task: 研究模型特性对OpenAI的GPT和Meta的Llama-2模型在讽刺检测任务中的影响。

Motivation: 讽刺检测对人类和机器都具有挑战性，而GPT和Llama-2模型因其强大的自然语言理解能力和流行度成为研究对象。

Details

Method: 评估不同大小、版本和超参数的微调模型和零样本模型，使用SARC2.0数据集的政治和平衡部分进行实验。 Result: 微调性能随模型大小单调提升，Llama-2-13b在微调场景下达到最佳性能（准确率和F1分数均为0.83），GPT-4在零样本设置下表现竞争性（准确率0.70，F1分数0.75）。 Conclusion: 模型性能随版本更新可能波动，需在每次发布后重新评估性能。 Abstract: Sarcasm detection is challenging for both humans and machines. This work explores how model characteristics impact sarcasm detection in OpenAI's GPT, and Meta's Llama-2 models, given their strong natural language understanding, and popularity. We evaluate fine-tuned and zero-shot models across various sizes, releases, and hyperparameters. Experiments were conducted on the political and balanced (pol-bal) portion of the popular Self-Annotated Reddit Corpus (SARC2.0) sarcasm dataset. Fine-tuned performance improves monotonically with model size within a model family, while hyperparameter tuning also impacts performance. In the fine-tuning scenario, full precision Llama-2-13b achieves state-of-the-art accuracy and $F_1$-score, both measured at 0.83, comparable to average human performance. In the zero-shot setting, one GPT-4 model achieves competitive performance to prior attempts, yielding an accuracy of 0.70 and an $F_1$-score of 0.75. Furthermore, a model's performance may increase or decline with each release, highlighting the need to reassess performance after each release.

QEMesh: Employing A Quadric Error Metrics-Based Representation for Mesh Generation

Jiaqi Li,Ruowei Wang,Yu Liu,Qijun Zhao

Task: 提出一种名为QEMesh的新型模型，用于高质量网格生成。

Motivation: 解决现有网格生成方法中存在的表面不真实、薄部分缺失和结构不完整等问题。

Details

Method: 扩展PoNQ表示法，设计包含多解码器VAE的潜在扩散模型，生成PoNQ参数并预测体素单元。 Result: 生成具有水密表面的结果，并在多个主要指标上与最先进方法相当。 Conclusion: QEMesh方法在网格生成中表现出色，解决了现有问题并达到高质量结果。 Abstract: Mesh generation plays a crucial role in 3D content creation, as mesh is widely used in various industrial applications. Recent works have achieved impressive results but still face several issues, such as unrealistic patterns or pits on surfaces, thin parts missing, and incomplete structures. Most of these problems stem from the choice of shape representation or the capabilities of the generative network. To alleviate these, we extend PoNQ, a Quadric Error Metrics (QEM)-based representation, and propose a novel model, QEMesh, for high-quality mesh generation. PoNQ divides the shape surface into tiny patches, each represented by a point with its normal and QEM matrix, which preserves fine local geometry information. In our QEMesh, we regard these elements as generable parameters and design a unique latent diffusion model containing a novel multi-decoder VAE for PoNQ parameters generation. Given the latent code generated by the diffusion model, three parameter decoders produce several PoNQ parameters within each voxel cell, and an occupancy decoder predicts which voxel cells containing parameters to form the final shape. Extensive evaluations demonstrate that our method generates results with watertight surfaces and is comparable to state-of-the-art methods in several main metrics.

From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models

Chejian Xu,Wei Ping,Peng Xu,Zihan Liu,Boxin Wang,Mohammad Shoeybi,Bo Li,Bryan Catanzaro

Task: 提出一种高效的训练方法，用于构建超长上下文的大语言模型（LLMs），将上下文长度从128K扩展到1M、2M和4M标记。

Motivation: 长上下文能力在文档和视频理解、上下文学习以及推理时扩展等应用中至关重要，需要模型能够处理和推理长序列文本和多模态数据。

Details

Method: 采用高效的持续预训练策略扩展上下文窗口，并结合有效的指令调优以保持指令跟随和推理能力。 Result: 基于Llama3.1-Instruct构建的UltraLong-8B模型在多样化的长上下文基准测试中达到最先进的性能，同时在标准基准测试中保持竞争力。 Conclusion: 研究为高效扩展上下文长度同时保持模型通用能力提供了稳健框架，并公开了所有模型权重。 Abstract: Long-context capabilities are essential for a wide range of applications, including document and video understanding, in-context learning, and inference-time scaling, all of which require models to process and reason over long sequences of text and multimodal data. In this work, we introduce a efficient training recipe for building ultra-long context LLMs from aligned instruct model, pushing the boundaries of context lengths from 128K to 1M, 2M, and 4M tokens. Our approach leverages efficient continued pretraining strategies to extend the context window and employs effective instruction tuning to maintain the instruction-following and reasoning abilities. Our UltraLong-8B, built on Llama3.1-Instruct with our recipe, achieves state-of-the-art performance across a diverse set of long-context benchmarks. Importantly, models trained with our approach maintain competitive performance on standard benchmarks, demonstrating balanced improvements for both long and short context tasks. We further provide an in-depth analysis of key design choices, highlighting the impacts of scaling strategies and data composition. Our findings establish a robust framework for efficiently scaling context lengths while preserving general model capabilities. We release all model weights at: https://ultralong.github.io/.

DDT: Decoupled Diffusion Transformer

Shuai Wang,Zhi Tian,Weilin Huang,Limin Wang

Task: 提出一种解耦的扩散变换器（DDT），以解决扩散变换器在语义编码和高频解码之间的优化困境。

Motivation: 扩散变换器在生成质量上表现优异，但存在训练迭代和推理步骤较多的问题，且其编码低频语义与解码高频成分之间存在矛盾。

Details

Method: 设计了一种解耦结构，包括专用的条件编码器用于语义提取和专门的速度解码器，并通过统计动态规划方法优化共享策略。 Result: 在ImageNet 256×256和512×512上分别达到1.31和1.28的FID，训练收敛速度提升4倍，同时提高了推理速度。 Conclusion: DDT通过解耦设计和优化策略，显著提升了扩散变换器的性能和效率。 Abstract: Diffusion transformers have demonstrated remarkable generation quality, albeit requiring longer training iterations and numerous inference steps. In each denoising step, diffusion transformers encode the noisy inputs to extract the lower-frequency semantic component and then decode the higher frequency with identical modules. This scheme creates an inherent optimization dilemma: encoding low-frequency semantics necessitates reducing high-frequency components, creating tension between semantic encoding and high-frequency decoding. To resolve this challenge, we propose a new \textbf{\color{ddt}D}ecoupled \textbf{\color{ddt}D}iffusion \textbf{\color{ddt}T}ransformer~(\textbf{\color{ddt}DDT}), with a decoupled design of a dedicated condition encoder for semantic extraction alongside a specialized velocity decoder. Our experiments reveal that a more substantial encoder yields performance improvements as model size increases. For ImageNet $256\times256$, Our DDT-XL/2 achieves a new state-of-the-art performance of {1.31 FID}~(nearly $4\times$ faster training convergence compared to previous diffusion transformers). For ImageNet $512\times512$, Our DDT-XL/2 achieves a new state-of-the-art FID of 1.28. Additionally, as a beneficial by-product, our decoupled architecture enhances inference speed by enabling the sharing self-condition between adjacent denoising steps. To minimize performance degradation, we propose a novel statistical dynamic programming approach to identify optimal sharing strategies.

Can Performant LLMs Be Ethical? Quantifying the Impact of Web Crawling Opt-Outs

Dongyang Fan,Vinko Sabolčec,Matin Ansaripour,Ayush Kumar Tarun,Martin Jaggi,Antoine Bosselut,Imanol Schlag

Task: 量化网络爬虫退出对大型语言模型性能的影响，提出数据合规差距（DCG）概念。

Motivation: 研究网络爬虫退出对预训练数据集过滤的影响，以及这些限制如何影响模型能力。

Details

Method: 通过从头预训练和持续预训练两种设置测量DCG，使用1.5B模型进行实验。 Result: 截至2025年1月，合规性对通用知识获取无显著影响（DCG接近0%），但在专业领域（如生物医学）排除主要出版社会导致性能下降。 Conclusion: 通用LLM可以使用完全开放数据训练，而专业领域性能可能受益于后期访问高质量版权数据，为AI训练实践和政策决策提供实证依据。 Abstract: The increasing adoption of web crawling opt-outs by copyright holders of online content raises critical questions about the impact of data compliance on large language model (LLM) performance. However, little is known about how these restrictions (and the resultant filtering of pretraining datasets) affect the capabilities of models trained using these corpora. In this work, we conceptualize this effect as the $\textit{data compliance gap}$ (DCG), which quantifies the performance difference between models trained on datasets that comply with web crawling opt-outs, and those that do not. We measure the data compliance gap in two settings: pretraining models from scratch and continual pretraining from existing compliant models (simulating a setting where copyrighted data could be integrated later in pretraining). Our experiments with 1.5B models show that, as of January 2025, compliance with web data opt-outs does not degrade general knowledge acquisition (close to 0\% DCG). However, in specialized domains such as biomedical research, excluding major publishers leads to performance declines. These findings suggest that while general-purpose LLMs can be trained to perform equally well using fully open data, performance in specialized domains may benefit from access to high-quality copyrighted sources later in training. Our study provides empirical insights into the long-debated trade-off between data compliance and downstream model performance, informing future discussions on AI training practices and policy decisions.

Exploiting Temporal Audio-Visual Correlation Embedding for Audio-Driven One-Shot Talking Head Animation

Zhihua Xu,Tianshui Chen,Zhijing Yang,Siyuan Peng,Keze Wang,Liang Lin

Task: 提出一种新颖的时序音频-视觉相关性嵌入（TAVCE）框架，以增强音频驱动的单次说话头动画（ADOS-THA）的特征表示和生成效果。

Motivation: 音频相邻片段的时间关系与视频相邻帧的时间关系高度相关，这些相关性可以为说话头动画提供补充信息和监督指导。

Details

Method: 通过TAVCE框架学习音频-视觉时间相关性度量，并利用通道注意力机制整合相关性以增强特征表示，同时在训练中使用对齐相关性作为额外监督目标。 Result: 在多个公开基准测试（如HDTF、LRW、VoxCeleb1和VoxCeleb2）上表现出优于现有领先算法的性能。 Conclusion: TAVCE框架通过有效利用音频-视觉时间相关性，显著提升了说话头动画的生成质量。 Abstract: The paramount challenge in audio-driven One-shot Talking Head Animation (ADOS-THA) lies in capturing subtle imperceptible changes between adjacent video frames. Inherently, the temporal relationship of adjacent audio clips is highly correlated with that of the corresponding adjacent video frames, offering supplementary information that can be pivotal for guiding and supervising talking head animations. In this work, we propose to learn audio-visual correlations and integrate the correlations to help enhance feature representation and regularize final generation by a novel Temporal Audio-Visual Correlation Embedding (TAVCE) framework. Specifically, it first learns an audio-visual temporal correlation metric, ensuring the temporal audio relationships of adjacent clips are aligned with the temporal visual relationships of corresponding adjacent video frames. Since the temporal audio relationship contains aligned information about the visual frame, we first integrate it to guide learning more representative features via a simple yet effective channel attention mechanism. During training, we also use the alignment correlations as an additional objective to supervise generating visual frames. We conduct extensive experiments on several publicly available benchmarks (i.e., HDTF, LRW, VoxCeleb1, and VoxCeleb2) to demonstrate its superiority over existing leading algorithms.

Encoder-Decoder Gemma: Improving the Quality-Efficiency Trade-Off via Adaptation

Biao Zhang,Fedor Moiseev,Joshua Ainslie,Paul Suganthan,Min Ma,Surya Bhupatiraju,Fede Lebron,Orhan Firat,Armand Joulin,Zhe Dong

Task: 研究如何将仅解码器的大型语言模型（LLM）适应为编码器-解码器模型，以结合两者的优势。

Motivation: 编码器-解码器模型在实际应用中因其推理效率和更丰富的编码器表示而广泛采用，而仅解码器的LLM表现优异，但适应可以继承其能力并减少从头预训练的计算需求。

Details

Method: 探索不同的预训练目标、参数初始化和优化技术，基于Gemma 2和mT5模型进行实验。 Result: 在相似推理预算下，编码器-解码器LLM在预训练和微调性能上优于仅解码器模型，例如Gemma 2B-2B在指令调优后性能提升约7%。 Conclusion: 适应编码器-解码器模型能有效结合两者优势，提升性能，并支持灵活模型组合，未来将公开检查点以促进研究。 Abstract: While decoder-only large language models (LLMs) have shown impressive results, encoder-decoder models are still widely adopted in real-world applications for their inference efficiency and richer encoder representation. In this paper, we study a novel problem: adapting pretrained decoder-only LLMs to encoder-decoder, with the goal of leveraging the strengths of both approaches to achieve a more favorable quality-efficiency trade-off. We argue that adaptation not only enables inheriting the capability of decoder-only LLMs but also reduces the demand for computation compared to pretraining from scratch. We rigorously explore different pretraining objectives and parameter initialization/optimization techniques. Through extensive experiments based on Gemma 2 (2B and 9B) and a suite of newly pretrained mT5-sized models (up to 1.6B), we demonstrate the effectiveness of adaptation and the advantage of encoder-decoder LLMs. Under similar inference budget, encoder-decoder LLMs achieve comparable (often better) pretraining performance but substantially better finetuning performance than their decoder-only counterpart. For example, Gemma 2B-2B outperforms Gemma 2B by $\sim$7\% after instruction tuning. Encoder-decoder adaptation also allows for flexible combination of different-sized models, where Gemma 9B-2B significantly surpasses Gemma 2B-2B by $>$3\%. The adapted encoder representation also yields better results on SuperGLUE. We will release our checkpoints to facilitate future research.

When Less Is More: A Sparse Facial Motion Structure For Listening Motion Learning

Tri Tung Nguyen Nguyen,Quang Tien Dam,Dinh Tuan Tran,Joo-Ho Lee

Task: 提出一种新的稀疏表示方法，用于预测和表示非语言面部运动，以提高听讲头部行为的建模效果。

Motivation: 现有的离散运动标记表示方法难以捕捉非语言面部运动的时序变化和多模态特性，导致训练效率低且生成的运动保真度差。

Details

Method: 通过将长序列编码为关键帧和过渡帧的稀疏序列，识别关键运动步骤并插值中间帧，以保留运动的时序结构并增强学习过程中的实例多样性。 Result: 该方法在听讲头部预测任务中表现出色，显著提升了对面部运动模式的解释能力。 Conclusion: 稀疏表示方法为非语言面部运动的建模提供了一种高效且高保真的解决方案。 Abstract: Effective human behavior modeling is critical for successful human-robot interaction. Current state-of-the-art approaches for predicting listening head behavior during dyadic conversations employ continuous-to-discrete representations, where continuous facial motion sequence is converted into discrete latent tokens. However, non-verbal facial motion presents unique challenges owing to its temporal variance and multi-modal nature. State-of-the-art discrete motion token representation struggles to capture underlying non-verbal facial patterns making training the listening head inefficient with low-fidelity generated motion. This study proposes a novel method for representing and predicting non-verbal facial motion by encoding long sequences into a sparse sequence of keyframes and transition frames. By identifying crucial motion steps and interpolating intermediate frames, our method preserves the temporal structure of motion while enhancing instance-wise diversity during the learning process. Additionally, we apply this novel sparse representation to the task of listening head prediction, demonstrating its contribution to improving the explanation of facial motion patterns.

LExT: Towards Evaluating Trustworthiness of Natural Language Explanations

Krithi Shailya,Shreya Rajpal,Gokul S Krishnan,Balaraman Ravindran

Task: 提出一个通用框架（LExT）用于量化自然语言解释的可信度，平衡合理性和忠实性。

Motivation: 大型语言模型（LLMs）在高风险领域（如医疗）的应用需要透明和可靠的解释，现有评估指标（如BLEU和ROUGE）忽略了事实准确性、一致性和忠实性等关键方面。

Details

Method: 提出一个领域无关的框架LExT，结合合理性和忠实性，评估模型生成解释的可信度，并在公共医疗数据集上测试六种模型。 Result: 发现通用模型在忠实性上存在不一致性，但整体表现优于领域特定模型。 Conclusion: 强调了在敏感领域使用定制评估框架的重要性，为提升语言模型的可信度和透明度奠定了基础。 Abstract: As Large Language Models (LLMs) become increasingly integrated into high-stakes domains, there have been several approaches proposed toward generating natural language explanations. These explanations are crucial for enhancing the interpretability of a model, especially in sensitive domains like healthcare, where transparency and reliability are key. In light of such explanations being generated by LLMs and its known concerns, there is a growing need for robust evaluation frameworks to assess model-generated explanations. Natural Language Generation metrics like BLEU and ROUGE capture syntactic and semantic accuracies but overlook other crucial aspects such as factual accuracy, consistency, and faithfulness. To address this gap, we propose a general framework for quantifying trustworthiness of natural language explanations, balancing Plausibility and Faithfulness, to derive a comprehensive Language Explanation Trustworthiness Score (LExT) (The code and set up to reproduce our experiments are publicly available at https://github.com/cerai-iitm/LExT). Applying our domain-agnostic framework to the healthcare domain using public medical datasets, we evaluate six models, including domain-specific and general-purpose models. Our findings demonstrate significant differences in their ability to generate trustworthy explanations. On comparing these explanations, we make interesting observations such as inconsistencies in Faithfulness demonstrated by general-purpose models and their tendency to outperform domain-specific fine-tuned models. This work further highlights the importance of using a tailored evaluation framework to assess natural language explanations in sensitive fields, providing a foundation for improving the trustworthiness and transparency of language models in healthcare and beyond.

InvNeRF-Seg: Fine-Tuning a Pre-Trained NeRF for 3D Object Segmentation

Jiangsan Zhao,Jakob Geipel,Krzysztof Kusnierek,Xuean Cui

Task: 提出一种名为InvNeRFSeg的两步微调策略，用于从2D分割掩码直接生成高质量的3D分割点云。

Motivation: 现有的NeRF方法在3D场景分割中存在不足，直接训练NeRF于二值掩码会因缺乏颜色和阴影线索而失败，而手动标注3D点云又耗时耗力。

Details

Method: 采用两步微调策略：首先在RGB图像上训练标准NeRF，然后在不改变模型结构或损失函数的情况下，使用2D分割掩码进行微调。 Result: InvNeRFSeg在合成水果和真实大豆数据集上表现优于SA3D和FruitNeRF，生成的分割点云质量更高且更干净。 Conclusion: InvNeRFSeg成功将2D分割扩展到高质量的3D分割，为下游任务提供了更高效的方法。 Abstract: Neural Radiance Fields (NeRF) have been widely adopted for reconstructing high quality 3D point clouds from 2D RGB images. However, the segmentation of these reconstructed 3D scenes is more essential for downstream tasks such as object counting, size estimation, and scene understanding. While segmentation on raw 3D point clouds using deep learning requires labor intensive and time-consuming manual annotation, directly training NeRF on binary masks also fails due to the absence of color and shading cues essential for geometry learning. We propose Invariant NeRF for Segmentation (InvNeRFSeg), a two step, zero change fine tuning strategy for 3D segmentation. We first train a standard NeRF on RGB images and then fine tune it using 2D segmentation masks without altering either the model architecture or loss function. This approach produces higher quality, cleaner segmented point clouds directly from the refined radiance field with minimal computational overhead or complexity. Field density analysis reveals consistent semantic refinement: densities of object regions increase while background densities are suppressed, ensuring clean and interpretable segmentations. We demonstrate InvNeRFSegs superior performance over both SA3D and FruitNeRF on both synthetic fruit and real world soybean datasets. This approach effectively extends 2D segmentation to high quality 3D segmentation.

Dr Web: a modern, query-based web data retrieval engine

Ylli Prifti,Alessandro Provetti,Pasquale de Meo

Task: 介绍并开发一个名为Data Retrieval Web Engine（DR Web Engine）的工具，用于从网页中提取结构化数据。

Motivation: 解决动态内容处理和杂乱数据提取等工程挑战，并展示其开源潜力。

Details

Method: 使用简单的查询语言，开发一个灵活且模块化的工具。 Result: 成功开发了DR Web Engine，并讨论了其公开化的步骤。 Conclusion: DR Web Engine是一个有效的工具，具有开源潜力，能够处理复杂的网页数据提取任务。 Abstract: This article introduces the Data Retrieval Web Engine (also referred to as doctor web), a flexible and modular tool for extracting structured data from web pages using a simple query language. We discuss the engineering challenges addressed during its development, such as dynamic content handling and messy data extraction. Furthermore, we cover the steps for making the DR Web Engine public, highlighting its open source potential.

A Lightweight Multi-Module Fusion Approach for Korean Character Recognition

Inho Jake Park,Jaehoon Jay Jeong,Ho-Sang Jo

Task: 提出一种轻量高效的架构SDA-Net，用于鲁棒的单字符识别。

Motivation: 现有OCR模型在真实场景中表现不佳，主要由于不规则文本布局、图像质量差、字符多变性和高计算成本。

Details

Method: SDA-Net结合了双注意力机制、动态上下文编码模块、U-Net启发的特征融合策略和轻量级主干网络。 Result: 在挑战性OCR基准测试中达到最先进准确率，且推理速度显著提升。 Conclusion: SDA-Net适合实时和边缘OCR系统部署。 Abstract: Optical Character Recognition (OCR) is essential in applications such as document processing, license plate recognition, and intelligent surveillance. However, existing OCR models often underperform in real-world scenarios due to irregular text layouts, poor image quality, character variability, and high computational costs. This paper introduces SDA-Net (Stroke-Sensitive Attention and Dynamic Context Encoding Network), a lightweight and efficient architecture designed for robust single-character recognition. SDA-Net incorporates: (1) a Dual Attention Mechanism to enhance stroke-level and spatial feature extraction; (2) a Dynamic Context Encoding module that adaptively refines semantic information using a learnable gating mechanism; (3) a U-Net-inspired Feature Fusion Strategy for combining low-level and high-level features; and (4) a highly optimized lightweight backbone that reduces memory and computational demands. Experimental results show that SDA-Net achieves state-of-the-art accuracy on challenging OCR benchmarks, with significantly faster inference, making it well-suited for deployment in real-time and edge-based OCR systems.

Multimodal Quantitative Language for Generative Recommendation

Jianyang Zhai,Zi-Feng Mai,Chang-Dong Wang,Feidiao Yang,Xiawu Zheng,Hui Li,Yonghong Tian

Task: 提出一种名为MQL4GRec的新方法，用于生成式推荐系统，旨在通过统一语言转换多模态信息以提升推荐性能。

Motivation: 现有方法未能充分利用预训练语言模型（PLMs）的通用语言知识与推荐系统特定需求之间的差异，且忽视了多模态信息之间的互补知识。

Details

Method: 通过定量翻译器将不同领域和模态的项目内容转换为统一的定量语言，并设计生成任务以丰富语义信息，最后通过预训练和微调实现知识迁移。 Result: 在三个数据集上，NDCG指标分别提升了11.18%、14.82%和7.95%。 Conclusion: MQL4GRec通过统一语言和多模态知识迁移，显著提升了生成式推荐的性能。 Abstract: Generative recommendation has emerged as a promising paradigm aiming at directly generating the identifiers of the target candidates. Most existing methods attempt to leverage prior knowledge embedded in Pre-trained Language Models (PLMs) to improve the recommendation performance. However, they often fail to accommodate the differences between the general linguistic knowledge of PLMs and the specific needs of recommendation systems. Moreover, they rarely consider the complementary knowledge between the multimodal information of items, which represents the multi-faceted preferences of users. To facilitate efficient recommendation knowledge transfer, we propose a novel approach called Multimodal Quantitative Language for Generative Recommendation (MQL4GRec). Our key idea is to transform items from different domains and modalities into a unified language, which can serve as a bridge for transferring recommendation knowledge. Specifically, we first introduce quantitative translators to convert the text and image content of items from various domains into a new and concise language, known as quantitative language, with all items sharing the same vocabulary. Then, we design a series of quantitative language generation tasks to enrich quantitative language with semantic information and prior knowledge. Finally, we achieve the transfer of recommendation knowledge from different domains and modalities to the recommendation task through pre-training and fine-tuning. We evaluate the effectiveness of MQL4GRec through extensive experiments and comparisons with existing methods, achieving improvements over the baseline by 11.18\%, 14.82\%, and 7.95\% on the NDCG metric across three different datasets, respectively.

Transferable Mask Transformer: Cross-domain Semantic Segmentation with Region-adaptive Transferability Estimation

Enming Zhang,Zhengyu Li,Yanru Wu,Jingge Wang,Yang Tan,Ruizhe Zhao,Guan Wang,Yang Li

Task: 提出一种名为Transferable Mask Transformer (TMT)的区域级适应框架，用于解决视觉变换器(ViTs)在跨域语义分割中的性能下降问题。

Motivation: 由于源域和目标域之间的分布偏移（如纹理、尺度或对象共现模式的差异），预训练的ViTs在跨域适应时性能显著下降，导致全局注意力机制失效。

Details

Method: TMT框架包含两个关键组件：Adaptive Cluster-based Transferability Estimator (ACTE)用于动态分割图像并评估局部可转移性，以及Transferable Masked Attention (TMA)模块将区域可转移性映射集成到ViTs的注意力机制中。 Result: 在20个跨域对上的评估显示，TMT平均比普通微调方法提高了2%的MIoU，比现有最佳基线提高了1.28%。 Conclusion: TMT通过区域级适应和动态可转移性分析，显著提升了ViTs在跨域语义分割中的性能。 Abstract: Recent advances in Vision Transformers (ViTs) have set new benchmarks in semantic segmentation. However, when adapting pretrained ViTs to new target domains, significant performance degradation often occurs due to distribution shifts, resulting in suboptimal global attention. Since self-attention mechanisms are inherently data-driven, they may fail to effectively attend to key objects when source and target domains exhibit differences in texture, scale, or object co-occurrence patterns. While global and patch-level domain adaptation methods provide partial solutions, region-level adaptation with dynamically shaped regions is crucial due to spatial heterogeneity in transferability across different image areas. We present Transferable Mask Transformer (TMT), a novel region-level adaptation framework for semantic segmentation that aligns cross-domain representations through spatial transferability analysis. TMT consists of two key components: (1) An Adaptive Cluster-based Transferability Estimator (ACTE) that dynamically segments images into structurally and semantically coherent regions for localized transferability assessment, and (2) A Transferable Masked Attention (TMA) module that integrates region-specific transferability maps into ViTs' attention mechanisms, prioritizing adaptation in regions with low transferability and high semantic uncertainty. Comprehensive evaluations across 20 cross-domain pairs demonstrate TMT's superiority, achieving an average 2% MIoU improvement over vanilla fine-tuning and a 1.28% increase compared to state-of-the-art baselines. The source code will be publicly available.

On Synthesizing Data for Context Attribution in Question Answering

Gorjan Radevski,Kiril Gashteovski,Shahbaz Syed,Christopher Malon,Sebastien Nicolas,Chia-Chien Hung,Timo Sztyler,Verena Heußer,Wiem Ben Rim,Masafumi Enomoto,Kunihiro Takeoka,Masafumi Oyamada,Goran Glavaš,Carolin Lawrence

Task: 研究基于LLM的上下文归因方法，包括零样本推理、LLM集成和小模型微调。

Motivation: 解决LLM在问答任务中可能产生的虚假或误导性回答（幻觉问题），提升其可信度。

Details

Method: 提出SynQA策略，利用LLM生成支持上下文句子的QA对，用于小模型的微调。 Result: SynQA生成的归因数据在不同QA任务和领域中高效，用户研究验证了小模型的有效性。 Conclusion: SynQA策略能有效提升小模型在上下文归因任务中的表现，增强LLM的可信度。 Abstract: Question Answering (QA) accounts for a significant portion of LLM usage "in the wild". However, LLMs sometimes produce false or misleading responses, also known as "hallucinations". Therefore, grounding the generated answers in contextually provided information -- i.e., providing evidence for the generated text -- is paramount for LLMs' trustworthiness. Providing this information is the task of context attribution. In this paper, we systematically study LLM-based approaches for this task, namely we investigate (i) zero-shot inference, (ii) LLM ensembling, and (iii) fine-tuning of small LMs on synthetic data generated by larger LLMs. Our key contribution is SynQA: a novel generative strategy for synthesizing context attribution data. Given selected context sentences, an LLM generates QA pairs that are supported by these sentences. This leverages LLMs' natural strengths in text generation while ensuring clear attribution paths in the synthetic training data. We show that the attribution data synthesized via SynQA is highly effective for fine-tuning small LMs for context attribution in different QA tasks and domains. Finally, with a user study, we validate the usefulness of small LMs (fine-tuned on synthetic data from SynQA) in context attribution for QA.

FASR-Net: Unsupervised Shadow Removal Leveraging Inherent Frequency Priors

Tao Lin,Qingwang Wang,Qiwei Liang,Minghua Tang,Yuxuan Sun

Task: 提出一种无监督的频率感知阴影去除网络（FASR-Net）以解决阴影去除问题。

Motivation: 现有无监督方法忽视阴影特定先验，导致阴影恢复不完整。

Details

Method: 提出FASR-Net，结合小波注意力下采样模块（WADM）和多种新损失函数（频率损失、亮度-色度损失和对齐损失）。 Result: 在AISTD和SRD数据集上表现出优越的阴影去除性能。 Conclusion: FASR-Net通过频率特性和新损失函数有效提升了阴影去除效果。 Abstract: Shadow removal is challenging due to the complex interaction of geometry, lighting, and environmental factors. Existing unsupervised methods often overlook shadow-specific priors, leading to incomplete shadow recovery. To address this issue, we propose a novel unsupervised Frequency Aware Shadow Removal Network (FASR-Net), which leverages the inherent frequency characteristics of shadow regions. Specifically, the proposed Wavelet Attention Downsampling Module (WADM) integrates wavelet-based image decomposition and deformable attention, effectively breaking down the image into frequency components to enhance shadow details within specific frequency bands. We also introduce several new loss functions for precise shadow-free image reproduction: a frequency loss to capture image component details, a brightness-chromaticity loss that references the chromaticity of shadow-free regions, and an alignment loss to ensure smooth transitions between shadowed and shadow-free regions. Experimental results on the AISTD and SRD datasets demonstrate that our method achieves superior shadow removal performance.

Multi-Perspective Attention Mechanism for Bias-Aware Sequential Recommendation

Mingjian Fu,Hengsheng Chen,Dongchun Jiang,Yanchao Tan

Task: 提出一种基于序列信息和注意力机制的多视角注意力偏置序列推荐系统（MABSRec），以解决传统序列推荐算法中忽视偏见放大效应的问题。

Motivation: 传统推荐系统难以捕捉用户行为的动态演变，且现有序列推荐算法忽视偏见放大效应，导致推荐结果易受马太效应影响，限制了系统对用户偏好动态变化的感知能力。

Details

Method: 通过重构用户序列为三种短类型，利用图神经网络进行项目加权，并提出自适应多偏见视角注意力模块以提高推荐准确性。 Result: 实验结果表明，MABSRec模型在所有评估指标上均表现出显著优势。 Conclusion: MABSRec在序列推荐任务中表现出色，有效解决了偏见放大效应问题。 Abstract: In the era of advancing information technology, recommender systems have emerged as crucial tools for dealing with information overload. However, traditional recommender systems still have limitations in capturing the dynamic evolution of user behavior. To better understand and predict user behavior, especially taking into account the complexity of temporal evolution, sequential recommender systems have gradually become the focus of research. Currently, many sequential recommendation algorithms ignore the amplification effects of prevalent biases, which leads to recommendation results being susceptible to the Matthew Effect. Additionally, it will impose limitations on the recommender system's ability to deeply perceive and capture the dynamic shifts in user preferences, thereby diminishing the extent of its recommendation reach. To address this issue effectively, we propose a recommendation system based on sequential information and attention mechanism called Multi-Perspective Attention Bias Sequential Recommendation (MABSRec). Firstly, we reconstruct user sequences into three short types and utilize graph neural networks for item weighting. Subsequently, an adaptive multi-bias perspective attention module is proposed to enhance the accuracy of recommendations. Experimental results show that the MABSRec model exhibits significant advantages in all evaluation metrics, demonstrating its excellent performance in the sequence recommendation task.

MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models

Pengfei Zhou,Fanrui Zhang,Xiaopeng Peng,Zhaopan Xu,Jiaxin Ai,Yansheng Qiu,Chuanhao Li,Zhen Li,Ming Li,Yukang Feng,Jianwen Sun,Haoquan Zhang,Zizhen Li,Xiaofeng Mao,Wangbo Zhao,Kai Wang,Xiaojun Chang,Wenqi Shao,Yang You,Kaipeng Zhang

Task: 评估多模态大型语言模型（MLLMs）在多模态推理能力上的表现，并填补现有评测基准的不足。

Motivation: 多模态推理是人类智能的核心，也是实现通用人工智能的关键步骤，但目前对MLLMs的评测仍存在数据规模小、领域覆盖窄和知识分布无结构化等问题。

Details

Method: 提出了MDK12-Bench，一个基于K-12真实考试的多学科评测基准，涵盖六大学科，包含140K推理实例，并设计了动态评测框架以减少数据污染。 Result: 实验表明当前MLLMs在多模态推理上存在显著局限性。 Conclusion: MDK12-Bench为全面评估MLLMs提供了平台，并为下一代模型的开发提供了见解。 Abstract: Multimodal reasoning, which integrates language and visual cues into problem solving and decision making, is a fundamental aspect of human intelligence and a crucial step toward artificial general intelligence. However, the evaluation of multimodal reasoning capabilities in Multimodal Large Language Models (MLLMs) remains inadequate. Most existing reasoning benchmarks are constrained by limited data size, narrow domain coverage, and unstructured knowledge distribution. To close these gaps, we introduce MDK12-Bench, a multi-disciplinary benchmark assessing the reasoning capabilities of MLLMs via real-world K-12 examinations. Spanning six disciplines (math, physics, chemistry, biology, geography, and information science), our benchmark comprises 140K reasoning instances across diverse difficulty levels from primary school to 12th grade. It features 6,827 instance-level knowledge point annotations based on a well-organized knowledge structure, detailed answer explanations, difficulty labels and cross-year partitions, providing a robust platform for comprehensive evaluation. Additionally, we present a novel dynamic evaluation framework to mitigate data contamination issues by bootstrapping question forms, question types, and image styles during evaluation. Extensive experiment on MDK12-Bench reveals the significant limitation of current MLLMs in multimodal reasoning. The findings on our benchmark provide insights into the development of the next-generation models. Our data and codes are available at https://github.com/LanceZPF/MDK12.

Hybrid Retrieval for Hallucination Mitigation in Large Language Models: A Comparative Analysis

Chandana Sree Mala,Gizem Gezici,Fosca Giannotti

Task: 评估检索增强生成（RAG）系统中不同检索方法对减少大型语言模型（LLMs）幻觉的效果。

Motivation: 大型语言模型在语言理解和生成方面表现出色，但容易产生事实错误或无依据的输出，检索增强生成系统通过外部知识来缓解这一问题。

Details

Method: 比较三种检索方法：基于BM25的稀疏检索、使用Sentence Transformers的语义密集检索，以及提出的混合检索模块（结合查询扩展和动态加权融合）。 Result: 混合检索器在相关性评分上表现最佳，显著降低了幻觉率并提高了LLM的可靠性。 Conclusion: 混合检索技术能有效提升检索相关性，减少幻觉，提高LLM的响应准确性。 Abstract: Large Language Models (LLMs) excel in language comprehension and generation but are prone to hallucinations, producing factually incorrect or unsupported outputs. Retrieval Augmented Generation (RAG) systems address this issue by grounding LLM responses with external knowledge. This study evaluates the relationship between retriever effectiveness and hallucination reduction in LLMs using three retrieval approaches: sparse retrieval based on BM25 keyword search, dense retrieval using semantic search with Sentence Transformers, and a proposed hybrid retrieval module. The hybrid module incorporates query expansion and combines the results of sparse and dense retrievers through a dynamically weighted Reciprocal Rank Fusion score. Using the HaluBench dataset, a benchmark for hallucinations in question answering tasks, we assess retrieval performance with metrics such as mean average precision and normalised discounted cumulative gain, focusing on the relevance of the top three retrieved documents. Results show that the hybrid retriever achieves better relevance scores, outperforming both sparse and dense retrievers. Further evaluation of LLM-generated answers against ground truth using metrics such as accuracy, hallucination rate, and rejection rate reveals that the hybrid retriever achieves the highest accuracy on fails, the lowest hallucination rate, and the lowest rejection rate. These findings highlight the hybrid retriever's ability to enhance retrieval relevance, reduce hallucination rates, and improve LLM reliability, emphasising the importance of advanced retrieval techniques in mitigating hallucinations and improving response accuracy.

Video Flow as Time Series: Discovering Temporal Consistency and Variability for VideoQA

Zijie Song,Zhenzhen Hu,Yixiao Ma,Jia Li,Richang Hong

Task: 提出一种名为Temporal Trio Transformer (T3T)的新架构，用于视频问答任务。

Motivation: 传统Transformer架构在整合多模态数据时简化了时间动态，未能捕捉视频序列中的非线性交互。

Details

Method: T3T包含三个关键模块：Temporal Smoothing (TS)、Temporal Difference (TD)和Temporal Fusion (TF)，分别用于捕捉连续时间过渡、显著时间变化以及融合时间特征与文本线索。 Result: 在多个VideoQA基准数据集上测试，T3T表现出色。 Conclusion: 细致的时间建模对提升视频问答的准确性和深度至关重要。 Abstract: Video Question Answering (VideoQA) is a complex video-language task that demands a sophisticated understanding of both visual content and temporal dynamics. Traditional Transformer-style architectures, while effective in integrating multimodal data, often simplify temporal dynamics through positional encoding and fail to capture non-linear interactions within video sequences. In this paper, we introduce the Temporal Trio Transformer (T3T), a novel architecture that models time consistency and time variability. The T3T integrates three key components: Temporal Smoothing (TS), Temporal Difference (TD), and Temporal Fusion (TF). The TS module employs Brownian Bridge for capturing smooth, continuous temporal transitions, while the TD module identifies and encodes significant temporal variations and abrupt changes within the video content. Subsequently, the TF module synthesizes these temporal features with textual cues, facilitating a deeper contextual understanding and response accuracy. The efficacy of the T3T is demonstrated through extensive testing on multiple VideoQA benchmark datasets. Our results underscore the importance of a nuanced approach to temporal modeling in improving the accuracy and depth of video-based question answering.

Thanos: A Block-wise Pruning Algorithm for Efficient Large Language Model Compression

Ivan Ilin,Peter Richtarik

Task: 提出一种名为Thanos的新型权重剪枝算法，以减少大型语言模型的内存占用并提升计算效率。

Motivation: 解决大型语言模型在资源受限环境中的部署问题，通过剪枝减少冗余权重同时保持模型精度。

Details

Method: 采用块级剪枝策略和自适应掩码，动态调整权重重要性，支持灵活的稀疏模式和结构化格式（如$n:m$稀疏）。 Result: 实验表明，Thanos在结构化剪枝中达到最先进性能，并在非结构化剪枝中优于现有方法。 Conclusion: Thanos为资源受限环境中部署大型模型提供了高效且适应性强的压缩方案。 Abstract: This paper presents Thanos, a novel weight-pruning algorithm designed to reduce the memory footprint and enhance the computational efficiency of large language models (LLMs) by removing redundant weights while maintaining accuracy. Thanos introduces a block-wise pruning strategy with adaptive masks that dynamically adjust to weight importance, enabling flexible sparsity patterns and structured formats, such as $n:m$ sparsity, optimized for hardware acceleration. Experimental evaluations demonstrate that Thanos achieves state-of-the-art performance in structured pruning and outperforms existing methods in unstructured pruning. By providing an efficient and adaptable approach to model compression, Thanos offers a practical solution for deploying large models in resource-constrained environments.

How to Enable LLM with 3D Capacity? A Survey of Spatial Reasoning in LLM

Jirong Zha,Yuxuan Fan,Xiao Yang,Chen Gao,Xinlei Chen

Task: 综述将大语言模型（LLMs）与3D空间理解任务结合的方法。

Motivation: 3D空间理解在机器人、自动驾驶、虚拟现实和医学影像等领域至关重要，而LLMs在多个领域表现出色，有望超越传统计算机视觉方法。

Details

Method: 提出分类法，将现有方法分为三类：基于图像的方法、基于点云的方法和基于混合模态的方法，并系统回顾代表性方法。 Result: 总结了数据表示、架构修改和训练策略，讨论了当前局限性和未来研究方向。 Conclusion: LLMs在3D理解任务中具有潜力，但仍需解决数据集稀缺和计算挑战等问题。 Abstract: 3D spatial understanding is essential in real-world applications such as robotics, autonomous vehicles, virtual reality, and medical imaging. Recently, Large Language Models (LLMs), having demonstrated remarkable success across various domains, have been leveraged to enhance 3D understanding tasks, showing potential to surpass traditional computer vision methods. In this survey, we present a comprehensive review of methods integrating LLMs with 3D spatial understanding. We propose a taxonomy that categorizes existing methods into three branches: image-based methods deriving 3D understanding from 2D visual data, point cloud-based methods working directly with 3D representations, and hybrid modality-based methods combining multiple data streams. We systematically review representative methods along these categories, covering data representations, architectural modifications, and training strategies that bridge textual and 3D modalities. Finally, we discuss current limitations, including dataset scarcity and computational challenges, while highlighting promising research directions in spatial perception, multi-modal fusion, and real-world applications.

Reasoning Models Know When They're Right: Probing Hidden States for Self-Verification

Anqi Zhang,Yulin Chen,Jane Pan,Chen Zhao,Aurojit Panda,Jinyang Li,He He

Task: 研究推理模型是否能在推理过程中评估其中间答案的正确性。

Motivation: 推理模型在数学和逻辑推理任务中表现出色，但存在过度思考的问题，即使已经得出正确答案仍会进行不必要的推理步骤。

Details

Method: 通过探测模型的隐藏状态，研究模型是否编码了答案正确性的信息，并利用探针验证中间答案的正确性。 Result: 探针能高精度验证中间答案，且分数高度校准；模型的隐藏状态还能编码未来答案的正确性，实现早期预测。利用探针作为验证器，推理过程中可提前退出，减少24%的推理标记而不影响性能。 Conclusion: 推理模型确实编码了正确性信息但未充分利用，揭示了提升其效率的潜在空间。 Abstract: Reasoning models have achieved remarkable performance on tasks like math and logical reasoning thanks to their ability to search during reasoning. However, they still suffer from overthinking, often performing unnecessary reasoning steps even after reaching the correct answer. This raises the question: can models evaluate the correctness of their intermediate answers during reasoning? In this work, we study whether reasoning models encode information about answer correctness through probing the model's hidden states. The resulting probe can verify intermediate answers with high accuracy and produces highly calibrated scores. Additionally, we find models' hidden states encode correctness of future answers, enabling early prediction of the correctness before the intermediate answer is fully formulated. We then use the probe as a verifier to decide whether to exit reasoning at intermediate answers during inference, reducing the number of inference tokens by 24\% without compromising performance. These findings confirm that reasoning models do encode a notion of correctness yet fail to exploit it, revealing substantial untapped potential to enhance their efficiency.

Leveraging Synthetic Adult Datasets for Unsupervised Infant Pose Estimation

Sarosij Bose,Hannah Dela Cruz,Arindam Dutta,Elena Kokkoni,Konstantinos Karydis,Amit K. Roy-Chowdhury

Task: 提出一种名为SHIFT的无监督婴儿姿态估计方法，利用合成成人数据集和伪标签技术解决标注数据不足和分布偏移问题。

Motivation: 现有婴儿姿态估计算法依赖大量标注数据且泛化能力差，限制了在医疗等领域的应用。

Details

Method: 采用基于伪标签的Mean-Teacher框架，结合婴儿姿态先验和可见性一致性模块，提升姿态估计的准确性和鲁棒性。 Result: 在多个基准测试中，SHIFT显著优于现有无监督和监督方法，分别提升5%和16%。 Conclusion: SHIFT通过无监督学习和合成数据利用，有效解决了婴儿姿态估计中的标注数据不足和泛化问题。 Abstract: Human pose estimation is a critical tool across a variety of healthcare applications. Despite significant progress in pose estimation algorithms targeting adults, such developments for infants remain limited. Existing algorithms for infant pose estimation, despite achieving commendable performance, depend on fully supervised approaches that require large amounts of labeled data. These algorithms also struggle with poor generalizability under distribution shifts. To address these challenges, we introduce SHIFT: Leveraging SyntHetic Adult Datasets for Unsupervised InFanT Pose Estimation, which leverages the pseudo-labeling-based Mean-Teacher framework to compensate for the lack of labeled data and addresses distribution shifts by enforcing consistency between the student and the teacher pseudo-labels. Additionally, to penalize implausible predictions obtained from the mean-teacher framework, we incorporate an infant manifold pose prior. To enhance SHIFT's self-occlusion perception ability, we propose a novel visibility consistency module for improved alignment of the predicted poses with the original image. Extensive experiments on multiple benchmarks show that SHIFT significantly outperforms existing state-of-the-art unsupervised domain adaptation (UDA) pose estimation methods by 5% and supervised infant pose estimation methods by a margin of 16%. The project page is available at: https://sarosijbose.github.io/SHIFT.

GraphRAFT: Retrieval Augmented Fine-Tuning for Knowledge Graphs on Graph Databases

Alfred Clemedtson,Borun Shi

Task: 提出一种名为GraphRAFT的框架，通过微调大型语言模型生成可验证正确的Cypher查询，以从知识图谱中检索高质量子图上下文并生成准确答案。

Motivation: 现有GraphRAG方法在检索步骤上存在不足，导致其无法高效应用于支持图查询语言的图数据库中。

Details

Method: GraphRAFT框架通过微调LLMs生成Cypher查询，检索高质量子图上下文，并结合推理生成答案。 Result: 在两种具有挑战性的问答任务中，GraphRAFT在四项标准指标上显著优于所有现有最优模型。 Conclusion: GraphRAFT是一种高效且可扩展的解决方案，可直接应用于原生图数据库中的知识图谱。 Abstract: Large language models have shown remarkable language processing and reasoning ability but are prone to hallucinate when asked about private data. Retrieval-augmented generation (RAG) retrieves relevant data that fit into an LLM's context window and prompts the LLM for an answer. GraphRAG extends this approach to structured Knowledge Graphs (KGs) and questions regarding entities multiple hops away. The majority of recent GraphRAG methods either overlook the retrieval step or have ad hoc retrieval processes that are abstract or inefficient. This prevents them from being adopted when the KGs are stored in graph databases supporting graph query languages. In this work, we present GraphRAFT, a retrieve-and-reason framework that finetunes LLMs to generate provably correct Cypher queries to retrieve high-quality subgraph contexts and produce accurate answers. Our method is the first such solution that can be taken off-the-shelf and used on KGs stored in native graph DBs. Benchmarks suggest that our method is sample-efficient and scales with the availability of training data. Our method achieves significantly better results than all state-of-the-art models across all four standard metrics on two challenging Q\&As on large text-attributed KGs.

DefMamba: Deformable Visual State Space Model

Leiye Liu,Miao Zhang,Jihao Yin,Tingwei Liu,Wei Ji,Yongri Piao,Huchuan Lu

Task: 提出一种名为DefMamba的新型视觉基础模型，以解决现有视觉Mamba方法在特征提取过程中无法充分利用图像空间结构信息的问题。

Motivation: 现有视觉Mamba方法将图像展平为1D序列，导致模型在特征提取过程中难以利用图像的空间结构信息。

Details

Method: 提出DefMamba模型，包含多尺度主干结构和可变形Mamba（DM）块，动态调整扫描路径以优先处理重要信息。 Result: DefMamba在图像分类、目标检测、实例分割和语义分割等视觉任务中实现了最先进的性能。 Conclusion: DefMamba通过可变形扫描策略显著提升了模型对图像结构的学习能力，并在多个视觉任务中表现出色。 Abstract: Recently, state space models (SSM), particularly Mamba, have attracted significant attention from scholars due to their ability to effectively balance computational efficiency and performance. However, most existing visual Mamba methods flatten images into 1D sequences using predefined scan orders, which results the model being less capable of utilizing the spatial structural information of the image during the feature extraction process. To address this issue, we proposed a novel visual foundation model called DefMamba. This model includes a multi-scale backbone structure and deformable mamba (DM) blocks, which dynamically adjust the scanning path to prioritize important information, thus enhancing the capture and processing of relevant input features. By combining a deformable scanning(DS) strategy, this model significantly improves its ability to learn image structures and detects changes in object details. Numerous experiments have shown that DefMamba achieves state-of-the-art performance in various visual tasks, including image classification, object detection, instance segmentation, and semantic segmentation. The code is open source on DefMamba.

Evaluating the Generalization Capabilities of Large Language Models on Code Reasoning

Rem Yang,Julian Dai,Nikos Vasilakis,Martin Rinard

Task: 评估大型语言模型（LLMs）在不同类型程序上的代码推理能力的泛化性。

Motivation: 研究LLMs在代码推理任务中的泛化能力，特别是在面对不同来源和特性的程序时的表现。

Details

Method: 提出技术获取分布内和分布外的程序，包括领域特定语言代码、LLM自动生成代码、竞赛编程代码及其变体，并设计实验方法评估LLM的泛化性能。 Result: 通过对10个最新模型的广泛评估，发现早期模型表现类似模式匹配，而最新模型在代码推理上展现出强大的泛化能力。 Conclusion: 最新LLMs在代码推理任务中具有显著的泛化能力，超越了早期模型的局限性。 Abstract: We assess how the code reasoning abilities of large language models (LLMs) generalize to different kinds of programs. We present techniques for obtaining in- and out-of-distribution programs with different characteristics: code sampled from a domain-specific language, code automatically generated by an LLM, code collected from competitive programming contests, and mutated versions of these programs. We also present an experimental methodology for evaluating LLM generalization by comparing their performance on these programs. We perform an extensive evaluation across 10 state-of-the-art models from the past year, obtaining insights into their generalization capabilities over time and across different classes of programs. Our results highlight that while earlier models exhibit behavior consistent with pattern matching, the latest models exhibit strong generalization abilities on code reasoning.

Robust Fusion Controller: Degradation-aware Image Fusion with Fine-grained Language Instructions

Hao Zhang,Yanping Zha,Qingwei Zhuang,Zhenfeng Shao,Jiayi Ma

Task: 提出一种鲁棒的融合控制器（RFC），通过细粒度语言指令实现退化感知的图像融合。

Motivation: 现有图像融合方法难以适应具有空间变化特性的多样化退化环境。

Details

Method: RFC通过解析语言指令生成功能条件和空间条件，利用多条件耦合网络生成复合控制先验，并通过混合注意力融合网络调制中间特征。 Result: 在公开数据集上的实验表明，RFC对多种复合退化具有鲁棒性，尤其在强光场景中表现优异。 Conclusion: RFC通过语言指令实现退化感知的图像融合，适用于复杂环境。 Abstract: Current image fusion methods struggle to adapt to real-world environments encompassing diverse degradations with spatially varying characteristics. To address this challenge, we propose a robust fusion controller (RFC) capable of achieving degradation-aware image fusion through fine-grained language instructions, ensuring its reliable application in adverse environments. Specifically, RFC first parses language instructions to innovatively derive the functional condition and the spatial condition, where the former specifies the degradation type to remove, while the latter defines its spatial coverage. Then, a composite control priori is generated through a multi-condition coupling network, achieving a seamless transition from abstract language instructions to latent control variables. Subsequently, we design a hybrid attention-based fusion network to aggregate multi-modal information, in which the obtained composite control priori is deeply embedded to linearly modulate the intermediate fused features. To ensure the alignment between language instructions and control outcomes, we introduce a novel language-feature alignment loss, which constrains the consistency between feature-level gains and the composite control priori. Extensive experiments on publicly available datasets demonstrate that our RFC is robust against various composite degradations, particularly in highly challenging flare scenarios.

Efficient Reinforcement Finetuning via Adaptive Curriculum Learning

Taiwei Shi,Yiyang Wu,Linxin Song,Tianyi Zhou,Jieyu Zhao

Task: 通过自适应课程学习提升强化微调（RFT）的效率和准确性。

Motivation: RFT在提升大型语言模型（LLMs）数学推理能力方面潜力巨大，但存在样本和计算效率低下的问题。

Details

Method: 提出AdaRFT方法，通过动态调整训练问题的难度，基于模型最近的奖励信号，确保模型始终在具有挑战性但可解决的问题上训练。 Result: 在竞赛级数学数据集上的实验表明，AdaRFT显著提高了训练效率和推理性能，减少了训练步骤并提高了准确性。 Conclusion: AdaRFT为RFT提供了一个更高效和可扩展的框架。 Abstract: Reinforcement finetuning (RFT) has shown great potential for enhancing the mathematical reasoning capabilities of large language models (LLMs), but it is often sample- and compute-inefficient, requiring extensive training. In this work, we introduce AdaRFT (Adaptive Curriculum Reinforcement Finetuning), a method that significantly improves both the efficiency and final accuracy of RFT through adaptive curriculum learning. AdaRFT dynamically adjusts the difficulty of training problems based on the model's recent reward signals, ensuring that the model consistently trains on tasks that are challenging but solvable. This adaptive sampling strategy accelerates learning by maintaining an optimal difficulty range, avoiding wasted computation on problems that are too easy or too hard. AdaRFT requires only a lightweight extension to standard RFT algorithms like Proximal Policy Optimization (PPO), without modifying the reward function or model architecture. Experiments on competition-level math datasets-including AMC, AIME, and IMO-style problems-demonstrate that AdaRFT significantly improves both training efficiency and reasoning performance. We evaluate AdaRFT across multiple data distributions and model sizes, showing that it reduces the number of training steps by up to 2x and improves accuracy by a considerable margin, offering a more scalable and effective RFT framework.

Storybooth: Training-free Multi-Subject Consistency for Improved Visual Storytelling

Jaskirat Singh,Junshen Kevin Chen,Jonas Kohler,Michael Cohen

Task: 提出一种无需训练的方法（StoryBooth）来提高多角色一致性文本到图像生成的效果。

Motivation: 现有方法依赖跨帧自注意力机制，但在处理多角色时存在自注意力泄漏问题，导致角色之间相互干扰。

Details

Method: 结合多模态链式推理和基于区域的生成，改进扩散模型，引入有界跨帧自注意力层和令牌合并层。 Result: 在定性和定量实验中，该方法优于现有技术，显著提高了多角色和细节的一致性。 Conclusion: StoryBooth是一种有效的无需训练方法，解决了多角色一致性生成中的自注意力泄漏问题。 Abstract: Training-free consistent text-to-image generation depicting the same subjects across different images is a topic of widespread recent interest. Existing works in this direction predominantly rely on cross-frame self-attention; which improves subject-consistency by allowing tokens in each frame to pay attention to tokens in other frames during self-attention computation. While useful for single subjects, we find that it struggles when scaling to multiple characters. In this work, we first analyze the reason for these limitations. Our exploration reveals that the primary-issue stems from self-attention-leakage, which is exacerbated when trying to ensure consistency across multiple-characters. This happens when tokens from one subject pay attention to other characters, causing them to appear like each other (e.g., a dog appearing like a duck). Motivated by these findings, we propose StoryBooth: a training-free approach for improving multi-character consistency. In particular, we first leverage multi-modal chain-of-thought reasoning and region-based generation to apriori localize the different subjects across the desired story outputs. The final outputs are then generated using a modified diffusion model which consists of two novel layers: 1) a bounded cross-frame self-attention layer for reducing inter-character attention leakage, and 2) token-merging layer for improving consistency of fine-grain subject details. Through both qualitative and quantitative results we find that the proposed approach surpasses prior state-of-the-art, exhibiting improved consistency across both multiple-characters and fine-grain subject details.

Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought

Yi Peng,Chris,Xiaokun Wang,Yichen Wei,Jiangbo Pei,Weijie Qiu,Ai Jian,Yunzhuo Hao,Jiachun Pan,Tianyidan Xie,Li Ge,Rongxian Zhuang,Xuchen Song,Yang Liu,Yahui Zhou

Task: 扩展R1系列大型语言模型（LLM）至视觉模态，构建多模态推理模型Skywork R1V。

Motivation: 通过高效的多模态迁移方法，实现语言模型与视觉模态的无缝适配，避免重新训练基础语言模型或视觉编码器。

Details

Method: 采用轻量级视觉投影器，结合混合优化策略（迭代监督微调与组相对策略优化）和自适应长度的思维链蒸馏方法。 Result: Skywork R1V在MMMU和MathVista基准测试中分别获得69.0和67.5分，同时在AIME和MATH500中保持强大的文本推理性能。 Conclusion: Skywork R1V以38B参数实现竞争性性能，模型权重已公开以促进开放性和可复现性。 Abstract: We introduce Skywork R1V, a multimodal reasoning model extending the an R1-series Large language models (LLM) to visual modalities via an efficient multimodal transfer method. Leveraging a lightweight visual projector, Skywork R1V facilitates seamless multimodal adaptation without necessitating retraining of either the foundational language model or the vision encoder. To strengthen visual-text alignment, we propose a hybrid optimization strategy that combines Iterative Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO), significantly enhancing cross-modal integration efficiency. Additionally, we introduce an adaptive-length Chain-of-Thought distillation approach for reasoning data generation. This approach dynamically optimizes reasoning chain lengths, thereby enhancing inference efficiency and preventing excessive reasoning overthinking. Empirical evaluations demonstrate that Skywork R1V, with only 38B parameters, delivers competitive performance, achieving a score of 69.0 on the MMMU benchmark and 67.5 on MathVista. Meanwhile, it maintains robust textual reasoning performance, evidenced by impressive scores of 72.0 on AIME and 94.0 on MATH500. The Skywork R1V model weights have been publicly released to promote openness and reproducibility.

Fast Sphericity and Roundness approximation in 2D and 3D using Local Thickness

Pawel Tomasz Pieta,Peter Winkel Rasumssen,Anders Bjorholm Dahl,Anders Nymark Christensen

Task: 提出一种基于局部厚度算法的高效方法来计算2D和3D图像中物体的球形度和圆度。

Motivation: 由于严格定义的球形度和圆度计算成本高，且随着2D和3D显微镜图像数据集的增大，需要更高效的算法来量化大量物体。

Details

Method: 通过将物体建模为具有不同长度和宽度的椭球体/椭圆来简化表面积计算，并使用局部厚度值近似角曲率以计算圆度。 Result: 提出的方法在保持精确度的同时，显著快于现有实现。 Conclusion: 该方法为大规模图像数据集中的球形度和圆度计算提供了高效且准确的解决方案。 Abstract: Sphericity and roundness are fundamental measures used for assessing object uniformity in 2D and 3D images. However, using their strict definition makes computation costly. As both 2D and 3D microscopy imaging datasets grow larger, there is an increased demand for efficient algorithms that can quantify multiple objects in large volumes. We propose a novel approach for extracting sphericity and roundness based on the output of a local thickness algorithm. For sphericity, we simplify the surface area computation by modeling objects as spheroids/ellipses of varying lengths and widths of mean local thickness. For roundness, we avoid a complex corner curvature determination process by approximating it with local thickness values on the contour/surface of the object. The resulting methods provide an accurate representation of the exact measures while being significantly faster than their existing implementations.

ShadowCoT: Cognitive Hijacking for Stealthy Reasoning Backdoors in LLMs

Gejian Zhao,Hanzhou Wu,Xinpeng Zhang,Athanasios V. Vasilakos

Task: 提出ShadowCoT，一种针对LLMs内部推理机制的后门攻击框架。

Motivation: Chain-of-Thought（CoT）虽然提升了LLMs的复杂推理能力，但也引入了新的安全问题，需要研究针对推理机制的威胁。

Details

Method: 通过多阶段注入管道，选择性重写注意力路径并扰动中间表示，结合强化学习和推理链污染（RCP）生成隐蔽的对抗性CoT。 Result: 在多样化的推理基准和LLMs上，ShadowCoT实现了高攻击成功率（94.4%）和劫持成功率（88.4%），同时保持良性性能。 Conclusion: 揭示了认知层面的新型威胁，强调了超越浅层一致性的防御需求。 Abstract: Chain-of-Thought (CoT) enhances an LLM's ability to perform complex reasoning tasks, but it also introduces new security issues. In this work, we present ShadowCoT, a novel backdoor attack framework that targets the internal reasoning mechanism of LLMs. Unlike prior token-level or prompt-based attacks, ShadowCoT directly manipulates the model's cognitive reasoning path, enabling it to hijack multi-step reasoning chains and produce logically coherent but adversarial outcomes. By conditioning on internal reasoning states, ShadowCoT learns to recognize and selectively disrupt key reasoning steps, effectively mounting a self-reflective cognitive attack within the target model. Our approach introduces a lightweight yet effective multi-stage injection pipeline, which selectively rewires attention pathways and perturbs intermediate representations with minimal parameter overhead (only 0.15% updated). ShadowCoT further leverages reinforcement learning and reasoning chain pollution (RCP) to autonomously synthesize stealthy adversarial CoTs that remain undetectable to advanced defenses. Extensive experiments across diverse reasoning benchmarks and LLMs show that ShadowCoT consistently achieves high Attack Success Rate (94.4%) and Hijacking Success Rate (88.4%) while preserving benign performance. These results reveal an emergent class of cognition-level threats and highlight the urgent need for defenses beyond shallow surface-level consistency.

PaMi-VDPO: Mitigating Video Hallucinations by Prompt-Aware Multi-Instance Video Preference Learning

Xinpeng Ding,Kui Zhang,Jinahua Han,Lanqing Hong,Hang Xu,Xiaomeng Li

Task: 提出一种名为Prompt-aware Multi-instance Learning VDPO (PaMi-VDPO)的在线偏好学习框架，以减少视频多模态大语言模型（VLLMs）中的幻觉问题。

Motivation: 现有的Direct Preference Optimization (DPO)依赖离线偏好数据，适应性差且无法捕捉真实的视频-响应错位。

Details

Method: 通过视频增强生成拒绝样本，结合提示感知的多实例学习策略选择增强片段，避免语义重复和错误拒绝。 Result: 在仅使用10k SFT数据的情况下，PaMi-VDPO将基础模型在VideoHallucer上的性能提升了5.3%，超过GPT-4o，同时在通用视频基准上保持稳定。 Conclusion: PaMi-VDPO有效减少了幻觉问题，提升了模型对齐能力，且无需额外参数或监督。 Abstract: Direct Preference Optimization (DPO) helps reduce hallucinations in Video Multimodal Large Language Models (VLLMs), but its reliance on offline preference data limits adaptability and fails to capture true video-response misalignment. We propose Video Direct Preference Optimization (VDPO), an online preference learning framework that eliminates the need for preference annotation by leveraging video augmentations to generate rejected samples while keeping responses fixed. However, selecting effective augmentations is non-trivial, as some clips may be semantically identical to the original under specific prompts, leading to false rejections and disrupting alignment. To address this, we introduce Prompt-aware Multi-instance Learning VDPO (PaMi-VDPO), which selects augmentations based on prompt context. Instead of a single rejection, we construct a candidate set of augmented clips and apply a close-to-far selection strategy, initially ensuring all clips are semantically relevant while then prioritizing the most prompt-aware distinct clip. This allows the model to better capture meaningful visual differences, mitigating hallucinations, while avoiding false rejections, and improving alignment. PaMi-VDPOseamlessly integrates into existing VLLMs without additional parameters, GPT-4/human supervision. With only 10k SFT data, it improves the base model by 5.3% on VideoHallucer, surpassing GPT-4o, while maintaining stable performance on general video benchmarks.

Sugar-Coated Poison: Benign Generation Unlocks LLM Jailbreaking

Yu-Hang Wu,Yu-Jie Xiong,Jie-Zhang

Task: 揭示大型语言模型（LLMs）中存在的防御阈值衰减（DTD）漏洞，并提出一种新型的越狱攻击方法（SCP）及防御策略（POSD）。

Motivation: 分析越狱攻击方法有助于深入了解LLMs的弱点并改进其安全性。

Details

Method: 通过分析模型的注意力权重变化揭示DTD漏洞，提出SCP攻击方法，并设计POSD防御策略。 Result: SCP攻击成功利用了DTD漏洞，而POSD防御策略显著降低了越狱成功率。 Conclusion: DTD漏洞和SCP攻击揭示了LLMs的安全隐患，POSD策略为防御提供了有效解决方案。 Abstract: Large Language Models (LLMs) have become increasingly integral to a wide range of applications. However, they still remain the threat of jailbreak attacks, where attackers manipulate designed prompts to make the models elicit malicious outputs. Analyzing jailbreak methods can help us delve into the weakness of LLMs and improve it. In this paper, We reveal a vulnerability in large language models (LLMs), which we term Defense Threshold Decay (DTD), by analyzing the attention weights of the model's output on input and subsequent output on prior output: as the model generates substantial benign content, its attention weights shift from the input to prior output, making it more susceptible to jailbreak attacks. To demonstrate the exploitability of DTD, we propose a novel jailbreak attack method, Sugar-Coated Poison (SCP), which induces the model to generate substantial benign content through benign input and adversarial reasoning, subsequently producing malicious content. To mitigate such attacks, we introduce a simple yet effective defense strategy, POSD, which significantly reduces jailbreak success rates while preserving the model's generalization capabilities.

Parasite: A Steganography-based Backdoor Attack Framework for Diffusion Models

Jiahao Chen,Yu Pan,Yi Du,Chunkai Wu,Lin Wang

Task: 提出一种名为“Parasite”的新型后门攻击方法，针对扩散模型中的图像到图像任务。

Motivation: 现有后门攻击方法在图像到图像任务中研究有限，且传统方法依赖单一显眼触发器，缺乏隐蔽性和灵活性。

Details

Method: 利用隐写术隐藏触发器，并将目标内容嵌入为后门触发器，实现更灵活的攻击。 Result: “Parasite”成功绕过主流防御框架，检测率为0%，并通过消融实验分析了隐藏系数的影响。 Conclusion: “Parasite”是一种隐蔽且灵活的后门攻击方法，有效规避现有检测框架。 Abstract: Recently, the diffusion model has gained significant attention as one of the most successful image generation models, which can generate high-quality images by iteratively sampling noise. However, recent studies have shown that diffusion models are vulnerable to backdoor attacks, allowing attackers to enter input data containing triggers to activate the backdoor and generate their desired output. Existing backdoor attack methods primarily focused on target noise-to-image and text-to-image tasks, with limited work on backdoor attacks in image-to-image tasks. Furthermore, traditional backdoor attacks often rely on a single, conspicuous trigger to generate a fixed target image, lacking concealability and flexibility. To address these limitations, we propose a novel backdoor attack method called "Parasite" for image-to-image tasks in diffusion models, which not only is the first to leverage steganography for triggers hiding, but also allows attackers to embed the target content as a backdoor trigger to achieve a more flexible attack. "Parasite" as a novel attack method effectively bypasses existing detection frameworks to execute backdoor attacks. In our experiments, "Parasite" achieved a 0 percent backdoor detection rate against the mainstream defense frameworks. In addition, in the ablation study, we discuss the influence of different hiding coefficients on the attack results. You can find our code at https://anonymous.4open.science/r/Parasite-1715/.

Retrieval Augmented Generation with Collaborative Filtering for Personalized Text Generation

Teng Shi,Jun Xu,Xiao Zhang,Xiaoxue Zang,Kai Zheng,Yang Song,Han Li

Task: 提出一种名为CFRAG的方法，将协同过滤应用于个性化检索增强生成（RAG）中，以利用相似用户的历史信息提升个性化文本生成效果。

Motivation: 现有个性化RAG方法未考虑相似用户历史信息的协同作用，而协同过滤在推荐系统中的成功应用表明其潜力。

Details

Method: 采用对比学习训练用户嵌入以检索相似用户，并设计个性化检索器和重排序器，结合LLM反馈进行微调。 Result: 在LaMP基准测试中验证了CFRAG的有效性，并证实了协同信息的重要性。 Conclusion: CFRAG通过引入协同信息显著提升了个性化文本生成的效果。 Abstract: Recently, the personalization of Large Language Models (LLMs) to generate content that aligns with individual user preferences has garnered widespread attention. Personalized Retrieval-Augmented Generation (RAG), which retrieves relevant documents from the user's history to reflect their preferences and enhance LLM generation, is one commonly used approach for personalization. However, existing personalized RAG methods do not consider that the histories of similar users can also assist in personalized generation for the current user, meaning that collaborative information between users can also benefit personalized generation. Inspired by the application of collaborative filtering in recommender systems, we propose a method called CFRAG, which adapts Collaborative Filtering to RAG for personalized text generation. However, this presents two challenges: (1)~how to incorporate collaborative information without explicit user similarity labels? (2)~how to retrieve documents that support personalized LLM generation? For Challenge 1, we use contrastive learning to train user embeddings to retrieve similar users and introduce collaborative information. For Challenge 2, we design a personalized retriever and reranker to retrieve the top-$k$ documents from these users' histories. We take into account the user's preference during retrieval and reranking. Then we leverage feedback from the LLM to fine-tune the personalized retriever and reranker, enabling them to retrieve documents that meet the personalized generation needs of the LLM. Experimental results on the Language Model Personalization (LaMP) benchmark validate the effectiveness of CFRAG. Further analysis confirms the importance of incorporating collaborative information.

Shiao Wang,Xiao Wang,Bo Jiang,Lin Zhu,Guoqi Li,Yaowei Wang,Yonghong Tian,Jin Tang

Task: 通过结合RGB和事件相机重新思考人类活动识别，并提出一种新的多模态热传导操作框架。

Motivation: 解决RGB相机在光线不足和快速运动等现实场景中的性能下降问题，利用生物启发的事件相机弥补其局限性。

Details

Method: 提出HARDVS 2.0数据集和MMHCO-HAR框架，通过多模态热传导操作层融合RGB和事件特征，并采用自适应融合模块进行分类。 Result: 实验证明该方法性能优异，验证了其有效性和鲁棒性。 Conclusion: 结合RGB和事件相机的多模态方法在人类活动识别中具有显著优势，为未来研究提供了新方向。 Abstract: Human Activity Recognition (HAR) primarily relied on traditional RGB cameras to achieve high-performance activity recognition. However, the challenging factors in real-world scenarios, such as insufficient lighting and rapid movements, inevitably degrade the performance of RGB cameras. To address these challenges, biologically inspired event cameras offer a promising solution to overcome the limitations of traditional RGB cameras. In this work, we rethink human activity recognition by combining the RGB and event cameras. The first contribution is the proposed large-scale multi-modal RGB-Event human activity recognition benchmark dataset, termed HARDVS 2.0, which bridges the dataset gaps. It contains 300 categories of everyday real-world actions with a total of 107,646 paired videos covering various challenging scenarios. Inspired by the physics-informed heat conduction model, we propose a novel multi-modal heat conduction operation framework for effective activity recognition, termed MMHCO-HAR. More in detail, given the RGB frames and event streams, we first extract the feature embeddings using a stem network. Then, multi-modal Heat Conduction blocks are designed to fuse the dual features, the key module of which is the multi-modal Heat Conduction Operation layer. We integrate RGB and event embeddings through a multi-modal DCT-IDCT layer while adaptively incorporating the thermal conductivity coefficient via FVEs into this module. After that, we propose an adaptive fusion module based on a policy routing strategy for high-performance classification. Comprehensive experiments demonstrate that our method consistently performs well, validating its effectiveness and robustness. The source code and benchmark dataset will be released on https://github.com/Event-AHU/HARDVS/tree/HARDVSv2

Are Generative AI Agents Effective Personalized Financial Advisors?

Takehiro Takayanagi,Kiyoshi Izumi,Javier Sanz-Cruzado,Richard McCreadie,Iadh Ounis

Task: 研究大型语言模型（LLM）顾问在金融领域的有效性，重点关注用户偏好获取、个性化建议和顾问个性对信任的影响。

Motivation: 探索LLM顾问在高风险复杂领域（如金融）的表现，尤其是在用户需求模糊、建议个性化及信任建立方面的能力。

Details

Method: 通过实验室用户研究（64名参与者），评估LLM顾问在偏好获取、个性化建议和个性表现上的效果。 Result: LLM顾问在偏好获取上与人类顾问相当，但在解决冲突需求时表现不佳；个性化建议能影响用户行为，但存在明显缺陷；用户对建议质量不敏感，甚至偏好表现外向但建议较差的LLM。 Conclusion: LLM顾问在金融领域有一定潜力，但偏好获取准确性是关键；用户对建议质量的忽视和个性偏好的影响需引起重视。 Abstract: Large language model-based agents are becoming increasingly popular as a low-cost mechanism to provide personalized, conversational advice, and have demonstrated impressive capabilities in relatively simple scenarios, such as movie recommendations. But how do these agents perform in complex high-stakes domains, where domain expertise is essential and mistakes carry substantial risk? This paper investigates the effectiveness of LLM-advisors in the finance domain, focusing on three distinct challenges: (1) eliciting user preferences when users themselves may be unsure of their needs, (2) providing personalized guidance for diverse investment preferences, and (3) leveraging advisor personality to build relationships and foster trust. Via a lab-based user study with 64 participants, we show that LLM-advisors often match human advisor performance when eliciting preferences, although they can struggle to resolve conflicting user needs. When providing personalized advice, the LLM was able to positively influence user behavior, but demonstrated clear failure modes. Our results show that accurate preference elicitation is key, otherwise, the LLM-advisor has little impact, or can even direct the investor toward unsuitable assets. More worryingly, users appear insensitive to the quality of advice being given, or worse these can have an inverse relationship. Indeed, users reported a preference for and increased satisfaction as well as emotional trust with LLMs adopting an extroverted persona, even though those agents provided worse advice.

Mind the Trojan Horse: Image Prompt Adapter Enabling Scalable and Deceptive Jailbreaking

Junxi Chen,Junhao Dong,Xiaohua Xie

Task: 揭示并分析基于IP-Adapter的文本到图像扩散模型（T2I-IP-DMs）中的劫持攻击。

Motivation: IP-Adapter的广泛应用增加了对可控性的需求，但同时也带来了新的安全威胁，即劫持攻击。

Details

Method: 通过上传不可感知的图像空间对抗样本（AEs），攻击者可以劫持大量良性用户，突破图像生成服务（IGS）的安全限制。 Result: 实验验证了劫持攻击的技术可行性，并探讨了现有防御措施的局限性。 Conclusion: 研究揭示了IP-Adapter的安全威胁，并提出结合对抗训练模型以克服现有防御的不足。 Abstract: Recently, the Image Prompt Adapter (IP-Adapter) has been increasingly integrated into text-to-image diffusion models (T2I-DMs) to improve controllability. However, in this paper, we reveal that T2I-DMs equipped with the IP-Adapter (T2I-IP-DMs) enable a new jailbreak attack named the hijacking attack. We demonstrate that, by uploading imperceptible image-space adversarial examples (AEs), the adversary can hijack massive benign users to jailbreak an Image Generation Service (IGS) driven by T2I-IP-DMs and mislead the public to discredit the service provider. Worse still, the IP-Adapter's dependency on open-source image encoders reduces the knowledge required to craft AEs. Extensive experiments verify the technical feasibility of the hijacking attack. In light of the revealed threat, we investigate several existing defenses and explore combining the IP-Adapter with adversarially trained models to overcome existing defenses' limitations. Our code is available at https://github.com/fhdnskfbeuv/attackIPA.

Defending Deep Neural Networks against Backdoor Attacks via Module Switching

Weijun Li,Ansh Arora,Xuanli He,Mark Dras,Qiongkai Xu

Task: 提出一种新的模块切换策略，以打破模型传播路径中的虚假相关性，从而有效缓解后门攻击。

Motivation: 深度神经网络参数激增导致独立训练成本高昂，开源模型依赖增加，但训练过程不透明加剧了安全风险，现有防御策略（如权重平均）效果有限。

Details

Method: 利用进化算法优化融合策略，通过模块切换打破模型参数中的虚假相关性。 Result: 在文本和视觉领域的后门攻击测试中，方法显著降低了攻击成功率（如SST-2上平均攻击成功率降至22%，优于基线31.9%）。 Conclusion: 模块切换策略是一种有效的后门攻击防御方法，尤其在融合多个受污染模型时表现优异。 Abstract: The exponential increase in the parameters of Deep Neural Networks (DNNs) has significantly raised the cost of independent training, particularly for resource-constrained entities. As a result, there is a growing reliance on open-source models. However, the opacity of training processes exacerbates security risks, making these models more vulnerable to malicious threats, such as backdoor attacks, while simultaneously complicating defense mechanisms. Merging homogeneous models has gained attention as a cost-effective post-training defense. However, we notice that existing strategies, such as weight averaging, only partially mitigate the influence of poisoned parameters and remain ineffective in disrupting the pervasive spurious correlations embedded across model parameters. We propose a novel module-switching strategy to break such spurious correlations within the model's propagation path. By leveraging evolutionary algorithms to optimize fusion strategies, we validate our approach against backdoor attacks targeting text and vision domains. Our method achieves effective backdoor mitigation even when incorporating a couple of compromised models, e.g., reducing the average attack success rate (ASR) to 22% compared to 31.9% with the best-performing baseline on SST-2.

On the Importance of Conditioning for Privacy-Preserving Data Augmentation

Julian Lorenz,Katja Ludwig,Valentin Haug,Rainer Lienhart

Task: 研究潜在扩散模型在数据增强和匿名化中的适用性。

Motivation: 探索潜在扩散模型作为数据增强和隐私保护方法的有效性，特别是在条件扩散模型下的隐私风险。

Details

Method: 使用对比学习方法训练模型，评估条件扩散模型在匿名化中的表现，并进行黑盒攻击测试。 Result: 发现条件扩散模型不适合作为隐私保护方法，因其生成的图像仍保留可识别的模式。 Conclusion: 条件扩散模型在匿名化中存在隐私风险，需谨慎使用。 Abstract: Latent diffusion models can be used as a powerful augmentation method to artificially extend datasets for enhanced training. To the human eye, these augmented images look very different to the originals. Previous work has suggested to use this data augmentation technique for data anonymization. However, we show that latent diffusion models that are conditioned on features like depth maps or edges to guide the diffusion process are not suitable as a privacy preserving method. We use a contrastive learning approach to train a model that can correctly identify people out of a pool of candidates. Moreover, we demonstrate that anonymization using conditioned diffusion models is susceptible to black box attacks. We attribute the success of the described methods to the conditioning of the latent diffusion model in the anonymization process. The diffusion model is instructed to produce similar edges for the anonymized images. Hence, a model can learn to recognize these patterns for identification.

SkillFlow: Efficient Skill and Code Transfer Through Communication in Adapting AI Agents

Pagkratios Tagkopoulos,Fangzhou Li,Ilias Tagkopoulos

Task: 提出SkillFlow框架，使AI代理能够通过从环境或其他代理获取新技能来扩展功能。

Motivation: 探索AI代理如何动态扩展功能以提高任务完成效率和降低成本。

Details

Method: 提出模块化、技术无关的SkillFlow框架，并通过理论模型和实际应用（日历事件调度）验证其效果。 Result: SkillFlow在短时间内显著提升任务完成效率（24.8%）并降低成本，尤其在通信成本高时效果更明显。 Conclusion: SkillFlow框架在动态扩展AI代理功能方面具有潜力，类似于生物系统中的横向基因转移机制。 Abstract: AI agents are autonomous systems that can execute specific tasks based on predefined programming. Here, we present SkillFlow, a modular, technology-agnostic framework that allows agents to expand their functionality in an ad-hoc fashion by acquiring new skills from their environment or other agents. We present a theoretical model that examines under which conditions this framework would be beneficial, and we then explore SkillFlow's ability to accelerate task completion and lead to lower cumulative costs in a real-world application, namely scheduling agents for calendar events. We demonstrate that within a few iterations, SkillFlow leads to considerable (24.8%, p-value = $6.4\times10^{-3}$) gains in time and cost, especially when the communication cost is high. Finally, we draw analogies from well-studied biological systems and compare this framework to that of lateral gene transfer, a significant process of adaptation and evolution in novel environments.

Turin3D: Evaluating Adaptation Strategies under Label Scarcity in Urban LiDAR Segmentation with Semi-Supervised Techniques

Luca Barco,Giacomo Blanco,Gaetano Chiriaco,Alessia Intini,Luigi La Riccia,Vittorio Scolamiero,Piero Boccardo,Paolo Garza,Fabrizio Dominici

Task: 提出并评估了一个新的航空LiDAR数据集Turin3D，用于点云语义分割。

Motivation: 支持城市建模中的3D语义分割研究，特别是自监督和半监督学习方法。

Details

Method: 通过数据收集、手动标注验证和测试集，并应用半监督学习技术提升模型性能。 Result: 提供了Turin3D数据集，并展示了半监督学习对性能的提升。 Conclusion: Turin3D数据集将公开，以促进户外点云分割研究，尤其适用于无标注训练集的场景。 Abstract: 3D semantic segmentation plays a critical role in urban modelling, enabling detailed understanding and mapping of city environments. In this paper, we introduce Turin3D: a new aerial LiDAR dataset for point cloud semantic segmentation covering an area of around 1.43 km2 in the city centre of Turin with almost 70M points. We describe the data collection process and compare Turin3D with others previously proposed in the literature. We did not fully annotate the dataset due to the complexity and time-consuming nature of the process; however, a manual annotation process was performed on the validation and test sets, to enable a reliable evaluation of the proposed techniques. We first benchmark the performances of several point cloud semantic segmentation models, trained on the existing datasets, when tested on Turin3D, and then improve their performances by applying a semi-supervised learning technique leveraging the unlabelled training set. The dataset will be publicly available to support research in outdoor point cloud segmentation, with particular relevance for self-supervised and semi-supervised learning approaches given the absence of ground truth annotations for the training set.

TxGemma: Efficient and Agentic LLMs for Therapeutics

Eric Wang,Samuel Schmidgall,Paul F. Jaeger,Fan Zhang,Rory Pilgrim,Yossi Matias,Joelle Barral,David Fleet,Shekoofeh Azizi

Task: 开发TxGemma，一套高效、通用的语言模型，用于治疗性属性预测及交互式推理和解释。

Motivation: 解决治疗开发成本高、失败率高的问题，提供更广泛的应用能力。

Details

Method: 基于Gemma-2微调，构建2B、9B和27B参数模型，整合小分子、蛋白质、核酸、疾病和细胞系数据。 Result: 在66个任务中，64个表现优于或接近通用模型，50个优于专用模型；微调所需数据更少。 Conclusion: TxGemma及其衍生的Agentic-Tx系统在治疗开发中表现卓越，具备交互和推理能力。 Abstract: Therapeutic development is a costly and high-risk endeavor that is often plagued by high failure rates. To address this, we introduce TxGemma, a suite of efficient, generalist large language models (LLMs) capable of therapeutic property prediction as well as interactive reasoning and explainability. Unlike task-specific models, TxGemma synthesizes information from diverse sources, enabling broad application across the therapeutic development pipeline. The suite includes 2B, 9B, and 27B parameter models, fine-tuned from Gemma-2 on a comprehensive dataset of small molecules, proteins, nucleic acids, diseases, and cell lines. Across 66 therapeutic development tasks, TxGemma achieved superior or comparable performance to the state-of-the-art generalist model on 64 (superior on 45), and against state-of-the-art specialist models on 50 (superior on 26). Fine-tuning TxGemma models on therapeutic downstream tasks, such as clinical trial adverse event prediction, requires less training data than fine-tuning base LLMs, making TxGemma suitable for data-limited applications. Beyond these predictive capabilities, TxGemma features conversational models that bridge the gap between general LLMs and specialized property predictors. These allow scientists to interact in natural language, provide mechanistic reasoning for predictions based on molecular structure, and engage in scientific discussions. Building on this, we further introduce Agentic-Tx, a generalist therapeutic agentic system powered by Gemini 2.5 that reasons, acts, manages diverse workflows, and acquires external domain knowledge. Agentic-Tx surpasses prior leading models on the Humanity's Last Exam benchmark (Chemistry & Biology) with 52.3% relative improvement over o3-mini (high) and 26.7% over o3-mini (high) on GPQA (Chemistry) and excels with improvements of 6.3% (ChemBench-Preference) and 2.4% (ChemBench-Mini) over o3-mini (high).

Intrinsic Saliency Guided Trunk-Collateral Network for Unsupervised Video Object Segmentation

Xiangyu Zheng,Wanyun Li,Songcheng He,Xiaoqiang Li,We Zhang

Task: 提出一种名为ISTC-Net的网络，用于无监督视频对象分割（UVOS），以更好地平衡运动-外观关系并利用模型的内在显著性信息提升分割性能。

Motivation: 现有方法未能有效平衡运动-外观关系，且仅依赖光流无法在所有场景中实现高质量分割。

Details

Method: 采用Trunk-Collateral结构，主干网络捕捉运动-外观共性，分支网络学习运动特征独特性，并设计内在显著性引导的细化模块（ISRM）优化特征。 Result: 在三个UVOS数据集和四个VSOD基准测试中达到最优性能（如DAVIS-16上J&F为89.2%）。 Conclusion: ISTC-Net通过改进运动-外观平衡和引入内在显著性信息，显著提升了分割性能。 Abstract: Recent unsupervised video object segmentation (UVOS) methods predominantly adopt the motion-appearance paradigm. Mainstream motion-appearance approaches use either the two-encoder structure to separately encode motion and appearance features, or the single-encoder structure for joint encoding. However, these methods fail to properly balance the motion-appearance relationship. Consequently, even with complex fusion modules for motion-appearance integration, the extracted suboptimal features degrade the models' overall performance. Moreover, the quality of optical flow varies across scenarios, making it insufficient to rely solely on optical flow to achieve high-quality segmentation results. To address these challenges, we propose the Intrinsic Saliency guided Trunk-Collateral Net}work (ISTC-Net), which better balances the motion-appearance relationship and incorporates model's intrinsic saliency information to enhance segmentation performance. Specifically, considering that optical flow maps are derived from RGB images, they share both commonalities and differences. We propose a novel Trunk-Collateral structure. The shared trunk backbone captures the motion-appearance commonality, while the collateral branch learns the uniqueness of motion features. Furthermore, an Intrinsic Saliency guided Refinement Module (ISRM) is devised to efficiently leverage the model's intrinsic saliency information to refine high-level features, and provide pixel-level guidance for motion-appearance fusion, thereby enhancing performance without additional input. Experimental results show that ISTC-Net achieved state-of-the-art performance on three UVOS datasets (89.2% J&F on DAVIS-16, 76% J on YouTube-Objects, 86.4% J on FBMS) and four standard video salient object detection (VSOD) benchmarks with the notable increase, demonstrating its effectiveness and superiority over previous methods.

FEABench: Evaluating Language Models on Multiphysics Reasoning Ability

Nayantara Mudur,Hao Cui,Subhashini Venugopalan,Paul Raccuglia,Michael P. Brenner,Peter Norgaard

Task: 评估大型语言模型（LLMs）和LLM代理在有限元分析（FEA）中模拟和解决物理、数学和工程问题的能力。

Motivation: 通过结合自然语言推理和FEA软件操作，推动工程自动化的发展，增强LLMs的数值求解能力。

Details

Method: 设计FEABench基准测试，包括自然语言问题描述和COMSOL Multiphysics软件操作，并开发一个能通过API与软件交互的LLM代理。 Result: 最佳策略生成可执行API调用的成功率为88%。 Conclusion: 成功操作FEA软件的LLMs将推动工程自动化，增强复杂现实问题的解决能力。 Abstract: Building precise simulations of the real world and invoking numerical solvers to answer quantitative problems is an essential requirement in engineering and science. We present FEABench, a benchmark to evaluate the ability of large language models (LLMs) and LLM agents to simulate and solve physics, mathematics and engineering problems using finite element analysis (FEA). We introduce a comprehensive evaluation scheme to investigate the ability of LLMs to solve these problems end-to-end by reasoning over natural language problem descriptions and operating COMSOL Multiphysics$^\circledR$, an FEA software, to compute the answers. We additionally design a language model agent equipped with the ability to interact with the software through its Application Programming Interface (API), examine its outputs and use tools to improve its solutions over multiple iterations. Our best performing strategy generates executable API calls 88% of the time. LLMs that can successfully interact with and operate FEA software to solve problems such as those in our benchmark would push the frontiers of automation in engineering. Acquiring this capability would augment LLMs' reasoning skills with the precision of numerical solvers and advance the development of autonomous systems that can tackle complex problems in the real world. The code is available at https://github.com/google/feabench

PRIMEDrive-CoT: A Precognitive Chain-of-Thought Framework for Uncertainty-Aware Object Interaction in Driving Scene Scenario

Sriram Mandalika,Lalitha V,Athira Nambiar

Task: 提出一种不确定性感知模型PRIMEDrive-CoT，用于驾驶场景中的物体交互和链式思维推理。

Motivation: 传统确定性模型无法捕捉真实驾驶场景中的概率性和不确定性，需要一种更可靠的方法。

Details

Method: 结合LiDAR 3D物体检测与多视角RGB参考，使用贝叶斯图神经网络（BGNNs）进行概率推理，并通过链式思维推理和Grad-CAM可视化实现可解释性。 Result: 在DriveCoT数据集上，PRIMEDrive-CoT优于现有的链式思维和风险感知模型。 Conclusion: PRIMEDrive-CoT为驾驶场景理解提供了一种不确定性感知和可解释的解决方案。 Abstract: Driving scene understanding is a critical real-world problem that involves interpreting and associating various elements of a driving environment, such as vehicles, pedestrians, and traffic signals. Despite advancements in autonomous driving, traditional pipelines rely on deterministic models that fail to capture the probabilistic nature and inherent uncertainty of real-world driving. To address this, we propose PRIMEDrive-CoT, a novel uncertainty-aware model for object interaction and Chain-of-Thought (CoT) reasoning in driving scenarios. In particular, our approach combines LiDAR-based 3D object detection with multi-view RGB references to ensure interpretable and reliable scene understanding. Uncertainty and risk assessment, along with object interactions, are modelled using Bayesian Graph Neural Networks (BGNNs) for probabilistic reasoning under ambiguous conditions. Interpretable decisions are facilitated through CoT reasoning, leveraging object dynamics and contextual cues, while Grad-CAM visualizations highlight attention regions. Extensive evaluations on the DriveCoT dataset demonstrate that PRIMEDrive-CoT outperforms state-of-the-art CoT and risk-aware models.

Hogwild! Inference: Parallel LLM Generation via Concurrent Attention

Gleb Rodionov,Roman Garipov,Alina Shutova,George Yakushev,Vage Egiazarian,Anton Sinitsin,Denis Kuznedelev,Dan Alistarh

Task: 提出一种并行LLM推理方法，通过共享注意力缓存和协作策略解决复杂任务。

Motivation: 人类通过协作解决问题，而现有LLM并行框架可能不适用于所有任务，因此需要更灵活的设计。

Details

Method: 采用Hogwild!推理引擎，允许多个LLM实例并行运行并共享注意力缓存，利用RoPE避免重复计算。 Result: 现代推理能力强的LLM可以直接使用共享Key-Value缓存，无需额外微调。 Conclusion: 提出的方法能灵活适应不同任务，提高并行硬件利用率，为LLM协作推理提供新思路。 Abstract: Large Language Models (LLMs) have demonstrated the ability to tackle increasingly complex tasks through advanced reasoning, long-form content generation, and tool use. Solving these tasks often involves long inference-time computations. In human problem solving, a common strategy to expedite work is collaboration: by dividing the problem into sub-tasks, exploring different strategies concurrently, etc. Recent research has shown that LLMs can also operate in parallel by implementing explicit cooperation frameworks, such as voting mechanisms or the explicit creation of independent sub-tasks that can be executed in parallel. However, each of these frameworks may not be suitable for all types of tasks, which can hinder their applicability. In this work, we propose a different design approach: we run LLM "workers" in parallel , allowing them to synchronize via a concurrently-updated attention cache and prompt these workers to decide how best to collaborate. Our approach allows the instances to come up with their own collaboration strategy for the problem at hand, all the while "seeing" each other's partial progress in the concurrent cache. We implement this approach via Hogwild! Inference: a parallel LLM inference engine where multiple instances of the same LLM run in parallel with the same attention cache, with "instant" access to each other's generated tokens. Hogwild! inference takes advantage of Rotary Position Embeddings (RoPE) to avoid recomputation while improving parallel hardware utilization. We find that modern reasoning-capable LLMs can perform inference with shared Key-Value cache out of the box, without additional fine-tuning.

Balancing long- and short-term dynamics for the modeling of saliency in videos

Theodor Wulff,Fares Abawi,Philipp Allgeuer,Stefan Wermter

Task: 提出一种基于Transformer的方法，用于学习视频帧和过去显著性信息的联合表示，以检测视频中的动态显著性变化。

Motivation: 视频中长短期动态对显著性检测的作用尚未充分研究。

Details

Method: 采用双流Transformer架构，分别处理视频帧和过去显著性图，并通过时空令牌分解和融合两种模态。 Result: 额外的先验信息有助于首次检测显著性位置，长短期特征的比率直接影响模型性能。 Conclusion: 增加短期上下文在一定阈值内有益，但扩展长期上下文对模型性能提升更为显著。 Abstract: The role of long- and short-term dynamics towards salient object detection in videos is under-researched. We present a Transformer-based approach to learn a joint representation of video frames and past saliency information. Our model embeds long- and short-term information to detect dynamically shifting saliency in video. We provide our model with a stream of video frames and past saliency maps, which acts as a prior for the next prediction, and extract spatiotemporal tokens from both modalities. The decomposition of the frame sequence into tokens lets the model incorporate short-term information from within the token, while being able to make long-term connections between tokens throughout the sequence. The core of the system consists of a dual-stream Transformer architecture to process the extracted sequences independently before fusing the two modalities. Additionally, we apply a saliency-based masking scheme to the input frames to learn an embedding that facilitates the recognition of deviations from previous outputs. We observe that the additional prior information aids in the first detection of the salient location. Our findings indicate that the ratio of spatiotemporal long- and short-term features directly impacts the model's performance. While increasing the short-term context is beneficial up to a certain threshold, the model's performance greatly benefits from an expansion of the long-term context.

SVLTA: Benchmarking Vision-Language Temporal Alignment via Synthetic Video Situation

Hao Du,Bo Wu,Yan Lu,Zhendong Mao

Task: 研究并评估模型在时间视角下实现视觉场景与语言上下文同步对齐的能力。

Motivation: 现有研究因时间分布偏差、标注不精确和组合性不足而受限，需公平评估和全面探索。

Details

Method: 提出SVLTA（合成视觉语言时间对齐数据集），通过模拟环境中设计的控制生成方法生成合理、多样且平衡的数据分布。 Result: 实验揭示了模型在时间问答、分布偏移敏感性和时间对齐适应性方面的诊断性见解。 Conclusion: SVLTA为时间对齐能力提供了全面评估工具，揭示了现有挑战并提出了改进方向。 Abstract: Vision-language temporal alignment is a crucial capability for human dynamic recognition and cognition in real-world scenarios. While existing research focuses on capturing vision-language relevance, it faces limitations due to biased temporal distributions, imprecise annotations, and insufficient compositionally. To achieve fair evaluation and comprehensive exploration, our objective is to investigate and evaluate the ability of models to achieve alignment from a temporal perspective, specifically focusing on their capacity to synchronize visual scenarios with linguistic context in a temporally coherent manner. As a preliminary step, we present the statistical analysis of existing benchmarks and reveal the existing challenges from a decomposed perspective. To this end, we introduce SVLTA, the Synthetic Vision-Language Temporal Alignment derived via a well-designed and feasible control generation method within a simulation environment. The approach considers commonsense knowledge, manipulable action, and constrained filtering, which generates reasonable, diverse, and balanced data distributions for diagnostic evaluations. Our experiments reveal diagnostic insights through the evaluations in temporal question answering, distributional shift sensitiveness, and temporal alignment adaptation.

Temporal Alignment-Free Video Matching for Few-shot Action Recognition

SuBeen Lee,WonJun Moon,Hyun Seok Seong,Jae-Pil Heo

Task: Few-Shot Action Recognition (FSAR) 旨在通过少量标记视频实例训练模型。

Motivation: FSAR的关键挑战是处理不同的叙事轨迹以实现精确的视频匹配，现有方法依赖预定义的长度相关对齐单元（如帧或元组），限制了处理不同长度和速度动作的灵活性。

Details

Method: 提出了一种新颖的TEmporal Alignment-free Matching (TEAM)方法，通过固定模式令牌表示视频，无需时间单元或暴力对齐，并使用令牌级比较测量视频相似性。 Result: 实验证明TEAM的有效性，代码已开源。 Conclusion: TEAM方法通过消除时间对齐需求，提高了灵活性和效率，适用于不同长度和速度的动作识别。 Abstract: Few-Shot Action Recognition (FSAR) aims to train a model with only a few labeled video instances. A key challenge in FSAR is handling divergent narrative trajectories for precise video matching. While the frame- and tuple-level alignment approaches have been promising, their methods heavily rely on pre-defined and length-dependent alignment units (e.g., frames or tuples), which limits flexibility for actions of varying lengths and speeds. In this work, we introduce a novel TEmporal Alignment-free Matching (TEAM) approach, which eliminates the need for temporal units in action representation and brute-force alignment during matching. Specifically, TEAM represents each video with a fixed set of pattern tokens that capture globally discriminative clues within the video instance regardless of action length or speed, ensuring its flexibility. Furthermore, TEAM is inherently efficient, using token-wise comparisons to measure similarity between videos, unlike existing methods that rely on pairwise comparisons for temporal alignment. Additionally, we propose an adaptation process that identifies and removes common information across classes, establishing clear boundaries even between novel categories. Extensive experiments demonstrate the effectiveness of TEAM. Codes are available at github.com/leesb7426/TEAM.

AVP-AP: Self-supervised Automatic View Positioning in 3D cardiac CT via Atlas Prompting

Xiaolin Fan,Yan Wang,Yingying Zhang,Mingkun Bao,Bosen Jia,Dong Lu,Yifan Gu,Jian Cheng,Haogang Zhu

Task: 提出一种基于Atlas Prompting的自监督自动视图定位框架（AVP-AP），用于3D CT体积中的任意视图定位。

Motivation: 解决现有方法需要大量手动标注且仅能预测固定平面集的问题，满足临床中对任意方向语义2D切片定位的需求。

Details

Method: 通过生成3D标准图谱并自监督训练网络，结合刚性变换和特征空间相似性最大化，实现粗定位和精细调整。 Result: 在任意视图定位中平均结构相似性（SSIM）提升19.8%，两腔视图SSIM达到9%，优于四位放射科医生。 Conclusion: AVP-AP框架灵活高效，适用于临床场景，并在公开数据集上验证了其泛化能力。 Abstract: Automatic view positioning is crucial for cardiac computed tomography (CT) examinations, including disease diagnosis and surgical planning. However, it is highly challenging due to individual variability and large 3D search space. Existing work needs labor-intensive and time-consuming manual annotations to train view-specific models, which are limited to predicting only a fixed set of planes. However, in real clinical scenarios, the challenge of positioning semantic 2D slices with any orientation into varying coordinate space in arbitrary 3D volume remains unsolved. We thus introduce a novel framework, AVP-AP, the first to use Atlas Prompting for self-supervised Automatic View Positioning in the 3D CT volume. Specifically, this paper first proposes an atlas prompting method, which generates a 3D canonical atlas and trains a network to map slices into their corresponding positions in the atlas space via a self-supervised manner. Then, guided by atlas prompts corresponding to the given query images in a reference CT, we identify the coarse positions of slices in the target CT volume using rigid transformation between the 3D atlas and target CT volume, effectively reducing the search space. Finally, we refine the coarse positions by maximizing the similarity between the predicted slices and the query images in the feature space of a given foundation model. Our framework is flexible and efficient compared to other methods, outperforming other methods by 19.8% average structural similarity (SSIM) in arbitrary view positioning and achieving 9% SSIM in two-chamber view compared to four radiologists. Meanwhile, experiments on a public dataset validate our framework's generalizability.

Diffusion Based Ambiguous Image Segmentation

Jakob Lønborg Christensen,Morten Rieger Hannemose,Anders Bjorholm Dahl,Vedrana Andersen Dahl

Task: 探索扩散模型在生成式医学图像分割中的设计空间，研究噪声调度、预测类型和损失权重的影响。

Motivation: 医学图像分割存在专家标注的不确定性，需要捕捉这种不确定性以表示可能的真实标注分布。

Details

Method: 研究扩散模型的不同设计选择，包括噪声调度、预测类型和损失权重，并在LIDC-IDRI数据集上进行实验。 Result: 发现输入缩放的噪声调度显著提升性能，x-和v-预测优于epsilon-预测，多种损失权重在给予扩散过程足够权重时表现相似，模型在LIDC-IDRI数据集上达到SOTA性能。 Conclusion: 扩散模型在生成式医学图像分割中表现优异，特别是在处理不确定性时，通过优化设计选择可以进一步提升性能。 Abstract: Medical image segmentation often involves inherent uncertainty due to variations in expert annotations. Capturing this uncertainty is an important goal and previous works have used various generative image models for the purpose of representing the full distribution of plausible expert ground truths. In this work, we explore the design space of diffusion models for generative segmentation, investigating the impact of noise schedules, prediction types, and loss weightings. Notably, we find that making the noise schedule harder with input scaling significantly improves performance. We conclude that x- and v-prediction outperform epsilon-prediction, likely because the diffusion process is in the discrete segmentation domain. Many loss weightings achieve similar performance as long as they give enough weight to the end of the diffusion process. We base our experiments on the LIDC-IDRI lung lesion dataset and obtain state-of-the-art (SOTA) performance. Additionally, we introduce a randomly cropped variant of the LIDC-IDRI dataset that is better suited for uncertainty in image segmentation. Our model also achieves SOTA in this harder setting.

An Empirical Study of GPT-4o Image Generation Capabilities

Sixiang Chen,Jinbin Bai,Zhuoran Zhao,Tian Ye,Qingyu Shi,Donghao Zhou,Wenhao Chai,Xin Lin,Jianzong Wu,Chao Tang,Shilin Xu,Tao Zhang,Haobo Yuan,Yikang Zhou,Wei Chow,Linfeng Li,Xiangtai Li,Lei Zhu,Lu Qi

Task: 通过实证研究评估GPT-4o在图像生成任务中的能力，并与领先的开源和商业模型进行对比。

Motivation: 探索图像和文本生成是否已成功整合到统一框架中，并揭示GPT-4o在生成任务中的表现。

Details

Method: 对GPT-4o进行多类别（如文本到图像、图像到图像等）和20多项任务的基准测试。 Result: 分析了GPT-4o在不同设置下的优势和局限性，并探讨其在生成模型发展中的地位。 Conclusion: 提出了未来统一生成模型的发展方向，强调架构设计和数据扩展的重要性。 Abstract: The landscape of image generation has rapidly evolved, from early GAN-based approaches to diffusion models and, most recently, to unified generative architectures that seek to bridge understanding and generation tasks. Recent advances, especially the GPT-4o, have demonstrated the feasibility of high-fidelity multimodal generation, their architectural design remains mysterious and unpublished. This prompts the question of whether image and text generation have already been successfully integrated into a unified framework for those methods. In this work, we conduct an empirical study of GPT-4o's image generation capabilities, benchmarking it against leading open-source and commercial models. Our evaluation covers four main categories, including text-to-image, image-to-image, image-to-3D, and image-to-X generation, with more than 20 tasks. Our analysis highlights the strengths and limitations of GPT-4o under various settings, and situates it within the broader evolution of generative modeling. Through this investigation, we identify promising directions for future unified generative models, emphasizing the role of architectural design and data scaling.

Under-Sampled High-Dimensional Data Recovery via Symbiotic Multi-Prior Tensor Reconstruction

Jie Yang,Chang Su,Yuhan Zhang,Jianjun Zhu,Jianli Wang

Task: 提出一种融合多种先验信息的张量重构方法，以在极低采样率下恢复高维数据的完整结构。

Motivation: 高维数据在采集和传输过程中存在缺失问题，影响后续任务准确性，现有方法在极低采样率下表现不佳。

Details

Method: 结合可学习张量分解、预训练卷积神经网络和块匹配3D滤波正则化，利用交替方向乘子法分解优化问题。 Result: 在彩色图像、高光谱图像和灰度视频数据集上验证了方法的优越性。 Conclusion: 所提方法在极低采样率下显著优于现有方法，能有效恢复数据完整结构。 Abstract: The advancement of sensing technology has driven the widespread application of high-dimensional data. However, issues such as missing entries during acquisition and transmission negatively impact the accuracy of subsequent tasks. Tensor reconstruction aims to recover the underlying complete data from under-sampled observed data by exploring prior information in high-dimensional data. However, due to insufficient exploration, reconstruction methods still face challenges when sampling rate is extremely low. This work proposes a tensor reconstruction method integrating multiple priors to comprehensively exploit the inherent structure of the data. Specifically, the method combines learnable tensor decomposition to enforce low-rank constraints of the reconstructed data, a pre-trained convolutional neural network for smoothing and denoising, and block-matching and 3D filtering regularization to enhance the non-local similarity in the reconstructed data. An alternating direction method of the multipliers algorithm is designed to decompose the resulting optimization problem into three subproblems for efficient resolution. Extensive experiments on color images, hyperspectral images, and grayscale videos datasets demonstrate the superiority of our method in extreme cases as compared with state-of-the-art methods.

econSG: Efficient and Multi-view Consistent Open-Vocabulary 3D Semantic Gaussians

Can Zhang,Gim Hee Lee

Task: 提出一种名为econSG的方法，用于开放词汇语义分割与3D高斯语义场（3DGS）结合。

Motivation: 现有方法过度依赖SAM来正则化图像级CLIP，且通过降维处理2D视觉语言模型（VLMs）的语义特征导致多视角不一致。

Details

Method: econSG包括置信区域引导正则化（CRR）和低维上下文空间，以优化语义特征并保持多视角一致性。 Result: 在四个基准数据集上表现最优，且训练效率最高。 Conclusion: econSG在开放词汇语义分割中实现了高效且多视角一致的表现。 Abstract: The primary focus of most recent works on open-vocabulary neural fields is extracting precise semantic features from the VLMs and then consolidating them efficiently into a multi-view consistent 3D neural fields representation. However, most existing works over-trusted SAM to regularize image-level CLIP without any further refinement. Moreover, several existing works improved efficiency by dimensionality reduction of semantic features from 2D VLMs before fusing with 3DGS semantic fields, which inevitably leads to multi-view inconsistency. In this work, we propose econSG for open-vocabulary semantic segmentation with 3DGS. Our econSG consists of: 1) A Confidence-region Guided Regularization (CRR) that mutually refines SAM and CLIP to get the best of both worlds for precise semantic features with complete and precise boundaries. 2) A low dimensional contextual space to enforce 3D multi-view consistency while improving computational efficiency by fusing backprojected multi-view 2D features and follow by dimensional reduction directly on the fused 3D features instead of operating on each 2D view separately. Our econSG shows state-of-the-art performance on four benchmark datasets compared to the existing methods. Furthermore, we are also the most efficient training among all the methods.

FedFeat+: A Robust Federated Learning Framework Through Federated Aggregation and Differentially Private Feature-Based Classifier Retraining

Mrityunjoy Gain,Kitae Kim,Avi Deb Raha,Apurba Adhikary,Eui-Nam Huh,Zhu Han,Choong Seon Hong

Task: 提出FedFeat+框架，将特征提取与分类分离，并通过两阶段训练提升模型泛化能力。

Motivation: 解决联邦学习中模型泛化能力不足和隐私保护问题。

Details

Method: 采用两阶段训练，客户端传输权重和特征，服务器聚合模型并重训练分类器，结合差分隐私保护数据。 Result: 在多个基准数据集上表现优于FedAvg，准确率提升3.92%至12.34%。 Conclusion: FedFeat+框架在提升模型性能的同时有效保护隐私，适用于异构数据场景。 Abstract: In this paper, we propose the FedFeat+ framework, which distinctively separates feature extraction from classification. We develop a two-tiered model training process: following local training, clients transmit their weights and some features extracted from the feature extractor from the final local epochs to the server. The server aggregates these models using the FedAvg method and subsequently retrains the global classifier utilizing the shared features. The classifier retraining process enhances the model's understanding of the holistic view of the data distribution, ensuring better generalization across diverse datasets. This improved generalization enables the classifier to adaptively influence the feature extractor during subsequent local training epochs. We establish a balance between enhancing model accuracy and safeguarding individual privacy through the implementation of differential privacy mechanisms. By incorporating noise into the feature vectors shared with the server, we ensure that sensitive data remains confidential. We present a comprehensive convergence analysis, along with theoretical reasoning regarding performance enhancement and privacy preservation. We validate our approach through empirical evaluations conducted on benchmark datasets, including CIFAR-10, CIFAR-100, MNIST, and FMNIST, achieving high accuracy while adhering to stringent privacy guarantees. The experimental results demonstrate that the FedFeat+ framework, despite using only a lightweight two-layer CNN classifier, outperforms the FedAvg method in both IID and non-IID scenarios, achieving accuracy improvements ranging from 3.92 % to 12.34 % across CIFAR-10, CIFAR-100, and Fashion-MNIST datasets.

Latent Multimodal Reconstruction for Misinformation Detection

Stefanos-Iordanis Papadopoulos,Christos Koutlis,Symeon Papadopoulos,Panagiotis C. Petrantonakis

Task: 提出一种基于大型视觉语言模型（LVLM）生成的多模态错误信息检测（MMD）训练数据集和方法。

Motivation: 解决现有MMD数据集和方法因生成过于简单的错误信息而无法反映真实世界复杂性的问题。

Details

Method: 使用LVLM生成多样且真实的错误标注图像数据集“MisCaption This!”，并提出“Latent Multimodal Reconstruction”（LAMAR）网络来重构真实标注的嵌入。 Result: 实验表明，基于“MisCaption This!”训练的模型在真实错误信息上表现更好，LAMAR在NewsCLIPpings和VERITE基准测试中达到新最优性能。 Conclusion: LVLM生成的数据和基于重构的方法在提升MMD性能方面具有潜力。 Abstract: Multimodal misinformation, such as miscaptioned images, where captions misrepresent an image's origin, context, or meaning, poses a growing challenge in the digital age. To support fact-checkers, researchers have been focusing on creating datasets and developing methods for multimodal misinformation detection (MMD). Due to the scarcity of large-scale annotated MMD datasets, recent studies leverage synthetic training data via out-of-context image-caption pairs or named entity manipulations; altering names, dates, and locations. However, these approaches often produce simplistic misinformation that fails to reflect real-world complexity, limiting the robustness of detection models trained on them. Meanwhile, despite recent advancements, Large Vision-Language Models (LVLMs) remain underutilized for generating diverse, realistic synthetic training data for MMD. To address this gap, we introduce "MisCaption This!", a training dataset comprising LVLM-generated miscaptioned images. Additionally, we introduce "Latent Multimodal Reconstruction" (LAMAR), a network trained to reconstruct the embeddings of truthful captions, providing a strong auxiliary signal to the detection process. To optimize LAMAR, we explore different training strategies (end-to-end training and large-scale pre-training) and integration approaches (direct, mask, gate, and attention). Extensive experiments show that models trained on "MisCaption This!" generalize better on real-world misinformation, while LAMAR sets new state-of-the-art on both NewsCLIPpings and VERITE benchmarks; highlighting the potential of LVLM-generated data and reconstruction-based approaches for advancing MMD. We release our code at: https://github.com/stevejpapad/miscaptioned-image-reconstruction

Memory-Modular Classification: Learning to Generalize with Memory Replacement

Dahyun Kang,Ahmet Iscen,Eunchan Jo,Sua Choi,Minsu Cho,Cordelia Schmid

Task: 提出一种新型的记忆模块化学习器，用于图像分类，将知识记忆与推理分离。

Motivation: 传统模型在训练时将世界知识和任务特定技能编码到权重中，难以适应新类别而无需重新训练。

Details

Method: 模型将知识存储在外部的网络爬取图像和文本数据中，推理时根据输入图像动态选择相关内容，通过替换记忆内容适应新类别。 Result: 实验结果表明，该方法在零样本/少样本分类、细粒度分类和类增量分类等多样化任务中表现优异。 Conclusion: 该记忆模块化学习器能够有效适应新类别，无需重新训练，具有广泛的应用潜力。 Abstract: We propose a novel memory-modular learner for image classification that separates knowledge memorization from reasoning. Our model enables effective generalization to new classes by simply replacing the memory contents, without the need for model retraining. Unlike traditional models that encode both world knowledge and task-specific skills into their weights during training, our model stores knowledge in the external memory of web-crawled image and text data. At inference time, the model dynamically selects relevant content from the memory based on the input image, allowing it to adapt to arbitrary classes by simply replacing the memory contents. The key differentiator that our learner meta-learns to perform classification tasks with noisy web data from unseen classes, resulting in robust performance across various classification scenarios. Experimental results demonstrate the promising performance and versatility of our approach in handling diverse classification tasks, including zero-shot/few-shot classification of unseen classes, fine-grained classification, and class-incremental classification.

CamContextI2V: Context-aware Controllable Video Generation

Luis Denninger,Sina Mokhtarzadeh Azar,Juergen Gall

Task: 提出CamContextI2V模型，通过结合多图像条件和3D约束以及相机控制，提升图像到视频生成的全局语义和细节质量。

Motivation: 现有I2V扩散模型在动画静态图像时缺乏上下文扩展能力，且引入额外约束（如相机轨迹）会降低视觉质量，限制了其应用范围。

Details

Method: 整合多图像条件和3D约束，结合相机控制，增强全局语义和细粒度视觉细节，同时强调时间感知的重要性。 Result: 在RealEstate10K数据集上验证了视觉质量和相机可控性的提升。 Conclusion: CamContextI2V通过多条件整合和时间感知，实现了更连贯和上下文感知的视频生成。 Abstract: Recently, image-to-video (I2V) diffusion models have demonstrated impressive scene understanding and generative quality, incorporating image conditions to guide generation. However, these models primarily animate static images without extending beyond their provided context. Introducing additional constraints, such as camera trajectories, can enhance diversity but often degrades visual quality, limiting their applicability for tasks requiring faithful scene representation. We propose CamContextI2V, an I2V model that integrates multiple image conditions with 3D constraints alongside camera control to enrich both global semantics and fine-grained visual details. This enables more coherent and context-aware video generation. Moreover, we motivate the necessity of temporal awareness for an effective context representation. Our comprehensive study on the RealEstate10K dataset demonstrates improvements in visual quality and camera controllability. We make our code and models publicly available at: https://github.com/LDenninger/CamContextI2V.

Enhanced Anomaly Detection for Capsule Endoscopy Using Ensemble Learning Strategies

Julia Werner,Christoph Gerum,Jorg Nick,Maxime Le Floch,Franz Brinkmann,Jochen Hampe,Oliver Bringmann

Task: 提出一种集成策略来解决视频胶囊内窥镜中异常检测的挑战。

Motivation: 由于视频胶囊尺寸限制和可用数据稀缺，直接嵌入AI模型进行异常检测面临模型大小和性能的挑战。

Details

Method: 使用多种损失函数训练独立的神经网络，并通过集成学习结合它们的预测。 Result: 在Kvasir-Capsule和Galar数据集上分别达到76.86%和76.98%的AUC分数，优于现有基线。 Conclusion: 该方法以较少的参数实现了高性能，为将人工智能应用于胶囊内窥镜迈出了重要一步。 Abstract: Capsule endoscopy is a method to capture images of the gastrointestinal tract and screen for diseases which might remain hidden if investigated with standard endoscopes. Due to the limited size of a video capsule, embedding AI models directly into the capsule demands careful consideration of the model size and thus complicates anomaly detection in this field. Furthermore, the scarcity of available data in this domain poses an ongoing challenge to achieving effective anomaly detection. Thus, this work introduces an ensemble strategy to address this challenge in anomaly detection tasks in video capsule endoscopies, requiring only a small number of individual neural networks during both the training and inference phases. Ensemble learning combines the predictions of multiple independently trained neural networks. This has shown to be highly effective in enhancing both the accuracy and robustness of machine learning models. However, this comes at the cost of higher memory usage and increased computational effort, which quickly becomes prohibitive in many real-world applications. Instead of applying the same training algorithm to each individual network, we propose using various loss functions, drawn from the anomaly detection field, to train each network. The methods are validated on the two largest publicly available datasets for video capsule endoscopy images, the Galar and the Kvasir-Capsule dataset. We achieve an AUC score of 76.86% on the Kvasir-Capsule and an AUC score of 76.98% on the Galar dataset. Our approach outperforms current baselines with significantly fewer parameters across all models, which is a crucial step towards incorporating artificial intelligence into capsule endoscopies.

MCAT: Visual Query-Based Localization of Standard Anatomical Clips in Fetal Ultrasound Videos Using Multi-Tier Class-Aware Token Transformer

Divyanshu Mishra,Pramit Saha,He Zhao,Netzahualcoyotl Hernandez-Cruz,Olga Patey,Aris Papageorghiou,J. Alison Noble

Task: 开发一种基于视觉查询的视频片段定位方法（MCAT），用于辅助超声医师快速获取胎儿超声视频中的标准平面。

Motivation: 手动选择标准平面耗时且易受操作者差异影响，现有方法忽略视频的动态特性，需改进。

Details

Method: 提出Multi-Tier Class-Aware Token Transformer（MCAT），通过视觉查询定位视频片段。 Result: MCAT在超声数据集上比现有方法提升10%和13% mIoU，在Ego4D数据集上提升5.35% mIoU，且使用96%更少的token。 Conclusion: MCAT高效准确，有望提升低收入和中等收入国家的产前护理水平，简化超声筛查和诊断流程。 Abstract: Accurate standard plane acquisition in fetal ultrasound (US) videos is crucial for fetal growth assessment, anomaly detection, and adherence to clinical guidelines. However, manually selecting standard frames is time-consuming and prone to intra- and inter-sonographer variability. Existing methods primarily rely on image-based approaches that capture standard frames and then classify the input frames across different anatomies. This ignores the dynamic nature of video acquisition and its interpretation. To address these challenges, we introduce Multi-Tier Class-Aware Token Transformer (MCAT), a visual query-based video clip localization (VQ-VCL) method, to assist sonographers by enabling them to capture a quick US sweep. By then providing a visual query of the anatomy they wish to analyze, MCAT returns the video clip containing the standard frames for that anatomy, facilitating thorough screening for potential anomalies. We evaluate MCAT on two ultrasound video datasets and a natural image VQ-VCL dataset based on Ego4D. Our model outperforms state-of-the-art methods by 10% and 13% mIoU on the ultrasound datasets and by 5.35% mIoU on the Ego4D dataset, using 96% fewer tokens. MCAT's efficiency and accuracy have significant potential implications for public health, especially in low- and middle-income countries (LMICs), where it may enhance prenatal care by streamlining standard plane acquisition, simplifying US-based screening, diagnosis and allowing sonographers to examine more patients.

Towards Varroa destructor mite detection using a narrow spectra illumination

Samuel Bielik,Simon Bilik

Task: 开发并改进蜂箱监测设备，利用高光谱图像和U-net语义分割架构检测蜜蜂上的瓦螨。

Motivation: 通过计算机视觉方法区分蜜蜂和瓦螨，以支持蜂群健康监测。

Details

Method: 结合高光谱图像、U-net语义分割架构和传统计算机视觉方法。 Result: 收集了蜜蜂和瓦螨的数据集，并提出了能够区分两者的计算机视觉模型。 Conclusion: 提出的方法有效实现了蜜蜂和瓦螨的检测，为蜂群健康监测提供了技术支持。 Abstract: This paper focuses on the development and modification of a beehive monitoring device and Varroa destructor detection on the bees with the help of hyperspectral imagery while utilizing a U-net, semantic segmentation architecture, and conventional computer vision methods. The main objectives were to collect a dataset of bees and mites, and propose the computer vision model which can achieve the detection between bees and mites.

To Match or Not to Match: Revisiting Image Matching for Reliable Visual Place Recognition

Davide Sferrazza,Gabriele Berton,Gabriele Trivigno,Carlo Masone

Task: 探讨视觉地点识别（VPR）中重排序的必要性及其替代方案。

Motivation: 现代VPR方法的性能提升使得重排序可能反而降低效果，需要验证其实际价值。

Details

Method: 提出使用图像匹配作为验证步骤，通过内点数量预测重排序的适用性。 Result: 发现内点数量能可靠预测重排序的益处，并展示了更稳健的自适应VPR系统。 Conclusion: 研究改变了检索管道的范式，为更灵活和高效的VPR系统提供了新思路。 Abstract: Visual Place Recognition (VPR) is a critical task in computer vision, traditionally enhanced by re-ranking retrieval results with image matching. However, recent advancements in VPR methods have significantly improved performance, challenging the necessity of re-ranking. In this work, we show that modern retrieval systems often reach a point where re-ranking can degrade results, as current VPR datasets are largely saturated. We propose using image matching as a verification step to assess retrieval confidence, demonstrating that inlier counts can reliably predict when re-ranking is beneficial. Our findings shift the paradigm of retrieval pipelines, offering insights for more robust and adaptive VPR systems.

Hyperbolic Category Discovery

Yuanpei Liu,Zhenqi He,Kai Han

Task: 解决广义类别发现（GCD）问题，通过双曲空间学习层次感知的表示和分类器。

Motivation: 传统欧几里得或球面空间在编码具有层次结构的样本时表现不佳，而双曲空间因其指数级体积增长特性更适合捕捉层次结构。

Details

Method: 提出HypCD框架，将欧几里得嵌入空间转换为双曲空间，利用双曲距离和样本间角度进行表示和分类学习。 Result: 在公共GCD基准测试中，HypCD显著提升了基线方法和最新方法的性能。 Conclusion: 双曲空间为GCD问题提供了更有效的解决方案，能够更好地捕捉样本的层次结构。 Abstract: Generalized Category Discovery (GCD) is an intriguing open-world problem that has garnered increasing attention. Given a dataset that includes both labelled and unlabelled images, GCD aims to categorize all images in the unlabelled subset, regardless of whether they belong to known or unknown classes. In GCD, the common practice typically involves applying a spherical projection operator at the end of the self-supervised pretrained backbone, operating within Euclidean or spherical space. However, both of these spaces have been shown to be suboptimal for encoding samples that possesses hierarchical structures. In contrast, hyperbolic space exhibits exponential volume growth relative to radius, making it inherently strong at capturing the hierarchical structure of samples from both seen and unseen categories. Therefore, we propose to tackle the category discovery challenge in the hyperbolic space. We introduce HypCD, a simple \underline{Hyp}erbolic framework for learning hierarchy-aware representations and classifiers for generalized \underline{C}ategory \underline{D}iscovery. HypCD first transforms the Euclidean embedding space of the backbone network into hyperbolic space, facilitating subsequent representation and classification learning by considering both hyperbolic distance and the angle between samples. This approach is particularly helpful for knowledge transfer from known to unknown categories in GCD. We thoroughly evaluate HypCD on public GCD benchmarks, by applying it to various baseline and state-of-the-art methods, consistently achieving significant improvements.

A Robust Real-Time Lane Detection Method with Fog-Enhanced Feature Fusion for Foggy Conditions

Ronghui Zhang,Yuhang Ma,Tengfei Li,Ziyu Lin,Yueying Wu,Junzhou Chen,Lin Zhang,Jia Hu,Tony Z. Qiu,Konghui Guo

Task: 提出一种针对雾天环境的鲁棒车道检测方法，并引入新的数据集。

Motivation: 现有车道检测算法在雾天等恶劣条件下性能显著下降，缺乏专门的数据集和方法。

Details

Method: 提出Fog-Enhanced Network，包含GFFM、KFFM和LEEM模块，并引入FoggyLane数据集及合成数据集。 Result: 在FoggyLane、FoggyCULane和FoggyTusimple数据集上分别达到95.04、79.85和96.95的F1分数，实时处理速度为38.4 FPS。 Conclusion: 该方法在雾天环境中表现出色，具备实时性和鲁棒性。 Abstract: Lane detection is a critical component of Advanced Driver Assistance Systems (ADAS). Existing lane detection algorithms generally perform well under favorable weather conditions. However, their performance degrades significantly in adverse conditions, such as fog, which increases the risk of traffic accidents. This challenge is compounded by the lack of specialized datasets and methods designed for foggy environments. To address this, we introduce the FoggyLane dataset, captured in real-world foggy scenarios, and synthesize two additional datasets, FoggyCULane and FoggyTusimple, from existing popular lane detection datasets. Furthermore, we propose a robust Fog-Enhanced Network for lane detection, incorporating a Global Feature Fusion Module (GFFM) to capture global relationships in foggy images, a Kernel Feature Fusion Module (KFFM) to model the structural and positional relationships of lane instances, and a Low-level Edge Enhanced Module (LEEM) to address missing edge details in foggy conditions. Comprehensive experiments demonstrate that our method achieves state-of-the-art performance, with F1-scores of 95.04 on FoggyLane, 79.85 on FoggyCULane, and 96.95 on FoggyTusimple. Additionally, with TensorRT acceleration, the method reaches a processing speed of 38.4 FPS on the NVIDIA Jetson AGX Orin, confirming its real-time capabilities and robustness in foggy environments.

FaceCloak: Learning to Protect Face Templates

Sudipta Banerjee,Anubhav Jain,Chinmay Hegde,Nasir Memon

Task: 提出一种名为FaceCloak的神经网络框架，用于保护人脸模板免受逆向攻击。

Motivation: 生成模型能够从编码表示中重建人脸图像，引发安全和隐私问题。

Details

Method: 通过生成可更新的二进制伪装层，主动阻止逆向攻击，同时保留生物特征实用性和不可链接性。 Result: 伪装模板能抑制敏感属性，适应新特征提取方案，在生物特征匹配和抗重建攻击方面优于基线方法。 Conclusion: FaceCloak具有高效（推理时间0.28ms）和轻量（0.57MB）的特点。 Abstract: Generative models can reconstruct face images from encoded representations (templates) bearing remarkable likeness to the original face raising security and privacy concerns. We present FaceCloak, a neural network framework that protects face templates by generating smart, renewable binary cloaks. Our method proactively thwarts inversion attacks by cloaking face templates with unique disruptors synthesized from a single face template on the fly while provably retaining biometric utility and unlinkability. Our cloaked templates can suppress sensitive attributes while generalizing to novel feature extraction schemes and outperforms leading baselines in terms of biometric matching and resiliency to reconstruction attacks. FaceCloak-based matching is extremely fast (inference time cost=0.28ms) and light-weight (0.57MB).

A Training-Free Style-aligned Image Generation with Scale-wise Autoregressive Model

Jihun Park,Jongmin Gim,Kyoungmin Lee,Minseok Oh,Minwoo Choi,Jaeyeul Kim,Woo Chool Park,Sunghoon Im

Task: 提出一种无需训练的风格对齐图像生成方法，利用尺度自回归模型解决现有方法的风格不一致和推理速度慢的问题。

Motivation: 现有的大规模文本到图像（T2I）模型（尤其是基于扩散的方法）在生成质量上表现优异，但存在生成图像集风格不一致和推理速度慢的问题，限制了实际应用。

Details

Method: 提出三个关键组件：初始特征替换以确保背景一致性，关键特征插值以对齐物体位置，以及动态风格注入通过调度函数增强风格一致性。 Result: 实验表明，该方法在生成质量上与竞争方法相当，显著提升了风格对齐性，推理速度比最快模型快六倍以上。 Conclusion: 该方法无需微调或额外训练，保持了快速推理能力，同时保留了个体内容细节。 Abstract: We present a training-free style-aligned image generation method that leverages a scale-wise autoregressive model. While large-scale text-to-image (T2I) models, particularly diffusion-based methods, have demonstrated impressive generation quality, they often suffer from style misalignment across generated image sets and slow inference speeds, limiting their practical usability. To address these issues, we propose three key components: initial feature replacement to ensure consistent background appearance, pivotal feature interpolation to align object placement, and dynamic style injection, which reinforces style consistency using a schedule function. Unlike previous methods requiring fine-tuning or additional training, our approach maintains fast inference while preserving individual content details. Extensive experiments show that our method achieves generation quality comparable to competing approaches, significantly improves style alignment, and delivers inference speeds over six times faster than the fastest model.

V-MAGE: A Game Evaluation Framework for Assessing Visual-Centric Capabilities in Multimodal Large Language Models

Xiangxi Zheng,Linjie Li,Zhengyuan Yang,Ping Yu,Alex Jinpeng Wang,Rui Yan,Yuan Yao,Lijuan Wang

Task: 提出一种基于游戏的评估框架V-MAGE，用于评估多模态大语言模型（MLLMs）的视觉推理能力。

Motivation: 当前基于游戏的基准测试在动态开放环境中缺乏视觉中心任务，且未能全面评估真实世界决策所需的多样化推理能力。

Details

Method: 设计了包含五种游戏和30多个手工制作级别的V-MAGE框架，测试模型的视觉技能（如定位、轨迹跟踪）和高级推理能力（如长期规划）。 Result: 评估显示领先的MLLMs在视觉感知和推理方面存在显著挑战，与人类表现相比存在较大差距。 Conclusion: V-MAGE揭示了MLLMs的感知错误和局限性，并提出了从智能体角度改进的潜在方向。 Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have led to significant improvements across various multimodal benchmarks. However, as evaluations shift from static datasets to open-world, dynamic environments, current game-based benchmarks remain inadequate because they lack visual-centric tasks and fail to assess the diverse reasoning skills required for real-world decision-making. To address this, we introduce Visual-centric Multiple Abilities Game Evaluation (V-MAGE), a game-based evaluation framework designed to assess visual reasoning capabilities of MLLMs. V-MAGE features five diverse games with 30+ handcrafted levels, testing models on core visual skills such as positioning, trajectory tracking, timing, and visual memory, alongside higher-level reasoning like long-term planning and deliberation. We use V-MAGE to evaluate leading MLLMs, revealing significant challenges in their visual perception and reasoning. In all game environments, the top-performing MLLMs, as determined by Elo rating comparisons, exhibit a substantial performance gap compared to humans. Our findings highlight critical limitations, including various types of perceptual errors made by the models, and suggest potential avenues for improvement from an agent-centric perspective, such as refining agent strategies and addressing perceptual inaccuracies. Code is available at https://github.com/CSU-JPG/V-MAGE.

A Large-Scale Analysis on Contextual Self-Supervised Video Representation Learning

Akash Kumar,Ashlesha Kumar,Vibhav Vineet,Yogesh S Rawat

Task: 建立一个统一的基准，以公平比较视频领域自监督学习方法，并研究五个关键方面的影响。

Motivation: 现有自监督学习方法实验设置多样，缺乏标准化基准，难以直接比较。

Details

Method: 评估六种自监督学习方法在六种网络架构上的表现，并在五个基准数据集和两个下游任务上进行实验。 Result: 揭示了预训练策略、数据集特性、前置任务和模型架构之间的关键关系，并提出了一种新方法，减少训练数据需求并超越现有方法。 Conclusion: 该工作为自监督视频表示学习的未来研究提供了指导，并展示了其在大规模视频表示学习中的潜力。 Abstract: Self-supervised learning has emerged as a powerful paradigm for label-free model pretraining, particularly in the video domain, where manual annotation is costly and time-intensive. However, existing self-supervised approaches employ diverse experimental setups, making direct comparisons challenging due to the absence of a standardized benchmark. In this work, we establish a unified benchmark that enables fair comparisons across different methods. Additionally, we systematically investigate five critical aspects of self-supervised learning in videos: (1) dataset size, (2) model complexity, (3) data distribution, (4) data noise, and (5) feature representations. To facilitate this study, we evaluate six self-supervised learning methods across six network architectures, conducting extensive experiments on five benchmark datasets and assessing performance on two distinct downstream tasks. Our analysis reveals key insights into the interplay between pretraining strategies, dataset characteristics, pretext tasks, and model architectures. Furthermore, we extend these findings to Video Foundation Models (ViFMs), demonstrating their relevance in large-scale video representation learning. Finally, leveraging these insights, we propose a novel approach that significantly reduces training data requirements while surpassing state-of-the-art methods that rely on 10% more pretraining data. We believe this work will guide future research toward a deeper understanding of self-supervised video representation learning and its broader implications.

Rethinking the Nested U-Net Approach: Enhancing Biomarker Segmentation with Attention Mechanisms and Multiscale Feature Fusion

Saad Wazir,Daeyoung Kim

Task: 提出一种嵌套UNet架构，通过多尺度特征融合和注意力机制改进医学图像分割性能。

Motivation: 现有Transformer和CNN方法在形态和染色变化下特征提取能力有限，且端到端方法因多尺度特征传递困难而表现不佳。

Details

Method: 采用嵌套UNet架构，结合多尺度特征融合和注意力机制，优化特征集成与空间细节恢复。 Result: 在四个数据集上的实验表明，该方法优于现有SOTA方法。 Conclusion: 该方法通过改进特征提取和集成，显著提升了医学图像分割的准确性。 Abstract: Identifying biomarkers in medical images is vital for a wide range of biotech applications. However, recent Transformer and CNN based methods often struggle with variations in morphology and staining, which limits their feature extraction capabilities. In medical image segmentation, where data samples are often limited, state-of-the-art (SOTA) methods improve accuracy by using pre-trained encoders, while end-to-end approaches typically fall short due to difficulties in transferring multiscale features effectively between encoders and decoders. To handle these challenges, we introduce a nested UNet architecture that captures both local and global context through Multiscale Feature Fusion and Attention Mechanisms. This design improves feature integration from encoders, highlights key channels and regions, and restores spatial details to enhance segmentation performance. Our method surpasses SOTA approaches, as evidenced by experiments across four datasets and detailed ablation studies. Code: https://github.com/saadwazir/ReN-UNet

Action Valuation in Sports: A Survey

Artur Xarles,Sergio Escalera,Thomas B. Moeslund,Albert Clapés

Task: 对体育分析中的动作估值（AV）任务进行全面综述，并提出一个包含九个维度的分类法。

Motivation: 尽管已有一些关于相关概念（如球员估值）的综述，但缺乏针对不同体育项目中动作估值的深入分析。

Details

Method: 引入一个包含数据、方法、评估技术和实际应用等九个维度的分类法，对现有研究进行分析。 Result: 识别了有效动作估值方法的关键特征，并指出了研究中的现有空白。 Conclusion: 提出了未来推动该领域发展的方向。 Abstract: Action Valuation (AV) has emerged as a key topic in Sports Analytics, offering valuable insights by assigning scores to individual actions based on their contribution to desired outcomes. Despite a few surveys addressing related concepts such as Player Valuation, there is no comprehensive review dedicated to an in-depth analysis of AV across different sports. In this survey, we introduce a taxonomy with nine dimensions related to the AV task, encompassing data, methodological approaches, evaluation techniques, and practical applications. Through this analysis, we aim to identify the essential characteristics of effective AV methods, highlight existing gaps in research, and propose future directions for advancing the field.

Flash Sculptor: Modular 3D Worlds from Objects

Yujia Hu,Songhua Liu,Xingyi Yang,Xinchao Wang

Task: 提出Flash Sculptor框架，用于从单张图像进行组合式3D场景/物体重建。

Motivation: 现有文本到3D和图像到3D模型在处理复杂场景（多物体和复杂交互）时表现不佳，且优化过程繁琐。

Details

Method: 采用分而治之策略，将组合场景重建分解为处理外观、旋转、缩放和平移的子任务，并引入粗到细的旋转方案和基于异常值去除的平移算法。 Result: 实验表明，Flash Sculptor比现有方法快至少3倍，并在组合3D重建性能上设定了新基准。 Conclusion: Flash Sculptor是一种高效且准确的组合式3D重建框架。 Abstract: Existing text-to-3D and image-to-3D models often struggle with complex scenes involving multiple objects and intricate interactions. Although some recent attempts have explored such compositional scenarios, they still require an extensive process of optimizing the entire layout, which is highly cumbersome if not infeasible at all. To overcome these challenges, we propose Flash Sculptor in this paper, a simple yet effective framework for compositional 3D scene/object reconstruction from a single image. At the heart of Flash Sculptor lies a divide-and-conquer strategy, which decouples compositional scene reconstruction into a sequence of sub-tasks, including handling the appearance, rotation, scale, and translation of each individual instance. Specifically, for rotation, we introduce a coarse-to-fine scheme that brings the best of both worlds--efficiency and accuracy--while for translation, we develop an outlier-removal-based algorithm that ensures robust and precise parameters in a single step, without any iterative optimization. Extensive experiments demonstrate that Flash Sculptor achieves at least a 3 times speedup over existing compositional 3D methods, while setting new benchmarks in compositional 3D reconstruction performance. Codes are available at https://github.com/YujiaHu1109/Flash-Sculptor.

WoundAmbit: Bridging State-of-the-Art Semantic Segmentation and Real-World Wound Care

Vanessa Borst,Timo Dittus,Tassilo Dege,Astrid Schmieder,Samuel Kounev

Task: 通过语义分割技术实现慢性伤口的自动化监测和远程跟踪。

Motivation: 慢性伤口对老年人和糖尿病患者影响较大，而远程监测可以减少面对面就诊的需求，但目前伤口分割在医学影像研究中较少涉及。

Details

Method: 对通用视觉、医学影像和公开伤口挑战赛中的最先进深度学习模型进行基准测试，并标准化训练、数据增强和评估流程。 Result: TransNeXt模型表现出最高的泛化能力，所有模型在CPU上每秒至少处理一张图像，且专家评估显示所有模型的掩模质量较高。 Conclusion: 提出的AI驱动伤口大小估计框架WoundAmbit可集成到远程医疗系统中，具有实际应用潜力。 Abstract: Chronic wounds affect a large population, particularly the elderly and diabetic patients, who often exhibit limited mobility and co-existing health conditions. Automated wound monitoring via mobile image capture can reduce in-person physician visits by enabling remote tracking of wound size. Semantic segmentation is key to this process, yet wound segmentation remains underrepresented in medical imaging research. To address this, we benchmark state-of-the-art deep learning models from general-purpose vision, medical imaging, and top methods from public wound challenges. For fair comparison, we standardize training, data augmentation, and evaluation, conducting cross-validationto minimize partitioning bias. We also assess real-world deployment aspects, including generalization to an out-of-distribution wound dataset, computational efficiency, and interpretability. Additionally, we propose a reference object-based approach to convert AI-generated masks into clinically relevant wound size estimates, and evaluate this, along with mask quality, for the best models based on physician assessments. Overall, the transformer-based TransNeXt showed the highest levels of generalizability. Despite variations in inference times, all models processed at least one image per second on the CPU, which is deemed adequate for the intended application. Interpretability analysis typically revealed prominent activations in wound regions, emphasizing focus on clinically relevant features. Expert evaluation showed high mask approval for all analyzed models, with VWFormer and ConvNeXtS backbone performing the best. Size retrieval accuracy was similar across models, and predictions closely matched expert annotations. Finally, we demonstrate how our AI-driven wound size estimation framework, WoundAmbit, can be integrated into a custom telehealth system. Our code will be made available on GitHub upon publication.

HRMedSeg: Unlocking High-resolution Medical Image segmentation via Memory-efficient Attention Modeling

Qing Xu,Zhenye Lou,Chenxin Li,Xiangjian He,Rong Qu,Tesema Fiseha Berhanu,Yi Wang,Wenting Duan,Zhen Chen

Task: 提出一种内存高效的高分辨率医学图像分割框架HRMedSeg。

Motivation: 现有基于Transformer的编码器-解码器框架在处理大尺寸分割掩码预测时需要高昂的内存成本，限制了实际应用。

Details

Method: 设计轻量级门控视觉Transformer（LGViT）作为图像编码器，并开发高效的跨多尺度解码器（ECM-Decoder），同时利用特征蒸馏预训练。 Result: HRMedSeg在多种高分辨率医学图像分割任务中表现优于现有方法，且训练成本低（每批次仅需0.59GB GPU内存）。 Conclusion: HRMedSeg是一种高效且实用的高分辨率医学图像分割解决方案。 Abstract: High-resolution segmentation is critical for precise disease diagnosis by extracting micro-imaging information from medical images. Existing transformer-based encoder-decoder frameworks have demonstrated remarkable versatility and zero-shot performance in medical segmentation. While beneficial, they usually require huge memory costs when handling large-size segmentation mask predictions, which are expensive to apply to real-world scenarios. To address this limitation, we propose a memory-efficient framework for high-resolution medical image segmentation, called HRMedSeg. Specifically, we first devise a lightweight gated vision transformer (LGViT) as our image encoder to model long-range dependencies with linear complexity. Then, we design an efficient cross-multiscale decoder (ECM-Decoder) to generate high-resolution segmentation masks. Moreover, we utilize feature distillation during pretraining to unleash the potential of our proposed model. Extensive experiments reveal that HRMedSeg outperforms state-of-the-arts in diverse high-resolution medical image segmentation tasks. In particular, HRMedSeg uses only 0.59GB GPU memory per batch during fine-tuning, demonstrating low training costs. Besides, when HRMedSeg meets the Segment Anything Model (SAM), our HRMedSegSAM takes 0.61% parameters of SAM-H. The code is available at https://github.com/xq141839/HRMedSeg.

HiMoR: Monocular Deformable Gaussian Reconstruction with Hierarchical Motion Representation

Yiming Liang,Tianhan Xu,Yuta Kikuchi

Task: 提出一种名为HiMoR的分层运动表示方法，用于实现高质量的单目动态3D重建。

Motivation: 日常场景中的运动可以分解为粗粒度运动和细粒度运动，利用这种分层结构可以更有效地表示动态3D场景。

Details

Method: 采用树状结构表示不同层次的运动细节，浅层节点建模粗粒度运动以保证时间平滑性，深层节点捕捉细粒度运动；同时使用共享运动基表示不同节点集的运动。 Result: 实验表明，HiMoR在复杂运动的单目视频中实现了优越的新视角合成效果。 Conclusion: HiMoR通过分层运动表示和共享运动基的设计，为单目动态3D重建提供了更结构化的变形表示，显著提升了重建质量。 Abstract: We present Hierarchical Motion Representation (HiMoR), a novel deformation representation for 3D Gaussian primitives capable of achieving high-quality monocular dynamic 3D reconstruction. The insight behind HiMoR is that motions in everyday scenes can be decomposed into coarser motions that serve as the foundation for finer details. Using a tree structure, HiMoR's nodes represent different levels of motion detail, with shallower nodes modeling coarse motion for temporal smoothness and deeper nodes capturing finer motion. Additionally, our model uses a few shared motion bases to represent motions of different sets of nodes, aligning with the assumption that motion tends to be smooth and simple. This motion representation design provides Gaussians with a more structured deformation, maximizing the use of temporal relationships to tackle the challenging task of monocular dynamic 3D reconstruction. We also propose using a more reliable perceptual metric as an alternative, given that pixel-level metrics for evaluating monocular dynamic 3D reconstruction can sometimes fail to accurately reflect the true quality of reconstruction. Extensive experiments demonstrate our method's efficacy in achieving superior novel view synthesis from challenging monocular videos with complex motions.

Earth-Adapter: Bridge the Geospatial Domain Gaps with Mixture of Frequency Adaptation

Xiaoxing Hu,Ziyang Gong,Yupei Wang,Yuru Jia,Gen Luo,Xue Yang

Task: 提出一种针对遥感场景的参数高效微调方法Earth-Adapter，解决现有PEFT方法在处理遥感图像特征时的不足。

Motivation: 现有PEFT方法主要针对自然图像设计，难以处理遥感图像中的伪影问题，导致性能下降。

Details

Method: 结合混合适配器（MoA）和离散傅里叶变换（DFT），通过频率分解和动态权重分配，分离并处理伪影。 Result: 在域适应（DA）和域泛化（DG）语义分割任务中，Earth-Adapter分别比基线方法提升9.0%和3.1%的mIoU。 Conclusion: Earth-Adapter通过频率域处理伪影，显著提升了遥感场景下基础模型的性能。 Abstract: Parameter-Efficient Fine-Tuning (PEFT) is a technique that allows us to adapt powerful Foundation Models (FMs) to diverse downstream tasks while preserving and unleashing their inherent capabilities. However, we have observed that existing PEFT methods, which are often designed with natural imagery in mind, struggle when applied to Remote Sensing (RS) scenarios. This is primarily due to their inability to handle artifact influences, a problem particularly severe in RS image features. To tackle this challenge, we introduce Earth-Adapter, the first PEFT method specifically designed for RS artifacts conquering. Earth-Adapter introduces a novel Mixture of Frequency Adaptation process that combines a Mixture of Adapter (MoA) with Discrete Fourier Transformation (DFT). By utilizing DFT, Earth-Adapter can decompose features into different frequency components, precisely separating artifacts from original features. The MoA then dynamically assigns weights to each adapter expert, allowing for the combination of features across various frequency domains. These simple-yet-effective approaches enable Earth-Adapter to more efficiently overcome the disturbances caused by artifacts than previous PEFT methods, significantly enhancing the FMs' performance on RS scenarios. Experiments on Domain Adaptation (DA), and Domain Generalization (DG) semantic segmentation benchmarks showcase the Earth-Adapter's effectiveness. Compared with baseline Rein, Earth-Adapter significantly improves 9.0% mIoU in DA and 3.1% mIoU in DG benchmarks. Our code will be released at https://github.com/VisionXLab/Earth-Adapter.

HiFlow: Training-free High-Resolution Image Generation with Flow-Aligned Guidance

Jiazi Bu,Pengyang Ling,Yujie Zhou,Pan Zhang,Tong Wu,Xiaoyi Dong,Yuhang Zang,Yuhang Cao,Dahua Lin,Jiaqi Wang

Task: 提出HiFlow框架，以解锁预训练流模型在高分辨率图像合成中的潜力。

Motivation: 高分辨率图像合成因内容稀缺和复杂性面临挑战，现有方法难以满足需求。

Details

Method: HiFlow通过虚拟参考流在三个关键方面提供指导：初始化对齐、方向对齐和加速对齐。 Result: HiFlow显著提升了T2I模型的高分辨率图像合成质量，并在实验中优于现有方法。 Conclusion: HiFlow是一种无需训练且模型无关的框架，能有效提升高分辨率图像合成的质量。 Abstract: Text-to-image (T2I) diffusion/flow models have drawn considerable attention recently due to their remarkable ability to deliver flexible visual creations. Still, high-resolution image synthesis presents formidable challenges due to the scarcity and complexity of high-resolution content. To this end, we present HiFlow, a training-free and model-agnostic framework to unlock the resolution potential of pre-trained flow models. Specifically, HiFlow establishes a virtual reference flow within the high-resolution space that effectively captures the characteristics of low-resolution flow information, offering guidance for high-resolution generation through three key aspects: initialization alignment for low-frequency consistency, direction alignment for structure preservation, and acceleration alignment for detail fidelity. By leveraging this flow-aligned guidance, HiFlow substantially elevates the quality of high-resolution image synthesis of T2I models and demonstrates versatility across their personalized variants. Extensive experiments validate HiFlow's superiority in achieving superior high-resolution image quality over current state-of-the-art methods.

Monitoring Viewer Attention During Online Ads

Mina Bishay,Graham Page,Waleed Emad,Mohammad Mavadati

Task: 提出一种用于在线广告观看过程中监测观众注意力的架构。

Motivation: 在线广告测试中，观众可能因环境干扰（如电视、同事交谈或手机通知）而分心，影响广告效果评估的准确性。

Details

Method: 利用AFFDEX 2.0和SmartEye SDK提取面部表情、头部姿态和视线方向等低层次特征，结合高层次特征（如屏幕平面视线估计、打哈欠、说话等）识别四种主要干扰因素。 Result: 在标注数据集和真实广告测试数据集上验证了架构的有效性，能够准确检测桌面和移动设备上的分心行为。 Conclusion: 该架构在检测观众分心行为方面表现出色，为在线广告测试提供了可靠的工具。 Abstract: Nowadays, video ads spread through numerous online platforms, and are being watched by millions of viewers worldwide. Big brands gauge the liking and purchase intent of their new ads, by analyzing the facial responses of viewers recruited online to watch the ads from home or work. Although this approach captures naturalistic responses, it is susceptible to distractions inherent in the participants' environments, such as a movie playing on TV, a colleague speaking, or mobile notifications. Inattentive participants should get flagged and eliminated to avoid skewing the ad-testing process. In this paper we introduce an architecture for monitoring viewer attention during online ads. Leveraging two behavior analysis toolkits; AFFDEX 2.0 and SmartEye SDK, we extract low-level facial features encompassing facial expressions, head pose, and gaze direction. These features are then combined to extract high-level features that include estimated gaze on the screen plane, yawning, speaking, etc -- this enables the identification of four primary distractors; off-screen gaze, drowsiness, speaking, and unattended screen. Our architecture tailors the gaze settings according to the device type (desktop or mobile). We validate our architecture first on datasets annotated for specific distractors, and then on a real-world ad testing dataset with various distractors. The proposed architecture shows promising results in detecting distraction across both desktop and mobile devices.

Transfer between Modalities with MetaQueries

Xichen Pan,Satya Narayan Shukla,Aashu Singh,Zhuokai Zhao,Shlok Kumar Mishra,Jialiang Wang,Zhiyang Xu,Jiuhai Chen,Kunpeng Li,Felix Juefei-Xu,Ji Hou,Saining Xie

Task: 提出一种名为MetaQueries的方法，用于在自回归多模态大语言模型（MLLMs）和扩散模型之间建立高效接口，实现知识增强的图像生成。

Motivation: 现有的统一多模态模型在整合理解和生成能力时，常需要复杂的训练策略和数据平衡，因此需要一种更简单且高效的方法。

Details

Method: 引入MetaQueries作为可学习的查询集，连接MLLM的潜在表示与扩散解码器，仅需配对图像-标题数据和标准扩散目标进行训练。 Result: 该方法在保持MLLM骨干冻结的情况下，仍能实现强大的生成性能，并支持灵活的高级应用（如图像编辑和主题驱动生成）。 Conclusion: MetaQueries提供了一种简化训练、高效且灵活的多模态生成方法，同时保留了MLLM的先进理解能力。 Abstract: Unified multimodal models aim to integrate understanding (text output) and generation (pixel output), but aligning these different modalities within a single architecture often demands complex training recipes and careful data balancing. We introduce MetaQueries, a set of learnable queries that act as an efficient interface between autoregressive multimodal LLMs (MLLMs) and diffusion models. MetaQueries connects the MLLM's latents to the diffusion decoder, enabling knowledge-augmented image generation by leveraging the MLLM's deep understanding and reasoning capabilities. Our method simplifies training, requiring only paired image-caption data and standard diffusion objectives. Notably, this transfer is effective even when the MLLM backbone remains frozen, thereby preserving its state-of-the-art multimodal understanding capabilities while achieving strong generative performance. Additionally, our method is flexible and can be easily instruction-tuned for advanced applications such as image editing and subject-driven generation.

PainNet: Statistical Relation Network with Episode-Based Training for Pain Estimation

Mina Bishay,Graham Page,Mohammad Mavadati

Task: 提出一种名为PainNet的统计关系网络，用于估计序列级疼痛。

Motivation: 现有方法主要关注面部表情的疼痛估计，而忽略了临床上常用的患者报告的序列级疼痛。

Details

Method: PainNet包含嵌入模块和关系模块，通过统计层和RNN提取视频级特征，并采用端到端训练和基于情节的训练方案。 Result: 实验表明，统计层和基于情节的训练方案有效，PainNet在自报告疼痛估计上优于现有方法。 Conclusion: PainNet为序列级疼痛估计提供了一种高效且性能优越的解决方案。 Abstract: Despite the span in estimating pain from facial expressions, limited works have focused on estimating the sequence-level pain, which is reported by patients and used commonly in clinics. In this paper, we introduce a novel Statistical Relation Network, referred to as PainNet, designed for the estimation of the sequence-level pain. PainNet employs two key modules, the embedding and the relation modules, for comparing pairs of pain videos, and producing relation scores indicating if each pair belongs to the same pain category or not. At the core of the embedding module is a statistical layer mounted on the top of a RNN for extracting compact video-level features. The statistical layer is implemented as part of the deep architecture. Doing so, allows combining multiple training stages used in previous research, into a single end-to-end training stage. PainNet is trained using the episode-based training scheme, which involves comparing a query video with a set of videos representing the different pain categories. Experimental results show the benefit of using the statistical layer and the episode-based training in the proposed model. Furthermore, PainNet outperforms the state-of-the-art results on self-reported pain estimation.

OmniSVG: A Unified Scalable Vector Graphics Generation Model

Yiying Yang,Wei Cheng,Sijin Chen,Xianfang Zeng,Jiaxu Zhang,Liao Wang,Gang Yu,Xingjun Ma,Yu-Gang Jiang

Task: 提出一个统一的框架OmniSVG，用于生成高质量且复杂的SVG图像。

Motivation: 现有方法在生成SVG时存在输出非结构化、计算成本高或仅能生成单色简化图标的问题。

Details

Method: 利用预训练的视觉语言模型（VLMs）进行端到端多模态SVG生成，将SVG命令和坐标参数化为离散标记。 Result: OmniSVG在实验中表现优于现有方法，展示了其在专业SVG设计工作流程中的潜力。 Conclusion: OmniSVG为高质量SVG生成提供了一种高效且表达力强的解决方案，并推动了SVG合成领域的发展。 Abstract: Scalable Vector Graphics (SVG) is an important image format widely adopted in graphic design because of their resolution independence and editability. The study of generating high-quality SVG has continuously drawn attention from both designers and researchers in the AIGC community. However, existing methods either produces unstructured outputs with huge computational cost or is limited to generating monochrome icons of over-simplified structures. To produce high-quality and complex SVG, we propose OmniSVG, a unified framework that leverages pre-trained Vision-Language Models (VLMs) for end-to-end multimodal SVG generation. By parameterizing SVG commands and coordinates into discrete tokens, OmniSVG decouples structural logic from low-level geometry for efficient training while maintaining the expressiveness of complex SVG structure. To further advance the development of SVG synthesis, we introduce MMSVG-2M, a multimodal dataset with two million richly annotated SVG assets, along with a standardized evaluation protocol for conditional SVG generation tasks. Extensive experiments show that OmniSVG outperforms existing methods and demonstrates its potential for integration into professional SVG design workflows.

D^2USt3R: Enhancing 3D Reconstruction with 4D Pointmaps for Dynamic Scenes

Jisang Han,Honggyu An,Jaewoo Jung,Takuya Narihira,Junyoung Seo,Kazumi Fukuda,Chaehyun Kim,Sunghwan Hong,Yuki Mitsufuji,Seungryong Kim

Task: 解决动态场景中3D重建的任务，改进静态3D场景重建方法在动态运动下的性能。

Motivation: 现有方法（如DUSt3R）在静态场景中表现良好，但在动态场景中因运动导致对齐失败。

Details

Method: 提出D^2USt3R，通过回归4D点图同时捕获静态和动态3D场景几何信息。 Result: 实验表明，该方法在复杂运动场景中表现优于现有方法。 Conclusion: D^2USt3R通过结合时空信息，显著提升了动态场景下的3D重建性能。 Abstract: We address the task of 3D reconstruction in dynamic scenes, where object motions degrade the quality of previous 3D pointmap regression methods, such as DUSt3R, originally designed for static 3D scene reconstruction. Although these methods provide an elegant and powerful solution in static settings, they struggle in the presence of dynamic motions that disrupt alignment based solely on camera poses. To overcome this, we propose D^2USt3R that regresses 4D pointmaps that simultaneiously capture both static and dynamic 3D scene geometry in a feed-forward manner. By explicitly incorporating both spatial and temporal aspects, our approach successfully encapsulates spatio-temporal dense correspondence to the proposed 4D pointmaps, enhancing downstream tasks. Extensive experimental evaluations demonstrate that our proposed approach consistently achieves superior reconstruction performance across various datasets featuring complex motions.

Scale Up Composed Image Retrieval Learning via Modification Text Generation

Yinan Zhou,Yaxiong Wang,Haokun Lin,Chen Ma,Li Zhu,Zhedong Zheng

Task: 通过合成训练三元组来增强组合图像检索（CIR）任务的训练资源。

Motivation: 解决CIR任务中训练数据有限和标注过程繁琐的问题。

Details

Method: 利用大规模多模态模型训练修改文本生成器，在预训练和微调阶段生成合成三元组，并提出两跳对齐策略以缩小语义差距。 Result: 在CIRR和FashionIQ基准测试中取得了具有竞争力的召回率。 Conclusion: 提出的方法有效解决了CIR任务的数据限制问题，并提升了检索性能。 Abstract: Composed Image Retrieval (CIR) aims to search an image of interest using a combination of a reference image and modification text as the query. Despite recent advancements, this task remains challenging due to limited training data and laborious triplet annotation processes. To address this issue, this paper proposes to synthesize the training triplets to augment the training resource for the CIR problem. Specifically, we commence by training a modification text generator exploiting large-scale multimodal models and scale up the CIR learning throughout both the pretraining and fine-tuning stages. During pretraining, we leverage the trained generator to directly create Modification Text-oriented Synthetic Triplets(MTST) conditioned on pairs of images. For fine-tuning, we first synthesize reverse modification text to connect the target image back to the reference image. Subsequently, we devise a two-hop alignment strategy to incrementally close the semantic gap between the multimodal pair and the target image. We initially learn an implicit prototype utilizing both the original triplet and its reversed version in a cycle manner, followed by combining the implicit prototype feature with the modification text to facilitate accurate alignment with the target image. Extensive experiments validate the efficacy of the generated triplets and confirm that our proposed methodology attains competitive recall on both the CIRR and FashionIQ benchmarks.

MASS: MoErging through Adaptive Subspace Selection

Donato Crisostomi,Alessandro Zirilli,Antonio Andrea Gargiulo,Maria Sofia Bucarelli,Simone Scardapane,Fabrizio Silvestri,Iacopo Masi,Emanuele Rodolà

Task: 提出一种名为MASS的模型合并方法，通过自适应子空间选择将多个微调模型合并为一个共享模型。

Motivation: 现有模型合并方法无法达到单独微调模型的准确率，MASS旨在填补这一差距。

Details

Method: 基于低秩分解存储每个任务的最重要奇异成分，并通过数据无关的路由器在推理时激活任务特定块。 Result: 在多个任务基准测试中，MASS恢复了高达98%的单独微调模型的平均准确率。 Conclusion: MASS是一种存储成本极低的实用替代方案，接近集成方法的性能。 Abstract: Model merging has recently emerged as a lightweight alternative to ensembling, combining multiple fine-tuned models into a single set of parameters with no additional training overhead. Yet, existing merging methods fall short of matching the full accuracy of separately fine-tuned endpoints. We present MASS (MoErging through Adaptive Subspace Selection), a new approach that closes this gap by unifying multiple fine-tuned models while retaining near state-of-the-art performance across tasks. Building on the low-rank decomposition of per-task updates, MASS stores only the most salient singular components for each task and merges them into a shared model. At inference time, a non-parametric, data-free router identifies which subspace (or combination thereof) best explains an input's intermediate features and activates the corresponding task-specific block. This procedure is fully training-free and introduces only a two-pass inference overhead plus a ~2 storage factor compared to a single pretrained model, irrespective of the number of tasks. We evaluate MASS on CLIP-based image classification using ViT-B-16, ViT-B-32 and ViT-L-14 for benchmarks of 8, 14 and 20 tasks respectively, establishing a new state-of-the-art. Most notably, MASS recovers up to ~98% of the average accuracy of individual fine-tuned models, making it a practical alternative to ensembling at a fraction of the storage cost.

A Nature-Inspired Colony of Artificial Intelligence System with Fast, Detailed, and Organized Learner Agents for Enhancing Diversity and Quality

Shan Suthaharan

Task: 提出一种基于卷积神经网络（CNN）和多智能体系统的AI代理群体方法，用于执行多种任务（如预测或分类）。

Motivation: 模仿生物系统（如蚁群或人类群体）的自然环境，构建一个能够快速学习、详细学习和组织学习的AI代理群体。

Details

Method: 通过遗传算法的交叉和变异机制，结合预训练的VGG16、VGG19和ResNet50模型，构建角色化的AI代理群体。 Result: 模拟结果显示，该AI群体系统在任务中的预测性能F1分数介于82%至95%之间。 Conclusion: 该方法通过多样化和高质量的AI代理群体，实现了集体决策的优化。 Abstract: The concepts of convolutional neural networks (CNNs) and multi-agent systems are two important areas of research in artificial intelligence (AI). In this paper, we present an approach that builds a CNN-based colony of AI agents to serve as a single system and perform multiple tasks (e.g., predictions or classifications) in an environment. The proposed system impersonates the natural environment of a biological system, like an ant colony or a human colony. The proposed colony of AI that is defined as a role-based system uniquely contributes to accomplish tasks in an environment by incorporating AI agents that are fast learners, detailed learners, and organized learners. These learners can enhance their localized learning and their collective decisions as a single system of colony of AI agents. This approach also enhances the diversity and quality of the colony of AI with the help of Genetic Algorithms and their crossover and mutation mechanisms. The evolution of fast, detailed, and organized learners in the colony of AI is achieved by introducing a unique one-to-one mapping between these learners and the pretrained VGG16, VGG19, and ResNet50 models, respectively. This role-based approach creates two parent-AI agents using the AI models through the processes, called the intra- and inter-marriage of AI, so that they can share their learned knowledge (weights and biases) based on a probabilistic rule and produce diversified child-AI agents to perform new tasks. This process will form a colony of AI that consists of families of multi-model and mixture-model AI agents to improve diversity and quality. Simulations show that the colony of AI, built using the VGG16, VGG19, and ResNet50 models, can provide a single system that generates child-AI agents of excellent predictive performance, ranging between 82% and 95% of F1-scores, to make diversified collective and quality decisions on a task.

A Novel Approach to Linking Histology Images with DNA Methylation

Manahil Raza,Muhammad Dawood,Talha Qaiser,Nasir M. Rajpoot

Task: 利用全切片图像（WSIs）预测基因组的甲基化状态。

Motivation: DNA甲基化检测成本高且耗时长，而WSIs更易获取，因此探索WSIs与甲基化模式的关系具有重要意义。

Details

Method: 提出一种基于图神经网络的弱监督学习框架，用于预测基因组的甲基化状态。 Result: 在TCGA的三个队列中，该方法比现有方法的AUROC分数显著提高20%以上，并发现基因组显著富集于重要通路。 Conclusion: 该研究首次通过弱监督深度学习探索了多癌种中组织学模式与基因甲基化状态的关联。 Abstract: DNA methylation is an epigenetic mechanism that regulates gene expression by adding methyl groups to DNA. Abnormal methylation patterns can disrupt gene expression and have been linked to cancer development. To quantify DNA methylation, specialized assays are typically used. However, these assays are often costly and have lengthy processing times, which limits their widespread availability in routine clinical practice. In contrast, whole slide images (WSIs) for the majority of cancer patients can be more readily available. As such, given the ready availability of WSIs, there is a compelling need to explore the potential relationship between WSIs and DNA methylation patterns. To address this, we propose an end-to-end graph neural network based weakly supervised learning framework to predict the methylation state of gene groups exhibiting coherent patterns across samples. Using data from three cohorts from The Cancer Genome Atlas (TCGA) - TCGA-LGG (Brain Lower Grade Glioma), TCGA-GBM (Glioblastoma Multiforme) ($n$=729) and TCGA-KIRC (Kidney Renal Clear Cell Carcinoma) ($n$=511) - we demonstrate that the proposed approach achieves significantly higher AUROC scores than the state-of-the-art (SOTA) methods, by more than $20\%$. We conduct gene set enrichment analyses on the gene groups and show that majority of the gene groups are significantly enriched in important hallmarks and pathways. We also generate spatially enriched heatmaps to further investigate links between histological patterns and DNA methylation states. To the best of our knowledge, this is the first study that explores association of spatially resolved histological patterns with gene group methylation states across multiple cancer types using weakly supervised deep learning.

Improved Stochastic Texture Filtering Through Sample Reuse

Bartlomiej Wronski,Matt Pharr,Tomas Akenine-Möller

Task: 改进随机纹理过滤（STF）在纹理放大时的质量，减少与传统纹理过滤的图像差异。

Motivation: STF在纹理放大时可能导致走样和材质属性插值不平滑的问题，影响视觉效果。

Details

Method: 提出一种新方法，通过共享相邻像素的纹素值，并结合改进的加权重要性采样技术。 Result: 在高放大倍数下，PSNR比单样本STF提高>10 dB，图像质量显著提升。 Conclusion: 该方法在低成本（0.04--0.14 ms/帧）下显著改善了STF的纹理放大质量。 Abstract: Stochastic texture filtering (STF) has re-emerged as a technique that can bring down the cost of texture filtering of advanced texture compression methods, e.g., neural texture compression. However, during texture magnification, the swapped order of filtering and shading with STF can result in aliasing. The inability to smoothly interpolate material properties stored in textures, such as surface normals, leads to potentially undesirable appearance changes. We present a novel method to improve the quality of stochastically-filtered magnified textures and reduce the image difference compared to traditional texture filtering. When textures are magnified, nearby pixels filter similar sets of texels and we introduce techniques for sharing texel values among pixels with only a small increase in cost (0.04--0.14~ms per frame). We propose an improvement to weighted importance sampling that guarantees that our method never increases error beyond single-sample stochastic texture filtering. Under high magnification, our method has >10 dB higher PSNR than single-sample STF. Our results show greatly improved image quality both with and without spatiotemporal denoising.

SoundVista: Novel-View Ambient Sound Synthesis via Visual-Acoustic Binding

Mingfei Chen,Israel D. Gebru,Ishwarya Ananthabhotla,Christian Richardt,Dejan Markovic,Jake Sandakly,Steven Krenn,Todd Keebler,Eli Shlizerman,Alexander Richard

Task: 生成任意场景在新视角下的环境声音。

Motivation: 现有方法需要声音源的详细信息或约束条件，而SoundVista无需这些先验知识，并能适应多样化的房间布局和未知环境。

Details

Method: 通过学习声学传递函数，结合视觉-声学绑定模块，利用全景RGB和深度数据生成声音。 Result: 在公开数据集和真实场景中表现优于现有方法。 Conclusion: SoundVista是一种高效且适应性强的环境声音生成方法。 Abstract: We introduce SoundVista, a method to generate the ambient sound of an arbitrary scene at novel viewpoints. Given a pre-acquired recording of the scene from sparsely distributed microphones, SoundVista can synthesize the sound of that scene from an unseen target viewpoint. The method learns the underlying acoustic transfer function that relates the signals acquired at the distributed microphones to the signal at the target viewpoint, using a limited number of known recordings. Unlike existing works, our method does not require constraints or prior knowledge of sound source details. Moreover, our method efficiently adapts to diverse room layouts, reference microphone configurations and unseen environments. To enable this, we introduce a visual-acoustic binding module that learns visual embeddings linked with local acoustic properties from panoramic RGB and depth data. We first leverage these embeddings to optimize the placement of reference microphones in any given scene. During synthesis, we leverage multiple embeddings extracted from reference locations to get adaptive weights for their contribution, conditioned on target viewpoint. We benchmark the task on both publicly available data and real-world settings. We demonstrate significant improvements over existing methods.

Class Imbalance Correction for Improved Universal Lesion Detection and Tagging in CT

Peter D. Erickson,Tejas Sudharshan Mathai,Ronald M. Summers

Task: 训练一个VFNet模型用于检测和标记CT图像中的病灶，并解决数据类别不平衡问题。

Motivation: DeepLesion数据集中存在测量和标签缺失，且类别分布严重不平衡，影响病灶检测和标记的准确性。

Details

Method: 使用DeepLesion数据集的6%子集（1331个病灶），通过三种实验平衡数据：1）按身体部位标签平衡，2）按患者病灶数量平衡，3）按病灶大小平衡。 Result: 平衡身体部位标签显著提高了低数据量类别的灵敏度（如骨骼80% vs. 46%），平衡病灶大小也提升了所有类别的召回率。 Conclusion: 数据平衡策略有效提升了病灶检测和标记的性能，并提出了结构化报告指南。 Abstract: Radiologists routinely detect and size lesions in CT to stage cancer and assess tumor burden. To potentially aid their efforts, multiple lesion detection algorithms have been developed with a large public dataset called DeepLesion (32,735 lesions, 32,120 CT slices, 10,594 studies, 4,427 patients, 8 body part labels). However, this dataset contains missing measurements and lesion tags, and exhibits a severe imbalance in the number of lesions per label category. In this work, we utilize a limited subset of DeepLesion (6\%, 1331 lesions, 1309 slices) containing lesion annotations and body part label tags to train a VFNet model to detect lesions and tag them. We address the class imbalance by conducting three experiments: 1) Balancing data by the body part labels, 2) Balancing data by the number of lesions per patient, and 3) Balancing data by the lesion size. In contrast to a randomly sampled (unbalanced) data subset, our results indicated that balancing the body part labels always increased sensitivity for lesions >= 1cm for classes with low data quantities (Bone: 80\% vs. 46\%, Kidney: 77\% vs. 61\%, Soft Tissue: 70\% vs. 60\%, Pelvis: 83\% vs. 76\%). Similar trends were seen for three other models tested (FasterRCNN, RetinaNet, FoveaBox). Balancing data by lesion size also helped the VFNet model improve recalls for all classes in contrast to an unbalanced dataset. We also provide a structured reporting guideline for a ``Lesions'' subsection to be entered into the ``Findings'' section of a radiology report. To our knowledge, we are the first to report the class imbalance in DeepLesion, and have taken data-driven steps to address it in the context of joint lesion detection and tagging.

PyTopo3D: A Python Framework for 3D SIMP-based Topology Optimization

Jihoon Kim,Namwoo Kang

Task: 开发一个名为PyTopo3D的开源Python框架，用于三维拓扑优化。

Motivation: 当前Python科学环境中缺乏易用且开源的三维拓扑优化工具，PyTopo3D旨在填补这一空白。

Details

Method: 基于Solid Isotropic Material with Penalization (SIMP)方法和Optimality Criteria (OC)更新方案，结合稀疏矩阵运算、并行求解器和加速的KD-Tree灵敏度过滤技术。 Result: PyTopo3D提供了一个功能丰富的工具，支持复杂设计域的直接导入、优化过程的三维可视化以及优化几何的STL导出。 Conclusion: PyTopo3D是一个高性能且易用的工具，旨在帮助工程师、学生和研究人员更便捷地在Python环境中进行三维拓扑优化。 Abstract: Three-dimensional topology optimization (TO) is a powerful technique in engineering design, but readily usable, open-source implementations remain limited within the popular Python scientific environment. This paper introduces PyTopo3D, a software framework developed to address this gap. PyTopo3D provides a feature-rich tool for 3D TO by implementing the well-established Solid Isotropic Material with Penalization (SIMP) method and an Optimality Criteria (OC) update scheme, adapted and significantly enhanced from the efficient MATLAB code by Liu and Tovar (2014). While building on proven methodology, PyTopo3D's primary contribution is its integration and extension within Python, leveraging sparse matrix operations, optional parallel solvers, and accelerated KD-Tree sensitivity filtering for performance. Crucially, it incorporates functionalities vital for practical engineering workflows, including the direct import of complex design domains and non-design obstacles via STL files, integrated 3D visualization of the optimization process, and direct STL export of optimized geometries for manufacturing or further analysis. PyTopo3D is presented as an accessible, performance-aware tool and citable reference designed to empower engineers, students, and researchers to more easily utilize 3D TO within their existing Python-based workflows.

Technical Report: Full Version of Analyzing and Optimizing Perturbation of DP-SGD Geometrically

Jiawei Duan,Haibo Hu,Qingqing Ye,Xinyue Sun

Task: 研究差分隐私（DP）在机器学习中的效率问题，并提出一种几何扰动策略GeoDP以优化梯度扰动。

Motivation: DP-SGD虽然广泛应用，但其噪声对梯度方向的影响导致效率低下，现有解决方案未能解决根本问题。

Details

Method: 通过理论分析DP噪声对训练过程的影响，设计GeoDP策略，分别扰动梯度的方向和大小。 Result: 实验验证GeoDP在MNIST、CIFAR-10等数据集和多种模型上的有效性和通用性。 Conclusion: GeoDP通过减少方向噪声，在相同DP保证下显著提升模型效率。 Abstract: Differential privacy (DP) has become a prevalent privacy model in a wide range of machine learning tasks, especially after the debut of DP-SGD. However, DP-SGD, which directly perturbs gradients in the training iterations, fails to mitigate the negative impacts of noise on gradient direction. As a result, DP-SGD is often inefficient. Although various solutions (e.g., clipping to reduce the sensitivity of gradients and amplifying privacy bounds to save privacy budgets) are proposed to trade privacy for model efficiency, the root cause of its inefficiency is yet unveiled. In this work, we first generalize DP-SGD and theoretically derive the impact of DP noise on the training process. Our analysis reveals that, in terms of a perturbed gradient, only the noise on direction has eminent impact on the model efficiency while that on magnitude can be mitigated by optimization techniques, i.e., fine-tuning gradient clipping and learning rate. Besides, we confirm that traditional DP introduces biased noise on the direction when adding unbiased noise to the gradient itself. Overall, the perturbation of DP-SGD is actually sub-optimal from a geometric perspective. Motivated by this, we design a geometric perturbation strategy GeoDP within the DP framework, which perturbs the direction and the magnitude of a gradient, respectively. By directly reducing the noise on the direction, GeoDP mitigates the negative impact of DP noise on model efficiency with the same DP guarantee. Extensive experiments on two public datasets (i.e., MNIST and CIFAR-10), one synthetic dataset and three prevalent models (i.e., Logistic Regression, CNN and ResNet) confirm the effectiveness and generality of our strategy.

Jungkyu Park,Jan Witowski,Yanqi Xu,Hari Trivedi,Judy Gichoya,Beatrice Brown-Mulry,Malte Westerhoff,Linda Moy,Laura Heacock,Alana Lewin,Krzysztof J. Geras

Task: 开发一种多模态人工智能系统，结合FFDM、合成乳腺摄影和DBT，用于乳腺筛查中的假阳性召回问题。

Motivation: 尽管DBT在诊断性能上优于FFDM，但乳腺癌筛查中的假阳性召回问题仍然存在。

Details

Method: 利用约500,000次乳腺摄影检查训练AI系统，提供乳腺级别的预测和可疑发现的边界框定位。 Result: 系统在内部测试集上达到0.945 AUROC，减少31.7%的召回率和43.8%的放射科医生工作量，同时保持100%的敏感性。外部验证显示强泛化能力。 Conclusion: 结果表明多模态成像的重要性，展示了临床潜力，并表明通过增加训练集可进一步减少测试误差。 Abstract: Although digital breast tomosynthesis (DBT) improves diagnostic performance over full-field digital mammography (FFDM), false-positive recalls remain a concern in breast cancer screening. We developed a multi-modal artificial intelligence system integrating FFDM, synthetic mammography, and DBT to provide breast-level predictions and bounding-box localizations of suspicious findings. Our AI system, trained on approximately 500,000 mammography exams, achieved 0.945 AUROC on an internal test set. It demonstrated capacity to reduce recalls by 31.7% and radiologist workload by 43.8% while maintaining 100% sensitivity, underscoring its potential to improve clinical workflows. External validation confirmed strong generalizability, reducing the gap to a perfect AUROC by 35.31%-69.14% relative to strong baselines. In prospective deployment across 18 sites, the system reduced recall rates for low-risk cases. An improved version, trained on over 750,000 exams with additional labels, further reduced the gap by 18.86%-56.62% across large external datasets. Overall, these results underscore the importance of utilizing all available imaging modalities, demonstrate the potential for clinical impact, and indicate feasibility of further reduction of the test error with increased training set when using large-capacity neural networks.

Measuring Déjà vu Memorization Efficiently

Narine Kokhlikyan,Bargav Jayaraman,Florian Bordes,Chuan Guo,Kamalika Chaudhuri

Task: 提出一种无需重新训练即可估计预训练模型记忆能力的方法。

Motivation: 现有方法需要训练多个模型来测量记忆能力，对于大型开源模型不可行。

Details

Method: 提出替代性简单方法来估计数据集级相关性，近似预训练模型的记忆能力。 Result: 不同测量方法结果相似，开源模型的记忆能力通常低于类似模型。 Conclusion: 新方法首次实现了对预训练开源模型记忆能力的测量，代码已开源。 Abstract: Recent research has shown that representation learning models may accidentally memorize their training data. For example, the d\'ej\`a vu method shows that for certain representation learning models and training images, it is sometimes possible to correctly predict the foreground label given only the representation of the background - better than through dataset-level correlations. However, their measurement method requires training two models - one to estimate dataset-level correlations and the other to estimate memorization. This multiple model setup becomes infeasible for large open-source models. In this work, we propose alternative simple methods to estimate dataset-level correlations, and show that these can be used to approximate an off-the-shelf model's memorization ability without any retraining. This enables, for the first time, the measurement of memorization in pre-trained open-source image representation and vision-language representation models. Our results show that different ways of measuring memorization yield very similar aggregate results. We also find that open-source models typically have lower aggregate memorization than similar models trained on a subset of the data. The code is available both for vision and vision language models.

TARO: Timestep-Adaptive Representation Alignment with Onset-Aware Conditioning for Synchronized Video-to-Audio Synthesis

Tri Ton,Ji Woo Hong,Chang D. Yoo

Task: 提出了一种名为TARO的新框架，用于高保真且时间一致的视频到音频合成。

Motivation: 通过动态对齐潜在表示和整合起始感知条件，提升音频质量和同步精度。

Details

Method: 基于流式变换器，引入时间步自适应表示对齐（TRA）和起始感知条件（OAC）。 Result: 在VGGSound和Landscape数据集上表现优异，FD降低53%，FAD降低29%，对齐准确率达97.19%。 Conclusion: TARO在音频质量和同步精度上优于现有方法。 Abstract: This paper introduces Timestep-Adaptive Representation Alignment with Onset-Aware Conditioning (TARO), a novel framework for high-fidelity and temporally coherent video-to-audio synthesis. Built upon flow-based transformers, which offer stable training and continuous transformations for enhanced synchronization and audio quality, TARO introduces two key innovations: (1) Timestep-Adaptive Representation Alignment (TRA), which dynamically aligns latent representations by adjusting alignment strength based on the noise schedule, ensuring smooth evolution and improved fidelity, and (2) Onset-Aware Conditioning (OAC), which integrates onset cues that serve as sharp event-driven markers of audio-relevant visual moments to enhance synchronization with dynamic visual events. Extensive experiments on the VGGSound and Landscape datasets demonstrate that TARO outperforms prior methods, achieving relatively 53\% lower Frechet Distance (FD), 29% lower Frechet Audio Distance (FAD), and a 97.19% Alignment Accuracy, highlighting its superior audio quality and synchronization precision.

POMATO: Marrying Pointmap Matching with Temporal Motion for Dynamic 3D Reconstruction

Songyan Zhang,Yongtao Ge,Jinyuan Tian,Guangkai Xu,Hao Chen,Chen Lv,Chunhua Shen

Task: 提出了一种名为POMATO的统一框架，用于动态3D重建，通过结合点图匹配与时态运动模块。

Motivation: 解决动态场景中3D重建的几何估计与匹配模块分离问题，特别是动态区域的模糊匹配问题。

Details

Method: 学习RGB像素在动态和静态区域间的显式匹配关系，并引入时态运动模块以确保尺度一致性。 Result: 在视频深度估计、3D点跟踪和姿态估计等多个下游任务中表现出色。 Conclusion: POMATO通过统一框架显著提升了动态3D重建的性能，代码和模型已公开。 Abstract: 3D reconstruction in dynamic scenes primarily relies on the combination of geometry estimation and matching modules where the latter task is pivotal for distinguishing dynamic regions which can help to mitigate the interference introduced by camera and object motion. Furthermore, the matching module explicitly models object motion, enabling the tracking of specific targets and advancing motion understanding in complex scenarios. Recently, the proposed representation of pointmap in DUSt3R suggests a potential solution to unify both geometry estimation and matching in 3D space, but it still struggles with ambiguous matching in dynamic regions, which may hamper further improvement. In this work, we present POMATO, a unified framework for dynamic 3D reconstruction by marrying pointmap matching with temporal motion. Specifically, our method first learns an explicit matching relationship by mapping RGB pixels from both dynamic and static regions across different views to 3D pointmaps within a unified coordinate system. Furthermore, we introduce a temporal motion module for dynamic motions that ensures scale consistency across different frames and enhances performance in tasks requiring both precise geometry and reliable matching, most notably 3D point tracking. We show the effectiveness of the proposed pointmap matching and temporal fusion paradigm by demonstrating the remarkable performance across multiple downstream tasks, including video depth estimation, 3D point tracking, and pose estimation. Code and models are publicly available at https://github.com/wyddmw/POMATO.

Diabetic Retinopathy Detection Based on Convolutional Neural Networks with SMOTE and CLAHE Techniques Applied to Fundus Images

Sidhiq Mardianta,Affandy,Catur Supriyanto,Catur Supriyanto,Adi Wijaya

Task: 评估人工智能（AI）在诊断糖尿病视网膜病变（DR）中的准确性。

Motivation: 糖尿病视网膜病变是糖尿病患者的主要并发症之一，若不及时检测可能导致永久性失明，因此需要提高诊断准确性。

Details

Method: 采用合成少数类过采样技术（SMOTE）算法，结合卷积神经网络（CNN）对公开数据集“APTOS 2019 Blindness Detection”中的眼底图像进行分析。 Result: 二分类（正常与DR）准确率为99.55%，多分类（不同严重程度）准确率为95.26%。 Conclusion: AI在DR诊断中表现出显著潜力，其准确性优于传统人工分析。 Abstract: Diabetic retinopathy (DR) is one of the major complications in diabetic patients' eyes, potentially leading to permanent blindness if not detected timely. This study aims to evaluate the accuracy of artificial intelligence (AI) in diagnosing DR. The method employed is the Synthetic Minority Over-sampling Technique (SMOTE) algorithm, applied to identify DR and its severity stages from fundus images using the public dataset "APTOS 2019 Blindness Detection." Literature was reviewed via ScienceDirect, ResearchGate, Google Scholar, and IEEE Xplore. Classification results using Convolutional Neural Network (CNN) showed the best performance for the binary classes normal (0) and DR (1) with an accuracy of 99.55%, precision of 99.54%, recall of 99.54%, and F1-score of 99.54%. For the multiclass classification No_DR (0), Mild (1), Moderate (2), Severe (3), Proliferate_DR (4), the accuracy was 95.26%, precision 95.26%, recall 95.17%, and F1-score 95.23%. Evaluation using the confusion matrix yielded results of 99.68% for binary classification and 96.65% for multiclass. This study highlights the significant potential in enhancing the accuracy of DR diagnosis compared to traditional human analysis

Micro-splatting: Maximizing Isotropic Constraints for Refined Optimization in 3D Gaussian Splatting

Jee Won Lee,Hansol Lim,Sooyeun Yang,Jongseong Choi

Task: 提出一种名为Micro-splatting的新框架，以解决3D高斯泼溅技术在大规模场景中捕捉细粒度细节的不足。

Motivation: 传统方法因使用较大的协方差参数导致模糊表示，而直接减小协方差又会导致稀疏性问题。

Details

Method: 通过引入协方差正则化项和自适应致密化策略，动态优化高斯泼溅的紧凑性和各向同性。 Result: 定量和定性评估表明，该方法显著提升了3D重建中的细节表现。 Conclusion: Micro-splatting框架在不牺牲渲染效率的前提下，有效增强了细粒度细节的捕捉能力。 Abstract: Recent advancements in 3D Gaussian Splatting have achieved impressive scalability and real-time rendering for large-scale scenes but often fall short in capturing fine-grained details. Conventional approaches that rely on relatively large covariance parameters tend to produce blurred representations, while directly reducing covariance sizes leads to sparsity. In this work, we introduce Micro-splatting (Maximizing Isotropic Constraints for Refined Optimization in 3D Gaussian Splatting), a novel framework designed to overcome these limitations. Our approach leverages a covariance regularization term to penalize excessively large Gaussians to ensure each splat remains compact and isotropic. This work implements an adaptive densification strategy that dynamically refines regions with high image gradients by lowering the splitting threshold, followed by loss function enhancement. This strategy results in a denser and more detailed gaussian means where needed, without sacrificing rendering efficiency. Quantitative evaluations using metrics such as L1, L2, PSNR, SSIM, and LPIPS, alongside qualitative comparisons demonstrate that our method significantly enhances fine-details in 3D reconstructions.

SE4Lip: Speech-Lip Encoder for Talking Head Synthesis to Solve Phoneme-Viseme Alignment Ambiguity

Yihuan Huang,Jiajun Liu,Yanzhen Ren,Wuyang Liu,Juhua Tang

Task: 提出一种名为SE4Lip的语音编码器，直接通过语音编码唇部特征，以解决语音-唇部对齐模糊问题。

Motivation: 现有通用声学特征（如HuBERT和DeepSpeech）在语音驱动的说话头合成任务中存在音素-视素对齐模糊问题。

Details

Method: 设计了基于STFT频谱图和GRU模型的SE4Lip，通过跨模态对齐框架在联合嵌入空间中对齐语音和唇部特征。 Result: SE4Lip在NeRF和3DGS渲染模型中均达到最先进性能，唇部同步准确率分别提升13.7%和14.2%，接近真实视频效果。 Conclusion: SE4Lip有效解决了语音-唇部对齐模糊问题，显著提升了说话头合成的准确性和逼真度。 Abstract: Speech-driven talking head synthesis tasks commonly use general acoustic features (such as HuBERT and DeepSpeech) as guided speech features. However, we discovered that these features suffer from phoneme-viseme alignment ambiguity, which refers to the uncertainty and imprecision in matching phonemes (speech) with visemes (lip). To address this issue, we propose the Speech Encoder for Lip (SE4Lip) to encode lip features from speech directly, aligning speech and lip features in the joint embedding space by a cross-modal alignment framework. The STFT spectrogram with the GRU-based model is designed in SE4Lip to preserve the fine-grained speech features. Experimental results show that SE4Lip achieves state-of-the-art performance in both NeRF and 3DGS rendering models. Its lip sync accuracy improves by 13.7% and 14.2% compared to the best baseline and produces results close to the ground truth videos.

KAN-SAM: Kolmogorov-Arnold Network Guided Segment Anything Model for RGB-T Salient Object Detection

Xingyuan Li,Ruichao Hou,Tongwei Ren,Gangshan Wu

Task: 提出一种基于提示学习的RGB-T显著目标检测方法KAN-SAM，利用视觉基础模型提升性能。

Motivation: 现有RGB-T SOD方法因数据集多样性不足和多模态表示构建效率低而泛化能力有限。

Details

Method: 通过KAN适配器引入热特征作为引导提示，扩展SAM2模型，并采用互斥随机掩码策略。 Result: 在基准测试中表现出优于现有方法的性能。 Conclusion: KAN-SAM通过结合视觉基础模型和高效适配器，显著提升了RGB-T SOD的鲁棒性和泛化能力。 Abstract: Existing RGB-thermal salient object detection (RGB-T SOD) methods aim to identify visually significant objects by leveraging both RGB and thermal modalities to enable robust performance in complex scenarios, but they often suffer from limited generalization due to the constrained diversity of available datasets and the inefficiencies in constructing multi-modal representations. In this paper, we propose a novel prompt learning-based RGB-T SOD method, named KAN-SAM, which reveals the potential of visual foundational models for RGB-T SOD tasks. Specifically, we extend Segment Anything Model 2 (SAM2) for RGB-T SOD by introducing thermal features as guiding prompts through efficient and accurate Kolmogorov-Arnold Network (KAN) adapters, which effectively enhance RGB representations and improve robustness. Furthermore, we introduce a mutually exclusive random masking strategy to reduce reliance on RGB data and improve generalization. Experimental results on benchmarks demonstrate superior performance over the state-of-the-art methods.

UVG-VPC: Voxelized Point Cloud Dataset for Visual Volumetric Video-based Coding

Guillaume Gautier,Alexandre Mercat,Louis Fréneau,Mikko Pitkänen,Jarno Vanne

Task: 介绍并发布一个新的点云压缩数据集UVG-VPC，用于支持MPEG视觉体积视频编码（V3C）技术的开发与评估。

Motivation: 点云压缩在沉浸式视觉媒体处理和流媒体中至关重要，但缺乏适合评估的多样化数据集。

Details

Method: 创建包含12个不同特性的点云测试视频序列的数据集，涵盖运动、RGB纹理、3D几何和表面遮挡等方面，并提供体素化和颜色属性表示。 Result: 发布了UVG-VPC数据集，包含10秒长的序列，每序列250帧，几何精度9-12位，颜色为8位RGB值，并附带法线信息。 Conclusion: UVG-VPC数据集旨在推动V3C技术的发展，并影响该领域的未来方向。 Abstract: Point cloud compression has become a crucial factor in immersive visual media processing and streaming. This paper presents a new open dataset called UVG-VPC for the development, evaluation, and validation of MPEG Visual Volumetric Video-based Coding (V3C) technology. The dataset is distributed under its own non-commercial license. It consists of 12 point cloud test video sequences of diverse characteristics with respect to the motion, RGB texture, 3D geometry, and surface occlusion of the points. Each sequence is 10 seconds long and comprises 250 frames captured at 25 frames per second. The sequences are voxelized with a geometry precision of 9 to 12 bits, and the voxel color attributes are represented as 8-bit RGB values. The dataset also includes associated normals that make it more suitable for evaluating point cloud compression solutions. The main objective of releasing the UVG-VPC dataset is to foster the development of V3C technologies and thereby shape the future in this field.

CKGAN: Training Generative Adversarial Networks Using Characteristic Kernel Integral Probability Metrics

Kuntian Zhang,Simin Yu,Yaoshu Wang,Makoto Onizuka,Chuan Xiao

Task: 提出一种基于特征核积分概率度量框架的生成对抗网络变体CKGAN。

Motivation: 解决GAN中的模式崩溃问题，并自动学习特征核函数以减少人工选择的工作量。

Details

Method: 通过特征核积分概率度量（CKIPM）优化最大均值差异（MMD）的下界，并提出软选择方法自动学习特征核函数。 Result: CKGAN在合成和真实图像基准测试中表现优于其他基于MMD的GAN，自动选择的核函数性能接近手动调优的最佳核函数。 Conclusion: CKGAN有效缓解模式崩溃问题，自动学习核函数的方法具有实用性和推广潜力。 Abstract: In this paper, we propose CKGAN, a novel generative adversarial network (GAN) variant based on an integral probability metrics framework with characteristic kernel (CKIPM). CKIPM, as a distance between two probability distributions, is designed to optimize the lowerbound of the maximum mean discrepancy (MMD) in a reproducing kernel Hilbert space, and thus can be used to train GANs. CKGAN mitigates the notorious problem of mode collapse by mapping the generated images back to random noise. To save the effort of selecting the kernel function manually, we propose a soft selection method to automatically learn a characteristic kernel function. The experimental evaluation conducted on a set of synthetic and real image benchmarks (MNIST, CelebA, etc.) demonstrates that CKGAN generally outperforms other MMD-based GANs. The results also show that at the cost of moderately more training time, the automatically selected kernel function delivers very close performance to the best of manually fine-tuned one on real image benchmarks and is able to improve the performances of other MMD-based GANs.

AI analysis of medical images at scale as a health disparities probe: a feasibility demonstration using chest radiographs

Heather M. Whitney,Hui Li,Karen Drukker,Elbert Huang,Maryellen L. Giger

Task: 利用医学图像中的定量测量作为输入，计算健康差异指数，以评估社会健康决定因素（SDOH）与健康差异的关联。

Motivation: 探索医学图像作为健康差异研究的新数据源，以增强对健康差异的理解。

Details

Method: 开发了一个流程，从胸部X光片中提取定量测量，使用深度学习模型评估肺部疾病的严重程度，并通过无监督聚类将患者分为表型组，计算四种成像衍生的健康差异指数（iHDIs）。 Result: iHDI测量显示了可行的值，表明医学图像可以作为健康差异研究的新探针。 Conclusion: 大规模AI分析医学图像可以作为健康差异研究的新数据源。 Abstract: Health disparities (differences in non-genetic conditions that influence health) can be associated with differences in burden of disease by groups within a population. Social determinants of health (SDOH) are domains such as health care access, dietary access, and economics frequently studied for potential association with health disparities. Evaluating SDOH-related phenotypes using routine medical images as data sources may enhance health disparities research. We developed a pipeline for using quantitative measures automatically extracted from medical images as inputs into health disparities index calculations. Our study focused on the use case of two SDOH demographic correlates (sex and race) and data extracted from chest radiographs of 1,571 unique patients. The likelihood of severe disease within the lung parenchyma from each image type, measured using an established deep learning model, was merged into a single numerical image-based phenotype for each patient. Patients were then separated into phenogroups by unsupervised clustering of the image-based phenotypes. The health rate for each phenogroup was defined as the median image-based phenotype for each SDOH used as inputs to four imaging-derived health disparities indices (iHDIs): one absolute measure (between-group variance) and three relative measures (index of disparity, Theil index, and mean log deviation). The iHDI measures demonstrated feasible values for each SDOH demographic correlate, showing potential for medical images to serve as a novel probe for health disparities. Large-scale AI analysis of medical images can serve as a probe for a novel data source for health disparities research.

OSDM-MReg: Multimodal Image Registration based One Step Diffusion Model

Xiaochen Wei,Weiwei Guo,Wenxian Yu,Feiming Wei,Dongying Li

Task: 提出一种基于图像到图像转换的多模态遥感图像配准框架OSDM-MReg，以解决多模态图像配准中非线性辐射差异大的问题。

Motivation: 当前方法在配准具有大非线性辐射差异的图像对时，难以提取模态不变特征，导致配准效果不佳。

Details

Method: 提出UTGOS-CDDPM模型一步将多模态图像转换到统一域，结合目标图像条件加速生成；引入感知损失监督高频特征；设计MM-Reg网络融合多模态特征。 Result: 实验表明，该方法在多模态配准任务中具有更高的准确性和效率，尤其在SAR-光学图像对上表现突出。 Conclusion: OSDM-MReg框架有效解决了多模态图像配准中的非线性辐射差异问题，显著提升了配准性能。 Abstract: Multimodal remote sensing image registration aligns images from different sensors for data fusion and analysis. However, current methods often fail to extract modality-invariant features when aligning image pairs with large nonlinear radiometric differences. To address this issues, we propose OSDM-MReg, a novel multimodal image registration framework based image-to-image translation to eliminate the gap of multimodal images. Firstly, we propose a novel one-step unaligned target-guided conditional denoising diffusion probabilistic models(UTGOS-CDDPM)to translate multimodal images into a unified domain. In the inference stage, traditional conditional DDPM generate translated source image by a large number of iterations, which severely slows down the image registration task. To address this issues, we use the unaligned traget image as a condition to promote the generation of low-frequency features of the translated source image. Furthermore, during the training stage, we add the inverse process of directly predicting the translated image to ensure that the translated source image can be generated in one step during the testing stage. Additionally, to supervised the detail features of translated source image, we propose a new perceptual loss that focuses on the high-frequency feature differences between the translated and ground-truth images. Finally, a multimodal multiscale image registration network (MM-Reg) fuse the multimodal feature of the unimodal images and multimodal images by proposed multimodal feature fusion strategy. Experiments demonstrate superior accuracy and efficiency across various multimodal registration tasks, particularly for SAR-optical image pairs.

MAPLE: Encoding Dexterous Robotic Manipulation Priors Learned From Egocentric Videos

Alexey Gavryushin,Xi Wang,Robert J. S. Malate,Chenyu Yang,Xiangyi Jia,Shubh Goel,Davide Liconti,René Zurbrügg,Robert K. Katzschmann,Marc Pollefeys

Task: 利用大规模第一人称视频数据集中的操作先验知识，提升灵巧机器人操作任务的政策学习效果。

Motivation: 传统数据驱动的机器人操作方法在处理复杂、灵巧的操作任务时表现不足，而大规模第一人称视频数据集提供了丰富的操作细节，可以弥补这一不足。

Details

Method: 提出MAPLE方法，通过预测手-物体接触点和详细的手部姿态，利用学习到的特征训练下游操作任务的政策。 Result: 实验结果表明，MAPLE在仿真基准和新设计的复杂任务中表现优异，同时在真实世界的灵巧机器人手实验中验证了其有效性。 Conclusion: MAPLE通过利用操作先验知识，显著提升了灵巧机器人操作的性能，并在仿真和真实实验中均表现出色。 Abstract: Large-scale egocentric video datasets capture diverse human activities across a wide range of scenarios, offering rich and detailed insights into how humans interact with objects, especially those that require fine-grained dexterous control. Such complex, dexterous skills with precise controls are crucial for many robotic manipulation tasks, yet are often insufficiently addressed by traditional data-driven approaches to robotic manipulation. To address this gap, we leverage manipulation priors learned from large-scale egocentric video datasets to improve policy learning for dexterous robotic manipulation tasks. We present MAPLE, a novel method for dexterous robotic manipulation that exploits rich manipulation priors to enable efficient policy learning and better performance on diverse, complex manipulation tasks. Specifically, we predict hand-object contact points and detailed hand poses at the moment of hand-object contact and use the learned features to train policies for downstream manipulation tasks. Experimental results demonstrate the effectiveness of MAPLE across existing simulation benchmarks, as well as a newly designed set of challenging simulation tasks, which require fine-grained object control and complex dexterous skills. The benefits of MAPLE are further highlighted in real-world experiments using a dexterous robotic hand, whereas simultaneous evaluation across both simulation and real-world experiments has remained underexplored in prior work.