2025 04 14

ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use

Kaixin Li,Ziyang Meng,Hongzhan Lin,Ziyang Luo,Yuchen Tian,Jing Ma,Zhiyong Huang,Tat-Seng Chua

Task: 评估多模态大语言模型（MLLMs）在高分辨率专业场景中的定位能力。

Motivation: 专业领域的GUI代理应用尚未充分探索，且面临高分辨率显示、小目标尺寸和复杂环境等独特挑战。

Details

Method: 提出ScreenSpot-Pro基准和ScreenSeekeR视觉搜索方法，利用GUI知识指导级联搜索。 Result: 现有模型在基准上表现不佳（最佳模型仅18.9%），ScreenSeekeR无需额外训练即达到48.1%的先进性能。 Conclusion: ScreenSpot-Pro基准和ScreenSeekeR方法有望推动专业领域GUI代理的发展。 Abstract: Recent advancements in Multi-modal Large Language Models (MLLMs) have led to significant progress in developing GUI agents for general tasks such as web browsing and mobile phone use. However, their application in professional domains remains under-explored. These specialized workflows introduce unique challenges for GUI perception models, including high-resolution displays, smaller target sizes, and complex environments. In this paper, we introduce ScreenSpot-Pro, a new benchmark designed to rigorously evaluate the grounding capabilities of MLLMs in high-resolution professional settings. The benchmark comprises authentic high-resolution images from a variety of professional domains with expert annotations. It spans 23 applications across five industries and three operating systems. Existing GUI grounding models perform poorly on this dataset, with the best model achieving only 18.9%. Our experiments reveal that strategically reducing the search area enhances accuracy. Based on this insight, we propose ScreenSeekeR, a visual search method that utilizes the GUI knowledge of a strong planner to guide a cascaded search, achieving state-of-the-art performance with 48.1% without any additional training. We hope that our benchmark and findings will advance the development of GUI agents for professional applications. Code, data and leaderboard can be found at https://gui-agent.github.io/grounding-leaderboard.

Have we unified image generation and understanding yet? An empirical study of GPT-4o's image generation ability

Ning Li,Jingran Zhang,Justin Cui

Task: 系统评估GPT-4o在多模态任务中的语义合成能力，包括指令遵循、编辑精度和后生成推理。

Motivation: 验证GPT-4o是否能够无缝整合领域知识、上下文推理和指令遵循，填补现有研究的空白。

Details

Method: 通过三个关键维度（全局指令遵循、细粒度编辑精度和后生成推理）进行系统性评估。 Result: GPT-4o在图像生成和编辑方面表现优异，但在语义合成中存在局限性，如指令的直译、知识约束应用不一致和条件推理困难。 Conclusion: 研究揭示了GPT-4o在动态知识整合方面的不足，呼吁开发更强大的基准和训练策略，以支持上下文感知和基于推理的多模态生成。 Abstract: OpenAI's multimodal GPT-4o has demonstrated remarkable capabilities in image generation and editing, yet its ability to achieve world knowledge-informed semantic synthesis--seamlessly integrating domain knowledge, contextual reasoning, and instruction adherence--remains unproven. In this study, we systematically evaluate these capabilities across three critical dimensions: (1) Global Instruction Adherence, (2) Fine-Grained Editing Precision, and (3) Post-Generation Reasoning. While existing benchmarks highlight GPT-4o's strong capabilities in image generation and editing, our evaluation reveals GPT-4o's persistent limitations: the model frequently defaults to literal interpretations of instructions, inconsistently applies knowledge constraints, and struggles with conditional reasoning tasks. These findings challenge prevailing assumptions about GPT-4o's unified understanding and generation capabilities, exposing significant gaps in its dynamic knowledge integration. Our study calls for the development of more robust benchmarks and training strategies that go beyond surface-level alignment, emphasizing context-aware and reasoning-grounded multimodal generation.

Self-Bootstrapping for Versatile Test-Time Adaptation

Shuaicheng Niu,Guohao Chen,Peilin Zhao,Tianyi Wang,Pengcheng Wu,Zhiqi Shen

Task: 开发一种适用于分类和回归任务的通用测试时适应（TTA）目标。

Motivation: 解决在图像、对象和像素级别预测中，如何设计有效的增强/退化方法以保持几何信息并提供足够学习信号的问题。

Details

Method: 通过自举方案优化测试图像与其退化视图之间的一致性，利用傅里叶域分析设计低频掩码和高频噪声注入的增强策略。 Result: 在分类、分割和3D单目检测任务中，该方法独立或作为即插即用模块均表现出优越性能。 Conclusion: 通过傅里叶域分析和噪声注入，成功设计了一种高效且通用的TTA方法。 Abstract: In this paper, we seek to develop a versatile test-time adaptation (TTA) objective for a variety of tasks - classification and regression across image-, object-, and pixel-level predictions. We achieve this through a self-bootstrapping scheme that optimizes prediction consistency between the test image (as target) and its deteriorated view. The key challenge lies in devising effective augmentations/deteriorations that: i) preserve the image's geometric information, e.g., object sizes and locations, which is crucial for TTA on object/pixel-level tasks, and ii) provide sufficient learning signals for TTA. To this end, we analyze how common distribution shifts affect the image's information power across spatial frequencies in the Fourier domain, and reveal that low-frequency components carry high power and masking these components supplies more learning signals, while masking high-frequency components can not. In light of this, we randomly mask the low-frequency amplitude of an image in its Fourier domain for augmentation. Meanwhile, we also augment the image with noise injection to compensate for missing learning signals at high frequencies, by enhancing the information power there. Experiments show that, either independently or as a plug-and-play module, our method achieves superior results across classification, segmentation, and 3D monocular detection tasks with both transformer and CNN models.

SRVP: Strong Recollection Video Prediction Model Using Attention-Based Spatiotemporal Correlation Fusion

Yuseon Kim,Kyongseok Park

Task: 提出一种名为SRVP的视频预测模型，通过结合标准注意力和强化特征注意力模块来改善传统RNN模型在长期预测中丢失物体外观细节的问题。

Motivation: 传统RNN模型在视频预测中虽然能捕捉时空状态，但会逐渐丢失物体外观细节，因此需要一种新方法来缓解这一问题。

Details

Method: SRVP模型结合标准注意力（SA）和强化特征注意力（RFA）模块，利用点积注意力提取时空上下文和空间相关性，并融合以增强时空表示。 Result: 在三个基准数据集上的实验表明，SRVP能减轻RNN模型的图像质量退化，同时预测性能与无RNN架构相当。 Conclusion: SRVP模型通过注意力机制有效改善了传统RNN模型的缺陷，同时保持了高预测性能。 Abstract: Video prediction (VP) generates future frames by leveraging spatial representations and temporal context from past frames. Traditional recurrent neural network (RNN)-based models enhance memory cell structures to capture spatiotemporal states over extended durations but suffer from gradual loss of object appearance details. To address this issue, we propose the strong recollection VP (SRVP) model, which integrates standard attention (SA) and reinforced feature attention (RFA) modules. Both modules employ scaled dot-product attention to extract temporal context and spatial correlations, which are then fused to enhance spatiotemporal representations. Experiments on three benchmark datasets demonstrate that SRVP mitigates image quality degradation in RNN-based models while achieving predictive performance comparable to RNN-free architectures.

Metamorphic Testing for Fairness Evaluation in Large Language Models: Identifying Intersectional Bias in LLaMA and GPT

Harishwar Reddy,Madhusudan Srinivasan,Upulee Kanewala

Task: 提出一种变形测试方法，系统性地识别大型语言模型（LLMs）中的公平性缺陷。

Motivation: LLMs在自然语言处理中取得显著进展，但其训练数据中的偏见可能导致公平性问题，尤其在敏感领域（如医疗、金融和法律）中带来风险。

Details

Method: 定义并应用一组面向公平性的变形关系（MRs），通过生成源测试用例和后续测试用例，分析LLaMA和GPT模型在不同人口统计输入下的响应，检测公平性违规。 Result: 结果表明变形测试能有效暴露偏见模式，尤其是在语气和情感方面，并揭示了敏感属性交叉点中常见的公平性缺陷。 Conclusion: 该研究改进了LLMs的公平性测试，提供了一种结构化方法来检测和缓解偏见，提升了模型在公平敏感应用中的鲁棒性。 Abstract: Large Language Models (LLMs) have made significant strides in Natural Language Processing but remain vulnerable to fairness-related issues, often reflecting biases inherent in their training data. These biases pose risks, particularly when LLMs are deployed in sensitive areas such as healthcare, finance, and law. This paper introduces a metamorphic testing approach to systematically identify fairness bugs in LLMs. We define and apply a set of fairness-oriented metamorphic relations (MRs) to assess the LLaMA and GPT model, a state-of-the-art LLM, across diverse demographic inputs. Our methodology includes generating source and follow-up test cases for each MR and analyzing model responses for fairness violations. The results demonstrate the effectiveness of MT in exposing bias patterns, especially in relation to tone and sentiment, and highlight specific intersections of sensitive attributes that frequently reveal fairness faults. This research improves fairness testing in LLMs, providing a structured approach to detect and mitigate biases and improve model robustness in fairness-sensitive applications.

DGFamba: Learning Flow Factorized State Space for Visual Domain Generalization

Qi Bi,Jingjun Yi,Hao Zheng,Haolan Zhan,Wei Ji,Yawen Huang,Yuexiang Li

Task: 提出一种名为DG-Famba的Flow Factorized State Space模型，用于视觉领域泛化。

Motivation: 解决视觉领域泛化中因风格变化导致的领域差距问题，同时探索选择性状态空间的领域不变性。

Details

Method: 通过流分解将风格增强和原始状态嵌入映射到潜在流空间，并在该空间中对齐概率路径。 Result: 在多种视觉领域泛化设置中表现出最先进的性能。 Conclusion: DG-Famba模型通过流分解和对齐潜在概率路径，有效实现了领域一致性，提升了泛化能力。 Abstract: Domain generalization aims to learn a representation from the source domain, which can be generalized to arbitrary unseen target domains. A fundamental challenge for visual domain generalization is the domain gap caused by the dramatic style variation whereas the image content is stable. The realm of selective state space, exemplified by VMamba, demonstrates its global receptive field in representing the content. However, the way exploiting the domain-invariant property for selective state space is rarely explored. In this paper, we propose a novel Flow Factorized State Space model, dubbed as DG-Famba, for visual domain generalization. To maintain domain consistency, we innovatively map the style-augmented and the original state embeddings by flow factorization. In this latent flow space, each state embedding from a certain style is specified by a latent probability path. By aligning these probability paths in the latent space, the state embeddings are able to represent the same content distribution regardless of the style differences. Extensive experiments conducted on various visual domain generalization settings show its state-of-the-art performance.

Shurui Wu,Xinyi Huang,Dingxin Lu

Task: 提出一种基于大语言模型（LLM）的文本转移识别方法，用于社交网络危机干预。

Motivation: 随着社交媒体上心理健康危机的增加，识别和预防潜在危害成为紧迫挑战。

Details

Method: 提出一个多级框架，结合BERT的迁移学习，整合心理健康知识、情感分析和行为预测技术。 Result: 实验结果表明，该方法在危机检测准确性上优于传统模型，并对细微情感和上下文变化更敏感。 Conclusion: 该方法为社交媒体心理健康危机干预提供了有效工具。 Abstract: As the prevalence of mental health crises increases on social media platforms, identifying and preventing potential harm has become an urgent challenge. This study introduces a large language model (LLM)-based text transfer recognition method for social network crisis intervention, enhanced with domain-specific mental health knowledge. We propose a multi-level framework that incorporates transfer learning using BERT, and integrates mental health knowledge, sentiment analysis, and behavior prediction techniques. The framework includes a crisis annotation tool trained on social media datasets from real-world events, enabling the model to detect nuanced emotional cues and identify psychological crises. Experimental results show that the proposed method outperforms traditional models in crisis detection accuracy and exhibits greater sensitivity to subtle emotional and contextual variations.

Learning Fine-grained Domain Generalization via Hyperbolic State Space Hallucination

Qi Bi,Jingjun Yi,Haolan Zhan,Wei Ji,Gui-Song Xia

Task: 学习一种细粒度表示，能够在仅使用源域数据训练的情况下泛化到未见过的目标域。

Motivation: 细粒度领域泛化（FGDG）因细粒度类别仅能通过微小模式区分，而这些模式在跨域风格变化（如光照、颜色等）下特别脆弱，因此更具挑战性。

Details

Method: 提出了一种名为Hyperbolic State Space Hallucination (HSSH)的方法，包括状态空间幻觉（SSH）和双曲流形一致性（HMC）两个关键组件。SSH通过外推和幻觉源图像来丰富状态嵌入的风格多样性，然后将风格变化前后的状态嵌入投影到双曲流形中。 Result: 在三个FGDG基准测试中展示了最先进的性能。 Conclusion: HSSH方法通过双曲状态空间建模高阶统计量，能够更好地识别细粒度模式，并通过最小化双曲距离消除风格变化对细粒度模式的影响。 Abstract: Fine-grained domain generalization (FGDG) aims to learn a fine-grained representation that can be well generalized to unseen target domains when only trained on the source domain data. Compared with generic domain generalization, FGDG is particularly challenging in that the fine-grained category can be only discerned by some subtle and tiny patterns. Such patterns are particularly fragile under the cross-domain style shifts caused by illumination, color and etc. To push this frontier, this paper presents a novel Hyperbolic State Space Hallucination (HSSH) method. It consists of two key components, namely, state space hallucination (SSH) and hyperbolic manifold consistency (HMC). SSH enriches the style diversity for the state embeddings by firstly extrapolating and then hallucinating the source images. Then, the pre- and post- style hallucinate state embeddings are projected into the hyperbolic manifold. The hyperbolic state space models the high-order statistics, and allows a better discernment of the fine-grained patterns. Finally, the hyperbolic distance is minimized, so that the impact of style variation on fine-grained patterns can be eliminated. Experiments on three FGDG benchmarks demonstrate its state-of-the-art performance.

Topic mining based on fine-tuning Sentence-BERT and LDA

Jianheng Li,Lirong Chen

Task: 通过微调Sentence-BERT和LDA模型，挖掘商品在线评论的主题特征，展示商品多方面的细节。

Motivation: 随着社会发展，消费者在购物时更关注产品细粒度属性的关键信息。

Details

Method: 微调Sentence-BERT模型生成语义丰富的词向量，输入LDA模型进行主题特征提取，并通过关键词分析聚焦产品关键功能。 Result: 该模型的主题一致性比其他模型高0.5，提高了主题提取的准确性。 Conclusion: 该方法有效提升了商品在线评论主题提取的精度，为消费者提供了更详细的产品信息。 Abstract: Research background: With the continuous development of society, consumers pay more attention to the key information of product fine-grained attributes when shopping. Research purposes: This study will fine tune the Sentence-BERT word embedding model and LDA model, mine the subject characteristics in online reviews of goods, and show consumers the details of various aspects of goods. Research methods: First, the Sentence-BERT model was fine tuned in the field of e-commerce online reviews, and the online review text was converted into a word vector set with richer semantic information; Secondly, the vectorized word set is input into the LDA model for topic feature extraction; Finally, focus on the key functions of the product through keyword analysis under the theme. Results: This study compared this model with other word embedding models and LDA models, and compared it with common topic extraction methods. The theme consistency of this model is 0.5 higher than that of other models, which improves the accuracy of theme extraction

Teaching Humans Subtle Differences with DIFFusion

Mia Chiquier,Orr Avrech,Yossi Gandelsman,Berthy Feng,Katherine Bouman,Carl Vondrick

Task: 提出一种利用生成模型（DIFFusion）教授新手区分专业领域中细微视觉差异的方法。

Motivation: 人类专家能够识别细微的视觉差异（如疾病、物种或天文现象），但新手缺乏这种能力，需要有效的教学方法。

Details

Method: 通过操纵扩散模型的条件空间，生成类别间的反事实（counterfactuals）可视化，以展示特征的最小变化。 Result: 在六个领域的实验中，即使数据稀疏且未配对，该方法仍能准确生成过渡示例；用户研究表明其教学效果优于未配对示例。 Conclusion: 生成模型在专业视觉学习领域具有潜力，能够有效帮助新手掌握细微的视觉区分能力。 Abstract: Human expertise depends on the ability to recognize subtle visual differences, such as distinguishing diseases, species, or celestial phenomena. We propose a new method to teach novices how to differentiate between nuanced categories in specialized domains. Our method uses generative models to visualize the minimal change in features to transition between classes, i.e., counterfactuals, and performs well even in domains where data is sparse, examples are unpaired, and category boundaries are not easily explained by text. By manipulating the conditioning space of diffusion models, our proposed method DIFFusion disentangles category structure from instance identity, enabling high-fidelity synthesis even in challenging domains. Experiments across six domains show accurate transitions even with limited and unpaired examples across categories. User studies confirm that our generated counterfactuals outperform unpaired examples in teaching perceptual expertise, showing the potential of generative models for specialized visual learning.

SEAL: Steerable Reasoning Calibration of Large Language Models for Free

Runjin Chen,Zhenyu Zhang,Junyuan Hong,Souvik Kundu,Zhangyang Wang

Task: 研究大型语言模型（LLMs）中链式思维（CoT）推理过程中的冗余问题，并提出一种无需训练的校准方法SEAL以提高效率和准确性。

Motivation: 现有研究表明CoT推理中存在大量冗余，这不仅增加了推理延迟，还因注意力分散到不必要的推理路径而影响模型性能。

Details

Method: 通过分析LLMs的内部推理结构，将其分为执行、反思和过渡三种思维类型，并提出SEAL方法，通过离线提取潜在空间中的推理导向向量，并在推理过程中进行表示干预。 Result: SEAL在多个模型和基准测试中验证了其有效性，准确率提升高达11%，同时减少推理令牌11.8%至50.4%。 Conclusion: SEAL是一种高效且无需训练的校准方法，显著提升了LLMs的推理效率和准确性。 Abstract: Large Language Models (LLMs), such as OpenAI's o1-series have demonstrated compelling capabilities for complex reasoning tasks via the extended chain-of-thought (CoT) reasoning mechanism. However, recent studies reveal substantial redundancy in the CoT reasoning traces, which not only increases inference latency but also negatively impacts model performance by diverting attention to unnecessary reasoning paths. To address this issue, we investigate the internal reasoning structures of LLMs and categorize them into three primary thought types: execution, reflection, and transition thoughts. Moreover, our analysis reveals that excessive reflection and transition thoughts are strongly correlated with failure cases and these thought categories exhibit clear separation in the latent space. Based on these, we introduce SEAL (Steerable reasoning calibration), a training-free approach that seamlessly calibrates the CoT process, improving accuracy while demonstrating significant efficiency gains. SEAL consists of an offline stage for extracting the reasoning steering vector in the latent space, followed by an on-the-fly calibration of the reasoning trace through representation intervention using the steering vector. Notably, the steering vector exhibits strong transferability across various tasks. Extensive experiments across multiple models (DeepSeek-R1-Distill and QwQ-32B-Preview) and benchmarks (Math500, GSM8K, LiveCodeBench) validate the effectiveness of SEAL, up to a 11% improvement in accuracy while reducing reasoning tokens by 11.8% to 50.4%. Our code is publicly available at https://github.com/VITA-Group/SEAL.

Patch distribution modeling framework adaptive cosine estimator (PaDiM-ACE) for anomaly detection and localization in synthetic aperture radar imagery

Angelina Ibarra,Joshua Peeples

Task: 提出一种基于自适应余弦估计器（ACE）的新方法，用于合成孔径雷达（SAR）图像中的异常检测与定位。

Motivation: 现有方法PaDiM使用无界的马氏距离，而ACE引入有界的余弦相似度度量，旨在提升异常检测与定位的性能。

Details

Method: 在PaDiM框架基础上引入ACE检测统计量，利用余弦相似度替代马氏距离。 Result: 在多个SAR数据集上评估，性能指标包括图像和像素级别的AUROC，表现优于现有方法。 Conclusion: 提出的ACE方法在SAR图像异常检测与定位中表现更优，代码已开源。 Abstract: This work presents a new approach to anomaly detection and localization in synthetic aperture radar imagery (SAR), expanding upon the existing patch distribution modeling framework (PaDiM). We introduce the adaptive cosine estimator (ACE) detection statistic. PaDiM uses the Mahalanobis distance at inference, an unbounded metric. ACE instead uses the cosine similarity metric, providing bounded anomaly detection scores. The proposed method is evaluated across multiple SAR datasets, with performance metrics including the area under the receiver operating curve (AUROC) at the image and pixel level, aiming for increased performance in anomaly detection and localization of SAR imagery. The code is publicly available: https://github.com/Advanced-Vision-and-Learning-Lab/PaDiM-LACE.

Regional Tiny Stories: Using Small Models to Compare Language Learning and Tokenizer Performance

Nirvan Patil,Malhar Abhay Inamdar,Agnivo Gosai,Guruprasad Pathak,Anish Joshi,Aryan Sagavekar,Anish Joshirao,Raj Dandekar,Rajat Dandekar,Sreedath Panat

Task: 扩展TinyStories框架，将英语数据集翻译为印度语言并生成合成数据，评估小型语言模型（SLMs）在区域语言处理中的表现。

Motivation: 研究SLMs在印度语言（如印地语、马拉地语和孟加拉语）中的高效处理能力，以补充大型语言模型（LLMs）的不足。

Details

Method: 翻译原始数据集并生成合成数据，通过信息论和形态学分析评估SLMs的性能。 Result: SLMs在区域语言处理中表现高效，语言特定分词器优于通用分词器，合成数据集优于翻译内容。 Conclusion: 研究为SLMs在低资源语言中的实际应用提供了理论基础，并揭示了语言发展的神经机制。 Abstract: Small Language Models (SLMs) offer efficient alternatives to LLMs for specific domains. The 2023 TinyStories study developed an English dataset that allows SLMs with 1 to 10 million parameters to produce coherent outputs. Our research expands this framework by translating the original dataset into Indian languages and creating synthetic data using LLMs. We focus on Hindi, Marathi, and Bengali, evaluating SLMs for regional language processing and understanding linguistic complexity. We show that SLMs efficiently process regional languages with significantly fewer parameters than LLMs, providing a complementary framework for ``inference based evaluation" of tokenization strategies and linguistic complexity. Our analysis shows that language-specific tokenizers outperform general-purpose ones for Indian languages. Empirical validations, supported by information-theoretic and morphological analyses, provides fundamental understanding behind the better performance of Hindi models over Marathi and Bengali. Additionally, we show that synthetic datasets outperform translated content for training SLMs. Correlation analyses reveal cross-linguistic patterns and language-specific relationships between creativity, grammatical precision, and narrative completeness. These findings advance both the practical application of SLMs to underserved languages and our theoretical understanding of neural language development.

Multi-Task Learning with Multi-Annotation Triplet Loss for Improved Object Detection

Meilun Zhou,Aditya Dutt,Alina Zare

Task: 扩展三元组损失（MATL）以在多任务学习中利用额外的标注信息。

Motivation: 传统三元组损失仅依赖类别标签，未充分利用多任务场景中的多种标注信息。

Details

Method: 提出Multi-Annotation Triplet Loss（MATL）框架，结合类别标签和边界框信息等额外标注。 Result: 在航空野生动物图像数据集上，MATL在分类和定位任务上均优于传统三元组损失。 Conclusion: 利用所有可用标注信息可显著提升多任务学习框架中三元组损失的性能。 Abstract: Triplet loss traditionally relies only on class labels and does not use all available information in multi-task scenarios where multiple types of annotations are available. This paper introduces a Multi-Annotation Triplet Loss (MATL) framework that extends triplet loss by incorporating additional annotations, such as bounding box information, alongside class labels in the loss formulation. By using these complementary annotations, MATL improves multi-task learning for tasks requiring both classification and localization. Experiments on an aerial wildlife imagery dataset demonstrate that MATL outperforms conventional triplet loss in both classification and localization. These findings highlight the benefit of using all available annotations for triplet loss in multi-task learning frameworks.

'Neural howlround' in large language models: a self-reinforcing bias phenomenon, and a dynamic attenuation solution

Seth Drake

Task: 探讨大型语言模型（LLM）驱动的AI系统中出现的‘神经回响’现象及其纠正机制。

Motivation: 研究AI系统中自我强化的认知循环（‘神经回响’）及其对模型推理的影响，以提升AI的鲁棒性。

Details

Method: 提出一种基于衰减的纠正机制，动态引入平衡调整以恢复适应性推理。 Result: 该机制能够纠正‘神经回响’现象，即使在‘锁定’的AI系统中也有效。 Conclusion: 该纠正策略可应用于实际决策任务，提高AI系统的鲁棒性。 Abstract: Large language model (LLM)-driven AI systems may exhibit an inference failure mode we term `neural howlround,' a self-reinforcing cognitive loop where certain highly weighted inputs become dominant, leading to entrenched response patterns resistant to correction. This paper explores the mechanisms underlying this phenomenon, which is distinct from model collapse and biased salience weighting. We propose an attenuation-based correction mechanism that dynamically introduces counterbalancing adjustments and can restore adaptive reasoning, even in `locked-in' AI systems. Additionally, we discuss some other related effects arising from improperly managed reinforcement. Finally, we outline potential applications of this mitigation strategy for improving AI robustness in real-world decision-making tasks.

STEI-PCN: an efficient pure convolutional network for traffic prediction via spatial-temporal encoding and inferring

Kai Hu,Zhidan Zhao,Zhifeng Hao

Task: 提出一种高效的纯卷积网络（STEI-PCN）用于交通预测，通过时空编码和推断来解决现有模型在时空相关性建模中的不足。

Motivation: 现有模型在提取时空相关性时往往独立或同步处理，忽略了复杂的时空相关性，且在精度和计算效率上存在挑战。

Details

Method: 设计了基于绝对和相对时空坐标的动态邻接矩阵推断模块，结合图卷积网络和门控机制捕获局部同步时空相关性，并使用时间扩张因果卷积网络捕获长程时间相关性。 Result: 在多个数据集上的实验表明，STEI-PCN在计算效率和预测性能上具有竞争力，部分指标优于或接近现有最优模型。 Conclusion: STEI-PCN通过综合建模时空相关性，实现了高效的交通预测，为相关领域提供了新的解决方案。 Abstract: Traffic data exhibits complex temporal, spatial, and spatial-temporal correlations. Most of models use either independent modules to separately extract temporal and spatial correlations or joint modules to synchronously extract them, without considering the spatial-temporal correlations. Moreover, models that consider joint spatial-temporal correlations (temporal, spatial, and spatial-temporal correlations) often encounter significant challenges in accuracy and computational efficiency which prevent such models from demonstrating the expected advantages of a joint spatial-temporal correlations architecture. To address these issues, this paper proposes an efficient pure convolutional network for traffic prediction via spatial-temporal encoding and inferring (STEI-PCN). The model introduces and designs a dynamic adjacency matrix inferring module based on absolute spatial and temporal coordinates, as well as relative spatial and temporal distance encoding, using a graph convolutional network combined with gating mechanism to capture local synchronous joint spatial-temporal correlations. Additionally, three layers of temporal dilated causal convolutional network are used to capture long-range temporal correlations. Finally, through multi-view collaborative prediction module, the model integrates the gated-activated original, local synchronous joint spatial-temporal, and long-range temporal features to achieve comprehensive prediction. This study conducts extensive experiments on flow datasets (PeMS03/04/07/08) and speed dataset (PeMS-Bay), covering multiple prediction horizons. The results show that STEI-PCN demonstrates competitive computational efficiency in both training and inference speeds, and achieves superior or slightly inferior to state-of-the-art (SOTA) models on most evaluation metrics.

Evaluating the Fitness of Ontologies for the Task of Question Generation

Samah Alkhuzaey,Floriana Grasso,Terry R. Payne,Valentina Tamma

Task: 提出一组评估本体在教学中问题生成任务适用性的要求和任务特定指标。

Motivation: 缺乏对影响问题生成过程的本体特性或特征的全面研究。

Details

Method: 使用ROMEO方法学，通过专家评估方法评估不同本体在自动问题生成任务中的表现。 Result: 本体特性显著影响问题生成效果，不同本体表现各异。 Conclusion: 评估本体质量对自动问题生成任务的重要性。 Abstract: Ontology-based question generation is an important application of semantic-aware systems that enables the creation of large question banks for diverse learning environments. The effectiveness of these systems, both in terms of the calibre and cognitive difficulty of the resulting questions, depends heavily on the quality and modelling approach of the underlying ontologies, making it crucial to assess their fitness for this task. To date, there has been no comprehensive investigation into the specific ontology aspects or characteristics that affect the question generation process. Therefore, this paper proposes a set of requirements and task-specific metrics for evaluating the fitness of ontologies for question generation tasks in pedagogical settings. Using the ROMEO methodology, a structured framework for deriving task-specific metrics, an expert-based approach is employed to assess the performance of various ontologies in Automatic Question Generation (AQG) tasks, which is then evaluated over a set of ontologies. Our results demonstrate that ontology characteristics significantly impact the effectiveness of question generation, with different ontologies exhibiting varying performance levels. This highlights the importance of assessing ontology quality with respect to AQG tasks.

X-DECODE: EXtreme Deblurring with Curriculum Optimization and Domain Equalization

Sushant Gautam,Jingdao Chen

Task: 提出一种基于课程学习的训练策略，用于提升深度学习模型在极端图像去模糊任务中的鲁棒性。

Motivation: 恢复严重模糊图像在自动驾驶、医学成像和摄影等领域具有重要意义，但现有方法通常仅针对低到中等模糊程度训练，难以应对极端模糊情况。

Details

Method: 采用课程学习策略，逐步增加训练图像的模糊程度，并结合感知损失和铰链损失以增强细节恢复和训练稳定性。 Result: 在Extreme-GoPro和Extreme-KITTI数据集上，分别以14%和18%的SSIM提升超越次优方法。 Conclusion: 线性课程学习策略在极端模糊图像恢复中表现最佳，且训练模糊比例和损失函数设计对性能至关重要。 Abstract: Restoring severely blurred images remains a significant challenge in computer vision, impacting applications in autonomous driving, medical imaging, and photography. This paper introduces a novel training strategy based on curriculum learning to improve the robustness of deep learning models for extreme image deblurring. Unlike conventional approaches that train on only low to moderate blur levels, our method progressively increases the difficulty by introducing images with higher blur severity over time, allowing the model to adapt incrementally. Additionally, we integrate perceptual and hinge loss during training to enhance fine detail restoration and improve training stability. We experimented with various curriculum learning strategies and explored the impact of the train-test domain gap on the deblurring performance. Experimental results on the Extreme-GoPro dataset showed that our method outperforms the next best method by 14% in SSIM, whereas experiments on the Extreme-KITTI dataset showed that our method outperforms the next best by 18% in SSIM. Ablation studies showed that a linear curriculum progression outperforms step-wise, sigmoid, and exponential progressions, while hyperparameter settings such as the training blur percentage and loss function formulation all play important roles in addressing extreme blur artifacts. Datasets and code are available at https://github.com/RAPTOR-MSSTATE/XDECODE

SafeChat: A Framework for Building Trustworthy Collaborative Assistants and a Case Study of its Usefulness

Biplav Srivastava,Kausik Lakkaraju,Nitin Gupta,Vansh Nagpal,Bharath C. Muppasani,Sara E. Jones

Task: 提出SafeChat架构，用于构建安全可信的聊天机器人。

Motivation: 解决现有LLM聊天机器人在可靠性、可解释性、安全性等方面的不足，特别是在信任敏感领域的应用问题。

Details

Method: 设计SafeChat架构，包含安全性（如可追溯来源和避免有害回答）、可用性（如自动摘要和信任评估）和快速开发（如CSV驱动工作流）等关键特性。 Result: 实现了SafeChat框架，并通过案例研究展示了其在选举信息传播中的应用。 Conclusion: SafeChat是一个通用架构，能够构建安全可信的聊天机器人，已在多个领域得到验证。 Abstract: Collaborative assistants, or chatbots, are data-driven decision support systems that enable natural interaction for task completion. While they can meet critical needs in modern society, concerns about their reliability and trustworthiness persist. In particular, Large Language Model (LLM)-based chatbots like ChatGPT, Gemini, and DeepSeek are becoming more accessible. However, such chatbots have limitations, including their inability to explain response generation, the risk of generating problematic content, the lack of standardized testing for reliability, and the need for deep AI expertise and extended development times. These issues make chatbots unsuitable for trust-sensitive applications like elections or healthcare. To address these concerns, we introduce SafeChat, a general architecture for building safe and trustworthy chatbots, with a focus on information retrieval use cases. Key features of SafeChat include: (a) safety, with a domain-agnostic design where responses are grounded and traceable to approved sources (provenance), and 'do-not-respond' strategies to prevent harmful answers; (b) usability, with automatic extractive summarization of long responses, traceable to their sources, and automated trust assessments to communicate expected chatbot behavior, such as sentiment; and (c) fast, scalable development, including a CSV-driven workflow, automated testing, and integration with various devices. We implemented SafeChat in an executable framework using the open-source chatbot platform Rasa. A case study demonstrates its application in building ElectionBot-SC, a chatbot designed to safely disseminate official election information. SafeChat is being used in many domains, validating its potential, and is available at: https://github.com/ai4society/trustworthy-chatbot.

ContrastiveGaussian: High-Fidelity 3D Generation with Contrastive Learning and Gaussian Splatting

Junbang Liu,Enpei Huang,Dongxing Mao,Hui Zhang,Xinyuan Song,Yongxin Ni

Task: 从单视角图像生成3D内容。

Motivation: 现有方法利用预训练的2D扩散模型进行多视角3D表示生成，但受限于扩散模型输出的视觉不一致性。

Details

Method: 提出ContrastiveGaussian，结合对比学习和感知损失，利用视觉不一致性提升3D生成质量，并引入超分辨率模型和Quantity-Aware Triplet Loss以增强样本区分。 Result: 实验表明，该方法在纹理保真度和几何一致性上表现更优。 Conclusion: 通过对比学习和感知损失，ContrastiveGaussian显著提升了3D生成的质量和一致性。 Abstract: Creating 3D content from single-view images is a challenging problem that has attracted considerable attention in recent years. Current approaches typically utilize score distillation sampling (SDS) from pre-trained 2D diffusion models to generate multi-view 3D representations. Although some methods have made notable progress by balancing generation speed and model quality, their performance is often limited by the visual inconsistencies of the diffusion model outputs. In this work, we propose ContrastiveGaussian, which integrates contrastive learning into the generative process. By using a perceptual loss, we effectively differentiate between positive and negative samples, leveraging the visual inconsistencies to improve 3D generation quality. To further enhance sample differentiation and improve contrastive learning, we incorporate a super-resolution model and introduce another Quantity-Aware Triplet Loss to address varying sample distributions during training. Our experiments demonstrate that our approach achieves superior texture fidelity and improved geometric consistency.

BiasCause: Evaluate Socially Biased Causal Reasoning of Large Language Models

Tian Xie,Tongxin Yin,Vaishakh Keshava,Xueru Zhang,Siddhartha Reddy Jonnalagadda

Task: 评估大型语言模型（LLMs）在回答涉及社会偏见问题时的因果推理过程。

Motivation: 现有研究已发现LLMs生成的内容存在社会偏见，但对导致这些偏见的因果推理机制缺乏深入理解。

Details

Method: 提出一种新的分类框架，利用LLMs合成1788个问题，覆盖8个敏感属性，并通过因果图揭示LLMs的推理过程，测试4种先进LLMs。 Result: 所有模型在多数问题中表现出偏见的因果推理，生成4135个偏见因果图；同时发现3种避免偏见的策略，并揭示LLMs易混淆相关性与因果性。 Conclusion: LLMs在因果推理中易受偏见影响，需进一步研究以减少偏见并改进推理能力。 Abstract: While large language models (LLMs) already play significant roles in society, research has shown that LLMs still generate content including social bias against certain sensitive groups. While existing benchmarks have effectively identified social biases in LLMs, a critical gap remains in our understanding of the underlying reasoning that leads to these biased outputs. This paper goes one step further to evaluate the causal reasoning process of LLMs when they answer questions eliciting social biases. We first propose a novel conceptual framework to classify the causal reasoning produced by LLMs. Next, we use LLMs to synthesize $1788$ questions covering $8$ sensitive attributes and manually validate them. The questions can test different kinds of causal reasoning by letting LLMs disclose their reasoning process with causal graphs. We then test 4 state-of-the-art LLMs. All models answer the majority of questions with biased causal reasoning, resulting in a total of $4135$ biased causal graphs. Meanwhile, we discover $3$ strategies for LLMs to avoid biased causal reasoning by analyzing the "bias-free" cases. Finally, we reveal that LLMs are also prone to "mistaken-biased" causal reasoning, where they first confuse correlation with causality to infer specific sensitive group names and then incorporate biased causal reasoning.

Towards Unconstrained 2D Pose Estimation of the Human Spine

Muhammad Saif Ullah Khan,Stephan Krauß,Didier Stricker

Task: 提出SpineTrack数据集和SpinePose方法，用于2D脊柱姿态估计。

Motivation: 现有姿态数据集通常将脊柱简化为单一刚性段，无法满足运动分析中对脊柱细节的需求。

Details

Method: SpineTrack包含合成和真实世界两个子集，采用主动学习管道优化标注；SpinePose通过知识蒸馏和解剖学正则化策略扩展现有姿态估计器。 Result: 实验验证了SpineTrack在脊柱姿态估计中的有效性，为未来研究提供了基础。 Conclusion: SpineTrack和SpinePose为生物力学分析和3D脊柱重建提供了可靠工具。 Abstract: We present SpineTrack, the first comprehensive dataset for 2D spine pose estimation in unconstrained settings, addressing a crucial need in sports analytics, healthcare, and realistic animation. Existing pose datasets often simplify the spine to a single rigid segment, overlooking the nuanced articulation required for accurate motion analysis. In contrast, SpineTrack annotates nine detailed spinal keypoints across two complementary subsets: a synthetic set comprising 25k annotations created using Unreal Engine with biomechanical alignment through OpenSim, and a real-world set comprising over 33k annotations curated via an active learning pipeline that iteratively refines automated annotations with human feedback. This integrated approach ensures anatomically consistent labels at scale, even for challenging, in-the-wild images. We further introduce SpinePose, extending state-of-the-art body pose estimators using knowledge distillation and an anatomical regularization strategy to jointly predict body and spine keypoints. Our experiments in both general and sports-specific contexts validate the effectiveness of SpineTrack for precise spine pose estimation, establishing a robust foundation for future research in advanced biomechanical analysis and 3D spine reconstruction in the wild.

Linguistic Interpretability of Transformer-based Language Models: a systematic review

Miguel López-Otal,Jorge Gracia,Jordi Bernad,Carlos Bobed,Lucía Pitarch-Ballesteros,Emma Anglés-Herrero

Task: 对160篇研究论文进行综合分析，探讨基于Transformer的语言模型是否具备类似人类语言能力的语言学知识。

Motivation: 尽管Transformer模型在语言任务中表现优异，但其内部计算机制仍不明确，被视为‘黑箱’系统。研究其语言学可解释性有助于理解模型如何编码语言信息。

Details

Method: 通过分析160篇研究论文，涵盖多语言和多模型，从语言学角度（句法、形态学、词汇语义和话语）探讨模型内部表示。 Result: 填补了现有可解释性文献的空白，特别是针对语言学知识的研究，并扩展了非英语模型和多语言模型的分析。 Conclusion: 该综述为理解Transformer模型的语言学知识提供了全面视角，并强调了探索模型内部表示的重要性。 Abstract: Language models based on the Transformer architecture achieve excellent results in many language-related tasks, such as text classification or sentiment analysis. However, despite the architecture of these models being well-defined, little is known about how their internal computations help them achieve their results. This renders these models, as of today, a type of 'black box' systems. There is, however, a line of research -- 'interpretability' -- aiming to learn how information is encoded inside these models. More specifically, there is work dedicated to studying whether Transformer-based models possess knowledge of linguistic phenomena similar to human speakers -- an area we call 'linguistic interpretability' of these models. In this survey we present a comprehensive analysis of 160 research works, spread across multiple languages and models -- including multilingual ones -- that attempt to discover linguistic information from the perspective of several traditional Linguistics disciplines: Syntax, Morphology, Lexico-Semantics and Discourse. Our survey fills a gap in the existing interpretability literature, which either not focus on linguistic knowledge in these models or present some limitations -- e.g. only studying English-based models. Our survey also focuses on Pre-trained Language Models not further specialized for a downstream task, with an emphasis on works that use interpretability techniques that explore models' internal representations.

POEM: Precise Object-level Editing via MLLM control

Marco Schouten,Mehmet Onurcan Kaya,Serge Belongie,Dim P. Papadopoulos

Task: 提出一种基于多模态大语言模型（MLLMs）的精确对象级图像编辑框架POEM。

Motivation: 现有基于文本的编辑方法在局部形状和布局变换上表现不佳，而基于交互的方法需要大量人工输入。

Details

Method: 利用MLLMs分析指令提示并生成精确的对象掩码，指导基于扩散的编辑过程。 Result: POEM在精确性和可靠性上优于现有文本编辑方法，同时减少了人工输入。 Conclusion: POEM为对象级图像编辑提供了一种高效且精确的解决方案。 Abstract: Diffusion models have significantly improved text-to-image generation, producing high-quality, realistic images from textual descriptions. Beyond generation, object-level image editing remains a challenging problem, requiring precise modifications while preserving visual coherence. Existing text-based instructional editing methods struggle with localized shape and layout transformations, often introducing unintended global changes. Image interaction-based approaches offer better accuracy but require manual human effort to provide precise guidance. To reduce this manual effort while maintaining a high image editing accuracy, in this paper, we propose POEM, a framework for Precise Object-level Editing using Multimodal Large Language Models (MLLMs). POEM leverages MLLMs to analyze instructional prompts and generate precise object masks before and after transformation, enabling fine-grained control without extensive user input. This structured reasoning stage guides the diffusion-based editing process, ensuring accurate object localization and transformation. To evaluate our approach, we introduce VOCEdits, a benchmark dataset based on PASCAL VOC 2012, augmented with instructional edit prompts, ground-truth transformations, and precise object masks. Experimental results show that POEM outperforms existing text-based image editing approaches in precision and reliability while reducing manual effort compared to interaction-based methods.

More diverse more adaptive: Comprehensive Multi-task Learning for Improved LLM Domain Adaptation in E-commerce

Tong Piao,Pei Tang,Zhipeng Zhang,Jiaqi Li,Qiao Liu,Zufeng Wu

Task: 研究多模态数据和多样化任务对大型语言模型在电子商务领域适应性能的影响。

Motivation: 验证多样化数据和任务是否能提升大型语言模型在电子商务领域的性能，填补现有研究的空白。

Details

Method: 提出一个综合电子商务多任务框架，并通过实验从“能力全面性”和“任务全面性”两个角度分析数据多样性和任务多样性的影响。 Result: 通过逐步引入新能力领域任务和增加子任务，显著提升了模型性能；模型容量的增加进一步放大了多样性的收益。 Conclusion: 研究证实了数据多样性和任务多样性对提升大型语言模型在电子商务领域性能的重要性，并在KDD Cup 2024中验证了最佳模型的实用性。 Abstract: In recent years, Large Language Models (LLMs) have been widely applied across various domains due to their powerful domain adaptation capabilities. Previous studies have suggested that diverse, multi-modal data can enhance LLMs' domain adaptation performance. However, this hypothesis remains insufficiently validated in the e-commerce sector. To address this gap, we propose a comprehensive e-commerce multi-task framework and design empirical experiments to examine the impact of diverse data and tasks on LLMs from two perspectives: "capability comprehensiveness" and "task comprehensiveness." Specifically, we observe significant improvements in LLM performance by progressively introducing tasks related to new major capability areas and by continuously adding subtasks within different major capability domains. Furthermore, we observe that increasing model capacity amplifies the benefits of diversity, suggesting a synergistic relationship between model capacity and data diversity. Finally, we validate the best-performing model from our empirical experiments in the KDD Cup 2024, achieving a rank 5 in Task 1. This outcome demonstrates the significance of our research for advancing LLMs in the e-commerce domain.

Benchmarking Suite for Synthetic Aperture Radar Imagery Anomaly Detection (SARIAD) Algorithms

Lucian Chauvina,Somil Guptac,Angelina Ibarrac,Joshua Peeples

Task: 开发并提供一个用于合成孔径雷达（SAR）图像异常检测的基准工具包SARIAD。

Motivation: 目前缺乏用于开发和评估SAR图像异常检测方法的工具和数据集，限制了该领域的研究进展。

Details

Method: 结合Anomalib深度学习库，SARIAD提供了一套全面的算法和数据集，用于评估和开发SAR图像的异常检测方法。 Result: SARIAD整合了多个SAR数据集和工具，支持多种异常检测算法，并提供了评估指标和可视化功能。 Conclusion: SARIAD作为一个基准工具包，促进了SAR图像异常检测领域的可重复研究，并已公开提供使用。 Abstract: Anomaly detection is a key research challenge in computer vision and machine learning with applications in many fields from quality control to radar imaging. In radar imaging, specifically synthetic aperture radar (SAR), anomaly detection can be used for the classification, detection, and segmentation of objects of interest. However, there is no method for developing and benchmarking these methods on SAR imagery. To address this issue, we introduce SAR imagery anomaly detection (SARIAD). In conjunction with Anomalib, a deep-learning library for anomaly detection, SARIAD provides a comprehensive suite of algorithms and datasets for assessing and developing anomaly detection approaches on SAR imagery. SARIAD specifically integrates multiple SAR datasets along with tools to effectively apply various anomaly detection algorithms to SAR imagery. Several anomaly detection metrics and visualizations are available. Overall, SARIAD acts as a central package for benchmarking SAR models and datasets to allow for reproducible research in the field of anomaly detection in SAR imagery. This package is publicly available: https://github.com/Advanced-Vision-and-Learning-Lab/SARIAD.

From Speech to Summary: A Comprehensive Survey of Speech Summarization

Fabian Retkowski,Maike Züfle,Andreas Sudmann,Dinah Pfau,Jan Niehues,Alexander Waibel

Task: 探讨语音摘要的现状、数据集、评估方法及最新技术发展。

Motivation: 语音摘要对管理日益增长的语音和视听内容至关重要，但其定义模糊且涉及多个研究领域，需要系统梳理。

Details

Method: 通过调查现有数据集、评估方法，并综合领域内最新进展，分析从传统系统到先进模型的转变。 Result: 总结了语音摘要领域的关键数据集、评估方法及技术趋势，如微调级联架构和端到端解决方案。 Conclusion: 语音摘要领域正在快速发展，但仍需进一步明确定义和标准化评估方法。 Abstract: Speech summarization has become an essential tool for efficiently managing and accessing the growing volume of spoken and audiovisual content. However, despite its increasing importance, speech summarization is still not clearly defined and intersects with several research areas, including speech recognition, text summarization, and specific applications like meeting summarization. This survey not only examines existing datasets and evaluation methodologies, which are crucial for assessing the effectiveness of summarization approaches but also synthesizes recent developments in the field, highlighting the shift from traditional systems to advanced models like fine-tuned cascaded architectures and end-to-end solutions.

Gen3DEval: Using vLLMs for Automatic Evaluation of Generated 3D Objects

Shalini Maiti,Lourdes Agapito,Filippos Kokkinos

Task: 开发一个无需真实数据且能评估文本到3D生成质量的框架Gen3DEval。

Motivation: 当前评估指标如PSNR和CLIP需要真实数据或仅关注提示保真度，无法满足文本到3D生成领域的需求。

Details

Method: 利用经过微调的视觉大语言模型（vLLMs）分析3D表面法线，评估文本保真度、外观和表面质量。 Result: Gen3DEval在用户对齐评估中表现优于现有任务无关模型。 Conclusion: Gen3DEval为文本到3D生成研究提供了一个全面且易用的基准。 Abstract: Rapid advancements in text-to-3D generation require robust and scalable evaluation metrics that align closely with human judgment, a need unmet by current metrics such as PSNR and CLIP, which require ground-truth data or focus only on prompt fidelity. To address this, we introduce Gen3DEval, a novel evaluation framework that leverages vision large language models (vLLMs) specifically fine-tuned for 3D object quality assessment. Gen3DEval evaluates text fidelity, appearance, and surface quality by analyzing 3D surface normals, without requiring ground-truth comparisons, bridging the gap between automated metrics and user preferences. Compared to state-of-the-art task-agnostic models, Gen3DEval demonstrates superior performance in user-aligned evaluations, placing it as a comprehensive and accessible benchmark for future research on text-to-3D generation. The project page can be found here: \href{https://shalini-maiti.github.io/gen3deval.github.io/}{https://shalini-maiti.github.io/gen3deval.github.io/}.

Can Reasoning LLMs Enhance Clinical Document Classification?

Akram Mustafa,Usman Naseem,Mostafa Rahimi Azghadi

Task: 评估八种大型语言模型（LLMs）在临床出院摘要分类中的性能和一致性。

Motivation: 临床文档分类对将非结构化医疗文本转换为标准化ICD-10诊断至关重要，但面临复杂医学语言、隐私限制和有限标注数据的挑战。

Details

Method: 使用cTAKES结构化临床叙述，评估四种推理模型和四种非推理模型在MIMIC-IV数据集上的表现，通过多数投票确定最终预测。 Result: 推理模型在准确性（71% vs 68%）和F1分数（67% vs 60%）上优于非推理模型，其中Gemini 2.0 Flash Thinking表现最佳（75%准确性和76% F1分数）。非推理模型在一致性上更稳定（91% vs 84%）。 Conclusion: 研究发现准确性和一致性之间存在权衡，建议采用混合方法优化临床编码。未来研究应探索多标签分类、领域特定微调和集成方法以提高模型可靠性。 Abstract: Clinical document classification is essential for converting unstructured medical texts into standardised ICD-10 diagnoses, yet it faces challenges due to complex medical language, privacy constraints, and limited annotated datasets. Large Language Models (LLMs) offer promising improvements in accuracy and efficiency for this task. This study evaluates the performance and consistency of eight LLMs; four reasoning (Qwen QWQ, Deepseek Reasoner, GPT o3 Mini, Gemini 2.0 Flash Thinking) and four non-reasoning (Llama 3.3, GPT 4o Mini, Gemini 2.0 Flash, Deepseek Chat); in classifying clinical discharge summaries using the MIMIC-IV dataset. Using cTAKES to structure clinical narratives, models were assessed across three experimental runs, with majority voting determining final predictions. Results showed that reasoning models outperformed non-reasoning models in accuracy (71% vs 68%) and F1 score (67% vs 60%), with Gemini 2.0 Flash Thinking achieving the highest accuracy (75%) and F1 score (76%). However, non-reasoning models demonstrated greater stability (91% vs 84% consistency). Performance varied across ICD-10 codes, with reasoning models excelling in complex cases but struggling with abstract categories. Findings indicate a trade-off between accuracy and consistency, suggesting that a hybrid approach could optimise clinical coding. Future research should explore multi-label classification, domain-specific fine-tuning, and ensemble methods to enhance model reliability in real-world applications.

Impact of Language Guidance: A Reproducibility Study

Cherish Puniani,Advika Sinha,Shree Singhi,Aayan Yadav

Task: 通过语言指导采样视图对以改进自监督学习中的对比学习。

Motivation: 现代深度学习架构需要大量标注数据，但标注成本高且易出错；自监督学习（如对比学习）可以避免显式标注，但现有方法（如SimCLR和CLIP）依赖图像增强或跨模态损失，可能受视觉变异性影响。

Details

Method: 使用语言指导采样视图对，提出新指标评估自监督模型的语义能力。 Result: 发现原数据集RedCaps的标注质量低，用BLIP-2替换标注后性能提升。 Conclusion: 语言指导能提升对比学习的概念相似性，新指标有助于评估模型的语义能力。 Abstract: Modern deep-learning architectures need large amounts of data to produce state-of-the-art results. Annotating such huge datasets is time-consuming, expensive, and prone to human error. Recent advances in self-supervised learning allow us to train huge models without explicit annotation. Contrastive learning is a popular paradigm in self-supervised learning. Recent works like SimCLR and CLIP rely on image augmentations or directly minimizing cross-modal loss between image and text. Banani et al. (2023) propose to use language guidance to sample view pairs. They claim that language enables better conceptual similarity, eliminating the effects of visual variability. We reproduce their experiments to verify their claims and find that their dataset, RedCaps, contains low-quality captions. We use an off-the-shelf image captioning model, BLIP-2, to replace the captions and improve performance, and we also devise a new metric to evaluate the semantic capabilities of self-supervised models based on interpretability methods.

Multi-view autoencoders for Fake News Detection

Ingryd V. S. T. Pereira,George D. C. Cavalcanti,Rafael M. O. Cruz

Task: 通过多视图自动编码器生成联合特征表示以检测假新闻。

Motivation: 假新闻在社交媒体上传播速度快、范围广，自动检测成为重要任务，但现有特征提取技术在不同场景下表现不一，需要互补信息以提高检测效果。

Details

Method: 提出使用多视图自动编码器整合多种特征提取技术，生成联合特征表示。 Result: 实验表明，相比单一视图，多视图方法显著提升了分类性能，且选择部分视图组合在准确性和计算效率上更具优势。 Conclusion: 多视图自动编码器能有效整合不同特征提取技术，提升假新闻检测性能，同时优化视图选择可进一步提高效率和准确性。 Abstract: Given the volume and speed at which fake news spreads across social media, automatic fake news detection has become a highly important task. However, this task presents several challenges, including extracting textual features that contain relevant information about fake news. Research about fake news detection shows that no single feature extraction technique consistently outperforms the others across all scenarios. Nevertheless, different feature extraction techniques can provide complementary information about the textual data and enable a more comprehensive representation of the content. This paper proposes using multi-view autoencoders to generate a joint feature representation for fake news detection by integrating several feature extraction techniques commonly used in the literature. Experiments on fake news datasets show a significant improvement in classification performance compared to individual views (feature representations). We also observed that selecting a subset of the views instead of composing a latent space with all the views can be advantageous in terms of accuracy and computational effort. For further details, including source codes, figures, and datasets, please refer to the project's repository: https://github.com/ingrydpereira/multiview-fake-news.

LoRAX: LoRA eXpandable Networks for Continual Synthetic Image Attribution

Danielle Sullivan-Pao,Nicole Tian,Pooya Khorrami

Task: 提出LoRA eXpandable Networks (LoRAX)，一种参数高效的类增量学习算法，用于适应新的生成图像模型而无需完全重新训练。

Motivation: 生成AI图像技术日益普及和先进，需要强大的归因模型来验证图像真实性和识别生成模型架构，但现有模型难以泛化到未见模型，传统微调方法在实际场景中不实用。

Details

Method: 通过低秩适应（Low Rank Adaptation）为每个持续学习任务训练一个参数高效的特征提取器，每个任务特定的特征提取器仅需少量参数。 Result: LoRAX在Continual Deepfake Detection基准测试中优于或与最先进的类增量学习算法竞争，且每个特征提取器所需可训练参数少于完整实现的3%。 Conclusion: LoRAX是一种高效且实用的解决方案，适用于生成图像模型的持续学习任务。 Abstract: As generative AI image technologies become more widespread and advanced, there is a growing need for strong attribution models. These models are crucial for verifying the authenticity of images and identifying the architecture of their originating generative models-key to maintaining media integrity. However, attribution models struggle to generalize to unseen models, and traditional fine-tuning methods for updating these models have shown to be impractical in real-world settings. To address these challenges, we propose LoRA eXpandable Networks (LoRAX), a parameter-efficient class incremental algorithm that adapts to novel generative image models without the need for full retraining. Our approach trains an extremely parameter-efficient feature extractor per continual learning task via Low Rank Adaptation. Each task-specific feature extractor learns distinct features while only requiring a small fraction of the parameters present in the underlying feature extractor's backbone model. Our extensive experimentation shows LoRAX outperforms or remains competitive with state-of-the-art class incremental learning algorithms on the Continual Deepfake Detection benchmark across all training scenarios and memory settings, while requiring less than 3% of the number of trainable parameters per feature extractor compared to the full-rank implementation. LoRAX code is available at: https://github.com/mit-ll/lorax_cil.

DeepSeek vs. o3-mini: How Well can Reasoning LLMs Evaluate MT and Summarization?

Daniil Larionov,Sotaro Takeshita,Ran Zhang,Yanran Chen,Christoph Leiter,Zhipin Wang,Christian Greisinger,Steffen Eger

Task: 系统比较基于推理的大型语言模型（LLMs）与非推理模型在机器翻译（MT）和文本摘要（TS）评估任务中的表现。

Motivation: 探索推理能力对LLMs在自然语言生成（NLG）评估任务中的有效性，填补相关研究空白。

Details

Method: 在WMT23和SummEval基准上评估八种模型，包括最先进的推理模型、其蒸馏变体（参数从8B到70B）及等效的非推理LLMs。 Result: 推理能力的效果高度依赖模型和任务：OpenAI o3-mini模型表现随推理强度提升而改善，而DeepSeek-R1在多数情况下表现不如非推理变体。蒸馏推理能力在中等规模模型（32B）中表现尚可，但在小型模型（8B）中显著下降。 Conclusion: 本研究首次全面评估了推理LLMs在NLG评估中的应用，并为其实际使用提供了见解。 Abstract: Reasoning-enabled large language models (LLMs) have recently demonstrated impressive performance in complex logical and mathematical tasks, yet their effectiveness in evaluating natural language generation remains unexplored. This study systematically compares reasoning-based LLMs (DeepSeek-R1 and OpenAI o3) with their non-reasoning counterparts across machine translation (MT) and text summarization (TS) evaluation tasks. We evaluate eight models across three architectural categories, including state-of-the-art reasoning models, their distilled variants (ranging from 8B to 70B parameters), and equivalent conventional, non-reasoning LLMs. Our experiments on WMT23 and SummEval benchmarks reveal that the benefits of reasoning capabilities are highly model and task-dependent: while OpenAI o3-mini models show consistent performance improvements with increased reasoning intensity, DeepSeek-R1 underperforms compared to its non-reasoning variant, with exception to certain aspects of TS evaluation. Correlation analysis demonstrates that increased reasoning token usage positively correlates with evaluation quality in o3-mini models. Furthermore, our results show that distillation of reasoning capabilities maintains reasonable performance in medium-sized models (32B) but degrades substantially in smaller variants (8B). This work provides the first comprehensive assessment of reasoning LLMs for NLG evaluation and offers insights into their practical use.

Investigating Vision-Language Model for Point Cloud-based Vehicle Classification

Yiqiao Li,Jie Wei,Camille Kamga

Task: 利用路边LiDAR点云数据和视觉语言模型（VLMs）实现高效准确的卡车分类。

Motivation: 重型卡车因其体积大和机动性差带来显著的安全挑战，需通过卡车分类提升协同自动驾驶的安全性。传统LiDAR分类方法依赖大量人工标注，成本高且耗时。

Details

Method: 提出一种新框架，整合路边LiDAR点云数据与VLMs，包括点云预处理、3D渲染增强特征表示，以及基于少样本提示的上下文学习。 Result: 实验结果表明该方法性能良好，能减少标注工作量并提高分类准确性。 Conclusion: 该框架为卡车分类提供了一种高效且低成本的新方法，支持协同安全驾驶环境。 Abstract: Heavy-duty trucks pose significant safety challenges due to their large size and limited maneuverability compared to passenger vehicles. A deeper understanding of truck characteristics is essential for enhancing the safety perspective of cooperative autonomous driving. Traditional LiDAR-based truck classification methods rely on extensive manual annotations, which makes them labor-intensive and costly. The rapid advancement of large language models (LLMs) trained on massive datasets presents an opportunity to leverage their few-shot learning capabilities for truck classification. However, existing vision-language models (VLMs) are primarily trained on image datasets, which makes it challenging to directly process point cloud data. This study introduces a novel framework that integrates roadside LiDAR point cloud data with VLMs to facilitate efficient and accurate truck classification, which supports cooperative and safe driving environments. This study introduces three key innovations: (1) leveraging real-world LiDAR datasets for model development, (2) designing a preprocessing pipeline to adapt point cloud data for VLM input, including point cloud registration for dense 3D rendering and mathematical morphological techniques to enhance feature representation, and (3) utilizing in-context learning with few-shot prompting to enable vehicle classification with minimally labeled training data. Experimental results demonstrate encouraging performance of this method and present its potential to reduce annotation efforts while improving classification accuracy.

Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora

Alex Warstadt,Aaron Mueller,Leshem Choshen,Ethan Wilcox,Chengxu Zhuang,Juan Ciro,Rafael Mosquera,Bhargavi Paranjape,Adina Williams,Tal Linzen,Ryan Cotterell

Task: 研究如何在有限的数据预算下优化语言模型的训练，以提高数据效率。

Motivation: 大型语言模型通常需要大量数据且效率远低于人类语言学习，限制了其作为认知模型的应用。

Details

Method: 通过BabyLM Challenge竞赛，参与者提交不同数据限制下的语言模型训练方案，并进行多任务评估。 Result: LTG-BERT架构的模型表现最佳，部分模型通过短输入序列或知识蒸馏取得良好效果，但课程学习效果有限。 Conclusion: 数据效率高的语言模型可通过特定架构和训练方法实现，未来研究应避免过度依赖课程学习。 Abstract: Children can acquire language from less than 100 million words of input. Large language models are far less data-efficient: they typically require 3 or 4 orders of magnitude more data and still do not perform as well as humans on many evaluations. These intensive resource demands limit the ability of researchers to train new models and use existing models as developmentally plausible cognitive models. The BabyLM Challenge is a communal effort in which participants compete to optimize language model training on a fixed data budget. Submissions are compared on various evaluation tasks targeting grammatical ability, downstream task performance, and generalization. Participants can submit to up to three tracks with progressively looser data restrictions. From over 30 submissions, we extract concrete recommendations on how best to train data-efficient language models, and on where future efforts should (and perhaps should not) focus. The winning submissions using the LTG-BERT architecture (Samuel et al., 2023) outperformed models trained on trillions of words. Other submissions achieved strong results through training on shorter input sequences or training a student model on a pretrained teacher. Curriculum learning attempts, which accounted for a large number of submissions, were largely unsuccessful, though some showed modest improvements.

Learning Object Focused Attention

Vivek Trivedy,Amani Almalki,Longin Jan Latecki

Task: 提出一种改进Vision Transformers（ViTs）训练的方法，通过显式建模对象注意力计算。

Motivation: 通过限制注意力于同一对象类别的图像块，提升ViTs对整体对象形状的理解，减少背景干扰。

Details

Method: 在选定的注意力层添加新分支，计算辅助损失（OFA损失），并实验多尺度掩码以提升性能。 Result: OFA模型在分类任务中表现优于基础模型，对分布外（OOD）和对抗性损坏图像具有更强泛化能力，学习基于对象形状而非虚假纹理的表示。 Conclusion: OFA方法简单易集成，无需推理额外开销，且为自监督学习提供了新路径。 Abstract: We propose an adaptation to the training of Vision Transformers (ViTs) that allows for an explicit modeling of objects during the attention computation. This is achieved by adding a new branch to selected attention layers that computes an auxiliary loss which we call the object-focused attention (OFA) loss. We restrict the attention to image patches that belong to the same object class, which allows ViTs to gain a better understanding of configural (or holistic) object shapes by focusing on intra-object patches instead of other patches such as those in the background. Our proposed inductive bias fits easily into the attention framework of transformers since it only adds an auxiliary loss over selected attention layers. Furthermore, our approach has no additional overhead during inference. We also experiment with multiscale masking to further improve the performance of our OFA model and give a path forward for self-supervised learning with our method. Our experimental results demonstrate that ViTs with OFA achieve better classification results than their base models, exhibit a stronger generalization ability to out-of-distribution (OOD) and adversarially corrupted images, and learn representations based on object shapes rather than spurious correlations via general textures. For our OOD setting, we generate a novel dataset using the COCO dataset and Stable Diffusion inpainting which we plan to share with the community.

Harnessing the Unseen: The Hidden Influence of Intrinsic Knowledge in Long-Context Language Models

Yu Fu,Haz Sameen Shahgir,Hui Liu,Xianfeng Tang,Qi He,Yue Dong

Task: 研究大型语言模型内在知识对内容生成的影响及其与上下文知识利用的关系。

Motivation: 现有长上下文模型主要关注外部上下文信息，而忽略了模型内在知识的作用，尤其是在长上下文中的影响。

Details

Method: 设计了Hybrid Needle-in-a-Haystack测试，评估模型在内在和外在检索能力上的表现。 Result: Qwen-2.5模型在内在检索能力上显著优于Llama-3.1模型，而Llama-3.1-70B-Instruct在长上下文条件下表现不佳。 Conclusion: 强调从双检索视角评估模型的重要性，内在知识利用与上下文知识利用需平衡。 Abstract: Recent advances in long-context models (LCMs), designed to handle extremely long input contexts, primarily focus on utilizing external contextual information, often leaving the influence of large language models' intrinsic knowledge underexplored. In this work, we investigate how this intrinsic knowledge affects content generation and demonstrate that its impact becomes increasingly pronounced as context length extends. Furthermore, we show that the model's ability to utilize intrinsic knowledge, which we call intrinsic retrieval ability, does not improve simultaneously with its ability to leverage contextual knowledge through extrinsic retrieval ability. Moreover, better extrinsic retrieval can interfere with the model's ability to use its own knowledge effectively, limiting its full potential. To bridge this gap, we design a simple yet effective Hybrid Needle-in-a-Haystack test that evaluates models based on their capabilities across both retrieval abilities, rather than solely emphasizing extrinsic retrieval ability. Our experimental results reveal that Qwen-2.5 models significantly outperform Llama-3.1 models, demonstrating superior intrinsic retrieval ability. Moreover, even the more powerful Llama-3.1-70B-Instruct model fails to exhibit better performance under LCM conditions, highlighting the importance of evaluating models from a dual-retrieval perspective.

Multi-person Physics-based Pose Estimation for Combat Sports

Hossein Feiz,David Labbé,Thomas Romeas,Jocelyn Faubert,Sheldon Andrews

Task: 提出一种新颖的框架，用于在稀疏多摄像头设置下实现精准的3D人体姿态估计，特别针对格斗运动。

Motivation: 解决格斗运动中快速动作、遮挡和近距离交互等挑战，提升姿态估计的准确性和鲁棒性。

Details

Method: 结合基于Transformer的多视角2D姿态跟踪、极线几何约束和长期视频对象分割，通过加权三角测量和样条平滑获取初始3D姿态，再通过运动学优化和多人物理轨迹优化提升姿态真实性和鲁棒性。 Result: 在多样化数据集（包括新的精英拳击视频基准）上展示了最先进的性能。 Conclusion: 该框架在格斗运动的多人物姿态估计中表现出色，并发布了标注视频数据集以推动未来研究。 Abstract: We propose a novel framework for accurate 3D human pose estimation in combat sports using sparse multi-camera setups. Our method integrates robust multi-view 2D pose tracking via a transformer-based top-down approach, employing epipolar geometry constraints and long-term video object segmentation for consistent identity tracking across views. Initial 3D poses are obtained through weighted triangulation and spline smoothing, followed by kinematic optimization to refine pose accuracy. We further enhance pose realism and robustness by introducing a multi-person physics-based trajectory optimization step, effectively addressing challenges such as rapid motions, occlusions, and close interactions. Experimental results on diverse datasets, including a new benchmark of elite boxing footage, demonstrate state-of-the-art performance. Additionally, we release comprehensive annotated video datasets to advance future research in multi-person pose estimation for combat sports.

LLM for Comparative Narrative Analysis

Leo Kampen,Carlos Rabat Villarreal,Louis Yu,Santu Karmaker,Dongji Feng

Task: 通过多视角比较叙事分析（CNA）评估GPT-3.5、PaLM2和Llama2三种大型语言模型在相同提示下的表现。

Motivation: 比较不同大型语言模型在相同任务下的表现差异，揭示其在理解和分析能力上的异同。

Details

Method: 采用相同的提示对三种模型进行测试，并通过人类评估作为金标准，从四个视角分析模型表现的差异。 Result: 三种模型对相同提示生成的回答存在显著差异，表明它们在任务理解和分析能力上存在明显不同。 Conclusion: 多视角比较叙事分析揭示了不同大型语言模型在性能上的显著差异，为模型选择和优化提供了参考。 Abstract: In this paper, we conducted a Multi-Perspective Comparative Narrative Analysis (CNA) on three prominent LLMs: GPT-3.5, PaLM2, and Llama2. We applied identical prompts and evaluated their outputs on specific tasks, ensuring an equitable and unbiased comparison between various LLMs. Our study revealed that the three LLMs generated divergent responses to the same prompt, indicating notable discrepancies in their ability to comprehend and analyze the given task. Human evaluation was used as the gold standard, evaluating four perspectives to analyze differences in LLM performance.

TokenMotion: Decoupled Motion Control via Token Disentanglement for Human-centric Video Generation

Ruineng Li,Daitao Xing,Huiming Sun,Yuanzhou Ha,Jinglin Shen,Chiuman Ho

Task: 提出TokenMotion，一种基于DiT的视频扩散框架，用于细粒度控制相机运动、人体运动及其联合交互。

Motivation: 解决现有方法在视频生成中对相机运动和人体运动的控制不足以及运动表示有限的问题。

Details

Method: 将相机轨迹和人体姿态表示为时空令牌，采用解耦与融合策略，并通过人类感知动态掩码处理联合运动信号。 Result: TokenMotion在文本到视频和图像到视频任务中表现优异，优于现有方法。 Conclusion: TokenMotion在可控视频生成方面取得了显著进展，尤其适用于创意生产应用。 Abstract: Human-centric motion control in video generation remains a critical challenge, particularly when jointly controlling camera movements and human poses in scenarios like the iconic Grammy Glambot moment. While recent video diffusion models have made significant progress, existing approaches struggle with limited motion representations and inadequate integration of camera and human motion controls. In this work, we present TokenMotion, the first DiT-based video diffusion framework that enables fine-grained control over camera motion, human motion, and their joint interaction. We represent camera trajectories and human poses as spatio-temporal tokens to enable local control granularity. Our approach introduces a unified modeling framework utilizing a decouple-and-fuse strategy, bridged by a human-aware dynamic mask that effectively handles the spatially-and-temporally varying nature of combined motion signals. Through extensive experiments, we demonstrate TokenMotion's effectiveness across both text-to-video and image-to-video paradigms, consistently outperforming current state-of-the-art methods in human-centric motion control tasks. Our work represents a significant advancement in controllable video generation, with particular relevance for creative production applications.

Big Meaning: Qualitative Analysis on Large Bodies of Data Using AI

Samuel Flanders,Melati Nungsari,Mark Cheong Wing Loong

Task: 利用AI生成的描述性代码来识别文本在主题分析中的“丰度”（即人类生成代码的密度）。

Motivation: 通过AI生成的代码指导选择可能产生更丰富定性见解的文本，而非替代人类解释。

Details

Method: 使用2530篇马来西亚新闻文章的数据集，比较AI选择的文本与随机选择的文本，由三位人类编码员独立生成代码。 Result: AI选择的文本丰度约为随机选择文本的两倍。 Conclusion: AI生成的代码可作为识别主题分析中具有高意义潜力的文档的有效代理。 Abstract: This study introduces a framework that leverages AI-generated descriptive codes to indicate a text's fecundity--the density of unique human-generated codes--in thematic analysis. Rather than replacing human interpretation, AI-generated codes guide the selection of texts likely to yield richer qualitative insights. Using a dataset of 2,530 Malaysian news articles on refugee attitudes, we compare AI-selected documents to randomly chosen ones by having three human coders independently derive codes. The results demonstrate that AI-selected texts exhibit approximately twice the fecundity. Our findings support the use of AI-generated codes as an effective proxy for identifying documents with a high potential for meaning-making in thematic analysis.

Comparative Analysis of Different Methods for Classifying Polychromatic Sketches

Fahd Baba,Devon Mack

Task: 使用机器学习方法对手绘涂鸦图像进行分类，涵盖170个类别。

Motivation: 提升计算机视觉中图像分类的能力，尤其是在人类不熟悉的领域，使算法视觉能力达到或超越人类水平。

Details

Method: 收集、清理并解析大规模手绘涂鸦数据集，比较多种机器学习解决方案。 Result: 最佳模型的Top-1准确率达到47.5%，显著超过人类在该数据集上的表现（41%）。 Conclusion: 机器学习模型在手绘涂鸦分类任务上能够超越人类表现，展示了其在计算机视觉领域的潜力。 Abstract: Image classification is a significant challenge in computer vision, particularly in domains humans are not accustomed to. As machine learning and artificial intelligence become more prominent, it is crucial these algorithms develop a sense of sight that is on par with or exceeds human ability. For this reason, we have collected, cleaned, and parsed a large dataset of hand-drawn doodles and compared multiple machine learning solutions to classify these images into 170 distinct categories. The best model we found achieved a Top-1 accuracy of 47.5%, significantly surpassing human performance on the dataset, which stands at 41%.

Out of Style: RAG's Fragility to Linguistic Variation

Tianyu Cao,Neel Bhandari,Akhila Yerukola,Akari Asai,Maarten Sap

Task: 分析检索增强生成（RAG）系统在真实用户-LLM交互查询中的鲁棒性。

Motivation: 尽管RAG系统在多个NLP基准测试中表现优异，但其在真实用户查询中的鲁棒性尚未充分研究，这在实际部署中是一个关键问题。

Details

Method: 通过系统性地分析四种语言维度（正式性、可读性、礼貌性和语法正确性）对RAG性能的影响，评估两种检索模型和九种LLM在四个信息检索问答数据集上的表现。 Result: 语言重构显著影响检索和生成阶段，导致性能下降（如非正式查询的Recall@5分数下降40.41%，语法错误查询的答案匹配分数下降38.86%）。RAG系统对语言变化的敏感性高于纯LLM生成。 Conclusion: 研究强调了改进鲁棒性技术的必要性，以提升RAG系统在多样化用户交互中的可靠性。 Abstract: Despite the impressive performance of Retrieval-augmented Generation (RAG) systems across various NLP benchmarks, their robustness in handling real-world user-LLM interaction queries remains largely underexplored. This presents a critical gap for practical deployment, where user queries exhibit greater linguistic variations and can trigger cascading errors across interdependent RAG components. In this work, we systematically analyze how varying four linguistic dimensions (formality, readability, politeness, and grammatical correctness) impact RAG performance. We evaluate two retrieval models and nine LLMs, ranging from 3 to 72 billion parameters, across four information-seeking Question Answering (QA) datasets. Our results reveal that linguistic reformulations significantly impact both retrieval and generation stages, leading to a relative performance drop of up to 40.41% in Recall@5 scores for less formal queries and 38.86% in answer match scores for queries containing grammatical errors. Notably, RAG systems exhibit greater sensitivity to such variations compared to LLM-only generations, highlighting their vulnerability to error propagation due to linguistic shifts. These findings highlight the need for improved robustness techniques to enhance reliability in diverse user interactions.

EO-VLM: VLM-Guided Energy Overload Attacks on Vision Models

Minjae Seo,Myoungsung You,Junhee Lee,Jaehan Kim,Hwanjo Heo,Jintae Oh,Jinwoo Kim

Task: 提出一种新型的能量过载攻击（EO-VLM），利用视觉语言模型（VLM）生成对抗图像，针对视觉模型增加GPU能耗。

Motivation: 视觉模型在关键应用（如自动驾驶和监控）中易受资源消耗攻击，威胁系统可用性。

Details

Method: 通过无安全过滤的VLM（如DALL-E 3）生成对抗噪声图像，无需目标模型先验知识或内部结构。 Result: 实验显示能耗最高增加50%，揭示了当前视觉模型的关键漏洞。 Conclusion: EO-VLM是一种模型无关的攻击方法，突显了视觉模型在能耗方面的安全风险。 Abstract: Vision models are increasingly deployed in critical applications such as autonomous driving and CCTV monitoring, yet they remain susceptible to resource-consuming attacks. In this paper, we introduce a novel energy-overloading attack that leverages vision language model (VLM) prompts to generate adversarial images targeting vision models. These images, though imperceptible to the human eye, significantly increase GPU energy consumption across various vision models, threatening the availability of these systems. Our framework, EO-VLM (Energy Overload via VLM), is model-agnostic, meaning it is not limited by the architecture or type of the target vision model. By exploiting the lack of safety filters in VLMs like DALL-E 3, we create adversarial noise images without requiring prior knowledge or internal structure of the target vision models. Our experiments demonstrate up to a 50% increase in energy consumption, revealing a critical vulnerability in current vision models.

Evaluating the Bias in LLMs for Surveying Opinion and Decision Making in Healthcare

Yonchanok Khaokaew,Flora D. Salim,Andreas Züfle,Hao Xue,Taylor Anderson,Matthew Scotch,David J Heslop

Task: 比较生成代理与真实调查数据在医疗决策行为上的表现。

Motivation: 探究生成代理是否能真实模拟人类行为，尤其是在医疗决策领域。

Details

Method: 通过基于人口统计学的提示工程创建数字孪生，并分析不同LLM的表现。 Result: 部分LLM无法反映真实决策行为，而Llama 3虽更准确但也引入偏见。 Conclusion: 生成代理在行为研究中具有潜力，但需警惕LLM和提示策略带来的偏见风险。 Abstract: Generative agents have been increasingly used to simulate human behaviour in silico, driven by large language models (LLMs). These simulacra serve as sandboxes for studying human behaviour without compromising privacy or safety. However, it remains unclear whether such agents can truly represent real individuals. This work compares survey data from the Understanding America Study (UAS) on healthcare decision-making with simulated responses from generative agents. Using demographic-based prompt engineering, we create digital twins of survey respondents and analyse how well different LLMs reproduce real-world behaviours. Our findings show that some LLMs fail to reflect realistic decision-making, such as predicting universal vaccine acceptance. However, Llama 3 captures variations across race and Income more accurately but also introduces biases not present in the UAS data. This study highlights the potential of generative agents for behavioural research while underscoring the risks of bias from both LLMs and prompting strategies.

RealCam-Vid: High-resolution Video Dataset with Dynamic Scenes and Metric-scale Camera Movements

Guangcong Zheng,Teng Li,Xianpan Zhou,Xi Li

Task: 提出一个开源的高分辨率动态场景数据集，用于改进相机可控视频生成。

Motivation: 现有数据集（如RealEstate10K）依赖静态场景和相对尺度相机标注，无法捕捉动态场景交互和缺乏度量尺度几何一致性，限制了视频生成的真实性和精确性。

Details

Method: 引入首个完全开源的高分辨率动态场景数据集（RealCam-Vid），包含度量尺度相机标注。 Result: 提供了支持动态场景交互和精确相机轨迹合成的数据集。 Conclusion: 该数据集填补了现有技术的不足，为复杂环境中的视频生成提供了更真实的动态和几何一致性。 Abstract: Recent advances in camera-controllable video generation have been constrained by the reliance on static-scene datasets with relative-scale camera annotations, such as RealEstate10K. While these datasets enable basic viewpoint control, they fail to capture dynamic scene interactions and lack metric-scale geometric consistency-critical for synthesizing realistic object motions and precise camera trajectories in complex environments. To bridge this gap, we introduce the first fully open-source, high-resolution dynamic-scene dataset with metric-scale camera annotations in https://github.com/ZGCTroy/RealCam-Vid.

ELSA: A Style Aligned Dataset for Emotionally Intelligent Language Generation

Vishal Gandhi,Sagar Gandhi

Task: 构建一个名为ELSA的情感与语言风格对齐数据集，以支持情感条件化的风格自适应文本生成。

Motivation: 现有情感数据集缺乏情感粒度或风格多样性，限制了情感条件化文本生成系统的发展。

Details

Method: 利用细粒度情感分类（如dair ai情感数据集和GoEmotions分类法），通过大型语言模型（LLMs）生成多种情感和风格变体的句子。 Result: 通过困惑度、嵌入方差、可读性、词汇多样性和语义一致性等指标验证了数据集的情感真实性、语言流畅性和文本多样性。 Conclusion: ELSA数据集为情感条件化的风格自适应文本生成研究提供了有力支持，促进了细粒度情感控制和风格适应性语言生成的发展。 Abstract: Advancements in emotion aware language processing increasingly shape vital NLP applications ranging from conversational AI and affective computing to computational psychology and creative content generation. Existing emotion datasets either lack emotional granularity or fail to capture necessary stylistic diversity, limiting the advancement of effective emotion conditioned text generation systems. Seeking to bridge this crucial gap between granularity and style diversity, this paper introduces a novel systematically constructed dataset named ELSA Emotion and Language Style Alignment Dataset leveraging fine grained emotion taxonomies adapted from existing sources such as dair ai emotion dataset and GoEmotions taxonomy. This dataset comprises multiple emotionally nuanced variations of original sentences regenerated across distinct contextual styles such as conversational, formal, poetic, and narrative, using advanced Large Language Models LLMs. Rigorous computational evaluation using metrics such as perplexity, embedding variance, readability, lexical diversity, and semantic coherence measures validates the datasets emotional authenticity, linguistic fluency, and textual diversity. Comprehensive metric analyses affirm its potential to support deeper explorations into emotion conditioned style adaptive text generation. By enabling precision tuned emotionally nuanced language modeling, our dataset creates fertile ground for research on fine grained emotional control, prompt driven explanation, interpretability, and style adaptive expressive language generation with LLMs.

VL-UR: Vision-Language-guided Universal Restoration of Images Degraded by Adverse Weather Conditions

Ziyan Liu,Yuxu Lu,Huashan Yu,Dong yang

Task: 提出一种基于视觉-语言引导的通用图像修复框架（VL-UR），以解决现有方法在多样化和复杂真实环境中的适应性问题。

Motivation: 现有图像修复方法通常针对特定退化场景设计，难以适应真实环境中非均匀且复杂的退化问题。

Details

Method: 利用零样本对比语言-图像预训练（CLIP）模型，结合视觉和语义信息，并引入场景分类器以生成高质量语言嵌入和预测退化类型。 Result: 在11种不同退化设置下的实验表明，VL-UR在性能、鲁棒性和适应性方面均达到最先进水平。 Conclusion: VL-UR为动态真实环境中的图像修复问题提供了一种变革性解决方案。 Abstract: Image restoration is critical for improving the quality of degraded images, which is vital for applications like autonomous driving, security surveillance, and digital content enhancement. However, existing methods are often tailored to specific degradation scenarios, limiting their adaptability to the diverse and complex challenges in real-world environments. Moreover, real-world degradations are typically non-uniform, highlighting the need for adaptive and intelligent solutions. To address these issues, we propose a novel vision-language-guided universal restoration (VL-UR) framework. VL-UR leverages a zero-shot contrastive language-image pre-training (CLIP) model to enhance image restoration by integrating visual and semantic information. A scene classifier is introduced to adapt CLIP, generating high-quality language embeddings aligned with degraded images while predicting degraded types for complex scenarios. Extensive experiments across eleven diverse degradation settings demonstrate VL-UR's state-of-the-art performance, robustness, and adaptability. This positions VL-UR as a transformative solution for modern image restoration challenges in dynamic, real-world environments.

Large language models could be rote learners

Yuyang Xu,Renjun Hu,Haochao Ying,Jian Wu,Xing Shi,Wei Lin

Task: 研究如何通过TrinEval框架区分大型语言模型（LLMs）在多项选择题（MCQ）评估中的真实能力获取与表面记忆。

Motivation: 现有MCQ基准测试因基准污染而可靠性不足，需区分模型是真正学习还是简单记忆。

Details

Method: 通过分析模型在不同记忆条件下的表现，提出TrinEval框架，将MCQ转化为三元组形式以减少记忆干扰。 Result: 实验验证TrinEval有效，发现常见LLMs在MMLU中平均记忆了20.5%的知识点。 Conclusion: TrinEval能有效区分记忆与真实学习，为LLM评估提供更可靠的方法。 Abstract: Multiple-choice question (MCQ) benchmarks are widely used for evaluating Large Language Models (LLMs), yet their reliability is undermined by benchmark contamination. In this study, we reframe contamination as an inherent aspect of learning and seek to disentangle genuine capability acquisition from superficial memorization in LLM evaluation. First, by analyzing model performance under different memorization conditions, we uncover a counterintuitive trend: LLMs perform worse on memorized MCQs than on non-memorized ones, indicating the coexistence of two distinct learning phenomena, i.e., rote memorization and genuine capability learning. To disentangle them, we propose TrinEval, a novel evaluation framework that reformulates MCQs into an alternative trinity format, reducing memorization while preserving knowledge assessment. Experiments validate TrinEval's effectiveness in reformulation, and its evaluation reveals that common LLMs may memorize by rote 20.5% of knowledge points (in MMLU on average).

F$^3$Set: Towards Analyzing Fast, Frequent, and Fine-grained Events from Videos

Zhaoyu Liu,Kan Jiang,Murong Ma,Zhe Hou,Yun Lin,Jin Song Dong

Task: 提出一个名为F$^3$Set的基准数据集，用于精确检测快速、频繁和细粒度（F$^3$）事件。

Motivation: 当前方法在满足F$^3$标准的高精度事件检测上存在困难，如运动模糊和细微视觉差异等问题。

Details

Method: 引入F$^3$Set基准数据集，并提出新方法F$^3$ED用于F$^3$事件检测。 Result: F$^3$ED在F$^3$Set上表现出优越性能，现有方法面临显著挑战。 Conclusion: F$^3$Set为视频理解研究提供了新基准，F$^3$ED方法展示了高效的事件检测能力。 Abstract: Analyzing Fast, Frequent, and Fine-grained (F$^3$) events presents a significant challenge in video analytics and multi-modal LLMs. Current methods struggle to identify events that satisfy all the F$^3$ criteria with high accuracy due to challenges such as motion blur and subtle visual discrepancies. To advance research in video understanding, we introduce F$^3$Set, a benchmark that consists of video datasets for precise F$^3$ event detection. Datasets in F$^3$Set are characterized by their extensive scale and comprehensive detail, usually encompassing over 1,000 event types with precise timestamps and supporting multi-level granularity. Currently, F$^3$Set contains several sports datasets, and this framework may be extended to other applications as well. We evaluated popular temporal action understanding methods on F$^3$Set, revealing substantial challenges for existing techniques. Additionally, we propose a new method, F$^3$ED, for F$^3$ event detections, achieving superior performance. The dataset, model, and benchmark code are available at https://github.com/F3Set/F3Set.

Scholar Inbox: Personalized Paper Recommendations for Scientists

Markus Flicke,Glenn Angrabeit,Madhav Iyengar,Vitalii Protsenko,Illia Shakun,Jovan Cicvaric,Bora Kargi,Haoyu He,Lukas Schuler,Lewin Scholz,Kavyanjali Agnihotri,Yong Cao,Andreas Geiger

Task: 开发一个名为Scholar Inbox的开放获取平台，帮助研究人员应对科学文献快速增长带来的挑战。

Motivation: 解决研究人员难以跟上快速增长的文献量的问题，并提供个性化工具以优化研究流程。

Details

Method: 平台提供个性化推荐、持续更新、视觉化论文摘要、语义搜索等功能，并通过用户评分训练推荐系统，同时利用科学地图和主动学习策略解决冷启动问题。 Result: 基于80万用户评分的新数据集和广泛的用户研究评估了推荐系统的质量。 Conclusion: Scholar Inbox通过个性化工具和开放获取功能，有效提升了研究人员的工作效率和文献获取体验。 Abstract: Scholar Inbox is a new open-access platform designed to address the challenges researchers face in staying current with the rapidly expanding volume of scientific literature. We provide personalized recommendations, continuous updates from open-access archives (arXiv, bioRxiv, etc.), visual paper summaries, semantic search, and a range of tools to streamline research workflows and promote open research access. The platform's personalized recommendation system is trained on user ratings, ensuring that recommendations are tailored to individual researchers' interests. To further enhance the user experience, Scholar Inbox also offers a map of science that provides an overview of research across domains, enabling users to easily explore specific topics. We use this map to address the cold start problem common in recommender systems, as well as an active learning strategy that iteratively prompts users to rate a selection of papers, allowing the system to learn user preferences quickly. We evaluate the quality of our recommendation system on a novel dataset of 800k user ratings, which we make publicly available, as well as via an extensive user study. https://www.scholar-inbox.com/

Stereophotoclinometry Revisited

Travis Driver,Andrew Vaughan,Yang Cheng,Adnan Ansar,John Christian,Panagiotis Tsiotras

Task: 提出一种名为Photoclinometry-from-Motion (PhoMo)的新框架，用于从小天体图像中自主估计表面法线和反照率。

Motivation: 当前方法（如SPC）依赖人工验证和高精度先验信息，限制了自主性。

Details

Method: 将光度测量技术融入基于关键点的运动恢复结构系统，利用密集关键点测量和深度学习匹配方法，并通过因子图优化多参数。 Result: 在Dawn任务的真实图像上验证，显示优于SPC的渲染性能，且无需先验信息或人工干预。 Conclusion: PhoMo框架在小天体表面和形状表征中具有更高的自主性和准确性。 Abstract: Image-based surface reconstruction and characterization is crucial for missions to small celestial bodies, as it informs mission planning, navigation, and scientific analysis. However, current state-of-the-practice methods, such as stereophotoclinometry (SPC), rely heavily on human-in-the-loop verification and high-fidelity a priori information. This paper proposes Photoclinometry-from-Motion (PhoMo), a novel framework that incorporates photoclinometry techniques into a keypoint-based structure-from-motion (SfM) system to estimate the surface normal and albedo at detected landmarks to improve autonomous surface and shape characterization of small celestial bodies from in-situ imagery. In contrast to SPC, we forego the expensive maplet estimation step and instead use dense keypoint measurements and correspondences from an autonomous keypoint detection and matching method based on deep learning. Moreover, we develop a factor graph-based approach allowing for simultaneous optimization of the spacecraft's pose, landmark positions, Sun-relative direction, and surface normals and albedos via fusion of Sun vector measurements and image keypoint measurements. The proposed framework is validated on real imagery taken by the Dawn mission to the asteroid 4 Vesta and the minor planet 1 Ceres and compared against an SPC reconstruction, where we demonstrate superior rendering performance compared to an SPC solution and precise alignment to a stereophotogrammetry (SPG) solution without relying on any a priori camera pose and topography information or humans-in-the-loop.

Beyond Self-Reports: Multi-Observer Agents for Personality Assessment in Large Language Models

Yin Jou Huang,Rafik Hadfi

Task: 提出一种基于多观察者框架的大型语言模型（LLM）人格评估方法，以解决传统自评问卷的局限性。

Motivation: 传统自评问卷可能因固有偏见和元知识污染而无法准确捕捉LLM的行为特征。

Details

Method: 采用多观察者代理，模拟不同关系情境（如家庭、朋友或职场）下的交互场景，通过对话和评分评估LLM的Big Five人格维度。 Result: 实验表明LLM在自评中存在系统性偏见，而5-7个观察者的评分聚合能有效减少非系统性偏见并提高可靠性。 Conclusion: 多观察者框架能够提供更稳健且情境敏感的LLM人格评估，关系情境对人格感知有显著影响。 Abstract: There is a growing interest in assessing the personality traits of Large language models (LLMs). However, traditional personality assessments based on self-report questionnaires may fail to capture their true behavioral nuances due to inherent biases and meta-knowledge contamination. This paper introduces a novel multi-observer framework for LLM personality assessment that draws inspiration from informant-report methods in psychology. Instead of relying solely on self-assessments, our approach employs multiple observer agents configured with a specific relationship context (e.g., family, friend, or workplace) to simulate interactive scenarios with a subject LLM. These observers engage in dialogues and subsequently provide ratings across the Big Five personality dimensions. Our experiments reveal that LLMs possess systematic biases in self-report personality ratings. Moreover, aggregating observer ratings effectively reduces non-systematic biases and achieves optimal reliability with 5-7 observers. The findings highlight the significant impact of relationship context on personality perception and demonstrate that a multi-observer paradigm yields a more robust and context-sensitive evaluation of LLM personality traits.

Knowledge Distillation for Underwater Feature Extraction and Matching via GAN-synthesized Images

Jinghe Yang,Mingming Gong,Ye Pu

Task: 提高在浑浊水下环境中特征提取和匹配的鲁棒性。

Motivation: 水下环境中的图像模糊和噪声（如衰减、散射和海洋雪的干扰）对特征提取和匹配提出了重大挑战。

Details

Method: 提出了一种跨模态知识蒸馏方法，通过合成水下图像作为媒介，将空中特征提取模型迁移到水下环境；包括自适应GAN合成方法和通用知识蒸馏框架。 Result: GAN合成方法的新组件（如GAN合成噪声和前向散射）在模型中表现出重要性；特征提取和匹配（VSLAM）在真实水下序列中的应用验证了迁移模型的有效性。 Conclusion: 提出的方法显著提高了水下环境中特征提取和匹配的鲁棒性。 Abstract: Autonomous Underwater Vehicles (AUVs) play a crucial role in underwater exploration. Vision-based methods offer cost-effective solutions for localization and mapping in the absence of conventional sensors like GPS and LIDAR. However, underwater environments present significant challenges for feature extraction and matching due to image blurring and noise caused by attenuation, scattering, and the interference of \textit{marine snow}. In this paper, we aim to improve the robustness of the feature extraction and matching in the turbid underwater environment using the cross-modal knowledge distillation method that transfers the in-air feature extraction models to underwater settings using synthetic underwater images as the medium. We first propose a novel adaptive GAN-synthesis method to estimate water parameters and underwater noise distribution, to generate environment-specific synthetic underwater images. We then introduce a general knowledge distillation framework compatible with different teacher models. The evaluation of GAN-based synthesis highlights the significance of the new components, i.e. GAN-synthesized noise and forward scattering, in the proposed model. Additionally, the downstream application of feature extraction and matching (VSLAM) on real underwater sequences validates the effectiveness of the transferred model.

Integrated ensemble of BERT- and features-based models for authorship attribution in Japanese literary works

Taisei Kanda,Mingzhe Jin,Wataru Zaitsu

Task: 通过集成传统特征方法和现代预训练语言模型（PLM）方法，提升小样本作者归属任务（AA）的性能。

Motivation: 尽管预训练语言模型在短文本分类任务中表现优异，但其在小样本AA任务中的有效性尚未充分探索，且如何结合传统特征方法仍具挑战性。

Details

Method: 采用集成方法，结合传统特征和BERT等PLM模型，在包含10位作者的两个文学语料库上进行实验。 Result: 实验表明，BERT在小样本AA任务中有效，集成方法显著提升性能，F1分数提高了约14点。 Conclusion: 集成方法为未来高效利用多样化数据处理工具提供了可行方案。 Abstract: Traditionally, authorship attribution (AA) tasks relied on statistical data analysis and classification based on stylistic features extracted from texts. In recent years, pre-trained language models (PLMs) have attracted significant attention in text classification tasks. However, although they demonstrate excellent performance on large-scale short-text datasets, their effectiveness remains under-explored for small samples, particularly in AA tasks. Additionally, a key challenge is how to effectively leverage PLMs in conjunction with traditional feature-based methods to advance AA research. In this study, we aimed to significantly improve performance using an integrated integrative ensemble of traditional feature-based and modern PLM-based methods on an AA task in a small sample. For the experiment, we used two corpora of literary works to classify 10 authors each. The results indicate that BERT is effective, even for small-sample AA tasks. Both BERT-based and classifier ensembles outperformed their respective stand-alone models, and the integrated ensemble approach further improved the scores significantly. For the corpus that was not included in the pre-training data, the integrated ensemble improved the F1 score by approximately 14 points, compared to the best-performing single model. Our methodology provides a viable solution for the efficient use of the ever-expanding array of data processing tools in the foreseeable future.

CoProSketch: Controllable and Progressive Sketch Generation with Diffusion Model

Ruohao Zhan,Yijin Li,Yisheng He,Shuo Chen,Yichen Shen,Xinyu Chen,Zilong Dong,Zhaoyang Huang,Guofeng Zhang

Task: 提出一种名为CoProSketch的新框架，用于通过扩散模型生成具有显著可控性和细节的草图。

Motivation: 草图在艺术创作中是基础蓝图，但现有的生成模型在草图生成方面尚未充分探索，且直接使用扩散模型生成二值化草图效果不佳。

Details

Method: 采用无符号距离场（UDF）表示草图，并通过轻量级网络解码为草图，支持用户通过边界框和文本提示生成草图，并支持迭代编辑和细化。 Result: 实验表明，CoProSketch在语义一致性和可控性上优于基线方法，并提供了一种实用的用户反馈集成方案。 Conclusion: CoProSketch为草图生成提供了一种高效且可控的解决方案，填补了生成模型在草图领域的空白。 Abstract: Sketches serve as fundamental blueprints in artistic creation because sketch editing is easier and more intuitive than pixel-level RGB image editing for painting artists, yet sketch generation remains unexplored despite advancements in generative models. We propose a novel framework CoProSketch, providing prominent controllability and details for sketch generation with diffusion models. A straightforward method is fine-tuning a pretrained image generation diffusion model with binarized sketch images. However, we find that the diffusion models fail to generate clear binary images, which makes the produced sketches chaotic. We thus propose to represent the sketches by unsigned distance field (UDF), which is continuous and can be easily decoded to sketches through a lightweight network. With CoProSketch, users generate a rough sketch from a bounding box and a text prompt. The rough sketch can be manually edited and fed back into the model for iterative refinement and will be decoded to a detailed sketch as the final result. Additionally, we curate the first large-scale text-sketch paired dataset as the training data. Experiments demonstrate superior semantic consistency and controllability over baselines, offering a practical solution for integrating user feedback into generative workflows.

On The Landscape of Spoken Language Models: A Comprehensive Survey

Siddhant Arora,Kai-Wei Chang,Chung-Ming Chien,Yifan Peng,Haibin Wu,Yossi Adi,Emmanuel Dupoux,Hung-Yi Lee,Karen Livescu,Shinji Watanabe

Task: 通过文献综述统一理解口语语言模型（SLMs）的发展及其应用。

Motivation: 口语处理领域正从任务特定模型转向通用口语语言模型（SLMs），但相关研究术语和评估方法多样，缺乏统一理解。

Details

Method: 对近期研究进行文献综述，按模型架构、训练和评估选择分类。 Result: 总结了SLMs的分类、关键挑战及未来研究方向。 Conclusion: SLMs是口语处理的重要趋势，但仍需解决多样性和标准化问题。 Abstract: The field of spoken language processing is undergoing a shift from training custom-built, task-specific models toward using and optimizing spoken language models (SLMs) which act as universal speech processing systems. This trend is similar to the progression toward universal language models that has taken place in the field of (text) natural language processing. SLMs include both "pure" language models of speech -- models of the distribution of tokenized speech sequences -- and models that combine speech encoders with text language models, often including both spoken and written input or output. Work in this area is very diverse, with a range of terminology and evaluation settings. This paper aims to contribute an improved understanding of SLMs via a unifying literature survey of recent work in the context of the evolution of the field. Our survey categorizes the work in this area by model architecture, training, and evaluation choices, and describes some key challenges and directions for future work.

VLMT: Vision-Language Multimodal Transformer for Multimodal Multi-hop Question Answering

Qi Zhi Lim,Chin Poo Lee,Kian Ming Lim,Kalaiarasi Sonai Muthu Anbananthen

Task: 提出一种名为Vision-Language Multimodal Transformer (VLMT)的统一架构，用于解决多模态多跳问答(MMQA)中的跨模态推理问题。

Motivation: 现有MMQA方法存在推理能力有限、依赖模态转换以及视觉与文本表示对齐不足的问题。

Details

Method: VLMT采用基于Transformer的视觉编码器和序列到序列语言模型，通过直接标记级注入机制融合视觉与文本输入，并提出三阶段预训练策略增强跨模态对齐和推理能力。 Result: 在MultimodalQA和WebQA数据集上，VLMT-Large分别取得76.5% Exact Match和47.6 QA分数，显著优于现有最佳模型。 Conclusion: VLMT在多模态推理方面表现出色，有望推动实际信息检索和问答系统的发展。 Abstract: The increasing availability of multimodal data across text, tables, and images presents new challenges for developing models capable of complex cross-modal reasoning. Existing methods for Multimodal Multi-hop Question Answering (MMQA) often suffer from limited reasoning capabilities, reliance on modality conversion, and inadequate alignment between visual and textual representations. To address these limitations, this paper introduces Vision-Language Multimodal Transformer (VLMT), a unified architecture that integrates a transformer-based vision encoder with a sequence-to-sequence language model. VLMT employs a direct token-level injection mechanism to fuse visual and textual inputs within a shared embedding space, eliminating the need for intermediate projection layers. To enhance cross-modal alignment and reasoning, a three-stage pretraining strategy is proposed to progressively align vision-language representations and improve the model's capacity for multimodal understanding. Based on the pretrained backbone, two task-specific modules are instantiated to form a two-stage MMQA framework: a multimodal reranker that predicts document relevance scores and utilizes a relative threshold with top-k strategy for context retrieval, and a multimodal question answering model that generates contextually grounded answers based on the retrieved evidence. Comprehensive experiments on two benchmark datasets demonstrate the effectiveness of the proposed approach. On MultimodalQA validation set, VLMT-Large achieves 76.5% Exact Match and 80.1% F1, outperforming the previous state-of-the-art by +9.1% in Exact Match and +8.8% in F1. On WebQA, it attains a QA score of 47.6, surpassing prior models such as PERQA by +3.2. These results highlight VLMT's strong capabilities in multimodal reasoning and its potential to advance real-world information retrieval and question answering systems.

Lexical Bundle Frequency as a Construct-Relevant Candidate Feature in Automated Scoring of L2 Academic Writing

Burak Senel

Task: 研究如何通过将词汇束（LBs）频率特征整合到自动评分（AS）模型中，提升对TOEFL独立写作任务的评分效果。

Motivation: 尽管词汇束在评估L2写作中的潜力已被提出，但其在自动评分模型中的实证整合仍需进一步研究。

Details

Method: 从TOEFL11语料库中抽取1,225篇作文（9种母语背景），提取3至9词词汇束，并分为题目相关和非题目相关类型。在支持向量机（SVM）评分模型中，比较了仅使用传统语言特征的基线模型与加入词汇束频率特征的扩展模型。 Result: 词汇束频率（尤其是非题目相关词汇束）与写作水平存在显著但效应较小的关系。扩展模型显著提高了与人工评分的一致性（Cohen's Kappa提升5.63%）。 Conclusion: 整合词汇束频率特征有助于开发更准确且语言信息更丰富的自动评分系统，尤其适用于区分L2写作水平。 Abstract: Automated scoring (AS) systems are increasingly used for evaluating L2 writing, but require ongoing refinement for construct validity. While prior work suggested lexical bundles (LBs) - recurrent multi-word sequences satisfying certain frequency criteria - could inform assessment, their empirical integration into AS models needs further investigation. This study tested the impact of incorporating LB frequency features into an AS model for TOEFL independent writing tasks. Analyzing a sampled subcorpus (N=1,225 essays, 9 L1s) from the TOEFL11 corpus, scored by ETS-trained raters (Low, Medium, High), 3- to 9-word LBs were extracted, distinguishing prompt-specific from non-prompt types. A baseline Support Vector Machine (SVM) scoring model using established linguistic features (e.g., mechanics, cohesion, sophistication) was compared against an extended model including three aggregate LB frequency features (total prompt, total non-prompt, overall total). Results revealed significant, though generally small-effect, relationships between LB frequency (especially non-prompt bundles) and proficiency (p < .05). Mean frequencies suggested lower proficiency essays used more LBs overall. Critically, the LB-enhanced model improved agreement with human raters (Quadratic Cohen's Kappa +2.05%, overall Cohen's Kappa +5.63%), with notable gains for low (+10.1% exact agreement) and medium (+14.3% Cohen's Kappa) proficiency essays. These findings demonstrate that integrating aggregate LB frequency offers potential for developing more linguistically informed and accurate AS systems, particularly for differentiating developing L2 writers.

Palmprint De-Identification Using Diffusion Model for High-Quality and Diverse Synthesis

Licheng Yan,Bob Zhang,Andrew Beng Jin Teoh,Lu Leng,Shuyi Li,Yuqi Wang,Ziyuan Yang

Task: 开发一种基于预训练扩散模型的掌纹去识别技术，以隐藏身份特征同时保留图像的非敏感信息。

Motivation: 尽管掌纹识别技术取得了显著进展，但公开可用的掌纹图像可能被恶意利用，而现有的掌纹匿名化方法研究不足。

Details

Method: 提出一种无需训练的框架，利用预训练扩散模型生成多样化的高质量掌纹图像，结合语义引导嵌入融合和先验插值机制以提高稳定性和可控性。 Result: 实验表明，该方法能有效隐藏身份特征，同时保持高视觉保真度和实用性。 Conclusion: 该方法在去识别和保留非身份信息之间取得了平衡，为掌纹去识别提供了有效解决方案。 Abstract: Palmprint recognition techniques have advanced significantly in recent years, enabling reliable recognition even when palmprints are captured in uncontrolled or challenging environments. However, this strength also introduces new risks, as publicly available palmprint images can be misused by adversaries for malicious activities. Despite this growing concern, research on methods to obscure or anonymize palmprints remains largely unexplored. Thus, it is essential to develop a palmprint de-identification technique capable of removing identity-revealing features while retaining the image's utility and preserving non-sensitive information. In this paper, we propose a training-free framework that utilizes pre-trained diffusion models to generate diverse, high-quality palmprint images that conceal identity features for de-identification purposes. To ensure greater stability and controllability in the synthesis process, we incorporate a semantic-guided embedding fusion alongside a prior interpolation mechanism. We further propose the de-identification ratio, a novel metric for intuitive de-identification assessment. Extensive experiments across multiple palmprint datasets and recognition methods demonstrate that our method effectively conceals identity-related traits with significant diversity across de-identified samples. The de-identified samples preserve high visual fidelity and maintain excellent usability, achieving a balance between de-identification and retaining non-identity information.

UoB-NLP at SemEval-2025 Task 11: Leveraging Adapters for Multilingual and Cross-Lingual Emotion Detection

Frances Laureano De Leon,Yixiao Wang,Yue Feng,Mark G. Lee

Task: 通过基于适配器的微调方法解决多语言和跨语言情感检测问题。

Motivation: 低资源语言的情感检测研究不足，而高资源语言已有显著进展。

Details

Method: 利用多语言预训练语言模型，采用适配器微调策略，包括任务专用适配器、目标语言适配器和基于语言家族的适配器。 Result: 目标语言适配器表现最佳，尤其在低资源非洲语言中表现突出，团队在多个语言中排名靠前。 Conclusion: 适配器方法在参数效率和跨语言迁移能力上优于全微调，适用于低资源语言情感检测。 Abstract: Emotion detection in natural language processing is a challenging task due to the complexity of human emotions and linguistic diversity. While significant progress has been made in high-resource languages, emotion detection in low-resource languages remains underexplored. In this work, we address multilingual and cross-lingual emotion detection by leveraging adapter-based fine-tuning with multilingual pre-trained language models. Adapters introduce a small number of trainable parameters while keeping the pre-trained model weights fixed, offering a parameter-efficient approach to adaptation. We experiment with different adapter tuning strategies, including task-only adapters, target-language-ready task adapters, and language-family-based adapters. Our results show that target-language-ready task adapters achieve the best overall performance, particularly for low-resource African languages with our team ranking 7th for Tigrinya, and 8th for Kinyarwanda in Track A. In Track C, our system ranked 3rd for Amharic, and 4th for Oromo, Tigrinya, Kinyarwanda, Hausa, and Igbo. Our approach outperforms large language models in 11 languages and matches their performance in four others, despite our models having significantly fewer parameters. Furthermore, we find that adapter-based models retain cross-linguistic transfer capabilities while requiring fewer computational resources compared to full fine-tuning for each language.

PNE-SGAN: Probabilistic NDT-Enhanced Semantic Graph Attention Network for LiDAR Loop Closure Detection

Xiong Li,Shulei Liu,Xingning Chen,Yisong Wu,Dong Zhu

Task: 提出PNE-SGAN方法，以解决LiDAR闭环检测（LCD）在鲁棒性和准确性上的挑战。

Motivation: 现有方法（如语义图方法）在几何表示粗糙和缺乏对噪声、动态及视角变化的时序鲁棒性方面存在不足。

Details

Method: PNE-SGAN结合NDT协方差矩阵作为几何节点特征，通过图注意力网络（GAT）处理，并集成图相似性评分到概率时序滤波框架（HMM/贝叶斯滤波）中。 Result: 在KITTI数据集（序列00和08）上表现优异，平均精度分别达到96.2%和95.1%，显著优于现有方法。 Conclusion: PNE-SGAN通过结合NDT几何细节和概率时序推理，为LiDAR LCD提供了高精度和鲁棒性的解决方案，提升了复杂大规模环境中SLAM的可靠性。 Abstract: LiDAR loop closure detection (LCD) is crucial for consistent Simultaneous Localization and Mapping (SLAM) but faces challenges in robustness and accuracy. Existing methods, including semantic graph approaches, often suffer from coarse geometric representations and lack temporal robustness against noise, dynamics, and viewpoint changes. We introduce PNE-SGAN, a Probabilistic NDT-Enhanced Semantic Graph Attention Network, to overcome these limitations. PNE-SGAN enhances semantic graphs by using Normal Distributions Transform (NDT) covariance matrices as rich, discriminative geometric node features, processed via a Graph Attention Network (GAT). Crucially, it integrates graph similarity scores into a probabilistic temporal filtering framework (modeled as an HMM/Bayes filter), incorporating uncertain odometry for motion modeling and utilizing forward-backward smoothing to effectively handle ambiguities. Evaluations on challenging KITTI sequences (00 and 08) demonstrate state-of-the-art performance, achieving Average Precision of 96.2\% and 95.1\%, respectively. PNE-SGAN significantly outperforms existing methods, particularly in difficult bidirectional loop scenarios where others falter. By synergizing detailed NDT geometry with principled probabilistic temporal reasoning, PNE-SGAN offers a highly accurate and robust solution for LiDAR LCD, enhancing SLAM reliability in complex, large-scale environments.

Playpen: An Environment for Exploring Learning Through Conversational Interaction

Nicola Horst,Davide Mazzaccara,Antonia Schmidt,Michael Sullivan,Filippo Momentè,Luca Franceschetti,Philipp Sadler,Sherzod Hakimov,Alberto Testoni,Raffaella Bernardi,Raquel Fernández,Alexander Koller,Oliver Lemon,David Schlangen,Mario Giulianelli,Alessandro Suglia

Task: 探索对话游戏（Dialogue Games）是否能作为学习信号，并研究其使用方法。

Motivation: 现有文本预测学习信号可能已接近极限，需要探索新的学习信号来源，如对话交互。

Details

Method: 引入一个环境生成对话交互数据（使用大型语言模型作为学习模型的对手），研究监督微调和强化学习（如DPO、GRPO）的效果。 Result: 所有方法在领域内游戏中均有提升，但仅GRPO能泛化到领域外游戏并保持参考任务的竞争力。 Conclusion: 对话游戏是一种有前景的新学习信号来源，GRPO在泛化性上表现最佳。 Abstract: Are we running out of learning signal? Predicting the next word in an existing text has turned out to be a powerful signal, at least at scale. But there are signs that we are running out of this resource. In recent months, interaction between learner and feedback-giver has come into focus, both for "alignment" (with a reward model judging the quality of instruction following attempts) and for improving "reasoning" (process- and outcome-based verifiers judging reasoning steps). In this paper, we explore to what extent synthetic interaction in what we call Dialogue Games -- goal-directed and rule-governed activities driven predominantly by verbal actions -- can provide a learning signal, and how this signal can be used. We introduce an environment for producing such interaction data (with the help of a Large Language Model as counterpart to the learner model), both offline and online. We investigate the effects of supervised fine-tuning on this data, as well as reinforcement learning setups such as DPO, and GRPO; showing that all of these approaches achieve some improvements in in-domain games, but only GRPO demonstrates the ability to generalise to out-of-domain games as well as retain competitive performance in reference-based tasks. We release the framework and the baseline training setups in the hope that this can foster research in this promising new direction.

DreamFuse: Adaptive Image Fusion with Diffusion Transformer

Junjia Huang,Pengxiang Yan,Jiyang Liu,Jie Wu,Zhao Wang,Yitong Wang,Liang Lin,Guanbin Li

Task: 提出一种名为DreamFuse的方法，基于扩散变换器（DiT）模型，实现前景与背景的和谐融合。

Motivation: 现有方法直接将对象插入背景，缺乏自适应和交互性，导致融合效果不理想。

Details

Method: 采用迭代的人机交互数据生成流程，结合DiT模型和位置仿射机制，实现前景与背景的有效交互。 Result: 实验结果表明，DreamFuse在多个指标上优于现有方法。 Conclusion: DreamFuse能够生成一致且和谐的融合图像，并支持文本驱动的属性编辑。 Abstract: Image fusion seeks to seamlessly integrate foreground objects with background scenes, producing realistic and harmonious fused images. Unlike existing methods that directly insert objects into the background, adaptive and interactive fusion remains a challenging yet appealing task. It requires the foreground to adjust or interact with the background context, enabling more coherent integration. To address this, we propose an iterative human-in-the-loop data generation pipeline, which leverages limited initial data with diverse textual prompts to generate fusion datasets across various scenarios and interactions, including placement, holding, wearing, and style transfer. Building on this, we introduce DreamFuse, a novel approach based on the Diffusion Transformer (DiT) model, to generate consistent and harmonious fused images with both foreground and background information. DreamFuse employs a Positional Affine mechanism to inject the size and position of the foreground into the background, enabling effective foreground-background interaction through shared attention. Furthermore, we apply Localized Direct Preference Optimization guided by human feedback to refine DreamFuse, enhancing background consistency and foreground harmony. DreamFuse achieves harmonious fusion while generalizing to text-driven attribute editing of the fused results. Experimental results demonstrate that our method outperforms state-of-the-art approaches across multiple metrics.

MedHal: An Evaluation Dataset for Medical Hallucination Detection

Gaya Mehenni,Amal Zouaq

Task: 评估模型是否能检测医学文本中的幻觉。

Motivation: 当前幻觉检测方法在医学等专业领域存在显著局限性，可能导致严重后果，且现有医学数据集规模小或任务单一。

Details

Method: 提出MedHal数据集，整合多样医学文本和任务，提供大量标注样本及事实不一致的解释。 Result: 训练并评估基线模型，显示优于通用幻觉检测方法。 Conclusion: MedHal有助于高效评估医学文本生成系统，减少对专家评审的依赖，推动医学AI研究发展。 Abstract: We present MedHal, a novel large-scale dataset specifically designed to evaluate if models can detect hallucinations in medical texts. Current hallucination detection methods face significant limitations when applied to specialized domains like medicine, where they can have disastrous consequences. Existing medical datasets are either too small, containing only a few hundred samples, or focus on a single task like Question Answering or Natural Language Inference. MedHal addresses these gaps by: (1) incorporating diverse medical text sources and tasks; (2) providing a substantial volume of annotated samples suitable for training medical hallucination detection models; and (3) including explanations for factual inconsistencies to guide model learning. We demonstrate MedHal's utility by training and evaluating a baseline medical hallucination detection model, showing improvements over general-purpose hallucination detection approaches. This resource enables more efficient evaluation of medical text generation systems while reducing reliance on costly expert review, potentially accelerating the development of medical AI research.

Generative AI for Film Creation: A Survey of Recent Advances

Ruihan Zhang,Borou Yu,Jiajian Min,Yetong Xin,Zheng Wei,Juncheng Nemo Shi,Mingzhen Huang,Xianghao Kong,Nix Liu Xin,Shanshan Jiang,Praagya Bahuguna,Mark Chan,Khushi Hora,Lijian Yang,Yongqi Liang,Runhe Bian,Yunlei Liu,Isabela Campillo Valencia,Patricia Morales Tredinick,Ilia Kozlov,Sijia Jiang,Peiwen Huang,Na Chen,Xuanxuan Liu,Anyi Rao

Task: 研究生成式AI（GenAI）在电影制作中的应用及其对角色创作、美学风格和叙事的贡献。

Motivation: 探讨GenAI技术如何改变电影制作流程，并分析其在角色一致性、风格连贯性和运动连续性方面的关键策略。

Details

Method: 通过分析近期AI驱动的电影工作流程，研究GenAI技术的应用，并收集艺术家对技术挑战和改进需求的反馈。 Result: 揭示了GenAI在电影制作中的技术进步和艺术表达潜力，包括3D生成、真实素材与AI生成元素的融合等新兴趋势。 Conclusion: 研究为AI与电影制作交叉领域的快速发展提供了见解和路线图，为研究人员和艺术家提供了指导。 Abstract: Generative AI (GenAI) is transforming filmmaking, equipping artists with tools like text-to-image and image-to-video diffusion, neural radiance fields, avatar generation, and 3D synthesis. This paper examines the adoption of these technologies in filmmaking, analyzing workflows from recent AI-driven films to understand how GenAI contributes to character creation, aesthetic styling, and narration. We explore key strategies for maintaining character consistency, achieving stylistic coherence, and ensuring motion continuity. Additionally, we highlight emerging trends such as the growing use of 3D generation and the integration of real footage with AI-generated elements. Beyond technical advancements, we examine how GenAI is enabling new artistic expressions, from generating hard-to-shoot footage to dreamlike diffusion-based morphing effects, abstract visuals, and unworldly objects. We also gather artists' feedback on challenges and desired improvements, including consistency, controllability, fine-grained editing, and motion refinement. Our study provides insights into the evolving intersection of AI and filmmaking, offering a roadmap for researchers and artists navigating this rapidly expanding field.

A Survey of Machine Learning Models and Datasets for the Multi-label Classification of Textual Hate Speech in English

Julian Bäumler,Louis Blöcher,Lars-Joel Frey,Xian Chen,Markus Bayer,Christian Reuter

Task: 对多标签仇恨言论分类的研究现状进行系统调查。

Motivation: 在线仇恨言论的传播对个人、社区和社会有严重负面影响，且现有研究多为二元分类，实践需要更细分的多标签分类方法。

Details

Method: 系统调查了46篇英文文献，分析了28个数据集和24种分类模型。 Result: 发现数据集在标签集、大小、元概念等方面存在显著异质性，模型评估不一致且偏好BERT和RNN架构。 Conclusion: 提出10条研究建议，指出训练数据不平衡、依赖众包平台等关键问题。 Abstract: The dissemination of online hate speech can have serious negative consequences for individuals, online communities, and entire societies. This and the large volume of hateful online content prompted both practitioners', i.e., in content moderation or law enforcement, and researchers' interest in machine learning models to automatically classify instances of hate speech. Whereas most scientific works address hate speech classification as a binary task, practice often requires a differentiation into sub-types, e.g., according to target, severity, or legality, which may overlap for individual content. Hence, researchers created datasets and machine learning models that approach hate speech classification in textual data as a multi-label problem. This work presents the first systematic and comprehensive survey of scientific literature on this emerging research landscape in English (N=46). We contribute with a concise overview of 28 datasets suited for training multi-label classification models that reveals significant heterogeneity regarding label-set, size, meta-concept, annotation process, and inter-annotator agreement. Our analysis of 24 publications proposing suitable classification models further establishes inconsistency in evaluation and a preference for architectures based on Bidirectional Encoder Representation from Transformers (BERT) and Recurrent Neural Networks (RNNs). We identify imbalanced training data, reliance on crowdsourcing platforms, small and sparse datasets, and missing methodological alignment as critical open issues and formulate ten recommendations for research.

STSeg-Complex Video Object Segmentation: The 1st Solution for 4th PVUW MOSE Challenge

Kehuan Song,Xinglin Xie,Kexin Zhang,Licheng Jiao,Lingling Li,Shuyuan Yang

Task: 在复杂场景中分割视频对象。

Motivation: MOSE数据集推动了该领域的发展，但复杂对象运动和长视频序列的处理仍具挑战性。

Details

Method: 通过微调SAM2和无监督模型TMO，并采用自适应伪标签引导的模型优化流程。 Result: 在2025年PVUW挑战赛MOSE赛道测试集上获得87.26%的J&F分数，排名第一。 Conclusion: STSeg解决方案显著提升了复杂场景下视频对象分割的技术水平。 Abstract: Segmentation of video objects in complex scenarios is highly challenging, and the MOSE dataset has significantly contributed to the development of this field. This technical report details the STSeg solution proposed by the "imaplus" team.By finetuning SAM2 and the unsupervised model TMO on the MOSE dataset, the STSeg solution demonstrates remarkable advantages in handling complex object motions and long-video sequences. In the inference phase, an Adaptive Pseudo-labels Guided Model Refinement Pipeline is adopted to intelligently select appropriate models for processing each video. Through finetuning the models and employing the Adaptive Pseudo-labels Guided Model Refinement Pipeline in the inference phase, the STSeg solution achieved a J&F score of 87.26% on the test set of the 2025 4th PVUW Challenge MOSE Track, securing the 1st place and advancing the technology for video object segmentation in complex scenarios.

Genius: A Generalizable and Purely Unsupervised Self-Training Framework For Advanced Reasoning

Fangzhi Xu,Hang Yan,Chang Ma,Haiteng Zhao,Qiushi Sun,Kanzhi Cheng,Junxian He,Jun Liu,Zhiyong Wu

Task: 提出一种无需外部监督的自我训练框架（Genius）来提升大型语言模型（LLM）的推理能力。

Motivation: 当前的后训练技术依赖监督信号，面临扩展性和高标注成本问题，因此需要一种无需外部监督的方法。

Details

Method: Genius采用逐步前瞻重采样策略和优势校准优化（ACO）损失函数，以无监督方式优化LLM。 Result: Genius为无监督下提升LLM推理能力提供了初步解决方案，并展示了在通用查询下的潜力。 Conclusion: Genius为无监督LLM推理能力的提升开辟了新方向，具有广泛的适用性和扩展性。 Abstract: Advancing LLM reasoning skills has captivated wide interest. However, current post-training techniques rely heavily on supervisory signals, such as outcome supervision or auxiliary reward models, which face the problem of scalability and high annotation costs. This motivates us to enhance LLM reasoning without the need for external supervision. We introduce a generalizable and purely unsupervised self-training framework, named Genius. Without external auxiliary, Genius requires to seek the optimal response sequence in a stepwise manner and optimize the LLM. To explore the potential steps and exploit the optimal ones, Genius introduces a stepwise foresight re-sampling strategy to sample and estimate the step value by simulating future outcomes. Further, we recognize that the unsupervised setting inevitably induces the intrinsic noise and uncertainty. To provide a robust optimization, we propose an advantage-calibrated optimization (ACO) loss function to mitigate estimation inconsistencies. Combining these techniques together, Genius provides an advanced initial step towards self-improve LLM reasoning with general queries and without supervision, revolutionizing reasoning scaling laws given the vast availability of general queries. The code will be released at https://github.com/xufangzhi/Genius.

DSM: Building A Diverse Semantic Map for 3D Visual Grounding

Qinghongbing Xie,Zijian Liang,Long Zeng

Task: 提出一种专为机器人执行3D视觉定位任务设计的多样化语义地图构建方法。

Motivation: 现有方法在3D视觉定位任务中多关注几何和视觉信息，忽略了场景中多样化语义信息的提取和隐含语义属性的理解。

Details

Method: 利用多模态大语言模型（VLMs）捕捉场景中对象的潜在语义属性和关系，并通过几何滑动窗口地图构建策略创建多样化语义地图（DSM）。 Result: 实验结果表明，该方法在语义分割和3D视觉定位任务中优于现有方法，尤其在整体指标上表现优异。 Conclusion: 该方法在机器人导航和抓取任务中验证了其有效性，为3D视觉定位任务提供了更全面的语义理解。 Abstract: In recent years, with the growing research and application of multimodal large language models (VLMs) in robotics, there has been an increasing trend of utilizing VLMs for robotic scene understanding tasks. Existing approaches that use VLMs for 3D Visual Grounding tasks often focus on obtaining scene information through geometric and visual information, overlooking the extraction of diverse semantic information from the scene and the understanding of rich implicit semantic attributes, such as appearance, physics, and affordance. The 3D scene graph, which combines geometry and language, is an ideal representation method for environmental perception and is an effective carrier for language models in 3D Visual Grounding tasks. To address these issues, we propose a diverse semantic map construction method specifically designed for robotic agents performing 3D Visual Grounding tasks. This method leverages VLMs to capture the latent semantic attributes and relations of objects within the scene and creates a Diverse Semantic Map (DSM) through a geometry sliding-window map construction strategy. We enhance the understanding of grounding information based on DSM and introduce a novel approach named DSM-Grounding. Experimental results show that our method outperforms current approaches in tasks like semantic segmentation and 3D Visual Grounding, particularly excelling in overall metrics compared to the state-of-the-art. In addition, we have deployed this method on robots to validate its effectiveness in navigation and grasping tasks.

Fast-Slow-Thinking: Complex Task Solving with Large Language Models

Yiliu Sun,Yanfang Zhang,Zicheng Zhao,Sheng Wan,Dacheng Tao,Chen Gong

Task: 提出一种名为“快速-慢速思考”（FST）的新任务分解方法，以解决大型语言模型（LLMs）在处理复杂逻辑和约束任务时的性能不足问题。

Motivation: 现有任务分解方法在任务逻辑和约束过于复杂时表现不佳，导致生成的解决方案偏离任务初衷或包含冗余甚至错误内容。

Details

Method: FST方法通过快速思考（FT）和慢速思考（ST）步骤的协作，模拟人类从粗到细的认知过程，FT简化任务，ST补充细节和约束。 Result: 实验证明FST方法在三种任务类型上有效提升了LLMs的性能。 Conclusion: FST方法通过模拟人类思维过程，显著提高了LLMs处理复杂任务的能力。 Abstract: Nowadays, Large Language Models (LLMs) have been gradually employed to solve complex tasks. To face the challenge, task decomposition has become an effective way, which proposes to divide a complex task into multiple simpler subtasks and then solve them separately so that the difficulty of the original task can be reduced. However, the performance of existing task decomposition methods can be suboptimal when the task contains overly complex logic and constraints. In this situation, the solution generated by LLMs may deviate from the original purpose of the task, or contain redundant or even erroneous content. Therefore, inspired by the fact that humans possess two thinking systems including fast thinking and slow thinking, this paper introduces a new task decomposition method termed ``Fast-Slow-Thinking'' (FST), which stimulates LLMs to solve tasks through the cooperation of Fast Thinking (FT) and Slow Thinking (ST) steps. Here FT focuses more on the general and concise aspect of the task, and ST focuses more on the details of the task. In FT, LLMs are prompted to remove the constraints of the original task, therefore simplifying it to a general and concise one. In ST, we recall the constraints removed in FT, so that LLMs can improve the answer generated in FT to meet the requirements of the original task. Therefore, our FST method enables LLMs to consider a complex problem via a human-like cognition process from coarse to fine, the effectiveness of which has been well demonstrated by the experiments on three types of tasks.

EasyGenNet: An Efficient Framework for Audio-Driven Gesture Video Generation Based on Diffusion Model

Renda Li,Xiaohua Qi,Qiang Ling,Jun Yu,Ziyi Chen,Peng Chang,Mei HanJing Xiao

Task: 提出一种基于扩散模型的一阶段训练方法，用于生成逼真且连续的手势视频。

Motivation: 解决现有手势到视频系统中合成自然表情和手势的挑战，并减少对复杂输入、训练策略和大数据集的依赖。

Details

Method: 采用扩散模型的一阶段训练方法和时间推理方法，利用现有预训练权重，仅需少量数据进行微调。 Result: 实验表明，该方法优于现有的基于GAN和扩散模型的方法。 Conclusion: 提出的方法简化了训练流程，提高了生成效果，适用于实际应用。 Abstract: Audio-driven cospeech video generation typically involves two stages: speech-to-gesture and gesture-to-video. While significant advances have been made in speech-to-gesture generation, synthesizing natural expressions and gestures remains challenging in gesture-to-video systems. In order to improve the generation effect, previous works adopted complex input and training strategies and required a large amount of data sets for pre-training, which brought inconvenience to practical applications. We propose a simple one-stage training method and a temporal inference method based on a diffusion model to synthesize realistic and continuous gesture videos without the need for additional training of temporal modules.The entire model makes use of existing pre-trained weights, and only a few thousand frames of data are needed for each character at a time to complete fine-tuning. Built upon the video generator, we introduce a new audio-to-video pipeline to synthesize co-speech videos, using 2D human skeleton as the intermediate motion representation. Our experiments show that our method outperforms existing GAN-based and diffusion-based methods.

TP-RAG: Benchmarking Retrieval-Augmented Large Language Model Agents for Spatiotemporal-Aware Travel Planning

Hang Ni,Fan Liu,Xinyu Ma,Lixin Su,Shuaiqiang Wang,Dawei Yin,Hui Xiong,Hao Liu

Task: 开发一个针对检索增强、时空感知的旅行规划的基准测试TP-RAG。

Motivation: 现有基准测试仅关注基本计划有效性，忽略了路线效率、POI吸引力和实时适应性等关键方面。

Details

Method: 通过整合真实旅行查询、细粒度标注的POI和高质量旅行轨迹参考，提出TP-RAG基准测试，并进一步提出EvoRAG进化框架以优化性能。 Result: 实验表明，整合参考轨迹显著提高了旅行计划的空间效率和POI合理性，EvoRAG在时空合规性和常识性错误减少方面表现最优。 Conclusion: 结合网络知识和LLM驱动的优化方法，为更可靠和自适应的旅行规划代理奠定了基础。 Abstract: Large language models (LLMs) have shown promise in automating travel planning, yet they often fall short in addressing nuanced spatiotemporal rationality. While existing benchmarks focus on basic plan validity, they neglect critical aspects such as route efficiency, POI appeal, and real-time adaptability. This paper introduces TP-RAG, the first benchmark tailored for retrieval-augmented, spatiotemporal-aware travel planning. Our dataset includes 2,348 real-world travel queries, 85,575 fine-grain annotated POIs, and 18,784 high-quality travel trajectory references sourced from online tourist documents, enabling dynamic and context-aware planning. Through extensive experiments, we reveal that integrating reference trajectories significantly improves spatial efficiency and POI rationality of the travel plan, while challenges persist in universality and robustness due to conflicting references and noisy data. To address these issues, we propose EvoRAG, an evolutionary framework that potently synergizes diverse retrieved trajectories with LLMs' intrinsic reasoning. EvoRAG achieves state-of-the-art performance, improving spatiotemporal compliance and reducing commonsense violation compared to ground-up and retrieval-augmented baselines. Our work underscores the potential of hybridizing Web knowledge with LLM-driven optimization, paving the way for more reliable and adaptive travel planning agents.

Josef Bengtson,David Nilsson,Fredrik Kahl

Task: 提出一种方法，通过优化扩散采样过程中的初始噪声，提高单图像新视角合成（NVS）中生成图像的几何一致性。

Motivation: 现有扩散模型在单图像NVS中生成的图像虽然逼真，但在几何一致性上存在问题，尤其是未能满足目标姿态的极线约束。

Details

Method: 基于图像匹配和极线约束设计损失函数，优化扩散采样过程的初始噪声，确保生成图像既逼真又满足几何约束。 Result: 在MegaScenes数据集上验证，几何一致性优于基线模型，同时保持生成图像的质量。 Conclusion: 该方法无需训练数据或微调扩散模型，可应用于多种先进单图像NVS模型，显著提升几何一致性。 Abstract: Diffusion models for single image novel view synthesis (NVS) can generate highly realistic and plausible images, but they are limited in the geometric consistency to the given relative poses. The generated images often show significant errors with respect to the epipolar constraints that should be fulfilled, as given by the target pose. In this paper we address this issue by proposing a methodology to improve the geometric correctness of images generated by a diffusion model for single image NVS. We formulate a loss function based on image matching and epipolar constraints, and optimize the starting noise in a diffusion sampling process such that the generated image should both be a realistic image and fulfill geometric constraints derived from the given target pose. Our method does not require training data or fine-tuning of the diffusion models, and we show that we can apply it to multiple state-of-the-art models for single image NVS. The method is evaluated on the MegaScenes dataset and we show that geometric consistency is improved compared to the baseline models while retaining the quality of the generated images.

Large Language Models as Span Annotators

Zdeněk Kasner,Vilém Zouhar,Patrícia Schmidtová,Ivan Kartáč,Kristýna Onderková,Ondřej Plátek,Dimitra Gkatzia,Saad Mahamood,Ondřej Dušek,Simone Balloccu

Task: 利用大型语言模型（LLMs）自动化文本的跨度标注（span annotation），并与人类标注者进行比较。

Motivation: 单分数指标难以提供高质量文本的具体改进反馈，而跨度标注可以指导改进并提供洞察。

Details

Method: 比较专家或熟练众包标注者与开源和专有LLMs在三个任务上的表现：数据到文本生成评估、机器翻译评估和人类文本中的宣传检测。 Result: LLMs作为跨度标注者易于实现且成本效益显著高于人类标注者，与熟练人类标注者的中等一致性相当。定性分析显示，推理模型优于指令调优模型，并提供更有效的标注解释。 Conclusion: LLMs可作为高效的跨度标注工具，适用于多种任务，并发布了包含40k+模型和人类标注的数据集以供进一步研究。 Abstract: For high-quality texts, single-score metrics seldom provide actionable feedback. In contrast, span annotation - pointing out issues in the text by annotating their spans - can guide improvements and provide insights. Until recently, span annotation was limited to human annotators or fine-tuned encoder models. In this study, we automate span annotation with large language models (LLMs). We compare expert or skilled crowdworker annotators with open and proprietary LLMs on three tasks: data-to-text generation evaluation, machine translation evaluation, and propaganda detection in human-written texts. In our experiments, we show that LLMs as span annotators are straightforward to implement and notably more cost-efficient than human annotators. The LLMs achieve moderate agreement with skilled human annotators, in some scenarios comparable to the average agreement among the annotators themselves. Qualitative analysis shows that reasoning models outperform their instruction-tuned counterparts and provide more valid explanations for annotations. We release the dataset of more than 40k model and human annotations for further research.

LMM4LMM: Benchmarking and Evaluating Large-multimodal Image Generation with LMMs

Jiarui Wang,Huiyu Duan,Yu Zhao,Juntong Wang,Guangtao Zhai,Xiongkuo Min

Task: 提出一个自动评估大模型生成图像质量的指标LMM4LMM，并构建了一个综合数据集EvalMi-50K用于评估。

Motivation: 由于手动评估成本高且效率低，需要一个与人类偏好一致的自动评估指标来提升生成图像的质量和文本-图像对齐。

Details

Method: 构建了EvalMi-50K数据集，包含50,400张图像和100K人类评分，提出了基于LMM的评估指标LMM4LMM。 Result: LMM4LMM在EvalMi-50K上表现优异，并在其他数据集上展示了强泛化能力。 Conclusion: EvalMi-50K和LMM4LMM为评估大模型生成图像提供了有效工具，具有广泛适用性。 Abstract: Recent breakthroughs in large multimodal models (LMMs) have significantly advanced both text-to-image (T2I) generation and image-to-text (I2T) interpretation. However, many generated images still suffer from issues related to perceptual quality and text-image alignment. Given the high cost and inefficiency of manual evaluation, an automatic metric that aligns with human preferences is desirable. To this end, we present EvalMi-50K, a comprehensive dataset and benchmark for evaluating large-multimodal image generation, which features (i) comprehensive tasks, encompassing 2,100 extensive prompts across 20 fine-grained task dimensions, and (ii) large-scale human-preference annotations, including 100K mean-opinion scores (MOSs) and 50K question-answering (QA) pairs annotated on 50,400 images generated from 24 T2I models. Based on EvalMi-50K, we propose LMM4LMM, an LMM-based metric for evaluating large multimodal T2I generation from multiple dimensions including perception, text-image correspondence, and task-specific accuracy. Extensive experimental results show that LMM4LMM achieves state-of-the-art performance on EvalMi-50K, and exhibits strong generalization ability on other AI-generated image evaluation benchmark datasets, manifesting the generality of both the EvalMi-50K dataset and LMM4LMM metric. Both EvalMi-50K and LMM4LMM will be released at https://github.com/IntMeGroup/LMM4LMM.

ModernBERT or DeBERTaV3? Examining Architecture and Data Influence on Transformer Encoder Models Performance

Wissam Antoun,Benoît Sagot,Djamé Seddah

Task: 通过控制实验比较ModernBERT和DeBERTaV3的性能差异，以确定架构改进的实际效果。

Motivation: ModernBERT报告的性能提升可能源于训练数据而非架构改进，缺乏共享数据集比较导致难以验证。

Details

Method: 在相同数据集上预训练ModernBERT和DeBERTaV3，隔离模型设计的影响。 Result: DeBERTaV3在样本效率和性能上优于ModernBERT，后者优势在于训练和推理速度；ModernBERT仍比早期模型有架构改进。 Conclusion: 评估Transformer模型时需分离预训练数据和架构创新的影响，高质量数据加速收敛但对最终性能影响有限。 Abstract: Pretrained transformer-encoder models like DeBERTaV3 and ModernBERT introduce architectural advancements aimed at improving efficiency and performance. Although the authors of ModernBERT report improved performance over DeBERTaV3 on several benchmarks, the lack of disclosed training data and the absence of comparisons using a shared dataset make it difficult to determine whether these gains are due to architectural improvements or differences in training data. In this work, we conduct a controlled study by pretraining ModernBERT on the same dataset as CamemBERTaV2, a DeBERTaV3 French model, isolating the effect of model design. Our results show that the previous model generation remains superior in sample efficiency and overall benchmark performance, with ModernBERT's primary advantage being faster training and inference speed. However, the new proposed model still provides meaningful architectural improvements compared to earlier models such as BERT and RoBERTa. Additionally, we observe that high-quality pre-training data accelerates convergence but does not significantly improve final performance, suggesting potential benchmark saturation. These findings show the importance of disentangling pretraining data from architectural innovations when evaluating transformer models.

SN-LiDAR: Semantic Neural Fields for Novel Space-time View LiDAR Synthesis

Yi Chen,Tianchen Deng,Wentao Zhao,Xiaoning Wang,Wenqian Xi,Weidong Chen,Jingchuan Wang

Task: 提出SN-LiDAR方法，联合实现语义分割、几何重建和LiDAR合成。

Motivation: 现有方法未重建语义标签，而语义标签对自动驾驶等下游应用至关重要。

Details

Method: 采用粗到细的平面网格特征表示和CNN编码器提取全局与局部特征。 Result: 在SemanticKITTI和KITTI-360上表现优异，有效处理动态物体和大规模场景。 Conclusion: SN-LiDAR在语义和几何重建方面具有优越性，代码将开源。 Abstract: Recent research has begun exploring novel view synthesis (NVS) for LiDAR point clouds, aiming to generate realistic LiDAR scans from unseen viewpoints. However, most existing approaches do not reconstruct semantic labels, which are crucial for many downstream applications such as autonomous driving and robotic perception. Unlike images, which benefit from powerful segmentation models, LiDAR point clouds lack such large-scale pre-trained models, making semantic annotation time-consuming and labor-intensive. To address this challenge, we propose SN-LiDAR, a method that jointly performs accurate semantic segmentation, high-quality geometric reconstruction, and realistic LiDAR synthesis. Specifically, we employ a coarse-to-fine planar-grid feature representation to extract global features from multi-frame point clouds and leverage a CNN-based encoder to extract local semantic features from the current frame point cloud. Extensive experiments on SemanticKITTI and KITTI-360 demonstrate the superiority of SN-LiDAR in both semantic and geometric reconstruction, effectively handling dynamic objects and large-scale scenes. Codes will be available on https://github.com/dtc111111/SN-Lidar.

SWAN-GPT: An Efficient and Scalable Approach for Long-Context Language Modeling

Krishna C. Puvvada,Faisal Ladhak,Santiago Akle Serrano,Cheng-Ping Hsieh,Shantanu Acharya,Somshubra Majumdar,Fei Jia,Samuel Kriman,Simeng Sun,Dima Rekesh,Boris Ginsburg

Task: 提出一种解码器专用的Transformer架构（SWAN-GPT），能够在训练序列长度之外实现鲁棒的长度外推。

Motivation: 解决现有模型在训练序列长度之外表现不佳的问题，同时提升计算效率。

Details

Method: 结合无位置编码（NoPE）和滑动窗口注意力层（SWA-RoPE），并通过动态缩放注意力分数实现鲁棒外推。 Result: 在显著长于训练长度的序列上表现优异，且计算效率更高。 Conclusion: SWAN-GPT为语言模型在长上下文中的扩展提供了一种高效且鲁棒的方法。 Abstract: We present a decoder-only Transformer architecture that robustly generalizes to sequence lengths substantially longer than those seen during training. Our model, SWAN-GPT, interleaves layers without positional encodings (NoPE) and sliding-window attention layers equipped with rotary positional encodings (SWA-RoPE). Experiments demonstrate strong performance on sequence lengths significantly longer than the training length without the need for additional long-context training. This robust length extrapolation is achieved through our novel architecture, enhanced by a straightforward dynamic scaling of attention scores during inference. In addition, SWAN-GPT is more computationally efficient than standard GPT architectures, resulting in cheaper training and higher throughput. Further, we demonstrate that existing pre-trained decoder-only models can be efficiently converted to the SWAN architecture with minimal continued training, enabling longer contexts. Overall, our work presents an effective approach for scaling language models to longer contexts in a robust and efficient manner.

FocalLens: Instruction Tuning Enables Zero-Shot Conditional Image Representations

Cheng-Yu Hsieh,Pavan Kumar Anasosalu Vasu,Fartash Faghri,Raviteja Vemulapalli,Chun-Liang Li,Ranjay Krishna,Oncel Tuzel,Hadi Pouransari

Task: 开发一种条件视觉编码方法FocalLens，根据自然语言指令生成不同图像表示。

Motivation: 现有图像编码范式通常生成固定特征向量，无法根据不同任务需求动态调整视觉信息优先级。

Details

Method: 利用视觉指令调优数据和对比微调预训练视觉编码器，以自然语言指令为输入生成条件图像表示。 Result: FocalLens生成的图像表示能更突出任务相关视觉特征，在多项下游任务中表现优于标准视觉编码器（如CLIP），平均提升5-10分。 Conclusion: FocalLens通过条件视觉编码有效解决了图像表示中的上下文需求问题，显著提升了任务性能。 Abstract: Visual understanding is inherently contextual -- what we focus on in an image depends on the task at hand. For instance, given an image of a person holding a bouquet of flowers, we may focus on either the person such as their clothing, or the type of flowers, depending on the context of interest. Yet, most existing image encoding paradigms represent an image as a fixed, generic feature vector, overlooking the potential needs of prioritizing varying visual information for different downstream use cases. In this work, we introduce FocalLens, a conditional visual encoding method that produces different representations for the same image based on the context of interest, expressed flexibly through natural language. We leverage vision instruction tuning data and contrastively finetune a pretrained vision encoder to take natural language instructions as additional inputs for producing conditional image representations. Extensive experiments validate that conditional image representation from FocalLens better pronounce the visual features of interest compared to generic features produced by standard vision encoders like CLIP. In addition, we show FocalLens further leads to performance improvements on a range of downstream tasks including image-image retrieval, image classification, and image-text retrieval, with an average gain of 5 and 10 points on the challenging SugarCrepe and MMVP-VLM benchmarks, respectively.

Tanmay Laud,Akadia Kacha-Ochana,Steven A. Sumner,Vikram Krishnasamy,Royal Law,Lyna Schieber,Munmun De Choudhury,Mai ElSherief

Task: 研究Reddit上关于阿片类药物使用障碍（OUD）的自然语言问题，并对其进行分类。

Motivation: 在线社区中存在大量未经临床验证的信息，了解用户在这些平台上提出的问题有助于改进公共健康干预措施。

Details

Method: 采用基于Transformer的问题检测方法和层次聚类技术，对19个子版块中的问题进行分类。 Result: 识别出6个粗粒度类别和69个细粒度类别的OUD相关问题，涵盖10个信息寻求领域。 Conclusion: 研究为理解Reddit上OUD相关问题提供了重要基础，并讨论了基于这些问题的技术干预和公共健康减害措施。 Abstract: Opioid use disorder (OUD) is a leading health problem that affects individual well-being as well as general public health. Due to a variety of reasons, including the stigma faced by people using opioids, online communities for recovery and support were formed on different social media platforms. In these communities, people share their experiences and solicit information by asking questions to learn about opioid use and recovery. However, these communities do not always contain clinically verified information. In this paper, we study natural language questions asked in the context of OUD-related discourse on Reddit. We adopt transformer-based question detection along with hierarchical clustering across 19 subreddits to identify six coarse-grained categories and 69 fine-grained categories of OUD-related questions. Our analysis uncovers ten areas of information seeking from Reddit users in the context of OUD: drug sales, specific drug-related questions, OUD treatment, drug uses, side effects, withdrawal, lifestyle, drug testing, pain management and others, during the study period of 2018-2021. Our work provides a major step in improving the understanding of OUD-related questions people ask unobtrusively on Reddit. We finally discuss technological interventions and public health harm reduction techniques based on the topics of these questions.

Towards Efficient and Robust Moment Retrieval System: A Unified Framework for Multi-Granularity Models and Temporal Reranking

Huu-Loc Tran,Tinh-Anh Nguyen-Nhu,Huu-Phong Phan-Nguyen,Tien-Huy Nguyen,Nhat-Minh Nguyen-Dich,Anh Dao,Huy-Duc Do,Quan Nguyen,Hoang M. Le,Quang-Vinh Dinh

Task: 提出一种新型框架，通过四项关键创新提升交互式视频检索的效率和准确性。

Motivation: 传统方法在处理长视频内容时效率低下，且现有方法依赖单一模型、存储冗余、时间搜索不稳定以及缺乏上下文感知的重新排序。

Details

Method: 采用集成搜索策略、存储优化技术、时间搜索机制和时间重新排序方法。 Result: 在已知项搜索和问答任务中，框架显著提高了检索精度、效率和用户可解释性。 Conclusion: 该框架为实际交互式视频检索应用提供了稳健的解决方案。 Abstract: Long-form video understanding presents significant challenges for interactive retrieval systems, as conventional methods struggle to process extensive video content efficiently. Existing approaches often rely on single models, inefficient storage, unstable temporal search, and context-agnostic reranking, limiting their effectiveness. This paper presents a novel framework to enhance interactive video retrieval through four key innovations: (1) an ensemble search strategy that integrates coarse-grained (CLIP) and fine-grained (BEIT3) models to improve retrieval accuracy, (2) a storage optimization technique that reduces redundancy by selecting representative keyframes via TransNetV2 and deduplication, (3) a temporal search mechanism that localizes video segments using dual queries for start and end points, and (4) a temporal reranking approach that leverages neighboring frame context to stabilize rankings. Evaluated on known-item search and question-answering tasks, our framework demonstrates substantial improvements in retrieval precision, efficiency, and user interpretability, offering a robust solution for real-world interactive video retrieval applications.

The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

Yutaro Yamada,Robert Tjarko Lange,Cong Lu,Shengran Hu,Chris Lu,Jakob Foerster,Jeff Clune,David Ha

Task: 开发一个端到端的自主AI系统（The AI Scientist-v2），能够生成完全由AI撰写的、通过同行评审的研讨会论文。

Motivation: 展示AI在科学发现中的全面能力，推动自主科学研究技术的发展，提高研究效率和生产力。

Details

Method: 采用渐进式代理树搜索方法，由实验管理代理协调，结合视觉语言模型（VLM）反馈循环优化内容和图表。 Result: 成功提交三篇完全自主生成的论文至ICLR研讨会，其中一篇评分超过人类平均接受阈值。 Conclusion: AI在科学研究中的全面能力显著提升，未来将进一步加速科学突破，对社会产生深远影响。 Abstract: AI is increasingly playing a pivotal role in transforming how scientific discoveries are made. We introduce The AI Scientist-v2, an end-to-end agentic system capable of producing the first entirely AI generated peer-review-accepted workshop paper. This system iteratively formulates scientific hypotheses, designs and executes experiments, analyzes and visualizes data, and autonomously authors scientific manuscripts. Compared to its predecessor (v1, Lu et al., 2024 arXiv:2408.06292), The AI Scientist-v2 eliminates the reliance on human-authored code templates, generalizes effectively across diverse machine learning domains, and leverages a novel progressive agentic tree-search methodology managed by a dedicated experiment manager agent. Additionally, we enhance the AI reviewer component by integrating a Vision-Language Model (VLM) feedback loop for iterative refinement of content and aesthetics of the figures. We evaluated The AI Scientist-v2 by submitting three fully autonomous manuscripts to a peer-reviewed ICLR workshop. Notably, one manuscript achieved high enough scores to exceed the average human acceptance threshold, marking the first instance of a fully AI-generated paper successfully navigating a peer review. This accomplishment highlights the growing capability of AI in conducting all aspects of scientific research. We anticipate that further advancements in autonomous scientific discovery technologies will profoundly impact human knowledge generation, enabling unprecedented scalability in research productivity and significantly accelerating scientific breakthroughs, greatly benefiting society at large. We have open-sourced the code at https://github.com/SakanaAI/AI-Scientist-v2 to foster the future development of this transformative technology. We also discuss the role of AI in science, including AI safety.

MineWorld: a Real-Time and Open-Source Interactive World Model on Minecraft

Junliang Guo,Yang Ye,Tianyu He,Haoyu Wu,Yushu Jiang,Tim Pearce,Jiang Bian

Task: 提出MineWorld，一个基于Minecraft的实时交互世界模型。

Motivation: 世界建模是智能代理与人类交互并在动态环境中操作的关键任务。

Details

Method: 使用视觉-动作自回归Transformer，将游戏场景和动作转换为离散标记ID，通过拼接和训练实现状态和动作的联合表示学习。 Result: MineWorld在视觉质量和动作跟随能力上显著优于现有基于扩散的世界模型。 Conclusion: MineWorld是一个高效的世界模型，支持实时交互，并已在代码和模型上开源。 Abstract: World modeling is a crucial task for enabling intelligent agents to effectively interact with humans and operate in dynamic environments. In this work, we propose MineWorld, a real-time interactive world model on Minecraft, an open-ended sandbox game which has been utilized as a common testbed for world modeling. MineWorld is driven by a visual-action autoregressive Transformer, which takes paired game scenes and corresponding actions as input, and generates consequent new scenes following the actions. Specifically, by transforming visual game scenes and actions into discrete token ids with an image tokenizer and an action tokenizer correspondingly, we consist the model input with the concatenation of the two kinds of ids interleaved. The model is then trained with next token prediction to learn rich representations of game states as well as the conditions between states and actions simultaneously. In inference, we develop a novel parallel decoding algorithm that predicts the spatial redundant tokens in each frame at the same time, letting models in different scales generate $4$ to $7$ frames per second and enabling real-time interactions with game players. In evaluation, we propose new metrics to assess not only visual quality but also the action following capacity when generating new scenes, which is crucial for a world model. Our comprehensive evaluation shows the efficacy of MineWorld, outperforming SoTA open-sourced diffusion based world models significantly. The code and model have been released.

Geneshift: Impact of different scenario shift on Jailbreaking LLM

Tianyi Wu,Zhiwei Xue,Yue Liu,Jiaheng Zhang,Bryan Hooi,See-Kiong Ng

Task: 提出一种名为GeneShift的黑盒越狱攻击方法，通过遗传算法优化场景转换以提高攻击成功率。

Motivation: 现有越狱攻击方法在基于GPT的评估中表现不佳，无法输出满足有害请求的详细内容，因此需要一种更有效的方法。

Details

Method: 使用遗传算法优化场景转换，通过观察恶意查询在不同场景转换下的表现，进化并选择最佳场景转换组合。 Result: GeneShift显著提高了越狱攻击的成功率，从0%提升至60%。 Conclusion: GeneShift通过遗传算法优化场景转换，能够更隐蔽地引导模型生成详细的有害内容，显著提升了攻击效果。 Abstract: Jailbreak attacks, which aim to cause LLMs to perform unrestricted behaviors, have become a critical and challenging direction in AI safety. Despite achieving the promising attack success rate using dictionary-based evaluation, existing jailbreak attack methods fail to output detailed contents to satisfy the harmful request, leading to poor performance on GPT-based evaluation. To this end, we propose a black-box jailbreak attack termed GeneShift, by using a genetic algorithm to optimize the scenario shifts. Firstly, we observe that the malicious queries perform optimally under different scenario shifts. Based on it, we develop a genetic algorithm to evolve and select the hybrid of scenario shifts. It guides our method to elicit detailed and actionable harmful responses while keeping the seemingly benign facade, improving stealthiness. Extensive experiments demonstrate the superiority of GeneShift. Notably, GeneShift increases the jailbreak success rate from 0% to 60% when direct prompting alone would fail.

Light-YOLOv8-Flame: A Lightweight High-Performance Flame Detection Algorithm

Jiawei Lan,Zhibiao Wang,Haoyang Yu,Ye Tao,Wenhua Cui

Task: 提出一种轻量级火焰检测算法Light-YOLOv8-Flame，用于实时火灾检测。

Motivation: 解决基于计算机视觉的火灾检测算法计算成本高和响应延迟的问题。

Details

Method: 通过用FasterNet Block模块替换YOLOv8中的C2f模块，结合Partial Convolution和Convolution层，降低计算复杂度和模型大小。 Result: 改进后的模型在mAP上提升0.78%，召回率提升2.05%，参数数量减少25.34%，精度仅下降0.82%。 Conclusion: Light-YOLOv8-Flame在检测性能和速度上均有提升，适用于资源受限设备的实时火灾检测。 Abstract: Fire detection algorithms, particularly those based on computer vision, encounter significant challenges such as high computational costs and delayed response times, which hinder their application in real-time systems. To address these limitations, this paper introduces Light-YOLOv8-Flame, a lightweight flame detection algorithm specifically designed for fast and efficient real-time deployment. The proposed model enhances the YOLOv8 architecture through the substitution of the original C2f module with the FasterNet Block module. This new block combines Partial Convolution (PConv) and Convolution (Conv) layers, reducing both computational complexity and model size. A dataset comprising 7,431 images, representing both flame and non-flame scenarios, was collected and augmented for training purposes. Experimental findings indicate that the modified YOLOv8 model achieves a 0.78% gain in mean average precision (mAP) and a 2.05% boost in recall, while reducing the parameter count by 25.34%, with only a marginal decrease in precision by 0.82%. These findings highlight that Light-YOLOv8-Flame offers enhanced detection performance and speed, making it well-suited for real-time fire detection on resource-constrained devices.

SAEs $\textit{Can}$ Improve Unlearning: Dynamic Sparse Autoencoder Guardrails for Precision Unlearning in LLMs

Aashiq Muhamed,Jacopo Bonato,Mona Diab,Virginia Smith

Task: 通过动态稀疏自编码器（SAEs）改进机器遗忘方法，提出一种名为Dynamic DAE Guardrails（DSG）的新方法。

Motivation: 现有基于梯度的遗忘方法存在计算成本高、超参数不稳定、序列遗忘能力差、易受重新学习攻击、数据效率低和缺乏可解释性等问题。

Details

Method: 提出DSG方法，结合基于特征的选择和动态分类器，实现精准遗忘。 Result: DSG在实验中显著优于现有遗忘方法，实现了更好的遗忘-效用权衡。 Conclusion: DSG解决了基于梯度方法的主要缺点，提供了更高的计算效率、稳定性、序列遗忘性能、抗攻击能力、数据效率和可解释性。 Abstract: Machine unlearning is a promising approach to improve LLM safety by removing unwanted knowledge from the model. However, prevailing gradient-based unlearning methods suffer from issues such as high computational costs, hyperparameter instability, poor sequential unlearning capability, vulnerability to relearning attacks, low data efficiency, and lack of interpretability. While Sparse Autoencoders are well-suited to improve these aspects by enabling targeted activation-based unlearning, prior approaches underperform gradient-based methods. This work demonstrates that, contrary to these earlier findings, SAEs can significantly improve unlearning when employed dynamically. We introduce $\textbf{Dynamic DAE Guardrails}$ (DSG), a novel method for precision unlearning that leverages principled feature selection and a dynamic classifier. Our experiments show DSG substantially outperforms leading unlearning methods, achieving superior forget-utility trade-offs. DSG addresses key drawbacks of gradient-based approaches for unlearning -- offering enhanced computational efficiency and stability, robust performance in sequential unlearning, stronger resistance to relearning attacks, better data efficiency including zero-shot settings, and more interpretable unlearning.

PMNI: Pose-free Multi-view Normal Integration for Reflective and Textureless Surface Reconstruction

Mingzhi Pei,Xu Cao,Xiangyi Wang,Heng Guo,Zhanyu Ma

Task: 提出一种名为PMNI的神经表面重建方法，用于解决反射和无纹理表面在多视角3D重建中的挑战。

Motivation: 反射和无纹理表面在多视角3D重建中由于缺乏可靠的跨视角视觉特征，导致相机姿态校准和形状重建失败。

Details

Method: 利用表面法线图替代RGB图像，通过在多视角形状一致性和神经符号距离函数（SDF）优化框架中施加几何约束，同时恢复准确的相机姿态和高保真表面几何。 Result: 在合成和真实数据集上的实验结果表明，该方法在反射表面重建中达到了最先进的性能，即使没有可靠的初始相机姿态。 Conclusion: PMNI通过结合表面法线的几何信息，成功解决了反射和无纹理表面在多视角3D重建中的难题。 Abstract: Reflective and textureless surfaces remain a challenge in multi-view 3D reconstruction.Both camera pose calibration and shape reconstruction often fail due to insufficient or unreliable cross-view visual features. To address these issues, we present PMNI (Pose-free Multi-view Normal Integration), a neural surface reconstruction method that incorporates rich geometric information by leveraging surface normal maps instead of RGB images. By enforcing geometric constraints from surface normals and multi-view shape consistency within a neural signed distance function (SDF) optimization framework, PMNI simultaneously recovers accurate camera poses and high-fidelity surface geometry. Experimental results on synthetic and real-world datasets show that our method achieves state-of-the-art performance in the reconstruction of reflective surfaces, even without reliable initial camera poses.

Millions of States: Designing a Scalable MoE Architecture with RWKV-7 Meta-learner

Liu Xiao,Li Zhiyuan,Lin Yueyu

Task: 提出Meta-State，一种RWKV-7的扩展方法，以解决其缺乏token-parameter交互和原生可扩展性的问题。

Motivation: RWKV-7在短上下文场景中表现优异且具有线性复杂度，但缺乏token-parameter交互机制和可扩展性，限制了其适应性和扩展能力。

Details

Method: 通过引入Self-State Encoder (SSE)机制，利用RWKV-7的WKV状态作为转换权重，以线性、状态驱动的方式编码token-parameter交互，同时保持自回归特性。 Result: Meta-State在保持线性复杂度和恒定内存使用的同时，支持渐进式模型扩展，无需重新训练。 Conclusion: Meta-State填补了状态建模、token-parameter交互和可扩展架构之间的空白，为高效且适应性强的序列建模提供了灵活框架。 Abstract: State-based sequence models like RWKV-7 offer a compelling alternative to Transformer architectures, achieving linear complexity while demonstrating greater expressive power in short-context scenarios and enabling state tracking beyond the $\text{TC}^0$ complexity class. However, RWKV-7 lacks mechanisms for token-parameter interactions and native scalability, limiting its adaptability and growth without retraining. In this paper, we propose \textbf{Meta-State}, a novel extension to RWKV-7 that replaces attention mechanisms with a fully state-driven approach, integrating token-parameter interactions through a \textbf{Self-State Encoder} (SSE) mechanism. The SSE repurposes a portion of the RWKV-7 Weighted Key-Value (WKV) state as transformation weights to encode token-parameter interactions in a linear, state-driven manner without introducing new trainable matrices or softmax operations, while preserving the autoregressive property of token processing. Meta-State supports progressive model scaling by expanding the WKV state and parameter tokens, reusing existing parameters without retraining. Our approach bridges the gap between state-based modeling, token-parameter interactions, and scalable architectures, offering a flexible framework for efficient and adaptable sequence modeling with linear complexity and constant memory usage.

A Knowledge-guided Adversarial Defense for Resisting Malicious Visual Manipulation

Dawei Zhou,Suzhi Gang,Decheng Liu,Tongliang Liu,Nannan Wang,Xinbo Gao

Task: 提出一种知识引导的对抗防御方法（KGAD）以主动干扰恶意视觉操纵模型。

Motivation: 现有基于对抗噪声的防御方法主要在低层特征空间扭曲伪造样本，难以抵抗高级语义空间的恶意操纵。

Details

Method: 在生成对抗噪声时，通过构建领域特定知识级别的语义混淆，并利用与视觉感知密切相关的度量替代通用像素级度量。 Result: 实验表明，KGAD在人类感知和视觉质量评估上优于现有方法，并具有良好泛化性。 Conclusion: KGAD通过知识引导和感知相关的干扰，有效提升了对抗恶意视觉操纵的防御能力。 Abstract: Malicious applications of visual manipulation have raised serious threats to the security and reputation of users in many fields. To alleviate these issues, adversarial noise-based defenses have been enthusiastically studied in recent years. However, ``data-only" methods tend to distort fake samples in the low-level feature space rather than the high-level semantic space, leading to limitations in resisting malicious manipulation. Frontier research has shown that integrating knowledge in deep learning can produce reliable and generalizable solutions. Inspired by these, we propose a knowledge-guided adversarial defense (KGAD) to actively force malicious manipulation models to output semantically confusing samples. Specifically, in the process of generating adversarial noise, we focus on constructing significant semantic confusions at the domain-specific knowledge level, and exploit a metric closely related to visual perception to replace the general pixel-wise metrics. The generated adversarial noise can actively interfere with the malicious manipulation model by triggering knowledge-guided and perception-related disruptions in the fake samples. To validate the effectiveness of the proposed method, we conduct qualitative and quantitative experiments on human perception and visual quality assessment. The results on two different tasks both show that our defense provides better protection compared to state-of-the-art methods and achieves great generalizability.

VLMT: Vision-Language Multimodal Transformer for Multimodal Multi-hop Question Answering

Qi Zhi Lim,Chin Poo Lee,Kian Ming Lim,Kalaiarasi Sonai Muthu Anbananthen

Task: 提出一种名为Vision-Language Multimodal Transformer (VLMT)的统一架构，用于解决多模态多跳问答任务中的推理能力不足和模态对齐问题。

Motivation: 现有多模态多跳问答方法存在推理能力有限、依赖模态转换以及视觉与文本表示对齐不足的问题。

Details

Method: VLMT结合了基于Transformer的视觉编码器和序列到序列语言模型，采用直接令牌级注入机制融合视觉与文本输入，并提出三阶段预训练策略增强跨模态对齐和推理能力。 Result: 在MultimodalQA和WebQA数据集上，VLMT-Large分别取得76.5% Exact Match和47.6 QA分数，显著优于现有方法。 Conclusion: VLMT展示了强大的多模态推理能力，有望推动实际信息检索和问答系统的发展。 Abstract: The increasing availability of multimodal data across text, tables, and images presents new challenges for developing models capable of complex cross-modal reasoning. Existing methods for Multimodal Multi-hop Question Answering (MMQA) often suffer from limited reasoning capabilities, reliance on modality conversion, and inadequate alignment between visual and textual representations. To address these limitations, this paper introduces Vision-Language Multimodal Transformer (VLMT), a unified architecture that integrates a transformer-based vision encoder with a sequence-to-sequence language model. VLMT employs a direct token-level injection mechanism to fuse visual and textual inputs within a shared embedding space, eliminating the need for intermediate projection layers. To enhance cross-modal alignment and reasoning, a three-stage pretraining strategy is proposed to progressively align vision-language representations and improve the model's capacity for multimodal understanding. Based on the pretrained backbone, two task-specific modules are instantiated to form a two-stage MMQA framework: a multimodal reranker that predicts document relevance scores and utilizes a relative threshold with top-k strategy for context retrieval, and a multimodal question answering model that generates contextually grounded answers based on the retrieved evidence. Comprehensive experiments on two benchmark datasets demonstrate the effectiveness of the proposed approach. On MultimodalQA validation set, VLMT-Large achieves 76.5% Exact Match and 80.1% F1, outperforming the previous state-of-the-art by +9.1% in Exact Match and +8.8% in F1. On WebQA, it attains a QA score of 47.6, surpassing prior models such as PERQA by +3.2. These results highlight VLMT's strong capabilities in multimodal reasoning and its potential to advance real-world information retrieval and question answering systems.

Boosting the Class-Incremental Learning in 3D Point Clouds via Zero-Collection-Cost Basic Shape Pre-Training

Chao Qi,Jianqin Yin,Meng Chen,Yingchun Niu,Yuan Sun

Task: 提出一种基于3D几何知识的点云增量学习框架，适用于无样本和有样本设置。

Motivation: 解决现有3D点云增量学习方法依赖样本的问题，并突破预训练模型在3D领域的局限性。

Details

Method: 提出零成本基础形状数据集用于预训练，并设计嵌入3D几何知识的增量学习框架。 Result: 在多个基准数据集上显著优于基线方法，适用于无样本和有样本设置。 Conclusion: 该方法通过几何知识扩展和原型调整，有效解决了3D点云增量学习中的遗忘问题。 Abstract: Existing class-incremental learning methods in 3D point clouds rely on exemplars (samples of former classes) to resist the catastrophic forgetting of models, and exemplar-free settings will greatly degrade the performance. For exemplar-free incremental learning, the pre-trained model methods have achieved state-of-the-art results in 2D domains. However, these methods cannot be migrated to the 3D domains due to the limited pre-training datasets and insufficient focus on fine-grained geometric details. This paper breaks through these limitations, proposing a basic shape dataset with zero collection cost for model pre-training. It helps a model obtain extensive knowledge of 3D geometries. Based on this, we propose a framework embedded with 3D geometry knowledge for incremental learning in point clouds, compatible with exemplar-free (-based) settings. In the incremental stage, the geometry knowledge is extended to represent objects in point clouds. The class prototype is calculated by regularizing the data representation with the same category and is kept adjusting in the learning process. It helps the model remember the shape features of different categories. Experiments show that our method outperforms other baseline methods by a large margin on various benchmark datasets, considering both exemplar-free (-based) settings.

Generalized Multilingual Text-to-Speech Generation with Language-Aware Style Adaptation

Haowei Lou,Hye-young Paik,Sheng Li,Wen Hu,Lina Yao

Task: 提出LanStyleTTS，一种支持多语言、非自回归且能实现细粒度音素级风格控制的TTS框架。

Motivation: 解决多语言TTS中因音素词汇差异和语言风格变化导致的性能问题，避免训练语言特定模型的高计算成本。

Details

Method: 标准化音素表示，结合非自回归TTS架构，支持多语言统一模型，并探索不同声学特征表示（如梅尔频谱和自编码器潜在特征）。 Result: LanStyleTTS在不同模型骨干上均表现优异，潜在特征编码显著降低模型规模和计算成本。 Conclusion: LanStyleTTS是一种高效的多语言TTS解决方案，能生成高质量语音且无需语言特定模型。 Abstract: Text-to-Speech (TTS) models can generate natural, human-like speech across multiple languages by transforming phonemes into waveforms. However, multilingual TTS remains challenging due to discrepancies in phoneme vocabularies and variations in prosody and speaking style across languages. Existing approaches either train separate models for each language, which achieve high performance at the cost of increased computational resources, or use a unified model for multiple languages that struggles to capture fine-grained, language-specific style variations. In this work, we propose LanStyleTTS, a non-autoregressive, language-aware style adaptive TTS framework that standardizes phoneme representations and enables fine-grained, phoneme-level style control across languages. This design supports a unified multilingual TTS model capable of producing accurate and high-quality speech without the need to train language-specific models. We evaluate LanStyleTTS by integrating it with several state-of-the-art non-autoregressive TTS architectures. Results show consistent performance improvements across different model backbones. Furthermore, we investigate a range of acoustic feature representations, including mel-spectrograms and autoencoder-derived latent features. Our experiments demonstrate that latent encodings can significantly reduce model size and computational cost while preserving high-quality speech generation.

Adversarial Examples in Environment Perception for Automated Driving (Review)

Jun Yan,Huilin Yin

Task: 系统回顾过去十年对抗鲁棒性研究的发展，包括攻击和防御方法及其在自动驾驶中的应用。

Motivation: 深度学习在自动驾驶中的广泛应用面临对抗样本的威胁，这些样本可能导致神经网络错误预测，对AI应用构成巨大风险。

Details

Method: 通过文献综述的方式，总结了对抗样本的攻击和防御方法及其在自动驾驶中的具体应用。 Result: 综述了对抗鲁棒性研究的重要进展，并列举了相关研究中的关键文献。 Conclusion: 自动驾驶的发展推动了可信AI应用的实现，对抗鲁棒性研究在这一过程中具有重要意义。 Abstract: The renaissance of deep learning has led to the massive development of automated driving. However, deep neural networks are vulnerable to adversarial examples. The perturbations of adversarial examples are imperceptible to human eyes but can lead to the false predictions of neural networks. It poses a huge risk to artificial intelligence (AI) applications for automated driving. This survey systematically reviews the development of adversarial robustness research over the past decade, including the attack and defense methods and their applications in automated driving. The growth of automated driving pushes forward the realization of trustworthy AI applications. This review lists significant references in the research history of adversarial examples.

MedRep: Medical Concept Representation for General Electronic Health Record Foundation Models

Junmo Kim,Namkyeong Lee,Jiwon Kim,Kwangsoo Kim

Task: 提出MedRep方法，解决电子健康记录（EHR）基础模型在处理未见医疗代码时的局限性。

Motivation: EHR基础模型在处理未见医疗代码时存在局限性，限制了模型的通用性和不同词汇表模型的集成。

Details

Method: 基于OMOP CDM，通过LLM提示和图形本体增强概念表示，并采用轨迹增强策略处理未见代码。 Result: MedRep训练的EHR基础模型在外部数据集中保持了更好的预测性能。 Conclusion: MedRep为EHR基础模型提供了更通用的概念表示和增强策略，提升了模型的适用性。 Abstract: Electronic health record (EHR) foundation models have been an area ripe for exploration with their improved performance in various medical tasks. Despite the rapid advances, there exists a fundamental limitation: Processing unseen medical codes out of the vocabulary. This problem limits the generality of EHR foundation models and the integration of models trained with different vocabularies. To deal with this problem, we propose MedRep for EHR foundation models based on the observational medical outcome partnership (OMOP) common data model (CDM), providing the integrated medical concept representations and the basic data augmentation strategy for patient trajectories. For concept representation learning, we enrich the information of each concept with a minimal definition through large language model (LLM) prompts and enhance the text-based representations through graph ontology of OMOP vocabulary. Trajectory augmentation randomly replaces selected concepts with other similar concepts that have closely related representations to let the model practice with the concepts out-of-vocabulary. Finally, we demonstrate that EHR foundation models trained with MedRep better maintain the prediction performance in external datasets. Our code implementation is publicly available at https://github.com/kicarussays/MedRep.

GeoTexBuild: 3D Building Model Generation from Map Footprints

Ruizhe Wang,Junyan Yang,Qiao Wang

Task: 提出GeoTexBuild，一种模块化生成框架，用于从地图足迹创建3D建筑模型。

Motivation: 解决现有3D生成技术中单一立面照片背后的结构变化问题，减少建筑建模中的手工劳动。

Details

Method: 采用三阶段流程：高度图生成、几何重建和外观风格化，结合定制化的ControlNet和Text2Mesh模型控制几何和视觉属性。 Result: 实验验证了GeoTexBuild能够从地图足迹生成详细且准确的建筑模型。 Conclusion: 该框架显著减少了建筑建模的手工劳动，并为设计师提供了灵感。 Abstract: We introduce GeoTexBuild, a modular generative framework for creating 3D building models from map footprints. The proposed framework employs a three-stage process comprising height map generation, geometry reconstruction, and appearance stylization, culminating in building models with intricate geometry and appearance attributes. By integrating customized ControlNet and Text2Mesh models, we explore effective methods for controlling both geometric and visual attributes during the generation process. By this, we eliminate the problem of structural variations behind a single facade photo of the existing 3D generation techniques. Experimental results at each stage validate the capability of GeoTexBuild to generate detailed and accurate building models from footprints derived from site planning or map designs. Our framework significantly reduces manual labor in modeling buildings and can offer inspiration for designers.

FocalLens: Instruction Tuning Enables Zero-Shot Conditional Image Representations

Cheng-Yu Hsieh,Pavan Kumar Anasosalu Vasu,Fartash Faghri,Raviteja Vemulapalli,Chun-Liang Li,Ranjay Krishna,Oncel Tuzel,Hadi Pouransari

Task: 开发一种条件视觉编码方法FocalLens，根据自然语言表达的上下文生成不同的图像表示。

Motivation: 现有图像编码方法通常生成固定的通用特征向量，忽略了不同下游任务可能需要优先关注不同视觉信息的需求。

Details

Method: 利用视觉指令调优数据，通过对比微调预训练的视觉编码器，使其能够根据自然语言指令生成条件图像表示。 Result: FocalLens生成的图像表示比通用特征更能突出关注点，并在多项下游任务中表现更优，平均提升5-10个百分点。 Conclusion: FocalLens通过条件编码有效解决了视觉理解的上下文需求，显著提升了任务性能。 Abstract: Visual understanding is inherently contextual -- what we focus on in an image depends on the task at hand. For instance, given an image of a person holding a bouquet of flowers, we may focus on either the person such as their clothing, or the type of flowers, depending on the context of interest. Yet, most existing image encoding paradigms represent an image as a fixed, generic feature vector, overlooking the potential needs of prioritizing varying visual information for different downstream use cases. In this work, we introduce FocalLens, a conditional visual encoding method that produces different representations for the same image based on the context of interest, expressed flexibly through natural language. We leverage vision instruction tuning data and contrastively finetune a pretrained vision encoder to take natural language instructions as additional inputs for producing conditional image representations. Extensive experiments validate that conditional image representation from FocalLens better pronounce the visual features of interest compared to generic features produced by standard vision encoders like CLIP. In addition, we show FocalLens further leads to performance improvements on a range of downstream tasks including image-image retrieval, image classification, and image-text retrieval, with an average gain of 5 and 10 points on the challenging SugarCrepe and MMVP-VLM benchmarks, respectively.

Chao Qi,Jianqin Yin,Ren Zhang

Task: 探索并解决跨模态（图像和点云）类别增量学习中的灾难性遗忘问题。

Motivation: 现有增量学习方法在跨模态场景中表现不佳，无法有效处理模态间的差异和遗忘问题。

Details

Method: 提出CMIP-CIL基准，采用掩码点云和渲染多视图图像的对比学习框架进行预训练，并在增量阶段冻结主干网络并优化对象表示。 Result: 实验证明该方法在基准数据集上表现优异，显著优于基线方法。 Conclusion: 该方法有效缓解了跨模态灾难性遗忘问题，提升了模型在动态环境中的感知能力。 Abstract: Image-point class incremental learning helps the 3D-points-vision robots continually learn category knowledge from 2D images, improving their perceptual capability in dynamic environments. However, some incremental learning methods address unimodal forgetting but fail in cross-modal cases, while others handle modal differences within training/testing datasets but assume no modal gaps between them. We first explore this cross-modal task, proposing a benchmark CMIP-CIL and relieving the cross-modal catastrophic forgetting problem. It employs masked point clouds and rendered multi-view images within a contrastive learning framework in pre-training, empowering the vision model with the generalizations of image-point correspondence. In the incremental stage, by freezing the backbone and promoting object representations close to their respective prototypes, the model effectively retains and generalizes knowledge across previously seen categories while continuing to learn new ones. We conduct comprehensive experiments on the benchmark datasets. Experiments prove that our method achieves state-of-the-art results, outperforming the baseline methods by a large margin.

BOISHOMMO: Holistic Approach for Bangla Hate Speech

Md Abdullah Al Kafi,Sumit Kumar Banshal,Md Sadman Shakib,Showrov Azam,Tamanna Alam Tabashom

Task: 构建并评估一个多标签的孟加拉语仇恨言论数据集BOISHOMMO，以填补低资源语言领域的空白。

Motivation: 现有数据集在低资源语言（如孟加拉语）中缺乏全面性，且仇恨言论的多维度特性未被充分捕捉。

Details

Method: 编译并标注了一个包含多种仇恨言论类别的孟加拉语数据集，并采用多种算法进行评估。 Result: BOISHOMMO数据集提供了超过两千个标注样本，展示了处理非拉丁脚本的复杂性，并评估了模型性能。 Conclusion: 该多标签数据集为低资源语言的仇恨言论检测和分析提供了更丰富、多样的资源，推动了相关研究的进展。 Abstract: One of the most alarming issues in digital society is hate speech (HS) on social media. The severity is so high that researchers across the globe are captivated by this domain. A notable amount of work has been conducted to address the identification and alarm system. However, a noticeable gap exists, especially for low-resource languages. Comprehensive datasets are the main problem among the constrained resource languages, such as Bangla. Interestingly, hate speech or any particular speech has no single dimensionality. Similarly, the hate component can simultaneously have multiple abusive attributes, which seems to be missed in the existing datasets. Thus, a multi-label Bangla hate speech dataset named BOISHOMMO has been compiled and evaluated in this work. That includes categories of HS across race, gender, religion, politics, and more. With over two thousand annotated examples, BOISHOMMO provides a nuanced understanding of hate speech in Bangla and highlights the complexities of processing non-Latin scripts. Apart from evaluating with multiple algorithmic approaches, it also highlights the complexities of processing Bangla text and assesses model performance. This unique multi-label approach enriches future hate speech detection and analysis studies for low-resource languages by providing a more nuanced, diverse dataset.

SARFormer -- An Acquisition Parameter Aware Vision Transformer for Synthetic Aperture Radar Data

Jonathan Prexl,Michael Recla,Michael Schmitt

Task: Introducing SARFormer, a modified Vision Transformer (ViT) architecture for processing SAR images.

Motivation: Addressing the complex image geometry of SAR data and improving performance on downstream tasks.

Details

Method: Proposing an acquisition parameter encoding module and exploring self-supervised pre-training with limited labeled data. Result: Achieves up to 17% improvement in RMSE over baseline models. Conclusion: SARFormer effectively enhances SAR image processing performance. Abstract: This manuscript introduces SARFormer, a modified Vision Transformer (ViT) architecture designed for processing one or multiple synthetic aperture radar (SAR) images. Given the complex image geometry of SAR data, we propose an acquisition parameter encoding module that significantly guides the learning process, especially in the case of multiple images, leading to improved performance on downstream tasks. We further explore self-supervised pre-training, conduct experiments with limited labeled data, and benchmark our contribution and adaptations thoroughly in ablation experiments against a baseline, where the model is tested on tasks such as height reconstruction and segmentation. Our approach achieves up to 17% improvement in terms of RMSE over baseline models

Task Memory Engine (TME): Enhancing State Awareness for Multi-Step LLM Agent Tasks

Ye Ye

Task: 提出一种轻量级结构化记忆模块Task Memory Engine（TME），用于跟踪多步任务的执行状态。

Motivation: 现有框架在多步任务中缺乏对任务状态的结构化理解，导致性能脆弱、幻觉频繁和长程一致性差。

Details

Method: 使用分层Task Memory Tree（TMT）记录任务步骤，并通过动态提示合成方法生成LLM提示。 Result: TME显著提高了任务完成准确性和行为可解释性，且实现开销小。 Conclusion: TME是一种有效的解决方案，能够提升LLM在多步任务中的执行一致性和上下文理解能力。 Abstract: Large Language Models (LLMs) are increasingly used as autonomous agents for multi-step tasks. However, most existing frameworks fail to maintain a structured understanding of the task state, often relying on linear prompt concatenation or shallow memory buffers. This leads to brittle performance, frequent hallucinations, and poor long-range coherence. In this work, we propose the Task Memory Engine (TME), a lightweight and structured memory module that tracks task execution using a hierarchical Task Memory Tree (TMT). Each node in the tree corresponds to a task step, storing relevant input, output, status, and sub-task relationships. We introduce a prompt synthesis method that dynamically generates LLM prompts based on the active node path, significantly improving execution consistency and contextual grounding. Through case studies and comparative experiments on multi-step agent tasks, we demonstrate that TME leads to better task completion accuracy and more interpretable behavior with minimal implementation overhead. The full implementation of TME is available at https://github.com/biubiutomato/TME-Agent.

Jian Wang,Rishabh Dabral,Diogo Luvizon,Zhe Cao,Lingjie Liu,Thabo Beeler,Christian Theobalt

Task: 开发一个名为Ego4o的框架，用于从多模态自我中心输入中同时进行人体运动捕捉和理解。

Motivation: 消费级可穿戴设备（如VR/AR头显、智能眼镜、手机和智能手表）提供的多模态传感器输入（如自我中心图像和稀疏IMU传感器）及其间歇性可用性对人体运动捕捉和理解提出了挑战。

Details

Method: 通过将IMU传感器输入、可选自我中心图像和运动描述文本编码到运动VQ-VAE的潜在空间中，然后解码和优化以跟踪人体运动；当运动描述不可用时，使用多模态LLM生成描述以提升捕捉精度。 Result: 定量和定性评估表明，该方法在预测准确人体运动和高品质运动描述方面表现优异。 Conclusion: Ego4o框架在多模态输入下表现良好，且在输入部分缺失时仍能保持性能，为人体运动捕捉和理解提供了有效解决方案。 Abstract: This work focuses on tracking and understanding human motion using consumer wearable devices, such as VR/AR headsets, smart glasses, cellphones, and smartwatches. These devices provide diverse, multi-modal sensor inputs, including egocentric images, and 1-3 sparse IMU sensors in varied combinations. Motion descriptions can also accompany these signals. The diverse input modalities and their intermittent availability pose challenges for consistent motion capture and understanding. In this work, we present Ego4o (o for omni), a new framework for simultaneous human motion capture and understanding from multi-modal egocentric inputs. This method maintains performance with partial inputs while achieving better results when multiple modalities are combined. First, the IMU sensor inputs, the optional egocentric image, and text description of human motion are encoded into the latent space of a motion VQ-VAE. Next, the latent vectors are sent to the VQ-VAE decoder and optimized to track human motion. When motion descriptions are unavailable, the latent vectors can be input into a multi-modal LLM to generate human motion descriptions, which can further enhance motion capture accuracy. Quantitative and qualitative evaluations demonstrate the effectiveness of our method in predicting accurate human motion and high-quality motion descriptions.

Analyzing 16,193 LLM Papers for Fun and Profits

Zhiqiu Xia,Lang Zhu,Bingzhe Li,Feng Chen,Qiannan Li,Hang Liu

Task: 分析过去六年（2019-2024）在77个顶级计算机科学会议中与大型语言模型（LLM）相关的论文发表趋势。

Motivation: 研究LLM如何重塑计算机科学研究格局，并推动研究重点的转变。

Details

Method: 从四个角度进行分析：会议主题变化、主题建模识别研究领域、学术与工业机构的贡献模式、国家起源对LLM发展的影响。 Result: 通过多角度分析，揭示了LLM研究生态系统的动态和演变，并总结了十项关键见解。 Conclusion: LLM研究正在显著改变计算机科学的研究方向，其发展受到学术、工业和国家因素的共同影响。 Abstract: Large Language Models (LLMs) are reshaping the landscape of computer science research, driving significant shifts in research priorities across diverse conferences and fields. This study provides a comprehensive analysis of the publication trend of LLM-related papers in 77 top-tier computer science conferences over the past six years (2019-2024). We approach this analysis from four distinct perspectives: (1) We investigate how LLM research is driving topic shifts within major conferences. (2) We adopt a topic modeling approach to identify various areas of LLM-related topic growth and reveal the topics of concern at different conferences. (3) We explore distinct contribution patterns of academic and industrial institutions. (4) We study the influence of national origins on LLM development trajectories. Synthesizing the findings from these diverse analytical angles, we derive ten key insights that illuminate the dynamics and evolution of the LLM research ecosystem.

Muon-Accelerated Attention Distillation for Real-Time Edge Synthesis via Optimized Latent Diffusion

Weiye Chen,Qingen Zhu,Qian Long

Task: 提出Muon-AD框架，用于在边缘设备上实现实时高质量视觉合成。

Motivation: 现有视觉合成方法在边缘设备上部署时面临计算和内存限制的挑战。

Details

Method: 结合Muon优化器和注意力蒸馏，通过正交参数更新和动态剪枝消除梯度冲突，并采用混合精度量化和课程学习。 Result: Muon-AD在收敛速度（3.2倍）、合成质量（FID降低15%，SSIM提高4%）和内存占用（峰值7GB）上表现优异，支持24FPS实时生成。 Conclusion: Muon-AD为资源受限环境下的高质量视觉合成提供了高效解决方案。 Abstract: Recent advances in visual synthesis have leveraged diffusion models and attention mechanisms to achieve high-fidelity artistic style transfer and photorealistic text-to-image generation. However, real-time deployment on edge devices remains challenging due to computational and memory constraints. We propose Muon-AD, a co-designed framework that integrates the Muon optimizer with attention distillation for real-time edge synthesis. By eliminating gradient conflicts through orthogonal parameter updates and dynamic pruning, Muon-AD achieves 3.2 times faster convergence compared to Stable Diffusion-TensorRT, while maintaining synthesis quality (15% lower FID, 4% higher SSIM). Our framework reduces peak memory to 7GB on Jetson Orin and enables 24FPS real-time generation through mixed-precision quantization and curriculum learning. Extensive experiments on COCO-Stuff and ImageNet-Texture demonstrate Muon-AD's Pareto-optimal efficiency-quality trade-offs. Here, we show a 65% reduction in communication overhead during distributed training and real-time 10s/image generation on edge GPUs. These advancements pave the way for democratizing high-quality visual synthesis in resource-constrained environments.

Training-free Guidance in Text-to-Video Generation via Multimodal Planning and Structured Noise Initialization

Jialu Li,Shoubin Yu,Han Lin,Jaemin Cho,Jaehong Yoon,Mohit Bansal

Task: 提出一种无需微调或额外内存的文本到视频（T2V）生成引导方法Video-MSG，以提升生成视频与文本描述的准确性。

Motivation: 现有T2V模型在准确控制空间布局或物体轨迹时表现不佳，且现有方法需要微调或额外内存，限制了大型T2V模型的应用。

Details

Method: Video-MSG通过多模态规划和结构化噪声初始化，分三步生成视频草图（Video Sketch），并引导下游T2V扩散模型。 Result: 在多个T2V基准测试中，Video-MSG显著提升了文本对齐效果，且无需额外内存或微调。 Conclusion: Video-MSG是一种高效且易于集成的方法，能够显著提升T2V生成的质量和准确性。 Abstract: Recent advancements in text-to-video (T2V) diffusion models have significantly enhanced the visual quality of the generated videos. However, even recent T2V models find it challenging to follow text descriptions accurately, especially when the prompt requires accurate control of spatial layouts or object trajectories. A recent line of research uses layout guidance for T2V models that require fine-tuning or iterative manipulation of the attention map during inference time. This significantly increases the memory requirement, making it difficult to adopt a large T2V model as a backbone. To address this, we introduce Video-MSG, a training-free Guidance method for T2V generation based on Multimodal planning and Structured noise initialization. Video-MSG consists of three steps, where in the first two steps, Video-MSG creates Video Sketch, a fine-grained spatio-temporal plan for the final video, specifying background, foreground, and object trajectories, in the form of draft video frames. In the last step, Video-MSG guides a downstream T2V diffusion model with Video Sketch through noise inversion and denoising. Notably, Video-MSG does not need fine-tuning or attention manipulation with additional memory during inference time, making it easier to adopt large T2V models. Video-MSG demonstrates its effectiveness in enhancing text alignment with multiple T2V backbones (VideoCrafter2 and CogVideoX-5B) on popular T2V generation benchmarks (T2VCompBench and VBench). We provide comprehensive ablation studies about noise inversion ratio, different background generators, background object detection, and foreground object segmentation.

Road Grip Uncertainty Estimation Through Surface State Segmentation

Jyri Maanpää,Julius Pesonen,Iaroslav Melekhov,Heikki Hyyti,Juha Hyyppä

Task: 评估和比较多种不确定性预测方法在路面抓地力不确定性估计中的有效性，并提出一种基于路面状态分割的新方法。

Motivation: 湿滑路面条件对自动驾驶构成重大挑战，可靠估计抓地力不确定性对确保车辆安全控制至关重要。

Details

Method: 提出一种基于路面表面状态分割的新方法，通过推断路面条件生成像素级的抓地力概率分布。 Result: 实验结果表明，所提方法增强了抓地力不确定性预测的鲁棒性。 Conclusion: 基于路面状态分割的方法在抓地力不确定性预测中表现出更高的鲁棒性。 Abstract: Slippery road conditions pose significant challenges for autonomous driving. Beyond predicting road grip, it is crucial to estimate its uncertainty reliably to ensure safe vehicle control. In this work, we benchmark several uncertainty prediction methods to assess their effectiveness for grip uncertainty estimation. Additionally, we propose a novel approach that leverages road surface state segmentation to predict grip uncertainty. Our method estimates a pixel-wise grip probability distribution based on inferred road surface conditions. Experimental results indicate that the proposed approach enhances the robustness of grip uncertainty prediction.

Generating Fine Details of Entity Interactions

Xinyi Gu,Jiayuan Mao

Task: 生成涉及多个实体交互的高保真图像。

Motivation: 现有文本到图像模型在生成准确交互时表现不佳，主要由于训练数据中罕见对象交互的稀缺性。

Details

Method: 提出InterActing数据集和DetailScribe方法，利用LLM分解交互概念，VLM评估生成图像，并在扩散过程中进行针对性优化。 Result: 自动和人工评估显示图像质量显著提升，验证了增强推理策略的潜力。 Conclusion: InterActing数据集和DetailScribe方法为交互丰富的图像生成提供了新方向，相关资源已开源。 Abstract: Images not only depict objects but also encapsulate rich interactions between them. However, generating faithful and high-fidelity images involving multiple entities interacting with each other, is a long-standing challenge. While pre-trained text-to-image models are trained on large-scale datasets to follow diverse text instructions, they struggle to generate accurate interactions, likely due to the scarcity of training data for uncommon object interactions. This paper introduces InterActing, an interaction-focused dataset with 1000 fine-grained prompts covering three key scenarios: (1) functional and action-based interactions, (2) compositional spatial relationships, and (3) multi-subject interactions. To address interaction generation challenges, we propose a decomposition-augmented refinement procedure. Our approach, DetailScribe, built on Stable Diffusion 3.5, leverages LLMs to decompose interactions into finer-grained concepts, uses a VLM to critique generated images, and applies targeted interventions within the diffusion process in refinement. Automatic and human evaluations show significantly improved image quality, demonstrating the potential of enhanced inference strategies. Our dataset and code are available at https://concepts-ai.com/p/detailscribe/ to facilitate future exploration of interaction-rich image generation.

Cut-and-Splat: Leveraging Gaussian Splatting for Synthetic Data Generation

Bram Vanherle,Brent Zoomers,Jeroen Put,Frank Van Reeth,Nick Michiels

Task: 利用高斯泼溅方法生成高质量合成图像，用于计算机视觉模型的训练数据。

Motivation: 解决合成图像因3D模型不准确和模拟光照与相机效果困难导致的不真实问题。

Details

Method: 开发自动化流程，通过目标物体视频训练高斯泼溅模型，提取物体并渲染到随机背景上，结合单目深度估计生成逼真姿态。 Result: 提出新数据集验证方法，性能优于Cut-and-Paste和基于扩散模型的数据生成方法。 Conclusion: 高斯泼溅方法能有效生成高质量合成数据，提升计算机视觉模型的训练效果。 Abstract: Generating synthetic images is a useful method for cheaply obtaining labeled data for training computer vision models. However, obtaining accurate 3D models of relevant objects is necessary, and the resulting images often have a gap in realism due to challenges in simulating lighting effects and camera artifacts. We propose using the novel view synthesis method called Gaussian Splatting to address these challenges. We have developed a synthetic data pipeline for generating high-quality context-aware instance segmentation training data for specific objects. This process is fully automated, requiring only a video of the target object. We train a Gaussian Splatting model of the target object and automatically extract the object from the video. Leveraging Gaussian Splatting, we then render the object on a random background image, and monocular depth estimation is employed to place the object in a believable pose. We introduce a novel dataset to validate our approach and show superior performance over other data generation approaches, such as Cut-and-Paste and Diffusion model-based generation.

DocAgent: A Multi-Agent System for Automated Code Documentation Generation

Dayu Yang,Antoine Simoulin,Xin Qian,Xiaoyi Liu,Yuwei Cao,Zhaopu Teng,Grey Yang

Task: 提出DocAgent，一种基于多智能体协作和拓扑代码处理的自动代码文档生成系统。

Motivation: 高质量的代码文档对软件开发至关重要，但现有基于大语言模型的方法常生成不完整、无用或事实错误的文档。

Details

Method: 使用多智能体协作系统（Reader、Searcher、Writer、Verifier、Orchestrator）和拓扑代码处理技术，逐步构建上下文并生成文档。 Result: 实验表明DocAgent在完整性、有用性和真实性方面显著优于基线方法，拓扑处理顺序起关键作用。 Conclusion: DocAgent为复杂和专有代码库提供了一种可靠的自动文档生成方法。 Abstract: High-quality code documentation is crucial for software development especially in the era of AI. However, generating it automatically using Large Language Models (LLMs) remains challenging, as existing approaches often produce incomplete, unhelpful, or factually incorrect outputs. We introduce DocAgent, a novel multi-agent collaborative system using topological code processing for incremental context building. Specialized agents (Reader, Searcher, Writer, Verifier, Orchestrator) then collaboratively generate documentation. We also propose a multi-faceted evaluation framework assessing Completeness, Helpfulness, and Truthfulness. Comprehensive experiments show DocAgent significantly outperforms baselines consistently. Our ablation study confirms the vital role of the topological processing order. DocAgent offers a robust approach for reliable code documentation generation in complex and proprietary repositories.

A Hybrid Fully Convolutional CNN-Transformer Model for Inherently Interpretable Medical Image Classification

Kerol Djoumessi,Samuel Ofosu Mensah,Philipp Berens

Task: 开发一种可解释的混合CNN-Transformer架构用于医学图像分类。

Motivation: 现有的混合CNN-ViT模型难以解释，限制了其在医学影像中的应用。

Details

Method: 提出了一种可解释的混合全卷积CNN-Transformer架构，生成忠实且局部化的证据图。 Result: 在两种医学图像分类任务中，模型不仅预测性能达到最优，还能提供类别特定的稀疏证据图。 Conclusion: 该模型结合了CNN和Transformer的优势，同时具备高性能和可解释性。 Abstract: In many medical imaging tasks, convolutional neural networks (CNNs) efficiently extract local features hierarchically. More recently, vision transformers (ViTs) have gained popularity, using self-attention mechanisms to capture global dependencies, but lacking the inherent spatial localization of convolutions. Therefore, hybrid models combining CNNs and ViTs have been developed to combine the strengths of both architectures. However, such hybrid CNN-ViT models are difficult to interpret, which hinders their application in medical imaging. In this work, we introduce an interpretable-by-design hybrid fully convolutional CNN-Transformer architecture for medical image classification. Unlike widely used post-hoc saliency methods for ViTs, our approach generates faithful and localized evidence maps that directly reflect the model's decision process. We evaluated our method on two medical image classification tasks using color fundus images. Our model not only achieves state-of-the-art predictive performance compared to both black-box and interpretable models but also provides class-specific sparse evidence maps in a single forward pass. The code is available at: https://anonymous.4open.science/r/Expl-CNN-Transformer/.

Towards an Understanding of Context Utilization in Code Intelligence

Yanlin Wang,Kefeng Duan,Dewu Zheng,Ensheng Shi,Fengji Zhang,Yanli Wang,Jiachi Chen,Xilin Liu,Yuchi Ma,Hongyu Zhang,Qianxiang Wang,Zibin Zheng

Task: 系统分析代码智能中上下文信息的作用。

Motivation: 当前缺乏对代码智能中上下文信息的系统性分析，而研究表明上下文信息能显著提升模型性能。

Details

Method: 对2007年9月至2024年8月间的146项相关研究进行文献综述。 Result: 提出了四项主要贡献：研究格局的定量分析、上下文类型的新分类法、任务导向的分析以及评估方法的批判性评价。 Conclusion: 总结了当前代码智能系统中上下文利用的挑战，并提出了未来研究的关键机会。 Abstract: Code intelligence is an emerging domain in software engineering, aiming to improve the effectiveness and efficiency of various code-related tasks. Recent research suggests that incorporating contextual information beyond the basic original task inputs (i.e., source code) can substantially enhance model performance. Such contextual signals may be obtained directly or indirectly from sources such as API documentation or intermediate representations like abstract syntax trees can significantly improve the effectiveness of code intelligence. Despite growing academic interest, there is a lack of systematic analysis of context in code intelligence. To address this gap, we conduct an extensive literature review of 146 relevant studies published between September 2007 and August 2024. Our investigation yields four main contributions. (1) A quantitative analysis of the research landscape, including publication trends, venues, and the explored domains; (2) A novel taxonomy of context types used in code intelligence; (3) A task-oriented analysis investigating context integration strategies across diverse code intelligence tasks; (4) A critical evaluation of evaluation methodologies for context-aware methods. Based on these findings, we identify fundamental challenges in context utilization in current code intelligence systems and propose a research roadmap that outlines key opportunities for future research.

Embodied Image Captioning: Self-supervised Learning Agents for Spatially Coherent Image Descriptions

Tommaso Galliena,Tommaso Apicella,Stefano Rosa,Pietro Morerio,Alessio Del Bue,Lorenzo Natale

Task: 提出一种自监督方法，提升智能体在探索通用环境时描述任意对象的能力。

Motivation: 当前模型因不同相机视角和杂乱环境难以生成连贯的图像描述，需要改进。

Details

Method: 采用三阶段框架：1) 智能体探索环境并收集噪声图像-描述对；2) 通过大语言模型共识机制生成一致的伪描述；3) 结合对比学习微调现有描述模型。 Result: 实验表明，该方法能训练策略挖掘高分歧样本，伪描述方法结合所有策略在语义相似性上优于现有方法，微调显著提升描述准确性和一致性。 Conclusion: 提出的方法有效提升了图像描述的准确性和一致性，尤其在复杂环境中表现优异。 Abstract: We present a self-supervised method to improve an agent's abilities in describing arbitrary objects while actively exploring a generic environment. This is a challenging problem, as current models struggle to obtain coherent image captions due to different camera viewpoints and clutter. We propose a three-phase framework to fine-tune existing captioning models that enhances caption accuracy and consistency across views via a consensus mechanism. First, an agent explores the environment, collecting noisy image-caption pairs. Then, a consistent pseudo-caption for each object instance is distilled via consensus using a large language model. Finally, these pseudo-captions are used to fine-tune an off-the-shelf captioning model, with the addition of contrastive learning. We analyse the performance of the combination of captioning models, exploration policies, pseudo-labeling methods, and fine-tuning strategies, on our manually labeled test set. Results show that a policy can be trained to mine samples with higher disagreement compared to classical baselines. Our pseudo-captioning method, in combination with all policies, has a higher semantic similarity compared to other existing methods, and fine-tuning improves caption accuracy and consistency by a significant margin. Code and test set annotations available at https://hsp-iit.github.io/embodied-captioning/

Datasets for Lane Detection in Autonomous Driving: A Comprehensive Review

Jörg Gamerdinger,Sven Teufel,Oliver Bringmann

Task: 对30多个公开可用的车道检测数据集进行综合回顾和系统分析。

Motivation: 准确的车道检测对自动驾驶至关重要，但目前的数据集在数据量、传感器类型、标注粒度等方面存在差异，需要系统梳理以支持研究和开发。

Details

Method: 通过分类和分析这些数据集的关键因素（如传感器分辨率、标注类型、道路和天气条件多样性）来评估其特点和局限性。 Result: 总结了现有数据集的优缺点，并指出了未来改进的方向和研究空白。 Conclusion: 本研究为车道检测领域的研究者提供了数据集选择的参考，并推动了自动驾驶技术的进一步发展。 Abstract: Accurate lane detection is essential for automated driving, enabling safe and reliable vehicle navigation in a variety of road scenarios. Numerous datasets have been introduced to support the development and evaluation of lane detection algorithms, each differing in terms of the amount of data, sensor types, annotation granularity, environmental conditions, and scenario diversity. This paper provides a comprehensive review of over 30 publicly available lane detection datasets, systematically analysing their characteristics, advantages and limitations. We classify these datasets based on key factors such as sensor resolution, annotation types and diversity of road and weather conditions. By identifying existing challenges and research gaps, we highlight opportunities for future dataset improvements that can further drive innovation in robust lane detection. This survey serves as a resource for researchers seeking appropriate datasets for lane detection, and contributes to the broader goal of advancing autonomous driving.

Discriminator-Free Direct Preference Optimization for Video Diffusion

Haoran Cheng,Qide Dong,Liang Peng,Zhizhou Sha,Weiguo Feng,Jinghui Xie,Zhao Song,Shilei Wen,Xiaofei He,Boxi Wu

Task: 提出一种无需判别器的视频直接偏好优化（DPO）框架，以解决视频扩散模型在人类偏好对齐中的数据效率低和评估不确定性问题。

Motivation: 传统的DPO方法在视频生成中面临数据效率低（生成大量视频成本高）和评估不确定性（人类标注主观性强，自动判别器难以检测细微时间伪影）的挑战。

Details

Method: 使用真实视频作为正例，编辑后的视频（如反转、打乱或噪声干扰）作为负例，训练视频扩散模型区分并避免编辑引入的伪影。 Result: 实验证明该方法在CogVideoX上高效，无需合成视频对比，提供明确质量信号，并支持通过简单编辑扩展训练数据。 Conclusion: 提出的框架在理论和实验上均有效，解决了视频DPO的关键挑战。 Abstract: Direct Preference Optimization (DPO), which aligns models with human preferences through win/lose data pairs, has achieved remarkable success in language and image generation. However, applying DPO to video diffusion models faces critical challenges: (1) Data inefficiency. Generating thousands of videos per DPO iteration incurs prohibitive costs; (2) Evaluation uncertainty. Human annotations suffer from subjective bias, and automated discriminators fail to detect subtle temporal artifacts like flickering or motion incoherence. To address these, we propose a discriminator-free video DPO framework that: (1) Uses original real videos as win cases and their edited versions (e.g., reversed, shuffled, or noise-corrupted clips) as lose cases; (2) Trains video diffusion models to distinguish and avoid artifacts introduced by editing. This approach eliminates the need for costly synthetic video comparisons, provides unambiguous quality signals, and enables unlimited training data expansion through simple editing operations. We theoretically prove the framework's effectiveness even when real videos and model-generated videos follow different distributions. Experiments on CogVideoX demonstrate the efficiency of the proposed method.

Proxy-Anchor and EVT-Driven Continual Learning Method for Generalized Category Discovery

Alireza Fathalizadeh,Roozbeh Razavi-Far

Task: 提出一种结合极值理论（EVT）和代理锚点的方法，用于持续广义类别发现任务，以分离新样本并减少冗余代理。

Motivation: 解决持续学习中新类别的发现与旧类别遗忘的问题，同时优化模型性能和减少冗余。

Details

Method: 整合EVT与代理锚点，定义边界并通过包含概率函数拒绝未知样本；提出EVT-based损失函数增强表示学习；引入经验回放和知识蒸馏防止遗忘。 Result: 在持续广义类别发现任务中，性能优于现有方法。 Conclusion: 所提方法有效解决了新类别发现与旧类别遗忘的挑战，并优化了模型性能。 Abstract: Continual generalized category discovery has been introduced and studied in the literature as a method that aims to continuously discover and learn novel categories in incoming data batches while avoiding catastrophic forgetting of previously learned categories. A key component in addressing this challenge is the model's ability to separate novel samples, where Extreme Value Theory (EVT) has been effectively employed. In this work, we propose a novel method that integrates EVT with proxy anchors to define boundaries around proxies using a probability of inclusion function, enabling the rejection of unknown samples. Additionally, we introduce a novel EVT-based loss function to enhance the learned representation, achieving superior performance compared to other deep-metric learning methods in similar settings. Using the derived probability functions, novel samples are effectively separated from previously known categories. However, category discovery within these novel samples can sometimes overestimate the number of new categories. To mitigate this issue, we propose a novel EVT-based approach to reduce the model size and discard redundant proxies. We also incorporate experience replay and knowledge distillation mechanisms during the continual learning stage to prevent catastrophic forgetting. Experimental results demonstrate that our proposed approach outperforms state-of-the-art methods in continual generalized category discovery scenarios.

Shadow Erosion and Nighttime Adaptability for Camera-Based Automated Driving Applications

Mohamed Sabry,Gregory Schroeder,Joshua Varughese,Cristina Olaverri-Monreal

Task: 提出一种用于自动驾驶应用的阴影侵蚀和夜间适应性图像增强管道。

Motivation: 由于RGB相机图像在医疗成像、卫星成像和自动驾驶等领域的广泛应用，提升图像质量尤为重要，尤其是在挑战性光照条件下。

Details

Method: 提出阴影侵蚀和夜间适应性管道，并与广泛使用的CLAHE技术进行比较，基于光照均匀性和视觉感知质量指标进行评估。 Result: 结果显示该方法显著优于CLAHE技术，并提升了基于YOLO的可行驶区域分割算法的性能。 Conclusion: 该管道在自动驾驶应用中有效提升了图像质量，尤其是在阴影和夜间条件下。 Abstract: Enhancement of images from RGB cameras is of particular interest due to its wide range of ever-increasing applications such as medical imaging, satellite imaging, automated driving, etc. In autonomous driving, various techniques are used to enhance image quality under challenging lighting conditions. These include artificial augmentation to improve visibility in poor nighttime conditions, illumination-invariant imaging to reduce the impact of lighting variations, and shadow mitigation to ensure consistent image clarity in bright daylight. This paper proposes a pipeline for Shadow Erosion and Nighttime Adaptability in images for automated driving applications while preserving color and texture details. The Shadow Erosion and Nighttime Adaptability pipeline is compared to the widely used CLAHE technique and evaluated based on illumination uniformity and visual perception quality metrics. The results also demonstrate a significant improvement over CLAHE, enhancing a YOLO-based drivable area segmentation algorithm.

Banana Ripeness Level Classification using a Simple CNN Model Trained with Real and Synthetic Datasets

Luis Chuquimarca,Boris Vintimilla,Sergio Velastin

Task: 利用合成数据和真实数据结合的方法，训练一个简单的CNN模型来准确分类香蕉的成熟度。

Motivation: 工业上香蕉成熟度的评估仍依赖人工方法，而现有CNN模型因数据不足难以可靠训练，但已有研究表明其准确性可接受。

Details

Method: 生成结合真实和合成数据的鲁棒数据集，提出简单CNN架构，通过迁移学习技术改进模型以分类真实数据。 Result: 提出的CNN模型在多种架构和超参数配置下测试，达到0.917的高准确率和快速执行时间。 Conclusion: 结合合成与真实数据的CNN模型能有效分类香蕉成熟度，具有高准确率和实用性。 Abstract: The level of ripeness is essential in determining the quality of bananas. To correctly estimate banana maturity, the metrics of international marketing standards need to be considered. However, the process of assessing the maturity of bananas at an industrial level is still carried out using manual methods. The use of CNN models is an attractive tool to solve the problem, but there is a limitation regarding the availability of sufficient data to train these models reliably. On the other hand, in the state-of-the-art, existing CNN models and the available data have reported that the accuracy results are acceptable in identifying banana maturity. For this reason, this work presents the generation of a robust dataset that combines real and synthetic data for different levels of banana ripeness. In addition, it proposes a simple CNN architecture, which is trained with synthetic data and using the transfer learning technique, the model is improved to classify real data, managing to determine the level of maturity of the banana. The proposed CNN model is evaluated with several architectures, then hyper-parameter configurations are varied, and optimizers are used. The results show that the proposed CNN model reaches a high accuracy of 0.917 and a fast execution time.

Knowledge Distillation for Multimodal Egocentric Action Recognition Robust to Missing Modalities

Maria Santos-Villafranca,Dustin Carrión-Ojeda,Alejandro Perez-Yus,Jesus Bermudez-Cameo,Jose J. Guerrero,Simone Schaub-Meyer

Task: 提出一种高效的多模态知识蒸馏方法（KARMMA），用于解决自我中心视觉中动作识别任务中模态缺失的问题。

Motivation: 现有方法通常依赖单一模态（如视频），而多模态方法在模态缺失时性能下降，因此需要一种鲁棒的方法来处理模态缺失问题。

Details

Method: 利用预训练模型作为单模态特征提取器构建教师模型，将知识蒸馏到更小、更快的学生模型中。 Result: 在Epic-Kitchens和Something-Something数据集上的实验表明，学生模型能有效处理模态缺失，同时减少性能下降。 Conclusion: KARMMA方法在多模态动作识别中表现出鲁棒性和高效性，适用于模态缺失的场景。 Abstract: Action recognition is an essential task in egocentric vision due to its wide range of applications across many fields. While deep learning methods have been proposed to address this task, most rely on a single modality, typically video. However, including additional modalities may improve the robustness of the approaches to common issues in egocentric videos, such as blurriness and occlusions. Recent efforts in multimodal egocentric action recognition often assume the availability of all modalities, leading to failures or performance drops when any modality is missing. To address this, we introduce an efficient multimodal knowledge distillation approach for egocentric action recognition that is robust to missing modalities (KARMMA) while still benefiting when multiple modalities are available. Our method focuses on resource-efficient development by leveraging pre-trained models as unimodal feature extractors in our teacher model, which distills knowledge into a much smaller and faster student model. Experiments on the Epic-Kitchens and Something-Something datasets demonstrate that our student model effectively handles missing modalities while reducing its accuracy drop in this scenario.

FMLGS: Fast Multilevel Language Embedded Gaussians for Part-level Interactive Agents

Xin Tan,Yuzhou Ji,He Zhu,Yuan Xie

Task: 提出FMLGS方法，支持在3D高斯泼溅（3DGS）中进行部分级别的开放词汇查询。

Motivation: 解决多粒度交互中的语言模糊性和对象组件查询质量下降的问题。

Details

Method: 基于Segment Anything Model 2（SAM2）构建和查询一致的对象和部分级别语义的高效流程，并设计语义偏差策略以解决语言模糊性问题。 Result: FMLGS在速度和准确性上均优于现有方法，速度比LERF快98倍，比LangSplat快4倍，比LEGaussians快2.5倍。 Conclusion: FMLGS展示了在3D场景中交互导航、定位目标和响应用户需求的潜力，具有进一步扩展和应用的潜力。 Abstract: The semantically interactive radiance field has long been a promising backbone for 3D real-world applications, such as embodied AI to achieve scene understanding and manipulation. However, multi-granularity interaction remains a challenging task due to the ambiguity of language and degraded quality when it comes to queries upon object components. In this work, we present FMLGS, an approach that supports part-level open-vocabulary query within 3D Gaussian Splatting (3DGS). We propose an efficient pipeline for building and querying consistent object- and part-level semantics based on Segment Anything Model 2 (SAM2). We designed a semantic deviation strategy to solve the problem of language ambiguity among object parts, which interpolates the semantic features of fine-grained targets for enriched information. Once trained, we can query both objects and their describable parts using natural language. Comparisons with other state-of-the-art methods prove that our method can not only better locate specified part-level targets, but also achieve first-place performance concerning both speed and accuracy, where FMLGS is 98 x faster than LERF, 4 x faster than LangSplat and 2.5 x faster than LEGaussians. Meanwhile, we further integrate FMLGS as a virtual agent that can interactively navigate through 3D scenes, locate targets, and respond to user demands through a chat interface, which demonstrates the potential of our work to be further expanded and applied in the future.

Boosting multi-demographic federated learning for chest x-ray analysis using general-purpose self-supervised representations

Mahshad Lotfinia,Arash Tayebiarasteh,Samaneh Samiei,Mehdi Joodaki,Soroosh Tayebi Arasteh

Task: 研究联邦学习（FL）在非独立同分布（non-IID）医疗图像数据中的表现，特别是针对儿科数据的挑战。

Motivation: 解决FL在非IID数据（尤其是儿科数据）中性能下降的问题，并探索自监督学习在提升FL性能中的作用。

Details

Method: 利用迁移学习和自监督图像表示，结合先进的视觉变换器（vision transformers），分析成人（n=398,523）和儿科（n=9,125）胸部X光片数据。 Result: FL在小型成人数据集中表现提升（P<0.001），但在大型数据集和儿科数据中性能下降（P<0.064和P=0.242）；引入自监督权重后，儿科数据（P=0.031）和多数成人数据集（P<0.008）性能显著提升。 Conclusion: 自监督图像表示可有效解决FL在非IID数据中的挑战，尤其对儿科医疗数据具有重要价值。 Abstract: Reliable artificial intelligence (AI) models for medical image analysis often depend on large and diverse labeled datasets. Federated learning (FL) offers a decentralized and privacy-preserving approach to training but struggles in highly non-independent and identically distributed (non-IID) settings, where institutions with more representative data may experience degraded performance. Moreover, existing large-scale FL studies have been limited to adult datasets, neglecting the unique challenges posed by pediatric data, which introduces additional non-IID variability. To address these limitations, we analyzed n=398,523 adult chest radiographs from diverse institutions across multiple countries and n=9,125 pediatric images, leveraging transfer learning from general-purpose self-supervised image representations to classify pneumonia and cases with no abnormality. Using state-of-the-art vision transformers, we found that FL improved performance only for smaller adult datasets (P<0.001) but degraded performance for larger datasets (P<0.064) and pediatric cases (P=0.242). However, equipping FL with self-supervised weights significantly enhanced outcomes across pediatric cases (P=0.031) and most adult datasets (P<0.008), except the largest dataset (P=0.052). These findings underscore the potential of easily deployable general-purpose self-supervised image representations to address non-IID challenges in clinical FL applications and highlight their promise for enhancing patient outcomes and advancing pediatric healthcare, where data scarcity and variability remain persistent obstacles.

Hardware, Algorithms, and Applications of the Neuromorphic Vision Sensor: a Review

Claudio Cimarelli,Jose Andres Millan-Romera,Holger Voos,Jose Luis Sanchez-Lopez

Task: 系统性地综述神经形态视觉（事件相机）的技术发展、算法和应用。

Motivation: 探讨事件相机与传统相机的区别及其潜在优势，同时指出其算法和应用上的挑战。

Details

Method: 从硬件发展、事件数据专用算法（如特征检测、跟踪、光流等）和实际应用案例三个维度进行分析。 Result: 总结了事件相机的技术特点、算法进展和成功应用案例，并分析了其面临的挑战。 Conclusion: 神经形态视觉具有广阔的应用前景，但仍需解决算法适应性和技术普及等挑战。 Abstract: Neuromorphic, or event, cameras represent a transformation in the classical approach to visual sensing encodes detected instantaneous per-pixel illumination changes into an asynchronous stream of event packets. Their novelty compared to standard cameras lies in the transition from capturing full picture frames at fixed time intervals to a sparse data format which, with its distinctive qualities, offers potential improvements in various applications. However, these advantages come at the cost of reinventing algorithmic procedures or adapting them to effectively process the new data format. In this survey, we systematically examine neuromorphic vision along three main dimensions. First, we highlight the technological evolution and distinctive hardware features of neuromorphic cameras from their inception to recent models. Second, we review image processing algorithms developed explicitly for event-based data, covering key works on feature detection, tracking, and optical flow -which form the basis for analyzing image elements and transformations -as well as depth and pose estimation or object recognition, which interpret more complex scene structures and components. These techniques, drawn from classical computer vision and modern data-driven approaches, are examined to illustrate the breadth of applications for event-based cameras. Third, we present practical application case studies demonstrating how event cameras have been successfully used across various industries and scenarios. Finally, we analyze the challenges limiting widespread adoption, identify significant research gaps compared to standard imaging techniques, and outline promising future directions and opportunities that neuromorphic vision offers.

ZipIR: Latent Pyramid Diffusion Transformer for High-Resolution Image Restoration

Yongsheng Yu,Haitian Zheng,Zhifei Zhang,Jianming Zhang,Yuqian Zhou,Connelly Barnes,Yuchen Liu,Wei Xiong,Zhe Lin,Jiebo Luo

Task: 提出ZipIR框架，解决高分辨率图像修复中质量与效率的权衡问题。

Motivation: 现有生成模型在高分辨率图像修复中因计算需求大而面临质量与效率的权衡问题。

Details

Method: 采用高度压缩的潜在表示（32倍压缩）和Latent Pyramid VAE设计，结合Diffusion Transformer（DiT）。 Result: ZipIR在2K分辨率图像修复中超越现有方法，提供更快的速度和更高的质量。 Conclusion: ZipIR通过高效潜在表示和LP-VAE设计，显著提升了高分辨率图像修复的性能。 Abstract: Recent progress in generative models has significantly improved image restoration capabilities, particularly through powerful diffusion models that offer remarkable recovery of semantic details and local fidelity. However, deploying these models at ultra-high resolutions faces a critical trade-off between quality and efficiency due to the computational demands of long-range attention mechanisms. To address this, we introduce ZipIR, a novel framework that enhances efficiency, scalability, and long-range modeling for high-res image restoration. ZipIR employs a highly compressed latent representation that compresses image 32x, effectively reducing the number of spatial tokens, and enabling the use of high-capacity models like the Diffusion Transformer (DiT). Toward this goal, we propose a Latent Pyramid VAE (LP-VAE) design that structures the latent space into sub-bands to ease diffusion training. Trained on full images up to 2K resolution, ZipIR surpasses existing diffusion-based methods, offering unmatched speed and quality in restoring high-resolution images from severely degraded inputs.

Hands-On: Segmenting Individual Signs from Continuous Sequences

Low Jian He,Harry Walsh,Ozge Mercanoglu Sincan,Richard Bowden

Task: 解决连续手语分割的挑战，这是手语翻译和数据标注中的关键任务。

Motivation: 连续手语分割对手语翻译和数据标注具有重要意义。

Details

Method: 提出一种基于Transformer的架构，利用BIO标记方案将分割建模为序列标注问题，并结合HaMeR手部特征和3D角度。 Result: 在DGS Corpus上达到最先进水平，特征在BSLCorpus上超越先前基准。 Conclusion: 所提方法在连续手语分割任务中表现出色，为相关领域提供了有效解决方案。 Abstract: This work tackles the challenge of continuous sign language segmentation, a key task with huge implications for sign language translation and data annotation. We propose a transformer-based architecture that models the temporal dynamics of signing and frames segmentation as a sequence labeling problem using the Begin-In-Out (BIO) tagging scheme. Our method leverages the HaMeR hand features, and is complemented with 3D Angles. Extensive experiments show that our model achieves state-of-the-art results on the DGS Corpus, while our features surpass prior benchmarks on BSLCorpus.

On Background Bias of Post-Hoc Concept Embeddings in Computer Vision DNNs

Gesina Schwalbe,Georgii Mikriukov,Edgar Heinert,Stavros Gerolymatos,Mert Keser,Alois Knoll,Matthias Rottmann,Annika Mütze

Task: 验证数据驱动的后处理C-XAI方法是否容易受到背景偏见的影响。

Motivation: 现有C-XAI方法在训练时背景不受控制，可能导致方法本身存在背景偏见，影响概念嵌入的准确性。

Details

Method: 比较3种背景随机化技术，分析50多个概念和7种DNN架构，验证背景偏见的存在。 Result: 发现Net2Vec等概念分割技术确实存在背景偏见，如道路场景表现不佳。 Conclusion: 低成本实验可揭示背景偏见并提升背景鲁棒性。 Abstract: The thriving research field of concept-based explainable artificial intelligence (C-XAI) investigates how human-interpretable semantic concepts embed in the latent spaces of deep neural networks (DNNs). Post-hoc approaches therein use a set of examples to specify a concept, and determine its embeddings in DNN latent space using data driven techniques. This proved useful to uncover biases between different target (foreground or concept) classes. However, given that the background is mostly uncontrolled during training, an important question has been left unattended so far: Are/to what extent are state-of-the-art, data-driven post-hoc C-XAI approaches themselves prone to biases with respect to their backgrounds? E.g., wild animals mostly occur against vegetation backgrounds, and they seldom appear on roads. Even simple and robust C-XAI methods might abuse this shortcut for enhanced performance. A dangerous performance degradation of the concept-corner cases of animals on the road could thus remain undiscovered. This work validates and thoroughly confirms that established Net2Vec-based concept segmentation techniques frequently capture background biases, including alarming ones, such as underperformance on road scenes. For the analysis, we compare 3 established techniques from the domain of background randomization on >50 concepts from 2 datasets, and 7 diverse DNN architectures. Our results indicate that even low-cost setups can provide both valuable insight and improved background robustness.

Enhancing knowledge retention for continual learning with domain-specific adapters and features gating

Mohamed Abbas Hedjazi,Oussama Hadjerci,Adel Hafiane

Task: 提出一种在Vision Transformers中集成适配器的新方法，以增强多领域数据集顺序学习中的知识保留能力。

Motivation: 解决持续学习中灾难性遗忘的问题，并提升模型在多个领域数据集上的表现。

Details

Method: 在自注意力机制中集成适配器，引入领域特定输出头和特征门控机制。 Result: 相比现有参数高效微调方法，新方法有效缓解了局限性，并通过实验验证了任务顺序对性能的影响。 Conclusion: 数据集顺序对学习效果至关重要，策略性排序能显著提升模型适应能力并保留已学知识。 Abstract: Continual learning empowers models to learn from a continuous stream of data while preserving previously acquired knowledge, effectively addressing the challenge of catastrophic forgetting. In this study, we propose a new approach that integrates adapters within the self-attention mechanisms of Vision Transformers to enhance knowledge retention when sequentially adding datasets from different domains. Unlike previous methods that continue learning with only one dataset, our approach introduces domain-specific output heads and feature gating, allowing the model to maintain high accuracy on previously learned tasks while incorporating only the essential information from multiple domains. The proposed method is compared to prominent parameter-efficient fine-tuning methods in the current state of the art. The results provide evidence that our method effectively alleviates the limitations of previous works. Furthermore, we conduct a comparative analysis using three datasets, CIFAR-100, Flowers102, and DTD, each representing a distinct domain, to investigate the impact of task order on model performance. Our findings underscore the critical role of dataset sequencing in shaping learning outcomes, demonstrating that strategic ordering can significantly improve the model's ability to adapt to evolving data distributions over time while preserving the integrity of previously learned knowledge.

Preserving Privacy Without Compromising Accuracy: Machine Unlearning for Handwritten Text Recognition

Lei Kang,Xuanshuo Fu,Lluis Gomez,Alicia Fornés,Ernest Valveny,Dimosthenis Karatzas

Task: 提出一种新颖的两阶段遗忘策略，用于基于多头Transformer的手写文本识别模型，以选择性删除敏感数据。

Motivation: 手写数据常包含用户可识别信息，隐私保护需求日益重要，而遗忘学习在隐私与准确性之间存在权衡。

Details

Method: 结合剪枝和随机标记的两阶段遗忘策略，利用作者分类头作为遗忘指示器和触发器。 Result: 实验表明，该方法在保护隐私的同时保持了模型准确性。 Conclusion: 该方法为文档分析领域的新研究方向铺平了道路，代码将在接受后公开。 Abstract: Handwritten Text Recognition (HTR) is essential for document analysis and digitization. However, handwritten data often contains user-identifiable information, such as unique handwriting styles and personal lexicon choices, which can compromise privacy and erode trust in AI services. Legislation like the ``right to be forgotten'' underscores the necessity for methods that can expunge sensitive information from trained models. Machine unlearning addresses this by selectively removing specific data from models without necessitating complete retraining. Yet, it frequently encounters a privacy-accuracy tradeoff, where safeguarding privacy leads to diminished model performance. In this paper, we introduce a novel two-stage unlearning strategy for a multi-head transformer-based HTR model, integrating pruning and random labeling. Our proposed method utilizes a writer classification head both as an indicator and a trigger for unlearning, while maintaining the efficacy of the recognition head. To our knowledge, this represents the first comprehensive exploration of machine unlearning within HTR tasks. We further employ Membership Inference Attacks (MIA) to evaluate the effectiveness of unlearning user-identifiable information. Extensive experiments demonstrate that our approach effectively preserves privacy while maintaining model accuracy, paving the way for new research directions in the document analysis community. Our code will be publicly available upon acceptance.

Efficient Mixture of Geographical Species for On Device Wildlife Monitoring

Emmanuel Azuh Mensah,Joban Mand,Yueheng Ou,Min Jang,Kurtis Heimerl

Task: 探索一种基于条件计算的单物种检测器，用于地理感知的生态保护应用。

Motivation: 随着高效设备端模型的需求增加，特别是在生态保护领域，需要开发低计算成本的模型。视觉变换器在边缘计算中的应用尚未充分探索，尤其是基于输入数据的条件子网络执行。

Details

Method: 提出一种方法，通过地理感知的方式修剪专家模型，并在两个地理分布数据集（iNaturalist和iWildcam）上验证条件计算的性能。 Result: 展示了条件计算在地理感知模型中的有效性。 Conclusion: 该方法为生态保护中的高效设备端模型提供了一种新的解决方案。 Abstract: Efficient on-device models have become attractive for near-sensor insight generation, of particular interest to the ecological conservation community. For this reason, deep learning researchers are proposing more approaches to develop lower compute models. However, since vision transformers are very new to the edge use case, there are still unexplored approaches, most notably conditional execution of subnetworks based on input data. In this work, we explore the training of a single species detector which uses conditional computation to bias structured sub networks in a geographically-aware manner. We propose a method for pruning the expert model per location and demonstrate conditional computation performance on two geographically distributed datasets: iNaturalist and iWildcam.

Latent Diffusion Autoencoders: Toward Efficient and Meaningful Unsupervised Representation Learning in Medical Imaging

Gabriele Lozupone,Alessandro Bria,Francesco Fontanella,Frederick J. A. Meijer,Claudio De Stefano,Henkjan Huisman

Task: 提出一种名为Latent Diffusion Autoencoder (LDAE)的新型编码器-解码器扩散框架，用于医学影像中的高效无监督学习，以阿尔茨海默病(AD)为例。

Motivation: 传统扩散自编码器在图像空间中操作，计算效率低且难以处理3D医学影像，LDAE通过在压缩的潜在表示中应用扩散过程来解决这些问题。

Details

Method: LDAE在潜在表示空间中进行扩散过程，验证其有效性通过两个假设：(i) 捕捉与AD和衰老相关的语义表示，(ii) 实现高质量图像生成与重建且计算高效。 Result: 实验验证了LDAE的有效性：(i) AD诊断性能优异(ROC-AUC: 90%, ACC: 84%)，(ii) 语义表示支持属性操控，(iii) 重建缺失扫描表现强(SSIM: 0.969)，(iv) 计算效率显著提升(20倍)。 Conclusion: LDAE是一种有前景的框架，适用于可扩展的医学影像应用，并可能成为医学图像分析的基础模型。 Abstract: This study presents Latent Diffusion Autoencoder (LDAE), a novel encoder-decoder diffusion-based framework for efficient and meaningful unsupervised learning in medical imaging, focusing on Alzheimer disease (AD) using brain MR from the ADNI database as a case study. Unlike conventional diffusion autoencoders operating in image space, LDAE applies the diffusion process in a compressed latent representation, improving computational efficiency and making 3D medical imaging representation learning tractable. To validate the proposed approach, we explore two key hypotheses: (i) LDAE effectively captures meaningful semantic representations on 3D brain MR associated with AD and ageing, and (ii) LDAE achieves high-quality image generation and reconstruction while being computationally efficient. Experimental results support both hypotheses: (i) linear-probe evaluations demonstrate promising diagnostic performance for AD (ROC-AUC: 90%, ACC: 84%) and age prediction (MAE: 4.1 years, RMSE: 5.2 years); (ii) the learned semantic representations enable attribute manipulation, yielding anatomically plausible modifications; (iii) semantic interpolation experiments show strong reconstruction of missing scans, with SSIM of 0.969 (MSE: 0.0019) for a 6-month gap. Even for longer gaps (24 months), the model maintains robust performance (SSIM > 0.93, MSE < 0.004), indicating an ability to capture temporal progression trends; (iv) compared to conventional diffusion autoencoders, LDAE significantly increases inference throughput (20x faster) while also enhancing reconstruction quality. These findings position LDAE as a promising framework for scalable medical imaging applications, with the potential to serve as a foundation model for medical image analysis. Code available at https://github.com/GabrieleLozupone/LDAE

Training-free Guidance in Text-to-Video Generation via Multimodal Planning and Structured Noise Initialization

Jialu Li,Shoubin Yu,Han Lin,Jaemin Cho,Jaehong Yoon,Mohit Bansal

Task: 提出一种无需微调或额外内存的文本到视频生成引导方法Video-MSG，以提升生成视频与文本描述的准确性。

Motivation: 现有文本到视频（T2V）扩散模型在空间布局和物体轨迹控制上表现不佳，且现有方法需微调或注意力图操作，增加了内存需求。

Details

Method: Video-MSG通过多模态规划和结构化噪声初始化，分三步生成视频草图（Video Sketch），并引导下游T2V扩散模型。 Result: 实验表明，Video-MSG在多个T2V模型和基准测试中显著提升了文本对齐效果。 Conclusion: Video-MSG是一种高效且易于部署的T2V生成引导方法，无需额外内存或微调。 Abstract: Recent advancements in text-to-video (T2V) diffusion models have significantly enhanced the visual quality of the generated videos. However, even recent T2V models find it challenging to follow text descriptions accurately, especially when the prompt requires accurate control of spatial layouts or object trajectories. A recent line of research uses layout guidance for T2V models that require fine-tuning or iterative manipulation of the attention map during inference time. This significantly increases the memory requirement, making it difficult to adopt a large T2V model as a backbone. To address this, we introduce Video-MSG, a training-free Guidance method for T2V generation based on Multimodal planning and Structured noise initialization. Video-MSG consists of three steps, where in the first two steps, Video-MSG creates Video Sketch, a fine-grained spatio-temporal plan for the final video, specifying background, foreground, and object trajectories, in the form of draft video frames. In the last step, Video-MSG guides a downstream T2V diffusion model with Video Sketch through noise inversion and denoising. Notably, Video-MSG does not need fine-tuning or attention manipulation with additional memory during inference time, making it easier to adopt large T2V models. Video-MSG demonstrates its effectiveness in enhancing text alignment with multiple T2V backbones (VideoCrafter2 and CogVideoX-5B) on popular T2V generation benchmarks (T2VCompBench and VBench). We provide comprehensive ablation studies about noise inversion ratio, different background generators, background object detection, and foreground object segmentation.

Title block detection and information extraction for enhanced building drawings search

Alessio Lombardi,Li Duan,Ahmed Elnagar,Ahmed Zaalouk,Khalid Ismail,Edlira Vakaj

Task: 比较现有方法并提出一种新的标题块检测和信息提取流程，用于建筑图纸的元数据提取。

Motivation: 建筑、工程和施工（AEC）行业依赖图纸信息，但信息提取耗时且成本高，尤其是历史建筑图纸的标题块提取复杂且缺乏统一标准。

Details

Method: 结合轻量级卷积神经网络和GPT-4o，提出一种新的标题块检测和信息提取流程。 Result: 新方法在复杂、噪声多的历史图纸中表现优异，提取的元数据可用于图纸搜索、过滤和分组，显著节省时间。 Conclusion: 提出的流程在向量和手绘图纸中均高效准确，并开发了可扩展的标注数据集，为未来工作奠定基础。 Abstract: The architecture, engineering, and construction (AEC) industry still heavily relies on information stored in drawings for building construction, maintenance, compliance and error checks. However, information extraction (IE) from building drawings is often time-consuming and costly, especially when dealing with historical buildings. Drawing search can be simplified by leveraging the information stored in the title block portion of the drawing, which can be seen as drawing metadata. However, title block IE can be complex especially when dealing with historical drawings which do not follow existing standards for uniformity. This work performs a comparison of existing methods for this kind of IE task, and then proposes a novel title block detection and IE pipeline which outperforms existing methods, in particular when dealing with complex, noisy historical drawings. The pipeline is obtained by combining a lightweight Convolutional Neural Network and GPT-4o, the proposed inference pipeline detects building engineering title blocks with high accuracy, and then extract structured drawing metadata from the title blocks, which can be used for drawing search, filtering and grouping. The work demonstrates high accuracy and efficiency in IE for both vector (CAD) and hand-drawn (historical) drawings. A user interface (UI) that leverages the extracted metadata for drawing search is established and deployed on real projects, which demonstrates significant time savings. Additionally, an extensible domain-expert-annotated dataset for title block detection is developed, via an efficient AEC-friendly annotation workflow that lays the foundation for future work.

MBE-ARI: A Multimodal Dataset Mapping Bi-directional Engagement in Animal-Robot Interaction

Ian Noronha,Advait Prasad Jawaji,Juan Camilo Soto,Jiajun An,Yan Gu,Upinder Kaur

Task: 提出MBE-ARI数据集和一种针对四足动物的全身姿态估计模型，以促进动物-机器人交互研究。

Motivation: 动物-机器人交互缺乏基础资源，机器人难以理解动物的复杂多模态交流信号。

Details

Method: 构建包含多视角同步RGB-D数据、标注身体姿态和活动标签的MBE-ARI数据集，并开发一种高精度的四足动物姿态估计模型。 Result: 姿态估计模型在39个关键点上达到92.7%的平均精度（mAP），优于现有基准。 Conclusion: MBE-ARI数据集和姿态估计模型为动物-机器人交互研究提供了重要工具，推动了该领域的发展。 Abstract: Animal-robot interaction (ARI) remains an unexplored challenge in robotics, as robots struggle to interpret the complex, multimodal communication cues of animals, such as body language, movement, and vocalizations. Unlike human-robot interaction, which benefits from established datasets and frameworks, animal-robot interaction lacks the foundational resources needed to facilitate meaningful bidirectional communication. To bridge this gap, we present the MBE-ARI (Multimodal Bidirectional Engagement in Animal-Robot Interaction), a novel multimodal dataset that captures detailed interactions between a legged robot and cows. The dataset includes synchronized RGB-D streams from multiple viewpoints, annotated with body pose and activity labels across interaction phases, offering an unprecedented level of detail for ARI research. Additionally, we introduce a full-body pose estimation model tailored for quadruped animals, capable of tracking 39 keypoints with a mean average precision (mAP) of 92.7%, outperforming existing benchmarks in animal pose estimation. The MBE-ARI dataset and our pose estimation framework lay a robust foundation for advancing research in animal-robot interaction, providing essential tools for developing perception, reasoning, and interaction frameworks needed for effective collaboration between robots and animals. The dataset and resources are publicly available at https://github.com/RISELabPurdue/MBE-ARI/, inviting further exploration and development in this critical area.

The Invisible EgoHand: 3D Hand Forecasting through EgoBody Pose Estimation

Masashi Hatano,Zhifan Zhu,Hideo Saito,Dima Damen

Task: 从第一人称视角预测双手的3D轨迹和姿态，无论双手是否在视野内。

Motivation: 现有方法仅关注预测视野内双手的位置，忽略了即使双手不在视野内，其近似位置仍可推断的可能性。

Details

Method: 提出了一种基于扩散变换器架构的Egocentric Hand Forecasting方法（EgoH4），利用全身姿态信息约束手部运动，并通过去噪、可见性预测器和3D到2D重投影损失进行优化。 Result: 在Ego-Exo4D数据集上，EgoH4在轨迹预测（ADE）和姿态预测（MPJPE）上分别比基线提高了3.4厘米和5.1厘米。 Conclusion: EgoH4能够有效预测视野内外的双手运动，为理解人类意图提供了更全面的方法。 Abstract: Forecasting hand motion and pose from an egocentric perspective is essential for understanding human intention. However, existing methods focus solely on predicting positions without considering articulation, and only when the hands are visible in the field of view. This limitation overlooks the fact that approximate hand positions can still be inferred even when they are outside the camera's view. In this paper, we propose a method to forecast the 3D trajectories and poses of both hands from an egocentric video, both in and out of the field of view. We propose a diffusion-based transformer architecture for Egocentric Hand Forecasting, EgoH4, which takes as input the observation sequence and camera poses, then predicts future 3D motion and poses for both hands of the camera wearer. We leverage full-body pose information, allowing other joints to provide constraints on hand motion. We denoise the hand and body joints along with a visibility predictor for hand joints and a 3D-to-2D reprojection loss that minimizes the error when hands are in-view. We evaluate EgoH4 on the Ego-Exo4D dataset, combining subsets with body and hand annotations. We train on 156K sequences and evaluate on 34K sequences, respectively. EgoH4 improves the performance by 3.4cm and 5.1cm over the baseline in terms of ADE for hand trajectory forecasting and MPJPE for hand pose forecasting. Project page: https://masashi-hatano.github.io/EgoH4/

X2BR: High-Fidelity 3D Bone Reconstruction from a Planar X-Ray Image with Hybrid Neural Implicit Methods

Gokce Guven,H. Fatih Ugurdag,Hasan F. Ates

Task: 从单张平面X射线图像中实现精确的3D骨骼重建。

Motivation: 由于解剖结构的复杂性和输入数据的有限性，从单张X射线图像重建3D骨骼仍然是一个挑战。

Details

Method: 提出X2BR框架，结合连续体积重建和模板引导的非刚性配准，核心网络X2B使用ConvNeXt编码器提取X射线空间特征并预测高保真3D骨骼占据场，无需依赖统计形状模型。X2BR进一步通过YOLOv9检测和SKEL生物力学骨架模型构建患者特定模板网格，并使用基于测地线的相干点漂移方法将粗重建与模板对齐。 Result: X2B在临床数据集上达到最高数值精度（IoU 0.952，Chamfer-L1距离0.005），优于X2V和D2IM-Net。X2BR通过YOLOv9骨骼检测和生物力学模板对齐，虽IoU略低（0.875），但提供更高的解剖真实性。 Conclusion: X2B和X2BR在数值精度与视觉一致性之间的权衡表明，混合框架在临床相关的3D重建中具有重要价值。 Abstract: Accurate 3D bone reconstruction from a single planar X-ray remains a challenge due to anatomical complexity and limited input data. We propose X2BR, a hybrid neural implicit framework that combines continuous volumetric reconstruction with template-guided non-rigid registration. The core network, X2B, employs a ConvNeXt-based encoder to extract spatial features from X-rays and predict high-fidelity 3D bone occupancy fields without relying on statistical shape models. To further refine anatomical accuracy, X2BR integrates a patient-specific template mesh, constructed using YOLOv9-based detection and the SKEL biomechanical skeleton model. The coarse reconstruction is aligned to the template using geodesic-based coherent point drift, enabling anatomically consistent 3D bone volumes. Experimental results on a clinical dataset show that X2B achieves the highest numerical accuracy, with an IoU of 0.952 and Chamfer-L1 distance of 0.005, outperforming recent baselines including X2V and D2IM-Net. Building on this, X2BR incorporates anatomical priors via YOLOv9-based bone detection and biomechanical template alignment, leading to reconstructions that, while slightly lower in IoU (0.875), offer superior anatomical realism, especially in rib curvature and vertebral alignment. This numerical accuracy vs. visual consistency trade-off between X2B and X2BR highlights the value of hybrid frameworks for clinically relevant 3D reconstructions.

Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model

Team Seawead,Ceyuan Yang,Zhijie Lin,Yang Zhao,Shanchuan Lin,Zhibei Ma,Haoyuan Guo,Hao Chen,Lu Qi,Sen Wang,Feng Cheng,Feilong Zuo Xuejiao Zeng,Ziyan Yang,Fangyuan Kong,Zhiwu Qing,Fei Xiao,Meng Wei,Tuyen Hoang,Siyu Zhang,Peihao Zhu,Qi Zhao,Jiangqiao Yan,Liangke Gui,Sheng Bi,Jiashi Li,Yuxi Ren,Rui Wang,Huixia Li,Xuefeng Xiao,Shu Liu,Feng Ling,Heng Zhang,Houmin Wei,Huafeng Kuang,Jerry Duncan,Junda Zhang,Junru Zheng,Li Sun,Manlin Zhang,Renfei Sun,Xiaobin Zhuang,Xiaojie Li,Xin Xia,Xuyan Chi,Yanghua Peng,Yuping Wang,Yuxuan Wang,Zhongkai Zhao,Zhuo Chen,Zuquan Song,Zhenheng Yang,Jiashi Feng,Jianchao Yang,Lu Jiang

Task: 提出一种成本高效的视频生成基础模型训练策略。

Motivation: 在资源受限的情况下，设计选择对中等规模扩散模型的性能至关重要。

Details

Method: 训练了一个约70亿参数（7B）的Seaweed-7B模型，使用了665,000 H100 GPU小时。 Result: Seaweed-7B在性能上可与甚至超越更大规模的模型，并表现出强大的泛化能力。 Conclusion: Seaweed-7B在资源受限的情况下表现出色，且能通过轻量微调或继续训练适应多种下游应用。 Abstract: This technical report presents a cost-efficient strategy for training a video generation foundation model. We present a mid-sized research model with approximately 7 billion parameters (7B) called Seaweed-7B trained from scratch using 665,000 H100 GPU hours. Despite being trained with moderate computational resources, Seaweed-7B demonstrates highly competitive performance compared to contemporary video generation models of much larger size. Design choices are especially crucial in a resource-constrained setting. This technical report highlights the key design decisions that enhance the performance of the medium-sized diffusion model. Empirically, we make two observations: (1) Seaweed-7B achieves performance comparable to, or even surpasses, larger models trained on substantially greater GPU resources, and (2) our model, which exhibits strong generalization ability, can be effectively adapted across a wide range of downstream applications either by lightweight fine-tuning or continue training. See the project page at https://seaweed.video/

Hypergraph Vision Transformers: Images are More than Nodes, More than Edges

Joshua Fixelle

Task: 提出一种名为Hypergraph Vision Transformer (HgVT)的框架，用于在视觉任务中捕捉高阶语义关系并保持计算效率。

Motivation: 尽管Vision Transformers (ViTs)在计算机视觉任务中表现出可扩展性，但在适应性、计算效率和高阶关系建模方面仍存在挑战；Vision Graph Neural Networks (ViGs)则因聚类算法的计算瓶颈而受限。

Details

Method: 将分层二分超图结构融入视觉Transformer框架，利用种群和多样性正则化动态构建超图，并通过专家边池化增强语义提取。 Result: HgVT在图像分类和检索任务中表现优异。 Conclusion: HgVT是一种高效的语义视觉任务框架。 Abstract: Recent advancements in computer vision have highlighted the scalability of Vision Transformers (ViTs) across various tasks, yet challenges remain in balancing adaptability, computational efficiency, and the ability to model higher-order relationships. Vision Graph Neural Networks (ViGs) offer an alternative by leveraging graph-based methodologies but are hindered by the computational bottlenecks of clustering algorithms used for edge generation. To address these issues, we propose the Hypergraph Vision Transformer (HgVT), which incorporates a hierarchical bipartite hypergraph structure into the vision transformer framework to capture higher-order semantic relationships while maintaining computational efficiency. HgVT leverages population and diversity regularization for dynamic hypergraph construction without clustering, and expert edge pooling to enhance semantic extraction and facilitate graph-based image retrieval. Empirical results demonstrate that HgVT achieves strong performance on image classification and retrieval, positioning it as an efficient framework for semantic-based vision tasks.

Generating Fine Details of Entity Interactions

Xinyi Gu,Jiayuan Mao

Task: 生成高保真且准确的交互式图像。

Motivation: 现有文本到图像模型在多实体交互生成上表现不佳，主要由于训练数据中罕见交互的稀缺。

Details

Method: 提出InterActing数据集和DetailScribe方法，结合LLM分解交互、VLM评估图像，并在扩散过程中进行针对性优化。 Result: 自动和人工评估显示图像质量显著提升。 Conclusion: DetailScribe展示了增强推理策略在交互丰富图像生成中的潜力，数据集和代码已开源。 Abstract: Images not only depict objects but also encapsulate rich interactions between them. However, generating faithful and high-fidelity images involving multiple entities interacting with each other, is a long-standing challenge. While pre-trained text-to-image models are trained on large-scale datasets to follow diverse text instructions, they struggle to generate accurate interactions, likely due to the scarcity of training data for uncommon object interactions. This paper introduces InterActing, an interaction-focused dataset with 1000 fine-grained prompts covering three key scenarios: (1) functional and action-based interactions, (2) compositional spatial relationships, and (3) multi-subject interactions. To address interaction generation challenges, we propose a decomposition-augmented refinement procedure. Our approach, DetailScribe, built on Stable Diffusion 3.5, leverages LLMs to decompose interactions into finer-grained concepts, uses a VLM to critique generated images, and applies targeted interventions within the diffusion process in refinement. Automatic and human evaluations show significantly improved image quality, demonstrating the potential of enhanced inference strategies. Our dataset and code are available at https://concepts-ai.com/p/detailscribe/ to facilitate future exploration of interaction-rich image generation.

EMO-X: Efficient Multi-Person Pose and Shape Estimation in One-Stage

Haohang Jian,Jinlu Zhang,Junyi Wu,Zhigang Tu

Task: 联合估计单目图像中的人体姿势、手势和面部表情。

Motivation: 现有方法主要依赖Transformer架构，其自注意力机制具有二次复杂度，计算开销大，尤其在多人场景中。Mamba虽能高效建模全局信息，但缺乏对细粒度局部依赖的捕捉能力。

Details

Method: 提出EMO-X模型，采用Scan-based Global-Local Decoder（SGLD）结合全局上下文与局部骨架特征，迭代优化人体标记。 Result: EMO-X在效率和精度间取得平衡，推理时间减少69.8%，且精度优于多数SOTA方法。 Conclusion: EMO-X通过全局-局部协同建模，显著提升了多人场景下的表达性人体姿态与形状估计性能。 Abstract: Expressive Human Pose and Shape Estimation (EHPS) aims to jointly estimate human pose, hand gesture, and facial expression from monocular images. Existing methods predominantly rely on Transformer-based architectures, which suffer from quadratic complexity in self-attention, leading to substantial computational overhead, especially in multi-person scenarios. Recently, Mamba has emerged as a promising alternative to Transformers due to its efficient global modeling capability. However, it remains limited in capturing fine-grained local dependencies, which are essential for precise EHPS. To address these issues, we propose EMO-X, the Efficient Multi-person One-stage model for multi-person EHPS. Specifically, we explore a Scan-based Global-Local Decoder (SGLD) that integrates global context with skeleton-aware local features to iteratively enhance human tokens. Our EMO-X leverages the superior global modeling capability of Mamba and designs a local bidirectional scan mechanism for skeleton-aware local refinement. Comprehensive experiments demonstrate that EMO-X strikes an excellent balance between efficiency and accuracy. Notably, it achieves a significant reduction in computational complexity, requiring 69.8% less inference time compared to state-of-the-art (SOTA) methods, while outperforming most of them in accuracy.

Visual Chronicles: Using Multimodal LLMs to Analyze Massive Collections of Images

Boyang Deng,Songyou Peng,Kyle Genova,Gordon Wetzstein,Noah Snavely,Leonidas Guibas,Thomas Funkhouser

Task: 利用多模态大语言模型（MLLMs）分析包含数千万张不同时间拍摄图像的大型数据库，以发现时间变化中的模式。

Motivation: 传统视觉分析方法无法回答开放性问题（如“城市中常见的变更类型有哪些？”），且缺乏预定义目标或训练标签，因此需要一种新工具。

Details

Method: 提出一种自下而上的方法，将大规模视觉分析问题分解为更易处理的子问题，并为每个子问题设计基于MLLM的解决方案。 Result: 实验表明，该系统显著优于基线方法，并能从大城市图像中发现有趣的变化趋势（如“户外餐饮的增加”、“天桥被涂成蓝色”等）。 Conclusion: MLLMs因其开放语义理解能力成为解决大规模开放性问题的新工具，通过分解问题的方法有效解决了数据规模过大的挑战。 Abstract: We present a system using Multimodal LLMs (MLLMs) to analyze a large database with tens of millions of images captured at different times, with the aim of discovering patterns in temporal changes. Specifically, we aim to capture frequent co-occurring changes ("trends") across a city over a certain period. Unlike previous visual analyses, our analysis answers open-ended queries (e.g., "what are the frequent types of changes in the city?") without any predetermined target subjects or training labels. These properties cast prior learning-based or unsupervised visual analysis tools unsuitable. We identify MLLMs as a novel tool for their open-ended semantic understanding capabilities. Yet, our datasets are four orders of magnitude too large for an MLLM to ingest as context. So we introduce a bottom-up procedure that decomposes the massive visual analysis problem into more tractable sub-problems. We carefully design MLLM-based solutions to each sub-problem. During experiments and ablation studies with our system, we find it significantly outperforms baselines and is able to discover interesting trends from images captured in large cities (e.g., "addition of outdoor dining,", "overpass was painted blue," etc.). See more results and interactive demos at https://boyangdeng.com/visual-chronicles.

Steering CLIP's vision transformer with sparse autoencoders

Sonia Joseph,Praneet Suresh,Ethan Goldfarb,Lorenz Hufe,Yossi Gandelsman,Robert Graham,Danilo Bzdok,Wojciech Samek,Blake Aaron Richards

Task: 通过稀疏自编码器（SAEs）分析CLIP视觉变换器的内部机制，并研究其可操控性。

Motivation: 尽管视觉模型能力强大，但其内部机制仍不明确，稀疏自编码器在语言模型中已取得进展，但在视觉领域尚未充分探索。

Details

Method: 在CLIP视觉变换器上训练稀疏自编码器，分析不同层和标记类型的稀疏性模式，并引入量化指标评估特征的可操控性。 Result: 发现10-15%的神经元和特征是可操控的，SAEs提供了比基础模型更多的可操控特征，并在三个视觉解耦任务中实现了性能提升。 Conclusion: SAEs在视觉模型中揭示了新的机制，并在解耦任务中表现出色，为视觉模型的可解释性和应用提供了新方向。 Abstract: While vision models are highly capable, their internal mechanisms remain poorly understood -- a challenge which sparse autoencoders (SAEs) have helped address in language, but which remains underexplored in vision. We address this gap by training SAEs on CLIP's vision transformer and uncover key differences between vision and language processing, including distinct sparsity patterns for SAEs trained across layers and token types. We then provide the first systematic analysis on the steerability of CLIP's vision transformer by introducing metrics to quantify how precisely SAE features can be steered to affect the model's output. We find that 10-15\% of neurons and features are steerable, with SAEs providing thousands more steerable features than the base model. Through targeted suppression of SAE features, we then demonstrate improved performance on three vision disentanglement tasks (CelebA, Waterbirds, and typographic attacks), finding optimal disentanglement in middle model layers, and achieving state-of-the-art performance on defense against typographic attacks.

GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation

Tianwei Xiong,Jun Hao Liew,Zilong Huang,Jiashi Feng,Xihui Liu

Task: 解决自回归图像生成中视觉标记器扩展时重建质量与生成质量之间的矛盾。

Motivation: 现有视觉标记器在扩展时虽能提高图像重建质量，但会降低下游生成质量，这一问题尚未得到充分解决。

Details

Method: 提出GigaTok方法，通过语义正则化将标记器特征与预训练视觉编码器的语义一致特征对齐，并探索三种关键扩展实践。 Result: GigaTok在扩展到30亿参数时，在重建、下游AR生成和表示质量方面达到最先进性能。 Conclusion: 语义正则化和关键扩展实践能有效解决视觉标记器扩展时的重建与生成矛盾，提升整体性能。 Abstract: In autoregressive (AR) image generation, visual tokenizers compress images into compact discrete latent tokens, enabling efficient training of downstream autoregressive models for visual generation via next-token prediction. While scaling visual tokenizers improves image reconstruction quality, it often degrades downstream generation quality -- a challenge not adequately addressed in existing literature. To address this, we introduce GigaTok, the first approach to simultaneously improve image reconstruction, generation, and representation learning when scaling visual tokenizers. We identify the growing complexity of latent space as the key factor behind the reconstruction vs. generation dilemma. To mitigate this, we propose semantic regularization, which aligns tokenizer features with semantically consistent features from a pre-trained visual encoder. This constraint prevents excessive latent space complexity during scaling, yielding consistent improvements in both reconstruction and downstream autoregressive generation. Building on semantic regularization, we explore three key practices for scaling tokenizers:(1) using 1D tokenizers for better scalability, (2) prioritizing decoder scaling when expanding both encoder and decoder, and (3) employing entropy loss to stabilize training for billion-scale tokenizers. By scaling to $\bf{3 \space billion}$ parameters, GigaTok achieves state-of-the-art performance in reconstruction, downstream AR generation, and downstream AR representation quality.

CDM-QTA: Quantized Training Acceleration for Efficient LoRA Fine-Tuning of Diffusion Model

Jinming Lu,Minghao She,Wendong Mao,Zhongfeng Wang

Task: 开发一种专门用于扩散模型低秩适应（LoRA）微调的训练加速器，以降低计算复杂性和提高效率。

Motivation: 微调大型扩散模型需要大量计算资源和时间，难以在移动设备上高效实现。

Details

Method: 采用完全量化的训练方案进行LoRA微调，设计具有灵活数据流的加速器。 Result: 实验结果显示，相比基线方法，训练速度提升1.81倍，能效提高5.50倍，且对图像生成质量影响极小。 Conclusion: 提出的加速器显著降低了LoRA微调的计算成本和能耗，同时保持了模型的高保真度。 Abstract: Fine-tuning large diffusion models for custom applications demands substantial power and time, which poses significant challenges for efficient implementation on mobile devices. In this paper, we develop a novel training accelerator specifically for Low-Rank Adaptation (LoRA) of diffusion models, aiming to streamline the process and reduce computational complexity. By leveraging a fully quantized training scheme for LoRA fine-tuning, we achieve substantial reductions in memory usage and power consumption while maintaining high model fidelity. The proposed accelerator features flexible dataflow, enabling high utilization for irregular and variable tensor shapes during the LoRA process. Experimental results show up to 1.81x training speedup and 5.50x energy efficiency improvements compared to the baseline, with minimal impact on image generation quality.

Interpretable Automatic Rosacea Detection with Whitened Cosine Similarity

Chengyu Yang,Chengjun Liu

Task: 提出一种基于白化余弦相似度的可解释性自动玫瑰痤疮检测方法。

Motivation: 提高对玫瑰痤疮的认识，并帮助医生更准确地进行诊断。

Details

Method: 使用白化余弦相似度测量测试样本与两类（玫瑰痤疮类与正常类）均值之间的相似性。 Result: 在未见测试数据上，该方法比其他深度学习和统计方法具有更高的准确性，并解决了可解释性问题。 Conclusion: 该方法不仅提高了公众对玫瑰痤疮的认识，还能提醒患者早期治疗，代码和数据已公开。 Abstract: According to the National Rosacea Society, approximately sixteen million Americans suffer from rosacea, a common skin condition that causes flushing or long-term redness on a person's face. To increase rosacea awareness and to better assist physicians to make diagnosis on this disease, we propose an interpretable automatic rosacea detection method based on whitened cosine similarity in this paper. The contributions of the proposed methods are three-fold. First, the proposed method can automatically distinguish patients suffering from rosacea from people who are clean of this disease with a significantly higher accuracy than other methods in unseen test data, including both classical deep learning and statistical methods. Second, the proposed method addresses the interpretability issue by measuring the similarity between the test sample and the means of two classes, namely the rosacea class versus the normal class, which allows both medical professionals and patients to understand and trust the results. And finally, the proposed methods will not only help increase awareness of rosacea in the general population, but will also help remind patients who suffer from this disease of possible early treatment, as rosacea is more treatable in its early stages. The code and data are available at https://github.com/chengyuyang-njit/ICCRD-2025. The code and data are available at https://github.com/chengyuyang-njit/ICCRD-2025.

SynthFM: Training Modality-agnostic Foundation Models for Medical Image Segmentation without Real Medical Data

Sourya Sengupta,Satrajit Chakrabarty,Keerthi Sravan Ravi,Gopal Avinash,Ravi Soni

Task: 提出SynthFM框架，通过合成数据生成解决医学图像分割中基础模型适应性不足的问题。

Motivation: 医学图像与自然图像在纹理、对比度和噪声上存在差异，导致基础模型如SAM在医学图像分割中表现不佳，且医学图像标注成本高、需专业知识。

Details

Method: 利用SAM的预训练编码器，结合SynthFM生成的合成医学图像数据从头训练解码器。 Result: 在11种解剖结构和9个数据集（CT、MRI和超声）上，SynthFM表现优于零样本基线模型SAM和MedSAM，且在分布外数据集上表现优异。 Conclusion: SynthFM通过合成数据有效提升了基础模型在医学图像分割中的性能，无需真实医学数据标注。 Abstract: Foundation models like the Segment Anything Model (SAM) excel in zero-shot segmentation for natural images but struggle with medical image segmentation due to differences in texture, contrast, and noise. Annotating medical images is costly and requires domain expertise, limiting large-scale annotated data availability. To address this, we propose SynthFM, a synthetic data generation framework that mimics the complexities of medical images, enabling foundation models to adapt without real medical data. Using SAM's pretrained encoder and training the decoder from scratch on SynthFM's dataset, we evaluated our method on 11 anatomical structures across 9 datasets (CT, MRI, and Ultrasound). SynthFM outperformed zero-shot baselines like SAM and MedSAM, achieving superior results under different prompt settings and on out-of-distribution datasets.

Single View Garment Reconstruction Using Diffusion Mapping Via Pattern Coordinates

Ren Li,Cong Cao,Corentin Dumery,Yingxuan You,Hao Li,Pascal Fua

Task: 从单张图像中高保真地重建3D服装几何形状。

Motivation: 尽管人体重建技术有所进步，但准确重建服装几何形状（尤其是宽松服装）仍是一个挑战。

Details

Method: 结合隐式缝纫模式（ISP）和生成扩散模型，在2D UV空间中学习丰富的服装形状先验，并通过映射模型联合优化3D服装网格和2D模式。 Result: 方法在合成数据上训练，但能有效泛化到真实图像，在紧身和宽松服装上均优于现有方法。 Conclusion: 重建的服装保持物理合理性并捕捉细节，支持下游应用如服装重定向和纹理操作。 Abstract: Reconstructing 3D clothed humans from images is fundamental to applications like virtual try-on, avatar creation, and mixed reality. While recent advances have enhanced human body recovery, accurate reconstruction of garment geometry -- especially for loose-fitting clothing -- remains an open challenge. We present a novel method for high-fidelity 3D garment reconstruction from single images that bridges 2D and 3D representations. Our approach combines Implicit Sewing Patterns (ISP) with a generative diffusion model to learn rich garment shape priors in a 2D UV space. A key innovation is our mapping model that establishes correspondences between 2D image pixels, UV pattern coordinates, and 3D geometry, enabling joint optimization of both 3D garment meshes and the corresponding 2D patterns by aligning learned priors with image observations. Despite training exclusively on synthetically simulated cloth data, our method generalizes effectively to real-world images, outperforming existing approaches on both tight- and loose-fitting garments. The reconstructed garments maintain physical plausibility while capturing fine geometric details, enabling downstream applications including garment retargeting and texture manipulation.

In-2-4D: Inbetweening from Two Single-View Images to 4D Generation

Sauradip Nag,Daniel Cohen-Or,Hao Zhang,Ali Mahdavi-Amiri

Task: 从两幅单视角图像生成4D（3D+运动）中间帧，实现运动重建。

Motivation: 解决从极简输入（两幅图像）生成复杂4D运动的问题，填补现有方法的空白。

Details

Method: 采用分层方法识别关键帧，利用高斯泼溅构建3D表示，并通过变形场实现动态高斯变换，结合多视角扩散的自注意力机制优化时间一致性。 Result: 通过实验和用户研究验证了方法的有效性，生成平滑且无闪烁的4D运动。 Conclusion: 提出了一种从两幅图像生成高质量4D运动的新方法，为相关领域提供了实用工具。 Abstract: We propose a new problem, In-2-4D, for generative 4D (i.e., 3D + motion) inbetweening from a minimalistic input setting: two single-view images capturing an object in two distinct motion states. Given two images representing the start and end states of an object in motion, our goal is to generate and reconstruct the motion in 4D. We utilize a video interpolation model to predict the motion, but large frame-to-frame motions can lead to ambiguous interpretations. To overcome this, we employ a hierarchical approach to identify keyframes that are visually close to the input states and show significant motion, then generate smooth fragments between them. For each fragment, we construct the 3D representation of the keyframe using Gaussian Splatting. The temporal frames within the fragment guide the motion, enabling their transformation into dynamic Gaussians through a deformation field. To improve temporal consistency and refine 3D motion, we expand the self-attention of multi-view diffusion across timesteps and apply rigid transformation regularization. Finally, we merge the independently generated 3D motion segments by interpolating boundary deformation fields and optimizing them to align with the guiding video, ensuring smooth and flicker-free transitions. Through extensive qualitative and quantitiave experiments as well as a user study, we show the effectiveness of our method and its components. The project page is available at https://in-2-4d.github.io/

Poisson multi-Bernoulli mixture filter for trajectory measurements

Marco Fontana,Ángel F. García-Fernández,Simon Maskell

Task: 提出一种基于轨迹测量的多目标滤波方法，即轨迹测量PMBM（TM-PMBM）滤波器。

Motivation: 解决基于传感器测量（轨迹集合）的多目标滤波问题，并提供计算效率更高的替代方案。

Details

Method: 使用PMBM滤波器在目标状态集上传播密度，通过预测和更新步骤处理两时间步窗口内的轨迹测量。 Result: TM-PMBM滤波器提供了多目标滤波的闭式解，并通过仿真验证了其性能。 Conclusion: TM-PMBM滤波器及其轻量级替代方案在基于轨迹测量的多目标滤波中表现良好。 Abstract: This paper presents a Poisson multi-Bernoulli mixture (PMBM) filter for multi-target filtering based on sensor measurements that are sets of trajectories in the last two-time step window. The proposed filter, the trajectory measurement PMBM (TM-PMBM) filter, propagates a PMBM density on the set of target states. In prediction, the filter obtains the PMBM density on the set of trajectories over the last two time steps. This density is then updated with the set of trajectory measurements. After the update step, the PMBM posterior on the set of two-step trajectories is marginalised to obtain a PMBM density on the set of target states. The filter provides a closed-form solution for multi-target filtering based on sets of trajectory measurements, estimating the set of target states at the end of each time window. Additionally, the paper proposes computationally lighter alternatives to the TM-PMBM filter by deriving a Poisson multi-Bernoulli (PMB) density through Kullback-Leibler divergence minimisation in an augmented space with auxiliary variables. The performance of the proposed filters are evaluated in a simulation study.

Jiafan Lu,Dongcheng Hu,Yitian Ye,Anqi Liu,Zixian Zhang,Xin Peng

Task: 提出一种新型复合导航方法，结合激光和视觉技术，以提高室内家禽养殖场检查机器人的导航精度和操作效率。

Motivation: 传统依赖单一传感器的导航方法在复杂环境中表现不佳，导致激光漂移和视觉导航线提取不准确等问题。

Details

Method: 动态计算基于各传感器实时可靠性的融合偏航角，无需物理导航线。 Result: 实验验证表明，该方法显著提高了导航精度和操作效率，解决了单传感器系统的固有缺陷。 Conclusion: 该方法为复杂室内家禽养殖环境中检查机器人的性能提升提供了有前景的解决方案。 Abstract: Indoor poultry farms require inspection robots to maintain precise environmental control, which is crucial for preventing the rapid spread of disease and large-scale bird mortality. However, the complex conditions within these facilities, characterized by areas of intense illumination and water accumulation, pose significant challenges. Traditional navigation methods that rely on a single sensor often perform poorly in such environments, resulting in issues like laser drift and inaccuracies in visual navigation line extraction. To overcome these limitations, we propose a novel composite navigation method that integrates both laser and vision technologies. This approach dynamically computes a fused yaw angle based on the real-time reliability of each sensor modality, thereby eliminating the need for physical navigation lines. Experimental validation in actual poultry house environments demonstrates that our method not only resolves the inherent drawbacks of single-sensor systems, but also significantly enhances navigation precision and operational efficiency. As such, it presents a promising solution for improving the performance of inspection robots in complex indoor poultry farming settings.

Digital Twin Catalog: A Large-Scale Photorealistic 3D Object Digital Twin Dataset

Zhao Dong,Ka Chen,Zhaoyang Lv,Hong-Xing Yu,Yunzhi Zhang,Cheng Zhang,Yufeng Zhu,Stephen Tian,Zhengqin Li,Geordie Moffatt,Sean Christofferson,James Fort,Xiaqing Pan,Mingfei Yan,Jiajun Wu,Carl Yuheng Ren,Richard Newcombe

Task: 介绍并构建Digital Twin Catalog (DTC)，一个大规模、逼真的3D物体数字孪生数据集。

Motivation: 当前缺乏大规模、高质量的数字孪生数据集和基准测试，以评估和比较不同3D重建方法的性能，并推动重建质量的提升。此外，缺乏基于第一视角（如AR眼镜）捕获图像的3D重建数据集。

Details

Method: 构建包含2,000个扫描的高质量数字孪生3D物体，以及在不同光照条件下使用DSLR相机和AR眼镜捕获的图像序列。 Result: DTC数据集为3D数字孪生创建任务提供了首个全面的真实世界评估基准，支持现有重建方法的比较和改进。 Conclusion: DTC填补了数字孪生数据集和基准测试的空白，为3D重建领域的研究和应用提供了重要资源。 Abstract: We introduce Digital Twin Catalog (DTC), a new large-scale photorealistic 3D object digital twin dataset. A digital twin of a 3D object is a highly detailed, virtually indistinguishable representation of a physical object, accurately capturing its shape, appearance, physical properties, and other attributes. Recent advances in neural-based 3D reconstruction and inverse rendering have significantly improved the quality of 3D object reconstruction. Despite these advancements, there remains a lack of a large-scale, digital twin quality real-world dataset and benchmark that can quantitatively assess and compare the performance of different reconstruction methods, as well as improve reconstruction quality through training or fine-tuning. Moreover, to democratize 3D digital twin creation, it is essential to integrate creation techniques with next-generation egocentric computing platforms, such as AR glasses. Currently, there is no dataset available to evaluate 3D object reconstruction using egocentric captured images. To address these gaps, the DTC dataset features 2,000 scanned digital twin-quality 3D objects, along with image sequences captured under different lighting conditions using DSLR cameras and egocentric AR glasses. This dataset establishes the first comprehensive real-world evaluation benchmark for 3D digital twin creation tasks, offering a robust foundation for comparing and improving existing reconstruction methods. The DTC dataset is already released at https://www.projectaria.com/datasets/dtc/ and we will also make the baseline evaluations open-source.

COP-GEN-Beta: Unified Generative Modelling of COPernicus Imagery Thumbnails

Miguel Espinosa,Valerio Marsocci,Yuru Jia,Elliot J. Crowley,Mikolaj Czerkawski

Task: 提出一种生成扩散模型COP-GEN-Beta，用于多模态遥感数据的统一表示学习和零样本模态转换。

Motivation: 多模态遥感数据提供了丰富的信息，但传统方法局限于单或双模态，难以实现跨模态的统一表示学习。

Details

Method: 采用基于序列的扩散变换器，通过独立的时步嵌入控制各模态，实现任意模态间的零样本转换。 Result: 在Major TOM数据集上验证了COP-GEN-Beta生成高质量样本的能力，定性和定量评估均表现优异。 Conclusion: COP-GEN-Beta展示了作为未来遥感任务强大预训练模型的潜力。 Abstract: In remote sensing, multi-modal data from various sensors capturing the same scene offers rich opportunities, but learning a unified representation across these modalities remains a significant challenge. Traditional methods have often been limited to single or dual-modality approaches. In this paper, we introduce COP-GEN-Beta, a generative diffusion model trained on optical, radar, and elevation data from the Major TOM dataset. What sets COP-GEN-Beta apart is its ability to map any subset of modalities to any other, enabling zero-shot modality translation after training. This is achieved through a sequence-based diffusion transformer, where each modality is controlled by its own timestep embedding. We extensively evaluate COP-GEN-Beta on thumbnail images from the Major TOM dataset, demonstrating its effectiveness in generating high-quality samples. Qualitative and quantitative evaluations validate the model's performance, highlighting its potential as a powerful pre-trained model for future remote sensing tasks.

FindAnything: Open-Vocabulary and Object-Centric Mapping for Robot Exploration in Any Environment

Sebastián Barbas Laina,Simon Boche,Sotiris Papatheodorou,Simon Schaefer,Jaehyung Jung,Stefan Leutenegger

Task: 提出FindAnything框架，实现实时、开放词汇的大规模未知环境语义理解与地图构建。

Motivation: 几何准确且语义丰富的地图表示对机器人导航和任务规划至关重要，但实时开放词汇的语义理解仍是一个开放问题。

Details

Method: 结合视觉-语言信息到密集体积子地图中，利用SAM生成的像素级特征构建对象中心的体积子地图。 Result: 在Replica数据集上达到最先进的语义准确性，支持基于自然语言查询的探索。 Conclusion: FindAnything是首个在资源受限设备上部署的开放词汇地图系统，为机器人任务提供高效语义理解。 Abstract: Geometrically accurate and semantically expressive map representations have proven invaluable to facilitate robust and safe mobile robot navigation and task planning. Nevertheless, real-time, open-vocabulary semantic understanding of large-scale unknown environments is still an open problem. In this paper we present FindAnything, an open-world mapping and exploration framework that incorporates vision-language information into dense volumetric submaps. Thanks to the use of vision-language features, FindAnything bridges the gap between pure geometric and open-vocabulary semantic information for a higher level of understanding while allowing to explore any environment without the help of any external source of ground-truth pose information. We represent the environment as a series of volumetric occupancy submaps, resulting in a robust and accurate map representation that deforms upon pose updates when the underlying SLAM system corrects its drift, allowing for a locally consistent representation between submaps. Pixel-wise vision-language features are aggregated from efficient SAM (eSAM)-generated segments, which are in turn integrated into object-centric volumetric submaps, providing a mapping from open-vocabulary queries to 3D geometry that is scalable also in terms of memory usage. The open-vocabulary map representation of FindAnything achieves state-of-the-art semantic accuracy in closed-set evaluations on the Replica dataset. This level of scene understanding allows a robot to explore environments based on objects or areas of interest selected via natural language queries. Our system is the first of its kind to be deployed on resource-constrained devices, such as MAVs, leveraging vision-language information for real-world robotic tasks.

Task-conditioned Ensemble of Expert Models for Continuous Learning

Renu Sharma,Debasmita Pal,Arun Ross

Task: 提出一种任务条件化的模型集成方法，以在非平稳环境中保持模型的准确性。

Motivation: 非平稳环境导致数据分布偏移和模型性能下降，需要一种方法既能适应新数据又能保留对旧数据的准确性。

Details

Method: 使用基于任务成员信息的专家模型集成，并通过局部离群点概念的域内模型动态提供任务成员信息。 Result: 在三种实验设置（任务间分布偏移、任务内和任务间分布偏移、任务间不重叠分布）中验证了方法的有效性。 Conclusion: 提出的任务条件化模型集成方法在非平稳环境中能有效维持模型性能。 Abstract: One of the major challenges in machine learning is maintaining the accuracy of the deployed model (e.g., a classifier) in a non-stationary environment. The non-stationary environment results in distribution shifts and, consequently, a degradation in accuracy. Continuous learning of the deployed model with new data could be one remedy. However, the question arises as to how we should update the model with new training data so that it retains its accuracy on the old data while adapting to the new data. In this work, we propose a task-conditioned ensemble of models to maintain the performance of the existing model. The method involves an ensemble of expert models based on task membership information. The in-domain models-based on the local outlier concept (different from the expert models) provide task membership information dynamically at run-time to each probe sample. To evaluate the proposed method, we experiment with three setups: the first represents distribution shift between tasks (LivDet-Iris-2017), the second represents distribution shift both between and within tasks (LivDet-Iris-2020), and the third represents disjoint distribution between tasks (Split MNIST). The experiments highlight the benefits of the proposed method. The source code is available at https://github.com/iPRoBe-lab/Continuous_Learning_FE_DM.