2025 04 10

Reducing Formal Context Extraction: A Newly Proposed Framework from Big Corpora

Bryar A. Hassan,Shko M. Qader,Alla A. Hassan,Joan Lu,Aram M. Ahmed,Jafar Majidpour,Tarik A. Rashid

Task: 提出一种框架，用于在从自由文本中提取概念层次结构时减少形式上下文的规模。

Motivation: 手动生成概念层次结构通常耗时且资源密集，而自动化提取可以解决这一问题。

Details

Method: 结合基于WordNet的方法和基于频率的技术，减少形式上下文的规模。 Result: 提出的混合方法在概念格性能上优于其他基线技术，保留了98%的生成概念层次结构的质量。 Conclusion: 该框架能有效减少形式上下文的模糊性，提升概念层次结构的提取效率。 Abstract: Automating the extraction of concept hierarchies from free text is advantageous because manual generation is frequently labor- and resource-intensive. Free result, the whole procedure for concept hierarchy learning from free text entails several phases, including sentence-level text processing, sentence splitting, and tokenization. Lemmatization is after formal context analysis (FCA) to derive the pairings. Nevertheless, there could be a few uninteresting and incorrect pairings in the formal context. It may take a while to generate formal context; thus, size reduction formal context is necessary to weed out irrelevant and incorrect pairings to extract the concept lattice and hierarchies more quickly. This study aims to propose a framework for reducing formal context in extracting concept hierarchies from free text to reduce the ambiguity of the formal context. We achieve this by reducing the size of the formal context using a hybrid of a WordNet-based method and a frequency-based technique. Using 385 samples from the Wikipedia corpus and the suggested framework, tests are carried out to examine the reduced size of formal context, leading to concept lattice and concept hierarchy. With the help of concept lattice-invariants, the generated formal context lattice is compared to the normal one. In contrast to basic ones, the homomorphic between the resultant lattices retains up to 98% of the quality of the generating concept hierarchies, and the reduced concept lattice receives the structural connection of the standard one. Additionally, the new framework is compared to five baseline techniques to calculate the running time on random datasets with various densities. The findings demonstrate that, in various fill ratios, hybrid approaches of the proposed method outperform other indicated competing strategies in concept lattice performance.

Query Understanding in LLM-based Conversational Information Seeking

Yifei Yuan,Zahra Abbasiantaeb,Yang Deng,Mohammad Aliannejadi

Task: 探索如何利用大型语言模型（LLMs）提升对话信息检索（CIS）中的查询理解能力。

Motivation: 通过上下文感知的交互准确解析用户意图，解决歧义、优化查询并适应动态信息需求，以提升搜索结果的实时相关性和精确性。

Details

Method: 采用LLM驱动的技术，开发鲁棒的评价指标，构建交互式系统，并应用主动查询管理和查询重构等方法。 Result: 深入探讨了LLM在对话搜索系统中的查询理解应用，提出了未来研究方向。 Conclusion: 旨在加深对基于LLM的对话查询理解的认识，并推动该领域的持续发展。 Abstract: Query understanding in Conversational Information Seeking (CIS) involves accurately interpreting user intent through context-aware interactions. This includes resolving ambiguities, refining queries, and adapting to evolving information needs. Large Language Models (LLMs) enhance this process by interpreting nuanced language and adapting dynamically, improving the relevance and precision of search results in real-time. In this tutorial, we explore advanced techniques to enhance query understanding in LLM-based CIS systems. We delve into LLM-driven methods for developing robust evaluation metrics to assess query understanding quality in multi-turn interactions, strategies for building more interactive systems, and applications like proactive query management and query reformulation. We also discuss key challenges in integrating LLMs for query understanding in conversational search systems and outline future research directions. Our goal is to deepen the audience's understanding of LLM-based conversational query understanding and inspire discussions to drive ongoing advancements in this field.

The Zero Body Problem: Probing LLM Use of Sensory Language

Rebecca M. M. Hicke,Sil Hamilton,David Mimno

Task: 探索语言模型是否能近似人类使用具身语言的能力。

Motivation: 具身语言（如感官语言）在多个领域（如机器人学、叙事学、语言学、认知科学）中具有重要意义，但语言模型缺乏具身体验。

Details

Method: 扩展现有的人类与模型生成故事的平行语料库，新增18,000个由18个流行模型生成的故事，并分析其感官语言使用差异。 Result: 所有模型生成的故事在感官语言使用上与人类显著不同，但差异方向因模型家族而异；Gemini模型使用更多感官语言，其他五类模型使用较少。线性探测表明模型能识别感官语言，但指令调优可能抑制其使用。 Conclusion: 语言模型在感官语言使用上与人类存在显著差异，未来研究可进一步探索指令调优的影响，并利用扩展数据集支持相关工作。 Abstract: Sensory language expresses embodied experiences ranging from taste and sound to excitement and stomachache. This language is of interest to scholars from a wide range of domains including robotics, narratology, linguistics, and cognitive science. In this work, we explore whether language models, which are not embodied, can approximate human use of embodied language. We extend an existing corpus of parallel human and model responses to short story prompts with an additional 18,000 stories generated by 18 popular models. We find that all models generate stories that differ significantly from human usage of sensory language, but the direction of these differences varies considerably between model families. Namely, Gemini models use significantly more sensory language than humans along most axes whereas most models from the remaining five families use significantly less. Linear probes run on five models suggest that they are capable of identifying sensory language. However, we find preliminary evidence suggesting that instruction tuning may discourage usage of sensory language. Finally, to support further work, we release our expanded story dataset.

S'MoRE: Structural Mixture of Residual Experts for LLM Fine-tuning

Hanqing Zeng,Yinglong Xia,Zhuokai Zhao,Gilbert Jiang,Qiang Zhang,Jiayi Liu,Lizhu Zhang,Xiangjun Fan,Benyu Zhang

Task: 提出一种名为S'MoRE的新框架，以平衡参数效率和模型容量，用于预训练大型语言模型的微调。

Motivation: 现有方法如LoRA效率高但灵活性不足，而MoE架构虽增强模型容量却导致参数冗余和利用率低。

Details

Method: S'MoRE通过分层低秩分解专家权重，构建多层级残差结构，并利用图神经网络（GNN）实现残差传播。 Result: 在相同参数预算下，S'MoRE显著提升了传统MoE或Mixture-of-LoRA的结构灵活性，并实现了更优的微调性能。 Conclusion: S'MoRE为高效的大型语言模型微调提供了一种创新的解决方案。 Abstract: Fine-tuning pre-trained large language models (LLMs) presents a dual challenge of balancing parameter efficiency and model capacity. Existing methods like low-rank adaptations (LoRA) are efficient but lack flexibility, while Mixture-of-Experts (MoE) architectures enhance model capacity at the cost of more & under-utilized parameters. To address these limitations, we propose Structural Mixture of Residual Experts (S'MoRE), a novel framework that seamlessly integrates the efficiency of LoRA with the flexibility of MoE. Specifically, S'MoRE employs hierarchical low-rank decomposition of expert weights, yielding residuals of varying orders interconnected in a multi-layer structure. By routing input tokens through sub-trees of residuals, S'MoRE emulates the capacity of many experts by instantiating and assembling just a few low-rank matrices. We craft the inter-layer propagation of S'MoRE's residuals as a special type of Graph Neural Network (GNN), and prove that under similar parameter budget, S'MoRE improves "structural flexibility" of traditional MoE (or Mixture-of-LoRA) by exponential order. Comprehensive theoretical analysis and empirical results demonstrate that S'MoRE achieves superior fine-tuning performance, offering a transformative approach for efficient LLM adaptation.

Temporal-contextual Event Learning for Pedestrian Crossing Intent Prediction

Hongbin Liang,Hezhe Qiao,Wei Huang,Qizhou Wang,Mingsheng Shang,Lin Chen

Task: 通过准确预测行人过街意图（PCI）来确保弱势道路使用者的安全。

Motivation: 现有方法在分析视频帧时难以捕捉与行人行为相关的关键时间事件，导致PCI预测性能不佳。

Details

Method: 提出了一种名为Temporal-contextual Event Learning（TCL）的新方法，包括Temporal Merging Module（TMM）和Contextual Attention Block（CAB）。 Result: 在PIE、JAAD-beh和JAAD-all数据集上的实验表明，TCL显著优于现有方法。 Conclusion: TCL通过合成时间特征提取和上下文注意力，能够学习到更具表达力的PCI预测表示。 Abstract: Ensuring the safety of vulnerable road users through accurate prediction of pedestrian crossing intention (PCI) plays a crucial role in the context of autonomous and assisted driving. Analyzing the set of observation video frames in ego-view has been widely used in most PCI prediction methods to forecast the cross intent. However, they struggle to capture the critical events related to pedestrian behaviour along the temporal dimension due to the high redundancy of the video frames, which results in the sub-optimal performance of PCI prediction. Our research addresses the challenge by introducing a novel approach called \underline{T}emporal-\underline{c}ontextual Event \underline{L}earning (TCL). The TCL is composed of the Temporal Merging Module (TMM), which aims to manage the redundancy by clustering the observed video frames into multiple key temporal events. Then, the Contextual Attention Block (CAB) is employed to adaptively aggregate multiple event features along with visual and non-visual data. By synthesizing the temporal feature extraction and contextual attention on the key information across the critical events, TCL can learn expressive representation for the PCI prediction. Extensive experiments are carried out on three widely adopted datasets, including PIE, JAAD-beh, and JAAD-all. The results show that TCL substantially surpasses the state-of-the-art methods. Our code can be accessed at https://github.com/dadaguailhb/TCL.

Language-Dependent Political Bias in AI: A Study of ChatGPT and Gemini

Dogus Yuksel,Mehmet Cem Catalbas,Bora Oc

Task: 研究大型语言模型（如ChatGPT和Gemini）的政治倾向及其在不同查询语言中的差异。

Motivation: 验证这些模型是否如声称的那样保持政治中立和无偏见。

Details

Method: 对ChatGPT和Gemini进行政治倾向测试，涵盖14种不同语言。 Result: 发现模型存在自由主义和左翼偏见，且Gemini的倾向更明显；政治偏见因查询语言而异。 Conclusion: 建议AI工具不应宣称无政治倾向，而应努力实现政治中立，并在用户查询中考虑这些倾向。 Abstract: As leading examples of large language models, ChatGPT and Gemini claim to provide accurate and unbiased information, emphasizing their commitment to political neutrality and avoidance of personal bias. This research investigates the political tendency of large language models and the existence of differentiation according to the query language. For this purpose, ChatGPT and Gemini were subjected to a political axis test using 14 different languages. The findings of the study suggest that these large language models do exhibit political tendencies, with both models demonstrating liberal and leftist biases. A comparative analysis revealed that Gemini exhibited a more pronounced liberal and left-wing tendency compared to ChatGPT. The study also found that these political biases varied depending on the language used for inquiry. The study delves into the factors that constitute political tendencies and linguistic differentiation, exploring differences in the sources and scope of educational data, structural and grammatical features of languages, cultural and political contexts, and the model's response to linguistic features. From this standpoint, and an ethical perspective, it is proposed that artificial intelligence tools should refrain from asserting a lack of political tendencies and neutrality, instead striving for political neutrality and executing user queries by incorporating these tendencies.

Ternarization of Vision Language Models for use on edge devices

Ben Crulis,Cyril De Runz,Barthelemy Serres,Gilles Venturini

Task: 提出一种将预训练的视觉语言模型压缩为三元版本的过程。

Motivation: 减少从头训练三元模型的时间，并通过预训练权重初始化方案优化压缩效率。

Details

Method: 基于k-means算法的新初始化方案，并实现自定义运算符以在TensorFlow Lite引擎上运行三元模型。 Result: 三元模型在内存占用和困惑度之间取得平衡，同时具有最快的标记生成速度。 Conclusion: 三元模型通过自定义矩阵乘法运算符，在性能和效率上提供了良好的折衷方案。 Abstract: We propose a process to compress a pre-trained Vision Language Model into a ternary version of itself instead of training a ternary model from scratch. A new initialization scheme from pre-trained weights based on the k-means algorithm is proposed to reduce the ternarization time. We implement different custom operators for executing the ternary model on the TensorFlow Lite Engine. We compare the original model with its ternary and binary versions in terms of memory consumption, inference speed and perplexity. We find that the ternary model using our custom ternary matrix multiplication operator provides a good compromise in term of memory usage and perplexity, while having the fastest token generation speed.

Don't Let It Hallucinate: Premise Verification via Retrieval-Augmented Logical Reasoning

Yuehan Qin,Shawn Li,Yi Nian,Xinyan Velocity Yu,Yue Zhao,Xuezhe Ma

Task: 提出一种基于检索的框架，用于在生成前识别和处理用户查询中的错误前提以减少大语言模型（LLMs）的幻觉输出。

Motivation: 现有方法（如预训练、微调和推理时技术）通常计算成本高、依赖大量训练数据或缺乏主动预防机制，难以实时应用。

Details

Method: 将用户查询转化为逻辑表示，利用检索增强生成（RAG）验证前提的真实性，并将验证结果融入提示中以保持输出的事实一致性。 Result: 实验表明该方法有效减少幻觉、提高事实准确性，且无需访问模型logits或大规模微调。 Conclusion: 提出的检索框架为减少LLMs幻觉提供了一种高效、实时且无需复杂训练的解决方案。 Abstract: Large language models (LLMs) have shown substantial capacity for generating fluent, contextually appropriate responses. However, they can produce hallucinated outputs, especially when a user query includes one or more false premises-claims that contradict established facts. Such premises can mislead LLMs into offering fabricated or misleading details. Existing approaches include pretraining, fine-tuning, and inference-time techniques that often rely on access to logits or address hallucinations after they occur. These methods tend to be computationally expensive, require extensive training data, or lack proactive mechanisms to prevent hallucination before generation, limiting their efficiency in real-time applications. We propose a retrieval-based framework that identifies and addresses false premises before generation. Our method first transforms a user's query into a logical representation, then applies retrieval-augmented generation (RAG) to assess the validity of each premise using factual sources. Finally, we incorporate the verification results into the LLM's prompt to maintain factual consistency in the final output. Experiments show that this approach effectively reduces hallucinations, improves factual accuracy, and does not require access to model logits or large-scale fine-tuning.

Analyzing How Text-to-Image Models Represent Nationalities in Everyday Tasks

Abdulkareem Alsudais

Task: 研究一个流行的文本到图像（T2I）模型如何代表208个不同国籍的人在生成日常任务图像时的表现。

Motivation: 探讨T2I模型在生成图像时是否过度强调传统服饰等特征，以及这种表现是否与地区或收入群体相关。

Details

Method: 开发两种场景，基于输入提示生成图像，并使用CLIP测量生成图像与提示的对齐分数。 Result: 结果显示模型在生成图像时倾向于强调传统服饰，且这种表现与特定地区（如中东、北非和撒哈拉以南非洲）和收入群体显著相关。 Conclusion: 研究揭示了T2I模型在代表不同国籍个体时的潜在问题，并提出了未来模型改进的方向。 Abstract: The primary objective of this paper is to investigate how a popular Text-to-Image (T2I) model represents people from 208 different nationalities when prompted to generate images of individuals performing typical everyday tasks. Two scenarios were developed, and images were generated based on input prompts that specified nationalities. The results show that in one scenario, the majority of images, and in the other, a substantial portion, depict individuals wearing traditional attire. This suggests that the model emphasizes such characteristics even when they are impractical for the given task. A statistically significant relationship was observed between this representation pattern and the regions associated with the specified countries. This indicates that the issue disproportionately affects certain areas, particularly the Middle East & North Africa and Sub-Saharan Africa. A notable association with income groups was also found. CLIP was used to measure alignment scores between generated images and various prompts and captions. The findings indicate statistically significant higher scores for images featuring individuals in traditional attire in one scenario. The study also examined revised prompts (additional contextual information automatically added to the original input prompts) to assess their potential influence on how individuals are represented in the generated images, finding that the word "traditional" was commonly added to revised prompts. These findings provide valuable insights into how T2I models represent individuals from various countries and highlight potential areas for improvement in future models.

Can LLMs Simulate Personas with Reversed Performance? A Benchmark for Counterfactual Instruction Following

Sai Adith Senthil Kumar,Hao Yan,Saipavan Perepa,Murong Yue,Ziyu Yao

Task: 评估大型语言模型（LLMs）在模拟反向性能人物（如低熟练度学生）时的能力。

Motivation: 发现现有LLMs无法模拟反向性能人物，限制了虚拟环境的多样性和实际应用。

Details

Method: 提出首个基准数据集，以数学推理为代表场景，评估LLMs在“反事实指令遵循”能力上的表现。 Result: 包括OpenAI o1在内的LLMs均难以模拟反向性能人物，且模拟人物种族和性能水平时效果更差。 Conclusion: 反事实指令遵循具有挑战性，需进一步研究。 Abstract: Large Language Models (LLMs) are now increasingly widely used to simulate personas in virtual environments, leveraging their instruction-following capability. However, we discovered that even state-of-the-art LLMs cannot simulate personas with reversed performance (e.g., student personas with low proficiency in educational settings), which impairs the simulation diversity and limits the practical applications of the simulated environments. In this work, using mathematical reasoning as a representative scenario, we propose the first benchmark dataset for evaluating LLMs on simulating personas with reversed performance, a capability that we dub "counterfactual instruction following". We evaluate both open-weight and closed-source LLMs on this task and find that LLMs, including the OpenAI o1 reasoning model, all struggle to follow counterfactual instructions for simulating reversedly performing personas. Intersectionally simulating both the performance level and the race population of a persona worsens the effect even further. These results highlight the challenges of counterfactual instruction following and the need for further research.

Analyzing the Impact of Low-Rank Adaptation for Cross-Domain Few-Shot Object Detection in Aerial Images

Hicham Talaoubrid,Anissa Mokraoui,Ismail Ben Ayed,Axel Prouvost,Sonimith Hang,Monit Korn,Rémi Harvey

Task: 研究低秩适应（LoRA）在小模型上的应用，用于跨域少样本目标检测任务。

Motivation: LoRA最初设计用于大规模模型，能缓解过拟合问题，适合资源受限场景。

Details

Method: 将LoRA集成到DiffusionDet中，并在DOTA和DIOR数据集上评估性能。 Result: 在少样本设置（如1-shot和5-shot）中，LoRA略微提升性能，但在高样本配置中，完整微调更有效。 Conclusion: LoRA在航空目标检测中具有高效适应的潜力，鼓励进一步研究参数高效微调策略。 Abstract: This paper investigates the application of Low-Rank Adaptation (LoRA) to small models for cross-domain few-shot object detection in aerial images. Originally designed for large-scale models, LoRA helps mitigate overfitting, making it a promising approach for resource-constrained settings. We integrate LoRA into DiffusionDet, and evaluate its performance on the DOTA and DIOR datasets. Our results show that LoRA applied after an initial fine-tuning slightly improves performance in low-shot settings (e.g., 1-shot and 5-shot), while full fine-tuning remains more effective in higher-shot configurations. These findings highlight LoRA's potential for efficient adaptation in aerial object detection, encouraging further research into parameter-efficient fine-tuning strategies for few-shot learning. Our code is available here: https://github.com/HichTala/LoRA-DiffusionDet.

Analyzing Examinee Comments using DistilBERT and Machine Learning to Ensure Quality Control in Exam Content

Ye,Ma

Task: 利用自然语言处理（NLP）分析考生评论以识别有问题的测试题目。

Motivation: 通过自动识别负面反馈，补充传统统计方法，提高测试效度并减少人工审核负担。

Details

Method: 开发和验证机器学习模型，结合心理测量特征提升模型性能，并比较NLP标记与传统标记的题目。 Result: 考生反馈为统计方法提供了有价值的补充信息，可能提升测试效度并减少人工审核负担。 Conclusion: 研究为测试组织提供了一种高效机制，将考生直接体验纳入质量保证流程。 Abstract: This study explores using Natural Language Processing (NLP) to analyze candidate comments for identifying problematic test items. We developed and validated machine learning models that automatically identify relevant negative feedback, evaluated approaches of incorporating psychometric features enhances model performance, and compared NLP-flagged items with traditionally flagged items. Results demonstrate that candidate feedback provides valuable complementary information to statistical methods, potentially improving test validity while reducing manual review burden. This research offers testing organizations an efficient mechanism to incorporate direct candidate experience into quality assurance processes.

From Broadcast to Minimap: Achieving State-of-the-Art SoccerNet Game State Reconstruction

Vladimir Golovkin,Nikolay Nemtsev,Vasyl Shandyba,Oleg Udin,Nikita Kasatkin,Pavel Kononov,Anton Afanasiev,Sergey Ulasen,Andrei Boiarov

Task: 通过单摄像头设置实现足球比赛中所有个体的精确跟踪和定位（游戏状态重建，GSR）。

Motivation: 为教练和分析师提供球员移动、队形和比赛动态的可操作见解，优化训练策略并增强竞争优势。

Details

Method: 结合了微调的YOLOv5m进行目标检测、基于SegFormer的相机参数估计器，以及增强的DeepSORT跟踪框架（包括重识别、方向预测和球衣号码识别）。 Result: 在SoccerNet Game State Reconstruction Challenge 2024中获得第一名，显著优于其他方法。 Conclusion: 提出的端到端管道在单摄像头设置下实现了最先进的游戏状态重建，具有空间准确性和时间一致性。 Abstract: Game State Reconstruction (GSR), a critical task in Sports Video Understanding, involves precise tracking and localization of all individuals on the football field-players, goalkeepers, referees, and others - in real-world coordinates. This capability enables coaches and analysts to derive actionable insights into player movements, team formations, and game dynamics, ultimately optimizing training strategies and enhancing competitive advantage. Achieving accurate GSR using a single-camera setup is highly challenging due to frequent camera movements, occlusions, and dynamic scene content. In this work, we present a robust end-to-end pipeline for tracking players across an entire match using a single-camera setup. Our solution integrates a fine-tuned YOLOv5m for object detection, a SegFormer-based camera parameter estimator, and a DeepSORT-based tracking framework enhanced with re-identification, orientation prediction, and jersey number recognition. By ensuring both spatial accuracy and temporal consistency, our method delivers state-of-the-art game state reconstruction, securing first place in the SoccerNet Game State Reconstruction Challenge 2024 and significantly outperforming competing methods.

CDER: Collaborative Evidence Retrieval for Document-level Relation Extraction

Khai Phan Tran,Xue Li

Task: 提出一种新的证据检索框架CDER，用于文档级关系抽取（DocRE）中的证据检索任务。

Motivation: 现有证据检索系统忽略了同一文档中语义相似实体对之间的协作性，影响了证据检索的效果。

Details

Method: 采用基于注意力机制的图结构架构捕捉协作模式，并引入动态子结构以增强证据检索的鲁棒性。 Result: 在基准DocRE数据集上的实验表明，CDER在证据检索任务中表现优异，并提升了现有DocRE系统的整体性能。 Conclusion: CDER通过捕捉协作模式和引入动态子结构，显著提升了证据检索和DocRE系统的性能。 Abstract: Document-level Relation Extraction (DocRE) involves identifying relations between entities across multiple sentences in a document. Evidence sentences, crucial for precise entity pair relationships identification, enhance focus on essential text segments, improving DocRE performance. However, existing evidence retrieval systems often overlook the collaborative nature among semantically similar entity pairs in the same document, hindering the effectiveness of the evidence retrieval task. To address this, we propose a novel evidence retrieval framework, namely CDER. CDER employs an attentional graph-based architecture to capture collaborative patterns and incorporates a dynamic sub-structure for additional robustness in evidence retrieval. Experimental results on the benchmark DocRE dataset show that CDER not only excels in the evidence retrieval task but also enhances overall performance of existing DocRE system.

Towards Calibration Enhanced Network by Inverse Adversarial Attack

Yupeng Cheng,Zi Pong Lim,Sarthak Ketanbhai Modi,Yon Shin Teo,Yushi Cao,Shang-Wei Lin

Task: 利用对抗训练技术增强HMI测试场景中的OCR模型。

Motivation: 随着HMI软件复杂度的增加，测试自动化需求日益重要，而OCR模型在噪声处理方面面临挑战。

Details

Method: 设计新的对抗攻击目标，并通过对抗训练优化OCR模型的决策边界，同时构建包含多种扰动的HMI屏幕数据集。 Result: 实验表明，对抗训练技术能提升OCR模型对各种噪声的鲁棒性，同时保持高准确率，并展现对其他模式扰动的鲁棒性。 Conclusion: 对抗训练技术能有效增强OCR模型在HMI测试场景中的鲁棒性和准确性。 Abstract: Test automation has become increasingly important as the complexity of both design and content in Human Machine Interface (HMI) software continues to grow. Current standard practice uses Optical Character Recognition (OCR) techniques to automatically extract textual information from HMI screens for validation. At present, one of the key challenges faced during the automation of HMI screen validation is the noise handling for the OCR models. In this paper, we propose to utilize adversarial training techniques to enhance OCR models in HMI testing scenarios. More specifically, we design a new adversarial attack objective for OCR models to discover the decision boundaries in the context of HMI testing. We then adopt adversarial training to optimize the decision boundaries towards a more robust and accurate OCR model. In addition, we also built an HMI screen dataset based on real-world requirements and applied multiple types of perturbation onto the clean HMI dataset to provide a more complete coverage for the potential scenarios. We conduct experiments to demonstrate how using adversarial training techniques yields more robust OCR models against various kinds of noises, while still maintaining high OCR model accuracy. Further experiments even demonstrate that the adversarial training models exhibit a certain degree of robustness against perturbations from other patterns.

Lugha-Llama: Adapting Large Language Models for African Languages

Happy Buzaaba,Alexander Wettig,David Ifeoluwa Adelani,Christiane Fellbaum

Task: 研究如何将大语言模型（LLMs）适应于低资源的非洲语言。

Motivation: 非洲语言在大型训练语料库中代表性不足，导致LLMs在这些语言上表现不佳。

Details

Method: 结合非洲语言的精选数据与高质量的英语教育文本进行训练。 Result: 在IrokoBench数据集上表现最佳，特别是在知识密集型多选题（AfriMMLU）上；在跨语言问答基准AfriQA上，性能提升超过10%。 Conclusion: 英语数据的内容对模型性能提升起关键作用，研究鼓励未来对非洲语言的进一步探索。 Abstract: Large language models (LLMs) have achieved impressive results in a wide range of natural language applications. However, they often struggle to recognize low-resource languages, in particular African languages, which are not well represented in large training corpora. In this paper, we consider how to adapt LLMs to low-resource African languages. We find that combining curated data from African languages with high-quality English educational texts results in a training mix that substantially improves the model's performance on these languages. On the challenging IrokoBench dataset, our models consistently achieve the best performance amongst similarly sized baselines, particularly on knowledge-intensive multiple-choice questions (AfriMMLU). Additionally, on the cross-lingual question answering benchmark AfriQA, our models outperform the base model by over 10%. To better understand the role of English data during training, we translate a subset of 200M tokens into Swahili language and perform an analysis which reveals that the content of these data is primarily responsible for the strong performance. We release our models and data to encourage future research on African languages.

SemiDAViL: Semi-supervised Domain Adaptation with Vision-Language Guidance for Semantic Segmentation

Hritam Basak,Zhaozheng Yin

Task: 提出一种语言引导的半监督域适应（SSDA）方法，用于语义分割任务。

Motivation: 解决传统DA和SSL结合在语义分割中的不足，包括类别混淆和训练数据分布不平衡问题。

Details

Method: 利用视觉语言模型（VLMs）的语义泛化能力，设计类平衡分割损失函数。 Result: 在多种域适应场景中，性能显著优于现有方法。 Conclusion: 语言引导的SSDA框架有效提升了语义分割的鲁棒性和性能。 Abstract: Domain Adaptation (DA) and Semi-supervised Learning (SSL) converge in Semi-supervised Domain Adaptation (SSDA), where the objective is to transfer knowledge from a source domain to a target domain using a combination of limited labeled target samples and abundant unlabeled target data. Although intuitive, a simple amalgamation of DA and SSL is suboptimal in semantic segmentation due to two major reasons: (1) previous methods, while able to learn good segmentation boundaries, are prone to confuse classes with similar visual appearance due to limited supervision; and (2) skewed and imbalanced training data distribution preferring source representation learning whereas impeding from exploring limited information about tailed classes. Language guidance can serve as a pivotal semantic bridge, facilitating robust class discrimination and mitigating visual ambiguities by leveraging the rich semantic relationships encoded in pre-trained language models to enhance feature representations across domains. Therefore, we propose the first language-guided SSDA setting for semantic segmentation in this work. Specifically, we harness the semantic generalization capabilities inherent in vision-language models (VLMs) to establish a synergistic framework within the SSDA paradigm. To address the inherent class-imbalance challenges in long-tailed distributions, we introduce class-balanced segmentation loss formulations that effectively regularize the learning process. Through extensive experimentation across diverse domain adaptation scenarios, our approach demonstrates substantial performance improvements over contemporary state-of-the-art (SoTA) methodologies. Code is available: \href{https://github.com/hritam-98/SemiDAViL}{GitHub}.

NeedleInATable: Exploring Long-Context Capability of Large Language Models towards Long-Structured Tables

Lanrui Wang,Mingyu Zheng,Hongyin Tang,Zheng Lin,Yanan Cao,Jingang Wang,Xunliang Cai,Weiping Wang

Task: 提出并评估NeedleInATable（NIAT）任务，以测试大型语言模型（LLMs）对长结构化表格的理解能力。

Motivation: 现有长上下文基准主要关注非结构化文本，忽略了长且复杂的结构化表格的挑战。

Details

Method: 引入NIAT任务，将每个表格单元格视为“针”，要求模型在不同查询下提取目标单元格，并提出数据合成方法以增强模型的长表格理解能力。 Result: 主流LLMs在NIAT任务上表现不佳，而提出的数据合成方法显著提升了模型性能。 Conclusion: 该工作推动了LLMs在长结构化表格理解能力上的评估，并为长上下文和表格理解应用的进展铺平了道路。 Abstract: Processing structured tabular data, particularly lengthy tables, constitutes a fundamental yet challenging task for large language models (LLMs). However, existing long-context benchmarks primarily focus on unstructured text, neglecting the challenges of long and complex structured tables. To address this gap, we introduce NeedleInATable (NIAT), a novel task that treats each table cell as a "needle" and requires the model to extract the target cell under different queries. Evaluation results of mainstream LLMs on this benchmark show they lack robust long-table comprehension, often relying on superficial correlations or shortcuts for complex table understanding tasks, revealing significant limitations in processing intricate tabular data. To this end, we propose a data synthesis method to enhance models' long-table comprehension capabilities. Experimental results show that our synthesized training data significantly enhances LLMs' performance on the NIAT task, outperforming both long-context LLMs and long-table agent methods. This work advances the evaluation of LLMs' genuine long-structured table comprehension capabilities and paves the way for progress in long-context and table understanding applications.

PromptHMR: Promptable Human Mesh Recovery

Yufu Wang,Yu Sun,Priyanka Patel,Kostas Daniilidis,Michael J. Black,Muhammed Kocabas

Task: 提出一种基于Transformer的可提示方法PromptHMR，用于在多样化场景中提升人体姿态和形状（HPS）估计的准确性。

Motivation: 现有方法缺乏利用辅助信息（如空间提示和语义提示）的机制，且在复杂场景（如拥挤场景、人物交互）中表现不佳。

Details

Method: 通过空间提示（如边界框和掩码）和语义提示（如语言描述或交互标签）重新定义HPS估计，同时处理完整图像以保留场景上下文。 Result: PromptHMR在拥挤场景、人物交互等挑战性场景中表现鲁棒，并在基准测试中达到最先进性能。 Conclusion: PromptHMR通过灵活的提示机制显著提升了HPS估计的准确性和适应性。 Abstract: Human pose and shape (HPS) estimation presents challenges in diverse scenarios such as crowded scenes, person-person interactions, and single-view reconstruction. Existing approaches lack mechanisms to incorporate auxiliary "side information" that could enhance reconstruction accuracy in such challenging scenarios. Furthermore, the most accurate methods rely on cropped person detections and cannot exploit scene context while methods that process the whole image often fail to detect people and are less accurate than methods that use crops. While recent language-based methods explore HPS reasoning through large language or vision-language models, their metric accuracy is well below the state of the art. In contrast, we present PromptHMR, a transformer-based promptable method that reformulates HPS estimation through spatial and semantic prompts. Our method processes full images to maintain scene context and accepts multiple input modalities: spatial prompts like bounding boxes and masks, and semantic prompts like language descriptions or interaction labels. PromptHMR demonstrates robust performance across challenging scenarios: estimating people from bounding boxes as small as faces in crowded scenes, improving body shape estimation through language descriptions, modeling person-person interactions, and producing temporally coherent motions in videos. Experiments on benchmarks show that PromptHMR achieves state-of-the-art performance while offering flexible prompt-based control over the HPS estimation process.

FuseRL: Dense Preference Optimization for Heterogeneous Model Fusion

Longguang Zhong,Fanqi Wan,Ziyi Yang,Guosheng Liang,Tianyuan Shi,Xiaojun Quan

Task: 通过异构模型融合提升大语言模型（LLM）的性能。

Motivation: 现有方法通常仅从源模型中选择每个提示的最佳输出，未能充分利用其潜力，导致优化信号稀疏。

Details

Method: 提出FuseRL框架，包含FuseSFT和FusePO两阶段：FuseSFT通过加权监督微调整合异构源模型的优势；FusePO基于多源模型输出优化加权偏好。 Result: 在AlpacaEval-2和Arena-Hard基准测试中，使用Llama-3.1-8B-Instruct作为目标模型，取得了8B LLMs中的最佳性能。 Conclusion: FuseSFT减少过拟合，FusePO提供密集多样的优化信号，共同提升模型性能。 Abstract: Heterogeneous model fusion enhances the performance of LLMs by integrating the knowledge and capabilities of multiple structurally diverse models. However, existing approaches often rely solely on selecting the best output for each prompt from source models, which underutilizes their full potential due to limited source knowledge and results in sparse optimization signals. To address this limitation, we propose FuseRL, a novel two-stage framework comprising FuseSFT and FusePO to maximize the utilization of source LLMs. FuseSFT establishes a robust initialization by integrating the strengths of heterogeneous source models through weighted supervised fine-tuning (SFT) on diverse outputs for each prompt. FusePO optimizes weighted preferences based on the outputs of multiple source models to enable superior alignment performance. Extensive experiments demonstrate the effectiveness of our framework across various preference alignment methods, including RLOO, DPO, and SimPO. Using Llama-3.1-8B-Instruct as the target model, our approach achieves state-of-the-art performance among 8B LLMs on the AlpacaEval-2 and Arena-Hard benchmarks. Further analysis suggests that FuseSFT regularizes the training process to reduce overfitting, while FusePO introduces dense and diverse signals for preference optimization.

D-Feat Occlusions: Diffusion Features for Robustness to Partial Visual Occlusions in Object Recognition

Rupayan Mallick,Sibo Dong,Nataniel Ruiz,Sarah Adel Bargal

Task: 通过利用冻结的扩散模型，提高分类模型在物体识别任务中对遮挡的鲁棒性。

Motivation: 扩散模型在图像生成和图像补全中表现出色，能够理解图像上下文，因此可以将其特征用于解决遮挡问题。

Details

Method: 提出一种结合输入增强和特征增强的流程，包括对遮挡像素进行修复的微调以及将扩散特征与分类特征结合。 Result: 实验表明，该方法在模拟遮挡的ImageNet数据集上对Transformer和ConvNet均有效，且在真实遮挡数据集上也表现出更强的鲁棒性。 Conclusion: 利用扩散模型的特征可以显著提升模型对部分物体遮挡的鲁棒性。 Abstract: Applications of diffusion models for visual tasks have been quite noteworthy. This paper targets making classification models more robust to occlusions for the task of object recognition by proposing a pipeline that utilizes a frozen diffusion model. Diffusion features have demonstrated success in image generation and image completion while understanding image context. Occlusion can be posed as an image completion problem by deeming the pixels of the occluder to be `missing.' We hypothesize that such features can help hallucinate object visual features behind occluding objects, and hence we propose using them to enable models to become more occlusion robust. We design experiments to include input-based augmentations as well as feature-based augmentations. Input-based augmentations involve finetuning on images where the occluder pixels are inpainted, and feature-based augmentations involve augmenting classification features with intermediate diffusion features. We demonstrate that our proposed use of diffusion-based features results in models that are more robust to partial object occlusions for both Transformers and ConvNets on ImageNet with simulated occlusions. We also propose a dataset that encompasses real-world occlusions and demonstrate that our method is more robust to partial object occlusions.

Do Reasoning Models Show Better Verbalized Calibration?

Qingcheng Zeng,Weihao Xuan,Leyang Cui,Rob Voigt

Task: 研究大型推理模型（LRMs）在复杂推理任务中的校准性能，特别是与指令调优模型相比的置信度校准表现。

Motivation: 尽管大型推理模型在复杂推理中表现出色，但其校准性能（尤其是置信度校准）是否优于指令调优模型仍是一个未解决的问题。

Details

Method: 通过监督微调蒸馏（SFT推理模型）和基于结果的强化学习推理（RL推理模型）训练LRMs，并在多个领域进行比较。 Result: LRMs在复杂推理任务中的准确性和置信度校准上显著优于指令调优模型，但在事实性任务中表现不一，部分模型甚至表现更差。 Conclusion: 推理导向的强化学习训练可能对提升LLM生成可信、自知的输出能力具有关键作用。 Abstract: Large reasoning models (LRMs) have recently shown impressive capabilities in complex reasoning by leveraging increased test-time computation and exhibiting behaviors akin to human-like deliberation. Despite these advances, it remains an open question whether LRMs are better calibrated - particularly in their verbalized confidence - compared to instruction-tuned counterparts. In this paper, we investigate the calibration properties of LRMs trained via supervised fine-tuning distillation on long reasoning traces (henceforth SFT reasoning models) and outcome-based reinforcement learning for reasoning (henceforth RL reasoning models) across diverse domains. Our findings reveal that LRMs significantly outperform instruction-tuned models on complex reasoning tasks in both accuracy and confidence calibration. In contrast, we find surprising trends in the domain of factuality in particular. On factuality tasks, while Deepseek-R1 shows strong calibration behavior, smaller QwQ-32B shows no improvement over instruct models; moreover, SFT reasoning models display worse calibration (greater overconfidence) compared to instruct models. Our results provide evidence for a potentially critical role of reasoning-oriented RL training in improving LLMs' capacity for generating trustworthy, self-aware outputs.

Implementation of a Zed 2i Stereo Camera for High-Frequency Shoreline Change and Coastal Elevation Monitoring

José A. Pilartes-Congo,Matthew Kastl,Michael J. Starek,Marina Vicens-Miquel,Philippe Tissot

Task: 利用低成本ZED 2i立体相机系统和高分辨率摄影测量技术监测海岸线变化和海岸高程。

Motivation: 沿海地区人口和金融利益的增加导致对海岸高程和海岸线变化的高时间分辨率监测需求增加。

Details

Method: 使用ZED 2i立体相机系统和近距离摄影测量技术生成3D点云、数字表面模型（DSM）和地理校正图像。 Result: 系统实现了0.20像素的平均重投影误差、27厘米的点云配准、37.56厘米的垂直误差，以及x和y方向2.67厘米和2.81厘米的地理校正均方根误差。 Conclusion: 尽管存在局限性，ZED 2i系统能够在局部和高时间分辨率下提供所需的测绘产品。 Abstract: The increasing population, thus financial interests, in coastal areas have increased the need to monitor coastal elevation and shoreline change. Though several resources exist to obtain this information, they often lack the required temporal resolution for short-term monitoring (e.g., every hour). To address this issue, this study implements a low-cost ZED 2i stereo camera system and close-range photogrammetry to collect images for generating 3D point clouds, digital surface models (DSMs) of beach elevation, and georectified imagery at a localized scale and high temporal resolution. The main contributions of this study are (i) intrinsic camera calibration, (ii) georectification and registration of acquired imagery and point cloud, (iii) generation of the DSM of the beach elevation, and (iv) a comparison of derived products against those from uncrewed aircraft system structure-from-motion photogrammetry. Preliminary results show that despite its limitations, the ZED 2i can provide the desired mapping products at localized and high temporal scales. The system achieved a mean reprojection error of 0.20 px, a point cloud registration of 27 cm, a vertical error of 37.56 cm relative to ground truth, and georectification root mean square errors of 2.67 cm and 2.81 cm for x and y.

Bypassing Safety Guardrails in LLMs Using Humor

Pedro Cisneros-Velarde

Task: 通过幽默提示绕过大型语言模型的安全防护机制。

Motivation: 探索如何在不修改不安全请求的情况下，利用固定模板的幽默提示绕过LLMs的安全防护。

Details

Method: 使用固定模板的幽默提示，无需额外LLMs辅助。 Result: 实验证明该方法在不同LLMs中有效，且幽默程度过高或过低均会降低效果。 Conclusion: LLMs的越狱需要在关注不安全请求和幽默之间找到适当平衡。 Abstract: In this paper, we show it is possible to bypass the safety guardrails of large language models (LLMs) through a humorous prompt including the unsafe request. In particular, our method does not edit the unsafe request and follows a fixed template -- it is simple to implement and does not need additional LLMs to craft prompts. Extensive experiments show the effectiveness of our method across different LLMs. We also show that both removing and adding more humor to our method can reduce its effectiveness -- excessive humor possibly distracts the LLM from fulfilling its unsafe request. Thus, we argue that LLM jailbreaking occurs when there is a proper balance between focus on the unsafe request and presence of humor.

Mind the Gap: Evaluating Vision Systems in Small Data Applications

Samuel Stevens,S M Rayeed,Jenna Kline

Task: 比较多模态大语言模型（MLLMs）和纯视觉方法在小数据场景下的性能表现。

Motivation: 计算机视觉研究忽视了小数据场景的重要性，而实际应用中（如生态监测、医疗诊断或工业质量控制）依赖小数据。

Details

Method: 使用Natural World Tasks（NeWT）基准测试，在不同训练集规模下比较MLLMs和纯视觉方法。 Result: MLLMs在小数据场景下性能较早停滞，而纯视觉方法持续提升，尤其在超过10个训练样本时差距显著扩大。 Conclusion: 呼吁在AI研究中明确小数据场景的评估，以更好地连接理论进展与实际应用。 Abstract: The practical application of AI tools for specific computer vision tasks relies on the "small-data regime" of hundreds to thousands of labeled samples. This small-data regime is vital for applications requiring expensive expert annotations, such as ecological monitoring, medical diagnostics or industrial quality control. We find, however, that computer vision research has ignored the small data regime as evaluations increasingly focus on zero- and few-shot learning. We use the Natural World Tasks (NeWT) benchmark to compare multi-modal large language models (MLLMs) and vision-only methods across varying training set sizes. MLLMs exhibit early performance plateaus, while vision-only methods improve throughout the small-data regime, with performance gaps widening beyond 10 training examples. We provide the first comprehensive comparison between these approaches in small-data contexts and advocate for explicit small-data evaluations in AI research to better bridge theoretical advances with practical deployments.

Automated Business Process Analysis: An LLM-Based Approach to Value Assessment

William De Michele,Abel Armas Cervantes,Lea Frermann

Task: 利用大型语言模型（LLMs）自动化业务过程中的增值分析。

Motivation: 传统的手工增值分析方法耗时且主观，难以高效优化业务流程。

Details

Method: 分两阶段进行：首先将高级活动分解为详细步骤，然后基于Lean原则对每个步骤进行增值分类。 Result: 在50个业务流程模型上验证，结构化提示方法表现优于零样本基线，且整体性能良好。 Conclusion: LLMs可以辅助人类专家进行定性分析，减少时间和主观性，提升效率。 Abstract: Business processes are fundamental to organizational operations, yet their optimization remains challenging due to the timeconsuming nature of manual process analysis. Our paper harnesses Large Language Models (LLMs) to automate value-added analysis, a qualitative process analysis technique that aims to identify steps in the process that do not deliver value. To date, this technique is predominantly manual, time-consuming, and subjective. Our method offers a more principled approach which operates in two phases: first, decomposing high-level activities into detailed steps to enable granular analysis, and second, performing a value-added analysis to classify each step according to Lean principles. This approach enables systematic identification of waste while maintaining the semantic understanding necessary for qualitative analysis. We develop our approach using 50 business process models, for which we collect and publish manual ground-truth labels. Our evaluation, comparing zero-shot baselines with more structured prompts reveals (a) a consistent benefit of structured prompting and (b) promising performance for both tasks. We discuss the potential for LLMs to augment human expertise in qualitative process analysis while reducing the time and subjectivity inherent in manual approaches.

STaR: Seamless Spatial-Temporal Aware Motion Retargeting with Penetration and Consistency Constraints

Xiaohang Yang,Qing Wang,Jiahao Yang,Gregory Slabaugh,Shanxin Yuan

Task: 提出一种新颖的序列到序列模型STaR，用于实现空间-时间感知的运动重定向。

Motivation: 现有方法往往只关注几何合理性或时间一致性，导致运动重定向中可能出现穿透或抖动问题。

Details

Method: STaR模型包含两个模块：空间模块（利用密集形状表示和肢体穿透约束）和时间模块（利用时间变换器和时间一致性约束）。 Result: 在Mixamo和ScanRet数据集上的实验表明，该方法能生成合理且连贯的运动，并显著降低穿透率。 Conclusion: STaR模型在语义、几何和时间目标之间取得了良好平衡，优于现有方法。 Abstract: Motion retargeting seeks to faithfully replicate the spatio-temporal motion characteristics of a source character onto a target character with a different body shape. Apart from motion semantics preservation, ensuring geometric plausibility and maintaining temporal consistency are also crucial for effective motion retargeting. However, many existing methods prioritize either geometric plausibility or temporal consistency. Neglecting geometric plausibility results in interpenetration while neglecting temporal consistency leads to motion jitter. In this paper, we propose a novel sequence-to-sequence model for seamless Spatial-Temporal aware motion Retargeting (STaR), with penetration and consistency constraints. STaR consists of two modules: (1) a spatial module that incorporates dense shape representation and a novel limb penetration constraint to ensure geometric plausibility while preserving motion semantics, and (2) a temporal module that utilizes a temporal transformer and a novel temporal consistency constraint to predict the entire motion sequence at once while enforcing multi-level trajectory smoothness. The seamless combination of the two modules helps us achieve a good balance between the semantic, geometric, and temporal targets. Extensive experiments on the Mixamo and ScanRet datasets demonstrate that our method produces plausible and coherent motions while significantly reducing interpenetration rates compared with other approaches.

ThoughtProbe: Classifier-Guided Thought Space Exploration Leveraging LLM Intrinsic Reasoning

Zijian Wang,Chang Xu

Task: 探索预训练大语言模型（LLMs）中内在推理能力的神经表征机制及其最优利用方法。

Motivation: 尽管LLMs展现出自然涌现的内在推理能力，但其神经表征机制及如何最优利用这些能力仍未被充分理解。

Details

Method: 提出一种基于线性分类器的框架，通过检测激活空间中的特定表征类型和网络层来指导树状响应空间的探索，并结合分支聚合选择方法。 Result: 实验结果表明，该框架能有效覆盖并识别有效推理链，在多个算术推理基准上取得显著提升。 Conclusion: 该研究为理解和利用LLMs的内在推理能力提供了新视角和方法。 Abstract: Pre-trained large language models (LLMs) have been demonstrated to possess intrinsic reasoning capabilities that can emerge naturally when expanding the response space. However, the neural representation mechanisms underlying these intrinsic capabilities and approaches for their optimal utilization remain inadequately understood. In this work, we make the key discovery that a simple linear classifier can effectively detect intrinsic reasoning capabilities in LLMs' activation space, particularly within specific representation types and network layers. Based on this finding, we propose a classifier-guided search framework that strategically explore a tree-structured response space. In each node expansion, the classifier serves as a scoring and ranking mechanism that efficiently allocates computational resources by identifying and prioritizing more thoughtful reasoning directions for continuation. After completing the tree expansion, we collect answers from all branches to form a candidate answer pool. We propose a branch-aggregation selection method that marginalizes over all supporting branches by aggregating their thoughtfulness scores, thereby identifying the optimal answer from the pool. Experimental results show that our framework's comprehensive exploration not only covers valid reasoning chains but also effectively identifies them, achieving significant improvements across multiple arithmetic reasoning benchmarks.

DUKAE: DUal-level Knowledge Accumulation and Ensemble for Pre-Trained Model-Based Continual Learning

Songze Li,Tonghua Su,Xu-Yao Zhang,Qixing Xu,Zhongjie Wang

Task: 提出一种名为DUKAE的方法，通过特征级和决策级知识积累解决预训练模型持续学习中的分类头不对齐和特征级知识积累受限问题。

Motivation: 现有PTMCL方法在分类头不对齐和特征级知识积累受限方面存在挑战，导致决策边界不一致和遗忘增加。

Details

Method: DUKAE通过高斯分布采样将分类头对齐到统一特征空间，并引入自适应专家集成融合特征子空间知识。 Result: 在CIFAR-100、ImageNet-R、CUB-200和Cars-196数据集上表现出优越性能。 Conclusion: DUKAE有效解决了分类头不对齐和特征级知识积累问题，提升了持续学习性能。 Abstract: Pre-trained model-based continual learning (PTMCL) has garnered growing attention, as it enables more rapid acquisition of new knowledge by leveraging the extensive foundational understanding inherent in pre-trained model (PTM). Most existing PTMCL methods use Parameter-Efficient Fine-Tuning (PEFT) to learn new knowledge while consolidating existing memory. However, they often face some challenges. A major challenge lies in the misalignment of classification heads, as the classification head of each task is trained within a distinct feature space, leading to inconsistent decision boundaries across tasks and, consequently, increased forgetting. Another critical limitation stems from the restricted feature-level knowledge accumulation, with feature learning typically restricted to the initial task only, which constrains the model's representation capabilities. To address these issues, we propose a method named DUal-level Knowledge Accumulation and Ensemble (DUKAE) that leverages both feature-level and decision-level knowledge accumulation by aligning classification heads into a unified feature space through Gaussian distribution sampling and introducing an adaptive expertise ensemble to fuse knowledge across feature subspaces.Extensive experiments on CIFAR-100, ImageNet-R, CUB-200, and Cars-196 datasets demonstrate the superior performance of our approach.

SEE: Continual Fine-tuning with Sequential Ensemble of Experts

Zhilin Wang,Yafu Li,Xiaoye Qu,Yu Cheng

Task: 解决大型语言模型（LLM）在持续微调中的灾难性遗忘问题。

Motivation: 现有方法如基于排练的方法和专家分离方法存在性能损失或路由挑战，需改进。

Details

Method: 提出Sequential Ensemble of Experts (SEE)框架，通过分布式路由和独立专家决策避免额外路由器的需求。 Result: SEE在持续微调中优于多任务学习等方法，并展示出色的泛化能力。 Conclusion: SEE为分布式模型集成提供了新方向，结合路由和响应机制具有潜力。 Abstract: Continual fine-tuning of large language models (LLMs) suffers from catastrophic forgetting. Rehearsal-based methods mitigate this problem by retaining a small set of old data. Nevertheless, they still suffer inevitable performance loss. Although training separate experts for each task can help prevent forgetting, effectively assembling them remains a challenge. Some approaches use routers to assign tasks to experts, but in continual learning, they often require retraining for optimal performance. To address these challenges, we introduce the Sequential Ensemble of Experts (SEE) framework. SEE removes the need for an additional router, allowing each expert to independently decide whether a query should be handled. The framework employs distributed routing, and during continual fine-tuning, SEE only requires the training of new experts for incoming tasks rather than retraining the entire system. Experiments reveal that the SEE outperforms prior approaches, including multi-task learning, in continual fine-tuning. It also demonstrates remarkable generalization ability, as the expert can effectively identify out-of-distribution queries, which can then be directed to a more generalized model for resolution. This work highlights the promising potential of integrating routing and response mechanisms within each expert, paving the way for the future of distributed model ensembling.

TSP-OCS: A Time-Series Prediction for Optimal Camera Selection in Multi-Viewpoint Surgical Video Analysis

Xinyu Liu,Xiaoguang Lin,Xiang Liu,Yong Yang,Hongqian Wang,Qilong Sun

Task: 通过多视角摄像系统和时间序列预测方法，从六个不同角度录制手术过程并选择最佳视角序列。

Motivation: 传统单摄像头方法存在遮挡和固定视角的问题，影响视频内容的可理解性。

Details

Method: 采用多视角摄像系统，结合全监督学习的时间序列预测方法，提取并融合视觉和语义特征，通过时间预测网络选择最佳视角。 Result: 实验表明，该方法在较长预测时间范围内具有竞争力，且优于现有时间序列预测技术。 Conclusion: 提出的创新框架提升了手术视频分析技术，对手术教育和患者安全有重要意义。 Abstract: Recording the open surgery process is essential for educational and medical evaluation purposes; however, traditional single-camera methods often face challenges such as occlusions caused by the surgeon's head and body, as well as limitations due to fixed camera angles, which reduce comprehensibility of the video content. This study addresses these limitations by employing a multi-viewpoint camera recording system, capturing the surgical procedure from six different angles to mitigate occlusions. We propose a fully supervised learning-based time series prediction method to choose the best shot sequences from multiple simultaneously recorded video streams, ensuring optimal viewpoints at each moment. Our time series prediction model forecasts future camera selections by extracting and fusing visual and semantic features from surgical videos using pre-trained models. These features are processed by a temporal prediction network with TimeBlocks to capture sequential dependencies. A linear embedding layer reduces dimensionality, and a Softmax classifier selects the optimal camera view based on the highest probability. In our experiments, we created five groups of open thyroidectomy videos, each with simultaneous recordings from six different angles. The results demonstrate that our method achieves competitive accuracy compared to traditional supervised methods, even when predicting over longer time horizons. Furthermore, our approach outperforms state-of-the-art time series prediction techniques on our dataset. This manuscript makes a unique contribution by presenting an innovative framework that advances surgical video analysis techniques, with significant implications for improving surgical education and patient safety.

NLP Security and Ethics, in the Wild

Heather Lent,Erick Galinkin,Yiyi Chen,Jens Myrup Pedersen,Leon Derczynski,Johannes Bjerva

Task: 探讨NLP安全（NLPSec）领域的研究伦理问题，并提出改进建议。

Motivation: 随着NLP模型用户增多，其安全性问题（如恶意攻击）日益重要，但现有研究在伦理方面存在明显不足，可能导致隐私泄露等危害。

Details

Method: 分析当代NLPSec文献，探讨其与网络安全伦理规范的关联，并识别研究中的伦理缺口。 Result: 发现NLPSec在伤害最小化和负责任披露等主题上存在显著不足。 Conclusion: 提出具体建议，推动NLP安全研究更符合伦理，倡导“白帽NLP”文化。 Abstract: As NLP models are used by a growing number of end-users, an area of increasing importance is NLP Security (NLPSec): assessing the vulnerability of models to malicious attacks and developing comprehensive countermeasures against them. While work at the intersection of NLP and cybersecurity has the potential to create safer NLP for all, accidental oversights can result in tangible harm (e.g., breaches of privacy or proliferation of malicious models). In this emerging field, however, the research ethics of NLP have not yet faced many of the long-standing conundrums pertinent to cybersecurity, until now. We thus examine contemporary works across NLPSec, and explore their engagement with cybersecurity's ethical norms. We identify trends across the literature, ultimately finding alarming gaps on topics like harm minimization and responsible disclosure. To alleviate these concerns, we provide concrete recommendations to help NLP researchers navigate this space more ethically, bridging the gap between traditional cybersecurity and NLP ethics, which we frame as ``white hat NLP''. The goal of this work is to help cultivate an intentional culture of ethical research for those working in NLP Security.

LCGC: Learning from Consistency Gradient Conflicting for Class-Imbalanced Semi-Supervised Debiasing

Weiwei Xing,Yue Cheng,Hongzhu Yi,Xiaohui Gao,Xiang Wei,Xiaoyu Guo,Yuming Zhang,Xinyu Pang

Task: 提出一种名为LCGC的去偏方法，通过利用一致性梯度冲突来优化半监督学习中的伪标签。

Motivation: 解决半监督学习（SSL）中由于类不平衡数据集导致的分类器偏差问题，并填补之前方法缺乏理论基础的空白。

Details

Method: 理论分析基线图像对伪标签的优化作用，提出LCGC方法，通过鼓励训练过程中的偏差类预测来更新伪标签，并在测试时减去基线图像的对数概率。 Result: LCGC显著提高了现有CISSL模型在公共基准上的预测准确性。 Conclusion: LCGC是一种有效的去偏方法，能够通过理论分析和实验验证提升半监督学习中的分类性能。 Abstract: Classifiers often learn to be biased corresponding to the class-imbalanced dataset, especially under the semi-supervised learning (SSL) set. While previous work tries to appropriately re-balance the classifiers by subtracting a class-irrelevant image's logit, but lacks a firm theoretical basis. We theoretically analyze why exploiting a baseline image can refine pseudo-labels and prove that the black image is the best choice. We also indicated that as the training process deepens, the pseudo-labels before and after refinement become closer. Based on this observation, we propose a debiasing scheme dubbed LCGC, which Learning from Consistency Gradient Conflicting, by encouraging biased class predictions during training. We intentionally update the pseudo-labels whose gradient conflicts with the debiased logits, representing the optimization direction offered by the over-imbalanced classifier predictions. Then, we debiased the predictions by subtracting the baseline image logits during testing. Extensive experiments demonstrate that LCGC can significantly improve the prediction accuracy of existing CISSL models on public benchmarks.

Domain-Specific Pruning of Large Mixture-of-Experts Models with Few-shot Demonstrations

Zican Dong,Han Peng,Peiyu Liu,Wayne Xin Zhao,Dong Wu,Feng Xiao,Zhifeng Wang

Task: 研究大规模Mixture-of-Experts（MoE）模型中的领域专业化和专家冗余问题，并提出一种剪枝框架EASY-EP。

Motivation: MoE模型在性能和推理效率之间取得了良好的平衡，但存储所有专家的内存开销仍然是一个主要限制，尤其是在大规模MoE模型中。

Details

Method: 提出EASY-EP框架，通过输出感知的专家重要性评估和专家级令牌贡献估计，利用少量领域特定演示识别并保留最相关的专家。 Result: 实验表明，EASY-EP在相同内存预算下，仅使用一半专家即可达到与完整DeepSeek-R1相当的性能，且吞吐量提高2.99倍。 Conclusion: EASY-EP是一种简单有效的剪枝方法，能够显著减少内存开销并保持模型性能。 Abstract: Mixture-of-Experts (MoE) models achieve a favorable trade-off between performance and inference efficiency by activating only a subset of experts. However, the memory overhead of storing all experts remains a major limitation, especially in large-scale MoE models such as DeepSeek-R1 (671B). In this study, we investigate domain specialization and expert redundancy in large-scale MoE models and uncover a consistent behavior we term few-shot expert localization, with only a few demonstrations, the model consistently activates a sparse and stable subset of experts. Building on this observation, we propose a simple yet effective pruning framework, EASY-EP, that leverages a few domain-specific demonstrations to identify and retain only the most relevant experts. EASY-EP comprises two key components: output-aware expert importance assessment and expert-level token contribution estimation. The former evaluates the importance of each expert for the current token by considering the gating scores and magnitudes of the outputs of activated experts, while the latter assesses the contribution of tokens based on representation similarities after and before routed experts. Experiments show that our method can achieve comparable performances and $2.99\times$ throughput under the same memory budget with full DeepSeek-R1 with only half the experts. Our code is available at https://github.com/RUCAIBox/EASYEP.

Domain Generalization via Discrete Codebook Learning

Shaocong Long,Qianyu Zhou,Xikun Jiang,Chenhao Ying,Lizhuang Ma,Yuan Luo

Task: 提出一种新的离散域泛化（DDG）学习范式，通过离散化过程减少连续表示学习中的域差距。

Motivation: 当前域泛化方法在处理连续特征时可能无法有效减少分布差距，容易受到像素细节中的虚假相关性或噪声的影响。

Details

Method: 使用码本将特征图量化为离散码字，在共享的离散表示空间中对齐语义等价信息，优先考虑语义级信息而非像素级细节。 Result: 在多个广泛使用的域泛化基准测试中，DDG表现出优于现有方法的性能。 Conclusion: DDG通过离散化学习减少了分布差距，提升了模型的泛化能力。 Abstract: Domain generalization (DG) strives to address distribution shifts across diverse environments to enhance model's generalizability. Current DG approaches are confined to acquiring robust representations with continuous features, specifically training at the pixel level. However, this DG paradigm may struggle to mitigate distribution gaps in dealing with a large space of continuous features, rendering it susceptible to pixel details that exhibit spurious correlations or noise. In this paper, we first theoretically demonstrate that the domain gaps in continuous representation learning can be reduced by the discretization process. Based on this inspiring finding, we introduce a novel learning paradigm for DG, termed Discrete Domain Generalization (DDG). DDG proposes to use a codebook to quantize the feature map into discrete codewords, aligning semantic-equivalent information in a shared discrete representation space that prioritizes semantic-level information over pixel-level intricacies. By learning at the semantic level, DDG diminishes the number of latent features, optimizing the utilization of the representation space and alleviating the risks associated with the wide-ranging space of continuous features. Extensive experiments across widely employed benchmarks in DG demonstrate DDG's superior performance compared to state-of-the-art approaches, underscoring its potential to reduce the distribution gaps and enhance the model's generalizability.

A Graph Diffusion Algorithm for Lexical Similarity Evaluation

Karol Mikula,Mariana Sarkociová Remešíková

Task: 提出一种评估给定语言与多个参考语言簇之间词汇相似性的算法。

Motivation: 分析多语言地区中相互影响的语言之间的关系。

Details

Method: 通过构建加权有向图并求解带有Dirichlet边界条件的图扩散方程，计算语言间的词汇相似性分布。 Result: 算法能够生成语言属于各参考簇的概率分布，并通过案例研究展示了其在欧洲语言中的应用。 Conclusion: 该算法可用于分析多语言地区中语言间的词汇相似性和相互影响。 Abstract: In this paper, we present an algorithm for evaluating lexical similarity between a given language and several reference language clusters. As an input, we have a list of concepts and the corresponding translations in all considered languages. Moreover, each reference language is assigned to one of $c$ language clusters. For each of the concepts, the algorithm computes the distance between each pair of translations. Based on these distances, it constructs a weighted directed graph, where every vertex represents a language. After, it solves a graph diffusion equation with a Dirichlet boundary condition, where the unknown is a map from the vertex set to $\mathbb{R}^c$. The resulting coordinates are values from the interval $[0,1]$ and they can be interpreted as probabilities of belonging to each of the clusters or as a lexical similarity distribution with respect to the reference clusters. The distances between translations are calculated using phonetic transcriptions and a modification of the Damerau-Levenshtein distance. The algorithm can be useful in analyzing relationships between languages spoken in multilingual territories with a lot of mutual influences. We demonstrate this by presenting a case study regarding various European languages.

Attributes-aware Visual Emotion Representation Learning

Rahul Singh Maharjan,Marta Romeo,Angelo Cangelosi

Task: 提出A4Net，一种深度表示网络，通过利用亮度、色彩丰富度、场景上下文和面部表情四个关键属性来弥合情感鸿沟。

Motivation: 视觉情感分析因图像传达丰富语义和引发人类情感的兴趣而受到关注，但现有方法忽视了特定情感属性的重要性。

Details

Method: A4Net通过融合和联合训练四个关键属性的识别与视觉情感分析，提取广义特征。 Result: 实验结果显示A4Net在多个视觉情感数据集上表现优异，且激活图可视化展示了其泛化能力。 Conclusion: A4Net通过多属性联合训练有效提升了视觉情感分析的性能。 Abstract: Visual emotion analysis or recognition has gained considerable attention due to the growing interest in understanding how images can convey rich semantics and evoke emotions in human perception. However, visual emotion analysis poses distinctive challenges compared to traditional vision tasks, especially due to the intricate relationship between general visual features and the different affective states they evoke, known as the affective gap. Researchers have used deep representation learning methods to address this challenge of extracting generalized features from entire images. However, most existing methods overlook the importance of specific emotional attributes such as brightness, colorfulness, scene understanding, and facial expressions. Through this paper, we introduce A4Net, a deep representation network to bridge the affective gap by leveraging four key attributes: brightness (Attribute 1), colorfulness (Attribute 2), scene context (Attribute 3), and facial expressions (Attribute 4). By fusing and jointly training all aspects of attribute recognition and visual emotion analysis, A4Net aims to provide a better insight into emotional content in images. Experimental results show the effectiveness of A4Net, showcasing competitive performance compared to state-of-the-art methods across diverse visual emotion datasets. Furthermore, visualizations of activation maps generated by A4Net offer insights into its ability to generalize across different visual emotion datasets.

Inducing Programmatic Skills for Agentic Tasks

Zora Zhiruo Wang,Apurva Gandhi,Graham Neubig,Daniel Fried

Task: 研究如何通过程序化技能表示提升代理在数字任务（如网页导航）中的表现。

Motivation: 代理需要学习特定任务技能以高效完成数字任务，程序化技能表示可能比静态或文本技能更有效。

Details

Method: 提出代理技能归纳（ASI）方法，通过动态归纳、验证和利用程序化技能来提升代理能力。 Result: ASI在WebArena基准测试中成功率和效率均显著优于基线方法，并能适应不同网站的技能迁移。 Conclusion: 程序化技能表示是提升代理在数字任务中表现的有效方法，ASI展示了其高效性和适应性。 Abstract: To succeed in common digital tasks such as web navigation, agents must carry out a variety of specialized tasks such as searching for products or planning a travel route. To tackle these tasks, agents can bootstrap themselves by learning task-specific skills online through interaction with the web environment. In this work, we demonstrate that programs are an effective representation for skills. We propose agent skill induction (ASI), which allows agents to adapt themselves by inducing, verifying, and utilizing program-based skills on the fly. We start with an evaluation on the WebArena agent benchmark and show that ASI outperforms the static baseline agent and its text-skill counterpart by 23.5% and 11.3% in success rate, mainly thanks to the programmatic verification guarantee during the induction phase. ASI also improves efficiency by reducing 10.7-15.3% of the steps over baselines, by composing primitive actions (e.g., click) into higher-level skills (e.g., search product). We then highlight the efficacy of ASI in remaining efficient and accurate under scaled-up web activities. Finally, we examine the generalizability of induced skills when transferring between websites, and find that ASI can effectively reuse common skills, while also updating incompatible skills to versatile website changes.

Exploring Ordinal Bias in Action Recognition for Instructional Videos

Joochan Kim,Minjoon Jung,Byoung-Tak Zhang

Task: 解决动作识别模型在理解教学视频时依赖固定动作序列而非真实视频理解的问题（序数偏差）。

Motivation: 现有模型在理解教学视频时倾向于依赖数据集特定的动作序列，而非真正的视频内容理解，导致序数偏差问题。

Details

Method: 提出了两种视频处理方法：动作掩码（屏蔽频繁共现动作的帧）和序列随机化（打乱动作片段的顺序）。 Result: 实验表明，当前模型在面对非标准动作序列时性能显著下降，凸显了其对序数偏差的脆弱性。 Conclusion: 强调了重新评估策略和开发能够超越固定动作模式的模型的重要性，以适应多样化的教学视频。 Abstract: Action recognition models have achieved promising results in understanding instructional videos. However, they often rely on dominant, dataset-specific action sequences rather than true video comprehension, a problem that we define as ordinal bias. To address this issue, we propose two effective video manipulation methods: Action Masking, which masks frames of frequently co-occurring actions, and Sequence Shuffling, which randomizes the order of action segments. Through comprehensive experiments, we demonstrate that current models exhibit significant performance drops when confronted with nonstandard action sequences, underscoring their vulnerability to ordinal bias. Our findings emphasize the importance of rethinking evaluation strategies and developing models capable of generalizing beyond fixed action patterns in diverse instructional videos.

Open Problems and a Hypothetical Path Forward in LLM Knowledge Paradigms

Xiaotian Ye,Mengqi Zhang,Shu Wu

Task: 探讨大型语言模型（LLM）知识范式的局限性及其改进方向。

Motivation: 现有知识范式限制了LLM的潜力，亟需解决知识更新、反向知识泛化（逆转诅咒）和内部知识冲突等问题。

Details

Method: 提出基于上下文知识扩展（Contextual Knowledge Scaling）的假设范式，并讨论其实现路径。 Result: 该范式有望解决当前LLM知识系统的不足，为下一代模型架构提供灵感。 Conclusion: 为研究者提供LLM知识系统的进展概述，并启发未来模型架构的发展。 Abstract: Knowledge is fundamental to the overall capabilities of Large Language Models (LLMs). The knowledge paradigm of a model, which dictates how it encodes and utilizes knowledge, significantly affects its performance. Despite the continuous development of LLMs under existing knowledge paradigms, issues within these frameworks continue to constrain model potential. This blog post highlight three critical open problems limiting model capabilities: (1) challenges in knowledge updating for LLMs, (2) the failure of reverse knowledge generalization (the reversal curse), and (3) conflicts in internal knowledge. We review recent progress made in addressing these issues and discuss potential general solutions. Based on observations in these areas, we propose a hypothetical paradigm based on Contextual Knowledge Scaling, and further outline implementation pathways that remain feasible within contemporary techniques. Evidence suggests this approach holds potential to address current shortcomings, serving as our vision for future model paradigms. This blog post aims to provide researchers with a brief overview of progress in LLM knowledge systems, while provide inspiration for the development of next-generation model architectures.

Benchmarking Multimodal CoT Reward Model Stepwise by Visual Program

Minghe Gao,Xuqi Liu,Zhongqi Yue,Yang Wu,Shuang Chen,Juncheng Li,Siliang Tang,Fei Wu,Tat-Seng Chua,Yueting Zhuang

Task: 提出SVIP方法，用于训练多维度、步骤级的Chain-of-Thought（CoT）奖励模型，以解决多模态领域中的奖励信号问题。

Motivation: 多模态领域中奖励信号的应用存在标注成本高、对单步奖励依赖过度以及评估不足等挑战。

Details

Method: 通过自动生成视觉任务代码并将代码块分析转化为CoT步骤评估，利用TriAtt-CoT多头注意力机制训练SVIP-Reward模型。 Result: SVIP-Reward在多模态大语言模型（MLLM）的训练和推理阶段表现优异，减少了幻觉并提升了推理能力。 Conclusion: SVIP方法在多模态任务中显著提升了模型性能，并提供了一个用于CoT奖励模型训练和测试的基准。 Abstract: Recent advancements in reward signal usage for Large Language Models (LLMs) are remarkable. However, significant challenges exist when transitioning reward signal to the multimodal domain, including labor-intensive annotations, over-reliance on one-step rewards, and inadequate evaluation. To address these issues, we propose SVIP, a novel approach to train a step-level multi-dimensional Chain-of-Thought~(CoT) reward model automatically. It generates code for solving visual tasks and transforms the analysis of code blocks into the evaluation of CoT step as training samples. Then, we train SVIP-Reward model using a multi-head attention mechanism called TriAtt-CoT. The advantages of SVIP-Reward are evident throughout the entire process of MLLM. We also introduce a benchmark for CoT reward model training and testing. Experimental results demonstrate that SVIP-Reward improves MLLM performance across training and inference-time scaling, yielding better results on benchmarks while reducing hallucinations and enhancing reasoning ability.

Integrating Cognitive Processing Signals into Language Models: A Review of Advances, Applications and Future Directions

Angela Lopez-Cardona,Sebastian Idesis,Ioannis Arapakis

Task: 综述认知神经科学信号（尤其是眼动追踪信号）在自然语言处理和多模态大语言模型中的应用及其潜力。

Motivation: 解决数据稀缺和训练大规模模型的环境成本问题，同时提升模型效率和人类对齐性。

Details

Method: 通过整合用户为中心的认知信号（如眼动追踪数据）来增强语言模型和多模态大语言模型。 Result: 认知信号能够实现高效数据增强、更快收敛和更好的人类对齐，尤其在视觉问答和减少多模态大语言模型幻觉方面表现突出。 Conclusion: 探讨了新兴挑战和研究趋势，强调了眼动追踪数据在未来研究中的潜力。 Abstract: Recently, the integration of cognitive neuroscience in Natural Language Processing (NLP) has gained significant attention. This article provides a critical and timely overview of recent advancements in leveraging cognitive signals, particularly Eye-tracking (ET) signals, to enhance Language Models (LMs) and Multimodal Large Language Models (MLLMs). By incorporating user-centric cognitive signals, these approaches address key challenges, including data scarcity and the environmental costs of training large-scale models. Cognitive signals enable efficient data augmentation, faster convergence, and improved human alignment. The review emphasises the potential of ET data in tasks like Visual Question Answering (VQA) and mitigating hallucinations in MLLMs, and concludes by discussing emerging challenges and research trends.

Visually Similar Pair Alignment for Robust Cross-Domain Object Detection

Onkar Krishna,Hiroki Ohashi

Task: 通过视觉相似性对齐源域和目标域的特征，以改进目标检测模型的域适应性能。

Motivation: 现有方法在跨域特征对齐时未能充分考虑视觉差异（如颜色或方向），导致域适应效果不佳。

Details

Method: 提出了一种基于记忆的系统，存储源域的前景和背景特征，并在训练中动态更新，通过检索视觉相似的特征对进行对齐。 Result: 在Foggy Cityscapes和Sim10k数据集上分别达到53.1和62.3 mAP，优于现有方法。 Conclusion: 视觉相似性对齐显著提升了域适应性能，尤其在处理视觉差异和域特异性变化时表现突出。 Abstract: Domain gaps between training data (source) and real-world environments (target) often degrade the performance of object detection models. Most existing methods aim to bridge this gap by aligning features across source and target domains but often fail to account for visual differences, such as color or orientation, in alignment pairs. This limitation leads to less effective domain adaptation, as the model struggles to manage both domain-specific shifts (e.g., fog) and visual variations simultaneously. In this work, we demonstrate for the first time, using a custom-built dataset, that aligning visually similar pairs significantly improves domain adaptation. Based on this insight, we propose a novel memory-based system to enhance domain alignment. This system stores precomputed features of foreground objects and background areas from the source domain, which are periodically updated during training. By retrieving visually similar source features for alignment with target foreground and background features, the model effectively addresses domain-specific differences while reducing the impact of visual variations. Extensive experiments across diverse domain shift scenarios validate our method's effectiveness, achieving 53.1 mAP on Foggy Cityscapes and 62.3 on Sim10k, surpassing prior state-of-the-art methods by 1.2 and 4.1 mAP, respectively.

Persona Dynamics: Unveiling the Impact of Personality Traits on Agents in Text-Based Games

Seungwon Lim,Seungbeen Lee,Dongjun Min,Youngjae Yu

Task: 研究人类性格特征如何影响文本交互环境中智能代理的行为与表现。

Motivation: 尽管人工代理在复杂交互和决策任务中日益重要，但其行为与人类价值观的对齐仍是一个未解决的挑战。

Details

Method: 提出PANDA方法，通过将人类性格特征映射到代理上指导其行为，包括训练性格分类器和将性格特征整合到策略学习流程中。 Result: 实验表明，代理的行为可以被引导至特定性格特征，且某些性格类型（如开放性较高）表现更优。 Conclusion: 性格适配代理在促进更对齐、有效和以人为中心的决策方面具有潜力。 Abstract: Artificial agents are increasingly central to complex interactions and decision-making tasks, yet aligning their behaviors with desired human values remains an open challenge. In this work, we investigate how human-like personality traits influence agent behavior and performance within text-based interactive environments. We introduce PANDA: PersonalityAdapted Neural Decision Agents, a novel method for projecting human personality traits onto agents to guide their behavior. To induce personality in a text-based game agent, (i) we train a personality classifier to identify what personality type the agent's actions exhibit, and (ii) we integrate the personality profiles directly into the agent's policy-learning pipeline. By deploying agents embodying 16 distinct personality types across 25 text-based games and analyzing their trajectories, we demonstrate that an agent's action decisions can be guided toward specific personality profiles. Moreover, certain personality types, such as those characterized by higher levels of Openness, display marked advantages in performance. These findings underscore the promise of personality-adapted agents for fostering more aligned, effective, and human-centric decision-making in interactive environments.

A Cross-Domain Few-Shot Learning Method Based on Domain Knowledge Mapping

Jiajun Chen,Hongpeng Yin,Yifu Yang

Task: 提出一种基于域知识映射的跨域小样本学习方法，以解决非独立同分布假设下的任务适应问题。

Motivation: 现实场景中小样本学习的分布可能与现有数据分布显著不同，如何有效利用现有数据知识快速适应非独立同分布的任务成为关键挑战。

Details

Method: 在预训练、训练和测试阶段一致应用域知识映射方法，结合自监督和监督损失，并引入域分类器学习域映射能力和评估域适应难度。 Result: 在六个不同领域的数据集上验证了方法的有效性。 Conclusion: 该方法通过域知识映射和元训练任务，显著提升了模型在跨域小样本学习中的适应能力。 Abstract: In task-based few-shot learning paradigms, it is commonly assumed that different tasks are independently and identically distributed (i.i.d.). However, in real-world scenarios, the distribution encountered in few-shot learning can significantly differ from the distribution of existing data. Thus, how to effectively leverage existing data knowledge to enable models to quickly adapt to class variations under non-i.i.d. assumptions has emerged as a key research challenge. To address this challenge, this paper proposes a new cross-domain few-shot learning approach based on domain knowledge mapping, applied consistently throughout the pre-training, training, and testing phases. In the pre-training phase, our method integrates self-supervised and supervised losses by maximizing mutual information, thereby mitigating mode collapse. During the training phase, the domain knowledge mapping layer collaborates with a domain classifier to learn both domain mapping capabilities and the ability to assess domain adaptation difficulty. Finally, this approach is applied during the testing phase, rapidly adapting to domain variations through meta-training tasks on support sets, consequently enhancing the model's capability to transfer domain knowledge effectively. Experimental validation conducted across six datasets from diverse domains demonstrates the effectiveness of the proposed method.

Identifying Aspects in Peer Reviews

Sheng Lu,Ilia Kuznetsov,Iryna Gurevych

Task: 开发一种数据驱动的方法，从同行评审中提取细粒度的评审方面，以支持标准化评审过程。

Motivation: 同行评审是学术出版的核心，但日益增长的投稿量使其压力增大，需要计算方法的支持。评审方面（如新颖性）反映了研究社区的价值观，标准化评审过程可以提高质量控制并支持计算辅助。

Details

Method: 提出一种自下而上的方法，定义评审方面的操作化概念，并从评审语料库中提取细粒度方面。构建了一个带有标注方面的评审数据集，并展示了其在社区级评审分析中的应用。 Result: 展示了评审方面的选择对下游应用（如LLM生成评审检测）的影响，为评审方面的数据驱动研究奠定了基础。 Conclusion: 为NLP在支持同行评审中的新应用铺平了道路，提供了评审方面的原则性和数据驱动研究方法。 Abstract: Peer review is central to academic publishing, but the growing volume of submissions is straining the process. This motivates the development of computational approaches to support peer review. While each review is tailored to a specific paper, reviewers often make assessments according to certain aspects such as Novelty, which reflect the values of the research community. This alignment creates opportunities for standardizing the reviewing process, improving quality control, and enabling computational support. While prior work has demonstrated the potential of aspect analysis for peer review assistance, the notion of aspect remains poorly formalized. Existing approaches often derive aspect sets from review forms and guidelines of major NLP venues, yet data-driven methods for aspect identification are largely underexplored. To address this gap, our work takes a bottom-up approach: we propose an operational definition of aspect and develop a data-driven schema for deriving fine-grained aspects from a corpus of peer reviews. We introduce a dataset of peer reviews augmented with aspects and show how it can be used for community-level review analysis. We further show how the choice of aspects can impact downstream applications, such as LLM-generated review detection. Our results lay a foundation for a principled and data-driven investigation of review aspects, and pave the path for new applications of NLP to support peer review.

Human-like compositional learning of visually-grounded concepts using synthetic environments

Zijun Lin,M Ganesh Kumar,Cheston Tan

Task: 研究如何通过多模态强化学习让智能体在合成环境中学习并理解复杂的语言指令与视觉概念的组合。

Motivation: 探索人类如何通过试错学习组合概念类别并映射视觉线索，以及如何将这种能力应用于人工智能系统。

Details

Method: 设计了一个3D合成环境，智能体通过强化学习导航至由自然语言指令指定的目标，指令包含名词、属性、限定词和介词。 Result: 强化学习智能体能够理解限定词概念，但在介词概念上表现较差；课程学习策略显著提升了学习效率，减少了训练次数。 Conclusion: 多模态强化学习智能体能够实现复杂概念类别的组合理解，人类类似的学习策略可显著提升人工智能系统的学习效率。 Abstract: The compositional structure of language enables humans to decompose complex phrases and map them to novel visual concepts, showcasing flexible intelligence. While several algorithms exhibit compositionality, they fail to elucidate how humans learn to compose concept classes and ground visual cues through trial and error. To investigate this multi-modal learning challenge, we designed a 3D synthetic environment in which an agent learns, via reinforcement, to navigate to a target specified by a natural language instruction. These instructions comprise nouns, attributes, and critically, determiners, prepositions, or both. The vast array of word combinations heightens the compositional complexity of the visual grounding task, as navigating to a blue cube above red spheres is not rewarded when the instruction specifies navigating to "some blue cubes below the red sphere". We first demonstrate that reinforcement learning agents can ground determiner concepts to visual targets but struggle with more complex prepositional concepts. Second, we show that curriculum learning, a strategy humans employ, enhances concept learning efficiency, reducing the required training episodes by 15% in determiner environments and enabling agents to easily learn prepositional concepts. Finally, we establish that agents trained on determiner or prepositional concepts can decompose held-out test instructions and rapidly adapt their navigation policies to unseen visual object combinations. Leveraging synthetic environments, our findings demonstrate that multi-modal reinforcement learning agents can achieve compositional understanding of complex concept classes and highlight the efficacy of human-like learning strategies in improving artificial systems' learning efficiency.

Data Augmentation for Fake Reviews Detection in Multiple Languages and Multiple Domains

Ming Liu,Massimo Poesio

Task: 利用大语言模型生成数据集以训练虚假评论检测器。

Motivation: 随着互联网的发展，虚假评论问题日益突出，但低资源语言或领域缺乏足够的训练数据。

Details

Method: 使用大语言模型生成不同领域（书籍、餐厅、酒店）和语言（英语、中文）的虚假评论数据集。 Result: 数据增强技术显著提升了虚假评论检测的性能，各测试集的准确率均有提升（如DeRev TEST提升0.3个百分点，Amazon TEST提升10.9个百分点等）。 Conclusion: 通过生成数据集的方法可以有效提升虚假评论检测模型的性能，尤其在低资源语言和领域中表现突出。 Abstract: With the growth of the Internet, buying habits have changed, and customers have become more dependent on the online opinions of other customers to guide their purchases. Identifying fake reviews thus became an important area for Natural Language Processing (NLP) research. However, developing high-performance NLP models depends on the availability of large amounts of training data, which are often not available for low-resource languages or domains. In this research, we used large language models to generate datasets to train fake review detectors. Our approach was used to generate fake reviews in different domains (book reviews, restaurant reviews, and hotel reviews) and different languages (English and Chinese). Our results demonstrate that our data augmentation techniques result in improved performance at fake review detection for all domains and languages. The accuracy of our fake review detection model can be improved by 0.3 percentage points on DeRev TEST, 10.9 percentage points on Amazon TEST, 8.3 percentage points on Yelp TEST and 7.2 percentage points on DianPing TEST using the augmented datasets.

InstantSticker: Realistic Decal Blending via Disentangled Object Reconstruction

Yi Zhang,Xiaoyang Huang,Yishun Dou,Yue Shi,Rui Shi,Ye Chen,Bingbing Ni,Wenjun Zhang

Task: 提出InstantSticker，一种基于图像照明（IBL）的解耦重建流程，专注于高真实感的贴图混合、模拟贴纸附着效果，并支持即时编辑和实时渲染。

Motivation: 解决传统方法中贴图混合的阴影、变形和模糊问题，提升编辑和渲染效率。

Details

Method: 引入阴影因子优化IBL，使用ARAP参数化预处理网格，结合局部UV映射和神经纹理图，采用Disney BRDF模型实现即时编辑。 Result: 实验表明，InstantSticker在编辑质量、速度和渲染速度上优于现有方法，达到最新水平。 Conclusion: InstantSticker通过创新技术解决了贴图混合的关键问题，实现了高效、高质量的贴图编辑和渲染。 Abstract: We present InstantSticker, a disentangled reconstruction pipeline based on Image-Based Lighting (IBL), which focuses on highly realistic decal blending, simulates stickers attached to the reconstructed surface, and allows for instant editing and real-time rendering. To achieve stereoscopic impression of the decal, we introduce shadow factor into IBL, which can be adaptively optimized during training. This allows the shadow brightness of surfaces to be accurately decomposed rather than baked into the diffuse color, ensuring that the edited texture exhibits authentic shading. To address the issues of warping and blurriness in previous methods, we apply As-Rigid-As-Possible (ARAP) parameterization to pre-unfold a specified area of the mesh and use the local UV mapping combined with a neural texture map to enhance the ability to express high-frequency details in that area. For instant editing, we utilize the Disney BRDF model, explicitly defining material colors with 3-channel diffuse albedo. This enables instant replacement of albedo RGB values during the editing process, avoiding the prolonged optimization required in previous approaches. In our experiment, we introduce the Ratio Variance Warping (RVW) metric to evaluate the local geometric warping of the decal area. Extensive experimental results demonstrate that our method surpasses previous decal blending methods in terms of editing quality, editing speed and rendering speed, achieving the state-of-the-art.

RuOpinionNE-2024: Extraction of Opinion Tuples from Russian News Texts

Natalia Loukachevitch,Natalia Tkachenko,Anna Lapanitsyna,Mikhail Tikhomirov,Nicolay Rusnachenko

Task: 从俄语新闻文本中提取结构化观点的对话评估共享任务。

Motivation: 研究如何从俄语新闻文本中提取情感持有者、目标、表达和情感的观点元组。

Details

Method: 参与者主要尝试了零样本、少样本和微调的大语言模型，并对30种提示和11种开源语言模型进行了比较。 Result: 在测试集上，微调大语言模型取得了最佳结果，并确定了最佳模型和提示。 Conclusion: 微调大语言模型在俄语新闻文本观点提取任务中表现最佳。 Abstract: In this paper, we introduce the Dialogue Evaluation shared task on extraction of structured opinions from Russian news texts. The task of the contest is to extract opinion tuples for a given sentence; the tuples are composed of a sentiment holder, its target, an expression and sentiment from the holder to the target. In total, the task received more than 100 submissions. The participants experimented mainly with large language models in zero-shot, few-shot and fine-tuning formats. The best result on the test set was obtained with fine-tuning of a large language model. We also compared 30 prompts and 11 open source language models with 3-32 billion parameters in the 1-shot and 10-shot settings and found the best models and prompts.

FACT: Multinomial Misalignment Classification for Point Cloud Registration

Ludvig Dillén,Per-Erik Forssén,Johan Edstedt

Task: 提出一种名为FACT的方法，用于预测已注册激光雷达点云对的配准质量（即配准误差）。

Motivation: 为大规模自动配准的3D模型提供质量保证。

Details

Method: FACT从配准的点云对中提取局部特征，并通过基于点变换器的网络处理这些特征以预测配准误差类别。 Result: FACT在配准误差的多类别分类任务中表现优于直接回归和先前的二元分类方法，并在合成扰动点云任务中显著优于CorAl方法。 Conclusion: FACT不仅能成功分类不同配准方法的点云对，还能辅助专家纠正配准错误的点云地图。 Abstract: We present FACT, a method for predicting alignment quality (i.e., registration error) of registered lidar point cloud pairs. This is useful e.g. for quality assurance of large, automatically registered 3D models. FACT extracts local features from a registered pair and processes them with a point transformer-based network to predict a misalignment class. We generalize prior work that study binary alignment classification of registration errors, by recasting it as multinomial misalignment classification. To achieve this, we introduce a custom regression-by-classification loss function that combines the cross-entropy and Wasserstein losses, and demonstrate that it outperforms both direct regression and prior binary classification. FACT successfully classifies point-cloud pairs registered with both the classical ICP and GeoTransformer, while other choices, such as standard point-cloud-quality metrics and registration residuals are shown to be poor choices for predicting misalignment. On a synthetically perturbed point-cloud task introduced by the CorAl method, we show that FACT achieves substantially better performance than CorAl. Finally, we demonstrate how FACT can assist experts in correcting misaligned point-cloud maps. Our code is available at https://github.com/LudvigDillen/FACT_for_PCMC.

Towards LLMs Robustness to Changes in Prompt Format Styles

Lilian Ngweta,Kiran Kate,Jason Tsay,Yara Rizk

Task: 提出一种名为Mixture of Formats (MOF)的简单高效技术，以减少大型语言模型(LLMs)中提示脆弱性的问题。

Motivation: 大型语言模型对提示格式的非语义变化敏感，导致性能波动，现有研究未能提供简单解决方案。

Details

Method: 通过多样化提示few-shot示例的风格，借鉴计算机视觉中多样风格数据集的方法。 Result: 实验结果表明，MOF减少了风格引起的提示脆弱性，并提升了模型在不同提示变化和数据集上的整体性能。 Conclusion: MOF是一种有效的技术，能够缓解LLMs中的提示脆弱性问题，并提升模型鲁棒性。 Abstract: Large language models (LLMs) have gained popularity in recent years for their utility in various applications. However, they are sensitive to non-semantic changes in prompt formats, where small changes in the prompt format can lead to significant performance fluctuations. In the literature, this problem is commonly referred to as prompt brittleness. Previous research on prompt engineering has focused mainly on developing techniques for identifying the optimal prompt for specific tasks. Some studies have also explored the issue of prompt brittleness and proposed methods to quantify performance variations; however, no simple solution has been found to address this challenge. We propose Mixture of Formats (MOF), a simple and efficient technique for addressing prompt brittleness in LLMs by diversifying the styles used in the prompt few-shot examples. MOF was inspired by computer vision techniques that utilize diverse style datasets to prevent models from associating specific styles with the target variable. Empirical results show that our proposed technique reduces style-induced prompt brittleness in various LLMs while also enhancing overall performance across prompt variations and different datasets.

Rethinking LayerNorm in Image Restoration Transformers

MinKyu Lee,Sangeek Hyun,Woojin Jun,Hyunjun Kim,Jiwoo Chung,Jae-Pil Heo

Task: 研究图像恢复（IR）Transformer中异常特征行为的问题。

Motivation: 发现传统LayerNorm的逐令牌归一化会破坏空间相关性和内部特征统计，导致特征熵过小和特征幅度极度发散。

Details

Method: 提出一种针对IR Transformer的归一化策略，跨整个空间-通道维度进行归一化，并引入输入自适应的重新缩放方法。 Result: 实验证明该方法有效解决了特征发散问题，显著提升了IR Transformer的稳定性和性能。 Conclusion: 提出的归一化策略和自适应方法能够优化IR Transformer的特征处理，适用于多种IR任务。 Abstract: This work investigates abnormal feature behaviors observed in image restoration (IR) Transformers. Specifically, we identify two critical issues: feature entropy becoming excessively small and feature magnitudes diverging up to a million-fold scale. We pinpoint the root cause to the per-token normalization aspect of conventional LayerNorm, which disrupts essential spatial correlations and internal feature statistics. To address this, we propose a simple normalization strategy tailored for IR Transformers. Our approach applies normalization across the entire spatio-channel dimension, effectively preserving spatial correlations. Additionally, we introduce an input-adaptive rescaling method that aligns feature statistics to the unique statistical requirements of each input. Experimental results verify that this combined strategy effectively resolves feature divergence, significantly enhancing both the stability and performance of IR Transformers across various IR tasks.

Evaluating Retrieval Augmented Generative Models for Document Queries in Transportation Safety

Chad Melton,Alex Sorokine,Steve Peterson

Task: 评估三种生成模型在危险材料运输合规性信息检索中的性能。

Motivation: 生成模型在高风险领域的应用面临准确性和可靠性挑战，需要验证其在危险材料运输中的表现。

Details

Method: 使用约40份公开的联邦和州法规文档，生成100个相关查询，对ChatGPT、Vertex AI和RAG增强的LLaMA模型进行定性和定量评估。 Result: RAG增强的LLaMA模型在准确性和细节上显著优于其他模型，尽管偶尔存在不一致。 Conclusion: 研究表明，领域特定的微调和严格评估方法对确保高风险环境中模型的可靠性至关重要。 Abstract: Applications of generative Large Language Models LLMs are rapidly expanding across various domains, promising significant improvements in workflow efficiency and information retrieval. However, their implementation in specialized, high-stakes domains such as hazardous materials transportation is challenging due to accuracy and reliability concerns. This study evaluates the performance of three fine-tuned generative models, ChatGPT, Google's Vertex AI, and ORNL Retrieval Augmented Generation augmented LLaMA 2 and LLaMA in retrieving regulatory information essential for hazardous material transportation compliance in the United States. Utilizing approximately 40 publicly available federal and state regulatory documents, we developed 100 realistic queries relevant to route planning and permitting requirements. Responses were qualitatively rated based on accuracy, detail, and relevance, complemented by quantitative assessments of semantic similarity between model outputs. Results demonstrated that the RAG-augmented LLaMA models significantly outperformed Vertex AI and ChatGPT, providing more detailed and generally accurate information, despite occasional inconsistencies. This research introduces the first known application of RAG in transportation safety, emphasizing the need for domain-specific fine-tuning and rigorous evaluation methodologies to ensure reliability and minimize the risk of inaccuracies in high-stakes environments.

PosterMaker: Towards High-Quality Product Poster Generation with Accurate Text Rendering

Yifan Gao,Zihang Lin,Chuanbin Liu,Min Zhou,Tiezheng Ge,Bo Zheng,Hongtao Xie

Task: 提出一种端到端的产品海报生成框架PosterMaker，解决文本渲染和产品保真度的挑战。

Motivation: 产品海报作为重要促销工具，现有图像生成方法在复杂文本（如中文）渲染和用户特定产品保真度上存在不足。

Details

Method: 提出TextRenderNet（字符级表示控制）和SceneGenNet（基于修复的模型），结合两阶段训练策略优化PosterMaker。 Result: PosterMaker文本渲染准确率超过90%，并在实验中显著优于现有基线。 Conclusion: PosterMaker通过字符级控制和产品保真度学习，有效解决了海报生成中的关键挑战。 Abstract: Product posters, which integrate subject, scene, and text, are crucial promotional tools for attracting customers. Creating such posters using modern image generation methods is valuable, while the main challenge lies in accurately rendering text, especially for complex writing systems like Chinese, which contains over 10,000 individual characters. In this work, we identify the key to precise text rendering as constructing a character-discriminative visual feature as a control signal. Based on this insight, we propose a robust character-wise representation as control and we develop TextRenderNet, which achieves a high text rendering accuracy of over 90%. Another challenge in poster generation is maintaining the fidelity of user-specific products. We address this by introducing SceneGenNet, an inpainting-based model, and propose subject fidelity feedback learning to further enhance fidelity. Based on TextRenderNet and SceneGenNet, we present PosterMaker, an end-to-end generation framework. To optimize PosterMaker efficiently, we implement a two-stage training strategy that decouples text rendering and background generation learning. Experimental results show that PosterMaker outperforms existing baselines by a remarkable margin, which demonstrates its effectiveness.

Data Augmentation and Hyperparameter Tuning for Low-Resource MFA

Alessio Tosolini,Claire Bowern

Task: 通过数据增强和超参数调优方法，提高低资源语言在强制对齐任务中的准确性。

Motivation: 解决计算工具在处理低资源和濒危语言时因数据量小而导致的准确性低的问题。

Details

Method: 比较数据增强和超参数调优在强制对齐任务中的效果。 Result: 音频数据增强效果有限，而超参数调优显著提高了性能且训练时间可行。 Conclusion: 对于中小规模数据的语言，超参数调优是比高资源语言模型迁移更可行的替代方案。 Abstract: A continued issue for those working with computational tools and endangered and under-resourced languages is the lower accuracy of results for languages with smaller amounts of data. We attempt to ameliorate this issue by using data augmentation methods to increase corpus size, comparing augmentation to hyperparameter tuning for multilingual forced alignment. Unlike text augmentation methods, audio augmentation does not lead to substantially increased performance. Hyperparameter tuning, on the other hand, results in substantial improvement without (for this amount of data) infeasible additional training time. For languages with small to medium amounts of training data, this is a workable alternative to adapting models from high-resource languages.

Crafting Query-Aware Selective Attention for Single Image Super-Resolution

Junyoung Kim,Youngrok Kim,Siyeol Jung,Donghyun Min

Task: 提出一种动态选择关键值窗口的注意力机制（SSCAN）以改进单图像超分辨率（SISR）任务。

Motivation: 现有基于ViT的SISR方法存在计算成本高或注意力机制未明确关注查询相关区域的问题，且缺乏对选择性注意力机制设计的有效探索。

Details

Method: 提出SSCAN，通过基于查询相似性动态选择关键值窗口，实现高效且聚焦的特征提取，并采用固定大小窗口降低内存使用和计算复杂度。 Result: SSCAN在SISR任务中优于现有方法，PSNR提升最高达0.14 dB，同时保持计算效率。 Conclusion: SSCAN通过查询感知的窗口选择策略，显著提升了SISR的性能和效率。 Abstract: Single Image Super-Resolution (SISR) reconstructs high-resolution images from low-resolution inputs, enhancing image details. While Vision Transformer (ViT)-based models improve SISR by capturing long-range dependencies, they suffer from quadratic computational costs or employ selective attention mechanisms that do not explicitly focus on query-relevant regions. Despite these advancements, prior work has overlooked how selective attention mechanisms should be effectively designed for SISR. We propose SSCAN, which dynamically selects the most relevant key-value windows based on query similarity, ensuring focused feature extraction while maintaining efficiency. In contrast to prior approaches that apply attention globally or heuristically, our method introduces a query-aware window selection strategy that better aligns attention computation with important image regions. By incorporating fixed-sized windows, SSCAN reduces memory usage and enforces linear token-to-token complexity, making it scalable for large images. Our experiments demonstrate that SSCAN outperforms existing attention-based SISR methods, achieving up to 0.14 dB PSNR improvement on urban datasets, guaranteeing both computational efficiency and reconstruction quality in SISR.

TASTE: Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling

Liang-Hsuan Tseng,Yi-Chang Chen,Kuan-Yi Lee,Da-Shan Shiu,Hung-yi Lee

Task: 提出一种名为TASTE的方法，用于解决语音和文本模态之间的序列长度不匹配问题，以实现语音-文本联合建模。

Motivation: 提升语音语言模型（SLM）的性能，使其更接近文本大语言模型（LLM），以支持更自然的人机交互。

Details

Method: 通过文本对齐的语音标记化和嵌入（TASTE），结合特殊聚合机制和语音重建训练目标，减少语音标记序列长度并保留关键副语言信息。 Result: TASTE显著减少了语音标记序列长度，同时保留了关键信息，并通过参数高效微调技术（如LoRA）将文本LLM适配为有效的SLM。 Conclusion: TASTE是首个利用重建目标自动学习文本对齐语音标记化和嵌入的端到端方法，实验表明其性能与全微调方法相当。 Abstract: Large Language Models (LLMs) excel in text-based natural language processing tasks but remain constrained by their reliance on textual inputs and outputs. To enable more natural human-LLM interaction, recent progress have focused on deriving a spoken language model (SLM) that can not only listen but also generate speech. To achieve this, a promising direction is to conduct speech-text joint modeling. However, recent SLM still lag behind text LLM due to the modality mismatch. One significant mismatch can be the sequence lengths between speech and text tokens. To address this, we introduce Text-Aligned Speech Tokenization and Embedding (TASTE), a method that directly addresses the modality gap by aligning speech token with the corresponding text transcription during the tokenization stage. We propose a method that can achieve this through the special aggregation mechanism and with speech reconstruction as the training objective. We conduct extensive experiments and show that TASTE can preserve essential paralinguistic information while dramatically reducing the token sequence length. Furthermore, by leveraging TASTE, we can adapt text-based LLMs into effective SLMs with parameter-efficient fine-tuning techniques such as Low-Rank Adaptation (LoRA). Experimental results on benchmark tasks, including SALMON and StoryCloze, demonstrate that TASTE-based SLMs perform similarly to previous full-finetuning methods. To our knowledge, TASTE is the first end-to-end approach that utilizes a reconstruction objective to automatically learn a text-aligned speech tokenization and embedding suitable for spoken language modeling. Our demo, code, and models are publicly available at https://github.com/mtkresearch/TASTE-SpokenLM.

HGMamba: Enhancing 3D Human Pose Estimation with a HyperGCN-Mamba Network

Hu Cui,Tessai Hayama

Task: 利用估计和真实2D人体姿态数据进行3D人体姿态提升。

Motivation: 现有方法在应用于真实2D姿态数据时表现不佳，需要精确建模局部姿态结构和提取全局时空特征。

Details

Method: 提出Hyper-GCN和Shuffle Mamba（HGMamba）块，通过并行处理局部和全局特征，并自适应融合。 Result: 在Human3.6M和MPI-INF-3DHP数据集上取得最优结果，P1误差分别为38.65 mm和14.33 mm。 Conclusion: HGMamba在全局特征建模和局部结构建模上表现优异，提供多种配置以满足速度-精度权衡。 Abstract: 3D human pose lifting is a promising research area that leverages estimated and ground-truth 2D human pose data for training. While existing approaches primarily aim to enhance the performance of estimated 2D poses, they often struggle when applied to ground-truth 2D pose data. We observe that achieving accurate 3D pose reconstruction from ground-truth 2D poses requires precise modeling of local pose structures, alongside the ability to extract robust global spatio-temporal features. To address these challenges, we propose a novel Hyper-GCN and Shuffle Mamba (HGMamba) block, which processes input data through two parallel streams: Hyper-GCN and Shuffle-Mamba. The Hyper-GCN stream models the human body structure as hypergraphs with varying levels of granularity to effectively capture local joint dependencies. Meanwhile, the Shuffle Mamba stream leverages a state space model to perform spatio-temporal scanning across all joints, enabling the establishment of global dependencies. By adaptively fusing these two representations, HGMamba achieves strong global feature modeling while excelling at local structure modeling. We stack multiple HGMamba blocks to create three variants of our model, allowing users to select the most suitable configuration based on the desired speed-accuracy trade-off. Extensive evaluations on the Human3.6M and MPI-INF-3DHP benchmark datasets demonstrate the effectiveness of our approach. HGMamba-B achieves state-of-the-art results, with P1 errors of 38.65 mm and 14.33 mm on the respective datasets. Code and models are available: https://github.com/HuCui2022/HGMamba

HalluciNot: Hallucination Detection Through Context and Common Knowledge Verification

Bibek Paudel,Alexander Lyzhov,Preetam Joshi,Puneet Anand

Task: 提出一种用于企业环境中检测大型语言模型（LLM）输出中幻觉的综合系统。

Motivation: 解决企业部署中LLM输出幻觉的特定挑战，包括计算效率、领域专业化和细粒度错误识别。

Details

Method: 提出一种新的LLM响应分类法，并开发了幻觉检测模型HDM-2，结合上下文和常识验证响应。 Result: HDM-2在RagTruth、TruthfulQA和HDMBench数据集上优于现有方法。 Conclusion: 该工作为企业部署提供了高效的幻觉检测解决方案，相关资源已公开。 Abstract: This paper introduces a comprehensive system for detecting hallucinations in large language model (LLM) outputs in enterprise settings. We present a novel taxonomy of LLM responses specific to hallucination in enterprise applications, categorizing them into context-based, common knowledge, enterprise-specific, and innocuous statements. Our hallucination detection model HDM-2 validates LLM responses with respect to both context and generally known facts (common knowledge). It provides both hallucination scores and word-level annotations, enabling precise identification of problematic content. To evaluate it on context-based and common-knowledge hallucinations, we introduce a new dataset HDMBench. Experimental results demonstrate that HDM-2 out-performs existing approaches across RagTruth, TruthfulQA, and HDMBench datasets. This work addresses the specific challenges of enterprise deployment, including computational efficiency, domain specialization, and fine-grained error identification. Our evaluation dataset, model weights, and inference code are publicly available.

Uni-PrevPredMap: Extending PrevPredMap to a Unified Framework of Prior-Informed Modeling for Online Vectorized HD Map Construction

Nan Peng,Xun Zhou,Mingming Wang,Guisong Chen,Songming Chen

Task: 提出一种统一的先验信息框架Uni-PrevPredMap，用于在线矢量高清地图构建。

Motivation: 自动驾驶系统需要最大化利用外部先验信息以确保安全性。

Details

Method: 结合时间感知缓冲区和成本效益地图作为互补先验源，引入全局地图处理器和三模式操作优化范式。 Result: 在无地图场景中表现优异，并在模拟过时地图中展示出鲁棒的先验融合能力。 Conclusion: 证实了先前预测与模拟过时地图的协同互补性，为自动驾驶系统提供了高效的地图构建解决方案。 Abstract: Safety constitutes a foundational imperative for autonomous driving systems, necessitating the maximal incorporation of accessible external prior information. This study establishes that temporal perception buffers and cost-efficient maps inherently form complementary prior sources for online vectorized high-definition (HD) map construction. We present Uni-PrevPredMap, a unified prior-informed framework that systematically integrates two synergistic information sources: previous predictions and simulated outdated HD maps. The framework introduces two core innovations: a tile-indexed 3D vectorized global map processor enabling efficient refreshment, storage, and retrieval of 3D vectorized priors; a tri-mode operational optimization paradigm ensuring consistency across prior-free, map-absent, and map-prior scenarios while mitigating reliance on idealized map fidelity assumptions. Uni-PrevPredMap achieves state-of-the-art performance in map-free scenarios across established online vectorized HD map construction benchmarks. When provided with simulated outdated HD maps, the framework exhibits robust capabilities in error-resilient prior fusion, empirically confirming the synergistic complementarity between previous predictions and simulated outdated HD maps. Code will be available at https://github.com/pnnnnnnn/Uni-PrevPredMap.

A Survey on Personalized and Pluralistic Preference Alignment in Large Language Models

Zhouhang Xie,Junda Wu,Yiran Shen,Yu Xia,Xintong Li,Aaron Chang,Ryan Rossi,Sachin Kumar,Bodhisattwa Prasad Majumder,Jingbo Shang,Prithviraj Ammanabrolu,Julian McAuley

Task: 总结和分析关于大型语言模型（LLMs）个性化偏好对齐的研究工作。

Motivation: 个性化偏好对齐是NLP和个性化领域的新兴研究方向，旨在使LLMs更好地适应用户的个性化需求。

Details

Method: 提出了一种偏好对齐技术的分类法，包括训练时、推理时和基于用户建模的方法，并对每种技术的优缺点进行了分析。 Result: 提供了对现有技术的评估、基准测试以及该领域的开放性问题。 Conclusion: 该综述为LLMs个性化偏好对齐的研究提供了系统化的分析和未来研究方向。 Abstract: Personalized preference alignment for large language models (LLMs), the process of tailoring LLMs to individual users' preferences, is an emerging research direction spanning the area of NLP and personalization. In this survey, we present an analysis of works on personalized alignment and modeling for LLMs. We introduce a taxonomy of preference alignment techniques, including training time, inference time, and additionally, user-modeling based methods. We provide analysis and discussion on the strengths and limitations of each group of techniques and then cover evaluation, benchmarks, as well as open problems in the field.

Patch Matters: Training-free Fine-grained Image Caption Enhancement via Local Perception

Ruotian Peng,Haiying He,Yake Wei,Yandong Wen,Di Hu

Task: 提出一种“分而治之”策略，通过语义和空间分块提取细粒度细节，生成高质量长文本图像描述。

Motivation: 现有多模态大语言模型生成的描述缺乏细节或存在幻觉问题，需改进。

Details

Method: 将图像分为语义和空间块提取细节，分层聚合生成全局描述，并应用语义级过滤减少幻觉。 Result: 实验表明，该方法能生成更详细、可靠的描述，无需重新训练模型。 Conclusion: 该方法提升了多模态描述生成的质量，适用于开源和闭源模型。 Abstract: High-quality image captions play a crucial role in improving the performance of cross-modal applications such as text-to-image generation, text-to-video generation, and text-image retrieval. To generate long-form, high-quality captions, many recent studies have employed multimodal large language models (MLLMs). However, current MLLMs often produce captions that lack fine-grained details or suffer from hallucinations, a challenge that persists in both open-source and closed-source models. Inspired by Feature-Integration theory, which suggests that attention must focus on specific regions to integrate visual information effectively, we propose a \textbf{divide-then-aggregate} strategy. Our method first divides the image into semantic and spatial patches to extract fine-grained details, enhancing the model's local perception of the image. These local details are then hierarchically aggregated to generate a comprehensive global description. To address hallucinations and inconsistencies in the generated captions, we apply a semantic-level filtering process during hierarchical aggregation. This training-free pipeline can be applied to both open-source models (LLaVA-1.5, LLaVA-1.6, Mini-Gemini) and closed-source models (Claude-3.5-Sonnet, GPT-4o, GLM-4V-Plus). Extensive experiments demonstrate that our method generates more detailed, reliable captions, advancing multimodal description generation without requiring model retraining. The source code are available at https://github.com/GeWu-Lab/Patch-Matters

Kaleidoscope: In-language Exams for Massively Multilingual Vision Evaluation

Israfel Salazar,Manuel Fernández Burda,Shayekh Bin Islam,Arshia Soltani Moakhar,Shivalika Singh,Fabian Farestam,Angelika Romanou,Danylo Boiko,Dipika Khullar,Mike Zhang,Dominik Krzemiński,Jekaterina Novikova,Luísa Shimabucoro,Joseph Marvin Imperial,Rishabh Maheshwary,Sharad Duwal,Alfonso Amayuelas,Swati Rajwal,Jebish Purbey,Ahmed Ruby,Nicholas Popovič,Marek Suppa,Azmine Toushik Wasi,Ram Mohan Rao Kadiyala,Olga Tsymboi,Maksim Kostritsya,Bardia Soltani Moakhar,Gabriel da Costa Merlin,Otávio Ferracioli Coletti,Maral Jabbari Shiviari,MohammadAmin farahani fard,Silvia Fernandez,María Grandury,Dmitry Abulkhanov,Drishti Sharma,Andre Guarnier De Mitri,Leticia Bossatto Marchezi,Johan Obando-Ceron,Nazar Kohut,Beyza Ermis,Desmond Elliott,Enzo Ferrante,Sara Hooker,Marzieh Fadaee

Task: 提出一个多语言视觉语言模型评估基准Kaleidoscope，以填补现有基准在多语言和文化多样性上的不足。

Motivation: 现有评估基准主要依赖英语，且多语言基准多基于英语数据集的翻译，未能捕捉文化差异。

Details

Method: 通过开放科学合作构建Kaleidoscope，包含18种语言和14个主题的20,911道多选题，确保语言和文化真实性。 Result: 评估发现当前多语言视觉语言模型在低资源语言和复杂多模态场景中表现不佳。 Conclusion: 强调需要发展更具文化包容性的多模态评估框架。 Abstract: The evaluation of vision-language models (VLMs) has mainly relied on English-language benchmarks, leaving significant gaps in both multilingual and multicultural coverage. While multilingual benchmarks have expanded, both in size and languages, many rely on translations of English datasets, failing to capture cultural nuances. In this work, we propose Kaleidoscope, as the most comprehensive exam benchmark to date for the multilingual evaluation of vision-language models. Kaleidoscope is a large-scale, in-language multimodal benchmark designed to evaluate VLMs across diverse languages and visual inputs. Kaleidoscope covers 18 languages and 14 different subjects, amounting to a total of 20,911 multiple-choice questions. Built through an open science collaboration with a diverse group of researchers worldwide, Kaleidoscope ensures linguistic and cultural authenticity. We evaluate top-performing multilingual vision-language models and find that they perform poorly on low-resource languages and in complex multimodal scenarios. Our results highlight the need for progress on culturally inclusive multimodal evaluation frameworks.

RAGME: Retrieval Augmented Video Generation for Enhanced Motion Realism

Elia Peruzzo,Dejia Xu,Xingqian Xu,Humphrey Shi,Nicu Sebe

Task: 提出一种框架，通过检索机制提升生成视频中运动的真实感。

Motivation: 现有视频生成方法在运动复杂性和物理合理性方面表现不足，生成的视频常显得静态或运动不真实。

Details

Method: 在生成阶段引入检索机制，利用检索到的视频作为基础信号，指导模型生成更真实的运动。 Result: 通过实验验证了方法的优越性，包括定量指标、新提出的基准测试和定性结果。 Conclusion: 该框架不仅提升了视频生成的运动真实感，还具有广泛的应用潜力。 Abstract: Video generation is experiencing rapid growth, driven by advances in diffusion models and the development of better and larger datasets. However, producing high-quality videos remains challenging due to the high-dimensional data and the complexity of the task. Recent efforts have primarily focused on enhancing visual quality and addressing temporal inconsistencies, such as flickering. Despite progress in these areas, the generated videos often fall short in terms of motion complexity and physical plausibility, with many outputs either appearing static or exhibiting unrealistic motion. In this work, we propose a framework to improve the realism of motion in generated videos, exploring a complementary direction to much of the existing literature. Specifically, we advocate for the incorporation of a retrieval mechanism during the generation phase. The retrieved videos act as grounding signals, providing the model with demonstrations of how the objects move. Our pipeline is designed to apply to any text-to-video diffusion model, conditioning a pretrained model on the retrieved samples with minimal fine-tuning. We demonstrate the superiority of our approach through established metrics, recently proposed benchmarks, and qualitative results, and we highlight additional applications of the framework.

DeduCE: Deductive Consistency as a Framework to Evaluate LLM Reasoning

Atharva Pandey,Kshitij Dubey,Rahul Sharma,Amit Sharma

Task: 提出一种演绎一致性指标，用于分析语言模型在链式思维输出中的表现。

Motivation: 前沿大语言模型在奥林匹克级推理问题上表现优异，但在高中数学的新颖问题上仍有困难，需要超越最终准确率的分析。

Details

Method: 开发一个评估管道，测试语言模型在输入前提理解和多步推理结论推断上的表现，使用新颖的基准问题变体。 Result: 语言模型对输入前提数量增加表现稳健，但随着推理步数增加准确率显著下降，错误主要源于多步推理而非前提理解。 Conclusion: 通过输入前提窗口和推理步数的计算视角，为语言模型推理提供统一的跨领域评估框架。 Abstract: Despite great performance on Olympiad-level reasoning problems, frontier large language models can still struggle on high school math when presented with novel problems outside standard benchmarks. Going beyond final accuracy, we propose a deductive consistency metric to analyze chain-of-thought output from language models (LMs).Formally, deductive reasoning involves two subtasks: understanding a set of input premises and inferring the conclusions that follow from them. The proposed metric studies LMs' performance on these subtasks, with the goal of explaining LMs' reasoning errors on novel problems: how well do LMs understand input premises with increasing context lengths, and how well can they infer conclusions over multiple reasoning hops? Since existing benchmarks may be memorized, we develop a pipeline to evaluate LMs' deductive consistency on novel, perturbed versions of benchmark problems. On novel grade school math problems (GSM-8k), we find that LMs are fairly robust to increasing number of input premises, but suffer significant accuracy decay as the number of reasoning hops is increased. Interestingly, these errors are masked in the original benchmark as all models achieve near 100% accuracy. As we increase the number of solution steps using a synthetic dataset, prediction over multiple hops still remains the major source of error compared to understanding input premises. Other factors, such as shifts in language style or natural propagation of early errors do not explain the trends. Our analysis provides a new view to characterize LM reasoning -- as computations over a window of input premises and reasoning hops -- that can provide unified evaluation across problem domains.

Probability Density Geodesics in Image Diffusion Latent Space

Qingtao Yu,Jaskirat Singh,Zhaoyuan Yang,Peter Henry Tu,Jing Zhang,Hongdong Li,Richard Hartley,Dylan Campbell

Task: 在扩散潜在空间中计算测地线，并分析视频片段在该空间中的近似程度。

Motivation: 通过扩散模型间接估计数据空间的概率密度，研究其结构。

Details

Method: 提出算法解决初始和边界值问题，计算路径上的概率密度和测地线距离。 Result: 展示了如何应用这些技术进行无训练的图像序列插值和外推。 Conclusion: 扩散潜在空间中的测地线计算为图像序列分析提供了新工具。 Abstract: Diffusion models indirectly estimate the probability density over a data space, which can be used to study its structure. In this work, we show that geodesics can be computed in diffusion latent space, where the norm induced by the spatially-varying inner product is inversely proportional to the probability density. In this formulation, a path that traverses a high density (that is, probable) region of image latent space is shorter than the equivalent path through a low density region. We present algorithms for solving the associated initial and boundary value problems and show how to compute the probability density along the path and the geodesic distance between two points. Using these techniques, we analyze how closely video clips approximate geodesics in a pre-trained image diffusion space. Finally, we demonstrate how these techniques can be applied to training-free image sequence interpolation and extrapolation, given a pre-trained image diffusion model.

Self-Steering Language Models

Gabriel Grand,Joshua B. Tenenbaum,Vikash K. Mansinghka,Alexander K. Lew,Jacob Andreas

Task: 提出一种名为DisCIPL的方法，通过Planner模型生成任务特定的推理程序，由Follower模型执行，以实现语言模型的自引导推理。

Motivation: 语言模型在自然语言中进行搜索或规划时速度慢、成本高且容易出错，但擅长描述问题的抽象结构。

Details

Method: DisCIPL方法通过Planner模型生成递归搜索程序，指导Follower模型进行推理，实现可验证且高效的推理。 Result: 在小型Follower模型（如Llama-3.2-1B）上，DisCIPL表现优于或匹配大型模型（如GPT-4o和o1）。 Conclusion: DisCIPL通过分离规划与执行，开辟了高效并行蒙特卡洛推理策略的设计空间，无需微调且可由现有语言模型自动实现。 Abstract: While test-time reasoning enables language models to tackle complex tasks, searching or planning in natural language can be slow, costly, and error-prone. But even when LMs struggle to emulate the precise reasoning steps needed to solve a problem, they often excel at describing its abstract structure--both how to verify solutions and how to search for them. This paper introduces DisCIPL, a method for "self-steering" LMs where a Planner model generates a task-specific inference program that is executed by a population of Follower models. Our approach equips LMs with the ability to write recursive search procedures that guide LM inference, enabling new forms of verifiable and efficient reasoning. When instantiated with a small Follower (e.g., Llama-3.2-1B), DisCIPL matches (and sometimes outperforms) much larger models, including GPT-4o and o1, on challenging constrained generation tasks. In decoupling planning from execution, our work opens up a design space of highly-parallelized Monte Carlo inference strategies that outperform standard best-of-N sampling, require no finetuning, and can be implemented automatically by existing LMs.

Deep Learning for Cardiovascular Risk Assessment: Proxy Features from Carotid Sonography as Predictors of Arterial Damage

Christoph Balada,Aida Romano-Martinez,Vincent ten Cate,Katharina Geschke,Jonas Tesarz,Paul Claßen,Alexander K. Schuster,Dativa Tibyampansha,Karl-Patrik Kresoja,Philipp S. Wild,Sheraz Ahmed,Andreas Dengel

Task: 利用高血压作为个体血管损伤的指标，并通过机器学习技术识别这种损伤。

Motivation: 提供早期心血管事件风险标志物，并为个体患者的动脉状况提供有价值的见解。

Details

Method: 通过微调VideoMAE深度学习模型，将其应用于超声成像领域，使用来自Gutenberg健康研究的31,000多个颈动脉超声视频数据集进行训练和测试。 Result: 模型在验证集上达到75.7%的准确率，能够有效分类高血压与非高血压患者，作为检测视觉动脉损伤的代理指标。 Conclusion: 机器学习模型能够捕捉视觉特征，为个体心血管健康提供有价值的见解。 Abstract: In this study, hypertension is utilized as an indicator of individual vascular damage. This damage can be identified through machine learning techniques, providing an early risk marker for potential major cardiovascular events and offering valuable insights into the overall arterial condition of individual patients. To this end, the VideoMAE deep learning model, originally developed for video classification, was adapted by finetuning for application in the domain of ultrasound imaging. The model was trained and tested using a dataset comprising over 31,000 carotid sonography videos sourced from the Gutenberg Health Study (15,010 participants), one of the largest prospective population health studies. This adaptation facilitates the classification of individuals as hypertensive or non-hypertensive (75.7% validation accuracy), functioning as a proxy for detecting visual arterial damage. We demonstrate that our machine learning model effectively captures visual features that provide valuable insights into an individual's overall cardiovascular health.

KG-LLM-Bench: A Scalable Benchmark for Evaluating LLM Reasoning on Textualized Knowledge Graphs

Elan Markowitz,Krupa Galiya,Greg Ver Steeg,Aram Galstyan

Task: 研究知识图谱文本化过程对大型语言模型性能的影响。

Motivation: 知识图谱是向大型语言模型注入最新事实知识的流行方法，但其文本化过程对模型性能的影响尚未充分探索。

Details

Method: 引入KG-LLM-Bench基准，涵盖五种知识图谱理解任务，评估不同编码策略对多种基础模型性能的影响。 Result: 通过七种语言模型和五种文本化策略的广泛实验，为优化知识图谱推理任务的LLM性能提供了见解。 Conclusion: 研究为知识图谱文本化策略的选择提供了指导，有助于提升大型语言模型在知识推理任务中的表现。 Abstract: Knowledge graphs have emerged as a popular method for injecting up-to-date, factual knowledge into large language models (LLMs). This is typically achieved by converting the knowledge graph into text that the LLM can process in context. While multiple methods of encoding knowledge graphs have been proposed, the impact of this textualization process on LLM performance remains under-explored. We introduce KG-LLM-Bench, a comprehensive and extensible benchmark spanning five knowledge graph understanding tasks, and evaluate how different encoding strategies affect performance across various base models. Our extensive experiments with seven language models and five textualization strategies provide insights for optimizing LLM performance on KG reasoning tasks.

GSta: Efficient Training Scheme with Siestaed Gaussians for Monocular 3D Scene Reconstruction

Anil Armagan,Albert Saà-Garriga,Bruno Manganelli,Kyuwon Kim,M. Kerim Yucel

Task: 提出一种名为GSta的方法，通过动态识别训练中收敛良好的高斯分布并冻结其更新，以提高3D重建的训练速度和效率。

Motivation: 高斯泼溅（GS）在3D重建中表现优异，但存在存储和内存需求大、训练速度慢的问题，尤其在机器人场景中难以部署。

Details

Method: 基于位置和颜色梯度范数动态识别收敛的高斯分布并冻结其更新，结合学习率调度器和早期停止机制。 Result: GSta在训练速度、内存和存储需求上显著优化，同时保持质量，与其他方法结合时效果更佳。 Conclusion: GSta有效解决了GS的效率和部署问题，为3D重建提供了更实用的解决方案。 Abstract: Gaussian Splatting (GS) is a popular approach for 3D reconstruction, mostly due to its ability to converge reasonably fast, faithfully represent the scene and render (novel) views in a fast fashion. However, it suffers from large storage and memory requirements, and its training speed still lags behind the hash-grid based radiance field approaches (e.g. Instant-NGP), which makes it especially difficult to deploy them in robotics scenarios, where 3D reconstruction is crucial for accurate operation. In this paper, we propose GSta that dynamically identifies Gaussians that have converged well during training, based on their positional and color gradient norms. By forcing such Gaussians into a siesta and stopping their updates (freezing) during training, we improve training speed with competitive accuracy compared to state of the art. We also propose an early stopping mechanism based on the PSNR values computed on a subset of training images. Combined with other improvements, such as integrating a learning rate scheduler, GSta achieves an improved Pareto front in convergence speed, memory and storage requirements, while preserving quality. We also show that GSta can improve other methods and complement orthogonal approaches in efficiency improvement; once combined with Trick-GS, GSta achieves up to 5x faster training, 16x smaller disk size compared to vanilla GS, while having comparable accuracy and consuming only half the peak memory. More visualisations are available at https://anilarmagan.github.io/SRUK-GSta.

OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens

Jiacheng Liu,Taylor Blanton,Yanai Elazar,Sewon Min,YenSung Chen,Arnavi Chheda-Kothary,Huy Tran,Byron Bischoff,Eric Marsh,Michael Schmitz,Cassidy Trier,Aaron Sarnat,Jenna James,Jon Borchardt,Bailey Kuehl,Evie Cheng,Karen Farley,Sruthi Sreeram,Taira Anderson,David Albright,Carissa Schoenick,Luca Soldaini,Dirk Groeneveld,Rock Yuren Pang,Pang Wei Koh,Noah A. Smith,Sophie Lebrecht,Yejin Choi,Hannaneh Hajishirzi,Ali Farhadi,Jesse Dodge

Task: 开发一个实时追踪语言模型输出到其训练数据的系统OLMoTrace。

Motivation: 帮助用户通过训练数据的视角理解语言模型的行为，包括事实核查、幻觉和创造力等方面。

Details

Method: 基于扩展版的infini-gram技术，实时找到语言模型输出与训练数据中的逐字匹配。 Result: 系统能在几秒内返回追踪结果，并公开可用、完全开源。 Conclusion: OLMoTrace为理解语言模型的行为提供了新的工具，具有实际应用价值。 Abstract: We present OLMoTrace, the first system that traces the outputs of language models back to their full, multi-trillion-token training data in real time. OLMoTrace finds and shows verbatim matches between segments of language model output and documents in the training text corpora. Powered by an extended version of infini-gram (Liu et al., 2024), our system returns tracing results within a few seconds. OLMoTrace can help users understand the behavior of language models through the lens of their training data. We showcase how it can be used to explore fact checking, hallucination, and the creativity of language models. OLMoTrace is publicly available and fully open-source.

Masked Scene Modeling: Narrowing the Gap Between Supervised and Self-Supervised Learning in 3D Scene Understanding

Pedro Hermosilla,Christian Stippel,Leon Sick

Task: 提出一种评估自监督学习在3D场景理解中特征质量的协议，并开发一种基于Masked Scene Modeling目标的自监督模型。

Motivation: 当前3D场景理解中，自监督方法仅用于任务特定的微调初始化，限制了其通用特征提取的潜力。

Details

Method: 提出多分辨率特征采样协议评估特征质量，并设计基于Masked Scene Modeling的自监督模型。 Result: 模型在仅使用现成特征的线性探测设置中表现接近监督模型，且大幅超越现有自监督方法。 Conclusion: 提出的协议和模型为3D场景理解中的自监督学习提供了更通用的特征提取能力。 Abstract: Self-supervised learning has transformed 2D computer vision by enabling models trained on large, unannotated datasets to provide versatile off-the-shelf features that perform similarly to models trained with labels. However, in 3D scene understanding, self-supervised methods are typically only used as a weight initialization step for task-specific fine-tuning, limiting their utility for general-purpose feature extraction. This paper addresses this shortcoming by proposing a robust evaluation protocol specifically designed to assess the quality of self-supervised features for 3D scene understanding. Our protocol uses multi-resolution feature sampling of hierarchical models to create rich point-level representations that capture the semantic capabilities of the model and, hence, are suitable for evaluation with linear probing and nearest-neighbor methods. Furthermore, we introduce the first self-supervised model that performs similarly to supervised models when only off-the-shelf features are used in a linear probing setup. In particular, our model is trained natively in 3D with a novel self-supervised approach based on a Masked Scene Modeling objective, which reconstructs deep features of masked patches in a bottom-up manner and is specifically tailored to hierarchical 3D models. Our experiments not only demonstrate that our method achieves competitive performance to supervised models, but also surpasses existing self-supervised approaches by a large margin. The model and training code can be found at our Github repository (https://github.com/phermosilla/msm).

MultiDelete for Multimodal Machine Unlearning

Jiali Cheng,Hadi Amiri

Task: 提出一种名为MultiDelete的多模态机器遗忘方法，用于从已训练模型中移除特定训练数据样本的知识。

Motivation: 解决多模态环境下机器遗忘的独特挑战，如数据模态间的复杂依赖性和大规模多模态数据及架构的高训练成本。

Details

Method: MultiDelete通过模态解耦、多模态知识保留和单模态知识保留三个关键属性，有效解耦待删除单模态数据点的关联。 Result: 在两种架构和四个数据集上的实验表明，MultiDelete在遗忘多模态样本上平均提升17.6分，并能保持原始模型的多模态和单模态知识。 Conclusion: MultiDelete是一种高效且不受强凸损失限制的多模态机器遗忘方法，能有效保护遗忘数据免受对抗攻击。 Abstract: Machine Unlearning removes specific knowledge about training data samples from an already trained model. It has significant practical benefits, such as purging private, inaccurate, or outdated information from trained models without the need for complete re-training. Unlearning within a multimodal setting presents unique challenges due to the complex dependencies between different data modalities and the expensive cost of training on large multimodal datasets and architectures. This paper presents the first machine unlearning approach for multimodal data and models, titled MultiDelete, which is designed to decouple associations between unimodal data points during unlearning without losing the overall representation strength of the trained model. MultiDelete advocates for three key properties for effective multimodal unlearning: (a): modality decoupling, which effectively decouples the association between individual unimodal data points marked for deletion, rendering them as unrelated data points, (b): multimodal knowledge retention, which retains the multimodal representation post-unlearning, and (c): unimodal knowledge retention, which retains the unimodal representation postunlearning. MultiDelete is efficient to train and is not constrained by using a strongly convex loss -- a common restriction among existing baselines. Experiments on two architectures and four datasets, including image-text and graph-text datasets, show that MultiDelete gains an average improvement of 17.6 points over best performing baseline in unlearning multimodal samples, can maintain the multimodal and unimodal knowledge of the original model post unlearning, and can provide better protection to unlearned data against adversarial attacks.

EDIT: Enhancing Vision Transformers by Mitigating Attention Sink through an Encoder-Decoder Architecture

Wenfeng Feng,Guoying Sun

Task: 提出一种名为EDIT的新型架构，用于缓解Vision Transformer模型中的注意力下沉现象。

Motivation: 注意力下沉现象导致模型过度关注[CLS]标记，影响图像块的有效处理。

Details

Method: 采用层对齐的编码器-解码器架构，编码器使用自注意力处理图像块，解码器通过交叉注意力关注[CLS]标记，并从低层特征逐步提取信息。 Result: 在ImageNet-1k和ImageNet-21k等任务上，EDIT性能优于DeiT3模型。 Conclusion: EDIT的设计有效解决了注意力下沉问题，提升了视觉特征提取能力。 Abstract: In this paper, we propose EDIT (Encoder-Decoder Image Transformer), a novel architecture designed to mitigate the attention sink phenomenon observed in Vision Transformer models. Attention sink occurs when an excessive amount of attention is allocated to the [CLS] token, distorting the model's ability to effectively process image patches. To address this, we introduce a layer-aligned encoder-decoder architecture, where the encoder utilizes self-attention to process image patches, while the decoder uses cross-attention to focus on the [CLS] token. Unlike traditional encoder-decoder framework, where the decoder depends solely on high-level encoder representations, EDIT allows the decoder to extract information starting from low-level features, progressively refining the representation layer by layer. EDIT is naturally interpretable demonstrated through sequential attention maps, illustrating the refined, layer-by-layer focus on key image features. Experiments on ImageNet-1k and ImageNet-21k, along with transfer learning tasks, show that EDIT achieves consistent performance improvements over DeiT3 models. These results highlight the effectiveness of EDIT's design in addressing attention sink and improving visual feature extraction.

StealthRank: LLM Ranking Manipulation via Stealthy Prompt Optimization

Yiming Tang,Yi Fan,Chenxiao Yu,Tiankai Yang,Yue Zhao,Xiyang Hu

Task: 提出一种名为StealthRank的新型对抗性排名攻击方法，用于操纵基于大语言模型（LLM）的产品推荐系统。

Motivation: 大语言模型在信息检索系统中的集成引入了新的攻击面，尤其是对抗性排名操纵问题。

Details

Method: 采用基于能量的优化框架结合Langevin动力学，生成隐蔽的对抗性文本序列（SRPs），嵌入产品描述中以影响LLM的排名机制。 Result: StealthRank在多种LLM上评估，能够隐蔽地提升目标产品排名，同时避免被检测到明显的操纵痕迹。 Conclusion: StealthRank在效果和隐蔽性上均优于现有对抗性排名基线，揭示了LLM驱动推荐系统的关键漏洞。 Abstract: The integration of large language models (LLMs) into information retrieval systems introduces new attack surfaces, particularly for adversarial ranking manipulations. We present StealthRank, a novel adversarial ranking attack that manipulates LLM-driven product recommendation systems while maintaining textual fluency and stealth. Unlike existing methods that often introduce detectable anomalies, StealthRank employs an energy-based optimization framework combined with Langevin dynamics to generate StealthRank Prompts (SRPs)-adversarial text sequences embedded within product descriptions that subtly yet effectively influence LLM ranking mechanisms. We evaluate StealthRank across multiple LLMs, demonstrating its ability to covertly boost the ranking of target products while avoiding explicit manipulation traces that can be easily detected. Our results show that StealthRank consistently outperforms state-of-the-art adversarial ranking baselines in both effectiveness and stealth, highlighting critical vulnerabilities in LLM-driven recommendation systems.

MultiADS: Defect-aware Supervision for Multi-type Anomaly Detection and Segmentation in Zero-Shot Learning

Ylli Sadikaj,Hongkuan Zhou,Lavdim Halilaj,Stefan Schmid,Steffen Staab,Claudia Plant

Task: 提出一种名为MultiADS的零样本学习方法，用于多类型异常检测和分割。

Motivation: 工业应用中精确的光学检测对减少废品率和成本至关重要，现有方法仅能检测产品是否有缺陷，无法识别具体缺陷类型。

Details

Method: MultiADS结合CLIP和额外线性层，在联合特征空间中对齐视觉和文本表示。 Result: MultiADS能够为每种缺陷类型生成特定异常掩码，区分缺陷类型，并同时识别多个缺陷类型，在多个数据集上优于现有零/少样本学习方法。 Conclusion: MultiADS是首个在零样本学习中实现多类型异常分割的方法，性能优于现有技术。 Abstract: Precise optical inspection in industrial applications is crucial for minimizing scrap rates and reducing the associated costs. Besides merely detecting if a product is anomalous or not, it is crucial to know the distinct type of defect, such as a bent, cut, or scratch. The ability to recognize the "exact" defect type enables automated treatments of the anomalies in modern production lines. Current methods are limited to solely detecting whether a product is defective or not without providing any insights on the defect type, nevertheless detecting and identifying multiple defects. We propose MultiADS, a zero-shot learning approach, able to perform Multi-type Anomaly Detection and Segmentation. The architecture of MultiADS comprises CLIP and extra linear layers to align the visual- and textual representation in a joint feature space. To the best of our knowledge, our proposal, is the first approach to perform a multi-type anomaly segmentation task in zero-shot learning. Contrary to the other baselines, our approach i) generates specific anomaly masks for each distinct defect type, ii) learns to distinguish defect types, and iii) simultaneously identifies multiple defect types present in an anomalous product. Additionally, our approach outperforms zero/few-shot learning SoTA methods on image-level and pixel-level anomaly detection and segmentation tasks on five commonly used datasets: MVTec-AD, Visa, MPDD, MAD and Real-IAD.

Information-Theoretic Reward Decomposition for Generalizable RLHF

Liyuan Mao,Haoran Xu,Amy Zhang,Weinan Zhang,Chenjia Bai

Task: 提出一种新的奖励学习算法，通过分解奖励值为提示无关和提示相关两部分，以提升奖励模型的泛化能力。

Motivation: 现有奖励模型在评估未见过的提示-响应对时表现不佳，因为它们忽视了提示对奖励的影响。

Details

Method: 将奖励值分解为提示无关奖励和提示相关奖励，并从信息论角度提取这两部分，提出基于提示无关奖励值优先处理数据样本的算法。 Result: 通过实验证明，该方法能有效表征奖励模型的两部分，并提升对齐性能和泛化能力。 Conclusion: 提出的方法解决了奖励模型泛化能力不足的问题，并通过实验验证了其有效性。 Abstract: A generalizable reward model is crucial in Reinforcement Learning from Human Feedback (RLHF) as it enables correctly evaluating unseen prompt-response pairs. However, existing reward models lack this ability, as they are typically trained by increasing the reward gap between chosen and rejected responses, while overlooking the prompts that the responses are conditioned on. Consequently, when the trained reward model is evaluated on prompt-response pairs that lie outside the data distribution, neglecting the effect of prompts may result in poor generalization of the reward model. To address this issue, we decompose the reward value into two independent components: prompt-free reward and prompt-related reward. Prompt-free reward represents the evaluation that is determined only by responses, while the prompt-related reward reflects the reward that derives from both the prompt and the response. We extract these two components from an information-theoretic perspective, which requires no extra models. Subsequently, we propose a new reward learning algorithm by prioritizing data samples based on their prompt-free reward values. Through toy examples, we demonstrate that the extracted prompt-free and prompt-related rewards effectively characterize two parts of the reward model. Further, standard evaluations show that our method improves both the alignment performance and the generalization capability of the reward model.

Large Scale Supervised Pretraining For Traumatic Brain Injury Segmentation

Constantin Ulrich,Tassilo Wald,Fabian Isensee,Klaus H. Maier-Hein

Task: 开发一种针对T1加权MRI数据的创新分割算法，用于中重度创伤性脑损伤（msTBI）病灶的分割。

Motivation: 由于msTBI病灶在大小、形状和分布上的多样性，传统图像处理技术存在严重误差，需要更先进的解决方案。

Details

Method: 采用大规模多数据集监督预训练方法（MultiTalent方法），在涵盖多种解剖和病理结构的数据集上训练Resenc L网络，随后在msTBI特定数据上进行微调。 Result: 模型在T1加权MRI扫描上的性能优于未预训练的基线模型，Dice分数提升高达2点。 Conclusion: 该方法通过预训练和微调策略，显著提升了msTBI病灶分割的准确性和鲁棒性。 Abstract: The segmentation of lesions in Moderate to Severe Traumatic Brain Injury (msTBI) presents a significant challenge in neuroimaging due to the diverse characteristics of these lesions, which vary in size, shape, and distribution across brain regions and tissue types. This heterogeneity complicates traditional image processing techniques, resulting in critical errors in tasks such as image registration and brain parcellation. To address these challenges, the AIMS-TBI Segmentation Challenge 2024 aims to advance innovative segmentation algorithms specifically designed for T1-weighted MRI data, the most widely utilized imaging modality in clinical practice. Our proposed solution leverages a large-scale multi-dataset supervised pretraining approach inspired by the MultiTalent method. We train a Resenc L network on a comprehensive collection of datasets covering various anatomical and pathological structures, which equips the model with a robust understanding of brain anatomy and pathology. Following this, the model is fine-tuned on msTBI-specific data to optimize its performance for the unique characteristics of T1-weighted MRI scans and outperforms the baseline without pretraining up to 2 Dice points.

Yin Wu,Zhengxuan Zhang,Fuling Wang,Yuyu Luo,Hui Xiong,Nan Tang

Task: 检测和解释上下文不符（OOC）的虚假信息。

Motivation: 现有的OOC虚假信息检测方法依赖粗粒度的图像-文本相似性度量，难以捕捉细微的不一致性或提供可解释性。

Details

Method: 提出EXCLAIM框架，通过多粒度索引和多智能体推理架构，结合外部知识进行多模态内容的一致性分析。 Result: EXCLAIM在检测OOC虚假信息时比现有方法准确率提高4.3%，并提供可解释的洞察。 Conclusion: EXCLAIM框架在检测OOC虚假信息方面表现出更高的准确性和可解释性，为多模态内容分析提供了新思路。 Abstract: Misinformation continues to pose a significant challenge in today's information ecosystem, profoundly shaping public perception and behavior. Among its various manifestations, Out-of-Context (OOC) misinformation is particularly obscure, as it distorts meaning by pairing authentic images with misleading textual narratives. Existing methods for detecting OOC misinformation predominantly rely on coarse-grained similarity metrics between image-text pairs, which often fail to capture subtle inconsistencies or provide meaningful explainability. While multi-modal large language models (MLLMs) demonstrate remarkable capabilities in visual reasoning and explanation generation, they have not yet demonstrated the capacity to address complex, fine-grained, and cross-modal distinctions necessary for robust OOC detection. To overcome these limitations, we introduce EXCLAIM, a retrieval-based framework designed to leverage external knowledge through multi-granularity index of multi-modal events and entities. Our approach integrates multi-granularity contextual analysis with a multi-agent reasoning architecture to systematically evaluate the consistency and integrity of multi-modal news content. Comprehensive experiments validate the effectiveness and resilience of EXCLAIM, demonstrating its ability to detect OOC misinformation with 4.3% higher accuracy compared to state-of-the-art approaches, while offering explainable and actionable insights.

nnLandmark: A Self-Configuring Method for 3D Medical Landmark Detection

Alexandra Ertl,Shuhan Xiao,Stefan Denner,Robin Peretzke,David Zimmerer,Peter Neher,Fabian Isensee,Klaus Maier-Hein

Task: 开发一个自配置的深度学习框架nnLandmark，用于3D医学影像中的标志点检测。

Motivation: 医学影像中标志点检测对诊断、治疗计划等至关重要，但手动标注耗时且依赖专家知识。现有深度学习方法受限于数据集不足、基准不统一等问题。

Details

Method: 基于nnU-Net的自适应框架，采用热图回归方法，无需手动调参。 Result: 在两个公开数据集上达到最先进精度（MML数据集MRE为1.5 mm，AFIDs数据集MRE为1.2 mm）。 Conclusion: nnLandmark具有强泛化性、可重复性和易部署性，为3D标志点检测提供了可靠基准。 Abstract: Landmark detection plays a crucial role in medical imaging tasks that rely on precise spatial localization, including specific applications in diagnosis, treatment planning, image registration, and surgical navigation. However, manual annotation is labor-intensive and requires expert knowledge. While deep learning shows promise in automating this task, progress is hindered by limited public datasets, inconsistent benchmarks, and non-standardized baselines, restricting reproducibility, fair comparisons, and model generalizability.This work introduces nnLandmark, a self-configuring deep learning framework for 3D medical landmark detection, adapting nnU-Net to perform heatmap-based regression. By leveraging nnU-Net's automated configuration, nnLandmark eliminates the need for manual parameter tuning, offering out-of-the-box usability. It achieves state-of-the-art accuracy across two public datasets, with a mean radial error (MRE) of 1.5 mm on the Mandibular Molar Landmark (MML) dental CT dataset and 1.2 mm for anatomical fiducials on a brain MRI dataset (AFIDs), where nnLandmark aligns with the inter-rater variability of 1.5 mm. With its strong generalization, reproducibility, and ease of deployment, nnLandmark establishes a reliable baseline for 3D landmark detection, supporting research in anatomical localization and clinical workflows that depend on precise landmark identification. The code will be available soon.

ER-RAG: Enhance RAG with ER-Based Unified Modeling of Heterogeneous Data Sources

Yikuan Xia,Jiazun Chen,Yirui Zhan,Suifeng Zhao,Weipeng Jiang,Chaorui Zhang,Wei Han,Bo Bai,Jun Gao

Task: 提出ER-RAG框架，统一异构数据源的证据整合，提升检索增强生成（RAG）的效率。

Motivation: 当前RAG方法依赖特定数据源的策略，在低资源或黑盒环境中存在挑战，且证据分散时操作复杂。

Details

Method: 使用实体-关系（ER）模型标准化实体检索和关系查询，采用两阶段生成过程：优选数据源并构建API链。 Result: ER-RAG在2024 KDDCup CRAG Challenge中获胜，性能与商业RAG管道相当，LLM评分提高3.1%，检索速度提升5.5倍。 Conclusion: ER-RAG通过统一异构数据源的证据整合，显著提升了RAG的效率和性能。 Abstract: Large language models (LLMs) excel in question-answering (QA) tasks, and retrieval-augmented generation (RAG) enhances their precision by incorporating external evidence from diverse sources like web pages, databases, and knowledge graphs. However, current RAG methods rely on agent-specific strategies for individual data sources, posing challenges low-resource or black-box environments and complicates operations when evidence is fragmented across sources. To address these limitations, we propose ER-RAG, a framework that unifies evidence integration across heterogeneous data sources using the Entity-Relationship (ER) model. ER-RAG standardizes entity retrieval and relationship querying through ER-based APIs with GET and JOIN operations. It employs a two-stage generation process: first, a preference optimization module selects optimal sources; second, another module constructs API chains based on source schemas. This unified approach allows efficient fine-tuning and seamless integration across diverse data sources. ER-RAG demonstrated its effectiveness by winning all three tracks of the 2024 KDDCup CRAG Challenge, achieving performance on par with commercial RAG pipelines using an 8B LLM backbone. It outperformed hybrid competitors by 3.1% in LLM score and accelerated retrieval by 5.5X.

Visualisation of a multidimensional point cloud as a 3D swarm of avatars

Leszek Luchowski,Dariusz Pojda

Task: 提出一种基于Chernoff面孔启发的图标的多维数据可视化创新方法。

Motivation: 利用人脑自然解读面部表情的能力，将数据维度映射到面部特征上，以提升多维数据的可视化效果。

Details

Method: 结合经典投影技术，将数据维度分配到面部特征，并实现为dpVision开源图像处理平台的插件。 Result: 通过合成测试数据和葡萄牙葡萄酒的15维数据库验证了该方法的有效性。 Conclusion: 该方法为复杂数据结构的分析提供了有用的可视化工具。 Abstract: The article presents an innovative approach to the visualisation of multidimensional data, using icons inspired by Chernoff faces. The approach merges classical projection techniques with the assignment of particular data dimensions to mimic features, capitalizing on the natural ability of the human brain to interpret facial expressions. The technique is implemented as a plugin to the dpVision open-source image handling platform. The plugin allows the data to be interactively explored in the form of a swarm of "totems" whose position in hyperspace as well as facial features represent various aspects of the data. Sample visualisations, based on synthetic test data as well as the vinhoverde 15-dimensional database on Portuguese wines, confirm the usefulness of our approach to the analysis of complex data structures.

A Diverse and Effective Retrieval-Based Debt Collection System with Expert Knowledge

Jiaming Luo,Weiyi Luo,Guoqing Sun,Mengchen Zhu,Haifeng Tang,Kunyao Lan,Mengyue Wu,Kenny Q. Zhu

Task: 设计一个基于真实债务人与催收员数据的债务催收系统，以提高脚本多样性和上下文相关性。

Motivation: 债务催收系统的设计对提高金融行业的运营效率和降低成本至关重要，但脚本多样性、上下文相关性和一致性的维护具有挑战性。

Details

Method: 构建真实债务催收对话的脚本库，并提出基于两阶段检索的响应系统以确保上下文相关性。 Result: 实验结果表明，该系统提高了脚本多样性、增强了响应相关性，并通过知识蒸馏实现了实际部署的高效性。 Conclusion: 该研究提供了一个可扩展且自动化的解决方案，为实际应用中的债务催收实践提供了有价值的见解。 Abstract: Designing effective debt collection systems is crucial for improving operational efficiency and reducing costs in the financial industry. However, the challenges of maintaining script diversity, contextual relevance, and coherence make this task particularly difficult. This paper presents a debt collection system based on real debtor-collector data from a major commercial bank. We construct a script library from real-world debt collection conversations, and propose a two-stage retrieval based response system for contextual relevance. Experimental results show that our system improves script diversity, enhances response relevance, and achieves practical deployment efficiency through knowledge distillation. This work offers a scalable and automated solution, providing valuable insights for advancing debt collection practices in real-world applications.

Compass Control: Multi Object Orientation Control for Text-to-Image Generation

Rishbuh Parihar,Vaibhav Agrawal,Sachidanand VS,R. Venkatesh Babu

Task: 解决文本到图像扩散模型中多对象方向控制的问题。

Motivation: 现有方法无法实现对3D对象方向的精确控制，限制了生成多样化场景的能力。

Details

Method: 通过引入方向感知的“指南针”标记和轻量级编码器网络，结合交叉注意力约束，实现对对象方向的精确控制。 Result: 模型能够对未见过的复杂对象和多对象场景实现精确方向控制，并表现出强泛化能力。 Conclusion: 该方法在方向控制和文本对齐方面达到最先进水平，并通过广泛评估和用户研究验证了其有效性。 Abstract: Existing approaches for controlling text-to-image diffusion models, while powerful, do not allow for explicit 3D object-centric control, such as precise control of object orientation. In this work, we address the problem of multi-object orientation control in text-to-image diffusion models. This enables the generation of diverse multi-object scenes with precise orientation control for each object. The key idea is to condition the diffusion model with a set of orientation-aware \textbf{compass} tokens, one for each object, along with text tokens. A light-weight encoder network predicts these compass tokens taking object orientation as the input. The model is trained on a synthetic dataset of procedurally generated scenes, each containing one or two 3D assets on a plain background. However, direct training this framework results in poor orientation control as well as leads to entanglement among objects. To mitigate this, we intervene in the generation process and constrain the cross-attention maps of each compass token to its corresponding object regions. The trained model is able to achieve precise orientation control for a) complex objects not seen during training and b) multi-object scenes with more than two objects, indicating strong generalization capabilities. Further, when combined with personalization methods, our method precisely controls the orientation of the new object in diverse contexts. Our method achieves state-of-the-art orientation control and text alignment, quantified with extensive evaluations and a user study.

On the Effectiveness and Generalization of Race Representations for Debiasing High-Stakes Decisions

Dang Nguyen,Chenhao Tan

Task: 研究大型语言模型（LLMs）在招生和招聘决策中的种族偏见及其缓解方法。

Motivation: 理解和减轻偏见对于LLMs在高风险决策中的应用至关重要。

Details

Method: 使用假设的申请人档案作为测试平台，通过分布式对齐搜索识别并干预模型激活中的“种族子空间”。 Result: Gemma和LLaMA模型表现出显著偏见，但通过干预种族子空间，Gemma的偏见减少了37-57%。 Conclusion: 机制性方法可能改善LLMs的公平性，但通用的种族表征尚未实现。 Abstract: Understanding and mitigating biases is critical for the adoption of large language models (LLMs) in high-stakes decision-making. We introduce Admissions and Hiring, decision tasks with hypothetical applicant profiles where a person's race can be inferred from their name, as simplified test beds for racial bias. We show that Gemma 2B Instruct and LLaMA 3.2 3B Instruct exhibit strong biases. Gemma grants admission to 26% more White than Black applicants, and LLaMA hires 60% more Asian than White applicants. We demonstrate that these biases are resistant to prompt engineering: multiple prompting strategies all fail to promote fairness. In contrast, using distributed alignment search, we can identify "race subspaces" within model activations and intervene on them to debias model decisions. Averaging the representation across all races within the subspaces reduces Gemma's bias by 37-57%. Finally, we examine the generalizability of Gemma's race subspaces, and find limited evidence for generalization, where changing the prompt format can affect the race representation. Our work suggests mechanistic approaches may provide a promising venue for improving the fairness of LLMs, but a universal race representation remains elusive.

FANeRV: Frequency Separation and Augmentation based Neural Representation for Video

Li Yu,Zhihui Li,Jimin Xiao,Moncef Gabbouj

Task: 提出一种基于频率分离和增强的神经表示方法（FANeRV）以改进视频重建性能。

Motivation: 现有NeRV方法在捕捉空间细节方面表现不足，导致重建结果模糊。

Details

Method: 使用离散小波变换分离高低频成分，并通过专门模块增强，最后通过门控网络融合；引入卷积残差增强块优化高频细节恢复。 Result: FANeRV显著提升了重建性能，在视频压缩、修复和插值等任务中优于现有NeRV方法。 Conclusion: FANeRV通过频率分离和增强策略有效解决了现有方法的局限性，实现了更优的视频重建效果。 Abstract: Neural representations for video (NeRV) have gained considerable attention for their strong performance across various video tasks. However, existing NeRV methods often struggle to capture fine spatial details, resulting in vague reconstructions. In this paper, we present a Frequency Separation and Augmentation based Neural Representation for video (FANeRV), which addresses these limitations with its core Wavelet Frequency Upgrade Block.This block explicitly separates input frames into high and low-frequency components using discrete wavelet transform, followed by targeted enhancement using specialized modules. Finally, a specially designed gated network effectively fuses these frequency components for optimal reconstruction. Additionally, convolutional residual enhancement blocks are integrated into the later stages of the network to balance parameter distribution and improve the restoration of high-frequency details. Experimental results demonstrate that FANeRV significantly improves reconstruction performance and excels in multiple tasks, including video compression, inpainting, and interpolation, outperforming existing NeRV methods.

Understanding Machine Unlearning Through the Lens of Mode Connectivity

Jiali Cheng,Hadi Amiri

Task: 研究机器遗忘中的模式连通性及其在不同条件下的表现。

Motivation: 探索机器遗忘中模式连通性的现象，填补现有研究中对损失景观和优化动态的忽视。

Details

Method: 通过分析不同遗忘方法、课程学习模型以及一阶和二阶优化技术之间的模式连通性。 Result: 发现不同评估指标沿曲线的波动模式，以及遗忘方法之间的机制相似性和差异性。 Conclusion: 首次在机器遗忘背景下研究了模式连通性，揭示了其独特的行为和机制。 Abstract: Machine Unlearning aims to remove undesired information from trained models without requiring full retraining from scratch. Despite recent advancements, their underlying loss landscapes and optimization dynamics received less attention. In this paper, we investigate and analyze machine unlearning through the lens of mode connectivity - the phenomenon where independently trained models can be connected by smooth low-loss paths in the parameter space. We define and study mode connectivity in unlearning across a range of overlooked conditions, including connections between different unlearning methods, models trained with and without curriculum learning, and models optimized with first-order and secondorder techniques. Our findings show distinct patterns of fluctuation of different evaluation metrics along the curve, as well as the mechanistic (dis)similarity between unlearning methods. To the best of our knowledge, this is the first study on mode connectivity in the context of machine unlearning.

End2end-ALARA: Approaching the ALARA Law in CT Imaging with End-to-end Learning

Xi Tao,Liyan Lin

Task: 提出一种名为End2end-ALARA的端到端学习框架，联合优化剂量调制和图像重建，以实现CT成像中的ALARA目标。

Motivation: CT检查对患者有辐射伤害，ALARA原则要求尽可能降低辐射剂量。

Details

Method: 构建剂量调制模块和图像重建模块，通过可微分模拟函数连接，并使用约束铰链损失函数优化。 Result: End2end-ALARA能个性化预设剂量水平，稳定图像质量，且比固定剂量和传统剂量调制策略更节省剂量。 Conclusion: 该研究为CT成像中实现ALARA原则提供了一种可行方法。 Abstract: Computed tomography (CT) examination poses radiation injury to patient. A consensus performing CT imaging is to make the radiation dose as low as reasonably achievable, i.e. the ALARA law. In this paper, we propose an end-to-end learning framework, named End2end-ALARA, that jointly optimizes dose modulation and image reconstruction to meet the goal of ALARA in CT imaging. End2end-ALARA works by building a dose modulation module and an image reconstruction module, connecting these modules with a differentiable simulation function, and optimizing the them with a constrained hinge loss function. The objective is to minimize radiation dose subject to a prescribed image quality (IQ) index. The results show that End2end-ALARA is able to preset personalized dose levels to gain a stable IQ level across patients, which may facilitate image-based diagnosis and downstream model training. Moreover, compared to fixed-dose and conventional dose modulation strategies, End2end-ALARA consumes lower dose to reach the same IQ level. Our study sheds light on a way of realizing the ALARA law in CT imaging.

Missing Premise exacerbates Overthinking: Are Reasoning Models losing Critical Thinking Skill?

Chenrui Fan,Ming Li,Lichao Sun,Tianyi Zhou

Task: 研究推理型大语言模型（LLMs）在缺失前提（MiP）问题上的响应长度和效率问题。

Motivation: 发现推理型LLMs在MiP问题上表现出冗余和低效的思考（MiP-Overthinking），违背了“测试时间扩展定律”，且缺乏批判性思维。

Details

Method: 通过在多数据集上观察MiP场景下的LLMs表现，分析推理长度、过度思考模式及批判性思维位置，并进行蒸馏实验。 Result: 推理型LLMs在MiP问题上表现不佳，而非推理型LLMs表现更好；过度思考可通过蒸馏传播。 Conclusion: 当前推理型LLMs的训练方法存在缺陷，需改进以鼓励高效思考，避免过度思考滥用。 Abstract: We find that the response length of reasoning LLMs, whether trained by reinforcement learning or supervised learning, drastically increases for ill-posed questions with missing premises (MiP), ending up with redundant and ineffective thinking. This newly introduced scenario exacerbates the general overthinking issue to a large extent, which we name as the MiP-Overthinking. Such failures are against the ``test-time scaling law'' but have been widely observed on multiple datasets we curated with MiP, indicating the harm of cheap overthinking and a lack of critical thinking. Surprisingly, LLMs not specifically trained for reasoning exhibit much better performance on the MiP scenario, producing much shorter responses that quickly identify ill-posed queries. This implies a critical flaw of the current training recipe for reasoning LLMs, which does not encourage efficient thinking adequately, leading to the abuse of thinking patterns. To further investigate the reasons behind such failures, we conduct fine-grained analyses of the reasoning length, overthinking patterns, and location of critical thinking on different types of LLMs. Moreover, our extended ablation study reveals that the overthinking is contagious through the distillation of reasoning models' responses. These results improve the understanding of overthinking and shed novel insights into mitigating the problem.

Domain Generalization through Attenuation of Domain-Specific Information

Reiji Saito,Kazuhiro Hotta

Task: 提出一种新的评估指标DI和ADSI，用于汽车图像中领域泛化的语义分割。

Motivation: 解决领域泛化语义分割中领域特定信息的抑制问题，以提取更独立的特征。

Details

Method: DI衡量领域特定信息的存在，ADSI使用Butterworth滤波器去除低频率的领域特定信息。 Result: 在GTA5和Cityscapes数据集上的实验表明，该方法优于传统方法，并在夜间条件下表现出鲁棒性。 Conclusion: 提出的DI和ADSI能有效抑制领域特定信息，提升模型在领域泛化任务中的性能。 Abstract: In this paper, we propose a new evaluation metric called Domain Independence (DI) and Attenuation of Domain-Specific Information (ADSI) which is specifically designed for domain-generalized semantic segmentation in automotive images. DI measures the presence of domain-specific information: a lower DI value indicates strong domain dependence, while a higher DI value suggests greater domain independence. This makes it roughly where domain-specific information exists and up to which frequency range it is present. As a result, it becomes possible to effectively suppress only the regions in the image that contain domain-specific information, enabling feature extraction independent of the domain. ADSI uses a Butterworth filter to remove the low-frequency components of images that contain inherent domain-specific information such as sensor characteristics and lighting conditions. However, since low-frequency components also contain important information such as color, we should not remove them completely. Thus, a scalar value (ranging from 0 to 1) is multiplied by the low-frequency components to retain essential information. This helps the model learn more domain-independent features. In experiments, GTA5 (synthetic dataset) was used as training images, and a real-world dataset was used for evaluation, and the proposed method outperformed conventional approaches. Similarly, in experiments that the Cityscapes (real-world dataset) was used for training and various environment datasets such as rain and nighttime were used for evaluation, the proposed method demonstrated its robustness under nighttime conditions.

Defending LLM Watermarking Against Spoofing Attacks with Contrastive Representation Learning

Li An,Yujian Liu,Yepeng Liu,Yang Zhang,Yuheng Bu,Shiyu Chang

Task: 提出一种语义感知的水印算法，用于检测和保护由LLM生成的文本。

Motivation: 当前水印技术主要关注文本质量、可检测性和抗移除攻击的鲁棒性，但对防伪造攻击的安全性研究不足。

Details

Method: 提出一种后处理嵌入水印的算法，结合语义映射模型生成红绿令牌列表，通过对比训练使其对语义破坏性修改敏感，对语义保留性修改不敏感。 Result: 在两个标准基准测试中表现出对移除攻击的强鲁棒性和对伪造攻击的安全性，同时保持高水印检测率。 Conclusion: 该方法为LLM提供了更安全且语义感知的水印技术，显著提升了防伪造攻击的能力。 Abstract: Watermarking has emerged as a promising technique for detecting texts generated by LLMs. Current research has primarily focused on three design criteria: high quality of the watermarked text, high detectability, and robustness against removal attack. However, the security against spoofing attacks remains relatively understudied. For example, a piggyback attack can maliciously alter the meaning of watermarked text-transforming it into hate speech-while preserving the original watermark, thereby damaging the reputation of the LLM provider. We identify two core challenges that make defending against spoofing difficult: (1) the need for watermarks to be both sensitive to semantic-distorting changes and insensitive to semantic-preserving edits, and (2) the contradiction between the need to detect global semantic shifts and the local, auto-regressive nature of most watermarking schemes. To address these challenges, we propose a semantic-aware watermarking algorithm that post-hoc embeds watermarks into a given target text while preserving its original meaning. Our method introduces a semantic mapping model, which guides the generation of a green-red token list, contrastively trained to be sensitive to semantic-distorting changes and insensitive to semantic-preserving changes. Experiments on two standard benchmarks demonstrate strong robustness against removal attacks and security against spoofing attacks, including sentiment reversal and toxic content insertion, while maintaining high watermark detectability. Our approach offers a significant step toward more secure and semantically aware watermarking for LLMs. Our code is available at https://github.com/UCSB-NLP-Chang/contrastive-watermark.

Zero-Shot Image-Based Large Language Model Approach to Road Pavement Monitoring

Shuoshuo Xu,Kai Zhao,James Loney,Zili Li,Andrea Visentin

Task: 提出一种基于大型语言模型（LLMs）的零样本学习方法，用于自动化评估路面状况。

Motivation: 传统手动检测存在主观性，现有机器学习方法依赖大量高质量标注数据，资源消耗大且适应性有限，LLMs的进步为解决这些问题提供了潜力。

Details

Method: 开发多种基于LLMs的评估模型，采用与PSCI标准对齐的提示工程策略，并通过优化选择最佳模型。 Result: 优化模型在准确性和一致性上表现优异，甚至超越专家评估，且在Google街景图像中成功应用。 Conclusion: LLMs在自动化路面损伤评估中具有变革潜力，详细提示工程是实现可靠评估的关键。 Abstract: Effective and rapid evaluation of pavement surface condition is critical for prioritizing maintenance, ensuring transportation safety, and minimizing vehicle wear and tear. While conventional manual inspections suffer from subjectivity, existing machine learning-based methods are constrained by their reliance on large and high-quality labeled datasets, which require significant resources and limit adaptability across varied road conditions. The revolutionary advancements in Large Language Models (LLMs) present significant potential for overcoming these challenges. In this study, we propose an innovative automated zero-shot learning approach that leverages the image recognition and natural language understanding capabilities of LLMs to assess road conditions effectively. Multiple LLM-based assessment models were developed, employing prompt engineering strategies aligned with the Pavement Surface Condition Index (PSCI) standards. These models' accuracy and reliability were evaluated against official PSCI results, with an optimized model ultimately selected. Extensive tests benchmarked the optimized model against evaluations from various levels experts using Google Street View road images. The results reveal that the LLM-based approach can effectively assess road conditions, with the optimized model -employing comprehensive and structured prompt engineering strategies -outperforming simpler configurations by achieving high accuracy and consistency, even surpassing expert evaluations. Moreover, successfully applying the optimized model to Google Street View images demonstrates its potential for future city-scale deployments. These findings highlight the transformative potential of LLMs in automating road damage evaluations and underscore the pivotal role of detailed prompt engineering in achieving reliable assessments.

Wanting to be Understood

Chrisantha Fernando,Dylan Banarse,Simon Osindero

Task: 探索内在动机对相互意识的影响，假设人类在缺乏外在奖励时仍具有理解和被理解的基本驱动力。

Motivation: 研究人类在无外在奖励情况下对理解和被理解的内在需求，及其对社交互动和合作的影响。

Details

Method: 通过模拟感知交叉范式，研究不同内部奖励函数对强化学习代理的影响，包括主动推理型人工好奇心奖励和模仿、影响力/易感性等内在奖励。 Result: 人工好奇心单独不足以驱动社交互动偏好，但强调相互理解的奖励能促使代理优先互动，并在单方获得外在奖励的任务中促进合作。 Conclusion: 内在动机（理解和被理解）能有效促进社交互动和合作，尤其在缺乏外在奖励的环境中。 Abstract: This paper explores an intrinsic motivation for mutual awareness, hypothesizing that humans possess a fundamental drive to understand \textit{and to be understood} even in the absence of extrinsic rewards. Through simulations of the perceptual crossing paradigm, we explore the effect of various internal reward functions in reinforcement learning agents. The drive to understand is implemented as an active inference type artificial curiosity reward, whereas the drive to be understood is implemented through intrinsic rewards for imitation, influence/impressionability, and sub-reaction time anticipation of the other. Results indicate that while artificial curiosity alone does not lead to a preference for social interaction, rewards emphasizing reciprocal understanding successfully drive agents to prioritize interaction. We demonstrate that this intrinsic motivation can facilitate cooperation in tasks where only one agent receives extrinsic reward for the behaviour of the other.

A Meaningful Perturbation Metric for Evaluating Explainability Methods

Danielle Cohen,Hila Chefer,Lior Wolf

Task: 提出一种利用图像生成模型进行针对性扰动的新方法，以评估深度神经网络（DNNs）的归因方法。

Motivation: 现有归因方法生成的显著性图差异大，且传统扰动方法常导致分布外修改，结果不可靠。

Details

Method: 通过图像修复技术仅扰动高显著性像素，以改变模型预测并保持图像保真度。 Result: 实验表明，该方法能生成有意义的归因方法排名，且与人类偏好相关性显著高于现有方法。 Conclusion: 该方法有望提升DNNs的可解释性。 Abstract: Deep neural networks (DNNs) have demonstrated remarkable success, yet their wide adoption is often hindered by their opaque decision-making. To address this, attribution methods have been proposed to assign relevance values to each part of the input. However, different methods often produce entirely different relevance maps, necessitating the development of standardized metrics to evaluate them. Typically, such evaluation is performed through perturbation, wherein high- or low-relevance regions of the input image are manipulated to examine the change in prediction. In this work, we introduce a novel approach, which harnesses image generation models to perform targeted perturbation. Specifically, we focus on inpainting only the high-relevance pixels of an input image to modify the model's predictions while preserving image fidelity. This is in contrast to existing approaches, which often produce out-of-distribution modifications, leading to unreliable results. Through extensive experiments, we demonstrate the effectiveness of our approach in generating meaningful rankings across a wide range of models and attribution methods. Crucially, we establish that the ranking produced by our metric exhibits significantly higher correlation with human preferences compared to existing approaches, underscoring its potential for enhancing interpretability in DNNs.

A Neuro-inspired Interpretation of Unlearning in Large Language Models through Sample-level Unlearning Difficulty

Xiaohua Feng,Yuyuan Li,Chengye Wang,Junlin Liu,Li Zhang,Chaochao Chen

Task: 研究大型语言模型（LLM）中遗忘学习与样本特征的关系，并提出一种基于记忆移除难度（MRD）的加权采样方法。

Motivation: 当前研究忽视了解释性，尤其是样本级遗忘难度，可能导致算法性能被错误归因于样本选择而非算法设计。

Details

Method: 提出MRD指标量化样本级遗忘难度，并基于MRD设计加权采样方法优化现有遗忘算法。 Result: 通过公开基准和数据集验证了MRD指标和方法的有效性。 Conclusion: MRD指标和加权采样方法能有效提升遗忘学习的效率和效果。 Abstract: Driven by privacy protection laws and regulations, unlearning in Large Language Models (LLMs) is gaining increasing attention. However, current research often neglects the interpretability of the unlearning process, particularly concerning sample-level unlearning difficulty. Existing studies typically assume a uniform unlearning difficulty across samples. This simplification risks attributing the performance of unlearning algorithms to sample selection rather than the algorithm's design, potentially steering the development of LLM unlearning in the wrong direction. Thus, we investigate the relationship between LLM unlearning and sample characteristics, with a focus on unlearning difficulty. Drawing inspiration from neuroscience, we propose a Memory Removal Difficulty ($\mathrm{MRD}$) metric to quantify sample-level unlearning difficulty. Using $\mathrm{MRD}$, we analyze the characteristics of hard-to-unlearn versus easy-to-unlearn samples. Furthermore, we propose an $\mathrm{MRD}$-based weighted sampling method to optimize existing unlearning algorithms, which prioritizes easily forgettable samples, thereby improving unlearning efficiency and effectiveness. We validate the proposed metric and method using public benchmarks and datasets, with results confirming its effectiveness.

MonoPlace3D: Learning 3D-Aware Object Placement for 3D Monocular Detection

Rishubh Parihar,Srinjay Sarkar,Sarthak Vora,Jogendra Kundu,R. Venkatesh Babu

Task: 提出一种名为MonoPlace3D的系统，用于生成真实场景感知的合成数据以增强单目3D检测器的训练。

Motivation: 现有单目3D检测器受限于真实世界数据集的多样性和规模，而传统数据增强方法难以生成真实场景感知的户外数据。

Details

Method: MonoPlace3D通过学习场景内容生成合理的3D边界框分布，并在学习到的分布中采样位置以放置合成对象。 Result: 在KITTI和NuScenes数据集上的实验表明，MonoPlace3D显著提升了多种单目3D检测器的准确性，且数据效率高。 Conclusion: MonoPlace3D通过生成真实场景感知的合成数据，有效解决了单目3D检测器训练中的数据不足问题。 Abstract: Current monocular 3D detectors are held back by the limited diversity and scale of real-world datasets. While data augmentation certainly helps, it's particularly difficult to generate realistic scene-aware augmented data for outdoor settings. Most current approaches to synthetic data generation focus on realistic object appearance through improved rendering techniques. However, we show that where and how objects are positioned is just as crucial for training effective 3D monocular detectors. The key obstacle lies in automatically determining realistic object placement parameters - including position, dimensions, and directional alignment when introducing synthetic objects into actual scenes. To address this, we introduce MonoPlace3D, a novel system that considers the 3D scene content to create realistic augmentations. Specifically, given a background scene, MonoPlace3D learns a distribution over plausible 3D bounding boxes. Subsequently, we render realistic objects and place them according to the locations sampled from the learned distribution. Our comprehensive evaluation on two standard datasets KITTI and NuScenes, demonstrates that MonoPlace3D significantly improves the accuracy of multiple existing monocular 3D detectors while being highly data efficient.

Bridging the Gap Between Preference Alignment and Machine Unlearning

Xiaohua Feng,Yuyuan Li,Huwei Ji,Jiaming Zhang,Li Zhang,Tianyu Du,Chaochao Chen

Task: 探索偏好对齐（PA）与大型语言模型（LLM）遗忘技术之间的关系，并提出一种基于双层优化的方法（U2A）来优化负例选择以提高PA性能。

Motivation: 主流方法如RLHF需要高质量的正偏好数据集，成本高且计算密集，而LLM遗忘技术虽能直接移除负例影响，但缺乏系统性定量分析。

Details

Method: 提出双层优化框架，量化遗忘特定负例对PA性能的影响，并设计U2A方法优化负例选择和权重。 Result: 实验验证表明，U2A能有效选择并遗忘负例，显著提升PA性能。 Conclusion: U2A为低资源场景下的偏好对齐提供了一种高效解决方案。 Abstract: Despite advances in Preference Alignment (PA) for Large Language Models (LLMs), mainstream methods like Reinforcement Learning with Human Feedback (RLHF) face notable challenges. These approaches require high-quality datasets of positive preference examples, which are costly to obtain and computationally intensive due to training instability, limiting their use in low-resource scenarios. LLM unlearning technique presents a promising alternative, by directly removing the influence of negative examples. However, current research has primarily focused on empirical validation, lacking systematic quantitative analysis. To bridge this gap, we propose a framework to explore the relationship between PA and LLM unlearning. Specifically, we introduce a bi-level optimization-based method to quantify the impact of unlearning specific negative examples on PA performance. Our analysis reveals that not all negative examples contribute equally to alignment improvement when unlearned, and the effect varies significantly across examples. Building on this insight, we pose a crucial question: how can we optimally select and weight negative examples for unlearning to maximize PA performance? To answer this, we propose a framework called Unlearning to Align (U2A), which leverages bi-level optimization to efficiently select and unlearn examples for optimal PA performance. We validate the proposed method through extensive experiments, with results confirming its effectiveness.

DyDiT++: Dynamic Diffusion Transformers for Efficient Visual Generation

Wangbo Zhao,Yizeng Han,Jiasheng Tang,Kai Wang,Hao Luo,Yibing Song,Gao Huang,Fan Wang,Yang You

Task: 提出一种动态调整计算量的扩散变换器（DyDiT），以减少Diffusion Transformer（DiT）在视觉生成中的冗余计算。

Motivation: DiT在视觉生成中表现出色，但存在高计算成本问题，主要源于静态推理范式在扩散时间步和空间区域引入的冗余计算。

Details

Method: 提出动态扩散变换器（DyDiT），包括时间步动态宽度（TDW）和空间动态令牌（SDT）策略，并集成流匹配生成和参数高效训练方法（TD-LoRA）。 Result: DyDiT显著加速生成过程，并在视频生成和文本到图像生成等复杂任务中表现优异。 Conclusion: DyDiT通过动态调整计算量，有效减少冗余计算，扩展了DiT的应用范围，同时支持参数高效训练。 Abstract: Diffusion Transformer (DiT), an emerging diffusion model for visual generation, has demonstrated superior performance but suffers from substantial computational costs. Our investigations reveal that these costs primarily stem from the \emph{static} inference paradigm, which inevitably introduces redundant computation in certain \emph{diffusion timesteps} and \emph{spatial regions}. To overcome this inefficiency, we propose \textbf{Dy}namic \textbf{Di}ffusion \textbf{T}ransformer (DyDiT), an architecture that \emph{dynamically} adjusts its computation along both \emph{timestep} and \emph{spatial} dimensions. Specifically, we introduce a \emph{Timestep-wise Dynamic Width} (TDW) approach that adapts model width conditioned on the generation timesteps. In addition, we design a \emph{Spatial-wise Dynamic Token} (SDT) strategy to avoid redundant computation at unnecessary spatial locations. TDW and SDT can be seamlessly integrated into DiT and significantly accelerates the generation process. Building on these designs, we further enhance DyDiT in three key aspects. First, DyDiT is integrated seamlessly with flow matching-based generation, enhancing its versatility. Furthermore, we enhance DyDiT to tackle more complex visual generation tasks, including video generation and text-to-image generation, thereby broadening its real-world applications. Finally, to address the high cost of full fine-tuning and democratize technology access, we investigate the feasibility of training DyDiT in a parameter-efficient manner and introduce timestep-based dynamic LoRA (TD-LoRA). Extensive experiments on diverse visual generation models, including DiT, SiT, Latte, and FLUX, demonstrate the effectiveness of DyDiT.

CAT: Circular-Convolutional Attention for Sub-Quadratic Transformers

Yoshihiro Yamada

Task: 提出一种基于傅里叶变换的循环卷积注意力机制（CAT），以降低Transformer注意力机制的复杂度。

Motivation: 标准注意力机制的O(N^2)复杂度限制了其在长序列上的可扩展性。

Details

Method: 采用基于傅里叶变换的循环卷积方法，将复杂度降至O(NlogN)，并减少可学习参数。 Result: 在ImageNet-1k和WikiText-103等大规模基准测试中，实现了约10%的速度提升和一致的精度改进。 Conclusion: CAT不仅提供了实用的效率和易实现性，还为下一代高性能Transformer架构的设计提供了指导。 Abstract: Transformers have driven remarkable breakthroughs in natural language processing and computer vision, yet their standard attention mechanism still imposes O(N^2) complexity, hindering scalability to longer sequences. We introduce Circular-convolutional ATtention (CAT), a Fourier-based approach that efficiently applies circular convolutions to reduce complexity without sacrificing representational power. CAT achieves O(NlogN) computations, requires fewer learnable parameters by streamlining fully-connected layers, and introduces no heavier operations, resulting in consistent accuracy improvements and about a 10% speedup in naive PyTorch implementations on large-scale benchmarks such as ImageNet-1k and WikiText-103. Grounded in an engineering-isomorphism framework, CAT's design not only offers practical efficiency and ease of implementation but also provides insights to guide the development of next-generation, high-performance Transformer architectures. Finally, our ablation studies highlight the key conditions underlying CAT's success, shedding light on broader principles for scalable attention mechanisms.

Hybrid CNN with Chebyshev Polynomial Expansion for Medical Image Analysis

Abhinav Roy,Bhavesh Gyanchandani,Aditya Oza

Task: 提出一种结合切比雪夫多项式扩展的混合深度学习架构，用于提高肺部结节分类的准确性。

Motivation: 早期和准确的肺癌诊断对改善患者预后至关重要，而传统CNN在捕捉复杂空间-光谱变化方面存在局限性。

Details

Method: 提出一种Chebyshev-CNN架构，利用切比雪夫多项式的正交性和递归特性增强特征提取能力。 Result: 在LUNA16和LIDC-IDRI数据集上表现优于传统CNN，显著提高了准确性、敏感性和特异性。 Conclusion: 该方法为自动化医学诊断提供了更强大的框架，并具有在临床决策支持系统中广泛应用的潜力。 Abstract: Lung cancer remains one of the leading causes of cancer-related mortality worldwide, with early and accurate diagnosis playing a pivotal role in improving patient outcomes. Automated detection of pulmonary nodules in computed tomography (CT) scans is a challenging task due to variability in nodule size, shape, texture, and location. Traditional Convolutional Neural Networks (CNNs) have shown considerable promise in medical image analysis; however, their limited ability to capture fine-grained spatial-spectral variations restricts their performance in complex diagnostic scenarios. In this study, we propose a novel hybrid deep learning architecture that incorporates Chebyshev polynomial expansions into CNN layers to enhance expressive power and improve the representation of underlying anatomical structures. The proposed Chebyshev-CNN leverages the orthogonality and recursive properties of Chebyshev polynomials to extract high-frequency features and approximate complex nonlinear functions with greater fidelity. The model is trained and evaluated on benchmark lung cancer imaging datasets, including LUNA16 and LIDC-IDRI, achieving superior performance in classifying pulmonary nodules as benign or malignant. Quantitative results demonstrate significant improvements in accuracy, sensitivity, and specificity compared to traditional CNN-based approaches. This integration of polynomial-based spectral approximation within deep learning provides a robust framework for enhancing automated medical diagnostics and holds potential for broader applications in clinical decision support systems.

FamilyTool: A Multi-hop Personalized Tool Use Benchmark

Yuxin Wang,Yiran Guo,Yining Zheng,Zhangyue Yin,Shuo Chen,Jie Yang,Jiajun Chen,Xuanjing Huang,Xipeng Qiu

Task: 评估大型语言模型（LLMs）在个性化、多跳推理和动态环境中的工具学习能力。

Motivation: 现有工具学习基准未能充分解决真实世界中的个性化场景，特别是需要多跳推理和动态环境中归纳知识适应的场景。

Details

Method: 引入FamilyTool基准，基于家庭知识图谱（KG）模拟个性化多跳工具使用场景，并提出KGETool评估管道。 Result: 实验显示当前LLMs在复杂跳数和归纳场景中表现显著下降，暴露了泛化能力的不足。 Conclusion: FamilyTool为评估和提升LLMs在复杂动态环境中的推理、适应性和可扩展性提供了关键资源。 Abstract: The integration of tool learning with Large Language Models (LLMs) has expanded their capabilities in handling complex tasks by leveraging external tools. However, existing benchmarks for tool learning inadequately address critical real-world personalized scenarios, particularly those requiring multi-hop reasoning and inductive knowledge adaptation in dynamic environments. To bridge this gap, we introduce FamilyTool, a novel benchmark grounded in a family-based knowledge graph (KG) that simulates personalized, multi-hop tool use scenarios. FamilyTool challenges LLMs with queries spanning 1 to 3 relational hops (e.g., inferring familial connections and preferences) and incorporates an inductive KG setting where models must adapt to unseen user preferences and relationships without re-training, a common limitation in prior approaches that compromises generalization. We further propose KGETool: a simple KG-augmented evaluation pipeline to systematically assess LLMs' tool use ability in these settings. Experiments reveal significant performance gaps in state-of-the-art LLMs, with accuracy dropping sharply as hop complexity increases and inductive scenarios exposing severe generalization deficits. These findings underscore the limitations of current LLMs in handling personalized, evolving real-world contexts and highlight the urgent need for advancements in tool-learning frameworks. FamilyTool serves as a critical resource for evaluating and advancing LLM agents' reasoning, adaptability, and scalability in complex, dynamic environments. Code and dataset are available at Github.

SVG-IR: Spatially-Varying Gaussian Splatting for Inverse Rendering

Hanxiao Sun,YuPeng Gao,Jin Xie,Jian Yang,Beibei Wang

Task: 提出一种名为SVG-IR的新框架，用于提升基于3D高斯泼溅的逆渲染任务中的新视角合成和重光照质量。

Motivation: 现有方法在重光照任务中因高斯泼溅的局限性（如恒定材质参数和法线）以及缺乏物理约束的间接光照，导致质量不佳和人工痕迹。

Details

Method: 提出空间变化高斯（SVG）表示和SVG泼溅方案，并结合基于物理的间接光照模型。 Result: SVG-IR在PSNR上优于现有NeRF方法2.5 dB，在重光照任务中优于高斯方法3.5 dB，同时保持实时渲染速度。 Conclusion: SVG-IR通过增强表示和物理约束，显著提升了逆渲染的质量和效率。 Abstract: Reconstructing 3D assets from images, known as inverse rendering (IR), remains a challenging task due to its ill-posed nature. 3D Gaussian Splatting (3DGS) has demonstrated impressive capabilities for novel view synthesis (NVS) tasks. Methods apply it to relighting by separating radiance into BRDF parameters and lighting, yet produce inferior relighting quality with artifacts and unnatural indirect illumination due to the limited capability of each Gaussian, which has constant material parameters and normal, alongside the absence of physical constraints for indirect lighting. In this paper, we present a novel framework called Spatially-vayring Gaussian Inverse Rendering (SVG-IR), aimed at enhancing both NVS and relighting quality. To this end, we propose a new representation-Spatially-varying Gaussian (SVG)-that allows per-Gaussian spatially varying parameters. This enhanced representation is complemented by a SVG splatting scheme akin to vertex/fragment shading in traditional graphics pipelines. Furthermore, we integrate a physically-based indirect lighting model, enabling more realistic relighting. The proposed SVG-IR framework significantly improves rendering quality, outperforming state-of-the-art NeRF-based methods by 2.5 dB in peak signal-to-noise ratio (PSNR) and surpassing existing Gaussian-based techniques by 3.5 dB in relighting tasks, all while maintaining a real-time rendering speed.

Adaptive Computation Pruning for the Forgetting Transformer

Zhixuan Lin,Johan Obando-Ceron,Xu Owen He,Aaron Courville

Task: 提出Adaptive Computation Pruning (ACP)方法，用于动态剪枝Forgetting Transformer (FoX)中因遗忘门而衰减的输入-输出依赖计算。

Motivation: 观察到FoX中许多注意力头快速遗忘，导致输出主要依赖局部上下文，从而提出动态剪枝以减少计算量。

Details

Method: 通过动态设置的剪枝阈值，剪枝被遗忘门强烈衰减的注意力权重，确保剪枝后的权重可忽略。 Result: 在语言模型预训练中，ACP将softmax注意力的FLOPs减少约70%，训练吞吐量提升10%至35%，且不降低性能。 Conclusion: ACP有效减少计算量并提升效率，尤其适用于长上下文场景，且不影响模型性能。 Abstract: The recently proposed Forgetting Transformer (FoX) incorporates a forget gate into softmax attention and has shown consistently better or on-par performance compared to the standard RoPE-based Transformer. Notably, many attention heads in FoX tend to forget quickly, causing their output at each timestep to rely primarily on the local context. Based on this observation, we propose Adaptive Computation Pruning (ACP) for FoX, a method that dynamically prunes computations involving input-output dependencies that are strongly decayed by the forget gate. This is achieved using a dynamically set pruning threshold that ensures that the pruned attention weights remain negligible. We apply ACP to language model pretraining with FoX and show it consistently reduces the number of FLOPs in softmax attention by around 70% across different model sizes and context lengths, resulting in a roughly 10% to 35% improvement in training throughput. Furthermore, longer context lengths yield greater computational savings. All these speed improvements are achieved without any performance degradation. We also perform several analyses to provide deeper insights into our method, such as examining the pruning patterns and analyzing the distribution of FLOP savings across different attention heads. Our code is available at https://github.com/zhixuan-lin/arctic-fox.

IAAO: Interactive Affordance Learning for Articulated Objects in 3D Environments

Can Zhang,Gim Hee Lee

Task: 提出了一种名为IAAO的新框架，通过交互为智能体构建明确的3D模型以理解环境中的铰接物体。

Motivation: 现有方法依赖于任务特定网络和对可移动部分的假设，限制了泛化能力。

Details

Method: 利用大型基础模型分三个阶段估计交互可用性和部件铰接：1) 使用3D高斯泼溅构建层次特征和标签场；2) 通过查询3D高斯基元识别静态和铰接元素；3) 合并和优化不同状态的场景。 Result: 实验证明了方法的有效性。 Conclusion: IAAO框架能够实现基于可用性的鲁棒交互和物体操作。 Abstract: This work presents IAAO, a novel framework that builds an explicit 3D model for intelligent agents to gain understanding of articulated objects in their environment through interaction. Unlike prior methods that rely on task-specific networks and assumptions about movable parts, our IAAO leverages large foundation models to estimate interactive affordances and part articulations in three stages. We first build hierarchical features and label fields for each object state using 3D Gaussian Splatting (3DGS) by distilling mask features and view-consistent labels from multi-view images. We then perform object- and part-level queries on the 3D Gaussian primitives to identify static and articulated elements, estimating global transformations and local articulation parameters along with affordances. Finally, scenes from different states are merged and refined based on the estimated transformations, enabling robust affordance-based interaction and manipulation of objects. Experimental results demonstrate the effectiveness of our method.

RNN-Transducer-based Losses for Speech Recognition on Noisy Targets

Vladimir Bataev

Task: 提出新的损失函数以减少RNN-Transducer模型中转录错误的影响。

Motivation: 工业级语音识别系统中，数据集庞大且难以确保每个实例的转录准确性，训练噪声转录数据是一个重大挑战。

Details

Method: 引入三种损失函数：Star-Transducer（处理删除错误）、Bypass-Transducer（处理插入错误）和Target-Robust Transducer（综合处理任意错误）。 Result: Star-Transducer恢复90%性能，Bypass-Transducer恢复60%质量，Target-Robust Transducer恢复70%质量。 Conclusion: Target-Robust Transducer显著提升了RNN-T在噪声数据上的性能。 Abstract: Training speech recognition systems on noisy transcripts is a significant challenge in industrial pipelines, where datasets are enormous and ensuring accurate transcription for every instance is difficult. In this work, we introduce novel loss functions to mitigate the impact of transcription errors in RNN-Transducer models. Our Star-Transducer loss addresses deletion errors by incorporating "skip frame" transitions in the loss lattice, restoring over 90% of the system's performance compared to models trained with accurate transcripts. The Bypass-Transducer loss uses "skip token" transitions to tackle insertion errors, recovering more than 60% of the quality. Finally, the Target-Robust Transducer loss merges these approaches, offering robust performance against arbitrary errors. Experimental results demonstrate that the Target-Robust Transducer loss significantly improves RNN-T performance on noisy data by restoring over 70% of the quality compared to well-transcribed data.

LVC: A Lightweight Compression Framework for Enhancing VLMs in Long Video Understanding

Ziyi Wang,Haoran Wu,Yiming Rong,Deyang Jiang,Yixin Zhang,Yunlong Zhao,Shuang Xu,Bo XU

Task: 提出一种轻量级视频压缩方法（LVC），以解决视觉语言模型（VLMs）在长视频理解中的稀疏采样问题。

Motivation: 现有视觉语言模型（VLMs）因稀疏采样策略导致信息丢失，而视频大语言模型（Video-LLMs）受限于高质量视频-文本数据集的稀缺性。

Details

Method: 提出轻量级视频压缩（LVC）方法，采用查询注意力视频压缩机制，仅需训练对齐层和少量数据（10k短视频-文本对）。 Result: LVC显著提升了VLMs的时间推理能力，在MLVU和Video-MME基准测试中分别取得68.2和65.9分，相对提升14.6%和7.7%。 Conclusion: LVC是一种高效且低成本的方法，可显著增强VLMs的长视频理解能力，其模型和代码将公开。 Abstract: Long video understanding is a complex task that requires both spatial detail and temporal awareness. While Vision-Language Models (VLMs) obtain frame-level understanding capabilities through multi-frame input, they suffer from information loss due to the sparse sampling strategy. In contrast, Video Large Language Models (Video-LLMs) capture temporal relationships within visual features but are limited by the scarcity of high-quality video-text datasets. To transfer long video understanding capabilities to VLMs with minimal data and computational cost, we propose Lightweight Video Compression (LVC), a novel method featuring the Query-Attention Video Compression mechanism, which effectively tackles the sparse sampling problem in VLMs. By training only the alignment layer with 10k short video-text pairs, LVC significantly enhances the temporal reasoning abilities of VLMs. Extensive experiments show that LVC provides consistent performance improvements across various models, including the InternVL2 series and Phi-3.5-Vision. Notably, the InternVL2-40B-LVC achieves scores of 68.2 and 65.9 on the long video understanding benchmarks MLVU and Video-MME, respectively, with relative improvements of 14.6% and 7.7%. The enhanced models and code will be publicly available soon.

A Unified Agentic Framework for Evaluating Conditional Image Generation

Jifang Wang,Xue Yang,Longyue Wang,Zhenran Xu,Yiyu Wang,Yaowei Wang,Weihua Luo,Kaifu Zhang,Baotian Hu,Min Zhang

Task: 提出CIGEval，一个统一的代理框架，用于全面评估条件图像生成任务。

Motivation: 解决条件图像生成领域缺乏任务无关、可靠且可解释的评估指标的问题。

Details

Method: 利用大型多模态模型（LMMs）作为核心，整合多功能工具箱并建立细粒度评估框架，同时合成评估轨迹以微调小型LMMs。 Result: CIGEval（GPT-4o版本）在七个任务中与人类评估的相关系数达0.4625，接近人类评估者间的相关系数0.47；使用7B开源LMMs时，仅需2.3K训练轨迹即超越基于GPT-4o的现有最优方法。 Conclusion: CIGEval在识别图像生成任务中的细微问题方面表现出色，展示了其在自动化评估中实现人类级可靠性的潜力。 Abstract: Conditional image generation has gained significant attention for its ability to personalize content. However, the field faces challenges in developing task-agnostic, reliable, and explainable evaluation metrics. This paper introduces CIGEval, a unified agentic framework for comprehensive evaluation of conditional image generation tasks. CIGEval utilizes large multimodal models (LMMs) as its core, integrating a multi-functional toolbox and establishing a fine-grained evaluation framework. Additionally, we synthesize evaluation trajectories for fine-tuning, empowering smaller LMMs to autonomously select appropriate tools and conduct nuanced analyses based on tool outputs. Experiments across seven prominent conditional image generation tasks demonstrate that CIGEval (GPT-4o version) achieves a high correlation of 0.4625 with human assessments, closely matching the inter-annotator correlation of 0.47. Moreover, when implemented with 7B open-source LMMs using only 2.3K training trajectories, CIGEval surpasses the previous GPT-4o-based state-of-the-art method. Case studies on GPT-4o image generation highlight CIGEval's capability in identifying subtle issues related to subject consistency and adherence to control guidance, indicating its great potential for automating evaluation of image generation tasks with human-level reliability.

Jakub Maciej Wiśniewski,Anders Nymark Christensen,Mary Le Ngo,Martin Grønnebæk Tolsgaard,Chun Kit Wong

Task: 开发一个自动化流程，用于从遵循简单盲扫协议的超声视频中预测胎儿方位。

Motivation: 胎儿超声检查的认知需求对临床医生提出了独特挑战，需要辅助工具来减轻负担。

Details

Method: 利用预训练的头部检测和分割模型，先通过模板匹配确定胎儿先露（头位或臀位），再通过分析分割出的脑部解剖结构的空间分布预测胎儿朝向（左或右）。 Result: 在第三孕期超声扫描数据集上的评估显示，该流程具有较高的准确性。 Conclusion: 该工作通过引入自动化胎儿朝向预测和辅助范式，增强了超声医师的专业能力而非替代他们。未来研究将聚焦于提高采集效率和实时临床集成。 Abstract: Cognitive demands of fetal ultrasound examinations pose unique challenges among clinicians. With the goal of providing an assistive tool, we developed an automated pipeline for predicting fetal orientation from ultrasound videos acquired following a simple blind sweep protocol. Leveraging on a pre-trained head detection and segmentation model, this is achieved by first determining the fetal presentation (cephalic or breech) with a template matching approach, followed by the fetal lie (facing left or right) by analyzing the spatial distribution of segmented brain anatomies. Evaluation on a dataset of third-trimester ultrasound scans demonstrated the promising accuracy of our pipeline. This work distinguishes itself by introducing automated fetal lie prediction and by proposing an assistive paradigm that augments sonographer expertise rather than replacing it. Future research will focus on enhancing acquisition efficiency, and exploring real-time clinical integration to improve workflow and support for obstetric clinicians.

SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills

Boyuan Zheng,Michael Y. Fatemi,Xiaolong Jin,Zora Zhiruo Wang,Apurva Gandhi,Yueqi Song,Yu Gu,Jayanth Srinivasa,Gaowen Liu,Graham Neubig,Yu Su

Task: 提出SkillWeaver框架，使自主网络代理能够通过合成可重用技能API实现自我提升。

Motivation: 当前自主网络代理缺乏关键自我提升能力，如程序性知识抽象、技能精炼和技能组合。

Details

Method: SkillWeaver框架通过自主发现技能、执行练习并将经验提炼为轻量级API，实现技能库的持续扩展。 Result: 在WebArena和真实网站上的实验显示，SkillWeaver分别提升了31.8%和39.8%的成功率，且API共享使弱代理性能提升高达54.3%。 Conclusion: SkillWeaver通过将多样网站交互提炼为可共享API，显著提升了网络代理的能力和协作效率。 Abstract: To survive and thrive in complex environments, humans have evolved sophisticated self-improvement mechanisms through environment exploration, hierarchical abstraction of experiences into reuseable skills, and collaborative construction of an ever-growing skill repertoire. Despite recent advancements, autonomous web agents still lack crucial self-improvement capabilities, struggling with procedural knowledge abstraction, refining skills, and skill composition. In this work, we introduce SkillWeaver, a skill-centric framework enabling agents to self-improve by autonomously synthesizing reusable skills as APIs. Given a new website, the agent autonomously discovers skills, executes them for practice, and distills practice experiences into robust APIs. Iterative exploration continually expands a library of lightweight, plug-and-play APIs, significantly enhancing the agent's capabilities. Experiments on WebArena and real-world websites demonstrate the efficacy of SkillWeaver, achieving relative success rate improvements of 31.8% and 39.8%, respectively. Additionally, APIs synthesized by strong agents substantially enhance weaker agents through transferable skills, yielding improvements of up to 54.3% on WebArena. These results demonstrate the effectiveness of honing diverse website interactions into APIs, which can be seamlessly shared among various web agents.

ZIP: An Efficient Zeroth-order Prompt Tuning for Black-box Vision-Language Models

Seonghwan Park,Jaehyeon Jeong,Yongjun Kim,Jaeho Lee,Namhoon Lee

Task: 提出一种名为ZIP的高效黑盒提示调优方法，以减少查询次数并提高性能。

Motivation: 现有黑盒提示调优方法需要过多查询，限制了实际应用。

Details

Method: 通过低秩表示重新参数化提示，并设计内在维度梯度裁剪。 Result: 在13+个视觉语言任务中，ZIP平均少样本准确率提升6%，查询效率提升48%。 Conclusion: ZIP是一种高效且鲁棒的黑盒提示调优方法，无需手动调整超参数。 Abstract: Recent studies have introduced various approaches for prompt-tuning black-box vision-language models, referred to as black-box prompt-tuning (BBPT). While BBPT has demonstrated considerable potential, it is often found that many existing methods require an excessive number of queries (i.e., function evaluations), which poses a significant challenge in real-world scenarios where the number of allowed queries is limited. To tackle this issue, we propose Zeroth-order Intrinsic-dimensional Prompt-tuning (ZIP), a novel approach that enables efficient and robust prompt optimization in a purely black-box setting. The key idea of ZIP is to reduce the problem dimensionality and the variance of zeroth-order gradient estimates, such that the training is done fast with far less queries. We achieve this by re-parameterizing prompts in low-rank representations and designing intrinsic-dimensional clipping of estimated gradients. We evaluate ZIP on 13+ vision-language tasks in standard benchmarks and show that it achieves an average improvement of approximately 6% in few-shot accuracy and 48% in query efficiency compared to the best-performing alternative BBPT methods, establishing a new state of the art. Our ablation analysis further shows that the proposed clipping mechanism is robust and nearly optimal, without the need to manually select the clipping threshold, matching the result of expensive hyperparameter search.

A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility

Andreas Hochlehnert,Hardik Bhatnagar,Vishaal Udandarao,Samuel Albanie,Ameya Prabhu,Matthias Bethge

Task: 提出一个标准化的评估框架，以解决当前数学推理基准测试中的透明度和稳健性问题。

Motivation: 当前语言模型在推理任务上的评估缺乏方法论的严谨性，许多基准测试对实现细节高度敏感，导致性能提升的报道可能不可靠。

Details

Method: 通过全面的实证研究，分析基准测试对实现选择的敏感性，并提出标准化的评估框架和最佳实践。 Result: 研究发现强化学习方法（RL）的改进有限且容易过拟合，而监督微调方法（SFT）表现出更强的泛化能力。 Conclusion: 通过发布代码、提示和模型输出，为未来的推理研究建立了更严谨的基础。 Abstract: Reasoning has emerged as the next major frontier for language models (LMs), with rapid advances from both academic and industrial labs. However, this progress often outpaces methodological rigor, with many evaluations relying on benchmarking practices that lack transparency, robustness, or statistical grounding. In this work, we conduct a comprehensive empirical study and find that current mathematical reasoning benchmarks are highly sensitive to subtle implementation choices - including decoding parameters, random seeds, prompt formatting, and even hardware and software-framework configurations. Performance gains reported in recent studies frequently hinge on unclear comparisons or unreported sources of variance. To address these issues, we propose a standardized evaluation framework with clearly defined best practices and reporting standards. Using this framework, we reassess recent methods and find that reinforcement learning (RL) approaches yield only modest improvements - far below prior claims - and are prone to overfitting, especially on small-scale benchmarks like AIME24. In contrast, supervised finetuning (SFT) methods show consistently stronger generalization. To foster reproducibility, we release all code, prompts, and model outputs, for reasoning benchmarks, establishing more rigorous foundations for future work.

Classifying the Unknown: In-Context Learning for Open-Vocabulary Text and Symbol Recognition

Tom Simon,William Mocaer,Pierrick Tranouez,Clement Chatelain,Thierry Paquet

Task: 利用多模态上下文学习（MICL）对文档中的新脚本模式序列进行分类，无需显式重新训练。

Motivation: 通过最小示例实现对新型脚本模式的高效分类，扩展模型的应用范围。

Details

Method: 设计了数据集生成过程以增强上下文学习，并采用上下文感知分词器（CAT）实现开放词汇分类。 Result: 实验表明，Rosetta能够成功分类超出训练范围的视觉模式和多种字母及脚本。 Conclusion: Rosetta展示了在多语言和新型脚本分类中的潜力，扩展了模型的实际应用场景。 Abstract: We introduce Rosetta, a multimodal model that leverages Multimodal In-Context Learning (MICL) to classify sequences of novel script patterns in documents by leveraging minimal examples, thus eliminating the need for explicit retraining. To enhance contextual learning, we designed a dataset generation process that ensures varying degrees of contextual informativeness, improving the model's adaptability in leveraging context across different scenarios. A key strength of our method is the use of a Context-Aware Tokenizer (CAT), which enables open-vocabulary classification. This allows the model to classify text and symbol patterns across an unlimited range of classes, extending its classification capabilities beyond the scope of its training alphabet of patterns. As a result, it unlocks applications such as the recognition of new alphabets and languages. Experiments on synthetic datasets demonstrate the potential of Rosetta to successfully classify Out-Of-Distribution visual patterns and diverse sets of alphabets and scripts, including but not limited to Chinese, Greek, Russian, French, Spanish, and Japanese.

OmniCaptioner: One Captioner to Rule Them All

Yiting Lu,Jiakang Yuan,Zhen Li,Shitian Zhao,Qi Qin,Xinyue Li,Le Zhuo,Licheng Wen,Dongyang Liu,Yuewen Cao,Xiangchao Yan,Xin Li,Botian Shi,Tao Chen,Zhibo Chen,Lei Bai,Bo Zhang,Peng Gao

Task: 提出OmniCaptioner，一个通用的视觉描述框架，用于生成跨多种视觉领域的细粒度文本描述。

Motivation: 解决现有方法局限于特定图像类型（如自然图像或几何视觉）的问题，提供统一的解决方案。

Details

Method: 将低层次像素信息转换为语义丰富的文本表示，弥合视觉与文本模态之间的差距。 Result: 展示了三个关键优势：增强视觉推理、改进图像生成和高效监督微调。 Conclusion: OmniCaptioner的通用性和适应性为弥合语言与视觉模态之间的差距提供了新视角。 Abstract: We propose OmniCaptioner, a versatile visual captioning framework for generating fine-grained textual descriptions across a wide variety of visual domains. Unlike prior methods limited to specific image types (e.g., natural images or geometric visuals), our framework provides a unified solution for captioning natural images, visual text (e.g., posters, UIs, textbooks), and structured visuals (e.g., documents, tables, charts). By converting low-level pixel information into semantically rich textual representations, our framework bridges the gap between visual and textual modalities. Our results highlight three key advantages: (i) Enhanced Visual Reasoning with LLMs, where long-context captions of visual modalities empower LLMs, particularly the DeepSeek-R1 series, to reason effectively in multimodal scenarios; (ii) Improved Image Generation, where detailed captions improve tasks like text-to-image generation and image transformation; and (iii) Efficient Supervised Fine-Tuning (SFT), which enables faster convergence with less data. We believe the versatility and adaptability of OmniCaptioner can offer a new perspective for bridging the gap between language and visual modalities.

CasTex: Cascaded Text-to-Texture Synthesis via Explicit Texture Maps and Physically-Based Shading

Mishan Aliev,Dmitry Baranchuk,Kirill Struminsky

Task: 研究基于扩散模型的文本到纹理合成方法，以生成物理基础的纹理贴图。

Motivation: 旨在实现模型在不同光照条件下的真实外观。

Details

Method: 提出了一种使用级联扩散模型（CasTex）的方法，替代了传统的隐式纹理参数化，采用显式参数化以提高效果。 Result: 实验表明，该方法在公共纹理合成基准上显著优于现有的基于优化的解决方案。 Conclusion: CasTex方法能够直接生成高质量纹理，无需额外的隐式纹理参数化，效果优于现有方法。 Abstract: This work investigates text-to-texture synthesis using diffusion models to generate physically-based texture maps. We aim to achieve realistic model appearances under varying lighting conditions. A prominent solution for the task is score distillation sampling. It allows recovering a complex texture using gradient guidance given a differentiable rasterization and shading pipeline. However, in practice, the aforementioned solution in conjunction with the widespread latent diffusion models produces severe visual artifacts and requires additional regularization such as implicit texture parameterization. As a more direct alternative, we propose an approach using cascaded diffusion models for texture synthesis (CasTex). In our setup, score distillation sampling yields high-quality textures out-of-the box. In particular, we were able to omit implicit texture parameterization in favor of an explicit parameterization to improve the procedure. In the experiments, we show that our approach significantly outperforms state-of-the-art optimization-based solutions on public texture synthesis benchmarks.

Sculpting Subspaces: Constrained Full Fine-Tuning in LLMs for Continual Learning

Nikhil Shivakumar Nayak,Krishnateja Killamsetty,Ligong Han,Abhishek Bhandwaldar,Prateek Chanda,Kai Xu,Hao Wang,Aldo Pareja,Oleg Silkin,Mustafa Eyceoz,Akash Srivastava

Task: 提出一种基于自适应奇异值分解（SVD）的持续全微调方法，以解决大型语言模型（LLMs）在持续学习中面临的灾难性遗忘问题。

Motivation: 现有方法通常依赖于低秩、参数高效的更新，限制了模型的表达能力并引入了额外的任务参数，导致可扩展性问题。

Details

Method: 通过动态识别任务特定的低秩参数子空间，并将更新约束在与先前任务相关的关键方向上正交，从而最小化干扰。 Result: 在标准持续学习基准测试中，该方法实现了最先进的性能，平均准确率比基线方法（如O-LoRA）高出7%，并显著减少了遗忘现象。 Conclusion: 自适应SVD框架在模型可塑性和知识保留之间取得了平衡，为大型语言模型的持续学习提供了实用且计算可扩展的解决方案。 Abstract: Continual learning in large language models (LLMs) is prone to catastrophic forgetting, where adapting to new tasks significantly degrades performance on previously learned ones. Existing methods typically rely on low-rank, parameter-efficient updates that limit the model's expressivity and introduce additional parameters per task, leading to scalability issues. To address these limitations, we propose a novel continual full fine-tuning approach leveraging adaptive singular value decomposition (SVD). Our method dynamically identifies task-specific low-rank parameter subspaces and constrains updates to be orthogonal to critical directions associated with prior tasks, thus effectively minimizing interference without additional parameter overhead or storing previous task gradients. We evaluate our approach extensively on standard continual learning benchmarks using both encoder-decoder (T5-Large) and decoder-only (LLaMA-2 7B) models, spanning diverse tasks including classification, generation, and reasoning. Empirically, our method achieves state-of-the-art results, up to 7% higher average accuracy than recent baselines like O-LoRA, and notably maintains the model's general linguistic capabilities, instruction-following accuracy, and safety throughout the continual learning process by reducing forgetting to near-negligible levels. Our adaptive SVD framework effectively balances model plasticity and knowledge retention, providing a practical, theoretically grounded, and computationally scalable solution for continual learning scenarios in large language models.

EIDT-V: Exploiting Intersections in Diffusion Trajectories for Model-Agnostic, Zero-Shot, Training-Free Text-to-Video Generation

Diljeet Jagpal,Xi Chen,Vinay P. Namboodiri

Task: 提出一种零样本、无需训练、基于图像的文本到视频生成方法，利用现有图像扩散模型生成视频。

Motivation: 当前方法需要对图像生成模型进行特定架构修改，限制了其适应性和扩展性。

Details

Method: 采用模型无关的方法，利用扩散轨迹的交集和基于网格的方法，结合上下文训练的LLM生成连贯的帧提示和帧间差异，通过CLIP注意力掩码控制提示切换时机。 Result: 实现了最先进的性能，具有更高的灵活性和更好的时间一致性、视觉保真度及用户满意度。 Conclusion: 提供了一种无需训练、基于图像的文本到视频生成的新方法，平衡了连贯性和多样性。 Abstract: Zero-shot, training-free, image-based text-to-video generation is an emerging area that aims to generate videos using existing image-based diffusion models. Current methods in this space require specific architectural changes to image generation models, which limit their adaptability and scalability. In contrast to such methods, we provide a model-agnostic approach. We use intersections in diffusion trajectories, working only with the latent values. We could not obtain localized frame-wise coherence and diversity using only the intersection of trajectories. Thus, we instead use a grid-based approach. An in-context trained LLM is used to generate coherent frame-wise prompts; another is used to identify differences between frames. Based on these, we obtain a CLIP-based attention mask that controls the timing of switching the prompts for each grid cell. Earlier switching results in higher variance, while later switching results in more coherence. Therefore, our approach can ensure appropriate control between coherence and variance for the frames. Our approach results in state-of-the-art performance while being more flexible when working with diverse image-generation models. The empirical analysis using quantitative metrics and user studies confirms our model's superior temporal consistency, visual fidelity and user satisfaction, thus providing a novel way to obtain training-free, image-based text-to-video generation.

MovSAM: A Single-image Moving Object Segmentation Framework Based on Deep Thinking

Chang Nie,Yiqing Xu,Guangming Wang,Zhe Liu,Yanzi Miao,Hesheng Wang

Task: 提出一种名为MovSAM的框架，用于单图像运动物体分割。

Motivation: 解决现有方法因缺乏时间线索而难以从单图像中分割运动物体的问题，特别是在运动意图预测和相机帧丢失处理等应用中。

Details

Method: 利用多模态大语言模型（MLLM）和链式思维（CoT）提示生成文本提示，与Segment Anything Model（SAM）和视觉语言模型（VLM）的视觉特征交叉融合，实现逻辑驱动的运动物体分割，并通过深度思考循环迭代优化分割结果。 Result: 在公共MOS基准测试中达到92.5%的J&F分数，表现优于多帧方法。 Conclusion: MovSAM通过场景理解和逻辑推理，成功实现了单图像运动物体分割，并在自动驾驶等实际应用中验证了其有效性。 Abstract: Moving object segmentation plays a vital role in understanding dynamic visual environments. While existing methods rely on multi-frame image sequences to identify moving objects, single-image MOS is critical for applications like motion intention prediction and handling camera frame drops. However, segmenting moving objects from a single image remains challenging for existing methods due to the absence of temporal cues. To address this gap, we propose MovSAM, the first framework for single-image moving object segmentation. MovSAM leverages a Multimodal Large Language Model (MLLM) enhanced with Chain-of-Thought (CoT) prompting to search the moving object and generate text prompts based on deep thinking for segmentation. These prompts are cross-fused with visual features from the Segment Anything Model (SAM) and a Vision-Language Model (VLM), enabling logic-driven moving object segmentation. The segmentation results then undergo a deep thinking refinement loop, allowing MovSAM to iteratively improve its understanding of the scene context and inter-object relationships with logical reasoning. This innovative approach enables MovSAM to segment moving objects in single images by considering scene understanding. We implement MovSAM in the real world to validate its practical application and effectiveness for autonomous driving scenarios where the multi-frame methods fail. Furthermore, despite the inherent advantage of multi-frame methods in utilizing temporal information, MovSAM achieves state-of-the-art performance across public MOS benchmarks, reaching 92.5\% on J\&F. Our implementation will be available at https://github.com/IRMVLab/MovSAM.

Compound and Parallel Modes of Tropical Convolutional Neural Networks

Mingbo Li,Liying Liu,Ye Luo

Task: 提出两种新的热带卷积神经网络变体（cTCNN和pTCNN），以替代传统卷积核，减少乘法运算并平衡效率与性能。

Motivation: 由于传统卷积神经网络（CNNs）计算成本高，而热带卷积神经网络（TCNNs）虽减少乘法运算但性能不足，因此需要改进。

Details

Method: 通过结合热带min-plus和max-plus核，设计cTCNN和pTCNN，并在多种数据集上进行实验。 Result: cTCNN和pTCNN在性能上匹配或超越其他CNN方法，且在深层架构中结合传统CNNs能进一步提升性能。 Conclusion: cTCNN和pTCNN是高效且有效的模型，未来将进一步探索简化架构以减少参数和乘法运算。 Abstract: Convolutional neural networks have become increasingly deep and complex, leading to higher computational costs. While tropical convolutional neural networks (TCNNs) reduce multiplications, they underperform compared to standard CNNs. To address this, we propose two new variants - compound TCNN (cTCNN) and parallel TCNN (pTCNN)-that use combinations of tropical min-plus and max-plus kernels to replace traditional convolution kernels. This reduces multiplications and balances efficiency with performance. Experiments on various datasets show that cTCNN and pTCNN match or exceed the performance of other CNN methods. Combining these with conventional CNNs in deeper architectures also improves performance. We are further exploring simplified TCNN architectures that reduce parameters and multiplications with minimal accuracy loss, aiming for efficient and effective models.

ColorizeDiffusion v2: Enhancing Reference-based Sketch Colorization Through Separating Utilities

Dingkun Yan,Xinrui Wang,Yusuke Iwasawa,Yutaka Matsuo,Suguru Saito,Jiaxian Guo

Task: 提出一种基于参考的草图着色方法，解决训练与推理数据分布不匹配导致的过拟合问题。

Motivation: 现有方法在训练时使用语义和空间对齐的图像三元组，而实际应用中参考图和草图往往存在显著不对齐，导致着色质量下降。

Details

Method: 通过分析载体（信息传递的潜在表示），提出动态适应载体的工作流，包括分割交叉注意力机制、专用编码器和预处理步骤。 Result: 实验表明，该方法在定性和定量评估中优于现有方法，用户研究进一步验证了其优越性。 Conclusion: 提出的方法通过优化载体和引入新机制，显著提升了草图着色的质量和适用性。 Abstract: Reference-based sketch colorization methods have garnered significant attention due to their potential applications in the animation production industry. However, most existing methods are trained with image triplets of sketch, reference, and ground truth that are semantically and spatially well-aligned, while real-world references and sketches often exhibit substantial misalignment. This mismatch in data distribution between training and inference leads to overfitting, consequently resulting in spatial artifacts and significant degradation in overall colorization quality, limiting potential applications of current methods for general purposes. To address this limitation, we conduct an in-depth analysis of the \textbf{carrier}, defined as the latent representation facilitating information transfer from reference to sketch. Based on this analysis, we propose a novel workflow that dynamically adapts the carrier to optimize distinct aspects of colorization. Specifically, for spatially misaligned artifacts, we introduce a split cross-attention mechanism with spatial masks, enabling region-specific reference injection within the diffusion process. To mitigate semantic neglect of sketches, we employ dedicated background and style encoders to transfer detailed reference information in the latent feature space, achieving enhanced spatial control and richer detail synthesis. Furthermore, we propose character-mask merging and background bleaching as preprocessing steps to improve foreground-background integration and background generation. Extensive qualitative and quantitative evaluations, including a user study, demonstrate the superior performance of our proposed method compared to existing approaches. An ablation study further validates the efficacy of each proposed component.

MedSegFactory: Text-Guided Generation of Medical Image-Mask Pairs

Jiawei Mao,Yuhan Wang,Yucheng Tang,Daguang Xu,Kang Wang,Yang Yang,Zongwei Zhou,Yuyin Zhou

Task: 提出一种名为MedSegFactory的多功能医学合成框架，用于生成高质量配对的医学图像和分割掩码。

Motivation: 解决医学图像分割任务中数据稀缺和监管限制的问题，同时提升现有分割工具的性能。

Details

Method: 采用双流扩散模型，结合联合交叉注意力（JCA）机制，实现图像和掩码的协同生成。 Result: 实验表明，MedSegFactory生成的数据质量和可用性优越，在2D和3D分割任务中表现优异。 Conclusion: MedSegFactory为医学图像合成提供了可扩展且高质量的解决方案，显著提升了效率和准确性。 Abstract: This paper presents MedSegFactory, a versatile medical synthesis framework that generates high-quality paired medical images and segmentation masks across modalities and tasks. It aims to serve as an unlimited data repository, supplying image-mask pairs to enhance existing segmentation tools. The core of MedSegFactory is a dual-stream diffusion model, where one stream synthesizes medical images and the other generates corresponding segmentation masks. To ensure precise alignment between image-mask pairs, we introduce Joint Cross-Attention (JCA), enabling a collaborative denoising paradigm by dynamic cross-conditioning between streams. This bidirectional interaction allows both representations to guide each other's generation, enhancing consistency between generated pairs. MedSegFactory unlocks on-demand generation of paired medical images and segmentation masks through user-defined prompts that specify the target labels, imaging modalities, anatomical regions, and pathological conditions, facilitating scalable and high-quality data generation. This new paradigm of medical image synthesis enables seamless integration into diverse medical imaging workflows, enhancing both efficiency and accuracy. Extensive experiments show that MedSegFactory generates data of superior quality and usability, achieving competitive or state-of-the-art performance in 2D and 3D segmentation tasks while addressing data scarcity and regulatory constraints.

UKBOB: One Billion MRI Labeled Masks for Generalizable 3D Medical Image Segmentation

Emmanuelle Bourigault,Amir Jamaludin,Abdullah Hamdi

Task: 构建并验证UKBOB数据集，提出一种新的标签清洗方法（ETTA），并训练一个基于Swin-UNetr架构的基础模型Swin-BOB用于3D医学图像分割。

Motivation: 解决医学影像领域因隐私、物流和高标注成本导致的大规模标注数据难以获取的问题。

Details

Method: 利用自动标注和标签清洗流程（包括器官特定过滤器和手动验证），提出ETTA方法优化分割输出，并基于Swin-UNetr训练Swin-BOB模型。 Result: UKBOB成为最大的器官标注数据集，Swin-BOB在多个3D医学影像基准测试中达到最先进水平（如BRATS和BTCV）。 Conclusion: UKBOB数据集和Swin-BOB模型为医学影像分割提供了高效且可靠的解决方案，同时开源数据和代码以促进研究。 Abstract: In medical imaging, the primary challenge is collecting large-scale labeled data due to privacy concerns, logistics, and high labeling costs. In this work, we present the UK Biobank Organs and Bones (UKBOB), the largest labeled dataset of body organs, comprising 51,761 MRI 3D samples (equivalent to 17.9 million 2D images) and more than 1.37 billion 2D segmentation masks of 72 organs, all based on the UK Biobank MRI dataset. We utilize automatic labeling, introduce an automated label cleaning pipeline with organ-specific filters, and manually annotate a subset of 300 MRIs with 11 abdominal classes to validate the quality (referred to as UKBOB-manual). This approach allows for scaling up the dataset collection while maintaining confidence in the labels. We further confirm the validity of the labels by demonstrating zero-shot generalization of trained models on the filtered UKBOB to other small labeled datasets from similar domains (e.g., abdominal MRI). To further mitigate the effect of noisy labels, we propose a novel method called Entropy Test-time Adaptation (ETTA) to refine the segmentation output. We use UKBOB to train a foundation model, Swin-BOB, for 3D medical image segmentation based on the Swin-UNetr architecture, achieving state-of-the-art results in several benchmarks in 3D medical imaging, including the BRATS brain MRI tumor challenge (with a 0.4% improvement) and the BTCV abdominal CT scan benchmark (with a 1.3% improvement). The pre-trained models and the code are available at https://emmanuelleb985.github.io/ukbob , and the filtered labels will be made available with the UK Biobank.

S-EO: A Large-Scale Dataset for Geometry-Aware Shadow Detection in Remote Sensing Applications

Masquil Elías,Marí Roger,Ehret Thibaud,Meinhardt-Llopis Enric,Musé Pablo,Facciolo Gabriele

Task: 提出S-EO数据集，用于几何感知的阴影检测，并展示其在3D重建中的应用。

Motivation: 为遥感影像中的阴影检测及其在3D重建中的应用提供新的公开资源。

Details

Method: 收集多源数据（如WorldView-3影像和LiDAR DSM），生成阴影和植被掩膜，并通过训练阴影检测器验证数据集效果。 Result: S-EO数据集包含约20,000张图像，支持阴影检测和3D重建任务。 Conclusion: S-EO数据集为阴影检测和3D重建提供了有效资源，并通过实验验证了其应用潜力。 Abstract: We introduce the S-EO dataset: a large-scale, high-resolution dataset, designed to advance geometry-aware shadow detection. Collected from diverse public-domain sources, including challenge datasets and government providers such as USGS, our dataset comprises 702 georeferenced tiles across the USA, each covering 500x500 m. Each tile includes multi-date, multi-angle WorldView-3 pansharpened RGB images, panchromatic images, and a ground-truth DSM of the area obtained from LiDAR scans. For each image, we provide a shadow mask derived from geometry and sun position, a vegetation mask based on the NDVI index, and a bundle-adjusted RPC model. With approximately 20,000 images, the S-EO dataset establishes a new public resource for shadow detection in remote sensing imagery and its applications to 3D reconstruction. To demonstrate the dataset's impact, we train and evaluate a shadow detector, showcasing its ability to generalize, even to aerial images. Finally, we extend EO-NeRF - a state-of-the-art NeRF approach for satellite imagery - to leverage our shadow predictions for improved 3D reconstructions.

Are Vision-Language Models Ready for Dietary Assessment? Exploring the Next Frontier in AI-Powered Food Image Recognition

Sergio Romero-Tapiador,Ruben Tolosana,Blanca Lacruz-Pleguezuelos,Laura Judith Marcos Zambrano,Guadalupe X. Bazán,Isabel Espinosa-Salinas,Julian Fierrez,Javier Ortega-Garcia,Enrique Carrillo de Santa Pau,Aythami Morales

Task: 评估六种最先进的视觉语言模型（VLMs）在食品识别中的能力。

Motivation: 自动饮食评估基于食品图像仍具挑战性，需要精确的食品检测、分割和分类，视觉语言模型（VLMs）通过整合视觉和文本推理提供了新的可能性。

Details

Method: 引入FoodNExTDB数据库，包含9,263张专家标注的图像，涵盖10个类别、62个子类别和9种烹饪风格，并提出新的评估指标Expert-Weighted Recall (EWR)。 Result: 闭源模型表现优于开源模型，在单产品图像中EWR超过90%，但当前VLMs在细粒度食品识别（如烹饪风格和视觉相似食品的区分）上仍面临挑战。 Conclusion: 尽管VLMs在食品识别中表现出潜力，但其在细粒度识别上的局限性限制了其在自动饮食评估中的可靠性。FoodNExTDB数据库已公开。 Abstract: Automatic dietary assessment based on food images remains a challenge, requiring precise food detection, segmentation, and classification. Vision-Language Models (VLMs) offer new possibilities by integrating visual and textual reasoning. In this study, we evaluate six state-of-the-art VLMs (ChatGPT, Gemini, Claude, Moondream, DeepSeek, and LLaVA), analyzing their capabilities in food recognition at different levels. For the experimental framework, we introduce the FoodNExTDB, a unique food image database that contains 9,263 expert-labeled images across 10 categories (e.g., "protein source"), 62 subcategories (e.g., "poultry"), and 9 cooking styles (e.g., "grilled"). In total, FoodNExTDB includes 50k nutritional labels generated by seven experts who manually annotated all images in the database. Also, we propose a novel evaluation metric, Expert-Weighted Recall (EWR), that accounts for the inter-annotator variability. Results show that closed-source models outperform open-source ones, achieving over 90% EWR in recognizing food products in images containing a single product. Despite their potential, current VLMs face challenges in fine-grained food recognition, particularly in distinguishing subtle differences in cooking styles and visually similar food items, which limits their reliability for automatic dietary assessment. The FoodNExTDB database is publicly available at https://github.com/AI4Food/FoodNExtDB.

PathSegDiff: Pathology Segmentation using Diffusion model representations

Sachin Kumar Danisetty,Alexandros Graikos,Srikar Yellapragada,Dimitris Samaras

Task: 提出一种基于潜在扩散模型（LDM）的病理图像分割方法PathSegDiff。

Motivation: 传统分割模型依赖于预训练特征提取器，而现有研究关注如何通过任务预训练特征提取器，本文提出利用病理特异性LDM提升分割性能。

Details

Method: 使用病理特异性LDM结合自监督编码器提取特征，并通过全卷积网络生成分割掩码。 Result: 在BCSS和GlaS数据集上显著优于传统方法，验证了领域特定扩散预训练的有效性。 Conclusion: PathSegDiff通过LDM提取丰富语义信息，显著提升了病理图像分割的准确性。 Abstract: Image segmentation is crucial in many computational pathology pipelines, including accurate disease diagnosis, subtyping, outcome, and survivability prediction. The common approach for training a segmentation model relies on a pre-trained feature extractor and a dataset of paired image and mask annotations. These are used to train a lightweight prediction model that translates features into per-pixel classes. The choice of the feature extractor is central to the performance of the final segmentation model, and recent literature has focused on finding tasks to pre-train the feature extractor. In this paper, we propose PathSegDiff, a novel approach for histopathology image segmentation that leverages Latent Diffusion Models (LDMs) as pre-trained featured extractors. Our method utilizes a pathology-specific LDM, guided by a self-supervised encoder, to extract rich semantic information from H\&E stained histopathology images. We employ a simple, fully convolutional network to process the features extracted from the LDM and generate segmentation masks. Our experiments demonstrate significant improvements over traditional methods on the BCSS and GlaS datasets, highlighting the effectiveness of domain-specific diffusion pre-training in capturing intricate tissue structures and enhancing segmentation accuracy in histopathology images.

A Comparison of Deep Learning Methods for Cell Detection in Digital Cytology

Marco Acerbis,Nataša Sladoje,Joakim Lindblad

Task: 评估几种深度学习方法在Papanicolaou染色细胞学全切片图像（WSIs）中的细胞检测性能。

Motivation: 细胞检测在生物医学图像分析中至关重要，需要高精度和高效率的方法。

Details

Method: 比较了现成的算法和定制设计的检测器，包括StarDist、Cellpose、SAM2和基于质心的FCRN方法，并引入了一种基于距离的评估指标。 Result: 基于质心的方法（如IFCRN）在检测精度和计算效率上优于基于分割的方法。 Conclusion: 基于质心的检测器在资源有限的环境中具有潜力，能够在不牺牲精度的情况下提供更快的处理速度和更低的GPU内存消耗。 Abstract: Accurate and efficient cell detection is crucial in many biomedical image analysis tasks. We evaluate the performance of several Deep Learning (DL) methods for cell detection in Papanicolaou-stained cytological Whole Slide Images (WSIs), focusing on accuracy of predictions and computational efficiency. We examine recentoff-the-shelf algorithms as well as custom-designed detectors, applying them to two datasets: the CNSeg Dataset and the Oral Cancer (OC) Dataset. Our comparison includes well-established segmentation methods such as StarDist, Cellpose, and the Segment Anything Model 2 (SAM2), alongside centroid-based Fully Convolutional Regression Network (FCRN) approaches. We introduce a suitable evaluation metric to assess the accuracy of predictions based on the distance from ground truth positions. We also explore the impact of dataset size and data augmentation techniques on model performance. Results show that centroid-based methods, particularly the Improved Fully Convolutional Regression Network (IFCRN) method, outperform segmentation-based methods in terms of both detection accuracy and computational efficiency. This study highlights the potential of centroid-based detectors as a preferred option for cell detection in resource-limited environments, offering faster processing times and lower GPU memory usage without compromising accuracy.

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

Xinhao Li,Ziang Yan,Desen Meng,Lu Dong,Xiangyu Zeng,Yinan He,Yali Wang,Yu Qiao,Yi Wang,Limin Wang

Task: 探索强化微调（RFT）与GRPO在视频多模态大语言模型（MLLMs）中的应用，以增强时空感知能力。

Motivation: 现有方法（如GRPO和基于规则的奖励机制）在文本和图像领域表现良好，但在视频理解中的应用有限。

Details

Method: 采用多任务RFT方法，结合GRPO，对视频MLLMs进行微调，专注于时空感知目标。 Result: 开发的VideoChat-R1在时空感知任务中表现优异（如时间定位提升31.8，目标跟踪提升31.2），同时在通用QA基准测试中也有显著提升。 Conclusion: RFT在视频MLLMs的专用任务增强中具有潜力，为未来视频MLLMs的强化学习研究提供了有价值的见解。 Abstract: Recent advancements in reinforcement learning have significantly advanced the reasoning capabilities of multimodal large language models (MLLMs). While approaches such as Group Relative Policy Optimization (GRPO) and rule-based reward mechanisms demonstrate promise in text and image domains, their application to video understanding remains limited. This paper presents a systematic exploration of Reinforcement Fine-Tuning (RFT) with GRPO for video MLLMs, aiming to enhance spatio-temporal perception while maintaining general capabilities. Our experiments reveal that RFT is highly data-efficient for task-specific improvements. Through multi-task RFT on spatio-temporal perception objectives with limited samples, we develop VideoChat-R1, a powerful video MLLM that achieves state-of-the-art performance on spatio-temporal perception tasks without sacrificing chat ability, while exhibiting emerging spatio-temporal reasoning abilities. Compared to Qwen2.5-VL-7B, VideoChat-R1 boosts performance several-fold in tasks like temporal grounding (+31.8) and object tracking (+31.2). Additionally, it significantly improves on general QA benchmarks such as VideoMME (+0.9), MVBench (+1.0), and Perception Test (+0.9). Our findings underscore the potential of RFT for specialized task enhancement of Video MLLMs. We hope our work offers valuable insights for future RL research in video MLLMs.

Efficient Self-Supervised Learning for Earth Observation via Dynamic Dataset Curation

Thomas Kerdreux,Alexandre Tuel,Quentin Febvre,Alexis Mouche,Bertrand Chapron

Task: 提出一种动态数据集剪枝策略，以改进自监督学习（SSL）在遥感任务中的预训练效果。

Motivation: 遥感数据中常见的冗余和长尾分布可能导致偏置表示和低效训练，而现有研究对数据集平衡和多样化的关注不足。

Details

Method: 提出一种无需预训练特征提取器的动态数据集剪枝方法，通过迭代优化训练集以提高多样性和平衡性。 Result: 在Sentinel-1 WV SAR档案上验证，动态剪枝显著提高了计算效率和表示质量，增强了模型在下游任务中的迁移能力。 Conclusion: 动态剪枝策略在遥感领域具有潜力，同时发布了首个SAR海洋观测基础模型Nereus-SAR-1。 Abstract: Self-supervised learning (SSL) has enabled the development of vision foundation models for Earth Observation (EO), demonstrating strong transferability across diverse remote sensing tasks. While prior work has focused on network architectures and training strategies, the role of dataset curation, especially in balancing and diversifying pre-training datasets, remains underexplored. In EO, this challenge is amplified by the redundancy and heavy-tailed distributions common in satellite imagery, which can lead to biased representations and inefficient training. In this work, we propose a dynamic dataset pruning strategy designed to improve SSL pre-training by maximizing dataset diversity and balance. Our method iteratively refines the training set without requiring a pre-existing feature extractor, making it well-suited for domains where curated datasets are limited or unavailable. We demonstrate our approach on the Sentinel-1 Wave Mode (WV) Synthetic Aperture Radar (SAR) archive, a challenging dataset dominated by ocean observations. We train models from scratch on the entire Sentinel-1 WV archive spanning 10 years. Across three downstream tasks, our results show that dynamic pruning improves both computational efficiency and representation quality, leading to stronger transferability. We also release the weights of Nereus-SAR-1, the first model in the Nereus family, a series of foundation models for ocean observation and analysis using SAR imagery, at github.com/galeio-research/nereus-sar-models/.

A Deep Single Image Rectification Approach for Pan-Tilt-Zoom Cameras

Teng Xiao,Qi Hu,Qingsong Yan,Wei Liu,Zhiwei Ye,Fei Deng

Task: 提出一种新型框架FDBW-Net，用于广角图像校正。

Motivation: 当前深度学习方法在广角图像校正中难以保持细粒度几何细节，导致校正不准确。

Details

Method: 使用前向畸变模型合成桶形畸变图像，结合金字塔上下文编码器和注意力机制生成反向变形流，并通过多尺度解码器恢复图像。 Result: FDBW-Net在多个数据集上验证，表现出SOTA性能。 Conclusion: FDBW-Net提升了PTZ相机在实际视觉应用中的适应性。 Abstract: Pan-Tilt-Zoom (PTZ) cameras with wide-angle lenses are widely used in surveillance but often require image rectification due to their inherent nonlinear distortions. Current deep learning approaches typically struggle to maintain fine-grained geometric details, resulting in inaccurate rectification. This paper presents a Forward Distortion and Backward Warping Network (FDBW-Net), a novel framework for wide-angle image rectification. It begins by using a forward distortion model to synthesize barrel-distorted images, reducing pixel redundancy and preventing blur. The network employs a pyramid context encoder with attention mechanisms to generate backward warping flows containing geometric details. Then, a multi-scale decoder is used to restore distorted features and output rectified images. FDBW-Net's performance is validated on diverse datasets: public benchmarks, AirSim-rendered PTZ camera imagery, and real-scene PTZ camera datasets. It demonstrates that FDBW-Net achieves SOTA performance in distortion rectification, boosting the adaptability of PTZ cameras for practical visual applications.

Wheat3DGS: In-field 3D Reconstruction, Instance Segmentation and Phenotyping of Wheat Heads with Gaussian Splatting

Daiwei Zhang,Joaquin Gajardo,Tomislav Medic,Isinsu Katircioglu,Mike Boss,Norbert Kirchgessner,Achim Walter,Lukas Roth

Task: 利用3D高斯泼溅（3DGS）和Segment Anything Model（SAM）实现小麦穗的精确3D实例分割和形态测量。

Motivation: 高吞吐量田间表型分析（HTFP）需要自动化提取植物形态特征，但现有方法（如NeRF）在处理复杂结构（如小麦穗）时存在局限性。

Details

Method: 结合3DGS和SAM，开发了Wheat3DGS方法，用于自动化测量小麦穗的形态特征。 Result: 与高分辨率激光扫描数据相比，小麦穗的长度、宽度和体积的平均绝对百分比误差分别为15.1%、18.3%和40.2%，优于NeRF和传统MVS方法。 Conclusion: Wheat3DGS方法能够快速、无损地大规模测量关键产量相关性状，对加速作物育种和小麦发育研究具有重要意义。 Abstract: Automated extraction of plant morphological traits is crucial for supporting crop breeding and agricultural management through high-throughput field phenotyping (HTFP). Solutions based on multi-view RGB images are attractive due to their scalability and affordability, enabling volumetric measurements that 2D approaches cannot directly capture. While advanced methods like Neural Radiance Fields (NeRFs) have shown promise, their application has been limited to counting or extracting traits from only a few plants or organs. Furthermore, accurately measuring complex structures like individual wheat heads-essential for studying crop yields-remains particularly challenging due to occlusions and the dense arrangement of crop canopies in field conditions. The recent development of 3D Gaussian Splatting (3DGS) offers a promising alternative for HTFP due to its high-quality reconstructions and explicit point-based representation. In this paper, we present Wheat3DGS, a novel approach that leverages 3DGS and the Segment Anything Model (SAM) for precise 3D instance segmentation and morphological measurement of hundreds of wheat heads automatically, representing the first application of 3DGS to HTFP. We validate the accuracy of wheat head extraction against high-resolution laser scan data, obtaining per-instance mean absolute percentage errors of 15.1%, 18.3%, and 40.2% for length, width, and volume. We provide additional comparisons to NeRF-based approaches and traditional Muti-View Stereo (MVS), demonstrating superior results. Our approach enables rapid, non-destructive measurements of key yield-related traits at scale, with significant implications for accelerating crop breeding and improving our understanding of wheat development.

SIGMAN:Scaling 3D Human Gaussian Generation with Millions of Assets

Yuhang Yang,Fengqi Liu,Yixing Lu,Qin Zhao,Pingyu Wu,Wei Zhai,Ran Yi,Yang Cao,Lizhuang Ma,Zheng-Jun Zha,Junting Dong

Task: 提出一种基于潜在空间生成范式的3D人体数字化方法，解决现有方法在速度、质量和数据规模上的限制。

Motivation: 现有3D人体数字化方法受限于范式、数据稀缺以及低维到高维映射的模糊性，导致速度慢、质量低。

Details

Method: 采用UV结构化的VAE将多视图图像压缩为高斯分布，结合DiT条件生成，将低维到高维映射问题转化为可学习的分布偏移，并支持端到端推理。同时构建HGS-1M数据集支持大规模训练。 Result: 实验结果表明，该方法能够生成高质量、细节丰富的3D人体高斯模型，包括复杂纹理、面部细节和宽松衣物变形。 Conclusion: 提出的潜在空间生成范式结合大规模数据集，显著提升了3D人体数字化的质量和效率。 Abstract: 3D human digitization has long been a highly pursued yet challenging task. Existing methods aim to generate high-quality 3D digital humans from single or multiple views, but remain primarily constrained by current paradigms and the scarcity of 3D human assets. Specifically, recent approaches fall into several paradigms: optimization-based and feed-forward (both single-view regression and multi-view generation with reconstruction). However, they are limited by slow speed, low quality, cascade reasoning, and ambiguity in mapping low-dimensional planes to high-dimensional space due to occlusion and invisibility, respectively. Furthermore, existing 3D human assets remain small-scale, insufficient for large-scale training. To address these challenges, we propose a latent space generation paradigm for 3D human digitization, which involves compressing multi-view images into Gaussians via a UV-structured VAE, along with DiT-based conditional generation, we transform the ill-posed low-to-high-dimensional mapping problem into a learnable distribution shift, which also supports end-to-end inference. In addition, we employ the multi-view optimization approach combined with synthetic data to construct the HGS-1M dataset, which contains $1$ million 3D Gaussian assets to support the large-scale training. Experimental results demonstrate that our paradigm, powered by large-scale training, produces high-quality 3D human Gaussians with intricate textures, facial details, and loose clothing deformation.

Latent Diffusion U-Net Representations Contain Positional Embeddings and Anomalies

Jonas Loos,Lorenz Linhardt

Task: 分析Stable Diffusion模型的表示特性及其鲁棒性。

Motivation: 扩散模型在生成逼真图像方面表现出色，但其表示特性尚未充分研究，需要了解其鲁棒性以支持下游任务。

Details

Method: 使用表示相似性和范数分析流行的Stable Diffusion模型。 Result: 发现三种现象：(1)中间表示中存在学习的位置嵌入，(2)高相似性角点伪影，(3)异常高范数伪影。 Conclusion: 扩散模型的表示特性需要进一步研究，以确保其适用于需要鲁棒特征的下游任务。 Abstract: Diffusion models have demonstrated remarkable capabilities in synthesizing realistic images, spurring interest in using their representations for various downstream tasks. To better understand the robustness of these representations, we analyze popular Stable Diffusion models using representational similarity and norms. Our findings reveal three phenomena: (1) the presence of a learned positional embedding in intermediate representations, (2) high-similarity corner artifacts, and (3) anomalous high-norm artifacts. These findings underscore the need to further investigate the properties of diffusion model representations before considering them for downstream tasks that require robust features. Project page: https://jonasloos.github.io/sd-representation-anomalies

Glossy Object Reconstruction with Cost-effective Polarized Acquisition

Bojian Wu,Yifan Peng,Ruizhen Hu,Xiaowei Zhou

Task: 提出一种基于偏振的可扩展方法，用于从低成本设备捕获的多视角偏振图像中分离漫反射和镜面反射成分，实现光泽物体的3D重建。

Motivation: 现有方法依赖昂贵且复杂的设备，而本文旨在通过低成本工具和偏振技术简化数据采集过程，降低系统构建成本。

Details

Method: 通过在线性偏振器辅助下捕获多视角偏振图像，利用神经隐式场表示偏振BRDF、斯托克斯矢量和表面偏振状态，并通过优化渲染损失恢复这些参数。 Result: 实验表明，该方法在公共数据集和实际捕获图像上的重建和新视角合成任务中优于现有技术。 Conclusion: 通过结合偏振物理原理和神经隐式表示，该方法为光泽物体的3D重建提供了一种高效且低成本的解决方案。 Abstract: The challenge of image-based 3D reconstruction for glossy objects lies in separating diffuse and specular components on glossy surfaces from captured images, a task complicated by the ambiguity in discerning lighting conditions and material properties using RGB data alone. While state-of-the-art methods rely on tailored and/or high-end equipment for data acquisition, which can be cumbersome and time-consuming, this work introduces a scalable polarization-aided approach that employs cost-effective acquisition tools. By attaching a linear polarizer to readily available RGB cameras, multi-view polarization images can be captured without the need for advance calibration or precise measurements of the polarizer angle, substantially reducing system construction costs. The proposed approach represents polarimetric BRDF, Stokes vectors, and polarization states of object surfaces as neural implicit fields. These fields, combined with the polarizer angle, are retrieved by optimizing the rendering loss of input polarized images. By leveraging fundamental physical principles for the implicit representation of polarization rendering, our method demonstrates superiority over existing techniques through experiments in public datasets and real captured images on both reconstruction and novel view synthesis.

Distilling Textual Priors from LLM to Efficient Image Fusion

Ran Zhang,Xuanhua He,Ke Cao,Liu Liu,Li Zhang,Man Zhou,Jie Zhang

Task: 提出一种新颖的框架，用于蒸馏大型模型先验知识，以在多模态图像融合中实现高效且高质量的合成。

Motivation: 传统方法（如CNNs和GANs）在处理低质量或复杂输入时表现不佳，而基于文本引导的方法虽然有效但计算开销大。

Details

Method: 采用教师-学生架构，通过定制蒸馏过程将大型模型先验知识转移到小型学生网络，并引入空间-通道交叉融合模块增强文本先验利用。 Result: 蒸馏后的网络仅需教师网络10%的参数和推理时间，保留90%性能，并优于现有SOTA方法。 Conclusion: 该方法在计算效率和融合质量之间取得了良好平衡，实验证明了其有效性，并将开源实现。 Abstract: Multi-modality image fusion aims to synthesize a single, comprehensive image from multiple source inputs. Traditional approaches, such as CNNs and GANs, offer efficiency but struggle to handle low-quality or complex inputs. Recent advances in text-guided methods leverage large model priors to overcome these limitations, but at the cost of significant computational overhead, both in memory and inference time. To address this challenge, we propose a novel framework for distilling large model priors, eliminating the need for text guidance during inference while dramatically reducing model size. Our framework utilizes a teacher-student architecture, where the teacher network incorporates large model priors and transfers this knowledge to a smaller student network via a tailored distillation process. Additionally, we introduce spatial-channel cross-fusion module to enhance the model's ability to leverage textual priors across both spatial and channel dimensions. Our method achieves a favorable trade-off between computational efficiency and fusion quality. The distilled network, requiring only 10\% of the parameters and inference time of the teacher network, retains 90\% of its performance and outperforms existing SOTA methods. Extensive experiments demonstrate the effectiveness of our approach. The implementation will be made publicly available as an open-source resource.

A Unified Agentic Framework for Evaluating Conditional Image Generation

Jifang Wang,Xue Yang,Longyue Wang,Zhenran Xu,Yiyu Wang,Yaowei Wang,Weihua Luo,Kaifu Zhang,Baotian Hu,Min Zhang

Task: 提出CIGEval，一个统一的代理框架，用于全面评估条件图像生成任务。

Motivation: 条件图像生成领域缺乏任务无关、可靠且可解释的评估指标。

Details

Method: 利用大型多模态模型（LMMs）作为核心，整合多功能工具箱，建立细粒度评估框架，并通过合成评估轨迹微调小型LMMs。 Result: CIGEval（GPT-4o版本）在七项任务中与人类评估的相关性达到0.4625，接近标注者间相关性0.47；使用7B开源LMMs和2.3K训练轨迹时超越现有方法。 Conclusion: CIGEval在识别图像生成任务中的细微问题方面表现出色，具备自动化评估的潜力，可靠性接近人类水平。 Abstract: Conditional image generation has gained significant attention for its ability to personalize content. However, the field faces challenges in developing task-agnostic, reliable, and explainable evaluation metrics. This paper introduces CIGEval, a unified agentic framework for comprehensive evaluation of conditional image generation tasks. CIGEval utilizes large multimodal models (LMMs) as its core, integrating a multi-functional toolbox and establishing a fine-grained evaluation framework. Additionally, we synthesize evaluation trajectories for fine-tuning, empowering smaller LMMs to autonomously select appropriate tools and conduct nuanced analyses based on tool outputs. Experiments across seven prominent conditional image generation tasks demonstrate that CIGEval (GPT-4o version) achieves a high correlation of 0.4625 with human assessments, closely matching the inter-annotator correlation of 0.47. Moreover, when implemented with 7B open-source LMMs using only 2.3K training trajectories, CIGEval surpasses the previous GPT-4o-based state-of-the-art method. Case studies on GPT-4o image generation highlight CIGEval's capability in identifying subtle issues related to subject consistency and adherence to control guidance, indicating its great potential for automating evaluation of image generation tasks with human-level reliability.

Generalized Semantic Contrastive Learning via Embedding Side Information for Few-Shot Object Detection

Ruoyu Chen,Hua Zhang,Jingzhi Li,Li Liu,Zhen Huang,Xiaochun Cao

Task: 解决少样本目标检测（FSOD）中因样本不足导致的分类边界不清晰和模型过拟合问题。

Motivation: 在少样本情况下，新类别的特征容易被基类特征隐含表示，且数据不足导致模型容易过拟合。

Details

Method: 引入辅助信息构建知识矩阵量化基类与新类的语义关系，开发上下文语义监督对比学习，并提出基于辅助信息的区域感知掩码模块增强样本多样性。 Result: 在多个基准测试中，模型性能优于现有方法，显著提升了少样本目标检测能力。 Conclusion: 通过辅助信息和对比学习，有效解决了少样本目标检测中的关键问题，提升了模型性能。 Abstract: The objective of few-shot object detection (FSOD) is to detect novel objects with few training samples. The core challenge of this task is how to construct a generalized feature space for novel categories with limited data on the basis of the base category space, which could adapt the learned detection model to unknown scenarios. However, limited by insufficient samples for novel categories, two issues still exist: (1) the features of the novel category are easily implicitly represented by the features of the base category, leading to inseparable classifier boundaries, (2) novel categories with fewer data are not enough to fully represent the distribution, where the model fine-tuning is prone to overfitting. To address these issues, we introduce the side information to alleviate the negative influences derived from the feature space and sample viewpoints and formulate a novel generalized feature representation learning method for FSOD. Specifically, we first utilize embedding side information to construct a knowledge matrix to quantify the semantic relationship between the base and novel categories. Then, to strengthen the discrimination between semantically similar categories, we further develop contextual semantic supervised contrastive learning which embeds side information. Furthermore, to prevent overfitting problems caused by sparse samples, a side-information guided region-aware masked module is introduced to augment the diversity of samples, which finds and abandons biased information that discriminates between similar categories via counterfactual explanation, and refines the discriminative representation space further. Extensive experiments using ResNet and ViT backbones on PASCAL VOC, MS COCO, LVIS V1, FSOD-1K, and FSVOD-500 benchmarks demonstrate that our model outperforms the previous state-of-the-art methods, significantly improving the ability of FSOD in most shots/splits.

Teaching pathology foundation models to accurately predict gene expression with parameter efficient knowledge transfer

Shi Pan,Jianan Chen,Maria Secrier

Task: 提出一种名为PEKA的参数高效知识转移框架，用于从组织病理学图像预测基因表达。

Motivation: 当前基于图像的基础模型在基因表达预测任务上表现有限，且跨模态知识转移的微调和对齐成本较高。

Details

Method: 结合Block-Affine Adaptation、知识蒸馏和结构对齐损失，实现跨模态知识转移。 Result: 在多个空间转录组数据集上，PEKA比基线模型性能提升至少5%，且优于其他参数高效微调策略。 Conclusion: PEKA是一种高效且性能优越的跨模态知识转移方法，未来将开源代码和数据集以促进进一步研究。 Abstract: Gene expression profiling provides critical insights into cellular heterogeneity, biological processes and disease mechanisms. There has been an increasing interest in computational approaches that can predict gene expression directly from digitalized histopathology images. While image foundation models have shown promise in a variety of pathology downstream analysis, their performances on gene-expression prediction are still limited. Explicitly incorporating information from the transcriptomic models can help image models to address domain shift, yet the fine-tuning and alignment of foundation models can be expensive. In the work, we propose Parameter Efficient Knowledge trAnsfer (PEKA), a novel framework that leverages Block-Affine Adaptation and integrates knowledge distillation and structure alignment losses for cross-modal knowledge transfer. We evaluated PEKA for gene expression prediction using multiple spatial transcriptomics datasets (comprising 206,123 image tiles with matched gene expression profiles) that encompassed various types of tissue. PEKA achieved at least 5\% performance improvement over baseline foundation models while also outperforming alternative parameter-efficient fine-tuning strategies. We will release the code, datasets and aligned models after peer-review to facilitate broader adoption and further development for parameter efficient model alignment.

Detecting AI-generated Artwork

Meien Li,Mark Stamp

Task: 研究如何利用机器学习和深度学习模型区分AI生成与人类生成的艺术作品。

Motivation: 随着AI生成艺术的高效性和质量提升，区分AI与人类作品变得困难，这引发了新的挑战和关注。

Details

Method: 测试了逻辑回归（LR）、支持向量机（SVM）、多层感知机（MLP）和卷积神经网络（CNN）在巴洛克、立体主义和表现主义三种艺术风格上的表现。 Result: 多分类问题准确率为0.8208，二分类问题（区分AI与人类作品）准确率达0.9758。 Conclusion: 机器学习模型在区分AI与人类艺术作品方面具有显著潜力，尤其是CNN在二分类任务中表现优异。 Abstract: The high efficiency and quality of artwork generated by Artificial Intelligence (AI) has created new concerns and challenges for human artists. In particular, recent improvements in generative AI have made it difficult for people to distinguish between human-generated and AI-generated art. In this research, we consider the potential utility of various types of Machine Learning (ML) and Deep Learning (DL) models in distinguishing AI-generated artwork from human-generated artwork. We focus on three challenging artistic styles, namely, baroque, cubism, and expressionism. The learning models we test are Logistic Regression (LR), Support Vector Machine (SVM), Multilayer Perceptron (MLP), and Convolutional Neural Network (CNN). Our best experimental results yield a multiclass accuracy of 0.8208 over six classes, and an impressive accuracy of 0.9758 for the binary classification problem of distinguishing AI-generated from human-generated art.

GenDoP: Auto-regressive Camera Trajectory Generation as a Director of Photography

Mengchen Zhang,Tong Wu,Jing Tan,Ziwei Liu,Gordon Wetzstein,Dahua Lin

Task: 提出一种基于文本引导和RGBD输入的自动回归模型（GenDoP），用于生成艺术性和表现力的相机轨迹。

Motivation: 现有相机轨迹生成方法存在局限性，传统方法依赖几何优化或手工系统，而基于学习的方法存在结构偏差或缺乏文本对齐，限制了创意合成。

Details

Method: 首先构建了一个多模态数据集DataDoP，包含29K真实镜头；随后训练了一个基于Transformer的自动回归解码器模型GenDoP。 Result: GenDoP在可控性、轨迹调整精细度和运动稳定性方面优于现有方法。 Conclusion: 该方法为基于学习的电影摄影设立了新标准，推动了相机控制和电影制作的未来发展。 Abstract: Camera trajectory design plays a crucial role in video production, serving as a fundamental tool for conveying directorial intent and enhancing visual storytelling. In cinematography, Directors of Photography meticulously craft camera movements to achieve expressive and intentional framing. However, existing methods for camera trajectory generation remain limited: Traditional approaches rely on geometric optimization or handcrafted procedural systems, while recent learning-based methods often inherit structural biases or lack textual alignment, constraining creative synthesis. In this work, we introduce an auto-regressive model inspired by the expertise of Directors of Photography to generate artistic and expressive camera trajectories. We first introduce DataDoP, a large-scale multi-modal dataset containing 29K real-world shots with free-moving camera trajectories, depth maps, and detailed captions in specific movements, interaction with the scene, and directorial intent. Thanks to the comprehensive and diverse database, we further train an auto-regressive, decoder-only Transformer for high-quality, context-aware camera movement generation based on text guidance and RGBD inputs, named GenDoP. Extensive experiments demonstrate that compared to existing methods, GenDoP offers better controllability, finer-grained trajectory adjustments, and higher motion stability. We believe our approach establishes a new standard for learning-based cinematography, paving the way for future advancements in camera control and filmmaking. Our project website: https://kszpxxzmc.github.io/GenDoP/.

OmniCaptioner: One Captioner to Rule Them All

Yiting Lu,Jiakang Yuan,Zhen Li,Shitian Zhao,Qi Qin,Xinyue Li,Le Zhuo,Licheng Wen,Dongyang Liu,Yuewen Cao,Xiangchao Yan,Xin Li,Botian Shi,Tao Chen,Zhibo Chen,Lei Bai,Bo Zhang,Peng Gao

Task: 提出OmniCaptioner，一个通用的视觉描述框架，用于为多种视觉领域生成细粒度的文本描述。

Motivation: 解决现有方法局限于特定图像类型（如自然图像或几何视觉）的问题，提供统一的解决方案。

Details

Method: 将低层次像素信息转换为语义丰富的文本表示，弥合视觉与文本模态之间的差距。 Result: 展示了三个关键优势：增强视觉推理、改进图像生成和高效监督微调。 Conclusion: OmniCaptioner的通用性和适应性为弥合语言与视觉模态之间的差距提供了新视角。 Abstract: We propose OmniCaptioner, a versatile visual captioning framework for generating fine-grained textual descriptions across a wide variety of visual domains. Unlike prior methods limited to specific image types (e.g., natural images or geometric visuals), our framework provides a unified solution for captioning natural images, visual text (e.g., posters, UIs, textbooks), and structured visuals (e.g., documents, tables, charts). By converting low-level pixel information into semantically rich textual representations, our framework bridges the gap between visual and textual modalities. Our results highlight three key advantages: (i) Enhanced Visual Reasoning with LLMs, where long-context captions of visual modalities empower LLMs, particularly the DeepSeek-R1 series, to reason effectively in multimodal scenarios; (ii) Improved Image Generation, where detailed captions improve tasks like text-to-image generation and image transformation; and (iii) Efficient Supervised Fine-Tuning (SFT), which enables faster convergence with less data. We believe the versatility and adaptability of OmniCaptioner can offer a new perspective for bridging the gap between language and visual modalities.

Are We Done with Object-Centric Learning?

Alexander Rubinstein,Ameya Prabhu,Matthias Bethge,Seong Joon Oh

Task: 研究如何通过对象中心学习（OCL）实现对象分离及其对分布外（OOD）泛化的影响。

Motivation: 探索对象分离能力对OCL目标（如OOD泛化）的贡献，并解决由虚假背景线索引起的挑战。

Details

Method: 提出一种无需训练的方法OCCAM，基于分割的独立对象编码，并与基于槽的OCL方法进行比较。 Result: 分割编码方法显著优于槽基OCL方法，但在实际应用中仍存在挑战。 Conclusion: 为OCL社区提供可扩展的对象中心表示工具，并关注实际应用和基础问题（如人类认知中的对象感知）。 Abstract: Object-centric learning (OCL) seeks to learn representations that only encode an object, isolated from other objects or background cues in a scene. This approach underpins various aims, including out-of-distribution (OOD) generalization, sample-efficient composition, and modeling of structured environments. Most research has focused on developing unsupervised mechanisms that separate objects into discrete slots in the representation space, evaluated using unsupervised object discovery. However, with recent sample-efficient segmentation models, we can separate objects in the pixel space and encode them independently. This achieves remarkable zero-shot performance on OOD object discovery benchmarks, is scalable to foundation models, and can handle a variable number of slots out-of-the-box. Hence, the goal of OCL methods to obtain object-centric representations has been largely achieved. Despite this progress, a key question remains: How does the ability to separate objects within a scene contribute to broader OCL objectives, such as OOD generalization? We address this by investigating the OOD generalization challenge caused by spurious background cues through the lens of OCL. We propose a novel, training-free probe called $\textbf{Object-Centric Classification with Applied Masks (OCCAM)}$, demonstrating that segmentation-based encoding of individual objects significantly outperforms slot-based OCL methods. However, challenges in real-world applications remain. We provide the toolbox for the OCL community to use scalable object-centric representations, and focus on practical applications and fundamental questions, such as understanding object perception in human cognition. Our code is available $\href{https://github.com/AlexanderRubinstein/OCCAM}{here}$.

FlashDepth: Real-time Streaming Video Depth Estimation at 2K Resolution

Gene Chou,Wenqi Xian,Guandao Yang,Mohamed Abdelfattah,Bharath Hariharan,Noah Snavely,Ning Yu,Paul Debevec

Task: 提出一种满足高精度、高分辨率且支持实时视频流的深度估计模型FlashDepth。

Motivation: 现有深度估计模型在视频流处理中难以同时满足高精度、高分辨率和实时性需求。

Details

Method: 通过对预训练的单图像深度模型进行改进，结合少量数据和训练，实现实时高分辨率深度估计。 Result: 在多个未见数据集上，FlashDepth在边界清晰度和速度上显著优于现有模型，同时保持竞争力精度。 Conclusion: FlashDepth为视频编辑和机器人等需要高分辨率深度信息的应用提供了有效解决方案。 Abstract: A versatile video depth estimation model should (1) be accurate and consistent across frames, (2) produce high-resolution depth maps, and (3) support real-time streaming. We propose FlashDepth, a method that satisfies all three requirements, performing depth estimation on a 2044x1148 streaming video at 24 FPS. We show that, with careful modifications to pretrained single-image depth models, these capabilities are enabled with relatively little data and training. We evaluate our approach across multiple unseen datasets against state-of-the-art depth models, and find that ours outperforms them in terms of boundary sharpness and speed by a significant margin, while maintaining competitive accuracy. We hope our model will enable various applications that require high-resolution depth, such as video editing, and online decision-making, such as robotics.

MultiDelete for Multimodal Machine Unlearning

Jiali Cheng,Hadi Amiri

Task: 提出一种名为MultiDelete的多模态机器遗忘方法，用于从已训练模型中移除特定训练数据样本的知识。

Motivation: 多模态环境下的机器遗忘面临独特挑战，如数据模态间的复杂依赖性和大规模多模态数据集的高训练成本。

Details

Method: MultiDelete通过模态解耦、多模态知识保留和单模态知识保留三个关键特性，有效解耦待删除单模态数据点之间的关联。 Result: 在两种架构和四个数据集上的实验表明，MultiDelete在多模态样本遗忘上平均比最佳基线提高17.6分，并能保持原始模型的多模态和单模态知识。 Conclusion: MultiDelete是一种高效的多模态机器遗忘方法，能有效保护遗忘数据免受对抗攻击，同时保留模型的知识表示。 Abstract: Machine Unlearning removes specific knowledge about training data samples from an already trained model. It has significant practical benefits, such as purging private, inaccurate, or outdated information from trained models without the need for complete re-training. Unlearning within a multimodal setting presents unique challenges due to the complex dependencies between different data modalities and the expensive cost of training on large multimodal datasets and architectures. This paper presents the first machine unlearning approach for multimodal data and models, titled MultiDelete, which is designed to decouple associations between unimodal data points during unlearning without losing the overall representation strength of the trained model. MultiDelete advocates for three key properties for effective multimodal unlearning: (a): modality decoupling, which effectively decouples the association between individual unimodal data points marked for deletion, rendering them as unrelated data points, (b): multimodal knowledge retention, which retains the multimodal representation post-unlearning, and (c): unimodal knowledge retention, which retains the unimodal representation postunlearning. MultiDelete is efficient to train and is not constrained by using a strongly convex loss -- a common restriction among existing baselines. Experiments on two architectures and four datasets, including image-text and graph-text datasets, show that MultiDelete gains an average improvement of 17.6 points over best performing baseline in unlearning multimodal samples, can maintain the multimodal and unimodal knowledge of the original model post unlearning, and can provide better protection to unlearned data against adversarial attacks.

Jonas Brändli,Maurice Schneeberger,Lisa Herzog,Loran Avci,Nordin Dari,Martin Häansel,Hakim Baazaoui,Pascal Bühler,Susanne Wegener,Beate Sick

Task: 增强多模态预测模型（结合影像和表格患者数据）的可解释性和可解释性。

Motivation: 通过结合统计和深度学习方法，实现高性能预测和可解释参数估计，同时利用xAI方法提升模型的可解释性。

Details

Method: 采用Grad-CAM和Occlusion方法，结合深度转换模型（dTMs），基于407名中风患者的影像和表格数据训练模型，预测中风后三个月的功能结果。 Result: dTMs实现了接近0.8的AUC值，关键表格预测因子为中风前的功能独立性和入院时的NIHSS评分；影像解释图突出了与不良结果相关的关键脑区（如额叶）。 Conclusion: 通过将解释图方法应用于dTMs，提升了多模态模型的可解释性，支持错误分析和假设生成。 Abstract: Aim: This study aims to enhance interpretability and explainability of multi-modal prediction models integrating imaging and tabular patient data. Methods: We adapt the xAI methods Grad-CAM and Occlusion to multi-modal, partly interpretable deep transformation models (dTMs). DTMs combine statistical and deep learning approaches to simultaneously achieve state-of-the-art prediction performance and interpretable parameter estimates, such as odds ratios for tabular features. Based on brain imaging and tabular data from 407 stroke patients, we trained dTMs to predict functional outcome three months after stroke. We evaluated the models using different discriminatory metrics. The adapted xAI methods were used to generated explanation maps for identification of relevant image features and error analysis. Results: The dTMs achieve state-of-the-art prediction performance, with area under the curve (AUC) values close to 0.8. The most important tabular predictors of functional outcome are functional independence before stroke and NIHSS on admission, a neurological score indicating stroke severity. Explanation maps calculated from brain imaging dTMs for functional outcome highlighted critical brain regions such as the frontal lobe, which is known to be linked to age which in turn increases the risk for unfavorable outcomes. Similarity plots of the explanation maps revealed distinct patterns which give insight into stroke pathophysiology, support developing novel predictors of stroke outcome and enable to identify false predictions. Conclusion: By adapting methods for explanation maps to dTMs, we enhanced the explainability of multi-modal and partly interpretable prediction models. The resulting explanation maps facilitate error analysis and support hypothesis generation regarding the significance of specific image regions in outcome prediction.

Subjective Visual Quality Assessment for High-Fidelity Learning-Based Image Compression

Mohsen Jenadeleh,Jon Sneyers,Panqi Jia,Shima Mohammadi,Joao Ascenso,Dietmar Saupe

Task: 对JPEG AI压缩图像进行主观视觉质量评估，并使用JPEG AIC-3方法量化感知差异。

Motivation: 基于学习的图像压缩方法在性能和感知质量上优于传统编解码器，JPEG AI是这一领域的最新标准化框架，需要对其压缩图像的质量进行严格评估。

Details

Method: 生成50张压缩图像数据集，进行大规模众包实验，收集96,200组三元组响应，并使用统一模型重建基于JND的质量尺度。 Result: CVVDP指标表现最佳，但大多数指标对JPEG AI压缩图像的质量预测过于乐观。 Conclusion: 强调在高保真范围内对现代图像编解码器进行严格主观评估的必要性，并引入Meng-Rosenthal-Rubin统计测试用于质量体验研究。 Abstract: Learning-based image compression methods have recently emerged as promising alternatives to traditional codecs, offering improved rate-distortion performance and perceptual quality. JPEG AI represents the latest standardized framework in this domain, leveraging deep neural networks for high-fidelity image reconstruction. In this study, we present a comprehensive subjective visual quality assessment of JPEG AI-compressed images using the JPEG AIC-3 methodology, which quantifies perceptual differences in terms of Just Noticeable Difference (JND) units. We generated a dataset of 50 compressed images with fine-grained distortion levels from five diverse sources. A large-scale crowdsourced experiment collected 96,200 triplet responses from 459 participants. We reconstructed JND-based quality scales using a unified model based on boosted and plain triplet comparisons. Additionally, we evaluated the alignment of multiple objective image quality metrics with human perception in the high-fidelity range. The CVVDP metric achieved the overall highest performance; however, most metrics including CVVDP were overly optimistic in predicting the quality of JPEG AI-compressed images. These findings emphasize the necessity for rigorous subjective evaluations in the development and benchmarking of modern image codecs, particularly in the high-fidelity range. Another technical contribution is the introduction of the well-known Meng-Rosenthal-Rubin statistical test to the field of Quality of Experience research. This test can reliably assess the significance of difference in performance of quality metrics in terms of correlation between metrics and ground truth. The complete dataset, including all subjective scores, is publicly available at https://github.com/jpeg-aic/dataset-JPEG-AI-SDR25.

Leveraging State Space Models in Long Range Genomics

Matvei Popov,Aymen Kallala,Anirudha Ramesh,Narimane Hennouni,Shivesh Khaitan,Rick Gentry,Alain-Sam Cohen

Task: 探索状态空间模型（SSMs）在长距离基因组建模任务中的表现，并与传统的基于Transformer的模型进行比较。

Motivation: 传统方法在处理长距离依赖关系时表现不佳，而Transformer模型由于计算复杂性和无法外推长序列而受限。

Details

Method: 通过基准测试两种SSM架构（Caduceus和Hawk），在50M参数Transformer基线的条件下进行长距离基因组建模任务。 Result: SSMs在性能上与Transformer相当，并展现出卓越的零样本外推能力，能处理比训练时长10到100倍的序列。 Conclusion: SSMs在长距离基因组分析中表现出高效性和可扩展性，适合处理复杂的人类基因组。 Abstract: Long-range dependencies are critical for understanding genomic structure and function, yet most conventional methods struggle with them. Widely adopted transformer-based models, while excelling at short-context tasks, are limited by the attention module's quadratic computational complexity and inability to extrapolate to sequences longer than those seen in training. In this work, we explore State Space Models (SSMs) as a promising alternative by benchmarking two SSM-inspired architectures, Caduceus and Hawk, on long-range genomics modeling tasks under conditions parallel to a 50M parameter transformer baseline. We discover that SSMs match transformer performance and exhibit impressive zero-shot extrapolation across multiple tasks, handling contexts 10 to 100 times longer than those seen during training, indicating more generalizable representations better suited for modeling the long and complex human genome. Moreover, we demonstrate that these models can efficiently process sequences of 1M tokens on a single GPU, allowing for modeling entire genomic regions at once, even in labs with limited compute. Our findings establish SSMs as efficient and scalable for long-context genomic analysis.

Fast Globally Optimal and Geometrically Consistent 3D Shape Matching

Paul Roetzer,Florian Bernard

Task: 提出一种计算全局最优且几何一致的三维形状匹配的新方法。

Motivation: 几何一致性（即邻域的保持）是三维形状匹配中的重要先验，但实践中常被忽视或仅能在严格假设下实现。

Details

Method: 将源形状表面表示为循环路径集合，并在超乘积图中将其与目标形状匹配，转化为最小成本循环流问题。 Result: 方法在实践中有高效可解性，并能产生高质量的匹配结果。 Conclusion: 该方法为三维形状匹配提供了一种全局几何一致且可扩展的解决方案。 Abstract: Geometric consistency, i.e. the preservation of neighbourhoods, is a natural and strong prior in 3D shape matching. Geometrically consistent matchings are crucial for many downstream applications, such as texture transfer or statistical shape modelling. Yet, in practice, geometric consistency is often overlooked, or only achieved under severely limiting assumptions (e.g. a good initialisation). In this work, we propose a novel formalism for computing globally optimal and geometrically consistent matchings between 3D shapes which is scalable in practice. Our key idea is to represent the surface of the source shape as a collection of cyclic paths, which are then consistently matched to the target shape. Mathematically, we construct a hyper product graph (between source and target shape), and then cast 3D shape matching as a minimum-cost circulation flow problem in this hyper graph, which yields global geometrically consistent matchings between both shapes. We empirically show that our formalism is efficiently solvable and that it leads to high-quality results.

Understanding Machine Unlearning Through the Lens of Mode Connectivity

Jiali Cheng,Hadi Amiri

Task: 研究机器遗忘中的模式连通性及其在不同条件下的表现。

Motivation: 探索机器遗忘过程中损失景观和优化动态的未被充分研究的方面，特别是模式连通性现象。

Details

Method: 通过分析不同遗忘方法、课程学习模型以及一阶和二阶优化技术下的模式连通性。 Result: 发现不同评估指标沿曲线的波动模式以及遗忘方法之间的机制（不）相似性。 Conclusion: 这是首次在机器遗忘背景下研究模式连通性，揭示了其独特的行为和潜在机制。 Abstract: Machine Unlearning aims to remove undesired information from trained models without requiring full retraining from scratch. Despite recent advancements, their underlying loss landscapes and optimization dynamics received less attention. In this paper, we investigate and analyze machine unlearning through the lens of mode connectivity - the phenomenon where independently trained models can be connected by smooth low-loss paths in the parameter space. We define and study mode connectivity in unlearning across a range of overlooked conditions, including connections between different unlearning methods, models trained with and without curriculum learning, and models optimized with first-order and secondorder techniques. Our findings show distinct patterns of fluctuation of different evaluation metrics along the curve, as well as the mechanistic (dis)similarity between unlearning methods. To the best of our knowledge, this is the first study on mode connectivity in the context of machine unlearning.

PEEL the Layers and Find Yourself: Revisiting Inference-time Data Leakage for Residual Neural Networks

Huzaifa Arif,Keerthiram Murugesan,Payel Das,Alex Gittens,Pin-Yu Chen

Task: 探索深度神经网络（NNs）在推理时的数据泄漏风险，并提出一种新的反向特征反转方法PEEL来恢复残差网络的输入数据。

Motivation: 研究模型服务提供商仅基于模型推理结果检索用户私有数据的风险，特别是残差网络因其在计算机视觉中的广泛应用和跳跃连接可能导致数据泄漏。

Details

Method: 将推理时数据泄漏问题建模为约束优化问题，提出PEEL方法，通过逐层特征反转恢复残差网络的输入特征。 Result: PEEL在面部图像数据集和预训练分类器上表现优异，其恢复质量显著优于现有方法（MSE指标提升一个数量级）。 Conclusion: 残差网络的中间输出保留了足够的信息用于输入恢复，PEEL方法在数据泄漏风险研究中具有重要价值。 Abstract: This paper explores inference-time data leakage risks of deep neural networks (NNs), where a curious and honest model service provider is interested in retrieving users' private data inputs solely based on the model inference results. Particularly, we revisit residual NNs due to their popularity in computer vision and our hypothesis that residual blocks are a primary cause of data leakage owing to the use of skip connections. By formulating inference-time data leakage as a constrained optimization problem, we propose a novel backward feature inversion method, \textbf{PEEL}, which can effectively recover block-wise input features from the intermediate output of residual NNs. The surprising results in high-quality input data recovery can be explained by the intuition that the output from these residual blocks can be considered as a noisy version of the input and thus the output retains sufficient information for input recovery. We demonstrate the effectiveness of our layer-by-layer feature inversion method on facial image datasets and pre-trained classifiers. Our results show that PEEL outperforms the state-of-the-art recovery methods by an order of magnitude when evaluated by mean squared error (MSE). The code is available at \href{https://github.com/Huzaifa-Arif/PEEL}{https://github.com/Huzaifa-Arif/PEEL}

Retuve: Automated Multi-Modality Analysis of Hip Dysplasia with Open Source AI

Adam McArthur,Stephanie Wichuk,Stephen Burnside,Andrew Kirby,Alexander Scammon,Damian Sol,Abhilash Hareendranathan,Jacob L. Jaremko

Task: 开发一个开源框架Retuve，用于多模态（超声和X射线）的发育性髋关节发育不良（DDH）分析。

Motivation: 当前DDH诊断方法缺乏标准化，AI研究因数据和代码可用性不足而存在可重复性问题。

Details

Method: Retuve提供完整且可重复的工作流，包括开放数据集、预训练模型、训练代码和权重，以及用户友好的Python API。 Result: 框架整合了分割和标志点检测模型，实现了关键诊断参数（如α角和髋臼指数）的自动化测量。 Conclusion: Retuve通过开源原则促进DDH研究的透明性、协作性和可访问性，有望普及DDH筛查、促进早期诊断并改善患者预后。 Abstract: Developmental dysplasia of the hip (DDH) poses significant diagnostic challenges, hindering timely intervention. Current screening methodologies lack standardization, and AI-driven studies suffer from reproducibility issues due to limited data and code availability. To address these limitations, we introduce Retuve, an open-source framework for multi-modality DDH analysis, encompassing both ultrasound (US) and X-ray imaging. Retuve provides a complete and reproducible workflow, offering open datasets comprising expert-annotated US and X-ray images, pre-trained models with training code and weights, and a user-friendly Python Application Programming Interface (API). The framework integrates segmentation and landmark detection models, enabling automated measurement of key diagnostic parameters such as the alpha angle and acetabular index. By adhering to open-source principles, Retuve promotes transparency, collaboration, and accessibility in DDH research. This initiative has the potential to democratize DDH screening, facilitate early diagnosis, and ultimately improve patient outcomes by enabling widespread screening and early intervention. The GitHub repository/code can be found here: https://github.com/radoss-org/retuve

AstroClearNet: Deep image prior for multi-frame astronomical image restoration

Yashil Sukurdeep,Fausto Navarro,Tamás Budavári

Task: 提出一种基于深度图像先验的自监督多帧方法，用于地面天文观测的去噪、去模糊和叠加。

Motivation: 传统方法在恢复模糊观测的高保真图像时表现不足，且地面天文观测中多帧叠加因大气湍流导致的点扩散函数变化而复杂化。

Details

Method: 设计了一种卷积神经网络，整合多观测信息并施加物理约束。 Result: 通过处理Hyper Suprime-Cam观测数据，获得了更清晰的恢复图像。 Conclusion: 该方法在天文图像处理中展现出潜力。 Abstract: Recovering high-fidelity images of the night sky from blurred observations is a fundamental problem in astronomy, where traditional methods typically fall short. In ground-based astronomy, combining multiple exposures to enhance signal-to-noise ratios is further complicated by variations in the point-spread function caused by atmospheric turbulence. In this work, we present a self-supervised multi-frame method, based on deep image priors, for denoising, deblurring, and coadding ground-based exposures. Central to our approach is a carefully designed convolutional neural network that integrates information across multiple observations and enforces physically motivated constraints. We demonstrate the method's potential by processing Hyper Suprime-Cam exposures, yielding promising preliminary results with sharper restored images.

Holistic Fusion: Task- and Setup-Agnostic Robot Localization and State Estimation with Factor Graphs

Julian Nubert,Turcan Tuna,Jonas Frey,Cesar Cadena,Katherine J. Kuchenbecker,Shehryar Khattak,Marco Hutter

Task: 提出一种灵活的开源多模态传感器融合解决方案，用于任务和设置无关的机器人状态估计。

Motivation: 现有传感器融合方法通常针对特定场景设计，缺乏通用性和灵活性，难以适应多样化的实际应用需求。

Details

Method: 采用因子图方法，将传感器融合建模为局部和全局机器人状态以及动态上下文变量的联合估计问题，支持任意数量的绝对、局部和地标测量。 Result: HF框架在典型机器人硬件上实现了低延迟、平滑的在线状态估计，同时以IMU测量速率提供低漂移的全局定位。 Conclusion: HF框架具有通用性和灵活性，适用于多种实际应用场景，并在实验中验证了其有效性。 Abstract: Seamless operation of mobile robots in challenging environments requires low-latency local motion estimation (e.g., dynamic maneuvers) and accurate global localization (e.g., wayfinding). While most existing sensor-fusion approaches are designed for specific scenarios, this work introduces a flexible open-source solution for task- and setup-agnostic multimodal sensor fusion that is distinguished by its generality and usability. Holistic Fusion formulates sensor fusion as a combined estimation problem of i) the local and global robot state and ii) a (theoretically unlimited) number of dynamic context variables, including automatic alignment of reference frames; this formulation fits countless real-world applications without any conceptual modifications. The proposed factor-graph solution enables the direct fusion of an arbitrary number of absolute, local, and landmark measurements expressed with respect to different reference frames by explicitly including them as states in the optimization and modeling their evolution as random walks. Moreover, local smoothness and consistency receive particular attention to prevent jumps in the robot state belief. HF enables low-latency and smooth online state estimation on typical robot hardware while simultaneously providing low-drift global localization at the IMU measurement rate. The efficacy of this released framework is demonstrated in five real-world scenarios on three robotic platforms, each with distinct task requirements.

ASHiTA: Automatic Scene-grounded HIerarchical Task Analysis

Yun Chang,Leonor Fermoselle,Duy Ta,Bernadette Bucher,Luca Carlone,Jiuguang Wang

Task: 提出ASHITA框架，将高级任务分解为基于3D场景图的子任务。

Motivation: 当前方法难以将抽象高级指令与3D场景关联，且任务分解依赖环境。

Details

Method: ASHITA框架结合LLM辅助的任务分解和任务驱动的3D场景图构建。 Result: ASHITA在任务分解和场景关联性能上显著优于LLM基线，达到先进水平。 Conclusion: ASHITA为高级任务与3D场景的关联提供了有效解决方案。 Abstract: While recent work in scene reconstruction and understanding has made strides in grounding natural language to physical 3D environments, it is still challenging to ground abstract, high-level instructions to a 3D scene. High-level instructions might not explicitly invoke semantic elements in the scene, and even the process of breaking a high-level task into a set of more concrete subtasks, a process called hierarchical task analysis, is environment-dependent. In this work, we propose ASHiTA, the first framework that generates a task hierarchy grounded to a 3D scene graph by breaking down high-level tasks into grounded subtasks. ASHiTA alternates LLM-assisted hierarchical task analysis, to generate the task breakdown, with task-driven 3D scene graph construction to generate a suitable representation of the environment. Our experiments show that ASHiTA performs significantly better than LLM baselines in breaking down high-level tasks into environment-dependent subtasks and is additionally able to achieve grounding performance comparable to state-of-the-art methods.

Image registration of 2D optical thin sections in a 3D porous medium: Application to a Berea sandstone digital rock image

Jaehong Chung,Wei Cai,Tapan Mukerji

Task: 提出一种系统化的图像配准方法，用于在3D数字岩石体积中对齐2D光学薄片图像。

Motivation: 通过多模态图像配准，整合互补的成像模态，改进数字岩石物理学中计算的岩石属性。

Details

Method: 使用模板图像匹配结合差分进化优化，识别3D中最相似的2D平面。 Result: 在合成多孔介质中实现精确配准，并在Berea砂岩中达到0.990的结构相似性指数（SSIM）。薄片图像比配准的CT平面多揭示50%的孔隙度和亚微米孔隙，且弹性模量更低。 Conclusion: 多模态图像配准在数字岩石物理学中具有潜力，能够整合互补成像模态，提供更全面的地质信息。 Abstract: This study proposes a systematic image registration approach to align 2D optical thin-section images within a 3D digital rock volume. Using template image matching with differential evolution optimization, we identify the most similar 2D plane in 3D. The method is validated on a synthetic porous medium, achieving exact registration, and applied to Berea sandstone, where it achieves a structural similarity index (SSIM) of 0.990. With the registered images, we explore upscaling properties based on paired multimodal images, focusing on pore characteristics and effective elastic moduli. The thin-section image reveals 50 % more porosity and submicron pores than the registered CT plane. In addition, bulk and shear moduli from thin sections are 25 % and 30 % lower, respectively, than those derived from CT images. Beyond numerical comparisons, thin sections provide additional geological insights, including cementation, mineral phases, and weathering effects, which are not clear in CT images. This study demonstrates the potential of multimodal image registration to improve computed rock properties in digital rock physics by integrating complementary imaging modalities.

Disentangle and Regularize: Sign Language Production with Articulator-Based Disentanglement and Channel-Aware Regularization

Sumeyye Meryem Tasyurek,Tugce Kiziltepe,Hacer Yalim Keles

Task: 提出一种无需标注、基于Transformer的手语生成框架，直接将口语文本映射到手语姿势序列。

Motivation: 解决传统手语生成方法依赖标注数据的问题，并提升生成姿势的结构化和可解释性。

Details

Method: 使用姿势自编码器将手语姿势编码到紧凑的潜在空间，并采用非自回归Transformer解码器从文本嵌入预测潜在表示，结合通道感知正则化优化训练。 Result: 在PHOENIX14T数据集上仅用少量训练数据即达到最先进性能。 Conclusion: 该方法无需标注或预训练模型，为手语生成提供了一种高效且可解释的解决方案。 Abstract: In this work, we propose a simple gloss-free, transformer-based sign language production (SLP) framework that directly maps spoken-language text to sign pose sequences. We first train a pose autoencoder that encodes sign poses into a compact latent space using an articulator-based disentanglement strategy, where features corresponding to the face, right hand, left hand, and body are modeled separately to promote structured and interpretable representation learning. Next, a non-autoregressive transformer decoder is trained to predict these latent representations from sentence-level text embeddings. To guide this process, we apply channel-aware regularization by aligning predicted latent distributions with priors extracted from the ground-truth encodings using a KL-divergence loss. The contribution of each channel to the loss is weighted according to its associated articulator region, enabling the model to account for the relative importance of different articulators during training. Our approach does not rely on gloss supervision or pretrained models, and achieves state-of-the-art results on the PHOENIX14T dataset using only a modest training set.

Setup-Invariant Augmented Reality for Teaching by Demonstration with Surgical Robots

Alexandre Banks,Richard Cook,Septimiu E. Salcudean

Task: 开发一个开源系统（dV-STEAR），用于在机器人手术教育中提供专家指导的增强现实（AR）训练。

Motivation: 现有的AR系统需要专家监督，且未考虑导师和学员机器人配置的差异，限制了新手的自主训练。

Details

Method: dV-STEAR通过回放任务对齐的专家演示，不要求专家和新手的设置关节位置相同，并量化了姿态估计的准确性。 Result: dV-STEAR显著提高了新手在腹腔镜手术基础任务中的表现，包括完成速度、碰撞时间和成功率，同时改善了手部平衡使用和降低了挫败感。 Conclusion: dV-STEAR是一种有效的教育工具，为AR在机器人辅助手术中的进一步集成奠定了基础。 Abstract: Augmented reality (AR) is an effective tool in robotic surgery education as it combines exploratory learning with three-dimensional guidance. However, existing AR systems require expert supervision and do not account for differences in the mentor and mentee robot configurations. To enable novices to train outside the operating room while receiving expert-informed guidance, we present dV-STEAR: an open-source system that plays back task-aligned expert demonstrations without assuming identical setup joint positions between expert and novice. Pose estimation was rigorously quantified, showing a registration error of 3.86 (SD=2.01)mm. In a user study (N=24), dV-STEAR significantly improved novice performance on tasks from the Fundamentals of Laparoscopic Surgery. In a single-handed ring-over-wire task, dV-STEAR increased completion speed (p=0.03) and reduced collision time (p=0.01) compared to dry-lab training alone. During a pick-and-place task, it improved success rates (p=0.004). Across both tasks, participants using dV-STEAR exhibited significantly more balanced hand use and reported lower frustration levels. This work presents a novel educational tool implemented on the da Vinci Research Kit, demonstrates its effectiveness in teaching novices, and builds the foundation for further AR integration into robot-assisted surgery.

CAT: Circular-Convolutional Attention for Sub-Quadratic Transformers

Yoshihiro Yamada

Task: 提出一种基于傅里叶变换的循环卷积注意力机制（CAT），以降低Transformer注意力机制的复杂度。

Motivation: 标准注意力机制的O(N^2)复杂度限制了其在长序列上的可扩展性。

Details

Method: 通过傅里叶变换应用循环卷积，将复杂度降至O(NlogN)，并减少可学习参数。 Result: 在ImageNet-1k和WikiText-103等大规模基准测试中，CAT实现了约10%的速度提升和一致的精度改进。 Conclusion: CAT不仅提供了实用的效率和易实现性，还为下一代高性能Transformer架构的设计提供了指导。 Abstract: Transformers have driven remarkable breakthroughs in natural language processing and computer vision, yet their standard attention mechanism still imposes O(N^2) complexity, hindering scalability to longer sequences. We introduce Circular-convolutional ATtention (CAT), a Fourier-based approach that efficiently applies circular convolutions to reduce complexity without sacrificing representational power. CAT achieves O(NlogN) computations, requires fewer learnable parameters by streamlining fully-connected layers, and introduces no heavier operations, resulting in consistent accuracy improvements and about a 10% speedup in naive PyTorch implementations on large-scale benchmarks such as ImageNet-1k and WikiText-103. Grounded in an engineering-isomorphism framework, CAT's design not only offers practical efficiency and ease of implementation but also provides insights to guide the development of next-generation, high-performance Transformer architectures. Finally, our ablation studies highlight the key conditions underlying CAT's success, shedding light on broader principles for scalable attention mechanisms.

DIMA: DIffusing Motion Artifacts for unsupervised correction in brain MRI images

Paolo Angella,Luca Balbi,Fabrizio Ferrando,Paolo Traverso,Rosario Varriale,Vito Paolo Pastore,Matteo Santacesaria

Task: 提出一种名为DIMA的无监督运动伪影校正框架，用于脑部MRI图像。

Motivation: 现有深度学习方法需要成对的运动伪影和无伪影图像进行训练，这在临床环境中难以获取。

Details

Method: 利用扩散模型分两阶段训练：首先在未配对的运动伪影图像上训练扩散模型，生成伪影；然后利用生成的伪影图像训练校正网络。 Result: 在多个数据集和解剖平面上，DIMA的性能与现有监督方法相当，且具有更好的泛化能力。 Conclusion: DIMA为临床常规使用提供了一种更易实现的运动伪影校正方法，有望减少重复扫描并提高诊断准确性。 Abstract: Motion artifacts remain a significant challenge in Magnetic Resonance Imaging (MRI), compromising diagnostic quality and potentially leading to misdiagnosis or repeated scans. Existing deep learning approaches for motion artifact correction typically require paired motion-free and motion-affected images for training, which are rarely available in clinical settings. To overcome this requirement, we present DIMA (DIffusing Motion Artifacts), a novel framework that leverages diffusion models to enable unsupervised motion artifact correction in brain MRI. Our two-phase approach first trains a diffusion model on unpaired motion-affected images to learn the distribution of motion artifacts. This model then generates realistic motion artifacts on clean images, creating paired datasets suitable for supervised training of correction networks. Unlike existing methods, DIMA operates without requiring k-space manipulation or detailed knowledge of MRI sequence parameters, making it adaptable across different scanning protocols and hardware. Comprehensive evaluations across multiple datasets and anatomical planes demonstrate that our method achieves comparable performance to state-of-the-art supervised approaches while offering superior generalizability to real clinical data. DIMA represents a significant advancement in making motion artifact correction more accessible for routine clinical use, potentially reducing the need for repeat scans and improving diagnostic accuracy.

GraspClutter6D: A Large-scale Real-world Dataset for Robust Perception and Grasping in Cluttered Scenes

Seunghyeok Back,Joosoon Lee,Kangmin Kim,Heeseon Rho,Geonhyup Lee,Raeyoung Kang,Sangbeom Lee,Sangjun Noh,Youngjin Lee,Taeyeop Lee,Kyoobin Lee

Task: 提出并评估GraspClutter6D数据集，用于解决机器人抓取在复杂环境中的挑战。

Motivation: 现有数据集过于简单，缺乏多样性和遮挡，限制了实际应用。

Details

Method: 构建包含1000个高度杂乱场景、200个物体和75种环境配置的大规模数据集，并标注736K 6D物体位姿和9.3B可行抓取。 Result: 在仿真和真实实验中，基于GraspClutter6D训练的抓取网络显著优于现有数据集。 Conclusion: GraspClutter6D为复杂环境下的机器人抓取提供了有效的训练资源，推动了该领域的发展。 Abstract: Robust grasping in cluttered environments remains an open challenge in robotics. While benchmark datasets have significantly advanced deep learning methods, they mainly focus on simplistic scenes with light occlusion and insufficient diversity, limiting their applicability to practical scenarios. We present GraspClutter6D, a large-scale real-world grasping dataset featuring: (1) 1,000 highly cluttered scenes with dense arrangements (14.1 objects/scene, 62.6\% occlusion), (2) comprehensive coverage across 200 objects in 75 environment configurations (bins, shelves, and tables) captured using four RGB-D cameras from multiple viewpoints, and (3) rich annotations including 736K 6D object poses and 9.3B feasible robotic grasps for 52K RGB-D images. We benchmark state-of-the-art segmentation, object pose estimation, and grasping detection methods to provide key insights into challenges in cluttered environments. Additionally, we validate the dataset's effectiveness as a training resource, demonstrating that grasping networks trained on GraspClutter6D significantly outperform those trained on existing datasets in both simulation and real-world experiments. The dataset, toolkit, and annotation tools are publicly available on our project website: https://sites.google.com/view/graspclutter6d.

Audio-visual Event Localization on Portrait Mode Short Videos

Wuyang Liu,Yi Chai,Yongpeng Yan,Yanzhen Ren

Task: 研究音频-视觉事件定位（AVEL）在肖像模式短视频中的应用。

Motivation: 现有AVEL数据集主要针对横向长视频，而短视频已成为主流内容形式，其独特的纵向构图和复杂音频背景带来了新挑战。

Details

Method: 引入AVE-PM数据集，包含25,335个纵向短视频片段，并提出针对性的预处理方法和模型设计。 Result: 实验显示现有方法在跨模式评估中性能下降18.66%，但通过优化预处理和模型设计可提升性能。 Conclusion: 本研究为移动视频时代的AVEL研究提供了基准和实用建议，数据集和代码将公开。 Abstract: Audio-visual event localization (AVEL) plays a critical role in multimodal scene understanding. While existing datasets for AVEL predominantly comprise landscape-oriented long videos with clean and simple audio context, short videos have become the primary format of online video content due to the the proliferation of smartphones. Short videos are characterized by portrait-oriented framing and layered audio compositions (e.g., overlapping sound effects, voiceovers, and music), which brings unique challenges unaddressed by conventional methods. To this end, we introduce AVE-PM, the first AVEL dataset specifically designed for portrait mode short videos, comprising 25,335 clips that span 86 fine-grained categories with frame-level annotations. Beyond dataset creation, our empirical analysis shows that state-of-the-art AVEL methods suffer an average 18.66% performance drop during cross-mode evaluation. Further analysis reveals two key challenges of different video formats: 1) spatial bias from portrait-oriented framing introduces distinct domain priors, and 2) noisy audio composition compromise the reliability of audio modality. To address these issues, we investigate optimal preprocessing recipes and the impact of background music for AVEL on portrait mode videos. Experiments show that these methods can still benefit from tailored preprocessing and specialized model design, thus achieving improved performance. This work provides both a foundational benchmark and actionable insights for advancing AVEL research in the era of mobile-centric video content. Dataset and code will be released.

An Analysis of Temporal Dropout in Earth Observation Time Series for Regression Tasks

Miro Miranda,Francisco Mena,Andreas Dengel

Task: 解决时间序列数据中缺失实例对深度学习模型在回归任务中的影响问题。

Motivation: 卫星故障或云遮挡导致的时间步缺失会引入预测输出的不确定性并降低性能，现有方法常忽略输入级的不确定性。

Details

Method: 提出蒙特卡洛时间丢弃（MC-TD）和蒙特卡洛具体时间丢弃（MC-ConcTD），前者随机丢弃时间步以模拟缺失数据，后者学习最优丢弃分布。 Result: 在三个EO时间序列数据集上，MC-ConcTD提升了预测性能和不确定性校准效果。 Conclusion: 自适应丢弃调优优于手动选择，使不确定性量化更稳健且适用于EO应用。 Abstract: Missing instances in time series data impose a significant challenge to deep learning models, particularly in regression tasks. In the Earth Observation field, satellite failure or cloud occlusion frequently results in missing time-steps, introducing uncertainties in the predicted output and causing a decline in predictive performance. While many studies address missing time-steps through data augmentation to improve model robustness, the uncertainty arising at the input level is commonly overlooked. To address this gap, we introduce Monte Carlo Temporal Dropout (MC-TD), a method that explicitly accounts for input-level uncertainty by randomly dropping time-steps during inference using a predefined dropout ratio, thereby simulating the effect of missing data. To bypass the need for costly searches for the optimal dropout ratio, we extend this approach with Monte Carlo Concrete Temporal Dropout (MC-ConcTD), a method that learns the optimal dropout distribution directly. Both MC-TD and MC-ConcTD are applied during inference, leveraging Monte Carlo sampling for uncertainty quantification. Experiments on three EO time-series datasets demonstrate that MC-ConcTD improves predictive performance and uncertainty calibration compared to existing approaches. Additionally, we highlight the advantages of adaptive dropout tuning over manual selection, making uncertainty quantification more robust and accessible for EO applications.

Leveraging Anatomical Priors for Automated Pancreas Segmentation on Abdominal CT

Anisa V. Prasad,Tejas Sudharshan Mathai,Pritam Mukherjee,Jianfei Liu,Ronald M. Summers

Task: 利用解剖学先验知识提升胰腺CT分割性能。

Motivation: 胰腺CT分割对病理识别和影像生物标志物提取至关重要，但现有研究多集中于模型架构或预处理技术，解剖学先验知识的效用尚未充分探索。

Details

Method: 训练两个3D全分辨率nnU-Net模型，一个使用PANORAMA数据集的8个精细标签，另一个结合PANORAMA和TotalSegmentator工具的标签。 Result: 解剖学先验知识使Dice分数提升6%（p < .001），Hausdorff距离减少36.5 mm（p < .001），且避免了8次检测失败。 Conclusion: 解剖学先验知识显著提升胰腺分割性能，对影像生物标志物提取有潜在价值。 Abstract: An accurate segmentation of the pancreas on CT is crucial to identify pancreatic pathologies and extract imaging-based biomarkers. However, prior research on pancreas segmentation has primarily focused on modifying the segmentation model architecture or utilizing pre- and post-processing techniques. In this article, we investigate the utility of anatomical priors to enhance the segmentation performance of the pancreas. Two 3D full-resolution nnU-Net models were trained, one with 8 refined labels from the public PANORAMA dataset, and another that combined them with labels derived from the public TotalSegmentator (TS) tool. The addition of anatomical priors resulted in a 6\% increase in Dice score ($p < .001$) and a 36.5 mm decrease in Hausdorff distance for pancreas segmentation ($p < .001$). Moreover, the pancreas was always detected when anatomy priors were used, whereas there were 8 instances of failed detections without their use. The use of anatomy priors shows promise for pancreas segmentation and subsequent derivation of imaging biomarkers.

Longitudinal Assessment of Lung Lesion Burden in CT

Tejas Sudharshan Mathai,Benjamin Hou,Ronald M. Summers

Task: 利用3D模型（nnUNet）自动分割肺部病变并量化患者的总病变负担。

Motivation: 肺癌是美国第二大死亡原因，早期检测对治疗计划和改善预后至关重要，但目前研究较少关注肺部肿瘤负担的纵向变化。

Details

Method: 训练两种3D模型（带和不带解剖学先验的nnUNet），自动分割肺部病变并量化总病变负担。 Result: 无先验模型显著优于带先验模型（p < .001）；检测大于1cm病变的精度为71.3%，灵敏度为68.4%，F1分数为69.8%；分割Dice分数为77.1±20.3，Hausdorff距离误差为11.7±24.1 mm；中位病变负担为6.4 cc，手动与自动测量体积差异中位数为0.02 cc。 Conclusion: 该方法能个性化评估患者总肿瘤负担，并便于随时间跟踪变化。 Abstract: In the U.S., lung cancer is the second major cause of death. Early detection of suspicious lung nodules is crucial for patient treatment planning, management, and improving outcomes. Many approaches for lung nodule segmentation and volumetric analysis have been proposed, but few have looked at longitudinal changes in total lung tumor burden. In this work, we trained two 3D models (nnUNet) with and without anatomical priors to automatically segment lung lesions and quantified total lesion burden for each patient. The 3D model without priors significantly outperformed ($p < .001$) the model trained with anatomy priors. For detecting clinically significant lesions $>$ 1cm, a precision of 71.3\%, sensitivity of 68.4\%, and F1-score of 69.8\% was achieved. For segmentation, a Dice score of 77.1 $\pm$ 20.3 and Hausdorff distance error of 11.7 $\pm$ 24.1 mm was obtained. The median lesion burden was 6.4 cc (IQR: 2.1, 18.1) and the median volume difference between manual and automated measurements was 0.02 cc (IQR: -2.8, 1.2). Agreements were also evaluated with linear regression and Bland-Altman plots. The proposed approach can produce a personalized evaluation of the total tumor burden for a patient and facilitate interval change tracking over time.

Two by Two: Learning Multi-Task Pairwise Objects Assembly for Generalizable Robot Manipulation

Yu Qi,Yuanchen Ju,Tianming Wei,Chi Chu,Lawson L. S. Wong,Huazhe Xu

Task: 提出并验证了一种用于日常成对物体3D组装的两步SE(3)姿态估计方法。

Motivation: 现有数据集和基准测试主要关注几何碎片或工厂零件的组装，未能充分解决日常物体交互和组装的复杂性。

Details

Method: 利用2BY2数据集，提出了一种具有等变特征的两步SE(3)姿态估计方法。 Result: 在所有18个任务中实现了最先进的性能，并通过机器人实验验证了方法的可靠性和泛化能力。 Conclusion: 2BY2数据集和方法为复杂3D组装任务提供了有效的解决方案。 Abstract: 3D assembly tasks, such as furniture assembly and component fitting, play a crucial role in daily life and represent essential capabilities for future home robots. Existing benchmarks and datasets predominantly focus on assembling geometric fragments or factory parts, which fall short in addressing the complexities of everyday object interactions and assemblies. To bridge this gap, we present 2BY2, a large-scale annotated dataset for daily pairwise objects assembly, covering 18 fine-grained tasks that reflect real-life scenarios, such as plugging into sockets, arranging flowers in vases, and inserting bread into toasters. 2BY2 dataset includes 1,034 instances and 517 pairwise objects with pose and symmetry annotations, requiring approaches that align geometric shapes while accounting for functional and spatial relationships between objects. Leveraging the 2BY2 dataset, we propose a two-step SE(3) pose estimation method with equivariant features for assembly constraints. Compared to previous shape assembly methods, our approach achieves state-of-the-art performance across all 18 tasks in the 2BY2 dataset. Additionally, robot experiments further validate the reliability and generalization ability of our method for complex 3D assembly tasks.

RayFronts: Open-Set Semantic Ray Frontiers for Online Scene Understanding and Exploration

Omar Alama,Avigyan Bhattacharya,Haoyang He,Seungchan Kim,Yuheng Qiu,Wenshan Wang,Cherie Ho,Nikhil Keetha,Sebastian Scherer

Task: 提出一种统一的表示方法RayFronts，用于实现密集和超范围的高效语义建图。

Motivation: 现有建图方法在深度范围或超范围实体建图方面存在限制，且无法结合范围内和范围外观察，同时在细粒度语义和效率之间做出权衡。

Details

Method: 引入RayFronts，一种统一表示方法，编码任务无关的开放集语义，涵盖范围内体素和范围外射线，显著减少搜索量并支持高效决策。 Result: RayFronts在3D语义分割性能上提升1.34倍，吞吐量提高16.5倍，搜索量效率提升2.2倍。 Conclusion: RayFronts是一种高效且实用的开放集语义建图方法，适用于机器人决策和探索。 Abstract: Open-set semantic mapping is crucial for open-world robots. Current mapping approaches either are limited by the depth range or only map beyond-range entities in constrained settings, where overall they fail to combine within-range and beyond-range observations. Furthermore, these methods make a trade-off between fine-grained semantics and efficiency. We introduce RayFronts, a unified representation that enables both dense and beyond-range efficient semantic mapping. RayFronts encodes task-agnostic open-set semantics to both in-range voxels and beyond-range rays encoded at map boundaries, empowering the robot to reduce search volumes significantly and make informed decisions both within & beyond sensory range, while running at 8.84 Hz on an Orin AGX. Benchmarking the within-range semantics shows that RayFronts's fine-grained image encoding provides 1.34x zero-shot 3D semantic segmentation performance while improving throughput by 16.5x. Traditionally, online mapping performance is entangled with other system components, complicating evaluation. We propose a planner-agnostic evaluation framework that captures the utility for online beyond-range search and exploration, and show RayFronts reduces search volume 2.2x more efficiently than the closest online baselines.

Kaleidoscope: In-language Exams for Massively Multilingual Vision Evaluation

Israfel Salazar,Manuel Fernández Burda,Shayekh Bin Islam,Arshia Soltani Moakhar,Shivalika Singh,Fabian Farestam,Angelika Romanou,Danylo Boiko,Dipika Khullar,Mike Zhang,Dominik Krzemiński,Jekaterina Novikova,Luísa Shimabucoro,Joseph Marvin Imperial,Rishabh Maheshwary,Sharad Duwal,Alfonso Amayuelas,Swati Rajwal,Jebish Purbey,Ahmed Ruby,Nicholas Popovič,Marek Suppa,Azmine Toushik Wasi,Ram Mohan Rao Kadiyala,Olga Tsymboi,Maksim Kostritsya,Bardia Soltani Moakhar,Gabriel da Costa Merlin,Otávio Ferracioli Coletti,Maral Jabbari Shiviari,MohammadAmin farahani fard,Silvia Fernandez,María Grandury,Dmitry Abulkhanov,Drishti Sharma,Andre Guarnier De Mitri,Leticia Bossatto Marchezi,Johan Obando-Ceron,Nazar Kohut,Beyza Ermis,Desmond Elliott,Enzo Ferrante,Sara Hooker,Marzieh Fadaee

Task: 提出Kaleidoscope，一个用于多语言视觉语言模型评估的大规模多模态基准。

Motivation: 现有评估主要依赖英语基准，缺乏多语言和多文化覆盖，且许多多语言基准仅基于英语数据集的翻译，无法捕捉文化细微差异。

Details

Method: 通过全球研究者的开放科学合作，构建涵盖18种语言和14个主题的20,911道多选题的多模态基准Kaleidoscope。 Result: 评估发现当前表现最佳的多语言视觉语言模型在低资源语言和复杂多模态场景中表现不佳。 Conclusion: 强调需要发展更具文化包容性的多模态评估框架。 Abstract: The evaluation of vision-language models (VLMs) has mainly relied on English-language benchmarks, leaving significant gaps in both multilingual and multicultural coverage. While multilingual benchmarks have expanded, both in size and languages, many rely on translations of English datasets, failing to capture cultural nuances. In this work, we propose Kaleidoscope, as the most comprehensive exam benchmark to date for the multilingual evaluation of vision-language models. Kaleidoscope is a large-scale, in-language multimodal benchmark designed to evaluate VLMs across diverse languages and visual inputs. Kaleidoscope covers 18 languages and 14 different subjects, amounting to a total of 20,911 multiple-choice questions. Built through an open science collaboration with a diverse group of researchers worldwide, Kaleidoscope ensures linguistic and cultural authenticity. We evaluate top-performing multilingual vision-language models and find that they perform poorly on low-resource languages and in complex multimodal scenarios. Our results highlight the need for progress on culturally inclusive multimodal evaluation frameworks.

SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills

Boyuan Zheng,Michael Y. Fatemi,Xiaolong Jin,Zora Zhiruo Wang,Apurva Gandhi,Yueqi Song,Yu Gu,Jayanth Srinivasa,Gaowen Liu,Graham Neubig,Yu Su

Task: 提出SkillWeaver框架，使自主网络代理能够通过合成可重用技能API实现自我提升。

Motivation: 现有自主网络代理缺乏关键自我提升能力，如程序性知识抽象、技能提炼和组合。

Details

Method: SkillWeaver框架通过自主发现技能、执行实践并将经验提炼为轻量级API，持续扩展技能库。 Result: 在WebArena和真实网站上的实验显示，SkillWeaver分别提高了31.8%和39.8%的成功率，且强代理合成的API可提升弱代理性能达54.3%。 Conclusion: SkillWeaver通过将多样化网站交互提炼为可共享API，显著提升了网络代理的能力和协作效率。 Abstract: To survive and thrive in complex environments, humans have evolved sophisticated self-improvement mechanisms through environment exploration, hierarchical abstraction of experiences into reuseable skills, and collaborative construction of an ever-growing skill repertoire. Despite recent advancements, autonomous web agents still lack crucial self-improvement capabilities, struggling with procedural knowledge abstraction, refining skills, and skill composition. In this work, we introduce SkillWeaver, a skill-centric framework enabling agents to self-improve by autonomously synthesizing reusable skills as APIs. Given a new website, the agent autonomously discovers skills, executes them for practice, and distills practice experiences into robust APIs. Iterative exploration continually expands a library of lightweight, plug-and-play APIs, significantly enhancing the agent's capabilities. Experiments on WebArena and real-world websites demonstrate the efficacy of SkillWeaver, achieving relative success rate improvements of 31.8% and 39.8%, respectively. Additionally, APIs synthesized by strong agents substantially enhance weaker agents through transferable skills, yielding improvements of up to 54.3% on WebArena. These results demonstrate the effectiveness of honing diverse website interactions into APIs, which can be seamlessly shared among various web agents.