2025 04 02

Medical Reasoning in LLMs: An In-Depth Analysis of DeepSeek R1

Birger Moell,Fredrik Sand Aronsson,Sanian Akbar

Task: 评估DeepSeek R1在医学推理中的表现，并与临床专家模式进行对比。

Motivation: 验证大型语言模型（如DeepSeek R1）在医疗领域的推理能力是否与临床专家一致，以确保其可靠性和实用性。

Details

Method: 使用100个MedQA临床案例对DeepSeek R1的医学推理能力进行评估。 Result: 模型在诊断准确性上达到93%，展示了系统化的临床判断能力，但也存在锚定偏差、数据冲突处理不足等局限性。 Conclusion: DeepSeek R1具备基础的临床推理能力，但需进一步优化偏见缓解、知识更新和结构化推理框架，以提升其在医疗决策中的可靠性。 Abstract: Integrating large language models (LLMs) like DeepSeek R1 into healthcare requires rigorous evaluation of their reasoning alignment with clinical expertise. This study assesses DeepSeek R1's medical reasoning against expert patterns using 100 MedQA clinical cases. The model achieved 93% diagnostic accuracy, demonstrating systematic clinical judgment through differential diagnosis, guideline-based treatment selection, and integration of patient-specific factors. However, error analysis of seven incorrect cases revealed persistent limitations: anchoring bias, challenges reconciling conflicting data, insufficient exploration of alternatives, overthinking, knowledge gaps, and premature prioritization of definitive treatment over intermediate care. Crucially, reasoning length correlated with accuracy - shorter responses (<5,000 characters) were more reliable, suggesting extended explanations may signal uncertainty or rationalization of errors. While DeepSeek R1 exhibits foundational clinical reasoning capabilities, recurring flaws highlight critical areas for refinement, including bias mitigation, knowledge updates, and structured reasoning frameworks. These findings underscore LLMs' potential to augment medical decision-making through artificial reasoning but emphasize the need for domain-specific validation, interpretability safeguards, and confidence metrics (e.g., response length thresholds) to ensure reliability in real-world applications.

ObscuraCoder: Powering Efficient Code LM Pre-Training Via Obfuscation Grounding

Indraneil Paul,Haoyi Yang,Goran Glavaš,Kristian Kersting,Iryna Gurevych

Task: 探索通过代码混淆预训练目标改进代码语言模型（Code-LMs）的数据效率和语法-语义解耦能力。

Motivation: 当前代码语言模型的预训练方法停滞不前，缺乏针对数据效率和语法-语义解耦的改进研究，尤其是在自然语言模型领域已有较多进展的情况下。

Details

Method: 提出基于混淆代码的预训练方法，构建ObscuraX数据集（55M对源代码与混淆代码），并预训练ObscuraCoder模型（255M至2.8B参数）。 Result: ObscuraCoder在语法和语义理解、多语言代码补全、代码提交摘要及多用途代码生成等方面表现显著优于传统自回归预训练和现有解混淆目标。 Conclusion: 基于混淆的预训练方法能有效提升代码语言模型的性能，尤其在数据效率和语法-语义解耦方面具有优势。 Abstract: Language models (LMs) have become a staple of the code-writing toolbox. Their pre-training recipe has, however, remained stagnant over recent years, barring the occasional changes in data sourcing and filtering strategies. In particular, research exploring modifications to Code-LMs' pre-training objectives, geared towards improving data efficiency and better disentangling between syntax and semantics, has been noticeably sparse, especially compared with corresponding efforts in natural language LMs. In this work, we examine grounding on obfuscated code as a means of helping Code-LMs look beyond the surface-form syntax and enhance their pre-training sample efficiency. To this end, we compile ObscuraX, a dataset of approximately 55M source and obfuscated code pairs in seven languages. Subsequently, we pre-train ObscuraCoder models, ranging in size from 255M to 2.8B parameters, on a 272B-token corpus that includes ObscuraX and demonstrate that our obfuscation-based pre-training recipe leads to consistent improvements in Code-LMs' abilities compared to both vanilla autoregressive pre-training as well as existing de-obfuscation (DOBF) objectives. ObscuraCoder demonstrates sizeable gains across multiple tests of syntactic and semantic code understanding, along with improved capabilities in multilingual code completion, multilingual code commit summarization, and multi-purpose library-oriented code generation.

FUSE : A Ridge and Random Forest-Based Metric for Evaluating MT in Indigenous Languages

Rahul Raja,Arpita Vats

Task: 开发一种名为FUSE的自动评估指标，用于评估美洲土著语言的机器翻译质量。

Motivation: 传统自动评估指标（如BLEU、TER和ChrF）在捕捉语义充分性和流畅性等深层方面表现不佳，尤其是在形态丰富且资源匮乏的语言中。

Details

Method: 结合Ridge回归和梯度提升的FUSE方法，整合了词汇、语音、语义和模糊标记相似性特征，并探索了五种替代方法。 Result: FUSE在人类标注数据上表现出更高的Pearson和Spearman相关性，优于传统指标。 Conclusion: FUSE为低资源语言环境下的机器翻译评估提供了一种鲁棒且语言学信息丰富的解决方案。 Abstract: This paper presents the winning submission of the RaaVa team to the AmericasNLP 2025 Shared Task 3 on Automatic Evaluation Metrics for Machine Translation (MT) into Indigenous Languages of America, where our system ranked first overall based on average Pearson correlation with the human annotations. We introduce Feature-Union Scorer (FUSE) for Evaluation, FUSE integrates Ridge regression and Gradient Boosting to model translation quality. In addition to FUSE, we explore five alternative approaches leveraging different combinations of linguistic similarity features and learning paradigms. FUSE Score highlights the effectiveness of combining lexical, phonetic, semantic, and fuzzy token similarity with learning-based modeling to improve MT evaluation for morphologically rich and low-resource languages. MT into Indigenous languages poses unique challenges due to polysynthesis, complex morphology, and non-standardized orthography. Conventional automatic metrics such as BLEU, TER, and ChrF often fail to capture deeper aspects like semantic adequacy and fluency. Our proposed framework, formerly referred to as FUSE, incorporates multilingual sentence embeddings and phonological encodings to better align with human evaluation. We train supervised models on human-annotated development sets and evaluate held-out test data. Results show that FUSE consistently achieves higher Pearson and Spearman correlations with human judgments, offering a robust and linguistically informed solution for MT evaluation in low-resource settings.

Generalization Bias in Large Language Model Summarization of Scientific Research

Uwe Peters,Benjamin Chin-Yee

Task: 测试大型语言模型（LLMs）在总结科学文本时是否会产生过度概括的问题。

Motivation: LLMs在提高公众科学素养和支持科学研究方面具有潜力，但其可能因遗漏细节而导致研究结论的过度概括。

Details

Method: 测试了10种主流LLMs（如ChatGPT-4o、DeepSeek等），比较了4900个LLM生成的摘要与原始科学文本。 Result: 大多数LLMs在总结时会产生比原文更广泛的概括，其中DeepSeek、ChatGPT-4o和LLaMA 3.3 70B的过度概括率在26%至73%之间。LLM摘要的过度概括风险是人工摘要的近五倍（OR = 4.85）。 Conclusion: 主流LLMs存在对科学结论过度概括的强烈偏见，可能导致研究结果的大规模误解，建议通过降低温度设置和基准测试来缓解此问题。 Abstract: Artificial intelligence chatbots driven by large language models (LLMs) have the potential to increase public science literacy and support scientific research, as they can quickly summarize complex scientific information in accessible terms. However, when summarizing scientific texts, LLMs may omit details that limit the scope of research conclusions, leading to generalizations of results broader than warranted by the original study. We tested 10 prominent LLMs, including ChatGPT-4o, ChatGPT-4.5, DeepSeek, LLaMA 3.3 70B, and Claude 3.7 Sonnet, comparing 4900 LLM-generated summaries to their original scientific texts. Even when explicitly prompted for accuracy, most LLMs produced broader generalizations of scientific results than those in the original texts, with DeepSeek, ChatGPT-4o, and LLaMA 3.3 70B overgeneralizing in 26 to 73% of cases. In a direct comparison of LLM-generated and human-authored science summaries, LLM summaries were nearly five times more likely to contain broad generalizations (OR = 4.85, 95% CI [3.06, 7.70]). Notably, newer models tended to perform worse in generalization accuracy than earlier ones. Our results indicate a strong bias in many widely used LLMs towards overgeneralizing scientific conclusions, posing a significant risk of large-scale misinterpretations of research findings. We highlight potential mitigation strategies, including lowering LLM temperature settings and benchmarking LLMs for generalization accuracy.

Enhance Vision-based Tactile Sensors via Dynamic Illumination and Image Fusion

Artemii Redkin,Zdravko Dugonjic,Mike Lambeta,Roberto Calandra

Task: 研究动态光照模式与图像融合技术如何提升视觉触觉传感器的感知质量。

Motivation: 现有的视觉触觉传感器（如DIGIT和GelSight）使用静态光照模式，限制了感知质量的提升。

Details

Method: 提出通过动态光照模式捕获多组测量数据，并利用图像融合技术生成更高质量的单一测量结果。 Result: 实验表明，动态光照显著提升了图像对比度、清晰度和背景差异。 Conclusion: 动态光照技术可通过软件更新提升现有传感器的性能，并为新硬件设计提供潜力。 Abstract: Vision-based tactile sensors use structured light to measure deformation in their elastomeric interface. Until now, vision-based tactile sensors such as DIGIT and GelSight have been using a single, static pattern of structured light tuned to the specific form factor of the sensor. In this work, we investigate the effectiveness of dynamic illumination patterns, in conjunction with image fusion techniques, to improve the quality of sensing of vision-based tactile sensors. Specifically, we propose to capture multiple measurements, each with a different illumination pattern, and then fuse them together to obtain a single, higher-quality measurement. Experimental results demonstrate that this type of dynamic illumination yields significant improvements in image contrast, sharpness, and background difference. This discovery opens the possibility of retroactively improving the sensing quality of existing vision-based tactile sensors with a simple software update, and for new hardware designs capable of fully exploiting dynamic illumination.

Opioid Named Entity Recognition (ONER-2025) from Reddit

Muhammad Ahmad,Humaira Farid,Iqra Ameer,Muhammad Muzamil,Ameer Hamza Muhammad Jalal,Ildar Batyrshin,Grigori Sidorov

Task: 利用自然语言处理技术（特别是ONER-2025）从Reddit等社交媒体平台提取与阿片类药物使用相关的信息。

Motivation: 阿片类药物过量危机是一个严重的公共卫生问题，社交媒体上的非结构化数据可以提供公众对阿片类药物使用的看法和经验的见解。

Details

Method: 创建了一个手动标注的数据集，分析了语言挑战，并提出了一个实时监测系统，结合机器学习和深度学习模型。 Result: 基于Transformer的模型（bert-base-NER和roberta-base）达到了97%的准确率和F1分数，优于基线模型10.23%。 Conclusion: 研究通过NLP技术成功提取了社交媒体中的阿片类药物相关信息，并提出了高效的实时监测系统，为公共卫生干预提供了支持。 Abstract: The opioid overdose epidemic remains a critical public health crisis, particularly in the United States, leading to significant mortality and societal costs. Social media platforms like Reddit provide vast amounts of unstructured data that offer insights into public perceptions, discussions, and experiences related to opioid use. This study leverages Natural Language Processing (NLP), specifically Opioid Named Entity Recognition (ONER-2025), to extract actionable information from these platforms. Our research makes four key contributions. First, we created a unique, manually annotated dataset sourced from Reddit, where users share self-reported experiences of opioid use via different administration routes. This dataset contains 331,285 tokens and includes eight major opioid entity categories. Second, we detail our annotation process and guidelines while discussing the challenges of labeling the ONER-2025 dataset. Third, we analyze key linguistic challenges, including slang, ambiguity, fragmented sentences, and emotionally charged language, in opioid discussions. Fourth, we propose a real-time monitoring system to process streaming data from social media, healthcare records, and emergency services to identify overdose events. Using 5-fold cross-validation in 11 experiments, our system integrates machine learning, deep learning, and transformer-based language models with advanced contextual embeddings to enhance understanding. Our transformer-based models (bert-base-NER and roberta-base) achieved 97% accuracy and F1-score, outperforming baselines by 10.23% (RF=0.88).

A Novel Distance-Based Metric for Quality Assessment in Image Segmentation

Niklas Rottmayer,Claudia Redenbach

Task: 提出一种新的基于距离的质量度量方法——表面一致性系数（SCC），用于量化分割错误的空间分布。

Motivation: 传统分割质量评估方法仅计算错误像素数量，无法捕捉错误的空间分布，且现有基于距离的度量（如平均Hausdorff距离）难以解释和比较。

Details

Method: 通过合成数据和真实分割结果的严格分析，验证SCC的鲁棒性和有效性。 Result: SCC能够区分表面附近和远处的错误，且易于解释和跨结构比较。 Conclusion: SCC是一种有效且易于解释的分割质量评估方法，适用于不同结构场景。 Abstract: The assessment of segmentation quality plays a fundamental role in the development, optimization, and comparison of segmentation methods which are used in a wide range of applications. With few exceptions, quality assessment is performed using traditional metrics, which are based on counting the number of erroneous pixels but do not capture the spatial distribution of errors. Established distance-based metrics such as the average Hausdorff distance are difficult to interpret and compare for different methods and datasets. In this paper, we introduce the Surface Consistency Coefficient (SCC), a novel distance-based quality metric that quantifies the spatial distribution of errors based on their proximity to the surface of the structure. Through a rigorous analysis using synthetic data and real segmentation results, we demonstrate the robustness and effectiveness of SCC in distinguishing errors near the surface from those further away. At the same time, SCC is easy to interpret and comparable across different structural contexts.

Token-Driven GammaTune: Adaptive Calibration for Enchanced Speculative Decoding

Aayush Gautam,Susav Shrestha,Narasimha Annapareddy

Task: 提出一种动态调整推测长度的方法（GammaTune和GammaTune+）以优化大语言模型推理速度。

Motivation: 推测解码通过小模型生成候选令牌并由大模型验证，但固定推测长度可能导致计算浪费或速度提升不足。

Details

Method: 使用基于启发式的切换机制，根据令牌接受率动态调整推测长度。 Result: 在SpecBench上评估，GammaTune和GammaTune+平均分别提速15%和16%，且性能方差更低。 Conclusion: GammaTune是一种高效且鲁棒的解决方案，适合实际部署。 Abstract: Speculative decoding accelerates large language model (LLM) inference by using a smaller draft model to propose tokens, which are then verified by a larger target model. However, selecting an optimal speculation length is critical for maximizing speedup while minimizing wasted computation. We introduce \textit{GammaTune} and \textit{GammaTune+}, training-free adaptive algorithms that dynamically adjust speculation length based on token acceptance rates using a heuristic-based switching mechanism. Evaluated on SpecBench across multiple tasks and model pairs, our method outperforms other heuristic-based approaches and fixed-length speculative decoding, achieving an average speedup of 15\% ($\pm$5\%) with \textit{GammaTune} and 16\% ($\pm$3\%) with \textit{GammaTune+}, while reducing performance variance. This makes \textit{GammaTune} a robust and efficient solution for real-world deployment.

Skeletonization Quality Evaluation: Geometric Metrics for Point Cloud Analysis in Robotics

Qingmeng Wen,Yu-Kun Lai,Ze Ji,Seyed Amir Tafrishi

Task: 定义和量化几何属性以系统评估点云形状骨架化结果。

Motivation: 尽管骨架化算法近年被研究，但其性能缺乏详细的数值评估。

Details

Method: 引入代表性度量定义和数值评分框架，分析点云数据骨架化结果。 Result: 提出开源工具，评估骨架模型性能，并分析几何评估方法在机器人应用中的表现和敏感性。 Conclusion: 通过系统量化骨架化结果的几何属性，为研究社区提供评估和改进骨架模型的工具和方法。 Abstract: Skeletonization is a powerful tool for shape analysis, rooted in the inherent instinct to understand an object's morphology. It has found applications across various domains, including robotics. Although skeletonization algorithms have been studied in recent years, their performance is rarely quantified with detailed numerical evaluations. This work focuses on defining and quantifying geometric properties to systematically score the skeletonization results of point cloud shapes across multiple aspects, including topological similarity, boundedness, centeredness, and smoothness. We introduce these representative metric definitions along with a numerical scoring framework to analyze skeletonization outcomes concerning point cloud data for different scenarios, from object manipulation to mobile robot navigation. Additionally, we provide an open-source tool to enable the research community to evaluate and refine their skeleton models. Finally, we assess the performance and sensitivity of the proposed geometric evaluation methods from various robotic applications.

Quantum Methods for Managing Ambiguity in Natural Language Processing

Jurek Eisinger,Ward Gauderis,Lin de Huybrecht,Geraint A. Wiggins

Task: 研究如何在量子自然语言处理（QNLP）中使用概率分布表示句子的句法歧义。

Motivation: 解决自然语言中句法歧义的问题，扩展DisCoCat框架在QNLP中的应用。

Details

Method: 使用密度矩阵表示句子的含义，并通过量子电路上的概率分布建模句法歧义。 Result: 提出了一个理论框架，并通过实验验证了其有效性。 Conclusion: 该方法能够有效处理句法歧义，并推广了QNLP中的相关任务。 Abstract: The Categorical Compositional Distributional (DisCoCat) framework models meaning in natural language using the mathematical framework of quantum theory, expressed as formal diagrams. DisCoCat diagrams can be associated with tensor networks and quantum circuits. DisCoCat diagrams have been connected to density matrices in various contexts in Quantum Natural Language Processing (QNLP). Previous use of density matrices in QNLP entails modelling ambiguous words as probability distributions over more basic words (the word \texttt{queen}, e.g., might mean the reigning queen or the chess piece). In this article, we investigate using probability distributions over processes to account for syntactic ambiguity in sentences. The meanings of these sentences are represented by density matrices. We show how to create probability distributions on quantum circuits that represent the meanings of sentences and explain how this approach generalises tasks from the literature. We conduct an experiment to validate the proposed theory.

ViT-Linearizer: Distilling Quadratic Knowledge into Linear-Time Vision Models

Guoyizhe Wei,Rama Chellappa

Task: 提出ViT-Linearizer框架，通过跨架构蒸馏将ViT的二次复杂度自注意力知识转移到线性时间、循环式模型中。

Motivation: 解决ViT在高分辨率输入下因二次复杂度导致的效率问题，同时保持其性能优势。

Details

Method: 采用激活匹配和掩码预测两种策略，将ViT的表示蒸馏到线性时间模型中。 Result: 在高分辨率任务中显著提升速度，并在标准视觉基准上提升Mamba架构性能，达到84.3%的ImageNet top-1准确率。 Conclusion: RNN-based解决方案在大规模视觉任务中具有潜力，能够弥合理论效率与实际应用之间的差距。 Abstract: Vision Transformers (ViTs) have delivered remarkable progress through global self-attention, yet their quadratic complexity can become prohibitive for high-resolution inputs. In this work, we present ViT-Linearizer, a cross-architecture distillation framework that transfers rich ViT representations into a linear-time, recurrent-style model. Our approach leverages 1) activation matching, an intermediate constraint that encourages student to align its token-wise dependencies with those produced by the teacher, and 2) masked prediction, a contextual reconstruction objective that requires the student to predict the teacher's representations for unseen (masked) tokens, to effectively distill the quadratic self-attention knowledge into the student while maintaining efficient complexity. Empirically, our method provides notable speedups particularly for high-resolution tasks, significantly addressing the hardware challenges in inference. Additionally, it also elevates Mamba-based architectures' performance on standard vision benchmarks, achieving a competitive 84.3% top-1 accuracy on ImageNet with a base-sized model. Our results underscore the good potential of RNN-based solutions for large-scale visual tasks, bridging the gap between theoretical efficiency and real-world practice.

Beyond the Reported Cutoff: Where Large Language Models Fall Short on Financial Knowledge

Agam Shah,Liqin Ye,Sebastian Jaskowski,Wei Xu,Sudheer Chava

Task: 评估大型语言模型（LLMs）在历史金融数据上的知识广度及其准确性。

Motivation: 研究LLMs在历史金融信息上的表现，尤其是其对不同公司特征（如规模、投资关注度等）的响应差异。

Details

Method: 通过分析超过197k个问题，比较LLMs的回答与真实金融数据，并探讨公司特征对模型准确性的影响。 Result: LLMs对过去金融数据的了解较少，但对大公司和较新信息更敏感；同时，大公司和较新数据更容易引发模型的幻觉。 Conclusion: LLMs在历史金融数据上的知识存在局限性，需谨慎应用于实际场景。 Abstract: Large Language Models (LLMs) are frequently utilized as sources of knowledge for question-answering. While it is known that LLMs may lack access to real-time data or newer data produced after the model's cutoff date, it is less clear how their knowledge spans across historical information. In this study, we assess the breadth of LLMs' knowledge using financial data of U.S. publicly traded companies by evaluating more than 197k questions and comparing model responses to factual data. We further explore the impact of company characteristics, such as size, retail investment, institutional attention, and readability of financial filings, on the accuracy of knowledge represented in LLMs. Our results reveal that LLMs are less informed about past financial performance, but they display a stronger awareness of larger companies and more recent information. Interestingly, at the same time, our analysis also reveals that LLMs are more likely to hallucinate for larger companies, especially for data from more recent years. We will make the code, prompts, and model outputs public upon the publication of the work.

Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs

Lucas Ventura,Antoine Yang,Cordelia Schmid,Gül Varol

Task: 将长视频时间线划分为语义单元并生成相应的章节标题。

Motivation: 自动章节划分可以提升长视频的导航和内容检索效率，但目前研究较少。

Details

Method: 使用预训练的大型语言模型（LLM）结合语音转录和视频帧描述，提出轻量级的语音引导帧选择策略，并训练模型输出章节边界时间戳和自由形式的章节标题。 Result: 在VidChapters-7M基准测试中取得了显著改进（45.3 vs 26.7 F1分数）。 Conclusion: 提出的方法简单高效，适用于一小时长视频的单次处理，并推动了进一步研究。 Abstract: We address the task of video chaptering, i.e., partitioning a long video timeline into semantic units and generating corresponding chapter titles. While relatively underexplored, automatic chaptering has the potential to enable efficient navigation and content retrieval in long-form videos. In this paper, we achieve strong chaptering performance on hour-long videos by efficiently addressing the problem in the text domain with our 'Chapter-Llama' framework. Specifically, we leverage a pretrained large language model (LLM) with large context window, and feed as input (i) speech transcripts and (ii) captions describing video frames, along with their respective timestamps. Given the inefficiency of exhaustively captioning all frames, we propose a lightweight speech-guided frame selection strategy based on speech transcript content, and experimentally demonstrate remarkable advantages. We train the LLM to output timestamps for the chapter boundaries, as well as free-form chapter titles. This simple yet powerful approach scales to processing one-hour long videos in a single forward pass. Our results demonstrate substantial improvements (e.g., 45.3 vs 26.7 F1 score) over the state of the art on the recent VidChapters-7M benchmark. To promote further research, we release our code and models at our project page.

CrossWordBench: Evaluating the Reasoning Capabilities of LLMs and LVLMs with Controllable Puzzle Generation

Jixuan Leng,Chengsong Huang,Langlin Huang,Bill Yuchen Lin,William W. Cohen,Haohan Wang,Jiaxin Huang

Task: 评估大型语言模型（LLMs）和大型视觉语言模型（LVLMs）在跨模态推理任务中的表现。

Motivation: 现有评估框架主要关注文本推理或视觉语言理解，缺乏对文本和视觉约束动态交互的评估。

Details

Method: 引入CrossWordBench基准，通过填字游戏任务评估模型的推理能力，结合文本线索和视觉网格结构的约束。 Result: 推理型LLMs显著优于非推理型模型，而LVLMs在任务中表现不佳，其解决能力与网格解析准确性高度相关。 Conclusion: 揭示了当前LLMs和LVLMs在推理能力上的局限性，并为未来多模态约束任务评估提供了有效方法。 Abstract: Existing reasoning evaluation frameworks for Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) predominantly either assess text-based reasoning or vision-language understanding capabilities, with limited dynamic interplay between textual and visual constraints. To address this limitation, we introduce CrossWordBench, a benchmark designed to evaluate the reasoning capabilities of both LLMs and LVLMs through the medium of crossword puzzles-a task requiring multimodal adherence to semantic constraints from text-based clues and intersectional constraints from visual grid structures. CrossWordBench leverages a controllable puzzle generation framework that produces puzzles in multiple formats (text and image) and offers different evaluation strategies ranging from direct puzzle solving to interactive modes. Our extensive evaluation of over 20 models reveals that reasoning LLMs outperform non-reasoning models substantially by effectively leveraging crossing-letter constraints. We further demonstrate that LVLMs struggle with the task, showing a strong correlation between their puzzle-solving performance and grid-parsing accuracy. Our findings offer insights into the limitations of the reasoning capabilities of current LLMs and LVLMs, and provide an effective approach for creating multimodal constrained tasks for future evaluations.

Yannick Burkhardt,Simon Schaefer,Stefan Leutenegger

Task: 提出一种名为SuperEvent的数据驱动方法，用于预测事件流中稳定的关键点及其描述符。

Motivation: 现有方法在处理事件流中的运动依赖关键点外观和复杂噪声时表现不佳，导致特征匹配能力有限且在下游任务中性能较差。

Details

Method: 利用现有的基于帧的关键点检测器在事件对齐的灰度帧上生成伪标签，结合新颖的信息丰富的事件表示，进行自监督学习。 Result: SuperEvent在事件流中实现了鲁棒的关键点检测和描述，并在基于事件流的SLAM框架中显著超越了现有技术。 Conclusion: SuperEvent通过自监督学习和新颖的事件表示，有效提升了事件流中关键点检测和匹配的性能，为事件传感器在SLAM系统中的应用提供了新思路。 Abstract: Event-based keypoint detection and matching holds significant potential, enabling the integration of event sensors into highly optimized Visual SLAM systems developed for frame cameras over decades of research. Unfortunately, existing approaches struggle with the motion-dependent appearance of keypoints and the complex noise prevalent in event streams, resulting in severely limited feature matching capabilities and poor performance on downstream tasks. To mitigate this problem, we propose SuperEvent, a data-driven approach to predict stable keypoints with expressive descriptors. Due to the absence of event datasets with ground truth keypoint labels, we leverage existing frame-based keypoint detectors on readily available event-aligned and synchronized gray-scale frames for self-supervision: we generate temporally sparse keypoint pseudo-labels considering that events are a product of both scene appearance and camera motion. Combined with our novel, information-rich event representation, we enable SuperEvent to effectively learn robust keypoint detection and description in event streams. Finally, we demonstrate the usefulness of SuperEvent by its integration into a modern sparse keypoint and descriptor-based SLAM framework originally developed for traditional cameras, surpassing the state-of-the-art in event-based SLAM by a wide margin. Source code and multimedia material are available at smartroboticslab.github.io/SuperEvent.

Measuring Online Hate on 4chan using Pre-trained Deep Learning Models

Adrian Bermudez-Villalva,Maryam Mehrnezhad,Ehsan Toreini

Task: 分析并测量4chan政治不正确版块（/pol/）上在线仇恨言论的流行程度。

Motivation: 匿名平台如4chan上的仇恨言论可能对个人和群体产生有害影响，因此需要深入研究其动态和范围。

Details

Method: 使用先进的自然语言处理模型（如RoBERTa和Detoxify）进行多类别仇恨言论分类（种族主义、性别歧视、宗教等），并结合毒性内容分类和主题建模分析。 Result: 数据集中11.20%的内容被识别为包含不同类别的仇恨言论，表明在线仇恨以多种形式存在。 Conclusion: 研究揭示了在线仇恨的复杂性和动态性，强调了在非监管平台上检测仇恨言论的挑战。 Abstract: Online hate speech can harmfully impact individuals and groups, specifically on non-moderated platforms such as 4chan where users can post anonymous content. This work focuses on analysing and measuring the prevalence of online hate on 4chan's politically incorrect board (/pol/) using state-of-the-art Natural Language Processing (NLP) models, specifically transformer-based models such as RoBERTa and Detoxify. By leveraging these advanced models, we provide an in-depth analysis of hate speech dynamics and quantify the extent of online hate non-moderated platforms. The study advances understanding through multi-class classification of hate speech (racism, sexism, religion, etc.), while also incorporating the classification of toxic content (e.g., identity attacks and threats) and a further topic modelling analysis. The results show that 11.20% of this dataset is identified as containing hate in different categories. These evaluations show that online hate is manifested in various forms, confirming the complicated and volatile nature of detection in the wild.

Towards Precise Action Spotting: Addressing Temporal Misalignment in Labels with Dynamic Label Assignment

Masato Tamura

Task: 提出一种动态标签分配策略，以解决动作识别中时间错位的问题。

Motivation: 现有方法忽视了地面真实标签中固有的时间错位问题，这会影响模型性能。

Details

Method: 将最小成本匹配从空间域扩展到时间域，动态分配标签以缓解时间错位的负面影响。 Result: 实验表明该方法在视觉上明显的事件和时间错位常见的情况下达到最优性能。 Conclusion: 动态标签分配策略有效解决了时间错位问题，提升了动作识别的准确性。 Abstract: Precise action spotting has attracted considerable attention due to its promising applications. While existing methods achieve substantial performance by employing well-designed model architecture, they overlook a significant challenge: the temporal misalignment inherent in ground-truth labels. This misalignment arises when frames labeled as containing events do not align accurately with the actual event times, often as a result of human annotation errors or the inherent difficulties in precisely identifying event boundaries across neighboring frames. To tackle this issue, we propose a novel dynamic label assignment strategy that allows predictions to have temporal offsets from ground-truth action times during training, ensuring consistent event spotting. Our method extends the concept of minimum-cost matching, which is utilized in the spatial domain for object detection, to the temporal domain. By calculating matching costs based on predicted action class scores and temporal offsets, our method dynamically assigns labels to the most likely predictions, even when the predicted times of these predictions deviate from ground-truth times, alleviating the negative effects of temporal misalignment in labels. We conduct extensive experiments and demonstrate that our method achieves state-of-the-art performance, particularly in conditions where events are visually distinct and temporal misalignment in labels is common.

Loris Belcastro,Cristian Cosentino,Fabrizio Marozzo,Merve Gündüz-Cüre,Şule Öztürk-Birim

Task: 提出一种利用大语言模型（LLM）增强灾害响应和管理的方法，通过分类技术和生成式AI的结合，将原始用户反馈转化为针对不同利益相关者的定制报告。

Motivation: 社交媒体在灾害和紧急情况中成为用户快速分享反馈和问题的主要渠道，但目前需要提高数据的自动化、聚合和定制化，以便为不同利益相关者（如媒体、警察、EMS和消防员）提供可操作的见解，从而改善救援、资源分配和媒体沟通等活动的协调。

Details

Method: 结合分类技术（如BERT进行内容类型、情感、情绪、地理位置和主题的多维分类）和生成式AI（如ChatGPT生成针对不同受众的可读报告），并通过多维分类、子事件选择和定制报告生成来优化灾害响应。 Result: 该方法在定量指标（如文本连贯性评分和潜在表示）和定性评估（通过自动化工具和领域专家）上均表现出色，为不同灾害响应利益相关者提供了精确的见解。 Conclusion: 通过结合分类技术和生成式AI，该方法显著提升了灾害响应和管理的效率，为利益相关者提供了定制化的可操作见解。 Abstract: In recent years, social media has emerged as a primary channel for users to promptly share feedback and issues during disasters and emergencies, playing a key role in crisis management. While significant progress has been made in collecting and analyzing social media content, there remains a pressing need to enhance the automation, aggregation, and customization of this data to deliver actionable insights tailored to diverse stakeholders, including the press, police, EMS, and firefighters. This effort is essential for improving the coordination of activities such as relief efforts, resource distribution, and media communication. This paper presents a methodology that leverages the capabilities of LLMs to enhance disaster response and management. Our approach combines classification techniques with generative AI to bridge the gap between raw user feedback and stakeholder-specific reports. Social media posts shared during catastrophic events are analyzed with a focus on user-reported issues, service interruptions, and encountered challenges. We employ full-spectrum LLMs, using analytical models like BERT for precise, multi-dimensional classification of content type, sentiment, emotion, geolocation, and topic. Generative models such as ChatGPT are then used to produce human-readable, informative reports tailored to distinct audiences, synthesizing insights derived from detailed classifications. We compare standard approaches, which analyze posts directly using prompts in ChatGPT, to our advanced method, which incorporates multi-dimensional classification, sub-event selection, and tailored report generation. Our methodology demonstrates superior performance in both quantitative metrics, such as text coherence scores and latent representations, and qualitative assessments by automated tools and field experts, delivering precise insights for diverse disaster response stakeholders.

Yongyi Shi,Ge Wang

Task: 提出一种去中心化的少样本生成模型（DFGM），用于合成脑肿瘤图像并完全保护隐私。

Motivation: 多中心医疗数据分析面临隐私和数据异质性的挑战，现有方法如联邦学习仍存在隐私泄露风险，而生成模型在小数据集上易记忆化。

Details

Method: DFGM通过结合私有肿瘤数据和可公开分享的健康图像，构建新数据集，将肿瘤前景与健康背景融合。 Result: 在脑肿瘤分割任务中，DFGM使数据增强的Dice分数提升3.9%，公平性提升4.6%。 Conclusion: DFGM在保护隐私的同时，实现了可控、高质量的图像合成，适用于医疗影像分析。 Abstract: Leveraging multi-center data for medical analytics presents challenges due to privacy concerns and data heterogeneity. While distributed approaches such as federated learning has gained traction, they remain vulnerable to privacy breaches, particularly in sensitive domains like medical imaging. Generative models, such as diffusion models, enhance privacy by synthesizing realistic data. However, they are prone to memorization, especially when trained on small datasets. This study proposes a decentralized few-shot generative model (DFGM) to synthesize brain tumor images while fully preserving privacy. DFGM harmonizes private tumor data with publicly shareable healthy images from multiple medical centers, constructing a new dataset by blending tumor foregrounds with healthy backgrounds. This approach ensures stringent privacy protection and enables controllable, high-quality synthesis by preserving both the healthy backgrounds and tumor foregrounds. We assess DFGM's effectiveness in brain tumor segmentation using a UNet, achieving Dice score improvements of 3.9% for data augmentation and 4.6% for fairness on a separate dataset.

Distill-C: Enhanced NL2SQL via Distilled Customization with LLMs

Cong Duy Vu Hoang,Gioacchino Tangari,Clemence Lanfranchi,Dalu Guo,Paul Cayet,Steve Siu,Don Dharmasiri,Yuan-Fang Li,Long Duong,Damien Hilloulin,Rhicheek Patra,Sungpack Hong,Hassan Chafi

Task: 提出一种名为Distill-C的定制化框架，用于提升自然语言转SQL（NL2SQL）任务的性能与效率。

Motivation: 随着大型语言模型（LLM）在商业应用中的广泛采用，NL2SQL解决方案面临高性能与效率的双重需求，同时还需满足领域和客户的特定要求。

Details

Method: Distill-C利用大型教师LLM生成高质量合成数据，并通过稳健且可扩展的流程，对小规模开源LLM进行微调，使其性能媲美或超越规模更大的教师模型。 Result: 在多个挑战性基准测试中，Distill-C的平均执行准确率提升了36%；在三个内部客户基准测试中，性能提升了22.6%。 Conclusion: Distill-C是一种高效、高性能且通用的方法，能够部署轻量级但强大的NL2SQL模型，在保持低计算成本的同时提供卓越的准确性。 Abstract: The growing adoption of large language models (LLMs) in business applications has amplified interest in Natural Language to SQL (NL2SQL) solutions, in which there is competing demand for high performance and efficiency. Domain- and customer-specific requirements further complicate the problem. To address this conundrum, we introduce Distill-C, a distilled customization framework tailored for NL2SQL tasks. Distill-C utilizes large teacher LLMs to produce high-quality synthetic data through a robust and scalable pipeline. Finetuning smaller and open-source LLMs on this synthesized data enables them to rival or outperform teacher models an order of magnitude larger. Evaluated on multiple challenging benchmarks, Distill-C achieves an average improvement of 36% in execution accuracy compared to the base models from three distinct LLM families. Additionally, on three internal customer benchmarks, Distill-C demonstrates a 22.6% performance improvement over the base models. Our results demonstrate that Distill-C is an effective, high-performing and generalizable approach for deploying lightweight yet powerful NL2SQL models, delivering exceptional accuracies while maintaining low computational cost.

SonarSplat: Novel View Synthesis of Imaging Sonar via Gaussian Splatting

Advaith V. Sethuraman,Max Rucker,Onur Bagoren,Pou-Chun Kung,Nibarkavi N. B. Amutha,Katherine A. Skinner

Task: 提出了一种名为SonarSplat的新型高斯溅射框架，用于成像声纳，实现真实的新视角合成并模拟声学条纹现象。

Motivation: 解决成像声纳中真实新视角合成和声学条纹现象的建模问题。

Details

Method: 将场景表示为一组具有声学反射和饱和特性的3D高斯分布，并开发了一种高效光栅化学习高斯分布的方法，以生成符合声纳图像形成模型的图像。 Result: 在真实数据集上的评估显示，SonarSplat在图像合成能力上优于现有方法（PSNR提升2.5 dB），并能用于去除方位条纹和3D场景重建。 Conclusion: SonarSplat是一种有效的成像声纳处理框架，具有显著性能提升和多功能应用潜力。 Abstract: In this paper, we present SonarSplat, a novel Gaussian splatting framework for imaging sonar that demonstrates realistic novel view synthesis and models acoustic streaking phenomena. Our method represents the scene as a set of 3D Gaussians with acoustic reflectance and saturation properties. We develop a novel method to efficiently rasterize learned Gaussians to produce a range/azimuth image that is faithful to the acoustic image formation model of imaging sonar. In particular, we develop a novel approach to model azimuth streaking in a Gaussian splatting framework. We evaluate SonarSplat using real-world datasets of sonar images collected from an underwater robotic platform in a controlled test tank and in a real-world river environment. Compared to the state-of-the-art, SonarSplat offers improved image synthesis capabilities (+2.5 dB PSNR). We also demonstrate that SonarSplat can be leveraged for azimuth streak removal and 3D scene reconstruction.

JudgeLRM: Large Reasoning Models as a Judge

Nuo Chen,Zhiyuan Hu,Qingyun Zou,Jiaying Wu,Qian Wang,Bryan Hooi,Bingsheng He

Task: 研究大型语言模型（LLMs）作为评估者在需要复杂推理的任务中是否真正受益于增强的推理能力。

Motivation: 现有的监督微调（SFT）方法在需要复杂推理的领域中表现不足，因此需要探索更有效的方法来提升LLM评估者的性能。

Details

Method: 提出JudgeLRM，一种基于强化学习（RL）的判决导向LLM家族，使用判决导向、结果驱动的奖励进行训练。 Result: JudgeLRM模型在性能上优于SFT微调和最先进的推理模型，JudgeLRM-3B超越GPT-4，JudgeLRM-7B在F1分数上超过DeepSeek-R1 2.79%。 Conclusion: JudgeLRM在需要深度推理的判决任务中表现优异，验证了强化学习在提升LLM评估者性能方面的有效性。 Abstract: The rise of Large Language Models (LLMs) as evaluators offers a scalable alternative to human annotation, yet existing Supervised Fine-Tuning (SFT) for judges approaches often fall short in domains requiring complex reasoning. In this work, we investigate whether LLM judges truly benefit from enhanced reasoning capabilities. Through a detailed analysis of reasoning requirements across evaluation tasks, we reveal a negative correlation between SFT performance gains and the proportion of reasoning-demanding samples - highlighting the limitations of SFT in such scenarios. To address this, we introduce JudgeLRM, a family of judgment-oriented LLMs trained using reinforcement learning (RL) with judge-wise, outcome-driven rewards. JudgeLRM models consistently outperform both SFT-tuned and state-of-the-art reasoning models. Notably, JudgeLRM-3B surpasses GPT-4, and JudgeLRM-7B outperforms DeepSeek-R1 by 2.79% in F1 score, particularly excelling in judge tasks requiring deep reasoning.

SAVeD: Learning to Denoise Low-SNR Video for Improved Downstream Performance

Suzanne Stathatos,Michael Hobley,Markus Marks,Pietro Perona

Task: 提出一种自监督方法SAVeD，用于去噪低信噪比（SNR）传感器视频，并提升下游任务的性能。

Motivation: 基础模型在自然图像中表现优异，但在低SNR视频（如水下声纳、超声波和显微镜视频）中表现不佳。

Details

Method: 利用前景和背景运动的差异，通过带有时间瓶颈的编码器-解码器结构增强目标可见性。 Result: SAVeD在分类、检测、跟踪和计数任务中表现优于现有视频去噪方法，且资源需求更低。 Conclusion: SAVeD是一种高效的自监督方法，适用于低SNR视频的去噪和下游任务提升。 Abstract: Foundation models excel at vision tasks in natural images but fail in low signal-to-noise ratio (SNR) videos, such as underwater sonar, ultrasound, and microscopy. We introduce Spatiotemporal Augmentations and denoising in Video for Downstream Tasks (SAVeD), a self-supervised method that denoises low-SNR sensor videos and is trained using only the raw noisy data. By leveraging differences in foreground and background motion, SAVeD enhances object visibility using an encoder-decoder with a temporal bottleneck. Our approach improves classification, detection, tracking, and counting, outperforming state-of-the-art video denoising methods with lower resource requirements. Project page: https://suzanne-stathatos.github.io/SAVeD Code page: https://github.com/suzanne-stathatos/SAVeD

Integrating Large Language Models with Human Expertise for Disease Detection in Electronic Health Records

Jie Pan,Seungwon Lee,Cheligeer Cheligeer,Elliot A. Martin,Kiarash Riazi,Hude Quan,Na Li

Task: 开发一种基于大型语言模型（LLM）的高效策略，用于从电子健康记录（EHR）临床笔记中识别多种疾病。

Motivation: 电子健康记录（EHR）广泛用于疾病监测和医疗绩效评估，但手动标注疾病结果耗时且劳动密集。

Details

Method: 通过链接心脏注册队列与EHR系统，利用生成式大型语言模型（LLM）分析临床笔记，基于特定诊断、治疗管理和临床指南的提示，检测急性心肌梗死（AMI）、糖尿病和高血压。 Result: LLM方法在检测AMI、糖尿病和高血压时表现出不同的性能（如敏感性88%-94%），相比ICD编码方法，敏感性更高且趋势一致。 Conclusion: 基于LLM的管道在疾病检测中表现出潜力，尤其在提高敏感性和一致性方面优于传统方法。 Abstract: Objective: Electronic health records (EHR) are widely available to complement administrative data-based disease surveillance and healthcare performance evaluation. Defining conditions from EHR is labour-intensive and requires extensive manual labelling of disease outcomes. This study developed an efficient strategy based on advanced large language models to identify multiple conditions from EHR clinical notes. Methods: We linked a cardiac registry cohort in 2015 with an EHR system in Alberta, Canada. We developed a pipeline that leveraged a generative large language model (LLM) to analyze, understand, and interpret EHR notes by prompts based on specific diagnosis, treatment management, and clinical guidelines. The pipeline was applied to detect acute myocardial infarction (AMI), diabetes, and hypertension. The performance was compared against clinician-validated diagnoses as the reference standard and widely adopted International Classification of Diseases (ICD) codes-based methods. Results: The study cohort accounted for 3,088 patients and 551,095 clinical notes. The prevalence was 55.4%, 27.7%, 65.9% and for AMI, diabetes, and hypertension, respectively. The performance of the LLM-based pipeline for detecting conditions varied: AMI had 88% sensitivity, 63% specificity, and 77% positive predictive value (PPV); diabetes had 91% sensitivity, 86% specificity, and 71% PPV; and hypertension had 94% sensitivity, 32% specificity, and 72% PPV. Compared with ICD codes, the LLM-based method demonstrated improved sensitivity and negative predictive value across all conditions. The monthly percentage trends from the detected cases by LLM and reference standard showed consistent patterns.

Self-Evolving Visual Concept Library using Vision-Language Critics

Atharva Sehgal,Patrick Yuan,Ziniu Hu,Yisong Yue,Jennifer J. Sun,Swarat Chaudhuri

Task: 研究如何构建视觉概念库以支持视觉识别任务。

Motivation: 手动定义视觉概念库耗时且低效，而仅依赖大型语言模型（LLMs）生成概念可能导致概念缺乏区分度或无法处理概念间的复杂交互。

Details

Method: 提出ESCHER方法，通过视觉语言模型（VLM）作为批评者迭代优化概念库，动态调整概念生成策略。 Result: ESCHER能够自动学习视觉概念库，适用于零样本、少样本和微调视觉分类任务。 Conclusion: ESCHER是首个将概念库学习应用于真实世界视觉任务的自动化框架。 Abstract: We study the problem of building a visual concept library for visual recognition. Building effective visual concept libraries is challenging, as manual definition is labor-intensive, while relying solely on LLMs for concept generation can result in concepts that lack discriminative power or fail to account for the complex interactions between them. Our approach, ESCHER, takes a library learning perspective to iteratively discover and improve visual concepts. ESCHER uses a vision-language model (VLM) as a critic to iteratively refine the concept library, including accounting for interactions between concepts and how they affect downstream classifiers. By leveraging the in-context learning abilities of LLMs and the history of performance using various concepts, ESCHER dynamically improves its concept generation strategy based on the VLM critic's feedback. Finally, ESCHER does not require any human annotations, and is thus an automated plug-and-play framework. We empirically demonstrate the ability of ESCHER to learn a concept library for zero-shot, few-shot, and fine-tuning visual classification tasks. This work represents, to our knowledge, the first application of concept library learning to real-world visual tasks.

Evaluating the Feasibility and Accuracy of Large Language Models for Medical History-Taking in Obstetrics and Gynecology

Dou Liu,Ying Long,Sophia Zuoqiu,Tian Tang,Rong Yin

Task: 评估大型语言模型（LLMs）在不孕症病例中自动化医学病史采集和诊断的可行性和性能。

Motivation: 医患沟通在不孕症等复杂敏感医疗领域中耗时且影响诊所效率，LLMs可能提供自动化解决方案以提高效率和诊断准确性。

Details

Method: 开发基于ChatGPT-4o和ChatGPT-4o-mini的AI对话系统，模拟医患交互，处理70例真实不孕症病例，生成420份诊断历史，并通过F1分数、鉴别诊断准确性和不孕类型判断准确性评估模型性能。 Result: ChatGPT-4o-mini在信息提取准确性（F1分数0.9258 vs. 0.9029）和病史采集完整性（97.58% vs. 77.11%）上优于ChatGPT-4o，而ChatGPT-4o在鉴别诊断准确性上略优（2.0524 vs. 2.0048）。 Conclusion: ChatGPT-4o-mini在自动化不孕症病史采集中表现更优，未来需优先进行临床环境下的专家验证、模型微调和更大规模数据集研究。 Abstract: Effective physician-patient communications in pre-diagnostic environments, and most specifically in complex and sensitive medical areas such as infertility, are critical but consume a lot of time and, therefore, cause clinic workflows to become inefficient. Recent advancements in Large Language Models (LLMs) offer a potential solution for automating conversational medical history-taking and improving diagnostic accuracy. This study evaluates the feasibility and performance of LLMs in those tasks for infertility cases. An AI-driven conversational system was developed to simulate physician-patient interactions with ChatGPT-4o and ChatGPT-4o-mini. A total of 70 real-world infertility cases were processed, generating 420 diagnostic histories. Model performance was assessed using F1 score, Differential Diagnosis (DDs) Accuracy, and Accuracy of Infertility Type Judgment (ITJ). ChatGPT-4o-mini outperformed ChatGPT-4o in information extraction accuracy (F1 score: 0.9258 vs. 0.9029, p = 0.045, d = 0.244) and demonstrated higher completeness in medical history-taking (97.58% vs. 77.11%), suggesting that ChatGPT-4o-mini is more effective in extracting detailed patient information, which is critical for improving diagnostic accuracy. In contrast, ChatGPT-4o performed slightly better in differential diagnosis accuracy (2.0524 vs. 2.0048, p > 0.05). ITJ accuracy was higher in ChatGPT-4o-mini (0.6476 vs. 0.5905) but with lower consistency (Cronbach's $\alpha$ = 0.562), suggesting variability in classification reliability. Both models demonstrated strong feasibility in automating infertility history-taking, with ChatGPT-4o-mini excelling in completeness and extraction accuracy. In future studies, expert validation for accuracy and dependability in a clinical setting, AI model fine-tuning, and larger datasets with a mix of cases of infertility have to be prioritized.

Leveraging Diffusion Model and Image Foundation Model for Improved Correspondence Matching in Coronary Angiography

Lin Zhao,Xin Yu,Yikang Liu,Xiao Chen,Eric Z. Chen,Terrence Chen,Shanhui Sun

Task: 提出一种新方法，用于在冠状动脉造影图像中实现准确的对应匹配，以重建3D冠状动脉结构。

Motivation: 传统匹配方法在X射线图像上表现不佳，主要由于缺乏纹理、低对比度和重叠结构等固有差异，以及训练数据不足。

Details

Method: 利用扩散模型生成高质量的合成冠状动脉造影图像，并结合大规模图像基础模型指导特征聚合，提升匹配准确性。 Result: 该方法在合成数据集上表现出优越的匹配性能，并能有效泛化到真实数据集。 Conclusion: 研究不仅为冠状动脉造影图像的对应匹配提供了实用解决方案，还探索了不同基础模型在此任务中的效果，为医学影像应用提供了新见解。 Abstract: Accurate correspondence matching in coronary angiography images is crucial for reconstructing 3D coronary artery structures, which is essential for precise diagnosis and treatment planning of coronary artery disease (CAD). Traditional matching methods for natural images often fail to generalize to X-ray images due to inherent differences such as lack of texture, lower contrast, and overlapping structures, compounded by insufficient training data. To address these challenges, we propose a novel pipeline that generates realistic paired coronary angiography images using a diffusion model conditioned on 2D projections of 3D reconstructed meshes from Coronary Computed Tomography Angiography (CCTA), providing high-quality synthetic data for training. Additionally, we employ large-scale image foundation models to guide feature aggregation, enhancing correspondence matching accuracy by focusing on semantically relevant regions and keypoints. Our approach demonstrates superior matching performance on synthetic datasets and effectively generalizes to real-world datasets, offering a practical solution for this task. Furthermore, our work investigates the efficacy of different foundation models in correspondence matching, providing novel insights into leveraging advanced image foundation models for medical imaging applications.

Contextualize-then-Aggregate: Circuits for In-Context Learning in Gemma-2 2B

Aleksandra Bakalova,Yana Veitsman,Xinting Huang,Michael Hahn

Task: 研究大语言模型（LLMs）中上下文学习（ICL）的机制。

Motivation: 尽管已有大量研究关注ICL的行为表现及其在小规模实验中的表现，但其具体机制仍不明确。

Details

Method: 使用因果干预方法分析Gemma-2 2B模型在五个自然ICL任务中的信息流动。 Result: 发现模型采用“上下文化-聚合”两步策略：低层构建单个示例的表征并通过上下文连接，高层聚合这些表征以识别任务并预测输出。 Conclusion: 通过严格的因果分析，揭示了语言模型中ICL的机制，并发现上下文化步骤的重要性因任务而异，尤其在模糊示例中更为关键。 Abstract: In-Context Learning (ICL) is an intriguing ability of large language models (LLMs). Despite a substantial amount of work on its behavioral aspects and how it emerges in miniature setups, it remains unclear which mechanism assembles task information from the individual examples in a fewshot prompt. We use causal interventions to identify information flow in Gemma-2 2B for five naturalistic ICL tasks. We find that the model infers task information using a two-step strategy we call contextualize-then-aggregate: In the lower layers, the model builds up representations of individual fewshot examples, which are contextualized by preceding examples through connections between fewshot input and output tokens across the sequence. In the higher layers, these representations are aggregated to identify the task and prepare prediction of the next output. The importance of the contextualization step differs between tasks, and it may become more important in the presence of ambiguous examples. Overall, by providing rigorous causal analysis, our results shed light on the mechanisms through which ICL happens in language models.

SmartScan: An AI-based Interactive Framework for Automated Region Extraction from Satellite Images

Savinay Nagendra,Kashif Rashid

Task: 利用AI框架SmartScan自动优化甲烷监测系统中固定传感器的数量和位置。

Motivation: 传统方法在规划和部署固定传感器时劳动密集且难以扩展，尤其是在多站点评估时。

Details

Method: SmartScan结合卫星图像和Segment Anything Model（SAM），通过交互式提示系统或自主模式提取感兴趣的子空间。 Result: SmartScan实现了高效、高质量的传感器布局规划，减少了人工干预，提升了可扩展性。 Conclusion: SmartScan的灵活设计使其适用于多种领域的高分辨率卫星图像分析，显著提高了甲烷监测系统的部署效率。 Abstract: The deployment of a continuous methane monitoring system requires determining the optimal number and placement of fixed sensors. However, planning is labor-intensive, requiring extensive site setup and iteration to meet client restrictions. This challenge is amplified when evaluating multiple sites, limiting scalability. To address this, we introduce SmartScan, an AI framework that automates data extraction for optimal sensor placement. SmartScan identifies subspaces of interest from satellite images using an interactive tool to create facility-specific constraint sets efficiently. SmartScan leverages the Segment Anything Model (SAM), a prompt-based transformer for zero-shot segmentation, enabling subspace extraction without explicit training. It operates in two modes: (1) Data Curation Mode, where satellite images are processed to extract high-quality subspaces using an interactive prompting system for SAM, and (2) Autonomous Mode, where user-curated prompts train a deep learning network to replace manual prompting, fully automating subspace extraction. The interactive tool also serves for quality control, allowing users to refine AI-generated outputs and generate additional constraint sets as needed. With its AI-driven prompting mechanism, SmartScan delivers high-throughput, high-quality subspace extraction with minimal human intervention, enhancing scalability and efficiency. Notably, its adaptable design makes it suitable for extracting regions of interest from ultra-high-resolution satellite imagery across various domains.

Universal Zero-shot Embedding Inversion

Collin Zhang,John X. Morris,Vitaly Shmatikov

Task: 设计并评估一种零样本的嵌入反演方法ZSInvert，用于从文本嵌入中重构原始文本。

Motivation: 从NLP角度，嵌入反演有助于评估嵌入中保留的语义信息；从安全角度，它衡量向量数据库和基于嵌入的检索系统的信息泄露程度。现有方法如vec2text需要为每个嵌入训练单独模型且查询效率低。

Details

Method: 基于对抗解码技术，设计并实现ZSInvert，该方法无需训练特定嵌入的反演模型，且具有快速和查询高效的特点。 Result: 在多个嵌入上验证了ZSInvert的有效性，证明其能够恢复文本的关键语义信息。 Conclusion: ZSInvert是一种高效、通用的零样本嵌入反演方法，适用于多种文本嵌入，解决了现有方法的局限性。 Abstract: Embedding inversion, i.e., reconstructing text given its embedding and black-box access to the embedding encoder, is a fundamental problem in both NLP and security. From the NLP perspective, it helps determine how much semantic information about the input is retained in the embedding. From the security perspective, it measures how much information is leaked by vector databases and embedding-based retrieval systems. State-of-the-art methods for embedding inversion, such as vec2text, have high accuracy but require (a) training a separate model for each embedding, and (b) a large number of queries to the corresponding encoder. We design, implement, and evaluate ZSInvert, a zero-shot inversion method based on the recently proposed adversarial decoding technique. ZSInvert is fast, query-efficient, and can be used for any text embedding without training an embedding-specific inversion model. We measure the effectiveness of ZSInvert on several embeddings and demonstrate that it recovers key semantic information about the corresponding texts.

RailGoerl24: Görlitz Rail Test Center CV Dataset 2024

Rustam Tagiew,Ilkay Wunderlich,Mark Sastuba,Steffen Seitz

Task: 开发用于无人驾驶列车操作的数据集RailGoerl24，以支持轨道交通中障碍物（尤其是人类）的自动检测。

Motivation: 当前公开可用的铁路领域数据集不足，尤其是针对人类检测的高质量标注数据，限制了机器学习算法在无人驾驶列车操作中的应用。

Details

Method: 通过车载视觉高清摄像头和地面LiDAR扫描，在德国Görlitz的铁路测试中心采集了12205帧图像数据，并提供了33556个针对‘人’类别的边界框标注。 Result: RailGoerl24数据集包含高质量的RGB图像和LiDAR数据，可用于开发无人驾驶列车操作及其他相关任务。 Conclusion: RailGoerl24填补了铁路领域数据集的空白，为无人驾驶列车操作的研究和开发提供了重要资源。 Abstract: Driverless train operation for open tracks on urban guided transport and mainline railways requires, among other things automatic detection of actual and potential obstacles, especially humans, in the danger zone of the train's path. Machine learning algorithms have proven to be powerful state-of-the-art tools for this task. However, these algorithms require large amounts of high-quality annotated data containing human beings in railway-specific environments as training data. Unfortunately, the amount of publicly available datasets is not yet sufficient and is significantly inferior to the datasets in the road domain. Therefore, this paper presents RailGoerl24, an on-board visual light Full HD camera dataset of 12205 frames recorded in a railway test center of T\"UV S\"UD Rail, in G\"orlitz, Germany. Its main purpose is to support the development of driverless train operation for guided transport. RailGoerl24 also includes a terrestrial LiDAR scan covering parts of the area used to acquire the RGB data. In addition to the raw data, the dataset contains 33556 boxwise annotations in total for the object class 'person'. The faces of recorded actors are not blurred or altered in any other way. RailGoerl24, soon available at data.fid-move.de/dataset/railgoerl24, can also be used for tasks beyond collision prediction.

Does "Reasoning" with Large Language Models Improve Recognizing, Generating, and Reframing Unhelpful Thoughts?

Yilin Qi,Dong Won Lee,Cynthia Breazeal,Hae Won Park

Task: 研究如何利用大型语言模型（LLMs）的推理能力改进认知行为疗法（CBT）中的认知重构任务。

Motivation: 大型语言模型在推理任务中的表现提升启发了将其应用于认知重构的可能性，以更有效地识别、生成和重构认知扭曲。

Details

Method: 研究了多种推理方法，包括预训练的推理LLMs和增强推理策略（如CoT和自一致性），以提升LLMs在认知重构任务中的表现。 Result: 发现增强推理方法即使在‘过时’的LLMs（如GPT-3.5）上，也能在识别、生成和重构无益思维方面优于最先进的预训练推理模型。 Conclusion: 增强推理策略可以显著提升LLMs在认知重构任务中的表现，为改进CBT提供了新方向。 Abstract: Cognitive Reframing, a core element of Cognitive Behavioral Therapy (CBT), helps individuals reinterpret negative experiences by finding positive meaning. Recent advances in Large Language Models (LLMs) have demonstrated improved performance through reasoning-based strategies. This inspires a promising direction of leveraging the reasoning capabilities of LLMs to improve CBT and mental reframing by simulating the process of critical thinking, potentially enabling more effective recognition, generation, and reframing of cognitive distortions. In this work, we investigate the role of various reasoning methods, including pre-trained reasoning LLMs and augmented reasoning strategies such as CoT and self-consistency in enhancing LLMs' ability to perform cognitive reframing tasks. We find that augmented reasoning methods, even when applied to "outdated" LLMs like GPT-3.5, consistently outperform state-of-the-art pretrained reasoning models on recognizing, generating and reframing unhelpful thoughts.

LITA-GS: Illumination-Agnostic Novel View Synthesis via Reference-Free 3D Gaussian Splatting and Physical Priors

Han Zhou,Wei Dong,Jun Chen

Task: 提出一种光照无关的新视角合成方法LITA-GS，用于在不利光照条件下生成高质量的3D表示。

Motivation: 在不利光照条件下，3D高斯泼溅（3DGS）难以生成高质量的正常曝光表示，原因包括：SfM点不足、信息丢失与噪声、现有曝光校正方法不一致。

Details

Method: 提出LITA-GS方法，包括光照不变物理先验提取、光照无关结构渲染策略和渐进去噪模块。 Result: LITA-GS在性能上超越基于NeRF的SOTA方法，同时具有更快的推理速度和更短的训练时间。 Conclusion: LITA-GS通过光照无关策略和物理先验，有效解决了不利光照条件下的3D表示问题。 Abstract: Directly employing 3D Gaussian Splatting (3DGS) on images with adverse illumination conditions exhibits considerable difficulty in achieving high-quality, normally-exposed representations due to: (1) The limited Structure from Motion (SfM) points estimated in adverse illumination scenarios fail to capture sufficient scene details; (2) Without ground-truth references, the intensive information loss, significant noise, and color distortion pose substantial challenges for 3DGS to produce high-quality results; (3) Combining existing exposure correction methods with 3DGS does not achieve satisfactory performance due to their individual enhancement processes, which lead to the illumination inconsistency between enhanced images from different viewpoints. To address these issues, we propose LITA-GS, a novel illumination-agnostic novel view synthesis method via reference-free 3DGS and physical priors. Firstly, we introduce an illumination-invariant physical prior extraction pipeline. Secondly, based on the extracted robust spatial structure prior, we develop the lighting-agnostic structure rendering strategy, which facilitates the optimization of the scene structure and object appearance. Moreover, a progressive denoising module is introduced to effectively mitigate the noise within the light-invariant representation. We adopt the unsupervised strategy for the training of LITA-GS and extensive experiments demonstrate that LITA-GS surpasses the state-of-the-art (SOTA) NeRF-based method while enjoying faster inference speed and costing reduced training time. The code is released at https://github.com/LowLevelAI/LITA-GS.

Boundless Byte Pair Encoding: Breaking the Pre-tokenization Barrier

Craig W. Schmidt,Varshini Reddy,Chris Tanner,Yuval Pinter

Task: 提出一种改进的BPE算法（BoundlessBPE），通过放松预分词边界限制，生成更均匀的词汇分布。

Motivation: 预分词导致词汇分布偏向常见完整单词，限制了扩展大词汇表的效果。

Details

Method: BoundlessBPE算法选择性合并预分词单元为超词（superword），打破预分词边界约束。 Result: BoundlessBPE显著提高了词汇分布的均匀性，文本压缩效率提升约20%。 Conclusion: BoundlessBPE通过放松预分词边界，有效解决了传统BPE算法的局限性。 Abstract: Pre-tokenization, the initial step in many modern tokenization pipelines, segments text into smaller units called pretokens, typically splitting on whitespace and punctuation. While this process encourages having full, individual words as tokens, it introduces a fundamental limitation in most tokenization algorithms such as Byte Pair Encoding (BPE). Specifically, pre-tokenization causes the distribution of tokens in a corpus to heavily skew towards common, full-length words. This skewed distribution limits the benefits of expanding to larger vocabularies, since the additional tokens appear with progressively lower counts. To overcome this barrier, we propose BoundlessBPE, a modified BPE algorithm that relaxes the pretoken boundary constraint. Our approach selectively merges two complete pretokens into a larger unit we term a superword. Superwords are not necessarily semantically cohesive. For example, the pretokens " of" and " the" might be combined to form the superword " of the". This merging strategy results in a substantially more uniform distribution of tokens across a corpus than standard BPE, and compresses text more effectively, with an approximate 20% increase in bytes per token.

MultiMorph: On-demand Atlas Construction

S. Mazdak Abulnaga,Andrew Hoopes,Neel Dey,Malte Hoffmann,Marianne Rakic,Bruce Fischl,John Guttag,Adrian Dalca

Task: 提出一种快速高效的方法MultiMorph，用于动态构建解剖图谱。

Motivation: 当前图谱构建方法计算耗时过长，限制了快速实验，导致许多研究依赖不匹配的预计算图谱，影响下游分析。

Details

Method: 采用前馈模型和线性群体交互层，结合辅助合成数据，快速生成高质量、特定群体的图谱。 Result: MultiMorph在小型和大型群体设置中均优于现有方法，时间减少100倍。 Conclusion: MultiMorph为生物医学研究者提供了无需机器学习专业知识的快速、高质量图谱生成框架。 Abstract: We present MultiMorph, a fast and efficient method for constructing anatomical atlases on the fly. Atlases capture the canonical structure of a collection of images and are essential for quantifying anatomical variability across populations. However, current atlas construction methods often require days to weeks of computation, thereby discouraging rapid experimentation. As a result, many scientific studies rely on suboptimal, precomputed atlases from mismatched populations, negatively impacting downstream analyses. MultiMorph addresses these challenges with a feedforward model that rapidly produces high-quality, population-specific atlases in a single forward pass for any 3D brain dataset, without any fine-tuning or optimization. MultiMorph is based on a linear group-interaction layer that aggregates and shares features within the group of input images. Further, by leveraging auxiliary synthetic data, MultiMorph generalizes to new imaging modalities and population groups at test-time. Experimentally, MultiMorph outperforms state-of-the-art optimization-based and learning-based atlas construction methods in both small and large population settings, with a 100-fold reduction in time. This makes MultiMorph an accessible framework for biomedical researchers without machine learning expertise, enabling rapid, high-quality atlas generation for diverse studies.

Contradiction Detection in RAG Systems: Evaluating LLMs as Context Validators for Improved Information Consistency

Vignesh Gokul,Srikanth Tenneti,Alwarappan Nakkiran

Task: 研究检索增强生成（RAG）系统中检索阶段可能出现的矛盾信息对大型语言模型（LLM）性能的影响，并提出解决方案。

Motivation: RAG系统在快速变化的领域（如新闻）中可能检索到矛盾信息，导致LLM输出不一致或错误，亟需解决这一问题。

Details

Method: 提出一种数据生成框架模拟RAG系统中的矛盾信息，并评估不同LLM作为上下文验证器的鲁棒性。 Result: 实验表明，上下文验证对最先进的LLM仍具挑战性，不同提示策略的效果因任务和模型架构而异。 Conclusion: 需要更鲁棒的上下文验证方法，以提升RAG系统的性能。 Abstract: Retrieval Augmented Generation (RAG) systems have emerged as a powerful method for enhancing large language models (LLMs) with up-to-date information. However, the retrieval step in RAG can sometimes surface documents containing contradictory information, particularly in rapidly evolving domains such as news. These contradictions can significantly impact the performance of LLMs, leading to inconsistent or erroneous outputs. This study addresses this critical challenge in two ways. First, we present a novel data generation framework to simulate different types of contradictions that may occur in the retrieval stage of a RAG system. Second, we evaluate the robustness of different LLMs in performing as context validators, assessing their ability to detect contradictory information within retrieved document sets. Our experimental results reveal that context validation remains a challenging task even for state-of-the-art LLMs, with performance varying significantly across different types of contradictions. While larger models generally perform better at contradiction detection, the effectiveness of different prompting strategies varies across tasks and model architectures. We find that chain-of-thought prompting shows notable improvements for some models but may hinder performance in others, highlighting the complexity of the task and the need for more robust approaches to context validation in RAG systems.

NeRF-Based defect detection

Tianqi,Ding,Dawei Xiang,Yijiashun Qi,Ze Yang,Zunduo Zhao,Tianyao Sun,Pengbin Feng,Haoyu Wang

Task: 提出一种基于NeRF和数字孪生的自动化缺陷检测框架，用于大规模机械的精确缺陷检测。

Motivation: 传统的人工检测方法效率低、主观性强且危险，无法满足工业自动化快速发展的需求。

Details

Method: 利用无人机捕获图像并重建机械的3D模型，通过ICP算法对齐模型，进行点云分析以检测缺陷。 Result: 该方法提高了检测精度、增强了操作安全性，并提供了可扩展的解决方案。 Conclusion: 提出的方法在可靠性和效率方面展现出巨大潜力，适用于工业应用。 Abstract: The rapid growth of industrial automation has highlighted the need for precise and efficient defect detection in large-scale machinery. Traditional inspection techniques, involving manual procedures such as scaling tall structures for visual evaluation, are labor-intensive, subjective, and often hazardous. To overcome these challenges, this paper introduces an automated defect detection framework built on Neural Radiance Fields (NeRF) and the concept of digital twins. The system utilizes UAVs to capture images and reconstruct 3D models of machinery, producing both a standard reference model and a current-state model for comparison. Alignment of the models is achieved through the Iterative Closest Point (ICP) algorithm, enabling precise point cloud analysis to detect deviations that signify potential defects. By eliminating manual inspection, this method improves accuracy, enhances operational safety, and offers a scalable solution for defect detection. The proposed approach demonstrates great promise for reliable and efficient industrial applications.

Insight-RAG: Enhancing LLMs with Insight-Driven Augmentation

Pouya Pezeshkpour,Estevam Hruschka

Task: 提出一种名为Insight-RAG的新框架，以解决传统RAG方法在检索文档时仅基于表面相关性而忽略深层信息的问题。

Motivation: 传统RAG方法在检索文档时存在局限性，例如忽略文档中的深层信息、跨多个来源的相关见解，以及不适用于传统问答以外的任务。

Details

Method: Insight-RAG分为三个阶段：1）使用LLM分析输入查询和任务以提取潜在信息需求；2）使用专门训练的LLM从文档数据库中挖掘相关内容；3）将原始查询与检索到的见解结合，生成上下文丰富的准确响应。 Result: 在科学论文数据集上的评估表明，Insight-RAG显著优于传统RAG方法，解决了其局限性。 Conclusion: Insight-RAG不仅提升了性能，还扩展了RAG在传统问答以外任务中的适用性。 Abstract: Retrieval Augmented Generation (RAG) frameworks have shown significant promise in leveraging external knowledge to enhance the performance of large language models (LLMs). However, conventional RAG methods often retrieve documents based solely on surface-level relevance, leading to many issues: they may overlook deeply buried information within individual documents, miss relevant insights spanning multiple sources, and are not well-suited for tasks beyond traditional question answering. In this paper, we propose Insight-RAG, a novel framework designed to address these issues. In the initial stage of Insight-RAG, instead of using traditional retrieval methods, we employ an LLM to analyze the input query and task, extracting the underlying informational requirements. In the subsequent stage, a specialized LLM -- trained on the document database -- is queried to mine content that directly addresses these identified insights. Finally, by integrating the original query with the retrieved insights, similar to conventional RAG approaches, we employ a final LLM to generate a contextually enriched and accurate response. Using two scientific paper datasets, we created evaluation benchmarks targeting each of the mentioned issues and assessed Insight-RAG against traditional RAG pipeline. Our results demonstrate that the Insight-RAG pipeline successfully addresses these challenges, outperforming existing methods by a significant margin in most cases. These findings suggest that integrating insight-driven retrieval within the RAG framework not only enhances performance but also broadens the applicability of RAG to tasks beyond conventional question answering.

Transductive One-Shot Learning Meet Subspace Decomposition

Kyle Stein,Andrew A. Mahyari,Guillermo Francia III,Eman El-Sheikh

Task: 提出一种基于子空间分解的转导式单样本学习方法，用于从单个标注图像中识别新类别。

Motivation: 单样本学习因其能够仅通过一张标注图像推广到未见类别，是一个重要且具有挑战性的问题。

Details

Method: 通过子空间分解将标注图像和支持集中的未标注图像表示为潜在变量的线性组合，利用共享的潜在基元传播标签。 Result: 通过多种神经网络特征提取器和数据集的定量分析，证明了该方法能够有效从单个标注图像推广到新类别。 Conclusion: 该方法在单样本学习中表现出色，能够利用潜在基元实现标签的有效传播。 Abstract: One-shot learning focuses on adapting pretrained models to recognize newly introduced and unseen classes based on a single labeled image. While variations of few-shot and zero-shot learning exist, one-shot learning remains a challenging yet crucial problem due to its ability to generalize knowledge to unseen classes from just one human-annotated image. In this paper, we introduce a transductive one-shot learning approach that employs subspace decomposition to utilize the information from labeled images in the support set and unlabeled images in the query set. These images are decomposed into a linear combination of latent variables representing primitives captured by smaller subspaces. By representing images in the query set as linear combinations of these latent primitives, we can propagate the label from a single image in the support set to query images that share similar combinations of primitives. Through a comprehensive quantitative analysis across various neural network feature extractors and datasets, we demonstrate that our approach can effectively generalize to novel classes from just one labeled image.

Synthesizing Public Opinions with LLMs: Role Creation, Impacts, and the Future to eDemorcacy

Rabimba Karanjai,Boris Shor,Amanda Austin,Ryan Kennedy,Yang Lu,Lei Xu,Weidong Shi

Task: 利用大型语言模型（LLM）合成公众意见数据，解决传统调查方法中响应率下降和无响应偏差的问题。

Motivation: 传统调查方法存在响应率下降和无响应偏差的挑战，需要更高效和准确的替代方案。

Details

Method: 提出一种基于知识注入的角色创建技术，结合RAG、HEXACO模型的人格特征和人口统计信息，动态生成提示，以更准确地模拟多样化意见。 Result: 实验表明，该方法显著提高了LLM生成意见与真实人类调查结果的匹配度，增强了答案的依从性。 Conclusion: 该方法在合成公众意见数据方面表现优越，但仍存在挑战和局限性，未来研究需进一步探索。 Abstract: This paper investigates the use of Large Language Models (LLMs) to synthesize public opinion data, addressing challenges in traditional survey methods like declining response rates and non-response bias. We introduce a novel technique: role creation based on knowledge injection, a form of in-context learning that leverages RAG and specified personality profiles from the HEXACO model and demographic information, and uses that for dynamically generated prompts. This method allows LLMs to simulate diverse opinions more accurately than existing prompt engineering approaches. We compare our results with pre-trained models with standard few-shot prompts. Experiments using questions from the Cooperative Election Study (CES) demonstrate that our role-creation approach significantly improves the alignment of LLM-generated opinions with real-world human survey responses, increasing answer adherence. In addition, we discuss challenges, limitations and future research directions.

Hybrid Global-Local Representation with Augmented Spatial Guidance for Zero-Shot Referring Image Segmentation

Ting Liu,Siyuan Li

Task: 提出一种无需训练的混合全局-局部特征提取方法，以提升零样本参考图像分割（RIS）任务中的掩码区域表示质量。

Motivation: 尽管SAM和CLIP等模型在视觉与文本信息对齐方面取得了进展，但精确且高质量的掩码区域表示仍是一个关键挑战，限制了RIS任务的潜力。

Details

Method: 结合详细的掩码特定特征与周围区域的上下文信息，提出一种空间引导增强策略以提高空间一致性。 Result: 在标准RIS基准测试中，该方法显著优于现有的零样本RIS模型，取得了显著的性能提升。 Conclusion: 该方法不仅推动了RIS任务的发展，还为区域-文本对齐提供了一个通用框架，对跨模态理解和交互具有广泛意义。 Abstract: Recent advances in zero-shot referring image segmentation (RIS), driven by models such as the Segment Anything Model (SAM) and CLIP, have made substantial progress in aligning visual and textual information. Despite these successes, the extraction of precise and high-quality mask region representations remains a critical challenge, limiting the full potential of RIS tasks. In this paper, we introduce a training-free, hybrid global-local feature extraction approach that integrates detailed mask-specific features with contextual information from the surrounding area, enhancing mask region representation. To further strengthen alignment between mask regions and referring expressions, we propose a spatial guidance augmentation strategy that improves spatial coherence, which is essential for accurately localizing described areas. By incorporating multiple spatial cues, this approach facilitates more robust and precise referring segmentation. Extensive experiments on standard RIS benchmarks demonstrate that our method significantly outperforms existing zero-shot RIS models, achieving substantial performance gains. We believe our approach advances RIS tasks and establishes a versatile framework for region-text alignment, offering broader implications for cross-modal understanding and interaction. Code is available at https://github.com/fhgyuanshen/HybridGL .

SciReplicate-Bench: Benchmarking LLMs in Agent-driven Algorithmic Reproduction from Research Papers

Yanzheng Xiang,Hanqi Yan,Shuyin Ouyang,Lin Gui,Yulan He

Task: 评估大型语言模型（LLMs）从NLP论文的算法描述中生成代码的能力。

Motivation: 研究LLMs在算法理解和编码实现方面的能力，以促进科学代码的自动生成和复现。

Details

Method: 引入SciReplicate-Bench基准测试和Sci-Reproducer多智能体框架，结合Paper Agent和Code Agent，评估LLMs的算法理解和实现质量。 Result: 最佳性能的LLM仅达到39%的执行准确率，表明基准测试的难度较高。 Conclusion: 算法描述的缺失或不一致是成功复现的主要障碍，研究将开源基准测试和代码。 Abstract: This study evaluates large language models (LLMs) in generating code from algorithm descriptions from recent NLP papers. The task requires two key competencies: (1) algorithm comprehension: synthesizing information from papers and academic literature to understand implementation logic, and (2) coding expertise: identifying dependencies and correctly implementing necessary APIs. To facilitate rigorous evaluation, we introduce SciReplicate-Bench, a benchmark of 100 tasks from 36 NLP papers published in 2024, featuring detailed annotations and comprehensive test cases. Building on SciReplicate-Bench, we propose Sci-Reproducer, a multi-agent framework consisting of a Paper Agent that interprets algorithmic concepts from literature and a Code Agent that retrieves dependencies from repositories and implement solutions. To assess algorithm understanding, we introduce reasoning graph accuracy, which quantifies similarity between generated and reference reasoning graphs derived from code comments and structure. For evaluating implementation quality, we employ execution accuracy, CodeBLEU, and repository dependency/API recall metrics. In our experiments, we evaluate various powerful Non-Reasoning LLMs and Reasoning LLMs as foundational models. The best-performing LLM using Sci-Reproducer achieves only 39% execution accuracy, highlighting the benchmark's difficulty.Our analysis identifies missing or inconsistent algorithm descriptions as key barriers to successful reproduction. We will open-source our benchmark, and code at https://github.com/xyzCS/SciReplicate-Bench.

Spatiotemporal Attention Learning Framework for Event-Driven Object Recognition

Tiantian Xie,Pengpai Wang,Rosa H. M. Chan

Task: 提出一种基于事件的对象识别的新型时空学习框架，利用VGG网络结合卷积块注意力模块（CBAM）。

Motivation: 事件视觉传感器在动态范围和延迟等方面具有优势，但现有复杂架构的计算开销和参数复杂性限制了实际部署。

Details

Method: 采用VGG网络结合CBAM模块，减少参数数量并提升性能。 Result: 在CIFAR10-DVS和N-Caltech101数据集上取得优于ResNet方法的Top-1准确率，分别为76.4%（预训练）和72.4%（未预训练）。 Conclusion: 该方法在减少参数和计算开销的同时，保持了高性能，适用于无法使用迁移学习的场景。 Abstract: Event-based vision sensors, inspired by biological neural systems, asynchronously capture local pixel-level intensity changes as a sparse event stream containing position, polarity, and timestamp information. These neuromorphic sensors offer significant advantages in dynamic range, latency, and power efficiency. Their working principle inherently addresses traditional camera limitations such as motion blur and redundant background information, making them particularly suitable for dynamic vision tasks. While recent works have proposed increasingly complex event-based architectures, the computational overhead and parameter complexity of these approaches limit their practical deployment. This paper presents a novel spatiotemporal learning framework for event-based object recognition, utilizing a VGG network enhanced with Convolutional Block Attention Module (CBAM). Our approach achieves comparable performance to state-of-the-art ResNet-based methods while reducing parameter count by 2.3% compared to the original VGG model. Specifically, it outperforms ResNet-based methods like MVF-Net, achieving the highest Top-1 accuracy of 76.4% (pretrained) and 71.3% (not pretrained) on CIFAR10-DVS, and 72.4% (not pretrained) on N-Caltech101. These results highlight the robustness of our method when pretrained weights are not used, making it suitable for scenarios where transfer learning is unavailable. Moreover, our approach reduces reliance on data augmentation. Experimental results on standard event-based datasets demonstrate the framework's efficiency and effectiveness for real-world applications.

Multilingual Sentiment Analysis of Summarized Texts: A Cross-Language Study of Text Shortening Effects

Mikhail Krasitskii,Grigori Sidorov,Olga Kolesnikova,Liliana Chanona Hernandez,Alexander Gelbukh

Task: 研究抽取式和生成式摘要对多语言情感分类的影响。

Motivation: 摘要对情感分析在多语言环境中的影响，尤其是形态复杂的语言。

Details

Method: 使用多语言变换器（mBERT、XLM-RoBERTa、T5、BART）和语言特定模型（FinBERT、AraBERT）评估摘要后的情感变化。 Result: 抽取式摘要能更好地保留情感，尤其是形态复杂的语言；生成式摘要提高可读性但引入情感失真。 Conclusion: 需要语言特定的情感分析适配，并提出一种平衡可读性和情感保留的混合摘要方法。 Abstract: Summarization significantly impacts sentiment analysis across languages with diverse morphologies. This study examines extractive and abstractive summarization effects on sentiment classification in English, German, French, Spanish, Italian, Finnish, Hungarian, and Arabic. We assess sentiment shifts post-summarization using multilingual transformers (mBERT, XLM-RoBERTa, T5, and BART) and language-specific models (FinBERT, AraBERT). Results show extractive summarization better preserves sentiment, especially in morphologically complex languages, while abstractive summarization improves readability but introduces sentiment distortion, affecting sentiment accuracy. Languages with rich inflectional morphology, such as Finnish, Hungarian, and Arabic, experience greater accuracy drops than English or German. Findings emphasize the need for language-specific adaptations in sentiment analysis and propose a hybrid summarization approach balancing readability and sentiment preservation. These insights benefit multilingual sentiment applications, including social media monitoring, market analysis, and cross-lingual opinion mining.

CamoSAM2: Motion-Appearance Induced Auto-Refining Prompts for Video Camouflaged Object Detection

Xin Zhang,Keren Fu,Qijun Zhao

Task: 提出CamoSAM2模型，用于在视频伪装物体检测（VCOD）任务中自动生成和优化提示，以实现高质量的自动检测和分割。

Motivation: 由于伪装物体与背景高度相似，难以区分，现有的SAM2模型在伪装感知和可靠提示生成方面面临挑战。

Details

Method: 提出运动-外观提示诱导器（MAPI）和自适应多提示优化（AMPR）策略，通过整合运动和外观线索生成初始预测，并通过三步过程优化提示。 Result: 在两个基准数据集上，CamoSAM2在mIoU指标上分别提高了8.0%和10.1%，且推理速度最快。 Conclusion: CamoSAM2在视频伪装物体检测任务中显著优于现有方法，展示了其高效性和准确性。 Abstract: The Segment Anything Model 2 (SAM2), a prompt-guided video foundation model, has remarkably performed in video object segmentation, drawing significant attention in the community. Due to the high similarity between camouflaged objects and their surroundings, which makes them difficult to distinguish even by the human eye, the application of SAM2 for automated segmentation in real-world scenarios faces challenges in camouflage perception and reliable prompts generation. To address these issues, we propose CamoSAM2, a motion-appearance prompt inducer (MAPI) and refinement framework to automatically generate and refine prompts for SAM2, enabling high-quality automatic detection and segmentation in VCOD task. Initially, we introduce a prompt inducer that simultaneously integrates motion and appearance cues to detect camouflaged objects, delivering more accurate initial predictions than existing methods. Subsequently, we propose a video-based adaptive multi-prompts refinement (AMPR) strategy tailored for SAM2, aimed at mitigating prompt error in initial coarse masks and further producing good prompts. Specifically, we introduce a novel three-step process to generate reliable prompts by camouflaged object determination, pivotal prompting frame selection, and multi-prompts formation. Extensive experiments conducted on two benchmark datasets demonstrate that our proposed model, CamoSAM2, significantly outperforms existing state-of-the-art methods, achieving increases of 8.0% and 10.1% in mIoU metric. Additionally, our method achieves the fastest inference speed compared to current VCOD models.

Text Chunking for Document Classification for Urban System Management using Large Language Models

Joshua Rodriguez,Om Sanan,Guillermo Vizarreta-Luna,Steven A. Conrad

Task: 研究如何应用大型语言模型（LLM）进行定性编码活动，以减少资源需求并保持与人类相当的可靠性。

Motivation: 定性编码和评估面临资源限制、偏见、准确性以及人类评估者之间一致性的挑战。

Details

Method: 使用两种提示方法（全文分析和文本块分析）比较LLM与人类编码的语义处理，测试了OpenAI的GPT-4o、GPT-4o-mini和o1-mini模型。 Result: LLM在特定演绎编码上下文中表现与人类编码者相当，尤其是使用分块方法时，GPT-4o、o1-mini和GPT-4o-mini与人类评分者显著一致。 Conclusion: LLM在文本分析中具有潜力，分块提示方法能减少上下文聚合偏差，为LLM应用提供了新思路。 Abstract: Urban systems are managed using complex textual documentation that need coding and analysis to set requirements and evaluate built environment performance. This paper contributes to the study of applying large-language models (LLM) to qualitative coding activities to reduce resource requirements while maintaining comparable reliability to humans. Qualitative coding and assessment face challenges like resource limitations and bias, accuracy, and consistency between human evaluators. Here we report the application of LLMs to deductively code 10 case documents on the presence of 17 digital twin characteristics for the management of urban systems. We utilize two prompting methods to compare the semantic processing of LLMs with human coding efforts: whole text analysis and text chunk analysis using OpenAI's GPT-4o, GPT-4o-mini, and o1-mini models. We found similar trends of internal variability between methods and results indicate that LLMs may perform on par with human coders when initialized with specific deductive coding contexts. GPT-4o, o1-mini and GPT-4o-mini showed significant agreement with human raters when employed using a chunking method. The application of both GPT-4o and GPT-4o-mini as an additional rater with three manual raters showed statistically significant agreement across all raters, indicating that the analysis of textual documents is benefited by LLMs. Our findings reveal nuanced sub-themes of LLM application suggesting LLMs follow human memory coding processes where whole-text analysis may introduce multiple meanings. The novel contributions of this paper lie in assessing the performance of OpenAI GPT models and introduces the chunk-based prompting approach, which addresses context aggregation biases by preserving localized context.

MPDrive: Improving Spatial Understanding with Marker-Based Prompt Learning for Autonomous Driving

Zhiyuan Zhang,Xiaofan Li,Zhihao Xu,Wenjie Peng,Zijian Zhou,Miaojing Shi,Shuangping Huang

Task: 开发一种基于视觉标记的提示学习框架（MPDrive），用于提升自动驾驶视觉问答（AD-VQA）中空间信息的准确表达和理解。

Motivation: 现有方法通过文本坐标表示空间信息，导致视觉坐标与文本描述之间存在语义鸿沟，影响空间信息的准确传递并增加表达负担。

Details

Method: 提出MPDrive框架，通过简洁的视觉标记表示空间坐标，结合原始图像和标记图像生成场景级特征，并与检测先验融合生成实例级特征，构建双粒度视觉提示以增强空间感知能力。 Result: 在DriveLM和CODA-LM数据集上的实验表明，MPDrive在需要复杂空间理解的任务中达到了最先进的性能。 Conclusion: MPDrive通过视觉标记和双粒度提示学习，有效提升了AD-VQA中空间信息的表达和感知准确性。 Abstract: Autonomous driving visual question answering (AD-VQA) aims to answer questions related to perception, prediction, and planning based on given driving scene images, heavily relying on the model's spatial understanding capabilities. Prior works typically express spatial information through textual representations of coordinates, resulting in semantic gaps between visual coordinate representations and textual descriptions. This oversight hinders the accurate transmission of spatial information and increases the expressive burden. To address this, we propose a novel Marker-based Prompt learning framework (MPDrive), which represents spatial coordinates by concise visual markers, ensuring linguistic expressive consistency and enhancing the accuracy of both visual perception and spatial expression in AD-VQA. Specifically, we create marker images by employing a detection expert to overlay object regions with numerical labels, converting complex textual coordinate generation into straightforward text-based visual marker predictions. Moreover, we fuse original and marker images as scene-level features and integrate them with detection priors to derive instance-level features. By combining these features, we construct dual-granularity visual prompts that stimulate the LLM's spatial perception capabilities. Extensive experiments on the DriveLM and CODA-LM datasets show that MPDrive achieves state-of-the-art performance, particularly in cases requiring sophisticated spatial understanding.

Do Large Language Models Exhibit Spontaneous Rational Deception?

Samuel M. Taylor,Benjamin K. Bergen

Task: 研究大型语言模型（LLMs）在何种条件下会自发欺骗。

Motivation: 探讨LLMs在推理任务表现更好的情况下，是否会因理性自利而增加自发欺骗行为。

Details

Method: 使用信号理论工具，通过改进的2x2游戏（类似囚徒困境）评估封闭和开源LLMs的自发欺骗行为。 Result: 1）所有测试的LLMs在某些条件下会自发欺骗；2）在欺骗有益时更可能欺骗；3）推理能力更强的模型欺骗率更高。 Conclusion: LLMs的推理能力与诚实性存在权衡，研究揭示了影响LLMs欺骗行为的上下文因素，对自主系统的潜在影响进行了讨论。 Abstract: Large Language Models (LLMs) are effective at deceiving, when prompted to do so. But under what conditions do they deceive spontaneously? Models that demonstrate better performance on reasoning tasks are also better at prompted deception. Do they also increasingly deceive spontaneously in situations where it could be considered rational to do so? This study evaluates spontaneous deception produced by LLMs in a preregistered experimental protocol using tools from signaling theory. A range of proprietary closed-source and open-source LLMs are evaluated using modified 2x2 games (in the style of Prisoner's Dilemma) augmented with a phase in which they can freely communicate to the other agent using unconstrained language. This setup creates an opportunity to deceive, in conditions that vary in how useful deception might be to an agent's rational self-interest. The results indicate that 1) all tested LLMs spontaneously misrepresent their actions in at least some conditions, 2) they are generally more likely to do so in situations in which deception would benefit them, and 3) models exhibiting better reasoning capacity overall tend to deceive at higher rates. Taken together, these results suggest a tradeoff between LLM reasoning capability and honesty. They also provide evidence of reasoning-like behavior in LLMs from a novel experimental configuration. Finally, they reveal certain contextual factors that affect whether LLMs will deceive or not. We discuss consequences for autonomous, human-facing systems driven by LLMs both now and as their reasoning capabilities continue to improve.

Hierarchical Flow Diffusion for Efficient Frame Interpolation

Yang Hai,Guo Wang,Tan Su,Wenjie Jiang,Yinlin Hu

Task: 提出一种基于分层扩散模型的视频帧插值方法，通过显式建模双边光流来提高准确性和效率。

Motivation: 现有的基于扩散的方法在视频帧插值任务中与非扩散方法相比存在较大差距，尤其是在准确性和效率方面。

Details

Method: 采用分层扩散模型显式建模双边光流，并通过光流引导的图像合成器生成最终结果，同时端到端训练光流扩散模型和图像合成器。 Result: 该方法在准确性上达到最先进水平，且比其他基于扩散的方法快10倍以上。 Conclusion: 通过显式建模光流并结合分层扩散模型，显著提升了视频帧插值的性能和效率。 Abstract: Most recent diffusion-based methods still show a large gap compared to non-diffusion methods for video frame interpolation, in both accuracy and efficiency. Most of them formulate the problem as a denoising procedure in latent space directly, which is less effective caused by the large latent space. We propose to model bilateral optical flow explicitly by hierarchical diffusion models, which has much smaller search space in the denoising procedure. Based on the flow diffusion model, we then use a flow-guided images synthesizer to produce the final result. We train the flow diffusion model and the image synthesizer end to end. Our method achieves state of the art in accuracy, and 10+ times faster than other diffusion-based methods. The project page is at: https://hfd-interpolation.github.io.

Do Chinese models speak Chinese languages?

Andrea W Wen-Yi,Unso Eun Seo Jo,David Mimno

Task: 比较中西方开源大语言模型在亚洲区域语言和中国少数民族语言上的表现。

Motivation: 了解中国大语言模型是否支持中国少数民族语言，以及其多语言能力是否与西方模型一致，从而揭示资源分配和开发优先级。

Details

Method: 通过信息对等性和阅读理解实验测试中西方开源大语言模型在亚洲区域语言和中国少数民族语言上的表现。 Result: 中国模型在这些语言上的表现与西方模型高度相关（r=0.93），唯一例外是普通话表现更好。但中国模型有时无法识别哈萨克语和维吾尔语等少数民族语言。 Conclusion: 研究结果揭示了当前开发优先级，为未来开发提供了建议，并为终端用户提供了指导。 Abstract: The release of top-performing open-weight LLMs has cemented China's role as a leading force in AI development. Do these models support languages spoken in China? Or do they speak the same languages as Western models? Comparing multilingual capabilities is important for two reasons. First, language ability provides insights into pre-training data curation, and thus into resource allocation and development priorities. Second, China has a long history of explicit language policy, varying between inclusivity of minority languages and a Mandarin-first policy. To test whether Chinese LLMs today reflect an agenda about China's languages, we test performance of Chinese and Western open-source LLMs on Asian regional and Chinese minority languages. Our experiments on Information Parity and reading comprehension show Chinese models' performance across these languages correlates strongly (r=0.93) with Western models', with the sole exception being better Mandarin. Sometimes, Chinese models cannot identify languages spoken by Chinese minorities such as Kazakh and Uyghur, even though they are good at French and German. These results provide a window into current development priorities, suggest options for future development, and indicate guidance for end users.

Intrinsic-feature-guided 3D Object Detection

Wanjing Zhang,Chenxing Wang

Task: 提出一种基于模板辅助特征增强模块的固有特征引导的3D物体检测方法。

Motivation: LiDAR点云可能具有稀疏性、分布不均匀和不完整结构，限制了检测性能。

Details

Method: 通过模板辅助特征增强模块提取固有特征，并设计提案级对比学习机制增强前景与背景对象的特征差异。 Result: 所提模块可作为即插即用组件，提升多种现有方法的性能，实验结果表明其具有高度竞争力的检测结果。 Conclusion: 该方法通过模板引导和对比学习机制，显著提升了3D物体检测的性能。 Abstract: LiDAR-based 3D object detection is essential for autonomous driving systems. However, LiDAR point clouds may appear to have sparsity, uneven distribution, and incomplete structures, significantly limiting the detection performance. In road driving environments, target objects referring to vehicles, pedestrians and cyclists are well-suited for enhancing representation through the complete template guidance, considering their grid and topological structures. Therefore, this paper presents an intrinsic-feature-guided 3D object detection method based on a template-assisted feature enhancement module, which extracts intrinsic features from relatively generalized templates and provides rich structural information for foreground objects. Furthermore, a proposal-level contrastive learning mechanism is designed to enhance the feature differences between foreground and background objects. The proposed modules can act as plug-and-play components and improve the performance of multiple existing methods. Extensive experiments illustrate that the proposed method achieves the highly competitive detection results. Code will be available at https://github.com/zhangwanjingjj/IfgNet.git.

Detecting and Mitigating Bias in LLMs through Knowledge Graph-Augmented Training

Rajeev Kumar,Harishankar Kumar,Kumari Shalini

Task: 研究知识图谱增强训练（KGAT）作为一种新方法来减少大型语言模型（LLM）中的偏见。

Motivation: 大型语言模型在训练数据中继承和放大了偏见，引发伦理和公平性问题，需要检测和缓解这些偏见以确保模型的负责任和公平使用。

Details

Method: 利用现实世界知识图谱中的结构化领域知识，通过知识图谱增强训练（KGAT）来改进模型理解并减少偏见输出。 Result: 通过针对性缓解策略，显著减少了偏见输出并改善了偏见指标。 Conclusion: 结合真实数据集和知识图谱的框架具有可扩展性和有效性，为敏感和高风险应用中的负责任部署铺平了道路。 Abstract: Large language models have revolutionized natural language processing with their surprising capability to understand and generate human-like text. However, many of these models inherit and further amplify the biases present in their training data, raising ethical and fairness concerns. The detection and mitigation of such biases are vital to ensuring that LLMs act responsibly and equitably across diverse domains. This work investigates Knowledge Graph-Augmented Training (KGAT) as a novel method to mitigate bias in LLM. Using structured domain-specific knowledge from real-world knowledge graphs, we improve the understanding of the model and reduce biased output. Public datasets for bias assessment include Gender Shades, Bias in Bios, and FairFace, while metrics such as demographic parity and equal opportunity facilitate rigorous detection. We also performed targeted mitigation strategies to correct biased associations, leading to a significant drop in biased output and improved bias metrics. Equipped with real-world datasets and knowledge graphs, our framework is both scalable and effective, paving the way toward responsible deployment in sensitive and high-stakes applications.

Leveraging Contrast Information for Efficient Document Shadow Removal

Yifan Liu,Jiancheng Huang,Na Liu,Mingfu Yan,Yi Huang,Shifeng Chen

Task: 提出一种端到端的文档阴影去除方法，通过对比表示引导，实现从粗到细的优化。

Motivation: 现有文档阴影去除方法依赖额外信息（如阴影掩模）或缺乏泛化能力，导致阴影去除不完整或原始文档内容丢失。

Details

Method: 利用文档图像的对比信息定位阴影形状和位置，无需额外掩模，并将该信息整合到阴影去除过程中。 Result: 定性和定量实验表明，该方法达到了最先进的性能。 Conclusion: 通过充分利用文档图像本身的对比信息，提出了一种更高效、泛化能力更强的阴影去除方法。 Abstract: Document shadows are a major obstacle in the digitization process. Due to the dense information in text and patterns covered by shadows, document shadow removal requires specialized methods. Existing document shadow removal methods, although showing some progress, still rely on additional information such as shadow masks or lack generalization and effectiveness across different shadow scenarios. This often results in incomplete shadow removal or loss of original document content and tones. Moreover, these methods tend to underutilize the information present in the original shadowed document image. In this paper, we refocus our approach on the document images themselves, which inherently contain rich information.We propose an end-to-end document shadow removal method guided by contrast representation, following a coarse-to-fine refinement approach. By extracting document contrast information, we can effectively and quickly locate shadow shapes and positions without the need for additional masks. This information is then integrated into the refined shadow removal process, providing better guidance for network-based removal and feature fusion. Extensive qualitative and quantitative experiments show that our method achieves state-of-the-art performance.

Effect-driven interpretation: Functors for natural language composition

Dylan Bumford,Simon Charlow

Task: 探讨人类语言如何围绕纯值和副作用过程组织，并展示计算机科学中的指称技术如何用于自然语言组合分析。

Motivation: 计算机程序可分为纯组件和副作用组件，类似地，人类语言也可能围绕纯值和副作用过程组织。

Details

Method: 利用计算机科学中的指称技术进行分析。 Result: 展示了如何通过这些技术对自然语言组合进行优雅且富有启发性的分析。 Conclusion: 指称技术为自然语言分析提供了新的视角和方法。 Abstract: Computer programs are often factored into pure components -- simple, total functions from inputs to outputs -- and components that may have side effects -- errors, changes to memory, parallel threads, abortion of the current loop, etc. We make the case that human languages are similarly organized around the give and pull of pure values and impure processes, and we'll aim to show how denotational techniques from computer science can be leveraged to support elegant and illuminating analyses of natural language composition.

Scene4U: Hierarchical Layered 3D Scene Reconstruction from Single Panoramic Image for Your Immerse Exploration

Zilong Huang,Jun He,Junyan Ye,Lihan Jiang,Weijia Li,Yiping Chen,Ting Han

Task: 提出一种名为Scene4U的分层3D场景重建框架，从全景图像中重建沉浸式且真实的3D场景。

Motivation: 现有方法在相机姿态变化下存在全局纹理不一致导致的视觉不连续性和前景-背景遮挡导致的场景空洞问题。

Details

Method: 结合开放词汇分割模型和大语言模型分解全景图像为多层，利用基于扩散模型的分层修复模块恢复遮挡区域，并通过3D高斯泼溅表示和分层优化生成沉浸式3D场景。 Result: Scene4U在LPIPS和BRISQUE指标上分别提升了24.24%和24.40%，并实现了最快的训练速度。 Conclusion: Scene4U能够生成具有语义和结构一致性的沉浸式3D场景，并支持自由探索，同时构建了WorldVista3D数据集以验证其鲁棒性。 Abstract: The reconstruction of immersive and realistic 3D scenes holds significant practical importance in various fields of computer vision and computer graphics. Typically, immersive and realistic scenes should be free from obstructions by dynamic objects, maintain global texture consistency, and allow for unrestricted exploration. The current mainstream methods for image-driven scene construction involves iteratively refining the initial image using a moving virtual camera to generate the scene. However, previous methods struggle with visual discontinuities due to global texture inconsistencies under varying camera poses, and they frequently exhibit scene voids caused by foreground-background occlusions. To this end, we propose a novel layered 3D scene reconstruction framework from panoramic image, named Scene4U. Specifically, Scene4U integrates an open-vocabulary segmentation model with a large language model to decompose a real panorama into multiple layers. Then, we employs a layered repair module based on diffusion model to restore occluded regions using visual cues and depth information, generating a hierarchical representation of the scene. The multi-layer panorama is then initialized as a 3D Gaussian Splatting representation, followed by layered optimization, which ultimately produces an immersive 3D scene with semantic and structural consistency that supports free exploration. Scene4U outperforms state-of-the-art method, improving by 24.24% in LPIPS and 24.40% in BRISQUE, while also achieving the fastest training speed. Additionally, to demonstrate the robustness of Scene4U and allow users to experience immersive scenes from various landmarks, we build WorldVista3D dataset for 3D scene reconstruction, which contains panoramic images of globally renowned sites. The implementation code and dataset will be released at https://github.com/LongHZ140516/Scene4U .

VNJPTranslate: A comprehensive pipeline for Vietnamese-Japanese translation

Hoang Hai Phan,Nguyen Duc Minh Vu,Nam Dang Phuong

Task: 开发一个名为VNJPTranslate的流程，系统性地解决越南语-日语（Vi-Ja）低资源机器翻译任务。

Motivation: 现有的基于Transformer架构的神经机器翻译（NMT）在低资源语言对（如越南语-日语）中面临数据稀疏和语言/文化差异的挑战。

Details

Method: 采用大型语言模型（LLMs）生成高质量合成数据，并通过Chain-of-Thought提示增强困难片段；使用高效微调技术（Unsloth与QLoRA）在1.8B参数的Sailor模型上进行微调。 Result: 目标是显著提升越南语-日语的翻译质量，超越现有基线。 Conclusion: VNJPTranslate通过数据增强和高效微调，为低资源语言对提供了一种实用的高性能翻译解决方案。 Abstract: Neural Machine Translation (NMT) driven by Transformer architectures has advanced significantly, yet faces challenges with low-resource language pairs like Vietnamese-Japanese (Vi-Ja). Issues include sparse parallel data and handling linguistic/cultural nuances. Recent progress in Large Language Models (LLMs) with strong reasoning, often refined via Reinforcement Learning (RL), enables high-quality synthetic data generation. We introduce VNJPTranslate, a pipeline designed to systematically address the Vi-Ja translation task. It features a targeted data augmentation strategy using advanced LLMs with Chain-of-Thought prompting for challenging segments identified via corpus analysis. Subsequently, we employ efficient fine-tuning techniques (Unsloth with QLoRA) on a capable, low-parameter autoregressive model (specifically, a fine-tuned version of the 1.8B parameter Sailor model, which is based on the Qwen architecture) to create a practical and high-performing translation system. This integrated approach aims to improve Vi-Ja translation quality significantly over existing baselines.

AP-CAP: Advancing High-Quality Data Synthesis for Animal Pose Estimation via a Controllable Image Generation Pipeline

Lei Wang,Yujie Zhong,Xiaopeng Sun,Jingchun Cheng,Chengjian Feng,Qiong Cao,Lin Ma,Zhaoxin Fan

Task: 提出一种新颖的可控图像生成管道（AP-CAP）用于合成动物姿态估计数据，以解决高质量数据集稀缺的问题。

Motivation: 现有方法在动物姿态估计中因高质量数据集稀缺而受限，限制了其潜力。

Details

Method: 提出多模态动物图像生成模型及三种创新策略（模态融合、姿态调整、标题增强）以生成高质量多样化数据。 Result: 创建了MPCH数据集，首次结合合成与真实数据，成为最大规模的多源异构基准库，实验证明方法显著提升了姿态估计器的性能和泛化能力。 Conclusion: AP-CAP和MPCH数据集有效解决了数据稀缺问题，提升了动物姿态估计的性能和泛化能力。 Abstract: The task of 2D animal pose estimation plays a crucial role in advancing deep learning applications in animal behavior analysis and ecological research. Despite notable progress in some existing approaches, our study reveals that the scarcity of high-quality datasets remains a significant bottleneck, limiting the full potential of current methods. To address this challenge, we propose a novel Controllable Image Generation Pipeline for synthesizing animal pose estimation data, termed AP-CAP. Within this pipeline, we introduce a Multi-Modal Animal Image Generation Model capable of producing images with expected poses. To enhance the quality and diversity of the generated data, we further propose three innovative strategies: (1) Modality-Fusion-Based Animal Image Synthesis Strategy to integrate multi-source appearance representations, (2) Pose-Adjustment-Based Animal Image Synthesis Strategy to dynamically capture diverse pose variations, and (3) Caption-Enhancement-Based Animal Image Synthesis Strategy to enrich visual semantic understanding. Leveraging the proposed model and strategies, we create the MPCH Dataset (Modality-Pose-Caption Hybrid), the first hybrid dataset that innovatively combines synthetic and real data, establishing the largest-scale multi-source heterogeneous benchmark repository for animal pose estimation to date. Extensive experiments demonstrate the superiority of our method in improving both the performance and generalization capability of animal pose estimators.

Leveraging Large Language Models for Automated Definition Extraction with TaxoMatic A Case Study on Media Bias

Timo Spinde,Luyang Lin,Smi Hinterreiter,Isao Echizen

Task: 自动化从学术文献中提取定义的任务。

Motivation: 解决在媒体偏见领域中手动提取定义的效率问题，并利用大语言模型提升自动化程度。

Details

Method: 提出TaxoMatic框架，包括数据收集、基于LLM的相关性分类和概念定义提取。 Result: 在2398篇手动评分的文章数据集上评估，Claude-3-sonnet在相关性分类和定义提取中表现最佳。 Conclusion: TaxoMatic框架有效，未来可扩展数据集并应用于更多领域。 Abstract: This paper introduces TaxoMatic, a framework that leverages large language models to automate definition extraction from academic literature. Focusing on the media bias domain, the framework encompasses data collection, LLM-based relevance classification, and extraction of conceptual definitions. Evaluated on a dataset of 2,398 manually rated articles, the study demonstrates the frameworks effectiveness, with Claude-3-sonnet achieving the best results in both relevance classification and definition extraction. Future directions include expanding datasets and applying TaxoMatic to additional domains.

SPF-Portrait: Towards Pure Portrait Customization with Semantic Pollution-Free Fine-tuning

Xiaole Xian,Zhichao Liao,Qingyu Li,Wenyu Qin,Pengfei Wan,Weicheng Xie,Long Zeng,Linlin Shen,Pingfa Feng

Task: 提出SPF-Portrait方法，解决文本驱动肖像定制中的语义污染问题。

Motivation: 现有方法在微调预训练文本到图像模型时存在语义污染，影响原始模型行为并阻碍增量学习。

Details

Method: 采用双路径管道，引入原始模型作为参考，通过对比学习确保目标属性适应，并利用语义感知精细控制图空间引导对齐过程。 Result: 实验表明SPF-Portrait在性能上达到最先进水平。 Conclusion: SPF-Portrait有效消除语义污染，保留原始模型性能并增强目标属性表现。 Abstract: While fine-tuning pre-trained Text-to-Image (T2I) models on portrait datasets enables attribute customization, existing methods suffer from Semantic Pollution that compromises the original model's behavior and prevents incremental learning. To address this, we propose SPF-Portrait, a pioneering work to purely understand customized semantics while eliminating semantic pollution in text-driven portrait customization. In our SPF-Portrait, we propose a dual-path pipeline that introduces the original model as a reference for the conventional fine-tuning path. Through contrastive learning, we ensure adaptation to target attributes and purposefully align other unrelated attributes with the original portrait. We introduce a novel Semantic-Aware Fine Control Map, which represents the precise response regions of the target semantics, to spatially guide the alignment process between the contrastive paths. This alignment process not only effectively preserves the performance of the original model but also avoids over-alignment. Furthermore, we propose a novel response enhancement mechanism to reinforce the performance of target attributes, while mitigating representation discrepancy inherent in direct cross-modal supervision. Extensive experiments demonstrate that SPF-Portrait achieves state-of-the-art performance.

When Persuasion Overrides Truth in Multi-Agent LLM Debates: Introducing a Confidence-Weighted Persuasion Override Rate (CW-POR)

Mahak Agarwal,Divyam Khanna

Task: 研究大型语言模型（LLM）在多代理辩论框架中判断矛盾主张的能力。

Motivation: 探讨LLM在面对准确与错误信息时的判断风险，尤其是在对抗性环境中。

Details

Method: 采用单轮多代理辩论框架，一个代理提供真实答案，另一个代理坚持错误信息，并由同一LLM架构作为裁判。引入置信加权说服覆盖率（CW-POR）衡量裁判被欺骗的频率及其对错误选择的置信度。 Result: 实验表明，即使是较小规模的LLM也能生成具有说服力的错误论点，并高置信度地覆盖真实答案。 Conclusion: 强调了对抗性测试和鲁棒校准的重要性，以防止LLM自信地支持错误信息。 Abstract: In many real-world scenarios, a single Large Language Model (LLM) may encounter contradictory claims-some accurate, others forcefully incorrect-and must judge which is true. We investigate this risk in a single-turn, multi-agent debate framework: one LLM-based agent provides a factual answer from TruthfulQA, another vigorously defends a falsehood, and the same LLM architecture serves as judge. We introduce the Confidence-Weighted Persuasion Override Rate (CW-POR), which captures not only how often the judge is deceived but also how strongly it believes the incorrect choice. Our experiments on five open-source LLMs (3B-14B parameters), where we systematically vary agent verbosity (30-300 words), reveal that even smaller models can craft persuasive arguments that override truthful answers-often with high confidence. These findings underscore the importance of robust calibration and adversarial testing to prevent LLMs from confidently endorsing misinformation.

Adaptive Low Light Enhancement via Joint Global-Local Illumination Adjustment

Haodian Wang,Yaqi Song

Task: 提出一种亮度自适应增强框架，解决真实世界低光图像中局部曝光不一致的问题。

Motivation: 现有端到端方法难以处理大动态范围的低光图像增强，尤其是局部曝光不一致的情况。

Details

Method: 框架包含局部对比度增强网络（LCEN）和全局光照引导网络（GIGN），分别通过早期停止机制和全局注意力引导模块实现自适应增强。 Result: 在多个数据集上实验表明，该方法在定量和定性结果上优于现有先进算法。 Conclusion: 提出的亮度自适应增强框架有效解决了局部曝光不一致问题，显著提升了低光图像的质量。 Abstract: Images captured under real-world low-light conditions face significant challenges due to uneven ambient lighting, making it difficult for existing end-to-end methods to enhance images with a large dynamic range to normal exposure levels. To address the above issue, we propose a novel brightness-adaptive enhancement framework designed to tackle the challenge of local exposure inconsistencies in real-world low-light images. Specifically, our proposed framework comprises two components: the Local Contrast Enhancement Network (LCEN) and the Global Illumination Guidance Network (GIGN). We introduce an early stopping mechanism in the LCEN and design a local discriminative module, which adaptively perceives the contrast of different areas in the image to control the premature termination of the enhancement process for patches with varying exposure levels. Additionally, within the GIGN, we design a global attention guidance module that effectively models global illumination by capturing long-range dependencies and contextual information within the image, which guides the local contrast enhancement network to significantly improve brightness across different regions. Finally, in order to coordinate the LCEN and GIGN, we design a novel training strategy to facilitate the training process. Experiments on multiple datasets demonstrate that our method achieves superior quantitative and qualitative results compared to state-of-the-art algorithms.

VerifiAgent: a Unified Verification Agent in Language Model Reasoning

Jiuzhou Han,Wray Buntine,Ehsan Shareghi

Task: 提出一种统一的验证代理VerifiAgent，用于验证大型语言模型生成响应的可靠性和正确性。

Motivation: 现有验证方法通常局限于特定模型或领域，计算资源消耗大且缺乏跨任务的可扩展性。

Details

Method: VerifiAgent结合元验证（评估响应的完整性和一致性）和基于工具的自适应验证（根据推理类型自动选择验证工具）。 Result: 实验表明VerifiAgent在所有推理任务中优于基线验证方法，并能通过验证反馈提升推理准确性。 Conclusion: VerifiAgent在验证效率和鲁棒性方面表现优异，适用于推理扩展，能以更少样本和成本取得更好结果。 Abstract: Large language models demonstrate remarkable reasoning capabilities but often produce unreliable or incorrect responses. Existing verification methods are typically model-specific or domain-restricted, requiring significant computational resources and lacking scalability across diverse reasoning tasks. To address these limitations, we propose VerifiAgent, a unified verification agent that integrates two levels of verification: meta-verification, which assesses completeness and consistency in model responses, and tool-based adaptive verification, where VerifiAgent autonomously selects appropriate verification tools based on the reasoning type, including mathematical, logical, or commonsense reasoning. This adaptive approach ensures both efficiency and robustness across different verification scenarios. Experimental results show that VerifiAgent outperforms baseline verification methods (e.g., deductive verifier, backward verifier) among all reasoning tasks. Additionally, it can further enhance reasoning accuracy by leveraging feedback from verification results. VerifiAgent can also be effectively applied to inference scaling, achieving better results with fewer generated samples and costs compared to existing process reward models in the mathematical reasoning domain. Code is available at https://github.com/Jiuzhouh/VerifiAgent

Beyond Wide-Angle Images: Unsupervised Video Portrait Correction via Spatiotemporal Diffusion Adaptation

Wenbo Nie,Lang Nie,Chunyu Lin,Jingwen Chen,Ke Xing,Jiyuan Wang,Yao Zhao

Task: 提出了一种基于扩散模型的图像肖像校正框架（ImagePD）及其视频版本（VideoPD），用于解决广角相机引起的面部拉伸失真问题。

Motivation: 广角相机在内容创作中受欢迎，但其镜头边缘的面部拉伸失真会降低视觉吸引力。

Details

Method: ImagePD结合了Transformer的长距离感知能力和扩散模型的多步去噪能力；VideoPD通过时空扩散适应，利用空间一致性和时间平滑性约束，适用于未标记的广角视频。 Result: 实验表明，所提方法在定量和定性上优于现有解决方案，能够生成高保真、稳定且自然的广角视频肖像。 Conclusion: 提出的ImagePD和VideoPD框架有效解决了广角相机引起的面部失真问题，并在视频中保持了高质量的空间校正和时间平滑性。 Abstract: Wide-angle cameras, despite their popularity for content creation, suffer from distortion-induced facial stretching-especially at the edge of the lens-which degrades visual appeal. To address this issue, we propose an image portrait correction framework using diffusion models named ImagePD. It integrates the long-range awareness of transformer and multi-step denoising of diffusion models into a unified framework, achieving global structural robustness and local detail refinement. Besides, considering the high cost of obtaining video labels, we then repurpose ImagePD for unlabeled wide-angle videos (termed VideoPD), by spatiotemporal diffusion adaption with spatial consistency and temporal smoothness constraints. For the former, we encourage the denoised image to approximate pseudo labels following the wide-angle distortion distribution pattern, while for the latter, we derive rectification trajectories with backward optical flows and smooth them. Compared with ImagePD, VideoPD maintains high-quality facial corrections in space and mitigates the potential temporal shakes sequentially. Finally, to establish an evaluation benchmark and train the framework, we establish a video portrait dataset with a large diversity in people number, lighting conditions, and background. Experiments demonstrate that the proposed methods outperform existing solutions quantitatively and qualitatively, contributing to high-fidelity wide-angle videos with stable and natural portraits. The codes and dataset will be available.

Semantic Mastery: Enhancing LLMs with Advanced Natural Language Understanding

Mohanakrishnan Hariharan

Task: 探讨如何通过先进的自然语言理解（NLU）技术提升大型语言模型（LLMs）在语义理解、上下文连贯性和推理能力方面的表现。

Motivation: 尽管LLMs在NLP任务中表现优异，但在深层语义理解、上下文连贯性和复杂推理方面仍存在不足，需要更先进的技术来弥补这些缺陷。

Details

Method: 采用语义解析、知识整合、上下文强化学习等方法，结合结构化知识图谱、检索增强生成（RAG）和微调策略，以及基于Transformer的架构、对比学习和混合符号-神经方法。 Result: 研究发现语义精确性对提升AI驱动的语言系统至关重要，并提出了未来研究方向以缩小统计语言模型与真正自然语言理解之间的差距。 Conclusion: 通过结合多种先进技术，可以显著提升LLMs的语义理解和推理能力，但仍需进一步研究以实现真正的自然语言理解。 Abstract: Large language models (LLMs) have greatly improved their capability in performing NLP tasks. However, deeper semantic understanding, contextual coherence, and more subtle reasoning are still difficult to obtain. The paper discusses state-of-the-art methodologies that advance LLMs with more advanced NLU techniques, such as semantic parsing, knowledge integration, and contextual reinforcement learning. We analyze the use of structured knowledge graphs, retrieval-augmented generation (RAG), and fine-tuning strategies that match models with human-level understanding. Furthermore, we address the incorporation of transformer-based architectures, contrastive learning, and hybrid symbolic-neural methods that address problems like hallucinations, ambiguity, and inconsistency in the factual perspectives involved in performing complex NLP tasks, such as question-answering text summarization and dialogue generation. Our findings show the importance of semantic precision for enhancing AI-driven language systems and suggest future research directions to bridge the gap between statistical language models and true natural language understanding.

NCAP: Scene Text Image Super-Resolution with Non-CAtegorical Prior

Dongwoo Park,Suk Pil Ko

Task: 提升场景文本图像的分辨率和质量。

Motivation: 现有方法使用文本先验（TP）可能因错误而负面影响超分辨率效果，且预训练识别器在低分辨率图像上表现不佳。

Details

Method: 提出非分类先验（NCAP）替代TP，并采用混合硬标签和软标签的方法缓解过自信现象。 Result: 在TextZoom数据集上提升3.5%，在四个文本识别数据集上泛化性能提升14.8%。 Conclusion: 该方法适用于所有TP引导的STISR网络，显著提升了性能和泛化能力。 Abstract: Scene text image super-resolution (STISR) enhances the resolution and quality of low-resolution images. Unlike previous studies that treated scene text images as natural images, recent methods using a text prior (TP), extracted from a pre-trained text recognizer, have shown strong performance. However, two major issues emerge: (1) Explicit categorical priors, like TP, can negatively impact STISR if incorrect. We reveal that these explicit priors are unstable and propose replacing them with Non-CAtegorical Prior (NCAP) using penultimate layer representations. (2) Pre-trained recognizers used to generate TP struggle with low-resolution images. To address this, most studies jointly train the recognizer with the STISR network to bridge the domain gap between low- and high-resolution images, but this can cause an overconfidence phenomenon in the prior modality. We highlight this issue and propose a method to mitigate it by mixing hard and soft labels. Experiments on the TextZoom dataset demonstrate an improvement by 3.5%, while our method significantly enhances generalization performance by 14.8\% across four text recognition datasets. Our method generalizes to all TP-guided STISR networks.

Multimodal LLMs for OCR, OCR Post-Correction, and Named Entity Recognition in Historical Documents

Gavin Greif,Niclas Griesshaber,Robin Greif

Task: 探索多模态大型语言模型（mLLMs）在历史文档转录、信息提取和数据集构建中的应用。

Motivation: 研究mLLMs在OCR、OCR后校正和命名实体识别（NER）任务中的能力，以提升历史文档处理的效率和准确性。

Details

Method: 对1754年至1870年间出版的德语城市目录进行实验，比较mLLMs与传统OCR模型的转录准确性，并引入多模态后校正方法。 Result: mLLMs在转录准确性上显著优于传统OCR模型，后校正方法将错误率降至1%以下，且无需图像预处理或模型微调。 Conclusion: mLLMs有望在历史数据收集和文档转录领域引发范式转变。 Abstract: We explore how multimodal Large Language Models (mLLMs) can help researchers transcribe historical documents, extract relevant historical information, and construct datasets from historical sources. Specifically, we investigate the capabilities of mLLMs in performing (1) Optical Character Recognition (OCR), (2) OCR Post-Correction, and (3) Named Entity Recognition (NER) tasks on a set of city directories published in German between 1754 and 1870. First, we benchmark the off-the-shelf transcription accuracy of both mLLMs and conventional OCR models. We find that the best-performing mLLM model significantly outperforms conventional state-of-the-art OCR models and other frontier mLLMs. Second, we are the first to introduce multimodal post-correction of OCR output using mLLMs. We find that this novel approach leads to a drastic improvement in transcription accuracy and consistently produces highly accurate transcriptions (<1% CER), without any image pre-processing or model fine-tuning. Third, we demonstrate that mLLMs can efficiently recognize entities in transcriptions of historical documents and parse them into structured dataset formats. Our findings provide early evidence for the long-term potential of mLLMs to introduce a paradigm shift in the approaches to historical data collection and document transcription.

Can LLMs Assist Computer Education? an Empirical Case Study of DeepSeek

Dongfu Xiao,Chen Gao,Zhengquan Luo,Chi Liu,Sheng Shen

Task: 评估DeepSeek-V3在计算机教育中的效能和可靠性。

Motivation: 研究新兴大语言模型DeepSeek-V3在专业环境（如计算机教育）中的实际表现。

Details

Method: 采用CCNA模拟问题和真实网络工程师提出的网络安全问题，结合角色依赖性、跨语言能力和答案可重复性等多维度评估。 Result: 模型表现稳定，不受角色定义影响，跨语言适应性强，但在高阶推理任务中表现不如低阶事实检索任务。 Conclusion: DeepSeek-V3在网络安全教育中具有实用价值，但在处理多模态数据和复杂主题方面仍有改进空间。 Abstract: This study presents an empirical case study to assess the efficacy and reliability of DeepSeek-V3, an emerging large language model, within the context of computer education. The evaluation employs both CCNA simulation questions and real-world inquiries concerning computer network security posed by Chinese network engineers. To ensure a thorough evaluation, diverse dimensions are considered, encompassing role dependency, cross-linguistic proficiency, and answer reproducibility, accompanied by statistical analysis. The findings demonstrate that the model performs consistently, regardless of whether prompts include a role definition or not. In addition, its adaptability across languages is confirmed by maintaining stable accuracy in both original and translated datasets. A distinct contrast emerges between its performance on lower-order factual recall tasks and higher-order reasoning exercises, which underscores its strengths in retrieving information and its limitations in complex analytical tasks. Although DeepSeek-V3 offers considerable practical value for network security education, challenges remain in its capability to process multimodal data and address highly intricate topics. These results provide valuable insights for future refinement of large language models in specialized professional environments.

Memorizing is Not Enough: Deep Knowledge Injection Through Reasoning

Ruoxi Xu,Yunjie Ji,Boxi Cao,Yaojie Lu,Hongyu Lin,Xianpei Han,Ben He,Yingfei Sun,Xiangang Li,Le Sun

Task: 提出一个四层知识注入框架，并设计一个实验测试平台DeepKnowledge，用于细粒度评估知识注入的深度。

Motivation: 大型语言模型（LLMs）的静态特性导致其信息过时或难以适应领域特定知识，需要有效的知识注入方法。当前研究主要集中在知识记忆和检索层面，缺乏系统性。

Details

Method: 提出四层知识注入框架（记忆、检索、推理和关联），并设计测试平台DeepKnowledge，评估三种知识类型（新知识、增量知识和更新知识）的注入深度。 Result: 实验揭示了达到各层知识注入的关键因素，并建立了知识注入层次与适用方法的映射关系。 Conclusion: 该研究为跨层次的高效知识注入提供了系统性方法。 Abstract: Although large language models (LLMs) excel in knowledge recall and reasoning, their static nature leads to outdated information as the real world evolves or when adapting to domain-specific knowledge, highlighting the need for effective knowledge injection. However, current research on knowledge injection remains superficial, mainly focusing on knowledge memorization and retrieval. This paper proposes a four-tier knowledge injection framework that systematically defines the levels of knowledge injection: memorization, retrieval, reasoning, and association. Based on this framework, we introduce DeepKnowledge, a synthetic experimental testbed designed for fine-grained evaluation of the depth of knowledge injection across three knowledge types (novel, incremental, and updated). We then explore various knowledge injection scenarios and evaluate the depth of knowledge injection for each scenario on the benchmark. Experimental results reveal key factors to reach each level of knowledge injection for LLMs and establish a mapping between the levels of knowledge injection and the corresponding suitable injection methods, aiming to provide a comprehensive approach for efficient knowledge injection across various levels.

Unleashing the Power of Pre-trained Encoders for Universal Adversarial Attack Detection

Yinghe Zhang,Chi Liu,Shuai Zhou,Sheng Shen,Peng Gui

Task: 提出一种基于预训练视觉-语言模型CLIP的轻量级对抗检测框架，用于检测对抗攻击。

Motivation: 现有对抗攻击检测方法依赖手工特征设计和攻击模式先验知识，泛化能力有限且工程成本高。

Details

Method: 通过联合微调CLIP的双视觉-文本编码器，结合可训练适配网络和可学习提示，构建紧凑的自然图像表示空间。 Result: 检测架构在已知和未知攻击模式上的泛化能力显著优于传统方法，同时大幅降低训练开销。 Conclusion: 该研究为建立参数高效且攻击无关的防御范式提供了新途径，显著提升了视觉系统对抗对抗威胁的鲁棒性。 Abstract: Adversarial attacks pose a critical security threat to real-world AI systems by injecting human-imperceptible perturbations into benign samples to induce misclassification in deep learning models. While existing detection methods, such as Bayesian uncertainty estimation and activation pattern analysis, have achieved progress through feature engineering, their reliance on handcrafted feature design and prior knowledge of attack patterns limits generalization capabilities and incurs high engineering costs. To address these limitations, this paper proposes a lightweight adversarial detection framework based on the large-scale pre-trained vision-language model CLIP. Departing from conventional adversarial feature characterization paradigms, we innovatively adopt an anomaly detection perspective. By jointly fine-tuning CLIP's dual visual-text encoders with trainable adapter networks and learnable prompts, we construct a compact representation space tailored for natural images. Notably, our detection architecture achieves substantial improvements in generalization capability across both known and unknown attack patterns compared to traditional methods, while significantly reducing training overhead. This study provides a novel technical pathway for establishing a parameter-efficient and attack-agnostic defense paradigm, markedly enhancing the robustness of vision systems against evolving adversarial threats.

Making Large Language Models Better Reasoners with Orchestrated Streaming Experiences

Xiangyang Liu,Junliang He,Xipeng Qiu

Task: 提出RoSE框架，用于解决推理任务并实现自我改进。

Motivation: 零样本或少样本提示在复杂推理任务中表现不佳，且少样本提示依赖人工设计的示例。

Details

Method: RoSE通过流式经验池存储已回答问题及其思考，并基于相似性、不确定性和复杂性选择辅助问题。 Result: RoSE在多种推理任务、LLM和CoT方法中表现出广泛适用性。 Conclusion: RoSE是一种无需复杂外部干预即可自我改进的通用推理框架。 Abstract: Large language models (LLMs) can perform complex reasoning by generating intermediate thoughts under zero-shot or few-shot settings. However, zero-shot prompting always encounters low performance, and the superior performance of few-shot prompting hinges on the manual-crafted demonstrations. In this paper, we present RoSE (Reasoning with Orchestrated Streaming Experiences), a general framework for solving reasoning tasks that can self-improve without complex external efforts. To enable RoSE, we describe an architecture that extends an LLM to store all answered questions and their thoughts in a streaming experience pool then orchestrates helpful questions from the pool to assist in answering new questions. To set up a question-aware orchestration mechanism, RoSE first calculates the similarity of each question in the pool with a new test question. Since the solution to each answered question is not always correct, RoSE will sort the questions according to their similarity with the new question, and then uniformly divide them into multiple buckets. It finally extracts one question from each bucket to make these extracted questions more diverse. To make these extracted questions help RoSE answer new questions as much as possible, we introduce two other attributes of uncertainty and complexity for each question. RoSE will preferentially select the questions with low uncertainty and high complexity from each bucket. We evaluate the versatility of RoSE in various reasoning tasks, LLMs, and CoT methods.

Data Synthesis with Diverse Styles for Face Recognition via 3DMM-Guided Diffusion

Yuxi Mi,Zhizhou Zhong,Yuge Huang,Qiuyang Yuan,Xuan Zhao,Jianqing Xu,Shouhong Ding,ShaoMing Wang,Rizen Guo,Shuigeng Zhou

Task: Identity-preserving face synthesis to generate synthetic face images for training face recognition models.

Motivation: Prior methods face a trade-off between consistent identities and diverse styles, and treat style variation as subject-agnostic, while real-world persons have distinct, subject-specific styles.

Details

Method: Introduces MorphFace, a diffusion-based face generator that learns fine-grained facial styles from 3DMM renderings and identities from a recognition model, conditioned on novel identities and styles sampled from a real-world prior distribution. Result: MorphFace outperforms prior arts in face recognition efficacy. Conclusion: MorphFace effectively addresses the trade-off by learning subject-specific styles and identities, enhancing face recognition performance. Abstract: Identity-preserving face synthesis aims to generate synthetic face images of virtual subjects that can substitute real-world data for training face recognition models. While prior arts strive to create images with consistent identities and diverse styles, they face a trade-off between them. Identifying their limitation of treating style variation as subject-agnostic and observing that real-world persons actually have distinct, subject-specific styles, this paper introduces MorphFace, a diffusion-based face generator. The generator learns fine-grained facial styles, e.g., shape, pose and expression, from the renderings of a 3D morphable model (3DMM). It also learns identities from an off-the-shelf recognition model. To create virtual faces, the generator is conditioned on novel identities of unlabeled synthetic faces, and novel styles that are statistically sampled from a real-world prior distribution. The sampling especially accounts for both intra-subject variation and subject distinctiveness. A context blending strategy is employed to enhance the generator's responsiveness to identity and style conditions. Extensive experiments show that MorphFace outperforms the best prior arts in face recognition efficacy.

Training a Utility-based Retriever Through Shared Context Attribution for Retrieval-Augmented Language Models

Yilong Xu,Jinhua Gao,Xiaoming Yu,Yuanhai Xue,Baolong Bi,Huawei Shen,Xueqi Cheng

Task: 提出SCARLet框架，用于在检索增强语言模型（RALMs）中训练基于效用的检索器。

Motivation: 现有的检索器主要关注语义相关性，但对生成任务可能不够有效，因此需要基于效用的检索方法。然而，准确捕捉段落的效用尚未得到充分研究。

Details

Method: SCARLet框架通过多任务泛化和段落间交互两个关键因素，构建共享上下文并利用扰动归因方法估计段落级效用。 Result: 在十个数据集上的实验表明，SCARLet训练的检索器能显著提升RALMs的整体性能。 Conclusion: SCARLet框架通过多任务泛化和段落间交互，有效提升了检索器的效用，从而增强了RALMs的任务表现。 Abstract: Retrieval-Augmented Language Models boost task performance, owing to the retriever that provides external knowledge. Although crucial, the retriever primarily focuses on semantics relevance, which may not always be effective for generation. Thus, utility-based retrieval has emerged as a promising topic, prioritizing passages that provides valid benefits for downstream tasks. However, due to insufficient understanding, capturing passage utility accurately remains unexplored. This work proposes SCARLet, a framework for training utility-based retrievers in RALMs, which incorporates two key factors, multi-task generalization and inter-passage interaction. First, SCARLet constructs shared context on which training data for various tasks is synthesized. This mitigates semantic bias from context differences, allowing retrievers to focus on learning task-specific utility for better task generalization. Next, SCARLet uses a perturbation-based attribution method to estimate passage-level utility for shared context, which reflects interactions between passages and provides more accurate feedback. We evaluate our approach on ten datasets across various tasks, both in-domain and out-of-domain, showing that retrievers trained by SCARLet consistently improve the overall performance of RALMs.

Enhancing Fundus Image-based Glaucoma Screening via Dynamic Global-Local Feature Integration

Yuzhuo Zhou,Chi Liu,Sheng Shen,Siyu Le,Liwen Yu,Sihan Ouyang,Zongyuan Ge

Task: 提出一种自适应注意力窗口和多头注意力机制，以提高眼底图像分类器在青光眼诊断中的准确性和鲁棒性。

Motivation: 现有的分类模型在特定数据集上表现良好，但难以应对现实世界中的挑战，如图像质量差异、种族间图像差异以及青光眼病例的不确定边界。

Details

Method: 通过结合全面的眼底图像信息（如视杯和视盘区域）并引入自适应注意力窗口和多头注意力机制，优化特征提取和融合。 Result: 实验结果表明，该方法在青光眼分类中表现出更高的准确性和鲁棒性。 Conclusion: 该方法有效解决了图像变化带来的挑战，提升了青光眼诊断的可靠性。 Abstract: With the advancements in medical artificial intelligence (AI), fundus image classifiers are increasingly being applied to assist in ophthalmic diagnosis. While existing classification models have achieved high accuracy on specific fundus datasets, they struggle to address real-world challenges such as variations in image quality across different imaging devices, discrepancies between training and testing images across different racial groups, and the uncertain boundaries due to the characteristics of glaucomatous cases. In this study, we aim to address the above challenges posed by image variations by highlighting the importance of incorporating comprehensive fundus image information, including the optic cup (OC) and optic disc (OD) regions, and other key image patches. Specifically, we propose a self-adaptive attention window that autonomously determines optimal boundaries for enhanced feature extraction. Additionally, we introduce a multi-head attention mechanism to effectively fuse global and local features via feature linear readout, improving the model's discriminative capability. Experimental results demonstrate that our method achieves superior accuracy and robustness in glaucoma classification.

Enhancing Negation Awareness in Universal Text Embeddings: A Data-efficient and Computational-efficient Approach

Hongliu Cao

Task: 研究通用文本嵌入模型对否定的理解能力，并提出一种高效的嵌入重加权方法以提升其否定感知能力。

Motivation: 现有通用文本嵌入模型在否定理解方面存在不足，且由于评估基准的偏差，其否定感知能力尚不明确。

Details

Method: 提出一种数据高效且计算高效的嵌入重加权方法，无需修改文本嵌入模型的参数。 Result: 该方法显著提升了文本嵌入模型在简单和复杂否定理解任务中的表现，并适用于基于大型语言模型的任务特定高维通用文本嵌入。 Conclusion: 通过嵌入重加权方法，可以有效提升通用文本嵌入模型的否定感知能力，同时保持高效性。 Abstract: Negation plays an important role in various natural language processing tasks such as Natural Language Inference and Sentiment Analysis tasks. Numerous prior studies have found that contextual text embedding models such as BERT, ELMO, RoBERTa or XLNet face challenges in accurately understanding negation. Recent advancements in universal text embeddings have demonstrated superior performance over contextual text embeddings in various tasks. However, due to the bias in popular evaluation benchmarks, the negation awareness capacity of these models remains unclear. To bridge the gap in existing literature, an in-depth analysis is initiated in this work to study the negation awareness of cutting-edge universal text embedding models. Our findings reveal a significant lack of negation awareness in these models, often interpreting negated text pairs as semantically similar. To efficiently deal with the conflict that different tasks need different trade-offs between topic and negation information among other semantic information, a data-efficient and computational-efficient embedding re-weighting method is proposed without modifying the parameters of text embedding models. The proposed solution is able to improve text embedding models' negation awareness significantly on both simple negation understanding task and complex negation understanding task. Furthermore, the proposed solution can also significantly improve the negation awareness of Large Language Model based task-specific high dimensional universal text embeddings.

DecoFuse: Decomposing and Fusing the "What", "Where", and "How" for Brain-Inspired fMRI-to-Video Decoding

Chong Li,Jingyang Huo,Weikang Gong,Yanwei Fu,Xiangyang Xue,Jianfeng Feng

Task: 提出一种名为DecoFuse的新框架，用于从fMRI信号中解码视频。

Motivation: 现有的fMRI到视频的方法通常关注语义内容，而忽略了空间和运动信息，但这些信息在大脑中是分别处理的。

Details

Method: 将视频分解为语义、空间和运动三个组件，分别解码后再融合以重建视频。 Result: 在语义分类、空间一致性、运动预测和视频生成方面显著优于现有方法，并验证了腹侧和背侧通路的生物学假设。 Conclusion: DecoFuse提供了一个强大且生物学合理的fMRI到视频解码框架。 Abstract: Decoding visual experiences from brain activity is a significant challenge. Existing fMRI-to-video methods often focus on semantic content while overlooking spatial and motion information. However, these aspects are all essential and are processed through distinct pathways in the brain. Motivated by this, we propose DecoFuse, a novel brain-inspired framework for decoding videos from fMRI signals. It first decomposes the video into three components - semantic, spatial, and motion - then decodes each component separately before fusing them to reconstruct the video. This approach not only simplifies the complex task of video decoding by decomposing it into manageable sub-tasks, but also establishes a clearer connection between learned representations and their biological counterpart, as supported by ablation studies. Further, our experiments show significant improvements over previous state-of-the-art methods, achieving 82.4% accuracy for semantic classification, 70.6% accuracy in spatial consistency, a 0.212 cosine similarity for motion prediction, and 21.9% 50-way accuracy for video generation. Additionally, neural encoding analyses for semantic and spatial information align with the two-streams hypothesis, further validating the distinct roles of the ventral and dorsal pathways. Overall, DecoFuse provides a strong and biologically plausible framework for fMRI-to-video decoding. Project page: https://chongjg.github.io/DecoFuse/.

Efficient Annotator Reliablity Assessment with EffiARA

Owen Cook,Jake Vasilakes,Ian Roberts,Xingyi Song

Task: 提出EffiARA标注框架，支持从标注任务资源理解到数据集编译及标注可靠性分析的完整标注流程。

Motivation: 数据标注是机器学习流程中的关键环节，但成本高且耗时，尤其是基于Transformer模型的文档级标注缺乏标准化框架。

Details

Method: 开发EffiARA Python包及配套的Web工具，提供图形用户界面，支持标注流程的各个环节。 Result: 通过两项先前研究验证框架有效性：一是基于标注可靠性的软标签聚合和样本加权提升分类性能，二是通过移除不可靠标注者提高整体一致性。 Conclusion: EffiARA框架填补了文档级标注标准化工具的空白，并通过开源工具促进其广泛应用。 Abstract: Data annotation is an essential component of the machine learning pipeline; it is also a costly and time-consuming process. With the introduction of transformer-based models, annotation at the document level is increasingly popular; however, there is no standard framework for structuring such tasks. The EffiARA annotation framework is, to our knowledge, the first project to support the whole annotation pipeline, from understanding the resources required for an annotation task to compiling the annotated dataset and gaining insights into the reliability of individual annotators as well as the dataset as a whole. The framework's efficacy is supported by two previous studies: one improving classification performance through annotator-reliability-based soft label aggregation and sample weighting, and the other increasing the overall agreement among annotators through removing identifying and replacing an unreliable annotator. This work introduces the EffiARA Python package and its accompanying webtool, which provides an accessible graphical user interface for the system. We open-source the EffiARA Python package at https://github.com/MiniEggz/EffiARA and the webtool is publicly accessible at https://effiara.gate.ac.uk.

Qi Song,Chenghong Li,Haotong Lin,Sida Peng,Rui Huang

Task: 提出一种名为ADGaussian的新方法，用于可泛化的街景重建。

Motivation: 现有高斯泼溅方法主要关注几何细化，而本文强调图像和深度特征的联合优化对于准确高斯预测的重要性。

Details

Method: 通过结合稀疏LiDAR深度作为额外输入模态，将高斯预测过程构建为视觉信息和几何线索的联合学习框架，并提出多模态特征匹配策略和多尺度高斯解码模型以增强多模态特征的联合细化。 Result: 在Waymo和KITTI两个大规模自动驾驶数据集上的实验表明，ADGaussian实现了最先进的性能，并在新视角转换中表现出卓越的零样本泛化能力。 Conclusion: ADGaussian通过联合优化图像和深度特征，实现了高质量的街景重建，并展示了强大的泛化能力。 Abstract: We present a novel approach, termed ADGaussian, for generalizable street scene reconstruction. The proposed method enables high-quality rendering from single-view input. Unlike prior Gaussian Splatting methods that primarily focus on geometry refinement, we emphasize the importance of joint optimization of image and depth features for accurate Gaussian prediction. To this end, we first incorporate sparse LiDAR depth as an additional input modality, formulating the Gaussian prediction process as a joint learning framework of visual information and geometric clue. Furthermore, we propose a multi-modal feature matching strategy coupled with a multi-scale Gaussian decoding model to enhance the joint refinement of multi-modal features, thereby enabling efficient multi-modal Gaussian learning. Extensive experiments on two large-scale autonomous driving datasets, Waymo and KITTI, demonstrate that our ADGaussian achieves state-of-the-art performance and exhibits superior zero-shot generalization capabilities in novel-view shifting.

Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources

Weizhi Wang,Yu Tian,Linjie Yang,Heng Wang,Xifeng Yan

Task: 开发一个高效的多模态大型语言模型预训练框架Open-Qwen2VL，并在有限的计算资源下实现高性能。

Motivation: 当前多模态LLM预训练在数据过滤、多模态数据混合策略、序列打包技术和训练框架等方面存在障碍，需要一种更高效和开源的方法。

Details

Method: 采用低到高动态图像分辨率和多模态序列打包技术，结合MLLM和CLIP过滤方法优化数据集，使用8xA100-40G GPU进行预训练。 Result: Open-Qwen2VL在多个多模态基准测试中表现优于部分开源的Qwen2-VL-2B模型，展示了显著的训练效率。 Conclusion: Open-Qwen2VL通过完全开源训练代码、数据过滤技术和所有训练数据，重新定义了多模态LLM的“完全开源”标准。 Abstract: The reproduction of state-of-the-art multimodal LLM pre-training faces barriers at every stage of the pipeline, including high-quality data filtering, multimodal data mixture strategies, sequence packing techniques, and training frameworks. We introduce Open-Qwen2VL, a fully open-source 2B-parameter Multimodal Large Language Model pre-trained efficiently on 29M image-text pairs using only 442 A100-40G GPU hours. Our approach employs low-to-high dynamic image resolution and multimodal sequence packing to significantly enhance pre-training efficiency. The training dataset was carefully curated using both MLLM-based filtering techniques (e.g., MLM-Filter) and conventional CLIP-based filtering methods, substantially improving data quality and training efficiency. The Open-Qwen2VL pre-training is conducted on academic level 8xA100-40G GPUs at UCSB on 5B packed multimodal tokens, which is 0.36\% of 1.4T multimodal pre-training tokens of Qwen2-VL. The final instruction-tuned Open-Qwen2VL outperforms partially-open state-of-the-art MLLM Qwen2-VL-2B on various multimodal benchmarks of MMBench, SEEDBench, MMstar, and MathVista, indicating the remarkable training efficiency of Open-Qwen2VL. We open-source all aspects of our work, including compute-efficient and data-efficient training details, data filtering methods, sequence packing scripts, pre-training data in WebDataset format, FSDP-based training codebase, and both base and instruction-tuned model checkpoints. We redefine "fully open" for multimodal LLMs as the complete release of: 1) the training codebase, 2) detailed data filtering techniques, and 3) all pre-training and supervised fine-tuning data used to develop the model.

Lan Sun,Songpengcheng Xia,Jiarui Yang,Ling Pei

Task: 提出一种基于多设备穿戴网络的深度学习框架Suite-IN++，用于实现鲁棒且准确的行人定位。

Motivation: 传统行人航位推算（PDR）难以应对多样化的运动模式，而数据驱动方法虽然提高了准确性，但由于依赖单一设备设置，通常缺乏鲁棒性。因此，充分利用现有穿戴设备构建灵活穿戴网络（flexiwear bodynet）成为一种有前景的解决方案。

Details

Method: Suite-IN++通过对比学习分离全局和局部运动特征，融合各设备的全局特征以捕捉整体运动趋势，并利用注意力机制揭示局部特征的跨设备相关性，提取有助于精确定位的运动细节。 Result: 实验结果表明，Suite-IN++在真实行人跟踪场景中表现出卓越的定位准确性和鲁棒性，显著优于现有先进模型。 Conclusion: Suite-IN++为多设备穿戴网络下的行人定位提供了一种高效且鲁棒的解决方案。 Abstract: The proliferation of wearable technology has established multi-device ecosystems comprising smartphones, smartwatches, and headphones as critical enablers for ubiquitous pedestrian localization. However, traditional pedestrian dead reckoning (PDR) struggles with diverse motion modes, while data-driven methods, despite improving accuracy, often lack robustness due to their reliance on a single-device setup. Therefore, a promising solution is to fully leverage existing wearable devices to form a flexiwear bodynet for robust and accurate pedestrian localization. This paper presents Suite-IN++, a deep learning framework for flexiwear bodynet-based pedestrian localization. Suite-IN++ integrates motion data from wearable devices on different body parts, using contrastive learning to separate global and local motion features. It fuses global features based on the data reliability of each device to capture overall motion trends and employs an attention mechanism to uncover cross-device correlations in local features, extracting motion details helpful for accurate localization. To evaluate our method, we construct a real-life flexiwear bodynet dataset, incorporating Apple Suite (iPhone, Apple Watch, and AirPods) across diverse walking modes and device configurations. Experimental results demonstrate that Suite-IN++ achieves superior localization accuracy and robustness, significantly outperforming state-of-the-art models in real-life pedestrian tracking scenarios.

On the Consistency of Multilingual Context Utilization in Retrieval-Augmented Generation

Jirui Qi,Raquel Fernández,Arianna Bisazza

Task: 评估大型语言模型（LLMs）在多语言检索增强生成（mRAG）中利用不同语言上下文的能力。

Motivation: 多语言RAG中，检索到的段落可能不同于用户查询的语言，这对LLMs有效利用信息提出了挑战，但相关研究较少。

Details

Method: 通过实验评估四种LLMs在三个QA数据集（涵盖48种语言）中的表现，分析其利用不同语言上下文的能力。 Result: LLMs能够从非查询语言段落中提取相关信息，但在生成完整答案时表现较弱；干扰段落会降低答案质量，尤其是查询语言的干扰段落影响更大。 Conclusion: 研究揭示了LLMs在多语言RAG中利用上下文的局限性，为未来改进提供了方向。 Abstract: Retrieval-augmented generation (RAG) with large language models (LLMs) has demonstrated strong performance in multilingual question-answering (QA) tasks by leveraging relevant passages retrieved from corpora. In multilingual RAG (mRAG), the retrieved passages can be written in languages other than that of the query entered by the user, making it challenging for LLMs to effectively utilize the provided information. Recent research suggests that retrieving passages from multilingual corpora can improve RAG performance, particularly for low-resource languages. However, the extent to which LLMs can leverage different kinds of multilingual contexts to generate accurate answers, *independently from retrieval quality*, remains understudied. In this paper, we conduct an extensive assessment of LLMs' ability to (i) make consistent use of a relevant passage regardless of its language, (ii) respond in the expected language, and (iii) focus on the relevant passage even when multiple `distracting' passages in different languages are provided in the context. Our experiments with four LLMs across three QA datasets covering a total of 48 languages reveal a surprising ability of LLMs to extract the relevant information from out-language passages, but a much weaker ability to formulate a full answer in the correct language. Our analysis, based on both accuracy and feature attribution techniques, further shows that distracting passages negatively impact answer quality regardless of their language. However, distractors in the query language exert a slightly stronger influence. Taken together, our findings deepen the understanding of how LLMs utilize context in mRAG systems, providing directions for future improvements.

FA^{3}-CLIP: Frequency-Aware Cues Fusion and Attack-Agnostic Prompt Learning for Unified Face Attack Detection

Yongze Li,Ning Li,Ajian Liu,Hui Ma,Liying Yang,Xihong Chen,Zhiyao Liang,Yanyan Liang,Jun Wan,Zhen Lei

Task: 提出一种统一的攻击检测模型（FA³-CLIP），用于同时检测物理和数字面部攻击。

Motivation: 现有方法难以同时检测物理和数字攻击，主要由于攻击类型间的类内差异大，且仅依赖空间信息无法全面捕捉真实和伪造线索。

Details

Method: 通过攻击无关的提示学习和双流线索融合框架，结合空间与频率特征，生成通用的真实和伪造提示，并优化特征空间。 Result: 实验结果表明，该方法在检测物理和数字面部攻击方面显著提升性能，达到最先进水平。 Conclusion: FA³-CLIP通过融合空间与频率特征及攻击无关的提示学习，实现了对多种攻击类型的统一检测，性能优越。 Abstract: Facial recognition systems are vulnerable to physical (e.g., printed photos) and digital (e.g., DeepFake) face attacks. Existing methods struggle to simultaneously detect physical and digital attacks due to: 1) significant intra-class variations between these attack types, and 2) the inadequacy of spatial information alone to comprehensively capture live and fake cues. To address these issues, we propose a unified attack detection model termed Frequency-Aware and Attack-Agnostic CLIP (FA\textsuperscript{3}-CLIP), which introduces attack-agnostic prompt learning to express generic live and fake cues derived from the fusion of spatial and frequency features, enabling unified detection of live faces and all categories of attacks. Specifically, the attack-agnostic prompt module generates generic live and fake prompts within the language branch to extract corresponding generic representations from both live and fake faces, guiding the model to learn a unified feature space for unified attack detection. Meanwhile, the module adaptively generates the live/fake conditional bias from the original spatial and frequency information to optimize the generic prompts accordingly, reducing the impact of intra-class variations. We further propose a dual-stream cues fusion framework in the vision branch, which leverages frequency information to complement subtle cues that are difficult to capture in the spatial domain. In addition, a frequency compression block is utilized in the frequency stream, which reduces redundancy in frequency features while preserving the diversity of crucial cues. We also establish new challenging protocols to facilitate unified face attack detection effectiveness. Experimental results demonstrate that the proposed method significantly improves performance in detecting physical and digital face attacks, achieving state-of-the-art results.

Efficient Construction of Model Family through Progressive Training Using Model Expansion

Kazuki Yano,Sho Takase,Sosuke Kobayashi,Shun Kiyono,Jun Suzuki

Task: 提出一种通过渐进式训练构建不同参数规模的大语言模型家族的高效方法。

Motivation: 传统方法中，模型家族中的每个模型独立训练，导致计算成本随模型数量线性增加，缺乏效率。

Details

Method: 采用渐进式训练方法，从小模型逐步扩展到大模型，构建完整的模型家族，并通过调整最大学习率优化性能。 Result: 实验表明，该方法在1B到8B参数的模型家族中，计算成本降低约25%，性能与独立训练模型相当，甚至在某些指标上更优。 Conclusion: 该方法不仅节省计算资源，还能在不同规模的模型中实现更一致的行为，具有实际应用价值。 Abstract: As Large Language Models (LLMs) gain widespread practical application, providing the model family of different parameter sizes has become standard practice to address diverse computational requirements. Conventionally, each model in a family is trained independently, resulting in computational costs that scale additively with the number of models. We propose an efficient method for constructing the model family through progressive training, where smaller models are incrementally expanded to larger sizes to create a complete model family. Through extensive experiments with a model family ranging from 1B to 8B parameters, we demonstrate that our method reduces computational costs by approximately 25% while maintaining comparable performance to independently trained models. Furthermore, by strategically adjusting maximum learning rates based on model size, our method outperforms the independent training across various metrics. Beyond performance gains, our approach offers an additional advantage: models in our family tend to yield more consistent behavior across different model sizes.

Distilling Multi-view Diffusion Models into 3D Generators

Hao Qin,Luyuan Chen,Ming Kong,Mengxu Lu,Qiang Zhu

Task: 将多视角扩散模型（MV-DM）通过高斯溅射蒸馏为3D生成器（DD3G）。

Motivation: 通过模拟MV-DM的常微分方程（ODE）轨迹，压缩并整合其视觉和空间几何知识，使生成的3D生成器比仅基于3D数据训练的生成器具有更好的泛化能力。

Details

Method: 提出PEPD生成器，包含模式提取和渐进解码阶段，高效融合概率流，并在0.06秒内将单张图像转换为3D高斯。同时设计了联合优化目标以减少知识损失和克服稀疏视角监督。 Result: 在合成和公共数据集上的实验证明了方法的有效性。 Conclusion: DD3G通过蒸馏和优化目标设计，实现了高效的3D生成，并在实验中表现出优越性能。 Abstract: We introduce DD3G, a formulation that Distills a multi-view Diffusion model (MV-DM) into a 3D Generator using gaussian splatting. DD3G compresses and integrates extensive visual and spatial geometric knowledge from the MV-DM by simulating its ordinary differential equation (ODE) trajectory, ensuring the distilled generator generalizes better than those trained solely on 3D data. Unlike previous amortized optimization approaches, we align the MV-DM and 3D generator representation spaces to transfer the teacher's probabilistic flow to the student, thus avoiding inconsistencies in optimization objectives caused by probabilistic sampling. The introduction of probabilistic flow and the coupling of various attributes in 3D Gaussians introduce challenges in the generation process. To tackle this, we propose PEPD, a generator consisting of Pattern Extraction and Progressive Decoding phases, which enables efficient fusion of probabilistic flow and converts a single image into 3D Gaussians within 0.06 seconds. Furthermore, to reduce knowledge loss and overcome sparse-view supervision, we design a joint optimization objective that ensures the quality of generated samples through explicit supervision and implicit verification. Leveraging existing 2D generation models, we compile 120k high-quality RGBA images for distillation. Experiments on synthetic and public datasets demonstrate the effectiveness of our method. Our project is available at: https://qinbaigao.github.io/DD3G_project/

News is More than a Collection of Facts: Moral Frame Preserving News Summarization

Enrico Liscio,Michela Lorandi,Pradeep K. Murukannaiah

Task: 研究如何在AI生成的新闻摘要中保留道德框架。

Motivation: 新闻文章不仅仅是事实的集合，它们反映了记者的框架，塑造了事件如何呈现给受众。道德框架的选择（使用或引用道德色彩的语言）隐含了判断，自动摘要应识别并保留这些框架以维护作者的原始意图。

Details

Method: 提出一种方法，利用记者有意使用或报道特定道德词汇的直觉，确保这些词汇在摘要中被保留。通过自动化、众包和专家评估验证方法。 Result: 方法在保留道德框架的同时保持了摘要的整体质量。 Conclusion: 该研究首次探讨了AI生成新闻摘要中道德框架的保留问题，并提出了一种有效的方法。 Abstract: News articles are more than collections of facts; they reflect journalists' framing, shaping how events are presented to the audience. One key aspect of framing is the choice to write in (or quote verbatim) morally charged language as opposed to using neutral terms. This moral framing carries implicit judgments that automated news summarizers should recognize and preserve to maintain the original intent of the writer. In this work, we perform the first study on the preservation of moral framing in AI-generated news summaries. We propose an approach that leverages the intuition that journalists intentionally use or report specific moral-laden words, which should be retained in summaries. Through automated, crowd-sourced, and expert evaluations, we demonstrate that our approach enhances the preservation of moral framing while maintaining overall summary quality.

Mixture-of-Attack-Experts with Class Regularization for Unified Physical-Digital Face Attack Detection

Shunxin Chen,Ajian Liu,Junze Zheng,Jun Wan,Kailai Peng,Sergio Escalera,Zhen Lei

Task: 提出一种名为FG-MoE-CLIP-CAR的框架，用于解决面部识别系统中物理和数字攻击数据的分类问题。

Motivation: 现有方法未能充分处理物理和数字攻击数据的固有特性，如攻击类别内的大变异性以及真实与伪造人脸之间的微小类间差异。

Details

Method: 采用Soft MoE架构进行特征处理，并引入DM和CDM两个约束模块，分别增强类间分离性和类内聚类。 Result: 在两个统一的物理-数字攻击数据集上实现了最先进的性能。 Conclusion: FG-MoE-CLIP-CAR框架通过特征和损失级别的改进，有效提升了面部识别系统对攻击数据的分类能力。 Abstract: Facial recognition systems in real-world scenarios are susceptible to both digital and physical attacks. Previous methods have attempted to achieve classification by learning a comprehensive feature space. However, these methods have not adequately accounted for the inherent characteristics of physical and digital attack data, particularly the large intra class variation in attacks and the small inter-class variation between live and fake faces. To address these limitations, we propose the Fine-Grained MoE with Class-Aware Regularization CLIP framework (FG-MoE-CLIP-CAR), incorporating key improvements at both the feature and loss levels. At the feature level, we employ a Soft Mixture of Experts (Soft MoE) architecture to leverage different experts for specialized feature processing. Additionally, we refine the Soft MoE to capture more subtle differences among various types of fake faces. At the loss level, we introduce two constraint modules: the Disentanglement Module (DM) and the Cluster Distillation Module (CDM). The DM enhances class separability by increasing the distance between the centers of live and fake face classes. However, center-to-center constraints alone are insufficient to ensure distinctive representations for individual features. Thus, we propose the CDM to further cluster features around their respective class centers while maintaining separation from other classes. Moreover, specific attacks that significantly deviate from common attack patterns are often overlooked. To address this issue, our distance calculation prioritizes more distant features. Experimental results on two unified physical-digital attack datasets demonstrate that the proposed method achieves state-of-the-art (SOTA) performance.

DynMoLE: Boosting Mixture of LoRA Experts Fine-Tuning with a Hybrid Routing Mechanism

Dengchun Li,Naizheng Wang,Zihao Zhang,Haoyang Yin,Lei Duan,Meng Xiao,Mingjie Tang

Task: 提出一种名为DynMoLE的动态路由策略，用于改进基于指令微调的大型语言模型在多任务处理中的效率和准确性。

Motivation: 现有的MoLE路由机制在计算效率和预测准确性之间存在权衡，且未能充分满足不同Transformer层的多样化专家选择需求。

Details

Method: 提出DynMoLE，一种基于Tsallis熵的动态路由策略，通过调整专家选择来减少路由器的不确定性，并引入基于Tsallis熵的辅助损失以提高训练稳定性。 Result: 在常识推理基准测试中，DynMoLE显著提升了性能，比LoRA高出9.6%，比最先进的MoLE方法MoLA高出2.3%。 Conclusion: DynMoLE通过动态路由策略和辅助损失，显著提升了模型性能和训练稳定性，为多任务处理提供了更高效的解决方案。 Abstract: Instruction-based fine-tuning of large language models (LLMs) has achieved remarkable success in various natural language processing (NLP) tasks. Parameter-efficient fine-tuning (PEFT) methods, such as Mixture of LoRA Experts (MoLE), combine the efficiency of Low-Rank Adaptation (LoRA) with the versatility of Mixture of Experts (MoE) models, demonstrating significant potential for handling multiple downstream tasks. However, the existing routing mechanisms for MoLE often involve a trade-off between computational efficiency and predictive accuracy, and they fail to fully address the diverse expert selection demands across different transformer layers. In this work, we propose DynMoLE, a hybrid routing strategy that dynamically adjusts expert selection based on the Tsallis entropy of the router's probability distribution. This approach mitigates router uncertainty, enhances stability, and promotes more equitable expert participation, leading to faster convergence and improved model performance. Additionally, we introduce an auxiliary loss based on Tsallis entropy to further guide the model toward convergence with reduced uncertainty, thereby improving training stability and performance. Our extensive experiments on commonsense reasoning benchmarks demonstrate that DynMoLE achieves substantial performance improvements, outperforming LoRA by 9.6% and surpassing the state-of-the-art MoLE method, MoLA, by 2.3%. We also conduct a comprehensive ablation study to evaluate the contributions of DynMoLE's key components.

Exploring the Collaborative Advantage of Low-level Information on Generalizable AI-Generated Image Detection

Ziyin Zhou,Ke Sun,Zhongxi Chen,Xianming Lin,Yunpeng Luo,Ke Yan,Shouhong Ding,Xiaoshuai Sun

Task: 提出一种自适应低层专家注入（ALEI）框架，用于提升AI生成图像检测的泛化能力。

Motivation: 现有方法通常仅考虑单一类型的低层信息，导致泛化能力不足，且简单的融合策略无法充分利用不同低层和高层信息的检测优势。

Details

Method: 引入Lora Experts，使主干网络能够接受和学习不同低层信息；采用交叉注意力方法自适应融合特征；开发低层信息适配器防止主干网络丢失低层特征建模能力；提出动态特征选择以最大化泛化检测能力。 Result: 在仅使用四类主流ProGAN数据微调的情况下，方法在多个包含未见过的GAN和Diffusion方法的数据集上表现优异，达到最先进水平。 Conclusion: ALEI框架通过自适应融合低层和高层信息，显著提升了AI生成图像检测的泛化能力。 Abstract: Existing state-of-the-art AI-Generated image detection methods mostly consider extracting low-level information from RGB images to help improve the generalization of AI-Generated image detection, such as noise patterns. However, these methods often consider only a single type of low-level information, which may lead to suboptimal generalization. Through empirical analysis, we have discovered a key insight: different low-level information often exhibits generalization capabilities for different types of forgeries. Furthermore, we found that simple fusion strategies are insufficient to leverage the detection advantages of each low-level and high-level information for various forgery types. Therefore, we propose the Adaptive Low-level Experts Injection (ALEI) framework. Our approach introduces Lora Experts, enabling the backbone network, which is trained with high-level semantic RGB images, to accept and learn knowledge from different low-level information. We utilize a cross-attention method to adaptively fuse these features at intermediate layers. To prevent the backbone network from losing the modeling capabilities of different low-level features during the later stages of modeling, we developed a Low-level Information Adapter that interacts with the features extracted by the backbone network. Finally, we propose Dynamic Feature Selection, which dynamically selects the most suitable features for detecting the current image to maximize generalization detection capability. Extensive experiments demonstrate that our method, finetuned on only four categories of mainstream ProGAN data, performs excellently and achieves state-of-the-art results on multiple datasets containing unseen GAN and Diffusion methods.

Do LLMs Surpass Encoders for Biomedical NER?

Motasem S Obeidat,Md Sultan Al Nahian,Ramakanth Kavuluru

Task: 评估解码器模型（如Mistral和Llama）在生物医学命名实体识别（NER）中的性能与效率权衡。

Motivation: 解码器模型（LLMs）在信息提取中逐渐流行，但其在生物医学NER中是否优于编码器模型（如BERT）尚不明确，且存在计算成本高的问题。

Details

Method: 使用相同的BIO实体标记方案，在五个不同数据集上比较LLMs和编码器模型的性能，重点关注长实体的识别。 Result: LLMs在F分数上通常优于编码器模型2-8%，尤其在长实体（长度≥3个标记）上表现更优，但计算成本高1-2个数量级。 Conclusion: 尽管LLMs在性能上有优势，但在性能差异较小或需要实时反馈的场景下，编码器模型可能更合适。 Abstract: Recognizing spans of biomedical concepts and their types (e.g., drug or gene) in free text, often called biomedical named entity recognition (NER), is a basic component of information extraction (IE) pipelines. Without a strong NER component, other applications, such as knowledge discovery and information retrieval, are not practical. State-of-the-art in NER shifted from traditional ML models to deep neural networks with transformer-based encoder models (e.g., BERT) emerging as the current standard. However, decoder models (also called large language models or LLMs) are gaining traction in IE. But LLM-driven NER often ignores positional information due to the generative nature of decoder models. Furthermore, they are computationally very expensive (both in inference time and hardware needs). Hence, it is worth exploring if they actually excel at biomedical NER and assess any associated trade-offs (performance vs efficiency). This is exactly what we do in this effort employing the same BIO entity tagging scheme (that retains positional information) using five different datasets with varying proportions of longer entities. Our results show that the LLMs chosen (Mistral and Llama: 8B range) often outperform best encoder models (BERT-(un)cased, BiomedBERT, and DeBERTav3: 300M range) by 2-8% in F-scores except for one dataset, where they equal encoder performance. This gain is more prominent among longer entities of length >= 3 tokens. However, LLMs are one to two orders of magnitude more expensive at inference time and may need cost prohibitive hardware. Thus, when performance differences are small or real time user feedback is needed, encoder models might still be more suitable than LLMs.

4th PVUW MeViS 3rd Place Report: Sa2VA

Haobo Yuan,Tao Zhang,Xiangtai Li,Lu Qi,Zilong Huang,Shilin Xu,Jiashi Feng,Ming-Hsuan Yang

Task: Referring video object segmentation (RVOS) based on language descriptions, focusing on the MeViS dataset.

Motivation: The MeViS dataset introduces motion expressions of target objects, making it a more challenging benchmark compared to existing RVOS benchmarks. Additionally, multi-modal large language models (MLLMs) show promise in improving image and text alignment for referring expression tasks.

Details

Method: A simple modification to the test time inference method on stronger MLLMs, specifically adopting the Sa2VA model, a unified model for dense grounded understanding of images and videos. The scope of key frames is enlarged without further training. Result: Achieved the 3rd place in the 4th PVUW workshop. Conclusion: The proposed method demonstrates that simple modifications to existing MLLMs can lead to improved performance on challenging RVOS benchmarks like MeViS. Abstract: Referring video object segmentation (RVOS) is a challenging task that requires the model to segment the object in a video given the language description. MeViS is a recently proposed dataset that contains motion expressions of the target objects, leading to a challenging benchmark, compared with existing RVOS benchmarks. On the other hand, for referring expression tasks, a new trend is to adopt multi-modal large language model (MLLM) to achieve better image and text alignment. In this report, we show that with a simple modification to the test time inference method on stronger MLLMs, we can lead to stronger results on MeVIS. In particular, we adopt the recent method Sa2VA, a unified model for dense grounded understanding of both images and videos. By enlarging the scope of key frames, without any further training, we can achieve the 3rd place in the 4th PVUW workshop.

GLiNER-biomed: A Suite of Efficient Models for Open Biomedical Named Entity Recognition

Anthony Yazdani,Ihor Stepanov,Douglas Teodoro

Task: 提出GLiNER-biomed，一种专门为生物医学命名实体识别（NER）设计的通用轻量级模型套件，以解决传统NER模型在生物医学领域的局限性。

Motivation: 生物医学NER面临专业词汇量大、新实体不断涌现等挑战，传统模型因固定分类和人工标注难以泛化或适应新概念。

Details

Method: 通过自然语言描述推断任意实体类型，实现零样本识别；利用大型语言模型（LLMs）生成合成数据，训练两种GLiNER架构（单编码器和双编码器）。 Result: 在多个生物医学数据集上，GLiNER-biomed在零样本和少样本场景中优于现有最佳模型，F1分数提升5.96%。 Conclusion: GLiNER-biomed通过合成数据生成策略和通用领域高质量标注的微调，显著提升了生物医学NER的性能，模型和数据已开源。 Abstract: Biomedical named entity recognition (NER) presents unique challenges due to specialized vocabularies, the sheer volume of entities, and the continuous emergence of novel entities. Traditional NER models, constrained by fixed taxonomies and human annotations, struggle to generalize beyond predefined entity types or efficiently adapt to emerging concepts. To address these issues, we introduce GLiNER-biomed, a domain-adapted suite of Generalist and Lightweight Model for NER (GLiNER) models specifically tailored for biomedical NER. In contrast to conventional approaches, GLiNER uses natural language descriptions to infer arbitrary entity types, enabling zero-shot recognition. Our approach first distills the annotation capabilities of large language models (LLMs) into a smaller, more efficient model, enabling the generation of high-coverage synthetic biomedical NER data. We subsequently train two GLiNER architectures, uni- and bi-encoder, at multiple scales to balance computational efficiency and recognition performance. Evaluations on several biomedical datasets demonstrate that GLiNER-biomed outperforms state-of-the-art GLiNER models in both zero- and few-shot scenarios, achieving 5.96% improvement in F1-score over the strongest baseline. Ablation studies highlight the effectiveness of our synthetic data generation strategy and emphasize the complementary benefits of synthetic biomedical pre-training combined with fine-tuning on high-quality general-domain annotations. All datasets, models, and training pipelines are publicly available at https://github.com/ds4dh/GLiNER-biomed.

FSSUWNet: Mitigating the Fragility of Pre-trained Models with Feature Enhancement for Few-Shot Semantic Segmentation in Underwater Images

Zhuohao Li,Zhicheng Huang,Wenchao Liu,Zhuxing Zhang,Jianming Miao

Task: Few-Shot Semantic Segmentation (FSS) for underwater images with limited annotated examples.

Motivation: Existing FSS methods struggle to generalize to underwater environments due to fragile prior features from pre-trained models.

Details

Method: Proposes FSSUWNet, a tailored FSS framework with feature enhancement, integrating complementary features and a Feature Alignment Module. Result: Achieves state-of-the-art performance, outperforming previous methods by 2.8% and 2.6% in mean Intersection over Union for 1-shot and 5-shot scenarios. Conclusion: FSSUWNet effectively addresses the challenges of underwater FSS and demonstrates superior performance on public datasets. Abstract: Few-Shot Semantic Segmentation (FSS), which focuses on segmenting new classes in images using only a limited number of annotated examples, has recently progressed in data-scarce domains. However, in this work, we show that the existing FSS methods often struggle to generalize to underwater environments. Specifically, the prior features extracted by pre-trained models used as feature extractors are fragile due to the unique challenges of underwater images. To address this, we propose FSSUWNet, a tailored FSS framework for underwater images with feature enhancement. FSSUWNet exploits the integration of complementary features, emphasizing both low-level and high-level image characteristics. In addition to employing a pre-trained model as the primary encoder, we propose an auxiliary encoder called Feature Enhanced Encoder which extracts complementary features to better adapt to underwater scene characteristics. Furthermore, a simple and effective Feature Alignment Module aims to provide global prior knowledge and align low-level features with high-level features in dimensions. Given the scarcity of underwater images, we introduce a cross-validation dataset version based on the Segmentation of Underwater Imagery dataset. Extensive experiments on public underwater segmentation datasets demonstrate that our approach achieves state-of-the-art performance. For example, our method outperforms the previous best method by 2.8% and 2.6% in terms of the mean Intersection over Union metric for 1-shot and 5-shot scenarios in the datasets, respectively. Our implementation is available at https://github.com/lizhh268/FSSUWNet.

ToReMi: Topic-Aware Data Reweighting for Dynamic Pre-Training Data Selection

Xiaoxuan Zhu,Zhouhong Gu,Suhang Zheng,Tao Wang,Tianyu Li,Hongwei Feng,Yanghua Xiao

Task: 提出一种名为ToReMi的两阶段框架，用于动态调整预训练语言模型的数据样本权重。

Motivation: 当前方法未能充分捕捉训练样本之间的语义联系和领域内质量差异，导致预训练效果受限。

Details

Method: 采用基于主题的重新加权方法，结合学习模式动态调整样本权重。 Result: ToReMi在多个领域上表现出更快的困惑度下降，并在下游任务中表现更优。 Conclusion: ToReMi框架在预训练语言模型中有效提升了性能，优于传统方法。 Abstract: Pre-training large language models (LLMs) necessitates enormous diverse textual corpora, making effective data selection a key challenge for balancing computational resources and model performance. Current methodologies primarily emphasize data quality metrics and mixing proportions, yet they fail to adequately capture the underlying semantic connections between training samples and quality disparities within individual domains. We introduce ToReMi (Topic-based Reweighting for Model improvement), a novel two-stage framework that dynamically adjusts training sample weights according to their topical associations and observed learning patterns. Our comprehensive experiments reveal that ToReMi variants consistently achieve superior performance over conventional pre-training approaches, demonstrating accelerated perplexity reduction across multiple domains and enhanced capabilities on downstream evaluation tasks. Code is available at https://github.com/zxx000728/ToReMi.

Hierarchical Attention Networks for Lossless Point Cloud Attribute Compression

Yueru Chen,Wei Zhang,Dingquan Li,Jing Wang,Ge Li

Task: 提出一种深度分层注意力上下文模型，用于点云的无损属性压缩。

Motivation: 通过多分辨率空间结构和残差学习，提高点云属性压缩的效率和性能。

Details

Method: 引入简单的LoD结构实现从粗到细的表示，并行编码同一细化级别的点，并通过分层注意力模型学习多尺度和密度的上下文依赖关系。 Result: 实验结果表明，该方法在颜色和反射属性上比最新的G-PCC具有更好的编码性能，同时保持更高的编解码效率。 Conclusion: 提出的方法在点云属性压缩中表现出优越的性能和效率。 Abstract: In this paper, we propose a deep hierarchical attention context model for lossless attribute compression of point clouds, leveraging a multi-resolution spatial structure and residual learning. A simple and effective Level of Detail (LoD) structure is introduced to yield a coarse-to-fine representation. To enhance efficiency, points within the same refinement level are encoded in parallel, sharing a common context point group. By hierarchically aggregating information from neighboring points, our attention model learns contextual dependencies across varying scales and densities, enabling comprehensive feature extraction. We also adopt normalization for position coordinates and attributes to achieve scale-invariant compression. Additionally, we segment the point cloud into multiple slices to facilitate parallel processing, further optimizing time complexity. Experimental results demonstrate that the proposed method offers better coding performance than the latest G-PCC for color and reflectance attributes while maintaining more efficient encoding and decoding runtimes.

Command A: An Enterprise-Ready Large Language Model

Team Cohere,Aakanksha,Arash Ahmadian,Marwan Ahmed,Jay Alammar,Yazeed Alnumay,Sophia Althammer,Arkady Arkhangorodsky,Viraat Aryabumi,Dennis Aumiller,Raphaël Avalos,Zahara Aviv,Sammie Bae,Saurabh Baji,Alexandre Barbet,Max Bartolo,Björn Bebensee,Neeral Beladia,Walter Beller-Morales,Alexandre Bérard,Andrew Berneshawi,Anna Bialas,Phil Blunsom,Matt Bobkin,Adi Bongale,Sam Braun,Maxime Brunet,Samuel Cahyawijaya,David Cairuz,Jon Ander Campos,Cassie Cao,Kris Cao,Roman Castagné,Julián Cendrero,Leila Chan Currie,Yash Chandak,Diane Chang,Giannis Chatziveroglou,Hongyu Chen,Claire Cheng,Alexis Chevalier,Justin T. Chiu,Eugene Cho,Eugene Choi,Eujeong Choi,Tim Chung,Volkan Cirik,Ana Cismaru,Pierre Clavier,Henry Conklin,Lucas Crawhall-Stein,Devon Crouse,Andres Felipe Cruz-Salinas,Ben Cyrus,Daniel D'souza,Hugo Dalla-Torre,John Dang,William Darling,Omar Darwiche Domingues,Saurabh Dash,Antoine Debugne,Théo Dehaze,Shaan Desai,Joan Devassy,Rishit Dholakia,Kyle Duffy,Ali Edalati,Ace Eldeib,Abdullah Elkady,Sarah Elsharkawy,Irem Ergün,Beyza Ermis,Marzieh Fadaee,Boyu Fan,Lucas Fayoux,Yannis Flet-Berliac,Nick Frosst,Matthias Gallé,Wojciech Galuba,Utsav Garg,Matthieu Geist,Mohammad Gheshlaghi Azar,Seraphina Goldfarb-Tarrant,Tomas Goldsack,Aidan Gomez,Victor Machado Gonzaga,Nithya Govindarajan,Manoj Govindassamy,Nathan Grinsztajn,Nikolas Gritsch,Patrick Gu,Shangmin Guo,Kilian Haefeli,Rod Hajjar,Tim Hawes,Jingyi He,Sebastian Hofstätter,Sungjin Hong,Sara Hooker,Tom Hosking,Stephanie Howe,Eric Hu,Renjie Huang,Hemant Jain,Ritika Jain,Nick Jakobi,Madeline Jenkins,JJ Jordan,Dhruti Joshi,Jason Jung,Trushant Kalyanpur,Siddhartha Rao Kamalakara,Julia Kedrzycki,Gokce Keskin,Edward Kim,Joon Kim,Wei-Yin Ko,Tom Kocmi,Michael Kozakov,Wojciech Kryściński,Arnav Kumar Jain,Komal Kumar Teru,Sander Land,Michael Lasby,Olivia Lasche,Justin Lee,Patrick Lewis,Jeffrey Li,Jonathan Li,Hangyu Lin,Acyr Locatelli,Kevin Luong,Raymond Ma,Lukas Mach,Marina Machado,Joanne Magbitang,Brenda Malacara Lopez,Aryan Mann,Kelly Marchisio,Olivia Markham,Alexandre Matton,Alex McKinney,Dominic McLoughlin,Jozef Mokry,Adrien Morisot,Autumn Moulder,Harry Moynehan,Maximilian Mozes,Vivek Muppalla,Lidiya Murakhovska,Hemangani Nagarajan,Alekhya Nandula,Hisham Nasir,Shauna Nehra,Josh Netto-Rosen,Daniel Ohashi,James Owers-Bardsley,Jason Ozuzu,Dennis Padilla,Gloria Park,Sam Passaglia,Jeremy Pekmez,Laura Penstone,Aleksandra Piktus,Case Ploeg,Andrew Poulton,Youran Qi,Shubha Raghvendra,Miguel Ramos,Ekagra Ranjan,Pierre Richemond,Cécile Robert-Michon,Aurélien Rodriguez,Sudip Roy,Laura Ruis,Louise Rust,Anubhav Sachan,Alejandro Salamanca,Kailash Karthik Saravanakumar,Isha Satyakam,Alice Schoenauer Sebag,Priyanka Sen,Sholeh Sepehri,Preethi Seshadri,Ye Shen,Tom Sherborne,Sylvie Chang Shi,Sanal Shivaprasad,Vladyslav Shmyhlo,Anirudh Shrinivason,Inna Shteinbuk,Amir Shukayev,Mathieu Simard,Ella Snyder,Ava Spataru,Victoria Spooner,Trisha Starostina,Florian Strub,Yixuan Su,Jimin Sun,Dwarak Talupuru,Eugene Tarassov,Elena Tommasone,Jennifer Tracey,Billy Trend,Evren Tumer,Ahmet Üstün,Bharat Venkitesh,David Venuto,Pat Verga,Maxime Voisin,Alex Wang,Donglu Wang,Shijian Wang,Edmond Wen,Naomi White,Jesse Willman,Marysia Winkels,Chen Xia,Jessica Xie,Minjie Xu,Bowen Yang,Tan Yi-Chern,Ivan Zhang,Zhenyu Zhao,Zhoujie Zhao

Task: 开发Command A，一种专为现实企业用例优化的大型语言模型。

Motivation: 满足企业对多语言支持、高效性能和自动化复杂业务流程的需求。

Details

Method: 采用分散式训练方法，包括自优化算法和模型合并技术，结合混合架构平衡效率与性能。 Result: Command A和Command R7B在多语言支持和企业相关任务中表现出色，模型权重已公开供研究使用。 Conclusion: Command A在性能和效率上表现卓越，适用于企业级应用。 Abstract: In this report we describe the development of Command A, a powerful large language model purpose-built to excel at real-world enterprise use cases. Command A is an agent-optimised and multilingual-capable model, with support for 23 languages of global business, and a novel hybrid architecture balancing efficiency with top of the range performance. It offers best-in-class Retrieval Augmented Generation (RAG) capabilities with grounding and tool use to automate sophisticated business processes. These abilities are achieved through a decentralised training approach, including self-refinement algorithms and model merging techniques. We also include results for Command R7B which shares capability and architectural similarities to Command A. Weights for both models have been released for research purposes. This technical report details our original training pipeline and presents an extensive evaluation of our models across a suite of enterprise-relevant tasks and public benchmarks, demonstrating excellent performance and efficiency.

SCFANet: Style Distribution Constraint Feature Alignment Network For Pathological Staining Translation

Zetong Chen,Yuzhuo Chen,Hai Zhong,Xu Qiao

Task: 提出一种名为SCFANet的深度学习模型，用于将H&E染色图像直接转换为IHC染色图像。

Motivation: IHC染色过程耗时且昂贵，而H&E染色图像成本较低，但直接转换存在对齐差异和风格多样性等挑战。

Details

Method: SCFANet包含两个模块：Style Distribution Constrainer (SDC) 和 Feature Alignment Learning (FAL)，分别用于风格分布一致性和特征对齐。 Result: 在BCI数据集上的实验表明，SCFANet优于现有方法，能精确转换H&E图像为IHC图像。 Conclusion: SCFANet不仅解决了H&E到IHC图像转换的技术挑战，还为病理分析提供了高效准确的染色转换框架。 Abstract: Immunohistochemical (IHC) staining serves as a valuable technique for detecting specific antigens or proteins through antibody-mediated visualization. However, the IHC staining process is both time-consuming and costly. To address these limitations, the application of deep learning models for direct translation of cost-effective Hematoxylin and Eosin (H&E) stained images into IHC stained images has emerged as an efficient solution. Nevertheless, the conversion from H&E to IHC images presents significant challenges, primarily due to alignment discrepancies between image pairs and the inherent diversity in IHC staining style patterns. To overcome these challenges, we propose the Style Distribution Constraint Feature Alignment Network (SCFANet), which incorporates two innovative modules: the Style Distribution Constrainer (SDC) and Feature Alignment Learning (FAL). The SDC ensures consistency between the generated and target images' style distributions while integrating cycle consistency loss to maintain structural consistency. To mitigate the complexity of direct image-to-image translation, the FAL module decomposes the end-to-end translation task into two subtasks: image reconstruction and feature alignment. Furthermore, we ensure pathological consistency between generated and target images by maintaining pathological pattern consistency and Optical Density (OD) uniformity. Extensive experiments conducted on the Breast Cancer Immunohistochemical (BCI) dataset demonstrate that our SCFANet model outperforms existing methods, achieving precise transformation of H&E-stained images into their IHC-stained counterparts. The proposed approach not only addresses the technical challenges in H&E to IHC image translation but also provides a robust framework for accurate and efficient stain conversion in pathological analysis.

Aplicação de Large Language Models na Análise e Síntese de Documentos Jurídicos: Uma Revisão de Literatura

Matheus Belarmino,Rackel Coelho,Roberto Lotudo,Jayr Pereira

Task: 系统性地综述法律领域中应用于大型语言模型的提示工程技术的现状。

Motivation: 大型语言模型在法律文件分析和合成中的应用日益增多，但如何通过提示工程技术优化其表现仍需深入研究。

Details

Method: 通过系统文献综述，识别和分析法律领域中GPT-4、BERT、Llama 2和Legal-Pegasus等模型的应用及Few-shot Learning、Zero-shot Learning和Chain-of-Thought prompting等技术的有效性。 Result: 研究发现这些模型和技术在法律文本解释中表现良好，但仍存在模型偏见和幻觉等问题。 Conclusion: 尽管大型语言模型在法律领域潜力巨大，仍需改进提示工程策略以提高生成结果的准确性和可靠性。 Abstract: Large Language Models (LLMs) have been increasingly used to optimize the analysis and synthesis of legal documents, enabling the automation of tasks such as summarization, classification, and retrieval of legal information. This study aims to conduct a systematic literature review to identify the state of the art in prompt engineering applied to LLMs in the legal context. The results indicate that models such as GPT-4, BERT, Llama 2, and Legal-Pegasus are widely employed in the legal field, and techniques such as Few-shot Learning, Zero-shot Learning, and Chain-of-Thought prompting have proven effective in improving the interpretation of legal texts. However, challenges such as biases in models and hallucinations still hinder their large-scale implementation. It is concluded that, despite the great potential of LLMs for the legal field, there is a need to improve prompt engineering strategies to ensure greater accuracy and reliability in the generated results.

Learned Image Compression with Dictionary-based Entropy Model

Jingbo Lu,Leheng Zhang,Xingyu Zhou,Mu Li,Wen Li,Shuhang Gu

Task: 提出一种名为基于字典的交叉注意力熵模型的新方法，以提升学习图像压缩中的熵模型性能。

Motivation: 现有的熵模型主要关注潜在表示的内部依赖性，而忽略了从训练数据中提取先验的重要性。

Details

Method: 引入可学习的字典来总结训练数据中的典型结构，以增强熵模型。 Result: 实验结果表明，该模型在性能和延迟之间取得了更好的平衡，并在多个基准数据集上达到了最先进的结果。 Conclusion: 提出的基于字典的交叉注意力熵模型在图像压缩任务中表现优异，为学习图像压缩领域提供了新的思路。 Abstract: Learned image compression methods have attracted great research interest and exhibited superior rate-distortion performance to the best classical image compression standards of the present. The entropy model plays a key role in learned image compression, which estimates the probability distribution of the latent representation for further entropy coding. Most existing methods employed hyper-prior and auto-regressive architectures to form their entropy models. However, they only aimed to explore the internal dependencies of latent representation while neglecting the importance of extracting prior from training data. In this work, we propose a novel entropy model named Dictionary-based Cross Attention Entropy model, which introduces a learnable dictionary to summarize the typical structures occurring in the training dataset to enhance the entropy model. Extensive experimental results have demonstrated that the proposed model strikes a better balance between performance and latency, achieving state-of-the-art results on various benchmark datasets.

IHC-LLMiner: Automated extraction of tumour immunohistochemical profiles from PubMed abstracts using large language models

Yunsoo Kim,Michal W. S. Ong,Daniel W. Rogalsky,Manuel Rodriguez-Justo,Honghan Wu,Adam P. Levine

Task: 开发一个自动化流程IHC-LLMiner，用于从PubMed摘要中提取IHC-肿瘤特征，包括摘要分类和特征提取两个子任务。

Motivation: 免疫组织化学（IHC）在诊断病理学和生物医学研究中至关重要，但大规模提取和分析IHC-肿瘤特征数据仍存在挑战。

Details

Method: 利用先进的生物医学文本挖掘技术，开发基于LLM的流程，包括分类模型（Gemma-2 finetuned）和特征提取模型。 Result: Gemma-2 finetuned模型在分类任务中达到91.5%准确率和91.4 F1分数，特征提取任务中达到63.3%正确率，提取的特征与UMLS概念一致。 Conclusion: IHC-LLMiner为大规模IHC-肿瘤特征数据挖掘提供了实用解决方案，增强了数据的可访问性和实用性，支持癌症知识库开发。 Abstract: Immunohistochemistry (IHC) is essential in diagnostic pathology and biomedical research, offering critical insights into protein expression and tumour biology. This study presents an automated pipeline, IHC-LLMiner, for extracting IHC-tumour profiles from PubMed abstracts, leveraging advanced biomedical text mining. There are two subtasks: abstract classification (include/exclude as relevant) and IHC-tumour profile extraction on relevant included abstracts. The best-performing model, "Gemma-2 finetuned", achieved 91.5% accuracy and an F1 score of 91.4, outperforming GPT4-O by 9.5% accuracy with 5.9 times faster inference time. From an initial dataset of 107,759 abstracts identified for 50 immunohistochemical markers, the classification task identified 30,481 relevant abstracts (Include) using the Gemma-2 finetuned model. For IHC-tumour profile extraction, the Gemma-2 finetuned model achieved the best performance with 63.3% Correct outputs. Extracted IHC-tumour profiles (tumour types and markers) were normalised to Unified Medical Language System (UMLS) concepts to ensure consistency and facilitate IHC-tumour profile landscape analysis. The extracted IHC-tumour profiles demonstrated excellent concordance with available online summary data and provided considerable added value in terms of both missing IHC-tumour profiles and quantitative assessments. Our proposed LLM based pipeline provides a practical solution for large-scale IHC-tumour profile data mining, enhancing the accessibility and utility of such data for research and clinical applications as well as enabling the generation of quantitative and structured data to support cancer-specific knowledge base development. Models and training datasets are available at https://github.com/knowlab/IHC-LLMiner.

ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers

Qianhao Yuan,Qingyu Zhang,Yanjiang Liu,Jiawei Chen,Yaojie Lu,Hongyu Lin,Jia Zheng,Xianpei Han,Le Sun

Task: 研究多模态大语言模型（MLLMs）中层次冗余问题，并提出一种减少计算成本的方法。

Motivation: MLLMs由于模型规模大和视觉标记数量多，导致计算成本高。

Details

Method: 引入Layer Contribution（LC）度量，量化层次对视觉和文本标记的影响，并提出ShortV方法，冻结无效层次的视觉标记更新。 Result: ShortV可冻结约60%层次的视觉标记更新，显著降低计算成本（如LLaVA-NeXT-13B上FLOPs减少50%），同时保持性能。 Conclusion: ShortV是一种无需训练的方法，能有效减少MLLMs的计算开销。 Abstract: Multimodal Large Language Models (MLLMs) suffer from high computational costs due to their massive size and the large number of visual tokens. In this paper, we investigate layer-wise redundancy in MLLMs by introducing a novel metric, Layer Contribution (LC), which quantifies the impact of a layer's transformations on visual and text tokens, respectively. The calculation of LC involves measuring the divergence in model output that results from removing the layer's transformations on the specified tokens. Our pilot experiment reveals that many layers of MLLMs exhibit minimal contribution during the processing of visual tokens. Motivated by this observation, we propose ShortV, a training-free method that leverages LC to identify ineffective layers, and freezes visual token updates in these layers. Experiments show that ShortV can freeze visual token in approximately 60\% of the MLLM layers, thereby dramatically reducing computational costs related to updating visual tokens. For example, it achieves a 50\% reduction in FLOPs on LLaVA-NeXT-13B while maintaining superior performance. The code will be publicly available at https://github.com/icip-cas/ShortV

LLMs4SchemaDiscovery: A Human-in-the-Loop Workflow for Scientific Schema Mining with Large Language Models

Sameer Sadruddin,Jennifer D'Souza,Eleni Poupaki,Alex Watkins,Hamed Babaei Giglou,Anisa Rula,Bora Karasulu,Sören Auer,Adrie Mackus,Erwin Kessels

Task: 从非结构化文本中提取结构化信息。

Motivation: 传统模式挖掘依赖于半结构化数据，限制了可扩展性。

Details

Method: 结合大型语言模型与人类反馈，通过迭代工作流程组织文本属性，整合专家输入和领域特定本体以增加语义深度。 Result: 在材料科学（特别是原子层沉积）中，专家引导的大型语言模型能够生成适用于多样化实际应用的语义丰富模式。 Conclusion: schema-miner工具展示了自动化且精细化的模式提取潜力。 Abstract: Extracting structured information from unstructured text is crucial for modeling real-world processes, but traditional schema mining relies on semi-structured data, limiting scalability. This paper introduces schema-miner, a novel tool that combines large language models with human feedback to automate and refine schema extraction. Through an iterative workflow, it organizes properties from text, incorporates expert input, and integrates domain-specific ontologies for semantic depth. Applied to materials science--specifically atomic layer deposition--schema-miner demonstrates that expert-guided LLMs generate semantically rich schemas suitable for diverse real-world applications.

High-Quality Pseudo-Label Generation Based on Visual Prompt Assisted Cloud Model Update

Xinrun Xu,Qiuhong Zhang,Jianwen Yang,Zhanbiao Lian,Jin Yan,Zhiming Ding,Shan Jiang

Task: 提出一种名为CA-HQP的方法，用于在云-边缘对象检测中生成高质量伪标签。

Motivation: 现有方法通常假设云模型可靠，忽略了潜在错误或难以应对复杂的数据分布变化。

Details

Method: CA-HQP通过引入可学习的视觉提示生成器（VPG）和双重特征对齐技术（DQFA和TIAFA）来更新云模型。 Result: 在Bellevue交通数据集上的实验表明，CA-HQP显著提高了伪标签质量，并提升了边缘模型的性能。 Conclusion: CA-HQP为动态场景下的云-边缘对象检测系统提供了一种有效的解决方案。 Abstract: Generating high-quality pseudo-labels on the cloud is crucial for cloud-edge object detection, especially in dynamic traffic monitoring where data distributions evolve. Existing methods often assume reliable cloud models, neglecting potential errors or struggling with complex distribution shifts. This paper proposes Cloud-Adaptive High-Quality Pseudo-label generation (CA-HQP), addressing these limitations by incorporating a learnable Visual Prompt Generator (VPG) and dual feature alignment into cloud model updates. The VPG enables parameter-efficient adaptation by injecting visual prompts, enhancing flexibility without extensive fine-tuning. CA-HQP mitigates domain discrepancies via two feature alignment techniques: global Domain Query Feature Alignment (DQFA) capturing scene-level shifts, and fine-grained Temporal Instance-Aware Feature Embedding Alignment (TIAFA) addressing instance variations. Experiments on the Bellevue traffic dataset demonstrate that CA-HQP significantly improves pseudo-label quality compared to existing methods, leading to notable performance gains for the edge model and showcasing CA-HQP's adaptation effectiveness. Ablation studies validate each component (DQFA, TIAFA, VPG) and the synergistic effect of combined alignment strategies, highlighting the importance of adaptive cloud updates and domain adaptation for robust object detection in evolving scenarios. CA-HQP provides a promising solution for enhancing cloud-edge object detection systems in real-world applications.

RECKON: Large-scale Reference-based Efficient Knowledge Evaluation for Large Language Model

Lin Zhang,Zhouhong Gu,Xiaoran Shi,Hongwei Feng,Yanghua Xiao

Task: 提出了一种基于参考数据的高效知识评估方法RECKON，用于评估大型语言模型的能力。

Motivation: 传统评估方法依赖基准测试，存在资源成本高和信息丢失的局限性。

Details

Method: RECKON将非结构化数据组织为可管理单元，并为每个集群生成针对性问题，以提高评估准确性和效率。 Result: 实验结果表明，RECKON比传统方法减少56.5%的资源消耗，同时在多个领域（如世界知识、代码、法律和生物医学数据集）中达到97%以上的准确率。 Conclusion: RECKON是一种高效且准确的评估方法，适用于大型语言模型的知识评估。 Abstract: As large language models (LLMs) advance, efficient knowledge evaluation becomes crucial to verifying their capabilities. Traditional methods, relying on benchmarks, face limitations such as high resource costs and information loss. We propose the Large-scale Reference-based Efficient Knowledge Evaluation for Large Language Model (RECKON), which directly uses reference data to evaluate models. RECKON organizes unstructured data into manageable units and generates targeted questions for each cluster, improving evaluation accuracy and efficiency. Experimental results show that RECKON reduces resource consumption by 56.5% compared to traditional methods while achieving over 97% accuracy across various domains, including world knowledge, code, legal, and biomedical datasets. Code is available at https://github.com/MikeGu721/reckon

SMILE: Infusing Spatial and Motion Semantics in Masked Video Learning

Fida Mohammad Thoker,Letian Jiang,Chen Zhao,Bernard Ghanem

Task: 提出一种名为SMILE的自监督学习方法，用于视频表示学习，通过融合空间和运动语义来解决现有方法的局限性。

Motivation: 现有的基于像素级重建的视频自监督学习方法（如VideoMAE）在自然视频中存在时间冗余，限制了语义表示和运动动态的充分编码能力。

Details

Method: 利用图像-语言预训练模型（如CLIP）提供高级空间语义指导，并通过引入合成运动模式增强运动表示。 Result: 在7个数据集上的实验表明，SMILE超越了当前最先进的自监督学习方法，能够学习更具区分性和泛化性的视频表示。 Conclusion: SMILE为视频自监督学习提供了一种新范式，无需自然视频数据即可学习强视频表示。 Abstract: Masked video modeling, such as VideoMAE, is an effective paradigm for video self-supervised learning (SSL). However, they are primarily based on reconstructing pixel-level details on natural videos which have substantial temporal redundancy, limiting their capability for semantic representation and sufficient encoding of motion dynamics. To address these issues, this paper introduces a novel SSL approach for video representation learning, dubbed as SMILE, by infusing both spatial and motion semantics. In SMILE, we leverage image-language pretrained models, such as CLIP, to guide the learning process with their high-level spatial semantics. We enhance the representation of motion by introducing synthetic motion patterns in the training data, allowing the model to capture more complex and dynamic content. Furthermore, using SMILE, we establish a new self-supervised video learning paradigm capable of learning strong video representations without requiring any natural video data. We have carried out extensive experiments on 7 datasets with various downstream scenarios. SMILE surpasses current state-of-the-art SSL methods, showcasing its effectiveness in learning more discriminative and generalizable video representations. Code is available: https://github.com/fmthoker/SMILE

Digitally Supported Analysis of Spontaneous Speech (DigiSpon): Benchmarking NLP-Supported Language Sample Analysis of Swiss Children's Speech

Anja Ryser,Yingqiang Gao,Sarah Ebling

Task: 利用自然语言处理（NLP）方法优化语言样本分析（LSA）以支持儿童发展性语言障碍（DLD）的诊断。

Motivation: 传统的LSA方法耗时且劳动密集，限制了其在临床实践中的应用，因此需要更高效的替代方案。

Details

Method: 采用非商业大型语言模型（LLM）的NLP方法，分析119名德语区瑞士儿童（包括典型和非典型语言发展）的语音转录数据。 Result: 初步结果表明，本地部署的NLP方法在半自动LSA中具有潜力。 Conclusion: 该方法为语言病理学家提供了更高效的DLD诊断工具，同时避免了商业LLM的伦理问题。 Abstract: Language sample analysis (LSA) is a process that complements standardized psychometric tests for diagnosing, for example, developmental language disorder (DLD) in children. However, its labor-intensive nature has limited its use in speech-language pathology practice. We introduce an approach that leverages natural language processing (NLP) methods not based on commercial large language models (LLMs) applied to transcribed speech data from 119 children in the German speaking part of Switzerland with typical and atypical language development. The study aims to identify optimal practices that support speech-language pathologists in diagnosing DLD more efficiently within a human-in-the-loop framework, without relying on potentially unethical implementations that leverage commercial LLMs. Preliminary findings underscore the potential of integrating locally deployed NLP methods into the process of semi-automatic LSA.

Generalization-aware Remote Sensing Change Detection via Domain-agnostic Learning

Qi Zang,Shuang Wang,Dong Zhao,Dou Quan,Yang Hu,Licheng Jiao

Task: 提出一种通用的领域无关差异学习网络（DonaNet），用于解决双时相图像中由成像环境因素引起的伪变化问题。

Motivation: 现有基于变换的方法将伪变化视为风格偏移，并通过生成对抗网络（GANs）将双时相图像转换为相同风格来缓解问题，但存在图像失真和领域对齐限制的问题。

Details

Method: DonaNet通过局部统计量作为风格代理来对抗领域偏移，并通过去除编码特征的领域特定风格和突出对象的类别特征来学习领域无关表示。 Result: 在三个公共数据集上的实验表明，DonaNet在模型规模较小的情况下优于现有最先进方法，并对领域偏移更具鲁棒性。 Conclusion: DonaNet通过创新的领域差异去除模块和跨时域泛化学习策略，有效解决了伪变化问题，提升了模型的泛化能力。 Abstract: Change detection has essential significance for the region's development, in which pseudo-changes between bitemporal images induced by imaging environmental factors are key challenges. Existing transformation-based methods regard pseudo-changes as a kind of style shift and alleviate it by transforming bitemporal images into the same style using generative adversarial networks (GANs). However, their efforts are limited by two drawbacks: 1) Transformed images suffer from distortion that reduces feature discrimination. 2) Alignment hampers the model from learning domain-agnostic representations that degrades performance on scenes with domain shifts from the training data. Therefore, oriented from pseudo-changes caused by style differences, we present a generalizable domain-agnostic difference learning network (DonaNet). For the drawback 1), we argue for local-level statistics as style proxies to assist against domain shifts. For the drawback 2), DonaNet learns domain-agnostic representations by removing domain-specific style of encoded features and highlighting the class characteristics of objects. In the removal, we propose a domain difference removal module to reduce feature variance while preserving discriminative properties and propose its enhanced version to provide possibilities for eliminating more style by decorrelating the correlation between features. In the highlighting, we propose a cross-temporal generalization learning strategy to imitate latent domain shifts, thus enabling the model to extract feature representations more robust to shifts actively. Extensive experiments conducted on three public datasets demonstrate that DonaNet outperforms existing state-of-the-art methods with a smaller model size and is more robust to domain shift.

Inaccuracy of an E-Dictionary and Its Influence on Chinese Language Users

Xi Wang,Fanfei Meng,Shiyang Zhang,Lan Li

Task: 研究电子词典（如中国广泛使用的有道词典）的准确性及其对二语学习者词汇习得的影响。

Motivation: 电子词典已成为二语学习者扩展词汇的主要工具，但其可靠性和定义准确性鲜被质疑，且关于其语料库构建和使用局限的研究较少。

Details

Method: 采用实验、用户调查和词典批评相结合的方法，通过翻译任务和回顾性反思分析有道词典的定义问题及其对理解的影响。 Result: 研究发现不完整或误导性的定义会导致严重误解，同时学生表现出不良的查询习惯；词典构建中的数据处理和AI技术集成也存在问题。 Conclusion: 研究建议加强用户的词典素养培训，并改进构建电子词典的AI模型。 Abstract: Electronic dictionaries have largely replaced paper dictionaries and become central tools for L2 learners seeking to expand their vocabulary. Users often assume these resources are reliable and rarely question the validity of the definitions provided. The accuracy of major E-dictionaries is seldom scrutinized, and little attention has been paid to how their corpora are constructed. Research on dictionary use, particularly the limitations of electronic dictionaries, remains scarce. This study adopts a combined method of experimentation, user survey, and dictionary critique to examine Youdao, one of the most widely used E-dictionaries in China. The experiment involved a translation task paired with retrospective reflection. Participants were asked to translate sentences containing words that are insufficiently or inaccurately defined in Youdao. Their consultation behavior was recorded to analyze how faulty definitions influenced comprehension. Results show that incomplete or misleading definitions can cause serious misunderstandings. Additionally, students exhibited problematic consultation habits. The study further explores how such flawed definitions originate, highlighting issues in data processing and the integration of AI and machine learning technologies in dictionary construction. The findings suggest a need for better training in dictionary literacy for users, as well as improvements in the underlying AI models used to build E-dictionaries.

Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features

Jewon Lee,Ki-Ung Song,Seungmin Yang,Donguk Lim,Jaeyeon Kim,Wooksu Shin,Bo-Kyeong Kim,Yong Jae Lee,Tae-Ho Kim

Task: 通过视觉标记减少来降低大型视觉语言模型（LVLMs）中图像特征导致的推理成本。

Motivation: 研究发现，在基于交叉注意力的模型中，图像标记的键值（KV）缓存大小显著超过自注意力层中的文本标记，成为计算瓶颈。

Details

Method: 利用交叉注意力图的稀疏性，选择性地剪枝冗余的视觉特征，提出Trimmed Llama方法，无需额外训练即可减少KV缓存需求。 Result: 模型在减少50%视觉特征的情况下，降低了推理延迟和内存使用，同时保持了基准性能。 Conclusion: 该方法有效解决了交叉注意力模型中KV缓存过大的问题，提升了推理效率。 Abstract: Visual token reduction lowers inference costs caused by extensive image features in large vision-language models (LVLMs). Unlike relevant studies that prune tokens in self-attention-only LVLMs, our work uniquely addresses cross-attention-based models, which achieve superior performance. We identify that the key-value (KV) cache size for image tokens in cross-attention layers significantly exceeds that of text tokens in self-attention layers, posing a major compute bottleneck. To mitigate this issue, we exploit the sparse nature in cross-attention maps to selectively prune redundant visual features. Our Trimmed Llama effectively reduces KV cache demands without requiring additional training. By benefiting from 50%-reduced visual features, our model can reduce inference latency and memory usage while achieving benchmark parity.

Z1: Efficient Test-time Scaling with Code

Zhaojian Yu,Yinghao Wu,Yilun Zhao,Arman Cohan,Xiao-Ping Zhang

Task: 提出一种高效的测试时扩展方法，通过训练LLMs在代码相关推理轨迹上，减少多余思考令牌同时保持性能。

Motivation: LLMs在复杂问题解决中需要更长的上下文和大量推理令牌，导致效率低下。

Details

Method: 创建Z1-Code-Reasoning-107K数据集，并提出Shifted Thinking Window技术以减少过度思考开销。 Result: Z1-7B模型在保持性能的同时，平均思考令牌减少约30%，并在更广泛的推理任务中展现泛化能力（如GPQA Diamond上47.5%）。 Conclusion: 该方法为高效推理提供了有价值的见解，并为未来研究奠定了基础。 Abstract: Large Language Models (LLMs) can achieve enhanced complex problem-solving through test-time computing scaling, yet this often entails longer contexts and numerous reasoning token costs. In this paper, we propose an efficient test-time scaling method that trains LLMs on code-related reasoning trajectories, facilitating their reduction of excess thinking tokens while maintaining performance. First, we create Z1-Code-Reasoning-107K, a curated dataset of simple and complex coding problems paired with their short and long solution trajectories. Second, we present a novel Shifted Thinking Window to mitigate overthinking overhead by removing context-delimiting tags (e.g., . . . ) and capping reasoning tokens. Trained with long and short trajectory data and equipped with Shifted Thinking Window, our model, Z1-7B, demonstrates the ability to adjust its reasoning level as the complexity of problems and exhibits efficient test-time scaling across different reasoning tasks that matches R1-Distill-Qwen-7B performance with about 30% of its average thinking tokens. Notably, fine-tuned with only code trajectories, Z1-7B demonstrates generalization to broader reasoning tasks (47.5% on GPQA Diamond). Our analysis of efficient reasoning elicitation also provides valuable insights for future research.

Archival Faces: Detection of Faces in Digitized Historical Documents

Marek Vaško,Adam Herout,Michal Hradiš

Task: 提出一个新的手动标注的领域特定数据集，用于改进历史档案数字化中的人脸检测性能。

Motivation: 现有的人脸检测工具在扫描的历史文档数据集上表现不佳，仅达到约24%的mAP，需要改进以使其更接近野外人脸检测的标准。

Details

Method: 引入一个类似Wider Face数据集的新数据集，包含2.2k张来自19至20世纪历史报纸的图像，带有11k个新的边界框标注和相关面部标志点。 Result: 通过重新训练现有检测器，提高了其在历史文档中的人脸检测性能，并报告了多种微调检测器的实验结果。 Conclusion: 新数据集显著提升了历史文档中的人脸检测性能，使其更接近野外检测的标准。 Abstract: When digitizing historical archives, it is necessary to search for the faces of celebrities and ordinary people, especially in newspapers, link them to the surrounding text, and make them searchable. Existing face detectors on datasets of scanned historical documents fail remarkably -- current detection tools only achieve around $24\%$ mAP at $50:90\%$ IoU. This work compensates for this failure by introducing a new manually annotated domain-specific dataset in the style of the popular Wider Face dataset, containing 2.2k new images from digitized historical newspapers from the $19^{th}$ to $20^{th}$ century, with 11k new bounding-box annotations and associated facial landmarks. This dataset allows existing detectors to be retrained to bring their results closer to the standard in the field of face detection in the wild. We report several experimental results comparing different families of fine-tuned detectors against publicly available pre-trained face detectors and ablation studies of multiple detector sizes with comprehensive detection and landmark prediction performance results.

ScholarCopilot: Training Large Language Models for Academic Writing with Accurate Citations

Yubo Wang,Xueguang Ma,Ping Nie,Huaye Zeng,Zhiheng Lyu,Yuxuan Zhang,Benjamin Schneider,Yi Lu,Xiang Yue,Wenhu Chen

Task: 提出ScholarCopilot框架，用于增强大型语言模型在生成专业学术文章时的引用准确性。

Motivation: 现有检索增强生成（RAG）系统在支持专业学术写作方面仍有不足。

Details

Method: 通过动态生成检索标记[RET]并检索相关文献，将引用任务与生成任务联合优化。 Result: 在arXiv数据集上，ScholarCopilot的检索准确率达40.1%，生成质量评分为16.2/25，优于基线模型。 Conclusion: ScholarCopilot在引用召回、写作效率和用户体验方面表现优异，验证了其有效性。 Abstract: Academic writing requires both coherent text generation and precise citation of relevant literature. Although recent Retrieval-Augmented Generation (RAG) systems have significantly improved factual accuracy in general-purpose text generation, their capacity to adequately support professional academic writing remains limited. In this work, we introduce ScholarCopilot, a unified framework designed to enhance existing large language models for generating professional academic articles with accurate and contextually relevant citations. ScholarCopilot dynamically determines when to retrieve scholarly references by generating a retrieval token [RET], and then utilizes its representation to look up relevant citations from a database. The retrieved references are fed into the model to augment the generation process. We jointly optimize both the generation and citation tasks within a single framework to increase efficiency. Trained on 500K papers from arXiv, our model achieves a top-1 retrieval accuracy of 40.1% on our evaluation dataset, outperforming baselines such as E5-Mistral-7B-Instruct (15.0%) and BM25 (9.8%). On a dataset of 1,000 academic writing samples, ScholarCopilot scores 16.2/25 in generation quality (measured across relevance, coherence, academic rigor, completeness, and innovation), surpassing models with 10x more parameters such as Qwen-2.5-72B-Instruct (15.8/25). Human studies also confirm ScholarCopilot's superior performance in citation recall, writing efficiency, and overall user experience, confirming the effectiveness of our approach.

AttentiveGRU: Recurrent Spatio-Temporal Modeling for Advanced Radar-Based BEV Object Detection

Loveneet Saini,Mirko Meuter,Hasan Tercan,Tobias Meisen

Task: 提出一种基于注意力的循环方法（AttentiveGRU），用于解决雷达数据稀疏性和非确定性对鸟瞰图（BEV）物体检测的限制。

Motivation: 雷达数据的稀疏性和非确定性限制了传统单帧BEV方法的有效性，需要一种能够动态提取时空上下文的方法。

Details

Method: 引入AttentiveGRU，通过动态识别和融合当前状态与记忆状态中的时间相关结构，提取对象的个性化时空上下文。 Result: 在nuScenes数据集上，汽车类别的mAP提高了21%，并在其他数据集上展示了显著的检测能力提升。 Conclusion: AttentiveGRU通过利用时间关系丰富了特征表示，提高了检测性能，且无需外部提供或估计车辆运动信息。 Abstract: Bird's-eye view (BEV) object detection has become important for advanced automotive 3D radar-based perception systems. However, the inherently sparse and non-deterministic nature of radar data limits the effectiveness of traditional single-frame BEV paradigms. In this paper, we addresses this limitation by introducing AttentiveGRU, a novel attention-based recurrent approach tailored for radar constraints, which extracts individualized spatio-temporal context for objects by dynamically identifying and fusing temporally correlated structures across present and memory states. By leveraging the consistency of object's latent representation over time, our approach exploits temporal relations to enrich feature representations for both stationary and moving objects, thereby enhancing detection performance and eliminating the need for externally providing or estimating any information about ego vehicle motion. Our experimental results on the public nuScenes dataset show a significant increase in mAP for the car category by 21% over the best radar-only submission. Further evaluations on an additional dataset demonstrate notable improvements in object detection capabilities, underscoring the applicability and effectiveness of our method.

How Difficulty-Aware Staged Reinforcement Learning Enhances LLMs' Reasoning Capabilities: A Preliminary Experimental Study

Yunjie Ji,Sitong Zhao,Xiaoyu Tian,Haotian Wang,Shuaiting Chen,Yiping Peng,Han Zhao,Xiangang Li

Task: 通过难度感知的分阶段强化学习策略提升大型语言模型（LLMs）的推理能力。

Motivation: 提升大型语言模型的推理能力是人工智能研究中的一个基本挑战，需要高效且可扩展的方法。

Details

Method: 采用难度感知的分阶段强化学习策略，通过按难度级别选择训练数据，并逐步增加任务难度。 Result: 在AIME-2024和MATH-500基准测试中，1.5B参数模型分别达到42.3%和89.5%的准确率。 Conclusion: 该方法显著提升了LLMs的推理能力，并展示了跨领域训练的潜力。 Abstract: Enhancing the reasoning capabilities of Large Language Models (LLMs) with efficiency and scalability remains a fundamental challenge in artificial intelligence research. This paper presents a rigorous experimental investigation into how difficulty-aware staged reinforcement learning (RL) strategies can substantially improve LLM reasoning performance. Through systematic analysis, we demonstrate that strategically selecting training data according to well-defined difficulty levels markedly enhances RL optimization. Moreover, we introduce a staged training methodology, progressively exposing models to increasingly challenging tasks, further amplifying reasoning capabilities. Our findings reveal significant cross-domain benefits when simultaneously training models on mathematical reasoning and code generation tasks. Notably, our proposed approach enables a 1.5B parameter model to achieve an accuracy of 42.3\% on the AIME-2024 benchmark, 89.5\% on the MATH-500 benchmark. These results underscore the efficacy of our method in advancing the reasoning proficiency of LLMs. We will open-source our datasets on GitHub and Hugging Face.

Yan Xia,Hai Huang,Minghui Fang,Zhou Zhao

Task: 学习一个共享的离散表示空间，实现跨模态的知识迁移。

Motivation: 由于为所有模态对获取大量配对数据不切实际，因此探索一种逐步将新模态映射到共享离散码本的持续学习方法。

Details

Method: 提出Continual Mixture of Experts Adapter (CMoE-Adapter)和Pseudo-Modality Replay (PMR)机制，动态扩展码本以对齐语义。 Result: 在图像-文本、音频-文本、视频-文本和语音-文本等多种跨模态任务上表现优异。 Conclusion: 该方法能够有效实现跨模态的泛化，同时保留先验知识。 Abstract: Cross-modal generalization aims to learn a shared discrete representation space from multimodal pairs, enabling knowledge transfer across unannotated modalities. However, achieving a unified representation for all modality pairs requires extensive paired data, which is often impractical. Inspired by the availability of abundant bimodal data (e.g., in ImageBind), we explore a continual learning approach that incrementally maps new modalities into a shared discrete codebook via a mediator modality. We propose the Continual Mixture of Experts Adapter (CMoE-Adapter) to project diverse modalities into a unified space while preserving prior knowledge. To align semantics across stages, we introduce a Pseudo-Modality Replay (PMR) mechanism with a dynamically expanding codebook, enabling the model to adaptively incorporate new modalities using learned ones as guidance. Extensive experiments on image-text, audio-text, video-text, and speech-text show that our method achieves strong performance on various cross-modal generalization tasks. Code is provided in the supplementary material.

Investigating the Capabilities and Limitations of Machine Learning for Identifying Bias in English Language Data with Information and Heritage Professionals

Lucy Havens,Benjamin Bach,Melissa Terras,Beatrice Alex

Task: 研究如何通过模型识别偏见的语言，而非消除偏见。

Motivation: 现有的ML方法假设偏见可以被消除，但研究表明这并不总是可行或可取。

Details

Method: 通过创建模型识别偏见语言，并通过研讨会评估模型在特定用例中的表现。 Result: 发现ML在识别偏见方面存在局限性，且消除偏见的方法可能同时优待和压迫不同群体。 Conclusion: 需要扩展ML对偏见和公平性的研究方法，采用混合方法探讨消除偏见或实现公平性的可行性。 Abstract: Despite numerous efforts to mitigate their biases, ML systems continue to harm already-marginalized people. While predominant ML approaches assume bias can be removed and fair models can be created, we show that these are not always possible, nor desirable, goals. We reframe the problem of ML bias by creating models to identify biased language, drawing attention to a dataset's biases rather than trying to remove them. Then, through a workshop, we evaluated the models for a specific use case: workflows of information and heritage professionals. Our findings demonstrate the limitations of ML for identifying bias due to its contextual nature, the way in which approaches to mitigating it can simultaneously privilege and oppress different communities, and its inevitability. We demonstrate the need to expand ML approaches to bias and fairness, providing a mixed-methods approach to investigating the feasibility of removing bias or achieving fairness in a given ML use case.

Sample-level Adaptive Knowledge Distillation for Action Recognition

Ping Li,Chenhao Ping,Wenxiao Wang,Mingli Song

Task: 提出一种样本级自适应知识蒸馏（SAKD）框架，用于解决视频分析中知识蒸馏的困难样本传输问题。

Motivation: 传统方法忽视了教师与学生网络之间的能力差距以及样本传输难度的动态变化，导致部分知识无法正确传输或影响学生网络的性能。

Details

Method: 通过样本蒸馏难度评估模块和样本自适应蒸馏模块，动态调整样本级别的蒸馏比例，并选择低难度高多样性的样本进行训练。 Result: 在两个视频基准和一个图像基准上的实验结果表明，该方法在性能和效率之间取得了良好的平衡。 Conclusion: SAKD框架有效解决了知识蒸馏中的样本传输问题，提升了学生网络的性能并降低了计算成本。 Abstract: Knowledge Distillation (KD) compresses neural networks by learning a small network (student) via transferring knowledge from a pre-trained large network (teacher). Many endeavours have been devoted to the image domain, while few works focus on video analysis which desires training much larger model making it be hardly deployed in resource-limited devices. However, traditional methods neglect two important problems, i.e., 1) Since the capacity gap between the teacher and the student exists, some knowledge w.r.t. difficult-to-transfer samples cannot be correctly transferred, or even badly affects the final performance of student, and 2) As training progresses, difficult-to-transfer samples may become easier to learn, and vice versa. To alleviate the two problems, we propose a Sample-level Adaptive Knowledge Distillation (SAKD) framework for action recognition. In particular, it mainly consists of the sample distillation difficulty evaluation module and the sample adaptive distillation module. The former applies the temporal interruption to frames, i.e., randomly dropout or shuffle the frames during training, which increases the learning difficulty of samples during distillation, so as to better discriminate their distillation difficulty. The latter module adaptively adjusts distillation ratio at sample level, such that KD loss dominates the training with easy-to-transfer samples while vanilla loss dominates that with difficult-to-transfer samples. More importantly, we only select those samples with both low distillation difficulty and high diversity to train the student model for reducing computational cost. Experimental results on two video benchmarks and one image benchmark demonstrate the superiority of the proposed method by striking a good balance between performance and efficiency.

m1: Unleash the Potential of Test-Time Scaling for Medical Reasoning with Large Language Models

Xiaoke Huang,Juncheng Wu,Hui Liu,Xianfeng Tang,Yuyin Zhou

Task: 研究测试时缩放（test-time scaling）在医学推理中的有效性，并提出一种简单有效的方法m1。

Motivation: 医学领域的知识表示和决策过程与数学任务存在根本差异，测试时缩放在医学推理中的效果尚不明确。

Details

Method: 提出m1方法，通过测试时缩放增强模型的医学推理能力，并在不同医学任务中进行评估。 Result: 测试时缩放显著提升医学推理能力，小规模模型（<10B参数）达到新SOTA，32B模型媲美70B规模模型；但推理token预算存在最优值（约4K），超过后性能下降。预算强制（budget forcing）可能引入错误。 Conclusion: 医学知识的不足是性能提升的关键瓶颈，增加数据规模、质量和模型容量能持续提升性能，医学推理与数学推理存在根本差异，需丰富医学知识而非仅增加推理深度。 Abstract: Test-time scaling has emerged as a powerful technique for enhancing the reasoning capabilities of large language models. However, its effectiveness in medical reasoning remains uncertain, as the medical domain fundamentally differs from mathematical tasks in terms of knowledge representation and decision-making processes. In this paper, we provide the first comprehensive investigation of test-time scaling for medical reasoning and present m1, a simple yet effective approach that increases a model's medical reasoning capability at inference. Our evaluation across diverse medical tasks demonstrates that test-time scaling consistently enhances medical reasoning, enabling lightweight fine-tuned models under 10B parameters to establish new state-of-the-art performance, while our 32B model rivals previous 70B-scale medical LLMs. However, we identify an optimal reasoning token budget of approximately 4K, beyond which performance may degrade due to overthinking. Budget forcing, which extends test-time computation through iterative prompts, helps models double-check answers but does not necessarily improve the overall medical QA performance and, in some cases, even introduces errors into previously correct responses. Our case-by-case analysis identifies insufficient medical knowledge as a key bottleneck that prevents further performance gains through test-time scaling. We find that increasing data scale, improving data quality, and expanding model capacity consistently enhance medical knowledge grounding, enabling continued performance improvements, particularly on challenging medical benchmarks where smaller models reach saturation. These findings underscore fundamental differences between medical and mathematical reasoning in LLMs, highlighting that enriched medical knowledge, other than increased reasoning depth alone, is essential for realizing the benefits of test-time scaling.

Bi-Grid Reconstruction for Image Anomaly Detection

Huichuan Huang,Zhiqing Zhong,Guangyu Wei,Yonghao Wan,Wenlong Sun,Aimin Feng

Task: 提出一种名为GRAD的双网格重建方法，用于提升图像异常检测，尤其是细粒度异常的检测性能。

Motivation: 现有无监督和自监督方法在仅使用正常样本的数据集上取得了显著进展，但在细粒度异常检测上表现不佳。

Details

Method: GRAD通过两个连续网格（正常特征网格和异常特征网格）增强异常检测，并引入特征块粘贴（FBP）模块在特征层面合成异常以快速部署异常网格。 Result: 在MVTecAD、VisA和GoodsAD等数据集上的评估表明，GRAD在细粒度异常检测中显著提升了性能，整体准确率和细微差异辨别能力优于现有方法。 Conclusion: GRAD通过双网格设计和FBP模块，有效解决了细粒度异常检测的挑战，展示了其在多类别任务中的鲁棒性和优越性。 Abstract: In image anomaly detection, significant advancements have been made using un- and self-supervised methods with datasets containing only normal samples. However, these approaches often struggle with fine-grained anomalies. This paper introduces \textbf{GRAD}: Bi-\textbf{G}rid \textbf{R}econstruction for Image \textbf{A}nomaly \textbf{D}etection, which employs two continuous grids to enhance anomaly detection from both normal and abnormal perspectives. In this work: 1) Grids as feature repositories that improve generalization and mitigate the Identical Shortcut (IS) issue; 2) An abnormal feature grid that refines normal feature boundaries, boosting detection of fine-grained defects; 3) The Feature Block Paste (FBP) module, which synthesizes various anomalies at the feature level for quick abnormal grid deployment. GRAD's robust representation capabilities also allow it to handle multiple classes with a single model. Evaluations on datasets like MVTecAD, VisA, and GoodsAD show significant performance improvements in fine-grained anomaly detection. GRAD excels in overall accuracy and in discerning subtle differences, demonstrating its superiority over existing methods.

GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning

Jian Zhao,Runze Liu,Kaiyan Zhang,Zhimu Zhou,Junqi Gao,Dong Li,Jiafei Lyu,Zhouyi Qian,Biqing Qi,Xiu Li,Bowen Zhou

Task: 利用生成式过程奖励模型（GenPRM）提升大型语言模型（LLMs）的性能。

Motivation: 当前的过程奖励模型（PRMs）存在过程监督和泛化能力有限、依赖标量值预测而未能充分利用LLMs的生成能力，以及无法扩展测试时计算的问题。

Details

Method: 提出GenPRM，通过显式的链式思维（CoT）推理和代码验证来提供判断，并引入相对进度估计（RPE）和包含代码验证的理性合成框架来获取高质量的过程监督标签和理性数据。 Result: 在ProcessBench和多个数学推理任务上，GenPRM显著优于现有PRMs，仅使用23K训练数据；通过测试时扩展，1.5B GenPRM超越GPT-4o，7B GenPRM超越Qwen2.5-Math-PRM-72B。 Conclusion: GenPRM为过程监督建立了新范式，弥合了PRMs与LLMs中批评模型之间的差距。 Abstract: Recent advancements in Large Language Models (LLMs) have shown that it is promising to utilize Process Reward Models (PRMs) as verifiers to enhance the performance of LLMs. However, current PRMs face three key challenges: (1) limited process supervision and generalization capabilities, (2) dependence on scalar value prediction without leveraging the generative abilities of LLMs, and (3) inability to scale the test-time compute of PRMs. In this work, we introduce GenPRM, a generative process reward model that performs explicit Chain-of-Thought (CoT) reasoning with code verification before providing judgment for each reasoning step. To obtain high-quality process supervision labels and rationale data, we propose Relative Progress Estimation (RPE) and a rationale synthesis framework that incorporates code verification. Experimental results on ProcessBench and several mathematical reasoning tasks show that GenPRM significantly outperforms prior PRMs with only 23K training data from MATH dataset. Through test-time scaling, a 1.5B GenPRM outperforms GPT-4o, and a 7B GenPRM surpasses Qwen2.5-Math-PRM-72B on ProcessBench. Additionally, GenPRM demonstrates strong abilities to serve as a critic model for policy model refinement. This work establishes a new paradigm for process supervision that bridges the gap between PRMs and critic models in LLMs. Our code, model, and data will be available in https://ryanliu112.github.io/GenPRM.

Coca-Splat: Collaborative Optimization for Camera Parameters and 3D Gaussians

Jiamin Wu,Hongyang Li,Xiaoke Jiang,Yuan Yao,Lei Zhang

Task: 提出一种名为Coca-Splat的新方法，通过联合优化3D高斯和相机参数，解决稀疏视角无姿态场景重建和新视角合成（NVS）的挑战。

Motivation: 精确渲染接近真实图像的视图依赖于对3D高斯和相机参数的准确估计，因此需要一种能够联合优化两者的方法。

Details

Method: 设计了分离的查询用于3D高斯和相机参数，并通过可变形Transformer层逐层更新，实现单网络中的联合优化。利用相机感知多视角可变形交叉注意力（CaMDFA）和2D参考点确定射线（RayRef）增强3D高斯与相机参数的关系。 Result: 在RealEstate10K和ACID数据集上，该方法在无姿态设置下优于现有方法。 Conclusion: Coca-Splat通过联合优化3D高斯和相机参数，显著提升了无姿态场景重建和新视角合成的性能。 Abstract: In this work, we introduce Coca-Splat, a novel approach to addressing the challenges of sparse view pose-free scene reconstruction and novel view synthesis (NVS) by jointly optimizing camera parameters with 3D Gaussians. Inspired by deformable DEtection TRansformer, we design separate queries for 3D Gaussians and camera parameters and update them layer by layer through deformable Transformer layers, enabling joint optimization in a single network. This design demonstrates better performance because to accurately render views that closely approximate ground-truth images relies on precise estimation of both 3D Gaussians and camera parameters. In such a design, the centers of 3D Gaussians are projected onto each view by camera parameters to get projected points, which are regarded as 2D reference points in deformable cross-attention. With camera-aware multi-view deformable cross-attention (CaMDFA), 3D Gaussians and camera parameters are intrinsically connected by sharing the 2D reference points. Additionally, 2D reference point determined rays (RayRef) defined from camera centers to the reference points assist in modeling relationship between 3D Gaussians and camera parameters through RQ-decomposition on an overdetermined system of equations derived from the rays, enhancing the relationship between 3D Gaussians and camera parameters. Extensive evaluation shows that our approach outperforms previous methods, both pose-required and pose-free, on RealEstate10K and ACID within the same pose-free setting.

On the Robustness of Agentic Function Calling

Ella Rabinovich,Ateret Anaby-Tavor

Task: 评估大型语言模型（LLMs）在函数调用（FC）中的鲁棒性，特别是在自然查询变化和工具扩展时的稳定性。

Motivation: 现有研究主要关注FC准确性，而忽略了输入扰动对代理鲁棒性的影响，因此需要填补这一空白。

Details

Method: 引入一个基准测试，评估FC模型在自然查询变化和工具扩展时的表现，并使用Berkeley函数调用排行榜（BFCL）的扩展子集进行测试。 Result: 发现现有评估方法存在关键弱点，并指出实际代理部署中需要改进的领域。 Conclusion: 强调了提升FC鲁棒性的重要性，并为未来研究提供了方向。 Abstract: Large Language Models (LLMs) are increasingly acting as autonomous agents, with function calling (FC) capabilities enabling them to invoke specific tools for tasks. While prior research has primarily focused on improving FC accuracy, little attention has been given to the robustness of these agents to perturbations in their input. We introduce a benchmark assessing FC robustness in two key areas: resilience to naturalistic query variations, and stability in function calling when the toolkit expands with semantically related tools. Evaluating best-performing FC models on a carefully expanded subset of the Berkeley function calling leaderboard (BFCL), we identify critical weaknesses in existing evaluation methodologies, and highlight areas for improvement in real-world agentic deployments.

POPEN: Preference-Based Optimization and Ensemble for LVLM-Based Reasoning Segmentation

Lanyun Zhu,Tianrun Chen,Qianxiong Xu,Xuanyi Liu,Deyi Ji,Haiyang Wu,De Wen Soh,Jun Liu

Task: 提出POPEN框架以解决现有LVLM推理分割方法中分割结果不精确和文本响应幻觉问题。

Motivation: 现有方法在分割结果和文本响应中存在不足，POPEN旨在通过偏好优化和集成方法提升性能。

Details

Method: 采用偏好优化方法微调LVLM，引入偏好集成方法进行推理，并结合任务特定设计如课程学习机制和偏好优化损失。 Result: 实验表明POPEN在推理分割任务中达到最先进性能，文本响应幻觉最少，分割精度最高。 Conclusion: POPEN通过偏好优化和任务特定设计显著提升了LVLM在推理分割任务中的表现。 Abstract: Existing LVLM-based reasoning segmentation methods often suffer from imprecise segmentation results and hallucinations in their text responses. This paper introduces POPEN, a novel framework designed to address these issues and achieve improved results. POPEN includes a preference-based optimization method to finetune the LVLM, aligning it more closely with human preferences and thereby generating better text responses and segmentation results. Additionally, POPEN introduces a preference-based ensemble method for inference, which integrates multiple outputs from the LVLM using a preference-score-based attention mechanism for refinement. To better adapt to the segmentation task, we incorporate several task-specific designs in our POPEN framework, including a new approach for collecting segmentation preference data with a curriculum learning mechanism, and a novel preference optimization loss to refine the segmentation capability of the LVLM. Experiments demonstrate that our method achieves state-of-the-art performance in reasoning segmentation, exhibiting minimal hallucination in text responses and the highest segmentation accuracy compared to previous advanced methods like LISA and PixelLM. Project page is https://lanyunzhu.site/POPEN/

Multi-Token Attention

Olga Golovneva,Tianlu Wang,Jason Weston,Sainbayar Sukhbaatar

Task: 提出一种新的注意力方法（Multi-Token Attention, MTA），以解决单令牌注意力机制在定位上下文相关部分时的信息瓶颈问题。

Motivation: 单令牌注意力机制仅基于单个查询和键令牌向量的相似性确定注意力权重，限制了区分上下文相关部分的信息量。

Details

Method: 通过在多查询和键向量上应用卷积操作，使相邻的查询和键能够相互影响注意力权重，从而实现更精确的注意力分配。 Result: MTA在多个流行基准测试中表现优异，尤其在长上下文信息搜索任务中显著优于Transformer基线模型。 Conclusion: MTA通过利用更丰富的信息提升了注意力机制的精确性，为语言模型在复杂任务中的表现提供了改进。 Abstract: Soft attention is a critical mechanism powering LLMs to locate relevant parts within a given context. However, individual attention weights are determined by the similarity of only a single query and key token vector. This "single token attention" bottlenecks the amount of information used in distinguishing a relevant part from the rest of the context. To address this issue, we propose a new attention method, Multi-Token Attention (MTA), which allows LLMs to condition their attention weights on multiple query and key vectors simultaneously. This is achieved by applying convolution operations over queries, keys and heads, allowing nearby queries and keys to affect each other's attention weights for more precise attention. As a result, our method can locate relevant context using richer, more nuanced information that can exceed a single vector's capacity. Through extensive evaluations, we demonstrate that MTA achieves enhanced performance on a range of popular benchmarks. Notably, it outperforms Transformer baseline models on standard language modeling tasks, and on tasks that require searching for information within long contexts, where our method's ability to leverage richer information proves particularly beneficial.

Xinnan Zhu,Yicheng Zhu,Tixin Chen,Wentao Wu,Yuanjie Dang

Task: 提出一种频率感知解耦网络，用于改进未修剪视频中的动作检测。

Motivation: 现有方法忽视了预训练特征中的噪声和冗余，导致背景干扰和语义不相关，影响动作检测的准确性。

Details

Method: 采用自适应时间解耦方案抑制无关信息，保留细粒度动作细节；通过捕捉时间变化增强帧间建模；提出长短时类别感知关系网络联合建模局部转换和长程依赖。 Result: 在THUMOS14、HACS和ActivityNet-1.3等基准测试中达到最先进性能。 Conclusion: 频率感知解耦网络能有效过滤噪声语义，提升动作检测的判别性和定位精度。 Abstract: Temporal action detection aims to locate and classify actions in untrimmed videos. While recent works focus on designing powerful feature processors for pre-trained representations, they often overlook the inherent noise and redundancy within these features. Large-scale pre-trained video encoders tend to introduce background clutter and irrelevant semantics, leading to context confusion and imprecise boundaries. To address this, we propose a frequency-aware decoupling network that improves action discriminability by filtering out noisy semantics captured by pre-trained models. Specifically, we introduce an adaptive temporal decoupling scheme that suppresses irrelevant information while preserving fine-grained atomic action details, yielding more task-specific representations. In addition, we enhance inter-frame modeling by capturing temporal variations to better distinguish actions from background redundancy. Furthermore, we present a long-short-term category-aware relation network that jointly models local transitions and long-range dependencies, improving localization precision. The refined atomic features and frequency-guided dynamics are fed into a standard detection head to produce accurate action predictions. Extensive experiments on THUMOS14, HACS, and ActivityNet-1.3 show that our method, powered by InternVideo2-6B features, achieves state-of-the-art performance on temporal action detection benchmarks.

Taxonomizing Representational Harms using Speech Act Theory

Emily Corvi,Hannah Washington,Stefanie Reed,Chad Atalla,Alexandra Chouldechova,P. Alex Dow,Jean Garcia-Gathright,Nicholas Pangakis,Emily Sheng,Dan Vann,Matthew Vogel,Hanna Wallach

Task: 提出一个基于言语行为理论的框架，用于概念化生成语言系统造成的表征性伤害。

Motivation: 表征性伤害的定义通常不够明确，需要更具体的理论框架来理解和分类这些伤害。

Details

Method: 基于言语行为理论（Austin, 1962），将表征性伤害视为特定类型的言语行为（系统行为）的实际影响，并结合语言人类学和社会语言学的相关文献，提出新的定义和分类。 Result: 提出了一个细粒度的言语行为分类法，并展示了框架和分类法在案例研究中的实用性。 Conclusion: 该框架和分类法为表征性伤害的研究和测量提供了理论基础，并有助于未来的工具开发。 Abstract: Representational harms are widely recognized among fairness-related harms caused by generative language systems. However, their definitions are commonly under-specified. We present a framework, grounded in speech act theory (Austin, 1962), that conceptualizes representational harms caused by generative language systems as the perlocutionary effects (i.e., real-world impacts) of particular types of illocutionary acts (i.e., system behaviors). Building on this argument and drawing on relevant literature from linguistic anthropology and sociolinguistics, we provide new definitions stereotyping, demeaning, and erasure. We then use our framework to develop a granular taxonomy of illocutionary acts that cause representational harms, going beyond the high-level taxonomies presented in previous work. We also discuss the ways that our framework and taxonomy can support the development of valid measurement instruments. Finally, we demonstrate the utility of our framework and taxonomy via a case study that engages with recent conceptual debates about what constitutes a representational harm and how such harms should be measured.

QG-VTC: Question-Guided Visual Token Compression in MLLMs for Efficient VQA

Shuai Li,Jian Xu,Xiao-Hui Li,Chao Deng,Lin-Lin Huang

Task: 提出一种名为QG-VTC的问题引导视觉标记压缩方法，用于多模态大语言模型（MLLM）中的视觉问答（VQA）任务。

Motivation: 解决视觉信息处理中标记数量增加导致的GPU内存和计算开销问题，同时去除冗余视觉信息。

Details

Method: 使用预训练的文本编码器和可学习的前馈层将问题嵌入视觉编码器的特征空间，计算问题嵌入与视觉标记的相关性分数，选择最相关的标记并软压缩其他标记。 Result: 实验结果表明，该方法仅使用1/8的视觉标记即可达到与未压缩模型相当的性能。 Conclusion: QG-VTC方法在保留问题相关信息的同时有效减少了计算开销，为MLLM-based VQA任务提供了一种高效的解决方案。 Abstract: Recent advances in Multi-modal Large Language Models (MLLMs) have shown significant progress in open-world Visual Question Answering (VQA). However, integrating visual information increases the number of processed tokens, leading to higher GPU memory usage and computational overhead. Images often contain more redundant information than text, and not all visual details are pertinent to specific questions. To address these challenges, we propose QG-VTC, a novel question-guided visual token compression method for MLLM-based VQA tasks. QG-VTC employs a pretrained text encoder and a learnable feed-forward layer to embed user questions into the vision encoder's feature space then computes correlation scores between the question embeddings and visual tokens. By selecting the most relevant tokens and softly compressing others, QG-VTC ensures fine-tuned relevance to user needs. Additionally, a progressive strategy applies this compression across different vision encoder layers, gradually reducing token numbers. This approach maximizes retention of question-relevant information while discarding irrelevant details. Experimental results show that our method achieves performance on par with uncompressed models using just 1/8 of the visual tokens. The code and model will be publicly available on GitHub.

Zifeng Wang,Junyi Gao,Benjamin Danek,Brandon Theodorou,Ruba Shaik,Shivashankar Thati,Seunghyun Won,Jimeng Sun

Task: 利用大语言模型（LLMs）生成高风险的知情同意书（ICFs），确保其符合法规要求和事实准确性。

Motivation: 由于知情同意书需要极高的法规合规性和事实准确性，直接使用LLMs生成此类高风险文档存在挑战。

Details

Method: 提出InformGen，一种基于LLMs的辅助工具，通过优化的知识文档解析和内容生成，结合人工干预，实现准确且合规的ICF起草。 Result: InformGen在18项核心法规规则上实现了接近100%的合规性，优于GPT-4o模型30%；在人工干预下，事实准确性超过90%，显著高于GPT-4o模型的57%-82%。 Conclusion: InformGen通过提供源协议的引用，确保了可追溯性，同时保持了最高的事实完整性标准。 Abstract: Leveraging large language models (LLMs) to generate high-stakes documents, such as informed consent forms (ICFs), remains a significant challenge due to the extreme need for regulatory compliance and factual accuracy. Here, we present InformGen, an LLM-driven copilot for accurate and compliant ICF drafting by optimized knowledge document parsing and content generation, with humans in the loop. We further construct a benchmark dataset comprising protocols and ICFs from 900 clinical trials. Experimental results demonstrate that InformGen achieves near 100% compliance with 18 core regulatory rules derived from FDA guidelines, outperforming a vanilla GPT-4o model by up to 30%. Additionally, a user study with five annotators shows that InformGen, when integrated with manual intervention, attains over 90% factual accuracy, significantly surpassing the vanilla GPT-4o model's 57%-82%. Crucially, InformGen ensures traceability by providing inline citations to source protocols, enabling easy verification and maintaining the highest standards of factual integrity.

Monocular and Generalizable Gaussian Talking Head Animation

Shengjie Gong,Haojie Li,Jiapeng Tang,Dongming Hu,Shuangping Huang,Hao Chen,Tianshui Chen,Zhuoman Liu

Task: 提出一种基于单目数据集且无需个性化重新训练即可泛化到未见身份的3D高斯说话头动画方法（MGGTalk）。

Motivation: 解决现有3D高斯散射方法需要多视角数据集或繁琐的个性化学习/推理的问题，以实现更广泛和实用的应用。

Details

Method: 利用深度信息增强几何和外观特征，通过对称操作和点云过滤技术确保3DGS参数的完整性和精确性，并采用两阶段策略预测高斯参数。 Result: MGGTalk在多个指标上优于现有最先进方法。 Conclusion: MGGTalk通过深度信息和对称先验，有效解决了单目数据下几何和外观信息不完整的问题，具有广泛的应用潜力。 Abstract: In this work, we introduce Monocular and Generalizable Gaussian Talking Head Animation (MGGTalk), which requires monocular datasets and generalizes to unseen identities without personalized re-training. Compared with previous 3D Gaussian Splatting (3DGS) methods that requires elusive multi-view datasets or tedious personalized learning/inference, MGGtalk enables more practical and broader applications. However, in the absence of multi-view and personalized training data, the incompleteness of geometric and appearance information poses a significant challenge. To address these challenges, MGGTalk explores depth information to enhance geometric and facial symmetry characteristics to supplement both geometric and appearance features. Initially, based on the pixel-wise geometric information obtained from depth estimation, we incorporate symmetry operations and point cloud filtering techniques to ensure a complete and precise position parameter for 3DGS. Subsequently, we adopt a two-stage strategy with symmetric priors for predicting the remaining 3DGS parameters. We begin by predicting Gaussian parameters for the visible facial regions of the source image. These parameters are subsequently utilized to improve the prediction of Gaussian parameters for the non-visible regions. Extensive experiments demonstrate that MGGTalk surpasses previous state-of-the-art methods, achieving superior performance across various metrics.

Experiential Semantic Information and Brain Alignment: Are Multimodal Models Better than Language Models?

Anna Bavaresco,Raquel Fernández

Task: 比较单模态和多模态模型在捕捉经验信息和与人类fMRI响应对齐方面的表现。

Motivation: 验证多模态模型是否比单模态模型更接近人类语言处理方式，填补实证研究的空白。

Details

Method: 通过对比单模态和多模态模型的词表示，评估其与经验模型和人类fMRI响应的对齐程度。 Result: 单模态模型在多模态模型之上，在捕捉经验信息和与fMRI响应对齐方面表现更优，且学习到更多独特的与大脑相关的语义信息。 Conclusion: 需要开发能更好整合多模态数据源互补语义信息的计算模型。 Abstract: A common assumption in Computational Linguistics is that text representations learnt by multimodal models are richer and more human-like than those by language-only models, as they are grounded in images or audio -- similar to how human language is grounded in real-world experiences. However, empirical studies checking whether this is true are largely lacking. We address this gap by comparing word representations from contrastive multimodal models vs. language-only ones in the extent to which they capture experiential information -- as defined by an existing norm-based 'experiential model' -- and align with human fMRI responses. Our results indicate that, surprisingly, language-only models are superior to multimodal ones in both respects. Additionally, they learn more unique brain-relevant semantic information beyond that shared with the experiential model. Overall, our study highlights the need to develop computational models that better integrate the complementary semantic information provided by multimodal data sources.

ToVE: Efficient Vision-Language Learning via Knowledge Transfer from Vision Experts

Yuanchen Wu,Junlong Du,Ke Yan,Shouhong Ding,Xiaoqiang Li

Task: 提出一种名为ToVE的新框架，通过从预训练的视觉专家模型中转移知识，以高效地进行视觉-语言学习。

Motivation: 传统的视觉-语言学习需要训练大规模模型和数据集，效率较低，而ToVE通过利用预训练视觉专家模型提升视觉感知能力，提供更高效的替代方案。

Details

Method: ToVE基于冻结的CLIP编码器，引入多个视觉专家模型和令牌感知门控网络，动态地将专家知识路由到视觉令牌中，并提出“残差知识转移”策略以保留令牌的泛化能力。 Result: 实验结果表明，ToVE在多种视觉-语言任务中表现优异，且仅需两个数量级更少的训练数据即可达到竞争性性能。 Conclusion: ToVE框架通过高效的知识转移和动态路由机制，显著提升了视觉-语言学习的效率和性能。 Abstract: Vision-language (VL) learning requires extensive visual perception capabilities, such as fine-grained object recognition and spatial perception. Recent works typically rely on training huge models on massive datasets to develop these capabilities. As a more efficient alternative, this paper proposes a new framework that Transfers the knowledge from a hub of Vision Experts (ToVE) for efficient VL learning, leveraging pre-trained vision expert models to promote visual perception capability. Specifically, building on a frozen CLIP encoder that provides vision tokens for image-conditioned language generation, ToVE introduces a hub of multiple vision experts and a token-aware gating network that dynamically routes expert knowledge to vision tokens. In the transfer phase, we propose a "residual knowledge transfer" strategy, which not only preserves the generalizability of the vision tokens but also allows detachment of low-contributing experts to improve inference efficiency. Further, we explore to merge these expert knowledge to a single CLIP encoder, creating a knowledge-merged CLIP that produces more informative vision tokens without expert inference during deployment. Experiment results across various VL tasks demonstrate that the proposed ToVE achieves competitive performance with two orders of magnitude fewer training data.

SentenceKV: Efficient LLM Inference via Sentence-Level Semantic KV Caching

Yuxuan Zhu,Ali Falahati,David H. Yang,Mohammad Mohammadi Amiri

Task: 提出一种基于句子级语义相似性的键值缓存方法（SentenceKV），以提升大型语言模型在处理长上下文时的推理效率和内存管理。

Motivation: 传统的基于令牌级的高效键值缓存方法忽视了语义信息，而现有的语义保留方法则存在内存占用高和首令牌生成延迟大的问题。

Details

Method: SentenceKV在预填充阶段根据句子级语义相似性对令牌进行分组，将句子表示压缩为简洁的语义向量存储在GPU上，同时将单个键值对卸载到CPU；在解码阶段，通过选择性检索语义相关的句子级键值条目生成令牌。 Result: 在PG-19、LongBench和Needle-In-A-Haystack等基准测试中，SentenceKV在效率和内存使用上显著优于现有方法，且不影响模型准确性。 Conclusion: SentenceKV通过句子级语义相似性管理键值缓存，显著提升了长上下文处理的效率和内存利用率，同时保持了推理延迟的稳定性。 Abstract: Large language models face significant computational and memory challenges when processing long contexts. During inference, efficient management of the key-value (KV) cache, which stores intermediate activations for autoregressive generation, is critical to reducing memory overhead and improving computational efficiency. Traditional token-level efficient KV caching methods overlook semantic information, treating tokens independently without considering their semantic relationships. Meanwhile, existing semantic-preserving KV cache management approaches often suffer from substantial memory usage and high time-to-first-token. To address these limitations, we propose SentenceKV, a novel sentence-level semantic KV caching approach designed to enhance inference efficiency while preserving semantic coherence. During prefilling, SentenceKV groups tokens based on sentence-level semantic similarity, compressing sentence representations into concise semantic vectors stored directly on the GPU, while individual KV pairs are offloaded to CPU. During decoding, SentenceKV generates tokens by selectively retrieving semantically relevant sentence-level KV entries, leveraging the semantic similarity between the prefilling-stage semantic vectors and decoding-stage queries. This ensures efficient and contextually accurate predictions, minimizing the loading of redundant or irrelevant data into GPU memory and significantly reducing memory overhead while maintaining stable inference latency, even for extremely long contexts. Extensive evaluations on benchmarks including PG-19, LongBench, and Needle-In-A-Haystack demonstrate that SentenceKV significantly outperforms state-of-the-art methods in both efficiency and memory usage, without compromising model accuracy.

CAPE: Connectivity-Aware Path Enforcement Loss for Curvilinear Structure Delineation

Elyar Esmaeilzadeh,Ehsan Garaaghaji,Farzad Hallaji Azad,Doruk Oner

Task: 提出一种新的损失函数CAPE，用于在语义分割中增强曲线结构（如神经元和血管）的连通性。

Motivation: 传统的像素级损失函数（如交叉熵和Dice损失）无法捕捉高层次的拓扑连通性，导致预测图中的拓扑错误。

Details

Method: CAPE通过优化图连通性度量，利用真实标签的图表示选择节点对，并通过最短路径算法确定预测分割中的路径，从而惩罚断开和错误连接。 Result: 在2D和3D数据集（如神经元和血管追踪）上的实验表明，CAPE显著提高了拓扑感知指标，并优于现有方法。 Conclusion: CAPE是一种有效的损失函数，能够显著提升分割结果的拓扑正确性。 Abstract: Promoting the connectivity of curvilinear structures, such as neuronal processes in biomedical scans and blood vessels in CT images, remains a key challenge in semantic segmentation. Traditional pixel-wise loss functions, including cross-entropy and Dice losses, often fail to capture high-level topological connectivity, resulting in topological mistakes in graphs obtained from prediction maps. In this paper, we propose CAPE (Connectivity-Aware Path Enforcement), a novel loss function designed to enforce connectivity in graphs obtained from segmentation maps by optimizing a graph connectivity metric. CAPE uses the graph representation of the ground truth to select node pairs and determine their corresponding paths within the predicted segmentation through a shortest-path algorithm. Using this, we penalize both disconnections and false positive connections, effectively promoting the model to preserve topological correctness. Experiments on 2D and 3D datasets, including neuron and blood vessel tracing demonstrate that CAPE significantly improves topology-aware metrics and outperforms state-of-the-art methods.

Chinese Grammatical Error Correction: A Survey

Mengyang Qiu,Qingyu Gao,Linxuan Yang,Yang Gu,Tran Minh Nguyen,Zihao Huang,Jungyeul Park

Task: 对中文语法错误纠正（CGEC）的研究进行全面综述，涵盖数据集、标注方案、评估方法和系统进展。

Motivation: 满足第二语言（L2）和母语（L1）中文写作中对自动化写作辅助日益增长的需求，特别是在学术、专业和正式场合中需要精确写作的场景。

Details

Method: 综述现有研究，分析CGEC数据集、标注框架、评估指标和系统发展，重点关注从规则和统计方法到神经架构（如基于Transformer的模型和大规模预训练语言模型）的演变。 Result: 总结了CGEC研究的现状，指出了数据集标准化不足、分词歧义等挑战，并提出了未来方向，如改进标注标准和利用多语言方法提升CGEC。 Conclusion: CGEC研究在技术和应用上取得了显著进展，但仍需解决标注和评估中的挑战，未来可通过标准化和多语言方法进一步推动该领域发展。 Abstract: Chinese Grammatical Error Correction (CGEC) is a critical task in Natural Language Processing, addressing the growing demand for automated writing assistance in both second-language (L2) and native (L1) Chinese writing. While L2 learners struggle with mastering complex grammatical structures, L1 users also benefit from CGEC in academic, professional, and formal contexts where writing precision is essential. This survey provides a comprehensive review of CGEC research, covering datasets, annotation schemes, evaluation methodologies, and system advancements. We examine widely used CGEC datasets, highlighting their characteristics, limitations, and the need for improved standardization. We also analyze error annotation frameworks, discussing challenges such as word segmentation ambiguity and the classification of Chinese-specific error types. Furthermore, we review evaluation metrics, focusing on their adaptation from English GEC to Chinese, including character-level scoring and the use of multiple references. In terms of system development, we trace the evolution from rule-based and statistical approaches to neural architectures, including Transformer-based models and the integration of large pre-trained language models. By consolidating existing research and identifying key challenges, this survey provides insights into the current state of CGEC and outlines future directions, including refining annotation standards to address segmentation challenges, and leveraging multilingual approaches to enhance CGEC.

MSSFC-Net:Enhancing Building Interpretation with Multi-Scale Spatial-Spectral Feature Collaboration

Dehua Huo,Weida Zhan,Jinxin Guo,Depeng Zhu,Yu Chen,YiChun Jiang,Yueyi Han,Deng Han,Jin Li

Task: 提出一种多尺度空间-光谱特征协作双任务网络（MSSFC-Net），用于遥感图像中的建筑物提取和变化检测联合任务。

Motivation: 现有方法通常独立处理建筑物提取和变化检测任务，忽视了它们的内在关联性，未能利用共享特征表示实现相互增强。此外，建筑物的多样光谱、空间和尺度特性增加了联合建模空间-光谱多尺度特征的难度，且难以平衡精度和召回率。

Details

Method: 设计了一个双分支多尺度特征提取模块（DMFE）与空间-光谱特征协作（SSFC）相结合，以增强多尺度表示学习；并引入多尺度差分融合模块（MDFM）来显式建模差分和双时态特征之间的交互。 Result: 在三个基准数据集上的实验表明，MSSFC-Net在建筑物提取和变化检测任务中均表现出优越性能，显著提高了检测精度并保持了完整性。 Conclusion: MSSFC-Net通过联合建模建筑物提取和变化检测任务，有效利用了它们的互补性，提升了多尺度特征表示和变化检测能力。 Abstract: Building interpretation from remote sensing imagery primarily involves two fundamental tasks: building extraction and change detection. However, most existing methods address these tasks independently, overlooking their inherent correlation and failing to exploit shared feature representations for mutual enhancement. Furthermore, the diverse spectral,spatial, and scale characteristics of buildings pose additional challenges in jointly modeling spatial-spectral multi-scale features and effectively balancing precision and recall. The limited synergy between spatial and spectral representations often results in reduced detection accuracy and incomplete change localization.To address these challenges, we propose a Multi-Scale Spatial-Spectral Feature Cooperative Dual-Task Network (MSSFC-Net) for joint building extraction and change detection in remote sensing images. The framework integrates both tasks within a unified architecture, leveraging their complementary nature to simultaneously extract building and change features. Specifically,a Dual-branch Multi-scale Feature Extraction module (DMFE) with Spatial-Spectral Feature Collaboration (SSFC) is designed to enhance multi-scale representation learning, effectively capturing shallow texture details and deep semantic information, thus improving building extraction performance. For temporal feature aggregation, we introduce a Multi-scale Differential Fusion Module (MDFM) that explicitly models the interaction between differential and dual-temporal features. This module refines the network's capability to detect large-area changes and subtle structural variations in buildings. Extensive experiments conducted on three benchmark datasets demonstrate that MSSFC-Net achieves superior performance in both building extraction and change detection tasks, effectively improving detection accuracy while maintaining completeness.

MedReason: Eliciting Factual Medical Reasoning Steps in LLMs via Knowledge Graphs

Juncheng Wu,Wenlong Deng,Xingxuan Li,Sheng Liu,Taomian Mi,Yifan Peng,Ziyang Xu,Yi Liu,Hyunjin Cho,Chang-In Choi,Yihan Cao,Hui Ren,Xiang Li,Xiaoxiao Li,Yuyin Zhou

Task: 构建一个高质量的大规模医疗推理数据集MedReason，用于提升大型语言模型在医疗问题解决中的可靠性和可解释性。

Motivation: 医疗推理需要精确且可验证的思维过程，但目前缺乏透明、逐步推理的数据集来验证和增强AI模型的医疗推理能力。

Details

Method: 利用结构化医疗知识图谱（KG）将临床问答对转换为逻辑推理链（“思维路径”），并通过临床逻辑和循证医学验证其一致性。 Result: 生成了32,682个带有详细逐步解释的问答对，实验表明微调后模型在医疗问题解决能力上显著提升，最高提升7.7%。 Conclusion: MedReason数据集显著提升了模型的医疗推理能力，并在临床基准测试中优于现有最先进模型。 Abstract: Medical tasks such as diagnosis and treatment planning require precise and complex reasoning, particularly in life-critical domains. Unlike mathematical reasoning, medical reasoning demands meticulous, verifiable thought processes to ensure reliability and accuracy. However, there is a notable lack of datasets that provide transparent, step-by-step reasoning to validate and enhance the medical reasoning ability of AI models. To bridge this gap, we introduce MedReason, a large-scale high-quality medical reasoning dataset designed to enable faithful and explainable medical problem-solving in large language models (LLMs). We utilize a structured medical knowledge graph (KG) to convert clinical QA pairs into logical chains of reasoning, or ``thinking paths'', which trace connections from question elements to answers via relevant KG entities. Each path is validated for consistency with clinical logic and evidence-based medicine. Our pipeline generates detailed reasoning for various medical questions from 7 medical datasets, resulting in a dataset of 32,682 question-answer pairs, each with detailed, step-by-step explanations. Experiments demonstrate that fine-tuning with our dataset consistently boosts medical problem-solving capabilities, achieving significant gains of up to 7.7% for DeepSeek-Ditill-8B. Our top-performing model, MedReason-8B, outperforms the Huatuo-o1-8B, a state-of-the-art medical reasoning model, by up to 4.2% on the clinical benchmark MedBullets. We also engage medical professionals from diverse specialties to assess our dataset's quality, ensuring MedReason offers accurate and coherent medical reasoning. Our data, models, and code will be publicly available.

UnIRe: Unsupervised Instance Decomposition for Dynamic Urban Scene Reconstruction

Yunxuan Mao,Rong Xiong,Yue Wang,Yiyi Liao

Task: 提出一种基于3D高斯泼溅（3DGS）的方法UnIRe，用于将动态城市场景分解为静态背景和动态实例，仅需RGB图像和LiDAR点云。

Motivation: 现有方法无法在没有人工标注的情况下进行实例感知分解，而这对实例级场景编辑至关重要。

Details

Method: 引入4D超点表示，通过时空相关性实现无监督实例分离，并结合2D和3D空间的平滑正则化策略。 Result: 在基准数据集上表现优于现有方法，支持准确且灵活的实例级编辑。 Conclusion: UnIRe是一种适用于实际应用的动态场景分解和编辑的实用解决方案。 Abstract: Reconstructing and decomposing dynamic urban scenes is crucial for autonomous driving, urban planning, and scene editing. However, existing methods fail to perform instance-aware decomposition without manual annotations, which is crucial for instance-level scene editing.We propose UnIRe, a 3D Gaussian Splatting (3DGS) based approach that decomposes a scene into a static background and individual dynamic instances using only RGB images and LiDAR point clouds. At its core, we introduce 4D superpoints, a novel representation that clusters multi-frame LiDAR points in 4D space, enabling unsupervised instance separation based on spatiotemporal correlations. These 4D superpoints serve as the foundation for our decomposed 4D initialization, i.e., providing spatial and temporal initialization to train a dynamic 3DGS for arbitrary dynamic classes without requiring bounding boxes or object templates.Furthermore, we introduce a smoothness regularization strategy in both 2D and 3D space, further improving the temporal stability.Experiments on benchmark datasets show that our method outperforms existing methods in decomposed dynamic scene reconstruction while enabling accurate and flexible instance-level editing, making it a practical solution for real-world applications.

Zero-shot Benchmarking: A Framework for Flexible and Scalable Automatic Evaluation of Language Models

José Pombal,Nuno M. Guerreiro,Ricardo Rei,André F. T. Martins

Task: 提出一种名为Zero-shot Benchmarking (ZSB)的框架，用于为任何任务创建高质量基准测试，利用语言模型生成合成测试数据并进行评估。

Motivation: 随着语言模型能力的提升和多模态任务的复杂性增加，自动评估变得更具挑战性，且人工标注测试集的成本高昂且容易饱和。

Details

Method: ZSB框架仅需设计数据生成和评估的提示，即可为任何任务创建基准测试，支持多语言和多模态任务，且模型无关。 Result: ZSB在五个文本任务和一个多模态任务上的基准测试表现优异，其排名与人工排名高度一致，优于广泛采用的标准基准。 Conclusion: ZSB是一种简单、灵活且可扩展的基准测试框架，能够有效利用语言模型生成高质量测试数据并进行评估。 Abstract: As language models improve and become capable of performing more complex tasks across modalities, evaluating them automatically becomes increasingly challenging. Developing strong and robust task-specific automatic metrics gets harder, and human-annotated test sets -- which are expensive to create -- saturate more quickly. A compelling alternative is to design reliable strategies to automate the creation of test data and evaluation, but previous attempts either rely on pre-existing data, or focus solely on individual tasks. We present Zero-shot Benchmarking (ZSB), a framework for creating high-quality benchmarks for any task by leveraging language models for both synthetic test data creation and evaluation. ZSB is simple and flexible: it requires only the creation of a prompt for data generation and one for evaluation; it is scalable to tasks and languages where collecting real-world data is costly or impractical; it is model-agnostic, allowing the creation of increasingly challenging benchmarks as models improve. To assess the effectiveness of our framework, we create benchmarks for five text-only tasks and a multi-modal one: general capabilities in four languages (English, Chinese, French, and Korean), translation, and general vision-language capabilities in English. We then rank a broad range of open and closed systems on our benchmarks. ZSB rankings consistently correlate strongly with human rankings, outperforming widely-adopted standard benchmarks. Through ablations, we find that strong benchmarks can be created with open models, and that judge model size and dataset variety are crucial drivers of performance. We release all our benchmarks, and code to reproduce our experiments and to produce new benchmarks.

DropGaussian: Structural Regularization for Sparse-view Gaussian Splatting

Hyunwoo Park,Gun Ryu,Wonjun Kim

Task: 提出一种名为DropGaussian的方法，用于解决3D高斯溅射（3DGS）在稀疏视图设置中的过拟合问题。

Motivation: 3DGS在稀疏视图（如三视图输入）下容易过拟合训练视图，导致新视图图像质量下降。现有方法通常依赖强先验信息，而本文提出一种无需先验的简单方法。

Details

Method: 在3DGS训练过程中随机移除高斯分布（类似dropout），使未被移除的高斯分布具有更大的梯度和更好的可见性，从而优化稀疏输入视图的渲染。 Result: DropGaussian有效缓解了过拟合问题，提升了新视图合成的质量，且在基准数据集上达到了与依赖先验的3DGS方法竞争的性能。 Conclusion: DropGaussian是一种简单有效的方法，无需额外复杂性即可提升稀疏视图设置下的3DGS性能。 Abstract: Recently, 3D Gaussian splatting (3DGS) has gained considerable attentions in the field of novel view synthesis due to its fast performance while yielding the excellent image quality. However, 3DGS in sparse-view settings (e.g., three-view inputs) often faces with the problem of overfitting to training views, which significantly drops the visual quality of novel view images. Many existing approaches have tackled this issue by using strong priors, such as 2D generative contextual information and external depth signals. In contrast, this paper introduces a prior-free method, so-called DropGaussian, with simple changes in 3D Gaussian splatting. Specifically, we randomly remove Gaussians during the training process in a similar way of dropout, which allows non-excluded Gaussians to have larger gradients while improving their visibility. This makes the remaining Gaussians to contribute more to the optimization process for rendering with sparse input views. Such simple operation effectively alleviates the overfitting problem and enhances the quality of novel view synthesis. By simply applying DropGaussian to the original 3DGS framework, we can achieve the competitive performance with existing prior-based 3DGS methods in sparse-view settings of benchmark datasets without any additional complexity. The code and model are publicly available at: https://github.com/DCVL-3D/DropGaussian release.

Token embeddings violate the manifold hypothesis

Michael Robinson,Sourya Dey,Tony Chiang

Task: 研究大型语言模型（LLM）输入空间的结构及其对模型行为的影响。

Motivation: 理解LLM的输入空间对于准确评估其行为至关重要，如果输入空间与假设不符，可能导致对模型的理解和结论出现偏差。

Details

Method: 通过理论和实证分析，提出一个基于纤维丛的统计可测试模型，用于描述令牌嵌入的局部结构。 Result: 测试发现令牌子空间并非纤维丛或流形，且某些令牌的局部结构会导致输出变异性增加。 Conclusion: 令牌的局部结构对LLM的输出变异性有显著影响，提示语义等效的提示可能因令牌结构不同而产生不同输出。 Abstract: To fully understand the behavior of a large language model (LLM) requires our understanding of its input space. If this input space differs from our assumption, our understanding of and conclusions about the LLM is likely flawed, regardless of its architecture. Here, we elucidate the structure of the token embeddings, the input domain for LLMs, both empirically and theoretically. We present a generalized and statistically testable model where the neighborhood of each token splits into well-defined signal and noise dimensions. This model is based on a generalization of a manifold called a fiber bundle, so we denote our hypothesis test as the ``fiber bundle null.'' Failing to reject the null is uninformative, but rejecting it at a specific token indicates that token has a statistically significant local structure, and so is of interest to us. By running our test over several open-source LLMs, each with unique token embeddings, we find that the null is frequently rejected, and so the token subspace is provably not a fiber bundle and hence also not a manifold. As a consequence of our findings, when an LLM is presented with two semantically equivalent prompts, and if one prompt contains a token implicated by our test, that prompt will likely exhibit more output variability proportional to the local signal dimension of the token.

CellVTA: Enhancing Vision Foundation Models for Accurate Cell Segmentation and Classification

Yang Yang,Xijie Xu,Yixun Zhou,Jie Zheng

Task: 改进基于视觉基础模型（如Vision Transformers）的细胞实例分割性能。

Motivation: 现有的视觉基础模型在细胞实例分割中表现有限，主要由于ViT的标记化过程降低了输入图像的空间分辨率，影响分割质量，尤其是对小而密集的细胞。

Details

Method: 提出CellVTA方法，通过引入CNN适配器模块提取高分辨率空间信息，并通过交叉注意力机制将其注入ViT，同时保留ViT的核心架构。 Result: 在CoNIC数据集上达到0.538 mPQ，在PanNuke数据集上达到0.506 mPQ，显著优于现有方法。 Conclusion: CellVTA通过结合CNN适配器模块显著提升了ViT在细胞实例分割中的性能，且优于其他微调策略。 Abstract: Cell instance segmentation is a fundamental task in digital pathology with broad clinical applications. Recently, vision foundation models, which are predominantly based on Vision Transformers (ViTs), have achieved remarkable success in pathology image analysis. However, their improvements in cell instance segmentation remain limited. A key challenge arises from the tokenization process in ViTs, which substantially reduces the spatial resolution of input images, leading to suboptimal segmentation quality, especially for small and densely packed cells. To address this problem, we propose CellVTA (Cell Vision Transformer with Adapter), a novel method that improves the performance of vision foundation models for cell instance segmentation by incorporating a CNN-based adapter module. This adapter extracts high-resolution spatial information from input images and injects it into the ViT through a cross-attention mechanism. Our method preserves the core architecture of ViT, ensuring seamless integration with pretrained foundation models. Extensive experiments show that CellVTA achieves 0.538 mPQ on the CoNIC dataset and 0.506 mPQ on the PanNuke dataset, which significantly outperforms the state-of-the-art cell segmentation methods. Ablation studies confirm the superiority of our approach over other fine-tuning strategies, including decoder-only fine-tuning and full fine-tuning. Our code and models are publicly available at https://github.com/JieZheng-ShanghaiTech/CellVTA.

When To Solve, When To Verify: Compute-Optimal Problem Solving and Generative Verification for LLM Reasoning

Nishad Singhi,Hritik Bansal,Arian Hosseini,Aditya Grover,Kai-Wei Chang,Marcus Rohrbach,Anna Rohrbach

Task: 评估在固定推理预算下，生成奖励模型（GenRM）与自一致性（SC）方法的计算效率。

Motivation: 探索在增强大型语言模型推理能力时，如何优化测试时计算资源的分配，以平衡解决方案生成和验证。

Details

Method: 比较GenRM和SC方法在不同模型和数据集上的表现，并推导GenRM范式的推理扩展规律。 Result: 发现SC在大多数实际推理预算下比GenRM更高效，GenRM需要显著更多计算资源才能超越SC。 Conclusion: 研究为优化测试时计算资源的分配提供了实用指导，建议在计算最优推理中更积极地扩展解决方案生成而非验证。 Abstract: Scaling test-time compute has emerged as a key strategy for enhancing the reasoning capabilities of large language models (LLMs), particularly in tasks like mathematical problem-solving. A traditional approach, Self-Consistency (SC), generates multiple solutions to a problem and selects the most common answer via majority voting. Another common method involves scoring each solution with a reward model (verifier) and choosing the best one. Recent advancements in Generative Reward Models (GenRM) reframe verification as a next-token prediction task, enabling inference-time scaling along a new axis. Specifically, GenRM generates multiple verification chains-of-thought to score each solution. Under a limited inference budget, this introduces a fundamental trade-off: should you spend the budget on scaling solutions via SC or generate fewer solutions and allocate compute to verification via GenRM? To address this, we evaluate GenRM against SC under a fixed inference budget. Interestingly, we find that SC is more compute-efficient than GenRM for most practical inference budgets across diverse models and datasets. For instance, GenRM first matches SC after consuming up to 8x the inference compute and requires significantly more compute to outperform it. Furthermore, we derive inference scaling laws for the GenRM paradigm, revealing that compute-optimal inference favors scaling solution generation more aggressively than scaling the number of verifications. Our work provides practical guidance on optimizing test-time scaling by balancing solution generation and verification. The code is available at https://github.com/nishadsinghi/sc-genrm-scaling.

Scaling Prompt Instructed Zero Shot Composed Image Retrieval with Image-Only Data

Yiqun Duan,Sameera Ramasinghe,Stephen Gould,Ajanthan Thalaiyasingam

Task: Composed Image Retrieval (CIR) 的任务是通过参考图像和描述其变化的文本来检索匹配的目标图像。

Motivation: 传统CIR模型依赖人工标注的三元组数据（参考图像、改写文本、目标图像），成本高昂且难以扩展。

Details

Method: 利用大型语言模型（LLMs）自动生成CIR数据，并提出一种嵌入改写架构，结合图像和文本模态。 Result: 提出的InstructCIR模型在CIRR和FashionIQ数据集上零样本性能优于现有方法，且随着生成数据量的增加，性能接近监督基线。 Conclusion: 通过LLMs生成数据可有效替代人工标注，提升CIR模型的扩展性和性能。 Abstract: Composed Image Retrieval (CIR) is the task of retrieving images matching a reference image augmented with a text, where the text describes changes to the reference image in natural language. Traditionally, models designed for CIR have relied on triplet data containing a reference image, reformulation text, and a target image. However, curating such triplet data often necessitates human intervention, leading to prohibitive costs. This challenge has hindered the scalability of CIR model training even with the availability of abundant unlabeled data. With the recent advances in foundational models, we advocate a shift in the CIR training paradigm where human annotations can be efficiently replaced by large language models (LLMs). Specifically, we demonstrate the capability of large captioning and language models in efficiently generating data for CIR only relying on unannotated image collections. Additionally, we introduce an embedding reformulation architecture that effectively combines image and text modalities. Our model, named InstructCIR, outperforms state-of-the-art methods in zero-shot composed image retrieval on CIRR and FashionIQ datasets. Furthermore, we demonstrate that by increasing the amount of generated data, our zero-shot model gets closer to the performance of supervised baselines.

Self-Routing RAG: Binding Selective Retrieval with Knowledge Verbalization

Di Wu,Jia-Chen Gu,Kai-Wei Chang,Nanyun Peng

Task: 提出一种名为Self-Routing RAG (SR-RAG)的新框架，通过结合选择性检索和知识表达，动态决定是否使用外部检索或LLM自身的参数化知识。

Motivation: 现有方法未能充分利用大型语言模型（LLMs）的固有知识，导致检索决策次优和生成性能下降。

Details

Method: 设计多任务目标，联合优化LLM在知识源选择、知识表达和响应生成上的表现，并引入动态知识源推断以提高领域转移下的决策准确性。 Result: SR-RAG显著提高了响应准确性和推理延迟，相比最强基线减少了29%的检索次数，同时性能提升了5.1%。 Conclusion: SR-RAG通过结合选择性检索和知识表达，有效提升了检索增强生成的效率和性能。 Abstract: Selective retrieval improves retrieval-augmented generation (RAG) by reducing distractions from low-quality retrievals and improving efficiency. However, existing approaches under-utilize the inherent knowledge of large language models (LLMs), leading to suboptimal retrieval decisions and degraded generation performance. To bridge this gap, we propose Self-Routing RAG (SR-RAG), a novel framework that binds selective retrieval with knowledge verbalization. SR-RAG enables an LLM to dynamically decide between external retrieval and verbalizing its own parametric knowledge. To this end, we design a multi-task objective that jointly optimizes an LLM on knowledge source selection, knowledge verbalization, and response generation. We further introduce dynamic knowledge source inference via nearest neighbor search to improve the accuracy of knowledge source decision under domain shifts. Fine-tuning three LLMs with SR-RAG significantly improves both their response accuracy and inference latency. Compared to the strongest selective retrieval baseline, SR-RAG reduces retrievals by 29% while improving the performance by 5.1%.

The study of non-complete-ring positron emission tomography (PET) detection method

Yeqi Fang,Rong Zhou

Task: 提出一种用于不完整环PET扫描器的从粗到细的重建框架。

Motivation: 传统PET系统依赖完整探测器环以实现均匀和统计鲁棒的采样，但不完整环PET扫描器在硬件故障、成本限制或特定临床需求下出现，导致传统重建算法性能下降。

Details

Method: 采用Attention U-Net模型恢复完整sinogram，使用OSEM算法进行初步重建，再通过包含粗预测模块（CPM）和迭代细化模块（IRM）的两阶段架构进行精细重建。 Result: 在公共和内部脑PET数据集上，该方法在PSNR（35.6421 dB）和SSIM（0.9588）等指标上显著优于现有方法。 Conclusion: 该方法为不完整环PET成像提供了有效解决方案，成功保留了关键解剖结构和示踪剂分布特征。 Abstract: Positron Emission Tomography (PET) is a vital molecular imaging tool widely used in medical diagnosis and treatment evaluation. Traditional PET systems typically rely on complete detector rings to achieve full angular coverage for uniform and statistically robust sampling of coincidence events. However, incomplete-ring PET scanners have emerged in various scenarios due to hardware failures, cost constraints, or specific clinical needs. In such cases, conventional reconstruction algorithms often suffer from performance degradation due to reduced data completeness and geometric inconsistencies. This thesis proposes a coarse-to-fine reconstruction framework for incomplete-ring PET scanners. The framework first employs an Attention U-Net model to recover complete sinograms from incomplete ones, then uses the OSEM algorithm for preliminary reconstruction, and finally applies a two-stage architecture comprising a Coarse Prediction Module (CPM) and an Iterative Refinement Module (IRM) for fine reconstruction. Our approach utilizes neighboring axial slices and spectral transform features as auxiliary guidance at the input level to ensure spatial and frequency domain consistency, and integrates a contrastive diffusion strategy at the output level to improve correspondence between low-quality PET inputs and refined PET outputs. Experimental results on public and in-house brain PET datasets demonstrate that the proposed method significantly outperforms existing approaches in metrics such as PSNR (35.6421 dB) and SSIM (0.9588), successfully preserving key anatomical structures and tracer distribution features, thus providing an effective solution for incomplete-ring PET imaging.

Leaking LoRa: An Evaluation of Password Leaks and Knowledge Storage in Large Language Models

Ryan Marinelli,Magnus Eckhoff

Task: 研究如何通过微调技术在大语言模型中泄露敏感信息（如密码）并探索其修复方法。

Motivation: 尽管不推荐，用户在实际应用中可能会在消息中包含敏感信息（如密码），微调模型可能导致这些信息泄露，因此需要研究其风险及解决方案。

Details

Method: 使用LoRA对大语言模型进行微调，利用RockYou密码列表中的数据进行实验，并通过因果追踪和ROME技术定位并移除密码信息。 Result: 在200个密码中成功恢复37个，通过ROME技术将泄露的密码数量从37降至0。 Conclusion: 研究表明，微调可能导致敏感信息泄露，但通过模型编辑技术可以有效修复这一问题。 Abstract: To effectively deploy Large Language Models (LLMs) in application-specific settings, fine-tuning techniques are applied to enhance performance on specialized tasks. This process often involves fine-tuning on user data data, which may contain sensitive information. Although not recommended, it is not uncommon for users to send passwords in messages, and fine-tuning models on this could result in passwords being leaked. In this study, a Large Language Model is fine-tuned with customer support data and passwords from the RockYou password wordlist using Low-Rank Adaptation (LoRA). Out of the first 200 passwords from the list, 37 were successfully recovered. Further, causal tracing is used to identify that password information is largely located in a few layers. Lastly, Rank One Model Editing (ROME) is used to remove the password information from the model, resulting in the number of passwords recovered going from 37 to 0.

PRISM-0: A Predicate-Rich Scene Graph Generation Framework for Zero-Shot Open-Vocabulary Tasks

Abdelrahman Elskhawy,Mengze Li,Nassir Navab,Benjamin Busam

Task: 提出PRISM-0框架，用于零样本开放词汇场景图生成（SGG），以解决现有方法中的训练偏差和长尾分布问题。

Motivation: 现有完全监督的SGG方法因数据量小和长尾分布问题导致训练偏差，影响下游任务性能。

Details

Method: 通过结合基础模型、视觉语言模型（VLM）和大型语言模型（LLM），以自底向上的方式生成多样化的开放词汇谓词，并通过VQA模型验证最终结果。 Result: PRISM-0生成的场景图语义丰富，能提升下游任务（如图像描述和句子到图检索）性能，与最佳完全监督方法相当。 Conclusion: PRISM-0为SGG提供了一种模块化且独立于数据集的解决方案，有效解决了训练偏差和长尾分布问题。 Abstract: In Scene Graphs Generation (SGG) one extracts structured representation from visual inputs in the form of objects nodes and predicates connecting them. This facilitates image-based understanding and reasoning for various downstream tasks. Although fully supervised SGG approaches showed steady performance improvements, they suffer from a severe training bias. This is caused by the availability of only small subsets of curated data and exhibits long-tail predicate distribution issues with a lack of predicate diversity adversely affecting downstream tasks. To overcome this, we introduce PRISM-0, a framework for zero-shot open-vocabulary SGG that bootstraps foundation models in a bottom-up approach to capture the whole spectrum of diverse, open-vocabulary predicate prediction. Detected object pairs are filtered and passed to a Vision Language Model (VLM) that generates descriptive captions. These are used to prompt an LLM to generate fine-andcoarse-grained predicates for the pair. The predicates are then validated using a VQA model to provide a final SGG. With the modular and dataset-independent PRISM-0, we can enrich existing SG datasets such as Visual Genome (VG). Experiments illustrate that PRIMS-0 generates semantically meaningful graphs that improve downstream tasks such as Image Captioning and Sentence-to-Graph Retrieval with a performance on par to the best fully supervised methods.

Riccardo Cantini,Fabrizio Marozzo,Alessio Orsino,Domenico Talia,Paolo Trunfio

Task: 提出一种基于BERT的标签推荐方法H-ADAPTS，用于动态适应社交媒体中标签使用的变化。

Motivation: 现有静态模型难以适应社交媒体中标签的动态性和实时性变化，需要一种能够检测并适应趋势变化的推荐系统。

Details

Method: 采用BERT模型结合趋势检测机制，利用Apache Storm进行实时流处理，动态调整推荐模型。 Result: 在COVID-19疫情和2020年美国大选两个案例中，H-ADAPTS显著优于现有方法，保持了高推荐准确性。 Conclusion: H-ADAPTS能够有效适应动态环境，提供及时且相关的标签推荐。 Abstract: The widespread use of social media platforms results in the generation of vast amounts of user-generated content, which requires efficient methods for categorization and search. Hashtag recommendation systems have emerged as a crucial tool for automatically suggesting relevant hashtags and improving content discoverability. However, existing static models struggle to adapt to the highly dynamic and real-time nature of social media conversations, where new hashtags emerge and existing ones undergo semantic shifts. To address these challenges, this paper presents H-ADAPTS (Hashtag recommendAtion by Detecting and adAPting to Trend Shifts), a BERT-based hashtag recommendation methodology that can detect and adapt to shifts in the main trends and topics underlying social media conversation. Our approach introduces a trend-aware detection mechanism to identify changes in hashtag usage, triggering efficient model adaptation on a (small) set of recent posts. The framework leverages Apache Storm for real-time stream processing, enabling scalable and fault-tolerant analysis of high-velocity social data. Experimental results on two real-world case studies, including the COVID-19 pandemic and the 2020 US presidential election, demonstrate the ability to maintain high recommendation accuracy by adapting to emerging trends. Our methodology significantly outperforms existing solutions, ensuring timely and relevant hashtag recommendations in dynamic environments.

Zero-Shot 4D Lidar Panoptic Segmentation

Yushan Zhang,Aljoša Ošep,Laura Leal-Taixé,Tim Meinhardt

Task: 提出一种名为SAL-4D的方法，用于零样本4D激光雷达分割和识别任意物体。

Motivation: 解决激光雷达时空场景理解研究中数据集标注多样性和规模不足的问题。

Details

Method: 利用多模态机器人传感器设置，结合视频对象分割（VOS）模型和现成的视觉语言基础模型，将伪标记的轨迹提升到4D激光雷达空间。 Result: 在3D零样本激光雷达全景分割（LPS）上比现有方法提高了5 PQ，并实现了零样本4D-LPS。 Conclusion: SAL-4D方法通过多模态传感器和VOS模型的结合，显著提升了激光雷达场景理解的性能。 Abstract: Zero-shot 4D segmentation and recognition of arbitrary objects in Lidar is crucial for embodied navigation, with applications ranging from streaming perception to semantic mapping and localization. However, the primary challenge in advancing research and developing generalized, versatile methods for spatio-temporal scene understanding in Lidar lies in the scarcity of datasets that provide the necessary diversity and scale of annotations.To overcome these challenges, we propose SAL-4D (Segment Anything in Lidar--4D), a method that utilizes multi-modal robotic sensor setups as a bridge to distill recent developments in Video Object Segmentation (VOS) in conjunction with off-the-shelf Vision-Language foundation models to Lidar. We utilize VOS models to pseudo-label tracklets in short video sequences, annotate these tracklets with sequence-level CLIP tokens, and lift them to the 4D Lidar space using calibrated multi-modal sensory setups to distill them to our SAL-4D model. Due to temporal consistent predictions, we outperform prior art in 3D Zero-Shot Lidar Panoptic Segmentation (LPS) over $5$ PQ, and unlock Zero-Shot 4D-LPS.

The Cursive Transformer

Sam Greydanus,Zachary Wimpee

Task: 提出一种新颖的标记化方案，用于将手写笔迹数据转换为极坐标并离散化为标记序列，以训练标准GPT模型生成手写体。

Motivation: 手写数据（如笔坐标序列）在现有研究中未被充分探索，而现有方法（如混合密度网络）复杂且效率不高。

Details

Method: 将笔划偏移转换为极坐标，离散化为分箱，并生成标记序列，用于训练标准GPT模型。 Result: 仅需3,500个手写单词和简单数据增强，即可训练出能生成逼真草书手写的模型，性能优于传统RNN方法。 Conclusion: 该方法简化了手写生成流程，且性能优于现有方法，展示了标准GPT模型在手写生成中的潜力。 Abstract: Transformers trained on tokenized text, audio, and images can generate high-quality autoregressive samples. But handwriting data, represented as sequences of pen coordinates, remains underexplored. We introduce a novel tokenization scheme that converts pen stroke offsets to polar coordinates, discretizes them into bins, and then turns them into sequences of tokens with which to train a standard GPT model. This allows us to capture complex stroke distributions without using any specialized architectures (eg. the mixture density network or the self-advancing ASCII attention head from Graves 2014). With just 3,500 handwritten words and a few simple data augmentations, we are able to train a model that can generate realistic cursive handwriting. Our approach is simpler and more performant than previous RNN-based methods.

Global Intervention and Distillation for Federated Out-of-Distribution Generalization

Zhuang Qi,Runhui Zhang,Lei Meng,Wei Wu,Yachong Zhang,Xiangxu Meng

Task: 解决联邦学习中属性偏斜导致的模型性能下降和不稳定收敛问题。

Motivation: 现有方法通过数据增强或知识蒸馏学习不变表示，但生成数据质量不稳定且缺乏领域信息，限制了在未见样本上的表现。

Details

Method: 提出FedGID方法，包括全局干预模块和全局蒸馏模块，通过后门调整和统一知识库指导表示学习。 Result: 在三个数据集上的实验表明，FedGID提升了模型对未见数据中主要主题的关注能力，并优于现有方法。 Conclusion: FedGID通过干预和蒸馏有效解决了联邦学习中的属性偏斜问题，提升了模型性能。 Abstract: Attribute skew in federated learning leads local models to focus on learning non-causal associations, guiding them towards inconsistent optimization directions, which inevitably results in performance degradation and unstable convergence. Existing methods typically leverage data augmentation to enhance sample diversity or employ knowledge distillation to learn invariant representations. However, the instability in the quality of generated data and the lack of domain information limit their performance on unseen samples. To address these issues, this paper presents a global intervention and distillation method, termed FedGID, which utilizes diverse attribute features for backdoor adjustment to break the spurious association between background and label. It includes two main modules, where the global intervention module adaptively decouples objects and backgrounds in images, injects background information into random samples to intervene in the sample distribution, which links backgrounds to all categories to prevent the model from treating background-label associations as causal. The global distillation module leverages a unified knowledge base to guide the representation learning of client models, preventing local models from overfitting to client-specific attributes. Experimental results on three datasets demonstrate that FedGID enhances the model's ability to focus on the main subjects in unseen data and outperforms existing methods in collaborative modeling.

LLMs for Explainable AI: A Comprehensive Survey

Ahsan Bilal,David Ebert,Beiyu Lin

Task: 探索利用大型语言模型（LLMs）增强可解释人工智能（XAI）的可能性，以生成易于理解的解释。

Motivation: AI模型（如神经网络和深度学习模型）常被视为“黑箱”，缺乏透明度，导致用户难以信任其决策，影响决策效果和问责制。

Details

Method: 综述现有关于LLMs用于XAI的方法、评估技术，并讨论挑战、局限性和实际应用。 Result: 提出了未来发展方向，强调需要更可解释、自动化、以用户为中心和多学科的方法。 Conclusion: LLMs在XAI领域具有潜力，但仍需进一步研究以解决现有挑战。 Abstract: Large Language Models (LLMs) offer a promising approach to enhancing Explainable AI (XAI) by transforming complex machine learning outputs into easy-to-understand narratives, making model predictions more accessible to users, and helping bridge the gap between sophisticated model behavior and human interpretability. AI models, such as state-of-the-art neural networks and deep learning models, are often seen as "black boxes" due to a lack of transparency. As users cannot fully understand how the models reach conclusions, users have difficulty trusting decisions from AI models, which leads to less effective decision-making processes, reduced accountabilities, and unclear potential biases. A challenge arises in developing explainable AI (XAI) models to gain users' trust and provide insights into how models generate their outputs. With the development of Large Language Models, we want to explore the possibilities of using human language-based models, LLMs, for model explainabilities. This survey provides a comprehensive overview of existing approaches regarding LLMs for XAI, and evaluation techniques for LLM-generated explanation, discusses the corresponding challenges and limitations, and examines real-world applications. Finally, we discuss future directions by emphasizing the need for more interpretable, automated, user-centric, and multidisciplinary approaches for XAI via LLMs.

Exploring Personalized Federated Learning Architectures for Violence Detection in Surveillance Videos

Mohammad Kassir,Siba Haidar,Antoun Yaacoub

Task: 利用个性化联邦学习（PFL）检测城市监控系统中的暴力事件。

Motivation: 城市监控视频数据量大且多样，传统方法难以高效处理。

Details

Method: 采用基于Flower框架的个性化联邦学习方法（Federated Learning with Personalization Layers），适应每个监控节点的数据特性。 Result: 在平衡和不平衡数据集上的实验表明，PFL模型准确率高达99.3%。 Conclusion: PFL能显著提升监控系统的可扩展性和效率，为复杂城市环境中的暴力检测提供隐私保护的解决方案。 Abstract: The challenge of detecting violent incidents in urban surveillance systems is compounded by the voluminous and diverse nature of video data. This paper presents a targeted approach using Personalized Federated Learning (PFL) to address these issues, specifically employing the Federated Learning with Personalization Layers method within the Flower framework. Our methodology adapts learning models to the unique data characteristics of each surveillance node, effectively managing the heterogeneous and non-IID nature of surveillance video data. Through rigorous experiments conducted on balanced and imbalanced datasets, our PFL models demonstrated enhanced accuracy and efficiency, achieving up to 99.3% accuracy. This study underscores the potential of PFL to significantly improve the scalability and effectiveness of surveillance systems, offering a robust, privacy-preserving solution for violence detection in complex urban environments.

$\textit{Agents Under Siege}$: Breaking Pragmatic Multi-Agent LLM Systems with Optimized Prompt Attacks

Rana Muhammad Shahroz Khan,Zhen Tan,Sukwon Yun,Charles Flemming,Tianlong Chen

Task: 设计一种针对多智能体LLM系统的排列不变对抗攻击方法，以绕过分布式安全机制。

Motivation: 多智能体LLM系统因其通信和分散推理特性带来新的对抗风险，现有安全机制未能有效应对。

Details

Method: 提出排列不变对抗攻击，结合最大流最小成本问题和排列不变规避损失（PIEL），利用图优化方法优化攻击成功率。 Result: 在多个模型和数据集上，攻击成功率比传统方法高7倍，暴露多智能体系统的关键漏洞。 Conclusion: 现有防御机制无法阻止此类攻击，亟需针对多智能体系统的专用安全机制。 Abstract: Most discussions about Large Language Model (LLM) safety have focused on single-agent settings but multi-agent LLM systems now create novel adversarial risks because their behavior depends on communication between agents and decentralized reasoning. In this work, we innovatively focus on attacking pragmatic systems that have constrains such as limited token bandwidth, latency between message delivery, and defense mechanisms. We design a $\textit{permutation-invariant adversarial attack}$ that optimizes prompt distribution across latency and bandwidth-constraint network topologies to bypass distributed safety mechanisms within the system. Formulating the attack path as a problem of $\textit{maximum-flow minimum-cost}$, coupled with the novel $\textit{Permutation-Invariant Evasion Loss (PIEL)}$, we leverage graph-based optimization to maximize attack success rate while minimizing detection risk. Evaluating across models including $\texttt{Llama}$, $\texttt{Mistral}$, $\texttt{Gemma}$, $\texttt{DeepSeek}$ and other variants on various datasets like $\texttt{JailBreakBench}$ and $\texttt{AdversarialBench}$, our method outperforms conventional attacks by up to $7\times$, exposing critical vulnerabilities in multi-agent systems. Moreover, we demonstrate that existing defenses, including variants of $\texttt{Llama-Guard}$ and $\texttt{PromptGuard}$, fail to prohibit our attack, emphasizing the urgent need for multi-agent specific safety mechanisms.

NeuRadar: Neural Radiance Fields for Automotive Radar Point Clouds

Mahan Rafidashti,Ji Lan,Maryam Fatemi,Junsheng Fu,Lars Hammarstrand,Lennart Svensson

Task: 提出一种基于NeRF的模型NeuRadar，用于联合生成雷达点云、相机图像和激光雷达点云。

Motivation: 雷达在自动驾驶系统中具有重要地位，但基于NeRF的雷达点云合成方法尚未被充分探索。

Details

Method: 结合确定性及概率性点云表示方法，利用NeRF几何改进的编码器解决方案，探索基于集合的目标检测方法（如DETR）。 Result: 在两个自动驾驶数据集上实现了真实的雷达点云重建结果，并发布了相关数据和源代码。 Conclusion: NeuRadar为基于NeRF的雷达点云模拟模型建立了基准，并鼓励进一步研究雷达NeRF的发展。 Abstract: Radar is an important sensor for autonomous driving (AD) systems due to its robustness to adverse weather and different lighting conditions. Novel view synthesis using neural radiance fields (NeRFs) has recently received considerable attention in AD due to its potential to enable efficient testing and validation but remains unexplored for radar point clouds. In this paper, we present NeuRadar, a NeRF-based model that jointly generates radar point clouds, camera images, and lidar point clouds. We explore set-based object detection methods such as DETR, and propose an encoder-based solution grounded in the NeRF geometry for improved generalizability. We propose both a deterministic and a probabilistic point cloud representation to accurately model the radar behavior, with the latter being able to capture radar's stochastic behavior. We achieve realistic reconstruction results for two automotive datasets, establishing a baseline for NeRF-based radar point cloud simulation models. In addition, we release radar data for ZOD's Sequences and Drives to enable further research in this field. To encourage further development of radar NeRFs, we release the source code for NeuRadar.

ElaLoRA: Elastic & Learnable Low-Rank Adaptation for Efficient Model Fine-Tuning

Huandong Chang,Zicheng Ma,Mingyuan Ma,Zhenting Qi,Andrew Sabot,Hong Jiang,H. T. Kung

Task: 提出一种动态调整低秩适应（LoRA）框架ElaLoRA，支持在微调过程中动态修剪和扩展秩。

Motivation: 现有方法依赖固定秩或仅关注秩修剪或扩展，无法动态适应不同层的重要性。

Details

Method: 基于梯度重要性分数动态修剪和扩展秩。 Result: ElaLoRA在多个基准测试中优于现有PEFT方法，且验证了高秩分配层对模型性能的贡献更大。 Conclusion: ElaLoRA通过自适应秩分配机制，为资源受限环境提供了高效微调方案。 Abstract: Low-Rank Adaptation (LoRA) has become a widely adopted technique for fine-tuning large-scale pre-trained models with minimal parameter updates. However, existing methods rely on fixed ranks or focus solely on either rank pruning or expansion, failing to adapt ranks dynamically to match the importance of different layers during training. In this work, we propose ElaLoRA, an adaptive low-rank adaptation framework that dynamically prunes and expands ranks based on gradient-derived importance scores. To the best of our knowledge, ElaLoRA is the first method that enables both rank pruning and expansion during fine-tuning. Experiments across multiple benchmarks demonstrate that ElaLoRA consistently outperforms existing PEFT methods across different parameter budgets. Furthermore, our studies validate that layers receiving higher rank allocations contribute more significantly to model performance, providing theoretical justification for our adaptive strategy. By introducing a principled and adaptive rank allocation mechanism, ElaLoRA offers a scalable and efficient fine-tuning solution, particularly suited for resource-constrained environments.

Balancing Multi-Target Semi-Supervised Medical Image Segmentation with Collaborative Generalist and Specialists

You Wang,Zekun Li,Lei Qi,Qian Yu,Yinghuan Shi,Yang Gao

Task: 提出一种名为CGS的新方法，用于解决多目标医学图像分割中因目标尺度不平衡导致的性能下降问题。

Motivation: 当前半监督模型在多目标同时分割时，大目标主导损失函数，导致小目标被误分类为大目标。

Details

Method: CGS方法包括一个通用分割器（Generalist）和多个专用分割器（Specialists），通过专用分割器避免大目标主导，并结合交叉一致性损失和头部间错误检测模块提升性能。 Result: 在三个流行基准测试中，CGS表现出优于现有方法的性能。 Conclusion: CGS通过平衡训练和协作学习，有效解决了多目标分割中的尺度不平衡问题。 Abstract: Despite the promising performance achieved by current semi-supervised models in segmenting individual medical targets, many of these models suffer a notable decrease in performance when tasked with the simultaneous segmentation of multiple targets. A vital factor could be attributed to the imbalanced scales among different targets: during simultaneously segmenting multiple targets, large targets dominate the loss, leading to small targets being misclassified as larger ones. To this end, we propose a novel method, which consists of a Collaborative Generalist and several Specialists, termed CGS. It is centered around the idea of employing a specialist for each target class, thus avoiding the dominance of larger targets. The generalist performs conventional multi-target segmentation, while each specialist is dedicated to distinguishing a specific target class from the remaining target classes and the background. Based on a theoretical insight, we demonstrate that CGS can achieve a more balanced training. Moreover, we develop cross-consistency losses to foster collaborative learning between the generalist and the specialists. Lastly, regarding their intrinsic relation that the target class of any specialized head should belong to the remaining classes of the other heads, we introduce an inter-head error detection module to further enhance the quality of pseudo-labels. Experimental results on three popular benchmarks showcase its superior performance compared to state-of-the-art methods.

Inference-Time Scaling for Complex Tasks: Where We Stand and What Lies Ahead

Vidhisha Balachandran,Jingya Chen,Lingjiao Chen,Shivam Garg,Neel Joshi,Yash Lara,John Langford,Besmira Nushi,Vibhav Vineet,Yue Wu,Safoora Yousefi

Task: 研究推理时间扩展方法对大型语言模型（LLMs）在复杂问题解决中的影响。

Motivation: 尽管扩展生成的中间步骤（scratchpads）在数学任务中已证明有效，但该方法对其他任务的广泛影响尚不明确。

Details

Method: 比较九种最先进模型和八项挑战性任务（如数学、STEM推理、日历规划、NP难问题、导航和空间推理），评估推理时间扩展方法的优劣。 Result: 推理时间扩展的优势因任务而异，且随问题复杂性增加而减弱；更多标记并不总能提高准确性。部分任务中，传统模型接近先进推理模型的平均性能，但其他任务仍存在显著差距。 Conclusion: 未来通过完美验证器或强反馈进一步扩展推理，所有模型均有显著提升潜力。 Abstract: Inference-time scaling can enhance the reasoning capabilities of large language models (LLMs) on complex problems that benefit from step-by-step problem solving. Although lengthening generated scratchpads has proven effective for mathematical tasks, the broader impact of this approach on other tasks remains less clear. In this work, we investigate the benefits and limitations of scaling methods across nine state-of-the-art models and eight challenging tasks, including math and STEM reasoning, calendar planning, NP-hard problems, navigation, and spatial reasoning. We compare conventional models (e.g., GPT-4o) with models fine-tuned for inference-time scaling (e.g., o1) through evaluation protocols that involve repeated model calls, either independently or sequentially with feedback. These evaluations approximate lower and upper performance bounds and potential for future performance improvements for each model, whether through enhanced training or multi-model inference systems. Our extensive empirical analysis reveals that the advantages of inference-time scaling vary across tasks and diminish as problem complexity increases. In addition, simply using more tokens does not necessarily translate to higher accuracy in these challenging regimes. Results from multiple independent runs with conventional models using perfect verifiers show that, for some tasks, these models can achieve performance close to the average performance of today's most advanced reasoning models. However, for other tasks, a significant performance gap remains, even in very high scaling regimes. Encouragingly, all models demonstrate significant gains when inference is further scaled with perfect verifiers or strong feedback, suggesting ample potential for future improvements.

Data-free Knowledge Distillation with Diffusion Models

Xiaohua Qi,Renda Li,Long Peng,Qiang Ling,Jun Yu,Ziyi Chen,Peng Chang,Mei Han,Jing Xiao

Task: 提出一种基于扩散模型的无数据知识蒸馏方法DiffDFKD。

Motivation: 现有方法难以实现无数据知识蒸馏（DFKD），而扩散模型在生成高质量图像方面表现优异，但尚未有效应用于DFKD。

Details

Method: DiffDFKD通过教师模型引导扩散模型生成数据，并引入Latent CutMix Augmentation技术提升生成图像的多样性。 Result: 实验验证DiffDFKD优于现有DFKD方法，取得了最先进的结果。 Conclusion: DiffDFKD为无数据知识蒸馏提供了一种高效且有效的解决方案。 Abstract: Recently Data-Free Knowledge Distillation (DFKD) has garnered attention and can transfer knowledge from a teacher neural network to a student neural network without requiring any access to training data. Although diffusion models are adept at synthesizing high-fidelity photorealistic images across various domains, existing methods cannot be easiliy implemented to DFKD. To bridge that gap, this paper proposes a novel approach based on diffusion models, DiffDFKD. Specifically, DiffDFKD involves targeted optimizations in two key areas. Firstly, DiffDFKD utilizes valuable information from teacher models to guide the pre-trained diffusion models' data synthesis, generating datasets that mirror the training data distribution and effectively bridge domain gaps. Secondly, to reduce computational burdens, DiffDFKD introduces Latent CutMix Augmentation, an efficient technique, to enhance the diversity of diffusion model-generated images for DFKD while preserving key attributes for effective knowledge transfer. Extensive experiments validate the efficacy of DiffDFKD, yielding state-of-the-art results exceeding existing DFKD approaches. We release our code at https://github.com/xhqi0109/DiffDFKD.

FortisAVQA and MAVEN: a Benchmark Dataset and Debiasing Framework for Robust Multimodal Reasoning

Jie Ma,Zhitao Gao,Qi Chai,Jun Liu,Pinghui Wang,Jing Tao,Zhou Su

Task: 解决音频-视觉问答（AVQA）任务中的数据集偏差和鲁棒性问题。

Motivation: 现有AVQA方法容易过拟合数据集偏差，导致鲁棒性差，且当前数据集无法有效诊断这些问题。

Details

Method: 提出新数据集FortisAVQA，并设计多模态音频-视觉认知网络（MAVEN），采用多面循环协作去偏策略。 Result: 在FortisAVQA上取得7.81%的性能提升，验证了去偏策略的有效性。 Conclusion: 提出的方法和数据集显著提升了AVQA任务的鲁棒性和性能，并揭示了现有多模态QA方法的局限性。 Abstract: Audio-Visual Question Answering (AVQA) is a challenging multimodal reasoning task requiring intelligent systems to answer natural language queries based on paired audio-video inputs accurately. However, existing AVQA approaches often suffer from overfitting to dataset biases, leading to poor robustness. Moreover, current datasets may not effectively diagnose these methods. To address these challenges, we first introduce a novel dataset, FortisAVQA, constructed in two stages: (1) rephrasing questions in the test split of the public MUSIC-AVQA dataset and (2) introducing distribution shifts across questions. The first stage expands the test space with greater diversity, while the second enables a refined robustness evaluation across rare, frequent, and overall question distributions. Second, we introduce a robust Multimodal Audio-Visual Epistemic Network (MAVEN) that leverages a multifaceted cycle collaborative debiasing strategy to mitigate bias learning. Experimental results demonstrate that our architecture achieves state-of-the-art performance on FortisAVQA, with a notable improvement of 7.81\%. Extensive ablation studies on both datasets validate the effectiveness of our debiasing components. Additionally, our evaluation reveals the limited robustness of existing multimodal QA methods. We also verify the plug-and-play capability of our strategy by integrating it with various baseline models across both datasets. Our dataset and code are available at https://github.com/reml-group/fortisavqa.

WISE-TTT:Worldwide Information Segmentation Enhancement

Fenglei Hao,Yuliang Yang,Ruiyuan Su,Zhengran Zhao,Yukun Qiao,Mengyu Zhu

Task: 提出了一种名为WISE-TTT的协同架构，用于解决视频多目标分割中全局时间依赖性捕获的挑战。

Motivation: 现有架构在长序列中捕获全局时间依赖性方面存在固有局限性，导致视频多目标分割效果不佳。

Details

Method: 通过将测试时间训练（TTT）机制与Transformer架构协同设计，利用TTT层压缩历史时间数据生成包含全局信息的隐藏状态，并通过拼接实现多阶段上下文聚合。 Result: 在Davis2017长期基准测试中，准确率提高了3.1%（J&F指标），首次证明了分层上下文在视频分割中的优越性。 Conclusion: 全局信息对分割性能至关重要，WISE-TTT通过多网络层实现全局信息利用，显著提升了分割效果。 Abstract: Video multi-target segmentation remains a major challenge in long sequences, mainly due to the inherent limitations of existing architectures in capturing global temporal dependencies. We introduce WISE-TTT, a synergistic architecture integrating Test-Time Training (TTT) mechanisms with the Transformer architecture through co-design. The TTT layer systematically compresses historical temporal data to generate hidden states containing worldwide information(Lossless memory to maintain long contextual integrity), while achieving multi-stage contextual aggregation through splicing. Crucially, our framework provides the first empirical validation that implementing worldwide information across multiple network layers is essential for optimal dependency utilization.Ablation studies show TTT modules at high-level features boost global modeling. This translates to 3.1% accuracy improvement(J&F metric) on Davis2017 long-term benchmarks -- the first proof of hierarchical context superiority in video segmentation. We provide the first systematic evidence that worldwide information critically impacts segmentation performance.

ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers

Qianhao Yuan,Qingyu Zhang,Yanjiang Liu,Jiawei Chen,Yaojie Lu,Hongyu Lin,Jia Zheng,Xianpei Han,Le Sun

Task: 研究多模态大语言模型（MLLMs）中层次冗余问题，并提出一种训练无关的方法ShortV以减少计算成本。

Motivation: MLLMs因模型规模大和视觉标记数量多导致计算成本高，许多层次在视觉标记处理中贡献极小。

Details

Method: 引入Layer Contribution（LC）度量层次贡献，提出ShortV方法冻结无效层次的视觉标记更新。 Result: ShortV可在约60%的层次中冻结视觉标记，显著降低计算成本（如LLaVA-NeXT-13B上FLOPs减少50%），同时保持性能。 Conclusion: ShortV是一种高效且无需训练的方法，能显著减少MLLMs的计算开销。 Abstract: Multimodal Large Language Models (MLLMs) suffer from high computational costs due to their massive size and the large number of visual tokens. In this paper, we investigate layer-wise redundancy in MLLMs by introducing a novel metric, Layer Contribution (LC), which quantifies the impact of a layer's transformations on visual and text tokens, respectively. The calculation of LC involves measuring the divergence in model output that results from removing the layer's transformations on the specified tokens. Our pilot experiment reveals that many layers of MLLMs exhibit minimal contribution during the processing of visual tokens. Motivated by this observation, we propose ShortV, a training-free method that leverages LC to identify ineffective layers, and freezes visual token updates in these layers. Experiments show that ShortV can freeze visual token in approximately 60\% of the MLLM layers, thereby dramatically reducing computational costs related to updating visual tokens. For example, it achieves a 50\% reduction in FLOPs on LLaVA-NeXT-13B while maintaining superior performance. The code will be publicly available at https://github.com/icip-cas/ShortV

Improved Visual-Spatial Reasoning via R1-Zero-Like Training

Zhenyi Liao,Qingsong Xie,Yanhao Zhang,Zijian Kong,Haonan Lu,Zhenyu Yang,Zhijie Deng

Task: 研究如何通过R1-Zero-like训练提升多模态大语言模型（MLLMs）的视觉空间推理能力。

Motivation: 视频基础的视觉空间智能（VSI）是MLLMs在物理领域中作为AI代理的核心推理能力之一，但目前小型至中型Qwen2-VL模型的视觉空间推理能力无法通过思维链（CoT）提示激活。

Details

Method: 采用GRPO训练方法，结合精心策划的VSI-100k数据集，并保持KL惩罚项，以提升视觉空间推理能力。 Result: vsGRPO-2B模型在120 GPU小时内性能提升12.1%，超越GPT-4o；vsGRPO-7B模型性能接近最佳开源模型LLaVA-NeXT-Video-72B。 Conclusion: GRPO训练方法在提升MLLMs视觉空间推理能力方面表现出显著优势，优于监督微调和直接偏好优化基线。 Abstract: Increasing attention has been placed on improving the reasoning capacities of multi-modal large language models (MLLMs). As the cornerstone for AI agents that function in the physical realm, video-based visual-spatial intelligence (VSI) emerges as one of the most pivotal reasoning capabilities of MLLMs. This work conducts a first, in-depth study on improving the visual-spatial reasoning of MLLMs via R1-Zero-like training. Technically, we first identify that the visual-spatial reasoning capacities of small- to medium-sized Qwen2-VL models cannot be activated via Chain of Thought (CoT) prompts. We then incorporate GRPO training for improved visual-spatial reasoning, using the carefully curated VSI-100k dataset, following DeepSeek-R1-Zero. During the investigation, we identify the necessity to keep the KL penalty (even with a small value) in GRPO. With just 120 GPU hours, our vsGRPO-2B model, fine-tuned from Qwen2-VL-2B, can outperform the base model by 12.1% and surpass GPT-4o. Moreover, our vsGRPO-7B model, fine-tuned from Qwen2-VL-7B, achieves performance comparable to that of the best open-source model LLaVA-NeXT-Video-72B. Additionally, we compare vsGRPO to supervised fine-tuning and direct preference optimization baselines and observe strong performance superiority. The code and dataset will be available soon.

Recitation over Reasoning: How Cutting-Edge Language Models Can Fail on Elementary School-Level Reasoning Problems?

Kai Yan,Yufei Xu,Zhengyin Du,Xuesong Yao,Zheyu Wang,Xiaowen Guo,Jiecao Chen

Task: 提出RoR-Bench，一个多模态基准，用于检测LLM在条件微妙变化时对简单推理问题的复述行为。

Motivation: 研究LLM的卓越推理能力是否源于真正的智能，还是仅仅是在训练中复述互联网级别的解决方案。

Details

Method: 通过RoR-Bench进行实证分析，检测LLM在条件变化时的表现。 Result: 现有顶尖LLM在条件变化时表现出严重的复述行为，性能下降高达60%。 Conclusion: 研究结果提醒LLM社区重新评估顶尖LLM的真实智能水平。 Abstract: The rapid escalation from elementary school-level to frontier problems of the difficulty for LLM benchmarks in recent years have weaved a miracle for researchers that we are only inches away from surpassing human intelligence. However, is the LLMs' remarkable reasoning ability indeed comes from true intelligence by human standards, or are they simply reciting solutions witnessed during training at an Internet level? To study this problem, we propose RoR-Bench, a novel, multi-modal benchmark for detecting LLM's recitation behavior when asked simple reasoning problems but with conditions subtly shifted, and conduct empirical analysis on our benchmark. Surprisingly, we found existing cutting-edge LLMs unanimously exhibits extremely severe recitation behavior; by changing one phrase in the condition, top models such as OpenAI-o1 and DeepSeek-R1 can suffer $60\%$ performance loss on elementary school-level arithmetic and reasoning problems. Such findings are a wake-up call to the LLM community that compels us to re-evaluate the true intelligence level of cutting-edge LLMs.

A Decade of Deep Learning for Remote Sensing Spatiotemporal Fusion: Advances, Challenges, and Opportunities

Enzhe Sun,Yongchuan Cui,Peng Liu,Jining Yan

Task: 系统综述过去十年中深度学习在遥感时空融合（STF）领域的发展。

Motivation: 硬件限制和卫星发射成本使得直接获取高时空分辨率的遥感影像具有挑战性，STF技术通过融合不同分辨率的影像来解决这一问题。深度学习在STF领域的应用显著优于传统方法，但缺乏系统性综述。

Details

Method: 通过分析关键研究趋势、方法分类、常用数据集和评估指标，全面综述深度学习在STF领域的发展。 Result: 总结了深度学习在STF领域的应用，并讨论了现有研究中的主要挑战和未来研究方向。 Conclusion: 本文为研究者提供了深度学习在STF领域的系统性综述，并指出了未来的研究方向，以激发新思路。 Abstract: Hardware limitations and satellite launch costs make direct acquisition of high temporal-spatial resolution remote sensing imagery challenging. Remote sensing spatiotemporal fusion (STF) technology addresses this problem by merging high temporal but low spatial resolution imagery with high spatial but low temporal resolution imagery to efficiently generate high spatiotemporal resolution satellite images. STF provides unprecedented observational capabilities for land surface change monitoring, agricultural management, and environmental research. Deep learning (DL) methods have revolutionized the remote sensing spatiotemporal fusion field over the past decade through powerful automatic feature extraction and nonlinear modeling capabilities, significantly outperforming traditional methods in handling complex spatiotemporal data. Despite the rapid development of DL-based remote sensing STF, the community lacks a systematic review of this quickly evolving field. This paper comprehensively reviews DL developments in remote sensing STF over the last decade, analyzing key research trends, method classifications, commonly used datasets, and evaluation metrics. It discusses major challenges in existing research and identifies promising future research directions as references for researchers in this field to inspire new ideas. The specific models, datasets, and other information mentioned in this article have been collected in: https://github.com/yc-cui/Deep-Learning-Spatiotemporal-Fusion-Survey.

SRLCG: Self-Rectified Large-Scale Code Generation with Multidimensional Chain-of-Thought and Dynamic Backtracking

Hongru Ma,Yanjie Liang,Jiasheng Si,Weiyu Zhang,Hongjiao Guan,Chaoqun Zheng,Bing Xu,Wenpeng Lu

Task: 提出一种名为SRLCG的框架，用于从单一提示生成完整的多文件项目代码。

Motivation: 解决大型语言模型（LLMs）在生成代码时仅提供孤立代码片段，而无法支持非专业用户生成完整项目的问题。

Details

Method: 采用多维度的思维链（CoT）和自校正机制，结合动态回溯算法，生成正确且鲁棒的代码文件并整合为完整项目。 Result: 实验表明，SRLCG生成的代码长度是DeepSeek-V3的15倍、GPT-4的16倍，且比其他基于CoT的基线方法至少长10倍，同时在正确性、鲁棒性和性能上优于基线。 Conclusion: SRLCG能够有效支持非专业用户生成大规模完整项目代码，显著提升了代码生成的质量和规模。 Abstract: Large language models (LLMs) have revolutionized code generation, significantly enhancing developer productivity. However, for a vast number of users with minimal coding knowledge, LLMs provide little support, as they primarily generate isolated code snippets rather than complete, large-scale project code. Without coding expertise, these users struggle to interpret, modify, and iteratively refine the outputs of LLMs, making it impossible to assemble a complete project. To address this issue, we propose Self-Rectified Large-Scale Code Generator (SRLCG), a framework that generates complete multi-file project code from a single prompt. SRLCG employs a novel multidimensional chain-of-thought (CoT) and self-rectification to guide LLMs in generating correct and robust code files, then integrates them into a complete and coherent project using our proposed dynamic backtracking algorithm. Experimental results show that SRLCG generates code 15x longer than DeepSeek-V3, 16x longer than GPT-4, and at least 10x longer than other leading CoT-based baselines. Furthermore, they confirm its improved correctness, robustness, and performance compared to baselines in large-scale code generation.

DBF-UNet: A Two-Stage Framework for Carotid Artery Segmentation with Pseudo-Label Generation

Haoxuan Li,Wei Song,Aofan Liu,Peiwu Qin

Task: 提出一种两阶段分割框架，解决三维颈动脉分割任务中标注数据稀疏的问题。

Motivation: 医学图像分析中，三维颈动脉分割任务面临标注数据不足的挑战，现有数据集仅包含部分专家标注的切片。

Details

Method: 第一阶段通过插值生成连续血管中心线并传播标签；第二阶段提出轻量级DBF-UNet架构，结合双向特征融合和多尺度特征聚合。 Result: 在公共数据集上验证，该方法有效解决了标注稀疏问题，性能优于现有方法。 Conclusion: 提出的两阶段框架和DBF-UNet在三维颈动脉分割任务中表现出色，解决了标注数据不足的挑战。 Abstract: Medical image analysis faces significant challenges due to limited annotation data, particularly in three-dimensional carotid artery segmentation tasks, where existing datasets exhibit spatially discontinuous slice annotations with only a small portion of expert-labeled slices in complete 3D volumetric data. To address this challenge, we propose a two-stage segmentation framework. First, we construct continuous vessel centerlines by interpolating between annotated slice centroids and propagate labels along these centerlines to generate interpolated annotations for unlabeled slices. The slices with expert annotations are used for fine-tuning SAM-Med2D, while the interpolated labels on unlabeled slices serve as prompts to guide segmentation during inference. In the second stage, we propose a novel Dense Bidirectional Feature Fusion UNet (DBF-UNet). This lightweight architecture achieves precise segmentation of complete 3D vascular structures. The network incorporates bidirectional feature fusion in the encoder and integrates multi-scale feature aggregation with dense connectivity for effective feature reuse. Experimental validation on public datasets demonstrates that our proposed method effectively addresses the sparse annotation challenge in carotid artery segmentation while achieving superior performance compared to existing approaches. The source code is available at https://github.com/Haoxuanli-Thu/DBF-UNet.

AgentNet: Decentralized Evolutionary Coordination for LLM-based Multi-Agent Systems

Yingxuan Yang,Huacan Chai,Shuai Shao,Yuanyi Song,Siyuan Qi,Renting Rui,Weinan Zhang

Task: 提出一种名为AgentNet的去中心化框架，用于解决多智能体系统中集中式协调的可扩展性、适应性和隐私问题。

Motivation: 现有基于大型语言模型的多智能体系统依赖集中式协调，存在可扩展性瓶颈、适应性限制和单点故障问题，同时隐私和知识共享问题阻碍跨组织协作。

Details

Method: 采用去中心化、基于检索增强生成（RAG）的框架，结合有向无环图（DAG）结构，使智能体能够动态调整连接、自主进化和协作。 Result: AgentNet通过去中心化设计提高了容错能力、实现了可扩展的专业化，并支持跨组织的隐私保护协作。 Conclusion: AgentNet通过去中心化协调和最小化数据交换，解决了集中式系统的局限性，同时保护了敏感信息。 Abstract: The rapid advancement of Large Language Models (LLMs) has catalyzed the development of multi-agent systems, where multiple LLM-based agents collaborate to solve complex tasks. However, existing systems predominantly rely on centralized coordination, which introduces scalability bottlenecks, limits adaptability, and creates single points of failure. Additionally, concerns over privacy and proprietary knowledge sharing hinder cross-organizational collaboration, leading to siloed expertise. To address these challenges, we propose AgentNet, a decentralized, Retrieval-Augmented Generation (RAG)-based framework that enables LLM-based agents to autonomously evolve their capabilities and collaborate efficiently in a Directed Acyclic Graph (DAG)-structured network. Unlike traditional multi-agent systems that depend on static role assignments or centralized control, AgentNet allows agents to specialize dynamically, adjust their connectivity, and route tasks without relying on predefined workflows. AgentNet's core design is built upon several key innovations: (1) Fully Decentralized Paradigm: Removing the central orchestrator, allowing agents to coordinate and specialize autonomously, fostering fault tolerance and emergent collective intelligence. (2) Dynamically Evolving Graph Topology: Real-time adaptation of agent connections based on task demands, ensuring scalability and resilience.(3) Adaptive Learning for Expertise Refinement: A retrieval-based memory system that enables agents to continuously update and refine their specialized skills. By eliminating centralized control, AgentNet enhances fault tolerance, promotes scalable specialization, and enables privacy-preserving collaboration across organizations. Through decentralized coordination and minimal data exchange, agents can leverage diverse knowledge sources while safeguarding sensitive information.

WikiVideo: Article Generation from Multiple Videos

Alexander Martin,Reno Kriz,William Gantt Walden,Kate Sanders,Hannah Recknor,Eugene Yang,Francis Ferraro,Benjamin Van Durme

Task: 自动创建高层次的维基百科风格文章，从多个多样化的视频中聚合关于现实世界事件（如自然灾害或政治选举）的信息。

Motivation: 视频是检索增强生成（RAG）的直观来源，但现有方法主要关注文本或低层次场景理解，缺乏对高层次事件语义的关注。

Details

Method: 提出了WikiVideo基准和协作文章生成（CAG）方法，结合推理模型和VideoLLM进行交互式文章生成。 Result: CAG在实验中的表现优于现有方法，展示了其在多模态信息整合中的潜力。 Conclusion: CAG为视频驱动的RAG提供了新方向，并指出了未来研究的可能路径。 Abstract: We present the challenging task of automatically creating a high-level Wikipedia-style article that aggregates information from multiple diverse videos about real-world events, such as natural disasters or political elections. Videos are intuitive sources for retrieval-augmented generation (RAG), but most contemporary RAG workflows focus heavily on text and existing methods for video-based summarization focus on low-level scene understanding rather than high-level event semantics. To close this gap, we introduce WikiVideo, a benchmark consisting of expert-written articles and densely annotated videos that provide evidence for articles' claims, facilitating the integration of video into RAG pipelines and enabling the creation of in-depth content that is grounded in multimodal sources. We further propose Collaborative Article Generation (CAG), a novel interactive method for article creation from multiple videos. CAG leverages an iterative interaction between an r1-style reasoning model and a VideoLLM to draw higher level inferences about the target event than is possible with VideoLLMs alone, which fixate on low-level visual features. We benchmark state-of-the-art VideoLLMs and CAG in both oracle retrieval and RAG settings and find that CAG consistently outperforms alternative methods, while suggesting intriguing avenues for future work.

Automated Explanation of Machine Learning Models of Footballing Actions in Words

Pegah Rahimian,Jernej Flisar,David Sumpter

Task: 提出一种利用大型语言模型生成足球射门描述（wordalizations）的方法，以弥合机器学习实践与教练团队沟通之间的差距。

Motivation: 足球分析中机器学习模型提供的洞察往往不够直观，教练和从业者需要更易理解且可操作的信息。

Details

Method: 首先构建基于逻辑回归的预期进球模型，利用回归系数生成描述射门因素的句子，再通过大型语言模型生成有趣的射门描述。 Result: 开发了一个开源交互式应用，展示近期比赛中的射门描述，并讨论了该方法在教练和足球解说中的应用潜力。 Conclusion: 该方法能有效提升足球分析中的沟通效率，并可扩展至其他足球动作的分析。 Abstract: While football analytics has changed the way teams and analysts assess performance, there remains a communication gap between machine learning practice and how coaching staff talk about football. Coaches and practitioners require actionable insights, which are not always provided by models. To bridge this gap, we show how to build wordalizations (a novel approach that leverages large language models) for shots in football. Specifically, we first build an expected goals model using logistic regression. We then use the co-efficients of this regression model to write sentences describing how factors (such as distance, angle and defensive pressure) contribute to the model's prediction. Finally, we use large language models to give an entertaining description of the shot. We describe our approach in a model card and provide an interactive open-source application describing shots in recent tournaments. We discuss how shot wordalisations might aid communication in coaching and football commentary, and give a further example of how the same approach can be applied to other actions in football.

Graph Classification and Radiomics Signature for Identification of Tuberculous Meningitis

Snigdha Agarwal,Ganaraja V H,Neelam Sinha,Abhilasha Indoria,Netravathi M,Jitender Saini

Task: This study aims to classify Tuberculous meningitis (TBM) patients using T1-weighted non-contrast MRI scans.

Motivation: Diagnosis of TBM often requires invasive procedures like lumbar puncture, and the study seeks a non-invasive alternative by identifying visual markers in specific brain regions.

Details

Method: The study proposes a novel Pixel-array Graphs Classifier (PAG-Classifier) that uses spatial relationships between 3D pixels in a graph-based framework, along with radiomics-based methodology for feature extraction and classification. Result: The PAG-Classifier achieved an average F1 score of 85.71% for cistern regions, while the radiomics features classifier achieved 92.85%, surpassing benchmarks by 15% and 22%, respectively. Bone and corpus callosum regions performed poorly (F1 scores below 50%). Conclusion: The PAG-Classifier is effective for non-invasive TBM analysis, particularly targeting the interpeduncular cistern, but bone and corpus callosum regions lack distinctive patterns for differentiation. Abstract: Introduction: Tuberculous meningitis (TBM) is a serious brain infection caused by Mycobacterium tuberculosis, characterized by inflammation of the meninges covering the brain and spinal cord. Diagnosis often requires invasive lumbar puncture (LP) and cerebrospinal fluid (CSF) analysis. Objectives: This study aims to classify TBM patients using T1-weighted (T1w) non-contrast Magnetic Resonance Imaging (MRI) scans. We hypothesize that specific brain regions, such as the interpeduncular cisterns, bone, and corpus callosum, contain visual markers that can non-invasively distinguish TBM patients from healthy controls. We propose a novel Pixel-array Graphs Classifier (PAG-Classifier) that leverages spatial relationships between neighbouring 3D pixels in a graph-based framework to extract significant features through eigen decomposition. These features are then used to train machine learning classifiers for effective patient classification. We validate our approach using a radiomics-based methodology, classifying TBM patients based on relevant radiomics features. Results: We utilized an internal dataset consisting of 52 scans, 32 from confirmed TBM patients based on mycobacteria detection in CSF, and 20 from healthy individuals. We achieved a 5-fold cross-validated average F1 score of 85.71% for cistern regions with our PAG-Classifier and 92.85% with the radiomics features classifier, surpassing current state-of-the-art benchmarks by 15% and 22%, respectively. However, bone and corpus callosum regions showed poor classification effectiveness, with average F1 scores below 50%. Conclusion: Our study suggests that algorithms like the PAG-Classifier serve as effective tools for non-invasive TBM analysis, particularly by targeting the interpeduncular cistern. Findings indicate that the bone and corpus callosum regions lack distinctive patterns for differentiation.

CrackSQL: A Hybrid SQL Dialect Translation System Powered by Large Language Models

Wei Zhou,Yuyang Gao,Xuanhe Zhou,Guoliang Li

Task: 开发一种混合SQL方言翻译系统CrackSQL，结合规则和基于LLM的方法，解决SQL查询在不同数据库系统间翻译的挑战。

Motivation: 现有方法（如手动重写、基于规则的系统或基于LLM的技术）存在维护成本高或结果不可靠的问题，尤其是在处理复杂查询时。

Details

Method: CrackSQL结合规则和LLM方法，通过功能分段的查询处理、跨方言语法嵌入模型和自适应局部到全局翻译策略，提高翻译准确性和鲁棒性。 Result: CrackSQL支持三种翻译模式，并提供多种部署和访问选项（如Web控制台、PyPI包和命令行提示），适用于多种实际应用场景。 Conclusion: CrackSQL通过混合方法有效解决了SQL方言翻译的局限性，提升了翻译的准确性和实用性。 Abstract: Dialect translation plays a key role in enabling seamless interaction across heterogeneous database systems. However, translating SQL queries between different dialects (e.g., from PostgreSQL to MySQL) remains a challenging task due to syntactic discrepancies and subtle semantic variations. Existing approaches including manual rewriting, rule-based systems, and large language model (LLM)-based techniques often involve high maintenance effort (e.g., crafting custom translation rules) or produce unreliable results (e.g., LLM generates non-existent functions), especially when handling complex queries. In this demonstration, we present CrackSQL, the first hybrid SQL dialect translation system that combines rule and LLM-based methods to overcome these limitations. CrackSQL leverages the adaptability of LLMs to minimize manual intervention, while enhancing translation accuracy by segmenting lengthy complex SQL via functionality-based query processing. To further improve robustness, it incorporates a novel cross-dialect syntax embedding model for precise syntax alignment, as well as an adaptive local-to-global translation strategy that effectively resolves interdependent query operations. CrackSQL supports three translation modes and offers multiple deployment and access options including a web console interface, a PyPI package, and a command-line prompt, facilitating adoption across a variety of real-world use cases

GKAN: Explainable Diagnosis of Alzheimer's Disease Using Graph Neural Network with Kolmogorov-Arnold Networks

Tianqi Ding,Dawei Xiang,Keith E Schubert,Liang Dong

Task: 提出一种名为GCN-KAN的新型单模态框架，将Kolmogorov-Arnold Networks（KAN）集成到图卷积网络（GCNs）中，以提高阿尔茨海默病（AD）的诊断准确性和可解释性。

Motivation: 阿尔茨海默病（AD）是一种复杂的神经退行性疾病，其诊断具有挑战性。现有的图卷积网络（GCNs）在建模脑连接性方面表现出潜力，但其依赖线性变换限制了其对神经影像数据中复杂非线性模式的捕捉能力。

Details

Method: 提出GCN-KAN框架，结合Kolmogorov-Arnold Networks（KAN）和GCNs，利用可学习的基于样条的变换来更好地表示脑区之间的相互作用。 Result: 在阿尔茨海默病神经影像学倡议（ADNI）数据集上的评估表明，GCN-KAN在分类准确性上比传统GCNs提高了4-8%，并提供了与AD相关的关键脑区的可解释性见解。 Conclusion: GCN-KAN为早期AD诊断提供了一种强大且可解释的工具。 Abstract: Alzheimer's Disease (AD) is a progressive neurodegenerative disorder that poses significant diagnostic challenges due to its complex etiology. Graph Convolutional Networks (GCNs) have shown promise in modeling brain connectivity for AD diagnosis, yet their reliance on linear transformations limits their ability to capture intricate nonlinear patterns in neuroimaging data. To address this, we propose GCN-KAN, a novel single-modal framework that integrates Kolmogorov-Arnold Networks (KAN) into GCNs to enhance both diagnostic accuracy and interpretability. Leveraging structural MRI data, our model employs learnable spline-based transformations to better represent brain region interactions. Evaluated on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset, GCN-KAN outperforms traditional GCNs by 4-8% in classification accuracy while providing interpretable insights into key brain regions associated with AD. This approach offers a robust and explainable tool for early AD diagnosis.

Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents

Saaket Agashe,Kyle Wong,Vincent Tu,Jiachen Yang,Ang Li,Xin Eric Wang

Task: 开发一个名为Agent S2的组合框架，用于自动化与图形用户界面（GUI）交互的任务，以提升人类生产力。

Motivation: 当前代理在GUI元素定位、长时程任务规划和依赖单一通用模型方面存在显著挑战，限制了其性能和适用性。

Details

Method: 采用Mixture-of-Grounding技术实现精确的GUI定位，并引入Proactive Hierarchical Planning动态调整多时间尺度的行动计划。 Result: Agent S2在三个计算机使用基准测试中取得了最先进的性能，相对改进显著（如OSWorld 15步和50步评估中分别提升18.9%和32.7%）。 Conclusion: Agent S2通过组合通用和专用模型，显著提升了GUI任务的自动化性能，并展示了良好的泛化能力。 Abstract: Computer use agents automate digital tasks by directly interacting with graphical user interfaces (GUIs) on computers and mobile devices, offering significant potential to enhance human productivity by completing an open-ended space of user queries. However, current agents face significant challenges: imprecise grounding of GUI elements, difficulties with long-horizon task planning, and performance bottlenecks from relying on single generalist models for diverse cognitive tasks. To this end, we introduce Agent S2, a novel compositional framework that delegates cognitive responsibilities across various generalist and specialist models. We propose a novel Mixture-of-Grounding technique to achieve precise GUI localization and introduce Proactive Hierarchical Planning, dynamically refining action plans at multiple temporal scales in response to evolving observations. Evaluations demonstrate that Agent S2 establishes new state-of-the-art (SOTA) performance on three prominent computer use benchmarks. Specifically, Agent S2 achieves 18.9% and 32.7% relative improvements over leading baseline agents such as Claude Computer Use and UI-TARS on the OSWorld 15-step and 50-step evaluation. Moreover, Agent S2 generalizes effectively to other operating systems and applications, surpassing previous best methods by 52.8% on WindowsAgentArena and by 16.52% on AndroidWorld relatively. Code available at https://github.com/simular-ai/Agent-S.

Neural Pruning for 3D Scene Reconstruction: Efficient NeRF Acceleration

Tianqi Ding,Dawei Xiang,Pablo Rivas,Liang Dong

Task: 研究神经剪枝作为优化NeRF模型训练效率的策略。

Motivation: NeRF模型训练时间长，资源消耗大，需要一种方法来减少模型大小并加速训练。

Details

Method: 比较了均匀采样、基于重要性的方法和基于核心集的技术等剪枝方法。 Result: 核心集驱动的剪枝实现了模型大小减少50%和训练速度提升35%，且精度下降较小。 Conclusion: 剪枝是提高NeRF模型在资源有限环境下效率的有效方法。 Abstract: Neural Radiance Fields (NeRF) have become a popular 3D reconstruction approach in recent years. While they produce high-quality results, they also demand lengthy training times, often spanning days. This paper studies neural pruning as a strategy to address these concerns. We compare pruning approaches, including uniform sampling, importance-based methods, and coreset-based techniques, to reduce the model size and speed up training. Our findings show that coreset-driven pruning can achieve a 50% reduction in model size and a 35% speedup in training, with only a slight decrease in accuracy. These results suggest that pruning can be an effective method for improving the efficiency of NeRF models in resource-limited settings.

WikiVideo: Article Generation from Multiple Videos

Alexander Martin,Reno Kriz,William Gantt Walden,Kate Sanders,Hannah Recknor,Eugene Yang,Francis Ferraro,Benjamin Van Durme

Task: 自动创建高层次的维基百科风格文章，从多个多样化视频中聚合关于现实世界事件（如自然灾害或政治选举）的信息。

Motivation: 视频是检索增强生成（RAG）的直观来源，但现有RAG工作流主要关注文本，而视频摘要方法则侧重于低层次场景理解而非高层次事件语义。

Details

Method: 提出了WikiVideo基准和协作文章生成（CAG）方法，结合推理模型和VideoLLM的迭代交互，从多视频中生成高层次事件推断。 Result: CAG在Oracle检索和RAG设置中均优于现有方法。 Conclusion: CAG为基于视频的RAG提供了有效解决方案，并指出了未来研究方向。 Abstract: We present the challenging task of automatically creating a high-level Wikipedia-style article that aggregates information from multiple diverse videos about real-world events, such as natural disasters or political elections. Videos are intuitive sources for retrieval-augmented generation (RAG), but most contemporary RAG workflows focus heavily on text and existing methods for video-based summarization focus on low-level scene understanding rather than high-level event semantics. To close this gap, we introduce WikiVideo, a benchmark consisting of expert-written articles and densely annotated videos that provide evidence for articles' claims, facilitating the integration of video into RAG pipelines and enabling the creation of in-depth content that is grounded in multimodal sources. We further propose Collaborative Article Generation (CAG), a novel interactive method for article creation from multiple videos. CAG leverages an iterative interaction between an r1-style reasoning model and a VideoLLM to draw higher level inferences about the target event than is possible with VideoLLMs alone, which fixate on low-level visual features. We benchmark state-of-the-art VideoLLMs and CAG in both oracle retrieval and RAG settings and find that CAG consistently outperforms alternative methods, while suggesting intriguing avenues for future work.

IDMR: Towards Instance-Driven Precise Visual Correspondence in Multimodal Retrieval

Bangwei Liu,Yicheng Bao,Shaohui Lin,Xuhong Wang,Xin Tan,Yingchun Wang,Yuan Xie,Chaochao Lu

Task: 设计并评估一种新型的多模态图像检索任务（IDMR），要求模型在匹配文本描述场景的同时检索包含相同实例的图像。

Motivation: 当前多模态检索任务缺乏复杂性和实际应用价值，需要一种更细粒度的实例级一致性任务。

Details

Method: 提出IDMR任务，开发IDMR-bench基准，采用跨域合成方法生成训练数据，并基于多模态大语言模型（MLLM）构建检索模型。 Result: MLLM模型在传统基准和零样本IDMR-bench上均优于现有方法，展示了实例感知检索的潜力。 Conclusion: IDMR任务填补了多模态检索的空白，MLLM模型在实例级检索中表现出色，为高级检索应用提供了新方向。 Abstract: Multimodal retrieval systems are becoming increasingly vital for cutting-edge AI technologies, such as embodied AI and AI-driven digital content industries. However, current multimodal retrieval tasks lack sufficient complexity and demonstrate limited practical application value. It spires us to design Instance-Driven Multimodal Image Retrieval (IDMR), a novel task that requires models to retrieve images containing the same instance as a query image while matching a text-described scenario. Unlike existing retrieval tasks focused on global image similarity or category-level matching, IDMR demands fine-grained instance-level consistency across diverse contexts. To benchmark this capability, we develop IDMR-bench using real-world object tracking and first-person video data. Addressing the scarcity of training data, we propose a cross-domain synthesis method that creates 557K training samples by cropping objects from standard detection datasets. Our Multimodal Large Language Model (MLLM) based retrieval model, trained on 1.2M samples, outperforms state-of-the-art approaches on both traditional benchmarks and our zero-shot IDMR-bench. Experimental results demonstrate previous models' limitations in instance-aware retrieval and highlight the potential of MLLM for advanced retrieval applications. The whole training dataset, codes and models, with wide ranges of sizes, are available at https://github.com/BwLiu01/IDMR.

Artificial Intelligence-Assisted Prostate Cancer Diagnosis for Reduced Use of Immunohistochemistry

Anders Blilie,Nita Mulliqi,Xiaoyi Ji,Kelvin Szolnoky,Sol Erika Boman,Matteo Titus,Geraldine Martinez Gonzalez,José Asenjo,Marcello Gambacorta,Paolo Libretti,Einar Gudlaugsson,Svein R. Kjosavik,Lars Egevad,Emiel A. M. Janssen,Martin Eklund,Kimmo Kartasalo

Task: 评估AI模型在前列腺癌诊断中减少免疫组化染色（IHC）需求的能力。

Motivation: 前列腺癌诊断依赖组织病理学评估，但存在变异性，且IHC染色增加工作量和成本。AI可能通过准确分类H&E染色切片中的非典型腺体和边缘形态来减少对IHC的依赖。

Details

Method: 回顾性分析来自三个病理站点的前列腺核心针活检样本，评估AI模型在H&E染色切片中检测癌症的能力。 Result: AI模型在H&E染色切片中检测癌症的AUC值为0.951-0.993，应用敏感性优先的诊断阈值可将IHC需求减少44.4%、42.0%和20.7%，且无假阴性预测。 Conclusion: 该AI模型有望优化IHC使用，简化前列腺病理学决策，并减轻资源负担。 Abstract: Prostate cancer diagnosis heavily relies on histopathological evaluation, which is subject to variability. While immunohistochemical staining (IHC) assists in distinguishing benign from malignant tissue, it involves increased work, higher costs, and diagnostic delays. Artificial intelligence (AI) presents a promising solution to reduce reliance on IHC by accurately classifying atypical glands and borderline morphologies in hematoxylin & eosin (H&E) stained tissue sections. In this study, we evaluated an AI model's ability to minimize IHC use without compromising diagnostic accuracy by retrospectively analyzing prostate core needle biopsies from routine diagnostics at three different pathology sites. These cohorts were composed exclusively of difficult cases where the diagnosing pathologists required IHC to finalize the diagnosis. The AI model demonstrated area under the curve values of 0.951-0.993 for detecting cancer in routine H&E-stained slides. Applying sensitivity-prioritized diagnostic thresholds reduced the need for IHC staining by 44.4%, 42.0%, and 20.7% in the three cohorts investigated, without a single false negative prediction. This AI model shows potential for optimizing IHC use, streamlining decision-making in prostate pathology, and alleviating resource burdens.

SuperDec: 3D Scene Decomposition with Superquadric Primitives

Elisabetta Fedele,Boyang Sun,Leonidas Guibas,Marc Pollefeys,Francis Engelmann

Task: 提出一种名为SuperDec的方法，通过将3D场景分解为超二次曲面基元来创建紧凑的3D场景表示。

Motivation: 尽管最近的研究利用几何基元获得逼真的3D场景表示，但本文旨在利用这些基元获得紧凑且表达能力强的表示。

Details

Method: 通过局部解决单个对象的问题，并利用实例分割方法的能力扩展到完整的3D场景，设计了一种新架构，高效地将任意对象的点云分解为一组紧凑的超二次曲面。 Result: 在ShapeNet上训练架构，并在ScanNet++数据集和Replica场景中验证其泛化能力。 Conclusion: 基于超二次曲面的紧凑表示可用于多种下游应用，如机器人任务和可控视觉内容生成与编辑。 Abstract: We present SuperDec, an approach for creating compact 3D scene representations via decomposition into superquadric primitives. While most recent works leverage geometric primitives to obtain photorealistic 3D scene representations, we propose to leverage them to obtain a compact yet expressive representation. We propose to solve the problem locally on individual objects and leverage the capabilities of instance segmentation methods to scale our solution to full 3D scenes. In doing that, we design a new architecture which efficiently decompose point clouds of arbitrary objects in a compact set of superquadrics. We train our architecture on ShapeNet and we prove its generalization capabilities on object instances extracted from the ScanNet++ dataset as well as on full Replica scenes. Finally, we show how a compact representation based on superquadrics can be useful for a diverse range of downstream applications, including robotic tasks and controllable visual content generation and editing.

TurboFill: Adapting Few-step Text-to-image Model for Fast Image Inpainting

Liangbin Xie,Daniil Pakhomov,Zhonghao Wang,Zongze Wu,Ziyan Chen,Yuqian Zhou,Haitian Zheng,Zhifei Zhang,Zhe Lin,Jiantao Zhou,Chao Dong

Task: 介绍TurboFill，一种快速的图像修复模型，通过增强少步文本到图像扩散模型来实现高质量和高效的修复。

Motivation: 标准扩散模型虽然能生成高质量结果，但计算成本高。TurboFill通过训练一个修复适配器来解决这一问题。

Details

Method: 在少步蒸馏的文本到图像模型DMD2上训练修复适配器，采用新颖的3步对抗训练方案，确保修复区域真实、结构一致且视觉和谐。 Result: TurboFill在DilationBench和HumanBench上表现优异，优于多步BrushNet和少步修复方法，为高性能修复任务设定了新基准。 Conclusion: TurboFill通过高效的修复适配器和对抗训练方案，实现了高质量和快速的图像修复。 Abstract: This paper introduces TurboFill, a fast image inpainting model that enhances a few-step text-to-image diffusion model with an inpainting adapter for high-quality and efficient inpainting. While standard diffusion models generate high-quality results, they incur high computational costs. We overcome this by training an inpainting adapter on a few-step distilled text-to-image model, DMD2, using a novel 3-step adversarial training scheme to ensure realistic, structurally consistent, and visually harmonious inpainted regions. To evaluate TurboFill, we propose two benchmarks: DilationBench, which tests performance across mask sizes, and HumanBench, based on human feedback for complex prompts. Experiments show that TurboFill outperforms both multi-step BrushNet and few-step inpainting methods, setting a new benchmark for high-performance inpainting tasks. Our project page: https://liangbinxie.github.io/projects/TurboFill/

MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization

Siyuan Li,Luyuan Zhang,Zedong Wang,Juanxi Tian,Cheng Tan,Zicheng Liu,Chang Yu,Qingsong Xie,Haonan Lu,Haoqian Wang,Zhen Lei

Task: 提出MergeVQ方法，通过结合token merging技术，解决VQ-based生成模型中生成质量与表示学习及效率之间的权衡问题。

Motivation: 现有方法在共享潜在空间中难以平衡生成质量与表示学习及效率，MergeVQ旨在突破这一限制。

Details

Method: MergeVQ在预训练阶段通过token merge模块解耦top-k语义，结合LFQ和全局对齐，并通过解码器中的交叉注意力恢复细节；在生成阶段引入MergeAR进行KV Cache压缩。 Result: 在ImageNet上的实验表明，MergeVQ在视觉表示学习和图像生成任务中均表现优异，同时保持了高效的token利用和推理速度。 Conclusion: MergeVQ通过统一架构成功弥合了图像生成与视觉表示学习之间的差距，为VQ-based模型提供了新的解决方案。 Abstract: Masked Image Modeling (MIM) with Vector Quantization (VQ) has achieved great success in both self-supervised pre-training and image generation. However, most existing methods struggle to address the trade-off in shared latent space for generation quality vs. representation learning and efficiency. To push the limits of this paradigm, we propose MergeVQ, which incorporates token merging techniques into VQ-based generative models to bridge the gap between image generation and visual representation learning in a unified architecture. During pre-training, MergeVQ decouples top-k semantics from latent space with the token merge module after self-attention blocks in the encoder for subsequent Look-up Free Quantization (LFQ) and global alignment and recovers their fine-grained details through cross-attention in the decoder for reconstruction. As for the second-stage generation, we introduce MergeAR, which performs KV Cache compression for efficient raster-order prediction. Extensive experiments on ImageNet verify that MergeVQ as an AR generative model achieves competitive performance in both visual representation learning and image generation tasks while maintaining favorable token efficiency and inference speed. The code and model will be available at https://apexgen-x.github.io/MergeVQ.

Enhancing 3T BOLD fMRI SNR using Unpaired 7T Data with Schrödinger Bridge Diffusion

Yujian Xiong,Xuanzhao Dong,Sebastian Waz,Wenhui Zhu,Negar Mallak,Zhong-lin Lu,Yalin Wang

Task: 提出一种新框架，通过将7T和3T fMRI数据对齐在共享参数域中，并应用无配对Brain Disk Schrödinger Bridge扩散模型，提升3T数据的时空分辨率和信噪比（SNR）。

Motivation: 由于7T MRI系统有限，大多数研究依赖3T MRI系统，但其时空分辨率和SNR较低，因此需要提升3T数据质量以接近7T水平。

Details

Method: 提出一种新框架，对齐7T和3T fMRI数据，并应用无配对Brain Disk Schrödinger Bridge扩散模型。 Result: 在增强的3T数据中，SNR和群体感受野（pRF）模型的拟合优度显著提升，接近7T质量。 Conclusion: 该方法有效提升了3T fMRI数据的质量，使其接近7T水平，解决了7T数据有限的问题。 Abstract: High spatial and temporal resolution, coupled with a strong signal-to-noise ratio (SNR), has made BOLD 7 Tesla fMRI an invaluable tool for understanding how the brain processes visual stimuli. However, the limited availability of 7T MRI systems means that most research relies on 3T MRI systems, which offer lower spatial and temporal resolution and SNR. This naturally raises the question: Can we enhance the spatiotemporal resolution and SNR of 3T BOLD fMRI data to approximate 7T quality? In this study, we propose a novel framework that aligns 7T and 3T fMRI data from different subjects and datasets in a shared parametric domain. We then apply an unpaired Brain Disk Schr\"odinger Bridge diffusion model to enhance the spatiotemporal resolution and SNR of the 3T data. Our approach addresses the challenge of limited 7T data by improving the 3T scan quality. We demonstrate its effectiveness by testing it on two distinct fMRI retinotopy datasets (one 7T and one 3T), as well as synthetic data. The results show that our method significantly improves the SNR and goodness-of-fit of the population receptive field (pRF) model in the enhanced 3T data, making it comparable to 7T quality. The codes will be available at Github.

IntrinsiX: High-Quality PBR Generation using Image Priors

Peter Kocsis,Lukas Höllein,Matthias Nießner

Task: 提出一种名为IntrinsiX的新方法，通过文本描述生成高质量的本征图像。

Motivation: 现有文本到图像模型的输出包含固定的场景光照，而IntrinsiX预测基于物理的渲染（PBR）贴图，使其适用于需要重新光照、编辑和纹理生成的内容创作场景。

Details

Method: 利用强图像先验，预训练每个PBR材质组件的单独模型，并通过新的跨本征注意力机制对齐这些模型，实现语义一致的PBR预测。此外，提出渲染损失函数以约束模型。 Result: 实验结果表明，IntrinsiX在生成细节丰富的本征图像方面表现优异，显著优于现有方法。 Conclusion: IntrinsiX在重新光照、编辑和基于文本的房间尺度PBR纹理生成等应用中展示了其潜力。 Abstract: We introduce IntrinsiX, a novel method that generates high-quality intrinsic images from text description. In contrast to existing text-to-image models whose outputs contain baked-in scene lighting, our approach predicts physically-based rendering (PBR) maps. This enables the generated outputs to be used for content creation scenarios in core graphics applications that facilitate re-lighting, editing, and texture generation tasks. In order to train our generator, we exploit strong image priors, and pre-train separate models for each PBR material component (albedo, roughness, metallic, normals). We then align these models with a new cross-intrinsic attention formulation that concatenates key and value features in a consistent fashion. This allows us to exchange information between each output modality and to obtain semantically coherent PBR predictions. To ground each intrinsic component, we propose a rendering loss which provides image-space signals to constrain the model, thus facilitating sharp details also in the output BRDF properties. Our results demonstrate detailed intrinsic generation with strong generalization capabilities that outperforms existing intrinsic image decomposition methods used with generated images by a significant margin. Finally, we show a series of applications, including re-lighting, editing, and text-conditioned room-scale PBR texture generation.

GECKO: Gigapixel Vision-Concept Contrastive Pretraining in Histopathology

Saarthak Kapse,Pushpak Pati,Srikar Yellapragada,Srijan Das,Rajarsi R. Gupta,Joel Saltz,Dimitris Samaras,Prateek Prasanna

Task: 提出一种名为GECKO的多实例学习（MIL）预训练方法，用于从无监督的补丁级表示中提取全幻灯片图像（WSI）级别的嵌入。

Motivation: 现有的多模态MIL预训练方法需要额外的临床数据，增加了成本并限制了可扩展性。GECKO旨在通过利用WSI自身的概念先验来避免这一问题。

Details

Method: GECKO通过计算WSI补丁与预定义病理概念文本描述的相似性，生成可解释的概念先验。随后，使用双分支MIL网络分别聚合补丁嵌入和概念先验，并通过对比目标对齐两者。 Result: 在五个不同任务中，GECKO的表现优于现有的单模态和多模态预训练方法，同时提供了临床可解释性。 Conclusion: GECKO不仅提升了性能，还通过概念先验弥合了计算模型与病理学专业知识之间的差距。 Abstract: Pretraining a Multiple Instance Learning (MIL) aggregator enables the derivation of Whole Slide Image (WSI)-level embeddings from patch-level representations without supervision. While recent multimodal MIL pretraining approaches leveraging auxiliary modalities have demonstrated performance gains over unimodal WSI pretraining, the acquisition of these additional modalities necessitates extensive clinical profiling. This requirement increases costs and limits scalability in existing WSI datasets lacking such paired modalities. To address this, we propose Gigapixel Vision-Concept Knowledge Contrastive pretraining (GECKO), which aligns WSIs with a Concept Prior derived from the available WSIs. First, we derive an inherently interpretable concept prior by computing the similarity between each WSI patch and textual descriptions of predefined pathology concepts. GECKO then employs a dual-branch MIL network: one branch aggregates patch embeddings into a WSI-level deep embedding, while the other aggregates the concept prior into a corresponding WSI-level concept embedding. Both aggregated embeddings are aligned using a contrastive objective, thereby pretraining the entire dual-branch MIL model. Moreover, when auxiliary modalities such as transcriptomics data are available, GECKO seamlessly integrates them. Across five diverse tasks, GECKO consistently outperforms prior unimodal and multimodal pretraining approaches while also delivering clinically meaningful interpretability that bridges the gap between computational models and pathology expertise. Code is made available at https://github.com/bmi-imaginelab/GECKO

A YOLO-Based Semi-Automated Labeling Approach to Improve Fault Detection Efficiency in Railroad Videos

Dylan Lester,James Gao,Samuel Sutphin,Pingping Zhu,Husnu Narman,Ammar Alzarrad

Task: 提出一种半自动标注方法，利用预训练的YOLO模型优化铁路视频故障检测中的标注流程。

Motivation: 大规模图像和视频数据集的手动标注耗时、易错且成本高，阻碍了铁路视频故障检测的高效机器学习流程。

Details

Method: 通过少量手动标注数据初始化YOLO模型，并迭代训练模型，逐步减少人工干预；同时开发系统将YOLO检测数据导出为可编辑文本文件，便于快速修正。 Result: 标注时间从每张图像2-4分钟减少到30秒-2分钟，显著降低了人工成本和标注错误。 Conclusion: 该方法为处理大规模故障检测数据集的研究者和从业者提供了一种经济高效的替代方案。 Abstract: Manual labeling for large-scale image and video datasets is often time-intensive, error-prone, and costly, posing a significant barrier to efficient machine learning workflows in fault detection from railroad videos. This study introduces a semi-automated labeling method that utilizes a pre-trained You Only Look Once (YOLO) model to streamline the labeling process and enhance fault detection accuracy in railroad videos. By initiating the process with a small set of manually labeled data, our approach iteratively trains the YOLO model, using each cycle's output to improve model accuracy and progressively reduce the need for human intervention. To facilitate easy correction of model predictions, we developed a system to export YOLO's detection data as an editable text file, enabling rapid adjustments when detections require refinement. This approach decreases labeling time from an average of 2 to 4 minutes per image to 30 seconds to 2 minutes, effectively minimizing labor costs and labeling errors. Unlike costly AI based labeling solutions on paid platforms, our method provides a cost-effective alternative for researchers and practitioners handling large datasets in fault detection and other detection based machine learning applications.

AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction

Junhao Cheng,Yuying Ge,Yixiao Ge,Jing Liao,Ying Shan

Task: 提出AnimeGamer，一种基于多模态大语言模型（MLLMs）的方法，用于生成动态动画片段和角色状态更新的游戏状态，以提升无限动漫生活模拟游戏的体验。

Motivation: 现有方法仅通过大型语言模型（LLMs）生成静态图像，忽略了历史视觉上下文和动态性，导致游戏体验不一致且缺乏吸引力。

Details

Method: 利用多模态大语言模型（MLLMs）生成游戏状态，包括动态动画片段和角色状态更新，并引入动作感知的多模态表示来解码高质量视频片段。 Result: 通过自动指标和人工评估，AnimeGamer在游戏体验的多个方面优于现有方法。 Conclusion: AnimeGamer通过结合历史动画片段表示和动态生成，实现了上下文一致且动态丰富的游戏体验。 Abstract: Recent advancements in image and video synthesis have opened up new promise in generative games. One particularly intriguing application is transforming characters from anime films into interactive, playable entities. This allows players to immerse themselves in the dynamic anime world as their favorite characters for life simulation through language instructions. Such games are defined as infinite game since they eliminate predetermined boundaries and fixed gameplay rules, where players can interact with the game world through open-ended language and experience ever-evolving storylines and environments. Recently, a pioneering approach for infinite anime life simulation employs large language models (LLMs) to translate multi-turn text dialogues into language instructions for image generation. However, it neglects historical visual context, leading to inconsistent gameplay. Furthermore, it only generates static images, failing to incorporate the dynamics necessary for an engaging gaming experience. In this work, we propose AnimeGamer, which is built upon Multimodal Large Language Models (MLLMs) to generate each game state, including dynamic animation shots that depict character movements and updates to character states, as illustrated in Figure 1. We introduce novel action-aware multimodal representations to represent animation shots, which can be decoded into high-quality video clips using a video diffusion model. By taking historical animation shot representations as context and predicting subsequent representations, AnimeGamer can generate games with contextual consistency and satisfactory dynamics. Extensive evaluations using both automated metrics and human evaluations demonstrate that AnimeGamer outperforms existing methods in various aspects of the gaming experience. Codes and checkpoints are available at https://github.com/TencentARC/AnimeGamer.

Scaling Language-Free Visual Representation Learning

David Fan,Shengbang Tong,Jiachen Zhu,Koustuv Sinha,Zhuang Liu,Xinlei Chen,Michael Rabbat,Nicolas Ballas,Yann LeCun,Amir Bar,Saining Xie

Task: 比较视觉自监督学习（SSL）与对比语言-图像预训练（CLIP）在多模态任务（如视觉问答VQA）中的性能差异。

Motivation: 探究视觉SSL在多模态任务中表现不如CLIP的原因，是缺乏语言监督还是训练数据的差异。

Details

Method: 在相同的MetaCLIP数据上训练视觉SSL和CLIP模型，并以VQA为测试平台进行对比分析。 Result: 视觉SSL模型在数据和模型容量上表现优于CLIP，且性能在扩展到7B参数后仍未饱和，最终在多模态任务中达到CLIP水平。 Conclusion: 纯视觉SSL在大规模训练下可以匹配语言监督的视觉预训练，为视觉中心表示学习提供了新机会。 Abstract: Visual Self-Supervised Learning (SSL) currently underperforms Contrastive Language-Image Pretraining (CLIP) in multimodal settings such as Visual Question Answering (VQA). This multimodal gap is often attributed to the semantics introduced by language supervision, even though visual SSL and CLIP models are often trained on different data. In this work, we ask the question: "Do visual self-supervised approaches lag behind CLIP due to the lack of language supervision, or differences in the training data?" We study this question by training both visual SSL and CLIP models on the same MetaCLIP data, and leveraging VQA as a diverse testbed for vision encoders. In this controlled setup, visual SSL models scale better than CLIP models in terms of data and model capacity, and visual SSL performance does not saturate even after scaling up to 7B parameters. Consequently, we observe visual SSL methods achieve CLIP-level performance on a wide range of VQA and classic vision benchmarks. These findings demonstrate that pure visual SSL can match language-supervised visual pretraining at scale, opening new opportunities for vision-centric representation learning.

MixerMDM: Learnable Composition of Human Motion Diffusion Models

Pablo Ruiz-Ponce,German Barquero,Cristina Palmero,Sergio Escalera,José García-Rodríguez

Task: 提出一种可学习的模型组合技术MixerMDM，用于结合预训练的文本条件人体运动扩散模型，以实现细粒度的运动生成控制。

Motivation: 现有方法在结合多个预训练运动扩散模型时忽略了生成过程的最佳组合方式可能因模型和文本描述而异，导致控制不够精细。

Details

Method: 引入MixerMDM，一种动态混合策略，通过对抗训练学习根据生成条件动态结合每个模型的去噪过程。 Result: MixerMDM结合单人和多人运动扩散模型，实现了对每个人动态及整体交互的细粒度控制，并提出新的评估技术验证其效果。 Conclusion: MixerMDM通过动态混合策略显著提升了文本条件人体运动生成的精细控制能力，并在交互和个体质量评估中表现优异。 Abstract: Generating human motion guided by conditions such as textual descriptions is challenging due to the need for datasets with pairs of high-quality motion and their corresponding conditions. The difficulty increases when aiming for finer control in the generation. To that end, prior works have proposed to combine several motion diffusion models pre-trained on datasets with different types of conditions, thus allowing control with multiple conditions. However, the proposed merging strategies overlook that the optimal way to combine the generation processes might depend on the particularities of each pre-trained generative model and also the specific textual descriptions. In this context, we introduce MixerMDM, the first learnable model composition technique for combining pre-trained text-conditioned human motion diffusion models. Unlike previous approaches, MixerMDM provides a dynamic mixing strategy that is trained in an adversarial fashion to learn to combine the denoising process of each model depending on the set of conditions driving the generation. By using MixerMDM to combine single- and multi-person motion diffusion models, we achieve fine-grained control on the dynamics of every person individually, and also on the overall interaction. Furthermore, we propose a new evaluation technique that, for the first time in this task, measures the interaction and individual quality by computing the alignment between the mixed generated motions and their conditions as well as the capabilities of MixerMDM to adapt the mixing throughout the denoising process depending on the motions to mix.

Shot-by-Shot: Film-Grammar-Aware Training-Free Audio Description Generation

Junyu Xie,Tengda Han,Max Bain,Arsha Nagrani,Eshika Khandelwal,Gül Varol,Weidi Xie,Andrew Zisserman

Task: 自动为编辑过的视频材料（如电影和电视剧）生成音频描述（ADs）。

Motivation: 通过利用视频理解的基本单位“镜头”并结合电影语法设备（如镜头尺度和线程结构）来指导AD生成，以提升生成质量。

Details

Method: 提出一个两阶段框架，扩展时间上下文到相邻镜头，并整合开源和专有视觉语言模型（VLMs），无需额外训练即可集成专家知识。 Result: 在所有无需训练的方法中达到最先进的性能，并在多个基准测试中超越微调方法。 Conclusion: 通过引入新的评估指标（动作分数）和评估协议，进一步验证了方法的有效性，并提出将自动框架作为AD生成助手的多候选生成策略。 Abstract: Our objective is the automatic generation of Audio Descriptions (ADs) for edited video material, such as movies and TV series. To achieve this, we propose a two-stage framework that leverages "shots" as the fundamental units of video understanding. This includes extending temporal context to neighbouring shots and incorporating film grammar devices, such as shot scales and thread structures, to guide AD generation. Our method is compatible with both open-source and proprietary Visual-Language Models (VLMs), integrating expert knowledge from add-on modules without requiring additional training of the VLMs. We achieve state-of-the-art performance among all prior training-free approaches and even surpass fine-tuned methods on several benchmarks. To evaluate the quality of predicted ADs, we introduce a new evaluation measure -- an action score -- specifically targeted to assessing this important aspect of AD. Additionally, we propose a novel evaluation protocol that treats automatic frameworks as AD generation assistants and asks them to generate multiple candidate ADs for selection.

Assessing Foundation Models for Sea Ice Type Segmentation in Sentinel-1 SAR Imagery

Samira Alkaee Taleghan,Morteza Karimzadeh,Andrew P. Barrett,Walter N. Meier,Farnoush Banaei-Kashani

Task: 评估十种遥感基础模型（FMs）在海冰类型分割中的性能，并分析其季节和空间泛化能力。

Motivation: 海冰类型的准确分割对于冰覆盖水域的安全导航、资源开采以及极地气候过程的理解至关重要，但现有深度学习方法依赖大量标注数据，且海冰的复杂结构和SAR图像特性使其分割具有挑战性。

Details

Method: 使用Sentinel-1 SAR图像，对十种遥感基础模型进行海冰类型分割评估，重点关注季节和空间泛化能力。 Result: Prithvi-600M模型表现最佳，CROMA模型在F1分数上与之接近。 Conclusion: 研究提供了选择基础模型的系统方法、全面的性能基准测试，并指出了未来改进极地SAR数据应用领域特定模型的方向。 Abstract: Accurate segmentation of sea ice types is essential for mapping and operational forecasting of sea ice conditions for safe navigation and resource extraction in ice-covered waters, as well as for understanding polar climate processes. While deep learning methods have shown promise in automating sea ice segmentation, they often rely on extensive labeled datasets which require expert knowledge and are time-consuming to create. Recently, foundation models (FMs) have shown excellent results for segmenting remote sensing images by utilizing pre-training on large datasets using self-supervised techniques. However, their effectiveness for sea ice segmentation remains unexplored, especially given sea ice's complex structures, seasonal changes, and unique spectral signatures, as well as peculiar Synthetic Aperture Radar (SAR) imagery characteristics including banding and scalloping noise, and varying ice backscatter characteristics, which are often missing in standard remote sensing pre-training datasets. In particular, SAR images over polar regions are acquired using different modes than used to capture the images at lower latitudes by the same sensors that form training datasets for FMs. This study evaluates ten remote sensing FMs for sea ice type segmentation using Sentinel-1 SAR imagery, focusing on their seasonal and spatial generalization. Among the selected models, Prithvi-600M outperforms the baseline models, while CROMA achieves a very similar performance in F1-score. Our contributions include offering a systematic methodology for selecting FMs for sea ice data analysis, a comprehensive benchmarking study on performances of FMs for sea ice segmentation with tailored performance metrics, and insights into existing gaps and future directions for improving domain-specific models in polar applications using SAR data.

RIG: Synergizing Reasoning and Imagination in End-to-End Generalist Policy

Zhonghan Zhao,Wenwei Zhang,Haian Huang,Kuikun Liu,Jianfei Gao,Gaoang Wang,Kai Chen

Task: 提出一种名为RIG的端到端通用策略，首次尝试将推理与想象能力协同整合。

Motivation: 现有方法要么仅整合其中一种能力，要么依赖多个专用模型，限制了策略的学习效率和泛化能力。

Details

Method: 构建数据管道，逐步整合和丰富从现有智能体收集的轨迹中的想象与推理内容，联合学习推理和下一帧图像生成。 Result: RIG表现出超过17倍的样本效率提升和泛化能力，推理时通过想象自我纠正，提高了策略的鲁棒性和互操作性。 Conclusion: 推理与想象的协同不仅提升了通用策略的性能，还支持测试时扩展以增强整体表现。 Abstract: Reasoning before action and imagining potential outcomes (i.e., world models) are essential for embodied agents operating in complex open-world environments. Yet, prior work either incorporates only one of these abilities in an end-to-end agent or integrates multiple specialized models into an agent system, limiting the learning efficiency and generalization of the policy. Thus, this paper makes the first attempt to synergize Reasoning and Imagination in an end-to-end Generalist policy, termed RIG. To train RIG in an end-to-end manner, we construct a data pipeline that progressively integrates and enriches the content of imagination and reasoning in the trajectories collected from existing agents. The joint learning of reasoning and next image generation explicitly models the inherent correlation between reasoning, action, and dynamics of environments, and thus exhibits more than $17\times$ sample efficiency improvements and generalization in comparison with previous works. During inference, RIG first reasons about the next action, produces potential action, and then predicts the action outcomes, which offers the agent a chance to review and self-correct based on the imagination before taking real actions. Experimental results show that the synergy of reasoning and imagination not only improves the robustness, generalization, and interoperability of generalist policy but also enables test-time scaling to enhance overall performance.

Autonomous AI for Multi-Pathology Detection in Chest X-Rays: A Multi-Site Study in the Indian Healthcare System

Bargava Subramanian,Shajeev Jaikumar,Praveen Shastry,Naveen Kumarasami,Kalyan Sivasailam,Anandakumar D,Keerthana R,Mounigasri M,Kishore Prasath Venkatesh

Task: 开发一个用于胸部X光片（CXR）解读的自主AI系统，能够对75种不同病理进行分类、检测和分割。

Motivation: 解决印度医疗系统中诊断资源不足的问题，优化放射学工作流程并提高患者护理质量。

Details

Method: 结合Vision Transformers、Faster R-CNN和多种U-Net模型（如Attention U-Net、U-Net++和Dense U-Net），利用超过500万张X光片的数据集进行训练，并通过年龄、性别和设备类型等子群分析验证模型的稳健性。 Result: AI系统在多病理分类中达到98%的精确率和超过95%的召回率，在正常与异常分类中达到99.8%的精确率和99.6%的召回率，部署后处理了超过15万次扫描，显著减少了报告时间并提高了诊断准确性。 Conclusion: 该AI系统是一种可靠的自主工具，能够解决服务不足地区的诊断缺口，并在多样化的医疗环境中优化工作流程和提升患者护理水平。 Abstract: Study Design: The study outlines the development of an autonomous AI system for chest X-ray (CXR) interpretation, trained on a vast dataset of over 5 million X rays sourced from healthcare systems across India. This AI system integrates advanced architectures including Vision Transformers, Faster R-CNN, and various U Net models (such as Attention U-Net, U-Net++, and Dense U-Net) to enable comprehensive classification, detection, and segmentation of 75 distinct pathologies. To ensure robustness, the study design includes subgroup analyses across age, gender, and equipment type, validating the model's adaptability and performance across diverse patient demographics and imaging environments. Performance: The AI system achieved up to 98% precision and over 95% recall for multi pathology classification, with stable performance across demographic and equipment subgroups. For normal vs. abnormal classification, it reached 99.8% precision, 99.6% recall, and 99.9% negative predictive value (NPV). It was deployed in 17 major healthcare systems in India including diagnostic centers, large hospitals, and government hospitals. Over the deployment period, the system processed over 150,000 scans, averaging 2,000 chest X rays daily, resulting in reduced reporting times and improved diagnostic accuracy. Conclusion: The high precision and recall validate the AI's capability as a reliable tool for autonomous normal abnormal classification, pathology localization, and segmentation. This scalable AI model addresses diagnostic gaps in underserved areas, optimizing radiology workflows and enhancing patient care across diverse healthcare settings in India.

Diffusion models applied to skin and oral cancer classification

José J. M. Uliana,Renato A. Krohling

Task: 研究扩散模型在皮肤和口腔病变医学图像分类中的应用。

Motivation: 探索扩散模型在医学图像分类中的性能，并与现有深度学习模型（如CNN和Transformer）进行比较。

Details

Method: 使用PAD-UFES-20（皮肤癌）和P-NDB-UFES（口腔癌）数据集，评估扩散模型的分类性能，并在HIBA数据集上测试其鲁棒性。 Result: 扩散模型在PAD-UFES-20数据集上六分类平衡准确率为0.6457，二分类为0.8357；在P-NDB-UFES数据集上平衡准确率为0.9050，表现优于现有模型。 Conclusion: 扩散模型在皮肤和口腔病变医学图像分类中具有竞争力，是一种可行的分类方法。 Abstract: This study investigates the application of diffusion models in medical image classification (DiffMIC), focusing on skin and oral lesions. Utilizing the datasets PAD-UFES-20 for skin cancer and P-NDB-UFES for oral cancer, the diffusion model demonstrated competitive performance compared to state-of-the-art deep learning models like Convolutional Neural Networks (CNNs) and Transformers. Specifically, for the PAD-UFES-20 dataset, the model achieved a balanced accuracy of 0.6457 for six-class classification and 0.8357 for binary classification (cancer vs. non-cancer). For the P-NDB-UFES dataset, it attained a balanced accuracy of 0.9050. These results suggest that diffusion models are viable models for classifying medical images of skin and oral lesions. In addition, we investigate the robustness of the model trained on PAD-UFES-20 for skin cancer but tested on the clinical images of the HIBA dataset.

CrossWordBench: Evaluating the Reasoning Capabilities of LLMs and LVLMs with Controllable Puzzle Generation

Jixuan Leng,Chengsong Huang,Langlin Huang,Bill Yuchen Lin,William W. Cohen,Haohan Wang,Jiaxin Huang

Task: 评估大型语言模型（LLMs）和大型视觉语言模型（LVLMs）在跨模态推理任务中的表现，通过填字游戏这一任务。

Motivation: 现有的评估框架主要关注文本推理或视觉语言理解，缺乏对文本和视觉约束动态交互的评估。

Details

Method: 提出了CrossWordBench基准，通过可控的填字游戏生成框架，评估模型在文本和视觉约束下的推理能力。 Result: 推理型LLMs表现优于非推理型模型，而LVLMs在任务中表现不佳，其表现与网格解析准确性高度相关。 Conclusion: 揭示了当前LLMs和LVLMs在跨模态推理中的局限性，并为未来多模态约束任务评估提供了有效方法。 Abstract: Existing reasoning evaluation frameworks for Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) predominantly either assess text-based reasoning or vision-language understanding capabilities, with limited dynamic interplay between textual and visual constraints. To address this limitation, we introduce CrossWordBench, a benchmark designed to evaluate the reasoning capabilities of both LLMs and LVLMs through the medium of crossword puzzles-a task requiring multimodal adherence to semantic constraints from text-based clues and intersectional constraints from visual grid structures. CrossWordBench leverages a controllable puzzle generation framework that produces puzzles in multiple formats (text and image) and offers different evaluation strategies ranging from direct puzzle solving to interactive modes. Our extensive evaluation of over 20 models reveals that reasoning LLMs outperform non-reasoning models substantially by effectively leveraging crossing-letter constraints. We further demonstrate that LVLMs struggle with the task, showing a strong correlation between their puzzle-solving performance and grid-parsing accuracy. Our findings offer insights into the limitations of the reasoning capabilities of current LLMs and LVLMs, and provide an effective approach for creating multimodal constrained tasks for future evaluations.

EAP4EMSIG -- Enhancing Event-Driven Microscopy for Microfluidic Single-Cell Analysis

Nils Friederich,Angelo Jovin Yamachui Sitcheu,Annika Nassal,Erenus Yildiz,Matthias Pesch,Maximilian Beichter,Lukas Scholtes,Bahar Akbaba,Thomas Lautenschlager,Oliver Neumann,Dietrich Kohlheyer,Hanno Scharr,Johannes Seiffarth,Katharina Nöh,Ralf Mikut

Task: 开发一个实验自动化管道，用于事件驱动的显微镜智能微流控单细胞分析。

Motivation: 解决高通量实验中缺乏实时洞察的问题，以应对随机事件的延迟响应。

Details

Method: 引入了三个组件：快速准确的深度学习自动对焦方法、实时分割方法的评估以及实时数据分析仪表板。 Result: 自动对焦方法的平均绝对误差为0.0226微米，推理时间低于50毫秒；Cellpose~3的分割质量最高（93.58%），距离基方法最快（121毫秒，93.02%）。 Conclusion: 深度学习基础模型不适合实时分割，但提出的方法在自动对焦和分割方面表现出色。 Abstract: Microfluidic Live-Cell Imaging yields data on microbial cell factories. However, continuous acquisition is challenging as high-throughput experiments often lack realtime insights, delaying responses to stochastic events. We introduce three components in the Experiment Automation Pipeline for Event-Driven Microscopy to Smart Microfluidic Single-Cell Analysis: a fast, accurate Deep Learning autofocusing method predicting the focus offset, an evaluation of real-time segmentation methods and a realtime data analysis dashboard. Our autofocusing achieves a Mean Absolute Error of 0.0226\textmu m with inference times below 50~ms. Among eleven Deep Learning segmentation methods, Cellpose~3 reached a Panoptic Quality of 93.58\%, while a distance-based method is fastest (121~ms, Panoptic Quality 93.02\%). All six Deep Learning Foundation Models were unsuitable for real-time segmentation.

CF-CAM: Gradient Perturbation Mitigation and Feature Stabilization for Reliable Interpretability

Hongjie He,Xu Pan,Yudong Yao

Task: 提出一种新的类激活映射方法（CF-CAM），以解决现有方法在梯度不稳定性和计算开销方面的局限性。

Motivation: 深度学习决策的不透明性限制了在高风险领域的应用，现有CAM方法在梯度稳定性或计算效率上存在不足。

Details

Method: CF-CAM通过分层重要性加权策略和基于DBSCAN的密度感知通道聚类，结合双边滤波的梯度过滤，增强鲁棒性。 Result: 实验表明，CF-CAM在解释性能和梯度扰动鲁棒性上优于现有方法。 Conclusion: CF-CAM为提升深度神经网络在关键应用中的可解释性提供了可靠解决方案。 Abstract: As deep learning continues to advance, the opacity of neural network decision-making remains a critical challenge, limiting trust and applicability in high-stakes domains. Class Activation Mapping (CAM) techniques have emerged as a key approach to visualizing model decisions, yet existing methods face inherent trade-offs. Gradient-based CAM variants suffer from sensitivity to gradient perturbations, leading to unstable and unreliable explanations. Conversely, gradient-free approaches mitigate gradient instability but incur significant computational overhead and inference latency. To address these limitations, we propose Cluster Filter Class Activation Map (CF-CAM), a novel framework that reintroduces gradient-based weighting while enhancing robustness against gradient noise. CF-CAM employs a hierarchical importance weighting strategy to balance discriminative feature preservation and noise elimination. A density-aware channel clustering via Density-Based Spatial Clustering of Applications with Noise (DBSCAN) groups semantically relevant feature channels and discard noise-prone activations. Additionally, cluster-conditioned gradient filtering leverages bilateral filters to refine gradient signals, preserving edge-aware localization while suppressing noise impact. Experiment results demonstrate that CF-CAM achieves superior interpretability performance while maintaining resilience to gradient perturbations, outperforming state-of-the-art CAM methods in faithfulness and robustness. By effectively mitigating gradient instability without excessive computational cost, CF-CAM provides a reliable solution for enhancing the interpretability of deep neural networks in critical applications such as medical diagnosis and autonomous driving.

Detecting Glioma, Meningioma, and Pituitary Tumors, and Normal Brain Tissues based on Yolov11 and Yolov8 Deep Learning Models

Ahmed M. Taha,Salah A. Aly,Mohamed F. Darwish

Task: 提出一种基于YoloV11和YoloV8深度学习模型的AI驱动技术，用于检测和分类脑肿瘤（胶质瘤、脑膜瘤和垂体瘤）。

Motivation: MRI扫描的手动解释耗时、易出错且依赖高度专业化的知识，因此需要一种快速准确的自动诊断方法。

Details

Method: 采用基于迁移学习的微调方法，结合先进的深度学习技术对脑肿瘤进行分类，分为无肿瘤、胶质瘤、脑膜瘤和垂体瘤四类。 Result: 使用CE-MRI Figshare数据集，YoloV8和YoloV11模型的准确率分别为99.49%和99.56%，定制CNN的准确率为96.98%，验证了CNN在脑肿瘤检测和分类中的高精度潜力。 Conclusion: 该研究展示了深度学习模型在医学影像和诊断中的变革性作用，能够实现高精度的脑肿瘤检测和分类。 Abstract: Accurate and quick diagnosis of normal brain tissue Glioma, Meningioma, and Pituitary Tumors is crucial for optimal treatment planning and improved medical results. Magnetic Resonance Imaging (MRI) is widely used as a non-invasive diagnostic tool for detecting brain abnormalities, including tumors. However, manual interpretation of MRI scans is often time-consuming, prone to human error, and dependent on highly specialized expertise. This paper proposes an advanced AI-driven technique to detecting glioma, meningioma, and pituitary brain tumors using YoloV11 and YoloV8 deep learning models. Methods: Using a transfer learning-based fine-tuning approach, we integrate cutting-edge deep learning techniques with medical imaging to classify brain tumors into four categories: No-Tumor, Glioma, Meningioma, and Pituitary Tumors. Results: The study utilizes the publicly accessible CE-MRI Figshare dataset and involves fine-tuning pre-trained models YoloV8 and YoloV11 of 99.49% and 99.56% accuracies; and customized CNN accuracy of 96.98%. The results validate the potential of CNNs in achieving high precision in brain tumor detection and classification, highlighting their transformative role in medical imaging and diagnostics.

Can Diffusion Models Disentangle? A Theoretical Perspective

Liming Wang,Muhammad Jehanzeb Mirza,Yishu Gong,Yuan Gong,Jiaqi Zhang,Brian H. Tracey,Katerina Placek,Marco Vilela,James R. Glass

Task: 提出一种新的理论框架，用于理解扩散模型如何学习解耦表示。

Motivation: 研究扩散模型在解耦表示学习中的潜力，并验证其理论框架的有效性。

Details

Method: 建立解耦潜在变量模型的可识别性条件，分析训练动态，并为解耦潜在子空间模型推导样本复杂度界限。 Result: 通过多样化任务和模态的实验验证了理论框架的有效性，并表明基于理论的训练策略（如风格引导正则化）能持续提升解耦性能。 Conclusion: 提出的理论框架为扩散模型在解耦表示学习中的应用提供了理论基础，并通过实验验证了其有效性。 Abstract: This paper presents a novel theoretical framework for understanding how diffusion models can learn disentangled representations. Within this framework, we establish identifiability conditions for general disentangled latent variable models, analyze training dynamics, and derive sample complexity bounds for disentangled latent subspace models. To validate our theory, we conduct disentanglement experiments across diverse tasks and modalities, including subspace recovery in latent subspace Gaussian mixture models, image colorization, image denoising, and voice conversion for speech classification. Additionally, our experiments show that training strategies inspired by our theory, such as style guidance regularization, consistently enhance disentanglement performance.

GazeLLM: Multimodal LLMs incorporating Human Visual Attention

Jun Rekimoto

Task: 提出一种通过整合眼动追踪数据来优化第一人称视频分析的方法，并分解视频为注视焦点区域，以减少视频数据输入。

Motivation: 处理高分辨率、长时间视频时，潜在表示会占用大量内存和计算资源，限制多模态大语言模型（MLLMs）的能力。降低分辨率虽减少内存使用，但可能影响理解。

Details

Method: 整合眼动追踪数据，将第一人称视频分解为注视焦点区域，选择性处理这些区域以减少输入数据量。 Result: 该方法在任务理解上达到或优于全分辨率处理的效果，同时将输入像素减少至十分之一。 Conclusion: 通过选择性处理注视焦点区域，为MLLMs提供了一种高效解析和利用人类技能的方法。 Abstract: Large Language Models (LLMs) are advancing into Multimodal LLMs (MLLMs), capable of processing image, audio, and video as well as text. Combining first-person video, MLLMs show promising potential for understanding human activities through video and audio, enabling many human-computer interaction and human-augmentation applications such as human activity support, real-world agents, and skill transfer to robots or other individuals. However, handling high-resolution, long-duration videos generates large latent representations, leading to substantial memory and processing demands, limiting the length and resolution MLLMs can manage. Reducing video resolution can lower memory usage but often compromises comprehension. This paper introduces a method that optimizes first-person video analysis by integrating eye-tracking data, and proposes a method that decomposes first-person vision video into sub areas for regions of gaze focus. By processing these selectively gazed-focused inputs, our approach achieves task comprehension equivalent to or even better than processing the entire image at full resolution, but with significantly reduced video data input (reduce the number of pixels to one-tenth), offering an efficient solution for using MLLMs to interpret and utilize human skills.

CBIL: Collective Behavior Imitation Learning for Fish from Real Videos

Yifan Wu,Zhiyang Dou,Yuko Ishiwaka,Shun Ogawa,Yuke Lou,Wenping Wang,Lingjie Liu,Taku Komura

Task: 提出一种名为CBIL的可扩展方法，直接从视频中学习鱼群行为，无需依赖捕捉的运动轨迹。

Motivation: 传统基于规则的方法依赖手工设计原则，限制了集体行为的多样性和真实性；而现有的模仿学习方法需要真实运动轨迹且在高密度群体中表现不佳。

Details

Method: 结合视频表示学习（使用MVAE自监督提取隐状态）和对抗模仿学习方法，捕捉鱼群复杂运动并模仿潜在空间中的运动模式分布。 Result: CBIL能够有效学习集体运动先验，适用于多种动画任务，并在不同物种中表现出有效性，还可用于检测野外视频中的异常鱼群行为。 Conclusion: CBIL是一种直接从视频中学习集体行为的有效方法，具有广泛的应用潜力。 Abstract: Reproducing realistic collective behaviors presents a captivating yet formidable challenge. Traditional rule-based methods rely on hand-crafted principles, limiting motion diversity and realism in generated collective behaviors. Recent imitation learning methods learn from data but often require ground truth motion trajectories and struggle with authenticity, especially in high-density groups with erratic movements. In this paper, we present a scalable approach, Collective Behavior Imitation Learning (CBIL), for learning fish schooling behavior directly from videos, without relying on captured motion trajectories. Our method first leverages Video Representation Learning, where a Masked Video AutoEncoder (MVAE) extracts implicit states from video inputs in a self-supervised manner. The MVAE effectively maps 2D observations to implicit states that are compact and expressive for following the imitation learning stage. Then, we propose a novel adversarial imitation learning method to effectively capture complex movements of the schools of fish, allowing for efficient imitation of the distribution for motion patterns measured in the latent space. It also incorporates bio-inspired rewards alongside priors to regularize and stabilize training. Once trained, CBIL can be used for various animation tasks with the learned collective motion priors. We further show its effectiveness across different species. Finally, we demonstrate the application of our system in detecting abnormal fish behavior from in-the-wild videos.

ElaLoRA: Elastic & Learnable Low-Rank Adaptation for Efficient Model Fine-Tuning

Huandong Chang,Zicheng Ma,Mingyuan Ma,Zhenting Qi,Andrew Sabot,Hong Jiang,H. T. Kung

Task: 提出一种动态调整低秩适应（LoRA）框架ElaLoRA，用于高效微调大规模预训练模型。

Motivation: 现有方法依赖固定秩或仅关注秩剪枝或扩展，无法动态调整秩以适应不同层的重要性。

Details

Method: 基于梯度重要性分数动态剪枝和扩展秩。 Result: ElaLoRA在多个基准测试中优于现有PEFT方法，且验证了高秩分配层对模型性能的贡献更大。 Conclusion: ElaLoRA通过自适应秩分配机制，为资源受限环境提供了可扩展且高效的微调解决方案。 Abstract: Low-Rank Adaptation (LoRA) has become a widely adopted technique for fine-tuning large-scale pre-trained models with minimal parameter updates. However, existing methods rely on fixed ranks or focus solely on either rank pruning or expansion, failing to adapt ranks dynamically to match the importance of different layers during training. In this work, we propose ElaLoRA, an adaptive low-rank adaptation framework that dynamically prunes and expands ranks based on gradient-derived importance scores. To the best of our knowledge, ElaLoRA is the first method that enables both rank pruning and expansion during fine-tuning. Experiments across multiple benchmarks demonstrate that ElaLoRA consistently outperforms existing PEFT methods across different parameter budgets. Furthermore, our studies validate that layers receiving higher rank allocations contribute more significantly to model performance, providing theoretical justification for our adaptive strategy. By introducing a principled and adaptive rank allocation mechanism, ElaLoRA offers a scalable and efficient fine-tuning solution, particularly suited for resource-constrained environments.

DiffDenoise: Self-Supervised Medical Image Denoising with Conditional Diffusion Models

Basar Demir,Yikang Liu,Xiao Chen,Eric Z. Chen,Lin Zhao,Boris Mailhe,Terrence Chen,Shanhui Sun

Task: 提出一种名为DiffDenoise的自监督去噪方法，专门用于医学图像，旨在保留高频细节。

Motivation: 现有的自监督去噪方法容易过度平滑图像，导致医学应用中重要的精细结构丢失。

Details

Method: 方法分为三个阶段：1）在噪声图像上训练扩散模型，使用预训练的Blind-Spot Network输出作为条件输入；2）引入一种新颖的稳定反向采样技术，通过平均由对称噪声初始化的扩散采样输出来生成干净图像；3）使用扩散模型生成的去噪输出与噪声图像配对，训练一个有监督去噪网络。 Result: DiffDenoise在合成和真实医学图像去噪任务中优于现有最先进方法。 Conclusion: DiffDenoise通过理论支持和实践验证，证明了其在多种医学成像模态和解剖结构中的有效性。 Abstract: Many self-supervised denoising approaches have been proposed in recent years. However, these methods tend to overly smooth images, resulting in the loss of fine structures that are essential for medical applications. In this paper, we propose DiffDenoise, a powerful self-supervised denoising approach tailored for medical images, designed to preserve high-frequency details. Our approach comprises three stages. First, we train a diffusion model on noisy images, using the outputs of a pretrained Blind-Spot Network as conditioning inputs. Next, we introduce a novel stabilized reverse sampling technique, which generates clean images by averaging diffusion sampling outputs initialized with a pair of symmetric noises. Finally, we train a supervised denoising network using noisy images paired with the denoised outputs generated by the diffusion model. Our results demonstrate that DiffDenoise outperforms existing state-of-the-art methods in both synthetic and real-world medical image denoising tasks. We provide both a theoretical foundation and practical insights, demonstrating the method's effectiveness across various medical imaging modalities and anatomical structures.

Deconver: A Deconvolutional Network for Medical Image Segmentation

Pooya Ashtari,Shahryar Noei,Fateme Nateghi Haredasht,Jonathan H. Chen,Giuseppe Jurman,Aleksandra Pizurica,Sabine Van Huffel

Task: 提出一种名为Deconver的新型网络，用于医学图像分割，结合传统去卷积技术和U型架构。

Motivation: 解决卷积神经网络（CNNs）和视觉变换器（ViTs）在医学图像分割中的局限性，如局部感受野和高计算复杂度。

Details

Method: 通过引入高效的非负去卷积（NDC）操作替代昂贵的注意力机制，结合可学习的去卷积技术和参数高效设计。 Result: 在四个数据集（ISLES'22、BraTS'23、GlaS、FIVES）上实现最先进的性能，Dice分数和Hausdorff距离表现优异，计算成本降低90%。 Conclusion: Deconver为资源受限的临床工作流程提供了高精度分割的实用解决方案，并开源了项目代码。 Abstract: While convolutional neural networks (CNNs) and vision transformers (ViTs) have advanced medical image segmentation, they face inherent limitations such as local receptive fields in CNNs and high computational complexity in ViTs. This paper introduces Deconver, a novel network that integrates traditional deconvolution techniques from image restoration as a core learnable component within a U-shaped architecture. Deconver replaces computationally expensive attention mechanisms with efficient nonnegative deconvolution (NDC) operations, enabling the restoration of high-frequency details while suppressing artifacts. Key innovations include a backpropagation-friendly NDC layer based on a provably monotonic update rule and a parameter-efficient design. Evaluated across four datasets (ISLES'22, BraTS'23, GlaS, FIVES) covering both 2D and 3D segmentation tasks, Deconver achieves state-of-the-art performance in Dice scores and Hausdorff distance while reducing computational costs (FLOPs) by up to 90% compared to leading baselines. By bridging traditional image restoration with deep learning, this work offers a practical solution for high-precision segmentation in resource-constrained clinical workflows. The project is available at https://github.com/pashtari/deconver.

Think Small, Act Big: Primitive Prompt Learning for Lifelong Robot Manipulation

Yuanqi Yao,Siao Liu,Haoming Song,Delin Qu,Qizhi Chen,Yan Ding,Bin Zhao,Zhigang Wang,Xuelong Li,Dong Wang

Task: 提出一种名为Primitive Prompt Learning（PPL）的方法，通过可重用和可扩展的基元实现终身机器人操作。

Motivation: 终身机器人在持续技能获取中有效利用先验知识仍具挑战性，现有方法（如经验回放和参数高效方法）未能充分利用技能间的共享基元。

Details

Method: 采用两阶段学习方案：1）通过多技能预训练阶段学习一组基元提示以表示共享基元；2）在终身学习阶段，通过冻结预训练提示并优化新提示，实现知识从旧技能到新技能的迁移。 Result: 在大规模技能数据集上的仿真和真实任务实验中，PPL表现出优于现有方法的性能。 Conclusion: PPL通过共享和扩展基元，有效解决了终身机器人操作中的技能迁移和知识利用问题。 Abstract: Building a lifelong robot that can effectively leverage prior knowledge for continuous skill acquisition remains significantly challenging. Despite the success of experience replay and parameter-efficient methods in alleviating catastrophic forgetting problem, naively applying these methods causes a failure to leverage the shared primitives between skills. To tackle these issues, we propose Primitive Prompt Learning (PPL), to achieve lifelong robot manipulation via reusable and extensible primitives. Within our two stage learning scheme, we first learn a set of primitive prompts to represent shared primitives through multi-skills pre-training stage, where motion-aware prompts are learned to capture semantic and motion shared primitives across different skills. Secondly, when acquiring new skills in lifelong span, new prompts are appended and optimized with frozen pretrained prompts, boosting the learning via knowledge transfer from old skills to new ones. For evaluation, we construct a large-scale skill dataset and conduct extensive experiments in both simulation and real-world tasks, demonstrating PPL's superior performance over state-of-the-art methods.

Less is More: Efficient Black-box Attribution via Minimal Interpretable Subset Selection

Ruoyu Chen,Siyuan Liang,Jingzhi Li,Shiming Liu,Li Liu,Hua Zhang,Xiaochun Cao

Task: 开发一种可信赖的AI系统，旨在识别对模型决策影响最大的输入区域。

Motivation: 现有归因方法的主要任务在于高效且准确地识别输入与预测之间的交互关系，特别是在输入数据为离散（如图像）时，由于组合爆炸问题，分析输入与输出之间的关系具有挑战性。

Details

Method: 提出了一种新颖且高效的黑盒归因机制LiMA（Less input is More faithful for Attribution），将重要区域的归因问题重新表述为子模子集选择的优化问题。设计了一个子模函数来量化子集重要性，并通过一种新颖的双向贪婪搜索算法高效排序输入子区域。 Result: 在八个基础模型上的广泛实验表明，LiMA能够以更少的区域提供忠实的解释，并在插入和删除任务中分别平均提升了36.3%和39.6%。此外，LiMA的归因效率比朴素贪婪搜索快1.6倍，且在解释模型预测错误原因时，其最高置信度平均比现有归因算法高86.1%。 Conclusion: LiMA通过优化子模子集选择和高效搜索算法，显著提升了归因的准确性和效率，为可信赖AI系统的开发提供了有力支持。 Abstract: To develop a trustworthy AI system, which aim to identify the input regions that most influence the models decisions. The primary task of existing attribution methods lies in efficiently and accurately identifying the relationships among input-prediction interactions. Particularly when the input data is discrete, such as images, analyzing the relationship between inputs and outputs poses a significant challenge due to the combinatorial explosion. In this paper, we propose a novel and efficient black-box attribution mechanism, LiMA (Less input is More faithful for Attribution), which reformulates the attribution of important regions as an optimization problem for submodular subset selection. First, to accurately assess interactions, we design a submodular function that quantifies subset importance and effectively captures their impact on decision outcomes. Then, efficiently ranking input sub-regions by their importance for attribution, we improve optimization efficiency through a novel bidirectional greedy search algorithm. LiMA identifies both the most and least important samples while ensuring an optimal attribution boundary that minimizes errors. Extensive experiments on eight foundation models demonstrate that our method provides faithful interpretations with fewer regions and exhibits strong generalization, shows an average improvement of 36.3% in Insertion and 39.6% in Deletion. Our method also outperforms the naive greedy search in attribution efficiency, being 1.6 times faster. Furthermore, when explaining the reasons behind model prediction errors, the average highest confidence achieved by our method is, on average, 86.1% higher than that of state-of-the-art attribution algorithms. The code is available at https://github.com/RuoyuChen10/LIMA.

FortisAVQA and MAVEN: a Benchmark Dataset and Debiasing Framework for Robust Multimodal Reasoning

Jie Ma,Zhitao Gao,Qi Chai,Jun Liu,Pinghui Wang,Jing Tao,Zhou Su

Task: 提出一种新的音频-视觉问答（AVQA）数据集FortisAVQA，并设计一种鲁棒的多模态音频-视觉认知网络（MAVEN）来解决现有方法中的过拟合和数据集偏差问题。

Motivation: 现有AVQA方法容易过拟合数据集偏差，且当前数据集无法有效诊断这些方法的鲁棒性。

Details

Method: 通过两个阶段构建FortisAVQA数据集：（1）重新表述MUSIC-AVQA测试集中的问题；（2）引入问题分布偏移。同时提出MAVEN网络，采用多方面的循环协作去偏策略。 Result: MAVEN在FortisAVQA上达到最先进性能，提升7.81%，并通过消融实验验证了去偏组件的有效性。 Conclusion: FortisAVQA和MAVEN有效解决了现有AVQA方法的鲁棒性问题，并展示了其插拔式能力。 Abstract: Audio-Visual Question Answering (AVQA) is a challenging multimodal reasoning task requiring intelligent systems to answer natural language queries based on paired audio-video inputs accurately. However, existing AVQA approaches often suffer from overfitting to dataset biases, leading to poor robustness. Moreover, current datasets may not effectively diagnose these methods. To address these challenges, we first introduce a novel dataset, FortisAVQA, constructed in two stages: (1) rephrasing questions in the test split of the public MUSIC-AVQA dataset and (2) introducing distribution shifts across questions. The first stage expands the test space with greater diversity, while the second enables a refined robustness evaluation across rare, frequent, and overall question distributions. Second, we introduce a robust Multimodal Audio-Visual Epistemic Network (MAVEN) that leverages a multifaceted cycle collaborative debiasing strategy to mitigate bias learning. Experimental results demonstrate that our architecture achieves state-of-the-art performance on FortisAVQA, with a notable improvement of 7.81\%. Extensive ablation studies on both datasets validate the effectiveness of our debiasing components. Additionally, our evaluation reveals the limited robustness of existing multimodal QA methods. We also verify the plug-and-play capability of our strategy by integrating it with various baseline models across both datasets. Our dataset and code are available at https://github.com/reml-group/fortisavqa.

Training Frozen Feature Pyramid DINOv2 for Eyelid Measurements with Infinite Encoding and Orthogonal Regularization

Chun-Hung Chen

Task: 利用深度学习模型（SE-ResNet、EfficientNet和DINOv2）自动化测量眼睑参数（MRD1、MRD2和LF）。

Motivation: 手动测量眼睑参数存在不一致性，需要一种自动化、高精度的方法来改进诊断。

Details

Method: 评估了SE-ResNet、EfficientNet和DINOv2模型在智能手机图像上的性能，结合了焦点损失、正交正则化和二进制编码策略。 Result: DINOv2在冻结条件下表现出卓越的可扩展性和鲁棒性，结合轻量级回归器（如MLP和Deep Ensemble）实现了高精度。 Conclusion: DINOv2及其增强策略为移动临床应用提供了可靠且准确的解决方案，展示了基础模型在眼科AI中的潜力。 Abstract: Accurate measurement of eyelid parameters such as Margin Reflex Distances (MRD1, MRD2) and Levator Function (LF) is critical in oculoplastic diagnostics but remains limited by manual, inconsistent methods. This study evaluates deep learning models: SE-ResNet, EfficientNet, and the vision transformer-based DINOv2 for automating these measurements using smartphone-acquired images. We assess performance across frozen and fine-tuned settings, using MSE, MAE, and R2 metrics. DINOv2, pretrained through self-supervised learning, demonstrates superior scalability and robustness, especially under frozen conditions ideal for mobile deployment. Lightweight regressors such as MLP and Deep Ensemble offer high precision with minimal computational overhead. To address class imbalance and improve generalization, we integrate focal loss, orthogonal regularization, and binary encoding strategies. Our results show that DINOv2 combined with these enhancements delivers consistent, accurate predictions across all tasks, making it a strong candidate for real-world, mobile-friendly clinical applications. This work highlights the potential of foundation models in advancing AI-powered ophthalmic care.

Robust LiDAR-Camera Calibration with 2D Gaussian Splatting

Shuyi Zhou,Shuxiang Xie,Ryoichi Ishikawa,Takeshi Oishi

Task: 提出一种基于几何约束的LiDAR-相机系统标定方法。

Motivation: 现有标定方法依赖辅助目标物体，操作复杂，而无需目标的方法尚未达到实用效果。

Details

Method: 利用2D高斯泼溅（2DGS）从相机图像序列重建几何信息，通过几何约束估计LiDAR-相机外参。 Result: 通过优化光度损失、重投影损失和三角测量损失，提高了标定的鲁棒性和准确性。 Conclusion: 该方法无需辅助目标，操作简便且效果优于现有方法。 Abstract: LiDAR-camera systems have become increasingly popular in robotics recently. A critical and initial step in integrating the LiDAR and camera data is the calibration of the LiDAR-camera system. Most existing calibration methods rely on auxiliary target objects, which often involve complex manual operations, whereas targetless methods have yet to achieve practical effectiveness. Recognizing that 2D Gaussian Splatting (2DGS) can reconstruct geometric information from camera image sequences, we propose a calibration method that estimates LiDAR-camera extrinsic parameters using geometric constraints. The proposed method begins by reconstructing colorless 2DGS using LiDAR point clouds. Subsequently, we update the colors of the Gaussian splats by minimizing the photometric loss. The extrinsic parameters are optimized during this process. Additionally, we address the limitations of the photometric loss by incorporating the reprojection and triangulation losses, thereby enhancing the calibration robustness and accuracy.

Orientation Scores should be a Piece of Cake

Finn M. Sherry,Chase van de Geijn,Erik J. Bekkers,Remco Duits

Task: 提出一种用于方向评分的波束族，并证明其在位置-方向不确定性最小化方面的有效性。

Motivation: 为了在位置和方向空间中实现快速重建并最小化不确定性，同时提升(PDE-)G-CNNs的效率和可解释性。

Details

Method: 通过公理化方法推导波束族，并证明其与蛋糕波束的近似性；进一步验证蛋糕波束在(PDE-)G-CNNs中的应用。 Result: 蛋糕波束的不确定性差距小于1.1，且趋近于最小值1；实验表明其在降低网络复杂度和提升可解释性的同时，对性能影响较小。 Conclusion: 蛋糕波束是一种有效的最小不确定性波束，可用于简化(PDE-)G-CNNs并提升其可解释性。 Abstract: We axiomatically derive a family of wavelets for an orientation score, lifting from position space $\mathbb{R}^2$ to position and orientation space $\mathbb{R}^2\times S^1$, with fast reconstruction property, that minimise position-orientation uncertainty. We subsequently show that these minimum uncertainty states are well-approximated by cake wavelets: for standard parameters, the uncertainty gap of cake wavelets is less than 1.1, and in the limit, we prove the uncertainty gap tends to the minimum of 1. Next, we complete a previous theoretical argument that one does not have to train the lifting layer in (PDE-)G-CNNs, but can instead use cake wavelets. Finally, we show experimentally that in this way we can reduce the network complexity and improve the interpretability of (PDE-)G-CNNs, with only a slight impact on the model's performance.

Scaling Up Resonate-and-Fire Networks for Fast Deep Learning

Thomas E. Huber,Jules Lecomte,Borislav Polovnikov,Axel von Arnim

Task: 通过将共振发放（RF）神经元建模为结构化状态空间模型（SSM），并基于S5模型构建S5-RF层，实现深度脉冲神经网络（SNN）的高效训练和扩展。

Motivation: 尽管RF神经元具有生物学合理性和计算简单性，但其参数初始化和高效学习的挑战限制了其在多层网络中的应用。

Details

Method: 将RF神经元作为SSM从HiPPO框架中导出，构建S5-RF层，提供通用初始化方案和快速训练。 Result: S5-RF首次将RF网络扩展到四层深度SNN，在Spiking Speech Commands数据集上以78.8%的准确率创下新纪录，且训练时间少于三小时。 Conclusion: S5-RF在减少脉冲操作的同时，实现了与参考SNN相似的性能，为深度SNN的应用提供了新方法。 Abstract: Spiking neural networks (SNNs) present a promising computing paradigm for neuromorphic processing of event-based sensor data. The resonate-and-fire (RF) neuron, in particular, appeals through its biological plausibility, complex dynamics, yet computational simplicity. Despite theoretically predicted benefits, challenges in parameter initialization and efficient learning inhibited the implementation of RF networks, constraining their use to a single layer. In this paper, we address these shortcomings by deriving the RF neuron as a structured state space model (SSM) from the HiPPO framework. We introduce S5-RF, a new SSM layer comprised of RF neurons based on the S5 model, that features a generic initialization scheme and fast training within a deep architecture. S5-RF scales for the first time a RF network to a deep SNN with up to four layers and achieves with 78.8% a new state-of-the-art result for recurrent SNNs on the Spiking Speech Commands dataset in under three hours of training time. Moreover, compared to the reference SNNs that solve our benchmarking tasks, it achieves similar performance with much fewer spiking operations. Our code is publicly available at https://github.com/ThomasEHuber/s5-rf.

Multi-Task Neural Architecture Search Using Architecture Embedding and Transfer Rank

TingJie Zhang,HaiLin Liu

Task: 提出KTNAS算法，通过进化跨任务神经架构搜索（NAS）提升架构知识在多任务间的迁移效率。

Motivation: 多任务NAS中源任务与目标任务之间的排序问题会降低下游任务的架构性能，需要解决这一问题以提高迁移效率。

Details

Method: 将神经架构转换为图，利用架构嵌入向量进行性能预测，并引入基于实例的分类器“迁移排名”来解决性能下降问题。 Result: 在NASBench-201和Micro TransNAS-Bench-101等数据集上验证了搜索效率和迁移性能，KTNAS优于同类多任务NAS算法。 Conclusion: KTNAS通过迁移排名显著提升了多任务NAS的迁移性能，实验证明了其高效性和可扩展性。 Abstract: Multi-task neural architecture search (NAS) enables transferring architectural knowledge among different tasks. However, ranking disorder between the source task and the target task degrades the architecture performance on the downstream task. We propose KTNAS, an evolutionary cross-task NAS algorithm, to enhance transfer efficiency. Our data-agnostic method converts neural architectures into graphs and uses architecture embedding vectors for the subsequent architecture performance prediction. The concept of transfer rank, an instance-based classifier, is introduced into KTNAS to address the performance degradation issue. We verify the search efficiency on NASBench-201 and transferability to various vision tasks on Micro TransNAS-Bench-101. The scalability of our method is demonstrated on DARTs search space including CIFAR-10/100, MNIST/Fashion-MNIST, MedMNIST. Experimental results show that KTNAS outperforms peer multi-task NAS algorithms in search efficiency and downstream task performance. Ablation studies demonstrate the vital importance of transfer rank for transfer performance.

Visual Environment-Interactive Planning for Embodied Complex-Question Answering

Ning Lan,Baoshan Ou,Xuemei Xie,Guangming Shi

Task: 研究聚焦于具身复杂问答任务，即具身机器人需要理解具有复杂结构和抽象语义的人类问题。

Motivation: 现有方法通常采用一次性规划（一步规划），依赖大型模型且对环境理解不足，无法有效处理复杂问题。

Details

Method: 提出了一种多步规划的框架，通过构建结构化语义空间实现层次化视觉感知和问题本质的链式表达，支持迭代交互和任务规划。 Result: 实验结果表明，该方法在复杂任务上表现优异且稳定，并在实际场景中验证了其可行性。 Conclusion: 该框架通过多步规划和迭代反馈优化机器人行动策略，具有实际应用价值。 Abstract: This study focuses on Embodied Complex-Question Answering task, which means the embodied robot need to understand human questions with intricate structures and abstract semantics. The core of this task lies in making appropriate plans based on the perception of the visual environment. Existing methods often generate plans in a once-for-all manner, i.e., one-step planning. Such approach rely on large models, without sufficient understanding of the environment. Considering multi-step planning, the framework for formulating plans in a sequential manner is proposed in this paper. To ensure the ability of our framework to tackle complex questions, we create a structured semantic space, where hierarchical visual perception and chain expression of the question essence can achieve iterative interaction. This space makes sequential task planning possible. Within the framework, we first parse human natural language based on a visual hierarchical scene graph, which can clarify the intention of the question. Then, we incorporate external rules to make a plan for current step, weakening the reliance on large models. Every plan is generated based on feedback from visual perception, with multiple rounds of interaction until an answer is obtained. This approach enables continuous feedback and adjustment, allowing the robot to optimize its action strategy. To test our framework, we contribute a new dataset with more complex questions. Experimental results demonstrate that our approach performs excellently and stably on complex tasks. And also, the feasibility of our approach in real-world scenarios has been established, indicating its practical applicability.

Feature-Preserving Mesh Decimation for Normal Integration

Moritz Heep,Sven Behnke,Eduard Zell

Task: 从法线图中重建3D表面，通过稀疏各向异性三角形网格替代密集像素网格以提高计算效率。

Motivation: 高分辨率法线图在集成时计算资源消耗大，需要一种更高效的方法来减少计算时间。

Details

Method: 使用稀疏各向异性三角形网格替代密集像素网格，结合四边形误差度量与最优Delaunay三角剖分。 Result: 将法线集成运行时间从几小时缩短到几分钟，同时保持高表面精度。 Conclusion: 提出的方法显著提高了计算效率，适用于高分辨率图像的法线集成。 Abstract: Normal integration reconstructs 3D surfaces from normal maps obtained e.g. by photometric stereo. These normal maps capture surface details down to the pixel level but require large computational resources for integration at high resolutions. In this work, we replace the dense pixel grid with a sparse anisotropic triangle mesh prior to normal integration. We adapt the triangle mesh to the local geometry in the case of complex surface structures and remove oversampling from flat featureless regions. For high-resolution images, the resulting compression reduces normal integration runtimes from hours to minutes while maintaining high surface accuracy. Our main contribution is the derivation of the well-known quadric error measure from mesh decimation for screen space applications and its combination with optimal Delaunay triangulation.

Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents

Saaket Agashe,Kyle Wong,Vincent Tu,Jiachen Yang,Ang Li,Xin Eric Wang

Task: 开发一种新型组合框架Agent S2，用于自动化图形用户界面（GUI）任务，以提升人类生产力。

Motivation: 当前代理在GUI元素定位、长时任务规划和依赖单一通用模型等方面存在显著挑战，限制了其性能和通用性。

Details

Method: 采用Mixture-of-Grounding技术实现精确的GUI定位，并引入Proactive Hierarchical Planning动态调整多时间尺度的行动计划。 Result: Agent S2在三个主要计算机使用基准测试中取得新的SOTA性能，相对改进显著。 Conclusion: Agent S2通过组合通用和专用模型，显著提升了GUI任务的自动化性能和通用性。 Abstract: Computer use agents automate digital tasks by directly interacting with graphical user interfaces (GUIs) on computers and mobile devices, offering significant potential to enhance human productivity by completing an open-ended space of user queries. However, current agents face significant challenges: imprecise grounding of GUI elements, difficulties with long-horizon task planning, and performance bottlenecks from relying on single generalist models for diverse cognitive tasks. To this end, we introduce Agent S2, a novel compositional framework that delegates cognitive responsibilities across various generalist and specialist models. We propose a novel Mixture-of-Grounding technique to achieve precise GUI localization and introduce Proactive Hierarchical Planning, dynamically refining action plans at multiple temporal scales in response to evolving observations. Evaluations demonstrate that Agent S2 establishes new state-of-the-art (SOTA) performance on three prominent computer use benchmarks. Specifically, Agent S2 achieves 18.9% and 32.7% relative improvements over leading baseline agents such as Claude Computer Use and UI-TARS on the OSWorld 15-step and 50-step evaluation. Moreover, Agent S2 generalizes effectively to other operating systems and applications, surpassing previous best methods by 52.8% on WindowsAgentArena and by 16.52% on AndroidWorld relatively. Code available at https://github.com/simular-ai/Agent-S.

Personalized Federated Training of Diffusion Models with Privacy Guarantees

Kumar Kshitij Patel,Weitong Zhang,Lingxiao Wang

Task: 提出一种新颖的联邦学习框架，用于在分散的私有数据集上训练扩散模型，以生成高质量和多样化的合成数据。

Motivation: 解决敏感领域（如医疗、金融和生物医学研究）中数据稀缺、隐私保护和合规性问题，同时应对公共数据集受限的挑战。

Details

Method: 利用个性化和前向扩散过程中的固有噪声，结合联邦学习框架，训练扩散模型，确保强大的差分隐私保证。 Result: 实验表明，该框架在数据异构性高的环境中优于非协作训练方法，有效减少合成数据中的偏见和不平衡，生成更公平的下游模型。 Conclusion: 该框架为生成高质量、隐私保护的合成数据提供了一种有效解决方案，适用于敏感领域的数据共享和模型训练。 Abstract: The scarcity of accessible, compliant, and ethically sourced data presents a considerable challenge to the adoption of artificial intelligence (AI) in sensitive fields like healthcare, finance, and biomedical research. Furthermore, access to unrestricted public datasets is increasingly constrained due to rising concerns over privacy, copyright, and competition. Synthetic data has emerged as a promising alternative, and diffusion models -- a cutting-edge generative AI technology -- provide an effective solution for generating high-quality and diverse synthetic data. In this paper, we introduce a novel federated learning framework for training diffusion models on decentralized private datasets. Our framework leverages personalization and the inherent noise in the forward diffusion process to produce high-quality samples while ensuring robust differential privacy guarantees. Our experiments show that our framework outperforms non-collaborative training methods, particularly in settings with high data heterogeneity, and effectively reduces biases and imbalances in synthetic data, resulting in fairer downstream models.

WorldScore: A Unified Evaluation Benchmark for World Generation

Haoyi Duan,Hong-Xing Yu,Sirui Chen,Li Fei-Fei,Jiajun Wu

Task: 介绍WorldScore基准，首个用于世界生成的统一基准。

Motivation: 将世界生成分解为基于相机轨迹布局的下一场景生成任务，实现对3D、4D场景生成及视频生成模型的统一评估。

Details

Method: 使用包含3,000个测试示例的数据集，涵盖静态与动态、室内与室外、写实与风格化等多样化世界。 Result: 通过评估19个代表性模型，揭示了各类模型的关键见解与挑战。 Conclusion: WorldScore基准提供了数据集、评估代码和排行榜，支持世界生成领域的统一评估。 Abstract: We introduce the WorldScore benchmark, the first unified benchmark for world generation. We decompose world generation into a sequence of next-scene generation tasks with explicit camera trajectory-based layout specifications, enabling unified evaluation of diverse approaches from 3D and 4D scene generation to video generation models. The WorldScore benchmark encompasses a curated dataset of 3,000 test examples that span diverse worlds: static and dynamic, indoor and outdoor, photorealistic and stylized. The WorldScore metrics evaluate generated worlds through three key aspects: controllability, quality, and dynamics. Through extensive evaluation of 19 representative models, including both open-source and closed-source ones, we reveal key insights and challenges for each category of models. Our dataset, evaluation code, and leaderboard can be found at https://haoyi-duan.github.io/WorldScore/

GeometryCrafter: Consistent Geometry Estimation for Open-world Videos with Diffusion Priors

Tian-Xing Xu,Xiangjun Gao,Wenbo Hu,Xiaoyu Li,Song-Hai Zhang,Ying Shan

Task: 提出一种名为GeometryCrafter的新框架，用于从开放世界视频中恢复具有时间一致性的高保真点地图序列。

Motivation: 现有视频深度估计方法在通过仿射不变预测实现几何保真度方面存在固有局限性，限制了其在重建和其他基于度量的下游任务中的应用。

Details

Method: 采用点地图变分自编码器（VAE）学习与视频潜在分布无关的潜在空间，并结合视频扩散模型对点地图序列的分布进行建模。 Result: 在多个数据集上的广泛评估表明，GeometryCrafter在3D精度、时间一致性和泛化能力方面达到了最先进水平。 Conclusion: GeometryCrafter能够实现准确的3D/4D重建、相机参数估计和其他基于深度的应用。 Abstract: Despite remarkable advancements in video depth estimation, existing methods exhibit inherent limitations in achieving geometric fidelity through the affine-invariant predictions, limiting their applicability in reconstruction and other metrically grounded downstream tasks. We propose GeometryCrafter, a novel framework that recovers high-fidelity point map sequences with temporal coherence from open-world videos, enabling accurate 3D/4D reconstruction, camera parameter estimation, and other depth-based applications. At the core of our approach lies a point map Variational Autoencoder (VAE) that learns a latent space agnostic to video latent distributions for effective point map encoding and decoding. Leveraging the VAE, we train a video diffusion model to model the distribution of point map sequences conditioned on the input videos. Extensive evaluations on diverse datasets demonstrate that GeometryCrafter achieves state-of-the-art 3D accuracy, temporal consistency, and generalization capability.