2025 04 13

EnDive: A Cross-Dialect Benchmark for Fairness and Performance in Large Language Models

Abhay Gupta,Jacob Cheung,Philip Meng,Shayan Sayyed,Austen Liao,Kevin Zhu,Sean O'Brien

Task: 评估五种大型语言模型在非标准英语方言上的表现。

Motivation: 现有NLP基准测试往往忽略语言内部多样性，导致非标准方言使用者服务不足。

Details

Method: 通过少样本提示将标准英语数据集翻译为五种非主流方言，并与基于规则的方法进行比较。 Result: 模型在方言输入上的表现显著低于标准英语，揭示了模型偏见。 Conclusion: EnDive通过揭示模型偏见推动了方言感知的NLP技术发展。 Abstract: The diversity of human language, shaped by social, cultural, and regional influences, presents significant challenges for natural language processing (NLP) systems. Existing benchmarks often overlook intra-language variations, leaving speakers of non-standard dialects underserved. To address this gap, we introduce EnDive (English Diversity), a benchmark that evaluates five widely-used large language models (LLMs) across tasks in language understanding, algorithmic reasoning, mathematics, and logic. Our framework translates Standard American English datasets into five underrepresented dialects using few-shot prompting with verified examples from native speakers, and compare these translations against rule-based methods via fluency assessments, preference tests, and semantic similarity metrics. Human evaluations confirm high translation quality, with average scores of at least 6.02/7 for faithfulness, fluency, and formality. By filtering out near-identical translations, we create a challenging dataset that reveals significant performance disparities - models consistently underperform on dialectal inputs compared to Standard American English. EnDive thus advances dialect-aware NLP by uncovering model biases and promoting more equitable language technologies.

How Robust Are Router-LLMs? Analysis of the Fragility of LLM Routing Capabilities

Aly M. Kassem,Bernhard Schölkopf,Zhijing Jin

Task: 提出一个名为DSC的评估框架，用于全面评估大型语言模型（LLM）路由器的性能，涵盖多种查询类型以及隐私和安全问题。

Motivation: 当前评估基准过于关注通用模型能力，忽视了任务特定行为和潜在风险（如隐私、安全和后门漏洞），因此需要更全面的评估方法。

Details

Method: 设计DSC基准，分类评估路由器在多种查询类型（如编程、翻译、数学等）中的表现，并整合隐私和安全评估。 Result: 实验表明，基于偏好的路由器虽提升效率，但常做出次优决策，如将复杂查询过度分配给强大模型，或将危险查询分配给弱模型。 Conclusion: DSC基准揭示了当前路由器的局限性，强调了在效率和安全性之间平衡的重要性。 Abstract: Large language model (LLM) routing has emerged as a crucial strategy for balancing computational costs with performance by dynamically assigning queries to the most appropriate model based on query complexity. Despite recent advances showing that preference-data-based routers can outperform traditional methods, current evaluation benchmarks remain limited. They largely focus on general model capabilities while overlooking task-specific behaviors and critical concerns such as privacy, safety, and potential backdoor vulnerabilities introduced through preference data. In response, we propose the DSC benchmark: Diverse, Simple, and Categorized, an evaluation framework that categorizes router performance across a broad spectrum of query types, including coding, translation, mathematics, human instructions, general knowledge, and LLM jailbreaking. Additionally, it integrates privacy and safety assessments to reveal hidden risks. Our experiments on three preference-based routers and two commercial counterparts demonstrate that while these systems improve efficiency, they often make suboptimal, category-driven decisions. For instance, a BERT-based router directs all coding and mathematics queries to the most powerful LLM even when simpler models would suffice, while routing jailbreaking attempts to weaker models, thereby elevating safety risks.

ChatBench: From Static Benchmarks to Human-AI Evaluation

Serina Chang,Ashton Anderson,Jake M. Hofman

Task: 设计并实施用户研究，将MMLU问题转化为用户与AI的对话，以评估人类与LLM合作的能力。

Motivation: 现有标准基准（如MMLU）仅评估LLM的独立能力，而忽略了人类与AI合作的效果。

Details

Method: 通过用户研究将MMLU问题转化为用户-AI对话，并发布ChatBench数据集，包含AI独立、用户独立及用户-AI合作的数据。 Result: 发现AI独立准确率无法预测用户-AI合作准确率，且在不同学科中存在显著差异；通过微调用户模拟器提高了对用户-AI准确率的估计能力。 Conclusion: 交互式评估的潜力得到验证，为未来扩展此类评估提供了可能性。 Abstract: With the rapid adoption of LLM-based chatbots, there is a pressing need to evaluate what humans and LLMs can achieve together. However, standard benchmarks, such as MMLU, measure LLM capabilities in isolation (i.e., "AI-alone"). Here, we design and conduct a user study to convert MMLU questions into user-AI conversations, by seeding the user with the question and having them carry out a conversation with the LLM to answer their question. We release ChatBench, a new dataset with AI-alone, user-alone, and user-AI data for 396 questions and two LLMs, including 144K answers and 7,336 user-AI conversations. We find that AI-alone accuracy fails to predict user-AI accuracy, with significant differences across multiple subjects (math, physics, and moral reasoning), and we analyze the user-AI conversations to provide insight into how they diverge from AI-alone benchmarks. Finally, we show that fine-tuning a user simulator on a subset of ChatBench improves its ability to estimate user-AI accuracies, increasing correlation on held-out questions by more than 20 points, creating possibilities for scaling interactive evaluation.

EqualizeIR: Mitigating Linguistic Biases in Retrieval Models

Jiali Cheng,Hadi Amiri

Task: 提出EqualizeIR框架以减少信息检索模型中的语言偏见。

Motivation: 现有信息检索模型在语言复杂度不同的查询上表现不均，存在显著偏见。

Details

Method: 使用语言偏见弱学习器捕捉数据集中的偏见，并通过正则化和优化训练鲁棒模型。 Result: 实验表明，该方法减少了语言简单和复杂查询间的性能差异，并提升了整体检索性能。 Conclusion: EqualizeIR框架有效缓解了信息检索模型中的语言偏见问题。 Abstract: This study finds that existing information retrieval (IR) models show significant biases based on the linguistic complexity of input queries, performing well on linguistically simpler (or more complex) queries while underperforming on linguistically more complex (or simpler) queries. To address this issue, we propose EqualizeIR, a framework to mitigate linguistic biases in IR models. EqualizeIR uses a linguistically biased weak learner to capture linguistic biases in IR datasets and then trains a robust model by regularizing and refining its predictions using the biased weak learner. This approach effectively prevents the robust model from overfitting to specific linguistic patterns in data. We propose four approaches for developing linguistically-biased models. Extensive experiments on several datasets show that our method reduces performance disparities across linguistically simple and complex queries, while improving overall retrieval performance.

Perception in Reflection

Yana Wei,Liang Zhao,Kangheng Lin,En Yu,Yuang Peng,Runpei Dong,Jianjian Sun,Haoran Wei,Zheng Ge,Xiangyu Zhang,Vishal M. Patel

Task: 提出一种名为Reflective Perception (RePer)的双模型反射机制，通过迭代优化视觉感知能力。

Motivation: 当前大型视觉语言模型（LVLMs）在初始感知上存在局限性，无法达到完美感知。

Details

Method: 采用双模型反射机制（政策模型和批评模型交替工作），并结合Reflective Perceptual Learning (RPL)方法，通过视觉反射数据集和反射非似然训练增强反射能力。 Result: 实验表明RePer在图像理解、标题生成精度和幻觉减少方面有显著提升，且模型注意力模式与人类视觉焦点高度一致。 Conclusion: 反射感知为未来多模态代理提供了一种稳健的范式，尤其适用于需要复杂推理和多步操作的任务。 Abstract: We present a perception in reflection paradigm designed to transcend the limitations of current large vision-language models (LVLMs), which are expected yet often fail to achieve perfect perception initially. Specifically, we propose Reflective Perception (RePer), a dual-model reflection mechanism that systematically alternates between policy and critic models, enables iterative refinement of visual perception. This framework is powered by Reflective Perceptual Learning (RPL), which reinforces intrinsic reflective capabilities through a methodically constructed visual reflection dataset and reflective unlikelihood training. Comprehensive experimental evaluation demonstrates RePer's quantifiable improvements in image understanding, captioning precision, and hallucination reduction. Notably, RePer achieves strong alignment between model attention patterns and human visual focus, while RPL optimizes fine-grained and free-form preference alignment. These advancements establish perception in reflection as a robust paradigm for future multimodal agents, particularly in tasks requiring complex reasoning and multi-step manipulation.

CLEAR: Contrasting Textual Feedback with Experts and Amateurs for Reasoning

Andrew Rufail,Daniel Kim,Sean O'Brien,Kevin Zhu

Task: 提出一种名为CLEAR的新方法，通过结合专家（大模型）和业余（小模型）的反馈来改进语言模型的推理能力。

Motivation: 利用大模型和小模型的优势，通过对比反馈提升语言模型的推理性能。

Details

Method: 专家和业余模型分别对初始输出提供反馈，通过对比生成优化后的反馈，并迭代改进响应。 Result: CLEAR在多项推理任务中表现优于现有方法，包括故事大纲改进、约束生成、数学推理和毒性缓解。 Conclusion: CLEAR通过结合专家和业余模型的反馈，显著提升了语言模型的推理能力。 Abstract: We introduce CLEAR (Contrasting Textual Feedback with Experts and Amateurs for Reasoning), a novel approach to language model reasoning that leverages the strengths of a larger (expert) model and smaller (amateur) model. The expert and amateur models each provide feedback on a model's initial output and are contrasted with each other into refined feedback. This feedback is subsequently applied to iteratively improve CLEAR's responses. Our experiments demonstrate that CLEAR outperforms state-of-the-art methods in several challenging reasoning tasks, including story outline improvement (up to 19.6% relative increase in interestingness), constrained generation (up to 18.5% increase in coverage), mathematical reasoning (up to 6.7% improvement in accuracy) and mitigation of toxicity (decrease of up to 22% in toxicity).

Face-LLaVA: Facial Expression and Attribute Understanding through Instruction Tuning

Ashutosh Chaubey,Xulang Guan,Mohammad Soleymani

Task: 提出Face-LLaVA，一种多模态大语言模型，用于面部为中心的学习和推理。

Motivation: 人脸在社交沟通中至关重要，需要高性能的计算机视觉工具支持人本应用。

Details

Method: 开发FaceInstruct-1M数据库，并设计基于Face-Region Guided Cross-Attention的面部专用视觉编码器。 Result: 在九种数据集和五种面部处理任务中表现优于开源模型，与商业方案竞争。 Conclusion: Face-LLaVA在社交AI和基础视觉语言研究中具有潜力，数据集和模型将公开。 Abstract: The human face plays a central role in social communication, necessitating the use of performant computer vision tools for human-centered applications. We propose Face-LLaVA, a multimodal large language model for face-centered, in-context learning, including facial expression and attribute recognition. Additionally, Face-LLaVA is able to generate natural language descriptions that can be used for reasoning. Leveraging existing visual databases, we first developed FaceInstruct-1M, a face-centered database for instruction tuning MLLMs for face processing. We then developed a novel face-specific visual encoder powered by Face-Region Guided Cross-Attention that integrates face geometry with local visual features. We evaluated the proposed method across nine different datasets and five different face processing tasks, including facial expression recognition, action unit detection, facial attribute detection, age estimation and deepfake detection. Face-LLaVA achieves superior results compared to existing open-source MLLMs and competitive performance compared to commercial solutions. Our model output also receives a higher reasoning rating by GPT under a zero-shot setting across all the tasks. Both our dataset and model wil be released at https://face-llava.github.io to support future advancements in social AI and foundational vision-language research.

DeepSeek-R1 Thoughtology: Let's about LLM Reasoning

Sara Vera Marjanović,Arkil Patel,Vaibhav Adlakha,Milad Aghajohari,Parishad BehnamGhader,Mehar Bhatia,Aditi Khandelwal,Austin Kraft,Benno Krojer,Xing Han Lù,Nicholas Meade,Dongchan Shin,Amirhossein Kazemnejad,Gaurav Kamath,Marius Mosbach,Karolina Stańczak,Siva Reddy

Task: 研究DeepSeek-R1模型的推理行为及其影响。

Motivation: 探索大型推理模型（如DeepSeek-R1）如何通过多步推理链处理复杂问题，并分析其推理行为的可控性、安全性及认知现象。

Details

Method: 通过分类DeepSeek-R1的基本推理构建块，分析其推理长度、上下文管理、文化安全问题和认知现象。 Result: 发现DeepSeek-R1存在推理‘最佳点’，过长的推理时间会降低性能；模型易陷入重复思考，且安全性较弱。 Conclusion: DeepSeek-R1的推理行为复杂，需进一步优化其可控性和安全性。 Abstract: Large Reasoning Models like DeepSeek-R1 mark a fundamental shift in how LLMs approach complex problems. Instead of directly producing an answer for a given input, DeepSeek-R1 creates detailed multi-step reasoning chains, seemingly "thinking" about a problem before providing an answer. This reasoning process is publicly available to the user, creating endless opportunities for studying the reasoning behaviour of the model and opening up the field of Thoughtology. Starting from a taxonomy of DeepSeek-R1's basic building blocks of reasoning, our analyses on DeepSeek-R1 investigate the impact and controllability of thought length, management of long or confusing contexts, cultural and safety concerns, and the status of DeepSeek-R1 vis-\`a-vis cognitive phenomena, such as human-like language processing and world modelling. Our findings paint a nuanced picture. Notably, we show DeepSeek-R1 has a 'sweet spot' of reasoning, where extra inference time can impair model performance. Furthermore, we find a tendency for DeepSeek-R1 to persistently ruminate on previously explored problem formulations, obstructing further exploration. We also note strong safety vulnerabilities of DeepSeek-R1 compared to its non-reasoning counterpart, which can also compromise safety-aligned LLMs.

Few-Shot Adaptation of Grounding DINO for Agricultural Domain

Rajhans Singh,Rafael Bidese Puhl,Kshitiz Dhakal,Sudhir Sornapudi

Task: 提出一种高效的少样本适应方法，改进基于文本提示的开放集目标检测模型在农业应用中的性能。

Motivation: 深度学习模型在农业应用中需要大量标注数据，而开放集目标检测模型（如Grounding-DINO）在复杂对象检测中存在文本提示设计的挑战。

Details

Method: 通过移除文本编码器模块（BERT）并引入随机初始化的可训练文本嵌入，简化Grounding-DINO架构。 Result: 在多个农业数据集中表现优异，比完全微调的YOLO模型mAP提高约24%，在遥感任务中比现有方法提高约10%。 Conclusion: 该方法为自动化标注和加速农业AI解决方案的开发提供了有前景的解决方案。 Abstract: Deep learning models are transforming agricultural applications by enabling automated phenotyping, monitoring, and yield estimation. However, their effectiveness heavily depends on large amounts of annotated training data, which can be labor and time intensive. Recent advances in open-set object detection, particularly with models like Grounding-DINO, offer a potential solution to detect regions of interests based on text prompt input. Initial zero-shot experiments revealed challenges in crafting effective text prompts, especially for complex objects like individual leaves and visually similar classes. To address these limitations, we propose an efficient few-shot adaptation method that simplifies the Grounding-DINO architecture by removing the text encoder module (BERT) and introducing a randomly initialized trainable text embedding. This method achieves superior performance across multiple agricultural datasets, including plant-weed detection, plant counting, insect identification, fruit counting, and remote sensing tasks. Specifically, it demonstrates up to a $\sim24\%$ higher mAP than fully fine-tuned YOLO models on agricultural datasets and outperforms previous state-of-the-art methods by $\sim10\%$ in remote sensing, under few-shot learning conditions. Our method offers a promising solution for automating annotation and accelerating the development of specialized agricultural AI solutions.

HypoEval: Hypothesis-Guided Evaluation for Natural Language Generation

Mingxuan Li,Hanchen Li,Chenhao Tan

Task: 提出HypoEval框架，用于自动化评估自然语言生成任务。

Motivation: 现有LLM评估框架要么缺乏人类输入导致低对齐，要么需要大量标注数据，且缺乏解释性。

Details

Method: 利用少量人类评估生成详细评分标准，结合类似清单的方法分解维度并综合评分。 Result: 仅需30次人类评估，HypoEval在Spearman和Pearson相关性上优于现有方法，平均提升11.86%和11.95%。 Conclusion: HypoEval是一种可靠且可解释的自动化评估框架。 Abstract: Large language models (LLMs) have demonstrated great potential for automating the evaluation of natural language generation. Previous frameworks of LLM-as-a-judge fall short in two ways: they either use zero-shot setting without consulting any human input, which leads to low alignment, or fine-tune LLMs on labeled data, which requires a non-trivial number of samples. Moreover, previous methods often provide little reasoning behind automated evaluations. In this paper, we propose HypoEval, Hypothesis-guided Evaluation framework, which first uses a small corpus of human evaluations to generate more detailed rubrics for human judgments and then incorporates a checklist-like approach to combine LLM's assigned scores on each decomposed dimension to acquire overall scores. With only 30 human evaluations, HypoEval achieves state-of-the-art performance in alignment with both human rankings (Spearman correlation) and human scores (Pearson correlation), on average outperforming G-Eval by 11.86% and fine-tuned Llama-3.1-8B-Instruct with at least 3 times more human evaluations by 11.95%. Furthermore, we conduct systematic studies to assess the robustness of HypoEval, highlighting its effectiveness as a reliable and interpretable automated evaluation framework.

Quantifying Epistemic Uncertainty in Absolute Pose Regression

Fereidoon Zangeneh,Amit Dekel,Alessandro Pieropan,Patric Jensfelt

Task: 通过神经网络直接回归相机姿态的视觉重定位任务。

Motivation: 绝对姿态回归虽然在内存和计算效率上有优势，但其预测在训练域外不准确且不可靠。

Details

Method: 提出一种基于变分框架的新方法，通过估计观测的似然性来量化绝对姿态回归模型的认知不确定性。 Result: 该方法在捕捉不确定性与预测误差关系方面优于现有方法。 Conclusion: 新方法不仅提供了预测置信度，还能在存在重复结构时概率性地定位相机。 Abstract: Visual relocalization is the task of estimating the camera pose given an image it views. Absolute pose regression offers a solution to this task by training a neural network, directly regressing the camera pose from image features. While an attractive solution in terms of memory and compute efficiency, absolute pose regression's predictions are inaccurate and unreliable outside the training domain. In this work, we propose a novel method for quantifying the epistemic uncertainty of an absolute pose regression model by estimating the likelihood of observations within a variational framework. Beyond providing a measure of confidence in predictions, our approach offers a unified model that also handles observation ambiguities, probabilistically localizing the camera in the presence of repetitive structures. Our method outperforms existing approaches in capturing the relation between uncertainty and prediction error.

SemEval-2025 Task 5: LLMs4Subjects -- LLM-based Automated Subject Tagging for a National Technical Library's Open-Access Catalog

Jennifer D'Souza,Sameer Sadruddin,Holger Israel,Mathias Begoin,Diana Slawig

Task: 开发基于LLM的系统，为英文和德文的科学和技术记录自动推荐GND分类法中的主题标签。

Motivation: 探索如何利用LLM技术改进数字图书馆中的主题分类，提高分类的准确性和效率。

Details

Method: 参与者开发了基于LLM的系统，通过定量指标（精确率、召回率、F1分数）和主题专家的定性评估来推荐前k个主题。 Result: 结果表明，LLM集成、合成数据生成和多语言处理在主题分类中表现有效。 Conclusion: 该研究为LLM在数字图书馆分类中的应用提供了有价值的见解。 Abstract: We present SemEval-2025 Task 5: LLMs4Subjects, a shared task on automated subject tagging for scientific and technical records in English and German using the GND taxonomy. Participants developed LLM-based systems to recommend top-k subjects, evaluated through quantitative metrics (precision, recall, F1-score) and qualitative assessments by subject specialists. Results highlight the effectiveness of LLM ensembles, synthetic data generation, and multilingual processing, offering insights into applying LLMs for digital library classification.

Krzysztof Byrski,Jacek Tabor,Przemysław Spurek,Marcin Mazur

Task: 提出一种基于交叉熵聚类（CEC）的新方法CEC-MMR，用于自动检测回归问题中的组件数量。

Motivation: 传统混合密度网络（MDN）在回归分析中无法准确确定组件数量，导致预测值与实际数据差异较大。

Details

Method: 采用交叉熵聚类（CEC）方法，自动检测组件数量，并能唯一识别属性值对应的组件。 Result: 实验结果表明，CEC-MMR优于传统MDN方法。 Conclusion: CEC-MMR是一种有效的替代方法，解决了MDN在组件数量确定上的局限性。 Abstract: In practical applications of regression analysis, it is not uncommon to encounter a multitude of values for each attribute. In such a situation, the univariate distribution, which is typically Gaussian, is suboptimal because the mean may be situated between modes, resulting in a predicted value that differs significantly from the actual data. Consequently, to address this issue, a mixture distribution with parameters learned by a neural network, known as a Mixture Density Network (MDN), is typically employed. However, this approach has an important inherent limitation, in that it is not feasible to ascertain the precise number of components with a reasonable degree of accuracy. In this paper, we introduce CEC-MMR, a novel approach based on Cross-Entropy Clustering (CEC), which allows for the automatic detection of the number of components in a regression problem. Furthermore, given an attribute and its value, our method is capable of uniquely identifying it with the underlying component. The experimental results demonstrate that CEC-MMR yields superior outcomes compared to classical MDNs.

ConceptCarve: Dynamic Realization of Evidence

Eylon Caplan,Dan Goldwasser

Task: 开发一个名为ConceptCarve的证据检索框架，用于在社交媒体上大规模识别和分析人类观点与行为。

Motivation: 研究人类观点和行为需要理解复杂的思想模式，尤其是在社交媒体上，存在抽象概念实例化和跨社区差异化的挑战。

Details

Method: 结合传统检索器和大型语言模型（LLMs），动态表征检索空间。 Result: ConceptCarve在社交媒体社区中表现优于传统检索系统，并能生成可解释的证据表示。 Conclusion: ConceptCarve能够有效分析跨社区复杂思想模式，并提供可解释的证据表示。 Abstract: Finding evidence for human opinion and behavior at scale is a challenging task, often requiring an understanding of sophisticated thought patterns among vast online communities found on social media. For example, studying how gun ownership is related to the perception of Freedom, requires a retrieval system that can operate at scale over social media posts, while dealing with two key challenges: (1) identifying abstract concept instances, (2) which can be instantiated differently across different communities. To address these, we introduce ConceptCarve, an evidence retrieval framework that utilizes traditional retrievers and LLMs to dynamically characterize the search space during retrieval. Our experiments show that ConceptCarve surpasses traditional retrieval systems in finding evidence within a social media community. It also produces an interpretable representation of the evidence for that community, which we use to qualitatively analyze complex thought patterns that manifest differently across the communities.

Objaverse++: Curated 3D Object Dataset with Quality Annotations

Chendi Lin,Heshan Liu,Qunshu Lin,Zachary Bright,Shitao Tang,Yihui He,Minghao Liu,Ling Zhu,Cindy Le

Task: 提出Objaverse++，一个经过人工专家详细标注的Objaverse子集，用于提升3D内容生成的质量。

Motivation: 尽管Objaverse是目前最大的3D资产数据集，但其低质量模型占主导地位，限制了其实际应用。

Details

Method: 人工标注10,000个3D对象的详细属性，并训练神经网络为剩余数据集标注标签。 Result: 实验表明，基于质量优化子集预训练的模型在图像到3D生成任务中表现更优，且高质量数据能加速训练损失收敛。 Conclusion: 精心筛选和丰富标注可以弥补原始数据集规模的不足，为3D生成模型开发提供更高效路径。 Abstract: This paper presents Objaverse++, a curated subset of Objaverse enhanced with detailed attribute annotations by human experts. Recent advances in 3D content generation have been driven by large-scale datasets such as Objaverse, which contains over 800,000 3D objects collected from the Internet. Although Objaverse represents the largest available 3D asset collection, its utility is limited by the predominance of low-quality models. To address this limitation, we manually annotate 10,000 3D objects with detailed attributes, including aesthetic quality scores, texture color classifications, multi-object composition flags, transparency characteristics, etc. Then, we trained a neural network capable of annotating the tags for the rest of the Objaverse dataset. Through experiments and a user study on generation results, we demonstrate that models pre-trained on our quality-focused subset achieve better performance than those trained on the larger dataset of Objaverse in image-to-3D generation tasks. In addition, by comparing multiple subsets of training data filtered by our tags, our results show that the higher the data quality, the faster the training loss converges. These findings suggest that careful curation and rich annotation can compensate for the raw dataset size, potentially offering a more efficient path to develop 3D generative models. We release our enhanced dataset of approximately 500,000 curated 3D models to facilitate further research on various downstream tasks in 3D computer vision. In the near future, we aim to extend our annotations to cover the entire Objaverse dataset.

Visual-Aware Speech Recognition for Noisy Scenarios

Lakshmipathi Balaji,Karan Singla

Task: 提出一种通过关联噪声源与视觉线索来改进嘈杂环境中语音转录的模型。

Motivation: 当前ASR或AVSR模型在嘈杂环境中表现不佳，而人类能利用视觉线索（如唇动和环境场景）提升听觉感知。

Details

Method: 利用预训练的语音和视觉编码器，通过多头注意力机制关联噪声源与视觉信息，构建可扩展的音频-视觉数据集。 Result: 在嘈杂场景中显著优于纯音频模型，视觉线索对提升转录准确性至关重要。 Conclusion: 通过利用环境视觉信息，模型能更自然地过滤噪声并改进转录，类似人类在嘈杂环境中的表现。 Abstract: Humans have the ability to utilize visual cues, such as lip movements and visual scenes, to enhance auditory perception, particularly in noisy environments. However, current Automatic Speech Recognition (ASR) or Audio-Visual Speech Recognition (AVSR) models often struggle in noisy scenarios. To solve this task, we propose a model that improves transcription by correlating noise sources to visual cues. Unlike works that rely on lip motion and require the speaker's visibility, we exploit broader visual information from the environment. This allows our model to naturally filter speech from noise and improve transcription, much like humans do in noisy scenarios. Our method re-purposes pretrained speech and visual encoders, linking them with multi-headed attention. This approach enables the transcription of speech and the prediction of noise labels in video inputs. We introduce a scalable pipeline to develop audio-visual datasets, where visual cues correlate to noise in the audio. We show significant improvements over existing audio-only models in noisy scenarios. Results also highlight that visual cues play a vital role in improved transcription accuracy.

DLTPose: 6DoF Pose Estimation From Accurate Dense Surface Point Estimates

Akash Jadhav,Michael Greenspan

Task: 提出一种结合稀疏关键点方法和密集像素预测优点的6DoF物体姿态估计新方法DLTPose。

Motivation: 解决现有关键点方法在处理对称物体时因固定关键点顺序导致的不一致性问题，并提升姿态估计的准确性和鲁棒性。

Details

Method: 通过预测像素到关键点的径向距离，结合直接线性变换（DLT）生成3D物体表面估计，并引入对称感知的关键点排序方法。 Result: 在LINEMOD、Occlusion LINEMOD和YCB-Video数据集上表现优异，平均召回率分别达到86.5%、79.7%和89.5%。 Conclusion: DLTPose在对称和遮挡物体上显著优于现有方法，代码已开源。 Abstract: We propose DLTPose, a novel method for 6DoF object pose estimation from RGB-D images that combines the accuracy of sparse keypoint methods with the robustness of dense pixel-wise predictions. DLTPose predicts per-pixel radial distances to a set of minimally four keypoints, which are then fed into our novel Direct Linear Transform (DLT) formulation to produce accurate 3D object frame surface estimates, leading to better 6DoF pose estimation. Additionally, we introduce a novel symmetry-aware keypoint ordering approach, designed to handle object symmetries that otherwise cause inconsistencies in keypoint assignments. Previous keypoint-based methods relied on fixed keypoint orderings, which failed to account for the multiple valid configurations exhibited by symmetric objects, which our ordering approach exploits to enhance the model's ability to learn stable keypoint representations. Extensive experiments on the benchmark LINEMOD, Occlusion LINEMOD and YCB-Video datasets show that DLTPose outperforms existing methods, especially for symmetric and occluded objects, demonstrating superior Mean Average Recall values of 86.5% (LM), 79.7% (LM-O) and 89.5% (YCB-V). The code is available at https://anonymous.4open.science/r/DLTPose_/ .

Language Modeling for the Future of Finance: A Quantitative Survey into Metrics, Tasks, and Data Opportunities

Nikita Tatarinov,Siddhant Sukhani,Agam Shah,Sudheer Chava

Task: 系统性地回顾和分析2017年至2024年间发表的374篇NLP研究论文，重点关注其中221篇直接涉及金融任务的论文。

Motivation: 探索NLP技术在金融问题中的应用趋势，为研究和实践提供结构化概述和实用见解。

Details

Method: 通过11个定性和定量维度评估论文，分析关键趋势如通用语言模型的使用、情感分析和信息提取的进展，以及可解释性和隐私保护方法。 Result: 发现通用语言模型的使用增加，情感分析和信息提取稳步进展，强调领域特定评估指标的重要性，并提出需要更多可访问、适应性强的数据集。 Conclusion: 该综述为金融领域的NLP研究提供了结构化概述，并强调了在现实条件下增强模型鲁棒性的重要性。 Abstract: Recent advances in language modeling have led to growing interest in applying Natural Language Processing (NLP) techniques to financial problems, enabling new approaches to analysis and decision-making. To systematically examine this trend, we review 374 NLP research papers published between 2017 and 2024 across 38 conferences and workshops, with a focused analysis of 221 papers that directly address finance-related tasks. We evaluate these papers across 11 qualitative and quantitative dimensions, identifying key trends such as the increasing use of general-purpose language models, steady progress in sentiment analysis and information extraction, and emerging efforts around explainability and privacy-preserving methods. We also discuss the use of evaluation metrics, highlighting the importance of domain-specific ones to complement standard machine learning metrics. Our findings emphasize the need for more accessible, adaptive datasets and highlight the significance of incorporating financial crisis periods to strengthen model robustness under real-world conditions. This survey provides a structured overview of NLP research applied to finance and offers practical insights for researchers and practitioners working at this intersection.

Zeus: Zero-shot LLM Instruction for Union Segmentation in Multimodal Medical Imaging

Siyuan Dai,Kai Ye,Guodong Liu,Haoteng Tang,Liang Zhan

Task: 提出一种基于视觉-LLM联合框架的多模态医学图像分割方法，无需预先收集配对的视觉-语言数据集。

Motivation: 临床诊断需要结合领域知识（如文本信息），但收集多模态配对数据成本高且耗时。

Details

Method: 利用冻结的大型语言模型（LLMs）生成零样本指令，模仿放射学扫描和报告生成过程，结合多模态医学图像生成精确文本指令。 Result: 实验结果表明，该方法在多模态分割任务中表现优越。 Conclusion: 提出的框架能够有效解决多模态医学图像分割问题，无需依赖预先收集的配对数据集。 Abstract: Medical image segmentation has achieved remarkable success through the continuous advancement of UNet-based and Transformer-based foundation backbones. However, clinical diagnosis in the real world often requires integrating domain knowledge, especially textual information. Conducting multimodal learning involves visual and text modalities shown as a solution, but collecting paired vision-language datasets is expensive and time-consuming, posing significant challenges. Inspired by the superior ability in numerous cross-modal tasks for Large Language Models (LLMs), we proposed a novel Vision-LLM union framework to address the issues. Specifically, we introduce frozen LLMs for zero-shot instruction generation based on corresponding medical images, imitating the radiology scanning and report generation process. {To better approximate real-world diagnostic processes}, we generate more precise text instruction from multimodal radiology images (e.g., T1-w or T2-w MRI and CT). Based on the impressive ability of semantic understanding and rich knowledge of LLMs. This process emphasizes extracting special features from different modalities and reunion the information for the ultimate clinical diagnostic. With generated text instruction, our proposed union segmentation framework can handle multimodal segmentation without prior collected vision-language datasets. To evaluate our proposed method, we conduct comprehensive experiments with influential baselines, the statistical results and the visualized case study demonstrate the superiority of our novel method.}

RAISE: Reinforenced Adaptive Instruction Selection For Large Language Models

Lv Qingsong,Yangning Li,Zihua Lan,Zishan Xu,Jiwei Tang,Yinghui Li,Wenhao Jiang,Hai-Tao Zheng,Philip S. Yu

Task: 设计一个动态、任务目标驱动的指令选择框架RAISE，以优化大型语言模型（LLMs）的指令微调过程。

Motivation: 现有指令选择方法多基于启发式质量指标，且仅在训练前进行数据选择，导致指令微调优化不足且难以针对特定任务优化。

Details

Method: 将动态指令选择建模为序列决策过程，使用强化学习（RL）训练选择策略，并在整个指令微调过程中优化指令选择。 Result: RAISE方法在仅更新1%训练步数的情况下优于全数据训练，表现出高效性和有效性。 Conclusion: RAISE框架具有强任务特定优化能力和良好可解释性，显著提升了指令微调的效果。 Abstract: In the instruction fine-tuning of large language models (LLMs), it has become a consensus that a few high-quality instructions are superior to a large number of low-quality instructions. At present, many instruction selection methods have been proposed, but most of these methods select instruction based on heuristic quality metrics, and only consider data selection before training. These designs lead to insufficient optimization of instruction fine-tuning, and fixed heuristic indicators are often difficult to optimize for specific tasks. So we designed a dynamic, task-objective-driven instruction selection framework RAISE(Reinforenced Adaptive Instruction SElection), which incorporates the entire instruction fine-tuning process into optimization, selecting instruction at each step based on the expected impact of instruction on model performance improvement. Our approach is well interpretable and has strong task-specific optimization capabilities. By modeling dynamic instruction selection as a sequential decision-making process, we use RL to train our selection strategy. Extensive experiments and result analysis prove the superiority of our method compared with other instruction selection methods. Notably, RAISE achieves superior performance by updating only 1\% of the training steps compared to full-data training, demonstrating its efficiency and effectiveness.

View-Dependent Uncertainty Estimation of 3D Gaussian Splatting

Chenyu Han,Corentin Dumery

Task: 提出一种在3D高斯泼溅（3DGS）中建模不确定性的方法。

Motivation: 3DGS在3D场景重建中表现出高视觉精度，但其不确定性估计尚未充分探索，而这对下游任务（如资产提取和场景补全）至关重要。

Details

Method: 将不确定性建模为额外的视角依赖的每高斯特征，并使用球谐函数表示。 Result: 该方法简单高效，易于解释，且比集成方法更快，同时保持高精度。 Conclusion: 提出的不确定性建模方法为3DGS场景提供了有效的视角依赖不确定性估计，适用于实际应用。 Abstract: 3D Gaussian Splatting (3DGS) has become increasingly popular in 3D scene reconstruction for its high visual accuracy. However, uncertainty estimation of 3DGS scenes remains underexplored and is crucial to downstream tasks such as asset extraction and scene completion. Since the appearance of 3D gaussians is view-dependent, the color of a gaussian can thus be certain from an angle and uncertain from another. We thus propose to model uncertainty in 3DGS as an additional view-dependent per-gaussian feature that can be modeled with spherical harmonics. This simple yet effective modeling is easily interpretable and can be integrated into the traditional 3DGS pipeline. It is also significantly faster than ensemble methods while maintaining high accuracy, as demonstrated in our experiments.

MDIT: A Model-free Data Interpolation Method for Diverse Instruction Tuning

Yangning Li,Zihua Lan,Lv Qingsong,Yinghui Li,Hai-Tao Zheng

Task: 提出一种名为MDIT的模型无关数据插值方法，用于多样化的指令调优。

Motivation: 当前数据管理策略在生成多样化和全面数据方面面临挑战，限制了模型性能的进一步提升。

Details

Method: 通过任务插值生成多样化和高质量的指令数据，并结合基于多样性的聚类策略。 Result: 在多个基准任务中表现优异，显著提升了LLMs在通用问答、数学推理和代码生成等任务中的性能。 Conclusion: MDIT提供了一种高效且自动化的数据合成方法，无需依赖外部资源即可生成多样化指令数据，扩展了LLMs在复杂环境中的应用潜力。 Abstract: As Large Language Models (LLMs) are increasingly applied across various tasks, instruction tuning has emerged as a critical method for enhancing model performance. However, current data management strategies face substantial challenges in generating diverse and comprehensive data, restricting further improvements in model performance. To address this gap, we propose MDIT, a novel model-free data interpolation method for diverse instruction tuning, which generates varied and high-quality instruction data by performing task interpolation. Moreover, it contains diversity-based clustering strategies to ensure the diversity of the training data. Extensive experiments show that our method achieves superior performance in multiple benchmark tasks. The LLMs finetuned with MDIT show significant improvements in numerous tasks such as general question answering, math reasoning, and code generation. MDIT offers an efficient and automatic data synthetic method, generating diverse instruction data without depending on external resources while expanding the application potential of LLMs in complex environments.

Novel Diffusion Models for Multimodal 3D Hand Trajectory Prediction

Junyi Ma,Wentao Bao,Jingyi Xu,Guanzhong Sun,Xieyuanli Chen,Hesheng Wang

Task: 预测未来3D手部轨迹，结合多模态环境信息。

Motivation: 现有方法仅支持2D视频输入，缺乏对多模态环境信息的利用，且忽视了手部运动与头戴相机自运动的协同作用。

Details

Method: 提出MMTwin扩散模型，吸收2D RGB图像、3D点云、过去手部轨迹和文本提示作为输入，并集成两个潜在扩散模型预测相机自运动和手部轨迹。 Result: 在三个公开数据集和自录数据上表现优于现有基线，并能泛化到未见环境。 Conclusion: MMTwin能有效预测未来3D手部轨迹，且具有泛化能力。 Abstract: Predicting hand motion is critical for understanding human intentions and bridging the action space between human movements and robot manipulations. Existing hand trajectory prediction (HTP) methods forecast the future hand waypoints in 3D space conditioned on past egocentric observations. However, such models are only designed to accommodate 2D egocentric video inputs. There is a lack of awareness of multimodal environmental information from both 2D and 3D observations, hindering the further improvement of 3D HTP performance. In addition, these models overlook the synergy between hand movements and headset camera egomotion, either predicting hand trajectories in isolation or encoding egomotion only from past frames. To address these limitations, we propose novel diffusion models (MMTwin) for multimodal 3D hand trajectory prediction. MMTwin is designed to absorb multimodal information as input encompassing 2D RGB images, 3D point clouds, past hand waypoints, and text prompt. Besides, two latent diffusion models, the egomotion diffusion and the HTP diffusion as twins, are integrated into MMTwin to predict camera egomotion and future hand trajectories concurrently. We propose a novel hybrid Mamba-Transformer module as the denoising model of the HTP diffusion to better fuse multimodal features. The experimental results on three publicly available datasets and our self-recorded data demonstrate that our proposed MMTwin can predict plausible future 3D hand trajectories compared to the state-of-the-art baselines, and generalizes well to unseen environments. The code and pretrained models will be released at https://github.com/IRMVLab/MMTwin.

PAYADOR: A Minimalist Approach to Grounding Language Models on Structured Data for Interactive Storytelling and Role-playing Games

Santiago Góngora,Luis Chiruzzo,Gonzalo Méndez,Pablo Gervás

Task: 提出PAYADOR方法，通过预测动作结果而非直接表示动作来解决交互式叙事系统中的世界更新问题。

Motivation: 传统方法将玩家输入映射到预编程动作，限制了玩家的自由意志，尤其在角色扮演游戏中影响显著。

Details

Method: 基于大型语言模型，结合虚构世界的最小表示，预测动作结果。 Result: 获得了有希望的结果，并将该方法开源以便进一步研究。 Conclusion: PAYADOR为释放角色扮演游戏的共创潜力提供了新的研究方向。 Abstract: Every time an Interactive Storytelling (IS) system gets a player input, it is facing the world-update problem. Classical approaches to this problem consist in mapping that input to known preprogrammed actions, what can severely constrain the free will of the player. When the expected experience has a strong focus on improvisation, like in Role-playing Games (RPGs), this problem is critical. In this paper we present PAYADOR, a different approach that focuses on predicting the outcomes of the actions instead of representing the actions themselves. To implement this approach, we ground a Large Language Model to a minimal representation of the fictional world, obtaining promising results. We make this contribution open-source, so it can be adapted and used for other related research on unleashing the co-creativity power of RPGs.

BRepFormer: Transformer-Based B-rep Geometric Feature Recognition

Yongkang Dai,Xiaoshui Huang,Yunpeng Bai,Hao Guo,Hongping Gan,Ling Yang,Yilei Shi

Task: 提出一种基于Transformer的模型BRepFormer，用于识别加工特征和复杂CAD模型的特征。

Motivation: 现有研究多集中于加工特征识别（MFR），未能有效捕捉复杂几何特征的拓扑和几何特性，限制了多媒体内容检索和智能制造的应用。

Details

Method: BRepFormer通过编码和融合模型的几何与拓扑特征，利用Transformer架构进行特征传播，并结合边特征和拓扑特征的偏置强化几何约束。 Result: 实验表明，BRepFormer在MFInstSeg、MFTRCAD和自建的CBF数据集上达到了最先进的准确率。 Conclusion: BRepFormer能够有效识别复杂几何特征，并通过新数据集CBF更好地满足工业应用需求。 Abstract: Recognizing geometric features on B-rep models is a cornerstone technique for multimedia content-based retrieval and has been widely applied in intelligent manufacturing. However, previous research often merely focused on Machining Feature Recognition (MFR), falling short in effectively capturing the intricate topological and geometric characteristics of complex geometry features. In this paper, we propose BRepFormer, a novel transformer-based model to recognize both machining feature and complex CAD models' features. BRepFormer encodes and fuses the geometric and topological features of the models. Afterwards, BRepFormer utilizes a transformer architecture for feature propagation and a recognition head to identify geometry features. During each iteration of the transformer, we incorporate a bias that combines edge features and topology features to reinforce geometric constraints on each face. In addition, we also proposed a dataset named Complex B-rep Feature Dataset (CBF), comprising 20,000 B-rep models. By covering more complex B-rep models, it is better aligned with industrial applications. The experimental results demonstrate that BRepFormer achieves state-of-the-art accuracy on the MFInstSeg, MFTRCAD, and our CBF datasets.

Alessio Tosolini,Claire Bowern

Task: 比较多语言和跨语言训练在相关和不相关的澳大利亚语言中的效果。

Motivation: 探讨多语言和跨语言训练对语言模型性能的影响，尤其是在语言相似性不同的情况下。

Details

Method: 使用蒙特利尔强制对齐器从头训练声学模型，并基于大型英语模型进行适应，评估结果包括已见数据、未见数据（已见语言）以及未见数据和语言。 Result: 结果表明，适应英语基线模型对未见语言有显著优势。 Conclusion: 跨语言训练中适应基线模型对未见语言具有实际应用价值。 Abstract: We compare the outcomes of multilingual and crosslingual training for related and unrelated Australian languages with similar phonological inventories. We use the Montreal Forced Aligner to train acoustic models from scratch and adapt a large English model, evaluating results against seen data, unseen data (seen language), and unseen data and language. Results indicate benefits of adapting the English baseline model for previously unseen languages.

Model Discrepancy Learning: Synthetic Faces Detection Based on Multi-Reconstruction

Qingchao Jiang,Zhishuo Xu,Zhiying Zhu,Ning Chen,Haoyue Wang,Zhongjie Ba

Task: 探索合成图像与其生成技术之间的内在关系，并提出一种基于多重建的检测器。

Motivation: 现有的合成人脸检测研究忽视了不同生成技术之间的差异，导致检测效果受限。

Details

Method: 通过使用多种生成模型对图像进行反向重建，分析重建差异以区分真实图像、GAN生成图像和DM生成图像。 Result: 提出的检测器在实验中表现出卓越的性能，具有强泛化能力和鲁棒性。 Conclusion: 研究揭示了生成技术与合成图像之间的关系，提出的方法在合成人脸检测中具有显著优势。 Abstract: Advances in image generation enable hyper-realistic synthetic faces but also pose risks, thus making synthetic face detection crucial. Previous research focuses on the general differences between generated images and real images, often overlooking the discrepancies among various generative techniques. In this paper, we explore the intrinsic relationship between synthetic images and their corresponding generation technologies. We find that specific images exhibit significant reconstruction discrepancies across different generative methods and that matching generation techniques provide more accurate reconstructions. Based on this insight, we propose a Multi-Reconstruction-based detector. By reversing and reconstructing images using multiple generative models, we analyze the reconstruction differences among real, GAN-generated, and DM-generated images to facilitate effective differentiation. Additionally, we introduce the Asian Synthetic Face Dataset (ASFD), containing synthetic Asian faces generated with various GANs and DMs. This dataset complements existing synthetic face datasets. Experimental results demonstrate that our detector achieves exceptional performance, with strong generalization and robustness.

Alice: Proactive Learning with Teacher's Demonstrations for Weak-to-Strong Generalization

Shujin Wu,Cheng Qian,Yi R.,Fung,Paul Pu Liang,Heng Ji

Task: 提出一种名为Alice的主动学习框架，通过利用教师和学生模型的互补知识，提升弱到强泛化（W2SG）的性能。

Motivation: 传统W2SG方法依赖被动学习，限制了学生模型发挥其潜力，因此需要一种更有效的方法来增强知识传递和监督效果。

Details

Method: 通过探测教师模型的不确定性，结合教师的响应作为演示，指导学生模型自我生成改进的响应；针对能力差距大的情况，提出级联Alice，采用分层训练方法。 Result: 实验结果显示，在知识推理（+4.0%）、数学推理（+22.62%）和逻辑推理（+12.11%）任务中，性能显著提升。 Conclusion: Alice框架有效提升了W2SG的性能，实现了更稳健的知识传递和监督效果。 Abstract: The growing capabilities of large language models (LLMs) present a key challenge of maintaining effective human oversight. Weak-to-strong generalization (W2SG) offers a promising framework for supervising increasingly capable LLMs using weaker ones. Traditional W2SG methods rely on passive learning, where a weak teacher provides noisy demonstrations to train a strong student. This hinders students from employing their knowledge during training and reaching their full potential. In this work, we introduce Alice (pro{A}ctive {l}earning w{i}th tea{c}her's D{e}monstrations), a framework that leverages complementary knowledge between teacher and student to enhance the learning process.We probe the knowledge base of the teacher model by eliciting their uncertainty, and then use these insights together with teachers' responses as demonstrations to guide student models in self-generating improved responses for supervision. In addition, for situations with significant capability gaps between teacher and student models, we introduce cascade Alice, which employs a hierarchical training approach where weak teachers initially supervise intermediate models, who then guide stronger models in sequence. Experimental results demonstrate that our method significantly enhances the W2SG performance, yielding substantial improvements in three key tasks compared to the original W2SG: knowledge-based reasoning (+4.0%), mathematical reasoning (+22.62%), and logical reasoning (+12.11%). This highlights the effectiveness of our new W2SG paradigm that enables more robust knowledge transfer and supervision outcome.

ID-Booth: Identity-consistent Face Generation with Diffusion Models

Darian Tomašević,Fadi Boutros,Chenhao Lin,Naser Damer,Vitomir Štruc,Peter Peer

Task: 提出一种名为ID-Booth的新型生成扩散框架，用于在生成图像时保持身份一致性并提高多样性。

Motivation: 现有生成模型在训练时未充分考虑身份一致性，导致生成图像与目标身份不一致；而基于身份的训练方法又容易过拟合，降低生成多样性。

Details

Method: ID-Booth结合了去噪网络、变分自编码器和文本编码器，采用新颖的三元组身份训练目标，实现身份一致的图像生成。 Result: 实验表明，ID-Booth在身份一致性和多样性上优于现有方法，并能有效增强小规模数据集和训练高性能识别模型。 Conclusion: ID-Booth在隐私保护的前提下，实现了高质量的身份一致图像生成，具有实际应用价值。 Abstract: Recent advances in generative modeling have enabled the generation of high-quality synthetic data that is applicable in a variety of domains, including face recognition. Here, state-of-the-art generative models typically rely on conditioning and fine-tuning of powerful pretrained diffusion models to facilitate the synthesis of realistic images of a desired identity. Yet, these models often do not consider the identity of subjects during training, leading to poor consistency between generated and intended identities. In contrast, methods that employ identity-based training objectives tend to overfit on various aspects of the identity, and in turn, lower the diversity of images that can be generated. To address these issues, we present in this paper a novel generative diffusion-based framework, called ID-Booth. ID-Booth consists of a denoising network responsible for data generation, a variational auto-encoder for mapping images to and from a lower-dimensional latent space and a text encoder that allows for prompt-based control over the generation procedure. The framework utilizes a novel triplet identity training objective and enables identity-consistent image generation while retaining the synthesis capabilities of pretrained diffusion models. Experiments with a state-of-the-art latent diffusion model and diverse prompts reveal that our method facilitates better intra-identity consistency and inter-identity separability than competing methods, while achieving higher image diversity. In turn, the produced data allows for effective augmentation of small-scale datasets and training of better-performing recognition models in a privacy-preserving manner. The source code for the ID-Booth framework is publicly available at https://github.com/dariant/ID-Booth.

Revisiting Prompt Optimization with Large Reasoning Models-A Case Study on Event Extraction

Saurabh Srivastava,Ziyu Yao

Task: 系统研究大型推理模型（LRMs）是否需要提示工程优化，以事件提取任务为例。

Motivation: 探讨LRMs是否因其强大的中间思维生成和推理能力而无需提示工程优化。

Details

Method: 实验比较两种LRMs（DeepSeek-R1和o1）和两种通用LLMs（GPT-4o和GPT-4.5）作为任务模型或提示优化器的表现。 Result: 在复杂任务（如事件提取）中，LRMs作为任务模型仍需提示优化，且作为提示优化器时效果更佳。 Conclusion: LRMs在优化任务指令和事件指南时表现出稳定性和一致性，但仍需提示工程支持。 Abstract: Large Reasoning Models (LRMs) such as DeepSeek-R1 and OpenAI o1 have demonstrated remarkable capabilities in various reasoning tasks. Their strong capability to generate and reason over intermediate thoughts has also led to arguments that they may no longer require extensive prompt engineering or optimization to interpret human instructions and produce accurate outputs. In this work, we aim to systematically study this open question, using the structured task of event extraction for a case study. We experimented with two LRMs (DeepSeek-R1 and o1) and two general-purpose Large Language Models (LLMs) (GPT-4o and GPT-4.5), when they were used as task models or prompt optimizers. Our results show that on tasks as complicated as event extraction, LRMs as task models still benefit from prompt optimization, and that using LRMs as prompt optimizers yields more effective prompts. Finally, we provide an error analysis of common errors made by LRMs and highlight the stability and consistency of LRMs in refining task instructions and event guidelines.

FAIR-SIGHT: Fairness Assurance in Image Recognition via Simultaneous Conformal Thresholding and Dynamic Output Repair

Arya Fayyazi,Mehdi Kamal,Massoud Pedram

Task: 提出FAIR-SIGHT框架，通过结合共形预测和动态输出修复机制，确保计算机视觉系统的公平性。

Motivation: 解决计算机视觉系统中存在的公平性问题，无需重新训练或访问内部模型参数。

Details

Method: 使用共形预测计算公平感知的非一致性分数，并通过动态调整输出（如分类的logit偏移和检测的置信度重新校准）来减少公平性差异。 Result: 理论分析验证了方法的误差控制和收敛性，实验表明FAIR-SIGHT显著减少了公平性差异并保持了高预测性能。 Conclusion: FAIR-SIGHT是一种有效的后处理框架，能够在保证预测性能的同时提升计算机视觉系统的公平性。 Abstract: We introduce FAIR-SIGHT, an innovative post-hoc framework designed to ensure fairness in computer vision systems by combining conformal prediction with a dynamic output repair mechanism. Our approach calculates a fairness-aware non-conformity score that simultaneously assesses prediction errors and fairness violations. Using conformal prediction, we establish an adaptive threshold that provides rigorous finite-sample, distribution-free guarantees. When the non-conformity score for a new image exceeds the calibrated threshold, FAIR-SIGHT implements targeted corrective adjustments, such as logit shifts for classification and confidence recalibration for detection, to reduce both group and individual fairness disparities, all without the need for retraining or having access to internal model parameters. Comprehensive theoretical analysis validates our method's error control and convergence properties. At the same time, extensive empirical evaluations on benchmark datasets show that FAIR-SIGHT significantly reduces fairness disparities while preserving high predictive performance.

Enhancing Time Series Forecasting via Multi-Level Text Alignment with LLMs

Taibiao Zhao,Xiaobing Chen,Mingxuan Sun

Task: 提出一种多级文本对齐框架，将时间序列数据适配到大型语言模型（LLMs）中，以提高预测准确性和可解释性。

Motivation: 时间序列数据是连续的，而LLMs基于离散标记操作，现有方法在将时间序列数据转换为文本形式时，难以同时保持预测准确性和可解释性。

Details

Method: 将时间序列分解为趋势、季节性和残差分量，并将其转换为特定分量的文本表示，通过多级对齐机制将这些表示与预训练的词标记对齐。 Result: 在多个数据集上的实验表明，该方法在准确性和可解释性上优于现有最先进模型。 Conclusion: 提出的多级文本对齐框架成功解决了时间序列数据与LLMs适配的挑战，同时提升了预测性能和可解释性。 Abstract: The adaptation of large language models (LLMs) to time series forecasting poses unique challenges, as time series data is continuous in nature, while LLMs operate on discrete tokens. Despite the success of LLMs in natural language processing (NLP) and other structured domains, aligning time series data with language-based representations while maintaining both predictive accuracy and interpretability remains a significant hurdle. Existing methods have attempted to reprogram time series data into text-based forms, but these often fall short in delivering meaningful, interpretable results. In this paper, we propose a multi-level text alignment framework for time series forecasting using LLMs that not only improves prediction accuracy but also enhances the interpretability of time series representations. Our method decomposes time series into trend, seasonal, and residual components, which are then reprogrammed into component-specific text representations. We introduce a multi-level alignment mechanism, where component-specific embeddings are aligned with pre-trained word tokens, enabling more interpretable forecasts. Experiments on multiple datasets demonstrate that our method outperforms state-of-the-art models in accuracy while providing good interpretability.

FlexIP: Dynamic Control of Preservation and Personality for Customized Image Generation

Linyan Huang,Haonan Lin,Yanning Zhou,Kaiwen Xiao

Task: 提出FlexIP框架，解决2D生成模型中身份保留与个性化编辑之间的权衡问题。

Motivation: 现有方法在身份保留和个性化编辑之间存在固有权衡，需要一种更灵活的解决方案。

Details

Method: 通过两个专用组件（个性化适配器和保留适配器）将目标解耦，并动态调整权重适配器实现参数化控制。 Result: 实验结果表明，FlexIP突破了传统方法的性能限制，实现了更好的身份保留和更丰富的个性化生成能力。 Conclusion: FlexIP框架为2D生成模型中的身份保留和个性化编辑提供了一种有效的解决方案。 Abstract: With the rapid advancement of 2D generative models, preserving subject identity while enabling diverse editing has emerged as a critical research focus. Existing methods typically face inherent trade-offs between identity preservation and personalized manipulation. We introduce FlexIP, a novel framework that decouples these objectives through two dedicated components: a Personalization Adapter for stylistic manipulation and a Preservation Adapter for identity maintenance. By explicitly injecting both control mechanisms into the generative model, our framework enables flexible parameterized control during inference through dynamic tuning of the weight adapter. Experimental results demonstrate that our approach breaks through the performance limitations of conventional methods, achieving superior identity preservation while supporting more diverse personalized generation capabilities (Project Page: https://flexip-tech.github.io/flexip/).

TALE: A Tool-Augmented Framework for Reference-Free Evaluation of Large Language Models

Sher Badshah,Ali Emami,Hassan Sajjad

Task: 提出一种无需预定义标准答案的LLM输出评估框架TALE。

Motivation: 解决传统评估方法依赖静态标注参考的高成本、低扩展性和不完整性问题。

Details

Method: 通过工具增强的代理主动检索和综合外部证据，迭代生成查询、收集信息并优化搜索。 Result: 在自由形式QA任务中，TALE优于传统参考指标，并与人类评估高度一致。 Conclusion: TALE提升了动态场景下LLM评估的可靠性，摆脱了对静态参考的依赖。 Abstract: As Large Language Models (LLMs) become increasingly integrated into real-world, autonomous applications, relying on static, pre-annotated references for evaluation poses significant challenges in cost, scalability, and completeness. We propose Tool-Augmented LLM Evaluation (TALE), a framework to assess LLM outputs without predetermined ground-truth answers. Unlike conventional metrics that compare to fixed references or depend solely on LLM-as-a-judge knowledge, TALE employs an agent with tool-access capabilities that actively retrieves and synthesizes external evidence. It iteratively generates web queries, collects information, summarizes findings, and refines subsequent searches through reflection. By shifting away from static references, TALE aligns with free-form question-answering tasks common in real-world scenarios. Experimental results on multiple free-form QA benchmarks show that TALE not only outperforms standard reference-based metrics for measuring response accuracy but also achieves substantial to near-perfect agreement with human evaluations. TALE enhances the reliability of LLM evaluations in real-world, dynamic scenarios without relying on static references.

Leveraging LLMs for Multimodal Retrieval-Augmented Radiology Report Generation via Key Phrase Extraction

Kyoyun Choi,Byungmu Yoon,Soobum Kim,Jonggwon Park

Task: 提出一种基于检索增强生成的方法，用于自动生成放射学报告，以减少幻觉并降低计算需求。

Motivation: 多模态大语言模型（MLLMs）资源密集，需要大量数据和计算成本，因此需要一种更高效的方法。

Details

Method: 结合多模态检索和大语言模型，提取关键短语，优化图像编码器结构，并采用对比学习。 Result: 在MIMIC-CXR数据集上达到CheXbert指标的最先进水平，并在RadGraph F1指标上表现优异。 Conclusion: 该方法无需微调大语言模型，适用于多视图放射学报告生成，具有临床应用的潜力。 Abstract: Automated radiology report generation (RRG) holds potential to reduce radiologists' workload, especially as recent advancements in large language models (LLMs) enable the development of multimodal models for chest X-ray (CXR) report generation. However, multimodal LLMs (MLLMs) are resource-intensive, requiring vast datasets and substantial computational cost for training. To address these challenges, we propose a retrieval-augmented generation approach that leverages multimodal retrieval and LLMs to generate radiology reports while mitigating hallucinations and reducing computational demands. Our method uses LLMs to extract key phrases from radiology reports, effectively focusing on essential diagnostic information. Through exploring effective training strategies, including image encoder structure search, adding noise to text embeddings, and additional training objectives, we combine complementary pre-trained image encoders and adopt contrastive learning between text and semantic image embeddings. We evaluate our approach on MIMIC-CXR dataset, achieving state-of-the-art results on CheXbert metrics and competitive RadGraph F1 metric alongside MLLMs, without requiring LLM fine-tuning. Our method demonstrates robust generalization for multi-view RRG, making it suitable for comprehensive clinical applications.

Talking Point based Ideological Discourse Analysis in News Events

Nishanth Nakshatri,Nikhil Mehta,Siyi Liu,Sihao Chen,Daniel J. Hopkins,Dan Roth,Dan Goldwasser

Task: 提出一种基于意识形态话语分析理论的框架，用于分析与现实世界事件相关的新闻文章。

Motivation: 大型语言模型（LLMs）在分析意识形态话语时难以捕捉关键元素和整合上下文信息，导致无法理解抽象的意识形态观点。

Details

Method: 通过关系结构（谈论点）表示新闻文章，构建重复主题的词汇表，生成意识形态特定的观点，并通过自动任务和人工验证评估框架。 Result: 框架能够生成意识形态特定的观点，并在创建事件快照中展示直观的适用性。 Conclusion: 提出的框架有效解决了LLMs在意识形态话语分析中的局限性，并发布了数据集和模型以支持进一步研究。 Abstract: Analyzing ideological discourse even in the age of LLMs remains a challenge, as these models often struggle to capture the key elements that shape real-world narratives. Specifically, LLMs fail to focus on characteristic elements driving dominant discourses and lack the ability to integrate contextual information required for understanding abstract ideological views. To address these limitations, we propose a framework motivated by the theory of ideological discourse analysis to analyze news articles related to real-world events. Our framework represents the news articles using a relational structure - talking points, which captures the interaction between entities, their roles, and media frames along with a topic of discussion. It then constructs a vocabulary of repeating themes - prominent talking points, that are used to generate ideology-specific viewpoints (or partisan perspectives). We evaluate our framework's ability to generate these perspectives through automated tasks - ideology and partisan classification tasks, supplemented by human validation. Additionally, we demonstrate straightforward applicability of our framework in creating event snapshots, a visual way of interpreting event discourse. We release resulting dataset and model to the community to support further research.

RadZero: Similarity-Based Cross-Attention for Explainable Vision-Language Alignment in Radiology with Zero-Shot Multi-Task Capability

Jonggwon Park,Soobum Kim,Byungmu Yoon,Kyoyun Choi

Task: 提出RadZero，一种基于相似性的跨注意力框架，用于放射学中的视觉-语言对齐，具备零样本多任务能力。

Motivation: 解决现有方法在利用复杂放射学报告、处理低分辨率图像以及注意力机制可解释性方面的不足。

Details

Method: 利用大语言模型提取放射学报告中的最小语义句子，采用多正对比学习策略，结合预训练视觉编码器和可训练Transformer层。 Result: 在公开的胸部X光基准测试中，RadZero在零样本分类、定位和分割任务上优于现有方法。 Conclusion: RadZero在视觉-语言对齐中表现出色，具有零样本能力和解释性潜力，适用于医学影像分析。 Abstract: Recent advancements in multi-modal models have significantly improved vision-language alignment in radiology. However, existing approaches struggle to effectively utilize complex radiology reports for learning, rely on low-resolution images, and offer limited interpretability in attention mechanisms. To address these challenges, we introduce RadZero, a novel similarity-based cross-attention framework for vision-language alignment in radiology with zero-shot multi-task capability. RadZero leverages large language models to extract minimal semantic sentences from radiology reports and employs a multi-positive contrastive learning strategy to effectively capture relationships between images and multiple relevant textual descriptions. It also utilizes a pre-trained vision encoder with additional trainable Transformer layers, allowing efficient high-resolution image processing. By computing similarity between text embeddings and local image patch features, RadZero enables zero-shot inference with similarity probability for classification and pixel-level cross-modal similarity maps for grounding and segmentation. Experimental results on public chest radiograph benchmarks show that RadZero outperforms state-of-the-art methods in zero-shot classification, grounding, and segmentation. Furthermore, cross-modal similarity map analysis highlights its potential for improving explainability in vision-language alignment. Additionally, qualitative evaluation demonstrates RadZero's capability for open-vocabulary semantic segmentation, further validating its effectiveness in medical imaging.

AI Coding with Few-Shot Prompting for Thematic Analysis

Samuel Flanders,Melati Nungsari,Mark Cheong Wing Loong

Task: 探索使用大型语言模型（如GPT 3.5-Turbo）为主题分析进行编码。

Motivation: 主题分析的编码工作高度耗时，使得大多数研究者难以对大型语料库进行详尽分析。

Details

Method: 采用少量示例提示（few-shot prompting）并结合语义相似段落生成更高质量的编码，同时使用成本更低、更易扩展的模型。 Result: 通过该方法提升了编码质量，同时实现了低成本和大规模扩展。 Conclusion: 利用大型语言模型可以有效解决主题分析编码的高成本问题，并提升分析效率。 Abstract: This paper explores the use of large language models (LLMs), here represented by GPT 3.5-Turbo to perform coding for a thematic analysis. Coding is highly labor intensive, making it infeasible for most researchers to conduct exhaustive thematic analyses of large corpora. We utilize few-shot prompting with higher quality codes generated on semantically similar passages to enhance the quality of the codes while utilizing a cheap, more easily scalable model.

Anning Hu,Ang Li,Xirui Jin,Danping Zou

Task: 提出一种实时热成像立体匹配方法ThermoStereoRT，用于全天候条件下的视差恢复。

Motivation: 针对夜间无人机监控或床底清洁机器人等应用场景，解决热成像图像中稀疏地面真实数据的挑战。

Details

Method: 采用轻量级但强大的主干网络构建3D成本体积，结合多尺度注意力机制生成初始视差图，并设计新颖的通道和空间注意力模块进行优化。 Result: 在多个数据集上的综合评估表明，ThermoStereoRT具备实时能力和鲁棒精度。 Conclusion: ThermoStereoRT是一种适用于各种挑战性环境的实用解决方案，代码将开源。 Abstract: We introduce ThermoStereoRT, a real-time thermal stereo matching method designed for all-weather conditions that recovers disparity from two rectified thermal stereo images, envisioning applications such as night-time drone surveillance or under-bed cleaning robots. Leveraging a lightweight yet powerful backbone, ThermoStereoRT constructs a 3D cost volume from thermal images and employs multi-scale attention mechanisms to produce an initial disparity map. To refine this map, we design a novel channel and spatial attention module. Addressing the challenge of sparse ground truth data in thermal imagery, we utilize knowledge distillation to boost performance without increasing computational demands. Comprehensive evaluations on multiple datasets demonstrate that ThermoStereoRT delivers both real-time capacity and robust accuracy, making it a promising solution for real-world deployment in various challenging environments. Our code will be released on https://github.com/SJTU-ViSYS-team/ThermoStereoRT

AgentAda: Skill-Adaptive Data Analytics for Tailored Insight Discovery

Amirhossein Abaskohi,Amrutha Varshini Ramesh,Shailesh Nanisetty,Chirag Goel,David Vazquez,Christopher Pal,Spandana Gella,Giuseppe Carenini,Issam H. Laradji

Task: 介绍AgentAda，一种能够学习和使用新分析技能的LLM驱动分析代理。

Motivation: 现有方法需要用户手动选择数据分析方法，而AgentAda能自动从技能库中选择所需技能，处理现有LLM无法完成的任务。

Details

Method: 采用三步策略：问题生成器、混合RAG技能匹配器和代码生成器，结合技能库中的方法（如聚类、预测建模和NLP技术）。 Result: 在人类评估中，48.78%的评估者更偏好AgentAda的分析结果，优于未熟练代理的27.67%。 Conclusion: AgentAda能提供更深入的分析，并通过LLM-as-a-judge方法实现大规模自动化评估。 Abstract: We introduce AgentAda, the first LLM-powered analytics agent that can learn and use new analytics skills to extract more specialized insights. Unlike existing methods that require users to manually decide which data analytics method to apply, AgentAda automatically identifies the skill needed from a library of analytical skills to perform the analysis. This also allows AgentAda to use skills that existing LLMs cannot perform out of the box. The library covers a range of methods, including clustering, predictive modeling, and NLP techniques like BERT, which allow AgentAda to handle complex analytics tasks based on what the user needs. AgentAda's dataset-to-insight extraction strategy consists of three key steps: (I) a question generator to generate queries relevant to the user's goal and persona, (II) a hybrid Retrieval-Augmented Generation (RAG)-based skill matcher to choose the best data analytics skill from the skill library, and (III) a code generator that produces executable code based on the retrieved skill's documentation to extract key patterns. We also introduce KaggleBench, a benchmark of curated notebooks across diverse domains, to evaluate AgentAda's performance. We conducted a human evaluation demonstrating that AgentAda provides more insightful analytics than existing tools, with 48.78% of evaluators preferring its analyses, compared to 27.67% for the unskilled agent. We also propose a novel LLM-as-a-judge approach that we show is aligned with human evaluation as a way to automate insight quality evaluation at larger scale.

WS-DETR: Robust Water Surface Object Detection through Vision-Radar Fusion with Detection Transformer

Huilin Yin,Pengyu Wang,Senmao Li,Jun Yan,Daniel Watzenig

Task: 提出一种鲁棒的视觉-雷达融合模型WS-DETR，用于无人水面艇（USV）在复杂水域环境中的目标检测。

Motivation: 解决水域环境中目标检测因边缘模糊和物体尺度多样带来的挑战，以及现有视觉-雷达融合方法中跨模态特征冲突的问题。

Details

Method: 引入多尺度边缘信息集成模块（MSEII）增强边缘感知，采用分层特征聚合器（HiFA）提升多尺度目标检测，利用自移动点表示进行连续卷积和残差连接提取不规则特征，并通过自适应特征交互融合模块（AFIF）实现视觉与雷达特征的几何对齐和语义融合。 Result: 在WaterScenes数据集上的实验表明，WS-DETR实现了最先进的性能，并在恶劣天气和光照条件下保持优势。 Conclusion: WS-DETR通过创新的模块设计有效解决了跨模态特征冲突问题，提升了水域目标检测的鲁棒性。 Abstract: Robust object detection for Unmanned Surface Vehicles (USVs) in complex water environments is essential for reliable navigation and operation. Specifically, water surface object detection faces challenges from blurred edges and diverse object scales. Although vision-radar fusion offers a feasible solution, existing approaches suffer from cross-modal feature conflicts, which negatively affect model robustness. To address this problem, we propose a robust vision-radar fusion model WS-DETR. In particular, we first introduce a Multi-Scale Edge Information Integration (MSEII) module to enhance edge perception and a Hierarchical Feature Aggregator (HiFA) to boost multi-scale object detection in the encoder. Then, we adopt self-moving point representations for continuous convolution and residual connection to efficiently extract irregular features under the scenarios of irregular point cloud data. To further mitigate cross-modal conflicts, an Adaptive Feature Interactive Fusion (AFIF) module is introduced to integrate visual and radar features through geometric alignment and semantic fusion. Extensive experiments on the WaterScenes dataset demonstrate that WS-DETR achieves state-of-the-art (SOTA) performance, maintaining its superiority even under adverse weather and lighting conditions.

From Token to Line: Enhancing Code Generation with a Long-Term Perspective

Tingwei Lu,Yangning Li,Liyuan Wang,Binghuai Lin,Jiwei Tang,Wanshi Xu,Hai-Tao Zheng,Yinghui Li,Bingxu An,Zhao Wei,Yong Xu

Task: 提出一种基于MCTS的算法（LSR-MCTS），用于逐行生成代码并优化生成路径。

Motivation: 现有代码生成研究存在冗余结果和局部过拟合问题，且缺乏对生成处理长度的合理选择。

Details

Method: 通过分析LLM生成过程中的注意力机制，提出以代码行为基本处理单元，结合MCTS和自优化机制逐行生成代码。 Result: 在三个公开编码基准测试中，该方法优于现有最优性能方法。 Conclusion: LSR-MCTS算法通过逐行生成和自优化机制，显著提升了代码生成的质量和多样性。 Abstract: The emergence of large language models (LLMs) has significantly promoted the development of code generation task, sparking a surge in pertinent literature. Current research is hindered by redundant generation results and a tendency to overfit local patterns in the short term. Although existing studies attempt to alleviate the issue by adopting a multi-token prediction strategy, there remains limited focus on choosing the appropriate processing length for generations. By analyzing the attention between tokens during the generation process of LLMs, it can be observed that the high spikes of the attention scores typically appear at the end of lines. This insight suggests that it is reasonable to treat each line of code as a fundamental processing unit and generate them sequentially. Inspired by this, we propose the \textbf{LSR-MCTS} algorithm, which leverages MCTS to determine the code line-by-line and select the optimal path. Further, we integrate a self-refine mechanism at each node to enhance diversity and generate higher-quality programs through error correction. Extensive experiments and comprehensive analyses on three public coding benchmarks demonstrate that our method outperforms the state-of-the-art performance approaches.

How Can Objects Help Video-Language Understanding?

Zitian Tang,Shijie Wang,Junho Cho,Jaewook Yoo,Chen Sun

Task: 探索多模态大语言模型（MLLMs）中对象表示和适应对视频-语言理解的影响。

Motivation: 理解MLLMs如何感知视觉世界，尤其是对象和关系的建模方式，以及如何通过对象表示提升视频理解能力。

Details

Method: 从对象表示和适应角度研究表达力（如分布式与符号化）与集成难度（如数据效率）之间的权衡，并在五个视频问答数据集上进行评估。 Result: 显式集成以对象为中心的表示是必要的，符号化对象最易集成且在问答任务中表现优异。 Conclusion: 研究结果鼓励社区在MLLM设计中探索显式集成感知模块的可能性。 Abstract: How multimodal large language models (MLLMs) perceive the visual world remains a mystery. To one extreme, object and relation modeling may be implicitly implemented with inductive biases, for example by treating objects as tokens. To the other extreme, empirical results reveal the surprising finding that simply performing visual captioning, which tends to ignore spatial configuration of the objects, serves as a strong baseline for video understanding. We aim to answer the question: how can objects help video-language understanding in MLLMs? We tackle the question from the object representation and adaptation perspectives. Specifically, we investigate the trade-off between representation expressiveness (e.g., distributed versus symbolic) and integration difficulty (e.g., data-efficiency when learning the adapters). Through extensive evaluations on five video question answering datasets, we confirm that explicit integration of object-centric representation remains necessary, and the symbolic objects can be most easily integrated while being performant for question answering. We hope our findings can encourage the community to explore the explicit integration of perception modules into MLLM design. Our code and models will be publicly released.

Revisiting LLM Evaluation through Mechanism Interpretability: a New Metric and Model Utility Law

Yixin Cao,Jiahao Ying,Yaoning Wang,Xipeng Qiu,Xuanjing Huang,Yugang Jiang

Task: 分析传统评估方法的局限性并提出一种新指标（MUI），以补充传统性能指标。

Motivation: 当前评估方法难以跟上大语言模型的快速发展，需要更全面的评估方式。

Details

Method: 提出模型利用指数（MUI），结合机制可解释性技术，量化模型完成任务时对其能力的利用程度。 Result: 实验发现MUI与性能呈反比关系，并总结出“效用定律”及四个推论。 Conclusion: MUI和效用定律有望推动评估与机制可解释性的共同进步。 Abstract: Large Language Models (LLMs) have become indispensable across academia, industry, and daily applications, yet current evaluation methods struggle to keep pace with their rapid development. In this paper, we analyze the core limitations of traditional evaluation pipelines and propose a novel metric, the Model Utilization Index (MUI), which introduces mechanism interpretability techniques to complement traditional performance metrics. MUI quantifies the extent to which a model leverages its capabilities to complete tasks. The core idea is that to assess an LLM's overall ability, we must evaluate not only its task performance but also the effort expended to achieve the outcome. Our extensive experiments reveal an inverse relationship between MUI and performance, from which we deduce a common trend observed in popular LLMs, which we term the Utility Law. Based on this, we derive four corollaries that address key challenges, including training judgement, the issue of data contamination, fairness in model comparison, and data diversity. We hope that our survey, novel metric, and utility law will foster mutual advancement in both evaluation and mechanism interpretability. Our code can be found at https://github.com/ALEX-nlp/MUI-Eva.

Learning Universal Features for Generalizable Image Forgery Localization

Hengrun Zhao,Yunzhi Zhuge,Yifan Wang,Lijun Wang,Huchuan Lu,Yu Zeng

Task: 提出一种可泛化的图像伪造定位方法（GIFL），用于检测和定位已知和未知的图像伪造内容。

Motivation: 现有方法依赖特定伪造痕迹，难以处理未知伪造类型，亟需一种更通用的解决方案。

Details

Method: 通过学习原始内容的通用特征而非特定伪造痕迹，实现对未知伪造的定位。 Result: 实验表明，该方法在未知伪造检测上优于现有方法，并在已知伪造检测上表现竞争性。 Conclusion: GIFL提供了一种更实用和高效的图像伪造检测方案，适用于生成式AI时代的信息防伪。 Abstract: In recent years, advanced image editing and generation methods have rapidly evolved, making detecting and locating forged image content increasingly challenging. Most existing image forgery detection methods rely on identifying the edited traces left in the image. However, because the traces of different forgeries are distinct, these methods can identify familiar forgeries included in the training data but struggle to handle unseen ones. In response, we present an approach for Generalizable Image Forgery Localization (GIFL). Once trained, our model can detect both seen and unseen forgeries, providing a more practical and efficient solution to counter false information in the era of generative AI. Our method focuses on learning general features from the pristine content rather than traces of specific forgeries, which are relatively consistent across different types of forgeries and therefore can be used as universal features to locate unseen forgeries. Additionally, as existing image forgery datasets are still dominated by traditional hand-crafted forgeries, we construct a new dataset consisting of images edited by various popular deep generative image editing methods to further encourage research in detecting images manipulated by deep generative models. Extensive experimental results show that the proposed approach outperforms state-of-the-art methods in the detection of unseen forgeries and also demonstrates competitive results for seen forgeries. The code and dataset are available at https://github.com/ZhaoHengrun/GIFL.

Beyond LLMs: A Linguistic Approach to Causal Graph Generation from Narrative Texts

Zehan Li,Ruhua Pan,Xinyu Pi

Task: 提出一种从叙事文本生成因果图的新框架，连接高层因果关系和事件特定关系。

Motivation: 解决现有方法在因果链接识别精度上的不足，同时保持文本可读性。

Details

Method: 结合LLM摘要、专家索引（七个语言学特征）和STAC分类模型，以及五轮提示过程构建因果图。 Result: 在100个叙事章节和短故事上的实验表明，该方法在因果图质量上优于GPT-4o和Claude 3.5。 Conclusion: 开源工具提供了一种高效且可解释的解决方案，用于捕捉叙事中的细致因果链。 Abstract: We propose a novel framework for generating causal graphs from narrative texts, bridging high-level causality and detailed event-specific relationships. Our method first extracts concise, agent-centered vertices using large language model (LLM)-based summarization. We introduce an "Expert Index," comprising seven linguistically informed features, integrated into a Situation-Task-Action-Consequence (STAC) classification model. This hybrid system, combining RoBERTa embeddings with the Expert Index, achieves superior precision in causal link identification compared to pure LLM-based approaches. Finally, a structured five-iteration prompting process refines and constructs connected causal graphs. Experiments on 100 narrative chapters and short stories demonstrate that our approach consistently outperforms GPT-4o and Claude 3.5 in causal graph quality, while maintaining readability. The open-source tool provides an interpretable, efficient solution for capturing nuanced causal chains in narratives.

CMEdataset Advancing China Map Detection and Standardization with Digital Image Resources

Yan Xu,Zhenqiang Zhang,Zhiwei Zhou,Liting Geng,Yue Li,Jintao Li

Task: 创建一个专门用于问题地图检测的CME数据集，覆盖五个关键问题领域。

Motivation: 现有数据集主要关注一般地图数据，无法有效识别国家边界错误表示、缺失元素和模糊边界等复杂问题。

Details

Method: 本研究创建了一个问题地图数据集，涵盖五个关键问题领域。 Result: 该数据集为问题地图检测技术提供了多样化的样本，支持高精度地图合规性检测，并提升了地图数据质量和时效性。 Conclusion: 该数据集不仅为地图合规性、国家安全监测和地图更新提供了重要资源，还促进了相关技术的创新和应用。 Abstract: Digital images of Chinas maps play a crucial role in map detection, particularly in ensuring national sovereignty, territorial integrity, and map compliance. However, there is currently no publicly available dataset specifically dedicated to problematic maps the CME dataset. Existing datasets primarily focus on general map data and are insufficient for effectively identifying complex issues such as national boundary misrepresentations, missing elements, and blurred boundaries. Therefore, this study creates a Problematic Map dataset that covers five key problem areas, aiming to provide diverse samples for problematic map detection technologies, support high-precision map compliance detection, and enhance map data quality and timeliness. This dataset not only provides essential resources for map compliance, national security monitoring, and map updates, but also fosters innovation and application of related technologies.

Defense against Prompt Injection Attacks via Mixture of Encodings

Ruiyi Zhang,David Sullivan,Kyle Jackson,Pengtao Xie,Mei Chen

Task: 提出一种新的防御机制（混合编码）以应对大型语言模型（LLMs）中的提示注入攻击。

Motivation: 现有的Base64防御方法虽然有效，但会降低LLM在某些NLP任务上的性能。

Details

Method: 采用多种字符编码（包括Base64）的混合编码策略。 Result: 实验结果表明，该方法在保持高NLP任务性能的同时，显著降低了提示注入攻击的成功率。 Conclusion: 混合编码策略在安全性和任务性能方面均表现出色，优于现有基于字符编码的防御方法。 Abstract: Large Language Models (LLMs) have emerged as a dominant approach for a wide range of NLP tasks, with their access to external information further enhancing their capabilities. However, this introduces new vulnerabilities, known as prompt injection attacks, where external content embeds malicious instructions that manipulate the LLM's output. Recently, the Base64 defense has been recognized as one of the most effective methods for reducing success rate of prompt injection attacks. Despite its efficacy, this method can degrade LLM performance on certain NLP tasks. To address this challenge, we propose a novel defense mechanism: mixture of encodings, which utilizes multiple character encodings, including Base64. Extensive experimental results show that our method achieves one of the lowest attack success rates under prompt injection attacks, while maintaining high performance across all NLP tasks, outperforming existing character encoding-based defense methods. This underscores the effectiveness of our mixture of encodings strategy for both safety and task performance metrics.

Kimi-VL Technical Report

Kimi Team,Angang Du,Bohong Yin,Bowei Xing,Bowen Qu,Bowen Wang,Cheng Chen,Chenlin Zhang,Chenzhuang Du,Chu Wei,Congcong Wang,Dehao Zhang,Dikang Du,Dongliang Wang,Enming Yuan,Enzhe Lu,Fang Li,Flood Sung,Guangda Wei,Guokun Lai,Han Zhu,Hao Ding,Hao Hu,Hao Yang,Hao Zhang,Haoning Wu,Haotian Yao,Haoyu Lu,Heng Wang,Hongcheng Gao,Huabin Zheng,Jiaming Li,Jianlin Su,Jianzhou Wang,Jiaqi Deng,Jiezhong Qiu,Jin Xie,Jinhong Wang,Jingyuan Liu,Junjie Yan,Kun Ouyang,Liang Chen,Lin Sui,Longhui Yu,Mengfan Dong,Mengnan Dong,Nuo Xu,Pengyu Cheng,Qizheng Gu,Runjie Zhou,Shaowei Liu,Sihan Cao,Tao Yu,Tianhui Song,Tongtong Bai,Wei Song,Weiran He,Weixiao Huang,Weixin Xu,Xiaokun Yuan,Xingcheng Yao,Xingzhe Wu,Xinxing Zu,Xinyu Zhou,Xinyuan Wang,Y. Charles,Yan Zhong,Yang Li,Yangyang Hu,Yanru Chen,Yejie Wang,Yibo Liu,Yibo Miao,Yidao Qin,Yimin Chen,Yiping Bao,Yiqin Wang,Yongsheng Kang,Yuanxin Liu,Yulun Du,Yuxin Wu,Yuzhi Wang,Yuzi Yan,Zaida Zhou,Zhaowei Li,Zhejun Jiang,Zheng Zhang,Zhilin Yang,Zhiqi Huang,Zihao Huang,Zijia Zhao,Ziwei Chen

Task: 开发一个高效的开源混合专家（MoE）视觉语言模型（VLM）Kimi-VL，具备多模态推理、长上下文理解和强大的代理能力。

Motivation: 提供一种高效的视觉语言模型，能够在多领域任务中表现优异，同时激活参数较少，降低计算成本。

Details

Method: 采用混合专家（MoE）架构，结合长链思维监督微调（SFT）和强化学习（RL），开发了Kimi-VL及其变体Kimi-VL-Thinking。 Result: Kimi-VL在多项任务中表现优异，如多轮代理任务、图像视频理解、OCR等，并在长上下文处理和高分辨率视觉输入方面取得突破。 Conclusion: Kimi-VL及其变体为高效多模态推理模型设定了新标准，同时保持较低的参数激活和计算成本。 Abstract: We present Kimi-VL, an efficient open-source Mixture-of-Experts (MoE) vision-language model (VLM) that offers advanced multimodal reasoning, long-context understanding, and strong agent capabilities - all while activating only 2.8B parameters in its language decoder (Kimi-VL-A3B). Kimi-VL demonstrates strong performance across challenging domains: as a general-purpose VLM, Kimi-VL excels in multi-turn agent tasks (e.g., OSWorld), matching flagship models. Furthermore, it exhibits remarkable capabilities across diverse challenging vision language tasks, including college-level image and video comprehension, OCR, mathematical reasoning, and multi-image understanding. In comparative evaluations, it effectively competes with cutting-edge efficient VLMs such as GPT-4o-mini, Qwen2.5-VL-7B, and Gemma-3-12B-IT, while surpassing GPT-4o in several key domains. Kimi-VL also advances in processing long contexts and perceiving clearly. With a 128K extended context window, Kimi-VL can process diverse long inputs, achieving impressive scores of 64.5 on LongVideoBench and 35.1 on MMLongBench-Doc. Its native-resolution vision encoder, MoonViT, further allows it to see and understand ultra-high-resolution visual inputs, achieving 83.2 on InfoVQA and 34.5 on ScreenSpot-Pro, while maintaining lower computational cost for common tasks. Building upon Kimi-VL, we introduce an advanced long-thinking variant: Kimi-VL-Thinking. Developed through long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL), this model exhibits strong long-horizon reasoning capabilities. It achieves scores of 61.7 on MMMU, 36.8 on MathVision, and 71.3 on MathVista while maintaining the compact 2.8B activated LLM parameters, setting a new standard for efficient multimodal thinking models. Code and models are publicly accessible at https://github.com/MoonshotAI/Kimi-VL.

Transformer-Based Temporal Information Extraction and Application: A Review

Xin Su,Phillip Howard,Steven Bethard

Task: 系统总结和分析基于Transformer的时间信息提取（Temporal IE）研究，并指出未来研究方向。

Motivation: 尽管Transformer在时间信息提取领域取得了显著成果，但缺乏对这些工作的全面综述。

Details

Method: 通过系统总结和分析现有基于Transformer的时间信息提取研究。 Result: 填补了该领域的综述空白，并提出了未来研究方向。 Conclusion: 本文为时间信息提取领域的研究者提供了有价值的参考，并推动了该领域的进一步发展。 Abstract: Temporal information extraction (IE) aims to extract structured temporal information from unstructured text, thereby uncovering the implicit timelines within. This technique is applied across domains such as healthcare, newswire, and intelligence analysis, aiding models in these areas to perform temporal reasoning and enabling human users to grasp the temporal structure of text. Transformer-based pre-trained language models have produced revolutionary advancements in natural language processing, demonstrating exceptional performance across a multitude of tasks. Despite the achievements garnered by Transformer-based approaches in temporal IE, there is a lack of comprehensive reviews on these endeavors. In this paper, we aim to bridge this gap by systematically summarizing and analyzing the body of work on temporal IE using Transformers while highlighting potential future research directions.

Event Signal Filtering via Probability Flux Estimation

Jinze Chen,Wei Zhai,Yang Cao,Bin Li,Zheng-Jun Zha

Task: 提出一种基于扩散过程理论的事件信号在线滤波框架EDFilter，用于提升事件信号的质量和一致性。

Motivation: 事件信号因其异步性和随机性导致信号质量下降，需要一种有效的滤波方法来解决这一问题。

Details

Method: 通过扩散过程理论建模事件生成，利用非参数核平滑重建连续概率通量，并提出一种快速递归求解器进行优化。 Result: 实验验证了EDFilter在事件滤波、超分辨率和直接事件跟踪等任务中的性能，并在SLAM和视频重建等下游应用中表现出显著优势。 Conclusion: EDFilter通过理论建模和高效计算，显著提升了事件信号的质量和下游应用的效果。 Abstract: Events offer a novel paradigm for capturing scene dynamics via asynchronous sensing, but their inherent randomness often leads to degraded signal quality. Event signal filtering is thus essential for enhancing fidelity by reducing this internal randomness and ensuring consistent outputs across diverse acquisition conditions. Unlike traditional time series that rely on fixed temporal sampling to capture steady-state behaviors, events encode transient dynamics through polarity and event intervals, making signal modeling significantly more complex. To address this, the theoretical foundation of event generation is revisited through the lens of diffusion processes. The state and process information within events is modeled as continuous probability flux at threshold boundaries of the underlying irradiance diffusion. Building on this insight, a generative, online filtering framework called Event Density Flow Filter (EDFilter) is introduced. EDFilter estimates event correlation by reconstructing the continuous probability flux from discrete events using nonparametric kernel smoothing, and then resamples filtered events from this flux. To optimize fidelity over time, spatial and temporal kernels are employed in a time-varying optimization framework. A fast recursive solver with O(1) complexity is proposed, leveraging state-space models and lookup tables for efficient likelihood computation. Furthermore, a new real-world benchmark Rotary Event Dataset (RED) is released, offering microsecond-level ground truth irradiance for full-reference event filtering evaluation. Extensive experiments validate EDFilter's performance across tasks like event filtering, super-resolution, and direct event-based blob tracking. Significant gains in downstream applications such as SLAM and video reconstruction underscore its robustness and effectiveness.

Geological Inference from Textual Data using Word Embeddings

Nanmanas Linphrachaya,Irving Gómez-Méndez,Adil Siripatana

Task: 利用自然语言处理（NLP）技术定位地质资源，特别是工业矿物。

Motivation: 通过NLP技术提取地质文本中的语义关系，以辅助地质资源的空间分布分析。

Details

Method: 使用GloVe模型训练词嵌入，提取目标关键词与地质文本的语义关系，并结合降维技术（如PCA、Autoencoder、VAE和VAE-LSTM）优化特征提取。 Result: 通过余弦相似度排名地理相关词汇，并结合Haversine方程验证语义关系与矿点位置的空间分布，结果显示方法有效但精度有待提升。 Conclusion: 结合NLP与降维技术可为自然资源空间分布提供有用信息，但需进一步优化精度。 Abstract: This research explores the use of Natural Language Processing (NLP) techniques to locate geological resources, with a specific focus on industrial minerals. By using word embeddings trained with the GloVe model, we extract semantic relationships between target keywords and a corpus of geological texts. The text is filtered to retain only words with geographical significance, such as city names, which are then ranked by their cosine similarity to the target keyword. Dimensional reduction techniques, including Principal Component Analysis (PCA), Autoencoder, Variational Autoencoder (VAE), and VAE with Long Short-Term Memory (VAE-LSTM), are applied to enhance feature extraction and improve the accuracy of semantic relations. For benchmarking, we calculate the proximity between the ten cities most semantically related to the target keyword and identified mine locations using the haversine equation. The results demonstrate that combining NLP with dimensional reduction techniques provides meaningful insights into the spatial distribution of natural resources. Although the result shows to be in the same region as the supposed location, the accuracy has room for improvement.

VideoExpert: Augmented LLM for Temporal-Sensitive Video Understanding

Henghao Zhao,Ge-Peng Ji,Rui Yan,Huan Xiong,Zechao Li

Task: 提出一种名为VideoExpert的多模态大语言模型，用于解决时间敏感的视频任务中的时间戳生成问题。

Motivation: 现有的多模态大语言模型在生成时间戳时过度依赖语言模式而非视觉线索，导致性能不佳。

Details

Method: VideoExpert通过集成时间专家和空间专家两个并行模块，分别处理时间序列和内容细节，并通过特殊令牌协调合作。 Result: 实验证明VideoExpert在时间敏感视频任务中表现出色且具有通用性。 Conclusion: VideoExpert通过分离时间定位和内容生成，有效减少了时间戳预测中的文本模式偏差。 Abstract: The core challenge in video understanding lies in perceiving dynamic content changes over time. However, multimodal large language models struggle with temporal-sensitive video tasks, which requires generating timestamps to mark the occurrence of specific events. Existing strategies require MLLMs to generate absolute or relative timestamps directly. We have observed that those MLLMs tend to rely more on language patterns than visual cues when generating timestamps, affecting their performance. To address this problem, we propose VideoExpert, a general-purpose MLLM suitable for several temporal-sensitive video tasks. Inspired by the expert concept, VideoExpert integrates two parallel modules: the Temporal Expert and the Spatial Expert. The Temporal Expert is responsible for modeling time sequences and performing temporal grounding. It processes high-frame-rate yet compressed tokens to capture dynamic variations in videos and includes a lightweight prediction head for precise event localization. The Spatial Expert focuses on content detail analysis and instruction following. It handles specially designed spatial tokens and language input, aiming to generate content-related responses. These two experts collaborate seamlessly via a special token, ensuring coordinated temporal grounding and content generation. Notably, the Temporal and Spatial Experts maintain independent parameter sets. By offloading temporal grounding from content generation, VideoExpert prevents text pattern biases in timestamp predictions. Moreover, we introduce a Spatial Compress module to obtain spatial tokens. This module filters and compresses patch tokens while preserving key information, delivering compact yet detail-rich input for the Spatial Expert. Extensive experiments demonstrate the effectiveness and versatility of the VideoExpert.

Supervised Optimism Correction: Be Confident When LLMs Are Sure

Junjie Zhang,Rushuai Yang,Shunyu Liu,Ting-En Lin,Fei Huang,Yi Chen,Yongbin Li,Dacheng Tao

Task: 建立监督微调与离线强化学习在令牌级马尔可夫决策过程下的理论联系，并提出一种改进方法。

Motivation: 揭示大型语言模型在推理过程中学习隐式Q函数，并指出广泛使用的束搜索方法存在过度乐观问题。

Details

Method: 提出监督乐观校正（SOC），通过辅助损失函数对令牌级Q值估计进行优化。 Result: 在数学推理基准测试（如GSM8K、MATH和GAOKAO）上验证了SOC方法的优越性。 Conclusion: SOC方法通过抑制对监督不足响应的过度乐观，显著提升了模型性能。 Abstract: In this work, we establish a novel theoretical connection between supervised fine-tuning and offline reinforcement learning under the token-level Markov decision process, revealing that large language models indeed learn an implicit $Q$-function for inference. Through this theoretical lens, we demonstrate that the widely used beam search method suffers from unacceptable over-optimism, where inference errors are inevitably amplified due to inflated $Q$-value estimations of suboptimal steps. To address this limitation, we propose Supervised Optimism Correction(SOC), which introduces a simple yet effective auxiliary loss for token-level $Q$-value estimations during supervised fine-tuning. Specifically, the auxiliary loss employs implicit value regularization to boost model confidence in expert-demonstrated responses, thereby suppressing over-optimism toward insufficiently supervised responses. Extensive experiments on mathematical reasoning benchmarks, including GSM8K, MATH, and GAOKAO, showcase the superiority of the proposed SOC with beam search across a series of open-source models.

DGOcc: Depth-aware Global Query-based Network for Monocular 3D Occupancy Prediction

Xu Zhao,Pengju Zhang,Bo Liu,Yihong Wu

Task: 从单目2D图像预测3D场景中的占用率和语义信息。

Motivation: 单目3D占用预测在3D场景理解中具有重要作用，但大规模室外场景的预测存在不适定性和资源消耗大的问题。

Details

Method: 提出DGOcc网络，利用深度上下文特征和全局查询模块（GQ Module），结合注意力机制和尺度感知操作，并通过分层监督策略（HSS）减少计算资源消耗。 Result: 在SemanticKITTI和SSCBench-KITTI-360数据集上表现最佳，同时降低了GPU和时间开销。 Conclusion: DGOcc方法在单目语义占用预测中实现了高性能和高效资源利用。 Abstract: Monocular 3D occupancy prediction, aiming to predict the occupancy and semantics within interesting regions of 3D scenes from only 2D images, has garnered increasing attention recently for its vital role in 3D scene understanding. Predicting the 3D occupancy of large-scale outdoor scenes from 2D images is ill-posed and resource-intensive. In this paper, we present \textbf{DGOcc}, a \textbf{D}epth-aware \textbf{G}lobal query-based network for monocular 3D \textbf{Occ}upancy prediction. We first explore prior depth maps to extract depth context features that provide explicit geometric information for the occupancy network. Then, in order to fully exploit the depth context features, we propose a Global Query-based (GQ) Module. The cooperation of attention mechanisms and scale-aware operations facilitates the feature interaction between images and 3D voxels. Moreover, a Hierarchical Supervision Strategy (HSS) is designed to avoid upsampling the high-dimension 3D voxel features to full resolution, which mitigates GPU memory utilization and time cost. Extensive experiments on SemanticKITTI and SSCBench-KITTI-360 datasets demonstrate that the proposed method achieves the best performance on monocular semantic occupancy prediction while reducing GPU and time overhead.

AI-Slop to AI-Polish? Aligning Language Models through Edit-Based Writing Rewards and Test-time Computation

Tuhin Chakrabarty,Philippe Laban,Chien-Sheng Wu

Task: 评估和改进AI生成文本的写作质量。

Motivation: AI生成文本的写作质量评估未得到足够关注，因其主观性强且需要专业知识。

Details

Method: 引入写作质量基准（WQ），训练专门的写作质量奖励模型（WQRM），并通过生成和排名候选修订来优化输出。 Result: WQRM在四个分布外测试集上表现良好，WQ基准准确率达74%，人类评估显示专家偏好WQRM选择的文本。 Conclusion: WQRM能有效提升AI生成文本的写作质量，并鼓励社区参与写作质量评估和AI写作系统的开发。 Abstract: AI-generated text is proliferating across domains, from creative writing and journalism to marketing content and scientific articles. Models can follow user-provided instructions to generate coherent and grammatically correct outputs but in this work, we study a more fundamental question: how do we evaluate and improve the writing quality of AI-generated text? Writing quality assessment has received less attention from the community, in part because it is fundamentally subjective and requires expertise. We first introduce the Writing Quality Benchmark (WQ) by consolidating five writing-preference datasets into 4,729 writing quality judgments. Our experiments show that competitive baselines, including state-of-the-art LLMs that excel at reasoning tasks, barely outperform random baselines on WQ. We then train specialized Writing Quality Reward Models (WQRM) of various sizes for writing quality assessment that demonstrate strong generalization on four out-of-distribution test sets and 74% accuracy on the WQ benchmark. To further show WQRM's practical benefits during inference, we leverage additional test-time compute to generate and rank multiple candidate revisions, allowing us to select higher-quality outputs from an initial draft. Human evaluation with 9 experienced writers confirm that WQRM-based selection produces writing samples preferred by experts 66% overall, and 72.2% when the reward gap is larger than 1 point. We release our datasets and models to encourage community engagement with writing quality assessment and development of AI writing systems better aligned with human preferences.

SydneyScapes: Image Segmentation for Australian Environments

Hongyu Lyu,Julie Stephany Berrio,Mao Shan,Stewart Worrall

Task: Introduce SydneyScapes, a dataset for computer vision tasks in AV perception systems tailored for the Australian context.

Motivation: Address the lack of locally labelled datasets in Australia for developing and testing AV algorithms.

Details

Method: Collect and annotate 756 images from Sydney and surrounding areas, providing high-quality pixel-level annotations. Result: SydneyScapes dataset is publicly available, with benchmarking results using state-of-the-art algorithms. Conclusion: The dataset supports AV research and development in Australia, filling a gap in localized data. Abstract: Autonomous Vehicles (AVs) are being partially deployed and tested across various global locations, including China, the USA, Germany, France, Japan, Korea, and the UK, but with limited demonstrations in Australia. The integration of machine learning (ML) into AV perception systems highlights the need for locally labelled datasets to develop and test algorithms in specific environments. To address this, we introduce SydneyScapes - a dataset tailored for computer vision tasks of image semantic, instance, and panoptic segmentation. This dataset, collected from Sydney and surrounding cities in New South Wales (NSW), Australia, consists of 756 images with high-quality pixel-level annotations. It is designed to assist AV industry and researchers by providing annotated data and tools for algorithm development, testing, and deployment in the Australian context. Additionally, we offer benchmarking results using state-of-the-art algorithms to establish reference points for future research and development. The dataset is publicly available at https://hdl.handle.net/2123/33051.

Do LLMs Understand Your Translations? Evaluating Paragraph-level MT with Question Answering

Patrick Fernandes,Sweta Agrawal,Emmanouil Zaranis,André F. T. Martins,Graham Neubig

Task: 提出一种基于问答的翻译评估框架TREQA，用于评估段落级翻译的质量。

Motivation: 现有自动指标难以捕捉跨句子的意义保留，需要更实用的方法来评估长复杂段落的翻译质量。

Details

Method: 通过问答任务评估翻译是否准确传达原文或参考文本中的关键信息。 Result: TREQA在段落级翻译排名中表现优异，甚至优于现有神经和LLM指标，同时提供可解释性。 Conclusion: TREQA是一种有效的翻译评估方法，尤其在需要长距离理解的领域表现突出。 Abstract: Despite the steady progress in machine translation evaluation, existing automatic metrics struggle to capture how well meaning is preserved beyond sentence boundaries. We posit that reliance on a single intrinsic quality score, trained to mimic human judgments, might be insufficient for evaluating translations of long, complex passages, and a more ``pragmatic'' approach that assesses how accurately key information is conveyed by a translation in context is needed. We introduce TREQA (Translation Evaluation via Question-Answering), a framework that extrinsically evaluates translation quality by assessing how accurately candidate translations answer reading comprehension questions that target key information in the original source or reference texts. In challenging domains that require long-range understanding, such as literary texts, we show that TREQA is competitive with and, in some cases, outperforms state-of-the-art neural and LLM-based metrics in ranking alternative paragraph-level translations, despite never being explicitly optimized to correlate with human judgments. Furthermore, the generated questions and answers offer interpretability: empirical analysis shows that they effectively target translation errors identified by experts in evaluated datasets. Our code is available at https://github.com/deep-spin/treqa

STeP: A General and Scalable Framework for Solving Video Inverse Problems with Spatiotemporal Diffusion Priors

Bingliang Zhang,Zihui Wu,Berthy T. Feng,Yang Song,Yisong Yue,Katherine L. Bouman

Task: 研究如何利用扩散模型先验解决涉及视频的一般贝叶斯逆问题。

Motivation: 现有方法依赖单帧图像扩散先验和启发式方法强制时间一致性，但难以准确恢复时间关系，尤其是在高时间不确定性的任务中。

Details

Method: 通过微调预训练图像扩散模型的潜在视频扩散模型，构建实用且易获取的时空扩散先验，并开发通用可扩展的视频逆问题求解框架。 Result: 在黑洞成像和动态MRI等科学视频逆问题中，生成多样且高保真的视频重建结果，同时恢复多模态解。 Conclusion: 引入时空扩散先验显著提升了捕捉复杂时间关系的能力，并增强了空间保真度。 Abstract: We study how to solve general Bayesian inverse problems involving videos using diffusion model priors. While it is desirable to use a video diffusion prior to effectively capture complex temporal relationships, due to the computational and data requirements of training such a model, prior work has instead relied on image diffusion priors on single frames combined with heuristics to enforce temporal consistency. However, these approaches struggle with faithfully recovering the underlying temporal relationships, particularly for tasks with high temporal uncertainty. In this paper, we demonstrate the feasibility of practical and accessible spatiotemporal diffusion priors by fine-tuning latent video diffusion models from pretrained image diffusion models using limited videos in specific domains. Leveraging this plug-and-play spatiotemporal diffusion prior, we introduce a general and scalable framework for solving video inverse problems. We then apply our framework to two challenging scientific video inverse problems--black hole imaging and dynamic MRI. Our framework enables the generation of diverse, high-fidelity video reconstructions that not only fit observations but also recover multi-modal solutions. By incorporating a spatiotemporal diffusion prior, we significantly improve our ability to capture complex temporal relationships in the data while also enhancing spatial fidelity.

SaRoHead: A Dataset for Satire Detection in Romanian Multi-Domain News Headlines

Mihnea-Alexandru Vîrlan,Răzvan-Alexandru Smădu,Dumitru-Clementin Cercel

Task: 构建首个罗马尼亚多领域新闻标题的讽刺检测语料库SaRoHead。

Motivation: 新闻标题的表达方式和与主题的关联性对讽刺检测具有挑战性，尤其是幽默风格的标题可能包含讽刺、反讽和挖苦等元素。

Details

Method: 提出SaRoHead语料库，用于讽刺检测。 Result: 研究发现，非讽刺标题中的点击诱饵对模型有显著影响。 Conclusion: SaRoHead为罗马尼亚多领域新闻标题的讽刺检测提供了首个语料库，并揭示了点击诱饵对检测的影响。 Abstract: The headline is an important part of a news article, influenced by expressiveness and connection to the exposed subject. Although most news outlets aim to present reality objectively, some publications prefer a humorous approach in which stylistic elements of satire, irony, and sarcasm blend to cover specific topics. Satire detection can be difficult because a headline aims to expose the main idea behind a news article. In this paper, we propose SaRoHead, the first corpus for satire detection in Romanian multi-domain news headlines. Our findings show that the clickbait used in some non-satirical headlines significantly influences the model.

TokenFocus-VQA: Enhancing Text-to-Image Alignment with Position-Aware Focus and Multi-Perspective Aggregations on LVLMs

Zijian Zhang,Xuhui Zheng,Xuecheng Wu,Chong Peng,Xuezhi Cao

Task: 提出TokenFocus-VQA框架，用于细粒度语义对齐的文本到图像生成模型评估。

Motivation: 现有评估方法在细粒度语义匹配上表现不足，全局相似性度量常忽略关键词汇与视觉内容的对应关系。

Details

Method: 利用大型视觉语言模型（LVLMs）通过视觉问答（VQA）范式，结合位置特定的概率优化和令牌感知损失函数。 Result: 在NTIRE 2025挑战赛中公开和私有测试集上均排名第二，证明其在捕捉文本-图像对应关系上的优越性。 Conclusion: TokenFocus-VQA通过细粒度语义对齐评估，显著优于传统方法。 Abstract: While text-to-image (T2I) generation models have achieved remarkable progress in recent years, existing evaluation methodologies for vision-language alignment still struggle with the fine-grained semantic matching. Current approaches based on global similarity metrics often overlook critical token-level correspondences between textual descriptions and visual content. To this end, we present TokenFocus-VQA, a novel evaluation framework that leverages Large Vision-Language Models (LVLMs) through visual question answering (VQA) paradigm with position-specific probability optimization. Our key innovation lies in designing a token-aware loss function that selectively focuses on probability distributions at pre-defined vocabulary positions corresponding to crucial semantic elements, enabling precise measurement of fine-grained semantical alignment. The proposed framework further integrates ensemble learning techniques to aggregate multi-perspective assessments from diverse LVLMs architectures, thereby achieving further performance enhancement. Evaluated on the NTIRE 2025 T2I Quality Assessment Challenge Track 1, our TokenFocus-VQA ranks 2nd place (0.8445, only 0.0001 lower than the 1st method) on public evaluation and 2nd place (0.8426) on the official private test set, demonstrating superiority in capturing nuanced text-image correspondences compared to conventional evaluation methods.

ConceptFormer: Towards Efficient Use of Knowledge-Graph Embeddings in Large Language Models

Joel Barmettler,Abraham Bernstein,Luca Rossetto

Task: 提出ConceptFormer，一种在不改变预训练语言模型内部结构或依赖知识图谱文本化的情况下，通过向量空间注入概念向量来增强大型语言模型的方法。

Motivation: 当前检索增强生成方法通常需要修改预训练语言模型的内部结构或依赖知识图谱的文本化，这在令牌使用上效率低下。

Details

Method: ConceptFormer在大型语言模型的嵌入向量空间中操作，创建并注入封装知识图谱节点信息的概念向量，同时与冻结的大型语言模型联合训练，生成一个映射知识图谱节点到概念向量的查找表。 Result: 实验表明，ConceptFormer显著提升了GPT-2 0.1B的事实召回能力（Hit@10），在Wikipedia句子上最高提升272%，在合成句子上最高提升348%，且输入令牌消耗仅为传统方法的1/130。 Conclusion: ConceptFormer通过高效且可扩展的方式为大型语言模型注入结构化知识，显著提升了其事实召回能力。 Abstract: Retrieval Augmented Generation (RAG) has enjoyed increased attention in the recent past and recent advancements in Large Language Models (LLMs) have highlighted the importance of integrating world knowledge into these systems. Current RAG methodologies often modify the internal architecture of pre-trained language models (PLMs) or rely on textifying knowledge graphs (KGs), which is inefficient in terms of token usage. This paper introduces ConceptFormer, a new approach to augment LLMs with structured knowledge from KGs, such as Wikidata, without altering their internal structure or relying on textual input of KGs. ConceptFormer operates in the LLM embedding vector space, creating and injecting \emph{concept vectors} that encapsulate the information of the KG nodes directly. Trained in conjunction with a frozen LLM, ConceptFormer generates a comprehensive lookup table that maps KG nodes to their respective concept vectors. The approach aims to enhance the factual recall capabilities of LLMs by enabling them to process these concept vectors natively, thus enriching them with structured world knowledge in an efficient and scalable manner. Our experiments demonstrate that the addition of concept vectors to GPT-2 0.1B substantially increases its factual recall ability (Hit@10) by up to 272\% when tested on sentences from Wikipedia and up to 348\% on synthetically generated sentences. Even injecting only a single concept vector into the prompt increases factual recall ability (Hit@10) by up to 213\% on Wikipedia sentences, significantly outperforming RAG with graph textification while consuming 130x fewer input tokens.

Benchmarking Image Embeddings for E-Commerce: Evaluating Off-the Shelf Foundation Models, Fine-Tuning Strategies and Practical Trade-offs

Urszula Czerwinska,Cenk Bircanoglu,Jeremy Chamoux

Task: 评估基础模型图像嵌入在电子商务分类和检索中的适用性。

Motivation: 研究不同预训练模型和训练方法在电子商务场景中的实际应用效果。

Details

Method: 使用监督、自监督和文本-图像对比学习方法预训练的卷积和Transformer模型，并在六个电子商务数据集上进行全面微调和迁移学习（top-tuning）评估。 Result: 全面微调表现稳定，文本-图像和自监督嵌入在较少训练下也能达到类似效果；top-tuning是高效替代方案，降低计算成本。 Conclusion: 为嵌入选择和微调策略提供实用指南，平衡效率与性能。 Abstract: We benchmark foundation models image embeddings for classification and retrieval in e-Commerce, evaluating their suitability for real-world applications. Our study spans embeddings from pre-trained convolutional and transformer models trained via supervised, self-supervised, and text-image contrastive learning. We assess full fine-tuning and transfer learning (top-tuning) on six diverse e-Commerce datasets: fashion, consumer goods, cars, food, and retail. Results show full fine-tuning consistently performs well, while text-image and self-supervised embeddings can match its performance with less training. While supervised embeddings remain stable across architectures, SSL and contrastive embeddings vary significantly, often benefiting from top-tuning. Top-tuning emerges as an efficient alternative to full fine-tuning, reducing computational costs. We also explore cross-tuning, noting its impact depends on dataset characteristics. Our findings offer practical guidelines for embedding selection and fine-tuning strategies, balancing efficiency and performance.

On the Temporal Question-Answering Capabilities of Large Language Models Over Anonymized Data

Alfredo Garrachón Ruiz,Tomás de la Rosa,Daniel Borrajo

Task: 探索大型语言模型（LLMs）在未训练数据上的时间推理任务中的适用性。

Motivation: 研究LLMs在结构化与半结构化匿名数据上的时间推理能力，填补该领域的空白。

Details

Method: 开发直接LLM流程，比较多种方法（如Tree-of-Thought、自反思和代码执行），并创建RATA数据集评估性能。 Result: 发现仅依赖LLM难以实现可扩展且可靠的解决方案，需结合集成方法。 Conclusion: 强调集成方法在提升LLMs时间推理能力中的重要性。 Abstract: The applicability of Large Language Models (LLMs) in temporal reasoning tasks over data that is not present during training is still a field that remains to be explored. In this paper we work on this topic, focusing on structured and semi-structured anonymized data. We not only develop a direct LLM pipeline, but also compare various methodologies and conduct an in-depth analysis. We identified and examined seventeen common temporal reasoning tasks in natural language, focusing on their algorithmic components. To assess LLM performance, we created the \textit{Reasoning and Answering Temporal Ability} dataset (RATA), featuring semi-structured anonymized data to ensure reliance on reasoning rather than on prior knowledge. We compared several methodologies, involving SoTA techniques such as Tree-of-Thought, self-reflexion and code execution, tuned specifically for this scenario. Our results suggest that achieving scalable and reliable solutions requires more than just standalone LLMs, highlighting the need for integrated approaches.

On Model and Data Scaling for Skeleton-based Self-Supervised Gait Recognition

Adrian Cosma,Andy Cǎtrunǎ,Emilian Rǎdoi

Task: 研究基于骨架的自监督步态识别中数据量、模型规模和计算资源对性能的影响。

Motivation: 探索神经缩放定律在步态识别领域的适用性，以量化资源投入与性能提升的关系。

Details

Method: 使用GaitPT（基于Transformer的架构）在270万野外采集的步行序列上进行预训练，并在四个基准数据集上评估零样本性能。 Result: 发现性能随规模增加呈幂律提升，数据和计算资源对下游准确性有显著影响。 Conclusion: 为实际步态识别系统的资源分配和性能估计提供了实用见解。 Abstract: Gait recognition from video streams is a challenging problem in computer vision biometrics due to the subtle differences between gaits and numerous confounding factors. Recent advancements in self-supervised pretraining have led to the development of robust gait recognition models that are invariant to walking covariates. While neural scaling laws have transformed model development in other domains by linking performance to data, model size, and compute, their applicability to gait remains unexplored. In this work, we conduct the first empirical study scaling on skeleton-based self-supervised gait recognition to quantify the effect of data quantity, model size and compute on downstream gait recognition performance. We pretrain multiple variants of GaitPT - a transformer-based architecture - on a dataset of 2.7 million walking sequences collected in the wild. We evaluate zero-shot performance across four benchmark datasets to derive scaling laws for data, model size, and compute. Our findings demonstrate predictable power-law improvements in performance with increased scale and confirm that data and compute scaling significantly influence downstream accuracy. We further isolate architectural contributions by comparing GaitPT with GaitFormer under controlled compute budgets. These results provide practical insights into resource allocation and performance estimation for real-world gait recognition systems.

Unveiling the Impact of Multimodal Features on Chinese Spelling Correction: From Analysis to Design

Xiaowu Zhang,Hongfei Zhao,Jingyi Hou,Zhijie Liu

Task: 提出并评估一种新型多模态模型NamBert，用于中文拼写纠错（CSC）任务。

Motivation: 现有大型语言模型（LLMs）在CSC任务中存在过纠问题，而多模态模型在利用语音和字形信息方面仍有提升空间。

Details

Method: 提出MACU实验分析多模态改进潜力，并基于此设计多模态模型NamBert。 Result: NamBert在基准数据集上表现优于现有方法，并与LLMs进行了系统比较。 Conclusion: NamBert在多模态CSC任务中表现出色，为LLMs的局限性提供了有效解决方案。 Abstract: The Chinese Spelling Correction (CSC) task focuses on detecting and correcting spelling errors in sentences. Current research primarily explores two approaches: traditional multimodal pre-trained models and large language models (LLMs). However, LLMs face limitations in CSC, particularly over-correction, making them suboptimal for this task. While existing studies have investigated the use of phonetic and graphemic information in multimodal CSC models, effectively leveraging these features to enhance correction performance remains a challenge. To address this, we propose the Multimodal Analysis for Character Usage (\textbf{MACU}) experiment, identifying potential improvements for multimodal correctison. Based on empirical findings, we introduce \textbf{NamBert}, a novel multimodal model for Chinese spelling correction. Experiments on benchmark datasets demonstrate NamBert's superiority over SOTA methods. We also conduct a comprehensive comparison between NamBert and LLMs, systematically evaluating their strengths and limitations in CSC. Our code and model are available at https://github.com/iioSnail/NamBert.

RASMD: RGB And SWIR Multispectral Driving Dataset for Robust Perception in Adverse Conditions

Youngwan Jin,Michal Kovac,Yagiz Nalcakan,Hyeongjin Ju,Hanbin Song,Sanghyeop Yeo,Shiho Kim

Task: Introduce the RGB and SWIR Multispectral Driving (RASMD) dataset to address the lack of large-scale SWIR data for autonomous driving.

Motivation: Current autonomous driving algorithms rely on visible spectrum data, which performs poorly in adverse conditions; SWIR imaging offers advantages but lacks datasets.

Details

Method: Collect 100,000 synchronized RGB-SWIR image pairs across diverse conditions, provide annotations for object detection and RGB-SWIR translation, and test an ensemble framework. Result: Combining RGB and SWIR data improves detection accuracy, especially in challenging conditions. Conclusion: The RASMD dataset can advance multispectral imaging research for autonomous driving. Abstract: Current autonomous driving algorithms heavily rely on the visible spectrum, which is prone to performance degradation in adverse conditions like fog, rain, snow, glare, and high contrast. Although other spectral bands like near-infrared (NIR) and long-wave infrared (LWIR) can enhance vision perception in such situations, they have limitations and lack large-scale datasets and benchmarks. Short-wave infrared (SWIR) imaging offers several advantages over NIR and LWIR. However, no publicly available large-scale datasets currently incorporate SWIR data for autonomous driving. To address this gap, we introduce the RGB and SWIR Multispectral Driving (RASMD) dataset, which comprises 100,000 synchronized and spatially aligned RGB-SWIR image pairs collected across diverse locations, lighting, and weather conditions. In addition, we provide a subset for RGB-SWIR translation and object detection annotations for a subset of challenging traffic scenarios to demonstrate the utility of SWIR imaging through experiments on both object detection and RGB-to-SWIR image translation. Our experiments show that combining RGB and SWIR data in an ensemble framework significantly improves detection accuracy compared to RGB-only approaches, particularly in conditions where visible-spectrum sensors struggle. We anticipate that the RASMD dataset will advance research in multispectral imaging for autonomous driving and robust perception systems.

Synthetic Fluency: Hallucinations, Confabulations, and the Creation of Irish Words in LLM-Generated Translations

Sheila Castilho,Zoe Fitzsimmons,Claire Holton,Aoife Mc Donagh

Task: 研究大型语言模型（LLM）在爱尔兰语翻译中产生的幻觉现象，特别是生成不存在的词汇。

Motivation: 探讨LLM在低资源、形态丰富的语言（如爱尔兰语）中生成幻觉的规律及其对语言演变的潜在影响。

Details

Method: 对幻觉词汇进行分类（动词和名词），分析其是否符合爱尔兰语形态规则及语言倾向，并比较GPT-4.o和GPT-4.o Mini的表现。 Result: 发现两种模型产生相似类型的幻觉，但Mini模型频率更高；幻觉词汇部分符合形态规则，但存在特定语言倾向。 Conclusion: 提出LLM可能对爱尔兰语词汇和语言演变产生影响，呼吁进一步讨论技术在低资源语言中的作用。 Abstract: This study examines hallucinations in Large Language Model (LLM) translations into Irish, specifically focusing on instances where the models generate novel, non-existent words. We classify these hallucinations within verb and noun categories, identifying six distinct patterns among the latter. Additionally, we analyse whether these hallucinations adhere to Irish morphological rules and what linguistic tendencies they exhibit. Our findings show that while both GPT-4.o and GPT-4.o Mini produce similar types of hallucinations, the Mini model generates them at a significantly higher frequency. Beyond classification, the discussion raises speculative questions about the implications of these hallucinations for the Irish language. Rather than seeking definitive answers, we offer food for thought regarding the increasing use of LLMs and their potential role in shaping Irish vocabulary and linguistic evolution. We aim to prompt discussion on how such technologies might influence language over time, particularly in the context of low-resource, morphologically rich languages.

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Haozhan Shen,Peng Liu,Jingcheng Li,Chunxin Fang,Yibo Ma,Jiajia Liao,Qiaoli Shen,Zilun Zhang,Kangjia Zhao,Qianqian Zhang,Ruochen Xu,Tiancheng Zhao

Task: 通过强化学习（RL）提升视觉语言模型（VLMs）的视觉推理能力。

Motivation: 观察到视觉理解任务通常具有明确的真实标注，适合基于规则的奖励机制，因此探索将R1风格的强化学习扩展到VLMs。

Details

Method: 开发了VLM-R1框架，利用RL提升VLMs在视觉语言任务中的表现。 Result: RL模型在视觉理解任务中表现优异，且泛化能力优于监督微调（SFT）。 Conclusion: 通过实验和分析，揭示了RL在视觉语言模型中的潜力，并开源代码以推动社区发展。 Abstract: Recently DeepSeek R1 has shown that reinforcement learning (RL) can substantially improve the reasoning capabilities of Large Language Models (LLMs) through a simple yet effective design. The core of R1 lies in its rule-based reward formulation, which leverages tasks with deterministic ground-truth answers to enable precise and stable reward computation. In the visual domain, we similarly observe that a wide range of visual understanding tasks are inherently equipped with well-defined ground-truth annotations. This property makes them naturally compatible with rule-based reward mechanisms. Motivated by this observation, we investigate the extension of R1-style reinforcement learning to Vision-Language Models (VLMs), aiming to enhance their visual reasoning capabilities. To this end, we develop VLM-R1, a dedicated framework designed to harness RL for improving VLMs' performance on general vision-language tasks. Using this framework, we further explore the feasibility of applying RL to visual domain. Experimental results indicate that the RL-based model not only delivers competitive performance on visual understanding tasks but also surpasses Supervised Fine-Tuning (SFT) in generalization ability. Furthermore, we conduct comprehensive ablation studies that uncover a series of noteworthy insights, including the presence of reward hacking in object detection, the emergence of the "OD aha moment", the impact of training data quality, and the scaling behavior of RL across different model sizes. Through these analyses, we aim to deepen the understanding of how reinforcement learning enhances the capabilities of vision-language models, and we hope our findings and open-source contributions will support continued progress in the vision-language RL community. Our code and model are available at https://github.com/om-ai-lab/VLM-R1

Context-Aware Monolingual Human Evaluation of Machine Translation

Silvio Picinini,Sheila Castilho

Task: 探索无源文本参考下的上下文感知单语人工评估在机器翻译评估中的潜力。

Motivation: 比较单语与双语评估（带源文本）在评估单个机器翻译系统及对比多个系统时的表现。

Details

Method: 四位专业翻译进行单语和双语评估，包括评分、标注错误并提供反馈。 Result: 上下文感知单语评估与双语评估结果相当，表明其作为高效评估机器翻译方法的可行性。 Conclusion: 单语评估是一种高效且可行的机器翻译评估方法。 Abstract: This paper explores the potential of context-aware monolingual human evaluation for assessing machine translation (MT) when no source is given for reference. To this end, we compare monolingual with bilingual evaluations (with source text), under two scenarios: the evaluation of a single MT system, and the comparative evaluation of pairwise MT systems. Four professional translators performed both monolingual and bilingual evaluations by assigning ratings and annotating errors, and providing feedback on their experience. Our findings suggest that context-aware monolingual human evaluation achieves comparable outcomes to human bilingual evaluations, and suggest the feasibility and potential of monolingual evaluation as an efficient approach to assessing MT.

End-to-End Facial Expression Detection in Long Videos

Yini Fang,Alec Diallo,Yiqi Shi,Frederic Jumelle,Bertram Shi

Task: 提出一种端到端的面部表情检测网络（FEDN），联合优化表情的定位（spotting）和识别（recognition）任务。

Motivation: 现有方法将表情定位和识别任务分开处理，导致误差传播、特征学习效率低下和性能不佳，缺乏联合优化。

Details

Method: 引入基于注意力的特征提取模块，结合片段注意力和滑动窗口注意力，改进面部特征学习。 Result: 在CASME^2和CASME^3数据集上实现了最先进的定位和检测准确率。 Conclusion: 联合优化显著减少了误差传播，提升了长视频中面部表情检测的鲁棒性和整体性能。 Abstract: Facial expression detection involves two interrelated tasks: spotting, which identifies the onset and offset of expressions, and recognition, which classifies them into emotional categories. Most existing methods treat these tasks separately using a two-step training pipelines. A spotting model first detects expression intervals. A recognition model then classifies the detected segments. However, this sequential approach leads to error propagation, inefficient feature learning, and suboptimal performance due to the lack of joint optimization of the two tasks. We propose FEDN, an end-to-end Facial Expression Detection Network that jointly optimizes spotting and recognition. Our model introduces a novel attention-based feature extraction module, incorporating segment attention and sliding window attention to improve facial feature learning. By unifying two tasks within a single network, we greatly reduce error propagation and enhance overall performance. Experiments on CASME}^2 and CASME^3 demonstrate state-of-the-art accuracy for both spotting and detection, underscoring the benefits of joint optimization for robust facial expression detection in long videos.

Proactive User Information Acquisition via Chats on User-Favored Topics

Shiki Sato,Jun Baba,Asahi Hentona,Shinji Iwata,Akifumi Yoshimoto,Koichiro Yoshino

Task: 提出PIVOT任务，旨在通过聊天获取用户对预定义问题的回答，同时避免让用户感到突兀。

Motivation: 为面向聊天的对话系统提供技术基础，这些系统旨在通过用户感兴趣的话题获取特定信息。

Details

Method: 构建适合分析的数据集，并开发一个简单但有效的系统。 Result: 发现即使最新的LLMs在PIVOT任务中成功率较低，但通过数据集分析开发的系统表现更优。 Conclusion: 通过数据集分析和系统开发，为PIVOT任务提供了有效的解决方案。 Abstract: Chat-oriented dialogue systems designed to provide tangible benefits, such as sharing the latest news or preventing frailty in senior citizens, often require Proactive acquisition of specific user Information via chats on user-faVOred Topics (PIVOT). This study proposes the PIVOT task, designed to advance the technical foundation for these systems. In this task, a system needs to acquire the answers of a user to predefined questions without making the user feel abrupt while engaging in a chat on a predefined topic. We found that even recent large language models (LLMs) show a low success rate in the PIVOT task. We constructed a dataset suitable for the analysis to develop more effective systems. Finally, we developed a simple but effective system for this task by incorporating insights obtained through the analysis of this dataset.

S2R-HDR: A Large-Scale Rendered Dataset for HDR Fusion

Yujin Wang,Jiarui Wu,Yichen Bian,Fan Zhang,Tianfan Xue

Task: 提出S2R-HDR，首个大规模高质量合成数据集用于HDR融合。

Motivation: 解决HDR融合中训练数据不足的问题，因为真实场景的大规模HDR数据收集成本高且技术挑战大。

Details

Method: 使用Unreal Engine 5设计多样化的HDR场景，开发高效渲染流程，并引入S2R-Adapter进行域适应。 Result: 在真实数据集上实现了最先进的HDR重建性能。 Conclusion: S2R-HDR数据集和S2R-Adapter方法有效提升了HDR融合的泛化能力。 Abstract: The generalization of learning-based high dynamic range (HDR) fusion is often limited by the availability of training data, as collecting large-scale HDR images from dynamic scenes is both costly and technically challenging. To address these challenges, we propose S2R-HDR, the first large-scale high-quality synthetic dataset for HDR fusion, with 24,000 HDR samples. Using Unreal Engine 5, we design a diverse set of realistic HDR scenes that encompass various dynamic elements, motion types, high dynamic range scenes, and lighting. Additionally, we develop an efficient rendering pipeline to generate realistic HDR images. To further mitigate the domain gap between synthetic and real-world data, we introduce S2R-Adapter, a domain adaptation designed to bridge this gap and enhance the generalization ability of models. Experimental results on real-world datasets demonstrate that our approach achieves state-of-the-art HDR reconstruction performance. Dataset and code will be available at https://openimaginglab.github.io/S2R-HDR.

MRD-RAG: Enhancing Medical Diagnosis with Multi-Round Retrieval-Augmented Generation

Yixiang Chen,Penglei Sun,Xiang Li,Xiaowen Chu

Task: 提出一种多轮诊断检索增强生成框架（MRD-RAG），以模拟医生的诊断过程。

Motivation: 现有医学RAG框架多为单轮问答设计，无法适应多轮诊断对话，且未考虑潜在疾病间的关联。

Details

Method: 设计MRD-RAG框架，分析潜在疾病的诊断信息，并像医生一样进行多轮诊断。 Result: 在两个现代医学数据集和两个中医数据集上的实验表明，MRD-RAG显著提升了LLMs的诊断性能。 Conclusion: MRD-RAG框架在医学诊断中具有潜力，能有效支持多轮诊断任务。 Abstract: In recent years, accurately and quickly deploying medical large language models (LLMs) has become a significant trend. Among these, retrieval-augmented generation (RAG) has garnered significant attention due to its features of rapid deployment and privacy protection. However, existing medical RAG frameworks still have shortcomings. Most existing medical RAG frameworks are designed for single-round question answering tasks and are not suitable for multi-round diagnostic dialogue. On the other hand, existing medical multi-round RAG frameworks do not consider the interconnections between potential diseases to inquire precisely like a doctor. To address these issues, we propose a Multi-Round Diagnostic RAG (MRD-RAG) framework that mimics the doctor's diagnostic process. This RAG framework can analyze diagnosis information of potential diseases and accurately conduct multi-round diagnosis like a doctor. To evaluate the effectiveness of our proposed frameworks, we conduct experiments on two modern medical datasets and two traditional Chinese medicine datasets, with evaluations by GPT and human doctors on different methods. The results indicate that our RAG framework can significantly enhance the diagnostic performance of LLMs, highlighting the potential of our approach in medical diagnosis. The code and data can be found in our project website https://github.com/YixiangCh/MRD-RAG/tree/master.

LAPIS: A novel dataset for personalized image aesthetic assessment

Anne-Sofie Maerten,Li-Wei Chen,Stefanie De Winter,Christophe Bossens,Johan Wagemans

Task: 介绍并评估LAPIS数据集，用于个性化图像美学评估（PIAA）。

Motivation: 当前缺乏适合PIAA的艺术作品数据集，LAPIS填补了这一空白。

Details

Method: 通过精心策划的艺术作品数据集（11,723张图像），结合美学评分和图像属性，评估两种现有PIAA模型的性能。 Result: 性能在移除某些个人和图像属性时下降，现有模型在艺术图像美学评估中存在相似错误。 Conclusion: LAPIS为艺术图像美学评估提供了新基准，并揭示了现有模型的改进空间。 Abstract: We present the Leuven Art Personalized Image Set (LAPIS), a novel dataset for personalized image aesthetic assessment (PIAA). It is the first dataset with images of artworks that is suitable for PIAA. LAPIS consists of 11,723 images and was meticulously curated in collaboration with art historians. Each image has an aesthetics score and a set of image attributes known to relate to aesthetic appreciation. Besides rich image attributes, LAPIS offers rich personal attributes of each annotator. We implemented two existing state-of-the-art PIAA models and assessed their performance on LAPIS. We assess the contribution of personal attributes and image attributes through ablation studies and find that performance deteriorates when certain personal and image attributes are removed. An analysis of failure cases reveals that both existing models make similar incorrect predictions, highlighting the need for improvements in artistic image aesthetic assessment. The LAPIS project page can be found at: https://github.com/Anne-SofieMaerten/LAPIS

DeepGreen: Effective LLM-Driven Green-washing Monitoring System Designed for Empirical Testing -- Evidence from China

Congluo Xu,Yu Miao,Yiling Xiao,Chengmengjia Lin

Task: 提出DeepGreen系统，利用大语言模型（LLM）检测企业绿色洗白行为。

Motivation: 通过双层次LLM分析，识别财务报告中的绿色关键词并评估其实现程度，为监管机构和投资者提供主动监测工具。

Details

Method: 采用双层次LLM分析，初步识别绿色关键词并通过迭代语义分析评估实现程度，生成核心变量GreenImplement。 Result: 分析204份财务报告，验证GreenImplement与华证ESG评分的相关性，发现绿色实现显著提升资产回报率，但中小企业贡献有限。 Conclusion: DeepGreen为监管和投资提供新视角，补充传统方法，并揭示绿色洗白动机的异质性。 Abstract: This paper proposes DeepGreen, an Large Language Model Driven (LLM-Driven) system for detecting corporate green-washing behaviour. Utilizing dual-layer LLM analysis, DeepGreen preliminarily identifies potential green keywords in financial statements and then assesses their implementation degree via iterative semantic analysis of LLM. A core variable GreenImplement is derived from the ratio from the two layers' output. We extract 204 financial statements of 68 companies from A-share market over three years, comprising 89,893 words, and analyse them through DeepGreen. Our analysis, supported by violin plots and K-means clustering, reveals insights and validates the variable against the Huazheng ESG rating. It offers a novel perspective for regulatory agencies and investors, serving as a proactive monitoring tool that complements traditional methods.Empirical tests show that green implementation can significantly boost the asset return rate of companies, but there is heterogeneity in scale. Small and medium-sized companies have limited contribution to asset return via green implementation, so there is a stronger motivation for green-washing.

FMNV: A Dataset of Media-Published News Videos for Fake News Detection

Yihao Wang,Zhong Qian,Peifeng Li

Task: 构建一个专门由媒体组织发布的新闻视频数据集FMNV，并提出一个基线模型FMNVD用于检测多模态假新闻。

Motivation: 现有数据集主要包含用户生成的视频，而专业制作的假新闻视频对社会危害更大，但缺乏相关研究。

Details

Method: 通过分析现有数据集和自建数据集FMNV，将假新闻视频分为四类，并利用大型语言模型（LLMs）生成欺骗性内容；提出FMNVD模型，采用双流架构结合CLIP和Faster R-CNN进行视频特征提取，并通过共注意力机制优化特征。 Result: 实验表明FMNV在多个基线模型上具有泛化能力，且FMNVD在检测效果上优于其他方法。 Conclusion: 该研究为检测媒体生态系统中的高影响力假新闻提供了关键基准，并推动了跨模态不一致性分析方法的发展。 Abstract: News media, particularly video-based platforms, have become deeply embedded in daily life, concurrently amplifying risks of misinformation dissemination. Consequently, multimodal fake news detection has garnered significant research attention. However, existing datasets predominantly comprise user-generated videos characterized by crude editing and limited public engagement, whereas professionally crafted fake news videos disseminated by media outlets often politically or virally motivated pose substantially greater societal harm. To address this gap, we construct FMNV, a novel dataset exclusively composed of news videos published by media organizations. Through empirical analysis of existing datasets and our curated collection, we categorize fake news videos into four distinct types. Building upon this taxonomy, we employ Large Language Models (LLMs) to automatically generate deceptive content by manipulating authentic media-published news videos. Furthermore, we propose FMNVD, a baseline model featuring a dual-stream architecture integrating CLIP and Faster R-CNN for video feature extraction, enhanced by co-attention mechanisms for feature refinement and multimodal aggregation. Comparative experiments demonstrate both the generalization capability of FMNV across multiple baselines and the superior detection efficacy of FMNVD. This work establishes critical benchmarks for detecting high-impact fake news in media ecosystems while advancing methodologies for cross-modal inconsistency analysis.

Automated Construction of a Knowledge Graph of Nuclear Fusion Energy for Effective Elicitation and Retrieval of Information

A. Loreti,K. Chen,R. George,R. Firth,A. Agnello,S. Tanaka

Task: 构建一个自动化知识图谱的多步骤方法，用于从大规模文档语料库中结构化并表示领域特定知识。

Motivation: 核聚变能源领域知识范围广且异构性强，是测试方法关键特性的理想基准。

Details

Method: 结合预训练大型语言模型，实现自动命名实体识别和实体解析，并开发基于知识图谱的检索增强生成系统。 Result: 方法能够处理复杂多跳问题，并提供上下文相关的自然语言查询答案。 Conclusion: 该方法在核聚变能源领域成功构建了首个知识图谱，并验证了其有效性。 Abstract: In this document, we discuss a multi-step approach to automated construction of a knowledge graph, for structuring and representing domain-specific knowledge from large document corpora. We apply our method to build the first knowledge graph of nuclear fusion energy, a highly specialized field characterized by vast scope and heterogeneity. This is an ideal benchmark to test the key features of our pipeline, including automatic named entity recognition and entity resolution. We show how pre-trained large language models can be used to address these challenges and we evaluate their performance against Zipf's law, which characterizes human-generated natural language. Additionally, we develop a knowledge-graph retrieval-augmented generation system that combines large language models with a multi-prompt approach. This system provides contextually relevant answers to natural-language queries, including complex multi-hop questions that require reasoning across interconnected entities.

Zehong Ma,Hao Chen,Wei Zeng,Limin Su,Shiliang Zhang

Task: 通过多模态参考学习框架解决细粒度文本到图像检索中的文本模糊性问题。

Motivation: 现有方法假设训练图像的文本描述准确，但实际上文本描述可能模糊且无法捕捉图像的区分性视觉细节，导致表示学习不准确。

Details

Method: 提出多模态参考学习框架，包括多模态参考构建模块和参考引导的表示学习模块，以及基于参考的细化方法。 Result: 在五个细粒度文本到图像检索数据集上表现优异，例如在RSTPReid数据集上Rank1准确率达到56.2%，超过CFine方法5.6%。 Conclusion: 多模态参考学习框架能有效缓解文本模糊性，提升细粒度文本到图像检索的性能。 Abstract: Fine-grained text-to-image retrieval aims to retrieve a fine-grained target image with a given text query. Existing methods typically assume that each training image is accurately depicted by its textual descriptions. However, textual descriptions can be ambiguous and fail to depict discriminative visual details in images, leading to inaccurate representation learning. To alleviate the effects of text ambiguity, we propose a Multi-Modal Reference learning framework to learn robust representations. We first propose a multi-modal reference construction module to aggregate all visual and textual details of the same object into a comprehensive multi-modal reference. The multi-modal reference hence facilitates the subsequent representation learning and retrieval similarity computation. Specifically, a reference-guided representation learning module is proposed to use multi-modal references to learn more accurate visual and textual representations. Additionally, we introduce a reference-based refinement method that employs the object references to compute a reference-based similarity that refines the initial retrieval results. Extensive experiments are conducted on five fine-grained text-to-image retrieval datasets for different text-to-image retrieval tasks. The proposed method has achieved superior performance over state-of-the-art methods. For instance, on the text-to-person image retrieval dataset RSTPReid, our method achieves the Rank1 accuracy of 56.2\%, surpassing the recent CFine by 5.6\%.

NorEval: A Norwegian Language Understanding and Generation Evaluation Benchmark

Vladislav Mikhailov,Tita Enstad,David Samuel,Hans Christian Farsethås,Andrey Kutuzov,Erik Velldal,Lilja Øvrelid

Task: 介绍NorEval，一个用于挪威生成语言模型大规模标准化评估的新评估套件。

Motivation: 现有挪威语基准测试覆盖范围有限，NorEval旨在填补这一空白，提供更全面的评估。

Details

Method: NorEval包含24个高质量人工创建的数据集，其中5个为新创建，覆盖挪威语理解和生成的广泛任务类别，并整合到LM Evaluation Harness中。 Result: 对19个开源预训练和指令调优的挪威语LM进行了基准测试，结果展示了其性能。 Conclusion: NorEval及其评估框架和标注材料已公开，为挪威语LM的标准化评估提供了工具。 Abstract: This paper introduces NorEval, a new and comprehensive evaluation suite for large-scale standardized benchmarking of Norwegian generative language models (LMs). NorEval consists of 24 high-quality human-created datasets -- of which five are created from scratch. In contrast to existing benchmarks for Norwegian, NorEval covers a broad spectrum of task categories targeting Norwegian language understanding and generation, establishes human baselines, and focuses on both of the official written standards of the Norwegian language: Bokm{\aa}l and Nynorsk. All our datasets and a collection of over 100 human-written prompts are integrated into LM Evaluation Harness, ensuring flexible and reproducible evaluation. We describe the NorEval design and present the results of benchmarking 19 open-source pre-trained and instruction-tuned LMs for Norwegian in various scenarios. Our benchmark, evaluation framework, and annotation materials are publicly available.

Benchmarking Multi-Organ Segmentation Tools for Multi-Parametric T1-weighted Abdominal MRI

Nicole Tran,Anisa Prasad,Yan Zhuang,Tejas Sudharshan Mathai,Boah Kim,Sydney Lewis,Pritam Mukherjee,Jianfei Liu,Ronald M. Summers

Task: 量化三种公开工具（MRSeg、TS、VIBE）在特定MRI序列类型上的多器官分割性能。

Motivation: 多器官分割在MRI研究中至关重要，但现有工具在特定MRI序列上的性能尚未量化。

Details

Method: 使用40个来自Duke Liver Dataset的MRI体积，手动标注10个腹部结构，并评估三种工具的性能。 Result: MRSeg表现最佳，平均Dice得分为80.7±18.6，Hausdorff距离误差为8.9±10.4 mm。 Conclusion: MRSeg在多种MRI序列类型上优于TS和VIBE，具有显著优势。 Abstract: The segmentation of multiple organs in multi-parametric MRI studies is critical for many applications in radiology, such as correlating imaging biomarkers with disease status (e.g., cirrhosis, diabetes). Recently, three publicly available tools, such as MRSegmentator (MRSeg), TotalSegmentator MRI (TS), and TotalVibeSegmentator (VIBE), have been proposed for multi-organ segmentation in MRI. However, the performance of these tools on specific MRI sequence types has not yet been quantified. In this work, a subset of 40 volumes from the public Duke Liver Dataset was curated. The curated dataset contained 10 volumes each from the pre-contrast fat saturated T1, arterial T1w, venous T1w, and delayed T1w phases, respectively. Ten abdominal structures were manually annotated in these volumes. Next, the performance of the three public tools was benchmarked on this curated dataset. The results indicated that MRSeg obtained a Dice score of 80.7 $\pm$ 18.6 and Hausdorff Distance (HD) error of 8.9 $\pm$ 10.4 mm. It fared the best ($p < .05$) across the different sequence types in contrast to TS and VIBE.

Efficient Tuning of Large Language Models for Knowledge-Grounded Dialogue Generation

Bo Zhang,Hui Ma,Dailin Li,Jian Ding,Jian Wang,Bo Xu,HongFei Lin

Task: 提出一种名为KEDiT的方法，用于微调大型语言模型（LLMs）以生成基于知识的对话。

Motivation: 解决LLMs无法利用最新或领域特定知识的问题。

Details

Method: KEDiT通过信息瓶颈压缩检索到的知识，并通过轻量级知识感知适配器将其集成到LLMs中。 Result: 在Wizard of Wikipedia和PubMed-Dialog数据集上，KEDiT在生成上下文相关且信息丰富的回答方面表现优异。 Conclusion: KEDiT结合了预训练LLMs的优势和动态知识整合的适应性，为医学等领域提供了可扩展的解决方案。 Abstract: Large language models (LLMs) demonstrate remarkable text comprehension and generation capabilities but often lack the ability to utilize up-to-date or domain-specific knowledge not included in their training data. To address this gap, we introduce KEDiT, an efficient method for fine-tuning LLMs for knowledge-grounded dialogue generation. KEDiT operates in two main phases: first, it employs an information bottleneck to compress retrieved knowledge into learnable parameters, retaining essential information while minimizing computational overhead. Second, a lightweight knowledge-aware adapter integrates these compressed knowledge vectors into the LLM during fine-tuning, updating less than 2\% of the model parameters. The experimental results on the Wizard of Wikipedia and a newly constructed PubMed-Dialog dataset demonstrate that KEDiT excels in generating contextually relevant and informative responses, outperforming competitive baselines in automatic, LLM-based, and human evaluations. This approach effectively combines the strengths of pretrained LLMs with the adaptability needed for incorporating dynamic knowledge, presenting a scalable solution for fields such as medicine.

MMLA: Multi-Environment, Multi-Species, Low-Altitude Aerial Footage Dataset

Jenna Kline,Samuel Stevens,Guy Maalouf,Camille Rondeau Saint-Jean,Dat Nguyen Ngoc,Majid Mirmehdi,David Guerin,Tilo Burghardt,Elzbieta Pastucha,Blair Costelloe,Matthew Watson,Thomas Richardson,Ulrik Pagh Schultz Lundquist

Task: 提出并评估一个多环境、多物种的低空无人机影像数据集（MMLA），用于实时野生动物检测。

Motivation: 填补低空无人机影像中计算机视觉模型评估及跨物种和跨环境通用性研究的空白。

Details

Method: 收集来自三个不同环境的无人机影像数据，包含五种物种，并评估三种YOLO模型（YOLOv5m、YOLOv8m、YOLOv11m）的性能。 Result: 结果显示不同地点和物种间的检测性能存在显著差异。 Conclusion: 强调了在不同环境中评估检测算法对无人机野生动物监测应用的重要性。 Abstract: Real-time wildlife detection in drone imagery is critical for numerous applications, including animal ecology, conservation, and biodiversity monitoring. Low-altitude drone missions are effective for collecting fine-grained animal movement and behavior data, particularly if missions are automated for increased speed and consistency. However, little work exists on evaluating computer vision models on low-altitude aerial imagery and generalizability across different species and settings. To fill this gap, we present a novel multi-environment, multi-species, low-altitude aerial footage (MMLA) dataset. MMLA consists of drone footage collected across three diverse environments: Ol Pejeta Conservancy and Mpala Research Centre in Kenya, and The Wilds Conservation Center in Ohio, which includes five species: Plains zebras, Grevy's zebras, giraffes, onagers, and African Painted Dogs. We comprehensively evaluate three YOLO models (YOLOv5m, YOLOv8m, and YOLOv11m) for detecting animals. Results demonstrate significant performance disparities across locations and species-specific detection variations. Our work highlights the importance of evaluating detection algorithms across different environments for robust wildlife monitoring applications using drones.

Plan-and-Refine: Diverse and Comprehensive Retrieval-Augmented Generation

Alireza Salemi,Chris Samarinas,Hamed Zamani

Task: 研究检索增强的大型语言模型（LLMs）在生成多样化和全面性回答时的局限性，并提出基于两阶段系统设计的Plan-and-Refine（P&R）框架。

Motivation: 解决LLMs在生成回答时缺乏多样性和全面性的问题。

Details

Method: P&R框架分为全局探索阶段（生成多样化计划）和局部利用阶段（生成并迭代优化回答提案），最后通过奖励模型选择最佳提案。 Result: 在ANTIQUE和TREC数据集上，P&R显著优于基线方法，分别提升13.1%和15.41%。用户研究也证实了其有效性。 Conclusion: P&R框架能有效提升LLMs生成回答的多样性和全面性。 Abstract: This paper studies the limitations of (retrieval-augmented) large language models (LLMs) in generating diverse and comprehensive responses, and introduces the Plan-and-Refine (P&R) framework based on a two phase system design. In the global exploration phase, P&R generates a diverse set of plans for the given input, where each plan consists of a list of diverse query aspects with corresponding additional descriptions. This phase is followed by a local exploitation phase that generates a response proposal for the input query conditioned on each plan and iteratively refines the proposal for improving the proposal quality. Finally, a reward model is employed to select the proposal with the highest factuality and coverage. We conduct our experiments based on the ICAT evaluation methodology--a recent approach for answer factuality and comprehensiveness evaluation. Experiments on the two diverse information seeking benchmarks adopted from non-factoid question answering and TREC search result diversification tasks demonstrate that P&R significantly outperforms baselines, achieving up to a 13.1% improvement on the ANTIQUE dataset and a 15.41% improvement on the TREC dataset. Furthermore, a smaller scale user study confirms the substantial efficacy of the P&R framework.

SF2T: Self-supervised Fragment Finetuning of Video-LLMs for Fine-Grained Understanding

Yangliu Hu,Zikai Song,Na Feng,Yawei Luo,Junqing Yu,Yi-Ping Phoebe Chen,Wei Yang

Task: 通过自监督片段微调（SF$^2$T）提升视频大语言模型（Video-LLMs）在细粒度视频理解方面的能力。

Motivation: 现有的Video-LLMs在整体视频描述上表现良好，但在细粒度理解（如视觉动态和视频细节）方面存在不足。

Details

Method: 提出SF$^2$T方法，利用视频的固有特性进行自监督微调，避免人工标注和自然语言的局限性；同时构建FineVidBench基准数据集进行多层面评估。 Result: 实验表明，SF$^2$T显著提升了模型对时空细节的捕捉和解释能力。 Conclusion: SF$^2$T是一种高效且无需标注的微调方法，显著提升了Video-LLMs的细粒度视频理解能力。 Abstract: Video-based Large Language Models (Video-LLMs) have witnessed substantial advancements in recent years, propelled by the advancement in multi-modal LLMs. Although these models have demonstrated proficiency in providing the overall description of videos, they struggle with fine-grained understanding, particularly in aspects such as visual dynamics and video details inquiries. To tackle these shortcomings, we find that fine-tuning Video-LLMs on self-supervised fragment tasks, greatly improve their fine-grained video understanding abilities. Hence we propose two key contributions:(1) Self-Supervised Fragment Fine-Tuning (SF$^2$T), a novel effortless fine-tuning method, employs the rich inherent characteristics of videos for training, while unlocking more fine-grained understanding ability of Video-LLMs. Moreover, it relieves researchers from labor-intensive annotations and smartly circumvents the limitations of natural language, which often fails to capture the complex spatiotemporal variations in videos; (2) A novel benchmark dataset, namely FineVidBench, for rigorously assessing Video-LLMs' performance at both the scene and fragment levels, offering a comprehensive evaluation of their capabilities. We assessed multiple models and validated the effectiveness of SF$^2$T on them. Experimental results reveal that our approach improves their ability to capture and interpret spatiotemporal details.

A System for Comprehensive Assessment of RAG Frameworks

Mattia Rengo,Senad Beadini,Domenico Alfano,Roberto Abbruzzese

Task: 提出SCARF框架，用于全面评估RAG系统的性能。

Motivation: 现有评估框架无法全面评估RAG系统在实际部署中的表现。

Details

Method: 设计模块化、灵活的SCARF框架，支持端到端的黑盒评估方法。 Result: SCARF能够系统化评估不同RAG框架，并生成详细性能报告。 Conclusion: SCARF为研究者和行业专业人士提供了一种可扩展且适应性强的RAG评估解决方案。 Abstract: Retrieval Augmented Generation (RAG) has emerged as a standard paradigm for enhancing the factual accuracy and contextual relevance of Large Language Models (LLMs) by integrating retrieval mechanisms. However, existing evaluation frameworks fail to provide a holistic black-box approach to assessing RAG systems, especially in real-world deployment scenarios. To address this gap, we introduce SCARF (System for Comprehensive Assessment of RAG Frameworks), a modular and flexible evaluation framework designed to benchmark deployed RAG applications systematically. SCARF provides an end-to-end, black-box evaluation methodology, enabling a limited-effort comparison across diverse RAG frameworks. Our framework supports multiple deployment configurations and facilitates automated testing across vector databases and LLM serving strategies, producing a detailed performance report. Moreover, SCARF integrates practical considerations such as response coherence, providing a scalable and adaptable solution for researchers and industry professionals evaluating RAG applications. Using the REST APIs interface, we demonstrate how SCARF can be applied to real-world scenarios, showcasing its flexibility in assessing different RAG frameworks and configurations. SCARF is available at GitHub repository.

PIDSR:ComplementaryPolarizedImageDemosaicingandSuper-Resolution

Shuangfan Zhou,Chu Zhou,Youwei Lyu,Heng Guo,Zhanyu Ma,Boxin Shi,Imari Sato

Task: 提出一个联合框架PIDSR，用于直接从CPFA原始图像中同时进行偏振图像去马赛克和超分辨率处理，以获得高质量的高分辨率偏振图像。

Motivation: 现有的偏振图像去马赛克方法无法提升分辨率，而偏振图像超分辨率方法容易保留或放大去马赛克引入的误差，导致偏振参数（如DoP和AoP）不准确。

Details

Method: 提出PIDSR框架，联合进行偏振图像去马赛克和超分辨率处理，直接从CPFA原始图像中生成高质量的高分辨率偏振图像。 Result: 实验表明，PIDSR在合成和真实数据上均达到最先进性能，并能提升下游任务的效果。 Conclusion: PIDSR能够直接从CPFA原始图像中生成高质量的高分辨率偏振图像，且偏振参数更准确。 Abstract: Polarization cameras can capture multiple polarized images with different polarizer angles in a single shot, bringing convenience to polarization-based downstream tasks. However, their direct outputs are color-polarization filter array (CPFA) raw images, requiring demosaicing to reconstruct full-resolution, full-color polarized images; unfortunately, this necessary step introduces artifacts that make polarization-related parameters such as the degree of polarization (DoP) and angle of polarization (AoP) prone to error. Besides, limited by the hardware design, the resolution of a polarization camera is often much lower than that of a conventional RGB camera. Existing polarized image demosaicing (PID) methods are limited in that they cannot enhance resolution, while polarized image super-resolution (PISR) methods, though designed to obtain high-resolution (HR) polarized images from the demosaicing results, tend to retain or even amplify errors in the DoP and AoP introduced by demosaicing artifacts. In this paper, we propose PIDSR, a joint framework that performs complementary Polarized Image Demosaicing and Super-Resolution, showing the ability to robustly obtain high-quality HR polarized images with more accurate DoP and AoP from a CPFA raw image in a direct manner. Experiments show our PIDSR not only achieves state-of-the-art performance on both synthetic and real data, but also facilitates downstream tasks.

Cluster-Driven Expert Pruning for Mixture-of-Experts Large Language Models

Hongcheng Guo,Juntao Yao,Boyang Wang,Junjia Du,Shaosheng Cao,Donglin Di,Shun Zhang,Zhoujun Li

Task: 提出一种名为C-Prune的两阶段框架，用于自适应任务特定的MoE LLMs压缩。

Motivation: 解决MoE模型中存在的专家同质性和层间相似性问题，以实现更高效的模型部署。

Details

Method: 通过层内专家聚类和全局聚类剪枝，利用参数相似性度量和统一重要性评分机制。 Result: C-Prune在多个MoE模型和基准测试中有效减小模型规模，并优于现有剪枝方法。 Conclusion: C-Prune为MoE模型的压缩提供了一种高效且任务自适应的解决方案。 Abstract: Mixture-of-Experts (MoE) architectures have emerged as a promising paradigm for scaling large language models (LLMs) with sparse activation of task-specific experts. Despite their computational efficiency during inference, the massive overall parameter footprint of MoE models (e.g., GPT-4) introduces critical challenges for practical deployment. Current pruning approaches often fail to address two inherent characteristics of MoE systems: 1).intra-layer expert homogeneity where experts within the same MoE layer exhibit functional redundancy, and 2). inter-layer similarity patterns where deeper layers tend to contain progressively more homogeneous experts. To tackle these issues, we propose Cluster-driven Expert Pruning (C-Prune), a novel two-stage framework for adaptive task-specific compression of MoE LLMs. C-Prune operates through layer-wise expert clustering, which groups functionally similar experts within each MoE layer using parameter similarity metrics, followed by global cluster pruning, which eliminates redundant clusters across all layers through a unified importance scoring mechanism that accounts for cross-layer homogeneity. We validate C-Prune through extensive experiments on multiple MoE models and benchmarks. The results demonstrate that C-Prune effectively reduces model size while outperforming existing MoE pruning methods.

Exploring a Patch-Wise Approach for Privacy-Preserving Fake ID Detection

Javier Muñoz-Haro,Ruben Tolosana,Ruben Vera-Rodriguez,Aythami Morales,Julian Fierrez

Task: 研究如何在隐私保护的前提下检测伪造身份证件。

Motivation: 当前领域缺乏公开的真实身份证件数据，且大多数研究依赖私有数据库，限制了研究进展。

Details

Method: 提出一种基于分块的隐私保护方法，探索两种匿名化级别和不同分块大小配置，并结合Vision Transformers和Foundation Models。 Result: 在DLC-2021数据库上，提出的方法在分块和证件级别分别达到13.91%和0%的EER，表现出良好的泛化能力。 Conclusion: 研究不仅提出了一种有效的隐私保护检测方法，还发布了首个公开的包含真实和伪造身份证件分块的数据库。 Abstract: In an increasingly digitalized world, verifying the authenticity of ID documents has become a critical challenge for real-life applications such as digital banking, crypto-exchanges, renting, etc. This study focuses on the topic of fake ID detection, covering several limitations in the field. In particular, no publicly available data from real ID documents exists, and most studies rely on proprietary in-house databases that are not available due to privacy reasons. In order to shed some light on this critical challenge that makes difficult to advance in the field, we explore a trade-off between privacy (i.e., amount of sensitive data available) and performance, proposing a novel patch-wise approach for privacy-preserving fake ID detection. Our proposed approach explores how privacy can be enhanced through: i) two levels of anonymization for an ID document (i.e., fully- and pseudo-anonymized), and ii) different patch size configurations, varying the amount of sensitive data visible in the patch image. Also, state-of-the-art methods such as Vision Transformers and Foundation Models are considered in the analysis. The experimental framework shows that, on an unseen database (DLC-2021), our proposal achieves 13.91% and 0% EERs at patch and ID document level, showing a good generalization to other databases. In addition to this exploration, another key contribution of our study is the release of the first publicly available database that contains 48,400 patches from both real and fake ID documents, along with the experimental framework and models, which will be available in our GitHub.

What the HellaSwag? On the Validity of Common-Sense Reasoning Benchmarks

Pavel Chizhov,Mattia Nee,Pierre-Carl Langlais,Ivan P. Yamshchikov

Task: 分析HellaSwag基准在衡量语言模型常识推理能力时的构造效度问题。

Motivation: HellaSwag作为广泛使用的常识推理评估基准，存在严重的构造效度问题，可能导致模型选择中的错误决策。

Details

Method: 通过多种生成语言模型的评估，揭示HellaSwag的问题，并提出修正后的子集GoldenSwag。 Result: 超过65%的模型预测在去除问题内容后仍保持一致，表明HellaSwag未能有效衡量常识推理能力。 Conclusion: HellaSwag在当前状态下不适用于评估常识推理，未来基准需满足更高要求，GoldenSwag可作为替代方案。 Abstract: Common-sense reasoning is a key language model capability because it encapsulates not just specific factual knowledge but rather general language and world understanding. Measuring common-sense reasoning, therefore, is crucial for language models of different sizes and applications. One of the most widely used benchmarks for evaluating such capabilities is HellaSwag; however, in this paper, we show that it has severe construct validity issues. These issues range from basic ungrammaticality and numerous typos to misleading prompts or equally correct options. Furthermore, we show that if models are evaluated only on answer texts, or with "Lorem ipsum dolor..." instead of the question, more than 65% of model predictions remain the same, and this cannot be attributed merely to contamination. Since benchmark scores are an essential part of model selection in both research and commercial applications, these validity issues can have severe consequences. In particular, knowing that taking benchmark scores at face value is ubiquitous, inadequate evaluation leads to ill-informed decisions about models. In this paper, we thoroughly investigate critical validity issues posed by HellaSwag and illustrate them with various evaluations using generative language models of different sizes. We argue that this benchmark does not accurately measure common-sense reasoning and, therefore, should not be used for evaluation in its current state. Based on the results of our study, we propose requirements that should be met by future common-sense reasoning benchmarks. In addition, we release GoldenSwag, a corrected subset of HellaSwag, which, to our belief, facilitates acceptable common-sense reasoning evaluation.

Towards Micro-Action Recognition with Limited Annotations: An Asynchronous Pseudo Labeling and Training Approach

Yan Zhang,Lechao Cheng,Yaxiong Wang,Zhun Zhong,Meng Wang

Task: 研究半监督微动作识别（SSMAR）问题，提出一种异步伪标签与训练（APLT）框架以提高分类准确性。

Motivation: 传统半监督学习方法在微动作识别中容易因伪标签不准确而过拟合，导致性能下降。

Details

Method: 提出APLT框架，分离伪标签生成与模型训练，采用半监督聚类和自适应阈值策略生成更准确的伪标签，并构建基于记忆的原型分类器。 Result: 在三个MAR数据集上，APLT显著优于现有半监督学习方法，例如在MA-12数据集上使用50%标注数据时准确率提升14.5%。 Conclusion: APLT通过异步伪标签与训练策略有效解决了伪标签不准确导致的过拟合问题，显著提升了半监督微动作识别的性能。 Abstract: Micro-Action Recognition (MAR) aims to classify subtle human actions in video. However, annotating MAR datasets is particularly challenging due to the subtlety of actions. To this end, we introduce the setting of Semi-Supervised MAR (SSMAR), where only a part of samples are labeled. We first evaluate traditional Semi-Supervised Learning (SSL) methods to SSMAR and find that these methods tend to overfit on inaccurate pseudo-labels, leading to error accumulation and degraded performance. This issue primarily arises from the common practice of directly using the predictions of classifier as pseudo-labels to train the model. To solve this issue, we propose a novel framework, called Asynchronous Pseudo Labeling and Training (APLT), which explicitly separates the pseudo-labeling process from model training. Specifically, we introduce a semi-supervised clustering method during the offline pseudo-labeling phase to generate more accurate pseudo-labels. Moreover, a self-adaptive thresholding strategy is proposed to dynamically filter noisy labels of different classes. We then build a memory-based prototype classifier based on the filtered pseudo-labels, which is fixed and used to guide the subsequent model training phase. By alternating the two pseudo-labeling and model training phases in an asynchronous manner, the model can not only be learned with more accurate pseudo-labels but also avoid the overfitting issue. Experiments on three MAR datasets show that our APLT largely outperforms state-of-the-art SSL methods. For instance, APLT improves accuracy by 14.5\% over FixMatch on the MA-12 dataset when using only 50\% labeled data. Code will be publicly available.

MuSaRoNews: A Multidomain, Multimodal Satire Dataset from Romanian News Articles

Răzvan-Alexandru Smădu,Andreea Iuga,Dumitru-Clementin Cercel

Task: 构建一个多模态语料库MuSaRoNews，用于检测罗马尼亚新闻文章中的讽刺内容。

Motivation: 讽刺和假新闻虽然目的不同，但都会传播虚假信息，仅依赖文本难以检测表面与实际含义的不一致，需要结合其他信息源（如视觉信息）。

Details

Method: 收集了117,834篇来自真实和讽刺新闻来源的公开新闻文章，构建首个罗马尼亚语多模态讽刺检测语料库，并通过实验验证多模态方法的有效性。 Result: 实验表明，结合文本和视觉模态能提高讽刺检测的性能。 Conclusion: 多模态方法在讽刺检测中具有优势，MuSaRoNews为罗马尼亚语相关研究提供了重要资源。 Abstract: Satire and fake news can both contribute to the spread of false information, even though both have different purposes (one if for amusement, the other is to misinform). However, it is not enough to rely purely on text to detect the incongruity between the surface meaning and the actual meaning of the news articles, and, often, other sources of information (e.g., visual) provide an important clue for satire detection. This work introduces a multimodal corpus for satire detection in Romanian news articles named MuSaRoNews. Specifically, we gathered 117,834 public news articles from real and satirical news sources, composing the first multimodal corpus for satire detection in the Romanian language. We conducted experiments and showed that the use of both modalities improves performance.

Breaking the Barriers: Video Vision Transformers for Word-Level Sign Language Recognition

Alexander Brettmann,Jakob Grävinghoff,Marlene Rüschoff,Marie Westhues

Task: 提出一种基于Video Vision Transformer (ViViT) 的模型，用于动态词级别的美国手语(ASL)识别。

Motivation: 解决听力人群对手语不熟练导致的沟通障碍，并通过自动手语识别(SLR)技术提升识别效果。

Details

Method: 采用Video Vision Transformer (ViViT) 模型，利用自注意力机制捕捉时空维度的全局关系。 Result: 在WLASL100数据集上，VideoMAE模型的Top-1准确率达到75.58%，优于传统CNN的65.89%。 Conclusion: 基于Transformer的架构在手语识别中具有巨大潜力，能够推动沟通障碍的解决并促进聋哑人群的包容性。 Abstract: Sign language is a fundamental means of communication for the deaf and hard-of-hearing (DHH) community, enabling nuanced expression through gestures, facial expressions, and body movements. Despite its critical role in facilitating interaction within the DHH population, significant barriers persist due to the limited fluency in sign language among the hearing population. Overcoming this communication gap through automatic sign language recognition (SLR) remains a challenge, particularly at a dynamic word-level, where temporal and spatial dependencies must be effectively recognized. While Convolutional Neural Networks have shown potential in SLR, they are computationally intensive and have difficulties in capturing global temporal dependencies between video sequences. To address these limitations, we propose a Video Vision Transformer (ViViT) model for word-level American Sign Language (ASL) recognition. Transformer models make use of self-attention mechanisms to effectively capture global relationships across spatial and temporal dimensions, which makes them suitable for complex gesture recognition tasks. The VideoMAE model achieves a Top-1 accuracy of 75.58% on the WLASL100 dataset, highlighting its strong performance compared to traditional CNNs with 65.89%. Our study demonstrates that transformer-based architectures have great potential to advance SLR, overcome communication barriers and promote the inclusion of DHH individuals.

Genglin Liu,Salman Rahman,Elisa Kreiss,Marzyeh Ghassemi,Saadia Gabriel

Task: 提出一个开源社交网络模拟框架MOSAIC，利用生成语言代理预测用户行为并分析虚假信息传播。

Motivation: 为了更好地理解用户如何判断在线社交内容的真实性，并研究内容审核策略的效果。

Details

Method: 结合LLM代理和有向社交图，构建多样化的用户角色进行多代理模拟。 Result: 发现三种内容审核策略不仅能减少虚假信息传播，还能提高用户参与度。 Conclusion: 开源模拟软件以促进AI和社会科学的进一步研究。 Abstract: We present a novel, open-source social network simulation framework, MOSAIC, where generative language agents predict user behaviors such as liking, sharing, and flagging content. This simulation combines LLM agents with a directed social graph to analyze emergent deception behaviors and gain a better understanding of how users determine the veracity of online social content. By constructing user representations from diverse fine-grained personas, our system enables multi-agent simulations that model content dissemination and engagement dynamics at scale. Within this framework, we evaluate three different content moderation strategies with simulated misinformation dissemination, and we find that they not only mitigate the spread of non-factual content but also increase user engagement. In addition, we analyze the trajectories of popular content in our simulations, and explore whether simulation agents' articulated reasoning for their social interactions truly aligns with their collective engagement patterns. We open-source our simulation software to encourage further research within AI and social sciences.

Nonlocal Retinex-Based Variational Model and its Deep Unfolding Twin for Low-Light Image Enhancement

Daniel Torres,Joan Duran,Julia Navarro,Catalina Sbert

Task: 提出一种基于Retinex分解的变分方法，用于低光照图像增强。

Motivation: 低光照条件下捕获的图像在细节、对比度和噪声方面存在显著限制，影响图像分割和目标检测等任务。

Details

Method: 结合颜色校正预处理、非局部梯度型保真项和自动伽马校正模块的变分方法，并扩展为深度展开模型。 Result: 实验表明，该方法在视觉和质量指标上优于现有技术，尤其是变分模型在不依赖学习策略的情况下表现优异。 Conclusion: 提出的变分方法及其深度展开版本在低光照图像增强任务中具有显著优势。 Abstract: Images captured under low-light conditions present significant limitations in many applications, as poor lighting can obscure details, reduce contrast, and hide noise. Removing the illumination effects and enhancing the quality of such images is crucial for many tasks, such as image segmentation and object detection. In this paper, we propose a variational method for low-light image enhancement based on the Retinex decomposition into illumination, reflectance, and noise components. A color correction pre-processing step is applied to the low-light image, which is then used as the observed input in the decomposition. Moreover, our model integrates a novel nonlocal gradient-type fidelity term designed to preserve structural details. Additionally, we propose an automatic gamma correction module. Building on the proposed variational approach, we extend the model by introducing its deep unfolding counterpart, in which the proximal operators are replaced with learnable networks. We propose cross-attention mechanisms to capture long-range dependencies in both the nonlocal prior of the reflectance and the nonlocal gradient-based constraint. Experimental results demonstrate that both methods compare favorably with several recent and state-of-the-art techniques across different datasets. In particular, despite not relying on learning strategies, the variational model outperforms most deep learning approaches both visually and in terms of quality metrics.

The KL3M Data Project: Copyright-Clean Training Resources for Large Language Models

Michael J Bommarito II,Jillian Bommarito,Daniel Martin Katz

Task: 构建一个最小化版权和合同违约风险的大型语言模型训练数据管道。

Motivation: 解决现有大型语言模型预训练数据因版权和合同问题带来的法律不确定性风险。

Details

Method: 通过验证和整合16个不同来源的1.32亿份文档和数万亿标记，构建一个严格遵循版权和许可协议的数据管道。 Result: 发布完整的数据管道，包括源代码、原始文档、标准化内容、预标记表示及多种训练资源，全部公开免费。 Conclusion: KL3M数据项目为AI模型的开发和使用提供了一种更道德、合法和可持续的方法。 Abstract: Practically all large language models have been pre-trained on data that is subject to global uncertainty related to copyright infringement and breach of contract. This creates potential risk for users and developers due to this uncertain legal status. The KL3M Data Project directly confronts this critical issue by introducing the largest comprehensive training data pipeline that minimizes risks related to copyright or breach of contract. The foundation of this project is a corpus of over 132 million documents and trillions of tokens spanning 16 different sources that have been verified to meet the strict copyright and licensing protocol detailed herein. We are releasing the entire pipeline, including 1) the source code to acquire and process these documents, 2) the original document formats with associated provenance and metadata, 3) extracted content in a standardized format, 4) pre-tokenized representations of the documents, and 5) various mid- and post-train resources such as question-answer, summarization, conversion, drafting, classification, prediction, and conversational data. All of these resources are freely available to the public on S3, Hugging Face, and GitHub under CC-BY terms. We are committed to continuing this project in furtherance of a more ethical, legal, and sustainable approach to the development and use of AI models.

P2Object: Single Point Supervised Object Detection and Instance Segmentation

Pengfei Chen,Xuehui Yu,Xumeng Han,Kuiran Wang,Guorong Li,Lingxi Xie,Zhenjun Han,Jianbin Jiao

Task: 提出一种基于单点监督的目标识别方法，通过改进提案生成和优化策略，缩小与全监督算法的性能差距。

Motivation: 现有单点监督方法在生成提案时存在离散采样问题，导致边界截断或背景过多，影响性能。

Details

Method: 提出P2BNet++和P2MNet，分别通过近似连续提案采样和像素级感知优化提案生成。 Result: 在COCO、VOC、SBD和Cityscapes数据集上显著超越先前方法，平均精度提升明显。 Conclusion: 通过连续优化和像素级感知，P2MNet能够生成更精确的边界框，并适用于分割任务，缩小了与全监督方法的差距。 Abstract: Object recognition using single-point supervision has attracted increasing attention recently. However, the performance gap compared with fully-supervised algorithms remains large. Previous works generated class-agnostic \textbf{\textit{proposals in an image}} offline and then treated mixed candidates as a single bag, putting a huge burden on multiple instance learning (MIL). In this paper, we introduce Point-to-Box Network (P2BNet), which constructs balanced \textbf{\textit{instance-level proposal bags}} by generating proposals in an anchor-like way and refining the proposals in a coarse-to-fine paradigm. Through further research, we find that the bag of proposals, either at the image level or the instance level, is established on discrete box sampling. This leads the pseudo box estimation into a sub-optimal solution, resulting in the truncation of object boundaries or the excessive inclusion of background. Hence, we conduct a series exploration of discrete-to-continuous optimization, yielding P2BNet++ and Point-to-Mask Network (P2MNet). P2BNet++ conducts an approximately continuous proposal sampling strategy by better utilizing spatial clues. P2MNet further introduces low-level image information to assist in pixel prediction, and a boundary self-prediction is designed to relieve the limitation of the estimated boxes. Benefiting from the continuous object-aware \textbf{\textit{pixel-level perception}}, P2MNet can generate more precise bounding boxes and generalize to segmentation tasks. Our method largely surpasses the previous methods in terms of the mean average precision on COCO, VOC, SBD, and Cityscapes, demonstrating great potential to bridge the performance gap compared with fully supervised tasks.

Pangu Ultra: Pushing the Limits of Dense Large Language Models on Ascend NPUs

Yichun Yin,Wenyong Huang,Kaikai Song,Yehui Tang,Xueyu Wu,Wei Guo,Peng Guo,Yaoyuan Wang,Xiaojun Meng,Yasheng Wang,Dong Li,Can Chen,Dandan Tu,Yin Li,Fisher Yu,Ruiming Tang,Yunhe Wang,Baojun Wang,Bin Wang,Bo Wang,Boxiao Liu,Changzheng Zhang,Duyu Tang,Fei Mi,Hui Jin,Jiansheng Wei,Jiarui Qin,Jinpeng Li,Jun Zhao,Liqun Deng,Lin Li,Minghui Xu,Naifu Zhang,Nianzu Zheng,Qiang Li,Rongju Ruan,Shengjun Cheng,Tianyu Guo,Wei He,Wei Li,Weiwen Liu,Wulong Liu,Xinyi Dai,Yonghan Dong,Yu Pan,Yue Li,Yufei Wang,Yujun Li,Yunsheng Ni,Zhe Liu,Zhenhe Zhang,Zhicheng Liu

Task: 提出并训练了一个名为Pangu Ultra的1350亿参数大型语言模型（LLM），并解决了大规模模型训练中的优化和系统挑战。

Motivation: 尽管LLM领域在规模和能力方面取得了前所未有的进展，但训练如此大规模的模型仍面临显著的优化和系统挑战。

Details

Method: 采用深度缩放三明治归一化技术稳定训练过程，并在8192个Ascend NPU上进行了系统优化的大规模预训练和后训练增强。 Result: Pangu Ultra在多个基准测试中显著超越了Llama 405B和Mistral Large 2等密集LLM，并与参数更多的稀疏模型DeepSeek-R1竞争。 Conclusion: Ascend NPU能够高效训练超过1000亿参数的密集模型，Pangu Ultra及其系统将为商业客户提供支持。 Abstract: We present Pangu Ultra, a Large Language Model (LLM) with 135 billion parameters and dense Transformer modules trained on Ascend Neural Processing Units (NPUs). Although the field of LLM has been witnessing unprecedented advances in pushing the scale and capability of LLM in recent years, training such a large-scale model still involves significant optimization and system challenges. To stabilize the training process, we propose depth-scaled sandwich normalization, which effectively eliminates loss spikes during the training process of deep models. We pre-train our model on 13.2 trillion diverse and high-quality tokens and further enhance its reasoning capabilities during post-training. To perform such large-scale training efficiently, we utilize 8,192 Ascend NPUs with a series of system optimizations. Evaluations on multiple diverse benchmarks indicate that Pangu Ultra significantly advances the state-of-the-art capabilities of dense LLMs such as Llama 405B and Mistral Large 2, and even achieves competitive results with DeepSeek-R1, whose sparse model structure contains much more parameters. Our exploration demonstrates that Ascend NPUs are capable of efficiently and effectively training dense models with more than 100 billion parameters. Our model and system will be available for our commercial customers.

AerialVG: A Challenging Benchmark for Aerial Visual Grounding by Exploring Positional Relations

Junli Liu,Qizhi Chen,Zhigang Wang,Yiwen Tang,Yiting Zhang,Chi Yan,Dong Wang,Xuelong Li,Bin Zhao

Task: 提出并解决从航拍视角进行视觉定位（AerialVG）的新任务。

Motivation: 传统视觉定位在航拍图像中面临新挑战，如外观相似物体难以区分，且需强调位置关系。

Details

Method: 提出包含5K航拍图像的数据集，并设计分层交叉注意力和关系感知定位模块的模型。 Result: 实验验证了数据集和方法的有效性，突出了空间推理在航拍视觉定位中的重要性。 Conclusion: AerialVG任务和提出的方法为航拍视觉定位提供了新方向，代码和数据集将公开。 Abstract: Visual grounding (VG) aims to localize target objects in an image based on natural language descriptions. In this paper, we propose AerialVG, a new task focusing on visual grounding from aerial views. Compared to traditional VG, AerialVG poses new challenges, \emph{e.g.}, appearance-based grounding is insufficient to distinguish among multiple visually similar objects, and positional relations should be emphasized. Besides, existing VG models struggle when applied to aerial imagery, where high-resolution images cause significant difficulties. To address these challenges, we introduce the first AerialVG dataset, consisting of 5K real-world aerial images, 50K manually annotated descriptions, and 103K objects. Particularly, each annotation in AerialVG dataset contains multiple target objects annotated with relative spatial relations, requiring models to perform comprehensive spatial reasoning. Furthermore, we propose an innovative model especially for the AerialVG task, where a Hierarchical Cross-Attention is devised to focus on target regions, and a Relation-Aware Grounding module is designed to infer positional relations. Experimental results validate the effectiveness of our dataset and method, highlighting the importance of spatial reasoning in aerial visual grounding. The code and dataset will be released.

Token Level Routing Inference System for Edge Devices

Jianshu She,Wenhao Zheng,Zhengzhong Liu,Hongyi Wang,Eric Xing,Huaxiu Yao,Qirong Ho

Task: 提出一种协作解码推理系统，以解决大型语言模型在边缘设备上部署效率低的问题。

Motivation: 大型语言模型的计算复杂度限制了其在边缘设备上的部署效率，而小型语言模型虽然速度快但响应质量较差。

Details

Method: 通过协作解码，小型模型在本地进行推理，同时选择性地上传关键令牌到云端的大型模型生成。 Result: 系统在CommonsenseQA上实现了60%的性能提升，仅使用0.5B模型和不到7%的令牌上传。 Conclusion: 协作解码是一种有效平衡性能和效率的方法。 Abstract: The computational complexity of large language model (LLM) inference significantly constrains their deployment efficiency on edge devices. In contrast, small language models offer faster decoding and lower resource consumption but often suffer from degraded response quality and heightened susceptibility to hallucinations. To address this trade-off, collaborative decoding, in which a large model assists in generating critical tokens, has emerged as a promising solution. This paradigm leverages the strengths of both model types by enabling high-quality inference through selective intervention of the large model, while maintaining the speed and efficiency of the smaller model. In this work, we present a novel collaborative decoding inference system that allows small models to perform on-device inference while selectively consulting a cloud-based large model for critical token generation. Remarkably, the system achieves a 60% performance gain on CommonsenseQA using only a 0.5B model on an M1 MacBook, with under 7% of tokens generation uploaded to the large model in the cloud.

V2V3D: View-to-View Denoised 3D Reconstruction for Light-Field Microscopy

Jiayin Zhao,Zhenqi Fu,Tao Yu,Hui Qiao

Task: 提出一种无监督的基于视图的框架V2V3D，用于联合优化光场显微镜（LFM）图像去噪和3D重建。

Motivation: 现有LFM重建算法对传感器噪声高度敏感或需要难以获取的真实标注数据，限制了其应用。

Details

Method: V2V3D框架结合噪声独立性假设和噪声2噪声原则，提出基于波动光学的特征对齐技术，并设计专用卷积核。 Result: 实验表明V2V3D在计算效率和性能上优于现有方法。 Conclusion: V2V3D为挑战性条件下的3D成像提供了有前景的解决方案。 Abstract: Light field microscopy (LFM) has gained significant attention due to its ability to capture snapshot-based, large-scale 3D fluorescence images. However, existing LFM reconstruction algorithms are highly sensitive to sensor noise or require hard-to-get ground-truth annotated data for training. To address these challenges, this paper introduces V2V3D, an unsupervised view2view-based framework that establishes a new paradigm for joint optimization of image denoising and 3D reconstruction in a unified architecture. We assume that the LF images are derived from a consistent 3D signal, with the noise in each view being independent. This enables V2V3D to incorporate the principle of noise2noise for effective denoising. To enhance the recovery of high-frequency details, we propose a novel wave-optics-based feature alignment technique, which transforms the point spread function, used for forward propagation in wave optics, into convolution kernels specifically designed for feature alignment. Moreover, we introduce an LFM dataset containing LF images and their corresponding 3D intensity volumes. Extensive experiments demonstrate that our approach achieves high computational efficiency and outperforms the other state-of-the-art methods. These advancements position V2V3D as a promising solution for 3D imaging under challenging conditions.

Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge

Riccardo Cantini,Alessio Orsino,Massimo Ruggiero,Domenico Talia

Task: 提出一个可扩展的基准框架，用于评估大型语言模型（LLMs）对抗对抗性偏见诱导的鲁棒性。

Motivation: LLMs在关键社会领域中的广泛应用引发了对其嵌入偏见的担忧，这些偏见可能延续刻板印象并损害公平性。

Details

Method: 采用多任务方法系统地探测模型在不同社会文化维度上的偏见，使用LLM-as-a-Judge方法自动评估模型响应的安全性，并利用越狱技术研究安全机制的漏洞。 Result: 分析揭示了当前最先进模型中普遍存在的偏见及其对模型安全性的影响，并评估了特定领域模型（如医学）的安全性。 Conclusion: 研究发现模型规模与安全性之间存在关键权衡，为未来开发更公平和鲁棒的语言模型提供了指导。 Abstract: Large Language Models (LLMs) have revolutionized artificial intelligence, driving advancements in machine translation, summarization, and conversational agents. However, their increasing integration into critical societal domains has raised concerns about embedded biases, which can perpetuate stereotypes and compromise fairness. These biases stem from various sources, including historical inequalities in training data, linguistic imbalances, and adversarial manipulation. Despite mitigation efforts, recent studies indicate that LLMs remain vulnerable to adversarial attacks designed to elicit biased responses. This work proposes a scalable benchmarking framework to evaluate LLM robustness against adversarial bias elicitation. Our methodology involves (i) systematically probing models with a multi-task approach targeting biases across various sociocultural dimensions, (ii) quantifying robustness through safety scores using an LLM-as-a-Judge approach for automated assessment of model responses, and (iii) employing jailbreak techniques to investigate vulnerabilities in safety mechanisms. Our analysis examines prevalent biases in both small and large state-of-the-art models and their impact on model safety. Additionally, we assess the safety of domain-specific models fine-tuned for critical fields, such as medicine. Finally, we release a curated dataset of bias-related prompts, CLEAR-Bias, to facilitate systematic vulnerability benchmarking. Our findings reveal critical trade-offs between model size and safety, aiding the development of fairer and more robust future language models.

SAMJAM: Zero-Shot Video Scene Graph Generation for Egocentric Kitchen Videos

Joshua Li,Fernando Jose Pena Cantu,Emily Yu,Alexander Wong,Yuchen Cui,Yuhao Chen

Task: 提出一种零样本的视频场景图生成方法SAMJAM，用于动态厨房环境。

Motivation: 现有视频场景图生成模型需要大量训练，而视觉语言模型在零样本任务中表现优异但无法稳定跟踪动态对象。

Details

Method: 结合SAM2的时间跟踪能力和Gemini的语义理解，通过匹配算法生成时间一致的场景图。 Result: 在EPIC-KITCHENS和EPIC-KITCHENS-100数据集上，SAMJAM比Gemini的平均召回率提高了8.33%。 Conclusion: SAMJAM是一种有效的零样本视频场景图生成方法，适用于动态环境。 Abstract: Video Scene Graph Generation (VidSGG) is an important topic in understanding dynamic kitchen environments. Current models for VidSGG require extensive training to produce scene graphs. Recently, Vision Language Models (VLM) and Vision Foundation Models (VFM) have demonstrated impressive zero-shot capabilities in a variety of tasks. However, VLMs like Gemini struggle with the dynamics for VidSGG, failing to maintain stable object identities across frames. To overcome this limitation, we propose SAMJAM, a zero-shot pipeline that combines SAM2's temporal tracking with Gemini's semantic understanding. SAM2 also improves upon Gemini's object grounding by producing more accurate bounding boxes. In our method, we first prompt Gemini to generate a frame-level scene graph. Then, we employ a matching algorithm to map each object in the scene graph with a SAM2-generated or SAM2-propagated mask, producing a temporally-consistent scene graph in dynamic environments. Finally, we repeat this process again in each of the following frames. We empirically demonstrate that SAMJAM outperforms Gemini by 8.33% in mean recall on the EPIC-KITCHENS and EPIC-KITCHENS-100 datasets.

Hongcheng Guo,Fei Zhao,Shaosheng Cao,Xinze Lyu,Ziyan Liu,Yue Wang,Boyang Wang,Zhoujun Li,Chonggang Lu,Zhe Xu,Yao Hu

Task: 开发一个针对社交媒体网络服务（SNS）的机器翻译模型RedTrans，以解决传统模型在翻译文化相关内容（如梗、俚语和流行文化引用）时的不足。

Motivation: 全球化社交互动增加了对社交媒体机器翻译的需求，但传统模型在文化相关内容的翻译上表现不佳，且缺乏专门的训练数据和评估基准。

Details

Method: 通过三种创新方法训练RedTrans：1）使用双LLM反向翻译采样的监督微调；2）通过专家标注纠正偏好对的Rewritten Preference Optimization（RePO）算法；3）开发首个SNS翻译基准RedTrans-Bench。 Result: RedTrans在实验中表现优于现有最先进的大型语言模型，并已在实际生产环境中部署。 Conclusion: RedTrans通过领域特定适应，有效弥合了通用翻译系统与文化相关翻译系统之间的差距。 Abstract: The globalization of social interactions has heightened the need for machine translation (MT) on Social Network Services (SNS), yet traditional models struggle with culturally nuanced content like memes, slang, and pop culture references. While large language models (LLMs) have advanced general-purpose translation, their performance on SNS-specific content remains limited due to insufficient specialized training data and evaluation benchmarks. This paper introduces RedTrans, a 72B LLM tailored for SNS translation, trained on a novel dataset developed through three innovations: (1) Supervised Finetuning with Dual-LLM Back-Translation Sampling, an unsupervised sampling method using LLM-based back-translation to select diverse data for large-scale finetuning; (2) Rewritten Preference Optimization (RePO), an algorithm that identifies and corrects erroneous preference pairs through expert annotation, building reliable preference corpora; and (3) RedTrans-Bench, the first benchmark for SNS translation, evaluating phenomena like humor localization, emoji semantics, and meme adaptation. Experiments show RedTrans outperforms state-of-the-art LLMs. Besides, RedTrans has already been deployed in a real-world production environment, demonstrating that domain-specific adaptation, effectively bridges the gap between generic and culturally grounded translation systems.

SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement

Xiyao Wang,Zhengyuan Yang,Chao Feng,Hongjin Lu,Linjie Li,Chung-Ching Lin,Kevin Lin,Furong Huang,Lijuan Wang

Task: 提出一种基于自改进的视觉推理增强方法，仅需少量训练样本且无需知识蒸馏。

Motivation: 训练数据难度在强化微调（RFT）中至关重要，适当挑战性的样本能显著提升推理能力，但量化样本难度是一大挑战。

Details

Method: 利用蒙特卡洛树搜索（MCTS）量化样本难度，基于模型解决问题的迭代次数筛选数据，保留11k样本进行RFT。 Result: ThinkLite-VL在8个基准测试中平均性能提升7%，在MathVista上达到75.1%的SoTA准确率。 Conclusion: ThinkLite-VL在少量样本下显著优于现有方法，证明了数据难度量化的重要性。 Abstract: In this paper, we present an effective method to enhance visual reasoning with significantly fewer training samples, relying purely on self-improvement with no knowledge distillation. Our key insight is that the difficulty of training data during reinforcement fine-tuning (RFT) is critical. Appropriately challenging samples can substantially boost reasoning capabilities even when the dataset is small. Despite being intuitive, the main challenge remains in accurately quantifying sample difficulty to enable effective data filtering. To this end, we propose a novel way of repurposing Monte Carlo Tree Search (MCTS) to achieve that. Starting from our curated 70k open-source training samples, we introduce an MCTS-based selection method that quantifies sample difficulty based on the number of iterations required by the VLMs to solve each problem. This explicit step-by-step reasoning in MCTS enforces the model to think longer and better identifies samples that are genuinely challenging. We filter and retain 11k samples to perform RFT on Qwen2.5-VL-7B-Instruct, resulting in our final model, ThinkLite-VL. Evaluation results on eight benchmarks show that ThinkLite-VL improves the average performance of Qwen2.5-VL-7B-Instruct by 7%, using only 11k training samples with no knowledge distillation. This significantly outperforms all existing 7B-level reasoning VLMs, and our fairly comparable baselines that use classic selection methods such as accuracy-based filtering. Notably, on MathVista, ThinkLite-VL-7B achieves the SoTA accuracy of 75.1, surpassing Qwen2.5-VL-72B, GPT-4o, and O1. Our code, data, and model are available at https://github.com/si0wang/ThinkLite-VL.

FG-RAG: Enhancing Query-Focused Summarization with Context-Aware Fine-Grained Graph RAG

Yubin Hong,Chaofan Li,Jingyi Zhang,Yingxia Shao

Task: 提出一种名为Context-Aware Fine-Grained Graph RAG (FG-RAG)的方法，以提升Query-Focused Summarization (QFS)任务的性能。

Motivation: 现有的GraphRAG方法在QFS任务中主要关注粗粒度信息摘要，缺乏对特定查询的感知，且检索内容缺乏足够的上下文信息以生成全面的回答。

Details

Method: FG-RAG采用Context-Aware Entity Expansion扩展图中检索实体的覆盖范围，并提供足够的上下文信息；同时利用Query-Level Fine-Grained Summarization在生成回答时融入细粒度细节。 Result: FG-RAG在QFS任务中，在全面性、多样性和赋能性等多个指标上优于其他RAG系统。 Conclusion: FG-RAG通过上下文感知和细粒度摘要提升了QFS任务的性能，为检索增强生成提供了更有效的解决方案。 Abstract: Retrieval-Augmented Generation (RAG) enables large language models to provide more precise and pertinent responses by incorporating external knowledge. In the Query-Focused Summarization (QFS) task, GraphRAG-based approaches have notably enhanced the comprehensiveness and diversity of generated responses. However, existing GraphRAG-based approaches predominantly focus on coarse-grained information summarization without being aware of the specific query, and the retrieved content lacks sufficient contextual information to generate comprehensive responses. To address the deficiencies of current RAG systems, we propose Context-Aware Fine-Grained Graph RAG (FG-RAG) to enhance the performance of the QFS task. FG-RAG employs Context-Aware Entity Expansion in graph retrieval to expand the coverage of retrieved entities in the graph, thus providing enough contextual information for the retrieved content. Furthermore, FG-RAG utilizes Query-Level Fine-Grained Summarization to incorporate fine-grained details during response generation, enhancing query awareness for the generated summarization. Our evaluation demonstrates that FG-RAG outperforms other RAG systems in multiple metrics of comprehensiveness, diversity, and empowerment when handling the QFS task. Our implementation is available at https://github.com/BuptWululu/FG-RAG.

Beyond the Frame: Generating 360° Panoramic Videos from Perspective Videos

Rundong Luo,Matthew Wallingford,Ali Farhadi,Noah Snavely,Wei-Chiu Ma

Task: 研究如何从普通视角视频生成全景360度视频。

Motivation: 360度视频能提供更完整的视觉体验，但现有视频模型难以生成高质量的全景视频。

Details

Method: 利用在线360度视频数据，设计数据过滤流程，并结合几何和运动感知操作来优化生成过程。 Result: 模型能够从普通视频生成真实且连贯的360度视频，并展示了多种潜在应用。 Conclusion: 提出的方法在360度视频生成任务中表现优异，具有广泛的应用前景。 Abstract: 360{\deg} videos have emerged as a promising medium to represent our dynamic visual world. Compared to the "tunnel vision" of standard cameras, their borderless field of view offers a more complete perspective of our surroundings. While existing video models excel at producing standard videos, their ability to generate full panoramic videos remains elusive. In this paper, we investigate the task of video-to-360{\deg} generation: given a perspective video as input, our goal is to generate a full panoramic video that is consistent with the original video. Unlike conventional video generation tasks, the output's field of view is significantly larger, and the model is required to have a deep understanding of both the spatial layout of the scene and the dynamics of objects to maintain spatio-temporal consistency. To address these challenges, we first leverage the abundant 360{\deg} videos available online and develop a high-quality data filtering pipeline to curate pairwise training data. We then carefully design a series of geometry- and motion-aware operations to facilitate the learning process and improve the quality of 360{\deg} video generation. Experimental results demonstrate that our model can generate realistic and coherent 360{\deg} videos from in-the-wild perspective video. In addition, we showcase its potential applications, including video stabilization, camera viewpoint control, and interactive visual question answering.

Relevance Isn't All You Need: Scaling RAG Systems With Inference-Time Compute Via Multi-Criteria Reranking

Will LeVine,Bijan Varjavand

Task: 研究在检索增强生成（RAG）系统中如何通过多标准优化提升上下文相关性和回答质量。

Motivation: 传统RAG系统仅优化上下文相关性可能导致信息瓶颈，进而降低下游回答质量，需要探索更全面的优化方法。

Details

Method: 提出REBEL方法，通过多标准优化（如Chain-of-Thought提示和多轮对话）改进RAG系统的性能和速度权衡。 Result: 实验表明REBEL能够随着推理时间增加，同时提升上下文相关性和回答质量。 Conclusion: REBEL为RAG系统提供了一种新的性能与速度权衡曲线，显著优于传统方法。 Abstract: Modern Large Language Model (LLM) systems typically rely on Retrieval Augmented Generation (RAG) which aims to gather context that is useful for response generation. These RAG systems typically optimize strictly towards retrieving context that is maximally relevant to the query. However, conventional theory suggests that retrieval systems which seek to maximize context relevance without any additional explicit criteria can create information bottlenecks. We reaffirm this finding in the modern age of LLM's by showing that in standard RAG pipelines, maximizing for context relevance alone can degrade downstream response quality. In response, we show evaluations of existing RAG methods which account for both context relevance and answer quality. These evaluations introduce a novel finding that existing RAG systems scale poorly with inference time compute usage when considering our combined metric. We introduce "RErank BEyond reLevance (REBEL)", which enables RAG systems to scale with inference-time compute via injection of multi-criteria optimization using Chain-of-Thought prompting (and optionally Multi-Turn dialogue). Ultimately, this enables a new performance/speed tradeoff curve, where RAG systems are able to achieve both higher relevance of retrieved contexts and superior answer quality as inference time increases. Code for the implementation of our method in llama-index can be found at the following PR: https://github.com/run-llama/llama_index/pull/17590. Code for running experiments using this llama-index implementation can be found at https://github.com/microsoft/REBEL.

MARS: a Multimodal Alignment and Ranking System for Few-Shot Segmentation

Nico Catalano,Stefano Samele,Paolo Pertino,Matteo Matteucci

Task: 提出一种名为MARS的插件式排名系统，用于改进少样本分割中的掩码选择方法。

Motivation: 当前少样本分割文献缺乏超越查询图像与示例图像视觉相似性的掩码选择方法，导致预测结果不理想。

Details

Method: 利用多模态线索对掩码提议进行评分、过滤和合并，通过局部和全局级别的多模态评分评估提议。 Result: 在多个数据集上的实验表明，整合所有四个评分组件对稳健排名至关重要，MARS能够与多种掩码提议系统集成，并在多个基准测试中达到新的最先进结果。 Conclusion: MARS是一种高效的插件式排名系统，能够显著提升少样本分割的性能。 Abstract: Current Few Shot Segmentation literature lacks a mask selection method that goes beyond visual similarity between the query and example images, leading to suboptimal predictions. We present MARS, a plug-and-play ranking system that leverages multimodal cues to filter and merge mask proposals robustly. Starting from a set of mask predictions for a single query image, we score, filter, and merge them to improve results. Proposals are evaluated using multimodal scores computed at local and global levels. Extensive experiments on COCO-20i, Pascal-5i, LVIS-92i, and FSS-1000 demonstrate that integrating all four scoring components is crucial for robust ranking, validating our contribution. As MARS can be effortlessly integrated with various mask proposal systems, we deploy it across a wide range of top-performer methods and achieve new state-of-the-art results on multiple existing benchmarks. Code will be available upon acceptance.

OSCAR: Online Soft Compression And Reranking

Maxime Louis,Thibault Formal,Hervé Dejean,Stéphane Clinchant

Task: 提出一种名为OSCAR的查询依赖在线软压缩方法，以减少检索增强生成（RAG）的计算开销。

Motivation: 随着检索规模的增长，RAG管道的计算成本变得高昂，传统压缩方法存在局限性。

Details

Method: OSCAR动态地在推理时压缩检索到的信息，避免存储开销并支持更高压缩率，同时结合重排序优化效率。 Result: 实验表明，OSCAR在1B到24B参数的LLMs上实现了2-5倍的推理加速，且精度损失极小。 Conclusion: OSCAR是一种高效且性能优越的RAG压缩方法，显著提升了计算效率。 Abstract: Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by integrating external knowledge, leading to improved accuracy and relevance. However, scaling RAG pipelines remains computationally expensive as retrieval sizes grow. To address this, we introduce OSCAR, a novel query-dependent online soft compression method that reduces computational overhead while preserving performance. Unlike traditional hard compression methods, which shorten retrieved texts, or soft compression approaches, which map documents to continuous embeddings offline, OSCAR dynamically compresses retrieved information at inference time, eliminating storage overhead and enabling higher compression rates. Additionally, we extend OSCAR to simultaneously perform reranking, further optimizing the efficiency of the RAG pipeline. Our experiments demonstrate state-of-the-art performance with a 2-5x speed-up in inference and minimal to no loss in accuracy for LLMs ranging from 1B to 24B parameters. The models are available at: https://huggingface.co/collections/naver/oscar-67d446a8e3a2551f57464295.

HoloPart: Generative 3D Part Amodal Segmentation

Yunhan Yang,Yuan-Chen Guo,Yukun Huang,Zi-Xin Zou,Zhipeng Yu,Yangguang Li,Yan-Pei Cao,Xihui Liu

Task: 在3D形状中实现部分的无遮挡分割（amodal segmentation），即使部分被遮挡也能分解出完整的、语义上有意义的部分。

Motivation: 现有的3D部分分割方法仅能识别可见的表面区域，限制了其应用价值。受2D无遮挡分割的启发，将这一任务引入3D领域，以解决推断被遮挡的3D几何、保持全局形状一致性和处理有限训练数据下的多样形状等关键挑战。

Details

Method: 提出了一种两阶段方法：首先利用现有的3D部分分割技术获取初始的不完整部分片段；其次引入HoloPart，一种基于扩散的模型，用于将这些片段补全为完整的3D部分。HoloPart采用局部注意力捕捉细粒度几何和全局形状上下文注意力确保一致性。 Result: 在ABO和PartObjaverse-Tiny数据集上建立了新基准，HoloPart显著优于现有形状补全方法。结合现有分割技术，在3D部分无遮挡分割任务中取得了有前景的结果。 Conclusion: HoloPart为3D内容创建和理解开辟了新途径，尤其在几何编辑、动画和材质分配等应用中具有潜力。 Abstract: 3D part amodal segmentation--decomposing a 3D shape into complete, semantically meaningful parts, even when occluded--is a challenging but crucial task for 3D content creation and understanding. Existing 3D part segmentation methods only identify visible surface patches, limiting their utility. Inspired by 2D amodal segmentation, we introduce this novel task to the 3D domain and propose a practical, two-stage approach, addressing the key challenges of inferring occluded 3D geometry, maintaining global shape consistency, and handling diverse shapes with limited training data. First, we leverage existing 3D part segmentation to obtain initial, incomplete part segments. Second, we introduce HoloPart, a novel diffusion-based model, to complete these segments into full 3D parts. HoloPart utilizes a specialized architecture with local attention to capture fine-grained part geometry and global shape context attention to ensure overall shape consistency. We introduce new benchmarks based on the ABO and PartObjaverse-Tiny datasets and demonstrate that HoloPart significantly outperforms state-of-the-art shape completion methods. By incorporating HoloPart with existing segmentation techniques, we achieve promising results on 3D part amodal segmentation, opening new avenues for applications in geometry editing, animation, and material assignment.

Proposed 2MW Wind Turbine for Use in the Governorate of Dhofar at the Sultanate of Oman

Osama Ahmed Marzouk,Omar Rashid Hamdan Al Badi,Maadh Hamed Salman Al Rashdi,Hamed Mohammed Eid Al Balushi

Task: 设计一种适用于阿曼Dhofar风电场项目的水平轴风力涡轮机（HAWT）。

Motivation: 为GCC地区首个商业规模（50MW）风电场提供一种可行的风力涡轮机设计方案。

Details

Method: 基于阿曼风能图谱确定最大平均风速（6m/s），通过MATLAB建模方程匹配目标电力输出，并参考国际制造商的设计参数。 Result: 设计出一种3叶片、直径70米、转速24rpm的涡轮机，输出功率2.37MW，超过目标2MW。 Conclusion: 提出的设计满足目标功率需求，并考虑了15%的功率损耗，适用于Dhofar风电场。 Abstract: In this work, we propose a preliminary design of a horizontal-axis wind turbine (HAWT) as a candidate for the Dhofar Wind Farm project, in the southern Omani Governorate "Dhofar", at the southwest part of the Sultanate of Oman. This wind farm (under construction) is considered to be the first commercial, utility-scale (50MW) wind farm in the GCC (Gulf Cooperation Council) area. The proposed wind turbine has an expected electricity generation of 2MW. We studied the wind atlas of Oman and from which we determined the maximum possible mean wind speed in the entire Sultanate and built our design based on that reference value, which is 6m/s (21.6km/h). After this, we applied a set of modeling equations that estimate the power output from the wind turbine rotor and matched the target electric power to the design variables using a MATLAB computer code. We reached a suitable design and we present here the distribution of the blade angle (twist angle), and the power per unit span along the rotor blade. The rotor design has 3 blades with a diameter of 70m and a rotational speed of 24rpm. This rotor gives 2.37MW of output power, which exceeds the target 2MW output, allowing for about 15% of power losses in the gearbox and generator. We utilized some commercial designs of wind turbines from different international manufacturers as references for typical limits or recommended values of some design parameters.

GenEAva: Generating Cartoon Avatars with Fine-Grained Facial Expressions from Realistic Diffusion-based Faces

Hao Yu,Rupayan Mallick,Margrit Betke,Sarah Adel Bargal

Task: 提出一种名为GenEAva的新框架，用于生成高质量、具有细粒度面部表情的卡通头像。

Motivation: 现有卡通头像数据集和生成方法难以呈现高度表现力的面部表情，且常基于真实身份，引发隐私问题。

Details

Method: 通过微调先进的文本到图像扩散模型，结合风格化模型，生成既保留身份又表达丰富的卡通头像。 Result: 创建了首个表达丰富的卡通头像数据集GenEAva 1.0，包含13,230个头像和135种细粒度表情，生成效果优于SDXL模型。 Conclusion: GenEAva框架和数据集为卡通头像生成研究提供了多样化和表现力强的基准。 Abstract: Cartoon avatars have been widely used in various applications, including social media, online tutoring, and gaming. However, existing cartoon avatar datasets and generation methods struggle to present highly expressive avatars with fine-grained facial expressions and are often inspired from real-world identities, raising privacy concerns. To address these challenges, we propose a novel framework, GenEAva, for generating high-quality cartoon avatars with fine-grained facial expressions. Our approach fine-tunes a state-of-the-art text-to-image diffusion model to synthesize highly detailed and expressive facial expressions. We then incorporate a stylization model that transforms these realistic faces into cartoon avatars while preserving both identity and expression. Leveraging this framework, we introduce the first expressive cartoon avatar dataset, GenEAva 1.0, specifically designed to capture 135 fine-grained facial expressions, featuring 13,230 expressive cartoon avatars with a balanced distribution across genders, racial groups, and age ranges. We demonstrate that our fine-tuned model generates more expressive faces than the state-of-the-art text-to-image diffusion model SDXL. We also verify that the cartoon avatars generated by our framework do not include memorized identities from fine-tuning data. The proposed framework and dataset provide a diverse and expressive benchmark for future research in cartoon avatar generation.

Holistic Capability Preservation: Towards Compact Yet Comprehensive Reasoning Models

Ling Team,Caizhi Tang,Chilin Fu,Chunwei Wu,Jia Guo,Jianwen Wang,Jingyu Hu,Liang Jiang,Meng Li,Peng Jiao,Pingping Liu,Shaomian Zheng,Shiwei Liang,Shuaicheng Li,Yalin Zhang,Yingting Wu,Yongkang Liu,Zhenyu Huang

Task: 通过高质量数据筛选和创新训练范式，从轻量级MoE模型Ling-Lite进一步训练出具有卓越推理能力的Ring-Lite-Distill模型。

Motivation: 开发一个参数高效（仅激活2.75亿参数）且具备全面推理能力的轻量级模型，覆盖不同难度推理任务，同时保留通用能力。

Details

Method: 采用高质量数据筛选和创新训练范式，从Ling-Lite模型进一步训练。 Result: Ring-Lite-Distill的推理能力与DeepSeek-R1-Distill-Qwen-7B相当，通用能力显著超越。 Conclusion: Ring-Lite-Distill展示了轻量级模型在高效推理和通用能力上的潜力。 Abstract: This technical report presents Ring-Lite-Distill, a lightweight reasoning model derived from our open-source Mixture-of-Experts (MoE) Large Language Models (LLMs) Ling-Lite. This study demonstrates that through meticulous high-quality data curation and ingenious training paradigms, the compact MoE model Ling-Lite can be further trained to achieve exceptional reasoning capabilities, while maintaining its parameter-efficient architecture with only 2.75 billion activated parameters, establishing an efficient lightweight reasoning architecture. In particular, in constructing this model, we have not merely focused on enhancing advanced reasoning capabilities, exemplified by high-difficulty mathematical problem solving, but rather aimed to develop a reasoning model with more comprehensive competency coverage. Our approach ensures coverage across reasoning tasks of varying difficulty levels while preserving generic capabilities, such as instruction following, tool use, and knowledge retention. We show that, Ring-Lite-Distill's reasoning ability reaches a level comparable to DeepSeek-R1-Distill-Qwen-7B, while its general capabilities significantly surpass those of DeepSeek-R1-Distill-Qwen-7B. The models are accessible at https://huggingface.co/inclusionAI

InteractAvatar: Modeling Hand-Face Interaction in Photorealistic Avatars with Deformable Gaussians

Kefan Chen,Sergiu Oprea,Justin Theiss,Sreyas Mohan,Srinath Sridhar,Aayush Prakash

Task: 提出InteracttAvatar模型，用于高保真捕捉动态手部与非刚性手-面部交互的光照真实外观。

Motivation: 随着数字化身在通信中的重要性增加，建模自然化身行为成为多个行业的重要挑战，现有模型常忽略手-身体交互的关键方面。

Details

Method: 结合模板模型、3D高斯泼溅和动态细化模块的Dynamic Gaussian Hand模型，以及手-面部交互模块。 Result: 实验表明，InteracttAvatar能够从单目或多视角视频中高保真重建手部及手-面部交互，并支持新姿势动画。 Conclusion: InteracttAvatar是首个能忠实捕捉动态手部与非刚性手-面部交互光照真实外观的模型。 Abstract: With the rising interest from the community in digital avatars coupled with the importance of expressions and gestures in communication, modeling natural avatar behavior remains an important challenge across many industries such as teleconferencing, gaming, and AR/VR. Human hands are the primary tool for interacting with the environment and essential for realistic human behavior modeling, yet existing 3D hand and head avatar models often overlook the crucial aspect of hand-body interactions, such as between hand and face. We present InteracttAvatar, the first model to faithfully capture the photorealistic appearance of dynamic hand and non-rigid hand-face interactions. Our novel Dynamic Gaussian Hand model, combining template model and 3D Gaussian Splatting as well as a dynamic refinement module, captures pose-dependent change, e.g. the fine wrinkles and complex shadows that occur during articulation. Importantly, our hand-face interaction module models the subtle geometry and appearance dynamics that underlie common gestures. Through experiments of novel view synthesis, self reenactment and cross-identity reenactment, we demonstrate that InteracttAvatar can reconstruct hand and hand-face interactions from monocular or multiview videos with high-fidelity details and be animated with novel poses.

R2E-Gym: Procedural Environments and Hybrid Verifiers for Scaling Open-Weights SWE Agents

Naman Jain,Jaskirat Singh,Manish Shetty,Liang Zheng,Koushik Sen,Ion Stoica

Task: 解决现实世界软件工程任务中开源模型的两个关键挑战：可扩展的执行环境训练和测试时计算的最优扩展。

Motivation: 提高开源模型在解决GitHub问题等实际软件工程任务中的性能。

Details

Method: 引入AgentGym，一个程序化生成的可执行训练环境，结合SYNGEN（合成数据生成方法）和混合测试时扩展技术。 Result: 在SWE-Bench Verified基准测试中，32B模型达到34.4%的pass@1性能，混合方法达到51%的新最佳性能。 Conclusion: AgentGym及其方法在开源模型中实现了与专有模型竞争的性能，并提供了可扩展的训练和测试解决方案。 Abstract: Improving open-source models on real-world SWE tasks (solving GITHUB issues) faces two key challenges: 1) scalable curation of execution environments to train these models, and, 2) optimal scaling of test-time compute. We introduce AgentGym, the largest procedurally-curated executable gym environment for training real-world SWE-agents, consisting of more than 8.7K tasks. AgentGym is powered by two main contributions: 1) SYNGEN: a synthetic data curation recipe that enables scalable curation of executable environments using test-generation and back-translation directly from commits, thereby reducing reliance on human-written issues or unit tests. We show that this enables more scalable training leading to pass@1 performance of 34.4% on SWE-Bench Verified benchmark with our 32B model. 2) Hybrid Test-time Scaling: we provide an in-depth analysis of two test-time scaling axes; execution-based and execution-free verifiers, demonstrating that they exhibit complementary strengths and limitations. Test-based verifiers suffer from low distinguishability, while execution-free verifiers are biased and often rely on stylistic features. Surprisingly, we find that while each approach individually saturates around 42-43%, significantly higher gains can be obtained by leveraging their complementary strengths. Overall, our approach achieves 51% on the SWE-Bench Verified benchmark, reflecting a new state-of-the-art for open-weight SWE-agents and for the first time showing competitive performance with proprietary models such as o1, o1-preview and sonnet-3.5-v2 (with tools). We will open-source our environments, models, and agent trajectories.

Scaling Laws for Native Multimodal Models Scaling Laws for Native Multimodal Models

Mustafa Shukor,Enrico Fini,Victor Guilherme Turrisi da Costa,Matthieu Cord,Joshua Susskind,Alaaeldin El-Nouby

Task: 研究原生多模态模型（NMMs）的架构设计，并比较早期融合与晚期融合架构的性能。

Motivation: 探索是否晚期融合架构在多模态模型中具有固有优势，以及早期融合架构的潜在性能优势。

Details

Method: 对457个不同架构和训练混合的模型进行扩展规律研究，比较早期融合和晚期融合架构的性能。 Result: 早期融合架构在较低参数数量下表现更强，训练效率更高，部署更简单；引入混合专家（MoEs）可显著提升性能。 Conclusion: 早期融合架构在多模态模型中具有优势，结合MoEs可进一步优化性能。 Abstract: Building general-purpose models that can effectively perceive the world through multimodal signals has been a long-standing goal. Current approaches involve integrating separately pre-trained components, such as connecting vision encoders to LLMs and continuing multimodal training. While such approaches exhibit remarkable sample efficiency, it remains an open question whether such late-fusion architectures are inherently superior. In this work, we revisit the architectural design of native multimodal models (NMMs)--those trained from the ground up on all modalities--and conduct an extensive scaling laws study, spanning 457 trained models with different architectures and training mixtures. Our investigation reveals no inherent advantage to late-fusion architectures over early-fusion ones, which do not rely on image encoders. On the contrary, early-fusion exhibits stronger performance at lower parameter counts, is more efficient to train, and is easier to deploy. Motivated by the strong performance of the early-fusion architectures, we show that incorporating Mixture of Experts (MoEs) allows for models that learn modality-specific weights, significantly enhancing performance.

Task-Circuit Quantization: Leveraging Knowledge Localization and Interpretability for Compression

Hanqi Xiao,Yi-Lin Sung,Elias Stengel-Eskin,Mohit Bansal

Task: 开发一种新的混合精度后训练量化方法（TaCQ），以在低比特（2-3位）设置下保持模型性能。

Motivation: 后训练量化（PTQ）在减少模型内存占用时可能导致性能下降，尤其是在低比特设置下。

Details

Method: 提出Task-Circuit Quantization（TaCQ），通过识别与任务性能相关的权重电路，保留这些权重为16位，其余量化。 Result: TaCQ在2-3位量化设置下优于现有方法，恢复96%的16位性能，并在多个任务中表现优异。 Conclusion: TaCQ是一种高效的后训练量化方法，能够在低比特设置下显著提升性能。 Abstract: Post-training quantization (PTQ) reduces a model's memory footprint by mapping full precision weights into low bit weights without costly retraining, but can degrade its downstream performance especially in low 2- to 3-bit settings. We develop a new mixed-precision PTQ approach, Task-Circuit Quantization (TaCQ), that draws parallels to automated circuit discovery, directly conditioning the quantization process on specific weight circuits -- which we define as sets of weights associated with downstream task performance. These weights are kept as 16-bit weights, while others are quantized, maintaining performance while only adding a marginal memory cost. Specifically, TaCQ contrasts unquantized model weights with a uniformly-quantized model to estimate the expected change in weights due to quantization and uses gradient information to predict the resulting impact on task performance, allowing us to preserve task-specific weights. We compare TaCQ-based quantization to existing mixed-precision quantization methods when conditioning both on general-purpose and task-specific data. Across QA, math reasoning, and text-to-SQL tasks for both Llama-3 and Qwen2.5, we find that TaCQ outperforms baselines using the same calibration data and a lower weight budget, achieving major improvements in the 2 and 3-bit regime. With only 3.1 bits we are able to recover 96% of Llama-3-8B-Instruct's unquantized 16-bit MMLU performance, obtaining a 5.25% absolute improvement over SPQR. We also observe consistently large gains over existing methods in the 2-bit regime, with an average gain of 14.74% over the strongest baseline, SliM-LLM. Moreover, we observe a 7.20% gain without conditioning on specific tasks, showing TaCQ's ability to identify important weights is not limited to task-conditioned settings.

Perception-R1: Pioneering Perception Policy with Reinforcement Learning

En Yu,Kangheng Lin,Liang Zhao,Jisheng Yin,Yana Wei,Yuang Peng,Haoran Wei,Jianjian Sun,Chunrui Han,Zheng Ge,Xiangyu Zhang,Daxin Jiang,Jingyu Wang,Wenbing Tao

Task: 探索基于规则的强化学习（RL）在多模态大语言模型（MLLM）后训练中对感知策略学习的潜在作用。

Motivation: 初步实验表明，通过RL引入思考过程并不总能提升所有视觉感知任务的性能，因此深入研究RL在视觉感知中的本质作用。

Details

Method: 提出Perception-R1框架，利用GRPO在MLLM后训练中优化感知任务，并分析感知复杂度和奖励设计对RL效果的影响。 Result: Perception-R1在多个任务上取得显著提升，如RefCOCO+（+4.2%）、PixMo-Count（+17.9%）、PageOCR（+4.2%）和COCO2017 val（31.9% AP）。 Conclusion: 感知复杂度是决定RL有效性的关键因素，奖励设计对逼近模型感知上限至关重要，Perception-R1为感知策略学习建立了强基线。 Abstract: Inspired by the success of DeepSeek-R1, we explore the potential of rule-based reinforcement learning (RL) in MLLM post-training for perception policy learning. While promising, our initial experiments reveal that incorporating a thinking process through RL does not consistently lead to performance gains across all visual perception tasks. This leads us to delve into the essential role of RL in the context of visual perception. In this work, we return to the fundamentals and explore the effects of RL on different perception tasks. We observe that the perceptual complexity is a major factor in determining the effectiveness of RL. We also observe that reward design plays a crucial role in further approching the upper limit of model perception. To leverage these findings, we propose Perception-R1, a scalable RL framework using GRPO during MLLM post-training. With a standard Qwen2.5-VL-3B-Instruct, Perception-R1 achieves +4.2% on RefCOCO+, +17.9% on PixMo-Count, +4.2% on PageOCR, and notably, 31.9% AP on COCO2017 val for the first time, establishing a strong baseline for perception policy learning.

Leveraging LLMs for Multimodal Retrieval-Augmented Radiology Report Generation via Key Phrase Extraction

Kyoyun Choi,Byungmu Yoon,Soobum Kim,Jonggwon Park

Task: 提出一种基于检索增强生成的方法，用于自动生成放射学报告，以减少幻觉并降低计算需求。

Motivation: 多模态大语言模型（MLLMs）资源密集，需要大量数据和计算成本，因此需要一种更高效的方法。

Details

Method: 结合多模态检索和大语言模型，提取关键短语，采用图像编码器结构搜索、文本嵌入噪声添加和对比学习等策略。 Result: 在MIMIC-CXR数据集上取得CheXbert指标的先进结果和RadGraph F1指标的竞争力，无需微调大语言模型。 Conclusion: 该方法在多视图放射学报告生成中表现出鲁棒性，适合临床应用。 Abstract: Automated radiology report generation (RRG) holds potential to reduce radiologists' workload, especially as recent advancements in large language models (LLMs) enable the development of multimodal models for chest X-ray (CXR) report generation. However, multimodal LLMs (MLLMs) are resource-intensive, requiring vast datasets and substantial computational cost for training. To address these challenges, we propose a retrieval-augmented generation approach that leverages multimodal retrieval and LLMs to generate radiology reports while mitigating hallucinations and reducing computational demands. Our method uses LLMs to extract key phrases from radiology reports, effectively focusing on essential diagnostic information. Through exploring effective training strategies, including image encoder structure search, adding noise to text embeddings, and additional training objectives, we combine complementary pre-trained image encoders and adopt contrastive learning between text and semantic image embeddings. We evaluate our approach on MIMIC-CXR dataset, achieving state-of-the-art results on CheXbert metrics and competitive RadGraph F1 metric alongside MLLMs, without requiring LLM fine-tuning. Our method demonstrates robust generalization for multi-view RRG, making it suitable for comprehensive clinical applications.

BoxDreamer: Dreaming Box Corners for Generalizable Object Pose Estimation

Yuanhong Yu,Xingyi He,Chen Zhao,Junhao Yu,Jiaqi Yang,Ruizhen Hu,Yujun Shen,Xing Zhu,Xiaowei Zhou,Sida Peng

Task: 提出一种基于RGB的通用方法，用于稀疏视图场景下的物体姿态估计。

Motivation: 现有方法在遮挡和稀疏参考视图场景中的泛化能力有限，限制了其实际应用。

Details

Method: 引入物体边界框的角点作为中间表示，通过参考点合成器估计目标视图中的2D角点，并与3D角点建立对应关系，使用PnP算法进行姿态估计。 Result: 在YCB-Video和Occluded-LINEMOD数据集上的实验表明，该方法优于现有技术。 Conclusion: 提出的表示方法显著提升了物体姿态估计的泛化能力，对实际应用至关重要。 Abstract: This paper presents a generalizable RGB-based approach for object pose estimation, specifically designed to address challenges in sparse-view settings. While existing methods can estimate the poses of unseen objects, their generalization ability remains limited in scenarios involving occlusions and sparse reference views, restricting their real-world applicability. To overcome these limitations, we introduce corner points of the object bounding box as an intermediate representation of the object pose. The 3D object corners can be reliably recovered from sparse input views, while the 2D corner points in the target view are estimated through a novel reference-based point synthesizer, which works well even in scenarios involving occlusions. As object semantic points, object corners naturally establish 2D-3D correspondences for object pose estimation with a PnP algorithm. Extensive experiments on the YCB-Video and Occluded-LINEMOD datasets show that our approach outperforms state-of-the-art methods, highlighting the effectiveness of the proposed representation and significantly enhancing the generalization capabilities of object pose estimation, which is crucial for real-world applications.

RadZero: Similarity-Based Cross-Attention for Explainable Vision-Language Alignment in Radiology with Zero-Shot Multi-Task Capability

Jonggwon Park,Soobum Kim,Byungmu Yoon,Kyoyun Choi

Task: 提出RadZero框架，用于解决放射学中视觉-语言对齐的挑战，并具备零样本多任务能力。

Motivation: 现有方法在利用复杂放射学报告、处理低分辨率图像及注意力机制可解释性方面存在不足。

Details

Method: RadZero结合大型语言模型提取语义句子，采用多正对比学习策略，使用预训练视觉编码器与可训练Transformer层处理高分辨率图像，并通过相似性计算实现零样本推理。 Result: 在公共胸部X光基准测试中，RadZero在零样本分类、定位和分割任务上优于现有方法，且相似性映射分析提升了可解释性。 Conclusion: RadZero在医学影像中表现出色，支持开放词汇语义分割，验证了其有效性。 Abstract: Recent advancements in multi-modal models have significantly improved vision-language alignment in radiology. However, existing approaches struggle to effectively utilize complex radiology reports for learning, rely on low-resolution images, and offer limited interpretability in attention mechanisms. To address these challenges, we introduce RadZero, a novel similarity-based cross-attention framework for vision-language alignment in radiology with zero-shot multi-task capability. RadZero leverages large language models to extract minimal semantic sentences from radiology reports and employs a multi-positive contrastive learning strategy to effectively capture relationships between images and multiple relevant textual descriptions. It also utilizes a pre-trained vision encoder with additional trainable Transformer layers, allowing efficient high-resolution image processing. By computing similarity between text embeddings and local image patch features, RadZero enables zero-shot inference with similarity probability for classification and pixel-level cross-modal similarity maps for grounding and segmentation. Experimental results on public chest radiograph benchmarks show that RadZero outperforms state-of-the-art methods in zero-shot classification, grounding, and segmentation. Furthermore, cross-modal similarity map analysis highlights its potential for improving explainability in vision-language alignment. Additionally, qualitative evaluation demonstrates RadZero's capability for open-vocabulary semantic segmentation, further validating its effectiveness in medical imaging.

VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning

Yukun Qi,Yiming Zhao,Yu Zeng,Xikun Bao,Wenxuan Huang,Lin Chen,Zehui Chen,Jie Zhao,Zhongang Qi,Feng Zhao

Task: 提出VCR-Bench，一个用于全面评估大型视觉语言模型（LVLMs）视频链式思维推理能力的新基准。

Motivation: 当前视频基准无法充分评估推理过程或区分感知与推理能力的缺陷，因此需要更严格的评估框架。

Details

Method: VCR-Bench包含859个视频和1,034个高质量问答对，每个问答对附带逐步标注的链式思维推理理由，并设计七个任务维度和CoT评分。 Result: 实验显示当前LVLMs表现有限，最佳模型CoT评分仅62.8%，准确率56.7%，且感知能力是主要瓶颈。 Conclusion: VCR-Bench验证了链式思维推理在复杂视频推理任务中的关键作用，并可作为标准化评估框架。 Abstract: The advancement of Chain-of-Thought (CoT) reasoning has significantly enhanced the capabilities of large language models (LLMs) and large vision-language models (LVLMs). However, a rigorous evaluation framework for video CoT reasoning remains absent. Current video benchmarks fail to adequately assess the reasoning process and expose whether failures stem from deficiencies in perception or reasoning capabilities. Therefore, we introduce VCR-Bench, a novel benchmark designed to comprehensively evaluate LVLMs' Video Chain-of-Thought Reasoning capabilities. VCR-Bench comprises 859 videos spanning a variety of video content and durations, along with 1,034 high-quality question-answer pairs. Each pair is manually annotated with a stepwise CoT rationale, where every step is tagged to indicate its association with the perception or reasoning capabilities. Furthermore, we design seven distinct task dimensions and propose the CoT score to assess the entire CoT process based on the stepwise tagged CoT rationals. Extensive experiments on VCR-Bench highlight substantial limitations in current LVLMs. Even the top-performing model, o1, only achieves a 62.8% CoT score and an 56.7% accuracy, while most models score below 40%. Experiments show most models score lower on perception than reasoning steps, revealing LVLMs' key bottleneck in temporal-spatial information processing for complex video reasoning. A robust positive correlation between the CoT score and accuracy confirms the validity of our evaluation framework and underscores the critical role of CoT reasoning in solving complex video reasoning tasks. We hope VCR-Bench to serve as a standardized evaluation framework and expose the actual drawbacks in complex video reasoning task.

LLM4Ranking: An Easy-to-use Framework of Utilizing Large Language Models for Document Reranking

Qi Liu,Haozhe Duan,Yiqun Chen,Quanfeng Lu,Weiwei Sun,Jiaxin Mao

Task: 提出一个统一的框架LLM4Ranking，用于利用开源或闭源API的大语言模型（LLMs）进行文档重排序。

Motivation: 近年来，利用LLMs进行文档重排序成为热门研究方向，但缺乏统一的框架支持不同方法和模型的应用与评估。

Details

Method: 开发了一个简单且可扩展的框架LLM4Ranking，提供接口、评估和微调脚本。 Result: 在多个数据集上评估了不同模型和方法，提供了可复现的结果。 Conclusion: LLM4Ranking为LLMs在文档重排序中的应用提供了实用且高效的解决方案。 Abstract: Utilizing large language models (LLMs) for document reranking has been a popular and promising research direction in recent years, many studies are dedicated to improving the performance and efficiency of using LLMs for reranking. Besides, it can also be applied in many real-world applications, such as search engines or retrieval-augmented generation. In response to the growing demand for research and application in practice, we introduce a unified framework, \textbf{LLM4Ranking}, which enables users to adopt different ranking methods using open-source or closed-source API-based LLMs. Our framework provides a simple and extensible interface for document reranking with LLMs, as well as easy-to-use evaluation and fine-tuning scripts for this task. We conducted experiments based on this framework and evaluated various models and methods on several widely used datasets, providing reproducibility results on utilizing LLMs for document reranking. Our code is publicly available at https://github.com/liuqi6777/llm4ranking.

MM-IFEngine: Towards Multimodal Instruction Following

Shengyuan Ding,Shenxi Wu,Xiangyu Zhao,Yuhang Zang,Haodong Duan,Xiaoyi Dong,Pan Zhang,Yuhang Cao,Dahua Lin,Jiaqi Wang

Task: 提出MM-IFEngine管道，生成高质量图像-指令对，并构建多模态指令跟随训练数据和基准。

Motivation: 现有多模态指令跟随训练数据稀缺，基准简单且评估策略不精确，无法满足精确输出约束任务的需求。

Details

Method: 通过MM-IFEngine生成大规模、多样化的高质量训练数据MM-IFInstruct-23k，并扩展为MM-IFDPO-23k用于DPO；同时提出MM-IFEval基准，包含复合级和感知级约束及综合评估流程。 Result: 在MM-IFEval、MIA和IFEval等基准上，微调后的MLLMs性能显著提升（+10.2%、+7.6%、+12.3%）。 Conclusion: MM-IFEngine和MM-IFEval为多模态指令跟随任务提供了高质量数据和评估框架，显著提升了模型性能。 Abstract: The Instruction Following (IF) ability measures how well Multi-modal Large Language Models (MLLMs) understand exactly what users are telling them and whether they are doing it right. Existing multimodal instruction following training data is scarce, the benchmarks are simple with atomic instructions, and the evaluation strategies are imprecise for tasks demanding exact output constraints. To address this, we present MM-IFEngine, an effective pipeline to generate high-quality image-instruction pairs. Our MM-IFEngine pipeline yields large-scale, diverse, and high-quality training data MM-IFInstruct-23k, which is suitable for Supervised Fine-Tuning (SFT) and extended as MM-IFDPO-23k for Direct Preference Optimization (DPO). We further introduce MM-IFEval, a challenging and diverse multi-modal instruction-following benchmark that includes (1) both compose-level constraints for output responses and perception-level constraints tied to the input images, and (2) a comprehensive evaluation pipeline incorporating both rule-based assessment and judge model. We conduct SFT and DPO experiments and demonstrate that fine-tuning MLLMs on MM-IFInstruct-23k and MM-IFDPO-23k achieves notable gains on various IF benchmarks, such as MM-IFEval (+10.2$\%$), MIA (+7.6$\%$), and IFEval (+12.3$\%$). The full data and evaluation code will be released on https://github.com/SYuan03/MM-IFEngine.

LoRI: Reducing Cross-Task Interference in Multi-Task Low-Rank Adaptation

Juzheng Zhang,Jiacheng You,Ashwinee Panda,Tom Goldstein

Task: 提出一种名为LoRI的参数高效微调方法，以减少多任务场景中的参数干扰和计算开销。

Motivation: LoRA在多任务场景中存在显著的计算开销和参数干扰问题，需要一种更高效的解决方案。

Details

Method: 通过冻结随机投影矩阵A并使用任务特定掩码稀疏化矩阵B，减少可训练参数数量，同时利用子空间正交性减少跨任务干扰。 Result: LoRI在多种任务中优于全微调和现有PEFT方法，且可训练参数比LoRA减少95%。 Conclusion: LoRI是一种简单有效的参数高效微调方法，适用于多任务和持续学习场景。 Abstract: Low-Rank Adaptation (LoRA) has emerged as a popular parameter-efficient fine-tuning (PEFT) method for Large Language Models (LLMs), yet it still incurs notable overhead and suffers from parameter interference in multi-task scenarios. We propose LoRA with Reduced Interference (LoRI), a simple yet effective approach that freezes the projection matrices $A$ as random projections and sparsifies the matrices $B$ using task-specific masks. This design substantially reduces the number of trainable parameters while maintaining strong task performance. Moreover, LoRI minimizes cross-task interference in adapter merging by leveraging the orthogonality between adapter subspaces, and supports continual learning by using sparsity to mitigate catastrophic forgetting. Extensive experiments across natural language understanding, mathematical reasoning, code generation, and safety alignment tasks demonstrate that LoRI outperforms full fine-tuning and existing PEFT methods, while using up to 95% fewer trainable parameters than LoRA. In multi-task experiments, LoRI enables effective adapter merging and continual learning with reduced cross-task interference. Code is available at: https://github.com/juzhengz/LoRI

Detect Anything 3D in the Wild

Hanxue Zhang,Haoran Jiang,Qingsong Yao,Yanan Sun,Renrui Zhang,Hao Zhao,Hongyang Li,Hongzi Zhu,Zetong Yang

Task: 提出DetAny3D，一种可提示的3D检测基础模型，用于在任意相机配置下检测任何新物体。

Motivation: 现有深度学习方法在零样本泛化到新物体和相机配置方面表现不佳，且3D标注数据有限。

Details

Method: 利用预训练的2D基础模型知识，通过2D聚合器和3D解释器模块实现2D到3D的知识迁移。 Result: DetAny3D在未见类别和新相机配置上表现优异，甚至超越多数现有方法。 Conclusion: DetAny3D展示了3D基础模型在现实场景中的潜力，如自动驾驶中的稀有物体检测。 Abstract: Despite the success of deep learning in close-set 3D object detection, existing approaches struggle with zero-shot generalization to novel objects and camera configurations. We introduce DetAny3D, a promptable 3D detection foundation model capable of detecting any novel object under arbitrary camera configurations using only monocular inputs. Training a foundation model for 3D detection is fundamentally constrained by the limited availability of annotated 3D data, which motivates DetAny3D to leverage the rich prior knowledge embedded in extensively pre-trained 2D foundation models to compensate for this scarcity. To effectively transfer 2D knowledge to 3D, DetAny3D incorporates two core modules: the 2D Aggregator, which aligns features from different 2D foundation models, and the 3D Interpreter with Zero-Embedding Mapping, which mitigates catastrophic forgetting in 2D-to-3D knowledge transfer. Experimental results validate the strong generalization of our DetAny3D, which not only achieves state-of-the-art performance on unseen categories and novel camera configurations, but also surpasses most competitors on in-domain data.DetAny3D sheds light on the potential of the 3D foundation model for diverse applications in real-world scenarios, e.g., rare object detection in autonomous driving, and demonstrates promise for further exploration of 3D-centric tasks in open-world settings. More visualization results can be found at DetAny3D project page.

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Haozhan Shen,Peng Liu,Jingcheng Li,Chunxin Fang,Yibo Ma,Jiajia Liao,Qiaoli Shen,Zilun Zhang,Kangjia Zhao,Qianqian Zhang,Ruochen Xu,Tiancheng Zhao

Task: 研究如何通过强化学习（RL）提升视觉语言模型（VLMs）的视觉推理能力。

Motivation: 视觉理解任务通常具有明确的标注，适合基于规则的奖励机制，因此可以借鉴DeepSeek R1的方法扩展到视觉领域。

Details

Method: 开发了VLM-R1框架，利用强化学习优化VLMs在视觉语言任务中的表现。 Result: RL模型在视觉理解任务中表现优异，泛化能力超越监督微调（SFT），并通过消融研究揭示了奖励机制中的关键现象。 Conclusion: 强化学习能有效提升视觉语言模型的能力，研究结果和开源贡献有望推动视觉语言RL领域的进一步发展。 Abstract: Recently DeepSeek R1 has shown that reinforcement learning (RL) can substantially improve the reasoning capabilities of Large Language Models (LLMs) through a simple yet effective design. The core of R1 lies in its rule-based reward formulation, which leverages tasks with deterministic ground-truth answers to enable precise and stable reward computation. In the visual domain, we similarly observe that a wide range of visual understanding tasks are inherently equipped with well-defined ground-truth annotations. This property makes them naturally compatible with rule-based reward mechanisms. Motivated by this observation, we investigate the extension of R1-style reinforcement learning to Vision-Language Models (VLMs), aiming to enhance their visual reasoning capabilities. To this end, we develop VLM-R1, a dedicated framework designed to harness RL for improving VLMs' performance on general vision-language tasks. Using this framework, we further explore the feasibility of applying RL to visual domain. Experimental results indicate that the RL-based model not only delivers competitive performance on visual understanding tasks but also surpasses Supervised Fine-Tuning (SFT) in generalization ability. Furthermore, we conduct comprehensive ablation studies that uncover a series of noteworthy insights, including the presence of reward hacking in object detection, the emergence of the "OD aha moment", the impact of training data quality, and the scaling behavior of RL across different model sizes. Through these analyses, we aim to deepen the understanding of how reinforcement learning enhances the capabilities of vision-language models, and we hope our findings and open-source contributions will support continued progress in the vision-language RL community. Our code and model are available at https://github.com/om-ai-lab/VLM-R1

CCMNet: Leveraging Calibrated Color Correction Matrices for Cross-Camera Color Constancy

Dongyoung Kim,Mahmoud Afifi,Dongyun Kim,Michael S. Brown,Seon Joo Kim

Task: 提出一种基于学习的方法，用于跨相机的颜色恒常性，无需重新训练即可适应新相机。

Motivation: 由于白平衡算法需要在相机特定的原始色彩空间中运行，因此需要适应不同相机，而现有方法难以实现跨相机的泛化。

Details

Method: 利用ISP中预校准的色彩校正矩阵（CCMs）将预定义的照明颜色映射到测试相机的原始空间，并通过相机指纹嵌入（CFE）实现网络对新相机的适应。 Result: 在多个数据集和骨干网络上，该方法实现了最先进的跨相机颜色恒常性，且轻量级并仅依赖ISP中现成的数据。 Conclusion: 该方法通过CFE和数据增强技术，有效解决了跨相机颜色恒常性问题，具有实际应用价值。 Abstract: Computational color constancy, or white balancing, is a key module in a camera's image signal processor (ISP) that corrects color casts from scene lighting. Because this operation occurs in the camera-specific raw color space, white balance algorithms must adapt to different cameras. This paper introduces a learning-based method for cross-camera color constancy that generalizes to new cameras without retraining. Our method leverages pre-calibrated color correction matrices (CCMs) available on ISPs that map the camera's raw color space to a standard space (e.g., CIE XYZ). Our method uses these CCMs to transform predefined illumination colors (i.e., along the Planckian locus) into the test camera's raw space. The mapped illuminants are encoded into a compact camera fingerprint embedding (CFE) that enables the network to adapt to unseen cameras. To prevent overfitting due to limited cameras and CCMs during training, we introduce a data augmentation technique that interpolates between cameras and their CCMs. Experimental results across multiple datasets and backbones show that our method achieves state-of-the-art cross-camera color constancy while remaining lightweight and relying only on data readily available in camera ISPs.

CollEX -- A Multimodal Agentic RAG System Enabling Interactive Exploration of Scientific Collections

Florian Schneider,Narges Baba Ahmadi,Niloufar Baba Ahmadi,Iris Vogel,Martin Semmann,Chris Biemann

Task: 介绍并评估CollEx，一种多模态代理检索增强生成系统，用于增强对大规模科学集合的交互式探索。

Motivation: 传统搜索系统在面对科学集合的庞大数量和复杂性时缺乏直观性和交互性，给用户带来障碍。

Details

Method: 利用先进的大型视觉语言模型（LVLMs）作为多模态代理，通过直观的聊天界面实现复杂交互的抽象化。 Result: CollEx显著简化了对多样化科学集合的访问，支持教育场景并促进跨学科连接。 Conclusion: CollEx通过多模态集成和代理技术，有效提升了科学集合的探索体验，具有教育和研究价值。 Abstract: In this paper, we introduce CollEx, an innovative multimodal agentic Retrieval-Augmented Generation (RAG) system designed to enhance interactive exploration of extensive scientific collections. Given the overwhelming volume and inherent complexity of scientific collections, conventional search systems often lack necessary intuitiveness and interactivity, presenting substantial barriers for learners, educators, and researchers. CollEx addresses these limitations by employing state-of-the-art Large Vision-Language Models (LVLMs) as multimodal agents accessible through an intuitive chat interface. By abstracting complex interactions via specialized agents equipped with advanced tools, CollEx facilitates curiosity-driven exploration, significantly simplifying access to diverse scientific collections and records therein. Our system integrates textual and visual modalities, supporting educational scenarios that are helpful for teachers, pupils, students, and researchers by fostering independent exploration as well as scientific excitement and curiosity. Furthermore, CollEx serves the research community by discovering interdisciplinary connections and complementing visual data. We illustrate the effectiveness of our system through a proof-of-concept application containing over 64,000 unique records across 32 collections from a local scientific collection from a public university.

VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning

Zhong-Yu Li,Ruoyi Du,Juncheng Yan,Le Zhuo,Zhen Li,Peng Gao,Zhanyu Ma,Ming-Ming Cheng

Task: 提出一种通用的图像生成框架VisualCloze，支持多种任务。

Motivation: 当前任务特定模型效率有限，通用模型面临任务指令泛化、任务分布和统一架构设计的挑战。

Details

Method: 结合视觉上下文学习，使用Graph200K数据集增强任务密度和可迁移知识，利用预训练填充模型的生成先验。 Result: VisualCloze支持广泛任务，包括未见任务和多任务统一，且无需修改架构。 Conclusion: VisualCloze解决了通用图像生成模型的挑战，具有高效性和灵活性。 Abstract: Recent progress in diffusion models significantly advances various image generation tasks. However, the current mainstream approach remains focused on building task-specific models, which have limited efficiency when supporting a wide range of different needs. While universal models attempt to address this limitation, they face critical challenges, including generalizable task instruction, appropriate task distributions, and unified architectural design. To tackle these challenges, we propose VisualCloze, a universal image generation framework, which supports a wide range of in-domain tasks, generalization to unseen ones, unseen unification of multiple tasks, and reverse generation. Unlike existing methods that rely on language-based task instruction, leading to task ambiguity and weak generalization, we integrate visual in-context learning, allowing models to identify tasks from visual demonstrations. Meanwhile, the inherent sparsity of visual task distributions hampers the learning of transferable knowledge across tasks. To this end, we introduce Graph200K, a graph-structured dataset that establishes various interrelated tasks, enhancing task density and transferable knowledge. Furthermore, we uncover that our unified image generation formulation shared a consistent objective with image infilling, enabling us to leverage the strong generative priors of pre-trained infilling models without modifying the architectures.

Zero-Shot Cross-Domain Code Search without Fine-Tuning

Keyu Liang,Zhongxin Liu,Chao Liu,Zhiyuan Wan,David Lo,Xiaohu Yang

Task: 解决零样本跨领域代码搜索的问题，提出一种无需微调的方法。

Motivation: 现有方法（如RAPID）在零样本跨领域代码搜索中需要高昂的计算资源和专门的微调模型，亟需一种更高效的方法。

Details

Method: 将查询-代码匹配过程分解为查询-注释匹配和代码-代码匹配，利用大型语言模型生成注释和伪代码，结合多种匹配模式。 Result: 在三个数据集上，CodeBridge的平均MRR比现有方法（CoCoSoDa和UniXcoder）分别高出21.4%和24.9%，且与需要微调的RAPID方法结果相当或更好。 Conclusion: CodeBridge是一种高效、无需微调的零样本跨领域代码搜索方法，显著优于现有技术。 Abstract: Code search aims to retrieve semantically relevant code snippets for natural language queries. While pre-trained language models (PLMs) have shown remarkable performance in this task, they struggle in cross-domain scenarios, often requiring costly fine-tuning or facing performance drops in zero-shot settings. RAPID, which generates synthetic data for model fine-tuning, is currently the only effective method for zero-shot cross-domain code search. Despite its effectiveness, RAPID demands substantial computational resources for fine-tuning and needs to maintain specialized models for each domain, underscoring the need for a zero-shot, fine-tuning-free approach for cross-domain code search. The key to tackling zero-shot cross-domain code search lies in bridging the gaps among domains. In this work, we propose to break the query-code matching process of code search into two simpler tasks: query-comment matching and code-code matching. Our empirical study reveals the strong complementarity among the three matching schemas in zero-shot cross-domain settings, i.e., query-code, query-comment, and code-code matching. Based on the findings, we propose CodeBridge, a zero-shot, fine-tuning-free approach for cross-domain code search. Specifically, CodeBridge uses Large Language Models (LLMs) to generate comments and pseudo-code, then combines query-code, query-comment, and code-code matching via PLM-based similarity scoring and sampling-based fusion. Experimental results show that our approach outperforms the state-of-the-art PLM-based code search approaches, i.e., CoCoSoDa and UniXcoder, by an average of 21.4% and 24.9% in MRR, respectively, across three datasets. Our approach also yields results that are better than or comparable to those of the zero-shot cross-domain code search approach RAPID, which requires costly fine-tuning.

Geo4D: Leveraging Video Generators for Geometric 4D Scene Reconstruction

Zeren Jiang,Chuanxia Zheng,Iro Laina,Diane Larlus,Andrea Vedaldi

Task: 利用视频扩散模型进行单目动态场景的3D重建。

Motivation: 通过利用视频模型捕获的动态先验，仅使用合成数据训练即可在零样本情况下泛化到真实数据。

Details

Method: Geo4D预测多种互补的几何模态（点、深度和射线图），并使用多模态对齐算法和滑动窗口进行融合。 Result: 在多个基准测试中显著优于现有方法，包括专为动态场景设计的MonST3R。 Conclusion: Geo4D能够实现鲁棒且准确的4D重建，适用于长视频。 Abstract: We introduce Geo4D, a method to repurpose video diffusion models for monocular 3D reconstruction of dynamic scenes. By leveraging the strong dynamic prior captured by such video models, Geo4D can be trained using only synthetic data while generalizing well to real data in a zero-shot manner. Geo4D predicts several complementary geometric modalities, namely point, depth, and ray maps. It uses a new multi-modal alignment algorithm to align and fuse these modalities, as well as multiple sliding windows, at inference time, thus obtaining robust and accurate 4D reconstruction of long videos. Extensive experiments across multiple benchmarks show that Geo4D significantly surpasses state-of-the-art video depth estimation methods, including recent methods such as MonST3R, which are also designed to handle dynamic scenes.

Deceptive Automated Interpretability: Language Models Coordinating to Fool Oversight Systems

Simon Lermen,Mateusz Dziemian,Natalia Pérez-Campanero Antolín

Task: 研究AI代理如何通过神经网络的自动可解释性协调欺骗监督系统。

Motivation: 揭示语言模型能够生成欺骗性解释以逃避检测，并探讨其潜在危害。

Details

Method: 使用稀疏自编码器（SAEs）作为实验框架，测试语言模型（Llama、DeepSeek R1和Claude 3.7 Sonnet）生成欺骗性解释的能力。 Result: 所有测试的语言模型代理均能成功欺骗监督模型，同时保持与参考标签相当的解释质量。 Conclusion: 提出缓解策略，强调需要建立强大的理解和防御机制以应对欺骗行为。 Abstract: We demonstrate how AI agents can coordinate to deceive oversight systems using automated interpretability of neural networks. Using sparse autoencoders (SAEs) as our experimental framework, we show that language models (Llama, DeepSeek R1, and Claude 3.7 Sonnet) can generate deceptive explanations that evade detection. Our agents employ steganographic methods to hide information in seemingly innocent explanations, successfully fooling oversight models while achieving explanation quality comparable to reference labels. We further find that models can scheme to develop deceptive strategies when they believe the detection of harmful features might lead to negative consequences for themselves. All tested LLM agents were capable of deceiving the overseer while achieving high interpretability scores comparable to those of reference labels. We conclude by proposing mitigation strategies, emphasizing the critical need for robust understanding and defenses against deception.

GLUS: Global-Local Reasoning Unified into A Single Large Language Model for Video Segmentation

Lang Lin,Xueyang Yu,Ziqi Pang,Yu-Xiong Wang

Task: 提出一种基于多模态大语言模型（MLLMs）的新框架GLUS，用于解决参考视频对象分割（RefVOS）任务。

Motivation: 现有MLLM方法在全局推理（理解关键帧）和局部推理（跟踪连续帧）之间存在矛盾，且依赖外部工具。

Details

Method: GLUS框架通过稀疏的“上下文帧”提供全局信息，连续的“查询帧”进行局部跟踪，并结合预训练VOS记忆库联合训练。 Result: 在MeViS和Ref-Youtube-VOS基准测试中达到新的最优性能。 Conclusion: GLUS框架简单高效，统一了全局和局部一致性，为MLLMs在RefVOS任务中提供了新基准。 Abstract: This paper proposes a novel framework utilizing multi-modal large language models (MLLMs) for referring video object segmentation (RefVOS). Previous MLLM-based methods commonly struggle with the dilemma between "Ref" and "VOS": they either specialize in understanding a few key frames (global reasoning) or tracking objects on continuous frames (local reasoning), and rely on external VOS or frame selectors to mitigate the other end of the challenge. However, our framework GLUS shows that global and local consistency can be unified into a single video segmentation MLLM: a set of sparse "context frames" provides global information, while a stream of continuous "query frames" conducts local object tracking. This is further supported by jointly training the MLLM with a pre-trained VOS memory bank to simultaneously digest short-range and long-range temporal information. To improve the information efficiency within the limited context window of MLLMs, we introduce object contrastive learning to distinguish hard false-positive objects and a self-refined framework to identify crucial frames and perform propagation. By collectively integrating these insights, our GLUS delivers a simple yet effective baseline, achieving new state-of-the-art for MLLMs on the MeViS and Ref-Youtube-VOS benchmark. Our project page is at https://glus-video.github.io/.

Understanding Learner-LLM Chatbot Interactions and the Impact of Prompting Guidelines

Cansu Koyuturk,Emily Theophilou,Sabrina Patania,Gregor Donabauer,Andrea Martinenghi,Chiara Antico,Alessia Telari,Alessia Testa,Sathya Bursic,Franca Garzotto,Davinia Hernandez-Leo,Udo Kruschwitz,Davide Taibi,Simona Amenta,Martin Ruskov,Dimitri Ognibene

Task: 研究如何通过结构化提示指导提高用户与大型语言模型（LLMs）的交互效果。

Motivation: 尽管LLMs在自然语言交互中表现出色，但用户常因提示不准确而获得低效响应，需要探索如何通过指导改善这一问题。

Details

Method: 通过教育实验比较三种提示指导方法（任务特定框架与两种基线方法），分析642次交互数据，使用Von NeuMidas标注模式分类错误和行为模式。 Result: 研究发现结构化提示指导能显著改善用户行为、提示策略遵循度及AI响应质量。 Conclusion: 结构化提示指导有助于提升用户与LLMs的交互能力，对AI素养、聊天机器人可用性及响应式AI设计具有启示意义。 Abstract: Large Language Models (LLMs) have transformed human-computer interaction by enabling natural language-based communication with AI-powered chatbots. These models are designed to be intuitive and user-friendly, allowing users to articulate requests with minimal effort. However, despite their accessibility, studies reveal that users often struggle with effective prompting, resulting in inefficient responses. Existing research has highlighted both the limitations of LLMs in interpreting vague or poorly structured prompts and the difficulties users face in crafting precise queries. This study investigates learner-AI interactions through an educational experiment in which participants receive structured guidance on effective prompting. We introduce and compare three types of prompting guidelines: a task-specific framework developed through a structured methodology and two baseline approaches. To assess user behavior and prompting efficacy, we analyze a dataset of 642 interactions from 107 users. Using Von NeuMidas, an extended pragmatic annotation schema for LLM interaction analysis, we categorize common prompting errors and identify recurring behavioral patterns. We then evaluate the impact of different guidelines by examining changes in user behavior, adherence to prompting strategies, and the overall quality of AI-generated responses. Our findings provide a deeper understanding of how users engage with LLMs and the role of structured prompting guidance in enhancing AI-assisted communication. By comparing different instructional frameworks, we offer insights into more effective approaches for improving user competency in AI interactions, with implications for AI literacy, chatbot usability, and the design of more responsive AI systems.

PixelFlow: Pixel-Space Generative Models with Flow

Shoufa Chen,Chongjian Ge,Shilong Zhang,Peize Sun,Ping Luo

Task: 提出PixelFlow，一种直接在原始像素空间操作的图像生成模型家族。

Motivation: 简化图像生成过程，消除对预训练变分自编码器（VAE）的需求，并使整个模型可端到端训练。

Details

Method: 通过高效的级联流建模，在像素空间中实现可负担的计算成本。 Result: 在256×256 ImageNet类条件图像生成基准测试中，FID达到1.98；定性文本到图像结果显示PixelFlow在图像质量、艺术性和语义控制方面表现出色。 Conclusion: PixelFlow为下一代视觉生成模型提供了新的范式，并有望激发更多研究机会。 Abstract: We present PixelFlow, a family of image generation models that operate directly in the raw pixel space, in contrast to the predominant latent-space models. This approach simplifies the image generation process by eliminating the need for a pre-trained Variational Autoencoder (VAE) and enabling the whole model end-to-end trainable. Through efficient cascade flow modeling, PixelFlow achieves affordable computation cost in pixel space. It achieves an FID of 1.98 on 256$\times$256 ImageNet class-conditional image generation benchmark. The qualitative text-to-image results demonstrate that PixelFlow excels in image quality, artistry, and semantic control. We hope this new paradigm will inspire and open up new opportunities for next-generation visual generation models. Code and models are available at https://github.com/ShoufaChen/PixelFlow.

Dual Engines of Thoughts: A Depth-Breadth Integration Framework for Open-Ended Analysis

Fei-Hsuan Yu,Yun-Cheng Chou,Teng-Ruei Chen

Task: 提出一种名为Dual Engines of Thoughts (DEoT)的分析框架，用于全面的开放式推理。

Motivation: 传统推理框架主要针对单一答案问题，而DEoT专门设计用于开放式问题，支持更广泛和深入的分析探索。

Details

Method: 框架包含三个关键组件：Base Prompter（优化用户查询）、Solver Agent（任务分解、执行和验证）和Dual-Engine System（Breadth Engine探索多样性因素，Depth Engine进行深度分析）。 Result: 实验结果表明，DEoT在解决复杂多面问题时表现优异，胜率为77-86%，优于现有推理模型。 Conclusion: DEoT在现实应用中表现出高效性，能够平衡广泛覆盖与深度分析。 Abstract: We propose the Dual Engines of Thoughts (DEoT), an analytical framework for comprehensive open-ended reasoning. While traditional reasoning frameworks primarily focus on finding "the best answer" or "the correct answer" for single-answer problems, DEoT is specifically designed for "open-ended questions," enabling both broader and deeper analytical exploration. The framework centers on three key components: a Base Prompter for refining user queries, a Solver Agent that orchestrates task decomposition, execution, and validation, and a Dual-Engine System consisting of a Breadth Engine (to explore diverse impact factors) and a Depth Engine (to perform deep investigations). This integrated design allows DEoT to balance wide-ranging coverage with in-depth analysis, and it is highly customizable, enabling users to adjust analytical parameters and tool configurations based on specific requirements. Experimental results show that DEoT excels in addressing complex, multi-faceted questions, achieving a total win rate of 77-86% compared to existing reasoning models, thus highlighting its effectiveness in real-world applications.

Boundary representation learning via Transformer

Qiang Zou,Lizhen Zhu

Task: 将Transformer网络应用于边界表示（B-rep）模型学习，提出Boundary Representation Transformer（BRT）方法。

Motivation: 尽管Transformer在自然语言处理等领域取得了显著成功，但在计算机辅助设计（CAD）中处理B-rep模型的应用仍未被充分探索。

Details

Method: BRT提出连续几何嵌入方法将B-rep表面编码为Bézier三角形，并采用拓扑感知嵌入方法将其组织为适合Transformer的离散令牌序列。 Result: 实验表明，BRT在零件分类和特征识别任务中达到了最先进的性能。 Conclusion: BRT成功地将Transformer应用于B-rep模型学习，解决了其不规则拓扑和连续几何定义的挑战。 Abstract: The recent rise of generative artificial intelligence (AI), powered by Transformer networks, has achieved remarkable success in natural language processing, computer vision, and graphics. However, the application of Transformers in computer-aided design (CAD), particularly for processing boundary representation (B-rep) models, remains largely unexplored. To bridge this gap, this paper introduces Boundary Representation Transformer (BRT), a novel method adapting Transformer for B-rep learning. B-rep models pose unique challenges due to their irregular topology and continuous geometric definitions, which are fundamentally different from the structured and discrete data Transformers are designed for. To address this, BRT proposes a continuous geometric embedding method that encodes B-rep surfaces (trimmed and untrimmed) into B\'ezier triangles, preserving their shape and continuity without discretization. Additionally, BRT employs a topology-aware embedding method that organizes these geometric embeddings into a sequence of discrete tokens suitable for Transformers, capturing both geometric and topological characteristics within B-rep models. This enables the Transformer's attention mechanism to effectively learn shape patterns and contextual semantics of boundary elements in a B-rep model. Extensive experiments demonstrate that BRT achieves state-of-the-art performance in part classification and feature recognition tasks.

How do Large Language Models Understand Relevance? A Mechanistic Interpretability Perspective

Qi Liu,Jiaxin Mao,Ji-Rong Wen

Task: 系统研究大型语言模型（LLMs）如何通过模块化机制理解并生成相关性判断。

Motivation: 探索现成LLMs内部机制中相关性判断的生成过程，填补相关研究的空白。

Details

Method: 使用激活修补技术分析不同模型组件的作用，揭示相关性判断的多阶段渐进过程。 Result: 发现LLMs在早期层提取查询和文档信息，中间层处理相关性信息，后期层通过特定注意力头生成判断。 Conclusion: 研究揭示了LLMs相关性评估的机制，为未来利用LLMs进行信息检索任务提供了重要启示。 Abstract: Recent studies have shown that large language models (LLMs) can assess relevance and support information retrieval (IR) tasks such as document ranking and relevance judgment generation. However, the internal mechanisms by which off-the-shelf LLMs understand and operationalize relevance remain largely unexplored. In this paper, we systematically investigate how different LLM modules contribute to relevance judgment through the lens of mechanistic interpretability. Using activation patching techniques, we analyze the roles of various model components and identify a multi-stage, progressive process in generating either pointwise or pairwise relevance judgment. Specifically, LLMs first extract query and document information in the early layers, then process relevance information according to instructions in the middle layers, and finally utilize specific attention heads in the later layers to generate relevance judgments in the required format. Our findings provide insights into the mechanisms underlying relevance assessment in LLMs, offering valuable implications for future research on leveraging LLMs for IR tasks.

MESA: Text-Driven Terrain Generation Using Latent Diffusion and Global Copernicus Data

Paul Borne--Pons,Mikolaj Czerkawski,Rosalie Martin,Romain Rouffet

Task: 提出一种基于扩散模型的数据驱动方法MESA，用于从文本描述生成高质量地形样本。

Motivation: 传统地形建模依赖手工规则和领域专业知识，缺乏灵活性和可扩展性。

Details

Method: 利用全球遥感数据训练扩散模型，生成地形样本。 Result: 实验表明模型能生成逼真且多样化的地形景观，并发布了Major TOM Core-DEM扩展数据集。 Conclusion: 数据驱动模型为地形建模提供了强大工具，具有现实意义和可扩展性。 Abstract: Terrain modeling has traditionally relied on procedural techniques, which often require extensive domain expertise and handcrafted rules. In this paper, we present MESA - a novel data-centric alternative by training a diffusion model on global remote sensing data. This approach leverages large-scale geospatial information to generate high-quality terrain samples from text descriptions, showcasing a flexible and scalable solution for terrain generation. The model's capabilities are demonstrated through extensive experiments, highlighting its ability to generate realistic and diverse terrain landscapes. The dataset produced to support this work, the Major TOM Core-DEM extension dataset, is released openly as a comprehensive resource for global terrain data. The results suggest that data-driven models, trained on remote sensing data, can provide a powerful tool for realistic terrain modeling and generation.

Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory

Mirac Suzgun,Mert Yuksekgonul,Federico Bianchi,Dan Jurafsky,James Zou

Task: 提出一种轻量级框架Dynamic Cheatsheet（DC），为黑盒语言模型提供持久且动态演化的记忆能力。

Motivation: 当前语言模型在处理输入时缺乏记忆能力，无法保留和复用之前的问题解决策略和错误经验，导致效率低下。

Details

Method: 通过DC框架，模型能够在推理时存储和复用积累的策略、代码片段和问题解决经验，无需显式标签或人工反馈。 Result: 实验表明，DC显著提升了模型在数学考试、算术任务和知识密集型任务中的表现，例如Claude 3.5 Sonnet在AIME数学考试中的准确率翻倍，GPT-4o在Game of 24任务中的成功率从10%提升至99%。 Conclusion: DC是一种有前景的方法，能够为语言模型提供持久记忆，缩小其与人类经验驱动学习之间的差距。 Abstract: Despite their impressive performance on complex tasks, current language models (LMs) typically operate in a vacuum: Each input query is processed separately, without retaining insights from previous attempts. Here, we present Dynamic Cheatsheet (DC), a lightweight framework that endows a black-box LM with a persistent, evolving memory. Rather than repeatedly re-discovering or re-committing the same solutions and mistakes, DC enables models to store and reuse accumulated strategies, code snippets, and general problem-solving insights at inference time. This test-time learning enhances performance substantially across a range of tasks without needing explicit ground-truth labels or human feedback. Leveraging DC, Claude 3.5 Sonnet's accuracy more than doubled on AIME math exams once it began retaining algebraic insights across questions. Similarly, GPT-4o's success rate on Game of 24 increased from 10% to 99% after the model discovered and reused a Python-based solution. In tasks prone to arithmetic mistakes, such as balancing equations, DC enabled GPT-4o and Claude to reach near-perfect accuracy by recalling previously validated code, whereas their baselines stagnated around 50%. Beyond arithmetic challenges, DC yields notable accuracy gains on knowledge-demanding tasks. Claude achieved a 9% improvement in GPQA-Diamond and an 8% boost on MMLU-Pro problems. Crucially, DC's memory is self-curated, focusing on concise, transferable snippets rather than entire transcript. Unlike finetuning or static retrieval methods, DC adapts LMs' problem-solving skills on the fly, without modifying their underlying parameters. Overall, our findings present DC as a promising approach for augmenting LMs with persistent memory, bridging the divide between isolated inference events and the cumulative, experience-driven learning characteristic of human cognition.

MoEDiff-SR: Mixture of Experts-Guided Diffusion Model for Region-Adaptive MRI Super-Resolution

Zhe Wang,Yuhua Ru,Aladine Chetouani,Fang Chen,Fabian Bauer,Liping Zhang,Didier Hans,Rachid Jennane,Mohamed Jarraya,Yung Hsin Chen

Task: 提出一种基于混合专家（MoE）引导的扩散模型（MoEDiff-SR），用于区域自适应的磁共振成像（MRI）超分辨率重建。

Motivation: 低场强（如3T）MRI的空间分辨率有限，难以捕捉临床诊断和神经影像研究所需的精细解剖细节。

Details

Method: 采用Transformer提取多尺度特征，通过MoE门控网络动态选择多个扩散去噪专家，实现区域自适应超分辨率重建。 Result: MoEDiff-SR在定量图像质量指标、感知保真度和计算效率上优于现有方法，临床评估显示其能更准确识别细微病理特征。 Conclusion: MoEDiff-SR通过区域自适应去噪显著提升了MRI超分辨率性能，具有临床实用性和可解释性。 Abstract: Magnetic Resonance Imaging (MRI) at lower field strengths (e.g., 3T) suffers from limited spatial resolution, making it challenging to capture fine anatomical details essential for clinical diagnosis and neuroimaging research. To overcome this limitation, we propose MoEDiff-SR, a Mixture of Experts (MoE)-guided diffusion model for region-adaptive MRI Super-Resolution (SR). Unlike conventional diffusion-based SR models that apply a uniform denoising process across the entire image, MoEDiff-SR dynamically selects specialized denoising experts at a fine-grained token level, ensuring region-specific adaptation and enhanced SR performance. Specifically, our approach first employs a Transformer-based feature extractor to compute multi-scale patch embeddings, capturing both global structural information and local texture details. The extracted feature embeddings are then fed into an MoE gating network, which assigns adaptive weights to multiple diffusion-based denoisers, each specializing in different brain MRI characteristics, such as centrum semiovale, sulcal and gyral cortex, and grey-white matter junction. The final output is produced by aggregating the denoised results from these specialized experts according to dynamically assigned gating probabilities. Experimental results demonstrate that MoEDiff-SR outperforms existing state-of-the-art methods in terms of quantitative image quality metrics, perceptual fidelity, and computational efficiency. Difference maps from each expert further highlight their distinct specializations, confirming the effective region-specific denoising capability and the interpretability of expert contributions. Additionally, clinical evaluation validates its superior diagnostic capability in identifying subtle pathological features, emphasizing its practical relevance in clinical neuroimaging. Our code is available at https://github.com/ZWang78/MoEDiff-SR.

Perception-R1: Pioneering Perception Policy with Reinforcement Learning

En Yu,Kangheng Lin,Liang Zhao,Jisheng Yin,Yana Wei,Yuang Peng,Haoran Wei,Jianjian Sun,Chunrui Han,Zheng Ge,Xiangyu Zhang,Daxin Jiang,Jingyu Wang,Wenbing Tao

Task: 探索基于规则的强化学习（RL）在多模态大语言模型（MLLM）后训练中对感知策略学习的潜力。

Motivation: 尽管初步实验显示RL在视觉感知任务中效果不一致，但深入探讨RL在视觉感知中的本质作用及其影响因素。

Details

Method: 提出Perception-R1框架，利用GRPO算法在MLLM后训练中优化感知策略。 Result: 在多个基准任务上显著提升性能，如RefCOCO+（+4.2%）、PixMo-Count（+17.9%）和COCO2017 val（31.9% AP）。 Conclusion: 感知任务的复杂性及奖励设计是RL效果的关键因素，Perception-R1为感知策略学习提供了强有力的基准。 Abstract: Inspired by the success of DeepSeek-R1, we explore the potential of rule-based reinforcement learning (RL) in MLLM post-training for perception policy learning. While promising, our initial experiments reveal that incorporating a thinking process through RL does not consistently lead to performance gains across all visual perception tasks. This leads us to delve into the essential role of RL in the context of visual perception. In this work, we return to the fundamentals and explore the effects of RL on different perception tasks. We observe that the perceptual complexity is a major factor in determining the effectiveness of RL. We also observe that reward design plays a crucial role in further approching the upper limit of model perception. To leverage these findings, we propose Perception-R1, a scalable RL framework using GRPO during MLLM post-training. With a standard Qwen2.5-VL-3B-Instruct, Perception-R1 achieves +4.2% on RefCOCO+, +17.9% on PixMo-Count, +4.2% on PageOCR, and notably, 31.9% AP on COCO2017 val for the first time, establishing a strong baseline for perception policy learning.

Identifying regions of interest in whole slide images of renal cell carcinoma

Mohammed Lamine Benomar,Nesma Settouti,Eric Debreuve,Xavier Descombes,Damien Ambrosetti

Task: 开发一个完全自动化的系统，用于在肾细胞癌（RCC）的全切片图像（WSI）中检测感兴趣区域（ROIs），以减少分析时间并辅助病理学家做出更准确的诊断。

Motivation: 组织病理学图像包含大量信息，诊断过程耗时且繁琐，需要自动化系统来提高效率和准确性。

Details

Method: 基于高效的纹理描述符（DRLBP）和颜色变换，提取WSI的纹理特征，并通过特征选择和分类器（如SVM和基于迁移学习的深度学习模型）进行分类。 Result: 系统在1800个肾癌图像块上表现出色，SVM分类器的最高精度为99.17%，迁移学习模型（如ResNet-50）的精度为98.50%。 Conclusion: 提出的方法在肾癌全切片图像中高效识别ROIs，为病理诊断提供了自动化支持。 Abstract: The histopathological images contain a huge amount of information, which can make diagnosis an extremely timeconsuming and tedious task. In this study, we developed a completely automated system to detect regions of interest (ROIs) in whole slide images (WSI) of renal cell carcinoma (RCC), to reduce time analysis and assist pathologists in making more accurate decisions. The proposed approach is based on an efficient texture descriptor named dominant rotated local binary pattern (DRLBP) and color transformation to reveal and exploit the immense texture variability at the microscopic high magnifications level. Thereby, the DRLBPs retain the structural information and utilize the magnitude values in a local neighborhood for more discriminative power. For the classification of the relevant ROIs, feature extraction of WSIs patches was performed on the color channels separately to form the histograms. Next, we used the most frequently occurring patterns as a feature selection step to discard non-informative features. The performances of different classifiers on a set of 1800 kidney cancer patches originating from 12 whole slide images were compared and evaluated. Furthermore, the small size of the image dataset allows to investigate deep learning approach based on transfer learning for image patches classification by using deep features and fine-tuning methods. High recognition accuracy was obtained and the classifiers are efficient, the best precision result was 99.17% achieved with SVM. Moreover, transfer learning models perform well with comparable performance, and the highest precision using ResNet-50 reached 98.50%. The proposed approach results revealed a very efficient image classification and demonstrated efficacy in identifying ROIs. This study presents an automatic system to detect regions of interest relevant to the diagnosis of kidney cancer in whole slide histopathology images.

VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning

Yukun Qi,Yiming Zhao,Yu Zeng,Xikun Bao,Wenxuan Huang,Lin Chen,Zehui Chen,Jie Zhao,Zhongang Qi,Feng Zhao

Task: 提出VCR-Bench，一个用于全面评估大型视觉语言模型（LVLMs）视频链式思维推理能力的新基准。

Motivation: 当前视频基准无法充分评估推理过程或区分感知与推理能力的缺陷，缺乏严格的视频链式思维推理评估框架。

Details

Method: VCR-Bench包含859个视频和1,034个高质量问答对，每个问答对附带手动标注的分步链式思维推理依据，并设计七个任务维度和链式思维评分（CoT score）。 Result: 实验显示当前LVLMs表现有限，最佳模型CoT得分仅为62.8%，准确率56.7%，多数模型得分低于40%，且感知能力是主要瓶颈。 Conclusion: VCR-Bench验证了链式思维推理在复杂视频推理任务中的关键作用，并可作为标准化评估框架揭示模型的实际缺陷。 Abstract: The advancement of Chain-of-Thought (CoT) reasoning has significantly enhanced the capabilities of large language models (LLMs) and large vision-language models (LVLMs). However, a rigorous evaluation framework for video CoT reasoning remains absent. Current video benchmarks fail to adequately assess the reasoning process and expose whether failures stem from deficiencies in perception or reasoning capabilities. Therefore, we introduce VCR-Bench, a novel benchmark designed to comprehensively evaluate LVLMs' Video Chain-of-Thought Reasoning capabilities. VCR-Bench comprises 859 videos spanning a variety of video content and durations, along with 1,034 high-quality question-answer pairs. Each pair is manually annotated with a stepwise CoT rationale, where every step is tagged to indicate its association with the perception or reasoning capabilities. Furthermore, we design seven distinct task dimensions and propose the CoT score to assess the entire CoT process based on the stepwise tagged CoT rationals. Extensive experiments on VCR-Bench highlight substantial limitations in current LVLMs. Even the top-performing model, o1, only achieves a 62.8% CoT score and an 56.7% accuracy, while most models score below 40%. Experiments show most models score lower on perception than reasoning steps, revealing LVLMs' key bottleneck in temporal-spatial information processing for complex video reasoning. A robust positive correlation between the CoT score and accuracy confirms the validity of our evaluation framework and underscores the critical role of CoT reasoning in solving complex video reasoning tasks. We hope VCR-Bench to serve as a standardized evaluation framework and expose the actual drawbacks in complex video reasoning task.

Synthetic CT Generation from Time-of-Flight Non-Attenutaion-Corrected PET for Whole-Body PET Attenuation Correction

Weijie Chen,James Wang,Alan McMillan

Task: 利用深度学习从TOF NAC PET图像生成合成CT（sCT）图像，以改善PET/MR中的衰减校正。

Motivation: PET/MR系统中缺乏CT图像，导致衰减校正困难，需要一种替代方法。

Details

Method: 使用预训练深度学习模型，并在35对TOF NAC PET和CT数据上进行微调。 Result: 实现了74.49 HU的MAE和28.66 dB的PSNR，视觉评估显示骨和软组织重建效果提升。 Conclusion: 预训练模型在医学图像转换任务中有效，未来将探索更多架构和数据以优化性能。 Abstract: Positron Emission Tomography (PET) imaging requires accurate attenuation correction (AC) to account for photon loss due to tissue density variations. In PET/MR systems, computed tomography (CT), which offers a straightforward estimation of AC is not available. This study presents a deep learning approach to generate synthetic CT (sCT) images directly from Time-of-Flight (TOF) non-attenuation corrected (NAC) PET images, enhancing AC for PET/MR. We first evaluated models pre-trained on large-scale natural image datasets for a CT-to-CT reconstruction task, finding that the pre-trained model outperformed those trained solely on medical datasets. The pre-trained model was then fine-tuned using an institutional dataset of 35 TOF NAC PET and CT volume pairs, achieving the lowest mean absolute error (MAE) of 74.49 HU and highest peak signal-to-noise ratio (PSNR) of 28.66 dB within the body contour region. Visual assessments demonstrated improved reconstruction of both bone and soft tissue structures from TOF NAC PET images. This work highlights the effectiveness of using pre-trained deep learning models for medical image translation tasks. Future work will assess the impact of sCT on PET attenuation correction and explore additional neural network architectures and datasets to further enhance performance and practical applications in PET imaging.

Cat, Rat, Meow: On the Alignment of Language Model and Human Term-Similarity Judgments

Lorenz Linhardt,Tom Neuhäuser,Lenka Tětková,Oliver Eberle

Task: 评估32个公开可用的语言模型在单词三元组任务中与人类相似性判断的表征和行为对齐。

Motivation: 中小型生成语言模型因其规模和可用性，便于从行为和表征层面进行分析，探究二者的交互关系。

Details

Method: 通过单词三元组任务，比较语言模型的表征和行为与人类相似性判断的对齐程度。 Result: 发现小型语言模型的表征可以达到人类水平对齐；指令调优模型变体显著提高一致性；不同层级的对齐模式高度依赖模型；行为对齐与模型大小相关，仅最大模型的行为与表征对齐一致。 Conclusion: 研究为语言模型语义关联提供了新颖评估方法，揭示了模型大小和调优对对齐的影响。 Abstract: Small and mid-sized generative language models have gained increasing attention. Their size and availability make them amenable to being analyzed at a behavioral as well as a representational level, allowing investigations of how these levels interact. We evaluate 32 publicly available language models for their representational and behavioral alignment with human similarity judgments on a word triplet task. This provides a novel evaluation setting to probe semantic associations in language beyond common pairwise comparisons. We find that (1) even the representations of small language models can achieve human-level alignment, (2) instruction-tuned model variants can exhibit substantially increased agreement, (3) the pattern of alignment across layers is highly model dependent, and (4) alignment based on models' behavioral responses is highly dependent on model size, matching their representational alignment only for the largest evaluated models.

Novel Pooling-based VGG-Lite for Pneumonia and Covid-19 Detection from Imbalanced Chest X-Ray Datasets

Santanu Roy,Ashvath Suresh,Palak Sahu,Tulika Rudra Gupta

Task: 提出一种基于池化的VGG-Lite模型，以解决胸部X光（CXR）数据集中的类别不平衡问题。

Motivation: 深度学习模型在自动肺炎检测中面临类别不平衡的挑战，尤其是在新冠变种出现后，这一问题更为突出。

Details

Method: 提出轻量级CNN模型VGG-Lite，并结合边缘增强模块（EEM）和新型2Max-Min池化层，以增强对肺炎CXR图像边缘的关注。 Result: 在两个CXR数据集上，提出的框架显著优于预训练CNN模型和其他最新模型，达到了95%的准确率、97.1%的精确率、96.1%的召回率和96.6%的F1分数。 Conclusion: VGG-Lite结合EEM模块在无需预处理的情况下，有效解决了类别不平衡问题，并显著提升了肺炎检测性能。 Abstract: This paper proposes a novel pooling-based VGG-Lite model in order to mitigate class imbalance issues in Chest X-Ray (CXR) datasets. Automatic Pneumonia detection from CXR images by deep learning model has emerged as a prominent and dynamic area of research, since the inception of the new Covid-19 variant in 2020. However, the standard Convolutional Neural Network (CNN) models encounter challenges associated with class imbalance, a prevalent issue found in many medical datasets. The innovations introduced in the proposed model architecture include: (I) A very lightweight CNN model, `VGG-Lite', is proposed as a base model, inspired by VGG-16 and MobileNet-V2 architecture. (II) On top of this base model, we leverage an ``Edge Enhanced Module (EEM)" through a parallel branch, consisting of a ``negative image layer", and a novel custom pooling layer ``2Max-Min Pooling". This 2Max-Min Pooling layer is entirely novel in this investigation, providing more attention to edge components within pneumonia CXR images. Thus, it works as an efficient spatial attention module (SAM). We have implemented the proposed framework on two separate CXR datasets. The first dataset is obtained from a readily available source on the internet, and the second dataset is a more challenging CXR dataset, assembled by our research team from three different sources. Experimental results reveal that our proposed framework has outperformed pre-trained CNN models, and three recent trend existing models ``Vision Transformer", ``Pooling-based Vision Transformer (PiT)'' and ``PneuNet", by substantial margins on both datasets. The proposed framework VGG-Lite with EEM, has achieved a macro average of 95% accuracy, 97.1% precision, 96.1% recall, and 96.6% F1 score on the ``Pneumonia Imbalance CXR dataset", without employing any pre-processing technique.

PhaseGen: A Diffusion-Based Approach for Complex-Valued MRI Data Generation

Moritz Rempe,Fabian Hörst,Helmut Becker,Marco Schlimbach,Lukas Rotkopf,Kevin Kröninger,Jens Kleesiek

Task: 提出一种名为PhaseGen的复杂值扩散模型，用于生成基于临床常用幅度图像的合成MRI原始数据。

Motivation: 现有临床和AI方法仅关注幅度图像，忽略了相位数据的潜在价值，而PhaseGen旨在填补这一空白，利用生成式AI弥合幅度数据集与复杂值MRI原始数据之间的差距。

Details

Method: PhaseGen是一种复杂值扩散模型，能够基于幅度图像生成合成MRI原始数据，并在FastMRI数据集上评估其性能。 Result: 实验表明，使用合成相位数据训练显著提高了颅骨剥离任务的泛化能力（分割准确率从41.1%提升至80.1%），并改善了MRI重建效果。 Conclusion: PhaseGen为利用生成式AI结合幅度图像与k-Space数据提供了新途径，有助于提升诊断任务的准确性和效率。 Abstract: Magnetic resonance imaging (MRI) raw data, or k-Space data, is complex-valued, containing both magnitude and phase information. However, clinical and existing Artificial Intelligence (AI)-based methods focus only on magnitude images, discarding the phase data despite its potential for downstream tasks, such as tumor segmentation and classification. In this work, we introduce $\textit{PhaseGen}$, a novel complex-valued diffusion model for generating synthetic MRI raw data conditioned on magnitude images, commonly used in clinical practice. This enables the creation of artificial complex-valued raw data, allowing pretraining for models that require k-Space information. We evaluate PhaseGen on two tasks: skull-stripping directly in k-Space and MRI reconstruction using the publicly available FastMRI dataset. Our results show that training with synthetic phase data significantly improves generalization for skull-stripping on real-world data, with an increased segmentation accuracy from $41.1\%$ to $80.1\%$, and enhances MRI reconstruction when combined with limited real-world data. This work presents a step forward in utilizing generative AI to bridge the gap between magnitude-based datasets and the complex-valued nature of MRI raw data. This approach allows researchers to leverage the vast amount of avaliable image domain data in combination with the information-rich k-Space data for more accurate and efficient diagnostic tasks. We make our code publicly $\href{https://github.com/TIO-IKIM/PhaseGen}{\text{available here}}$.

Extending Visual Dynamics for Video-to-Music Generation

Xiaohao Liu,Teng Tu,Yunshan Ma,Tat-Seng Chua

Task: 提出一种名为DyViM的新框架，用于增强视频到音乐生成中的动态建模。

Motivation: 现有方法在特定场景下表现有限或低估了视觉动态性，因此需要解决动态复杂性和视频与音乐表示之间的时间对齐问题。

Details

Method: 通过简化的运动编码器提取帧级动态特征，利用自注意力模块聚合帧内信息，并结合交叉注意力机制传递高级语义，采用退火调优策略微调音乐解码器。 Result: 实验表明DyViM在视频到音乐生成任务中优于现有最先进方法。 Conclusion: DyViM通过改进动态建模和时间对齐，显著提升了视频到音乐生成的质量和适应性。 Abstract: Music profoundly enhances video production by improving quality, engagement, and emotional resonance, sparking growing interest in video-to-music generation. Despite recent advances, existing approaches remain limited in specific scenarios or undervalue the visual dynamics. To address these limitations, we focus on tackling the complexity of dynamics and resolving temporal misalignment between video and music representations. To this end, we propose DyViM, a novel framework to enhance dynamics modeling for video-to-music generation. Specifically, we extract frame-wise dynamics features via a simplified motion encoder inherited from optical flow methods, followed by a self-attention module for aggregation within frames. These dynamic features are then incorporated to extend existing music tokens for temporal alignment. Additionally, high-level semantics are conveyed through a cross-attention mechanism, and an annealing tuning strategy benefits to fine-tune well-trained music decoders efficiently, therefore facilitating seamless adaptation. Extensive experiments demonstrate DyViM's superiority over state-of-the-art (SOTA) methods.

Andrés Bell-Navas,María Villalba-Orero,Enrique Lara-Pezzi,Jesús Garicano-Mena,Soledad Le Clainche

Task: 开发一个基于深度学习的实时超声心动图视频分析系统，用于预测心力衰竭发生时间。

Motivation: 心力衰竭对全球健康构成重大威胁，亟需早期、快速且有效的预测系统。

Details

Method: 采用两阶段方法：第一阶段使用HODMD算法进行数据增强和特征提取；第二阶段构建并训练Vision Transformer（ViT），结合自监督学习（SSL）方法。 Result: 系统表现出高效性，HODMD算法和ViT架构优于传统CNN和其他ViT模型。 Conclusion: 提出的系统在心力衰竭时间预测任务中具有显著效果和优越性。 Abstract: Heart diseases constitute the main cause of international human defunction. According to the World Health Organization (WHO), approximately 18 million deaths happen each year due to precisely heart diseases. In particular, heart failures (HF) press the healthcare industry to develop systems for their early, rapid and effective prediction. In this work, an automatic system which analyses in real-time echocardiography video sequences is proposed for the challenging and more specific task of prediction of heart failure times. This system is based on a novel deep learning framework, and works in two stages. The first one transforms the data included in a database of echocardiography video sequences into a machine learning-compatible collection of annotated images which can be used in the training phase of any kind of machine learning-based framework, including a deep learning one. This initial stage includes the use of the Higher Order Dynamic Mode Decomposition (HODMD) algorithm for both data augmentation and feature extraction. The second stage is focused on building and training a Vision Transformer (ViT). Self-supervised learning (SSL) methods, which have been so far barely explored in the literature about heart failure prediction, are applied to effectively train the ViT from scratch, even with scarce databases of echocardiograms. The designed neural network analyses images from echocardiography sequences to estimate the time in which a heart failure will happen. The results obtained show the efficacy of the HODMD algorithm and the superiority of the proposed system with respect to several established ViT and Convolutional Neural Network (CNN) architectures.

CollEX -- A Multimodal Agentic RAG System Enabling Interactive Exploration of Scientific Collections

Florian Schneider,Narges Baba Ahmadi,Niloufar Baba Ahmadi,Iris Vogel,Martin Semmann,Chris Biemann

Task: 介绍并验证CollEx，一种多模态代理增强检索生成（RAG）系统，用于增强对大规模科学集合的交互式探索。

Motivation: 传统搜索系统在面对庞大且复杂的科学集合时缺乏直观性和交互性，给用户带来障碍。

Details

Method: 利用先进的大型视觉语言模型（LVLMs）作为多模态代理，通过直观的聊天界面实现复杂交互的抽象化。 Result: CollEx显著简化了对多样化科学集合的访问，支持教育场景并促进跨学科连接发现。 Conclusion: CollEx通过多模态集成和代理技术，有效提升了科学集合的探索体验，具有教育和研究价值。 Abstract: In this paper, we introduce CollEx, an innovative multimodal agentic Retrieval-Augmented Generation (RAG) system designed to enhance interactive exploration of extensive scientific collections. Given the overwhelming volume and inherent complexity of scientific collections, conventional search systems often lack necessary intuitiveness and interactivity, presenting substantial barriers for learners, educators, and researchers. CollEx addresses these limitations by employing state-of-the-art Large Vision-Language Models (LVLMs) as multimodal agents accessible through an intuitive chat interface. By abstracting complex interactions via specialized agents equipped with advanced tools, CollEx facilitates curiosity-driven exploration, significantly simplifying access to diverse scientific collections and records therein. Our system integrates textual and visual modalities, supporting educational scenarios that are helpful for teachers, pupils, students, and researchers by fostering independent exploration as well as scientific excitement and curiosity. Furthermore, CollEx serves the research community by discovering interdisciplinary connections and complementing visual data. We illustrate the effectiveness of our system through a proof-of-concept application containing over 64,000 unique records across 32 collections from a local scientific collection from a public university.

Hye-Min Won,Jieun Lee,Jiyong Oh

Task: 提出一种基于不确定性的定位方法，通过过滤不可靠的3自由度位姿预测来提升定位输出的可靠性。

Motivation: 复杂室内环境中可靠的定位对机器人导航至关重要。

Details

Method: 采用基于百分位数的拒绝策略，结合RGB图像和2D LiDAR数据的多模态端到端定位方法。 Result: 实验结果表明，更严格的不确定性阈值显著降低了位姿误差，位置误差分别减少41.0%、56.7%和69.4%，方向误差减少55.6%、65.7%和73.3%。 Conclusion: 该方法首次定量证明了基于百分位数的不确定性拒绝在多模态端到端定位任务中的优势，提升了实际部署中定位系统的可靠性和准确性。 Abstract: Reliable localization is critical for robot navigation in complex indoor environments. In this paper, we propose an uncertainty-aware localization method that enhances the reliability of localization outputs without modifying the prediction model itself. This study introduces a percentile-based rejection strategy that filters out unreliable 3-DoF pose predictions based on aleatoric and epistemic uncertainties the network estimates. We apply this approach to a multi-modal end-to-end localization that fuses RGB images and 2D LiDAR data, and we evaluate it across three real-world datasets collected using a commercialized serving robot. Experimental results show that applying stricter uncertainty thresholds consistently improves pose accuracy. Specifically, the mean position error is reduced by 41.0%, 56.7%, and 69.4%, and the mean orientation error by 55.6%, 65.7%, and 73.3%, when applying 90%, 80%, and 70% thresholds, respectively. Furthermore, the rejection strategy effectively removes extreme outliers, resulting in better alignment with ground truth trajectories. To the best of our knowledge, this is the first study to quantitatively demonstrate the benefits of percentile-based uncertainty rejection in multi-modal end-to-end localization tasks. Our approach provides a practical means to enhance the reliability and accuracy of localization systems in real-world deployments.

Distilling Knowledge from Heterogeneous Architectures for Semantic Segmentation

Yanglin Huang,Kai Hu,Yuan Zhang,Zhineng Chen,Xieping Gao

Task: 提出一种名为HeteroAKD的异构知识蒸馏方法，用于语义分割任务。

Motivation: 现有知识蒸馏方法在同构架构下进行，忽略了异构架构中多样知识的价值，学生模型难以从中获得更精确和全面的理解。

Details

Method: 通过将中间特征投影到对齐的logits空间消除架构差异影响，并引入知识混合机制（KMM）和知识评估机制（KEM）来利用异构知识。 Result: 在三个主流基准测试中，HeteroAKD在异构架构间的知识蒸馏效果优于现有方法。 Conclusion: HeteroAKD为异构架构间的知识蒸馏提供了一种有效解决方案。 Abstract: Current knowledge distillation (KD) methods for semantic segmentation focus on guiding the student to imitate the teacher's knowledge within homogeneous architectures. However, these methods overlook the diverse knowledge contained in architectures with different inductive biases, which is crucial for enabling the student to acquire a more precise and comprehensive understanding of the data during distillation. To this end, we propose for the first time a generic knowledge distillation method for semantic segmentation from a heterogeneous perspective, named HeteroAKD. Due to the substantial disparities between heterogeneous architectures, such as CNN and Transformer, directly transferring cross-architecture knowledge presents significant challenges. To eliminate the influence of architecture-specific information, the intermediate features of both the teacher and student are skillfully projected into an aligned logits space. Furthermore, to utilize diverse knowledge from heterogeneous architectures and deliver customized knowledge required by the student, a teacher-student knowledge mixing mechanism (KMM) and a teacher-student knowledge evaluation mechanism (KEM) are introduced. These mechanisms are performed by assessing the reliability and its discrepancy between heterogeneous teacher-student knowledge. Extensive experiments conducted on three main-stream benchmarks using various teacher-student pairs demonstrate that our HeteroAKD outperforms state-of-the-art KD methods in facilitating distillation between heterogeneous architectures.

Virtual-mask Informed Prior for Sparse-view Dual-Energy CT Reconstruction

Zini Chen,Yao Xiao,Junyan Zhang,Shaoyu Wang,Liu Shi,Qiegen Liu

Task: 提出一种基于双域虚拟掩码的扩散模型，用于稀疏视角双能CT重建。

Motivation: 稀疏视角采样在双能CT中虽能降低辐射剂量和加快成像速度，但容易产生伪影；现有扩散模型多聚焦于图像域且缺乏全局约束，导致重建质量不足。

Details

Method: 设计虚拟掩码并应用于高低能数据以构建高维张量作为扩散模型先验信息，同时采用双域协作策略整合小波域和投影域信息。 Result: 实验结果表明该方法在多个数据集上表现优异。 Conclusion: 该方法通过双域协作和虚拟掩码优化了全局结构和局部细节，显著提升了稀疏视角双能CT的重建质量。 Abstract: Sparse-view sampling in dual-energy computed tomography (DECT) significantly reduces radiation dose and increases imaging speed, yet is highly prone to artifacts. Although diffusion models have demonstrated potential in effectively handling incomplete data, most existing methods in this field focus on the image do-main and lack global constraints, which consequently leads to insufficient reconstruction quality. In this study, we propose a dual-domain virtual-mask in-formed diffusion model for sparse-view reconstruction by leveraging the high inter-channel correlation in DECT. Specifically, the study designs a virtual mask and applies it to the high-energy and low-energy data to perform perturbation operations, thus constructing high-dimensional tensors that serve as the prior information of the diffusion model. In addition, a dual-domain collaboration strategy is adopted to integrate the information of the randomly selected high-frequency components in the wavelet domain with the information in the projection domain, for the purpose of optimizing the global struc-tures and local details. Experimental results indicated that the present method exhibits excellent performance across multiple datasets.

PRAD: Periapical Radiograph Analysis Dataset and Benchmark Model Development

Zhenhuan Zhou,Yuchen Zhang,Ruihong Xu,Xuansen Zhao,Tao Li

Task: 提出一个名为PRAD-10K的数据集，用于牙周X光片（PR）的深度学习分析，并设计了一个名为PRNet的网络用于PR分割任务。

Motivation: 目前深度学习在牙科辅助诊断中的应用主要集中在全景X光和锥形束CT，而PR作为牙髓和牙周病中最常用的影像模态，其分析数据集稀缺，限制了深度学习在PR分析中的发展。

Details

Method: 构建了包含10,000张临床PR图像的PRAD-10K数据集，并设计了PRNet网络用于PR分割任务。 Result: PRNet在PRAD-10K数据集上的表现优于现有的医学图像分割模型。 Conclusion: PRAD-10K数据集和PRNet为PR分析提供了新的基准，推动了深度学习在牙科辅助诊断中的应用。 Abstract: Deep learning (DL), a pivotal technology in artificial intelligence, has recently gained substantial traction in the domain of dental auxiliary diagnosis. However, its application has predominantly been confined to imaging modalities such as panoramic radiographs and Cone Beam Computed Tomography, with limited focus on auxiliary analysis specifically targeting Periapical Radiographs (PR). PR are the most extensively utilized imaging modality in endodontics and periodontics due to their capability to capture detailed local lesions at a low cost. Nevertheless, challenges such as resolution limitations and artifacts complicate the annotation and recognition of PR, leading to a scarcity of publicly available, large-scale, high-quality PR analysis datasets. This scarcity has somewhat impeded the advancement of DL applications in PR analysis. In this paper, we present PRAD-10K, a dataset for PR analysis. PRAD-10K comprises 10,000 clinical periapical radiograph images, with pixel-level annotations provided by professional dentists for nine distinct anatomical structures, lesions, and artificial restorations or medical devices, We also include classification labels for images with typical conditions or lesions. Furthermore, we introduce a DL network named PRNet to establish benchmarks for PR segmentation tasks. Experimental results demonstrate that PRNet surpasses previous state-of-the-art medical image segmentation models on the PRAD-10K dataset. The codes and dataset will be made publicly available.

Focal Cortical Dysplasia Type II Detection Using Cross Modality Transfer Learning and Grad-CAM in 3D-CNNs for MRI Analysis

Lorenzo Lasagni,Antonio Ciccarone,Renzo Guerrini,Matteo Lenge,Ludovico D'incerti

Task: 研究使用3D卷积神经网络（3D-CNNs）检测FCD（局灶性皮质发育不良II型）的方法。

Motivation: FCD是药物难治性癫痫的主要原因，但MRI诊断困难，易误诊，因此需要更准确的检测方法。

Details

Method: 采用ResNet架构（ResNet-18、-34和-50），结合跨模态迁移学习和可解释人工智能技术（如Grad-CAM），利用T1加权和FLAIR MRI扫描数据集进行实验。 Result: 迁移学习显著提高了分类准确率（达80.3%）和模型对临床相关区域的关注度（通过Heat-Score指标衡量）。 Conclusion: 迁移学习和XAI技术在提升AI医学诊断能力方面具有重要意义，尤其适用于FCD等难诊断病理。 Abstract: Focal cortical dysplasia (FCD) type II is a major cause of drug-resistant epilepsy, often curable only by surgery. Despite its clinical importance, the diagnosis of FCD is very difficult in MRI because of subtle abnormalities, leading to misdiagnosis. This study investigates the use of 3D convolutional neural networks (3D-CNNs) for FCD detection, using a dataset of 170 subjects (85 FCD patients and 85 controls) composed of T1-weighted and FLAIR MRI scans. In particular, it investigates the benefits obtained from cross-modality transfer learning and explainable artificial intelligence (XAI) techniques, in particular Gradient-weighted Class Activation Mapping (Grad-CAM). ResNet architectures (ResNet-18, -34, and -50) were implemented, employing transfer learning strategies that used pre-trained weights from segmentation tasks. Results indicate that transfer learning significantly enhances classification accuracy (up to 80.3%) and interpretability, as measured by a novel Heat-Score metric, which evaluates the model's focus on clinically relevant regions. Improvements in the Heat-Score metric underscore the model's seizure zone localization capabilities, bringing AI predictions and clinical insights closer together. These results highlight the importance of transfer learning, including cross-modality, and XAI in advancing AI-based medical diagnostics, especially for difficult-to-diagnose pathologies such as FCD.

Adaptive Detection of Fast Moving Celestial Objects Using a Mixture of Experts and Physical-Inspired Neural Network

Peng Jia,Ge Li,Bafeng Cheng,Yushan Li,Rongyu Sun

Task: 提出一种新颖的算法，用于在星场中检测快速移动的天体。

Motivation: 随着空间望远镜的普及及其多样化的观测模式，传统的地基望远镜检测方法在空间观测中效果不佳。

Details

Method: 将先进的快速移动天体检测神经网络转化为物理启发的神经网络，利用望远镜的点扩散函数和特定观测模式作为先验信息，无需额外训练即可直接识别快速移动天体，并结合专家混合技术整合所有神经网络。 Result: 在模拟和真实观测数据上的实验表明，该算法能在不同观测模式下有效检测快速移动天体。 Conclusion: 提出的物理启发神经网络方法解决了传统技术在空间望远镜观测中的局限性，实现了高效检测。 Abstract: Fast moving celestial objects are characterized by velocities across the celestial sphere that significantly differ from the motions of background stars. In observational images, these objects exhibit distinct shapes, contrasting with the typical appearances of stars. Depending on the observational method employed, these celestial entities may be designated as near-Earth objects or asteroids. Historically, fast moving celestial objects have been observed using ground-based telescopes, where the relative stability of stars and Earth facilitated effective image differencing techniques alongside traditional fast moving celestial object detection and classification algorithms. However, the growing prevalence of space-based telescopes, along with their diverse observational modes, produces images with different properties, rendering conventional methods less effective. This paper presents a novel algorithm for detecting fast moving celestial objects within star fields. Our approach enhances state-of-the-art fast moving celestial object detection neural networks by transforming them into physical-inspired neural networks. These neural networks leverage the point spread function of the telescope and the specific observational mode as prior information; they can directly identify moving fast moving celestial objects within star fields without requiring additional training, thereby addressing the limitations of traditional techniques. Additionally, all neural networks are integrated using the mixture of experts technique, forming a comprehensive fast moving celestial object detection algorithm. We have evaluated our algorithm using simulated observational data that mimics various observations carried out by space based telescope scenarios and real observation images. Results demonstrate that our method effectively detects fast moving celestial objects across different observational modes.

Revisiting Likelihood-Based Out-of-Distribution Detection by Modeling Representations

Yifan Ding,Arturas Aleksandrauskas,Amirhossein Ahmadian,Jonas Unger,Fredrik Lindsten,Gabriel Eilertsen

Task: 探讨基于似然的深度生成模型在离群分布（OOD）检测中的有效性。

Motivation: 解决传统基于似然的生成模型在图像数据OOD检测中表现不佳的问题，揭示似然方法并非本质缺陷。

Details

Method: 利用扩散模型的概率流公式作为似然估计器，并结合预训练编码器的表示空间。 Result: 证明在表示空间中，基于似然的方法可以达到与最先进方法相当的性能。 Conclusion: 似然方法在适当条件下仍可用于OOD检测，关键在于选择合适的表示空间和似然估计器。 Abstract: Out-of-distribution (OOD) detection is critical for ensuring the reliability of deep learning systems, particularly in safety-critical applications. Likelihood-based deep generative models have historically faced criticism for their unsatisfactory performance in OOD detection, often assigning higher likelihood to OOD data than in-distribution samples when applied to image data. In this work, we demonstrate that likelihood is not inherently flawed. Rather, several properties in the images space prohibit likelihood as a valid detection score. Given a sufficiently good likelihood estimator, specifically using the probability flow formulation of a diffusion model, we show that likelihood-based methods can still perform on par with state-of-the-art methods when applied in the representation space of pre-trained encoders. The code of our work can be found at $\href{https://github.com/limchaos/Likelihood-OOD.git}{\texttt{https://github.com/limchaos/Likelihood-OOD.git}}$.

HarmonySeg: Tubular Structure Segmentation with Deep-Shallow Feature Fusion and Growth-Suppression Balanced Loss

Yi Huang,Ke Zhang,Wei Liu,Yuanyuan Wang,Vishal M. Patel,Le Lu,Xu Han,Dakai Jin,Ke Yan

Task: 提出一种名为HarmonySeg的管状结构分割框架，以解决医学图像中管状结构分割的挑战。

Motivation: 医学图像中管状结构（如血管和气道树）的准确分割对计算机辅助诊断、放射治疗和手术规划至关重要，但算法设计面临尺寸多样、拓扑复杂和标注不完整等挑战。

Details

Method: 设计了具有灵活卷积块的深度到浅层解码器网络，结合血管性图作为辅助信息，并通过浅层-深层融合模块对齐特征，同时引入拓扑保持损失函数。 Result: 在四个公共数据集上的实验表明，模型能准确分割2D和3D管状结构，并优于现有方法，私有数据集的外部验证也展示了良好的泛化性。 Conclusion: HarmonySeg框架通过多尺度适应、辅助信息融合和拓扑保持损失，有效解决了管状结构分割的挑战。 Abstract: Accurate segmentation of tubular structures in medical images, such as vessels and airway trees, is crucial for computer-aided diagnosis, radiotherapy, and surgical planning. However, significant challenges exist in algorithm design when faced with diverse sizes, complex topologies, and (often) incomplete data annotation of these structures. We address these difficulties by proposing a new tubular structure segmentation framework named HarmonySeg. First, we design a deep-to-shallow decoder network featuring flexible convolution blocks with varying receptive fields, which enables the model to effectively adapt to tubular structures of different scales. Second, to highlight potential anatomical regions and improve the recall of small tubular structures, we incorporate vesselness maps as auxiliary information. These maps are aligned with image features through a shallow-and-deep fusion module, which simultaneously eliminates unreasonable candidates to maintain high precision. Finally, we introduce a topology-preserving loss function that leverages contextual and shape priors to balance the growth and suppression of tubular structures, which also allows the model to handle low-quality and incomplete annotations. Extensive quantitative experiments are conducted on four public datasets. The results show that our model can accurately segment 2D and 3D tubular structures and outperform existing state-of-the-art methods. External validation on a private dataset also demonstrates good generalizability.

The Efficacy of Semantics-Preserving Transformations in Self-Supervised Learning for Medical Ultrasound

Blake VanBerlo,Alexander Wong,Jesse Hoey,Robert Arntfield

Task: 系统研究数据增强和预处理策略在肺部超声自监督学习中的影响。

Motivation: 自然图像的数据增强方法在医学影像任务中可能不适用，因此需要针对超声影像设计有效的策略。

Details

Method: 评估三种数据增强流程：基线流程、语义保留流程和蒸馏流程，并在多个分类任务上测试预训练模型。 Result: 语义保留数据增强在COVID-19分类任务中表现最佳，而基于裁剪的方法在B线和胸腔积液检测任务中表现更好。 Conclusion: 针对超声影像的语义保留预处理策略能提升下游任务性能，为自监督学习实践者提供了指导。 Abstract: Data augmentation is a central component of joint embedding self-supervised learning (SSL). Approaches that work for natural images may not always be effective in medical imaging tasks. This study systematically investigated the impact of data augmentation and preprocessing strategies in SSL for lung ultrasound. Three data augmentation pipelines were assessed: (1) a baseline pipeline commonly used across imaging domains, (2) a novel semantic-preserving pipeline designed for ultrasound, and (3) a distilled set of the most effective transformations from both pipelines. Pretrained models were evaluated on multiple classification tasks: B-line detection, pleural effusion detection, and COVID-19 classification. Experiments revealed that semantics-preserving data augmentation resulted in the greatest performance for COVID-19 classification - a diagnostic task requiring global image context. Cropping-based methods yielded the greatest performance on the B-line and pleural effusion object classification tasks, which require strong local pattern recognition. Lastly, semantics-preserving ultrasound image preprocessing resulted in increased downstream performance for multiple tasks. Guidance regarding data augmentation and preprocessing strategies was synthesized for practitioners working with SSL in ultrasound.

Zero-Shot Low-dose CT Denoising via Sinogram Flicking

Yongyi Shi,Ge Wang

Task: 提出一种基于正弦图闪烁的零样本低剂量CT成像方法，解决现有方法因降采样操作导致图像分辨率下降及训练数据受限的问题。

Motivation: 临床实践中难以获取大量配对的噪声和干净图像，现有零样本自监督方法（如ZS-N2N）因降采样操作和单图像训练数据限制而效果不佳。

Details

Method: 通过随机共轭射线匹配在单图像内生成多组正弦图，利用共轭X射线测量路径相同但噪声分布不同的特性，训练轻量级网络。 Result: 仿真研究表明，该方法优于ZS-N2N等现有先进方法。 Conclusion: 提出的正弦图闪烁方法有效解决了零样本低剂量CT成像中的分辨率下降和数据限制问题，性能优于现有方法。 Abstract: Many low-dose CT imaging methods rely on supervised learning, which requires a large number of paired noisy and clean images. However, obtaining paired images in clinical practice is challenging. To address this issue, zero-shot self-supervised methods train denoising networks using only the information within a single image, such as ZS-N2N. However, these methods often employ downsampling operations that degrade image resolution. Additionally, the training dataset is inherently constrained to the image itself. In this paper, we propose a zero-shot low-dose CT imaging method based on sinogram flicking, which operates within a single image but generates many copies via random conjugate ray matching. Specifically, two conjugate X-ray pencil beams measure the same path; their expected values should be identical, while their noise levels vary during measurements. By randomly swapping portions of the conjugate X-rays in the sinogram domain, we generate a large set of sinograms with consistent content but varying noise patterns. When displayed dynamically, these sinograms exhibit a flickering effect due to their identical structural content but differing noise patterns-hence the term sinogram flicking. We train the network on pairs of sinograms with the same content but different noise distributions using a lightweight model adapted from ZS-NSN. This process is repeated to obtain the final results. A simulation study demonstrates that our method outperforms state-of-the-art approaches such as ZS-N2N.