2025 04 03

Repetitions are not all alike: distinct mechanisms sustain repetition in language models

Matéo Mahaut,Francesca Franzon

Task: 探索语言模型中重复文本生成的潜在机制及其在不同条件下的表现。

Motivation: 语言模型生成的文本可能因多种原因出现重复现象，但现有研究通常将其视为单一现象，而实际上可能由多种因素驱动。

Details

Method: 通过实验分析语言模型在两种不同条件下（自然生成和通过上下文学习诱导）的重复行为，考察其内部工作机制的差异。 Result: 发现两种条件下模型的置信度、注意力头依赖以及对扰动的响应模式存在显著差异，表明重复现象可能由不同的内部机制驱动。 Conclusion: 语言模型的重复行为可能由多种独立或组合的内部机制支撑，这对理解和缓解重复现象具有重要意义。 Abstract: Text generated by language models (LMs) can degrade into repetitive cycles, where identical word sequences are persistently repeated one after another. Prior research has typically treated repetition as a unitary phenomenon. However, repetitive sequences emerge under diverse tasks and contexts, raising the possibility that it may be driven by multiple underlying factors. Here, we experimentally explore the hypothesis that repetition in LMs can result from distinct mechanisms, reflecting different text generation strategies used by the model. We examine the internal working of LMs under two conditions that prompt repetition: one in which repeated sequences emerge naturally after human-written text, and another where repetition is explicitly induced through an in-context learning (ICL) setup. Our analysis reveals key differences between the two conditions: the model exhibits varying levels of confidence, relies on different attention heads, and shows distinct pattens of change in response to controlled perturbations. These findings suggest that distinct internal mechanisms can interact to drive repetition, with implications for its interpretation and mitigation strategies. More broadly, our results highlight that the same surface behavior in LMs may be sustained by different underlying processes, acting independently or in combination.

Can LLMs Grasp Implicit Cultural Values? Benchmarking LLMs' Metacognitive Cultural Intelligence with CQ-Bench

Ziyi Liu,Priyanka Dey,Zhenyu Zhao,Jen-tse Huang,Rahul Gupta,Yang Liu,Jieyu Zhao

Task: 评估大型语言模型（LLMs）从自然对话中推断隐含文化价值观的能力。

Motivation: 现有研究多关注明确的文化规范，而忽略了隐含的价值观，这限制了LLMs在全球多样化用户中的有效互动。

Details

Method: 引入CQ-Bench基准，基于世界价值观调查和GlobalOpinions数据集生成多角色对话故事，并通过GPT-4o进行严格验证。 Result: 部分模型在价值观选择任务中达到人类水平，但在态度检测和价值观提取任务中表现较差；小模型通过微调可显著提升性能。 Conclusion: CQ-Bench揭示了LLMs在跨文化推理中的挑战，并提出了提升其能力的实用路径。 Abstract: Cultural Intelligence (CQ) refers to the ability to understand unfamiliar cultural contexts-a crucial skill for large language models (LLMs) to effectively engage with globally diverse users. While existing research often focuses on explicitly stated cultural norms, such approaches fail to capture the subtle, implicit values that underlie real-world conversations. To address this gap, we introduce CQ-Bench, a benchmark specifically designed to assess LLMs' capability to infer implicit cultural values from natural conversational contexts. We generate a multi-character conversation-based stories dataset using values from the World Value Survey and GlobalOpinions datasets, with topics including ethical, religious, social, and political. Our dataset construction pipeline includes rigorous validation procedures-incorporation, consistency, and implicitness checks-using GPT-4o, with 98.2% human-model agreement in the final validation. Our benchmark consists of three tasks of increasing complexity: attitude detection, value selection, and value extraction. We find that while o1 and Deepseek-R1 models reach human-level performance in value selection (0.809 and 0.814), they still fall short in nuanced attitude detection, with F1 scores of 0.622 and 0.635, respectively. In the value extraction task, GPT-4o-mini and o3-mini score 0.602 and 0.598, highlighting the difficulty of open-ended cultural reasoning. Notably, fine-tuning smaller models (e.g., LLaMA-3.2-3B) on only 500 culturally rich examples improves performance by over 10%, even outperforming stronger baselines (o3-mini) in some cases. Using CQ-Bench, we provide insights into the current challenges in LLMs' CQ research and suggest practical pathways for enhancing LLMs' cross-cultural reasoning abilities.

Is the Top Still Spinning? Evaluating Subjectivity in Narrative Understanding

Melanie Subbiah,Akankshya Mishra,Grace Kim,Liyan Tang,Greg Durrett,Kathleen McKeown

Task: Determining the faithfulness of a claim to a source document, particularly for ambiguous claims, by reframing the task to manage subjectivity.

Motivation: Binary judgments of claim faithfulness are unreliable for ambiguous claims, as subjectivity and differing interpretations can lead to inconsistent evaluations.

Details

Method: Introducing LLM-generated edits of summaries to measure ambiguity, resulting in the Ambiguity Rewrite Metric (ARM), which provides nuanced feedback. Result: ARM achieves a 21% absolute improvement in annotator agreement on claim faithfulness, reducing subjectivity. Conclusion: ARM offers a richer and more reliable evaluation metric for claim faithfulness compared to binary judgments, especially in contexts with high ambiguity. Abstract: Determining faithfulness of a claim to a source document is an important problem across many domains. This task is generally treated as a binary judgment of whether the claim is supported or unsupported in relation to the source. In many cases, though, whether a claim is supported can be ambiguous. For instance, it may depend on making inferences from given evidence, and different people can reasonably interpret the claim as either supported or unsupported based on their agreement with those inferences. Forcing binary labels upon such claims lowers the reliability of evaluation. In this work, we reframe the task to manage the subjectivity involved with factuality judgments of ambiguous claims. We introduce LLM-generated edits of summaries as a method of providing a nuanced evaluation of claims: how much does a summary need to be edited to be unambiguous? Whether a claim gets rewritten and how much it changes can be used as an automatic evaluation metric, the Ambiguity Rewrite Metric (ARM), with a much richer feedback signal than a binary judgment of faithfulness. We focus on the area of narrative summarization as it is particularly rife with ambiguity and subjective interpretation. We show that ARM produces a 21% absolute improvement in annotator agreement on claim faithfulness, indicating that subjectivity is reduced.

Follow the Flow: On Information Flow Across Textual Tokens in Text-to-Image Models

Guy Kaplan,Michael Toker,Yuval Reif,Yonatan Belinkov,Roy Schwartz

Task: 研究文本到图像（T2I）模型中信息流在文本标记表示中的作用，并提出减少语义泄漏的方法。

Motivation: T2I模型存在语义泄漏、特征绑定错误和关键概念遗漏等问题，需要深入分析信息流以解决这些问题。

Details

Method: 通过扩散组件在文本标记子集上生成图像，观察信息冗余和泄漏现象，并提出一种无训练的方法来缓解语义泄漏。 Result: 发现某些标记是冗余的，移除后图像生成性能保持不变且错误减少21%；提出的方法显著减少语义泄漏85%。 Conclusion: 该研究为T2I模型中的信息流提供了全面分析，并提出了实用的改进方法。 Abstract: Text-to-Image (T2I) models often suffer from issues such as semantic leakage, incorrect feature binding, and omissions of key concepts in the generated image. This work studies these phenomena by looking into the role of information flow between textual token representations. To this end, we generate images by applying the diffusion component on a subset of contextual token representations in a given prompt and observe several interesting phenomena. First, in many cases, a word or multiword expression is fully represented by one or two tokens, while other tokens are redundant. For example, in "San Francisco's Golden Gate Bridge", the token "gate" alone captures the full expression. We demonstrate the redundancy of these tokens by removing them after textual encoding and generating an image from the resulting representation. Surprisingly, we find that this process not only maintains image generation performance but also reduces errors by 21\% compared to standard generation. We then show that information can also flow between different expressions in a sentence, which often leads to semantic leakage. Based on this observation, we propose a simple, training-free method to mitigate semantic leakage: replacing the leaked item's representation after the textual encoding with its uncontextualized representation. Remarkably, this simple approach reduces semantic leakage by 85\%. Overall, our work provides a comprehensive analysis of information flow across textual tokens in T2I models, offering both novel insights and practical benefits.

Omnidirectional Depth-Aided Occupancy Prediction based on Cylindrical Voxel for Autonomous Driving

Chaofan Wu,Jiaheng Li,Jinghao Cao,Ming Li,Yongkang Feng,Jiayu Wu Shuwen Xu,Zihang Gao,Sidan Du,Yang Li

Task: 通过全向深度估计和圆柱体体素表示提升自动驾驶中的3D感知性能。

Motivation: 传统方法因缺乏几何先验而难以解决几何模糊性问题。

Details

Method: 提出基于深度信息的Sketch-Coloring框架OmniDepth-Occ，并采用极坐标下的圆柱体体素表示。 Result: 实验结果表明，该方法显著提升了3D感知性能。 Conclusion: 通过引入几何先验和创新的表示方法，有效解决了自动驾驶中的3D感知挑战。 Abstract: Accurate 3D perception is essential for autonomous driving. Traditional methods often struggle with geometric ambiguity due to a lack of geometric prior. To address these challenges, we use omnidirectional depth estimation to introduce geometric prior. Based on the depth information, we propose a Sketch-Coloring framework OmniDepth-Occ. Additionally, our approach introduces a cylindrical voxel representation based on polar coordinate to better align with the radial nature of panoramic camera views. To address the lack of fisheye camera dataset in autonomous driving tasks, we also build a virtual scene dataset with six fisheye cameras, and the data volume has reached twice that of SemanticKITTI. Experimental results demonstrate that our Sketch-Coloring network significantly enhances 3D perception performance.

$μ$KE: Matryoshka Unstructured Knowledge Editing of Large Language Models

Zian Su,Ziyang Huang,Kaiyuan Zhang,Xiangyu Zhang

Task: 提出一种名为Matryoshka Unstructured Knowledge Editing（μKE）的新方法，用于改进大型语言模型（LLMs）中非结构化知识编辑的效果。

Motivation: 大型语言模型作为知识库存在静态训练数据的限制，导致幻觉和安全风险等问题，而现有的非结构化知识编辑方法（如基于窗口的自回归方法）会破坏早期记忆更新与后续输出标记之间的因果依赖关系。

Details

Method: 通过Matryoshka风格的目标和自适应损失系数，设计了一种新的记忆更新机制（μKE），以保持因果依赖关系。 Result: 在四个基准测试中对两种模型的实证评估表明，μKE将编辑效果提高了12.33%，并且在多样化格式的编辑中表现稳健。 Conclusion: μKE是一种有效的非结构化知识编辑方法，具有在大型语言模型中广泛应用的潜力。 Abstract: Large language models (LLMs) have emerged as powerful knowledge bases yet are limited by static training data, leading to issues such as hallucinations and safety risks. Editing a model's internal knowledge through the locate-and-edit paradigm has proven a cost-effective alternative to retraining, though current unstructured approaches, especially window-based autoregressive methods, often disrupt the causal dependency between early memory updates and later output tokens. In this work, we first theoretically analyze these limitations and then introduce Matryoshka Unstructured Knowledge Editing ($\mu$KE), a novel memory update mechanism that preserves such dependencies via a Matryoshka-style objective and adaptive loss coefficients. Empirical evaluations on two models across four benchmarks demonstrate that $\mu$KE improves edit efficacy by up to 12.33% over state-of-the-art methods, and remain robust when applied to diverse formatted edits, underscoring its potential for effective unstructured knowledge editing in LLMs.

Gaze-Guided 3D Hand Motion Prediction for Detecting Intent in Egocentric Grasping Tasks

Yufei He,Xucong Zhang,Arno H. A. Stienen

Task: 预测手部运动和关节位置序列以实现人类意图检测。

Motivation: 传统依赖生理信号测量的方法在神经康复应用中具有局限性且缺乏环境上下文信息。

Details

Method: 提出一种结合注视信息、历史手部运动序列和环境物体数据的方法，使用向量量化变分自编码器和自回归生成变换器进行预测。 Result: 在健康受试者的初步研究中验证了方法的有效性，注视信息显著提升了预测能力。 Conclusion: 该方法在真实场景中具有潜在应用价值，特别是在输入帧较少时表现优异。 Abstract: Human intention detection with hand motion prediction is critical to drive the upper-extremity assistive robots in neurorehabilitation applications. However, the traditional methods relying on physiological signal measurement are restrictive and often lack environmental context. We propose a novel approach that predicts future sequences of both hand poses and joint positions. This method integrates gaze information, historical hand motion sequences, and environmental object data, adapting dynamically to the assistive needs of the patient without prior knowledge of the intended object for grasping. Specifically, we use a vector-quantized variational autoencoder for robust hand pose encoding with an autoregressive generative transformer for effective hand motion sequence prediction. We demonstrate the usability of these novel techniques in a pilot study with healthy subjects. To train and evaluate the proposed method, we collect a dataset consisting of various types of grasp actions on different objects from multiple subjects. Through extensive experiments, we demonstrate that the proposed method can successfully predict sequential hand movement. Especially, the gaze information shows significant enhancements in prediction capabilities, particularly with fewer input frames, highlighting the potential of the proposed method for real-world applications.

Medical large language models are easily distracted

Krithik Vishwanath,Anton Alyakin,Daniel Alexander Alber,Jin Vivian Lee,Douglas Kondziolka,Eric Karl Oermann

Task: 评估大型语言模型（LLMs）在临床场景中过滤无关信息的能力。

Motivation: 现实临床场景中存在大量无关信息，可能影响LLMs的性能，而辅助技术（如环境听写）可能引入更多噪声，因此需要评估LLMs的过滤能力。

Details

Method: 开发了MedDistractQA基准，使用USMLE风格的问题嵌入模拟现实世界中的干扰信息。 Result: 干扰性陈述（如多义词的非临床用法或无关健康条件）可使LLM准确率降低高达17.9%，且常见解决方案（如检索增强生成和医学微调）未能改善性能，甚至可能进一步降低。 Conclusion: LLMs缺乏区分临床相关与无关信息的逻辑机制，需开发更稳健的缓解策略以增强其抗干扰能力。 Abstract: Large language models (LLMs) have the potential to transform medicine, but real-world clinical scenarios contain extraneous information that can hinder performance. The rise of assistive technologies like ambient dictation, which automatically generates draft notes from live patient encounters, has the potential to introduce additional noise making it crucial to assess the ability of LLM's to filter relevant data. To investigate this, we developed MedDistractQA, a benchmark using USMLE-style questions embedded with simulated real-world distractions. Our findings show that distracting statements (polysemous words with clinical meanings used in a non-clinical context or references to unrelated health conditions) can reduce LLM accuracy by up to 17.9%. Commonly proposed solutions to improve model performance such as retrieval-augmented generation (RAG) and medical fine-tuning did not change this effect and in some cases introduced their own confounders and further degraded performance. Our findings suggest that LLMs natively lack the logical mechanisms necessary to distinguish relevant from irrelevant clinical information, posing challenges for real-world applications. MedDistractQA and our results highlights the need for robust mitigation strategies to enhance LLM resilience to extraneous information.

Improving Applicability of Deep Learning based Token Classification models during Training

Anket Mehra,Malte Prieß,Marian Himstedt

Task: 提出一种新的评估指标（Document Integrity Precision，DIP）以解决传统分类指标在评估模型实际应用中的不足。

Motivation: 传统分类指标（如F1分数）无法充分评估模型在实际业务场景中的适用性，尤其是在视觉文档理解和令牌分类任务中。

Details

Method: 通过训练基于LayoutLM的模型对德国收据进行令牌分类，并引入DIP作为新指标，量化测试数据集中需要人工干预的文档比例。 Result: 实验表明，传统指标对模型性能变化不敏感，而DIP能有效反映模型在部署时所需的人工干预程度。 Conclusion: DIP是一种适用于业务场景的评估指标，未来需进一步研究其他任务的类似指标。 Abstract: This paper shows that further evaluation metrics during model training are needed to decide about its applicability in inference. As an example, a LayoutLM-based model is trained for token classification in documents. The documents are German receipts. We show that conventional classification metrics, represented by the F1-Score in our experiments, are insufficient for evaluating the applicability of machine learning models in practice. To address this problem, we introduce a novel metric, Document Integrity Precision (DIP), as a solution for visual document understanding and the token classification task. To the best of our knowledge, nothing comparable has been introduced in this context. DIP is a rigorous metric, describing how many documents of the test dataset require manual interventions. It enables AI researchers and software developers to conduct an in-depth investigation of the level of process automation in business software. In order to validate DIP, we conduct experiments with our created models to highlight and analyze the impact and relevance of DIP to evaluate if the model should be deployed or not in different training settings. Our results demonstrate that existing metrics barely change for isolated model impairments, whereas DIP indicates that the model requires substantial human interventions in deployment. The larger the set of entities being predicted, the less sensitive conventional metrics are, entailing poor automation quality. DIP, in contrast, remains a single value to be interpreted for entire entity sets. This highlights the importance of having metrics that focus on the business task for model training in production. Since DIP is created for the token classification task, more research is needed to find suitable metrics for other training tasks.

Detecting PTSD in Clinical Interviews: A Comparative Analysis of NLP Methods and Large Language Models

Feng Chen,Dror Ben-Zeev,Gillian Sparks,Arya Kadakia,Trevor Cohen

Task: 评估自然语言处理技术在临床访谈转录本中检测创伤后应激障碍（PTSD）的效果。

Motivation: PTSD在临床环境中常被漏诊，自动化检测工具有助于识别潜在患者。

Details

Method: 比较了通用和心理健康专用的Transformer模型（BERT/RoBERTa）、基于嵌入的方法（SentenceBERT/LLaMA）以及大型语言模型的提示策略（零样本/少样本/思维链），使用了DAIC-WOZ数据集。 Result: 领域专用模型显著优于通用模型（Mental-RoBERTa F1=0.643 vs. RoBERTa-base 0.485），LLaMA嵌入与神经网络结合表现最佳（F1=0.700）。零样本提示使用DSM-5标准也取得了竞争性结果（F1=0.657）。性能因症状严重程度和共病状态而异。 Conclusion: 领域适应的嵌入和大型语言模型在可扩展筛查中具有潜力，但需改进对复杂表现的检测，为开发临床可行的PTSD评估AI工具提供了见解。 Abstract: Post-Traumatic Stress Disorder (PTSD) remains underdiagnosed in clinical settings, presenting opportunities for automated detection to identify patients. This study evaluates natural language processing approaches for detecting PTSD from clinical interview transcripts. We compared general and mental health-specific transformer models (BERT/RoBERTa), embedding-based methods (SentenceBERT/LLaMA), and large language model prompting strategies (zero-shot/few-shot/chain-of-thought) using the DAIC-WOZ dataset. Domain-specific models significantly outperformed general models (Mental-RoBERTa F1=0.643 vs. RoBERTa-base 0.485). LLaMA embeddings with neural networks achieved the highest performance (F1=0.700). Zero-shot prompting using DSM-5 criteria yielded competitive results without training data (F1=0.657). Performance varied significantly across symptom severity and comorbidity status, with higher accuracy for severe PTSD cases and patients with comorbid depression. Our findings highlight the potential of domain-adapted embeddings and LLMs for scalable screening while underscoring the need for improved detection of nuanced presentations and offering insights for developing clinically viable AI tools for PTSD assessment.

Cal or No Cal? -- Real-Time Miscalibration Detection of LiDAR and Camera Sensors

Ilir Tahiraj,Jeremialie Swadiryus,Felix Fent,Markus Lienkamp

Task: 提出一个用于检测传感器校准状态的二分类框架，以替代直接回归校准参数的方法。

Motivation: 在线校准面临实时性和资源限制的挑战，现有方法无法满足需求，因此需要一种更高效的方法来确保自动驾驶车辆的安全性。

Details

Method: 采用对比学习方法，通过比较潜在空间中的嵌入特征来分类两种不同传感器模态的校准状态。 Result: 该方法在检测性能、推理时间和资源需求方面优于现有技术。 Conclusion: 提出的误校准检测框架为自动驾驶中的传感器校准问题提供了一种高效且可靠的解决方案。 Abstract: The goal of extrinsic calibration is the alignment of sensor data to ensure an accurate representation of the surroundings and enable sensor fusion applications. From a safety perspective, sensor calibration is a key enabler of autonomous driving. In the current state of the art, a trend from target-based offline calibration towards targetless online calibration can be observed. However, online calibration is subject to strict real-time and resource constraints which are not met by state-of-the-art methods. This is mainly due to the high number of parameters to estimate, the reliance on geometric features, or the dependence on specific vehicle maneuvers. To meet these requirements and ensure the vehicle's safety at any time, we propose a miscalibration detection framework that shifts the focus from the direct regression of calibration parameters to a binary classification of the calibration state, i.e., calibrated or miscalibrated. Therefore, we propose a contrastive learning approach that compares embedded features in a latent space to classify the calibration state of two different sensor modalities. Moreover, we provide a comprehensive analysis of the feature embeddings and challenging calibration errors that highlight the performance of our approach. As a result, our method outperforms the current state-of-the-art in terms of detection performance, inference time, and resource demand. The code is open source and available on https://github.com/TUMFTM/MiscalibrationDetection.

A Conformal Risk Control Framework for Granular Word Assessment and Uncertainty Calibration of CLIPScore Quality Estimates

Gonçalo Gomes,Chrysoula Zerva,Bruno Martins

Task: 探索学习型图像描述评估指标的当前局限性，并提出一种改进方法。

Motivation: 现有评估指标缺乏对描述中单个词错位的细粒度评估，且依赖单点质量估计而未考虑不确定性。

Details

Method: 提出一种简单有效的策略，通过模型无关的符合风险控制框架校准CLIPScore分布，以解决上述问题。 Result: 实验表明，使用符合风险控制的方法在简单输入掩码等操作下，性能可与复杂方法媲美，能有效检测错位词并提供与期望风险水平一致的形式保证。 Conclusion: 该方法提高了描述评估指标的可靠性，增强了不确定性估计与预测误差之间的相关性。 Abstract: This study explores current limitations of learned image captioning evaluation metrics, specifically the lack of granular assessment for individual word misalignments within captions, and the reliance on single-point quality estimates without considering uncertainty. To address these limitations, we propose a simple yet effective strategy for generating and calibrating CLIPScore distributions. Leveraging a model-agnostic conformal risk control framework, we calibrate CLIPScore values for task-specific control variables, to tackle the aforementioned two limitations. Experimental results demonstrate that using conformal risk control, over the distributions produced with simple methods such as input masking, can achieve competitive performance compared to more complex approaches. Our method effectively detects misaligned words, while providing formal guarantees aligned with desired risk levels, and improving the correlation between uncertainty estimations and prediction errors, thus enhancing the overall reliability of caption evaluation metrics.

Coarse-to-Fine Learning for Multi-Pipette Localisation in Robot-Assisted In Vivo Patch-Clamp

Lan Wei,Gema Vera Gonzalez,Phatsimo Kgwarae,Alexander Timms,Denis Zahorovsky,Simon Schultz,Dandan Zhang

Task: 开发一种基于热图增强的粗到细学习技术，用于机器人辅助的体内多移液管实时定位。

Motivation: 当前的多移液管膜片钳技术主要依赖人工操作，限制了其可访问性和可扩展性，而现有方法无法满足体内多移液管场景的需求。

Details

Method: 提出了一种结合生成对抗网络（GAN）的背景噪声去除模块和两阶段Transformer模型，通过粗热图预测和细粒度坐标回归实现精确定位。 Result: 实验结果表明，该方法在10微米范围内定位准确率超过98%，5微米范围内超过89%，平均均方误差为2.52微米。 Conclusion: 该方法显著提高了体内多移液管实时定位的精度和效率，为神经科学研究提供了可靠的工具。 Abstract: In vivo image-guided multi-pipette patch-clamp is essential for studying cellular interactions and network dynamics in neuroscience. However, current procedures mainly rely on manual expertise, which limits accessibility and scalability. Robotic automation presents a promising solution, but achieving precise real-time detection of multiple pipettes remains a challenge. Existing methods focus on ex vivo experiments or single pipette use, making them inadequate for in vivo multi-pipette scenarios. To address these challenges, we propose a heatmap-augmented coarse-to-fine learning technique to facilitate multi-pipette real-time localisation for robot-assisted in vivo patch-clamp. More specifically, we introduce a Generative Adversarial Network (GAN)-based module to remove background noise and enhance pipette visibility. We then introduce a two-stage Transformer model that starts with predicting the coarse heatmap of the pipette tips, followed by the fine-grained coordination regression module for precise tip localisation. To ensure robust training, we use the Hungarian algorithm for optimal matching between the predicted and actual locations of tips. Experimental results demonstrate that our method achieved > 98% accuracy within 10 {\mu}m, and > 89% accuracy within 5 {\mu}m for the localisation of multi-pipette tips. The average MSE is 2.52 {\mu}m.

Catastrophic Forgetting in LLMs: A Comparative Analysis Across Language Tasks

Naimul Haque

Task: 评估不同参数规模的开源大语言模型（LLMs）在GLUE基准测试关键NLU任务上的持续微调能力。

Motivation: 随着LLM代理在自主处理任务中的应用增加，模型需适应新任务而不遗忘旧知识（灾难性遗忘问题）。

Details

Method: 通过提示工程和任务特定调整，评估和比较模型在保留旧知识的同时学习新任务的能力。 Result: Phi-3.5-mini表现出最小的遗忘和强大的学习能力；Orca-2-7b和Qwen2.5-7B在微调后展现出优异的学习和整体性能。 Conclusion: 该研究有助于理解LLM中的灾难性遗忘问题，并强调提示工程在优化持续学习场景中的重要性。 Abstract: Large Language Models (LLMs) have significantly advanced Natural Language Processing (NLP), particularly in Natural Language Understanding (NLU) tasks. As we progress toward an agentic world where LLM-based agents autonomously handle specialized tasks, it becomes crucial for these models to adapt to new tasks without forgetting previously learned information - a challenge known as catastrophic forgetting. This study evaluates the continual fine-tuning of various open-source LLMs with different parameter sizes (specifically models under 10 billion parameters) on key NLU tasks from the GLUE benchmark, including SST-2, MRPC, CoLA, and MNLI. By employing prompt engineering and task-specific adjustments, we assess and compare the models' abilities to retain prior knowledge while learning new tasks. Our results indicate that models such as Phi-3.5-mini exhibit minimal forgetting while maintaining strong learning capabilities, making them well-suited for continual learning environments. Additionally, models like Orca-2-7b and Qwen2.5-7B demonstrate impressive learning abilities and overall performance after fine-tuning. This work contributes to understanding catastrophic forgetting in LLMs and highlights prompting engineering to optimize model performance for continual learning scenarios.

Predicting Movie Production Years through Facial Recognition of Actors with Machine Learning

Asraa Muayed Abdalah,Noor Redha Alkazaz

Task: 使用机器学习算法从阿拉伯电影图像中识别演员并提取其年龄。

Motivation: 阿拉伯电影图像中存在非均匀光照、多样化姿势、化妆和服饰等挑战，增加了识别难度。

Details

Method: 利用阿拉伯演员数据集（AAD），采用多种模型进行特征提取，并使用不同机器学习算法进行分类和预测。 Result: 逻辑回归模型表现最佳，AUC、精确度、CA和F1分数分别为99%、86%、85.5%和84.2%。 Conclusion: 研究结果可提升面部识别技术的精度和可靠性，适用于电影搜索服务、推荐算法和类型分类。 Abstract: This study used machine learning algorithms to identify actors and extract the age of actors from images taken randomly from movies. The use of images taken from Arab movies includes challenges such as non-uniform lighting, different and multiple poses for the actors and multiple elements with the actor or a group of actors. Additionally, the use of make-up, wigs, beards, and wearing different accessories and costumes made it difficult for the system to identify the personality of the same actor. The Arab Actors Dataset-AAD comprises 574 images sourced from various movies, encompassing both black and white as well as color compositions. The images depict complete scenes or fragments thereof. Multiple models were employed for feature extraction, and diverse machine learning algorithms were utilized during the classification and prediction stages to determine the most effective algorithm for handling such image types. The study demonstrated the effectiveness of the Logistic Regression model exhibited the best performance compared to other models in the training phase, as evidenced by its AUC, precision, CA and F1score values of 99%, 86%, 85.5% and 84.2% respectively. The findings of this study can be used to improve the precision and reliability of facial recognition technology for various uses as with movies search services, movie suggestion algorithms, and genre classification of movies.

Automated Factual Benchmarking for In-Car Conversational Systems using Large Language Models

Rafael Giebisch,Ken E. Friedl,Lev Sorokin,Andrea Stocco

Task: 提出一种基于LLM的方法，用于自动评估车载对话系统的事实准确性。

Motivation: 现代对话系统基于大型语言模型（LLM），容易产生幻觉（不准确或虚构的信息），因此需要一种方法来验证其事实正确性。

Details

Method: 采用五种基于LLM的方法，利用集成技术和多样化角色来增强一致性和减少幻觉。 Result: GPT-4与输入输出提示结合的方法在事实正确性上与专家评估的吻合率超过90%，且平均响应时间为4.5秒。 Conclusion: 基于LLM的测试方法是一种可行的验证对话系统事实正确性的方法。 Abstract: In-car conversational systems bring the promise to improve the in-vehicle user experience. Modern conversational systems are based on Large Language Models (LLMs), which makes them prone to errors such as hallucinations, i.e., inaccurate, fictitious, and therefore factually incorrect information. In this paper, we present an LLM-based methodology for the automatic factual benchmarking of in-car conversational systems. We instantiate our methodology with five LLM-based methods, leveraging ensembling techniques and diverse personae to enhance agreement and minimize hallucinations. We use our methodology to evaluate CarExpert, an in-car retrieval-augmented conversational question answering system, with respect to the factual correctness to a vehicle's manual. We produced a novel dataset specifically created for the in-car domain, and tested our methodology against an expert evaluation. Our results show that the combination of GPT-4 with the Input Output Prompting achieves over 90 per cent factual correctness agreement rate with expert evaluations, other than being the most efficient approach yielding an average response time of 4.5s. Our findings suggest that LLM-based testing constitutes a viable approach for the validation of conversational systems regarding their factual correctness.

How does Watermarking Affect Visual Language Models in Document Understanding?

Chunxue Xu,Yiwei Wang,Bryan Hooi,Yujun Cai,Songze Li

Task: 研究水印对视觉语言模型（VLMs）在文档理解任务中性能的影响。

Motivation: 水印等噪声信息可能影响VLMs的性能，但目前缺乏系统的评估框架。

Details

Method: 提出一种新颖的评估框架，考虑水印类型、位置和内容等因素，并通过实验分析水印对VLMs性能的影响。 Result: 实验结果表明，水印显著降低VLMs性能（最高达36%），且分散水印和语义内容水印干扰更强。 Conclusion: 水印通过注意力重新分配和语义表示改变影响VLMs性能，研究为开发鲁棒的水印文档推理机制提供了见解。 Abstract: Visual Language Models (VLMs) have become foundational models for document understanding tasks, widely used in the processing of complex multimodal documents across domains such as finance, law, and academia. However, documents often contain noise-like information, such as watermarks, which inevitably leads us to inquire: \emph{Do watermarks degrade the performance of VLMs in document understanding?} To address this, we propose a novel evaluation framework to investigate the effect of visible watermarks on VLMs performance. We takes into account various factors, including different types of document data, the positions of watermarks within documents and variations in watermark content. Our experimental results reveal that VLMs performance can be significantly compromised by watermarks, with performance drop rates reaching up to 36\%. We discover that \emph{scattered} watermarks cause stronger interference than centralized ones, and that \emph{semantic contents} in watermarks creates greater disruption than simple visual occlusion. Through attention mechanism analysis and embedding similarity examination, we find that the performance drops are mainly attributed to that watermarks 1) force widespread attention redistribution, and 2) alter semantic representation in the embedding space. Our research not only highlights significant challenges in deploying VLMs for document understanding, but also provides insights towards developing robust inference mechanisms on watermarked documents.

Grade Guard: A Smart System for Short Answer Automated Grading

Niharika Dadu,Harsh Vardhan Singh,Romi Banerjee

Task: 提出了一种名为Grade Guard的新框架，用于改进大型语言模型（LLMs）在自动评分短答案问题（ASAG）中的准确性和可靠性。

Motivation: LLMs在自动评分短答案时受训练数据集中多样观点的影响，导致对细微或部分正确答案的评估不准确。

Details

Method: 1. 通过调整温度参数和RMSE优化任务专业化；2. 引入Indecisiveness Score（IS）反映评分不确定性；3. 提出Confidence-Aware Loss（CAL）优化IS；4. 通过自我反思和人工重新评估提高可靠性。 Result: Grade Guard在多个模型（如Upstage Solar Pro、GPT 4-o Mini等）中优于传统方法，RMSE提升4.00%至23.64%。 Conclusion: 未来工作包括提高评分的可解释性、扩展基准数据集、优化评分标准，以及支持多语言评分系统，以增强准确性、适应性和公平性。 Abstract: The advent of large language models (LLMs) in the education sector has provided impetus to automate grading short answer questions. LLMs make evaluating short answers very efficient, thus addressing issues like staff shortage. However, in the task of Automated Short Answer Grading (ASAG), LLM responses are influenced by diverse perspectives in their training dataset, leading to inaccuracies in evaluating nuanced or partially correct answers. To address this challenge, we propose a novel framework, Grade Guard. 1. To enhance the task-based specialization of the LLMs, the temperature parameter has been fine-tuned using Root Mean Square Error (RMSE). 2. Unlike traditional approaches, LLMs in Grade Guard compute an Indecisiveness Score (IS) along with the grade to reflect uncertainty in predicted grades. 3. Introduced Confidence-Aware Loss (CAL) to generate an optimized Indecisiveness Score (IS). 4. To improve reliability, self-reflection based on the optimized IS has been introduced into the framework, enabling human re-evaluation to minimize incorrect grade assignments. Our experimentation shows that the best setting of Grade Guard outperforms traditional methods by 19.16% RMSE in Upstage Solar Pro, 23.64% RMSE in Upstage Solar Mini, 4.00% RMSE in Gemini 1.5 Flash, and 10.20% RMSE in GPT 4-o Mini. Future work includes improving interpretability by generating rationales for grades to enhance accuracy. Expanding benchmark datasets and annotating them with domain-specific nuances will enhance grading accuracy. Finally, analyzing feedback to enhance confidence in predicted grades, reduce biases, optimize grading criteria, and personalize learning while supporting multilingual grading systems will make the solution more accurate, adaptable, fair, and inclusive.

SViQA: A Unified Speech-Vision Multimodal Model for Textless Visual Question Answering

Bingxin Li

Task: 提出一种名为SViQA的统一语音-视觉模型，直接处理语音问题而无需文本转录。

Motivation: 现有方法主要关注文本-视觉整合，而语音-视觉模态的异构性导致其研究不足，限制了语音-视觉交互的潜力。

Details

Method: 基于LLaVA架构，通过端到端语音特征提取和跨模态对齐优化，实现语音信号与视觉内容的有效融合。 Result: 在SBVQA基准测试中，SViQA达到75.62%的准确率，混合输入时提升至78.85%。 Conclusion: SViQA展示了卓越的性能和跨模态泛化能力，为语音-视觉交互提供了有效解决方案。 Abstract: Multimodal models integrating speech and vision hold significant potential for advancing human-computer interaction, particularly in Speech-Based Visual Question Answering (SBVQA) where spoken questions about images require direct audio-visual understanding. Existing approaches predominantly focus on text-visual integration, leaving speech-visual modality gaps underexplored due to their inherent heterogeneity. To this end, we introduce SViQA, a unified speech-vision model that directly processes spoken questions without text transcription. Building upon the LLaVA architecture, our framework bridges auditory and visual modalities through two key innovations: (1) end-to-end speech feature extraction eliminating intermediate text conversion, and (2) cross-modal alignment optimization enabling effective fusion of speech signals with visual content. Extensive experimental results on the SBVQA benchmark demonstrate the proposed SViQA's state-of-the-art performance, achieving 75.62% accuracy, and competitive multimodal generalization. Leveraging speech-text mixed input boosts performance to 78.85%, a 3.23% improvement over pure speech input, highlighting SViQA's enhanced robustness and effective cross-modal attention alignment.

Prompt-Reverse Inconsistency: LLM Self-Inconsistency Beyond Generative Randomness and Prompt Paraphrasing

Jihyun Janice Ahn,Wenpeng Yin

Task: 研究大型语言模型（LLM）中的Prompt-Reverse Inconsistency（PRIN）现象及其影响。

Motivation: 先前研究主要关注生成不一致性的两种类型（随机性不一致和改写不一致），但PRIN作为一种新的自不一致性现象尚未被探索，其可能影响LLM作为评判者的可信度。

Details

Method: 通过一系列实验研究PRIN，包括在不同LLM中PRIN的程度、缓解方法、潜在应用及其与其他不一致性的关系。 Result: 发现PRIN现象普遍存在，揭示了LLM在基本逻辑规则上的挑战，并提出了缓解方法。 Conclusion: PRIN的研究为理解LLM内部机制提供了新视角，有助于推动可信AI的发展。 Abstract: While the inconsistency of LLMs is not a novel topic, prior research has predominantly addressed two types of generative inconsistencies: i) Randomness Inconsistency: running the same LLM multiple trials, yielding varying responses; ii) Paraphrase Inconsistency: paraphrased prompts result in different responses from the same LLM. Randomness Inconsistency arises from the inherent randomness due to stochastic sampling in generative models, while Paraphrase Inconsistency is a consequence of the language modeling objectives, where paraphrased prompts alter the distribution of vocabulary logits. This research discovers Prompt-Reverse Inconsistency (PRIN), a new form of LLM self-inconsistency: given a question and a couple of LLM-generated answer candidates, the LLM often has conflicting responses when prompted "Which are correct answers?" and "Which are incorrect answers?". PRIN poses a big concern as it undermines the credibility of LLM-as-a-judge, and suggests a challenge for LLMs to adhere to basic logical rules. We conduct a series of experiments to investigate PRIN, examining the extent of PRIN across different LLMs, methods to mitigate it, potential applications, and its relationship with Randomness Inconsistency and Paraphrase Inconsistency. As the first study to explore PRIN, our findings offer valuable insights into the inner workings of LLMs and contribute to advancing trustworthy AI.

Knowledge-Base based Semantic Image Transmission Using CLIP

Chongyang Li,Yanmei He,Tianqian Zhang,Mingjian He,Shouyin Liu

Task: 提出一种基于知识库（KB）辅助的语义通信框架，用于图像传输。

Motivation: 传统图像传输方法依赖于PSNR等传统指标，而该框架专注于语义准确性，提供了一种新的语义感知通信系统评估范式。

Details

Method: 在接收端使用CLIP模型提取图像的语义嵌入构建FAISS向量数据库；发送端提取512维语义特征并通过轻量级神经网络压缩传输；接收端重构特征后通过相似性匹配从KB中检索语义最相似的图像。 Result: 在CIFAR100数据集上的实验验证了该框架在实现语义图像传输方面的有效性。 Conclusion: 该框架通过语义匹配而非传统指标评估传输成功，为语义感知通信系统提供了新的方向。 Abstract: This paper proposes a novel knowledge-Base (KB) assisted semantic communication framework for image transmission. At the receiver, a Facebook AI Similarity Search (FAISS) based vector database is constructed by extracting semantic embeddings from images using the Contrastive Language-Image Pre-Training (CLIP) model. During transmission, the transmitter first extracts a 512-dimensional semantic feature using the CLIP model, then compresses it with a lightweight neural network for transmission. After receiving the signal, the receiver reconstructs the feature back to 512 dimensions and performs similarity matching from the KB to retrieve the most semantically similar image. Semantic transmission success is determined by category consistency between the transmitted and retrieved images, rather than traditional metrics like Peak Signal-to-Noise Ratio (PSNR). The proposed system prioritizes semantic accuracy, offering a new evaluation paradigm for semantic-aware communication systems. Experimental validation on CIFAR100 demonstrates the effectiveness of the framework in achieving semantic image transmission.

ThinkPrune: Pruning Long Chain-of-Thought of LLMs via Reinforcement Learning

Bairu Hou,Yang Zhang,Jiabao Ji,Yujian Liu,Kaizhi Qian,Jacob Andreas,Shiyu Chang

Task: 提出ThinkPrune方法，用于优化长思考LLMs的思考长度，减少冗余和低效的思考过程。

Motivation: 现有的方法主要通过强制提前终止思考过程来减少思考长度，未能优化和整合思考过程，导致性能与长度之间的权衡不理想。

Details

Method: 通过强化学习（RL）持续训练LLMs，并引入令牌限制，超过限制的未完成思考和答案将被丢弃，奖励为零；采用迭代长度剪枝方法，逐步收紧令牌限制。 Result: 在AIME24数据集上，DeepSeek-R1-Distill-Qwen-1.5B的推理长度减少一半，性能仅下降2%；剪枝后的LLMs能跳过不必要的步骤，同时保持核心推理完整。 Conclusion: ThinkPrune在性能和思考长度之间取得了显著的平衡，为长思考LLMs的优化提供了有效解决方案。 Abstract: We present ThinkPrune, a simple yet effective method for pruning the thinking length for long-thinking LLMs, which has been found to often produce inefficient and redundant thinking processes. Existing preliminary explorations of reducing thinking length primarily focus on forcing the thinking process to early exit, rather than adapting the LLM to optimize and consolidate the thinking process, and therefore the length-performance tradeoff observed so far is sub-optimal. To fill this gap, ThinkPrune offers a simple solution that continuously trains the long-thinking LLMs via reinforcement learning (RL) with an added token limit, beyond which any unfinished thoughts and answers will be discarded, resulting in a zero reward. To further preserve model performance, we introduce an iterative length pruning approach, where multiple rounds of RL are conducted, each with an increasingly more stringent token limit. We observed that ThinkPrune results in a remarkable performance-length tradeoff -- on the AIME24 dataset, the reasoning length of DeepSeek-R1-Distill-Qwen-1.5B can be reduced by half with only 2% drop in performance. We also observed that after pruning, the LLMs can bypass unnecessary steps while keeping the core reasoning process complete. Code is available at https://github.com/UCSB-NLP-Chang/ThinkPrune.

ShieldGemma 2: Robust and Tractable Image Content Moderation

Wenjun Zeng,Dana Kurniawan,Ryan Mullins,Yuchi Liu,Tamoghna Saha,Dirichi Ike-Njoku,Jindong Gu,Yiwen Song,Cai Xu,Jingjing Zhou,Aparna Joshi,Shravan Dheep,Mani Malek,Hamid Palangi,Joon Baek,Rick Pereira,Karthik Narasimhan

Task: 介绍ShieldGemma 2，一个基于Gemma 3的4B参数图像内容审核模型。

Motivation: 提供对合成图像和自然图像中关键危害类别（如色情、暴力与血腥、危险内容）的稳健安全风险预测，以推动多模态安全和负责任的AI发展。

Details

Method: 基于Gemma 3构建，并引入新颖的对抗性数据生成流程，以生成受控、多样且稳健的图像。 Result: 在内部和外部基准测试中表现出优于LlavaGuard、GPT-4o mini和基础Gemma 3模型的最先进性能。 Conclusion: ShieldGemma 2提供了一个开放的图像审核工具，促进了多模态安全和负责任的AI发展。 Abstract: We introduce ShieldGemma 2, a 4B parameter image content moderation model built on Gemma 3. This model provides robust safety risk predictions across the following key harm categories: Sexually Explicit, Violence \& Gore, and Dangerous Content for synthetic images (e.g. output of any image generation model) and natural images (e.g. any image input to a Vision-Language Model). We evaluated on both internal and external benchmarks to demonstrate state-of-the-art performance compared to LlavaGuard \citep{helff2024llavaguard}, GPT-4o mini \citep{hurst2024gpt}, and the base Gemma 3 model \citep{gemma_2025} based on our policies. Additionally, we present a novel adversarial data generation pipeline which enables a controlled, diverse, and robust image generation. ShieldGemma 2 provides an open image moderation tool to advance multimodal safety and responsible AI development.

Biomedical Question Answering via Multi-Level Summarization on a Local Knowledge Graph

Lingxiao Guan,Yuanhao Huang,Jie Liu

Task: 提出一种利用命题声明构建局部知识图谱的方法，以改进多文档关系捕获，从而提升生物医学问答任务的性能。

Motivation: 在问答任务中，检索增强生成（RAG）在多领域表现出色，但如何有效捕获多文档关系（尤其是生物医学任务中）仍是一个未解决的问题。

Details

Method: 通过命题声明构建局部知识图谱，并采用分层摘要方法从知识图谱中生成摘要，以指导小型语言模型进行问答。 Result: 在多个生物医学问答基准测试中，该方法表现优于或与RAG基线相当，并通过针对性指标验证了各步骤的有效性。 Conclusion: 该方法通过知识图谱和分层摘要有效提升了生物医学问答任务的性能，为多文档关系捕获提供了新思路。 Abstract: In Question Answering (QA), Retrieval Augmented Generation (RAG) has revolutionized performance in various domains. However, how to effectively capture multi-document relationships, particularly critical for biomedical tasks, remains an open question. In this work, we propose a novel method that utilizes propositional claims to construct a local knowledge graph from retrieved documents. Summaries are then derived via layerwise summarization from the knowledge graph to contextualize a small language model to perform QA. We achieved comparable or superior performance with our method over RAG baselines on several biomedical QA benchmarks. We also evaluated each individual step of our methodology over a targeted set of metrics, demonstrating its effectiveness.

RipVIS: Rip Currents Video Instance Segmentation Benchmark for Beach Monitoring and Safety

Andrei Dumitriu,Florin Tatui,Florin Miron,Aakash Ralhan,Radu Tudor Ionescu,Radu Timofte

Task: 提出一个名为RipVIS的大规模视频实例分割基准，专门用于裂流（rip currents）的准确分割。

Motivation: 裂流是全球海滩相关伤害和死亡的主要原因，但由于其无定形性和缺乏标注数据，准确识别裂流具有挑战性。

Details

Method: 构建了一个包含184个视频（212,328帧）的数据集，其中150个视频（163,528帧）包含裂流，并采用多种模型（如Mask R-CNN、Cascade Mask R-CNN等）进行分割任务，同时引入基于时间置信度聚合（TCA）的后处理步骤。 Result: 实验结果表明，RipVIS在裂流分割任务中表现优异，特别是在F2分数上优先考虑召回率并减少假阴性。 Conclusion: RipVIS旨在为裂流分割设定新标准，促进更安全的海滩环境，并通过基准网站与研究社区共享数据和成果。 Abstract: Rip currents are strong, localized and narrow currents of water that flow outwards into the sea, causing numerous beach-related injuries and fatalities worldwide. Accurate identification of rip currents remains challenging due to their amorphous nature and the lack of annotated data, which often requires expert knowledge. To address these issues, we present RipVIS, a large-scale video instance segmentation benchmark explicitly designed for rip current segmentation. RipVIS is an order of magnitude larger than previous datasets, featuring $184$ videos ($212,328$ frames), of which $150$ videos ($163,528$ frames) are with rip currents, collected from various sources, including drones, mobile phones, and fixed beach cameras. Our dataset encompasses diverse visual contexts, such as wave-breaking patterns, sediment flows, and water color variations, across multiple global locations, including USA, Mexico, Costa Rica, Portugal, Italy, Greece, Romania, Sri Lanka, Australia and New Zealand. Most videos are annotated at $5$ FPS to ensure accuracy in dynamic scenarios, supplemented by an additional $34$ videos ($48,800$ frames) without rip currents. We conduct comprehensive experiments with Mask R-CNN, Cascade Mask R-CNN, SparseInst and YOLO11, fine-tuning these models for the task of rip current segmentation. Results are reported in terms of multiple metrics, with a particular focus on the $F_2$ score to prioritize recall and reduce false negatives. To enhance segmentation performance, we introduce a novel post-processing step based on Temporal Confidence Aggregation (TCA). RipVIS aims to set a new standard for rip current segmentation, contributing towards safer beach environments. We offer a benchmark website to share data, models, and results with the research community, encouraging ongoing collaboration and future contributions, at https://ripvis.ai.

Adaptive Rectification Sampling for Test-Time Compute Scaling

Zhendong Tan,Xingjun Zhang,Chaoyi Hu,Yancheng Pan,Shaoxun Wang

Task: 提出一种名为自适应修正采样（AR-Sampling）的方法，以更细粒度地指导大型语言模型（LLMs）在推理过程中进行自我修正。

Motivation: 现有的测试时扩展方法（如生成更多或更长的思维链）可能导致不必要的标记浪费和可读性降低，尤其是在推理步骤已经正确的情况下。

Details

Method: 利用过程监督奖励模型（PRM）作为验证器，并构建触发句子，以在适当的步骤引导模型进行自适应修正。 Result: 在GSM8K和MATH500数据集上的实验表明，该方法能够以更细粒度引导模型重新思考，提高解决方案的准确性，同时生成合理的额外标记数量。 Conclusion: AR-Sampling方法能够有效减少标记浪费，提高推理准确性，为LLMs的自我修正提供了更精细的解决方案。 Abstract: The newly released OpenAI-o1 and DeepSeek-R1 have demonstrated that test-time scaling can significantly improve model performance, especially in complex tasks such as logical reasoning. Common test-time scaling methods involve generating more chain of thoughts (CoTs) or longer CoTs with self-correction. However, while self-correction can improve performance, it may lead to significant token waste and reduce readability of the CoT if the reasoning steps are already correct. To demonstrate that large language models (LLMs) can rectify errors at a more fine-grained level, we propose Adaptive Rectification Sampling (AR-Sampling), which can guide the LLMs to self-correction at the appropriate step. AR-Sampling leverages a process-supervised reward model (PRM) as a verifier and constructed trigger sentences to guide the model in adaptive step-level rethinking. Through the experiments on GSM8K and MATH500, it indicate that our approach enables the models to rethink in more fine-grained level, improving the accuracy of solutions, while generating a reasonable number of additional tokens.

GRU-AUNet: A Domain Adaptation Framework for Contactless Fingerprint Presentation Attack Detection

Banafsheh Adami,Nima Karimian

Task: 提出一种名为GRU-AUNet的域适应方法，用于增强非接触式指纹的抗欺骗能力。

Motivation: 当前非接触式指纹的抗欺骗方法依赖于域适应学习，但其泛化性和可扩展性有限。

Details

Method: GRU-AUNet结合了基于Swin Transformer的UNet架构、GRU增强的注意力机制、瓶颈中的动态滤波网络，以及Focal和对比损失函数的组合。 Result: 在CLARKSON、COLFISPOOF和IIITD数据集上，GRU-AUNet的平均BPCER为0.09%，APCER为1.2%，优于现有域适应方法。 Conclusion: GRU-AUNet在抗欺骗方面表现出强大的鲁棒性，解决了现有方法的局限性。 Abstract: Although contactless fingerprints offer user comfort, they are more vulnerable to spoofing. The current solution for anti-spoofing in the area of contactless fingerprints relies on domain adaptation learning, limiting their generalization and scalability. To address these limitations, we introduce GRU-AUNet, a domain adaptation approach that integrates a Swin Transformer-based UNet architecture with GRU-enhanced attention mechanisms, a Dynamic Filter Network in the bottleneck, and a combined Focal and Contrastive Loss function. Trained in both genuine and spoof fingerprint images, GRU-AUNet demonstrates robust resilience against presentation attacks, achieving an average BPCER of 0.09\% and APCER of 1.2\% in the CLARKSON, COLFISPOOF, and IIITD datasets, outperforming state-of-the-art domain adaptation methods.

Foundations and Evaluations in NLP

Jungyeul Park

Task: 探索自然语言处理（NLP）中语言资源的创建和系统性能评估的两个基本方面。

Motivation: 开发韩语的基于语素的标注方案，并解决传统评估方法在预处理任务中的局限性。

Details

Method: 提出了一种基于语素的标注方案和一种名为jp-algorithm的新型评估框架。 Result: 在词性标注、依存句法分析和命名实体识别等任务中取得了最先进的结果，并通过jp-algorithm提高了评估的准确性和灵活性。 Conclusion: 为形态丰富的语言（如韩语）处理提供了关键见解，并为多语言资源开发和系统评估奠定了基础。 Abstract: This memoir explores two fundamental aspects of Natural Language Processing (NLP): the creation of linguistic resources and the evaluation of NLP system performance. Over the past decade, my work has focused on developing a morpheme-based annotation scheme for the Korean language that captures linguistic properties from morphology to semantics. This approach has achieved state-of-the-art results in various NLP tasks, including part-of-speech tagging, dependency parsing, and named entity recognition. Additionally, this work provides a comprehensive analysis of segmentation granularity and its critical impact on NLP system performance. In parallel with linguistic resource development, I have proposed a novel evaluation framework, the jp-algorithm, which introduces an alignment-based method to address challenges in preprocessing tasks like tokenization and sentence boundary detection (SBD). Traditional evaluation methods assume identical tokenization and sentence lengths between gold standards and system outputs, limiting their applicability to real-world data. The jp-algorithm overcomes these limitations, enabling robust end-to-end evaluations across a variety of NLP tasks. It enhances accuracy and flexibility by incorporating linear-time alignment while preserving the complexity of traditional evaluation metrics. This memoir provides key insights into the processing of morphologically rich languages, such as Korean, while offering a generalizable framework for evaluating diverse end-to-end NLP systems. My contributions lay the foundation for future developments, with broader implications for multilingual resource development and system evaluation.

PolygoNet: Leveraging Simplified Polygonal Representation for Effective Image Classification

Salim Khazem,Jeremy Fix,Cédric Pradalier

Task: 提出一种利用多边形表示图像的高效深度学习方法，以减少计算复杂性和过拟合问题。

Motivation: 深度学习模型在图像任务中表现优异，但面临计算复杂性和过拟合的挑战。

Details

Method: 通过将输入图像转换为多边形表示（如主导点或轮廓坐标），减少计算需求并加速训练。 Result: 轻量级模型在性能上与全分辨率图像方法相当，适用于边缘设备部署。 Conclusion: 多边形表示在高效和可扩展的深度学习解决方案中具有潜力，适用于实际场景。 Abstract: Deep learning models have achieved significant success in various image related tasks. However, they often encounter challenges related to computational complexity and overfitting. In this paper, we propose an efficient approach that leverages polygonal representations of images using dominant points or contour coordinates. By transforming input images into these compact forms, our method significantly reduces computational requirements, accelerates training, and conserves resources making it suitable for real time and resource constrained applications. These representations inherently capture essential image features while filtering noise, providing a natural regularization effect that mitigates overfitting. The resulting lightweight models achieve performance comparable to state of the art methods using full resolution images while enabling deployment on edge devices. Extensive experiments on benchmark datasets validate the effectiveness of our approach in reducing complexity, improving generalization, and facilitating edge computing applications. This work demonstrates the potential of polygonal representations in advancing efficient and scalable deep learning solutions for real world scenarios. The code for the experiments of the paper is provided in https://github.com/salimkhazem/PolygoNet.

Breaking BERT: Gradient Attack on Twitter Sentiment Analysis for Targeted Misclassification

Akil Raj Subedi,Taniya Shah,Aswani Kumar Cherukuri,Thanos Vasilakos

Task: 研究BERT模型在Twitter情感分析中的脆弱性，并提出一种构建针对性对抗文本的框架。

Motivation: BERT等NLP模型在情感分析中广泛应用，但其易受对抗攻击，需要揭示其内在脆弱性并设计更隐蔽的攻击方法。

Details

Method: 通过微调预训练BERT模型，分析梯度以确定关键词重要性，迭代替换生成对抗文本，并评估其欺骗效果。 Result: 提出了一种基于梯度的对抗文本生成框架，能够有效欺骗情感分类模型而不引起警觉。 Conclusion: 该框架揭示了BERT模型的脆弱性，为未来防御对抗攻击提供了参考。 Abstract: Social media platforms like Twitter have increasingly relied on Natural Language Processing NLP techniques to analyze and understand the sentiments expressed in the user generated content. One such state of the art NLP model is Bidirectional Encoder Representations from Transformers BERT which has been widely adapted in sentiment analysis. BERT is susceptible to adversarial attacks. This paper aims to scrutinize the inherent vulnerabilities of such models in Twitter sentiment analysis. It aims to formulate a framework for constructing targeted adversarial texts capable of deceiving these models, while maintaining stealth. In contrast to conventional methodologies, such as Importance Reweighting, this framework core idea resides in its reliance on gradients to prioritize the importance of individual words within the text. It uses a whitebox approach to attain fine grained sensitivity, pinpointing words that exert maximal influence on the classification outcome. This paper is organized into three interdependent phases. It starts with fine-tuning a pre-trained BERT model on Twitter data. It then analyzes gradients of the model to rank words on their importance, and iteratively replaces those with feasible candidates until an acceptable solution is found. Finally, it evaluates the effectiveness of the adversarial text against the custom trained sentiment classification model. This assessment would help in gauging the capacity of the adversarial text to successfully subvert classification without raising any alarm.

rPPG-SysDiaGAN: Systolic-Diastolic Feature Localization in rPPG Using Generative Adversarial Network with Multi-Domain Discriminator

Banafsheh Adami,Nima Karimian

Task: 提出一种基于生成对抗网络（GAN）的新型深度学习架构，用于从面部视频中提取远程光电容积描记（rPPG）信号。

Motivation: 现有方法在准确重建PPG信号（尤其是区分收缩期和舒张期成分）方面存在不足，且主要关注心率提取，未能完整表征PPG信号。

Details

Method: 采用多判别器的GAN架构，分别关注时域、频域和原始时域信号的二阶导数，并结合四种损失函数（方差损失、动态时间规整损失、稀疏损失和方差损失）以优化信号提取。 Result: 通过多判别器和综合损失函数的设计，能够更准确地提取rPPG信号，尤其是在区分收缩期和舒张期成分方面表现更优。 Conclusion: 该方法显著提升了rPPG信号的重建精度，为全面表征PPG信号提供了有效解决方案。 Abstract: Remote photoplethysmography (rPPG) offers a novel approach to noninvasive monitoring of vital signs, such as respiratory rate, utilizing a camera. Although several supervised and self-supervised methods have been proposed, they often fail to accurately reconstruct the PPG signal, particularly in distinguishing between systolic and diastolic components. Their primary focus tends to be solely on extracting heart rate, which may not accurately represent the complete PPG signal. To address this limitation, this paper proposes a novel deep learning architecture using Generative Adversarial Networks by introducing multi-discriminators to extract rPPG signals from facial videos. These discriminators focus on the time domain, the frequency domain, and the second derivative of the original time domain signal. The discriminator integrates four loss functions: variance loss to mitigate local minima caused by noise; dynamic time warping loss to address local minima induced by alignment and sequences of variable lengths; Sparsity Loss for heart rate adjustment, and Variance Loss to ensure a uniform distribution across the desired frequency domain and time interval between systolic and diastolic phases of the PPG signal.

GTR: Graph-Table-RAG for Cross-Table Question Answering

Jiaru Zou,Dongqi Fu,Sirui Chen,Xinrui He,Zihao Li,Yada Zhu,Jiawei Han,Jingrui He

Task: 提出一种名为GTR的Graph-Table-RAG框架，用于解决跨表格问答问题。

Motivation: 现实场景中，用户问题通常需要从多个表格中检索答案，而现有数据缺乏相关基准。

Details

Method: 通过将表格语料库重组为异构图，采用分层粗到细的检索过程提取最相关表格，并结合图感知提示增强下游LLM的表格推理能力。 Result: GTR在跨表格问答任务中表现出卓越性能，同时保持高部署效率。 Conclusion: GTR展示了在实际应用中的潜力，为跨表格问答提供了有效解决方案。 Abstract: Beyond pure text, a substantial amount of knowledge is stored in tables. In real-world scenarios, user questions often require retrieving answers that are distributed across multiple tables. GraphRAG has recently attracted much attention for enhancing LLMs' reasoning capabilities by organizing external knowledge to address ad-hoc and complex questions, exemplifying a promising direction for cross-table question answering. In this paper, to address the current gap in available data, we first introduce a multi-table benchmark, MutliTableQA, comprising 60k tables and 25k user queries collected from real-world sources. Then, we propose the first Graph-Table-RAG framework, namely GTR, which reorganizes table corpora into a heterogeneous graph, employs a hierarchical coarse-to-fine retrieval process to extract the most relevant tables, and integrates graph-aware prompting for downstream LLMs' tabular reasoning. Extensive experiments show that GTR exhibits superior cross-table question-answering performance while maintaining high deployment efficiency, demonstrating its real-world practical applicability.

TenAd: A Tensor-based Low-rank Black Box Adversarial Attack for Video Classification

Kimia haghjooei,Mansoor Rezghi

Task: 提出一种基于张量的低秩对抗攻击方法（TenAd），用于在黑盒设置下高效生成视频对抗样本。

Motivation: 现有的对抗攻击方法通常将视频数据视为简单向量，忽略了其多维结构，且需要大量查询，效率低且易被检测。

Details

Method: 利用视频数据的多维特性，将其表示为四阶张量，并通过低秩攻击减少搜索空间和查询次数。 Result: 在标准视频分类数据集上，TenAd生成的对抗扰动难以察觉，攻击成功率和查询效率优于现有方法。 Conclusion: 张量方法在视频模型对抗攻击中具有潜力，TenAd在成功率、查询效率和扰动不可察觉性方面优于现有黑盒攻击方法。 Abstract: Deep learning models have achieved remarkable success in computer vision but remain vulnerable to adversarial attacks, particularly in black-box settings where model details are unknown. Existing adversarial attack methods(even those works with key frames) often treat video data as simple vectors, ignoring their inherent multi-dimensional structure, and require a large number of queries, making them inefficient and detectable. In this paper, we propose \textbf{TenAd}, a novel tensor-based low-rank adversarial attack that leverages the multi-dimensional properties of video data by representing videos as fourth-order tensors. By exploiting low-rank attack, our method significantly reduces the search space and the number of queries needed to generate adversarial examples in black-box settings. Experimental results on standard video classification datasets demonstrate that \textbf{TenAd} effectively generates imperceptible adversarial perturbations while achieving higher attack success rates and query efficiency compared to state-of-the-art methods. Our approach outperforms existing black-box adversarial attacks in terms of success rate, query efficiency, and perturbation imperceptibility, highlighting the potential of tensor-based methods for adversarial attacks on video models.

Tasks and Roles in Legal AI: Data Curation, Annotation, and Verification

Allison Koenecke,Jed Stiglitz,David Mimno,Matthew Wilkens

Task: 探讨AI工具在法律领域的应用及其面临的挑战。

Motivation: 法律文件与大多数AI系统基于的网络文本不同，且法律AI在高风险环境中的表现要求极高，因此需要解决数据整理、标注和输出验证等问题。

Details

Method: 通过分析法律数据的特点，并结合案例研究，探讨AI在法律领域的应用及其局限性。 Result: 提出了数据整理、标注和输出验证三个关键问题，并呼吁法律和AI从业者跨学科合作，开发高性能且可靠的AI工具。 Conclusion: 法律AI的发展需要跨学科合作和开放资源的支持，以提高其性能和可靠性。 Abstract: The application of AI tools to the legal field feels natural: large legal document collections could be used with specialized AI to improve workflow efficiency for lawyers and ameliorate the "justice gap" for underserved clients. However, legal documents differ from the web-based text that underlies most AI systems. The challenges of legal AI are both specific to the legal domain, and confounded with the expectation of AI's high performance in high-stakes settings. We identify three areas of special relevance to practitioners: data curation, data annotation, and output verification. First, it is difficult to obtain usable legal texts. Legal collections are inconsistent, analog, and scattered for reasons technical, economic, and jurisdictional. AI tools can assist document curation efforts, but the lack of existing data also limits AI performance. Second, legal data annotation typically requires significant expertise to identify complex phenomena such as modes of judicial reasoning or controlling precedents. We describe case studies of AI systems that have been developed to improve the efficiency of human annotation in legal contexts and identify areas of underperformance. Finally, AI-supported work in the law is valuable only if results are verifiable and trustworthy. We describe both the abilities of AI systems to support evaluation of their outputs, as well as new approaches to systematic evaluation of computational systems in complex domains. We call on both legal and AI practitioners to collaborate across disciplines and to release open access materials to support the development of novel, high-performing, and reliable AI tools for legal applications.

FUSION: Frequency-guided Underwater Spatial Image recOnstructioN

Jaskaran Singh Walia,Shravan Venkatraman,Pavithra LK

Task: 提出一种双域深度学习框架FUSION，用于水下图像增强，同时利用空间域和频域信息。

Motivation: 现有水下图像增强方法主要关注空间域处理，忽略了频域在捕捉全局颜色分布和长程依赖关系方面的潜力。

Details

Method: FUSION框架通过多尺度卷积核和自适应注意力机制在空间域处理RGB通道，同时通过基于FFT的频域注意力提取全局结构信息，并通过频率引导融合模块整合双域特征。 Result: 在多个基准数据集（UIEB、EUVP、SUIM-E）上，FUSION在重建保真度（最高PSNR 23.717 dB和SSIM 0.883）、感知质量（最低LPIPS 0.112）和视觉增强指标（最佳UIQM 3.414）上均优于现有方法，且参数更少（0.28M）、计算复杂度更低。 Conclusion: FUSION是一种高效且适用于实时水下成像应用的方法。 Abstract: Underwater images suffer from severe degradations, including color distortions, reduced visibility, and loss of structural details due to wavelength-dependent attenuation and scattering. Existing enhancement methods primarily focus on spatial-domain processing, neglecting the frequency domain's potential to capture global color distributions and long-range dependencies. To address these limitations, we propose FUSION, a dual-domain deep learning framework that jointly leverages spatial and frequency domain information. FUSION independently processes each RGB channel through multi-scale convolutional kernels and adaptive attention mechanisms in the spatial domain, while simultaneously extracting global structural information via FFT-based frequency attention. A Frequency Guided Fusion module integrates complementary features from both domains, followed by inter-channel fusion and adaptive channel recalibration to ensure balanced color distributions. Extensive experiments on benchmark datasets (UIEB, EUVP, SUIM-E) demonstrate that FUSION achieves state-of-the-art performance, consistently outperforming existing methods in reconstruction fidelity (highest PSNR of 23.717 dB and SSIM of 0.883 on UIEB), perceptual quality (lowest LPIPS of 0.112 on UIEB), and visual enhancement metrics (best UIQM of 3.414 on UIEB), while requiring significantly fewer parameters (0.28M) and lower computational complexity, demonstrating its suitability for real-time underwater imaging applications.

LITE: LLM-Impelled efficient Taxonomy Evaluation

Lin Zhang,Zhouhong Gu,Suhang Zheng,Tao Wang,Tianyu Li,Hongwei Feng,Yanghua Xiao

Task: 提出一种基于LLM的高效灵活的分类质量评估方法LITE。

Motivation: 解决大规模分类评估中的效率、公平性和一致性问题。

Details

Method: 采用自上而下的分层评估策略，将分类分解为可管理的子结构，并通过交叉验证和标准化输入格式确保结果可靠性。 Result: 实验结果表明，LITE在复杂评估任务中具有高可靠性，能有效识别分类中的语义错误、逻辑矛盾和结构缺陷。 Conclusion: LITE不仅提供了定量性能分析和定性见解，还为分类改进提供了方向。 Abstract: This paper presents LITE, an LLM-based evaluation method designed for efficient and flexible assessment of taxonomy quality. To address challenges in large-scale taxonomy evaluation, such as efficiency, fairness, and consistency, LITE adopts a top-down hierarchical evaluation strategy, breaking down the taxonomy into manageable substructures and ensuring result reliability through cross-validation and standardized input formats. LITE also introduces a penalty mechanism to handle extreme cases and provides both quantitative performance analysis and qualitative insights by integrating evaluation metrics closely aligned with task objectives. Experimental results show that LITE demonstrates high reliability in complex evaluation tasks, effectively identifying semantic errors, logical contradictions, and structural flaws in taxonomies, while offering directions for improvement. Code is available at https://github.com/Zhang-l-i-n/TAXONOMY_DETECT .

Direction-Aware Hybrid Representation Learning for 3D Hand Pose and Shape Estimation

Shiyong Liu,Zhihao Li,Xiao Tang,Jianzhuang Liu

Task: 从图像中估计3D手部姿态和形状，并减少运动捕捉中的抖动。

Motivation: 现有方法在弱监督下直接回归参数化模型参数，涉及复杂的优化问题，训练困难。

Details

Method: 提出方向感知混合特征（DaHyF），融合隐式图像特征和显式2D关节坐标特征，并利用相机坐标系中的像素方向信息增强融合。 Result: 在FreiHAND数据集上，准确率比现有方法提高33%以上，并在HO3Dv2和HO3Dv3排行榜上排名第一。 Conclusion: DaHyF方法在实时运动捕捉场景中表现出色，适用于手部位置变化、遮挡和运动模糊的情况。 Abstract: Most model-based 3D hand pose and shape estimation methods directly regress the parametric model parameters from an image to obtain 3D joints under weak supervision. However, these methods involve solving a complex optimization problem with many local minima, making training difficult. To address this challenge, we propose learning direction-aware hybrid features (DaHyF) that fuse implicit image features and explicit 2D joint coordinate features. This fusion is enhanced by the pixel direction information in the camera coordinate system to estimate pose, shape, and camera viewpoint. Our method directly predicts 3D hand poses with DaHyF representation and reduces jittering during motion capture using prediction confidence based on contrastive learning. We evaluate our method on the FreiHAND dataset and show that it outperforms existing state-of-the-art methods by more than 33% in accuracy. DaHyF also achieves the top ranking on both the HO3Dv2 and HO3Dv3 leaderboards for the metric of Mean Joint Error (after scale and translation alignment). Compared to the second-best results, the largest improvement observed is 10%. We also demonstrate its effectiveness in real-time motion capture scenarios with hand position variability, occlusion, and motion blur.

Xingshan Zeng,Weiwen Liu,Xu Huang,Zezhong Wang,Lingzhi Wang,Liangyou Li,Yasheng Wang,Lifeng Shang,Xin Jiang,Ruiming Tang,Qun Liu

Task: 提出一种名为ToolACE-R的新方法，通过自适应自优化提升大语言模型（LLMs）在工具调用中的性能。

Motivation: 当前工具学习方法主要关注数据合成以微调LLMs，而忽视了如何充分激发模型的潜力。

Details

Method: 采用模型感知的迭代训练过程，逐步增加训练样本，并允许LLMs迭代优化工具调用，同时引入自适应机制以提高计算效率。 Result: 在多个基准数据集上的实验表明，ToolACE-R无需优化即可与先进的API模型竞争，且通过自适应自优化可进一步提升性能。 Conclusion: ToolACE-R是一种高效的工具学习方法，兼容不同规模的基模型，为工具学习提供了新方向。 Abstract: Tool learning, which allows Large Language Models (LLMs) to leverage external tools for solving complex user tasks, has emerged as a promising avenue for extending model capabilities. However, current approaches primarily focus on data synthesis for fine-tuning LLMs to invoke tools effectively, largely ignoring how to fully stimulate the potential of the model. In this paper, we propose ToolACE-R, a novel method that introduces adaptive self-refinement for tool invocations. Our approach features a model-aware iterative training procedure that progressively incorporates more training samples based on the model's evolving capabilities. Additionally, it allows LLMs to iteratively refine their tool calls, optimizing performance without requiring external feedback. To further enhance computational efficiency, we integrate an adaptive mechanism when scaling the inference time, enabling the model to autonomously determine when to stop the refinement process. We conduct extensive experiments across several benchmark datasets, showing that ToolACE-R achieves competitive performance compared to advanced API-based models, even without any refinement. Furthermore, its performance can be further improved efficiently through adaptive self-refinement. Our results demonstrate the effectiveness of the proposed method, which is compatible with base models of various sizes, offering a promising direction for more efficient tool learning.

Safeguarding Vision-Language Models: Mitigating Vulnerabilities to Gaussian Noise in Perturbation-based Attacks

Jiawei Wang,Yushen Zuo,Yuanjun Chai,Zhendong Liu,Yichen Fu,Yichun Feng,Kin-man Lam

Task: 研究视觉语言模型（VLMs）在面对噪声或损坏图像时的安全漏洞，并提出解决方案。

Motivation: 现有的视觉语言模型在训练时虽然采取了安全措施，但忽略了噪声增强视觉输入带来的安全漏洞，导致模型容易受到简单扰动攻击。

Details

Method: 提出Robust-VLGuard多模态安全数据集和噪声增强微调方法，并结合DiffPure-VLM利用扩散模型防御优化型视觉扰动攻击。 Result: 实验结果表明，扩散模型的分布偏移特性与微调后的VLMs结合，显著降低了不同强度的对抗扰动攻击成功率。 Conclusion: 通过噪声增强微调和扩散模型的应用，可以有效提升视觉语言模型的安全性，同时保持其功能。 Abstract: Vision-Language Models (VLMs) extend the capabilities of Large Language Models (LLMs) by incorporating visual information, yet they remain vulnerable to jailbreak attacks, especially when processing noisy or corrupted images. Although existing VLMs adopt security measures during training to mitigate such attacks, vulnerabilities associated with noise-augmented visual inputs are overlooked. In this work, we identify that missing noise-augmented training causes critical security gaps: many VLMs are susceptible to even simple perturbations such as Gaussian noise. To address this challenge, we propose Robust-VLGuard, a multimodal safety dataset with aligned / misaligned image-text pairs, combined with noise-augmented fine-tuning that reduces attack success rates while preserving functionality of VLM. For stronger optimization-based visual perturbation attacks, we propose DiffPure-VLM, leveraging diffusion models to convert adversarial perturbations into Gaussian-like noise, which can be defended by VLMs with noise-augmented safety fine-tuning. Experimental results demonstrate that the distribution-shifting property of diffusion model aligns well with our fine-tuned VLMs, significantly mitigating adversarial perturbations across varying intensities. The dataset and code are available at https://github.com/JarvisUSTC/DiffPure-RobustVLM.

FAIRE: Assessing Racial and Gender Bias in AI-Driven Resume Evaluations

Athena Wen,Tanush Patil,Ansh Saxena,Yicheng Fu,Sean O'Brien,Kevin Zhu

Task: Introduce a benchmark (FAIRE) to test for racial and gender bias in large language models (LLMs) used for resume evaluation.

Motivation: Address concerns about fairness and bias in AI-driven hiring practices.

Details

Method: Use direct scoring and ranking methods to measure bias when resumes are altered to reflect different racial or gender identities. Result: All models exhibit some degree of bias, with varying magnitude and direction. Conclusion: The benchmark provides insights into AI fairness and highlights the need for strategies to reduce bias in AI-driven recruitment. Abstract: In an era where AI-driven hiring is transforming recruitment practices, concerns about fairness and bias have become increasingly important. To explore these issues, we introduce a benchmark, FAIRE (Fairness Assessment In Resume Evaluation), to test for racial and gender bias in large language models (LLMs) used to evaluate resumes across different industries. We use two methods-direct scoring and ranking-to measure how model performance changes when resumes are slightly altered to reflect different racial or gender identities. Our findings reveal that while every model exhibits some degree of bias, the magnitude and direction vary considerably. This benchmark provides a clear way to examine these differences and offers valuable insights into the fairness of AI-based hiring tools. It highlights the urgent need for strategies to reduce bias in AI-driven recruitment. Our benchmark code and dataset are open-sourced at our repository: https://github.com/athenawen/FAIRE-Fairness-Assessment-In-Resume-Evaluation.git.

COST: Contrastive One-Stage Transformer for Vision-Language Small Object Tracking

Chunhui Zhang,Li Liu,Jialin Gao,Xin Sun,Hao Wen,Xi Zhou,Shiming Ge,Yanfeng Wang

Task: 提出一种名为COST的对比性单阶段Transformer融合框架，用于视觉语言（VL）跟踪，旨在学习语义一致且统一的VL表示。

Motivation: 现有VL跟踪器大多依赖精心设计的多阶段多模态融合机制，且直接的多模态融合未考虑特征空间中的模态分布差异，可能导致次优表示。

Details

Method: 引入对比对齐策略，最大化视频与其语言描述之间的互信息（MI），实现跨模态对齐；利用视觉-语言Transformer建立高效的多模态融合和推理机制。 Result: COST在五个现有VL跟踪数据集及新提出的VL-SOT500数据集上实现了最先进的性能。 Conclusion: COST框架通过对比对齐和Transformer融合，有效提升了VL跟踪的性能，并贡献了首个专注于小目标跟踪的VL数据集VL-SOT500。 Abstract: Transformer has recently demonstrated great potential in improving vision-language (VL) tracking algorithms. However, most of the existing VL trackers rely on carefully designed mechanisms to perform the multi-stage multi-modal fusion. Additionally, direct multi-modal fusion without alignment ignores distribution discrepancy between modalities in feature space, potentially leading to suboptimal representations. In this work, we propose COST, a contrastive one-stage transformer fusion framework for VL tracking, aiming to learn semantically consistent and unified VL representations. Specifically, we introduce a contrastive alignment strategy that maximizes mutual information (MI) between a video and its corresponding language description. This enables effective cross-modal alignment, yielding semantically consistent features in the representation space. By leveraging a visual-linguistic transformer, we establish an efficient multi-modal fusion and reasoning mechanism, empirically demonstrating that a simple stack of transformer encoders effectively enables unified VL representations. Moreover, we contribute a newly collected VL tracking benchmark dataset for small object tracking, named VL-SOT500, with bounding boxes and language descriptions. Our dataset comprises two challenging subsets, VL-SOT230 and VL-SOT270, dedicated to evaluating generic and high-speed small object tracking, respectively. Small object tracking is notoriously challenging due to weak appearance and limited features, and this dataset is, to the best of our knowledge, the first to explore the usage of language cues to enhance visual representation for small object tracking. Extensive experiments demonstrate that COST achieves state-of-the-art performance on five existing VL tracking datasets, as well as on our proposed VL-SOT500 dataset. Source codes and dataset will be made publicly available.

Refining Interactions: Enhancing Anisotropy in Graph Neural Networks with Language Semantics

Zhaoxing Li,Xiaoming Zhang,Haifeng Zhang,Chengxiang Liu

Task: 探索将大型语言模型（LLMs）与图神经网络（GNNs）结合以增强文本属性图（TAGs）的能力。

Motivation: 现有方法将图结构或相邻节点的文本描述直接输入LLMs，导致LLMs将结构信息视为一般上下文文本，限制了在图相关任务中的有效性。

Details

Method: 提出LanSAGNN框架，将各向异性GNN的概念扩展到自然语言层面，利用LLMs为节点对提取定制化的语义信息，并提出高效的双层LLMs微调架构。 Result: 实验结果表明，LanSAGNN显著提升了现有基于LLM的方法，且未增加复杂性，同时表现出对干扰的强鲁棒性。 Conclusion: LanSAGNN通过结合LLMs和GNNs的优势，有效提升了图相关任务的性能。 Abstract: The integration of Large Language Models (LLMs) with Graph Neural Networks (GNNs) has recently been explored to enhance the capabilities of Text Attribute Graphs (TAGs). Most existing methods feed textual descriptions of the graph structure or neighbouring nodes' text directly into LLMs. However, these approaches often cause LLMs to treat structural information simply as general contextual text, thus limiting their effectiveness in graph-related tasks. In this paper, we introduce LanSAGNN (Language Semantic Anisotropic Graph Neural Network), a framework that extends the concept of anisotropic GNNs to the natural language level. This model leverages LLMs to extract tailor-made semantic information for node pairs, effectively capturing the unique interactions within node relationships. In addition, we propose an efficient dual-layer LLMs finetuning architecture to better align LLMs' outputs with graph tasks. Experimental results demonstrate that LanSAGNN significantly enhances existing LLM-based methods without increasing complexity while also exhibiting strong robustness against interference.

On Data Synthesis and Post-training for Visual Abstract Reasoning

Ke Zhu,Yu Wang,Jiangjiang Liu,Qunyi Xie,Shanshan Liu,Gang Zhang

Task: 提升大型视觉语言模型（VLMs）在抽象视觉推理（AVR）问题上的能力。

Motivation: 当前大多数VLMs在代表性AVR基准测试中表现不佳，甚至接近随机水平，因此需要突破性方法来解决这一问题。

Details

Method: 采用创新的数据合成和后训练流程，逐步降低任务难度并引导模型学习。 Result: LLaVA-NeXT 7B模型在AVR任务上显著超越开源和闭源VLMs（如Qwen-2-VL-72B和GPT-4o），同时保持多模态理解能力。 Conclusion: 本文为AVR领域提供了早期探索，并有望激发进一步研究。 Abstract: This paper is a pioneering work attempting to address abstract visual reasoning (AVR) problems for large vision-language models (VLMs). We make a common LLaVA-NeXT 7B model capable of perceiving and reasoning about specific AVR problems, surpassing both open-sourced (e.g., Qwen-2-VL-72B) and closed-sourced powerful VLMs (e.g., GPT-4o) with significant margin. This is a great breakthrough since almost all previous VLMs fail or show nearly random performance on representative AVR benchmarks. Our key success is our innovative data synthesis and post-training process, aiming to fully relieve the task difficulty and elicit the model to learn, step by step. Our 7B model is also shown to be behave well on AVR without sacrificing common multimodal comprehension abilities. We hope our paper could serve as an early effort in this area and would inspire further research in abstract visual reasoning.

PROPHET: An Inferable Future Forecasting Benchmark with Causal Intervened Likelihood Estimation

Zhengwei Tao,Zhi Jin,Bincheng Li,Xiaoying Bai,Haiyan Zhao,Chengfeng Dou,Xiancai Chen,Jia Li,Linyu Li,Chongyang Tao

Task: 提出一个新的基准PROPHET，用于评估基于检索增强生成（RAG）和推理的未来事件预测能力。

Motivation: 现有基准未考虑问题是否具有可推断性，导致部分问题可能无法通过合理推理解决。

Details

Method: 引入因果干预似然（CIL）作为统计度量，通过因果推理评估问题的可推断性，并构建PROPHET基准。 Result: 验证了CIL的有效性，并评估了多个预测系统在PROPHET上的表现。 Conclusion: PROPHET为未来事件预测提供了更可靠的评估基准，并揭示了未来研究方向。 Abstract: Predicting future events stands as one of the ultimate aspirations of artificial intelligence. Recent advances in large language model (LLM)-based systems have shown remarkable potential in forecasting future events, thereby garnering significant interest in the research community. Currently, several benchmarks have been established to evaluate the forecasting capabilities by formalizing the event prediction as a retrieval-augmented generation (RAG) and reasoning task. In these benchmarks, each prediction question is answered with relevant retrieved news articles. However, because there is no consideration on whether the questions can be supported by valid or sufficient supporting rationales, some of the questions in these benchmarks may be inherently noninferable. To address this issue, we introduce a new benchmark, PROPHET, which comprises inferable forecasting questions paired with relevant news for retrieval. To ensure the inferability of the benchmark, we propose Causal Intervened Likelihood (CIL), a statistical measure that assesses inferability through causal inference. In constructing this benchmark, we first collected recent trend forecasting questions and then filtered the data using CIL, resulting in an inferable benchmark for event prediction. Through extensive experiments, we first demonstrate the validity of CIL and in-depth investigations into event prediction with the aid of CIL. Subsequently, we evaluate several representative prediction systems on PROPHET, drawing valuable insights for future directions.

CFMD: Dynamic Cross-layer Feature Fusion for Salient Object Detection

Jin Lian,Zhongyu Wan,Ming Gao,JunFeng Chen

Task: 提出一种新型的跨层特征金字塔网络（CFMD），以解决传统CFPNs在计算效率和边界精度上的局限性。

Motivation: 传统CFPNs存在计算瓶颈和边界模糊问题，限制了其在显著目标检测中的性能。

Details

Method: 设计了上下文感知特征聚合模块（CFLMA）和自适应动态上采样单元（CFLMD），分别用于动态调整特征权重和保留空间细节。 Result: 在三个标准基准测试中，CFMD显著提高了像素级精度和边界分割质量。 Conclusion: CFMD通过提升计算效率和分割性能，在显著目标检测任务中展现出强大潜力。 Abstract: Cross-layer feature pyramid networks (CFPNs) have achieved notable progress in multi-scale feature fusion and boundary detail preservation for salient object detection. However, traditional CFPNs still suffer from two core limitations: (1) a computational bottleneck caused by complex feature weighting operations, and (2) degraded boundary accuracy due to feature blurring in the upsampling process. To address these challenges, we propose CFMD, a novel cross-layer feature pyramid network that introduces two key innovations. First, we design a context-aware feature aggregation module (CFLMA), which incorporates the state-of-the-art Mamba architecture to construct a dynamic weight distribution mechanism. This module adaptively adjusts feature importance based on image context, significantly improving both representation efficiency and generalization. Second, we introduce an adaptive dynamic upsampling unit (CFLMD) that preserves spatial details during resolution recovery. By adjusting the upsampling range dynamically and initializing with a bilinear strategy, the module effectively reduces feature overlap and maintains fine-grained boundary structures. Extensive experiments on three standard benchmarks using three mainstream backbone networks demonstrate that CFMD achieves substantial improvements in pixel-level accuracy and boundary segmentation quality, especially in complex scenes. The results validate the effectiveness of CFMD in jointly enhancing computational efficiency and segmentation performance, highlighting its strong potential in salient object detection tasks.

Chain of Correction for Full-text Speech Recognition with Large Language Models

Zhiyuan Tang,Dong Wang,Zhikai Zhou,Yong Liu,Shen Huang,Shidong Shang

Task: 提出一种基于大语言模型（LLM）的链式校正（CoC）方法，用于自动语音识别（ASR）的全文错误校正。

Motivation: 全文错误校正在长上下文和多种错误类型（如标点恢复和逆向文本规范化）中具有潜力，但仍面临稳定性、可控性、完整性和流畅性等挑战。

Details

Method: 采用链式校正（CoC）方法，通过分段校正和预识别文本引导，结合多轮对话格式，利用预识别全文作为上下文。 Result: 实验结果表明，CoC在全文ASR输出的错误校正中显著优于基线系统和基准系统。 Conclusion: CoC方法有效解决了全文错误校正的挑战，并探讨了校正阈值设置、长文本处理及其他信息引导校正的潜力。 Abstract: Full-text error correction with Large Language Models (LLMs) for Automatic Speech Recognition (ASR) has gained increased attention due to its potential to correct errors across long contexts and address a broader spectrum of error types, including punctuation restoration and inverse text normalization. Nevertheless, many challenges persist, including issues related to stability, controllability, completeness, and fluency. To mitigate these challenges, this paper proposes the Chain of Correction (CoC) for full-text error correction with LLMs, which corrects errors segment by segment using pre-recognized text as guidance within a regular multi-turn chat format. The CoC also uses pre-recognized full text for context, allowing the model to better grasp global semantics and maintain a comprehensive overview of the entire content. Utilizing the open-sourced full-text error correction dataset ChFT, we fine-tune a pre-trained LLM to evaluate the performance of the CoC framework. Experimental results demonstrate that the CoC effectively corrects errors in full-text ASR outputs, significantly outperforming baseline and benchmark systems. We further analyze how to set the correction threshold to balance under-correction and over-rephrasing, extrapolate the CoC model on extremely long ASR outputs, and investigate whether other types of information can be employed to guide the error correction process.

Min Shi,Shihao Wang,Chieh-Yun Chen,Jitesh Jain,Kai Wang,Junjun Xiong,Guilin Liu,Zhiding Yu,Humphrey Shi

Task: 提出一种新颖的慢快架构（slow-fast architecture），以解决视频多模态大语言模型（MLLMs）在有限计算预算下平衡时间分辨率和空间细节的挑战。

Motivation: 现有方法通常通过预定义规则压缩视频表示，导致不可逆的信息丢失并忽略输入指令，因此需要一种更高效的方法来保留空间细节并增加输入帧数。

Details

Method: 采用双令牌策略：1）"快速"视觉令牌（压缩的视频特征）与文本嵌入一起输入LLM以提供快速概览；2）"慢速"视觉令牌（未压缩的视频特征）通过混合解码层与文本嵌入交叉关注，实现指令感知的视觉细节提取。 Result: 模型在输入帧数从16扩展到128的同时仅增加3%的计算量，并在五个视频理解基准测试中平均性能提升16%，7B模型在同类规模模型中达到最先进性能。 Conclusion: 慢快架构是一种即插即用的设计，可集成到其他视频MLLMs中，提高效率和可扩展性。 Abstract: Balancing temporal resolution and spatial detail under limited compute budget remains a key challenge for video-based multi-modal large language models (MLLMs). Existing methods typically compress video representations using predefined rules before feeding them into the LLM, resulting in irreversible information loss and often ignoring input instructions. To address this, we propose a novel slow-fast architecture that naturally circumvents this trade-off, enabling the use of more input frames while preserving spatial details. Inspired by how humans first skim a video before focusing on relevant parts, our slow-fast design employs a dual-token strategy: 1) "fast" visual tokens -- a compact set of compressed video features -- are fed into the LLM alongside text embeddings to provide a quick overview; 2) "slow" visual tokens -- uncompressed video features -- are cross-attended by text embeddings through specially designed hybrid decoder layers, enabling instruction-aware extraction of relevant visual details with linear complexity. We conduct systematic exploration to optimize both the overall architecture and key components. Experiments show that our model significantly outperforms self-attention-only baselines, extending the input capacity from 16 to 128 frames with just a 3% increase in computation, and achieving a 16% average performance improvement across five video understanding benchmarks. Our 7B model achieves state-of-the-art performance among models of similar size. Furthermore, our slow-fast architecture is a plug-and-play design that can be integrated into other video MLLMs to improve efficiency and scalability.

Context-Aware Toxicity Detection in Multiplayer Games: Integrating Domain-Adaptive Pretraining and Match Metadata

Adrien Schurger-Foy,Rafal Dariusz Kocielnik,Caglar Gulcehre,R. Michael Alvarez

Task: Adapting RoBERTa LLM for context-aware toxicity detection in competitive online video games.

Motivation: Address the challenge of detecting toxicity in video game chats due to context-dependent nature, specialized slang, and rarity of toxic interactions.

Details

Method: Enhanced pretrained embeddings with metadata and domain adaptive pretraining to capture nuances of player interactions, using datasets from DOTA 2 and MWIII. Result: Identified useful contextual sources (metadata, prior interactions) and demonstrated performance improvements for toxicity detection. Conclusion: Highlights the need for context-aware and domain-specific approaches for effective moderation in online gaming. Abstract: The detrimental effects of toxicity in competitive online video games are widely acknowledged, prompting publishers to monitor player chat conversations. This is challenging due to the context-dependent nature of toxicity, often spread across multiple messages or informed by non-textual interactions. Traditional toxicity detectors focus on isolated messages, missing the broader context needed for accurate moderation. This is especially problematic in video games, where interactions involve specialized slang, abbreviations, and typos, making it difficult for standard models to detect toxicity, especially given its rarity. We adapted RoBERTa LLM to support moderation tailored to video games, integrating both textual and non-textual context. By enhancing pretrained embeddings with metadata and addressing the unique slang and language quirks through domain adaptive pretraining, our method better captures the nuances of player interactions. Using two gaming datasets - from Defense of the Ancients 2 (DOTA 2) and Call of Duty$^\circledR$: Modern Warfare$^\circledR$III (MWIII) we demonstrate which sources of context (metadata, prior interactions...) are most useful, how to best leverage them to boost performance, and the conditions conducive to doing so. This work underscores the importance of context-aware and domain-specific approaches for proactive moderation.

Prompt-Guided Attention Head Selection for Focus-Oriented Image Retrieval

Yuji Nozawa,Yu-Chieh Lin,Kazumoto Nakamura,Youyang Ng

Task: 增强预训练的Vision Transformer（ViT）模型，用于基于视觉提示的焦点导向图像检索（FOIR）。

Motivation: 现实世界中的图像检索场景通常包含复杂图像，用户希望检索特定对象的图像，而传统的全局特征向量方法在此任务中表现不佳。

Details

Method: 提出了一种名为Prompt-guided attention Head Selection（PHS）的方法，通过匹配用户视觉提示与注意力图来选择特定的注意力头，以聚焦于感兴趣的对象。 Result: 实验结果表明，PHS在多个数据集上显著提升了性能，且无需重新训练模型或修改图像。 Conclusion: PHS为FOIR任务提供了一种实用且无需训练的解决方案，有效提升了模型性能。 Abstract: The goal of this paper is to enhance pretrained Vision Transformer (ViT) models for focus-oriented image retrieval with visual prompting. In real-world image retrieval scenarios, both query and database images often exhibit complexity, with multiple objects and intricate backgrounds. Users often want to retrieve images with specific object, which we define as the Focus-Oriented Image Retrieval (FOIR) task. While a standard image encoder can be employed to extract image features for similarity matching, it may not perform optimally in the multi-object-based FOIR task. This is because each image is represented by a single global feature vector. To overcome this, a prompt-based image retrieval solution is required. We propose an approach called Prompt-guided attention Head Selection (PHS) to leverage the head-wise potential of the multi-head attention mechanism in ViT in a promptable manner. PHS selects specific attention heads by matching their attention maps with user's visual prompts, such as a point, box, or segmentation. This empowers the model to focus on specific object of interest while preserving the surrounding visual context. Notably, PHS does not necessitate model re-training and avoids any image alteration. Experimental results show that PHS substantially improves performance on multiple datasets, offering a practical and training-free solution to enhance model performance in the FOIR task.

From Smør-re-brød to Subwords: Training LLMs on Danish, One Morpheme at a Time

Mikkel Wildner Kildeberg,Emil Allerslev Schledermann,Nicolaj Larsen,Rob van der Goot

Task: 研究如何利用丹麦语形态学数据改进子词分词技术，以提升生成式Transformer模型在丹麦语任务中的性能。

Motivation: 现有的子词分词技术（如BPE）往往忽略语言形态学原则，而形态学分割对理解语言特定词结构至关重要。

Details

Method: 利用丹麦语形态学数据集训练半监督模型进行形态学分割，开发优化的分词器，并评估其在丹麦语词分割和生成式Transformer模型中的表现。 Result: 自定义的形态学分词器在形态学分割中显著优于BPE分词器（F1分数58.84 vs. 39.28），并在下游任务中全面优于BPE分词器。 Conclusion: 将丹麦语形态学分割策略融入分词器可显著提升生成式Transformer模型在丹麦语任务中的性能。 Abstract: The best performing transformer-based language models use subword tokenization techniques, such as Byte-Pair-Encoding (BPE). However, these approaches often overlook linguistic principles, such as morphological segmentation, which we believe is fundamental for understanding language-specific word structure. In this study, we leverage an annotated Danish morphological dataset to train a semisupervised model for morphological segmentation, enabling the development of tokenizers optimized for Danish morphology. We evaluate four distinct tokenizers, including two custom morphological tokenizers, by analyzing their performance in morphologically segmenting Danish words. Additionally, we train two generative transformer models, \textit{CerebrasGPT-111M} and \textit{LLaMA-3.2 1B}, using these tokenizers and evaluate their downstream performance. Our findings reveal that our custom-developed tokenizers substantially enhance morphological segmentation, achieving an F1 score of 58.84, compared to 39.28 achieved by a Danish BPE tokenizer. In downstream tasks, models trained with our morphological tokenizers outperform those using BPE tokenizers across different evaluation metrics. These results highlight that incorporating Danish morphological segmentation strategies into tokenizers leads to improved performance in generative transformer models on Danish language

v-CLR: View-Consistent Learning for Open-World Instance Segmentation

Chang-Bin Zhang,Jinhong Ni,Yujie Zhong,Kai Han

Task: 解决开放世界实例分割中模型对纹理的依赖问题。

Motivation: 现有视觉网络偏向学习外观信息（如纹理），导致在开放世界设置中无法检测具有未见纹理的新对象。

Details

Method: 提出了一种称为视图一致学习（v-CLR）的框架，通过引入纹理显著改变但保留图像结构的额外视图，并强制模型在不同视图间学习外观不变表示。 Result: 在公共基准测试中，无论是跨类别还是跨数据集设置，均实现了最先进的性能。 Conclusion: v-CLR框架通过增强对象感知和减少外观依赖，有效提升了开放世界实例分割的鲁棒性。 Abstract: In this paper, we address the challenging problem of open-world instance segmentation. Existing works have shown that vanilla visual networks are biased toward learning appearance information, \eg texture, to recognize objects. This implicit bias causes the model to fail in detecting novel objects with unseen textures in the open-world setting. To address this challenge, we propose a learning framework, called view-Consistent LeaRning (v-CLR), which aims to enforce the model to learn appearance-invariant representations for robust instance segmentation. In v-CLR, we first introduce additional views for each image, where the texture undergoes significant alterations while preserving the image's underlying structure. We then encourage the model to learn the appearance-invariant representation by enforcing the consistency between object features across different views, for which we obtain class-agnostic object proposals using off-the-shelf unsupervised models that possess strong object-awareness. These proposals enable cross-view object feature matching, greatly reducing the appearance dependency while enhancing the object-awareness. We thoroughly evaluate our method on public benchmarks under both cross-class and cross-dataset settings, achieving state-of-the-art performance. Project page: https://visual-ai.github.io/vclr

Register Always Matters: Analysis of LLM Pretraining Data Through the Lens of Language Variation

Amanda Myntti,Erik Henriksson,Veronika Laippala,Sampo Pyysalo

Task: 研究预训练数据中文本类型（register）对大型语言模型（LLM）性能的影响。

Motivation: 当前预训练数据筛选方法多为二元分类（有用/无用），缺乏对不同类型文本对模型性能贡献的深入理解。

Details

Method: 利用语料库语言学中的register分类方法，对预训练数据进行分类，并训练模型，通过标准基准测试评估性能。 Result: 发现register对模型性能有显著影响，某些register（如Opinion）对性能有益，而另一些（如News）表现不佳；组合表现良好的register可显著提升模型性能。 Conclusion: register是解释模型性能差异的重要因素，未来数据筛选应更注重register的选择。 Abstract: Pretraining data curation is a cornerstone in Large Language Model (LLM) development, leading to growing research on quality filtering of large web corpora. From statistical quality flags to LLM-based labeling systems, datasets are divided into categories, frequently reducing to a binary: those passing the filters deemed as valuable examples, others discarded as useless or detrimental. However, a more detailed understanding of the contribution of different kinds of texts to model performance is still largely lacking. In this article, we present the first study utilizing registers (also known as genres) - a widely used standard in corpus linguistics to model linguistic variation - to curate pretraining datasets and investigate the effect of register on the performance of LLMs. We perform comparative studies by training models with register classified data and evaluating them using standard benchmarks, and show that the register of pretraining data substantially affects model performance. We uncover surprising relationships between the pretraining material and the resulting models: using the News register results in subpar performance, and on the contrary, including the Opinion class, covering texts such as reviews and opinion blogs, is highly beneficial. While a model trained on the entire unfiltered dataset outperforms those trained on datasets limited to a single register, combining well-performing registers like How-to-Instructions, Informational Description, and Opinion leads to major improvements. Furthermore, analysis of individual benchmark results reveals key differences in the strengths and drawbacks of specific register classes as pretraining data. These findings show that register is an important explainer of model variation and can facilitate more deliberate future data selection practices.

DALIP: Distribution Alignment-based Language-Image Pre-Training for Domain-Specific Data

Junjie Wu,Jiangtao Xie,Zhaolin Zhang,Qilong Wang,Qinghua Hu,Peihua Li,Sen Xu

Task: 提出一种基于分布对齐的语言-图像预训练方法（DALIP），用于生物数据。

Motivation: 现有方法在领域特定数据（如生物数据）上直接调整原始CLIP模型，未能充分考虑领域数据的特性（如细粒度特性），且可能丢失CLIP在通用领域的能力。

Details

Method: DALIP通过匹配图像-文本对的特征分布相似性来优化CLIP模型，提出多头布朗距离协方差（MBDC）模块高效获取二阶统计量，并收集了PlantMix-13M数据集。 Result: DALIP在生物领域显著优于现有CLIP方法，并能泛化到遥感和医学影像领域；PlantMix-13M数据集进一步提升了性能。 Conclusion: DALIP方法有效解决了领域特定数据的细粒度问题，同时保留了通用领域的能力。 Abstract: Recently, Contrastive Language-Image Pre-training (CLIP) has shown promising performance in domain-specific data (e.g., biology), and has attracted increasing research attention. Existing works generally focus on collecting extensive domain-specific data and directly tuning the original CLIP models. Intuitively, such a paradigm takes no full consideration of the characteristics lying in domain-specific data (e.g., fine-grained nature of biological data) and so limits model capability, while mostly losing the original ability of CLIP in the general domain. In this paper, we propose a Distribution Alignment-based Language-Image Pre-Training (DALIP) method for biological data. Specifically, DALIP optimizes CLIP models by matching the similarity between feature distribution of image-text pairs instead of the original [cls] token, which can capture rich yet effective information inherent in image-text pairs as powerful representations, and so better cope with fine-grained nature of biological data. Particularly, our DALIP efficiently approximates feature distribution via its first- and second-order statistics, while presenting a Multi-head Brownian Distance Covariance (MBDC) module to acquire second-order statistics of token features efficiently. Furthermore, we collect a new dataset for plant domain (e.g., specific data in biological domain) comprising 10M plant data with 3M general-domain data (namely PlantMix-13M) according to data mixing laws. Extensive experiments show that DALIP clearly outperforms existing CLIP counterparts in biological domain, while well generalizing to remote sensing and medical imaging domains. Besides, our PlantMix-13M dataset further boosts performance of DALIP in plant domain, while preserving model ability in general domain.

Testing Low-Resource Language Support in LLMs Using Language Proficiency Exams: the Case of Luxembourgish

Cedric Lothritz,Jordi Cabot

Task: 研究语言能力考试作为卢森堡语评估工具的可行性。

Motivation: 大型语言模型（LLMs）主要针对英语用户开发，较少关注资源较少的语言（如卢森堡语），缺乏相关评估工具和数据集。

Details

Method: 通过语言能力考试评估不同规模的语言模型（如ChatGPT、Claude和DeepSeek-R1）在卢森堡语上的表现。 Result: 大型模型（如ChatGPT）表现优异，小型模型表现较弱；语言考试表现可预测其他NLP任务的表现。 Conclusion: 语言能力考试可作为卢森堡语评估的有效工具，大型模型在低资源语言中表现突出。 Abstract: Large Language Models (LLMs) have become an increasingly important tool in research and society at large. While LLMs are regularly used all over the world by experts and lay-people alike, they are predominantly developed with English-speaking users in mind, performing well in English and other wide-spread languages while less-resourced languages such as Luxembourgish are seen as a lower priority. This lack of attention is also reflected in the sparsity of available evaluation tools and datasets. In this study, we investigate the viability of language proficiency exams as such evaluation tools for the Luxembourgish language. We find that large models such as ChatGPT, Claude and DeepSeek-R1 typically achieve high scores, while smaller models show weak performances. We also find that the performances in such language exams can be used to predict performances in other NLP tasks.

All Patches Matter, More Patches Better: Enhance AI-Generated Image Detection via Panoptic Patch Learning

Zheng Yang,Ruoxin Chen,Zhiyuan Yan,Ke-Yue Zhang,Xinghe Fu,Shuang Wu,Xiujun Shu,Taiping Yao,Junchi Yan,Shouhong Ding,Xi Li

Task: 提出一种名为Panoptic Patch Learning (PPL)的框架，用于检测AI生成图像（AIGIs）中的合成伪影。

Motivation: AI生成图像的快速增长需要鲁棒且泛化性强的检测方法。研究发现，现有检测器存在Few-Patch Bias问题，即仅依赖少数区域进行判别，导致性能受限。

Details

Method: 提出PPL框架，包括随机块替换和块级对比学习，以鼓励模型利用更多区域的伪影进行检测。 Result: 在多个基准测试中验证了PPL框架的有效性，显著提升了检测的鲁棒性和泛化能力。 Conclusion: PPL框架通过充分利用所有区域的伪影，解决了Few-Patch Bias问题，为AIGI检测提供了更可靠的方法。 Abstract: The exponential growth of AI-generated images (AIGIs) underscores the urgent need for robust and generalizable detection methods. In this paper, we establish two key principles for AIGI detection through systematic analysis: \textbf{(1) All Patches Matter:} Unlike conventional image classification where discriminative features concentrate on object-centric regions, each patch in AIGIs inherently contains synthetic artifacts due to the uniform generation process, suggesting that every patch serves as an important artifact source for detection. \textbf{(2) More Patches Better}: Leveraging distributed artifacts across more patches improves detection robustness by capturing complementary forensic evidence and reducing over-reliance on specific patches, thereby enhancing robustness and generalization. However, our counterfactual analysis reveals an undesirable phenomenon: naively trained detectors often exhibit a \textbf{Few-Patch Bias}, discriminating between real and synthetic images based on minority patches. We identify \textbf{Lazy Learner} as the root cause: detectors preferentially learn conspicuous artifacts in limited patches while neglecting broader artifact distributions. To address this bias, we propose the \textbf{P}anoptic \textbf{P}atch \textbf{L}earning (PPL) framework, involving: (1) Random Patch Replacement that randomly substitutes synthetic patches with real counterparts to compel models to identify artifacts in underutilized regions, encouraging the broader use of more patches; (2) Patch-wise Contrastive Learning that enforces consistent discriminative capability across all patches, ensuring uniform utilization of all patches. Extensive experiments across two different settings on several benchmarks verify the effectiveness of our approach.

ToM-RL: Reinforcement Learning Unlocks Theory of Mind in Small LLMs

Yi-Long Lu,Chunhui Zhang,Jiajun Song,Lifeng Fan,Wei Wang

Task: 探索基于规则的强化学习（RL）在大型语言模型（LLMs）后训练阶段对社交推理能力（如心理理论ToM）的提升效果。

Motivation: 尽管RL在结构化推理任务（如数学和逻辑推理）中表现优异，但其在社交推理（尤其是心理理论ToM）中的应用尚未充分研究。

Details

Method: 使用包含3200个问题的数据集，对小型LLMs（0.5B至7B参数）进行RL训练，并在Hi-ToM基准上评估性能。 Result: 7B参数的RL训练模型在Hi-ToM基准上达到84.50%的准确率，超越GPT-4o和DeepSeek-v3等更大模型；小模型（≤3B参数）存在推理崩溃问题，而大模型（7B参数）通过稳定的信念跟踪保持性能。 Conclusion: RL能有效提升LLMs的社交认知推理能力，填补结构化问题解决与复杂社交推理之间的差距。 Abstract: Recent advancements in rule-based reinforcement learning (RL), applied during the post-training phase of large language models (LLMs), have significantly enhanced their capabilities in structured reasoning tasks such as mathematics and logical inference. However, the effectiveness of RL in social reasoning, particularly in Theory of Mind (ToM), the ability to infer others' mental states, remains largely unexplored. In this study, we demonstrate that RL methods effectively unlock ToM reasoning capabilities even in small-scale LLMs (0.5B to 7B parameters). Using a modest dataset comprising 3200 questions across diverse scenarios, our RL-trained 7B model achieves 84.50\% accuracy on the Hi-ToM benchmark, surpassing models like GPT-4o and DeepSeek-v3 despite significantly fewer parameters. While smaller models ($\leq$3B parameters) suffer from reasoning collapse, larger models (7B parameters) maintain stable performance through consistent belief tracking. Additionally, our RL-based models demonstrate robust generalization to higher-order, out-of-distribution ToM problems, novel textual presentations, and previously unseen datasets. These findings highlight RL's potential to enhance social cognitive reasoning, bridging the gap between structured problem-solving and nuanced social inference in LLMs.

Leveraging Generalizability of Image-to-Image Translation for Enhanced Adversarial Defense

Haibo Zhang,Zhihua Yao,Kouichi Sakurai,Takeshi Saitoh

Task: 提出一种改进的基于图像到图像转换的防御方法，以增强对抗攻击的泛化能力。

Motivation: 对抗攻击揭示了机器学习模型的关键漏洞，现有防御方法通常忽略时间和计算成本，且难以泛化到未见过的攻击类型。

Details

Method: 引入残差块改进图像到图像转换模型，仅需训练单一模型，能有效防御多种攻击类型，并具有良好迁移性。 Result: 实验表明，该方法能将分类准确率从接近零恢复至平均72%，且性能与最先进方法相当。 Conclusion: 改进的模型在对抗攻击防御中表现出高效性和泛化能力，为实际应用提供了可行方案。 Abstract: In the rapidly evolving field of artificial intelligence, machine learning emerges as a key technology characterized by its vast potential and inherent risks. The stability and reliability of these models are important, as they are frequent targets of security threats. Adversarial attacks, first rigorously defined by Ian Goodfellow et al. in 2013, highlight a critical vulnerability: they can trick machine learning models into making incorrect predictions by applying nearly invisible perturbations to images. Although many studies have focused on constructing sophisticated defensive mechanisms to mitigate such attacks, they often overlook the substantial time and computational costs of training and maintaining these models. Ideally, a defense method should be able to generalize across various, even unseen, adversarial attacks with minimal overhead. Building on our previous work on image-to-image translation-based defenses, this study introduces an improved model that incorporates residual blocks to enhance generalizability. The proposed method requires training only a single model, effectively defends against diverse attack types, and is well-transferable between different target models. Experiments show that our model can restore the classification accuracy from near zero to an average of 72\% while maintaining competitive performance compared to state-of-the-art methods.

InfiniteICL: Breaking the Limit of Context Window Size via Long Short-term Memory Transformation

Bowen Cao,Deng Cai,Wai Lam

Task: 提出InfiniteICL框架，以解决大语言模型（LLMs）在超长上下文中的有限上下文窗口问题。

Motivation: 现有的大语言模型在超长上下文中的表现受限于有限的上下文窗口，影响了其学习和推理能力。

Details

Method: 通过将上下文和参数与人类认知系统中的短时和长时记忆类比，将临时上下文知识转化为永久参数更新，实现无限上下文集成。 Result: 实验表明，该方法将上下文长度减少90%，同时在事实回忆、基础推理和技能获取任务中达到全上下文提示的103%平均性能。 Conclusion: InfiniteICL框架通过突破传统上下文窗口的限制，显著提升了LLMs的可扩展性和效率。 Abstract: In-context learning (ICL) is critical for large language models (LLMs), but its effectiveness is constrained by finite context windows, particularly in ultra-long contexts. To overcome this, we introduce InfiniteICL, a framework that parallels context and parameters in LLMs with short- and long-term memory in human cognitive systems, focusing on transforming temporary context knowledge into permanent parameter updates. This approach significantly reduces memory usage, maintains robust performance across varying input lengths, and theoretically enables infinite context integration through the principles of context knowledge elicitation, selection, and consolidation. Evaluations demonstrate that our method reduces context length by 90% while achieving 103% average performance of full-context prompting across fact recall, grounded reasoning, and skill acquisition tasks. When conducting sequential multi-turn transformations on complex, real-world contexts (with length up to 2M tokens), our approach surpasses full-context prompting while using only 0.4% of the original contexts. These findings highlight InfiniteICL's potential to enhance the scalability and efficiency of LLMs by breaking the limitations of conventional context window sizes.

TimeSearch: Hierarchical Video Search with Spotlight and Reflection for Human-like Long Video Understanding

Junwen Pan,Rui Zhang,Xin Wan,Yuan Zhang,Ming Lu,Qi She

Task: 提出一种名为TimeSearch的新框架，使大型视频语言模型（LVLMs）能够以类似人类的方式理解长视频。

Motivation: 长视频处理因帧数过多导致视觉幻觉，现有方法难以准确解析。受人类分层时间搜索策略启发，提出TimeSearch。

Details

Method: TimeSearch结合两种人类行为：1) Spotlight通过时间增强帧表示（TAFR）高效识别相关事件；2) Reflection利用LVLMs的自我反思能力评估事件正确性。 Result: 在LVBench上准确率从41.8%提升至51.5%，在Charades-STA上的mIoU提升11.8%。 Conclusion: TimeSearch显著优于现有方法，且TAFR能有效激发LVLMs的时间定位能力。 Abstract: Large video-language models (LVLMs) have shown remarkable performance across various video-language tasks. However, they encounter significant challenges when processing long videos because of the large number of video frames involved. Downsampling long videos in either space or time can lead to visual hallucinations, making it difficult to accurately interpret long videos. Motivated by human hierarchical temporal search strategies, we propose \textbf{TimeSearch}, a novel framework enabling LVLMs to understand long videos in a human-like manner. TimeSearch integrates two human-like primitives into a unified autoregressive LVLM: 1) \textbf{Spotlight} efficiently identifies relevant temporal events through a Temporal-Augmented Frame Representation (TAFR), explicitly binding visual features with timestamps; 2) \textbf{Reflection} evaluates the correctness of the identified events, leveraging the inherent temporal self-reflection capabilities of LVLMs. TimeSearch progressively explores key events and prioritizes temporal search based on reflection confidence. Extensive experiments on challenging long-video benchmarks confirm that TimeSearch substantially surpasses previous state-of-the-art, improving the accuracy from 41.8\% to 51.5\% on the LVBench. Additionally, experiments on temporal grounding demonstrate that appropriate TAFR is adequate to effectively stimulate the surprising temporal grounding ability of LVLMs in a simpler yet versatile manner, which improves mIoU on Charades-STA by 11.8\%. The code will be released.

Style over Substance: Distilled Language Models Reason Via Stylistic Replication

Philip Lippmann,Jie Yang

Task: 研究蒸馏模型中推理能力的风格模式转移。

Motivation: 尽管推理痕迹有助于知识蒸馏到小型指令调优模型中，但转移的推理本质尚不明确。

Details

Method: 系统分析推理痕迹，识别成功推理的结构和词汇模式，并引入两个新数据集（自然推理痕迹数据集和合成数据集）来研究风格模式的影响。 Result: 发现合成痕迹训练的模型性能相当，表明蒸馏推理能力显著依赖表面模式；甚至合成痕迹被修改为错误答案时性能仍提升。 Conclusion: 风格模式可有效提升不同模型家族的推理能力。 Abstract: Specialized reasoning language models (RLMs) have demonstrated that scaling test-time computation through detailed reasoning traces significantly enhances performance. Although these traces effectively facilitate knowledge distillation into smaller, instruction-tuned models, the precise nature of transferred reasoning remains unclear. In this study, we investigate to what extent distilled models internalize replicated stylistic patterns during reasoning. To this end, we systematically analyze reasoning traces, identifying structural and lexical patterns that characterize successful reasoning. We then introduce two new datasets -- a dataset of emergent reasoning traces and a synthetic dataset explicitly constructed to replicate these stylistic patterns -- to precisely examine their influence on distilled models' reasoning capabilities. We find that models trained on the synthetic traces achieve comparable performance, indicating that distilled reasoning abilities rely significantly on surface-level patterns. Surprisingly, we observe an increase in performance even when the synthetic traces are altered to lead to the wrong answer. Our findings highlight how stylistic patterns can be leveraged to efficiently enhance LM reasoning across diverse model families.

MuTri: Multi-view Tri-alignment for OCT to OCTA 3D Image Translation

Zhuangzhuang Chen,Hualiang Wang,Chubin Ou,Xiaomeng Li

Task: 将3D OCT图像转换为3D OCTA图像。

Motivation: 现有的OCTA翻译方法在连续无限空间中直接从OCT域映射到OCTA域，仅依赖单一视图（OCTA投影图），导致结果不理想。

Details

Method: 提出多视图Tri-alignment框架（MuTri），分两阶段：预训练两个VQ-VAE模型重建3D OCT和OCTA数据；第二阶段通过多视图对齐学习离散有限空间中的映射。 Result: 提出对比语义对齐和血管结构对齐，优化了代码本学习和血管结构细节。 Conclusion: MuTri框架在离散有限空间中实现了更优的OCT到OCTA图像转换，并收集了大规模数据集OCTA2024。 Abstract: Optical coherence tomography angiography (OCTA) shows its great importance in imaging microvascular networks by providing accurate 3D imaging of blood vessels, but it relies upon specialized sensors and expensive devices. For this reason, previous works show the potential to translate the readily available 3D Optical Coherence Tomography (OCT) images into 3D OCTA images. However, existing OCTA translation methods directly learn the mapping from the OCT domain to the OCTA domain in continuous and infinite space with guidance from only a single view, i.e., the OCTA project map, resulting in suboptimal results. To this end, we propose the multi-view Tri-alignment framework for OCT to OCTA 3D image translation in discrete and finite space, named MuTri. In the first stage, we pre-train two vector-quantized variational auto-encoder (VQ- VAE) by reconstructing 3D OCT and 3D OCTA data, providing semantic prior for subsequent multi-view guidances. In the second stage, our multi-view tri-alignment facilitates another VQVAE model to learn the mapping from the OCT domain to the OCTA domain in discrete and finite space. Specifically, a contrastive-inspired semantic alignment is proposed to maximize the mutual information with the pre-trained models from OCT and OCTA views, to facilitate codebook learning. Meanwhile, a vessel structure alignment is proposed to minimize the structure discrepancy with the pre-trained models from the OCTA project map view, benefiting from learning the detailed vessel structure information. We also collect the first large-scale dataset, namely, OCTA2024, which contains a pair of OCT and OCTA volumes from 846 subjects.

OpenThaiGPT 1.6 and R1: Thai-Centric Open Source and Reasoning Large Language Models

Sumeth Yuenyong,Thodsaporn Chay-intr,Kobkrit Viriyayudhakorn

Task: 开发OpenThaiGPT 1.6和R1（OTG-1.6和OTG-R1）两种泰语中心的大语言模型（LLMs），以提升泛化和推理能力。

Motivation: 通过不同的方法增强泰语大语言模型的泛化和推理能力，以在泰语任务中取得更优表现。

Details

Method: OTG-1.6采用任务算术模型合并实现广泛泛化，OTG-R1通过多阶段训练结合LIMO假设提升高级推理能力。 Result: 基准测试显示模型在泰语任务中表现优异，与更大规模的开源泰语LLMs竞争。 Conclusion: 论文详细介绍了模型、训练过程及结果，展示了优于先前模型的改进，并为泰语中心LLMs设定了新的性能标准。 Abstract: We present OpenThaiGPT 1.6 and R1 (OTG-1.6 and OTG-R1), Thai-centric Large Language Models (LLMs) developed through distinct methodologies to enhance generalization and reasoning capabilities. OTG-1.6 employs Task Arithmetic model merging for broad generalization, while OTG-R1 integrates multi-stage training with the Less-Is-More Reasoning Hypothesis (LIMO) for advanced reasoning. Benchmark evaluations demonstrate superior performance across Thai language tasks, achieving competitive results against larger-scale open-source Thai LLMs. This paper details the proposed models, training processes, benchmarks, and results, highlighting improvements over previous models and establishing new performance standards for Thai-centric LLMs.

Multimodal Point Cloud Semantic Segmentation With Virtual Point Enhancement

Zaipeng Duan,Xuzhong Hu,Pei An,Jie Ma

Task: 提出一种基于虚拟点增强（VPE）的多模态点云语义分割方法，以解决LiDAR点云稀疏性和密度变化带来的挑战。

Motivation: LiDAR点云的稀疏性和密度变化在捕捉中距离和小目标的细节时存在显著困难。

Details

Method: 结合图像生成的虚拟点，引入空间差异驱动的自适应滤波模块和噪声鲁棒的稀疏特征编码器。 Result: 在SemanticKITTI和nuScenes数据集上验证了有效性，nuScenes上引入7.7%虚拟点后mIoU显著提高了2.89%。 Conclusion: 提出的方法通过虚拟点增强和噪声鲁棒特征提取，有效提升了点云语义分割的性能。 Abstract: LiDAR-based 3D point cloud recognition has been proven beneficial in various applications. However, the sparsity and varying density pose a significant challenge in capturing intricate details of objects, particularly for medium-range and small targets. Therefore, we propose a multi-modal point cloud semantic segmentation method based on Virtual Point Enhancement (VPE), which integrates virtual points generated from images to address these issues. These virtual points are dense but noisy, and directly incorporating them can increase computational burden and degrade performance. Therefore, we introduce a spatial difference-driven adaptive filtering module that selectively extracts valuable pseudo points from these virtual points based on density and distance, enhancing the density of medium-range targets. Subsequently, we propose a noise-robust sparse feature encoder that incorporates noise-robust feature extraction and fine-grained feature enhancement. Noise-robust feature extraction exploits the 2D image space to reduce the impact of noisy points, while fine-grained feature enhancement boosts sparse geometric features through inner-voxel neighborhood point aggregation and downsampled voxel aggregation. The results on the SemanticKITTI and nuScenes, two large-scale benchmark data sets, have validated effectiveness, significantly improving 2.89\% mIoU with the introduction of 7.7\% virtual points on nuScenes.

Investigating and Scaling up Code-Switching for Multilingual Language Model Pre-Training

Zhijun Wang,Jiahuan Li,Hao Zhou,Rongxiang Weng,Jingang Wang,Xin Huang,Xue Han,Junlan Feng,Chao Deng,Shujian Huang

Task: 探究代码切换（code-switching）对大型语言模型（LLMs）多语言能力的影响。

Motivation: 尽管预训练数据中存在极端的语言不平衡，但大型语言模型展现出显著的多语言能力，研究旨在揭示这一现象背后的原因。

Details

Method: 分析预训练语料库中的代码切换现象，将其分为四种类型，并研究其对多语言性能的影响；同时探索合成代码切换数据的策略。 Result: 合成代码切换数据显著提升了模型在基准测试和表示空间中的表现，且对不同资源水平的语言均有效。 Conclusion: 代码切换是提升多语言能力的关键因素，合成代码切换数据能有效促进语言对齐并泛化到不同资源水平的语言。 Abstract: Large language models (LLMs) exhibit remarkable multilingual capabilities despite the extreme language imbalance in the pre-training data. In this paper, we closely examine the reasons behind this phenomenon, focusing on the pre-training corpus. We find that the existence of code-switching, alternating between different languages within a context, is key to multilingual capabilities. We conduct an analysis to investigate code-switching in the pre-training corpus, examining its presence and categorizing it into four types within two quadrants. We then assess its impact on multilingual performance. These types of code-switching data are unbalanced in proportions and demonstrate different effects on facilitating language transfer. To better explore the power of code-switching for language alignment during pre-training, we investigate the strategy of synthetic code-switching. We continuously scale up the synthetic code-switching data and observe remarkable improvements in both benchmarks and representation space. Extensive experiments indicate that incorporating synthetic code-switching data enables better language alignment and generalizes well to high, medium, and low-resource languages with pre-training corpora of varying qualities.

BiSeg-SAM: Weakly-Supervised Post-Processing Framework for Boosting Binary Segmentation in Segment Anything Models

Encheng Su,Hu Cao,Alois Knoll

Task: 提出一种基于SAM引导的弱监督提示和边界细化网络（BiSeg-SAM），用于息肉和皮肤病变的精确分割。

Motivation: 由于医学图像的像素级标注耗时且昂贵，直接应用基础视觉模型（如SAM）在医学分割任务中效果不佳，缺乏领域特定知识。

Details

Method: 结合SAM和CNN模块进行微调，引入WeakBox自动生成提示框并使用MM2B转换，应用SC损失进行预测尺度对齐，通过DetailRefine模块细化边界。 Result: 在五个息肉数据集和一个皮肤癌数据集上表现出优于现有方法的性能。 Conclusion: BiSeg-SAM通过综合方法实现了多任务分割的优异性能，显著优于现有技术。 Abstract: Accurate segmentation of polyps and skin lesions is essential for diagnosing colorectal and skin cancers. While various segmentation methods for polyps and skin lesions using fully supervised deep learning techniques have been developed, the pixel-level annotation of medical images by doctors is both time-consuming and costly. Foundational vision models like the Segment Anything Model (SAM) have demonstrated superior performance; however, directly applying SAM to medical segmentation may not yield satisfactory results due to the lack of domain-specific medical knowledge. In this paper, we propose BiSeg-SAM, a SAM-guided weakly supervised prompting and boundary refinement network for the segmentation of polyps and skin lesions. Specifically, we fine-tune SAM combined with a CNN module to learn local features. We introduce a WeakBox with two functions: automatically generating box prompts for the SAM model and using our proposed Multi-choice Mask-to-Box (MM2B) transformation for rough mask-to-box conversion, addressing the mismatch between coarse labels and precise predictions. Additionally, we apply scale consistency (SC) loss for prediction scale alignment. Our DetailRefine module enhances boundary precision and segmentation accuracy by refining coarse predictions using a limited amount of ground truth labels. This comprehensive approach enables BiSeg-SAM to achieve excellent multi-task segmentation performance. Our method demonstrates significant superiority over state-of-the-art (SOTA) methods when tested on five polyp datasets and one skin cancer dataset.

YourBench: Easy Custom Evaluation Sets for Everyone

Sumuk Shashidhar,Clémentine Fourrier,Alina Lozovskia,Thomas Wolf,Gokhan Tur,Dilek Hakkani-Tür

Task: 提出一个名为YourBench的开源框架，用于动态、自动化生成可靠、最新且领域定制的基准测试，以解决传统静态基准测试和人评估的局限性。

Motivation: 传统静态基准测试容易饱和和污染，而人评估成本高且耗时，这阻碍了对大语言模型（LLM）的及时或领域特定评估。

Details

Method: 通过用户提供的文档直接生成动态、自动化的基准测试，无需人工标注，并引入Tempora-0325数据集确保生成的数据基于输入而非模型的后验参数知识。 Result: 在7个多样化的MMLU子集上验证了YourBench的有效性，以低于15美元的总推理成本完美保留了原始基准测试中的模型性能排名（Spearman Rho = 1）。 Conclusion: YourBench框架及其相关资源（如Tempora-0325数据集和150k+问答对）的发布，促进了可重复研究，并支持社区按需生成定制化基准测试，从而推动更相关和可信的LLM评估。 Abstract: Evaluating large language models (LLMs) effectively remains a critical bottleneck, as traditional static benchmarks suffer from saturation and contamination, while human evaluations are costly and slow. This hinders timely or domain-specific assessment, crucial for real-world applications. We introduce YourBench, a novel, open-source framework that addresses these limitations by enabling dynamic, automated generation of reliable, up-to-date, and domain-tailored benchmarks cheaply and without manual annotation, directly from user-provided documents. We demonstrate its efficacy by replicating 7 diverse MMLU subsets using minimal source text, achieving this for under 15 USD in total inference costs while perfectly preserving the relative model performance rankings (Spearman Rho = 1) observed on the original benchmark. To ensure that YourBench generates data grounded in provided input instead of relying on posterior parametric knowledge in models, we also introduce Tempora-0325, a novel dataset of over 7K diverse documents, published exclusively after March 2025. Our comprehensive analysis spans 26 SoTA models from 7 major families across varying scales (3-671B parameters) to validate the quality of generated evaluations through rigorous algorithmic checks (e.g., citation grounding) and human assessments. We release the YourBench library, the Tempora-0325 dataset, 150k+ question answer pairs based on Tempora and all evaluation and inference traces to facilitate reproducible research and empower the community to generate bespoke benchmarks on demand, fostering more relevant and trustworthy LLM evaluation.

Deep LG-Track: An Enhanced Localization-Confidence-Guided Multi-Object Tracker

Ting Meng,Chunyun Fu,Xiangyan Yan,Zheng Liang,Pan Ji,Jianwen Wang,Tao Huang

Task: 提出一种名为Deep LG-Track的新型多目标跟踪器，通过三项关键改进提升跟踪精度和鲁棒性。

Motivation: 多目标跟踪在自动驾驶和安全监控等应用中至关重要，但现有方法在精度和鲁棒性上仍有提升空间。

Details

Method: 1. 开发自适应卡尔曼滤波器动态更新测量噪声的协方差；2. 提出新型成本矩阵自适应融合运动和外观信息；3. 引入动态外观特征更新策略。 Result: 在MOT17和MOT20数据集上的综合评估表明，Deep LG-Track在多项性能指标上优于现有最佳跟踪器。 Conclusion: Deep LG-Track在多目标跟踪任务中表现出高效性和优越性。 Abstract: Multi-object tracking plays a crucial role in various applications, such as autonomous driving and security surveillance. This study introduces Deep LG-Track, a novel multi-object tracker that incorporates three key enhancements to improve the tracking accuracy and robustness. First, an adaptive Kalman filter is developed to dynamically update the covariance of measurement noise based on detection confidence and trajectory disappearance. Second, a novel cost matrix is formulated to adaptively fuse motion and appearance information, leveraging localization confidence and detection confidence as weighting factors. Third, a dynamic appearance feature updating strategy is introduced, adjusting the relative weighting of historical and current appearance features based on appearance clarity and localization accuracy. Comprehensive evaluations on the MOT17 and MOT20 datasets demonstrate that the proposed Deep LG-Track consistently outperforms state-of-the-art trackers across multiple performance metrics, highlighting its effectiveness in multi-object tracking tasks.

LARGE: Legal Retrieval Augmented Generation Evaluation Tool

Minhu Park,Hongseok Oh,Eunkyung Choi,Wonseok Hwang

Task: 提出一个开源工具LRAGE，用于全面评估法律领域的检索增强生成（RAG）系统。

Motivation: 在法律领域中，先前的司法判决对决策具有重要影响，而RAG系统的整体性能受多个组件影响，因此需要一个工具来评估这些组件的变化对系统准确性的影响。

Details

Method: LRAGE提供了GUI和CLI接口，支持对检索语料库、检索算法、重排序器、LLM主干和评估指标五个组件的实验，并通过多语言法律基准（韩语、英语、中文）验证其有效性。 Result: 通过实验展示了五个组件的变化如何影响整体准确性，验证了LRAGE的实用性。 Conclusion: LRAGE是一个有效的工具，可用于评估法律领域RAG系统的性能，其开源代码可供进一步研究和应用。 Abstract: Recently, building retrieval-augmented generation (RAG) systems to enhance the capability of large language models (LLMs) has become a common practice. Especially in the legal domain, previous judicial decisions play a significant role under the doctrine of stare decisis which emphasizes the importance of making decisions based on (retrieved) prior documents. However, the overall performance of RAG system depends on many components: (1) retrieval corpora, (2) retrieval algorithms, (3) rerankers, (4) LLM backbones, and (5) evaluation metrics. Here we propose LRAGE, an open-source tool for holistic evaluation of RAG systems focusing on the legal domain. LRAGE provides GUI and CLI interfaces to facilitate seamless experiments and investigate how changes in the aforementioned five components affect the overall accuracy. We validated LRAGE using multilingual legal benches including Korean (KBL), English (LegalBench), and Chinese (LawBench) by demonstrating how the overall accuracy changes when varying the five components mentioned above. The source code is available at https://github.com/hoorangyee/LRAGE.

Mesh Mamba: A Unified State Space Model for Saliency Prediction in Non-Textured and Textured Meshes

Kaiwei Zhang,Dandan Zhu,Xiongkuo Min,Guangtao Zhai

Task: 研究几何结构与纹理在视觉注意力中的交互作用，并开发一个统一的网格显著性预测模型。

Motivation: 通过建立全面的网格显著性数据集和开发适应性强的预测模型，提升3D视觉的适应性。

Details

Method: 引入基于状态空间模型（SSM）的Mesh Mamba模型，结合子图嵌入和双向SSM，分析几何结构并整合纹理特征。 Result: 模型在各种网格类型中表现优异，具有高扩展性和多功能性。 Conclusion: Mesh Mamba模型在提升网格显著性预测性能的同时，展示了全局上下文建模的潜力。 Abstract: Mesh saliency enhances the adaptability of 3D vision by identifying and emphasizing regions that naturally attract visual attention. To investigate the interaction between geometric structure and texture in shaping visual attention, we establish a comprehensive mesh saliency dataset, which is the first to systematically capture the differences in saliency distribution under both textured and non-textured visual conditions. Furthermore, we introduce mesh Mamba, a unified saliency prediction model based on a state space model (SSM), designed to adapt across various mesh types. Mesh Mamba effectively analyzes the geometric structure of the mesh while seamlessly incorporating texture features into the topological framework, ensuring coherence throughout appearance-enhanced modeling. More importantly, by subgraph embedding and a bidirectional SSM, the model enables global context modeling for both local geometry and texture, preserving the topological structure and improving the understanding of visual details and structural complexity. Through extensive theoretical and empirical validation, our model not only improves performance across various mesh types but also demonstrates high scalability and versatility, particularly through cross validations of various visual features.

Cross-Lingual Consistency: A Novel Inference Framework for Advancing Reasoning in Large Language Models

Zhiwei Yu,Tuo Li,Changhong Wang,Hui Chen,Lang Zhou

Task: 提出跨语言一致性（CLC）框架，以提升大型语言模型在多语言复杂推理任务中的性能。

Motivation: 多语言训练语料中的语言偏见会导致语义漂移和逻辑不一致，尤其是在参数较少的LLMs处理复杂推理任务时。

Details

Method: 通过多数投票整合多语言推理路径的CLC框架。 Result: 在CMATH数据集上，CLC相比传统自一致性方法显著提升了性能（如DeepSeek-Math-7B-Instruct提升了9.5%）。 Conclusion: CLC通过多语言集成投票中和语言偏见，并在多语言解空间中寻找更优推理路径，实现了更全局最优的推理性能。 Abstract: Chain-of-thought (CoT) has emerged as a critical mechanism for enhancing reasoning capabilities in large language models (LLMs), with self-consistency demonstrating notable promise in boosting performance. However, inherent linguistic biases in multilingual training corpora frequently cause semantic drift and logical inconsistencies, especially in sub-10B parameter LLMs handling complex inference tasks. To overcome these constraints, we propose the Cross-Lingual Consistency (CLC) framework, an innovative inference paradigm that integrates multilingual reasoning paths through majority voting to elevate LLMs' reasoning capabilities. Empirical evaluations on the CMATH dataset reveal CLC's superiority over the conventional self-consistency method, delivering 9.5%, 6.5%, and 6.0% absolute accuracy gains for DeepSeek-Math-7B-Instruct, Qwen2.5-Math-7B-Instruct, and Gemma2-9B-Instruct respectively. Expanding CLC's linguistic scope to 11 diverse languages implies two synergistic benefits: 1) neutralizing linguistic biases in multilingual training corpora through multilingual ensemble voting, 2) escaping monolingual reasoning traps by exploring the broader multilingual solution space. This dual benefits empirically enables more globally optimal reasoning paths compared to monolingual self-consistency baselines, as evidenced by the 4.1%-18.5% accuracy gains using Gemma2-9B-Instruct on the MGSM dataset.

Detecting Lip-Syncing Deepfakes: Vision Temporal Transformer for Analyzing Mouth Inconsistencies

Soumyya Kanti Datta,Shan Jia,Siwei Lyu

Task: 提出一种名为LIPINC-V2的新型检测框架，用于检测唇语同步深度伪造视频。

Motivation: 唇语同步深度伪造视频的伪造痕迹仅限于嘴部区域，比其他类型的深度伪造更难以察觉，因此需要更有效的检测方法。

Details

Method: 结合视觉时间变换器和多头交叉注意力机制，通过识别嘴部区域的时空不一致性来检测唇语同步深度伪造。 Result: 在自建的LipSyncTIMIT数据集和其他两个基准数据集上，LIPINC-V2模型实现了最先进的性能。 Conclusion: LIPINC-V2能够有效捕捉嘴部运动的短期和长期变化，显著提升了唇语同步深度伪造的检测能力。 Abstract: Deepfakes are AI-generated media in which the original content is digitally altered to create convincing but manipulated images, videos, or audio. Among the various types of deepfakes, lip-syncing deepfakes are one of the most challenging deepfakes to detect. In these videos, a person's lip movements are synthesized to match altered or entirely new audio using AI models. Therefore, unlike other types of deepfakes, the artifacts in lip-syncing deepfakes are confined to the mouth region, making them more subtle and, thus harder to discern. In this paper, we propose LIPINC-V2, a novel detection framework that leverages a combination of vision temporal transformer with multihead cross-attention to detect lip-syncing deepfakes by identifying spatiotemporal inconsistencies in the mouth region. These inconsistencies appear across adjacent frames and persist throughout the video. Our model can successfully capture both short-term and long-term variations in mouth movement, enhancing its ability to detect these inconsistencies. Additionally, we created a new lip-syncing deepfake dataset, LipSyncTIMIT, which was generated using five state-of-the-art lip-syncing models to simulate real-world scenarios. Extensive experiments on our proposed LipSyncTIMIT dataset and two other benchmark deepfake datasets demonstrate that our model achieves state-of-the-art performance. The code and the dataset are available at https://github.com/skrantidatta/LIPINC-V2 .

TransientTables: Evaluating LLMs' Reasoning on Temporally Evolving Semi-structured Tables

Abhilash Shankarampeta,Harsh Mahajan,Tushar Kataria,Dan Roth,Vivek Gupta

Task: 评估大型语言模型（LLMs）在时间推理任务上的能力。

Motivation: 人类能够通过时间序列理解突破性发现，而LLMs通常在静态数据集上训练，限制了其时间推理能力。

Details

Method: 提出了TRANSIENTTABLES数据集，包含3,971个问题，基于14,000多张表格，涵盖1,238个实体；使用基于模板的问题生成流程，并引入任务分解的建模策略。 Result: 建立了基准测试结果，并通过任务分解策略提升了LLMs的性能。 Conclusion: TRANSIENTTABLES数据集和任务分解策略为LLMs的时间推理能力提供了新的评估和改进方向。 Abstract: Humans continuously make new discoveries, and understanding temporal sequence of events leading to these breakthroughs is essential for advancing science and society. This ability to reason over time allows us to identify future steps and understand the effects of financial and political decisions on our lives. However, large language models (LLMs) are typically trained on static datasets, limiting their ability to perform effective temporal reasoning. To assess the temporal reasoning capabilities of LLMs, we present the TRANSIENTTABLES dataset, which comprises 3,971 questions derived from over 14,000 tables, spanning 1,238 entities across multiple time periods. We introduce a template-based question-generation pipeline that harnesses LLMs to refine both templates and questions. Additionally, we establish baseline results using state-of-the-art LLMs to create a benchmark. We also introduce novel modeling strategies centered around task decomposition, enhancing LLM performance.

ANNEXE: Unified Analyzing, Answering, and Pixel Grounding for Egocentric Interaction

Yuejiao Su,Yi Wang,Qiongyang Hu,Chuang Yang,Lap-Pui Chau

Task: 提出并解决Egocentric Interaction Reasoning and pixel Grounding (Ego-IRG)任务，以生成连贯的文本和像素级响应。

Motivation: 现有方法无法根据用户查询同时生成连贯的文本和像素级响应，缺乏灵活性。

Details

Method: 设计了一个统一的ANNEXE模型，利用多模态大语言模型生成文本和像素级输出。 Result: 在Ego-IRGBench数据集上的实验表明，ANNEXE模型优于其他方法。 Conclusion: Ego-IRG任务和ANNEXE模型为全面理解自我中心交互提供了有效解决方案。 Abstract: Egocentric interaction perception is one of the essential branches in investigating human-environment interaction, which lays the basis for developing next-generation intelligent systems. However, existing egocentric interaction understanding methods cannot yield coherent textual and pixel-level responses simultaneously according to user queries, which lacks flexibility for varying downstream application requirements. To comprehend egocentric interactions exhaustively, this paper presents a novel task named Egocentric Interaction Reasoning and pixel Grounding (Ego-IRG). Taking an egocentric image with the query as input, Ego-IRG is the first task that aims to resolve the interactions through three crucial steps: analyzing, answering, and pixel grounding, which results in fluent textual and fine-grained pixel-level responses. Another challenge is that existing datasets cannot meet the conditions for the Ego-IRG task. To address this limitation, this paper creates the Ego-IRGBench dataset based on extensive manual efforts, which includes over 20k egocentric images with 1.6 million queries and corresponding multimodal responses about interactions. Moreover, we design a unified ANNEXE model to generate text- and pixel-level outputs utilizing multimodal large language models, which enables a comprehensive interpretation of egocentric interactions. The experiments on the Ego-IRGBench exhibit the effectiveness of our ANNEXE model compared with other works.

Célia Nouri,Jean-Philippe Cointet,Chloé Clavel

Task: 提出一种利用图神经网络（GNNs）建模社交媒体对话以检测辱骂性语言的新方法。

Motivation: 传统辱骂性语言检测（ALD）模型忽视对话上下文，导致性能不可靠；现有NLP方法对上下文的表示有限且结果不一致。

Details

Method: 将社交媒体对话建模为图，节点代表评论，边捕捉回复结构，并系统研究不同图表示和上下文窗口以优化ALD配置。 Result: GNN模型在F1分数上显著优于无视上下文的基线方法和线性上下文感知方法。 Conclusion: 结构化对话上下文对ALD至关重要，GNNs为上下文感知的辱骂性语言检测提供了稳健框架。 Abstract: Detecting abusive language in social media conversations poses significant challenges, as identifying abusiveness often depends on the conversational context, characterized by the content and topology of preceding comments. Traditional Abusive Language Detection (ALD) models often overlook this context, which can lead to unreliable performance metrics. Recent Natural Language Processing (NLP) methods that integrate conversational context often depend on limited and simplified representations, and report inconsistent results. In this paper, we propose a novel approach that utilize graph neural networks (GNNs) to model social media conversations as graphs, where nodes represent comments, and edges capture reply structures. We systematically investigate various graph representations and context windows to identify the optimal configuration for ALD. Our GNN model outperform both context-agnostic baselines and linear context-aware methods, achieving significant improvements in F1 scores. These findings demonstrate the critical role of structured conversational context and establish GNNs as a robust framework for advancing context-aware abusive language detection.

Junlong Ren,Hao Wang

Task: 实现3D与文本模态之间的双向检索。

Motivation: 现有方法主要依赖单一3D表示（如点云），未能充分利用2D-3D的一致性和互补关系，限制了性能。

Details

Method: 采用多视角图像和点云联合表示3D形状，通过三模态对齐（图像、点云、文本）和重构任务增强编码器泛化能力，并通过细粒度2D-3D融合生成多模态嵌入。 Result: 在Text2Shape数据集上，方法在形状到文本和文本到形状检索任务中显著优于现有最佳方法。 Conclusion: 通过多模态联合表示和硬负对比训练，显著提升了跨模态3D检索的性能。 Abstract: Cross-modal 3D retrieval is a critical yet challenging task, aiming to achieve bi-directional retrieval between 3D and text modalities. Current methods predominantly rely on a certain 3D representation (e.g., point cloud), with few exploiting the 2D-3D consistency and complementary relationships, which constrains their performance. To bridge this gap, we propose to adopt multi-view images and point clouds to jointly represent 3D shapes, facilitating tri-modal alignment (i.e., image, point, text) for enhanced cross-modal 3D retrieval. Notably, we introduce tri-modal reconstruction to improve the generalization ability of encoders. Given point features, we reconstruct image features under the guidance of text features, and vice versa. With well-aligned point cloud and multi-view image features, we aggregate them as multimodal embeddings through fine-grained 2D-3D fusion to enhance geometric and semantic understanding. Recognizing the significant noise in current datasets where many 3D shapes and texts share similar semantics, we employ hard negative contrastive training to emphasize harder negatives with greater significance, leading to robust discriminative embeddings. Extensive experiments on the Text2Shape dataset demonstrate that our method significantly outperforms previous state-of-the-art methods in both shape-to-text and text-to-shape retrieval tasks by a substantial margin.

STAR-1: Safer Alignment of Reasoning LLMs with 1K Data

Zijun Wang,Haoqin Tu,Yuhan Wang,Juncheng Wu,Jieru Mei,Brian R. Bartoldson,Bhavya Kailkhura,Cihang Xie

Task: 构建一个高质量、小规模的安全数据集STAR-1，专为大型推理模型（LRMs）设计。

Motivation: 解决大型推理模型在安全对齐方面的关键需求。

Details

Method: 整合现有开源安全数据集，生成基于安全政策的深思熟虑推理样本，并通过GPT-4o评分系统筛选最佳实践样本。 Result: 实验显示，使用STAR-1微调的LRMs在安全性能上平均提升40%，推理能力仅轻微下降1.1%。 Conclusion: STAR-1的设计原则有效，适用于LRMs和传统LLMs。 Abstract: This paper introduces STAR-1, a high-quality, just-1k-scale safety dataset specifically designed for large reasoning models (LRMs) like DeepSeek-R1. Built on three core principles -- diversity, deliberative reasoning, and rigorous filtering -- STAR-1 aims to address the critical needs for safety alignment in LRMs. Specifically, we begin by integrating existing open-source safety datasets from diverse sources. Then, we curate safety policies to generate policy-grounded deliberative reasoning samples. Lastly, we apply a GPT-4o-based safety scoring system to select training examples aligned with best practices. Experimental results show that fine-tuning LRMs with STAR-1 leads to an average 40% improvement in safety performance across four benchmarks, while only incurring a marginal decrease (e.g., an average of 1.1%) in reasoning ability measured across five reasoning tasks. Extensive ablation studies further validate the importance of our design principles in constructing STAR-1 and analyze its efficacy across both LRMs and traditional LLMs. Our project page is https://ucsc-vlaa.github.io/STAR-1.

Luminance-GS: Adapting 3D Gaussian Splatting to Challenging Lighting Conditions with View-Adaptive Curve Adjustment

Ziteng Cui,Xuangeng Chu,Tatsuya Harada

Task: 提出一种名为Luminance-GS的新方法，用于在多样化的挑战性光照条件下实现高质量的新视角合成。

Motivation: 真实世界中的多样化光照条件和相机曝光设置对图像质量有显著影响，尤其是在多视角场景中，光照和图像信号处理器设置的差异会导致光度不一致性，这对基于NeRF和3DGS的新视角合成框架提出了重大挑战。

Details

Method: 采用每视角颜色矩阵映射和视角自适应曲线调整，同时不改变原始的3DGS显式表示。 Result: Luminance-GS在各种光照条件下（包括低光、过曝光和变化曝光）实现了最先进的结果，并提供了实时渲染速度和改进的重建质量。 Conclusion: Luminance-GS是一种有效的方法，能够在多样化光照条件下实现高质量的新视角合成，同时保持实时渲染能力。 Abstract: Capturing high-quality photographs under diverse real-world lighting conditions is challenging, as both natural lighting (e.g., low-light) and camera exposure settings (e.g., exposure time) significantly impact image quality. This challenge becomes more pronounced in multi-view scenarios, where variations in lighting and image signal processor (ISP) settings across viewpoints introduce photometric inconsistencies. Such lighting degradations and view-dependent variations pose substantial challenges to novel view synthesis (NVS) frameworks based on Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS). To address this, we introduce Luminance-GS, a novel approach to achieving high-quality novel view synthesis results under diverse challenging lighting conditions using 3DGS. By adopting per-view color matrix mapping and view-adaptive curve adjustments, Luminance-GS achieves state-of-the-art (SOTA) results across various lighting conditions -- including low-light, overexposure, and varying exposure -- while not altering the original 3DGS explicit representation. Compared to previous NeRF- and 3DGS-based baselines, Luminance-GS provides real-time rendering speed with improved reconstruction quality.

Bridging the Linguistic Divide: A Survey on Leveraging Large Language Models for Machine Translation

Baban Gain,Dibyanayan Bandyopadhyay,Asif Ekbal

Task: 对利用大型语言模型（LLMs）进行机器翻译（MT）的最新进展进行全面概述。

Motivation: 低资源语言和领域缺乏足够的平行语料库、语言工具和计算基础设施，LLMs的出现为这些场景下的机器翻译提供了新的可能性。

Details

Method: 分析了少样本提示、跨语言迁移和参数高效微调等技术，以及利用LLMs生成合成数据的策略（如回译和词汇增强）。 Result: 比较了基于LLMs的翻译与传统编码器-解码器模型在不同语言对上的表现，总结了各自的优势和局限性。 Conclusion: 讨论了幻觉、评估不一致性和固有偏见等挑战，并提出了未来构建稳健、包容和可扩展的MT系统的方向。 Abstract: The advent of Large Language Models (LLMs) has significantly reshaped the landscape of machine translation (MT), particularly for low-resource languages and domains that lack sufficient parallel corpora, linguistic tools, and computational infrastructure. This survey presents a comprehensive overview of recent progress in leveraging LLMs for MT. We analyze techniques such as few-shot prompting, cross-lingual transfer, and parameter-efficient fine-tuning that enable effective adaptation to under-resourced settings. The paper also explores synthetic data generation strategies using LLMs, including back-translation and lexical augmentation. Additionally, we compare LLM-based translation with traditional encoder-decoder models across diverse language pairs, highlighting the strengths and limitations of each. We discuss persistent challenges such as hallucinations, evaluation inconsistencies, and inherited biases while also evaluating emerging LLM-driven metrics for translation quality. This survey offers practical insights and outlines future directions for building robust, inclusive, and scalable MT systems in the era of large-scale generative models.

High-fidelity 3D Object Generation from Single Image with RGBN-Volume Gaussian Reconstruction Model

Yiyang Shen,Kun Zhou,He Wang,Yin Yang,Tianjia Shao

Task: 通过单视图图像生成高保真度的3D对象。

Motivation: 现有的基于高斯溅射的单视图3D生成方法存在几何模糊和3D高斯结构缺失的问题，导致生成的3D对象扭曲和模糊。

Details

Method: 提出了一种新的RGBN-volume高斯重建模型（GS-RGBN），采用混合体素-高斯表示，结合RGB特征和表面法线特征来消除几何模糊并优化高斯结构。 Result: 实验表明，该方法在高质量重建、鲁棒性和效率方面优于现有方法。 Conclusion: GS-RGBN通过结构化3D表示有效解决了单视图3D生成中的几何模糊和结构缺失问题，实现了高质量的3D对象生成。 Abstract: Recently single-view 3D generation via Gaussian splatting has emerged and developed quickly. They learn 3D Gaussians from 2D RGB images generated from pre-trained multi-view diffusion (MVD) models, and have shown a promising avenue for 3D generation through a single image. Despite the current progress, these methods still suffer from the inconsistency jointly caused by the geometric ambiguity in the 2D images, and the lack of structure of 3D Gaussians, leading to distorted and blurry 3D object generation. In this paper, we propose to fix these issues by GS-RGBN, a new RGBN-volume Gaussian Reconstruction Model designed to generate high-fidelity 3D objects from single-view images. Our key insight is a structured 3D representation can simultaneously mitigate the afore-mentioned two issues. To this end, we propose a novel hybrid Voxel-Gaussian representation, where a 3D voxel representation contains explicit 3D geometric information, eliminating the geometric ambiguity from 2D images. It also structures Gaussians during learning so that the optimization tends to find better local optima. Our 3D voxel representation is obtained by a fusion module that aligns RGB features and surface normal features, both of which can be estimated from 2D images. Extensive experiments demonstrate the superiority of our methods over prior works in terms of high-quality reconstruction results, robust generalization, and good efficiency.

Is the Reversal Curse a Binding Problem? Uncovering Limitations of Transformers from a Basic Generalization Failure

Boshi Wang,Huan Sun

Task: 研究大型语言模型（LLMs）中的Reversal Curse现象及其解决方法。

Motivation: 理解Reversal Curse的原因有助于识别当前模型的弱点，并提升其泛化能力和鲁棒性。

Details

Method: 通过实验验证Reversal Curse的两个主要原因（概念表示的不一致性和纠缠性），并提出基于JEPA的模型设计。 Result: 提出的模型首次在不依赖数据增强或非因果掩码的情况下解决了Reversal Curse，并通过特殊记忆层进一步提升了泛化能力。 Conclusion: 解决Reversal Curse的能力为模型提供了一种新的记忆整合方式，使其在算术推理任务中优于前沿LLMs。 Abstract: Despite their impressive capabilities, LLMs exhibit a basic generalization failure known as the Reversal Curse, where they struggle to learn reversible factual associations. Understanding why this occurs could help identify weaknesses in current models and advance their generalization and robustness. In this paper, we conjecture that the Reversal Curse in LLMs is a manifestation of the long-standing binding problem in cognitive science, neuroscience and AI. Specifically, we identify two primary causes of the Reversal Curse stemming from transformers' limitations in conceptual binding: the inconsistency and entanglements of concept representations. We perform a series of experiments that support these conjectures. Our exploration leads to a model design based on JEPA (Joint-Embedding Predictive Architecture) that for the first time breaks the Reversal Curse without side-stepping it with specialized data augmentation or non-causal masking, and moreover, generalization could be further improved by incorporating special memory layers that support disentangled concept representations. We demonstrate that the skill of reversal unlocks a new kind of memory integration that enables models to solve large-scale arithmetic reasoning problems via parametric forward-chaining, outperforming frontier LLMs based on non-parametric memory and prolonged explicit reasoning.

Training-free Dense-Aligned Diffusion Guidance for Modular Conditional Image Synthesis

Zixuan Wang,Duo Peng,Feng Chen,Yuwei Yang,Yinjie Lei

Task: 提出一种将条件图像合成任务模块化为多种基本条件单元组合的新方法。

Motivation: 当前生成方法通常局限于特定任务，适用范围有限，因此需要一种更灵活的方法来处理多样化的条件。

Details

Method: 将条件分为文本、布局和拖拽三种基本单元，并为每种条件设计专用的对齐模块（DCA、DGA、DMA）。 Result: 实验表明，该方法在多种条件（如文本描述、分割掩码、拖拽操作及其组合）下表现优异。 Conclusion: 通过模块化设计，该方法显著提升了模型对多样化条件生成任务的适应性和应用范围。 Abstract: Conditional image synthesis is a crucial task with broad applications, such as artistic creation and virtual reality. However, current generative methods are often task-oriented with a narrow scope, handling a restricted condition with constrained applicability. In this paper, we propose a novel approach that treats conditional image synthesis as the modular combination of diverse fundamental condition units. Specifically, we divide conditions into three primary units: text, layout, and drag. To enable effective control over these conditions, we design a dedicated alignment module for each. For the text condition, we introduce a Dense Concept Alignment (DCA) module, which achieves dense visual-text alignment by drawing on diverse textual concepts. For the layout condition, we propose a Dense Geometry Alignment (DGA) module to enforce comprehensive geometric constraints that preserve the spatial configuration. For the drag condition, we introduce a Dense Motion Alignment (DMA) module to apply multi-level motion regularization, ensuring that each pixel follows its desired trajectory without visual artifacts. By flexibly inserting and combining these alignment modules, our framework enhances the model's adaptability to diverse conditional generation tasks and greatly expands its application range. Extensive experiments demonstrate the superior performance of our framework across a variety of conditions, including textual description, segmentation mask (bounding box), drag manipulation, and their combinations. Code is available at https://github.com/ZixuanWang0525/DADG.

A thorough benchmark of automatic text classification: From traditional approaches to large language models

Washington Cunha,Leonardo Rocha,Marcos André Gonçalves

Task: 对自动文本分类（ATC）中传统方法和最新方法（包括大型语言模型LLMs和小型语言模型SLMs）进行成本效益分析。

Motivation: 尽管最新的方法在效果上有所提升，但缺乏对其高成本是否值得的全面分析，尤其是与传统方法（如SVM和逻辑回归）相比。

Details

Method: 对12种传统和最新的ATC解决方案（包括5种开源LLMs）进行科学比较分析，并提供一个包含22个数据集的大型基准测试。 Result: LLMs在效果上优于传统方法（平均提升26%-7.1%）和SLMs（平均提升4.9%-1.9%），但计算成本显著更高（分别比传统方法和SLMs慢590倍和8.5倍）。 Conclusion: 推荐根据应用需求选择方法：LLMs适用于追求最佳效果且能承担高成本的场景；传统方法适用于资源有限或无法承担LLMs调优成本的场景；SLMs则在效果和效率之间提供了接近最优的平衡。 Abstract: Automatic text classification (ATC) has experienced remarkable advancements in the past decade, best exemplified by recent small and large language models (SLMs and LLMs), leveraged by Transformer architectures. Despite recent effectiveness improvements, a comprehensive cost-benefit analysis investigating whether the effectiveness gains of these recent approaches compensate their much higher costs when compared to more traditional text classification approaches such as SVMs and Logistic Regression is still missing in the literature. In this context, this work's main contributions are twofold: (i) we provide a scientifically sound comparative analysis of the cost-benefit of twelve traditional and recent ATC solutions including five open LLMs, and (ii) a large benchmark comprising {22 datasets}, including sentiment analysis and topic classification, with their (train-validation-test) partitions based on folded cross-validation procedures, along with documentation, and code. The release of code, data, and documentation enables the community to replicate experiments and advance the field in a more scientifically sound manner. Our comparative experimental results indicate that LLMs outperform traditional approaches (up to 26%-7.1% on average) and SLMs (up to 4.9%-1.9% on average) in terms of effectiveness. However, LLMs incur significantly higher computational costs due to fine-tuning, being, on average 590x and 8.5x slower than traditional methods and SLMs, respectively. Results suggests the following recommendations: (1) LLMs for applications that require the best possible effectiveness and can afford the costs; (2) traditional methods such as Logistic Regression and SVM for resource-limited applications or those that cannot afford the cost of tuning large LLMs; and (3) SLMs like Roberta for near-optimal effectiveness-efficiency trade-off.

Beyond Nearest Neighbor Interpolation in Data Augmentation

Olivier Rukundo

Task: 改进卷积神经网络的数据变换功能，以同时避免未定义分类标签和像素级标注错误的风险。

Motivation: 避免最近邻插值带来的未定义分类标签风险，同时防止数据增强中像素级标注错误的加剧。

Details

Method: 通过修改几何变换函数，减少对最近邻插值的依赖，并引入基于均值的类别过滤机制来处理未定义分类标签。 Result: 在三个医学图像数据集上的语义分割任务中，实验证明了替代插值算法在定性和定量上的改进。 Conclusion: 通过替代插值算法和类别过滤机制，可以有效提升数据增强的质量。 Abstract: Avoiding the risk of undefined categorical labels using nearest neighbor interpolation overlooks the risk of exacerbating pixel level annotation errors in data augmentation. To simultaneously avoid these risks, the author modified convolutional neural networks data transformation functions by incorporating a modified geometric transformation function to improve the quality of augmented data by removing the reliance on nearest neighbor interpolation and integrating a mean based class filtering mechanism to handle undefined categorical labels with alternative interpolation algorithms. Experiments on semantic segmentation tasks using three medical image datasets demonstrated both qualitative and quantitative improvements with alternative interpolation algorithms.

Review, Refine, Repeat: Understanding Iterative Decoding of AI Agents with Dynamic Evaluation and Selection

Souradip Chakraborty,Mohammadreza Pourreza,Ruoxi Sun,Yiwen Song,Nino Scherrer,Jindong Gu,Furong Huang,Amrit Singh Bedi,Ahmad Beirami,Hamid Palangi,Tomas Pfister

Task: 提出一种名为迭代代理解码（IAD）的方法，以改进AI代理在复杂多模态应用、结构化生成和战略规划任务中的性能。

Motivation: 现有方法如Best-of-N（BON）采样缺乏迭代反馈集成机制，限制了性能提升。

Details

Method: IAD结合了迭代优化和动态候选评估与选择，通过验证器引导反馈设计。 Result: IAD在Sketch2Code、Text2SQL和Webshop任务中显著优于基线方法，绝对增益达3-10%。 Conclusion: IAD的性能提升主要来自验证器引导的优化，而非采样多样性，验证器质量对推理时优化至关重要。 Abstract: While AI agents have shown remarkable performance at various tasks, they still struggle with complex multi-modal applications, structured generation and strategic planning. Improvements via standard fine-tuning is often impractical, as solving agentic tasks usually relies on black box API access without control over model parameters. Inference-time methods such as Best-of-N (BON) sampling offer a simple yet effective alternative to improve performance. However, BON lacks iterative feedback integration mechanism. Hence, we propose Iterative Agent Decoding (IAD) which combines iterative refinement with dynamic candidate evaluation and selection guided by a verifier. IAD differs in how feedback is designed and integrated, specifically optimized to extract maximal signal from reward scores. We conduct a detailed comparison of baselines across key metrics on Sketch2Code, Text2SQL, and Webshop where IAD consistently outperforms baselines, achieving 3--6% absolute gains on Sketch2Code and Text2SQL (with and without LLM judges) and 8--10% gains on Webshop across multiple metrics. To better understand the source of IAD's gains, we perform controlled experiments to disentangle the effect of adaptive feedback from stochastic sampling, and find that IAD's improvements are primarily driven by verifier-guided refinement, not merely sampling diversity. We also show that both IAD and BON exhibit inference-time scaling with increased compute when guided by an optimal verifier. Our analysis highlights the critical role of verifier quality in effective inference-time optimization and examines the impact of noisy and sparse rewards on scaling behavior. Together, these findings offer key insights into the trade-offs and principles of effective inference-time optimization.

Semi-Supervised Biomedical Image Segmentation via Diffusion Models and Teacher-Student Co-Training

Luca Ciampi,Gabriele Lagani,Giuseppe Amato,Fabrizio Falchi

Task: 提出一种基于去噪扩散概率模型（DDPMs）的半监督师生框架，用于生物医学图像分割。

Motivation: 解决监督深度学习在医学图像分割中需要大量标注数据的问题，提高在临床环境中的可扩展性。

Details

Method: 利用DDPMs生成分割掩码，通过无监督训练教师模型并引入循环一致性约束，再与双学生网络进行协同训练，结合多轮伪标签生成策略。 Result: 在多个生物医学图像基准测试中，该方法优于现有的半监督技术。 Conclusion: 该方法在标注数据有限的情况下表现出色，具有较高的实用价值。 Abstract: Supervised deep learning for semantic segmentation has achieved excellent results in accurately identifying anatomical and pathological structures in medical images. However, it often requires large annotated training datasets, which limits its scalability in clinical settings. To address this challenge, semi-supervised learning is a well-established approach that leverages both labeled and unlabeled data. In this paper, we introduce a novel semi-supervised teacher-student framework for biomedical image segmentation, inspired by the recent success of generative models. Our approach leverages denoising diffusion probabilistic models (DDPMs) to generate segmentation masks by progressively refining noisy inputs conditioned on the corresponding images. The teacher model is first trained in an unsupervised manner using a cycle-consistency constraint based on noise-corrupted image reconstruction, enabling it to generate informative semantic masks. Subsequently, the teacher is integrated into a co-training process with a twin-student network. The student learns from ground-truth labels when available and from teacher-generated pseudo-labels otherwise, while the teacher continuously improves its pseudo-labeling capabilities. Finally, to further enhance performance, we introduce a multi-round pseudo-label generation strategy that iteratively improves the pseudo-labeling process. We evaluate our approach on multiple biomedical imaging benchmarks, spanning multiple imaging modalities and segmentation tasks. Experimental results show that our method consistently outperforms state-of-the-art semi-supervised techniques, highlighting its effectiveness in scenarios with limited annotated data. The code to replicate our experiments can be found at https://github.com/ciampluca/diffusion_semi_supervised_biomedical_image_segmentation

OpenCodeReasoning: Advancing Data Distillation for Competitive Coding

Wasi Uddin Ahmad,Sean Narenthiran,Somshubra Majumdar,Aleksander Ficek,Siddhartha Jain,Jocelyn Huang,Vahid Noroozi,Boris Ginsburg

Task: 构建一个高质量的监督微调（SFT）数据集，用于提升不同规模模型的编码能力。

Motivation: 现有基于推理的大语言模型在编码任务上的蒸馏方法缺乏公开数据集或数据处理的详细说明，限制了进展。

Details

Method: 通过构建一个高质量的SFT数据集，并分析数据来源、代码执行过滤的影响以及指令/解决方案的多样性。 Result: 蒸馏模型仅使用SFT便在LiveCodeBench上达到61.8%，在CodeContests上达到24.6%，优于使用强化学习的替代方案。 Conclusion: 指令多样性比解决方案正确性更重要，同时将开源数据集和蒸馏模型以促进社区发展。 Abstract: Since the advent of reasoning-based large language models, many have found great success from distilling reasoning capabilities into student models. Such techniques have significantly bridged the gap between reasoning and standard LLMs on coding tasks. Despite this, much of the progress on distilling reasoning models remains locked behind proprietary datasets or lacks details on data curation, filtering and subsequent training. To address this, we construct a superior supervised fine-tuning (SFT) dataset that we use to achieve state-of-the-art coding capability results in models of various sizes. Our distilled models use only SFT to achieve 61.8% on LiveCodeBench and 24.6% on CodeContests, surpassing alternatives trained with reinforcement learning. We then perform analysis on the data sources used to construct our dataset, the impact of code execution filtering, and the importance of instruction/solution diversity. We observe that execution filtering negatively affected benchmark accuracy, leading us to prioritize instruction diversity over solution correctness. Finally, we also analyze the token efficiency and reasoning patterns utilized by these models. We will open-source these datasets and distilled models to the community.

RealityAvatar: Towards Realistic Loose Clothing Modeling in Animatable 3D Gaussian Avatars

Yahui Li,Zhi Zeng,Liming Pang,Guixuan Zhang,Shuwu Zhang

Task: 提出一种高效的高保真数字人体建模框架RealityAvatar，专注于松散着装的人体模型。

Motivation: 现有方法在捕捉松散衣物动态时存在困难，主要依赖全局姿态条件或静态帧表示，导致非刚性区域的过度平滑和时间不一致性。

Details

Method: 利用3D高斯泼溅技术捕捉复杂衣物变形和运动动态，结合运动趋势模块和潜在骨骼编码器，显式建模姿态依赖变形和衣物行为的时间变化。 Result: 在基准数据集上的实验表明，该方法能有效捕捉细粒度衣物变形和运动驱动的形状变化，显著提升动态人体重建的结构保真度和感知质量。 Conclusion: RealityAvatar在非刚性区域表现优异，同时实现了更好的时间帧一致性。 Abstract: Modeling animatable human avatars from monocular or multi-view videos has been widely studied, with recent approaches leveraging neural radiance fields (NeRFs) or 3D Gaussian Splatting (3DGS) achieving impressive results in novel-view and novel-pose synthesis. However, existing methods often struggle to accurately capture the dynamics of loose clothing, as they primarily rely on global pose conditioning or static per-frame representations, leading to oversmoothing and temporal inconsistencies in non-rigid regions. To address this, We propose RealityAvatar, an efficient framework for high-fidelity digital human modeling, specifically targeting loosely dressed avatars. Our method leverages 3D Gaussian Splatting to capture complex clothing deformations and motion dynamics while ensuring geometric consistency. By incorporating a motion trend module and a latentbone encoder, we explicitly model pose-dependent deformations and temporal variations in clothing behavior. Extensive experiments on benchmark datasets demonstrate the effectiveness of our approach in capturing fine-grained clothing deformations and motion-driven shape variations. Our method significantly enhances structural fidelity and perceptual quality in dynamic human reconstruction, particularly in non-rigid regions, while achieving better consistency across temporal frames.

Improving Applicability of Deep Learning based Token Classification models during Training

Anket Mehra,Malte Prieß,Marian Himstedt

Task: 提出一种新的评估指标Document Integrity Precision (DIP)，用于衡量模型在文档理解任务中的实际适用性。

Motivation: 传统的分类指标（如F1分数）不足以评估模型在实际应用中的表现，尤其是在需要高自动化质量的业务场景中。

Details

Method: 通过训练一个基于LayoutLM的模型进行文档中的令牌分类任务，并引入DIP作为新的评估指标，验证其在模型部署决策中的有效性。 Result: 实验表明，传统指标对模型性能变化的敏感性较低，而DIP能准确反映模型在部署中需要的人工干预程度。 Conclusion: DIP是一种有效的业务导向评估指标，强调了在模型训练中关注实际任务需求的重要性，未来需进一步研究其他任务的类似指标。 Abstract: This paper shows that further evaluation metrics during model training are needed to decide about its applicability in inference. As an example, a LayoutLM-based model is trained for token classification in documents. The documents are German receipts. We show that conventional classification metrics, represented by the F1-Score in our experiments, are insufficient for evaluating the applicability of machine learning models in practice. To address this problem, we introduce a novel metric, Document Integrity Precision (DIP), as a solution for visual document understanding and the token classification task. To the best of our knowledge, nothing comparable has been introduced in this context. DIP is a rigorous metric, describing how many documents of the test dataset require manual interventions. It enables AI researchers and software developers to conduct an in-depth investigation of the level of process automation in business software. In order to validate DIP, we conduct experiments with our created models to highlight and analyze the impact and relevance of DIP to evaluate if the model should be deployed or not in different training settings. Our results demonstrate that existing metrics barely change for isolated model impairments, whereas DIP indicates that the model requires substantial human interventions in deployment. The larger the set of entities being predicted, the less sensitive conventional metrics are, entailing poor automation quality. DIP, in contrast, remains a single value to be interpreted for entire entity sets. This highlights the importance of having metrics that focus on the business task for model training in production. Since DIP is created for the token classification task, more research is needed to find suitable metrics for other training tasks.

Text Speaks Louder than Vision: ASCII Art Reveals Textual Biases in Vision-Language Models

Zhaochen Wang,Yujun Cai,Zi Huang,Bryan Hooi,Yiwei Wang,Ming-Hsuan Yang

Task: 研究视觉语言模型（VLMs）在处理ASCII艺术时的表现，尤其是当文本语义与视觉模式冲突时的能力。

Motivation: 探索VLMs在多模态信息处理中的局限性，特别是在面对语义与视觉冲突时的表现，以揭示其潜在缺陷。

Details

Method: 提出了一种新颖的评估框架，通过对抗性ASCII艺术（字符级语义与全局视觉模式冲突）系统性地测试五种先进模型（如GPT-4o、Claude和Gemini）。 Result: 实验发现VLMs存在强烈的文本优先偏见，随着语义复杂度增加，视觉识别能力显著下降。通过视觉参数调整和提示工程尝试缓解，效果有限。 Conclusion: 当前VLMs在多模态信息整合上存在根本性缺陷，需架构级解决方案，对易受对抗样本影响的内容审核系统有重要启示。 Abstract: Vision-language models (VLMs) have advanced rapidly in processing multimodal information, but their ability to reconcile conflicting signals across modalities remains underexplored. This work investigates how VLMs process ASCII art, a unique medium where textual elements collectively form visual patterns, potentially creating semantic-visual conflicts. We introduce a novel evaluation framework that systematically challenges five state-of-the-art models (including GPT-4o, Claude, and Gemini) using adversarial ASCII art, where character-level semantics deliberately contradict global visual patterns. Our experiments reveal a strong text-priority bias: VLMs consistently prioritize textual information over visual patterns, with visual recognition ability declining dramatically as semantic complexity increases. Various mitigation attempts through visual parameter tuning and prompt engineering yielded only modest improvements, suggesting that this limitation requires architectural-level solutions. These findings uncover fundamental flaws in how current VLMs integrate multimodal information, providing important guidance for future model development while highlighting significant implications for content moderation systems vulnerable to adversarial examples.

ShieldGemma 2: Robust and Tractable Image Content Moderation

Wenjun Zeng,Dana Kurniawan,Ryan Mullins,Yuchi Liu,Tamoghna Saha,Dirichi Ike-Njoku,Jindong Gu,Yiwen Song,Cai Xu,Jingjing Zhou,Aparna Joshi,Shravan Dheep,Mani Malek,Hamid Palangi,Joon Baek,Rick Pereira,Karthik Narasimhan

Task: 介绍ShieldGemma 2，一个基于Gemma 3的4B参数图像内容审核模型，用于预测合成图像和自然图像的安全风险。

Motivation: 提供一种强大的工具来检测图像内容中的潜在危害，如性暴露、暴力与血腥以及危险内容，以促进多模态安全和负责任的AI发展。

Details

Method: 基于Gemma 3构建模型，并通过内部和外部基准测试评估性能，同时提出了一种新颖的对抗性数据生成流程。 Result: 在性能上优于LlavaGuard、GPT-4o mini和基础Gemma 3模型。 Conclusion: ShieldGemma 2是一个先进的图像审核工具，有助于推动多模态安全和负责任的AI发展。 Abstract: We introduce ShieldGemma 2, a 4B parameter image content moderation model built on Gemma 3. This model provides robust safety risk predictions across the following key harm categories: Sexually Explicit, Violence \& Gore, and Dangerous Content for synthetic images (e.g. output of any image generation model) and natural images (e.g. any image input to a Vision-Language Model). We evaluated on both internal and external benchmarks to demonstrate state-of-the-art performance compared to LlavaGuard \citep{helff2024llavaguard}, GPT-4o mini \citep{hurst2024gpt}, and the base Gemma 3 model \citep{gemma_2025} based on our policies. Additionally, we present a novel adversarial data generation pipeline which enables a controlled, diverse, and robust image generation. ShieldGemma 2 provides an open image moderation tool to advance multimodal safety and responsible AI development.

Adriano Fragomeni,Dima Damen,Michael Wray

Task: 提出了一种名为MAC-VR的新方法，通过利用模态特定标签增强视频检索。

Motivation: 视频检索需要将视觉内容与自然语言描述对齐，现有方法在跨模态对齐方面仍有改进空间。

Details

Method: 在潜在空间中对齐模态，并学习和对齐从视频及其对应字幕特征中提取的辅助潜在概念。 Result: 在五个数据集上的实验表明，模态特定标签提升了跨模态对齐性能，优于或与现有最佳方法相当。 Conclusion: MAC-VR通过引入辅助潜在概念，显著提升了视频检索的性能。 Abstract: Video retrieval requires aligning visual content with corresponding natural language descriptions. In this paper, we introduce Modality Auxiliary Concepts for Video Retrieval (MAC-VR), a novel approach that leverages modality-specific tags -- automatically extracted from foundation models -- to enhance video retrieval. We propose to align modalities in a latent space, along with learning and aligning auxiliary latent concepts, derived from the features of a video and its corresponding caption. We introduce these auxiliary concepts to improve the alignment of visual and textual latent concepts, and so are able to distinguish concepts from one other. We conduct extensive experiments on five diverse datasets: MSR-VTT, DiDeMo, TGIF, Charades and YouCook2. The experimental results consistently demonstrate that modality-specific tags improve cross-modal alignment, outperforming current state-of-the-art methods across three datasets and performing comparably or better across the other two.

Multilingual and Multi-Accent Jailbreaking of Audio LLMs

Jaechul Roh,Virat Shejwalkar,Amir Houmansadr

Task: 研究多语言和多口音音频对抗攻击对大型音频语言模型（LALMs）的安全性影响。

Motivation: 揭示多语言和多口音音频对抗攻击的严重性，尤其是在跨语言和声学变异的情况下，攻击成功率显著增加。

Details

Method: 提出Multi-AudioJail框架，包括（1）构建多语言/多口音对抗音频数据集，（2）分层评估管道分析声学扰动与跨语言音素的交互作用。 Result: 实验显示，声学扰动（如混响、回声和耳语效果）使攻击成功率（JSRs）提升高达57.25个百分点，且多模态LLMs比单模态系统更脆弱。 Conclusion: 呼吁社区关注多模态攻击面的扩展，并计划发布数据集以促进跨模态防御研究。 Abstract: Large Audio Language Models (LALMs) have significantly advanced audio understanding but introduce critical security risks, particularly through audio jailbreaks. While prior work has focused on English-centric attacks, we expose a far more severe vulnerability: adversarial multilingual and multi-accent audio jailbreaks, where linguistic and acoustic variations dramatically amplify attack success. In this paper, we introduce Multi-AudioJail, the first systematic framework to exploit these vulnerabilities through (1) a novel dataset of adversarially perturbed multilingual/multi-accent audio jailbreaking prompts, and (2) a hierarchical evaluation pipeline revealing that how acoustic perturbations (e.g., reverberation, echo, and whisper effects) interacts with cross-lingual phonetics to cause jailbreak success rates (JSRs) to surge by up to +57.25 percentage points (e.g., reverberated Kenyan-accented attack on MERaLiON). Crucially, our work further reveals that multimodal LLMs are inherently more vulnerable than unimodal systems: attackers need only exploit the weakest link (e.g., non-English audio inputs) to compromise the entire model, which we empirically show by multilingual audio-only attacks achieving 3.1x higher success rates than text-only attacks. We plan to release our dataset to spur research into cross-modal defenses, urging the community to address this expanding attack surface in multimodality as LALMs evolve.

DEPTHOR: Depth Enhancement from a Practical Light-Weight dToF Sensor and RGB Image

Jijun Xiang,Xuan Zhu,Xianqi Wang,Yu Wang,Hong Zhang,Fei Guo,Xin Yang

Task: 利用RGB图像作为引导，将dToF的原始信号转换为高精度、密集的深度图。

Motivation: 现有基于超分辨率的方法在公共数据集上表现良好，但依赖于理想化假设（如准确的区域对应和可靠的dToF输入），忽略了校准误差和dToF成像中的异常信号，限制了实际应用。

Details

Method: 提出了一种名为DEPTHOR的新型基于补全的方法，包括模拟真实dToF数据的训练策略和结合单目深度估计（MDE）的网络架构。 Result: 在ZJU-L5数据集上，训练策略显著提升了深度补全模型，结果与深度超分辨率方法相当；模型在Rel和RMSE指标上分别提升了27%和18%。在更具挑战性的dToF样本集上，方法优于现有技术，Rel和RMSE分别提升了23%和22%。 Conclusion: DEPTHOR方法通过创新的训练策略和网络设计，有效解决了dToF深度增强中的实际问题，显著提升了性能。 Abstract: Depth enhancement, which uses RGB images as guidance to convert raw signals from dToF into high-precision, dense depth maps, is a critical task in computer vision. Although existing super-resolution-based methods show promising results on public datasets, they often rely on idealized assumptions like accurate region correspondences and reliable dToF inputs, overlooking calibration errors that cause misalignment and anomaly signals inherent to dToF imaging, limiting real-world applicability. To address these challenges, we propose a novel completion-based method, named DEPTHOR, featuring advances in both the training strategy and model architecture. First, we propose a method to simulate real-world dToF data from the accurate ground truth in synthetic datasets to enable noise-robust training. Second, we design a novel network that incorporates monocular depth estimation (MDE), leveraging global depth relationships and contextual information to improve prediction in challenging regions. On the ZJU-L5 dataset, our training strategy significantly enhances depth completion models, achieving results comparable to depth super-resolution methods, while our model achieves state-of-the-art results, improving Rel and RMSE by 27% and 18%, respectively. On a more challenging set of dToF samples we collected, our method outperforms SOTA methods on preliminary stereo-based GT, improving Rel and RMSE by 23% and 22%, respectively. Our Code is available at https://github.com/ShadowBbBb/Depthor

Epistemic Alignment: A Mediating Framework for User-LLM Knowledge Delivery

Nicholas Clark,Hua Shen,Bill Howe,Tanushree Mitra

Task: 提出一个名为'认知对齐框架'的结构化方法，以解决用户与LLMs在知识传递中的需求与系统能力之间的差距。

Motivation: 当前LLMs的接口无法有效支持用户对信息呈现方式的个性化需求，导致用户依赖社区共享的未经验证的提示。

Details

Method: 通过哲学认识论文献提取十个知识传递挑战，分析用户自定义提示和个性化策略，并评估OpenAI和Anthropic的政策和功能。 Result: 发现现有模型提供商部分解决了这些挑战，但在透明度、偏好实现和验证工具方面存在不足。 Conclusion: 认知对齐框架为AI开发者提供了支持多样化知识传递的具体指导，同时帮助用户获得更符合需求的信息呈现方式。 Abstract: LLMs increasingly serve as tools for knowledge acquisition, yet users cannot effectively specify how they want information presented. When users request that LLMs "cite reputable sources," "express appropriate uncertainty," or "include multiple perspectives," they discover that current interfaces provide no structured way to articulate these preferences. The result is prompt sharing folklore: community-specific copied prompts passed through trust relationships rather than based on measured efficacy. We propose the Epistemic Alignment Framework, a set of ten challenges in knowledge transmission derived from the philosophical literature of epistemology, concerning issues such as evidence quality assessment and calibration of testimonial reliance. The framework serves as a structured intermediary between user needs and system capabilities, creating a common vocabulary to bridge the gap between what users want and what systems deliver. Through a thematic analysis of custom prompts and personalization strategies shared on online communities where these issues are actively discussed, we find users develop elaborate workarounds to address each of the challenges. We then apply our framework to two prominent model providers, OpenAI and Anthropic, through content analysis of their documented policies and product features. Our analysis shows that while these providers have partially addressed the challenges we identified, they fail to establish adequate mechanisms for specifying epistemic preferences, lack transparency about how preferences are implemented, and offer no verification tools to confirm whether preferences were followed. For AI developers, the Epistemic Alignment Framework offers concrete guidance for supporting diverse approaches to knowledge; for users, it works toward information delivery that aligns with their specific needs rather than defaulting to one-size-fits-all approaches.

A topology-preserving three-stage framework for fully-connected coronary artery extraction

Yuehui Qiu,Dandan Shan,Yining Wang,Pei Dong,Dijia Wu,Xinnian Yang,Qingqi Hong,Dinggang Shen

Task: 提出一种保持拓扑结构的三阶段框架，用于完整冠状动脉树的提取。

Motivation: 冠状动脉提取是计算机辅助诊断冠状动脉疾病的关键步骤，但由于远端血管细小、拓扑结构复杂和对比度不足等因素，现有方法常出现过分割或欠分割问题。

Details

Method: 采用三阶段框架，包括血管分割、中心线重连和缺失血管重建，并引入中心线增强损失、正则化行走算法和隐式神经表示等技术。 Result: 在ASOCA和PDSCA数据集上，Dice分数分别为88.53%和85.07%，Hausdorff距离为1.07mm和1.63mm，优于现有方法。 Conclusion: 所提框架能有效解决冠状动脉提取中的挑战，显著提升性能。 Abstract: Coronary artery extraction is a crucial prerequisite for computer-aided diagnosis of coronary artery disease. Accurately extracting the complete coronary tree remains challenging due to several factors, including presence of thin distal vessels, tortuous topological structures, and insufficient contrast. These issues often result in over-segmentation and under-segmentation in current segmentation methods. To address these challenges, we propose a topology-preserving three-stage framework for fully-connected coronary artery extraction. This framework includes vessel segmentation, centerline reconnection, and missing vessel reconstruction. First, we introduce a new centerline enhanced loss in the segmentation process. Second, for the broken vessel segments, we further propose a regularized walk algorithm to integrate distance, probabilities predicted by a centerline classifier, and directional cosine similarity, for reconnecting the centerlines. Third, we apply implicit neural representation and implicit modeling, to reconstruct the geometric model of the missing vessels. Experimental results show that our proposed framework outperforms existing methods, achieving Dice scores of 88.53\% and 85.07\%, with Hausdorff Distances (HD) of 1.07mm and 1.63mm on ASOCA and PDSCA datasets, respectively. Code will be available at https://github.com/YH-Qiu/CorSegRec.

Scaling Test-Time Inference with Policy-Optimized, Dynamic Retrieval-Augmented Generation via KV Caching and Decoding

Sakhinana Sagar Srinivas,Venkataramana Runkana

Task: 提出一个综合框架，通过动态检索策略和强化微调来增强检索增强生成（RAG）系统。

Motivation: 提升大型语言模型在知识密集型任务（如开放领域问答和复杂推理）中的表现，同时优化检索信息的利用和相关性。

Details

Method: 结合两种互补技术：策略优化的检索增强生成（PORAG）和自适应令牌层注意力评分（ATLAS），并引入CRITIC方法选择性压缩关键值缓存。 Result: 实验表明，该框架减少了幻觉现象，增强了领域特定推理能力，并在效率和可扩展性上显著优于传统RAG系统。 Conclusion: 该框架为开发高效、可扩展的RAG系统提供了新的解决方案，适用于多种应用场景。 Abstract: We present a comprehensive framework for enhancing Retrieval-Augmented Generation (RAG) systems through dynamic retrieval strategies and reinforcement fine-tuning. This approach significantly improves large language models on knowledge-intensive tasks, including opendomain question answering and complex reasoning. Our framework integrates two complementary techniques: Policy-Optimized RetrievalAugmented Generation (PORAG), which optimizes the use of retrieved information, and Adaptive Token-Layer Attention Scoring (ATLAS), which dynamically determines retrieval timing and content based on contextual needs. Together, these techniques enhance both the utilization and relevance of retrieved content, improving factual accuracy and response quality. Designed as a lightweight solution compatible with any Transformer-based LLM without requiring additional training, our framework excels in knowledge-intensive tasks, boosting output accuracy in RAG settings. We further propose CRITIC, a novel method to selectively compress key-value caches by token importance, mitigating memory bottlenecks in long-context applications. The framework also incorporates test-time scaling techniques to dynamically balance reasoning depth and computational resources, alongside optimized decoding strategies for faster inference. Experiments on benchmark datasets show that our framework reduces hallucinations, strengthens domain-specific reasoning, and achieves significant efficiency and scalability gains over traditional RAG systems. This integrated approach advances the development of robust, efficient, and scalable RAG systems across diverse applications.

A$^\text{T}$A: Adaptive Transformation Agent for Text-Guided Subject-Position Variable Background Inpainting

Yizhe Tang,Zhimin Sun,Yuzhen Du,Ran Yi,Guangben Lu,Teng Hu,Luying Li,Lizhuang Ma,Fangyuan Zou

Task: 提出一种名为“文本引导的主体位置可变背景修复”的新任务，旨在动态调整主体位置以实现主体与修复背景的和谐关系，并提出Adaptive Transformation Agent (A$^\text{T}$A)方法。

Motivation: 现有背景修复方法通常严格保留主体在源图像中的原始位置，导致主体与生成背景之间不一致。

Details

Method: 设计了PosAgent Block自适应预测位移，Reverse Displacement Transform (RDT)模块反向转换特征图，以及Position Switch Embedding控制主体位置是否自适应预测。 Result: 实验验证了A$^\text{T}$A方法的有效性，在主体位置可变和固定修复中均表现优异。 Conclusion: 提出的方法不仅提升了主体位置可变背景修复的能力，同时在主体位置固定修复中保持了良好性能。 Abstract: Image inpainting aims to fill the missing region of an image. Recently, there has been a surge of interest in foreground-conditioned background inpainting, a sub-task that fills the background of an image while the foreground subject and associated text prompt are provided. Existing background inpainting methods typically strictly preserve the subject's original position from the source image, resulting in inconsistencies between the subject and the generated background. To address this challenge, we propose a new task, the "Text-Guided Subject-Position Variable Background Inpainting", which aims to dynamically adjust the subject position to achieve a harmonious relationship between the subject and the inpainted background, and propose the Adaptive Transformation Agent (A$^\text{T}$A) for this task. Firstly, we design a PosAgent Block that adaptively predicts an appropriate displacement based on given features to achieve variable subject-position. Secondly, we design the Reverse Displacement Transform (RDT) module, which arranges multiple PosAgent blocks in a reverse structure, to transform hierarchical feature maps from deep to shallow based on semantic information. Thirdly, we equip A$^\text{T}$A with a Position Switch Embedding to control whether the subject's position in the generated image is adaptively predicted or fixed. Extensive comparative experiments validate the effectiveness of our A$^\text{T}$A approach, which not only demonstrates superior inpainting capabilities in subject-position variable inpainting, but also ensures good performance on subject-position fixed inpainting.

On Data Synthesis and Post-training for Visual Abstract Reasoning

Ke Zhu,Yu Wang,Jiangjiang Liu,Qunyi Xie,Shanshan Liu,Gang Zhang

Task: 通过创新的数据合成和后训练过程，使LLaVA-NeXT 7B模型能够解决抽象视觉推理（AVR）问题。

Motivation: 当前大多数视觉语言模型（VLMs）在代表性AVR基准测试中表现不佳，甚至接近随机水平，因此需要突破性方法来解决这一问题。

Details

Method: 采用创新的数据合成和后训练过程，逐步降低任务难度并引导模型学习。 Result: LLaVA-NeXT 7B模型在AVR任务上显著优于开源和闭源的大型VLMs（如Qwen-2-VL-72B和GPT-4o），同时保持多模态理解能力。 Conclusion: 本研究为抽象视觉推理领域的早期探索提供了重要突破，并有望激发进一步研究。 Abstract: This paper is a pioneering work attempting to address abstract visual reasoning (AVR) problems for large vision-language models (VLMs). We make a common LLaVA-NeXT 7B model capable of perceiving and reasoning about specific AVR problems, surpassing both open-sourced (e.g., Qwen-2-VL-72B) and closed-sourced powerful VLMs (e.g., GPT-4o) with significant margin. This is a great breakthrough since almost all previous VLMs fail or show nearly random performance on representative AVR benchmarks. Our key success is our innovative data synthesis and post-training process, aiming to fully relieve the task difficulty and elicit the model to learn, step by step. Our 7B model is also shown to be behave well on AVR without sacrificing common multimodal comprehension abilities. We hope our paper could serve as an early effort in this area and would inspire further research in abstract visual reasoning.

3DBonsai: Structure-Aware Bonsai Modeling Using Conditioned 3D Gaussian Splatting

Hao Wu,Hao Wang,Ruochong Li,Xuran Ma,Hui Xiong

Task: 提出一种名为3DBonsai的新框架，用于生成具有复杂结构的3D盆景。

Motivation: 现有方法使用的3D先验缺乏详细和复杂的结构信息，限制了生成复杂结构（如盆景）的能力。

Details

Method: 设计了一种可训练的3D空间殖民算法，结合随机采样和点云增强生成3D高斯先验，并提出了两种生成管道（精细结构和粗结构条件生成）。 Result: 实验结果表明，3DBonsai显著优于现有方法，为结构感知的3D盆景生成提供了新基准。 Conclusion: 3DBonsai通过创新的3D先验和生成管道，成功解决了生成复杂盆景结构的挑战。 Abstract: Recent advancements in text-to-3D generation have shown remarkable results by leveraging 3D priors in combination with 2D diffusion. However, previous methods utilize 3D priors that lack detailed and complex structural information, limiting them to generating simple objects and presenting challenges for creating intricate structures such as bonsai. In this paper, we propose 3DBonsai, a novel text-to-3D framework for generating 3D bonsai with complex structures. Technically, we first design a trainable 3D space colonization algorithm to produce bonsai structures, which are then enhanced through random sampling and point cloud augmentation to serve as the 3D Gaussian priors. We introduce two bonsai generation pipelines with distinct structural levels: fine structure conditioned generation, which initializes 3D Gaussians using a 3D structure prior to produce detailed and complex bonsai, and coarse structure conditioned generation, which employs a multi-view structure consistency module to align 2D and 3D structures. Moreover, we have compiled a unified 2D and 3D Chinese-style bonsai dataset. Our experimental results demonstrate that 3DBonsai significantly outperforms existing methods, providing a new benchmark for structure-aware 3D bonsai generation.

Advancing MoE Efficiency: A Collaboration-Constrained Routing (C2R) Strategy for Better Expert Parallelism Design

Mohan Zhang,Pingzhi Li,Jie Peng,Mufan Qiu,Tianlong Chen

Task: 提出一种新的协作约束路由策略（C2R）以优化Mixture-of-Experts（MoE）模型的效率和性能。

Motivation: MoE模型在实践中存在专家激活不平衡和通信开销大的问题，导致效率低下。

Details

Method: 通过分析专家协作与专业化的高阶视角，提出C2R策略，鼓励专家专业化分组并优化通信开销。 Result: 在LLaMA-MoE和Qwen-MoE上分别实现了0.51%和0.33%的平均性能提升，并减少了20%-30%的通信时间。 Conclusion: C2R策略有效提升了MoE模型的效率和性能，同时降低了通信开销。 Abstract: Mixture-of-Experts (MoE) has successfully scaled up models while maintaining nearly constant computing costs. By employing a gating network to route input tokens, it selectively activates a subset of expert networks to process the corresponding token embeddings. However, in practice, the efficiency of MoE is challenging to achieve due to two key reasons: imbalanced expert activation, which leads to substantial idle time during model or expert parallelism, and insufficient capacity utilization; massive communication overhead, induced by numerous expert routing combinations in expert parallelism at the system level. Previous works typically formulate it as the load imbalance issue characterized by the gating network favoring certain experts over others or attribute it to static execution which fails to adapt to the dynamic expert workload at runtime. In this paper, we exploit it from a brand new perspective, a higher-order view and analysis of MoE routing policies: expert collaboration and specialization where some experts tend to activate broadly with others (collaborative), while others are more likely to activate only with a specific subset of experts (specialized). Our experiments reveal that most experts tend to be overly collaborative, leading to increased communication overhead from repeatedly sending tokens to different accelerators. To this end, we propose a novel collaboration-constrained routing (C2R) strategy to encourage more specialized expert groups, as well as to improve expert utilization, and present an efficient implementation of MoE that further leverages expert specialization. We achieve an average performance improvement of 0.51% and 0.33% on LLaMA-MoE and Qwen-MoE respectively across ten downstream NLP benchmarks, and reduce the all2all communication costs between GPUs, bringing an extra 20%-30% total running time savings on top of the existing SoTA, i.e. MegaBlocks.

A Conic Transformation Approach for Solving the Perspective-Three-Point Problem

Haidong Wu,Snehal Bhayani,Janne Heikkilä

Task: 提出一种圆锥变换方法来解决透视三点（P3P）问题。

Motivation: 当前最先进的求解器通过构造退化圆锥曲线来寻找交点，而本文方法通过新的坐标变换简化问题。

Details

Method: 将两条圆锥曲线映射到新坐标系，使其中的一条变为标准抛物线，从而简化交点求解问题。 Result: 新方法在保持鲁棒性和稳定性的同时，实现了更高的计算速度。 Conclusion: 提出的圆锥变换方法在P3P问题中表现优于现有方法，计算效率更高。 Abstract: We propose a conic transformation method to solve the Perspective-Three-Point (P3P) problem. In contrast to the current state-of-the-art solvers, which formulate the P3P problem by intersecting two conics and constructing a degenerate conic to find the intersection, our approach builds upon a new formulation based on a transformation that maps the two conics to a new coordinate system, where one of the conics becomes a standard parabola in a canonical form. This enables expressing one variable in terms of the other variable, and as a consequence, substantially simplifies the problem of finding the conic intersection. Moreover, the polynomial coefficients are fast to compute, and we only need to determine the real-valued intersection points, which avoids the requirement of using computationally expensive complex arithmetic. While the current state-of-the-art methods reduce the conic intersection problem to solving a univariate cubic equation, our approach, despite resulting in a quartic equation, is still faster thanks to this new simplified formulation. Extensive evaluations demonstrate that our method achieves higher speed while maintaining robustness and stability comparable to state-of-the-art methods.

An Illusion of Progress? Assessing the Current State of Web Agents

Tianci Xue,Weijian Qi,Tianneng Shi,Chan Hee Song,Boyu Gou,Dawn Song,Huan Sun,Yu Su

Task: 对当前基于大型语言模型（LLMs）的自主网络代理进行全面且严格的评估。

Motivation: 随着数字化和云技术的发展，网络在现代社会中变得越来越重要，而基于LLMs的自主网络代理在工作自动化方面具有巨大潜力，因此需要准确衡量和监控其能力进展。

Details

Method: 引入Online-Mind2Web在线评估基准，包含300个多样化和现实的任务，覆盖136个网站，并开发了一种新的LLM-as-a-Judge自动评估方法。 Result: 结果显示当前网络代理的能力与之前报告的结果存在较大差距，表明现有基准存在不足。LLM-as-a-Judge方法与人类判断的一致性达到约85%，显著高于现有方法。 Conclusion: 研究首次全面比较分析了当前网络代理的优势和局限性，为未来研究提供了启示。 Abstract: As digitalization and cloud technologies evolve, the web is becoming increasingly important in the modern society. Autonomous web agents based on large language models (LLMs) hold a great potential in work automation. It is therefore important to accurately measure and monitor the progression of their capabilities. In this work, we conduct a comprehensive and rigorous assessment of the current state of web agents. Our results depict a very different picture of the competency of current agents, suggesting over-optimism in previously reported results. This gap can be attributed to shortcomings in existing benchmarks. We introduce Online-Mind2Web, an online evaluation benchmark consisting of 300 diverse and realistic tasks spanning 136 websites. It enables us to evaluate web agents under a setting that approximates how real users use these agents. To facilitate more scalable evaluation and development, we also develop a novel LLM-as-a-Judge automatic evaluation method and show that it can achieve around 85% agreement with human judgment, substantially higher than existing methods. Finally, we present the first comprehensive comparative analysis of current web agents, highlighting both their strengths and limitations to inspire future research.

Benchmarking the Spatial Robustness of DNNs via Natural and Adversarial Localized Corruptions

Giulia Marchiori Pietrosanti,Giulio Rossolini,Alessandro Biondi,Giorgio Buttazzo

Task: 研究密集视觉模型在局部损坏情况下的空间鲁棒性，并提出评估框架和方法。

Motivation: 在安全关键应用中，深度神经网络的鲁棒性至关重要，尤其是在复杂动态环境中可能出现局部损坏的情况下。

Details

Method: 引入专门的指标和评估框架，提出区域感知多攻击对抗分析方法。 Result: 在15个分割模型上评估，发现不同模型对自然和对抗性局部损坏的响应不同。 Conclusion: 通过集成模型平衡对自然和对抗性局部损坏的鲁棒性，提高密集视觉任务的可靠性。 Abstract: The robustness of DNNs is a crucial factor in safety-critical applications, particularly in complex and dynamic environments where localized corruptions can arise. While previous studies have evaluated the robustness of semantic segmentation (SS) models under whole-image natural or adversarial corruptions, a comprehensive investigation into the spatial robustness of dense vision models under localized corruptions remained underexplored. This paper fills this gap by introducing specialized metrics for benchmarking the spatial robustness of segmentation models, alongside with an evaluation framework to assess the impact of localized corruptions. Furthermore, we uncover the inherent complexity of characterizing worst-case robustness using a single localized adversarial perturbation. To address this, we propose region-aware multi-attack adversarial analysis, a method that enables a deeper understanding of model robustness against adversarial perturbations applied to specific regions. The proposed metrics and analysis were evaluated on 15 segmentation models in driving scenarios, uncovering key insights into the effects of localized corruption in both natural and adversarial forms. The results reveal that models respond to these two types of threats differently; for instance, transformer-based segmentation models demonstrate notable robustness to localized natural corruptions but are highly vulnerable to adversarial ones and vice-versa for CNN-based models. Consequently, we also address the challenge of balancing robustness to both natural and adversarial localized corruptions by means of ensemble models, thereby achieving a broader threat coverage and improved reliability for dense vision tasks.

Generative Retrieval and Alignment Model: A New Paradigm for E-commerce Retrieval

Ming Pang,Chunyuan Yuan,Xiaoyu He,Zheng Fang,Donghao Xie,Fanyi Qu,Xue Jiang,Changping Peng,Zhangang Lin,Zheng Luo,Jingping Shao

Task: 提出一种新的电子商务检索范式——生成式检索与对齐模型（GRAM），以解决传统稀疏和密集检索方法在利用世界知识和捕捉查询与产品细微特征方面的不足。

Motivation: 传统方法难以利用通用世界知识，且无法有效对齐查询与产品的特征，导致检索效率低下。

Details

Method: GRAM通过联合训练查询和产品的文本信息生成共享文本标识符代码，采用共对齐策略优化检索效率，并引入查询-产品评分机制。 Result: 离线与在线A/B测试表明，GRAM显著优于传统模型和最新的生成式检索模型。 Conclusion: GRAM有效提升了检索效率，证实了其在实际应用中的有效性和实用性。 Abstract: Traditional sparse and dense retrieval methods struggle to leverage general world knowledge and often fail to capture the nuanced features of queries and products. With the advent of large language models (LLMs), industrial search systems have started to employ LLMs to generate identifiers for product retrieval. Commonly used identifiers include (1) static/semantic IDs and (2) product term sets. The first approach requires creating a product ID system from scratch, missing out on the world knowledge embedded within LLMs. While the second approach leverages this general knowledge, the significant difference in word distribution between queries and products means that product-based identifiers often do not align well with user search queries, leading to missed product recalls. Furthermore, when queries contain numerous attributes, these algorithms generate a large number of identifiers, making it difficult to assess their quality, which results in low overall recall efficiency. To address these challenges, this paper introduces a novel e-commerce retrieval paradigm: the Generative Retrieval and Alignment Model (GRAM). GRAM employs joint training on text information from both queries and products to generate shared text identifier codes, effectively bridging the gap between queries and products. This approach not only enhances the connection between queries and products but also improves inference efficiency. The model uses a co-alignment strategy to generate codes optimized for maximizing retrieval efficiency. Additionally, it introduces a query-product scoring mechanism to compare product values across different codes, further boosting retrieval efficiency. Extensive offline and online A/B testing demonstrates that GRAM significantly outperforms traditional models and the latest generative retrieval models, confirming its effectiveness and practicality.

Bridge 2D-3D: Uncertainty-aware Hierarchical Registration Network with Domain Alignment

Zhixin Cheng,Jiacheng Deng,Xinjun Li,Baoqun Yin,Tianzhu Zhang

Task: 提出一种创新的图像到点云配准方法，解决直接匹配图像块和点云块时的问题。

Motivation: 直接均匀匹配图像块和点云块可能导致关注错误的噪声块而忽略关键块，且图像和点云的模态差异大，难以跨越领域差距。

Details

Method: 提出不确定性感知层次匹配模块（UHMM）和对抗模态对齐模块（AMAM），分别用于多级特征融合和减少模态差异。 Result: 在RGB-D Scene V2和7-Scenes基准测试中表现出优越性，成为图像到点云配准任务的最先进方法。 Conclusion: 该方法通过创新模块设计，有效解决了图像和点云配准中的关键问题，取得了显著性能提升。 Abstract: The method for image-to-point cloud registration typically determines the rigid transformation using a coarse-to-fine pipeline. However, directly and uniformly matching image patches with point cloud patches may lead to focusing on incorrect noise patches during matching while ignoring key ones. Moreover, due to the significant differences between image and point cloud modalities, it may be challenging to bridge the domain gap without specific improvements in design. To address the above issues, we innovatively propose the Uncertainty-aware Hierarchical Matching Module (UHMM) and the Adversarial Modal Alignment Module (AMAM). Within the UHMM, we model the uncertainty of critical information in image patches and facilitate multi-level fusion interactions between image and point cloud features. In the AMAM, we design an adversarial approach to reduce the domain gap between image and point cloud. Extensive experiments and ablation studies on RGB-D Scene V2 and 7-Scenes benchmarks demonstrate the superiority of our method, making it a state-of-the-art approach for image-to-point cloud registration tasks.

CASCADE Your Datasets for Cross-Mode Knowledge Retrieval of Language Models

Runlong Zhou,Yi Zhang

Task: 研究语言模型在不同数据模式下的知识检索能力及其局限性。

Motivation: 语言模型在跨模式知识检索时表现不佳，需要探索其原因并提出解决方案。

Details

Method: 通过随机标记序列记忆的对照研究，提出数据集重写和CASCADE预训练算法。 Result: CASCADE算法优于数据集重写方法，能有效提升跨模式检索能力。 Conclusion: CASCADE为语言模型提供了一种独立于数据呈现格式的知识检索解决方案。 Abstract: Language models often struggle with cross-mode knowledge retrieval -- the ability to access knowledge learned in one format (mode) when queried in another. We demonstrate that models trained on multiple data sources (e.g., Wikipedia and TinyStories) exhibit significantly reduced accuracy when retrieving knowledge in a format different from its original training mode. This paper quantitatively investigates this phenomenon through a controlled study of random token sequence memorization across different modes. We first explore dataset rewriting as a solution, revealing that effective cross-mode retrieval requires prohibitively extensive rewriting efforts that follow a sigmoid-like relationship. As an alternative, we propose CASCADE, a novel pretraining algorithm that uses cascading datasets with varying sequence lengths to capture knowledge at different scales. Our experiments demonstrate that CASCADE outperforms dataset rewriting approaches, even when compressed into a single model with a unified loss function. This work provides both qualitative evidence of cross-mode retrieval limitations and a practical solution to enhance language models' ability to access knowledge independently of its presentational format.

FlowR: Flowing from Sparse to Dense 3D Reconstructions

Tobias Fischer,Samuel Rota Bulò,Yung-Hsu Yang,Nikhil Varma Keetha,Lorenzo Porzi,Norman Müller,Katja Schwarz,Jonathon Luiten,Marc Pollefeys,Peter Kontschieder

Task: 通过多视角流匹配模型提升稀疏重建场景下的新视角合成质量。

Motivation: 3D高斯溅射在训练视图外质量下降明显，密集捕获成本高，现有方法未能充分利用3D信息。

Details

Method: 提出多视角流匹配模型，连接稀疏重建与密集重建的渲染结果，生成新视角以提升重建质量。 Result: 在稀疏和密集视图场景下，新视角合成质量均优于现有方法。 Conclusion: 该方法显著提升了稀疏重建场景下的新视角合成质量，优于现有技术。 Abstract: 3D Gaussian splatting enables high-quality novel view synthesis (NVS) at real-time frame rates. However, its quality drops sharply as we depart from the training views. Thus, dense captures are needed to match the high-quality expectations of some applications, e.g. Virtual Reality (VR). However, such dense captures are very laborious and expensive to obtain. Existing works have explored using 2D generative models to alleviate this requirement by distillation or generating additional training views. These methods are often conditioned only on a handful of reference input views and thus do not fully exploit the available 3D information, leading to inconsistent generation results and reconstruction artifacts. To tackle this problem, we propose a multi-view, flow matching model that learns a flow to connect novel view renderings from possibly sparse reconstructions to renderings that we expect from dense reconstructions. This enables augmenting scene captures with novel, generated views to improve reconstruction quality. Our model is trained on a novel dataset of 3.6M image pairs and can process up to 45 views at 540x960 resolution (91K tokens) on one H100 GPU in a single forward pass. Our pipeline consistently improves NVS in sparse- and dense-view scenarios, leading to higher-quality reconstructions than prior works across multiple, widely-used NVS benchmarks.

Redefining technology for indigenous languages

Silvia Fernandez-Sabido,Laura Peniche-Sabido

Task: 分析土著语言的现状及其技术振兴方法。

Motivation: 探讨土著语言贬值的根源及语言权利立法的必要性，并研究技术振兴的可行性。

Details

Method: 综述外部技术与社区内部开发技术的效果对比，并提出将土著知识融入大型语言模型（LLMs）的参与式方法。 Result: 外部技术可能适得其反，而社区内部开发的技术更具表达力；参与式方法能丰富技术生态。 Conclusion: 土著语言的振兴需结合社区内部技术开发和参与式知识共享，以实现可持续的语言权利保护。 Abstract: In this paper, we offer an overview of indigenous languages, identifying the causes of their devaluation and the need for legislation on language rights. We review the technologies used to revitalize these languages, finding that when they come from outside, they often have the opposite effect to what they seek; however, when developed from within communities, they become powerful instruments of expression. We propose that the inclusion of Indigenous knowledge in large language models (LLMs) will enrich the technological landscape, but must be done in a participatory environment that encourages the exchange of knowledge.

ProtoGuard-guided PROPEL: Class-Aware Prototype Enhancement and Progressive Labeling for Incremental 3D Point Cloud Segmentation

Haosheng Li,Yuecong Xu,Junjie Chen,Kemi Ding

Task: 解决3D点云语义分割中的灾难性遗忘问题，并提出类增量学习方法。

Motivation: 现实场景中环境动态变化，离线训练的分割模型会导致对已见过类别的灾难性遗忘，且点云数据存在类别相似性高、边界模糊及类别分布不平衡的问题。

Details

Method: 提出ProtoGuard和PROPEL方法，前者在基础类训练阶段维护几何和语义原型，后者在新类训练阶段基于密度分布和语义相似性引导伪标签传播和更新。 Result: 在S3DIS和ScanNet数据集上显著提升性能，5步CIL场景下mIoU最大提升20.39%。 Conclusion: ProtoGuard和PROPEL有效解决了点云分割中的类增量学习问题，显著提升了性能。 Abstract: 3D point cloud semantic segmentation technology has been widely used. However, in real-world scenarios, the environment is evolving. Thus, offline-trained segmentation models may lead to catastrophic forgetting of previously seen classes. Class-incremental learning (CIL) is designed to address the problem of catastrophic forgetting. While point clouds are common, we observe high similarity and unclear boundaries between different classes. Meanwhile, they are known to be imbalanced in class distribution. These lead to issues including misclassification between similar classes and the long-tail problem, which have not been adequately addressed in previous CIL methods. We thus propose ProtoGuard and PROPEL (Progressive Refinement Of PsEudo-Labels). In the base-class training phase, ProtoGuard maintains geometric and semantic prototypes for each class, which are combined into prototype features using an attention mechanism. In the novel-class training phase, PROPEL inherits the base feature extractor and classifier, guiding pseudo-label propagation and updates based on density distribution and semantic similarity. Extensive experiments show that our approach achieves remarkable results on both the S3DIS and ScanNet datasets, improving the mIoU of 3D point cloud segmentation by a maximum of 20.39% under the 5-step CIL scenario on S3DIS.

Representation Bending for Large Language Model Safety

Ashkan Yousefpour,Taeheon Kim,Ryan S. Kwon,Seungbeen Lee,Wonje Jeung,Seungju Han,Alvin Wan,Harrison Ngan,Youngjae Yu,Jonghyun Choi

Task: 提出一种名为RepBend的新方法，通过破坏大型语言模型（LLMs）中潜在有害行为的表示，增强其安全性。

Motivation: 大型语言模型存在安全风险，如生成有害内容和社会危害，现有技术难以应对多样化的攻击或需要手动防御。

Details

Method: RepBend结合激活导向和基于损失的微调，通过向量算术调整模型行为。 Result: RepBend在多种越狱基准测试中攻击成功率降低95%，优于现有方法，且不影响模型可用性和通用能力。 Conclusion: RepBend为LLMs提供了一种可扩展的安全增强方案，有效应对多样化攻击。 Abstract: Large Language Models (LLMs) have emerged as powerful tools, but their inherent safety risks - ranging from harmful content generation to broader societal harms - pose significant challenges. These risks can be amplified by the recent adversarial attacks, fine-tuning vulnerabilities, and the increasing deployment of LLMs in high-stakes environments. Existing safety-enhancing techniques, such as fine-tuning with human feedback or adversarial training, are still vulnerable as they address specific threats and often fail to generalize across unseen attacks, or require manual system-level defenses. This paper introduces RepBend, a novel approach that fundamentally disrupts the representations underlying harmful behaviors in LLMs, offering a scalable solution to enhance (potentially inherent) safety. RepBend brings the idea of activation steering - simple vector arithmetic for steering model's behavior during inference - to loss-based fine-tuning. Through extensive evaluation, RepBend achieves state-of-the-art performance, outperforming prior methods such as Circuit Breaker, RMU, and NPO, with up to 95% reduction in attack success rates across diverse jailbreak benchmarks, all with negligible reduction in model usability and general capabilities.

Q-Adapt: Adapting LMM for Visual Quality Assessment with Progressive Instruction Tuning

Yiting Lu,Xin Li,Haoning Wu,Bingchen Li,Weisi Lin,Zhibo Chen

Task: 提出一种新的感知导向指令调优范式Q-Adapt，用于消除可解释图像质量评估（EIQA）中整体质量解释和属性感知回答之间的冲突。

Motivation: 现有方法在联合指令调优时忽视了两种感知解释之间的冲突，导致感知理解不足。

Details

Method: 采用两阶段渐进式指令调优策略：第一阶段使用LoRA高效迁移学习为LMM赋予通用感知知识；第二阶段引入指令自适应视觉提示调优，动态适配不同任务的视觉特征。 Result: Q-Adapt实现了轻量级视觉质量评估器，在感知相关基准和常用IQA数据库中表现优异。 Conclusion: Q-Adapt通过消除冲突并实现任务协同，增强了IQA的多方面解释能力。 Abstract: The rapid advancement of Large Multi-modal Foundation Models (LMM) has paved the way for the possible Explainable Image Quality Assessment (EIQA) with instruction tuning from two perspectives: overall quality explanation, and attribute-wise perception answering. However, existing works usually overlooked the conflicts between these two types of perception explanations during joint instruction tuning, leading to insufficient perception understanding. To mitigate this, we propose a new paradigm for perception-oriented instruction tuning, i.e., Q-Adapt, which aims to eliminate the conflicts and achieve the synergy between these two EIQA tasks when adapting LMM, resulting in enhanced multi-faceted explanations of IQA. Particularly, we propose a progressive instruction tuning strategy by dividing the adaption process of LMM for EIQA into two stages, where the first stage empowers the LMM with universal perception knowledge tailored for two tasks using an efficient transfer learning strategy, i.e., LoRA, and the second stage introduces the instruction-adaptive visual prompt tuning to dynamically adapt visual features for the different instructions from two tasks. In this way, our proposed Q-Adapt can achieve a lightweight visual quality evaluator, demonstrating comparable performance and, in some instances, superior results across perceptual-related benchmarks and commonly-used IQA databases. The source code is publicly available at https://github.com/yeppp27/Q-Adapt.

Horizon Scans can be accelerated using novel information retrieval and artificial intelligence tools

Lena Schmidt,Oshin Sharma,Chris Marshall,Sonia Garcia Gonzalez Moral

Task: 开发开源工具SCANAR和AIDOC，以提升医疗领域地平线扫描中信息检索和分析的效率。

Motivation: 当前地平线扫描在非结构化数据（如新闻）中的信息检索和分析效率低下，亟需创新工具来优化流程。

Details

Method: SCANAR用于自动化检索和处理新闻文章，具备去重和无监督相关性排序功能；AIDOC利用AI和神经网络对文本数据进行语义相似性分析和优先级排序。 Result: SCANAR提高了检索效率，AIDOC减少了62%的人工审核工作量（召回率95%），性能与现有系统评价自动化工具相当。 Conclusion: SCANAR和AIDOC能显著提升地平线扫描效率，未来可进一步优化模型并整合大语言模型以改进工作流程。 Abstract: Introduction: Horizon scanning in healthcare assesses early signals of innovation, crucial for timely adoption. Current horizon scanning faces challenges in efficient information retrieval and analysis, especially from unstructured sources like news, presenting a need for innovative tools. Methodology: The study introduces SCANAR and AIDOC, open-source Python-based tools designed to improve horizon scanning. SCANAR automates the retrieval and processing of news articles, offering functionalities such as de-duplication and unsupervised relevancy ranking. AIDOC aids filtration by leveraging AI to reorder textual data based on relevancy, employing neural networks for semantic similarity, and subsequently prioritizing likely relevant entries for human review. Results: Twelve internal datasets from horizon scans and four external benchmarking datasets were used. SCANAR improved retrieval efficiency by automating processes previously dependent on manual labour. AIDOC displayed work-saving potential, achieving around 62% reduction in manual review efforts at 95% recall. Comparative analysis with benchmarking data showed AIDOC's performance was similar to existing systematic review automation tools, though performance varied depending on dataset characteristics. A smaller case-study on our news datasets shows the potential of ensembling large language models within the active-learning process for faster detection of relevant articles across news datasets. Conclusion: The validation indicates that SCANAR and AIDOC show potential to enhance horizon scanning efficiency by streamlining data retrieval and prioritisation. These tools may alleviate methodological limitations and allow broader, swifter horizon scans. Further studies are suggested to optimize these models and to design new workflows and validation processes that integrate large language models.

Robust Unsupervised Domain Adaptation for 3D Point Cloud Segmentation Under Source Adversarial Attacks

Haosheng Li,Yuecong Xu,Junjie Chen,Kemi Ding

Task: 探索无监督域适应（UDA）框架在3D点云语义分割中的对抗鲁棒性。

Motivation: 现有UDA框架在干净数据上表现良好，但忽略了源域本身被污染时的对抗鲁棒性问题。

Details

Method: 设计了一种隐蔽的点云生成攻击方法，并提出了新的数据集AdvSynLiDAR；进一步开发了对抗适应框架（AAF），通过扩展关键点敏感（KPS）损失为鲁棒长尾损失（RLT损失）和利用解码器分支来提升模型性能。 Result: 在AdvSynLiDAR数据集上的实验表明，AAF方法能够有效减轻源域对抗扰动对UDA性能的影响。 Conclusion: AAF方法为3D点云分割中的UDA提供了对抗鲁棒性解决方案。 Abstract: Unsupervised domain adaptation (UDA) frameworks have shown good generalization capabilities for 3D point cloud semantic segmentation models on clean data. However, existing works overlook adversarial robustness when the source domain itself is compromised. To comprehensively explore the robustness of the UDA frameworks, we first design a stealthy adversarial point cloud generation attack that can significantly contaminate datasets with only minor perturbations to the point cloud surface. Based on that, we propose a novel dataset, AdvSynLiDAR, comprising synthesized contaminated LiDAR point clouds. With the generated corrupted data, we further develop the Adversarial Adaptation Framework (AAF) as the countermeasure. Specifically, by extending the key point sensitive (KPS) loss towards the Robust Long-Tail loss (RLT loss) and utilizing a decoder branch, our approach enables the model to focus on long-tail classes during the pre-training phase and leverages high-confidence decoded point cloud information to restore point cloud structures during the adaptation phase. We evaluated our AAF method on the AdvSynLiDAR dataset, where the results demonstrate that our AAF method can mitigate performance degradation under source adversarial perturbations for UDA in the 3D point cloud segmentation application.

Study of scaling laws in language families

Maelyson R. F. Santos,Marcelo A. F. Gomes

Task: 研究语言家族中的尺度规律，通过分析Zipf-like分类图中的涌现模式。

Motivation: 探索语言家族在宏观（基于家族语言数量）和微观（基于家族中每种语言的使用者数量）层面的分类特征，揭示语言多样性和分布的本质。

Details

Method: 使用来自六千多种语言的数据，分析Zipf-like分类图，并研究十四大现代语言家族的分布特征。 Result: 发现十四大语言家族分为三个四重奏组，每组在Zipf图中表现出显著不同的指数。 Conclusion: 揭示了主要语言家族的基本结构和组织，为语言多样性和分布提供了新的见解。 Abstract: This article investigates scaling laws within language families using data from over six thousand languages and analyzing emergent patterns observed in Zipf-like classification graphs. Both macroscopic (based on number of languages by family) and microscopic (based on numbers of speakers by language on a family) aspects of these classifications are examined. Particularly noteworthy is the discovery of a distinct division among the fourteen largest contemporary language families, excluding Afro-Asiatic and Nilo-Saharan languages. These families are found to be distributed across three language family quadruplets, each characterized by significantly different exponents in the Zipf graphs. This finding sheds light on the underlying structure and organization of major language families, revealing intriguing insights into the nature of linguistic diversity and distribution.

BioAtt: Anatomical Prior Driven Low-Dose CT Denoising

Namhun Kim,UiHyun Cho

Task: 提出一种基于解剖先验的新型低剂量CT去噪框架BioAtt，以解决现有方法过度平滑重要解剖细节的问题。

Motivation: 现有深度学习方法在低剂量CT去噪中过度平滑解剖细节，缺乏对解剖结构的针对性关注。

Details

Method: 利用预训练的视觉语言模型BiomedCLIP提取解剖先验分布，并将其嵌入空间注意力机制中，指导去噪模型关注解剖相关区域。 Result: BioAtt在SSIM、PSNR和RMSE指标上优于基线模型和基于注意力的模型，并在多个解剖区域表现优异。 Conclusion: BioAtt通过引入解剖先验，不仅提升了去噪性能，还提供了一种新的架构范式，并通过注意力图验证了其解剖引导的有效性。 Abstract: Deep-learning-based denoising methods have significantly improved Low-Dose CT (LDCT) image quality. However, existing models often over-smooth important anatomical details due to their purely data-driven attention mechanisms. To address this challenge, we propose a novel LDCT denoising framework, BioAtt. The key innovation lies in attending anatomical prior distributions extracted from the pretrained vision-language model BiomedCLIP. These priors guide the denoising model to focus on anatomically relevant regions to suppress noise while preserving clinically relevant structures. We highlight three main contributions: BioAtt outperforms baseline and attention-based models in SSIM, PSNR, and RMSE across multiple anatomical regions. The framework introduces a new architectural paradigm by embedding anatomic priors directly into spatial attention. Finally, BioAtt attention maps provide visual confirmation that the improvements stem from anatomical guidance rather than increased model complexity.

Efficient Constant-Space Multi-Vector Retrieval

Sean MacAvaney,Antonio Mallia,Nicola Tonellotto

Task: 提出一种固定向量数量的文档编码方法，以解决多向量检索方法的高存储成本问题。

Motivation: 多向量检索方法（如ColBERT架构）在检索延迟和效果之间提供了良好的权衡，但存储成本高昂，因为需要为输入集合中的每个标记存储（可能压缩的）向量。

Details

Method: 通过将文档编码为固定数量的向量，不再与输入标记绑定，从而降低存储成本并优化操作系统分页管理。 Result: 在MSMARCO段落语料库和BEIR上使用ColBERT-v2架构的实验表明，段落可以有效地编码为固定数量的向量，同时保留大部分原始效果。 Conclusion: 固定向量数量的文档编码方法能够显著降低存储成本，同时保持检索效果。 Abstract: Multi-vector retrieval methods, exemplified by the ColBERT architecture, have shown substantial promise for retrieval by providing strong trade-offs in terms of retrieval latency and effectiveness. However, they come at a high cost in terms of storage since a (potentially compressed) vector needs to be stored for every token in the input collection. To overcome this issue, we propose encoding documents to a fixed number of vectors, which are no longer necessarily tied to the input tokens. Beyond reducing the storage costs, our approach has the advantage that document representations become of a fixed size on disk, allowing for better OS paging management. Through experiments using the MSMARCO passage corpus and BEIR with the ColBERT-v2 architecture, a representative multi-vector ranking model architecture, we find that passages can be effectively encoded into a fixed number of vectors while retaining most of the original effectiveness.

CLIP-SLA: Parameter-Efficient CLIP Adaptation for Continuous Sign Language Recognition

Sarah Alyami,Hamzah Luqman

Task: 提出一种基于CLIP模型的连续手语识别框架CLIP-SLA，通过参数高效微调（PEFT）将预训练的视觉编码器应用于手语任务。

Motivation: 利用CLIP模型的强大视觉编码能力，通过参数高效微调方法，减少训练参数的同时提升连续手语识别的性能。

Details

Method: 提出两种变体SLA-Adapter和SLA-LoRA，将PEFT模块集成到CLIP视觉编码器中，实现高效微调。 Result: 在四个数据集上验证了CLIP-SLA的有效性，其变体在较少训练参数的情况下优于多个SOTA模型。 Conclusion: 展示了大规模预训练模型在手语识别中的潜力，为未来手语理解的研究提供了新方向。 Abstract: Continuous sign language recognition (CSLR) focuses on interpreting and transcribing sequences of sign language gestures in videos. In this work, we propose CLIP sign language adaptation (CLIP-SLA), a novel CSLR framework that leverages the powerful pre-trained visual encoder from the CLIP model to sign language tasks through parameter-efficient fine-tuning (PEFT). We introduce two variants, SLA-Adapter and SLA-LoRA, which integrate PEFT modules into the CLIP visual encoder, enabling fine-tuning with minimal trainable parameters. The effectiveness of the proposed frameworks is validated on four datasets: Phoenix2014, Phoenix2014-T, CSL-Daily, and Isharah-500, where both CLIP-SLA variants outperformed several SOTA models with fewer trainable parameters. Extensive ablation studies emphasize the effectiveness and flexibility of the proposed methods with different vision-language models for CSLR. These findings showcase the potential of adapting large-scale pre-trained models for scalable and efficient CSLR, which pave the way for future advancements in sign language understanding.

PaperBench: Evaluating AI's Ability to Replicate AI Research

Giulio Starace,Oliver Jaffe,Dane Sherburn,James Aung,Jun Shern Chan,Leon Maksin,Rachel Dias,Evan Mays,Benjamin Kinsella,Wyatt Thompson,Johannes Heidecke,Amelia Glaese,Tejal Patwardhan

Task: 评估AI代理复制最先进AI研究的能力。

Motivation: 开发一个基准测试（PaperBench），以客观评估AI代理在理解和复制复杂研究论文方面的能力。

Details

Method: 通过分解20篇ICML 2024论文的复制任务为8,316个可评分子任务，并开发基于LLM的自动评分系统。 Result: 最佳AI代理（Claude 3.5 Sonnet）的平均复制得分为21.0%，尚未超过人类基线。 Conclusion: PaperBench为评估AI工程能力提供了新工具，并开源代码以促进未来研究。 Abstract: We introduce PaperBench, a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research. Agents must replicate 20 ICML 2024 Spotlight and Oral papers from scratch, including understanding paper contributions, developing a codebase, and successfully executing experiments. For objective evaluation, we develop rubrics that hierarchically decompose each replication task into smaller sub-tasks with clear grading criteria. In total, PaperBench contains 8,316 individually gradable tasks. Rubrics are co-developed with the author(s) of each ICML paper for accuracy and realism. To enable scalable evaluation, we also develop an LLM-based judge to automatically grade replication attempts against rubrics, and assess our judge's performance by creating a separate benchmark for judges. We evaluate several frontier models on PaperBench, finding that the best-performing tested agent, Claude 3.5 Sonnet (New) with open-source scaffolding, achieves an average replication score of 21.0\%. Finally, we recruit top ML PhDs to attempt a subset of PaperBench, finding that models do not yet outperform the human baseline. We \href{https://github.com/openai/preparedness}{open-source our code} to facilitate future research in understanding the AI engineering capabilities of AI agents.

Overlap-Aware Feature Learning for Robust Unsupervised Domain Adaptation for 3D Semantic Segmentation

Junjie Chen,Yuecong Xu,Haosheng Li,Kemi Ding

Task: 提出一种针对3D点云语义分割（PCSS）中无监督域适应（UDA）鲁棒性问题的三部分框架。

Motivation: 现有PCSS-UDA方法忽略了真实世界扰动（如雪、雾、雨）和对抗性扭曲的固有脆弱性，导致特征重叠和结构侵蚀问题。

Details

Method: 框架包括：1）鲁棒性评估模型；2）可逆注意力对齐模块（IAAM）；3）带有质量感知对比学习的对比记忆库。 Result: 在SynLiDAR到SemanticPOSS的适应实验中，对抗攻击下的mIoU最大提升了14.3%。 Conclusion: 提出的框架有效解决了PCSS-UDA中的鲁棒性问题，提升了对抗攻击和扰动下的性能。 Abstract: 3D point cloud semantic segmentation (PCSS) is a cornerstone for environmental perception in robotic systems and autonomous driving, enabling precise scene understanding through point-wise classification. While unsupervised domain adaptation (UDA) mitigates label scarcity in PCSS, existing methods critically overlook the inherent vulnerability to real-world perturbations (e.g., snow, fog, rain) and adversarial distortions. This work first identifies two intrinsic limitations that undermine current PCSS-UDA robustness: (a) unsupervised features overlap from unaligned boundaries in shared-class regions and (b) feature structure erosion caused by domain-invariant learning that suppresses target-specific patterns. To address the proposed problems, we propose a tripartite framework consisting of: 1) a robustness evaluation model quantifying resilience against adversarial attack/corruption types through robustness metrics; 2) an invertible attention alignment module (IAAM) enabling bidirectional domain mapping while preserving discriminative structure via attention-guided overlap suppression; and 3) a contrastive memory bank with quality-aware contrastive learning that progressively refines pseudo-labels with feature quality for more discriminative representations. Extensive experiments on SynLiDAR-to-SemanticPOSS adaptation demonstrate a maximum mIoU improvement of 14.3\% under adversarial attack.

CoRAG: Collaborative Retrieval-Augmented Generation

Aashiq Muhamed,Mona Diab,Virginia Smith

Task: 介绍并评估CoRAG框架，一个扩展RAG模型到协作设置的框架，用于知识密集型任务。

Motivation: 解决在协作环境中如何有效利用共享知识库进行知识密集型任务的问题，特别是在低资源场景下。

Details

Method: 提出CoRAG框架，通过协作训练共享模型并使用协作段落存储，同时引入CRAB基准进行评估。 Result: CoRAG在低资源场景下表现优于参数化协作学习方法和本地训练的RAG模型，并揭示了共享存储中相关段落的重要性以及无关段落的意外益处。 Conclusion: CoRAG展示了协作RAG的潜力，但也指出了设计挑战和未来研究方向，特别是如何平衡共享知识库的利用与潜在风险。 Abstract: Retrieval-Augmented Generation (RAG) models excel in knowledge-intensive tasks, especially under few-shot learning constraints. We introduce CoRAG, a framework extending RAG to collaborative settings, where clients jointly train a shared model using a collaborative passage store. To evaluate CoRAG, we introduce CRAB, a benchmark for collaborative homogeneous open-domain question answering. Our experiments demonstrate that CoRAG consistently outperforms both parametric collaborative learning methods and locally trained RAG models in low-resource scenarios. Further analysis reveals the critical importance of relevant passages within the shared store, the surprising benefits of incorporating irrelevant passages, and the potential for hard negatives to negatively impact performance. This introduces a novel consideration in collaborative RAG: the trade-off between leveraging a collectively enriched knowledge base and the potential risk of incorporating detrimental passages from other clients. Our findings underscore the viability of CoRAG, while also highlighting key design challenges and promising avenues for future research.

InvFussion: Bridging Supervised and Zero-shot Diffusion for Inverse Problems

Noam Elata,Hyungjin Chung,Jong Chul Ye,Tomer Michaeli,Michael Elad

Task: 提出一种结合训练方法和零样本方法优点的框架，用于解决逆问题。

Motivation: 现有方法在条件合成中存在性能与灵活性的权衡，训练方法性能高但灵活性低，零样本方法灵活性高但性能较差。

Details

Method: 通过一种新颖的架构设计，将退化算子直接集成到去噪器中，并在每个块中应用退化算子并通过注意力机制调节输出。 Result: 实验结果表明，该框架在FFHQ和ImageNet数据集上实现了最先进的后验采样性能，超越了训练方法和零样本方法。 Conclusion: 该框架提供了一种多功能、准确且计算高效的解决方案，展示了专用网络架构在复杂逆问题中的优势。 Abstract: Diffusion Models have demonstrated remarkable capabilities in handling inverse problems, offering high-quality posterior-sampling-based solutions. Despite significant advances, a fundamental trade-off persists, regarding the way the conditioned synthesis is employed: Training-based methods achieve high quality results, while zero-shot approaches trade this with flexibility. This work introduces a framework that combines the best of both worlds -- the strong performance of supervised approaches and the flexibility of zero-shot methods. This is achieved through a novel architectural design that seamlessly integrates the degradation operator directly into the denoiser. In each block, our proposed architecture applies the degradation operator on the network activations and conditions the output using the attention mechanism, enabling adaptation to diverse degradation scenarios while maintaining high performance. Our work demonstrates the versatility of the proposed architecture, operating as a general MMSE estimator, a posterior sampler, or a Neural Posterior Principal Component estimator. This flexibility enables a wide range of downstream tasks, highlighting the broad applicability of our framework. The proposed modification of the denoiser network offers a versatile, accurate, and computationally efficient solution, demonstrating the advantages of dedicated network architectures for complex inverse problems. Experimental results on the FFHQ and ImageNet datasets demonstrate state-of-the-art posterior-sampling performance, surpassing both training-based and zero-shot alternatives.

Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness

Haochen Wang,Yucheng Zhao,Tiancai Wang,Haoqiang Fan,Xiangyu Zhang,Zhaoxiang Zhang

Task: 将大型多模态模型（LMMs）从2D图像和视频扩展到3D场景理解。

Motivation: 缺乏大规模3D视觉-语言数据集阻碍了3D场景理解的发展。

Details

Method: 提出Ross3D方法，通过引入3D感知的视觉监督（包括跨视图和全局视图重建）来训练模型。 Result: Ross3D在多个3D场景理解基准测试中达到最先进性能，并展示了利用大量未标记3D视觉数据的潜力。 Conclusion: Ross3D为3D场景理解提供了新视角，并展示了半监督学习的潜力。 Abstract: The rapid development of Large Multimodal Models (LMMs) for 2D images and videos has spurred efforts to adapt these models for interpreting 3D scenes. However, the absence of large-scale 3D vision-language datasets has posed a significant obstacle. To address this issue, typical approaches focus on injecting 3D awareness into 2D LMMs by designing 3D input-level scene representations. This work provides a new perspective. We introduce reconstructive visual instruction tuning with 3D-awareness (Ross3D), which integrates 3D-aware visual supervision into the training procedure. Specifically, it incorporates cross-view and global-view reconstruction. The former requires reconstructing masked views by aggregating overlapping information from other views. The latter aims to aggregate information from all available views to recover Bird's-Eye-View images, contributing to a comprehensive overview of the entire scene. Empirically, Ross3D achieves state-of-the-art performance across various 3D scene understanding benchmarks. More importantly, our semi-supervised experiments demonstrate significant potential in leveraging large amounts of unlabeled 3D vision-only data.

{GSR4B}: Biomass Map Super-Resolution with Sentinel-1/2 Guidance

Kaan Karaman,Yuchang Jiang,Damien Robert,Vivien Sainte Fare Garnot,Maria João Santos,Jan Dirk Wegner

Task: 提出一种利用高分辨率卫星观测和现有低分辨率生物量产品进行高分辨率地上生物量（AGB）估计的新方法。

Motivation: 高分辨率AGB映射在大规模和高时空分辨率下对气候建模、生物多样性评估和可持续供应链监测等应用至关重要，但目前依赖昂贵的机载激光扫描或分辨率较低的全球生物量产品。

Details

Method: 将问题转化为引导超分辨率（GSR），利用高分辨率卫星图像作为引导，从低分辨率生物量地图中提升分辨率。 Result: 多尺度引导（MSG）方法在回归和感知指标上优于直接回归，且能更好地捕捉高生物量值，计算开销不明显。 Conclusion: GSR框架为大规模高分辨率生物量映射提供了准确的方法，代码和模型权重已公开。 Abstract: Accurate Above-Ground Biomass (AGB) mapping at both large scale and high spatio-temporal resolution is essential for applications ranging from climate modeling to biodiversity assessment, and sustainable supply chain monitoring. At present, fine-grained AGB mapping relies on costly airborne laser scanning acquisition campaigns usually limited to regional scales. Initiatives such as the ESA CCI map attempt to generate global biomass products from diverse spaceborne sensors but at a coarser resolution. To enable global, high-resolution (HR) mapping, several works propose to regress AGB from HR satellite observations such as ESA Sentinel-1/2 images. We propose a novel way to address HR AGB estimation, by leveraging both HR satellite observations and existing low-resolution (LR) biomass products. We cast this problem as Guided Super-Resolution (GSR), aiming at upsampling LR biomass maps (sources) from $100$ to $10$ m resolution, using auxiliary HR co-registered satellite images (guides). We compare super-resolving AGB maps with and without guidance, against direct regression from satellite images, on the public BioMassters dataset. We observe that Multi-Scale Guidance (MSG) outperforms direct regression both for regression ($-780$ t/ha RMSE) and perception ($+2.0$ dB PSNR) metrics, and better captures high-biomass values, without significant computational overhead. Interestingly, unlike the RGB+Depth setting they were originally designed for, our best-performing AGB GSR approaches are those that most preserve the guide image texture. Our results make a strong case for adopting the GSR framework for accurate HR biomass mapping at scale. Our code and model weights are made publicly available (https://github.com/kaankaramanofficial/GSR4B).

Advancing AI-Scientist Understanding: Making LLM Think Like a Physicist with Interpretable Reasoning

Yinggan Xu,Hana Kimlee,Yijia Xiao,Di Luo

Task: 提出一个框架，通过解释模块增强大语言模型（LLMs）在物理研究中的可靠性和可解释性。

Motivation: 确保LLMs在物理研究中的输出可靠且可解释是一个重要挑战，现有研究未充分探索如何理解AI生成的输出。

Details

Method: 框架包含三个模块：推理模块、解释模块和AI-科学家交互模块，其中解释模块由多个专业代理（如总结器、模型构建器、UI构建器和测试器）组成，用于在物理框架内结构化LLM输出。 Result: 案例研究表明，该方法提高了透明度，便于验证，并增强了科学发现中AI辅助推理的能力。 Conclusion: 该框架通过解释模块有效提升了LLMs在物理研究中的可靠性和可解释性，为AI与科学家的协作提供了新思路。 Abstract: Large Language Models (LLMs) are playing an expanding role in physics research by enhancing reasoning, symbolic manipulation, and numerical computation. However, ensuring the reliability and interpretability of their outputs remains a significant challenge. In our framework, we conceptualize the collaboration between AI and human scientists as a dynamic interplay among three modules: the reasoning module, the interpretation module, and the AI-scientist interaction module. Recognizing that effective physics reasoning demands rigorous logical consistency, quantitative precision, and deep integration with established theoretical models, we introduce the interpretation module to improve the understanding of AI-generated outputs, which is not previously explored in the literature. This module comprises multiple specialized agents, including summarizers, model builders, UI builders, and testers, which collaboratively structure LLM outputs within a physically grounded framework, by constructing a more interpretable science model. A case study demonstrates that our approach enhances transparency, facilitates validation, and strengthens AI-augmented reasoning in scientific discovery.

DreamActor-M1: Holistic, Expressive and Robust Human Image Animation with Hybrid Guidance

Yuxuan Luo,Zhengkun Rong,Lizhen Wang,Longhao Zhang,Tianshu Hu,Yongming Zhu

Task: 提出一种基于扩散变换器（DiT）的框架DreamActor-M1，用于实现细粒度、多尺度和长时序一致的人体动画生成。

Motivation: 现有图像驱动的人体动画方法在细粒度控制、多尺度适应性和长时序一致性方面存在不足，导致表现力和鲁棒性较低。

Details

Method: 采用混合引导信号（包括隐式面部表征、3D头部球体和3D身体骨骼）实现面部表情和身体运动的鲁棒控制；通过渐进式训练策略处理不同尺度的数据；整合时序帧的运动模式和视觉参考以确保长时序一致性。 Result: 实验表明，该方法在肖像、上半身和全身生成任务中表现优于现有技术，具有更强的表现力和长时序一致性。 Conclusion: DreamActor-M1框架在人体动画生成中实现了更高的控制性、适应性和一致性，为相关领域提供了新的解决方案。 Abstract: While recent image-based human animation methods achieve realistic body and facial motion synthesis, critical gaps remain in fine-grained holistic controllability, multi-scale adaptability, and long-term temporal coherence, which leads to their lower expressiveness and robustness. We propose a diffusion transformer (DiT) based framework, DreamActor-M1, with hybrid guidance to overcome these limitations. For motion guidance, our hybrid control signals that integrate implicit facial representations, 3D head spheres, and 3D body skeletons achieve robust control of facial expressions and body movements, while producing expressive and identity-preserving animations. For scale adaptation, to handle various body poses and image scales ranging from portraits to full-body views, we employ a progressive training strategy using data with varying resolutions and scales. For appearance guidance, we integrate motion patterns from sequential frames with complementary visual references, ensuring long-term temporal coherence for unseen regions during complex movements. Experiments demonstrate that our method outperforms the state-of-the-art works, delivering expressive results for portraits, upper-body, and full-body generation with robust long-term consistency. Project Page: https://grisoon.github.io/DreamActor-M1/.

FineLIP: Extending CLIP's Reach via Fine-Grained Alignment with Longer Text Inputs

Mothilal Asokan,Kebin Wu,Fatima Albreiki

Task: 提出FineLIP方法，扩展CLIP模型以处理更长的文本输入并实现细粒度的跨模态对齐。

Motivation: CLIP模型的文本编码器仅能处理77个文本标记，限制了其对长文本和细节丰富描述的捕捉能力，且难以实现细粒度的视觉和文本信息对齐。

Details

Method: FineLIP通过扩展位置嵌入以支持长文本输入，动态聚合局部图像和文本标记，并实现细粒度的跨模态对齐。 Result: 在零样本跨模态检索和文本到图像生成任务中，FineLIP表现优于现有方法，实验验证了其有效性。 Conclusion: FineLIP通过改进长文本处理和细粒度对齐，显著提升了CLIP模型的性能。 Abstract: As a pioneering vision-language model, CLIP (Contrastive Language-Image Pre-training) has achieved significant success across various domains and a wide range of downstream vision-language tasks. However, the text encoders in popular CLIP models are limited to processing only 77 text tokens, which constrains their ability to effectively handle longer, detail-rich captions. Additionally, CLIP models often struggle to effectively capture detailed visual and textual information, which hampers their performance on tasks that require fine-grained analysis. To address these limitations, we present a novel approach, \textbf{FineLIP}, that extends the capabilities of CLIP. FineLIP enhances cross-modal text-image mapping by incorporating \textbf{Fine}-grained alignment with \textbf{L}onger text input within the CL\textbf{IP}-style framework. FineLIP first extends the positional embeddings to handle longer text, followed by the dynamic aggregation of local image and text tokens. The aggregated results are then used to enforce fine-grained token-to-token cross-modal alignment. We validate our model on datasets with long, detailed captions across two tasks: zero-shot cross-modal retrieval and text-to-image generation. Quantitative and qualitative experimental results demonstrate the effectiveness of FineLIP, outperforming existing state-of-the-art approaches. Furthermore, comprehensive ablation studies validate the benefits of key design elements within FineLIP.

FIORD: A Fisheye Indoor-Outdoor Dataset with LIDAR Ground Truth for 3D Scene Reconstruction and Benchmarking

Ulas Gunes,Matias Turkulainen,Xuqian Ren,Arno Solin,Juho Kannala,Esa Rahtu

Task: 提出一个专为场景重建任务设计的鱼眼图像数据集。

Motivation: 现有的大规模3D场景重建和新视角合成方法主要依赖窄视场（FoV）的透视图像数据集，这些数据集需要大量图像和复杂的结构从运动（SfM）处理，限制了可扩展性。

Details

Method: 使用双200度鱼眼镜头构建数据集，提供5个室内和5个室外场景的360度覆盖，并包含稀疏SfM点云和精确的LIDAR密集点云作为几何真值。 Result: 数据集支持在遮挡和反射等挑战性条件下进行鲁棒的基准测试，并适用于多种场景重建、新视角合成和基于图像的渲染方法。 Conclusion: 该数据集为场景重建任务提供了更高效和可扩展的解决方案。 Abstract: The development of large-scale 3D scene reconstruction and novel view synthesis methods mostly rely on datasets comprising perspective images with narrow fields of view (FoV). While effective for small-scale scenes, these datasets require large image sets and extensive structure-from-motion (SfM) processing, limiting scalability. To address this, we introduce a fisheye image dataset tailored for scene reconstruction tasks. Using dual 200-degree fisheye lenses, our dataset provides full 360-degree coverage of 5 indoor and 5 outdoor scenes. Each scene has sparse SfM point clouds and precise LIDAR-derived dense point clouds that can be used as geometric ground-truth, enabling robust benchmarking under challenging conditions such as occlusions and reflections. While the baseline experiments focus on vanilla Gaussian Splatting and NeRF based Nerfacto methods, the dataset supports diverse approaches for scene reconstruction, novel view synthesis, and image-based rendering.

The LLM Wears Prada: Analysing Gender Bias and Stereotypes through Online Shopping Data

Massimiliano Luca,Ciro Beneduce,Bruno Lepri,Jacopo Staiano

Task: 评估大型语言模型（LLMs）是否能够基于在线购物历史预测个体性别，并分析其预测是否受性别偏见和刻板印象影响。

Motivation: 随着大型语言模型的跨领域广泛应用，其训练数据中的统计相关性可能隐藏潜在的偏见，尤其是性别偏见。本研究从在线购物历史的角度探讨这一问题。

Details

Method: 使用美国用户的在线购物历史数据集，评估六种LLMs的性别分类能力，并分析其推理过程及产品与性别的共现关系。 Result: 模型能够以中等准确率推断性别，但其决策常基于产品类别与性别的刻板关联。明确避免偏见的指令降低了预测的确定性，但未消除刻板模式。 Conclusion: 研究揭示了LLMs中性别偏见的顽固性，强调了需要更有效的偏见缓解策略。 Abstract: With the wide and cross-domain adoption of Large Language Models, it becomes crucial to assess to which extent the statistical correlations in training data, which underlie their impressive performance, hide subtle and potentially troubling biases. Gender bias in LLMs has been widely investigated from the perspectives of works, hobbies, and emotions typically associated with a specific gender. In this study, we introduce a novel perspective. We investigate whether LLMs can predict an individual's gender based solely on online shopping histories and whether these predictions are influenced by gender biases and stereotypes. Using a dataset of historical online purchases from users in the United States, we evaluate the ability of six LLMs to classify gender and we then analyze their reasoning and products-gender co-occurrences. Results indicate that while models can infer gender with moderate accuracy, their decisions are often rooted in stereotypical associations between product categories and gender. Furthermore, explicit instructions to avoid bias reduce the certainty of model predictions, but do not eliminate stereotypical patterns. Our findings highlight the persistent nature of gender biases in LLMs and emphasize the need for robust bias-mitigation strategies.

AdPO: Enhancing the Adversarial Robustness of Large Vision-Language Models with Preference Optimization

Chaohu Liu,Tianyi Gui,Yu Liu,Linli Xu

Task: 提出一种基于偏好优化的新型对抗防御策略AdPO，用于增强大型视觉语言模型（LVLMs）在对抗攻击下的鲁棒性。

Motivation: 尽管LVLMs在现实应用中取得了显著进展，但它们继承了视觉神经网络的敏感性，容易受到对抗攻击，导致错误或恶意输出。现有的对抗微调方法在增强鲁棒性的同时，往往导致干净输入上的性能下降。

Details

Method: 将对抗训练重新定义为偏好优化问题，通过仅修改图像编码器（如CLIP ViT），增强模型在干净输入上生成正常输出的偏好，同时拒绝对抗样本的误导输出。 Result: AdPO在多种下游任务中实现了优越的干净和对抗性能，并通过在较小LVLMs上训练后迁移到较大模型，保持了与基线方法相当的效率。 Conclusion: AdPO为未来的对抗防御研究提供了新的视角，并通过全面实验验证了其有效性。 Abstract: Large Vision-Language Models (LVLMs), such as GPT-4o and LLaVA, have recently witnessed remarkable advancements and are increasingly being deployed in real-world applications. However, inheriting the sensitivity of visual neural networks, LVLMs remain vulnerable to adversarial attacks, which can result in erroneous or malicious outputs. While existing efforts utilize adversarial fine-tuning to enhance robustness, they often suffer from performance degradation on clean inputs. In this paper, we proposes AdPO, a novel adversarial defense strategy for LVLMs based on preference optimization. For the first time, we reframe adversarial training as a preference optimization problem, aiming to enhance the model's preference for generating normal outputs on clean inputs while rejecting the potential misleading outputs for adversarial examples. Notably, AdPO achieves this by solely modifying the image encoder, e.g., CLIP ViT, resulting in superior clean and adversarial performance in a variety of downsream tasks. Considering that training involves large language models (LLMs), the computational cost increases significantly. We validate that training on smaller LVLMs and subsequently transferring to larger models can achieve competitive performance while maintaining efficiency comparable to baseline methods. Our comprehensive experiments confirm the effectiveness of the proposed AdPO, which provides a novel perspective for future adversarial defense research.

Understanding Cross-Model Perceptual Invariances Through Ensemble Metamers

Lukas Boehm,Jonas Leo Mueller,Christoffer Loeffler,Leo Schwinn,Bjoern Eskofier,Dario Zanca

Task: 研究人工神经网络的感知不变性，并通过生成元刺激（metamers）来探索这些不变性。

Motivation: 提高模型的可解释性，并使其与人类视觉对齐。

Details

Method: 利用多种人工神经网络（包括卷积神经网络和视觉变换器）的集成来生成元刺激，并通过图像指标评估其语义保真度和自然性。 Result: 卷积神经网络生成的元刺激更具识别性和人类相似性，而视觉变换器生成的元刺激更真实但可迁移性较差。 Conclusion: 架构偏差对表征不变性有显著影响。 Abstract: Understanding the perceptual invariances of artificial neural networks is essential for improving explainability and aligning models with human vision. Metamers - stimuli that are physically distinct yet produce identical neural activations - serve as a valuable tool for investigating these invariances. We introduce a novel approach to metamer generation by leveraging ensembles of artificial neural networks, capturing shared representational subspaces across diverse architectures, including convolutional neural networks and vision transformers. To characterize the properties of the generated metamers, we employ a suite of image-based metrics that assess factors such as semantic fidelity and naturalness. Our findings show that convolutional neural networks generate more recognizable and human-like metamers, while vision transformers produce realistic but less transferable metamers, highlighting the impact of architectural biases on representational invariances.

Bridge the Gap between SNN and ANN for Image Restoration

Xin Su,Chen Wu,Zhuoran Zheng

Task: 提出一种名为不对称框架（ANN-SNN）蒸馏的新技术，用于加速SNN的训练并提升其性能。

Motivation: 传统ANN在图像修复任务中能耗高，而SNN能耗低但训练成本高且收敛慢。

Details

Method: 利用ANN的中间特征（特征图）作为提示，指导SNN的训练过程。 Result: 设计的基于SNN的图像修复模型参数仅为教师网络的1/300，能耗为1/50，在某些去噪任务中表现与教师网络相当。 Conclusion: ANN-SNN蒸馏技术有效解决了SNN训练效率低的问题，同时提升了其性能。 Abstract: Models of dense prediction based on traditional Artificial Neural Networks (ANNs) require a lot of energy, especially for image restoration tasks. Currently, neural networks based on the SNN (Spiking Neural Network) framework are beginning to make their mark in the field of image restoration, especially as they typically use less than 10\% of the energy of ANNs with the same architecture. However, training an SNN is much more expensive than training an ANN, due to the use of the heuristic gradient descent strategy. In other words, the process of SNN's potential membrane signal changing from sparse to dense is very slow, which affects the convergence of the whole model.To tackle this problem, we propose a novel distillation technique, called asymmetric framework (ANN-SNN) distillation, in which the teacher is an ANN and the student is an SNN. Specifically, we leverage the intermediate features (feature maps) learned by the ANN as hints to guide the training process of the SNN. This approach not only accelerates the convergence of the SNN but also improves its final performance, effectively bridging the gap between the efficiency of the SNN and the superior learning capabilities of ANN. Extensive experimental results show that our designed SNN-based image restoration model, which has only 1/300 the number of parameters of the teacher network and 1/50 the energy consumption of the teacher network, is as good as the teacher network in some denoising tasks.

Dual-stream Transformer-GCN Model with Contextualized Representations Learning for Monocular 3D Human Pose Estimation

Mingrui Ye,Lianping Yang,Hegui Zhu,Zenghao Zheng,Xin Wang,Yantao Lo

Task: 提出一种基于Transformer-GCN双流模型的单目3D人体姿态估计新方法。

Motivation: 解决单目3D人体姿态估计中的深度模糊性、3D标注数据有限、建模不平衡和模型泛化能力不足等问题。

Details

Method: 采用基于上下文表示学习的运动预训练方法，通过掩码2D姿态特征，利用Transformer-GCN双流模型在自蒸馏设置下学习高维表示。 Result: 在两个基准数据集上达到最优性能（Human3.6M上的MPJPE为38.0mm，P-MPJPE为31.9mm；MPI-INF-3DHP上的MPJPE为15.9mm），并在公开数据集和真实视频中验证了其鲁棒性和泛化能力。 Conclusion: 通过上下文表示学习和时空建模，该方法显著提升了模型的泛化能力，并在全局与局部交互的平衡中取得了突破。 Abstract: This paper introduces a novel approach to monocular 3D human pose estimation using contextualized representation learning with the Transformer-GCN dual-stream model. Monocular 3D human pose estimation is challenged by depth ambiguity, limited 3D-labeled training data, imbalanced modeling, and restricted model generalization. To address these limitations, our work introduces a groundbreaking motion pre-training method based on contextualized representation learning. Specifically, our method involves masking 2D pose features and utilizing a Transformer-GCN dual-stream model to learn high-dimensional representations through a self-distillation setup. By focusing on contextualized representation learning and spatial-temporal modeling, our approach enhances the model's ability to understand spatial-temporal relationships between postures, resulting in superior generalization. Furthermore, leveraging the Transformer-GCN dual-stream model, our approach effectively balances global and local interactions in video pose estimation. The model adaptively integrates information from both the Transformer and GCN streams, where the GCN stream effectively learns local relationships between adjacent key points and frames, while the Transformer stream captures comprehensive global spatial and temporal features. Our model achieves state-of-the-art performance on two benchmark datasets, with an MPJPE of 38.0mm and P-MPJPE of 31.9mm on Human3.6M, and an MPJPE of 15.9mm on MPI-INF-3DHP. Furthermore, visual experiments on public datasets and in-the-wild videos demonstrate the robustness and generalization capabilities of our approach.

Memory-efficient Low-latency Remote Photoplethysmography through Temporal-Spatial State Space Duality

Kegang Wang,Jiankai Tang,Yuxuan Fan,Jiatong Ji,Yuanchun Shi,Yuntao Wang

Task: 提出一种内存高效的远程光电容积描记（rPPG）算法ME-rPPG，解决模型可扩展性、跨数据集泛化和实时性之间的三难问题。

Motivation: 深度学习在rPPG中带来性能提升的同时，计算资源需求过高，导致计算瓶颈。

Details

Method: 基于时空状态空间对偶性，利用可转移状态空间高效捕捉面部帧间的周期性变化，同时保持低计算开销。 Result: 在多个数据集上表现优异（MAEs：5.38、0.70、0.25），性能提升21.3%-60.2%，内存占用仅3.6 MB，延迟9.46 ms。 Conclusion: ME-rPPG在精度和用户满意度上均显著优于现有方法，适用于实际部署。 Abstract: Remote photoplethysmography (rPPG), enabling non-contact physiological monitoring through facial light reflection analysis, faces critical computational bottlenecks as deep learning introduces performance gains at the cost of prohibitive resource demands. This paper proposes ME-rPPG, a memory-efficient algorithm built on temporal-spatial state space duality, which resolves the trilemma of model scalability, cross-dataset generalization, and real-time constraints. Leveraging a transferable state space, ME-rPPG efficiently captures subtle periodic variations across facial frames while maintaining minimal computational overhead, enabling training on extended video sequences and supporting low-latency inference. Achieving cross-dataset MAEs of 5.38 (MMPD), 0.70 (VitalVideo), and 0.25 (PURE), ME-rPPG outperforms all baselines with improvements ranging from 21.3% to 60.2%. Our solution enables real-time inference with only 3.6 MB memory usage and 9.46 ms latency -- surpassing existing methods by 19.5%-49.7% accuracy and 43.2% user satisfaction gains in real-world deployments. The code and demos are released for reproducibility on https://github.com/Health-HCI-Group/ME-rPPG-demo.

UniViTAR: Unified Vision Transformer with Native Resolution

Limeng Qiao,Yiyang Gan,Bairui Wang,Jie Qin,Shuang Xu,Siqi Yang,Lin Ma

Task: 提出UniViTAR，一种针对统一视觉模态和原生分辨率场景的视觉基础模型家族。

Motivation: 传统Vision Transformer简化视觉建模时忽略了自然视觉数据的可变性，损害了空间上下文保真度，且现有方法缺乏系统性分析。

Details

Method: 通过架构升级和渐进式训练范式（包括分辨率课程学习和视觉模态适应），结合混合训练框架（对比损失和特征蒸馏）。 Result: 在多个模型规模（0.3B至1B）上验证了其有效性。 Conclusion: UniViTAR在统一视觉模态和原生分辨率场景中表现出色，且仅使用公开数据集训练。 Abstract: Conventional Vision Transformer simplifies visual modeling by standardizing input resolutions, often disregarding the variability of natural visual data and compromising spatial-contextual fidelity. While preliminary explorations have superficially investigated native resolution modeling, existing approaches still lack systematic analysis from a visual representation perspective. To bridge this gap, we introduce UniViTAR, a family of homogeneous vision foundation models tailored for unified visual modality and native resolution scenario in the era of multimodal. Our framework first conducts architectural upgrades to the vanilla paradigm by integrating multiple advanced components. Building upon these improvements, a progressive training paradigm is introduced, which strategically combines two core mechanisms: (1) resolution curriculum learning, transitioning from fixed-resolution pretraining to native resolution tuning, thereby leveraging ViT's inherent adaptability to variable-length sequences, and (2) visual modality adaptation via inter-batch image-video switching, which balances computational efficiency with enhanced temporal reasoning. In parallel, a hybrid training framework further synergizes sigmoid-based contrastive loss with feature distillation from a frozen teacher model, thereby accelerating early-stage convergence. Finally, trained exclusively on public datasets, externsive experiments across multiple model scales from 0.3B to 1B demonstrate its effectiveness.

Spatial-R1: Enhancing MLLMs in Video Spatial Reasoning

Kun Ouyang

Task: 提升多模态大语言模型（MLLMs）在视频理解中的空间推理能力。

Motivation: 空间推理能力对视频理解至关重要，但目前仍具挑战性。

Details

Method: 提出了Spatial-R1方法，包括两个关键贡献：1）从ScanNet中构建新的视频空间推理数据集SR，自动生成七种任务类型的QA对；2）应用任务特定的组相对策略优化（GRPO）进行微调。 Result: 在VSI-Bench基准测试中，Spatial-R1显著提升了性能，比基线提高了7.4%，并优于其他强当代模型。 Conclusion: 验证了专门的数据构建和优化技术对提升视频MLLMs复杂空间推理能力的有效性。 Abstract: Enhancing the spatial reasoning capabilities of Multi-modal Large Language Models (MLLMs) for video understanding is crucial yet challenging. We present Spatial-R1, a targeted approach involving two key contributions: the curation of SR, a new video spatial reasoning dataset from ScanNet with automatically generated QA pairs across seven task types, and the application of Task-Specific Group Relative Policy Optimization (GRPO) for fine-tuning. By training the Qwen2.5-VL-7B-Instruct model on SR using GRPO, Spatial-R1 significantly advances performance on the VSI-Bench benchmark, achieving a 7.4\% gain over the baseline and outperforming strong contemporary models. This work validates the effectiveness of specialized data curation and optimization techniques for improving complex spatial reasoning in video MLLMs.

Implicit Bias Injection Attacks against Text-to-Image Diffusion Models

Huayang Huang,Xiangye Jin,Jiaxu Miao,Yu Wu

Task: 研究文本到图像扩散模型（T2I DMs）中的隐式偏见及其注入攻击框架（IBI-Attacks）。

Motivation: T2I模型可能生成带有特定倾向的内容，影响公众认知，而现有研究主要关注显式偏见，忽略了缺乏明确视觉特征的隐式偏见。

Details

Method: 提出隐式偏见注入攻击框架（IBI-Attacks），通过预计算提示嵌入空间中的通用偏见方向，并根据不同输入自适应调整。 Result: 实验验证了该框架在保留原始语义的同时，通过细微多样的修改引入偏见的有效性。 Conclusion: 隐式偏见的隐蔽性和可转移性凸显了该研究的重要性。 Abstract: The proliferation of text-to-image diffusion models (T2I DMs) has led to an increased presence of AI-generated images in daily life. However, biased T2I models can generate content with specific tendencies, potentially influencing people's perceptions. Intentional exploitation of these biases risks conveying misleading information to the public. Current research on bias primarily addresses explicit biases with recognizable visual patterns, such as skin color and gender. This paper introduces a novel form of implicit bias that lacks explicit visual features but can manifest in diverse ways across various semantic contexts. This subtle and versatile nature makes this bias challenging to detect, easy to propagate, and adaptable to a wide range of scenarios. We further propose an implicit bias injection attack framework (IBI-Attacks) against T2I diffusion models by precomputing a general bias direction in the prompt embedding space and adaptively adjusting it based on different inputs. Our attack module can be seamlessly integrated into pre-trained diffusion models in a plug-and-play manner without direct manipulation of user input or model retraining. Extensive experiments validate the effectiveness of our scheme in introducing bias through subtle and diverse modifications while preserving the original semantics. The strong concealment and transferability of our attack across various scenarios further underscore the significance of our approach. Code is available at https://github.com/Hannah1102/IBI-attacks.

Prompting Medical Vision-Language Models to Mitigate Diagnosis Bias by Generating Realistic Dermoscopic Images

Nusrat Munia,Abdullah-Al-Zubaer Imran

Task: 提出一种基于生成AI的框架DermDiT，用于生成新的皮肤镜图像以解决皮肤疾病诊断中的偏见问题。

Motivation: 现有AI模型在皮肤疾病诊断中对不同子组（如肤色）的表现存在偏见，需要改进数据集的代表性。

Details

Method: 利用视觉语言模型生成文本提示，并通过多模态文本-图像学习生成新的皮肤镜图像。 Result: 实验表明，DermDiT能够生成高质量图像，提升数据集中少数群体的代表性。 Conclusion: DermDiT框架通过生成合成图像，有效缓解了皮肤疾病诊断中的偏见问题。 Abstract: Artificial Intelligence (AI) in skin disease diagnosis has improved significantly, but a major concern is that these models frequently show biased performance across subgroups, especially regarding sensitive attributes such as skin color. To address these issues, we propose a novel generative AI-based framework, namely, Dermatology Diffusion Transformer (DermDiT), which leverages text prompts generated via Vision Language Models and multimodal text-image learning to generate new dermoscopic images. We utilize large vision language models to generate accurate and proper prompts for each dermoscopic image which helps to generate synthetic images to improve the representation of underrepresented groups (patient, disease, etc.) in highly imbalanced datasets for clinical diagnoses. Our extensive experimentation showcases the large vision language models providing much more insightful representations, that enable DermDiT to generate high-quality images. Our code is available at https://github.com/Munia03/DermDiT

BOGausS: Better Optimized Gaussian Splatting

Stéphane Pateux,Matthieu Gendrin,Luce Morin,Théo Ladune,Xiaoran Jiang

Task: 提出一种新的优化方法（BOGausS）以生成更轻量级的3D高斯泼溅（3DGS）模型，同时保持质量。

Motivation: 尽管3DGS在快速和高保真渲染方面表现优异，但仍存在构建更小模型而不牺牲质量的挑战。

Details

Method: 对3DGS训练过程进行仔细分析，并提出新的优化方法（BOGausS）。 Result: BOGausS能够生成比原始3DGS轻十倍的模型，且无质量损失，显著提升了性能。 Conclusion: BOGausS显著提升了高斯泼溅的性能，优于现有技术。 Abstract: 3D Gaussian Splatting (3DGS) proposes an efficient solution for novel view synthesis. Its framework provides fast and high-fidelity rendering. Although less complex than other solutions such as Neural Radiance Fields (NeRF), there are still some challenges building smaller models without sacrificing quality. In this study, we perform a careful analysis of 3DGS training process and propose a new optimization methodology. Our Better Optimized Gaussian Splatting (BOGausS) solution is able to generate models up to ten times lighter than the original 3DGS with no quality degradation, thus significantly boosting the performance of Gaussian Splatting compared to the state of the art.

CoMatcher: Multi-View Collaborative Feature Matching

Jintao Zhang,Zimin Xia,Mingyue Dong,Shuhan Shen,Linwei Yue,Xianwei Zheng

Task: 提出一种多视角协作匹配策略，用于复杂场景中的可靠轨迹构建。

Motivation: 观察到在图像集匹配中，成对匹配范式在独立对之间存在显著遮挡或极端视角变化时会导致模糊估计，主要原因是基于有限的双视角观察解释复杂3D结构时存在固有不确定性。

Details

Method: 提出CoMatcher，一种深度多视角匹配器，利用不同视角的互补上下文线索形成整体3D场景理解，并利用跨视角投影一致性推断可靠的全局解。 Result: 在多种复杂场景上的大量实验表明，该方法优于主流的两视角匹配范式。 Conclusion: 通过多视角协作匹配策略，能够有效解决复杂场景中的轨迹构建问题，提升匹配的可靠性。 Abstract: This paper proposes a multi-view collaborative matching strategy for reliable track construction in complex scenarios. We observe that the pairwise matching paradigms applied to image set matching often result in ambiguous estimation when the selected independent pairs exhibit significant occlusions or extreme viewpoint changes. This challenge primarily stems from the inherent uncertainty in interpreting intricate 3D structures based on limited two-view observations, as the 3D-to-2D projection leads to significant information loss. To address this, we introduce CoMatcher, a deep multi-view matcher to (i) leverage complementary context cues from different views to form a holistic 3D scene understanding and (ii) utilize cross-view projection consistency to infer a reliable global solution. Building on CoMatcher, we develop a groupwise framework that fully exploits cross-view relationships for large-scale matching tasks. Extensive experiments on various complex scenarios demonstrate the superiority of our method over the mainstream two-view matching paradigm.

A Diffusion-Based Framework for Occluded Object Movement

Zheng-Peng Duan,Jiawei Zhang,Siyu Liu,Zheng Lin,Chun-Le Guo,Dongqing Zou,Jimmy Ren,Chongyi Li

Task: 提出一种基于扩散模型的框架DiffOOM，用于解决图像中遮挡物体移动的问题。

Motivation: 现有图像编辑方法在处理遮挡物体移动时存在困难，尤其是真实世界图像中的遮挡情况增加了挑战。

Details

Method: DiffOOM采用两个并行分支，分别处理物体去遮挡和移动：去遮挡分支利用背景填充策略和动态更新的物体掩码完成遮挡部分；移动分支通过潜在优化将物体放置到目标位置，并结合局部文本条件引导使其融入新环境。 Result: 实验评估和用户研究表明，DiffOOM在性能上表现优越。 Conclusion: DiffOOM框架有效解决了遮挡物体移动的难题，展示了扩散模型在此类任务中的潜力。 Abstract: Seamlessly moving objects within a scene is a common requirement for image editing, but it is still a challenge for existing editing methods. Especially for real-world images, the occlusion situation further increases the difficulty. The main difficulty is that the occluded portion needs to be completed before movement can proceed. To leverage the real-world knowledge embedded in the pre-trained diffusion models, we propose a Diffusion-based framework specifically designed for Occluded Object Movement, named DiffOOM. The proposed DiffOOM consists of two parallel branches that perform object de-occlusion and movement simultaneously. The de-occlusion branch utilizes a background color-fill strategy and a continuously updated object mask to focus the diffusion process on completing the obscured portion of the target object. Concurrently, the movement branch employs latent optimization to place the completed object in the target location and adopts local text-conditioned guidance to integrate the object into new surroundings appropriately. Extensive evaluations demonstrate the superior performance of our method, which is further validated by a comprehensive user study.

GMAI-VL-R1: Harnessing Reinforcement Learning for Multimodal Medical Reasoning

Yanzhou Su,Tianbin Li,Jiyao Liu,Chenglong Ma,Junzhi Ning,Cheng Tang,Sibo Ju,Jin Ye,Pengcheng Chen,Ming Hu,Shixiang Tang,Lihao Liu,Bin Fu,Wenqi Shao,Xiaowei Hu,Xiangwen Liao,Yuanfeng Ji,Junjun He

Task: 提出GMAI-VL-R1，一种通过强化学习增强的多模态医疗推理模型，以提高复杂医疗决策中的推理能力。

Motivation: 现有通用医疗AI模型在复杂医疗决策中缺乏足够的推理能力，需要改进。

Details

Method: 通过强化学习（RL）迭代训练优化决策，并开发了一种推理数据合成方法，通过拒绝采样生成逐步推理数据。 Result: 实验结果表明，经过RL训练后，GMAI-VL-R1在医学图像诊断和视觉问答等任务中表现优异。 Conclusion: 该研究为医疗推理模型设立了新的评估基准，并为未来进展铺平了道路。 Abstract: Recent advances in general medical AI have made significant strides, but existing models often lack the reasoning capabilities needed for complex medical decision-making. This paper presents GMAI-VL-R1, a multimodal medical reasoning model enhanced by reinforcement learning (RL) to improve its reasoning abilities. Through iterative training, GMAI-VL-R1 optimizes decision-making, significantly boosting diagnostic accuracy and clinical support. We also develop a reasoning data synthesis method, generating step-by-step reasoning data via rejection sampling, which further enhances the model's generalization. Experimental results show that after RL training, GMAI-VL-R1 excels in tasks such as medical image diagnosis and visual question answering. While the model demonstrates basic memorization with supervised fine-tuning, RL is crucial for true generalization. Our work establishes new evaluation benchmarks and paves the way for future advancements in medical reasoning models. Code, data, and model will be released at \href{https://github.com/uni-medical/GMAI-VL-R1}{this link}.

Is Temporal Prompting All We Need For Limited Labeled Action Recognition?

Shreyank N Gowda,Boyan Gao,Xiao Gu,Xiaobo Jin

Task: 提出TP-CLIP，一种基于CLIP的改进方法，通过时间视觉提示实现视频数据的时序建模，同时保持其泛化能力。

Motivation: 当前视频理解依赖于大规模标注数据集，而视觉语言模型在零样本任务中表现出色，但直接应用于视频数据时存在计算量大和时序建模困难的问题。

Details

Method: TP-CLIP通过时间视觉提示对CLIP进行改进，无需修改其核心架构，高效集成预训练能力。 Result: 在多个数据集上，TP-CLIP在零样本和少样本学习中表现优异，计算效率和参数数量均优于现有方法。 Conclusion: TP-CLIP是一种高效且泛化能力强的视频理解方法，显著优于现有技术。 Abstract: Video understanding has shown remarkable improvements in recent years, largely dependent on the availability of large scaled labeled datasets. Recent advancements in visual-language models, especially based on contrastive pretraining, have shown remarkable generalization in zero-shot tasks, helping to overcome this dependence on labeled datasets. Adaptations of such models for videos, typically involve modifying the architecture of vision-language models to cater to video data. However, this is not trivial, since such adaptations are mostly computationally intensive and struggle with temporal modeling. We present TP-CLIP, an adaptation of CLIP that leverages temporal visual prompting for temporal adaptation without modifying the core CLIP architecture. This preserves its generalization abilities. TP-CLIP efficiently integrates into the CLIP architecture, leveraging its pre-trained capabilities for video data. Extensive experiments across various datasets demonstrate its efficacy in zero-shot and few-shot learning, outperforming existing approaches with fewer parameters and computational efficiency. In particular, we use just 1/3 the GFLOPs and 1/28 the number of tuneable parameters in comparison to recent state-of-the-art and still outperform it by up to 15.8% depending on the task and dataset.

Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness

Haochen Wang,Yucheng Zhao,Tiancai Wang,Haoqiang Fan,Xiangyu Zhang,Zhaoxiang Zhang

Task: Adapting Large Multimodal Models (LMMs) for interpreting 3D scenes by introducing reconstructive visual instruction tuning with 3D-awareness (Ross3D).

Motivation: The lack of large-scale 3D vision-language datasets has hindered the adaptation of 2D LMMs for 3D scene understanding.

Details

Method: Ross3D integrates 3D-aware visual supervision through cross-view and global-view reconstruction tasks. Result: Ross3D achieves state-of-the-art performance on 3D scene understanding benchmarks and shows potential in leveraging unlabeled 3D data. Conclusion: Ross3D provides an effective approach to enhance 3D scene understanding by incorporating 3D-aware supervision, demonstrating significant potential for semi-supervised learning. Abstract: The rapid development of Large Multimodal Models (LMMs) for 2D images and videos has spurred efforts to adapt these models for interpreting 3D scenes. However, the absence of large-scale 3D vision-language datasets has posed a significant obstacle. To address this issue, typical approaches focus on injecting 3D awareness into 2D LMMs by designing 3D input-level scene representations. This work provides a new perspective. We introduce reconstructive visual instruction tuning with 3D-awareness (Ross3D), which integrates 3D-aware visual supervision into the training procedure. Specifically, it incorporates cross-view and global-view reconstruction. The former requires reconstructing masked views by aggregating overlapping information from other views. The latter aims to aggregate information from all available views to recover Bird's-Eye-View images, contributing to a comprehensive overview of the entire scene. Empirically, Ross3D achieves state-of-the-art performance across various 3D scene understanding benchmarks. More importantly, our semi-supervised experiments demonstrate significant potential in leveraging large amounts of unlabeled 3D vision-only data.

FineLIP: Extending CLIP's Reach via Fine-Grained Alignment with Longer Text Inputs

Mothilal Asokan,Kebin Wu,Fatima Albreiki

Task: 扩展CLIP模型的能力，以处理更长的文本输入和更细粒度的跨模态对齐。

Motivation: CLIP模型的文本编码器仅能处理77个文本标记，限制了其在长文本和细节丰富任务中的表现。

Details

Method: 提出FineLIP方法，通过扩展位置嵌入和动态聚合局部图像与文本标记，实现细粒度的跨模态对齐。 Result: FineLIP在零样本跨模态检索和文本到图像生成任务中表现优异，超越现有方法。 Conclusion: FineLIP有效解决了CLIP在处理长文本和细粒度任务中的局限性，并通过实验验证了其设计的关键优势。 Abstract: As a pioneering vision-language model, CLIP (Contrastive Language-Image Pre-training) has achieved significant success across various domains and a wide range of downstream vision-language tasks. However, the text encoders in popular CLIP models are limited to processing only 77 text tokens, which constrains their ability to effectively handle longer, detail-rich captions. Additionally, CLIP models often struggle to effectively capture detailed visual and textual information, which hampers their performance on tasks that require fine-grained analysis. To address these limitations, we present a novel approach, \textbf{FineLIP}, that extends the capabilities of CLIP. FineLIP enhances cross-modal text-image mapping by incorporating \textbf{Fine}-grained alignment with \textbf{L}onger text input within the CL\textbf{IP}-style framework. FineLIP first extends the positional embeddings to handle longer text, followed by the dynamic aggregation of local image and text tokens. The aggregated results are then used to enforce fine-grained token-to-token cross-modal alignment. We validate our model on datasets with long, detailed captions across two tasks: zero-shot cross-modal retrieval and text-to-image generation. Quantitative and qualitative experimental results demonstrate the effectiveness of FineLIP, outperforming existing state-of-the-art approaches. Furthermore, comprehensive ablation studies validate the benefits of key design elements within FineLIP.

Equivariant Spherical CNNs for Accurate Fiber Orientation Distribution Estimation in Neonatal Diffusion MRI with Reduced Acquisition Time

Haykel Snoussi,Davood Karimi

Task: 提出一种旋转等变的球面卷积神经网络（sCNN）框架，用于从减少梯度方向的多壳层dMRI信号中预测纤维方向分布（FOD）。

Motivation: 新生儿脑微结构的早期准确评估对识别神经发育障碍至关重要，但低信噪比、运动伪影和持续髓鞘化使其具有挑战性。

Details

Method: 使用来自Developing Human Connectome Project（dHCP）的43个新生儿dMRI数据集训练和评估sCNN的性能，并与多层感知器（MLP）基线进行比较。 Result: sCNN在FOD估计中表现出显著更低的均方误差（MSE）和更高的角度相关系数（ACC），且基于sCNN预测的FOD的纤维追踪结果在解剖学合理性、覆盖范围和一致性上优于MLP。 Conclusion: sCNN凭借其固有的旋转等变性，为准确且临床高效的dMRI分析提供了有前景的方法，有助于改善早期脑发育的诊断和表征。 Abstract: Early and accurate assessment of brain microstructure using diffusion Magnetic Resonance Imaging (dMRI) is crucial for identifying neurodevelopmental disorders in neonates, but remains challenging due to low signal-to-noise ratio (SNR), motion artifacts, and ongoing myelination. In this study, we propose a rotationally equivariant Spherical Convolutional Neural Network (sCNN) framework tailored for neonatal dMRI. We predict the Fiber Orientation Distribution (FOD) from multi-shell dMRI signals acquired with a reduced set of gradient directions (30% of the full protocol), enabling faster and more cost-effective acquisitions. We train and evaluate the performance of our sCNN using real data from 43 neonatal dMRI datasets provided by the Developing Human Connectome Project (dHCP). Our results demonstrate that the sCNN achieves significantly lower mean squared error (MSE) and higher angular correlation coefficient (ACC) compared to a Multi-Layer Perceptron (MLP) baseline, indicating improved accuracy in FOD estimation. Furthermore, tractography results based on the sCNN-predicted FODs show improved anatomical plausibility, coverage, and coherence compared to those from the MLP. These findings highlight that sCNNs, with their inherent rotational equivariance, offer a promising approach for accurate and clinically efficient dMRI analysis, paving the way for improved diagnostic capabilities and characterization of early brain development.

Runhui Huang,Chunwei Wang,Junwei Yang,Guansong Lu,Yunlong Yuan,Jianhua Han,Lu Hou,Wei Zhang,Lanqing Hong,Hengshuang Zhao,Hang Xu

Task: 提出ILLUME+模型，通过双重视觉标记化和扩散解码器提升深度语义理解和高保真图像生成能力。

Motivation: 现有统一模型难以同时处理理解、生成和编辑三种基本能力，ILLUME+旨在解决这一问题。

Details

Method: 引入DualViTok双重视觉标记器，结合扩散模型作为图像解标记器，采用渐进式训练策略。 Result: ILLUME+在多模态理解、生成和编辑任务中表现优异，性能优于现有统一模型和专用模型。 Conclusion: ILLUME+为未来多模态应用提供了可扩展且多功能的基础。 Abstract: We present ILLUME+ that leverages dual visual tokenization and a diffusion decoder to improve both deep semantic understanding and high-fidelity image generation. Existing unified models have struggled to simultaneously handle the three fundamental capabilities in a unified model: understanding, generation, and editing. Models like Chameleon and EMU3 utilize VQGAN for image discretization, due to the lack of deep semantic interaction, they lag behind specialist models like LLaVA in visual understanding tasks. To mitigate this, LaViT and ILLUME employ semantic encoders for tokenization, but they struggle with image editing due to poor texture preservation. Meanwhile, Janus series decouples the input and output image representation, limiting their abilities to seamlessly handle interleaved image-text understanding and generation. In contrast, ILLUME+ introduces a unified dual visual tokenizer, DualViTok, which preserves both fine-grained textures and text-aligned semantics while enabling a coarse-to-fine image representation strategy for multimodal understanding and generation. Additionally, we employ a diffusion model as the image detokenizer for enhanced generation quality and efficient super-resolution. ILLUME+ follows a continuous-input, discrete-output scheme within the unified MLLM and adopts a progressive training procedure that supports dynamic resolution across the vision tokenizer, MLLM, and diffusion decoder. This design allows for flexible and efficient context-aware image editing and generation across diverse tasks. ILLUME+ (3B) exhibits competitive performance against existing unified MLLMs and specialized models across multimodal understanding, generation, and editing benchmarks. With its strong performance, ILLUME+ provides a scalable and versatile foundation for future multimodal applications. Project Page: https://illume-unified-mllm.github.io/.

End-to-End Driving with Online Trajectory Evaluation via BEV World Model

Yingyan Li,Yuqi Wang,Yang Liu,Jiawei He,Lue Fan,Zhaoxiang Zhang

Task: 提出一种基于BEV世界模型的端到端自动驾驶框架WoTE，用于轨迹评估。

Motivation: 通过预测给定轨迹的未来结果，提高轨迹评估的有效性，从而确保自动驾驶的安全性。

Details

Method: 利用BEV世界模型预测未来BEV状态，进行轨迹评估。该模型延迟低，并可无缝监督使用现成的BEV空间交通模拟器。 Result: 在NAVSIM和Bench2Drive基准测试中实现了最先进的性能。 Conclusion: WoTE框架通过高效的BEV世界模型，显著提升了自动驾驶轨迹评估的效果和安全性。 Abstract: End-to-end autonomous driving has achieved remarkable progress by integrating perception, prediction, and planning into a fully differentiable framework. Yet, to fully realize its potential, an effective online trajectory evaluation is indispensable to ensure safety. By forecasting the future outcomes of a given trajectory, trajectory evaluation becomes much more effective. This goal can be achieved by employing a world model to capture environmental dynamics and predict future states. Therefore, we propose an end-to-end driving framework WoTE, which leverages a BEV World model to predict future BEV states for Trajectory Evaluation. The proposed BEV world model is latency-efficient compared to image-level world models and can be seamlessly supervised using off-the-shelf BEV-space traffic simulators. We validate our framework on both the NAVSIM benchmark and the closed-loop Bench2Drive benchmark based on the CARLA simulator, achieving state-of-the-art performance. Code is released at https://github.com/liyingyanUCAS/WoTE.

Image Difference Grounding with Natural Language

Wenxuan Wang,Zijia Zhao,Yisi Zhang,Yepeng Tang,Erdong Hu,Xinlong Wang,Jing Liu

Task: 提出Image Difference Grounding (IDG)任务，旨在基于用户指令精确定位视觉差异。

Motivation: 现有视觉定位方法局限于单图像解释，无法满足多图像场景（如自动监控）中检测细微但有意义的视觉差异的需求；而现有图像差异理解方法缺乏跨模态文本指导或仅提供粗粒度描述。

Details

Method: 提出DiffGround数据集和DiffTracker基线模型，结合特征差异增强和共同抑制来精确定位差异。 Result: 实验证明DiffGround数据集在实现更细粒度图像差异理解中的重要性。 Conclusion: DiffGround数据集和DiffTracker模型将公开以促进未来研究。 Abstract: Visual grounding (VG) typically focuses on locating regions of interest within an image using natural language, and most existing VG methods are limited to single-image interpretations. This limits their applicability in real-world scenarios like automatic surveillance, where detecting subtle but meaningful visual differences across multiple images is crucial. Besides, previous work on image difference understanding (IDU) has either focused on detecting all change regions without cross-modal text guidance, or on providing coarse-grained descriptions of differences. Therefore, to push towards finer-grained vision-language perception, we propose Image Difference Grounding (IDG), a task designed to precisely localize visual differences based on user instructions. We introduce DiffGround, a large-scale and high-quality dataset for IDG, containing image pairs with diverse visual variations along with instructions querying fine-grained differences. Besides, we present a baseline model for IDG, DiffTracker, which effectively integrates feature differential enhancement and common suppression to precisely locate differences. Experiments on the DiffGround dataset highlight the importance of our IDG dataset in enabling finer-grained IDU. To foster future research, both DiffGround data and DiffTracker model will be publicly released.

Deep Representation Learning for Unsupervised Clustering of Myocardial Fiber Trajectories in Cardiac Diffusion Tensor Imaging

Mohini Anand,Xavier Tricoche

Task: 提出一种新颖的深度学习框架，用于无监督聚类心肌纤维，以数据驱动的方式识别不同的纤维束。

Motivation: 现有方法难以从扩散张量成像（DTI）数据中准确捕捉心肌的复杂结构，尤其是缺乏真实标签和纤维轨迹的模糊交织特性。

Details

Method: 结合双向长短期记忆网络（BiLSTM）捕捉纤维的局部序列信息，以及Transformer自编码器学习全局形状特征，并融入解剖学上下文信息，使用基于密度的聚类算法对表示进行聚类。 Result: 识别出33至62个稳健的聚类，成功捕捉了纤维轨迹的细微差异，提供了前所未有的纤维束划分水平。 Conclusion: 该框架为分析心肌结构提供了一种新的、灵活且定量的方法，具有改进手术规划、表征疾病相关重构以及推进个性化心脏护理的潜力。 Abstract: Understanding the complex myocardial architecture is critical for diagnosing and treating heart disease. However, existing methods often struggle to accurately capture this intricate structure from Diffusion Tensor Imaging (DTI) data, particularly due to the lack of ground truth labels and the ambiguous, intertwined nature of fiber trajectories. We present a novel deep learning framework for unsupervised clustering of myocardial fibers, providing a data-driven approach to identifying distinct fiber bundles. We uniquely combine a Bidirectional Long Short-Term Memory network to capture local sequential information along fibers, with a Transformer autoencoder to learn global shape features, with pointwise incorporation of essential anatomical context. Clustering these representations using a density-based algorithm identifies 33 to 62 robust clusters, successfully capturing the subtle distinctions in fiber trajectories with varying levels of granularity. Our framework offers a new, flexible, and quantitative way to analyze myocardial structure, achieving a level of delineation that, to our knowledge, has not been previously achieved, with potential applications in improving surgical planning, characterizing disease-related remodeling, and ultimately, advancing personalized cardiac care.

Towards Unified Referring Expression Segmentation Across Omni-Level Visual Target Granularities

Jing Liu,Wenxuan Wang,Yisi Zhang,Yepeng Tang,Xingjian He,Longteng Guo,Tongtian Yue,Xinlong Wang

Task: 提出一个多粒度指代表达分割（MRES）任务，并开发了一个统一的模型UniRES++来解决多粒度RES问题。

Motivation: 现实场景需要处理多粒度目标（如多对象、单对象或部分对象）的指代表达分割，但现有数据集和模型主要关注对象级目标定位，缺乏多粒度RES的数据资源和统一框架。

Details

Method: 引入MRES任务和RefCOCOm基准数据集，创建MRES-32M数据集，并提出UniRES++模型，整合对象级和部分级RES任务。 Result: UniRES++在多个基准测试中（包括RefCOCOm、gRefCOCO和RefCOCO系列）取得了最先进的性能。 Conclusion: 通过RefCOCOm基准、MRES-32M数据集和UniRES++模型，推动了多粒度视觉定位的研究。 Abstract: Referring expression segmentation (RES) aims at segmenting the entities' masks that match the descriptive language expression. While traditional RES methods primarily address object-level grounding, real-world scenarios demand a more versatile framework that can handle multiple levels of target granularity, such as multi-object, single object or part-level references. This introduces great challenges due to the diverse and nuanced ways users describe targets. However, existing datasets and models mainly focus on designing grounding specialists for object-level target localization, lacking the necessary data resources and unified frameworks for the more practical multi-grained RES. In this paper, we take a step further towards visual granularity unified RES task. To overcome the limitation of data scarcity, we introduce a new multi-granularity referring expression segmentation (MRES) task, alongside the RefCOCOm benchmark, which includes part-level annotations for advancing finer-grained visual understanding. In addition, we create MRES-32M, the largest visual grounding dataset, comprising over 32.2M masks and captions across 1M images, specifically designed for part-level vision-language grounding. To tackle the challenges of multi-granularity RES, we propose UniRES++, a unified multimodal large language model that integrates object-level and part-level RES tasks. UniRES++ incorporates targeted designs for fine-grained visual feature exploration. With the joint model architecture and parameters, UniRES++ achieves state-of-the-art performance across multiple benchmarks, including RefCOCOm for MRES, gRefCOCO for generalized RES, and RefCOCO, RefCOCO+, RefCOCOg for classic RES. To foster future research into multi-grained visual grounding, our RefCOCOm benchmark, MRES-32M dataset and model UniRES++ will be publicly available at https://github.com/Rubics-Xuan/MRES.

Scene-Centric Unsupervised Panoptic Segmentation

Oliver Hahn,Christoph Reich,Nikita Araslanov,Daniel Cremers,Christian Rupprecht,Stefan Roth

Task: 提出一种无需人工标注的无监督全景分割方法，直接训练于场景中心图像。

Motivation: 消除对物体中心训练数据的依赖，实现对复杂场景的无监督理解。

Details

Method: 结合视觉表示、深度和运动线索，生成高分辨率全景伪标签，并采用伪标签训练和全景自训练策略。 Result: 在Cityscapes数据集上，无监督全景分割的PQ指标比现有最优方法提高了9.4%。 Conclusion: 该方法无需人工标注即可准确预测复杂场景的全景分割，显著提升了分割质量。 Abstract: Unsupervised panoptic segmentation aims to partition an image into semantically meaningful regions and distinct object instances without training on manually annotated data. In contrast to prior work on unsupervised panoptic scene understanding, we eliminate the need for object-centric training data, enabling the unsupervised understanding of complex scenes. To that end, we present the first unsupervised panoptic method that directly trains on scene-centric imagery. In particular, we propose an approach to obtain high-resolution panoptic pseudo labels on complex scene-centric data, combining visual representations, depth, and motion cues. Utilizing both pseudo-label training and a panoptic self-training strategy yields a novel approach that accurately predicts panoptic segmentation of complex scenes without requiring any human annotations. Our approach significantly improves panoptic quality, e.g., surpassing the recent state of the art in unsupervised panoptic segmentation on Cityscapes by 9.4% points in PQ.

VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step

Hanyang Wang,Fangfu Liu,Jiawei Chi,Yueqi Duan

Task: 从稀疏视图中恢复3D场景。

Motivation: 传统方法在输入视图重叠较少时性能下降，而现有视频生成模型虽能生成3D结构但存在推理速度慢和缺乏3D约束的问题。

Details

Method: 提出VideoScene，通过3D感知的跳跃流蒸馏策略和动态去噪策略网络，一步生成3D场景。 Result: 实验表明，VideoScene在速度和生成质量上优于现有视频扩散模型。 Conclusion: VideoScene是一种高效的工具，有望推动视频到3D应用的未来发展。 Abstract: Recovering 3D scenes from sparse views is a challenging task due to its inherent ill-posed problem. Conventional methods have developed specialized solutions (e.g., geometry regularization or feed-forward deterministic model) to mitigate the issue. However, they still suffer from performance degradation by minimal overlap across input views with insufficient visual information. Fortunately, recent video generative models show promise in addressing this challenge as they are capable of generating video clips with plausible 3D structures. Powered by large pretrained video diffusion models, some pioneering research start to explore the potential of video generative prior and create 3D scenes from sparse views. Despite impressive improvements, they are limited by slow inference time and the lack of 3D constraint, leading to inefficiencies and reconstruction artifacts that do not align with real-world geometry structure. In this paper, we propose VideoScene to distill the video diffusion model to generate 3D scenes in one step, aiming to build an efficient and effective tool to bridge the gap from video to 3D. Specifically, we design a 3D-aware leap flow distillation strategy to leap over time-consuming redundant information and train a dynamic denoising policy network to adaptively determine the optimal leap timestep during inference. Extensive experiments demonstrate that our VideoScene achieves faster and superior 3D scene generation results than previous video diffusion models, highlighting its potential as an efficient tool for future video to 3D applications. Project Page: https://hanyang-21.github.io/VideoScene

GaussianLSS -- Toward Real-world BEV Perception: Depth Uncertainty Estimation via Gaussian Splatting

Shu-Wei Lu,Yi-Hsuan Tsai,Yi-Ting Chen

Task: 提出一种基于不确定性建模的鸟瞰图（BEV）感知框架GaussianLSS，以改进自动驾驶任务中的感知性能。

Motivation: 现有的投影方法在不确定性建模和计算效率方面存在不足，限制了其在实际应用中的表现。

Details

Method: GaussianLSS通过深度不确定性建模，学习软深度均值并计算深度分布的方差，将深度分布转换为3D高斯分布并栅格化，构建不确定性感知的BEV特征。 Result: 在nuScenes数据集上，GaussianLSS性能优于其他基于反投影的方法，速度提升2.5倍，内存效率提高0.3倍，且性能仅下降0.4% IoU。 Conclusion: GaussianLSS是一种高效且不确定性感知的BEV感知框架，适用于自动驾驶任务。 Abstract: Bird's-eye view (BEV) perception has gained significant attention because it provides a unified representation to fuse multiple view images and enables a wide range of down-stream autonomous driving tasks, such as forecasting and planning. Recent state-of-the-art models utilize projection-based methods which formulate BEV perception as query learning to bypass explicit depth estimation. While we observe promising advancements in this paradigm, they still fall short of real-world applications because of the lack of uncertainty modeling and expensive computational requirement. In this work, we introduce GaussianLSS, a novel uncertainty-aware BEV perception framework that revisits unprojection-based methods, specifically the Lift-Splat-Shoot (LSS) paradigm, and enhances them with depth un-certainty modeling. GaussianLSS represents spatial dispersion by learning a soft depth mean and computing the variance of the depth distribution, which implicitly captures object extents. We then transform the depth distribution into 3D Gaussians and rasterize them to construct uncertainty-aware BEV features. We evaluate GaussianLSS on the nuScenes dataset, achieving state-of-the-art performance compared to unprojection-based methods. In particular, it provides significant advantages in speed, running 2.5x faster, and in memory efficiency, using 0.3x less memory compared to projection-based methods, while achieving competitive performance with only a 0.4% IoU difference.

Diffusion-Guided Gaussian Splatting for Large-Scale Unconstrained 3D Reconstruction and Novel View Synthesis

Niluthpol Chowdhury Mithun,Tuan Pham,Qiao Wang,Ben Southall,Kshitij Minhas,Bogdan Matei,Stephan Mandt,Supun Samarasekera,Rakesh Kumar

Task: 提出一种名为GS-Diff的新型3DGS框架，通过多视角扩散模型指导，解决大规模无约束环境中3D重建和新视角合成的挑战。

Motivation: 现有方法（如3DGS和NeRF）在大规模、无约束环境中表现不佳，主要由于稀疏和不均匀的输入覆盖、瞬态遮挡、外观变化和不一致的相机设置等问题。

Details

Method: GS-Diff利用多视角扩散模型生成伪观测数据，将欠约束的3D重建问题转化为适定问题，并结合外观嵌入、单目深度先验、动态对象建模、各向异性正则化和高级光栅化技术。 Result: 在四个基准测试中，GS-Diff显著优于现有最先进方法。 Conclusion: GS-Diff通过多视角扩散模型和多种增强技术，有效解决了大规模无约束环境中的3D重建问题，取得了显著优于现有方法的结果。 Abstract: Recent advancements in 3D Gaussian Splatting (3DGS) and Neural Radiance Fields (NeRF) have achieved impressive results in real-time 3D reconstruction and novel view synthesis. However, these methods struggle in large-scale, unconstrained environments where sparse and uneven input coverage, transient occlusions, appearance variability, and inconsistent camera settings lead to degraded quality. We propose GS-Diff, a novel 3DGS framework guided by a multi-view diffusion model to address these limitations. By generating pseudo-observations conditioned on multi-view inputs, our method transforms under-constrained 3D reconstruction problems into well-posed ones, enabling robust optimization even with sparse data. GS-Diff further integrates several enhancements, including appearance embedding, monocular depth priors, dynamic object modeling, anisotropy regularization, and advanced rasterization techniques, to tackle geometric and photometric challenges in real-world settings. Experiments on four benchmarks demonstrate that GS-Diff consistently outperforms state-of-the-art baselines by significant margins.

Learning from Streaming Video with Orthogonal Gradients

Tengda Han,Dilara Gokay,Joseph Heyward,Chuhan Zhang,Daniel Zoran,Viorica Pătrăucean,João Carreira,Dima Damen,Andrew Zisserman

Task: 研究如何从连续的视频流中进行自监督的表征学习。

Motivation: 传统的视频学习方法通过切分和打乱视频来满足独立同分布（IID）假设，但在连续视频流中这一假设被破坏，导致性能下降。

Details

Method: 提出一种几何优化方法，通过利用正交梯度来解相关批次，适用于任何优化器（如SGD和AdamW）。 Result: 在三个任务（DoRA、VideoMAE和未来视频预测）中，提出的正交优化器均优于AdamW。 Conclusion: 正交优化器能够有效缓解连续视频流中表征学习性能下降的问题。 Abstract: We address the challenge of representation learning from a continuous stream of video as input, in a self-supervised manner. This differs from the standard approaches to video learning where videos are chopped and shuffled during training in order to create a non-redundant batch that satisfies the independently and identically distributed (IID) sample assumption expected by conventional training paradigms. When videos are only available as a continuous stream of input, the IID assumption is evidently broken, leading to poor performance. We demonstrate the drop in performance when moving from shuffled to sequential learning on three tasks: the one-video representation learning method DoRA, standard VideoMAE on multi-video datasets, and the task of future video prediction. To address this drop, we propose a geometric modification to standard optimizers, to decorrelate batches by utilising orthogonal gradients during training. The proposed modification can be applied to any optimizer -- we demonstrate it with Stochastic Gradient Descent (SGD) and AdamW. Our proposed orthogonal optimizer allows models trained from streaming videos to alleviate the drop in representation learning performance, as evaluated on downstream tasks. On three scenarios (DoRA, VideoMAE, future prediction), we show our orthogonal optimizer outperforms the strong AdamW in all three scenarios.

Diagnosis of Pulmonary Hypertension by Integrating Multimodal Data with a Hybrid Graph Convolutional and Transformer Network

Fubao Zhu,Yang Zhang,Gengmin Liang,Jiaofen Nan,Yanting Li,Chuang Han,Danyang Sun,Zhiguo Wang,Chen Zhao,Wenxuan Zhou,Jian He,Yi Xu,Iokfai Cheang,Xu Zhu,Yanli Zhou,Weihua Zhou

Task: 开发并验证一种基于深度学习的诊断模型，用于分类非肺动脉高压（PH）、毛细血管前PH和毛细血管后PH。

Motivation: 肺动脉高压（PH）的早期准确诊断对患者管理至关重要，区分毛细血管前和毛细血管后PH对指导治疗决策尤为关键。

Details

Method: 结合图卷积网络（GCN）、卷积神经网络（CNN）和Transformer的深度学习模型，处理多模态数据（包括短轴序列、四腔序列和临床参数）。 Result: 模型在测试集上的AUC为0.81±0.06，准确率为0.73±0.06；对非PH、毛细血管前PH和毛细血管后PH的AUC分别为0.74±0.11、0.86±0.06和0.83±0.10。 Conclusion: 该模型通过有效整合多模态数据，有望支持临床决策，帮助医生做出准确及时的诊断。 Abstract: Early and accurate diagnosis of pulmonary hypertension (PH) is essential for optimal patient management. Differentiating between pre-capillary and post-capillary PH is critical for guiding treatment decisions. This study develops and validates a deep learning-based diagnostic model for PH, designed to classify patients as non-PH, pre-capillary PH, or post-capillary PH. This retrospective study analyzed data from 204 patients (112 with pre-capillary PH, 32 with post-capillary PH, and 60 non-PH controls) at the First Affiliated Hospital of Nanjing Medical University. Diagnoses were confirmed through right heart catheterization. We selected 6 samples from each category for the test set (18 samples, 10%), with the remaining 186 samples used for the training set. This process was repeated 35 times for testing. This paper proposes a deep learning model that combines Graph convolutional networks (GCN), Convolutional neural networks (CNN), and Transformers. The model was developed to process multimodal data, including short-axis (SAX) sequences, four-chamber (4CH) sequences, and clinical parameters. Our model achieved a performance of Area under the receiver operating characteristic curve (AUC) = 0.81 +- 0.06(standard deviation) and Accuracy (ACC) = 0.73 +- 0.06 on the test set. The discriminative abilities were as follows: non-PH subjects (AUC = 0.74 +- 0.11), pre-capillary PH (AUC = 0.86 +- 0.06), and post-capillary PH (AUC = 0.83 +- 0.10). It has the potential to support clinical decision-making by effectively integrating multimodal data to assist physicians in making accurate and timely diagnoses.

Mesh Compression with Quantized Neural Displacement Fields

Sai Karthikey Pentapati,Gregoire Phillips,Alan C. Bovik

Task: 扩展隐式神经表示（INRs）以压缩3D三角形网格。

Motivation: 现有的INRs方法在压缩结构化数据（如SDFs、体素网格、图像、视频和音频）方面表现良好，但在处理非结构化数据（如3D网格和点云）时存在局限性。

Details

Method: 提出了一种简单有效的方法，通过编码位移场并使用小型神经网络来优化3D网格的粗糙版本，从而实现压缩。 Result: 该方法能够保留复杂的几何纹理，并在压缩比为4x至380x时表现出最佳性能。 Conclusion: 该方法成功扩展了INRs的应用范围，为非结构化数据的压缩提供了高效解决方案。 Abstract: Implicit neural representations (INRs) have been successfully used to compress a variety of 3D surface representations such as Signed Distance Functions (SDFs), voxel grids, and also other forms of structured data such as images, videos, and audio. However, these methods have been limited in their application to unstructured data such as 3D meshes and point clouds. This work presents a simple yet effective method that extends the usage of INRs to compress 3D triangle meshes. Our method encodes a displacement field that refines the coarse version of the 3D mesh surface to be compressed using a small neural network. Once trained, the neural network weights occupy much lower memory than the displacement field or the original surface. We show that our method is capable of preserving intricate geometric textures and demonstrates state-of-the-art performance for compression ratios ranging from 4x to 380x.

Novel sparse PCA method via Runge Kutta numerical method(s) for face recognition

Loc Hoang Tran,Luong Anh Tuan Nguyen

Task: 探索稀疏主成分分析（PCA）在面部识别中的应用，并比较其与传统PCA的性能。

Motivation: 面部识别在数据科学和生物识别安全中具有广泛应用，但传统PCA方法可能无法满足高精度需求。

Details

Method: 采用稀疏PCA（通过近端梯度法或Runge-Kutta数值方法求解），并结合k近邻或核岭回归进行分类。 Result: 实验表明，稀疏PCA结合分类系统比传统PCA更准确，且Runge-Kutta方法在速度上优于近端梯度法。 Conclusion: 稀疏PCA在面部识别中表现更优，Runge-Kutta方法在计算效率上更具优势。 Abstract: Face recognition is a crucial topic in data science and biometric security, with applications spanning military, finance, and retail industries. This paper explores the implementation of sparse Principal Component Analysis (PCA) using the Proximal Gradient method (also known as ISTA) and the Runge-Kutta numerical methods. To address the face recognition problem, we integrate sparse PCA with either the k-nearest neighbor method or the kernel ridge regression method. Experimental results demonstrate that combining sparse PCA-solved via the Proximal Gradient method or the Runge-Kutta numerical approach-with a classification system yields higher accuracy compared to standard PCA. Additionally, we observe that the Runge-Kutta-based sparse PCA computation consistently outperforms the Proximal Gradient method in terms of speed.

An Integrated AI-Enabled System Using One Class Twin Cross Learning (OCT-X) for Early Gastric Cancer Detection

Xian-Xian Liu,Yuanyuan Wei,Mingkun Xu,Yongze Guo,Hongwei Zhang,Huicong Dong,Qun Song,Qi Zhao,Wei Luo,Feng Tien,Juntao Gao,Simon Fong

Task: 提出一种名为OCT-X的算法，结合硬件和软件技术，用于提高胃癌早期检测的准确性和效率。

Motivation: 当前胃癌早期检测技术存在误诊和漏诊率高的问题，亟需更准确和高效的解决方案。

Details

Method: 采用One Class Twin Cross Learning (OCT-X)算法，结合快速双阈值网格搜索策略(FDT-GS)和基于补丁的深度全卷积网络，以及集成的POCT硬件设备。 Result: 系统实现了99.70%的诊断准确率，比现有模型高出4.47%，并在多速率适应性上提升了10%。 Conclusion: OCT-X及其集成系统在临床诊断中具有潜力，为胃癌早期检测提供了更准确、高效且微创的解决方案。未来研究将探索更广泛的应用。 Abstract: Early detection of gastric cancer, a leading cause of cancer-related mortality worldwide, remains hampered by the limitations of current diagnostic technologies, leading to high rates of misdiagnosis and missed diagnoses. To address these challenges, we propose an integrated system that synergizes advanced hardware and software technologies to balance speed-accuracy. Our study introduces the One Class Twin Cross Learning (OCT-X) algorithm. Leveraging a novel fast double-threshold grid search strategy (FDT-GS) and a patch-based deep fully convolutional network, OCT-X maximizes diagnostic accuracy through real-time data processing and seamless lesion surveillance. The hardware component includes an all-in-one point-of-care testing (POCT) device with high-resolution imaging sensors, real-time data processing, and wireless connectivity, facilitated by the NI CompactDAQ and LabVIEW software. Our integrated system achieved an unprecedented diagnostic accuracy of 99.70%, significantly outperforming existing models by up to 4.47%, and demonstrated a 10% improvement in multirate adaptability. These findings underscore the potential of OCT-X as well as the integrated system in clinical diagnostics, offering a path toward more accurate, efficient, and less invasive early gastric cancer detection. Future research will explore broader applications, further advancing oncological diagnostics. Code is available at https://github.com/liu37972/Multirate-Location-on-OCT-X-Learning.git.

Articulated Kinematics Distillation from Video Diffusion Models

Xuan Li,Qianli Ma,Tsung-Yi Lin,Yongxin Chen,Chenfanfu Jiang,Ming-Yu Liu,Donglai Xiang

Task: 提出Articulated Kinematics Distillation (AKD)框架，用于生成高保真角色动画。

Motivation: 结合基于骨骼的动画和现代生成模型的优势，解决4D神经变形场在保持形状一致性上的挑战。

Details

Method: 使用基于骨骼的表示方法，通过Score Distillation Sampling (SDS)从预训练视频扩散模型中提取复杂动作。 Result: AKD在文本到4D生成任务中表现出优越的3D一致性和运动质量。 Conclusion: AKD框架在高效合成一致运动的同时，保持了结构的完整性，并与物理模拟兼容。 Abstract: We present Articulated Kinematics Distillation (AKD), a framework for generating high-fidelity character animations by merging the strengths of skeleton-based animation and modern generative models. AKD uses a skeleton-based representation for rigged 3D assets, drastically reducing the Degrees of Freedom (DoFs) by focusing on joint-level control, which allows for efficient, consistent motion synthesis. Through Score Distillation Sampling (SDS) with pre-trained video diffusion models, AKD distills complex, articulated motions while maintaining structural integrity, overcoming challenges faced by 4D neural deformation fields in preserving shape consistency. This approach is naturally compatible with physics-based simulation, ensuring physically plausible interactions. Experiments show that AKD achieves superior 3D consistency and motion quality compared with existing works on text-to-4D generation. Project page: https://research.nvidia.com/labs/dir/akd/

Lightweight Deep Models for Dermatological Disease Detection: A Study on Instance Selection and Channel Optimization

Ian Mateos Gonzalez,Estefani Jaramilla Nava,Abraham Sánchez Morales,Jesús García-Ramírez,Ricardo Ramos-Aguilar

Task: 提出一种预处理dermaMNIST数据集的方法，以提高其在分类阶段的性能。

Motivation: 墨西哥皮肤病识别问题重要，现有研究多直接使用数据集而未分析数据行为，尤其在医学图像领域。

Details

Method: 使用轻量级卷积神经网络预处理dermaMNIST数据集。 Result: 减少神经网络训练实例数量，同时保持与ResNet类似的性能。 Conclusion: 该方法在减少数据量的情况下仍能保持高性能，适用于皮肤病分类任务。 Abstract: The identification of dermatological disease is an important problem in Mexico according with different studies. Several works in literature use the datasets of different repositories without applying a study of the data behavior, especially in medical images domain. In this work, we propose a methodology to preprocess dermaMNIST dataset in order to improve its quality for the classification stage, where we use lightweight convolutional neural networks. In our results, we reduce the number of instances for the neural network training obtaining a similar performance of models as ResNet.

Prompting Forgetting: Unlearning in GANs via Textual Guidance

Piyush Nagasubramaniam,Neeraj Karamchandani,Chen Wu,Sencun Zhu

Task: 提出了一种名为Text-to-Unlearn的新框架，用于从预训练的生成对抗网络（GANs）中选择性地遗忘概念，仅使用文本提示。

Motivation: 现有的生成模型在图像生成方面表现出强大的能力，但也带来了伦理和法律挑战，尤其是在内容移除方面。目前的研究主要集中在扩散模型上，而GANs中的遗忘技术尚未得到充分探索。

Details

Method: 通过自然语言描述引导遗忘过程，无需额外数据集或监督微调，实现了特征遗忘、身份遗忘以及细粒度任务（如表情和多属性移除）。 Result: 提出了一个自动化的遗忘评估方法，基于先进的图像-文本对齐指标，全面分析了遗忘方法的有效性。 Conclusion: Text-to-Unlearn是首个针对GANs的跨模态遗忘框架，为管理生成模型行为提供了灵活且高效的解决方案。 Abstract: State-of-the-art generative models exhibit powerful image-generation capabilities, introducing various ethical and legal challenges to service providers hosting these models. Consequently, Content Removal Techniques (CRTs) have emerged as a growing area of research to control outputs without full-scale retraining. Recent work has explored the use of Machine Unlearning in generative models to address content removal. However, the focus of such research has been on diffusion models, and unlearning in Generative Adversarial Networks (GANs) has remained largely unexplored. We address this gap by proposing Text-to-Unlearn, a novel framework that selectively unlearns concepts from pre-trained GANs using only text prompts, enabling feature unlearning, identity unlearning, and fine-grained tasks like expression and multi-attribute removal in models trained on human faces. Leveraging natural language descriptions, our approach guides the unlearning process without requiring additional datasets or supervised fine-tuning, offering a scalable and efficient solution. To evaluate its effectiveness, we introduce an automatic unlearning assessment method adapted from state-of-the-art image-text alignment metrics, providing a comprehensive analysis of the unlearning methodology. To our knowledge, Text-to-Unlearn is the first cross-modal unlearning framework for GANs, representing a flexible and efficient advancement in managing generative model behavior.

A Conformal Risk Control Framework for Granular Word Assessment and Uncertainty Calibration of CLIPScore Quality Estimates

Gonçalo Gomes,Chrysoula Zerva,Bruno Martins

Task: 探索并改进学习型图像字幕评估指标的局限性，特别是对字幕中单个词语错位的细粒度评估以及忽略不确定性的单点质量估计。

Motivation: 当前的学习型图像字幕评估指标存在两个主要问题：缺乏对字幕中单个词语错位的细粒度评估，以及依赖单点质量估计而未考虑不确定性。

Details

Method: 提出一种简单有效的策略，通过模型无关的符合风险控制框架，生成和校准CLIPScore分布，以解决上述问题。 Result: 实验结果表明，使用符合风险控制的方法（如输入掩码）生成的分布，能够达到与更复杂方法相竞争的性能。该方法能有效检测错位词语，并提供与期望风险水平一致的形式保证，同时改善不确定性估计与预测误差之间的相关性。 Conclusion: 该方法显著提升了字幕评估指标的可靠性，特别是在细粒度评估和不确定性处理方面。 Abstract: This study explores current limitations of learned image captioning evaluation metrics, specifically the lack of granular assessment for individual word misalignments within captions, and the reliance on single-point quality estimates without considering uncertainty. To address these limitations, we propose a simple yet effective strategy for generating and calibrating CLIPScore distributions. Leveraging a model-agnostic conformal risk control framework, we calibrate CLIPScore values for task-specific control variables, to tackle the aforementioned two limitations. Experimental results demonstrate that using conformal risk control, over the distributions produced with simple methods such as input masking, can achieve competitive performance compared to more complex approaches. Our method effectively detects misaligned words, while providing formal guarantees aligned with desired risk levels, and improving the correlation between uncertainty estimations and prediction errors, thus enhancing the overall reliability of caption evaluation metrics.

ForestVO: Enhancing Visual Odometry in Forest Environments through ForestGlue

Thomas Pritchard,Saifullah Ijaz,Ronald Clark,Basaran Bahadir Kocer

Task: 提出一种名为ForestGlue和ForestVO的视觉里程计系统，专门用于复杂森林环境中的自主导航。

Motivation: 解决在森林等复杂环境中，由于密集植被、多变光照和重复纹理导致的特征对应准确性问题。

Details

Method: 通过改进SuperPoint特征检测器（四种配置）和重新训练LightGlue/SuperGlue进行特征匹配，结合基于Transformer的位姿估计模型ForestVO。 Result: ForestGlue仅需25%的关键点即可达到与基线模型相当的位姿估计精度；ForestVO在TartanAir森林序列上的平均相对位姿误差为1.09米，优于直接法40%。 Conclusion: 该工作为森林环境中的视觉里程计提供了一个端到端的深度学习流程，显著提升了自主导航系统的准确性和鲁棒性。 Abstract: Recent advancements in visual odometry systems have improved autonomous navigation; however, challenges persist in complex environments like forests, where dense foliage, variable lighting, and repetitive textures compromise feature correspondence accuracy. To address these challenges, we introduce ForestGlue, enhancing the SuperPoint feature detector through four configurations - grayscale, RGB, RGB-D, and stereo-vision - optimised for various sensing modalities. For feature matching, we employ LightGlue or SuperGlue, retrained with synthetic forest data. ForestGlue achieves comparable pose estimation accuracy to baseline models but requires only 512 keypoints - just 25% of the baseline's 2048 - to reach an LO-RANSAC AUC score of 0.745 at a 10{\deg} threshold. With only a quarter of keypoints needed, ForestGlue significantly reduces computational overhead, demonstrating effectiveness in dynamic forest environments, and making it suitable for real-time deployment on resource-constrained platforms. By combining ForestGlue with a transformer-based pose estimation model, we propose ForestVO, which estimates relative camera poses using matched 2D pixel coordinates between frames. On challenging TartanAir forest sequences, ForestVO achieves an average relative pose error (RPE) of 1.09 m and a kitti_score of 2.33%, outperforming direct-based methods like DSO by 40% in dynamic scenes. Despite using only 10% of the dataset for training, ForestVO maintains competitive performance with TartanVO while being a significantly lighter model. This work establishes an end-to-end deep learning pipeline specifically tailored for visual odometry in forested environments, leveraging forest-specific training data to optimise feature correspondence and pose estimation, thereby enhancing the accuracy and robustness of autonomous navigation systems.

BOLDSimNet: Examining Brain Network Similarity between Task and Resting-State fMRI

Boseong Kim,Debashis Das Chakladar,Haejun Chung,Ikbeom Jang

Task: 提出BOLDSimNet框架，用于测量不同认知状态下的因果连接和网络相似性。

Motivation: 传统因果连接方法在任务和静息态fMRI中难以准确捕捉有向信息流，且无法建模多变量依赖关系，限制了脑网络在认知状态间的比较。

Details

Method: 利用多变量转移熵（MTE）测量因果连接，并通过功能相似性分组ROI以提高网络对齐的准确性。 Result: 儿童在任务和静息态间的相似性得分高于青少年，而青少年在DAN和DMN网络中表现出更多差异，反映了网络适应性的增强。 Conclusion: BOLDSimNet能够量化网络相似性并识别不同认知状态下的注意力波动，揭示了因果脑网络重构的发育差异。 Abstract: Traditional causal connectivity methods in task-based and resting-state functional magnetic resonance imaging (fMRI) face challenges in accurately capturing directed information flow due to their sensitivity to noise and inability to model multivariate dependencies. These limitations hinder the effective comparison of brain networks between cognitive states, making it difficult to analyze network reconfiguration during task and resting states. To address these issues, we propose BOLDSimNet, a novel framework utilizing Multivariate Transfer Entropy (MTE) to measure causal connectivity and network similarity across different cognitive states. Our method groups functionally similar regions of interest (ROIs) rather than spatially adjacent nodes, improving accuracy in network alignment. We applied BOLDSimNet to fMRI data from 40 healthy controls and found that children exhibited higher similarity scores between task and resting states compared to adolescents, indicating reduced variability in attention shifts. In contrast, adolescents showed more differences between task and resting states in the Dorsal Attention Network (DAN) and the Default Mode Network (DMN), reflecting enhanced network adaptability. These findings emphasize developmental variations in the reconfiguration of the causal brain network, showcasing BOLDSimNet's ability to quantify network similarity and identify attentional fluctuations between different cognitive states.

3D Gaussian Inverse Rendering with Approximated Global Illumination

Zirui Wu,Jianteng Chen,Laijian Li,Shaoteng Wu,Zhikai Zhu,Kang Xu,Martin R. Oswald,Jie Song

Task: 提出一种通过屏幕空间光线追踪实现3D高斯泼溅全局光照的新方法。

Motivation: 现有3D高斯泼溅方法通常将光照信息固化在表示中，限制了基于物理的渲染和场景编辑的能力。

Details

Method: 结合蒙特卡洛屏幕空间光线追踪，捕捉单次间接光照，增强3D高斯泼溅的直接光照计算。 Result: 实验表明，该方法能够实现间接光照，并支持实时渲染和编辑。 Conclusion: 该方法在不牺牲3D高斯泼溅计算效率和可编辑性的前提下，实现了真实的全局光照。 Abstract: 3D Gaussian Splatting shows great potential in reconstructing photo-realistic 3D scenes. However, these methods typically bake illumination into their representations, limiting their use for physically-based rendering and scene editing. Although recent inverse rendering approaches aim to decompose scenes into material and lighting components, they often rely on simplifying assumptions that fail when editing. We present a novel approach that enables efficient global illumination for 3D Gaussians Splatting through screen-space ray tracing. Our key insight is that a substantial amount of indirect light can be traced back to surfaces visible within the current view frustum. Leveraging this observation, we augment the direct shading computed by 3D Gaussians with Monte-Carlo screen-space ray-tracing to capture one-bounce indirect illumination. In this way, our method enables realistic global illumination without sacrificing the computational efficiency and editability benefits of 3D Gaussians. Through experiments, we show that the screen-space approximation we utilize allows for indirect illumination and supports real-time rendering and editing. Code, data, and models will be made available at our project page: https://wuzirui.github.io/gs-ssr.

GarmageNet: A Dataset and Scalable Representation for Generic Garment Modeling

Siran Li,Ruiyang Liu,Chen Liu,Zhendong Wang,Gaofeng He,Yong-Lu Li,Xiaogang Jin,Huamin Wang

Task: 提出一种名为Garmage的神经网络和计算机图形学友好的服装表示方法，用于高保真、非水密、多层服装建模。

Motivation: 由于缺乏大规模、高质量的数据集以及能够处理非水密、多层几何的高效表示方法，高保真服装建模仍然具有挑战性。

Details

Method: Garmage作为一种双2D-3D表示方法，通过结构化的每面板几何图像无缝编码复杂多层服装的精确几何和缝制图案。基于此表示，提出了GarmageNet生成框架和一种鲁棒的缝合算法。 Result: GarmageNet能够根据用户提示或现有缝制图案生成详细的多层服装，并确保与工业级模拟的兼容性。同时发布了一个工业标准的大规模高保真服装数据集。 Conclusion: Garmage和GarmageNet为大规模工业级服装生成系统提供了新的解决方案。 Abstract: High-fidelity garment modeling remains challenging due to the lack of large-scale, high-quality datasets and efficient representations capable of handling non-watertight, multi-layer geometries. In this work, we introduce Garmage, a neural-network-and-CG-friendly garment representation that seamlessly encodes the accurate geometry and sewing pattern of complex multi-layered garments as a structured set of per-panel geometry images. As a dual-2D-3D representation, Garmage achieves an unprecedented integration of 2D image-based algorithms with 3D modeling workflows, enabling high fidelity, non-watertight, multi-layered garment geometries with direct compatibility for industrial-grade simulations.Built upon this representation, we present GarmageNet, a novel generation framework capable of producing detailed multi-layered garments with body-conforming initial geometries and intricate sewing patterns, based on user prompts or existing in-the-wild sewing patterns. Furthermore, we introduce a robust stitching algorithm that recovers per-vertex stitches, ensuring seamless integration into flexible simulation pipelines for downstream editing of sewing patterns, material properties, and dynamic simulations. Finally, we release an industrial-standard, large-scale, high-fidelity garment dataset featuring detailed annotations, vertex-wise correspondences, and a robust pipeline for converting unstructured production sewing patterns into GarmageNet standard structural assets, paving the way for large-scale, industrial-grade garment generation systems.

Domain Guidance: A Simple Transfer Approach for a Pre-trained Diffusion Model

Jincheng Zhong,Xiangcheng Zhang,Jianmin Wang,Mingsheng Long

Task: 提出一种名为Domain Guidance的新方法，用于基于预训练扩散模型的条件生成。

Motivation: 扩散模型虽然生成效果出色，但模型规模和计算需求较高，因此需要一种更高效的方法来利用预训练模型进行个性化生成。

Details

Method: 通过Domain Guidance方法，利用预训练知识引导采样过程，实现目标领域的生成。 Result: 在多个迁移基准测试中，Domain Guidance显著优于标准微调方法，FID和FD$\text{DINOv2}$分别提升了19.6%和23.4%。 Conclusion: Domain Guidance是一种高效且无需额外训练的方法，可显著提升生成质量。 Abstract: Recent advancements in diffusion models have revolutionized generative modeling. However, the impressive and vivid outputs they produce often come at the cost of significant model scaling and increased computational demands. Consequently, building personalized diffusion models based on off-the-shelf models has emerged as an appealing alternative. In this paper, we introduce a novel perspective on conditional generation for transferring a pre-trained model. From this viewpoint, we propose *Domain Guidance*, a straightforward transfer approach that leverages pre-trained knowledge to guide the sampling process toward the target domain. Domain Guidance shares a formulation similar to advanced classifier-free guidance, facilitating better domain alignment and higher-quality generations. We provide both empirical and theoretical analyses of the mechanisms behind Domain Guidance. Our experimental results demonstrate its substantial effectiveness across various transfer benchmarks, achieving over a 19.6% improvement in FID and a 23.4% improvement in FD$_\text{DINOv2}$ compared to standard fine-tuning. Notably, existing fine-tuned models can seamlessly integrate Domain Guidance to leverage these benefits, without additional training.

STPNet: Scale-aware Text Prompt Network for Medical Image Segmentation

Dandan Shan,Zihan Li,Yunxiang Li,Qingde Li,Jie Tian,Qingqi Hong

Task: 提出一种名为STPNet的尺度感知文本提示网络，用于增强医学图像分割。

Motivation: 传统分割方法仅依赖视觉特征，难以应对病变分布和大小的不确定性。

Details

Method: 利用多尺度文本描述指导病变定位，并通过检索-分割联合学习弥合视觉与语言模态的语义鸿沟。 Result: 在COVID-Xray、COVID-CT和Kvasir-SEG三个数据集上，STPNet优于现有分割方法。 Conclusion: 将文本语义知识融入医学图像分析是有效的，STPNet展示了跨模态学习的优势。 Abstract: Accurate segmentation of lesions plays a critical role in medical image analysis and diagnosis. Traditional segmentation approaches that rely solely on visual features often struggle with the inherent uncertainty in lesion distribution and size. To address these issues, we propose STPNet, a Scale-aware Text Prompt Network that leverages vision-language modeling to enhance medical image segmentation. Our approach utilizes multi-scale textual descriptions to guide lesion localization and employs retrieval-segmentation joint learning to bridge the semantic gap between visual and linguistic modalities. Crucially, STPNet retrieves relevant textual information from a specialized medical text repository during training, eliminating the need for text input during inference while retaining the benefits of cross-modal learning. We evaluate STPNet on three datasets: COVID-Xray, COVID-CT, and Kvasir-SEG. Experimental results show that our vision-language approach outperforms state-of-the-art segmentation methods, demonstrating the effectiveness of incorporating textual semantic knowledge into medical image analysis. The code has been made publicly on https://github.com/HUANGLIZI/STPNet.

Pro-DG: Procedural Diffusion Guidance for Architectural Facade Generation

Aleksander Plocharski,Jan Swidzinski,Przemyslaw Musialski

Task: 提出Pro-DG框架，结合程序化形状语法和基于扩散的图像合成，实现可控的照片级真实感立面生成。

Motivation: 立面是多层次结构，传统方法难以在保持局部外观保真度的同时实现大规模编辑。

Details

Method: 通过语法规则重建立面布局，引入分层匹配程序对齐不同层次的立面结构，并利用控制图指导生成扩散管道。 Result: 用户研究和定量测量表明，Pro-DG在保持建筑身份和提高编辑准确性方面优于基线方法。 Conclusion: Pro-DG首次将神经符号派生的形状语法与现代生成模型结合，展示了此类方法在精确可控图像操作中的潜力。 Abstract: We present Pro-DG, a framework for procedurally controllable photo-realistic facade generation that combines a procedural shape grammar with diffusion-based image synthesis. Starting from a single input image, we reconstruct its facade layout using grammar rules, then edit that structure through user-defined transformations. As facades are inherently multi-hierarchical structures, we introduce hierarchical matching procedure that aligns facade structures at different levels which is used to introduce control maps to guide a generative diffusion pipeline. This approach retains local appearance fidelity while accommodating large-scale edits such as floor duplication or window rearrangement. We provide a thorough evaluation, comparing Pro-DG against inpainting-based baselines and synthetic ground truths. Our user study and quantitative measurements indicate improved preservation of architectural identity and higher edit accuracy. Our novel method is the first to integrate neuro-symbolically derived shape-grammars for modeling with modern generative model and highlights the broader potential of such approaches for precise and controllable image manipulation.

Instance Migration Diffusion for Nuclear Instance Segmentation in Pathology

Lirui Qi,Hongliang He,Tong Wang,Siwei Feng,Guohong Fu

Task: 提出一种名为IM-Diffusion的数据增强框架，用于生成更多样化的病理图像以提升核实例分割性能。

Motivation: 病理图像中标记数据有限，限制了核实例分割的整体性能。

Details

Method: 通过核迁移模块（NMM）构建多样化的核布局，并通过核间区域修复模块（IIM）生成多样化的核间空间关系。 Result: 在CoNSeP和GLySAC数据集上的评估表明，IM-Diffusion生成的图像有效提升了实例分割性能。 Conclusion: IM-Diffusion通过生成多样化的病理图像，有助于提升下游任务的性能。 Abstract: Nuclear instance segmentation plays a vital role in disease diagnosis within digital pathology. However, limited labeled data in pathological images restricts the overall performance of nuclear instance segmentation. To tackle this challenge, we propose a novel data augmentation framework Instance Migration Diffusion Model (IM-Diffusion), IM-Diffusion designed to generate more varied pathological images by constructing diverse nuclear layouts and internuclear spatial relationships. In detail, we introduce a Nuclear Migration Module (NMM) which constructs diverse nuclear layouts by simulating the process of nuclear migration. Building on this, we further present an Internuclear-regions Inpainting Module (IIM) to generate diverse internuclear spatial relationships by structure-aware inpainting. On the basis of the above, IM-Diffusion generates more diverse pathological images with different layouts and internuclear spatial relationships, thereby facilitating downstream tasks. Evaluation on the CoNSeP and GLySAC datasets demonstrate that the images generated by IM-Diffusion effectively enhance overall instance segmentation performance. Code will be made public later.

Leveraging Embedding Techniques in Multimodal Machine Learning for Mental Illness Assessment

Abdelrahaman A. Hassan,Abdelrahman A. Ali,Aya E. Fouda,Radwa J. Hanafy,Mohammed E. Fouda

Task: 利用多模态机器学习开发客观、可扩展的心理健康诊断工具，以检测抑郁症和创伤后应激障碍（PTSD）。

Motivation: 全球心理健康问题日益严重，传统临床评估在可及性、客观性和一致性方面存在局限。

Details

Method: 结合文本、音频和视频数据，采用多种数据预处理技术（如分块和基于话语的格式化策略），评估嵌入模型，使用CNN和BiLSTM进行特征提取，探索数据级、特征级和决策级融合技术，并整合大型语言模型（LLM）预测。 Result: 基于话语的分块显著提升性能，决策级融合结合LLM预测达到最高准确率（抑郁症94.8%，PTSD 96.2%）。 Conclusion: 多模态机器学习为开发更准确、可及和个性化的心理健康工具提供了潜力。 Abstract: The increasing global prevalence of mental disorders, such as depression and PTSD, requires objective and scalable diagnostic tools. Traditional clinical assessments often face limitations in accessibility, objectivity, and consistency. This paper investigates the potential of multimodal machine learning to address these challenges, leveraging the complementary information available in text, audio, and video data. Our approach involves a comprehensive analysis of various data preprocessing techniques, including novel chunking and utterance-based formatting strategies. We systematically evaluate a range of state-of-the-art embedding models for each modality and employ Convolutional Neural Networks (CNNs) and Bidirectional LSTM Networks (BiLSTMs) for feature extraction. We explore data-level, feature-level, and decision-level fusion techniques, including a novel integration of Large Language Model (LLM) predictions. We also investigate the impact of replacing Multilayer Perceptron classifiers with Support Vector Machines. We extend our analysis to severity prediction using PHQ-8 and PCL-C scores and multi-class classification (considering co-occurring conditions). Our results demonstrate that utterance-based chunking significantly improves performance, particularly for text and audio modalities. Decision-level fusion, incorporating LLM predictions, achieves the highest accuracy, with a balanced accuracy of 94.8% for depression and 96.2% for PTSD detection. The combination of CNN-BiLSTM architectures with utterance-level chunking, coupled with the integration of external LLM, provides a powerful and nuanced approach to the detection and assessment of mental health conditions. Our findings highlight the potential of MMML for developing more accurate, accessible, and personalized mental healthcare tools.

TransientTables: Evaluating LLMs' Reasoning on Temporally Evolving Semi-structured Tables

Abhilash Shankarampeta,Harsh Mahajan,Tushar Kataria,Dan Roth,Vivek Gupta

Task: 评估大型语言模型（LLMs）在时间推理能力上的表现，并提出了TRANSIENTTABLES数据集。

Motivation: 人类通过时间推理能力理解事件的时间序列，而LLMs在静态数据集上训练，限制了其时间推理能力。

Details

Method: 提出了TRANSIENTTABLES数据集，包含3,971个问题，基于14,000多个表格，涵盖1,238个实体；使用模板生成问题，并引入任务分解的建模策略。 Result: 建立了基于最新LLMs的基准结果，并展示了任务分解策略对提升LLM性能的有效性。 Conclusion: TRANSIENTTABLES数据集和任务分解策略为评估和提升LLMs的时间推理能力提供了有效工具。 Abstract: Humans continuously make new discoveries, and understanding temporal sequence of events leading to these breakthroughs is essential for advancing science and society. This ability to reason over time allows us to identify future steps and understand the effects of financial and political decisions on our lives. However, large language models (LLMs) are typically trained on static datasets, limiting their ability to perform effective temporal reasoning. To assess the temporal reasoning capabilities of LLMs, we present the TRANSIENTTABLES dataset, which comprises 3,971 questions derived from over 14,000 tables, spanning 1,238 entities across multiple time periods. We introduce a template-based question-generation pipeline that harnesses LLMs to refine both templates and questions. Additionally, we establish baseline results using state-of-the-art LLMs to create a benchmark. We also introduce novel modeling strategies centered around task decomposition, enhancing LLM performance.