Motivation: 现有方法采用固定架构，难以适应新任务，且存在任务架构冲突和模态不平衡问题。 Method: 提出D-MoLE方法，包括动态层间专家分配器和梯度驱动的跨模态课程，以动态调整架构并平衡模态更新。 Result: 实验表明D-MoLE显著优于现有基线，平均提升15%。 Conclusion: 这是首个从架构角度研究MLLM持续学习的工作，为解决任务适应性问题提供了新思路。 Abstract: Continual multimodal instruction tuning is crucial for adapting Multimodal Large Language Models (MLLMs) to evolving tasks. However, most existing methods adopt a fixed architecture, struggling with adapting to new tasks due to static model capacity. We propose to evolve the architecture under parameter budgets for dynamic task adaptation, which remains unexplored and imposes two challenges: 1) task architecture conflict, where different tasks require varying layer-wise adaptations, and 2) modality imbalance, where different tasks rely unevenly on modalities, leading to unbalanced updates. To address these challenges, we propose a novel Dynamic Mixture of Curriculum LoRA Experts (D-MoLE) method, which automatically evolves MLLM's architecture with controlled parameter budgets to continually adapt to new tasks while retaining previously learned knowledge. Specifically, we propose a dynamic layer-wise expert allocator, which automatically allocates LoRA experts across layers to resolve architecture conflicts, and routes instructions layer-wisely to facilitate knowledge sharing among experts. Then, we propose a gradient-based inter-modal continual curriculum, which adjusts the update ratio of each module in MLLM based on the difficulty of each modality within the task to alleviate the modality imbalance problem. Extensive experiments show that D-MoLE significantly outperforms state-of-the-art baselines, achieving a 15% average improvement over the best baseline. To the best of our knowledge, this is the first study of continual learning for MLLMs from an architectural perspective.

Libin Lan,Hongxing Li,Zunhui Xia,Juan Zhou,Xiaofei Zhu,Yongmei Li,Yudong Zhang,Xin Luo

Main category: cs.CV

TL;DR: 论文提出了一种跨模态聚类引导负采样方法（CM-CGNS），通过改进负样本选择和增强局部细节提取，提升了医学视觉表示学习的效果。

Details

Motivation: 现有模型在医学图像和报告的多模态自监督学习中存在负样本选择不当、忽视局部细节和低层次特征的问题，影响了诊断准确性。 Method: 1）通过跨模态注意力扩展k-means聚类用于多模态负样本选择；2）引入跨模态掩码图像重建模块（CM-MIR）增强局部特征交互。 Result: 在五个下游数据集上的分类、检测和分割任务中，CM-CGNS在多项指标上优于现有方法。 Conclusion: CM-CGNS通过优化负样本选择和局部特征提取，显著提升了医学视觉表示学习的性能。 Abstract: Learning medical visual representations directly from paired images and reports through multimodal self-supervised learning has emerged as a novel and efficient approach to digital diagnosis in recent years. However, existing models suffer from several severe limitations. 1) neglecting the selection of negative samples, resulting in the scarcity of hard negatives and the inclusion of false negatives; 2) focusing on global feature extraction, but overlooking the fine-grained local details that are crucial for medical image recognition tasks; and 3) contrastive learning primarily targets high-level features but ignoring low-level details which are essential for accurate medical analysis. Motivated by these critical issues, this paper presents a Cross-Modal Cluster-Guided Negative Sampling (CM-CGNS) method with two-fold ideas. First, it extends the k-means clustering used for local text features in the single-modal domain to the multimodal domain through cross-modal attention. This improvement increases the number of negative samples and boosts the model representation capability. Second, it introduces a Cross-Modal Masked Image Reconstruction (CM-MIR) module that leverages local text-to-image features obtained via cross-modal attention to reconstruct masked local image regions. This module significantly strengthens the model's cross-modal information interaction capabilities and retains low-level image features essential for downstream tasks. By well handling the aforementioned limitations, the proposed CM-CGNS can learn effective and robust medical visual representations suitable for various recognition tasks. Extensive experimental results on classification, detection, and segmentation tasks across five downstream datasets show that our method outperforms state-of-the-art approaches on multiple metrics, verifying its superior performance.

[66] Predicting Patient Survival with Airway Biomarkers using nn-Unet/Radiomics

Zacharia Mesbah,Dhruv Jain,Tsiry Mayet,Romain Modzelewski,Romain Herault,Simon Bernard,Sebastien Thureau,Clement Chatelain

Main category: cs.CV

TL;DR: 该研究通过三阶段方法评估气道影像生物标志物对肺纤维化患者生存结果的预测意义，包括气道分割、特征提取和分类，取得了较高的分割和分类分数。

Details

Motivation: 研究旨在探索气道相关影像生物标志物在预测肺纤维化患者生存结果中的重要性。 Method: 采用三阶段方法：1) 使用nn-Unet分割气道结构；2) 从气管和气道周围提取关键特征；3) 将特征输入SVM分类器。 Result: 分割任务得分为0.8601，分类任务得分为0.7346。 Conclusion: 该方法在气道影像分析中表现出较高的预测能力，为肺纤维化患者的生存预测提供了有效工具。 Abstract: The primary objective of the AIIB 2023 competition is to evaluate the predictive significance of airway-related imaging biomarkers in determining the survival outcomes of patients with lung fibrosis.This study introduces a comprehensive three-stage approach. Initially, a segmentation network, namely nn-Unet, is employed to delineate the airway's structural boundaries. Subsequently, key features are extracted from the radiomic images centered around the trachea and an enclosing bounding box around the airway. This step is motivated by the potential presence of critical survival-related insights within the tracheal region as well as pertinent information encoded in the structure and dimensions of the airway. Lastly, radiomic features obtained from the segmented areas are integrated into an SVM classifier. We could obtain an overall-score of 0.8601 for the segmentation in Task 1 while 0.7346 for the classification in Task 2.

[67] Pose Matters: Evaluating Vision Transformers and CNNs for Human Action Recognition on Small COCO Subsets

MingZe Tang,Madiha Kazi

Main category: cs.CV

TL;DR: 本研究比较了不同模型在COCO图像数据集上的动作识别性能，发现Vision Transformer（ViT）表现最佳，准确率达90%，显著优于卷积网络和CLIP模型。

Details

Motivation: 探索不同模型在动作识别任务中的性能差异，并分析其失败原因。 Method: 使用COCO图像数据集的三类子集，测试了从全连接网络到Transformer架构的多种模型，并通过统计分析和可视化技术评估性能。 Result: ViT的测试准确率最高（90%），且其关注的动作区域更准确，而其他模型易受背景干扰。 Conclusion: Transformer模型在数据效率和性能上优于传统方法，且可解释性技术有助于诊断模型失败原因。 Abstract: This study explores human action recognition using a three-class subset of the COCO image corpus, benchmarking models from simple fully connected networks to transformer architectures. The binary Vision Transformer (ViT) achieved 90% mean test accuracy, significantly exceeding multiclass classifiers such as convolutional networks (approximately 35%) and CLIP-based models (approximately 62-64%). A one-way ANOVA (F = 61.37, p < 0.001) confirmed these differences are statistically significant. Qualitative analysis with SHAP explainer and LeGrad heatmaps indicated that the ViT localizes pose-specific regions (e.g., lower limbs for walking or running), while simpler feed-forward models often focus on background textures, explaining their errors. These findings emphasize the data efficiency of transformer representations and the importance of explainability techniques in diagnosing class-specific failures.

[68] MTabVQA: Evaluating Multi-Tabular Reasoning of Language Models in Visual Space

Anshul Singh,Chris Biemann,Jan Strich

Main category: cs.CV

TL;DR: MTabVQA是一个新基准，用于评估视觉语言模型在多表格图像中的多跳推理能力，揭示了现有模型的局限性，并通过微调提升了性能。

Details

Motivation: 现有基准无法评估模型在多表格图像中的解析和推理能力，MTabVQA填补了这一空白。 Method: 引入MTabVQA基准和MTabVQA-Instruct数据集，通过微调提升模型性能。 Result: 实验表明微调显著提升了模型在多表格视觉推理任务中的表现。 Conclusion: MTabVQA为多表格视觉问答提供了有效评估工具，并通过微调改进了模型能力。 Abstract: Vision-Language Models (VLMs) have demonstrated remarkable capabilities in interpreting visual layouts and text. However, a significant challenge remains in their ability to interpret robustly and reason over multi-tabular data presented as images, a common occurrence in real-world scenarios like web pages and digital documents. Existing benchmarks typically address single tables or non-visual data (text/structured). This leaves a critical gap: they don't assess the ability to parse diverse table images, correlate information across them, and perform multi-hop reasoning on the combined visual data. We introduce MTabVQA, a novel benchmark specifically designed for multi-tabular visual question answering to bridge that gap. MTabVQA comprises 3,745 complex question-answer pairs that necessitate multi-hop reasoning across several visually rendered table images. We provide extensive benchmark results for state-of-the-art VLMs on MTabVQA, revealing significant performance limitations. We further investigate post-training techniques to enhance these reasoning abilities and release MTabVQA-Instruct, a large-scale instruction-tuning dataset. Our experiments show that fine-tuning VLMs with MTabVQA-Instruct substantially improves their performance on visual multi-tabular reasoning. Code and dataset (https://huggingface.co/datasets/mtabvqa/MTabVQA-Eval) are available online (https://anonymous.4open.science/r/MTabVQA-EMNLP-B16E).

[96] TeleEval-OS: Performance evaluations of large language models for operations scheduling

Yanyan Wang,Yingying Wang,Junli Liang,Yin Xu,Yunlong Liu,Yiming Xu,Zhengwang Jiang,Zhehe Li,Fei Li,Long Zhao,Kuang Xu,Qi Song,Xiangyang Li

Main category: cs.CL

TL;DR: 论文提出了首个电信运营调度评估基准（TeleEval-OS），用于全面评估大语言模型（LLMs）在电信运营调度任务中的表现，发现开源LLMs在特定场景下优于闭源LLMs。

Details

Motivation: 电信运营调度任务复杂且缺乏评估基准，阻碍了LLMs在该领域的应用潜力探索。 Method: 构建TeleEval-OS基准，包含15个数据集和13个子任务，模拟四个关键运营阶段，并采用零样本和少样本评估方法测试14种LLMs。 Result: 实验表明，开源LLMs在特定场景下表现优于闭源LLMs，展示了其在电信运营调度中的潜力。 Conclusion: TeleEval-OS为LLMs在电信运营调度中的应用提供了评估工具，开源LLMs在该领域具有显著价值。 Abstract: The rapid advancement of large language models (LLMs) has significantly propelled progress in artificial intelligence, demonstrating substantial application potential across multiple specialized domains. Telecommunications operation scheduling (OS) is a critical aspect of the telecommunications industry, involving the coordinated management of networks, services, risks, and human resources to optimize production scheduling and ensure unified service control. However, the inherent complexity and domain-specific nature of OS tasks, coupled with the absence of comprehensive evaluation benchmarks, have hindered thorough exploration of LLMs' application potential in this critical field. To address this research gap, we propose the first Telecommunications Operation Scheduling Evaluation Benchmark (TeleEval-OS). Specifically, this benchmark comprises 15 datasets across 13 subtasks, comprehensively simulating four key operational stages: intelligent ticket creation, intelligent ticket handling, intelligent ticket closure, and intelligent evaluation. To systematically assess the performance of LLMs on tasks of varying complexity, we categorize their capabilities in telecommunications operation scheduling into four hierarchical levels, arranged in ascending order of difficulty: basic NLP, knowledge Q&A, report generation, and report analysis. On TeleEval-OS, we leverage zero-shot and few-shot evaluation methods to comprehensively assess 10 open-source LLMs (e.g., DeepSeek-V3) and 4 closed-source LLMs (e.g., GPT-4o) across diverse scenarios. Experimental results demonstrate that open-source LLMs can outperform closed-source LLMs in specific scenarios, highlighting their significant potential and value in the field of telecommunications operation scheduling.

[97] Who is in the Spotlight: The Hidden Bias Undermining Multimodal Retrieval-Augmented Generation

Jiayu Yao,Shenghua Liu,Yiwei Wang,Lingrui Mei,Baolong Bi,Yuyao Ge,Zhecheng Li,Xueqi Cheng

Main category: cs.CL

TL;DR: 本文研究了多模态检索增强生成（RAG）系统中证据位置对性能的影响，发现位置偏差会显著影响系统表现，并提出了一种量化方法。

Details

Motivation: 当前多模态RAG系统对证据顺序高度敏感，导致性能不稳定和推理偏差，因此需要研究位置偏差的影响。 Method: 通过文本、图像及混合模态任务的实验，引入位置敏感指数（PSI_p）和可视化框架分析注意力分配模式。 Result: 多模态交互加剧了位置偏差，且偏差随检索范围对数增长。 Conclusion: 研究为RAG系统的位置感知分析提供了理论基础，建议采用证据重排序或去偏策略以提高系统可靠性。 Abstract: Multimodal Retrieval-Augmented Generation (RAG) systems have become essential in knowledge-intensive and open-domain tasks. As retrieval complexity increases, ensuring the robustness of these systems is critical. However, current RAG models are highly sensitive to the order in which evidence is presented, often resulting in unstable performance and biased reasoning, particularly as the number of retrieved items or modality diversity grows. This raises a central question: How does the position of retrieved evidence affect multimodal RAG performance? To answer this, we present the first comprehensive study of position bias in multimodal RAG systems. Through controlled experiments across text-only, image-only, and mixed-modality tasks, we observe a consistent U-shaped accuracy curve with respect to evidence position. To quantify this bias, we introduce the Position Sensitivity Index ($PSI_p$) and develop a visualization framework to trace attention allocation patterns across decoder layers. Our results reveal that multimodal interactions intensify position bias compared to unimodal settings, and that this bias increases logarithmically with retrieval range. These findings offer both theoretical and empirical foundations for position-aware analysis in RAG, highlighting the need for evidence reordering or debiasing strategies to build more reliable and equitable generation systems.

[98] Smotrom tvoja pa ander drogoj verden! Resurrecting Dead Pidgin with Generative Models: Russenorsk Case Study

Alexey Tikhonov,Sergei Shteiner,Anna Bykova,Ivan P. Yamshchikov

Main category: cs.CL

TL;DR: 本文通过现代大语言模型分析Russenorsk词汇，构建结构化词典，验证其构词和语法原则，并提出翻译代理。

Details

Motivation: 研究Russenorsk这一独特的贸易皮钦语，探索其词汇和语法结构。 Method: 利用大语言模型分析文献，构建词典，验证假设，并开发翻译代理。 Result: 验证了部分学术假设，并生成了现代文本的Russenorsk翻译。 Conclusion: Russenorsk的构词和语法原则可通过现代技术验证，翻译代理为语言研究提供新工具。 Abstract: Russenorsk, a pidgin language historically used in trade interactions between Russian and Norwegian speakers, represents a unique linguistic phenomenon. In this paper, we attempt to analyze its lexicon using modern large language models (LLMs), based on surviving literary sources. We construct a structured dictionary of the language, grouped by synonyms and word origins. Subsequently, we use this dictionary to formulate hypotheses about the core principles of word formation and grammatical structure in Russenorsk and show which hypotheses generated by large language models correspond to the hypotheses previously proposed ones in the academic literature. We also develop a "reconstruction" translation agent that generates hypothetical Russenorsk renderings of contemporary Russian and Norwegian texts.

[99] A Large Language Model Based Pipeline for Review of Systems Entity Recognition from Clinical Notes

Hieu Nghiem,Hemanth Reddy Singareddy,Zhuqi Miao,Jivan Lamichhane,Abdulaziz Ahmed,Johnson Thomas,Dursun Delen,William Paiva

Main category: cs.CL

TL;DR: 开发了一种基于大语言模型（LLM）的自动化流程，用于从临床笔记中提取系统回顾（ROS）实体，结合开源和商业模型，实现了低成本且高效的性能。

Details

Motivation: 减少临床笔记中ROS文档的负担，提供一种可扩展且本地可部署的解决方案。 Method: 使用SecTag提取ROS部分，结合少量样本的LLM识别ROS实体范围、状态及关联系统，测试了开源模型（Mistral、Llama、Gemma）和ChatGPT。 Result: ChatGPT表现最佳（实体范围错误率28.2%，状态/系统错误率14.5%），开源模型也表现良好（实体范围错误率30.5-36.7%，状态/系统错误率24.3-27.3%）。 Conclusion: 该流程为资源有限的医疗环境提供了可行的开源替代方案，显著降低了ROS文档负担。 Abstract: Objective: Develop a cost-effective, large language model (LLM)-based pipeline for automatically extracting Review of Systems (ROS) entities from clinical notes. Materials and Methods: The pipeline extracts ROS sections using SecTag, followed by few-shot LLMs to identify ROS entity spans, their positive/negative status, and associated body systems. We implemented the pipeline using open-source LLMs (Mistral, Llama, Gemma) and ChatGPT. The evaluation was conducted on 36 general medicine notes containing 341 annotated ROS entities. Results: When integrating ChatGPT, the pipeline achieved the lowest error rates in detecting ROS entity spans and their corresponding statuses/systems (28.2% and 14.5%, respectively). Open-source LLMs enable local, cost-efficient execution of the pipeline while delivering promising performance with similarly low error rates (span: 30.5-36.7%; status/system: 24.3-27.3%). Discussion and Conclusion: Our pipeline offers a scalable and locally deployable solution to reduce ROS documentation burden. Open-source LLMs present a viable alternative to commercial models in resource-limited healthcare environments.

[1] EfficientQuant: An Efficient Post-Training Quantization for CNN-Transformer Hybrid Models on Edge Devices

[2] Adaptive Object Detection with ESRGAN-Enhanced Resolution & Faster R-CNN

[3] Technical Report for Argoverse2 Scenario Mining Challenges on Iterative Error Correction and Spatially-Aware Prompting

[4] Image-Based Method For Measuring And Classification Of Iron Ore Pellets Using Star-Convex Polygons

[5] Segment This Thing: Foveated Tokenization for Efficient Point-Prompted Segmentation

[6] Gender Fairness of Machine Learning Algorithms for Pain Detection

[7] Monocular 3D Hand Pose Estimation with Implicit Camera Alignment

[8] ContextLoss: Context Information for Topology-Preserving Segmentation

[9] JAFAR: Jack up Any Feature at Any Resolution

[10] Autonomous Computer Vision Development with Agentic AI

[11] FARCLUSS: Fuzzy Adaptive Rebalancing and Contrastive Uncertainty Learning for Semi-Supervised Semantic Segmentation

[12] On the development of an AI performance and behavioural measures for teaching and classroom management

[13] AlignHuman: Improving Motion and Fidelity via Timestep-Segment Preference Optimization for Audio-Driven Human Animation

[14] 3D-RAD: A Comprehensive 3D Radiology Med-VQA Dataset with Multi-Temporal Analysis and Diverse Diagnostic Tasks

[15] LLM-to-Phy3D: Physically Conform Online 3D Object Generation with LLMs

[16] Self-Calibrating BCIs: Ranking and Recovery of Mental Targets Without Labels

[17] SLRNet: A Real-Time LSTM-Based Sign Language Recognition System

[18] Evaluating Multimodal Large Language Models on Video Captioning via Monte Carlo Tree Search

[19] Digitization of Document and Information Extraction using OCR

[20] VIBE: Can a VLM Read the Room?

[21] Synthetic Geology -- Structural Geology Meets Deep Learning

[22] Evaluating BiLSTM and CNN+GRU Approaches for Human Activity Recognition Using WiFi CSI Data

[23] Test-Time-Scaling for Zero-Shot Diagnosis with Visual-Language Reasoning

[24] Towards a general-purpose foundation model for fMRI analysis

[25] WaveFormer: A Lightweight Transformer Model for sEMG-based Gesture Recognition

[26] Teaching in adverse scenes: a statistically feedback-driven threshold and mask adjustment teacher-student framework for object detection in UAV images under adverse scenes

[27] BrainMAP: Multimodal Graph Learning For Efficient Brain Disease Localization

[28] Enhanced Vehicle Speed Detection Considering Lane Recognition Using Drone Videos in California

[29] Lifting Data-Tracing Machine Unlearning to Knowledge-Tracing for Foundation Models

[30] TARDIS STRIDE: A Spatio-Temporal Road Image Dataset for Exploration and Autonomy

[31] HyBiomass: Global Hyperspectral Imagery Benchmark Dataset for Evaluating Geospatial Foundation Models in Forest Aboveground Biomass Estimation

[32] GynSurg: A Comprehensive Gynecology Laparoscopic Surgery Dataset

[33] A Watermark for Auto-Regressive Image Generation Models

[34] Scalable Context-Preserving Model-Aware Deep Clustering for Hyperspectral Images

[35] Enhance Multimodal Consistency and Coherence for Text-Image Plan Generation

[36] Dynamic Double Space Tower

[37] Stop learning it all to mitigate visual hallucination, Focus on the hallucination target

[38] Auto-Connect: Connectivity-Preserving RigFormer with Direct Preference Optimization

[39] Auditing Data Provenance in Real-world Text-to-Image Diffusion Models for Privacy and Copyright Protection

[40] TAViS: Text-bridged Audio-Visual Segmentation with Foundation Models

[41] Uncertainty Awareness Enables Efficient Labeling for Cancer Subtyping in Digital Pathology

[42] On the Natural Robustness of Vision-Language Models Against Visual Perception Attacks in Autonomous Driving

[43] FAME: A Lightweight Spatio-Temporal Network for Model Attribution of Face-Swap Deepfakes

[44] Environmental Change Detection: Toward a Practical Task of Scene Change Detection

[45] Composite Data Augmentations for Synthetic Image Detection Against Real-World Perturbations

[46] Preserving Clusters in Prompt Learning for Unsupervised Domain Adaptation

[47] Manager: Aggregating Insights from Unimodal Experts in Two-Tower VLMs and MLLMs

[48] GNSS-inertial state initialization by distance residuals

[49] FIMA-Q: Post-Training Quantization for Vision Transformers by Fisher Information Matrix Approximation

[50] Leveraging Satellite Image Time Series for Accurate Extreme Event Detection

[51] Linearly Solving Robust Rotation Estimation

[52] EyeSim-VQA: A Free-Energy-Guided Eye Simulation Framework for Video Quality Assessment

[53] DaMO: A Data-Efficient Multimodal Orchestrator for Temporal Reasoning with Video LLMs

[54] VFaith: Do Large Multimodal Models Really Reason on Seen Images Rather than Previous Memories?

[55] Camera-based method for the detection of lifted truck axles using convolutional neural networks

[56] OV-MAP : Open-Vocabulary Zero-Shot 3D Instance Segmentation Map for Robots

[57] EasyARC: Evaluating Vision Language Models on True Visual Reasoning

[58] A$^2$LC: Active and Automated Label Correction for Semantic Segmentation

[59] Wi-CBR: WiFi-based Cross-domain Behavior Recognition via Multimodal Collaborative Awareness

[60] SignAligner: Harmonizing Complementary Pose Modalities for Coherent Sign Language Generation

[61] Evaluating Fairness and Mitigating Bias in Machine Learning: A Novel Technique using Tensor Data and Bayesian Regression

[62] DISCO: Mitigating Bias in Deep Learning with Conditional Distance Correlation

[63] Prohibited Items Segmentation via Occlusion-aware Bilayer Modeling

[64] Dynamic Mixture of Curriculum LoRA Experts for Continual Multimodal Instruction Tuning

[65] Cross-Modal Clustering-Guided Negative Sampling for Self-Supervised Joint Learning from Medical Images and Reports

[66] Predicting Patient Survival with Airway Biomarkers using nn-Unet/Radiomics

[67] Pose Matters: Evaluating Vision Transformers and CNNs for Human Action Recognition on Small COCO Subsets

[68] MTabVQA: Evaluating Multi-Tabular Reasoning of Language Models in Visual Space

[69] DMAF-Net: An Effective Modality Rebalancing Framework for Incomplete Multi-Modal Medical Image Segmentation

[70] Quizzard@INOVA Challenge 2025 -- Track A: Plug-and-Play Technique in Interleaved Multi-Image Model

[71] AgriPotential: A Novel Multi-Spectral and Multi-Temporal Remote Sensing Dataset for Agricultural Potentials

[72] DiffFuSR: Super-Resolution of all Sentinel-2 Multispectral Bands using Diffusion Models

[73] MambaVSR: Content-Aware Scanning State Space Model for Video Super-Resolution

[74] CLIP Meets Diffusion: A Synergistic Approach to Anomaly Detection

[75] AgentSense: Virtual Sensor Data Generation Using LLM Agent in Simulated Home Environments

[76] Real-Time Feedback and Benchmark Dataset for Isometric Pose Evaluation

[77] Self-supervised Learning of Echocardiographic Video Representations via Online Cluster Distillation

[78] GPLQ: A General, Practical, and Lightning QAT Method for Vision Transformers